Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH iproute2] Revert "iproute: "list/flush/save default" selected all of the routes"
From: Alexander Zubkov @ 2018-03-13 11:05 UTC (permalink / raw)
  To: Luca Boccassi, Stephen Hemminger; +Cc: netdev@vger.kernel.org
In-Reply-To: <3677951520930804@web6j.yandex.ru>

Hello again,

The fun thing is that before the commit "ip route ls all" showed all routes, but "ip -[4|6] route ls all" showed only default. So it was broken too, but in other way.
I see parsing of prefix was changed since my patch. So I need several days to propose fix. I think if "ip route ls [all|any]" shows all routes and "ip route ls default" shows only default, everybody will be happy with that?

13.03.2018, 09:46, "Alexander Zubkov" <green@msu.ru>:
> Hello.
>
> May be the better way would be to change how "all"/"any" argument behaves? My original concern was about "default" only. I agree too, that "all" or "any" should work for all routes. But not for the default.
>
> 12.03.2018, 22:37, "Luca Boccassi" <bluca@debian.org>:
>>  On Mon, 2018-03-12 at 14:03 -0700, Stephen Hemminger wrote:
>>>   This reverts commit 9135c4d6037ff9f1818507bac0049fc44db8c3d2.
>>>
>>>   Debian maintainer found that basic command:
>>>           # ip route flush all
>>>   No longer worked as expected which breaks user scripts and
>>>   expectations. It no longer flushed all IPv4 routes.
>>>
>>>   Reported-by: Luca Boccassi <bluca@debian.org>
>>>   Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>>>   ---
>>>    ip/iproute.c | 65 ++++++++++++++++++------------------------------
>>>   ------------
>>>    lib/utils.c  | 13 ++++++++++++
>>>    2 files changed, 32 insertions(+), 46 deletions(-)
>>
>>  Tested-by: Luca Boccassi <bluca@debian.org>
>>
>>  Thanks, solves the problem. I'll backport it to Debian.
>>
>>  Alexander, reproducing the issue is quite simple - before that commit,
>>  ip route ls all showed all routes, but with the change it started
>>  showing only the default table. Same for ip route flush.
>>
>>  --
>>  Kind regards,
>>  Luca Boccassi

^ permalink raw reply

* [PATCH net-next] net: Add comment about pernet_operations methods and synchronization
From: Kirill Tkhai @ 2018-03-13 10:55 UTC (permalink / raw)
  To: davem, ktkhai, netdev

Make locking scheme be visible for users, and provide
a comment what for we are need exit_batch() methods,
and when it should be used.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/net/net_namespace.h |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index d4417495773a..71abc8d79178 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -312,6 +312,20 @@ struct net *get_net_ns_by_id(struct net *net, int id);
 
 struct pernet_operations {
 	struct list_head list;
+	/*
+	 * Below methods are called without any exclusive locks.
+	 * More than one net may be constructed and destructed
+	 * in parallel on several cpus. Every pernet_operations
+	 * have to keep in mind all other pernet_operations and
+	 * to introduce a locking, if they share common resources.
+	 *
+	 * Exit methods using blocking RCU primitives, such as
+	 * synchronize_rcu(), should be implemented via exit_batch.
+	 * Then, destruction of a group of net requires single
+	 * synchronize_rcu() related to these pernet_operations,
+	 * instead of separate synchronize_rcu() for every net.
+	 * Please, avoid synchronize_rcu() at all, where it's possible.
+	 */
 	int (*init)(struct net *net);
 	void (*exit)(struct net *net);
 	void (*exit_batch)(struct list_head *net_exit_list);

^ permalink raw reply related

* [PATCH net-next nfs 6/6] net: Convert rxrpc_net_ops
From: Kirill Tkhai @ 2018-03-13 10:50 UTC (permalink / raw)
  To: davem, trond.myklebust, anna.schumaker, bfields, jlayton,
	dhowells, ktkhai, keescook, dwindsor, ishkamiel, elena.reshetova,
	linux-nfs, linux-afs, netdev
In-Reply-To: <152093778442.8636.10592672493816457119.stgit@localhost.localdomain>

These pernet_operations modifies rxrpc_net_id-pointed
per-net entities. There is external link to AF_RXRPC
in fs/afs/Kconfig, but it seems there is no other
pernet_operations interested in that per-net entities.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/rxrpc/net_ns.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/rxrpc/net_ns.c b/net/rxrpc/net_ns.c
index f18c9248e0d4..5fd939dabf41 100644
--- a/net/rxrpc/net_ns.c
+++ b/net/rxrpc/net_ns.c
@@ -106,4 +106,5 @@ struct pernet_operations rxrpc_net_ops = {
 	.exit	= rxrpc_exit_net,
 	.id	= &rxrpc_net_id,
 	.size	= sizeof(struct rxrpc_net),
+	.async	= true,
 };

^ permalink raw reply related

* [PATCH net-next nfs 5/6] net: Convert nfs4blocklayout_net_ops
From: Kirill Tkhai @ 2018-03-13 10:49 UTC (permalink / raw)
  To: davem, trond.myklebust, anna.schumaker, bfields, jlayton,
	dhowells, ktkhai, keescook, dwindsor, ishkamiel, elena.reshetova,
	linux-nfs, linux-afs, netdev
In-Reply-To: <152093778442.8636.10592672493816457119.stgit@localhost.localdomain>

These pernet_operations create and destroy per-net pipe
and dentry, and they seem safe to be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/nfs/blocklayout/rpc_pipefs.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/blocklayout/rpc_pipefs.c b/fs/nfs/blocklayout/rpc_pipefs.c
index 9fb067a6f7e0..ef9fa111b009 100644
--- a/fs/nfs/blocklayout/rpc_pipefs.c
+++ b/fs/nfs/blocklayout/rpc_pipefs.c
@@ -261,6 +261,7 @@ static void nfs4blocklayout_net_exit(struct net *net)
 static struct pernet_operations nfs4blocklayout_net_ops = {
 	.init = nfs4blocklayout_net_init,
 	.exit = nfs4blocklayout_net_exit,
+	.async = true,
 };
 
 int __init bl_init_pipefs(void)

^ permalink raw reply related

* [PATCH net-next nfs 4/6] net: Convert nfs4_dns_resolver_ops
From: Kirill Tkhai @ 2018-03-13 10:49 UTC (permalink / raw)
  To: davem, trond.myklebust, anna.schumaker, bfields, jlayton,
	dhowells, ktkhai, keescook, dwindsor, ishkamiel, elena.reshetova,
	linux-nfs, linux-afs, netdev
In-Reply-To: <152093778442.8636.10592672493816457119.stgit@localhost.localdomain>

These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another cache. Also they create
and destroy directory. So, they also look safe to be async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/nfs/dns_resolve.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c
index 060c658eab66..e90bd69ab653 100644
--- a/fs/nfs/dns_resolve.c
+++ b/fs/nfs/dns_resolve.c
@@ -410,6 +410,7 @@ static void nfs4_dns_net_exit(struct net *net)
 static struct pernet_operations nfs4_dns_resolver_ops = {
 	.init = nfs4_dns_net_init,
 	.exit = nfs4_dns_net_exit,
+	.async = true,
 };
 
 static int rpc_pipefs_event(struct notifier_block *nb, unsigned long event,

^ permalink raw reply related

* [PATCH net-next nfs 3/6] net: Convert nfsd_net_ops
From: Kirill Tkhai @ 2018-03-13 10:49 UTC (permalink / raw)
  To: davem, trond.myklebust, anna.schumaker, bfields, jlayton,
	dhowells, ktkhai, keescook, dwindsor, ishkamiel, elena.reshetova,
	linux-nfs, linux-afs, netdev
In-Reply-To: <152093778442.8636.10592672493816457119.stgit@localhost.localdomain>

These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/nfsd/nfsctl.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index d107b4426f7e..1e3824e6cce0 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1263,6 +1263,7 @@ static struct pernet_operations nfsd_net_ops = {
 	.exit = nfsd_exit_net,
 	.id   = &nfsd_net_id,
 	.size = sizeof(struct nfsd_net),
+	.async = true,
 };
 
 static int __init init_nfsd(void)

^ permalink raw reply related

* [PATCH net-next nfs 2/6] net: Convert sunrpc_net_ops
From: Kirill Tkhai @ 2018-03-13 10:49 UTC (permalink / raw)
  To: davem, trond.myklebust, anna.schumaker, bfields, jlayton,
	dhowells, ktkhai, keescook, dwindsor, ishkamiel, elena.reshetova,
	linux-nfs, linux-afs, netdev
In-Reply-To: <152093778442.8636.10592672493816457119.stgit@localhost.localdomain>

These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/sunrpc/sunrpc_syms.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c
index 56f9eff74150..68287e921847 100644
--- a/net/sunrpc/sunrpc_syms.c
+++ b/net/sunrpc/sunrpc_syms.c
@@ -79,6 +79,7 @@ static struct pernet_operations sunrpc_net_ops = {
 	.exit = sunrpc_exit_net,
 	.id = &sunrpc_net_id,
 	.size = sizeof(struct sunrpc_net),
+	.async = true,
 };
 
 static int __init

^ permalink raw reply related

* [PATCH net-next nfs 1/6] net: Convert rpcsec_gss_net_ops
From: Kirill Tkhai @ 2018-03-13 10:49 UTC (permalink / raw)
  To: davem, trond.myklebust, anna.schumaker, bfields, jlayton,
	dhowells, ktkhai, keescook, dwindsor, ishkamiel, elena.reshetova,
	linux-nfs, linux-afs, netdev
In-Reply-To: <152093778442.8636.10592672493816457119.stgit@localhost.localdomain>

These pernet_operations initialize and destroy sunrpc_net_id
refered per-net items. Only used global list is cache_list,
and accesses already serialized.

sunrpc_destroy_cache_detail() check for list_empty() without
cache_list_lock, but when it's called from unregister_pernet_subsys(),
there can't be callers in parallel, so we won't miss list_empty()
in this case.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/sunrpc/auth_gss/auth_gss.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 9463af4b32e8..44f939cb6bc8 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -2063,6 +2063,7 @@ static __net_exit void rpcsec_gss_exit_net(struct net *net)
 static struct pernet_operations rpcsec_gss_net_ops = {
 	.init = rpcsec_gss_init_net,
 	.exit = rpcsec_gss_exit_net,
+	.async = true,
 };
 
 /*

^ permalink raw reply related

* [PATCH net-next nfs 0/6] Converting pernet_operations (part #7)
From: Kirill Tkhai @ 2018-03-13 10:49 UTC (permalink / raw)
  To: davem, trond.myklebust, anna.schumaker, bfields, jlayton,
	dhowells, ktkhai, keescook, dwindsor, ishkamiel, elena.reshetova,
	linux-nfs, linux-afs, netdev

Hi,

this series continues to review and to convert pernet_operations
to make them possible to be executed in parallel for several
net namespaces in the same time. There are nfs pernet_operations
in this series. All of them look similar each other, they mostly
create and destroy caches with small exceptions.

Also, there is rxrpc_net_ops, which is used in AFS.

Thanks,
Kirill
---

Kirill Tkhai (6):
      net: Convert rpcsec_gss_net_ops
      net: Convert sunrpc_net_ops
      net: Convert nfsd_net_ops
      net: Convert nfs4_dns_resolver_ops
      net: Convert nfs4blocklayout_net_ops
      net: Convert rxrpc_net_ops


 fs/nfs/blocklayout/rpc_pipefs.c |    1 +
 fs/nfs/dns_resolve.c            |    1 +
 fs/nfsd/nfsctl.c                |    1 +
 net/rxrpc/net_ns.c              |    1 +
 net/sunrpc/auth_gss/auth_gss.c  |    1 +
 net/sunrpc/sunrpc_syms.c        |    1 +
 6 files changed, 6 insertions(+)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

^ permalink raw reply

* Re: [PATCH net-next v2] sctp: fix error return code in sctp_sendmsg_new_asoc()
From: Neil Horman @ 2018-03-13 10:48 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: Vlad Yasevich, Xin Long, linux-sctp, netdev, kernel-janitors
In-Reply-To: <1520910210-147500-1-git-send-email-weiyongjun1@huawei.com>

On Tue, Mar 13, 2018 at 03:03:30AM +0000, Wei Yongjun wrote:
> Return error code -EINVAL in the address len check error handling
> case since 'err' can be overwrite to 0 by 'err = sctp_verify_addr()'
> in the for loop.
> 
> Fixes: 2c0dbaa0c43d ("sctp: add support for SCTP_DSTADDRV4/6 Information for sendmsg")
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> ---
> v1 -> v2: remove the 'err' initialization
> ---
>  net/sctp/socket.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 7d3476a..af5cf29 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1677,7 +1677,7 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, __u16 sflags,
>  	struct sctp_association *asoc;
>  	enum sctp_scope scope;
>  	struct cmsghdr *cmsg;
> -	int err = -EINVAL;
> +	int err;
>  
>  	*tp = NULL;
>  
> @@ -1761,16 +1761,20 @@ static int sctp_sendmsg_new_asoc(struct sock *sk, __u16 sflags,
>  		memset(daddr, 0, sizeof(*daddr));
>  		dlen = cmsg->cmsg_len - sizeof(struct cmsghdr);
>  		if (cmsg->cmsg_type == SCTP_DSTADDRV4) {
> -			if (dlen < sizeof(struct in_addr))
> +			if (dlen < sizeof(struct in_addr)) {
> +				err = -EINVAL;
>  				goto free;
> +			}
>  
>  			dlen = sizeof(struct in_addr);
>  			daddr->v4.sin_family = AF_INET;
>  			daddr->v4.sin_port = htons(asoc->peer.port);
>  			memcpy(&daddr->v4.sin_addr, CMSG_DATA(cmsg), dlen);
>  		} else {
> -			if (dlen < sizeof(struct in6_addr))
> +			if (dlen < sizeof(struct in6_addr)) {
> +				err = -EINVAL;
>  				goto free;
> +			}
>  
>  			dlen = sizeof(struct in6_addr);
>  			daddr->v6.sin6_family = AF_INET6;
> 
> 
Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: BUG: corrupted list in sctp_association_free
From: Xin Long @ 2018-03-13 10:44 UTC (permalink / raw)
  To: syzbot
  Cc: davem, LKML, linux-sctp, network dev, Neil Horman, syzkaller-bugs,
	Vlad Yasevich
In-Reply-To: <001a113ec036450c260567464832@google.com>

On Tue, Mar 13, 2018 at 3:34 PM, syzbot
<syzbot+e56a5d45f832ef33ad2f@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot hit the following crash on net-next commit
> fd372a7a9e5e9d8011a0222d10edd3523abcd3b1 (Thu Mar 8 19:43:48 2018 +0000)
> Merge tag 'mlx5-updates-2018-02-28-2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
>
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+e56a5d45f832ef33ad2f@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> selinux_nlmsg_perm: 1 callbacks suppressed
> SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0
> sclass=netlink_route_socket pig=12502 comm=syz-executor3
> SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0
> sclass=netlink_route_socket pig=12528 comm=syz-executor3
> list_del corruption, 00000000fcc5fb27->next is LIST_POISON1
> (00000000cb16e51d)
> ------------[ cut here ]------------
> kernel BUG at lib/list_debug.c:47!
> invalid opcode: 0000 [#1] SMP KASAN
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 12537 Comm: syz-executor2 Not tainted 4.16.0-rc4+ #258
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__list_del_entry_valid+0xd3/0x150 lib/list_debug.c:45
> RSP: 0018:ffff8801b6387778 EFLAGS: 00010286
> RAX: 000000000000004e RBX: dead000000000200 RCX: 0000000000000000
> RDX: 000000000000004e RSI: ffffc90002ed6000 RDI: ffffed0036c70ee3
> RBP: ffff8801b6387790 R08: 1ffff10036c70e3b R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: dead000000000100
> R13: ffff8801d3164000 R14: ffff8801d8502220 R15: ffff8801b6387c58
> FS:  00007ff42042f700(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ff42040ddb8 CR3: 00000001bd840003 CR4: 00000000001606f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  __list_del_entry include/linux/list.h:117 [inline]
>  list_del include/linux/list.h:125 [inline]
>  sctp_association_free+0x133/0x930 net/sctp/associola.c:341
>  sctp_sendmsg+0xc67/0x1a80 net/sctp/socket.c:2075
>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
>  sock_sendmsg_nosec net/socket.c:629 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:639
>  SYSC_sendto+0x361/0x5c0 net/socket.c:1748
>  SyS_sendto+0x40/0x50 net/socket.c:1716
>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x453e69
> RSP: 002b:00007ff42042ec68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
> RAX: ffffffffffffffda RBX: 00007ff42042f6d4 RCX: 0000000000453e69
> RDX: 0000000000000001 RSI: 0000000020000340 RDI: 0000000000000015
> RBP: 000000000072c0c8 R08: 00000000204d9000 R09: 000000000000001c
> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
> R13: 00000000000004cd R14: 00000000006f73d8 R15: 0000000000000003
> Code: 8f 00 00 00 49 8b 54 24 08 48 39 f2 75 3b 48 83 c4 08 b8 01 00 00 00
> 5b 41 5c 5d c3 4c 89 e2 48 c7 c7 c0 7c 40 86 e8 75 f6 fb fe <0f> 0b 48 c7 c7
> 20 7d 40 86 e8 67 f6 fb fe 0f 0b 48 c7 c7 80 7d
> RIP: __list_del_entry_valid+0xd3/0x150 lib/list_debug.c:45 RSP:
> ffff8801b6387778
> ---[ end trace a6b157f61f9bd43a ]---
> Kernel panic - not syncing: Fatal exception
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
I'd think the patch Neil just posted would fix it.

^ permalink raw reply

* Re: linux-next: manual merge of the net-next tree with the net tree
From: Petr Machata @ 2018-03-13 10:41 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, Networking, Linux-Next Mailing List,
	Linux Kernel Mailing List, Jiri Pirko, Ido Schimmel
In-Reply-To: <20180313112941.7971ba0c@canb.auug.org.au>

Stephen Rothwell <sfr@canb.auug.org.au> writes:

> Today's linux-next merge of the net-next tree got conflicts in:
>
>   drivers/net/ethernet/mellanox/mlxsw/spectrum.h
>   drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>
> between commit:
>
>   663f1b26f9c1 ("mlxsw: spectrum: Prevent duplicate mirrors")
>
> from the net tree and commit:
>
>   a629ef210d89 ("mlxsw: spectrum: Move SPAN code to separate module")
>
> from the net-next tree.
>
> I fixed it up

Looks good.

Thanks,
Petr

^ permalink raw reply

* [PATCH net-next 4/4] net: Convert rds_tcp_net_ops
From: Kirill Tkhai @ 2018-03-13 10:37 UTC (permalink / raw)
  To: davem, vyasevich, nhorman, jon.maloy, ying.xue, santosh.shilimkar,
	netdev, ktkhai
In-Reply-To: <152093726634.31266.4973922580771632781.stgit@localhost.localdomain>

These pernet_operations create and destroy sysctl table
and listen socket. Also, exit method flushes global
workqueue and work. Everything looks per-net safe,
so we can mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/rds/tcp.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 08230a145042..eb04e7fa2467 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -515,6 +515,7 @@ static struct pernet_operations rds_tcp_net_ops = {
 	.exit = rds_tcp_exit_net,
 	.id = &rds_tcp_netid,
 	.size = sizeof(struct rds_tcp_net),
+	.async = true,
 };
 
 static void rds_tcp_kill_sock(struct net *net)

^ permalink raw reply related

* [PATCH net-next 3/4] net: Convert tipc_net_ops
From: Kirill Tkhai @ 2018-03-13 10:37 UTC (permalink / raw)
  To: davem, vyasevich, nhorman, jon.maloy, ying.xue, santosh.shilimkar,
	netdev, ktkhai
In-Reply-To: <152093726634.31266.4973922580771632781.stgit@localhost.localdomain>

TIPC looks concentrated in itself, and other pernet_operations
seem not touching its entities.

tipc_net_ops look pernet-divided, and they should be safe to
be executed in parallel for several net the same time.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/tipc/core.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/tipc/core.c b/net/tipc/core.c
index 0b982d048fb9..04fd91bb11d7 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -105,6 +105,7 @@ static struct pernet_operations tipc_net_ops = {
 	.exit = tipc_exit_net,
 	.id   = &tipc_net_id,
 	.size = sizeof(struct tipc_net),
+	.async = true,
 };
 
 static int __init tipc_init(void)

^ permalink raw reply related

* [PATCH net-next 2/4] net: Convert sctp_ctrlsock_ops
From: Kirill Tkhai @ 2018-03-13 10:37 UTC (permalink / raw)
  To: davem, vyasevich, nhorman, jon.maloy, ying.xue, santosh.shilimkar,
	netdev, ktkhai
In-Reply-To: <152093726634.31266.4973922580771632781.stgit@localhost.localdomain>

These pernet_operations create and destroy net::sctp::ctl_sock.
Since pernet_operations do not send sctp packets each other,
they look safe to be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/sctp/protocol.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 32be52304f98..606361ee9e4a 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1354,6 +1354,7 @@ static void __net_init sctp_ctrlsock_exit(struct net *net)
 static struct pernet_operations sctp_ctrlsock_ops = {
 	.init = sctp_ctrlsock_init,
 	.exit = sctp_ctrlsock_exit,
+	.async = true,
 };
 
 /* Initialize the universe into something sensible.  */

^ permalink raw reply related

* [PATCH net-next 1/4] net: Convert sctp_defaults_ops
From: Kirill Tkhai @ 2018-03-13 10:36 UTC (permalink / raw)
  To: davem, vyasevich, nhorman, jon.maloy, ying.xue, santosh.shilimkar,
	netdev, ktkhai
In-Reply-To: <152093726634.31266.4973922580771632781.stgit@localhost.localdomain>

These pernet_operations have a deal with sysctl, /proc
entries and statistics. Also, there are freeing of
net::sctp::addr_waitq queue and net::sctp::local_addr_list
in exit method. All of them look pernet-divided, and it
seems these items are only interesting for sctp_defaults_ops,
which are safe to be executed in parallel.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/sctp/protocol.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 91813e686c67..32be52304f98 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1330,6 +1330,7 @@ static void __net_exit sctp_defaults_exit(struct net *net)
 static struct pernet_operations sctp_defaults_ops = {
 	.init = sctp_defaults_init,
 	.exit = sctp_defaults_exit,
+	.async = true,
 };
 
 static int __net_init sctp_ctrlsock_init(struct net *net)

^ permalink raw reply related

* [PATCH net-next 0/4] Converting pernet_operations (part #6)
From: Kirill Tkhai @ 2018-03-13 10:36 UTC (permalink / raw)
  To: davem, vyasevich, nhorman, jon.maloy, ying.xue, santosh.shilimkar,
	netdev, ktkhai

Hi,

this series continues to review and to convert pernet_operations
to make them possible to be executed in parallel for several
net namespaces in the same time. There are sctp, tipc and rds
in this series.

Thanks,
Kirill
---

Kirill Tkhai (4):
      net: Convert sctp_defaults_ops
      net: Convert sctp_ctrlsock_ops
      net: Convert tipc_net_ops
      net: Convert rds_tcp_net_ops


 net/rds/tcp.c       |    1 +
 net/sctp/protocol.c |    2 ++
 net/tipc/core.c     |    1 +
 3 files changed, 4 insertions(+)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

^ permalink raw reply

* [PATCH net-next] cxgb4: Add HMA support
From: Arjun Vynipadath @ 2018-03-13 10:54 UTC (permalink / raw)
  To: netdev, davem
  Cc: nirranjan, indranil, venkatesh, Arjun Vynipadath,
	Santosh Rastapur, Michael Werner, Ganesh GR

HMA(Host Memory Access) maps a part of host memory for T6-SO memfree cards.

This commit does the following:
- Query FW to check if we have HMA support. If yes, the params will
  return HMA size configured in FW. We will dma map memory based
  on this size.
- Also contains changes to get HMA memory information via debugfs.

Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Santosh Rastapur <santosh@chelsio.com>
Signed-off-by: Michael Werner <werner@chelsio.com>
Signed-off-by: Ganesh GR <ganeshgr@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h         |  13 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c |  10 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 228 ++++++++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c         |   2 +-
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h      |  56 +++++
 5 files changed, 303 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index d3fa53d..b2df0ff 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -831,6 +831,16 @@ struct vf_info {
 	u16 vlan;
 };
 
+enum {
+	HMA_DMA_MAPPED_FLAG = 1
+};
+
+struct hma_data {
+	unsigned char flags;
+	struct sg_table *sgt;
+	dma_addr_t *phy_addr;	/* physical address of the page */
+};
+
 struct mbox_list {
 	struct list_head list;
 };
@@ -946,6 +956,9 @@ struct adapter {
 
 	/* Ethtool Dump */
 	struct ethtool_dump eth_dump;
+
+	/* HMA */
+	struct hma_data hma;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 2822bbf..de2ba86 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2617,7 +2617,7 @@ int mem_open(struct inode *inode, struct file *file)
 
 	file->private_data = inode->i_private;
 
-	mem = (uintptr_t)file->private_data & 0x3;
+	mem = (uintptr_t)file->private_data & 0x7;
 	adap = file->private_data - mem;
 
 	(void)t4_fwcache(adap, FW_PARAM_DEV_FWCACHE_FLUSH);
@@ -2630,7 +2630,7 @@ static ssize_t mem_read(struct file *file, char __user *buf, size_t count,
 {
 	loff_t pos = *ppos;
 	loff_t avail = file_inode(file)->i_size;
-	unsigned int mem = (uintptr_t)file->private_data & 3;
+	unsigned int mem = (uintptr_t)file->private_data & 0x7;
 	struct adapter *adap = file->private_data - mem;
 	__be32 *data;
 	int ret;
@@ -3042,6 +3042,12 @@ int t4_setup_debugfs(struct adapter *adap)
 			add_debugfs_mem(adap, "mc", MEM_MC,
 					EXT_MEM_SIZE_G(size));
 		}
+
+		if (i & HMA_MUX_F) {
+			size = t4_read_reg(adap, MA_EXT_MEMORY1_BAR_A);
+			add_debugfs_mem(adap, "hma", MEM_HMA,
+					EXT_MEM1_SIZE_G(size));
+		}
 	}
 
 	de = debugfs_create_file_size("flash", S_IRUSR, adap->debugfs_root, adap,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 1b44652..d1e2786 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -1733,10 +1733,11 @@ EXPORT_SYMBOL(cxgb4_sync_txq_pidx);
 
 int cxgb4_read_tpte(struct net_device *dev, u32 stag, __be32 *tpte)
 {
-	struct adapter *adap;
-	u32 offset, memtype, memaddr;
 	u32 edc0_size, edc1_size, mc0_size, mc1_size, size;
 	u32 edc0_end, edc1_end, mc0_end, mc1_end;
+	u32 offset, memtype, memaddr;
+	struct adapter *adap;
+	u32 hma_size = 0;
 	int ret;
 
 	adap = netdev2adap(dev);
@@ -1756,6 +1757,10 @@ int cxgb4_read_tpte(struct net_device *dev, u32 stag, __be32 *tpte)
 	size = t4_read_reg(adap, MA_EXT_MEMORY0_BAR_A);
 	mc0_size = EXT_MEM0_SIZE_G(size) << 20;
 
+	if (t4_read_reg(adap, MA_TARGET_MEM_ENABLE_A) & HMA_MUX_F) {
+		size = t4_read_reg(adap, MA_EXT_MEMORY1_BAR_A);
+		hma_size = EXT_MEM1_SIZE_G(size) << 20;
+	}
 	edc0_end = edc0_size;
 	edc1_end = edc0_end + edc1_size;
 	mc0_end = edc1_end + mc0_size;
@@ -1767,7 +1772,10 @@ int cxgb4_read_tpte(struct net_device *dev, u32 stag, __be32 *tpte)
 		memtype = MEM_EDC1;
 		memaddr = offset - edc0_end;
 	} else {
-		if (offset < mc0_end) {
+		if (hma_size && (offset < (edc1_end + hma_size))) {
+			memtype = MEM_HMA;
+			memaddr = offset - edc1_end;
+		} else if (offset < mc0_end) {
 			memtype = MEM_MC0;
 			memaddr = offset - edc1_end;
 		} else if (is_t5(adap->params.chip)) {
@@ -3298,6 +3306,206 @@ static void setup_memwin_rdma(struct adapter *adap)
 	}
 }
 
+/* HMA Definitions */
+
+/* The maximum number of address that can be send in a single FW cmd */
+#define HMA_MAX_ADDR_IN_CMD	5
+
+#define HMA_PAGE_SIZE		PAGE_SIZE
+
+#define HMA_MAX_NO_FW_ADDRESS	(16 << 10)  /* FW supports 16K addresses */
+
+#define HMA_PAGE_ORDER					\
+	((HMA_PAGE_SIZE < HMA_MAX_NO_FW_ADDRESS) ?	\
+	ilog2(HMA_MAX_NO_FW_ADDRESS / HMA_PAGE_SIZE) : 0)
+
+/* The minimum and maximum possible HMA sizes that can be specified in the FW
+ * configuration(in units of MB).
+ */
+#define HMA_MIN_TOTAL_SIZE	1
+#define HMA_MAX_TOTAL_SIZE				\
+	(((HMA_PAGE_SIZE << HMA_PAGE_ORDER) *		\
+	  HMA_MAX_NO_FW_ADDRESS) >> 20)
+
+static void adap_free_hma_mem(struct adapter *adapter)
+{
+	struct scatterlist *iter;
+	struct page *page;
+	int i;
+
+	if (!adapter->hma.sgt)
+		return;
+
+	if (adapter->hma.flags & HMA_DMA_MAPPED_FLAG) {
+		dma_unmap_sg(adapter->pdev_dev, adapter->hma.sgt->sgl,
+			     adapter->hma.sgt->nents, PCI_DMA_BIDIRECTIONAL);
+		adapter->hma.flags &= ~HMA_DMA_MAPPED_FLAG;
+	}
+
+	for_each_sg(adapter->hma.sgt->sgl, iter,
+		    adapter->hma.sgt->orig_nents, i) {
+		page = sg_page(iter);
+		if (page)
+			__free_pages(page, HMA_PAGE_ORDER);
+	}
+
+	kfree(adapter->hma.phy_addr);
+	sg_free_table(adapter->hma.sgt);
+	kfree(adapter->hma.sgt);
+	adapter->hma.sgt = NULL;
+}
+
+static int adap_config_hma(struct adapter *adapter)
+{
+	struct scatterlist *sgl, *iter;
+	struct sg_table *sgt;
+	struct page *newpage;
+	unsigned int i, j, k;
+	u32 param, hma_size;
+	unsigned int ncmds;
+	size_t page_size;
+	u32 page_order;
+	int node, ret;
+
+	/* HMA is supported only for T6+ cards.
+	 * Avoid initializing HMA in kdump kernels.
+	 */
+	if (is_kdump_kernel() ||
+	    CHELSIO_CHIP_VERSION(adapter->params.chip) < CHELSIO_T6)
+		return 0;
+
+	/* Get the HMA region size required by fw */
+	param = (FW_PARAMS_MNEM_V(FW_PARAMS_MNEM_DEV) |
+		 FW_PARAMS_PARAM_X_V(FW_PARAMS_PARAM_DEV_HMA_SIZE));
+	ret = t4_query_params(adapter, adapter->mbox, adapter->pf, 0,
+			      1, &param, &hma_size);
+	/* An error means card has its own memory or HMA is not supported by
+	 * the firmware. Return without any errors.
+	 */
+	if (ret || !hma_size)
+		return 0;
+
+	if (hma_size < HMA_MIN_TOTAL_SIZE ||
+	    hma_size > HMA_MAX_TOTAL_SIZE) {
+		dev_err(adapter->pdev_dev,
+			"HMA size %uMB beyond bounds(%u-%lu)MB\n",
+			hma_size, HMA_MIN_TOTAL_SIZE, HMA_MAX_TOTAL_SIZE);
+		return -EINVAL;
+	}
+
+	page_size = HMA_PAGE_SIZE;
+	page_order = HMA_PAGE_ORDER;
+	adapter->hma.sgt = kzalloc(sizeof(*adapter->hma.sgt), GFP_KERNEL);
+	if (unlikely(!adapter->hma.sgt)) {
+		dev_err(adapter->pdev_dev, "HMA SG table allocation failed\n");
+		return -ENOMEM;
+	}
+	sgt = adapter->hma.sgt;
+	/* FW returned value will be in MB's
+	 */
+	sgt->orig_nents = (hma_size << 20) / (page_size << page_order);
+	if (sg_alloc_table(sgt, sgt->orig_nents, GFP_KERNEL)) {
+		dev_err(adapter->pdev_dev, "HMA SGL allocation failed\n");
+		kfree(adapter->hma.sgt);
+		adapter->hma.sgt = NULL;
+		return -ENOMEM;
+	}
+
+	sgl = adapter->hma.sgt->sgl;
+	node = dev_to_node(adapter->pdev_dev);
+	for_each_sg(sgl, iter, sgt->orig_nents, i) {
+		newpage = alloc_pages_node(node, __GFP_NOWARN | GFP_KERNEL,
+					   page_order);
+		if (!newpage) {
+			dev_err(adapter->pdev_dev,
+				"Not enough memory for HMA page allocation\n");
+			ret = -ENOMEM;
+			goto free_hma;
+		}
+		sg_set_page(iter, newpage, page_size << page_order, 0);
+	}
+
+	sgt->nents = dma_map_sg(adapter->pdev_dev, sgl, sgt->orig_nents,
+				DMA_BIDIRECTIONAL);
+	if (!sgt->nents) {
+		dev_err(adapter->pdev_dev,
+			"Not enough memory for HMA DMA mapping");
+		ret = -ENOMEM;
+		goto free_hma;
+	}
+	adapter->hma.flags |= HMA_DMA_MAPPED_FLAG;
+
+	adapter->hma.phy_addr = kcalloc(sgt->nents, sizeof(dma_addr_t),
+					GFP_KERNEL);
+	if (unlikely(!adapter->hma.phy_addr))
+		goto free_hma;
+
+	for_each_sg(sgl, iter, sgt->nents, i) {
+		newpage = sg_page(iter);
+		adapter->hma.phy_addr[i] = sg_dma_address(iter);
+	}
+
+	ncmds = DIV_ROUND_UP(sgt->nents, HMA_MAX_ADDR_IN_CMD);
+	/* Pass on the addresses to firmware */
+	for (i = 0, k = 0; i < ncmds; i++, k += HMA_MAX_ADDR_IN_CMD) {
+		struct fw_hma_cmd hma_cmd;
+		u8 naddr = HMA_MAX_ADDR_IN_CMD;
+		u8 soc = 0, eoc = 0;
+		u8 hma_mode = 1; /* Presently we support only Page table mode */
+
+		soc = (i == 0) ? 1 : 0;
+		eoc = (i == ncmds - 1) ? 1 : 0;
+
+		/* For last cmd, set naddr corresponding to remaining
+		 * addresses
+		 */
+		if (i == ncmds - 1) {
+			naddr = sgt->nents % HMA_MAX_ADDR_IN_CMD;
+			naddr = naddr ? naddr : HMA_MAX_ADDR_IN_CMD;
+		}
+		memset(&hma_cmd, 0, sizeof(hma_cmd));
+		hma_cmd.op_pkd = htonl(FW_CMD_OP_V(FW_HMA_CMD) |
+				       FW_CMD_REQUEST_F | FW_CMD_WRITE_F);
+		hma_cmd.retval_len16 = htonl(FW_LEN16(hma_cmd));
+
+		hma_cmd.mode_to_pcie_params =
+			htonl(FW_HMA_CMD_MODE_V(hma_mode) |
+			      FW_HMA_CMD_SOC_V(soc) | FW_HMA_CMD_EOC_V(eoc));
+
+		/* HMA cmd size specified in MB's */
+		hma_cmd.naddr_size =
+			htonl(FW_HMA_CMD_SIZE_V(hma_size) |
+			      FW_HMA_CMD_NADDR_V(naddr));
+
+		/* Total Page size specified in units of 4K */
+		hma_cmd.addr_size_pkd =
+			htonl(FW_HMA_CMD_ADDR_SIZE_V
+				((page_size << page_order) >> 12));
+
+		/* Fill the 5 addresses */
+		for (j = 0; j < naddr; j++) {
+			hma_cmd.phy_address[j] =
+				cpu_to_be64(adapter->hma.phy_addr[j + k]);
+		}
+		ret = t4_wr_mbox(adapter, adapter->mbox, &hma_cmd,
+				 sizeof(hma_cmd), &hma_cmd);
+		if (ret) {
+			dev_err(adapter->pdev_dev,
+				"HMA FW command failed with err %d\n", ret);
+			goto free_hma;
+		}
+	}
+
+	if (!ret)
+		dev_info(adapter->pdev_dev,
+			 "Reserved %uMB host memory for HMA\n", hma_size);
+	return ret;
+
+free_hma:
+	adap_free_hma_mem(adapter);
+	return ret;
+}
+
 static int adap_init1(struct adapter *adap, struct fw_caps_config_cmd *c)
 {
 	u32 v;
@@ -3751,6 +3959,12 @@ static int adap_init0_config(struct adapter *adapter, int reset)
 	if (ret < 0)
 		goto bye;
 
+	/* We will proceed even if HMA init fails. */
+	ret = adap_config_hma(adapter);
+	if (ret)
+		dev_err(adapter->pdev_dev,
+			"HMA configuration failed with error %d\n", ret);
+
 	/*
 	 * And finally tell the firmware to initialize itself using the
 	 * parameters from the Configuration File.
@@ -3957,6 +4171,11 @@ static int adap_init0(struct adapter *adap)
 	 * effect. Otherwise, it's time to try initializing the adapter.
 	 */
 	if (state == DEV_STATE_INIT) {
+		ret = adap_config_hma(adap);
+		if (ret)
+			dev_err(adap->pdev_dev,
+				"HMA configuration failed with error %d\n",
+				ret);
 		dev_info(adap->pdev_dev, "Coming up as %s: "\
 			 "Adapter already initialized\n",
 			 adap->flags & MASTER_PF ? "MASTER" : "SLAVE");
@@ -4346,6 +4565,7 @@ static int adap_init0(struct adapter *adap)
 	 * happened to HW/FW, stop issuing commands.
 	 */
 bye:
+	adap_free_hma_mem(adap);
 	kfree(adap->sge.egr_map);
 	kfree(adap->sge.ingr_map);
 	kfree(adap->sge.starving_fl);
@@ -5573,6 +5793,8 @@ static void remove_one(struct pci_dev *pdev)
 			t4_uld_clean_up(adapter);
 		}
 
+		adap_free_hma_mem(adapter);
+
 		disable_interrupts(adapter);
 
 		for_each_port(adapter, i)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 2c889ef..38e38dc 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -487,7 +487,7 @@ static int t4_edc_err_read(struct adapter *adap, int idx)
  * t4_memory_rw_init - Get memory window relative offset, base, and size.
  * @adap: the adapter
  * @win: PCI-E Memory Window to use
- * @mtype: memory type: MEM_EDC0, MEM_EDC1 or MEM_MC
+ * @mtype: memory type: MEM_EDC0, MEM_EDC1, MEM_HMA or MEM_MC
  * @mem_off: memory relative offset with respect to @mtype.
  * @mem_base: configured memory base address.
  * @mem_aperture: configured memory window aperture.
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
index 0d83b40..e40217a 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
@@ -766,6 +766,7 @@ enum fw_cmd_opcodes {
 	FW_DEVLOG_CMD                  = 0x25,
 	FW_CLIP_CMD                    = 0x28,
 	FW_PTP_CMD                     = 0x3e,
+	FW_HMA_CMD                     = 0x3f,
 	FW_LASTC2E_CMD                 = 0x40,
 	FW_ERROR_CMD                   = 0x80,
 	FW_DEBUG_CMD                   = 0x81,
@@ -1132,6 +1133,7 @@ enum fw_memtype_cf {
 	FW_MEMTYPE_CF_FLASH		= 0x4,
 	FW_MEMTYPE_CF_INTERNAL		= 0x5,
 	FW_MEMTYPE_CF_EXTMEM1           = 0x6,
+	FW_MEMTYPE_CF_HMA		= 0x7,
 };
 
 struct fw_caps_config_cmd {
@@ -1210,6 +1212,7 @@ enum fw_params_param_dev {
 	FW_PARAMS_PARAM_DEV_RI_FR_NSMR_TPTE_WR	= 0x1C,
 	FW_PARAMS_PARAM_DEV_FILTER2_WR  = 0x1D,
 	FW_PARAMS_PARAM_DEV_MPSBGMAP	= 0x1E,
+	FW_PARAMS_PARAM_DEV_HMA_SIZE	= 0x20,
 };
 
 /*
@@ -3435,6 +3438,59 @@ struct fw_debug_cmd {
 #define FW_DEBUG_CMD_TYPE_G(x)	\
 	(((x) >> FW_DEBUG_CMD_TYPE_S) & FW_DEBUG_CMD_TYPE_M)
 
+struct fw_hma_cmd {
+	__be32 op_pkd;
+	__be32 retval_len16;
+	__be32 mode_to_pcie_params;
+	__be32 naddr_size;
+	__be32 addr_size_pkd;
+	__be32 r6;
+	__be64 phy_address[5];
+};
+
+#define FW_HMA_CMD_MODE_S	31
+#define FW_HMA_CMD_MODE_M	0x1
+#define FW_HMA_CMD_MODE_V(x)	((x) << FW_HMA_CMD_MODE_S)
+#define FW_HMA_CMD_MODE_G(x)	\
+	(((x) >> FW_HMA_CMD_MODE_S) & FW_HMA_CMD_MODE_M)
+#define FW_HMA_CMD_MODE_F	FW_HMA_CMD_MODE_V(1U)
+
+#define FW_HMA_CMD_SOC_S	30
+#define FW_HMA_CMD_SOC_M	0x1
+#define FW_HMA_CMD_SOC_V(x)	((x) << FW_HMA_CMD_SOC_S)
+#define FW_HMA_CMD_SOC_G(x)	(((x) >> FW_HMA_CMD_SOC_S) & FW_HMA_CMD_SOC_M)
+#define FW_HMA_CMD_SOC_F	FW_HMA_CMD_SOC_V(1U)
+
+#define FW_HMA_CMD_EOC_S	29
+#define FW_HMA_CMD_EOC_M	0x1
+#define FW_HMA_CMD_EOC_V(x)	((x) << FW_HMA_CMD_EOC_S)
+#define FW_HMA_CMD_EOC_G(x)	(((x) >> FW_HMA_CMD_EOC_S) & FW_HMA_CMD_EOC_M)
+#define FW_HMA_CMD_EOC_F	FW_HMA_CMD_EOC_V(1U)
+
+#define FW_HMA_CMD_PCIE_PARAMS_S	0
+#define FW_HMA_CMD_PCIE_PARAMS_M	0x7ffffff
+#define FW_HMA_CMD_PCIE_PARAMS_V(x)	((x) << FW_HMA_CMD_PCIE_PARAMS_S)
+#define FW_HMA_CMD_PCIE_PARAMS_G(x)	\
+	(((x) >> FW_HMA_CMD_PCIE_PARAMS_S) & FW_HMA_CMD_PCIE_PARAMS_M)
+
+#define FW_HMA_CMD_NADDR_S	12
+#define FW_HMA_CMD_NADDR_M	0x3f
+#define FW_HMA_CMD_NADDR_V(x)	((x) << FW_HMA_CMD_NADDR_S)
+#define FW_HMA_CMD_NADDR_G(x)	\
+	(((x) >> FW_HMA_CMD_NADDR_S) & FW_HMA_CMD_NADDR_M)
+
+#define FW_HMA_CMD_SIZE_S	0
+#define FW_HMA_CMD_SIZE_M	0xfff
+#define FW_HMA_CMD_SIZE_V(x)	((x) << FW_HMA_CMD_SIZE_S)
+#define FW_HMA_CMD_SIZE_G(x)	\
+	(((x) >> FW_HMA_CMD_SIZE_S) & FW_HMA_CMD_SIZE_M)
+
+#define FW_HMA_CMD_ADDR_SIZE_S		11
+#define FW_HMA_CMD_ADDR_SIZE_M		0x1fffff
+#define FW_HMA_CMD_ADDR_SIZE_V(x)	((x) << FW_HMA_CMD_ADDR_SIZE_S)
+#define FW_HMA_CMD_ADDR_SIZE_G(x)	\
+	(((x) >> FW_HMA_CMD_ADDR_SIZE_S) & FW_HMA_CMD_ADDR_SIZE_M)
+
 enum pcie_fw_eval {
 	PCIE_FW_EVAL_CRASH = 0,
 };
-- 
2.3.5

^ permalink raw reply related

* Re: [2/2] net/usb/ax88179_178a: Delete three unnecessary variables in ax88179_chk_eee()
From: Oliver Neukum @ 2018-03-13 10:26 UTC (permalink / raw)
  To: SF Markus Elfring, linux-usb, netdev
  Cc: David S. Miller, Philippe Reynes, Andrew Lunn, Bjørn Mork,
	Yuval Shaia, Andrew F. Davis, kernel-janitors, LKML
In-Reply-To: <7d45bd0c-5ab2-a2c0-d746-ef2d859eea08@users.sourceforge.net>

Am Dienstag, den 13.03.2018, 08:24 +0100 schrieb SF Markus Elfring:
> > 
> > > 
> > > Use three values directly for a condition check without assigning them
> > > to intermediate variables.
> > 
> > Hi,
> > 
> > what is the benefit of this?
> 
> I proposed a small source code reduction.
> 
> Other software design directions might become more interesting for this use case.

Yes and doing so you killed three meaningful names that tell
us what these checks actually test for. That is not an improvement.

	Regards
		Oliver

^ permalink raw reply

* [PATCH v3 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

Currently, the TCP code is overly eager to increase window on
almost every ACK. It makes those ACKs that the receiver should
sent as dupACKs look like they update window that is not
considered a real dupACK by the non-SACK sender-side code.
Therefore the sender needs to resort to RTO to recover
losses as fast retransmit/fast recovery cannot be triggered
by such masked duplicate ACKs.

This change makes sure that when an ofo segment is received,
no change to window is applied if we are going to send a dupACK.
Even with this change, the window updates keep being maintained
but that occurs "behind the scenes". That is, this change does
not interfere with memory management of the flow which could
have long-term impact for the progress of the flow but only
prevents those updates being seen on the wire on short-term.
It's ok to change the window for non-dupACKs such as the first
ACK after ofo arrivals start if that ACK was using delayed ACKs
and also whenever the ack field advances. As ack field typically
advances once per RTT as the first hole is retransmitted, the
window updates are not delayed entirely during long recoveries.

Even before this fix, tcp_select_window did not allow ACK
shrinking the window for duplicate ACKs (that was previously
even called "treason" but the old charmy message is gone now).
The advertized window can only shrink when also ack field
changes which will not be blocked by this change as it is not
duplicate ACK.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h   |  3 ++-
 net/ipv4/tcp_input.c  |  5 ++++-
 net/ipv4/tcp_output.c | 43 +++++++++++++++++++++++++------------------
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c549..e239662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
 		fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
 		fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
 		is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
-		unused:2;
+		dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */
+		unused:1;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		unused1	    : 1,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 270aa48..4ff192b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4626,6 +4626,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	bool fragstolen;
 	int eaten;
 
@@ -4669,7 +4670,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 			 * gap in queue is filled.
 			 */
 			if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
-				inet_csk(sk)->icsk_ack.pingpong = 0;
+				icsk->icsk_ack.pingpong = 0;
 		}
 
 		if (tp->rx_opt.num_sacks)
@@ -4719,6 +4720,8 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		goto queue_and_out;
 	}
 
+	if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
+		tp->dupack_wnd_lock = 1;
 	tcp_data_queue_ofo(sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6818042..45fbe92 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 old_win = tp->rcv_wnd;
-	u32 cur_win = tcp_receive_window(tp);
-	u32 new_win = __tcp_select_window(sk);
-
-	/* Never shrink the offered window */
-	if (new_win < cur_win) {
-		/* Danger Will Robinson!
-		 * Don't update rcv_wup/rcv_wnd here or else
-		 * we will not be able to advertise a zero
-		 * window in time.  --DaveM
-		 *
-		 * Relax Will Robinson.
-		 */
-		if (new_win == 0)
-			NET_INC_STATS(sock_net(sk),
-				      LINUX_MIB_TCPWANTZEROWINDOWADV);
-		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+	u32 cur_win;
+	u32 new_win;
+
+	if (tp->dupack_wnd_lock) {
+		new_win = old_win;
+		tp->dupack_wnd_lock = 0;
+	} else {
+		cur_win = tcp_receive_window(tp);
+		new_win = __tcp_select_window(sk);
+		/* Never shrink the offered window */
+		if (new_win < cur_win) {
+			/* Danger Will Robinson!
+			 * Don't update rcv_wup/rcv_wnd here or else
+			 * we will not be able to advertise a zero
+			 * window in time.  --DaveM
+			 *
+			 * Relax Will Robinson.
+			 */
+			if (new_win == 0)
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPWANTZEROWINDOWADV);
+			new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+		}
+		tp->rcv_wnd = new_win;
+		tp->rcv_wup = tp->rcv_nxt;
 	}
-	tp->rcv_wnd = new_win;
-	tp->rcv_wup = tp->rcv_nxt;
 
 	/* Make sure we do not exceed the maximum possible
 	 * scaled window.
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

No functional changes. This change simplifies the next change
slightly.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c60745c..72ecfbb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2211,6 +2211,17 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
 	}
 }
 
+/* False fast retransmits may occur when SACK is not in use under certain
+ * conditions (RFC6582). The sender MUST hold old state until something
+ * *above* high_seq is ACKed to prevent triggering such false fast
+ * retransmits. SACK TCP is safe.
+ */
+static bool tcp_false_fast_retrans_possible(const struct tcp_sock *tp,
+					    const u32 snd_una)
+{
+	return tcp_is_reno(tp) && (snd_una == tp->high_seq);
+}
+
 static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 {
 	return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2350,10 +2361,10 @@ static bool tcp_try_undo_recovery(struct sock *sk)
 	} else if (tp->rack.reo_wnd_persist) {
 		tp->rack.reo_wnd_persist--;
 	}
-	if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
-		/* Hold old state until something *above* high_seq
-		 * is ACKed. For Reno it is MUST to prevent false
-		 * fast retransmits (RFC2582). SACK TCP is safe. */
+	if (tcp_false_fast_retrans_possible(tp, tp->snd_una)) {
+		/* Hold old state until something *above* high_seq is ACKed
+		 * if false fast retransmit is possible.
+		 */
 		if (!tcp_any_retrans_done(sk))
 			tp->retrans_stamp = 0;
 		return true;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

Here is a series of fixes to issues that occur when SACK is not
enabled for a TCP connection. These are not fixes to just some
remote corner cases of recovery but many/almost all recoveries
without SACK will be impacted by one (or even more than one) of
them. The sender-side changes (1-4) are not mainly, if any, to
improve performance (throughput) but address correctness
(congestion control responses should not get incorrectly
reverted) and burstiness (that may cause additional problems
later as some of the packets in such bursts may get dropped
needing again to resort to loss recovery that is likely
similarly bursty).

v1 -> v2:
- Tried to improve changelogs
- Reworked FRTO undo fix location
- Removed extra parenthesis from EXPR (and while at it, reverse
  also the sides of &&)
- Pass prior_snd_una rather than flag around to avoid moving
  tcp_packet_delayed call
- Pass tp instead of sk. Sk was there only due to a subsequent
  change (that I think is only net-next material) limiting the
  use of the transient state to only RTO recoveries as it won't
  be needed after NewReno recovery that won't do unnecessary
  rexmits like the non-SACK RTO recovery does

v2 -> v3:
- Remove unnecessarily added braces

tcp: feed correct number of pkts acked to cc
tcp: prevent bogus FRTO undos with non-SACK flows
tcp: move false FR condition into
tcp: prevent bogus undos when SACK is not enabled
tcp: send real dupACKs by locking advertized

^ permalink raw reply

* [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

When a cumulative ACK lands to high_seq at the end of loss
recovery and SACK is not enabled, the sender needs to avoid
false fast retransmits (RFC6582). The avoidance mechanisms is
implemented by remaining in the loss recovery CA state until
one additional cumulative ACK arrives. During the operation of
this avoidance mechanism, there is internal transient in the
use of state variables which will always trigger a bogus undo.

When we enter to this transient state in tcp_try_undo_recovery,
tcp_any_retrans_done is often (always?) false resulting in
clearing retrans_stamp. On the next cumulative ACK,
tcp_try_undo_recovery again executes because CA state still
remains in the same recovery state and tcp_may_undo will always
return true because tcp_packet_delayed has this condition:
    return !tp->retrans_stamp || ...

Check if the false fast retransmit transient avoidance is in
progress in tcp_packet_delayed to avoid bogus undos. Since snd_una
has advanced already on this ACK but CA state still remains
unchanged (CA state is updated slightly later than undo is
checked), prior_snd_una needs to be passed to tcp_packet_delayed
(instead of tp->snd_una). Passing prior_snd_una around to
the tcp_packet_delayed makes this change look more involved than
it really is.

The additional checks done in this change only affect non-SACK
case, the SACK case remains the same.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 72ecfbb..270aa48 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp,
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct tcp_sock *tp,
+				      const u32 prior_snd_una)
 {
-	return !tp->retrans_stamp ||
-	       tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+	if (!tp->retrans_stamp) {
+		/* Sender will be in a transient state with cleared
+		 * retrans_stamp during false fast retransmit prevention
+		 * mechanism
+		 */
+		return !tcp_false_fast_retrans_possible(tp, prior_snd_una);
+	}
+	return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
@@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
 	tp->rack.advanced = 1; /* Force RACK to re-exam losses */
 }
 
-static inline bool tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct tcp_sock *tp,
+				const u32 prior_snd_una)
 {
-	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+	return tp->undo_marker &&
+	       (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una));
 }
 
 /* People celebrate: "We love our President!" */
-static bool tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(tp, prior_snd_una)) {
 		int mib_idx;
 
 		/* Happy end! We did not retransmit anything
@@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk)
 }
 
 /* Undo during loss recovery after partial ACK or using F-RTO. */
-static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
+static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una,
+			      bool frto_undo)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (frto_undo || tcp_may_undo(tp)) {
+	if (frto_undo || tcp_may_undo(tp, prior_snd_una)) {
 		tcp_undo_cwnd_reduction(sk, true);
 
 		DBGUNDO(sk, "partial loss");
@@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack)
  * recovered or spurious. Otherwise retransmits more on partial ACKs.
  */
 static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
-			     int *rexmit)
+			     int *rexmit, const u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool recovered = !before(tp->snd_una, tp->high_seq);
 
 	if ((flag & FLAG_SND_UNA_ADVANCED) &&
-	    tcp_try_undo_loss(sk, false))
+	    tcp_try_undo_loss(sk, prior_snd_una, false))
 		return;
 
 	if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
@@ -2642,7 +2652,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 		 * lost, i.e., never-retransmitted data are (s)acked.
 		 */
 		if ((flag & FLAG_ORIG_SACK_ACKED) &&
-		    tcp_try_undo_loss(sk, true))
+		    tcp_try_undo_loss(sk, prior_snd_una, true))
 			return;
 
 		if (after(tp->snd_nxt, tp->high_seq)) {
@@ -2665,7 +2675,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 
 	if (recovered) {
 		/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
-		tcp_try_undo_recovery(sk);
+		tcp_try_undo_recovery(sk, prior_snd_una);
 		return;
 	}
 	if (tcp_is_reno(tp)) {
@@ -2685,7 +2695,7 @@ static bool tcp_try_undo_partial(struct sock *sk, u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tp->undo_marker && tcp_packet_delayed(tp)) {
+	if (tp->undo_marker && tcp_packet_delayed(tp, prior_snd_una)) {
 		/* Plain luck! Hole if filled with delayed
 		 * packet, rather than with a retransmit. Check reordering.
 		 */
@@ -2788,7 +2798,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		case TCP_CA_Recovery:
 			if (tcp_is_reno(tp))
 				tcp_reset_reno_sack(tp);
-			if (tcp_try_undo_recovery(sk))
+			if (tcp_try_undo_recovery(sk, prior_snd_una))
 				return;
 			tcp_end_cwnd_reduction(sk);
 			break;
@@ -2815,7 +2825,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		tcp_rack_identify_loss(sk, ack_flag);
 		break;
 	case TCP_CA_Loss:
-		tcp_process_loss(sk, flag, is_dupack, rexmit);
+		tcp_process_loss(sk, flag, is_dupack, rexmit, prior_snd_una);
 		tcp_rack_identify_loss(sk, ack_flag);
 		if (!(icsk->icsk_ca_state == TCP_CA_Open ||
 		      (*ack_flag & FLAG_LOST_RETRANS)))
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

A miscalculation for the number of acknowledged packets occurs during
RTO recovery whenever SACK is not enabled and a cumulative ACK covers
any non-retransmitted skbs. The reason is that pkts_acked value
calculated in tcp_clean_rtx_queue is not correct for slow start after
RTO as it may include segments that were not lost and therefore did
not need retransmissions in the slow start following the RTO. Then
tcp_slow_start will add the excess into cwnd bloating it and
triggering a burst.

Instead, we want to pass only the number of retransmitted segments
that were covered by the cumulative ACK (and potentially newly sent
data segments too if the cumulative ACK covers that far).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..4a26c09 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 	long seq_rtt_us = -1L;
 	long ca_rtt_us = -1L;
 	u32 pkts_acked = 0;
+	u32 rexmit_acked = 0;
+	u32 newdata_acked = 0;
 	u32 last_in_flight = 0;
 	bool rtt_update;
 	int flag = 0;
@@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (unlikely(sacked & TCPCB_RETRANS)) {
-			if (sacked & TCPCB_SACKED_RETRANS)
+			if (sacked & TCPCB_SACKED_RETRANS) {
 				tp->retrans_out -= acked_pcount;
+				rexmit_acked += acked_pcount;
+			}
 			flag |= FLAG_RETRANS_DATA_ACKED;
 		} else if (!(sacked & TCPCB_SACKED_ACKED)) {
 			last_ackt = skb->skb_mstamp;
@@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				reord = start_seq;
 			if (!after(scb->end_seq, tp->high_seq))
 				flag |= FLAG_ORIG_SACK_ACKED;
+			else
+				newdata_acked += acked_pcount;
 		}
 
 		if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (tcp_is_reno(tp)) {
+			/* Due to discontinuity on RTO in the artificial
+			 * sacked_out calculations, TCP must restrict
+			 * pkts_acked without SACK to rexmits and new data
+			 * segments
+			 */
+			if (icsk->icsk_ca_state == TCP_CA_Loss)
+				pkts_acked = rexmit_acked + newdata_acked;
+
 			tcp_remove_reno_sacks(sk, pkts_acked);
 		} else {
 			int delta;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

If SACK is not enabled and the first cumulative ACK after the RTO
retransmission covers more than the retransmitted skb, a spurious
FRTO undo will trigger (assuming FRTO is enabled for that RTO).
The reason is that any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case so the check for that bit won't help like it does with SACK).
Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
in tcp_process_loss.

We need to use more strict condition for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted
to prove that progress is due to original transmissions. Only then
keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
non-SACK case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a26c09..c60745c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				pkts_acked = rexmit_acked + newdata_acked;
 
 			tcp_remove_reno_sacks(sk, pkts_acked);
+
+			/* If any of the cumulatively ACKed segments was
+			 * retransmitted, non-SACK case cannot confirm that
+			 * progress was due to original transmission due to
+			 * lack of TCPCB_SACKED_ACKED bits even if some of
+			 * the packets may have been never retransmitted.
+			 */
+			if (flag & FLAG_RETRANS_DATA_ACKED)
+				flag &= ~FLAG_ORIG_SACK_ACKED;
 		} else {
 			int delta;
 
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox