Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
From: Yonghong Song @ 2019-09-13 17:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Carlos Antonio Neira Bustos, netdev@vger.kernel.org,
	brouer@redhat.com, bpf@vger.kernel.org
In-Reply-To: <87a7b8gmj8.fsf@x220.int.ebiederm.org>



On 9/13/19 5:59 PM, Eric W. Biederman wrote:
> Yonghong Song <yhs@fb.com> writes:
> 
>> On 9/11/19 9:16 AM, Eric W. Biederman wrote:
>>> Al Viro <viro@zeniv.linux.org.uk> writes:
>>>
>>>> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote:
>>>>>
>>>>> Carlos,
>>>>>
>>>>> Discussed with Eric today for what is the best way to get
>>>>> the device number for a namespace. The following patch seems
>>>>> a reasonable start although Eric would like to see
>>>>> how the helper is used in order to decide whether the
>>>>> interface looks right.
>>>>>
>>>>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
>>>>> Author: Yonghong Song <yhs@fb.com>
>>>>> Date:   Mon Sep 9 21:50:51 2019 -0700
>>>>>
>>>>>        nsfs: add an interface function ns_get_inum_dev()
>>>>>
>>>>>        This patch added an interface function
>>>>>        ns_get_inum_dev(). Given a ns_common structure,
>>>>>        the function returns the inode and device
>>>>>        numbers. The function will be used later
>>>>>        by a newly added bpf helper.
>>>>>
>>>>>        Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>>
>>>>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>>>>> index a0431642c6b5..a603c6fc3f54 100644
>>>>> --- a/fs/nsfs.c
>>>>> +++ b/fs/nsfs.c
>>>>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
>>>>>            return ERR_PTR(-EINVAL);
>>>>>     }
>>>>>
>>>>> +/* Get the device number for the current task pidns.
>>>>> + */
>>>>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
>>>>> +{
>>>>> +       *inum = ns->inum;
>>>>> +       *dev = nsfs_mnt->mnt_sb->s_dev;
>>>>> +}
>>>>
>>>> Umm...  Where would it get the device number once we get (hell knows
>>>> what for) multiple nsfs instances?  I still don't understand what
>>>> would that be about, TBH...  Is it really per-userns?  Or something
>>>> else entirely?  Eric, could you give some context?
>>>
>>> My goal is not to paint things into a corner, with future changes.
>>> Right now it is possible to stat a namespace file descriptor and
>>> get a device and inode number.  Then compare that.
>>>
>>> I don't want people using the inode number in nsfd as some magic
>>> namespace id.
>>>
>>> We have had times in the past where there was more than one superblock
>>> and thus more than one device number.  Further if userspace ever uses
>>> this heavily there may be times in the future where for
>>> checkpoint/restart purposes we will want multiple nsfd's so we can
>>> preserve the inode number accross a migration.
>>>
>>> Realistically there will probably just some kind of hotplug notification
>>> to userspace to say we have hotplugged your operatining system as
>>> a migration notification.
>>>
>>> Now the halway discussion did not quite capture everything I was trying
>>> to say but it at least got to the right ballpark.
>>>
>>> The helper in fs/nsfs.c should be:
>>>
>>> bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
>>> {
>>>           return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
>>> }
>>>
>>> That way if/when there are multiple inodes identifying the same
>>> namespace the bpf programs don't need to change.
>>
>> Thanks, Eric. This is indeed better. The bpf helper should focus
>> on comparing dev/ino, instead of return the dev/ino to bpf program.
>>
>> So overall, nsfs related change will look like:
>>
>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>> index a0431642c6b5..7e78d89c2172 100644
>> --- a/fs/nsfs.c
>> +++ b/fs/nsfs.c
>> @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd)
>>           return ERR_PTR(-EINVAL);
>>    }
>>
>> +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
>> +{
>> +       return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
>> +}
>> +
>>    static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
>>    {
>>           struct inode *inode = d_inode(dentry);
>> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
>> index d31cb6215905..79639807e960 100644
>> --- a/include/linux/proc_ns.h
>> +++ b/include/linux/proc_ns.h
>> @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct
>> task_struct *task,
>>    typedef struct ns_common *ns_get_path_helper_t(void *);
>>    extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t
>> ns_get_cb,
>>                               void *private_data);
>> +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino);
>>
>>    extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
>>                           const struct proc_ns_operations *ns_ops);
>>
>>>
>>> Up farther in the stack it should be something like:
>>>
>>>> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino)
>>>> {
>>>>           return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino);
>>>> }
>>>>
>>>> const struct bpf_func_proto bpf_current_pidns_match_proto = {
>>>> 	.func		= bpf_current_pins_match,
>>>> 	.gpl_only	= true,
>>>> 	.ret_type	= RET_INTEGER
>>>> 	.arg1_type	= ARG_PTR_TO_DEVICE_NUMBER,
>>>> 	.arg2_type	= ARG_PTR_TO_INODE_NUMBER,
>>>> };
>>>
>>> That allows comparing what the bpf came up with with whatever value
>>> userspace generated by stating the file descriptor.
>>>
>>>
>>> That is the least bad suggestion I currently have for that
>>> functionality.  It really would be better to not have that filter in the
>>> bpf program itself but in the infrastructure that binds a program to a
>>> set of tasks.
>>>
>>> The problem with this approach is whatever device/inode you have when
>>> the namespace they refer to exits there is the possibility that the
>>> inode will be reused.  So your filter will eventually start matching on
>>> the wrong thing.
>>
>> I come up with a differeent helper definition, which is much more
>> similar to existing bpf_get_current_pid_tgid() and helper definition
>> much more conforms to bpf convention.
> 
> There is a problem with your bpf_get_ns_current_pid_tgid below.
> The inode number is a 64bit number.  To be nice to old userspace
> we try and not use 64bit inode numbers where they are not required
> but in this case we should not use an interface that assumes inode
> numbers are 32bit.  They just aren't.
> 
> I didn't know how to express that in the bpf proto so I did what
> I could.

We can change inum to u64. Just change the prototype like
`u64, inum` should be good.

> 
> The alternative to this would be to simply restrict this
> helper to bpf programs registered in the initial pid namespace.
> At which point you could just ensure all the numbers are in
> the global pid namespace.
> 
> Hmm.  Looing at the comment below I am confused.
> 
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 5e28718928ca..bc26903c80c7 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -11,6 +11,8 @@
>>    #include <linux/uidgid.h>
>>    #include <linux/filter.h>
>>    #include <linux/ctype.h>
>> +#include <linux/pid_namespace.h>
>> +#include <linux/proc_ns.h>
>>
>>    #include "../../lib/kstrtox.h"
>>
>> @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = {
>>           .arg4_type      = ARG_PTR_TO_LONG,
>>    };
>>    #endif
>> +
>> +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum)
>> +{
>> +       struct task_struct *task = current;
>> +       struct pid_namespace *pidns;
>> +       pid_t pid, tgid;
>> +
>> +       if (unlikely(!task))
>> +               return -EINVAL;
>> +
>> +
>> +       pidns = task_active_pid_ns(task);
>> +       if (unlikely(!pidns))
>> +               return -ENOENT;
>> +
>> +       if (!ns_match(&pidns->ns, (dev_t)dev, inum))
>> +               return -EINVAL;
>> +
>> +       pid = task_pid_nr_ns(task, pidns);
>> +       tgid = task_tgid_nr_ns(task, pidns);
>> +
>> +       return (u64) tgid << 32 | pid;
>> +}
>> +
>> +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = {
>> +       .func           = bpf_get_ns_current_pid_tgid,
>> +       .gpl_only       = false,
>> +       .ret_type       = RET_INTEGER,
>> +       .arg1_type      = ARG_ANYTHING,
>> +       .arg2_type      = ARG_ANYTHING,
>> +};
>>
>> Existing usage of bpf_get_current_pid_tgid() can be converted
>> to bpf_get_ns_current_pid_tgid() if ns dev/inode number
>> is supplied. For bpf_get_ns_current_pid_tgid(), checking
>> return value ( < 0 or not) is needed.
> 
> Ok.  I missed something.
> 
> What is the problem bpf_get_ns_current_pid_tgid trying to solve
> that bpf_get_current_pid_tgid does not solve.
> 
> I would think since much of tracing ebpf is fundamentally restricted
> to the global root user.  Limiting the ebpf programs to the initial
> pid namespace should not be a problem.
> 
> So I don't understand why you need to specify the namespace in
> the ebpf call.
> 
> Can someone give me a clue what problem is being sovled by this
> new call?

We want to run the bpf program inside the namespace.
There are parallel work to introduce CAP_BPF and CAP_TRACING
(https://lore.kernel.org/bpf/20190906231053.1276792-1-ast@kernel.org/T/#t)
to facilitate this.

We have users requesting to use bcc tools inside the containers.
https://github.com/iovisor/bcc/issues/1875
https://github.com/iovisor/bcc/issues/1366
https://github.com/iovisor/bcc/issues/1329
https://github.com/iovisor/bcc/issues/1532
...

Yes, this may require granting `root` privilege to containers.
This can be done outside container as well. But it is just a
big usability improvement if people can do inside the containers.

In addition, we have requests below and internal requests as well
to filter based on containers.
https://github.com/iovisor/bcc/issues/1119
The new helper permits at root that you can filter based on
a particular container (not container id, but dev/inode should
identifying one).

Hope this clarify why this helper is useful for tracing community.

> 
> Eric
> 

^ permalink raw reply

* Re: VRF Issue Since kernel 5
From: David Ahern @ 2019-09-13 17:41 UTC (permalink / raw)
  To: Gowen, netdev@vger.kernel.org
In-Reply-To: <CWLP265MB15547011D9510DEA6475B469FDB00@CWLP265MB1554.GBRP265.PROD.OUTLOOK.COM>

[ FYI: you should not 'top post' in responses to netdev; rather comment
inline with the previous message ]

On 9/12/19 7:50 AM, Gowen wrote:
> 
> Hi David -thanks for getting back to me
> 
> 
> 
> The DNS servers are 10.24.65.203 or 10.24.64.203 which you want to go
> 
> out mgmt-vrf. correct? No - 10.24.65.203 10.25.65.203, so should hit the route leak rule as below (if I've put the 10.24.64.0/24 subnet anywhere it is a typo)
> 
> vmAdmin@NETM06:~$ ip ro get 10.24.65.203 fibmatch
> 10.24.65.0/24 via 10.24.12.1 dev eth0
> 
> 
> I've added the 127/8 route - no difference.

you mean address on mgmt-vrf right?

> 
> The reason for what you might think is an odd design is that I wanted any non-VRF aware users to be able to come in and run all commands in default context without issue, while production and mgmt traffic was separated still
> 
> DNS is now working as long as /etc/resolv.conf is populated with my DNS servers - a lot of people would be using this on Azure which uses netplan, so they'll have the same issue, is there documentation I could update or raise a bug to check the systemd-resolve servers as well?

That is going to be the fundamental system problem: handing DNS queries
off to systemd is losing the VRF context of the process doing the DNS
query.

^ permalink raw reply

* Re: [PATCH net] udp: correct reuseport selection with connected sockets
From: Craig Gallek @ 2019-09-13 18:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, David Miller, Eric Dumazet, zabele, Paolo Abeni,
	mark.keaton, Willem de Bruijn
In-Reply-To: <20190913011639.55895-1-willemdebruijn.kernel@gmail.com>

On Thu, Sep 12, 2019 at 9:16 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> From: Willem de Bruijn <willemb@google.com>
>
> UDP reuseport groups can hold a mix unconnected and connected sockets.
> Ensure that connections only receive all traffic to their 4-tuple.
>
> Fast reuseport returns on the first reuseport match on the assumption
> that all matches are equal. Only if connections are present, return to
> the previous behavior of scoring all sockets.
>
> Record if connections are present and if so (1) treat such connected
> sockets as an independent match from the group, (2) only return
> 2-tuple matches from reuseport and (3) do not return on the first
> 2-tuple reuseport match to allow for a higher scoring match later.
>
> New field has_conns is set without locks. No other fields in the
> bitmap are modified at runtime and the field is only ever set
> unconditionally, so an RMW cannot miss a change.
>
> Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
> Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.com
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Slick, no additional cost for the BPF case and just a single branch
for the unconnected udp, tcp listener case!

Acked-by: Craig Gallek <kraig@google.com>

^ permalink raw reply

* Re: [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check
From: Yonghong Song @ 2019-09-13 18:48 UTC (permalink / raw)
  To: Christian Barcenas, Alexei Starovoitov, Daniel Borkmann,
	netdev@vger.kernel.org
  Cc: Martin Lau, Song Liu, bpf@vger.kernel.org
In-Reply-To: <20190911181816.89874-1-christian@cbarcenas.com>



On 9/11/19 7:18 PM, Christian Barcenas wrote:
> A process can lock memory addresses into physical RAM explicitly
> (via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
> perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.
> 
> CAP_IPC_LOCK allows a process to exceed these limits, and throughout
> the kernel this capability is checked before allowing/denying an attempt
> to lock memory regions into RAM.
> 
> Because bpf locks its programs and maps into RAM, it should respect
> CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
> exceeded by a privileged process, which is contrary to documented
> RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.
> 
> Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
> Signed-off-by: Christian Barcenas <christian@cbarcenas.com>

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   kernel/bpf/syscall.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 272071e9112f..e551961f364b 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -183,8 +183,9 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
>   static int bpf_charge_memlock(struct user_struct *user, u32 pages)
>   {
>   	unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	unsigned long locked = atomic_long_add_return(pages, &user->locked_vm);
>   
> -	if (atomic_long_add_return(pages, &user->locked_vm) > memlock_limit) {
> +	if (locked > memlock_limit && !capable(CAP_IPC_LOCK)) {
>   		atomic_long_sub(pages, &user->locked_vm);
>   		return -EPERM;
>   	}
> @@ -1231,7 +1232,7 @@ int __bpf_prog_charge(struct user_struct *user, u32 pages)
>   
>   	if (user) {
>   		user_bufs = atomic_long_add_return(pages, &user->locked_vm);
> -		if (user_bufs > memlock_limit) {
> +		if (user_bufs > memlock_limit && !capable(CAP_IPC_LOCK)) {
>   			atomic_long_sub(pages, &user->locked_vm);
>   			return -EPERM;
>   		}
> 

^ permalink raw reply

* Re: [PATCH] libbpf: Don't error out if getsockopt() fails for XDP_OPTIONS
From: Yonghong Song @ 2019-09-13 18:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, netdev@vger.kernel.org, bpf@vger.kernel.org,
	maximmi@mellanox.com
In-Reply-To: <87lfuxul2b.fsf@toke.dk>



On 9/10/19 12:06 AM, Toke Høiland-Jørgensen wrote:
> Yonghong Song <yhs@fb.com> writes:
> 
>> On 9/9/19 10:46 AM, Toke Høiland-Jørgensen wrote:
>>> The xsk_socket__create() function fails and returns an error if it cannot
>>> get the XDP_OPTIONS through getsockopt(). However, support for XDP_OPTIONS
>>> was not added until kernel 5.3, so this means that creating XSK sockets
>>> always fails on older kernels.
>>>
>>> Since the option is just used to set the zero-copy flag in the xsk struct,
>>> there really is no need to error out if the getsockopt() call fails.
>>>
>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>> ---
>>>    tools/lib/bpf/xsk.c | 8 ++------
>>>    1 file changed, 2 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
>>> index 680e63066cf3..598e487d9ce8 100644
>>> --- a/tools/lib/bpf/xsk.c
>>> +++ b/tools/lib/bpf/xsk.c
>>> @@ -603,12 +603,8 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
>>>    
>>>    	optlen = sizeof(opts);
>>>    	err = getsockopt(xsk->fd, SOL_XDP, XDP_OPTIONS, &opts, &optlen);
>>> -	if (err) {
>>> -		err = -errno;
>>> -		goto out_mmap_tx;
>>> -	}
>>> -
>>> -	xsk->zc = opts.flags & XDP_OPTIONS_ZEROCOPY;
>>> +	if (!err)
>>> +		xsk->zc = opts.flags & XDP_OPTIONS_ZEROCOPY;
>>>    
>>>    	if (!(xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)) {
>>>    		err = xsk_setup_xdp_prog(xsk);
>>
>> Since 'zc' is not used by anybody, maybe all codes 'zc' related can be
>> removed? It can be added back back once there is an interface to use
>> 'zc'?
> 
> Fine with me; up to the maintainers what they prefer, I guess? :)

Maxim,

Your originally introduced `'zc' and getting XDP_OPTIONS.
What is your opinion of how to deal with the unused xsk->zc?

> 
> -Toke
> 

^ permalink raw reply

* [PATCH v4 2/2] tcp: Add snd_wnd to TCP_INFO
From: Thomas Higdon @ 2019-09-13 19:36 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: Jonathan Lemon, Dave Jones, Eric Dumazet, Neal Cardwell,
	Dave Taht, Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <20190913193629.55201-1-tph@fb.com>

Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
performance problems --
> (1) Usually when we're diagnosing TCP performance problems, we do so
> from the sender, since the sender makes most of the
> performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> From the sender-side the thing that would be most useful is to see
> tp->snd_wnd, the receive window that the receiver has advertised to
> the sender.

This serves the purpose of adding an additional __u32 to avoid the
would-be hole caused by the addition of the tcpi_rcvi_ooopack field.

Signed-off-by: Thomas Higdon <tph@fb.com>
---
changes from v3:
 - changed from rcv_wnd to snd_wnd

 include/uapi/linux/tcp.h | 1 +
 net/ipv4/tcp.c           | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 20237987ccc8..240654f22d98 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -272,6 +272,7 @@ struct tcp_info {
 	__u32	tcpi_reord_seen;     /* reordering events seen */
 
 	__u32	tcpi_rcv_ooopack;    /* Out-of-order packets received */
+	__u32	tcpi_snd_wnd;        /* Remote peer's advertised recv window size */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4cf58208270e..79c325a07ba5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3297,6 +3297,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 	info->tcpi_dsack_dups = tp->dsack_dups;
 	info->tcpi_reord_seen = tp->reord_seen;
 	info->tcpi_rcv_ooopack = tp->rcv_ooopack;
+	info->tcpi_snd_wnd = tp->snd_wnd;
 	unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
-- 
2.17.1


^ permalink raw reply related

* [PATCH v4 1/2] tcp: Add TCP_INFO counter for packets received out-of-order
From: Thomas Higdon @ 2019-09-13 19:36 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: Jonathan Lemon, Dave Jones, Eric Dumazet, Neal Cardwell,
	Dave Taht, Yuchung Cheng, Soheil Hassas Yeganeh

For receive-heavy cases on the server-side, we want to track the
connection quality for individual client IPs. This counter, similar to
the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
tracks out-of-order packet reception. By providing this counter in
TCP_INFO, it will allow understanding to what degree receive-heavy
sockets are experiencing out-of-order delivery and packet drops
indicating congestion.

Please note that this is similar to the counter in NetBSD TCP_INFO, and
has the same name.

Signed-off-by: Thomas Higdon <tph@fb.com>
---

no changes from v3

 include/linux/tcp.h      | 2 ++
 include/uapi/linux/tcp.h | 2 ++
 net/ipv4/tcp.c           | 2 ++
 net/ipv4/tcp_input.c     | 1 +
 4 files changed, 7 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f3a85a7fb4b1..a01dc78218f1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -393,6 +393,8 @@ struct tcp_sock {
 	 */
 	struct request_sock *fastopen_rsk;
 	u32	*saved_syn;
+
+	u32 rcv_ooopack; /* Received out-of-order packets, for tcpinfo */
 };
 
 enum tsq_enum {
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b3564f85a762..20237987ccc8 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -270,6 +270,8 @@ struct tcp_info {
 	__u64	tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
 	__u32	tcpi_dsack_dups;     /* RFC4898 tcpEStatsStackDSACKDups */
 	__u32	tcpi_reord_seen;     /* reordering events seen */
+
+	__u32	tcpi_rcv_ooopack;    /* Out-of-order packets received */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 94df48bcecc2..4cf58208270e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2653,6 +2653,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->rx_opt.saw_tstamp = 0;
 	tp->rx_opt.dsack = 0;
 	tp->rx_opt.num_sacks = 0;
+	tp->rcv_ooopack = 0;
 
 
 	/* Clean up fastopen related fields */
@@ -3295,6 +3296,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 	info->tcpi_bytes_retrans = tp->bytes_retrans;
 	info->tcpi_dsack_dups = tp->dsack_dups;
 	info->tcpi_reord_seen = tp->reord_seen;
+	info->tcpi_rcv_ooopack = tp->rcv_ooopack;
 	unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 706cbb3b2986..2ef333354026 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4555,6 +4555,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	tp->pred_flags = 0;
 	inet_csk_schedule_ack(sk);
 
+	tp->rcv_ooopack += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
 	NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFOQUEUE);
 	seq = TCP_SKB_CB(skb)->seq;
 	end_seq = TCP_SKB_CB(skb)->end_seq;
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH bpf-next 01/11] samples: bpf: makefile: fix HDR_PROBE "echo"
From: Ivan Khoronzhuk @ 2019-09-13 19:56 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: ast, daniel, yhs, davem, jakub.kicinski, hawk, john.fastabend,
	linux-kernel, netdev, bpf, clang-built-linux
In-Reply-To: <4251fe86-ccc7-f1ce-e954-2d488d2a95a9@cogentembedded.com>

On Wed, Sep 11, 2019 at 02:02:11PM +0300, Sergei Shtylyov wrote:
>On 10.09.2019 17:54, Ivan Khoronzhuk wrote:
>
>>>Hello!
>>>
>>>On 10.09.2019 13:38, Ivan Khoronzhuk wrote:
>>>
>>>>echo should be replaced on echo -e to handle \n correctly, but instead,
>>>
>>> s/on/with/?
>>s/echo/printf/ instead of s/echo/echo -e/
>
>   I only pointed that 'on' is incorrect there. You replace something 
>/with/ something other...
>
>>
>>printf looks better.
>>
>>>
>>>>replace it on printf as some systems can't handle echo -e.
>>>
>>>  Likewise?
>
>   Same grammatical mistake.
Oh, will correct it next v.


-- 
Regards,
Ivan Khoronzhuk

^ permalink raw reply

* Re: [PATCH net-next] ip: support SO_MARK cmsg
From: David Miller @ 2019-09-13 19:44 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb
In-Reply-To: <20190911195051.166062-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 11 Sep 2019 15:50:51 -0400

> From: Willem de Bruijn <willemb@google.com>
> 
> Enable setting skb->mark for UDP and RAW sockets using cmsg.
> 
> This is analogous to existing support for TOS, TTL, txtime, etc.
> 
> Packet sockets already support this as of commit c7d39e32632e
> ("packet: support per-packet fwmark for af_packet sendmsg").
> 
> Similar to other fields, implement by
> 1. initialize the sockcm_cookie.mark from socket option sk_mark
> 2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
> 3. initialize inet_cork.mark from sockcm_cookie.mark
> 4. initialize each (usually just one) skb->mark from inet_cork.mark
> 
> Step 1 is handled in one location for most protocols by ipcm_init_sk
> as of commit 351782067b6b ("ipv4: ipcm_cookie initializers").
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Looks good, applied.

^ permalink raw reply

* Re: [PATCH][PATCH net-next] hv_sock: Add the support of hibernation
From: David Miller @ 2019-09-13 20:03 UTC (permalink / raw)
  To: decui
  Cc: kys, haiyangz, sthemmin, sashal, linux-hyperv, netdev,
	linux-kernel, mikelley
In-Reply-To: <1568245042-66967-1-git-send-email-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com>
Date: Wed, 11 Sep 2019 23:37:27 +0000

> Add the necessary dummy callbacks for hibernation.
> 
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
> This patch is basically a pure Hyper-V specific change and it has a
> build dependency on the commit 271b2224d42f ("Drivers: hv: vmbus: Implement
> suspend/resume for VSC drivers for hibernation"), which is on Sasha Levin's
> Hyper-V tree's hyperv-next branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-next
> 
> I request this patch should go through Sasha's tree rather than the
> net-next tree.

That's fine:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [PATCH v2 net 0/3] fix memory leak for sctp_do_bind
From: David Miller @ 2019-09-13 20:06 UTC (permalink / raw)
  To: maowenan
  Cc: vyasevich, nhorman, marcelo.leitner, linux-sctp, netdev,
	linux-kernel, kernel-janitors
In-Reply-To: <20190912040219.67517-1-maowenan@huawei.com>

From: Mao Wenan <maowenan@huawei.com>
Date: Thu, 12 Sep 2019 12:02:16 +0800

> First two patches are to do cleanup, remove redundant assignment,
> and change return type of sctp_get_port_local.
> Third patch is to fix memory leak for sctp_do_bind if failed
> to bind address.
> 
> ---
>  v2: add one patch to change return type of sctp_get_port_local.

Series applied with Fixes: tag removed from patch #1.

Thanks.

^ permalink raw reply

* [PATCH v4.14-stable 0/2] Fixes to commit fdfc5c8594c2 (tcp: remove empty skb from write queue in error cases)
From: Christoph Paasch @ 2019-09-13 20:08 UTC (permalink / raw)
  To: stable, netdev, gregkh, Sasha Levin; +Cc: David Miller, Eric Dumazet

The above referenced commit has problems on older non-rbTree kernels.

AFAICS, the commit has only been backported to 4.14 up to now, but the
commit that fdfc5c8594c2 is fixing (namely ce5ec440994b ("tcp: ensure epoll
edge trigger wakeup when write queue is empty"), is in v4.2.

Christoph Paasch (2):
  tcp: Reset send_head when removing skb from write-queue
  tcp: Don't dequeue SYN/FIN-segments from write-queue

 net/ipv4/tcp.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
2.21.0

^ permalink raw reply

* Re: [PATCH net,stable] cdc_ether: fix rndis support for Mediatek based smartphones
From: David Miller @ 2019-09-13 20:09 UTC (permalink / raw)
  To: bjorn; +Cc: netdev, oliver, linux-usb, larsm17
In-Reply-To: <20190912084200.6359-1-bjorn@mork.no>

From: Bjørn Mork <bjorn@mork.no>
Date: Thu, 12 Sep 2019 10:42:00 +0200

> A Mediatek based smartphone owner reports problems with USB
> tethering in Linux.  The verbose USB listing shows a rndis_host
> interface pair (e0/01/03 + 10/00/00), but the driver fails to
> bind with
> 
> [  355.960428] usb 1-4: bad CDC descriptors
> 
> The problem is a failsafe test intended to filter out ACM serial
> functions using the same 02/02/ff class/subclass/protocol as RNDIS.
> The serial functions are recognized by their non-zero bmCapabilities.
> 
> No RNDIS function with non-zero bmCapabilities were known at the time
> this failsafe was added. But it turns out that some Wireless class
> RNDIS functions are using the bmCapabilities field. These functions
> are uniquely identified as RNDIS by their class/subclass/protocol, so
> the failing test can safely be disabled.  The same applies to the two
> types of Misc class RNDIS functions.
> 
> Applying the failsafe to Communication class functions only retains
> the original functionality, and fixes the problem for the Mediatek based
> smartphone.
> 
> Tow examples of CDC functional descriptors with non-zero bmCapabilities
> from Wireless class RNDIS functions are:
 ...
> The Mediatek example is believed to apply to most smartphones with
> Mediatek firmware.  The ZTE example is most likely also part of a larger
> family of devices/firmwares.
> 
> Suggested-by: Lars Melin <larsm17@gmail.com>
> Signed-off-by: Bjørn Mork <bjorn@mork.no>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [patch net-next v3 0/3] net: devlink: move reload fail indication to devlink core and expose to user
From: David Miller @ 2019-09-13 20:11 UTC (permalink / raw)
  To: jiri; +Cc: netdev, idosch, dsahern, jakub.kicinski, tariqt, mlxsw
In-Reply-To: <20190912084946.7468-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Thu, 12 Sep 2019 10:49:43 +0200

> From: Jiri Pirko <jiri@mellanox.com>
> 
> First two patches are dependencies of the last one. That moves devlink
> reload failure indication to the devlink code, so the drivers do not
> have to track it themselves. Currently it is only mlxsw, but I will send
> a follow-up patchset that introduces this in netdevsim too.

Series applied.

^ permalink raw reply

* RE: [PATCH][PATCH net-next] hv_sock: Add the support of hibernation
From: Dexuan Cui @ 2019-09-13 20:13 UTC (permalink / raw)
  To: David Miller, sashal@kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Michael Kelley
In-Reply-To: <20190913.210343.724088723062134961.davem@davemloft.net>

> From: David Miller <davem@davemloft.net>
> Sent: Friday, September 13, 2019 1:04 PM
> 
> From: Dexuan Cui <decui@microsoft.com>
> Date: Wed, 11 Sep 2019 23:37:27 +0000
> > I request this patch should go through Sasha's tree rather than the
> > net-next tree.
> 
> That's fine:
> 
> Acked-by: David S. Miller <davem@davemloft.net>

Thanks, David!

@Sasha: I found a few typos in my comment below. I'll post a v2.

> > +/* hv_sock connections can not persist across hibernation, and all the hv_sock
> >  + * channels are forceed to be rescinded before hibernation: see

forceed -> forced

> >  + * are only needed because hibernation requires that every device's driver

every device's driver -> every vmbus device's driver

Thanks,
-- Dexuan

^ permalink raw reply

* Re: [PATCH bpf-next 02/11] samples: bpf: makefile: fix cookie_uid_helper_example obj build
From: Yonghong Song @ 2019-09-13 20:48 UTC (permalink / raw)
  To: Ivan Khoronzhuk, ast@kernel.org, daniel@iogearbox.net,
	davem@davemloft.net, jakub.kicinski@netronome.com,
	hawk@kernel.org, john.fastabend@gmail.com
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	bpf@vger.kernel.org, clang-built-linux@googlegroups.com
In-Reply-To: <20190910103830.20794-3-ivan.khoronzhuk@linaro.org>



On 9/10/19 11:38 AM, Ivan Khoronzhuk wrote:
> Don't list userspace "cookie_uid_helper_example" object in list for
> bpf objects.
> 
> per_socket_stats_example-opjs is used to list additional dependencies

s/opjs/objs

> for user space binary from hostprogs-y list. Kbuild system creates
> rules for objects listed this way anyway and no need to worry about
> this. Despite on it, the samples bpf uses logic that hostporgs-y are
> build for userspace with includes needed for this, but "always"
> target, if it's not in hostprog-y list, uses CLANG-bpf rule and is
> intended to create bpf obj but not arch obj and uses only kernel
> includes for that. So correct it, as it breaks cross-compiling at
> least.

The above description is a little tricky to understand.
Maybe something like:
    'always' target is for bpf programs.
    'cookie_uid_helper_example.o' is a user space ELF file, and
    covered by rule `per_socket_stats_example`.
    Let us remove `always += cookie_uid_helper_example.o`,
    which avoids breaking cross compilation due to
    mismatched includes.

> 
> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
> ---
>   samples/bpf/Makefile | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index f50ca852c2a8..43dee90dffa4 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -145,7 +145,6 @@ always += sampleip_kern.o
>   always += lwt_len_hist_kern.o
>   always += xdp_tx_iptunnel_kern.o
>   always += test_map_in_map_kern.o
> -always += cookie_uid_helper_example.o
>   always += tcp_synrto_kern.o
>   always += tcp_rwnd_kern.o
>   always += tcp_bufs_kern.o
> 

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2019-09-13 20:55 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Don't corrupt xfrm_interface parms before validation, from Nicolas
   Dichtel.

2) Revert use of usb-wakeup in btusb, from Mario Limonciello.

3) Block ipv6 packets in bridge netfilter if ipv6 is disabled,
   from Leonardo Bras.

4) IPS_OFFLOAD not honored in ctnetlink, from Pablo Neira Ayuso.

5) Missing ULP check in sock_map, from John Fastabend.

6) Fix receive statistic handling in forcedeth, from Zhu Yanjun.

7) Fix length of SKB allocated in 6pack driver, from Christophe
   JAILLET.

8) ip6_route_info_create() returns an error pointer, not NULL.
   From Maciej Żenczykowski.

9) Only add RDS sock to the hashes after rs_transport is set, from
   Ka-Cheong Poon.

10) Don't double clean TX descriptors in ixgbe, from Ilya Maximets.

11) Presence of transmit IPSEC offload in an SKB is not tested for
    correctly in ixgbe and ixgbevf.  From Steffen Klassert and
    Jeff Kirsher.

12) Need rcu_barrier() when register_netdevice() takes one of the
    notifier based failure paths, from Subash Abhinov Kasiviswanathan.

13) Fix leak in sctp_do_bind(), from Mao Wenan.

Please pull, thanks a lot!

The following changes since commit 089cf7f6ecb266b6a4164919a2e69bd2f938374a:

  Linux 5.3-rc7 (2019-09-02 09:57:40 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git 

for you to fetch changes up to 4d7ffcf3bf1be98d876c570cab8fc31d9fa92725:

  cdc_ether: fix rndis support for Mediatek based smartphones (2019-09-13 22:08:13 +0200)

----------------------------------------------------------------
Alexander Duyck (1):
      ixgbe: Prevent u8 wrapping of ITR value to something less than 10us

Alexei Starovoitov (1):
      bpf: fix precision tracking of stack slots

Bjørn Mork (1):
      cdc_ether: fix rndis support for Mediatek based smartphones

Christophe JAILLET (3):
      net/hamradio/6pack: Fix the size of a sk_buff used in 'sp_bump()'
      ipv6: Fix the link time qualifier of 'ping_v6_proc_exit_net()'
      sctp: Fix the link time qualifier of 'sctp_ctrlsock_exit()'

Colin Ian King (4):
      NFC: st95hf: fix spelling mistake "receieve" -> "receive"
      net: lmc: fix spelling mistake "runnin" -> "running"
      net: hns3: fix spelling mistake "undeflow" -> "underflow"
      mlx4: fix spelling mistake "veify" -> "verify"

Cong Wang (2):
      net_sched: check cops->tcf_block in tc_bind_tclass()
      sch_hhf: ensure quantum and hhf_non_hh_weight are non-zero

David Ahern (2):
      ipv6: Fix RTA_MULTIPATH with nexthop objects
      selftest: A few cleanups for fib_nexthops.sh

David Howells (1):
      rxrpc: Fix misplaced traceline

David S. Miller (8):
      Merge git://git.kernel.org/.../pablo/nf
      Merge branch 'for-upstream' of git://git.kernel.org/.../bluetooth/bluetooth
      Merge branch 'nexthops-Fix-multipath-notifications-for-IPv6-and-selftests'
      Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
      Merge tag 'wireless-drivers-for-davem-2019-09-05' of git://git.kernel.org/.../kvalo/wireless-drivers
      Merge git://git.kernel.org/.../bpf/bpf
      Merge branch '10GbE' of git://git.kernel.org/.../jkirsher/net-queue
      Merge branch 'sctp_do_bind-leak'

Donald Sharp (1):
      net: Properly update v4 routes with v6 nexthop

Eric Biggers (1):
      isdn/capi: check message length in capi_write()

Eric Dumazet (1):
      net: sched: fix reordering issues

Fernando Fernandez Mancera (1):
      netfilter: nft_socket: fix erroneous socket assignment

Florian Westphal (1):
      xfrm: policy: avoid warning splat when merging nodes

Fred Lotter (1):
      nfp: flower: cmsg rtnl locks can timeout reify messages

Harish Bandi (1):
      Bluetooth: hci_qca: disable irqs when spinlock is acquired

Hui Peng (1):
      rsi: fix a double free bug in rsi_91x_deinit()

Ilya Maximets (1):
      ixgbe: fix double clean of Tx descriptors with xdp

Jeff Kirsher (1):
      ixgbevf: Fix secpath usage for IPsec Tx offload

Jian-Hong Pan (1):
      Bluetooth: btrtl: Additional Realtek 8822CE Bluetooth devices

John Fastabend (1):
      net: sock_map, fix missing ulp check in sock hash case

Jouni Malinen (1):
      mac80211: Do not send Layer 2 Update frame before authorization

Juliet Kim (1):
      net/ibmvnic: free reset work of removed device from queue

Ka-Cheong Poon (1):
      net/rds: An rds_sock is added too early to the hash table

Leonardo Bras (2):
      netfilter: bridge: Drops IPv6 packets if IPv6 module is not loaded
      netfilter: nft_fib_netdev: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled

Luca Coelho (1):
      iwlwifi: assign directly to iwl_trans->cfg in QuZ detection

Maciej Żenczykowski (2):
      net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others)
      ipv6: addrconf_f6i_alloc - fix non-null pointer check to !IS_ERR()

Mao Wenan (5):
      net: sonic: return NETDEV_TX_OK if failed to map buffer
      net: sonic: replace dev_kfree_skb in sonic_send_packet
      sctp: change return type of sctp_get_port_local
      sctp: remove redundant assignment when call sctp_get_port_local
      sctp: destroy bucket if failed to bind addr

Marcel Holtmann (1):
      Revert "Bluetooth: validate BLE connection interval updates"

Mario Limonciello (1):
      Revert "Bluetooth: btusb: driver to enable the usb-wakeup feature"

Michal Suchanek (1):
      net/ibmvnic: Fix missing { in __ibmvnic_reset

Moritz Fischer (1):
      net: fixed_phy: Add forward declaration for struct gpio_desc;

Navid Emamdoost (3):
      Bluetooth: bpa10x: change return value
      wimax: i2400: fix memory leak
      net: qrtr: fix memort leak in qrtr_tun_write_iter

Neal Cardwell (1):
      tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR

Nicolas Dichtel (5):
      xfrm interface: avoid corruption on changelink
      xfrm interface: ifname may be wrong in logs
      xfrm interface: fix list corruption for x-netns
      xfrm interface: fix management of phydev
      bridge/mdb: remove wrong use of NLM_F_MULTI

Pablo Neira Ayuso (2):
      netfilter: ctnetlink: honor IPS_OFFLOAD flag
      netfilter: nf_flow_table: set default timeout after successful insertion

Radhey Shyam Pandey (1):
      MAINTAINERS: add myself as maintainer for xilinx axiethernet driver

Randy Dunlap (1):
      lib/Kconfig: fix OBJAGG in lib/ menu structure

Shmulik Ladkani (1):
      net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list

Stanislaw Gruszka (4):
      mt76: mt76x0e: don't use hw encryption for MT7630E
      mt76: mt76x0e: disable 5GHz band for MT7630E
      rt2x00: clear up IV's on key removal
      Revert "rt2800: enable TX_PIN_CFG_LNA_PE_ bits per band"

Stefan Chulski (1):
      net: phylink: Fix flow control resolution

Steffen Klassert (1):
      ixgbe: Fix secpath usage for IPsec TX offload.

Subash Abhinov Kasiviswanathan (1):
      net: Fix null de-reference of device refcount

Wen Huang (1):
      mwifiex: Fix three heap overflow at parsing element in cfg80211_ap_settings

Xin Long (3):
      sctp: use transport pf_retrans in sctp_do_8_2_transport_strike
      tipc: add NULL pointer check before calling kfree_rcu
      sctp: fix the missing put_user when dumping transport thresholds

Yang Yingliang (1):
      tun: fix use-after-free when register netdev failed

Yizhuo (1):
      net: stmmac: dwmac-sun8i: Variable "val" in function sun8i_dwmac_set_syscon() could be uninitialized

Zhu Yanjun (1):
      forcedeth: use per cpu to collect xmit/recv statistics

 MAINTAINERS                                            |   3 +--
 drivers/bluetooth/bpa10x.c                             |   2 +-
 drivers/bluetooth/btusb.c                              |   8 +++----
 drivers/bluetooth/hci_qca.c                            |  10 ++++----
 drivers/isdn/capi/capi.c                               |  10 +++++++-
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c |   2 +-
 drivers/net/ethernet/ibm/ibmvnic.c                     |   9 ++++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c          |   7 ++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c           |  29 +++++++++--------------
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c      |   3 ++-
 drivers/net/ethernet/mellanox/mlx4/main.c              |   2 +-
 drivers/net/ethernet/natsemi/sonic.c                   |   6 ++---
 drivers/net/ethernet/netronome/nfp/flower/cmsg.c       |  10 ++++----
 drivers/net/ethernet/nvidia/forcedeth.c                | 143 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
 drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c      |   7 +++++-
 drivers/net/hamradio/6pack.c                           |   4 ++--
 drivers/net/phy/phylink.c                              |   6 ++---
 drivers/net/tun.c                                      |  16 +++++++++----
 drivers/net/usb/cdc_ether.c                            |  10 +++++++-
 drivers/net/wan/lmc/lmc_main.c                         |   2 +-
 drivers/net/wimax/i2400m/op-rfkill.c                   |   1 +
 drivers/net/wireless/intel/iwlwifi/pcie/drv.c          |  24 +++++++++----------
 drivers/net/wireless/marvell/mwifiex/ie.c              |   3 +++
 drivers/net/wireless/marvell/mwifiex/uap_cmd.c         |   9 ++++++-
 drivers/net/wireless/mediatek/mt76/mt76x0/eeprom.c     |   5 ++++
 drivers/net/wireless/mediatek/mt76/mt76x0/pci.c        |  15 +++++++++++-
 drivers/net/wireless/ralink/rt2x00/rt2800lib.c         |  37 ++++++++++++++---------------
 drivers/net/wireless/rsi/rsi_91x_usb.c                 |   1 -
 drivers/nfc/st95hf/core.c                              |   2 +-
 include/linux/phy_fixed.h                              |   1 +
 include/net/ip_fib.h                                   |   4 ++--
 include/net/nexthop.h                                  |   5 ++--
 include/net/xfrm.h                                     |   2 --
 include/uapi/linux/isdn/capicmd.h                      |   1 +
 kernel/bpf/verifier.c                                  |  23 +++++++++++-------
 lib/Kconfig                                            |   6 ++---
 net/bluetooth/hci_event.c                              |   5 ----
 net/bluetooth/l2cap_core.c                             |   9 +------
 net/bridge/br_mdb.c                                    |   2 +-
 net/bridge/br_netfilter_hooks.c                        |   4 ++++
 net/core/dev.c                                         |   2 ++
 net/core/skbuff.c                                      |  19 +++++++++++++++
 net/core/sock_map.c                                    |   3 +++
 net/ipv4/fib_semantics.c                               |  15 ++++++------
 net/ipv4/tcp_input.c                                   |   2 +-
 net/ipv6/ping.c                                        |   2 +-
 net/ipv6/route.c                                       |  21 ++++++++++-------
 net/mac80211/cfg.c                                     |  14 ++++-------
 net/mac80211/sta_info.c                                |   4 ++++
 net/netfilter/nf_conntrack_netlink.c                   |   7 ++++--
 net/netfilter/nf_flow_table_core.c                     |   2 +-
 net/netfilter/nft_fib_netdev.c                         |   3 +++
 net/netfilter/nft_socket.c                             |   6 ++---
 net/qrtr/tun.c                                         |   5 +++-
 net/rds/bind.c                                         |  40 ++++++++++++++------------------
 net/rxrpc/input.c                                      |   2 +-
 net/sched/sch_api.c                                    |   2 ++
 net/sched/sch_generic.c                                |   9 +++++--
 net/sched/sch_hhf.c                                    |   2 +-
 net/sctp/protocol.c                                    |   2 +-
 net/sctp/sm_sideeffect.c                               |   2 +-
 net/sctp/socket.c                                      |  24 ++++++++++---------
 net/tipc/name_distr.c                                  |   3 ++-
 net/xfrm/xfrm_interface.c                              |  56 ++++++++++++++++++++------------------------
 net/xfrm/xfrm_policy.c                                 |   6 +++--
 tools/testing/selftests/net/fib_nexthops.sh            |  24 ++++++++++---------
 tools/testing/selftests/net/xfrm_policy.sh             |   7 ++++++
 67 files changed, 443 insertions(+), 289 deletions(-)

^ permalink raw reply

* Re: [PATCH v4 1/2] tcp: Add TCP_INFO counter for packets received out-of-order
From: Neal Cardwell @ 2019-09-13 20:55 UTC (permalink / raw)
  To: Thomas Higdon
  Cc: netdev@vger.kernel.org, Jonathan Lemon, Dave Jones, Eric Dumazet,
	Dave Taht, Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <20190913193629.55201-1-tph@fb.com>

On Fri, Sep 13, 2019 at 3:37 PM Thomas Higdon <tph@fb.com> wrote:
>
> For receive-heavy cases on the server-side, we want to track the
> connection quality for individual client IPs. This counter, similar to
> the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
> tracks out-of-order packet reception. By providing this counter in
> TCP_INFO, it will allow understanding to what degree receive-heavy
> sockets are experiencing out-of-order delivery and packet drops
> indicating congestion.
>
> Please note that this is similar to the counter in NetBSD TCP_INFO, and
> has the same name.
>
> Signed-off-by: Thomas Higdon <tph@fb.com>
> ---
>
> no changes from v3
>
>  include/linux/tcp.h      | 2 ++
>  include/uapi/linux/tcp.h | 2 ++
>  net/ipv4/tcp.c           | 2 ++
>  net/ipv4/tcp_input.c     | 1 +
>  4 files changed, 7 insertions(+)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f3a85a7fb4b1..a01dc78218f1 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -393,6 +393,8 @@ struct tcp_sock {
>          */
>         struct request_sock *fastopen_rsk;
>         u32     *saved_syn;
> +
> +       u32 rcv_ooopack; /* Received out-of-order packets, for tcpinfo */

Thanks for adding this.

A thought: putting the new rcv_ooopack field here makes struct
tcp_sock bigger, and increases the odds of taking a cache miss
(according to "pahole" this field is the only one in a new cache
line).

I'd suggest putting the new rcv_ooopack field immediately before
rcv_rtt_last_tsecr. That would use up a 4-byte hole, and would place
it in a cache line already used on TCP receivers (for rcv_rtt logic).
This would make it less likely this new field causes more cache misses
or uses more space.

Details: looking at the output of "pahole" for tcp_sock in various cases:

net-next before this patch:
-------------------------------------
...
        u8                         bpf_sock_ops_cb_flags; /*  2076     1 */

        /* XXX 3 bytes hole, try to pack */

        u32                        rcv_rtt_last_tsecr;   /*  2080     4 */

        /* XXX 4 bytes hole, try to pack */

        struct {
                u32                rtt_us;               /*  2088     4 */
                u32                seq;                  /*  2092     4 */
                u64                time;                 /*  2096     8 */
        } rcv_rtt_est;                                   /*  2088    16 */
...
        /* size: 2176, cachelines: 34, members: 134 */
        /* sum members: 2164, holes: 4, sum holes: 12 */
        /* paddings: 3, sum paddings: 12 */


net-next with this patch:
-------------------------------------
...
        u32 *                      saved_syn;            /*  2168     8 */
        /* --- cacheline 34 boundary (2176 bytes) --- */
        u32                        rcv_ooopack;          /*  2176     4 */
...
        /* size: 2184, cachelines: 35, members: 135 */
        /* sum members: 2168, holes: 4, sum holes: 12 */
        /* padding: 4 */
        /* paddings: 3, sum paddings: 12 */
        /* last cacheline: 8 bytes */


net-next with this field in the suggested spot:
-------------------------------------
...
       /* XXX 3 bytes hole, try to pack */

        u32                        rcv_ooopack;          /*  2080     4 */
        u32                        rcv_rtt_last_tsecr;   /*  2084     4 */
        struct {
                u32                rtt_us;               /*  2088     4 */
                u32                seq;                  /*  2092     4 */
                u64                time;                 /*  2096     8 */
        } rcv_rtt_est;                                   /*  2088    16 */
...
        /* size: 2176, cachelines: 34, members: 135 */
        /* sum members: 2168, holes: 3, sum holes: 8 */
        /* paddings: 3, sum paddings: 12 */

neal


neal

^ permalink raw reply

* Re: [PATCH v4 2/2] tcp: Add snd_wnd to TCP_INFO
From: Neal Cardwell @ 2019-09-13 21:02 UTC (permalink / raw)
  To: Thomas Higdon
  Cc: netdev@vger.kernel.org, Jonathan Lemon, Dave Jones, Eric Dumazet,
	Dave Taht, Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <20190913193629.55201-2-tph@fb.com>

On Fri, Sep 13, 2019 at 3:36 PM Thomas Higdon <tph@fb.com> wrote:
>
> Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
> performance problems --
> > (1) Usually when we're diagnosing TCP performance problems, we do so
> > from the sender, since the sender makes most of the
> > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> > From the sender-side the thing that would be most useful is to see
> > tp->snd_wnd, the receive window that the receiver has advertised to
> > the sender.
>
> This serves the purpose of adding an additional __u32 to avoid the
> would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
>
> Signed-off-by: Thomas Higdon <tph@fb.com>
> ---
> changes from v3:
>  - changed from rcv_wnd to snd_wnd
>
>  include/uapi/linux/tcp.h | 1 +
>  net/ipv4/tcp.c           | 1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 20237987ccc8..240654f22d98 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -272,6 +272,7 @@ struct tcp_info {
>         __u32   tcpi_reord_seen;     /* reordering events seen */
>
>         __u32   tcpi_rcv_ooopack;    /* Out-of-order packets received */
> +       __u32   tcpi_snd_wnd;        /* Remote peer's advertised recv window size */
>  };

Thanks for adding this!

My run of ./scripts/checkpatch.pl is showing a warning on this line:

WARNING: line over 80 characters
#19: FILE: include/uapi/linux/tcp.h:273:
+       __u32   tcpi_snd_wnd;        /* Remote peer's advertised recv
window size */

What if the comment is shortened up to fit in 80 columns and the units
(bytes) are added, something like:

        __u32   tcpi_snd_wnd;        /* peer's advertised recv window (bytes) */

neal

^ permalink raw reply

* Re: [PATCH bpf-next 05/11] samples: bpf: makefile: use D vars from KBUILD_CFLAGS to handle headers
From: Yonghong Song @ 2019-09-13 21:12 UTC (permalink / raw)
  To: Ivan Khoronzhuk, ast@kernel.org, daniel@iogearbox.net,
	davem@davemloft.net, jakub.kicinski@netronome.com,
	hawk@kernel.org, john.fastabend@gmail.com
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	bpf@vger.kernel.org, clang-built-linux@googlegroups.com
In-Reply-To: <20190910103830.20794-6-ivan.khoronzhuk@linaro.org>



On 9/10/19 11:38 AM, Ivan Khoronzhuk wrote:
> The kernel headers are reused from samples bpf, and autoconf.h is not
> enough to reflect complete arch configuration for clang. But CLANG-bpf
> cmds are sensitive for assembler part taken from linux headers and -D
> vars, usually used in CFLAGS, should be carefully added for each arch.
> For that, for CLANG-bpf, lets filter them only for arm arch as it
> definitely requires __LINUX_ARM_ARCH__ to be set, but ignore for
> others till it's really needed. For arm, -D__LINUX_ARM_ARCH__ is min
> version used as instruction set selector. In another case errors
> like "SMP is not supported" for arm and bunch of other errors are
> issued resulting to incorrect final object.
> 
> Later D_OPTIONS can be used for gcc part.
> ---
>   samples/bpf/Makefile | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 8ecc5d0c2d5b..6492b7e65c08 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -185,6 +185,15 @@ HOSTLDLIBS_map_perf_test	+= -lrt
>   HOSTLDLIBS_test_overhead	+= -lrt
>   HOSTLDLIBS_xdpsock		+= -pthread
>   
> +# Strip all expet -D options needed to handle linux headers
> +# for arm it's __LINUX_ARM_ARCH__ and potentially others fork vars
> +D_OPTIONS = $(shell echo "$(KBUILD_CFLAGS) " | sed 's/[[:blank:]]/\n/g' | \
> +	sed '/^-D/!d' | tr '\n' ' ')
> +
> +ifeq ($(ARCH), arm)
> +CLANG_EXTRA_CFLAGS := $(D_OPTIONS)
> +endif

Do you need this for native compilation?

so arm64 compilation does not need this?
If only -D__LINUX_ARM_ARCH__ is needed, maybe just
with
    CLANG_EXTRA_CFLAGS := -D__LINUX_ARM_ARCH__
Otherwise, people will wonder whether this is needed for
other architectures. Or just do
    CLANG_EXTRA_CFLAGS := $(D_OPTIONS)
for all cross compilation?

> +
>   # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>   #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
>   LLC ?= llc
> 

^ permalink raw reply

* [PATCH v4.14-stable 1/2] tcp: Reset send_head when removing skb from write-queue
From: Christoph Paasch @ 2019-09-13 20:08 UTC (permalink / raw)
  To: stable, netdev, gregkh, Sasha Levin
  Cc: David Miller, Eric Dumazet, Jason Baron, Vladimir Rutsky,
	Soheil Hassas Yeganeh, Neal Cardwell
In-Reply-To: <20190913200819.32686-1-cpaasch@apple.com>

syzkaller is not happy since commit fdfc5c8594c2 ("tcp: remove empty skb
from write queue in error cases"):

CPU: 1 PID: 13814 Comm: syz-executor.4 Not tainted 4.14.143 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
task: ffff888040105c00 task.stack: ffff8880649c0000
RIP: 0010:tcp_sendmsg_locked+0x6b4/0x4390 net/ipv4/tcp.c:1350
RSP: 0018:ffff8880649cf718 EFLAGS: 00010206
RAX: 0000000000000014 RBX: 000000000000001e RCX: ffffc90000717000
RDX: 0000000000000077 RSI: ffffffff82e760f7 RDI: 00000000000000a0
RBP: ffff8880649cfaa8 R08: 1ffff1100c939e7a R09: ffff8880401063c8
R10: 0000000000000003 R11: 0000000000000001 R12: dffffc0000000000
R13: ffff888043d74750 R14: ffff888043d74500 R15: 000000000000001e
FS:  00007f0afcb6d700(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2ca22000 CR3: 0000000040496004 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 tcp_sendmsg+0x2a/0x40 net/ipv4/tcp.c:1533
 inet_sendmsg+0x173/0x4e0 net/ipv4/af_inet.c:784
 sock_sendmsg_nosec net/socket.c:646 [inline]
 sock_sendmsg+0xc3/0x100 net/socket.c:656
 SYSC_sendto+0x35d/0x5e0 net/socket.c:1766
 do_syscall_64+0x241/0x680 arch/x86/entry/common.c:292
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

The problem is that we are removing an skb from the write-queue that
could have been referenced by the sk_send_head. Thus, we need to check
for the send_head's sanity after removing it.

This patch needs to be backported only to 4.14 and older (among those
that applied the backport of fdfc5c8594c2).

Fixes: fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
---
 net/ipv4/tcp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5ce069ce2a97..efe767e20d01 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -924,8 +924,7 @@ static void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
 {
 	if (skb && !skb->len) {
 		tcp_unlink_write_queue(skb, sk);
-		if (tcp_write_queue_empty(sk))
-			tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
+		tcp_check_send_head(sk, skb);
 		sk_wmem_free_skb(sk, skb);
 	}
 }
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH bpf-next 05/11] samples: bpf: makefile: use D vars from KBUILD_CFLAGS to handle headers
From: Ivan Khoronzhuk @ 2019-09-13 21:24 UTC (permalink / raw)
  To: Yonghong Song
  Cc: ast@kernel.org, daniel@iogearbox.net, davem@davemloft.net,
	jakub.kicinski@netronome.com, hawk@kernel.org,
	john.fastabend@gmail.com, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	clang-built-linux@googlegroups.com
In-Reply-To: <97ca4228-145a-2449-b4ba-8e79380a54f4@fb.com>

On Fri, Sep 13, 2019 at 09:12:01PM +0000, Yonghong Song wrote:
>
>
>On 9/10/19 11:38 AM, Ivan Khoronzhuk wrote:
>> The kernel headers are reused from samples bpf, and autoconf.h is not
>> enough to reflect complete arch configuration for clang. But CLANG-bpf
>> cmds are sensitive for assembler part taken from linux headers and -D
>> vars, usually used in CFLAGS, should be carefully added for each arch.
>> For that, for CLANG-bpf, lets filter them only for arm arch as it
>> definitely requires __LINUX_ARM_ARCH__ to be set, but ignore for
>> others till it's really needed. For arm, -D__LINUX_ARM_ARCH__ is min
>> version used as instruction set selector. In another case errors
>> like "SMP is not supported" for arm and bunch of other errors are
>> issued resulting to incorrect final object.
>>
>> Later D_OPTIONS can be used for gcc part.
>> ---
>>   samples/bpf/Makefile | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
>> index 8ecc5d0c2d5b..6492b7e65c08 100644
>> --- a/samples/bpf/Makefile
>> +++ b/samples/bpf/Makefile
>> @@ -185,6 +185,15 @@ HOSTLDLIBS_map_perf_test	+= -lrt
>>   HOSTLDLIBS_test_overhead	+= -lrt
>>   HOSTLDLIBS_xdpsock		+= -pthread
>>
>> +# Strip all expet -D options needed to handle linux headers
>> +# for arm it's __LINUX_ARM_ARCH__ and potentially others fork vars
>> +D_OPTIONS = $(shell echo "$(KBUILD_CFLAGS) " | sed 's/[[:blank:]]/\n/g' | \
>> +	sed '/^-D/!d' | tr '\n' ' ')
>> +
>> +ifeq ($(ARCH), arm)
>> +CLANG_EXTRA_CFLAGS := $(D_OPTIONS)
>> +endif
>
>Do you need this for native compilation?
Yes, native "arm" also requires it.

>
>so arm64 compilation does not need this?
yes, now only arm

>If only -D__LINUX_ARM_ARCH__ is needed, maybe just
>with
>    CLANG_EXTRA_CFLAGS := -D__LINUX_ARM_ARCH__
Value also needed: -D__LINUX_ARM_ARCH_=7 or -D__LINUX_ARM_ARCH_=6
So, need retrieve it.

>Otherwise, people will wonder whether this is needed for
>other architectures. Or just do
>    CLANG_EXTRA_CFLAGS := $(D_OPTIONS)
>for all cross compilation?
arm, cross and native requires it.

Will do this:

# Strip all expet -D options needed to handle linux headers
# for arm it's __LINUX_ARM_ARCH__ and potentially others fork vars
ifeq ($(ARCH), arm)
D_OPTIONS = $(shell echo "$(KBUILD_CFLAGS) " | sed 's/[[:blank:]]/\n/g' | \
	sed '/^-D/!d' | tr '\n' ' ')
endif

CLANG_EXTRA_CFLAGS := $(D_OPTIONS)



>
>> +
>>   # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>>   #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
>>   LLC ?= llc
>>

-- 
Regards,
Ivan Khoronzhuk

^ permalink raw reply

* Re: [PATCH bpf-next 02/11] samples: bpf: makefile: fix cookie_uid_helper_example obj build
From: Ivan Khoronzhuk @ 2019-09-13 21:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: ast@kernel.org, daniel@iogearbox.net, davem@davemloft.net,
	jakub.kicinski@netronome.com, hawk@kernel.org,
	john.fastabend@gmail.com, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	clang-built-linux@googlegroups.com
In-Reply-To: <7f556c1c-abee-41a9-af83-1d72fc33af4b@fb.com>

On Fri, Sep 13, 2019 at 08:48:37PM +0000, Yonghong Song wrote:
>
>
>On 9/10/19 11:38 AM, Ivan Khoronzhuk wrote:
>> Don't list userspace "cookie_uid_helper_example" object in list for
>> bpf objects.
>>
>> per_socket_stats_example-opjs is used to list additional dependencies
>
>s/opjs/objs
>
>> for user space binary from hostprogs-y list. Kbuild system creates
>> rules for objects listed this way anyway and no need to worry about
>> this. Despite on it, the samples bpf uses logic that hostporgs-y are
>> build for userspace with includes needed for this, but "always"
>> target, if it's not in hostprog-y list, uses CLANG-bpf rule and is
>> intended to create bpf obj but not arch obj and uses only kernel
>> includes for that. So correct it, as it breaks cross-compiling at
>> least.
>
>The above description is a little tricky to understand.
>Maybe something like:
>    'always' target is for bpf programs.
>    'cookie_uid_helper_example.o' is a user space ELF file, and
>    covered by rule `per_socket_stats_example`.
>    Let us remove `always += cookie_uid_helper_example.o`,
>    which avoids breaking cross compilation due to
>    mismatched includes.

Yes, looks better, thanks.

-- 
Regards,
Ivan Khoronzhuk

^ permalink raw reply

* [PATCH v4.14-stable 2/2] tcp: Don't dequeue SYN/FIN-segments from write-queue
From: Christoph Paasch @ 2019-09-13 20:08 UTC (permalink / raw)
  To: stable, netdev, gregkh, Sasha Levin
  Cc: David Miller, Eric Dumazet, Jason Baron, Vladimir Rutsky,
	Soheil Hassas Yeganeh, Neal Cardwell
In-Reply-To: <20190913200819.32686-1-cpaasch@apple.com>

If a SYN/FIN-segment is on the write-queue, skb->len is 0, but the
segment actually has been transmitted. end_seq and seq of the tcp_skb_cb
in that case will indicate this difference.

We should not remove such segments from the write-queue as we might be
in SYN_SENT-state and a retransmission-timer is running. When that one
fires, packets_out will be 1, but the write-queue would be empty,
resulting in:

[   61.280214] ------------[ cut here ]------------
[   61.281307] WARNING: CPU: 0 PID: 0 at net/ipv4/tcp_timer.c:429 tcp_retransmit_timer+0x18f9/0x2660
[   61.283498] Modules linked in:
[   61.284084] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.142 #58
[   61.285214] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
[   61.286644] task: ffffffff8401e1c0 task.stack: ffffffff84000000
[   61.287758] RIP: 0010:tcp_retransmit_timer+0x18f9/0x2660
[   61.288715] RSP: 0018:ffff88806ce07cb8 EFLAGS: 00010206
[   61.289669] RAX: ffffffff8401e1c0 RBX: ffff88805c998b00 RCX: 0000000000000006
[   61.290968] RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffff88805c9994d8
[   61.292314] RBP: ffff88805c99919a R08: ffff88807fff901c R09: ffff88807fff9008
[   61.293547] R10: ffff88807fff9017 R11: ffff88807fff9010 R12: ffff88805c998b30
[   61.294834] R13: ffffffff844b9380 R14: 0000000000000000 R15: ffff88805c99930c
[   61.296086] FS:  0000000000000000(0000) GS:ffff88806ce00000(0000) knlGS:0000000000000000
[   61.297523] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   61.298646] CR2: 00007f721da50ff8 CR3: 0000000004014002 CR4: 00000000001606f0
[   61.299944] Call Trace:
[   61.300403]  <IRQ>
[   61.300806]  ? kvm_sched_clock_read+0x21/0x30
[   61.301689]  ? sched_clock+0x5/0x10
[   61.302433]  ? sched_clock_cpu+0x18/0x170
[   61.303173]  tcp_write_timer_handler+0x2c1/0x7a0
[   61.304038]  tcp_write_timer+0x13e/0x160
[   61.304794]  call_timer_fn+0x14a/0x5f0
[   61.305480]  ? tcp_write_timer_handler+0x7a0/0x7a0
[   61.306364]  ? __next_timer_interrupt+0x140/0x140
[   61.307229]  ? _raw_spin_unlock_irq+0x24/0x40
[   61.308033]  ? tcp_write_timer_handler+0x7a0/0x7a0
[   61.308887]  ? tcp_write_timer_handler+0x7a0/0x7a0
[   61.309760]  run_timer_softirq+0xc41/0x1080
[   61.310539]  ? trigger_dyntick_cpu.isra.33+0x180/0x180
[   61.311506]  ? ktime_get+0x13f/0x1c0
[   61.312232]  ? clockevents_program_event+0x10d/0x2f0
[   61.313158]  __do_softirq+0x20b/0x96b
[   61.313889]  irq_exit+0x1a7/0x1e0
[   61.314513]  smp_apic_timer_interrupt+0xfc/0x4d0
[   61.315386]  apic_timer_interrupt+0x8f/0xa0
[   61.316129]  </IRQ>

Followed by a panic.

So, before removing an skb with skb->len == 0, let's make sure that the
skb is really empty by checking the end_seq and seq.

This patch needs to be backported only to 4.14 and older (among those
that applied the backport of fdfc5c8594c2).

Fixes: fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
---
 net/ipv4/tcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index efe767e20d01..c1f59a53f68f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -922,7 +922,8 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
  */
 static void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
 {
-	if (skb && !skb->len) {
+	if (skb && !skb->len &&
+	    TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
 		tcp_unlink_write_queue(skb, sk);
 		tcp_check_send_head(sk, skb);
 		sk_wmem_free_skb(sk, skb);
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v4 2/2] tcp: Add snd_wnd to TCP_INFO
From: Yuchung Cheng @ 2019-09-13 21:28 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Thomas Higdon, netdev@vger.kernel.org, Jonathan Lemon, Dave Jones,
	Eric Dumazet, Dave Taht, Soheil Hassas Yeganeh
In-Reply-To: <CADVnQymKS6-jztAbLu_QYWiPYMqoTf5ODzSg3UPJxH+vBt=bmw@mail.gmail.com>

On Fri, Sep 13, 2019 at 2:02 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Sep 13, 2019 at 3:36 PM Thomas Higdon <tph@fb.com> wrote:
> >
> > Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
> > performance problems --
> > > (1) Usually when we're diagnosing TCP performance problems, we do so
> > > from the sender, since the sender makes most of the
> > > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> > > From the sender-side the thing that would be most useful is to see
> > > tp->snd_wnd, the receive window that the receiver has advertised to
> > > the sender.
> >
> > This serves the purpose of adding an additional __u32 to avoid the
> > would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
> >
> > Signed-off-by: Thomas Higdon <tph@fb.com>
> > ---
> > changes from v3:
> >  - changed from rcv_wnd to snd_wnd
> >
> >  include/uapi/linux/tcp.h | 1 +
> >  net/ipv4/tcp.c           | 1 +
> >  2 files changed, 2 insertions(+)
> >
> > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > index 20237987ccc8..240654f22d98 100644
> > --- a/include/uapi/linux/tcp.h
> > +++ b/include/uapi/linux/tcp.h
> > @@ -272,6 +272,7 @@ struct tcp_info {
> >         __u32   tcpi_reord_seen;     /* reordering events seen */
> >
> >         __u32   tcpi_rcv_ooopack;    /* Out-of-order packets received */
> > +       __u32   tcpi_snd_wnd;        /* Remote peer's advertised recv window size */
> >  };
>
> Thanks for adding this!
>
> My run of ./scripts/checkpatch.pl is showing a warning on this line:
>
> WARNING: line over 80 characters
> #19: FILE: include/uapi/linux/tcp.h:273:
> +       __u32   tcpi_snd_wnd;        /* Remote peer's advertised recv
> window size */
>
> What if the comment is shortened up to fit in 80 columns and the units
> (bytes) are added, something like:
>
>         __u32   tcpi_snd_wnd;        /* peer's advertised recv window (bytes) */
just a thought: will tcpi_peer_rcv_wnd be more self-explanatory?
>
> neal

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox