Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH stable 4.9] tcp: reset sk_send_head in tcp_write_queue_purge
From: Sasha Levin @ 2019-07-30 20:17 UTC (permalink / raw)
  To: maowenan; +Cc: gregkh, stable, netdev, linux-kernel
In-Reply-To: <29c1ee9c-4a5d-4f61-f526-85980185f0bd@huawei.com>

On Tue, Jul 30, 2019 at 09:31:19AM +0800, maowenan wrote:
>
>
>On 2019/7/29 23:32, Sasha Levin wrote:
>> On Mon, Jul 29, 2019 at 09:21:08PM +0800, Mao Wenan wrote:
>>> From: Soheil Hassas Yeganeh <soheil@google.com>
>>>
>>> tcp_write_queue_purge clears all the SKBs in the write queue
>>> but does not reset the sk_send_head. As a result, we can have
>>> a NULL pointer dereference anywhere that we use tcp_send_head
>>> instead of the tcp_write_queue_tail.
>>>
>>> For example, after a27fd7a8ed38 (tcp: purge write queue upon RST),
>>> we can purge the write queue on RST. Prior to
>>> 75c119afe14f (tcp: implement rb-tree based retransmit queue),
>>> tcp_push will only check tcp_send_head and then accesses
>>> tcp_write_queue_tail to send the actual SKB. As a result, it will
>>> dereference a NULL pointer.
>>>
>>> This has been reported twice for 4.14 where we don't have
>>> 75c119afe14f:
>>>
>>> By Timofey Titovets:
>>>
>>> [  422.081094] BUG: unable to handle kernel NULL pointer dereference
>>> at 0000000000000038
>>> [  422.081254] IP: tcp_push+0x42/0x110
>>> [  422.081314] PGD 0 P4D 0
>>> [  422.081364] Oops: 0002 [#1] SMP PTI
>>>
>>> By Yongjian Xu:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
>>> IP: tcp_push+0x48/0x120
>>> PGD 80000007ff77b067 P4D 80000007ff77b067 PUD 7fd989067 PMD 0
>>> Oops: 0002 [#18] SMP PTI
>>> Modules linked in: tcp_diag inet_diag tcp_bbr sch_fq iTCO_wdt
>>> iTCO_vendor_support pcspkr ixgbe mdio i2c_i801 lpc_ich joydev input_leds shpchp
>>> e1000e igb dca ptp pps_core hwmon mei_me mei ipmi_si ipmi_msghandler sg ses
>>> scsi_transport_sas enclosure ext4 jbd2 mbcache sd_mod ahci libahci megaraid_sas
>>> wmi ast ttm dm_mirror dm_region_hash dm_log dm_mod dax
>>> CPU: 6 PID: 14156 Comm: [ET_NET 6] Tainted: G D 4.14.26-1.el6.x86_64 #1
>>> Hardware name: LENOVO ThinkServer RD440 /ThinkServer RD440, BIOS A0TS80A
>>> 09/22/2014
>>> task: ffff8807d78d8140 task.stack: ffffc9000e944000
>>> RIP: 0010:tcp_push+0x48/0x120
>>> RSP: 0018:ffffc9000e947a88 EFLAGS: 00010246
>>> RAX: 00000000000005b4 RBX: ffff880f7cce9c00 RCX: 0000000000000000
>>> RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff8807d00f5000
>>> RBP: ffffc9000e947aa8 R08: 0000000000001c84 R09: 0000000000000000
>>> R10: ffff8807d00f5158 R11: 0000000000000000 R12: ffff8807d00f5000
>>> R13: 0000000000000020 R14: 00000000000256d4 R15: 0000000000000000
>>> FS: 00007f5916de9700(0000) GS:ffff88107fd00000(0000) knlGS:0000000000000000
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000038 CR3: 00000007f8226004 CR4: 00000000001606e0
>>> Call Trace:
>>> tcp_sendmsg_locked+0x33d/0xe50
>>> tcp_sendmsg+0x37/0x60
>>> inet_sendmsg+0x39/0xc0
>>> sock_sendmsg+0x49/0x60
>>> sock_write_iter+0xb6/0x100
>>> do_iter_readv_writev+0xec/0x130
>>> ? rw_verify_area+0x49/0xb0
>>> do_iter_write+0x97/0xd0
>>> vfs_writev+0x7e/0xe0
>>> ? __wake_up_common_lock+0x80/0xa0
>>> ? __fget_light+0x2c/0x70
>>> ? __do_page_fault+0x1e7/0x530
>>> do_writev+0x60/0xf0
>>> ? inet_shutdown+0xac/0x110
>>> SyS_writev+0x10/0x20
>>> do_syscall_64+0x6f/0x140
>>> ? prepare_exit_to_usermode+0x8b/0xa0
>>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>> RIP: 0033:0x3135ce0c57
>>> RSP: 002b:00007f5916de4b00 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
>>> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000003135ce0c57
>>> RDX: 0000000000000002 RSI: 00007f5916de4b90 RDI: 000000000000606f
>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f5916de8c38
>>> R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000464cc
>>> R13: 00007f5916de8c30 R14: 00007f58d8bef080 R15: 0000000000000002
>>> Code: 48 8b 97 60 01 00 00 4c 8d 97 58 01 00 00 41 b9 00 00 00 00 41 89 f3 4c 39
>>> d2 49 0f 44 d1 41 81 e3 00 80 00 00 0f 85 b0 00 00 00 <80> 4a 38 08 44 8b 8f 74
>>> 06 00 00 44 89 8f 7c 06 00 00 83 e6 01
>>> RIP: tcp_push+0x48/0x120 RSP: ffffc9000e947a88
>>> CR2: 0000000000000038
>>> ---[ end trace 8d545c2e93515549 ]---
>>>
>>> There is other scenario which found in stable 4.4:
>>> Allocated:
>>> [<ffffffff82f380a6>] __alloc_skb+0xe6/0x600 net/core/skbuff.c:218
>>> [<ffffffff832466c3>] alloc_skb_fclone include/linux/skbuff.h:856 [inline]
>>> [<ffffffff832466c3>] sk_stream_alloc_skb+0xa3/0x5d0 net/ipv4/tcp.c:833
>>> [<ffffffff83249164>] tcp_sendmsg+0xd34/0x2b00 net/ipv4/tcp.c:1178
>>> [<ffffffff83300ef3>] inet_sendmsg+0x203/0x4d0 net/ipv4/af_inet.c:755
>>> Freed:
>>> [<ffffffff82f372fd>] __kfree_skb+0x1d/0x20 net/core/skbuff.c:676
>>> [<ffffffff83288834>] sk_wmem_free_skb include/net/sock.h:1447 [inline]
>>> [<ffffffff83288834>] tcp_write_queue_purge include/net/tcp.h:1460 [inline]
>>> [<ffffffff83288834>] tcp_connect_init net/ipv4/tcp_output.c:3122 [inline]
>>> [<ffffffff83288834>] tcp_connect+0xb24/0x30c0 net/ipv4/tcp_output.c:3261
>>> [<ffffffff8329b991>] tcp_v4_connect+0xf31/0x1890 net/ipv4/tcp_ipv4.c:246
>>>
>>> BUG: KASAN: use-after-free in tcp_skb_pcount include/net/tcp.h:796 [inline]
>>> BUG: KASAN: use-after-free in tcp_init_tso_segs net/ipv4/tcp_output.c:1619 [inline]
>>> BUG: KASAN: use-after-free in tcp_write_xmit+0x3fc2/0x4cb0 net/ipv4/tcp_output.c:2056
>>> [<ffffffff81515cd5>] kasan_report.cold.7+0x175/0x2f7 mm/kasan/report.c:408
>>> [<ffffffff814f9784>] __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:427
>>> [<ffffffff83286582>] tcp_skb_pcount include/net/tcp.h:796 [inline]
>>> [<ffffffff83286582>] tcp_init_tso_segs net/ipv4/tcp_output.c:1619 [inline]
>>> [<ffffffff83286582>] tcp_write_xmit+0x3fc2/0x4cb0 net/ipv4/tcp_output.c:2056
>>> [<ffffffff83287a40>] __tcp_push_pending_frames+0xa0/0x290 net/ipv4/tcp_output.c:2307
>>>
>>> stable 4.4 and stable 4.9 don't have the commit abb4a8b870b5 ("tcp: purge write queue upon RST")
>>> which is referred in dbbf2d1e4077,
>>> in tcp_connect_init, it calls tcp_write_queue_purge, and does not reset sk_send_head, then UAF.
>>>
>>> stable 4.14 have the commit abb4a8b870b5 ("tcp: purge write queue upon RST"),
>>> in tcp_reset, it calls tcp_write_queue_purge(sk), and does not reset sk_send_head, then UAF.
>>>
>>> So this patch can be used to fix stable 4.4 and 4.9.
>>>
>>> Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST)
>>> Reported-by: Timofey Titovets <nefelim4ag@gmail.com>
>>> Reported-by: Yongjian Xu <yongjianchn@gmail.com>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
>>> Tested-by: Yongjian Xu <yongjianchn@gmail.com>
>>>
>>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
>>
>> So the "Fixes:" commit in the commit message is wrong? What's the actual
>> commit that this fixes?
>
>Upstream commit is 7f582b248d0a ("tcp: purge write queue in tcp_connect_init()")
>linux-4.4.y
>Fixes: 5bbe138a250e ("tcp: purge write queue in tcp_connect_init()")
>linux-4.9.y
>Fixes: 74a4c09d4b05 ("tcp: purge write queue in tcp_connect_init()")
>linux-4.14.y
>Fixes: a27fd7a8ed38 ("tcp: purge write queue upon RST")

Okay, I've queued this for 4.9 and 4.4, thank you.

--
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-30 20:20 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, linux-security@vger.kernel.org,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API
In-Reply-To: <77354A95-4107-41A7-8936-D144F01C3CA4@fb.com>

On Sat, Jul 27, 2019 at 11:20 AM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
> >>>>
> >>>
> >>> Well, yes. sys_bpf() is pretty powerful.
> >>>
> >>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In
> >>> the meanwhile, such users should not take down the whole system easily
> >>> by accident, e.g., with rm -rf /.
> >>
> >> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
> >
> > This is a great idea! fscaps + /etc/bpfusers should do the trick.
>
> After some discussions and more thinking on this, I have some concerns
> with the user space only approach.
>
> IIUC, your proposal for user space only approach is like:
>
> 1. bpftool (and other tools) check /etc/bpfusers and only do
>    setuid for allowed users:
>
>         int main()
>         {
>                 if (/* uid in /etc/bpfusers */)
>                         setuid(0);
>                 sys_bpf(...);
>         }
>
> 2. bpftool (and other tools) is installed with CAP_SETUID:
>
>         setcap cap_setuid=e+p /bin/bpftool
>

You have this a bit backwards.  You wouldn't use CAP_SETUID.  You
would use the setuid *mode* bit, i.e. chmod 4111 (or 4100 and use ACLs
to further lock it down).  Or you could use setcap cap_sys_admin=p,
although the details vary.  It woks a bit like this:

First, if you are running with elevated privilege due to SUID or
fscaps, the kernel and glibc offer you a degree of protection: you are
protected from ptrace(), LD_PRELOAD, etc.  You are *not* protected
from yourself.  For example, you may be running in a context in which
an attacker has malicious values in your environment variables, CWD,
etc.  Do you need to carefully decide whether you are willing to run
with elevated privilege on behalf of the user, which you learn like
this:

uid_t real_uid = getuid();

Your decision may may depend on command-line arguments as well (i.e.
you might want to allow tracing but not filtering, say).  Once you've
made this decision, the details vary:

For SUID, you either continue to run with euid == 0, or you drop
privilege using something like:

if (setresuid(real_uid, real_uid, real_uid) != 0) {
 /* optionally print an error to stderr */
 exit(1);
}

For fscaps, if you want to be privileged, you use something like
capng_update(); capng_apply() to set CAP_SYS_ADMIN to be effective
when you want privilege.  If you want to be unprivileged (because
bpfusers says so, for example), you could use capng_update() to drop
CAP_SYS_ADMIN entirely and see if the calls still work without
privilege.  But this is a little bit awkward, since you don't directly
know whether the user that invoked you in the first place had
CAP_SYS_ADMIN to begin with.

In general, SUID is a bit easier to work with.

> This approach is not ideal, because we need to trust the tool to give
> it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
> or use other root only sys calls after setuid(0).

How?  The whole SUID mechanism is designed fairly carefully to prevent
this.  /bin/sudo is likely to be SUID on your system, but you can't
just "hack" it to become root.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-30 20:24 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <369476A8-4CE1-43DA-9239-06437C0384C7@fb.com>

On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
> > On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> >
> > Hi Andy,
> >
> >>>>>
> >>>>
> >>>> Well, yes. sys_bpf() is pretty powerful.
> >>>>
> >>>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In
> >>>> the meanwhile, such users should not take down the whole system easily
> >>>> by accident, e.g., with rm -rf /.
> >>>
> >>> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
> >>
> >> This is a great idea! fscaps + /etc/bpfusers should do the trick.
> >
> > After some discussions and more thinking on this, I have some concerns
> > with the user space only approach.
> >
> > IIUC, your proposal for user space only approach is like:
> >
> > 1. bpftool (and other tools) check /etc/bpfusers and only do
> >   setuid for allowed users:
> >
> >       int main()
> >       {
> >               if (/* uid in /etc/bpfusers */)
> >                       setuid(0);
> >               sys_bpf(...);
> >       }
> >
> > 2. bpftool (and other tools) is installed with CAP_SETUID:
> >
> >       setcap cap_setuid=e+p /bin/bpftool
> >
> > 3. sys admin maintains proper /etc/bpfusers.
> >
> > This approach is not ideal, because we need to trust the tool to give
> > it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
> > or use other root only sys calls after setuid(0).
> >
>
> I would like more comments on this.
>
> Currently, bpf permission is more or less "root or nothing", which we
> would like to change.
>
> The short term goal is to separate bpf from root, in other words, it is
> "all or nothing". Special user space utilities, such as systemd, would
> benefit from this. Once this is implemented, systemd can call sys_bpf()
> when it is not running as root.

As generally nasty as Linux capabilities are, this sounds like a good
use for CAP_BPF_ADMIN.

But what do you have in mind?  Isn't non-root systemd mostly just the
user systemd session?  That should *not* have bpf() privileges until
bpf() is improved such that you can't use it to compromise the system.

>
> In longer term, it may be useful to provide finer grain permission of
> sys_bpf(). For example, sys_bpf() should be aware of containers; and
> user may only have access to certain bpf maps. Let's call this
> "fine grain" capability.
>
>
> Since we are seeing new use cases every year, we will need many
> iterations to implement the fine grain permission. I think we need an
> API that is flexible enough to cover different types of permission
> control.
>
> For example, bpf_with_cap() can be flexible:
>
>         bpf_with_cap(cmd, attr, size, perm_fd);
>
> We can get different types of permission via different combinations of
> arguments:
>
>     A perm_fd to /dev/bpf gives access to all sys_bpf() commands, so
>     this is "all or nothing" permission.
>
>     A perm_fd to /sys/fs/cgroup/.../bpf.xxx would only allow some
>     commands to this specific cgroup.
>

I don't see why you need to invent a whole new mechanism for this.
The entire cgroup ecosystem outside bpf() does just fine using the
write permission on files in cgroupfs to control access.  Why can't
bpf() do the same thing?

>
> Alexei raised another idea in offline discussions: instead of adding
> bpf_with_cap(), we add a command LOAD_PERM_FD, which enables special
> permission for the _next_ sys_bpf() from current task:
>
>     bpf(LOAD_PERM_FD, perm_fd);
>     /* the next sys_bpf() uses permission from perm_fd */
>     bpf(cmd, attr, size);
>
> This is equivalent to bpf_with_cap(cmd, attr, size, perm_fd), but
> doesn't require the new sys call.

That sounds almost every bit as problematic as the approach where you
ask for permission once and it sticks.

>
> 1. User space only approach doesn't work, even for "all or nothing"
>    permission control. I expanded the discussion in the previous
>    email. Please let me know if I missed anything there.

As in my previous email, I disagree.

^ permalink raw reply

* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: Jakub Kicinski @ 2019-07-30 20:39 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: willy, davem, kernel-team, netdev
In-Reply-To: <20190730144034.444022-1-jonathan.lemon@gmail.com>

On Tue, 30 Jul 2019 07:40:31 -0700, Jonathan Lemon wrote:
> The recent conversion of skb_frag_t to bio_vec did not include
> skb_frag's page_offset.  Add accessor functions for this field,
> utilize them, and remove the union, restoring the original structure.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Thanks!

^ permalink raw reply

* Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
From: Michael S. Tsirkin @ 2019-07-30 20:42 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190730093539.dcksure3vrykir3g@steredhat>

On Tue, Jul 30, 2019 at 11:35:39AM +0200, Stefano Garzarella wrote:
> On Mon, Jul 29, 2019 at 03:10:15PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 29, 2019 at 06:50:56PM +0200, Stefano Garzarella wrote:
> > > On Mon, Jul 29, 2019 at 06:19:03PM +0200, Stefano Garzarella wrote:
> > > > On Mon, Jul 29, 2019 at 11:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > > On Mon, Jul 29, 2019 at 05:36:56PM +0200, Stefano Garzarella wrote:
> > > > > > On Mon, Jul 29, 2019 at 10:04:29AM -0400, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Jul 17, 2019 at 01:30:26PM +0200, Stefano Garzarella wrote:
> > > > > > > > Since virtio-vsock was introduced, the buffers filled by the host
> > > > > > > > and pushed to the guest using the vring, are directly queued in
> > > > > > > > a per-socket list. These buffers are preallocated by the guest
> > > > > > > > with a fixed size (4 KB).
> > > > > > > > 
> > > > > > > > The maximum amount of memory used by each socket should be
> > > > > > > > controlled by the credit mechanism.
> > > > > > > > The default credit available per-socket is 256 KB, but if we use
> > > > > > > > only 1 byte per packet, the guest can queue up to 262144 of 4 KB
> > > > > > > > buffers, using up to 1 GB of memory per-socket. In addition, the
> > > > > > > > guest will continue to fill the vring with new 4 KB free buffers
> > > > > > > > to avoid starvation of other sockets.
> > > > > > > > 
> > > > > > > > This patch mitigates this issue copying the payload of small
> > > > > > > > packets (< 128 bytes) into the buffer of last packet queued, in
> > > > > > > > order to avoid wasting memory.
> > > > > > > > 
> > > > > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > > > > > 
> > > > > > > This is good enough for net-next, but for net I think we
> > > > > > > should figure out how to address the issue completely.
> > > > > > > Can we make the accounting precise? What happens to
> > > > > > > performance if we do?
> > > > > > > 
> > > > > > 
> > > > > > In order to do more precise accounting maybe we can use the buffer size,
> > > > > > instead of payload size when we update the credit available.
> > > > > > In this way, the credit available for each socket will reflect the memory
> > > > > > actually used.
> > > > > > 
> > > > > > I should check better, because I'm not sure what happen if the peer sees
> > > > > > 1KB of space available, then it sends 1KB of payload (using a 4KB
> > > > > > buffer).
> > > > > > 
> > > > > > The other option is to copy each packet in a new buffer like I did in
> > > > > > the v2 [2], but this forces us to make a copy for each packet that does
> > > > > > not fill the entire buffer, perhaps too expensive.
> > > > > > 
> > > > > > [2] https://patchwork.kernel.org/patch/10938741/
> > > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > Stefano
> > > > > 
> > > > > Interesting. You are right, and at some level the protocol forces copies.
> > > > > 
> > > > > We could try to detect that the actual memory is getting close to
> > > > > admin limits and force copies on queued packets after the fact.
> > > > > Is that practical?
> > > > 
> > > > Yes, I think it is doable!
> > > > We can decrease the credit available with the buffer size queued, and
> > > > when the buffer size of packet to queue is bigger than the credit
> > > > available, we can copy it.
> > > > 
> > > > > 
> > > > > And yes we can extend the credit accounting to include buffer size.
> > > > > That's a protocol change but maybe it makes sense.
> > > > 
> > > > Since we send to the other peer the credit available, maybe this
> > > > change can be backwards compatible (I'll check better this).
> > > 
> > > What I said was wrong.
> > > 
> > > We send a counter (increased when the user consumes the packets) and the
> > > "buf_alloc" (the max memory allowed) to the other peer.
> > > It makes a difference between a local counter (increased when the
> > > packets are sent) and the remote counter to calculate the credit available:
> > > 
> > >     u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 credit)
> > >     {
> > >     	u32 ret;
> > > 
> > >     	spin_lock_bh(&vvs->tx_lock);
> > >     	ret = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt);
> > >     	if (ret > credit)
> > >     		ret = credit;
> > >     	vvs->tx_cnt += ret;
> > >     	spin_unlock_bh(&vvs->tx_lock);
> > > 
> > >     	return ret;
> > >     }
> > > 
> > > Maybe I can play with "buf_alloc" to take care of bytes queued but not
> > > used.
> > > 
> > > Thanks,
> > > Stefano
> > 
> > Right. And the idea behind it all was that if we send a credit
> > to remote then we have space for it.
> 
> Yes.
> 
> > I think the basic idea was that if we have actual allocated
> > memory and can copy data there, then we send the credit to
> > remote.
> > 
> > Of course that means an extra copy every packet.
> > So as an optimization, it seems that we just assume
> > that we will be able to allocate a new buffer.
> 
> Yes, we refill the virtqueue when half of the buffers were used.
> 
> > 
> > First this is not the best we can do. We can actually do
> > allocate memory in the socket before sending credit.
> 
> In this case, IIUC we should allocate an entire buffer (4KB),
> so we can reuse it if the packet is big.
> 
> > If packet is small then we copy it there.
> > If packet is big then we queue the packet,
> > take the buffer out of socket and add it to the virtqueue.
> > 
> > Second question is what to do about medium sized packets.
> > Packet is 1K but buffer is 4K, what do we do?
> > And here I wonder - why don't we add the 3K buffer
> > to the vq?
> 
> This would allow us to have an accurate credit account.
> 
> The problem here is the compatibility. Before this series virtio-vsock
> and vhost-vsock modules had the RX buffer size hard-coded
> (VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE = 4K). So, if we send a buffer smaller
> of 4K, there might be issues.

Shouldn't be if they are following the spec. If not let's fix
the broken parts.

> 
> Maybe it is the time to add add 'features' to virtio-vsock device.
> 
> Thanks,
> Stefano

Why would a remote care about buffer sizes?

Let's first see what the issues are. If they exist
we can either fix the bugs, or code the bug as a feature in spec.

-- 
MST

^ permalink raw reply

* Re: [PATCH 4/4] net: dsa: mv88e6xxx: add PTP support for MV88E6250 family
From: Vladimir Oltean @ 2019-07-30 20:46 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Hubert Feurstein, netdev, lkml, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, David S. Miller, Rasmus Villemoes
In-Reply-To: <20190730171246.GB1251@localhost>

Hi Hubert, Richard,

On Tue, 30 Jul 2019 at 19:44, Richard Cochran <richardcochran@gmail.com> wrote:
>
> On Tue, Jul 30, 2019 at 12:04:29PM +0200, Hubert Feurstein wrote:
> > diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
> > index 768d256f7c9f..51cdf4712517 100644
> > --- a/drivers/net/dsa/mv88e6xxx/ptp.c
> > +++ b/drivers/net/dsa/mv88e6xxx/ptp.c
> > @@ -15,11 +15,38 @@
> >  #include "hwtstamp.h"
> >  #include "ptp.h"
> >
> > -/* Raw timestamps are in units of 8-ns clock periods. */
> > -#define CC_SHIFT     28
> > -#define CC_MULT              (8 << CC_SHIFT)
> > -#define CC_MULT_NUM  (1 << 9)
> > -#define CC_MULT_DEM  15625ULL
> > +/* The adjfine API clamps ppb between [-32,768,000, 32,768,000], and
>
> That is not true.
>

I was referring to this:
https://github.com/richardcochran/linuxptp/blob/master/phc.c#L38

/*
* On 32 bit platforms, the PHC driver's maximum adjustment (type
* 'int' in units of ppb) can overflow the timex.freq field (type
* 'long'). So in this case we clamp the maximum to the largest
* possible adjustment that fits into a 32 bit long.
*/
#define BITS_PER_LONG (sizeof(long)*8)
#define MAX_PPB_32 32767999 /* 2^31 - 1 / 65.536 */

Technically it is not "not true".

[snip]

On Tue, 30 Jul 2019 at 21:09, Richard Cochran <richardcochran@gmail.com> wrote:
>
> On Tue, Jul 30, 2019 at 06:20:00PM +0200, Hubert Feurstein wrote:
> > > Please don't re-write this logic.  It is written like that for a reason.
> > I used the sja1105_ptp.c as a reference. So it is also wrong there.
>
> I'll let that driver's author worry about that.
>
> Thanks,
> Richard
>

And what is the reason for the neg_adj thing? Can you give an example
of when does the "normal way" of doing signed arithmetics not work?

Thanks,
-Vladimir

^ permalink raw reply

* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: Jens Axboe @ 2019-07-30 20:49 UTC (permalink / raw)
  To: Jonathan Lemon, willy@infradead.org, davem@davemloft.net,
	jakub.kicinski@netronome.com
  Cc: Kernel Team, netdev@vger.kernel.org
In-Reply-To: <20190730144034.444022-1-jonathan.lemon@gmail.com>

On 7/30/19 8:40 AM, Jonathan Lemon wrote:
> The recent conversion of skb_frag_t to bio_vec did not include
> skb_frag's page_offset.  Add accessor functions for this field,
> utilize them, and remove the union, restoring the original structure.

You can add:

Reviewed-by: Jens Axboe <axboe@kernel.dk>

Pretty appalled to see this abomination:

net: Convert skb_frag_t to bio_vec

There are a lot of users of frag->page_offset, so use a union
to avoid converting those users today.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

show up in the net tree without even having been posted on a
block list...

At least this kills this ugly thing.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH v4 3/3] net/xdp: convert put_page() to put_user_page*()
From: john.hubbard @ 2019-07-30 20:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Darrick J . Wong, Dave Chinner, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jens Axboe, Jerome Glisse, Kirill A . Shutemov,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Mike Rapoport,
	linux-block, linux-fsdevel, linux-mm, linux-xfs, LKML,
	John Hubbard, Björn Töpel, Magnus Karlsson,
	David S . Miller, netdev
In-Reply-To: <20190730205705.9018-1-jhubbard@nvidia.com>

From: John Hubbard <jhubbard@nvidia.com>

For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().

This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").

Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 net/xdp/xdp_umem.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 83de74ca729a..17c4b3d3dc34 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -166,14 +166,7 @@ void xdp_umem_clear_dev(struct xdp_umem *umem)
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 {
-	unsigned int i;
-
-	for (i = 0; i < umem->npgs; i++) {
-		struct page *page = umem->pgs[i];
-
-		set_page_dirty_lock(page);
-		put_page(page);
-	}
+	put_user_pages_dirty_lock(umem->pgs, umem->npgs, true);
 
 	kfree(umem->pgs);
 	umem->pgs = NULL;
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH bpf-next] tools: bpftool: add support for reporting the effective cgroup progs
From: Jakub Kicinski @ 2019-07-30 21:00 UTC (permalink / raw)
  To: Takshak Chahande
  Cc: alexei.starovoitov@gmail.com, daniel@iogearbox.net,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	oss-drivers@netronome.com, Kernel Team, Quentin Monnet
In-Reply-To: <20190730180443.GA48276@ctakshak-mbp.dhcp.thefacebook.com>

On Tue, 30 Jul 2019 18:04:53 +0000, Takshak Chahande wrote:
> Jakub Kicinski <jakub.kicinski@netronome.com> wrote on Mon [2019-Jul-29 14:35:38 -0700]:
> > @@ -158,20 +161,30 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
> >  static int do_show(int argc, char **argv)
> >  {
> >  	enum bpf_attach_type type;
> > +	const char *path;
> >  	int cgroup_fd;
> >  	int ret = -1;
> >  
> > -	if (argc < 1) {
> > -		p_err("too few parameters for cgroup show");
> > -		goto exit;
> > -	} else if (argc > 1) {
> > -		p_err("too many parameters for cgroup show");
> > -		goto exit;
> > +	query_flags = 0;
> > +
> > +	if (!REQ_ARGS(1))
> > +		return -1;
> > +	path = GET_ARG();
> > +
> > +	while (argc) {
> > +		if (is_prefix(*argv, "effective")) {
> > +			query_flags |= BPF_F_QUERY_EFFECTIVE;
> > +			NEXT_ARG();
> > +		} else {
> > +			p_err("expected no more arguments, 'effective', got: '%s'?",
> > +			      *argv);
> > +			return -1;
> > +		}
> >  	}  
> This while loop will allow multiple 'effective' keywords in the argument
> unnecessarily. IMO, we should strictly restrict only for single
> occurance of 'effective' word.

It's kind of the way all bpftool works to date :(

But perhaps not checking is worse than inconsistency? Okay, let's fix
this up.

> > -	cgroup_fd = open(argv[0], O_RDONLY);
> > +	cgroup_fd = open(path, O_RDONLY);
> >  	if (cgroup_fd < 0) {
> > -		p_err("can't open cgroup %s", argv[0]);
> > +		p_err("can't open cgroup %s", path);
> >  		goto exit;
> >  	}
> >  
> > @@ -297,23 +310,29 @@ static int do_show_tree(int argc, char **argv)
> >  	char *cgroup_root;
> >  	int ret;
> >  
> > -	switch (argc) {
> > -	case 0:
> > +	query_flags = 0;
> > +
> > +	if (!argc) {
> >  		cgroup_root = find_cgroup_root();
> >  		if (!cgroup_root) {
> >  			p_err("cgroup v2 isn't mounted");
> >  			return -1;
> >  		}
> > -		break;
> > -	case 1:
> > -		cgroup_root = argv[0];
> > -		break;
> > -	default:
> > -		p_err("too many parameters for cgroup tree");
> > -		return -1;
> > +	} else {
> > +		cgroup_root = GET_ARG();
> > +
> > +		while (argc) {
> > +			if (is_prefix(*argv, "effective")) {
> > +				query_flags |= BPF_F_QUERY_EFFECTIVE;
> > +				NEXT_ARG();  
> 
> NEXT_ARG() does update argc value; that means after this outer if/else we need 
> to know how argc has become 0 (through which path) before freeing up `cgroup_root` allocated
> memory later at the end of this function.

Good catch!

> > +			} else {
> > +				p_err("expected no more arguments, 'effective', got: '%s'?",
> > +				      *argv);
> > +				return -1;
> > +			}
> > +		}
> >  	}  
 
> Thanks for the patch. Apart from above two issues, patch looks good.

Thanks for the review.

^ permalink raw reply

* Re: [patch net-next 3/3] netdevsim: create devlink and netdev instances in namespace
From: Jiri Pirko @ 2019-07-30 21:03 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730101411.7dc1e83d@cakuba.netronome.com>

Tue, Jul 30, 2019 at 07:14:11PM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 30 Jul 2019 08:06:55 +0200, Jiri Pirko wrote:
>> >> diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
>> >> index 79c05af2a7c0..cdf53d0e0c49 100644
>> >> --- a/drivers/net/netdevsim/netdevsim.h
>> >> +++ b/drivers/net/netdevsim/netdevsim.h
>> >> @@ -19,6 +19,7 @@
>> >>  #include <linux/netdevice.h>
>> >>  #include <linux/u64_stats_sync.h>
>> >>  #include <net/devlink.h>
>> >> +#include <net/net_namespace.h>  
>> >
>> >You can just do a forward declaration, no need to pull in the header.  
>> 
>> Sure, but why?
>
>Less time to compile the kernel after net_namespace.h was touched.
>Don't we all spend more time that we would like to recompiling the
>kernel? :(  Not a huge deal if you have a strong preference.

I removed it in v2. I don't care that much either :)

^ permalink raw reply

* [PATCH bpf-next v2] tools: bpftool: add support for reporting the effective cgroup progs
From: Jakub Kicinski @ 2019-07-30 21:03 UTC (permalink / raw)
  To: alexei.starovoitov, daniel
  Cc: netdev, bpf, oss-drivers, ctakshak, kernel-team, Jakub Kicinski,
	Quentin Monnet

Takshak said in the original submission:

With different bpf attach_flags available to attach bpf programs specially
with BPF_F_ALLOW_OVERRIDE and BPF_F_ALLOW_MULTI, the list of effective
bpf-programs available to any sub-cgroups really needs to be available for
easy debugging.

Using BPF_F_QUERY_EFFECTIVE flag, one can get the list of not only attached
bpf-programs to a cgroup but also the inherited ones from parent cgroup.

So a new option is introduced to use BPF_F_QUERY_EFFECTIVE query flag here
to list all the effective bpf-programs available for execution at a specified
cgroup.

Reused modified test program test_cgroup_attach from tools/testing/selftests/bpf:
  # ./test_cgroup_attach

With old bpftool:

 # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/
  ID       AttachType      AttachFlags     Name
  271      egress          multi           pkt_cntr_1
  272      egress          multi           pkt_cntr_2

Attached new program pkt_cntr_4 in cg2 gives following:

 # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2
  ID       AttachType      AttachFlags     Name
  273      egress          override        pkt_cntr_4

And with new "effective" option it shows all effective programs for cg2:

 # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2 effective
  ID       AttachType      AttachFlags     Name
  273      egress          override        pkt_cntr_4
  271      egress          override        pkt_cntr_1
  272      egress          override        pkt_cntr_2

Compared to original submission use a local flag instead of global
option.

We need to clear query_flags on every command, in case batch mode
wants to use varying settings.

v2: (Takshak)
 - forbid duplicated flags;
 - fix cgroup path freeing.

Signed-off-by: Takshak Chahande <ctakshak@fb.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 .../bpftool/Documentation/bpftool-cgroup.rst  | 16 +++-
 tools/bpf/bpftool/bash-completion/bpftool     | 15 ++--
 tools/bpf/bpftool/cgroup.c                    | 83 ++++++++++++-------
 3 files changed, 76 insertions(+), 38 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
index 585f270c2d25..06a28b07787d 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
@@ -20,8 +20,8 @@ SYNOPSIS
 CGROUP COMMANDS
 ===============
 
-|	**bpftool** **cgroup { show | list }** *CGROUP*
-|	**bpftool** **cgroup tree** [*CGROUP_ROOT*]
+|	**bpftool** **cgroup { show | list }** *CGROUP* [**effective**]
+|	**bpftool** **cgroup tree** [*CGROUP_ROOT*] [**effective**]
 |	**bpftool** **cgroup attach** *CGROUP* *ATTACH_TYPE* *PROG* [*ATTACH_FLAGS*]
 |	**bpftool** **cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG*
 |	**bpftool** **cgroup help**
@@ -35,13 +35,17 @@ CGROUP COMMANDS
 
 DESCRIPTION
 ===========
-	**bpftool cgroup { show | list }** *CGROUP*
+	**bpftool cgroup { show | list }** *CGROUP* [**effective**]
 		  List all programs attached to the cgroup *CGROUP*.
 
 		  Output will start with program ID followed by attach type,
 		  attach flags and program name.
 
-	**bpftool cgroup tree** [*CGROUP_ROOT*]
+		  If **effective** is specified retrieve effective programs that
+		  will execute for events within a cgroup. This includes
+		  inherited along with attached ones.
+
+	**bpftool cgroup tree** [*CGROUP_ROOT*] [**effective**]
 		  Iterate over all cgroups in *CGROUP_ROOT* and list all
 		  attached programs. If *CGROUP_ROOT* is not specified,
 		  bpftool uses cgroup v2 mountpoint.
@@ -50,6 +54,10 @@ DESCRIPTION
 		  commands: it starts with absolute cgroup path, followed by
 		  program ID, attach type, attach flags and program name.
 
+		  If **effective** is specified retrieve effective programs that
+		  will execute for events within a cgroup. This includes
+		  inherited along with attached ones.
+
 	**bpftool cgroup attach** *CGROUP* *ATTACH_TYPE* *PROG* [*ATTACH_FLAGS*]
 		  Attach program *PROG* to the cgroup *CGROUP* with attach type
 		  *ATTACH_TYPE* and optional *ATTACH_FLAGS*.
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index 6b961a5ed100..df16c5415444 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -710,12 +710,15 @@ _bpftool()
             ;;
         cgroup)
             case $command in
-                show|list)
-                    _filedir
-                    return 0
-                    ;;
-                tree)
-                    _filedir
+                show|list|tree)
+                    case $cword in
+                        3)
+                            _filedir
+                            ;;
+                        4)
+                            COMPREPLY=( $( compgen -W 'effective' -- "$cur" ) )
+                            ;;
+                    esac
                     return 0
                     ;;
                 attach|detach)
diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c
index f3c05b08c68c..44352b5aca85 100644
--- a/tools/bpf/bpftool/cgroup.c
+++ b/tools/bpf/bpftool/cgroup.c
@@ -29,6 +29,8 @@
 	"                        recvmsg4 | recvmsg6 | sysctl |\n"	       \
 	"                        getsockopt | setsockopt }"
 
+static unsigned int query_flags;
+
 static const char * const attach_type_strings[] = {
 	[BPF_CGROUP_INET_INGRESS] = "ingress",
 	[BPF_CGROUP_INET_EGRESS] = "egress",
@@ -107,7 +109,8 @@ static int count_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type)
 	__u32 prog_cnt = 0;
 	int ret;
 
-	ret = bpf_prog_query(cgroup_fd, type, 0, NULL, NULL, &prog_cnt);
+	ret = bpf_prog_query(cgroup_fd, type, query_flags, NULL,
+			     NULL, &prog_cnt);
 	if (ret)
 		return -1;
 
@@ -125,8 +128,8 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
 	int ret;
 
 	prog_cnt = ARRAY_SIZE(prog_ids);
-	ret = bpf_prog_query(cgroup_fd, type, 0, &attach_flags, prog_ids,
-			     &prog_cnt);
+	ret = bpf_prog_query(cgroup_fd, type, query_flags, &attach_flags,
+			     prog_ids, &prog_cnt);
 	if (ret)
 		return ret;
 
@@ -158,20 +161,34 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
 static int do_show(int argc, char **argv)
 {
 	enum bpf_attach_type type;
+	const char *path;
 	int cgroup_fd;
 	int ret = -1;
 
-	if (argc < 1) {
-		p_err("too few parameters for cgroup show");
-		goto exit;
-	} else if (argc > 1) {
-		p_err("too many parameters for cgroup show");
-		goto exit;
+	query_flags = 0;
+
+	if (!REQ_ARGS(1))
+		return -1;
+	path = GET_ARG();
+
+	while (argc) {
+		if (is_prefix(*argv, "effective")) {
+			if (query_flags & BPF_F_QUERY_EFFECTIVE) {
+				p_err("duplicated argument: %s", *argv);
+				return -1;
+			}
+			query_flags |= BPF_F_QUERY_EFFECTIVE;
+			NEXT_ARG();
+		} else {
+			p_err("expected no more arguments, 'effective', got: '%s'?",
+			      *argv);
+			return -1;
+		}
 	}
 
-	cgroup_fd = open(argv[0], O_RDONLY);
+	cgroup_fd = open(path, O_RDONLY);
 	if (cgroup_fd < 0) {
-		p_err("can't open cgroup %s", argv[0]);
+		p_err("can't open cgroup %s", path);
 		goto exit;
 	}
 
@@ -294,26 +311,37 @@ static char *find_cgroup_root(void)
 
 static int do_show_tree(int argc, char **argv)
 {
-	char *cgroup_root;
+	char *cgroup_root, *cgroup_alloced = NULL;
 	int ret;
 
-	switch (argc) {
-	case 0:
-		cgroup_root = find_cgroup_root();
-		if (!cgroup_root) {
+	query_flags = 0;
+
+	if (!argc) {
+		cgroup_alloced = find_cgroup_root();
+		if (!cgroup_alloced) {
 			p_err("cgroup v2 isn't mounted");
 			return -1;
 		}
-		break;
-	case 1:
-		cgroup_root = argv[0];
-		break;
-	default:
-		p_err("too many parameters for cgroup tree");
-		return -1;
+		cgroup_root = cgroup_alloced;
+	} else {
+		cgroup_root = GET_ARG();
+
+		while (argc) {
+			if (is_prefix(*argv, "effective")) {
+				if (query_flags & BPF_F_QUERY_EFFECTIVE) {
+					p_err("duplicated argument: %s", *argv);
+					return -1;
+				}
+				query_flags |= BPF_F_QUERY_EFFECTIVE;
+				NEXT_ARG();
+			} else {
+				p_err("expected no more arguments, 'effective', got: '%s'?",
+				      *argv);
+				return -1;
+			}
+		}
 	}
 
-
 	if (json_output)
 		jsonw_start_array(json_wtr);
 	else
@@ -338,8 +366,7 @@ static int do_show_tree(int argc, char **argv)
 	if (json_output)
 		jsonw_end_array(json_wtr);
 
-	if (argc == 0)
-		free(cgroup_root);
+	free(cgroup_alloced);
 
 	return ret;
 }
@@ -459,8 +486,8 @@ static int do_help(int argc, char **argv)
 	}
 
 	fprintf(stderr,
-		"Usage: %s %s { show | list } CGROUP\n"
-		"       %s %s tree [CGROUP_ROOT]\n"
+		"Usage: %s %s { show | list } CGROUP [**effective**]\n"
+		"       %s %s tree [CGROUP_ROOT] [**effective**]\n"
 		"       %s %s attach CGROUP ATTACH_TYPE PROG [ATTACH_FLAGS]\n"
 		"       %s %s detach CGROUP ATTACH_TYPE PROG\n"
 		"       %s %s help\n"
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH] net: usb: pegasus: fix improper read if get_registers() fail
From: David Miller @ 2019-07-30 21:10 UTC (permalink / raw)
  To: kirjanov; +Cc: kda, petkan, netdev
In-Reply-To: <CAHj3AVm2EZB7n9UBxiBmA+6XN+EgAC_FRoHjh6kO3WMT8KVd6g@mail.gmail.com>

From: Denis Kirjanov <kirjanov@gmail.com>
Date: Tue, 30 Jul 2019 21:19:46 +0300

> On Tuesday, July 30, 2019, David Miller <davem@davemloft.net> wrote:
> 
>> From: Denis Kirjanov <kda@linux-powerpc.org>
>> Date: Tue, 30 Jul 2019 15:13:57 +0200
>>
>> > get_registers() may fail with -ENOMEM and in this
>> > case we can read a garbage from the status variable tmp.
>> >
>> > Reported-by: syzbot+3499a83b2d062ae409d4@syzkaller.appspotmail.com
>> > Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
>>
>> Why did you post this patch twice?  What is different between the two
>> versions?
> 
> 
> Looks like it’s the issue with git send-email :/
> Do you want me to figure out the reason and resend?

No need, I was just curious.

^ permalink raw reply

* [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-07-30 21:12 UTC (permalink / raw)
  To: davem
  Cc: netdev, oss-drivers, edumazet, borisp, aviadye, davejwatson,
	john.fastabend, daniel, willemb, xiyou.wangcong, fw,
	alexei.starovoitov, Jakub Kicinski

sk_validate_xmit_skb() and drivers depend on the sk member of
struct sk_buff to identify segments requiring encryption.
Any operation which removes or does not preserve the original TLS
socket such as skb_orphan() or skb_clone() will cause clear text
leaks.

Make the TCP socket underlying an offloaded TLS connection
mark all skbs as decrypted, if TLS TX is in offload mode.
Then in sk_validate_xmit_skb() catch skbs which have no socket
(or a socket with no validation) and decrypted flag set.

Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
sk->sk_validate_xmit_skb are slightly interchangeable right now,
they all imply TLS offload. The new checks are guarded by
CONFIG_TLS_DEVICE because that's the option guarding the
sk_buff->decrypted member.

Second, smaller issue with orphaning is that it breaks
the guarantee that packets will be delivered to device
queues in-order. All TLS offload drivers depend on that
scheduling property. This means skb_orphan_partial()'s
trick of preserving partial socket references will cause
issues in the drivers. We need a full orphan, and as a
result netem delay/throttling will cause all TLS offload
skbs to be dropped.

Reusing the sk_buff->decrypted flag also protects from
leaking clear text when incoming, decrypted skb is redirected
(e.g. by TC).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
I'm sending this for net-next because of lack of confidence
in my own abilities. It should apply cleanly to net... :)

 Documentation/networking/tls-offload.rst |  9 --------
 include/net/sock.h                       | 28 +++++++++++++++++++++++-
 net/core/skbuff.c                        |  3 +++
 net/core/sock.c                          | 22 ++++++++++++++-----
 net/ipv4/tcp.c                           |  4 +++-
 net/ipv4/tcp_output.c                    |  3 +++
 net/tls/tls_device.c                     |  2 ++
 7 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index 048e5ca44824..2bc3ab5515d8 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -499,15 +499,6 @@ offloads, old connections will remain active after flags are cleared.
 Known bugs
 ==========
 
-skb_orphan() leaks clear text
------------------------------
-
-Currently drivers depend on the :c:member:`sk` member of
-:c:type:`struct sk_buff <sk_buff>` to identify segments requiring
-encryption. Any operation which removes or does not preserve the socket
-association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
-will cause the driver to miss the packets and lead to clear text leaks.
-
 Redirects leak clear text
 -------------------------
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 228db3998e46..90f3f552c789 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -814,6 +814,7 @@ enum sock_flags {
 	SOCK_TXTIME,
 	SOCK_XDP, /* XDP is attached */
 	SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
+	SOCK_CRYPTO_TX_PLAIN_TEXT, /* Generate skbs with decrypted flag set */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -2480,8 +2481,26 @@ static inline bool sk_fullsock(const struct sock *sk)
 	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
 }
 
+static inline void sk_set_tx_crypto(const struct sock *sk, struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+	skb->decrypted = sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
+#endif
+}
+
+static inline bool sk_tx_crypto_match(const struct sock *sk,
+				      const struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+	return sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT) == !!skb->decrypted;
+#else
+	return true;
+#endif
+}
+
 /* Checks if this SKB belongs to an HW offloaded socket
  * and whether any SW fallbacks are required based on dev.
+ * Check decrypted mark in case skb_orphan() cleared socket.
  */
 static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
 						   struct net_device *dev)
@@ -2489,8 +2508,15 @@ static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
 #ifdef CONFIG_SOCK_VALIDATE_XMIT
 	struct sock *sk = skb->sk;
 
-	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb) {
 		skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#ifdef CONFIG_TLS_DEVICE
+	} else if (unlikely(skb->decrypted)) {
+		pr_warn_ratelimited("unencrypted skb with no associated socket - dropping\n");
+		kfree_skb(skb);
+		skb = NULL;
+#endif
+	}
 #endif
 
 	return skb;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0b788df5a75b..9e92684479b9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3794,6 +3794,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 			skb_reserve(nskb, headroom);
 			__skb_put(nskb, doffset);
+#ifdef CONFIG_TLS_DEVICE
+			nskb->decrypted = head_skb->decrypted;
+#endif
 		}
 
 		if (segs)
diff --git a/net/core/sock.c b/net/core/sock.c
index d57b0cc995a0..b0c10b518e65 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 }
 EXPORT_SYMBOL(skb_set_owner_w);
 
+static bool can_skb_orphan_partial(const struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+	/* Drivers depend on in-order delivery for crypto offload,
+	 * partial orphan breaks out-of-order-OK logic.
+	 */
+	if (skb->decrypted)
+		return false;
+#endif
+#ifdef CONFIG_INET
+	if (skb->destructor == tcp_wfree)
+		return true;
+#endif
+	return skb->destructor == sock_wfree;
+}
+
 /* This helper is used by netem, as it can hold packets in its
  * delay queue. We want to allow the owner socket to send more
  * packets, as if they were already TX completed by a typical driver.
@@ -2003,11 +2019,7 @@ void skb_orphan_partial(struct sk_buff *skb)
 	if (skb_is_tcp_pure_ack(skb))
 		return;
 
-	if (skb->destructor == sock_wfree
-#ifdef CONFIG_INET
-	    || skb->destructor == tcp_wfree
-#endif
-		) {
+	if (can_skb_orphan_partial(skb)) {
 		struct sock *sk = skb->sk;
 
 		if (refcount_inc_not_zero(&sk->sk_refcnt)) {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f62f0e7e3cdd..dee93eab02fe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -984,6 +984,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			if (!skb)
 				goto wait_for_memory;
 
+			sk_set_tx_crypto(sk, skb);
 			skb_entail(sk, skb);
 			copy = size_goal;
 		}
@@ -993,7 +994,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 
 		i = skb_shinfo(skb)->nr_frags;
 		can_coalesce = skb_can_coalesce(skb, i, page, offset);
-		if (!can_coalesce && i >= sysctl_max_skb_frags) {
+		if ((!can_coalesce && i >= sysctl_max_skb_frags) ||
+		    !sk_tx_crypto_match(sk, skb)) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
 		}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6e4afc48d7bb..9efd0ca44d49 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1320,6 +1320,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 	buff = sk_stream_alloc_skb(sk, nsize, gfp, true);
 	if (!buff)
 		return -ENOMEM; /* We'll just try again later. */
+	sk_set_tx_crypto(sk, buff);
 
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
@@ -1874,6 +1875,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
 	buff = sk_stream_alloc_skb(sk, 0, gfp, true);
 	if (unlikely(!buff))
 		return -ENOMEM;
+	sk_set_tx_crypto(sk, buff);
 
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
@@ -2139,6 +2141,7 @@ static int tcp_mtu_probe(struct sock *sk)
 	nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
 	if (!nskb)
 		return -1;
+	sk_set_tx_crypto(sk, nskb);
 	sk->sk_wmem_queued += nskb->truesize;
 	sk_mem_charge(sk, nskb->truesize);
 
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 4ec8a06fa5d1..3d78742b3b1b 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -970,6 +970,8 @@ int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
 
 	tls_device_attach(ctx, sk, netdev);
 
+	sock_set_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
+
 	/* following this assignment tls_is_sk_tx_device_offloaded
 	 * will return true and the context might be accessed
 	 * by the netdev's xmit function.
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCHv2 net-next 0/5] sctp: clean up __sctp_connect function
From: David Miller @ 2019-07-30 21:18 UTC (permalink / raw)
  To: marcelo.leitner; +Cc: lucien.xin, netdev, linux-sctp, nhorman
In-Reply-To: <20190730194231.GE4064@localhost.localdomain>

From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Tue, 30 Jul 2019 16:42:31 -0300

> On Tue, Jul 30, 2019 at 08:38:18PM +0800, Xin Long wrote:
>> This patchset is to factor out some common code for
>> sctp_sendmsg_new_asoc() and __sctp_connect() into 2
>> new functioins.
>> 
>> v1->v2:
>>   - add the patch 1/5 to avoid a slab-out-of-bounds warning.
>>   - add some code comment for the check change in patch 2/5.
>>   - remove unused 'addrcnt' as Marcelo noticed in patch 3/5.
>> 
>> Xin Long (5):
>>   sctp: only copy the available addr data in sctp_transport_init
>>   sctp: check addr_size with sa_family_t size in
>>     __sctp_setsockopt_connectx
>>   sctp: clean up __sctp_connect
>>   sctp: factor out sctp_connect_new_asoc
>>   sctp: factor out sctp_connect_add_peer
> 
> Series,
> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 01/12] libbpf: add .BTF.ext offset relocation section loading
From: Song Liu @ 2019-07-30 21:19 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Yonghong Song, Andrii Nakryiko, Kernel Team
In-Reply-To: <20190730195408.670063-2-andriin@fb.com>



> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> 
> Add support for BPF CO-RE offset relocations. Add section/record
> iteration macros for .BTF.ext. These macro are useful for iterating over
> each .BTF.ext record, either for dumping out contents or later for BPF
> CO-RE relocation handling.
> 
> To enable other parts of libbpf to work with .BTF.ext contents, moved
> a bunch of type definitions into libbpf_internal.h.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply

* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: David Miller @ 2019-07-30 21:22 UTC (permalink / raw)
  To: jonathan.lemon; +Cc: willy, jakub.kicinski, kernel-team, netdev
In-Reply-To: <20190730144034.444022-1-jonathan.lemon@gmail.com>

From: Jonathan Lemon <jonathan.lemon@gmail.com>
Date: Tue, 30 Jul 2019 07:40:31 -0700

> The recent conversion of skb_frag_t to bio_vec did not include
> skb_frag's page_offset.  Add accessor functions for this field,
> utilize them, and remove the union, restoring the original structure.
> 
> v2:
>   - rename accessors
>   - follow kdoc conventions

Series applied, thanks Jonathan.

^ permalink raw reply

* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: David Miller @ 2019-07-30 21:22 UTC (permalink / raw)
  To: axboe; +Cc: jonathan.lemon, willy, jakub.kicinski, Kernel-team, netdev
In-Reply-To: <1d34658b-a807-44ae-756a-d55dead27f94@fb.com>

From: Jens Axboe <axboe@fb.com>
Date: Tue, 30 Jul 2019 20:49:09 +0000

> Pretty appalled to see this abomination:
> 
> net: Convert skb_frag_t to bio_vec
> 
> There are a lot of users of frag->page_offset, so use a union
> to avoid converting those users today.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> show up in the net tree without even having been posted on a
> block list...
> 
> At least this kills this ugly thing.

Sorry about that Jens, but at least as you say it's gone now.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 03/12] selftests/bpf: add BPF_CORE_READ relocatable read macro
From: Song Liu @ 2019-07-30 21:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Alexei Starovoitov, daniel@iogearbox.net,
	Yonghong Song, andrii.nakryiko@gmail.com, Kernel Team
In-Reply-To: <20190730195408.670063-4-andriin@fb.com>



> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> 
> Add BPF_CORE_READ macro used in tests to do bpf_core_read(), which
> automatically captures offset relocation.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
> tools/testing/selftests/bpf/bpf_helpers.h | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index f804f210244e..81bc51293d11 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -501,4 +501,23 @@ struct pt_regs;
> 				(void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
> #endif
> 
> +/*
> + * BPF_CORE_READ abstracts away bpf_probe_read() call and captures offset
> + * relocation for source address using __builtin_preserve_access_index()
> + * built-in, provided by Clang.
> + *
> + * __builtin_preserve_access_index() takes as an argument an expression of
> + * taking an address of a field within struct/union. It makes compiler emit
> + * a relocation, which records BTF type ID describing root struct/union and an
> + * accessor string which describes exact embedded field that was used to take
> + * an address. See detailed description of this relocation format and
> + * semantics in comments to struct bpf_offset_reloc in libbpf_internal.h.
> + *
> + * This relocation allows libbpf to adjust BPF instruction to use correct
> + * actual field offset, based on target kernel BTF type that matches original
> + * (local) BTF, used to record relocation.
> + */
> +#define BPF_CORE_READ(dst, src) \
> +	bpf_probe_read(dst, sizeof(*src), __builtin_preserve_access_index(src))

We should use "sizeof(*(src))"


^ permalink raw reply

* Re: [PATCH v6 51/57] net: Remove dev_err() usage after platform_get_irq()
From: David Miller @ 2019-07-30 21:25 UTC (permalink / raw)
  To: swboyd
  Cc: linux-kernel, kvalo, saeedm, jeffrey.t.kirsher, nbd, lorenzo,
	netdev, gregkh
In-Reply-To: <20190730181557.90391-52-swboyd@chromium.org>

From: Stephen Boyd <swboyd@chromium.org>
Date: Tue, 30 Jul 2019 11:15:51 -0700

> We don't need dev_err() messages when platform_get_irq() fails now that
> platform_get_irq() prints an error message itself when something goes
> wrong. Let's remove these prints with a simple semantic patch.
 ...
> While we're here, remove braces on if statements that only have one
> statement (manually).
 ...
> Signed-off-by: Stephen Boyd <swboyd@chromium.org>
> ---
> 
> Please apply directly to subsystem trees

I'll take this into net-next.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 03/12] selftests/bpf: add BPF_CORE_READ relocatable read macro
From: Andrii Nakryiko @ 2019-07-30 21:26 UTC (permalink / raw)
  To: Song Liu
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <87422673-525B-461B-B487-EB16386CAB25@fb.com>

On Tue, Jul 30, 2019 at 2:24 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> >
> > Add BPF_CORE_READ macro used in tests to do bpf_core_read(), which
> > automatically captures offset relocation.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> > tools/testing/selftests/bpf/bpf_helpers.h | 19 +++++++++++++++++++
> > 1 file changed, 19 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> > index f804f210244e..81bc51293d11 100644
> > --- a/tools/testing/selftests/bpf/bpf_helpers.h
> > +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> > @@ -501,4 +501,23 @@ struct pt_regs;
> >                               (void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
> > #endif
> >
> > +/*
> > + * BPF_CORE_READ abstracts away bpf_probe_read() call and captures offset
> > + * relocation for source address using __builtin_preserve_access_index()
> > + * built-in, provided by Clang.
> > + *
> > + * __builtin_preserve_access_index() takes as an argument an expression of
> > + * taking an address of a field within struct/union. It makes compiler emit
> > + * a relocation, which records BTF type ID describing root struct/union and an
> > + * accessor string which describes exact embedded field that was used to take
> > + * an address. See detailed description of this relocation format and
> > + * semantics in comments to struct bpf_offset_reloc in libbpf_internal.h.
> > + *
> > + * This relocation allows libbpf to adjust BPF instruction to use correct
> > + * actual field offset, based on target kernel BTF type that matches original
> > + * (local) BTF, used to record relocation.
> > + */
> > +#define BPF_CORE_READ(dst, src) \
> > +     bpf_probe_read(dst, sizeof(*src), __builtin_preserve_access_index(src))
>
> We should use "sizeof(*(src))"
>

Good point. Also (dst) instead of just (dst). Will update.

^ permalink raw reply

* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: Jens Axboe @ 2019-07-30 21:30 UTC (permalink / raw)
  To: David Miller
  Cc: jonathan.lemon@gmail.com, willy@infradead.org,
	jakub.kicinski@netronome.com, Kernel Team, netdev@vger.kernel.org
In-Reply-To: <20190730.142238.1475873068715429404.davem@davemloft.net>

On 7/30/19 3:22 PM, David Miller wrote:
> From: Jens Axboe <axboe@fb.com>
> Date: Tue, 30 Jul 2019 20:49:09 +0000
> 
>> Pretty appalled to see this abomination:
>>
>> net: Convert skb_frag_t to bio_vec
>>
>> There are a lot of users of frag->page_offset, so use a union
>> to avoid converting those users today.
>>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>> show up in the net tree without even having been posted on a
>> block list...
>>
>> At least this kills this ugly thing.
> 
> Sorry about that Jens, but at least as you say it's gone now.

Yeah all good now, thanks.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH v2 bpf-next 03/12] selftests/bpf: add BPF_CORE_READ relocatable read macro
From: Song Liu @ 2019-07-30 21:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <CAEf4BzakowCqkCkBdGPJCZNU3MpDf1yBhzOXL2pos1tPiUH0mQ@mail.gmail.com>



> On Jul 30, 2019, at 2:26 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Tue, Jul 30, 2019 at 2:24 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>>> 
>>> Add BPF_CORE_READ macro used in tests to do bpf_core_read(), which
>>> automatically captures offset relocation.
>>> 
>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>> ---
>>> tools/testing/selftests/bpf/bpf_helpers.h | 19 +++++++++++++++++++
>>> 1 file changed, 19 insertions(+)
>>> 
>>> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>>> index f804f210244e..81bc51293d11 100644
>>> --- a/tools/testing/selftests/bpf/bpf_helpers.h
>>> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
>>> @@ -501,4 +501,23 @@ struct pt_regs;
>>>                              (void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
>>> #endif
>>> 
>>> +/*
>>> + * BPF_CORE_READ abstracts away bpf_probe_read() call and captures offset
>>> + * relocation for source address using __builtin_preserve_access_index()
>>> + * built-in, provided by Clang.
>>> + *
>>> + * __builtin_preserve_access_index() takes as an argument an expression of
>>> + * taking an address of a field within struct/union. It makes compiler emit
>>> + * a relocation, which records BTF type ID describing root struct/union and an
>>> + * accessor string which describes exact embedded field that was used to take
>>> + * an address. See detailed description of this relocation format and
>>> + * semantics in comments to struct bpf_offset_reloc in libbpf_internal.h.
>>> + *
>>> + * This relocation allows libbpf to adjust BPF instruction to use correct
>>> + * actual field offset, based on target kernel BTF type that matches original
>>> + * (local) BTF, used to record relocation.
>>> + */
>>> +#define BPF_CORE_READ(dst, src) \
>>> +     bpf_probe_read(dst, sizeof(*src), __builtin_preserve_access_index(src))
>> 
>> We should use "sizeof(*(src))"
>> 
> 
> Good point. Also (dst) instead of just (dst). Will update.

I think dst as-is is fine. "," is the very last in precedence list. 


^ permalink raw reply

* Re: [net-next 01/13] net/mlx5e: Print a warning when LRO feature is dropped or not allowed
From: Willem de Bruijn @ 2019-07-30 21:34 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Huy Nguyen, davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <cb43e9dadb8e48d27df8f08464bf40f7a81eafe9.camel@mellanox.com>

On Tue, Jul 30, 2019 at 4:08 PM Saeed Mahameed <saeedm@mellanox.com> wrote:
>
> On Tue, 2019-07-30 at 11:52 -0400, Willem de Bruijn wrote:
> > On Mon, Jul 29, 2019 at 7:50 PM Saeed Mahameed <saeedm@mellanox.com>
> > wrote:
> > > From: Huy Nguyen <huyn@mellanox.com>
> > >
> > > When user enables LRO via ethtool and if the RQ mode is legacy,
> > > mlx5e_fix_features drops the request without any explanation.
> > > Add netdev_warn to cover this case.
> > >
> > > Fixes: 6c3a823e1e9c ("net/mlx5e: RX, Remove HW LRO support in
> > > legacy RQ")
> > > Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> > > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > > ---
> > >  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > index 47eea6b3a1c3..776eb46d263d 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > @@ -3788,9 +3788,10 @@ static netdev_features_t
> > > mlx5e_fix_features(struct net_device *netdev,
> > >                         netdev_warn(netdev, "Dropping C-tag vlan
> > > stripping offload due to S-tag vlan\n");
> > >         }
> > >         if (!MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ)) {
> > > -               features &= ~NETIF_F_LRO;
> > > -               if (params->lro_en)
> > > +               if (features & NETIF_F_LRO) {
> > >                         netdev_warn(netdev, "Disabling LRO, not
> > > supported in legacy RQ\n");
> >
> > This warns about "Disabling LRO" on an enable request?
> >
>
> no, this warning appears only when lro is already enabled and might
> conflict with any other feature requested by user (hence
> mlx5e_fix_features), e.g when moving away from striding rq in this
> example, we will force lro to off.

Ok. The previous commit mentioned "totally remove LRO support in
Legacy RQ". This handles the additional case when moving a queue into
legacy mode that still had LRO enabled. I see.

>
>
> > More fundamentally, it appears that the device does not advertise
> > the feature as configurable in netdev_hw_features as of commit
> > 6c3a823e1e9c ("net/mlx5e: RX, Remove HW LRO support in
> > legacy RQ"), so shouldn't this be caught by the device driver
> > independent ethtool code?
>
> when hw doesn't support MLX5E_PFLAG_RX_STRIDING_RQ then yes, you will
> never hit this code path, but when hw does support
> MLX5E_PFLAG_RX_STRIDING_RQ and you want to turn striding rq off, then
> lro will be forced to off (if it was enabled in first space) and a
> warning msg will be shown.

Got it, thanks.

^ permalink raw reply

* Re: [net 1/1] tipc: fix unitilized skb list crash
From: David Miller @ 2019-07-30 21:40 UTC (permalink / raw)
  To: jon.maloy
  Cc: netdev, tung.q.nguyen, hoang.h.le, lxin, shuali, ying.xue,
	tipc-discussion
In-Reply-To: <1564510750-19531-1-git-send-email-jon.maloy@ericsson.com>

From: Jon Maloy <jon.maloy@ericsson.com>
Date: Tue, 30 Jul 2019 20:19:10 +0200

> Our test suite somtimes provokes the following crash:
> 
> Description of problem:
 ...
> The reason is that the skb list tipc_socket::mc_method.deferredq only
> is initialized for connectionless sockets, while nothing stops arriving
> multicast messages from being filtered by connection oriented sockets,
> with subsequent access to the said list.
> 
> We fix this by initializing the list unconditionally at socket creation.
> This eliminates the crash, while the message still is dropped further
> down in tipc_sk_filter_rcv() as it should be.
> 
> Reported-by: Li Shuang <shuali@redhat.com>
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>

Applied and queued up for -stable, thank you.

^ permalink raw reply

* Re: [PATCH v5 09/29] compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
From: David Miller @ 2019-07-30 21:42 UTC (permalink / raw)
  To: arnd
  Cc: viro, linux-fsdevel, linux-kernel, g.nault, mostrows, xeb,
	jchapman, netdev
In-Reply-To: <20190730192552.4014288-10-arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Tue, 30 Jul 2019 21:25:20 +0200

> Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
> linux-2.5.69 along with hundreds of other commands, but was always broken
> sincen only the structure is compatible, but the command number is not,
> due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
> sockaddr_pppox)), which is different on 64-bit architectures.
> 
> Guillaume Nault adds:
> 
>   And the implementation was broken until 2016 (see 29e73269aa4d ("pppoe:
>   fix reference counting in PPPoE proxy")), and nobody ever noticed. I
>   should probably have removed this ioctl entirely instead of fixing it.
>   Clearly, it has never been used.
> 
> Fix it by adding a compat_ioctl handler for all pppoe variants that
> translates the command number and then calls the regular ioctl function.
> 
> All other ioctl commands handled by pppoe are compatible between 32-bit
> and 64-bit, and require compat_ptr() conversion.
> 
> This should apply to all stable kernels.
> 
> Acked-by: Guillaume Nault <g.nault@alphalink.fr>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied and queued up for -stable, thanks everyone.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox