Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: linux-next: Tree for Jul 15 (HEADERS_TEST w/ netfilter tables offload)
From: Pablo Neira Ayuso @ 2019-07-15 18:09 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Laura Garcia, Randy Dunlap, Stephen Rothwell,
	Linux Next Mailing List, Linux Kernel Mailing List, linux-kbuild,
	netdev@vger.kernel.org, Netfilter Development Mailing list
In-Reply-To: <CAK7LNAS0rX_SRXqb=N=Td-DFNWd=PytDFje12gYh2pYNRBVAJA@mail.gmail.com>

On Tue, Jul 16, 2019 at 02:56:09AM +0900, Masahiro Yamada wrote:
> On Tue, Jul 16, 2019 at 2:33 AM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >
> > On Mon, Jul 15, 2019 at 07:28:04PM +0200, Laura Garcia wrote:
> > > CC'ing netfilter.
> > >
> > > On Mon, Jul 15, 2019 at 6:45 PM Randy Dunlap <rdunlap@infradead.org> wrote:
> > > >
> > > > On 7/14/19 9:48 PM, Stephen Rothwell wrote:
> > > > > Hi all,
> > > > >
> > > > > Please do not add v5.4 material to your linux-next included branches
> > > > > until after v5.3-rc1 has been released.
> > > > >
> > > > > Changes since 20190712:
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > I am seeing these build errors from HEADERS_TEST (or KERNEL_HEADERS_TEST)
> > > > for include/net/netfilter/nf_tables_offload.h.s:
> > > >
> > > >   CC      include/net/netfilter/nf_tables_offload.h.s
> > [...]
> > > > Should this header file not be tested?
> 
> This means you must endlessly exclude
> headers that include nf_tables.h
> 
> 
> > Yes, it should indeed be added.
> 
> Adding 'header-test-' is the last resort.

OK, so policy now is that all internal headers should compile
standalone, right?

I can have a look and make it compile standalone.

^ permalink raw reply

* Re: [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: John Hubbard @ 2019-07-15 18:10 UTC (permalink / raw)
  To: Bharath Vedartham
  Cc: akpm, ira.weiny, Mauro Carvalho Chehab, Dimitri Sivanich,
	Arnd Bergmann, Greg Kroah-Hartman, Alex Williamson, Cornelia Huck,
	Jens Axboe, Alexander Viro, Björn Töpel,
	Magnus Karlsson, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Enrico Weigelt, Thomas Gleixner, Alexios Zavras,
	Dan Carpenter, Max Filippov, Matt Sickler, Kirill A. Shutemov,
	Keith Busch, YueHaibing, linux-media, linux-kernel, devel, kvm,
	linux-block, linux-fsdevel, linux-mm, netdev, bpf, xdp-newbies,
	Jason Gunthorpe
In-Reply-To: <20190715065654.GA3716@bharath12345-Inspiron-5559>

On 7/14/19 11:56 PM, Bharath Vedartham wrote:
> On Sun, Jul 14, 2019 at 04:33:42PM -0700, John Hubbard wrote:
>> On 7/14/19 12:08 PM, Bharath Vedartham wrote:
[...]
>> 1. Pull down https://github.com/johnhubbard/linux/commits/gup_dma_core
>> and find missing conversions: look for any additional missing 
>> get_user_pages/put_page conversions. You've already found a couple missing 
>> ones. I haven't re-run a search in a long time, so there's probably even more.
>> 	a) And find more, after I rebase to 5.3-rc1: people probably are adding
>> 	get_user_pages() calls as we speak. :)
> Shouldn't this be documented then? I don't see any docs for using
> put_user_page*() in v5.2.1 in the memory management API section?

Yes, it needs documentation. My first try (which is still in the above git
repo) was reviewed and found badly wanting, so I'm going to rewrite it. Meanwhile,
I agree that an interim note would be helpful, let me put something together.

[...]
>>     https://github.com/johnhubbard/linux/commits/gup_dma_core
>>
>>     a) gets rebased often, and
>>
>>     b) has a bunch of commits (iov_iter and related) that conflict
>>        with the latest linux.git,
>>
>>     c) has some bugs in the bio area, that I'm fixing, so I don't trust
>>        that's it's safely runnable, for a few more days.
> I assume your repo contains only work related to fixing gup issues and
> not the main repo for gup development? i.e where gup changes are merged?

Correct, this is just a private tree, not a maintainer tree. But I'll try to
keep the gup_dma_core branch something that is usable by others, during the
transition over to put_user_page(), because the page-tracking patches are the
main way to test any put_user_page() conversions.

As Ira said, we're using linux-mm as the real (maintainer) tree.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply

* Re: [PATCH] ISDN: hfcsusb: checking idx of ep configuration
From: David Miller @ 2019-07-15 18:10 UTC (permalink / raw)
  To: tranmanphong
  Cc: syzbot+8750abbc3a46ef47d509, isdn, gregkh, andreyknvl, bigeasy,
	gustavo, pakki001, linux-kernel, linux-usb, syzkaller-bugs,
	netdev, linux-kernel-mentees, skhan
In-Reply-To: <20190715150814.20022-1-tranmanphong@gmail.com>

From: Phong Tran <tranmanphong@gmail.com>
Date: Mon, 15 Jul 2019 22:08:14 +0700

> The syzbot test with random endpoint address which made the idx is
> overflow in the table of endpoint configuations.
> 
> this adds the checking for fixing the error report from
> syzbot
> 
> KASAN: stack-out-of-bounds Read in hfcsusb_probe [1]
> The patch tested by syzbot [2]
> 
> Reported-by: syzbot+8750abbc3a46ef47d509@syzkaller.appspotmail.com
> 
> [1]:
> https://syzkaller.appspot.com/bug?id=30a04378dac680c5d521304a00a86156bb913522
> [2]:
> https://groups.google.com/d/msg/syzkaller-bugs/_6HBdge8F3E/OJn7wVNpBAAJ
> 
> Signed-off-by: Phong Tran <tranmanphong@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: openvswitch: do not update max_headroom if new headroom is equal to old headroom
From: Pravin Shelar @ 2019-07-15 18:15 UTC (permalink / raw)
  To: Taehee Yoo; +Cc: David S. Miller, Linux Kernel Network Developers, ovs dev
In-Reply-To: <20190705160809.5202-1-ap420073@gmail.com>

On Fri, Jul 5, 2019 at 9:08 AM Taehee Yoo <ap420073@gmail.com> wrote:
>
> When a vport is deleted, the maximum headroom size would be changed.
> If the vport which has the largest headroom is deleted,
> the new max_headroom would be set.
> But, if the new headroom size is equal to the old headroom size,
> updating routine is unnecessary.
>
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>

Sorry for late reply. This patch looks good to me too.

I am curious about reason to avoid adjustment to headroom. why are you
trying to avoid unnecessary changes to headroom.

Thanks,
Pravin.

^ permalink raw reply

* [PATCH] can: flexcan: free error skb if enqueueing failed
From: Martin Hundebøll @ 2019-07-15 18:53 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde, linux-can
  Cc: Martin Hundebøll, David S . Miller, netdev, Sean Nyekjaer

If the call to can_rx_offload_queue_sorted() fails, the passed skb isn't
consumed, so the caller must do so.

Fixes: 30164759db1b ("can: flexcan: make use of rx-offload's irq_offload_fifo")
Signed-off-by: Martin Hundebøll <martin@geanix.com>
---
 drivers/net/can/flexcan.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
index 1c66fb2ad76b..21f39e805d42 100644
--- a/drivers/net/can/flexcan.c
+++ b/drivers/net/can/flexcan.c
@@ -688,7 +688,8 @@ static void flexcan_irq_bus_err(struct net_device *dev, u32 reg_esr)
 	if (tx_errors)
 		dev->stats.tx_errors++;
 
-	can_rx_offload_queue_sorted(&priv->offload, skb, timestamp);
+	if (can_rx_offload_queue_sorted(&priv->offload, skb, timestamp))
+		kfree_skb(skb);
 }
 
 static void flexcan_irq_state(struct net_device *dev, u32 reg_esr)
@@ -732,7 +733,8 @@ static void flexcan_irq_state(struct net_device *dev, u32 reg_esr)
 	if (unlikely(new_state == CAN_STATE_BUS_OFF))
 		can_bus_off(dev);
 
-	can_rx_offload_queue_sorted(&priv->offload, skb, timestamp);
+	if (can_rx_offload_queue_sorted(&priv->offload, skb, timestamp))
+		kfree_skb(skb);
 }
 
 static inline struct flexcan_priv *rx_offload_to_priv(struct can_rx_offload *offload)
-- 
2.22.0


^ permalink raw reply related

* Re: IPv6 L2TP issues related to 93531c67
From: David Ahern @ 2019-07-15 18:55 UTC (permalink / raw)
  To: Paul Donohue; +Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev
In-Reply-To: <20190715161827.GB2622@TopQuark.net>

[-- Attachment #1: Type: text/plain, Size: 3233 bytes --]

Hi Paul:

As an FYI, gmail thinks your emails are spam.

On 7/15/19 10:18 AM, Paul Donohue wrote:
> I have a system that establishes four L2TP over IPv6 tunnels using site-local addresses via the following:
...

> 
> These tunnels worked fine on kernel 4.4.  On kernel 4.15, there was a bug that caused intermittent L2TP packet errors, but everything worked fine after applying 4522a70db7aa5e77526a4079628578599821b193.
> 
> However, after upgrading to kernel 4.18 with 4522a70d (or upgrading to kernel 5.0 which includes 4522a70d, or upgrading to the current master kernel branch), two of the four tunnels always fail to work properly after a reboot, although it appears random which two work and which two fail.
> 
> When I say "fail to work properly", the problem is that packets generated by the l2tp kernel modules (in response to a packet being sent to the associated net_l2tpX interface) are silently dropped.  The l2tp_debugfs kernel module reports that L2TP packets are being transmitted with no errors, iptables counters and nflog rules can be used to confirm that well-formed packets are generated and sent, but tcpdump does not see the packets being sent on any interface on the system.  iptables reports that the destination interface of the lost packets is "lo" (which is clearly incorrect and probably an indicator of the underlying issue), but `tcpdump -nnn -i lo` doesn't show any packets.  Incoming L2TP packets appear to be processed correctly, only outgoing L2TP packets appear affected.
> 
> Reverting commit 93531c6743157d7e8c5792f8ed1a57641149d62c (identified by bisection) fixes this issue.

That commit can not be reverted. It is a foundational piece for a lot of
other changes. Did you mean the commit before it works and this commit
fails?

> 
> IPv4 L2TP tunnels do not appear affected by this issue.  Based on a few quick tests, it appears that switching to publicly-routable IPv6 addresses instead of site-local addresses seems to prevent this issue, although I haven't done sufficient testing of this, and it is not clear to me how the code in 93531c67 might be affected by the type of IPv6 address, so this observation may be a red herring.  Manually deleting and re-creating a broken interface seems to make it work again, although I have not thoroughly experimented with making changes after boot time to see if the problem is entirely random, if it is based on the number of existing interfaces, if it is based on a boot-time timing issue, etc.
> 
> It is not obvious to me how commit 93531c6743157d7e8c5792f8ed1a57641149d62c causes this issue, or how it should be fixed.  Could someone take a look and point me in the right direction for further troubleshooting?
> 

Let's get a complete example that demonstrates the problem, and I can go
from there. Can you take the attached script and update it so that it
reflects the problem you are reporting? That script works on latest
kernel as well as 4.14.133. It uses network namespaces for 2 hosts with
a router between them.

Also, check the return of the fib lookups using:
    perf record -e fib6:* -a
    <run test, ctrl-c on the record>
    perf script

Checkout the fib lookup parameters and result. Do they look correct to
you for your setup?

[-- Attachment #2: l2tp.sh --]
[-- Type: application/x-sh, Size: 3750 bytes --]

^ permalink raw reply

* Re: INFO: task hung in unregister_netdevice_notifier (3)
From: Cong Wang @ 2019-07-15 18:58 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: syzbot, David Miller, linux-can, LKML, Marc Kleine-Budde,
	Linux Kernel Network Developers, syzkaller-bugs, Kirill Tkhai
In-Reply-To: <be6c249e-3b99-8388-5b13-547645b2fac9@hartkopp.net>

On Mon, Jul 15, 2019 at 10:23 AM Oliver Hartkopp <socketcan@hartkopp.net> wrote:
>
> Hello all,
>
> On 14.07.19 06:07, syzbot wrote:
> > syzbot has found a reproducer for the following crash on:
>
> the internal users of the CAN networking subsystem like CAN_BCM and
> CAN_RAW hold a number of CAN identifier subscriptions ('filters') for
> CAN netdevices (only type ARPHRD_CAN) in their socket data structures.
>
> The per-socket netdevice notifier is used to manage the ad-hoc removal
> of these filters at netdevice removal time.
>
> What I can see in the console output at
>
> https://syzkaller.appspot.com/x/log.txt?x=10e45f0fa00000
>
> seems to be a race between an unknown register_netdevice_notifier() call
> ("A") and the unregister_netdevice_notifier() ("B") likely invoked by
> bcm_release() ("C"):
>
> [ 1047.294207][ T1049]  schedule+0xa8/0x270
> [ 1047.318401][ T1049]  rwsem_down_write_slowpath+0x70a/0xf70
> [ 1047.324114][ T1049]  ? downgrade_write+0x3c0/0x3c0
> [ 1047.438644][ T1049]  ? mark_held_locks+0xf0/0xf0
> [ 1047.443483][ T1049]  ? lock_acquire+0x190/0x410
> [ 1047.448191][ T1049]  ? unregister_netdevice_notifier+0x7e/0x390
> [ 1047.547227][ T1049]  down_write+0x13c/0x150
> [ 1047.579535][ T1049]  ? down_write+0x13c/0x150
> [ 1047.584106][ T1049]  ? __down_timeout+0x2d0/0x2d0
> [ 1047.635356][ T1049]  ? mark_held_locks+0xf0/0xf0
> [ 1047.640721][ T1049]  unregister_netdevice_notifier+0x7e/0x390  <- "B"
> [ 1047.646667][ T1049]  ? __sock_release+0x89/0x280
> [ 1047.709126][ T1049]  ? register_netdevice_notifier+0x630/0x630 <- "A"
> [ 1047.715203][ T1049]  ? __kasan_check_write+0x14/0x20
> [ 1047.775138][ T1049]  bcm_release+0x93/0x5e0                    <- "C"
> [ 1047.795337][ T1049]  __sock_release+0xce/0x280
> [ 1047.829016][ T1049]  sock_close+0x1e/0x30
>
> The question to me is now:
>
> Is the problem located in an (un)register_netdevice_notifier race OR is
> it generally a bad idea to call unregister_netdevice_notifier() in a
> sock release?

To me it doesn't look like anything wrong in CAN. If you look at a few
more reports from syzbot for the same bug, it actually indicates
something else is holding the pernet_ops_rwsem which caused this
hung task.

When NMI is kicked, it shows nf_ct_iterate_cleanup() was getting
stuck:


NMI backtrace for cpu 0
CPU: 0 PID: 1044 Comm: khungtaskd Not tainted 5.2.0-rc5+ #42
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
 nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
 trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
 check_hung_uninterruptible_tasks kernel/hung_task.c:205 [inline]
 watchdog+0x9b7/0xec0 kernel/hung_task.c:289
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 6877 Comm: kworker/u4:1 Not tainted 5.2.0-rc5+ #42
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Workqueue: netns cleanup_net
RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x50 kernel/kcov.c:95
Code: 89 25 e4 bc 15 09 41 bc f4 ff ff ff e8 6d 04 ea ff 48 c7 05 ce
bc 15 09 00 00 00 00 e9 a4 e9 ff ff 90 90 90 90 90 90 90 90 90 <55> 48
89 e5 48 8b 75 08 65 48 8b 04 25 c0 fd 01 00 65 8b 15 f0 3a
RSP: 0018:ffff888088fe7ac0 EFLAGS: 00000046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff817637c6
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: ffff888088fe7af8 R08: ffff888096406340 R09: fffffbfff118bda9
R10: fffffbfff118bda8 R11: ffffffff88c5ed43 R12: ffffffff85b86d11
R13: ffff888096406340 R14: 0000000000000001 R15: dffffc0000000000
FS:  0000000000000000(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffff600400 CR3: 0000000099e59000 CR4: 00000000001406e0
Call Trace:
 __local_bh_enable_ip+0x11a/0x270 kernel/softirq.c:171
 local_bh_enable include/linux/bottom_half.h:32 [inline]
 get_next_corpse net/netfilter/nf_conntrack_core.c:2015 [inline]
 nf_ct_iterate_cleanup+0x217/0x4e0 net/netfilter/nf_conntrack_core.c:2038
 nf_conntrack_cleanup_net_list+0x7a/0x240 net/netfilter/nf_conntrack_core.c:2225
 nf_conntrack_pernet_exit+0x159/0x1a0
net/netfilter/nf_conntrack_standalone.c:1151
 ops_exit_list.isra.0+0xfc/0x150 net/core/net_namespace.c:168
 cleanup_net+0x4e2/0xa40 net/core/net_namespace.c:575
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

^ permalink raw reply

* Re: [PATCH v2] ethtool: igb: dump RR2DCDELAY register
From: Michal Kubecek @ 2019-07-15 19:26 UTC (permalink / raw)
  To: netdev; +Cc: Artem Bityutskiy, John W . Linville
In-Reply-To: <20190715105933.40924-1-dedekind1@gmail.com>

On Mon, Jul 15, 2019 at 01:59:33PM +0300, Artem Bityutskiy wrote:
> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> 
> Decode 'RR2DCDELAY' register which Linux kernel provides starting from version
> 5.3. The corresponding commit in the Linux kernel is:
>     cd502a7f7c9c igb: add RR2DCDELAY to ethtool registers dump
> 
> The RR2DCDELAY register is present in I210 and I211 Intel Gigabit Ethernet
> chips and it stands for "Read Request To Data Completion Delay". Here is how
> this register is described in the I210 datasheet:
> 
> "This field captures the maximum PCIe split time in 16 ns units, which is the
> maximum delay between the read request to the first data completion. This is
> giving an estimation of the PCIe round trip time."
> 
> In practice, this register can be used to measure the time it takes the NIC to
> read data from the host memory.
> 
> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> ---
>  igb.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> Changelog:
> v2: Fixed a typo in the commentary.
> v1: Initial pach version.
> 
> 
> diff --git a/igb.c b/igb.c
> index e0ccef9..cb24877 100644
> --- a/igb.c
> +++ b/igb.c
> @@ -859,6 +859,18 @@ igb_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
>  		"0x03430: TDFPC       (Tx data FIFO packet count)      0x%08X\n",
>  		regs_buff[550]);
>  
> +	/*
> +	 * Starting from kernel version 5.3 the registers dump buffer grew from
> +	 * 739 4-byte words to 740 words, and word 740 contains the RR2DCDELAY
> +	 * register.
> +	 */
> +	if (regs->len < 740)
> +		return 0;
> +
> +	fprintf(stdout,
> +		"0x05BF4: RR2DCDELAY  (Max. DMA read delay)            0x%08X\n",
> +		regs_buff[739]);
> +
>  	return 0;
>  }
>  

Reviewed-by: Michal Kubecek <mkubecek@suse.cz>

^ permalink raw reply

* Re: [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: Bharath Vedartham @ 2019-07-15 19:35 UTC (permalink / raw)
  To: Ira Weiny
  Cc: John Hubbard, akpm, Mauro Carvalho Chehab, Dimitri Sivanich,
	Arnd Bergmann, Greg Kroah-Hartman, Alex Williamson, Cornelia Huck,
	Jens Axboe, Alexander Viro, Björn Töpel,
	Magnus Karlsson, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Enrico Weigelt, Thomas Gleixner, Alexios Zavras,
	Dan Carpenter, Max Filippov, Matt Sickler, Kirill A. Shutemov,
	Keith Busch, YueHaibing, linux-media, linux-kernel, devel, kvm,
	linux-block, linux-fsdevel, linux-mm, netdev, bpf, xdp-newbies,
	Jason Gunthorpe
In-Reply-To: <20190715162952.GA7953@iweiny-DESK2.sc.intel.com>

On Mon, Jul 15, 2019 at 09:29:53AM -0700, Ira Weiny wrote:
> On Mon, Jul 15, 2019 at 12:26:54PM +0530, Bharath Vedartham wrote:
> > On Sun, Jul 14, 2019 at 04:33:42PM -0700, John Hubbard wrote:
> > > On 7/14/19 12:08 PM, Bharath Vedartham wrote:
> > > > This patch converts all call sites of get_user_pages
> > > > to use put_user_page*() instead of put_page*() functions to
> > > > release reference to gup pinned pages.
> > Hi John, 
> > > Hi Bharath,
> > > 
> > > Thanks for jumping in to help, and welcome to the party!
> > > 
> > > You've caught everyone in the middle of a merge window, btw.  As a
> > > result, I'm busy rebasing and reworking the get_user_pages call sites, 
> > > and gup tracking, in the wake of some semi-traumatic changes to bio 
> > > and gup and such. I plan to re-post right after 5.3-rc1 shows up, from 
> > > here:
> > > 
> > >     https://github.com/johnhubbard/linux/commits/gup_dma_core
> > > 
> > > ...which you'll find already covers the changes you've posted, except for:
> > > 
> > >     drivers/misc/sgi-gru/grufault.c
> > >     drivers/staging/kpc2000/kpc_dma/fileops.c
> > > 
> > > ...and this one, which is undergoing to larger local changes, due to
> > > bvec, so let's leave it out of the choices:
> > > 
> > >     fs/io_uring.c
> > > 
> > > Therefore, until -rc1, if you'd like to help, I'd recommend one or more
> > > of the following ideas:
> > > 
> > > 1. Pull down https://github.com/johnhubbard/linux/commits/gup_dma_core
> > > and find missing conversions: look for any additional missing 
> > > get_user_pages/put_page conversions. You've already found a couple missing 
> > > ones. I haven't re-run a search in a long time, so there's probably even more.
> > > 	a) And find more, after I rebase to 5.3-rc1: people probably are adding
> > > 	get_user_pages() calls as we speak. :)
> > Shouldn't this be documented then? I don't see any docs for using
> > put_user_page*() in v5.2.1 in the memory management API section?
> > > 2. Patches: Focus on just one subsystem at a time, and perfect the patch for
> > > it. For example, I think this the staging driver would be perfect to start with:
> > > 
> > >     drivers/staging/kpc2000/kpc_dma/fileops.c
> > > 
> > > 	a) verify that you've really, corrected converted the whole
> > > 	driver. (Hint: I think you might be overlooking a put_page call.)
> > Yup. I did see that! Will fix it!
> > > 	b) Attempt to test it if you can (I'm being hypocritical in
> > > 	the extreme here, but one of my problems is that testing
> > > 	has been light, so any help is very valuable). qemu...?
> > > 	OTOH, maybe even qemu cannot easily test a kpc2000, but
> > > 	perhaps `git blame` and talking to the authors would help
> > > 	figure out a way to validate the changes.
> > Great! I ll do that, I ll mail the patch authors and ask them for help
> > in testing. 
> > > 	Thinking about whether you can run a test that would prove or
> > > 	disprove my claim in (a), above, could be useful in coming up
> > > 	with tests to run.
> > 
> > > In other words, a few very high quality conversions (even just one) that
> > > we can really put our faith in, is what I value most here. Tested patches
> > > are awesome.
> > I understand that! 
> > > 3. Once I re-post, turn on the new CONFIG_DEBUG_GET_USER_PAGES_REFERENCES
> > > and run things such as xfstest/fstest. (Again, doing so would be going
> > > further than I have yet--very helpful). Help clarify what conversions have
> > > actually been tested and work, and which ones remain unvalidated.
> > > Other: Please note that this:
> > Yup will do that.
> > >     https://github.com/johnhubbard/linux/commits/gup_dma_core
> > > 
> > >     a) gets rebased often, and
> > > 
> > >     b) has a bunch of commits (iov_iter and related) that conflict
> > >        with the latest linux.git,
> > > 
> > >     c) has some bugs in the bio area, that I'm fixing, so I don't trust
> > >        that's it's safely runnable, for a few more days.
> > I assume your repo contains only work related to fixing gup issues and
> > not the main repo for gup development? i.e where gup changes are merged?
> 
> We have been using Andrews tree for merging.
> 
> > Also are release_pages and put_user_pages interchangable? 
> 
> Conceptually yes.  But release_pages is more efficient.  There was some
> discussion around this starting here:
> 
> https://lore.kernel.org/lkml/20190523172852.GA27175@iweiny-DESK2.sc.intel.com/
> 
> And a resulting bug fix.
> 
> https://lkml.org/lkml/2019/6/21/95
> 
> Ira
Thanks! Will have a look at these links! 
> > > One note below, for the future:
> > > 
> > > > 
> > > > This is a bunch of trivial conversions which is a part of an effort
> > > > by John Hubbard to solve issues with gup pinned pages and 
> > > > filesystem writeback.
> > > > 
> > > > The issue is more clearly described in John Hubbard's patch[1] where
> > > > put_user_page*() functions are introduced.
> > > > 
> > > > Currently put_user_page*() simply does put_page but future implementations
> > > > look to change that once treewide change of put_page callsites to 
> > > > put_user_page*() is finished.
> > > > 
> > > > The lwn article describing the issue with gup pinned pages and filesystem 
> > > > writeback [2].
> > > > 
> > > > This patch has been tested by building and booting the kernel as I don't
> > > > have the required hardware to test the device drivers.
> > > > 
> > > > I did not modify gpu/drm drivers which use release_pages instead of
> > > > put_page() to release reference of gup pinned pages as I am not clear
> > > > whether release_pages and put_page are interchangable. 
> > > > 
> > > > [1] https://lkml.org/lkml/2019/3/26/1396
> > > 
> > > When referring to patches in a commit description, please use the 
> > > commit hash, not an external link. See Submitting Patches [1] for details.
> > > 
> > > Also, once you figure out the right maintainers and other involved people,
> > > putting Cc: in the commit description is common practice, too.
> > > 
> > > [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html
> > Will work on that! Thanks!
> > > thanks,
> > > -- 
> > > John Hubbard
> > > NVIDIA
> > > 
> > > > 
> > > > [2] https://lwn.net/Articles/784574/
> > > > 
> > > > Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
> > > > ---
> > > >  drivers/media/v4l2-core/videobuf-dma-sg.c | 3 +--
> > > >  drivers/misc/sgi-gru/grufault.c           | 2 +-
> > > >  drivers/staging/kpc2000/kpc_dma/fileops.c | 4 +---
> > > >  drivers/vfio/vfio_iommu_type1.c           | 2 +-
> > > >  fs/io_uring.c                             | 7 +++----
> > > >  mm/gup_benchmark.c                        | 6 +-----
> > > >  net/xdp/xdp_umem.c                        | 7 +------
> > > >  7 files changed, 9 insertions(+), 22 deletions(-)
> > > > 
> > > > diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
> > > > index 66a6c6c..d6eeb43 100644
> > > > --- a/drivers/media/v4l2-core/videobuf-dma-sg.c
> > > > +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
> > > > @@ -349,8 +349,7 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
> > > >  	BUG_ON(dma->sglen);
> > > >  
> > > >  	if (dma->pages) {
> > > > -		for (i = 0; i < dma->nr_pages; i++)
> > > > -			put_page(dma->pages[i]);
> > > > +		put_user_pages(dma->pages, dma->nr_pages);
> > > >  		kfree(dma->pages);
> > > >  		dma->pages = NULL;
> > > >  	}
> > > > diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> > > > index 4b713a8..61b3447 100644
> > > > --- a/drivers/misc/sgi-gru/grufault.c
> > > > +++ b/drivers/misc/sgi-gru/grufault.c
> > > > @@ -188,7 +188,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
> > > >  	if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
> > > >  		return -EFAULT;
> > > >  	*paddr = page_to_phys(page);
> > > > -	put_page(page);
> > > > +	put_user_page(page);
> > > >  	return 0;
> > > >  }
> > > >  
> > > > diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
> > > > index 6166587..26dceed 100644
> > > > --- a/drivers/staging/kpc2000/kpc_dma/fileops.c
> > > > +++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
> > > > @@ -198,9 +198,7 @@ int  kpc_dma_transfer(struct dev_private_data *priv, struct kiocb *kcb, unsigned
> > > >  	sg_free_table(&acd->sgt);
> > > >   err_dma_map_sg:
> > > >   err_alloc_sg_table:
> > > > -	for (i = 0 ; i < acd->page_count ; i++){
> > > > -		put_page(acd->user_pages[i]);
> > > > -	}
> > > > +	put_user_pages(acd->user_pages, acd->page_count);
> > > >   err_get_user_pages:
> > > >  	kfree(acd->user_pages);
> > > >   err_alloc_userpages:
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index add34ad..c491524 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -369,7 +369,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> > > >  		 */
> > > >  		if (ret > 0 && vma_is_fsdax(vmas[0])) {
> > > >  			ret = -EOPNOTSUPP;
> > > > -			put_page(page[0]);
> > > > +			put_user_page(page[0]);
> > > >  		}
> > > >  	}
> > > >  	up_read(&mm->mmap_sem);
> > > > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > > > index 4ef62a4..b4a4549 100644
> > > > --- a/fs/io_uring.c
> > > > +++ b/fs/io_uring.c
> > > > @@ -2694,10 +2694,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
> > > >  			 * if we did partial map, or found file backed vmas,
> > > >  			 * release any pages we did get
> > > >  			 */
> > > > -			if (pret > 0) {
> > > > -				for (j = 0; j < pret; j++)
> > > > -					put_page(pages[j]);
> > > > -			}
> > > > +			if (pret > 0)
> > > > +				put_user_pages(pages, pret);
> > > > +
> > > >  			if (ctx->account_mem)
> > > >  				io_unaccount_mem(ctx->user, nr_pages);
> > > >  			kvfree(imu->bvec);
> > > > diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> > > > index 7dd602d..15fc7a2 100644
> > > > --- a/mm/gup_benchmark.c
> > > > +++ b/mm/gup_benchmark.c
> > > > @@ -76,11 +76,7 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
> > > >  	gup->size = addr - gup->addr;
> > > >  
> > > >  	start_time = ktime_get();
> > > > -	for (i = 0; i < nr_pages; i++) {
> > > > -		if (!pages[i])
> > > > -			break;
> > > > -		put_page(pages[i]);
> > > > -	}
> > > > +	put_user_pages(pages, nr_pages);
> > > >  	end_time = ktime_get();
> > > >  	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
> > > >  
> > > > diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> > > > index 9c6de4f..6103e19 100644
> > > > --- a/net/xdp/xdp_umem.c
> > > > +++ b/net/xdp/xdp_umem.c
> > > > @@ -173,12 +173,7 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> > > >  {
> > > >  	unsigned int i;
> > > >  
> > > > -	for (i = 0; i < umem->npgs; i++) {
> > > > -		struct page *page = umem->pgs[i];
> > > > -
> > > > -		set_page_dirty_lock(page);
> > > > -		put_page(page);
> > > > -	}
> > > > +	put_user_pages_dirty_lock(umem->pgs, umem->npgs);
> > > >  
> > > >  	kfree(umem->pgs);
> > > >  	umem->pgs = NULL;
> > > > 

^ permalink raw reply

* Re: [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: Bharath Vedartham @ 2019-07-15 19:36 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, ira.weiny, Mauro Carvalho Chehab, Dimitri Sivanich,
	Arnd Bergmann, Greg Kroah-Hartman, Alex Williamson, Cornelia Huck,
	Jens Axboe, Alexander Viro, Björn Töpel,
	Magnus Karlsson, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Enrico Weigelt, Thomas Gleixner, Alexios Zavras,
	Dan Carpenter, Max Filippov, Matt Sickler, Kirill A. Shutemov,
	Keith Busch, YueHaibing, linux-media, linux-kernel, devel, kvm,
	linux-block, linux-fsdevel, linux-mm, netdev, bpf, xdp-newbies,
	Jason Gunthorpe
In-Reply-To: <1aeb21d9-6dc6-c7d2-58b6-279b1dfc523b@nvidia.com>

On Mon, Jul 15, 2019 at 11:10:20AM -0700, John Hubbard wrote:
> On 7/14/19 11:56 PM, Bharath Vedartham wrote:
> > On Sun, Jul 14, 2019 at 04:33:42PM -0700, John Hubbard wrote:
> >> On 7/14/19 12:08 PM, Bharath Vedartham wrote:
> [...]
> >> 1. Pull down https://github.com/johnhubbard/linux/commits/gup_dma_core
> >> and find missing conversions: look for any additional missing 
> >> get_user_pages/put_page conversions. You've already found a couple missing 
> >> ones. I haven't re-run a search in a long time, so there's probably even more.
> >> 	a) And find more, after I rebase to 5.3-rc1: people probably are adding
> >> 	get_user_pages() calls as we speak. :)
> > Shouldn't this be documented then? I don't see any docs for using
> > put_user_page*() in v5.2.1 in the memory management API section?
> 
> Yes, it needs documentation. My first try (which is still in the above git
> repo) was reviewed and found badly wanting, so I'm going to rewrite it. Meanwhile,
> I agree that an interim note would be helpful, let me put something together.
> 
> [...]
> >>     https://github.com/johnhubbard/linux/commits/gup_dma_core
> >>
> >>     a) gets rebased often, and
> >>
> >>     b) has a bunch of commits (iov_iter and related) that conflict
> >>        with the latest linux.git,
> >>
> >>     c) has some bugs in the bio area, that I'm fixing, so I don't trust
> >>        that's it's safely runnable, for a few more days.
> > I assume your repo contains only work related to fixing gup issues and
> > not the main repo for gup development? i.e where gup changes are merged?
> 
> Correct, this is just a private tree, not a maintainer tree. But I'll try to
> keep the gup_dma_core branch something that is usable by others, during the
> transition over to put_user_page(), because the page-tracking patches are the
> main way to test any put_user_page() conversions.
> 
> As Ira said, we're using linux-mm as the real (maintainer) tree.
Thanks for the info! 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply

* Re: [PATCH iproute2 net-next v2 1/6] Kernel header update for hardware offloading changes.
From: Patel, Vedang @ 2019-07-15 19:40 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Gomes, Vinicius,
	netdev@vger.kernel.org, Dorileo, Leandro, Jakub Kicinski,
	Murali Karicheri
In-Reply-To: <1559859735-17237-1-git-send-email-vedang.patel@intel.com>

Hi Stephen, 

The kernel patches corresponding to this series have been merged. I just wanted to check whether these iproute2 related patches are on your TODO list.

Let me know if you need any information from me on these patches.

Thanks,
Vedang Patel
> On Jun 6, 2019, at 3:22 PM, Patel, Vedang <vedang.patel@intel.com> wrote:
> 
> This should only be updated after the kernel patches related to
> txtime-offload have been merged into the kernel.
> 
> Signed-off-by: Vedang Patel <vedang.patel@intel.com>
> ---
> include/uapi/linux/pkt_sched.h | 5 +++++
> 1 file changed, 5 insertions(+)
> 
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index 8b2f993cbb77..c085860ff637 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -990,6 +990,7 @@ struct tc_etf_qopt {
> 	__u32 flags;
> #define TC_ETF_DEADLINE_MODE_ON	BIT(0)
> #define TC_ETF_OFFLOAD_ON	BIT(1)
> +#define TC_ETF_SKIP_SOCK_CHECK  BIT(2)
> };
> 
> enum {
> @@ -1158,6 +1159,8 @@ enum {
>  *       [TCA_TAPRIO_ATTR_SCHED_ENTRY_INTERVAL]
>  */
> 
> +#define TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST 0x1
> +
> enum {
> 	TCA_TAPRIO_ATTR_UNSPEC,
> 	TCA_TAPRIO_ATTR_PRIOMAP, /* struct tc_mqprio_qopt */
> @@ -1169,6 +1172,8 @@ enum {
> 	TCA_TAPRIO_ATTR_ADMIN_SCHED, /* The admin sched, only used in dump */
> 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
> 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
> +	TCA_TAPRIO_ATTR_FLAGS, /* u32 */
> +	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
> 	__TCA_TAPRIO_ATTR_MAX,
> };
> 
> -- 
> 2.7.3
> 


^ permalink raw reply

* Re: [PATCH iproute2 net-next v2 1/6] Kernel header update for hardware offloading changes.
From: Stephen Hemminger @ 2019-07-15 19:50 UTC (permalink / raw)
  To: Patel, Vedang
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Gomes, Vinicius,
	netdev@vger.kernel.org, Dorileo, Leandro, Jakub Kicinski,
	Murali Karicheri, David Ahern
In-Reply-To: <0AFDC65C-2A16-47B7-96F6-F6844AF75095@intel.com>

On Mon, 15 Jul 2019 19:40:19 +0000
"Patel, Vedang" <vedang.patel@intel.com> wrote:

> Hi Stephen, 
> 
> The kernel patches corresponding to this series have been merged. I just wanted to check whether these iproute2 related patches are on your TODO list.
> 
> Let me know if you need any information from me on these patches.
> 
> Thanks,
> Vedang Patel


David Ahern handles iproute2 next

https://patchwork.ozlabs.org/patch/1111466/

^ permalink raw reply

* Re: [PATCH bpf 0/5] bpf: allow wide (u64) aligned loads for some fields of bpf_sock_addr
From: Yonghong Song @ 2019-07-15 19:57 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net
In-Reply-To: <20190715163956.204061-1-sdf@google.com>



On 7/15/19 9:39 AM, Stanislav Fomichev wrote:
> When fixing selftests by adding support for wide stores, Yonghong
> reported that he had seen some examples where clang generates
> single u64 loads for two adjacent u32s as well:
> http://lore.kernel.org/netdev/a66c937f-94c0-eaf8-5b37-8587d66c0c62@fb.com
> 
> Let's support aligned u64 reads for some bpf_sock_addr fields
> as well.
> 
> (This can probably wait for bpf-next, I'll defer to Younhong and the
> maintainers.)
> 
> Cc: Yonghong Song <yhs@fb.com>
> 
> Stanislav Fomichev (5):
>    bpf: rename bpf_ctx_wide_store_ok to bpf_ctx_wide_access_ok
>    bpf: allow wide aligned loads for bpf_sock_addr user_ip6 and
>      msg_src_ip6
>    selftests/bpf: rename verifier/wide_store.c to verifier/wide_access.c
>    selftests/bpf: add selftests for wide loads
>    bpf: sync bpf.h to tools/

Thanks for fixing. Maybe getting into bpf is better as this indeed
a potential issue? I do not have strong feeling either as the
issue can be easily workarounded with "volatile" tricks.

Acked-by: Yonghong Song <yhs@fb.com>

> 
>   include/linux/filter.h                        |  2 +-
>   include/uapi/linux/bpf.h                      |  4 +-
>   net/core/filter.c                             | 24 ++++--
>   tools/include/uapi/linux/bpf.h                |  4 +-
>   .../selftests/bpf/verifier/wide_access.c      | 73 +++++++++++++++++++
>   .../selftests/bpf/verifier/wide_store.c       | 36 ---------
>   6 files changed, 95 insertions(+), 48 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/verifier/wide_access.c
>   delete mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
> 

^ permalink raw reply

* [PATCH V35 23/29] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Matthew Garrett @ 2019-07-15 19:59 UTC (permalink / raw)
  To: jmorris
  Cc: linux-security-module, linux-kernel, linux-api, David Howells,
	Alexei Starovoitov, Matthew Garrett, netdev, Chun-Yi Lee,
	Daniel Borkmann
In-Reply-To: <20190715195946.223443-1-matthewgarrett@google.com>

From: David Howells <dhowells@redhat.com>

bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
private keys in kernel memory to be leaked. Disable them if the kernel
has been locked down in confidentiality mode.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
cc: netdev@vger.kernel.org
cc: Chun-Yi Lee <jlee@suse.com>
cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/security.h     |  1 +
 kernel/trace/bpf_trace.c     | 10 ++++++++++
 security/lockdown/lockdown.c |  1 +
 3 files changed, 12 insertions(+)

diff --git a/include/linux/security.h b/include/linux/security.h
index 987d8427f091..8dd1741a52cd 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -118,6 +118,7 @@ enum lockdown_reason {
 	LOCKDOWN_INTEGRITY_MAX,
 	LOCKDOWN_KCORE,
 	LOCKDOWN_KPROBES,
+	LOCKDOWN_BPF_READ,
 	LOCKDOWN_CONFIDENTIALITY_MAX,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..605908da61c5 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -142,7 +142,12 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_BPF_READ);
+	if (ret)
+		goto out;
+
 	ret = probe_kernel_read(dst, unsafe_ptr, size);
+out:
 	if (unlikely(ret < 0))
 		memset(dst, 0, size);
 
@@ -569,6 +574,10 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_BPF_READ);
+	if (ret)
+		goto out;
+
 	/*
 	 * The strncpy_from_unsafe() call will likely not fill the entire
 	 * buffer, but that's okay in this circumstance as we're probing
@@ -579,6 +588,7 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
 	 * is returned that can be used for bpf_perf_event_output() et al.
 	 */
 	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
+out:
 	if (unlikely(ret < 0))
 		memset(dst, 0, size);
 
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index ccb3e9a2a47c..d14b89784412 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -33,6 +33,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
 	[LOCKDOWN_INTEGRITY_MAX] = "integrity",
 	[LOCKDOWN_KCORE] = "/proc/kcore access",
 	[LOCKDOWN_KPROBES] = "use of kprobes",
+	[LOCKDOWN_BPF_READ] = "use of bpf to read kernel RAM",
 	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
 };
 
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

* Re: [PATCH 7/9] x86/pci: Pass lockdep condition to pcm_mmcfg_list iterator (v1)
From: Bjorn Helgaas @ 2019-07-15 20:02 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Kuznetsov, Borislav Petkov, c0d1n61at3,
	David S. Miller, edumazet, Greg Kroah-Hartman, Hideaki YOSHIFUJI,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Josh Triplett,
	keescook, kernel-hardening, kernel-team, Lai Jiangshan, Len Brown,
	linux-acpi, linux-doc, linux-pci, linux-pm, Mathieu Desnoyers,
	neilb, netdev, Oleg Nesterov, Paul E. McKenney, Pavel Machek,
	peterz, Rafael J. Wysocki, Rasmus Villemoes, rcu, Steven Rostedt,
	Tejun Heo, Thomas Gleixner, will,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190715143705.117908-8-joel@joelfernandes.org>

On Mon, Jul 15, 2019 at 10:37:03AM -0400, Joel Fernandes (Google) wrote:
> The pcm_mmcfg_list is traversed with list_for_each_entry_rcu without a
> reader-lock held, because the pci_mmcfg_lock is already held. Make this
> known to the list macro so that it fixes new lockdep warnings that
> trigger due to lockdep checks added to list_for_each_entry_rcu().
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Ingo takes care of most patches to this file, but FWIW,

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

I would personally prefer if you capitalized the subject to match the
"x86/PCI:" convention that's used fairly consistently in
arch/x86/pci/.

Also, I didn't apply this to be sure, but it looks like this might
make a line or two wider than 80 columns, which I would rewrap if I
were applying this.

> ---
>  arch/x86/pci/mmconfig-shared.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
> index 7389db538c30..6fa42e9c4e6f 100644
> --- a/arch/x86/pci/mmconfig-shared.c
> +++ b/arch/x86/pci/mmconfig-shared.c
> @@ -29,6 +29,7 @@
>  static bool pci_mmcfg_running_state;
>  static bool pci_mmcfg_arch_init_failed;
>  static DEFINE_MUTEX(pci_mmcfg_lock);
> +#define pci_mmcfg_lock_held() lock_is_held(&(pci_mmcfg_lock).dep_map)
>  
>  LIST_HEAD(pci_mmcfg_list);
>  
> @@ -54,7 +55,7 @@ static void list_add_sorted(struct pci_mmcfg_region *new)
>  	struct pci_mmcfg_region *cfg;
>  
>  	/* keep list sorted by segment and starting bus number */
> -	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list) {
> +	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list, pci_mmcfg_lock_held()) {
>  		if (cfg->segment > new->segment ||
>  		    (cfg->segment == new->segment &&
>  		     cfg->start_bus >= new->start_bus)) {
> @@ -118,7 +119,7 @@ struct pci_mmcfg_region *pci_mmconfig_lookup(int segment, int bus)
>  {
>  	struct pci_mmcfg_region *cfg;
>  
> -	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list)
> +	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list, pci_mmcfg_lock_held())
>  		if (cfg->segment == segment &&
>  		    cfg->start_bus <= bus && bus <= cfg->end_bus)
>  			return cfg;
> -- 
> 2.22.0.510.g264f2c817a-goog
> 

^ permalink raw reply

* [pull request][net 0/3] Mellanox, mlx5 fixes 2019-07-15
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed

Hi Dave,

This pull request provides mlx5 TC flower and tunnel fixes for kernel 5.2
from Eli and Vlad.

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit f384e62a82ba5d85408405fdd6aeff89354deaa9:

  ISDN: hfcsusb: checking idx of ep configuration (2019-07-15 11:10:31 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-fixes-2019-07-15

for you to fetch changes up to 3d144578c91a2db417923ba905ce7a84ce0c274b:

  net/mlx5e: Allow dissector meta key in tc flower (2019-07-15 13:04:04 -0700)

----------------------------------------------------------------
mlx5-fixes-2019-07-15

----------------------------------------------------------------
Eli Cohen (1):
      net/mlx5e: Verify encapsulation is supported

Vlad Buslov (2):
      net/mlx5e: Rely on filter_dev instead of dissector keys for tunnels
      net/mlx5e: Allow dissector meta key in tc flower

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

^ permalink raw reply

* [net 1/3] net/mlx5e: Verify encapsulation is supported
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Eli Cohen, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Eli Cohen <eli@mellanox.com>

When mlx5e_attach_encap() calls mlx5e_get_tc_tun() to get the tunnel
info data struct, check that returned value is not NULL, as would be in
the case of unsupported encapsulation.

Fixes: d386939a327d2 ("net/mlx5e: Rearrange tc tunnel code in a modular way")
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 2d6436257f9d..018709a4343f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2647,6 +2647,10 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
 	family = ip_tunnel_info_af(tun_info);
 	key.ip_tun_key = &tun_info->key;
 	key.tc_tunnel = mlx5e_get_tc_tun(mirred_dev);
+	if (!key.tc_tunnel) {
+		NL_SET_ERR_MSG_MOD(extack, "Unsupported tunnel");
+		return -EOPNOTSUPP;
+	}

 	hash_key = hash_encap_info(&key);

-- 
2.21.0

^ permalink raw reply related

* [net 3/3] net/mlx5e: Allow dissector meta key in tc flower
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Vlad Buslov, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>

Recently, fl_flow_key->indev_ifindex int field was refactored into
flow_dissector_key_meta field. With this, flower classifier also sets
FLOW_DISSECTOR_KEY_META flow dissector key. However, mlx5 flower dissector
validation code rejects filters that use flow dissector keys that are not
supported. Add FLOW_DISSECTOR_KEY_META to the list of allowed dissector
keys in __parse_cls_flower() to prevent following error when offloading
flower classifier to mlx5:

Error: mlx5_core: Unsupported key.

Fixes: 8212ed777f40 ("net: sched: cls_flower: use flow_dissector for ingress ifindex")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index b95e0ae4d7fd..cc096f6011d9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1499,7 +1499,8 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 	*match_level = MLX5_MATCH_NONE;

 	if (dissector->used_keys &
-	    ~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+	    ~(BIT(FLOW_DISSECTOR_KEY_META) |
+	      BIT(FLOW_DISSECTOR_KEY_CONTROL) |
 	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
 	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
 	      BIT(FLOW_DISSECTOR_KEY_VLAN) |
-- 
2.21.0

^ permalink raw reply related

* [net 2/3] net/mlx5e: Rely on filter_dev instead of dissector keys for tunnels
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Vlad Buslov, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>

Currently, tunnel attributes are parsed and inner header matching is used
only when flow dissector specifies match on some of the supported
encapsulation fields. When user tries to offload tc filter that doesn't
match any encapsulation fields on tunnel device, mlx5 tc layer incorrectly
sets to match packet header keys on encap header (outer header) and
firmware rejects the rule with syndrome 0x7e1579 when creating new flow
group.

Change __parse_cls_flower() to determine whether tunnel is used based on
fitler_dev tunnel info, instead of determining it indirectly by checking
flow dissector enc keys.

Fixes: bbd00f7e2349 ("net/mlx5e: Add TC tunnel release action for SRIOV offloads")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 018709a4343f..b95e0ae4d7fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1522,11 +1522,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 		return -EOPNOTSUPP;
 	}

-	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_KEYID) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_PORTS) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_OPTS)) {
+	if (mlx5e_get_tc_tun(filter_dev)) {
 		if (parse_tunnel_attr(priv, spec, f, filter_dev, tunnel_match_level))
 			return -EOPNOTSUPP;

-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH iproute2 net-next v2 1/6] Kernel header update for hardware offloading changes.
From: David Ahern @ 2019-07-15 20:16 UTC (permalink / raw)
  To: Stephen Hemminger, Patel, Vedang
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Gomes, Vinicius,
	netdev@vger.kernel.org, Dorileo, Leandro, Jakub Kicinski,
	Murali Karicheri
In-Reply-To: <20190715125059.70470f9e@hermes.lan>

On 7/15/19 1:50 PM, Stephen Hemminger wrote:
> On Mon, 15 Jul 2019 19:40:19 +0000
> "Patel, Vedang" <vedang.patel@intel.com> wrote:
> 
>> Hi Stephen, 
>>
>> The kernel patches corresponding to this series have been merged. I just wanted to check whether these iproute2 related patches are on your TODO list.
>>
>> Let me know if you need any information from me on these patches.
>>
>> Thanks,
>> Vedang Patel
> 
> 
> David Ahern handles iproute2 next
> 
> https://patchwork.ozlabs.org/patch/1111466/
> 

given the long time delay between when the iproute2 patches were posted
and when the kernel side was accepted you will need to re-send the
iproute2 patches.

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-15 20:38 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Tycho Andersen, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, netfilter-devel, sgrubb, omosnace,
	dhowells, simo, Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190708175105.7zb6mikjw2wmnwln@madcap2.tricolour.ca>

On Mon, Jul 8, 2019 at 1:51 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-29 11:29, Paul Moore wrote:

...

> > The idea is that only container orchestrators should be able to
> > set/modify the audit container ID, and since setting the audit
> > container ID can have a significant effect on the records captured
> > (and their routing to multiple daemons when we get there) modifying
> > the audit container ID is akin to modifying the audit configuration
> > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > is that you would only change the audit container ID from one
> > set/inherited value to another if you were nesting containers, in
> > which case the nested container orchestrator would need to be granted
> > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > compromise).  We did consider allowing for a chain of nested audit
> > container IDs, but the implications of doing so are significant
> > (implementation mess, runtime cost, etc.) so we are leaving that out
> > of this effort.
>
> We had previously discussed the idea of restricting
> orchestrators/engines from only being able to set the audit container
> identifier on their own descendants, but it was discarded.  I've added a
> check to ensure this is now enforced.

When we weren't allowing nested orchestrators it wasn't necessary, but
with the move to support nesting I believe this will be a requirement.
We might also need/want to restrict audit container ID changes if a
descendant is acting as a container orchestrator and managing one or
more audit container IDs; although I'm less certain of the need for
this.

> I've also added a check to ensure that a process can't set its own audit
> container identifier ...

What does this protect against, or what problem does this solve?
Considering how easy it is to fork/exec, it seems like this could be
trivially bypassed.

> ... and that if the identifier is already set, then the
> orchestrator/engine must be in a descendant user namespace from the
> orchestrator that set the previously inherited audit container
> identifier.

You lost me here ... although I don't like the idea of relying on X
namespace inheritance for a hard coded policy on setting the audit
container ID; we've worked hard to keep this independent of any
definition of a "container" and it would sadden me greatly if we had
to go back on that.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [net-next 1/2] ipvs: batch __ip_vs_cleanup
From: Julian Anastasov @ 2019-07-15 20:39 UTC (permalink / raw)
  To: Haishuang Yan
  Cc: David S. Miller, Pablo Neira Ayuso, Simon Horman, netdev,
	lvs-devel, linux-kernel, netfilter-devel
In-Reply-To: <1563031186-2101-2-git-send-email-yanhaishuang@cmss.chinamobile.com>


	Hello,

On Sat, 13 Jul 2019, Haishuang Yan wrote:

> It's better to batch __ip_vs_cleanup to speedup ipvs
> connections dismantle.
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> ---
>  include/net/ip_vs.h             |  2 +-
>  net/netfilter/ipvs/ip_vs_core.c | 29 +++++++++++++++++------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 13 ++++++++++---
>  3 files changed, 28 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 3759167..93e7a25 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -1324,7 +1324,7 @@ static inline void ip_vs_control_del(struct ip_vs_conn *cp)
>  void ip_vs_control_net_cleanup(struct netns_ipvs *ipvs);
>  void ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs);
>  void ip_vs_sync_net_cleanup(struct netns_ipvs *ipvs);
> -void ip_vs_service_net_cleanup(struct netns_ipvs *ipvs);
> +void ip_vs_service_nets_cleanup(struct list_head *net_list);
>  
>  /* IPVS application functions
>   * (from ip_vs_app.c)
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 46f06f9..b4d79b7 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -2402,18 +2402,23 @@ static int __net_init __ip_vs_init(struct net *net)
>  	return -ENOMEM;
>  }
>  
> -static void __net_exit __ip_vs_cleanup(struct net *net)
> +static void __net_exit __ip_vs_cleanup_batch(struct list_head *net_list)
>  {
> -	struct netns_ipvs *ipvs = net_ipvs(net);
> -
> -	ip_vs_service_net_cleanup(ipvs);	/* ip_vs_flush() with locks */
> -	ip_vs_conn_net_cleanup(ipvs);
> -	ip_vs_app_net_cleanup(ipvs);
> -	ip_vs_protocol_net_cleanup(ipvs);
> -	ip_vs_control_net_cleanup(ipvs);
> -	ip_vs_estimator_net_cleanup(ipvs);
> -	IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> -	net->ipvs = NULL;
> +	struct netns_ipvs *ipvs;
> +	struct net *net;
> +	LIST_HEAD(list);
> +
> +	ip_vs_service_nets_cleanup(net_list);	/* ip_vs_flush() with locks */
> +	list_for_each_entry(net, net_list, exit_list) {

	How much faster is to replace list_for_each_entry in
ops_exit_list() with this one. IPVS can waste time in calls
such as kthread_stop() and del_timer_sync() but I'm not sure
we can solve it easily. What gain do you see in benchmarks?

> +		ipvs = net_ipvs(net);
> +		ip_vs_conn_net_cleanup(ipvs);
> +		ip_vs_app_net_cleanup(ipvs);
> +		ip_vs_protocol_net_cleanup(ipvs);
> +		ip_vs_control_net_cleanup(ipvs);
> +		ip_vs_estimator_net_cleanup(ipvs);
> +		IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> +		net->ipvs = NULL;
> +	}
>  }

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH iproute2] tc: util: constrain percentage in 0-100 interval
From: Stephen Hemminger @ 2019-07-15 20:48 UTC (permalink / raw)
  To: Andrea Claudi; +Cc: netdev, dsahern
In-Reply-To: <c0a9b4ce15d5389ac59fbf572f5f1b3030ec4c90.1563011008.git.aclaudi@redhat.com>

On Sat, 13 Jul 2019 11:44:07 +0200
Andrea Claudi <aclaudi@redhat.com> wrote:

> parse_percent() currently allows to specify negative percentages
> or value above 100%. However this does not seems to make sense,
> as the function is used for probabilities or bandiwidth rates.
> 
> Moreover, using negative values leads to erroneous results
> (using Bernoulli loss model as example):
> 
> $ ip link add test type dummy
> $ ip link set test up
> $ tc qdisc add dev test root netem loss gemodel -10% limit 10
> $ tc qdisc show dev test
> qdisc netem 800c: root refcnt 2 limit 10 loss gemodel p 90% r 10% 1-h 100% 1-k 0%
> 
> Using values above 100% we have instead:
> 
> $ ip link add test type dummy
> $ ip link set test up
> $ tc qdisc add dev test root netem loss gemodel 140% limit 10
> $ tc qdisc show dev test
> qdisc netem 800f: root refcnt 2 limit 10 loss gemodel p 40% r 60% 1-h 100% 1-k 0%
> 
> This commit changes parse_percent() with a check to ensure
> percentage values stay between 1.0 and 0.0.
> parse_percent_rate() function, which already employs a similar
> check, is adjusted accordingly.
> 
> With this check in place, we have:
> 
> $ ip link add test type dummy
> $ ip link set test up
> $ tc qdisc add dev test root netem loss gemodel -10% limit 10
> Illegal "loss gemodel p"
> 
> Fixes: 927e3cfb52b58 ("tc: B.W limits can now be specified in %.")
> Signed-off-by: Andrea Claudi <aclaudi@redhat.com>

Looks good. Applied

^ permalink raw reply

* [bpf PATCH v3 0/8] sockmap/tls fixes
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf

Resolve a series of splats discovered by syzbot and an unhash
TLS issue noted by Eric Dumazet.

The main issues revolved around interaction between TLS and
sockmap tear down. TLS and sockmap could both reset sk->prot
ops creating a condition where a close or unhash op could be
called forever. A rare race condition resulting from a missing
rcu sync operation was causing a use after free. Then on the
TLS side dropping the sock lock and re-acquiring it during the
close op could hang. Finally, sockmap must be deployed before
tls for current stack assumptions to be met. This is enforced
now. A feature series can enable it.

To fix this first refactor TLS code so the lock is held for the
entire teardown operation. Then add an unhash callback to ensure
TLS can not transition from ESTABLISHED to LISTEN state. This
transition is a similar bug to the one found and fixed previously
in sockmap. Then apply three fixes to sockmap to fix up races
on tear down around map free and close. Finally, if sockmap
is destroyed before TLS we add a new ULP op update to inform
the TLS stack it should not call sockmap ops. This last one
appears to be the most commonly found issue from syzbot.

---

Jakub Kicinski (1):
      net/tls: don't arm strparser immediately in tls_set_sw_offload()

John Fastabend (7):
      tls: remove close callback sock unlock/lock around TX work flush
      tls: remove sock unlock/lock around strp_done()
      bpf: tls fix transition through disconnect with close
      bpf: sockmap, sock_map_delete needs to use xchg
      bpf: sockmap, synchronize_rcu before free'ing map
      bpf: sockmap, only create entry if ulp is not already enabled
      bpf: sockmap/tls, close can race with map free

 include/linux/skmsg.h |    8 ++-
 include/net/tcp.h     |    3 +
 include/net/tls.h     |   12 +++-
 net/core/skmsg.c      |    4 +
 net/core/sock_map.c   |   19 ++++--
 net/ipv4/tcp_ulp.c    |   13 ++++
 net/tls/tls_main.c    |  155 ++++++++++++++++++++++++++++++++++++++-----------
 net/tls/tls_sw.c      |   76 +++++++++++++++++-------
 8 files changed, 221 insertions(+), 69 deletions(-)

--
Signature

^ permalink raw reply

* [bpf PATCH v3 1/8] net/tls: don't arm strparser immediately in tls_set_sw_offload()
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

From: Jakub Kicinski <jakub.kicinski@netronome.com>

In tls_set_device_offload_rx() we prepare the software context
for RX fallback and proceed to add the connection to the device.
Unfortunately, software context prep includes arming strparser
so in case of a later error we have to release the socket lock
to call strp_done().

In preparation for not releasing the socket lock half way through
callbacks move arming strparser into a separate function.
Following patches will make use of that.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h    |    1 +
 net/tls/tls_device.c |    1 +
 net/tls/tls_main.c   |    8 +++++---
 net/tls/tls_sw.c     |   19 ++++++++++++-------
 4 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 584609174fe0..43f551cd508b 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -355,6 +355,7 @@ int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
 		  unsigned int optlen);
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
+void tls_sw_strparser_arm(struct sock *sk, struct tls_context *ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 7c0b2b778703..4d67d72f007c 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -1045,6 +1045,7 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 	rc = tls_set_sw_offload(sk, ctx, 0);
 	if (rc)
 		goto release_ctx;
+	tls_sw_strparser_arm(sk, ctx);
 
 	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX,
 					     &ctx->crypto_recv.info,
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 4674e57e66b0..85a9d7d57b32 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -526,6 +526,8 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 		{
 #endif
 			rc = tls_set_sw_offload(sk, ctx, 1);
+			if (rc)
+				goto err_crypto_info;
 			conf = TLS_SW;
 		}
 	} else {
@@ -537,13 +539,13 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 		{
 #endif
 			rc = tls_set_sw_offload(sk, ctx, 0);
+			if (rc)
+				goto err_crypto_info;
+			tls_sw_strparser_arm(sk, ctx);
 			conf = TLS_SW;
 		}
 	}
 
-	if (rc)
-		goto err_crypto_info;
-
 	if (tx)
 		ctx->tx_conf = conf;
 	else
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 53b4ad94e74a..f58a8ffc2a9c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2160,6 +2160,18 @@ void tls_sw_write_space(struct sock *sk, struct tls_context *ctx)
 	}
 }
 
+void tls_sw_strparser_arm(struct sock *sk, struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_rx *rx_ctx = tls_sw_ctx_rx(tls_ctx);
+
+	write_lock_bh(&sk->sk_callback_lock);
+	rx_ctx->saved_data_ready = sk->sk_data_ready;
+	sk->sk_data_ready = tls_data_ready;
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	strp_check_rcv(&rx_ctx->strp);
+}
+
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -2357,13 +2369,6 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		cb.parse_msg = tls_read_size;
 
 		strp_init(&sw_ctx_rx->strp, sk, &cb);
-
-		write_lock_bh(&sk->sk_callback_lock);
-		sw_ctx_rx->saved_data_ready = sk->sk_data_ready;
-		sk->sk_data_ready = tls_data_ready;
-		write_unlock_bh(&sk->sk_callback_lock);
-
-		strp_check_rcv(&sw_ctx_rx->strp);
 	}
 
 	goto out;


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox