Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 0/4] Deterministic SPI latency with NXP DSPI driver
From: Mark Brown @ 2019-09-09 10:06 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: h.feurstein, mlichvar, richardcochran, andrew, f.fainelli,
	linux-spi, netdev
In-Reply-To: <20190905010114.26718-1-olteanv@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 773 bytes --]

On Thu, Sep 05, 2019 at 04:01:10AM +0300, Vladimir Oltean wrote:

> This patchset proposes an interface from the SPI subsystem for
> software timestamping SPI transfers. There is a default implementation
> provided in the core, as well as a mechanism for SPI slave drivers to
> check which byte was in fact timestamped post-facto. The patchset also
> adds the first user of this interface (the NXP DSPI driver in TCFQ mode).

I think this is about as good as we're going to get but we're
very near the merge window now so I'll leave this until after the
merge window is done in case there's more review comments before
applying.  I need to reread the implementation code a bit as
well, it looked fine on a first scan through but it's possible I
might spot something later.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: net/dst_cache.c: preemption bug in net/dst_cache.c
From: Xin Long @ 2019-09-09  9:48 UTC (permalink / raw)
  To: Bharath Vedartham
  Cc: davem, Greg Kroah-Hartman, allison, tglx, network dev, LKML,
	Jon Maloy
In-Reply-To: <20190822195132.GA2100@bharath12345-Inspiron-5559>

On Fri, Aug 23, 2019 at 3:58 PM Bharath Vedartham <linux.bhar@gmail.com> wrote:
>
> Hi all,
>
> I just want to bring attention to the syzbot bug [1]
>
> Even though syzbot claims the bug to be in net/tipc, I feel it is in
> net/dst_cache.c. Please correct me if I am wrong.
>
> This bug is being triggered a lot of times by syzbot since the day it
> was reported. Also given that this is core networking code, I felt it
> was important to bring this to attention.
>
> It looks like preemption needs to be disabled before using this_cpu_ptr
> or maybe we would be better of using a get_cpu_var and put_cpu_var combo
> here.
b->media->send_msg (tipc_udp_send_msg)
-> tipc_udp_xmit() -> dst_cache_get()

send_msg() is always called under the protection of rcu_read_lock(), which
already disabled preemption. If not, there must be some unbalanced calls of
disable/enable preemption elsewhere.

Agree that this could be a serious issue, do you have any reproducer for this?

Thanks.

>
> [1] https://syzkaller.appspot.com/bug?id=dc6352b92862eb79373fe03fdf9af5928753e057
>
> Thank you
> Bharath

^ permalink raw reply

* Re: WARNING in hso_free_net_device
From: Oliver Neukum @ 2019-09-09  9:47 UTC (permalink / raw)
  To: Hui Peng, Andrey Konovalov
  Cc: David S. Miller, syzkaller-bugs, alexios.zavras, Thomas Gleixner,
	Greg Kroah-Hartman, Mathias Payer, Stephen Hemminger, rfontana,
	syzbot+44d53c7255bb1aea22d2, LKML, USB list, netdev
In-Reply-To: <02ef64cc-5053-e6da-fc59-9970f48064c5@gmail.com>

Am Donnerstag, den 05.09.2019, 22:05 -0400 schrieb Hui Peng:
> 
> On 9/5/2019 7:24 AM, Andrey Konovalov wrote:
> > On Thu, Sep 5, 2019 at 4:20 AM Hui Peng <benquike@gmail.com> wrote:
> > > 
> > > Can you guys have  a look at the attached patch?
> > 
> > Let's try it:
> > 
> > #syz test: https://github.com/google/kasan.git eea39f24
> > 
> > FYI: there are two more reports coming from this driver, which might
> > (or might not) have the same root cause. One of them has a suggested
> > fix by Oliver.
> > 
> > https://syzkaller.appspot.com/bug?extid=67b2bd0e34f952d0321e
> > https://syzkaller.appspot.com/bug?extid=93f2f45b19519b289613
> > 
> 
> I think they are different, though similar.
> This one is resulted from unregistering a network device.
> These 2 are resulted from unregistering a tty device.

Hi,

looks like it. That may indeed be the issue.
Please try to have syzbot test your patch and we will
know more.

	Regards
		Oliver


^ permalink raw reply

* Re: VRF Issue Since kernel 5
From: Alexis Bauvin @ 2019-09-09  9:28 UTC (permalink / raw)
  To: Gowen; +Cc: netdev@vger.kernel.org
In-Reply-To: <CWLP265MB1554308A1373D9ECE68CB854FDB70@CWLP265MB1554.GBRP265.PROD.OUTLOOK.COM>

Hi,

There has been some changes regarding VRF isolation in Linux 5 IIRC, namely proper
isolation of the default VRF.

Some things you may try:

- looking at the l3mdev_accept sysctls (e.g. `net.ipv4.tcp_l3mdev_accept`)
- querying stuff from the management vrf through `ip vrf exec vrf-mgmt <stuff>`
  e.g. `ip vrf exec vrf-mgmt curl kernel.org`
       `ip vrf exec vrf-mgmt dig @1.1.1.1 kernel.org`
- reversing your logic: default VRF is your management one, the other one is for your
  other boxes

Also, your `unreachable default metric 4278198272` route looks odd to me.

What are your routing rules? (`ip rule`)

Alexis

> Le 9 sept. 2019 à 09:46, Gowen <gowen@potatocomputing.co.uk> a écrit :
> 
> Hi there,
> 
> Dave A said this was the mailer to send this to:
> 
> 
> I’ve been using my management interface in a VRF for several months now and it’s worked perfectly – I’ve been able to update/upgrade the packages just fine and iptables works excellently with it – exactly as I needed.
> 
> 
> Since Kernel 5 though I am no longer able to update – but the issue is quite a curious one as some traffic appears to be fine (DNS lookups use VRF correctly) but others don’t (updating/upgrading the packages)
> 
> 
> I have on this device 2 interfaces:
> Eth0 for management – inbound SSH, DNS, updates/upgrades
> Eth1 for managing other boxes (ansible using SSH)
> 
> 
> Link and addr info shown below:
> 
> 
> Admin@NETM06:~$ ip link show
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master mgmt-vrf state UP mode DEFAULT group default qlen 1000
>     link/ether 00:22:48:07:cc:ad brd ff:ff:ff:ff:ff:ff
> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
>     link/ether 00:22:48:07:c9:6c brd ff:ff:ff:ff:ff:ff
> 4: mgmt-vrf: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>     link/ether 8a:f6:26:65:02:5a brd ff:ff:ff:ff:ff:ff
> 
> 
> Admin@NETM06:~$ ip addr
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 scope host lo
>        valid_lft forever preferred_lft forever
>     inet6 ::1/128 scope host
>        valid_lft forever preferred_lft forever
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master mgmt-vrf state UP group default qlen 1000
>     link/ether 00:22:48:07:cc:ad brd ff:ff:ff:ff:ff:ff
>     inet 10.24.12.10/24 brd 10.24.12.255 scope global eth0
>        valid_lft forever preferred_lft forever
>     inet6 fe80::222:48ff:fe07:ccad/64 scope link
>        valid_lft forever preferred_lft forever
> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
>     link/ether 00:22:48:07:c9:6c brd ff:ff:ff:ff:ff:ff
>     inet 10.24.12.9/24 brd 10.24.12.255 scope global eth1
>        valid_lft forever preferred_lft forever
>     inet6 fe80::222:48ff:fe07:c96c/64 scope link
>        valid_lft forever preferred_lft forever
> 4: mgmt-vrf: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc noqueue state UP group default qlen 1000
>     link/ether 8a:f6:26:65:02:5a brd ff:ff:ff:ff:ff:ff
> 
> 
> 
> the production traffic is all in the 10.0.0.0/8 network (eth1 global VRF) except for a few subnets (DNS) which are routed out eth0 (mgmt-vrf)
> 
> 
> Admin@NETM06:~$ ip route show
> default via 10.24.12.1 dev eth0
> 10.0.0.0/8 via 10.24.12.1 dev eth1
> 10.24.12.0/24 dev eth1 proto kernel scope link src 10.24.12.9
> 10.24.65.0/24 via 10.24.12.1 dev eth0
> 10.25.65.0/24 via 10.24.12.1 dev eth0
> 10.26.0.0/21 via 10.24.12.1 dev eth0
> 10.26.64.0/21 via 10.24.12.1 dev eth0
> 
> 
> Admin@NETM06:~$ ip route show vrf mgmt-vrf
> default via 10.24.12.1 dev eth0
> unreachable default metric 4278198272
> 10.24.12.0/24 dev eth0 proto kernel scope link src 10.24.12.10
> 10.24.65.0/24 via 10.24.12.1 dev eth0
> 10.25.65.0/24 via 10.24.12.1 dev eth0
> 10.26.0.0/21 via 10.24.12.1 dev eth0
> 10.26.64.0/21 via 10.24.12.1 dev eth0
> 
> 
> 
> The strange activity occurs when I enter the command “sudo apt update” as I can resolve the DNS request (10.24.65.203 or 10.24.64.203, verified with tcpdump) out eth0 but for the actual update traffic there is no activity:
> 
> 
> sudo tcpdump -i eth0 '(host 10.24.65.203 or host 10.25.65.203) and port 53' -n
> <OUTPUT OMITTED FOR BREVITY>
> 10:06:05.268735 IP 10.24.12.10.39963 > 10.24.65.203.53: 48798+ [1au] A? security.ubuntu.com. (48)
> <OUTPUT OMITTED FOR BREVITY>
> 10:06:05.284403 IP 10.24.65.203.53 > 10.24.12.10.39963: 48798 13/0/1 A 91.189.91.23, A 91.189.88.24, A 91.189.91.26, A 91.189.88.162, A 91.189.88.149, A 91.189.91.24, A 91.189.88.173, A 91.189.88.177, A 91.189.88.31, A 91.189.91.14, A 91.189.88.176, A 91.189.88.175, A 91.189.88.174 (256)
> 
> 
> 
> You can see that the update traffic is returned but is not accepted by the stack and a RST is sent
> 
> 
> Admin@NETM06:~$ sudo tcpdump -i eth0 '(not host 168.63.129.16 and port 80)' -n
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
> 10:17:12.690658 IP 10.24.12.10.40216 > 91.189.88.175.80: Flags [S], seq 2279624826, win 64240, options [mss 1460,sackOK,TS val 2029365856 ecr 0,nop,wscale 7], length 0
> 10:17:12.691929 IP 10.24.12.10.52362 > 91.189.95.83.80: Flags [S], seq 1465797256, win 64240, options [mss 1460,sackOK,TS val 3833463674 ecr 0,nop,wscale 7], length 0
> 10:17:12.696270 IP 91.189.88.175.80 > 10.24.12.10.40216: Flags [S.], seq 968450722, ack 2279624827, win 28960, options [mss 1418,sackOK,TS val 81957103 ecr 2029365856,nop,wscale 7], length 0                                                                                                                            
> 10:17:12.696301 IP 10.24.12.10.40216 > 91.189.88.175.80: Flags [R], seq 2279624827, win 0, length 0
> 10:17:12.697884 IP 91.189.95.83.80 > 10.24.12.10.52362: Flags [S.], seq 4148330738, ack 1465797257, win 28960, options [mss 1418,sackOK,TS val 2257624414 ecr 3833463674,nop,wscale 8], length 0                                                                                                                         
> 10:17:12.697909 IP 10.24.12.10.52362 > 91.189.95.83.80: Flags [R], seq 1465797257, win 0, length 0
> 
> 
> 
> 
> I can emulate the DNS lookup using netcat in the vrf:
> 
> 
> sudo ip vrf exec mgmt-vrf nc -u 10.24.65.203 53
> 
> 
> then interactively enter the binary for a www.google.co.uk request:
> 
> 
> 0035624be394010000010000000000010377777706676f6f676c6502636f02756b00000100010000290200000000000000
> 
> 
> This returns as expected:
> 
> 
> 00624be394010000010000000000010377777706676f6f676c6502636f02756b00000100010000290200000000000000
> 
> 
> I can run:
> 
> 
> Admin@NETM06:~$ host www.google.co.uk
> www.google.co.uk has address 172.217.169.3
> www.google.co.uk has IPv6 address 2a00:1450:4009:80d::2003
> 
> 
> but I get a timeout for:
> 
> 
> sudo ip vrf  exec mgmt-vrf host www.google.co.uk
> ;; connection timed out; no servers could be reached
> 
> 
> 
> However I can take a repo address and vrf exec to it on port 80:
> 
> 
> Admin@NETM06:~$ sudo ip vrf  exec mgmt-vrf nc 91.189.91.23 80
> hello
> HTTP/1.1 400 Bad Request
> <OUTPUT OMITTED>
> 
> My iptables rule:
> 
> 
> sudo iptables -Z
> Admin@NETM06:~$ sudo iptables -L -v
> Chain INPUT (policy DROP 16 packets, 3592 bytes)
> pkts bytes target     prot opt in     out     source               destination
>    44  2360 ACCEPT     tcp  --  any    any     anywhere             anywhere             tcp spt:http ctstate RELATED,ESTABLISHED
>    83 10243 ACCEPT     udp  --  any    any     anywhere             anywhere             udp spt:domain ctstate RELATED,ESTABLISHED
> 
> 
> 
> I cannot find out why the update isn’t working. Any help greatly appreciated
> 
> 
> Kind Regards,
> 
> 
> Gareth


^ permalink raw reply

* Re: [PATCH] net: stmmac: socfpga: re-use the `interface` parameter from platform data
From: Ardelean, Alexandru @ 2019-09-09  8:53 UTC (permalink / raw)
  To: lkp@intel.com
  Cc: davem@davemloft.net, linux-stm32@st-md-mailman.stormreply.com,
	joabreu@synopsys.com, linux-kernel@vger.kernel.org,
	kbuild-all@01.org, netdev@vger.kernel.org,
	alexandre.torgue@st.com, peppe.cavallaro@st.com,
	linux-arm-kernel@lists.infradead.org, mcoquelin.stm32@gmail.com
In-Reply-To: <201909072036.v1rX0Vwh%lkp@intel.com>

On Sat, 2019-09-07 at 20:54 +0800, kbuild test robot wrote:
> [External]
> 
> Hi Alexandru,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]

Hmm, this error should be expectable I guess: I applied this on net-next/master.

Alex

> [cannot apply to v5.3-rc7 next-20190904]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    
> https://github.com/0day-ci/linux/commits/Alexandru-Ardelean/net-stmmac-socfpga-re-use-the-interface-parameter-from-platform-data/20190907-190627
> config: sparc64-allmodconfig (attached as .config)
> compiler: sparc64-linux-gcc (GCC) 7.4.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         GCC_VERSION=7.4.0 make.cross ARCH=sparc64 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from include/linux/dma-mapping.h:7:0,
>                     from include/linux/skbuff.h:30,
>                     from include/linux/if_ether.h:19,
>                     from include/uapi/linux/ethtool.h:19,
>                     from include/linux/ethtool.h:18,
>                     from include/linux/phy.h:16,
>                     from drivers/net//ethernet/stmicro/stmmac/dwmac-socfpga.c:11:
>    drivers/net//ethernet/stmicro/stmmac/dwmac-socfpga.c: In function 'socfpga_gen5_set_phy_mode':
> > > drivers/net//ethernet/stmicro/stmmac/dwmac-socfpga.c:264:44: error: 'phymode' undeclared (first use in this
> > > function); did you mean 'phy_modes'?
>       dev_err(dwmac->dev, "bad phy mode %d\n", phymode);
>                                                ^
>    include/linux/device.h:1499:32: note: in definition of macro 'dev_err'
>      _dev_err(dev, dev_fmt(fmt), ##__VA_ARGS__)
>                                    ^~~~~~~~~~~
>    drivers/net//ethernet/stmicro/stmmac/dwmac-socfpga.c:264:44: note: each undeclared identifier is reported only once
> for each function it appears in
>       dev_err(dwmac->dev, "bad phy mode %d\n", phymode);
>                                                ^
>    include/linux/device.h:1499:32: note: in definition of macro 'dev_err'
>      _dev_err(dev, dev_fmt(fmt), ##__VA_ARGS__)
>                                    ^~~~~~~~~~~
>    drivers/net//ethernet/stmicro/stmmac/dwmac-socfpga.c: In function 'socfpga_gen10_set_phy_mode':
>    drivers/net//ethernet/stmicro/stmmac/dwmac-socfpga.c:340:6: error: 'phymode' undeclared (first use in this
> function); did you mean 'phy_modes'?
>          phymode == PHY_INTERFACE_MODE_MII ||
>          ^~~~~~~
>          phy_modes
> 
> vim +264 drivers/net//ethernet/stmicro/stmmac/dwmac-socfpga.c
> 
> 40ae25505fe834 Dinh Nguyen        2019-06-05  255  
> 40ae25505fe834 Dinh Nguyen        2019-06-05  256  static int socfpga_gen5_set_phy_mode(struct socfpga_dwmac *dwmac)
> 40ae25505fe834 Dinh Nguyen        2019-06-05  257  {
> 40ae25505fe834 Dinh Nguyen        2019-06-05  258  	struct regmap *sys_mgr_base_addr = dwmac->sys_mgr_base_addr;
> 40ae25505fe834 Dinh Nguyen        2019-06-05  259  	u32 reg_offset = dwmac->reg_offset;
> 40ae25505fe834 Dinh Nguyen        2019-06-05  260  	u32 reg_shift = dwmac->reg_shift;
> 40ae25505fe834 Dinh Nguyen        2019-06-05  261  	u32 ctrl, val, module;
> 40ae25505fe834 Dinh Nguyen        2019-06-05  262  
> 6169afbe4a340b Alexandru Ardelean 2019-09-06  263  	if (socfpga_set_phy_mode_common(dwmac, &val)) {
> 801d233b7302ee Dinh Nguyen        2014-03-26 @264  		dev_err(dwmac->dev, "bad phy mode %d\n", phymode);
> 801d233b7302ee Dinh Nguyen        2014-03-26  265  		return -EINVAL;
> 801d233b7302ee Dinh Nguyen        2014-03-26  266  	}
> 801d233b7302ee Dinh Nguyen        2014-03-26  267  
> b4834c86e11baf Ley Foon Tan       2014-08-20  268  	/* Overwrite val to GMII if splitter core is enabled. The
> phymode here
> b4834c86e11baf Ley Foon Tan       2014-08-20  269  	 * is the actual phy mode on phy hardware, but phy interface
> from
> b4834c86e11baf Ley Foon Tan       2014-08-20  270  	 * EMAC core is GMII.
> b4834c86e11baf Ley Foon Tan       2014-08-20  271  	 */
> b4834c86e11baf Ley Foon Tan       2014-08-20  272  	if (dwmac->splitter_base)
> b4834c86e11baf Ley Foon Tan       2014-08-20  273  		val = SYSMGR_EMACGRP_CTRL_PHYSEL_ENUM_GMII_MII;
> b4834c86e11baf Ley Foon Tan       2014-08-20  274  
> 70cb136f773083 Joachim Eastwood   2016-05-01  275  	/* Assert reset to the enet controller before changing the phy
> mode */
> bc8a2d9bcbf1ca Dinh Nguyen        2018-06-19  276  	reset_control_assert(dwmac->stmmac_ocp_rst);
> 70cb136f773083 Joachim Eastwood   2016-05-01  277  	reset_control_assert(dwmac->stmmac_rst);
> 70cb136f773083 Joachim Eastwood   2016-05-01  278  
> 801d233b7302ee Dinh Nguyen        2014-03-26  279  	regmap_read(sys_mgr_base_addr, reg_offset, &ctrl);
> 801d233b7302ee Dinh Nguyen        2014-03-26  280  	ctrl &= ~(SYSMGR_EMACGRP_CTRL_PHYSEL_MASK << reg_shift);
> 801d233b7302ee Dinh Nguyen        2014-03-26  281  	ctrl |= val << reg_shift;
> 801d233b7302ee Dinh Nguyen        2014-03-26  282  
> 013dae5dbc07aa Stephan Gatzka     2017-08-22  283  	if (dwmac->f2h_ptp_ref_clk ||
> 013dae5dbc07aa Stephan Gatzka     2017-08-22  284  	    phymode == PHY_INTERFACE_MODE_MII ||
> 013dae5dbc07aa Stephan Gatzka     2017-08-22  285  	    phymode == PHY_INTERFACE_MODE_GMII ||
> 013dae5dbc07aa Stephan Gatzka     2017-08-22  286  	    phymode == PHY_INTERFACE_MODE_SGMII) {
> 43569814fa35b2 Phil Reid          2015-12-14  287  		ctrl |= SYSMGR_EMACGRP_CTRL_PTP_REF_CLK_MASK <<
> (reg_shift / 2);
> 734e00fa02eff5 Phil Reid          2016-04-07  288  		regmap_read(sys_mgr_base_addr,
> SYSMGR_FPGAGRP_MODULE_REG,
> 734e00fa02eff5 Phil Reid          2016-04-07  289  			    &module);
> 734e00fa02eff5 Phil Reid          2016-04-07  290  		module |= (SYSMGR_FPGAGRP_MODULE_EMAC << (reg_shift /
> 2));
> 734e00fa02eff5 Phil Reid          2016-04-07  291  		regmap_write(sys_mgr_base_addr,
> SYSMGR_FPGAGRP_MODULE_REG,
> 734e00fa02eff5 Phil Reid          2016-04-07  292  			     module);
> 734e00fa02eff5 Phil Reid          2016-04-07  293  	} else {
> 43569814fa35b2 Phil Reid          2015-12-14  294  		ctrl &= ~(SYSMGR_EMACGRP_CTRL_PTP_REF_CLK_MASK <<
> (reg_shift / 2));
> 734e00fa02eff5 Phil Reid          2016-04-07  295  	}
> 43569814fa35b2 Phil Reid          2015-12-14  296  
> 801d233b7302ee Dinh Nguyen        2014-03-26  297  	regmap_write(sys_mgr_base_addr, reg_offset, ctrl);
> 734e00fa02eff5 Phil Reid          2016-04-07  298  
> 70cb136f773083 Joachim Eastwood   2016-05-01  299  	/* Deassert reset for the phy configuration to be sampled by
> 70cb136f773083 Joachim Eastwood   2016-05-01  300  	 * the enet controller, and operation to start in requested mode
> 70cb136f773083 Joachim Eastwood   2016-05-01  301  	 */
> bc8a2d9bcbf1ca Dinh Nguyen        2018-06-19  302  	reset_control_deassert(dwmac->stmmac_ocp_rst);
> 70cb136f773083 Joachim Eastwood   2016-05-01  303  	reset_control_deassert(dwmac->stmmac_rst);
> fb3bbdb859891e Tien Hock Loh      2016-07-07  304  	if (phymode == PHY_INTERFACE_MODE_SGMII) {
> fb3bbdb859891e Tien Hock Loh      2016-07-07  305  		if (tse_pcs_init(dwmac->pcs.tse_pcs_base, &dwmac->pcs)
> != 0) {
> fb3bbdb859891e Tien Hock Loh      2016-07-07  306  			dev_err(dwmac->dev, "Unable to initialize TSE
> PCS");
> fb3bbdb859891e Tien Hock Loh      2016-07-07  307  			return -EINVAL;
> fb3bbdb859891e Tien Hock Loh      2016-07-07  308  		}
> fb3bbdb859891e Tien Hock Loh      2016-07-07  309  	}
> 70cb136f773083 Joachim Eastwood   2016-05-01  310  
> 801d233b7302ee Dinh Nguyen        2014-03-26  311  	return 0;
> 801d233b7302ee Dinh Nguyen        2014-03-26  312  }
> 801d233b7302ee Dinh Nguyen        2014-03-26  313  
> 
> :::::: The code at line 264 was first introduced by commit
> :::::: 801d233b7302eeab94750427a623c10c044cb0ca net: stmmac: Add SOCFPGA glue driver
> 
> :::::: TO: Dinh Nguyen <dinguyen@altera.com>
> :::::: CC: David S. Miller <davem@davemloft.net>
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCHv2 ipsec-next] xfrm: remove the unnecessary .net_exit for xfrmi
From: Xin Long @ 2019-09-09  8:30 UTC (permalink / raw)
  To: network dev; +Cc: davem, steffen.klassert

The xfrm_if(s) on each netns can be deleted when its xfrmi dev is
deleted. xfrmi dev's removal can happen when:

  a. netns is being removed and all xfrmi devs will be deleted.

  b. rtnl_link_unregister(&xfrmi_link_ops) in xfrmi_fini() when
     xfrm_interface.ko is being unloaded.

So there's no need to use xfrmi_exit_net() to clean any xfrm_if up.

v1->v2:
  - Fix some changelog.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/xfrm/xfrm_interface.c | 23 -----------------------
 1 file changed, 23 deletions(-)

diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 74868f9..faa0518 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -738,30 +738,7 @@ static struct rtnl_link_ops xfrmi_link_ops __read_mostly = {
 	.get_link_net	= xfrmi_get_link_net,
 };
 
-static void __net_exit xfrmi_destroy_interfaces(struct xfrmi_net *xfrmn)
-{
-	struct xfrm_if *xi;
-	LIST_HEAD(list);
-
-	xi = rtnl_dereference(xfrmn->xfrmi[0]);
-	if (!xi)
-		return;
-
-	unregister_netdevice_queue(xi->dev, &list);
-	unregister_netdevice_many(&list);
-}
-
-static void __net_exit xfrmi_exit_net(struct net *net)
-{
-	struct xfrmi_net *xfrmn = net_generic(net, xfrmi_net_id);
-
-	rtnl_lock();
-	xfrmi_destroy_interfaces(xfrmn);
-	rtnl_unlock();
-}
-
 static struct pernet_operations xfrmi_net_ops = {
-	.exit = xfrmi_exit_net,
 	.id   = &xfrmi_net_id,
 	.size = sizeof(struct xfrmi_net),
 };
-- 
2.1.0


^ permalink raw reply related

* [PATCH v3 1/2] PTP: introduce new versions of IOCTLs
From: Felipe Balbi @ 2019-09-09  7:59 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Christopher S Hall, netdev, linux-kernel, Felipe Balbi

The current version of the IOCTL have a small problem which prevents us
from extending the API by making use of reserved fields. In these new
IOCTLs, we are now making sure that flags and rsv fields are zero which
will allow us to extend the API in the future.

Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
---

Changes since v2:
	- Define PTP_{PEROUT,EXTTS}_VALID_FLAGS
	- Fix comment above PTP_*_FLAGS
	- 

Changes since v1:
	- Add a blank line after memset()
	- Move memset(req) to the three places where it's needed
	- Fix the accidental removal of GETFUNC and SETFUNC

 drivers/ptp/ptp_chardev.c      | 63 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/ptp_clock.h | 24 ++++++++++++-
 2 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
index 18ffe449efdf..9c18476d8d10 100644
--- a/drivers/ptp/ptp_chardev.c
+++ b/drivers/ptp/ptp_chardev.c
@@ -126,7 +126,9 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 	switch (cmd) {
 
 	case PTP_CLOCK_GETCAPS:
+	case PTP_CLOCK_GETCAPS2:
 		memset(&caps, 0, sizeof(caps));
+
 		caps.max_adj = ptp->info->max_adj;
 		caps.n_alarm = ptp->info->n_alarm;
 		caps.n_ext_ts = ptp->info->n_ext_ts;
@@ -139,11 +141,24 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		break;
 
 	case PTP_EXTTS_REQUEST:
+	case PTP_EXTTS_REQUEST2:
+		memset(&req, 0, sizeof(req));
+
 		if (copy_from_user(&req.extts, (void __user *)arg,
 				   sizeof(req.extts))) {
 			err = -EFAULT;
 			break;
 		}
+		if (((req.extts.flags & ~PTP_EXTTS_VALID_FLAGS) ||
+			req.extts.rsv[0] || req.extts.rsv[1]) &&
+			cmd == PTP_EXTTS_REQUEST2) {
+			err = -EINVAL;
+			break;
+		} else if (cmd == PTP_EXTTS_REQUEST) {
+			req.extts.flags &= ~PTP_EXTTS_VALID_FLAGS;
+			req.extts.rsv[0] = 0;
+			req.extts.rsv[1] = 0;
+		}
 		if (req.extts.index >= ops->n_ext_ts) {
 			err = -EINVAL;
 			break;
@@ -154,11 +169,27 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		break;
 
 	case PTP_PEROUT_REQUEST:
+	case PTP_PEROUT_REQUEST2:
+		memset(&req, 0, sizeof(req));
+
 		if (copy_from_user(&req.perout, (void __user *)arg,
 				   sizeof(req.perout))) {
 			err = -EFAULT;
 			break;
 		}
+		if (((req.perout.flags & ~PTP_PEROUT_VALID_FLAGS) ||
+			req.perout.rsv[0] || req.perout.rsv[1] ||
+			req.perout.rsv[2] || req.perout.rsv[3]) &&
+			cmd == PTP_PEROUT_REQUEST2) {
+			err = -EINVAL;
+			break;
+		} else if (cmd == PTP_PEROUT_REQUEST) {
+			req.perout.flags &= ~PTP_PEROUT_VALID_FLAGS;
+			req.perout.rsv[0] = 0;
+			req.perout.rsv[1] = 0;
+			req.perout.rsv[2] = 0;
+			req.perout.rsv[3] = 0;
+		}
 		if (req.perout.index >= ops->n_per_out) {
 			err = -EINVAL;
 			break;
@@ -169,6 +200,9 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		break;
 
 	case PTP_ENABLE_PPS:
+	case PTP_ENABLE_PPS2:
+		memset(&req, 0, sizeof(req));
+
 		if (!capable(CAP_SYS_TIME))
 			return -EPERM;
 		req.type = PTP_CLK_REQ_PPS;
@@ -177,6 +211,7 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		break;
 
 	case PTP_SYS_OFFSET_PRECISE:
+	case PTP_SYS_OFFSET_PRECISE2:
 		if (!ptp->info->getcrosststamp) {
 			err = -EOPNOTSUPP;
 			break;
@@ -201,6 +236,7 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		break;
 
 	case PTP_SYS_OFFSET_EXTENDED:
+	case PTP_SYS_OFFSET_EXTENDED2:
 		if (!ptp->info->gettimex64) {
 			err = -EOPNOTSUPP;
 			break;
@@ -232,6 +268,7 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		break;
 
 	case PTP_SYS_OFFSET:
+	case PTP_SYS_OFFSET2:
 		sysoff = memdup_user((void __user *)arg, sizeof(*sysoff));
 		if (IS_ERR(sysoff)) {
 			err = PTR_ERR(sysoff);
@@ -266,10 +303,23 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		break;
 
 	case PTP_PIN_GETFUNC:
+	case PTP_PIN_GETFUNC2:
 		if (copy_from_user(&pd, (void __user *)arg, sizeof(pd))) {
 			err = -EFAULT;
 			break;
 		}
+		if ((pd.rsv[0] || pd.rsv[1] || pd.rsv[2]
+				|| pd.rsv[3] || pd.rsv[4])
+			&& cmd == PTP_PIN_GETFUNC2) {
+			err = -EINVAL;
+			break;
+		} else if (cmd == PTP_PIN_GETFUNC) {
+			pd.rsv[0] = 0;
+			pd.rsv[1] = 0;
+			pd.rsv[2] = 0;
+			pd.rsv[3] = 0;
+			pd.rsv[4] = 0;
+		}
 		pin_index = pd.index;
 		if (pin_index >= ops->n_pins) {
 			err = -EINVAL;
@@ -285,10 +335,23 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		break;
 
 	case PTP_PIN_SETFUNC:
+	case PTP_PIN_SETFUNC2:
 		if (copy_from_user(&pd, (void __user *)arg, sizeof(pd))) {
 			err = -EFAULT;
 			break;
 		}
+		if ((pd.rsv[0] || pd.rsv[1] || pd.rsv[2]
+				|| pd.rsv[3] || pd.rsv[4])
+			&& cmd == PTP_PIN_SETFUNC2) {
+			err = -EINVAL;
+			break;
+		} else if (cmd == PTP_PIN_SETFUNC) {
+			pd.rsv[0] = 0;
+			pd.rsv[1] = 0;
+			pd.rsv[2] = 0;
+			pd.rsv[3] = 0;
+			pd.rsv[4] = 0;
+		}
 		pin_index = pd.index;
 		if (pin_index >= ops->n_pins) {
 			err = -EINVAL;
diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
index 1bc794ad957a..12911785991b 100644
--- a/include/uapi/linux/ptp_clock.h
+++ b/include/uapi/linux/ptp_clock.h
@@ -25,10 +25,20 @@
 #include <linux/ioctl.h>
 #include <linux/types.h>
 
-/* PTP_xxx bits, for the flags field within the request structures. */
+/*
+ * Bits of the ptp_extts_request.flags field:
+ */
 #define PTP_ENABLE_FEATURE (1<<0)
 #define PTP_RISING_EDGE    (1<<1)
 #define PTP_FALLING_EDGE   (1<<2)
+#define PTP_EXTTS_VALID_FLAGS	(PTP_ENABLE_FEATURE |	\
+				 PTP_RISING_EDGE |	\
+				 PTP_FALLING_EDGE)
+
+/*
+ * Bits of the ptp_perout_request.flags field:
+ */
+#define PTP_PEROUT_VALID_FLAGS (~0)
 
 /*
  * struct ptp_clock_time - represents a time value
@@ -149,6 +159,18 @@ struct ptp_pin_desc {
 #define PTP_SYS_OFFSET_EXTENDED \
 	_IOWR(PTP_CLK_MAGIC, 9, struct ptp_sys_offset_extended)
 
+#define PTP_CLOCK_GETCAPS2  _IOR(PTP_CLK_MAGIC, 10, struct ptp_clock_caps)
+#define PTP_EXTTS_REQUEST2  _IOW(PTP_CLK_MAGIC, 11, struct ptp_extts_request)
+#define PTP_PEROUT_REQUEST2 _IOW(PTP_CLK_MAGIC, 12, struct ptp_perout_request)
+#define PTP_ENABLE_PPS2     _IOW(PTP_CLK_MAGIC, 13, int)
+#define PTP_SYS_OFFSET2     _IOW(PTP_CLK_MAGIC, 14, struct ptp_sys_offset)
+#define PTP_PIN_GETFUNC2    _IOWR(PTP_CLK_MAGIC, 15, struct ptp_pin_desc)
+#define PTP_PIN_SETFUNC2    _IOW(PTP_CLK_MAGIC, 16, struct ptp_pin_desc)
+#define PTP_SYS_OFFSET_PRECISE2 \
+	_IOWR(PTP_CLK_MAGIC, 17, struct ptp_sys_offset_precise)
+#define PTP_SYS_OFFSET_EXTENDED2 \
+	_IOWR(PTP_CLK_MAGIC, 18, struct ptp_sys_offset_extended)
+
 struct ptp_extts_event {
 	struct ptp_clock_time t; /* Time event occured. */
 	unsigned int index;      /* Which channel produced the event. */
-- 
2.23.0


^ permalink raw reply related

* [PATCH v3 2/2] PTP: add support for one-shot output
From: Felipe Balbi @ 2019-09-09  7:59 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Christopher S Hall, netdev, linux-kernel, Felipe Balbi
In-Reply-To: <20190909075940.12843-1-felipe.balbi@linux.intel.com>

Some controllers allow for a one-shot output pulse, in contrast to
periodic output. Now that we have extensible versions of our IOCTLs, we
can finally make use of the 'flags' field to pass a bit telling driver
that if we want one-shot pulse output.

Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
---

Changes since v2:
	- Add _PEROUT_ to bit macro

Changes since v1:
	- remove comment from .flags field

 include/uapi/linux/ptp_clock.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
index 12911785991b..cbdc0d97b471 100644
--- a/include/uapi/linux/ptp_clock.h
+++ b/include/uapi/linux/ptp_clock.h
@@ -38,8 +38,8 @@
 /*
  * Bits of the ptp_perout_request.flags field:
  */
-#define PTP_PEROUT_VALID_FLAGS (~0)
-
+#define PTP_PEROUT_ONE_SHOT (1<<0)
+#define PTP_PEROUT_VALID_FLAGS	(~PTP_PEROUT_ONE_SHOT)
 /*
  * struct ptp_clock_time - represents a time value
  *
@@ -77,7 +77,7 @@ struct ptp_perout_request {
 	struct ptp_clock_time start;  /* Absolute start time. */
 	struct ptp_clock_time period; /* Desired period, zero means disable. */
 	unsigned int index;           /* Which channel to configure. */
-	unsigned int flags;           /* Reserved for future use. */
+	unsigned int flags;
 	unsigned int rsv[4];          /* Reserved for future use. */
 };
 
-- 
2.23.0


^ permalink raw reply related

* [PATCH net-next 5/5] sctp: add spt_pathcpthld in struct sctp_paddrthlds
From: Xin Long @ 2019-09-09  7:56 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem
In-Reply-To: <cover.1568015756.git.lucien.xin@gmail.com>

Section 7.2 of rfc7829: "Peer Address Thresholds (SCTP_PEER_ADDR_THLDS)
Socket Option" extends 'struct sctp_paddrthlds' with 'spt_pathcpthld'
added to allow a user to change ps_retrans per sock/asoc/transport, as
other 2 paddrthlds: pf_retrans, pathmaxrxt.

Note that ps_retrans is not allowed to be greater than pf_retrans.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/uapi/linux/sctp.h |  1 +
 net/sctp/socket.c         | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index a15cc28..dfd81e1 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1069,6 +1069,7 @@ struct sctp_paddrthlds {
 	struct sockaddr_storage spt_address;
 	__u16 spt_pathmaxrxt;
 	__u16 spt_pathpfthld;
+	__u16 spt_pathcpthld;
 };
 
 /*
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 5e2098b..5b9774d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3954,6 +3954,9 @@ static int sctp_setsockopt_paddr_thresholds(struct sock *sk,
 			   sizeof(struct sctp_paddrthlds)))
 		return -EFAULT;
 
+	if (val.spt_pathpfthld > val.spt_pathcpthld)
+		return -EINVAL;
+
 	if (!sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
 		trans = sctp_addr_id2transport(sk, &val.spt_address,
 					       val.spt_assoc_id);
@@ -3963,6 +3966,7 @@ static int sctp_setsockopt_paddr_thresholds(struct sock *sk,
 		if (val.spt_pathmaxrxt)
 			trans->pathmaxrxt = val.spt_pathmaxrxt;
 		trans->pf_retrans = val.spt_pathpfthld;
+		trans->ps_retrans = val.spt_pathcpthld;
 
 		return 0;
 	}
@@ -3978,17 +3982,20 @@ static int sctp_setsockopt_paddr_thresholds(struct sock *sk,
 			if (val.spt_pathmaxrxt)
 				trans->pathmaxrxt = val.spt_pathmaxrxt;
 			trans->pf_retrans = val.spt_pathpfthld;
+			trans->ps_retrans = val.spt_pathcpthld;
 		}
 
 		if (val.spt_pathmaxrxt)
 			asoc->pathmaxrxt = val.spt_pathmaxrxt;
 		asoc->pf_retrans = val.spt_pathpfthld;
+		asoc->ps_retrans = val.spt_pathcpthld;
 	} else {
 		struct sctp_sock *sp = sctp_sk(sk);
 
 		if (val.spt_pathmaxrxt)
 			sp->pathmaxrxt = val.spt_pathmaxrxt;
 		sp->pf_retrans = val.spt_pathpfthld;
+		sp->ps_retrans = val.spt_pathcpthld;
 	}
 
 	return 0;
@@ -7232,6 +7239,7 @@ static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
 		if (!trans)
 			return -ENOENT;
 
+		val.spt_pathcpthld = trans->ps_retrans;
 		val.spt_pathmaxrxt = trans->pathmaxrxt;
 		val.spt_pathpfthld = trans->pf_retrans;
 
@@ -7244,11 +7252,13 @@ static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
 		return -EINVAL;
 
 	if (asoc) {
+		val.spt_pathcpthld = asoc->ps_retrans;
 		val.spt_pathpfthld = asoc->pf_retrans;
 		val.spt_pathmaxrxt = asoc->pathmaxrxt;
 	} else {
 		struct sctp_sock *sp = sctp_sk(sk);
 
+		val.spt_pathcpthld = sp->ps_retrans;
 		val.spt_pathpfthld = sp->pf_retrans;
 		val.spt_pathmaxrxt = sp->pathmaxrxt;
 	}
-- 
2.1.0


^ permalink raw reply related

* [PATCH net-next 4/5] sctp: add support for Primary Path Switchover
From: Xin Long @ 2019-09-09  7:56 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem
In-Reply-To: <cover.1568015756.git.lucien.xin@gmail.com>

This is a new feature defined in section 5 of rfc7829: "Primary Path
Switchover". By introducing a new tunable parameter:

  Primary.Switchover.Max.Retrans (PSMR)

The primary path will be changed to another active path when the path
error counter on the old primary path exceeds PSMR, so that "the SCTP
sender is allowed to continue data transmission on a new working path
even when the old primary destination address becomes active again".

This patch is to add this tunable parameter, 'ps_retrans' per netns,
sock, asoc and transport. It also allows a user to change ps_retrans
per netns by sysctl, and ps_retrans per sock/asoc/transport will be
initialized with it.

The check will be done in sctp_do_8_2_transport_strike() when this
feature is enabled.

Note this feature is disabled by initializing 'ps_retrans' per netns
as 0xffff by default, and its value can't be less than 'pf_retrans'
when changing by sysctl.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/netns/sctp.h   |  6 ++++++
 include/net/sctp/structs.h | 11 ++++++++---
 net/sctp/associola.c       |  3 +++
 net/sctp/protocol.c        |  3 +++
 net/sctp/sm_sideeffect.c   |  5 +++++
 net/sctp/socket.c          |  1 +
 net/sctp/sysctl.c          |  9 +++++++++
 7 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
index 5234940c..cab0903 100644
--- a/include/net/netns/sctp.h
+++ b/include/net/netns/sctp.h
@@ -89,6 +89,12 @@ struct netns_sctp {
 	 */
 	int pf_retrans;
 
+	/* Primary.Switchover.Max.Retrans sysctl value
+	 * taken from:
+	 * https://tools.ietf.org/html/rfc7829
+	 */
+	int ps_retrans;
+
 	/*
 	 * Disable Potentially-Failed feature, the feature is enabled by default
 	 * pf_enable	-  0  : disable pf
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index c2d3317..3680a93 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -184,7 +184,8 @@ struct sctp_sock {
 	__u32 flowlabel;
 	__u8  dscp;
 
-	int pf_retrans;
+	__u16 pf_retrans;
+	__u16 ps_retrans;
 
 	/* The initial Path MTU to use for new associations. */
 	__u32 pathmtu;
@@ -897,7 +898,9 @@ struct sctp_transport {
 	 * and will be initialized from the assocs value.  This can be changed
 	 * using the SCTP_PEER_ADDR_THLDS socket option
 	 */
-	int pf_retrans;
+	__u16 pf_retrans;
+	/* Used for primary path switchover. */
+	__u16 ps_retrans;
 	/* PMTU	      : The current known path MTU.  */
 	__u32 pathmtu;
 
@@ -1773,7 +1776,9 @@ struct sctp_association {
 	 * and will be initialized from the assocs value.  This can be
 	 * changed using the SCTP_PEER_ADDR_THLDS socket option
 	 */
-	int pf_retrans;
+	__u16 pf_retrans;
+	/* Used for primary path switchover. */
+	__u16 ps_retrans;
 
 	/* Maximum number of times the endpoint will retransmit INIT  */
 	__u16 max_init_attempts;
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 461b1ffa..f63c56d 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -86,6 +86,7 @@ static struct sctp_association *sctp_association_init(
 	 */
 	asoc->max_retrans = sp->assocparams.sasoc_asocmaxrxt;
 	asoc->pf_retrans  = sp->pf_retrans;
+	asoc->ps_retrans  = sp->ps_retrans;
 	asoc->pf_expose   = sp->pf_expose;
 
 	asoc->rto_initial = msecs_to_jiffies(sp->rtoinfo.srto_initial);
@@ -625,6 +626,8 @@ struct sctp_transport *sctp_assoc_add_peer(struct sctp_association *asoc,
 
 	/* And the partial failure retrans threshold */
 	peer->pf_retrans = asoc->pf_retrans;
+	/* And the primary path switchover retrans threshold */
+	peer->ps_retrans = asoc->ps_retrans;
 
 	/* Initialize the peer's SACK delay timeout based on the
 	 * association configured value.
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index de0f15f..23f7631 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1217,6 +1217,9 @@ static int __net_init sctp_defaults_init(struct net *net)
 	/* Max.Burst		    - 4 */
 	net->sctp.max_burst			= SCTP_DEFAULT_MAX_BURST;
 
+	/* Disable of Primary Path Switchover by default */
+	net->sctp.ps_retrans = 0xffff;
+
 	/* Enable pf state by default */
 	net->sctp.pf_enable = 1;
 
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index 1cf5bb5..3948697 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -567,6 +567,11 @@ static void sctp_do_8_2_transport_strike(struct sctp_cmd_seq *commands,
 					     SCTP_FAILED_THRESHOLD);
 	}
 
+	if (transport->error_count > transport->ps_retrans &&
+	    asoc->peer.primary_path == transport &&
+	    asoc->peer.active_path != transport)
+		sctp_assoc_set_primary(asoc, asoc->peer.active_path);
+
 	/* E2) For the destination address for which the timer
 	 * expires, set RTO <- RTO * 2 ("back off the timer").  The
 	 * maximum value discussed in rule C7 above (RTO.max) may be
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index e3a8e25..5e2098b 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -5074,6 +5074,7 @@ static int sctp_init_sock(struct sock *sk)
 	sp->hbinterval  = net->sctp.hb_interval;
 	sp->pathmaxrxt  = net->sctp.max_retrans_path;
 	sp->pf_retrans  = net->sctp.pf_retrans;
+	sp->ps_retrans  = net->sctp.ps_retrans;
 	sp->pf_expose   = net->sctp.pf_expose;
 	sp->pathmtu     = 0; /* allow default discovery */
 	sp->sackdelay   = net->sctp.sack_timeout;
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index eacc9a1..c9ebfc2 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -212,6 +212,15 @@ static struct ctl_table sctp_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
+		.extra2		= &init_net.sctp.ps_retrans,
+	},
+	{
+		.procname	= "ps_retrans",
+		.data		= &init_net.sctp.ps_retrans,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &init_net.sctp.pf_retrans,
 		.extra2		= SYSCTL_INT_MAX,
 	},
 	{
-- 
2.1.0


^ permalink raw reply related

* [PATCH net-next 3/5] sctp: add SCTP_EXPOSE_POTENTIALLY_FAILED_STATE sockopt
From: Xin Long @ 2019-09-09  7:56 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem
In-Reply-To: <cover.1568015756.git.lucien.xin@gmail.com>

This is a sockopt defined in section 7.3 of rfc7829: "Exposing
the Potentially Failed Path State", by which users can change
pf_expose per sock and asoc.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/uapi/linux/sctp.h |  1 +
 net/sctp/socket.c         | 76 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index b6649d6..a15cc28 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -137,6 +137,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_ASCONF_SUPPORTED	128
 #define SCTP_AUTH_SUPPORTED	129
 #define SCTP_ECN_SUPPORTED	130
+#define SCTP_EXPOSE_POTENTIALLY_FAILED_STATE	131
 
 /* PR-SCTP policies */
 #define SCTP_PR_SCTP_NONE	0x0000
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 33b93bb..e3a8e25 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4588,6 +4588,37 @@ static int sctp_setsockopt_ecn_supported(struct sock *sk,
 	return retval;
 }
 
+static int sctp_setsockopt_pf_expose(struct sock *sk,
+				     char __user *optval,
+				     unsigned int optlen)
+{
+	struct sctp_assoc_value params;
+	struct sctp_association *asoc;
+	int retval = -EINVAL;
+
+	if (optlen != sizeof(params))
+		goto out;
+
+	if (copy_from_user(&params, optval, optlen)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc && params.assoc_id != SCTP_FUTURE_ASSOC &&
+	    sctp_style(sk, UDP))
+		goto out;
+
+	if (asoc)
+		asoc->pf_expose = !!params.assoc_value;
+	else
+		sctp_sk(sk)->pf_expose = !!params.assoc_value;
+	retval = 0;
+
+out:
+	return retval;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -4797,6 +4828,9 @@ static int sctp_setsockopt(struct sock *sk, int level, int optname,
 	case SCTP_ECN_SUPPORTED:
 		retval = sctp_setsockopt_ecn_supported(sk, optval, optlen);
 		break;
+	case SCTP_EXPOSE_POTENTIALLY_FAILED_STATE:
+		retval = sctp_setsockopt_pf_expose(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -7906,6 +7940,45 @@ static int sctp_getsockopt_ecn_supported(struct sock *sk, int len,
 	return retval;
 }
 
+static int sctp_getsockopt_pf_expose(struct sock *sk, int len,
+				     char __user *optval,
+				     int __user *optlen)
+{
+	struct sctp_assoc_value params;
+	struct sctp_association *asoc;
+	int retval = -EFAULT;
+
+	if (len < sizeof(params)) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	len = sizeof(params);
+	if (copy_from_user(&params, optval, len))
+		goto out;
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc && params.assoc_id != SCTP_FUTURE_ASSOC &&
+	    sctp_style(sk, UDP)) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	params.assoc_value = asoc ? asoc->pf_expose
+				  : sctp_sk(sk)->pf_expose;
+
+	if (put_user(len, optlen))
+		goto out;
+
+	if (copy_to_user(optval, &params, len))
+		goto out;
+
+	retval = 0;
+
+out:
+	return retval;
+}
+
 static int sctp_getsockopt(struct sock *sk, int level, int optname,
 			   char __user *optval, int __user *optlen)
 {
@@ -8118,6 +8191,9 @@ static int sctp_getsockopt(struct sock *sk, int level, int optname,
 	case SCTP_ECN_SUPPORTED:
 		retval = sctp_getsockopt_ecn_supported(sk, len, optval, optlen);
 		break;
+	case SCTP_EXPOSE_POTENTIALLY_FAILED_STATE:
+		retval = sctp_getsockopt_pf_expose(sk, len, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
-- 
2.1.0


^ permalink raw reply related

* [PATCH net-next 2/5] sctp: add pf_expose per netns and sock and asoc
From: Xin Long @ 2019-09-09  7:56 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem
In-Reply-To: <cover.1568015756.git.lucien.xin@gmail.com>

As said in rfc7829, section 3, point 12:

  The SCTP stack SHOULD expose the PF state of its destination
  addresses to the ULP as well as provide the means to notify the
  ULP of state transitions of its destination addresses from
  active to PF, and vice versa.  However, it is recommended that
  an SCTP stack implementing SCTP-PF also allows for the ULP to be
  kept ignorant of the PF state of its destinations and the
  associated state transitions, thus allowing for retention of the
  simpler state transition model of [RFC4960] in the ULP.

Not only does it allow to expose the PF state to ULP, but also
allow to ignore sctp-pf to ULP.

So this patch is to add pf_expose per netns, sock and asoc. And in
sctp_assoc_control_transport(), ulp_notify will be set to false if
asoc->expose is not set.

It also allows a user to change pf_expose per netns by sysctl, and
pf_expose per sock and asoc will be initialized with it.

Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
to not allow a user to query the state of a sctp-pf peer address
when pf_expose is not enabled, as said in section 7.3.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/netns/sctp.h   |  7 +++++++
 include/net/sctp/structs.h |  2 ++
 include/uapi/linux/sctp.h  |  1 +
 net/sctp/associola.c       | 10 ++++++++--
 net/sctp/protocol.c        |  3 +++
 net/sctp/socket.c          | 12 ++++++++++--
 net/sctp/sysctl.c          |  7 +++++++
 7 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h
index bdc0f27..5234940c 100644
--- a/include/net/netns/sctp.h
+++ b/include/net/netns/sctp.h
@@ -97,6 +97,13 @@ struct netns_sctp {
 	int pf_enable;
 
 	/*
+	 * Disable Potentially-Failed state exposure, enabled by default
+	 * pf_expose	-  0  : disable pf state exposure
+	 *		- >0  : enable  pf state exposure
+	 */
+	int pf_expose;
+
+	/*
 	 * Policy for preforming sctp/socket accounting
 	 * 0   - do socket level accounting, all assocs share sk_sndbuf
 	 * 1   - do sctp accounting, each asoc may use sk_sndbuf bytes
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 503fbc3..c2d3317 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -215,6 +215,7 @@ struct sctp_sock {
 	__u32 adaptation_ind;
 	__u32 pd_point;
 	__u16	nodelay:1,
+		pf_expose:1,
 		reuse:1,
 		disable_fragments:1,
 		v4mapped:1,
@@ -2053,6 +2054,7 @@ struct sctp_association {
 
 	__u8 need_ecne:1,	/* Need to send an ECNE Chunk? */
 	     temp:1,		/* Is it a temporary association? */
+	     pf_expose:1,       /* Expose pf state? */
 	     force_delay:1;
 
 	__u8 strreset_enable;
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 45a85d7..b6649d6 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -920,6 +920,7 @@ struct sctp_paddrinfo {
 enum sctp_spinfo_state {
 	SCTP_INACTIVE,
 	SCTP_PF,
+#define	SCTP_POTENTIALLY_FAILED		SCTP_PF
 	SCTP_ACTIVE,
 	SCTP_UNCONFIRMED,
 	SCTP_UNKNOWN = 0xffff  /* Value used for transport state unknown */
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 7278b7e..461b1ffa 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -86,6 +86,7 @@ static struct sctp_association *sctp_association_init(
 	 */
 	asoc->max_retrans = sp->assocparams.sasoc_asocmaxrxt;
 	asoc->pf_retrans  = sp->pf_retrans;
+	asoc->pf_expose   = sp->pf_expose;
 
 	asoc->rto_initial = msecs_to_jiffies(sp->rtoinfo.srto_initial);
 	asoc->rto_max = msecs_to_jiffies(sp->rtoinfo.srto_max);
@@ -793,7 +794,9 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 		 * to heartbeat success, report the SCTP_ADDR_CONFIRMED
 		 * state to the user, otherwise report SCTP_ADDR_AVAILABLE.
 		 */
-		if (SCTP_UNCONFIRMED == transport->state &&
+		if (transport->state == SCTP_PF && !asoc->pf_expose)
+			ulp_notify = false;
+		if (transport->state && SCTP_UNCONFIRMED &&
 		    SCTP_HEARTBEAT_SUCCESS == error)
 			spc_state = SCTP_ADDR_CONFIRMED;
 		else
@@ -817,7 +820,10 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 
 	case SCTP_TRANSPORT_PF:
 		transport->state = SCTP_PF;
-		spc_state = SCTP_ADDR_POTENTIALLY_FAILED;
+		if (!asoc->pf_expose)
+			ulp_notify = false;
+		else
+			spc_state = SCTP_ADDR_POTENTIALLY_FAILED;
 		break;
 
 	default:
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index b48ffe8..de0f15f 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1220,6 +1220,9 @@ static int __net_init sctp_defaults_init(struct net *net)
 	/* Enable pf state by default */
 	net->sctp.pf_enable = 1;
 
+	/* Enable pf state exposure by default */
+	net->sctp.pf_expose = 1;
+
 	/* Association.Max.Retrans  - 10 attempts
 	 * Path.Max.Retrans         - 5  attempts (per destination address)
 	 * Max.Init.Retransmits     - 8  attempts
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 3e50a97..33b93bb 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -5040,6 +5040,7 @@ static int sctp_init_sock(struct sock *sk)
 	sp->hbinterval  = net->sctp.hb_interval;
 	sp->pathmaxrxt  = net->sctp.max_retrans_path;
 	sp->pf_retrans  = net->sctp.pf_retrans;
+	sp->pf_expose   = net->sctp.pf_expose;
 	sp->pathmtu     = 0; /* allow default discovery */
 	sp->sackdelay   = net->sctp.sack_timeout;
 	sp->sackfreq	= 2;
@@ -5520,8 +5521,15 @@ static int sctp_getsockopt_peer_addr_info(struct sock *sk, int len,
 
 	transport = sctp_addr_id2transport(sk, &pinfo.spinfo_address,
 					   pinfo.spinfo_assoc_id);
-	if (!transport)
-		return -EINVAL;
+	if (!transport) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	if (transport->state == SCTP_PF && !transport->asoc->pf_expose) {
+		retval = -EACCES;
+		goto out;
+	}
 
 	pinfo.spinfo_assoc_id = sctp_assoc2id(transport->asoc);
 	pinfo.spinfo_state = transport->state;
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index 238cf17..eacc9a1 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -318,6 +318,13 @@ static struct ctl_table sctp_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "pf_expose",
+		.data		= &init_net.sctp.pf_expose,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 
 	{ /* sentinel */ }
 };
-- 
2.1.0


^ permalink raw reply related

* [PATCH net-next 1/5] sctp: add SCTP_ADDR_POTENTIALLY_FAILED notification
From: Xin Long @ 2019-09-09  7:56 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem
In-Reply-To: <cover.1568015756.git.lucien.xin@gmail.com>

SCTP Quick failover draft section 5.1, point 5 has been removed
from rfc7829. Instead, "the sender SHOULD (i) notify the Upper
Layer Protocol (ULP) about this state transition", as said in
section 3.2, point 8.

So this patch is to add SCTP_ADDR_POTENTIALLY_FAILED, defined
in section 7.1, "which is reported if the affected address
becomes PF". Also remove transport cwnd's update when moving
from PF back to ACTIVE , which is no longer in rfc7829 either.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/uapi/linux/sctp.h |  1 +
 net/sctp/associola.c      | 17 ++++-------------
 2 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 6d5b164..45a85d7 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -410,6 +410,7 @@ enum sctp_spc_state {
 	SCTP_ADDR_ADDED,
 	SCTP_ADDR_MADE_PRIM,
 	SCTP_ADDR_CONFIRMED,
+	SCTP_ADDR_POTENTIALLY_FAILED,
 };
 
 
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index d2ffc9a..7278b7e 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -798,14 +798,6 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 			spc_state = SCTP_ADDR_CONFIRMED;
 		else
 			spc_state = SCTP_ADDR_AVAILABLE;
-		/* Don't inform ULP about transition from PF to
-		 * active state and set cwnd to 1 MTU, see SCTP
-		 * Quick failover draft section 5.1, point 5
-		 */
-		if (transport->state == SCTP_PF) {
-			ulp_notify = false;
-			transport->cwnd = asoc->pathmtu;
-		}
 		transport->state = SCTP_ACTIVE;
 		break;
 
@@ -814,19 +806,18 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 		 * to inactive state.  Also, release the cached route since
 		 * there may be a better route next time.
 		 */
-		if (transport->state != SCTP_UNCONFIRMED)
+		if (transport->state != SCTP_UNCONFIRMED) {
 			transport->state = SCTP_INACTIVE;
-		else {
+			spc_state = SCTP_ADDR_UNREACHABLE;
+		} else {
 			sctp_transport_dst_release(transport);
 			ulp_notify = false;
 		}
-
-		spc_state = SCTP_ADDR_UNREACHABLE;
 		break;
 
 	case SCTP_TRANSPORT_PF:
 		transport->state = SCTP_PF;
-		ulp_notify = false;
+		spc_state = SCTP_ADDR_POTENTIALLY_FAILED;
 		break;
 
 	default:
-- 
2.1.0


^ permalink raw reply related

* [PATCH net-next 0/5] sctp: update from rfc7829
From: Xin Long @ 2019-09-09  7:56 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem

SCTP-PF was implemented based on a Internet-Draft in 2012:

  https://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05

It's been updated quite a few by rfc7829 in 2016.

This patchset adds the following features:

  1. add SCTP_ADDR_POTENTIALLY_FAILED notification
  2. add pf_expose per netns/sock/asoc
  3. add SCTP_EXPOSE_POTENTIALLY_FAILED_STATE sockopt
  4. add ps_retrans per netns/sock/asoc/transport
     (Primary Path Switchover)
  5. add spt_pathcpthld for SCTP_PEER_ADDR_THLDS sockopt

Xin Long (5):
  sctp: add SCTP_ADDR_POTENTIALLY_FAILED notification
  sctp: add pf_expose per netns and sock and asoc
  sctp: add SCTP_EXPOSE_POTENTIALLY_FAILED_STATE sockopt
  sctp: add support for Primary Path Switchover
  sctp: add spt_pathcpthld in struct sctp_paddrthlds

 include/net/netns/sctp.h   | 13 ++++++
 include/net/sctp/structs.h | 13 ++++--
 include/uapi/linux/sctp.h  |  4 ++
 net/sctp/associola.c       | 28 ++++++-------
 net/sctp/protocol.c        |  6 +++
 net/sctp/sm_sideeffect.c   |  5 +++
 net/sctp/socket.c          | 99 +++++++++++++++++++++++++++++++++++++++++++++-
 net/sctp/sysctl.c          | 16 ++++++++
 8 files changed, 165 insertions(+), 19 deletions(-)

-- 
2.1.0


^ permalink raw reply

* VRF Issue Since kernel 5
From: Gowen @ 2019-09-09  7:46 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi there,

Dave A said this was the mailer to send this to:


I’ve been using my management interface in a VRF for several months now and it’s worked perfectly – I’ve been able to update/upgrade the packages just fine and iptables works excellently with it – exactly as I needed.


Since Kernel 5 though I am no longer able to update – but the issue is quite a curious one as some traffic appears to be fine (DNS lookups use VRF correctly) but others don’t (updating/upgrading the packages)


I have on this device 2 interfaces:
Eth0 for management – inbound SSH, DNS, updates/upgrades
Eth1 for managing other boxes (ansible using SSH)


Link and addr info shown below:


Admin@NETM06:~$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master mgmt-vrf state UP mode DEFAULT group default qlen 1000
    link/ether 00:22:48:07:cc:ad brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:22:48:07:c9:6c brd ff:ff:ff:ff:ff:ff
4: mgmt-vrf: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 8a:f6:26:65:02:5a brd ff:ff:ff:ff:ff:ff


Admin@NETM06:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master mgmt-vrf state UP group default qlen 1000
    link/ether 00:22:48:07:cc:ad brd ff:ff:ff:ff:ff:ff
    inet 10.24.12.10/24 brd 10.24.12.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::222:48ff:fe07:ccad/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:22:48:07:c9:6c brd ff:ff:ff:ff:ff:ff
    inet 10.24.12.9/24 brd 10.24.12.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::222:48ff:fe07:c96c/64 scope link
       valid_lft forever preferred_lft forever
4: mgmt-vrf: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc noqueue state UP group default qlen 1000
    link/ether 8a:f6:26:65:02:5a brd ff:ff:ff:ff:ff:ff



the production traffic is all in the 10.0.0.0/8 network (eth1 global VRF) except for a few subnets (DNS) which are routed out eth0 (mgmt-vrf)


Admin@NETM06:~$ ip route show
default via 10.24.12.1 dev eth0
10.0.0.0/8 via 10.24.12.1 dev eth1
10.24.12.0/24 dev eth1 proto kernel scope link src 10.24.12.9
10.24.65.0/24 via 10.24.12.1 dev eth0
10.25.65.0/24 via 10.24.12.1 dev eth0
10.26.0.0/21 via 10.24.12.1 dev eth0
10.26.64.0/21 via 10.24.12.1 dev eth0


Admin@NETM06:~$ ip route show vrf mgmt-vrf
default via 10.24.12.1 dev eth0
unreachable default metric 4278198272
10.24.12.0/24 dev eth0 proto kernel scope link src 10.24.12.10
10.24.65.0/24 via 10.24.12.1 dev eth0
10.25.65.0/24 via 10.24.12.1 dev eth0
10.26.0.0/21 via 10.24.12.1 dev eth0
10.26.64.0/21 via 10.24.12.1 dev eth0



The strange activity occurs when I enter the command “sudo apt update” as I can resolve the DNS request (10.24.65.203 or 10.24.64.203, verified with tcpdump) out eth0 but for the actual update traffic there is no activity:


sudo tcpdump -i eth0 '(host 10.24.65.203 or host 10.25.65.203) and port 53' -n
<OUTPUT OMITTED FOR BREVITY>
10:06:05.268735 IP 10.24.12.10.39963 > 10.24.65.203.53: 48798+ [1au] A? security.ubuntu.com. (48)
<OUTPUT OMITTED FOR BREVITY>
10:06:05.284403 IP 10.24.65.203.53 > 10.24.12.10.39963: 48798 13/0/1 A 91.189.91.23, A 91.189.88.24, A 91.189.91.26, A 91.189.88.162, A 91.189.88.149, A 91.189.91.24, A 91.189.88.173, A 91.189.88.177, A 91.189.88.31, A 91.189.91.14, A 91.189.88.176, A 91.189.88.175, A 91.189.88.174 (256)



You can see that the update traffic is returned but is not accepted by the stack and a RST is sent


Admin@NETM06:~$ sudo tcpdump -i eth0 '(not host 168.63.129.16 and port 80)' -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:17:12.690658 IP 10.24.12.10.40216 > 91.189.88.175.80: Flags [S], seq 2279624826, win 64240, options [mss 1460,sackOK,TS val 2029365856 ecr 0,nop,wscale 7], length 0
10:17:12.691929 IP 10.24.12.10.52362 > 91.189.95.83.80: Flags [S], seq 1465797256, win 64240, options [mss 1460,sackOK,TS val 3833463674 ecr 0,nop,wscale 7], length 0
10:17:12.696270 IP 91.189.88.175.80 > 10.24.12.10.40216: Flags [S.], seq 968450722, ack 2279624827, win 28960, options [mss 1418,sackOK,TS val 81957103 ecr 2029365856,nop,wscale 7], length 0                                                                                                                            
10:17:12.696301 IP 10.24.12.10.40216 > 91.189.88.175.80: Flags [R], seq 2279624827, win 0, length 0
10:17:12.697884 IP 91.189.95.83.80 > 10.24.12.10.52362: Flags [S.], seq 4148330738, ack 1465797257, win 28960, options [mss 1418,sackOK,TS val 2257624414 ecr 3833463674,nop,wscale 8], length 0                                                                                                                         
10:17:12.697909 IP 10.24.12.10.52362 > 91.189.95.83.80: Flags [R], seq 1465797257, win 0, length 0




I can emulate the DNS lookup using netcat in the vrf:


sudo ip vrf exec mgmt-vrf nc -u 10.24.65.203 53


then interactively enter the binary for a www.google.co.uk request:


0035624be394010000010000000000010377777706676f6f676c6502636f02756b00000100010000290200000000000000


This returns as expected:


00624be394010000010000000000010377777706676f6f676c6502636f02756b00000100010000290200000000000000


I can run:


Admin@NETM06:~$ host www.google.co.uk
www.google.co.uk has address 172.217.169.3
www.google.co.uk has IPv6 address 2a00:1450:4009:80d::2003


but I get a timeout for:


sudo ip vrf  exec mgmt-vrf host www.google.co.uk
;; connection timed out; no servers could be reached



However I can take a repo address and vrf exec to it on port 80:


Admin@NETM06:~$ sudo ip vrf  exec mgmt-vrf nc 91.189.91.23 80
hello
HTTP/1.1 400 Bad Request
<OUTPUT OMITTED>

My iptables rule:


sudo iptables -Z
Admin@NETM06:~$ sudo iptables -L -v
Chain INPUT (policy DROP 16 packets, 3592 bytes)
pkts bytes target     prot opt in     out     source               destination
   44  2360 ACCEPT     tcp  --  any    any     anywhere             anywhere             tcp spt:http ctstate RELATED,ESTABLISHED
   83 10243 ACCEPT     udp  --  any    any     anywhere             anywhere             udp spt:domain ctstate RELATED,ESTABLISHED



I cannot find out why the update isn’t working. Any help greatly appreciated


Kind Regards,


Gareth

^ permalink raw reply

* [PATCH net] sctp: fix the missing put_user when dumping transport thresholds
From: Xin Long @ 2019-09-09  7:33 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

This issue causes SCTP_PEER_ADDR_THLDS sockopt not to be able to dump
a transport thresholds info.

Fix it by adding 'goto' put_user in sctp_getsockopt_paddr_thresholds.

Fixes: 8add543e369d ("sctp: add SCTP_FUTURE_ASSOC for SCTP_PEER_ADDR_THLDS sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/socket.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 9d1f83b..ad87518 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -7173,7 +7173,7 @@ static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
 		val.spt_pathmaxrxt = trans->pathmaxrxt;
 		val.spt_pathpfthld = trans->pf_retrans;
 
-		return 0;
+		goto out;
 	}
 
 	asoc = sctp_id2assoc(sk, val.spt_assoc_id);
@@ -7191,6 +7191,7 @@ static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
 		val.spt_pathmaxrxt = sp->pathmaxrxt;
 	}
 
+out:
 	if (put_user(len, optlen) || copy_to_user(optval, &val, len))
 		return -EFAULT;
 
-- 
2.1.0


^ permalink raw reply related

* Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address
From: Jason Wang @ 2019-09-09  7:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, virtualization, netdev, linux-kernel, jgg, aarcange, jglisse,
	linux-mm, James Bottomley, Christoph Hellwig, David Miller,
	linux-arm-kernel, linux-parisc
In-Reply-To: <20190909004504-mutt-send-email-mst@kernel.org>


On 2019/9/9 下午12:45, Michael S. Tsirkin wrote:
>>> Since idx can be speculated, I guess we need array_index_nospec here?
>> So we have
>>
>> ACQUIRE(mmu_lock)
>>
>> get idx
>>
>> RELEASE(mmu_lock)
>>
>> ACQUIRE(mmu_lock)
>>
>> read array[idx]
>>
>> RELEASE(mmu_lock)
>>
>> Then I think idx can't be speculated consider we've passed RELEASE +
>> ACQUIRE?
> I don't think memory barriers have anything to do with speculation,
> they are architectural.
>

Oh right. Let me add array_index_nospec() in next version.

Thanks


^ permalink raw reply

* Re: [RFC PATCH untested] vhost: block speculation of translated descriptors
From: Jason Wang @ 2019-09-09  7:19 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: kvm, virtualization, netdev
In-Reply-To: <20190908110521.4031-1-mst@redhat.com>


On 2019/9/8 下午7:05, Michael S. Tsirkin wrote:
> iovec addresses coming from vhost are assumed to be
> pre-validated, but in fact can be speculated to a value
> out of range.
>
> Userspace address are later validated with array_index_nospec so we can
> be sure kernel info does not leak through these addresses, but vhost
> must also not leak userspace info outside the allowed memory table to
> guests.
>
> Following the defence in depth principle, make sure
> the address is not validated out of node range.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>   drivers/vhost/vhost.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 5dc174ac8cac..0ee375fb7145 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2072,7 +2072,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
>   		size = node->size - addr + node->start;
>   		_iov->iov_len = min((u64)len - s, size);
>   		_iov->iov_base = (void __user *)(unsigned long)
> -			(node->userspace_addr + addr - node->start);
> +			(node->userspace_addr +
> +			 array_index_nospec(addr - node->start,
> +					    node->size));
>   		s += size;
>   		addr += size;
>   		++ret;


I've tried this on Kaby Lake smap off metadata acceleration off using 
testpmd (virtio-user) + vhost_net. I don't see obvious performance 
difference with TX PPS.

Thanks


^ permalink raw reply

* Re: [PATCH 0/2] Revert and rework on the metadata accelreation
From: Jason Wang @ 2019-09-09  7:18 UTC (permalink / raw)
  To: David Miller
  Cc: jgg, mst, kvm, virtualization, netdev, linux-kernel, aarcange,
	jglisse, linux-mm
In-Reply-To: <20190906.151505.1486178691190611604.davem@davemloft.net>


On 2019/9/6 下午9:15, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Fri, 6 Sep 2019 18:02:35 +0800
>
>> On 2019/9/5 下午9:59, Jason Gunthorpe wrote:
>>> I think you should apply the revert this cycle and rebase the other
>>> patch for next..
>>>
>>> Jason
>> Yes, the plan is to revert in this release cycle.
> Then you should reset patch #1 all by itself targetting 'net'.


Thanks for the reminding. I want the patch to go through Michael's vhost  
tree, that's why I don't put 'net' prefix. For next time, maybe I can  
use "vhost" as a prefix for classification?


^ permalink raw reply

* mta test
From: Gowen @ 2019-09-09  7:17 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Cannot get anything through the filter - another test

^ permalink raw reply

* Re: [PATCH 1/2] Revert "vhost: access vq metadata through kernel virtual address"
From: Jason Wang @ 2019-09-09  7:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, virtualization, netdev, linux-kernel, jgg, aarcange, jglisse,
	linux-mm
In-Reply-To: <20190906094332-mutt-send-email-mst@kernel.org>


On 2019/9/6 下午9:46, Michael S. Tsirkin wrote:
> On Thu, Sep 05, 2019 at 08:27:35PM +0800, Jason Wang wrote:
>> It was reported that metadata acceleration introduces several issues,
>> so this patch reverts commit ff466032dc9e5a61217f22ea34b2df932786bbfc,
>> 73f628ec9e6bcc45b77c53fe6d0c0ec55eaf82af and
>> 0b4a7092ffe568a55bf8f3cefdf79ff666586d91.
>>
>> We will rework it on the next version.
>>
>> Cc: Jason Gunthorpe <jgg@mellanox.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>
> I am confused by the above.
> What I see upstream is 7f466032dc.
>
> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> Author: Jason Wang <jasowang@redhat.com>
> Date:   Fri May 24 04:12:18 2019 -0400
>
>      vhost: access vq metadata through kernel virtual address
>
> so this is what I reverted.
>
> Pls take a look, and let me know if you see issues.
>
> Thanks!


Yes, my fault.

Thanks


>> ---
>>   drivers/vhost/vhost.c | 515 +-----------------------------------------
>>   drivers/vhost/vhost.h |  41 ----
>>   2 files changed, 3 insertions(+), 553 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 0536f8526359..791562e03fe0 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -298,160 +298,6 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
>>   		__vhost_vq_meta_reset(d->vqs[i]);
>>   }
>>   
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -static void vhost_map_unprefetch(struct vhost_map *map)
>> -{
>> -	kfree(map->pages);
>> -	map->pages = NULL;
>> -	map->npages = 0;
>> -	map->addr = NULL;
>> -}
>> -
>> -static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
>> -{
>> -	struct vhost_map *map[VHOST_NUM_ADDRS];
>> -	int i;
>> -
>> -	spin_lock(&vq->mmu_lock);
>> -	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
>> -		map[i] = rcu_dereference_protected(vq->maps[i],
>> -				  lockdep_is_held(&vq->mmu_lock));
>> -		if (map[i])
>> -			rcu_assign_pointer(vq->maps[i], NULL);
>> -	}
>> -	spin_unlock(&vq->mmu_lock);
>> -
>> -	synchronize_rcu();
>> -
>> -	for (i = 0; i < VHOST_NUM_ADDRS; i++)
>> -		if (map[i])
>> -			vhost_map_unprefetch(map[i]);
>> -
>> -}
>> -
>> -static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
>> -{
>> -	int i;
>> -
>> -	vhost_uninit_vq_maps(vq);
>> -	for (i = 0; i < VHOST_NUM_ADDRS; i++)
>> -		vq->uaddrs[i].size = 0;
>> -}
>> -
>> -static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>> -				     unsigned long start,
>> -				     unsigned long end)
>> -{
>> -	if (unlikely(!uaddr->size))
>> -		return false;
>> -
>> -	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
>> -}
>> -
>> -static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>> -				      int index,
>> -				      unsigned long start,
>> -				      unsigned long end)
>> -{
>> -	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
>> -	struct vhost_map *map;
>> -	int i;
>> -
>> -	if (!vhost_map_range_overlap(uaddr, start, end))
>> -		return;
>> -
>> -	spin_lock(&vq->mmu_lock);
>> -	++vq->invalidate_count;
>> -
>> -	map = rcu_dereference_protected(vq->maps[index],
>> -					lockdep_is_held(&vq->mmu_lock));
>> -	if (map) {
>> -		if (uaddr->write) {
>> -			for (i = 0; i < map->npages; i++)
>> -				set_page_dirty(map->pages[i]);
>> -		}
>> -		rcu_assign_pointer(vq->maps[index], NULL);
>> -	}
>> -	spin_unlock(&vq->mmu_lock);
>> -
>> -	if (map) {
>> -		synchronize_rcu();
>> -		vhost_map_unprefetch(map);
>> -	}
>> -}
>> -
>> -static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
>> -				    int index,
>> -				    unsigned long start,
>> -				    unsigned long end)
>> -{
>> -	if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end))
>> -		return;
>> -
>> -	spin_lock(&vq->mmu_lock);
>> -	--vq->invalidate_count;
>> -	spin_unlock(&vq->mmu_lock);
>> -}
>> -
>> -static int vhost_invalidate_range_start(struct mmu_notifier *mn,
>> -					const struct mmu_notifier_range *range)
>> -{
>> -	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
>> -					     mmu_notifier);
>> -	int i, j;
>> -
>> -	if (!mmu_notifier_range_blockable(range))
>> -		return -EAGAIN;
>> -
>> -	for (i = 0; i < dev->nvqs; i++) {
>> -		struct vhost_virtqueue *vq = dev->vqs[i];
>> -
>> -		for (j = 0; j < VHOST_NUM_ADDRS; j++)
>> -			vhost_invalidate_vq_start(vq, j,
>> -						  range->start,
>> -						  range->end);
>> -	}
>> -
>> -	return 0;
>> -}
>> -
>> -static void vhost_invalidate_range_end(struct mmu_notifier *mn,
>> -				       const struct mmu_notifier_range *range)
>> -{
>> -	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
>> -					     mmu_notifier);
>> -	int i, j;
>> -
>> -	for (i = 0; i < dev->nvqs; i++) {
>> -		struct vhost_virtqueue *vq = dev->vqs[i];
>> -
>> -		for (j = 0; j < VHOST_NUM_ADDRS; j++)
>> -			vhost_invalidate_vq_end(vq, j,
>> -						range->start,
>> -						range->end);
>> -	}
>> -}
>> -
>> -static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
>> -	.invalidate_range_start = vhost_invalidate_range_start,
>> -	.invalidate_range_end = vhost_invalidate_range_end,
>> -};
>> -
>> -static void vhost_init_maps(struct vhost_dev *dev)
>> -{
>> -	struct vhost_virtqueue *vq;
>> -	int i, j;
>> -
>> -	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
>> -
>> -	for (i = 0; i < dev->nvqs; ++i) {
>> -		vq = dev->vqs[i];
>> -		for (j = 0; j < VHOST_NUM_ADDRS; j++)
>> -			RCU_INIT_POINTER(vq->maps[j], NULL);
>> -	}
>> -}
>> -#endif
>> -
>>   static void vhost_vq_reset(struct vhost_dev *dev,
>>   			   struct vhost_virtqueue *vq)
>>   {
>> @@ -480,11 +326,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>>   	vq->busyloop_timeout = 0;
>>   	vq->umem = NULL;
>>   	vq->iotlb = NULL;
>> -	vq->invalidate_count = 0;
>>   	__vhost_vq_meta_reset(vq);
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	vhost_reset_vq_maps(vq);
>> -#endif
>>   }
>>   
>>   static int vhost_worker(void *data)
>> @@ -634,9 +476,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>>   	INIT_LIST_HEAD(&dev->read_list);
>>   	INIT_LIST_HEAD(&dev->pending_list);
>>   	spin_lock_init(&dev->iotlb_lock);
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	vhost_init_maps(dev);
>> -#endif
>> +
>>   
>>   	for (i = 0; i < dev->nvqs; ++i) {
>>   		vq = dev->vqs[i];
>> @@ -645,7 +485,6 @@ void vhost_dev_init(struct vhost_dev *dev,
>>   		vq->heads = NULL;
>>   		vq->dev = dev;
>>   		mutex_init(&vq->mutex);
>> -		spin_lock_init(&vq->mmu_lock);
>>   		vhost_vq_reset(dev, vq);
>>   		if (vq->handle_kick)
>>   			vhost_poll_init(&vq->poll, vq->handle_kick,
>> @@ -725,18 +564,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>>   	if (err)
>>   		goto err_cgroup;
>>   
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
>> -	if (err)
>> -		goto err_mmu_notifier;
>> -#endif
>> -
>>   	return 0;
>> -
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -err_mmu_notifier:
>> -	vhost_dev_free_iovecs(dev);
>> -#endif
>>   err_cgroup:
>>   	kthread_stop(worker);
>>   	dev->worker = NULL;
>> @@ -827,107 +655,6 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>   	spin_unlock(&dev->iotlb_lock);
>>   }
>>   
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -static void vhost_setup_uaddr(struct vhost_virtqueue *vq,
>> -			      int index, unsigned long uaddr,
>> -			      size_t size, bool write)
>> -{
>> -	struct vhost_uaddr *addr = &vq->uaddrs[index];
>> -
>> -	addr->uaddr = uaddr;
>> -	addr->size = size;
>> -	addr->write = write;
>> -}
>> -
>> -static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
>> -{
>> -	vhost_setup_uaddr(vq, VHOST_ADDR_DESC,
>> -			  (unsigned long)vq->desc,
>> -			  vhost_get_desc_size(vq, vq->num),
>> -			  false);
>> -	vhost_setup_uaddr(vq, VHOST_ADDR_AVAIL,
>> -			  (unsigned long)vq->avail,
>> -			  vhost_get_avail_size(vq, vq->num),
>> -			  false);
>> -	vhost_setup_uaddr(vq, VHOST_ADDR_USED,
>> -			  (unsigned long)vq->used,
>> -			  vhost_get_used_size(vq, vq->num),
>> -			  true);
>> -}
>> -
>> -static int vhost_map_prefetch(struct vhost_virtqueue *vq,
>> -			       int index)
>> -{
>> -	struct vhost_map *map;
>> -	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
>> -	struct page **pages;
>> -	int npages = DIV_ROUND_UP(uaddr->size, PAGE_SIZE);
>> -	int npinned;
>> -	void *vaddr, *v;
>> -	int err;
>> -	int i;
>> -
>> -	spin_lock(&vq->mmu_lock);
>> -
>> -	err = -EFAULT;
>> -	if (vq->invalidate_count)
>> -		goto err;
>> -
>> -	err = -ENOMEM;
>> -	map = kmalloc(sizeof(*map), GFP_ATOMIC);
>> -	if (!map)
>> -		goto err;
>> -
>> -	pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC);
>> -	if (!pages)
>> -		goto err_pages;
>> -
>> -	err = EFAULT;
>> -	npinned = __get_user_pages_fast(uaddr->uaddr, npages,
>> -					uaddr->write, pages);
>> -	if (npinned > 0)
>> -		release_pages(pages, npinned);
>> -	if (npinned != npages)
>> -		goto err_gup;
>> -
>> -	for (i = 0; i < npinned; i++)
>> -		if (PageHighMem(pages[i]))
>> -			goto err_gup;
>> -
>> -	vaddr = v = page_address(pages[0]);
>> -
>> -	/* For simplicity, fallback to userspace address if VA is not
>> -	 * contigious.
>> -	 */
>> -	for (i = 1; i < npinned; i++) {
>> -		v += PAGE_SIZE;
>> -		if (v != page_address(pages[i]))
>> -			goto err_gup;
>> -	}
>> -
>> -	map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1));
>> -	map->npages = npages;
>> -	map->pages = pages;
>> -
>> -	rcu_assign_pointer(vq->maps[index], map);
>> -	/* No need for a synchronize_rcu(). This function should be
>> -	 * called by dev->worker so we are serialized with all
>> -	 * readers.
>> -	 */
>> -	spin_unlock(&vq->mmu_lock);
>> -
>> -	return 0;
>> -
>> -err_gup:
>> -	kfree(pages);
>> -err_pages:
>> -	kfree(map);
>> -err:
>> -	spin_unlock(&vq->mmu_lock);
>> -	return err;
>> -}
>> -#endif
>> -
>>   void vhost_dev_cleanup(struct vhost_dev *dev)
>>   {
>>   	int i;
>> @@ -957,16 +684,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>>   		kthread_stop(dev->worker);
>>   		dev->worker = NULL;
>>   	}
>> -	if (dev->mm) {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
>> -#endif
>> +	if (dev->mm)
>>   		mmput(dev->mm);
>> -	}
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	for (i = 0; i < dev->nvqs; i++)
>> -		vhost_uninit_vq_maps(dev->vqs[i]);
>> -#endif
>>   	dev->mm = NULL;
>>   }
>>   EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
>> @@ -1195,26 +914,6 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
>>   
>>   static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_used *used;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> -		if (likely(map)) {
>> -			used = map->addr;
>> -			*((__virtio16 *)&used->ring[vq->num]) =
>> -				cpu_to_vhost16(vq, vq->avail_idx);
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
>>   			      vhost_avail_event(vq));
>>   }
>> @@ -1223,27 +922,6 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>>   				 struct vring_used_elem *head, int idx,
>>   				 int count)
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_used *used;
>> -	size_t size;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> -		if (likely(map)) {
>> -			used = map->addr;
>> -			size = count * sizeof(*head);
>> -			memcpy(used->ring + idx, head, size);
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
>>   				  count * sizeof(*head));
>>   }
>> @@ -1251,25 +929,6 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>>   static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>>   
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_used *used;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> -		if (likely(map)) {
>> -			used = map->addr;
>> -			used->flags = cpu_to_vhost16(vq, vq->used_flags);
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
>>   			      &vq->used->flags);
>>   }
>> @@ -1277,25 +936,6 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>>   static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>>   
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_used *used;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> -		if (likely(map)) {
>> -			used = map->addr;
>> -			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
>>   			      &vq->used->idx);
>>   }
>> @@ -1341,50 +981,12 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>>   static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>>   				      __virtio16 *idx)
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_avail *avail;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>> -		if (likely(map)) {
>> -			avail = map->addr;
>> -			*idx = avail->idx;
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_get_avail(vq, *idx, &vq->avail->idx);
>>   }
>>   
>>   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>>   				       __virtio16 *head, int idx)
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_avail *avail;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>> -		if (likely(map)) {
>> -			avail = map->addr;
>> -			*head = avail->ring[idx & (vq->num - 1)];
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_get_avail(vq, *head,
>>   			       &vq->avail->ring[idx & (vq->num - 1)]);
>>   }
>> @@ -1392,98 +994,24 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>>   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>>   					__virtio16 *flags)
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_avail *avail;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>> -		if (likely(map)) {
>> -			avail = map->addr;
>> -			*flags = avail->flags;
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_get_avail(vq, *flags, &vq->avail->flags);
>>   }
>>   
>>   static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>>   				       __virtio16 *event)
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_avail *avail;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>> -		if (likely(map)) {
>> -			avail = map->addr;
>> -			*event = (__virtio16)avail->ring[vq->num];
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_get_avail(vq, *event, vhost_used_event(vq));
>>   }
>>   
>>   static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>>   				     __virtio16 *idx)
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_used *used;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> -		if (likely(map)) {
>> -			used = map->addr;
>> -			*idx = used->idx;
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_get_used(vq, *idx, &vq->used->idx);
>>   }
>>   
>>   static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>>   				 struct vring_desc *desc, int idx)
>>   {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	struct vhost_map *map;
>> -	struct vring_desc *d;
>> -
>> -	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
>> -		if (likely(map)) {
>> -			d = map->addr;
>> -			*desc = *(d + idx);
>> -			rcu_read_unlock();
>> -			return 0;
>> -		}
>> -
>> -		rcu_read_unlock();
>> -	}
>> -#endif
>> -
>>   	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
>>   }
>>   
>> @@ -1824,32 +1352,12 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
>>   	return true;
>>   }
>>   
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
>> -{
>> -	struct vhost_map __rcu *map;
>> -	int i;
>> -
>> -	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
>> -		rcu_read_lock();
>> -		map = rcu_dereference(vq->maps[i]);
>> -		rcu_read_unlock();
>> -		if (unlikely(!map))
>> -			vhost_map_prefetch(vq, i);
>> -	}
>> -}
>> -#endif
>> -
>>   int vq_meta_prefetch(struct vhost_virtqueue *vq)
>>   {
>>   	unsigned int num = vq->num;
>>   
>> -	if (!vq->iotlb) {
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -		vhost_vq_map_prefetch(vq);
>> -#endif
>> +	if (!vq->iotlb)
>>   		return 1;
>> -	}
>>   
>>   	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
>>   			       vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
>> @@ -2060,16 +1568,6 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
>>   
>>   	mutex_lock(&vq->mutex);
>>   
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	/* Unregister MMU notifer to allow invalidation callback
>> -	 * can access vq->uaddrs[] without holding a lock.
>> -	 */
>> -	if (d->mm)
>> -		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
>> -
>> -	vhost_uninit_vq_maps(vq);
>> -#endif
>> -
>>   	switch (ioctl) {
>>   	case VHOST_SET_VRING_NUM:
>>   		r = vhost_vring_set_num(d, vq, argp);
>> @@ -2081,13 +1579,6 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
>>   		BUG();
>>   	}
>>   
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	vhost_setup_vq_uaddr(vq);
>> -
>> -	if (d->mm)
>> -		mmu_notifier_register(&d->mmu_notifier, d->mm);
>> -#endif
>> -
>>   	mutex_unlock(&vq->mutex);
>>   
>>   	return r;
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 42a8c2a13ab1..e9ed2722b633 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -12,9 +12,6 @@
>>   #include <linux/virtio_config.h>
>>   #include <linux/virtio_ring.h>
>>   #include <linux/atomic.h>
>> -#include <linux/pagemap.h>
>> -#include <linux/mmu_notifier.h>
>> -#include <asm/cacheflush.h>
>>   
>>   struct vhost_work;
>>   typedef void (*vhost_work_fn_t)(struct vhost_work *work);
>> @@ -83,24 +80,6 @@ enum vhost_uaddr_type {
>>   	VHOST_NUM_ADDRS = 3,
>>   };
>>   
>> -struct vhost_map {
>> -	int npages;
>> -	void *addr;
>> -	struct page **pages;
>> -};
>> -
>> -struct vhost_uaddr {
>> -	unsigned long uaddr;
>> -	size_t size;
>> -	bool write;
>> -};
>> -
>> -#if defined(CONFIG_MMU_NOTIFIER) && ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 0
>> -#define VHOST_ARCH_CAN_ACCEL_UACCESS 0
>> -#else
>> -#define VHOST_ARCH_CAN_ACCEL_UACCESS 0
>> -#endif
>> -
>>   /* The virtqueue structure describes a queue attached to a device. */
>>   struct vhost_virtqueue {
>>   	struct vhost_dev *dev;
>> @@ -111,22 +90,7 @@ struct vhost_virtqueue {
>>   	struct vring_desc __user *desc;
>>   	struct vring_avail __user *avail;
>>   	struct vring_used __user *used;
>> -
>> -#if VHOST_ARCH_CAN_ACCEL_UACCESS
>> -	/* Read by memory accessors, modified by meta data
>> -	 * prefetching, MMU notifier and vring ioctl().
>> -	 * Synchonrized through mmu_lock (writers) and RCU (writers
>> -	 * and readers).
>> -	 */
>> -	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
>> -	/* Read by MMU notifier, modified by vring ioctl(),
>> -	 * synchronized through MMU notifier
>> -	 * registering/unregistering.
>> -	 */
>> -	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>> -#endif
>>   	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>> -
>>   	struct file *kick;
>>   	struct eventfd_ctx *call_ctx;
>>   	struct eventfd_ctx *error_ctx;
>> @@ -181,8 +145,6 @@ struct vhost_virtqueue {
>>   	bool user_be;
>>   #endif
>>   	u32 busyloop_timeout;
>> -	spinlock_t mmu_lock;
>> -	int invalidate_count;
>>   };
>>   
>>   struct vhost_msg_node {
>> @@ -196,9 +158,6 @@ struct vhost_msg_node {
>>   
>>   struct vhost_dev {
>>   	struct mm_struct *mm;
>> -#ifdef CONFIG_MMU_NOTIFIER
>> -	struct mmu_notifier mmu_notifier;
>> -#endif
>>   	struct mutex mutex;
>>   	struct vhost_virtqueue **vqs;
>>   	int nvqs;
>> -- 
>> 2.19.1

^ permalink raw reply

* Re: [PATCH v1 net-next 00/15] tc-taprio offload for SJA1105 DSA
From: Richard Cochran @ 2019-09-09  7:04 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, David Miller, f.fainelli, vivien.didelot,
	vinicius.gomes, vedang.patel, weifeng.voon, jiri, m-karicheri2,
	Jose.Abreu, ilias.apalodimas, jhs, xiyou.wangcong,
	kurt.kanzenbach, netdev
In-Reply-To: <CA+h21hqLF1gE+aDH9xQPadCuo6ih=xWY73JZvg7c58C1tC+0Jg@mail.gmail.com>

On Sun, Sep 08, 2019 at 12:07:27PM +0100, Vladimir Oltean wrote:
> I think Richard has been there when the taprio, etf qdiscs, SO_TXTIME
> were first defined and developed:
> https://patchwork.ozlabs.org/cover/808504/
> I expect he is capable of delivering a competent review of the entire
> series, possibly way more competent than my patch set itself.

I am really not familiar with the taprio/qdisc stuff.  Sorry.
 
> Additionally, the 802.1AS PTP profile even calls for switches and
> end-stations to use timestamping counters that are free-running, and
> scale&rate-correct those in software - due to a perceived "double
> feedback loop", or "changing the ruler while measuring with it". Now
> I'm no expert at all, but it would be interesting if we went on with
> the discussion in the direction of what Linux is currently
> understanding by a "free-running" PTP counter. On one hand there's the
> timecounter/cyclecounter in the kernel which makes for a
> software-corrected PHC, and on the other there's the free_running
> option in linuxptp which makes for a "nowhere-corrected" PHC that is
> only being used in the E2E_TC and P2P_TC profiles. But user space
> otherwise has no insight into the PHC implementation from the kernel,
> and "free_running" from ptp4l can't really be used to implement the
> synchronization mechanism required by 802.1AS.

That just isn't true.  We have already done this for end stations.

> To me, the most striking aspect is that this particular recommendation
> from 802.1AS is at direct odds with 802.1Qbv (time-based egress) /
> 802.1Qci (time-based ingress policing) which clearly require a PTP
> counter in the NIC that ticks to the wall clock, and not to a random
> free-running time since boot up. I simply can't seem to reconcile the
> two.

Well, yeah.  The various PTP standards and profiles dream up whatever
they want.  The HW we get dictates what is actually possible.

> But let's leave 802.1AS aside for a second - that's not what the patch
> set is about, but rather a bit of background on why there are 2 PTP
> clocks in this switch, and why I'm switching from one to the other.
> Richard didn't really warm up to the phc2sys-to-itself idea in the
> past, and opted for simplicity: just use the hardware-corrected
> PTPCLKVAL for everything, which is exactly what I'm doing as of now.

If you really want to make an 802.1-AS bridge, then

1. You can leave the clock free running, and

2. you don't need to synchronize the Linux system clock at all.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH v1 net-next 00/15] tc-taprio offload for SJA1105 DSA
From: Richard Cochran @ 2019-09-09  6:52 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Vladimir Oltean, David Miller, f.fainelli, vivien.didelot,
	vinicius.gomes, vedang.patel, weifeng.voon, jiri, m-karicheri2,
	Jose.Abreu, ilias.apalodimas, jhs, xiyou.wangcong,
	kurt.kanzenbach, netdev
In-Reply-To: <20190908204224.GA2730@lunn.ch>

On Sun, Sep 08, 2019 at 10:42:24PM +0200, Andrew Lunn wrote:
> So if you are struggling to get something reviewed, make it more
> appealing for the reviewer. Salami tactics.

+1

^ permalink raw reply

* Re: general protection fault in qdisc_put
From: Dmitry Vyukov @ 2019-09-09  6:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Akinobu Mita, Andrew Morton, David Miller,
	Jamal Hadi Salim, Jiří Pírko,
	Linux List Kernel Mailing, Michal Hocko, Netdev, syzkaller-bugs,
	Cong Wang
In-Reply-To: <CAHk-=wgZneAegyitz7f+JLjB6=28ewtvT7M4xy_a-wqsTjOX_w@mail.gmail.com>

On Sun, Sep 8, 2019 at 6:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, Sep 7, 2019 at 11:08 PM syzbot
> <syzbot+d5870a903591faaca4ae@syzkaller.appspotmail.com> wrote:
> >
> > The bug was bisected to:
> >
> > commit e41d58185f1444368873d4d7422f7664a68be61d
> > Author: Dmitry Vyukov <dvyukov@google.com>
> > Date:   Wed Jul 12 21:34:35 2017 +0000
> >
> >      fault-inject: support systematic fault injection
>
> That commit does seem a bit questionable, but not the cause of this
> problem (just the trigger).
>
> I think the questionable part is that the new code doesn't honor the
> task filtering, and will fail even for protected tasks. Dmitry?

That commit added a new fault injection mode with a new API that is
used by syzkaller to inject faults. Before that commit the fault
inject is not working for syzkaller at all. I think this bisection
result simply means "the GPF is related to an earlier failure".

> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault: 0000 [#1] PREEMPT SMP KASAN
> > CPU: 1 PID: 9699 Comm: syz-executor169 Not tainted 5.3.0-rc7+ #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:qdisc_put+0x25/0x90 net/sched/sch_generic.c:983
>
> Yes, looks like 'qdisc' is NULL.
>
> This is the
>
>         qdisc_put(q->qdisc);
>
> in sfb_destroy(), called from qdisc_create().
>
> I think what is happening is this (in qdisc_create()):
>
>         if (ops->init) {
>                 err = ops->init(sch, tca[TCA_OPTIONS], extack);
>                 if (err != 0)
>                         goto err_out5;
>         }
>         ...
> err_out5:
>         /* ops->init() failed, we call ->destroy() like qdisc_create_dflt() */
>         if (ops->destroy)
>                 ops->destroy(sch);
>
> and "ops->init" is sfb_init(), which will not initialize q->qdisc if
> tcf_block_get() fails.
>
> I see two solutions:
>
>  (a) move the
>
>         q->qdisc = &noop_qdisc;
>
>      up earlier in sfb_init(), so that qdisc is always initialized
> after sfb_init(), even on failure.
>
>  (b) just make qdisc_put(NULL) just silently work as a no-op.
>
>  (c) change all the semantics to not call ->destroy if ->init failed.
>
> Honestly, (a) seems very fragile - do all the other init routines do
> this? And (c) sounds like a big change, and very fragile too.
>
> So I'd suggest that qdisc_put() be made to just ignore a NULL pointer
> (and maybe an error pointer too?).
>
> But I'll leave it to the maintainers to sort out the proper fix.
> Maybe people prefer (a)?
>
>                    Linus

^ permalink raw reply

* Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address
From: Michael S. Tsirkin @ 2019-09-09  4:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel, jgg, aarcange, jglisse,
	linux-mm, James Bottomley, Christoph Hellwig, David Miller,
	linux-arm-kernel, linux-parisc
In-Reply-To: <1cb5aa8d-6213-5fce-5a77-fcada572c882@redhat.com>

On Mon, Sep 09, 2019 at 10:18:57AM +0800, Jason Wang wrote:
> 
> On 2019/9/8 下午7:05, Michael S. Tsirkin wrote:
> > On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote:
> > > This is a rework on the commit 7f466032dc9e ("vhost: access vq
> > > metadata through kernel virtual address").
> > > 
> > > It was noticed that the copy_to/from_user() friends that was used to
> > > access virtqueue metdata tends to be very expensive for dataplane
> > > implementation like vhost since it involves lots of software checks,
> > > speculation barriers,
> > So if we drop speculation barrier,
> > there's a problem here in access will now be speculated.
> > This effectively disables the defence in depth effect of
> > b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd
> >      x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec
> > 
> > 
> > So now we need to sprinkle array_index_nospec or barrier_nospec over the
> > code whenever we use an index we got from userspace.
> > See below for some examples.
> > 
> > 
> > > hardware feature toggling (e.g SMAP). The
> > > extra cost will be more obvious when transferring small packets since
> > > the time spent on metadata accessing become more significant.
> > > 
> > > This patch tries to eliminate those overheads by accessing them
> > > through direct mapping of those pages. Invalidation callbacks is
> > > implemented for co-operation with general VM management (swap, KSM,
> > > THP or NUMA balancing). We will try to get the direct mapping of vq
> > > metadata before each round of packet processing if it doesn't
> > > exist. If we fail, we will simplely fallback to copy_to/from_user()
> > > friends.
> > > 
> > > This invalidation, direct mapping access and set are synchronized
> > > through spinlock. This takes a step back from the original commit
> > > 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > address") which tries to RCU which is suspicious and hard to be
> > > reviewed. This won't perform as well as RCU because of the atomic,
> > > this could be addressed by the future optimization.
> > > 
> > > This method might does not work for high mem page which requires
> > > temporary mapping so we just fallback to normal
> > > copy_to/from_user() and may not for arch that has virtual tagged cache
> > > since extra cache flushing is needed to eliminate the alias. This will
> > > result complex logic and bad performance. For those archs, this patch
> > > simply go for copy_to/from_user() friends. This is done by ruling out
> > > kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
> > > 
> > > Note that this is only done when device IOTLB is not enabled. We
> > > could use similar method to optimize IOTLB in the future.
> > > 
> > > Tests shows at most about 22% improvement on TX PPS when using
> > > virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake.
> > > 
> > >          SMAP on | SMAP off
> > > Before: 4.9Mpps | 6.9Mpps
> > > After:  6.0Mpps | 7.5Mpps
> > > 
> > > On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
> > > any difference.
> > Why is not Kaby Lake with SMAP off the same as Sandy Bridge?
> 
> 
> I don't know, I guess it was because the atomic is l
> 
> 
> > 
> > 
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
> > > Cc: Christoph Hellwig <hch@infradead.org>
> > > Cc: David Miller <davem@davemloft.net>
> > > Cc: Jerome Glisse <jglisse@redhat.com>
> > > Cc: Jason Gunthorpe <jgg@mellanox.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-arm-kernel@lists.infradead.org
> > > Cc: linux-parisc@vger.kernel.org
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >   drivers/vhost/vhost.c | 551 +++++++++++++++++++++++++++++++++++++++++-
> > >   drivers/vhost/vhost.h |  41 ++++
> > >   2 files changed, 589 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 791562e03fe0..f98155f28f02 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
> > >   		__vhost_vq_meta_reset(d->vqs[i]);
> > >   }
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +static void vhost_map_unprefetch(struct vhost_map *map)
> > > +{
> > > +	kfree(map->pages);
> > > +	kfree(map);
> > > +}
> > > +
> > > +static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
> > > +				struct vhost_map *map, int index)
> > > +{
> > > +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> > > +	int i;
> > > +
> > > +	if (uaddr->write) {
> > > +		for (i = 0; i < map->npages; i++)
> > > +			set_page_dirty(map->pages[i]);
> > > +	}
> > > +}
> > > +
> > > +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
> > > +{
> > > +	struct vhost_map *map[VHOST_NUM_ADDRS];
> > > +	int i;
> > > +
> > > +	spin_lock(&vq->mmu_lock);
> > > +	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> > > +		map[i] = vq->maps[i];
> > > +		if (map[i]) {
> > > +			vhost_set_map_dirty(vq, map[i], i);
> > > +			vq->maps[i] = NULL;
> > > +		}
> > > +	}
> > > +	spin_unlock(&vq->mmu_lock);
> > > +
> > > +	/* No need for synchronization since we are serialized with
> > > +	 * memory accessors (e.g vq mutex held).
> > > +	 */
> > > +
> > > +	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> > > +		if (map[i])
> > > +			vhost_map_unprefetch(map[i]);
> > > +
> > > +}
> > > +
> > > +static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
> > > +{
> > > +	int i;
> > > +
> > > +	vhost_uninit_vq_maps(vq);
> > > +	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> > > +		vq->uaddrs[i].size = 0;
> > > +}
> > > +
> > > +static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
> > > +				     unsigned long start,
> > > +				     unsigned long end)
> > > +{
> > > +	if (unlikely(!uaddr->size))
> > > +		return false;
> > > +
> > > +	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
> > > +}
> > > +
> > > +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> > > +{
> > > +	spin_lock(&vq->mmu_lock);
> > > +}
> > > +
> > > +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> > > +{
> > > +	spin_unlock(&vq->mmu_lock);
> > > +}
> > > +
> > > +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> > > +				     int index,
> > > +				     unsigned long start,
> > > +				     unsigned long end,
> > > +				     bool blockable)
> > > +{
> > > +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> > > +	struct vhost_map *map;
> > > +
> > > +	if (!vhost_map_range_overlap(uaddr, start, end))
> > > +		return 0;
> > > +	else if (!blockable)
> > > +		return -EAGAIN;
> > > +
> > > +	spin_lock(&vq->mmu_lock);
> > > +	++vq->invalidate_count;
> > > +
> > > +	map = vq->maps[index];
> > > +	if (map)
> > > +		vq->maps[index] = NULL;
> > > +	spin_unlock(&vq->mmu_lock);
> > > +
> > > +	if (map) {
> > > +		vhost_set_map_dirty(vq, map, index);
> > > +		vhost_map_unprefetch(map);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
> > > +				    int index,
> > > +				    unsigned long start,
> > > +				    unsigned long end)
> > > +{
> > > +	if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end))
> > > +		return;
> > > +
> > > +	spin_lock(&vq->mmu_lock);
> > > +	--vq->invalidate_count;
> > > +	spin_unlock(&vq->mmu_lock);
> > > +}
> > > +
> > > +static int vhost_invalidate_range_start(struct mmu_notifier *mn,
> > > +					const struct mmu_notifier_range *range)
> > > +{
> > > +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> > > +					     mmu_notifier);
> > > +	bool blockable = mmu_notifier_range_blockable(range);
> > > +	int i, j, ret;
> > > +
> > > +	for (i = 0; i < dev->nvqs; i++) {
> > > +		struct vhost_virtqueue *vq = dev->vqs[i];
> > > +
> > > +		for (j = 0; j < VHOST_NUM_ADDRS; j++) {
> > > +			ret = vhost_invalidate_vq_start(vq, j,
> > > +							range->start,
> > > +							range->end, blockable);
> > > +			if (ret)
> > > +				return ret;
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void vhost_invalidate_range_end(struct mmu_notifier *mn,
> > > +				       const struct mmu_notifier_range *range)
> > > +{
> > > +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> > > +					     mmu_notifier);
> > > +	int i, j;
> > > +
> > > +	for (i = 0; i < dev->nvqs; i++) {
> > > +		struct vhost_virtqueue *vq = dev->vqs[i];
> > > +
> > > +		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> > > +			vhost_invalidate_vq_end(vq, j,
> > > +						range->start,
> > > +						range->end);
> > > +	}
> > > +}
> > > +
> > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> > > +	.invalidate_range_start = vhost_invalidate_range_start,
> > > +	.invalidate_range_end = vhost_invalidate_range_end,
> > > +};
> > > +
> > > +static void vhost_init_maps(struct vhost_dev *dev)
> > > +{
> > > +	struct vhost_virtqueue *vq;
> > > +	int i, j;
> > > +
> > > +	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
> > > +
> > > +	for (i = 0; i < dev->nvqs; ++i) {
> > > +		vq = dev->vqs[i];
> > > +		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> > > +			vq->maps[j] = NULL;
> > > +	}
> > > +}
> > > +#endif
> > > +
> > >   static void vhost_vq_reset(struct vhost_dev *dev,
> > >   			   struct vhost_virtqueue *vq)
> > >   {
> > > @@ -326,7 +502,11 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> > >   	vq->busyloop_timeout = 0;
> > >   	vq->umem = NULL;
> > >   	vq->iotlb = NULL;
> > > +	vq->invalidate_count = 0;
> > >   	__vhost_vq_meta_reset(vq);
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	vhost_reset_vq_maps(vq);
> > > +#endif
> > >   }
> > >   static int vhost_worker(void *data)
> > > @@ -471,12 +651,15 @@ void vhost_dev_init(struct vhost_dev *dev,
> > >   	dev->iov_limit = iov_limit;
> > >   	dev->weight = weight;
> > >   	dev->byte_weight = byte_weight;
> > > +	dev->has_notifier = false;
> > >   	init_llist_head(&dev->work_list);
> > >   	init_waitqueue_head(&dev->wait);
> > >   	INIT_LIST_HEAD(&dev->read_list);
> > >   	INIT_LIST_HEAD(&dev->pending_list);
> > >   	spin_lock_init(&dev->iotlb_lock);
> > > -
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	vhost_init_maps(dev);
> > > +#endif
> > >   	for (i = 0; i < dev->nvqs; ++i) {
> > >   		vq = dev->vqs[i];
> > > @@ -485,6 +668,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> > >   		vq->heads = NULL;
> > >   		vq->dev = dev;
> > >   		mutex_init(&vq->mutex);
> > > +		spin_lock_init(&vq->mmu_lock);
> > >   		vhost_vq_reset(dev, vq);
> > >   		if (vq->handle_kick)
> > >   			vhost_poll_init(&vq->poll, vq->handle_kick,
> > > @@ -564,7 +748,19 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> > >   	if (err)
> > >   		goto err_cgroup;
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
> > > +	if (err)
> > > +		goto err_mmu_notifier;
> > > +#endif
> > > +	dev->has_notifier = true;
> > > +
> > >   	return 0;
> > > +
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +err_mmu_notifier:
> > > +	vhost_dev_free_iovecs(dev);
> > > +#endif
> > >   err_cgroup:
> > >   	kthread_stop(worker);
> > >   	dev->worker = NULL;
> > > @@ -655,6 +851,107 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > >   	spin_unlock(&dev->iotlb_lock);
> > >   }
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +static void vhost_setup_uaddr(struct vhost_virtqueue *vq,
> > > +			      int index, unsigned long uaddr,
> > > +			      size_t size, bool write)
> > > +{
> > > +	struct vhost_uaddr *addr = &vq->uaddrs[index];
> > > +
> > > +	addr->uaddr = uaddr;
> > > +	addr->size = size;
> > > +	addr->write = write;
> > > +}
> > > +
> > > +static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
> > > +{
> > > +	vhost_setup_uaddr(vq, VHOST_ADDR_DESC,
> > > +			  (unsigned long)vq->desc,
> > > +			  vhost_get_desc_size(vq, vq->num),
> > > +			  false);
> > > +	vhost_setup_uaddr(vq, VHOST_ADDR_AVAIL,
> > > +			  (unsigned long)vq->avail,
> > > +			  vhost_get_avail_size(vq, vq->num),
> > > +			  false);
> > > +	vhost_setup_uaddr(vq, VHOST_ADDR_USED,
> > > +			  (unsigned long)vq->used,
> > > +			  vhost_get_used_size(vq, vq->num),
> > > +			  true);
> > > +}
> > > +
> > > +static int vhost_map_prefetch(struct vhost_virtqueue *vq,
> > > +			       int index)
> > > +{
> > > +	struct vhost_map *map;
> > > +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> > > +	struct page **pages;
> > > +	int npages = DIV_ROUND_UP(uaddr->size, PAGE_SIZE);
> > > +	int npinned;
> > > +	void *vaddr, *v;
> > > +	int err;
> > > +	int i;
> > > +
> > > +	spin_lock(&vq->mmu_lock);
> > > +
> > > +	err = -EFAULT;
> > > +	if (vq->invalidate_count)
> > > +		goto err;
> > > +
> > > +	err = -ENOMEM;
> > > +	map = kmalloc(sizeof(*map), GFP_ATOMIC);
> > > +	if (!map)
> > > +		goto err;
> > > +
> > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC);
> > > +	if (!pages)
> > > +		goto err_pages;
> > > +
> > > +	err = EFAULT;
> > > +	npinned = __get_user_pages_fast(uaddr->uaddr, npages,
> > > +					uaddr->write, pages);
> > > +	if (npinned > 0)
> > > +		release_pages(pages, npinned);
> > > +	if (npinned != npages)
> > > +		goto err_gup;
> > > +
> > > +	for (i = 0; i < npinned; i++)
> > > +		if (PageHighMem(pages[i]))
> > > +			goto err_gup;
> > > +
> > > +	vaddr = v = page_address(pages[0]);
> > > +
> > > +	/* For simplicity, fallback to userspace address if VA is not
> > > +	 * contigious.
> > > +	 */
> > > +	for (i = 1; i < npinned; i++) {
> > > +		v += PAGE_SIZE;
> > > +		if (v != page_address(pages[i]))
> > > +			goto err_gup;
> > > +	}
> > > +
> > > +	map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1));
> > > +	map->npages = npages;
> > > +	map->pages = pages;
> > > +
> > > +	vq->maps[index] = map;
> > > +	/* No need for a synchronize_rcu(). This function should be
> > > +	 * called by dev->worker so we are serialized with all
> > > +	 * readers.
> > > +	 */
> > > +	spin_unlock(&vq->mmu_lock);
> > > +
> > > +	return 0;
> > > +
> > > +err_gup:
> > > +	kfree(pages);
> > > +err_pages:
> > > +	kfree(map);
> > > +err:
> > > +	spin_unlock(&vq->mmu_lock);
> > > +	return err;
> > > +}
> > > +#endif
> > > +
> > >   void vhost_dev_cleanup(struct vhost_dev *dev)
> > >   {
> > >   	int i;
> > > @@ -684,8 +981,20 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> > >   		kthread_stop(dev->worker);
> > >   		dev->worker = NULL;
> > >   	}
> > > -	if (dev->mm)
> > > +	if (dev->mm) {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +		if (dev->has_notifier) {
> > > +			mmu_notifier_unregister(&dev->mmu_notifier,
> > > +						dev->mm);
> > > +			dev->has_notifier = false;
> > > +		}
> > > +#endif
> > >   		mmput(dev->mm);
> > > +	}
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	for (i = 0; i < dev->nvqs; i++)
> > > +		vhost_uninit_vq_maps(dev->vqs[i]);
> > > +#endif
> > >   	dev->mm = NULL;
> > >   }
> > >   EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> > > @@ -914,6 +1223,26 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
> > >   static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			*((__virtio16 *)&used->ring[vq->num]) =
> > > +				cpu_to_vhost16(vq, vq->avail_idx);
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> > >   			      vhost_avail_event(vq));
> > >   }
> > > @@ -922,6 +1251,27 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> > >   				 struct vring_used_elem *head, int idx,
> > >   				 int count)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +	size_t size;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			size = count * sizeof(*head);
> > > +			memcpy(used->ring + idx, head, size);
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
> > >   				  count * sizeof(*head));
> > >   }
> > > @@ -929,6 +1279,25 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> > >   static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> > >   			      &vq->used->flags);
> > >   }
> > > @@ -936,6 +1305,25 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> > >   static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> > >   			      &vq->used->idx);
> > >   }
> > > @@ -981,12 +1369,50 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
> > >   static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> > >   				      __virtio16 *idx)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_avail *avail;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_AVAIL];
> > > +		if (likely(map)) {
> > > +			avail = map->addr;
> > > +			*idx = avail->idx;
> > index can now be speculated.
> 
> [...]
> 
> 
> > +		vhost_vq_access_map_begin(vq);
> > +
> > +		map = vq->maps[VHOST_ADDR_AVAIL];
> > +		if (likely(map)) {
> > +			avail = map->addr;
> > +			*head = avail->ring[idx & (vq->num - 1)];
> > 
> > Since idx can be speculated, I guess we need array_index_nospec here?
> 
> 
> So we have
> 
> ACQUIRE(mmu_lock)
> 
> get idx
> 
> RELEASE(mmu_lock)
> 
> ACQUIRE(mmu_lock)
> 
> read array[idx]
> 
> RELEASE(mmu_lock)
> 
> Then I think idx can't be speculated consider we've passed RELEASE +
> ACQUIRE?

I don't think memory barriers have anything to do with speculation,
they are architectural.

> 
> > 
> > 
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_get_avail(vq, *head,
> > >   			       &vq->avail->ring[idx & (vq->num - 1)]);
> > >   }
> > > @@ -994,24 +1420,98 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> > >   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > >   					__virtio16 *flags)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_avail *avail;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_AVAIL];
> > > +		if (likely(map)) {
> > > +			avail = map->addr;
> > > +			*flags = avail->flags;
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_get_avail(vq, *flags, &vq->avail->flags);
> > >   }
> > >   static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> > >   				       __virtio16 *event)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_avail *avail;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +		map = vq->maps[VHOST_ADDR_AVAIL];
> > > +		if (likely(map)) {
> > > +			avail = map->addr;
> > > +			*event = (__virtio16)avail->ring[vq->num];
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_get_avail(vq, *event, vhost_used_event(vq));
> > >   }
> > >   static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> > >   				     __virtio16 *idx)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			*idx = used->idx;
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_get_used(vq, *idx, &vq->used->idx);
> > >   }
> > 
> > This seems to be used during init. Why do we bother
> > accelerating this?
> 
> 
> Ok, I can remove this part in next version.
> 
> 
> > 
> > 
> > >   static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> > >   				 struct vring_desc *desc, int idx)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_desc *d;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_DESC];
> > > +		if (likely(map)) {
> > > +			d = map->addr;
> > > +			*desc = *(d + idx);
> > 
> > Since idx can be speculated, I guess we need array_index_nospec here?
> 
> 
> This is similar to the above avail idx case.
> 
> 
> > 
> > 
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
> > >   }
> > I also wonder about the userspace address we get eventualy.
> > It would seem that we need to prevent that from speculating -
> > and that seems like a good idea even if this patch isn't
> > applied. As you are playing with micro-benchmarks, maybe
> > you could the below patch?
> 
> 
> Let me test it.
> 
> Thanks
> 
> 
> > It's unfortunately untested.
> > Thanks a lot in advance!
> > 
> > ===>
> > vhost: block speculation of translated descriptors
> > 
> > iovec addresses coming from vhost are assumed to be
> > pre-validated, but in fact can be speculated to a value
> > out of range.
> > 
> > Userspace address are later validated with array_index_nospec so we can
> > be sure kernel info does not leak through these addresses, but vhost
> > must also not leak userspace info outside the allowed memory table to
> > guests.
> > 
> > Following the defence in depth principle, make sure
> > the address is not validated out of node range.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > 
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 5dc174ac8cac..863e25011ef6 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2072,7 +2076,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> >   		size = node->size - addr + node->start;
> >   		_iov->iov_len = min((u64)len - s, size);
> >   		_iov->iov_base = (void __user *)(unsigned long)
> > -			(node->userspace_addr + addr - node->start);
> > +			(node->userspace_addr +
> > +			 array_index_nospec(addr - node->start,
> > +					    node->size));
> >   		s += size;
> >   		addr += size;
> >   		++ret;

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox