* Re: [PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
From: Kevin Easton @ 2018-04-10 11:38 UTC (permalink / raw)
To: Steffen Klassert; +Cc: Herbert Xu, David S. Miller, netdev, linux-kernel
In-Reply-To: <20180409103442.oiib3bi7rms5sgg3@gauss3.secunet.de>
On Mon, Apr 09, 2018 at 12:34:42PM +0200, Steffen Klassert wrote:
> On Sat, Apr 07, 2018 at 11:40:47AM -0400, Kevin Easton wrote:
> > Several places use (x + 7) / 8 to convert from a number of bits to a number
> > of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
> > with other parts of the same file.
> >
> > Signed-off-by: Kevin Easton <kevin@guarana.org>
>
> Please resubmit this one to ipsec-next after the
> merge window. Thanks!
Will do!
- Kevin
^ permalink raw reply
* Re: [PATCH] ieee802154: mcr20a: Fix memory leak in mcr20a_probe
From: Stefan Schmidt @ 2018-04-10 10:58 UTC (permalink / raw)
To: Gustavo A. R. Silva, Xue Liu, Alexander Aring
Cc: linux-wpan, netdev, linux-kernel
In-Reply-To: <20180405162006.GA2199@embeddedor.com>
Hello.
On 04/05/2018 06:20 PM, Gustavo A. R. Silva wrote:
> Free allocated memory for pdata before return.
>
> Addresses-Coverity-ID: 1466096 ("Resource leak")
> Fixes: 8c6ad9cc5157 ("ieee802154: Add NXP MCR20A IEEE 802.15.4 transceiver driver")
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> ---
> drivers/net/ieee802154/mcr20a.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ieee802154/mcr20a.c b/drivers/net/ieee802154/mcr20a.c
> index 55a22c7..944470d 100644
> --- a/drivers/net/ieee802154/mcr20a.c
> +++ b/drivers/net/ieee802154/mcr20a.c
> @@ -1267,7 +1267,7 @@ mcr20a_probe(struct spi_device *spi)
> ret = mcr20a_get_platform_data(spi, pdata);
> if (ret < 0) {
> dev_crit(&spi->dev, "mcr20a_get_platform_data failed.\n");
> - return ret;
> + goto free_pdata;
> }
>
> /* init reset gpio */
> @@ -1275,7 +1275,7 @@ mcr20a_probe(struct spi_device *spi)
> ret = devm_gpio_request_one(&spi->dev, pdata->rst_gpio,
> GPIOF_OUT_INIT_HIGH, "reset");
> if (ret)
> - return ret;
> + goto free_pdata;
> }
>
> /* reset mcr20a */
> @@ -1291,7 +1291,8 @@ mcr20a_probe(struct spi_device *spi)
> hw = ieee802154_alloc_hw(sizeof(*lp), &mcr20a_hw_ops);
> if (!hw) {
> dev_crit(&spi->dev, "ieee802154_alloc_hw failed\n");
> - return -ENOMEM;
> + ret = -ENOMEM;
> + goto free_pdata;
> }
>
> /* init mcr20a local data */
> @@ -1366,6 +1367,8 @@ mcr20a_probe(struct spi_device *spi)
>
> free_dev:
> ieee802154_free_hw(lp->hw);
> +free_pdata:
> + kfree(pdata);
>
> return ret;
> }
This patch has been applied to the wpan tree and will be
part of the next pull request to net. Thanks!
regards
Stefan Schmidt
^ permalink raw reply
* [PATCH net] ip_gre: clear feature flags when incompatible o_flags are set
From: Sabrina Dubroca @ 2018-04-10 10:57 UTC (permalink / raw)
To: netdev; +Cc: Sabrina Dubroca, Lorenzo Bianconi, Xin Long, William Tu
Commit dd9d598c6657 ("ip_gre: add the support for i/o_flags update via
netlink") added the ability to change o_flags, but missed that the
GSO/LLTX features are disabled by default, and only enabled some gre
features are unused. Thus we also need to disable the GSO/LLTX features
on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set.
These two examples should result in the same features being set:
ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 0
ip link set gre_none type gre seq
ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 1 seq
Fixes: dd9d598c6657 ("ip_gre: add the support for i/o_flags update via netlink")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
net/ipv4/ip_gre.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index a8772a978224..9c169bb2444d 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -781,8 +781,14 @@ static void ipgre_link_update(struct net_device *dev, bool set_mtu)
tunnel->encap.type == TUNNEL_ENCAP_NONE) {
dev->features |= NETIF_F_GSO_SOFTWARE;
dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+ } else {
+ dev->features &= ~NETIF_F_GSO_SOFTWARE;
+ dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
}
dev->features |= NETIF_F_LLTX;
+ } else {
+ dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+ dev->features &= ~(NETIF_F_LLTX | NETIF_F_GSO_SOFTWARE);
}
}
--
2.16.2
^ permalink raw reply related
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-10 10:55 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh
In-Reply-To: <16b2e531-7bfa-7f25-2702-f3f8069663ee@intel.com>
Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> [...]
>>
>> > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > + struct net_device *child_netdev)
>> > > > +{
>> > > > + struct virtnet_bypass_info *vbi;
>> > > > + bool backup;
>> > > > +
>> > > > + vbi = netdev_priv(bypass_netdev);
>> > > > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > + rtnl_dereference(vbi->active_netdev)) {
>> > > > + netdev_info(bypass_netdev,
>> > > > + "%s attempting to join bypass dev when %s already present\n",
>> > > > + child_netdev->name, backup ? "backup" : "active");
>> > > Bypass module should check if there is already some other netdev
>> > > enslaved and refuse right there.
>> > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> > as its bypass_netdev is same as the backup_netdev.
>> > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> > for 3 netdev scenario.
>> Just let me undestand it clearly. What I expect the difference would be
>> between 2netdev and3 netdev model is this:
>> 2netdev:
>> bypass_master
>> /
>> /
>> VF_slave
>>
>> 3netdev:
>> bypass_master
>> / \
>> / \
>> VF_slave backup_slave
>>
>> Is that correct? If not, how does it look like?
>>
>>
>Looks correct.
>VF_slave and backup_slave are the original netdevs and are present in both the models.
>In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>marked as the 2 slaves of this new netdev.
You say it looks correct and in another sentence you provide completely
different description. Could you please look again?
[...]
^ permalink raw reply
* Re: [PATCH] ieee802154: mcr20a: Fix memory leak in mcr20a_probe
From: Stefan Schmidt @ 2018-04-10 10:26 UTC (permalink / raw)
To: Xue Liu, Gustavo A. R. Silva
Cc: Alexander Aring, linux-wpan - ML, netdev, linux-kernel
In-Reply-To: <CAJuUDwtUUv_nMEwijRydWmseL1aYOAoAfEdLVnoHWPWKdVgS-A@mail.gmail.com>
Hello.
On 04/10/2018 11:54 AM, Xue Liu wrote:
> Hallo,
>
> Thanks for the fix. It looks good.
>
> ACK-by: Xue Liu <liuxuenetmail@gmail.com>
Thanks for the ACK Xue. The correct format would be
Acked-by: Xue Liu <liuxuenetmail@gmail.com>
That way patchwork also picks the ACK up and adds it to the patch before I apply it from there.
No worries, I did this manually for you this time. For the next time please use the correct format to have this tracked.
regards
Stefan Schmidt
^ permalink raw reply
* Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
From: Quentin Monnet @ 2018-04-10 10:21 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Daniel Borkmann, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410014751.mqwqdyujvybir6g5@ast-mbp.dhcp.thefacebook.com>
2018-04-09 18:47 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Mon, Apr 09, 2018 at 02:25:26PM +0100, Quentin Monnet wrote:
>>
>> Anyway, I am fine with keeping just signatures, descriptions and return
>> values for now. I will submit a new version with only those items.
>
> Thank you.
>
> Could you also split it into few patches?
> include/uapi/linux/bpf.h | 2237 ++++++++++++++++++++++++++++++++++++--------
> scripts/bpf_helpers_doc.py | 568 +++++++++++
> 2 files changed, 2429 insertions(+), 376 deletions(-)
>
> replying back and forth on a single patch of such size will be tedious
> for others to follow.
> May be document ~10 helpers at a time ? Total of ~7 patches and extra
> patch for .py ?
>
Sure, I'll do that. And I'll try to group helpers in a patch by author,
it should also help for reviewing the descriptions.
Quentin
^ permalink raw reply
* Re: XDP_TX for virtio_net not working in recent kernel?
From: Daniel Borkmann @ 2018-04-10 10:17 UTC (permalink / raw)
To: Kimitoshi Takahashi; +Cc: xdp-newbies, jasowang, netdev
In-Reply-To: <47428621-2f24-3c8e-5d34-1d788a86a455@nii.ac.jp>
[ +jason +netdev ]
On 04/10/2018 12:49 AM, Kimitoshi Takahashi wrote:
> Hello,
>
> I'm totally newbie.
>
> I want to try some examples under the samples/bpf of the kernel tree using the virtio_net driver in kvm, in order to get familiar with the XDP. However, it seems to me that for kernels newer than linux-4.15, the xdp2 and xdp_tx_tunnel are not working in the case of virtio_net.
>
> Can somebody help me out to make them working?
>
>
> 1) For linux-4.11.12, linux-4.12.14, and linux-4.14.31, xdp2 and xdp_tx_tunnel are working.
>
> [xdp2]
> 10.0.0.27# ./xdp2 2 (or ./xdp -N 2)
>
> 10.0.0.254# hping3 10.0.0.27 -p 5000 -2
>
> 10.0.0.254# tcpdump -nei kbr0 udp port 5000
> 06:18:37.677070 2e:c9:83:fb:5b:f3 > 52:54:00:11:00:1b, ethertype IPv4 (0x0800), length 42: 10.0.0.254.2982 > 10.0.0.27.5000: UDP, length 0
> 06:18:37.677265 52:54:00:11:00:1b > 2e:c9:83:fb:5b:f3, ethertype IPv4 (0x0800), length 42: 10.0.0.254.2982 > 10.0.0.27.5000: UDP, length 0
>
> I can see the UDP packets with MAC addresses swapped.
>
> [xdp_tx_tunnel]
>
> 10.0.0.254# telnet 10.1.2.1 80
>
> 10.0.0.27# ./xdp_tx_iptunnel -i 2 -a 10.1.2.1 -p 80 -s 10.0.0.27 -d 10.0.0.24 -m 52:54:00:11:00:18
>
> 10.0.0.24# tcpdump -nei any proto 4
> 22:08:20.510354 In 52:54:00:11:00:1b ethertype IPv4 (0x0800), length 96: 10.0.0.27 > 10.0.0.24: 10.0.0.254.60826 > 10.1.2.1.80: Flags [S], seq 3224377720, win 29200, options [mss 1460,sackOK,TS val 217801953 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
>
> I can see the encapsulated SYN packet that is forwarded by the 10.0.0.27.
>
> 2) For linux-4.15, linux-4.15.5, linux-4.15.13 and linux-4.16.1, xdp2 and xdp_tx_tunnel are NOT working.
>
> [xdp2]
> 10.0.0.27# ./xdp2 -N 2
>
> 10.0.0.254# hping3 10.0.0.27 -p 5000 -2
>
> 10.0.0.254# tcpdump -nei br0 udp port 5000
> 06:45:34.810046 2e:c9:83:fb:5b:f3 > 52:54:00:11:00:1b, ethertype IPv4 (0x0800), length 42: 10.0.0.254.2346 > 10.0.0.27.5000: UDP, length 0
> 06:45:35.810176 2e:c9:83:fb:5b:f3 > 52:54:00:11:00:1b, ethertype IPv4 (0x0800), length 42: 10.0.0.254.2347 > 10.0.0.27.5000: UDP, length 0
>
> Only the outgoing packets can be seen.
>
> [xdp_tx_tunnel]
>
> 10.0.0.254# telnet 10.1.2.1 80
>
> 10.0.0.27# ./xdp_tx_iptunnel -i 2 -a 10.1.2.1 -p 80 -s 10.0.0.27 -d 10.0.0.24 -m 52:54:00:11:00:18
>
> 10.0.0.24# tcpdump -nei any proto 4
>
> I can NOT see the encapsulated SYN packet that is forwarded by the 10.0.0.27.
>
>
> --
> Sincerely,
> Takahashi Kimitoshi
^ permalink raw reply
* Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?
From: Daniel Borkmann @ 2018-04-10 10:13 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Daniel Borkmann
Cc: Eric Leblond, Victor Julien, Peter Manev, oisf-devel,
Jakub Kicinski, Alexei Starovoitov, iovisor-dev@lists.iovisor.org,
Saeed Mahameed, netdev, davem, Jiri Benc
In-Reply-To: <20180409132635.7d5e1bda@redhat.com>
On 04/09/2018 01:26 PM, Jesper Dangaard Brouer wrote:
> On Fri, 6 Apr 2018 12:36:18 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
[...]
>> [...]
>> The underlying problem is effectively the decoupling of program
>> verification that doesn't have/know the context of where it is being
>> attached to in this case.
>
> Yes, exactly, that the underlying problem. (plus the first XDP prog
> gets verified and driver attached, and subsequent added bpf tail calls,
> cannot be "rejected" (at "driver attachment time") as its too late).
>
>> Thinking out loud for a sec on couple of other options aside
>> from feature bits, what about i) providing the target ifindex to the
>> verifier for XDP programs, such that at verification time you have the
>> full context similar to nfp offload case today,
>
> I do like that we could provide the target ifindex earlier. But
> userspace still need some way to express that it e.g. need the
> XDP_REDIRECT features (as verifier cannot reliability detect the action
> return codes used, as discussed before, due to bpf tail calls and maps
> used as return values).
(See below on the detection of it.)
>> or ii) populating some
>> XDP specific auxillary data to the BPF program at verification time such
>> that the driver can check at program attach time whether the requested
>> features are possible and if not it will reject and respond with netlink
>> extack message to the user (as we do in various XDP attach cases already
>> through XDP_SETUP_PROG{,_HW}).
>
> I like proposal ii) better. But how do I specify/input that I need
> e.g. the XDP_REDIRECT feature, such that is it avail to "the BPF
> program at verification time"?
>
> My proposal iii), what about at XDP attachment time, create a new
> netlink attribute that describe XDP action codes the program
> needs/wants. If the info is provided, the ndo_bpf call check and
> reject, and respond with netlink extack message.
> If I want to query for avail action codes, then I can use the same
> netlink attribute format, and kernel will return it "populated" with
> the info.
>
> It is very useful that the program gets rejected at attachment time (to
> avoid loading an XDP prog that silently drops packets). BUT I also
> want a query option/functionality (reuse netlink attr format).
>
> Specifically for Suricata the same "bypass" features can be implemented
> at different levels (XDP, XDP-generic or clsbpf), and the XDP program
> contains multiple features. Thus, depending on what NIC driver
> supports, we want to load different XDP and/or clsbpf/TC BPF-programs.
> Thus, still support the same user requested feature/functionality, even
> if XDP_REDIRECT is not avail, just slightly slower.
Makes sense to have such fallbacks.
>> This would, for example, avoid the need for feature bits, and do actual
>> rejection of the program while retaining flexibility (and avoiding to
>> expose bits that over time hopefully will deprecate anyway due to all
>> XDP aware drivers implementing them). For both cases i) and ii), it
>> would mean we make the verifier a bit smarter with regards to keeping
>> track of driver related (XDP) requirements. Program return code checking
>> is already present in the verifier's check_return_code() and we could
>> extend it for XDP as well, for example. Seems cleaner and more extensible
>> than feature bits, imho.
>
> I thought the verifier's check_return_code() couldn't see/verify if the
> XDP return action code is provided as a map lookup result(?). How is
> that handled?
For the latter, I would just add a BPF_F_STRICT_CONST_VERDICT load flag
which e.g. enforces a constant return code in all prog types. It also needs
to check for helpers like bpf_xdp_redirect() and track R0 from there that
it either contains XDP_REDIRECT or XDP_ABORTED.
For the bpf_prog_aux, we need a prog dependent void *private_data area
pointer in bpf_prog_aux that verifier populates; e.g. we could migrate
already some of the prog type specific fields into that like kprobe_override,
cb_access, dst_needed that are non-generic anyway. For XDP, verifier would
in your case record all seen return codes into private_data. When the flag
BPF_F_STRICT_CONST_VERDICT is not used and we noticed there were cases
where the verdict was not a verifiable constant, then we e.g. mark all
XDP return codes as seen. Potentially the same is needed for tail calls.
We can add a ndo_bpf query to drivers that return their supported XDP return
codes and compare them in e.g. dev_change_xdp_fd() out of generic code and
reject with extack.
For tail calls, the only way that comes to mind right now where you could
lift that requirement with having to mark all return codes as seen is that
you'd need to pass the ifindex as in offload case at load time, such that
you the program becomes tied to the device. Then you also need to record
dev pointer in the private_data such that you can add a new callback to
bpf_prog_ops for prog dependent compare in bpf_prog_array_compatible() to
make sure both would have the same target owner dev where the return code
check was previously asserted.
If you want to expose that internal ndo_bpf query which specifically returns
the set of supported XDP return codes I wouldn't mind, that way it's not
some sort of generic feature bit query, but a specific query for the set of
return codes the driver supports, thus keeping this very limited and avoiding
mixing this with other future feature bits that could turn this into a big
mess; I'm not sure right now though what would be the best uapi to query
that info from.
Cheers,
Daniel
^ permalink raw reply
* Re: [PATCH] ieee802154: mcr20a: Fix memory leak in mcr20a_probe
From: Xue Liu @ 2018-04-10 9:54 UTC (permalink / raw)
To: Gustavo A. R. Silva
Cc: Alexander Aring, Stefan Schmidt, linux-wpan - ML, netdev,
linux-kernel
In-Reply-To: <20180405162006.GA2199@embeddedor.com>
Hallo,
Thanks for the fix. It looks good.
ACK-by: Xue Liu <liuxuenetmail@gmail.com>
On 5 April 2018 at 18:20, Gustavo A. R. Silva <gustavo@embeddedor.com> wrote:
> Free allocated memory for pdata before return.
>
> Addresses-Coverity-ID: 1466096 ("Resource leak")
> Fixes: 8c6ad9cc5157 ("ieee802154: Add NXP MCR20A IEEE 802.15.4 transceiver driver")
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> ---
> drivers/net/ieee802154/mcr20a.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ieee802154/mcr20a.c b/drivers/net/ieee802154/mcr20a.c
> index 55a22c7..944470d 100644
> --- a/drivers/net/ieee802154/mcr20a.c
> +++ b/drivers/net/ieee802154/mcr20a.c
> @@ -1267,7 +1267,7 @@ mcr20a_probe(struct spi_device *spi)
> ret = mcr20a_get_platform_data(spi, pdata);
> if (ret < 0) {
> dev_crit(&spi->dev, "mcr20a_get_platform_data failed.\n");
> - return ret;
> + goto free_pdata;
> }
>
> /* init reset gpio */
> @@ -1275,7 +1275,7 @@ mcr20a_probe(struct spi_device *spi)
> ret = devm_gpio_request_one(&spi->dev, pdata->rst_gpio,
> GPIOF_OUT_INIT_HIGH, "reset");
> if (ret)
> - return ret;
> + goto free_pdata;
> }
>
> /* reset mcr20a */
> @@ -1291,7 +1291,8 @@ mcr20a_probe(struct spi_device *spi)
> hw = ieee802154_alloc_hw(sizeof(*lp), &mcr20a_hw_ops);
> if (!hw) {
> dev_crit(&spi->dev, "ieee802154_alloc_hw failed\n");
> - return -ENOMEM;
> + ret = -ENOMEM;
> + goto free_pdata;
> }
>
> /* init mcr20a local data */
> @@ -1366,6 +1367,8 @@ mcr20a_probe(struct spi_device *spi)
>
> free_dev:
> ieee802154_free_hw(lp->hw);
> +free_pdata:
> + kfree(pdata);
>
> return ret;
> }
> --
> 2.7.4
>
^ permalink raw reply
* Re: [PATCH] slip: Check if rstate is initialized before uncompressing
From: Guillaume Nault @ 2018-04-10 9:48 UTC (permalink / raw)
To: tejaswit; +Cc: David Miller, netdev
In-Reply-To: <ed7bd53914f26c2225c6c00d16bffb35@codeaurora.org>
On Tue, Apr 10, 2018 at 11:28:10AM +0530, tejaswit@codeaurora.org wrote:
> On 2018-04-09 20:34, David Miller wrote:
> > From: Tejaswi Tanikella <tejaswit@codeaurora.org>
> > Date: Mon, 9 Apr 2018 14:23:49 +0530
> >
> > > @@ -673,6 +677,7 @@ struct slcompress *
> > > if (cs->cs_tcp.doff > 5)
> > > memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr),
> > > (cs->cs_tcp.doff - 5) * 4);
> > > cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2;
> > > + cs->initialized = 1;
> > > /* Put headers back on packet
> > ...
> > > struct cstate {
> > > byte_t cs_this; /* connection id number (xmit) */
> > > + byte_t initialized; /* non-zero if initialized */
> >
> > Please use 'bool' and true/false for 'initialized'.
>
> Made the changes.
Hi Tejaswi,
Please send the new version of your patch as fresh new submission, with
proper subject prefix. In this case, it should be [PATCH net v2]. 'net'
because this is a bugfix and it should therefore target the 'net' tree.
'v2' because that's the second version of this series.
Also it'd be good if you could add a proper 'Fixes' tag in order to
help with stable backports. If the bug has always been there, just say
so.
I have no expertise on slhc, but overall, the patch content looks fine.
Guillaume
^ permalink raw reply
* RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Liang, Cunming @ 2018-04-10 9:23 UTC (permalink / raw)
To: Paolo Bonzini, Bie, Tiwei, Jason Wang
Cc: mst@redhat.com, alex.williamson@redhat.com, ddutile@redhat.com,
Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
Daly, Dan, Wang, Zhihong, Tan, Jianfeng, Wang, Xiao W
In-Reply-To: <7ee31a12-a370-fc43-82a6-2235f598e970@redhat.com>
> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Tuesday, April 10, 2018 3:52 PM
> To: Bie, Tiwei <tiwei.bie@intel.com>; Jason Wang <jasowang@redhat.com>
> Cc: mst@redhat.com; alex.williamson@redhat.com; ddutile@redhat.com;
> Duyck, Alexander H <alexander.h.duyck@intel.com>; virtio-dev@lists.oasis-
> open.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; netdev@vger.kernel.org; Daly, Dan
> <dan.daly@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Wang,
> Zhihong <zhihong.wang@intel.com>; Tan, Jianfeng <jianfeng.tan@intel.com>;
> Wang, Xiao W <xiao.w.wang@intel.com>
> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware
> vhost backend
>
> On 10/04/2018 06:57, Tiwei Bie wrote:
> >> So you just move the abstraction layer from qemu to kernel, and you
> >> still need different drivers in kernel for different device
> >> interfaces of accelerators. This looks even more complex than leaving
> >> it in qemu. As you said, another idea is to implement userspace vhost
> >> backend for accelerators which seems easier and could co-work with
> >> other parts of qemu without inventing new type of messages.
> >
> > I'm not quite sure. Do you think it's acceptable to add various vendor
> > specific hardware drivers in QEMU?
>
> I think so. We have vendor-specific quirks, and at some point there was an
> idea of using quirks to implement (vendor-specific) live migration support for
> assigned devices.
Vendor-specific quirks of accessing VGA is a small portion. Other major portions are still handled by guest driver.
While in this case, when saying various vendor specific drivers in QEMU, it says QEMU takes over and provides the entire user space device drivers. Some parts are even not relevant to vhost, they're basic device function enabling. Moreover, it could be different kinds of devices(network/block/...) under vhost. No matter # of vendors or # of types, total LOC is not small.
The idea is to avoid introducing these extra complexity out of QEMU, keeping vhost adapter simple. As vhost protocol is de factor standard, it leverages kernel device driver to provide the diversity. Changing once in QEMU, then it supports multi-vendor devices whose drivers naturally providing kernel driver there.
If QEMU is going to build a user space driver framework there, we're open mind on that, even leveraging DPDK as the underlay library. Looking forward to more others' comments from community.
Steve
>
> Paolo
^ permalink raw reply
* [PATCH linux] net: fix deadlock while clearing neighbor proxy table
From: Wolfgang Bumiller @ 2018-04-10 9:15 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Hideaki YOSHIFUJI
When coming from ndisc_netdev_event() in net/ipv6/ndisc.c,
neigh_ifdown() is called with &nd_tbl, locking this while
clearing the proxy neighbor entries when eg. deleting an
interface. Calling the table's pndisc_destructor() with the
lock still held, however, can cause a deadlock: When a
multicast listener is available an IGMP packet of type
ICMPV6_MGM_REDUCTION may be sent out. When reaching
ip6_finish_output2(), if no neighbor entry for the target
address is found, __neigh_create() is called with &nd_tbl,
which it'll want to lock.
Move the elements into their own list, then unlock the table
and perform the destruction.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199289
Fixes: 6fd6ce2056de ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
---
I do feel bad about moving the unlock call. Perhaps returning the
freelist and dealing with it in neigh_ifdown() would be better?
net/core/neighbour.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 7b7a14abba28..601df647588c 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -292,7 +292,6 @@ int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev)
write_lock_bh(&tbl->lock);
neigh_flush_dev(tbl, dev);
pneigh_ifdown(tbl, dev);
- write_unlock_bh(&tbl->lock);
del_timer_sync(&tbl->proxy_timer);
pneigh_queue_purge(&tbl->proxy_queue);
@@ -683,7 +682,7 @@ int pneigh_delete(struct neigh_table *tbl, struct net *net, const void *pkey,
static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev)
{
- struct pneigh_entry *n, **np;
+ struct pneigh_entry *n, **np, *freelist = NULL;
u32 h;
for (h = 0; h <= PNEIGH_HASHMASK; h++) {
@@ -691,16 +690,23 @@ static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev)
while ((n = *np) != NULL) {
if (!dev || n->dev == dev) {
*np = n->next;
- if (tbl->pdestructor)
- tbl->pdestructor(n);
- if (n->dev)
- dev_put(n->dev);
- kfree(n);
+ n->next = freelist;
+ freelist = n;
continue;
}
np = &n->next;
}
}
+ write_unlock_bh(&tbl->lock);
+ while ((n = freelist)) {
+ freelist = n->next;
+ n->next = NULL;
+ if (tbl->pdestructor)
+ tbl->pdestructor(n);
+ if (n->dev)
+ dev_put(n->dev);
+ kfree(n);
+ }
return -ENOENT;
}
--
2.11.0
^ permalink raw reply related
* Patch "x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()" has been added to the 4.4-stable tree
From: gregkh @ 2018-04-10 9:07 UTC (permalink / raw)
To: jpoimboe, alexander.levin, andreyknvl, davem, dvyukov, edumazet,
gregkh, kcc, marcelo.leitner, mingo, netdev, nhorman, peterz,
syzkaller, tglx, torvalds, vyasevich, xiyou.wangcong
Cc: stable, stable-commits
This is a note to let you know that I've just added the patch titled
x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary
The filename of the patch is:
x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.
>From foo@baz Tue Apr 10 10:31:53 CEST 2018
From: Josh Poimboeuf <jpoimboe@redhat.com>
Date: Thu, 4 May 2017 09:51:40 -0500
Subject: x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()
From: Josh Poimboeuf <jpoimboe@redhat.com>
[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]
Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:
WARNING: kernel stack regs at ffff8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0
The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().
That function saves RBP on the stack and then overwrites it, using it as
a scratch register. That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.
Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: netdev <netdev@vger.kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
arch/x86/lib/csum-copy_64.S | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
movq %r12, 3*8(%rsp)
movq %r14, 4*8(%rsp)
movq %r13, 5*8(%rsp)
- movq %rbp, 6*8(%rsp)
+ movq %r15, 6*8(%rsp)
movq %r8, (%rsp)
movq %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
/* main loop. clear in 64 byte blocks */
/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
/* r11: temp3, rdx: temp4, r12 loopcnt */
- /* r10: temp5, rbp: temp6, r14 temp7, r13 temp8 */
+ /* r10: temp5, r15: temp6, r14 temp7, r13 temp8 */
.p2align 4
.Lloop:
source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
source
movq 32(%rdi), %r10
source
- movq 40(%rdi), %rbp
+ movq 40(%rdi), %r15
source
movq 48(%rdi), %r14
source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
adcq %r11, %rax
adcq %rdx, %rax
adcq %r10, %rax
- adcq %rbp, %rax
+ adcq %r15, %rax
adcq %r14, %rax
adcq %r13, %rax
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
dest
movq %r10, 32(%rsi)
dest
- movq %rbp, 40(%rsi)
+ movq %r15, 40(%rsi)
dest
movq %r14, 48(%rsi)
dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
movq 3*8(%rsp), %r12
movq 4*8(%rsp), %r14
movq 5*8(%rsp), %r13
- movq 6*8(%rsp), %rbp
+ movq 6*8(%rsp), %r15
addq $7*8, %rsp
ret
Patches currently in stable-queue which might be from jpoimboe@redhat.com are
queue-4.4/x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch
^ permalink raw reply
* Re: [PATCH bpf v3] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Daniel Borkmann @ 2018-04-10 8:54 UTC (permalink / raw)
To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180410072139.323589-1-yhs@fb.com>
On 04/10/2018 09:21 AM, Yonghong Song wrote:
> syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
> The error details:
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.16.0-rc7+ #3 Not tainted
> ------------------------------------------------------
> syz-executor7/24531 is trying to acquire lock:
> (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
>
> but task is already holding lock:
> (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&mm->mmap_sem){++++}:
> __might_fault+0x13a/0x1d0 mm/memory.c:4571
> _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
> copy_to_user include/linux/uaccess.h:155 [inline]
> bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
> perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
> _perf_ioctl kernel/events/core.c:4750 [inline]
> perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
> vfs_ioctl fs/ioctl.c:46 [inline]
> do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
> SYSC_ioctl fs/ioctl.c:701 [inline]
> SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
> do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
>
> -> #0 (bpf_event_mutex){+.+.}:
> lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
> __mutex_lock_common kernel/locking/mutex.c:756 [inline]
> __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
> mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
> perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
> perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
> _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
> put_event+0x24/0x30 kernel/events/core.c:4204
> perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
> remove_vma+0xb4/0x1b0 mm/mmap.c:172
> remove_vma_list mm/mmap.c:2490 [inline]
> do_munmap+0x82a/0xdf0 mm/mmap.c:2731
> mmap_region+0x59e/0x15a0 mm/mmap.c:1646
> do_mmap+0x6c0/0xe00 mm/mmap.c:1483
> do_mmap_pgoff include/linux/mm.h:2223 [inline]
> vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
> SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
> SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
> SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
> SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
> do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
>
> other info that might help us debug this:
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&mm->mmap_sem);
> lock(bpf_event_mutex);
> lock(&mm->mmap_sem);
> lock(bpf_event_mutex);
>
> *** DEADLOCK ***
> ======================================================
>
> The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
> user space to query prog array on the same tp") where copy_to_user,
> which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
> At the same time, during perf_event file descriptor close,
> mm->mmap_sem is held first and then subsequent
> perf_event_detach_bpf_prog needs bpf_event_mutex lock.
> Such a senario caused a deadlock.
>
> As suggested by Danial, moving copy_to_user out of the
Nit: typo :)
> bpf_event_mutex lock should fix the problem.
>
> Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp")
> Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
> include/linux/bpf.h | 4 ++--
> kernel/bpf/core.c | 45 +++++++++++++++++++++++++++++----------------
> kernel/trace/bpf_trace.c | 19 +++++++++++++++----
> 3 files changed, 46 insertions(+), 22 deletions(-)
>
> Changelog:
> v2 -> v3:
> . Remove the redundant label in function perf_event_query_prog_array.
> v1 -> v2:
> . Use the common core function for prog_id copying for two
> different prog_array copy routines, suggested by Alexei.
>
[...]
> static void bpf_prog_free_deferred(struct work_struct *work)
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d88e96d..f505d43 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -977,6 +977,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
> {
> struct perf_event_query_bpf __user *uquery = info;
> struct perf_event_query_bpf query = {};
> + u32 *ids, prog_cnt, ids_len;
> int ret;
>
> if (!capable(CAP_SYS_ADMIN))
> @@ -985,16 +986,26 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
> return -EINVAL;
> if (copy_from_user(&query, uquery, sizeof(query)))
> return -EFAULT;
> - if (query.ids_len > BPF_TRACE_MAX_PROGS)
> +
> + ids_len = query.ids_len;
> + if (ids_len > BPF_TRACE_MAX_PROGS)
> return -E2BIG;
> + ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
> + if (!ids)
> + return -ENOMEM;
Fix looks good to me, but could you still add a comment stating that we don't
need to check for ZERO_SIZE_PTR above since we handle this gracefully in the
bpf_prog_array_copy_info() plus it's also required for the case where the user
only wants to check for the uquery->prog_cnt, but nothing else.
> mutex_lock(&bpf_event_mutex);
> ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
> - uquery->ids,
> - query.ids_len,
> - &uquery->prog_cnt);
> + ids,
> + ids_len,
> + &prog_cnt);
> mutex_unlock(&bpf_event_mutex);
>
> + if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
> + copy_to_user(uquery->ids, ids, ids_len * sizeof(u32)))
> + ret = -EFAULT;
> +
> + kfree(ids);
> return ret;
> }
Thanks,
Daniel
^ permalink raw reply
* Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Paolo Bonzini @ 2018-04-10 7:51 UTC (permalink / raw)
To: Tiwei Bie, Jason Wang
Cc: alexander.h.duyck, virtio-dev, kvm, mst, netdev, linux-kernel,
virtualization, xiao.w.wang, ddutile, jianfeng.tan, zhihong.wang
In-Reply-To: <20180410045723.rftsb7l4l3ip2ioi@debian>
On 10/04/2018 06:57, Tiwei Bie wrote:
>> So you just move the abstraction layer from qemu to kernel, and you still
>> need different drivers in kernel for different device interfaces of
>> accelerators. This looks even more complex than leaving it in qemu. As you
>> said, another idea is to implement userspace vhost backend for accelerators
>> which seems easier and could co-work with other parts of qemu without
>> inventing new type of messages.
>
> I'm not quite sure. Do you think it's acceptable to
> add various vendor specific hardware drivers in QEMU?
I think so. We have vendor-specific quirks, and at some point there was
an idea of using quirks to implement (vendor-specific) live migration
support for assigned devices.
Paolo
^ permalink raw reply
* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2018-04-10 7:25 UTC (permalink / raw)
To: Tiwei Bie
Cc: mst, alex.williamson, ddutile, alexander.h.duyck, virtio-dev,
linux-kernel, kvm, virtualization, netdev, dan.daly,
cunming.liang, zhihong.wang, jianfeng.tan, xiao.w.wang
In-Reply-To: <20180410045723.rftsb7l4l3ip2ioi@debian>
On 2018年04月10日 12:57, Tiwei Bie wrote:
> On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote:
>> On 2018年04月02日 23:23, Tiwei Bie wrote:
>>> This patch introduces a mdev (mediated device) based hardware
>>> vhost backend. This backend is an abstraction of the various
>>> hardware vhost accelerators (potentially any device that uses
>>> virtio ring can be used as a vhost accelerator). Some generic
>>> mdev parent ops are provided for accelerator drivers to support
>>> generating mdev instances.
>>>
>>> What's this
>>> ===========
>>>
>>> The idea is that we can setup a virtio ring compatible device
>>> with the messages available at the vhost-backend. Originally,
>>> these messages are used to implement a software vhost backend,
>>> but now we will use these messages to setup a virtio ring
>>> compatible hardware device. Then the hardware device will be
>>> able to work with the guest virtio driver in the VM just like
>>> what the software backend does. That is to say, we can implement
>>> a hardware based vhost backend in QEMU, and any virtio ring
>>> compatible devices potentially can be used with this backend.
>>> (We also call it vDPA -- vhost Data Path Acceleration).
>>>
>>> One problem is that, different virtio ring compatible devices
>>> may have different device interfaces. That is to say, we will
>>> need different drivers in QEMU. It could be troublesome. And
>>> that's what this patch trying to fix. The idea behind this
>>> patch is very simple: mdev is a standard way to emulate device
>>> in kernel.
>> So you just move the abstraction layer from qemu to kernel, and you still
>> need different drivers in kernel for different device interfaces of
>> accelerators. This looks even more complex than leaving it in qemu. As you
>> said, another idea is to implement userspace vhost backend for accelerators
>> which seems easier and could co-work with other parts of qemu without
>> inventing new type of messages.
> I'm not quite sure. Do you think it's acceptable to
> add various vendor specific hardware drivers in QEMU?
>
I don't object but we need to figure out the advantages of doing it in
qemu too.
Thanks
^ permalink raw reply
* [PATCH bpf v3] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Yonghong Song @ 2018-04-10 7:21 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
The error details:
======================================================
WARNING: possible circular locking dependency detected
4.16.0-rc7+ #3 Not tainted
------------------------------------------------------
syz-executor7/24531 is trying to acquire lock:
(bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
but task is already holding lock:
(&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_sem){++++}:
__might_fault+0x13a/0x1d0 mm/memory.c:4571
_copy_to_user+0x2c/0xc0 lib/usercopy.c:25
copy_to_user include/linux/uaccess.h:155 [inline]
bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
_perf_ioctl kernel/events/core.c:4750 [inline]
perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
SYSC_ioctl fs/ioctl.c:701 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> #0 (bpf_event_mutex){+.+.}:
lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
__mutex_lock_common kernel/locking/mutex.c:756 [inline]
__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
_free_event+0xbdb/0x10f0 kernel/events/core.c:4116
put_event+0x24/0x30 kernel/events/core.c:4204
perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
remove_vma+0xb4/0x1b0 mm/mmap.c:172
remove_vma_list mm/mmap.c:2490 [inline]
do_munmap+0x82a/0xdf0 mm/mmap.c:2731
mmap_region+0x59e/0x15a0 mm/mmap.c:1646
do_mmap+0x6c0/0xe00 mm/mmap.c:1483
do_mmap_pgoff include/linux/mm.h:2223 [inline]
vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
*** DEADLOCK ***
======================================================
The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
user space to query prog array on the same tp") where copy_to_user,
which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
At the same time, during perf_event file descriptor close,
mm->mmap_sem is held first and then subsequent
perf_event_detach_bpf_prog needs bpf_event_mutex lock.
Such a senario caused a deadlock.
As suggested by Danial, moving copy_to_user out of the
bpf_event_mutex lock should fix the problem.
Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
---
include/linux/bpf.h | 4 ++--
kernel/bpf/core.c | 45 +++++++++++++++++++++++++++++----------------
kernel/trace/bpf_trace.c | 19 +++++++++++++++----
3 files changed, 46 insertions(+), 22 deletions(-)
Changelog:
v2 -> v3:
. Remove the redundant label in function perf_event_query_prog_array.
v1 -> v2:
. Use the common core function for prog_id copying for two
different prog_array copy routines, suggested by Alexei.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..486e65e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -339,8 +339,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
struct bpf_prog *old_prog);
int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
- __u32 __user *prog_ids, u32 request_cnt,
- __u32 __user *prog_cnt);
+ u32 *prog_ids, u32 request_cnt,
+ u32 *prog_cnt);
int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
struct bpf_prog *exclude_prog,
struct bpf_prog *include_prog,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d315b39..ba03ec3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1572,13 +1572,32 @@ int bpf_prog_array_length(struct bpf_prog_array __rcu *progs)
return cnt;
}
+static bool bpf_prog_array_copy_core(struct bpf_prog **prog,
+ u32 *prog_ids,
+ u32 request_cnt)
+{
+ int i = 0;
+
+ for (; *prog; prog++) {
+ if (*prog == &dummy_bpf_prog.prog)
+ continue;
+ prog_ids[i] = (*prog)->aux->id;
+ if (++i == request_cnt) {
+ prog++;
+ break;
+ }
+ }
+
+ return !!(*prog);
+}
+
int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
__u32 __user *prog_ids, u32 cnt)
{
struct bpf_prog **prog;
unsigned long err = 0;
- u32 i = 0, *ids;
bool nospc;
+ u32 *ids;
/* users of this function are doing:
* cnt = bpf_prog_array_length();
@@ -1595,16 +1614,7 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
return -ENOMEM;
rcu_read_lock();
prog = rcu_dereference(progs)->progs;
- for (; *prog; prog++) {
- if (*prog == &dummy_bpf_prog.prog)
- continue;
- ids[i] = (*prog)->aux->id;
- if (++i == cnt) {
- prog++;
- break;
- }
- }
- nospc = !!(*prog);
+ nospc = bpf_prog_array_copy_core(prog, ids, cnt);
rcu_read_unlock();
err = copy_to_user(prog_ids, ids, cnt * sizeof(u32));
kfree(ids);
@@ -1683,22 +1693,25 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
}
int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
- __u32 __user *prog_ids, u32 request_cnt,
- __u32 __user *prog_cnt)
+ u32 *prog_ids, u32 request_cnt,
+ u32 *prog_cnt)
{
+ struct bpf_prog **prog;
u32 cnt = 0;
if (array)
cnt = bpf_prog_array_length(array);
- if (copy_to_user(prog_cnt, &cnt, sizeof(cnt)))
- return -EFAULT;
+ *prog_cnt = cnt;
/* return early if user requested only program count or nothing to copy */
if (!request_cnt || !cnt)
return 0;
- return bpf_prog_array_copy_to_user(array, prog_ids, request_cnt);
+ /* this function is called under trace/bpf_trace.c: bpf_event_mutex */
+ prog = rcu_dereference_check(array, 1)->progs;
+ return bpf_prog_array_copy_core(prog, prog_ids, request_cnt) ? -ENOSPC
+ : 0;
}
static void bpf_prog_free_deferred(struct work_struct *work)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d88e96d..f505d43 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -977,6 +977,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
{
struct perf_event_query_bpf __user *uquery = info;
struct perf_event_query_bpf query = {};
+ u32 *ids, prog_cnt, ids_len;
int ret;
if (!capable(CAP_SYS_ADMIN))
@@ -985,16 +986,26 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
return -EINVAL;
if (copy_from_user(&query, uquery, sizeof(query)))
return -EFAULT;
- if (query.ids_len > BPF_TRACE_MAX_PROGS)
+
+ ids_len = query.ids_len;
+ if (ids_len > BPF_TRACE_MAX_PROGS)
return -E2BIG;
+ ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
+ if (!ids)
+ return -ENOMEM;
mutex_lock(&bpf_event_mutex);
ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
- uquery->ids,
- query.ids_len,
- &uquery->prog_cnt);
+ ids,
+ ids_len,
+ &prog_cnt);
mutex_unlock(&bpf_event_mutex);
+ if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
+ copy_to_user(uquery->ids, ids, ids_len * sizeof(u32)))
+ ret = -EFAULT;
+
+ kfree(ids);
return ret;
}
--
2.9.5
^ permalink raw reply related
* [PATCH bpf v2] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Yonghong Song @ 2018-04-10 7:13 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
The error details:
======================================================
WARNING: possible circular locking dependency detected
4.16.0-rc7+ #3 Not tainted
------------------------------------------------------
syz-executor7/24531 is trying to acquire lock:
(bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
but task is already holding lock:
(&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_sem){++++}:
__might_fault+0x13a/0x1d0 mm/memory.c:4571
_copy_to_user+0x2c/0xc0 lib/usercopy.c:25
copy_to_user include/linux/uaccess.h:155 [inline]
bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
_perf_ioctl kernel/events/core.c:4750 [inline]
perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
SYSC_ioctl fs/ioctl.c:701 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> #0 (bpf_event_mutex){+.+.}:
lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
__mutex_lock_common kernel/locking/mutex.c:756 [inline]
__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
_free_event+0xbdb/0x10f0 kernel/events/core.c:4116
put_event+0x24/0x30 kernel/events/core.c:4204
perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
remove_vma+0xb4/0x1b0 mm/mmap.c:172
remove_vma_list mm/mmap.c:2490 [inline]
do_munmap+0x82a/0xdf0 mm/mmap.c:2731
mmap_region+0x59e/0x15a0 mm/mmap.c:1646
do_mmap+0x6c0/0xe00 mm/mmap.c:1483
do_mmap_pgoff include/linux/mm.h:2223 [inline]
vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
*** DEADLOCK ***
======================================================
The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
user space to query prog array on the same tp") where copy_to_user,
which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
At the same time, during perf_event file descriptor close,
mm->mmap_sem is held first and then subsequent
perf_event_detach_bpf_prog needs bpf_event_mutex lock.
Such a senario caused a deadlock.
As suggested by Danial, moving copy_to_user out of the
bpf_event_mutex lock should fix the problem.
Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
---
include/linux/bpf.h | 4 ++--
kernel/bpf/core.c | 45 +++++++++++++++++++++++++++++----------------
kernel/trace/bpf_trace.c | 22 ++++++++++++++++++----
3 files changed, 49 insertions(+), 22 deletions(-)
Changelog:
v1 -> v2:
. Use the common core function for prog_id copying for two
different prog_array copy routines, suggested by Alexei.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..486e65e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -339,8 +339,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
struct bpf_prog *old_prog);
int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
- __u32 __user *prog_ids, u32 request_cnt,
- __u32 __user *prog_cnt);
+ u32 *prog_ids, u32 request_cnt,
+ u32 *prog_cnt);
int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
struct bpf_prog *exclude_prog,
struct bpf_prog *include_prog,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d315b39..ba03ec3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1572,13 +1572,32 @@ int bpf_prog_array_length(struct bpf_prog_array __rcu *progs)
return cnt;
}
+static bool bpf_prog_array_copy_core(struct bpf_prog **prog,
+ u32 *prog_ids,
+ u32 request_cnt)
+{
+ int i = 0;
+
+ for (; *prog; prog++) {
+ if (*prog == &dummy_bpf_prog.prog)
+ continue;
+ prog_ids[i] = (*prog)->aux->id;
+ if (++i == request_cnt) {
+ prog++;
+ break;
+ }
+ }
+
+ return !!(*prog);
+}
+
int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
__u32 __user *prog_ids, u32 cnt)
{
struct bpf_prog **prog;
unsigned long err = 0;
- u32 i = 0, *ids;
bool nospc;
+ u32 *ids;
/* users of this function are doing:
* cnt = bpf_prog_array_length();
@@ -1595,16 +1614,7 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
return -ENOMEM;
rcu_read_lock();
prog = rcu_dereference(progs)->progs;
- for (; *prog; prog++) {
- if (*prog == &dummy_bpf_prog.prog)
- continue;
- ids[i] = (*prog)->aux->id;
- if (++i == cnt) {
- prog++;
- break;
- }
- }
- nospc = !!(*prog);
+ nospc = bpf_prog_array_copy_core(prog, ids, cnt);
rcu_read_unlock();
err = copy_to_user(prog_ids, ids, cnt * sizeof(u32));
kfree(ids);
@@ -1683,22 +1693,25 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
}
int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
- __u32 __user *prog_ids, u32 request_cnt,
- __u32 __user *prog_cnt)
+ u32 *prog_ids, u32 request_cnt,
+ u32 *prog_cnt)
{
+ struct bpf_prog **prog;
u32 cnt = 0;
if (array)
cnt = bpf_prog_array_length(array);
- if (copy_to_user(prog_cnt, &cnt, sizeof(cnt)))
- return -EFAULT;
+ *prog_cnt = cnt;
/* return early if user requested only program count or nothing to copy */
if (!request_cnt || !cnt)
return 0;
- return bpf_prog_array_copy_to_user(array, prog_ids, request_cnt);
+ /* this function is called under trace/bpf_trace.c: bpf_event_mutex */
+ prog = rcu_dereference_check(array, 1)->progs;
+ return bpf_prog_array_copy_core(prog, prog_ids, request_cnt) ? -ENOSPC
+ : 0;
}
static void bpf_prog_free_deferred(struct work_struct *work)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d88e96d..3c2fc26 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -977,6 +977,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
{
struct perf_event_query_bpf __user *uquery = info;
struct perf_event_query_bpf query = {};
+ u32 *ids, prog_cnt, ids_len;
int ret;
if (!capable(CAP_SYS_ADMIN))
@@ -985,16 +986,29 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
return -EINVAL;
if (copy_from_user(&query, uquery, sizeof(query)))
return -EFAULT;
- if (query.ids_len > BPF_TRACE_MAX_PROGS)
+
+ ids_len = query.ids_len;
+ if (ids_len > BPF_TRACE_MAX_PROGS)
return -E2BIG;
+ ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
+ if (!ids)
+ return -ENOMEM;
mutex_lock(&bpf_event_mutex);
ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
- uquery->ids,
- query.ids_len,
- &uquery->prog_cnt);
+ ids,
+ ids_len,
+ &prog_cnt);
mutex_unlock(&bpf_event_mutex);
+ if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
+ copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+out:
+ kfree(ids);
return ret;
}
--
2.9.5
^ permalink raw reply related
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-10 6:48 UTC (permalink / raw)
To: David Miller
Cc: David Ahern, Jiri Pirko, si-wei liu, Michael S. Tsirkin,
Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
virtualization, virtio-dev
In-Reply-To: <20180408.123207.2294740686493951200.davem@davemloft.net>
On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
> From: Siwei Liu <loseweigh@gmail.com>
> Date: Fri, 6 Apr 2018 19:32:05 -0700
>
>> And I assume everyone here understands the use case for live
>> migration (in the context of providing cloud service) is very
>> different, and we have to hide the netdevs. If not, I'm more than
>> happy to clarify.
>
> I think you still need to clarify.
OK. The short answer is cloud users really want *transparent* live migration.
By being transparent it means they don't and shouldn't care about the
existence and the occurence of live migration, but they do if
userspace toolstack and libraries have to be updated or modified,
which means potential dependency brokeness of their applications. They
don't like any change to the userspace envinroment (existing apps
lift-and-shift, no recompilation, no re-packaging, no re-certification
needed), while no one barely cares about ABI or API compatibility in
the kernel level, as long as their applications don't break.
I agree the current bypass solution for SR-IOV live migration requires
guest cooperation. Though it doesn't mean guest *userspace*
cooperation. As a matter of fact, techinically it shouldn't invovle
userspace at all to get SR-IOV migration working. It's the kernel that
does the real work. If I understand the goal of this in-kernel
approach correctly, it was meant to save userspace from modification
or corresponding toolstack support, as those additional 2 interfaces
is more a side product of this approach, rather than being neccessary
for users to be aware of. All what the user needs to deal with is one
single interface, and that's what they care about. It's more a trouble
than help when they see 2 extra interfaces are present. Management
tools in the old distros don't recoginze them and try to bring up
those extra interfaces for its own. Various odd warnings start to spew
out, and there's a lot of caveats for the users to get around...
On the other hand, if we "teach" those cloud users to update the
userspace toolstack just for trading a feature they don't need, no one
is likely going to embrace the change. As such there's just no real
value of adopting this in-kernel bypass facility for any cloud service
provider. It does not look more appealing than just configure generic
bonding using its own set of daemons or scripts. But again, cloud
users don't welcome that facility. And basically it would get to
nearly the same set of problems if leaving userspace alone.
IMHO we're not hiding the devices, think it the way we're adding a
feature transparent to user. Those auto-managed slaves are ones users
don't care about much. And user is still able to see and configure the
lower netdevs if they really desires to do so. But generally the
target user for this feature won't need to know that. Why they care
how many interfaces a VM virtually has rather than how many interfaces
are actually _useable_ to them??
Thanks,
-Siwei
>
> netdevs are netdevs. If they have special attributes, mark them as
> such and the tools base their actions upon that.
>
> "Hiding", or changing classes, doesn't make any sense to me still.
^ permalink raw reply
* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
From: Björn Töpel @ 2018-04-10 6:47 UTC (permalink / raw)
To: William Tu
Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
Willem de Bruijn, Daniel Borkmann,
Linux Kernel Network Developers, Björn Töpel,
michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain,
Zhang, Qi Z, ravineet.singh
In-Reply-To: <CALDO+SbsotUZ2oyuBGHDsdfr+=ffCUDQnFXxbgN3iggrW6xakw@mail.gmail.com>
2018-04-09 23:51 GMT+02:00 William Tu <u9012063@gmail.com>:
> On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> This RFC introduces a new address family called AF_XDP that is
>> optimized for high performance packet processing and, in upcoming
>> patch sets, zero-copy semantics. In this v2 version, we have removed
>> all zero-copy related code in order to make it smaller, simpler and
>> hopefully more review friendly. This RFC only supports copy-mode for
>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
>> using the XDP_DRV path. Zero-copy support requires XDP and driver
>> changes that Jesper Dangaard Brouer is working on. Some of his work is
>> already on the mailing list for review. We will publish our zero-copy
>> support for RX and TX on top of his patch sets at a later point in
>> time.
>>
>> An AF_XDP socket (XSK) is created with the normal socket()
>> syscall. Associated with each XSK are two queues: the RX queue and the
>> TX queue. A socket can receive packets on the RX queue and it can send
>> packets on the TX queue. These queues are registered and sized with
>> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
>> mandatory to have at least one of these queues for each socket. In
>> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
>> packet buffers. An RX or TX descriptor points to a data buffer in a
>> memory area called a UMEM. RX and TX can share the same UMEM so that a
>> packet does not have to be copied between RX and TX. Moreover, if a
>> packet needs to be kept for a while due to a possible retransmit, the
>> descriptor that points to that packet can be changed to point to
>> another and reused right away. This again avoids copying data.
>>
>> This new dedicated packet buffer area is called a UMEM. It consists of
>> a number of equally size frames and each frame has a unique frame
>> id. A descriptor in one of the queues references a frame by
>> referencing its frame id. The user space allocates memory for this
>> UMEM using whatever means it feels is most appropriate (malloc, mmap,
>> huge pages, etc). This memory area is then registered with the kernel
>> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
>> the FILL queue and the COMPLETION queue. The fill queue is used by the
>> application to send down frame ids for the kernel to fill in with RX
>> packet data. References to these frames will then appear in the RX
>> queue of the XSK once they have been received. The completion queue,
>> on the other hand, contains frame ids that the kernel has transmitted
>> completely and can now be used again by user space, for either TX or
>> RX. Thus, the frame ids appearing in the completion queue are ids that
>> were previously transmitted using the TX queue. In summary, the RX and
>> FILL queues are used for the RX path and the TX and COMPLETION queues
>> are used for the TX path.
>>
> Can we register a UMEM to multiple device's queue?
>
No, one UMEM, one netdev queue in this RFC. That being said, there's
nothing stopping a user from creating an additional UMEM, say UMEM',
pointing to the same memory as UMEM, but bound to another
netdev/queue. Note that the user space application has to make sure
that the buffer handling is sane (user/kernel frame ownership).
We used to allow to share UMEM between unrelated sockets, but after
the introduction of the UMEM queues (fill/completion) that's no the
case any more. For the zero-copy scenario, having to manage multiple
DMA mappings per UMEM was a bit of a mess, so we went for the simpler
(current) solution with one UMEM per netdev/queue.
> So far the l2fwd sample code is sending/receiving from the same
> queue. I'm thinking about forwarding packets from one device to another.
> Now I'm copying packets from one device's RX desc to another device's TX
> completion queue. But this introduces one extra copy.
>
So you've setup two identical UMEMs? Then you can just forward the
incoming Rx descriptor to the other netdev's Tx queue. Note, that you
only need to copy the descriptor, not the actual frame data.
> One way I can do is to call bpf_redirect helper function, but sometimes
> I still need to process the packet in userspace.
>
> I like this work!
> Thanks a lot.
Happy to hear that, and thanks a bunch for trying it out. Keep that
feedback coming!
Björn
> William
^ permalink raw reply
* Re: [PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check
From: Jason Wang @ 2018-04-10 6:40 UTC (permalink / raw)
To: Stefan Hajnoczi, virtualization
Cc: syzkaller-bugs, mst, Linus Torvalds, kvm, linux-kernel, netdev
In-Reply-To: <20180410052630.11270-1-stefanha@redhat.com>
On 2018年04月10日 13:26, Stefan Hajnoczi wrote:
> v2:
> * Rewrote the conditional to make the vq access check clearer [Linus]
> * Added Patch 2 to make the return type consistent and harder to misuse [Linus]
>
> The first patch fixes the vhost virtqueue access check which was recently
> broken. The second patch replaces the int return type with bool to prevent
> future bugs.
>
> Stefan Hajnoczi (2):
> vhost: fix vhost_vq_access_ok() log check
> vhost: return bool from *_access_ok() functions
>
> drivers/vhost/vhost.h | 4 +--
> drivers/vhost/vhost.c | 70 ++++++++++++++++++++++++++-------------------------
> 2 files changed, 38 insertions(+), 36 deletions(-)
>
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks!
^ permalink raw reply
* Re: [PATCH] slip: Check if rstate is initialized before uncompressing
From: tejaswit @ 2018-04-10 5:58 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20180409.110438.458180702294361660.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 627 bytes --]
On 2018-04-09 20:34, David Miller wrote:
> From: Tejaswi Tanikella <tejaswit@codeaurora.org>
> Date: Mon, 9 Apr 2018 14:23:49 +0530
>
>> @@ -673,6 +677,7 @@ struct slcompress *
>> if (cs->cs_tcp.doff > 5)
>> memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr),
>> (cs->cs_tcp.doff - 5) * 4);
>> cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2;
>> + cs->initialized = 1;
>> /* Put headers back on packet
> ...
>> struct cstate {
>> byte_t cs_this; /* connection id number (xmit) */
>> + byte_t initialized; /* non-zero if initialized */
>
> Please use 'bool' and true/false for 'initialized'.
Made the changes.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-slip-Check-if-rstate-is-initialized-before-uncompres.patch --]
[-- Type: text/x-diff; name=0001-slip-Check-if-rstate-is-initialized-before-uncompres.patch, Size: 3032 bytes --]
From b04441041034a5e044705d3c5a2cc338dd4bfc3a Mon Sep 17 00:00:00 2001
From: Tejaswi Tanikella <tejaswit@codeaurora.org>
Date: Thu, 29 Mar 2018 14:46:41 +0530
Subject: [PATCH] slip: Check if rstate is initialized before uncompressing
On receiving a packet the state index points to the rstate which must be
used to fill up IP and TCP headers. But if the state index points to a
rstate which is unitialized, i.e. filled with zeros, it gets stuck in an
infinite loop inside ip_fast_csum trying to compute the ip checsum of a
header with zero length.
89.666953: <2> [<ffffff9dd3e94d38>] slhc_uncompress+0x464/0x468
89.666965: <2> [<ffffff9dd3e87d88>] ppp_receive_nonmp_frame+0x3b4/0x65c
89.666978: <2> [<ffffff9dd3e89dd4>] ppp_receive_frame+0x64/0x7e0
89.666991: <2> [<ffffff9dd3e8a708>] ppp_input+0x104/0x198
89.667005: <2> [<ffffff9dd3e93868>] pppopns_recv_core+0x238/0x370
89.667027: <2> [<ffffff9dd4428fc8>] __sk_receive_skb+0xdc/0x250
89.667040: <2> [<ffffff9dd3e939e4>] pppopns_recv+0x44/0x60
89.667053: <2> [<ffffff9dd4426848>] __sock_queue_rcv_skb+0x16c/0x24c
89.667065: <2> [<ffffff9dd4426954>] sock_queue_rcv_skb+0x2c/0x38
89.667085: <2> [<ffffff9dd44f7358>] raw_rcv+0x124/0x154
89.667098: <2> [<ffffff9dd44f7568>] raw_local_deliver+0x1e0/0x22c
89.667117: <2> [<ffffff9dd44c8ba0>] ip_local_deliver_finish+0x70/0x24c
89.667131: <2> [<ffffff9dd44c92f4>] ip_local_deliver+0x100/0x10c
./scripts/faddr2line vmlinux slhc_uncompress+0x464/0x468 output:
ip_fast_csum at arch/arm64/include/asm/checksum.h:40
(inlined by) slhc_uncompress at drivers/net/slip/slhc.c:615
Adding a variable to indicate if the current rstate is initialized. If
such a packet arrives, move to toss state.
Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>
---
drivers/net/slip/slhc.c | 5 +++++
include/net/slhc_vj.h | 1 +
2 files changed, 6 insertions(+)
diff --git a/drivers/net/slip/slhc.c b/drivers/net/slip/slhc.c
index 5782733..f4e93f5 100644
--- a/drivers/net/slip/slhc.c
+++ b/drivers/net/slip/slhc.c
@@ -509,6 +509,10 @@ struct slcompress *
if(x < 0 || x > comp->rslot_limit)
goto bad;
+ /* Check if the cstate is initialized */
+ if (!comp->rstate[x].initialized)
+ goto bad;
+
comp->flags &=~ SLF_TOSS;
comp->recv_current = x;
} else {
@@ -673,6 +677,7 @@ struct slcompress *
if (cs->cs_tcp.doff > 5)
memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), (cs->cs_tcp.doff - 5) * 4);
cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2;
+ cs->initialized = true;
/* Put headers back on packet
* Neither header checksum is recalculated
*/
diff --git a/include/net/slhc_vj.h b/include/net/slhc_vj.h
index 8716d59..8fcf890 100644
--- a/include/net/slhc_vj.h
+++ b/include/net/slhc_vj.h
@@ -127,6 +127,7 @@
*/
struct cstate {
byte_t cs_this; /* connection id number (xmit) */
+ bool initialized; /* true if initialized */
struct cstate *next; /* next in ring (xmit) */
struct iphdr cs_ip; /* ip/tcp hdr from most recent packet */
struct tcphdr cs_tcp;
--
1.9.1
^ permalink raw reply related
* [PATCH v2 2/2] vhost: return bool from *_access_ok() functions
From: Stefan Hajnoczi @ 2018-04-10 5:26 UTC (permalink / raw)
To: virtualization
Cc: kvm, mst, netdev, syzkaller-bugs, linux-kernel, Stefan Hajnoczi,
Linus Torvalds
In-Reply-To: <20180410052630.11270-1-stefanha@redhat.com>
Currently vhost *_access_ok() functions return int. This is error-prone
because there are two popular conventions:
1. 0 means failure, 1 means success
2. -errno means failure, 0 means success
Although vhost mostly uses #1, it does not do so consistently.
umem_access_ok() uses #2.
This patch changes the return type from int to bool so that false means
failure and true means success. This eliminates a potential source of
errors.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
drivers/vhost/vhost.h | 4 ++--
drivers/vhost/vhost.c | 66 +++++++++++++++++++++++++--------------------------
2 files changed, 35 insertions(+), 35 deletions(-)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index ac4b6056f19a..6e00fa57af09 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
void vhost_dev_stop(struct vhost_dev *);
long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
-int vhost_vq_access_ok(struct vhost_virtqueue *vq);
-int vhost_log_access_ok(struct vhost_dev *);
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
+bool vhost_log_access_ok(struct vhost_dev *);
int vhost_get_vq_desc(struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 93fd0c75b0d8..b6a082ef33dd 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
-static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
+static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
{
u64 a = addr / VHOST_PAGE_SIZE / 8;
/* Make sure 64 bit math will not overflow. */
if (a > ULONG_MAX - (unsigned long)log_base ||
a + (unsigned long)log_base > ULONG_MAX)
- return 0;
+ return false;
return access_ok(VERIFY_WRITE, log_base + a,
(sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
@@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
}
/* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
- int log_all)
+static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
+ int log_all)
{
struct vhost_umem_node *node;
if (!umem)
- return 0;
+ return false;
list_for_each_entry(node, &umem->umem_list, link) {
unsigned long a = node->userspace_addr;
if (vhost_overflow(node->userspace_addr, node->size))
- return 0;
+ return false;
if (!access_ok(VERIFY_WRITE, (void __user *)a,
node->size))
- return 0;
+ return false;
else if (log_all && !log_access_ok(log_base,
node->start,
node->size))
- return 0;
+ return false;
}
- return 1;
+ return true;
}
static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
@@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
/* Can we switch to this memory table? */
/* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
- int log_all)
+static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
+ int log_all)
{
int i;
for (i = 0; i < d->nvqs; ++i) {
- int ok;
+ bool ok;
bool log;
mutex_lock(&d->vqs[i]->mutex);
@@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
ok = vq_memory_access_ok(d->vqs[i]->log_base,
umem, log);
else
- ok = 1;
+ ok = true;
mutex_unlock(&d->vqs[i]->mutex);
if (!ok)
- return 0;
+ return false;
}
- return 1;
+ return true;
}
static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
@@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
spin_unlock(&d->iotlb_lock);
}
-static int umem_access_ok(u64 uaddr, u64 size, int access)
+static bool umem_access_ok(u64 uaddr, u64 size, int access)
{
unsigned long a = uaddr;
/* Make sure 64 bit math will not overflow. */
if (vhost_overflow(uaddr, size))
- return -EFAULT;
+ return false;
if ((access & VHOST_ACCESS_RO) &&
!access_ok(VERIFY_READ, (void __user *)a, size))
- return -EFAULT;
+ return false;
if ((access & VHOST_ACCESS_WO) &&
!access_ok(VERIFY_WRITE, (void __user *)a, size))
- return -EFAULT;
- return 0;
+ return false;
+ return true;
}
static int vhost_process_iotlb_msg(struct vhost_dev *dev,
@@ -988,7 +988,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
ret = -EFAULT;
break;
}
- if (umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
+ if (!umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
ret = -EFAULT;
break;
}
@@ -1135,10 +1135,10 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
return 0;
}
-static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
- struct vring_desc __user *desc,
- struct vring_avail __user *avail,
- struct vring_used __user *used)
+static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
@@ -1161,8 +1161,8 @@ static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
vq->meta_iotlb[type] = node;
}
-static int iotlb_access_ok(struct vhost_virtqueue *vq,
- int access, u64 addr, u64 len, int type)
+static bool iotlb_access_ok(struct vhost_virtqueue *vq,
+ int access, u64 addr, u64 len, int type)
{
const struct vhost_umem_node *node;
struct vhost_umem *umem = vq->iotlb;
@@ -1220,7 +1220,7 @@ EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
/* Can we log writes? */
/* Caller should have device mutex but not vq mutex */
-int vhost_log_access_ok(struct vhost_dev *dev)
+bool vhost_log_access_ok(struct vhost_dev *dev)
{
return memory_access_ok(dev, dev->umem, 1);
}
@@ -1228,8 +1228,8 @@ EXPORT_SYMBOL_GPL(vhost_log_access_ok);
/* Verify access for write logging. */
/* Caller should have vq mutex and device mutex */
-static int vq_log_access_ok(struct vhost_virtqueue *vq,
- void __user *log_base)
+static bool vq_log_access_ok(struct vhost_virtqueue *vq,
+ void __user *log_base)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
@@ -1242,14 +1242,14 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
/* Can we start vq? */
/* Caller should have vq mutex and device mutex */
-int vhost_vq_access_ok(struct vhost_virtqueue *vq)
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq)
{
if (!vq_log_access_ok(vq, vq->log_base))
- return 0;
+ return false;
/* Access validation occurs at prefetch time with IOTLB */
if (vq->iotlb)
- return 1;
+ return true;
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
}
--
2.14.3
^ permalink raw reply related
* [PATCH v2 1/2] vhost: fix vhost_vq_access_ok() log check
From: Stefan Hajnoczi @ 2018-04-10 5:26 UTC (permalink / raw)
To: virtualization
Cc: syzkaller-bugs, mst, Linus Torvalds, kvm, jasowang, linux-kernel,
netdev, Stefan Hajnoczi
In-Reply-To: <20180410052630.11270-1-stefanha@redhat.com>
Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
when IOTLB is enabled") introduced a regression. The logic was
originally:
if (vq->iotlb)
return 1;
return A && B;
After the patch the short-circuit logic for A was inverted:
if (A || vq->iotlb)
return A;
return B;
This patch fixes the regression by rewriting the checks in the obvious
way, no longer returning A when vq->iotlb is non-NULL (which is hard to
understand).
Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
drivers/vhost/vhost.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5320039671b7..93fd0c75b0d8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
/* Caller should have vq mutex and device mutex */
int vhost_vq_access_ok(struct vhost_virtqueue *vq)
{
- int ret = vq_log_access_ok(vq, vq->log_base);
+ if (!vq_log_access_ok(vq, vq->log_base))
+ return 0;
- if (ret || vq->iotlb)
- return ret;
+ /* Access validation occurs at prefetch time with IOTLB */
+ if (vq->iotlb)
+ return 1;
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
}
--
2.14.3
^ permalink raw reply related
* [PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check
From: Stefan Hajnoczi @ 2018-04-10 5:26 UTC (permalink / raw)
To: virtualization
Cc: syzkaller-bugs, mst, Linus Torvalds, kvm, jasowang, linux-kernel,
netdev, Stefan Hajnoczi
v2:
* Rewrote the conditional to make the vq access check clearer [Linus]
* Added Patch 2 to make the return type consistent and harder to misuse [Linus]
The first patch fixes the vhost virtqueue access check which was recently
broken. The second patch replaces the int return type with bool to prevent
future bugs.
Stefan Hajnoczi (2):
vhost: fix vhost_vq_access_ok() log check
vhost: return bool from *_access_ok() functions
drivers/vhost/vhost.h | 4 +--
drivers/vhost/vhost.c | 70 ++++++++++++++++++++++++++-------------------------
2 files changed, 38 insertions(+), 36 deletions(-)
--
2.14.3
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox