Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] ip_gre: clear feature flags when incompatible o_flags are set
From: Xin Long @ 2018-04-10 13:10 UTC (permalink / raw)
  To: Sabrina Dubroca; +Cc: network dev, Lorenzo Bianconi, William Tu
In-Reply-To: <a989fb36846d45587455e648ed4e8ff8707bcee2.1523357388.git.sd@queasysnail.net>

On Tue, Apr 10, 2018 at 6:57 PM, Sabrina Dubroca <sd@queasysnail.net> wrote:
> Commit dd9d598c6657 ("ip_gre: add the support for i/o_flags update via
> netlink") added the ability to change o_flags, but missed that the
> GSO/LLTX features are disabled by default, and only enabled some gre
> features are unused. Thus we also need to disable the GSO/LLTX features
> on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set.
>
> These two examples should result in the same features being set:
>
>     ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 0
>
>     ip link set gre_none type gre seq
>     ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 1 seq
>
> Fixes: dd9d598c6657 ("ip_gre: add the support for i/o_flags update via netlink")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> ---
>  net/ipv4/ip_gre.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index a8772a978224..9c169bb2444d 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -781,8 +781,14 @@ static void ipgre_link_update(struct net_device *dev, bool set_mtu)
>                     tunnel->encap.type == TUNNEL_ENCAP_NONE) {
>                         dev->features |= NETIF_F_GSO_SOFTWARE;
>                         dev->hw_features |= NETIF_F_GSO_SOFTWARE;
> +               } else {
> +                       dev->features &= ~NETIF_F_GSO_SOFTWARE;
> +                       dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
>                 }
>                 dev->features |= NETIF_F_LLTX;
> +       } else {
> +               dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
> +               dev->features &= ~(NETIF_F_LLTX | NETIF_F_GSO_SOFTWARE);
>         }
>  }
>
> --
> 2.16.2
>
Reviewed-by: Xin Long <lucien.xin@gmail.com>

^ permalink raw reply

* Re: [RFC PATCH v3 bpf-next 4/5] bpf/verifier: use call graph to efficiently check max stack depth
From: Jiong Wang @ 2018-04-10 12:54 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <0d936412-a7fd-e861-f16b-91d970fd8896@solarflare.com>

On 06/04/2018 18:14, Edward Cree wrote:
<snip>
> Since that algorithm selects
> + * nodes in the order of the sorted output, we can do our processing in the loop
> + * that does the tsort, rather than storing the sorted list and then having a
> + * second loop to iterate over it and compute the total_stack_depth values.
>   */

IMO, seperate the sort from depth calculation is better than combining
them. The tsort of call graph could potentially be used in other places.

^ permalink raw reply

* Re: [RFC PATCH v3 bpf-next 0/5] bpf/verifier: subprog/func_call simplifications
From: Jiong Wang @ 2018-04-10 12:54 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov, Daniel Borkmann; +Cc: netdev
In-Reply-To: <b445b3c5-04f8-7425-2a3c-9bc54a04add2@solarflare.com>

On 06/04/2018 18:11, Edward Cree wrote:
> By storing subprog boundaries as a subprogno mark on each insn, rather than
>  a start (and implicit end) for each subprog, the code handling subprogs can
>  in various places be made simpler, more explicit, and more efficient (i.e.
>  better asymptotic complexity).

IMHO, the subprog number might be much smaller than insn number, so just
keeping sorted subprog information in env->subprog_info and do bsearch
looks reasonable to me.

While the new added "subprogno" field is only used round find_subprog, so
I feel it is a little bit heavy to allocate a delicated field that wouldn't
be used broadly for each instruction.

> This also lays the ground for possible
>  future work in which an extension of the check_subprogs() code to cover
>  basic blocks could replace check_cfg(), which currently largely duplicates
>  the insn-walking part.

I was thinking to remove the insn-walk duplication in the other way by
center them in the DFS insn-walk in check_cfg, as we can't remove that one
before we migrate to other loop detection method. This approach is
questionable though.

> Some other changes were also included to support this:
> * Per-subprog info is stored in env->subprog_info, an array of structs,
>   rather than several arrays with a common index.
> * Subprog numbers and corresponding subprog_info index are now 0-based;
>   subprog 0 is the main()-function of the program, starting at insn 0.
> * Call graph is now stored in the new bpf_subprog_info struct; used here
>   for check_max_stack_depth() but may have other uses too.

Ack above changes which center all subprog related info into
env->subprog_info and the new addition of callee field.

^ permalink raw reply

* Re: aquantia - BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 on reboot
From: Igor Russkikh @ 2018-04-10 12:53 UTC (permalink / raw)
  To: Yanko Kaneti; +Cc: netdev
In-Reply-To: <5a5882e07351e92a5d57ea6d1b91fcab55e466bf.camel@declera.com>


On 10.04.2018 15:42, Yanko Kaneti wrote:
> Hello, 
> 
> Since 90869ddfefeb net: aquantia: Implement pci shutdown callback
> I get the below oops on reboot.  Without the callback everything works
> as expected.
> 

Thanks, we also recently found out that.

Could you please try the below patch?


diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index c96a921..32f6d2e 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -951,9 +951,11 @@ void aq_nic_shutdown(struct aq_nic_s *self)
 
 	netif_device_detach(self->ndev);
 
-	err = aq_nic_stop(self);
-	if (err < 0)
-		goto err_exit;
+	if (netif_running(self->ndev)) {
+		err = aq_nic_stop(self);
+		if (err < 0)
+			goto err_exit;
+	}
 	aq_nic_deinit(self);
 
 err_exit:
-- 
2.7.4

^ permalink raw reply related

* aquantia - BUG: unable to handle kernel NULL pointer dereference at 0000000000000058  on reboot
From: Yanko Kaneti @ 2018-04-10 12:42 UTC (permalink / raw)
  To: Igor Russkikh; +Cc: netdev

Hello, 

Since 90869ddfefeb net: aquantia: Implement pci shutdown callback
I get the below oops on reboot.  Without the callback everything works
as expected.

05:00.0 Ethernet controller [0200]: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] [1d6a:d107] (rev 02)
Fedora rawhide config - atlantic is a module.

....
[ 6278.083643] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
[ 6278.091536] PGD 0 P4D 0 
[ 6278.094113] Oops: 0000 [#1] SMP NOPTI
[ 6278.097802] Modules linked in: nfsv3 nfs_acl nfs lockd grace fscache pppoe pppox ppp_synctty ppp_async ppp_generic slhc fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables btrfs xor zstd_compress raid6_pq zstd_decompress xxhash sunrpc vfat fat edac_mce_amd snd_usb_audio kvm_amd ccp kvm raid1 snd_usbmidi_lib irqbypass mxm_wmi joydev ftdi_sio crct10dif_pclmul snd_rawmidi crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device fam15h_power sp5100_tco k10temp snd_pcm i2c_piix4 snd_timer snd sound
 core wmi shpchp
[ 6278.169417]  acpi_cpufreq binfmt_misc hid_logitech_hidpp hid_logitech_dj amdkfd amd_iommu_v2 amdgpu bnx2x chash i2c_algo_bit gpu_sched drm_kms_helper ttm nvme r8169 drm mii nvme_core mdio atlantic libcrc32c crc32c_intel i2c_dev [last unloaded: hwmon_vid]
[ 6278.192170] CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 4.17.0-0.rc0.git5.2.fc29.x86_64 #1
[ 6278.200859] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./990FXA-UD3 R5, BIOS F2 04/01/2015
[ 6278.211208] RIP: 0010:aq_vec_stop+0x27/0x80 [atlantic]
[ 6278.216399] RSP: 0018:ffffb82983133d30 EFLAGS: 00010246
[ 6278.221704] RAX: 0000000000000000 RBX: ffff95dbd43c1800 RCX: 0000000000000000
[ 6278.228905] RDX: 00000000000001ff RSI: ffff95dbd43c1958 RDI: 0000000000000000
[ 6278.236117] RBP: ffff95dbd43c1958 R08: 0000000000000064 R09: 0000000000000000
[ 6278.243326] R10: 0000000000000001 R11: 0000000000000025 R12: 0000000000000000
[ 6278.250537] R13: 0000000000000006 R14: ffff95dbda58c000 R15: 0000000000000000
[ 6278.257749] FS:  00007f05685d3940(0000) GS:ffff95dbfed80000(0000) knlGS:0000000000000000
[ 6278.265913] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6278.271727] CR2: 0000000000000058 CR3: 0000000819bc0000 CR4: 00000000000406e0
[ 6278.278940] Call Trace:
[ 6278.281420]  aq_nic_stop+0xfc/0x140 [atlantic]
[ 6278.285914]  aq_nic_shutdown+0x2c/0x40 [atlantic]
[ 6278.290671]  aq_pci_shutdown+0x15/0x40 [atlantic]
[ 6278.295446]  pci_device_shutdown+0x34/0x60
[ 6278.299614]  device_shutdown+0x14c/0x1f0
[ 6278.303609]  kernel_restart+0xe/0x30
[ 6278.307213]  SYSC_reboot+0x1cf/0x210
[ 6278.310821]  ? __alloc_fd+0x3d/0x140
[ 6278.314467]  ? do_writev+0x5c/0xf0
[ 6278.317925]  do_syscall_64+0x74/0x180
[ 6278.321619]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6278.326883] Code: 00 00 00 00 0f 1f 44 00 00 41 54 55 53 8b 47 18 48 89 fb 85 c0 74 4f 48 8d af 58 01 00 00 45 31 e4 48 8b 03 48 89 ee 48 8b 7b 08 <48> 8b 40 58 e8 10 89 92 e7 48 8b 03 48 8d 75 60 48 8b 7b 08 48 
[ 6278.345930] RIP: aq_vec_stop+0x27/0x80 [atlantic] RSP: ffffb82983133d30
[ 6278.352607] CR2: 0000000000000058
[ 6278.356010] ---[ end trace c32f052b515f1b02 ]---
[ 6278.363883] watchdog: watchdog0: watchdog did not stop!
[ 6278.369133] systemd-shutdow: 12 output lines suppressed due to ratelimiting
[ 6278.376180] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[ 6278.376180] 
[ 6278.385421] Kernel Offset: 0x26000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 6278.396279] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[ 6278.396279]  ]---
.....

- Yanko

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Thomas Gleixner @ 2018-04-10 12:37 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
	mlichvar
In-Reply-To: <0369f48c-b48e-ce27-1988-8bc0ec65bf13@intel.com>

Jesus,

On Mon, 9 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 03/28/2018 12:48 AM, Thomas Gleixner wrote:
> > Yes, you have the root qdisc, which is in charge of the overall scheduling
> > plan, how complex or not it is defined does not matter. It exposes traffic
> > classes which have properties defined by the configuration.
> 
> Perfect. Let's see if we can agree on an overall plan, then. Hopefully I'm not
> missing anything.
> 
> For the above we'll develop a new qdisc, designed along the 'taprio' ideas, thus
> a Qbv style scheduler, to be used as root qdisc. It can run the schedule inside
> the kernel or just offload it to the NIC if supported. Similarly to the other
> multiqueue qdiscs, it will expose the HW Tx queues.
> 
> What is new here from the ideas we shared last year is that this new root qdisc
> will be responsible for calling the attached qdiscs' dequeue functions during
> their timeslices, making it the only entity capable of enqueueing packets into
> the NIC.

Correct. Aside of that it's the entity which is in charge of the overall
scheduling.

> This is the "global scheduler", but we still need the txtime aware
> qdisc. For that, we'll modify tbs to accommodate the feedback from this
> thread. More below.

> > The qdiscs which are attached to those traffic classes can be anything
> > including:
> >
> >  - Simple feed through (Applications are time contraints aware and set the
> >    exact schedule). qdisc has admission control.
> 
> This will be provided by the tbs qdisc. It will still provide a txtime sorted
> list and hw offload, but now there will be a per-socket option that tells the
> qdisc if the per-packet timestamp is the txtime (i.e. explicit mode, as you've
> called it) or a deadline. The drop_if_late flag will be removed.
> 
> When in explicit mode, packets from that socket are dequeued from the qdisc
> during its time slice if their [(txtime - delta) < now].
> 
> >
> >  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
> >    of time constraints and provide the packet deadline. qdisc has admission
> >    control. This can be a simple first comes, first served scheduler or
> >    something like EDF which allows optimized utilization. The qdisc sets
> >    the TX time depending on the deadline and feeds into the root.
> 
> This will be provided by tbs if the socket which is transmitting packets is
> configured for deadline mode.

You don't want the socket to decide that. The qdisc into which a socket
feeds defines the mode and the qdisc rejects requests with the wrong mode.

Making a qdisc doing both and let the user decide what he wants it to be is
not really going to fly. Especially if you have different users which want
a different mode. It's clearly distinct functionality.

Please stop trying to develop swiss army knifes with integrated coffee
machines.

> For the deadline -> txtime conversion, what I have in mind is: when dequeue is
> called tbs will just change the skbuff's timestamp from the deadline to 'now'
> (i.e. as soon as possible) and dequeue the packet. Would that be enough or
> should we use the delta parameter of the qdisc on this case add make [txtime =
> now + delta]? The only benefit of doing so would be to provide a configurable
> 'fudge' factor.

Well, that really depends on how your deadline scheduler works.

> Another question for this mode (but perhaps that applies to both modes) is, what
> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
> the packet during dequeue.

There the question is how user space is notified about that issue. The
application which queued the packet on time does rightfully assume that
it's going to be on the wire on time.

This is a violation of the overall scheduling plan, so you need to have
a sane design to handle that.

> Putting it all together, we end up with:
> 
> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting

Why CLOCK_REALTIME? The only interesting time in a TSN network is
CLOCK_TAI, really.

> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
> either as a txtime or as deadline by tbs (and further the NIC driver for the
> offlaod case): SCM_TXTIME.
> 
> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
> socket, and will have as parameters a clockid and a txtime mode (deadline or
> explicit), that defines the semantics of the timestamp set on packets using
> SCM_TXTIME.
> 
> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .

Can you remind me why we would need that?

> 5) a new schedule-aware qdisc, 'tas' or 'taprio', to be used per port. Its cli
> will look like what was proposed for taprio (base time being an absolute timestamp).
> 
> If we all agree with the above, we will start by closing on 1-4 asap and will
> focus on 5 next.
> 
> How does that sound?

Backwards to be honest.

You should start with the NIC facing qdisc because that's the key part of
all this and the design might have implications on how the qdiscs which
feed into it need to be designed.

Thanks,

	tglx

^ permalink raw reply

* [PATCH] selftests: bpf: update .gitignore with missing generated files
From: Anders Roxell @ 2018-04-10 12:24 UTC (permalink / raw)
  To: daniel, ast, shuah; +Cc: netdev, linux-kernel, linux-kselftest, Anders Roxell

Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
---
 tools/testing/selftests/bpf/.gitignore | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 9cf83f895d98..5e1ab2f0eb79 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -12,3 +12,6 @@ test_tcpbpf_user
 test_verifier_log
 feature
 test_libbpf_open
+test_sock
+test_sock_addr
+urandom_read
-- 
2.16.3

^ permalink raw reply related

* [PATCH] lan78xx: Don't reset the interface on open
From: Phil Elwell @ 2018-04-10 12:18 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Alexander Graf,
	Thomas Bogendoerfer, netdev, linux-usb, linux-kernel
  Cc: Phil Elwell

Commit 92571a1aae40 ("lan78xx: Connect phy early") moves the PHY
initialisation into lan78xx_probe, but lan78xx_open subsequently calls
lan78xx_reset. As well as forcing a second round of link negotiation,
this reset frequently prevents the phy interrupt from being generated
(even though the link is up), rendering the interface unusable.

Fix this issue by removing the lan78xx_reset call from lan78xx_open.

Fixes: 92571a1aae40 ("lan78xx: Connect phy early")
Signed-off-by: Phil Elwell <phil@raspberrypi.org>
---
 drivers/net/usb/lan78xx.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index aff105f..108f04a 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2514,10 +2514,6 @@ static int lan78xx_open(struct net_device *net)
 	if (ret < 0)
 		goto out;

-	ret = lan78xx_reset(dev);
-	if (ret < 0)
-		goto done;
-
 	phy_start(net->phydev);

 	netif_dbg(dev, ifup, dev->net, "phy initialised successfully");
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
From: Kevin Easton @ 2018-04-10 11:38 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Herbert Xu, David S. Miller, netdev, linux-kernel
In-Reply-To: <20180409103442.oiib3bi7rms5sgg3@gauss3.secunet.de>

On Mon, Apr 09, 2018 at 12:34:42PM +0200, Steffen Klassert wrote:
> On Sat, Apr 07, 2018 at 11:40:47AM -0400, Kevin Easton wrote:
> > Several places use (x + 7) / 8 to convert from a number of bits to a number
> > of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
> > with other parts of the same file.
> > 
> > Signed-off-by: Kevin Easton <kevin@guarana.org>
> 
> Please resubmit this one to ipsec-next after the
> merge window. Thanks!

Will do!

    - Kevin

^ permalink raw reply

* Re: [PATCH] ieee802154: mcr20a: Fix memory leak in mcr20a_probe
From: Stefan Schmidt @ 2018-04-10 10:58 UTC (permalink / raw)
  To: Gustavo A. R. Silva, Xue Liu, Alexander Aring
  Cc: linux-wpan, netdev, linux-kernel
In-Reply-To: <20180405162006.GA2199@embeddedor.com>

Hello.


On 04/05/2018 06:20 PM, Gustavo A. R. Silva wrote:
> Free allocated memory for pdata before return.
>
> Addresses-Coverity-ID: 1466096 ("Resource leak")
> Fixes: 8c6ad9cc5157 ("ieee802154: Add NXP MCR20A IEEE 802.15.4 transceiver driver")
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> ---
>  drivers/net/ieee802154/mcr20a.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ieee802154/mcr20a.c b/drivers/net/ieee802154/mcr20a.c
> index 55a22c7..944470d 100644
> --- a/drivers/net/ieee802154/mcr20a.c
> +++ b/drivers/net/ieee802154/mcr20a.c
> @@ -1267,7 +1267,7 @@ mcr20a_probe(struct spi_device *spi)
>  	ret = mcr20a_get_platform_data(spi, pdata);
>  	if (ret < 0) {
>  		dev_crit(&spi->dev, "mcr20a_get_platform_data failed.\n");
> -		return ret;
> +		goto free_pdata;
>  	}
>  
>  	/* init reset gpio */
> @@ -1275,7 +1275,7 @@ mcr20a_probe(struct spi_device *spi)
>  		ret = devm_gpio_request_one(&spi->dev, pdata->rst_gpio,
>  					    GPIOF_OUT_INIT_HIGH, "reset");
>  		if (ret)
> -			return ret;
> +			goto free_pdata;
>  	}
>  
>  	/* reset mcr20a */
> @@ -1291,7 +1291,8 @@ mcr20a_probe(struct spi_device *spi)
>  	hw = ieee802154_alloc_hw(sizeof(*lp), &mcr20a_hw_ops);
>  	if (!hw) {
>  		dev_crit(&spi->dev, "ieee802154_alloc_hw failed\n");
> -		return -ENOMEM;
> +		ret = -ENOMEM;
> +		goto free_pdata;
>  	}
>  
>  	/* init mcr20a local data */
> @@ -1366,6 +1367,8 @@ mcr20a_probe(struct spi_device *spi)
>  
>  free_dev:
>  	ieee802154_free_hw(lp->hw);
> +free_pdata:
> +	kfree(pdata);
>  
>  	return ret;
>  }

This patch has been applied to the wpan tree and will be
part of the next pull request to net. Thanks!

regards
Stefan Schmidt

^ permalink raw reply

* [PATCH net] ip_gre: clear feature flags when incompatible o_flags are set
From: Sabrina Dubroca @ 2018-04-10 10:57 UTC (permalink / raw)
  To: netdev; +Cc: Sabrina Dubroca, Lorenzo Bianconi, Xin Long, William Tu

Commit dd9d598c6657 ("ip_gre: add the support for i/o_flags update via
netlink") added the ability to change o_flags, but missed that the
GSO/LLTX features are disabled by default, and only enabled some gre
features are unused. Thus we also need to disable the GSO/LLTX features
on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set.

These two examples should result in the same features being set:

    ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 0

    ip link set gre_none type gre seq
    ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 1 seq

Fixes: dd9d598c6657 ("ip_gre: add the support for i/o_flags update via netlink")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
 net/ipv4/ip_gre.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index a8772a978224..9c169bb2444d 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -781,8 +781,14 @@ static void ipgre_link_update(struct net_device *dev, bool set_mtu)
 		    tunnel->encap.type == TUNNEL_ENCAP_NONE) {
 			dev->features |= NETIF_F_GSO_SOFTWARE;
 			dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+		} else {
+			dev->features &= ~NETIF_F_GSO_SOFTWARE;
+			dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
 		}
 		dev->features |= NETIF_F_LLTX;
+	} else {
+		dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+		dev->features &= ~(NETIF_F_LLTX | NETIF_F_GSO_SOFTWARE);
 	}
 }
 
-- 
2.16.2

^ permalink raw reply related

* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-10 10:55 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <16b2e531-7bfa-7f25-2702-f3f8069663ee@intel.com>

Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
>> [...]
>> 
>> > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > +				     struct net_device *child_netdev)
>> > > > +{
>> > > > +	struct virtnet_bypass_info *vbi;
>> > > > +	bool backup;
>> > > > +
>> > > > +	vbi = netdev_priv(bypass_netdev);
>> > > > +	backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > > > +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > +			rtnl_dereference(vbi->active_netdev)) {
>> > > > +		netdev_info(bypass_netdev,
>> > > > +			    "%s attempting to join bypass dev when %s already present\n",
>> > > > +			    child_netdev->name, backup ? "backup" : "active");
>> > > Bypass module should check if there is already some other netdev
>> > > enslaved and refuse right there.
>> > This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> > as its bypass_netdev is same as the backup_netdev.
>> > Will add a flag while registering with the bypass module to indicate if the driver is doing
>> > a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> > for 3 netdev scenario.
>> Just let me undestand it clearly. What I expect the difference would be
>> between 2netdev and3 netdev model is this:
>> 2netdev:
>>    bypass_master
>>       /
>>      /
>> VF_slave
>> 
>> 3netdev:
>>    bypass_master
>>       /     \
>>      /       \
>> VF_slave   backup_slave
>> 
>> Is that correct? If not, how does it look like?
>> 
>> 
>Looks correct.
>VF_slave and backup_slave are the original netdevs and are present in both the models.
>In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
>marked as the 2 slaves of this new netdev.

You say it looks correct and in another sentence you provide completely
different description. Could you please look again?

[...]

^ permalink raw reply

* Re: [PATCH] ieee802154: mcr20a: Fix memory leak in mcr20a_probe
From: Stefan Schmidt @ 2018-04-10 10:26 UTC (permalink / raw)
  To: Xue Liu, Gustavo A. R. Silva
  Cc: Alexander Aring, linux-wpan - ML, netdev, linux-kernel
In-Reply-To: <CAJuUDwtUUv_nMEwijRydWmseL1aYOAoAfEdLVnoHWPWKdVgS-A@mail.gmail.com>

Hello.

On 04/10/2018 11:54 AM, Xue Liu wrote:
> Hallo,
>
> Thanks for the fix. It looks good.
>
> ACK-by: Xue Liu <liuxuenetmail@gmail.com>

Thanks for the ACK Xue. The correct format would be

Acked-by: Xue Liu <liuxuenetmail@gmail.com>

That way patchwork also picks the ACK up and adds it to the patch before I apply it from there.

No worries, I did this manually for you this time. For the next time please use the correct format to have this tracked.

regards
Stefan Schmidt

^ permalink raw reply

* Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
From: Quentin Monnet @ 2018-04-10 10:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410014751.mqwqdyujvybir6g5@ast-mbp.dhcp.thefacebook.com>

2018-04-09 18:47 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Mon, Apr 09, 2018 at 02:25:26PM +0100, Quentin Monnet wrote:
>>
>> Anyway, I am fine with keeping just signatures, descriptions and return
>> values for now. I will submit a new version with only those items.
> 
> Thank you.
> 
> Could you also split it into few patches?
>  include/uapi/linux/bpf.h   | 2237 ++++++++++++++++++++++++++++++++++++--------
>  scripts/bpf_helpers_doc.py |  568 +++++++++++
>  2 files changed, 2429 insertions(+), 376 deletions(-)
> 
> replying back and forth on a single patch of such size will be tedious
> for others to follow.
> May be document ~10 helpers at a time ? Total of ~7 patches and extra
> patch for .py ?
> 

Sure, I'll do that. And I'll try to group helpers in a patch by author,
it should also help for reviewing the descriptions.

Quentin

^ permalink raw reply

* Re: XDP_TX for virtio_net not working in recent kernel?
From: Daniel Borkmann @ 2018-04-10 10:17 UTC (permalink / raw)
  To: Kimitoshi Takahashi; +Cc: xdp-newbies, jasowang, netdev
In-Reply-To: <47428621-2f24-3c8e-5d34-1d788a86a455@nii.ac.jp>

[ +jason +netdev ]

On 04/10/2018 12:49 AM, Kimitoshi Takahashi wrote:
> Hello,
> 
> I'm totally newbie.
> 
> I want to try some examples under the samples/bpf of the kernel tree using the virtio_net driver in kvm, in order to get familiar with the XDP. However, it seems to me that for kernels newer than linux-4.15, the xdp2 and xdp_tx_tunnel are not working in the case of virtio_net.
> 
> Can somebody help me out to make them working?
> 
> 
> 1) For linux-4.11.12, linux-4.12.14, and linux-4.14.31, xdp2 and xdp_tx_tunnel are working.
> 
> [xdp2]
> 10.0.0.27# ./xdp2 2 (or ./xdp -N 2)
> 
> 10.0.0.254# hping3 10.0.0.27 -p 5000 -2
> 
> 10.0.0.254# tcpdump -nei kbr0 udp port 5000
> 06:18:37.677070 2e:c9:83:fb:5b:f3 > 52:54:00:11:00:1b, ethertype IPv4 (0x0800), length 42: 10.0.0.254.2982 > 10.0.0.27.5000: UDP, length 0
> 06:18:37.677265 52:54:00:11:00:1b > 2e:c9:83:fb:5b:f3, ethertype IPv4 (0x0800), length 42: 10.0.0.254.2982 > 10.0.0.27.5000: UDP, length 0
> 
> I can see the UDP packets with MAC addresses swapped.
> 
> [xdp_tx_tunnel]
> 
> 10.0.0.254# telnet 10.1.2.1 80
> 
> 10.0.0.27# ./xdp_tx_iptunnel -i 2 -a 10.1.2.1 -p 80 -s 10.0.0.27 -d 10.0.0.24 -m 52:54:00:11:00:18
> 
> 10.0.0.24# tcpdump -nei any proto 4
> 22:08:20.510354  In 52:54:00:11:00:1b ethertype IPv4 (0x0800), length 96: 10.0.0.27 > 10.0.0.24: 10.0.0.254.60826 > 10.1.2.1.80: Flags [S], seq 3224377720, win 29200, options [mss 1460,sackOK,TS val 217801953 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
> 
> I can see the encapsulated SYN packet that is forwarded by the 10.0.0.27.
> 
> 2) For linux-4.15, linux-4.15.5, linux-4.15.13 and linux-4.16.1, xdp2 and xdp_tx_tunnel are NOT working.
> 
> [xdp2]
> 10.0.0.27# ./xdp2 -N 2
> 
> 10.0.0.254# hping3 10.0.0.27 -p 5000 -2
> 
> 10.0.0.254# tcpdump -nei br0 udp port 5000
> 06:45:34.810046 2e:c9:83:fb:5b:f3 > 52:54:00:11:00:1b, ethertype IPv4 (0x0800), length 42: 10.0.0.254.2346 > 10.0.0.27.5000: UDP, length 0
> 06:45:35.810176 2e:c9:83:fb:5b:f3 > 52:54:00:11:00:1b, ethertype IPv4 (0x0800), length 42: 10.0.0.254.2347 > 10.0.0.27.5000: UDP, length 0
> 
> Only the outgoing packets can be seen.
> 
> [xdp_tx_tunnel]
> 
> 10.0.0.254# telnet 10.1.2.1 80
> 
> 10.0.0.27# ./xdp_tx_iptunnel -i 2 -a 10.1.2.1 -p 80 -s 10.0.0.27 -d 10.0.0.24 -m 52:54:00:11:00:18
> 
> 10.0.0.24# tcpdump -nei any proto 4
> 
> I can NOT see the encapsulated SYN packet that is forwarded by the 10.0.0.27.
> 
> 
> -- 
> Sincerely,
> Takahashi Kimitoshi

^ permalink raw reply

* Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?
From: Daniel Borkmann @ 2018-04-10 10:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Daniel Borkmann
  Cc: Eric Leblond, Victor Julien, Peter Manev, oisf-devel,
	Jakub Kicinski, Alexei Starovoitov, iovisor-dev@lists.iovisor.org,
	Saeed Mahameed, netdev, davem, Jiri Benc
In-Reply-To: <20180409132635.7d5e1bda@redhat.com>

On 04/09/2018 01:26 PM, Jesper Dangaard Brouer wrote:
> On Fri, 6 Apr 2018 12:36:18 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
[...]
>> [...]
>> The underlying problem is effectively the decoupling of program
>> verification that doesn't have/know the context of where it is being
>> attached to in this case. 
> 
> Yes, exactly, that the underlying problem. (plus the first XDP prog
> gets verified and driver attached, and subsequent added bpf tail calls,
> cannot be "rejected" (at "driver attachment time") as its too late).
> 
>> Thinking out loud for a sec on couple of other options aside
>> from feature bits, what about i) providing the target ifindex to the
>> verifier for XDP programs, such that at verification time you have the
>> full context similar to nfp offload case today, 
> 
> I do like that we could provide the target ifindex earlier.  But
> userspace still need some way to express that it e.g. need the
> XDP_REDIRECT features (as verifier cannot reliability detect the action
> return codes used, as discussed before, due to bpf tail calls and maps
> used as return values).

(See below on the detection of it.)

>> or ii) populating some
>> XDP specific auxillary data to the BPF program at verification time such
>> that the driver can check at program attach time whether the requested
>> features are possible and if not it will reject and respond with netlink
>> extack message to the user (as we do in various XDP attach cases already
>> through XDP_SETUP_PROG{,_HW}).
> 
> I like proposal ii) better.  But how do I specify/input that I need
> e.g. the XDP_REDIRECT feature, such that is it avail to "the BPF
> program at verification time"?
> 
> My proposal iii), what about at XDP attachment time, create a new
> netlink attribute that describe XDP action codes the program
> needs/wants. If the info is provided, the ndo_bpf call check and
> reject, and respond with netlink extack message.
>   If I want to query for avail action codes, then I can use the same
> netlink attribute format, and kernel will return it "populated" with
> the info.
> 
> It is very useful that the program gets rejected at attachment time (to
> avoid loading an XDP prog that silently drops packets). BUT I also
> want a query option/functionality (reuse netlink attr format).
> 
> Specifically for Suricata the same "bypass" features can be implemented
> at different levels (XDP, XDP-generic or clsbpf), and the XDP program
> contains multiple features.  Thus, depending on what NIC driver
> supports, we want to load different XDP and/or clsbpf/TC BPF-programs.
> Thus, still support the same user requested feature/functionality, even
> if XDP_REDIRECT is not avail, just slightly slower.

Makes sense to have such fallbacks.

>> This would, for example, avoid the need for feature bits, and do actual
>> rejection of the program while retaining flexibility (and avoiding to
>> expose bits that over time hopefully will deprecate anyway due to all
>> XDP aware drivers implementing them). For both cases i) and ii), it
>> would mean we make the verifier a bit smarter with regards to keeping
>> track of driver related (XDP) requirements. Program return code checking
>> is already present in the verifier's check_return_code() and we could
>> extend it for XDP as well, for example. Seems cleaner and more extensible
>> than feature bits, imho.
> 
> I thought the verifier's check_return_code() couldn't see/verify if the
> XDP return action code is provided as a map lookup result(?). How is
> that handled?

For the latter, I would just add a BPF_F_STRICT_CONST_VERDICT load flag
which e.g. enforces a constant return code in all prog types. It also needs
to check for helpers like bpf_xdp_redirect() and track R0 from there that
it either contains XDP_REDIRECT or XDP_ABORTED.

For the bpf_prog_aux, we need a prog dependent void *private_data area
pointer in bpf_prog_aux that verifier populates; e.g. we could migrate
already some of the prog type specific fields into that like kprobe_override,
cb_access, dst_needed that are non-generic anyway. For XDP, verifier would
in your case record all seen return codes into private_data. When the flag
BPF_F_STRICT_CONST_VERDICT is not used and we noticed there were cases
where the verdict was not a verifiable constant, then we e.g. mark all
XDP return codes as seen. Potentially the same is needed for tail calls.

We can add a ndo_bpf query to drivers that return their supported XDP return
codes and compare them in e.g. dev_change_xdp_fd() out of generic code and
reject with extack.

For tail calls, the only way that comes to mind right now where you could
lift that requirement with having to mark all return codes as seen is that
you'd need to pass the ifindex as in offload case at load time, such that
you the program becomes tied to the device. Then you also need to record
dev pointer in the private_data such that you can add a new callback to
bpf_prog_ops for prog dependent compare in bpf_prog_array_compatible() to
make sure both would have the same target owner dev where the return code
check was previously asserted.

If you want to expose that internal ndo_bpf query which specifically returns
the set of supported XDP return codes I wouldn't mind, that way it's not
some sort of generic feature bit query, but a specific query for the set of
return codes the driver supports, thus keeping this very limited and avoiding
mixing this with other future feature bits that could turn this into a big
mess; I'm not sure right now though what would be the best uapi to query
that info from.

Cheers,
Daniel

^ permalink raw reply

* Re: [PATCH] ieee802154: mcr20a: Fix memory leak in mcr20a_probe
From: Xue Liu @ 2018-04-10  9:54 UTC (permalink / raw)
  To: Gustavo A. R. Silva
  Cc: Alexander Aring, Stefan Schmidt, linux-wpan - ML, netdev,
	linux-kernel
In-Reply-To: <20180405162006.GA2199@embeddedor.com>

Hallo,

Thanks for the fix. It looks good.

ACK-by: Xue Liu <liuxuenetmail@gmail.com>

On 5 April 2018 at 18:20, Gustavo A. R. Silva <gustavo@embeddedor.com> wrote:
> Free allocated memory for pdata before return.
>
> Addresses-Coverity-ID: 1466096 ("Resource leak")
> Fixes: 8c6ad9cc5157 ("ieee802154: Add NXP MCR20A IEEE 802.15.4 transceiver driver")
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> ---
>  drivers/net/ieee802154/mcr20a.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ieee802154/mcr20a.c b/drivers/net/ieee802154/mcr20a.c
> index 55a22c7..944470d 100644
> --- a/drivers/net/ieee802154/mcr20a.c
> +++ b/drivers/net/ieee802154/mcr20a.c
> @@ -1267,7 +1267,7 @@ mcr20a_probe(struct spi_device *spi)
>         ret = mcr20a_get_platform_data(spi, pdata);
>         if (ret < 0) {
>                 dev_crit(&spi->dev, "mcr20a_get_platform_data failed.\n");
> -               return ret;
> +               goto free_pdata;
>         }
>
>         /* init reset gpio */
> @@ -1275,7 +1275,7 @@ mcr20a_probe(struct spi_device *spi)
>                 ret = devm_gpio_request_one(&spi->dev, pdata->rst_gpio,
>                                             GPIOF_OUT_INIT_HIGH, "reset");
>                 if (ret)
> -                       return ret;
> +                       goto free_pdata;
>         }
>
>         /* reset mcr20a */
> @@ -1291,7 +1291,8 @@ mcr20a_probe(struct spi_device *spi)
>         hw = ieee802154_alloc_hw(sizeof(*lp), &mcr20a_hw_ops);
>         if (!hw) {
>                 dev_crit(&spi->dev, "ieee802154_alloc_hw failed\n");
> -               return -ENOMEM;
> +               ret = -ENOMEM;
> +               goto free_pdata;
>         }
>
>         /* init mcr20a local data */
> @@ -1366,6 +1367,8 @@ mcr20a_probe(struct spi_device *spi)
>
>  free_dev:
>         ieee802154_free_hw(lp->hw);
> +free_pdata:
> +       kfree(pdata);
>
>         return ret;
>  }
> --
> 2.7.4
>

^ permalink raw reply

* Re: [PATCH] slip: Check if rstate is initialized before uncompressing
From: Guillaume Nault @ 2018-04-10  9:48 UTC (permalink / raw)
  To: tejaswit; +Cc: David Miller, netdev
In-Reply-To: <ed7bd53914f26c2225c6c00d16bffb35@codeaurora.org>

On Tue, Apr 10, 2018 at 11:28:10AM +0530, tejaswit@codeaurora.org wrote:
> On 2018-04-09 20:34, David Miller wrote:
> > From: Tejaswi Tanikella <tejaswit@codeaurora.org>
> > Date: Mon, 9 Apr 2018 14:23:49 +0530
> > 
> > > @@ -673,6 +677,7 @@ struct slcompress *
> > >  	if (cs->cs_tcp.doff > 5)
> > >  	  memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr),
> > > (cs->cs_tcp.doff - 5) * 4);
> > >  	cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2;
> > > +	cs->initialized = 1;
> > >  	/* Put headers back on packet
> >  ...
> > >  struct cstate {
> > >  	byte_t	cs_this;	/* connection id number (xmit) */
> > > +	byte_t	initialized;	/* non-zero if initialized */
> > 
> > Please use 'bool' and true/false for 'initialized'.
> 
> Made the changes.

Hi Tejaswi,

Please send the new version of your patch as fresh new submission, with
proper subject prefix. In this case, it should be [PATCH net v2]. 'net'
because this is a bugfix and it should therefore target the 'net' tree.
'v2' because that's the second version of this series.

Also it'd be good if you could add a proper 'Fixes' tag in order to
help with stable backports. If the bug has always been there, just say
so.

I have no expertise on slhc, but overall, the patch content looks fine.

Guillaume

^ permalink raw reply

* RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Liang, Cunming @ 2018-04-10  9:23 UTC (permalink / raw)
  To: Paolo Bonzini, Bie, Tiwei, Jason Wang
  Cc: mst@redhat.com, alex.williamson@redhat.com, ddutile@redhat.com,
	Duyck, Alexander H, virtio-dev@lists.oasis-open.org,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	Daly, Dan, Wang, Zhihong, Tan, Jianfeng, Wang, Xiao W
In-Reply-To: <7ee31a12-a370-fc43-82a6-2235f598e970@redhat.com>

> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Tuesday, April 10, 2018 3:52 PM
> To: Bie, Tiwei <tiwei.bie@intel.com>; Jason Wang <jasowang@redhat.com>
> Cc: mst@redhat.com; alex.williamson@redhat.com; ddutile@redhat.com;
> Duyck, Alexander H <alexander.h.duyck@intel.com>; virtio-dev@lists.oasis-
> open.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; netdev@vger.kernel.org; Daly, Dan
> <dan.daly@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Wang,
> Zhihong <zhihong.wang@intel.com>; Tan, Jianfeng <jianfeng.tan@intel.com>;
> Wang, Xiao W <xiao.w.wang@intel.com>
> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware
> vhost backend
> 
> On 10/04/2018 06:57, Tiwei Bie wrote:
> >> So you just move the abstraction layer from qemu to kernel, and you
> >> still need different drivers in kernel for different device
> >> interfaces of accelerators. This looks even more complex than leaving
> >> it in qemu. As you said, another idea is to implement userspace vhost
> >> backend for accelerators which seems easier and could co-work with
> >> other parts of qemu without inventing new type of messages.
> >
> > I'm not quite sure. Do you think it's acceptable to add various vendor
> > specific hardware drivers in QEMU?
> 
> I think so.  We have vendor-specific quirks, and at some point there was an
> idea of using quirks to implement (vendor-specific) live migration support for
> assigned devices.

Vendor-specific quirks of accessing VGA is a small portion. Other major portions are still handled by guest driver.

While in this case, when saying various vendor specific drivers in QEMU, it says QEMU takes over and provides the entire user space device drivers. Some parts are even not relevant to vhost, they're basic device function enabling. Moreover, it could be different kinds of devices(network/block/...) under vhost. No matter # of vendors or # of types, total LOC is not small.

The idea is to avoid introducing these extra complexity out of QEMU, keeping vhost adapter simple. As vhost protocol is de factor standard, it leverages kernel device driver to provide the diversity. Changing once in QEMU, then it supports multi-vendor devices whose drivers naturally providing kernel driver there.

If QEMU is going to build a user space driver framework there, we're open mind on that, even leveraging DPDK as the underlay library. Looking forward to more others' comments from community.

Steve

> 
> Paolo

^ permalink raw reply

* [PATCH linux] net: fix deadlock while clearing neighbor proxy table
From: Wolfgang Bumiller @ 2018-04-10  9:15 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Hideaki YOSHIFUJI

When coming from ndisc_netdev_event() in net/ipv6/ndisc.c,
neigh_ifdown() is called with &nd_tbl, locking this while
clearing the proxy neighbor entries when eg. deleting an
interface. Calling the table's pndisc_destructor() with the
lock still held, however, can cause a deadlock: When a
multicast listener is available an IGMP packet of type
ICMPV6_MGM_REDUCTION may be sent out. When reaching
ip6_finish_output2(), if no neighbor entry for the target
address is found, __neigh_create() is called with &nd_tbl,
which it'll want to lock.

Move the elements into their own list, then unlock the table
and perform the destruction.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199289
Fixes: 6fd6ce2056de ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
---
I do feel bad about moving the unlock call. Perhaps returning the
freelist and dealing with it in neigh_ifdown() would be better?

 net/core/neighbour.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 7b7a14abba28..601df647588c 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -292,7 +292,6 @@ int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev)
 	write_lock_bh(&tbl->lock);
 	neigh_flush_dev(tbl, dev);
 	pneigh_ifdown(tbl, dev);
-	write_unlock_bh(&tbl->lock);

 	del_timer_sync(&tbl->proxy_timer);
 	pneigh_queue_purge(&tbl->proxy_queue);
@@ -683,7 +682,7 @@ int pneigh_delete(struct neigh_table *tbl, struct net *net, const void *pkey,

 static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev)
 {
-	struct pneigh_entry *n, **np;
+	struct pneigh_entry *n, **np, *freelist = NULL;
 	u32 h;

 	for (h = 0; h <= PNEIGH_HASHMASK; h++) {
@@ -691,16 +690,23 @@ static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev)
 		while ((n = *np) != NULL) {
 			if (!dev || n->dev == dev) {
 				*np = n->next;
-				if (tbl->pdestructor)
-					tbl->pdestructor(n);
-				if (n->dev)
-					dev_put(n->dev);
-				kfree(n);
+				n->next = freelist;
+				freelist = n;
 				continue;
 			}
 			np = &n->next;
 		}
 	}
+	write_unlock_bh(&tbl->lock);
+	while ((n = freelist)) {
+		freelist = n->next;
+		n->next = NULL;
+		if (tbl->pdestructor)
+			tbl->pdestructor(n);
+		if (n->dev)
+			dev_put(n->dev);
+		kfree(n);
+	}
 	return -ENOENT;
 }

-- 
2.11.0

^ permalink raw reply related

* Patch "x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()" has been added to the 4.4-stable tree
From: gregkh @ 2018-04-10  9:07 UTC (permalink / raw)
  To: jpoimboe, alexander.levin, andreyknvl, davem, dvyukov, edumazet,
	gregkh, kcc, marcelo.leitner, mingo, netdev, nhorman, peterz,
	syzkaller, tglx, torvalds, vyasevich, xiyou.wangcong
  Cc: stable, stable-commits


This is a note to let you know that I've just added the patch titled

    x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()

to the 4.4-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch
and it can be found in the queue-4.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.


>From foo@baz Tue Apr 10 10:31:53 CEST 2018
From: Josh Poimboeuf <jpoimboe@redhat.com>
Date: Thu, 4 May 2017 09:51:40 -0500
Subject: x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()

From: Josh Poimboeuf <jpoimboe@redhat.com>


[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]

Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:

  WARNING: kernel stack regs at ffff8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0

The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().

That function saves RBP on the stack and then overwrites it, using it as
a scratch register.  That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.

Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: netdev <netdev@vger.kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 arch/x86/lib/csum-copy_64.S |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
 	movq  %r12, 3*8(%rsp)
 	movq  %r14, 4*8(%rsp)
 	movq  %r13, 5*8(%rsp)
-	movq  %rbp, 6*8(%rsp)
+	movq  %r15, 6*8(%rsp)
 
 	movq  %r8, (%rsp)
 	movq  %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
 	/* main loop. clear in 64 byte blocks */
 	/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
 	/* r11:	temp3, rdx: temp4, r12 loopcnt */
-	/* r10:	temp5, rbp: temp6, r14 temp7, r13 temp8 */
+	/* r10:	temp5, r15: temp6, r14 temp7, r13 temp8 */
 	.p2align 4
 .Lloop:
 	source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
 	source
 	movq  32(%rdi), %r10
 	source
-	movq  40(%rdi), %rbp
+	movq  40(%rdi), %r15
 	source
 	movq  48(%rdi), %r14
 	source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
 	adcq  %r11, %rax
 	adcq  %rdx, %rax
 	adcq  %r10, %rax
-	adcq  %rbp, %rax
+	adcq  %r15, %rax
 	adcq  %r14, %rax
 	adcq  %r13, %rax
 
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
 	dest
 	movq %r10, 32(%rsi)
 	dest
-	movq %rbp, 40(%rsi)
+	movq %r15, 40(%rsi)
 	dest
 	movq %r14, 48(%rsi)
 	dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
 	movq 3*8(%rsp), %r12
 	movq 4*8(%rsp), %r14
 	movq 5*8(%rsp), %r13
-	movq 6*8(%rsp), %rbp
+	movq 6*8(%rsp), %r15
 	addq $7*8, %rsp
 	ret
 


Patches currently in stable-queue which might be from jpoimboe@redhat.com are

queue-4.4/x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch

^ permalink raw reply

* Re: [PATCH bpf v3] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Daniel Borkmann @ 2018-04-10  8:54 UTC (permalink / raw)
  To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180410072139.323589-1-yhs@fb.com>

On 04/10/2018 09:21 AM, Yonghong Song wrote:
> syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
> The error details:
>   ======================================================
>   WARNING: possible circular locking dependency detected
>   4.16.0-rc7+ #3 Not tainted
>   ------------------------------------------------------
>   syz-executor7/24531 is trying to acquire lock:
>    (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
> 
>   but task is already holding lock:
>    (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
> 
>   which lock already depends on the new lock.
> 
>   the existing dependency chain (in reverse order) is:
> 
>   -> #1 (&mm->mmap_sem){++++}:
>        __might_fault+0x13a/0x1d0 mm/memory.c:4571
>        _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
>        copy_to_user include/linux/uaccess.h:155 [inline]
>        bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
>        perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
>        _perf_ioctl kernel/events/core.c:4750 [inline]
>        perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
>        vfs_ioctl fs/ioctl.c:46 [inline]
>        do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>        SYSC_ioctl fs/ioctl.c:701 [inline]
>        SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>        do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>        entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
>   -> #0 (bpf_event_mutex){+.+.}:
>        lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>        __mutex_lock_common kernel/locking/mutex.c:756 [inline]
>        __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
>        mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
>        perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
>        perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
>        _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
>        put_event+0x24/0x30 kernel/events/core.c:4204
>        perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
>        remove_vma+0xb4/0x1b0 mm/mmap.c:172
>        remove_vma_list mm/mmap.c:2490 [inline]
>        do_munmap+0x82a/0xdf0 mm/mmap.c:2731
>        mmap_region+0x59e/0x15a0 mm/mmap.c:1646
>        do_mmap+0x6c0/0xe00 mm/mmap.c:1483
>        do_mmap_pgoff include/linux/mm.h:2223 [inline]
>        vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
>        SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
>        SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
>        SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
>        SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
>        do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>        entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
>   other info that might help us debug this:
> 
>    Possible unsafe locking scenario:
> 
>          CPU0                    CPU1
>          ----                    ----
>     lock(&mm->mmap_sem);
>                                  lock(bpf_event_mutex);
>                                  lock(&mm->mmap_sem);
>     lock(bpf_event_mutex);
> 
>    *** DEADLOCK ***
>   ======================================================
> 
> The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
> user space to query prog array on the same tp") where copy_to_user,
> which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
> At the same time, during perf_event file descriptor close,
> mm->mmap_sem is held first and then subsequent
> perf_event_detach_bpf_prog needs bpf_event_mutex lock.
> Such a senario caused a deadlock.
> 
> As suggested by Danial, moving copy_to_user out of the

Nit: typo :)

> bpf_event_mutex lock should fix the problem.
> 
> Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp")
> Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h      |  4 ++--
>  kernel/bpf/core.c        | 45 +++++++++++++++++++++++++++++----------------
>  kernel/trace/bpf_trace.c | 19 +++++++++++++++----
>  3 files changed, 46 insertions(+), 22 deletions(-)
> 
> Changelog:
>   v2 -> v3:
>     . Remove the redundant label in function perf_event_query_prog_array.
>   v1 -> v2:
>     . Use the common core function for prog_id copying for two
>       different prog_array copy routines, suggested by Alexei.
> 
[...]
>  static void bpf_prog_free_deferred(struct work_struct *work)
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d88e96d..f505d43 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -977,6 +977,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
>  {
>  	struct perf_event_query_bpf __user *uquery = info;
>  	struct perf_event_query_bpf query = {};
> +	u32 *ids, prog_cnt, ids_len;
>  	int ret;
>  
>  	if (!capable(CAP_SYS_ADMIN))
> @@ -985,16 +986,26 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
>  		return -EINVAL;
>  	if (copy_from_user(&query, uquery, sizeof(query)))
>  		return -EFAULT;
> -	if (query.ids_len > BPF_TRACE_MAX_PROGS)
> +
> +	ids_len = query.ids_len;
> +	if (ids_len > BPF_TRACE_MAX_PROGS)
>  		return -E2BIG;
> +	ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
> +	if (!ids)
> +		return -ENOMEM;

Fix looks good to me, but could you still add a comment stating that we don't
need to check for ZERO_SIZE_PTR above since we handle this gracefully in the
bpf_prog_array_copy_info() plus it's also required for the case where the user
only wants to check for the uquery->prog_cnt, but nothing else.

>  	mutex_lock(&bpf_event_mutex);
>  	ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
> -				       uquery->ids,
> -				       query.ids_len,
> -				       &uquery->prog_cnt);
> +				       ids,
> +				       ids_len,
> +				       &prog_cnt);
>  	mutex_unlock(&bpf_event_mutex);
>  
> +	if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
> +	    copy_to_user(uquery->ids, ids, ids_len * sizeof(u32)))
> +		ret = -EFAULT;
> +
> +	kfree(ids);
>  	return ret;
>  }

Thanks,
Daniel

^ permalink raw reply

* Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Paolo Bonzini @ 2018-04-10  7:51 UTC (permalink / raw)
  To: Tiwei Bie, Jason Wang
  Cc: alexander.h.duyck, virtio-dev, kvm, mst, netdev, linux-kernel,
	virtualization, xiao.w.wang, ddutile, jianfeng.tan, zhihong.wang
In-Reply-To: <20180410045723.rftsb7l4l3ip2ioi@debian>

On 10/04/2018 06:57, Tiwei Bie wrote:
>> So you just move the abstraction layer from qemu to kernel, and you still
>> need different drivers in kernel for different device interfaces of
>> accelerators. This looks even more complex than leaving it in qemu. As you
>> said, another idea is to implement userspace vhost backend for accelerators
>> which seems easier and could co-work with other parts of qemu without
>> inventing new type of messages.
> 
> I'm not quite sure. Do you think it's acceptable to
> add various vendor specific hardware drivers in QEMU?

I think so.  We have vendor-specific quirks, and at some point there was
an idea of using quirks to implement (vendor-specific) live migration
support for assigned devices.

Paolo

^ permalink raw reply

* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2018-04-10  7:25 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: mst, alex.williamson, ddutile, alexander.h.duyck, virtio-dev,
	linux-kernel, kvm, virtualization, netdev, dan.daly,
	cunming.liang, zhihong.wang, jianfeng.tan, xiao.w.wang
In-Reply-To: <20180410045723.rftsb7l4l3ip2ioi@debian>



On 2018年04月10日 12:57, Tiwei Bie wrote:
> On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote:
>> On 2018年04月02日 23:23, Tiwei Bie wrote:
>>> This patch introduces a mdev (mediated device) based hardware
>>> vhost backend. This backend is an abstraction of the various
>>> hardware vhost accelerators (potentially any device that uses
>>> virtio ring can be used as a vhost accelerator). Some generic
>>> mdev parent ops are provided for accelerator drivers to support
>>> generating mdev instances.
>>>
>>> What's this
>>> ===========
>>>
>>> The idea is that we can setup a virtio ring compatible device
>>> with the messages available at the vhost-backend. Originally,
>>> these messages are used to implement a software vhost backend,
>>> but now we will use these messages to setup a virtio ring
>>> compatible hardware device. Then the hardware device will be
>>> able to work with the guest virtio driver in the VM just like
>>> what the software backend does. That is to say, we can implement
>>> a hardware based vhost backend in QEMU, and any virtio ring
>>> compatible devices potentially can be used with this backend.
>>> (We also call it vDPA -- vhost Data Path Acceleration).
>>>
>>> One problem is that, different virtio ring compatible devices
>>> may have different device interfaces. That is to say, we will
>>> need different drivers in QEMU. It could be troublesome. And
>>> that's what this patch trying to fix. The idea behind this
>>> patch is very simple: mdev is a standard way to emulate device
>>> in kernel.
>> So you just move the abstraction layer from qemu to kernel, and you still
>> need different drivers in kernel for different device interfaces of
>> accelerators. This looks even more complex than leaving it in qemu. As you
>> said, another idea is to implement userspace vhost backend for accelerators
>> which seems easier and could co-work with other parts of qemu without
>> inventing new type of messages.
> I'm not quite sure. Do you think it's acceptable to
> add various vendor specific hardware drivers in QEMU?
>

I don't object but we need to figure out the advantages of doing it in 
qemu too.

Thanks

^ permalink raw reply

* [PATCH bpf v3] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Yonghong Song @ 2018-04-10  7:21 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
The error details:
  ======================================================
  WARNING: possible circular locking dependency detected
  4.16.0-rc7+ #3 Not tainted
  ------------------------------------------------------
  syz-executor7/24531 is trying to acquire lock:
   (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854

  but task is already holding lock:
   (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&mm->mmap_sem){++++}:
       __might_fault+0x13a/0x1d0 mm/memory.c:4571
       _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
       copy_to_user include/linux/uaccess.h:155 [inline]
       bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
       perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
       _perf_ioctl kernel/events/core.c:4750 [inline]
       perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
       vfs_ioctl fs/ioctl.c:46 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
       SYSC_ioctl fs/ioctl.c:701 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

  -> #0 (bpf_event_mutex){+.+.}:
       lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
       __mutex_lock_common kernel/locking/mutex.c:756 [inline]
       __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
       mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
       perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
       perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
       _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
       put_event+0x24/0x30 kernel/events/core.c:4204
       perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
       remove_vma+0xb4/0x1b0 mm/mmap.c:172
       remove_vma_list mm/mmap.c:2490 [inline]
       do_munmap+0x82a/0xdf0 mm/mmap.c:2731
       mmap_region+0x59e/0x15a0 mm/mmap.c:1646
       do_mmap+0x6c0/0xe00 mm/mmap.c:1483
       do_mmap_pgoff include/linux/mm.h:2223 [inline]
       vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
       SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
       SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
       SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
       SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

  other info that might help us debug this:

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&mm->mmap_sem);
                                 lock(bpf_event_mutex);
                                 lock(&mm->mmap_sem);
    lock(bpf_event_mutex);

   *** DEADLOCK ***
  ======================================================

The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
user space to query prog array on the same tp") where copy_to_user,
which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
At the same time, during perf_event file descriptor close,
mm->mmap_sem is held first and then subsequent
perf_event_detach_bpf_prog needs bpf_event_mutex lock.
Such a senario caused a deadlock.

As suggested by Danial, moving copy_to_user out of the
bpf_event_mutex lock should fix the problem.

Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h      |  4 ++--
 kernel/bpf/core.c        | 45 +++++++++++++++++++++++++++++----------------
 kernel/trace/bpf_trace.c | 19 +++++++++++++++----
 3 files changed, 46 insertions(+), 22 deletions(-)

Changelog:
  v2 -> v3:
    . Remove the redundant label in function perf_event_query_prog_array.
  v1 -> v2:
    . Use the common core function for prog_id copying for two
      different prog_array copy routines, suggested by Alexei.

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..486e65e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -339,8 +339,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
 void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
 				struct bpf_prog *old_prog);
 int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
-			     __u32 __user *prog_ids, u32 request_cnt,
-			     __u32 __user *prog_cnt);
+			     u32 *prog_ids, u32 request_cnt,
+			     u32 *prog_cnt);
 int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
 			struct bpf_prog *exclude_prog,
 			struct bpf_prog *include_prog,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d315b39..ba03ec3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1572,13 +1572,32 @@ int bpf_prog_array_length(struct bpf_prog_array __rcu *progs)
 	return cnt;
 }
 
+static bool bpf_prog_array_copy_core(struct bpf_prog **prog,
+				     u32 *prog_ids,
+				     u32 request_cnt)
+{
+	int i = 0;
+
+	for (; *prog; prog++) {
+		if (*prog == &dummy_bpf_prog.prog)
+			continue;
+		prog_ids[i] = (*prog)->aux->id;
+		if (++i == request_cnt) {
+			prog++;
+			break;
+		}
+	}
+
+	return !!(*prog);
+}
+
 int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
 				__u32 __user *prog_ids, u32 cnt)
 {
 	struct bpf_prog **prog;
 	unsigned long err = 0;
-	u32 i = 0, *ids;
 	bool nospc;
+	u32 *ids;
 
 	/* users of this function are doing:
 	 * cnt = bpf_prog_array_length();
@@ -1595,16 +1614,7 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
 		return -ENOMEM;
 	rcu_read_lock();
 	prog = rcu_dereference(progs)->progs;
-	for (; *prog; prog++) {
-		if (*prog == &dummy_bpf_prog.prog)
-			continue;
-		ids[i] = (*prog)->aux->id;
-		if (++i == cnt) {
-			prog++;
-			break;
-		}
-	}
-	nospc = !!(*prog);
+	nospc = bpf_prog_array_copy_core(prog, ids, cnt);
 	rcu_read_unlock();
 	err = copy_to_user(prog_ids, ids, cnt * sizeof(u32));
 	kfree(ids);
@@ -1683,22 +1693,25 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
 }
 
 int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
-			     __u32 __user *prog_ids, u32 request_cnt,
-			     __u32 __user *prog_cnt)
+			     u32 *prog_ids, u32 request_cnt,
+			     u32 *prog_cnt)
 {
+	struct bpf_prog **prog;
 	u32 cnt = 0;
 
 	if (array)
 		cnt = bpf_prog_array_length(array);
 
-	if (copy_to_user(prog_cnt, &cnt, sizeof(cnt)))
-		return -EFAULT;
+	*prog_cnt = cnt;
 
 	/* return early if user requested only program count or nothing to copy */
 	if (!request_cnt || !cnt)
 		return 0;
 
-	return bpf_prog_array_copy_to_user(array, prog_ids, request_cnt);
+	/* this function is called under trace/bpf_trace.c: bpf_event_mutex */
+	prog = rcu_dereference_check(array, 1)->progs;
+	return bpf_prog_array_copy_core(prog, prog_ids, request_cnt) ? -ENOSPC
+								     : 0;
 }
 
 static void bpf_prog_free_deferred(struct work_struct *work)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d88e96d..f505d43 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -977,6 +977,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 {
 	struct perf_event_query_bpf __user *uquery = info;
 	struct perf_event_query_bpf query = {};
+	u32 *ids, prog_cnt, ids_len;
 	int ret;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -985,16 +986,26 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
 		return -EINVAL;
 	if (copy_from_user(&query, uquery, sizeof(query)))
 		return -EFAULT;
-	if (query.ids_len > BPF_TRACE_MAX_PROGS)
+
+	ids_len = query.ids_len;
+	if (ids_len > BPF_TRACE_MAX_PROGS)
 		return -E2BIG;
+	ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
+	if (!ids)
+		return -ENOMEM;
 
 	mutex_lock(&bpf_event_mutex);
 	ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
-				       uquery->ids,
-				       query.ids_len,
-				       &uquery->prog_cnt);
+				       ids,
+				       ids_len,
+				       &prog_cnt);
 	mutex_unlock(&bpf_event_mutex);
 
+	if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
+	    copy_to_user(uquery->ids, ids, ids_len * sizeof(u32)))
+		ret = -EFAULT;
+
+	kfree(ids);
 	return ret;
 }
 
-- 
2.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox