Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: Proposal: r8152 firmware patching framework
From: Hayes Wang @ 2019-09-02  6:31 UTC (permalink / raw)
  To: Amber Chen, Prashant Malani
  Cc: David Miller, netdev@vger.kernel.org, Bambi Yeh, Ryankao, Jackc,
	Albertk, marcochen@google.com, nic_swsd, Grant Grundler
In-Reply-To: <755AFD2B-D66F-40FF-ADCD-5077ECC569FE@realtek.com>

Prashant Malani <pmalani@chromium.org> 
> >
> > (Adding a few more Realtek folks)
> >
> > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ?
> >
> >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani
> <pmalani@chromium.org> wrote:
> >>
> >> Hi,
> >>
> >> The r8152 driver source code distributed by Realtek (on
> >> www.realtek.com) contains firmware patches. This involves binary
> >> byte-arrays being written byte/word-wise to the hardware memory
> >> Example: grundler@chromium.org (cc-ed) has an experimental patch
> which
> >> includes the firmware patching code which was distributed with the
> >> Realtek source :
> >>
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel
> /+/1417953
> >>
> >> It would be nice to have a way to incorporate these firmware fixes
> >> into the upstream code. Since having indecipherable byte-arrays is not
> >> possible upstream, I propose the following:
> >> - We use the assistance of Realtek to come up with a format which the
> >> firmware patch files can follow (this can be documented in the
> >> comments).
> >>       - A real simple format could look like this:
> >>               +
> >>
> <section1><size_in_bytes><address1><data1><address2><data2>...<addressN
> ><dataN><section2>...
> >>                + The driver would be able to understand how to parse
> >> each section (e.g is each data entry a byte or a word?)
> >>
> >> - We use request_firmware() to load the firmware, parse it and write
> >> the data to the relevant registers.

I plan to finish the patches which I am going to submit, first. Then,
I could focus on this. However, I don't think I would start this
quickly. There are many preparations and they would take me a lot of
time.

Best Regards,
Hayes



^ permalink raw reply

* [PATCH bpf-next] arm64: bpf: optimize modulo operation
From: jerinj @ 2019-09-02  6:14 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov, Zi Shen Lim,
	Catalin Marinas, Will Deacon, Martin KaFai Lau, Song Liu,
	Yonghong Song, open list:BPF JIT for ARM64,
	moderated list:ARM64 PORT (AARCH64 ARCHITECTURE), open list
  Cc: Jerin Jacob

From: Jerin Jacob <jerinj@marvell.com>

Optimize modulo operation instruction generation by
using single MSUB instruction vs MUL followed by SUB
instruction scheme.

Signed-off-by: Jerin Jacob <jerinj@marvell.com>
---
 arch/arm64/net/bpf_jit.h      | 3 +++
 arch/arm64/net/bpf_jit_comp.c | 6 ++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h
index cb7ab50b7657..eb73f9f72c46 100644
--- a/arch/arm64/net/bpf_jit.h
+++ b/arch/arm64/net/bpf_jit.h
@@ -171,6 +171,9 @@
 /* Rd = Ra + Rn * Rm */
 #define A64_MADD(sf, Rd, Ra, Rn, Rm) aarch64_insn_gen_data3(Rd, Ra, Rn, Rm, \
 	A64_VARIANT(sf), AARCH64_INSN_DATA3_MADD)
+/* Rd = Ra - Rn * Rm */
+#define A64_MSUB(sf, Rd, Ra, Rn, Rm) aarch64_insn_gen_data3(Rd, Ra, Rn, Rm, \
+	A64_VARIANT(sf), AARCH64_INSN_DATA3_MSUB)
 /* Rd = Rn * Rm */
 #define A64_MUL(sf, Rd, Rn, Rm) A64_MADD(sf, Rd, A64_ZR, Rn, Rm)
 
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index f5b437f8a22b..cdc79de0c794 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -409,8 +409,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 			break;
 		case BPF_MOD:
 			emit(A64_UDIV(is64, tmp, dst, src), ctx);
-			emit(A64_MUL(is64, tmp, tmp, src), ctx);
-			emit(A64_SUB(is64, dst, dst, tmp), ctx);
+			emit(A64_MSUB(is64, dst, dst, tmp, src), ctx);
 			break;
 		}
 		break;
@@ -516,8 +515,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 	case BPF_ALU64 | BPF_MOD | BPF_K:
 		emit_a64_mov_i(is64, tmp2, imm, ctx);
 		emit(A64_UDIV(is64, tmp, dst, tmp2), ctx);
-		emit(A64_MUL(is64, tmp, tmp, tmp2), ctx);
-		emit(A64_SUB(is64, dst, dst, tmp), ctx);
+		emit(A64_MSUB(is64, dst, dst, tmp, tmp2), ctx);
 		break;
 	case BPF_ALU | BPF_LSH | BPF_K:
 	case BPF_ALU64 | BPF_LSH | BPF_K:
-- 
2.23.0


^ permalink raw reply related

* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Heiner Kallweit @ 2019-09-02  6:07 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <fafc1c05-d7ac-f108-74f9-207617773968@gmail.com>

On 02.09.2019 04:07, Florian Fainelli wrote:
> 
> 
> On 8/8/2019 1:24 PM, Heiner Kallweit wrote:
>> On 08.08.2019 22:20, Andrew Lunn wrote:
>>>> I have a contact in Realtek who provided the information about
>>>> the vendor-specific registers used in the patch. I also asked for
>>>> a method to auto-detect 2.5Gbps support but have no feedback so far.
>>>> What may contribute to the problem is that also the integrated 1Gbps
>>>> PHY's (all with the same PHY ID) differ significantly from each other,
>>>> depending on the network chip version.
>>>
>>> Hi Heiner
>>>
>>> Some of the PHYs embedded in Marvell switches have an OUI, but no
>>> product ID. We work around this brokenness by trapping the reads to
>>> the ID registers in the MDIO bus controller driver and inserting the
>>> switch product ID. The Marvell PHY driver then recognises these IDs
>>> and does the right thing.
>>>
>>> Maybe you can do something similar here?
>>>
>> Yes, this would be an idea. Let me check.
> 
> Since this is an integrated PHY you could have the MAC driver pass a
> specific phydev->dev_flag bit that indicates that this is RTL8215, since
> I am assuming that PCI IDs for those different chipsets do have to be
> allocated, right?
> 
Hi Florian,

thanks for the feedback. In the meantime Realtek provided a method to
identify NBaseT-capable PHY's, and the respective match_phy_device
callback implementations had been done in
5181b473d64e ("net: phy: realtek: add NBase-T PHY auto-detection").

Heiner

^ permalink raw reply

* Re: [PATCH v3] tun: fix use-after-free when register netdev failed
From: Jason Wang @ 2019-09-02  5:32 UTC (permalink / raw)
  To: Yang Yingliang
  Cc: David Miller, netdev, eric dumazet, xiyou wangcong, weiyongjun1
In-Reply-To: <5D5FB3B6.5080800@huawei.com>


On 2019/8/23 下午5:36, Yang Yingliang wrote:
>
>
> On 2019/8/23 11:05, Jason Wang wrote:
>> ----- Original Message -----
>>>
>>> On 2019/8/22 14:07, Yang Yingliang wrote:
>>>>
>>>> On 2019/8/22 10:13, Jason Wang wrote:
>>>>> On 2019/8/20 上午10:28, Jason Wang wrote:
>>>>>> On 2019/8/20 上午9:25, David Miller wrote:
>>>>>>> From: Yang Yingliang <yangyingliang@huawei.com>
>>>>>>> Date: Mon, 19 Aug 2019 21:31:19 +0800
>>>>>>>
>>>>>>>> Call tun_attach() after register_netdevice() to make sure 
>>>>>>>> tfile->tun
>>>>>>>> is not published until the netdevice is registered. So the 
>>>>>>>> read/write
>>>>>>>> thread can not use the tun pointer that may freed by 
>>>>>>>> free_netdev().
>>>>>>>> (The tun and dev pointer are allocated by alloc_netdev_mqs(), they
>>>>>>>> can
>>>>>>>> be freed by netdev_freemem().)
>>>>>>> register_netdevice() must always be the last operation in the 
>>>>>>> order of
>>>>>>> network device setup.
>>>>>>>
>>>>>>> At the point register_netdevice() is called, the device is visible
>>>>>>> globally
>>>>>>> and therefore all of it's software state must be fully 
>>>>>>> initialized and
>>>>>>> ready for us.
>>>>>>>
>>>>>>> You're going to have to find another solution to these problems.
>>>>>>
>>>>>> The device is loosely coupled with sockets/queues. Each side is
>>>>>> allowed to be go away without caring the other side. So in this
>>>>>> case, there's a small window that network stack think the device has
>>>>>> one queue but actually not, the code can then safely drop them.
>>>>>> Maybe it's ok here with some comments?
>>>>>>
>>>>>> Or if not, we can try to hold the device before tun_attach and drop
>>>>>> it after register_netdevice().
>>>>>
>>>>> Hi Yang:
>>>>>
>>>>> I think maybe we can try to hold refcnt instead of playing real num
>>>>> queues here. Do you want to post a V4?
>>>> I think the refcnt can prevent freeing the memory in this case.
>>>> When register_netdevice() failed, free_netdev() will be called 
>>>> directly,
>>>> dev->pcpu_refcnt and dev are freed without checking refcnt of dev.
>>> How about using patch-v1 that using a flag to check whether the device
>>> registered successfully.
>>>
>> As I said, it lacks sufficient locks or barriers. To be clear, I meant
>> something like (compile-test only):
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index db16d7a13e00..e52678f9f049 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2828,6 +2828,7 @@ static int tun_set_iff(struct net *net, struct 
>> file *file, struct ifreq *ifr)
>>                                (ifr->ifr_flags & TUN_FEATURES);
>>                    INIT_LIST_HEAD(&tun->disabled);
>> +               dev_hold(dev);
>>                  err = tun_attach(tun, file, false, ifr->ifr_flags & 
>> IFF_NAPI,
>>                                   ifr->ifr_flags & IFF_NAPI_FRAGS);
>>                  if (err < 0)
>> @@ -2836,6 +2837,7 @@ static int tun_set_iff(struct net *net, struct 
>> file *file, struct ifreq *ifr)
>>                  err = register_netdevice(tun->dev);
>>                  if (err < 0)
>>                          goto err_detach;
>> +               dev_put(dev);
>>          }
>>            netif_carrier_on(tun->dev);
>> @@ -2852,11 +2854,13 @@ static int tun_set_iff(struct net *net, 
>> struct file *file, struct ifreq *ifr)
>>          return 0;
>>     err_detach:
>> +       dev_put(dev);
>>          tun_detach_all(dev);
>>          /* register_netdevice() already called tun_free_netdev() */
>>          goto err_free_dev;
>>     err_free_flow:
>> +       dev_put(dev);
>>          tun_flow_uninit(tun);
>>          security_tun_dev_free_security(tun->security);
>>   err_free_stat:
>>
>> What's your thought?
>
> The dev pointer are freed without checking the refcount in 
> free_netdev() called by err_free_dev
>
> path, so I don't understand how the refcount protects this pointer.
>

The refcount are guaranteed to be zero there, isn't it?

Thanks


> Thanks,
> Yang
>
>>
>> Thanks
>>
>> .
>>
>
>

^ permalink raw reply

* Re: [bpf-next, v2] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Daniel T. Lee @ 2019-09-02  4:43 UTC (permalink / raw)
  To: Song Liu; +Cc: Daniel Borkmann, Alexei Starovoitov, Networking
In-Reply-To: <CAEKGpzhGkLGswP3G9BzY1YErVOuNQRRBD2y=4g7u7dfh1by3aA@mail.gmail.com>

On Fri, Aug 30, 2019 at 3:23 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
>
> On Fri, Aug 30, 2019 at 5:42 AM Song Liu <liu.song.a23@gmail.com> wrote:
> >
> > On Mon, Aug 26, 2019 at 9:52 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
> > >
> > > Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
> > > to 600. To make this size flexible, a new map 'pcktsz' is added.
> > >
> > > By updating new packet size to this map from the userland,
> > > xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.
> > >
> > > If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
> > > will be 600 as a default.
> >
> > Please also cc bpf@vger.kernel.org for bpf patches.
> >
>
> I'll make sure to have it included next time.
>
> > >
> > > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> >
> > Acked-by: Song Liu <songliubraving@fb.com>
> >
> > With a nit below.
> >
> > [...]
> >
> > > diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
> > > index a3596b617c4c..29ade7caf841 100644
> > > --- a/samples/bpf/xdp_adjust_tail_user.c
> > > +++ b/samples/bpf/xdp_adjust_tail_user.c
> > > @@ -72,6 +72,7 @@ static void usage(const char *cmd)
> > >         printf("Usage: %s [...]\n", cmd);
> > >         printf("    -i <ifname|ifindex> Interface\n");
> > >         printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
> > > +       printf("    -P <MAX_PCKT_SIZE> Default: 600\n");
> >
> > nit: printf("    -P <MAX_PCKT_SIZE> Default: %u\n", MAX_PCKT_SIZE);
>
> With all due respect, I'm afraid that MAX_PCKT_SIZE constant is only
> defined at '_kern.c'.
> Are you saying that it should be defined at '_user.c' either?
>
> Thanks for the review!

Ping?

^ permalink raw reply

* [PATCH v3 0/5] Introduce variable length mdev alias
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190826204119.54386-1-parav@mellanox.com>

To have consistent naming for the netdevice of a mdev and to have
consistent naming of the devlink port [1] of a mdev, which is formed using
phys_port_name of the devlink port, current UUID is not usable because
UUID is too long.

UUID in string format is 36-characters long and in binary 128-bit.
Both formats are not able to fit within 15 characters limit of netdev
name.

It is desired to have mdev device naming consistent using UUID.
So that widely used user space framework such as ovs [2] can make use
of mdev representor in similar way as PCIe SR-IOV VF and PF representors.

Hence,
(a) mdev alias is created which is derived using sha1 from the mdev name.
(b) Vendor driver describes how long an alias should be for the child mdev
created for a given parent.
(c) Mdev aliases are unique at system level.
(d) alias is created optionally whenever parent requested.
This ensures that non networking mdev parents can function without alias
creation overhead.

This design is discussed at [3].

An example systemd/udev extension will have,

1. netdev name created using mdev alias available in sysfs.

mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
mdev 12 character alias=cd5b146a80a5

netdev name of this mdev = enmcd5b146a80a5
Here en = Ethernet link
m = mediated device

2. devlink port phys_port_name created using mdev alias.
devlink phys_port_name=pcd5b146a80a5

This patchset enables mdev core to maintain unique alias for a mdev.

Patch-1 Introduces mdev alias using sha1.
Patch-2 Ensures that mdev alias is unique in a system.
Patch-3 Exposes mdev alias in a sysfs hirerchy, update Documentation
Patch-4 Introduces mdev_alias() API.
Patch-5 Extends mtty driver to optionally provide alias generation.
This also enables to test UUID based sha1 collision and trigger
error handling for duplicate sha1 results.

[1] http://man7.org/linux/man-pages/man8/devlink-port.8.html
[2] https://docs.openstack.org/os-vif/latest/user/plugins/ovs.html
[3] https://patchwork.kernel.org/cover/11084231/

---
Changelog:
v2->v3:
 - Addressed comment from Yunsheng Lin
 - Changed strcmp() ==0 to !strcmp()
 - Addressed comment from Cornelia Hunk
 - Merged sysfs Documentation patch with syfs patch
 - Added more description for alias return value
v1->v2:
 - Corrected a typo from 'and' to 'an'
 - Addressed comments from Alex Williamson
 - Kept mdev_device naturally aligned
 - Added error checking for crypt_*() calls
 - Moved alias NULL check at beginning
 - Added mdev_alias() API
 - Updated mtty driver to show example mdev_alias() usage
 - Changed return type of generate_alias() from int to char*
v0->v1:
 - Addressed comments from Alex Williamson, Cornelia Hunk and Mark Bloch
 - Moved alias length check outside of the parent lock
 - Moved alias and digest allocation from kvzalloc to kzalloc
 - &alias[0] changed to alias
 - alias_length check is nested under get_alias_length callback check
 - Changed comments to start with an empty line
 - Added comment where alias memory ownership is handed over to mdev device
 - Fixed cleaunup of hash if mdev_bus_register() fails
 - Updated documentation for new sysfs alias file
 - Improved commit logs to make description more clear
 - Fixed inclusiong of alias for NULL check
 - Added ratelimited debug print for sha1 hash collision error

Parav Pandit (5):
  mdev: Introduce sha1 based mdev alias
  mdev: Make mdev alias unique among all mdevs
  mdev: Expose mdev alias in sysfs tree
  mdev: Introduce an API mdev_alias
  mtty: Optionally support mtty alias

 .../driver-api/vfio-mediated-device.rst       |   9 ++
 drivers/vfio/mdev/mdev_core.c                 | 142 +++++++++++++++++-
 drivers/vfio/mdev/mdev_private.h              |   5 +-
 drivers/vfio/mdev/mdev_sysfs.c                |  26 +++-
 include/linux/mdev.h                          |   5 +
 samples/vfio-mdev/mtty.c                      |  13 ++
 6 files changed, 190 insertions(+), 10 deletions(-)

-- 
2.19.2

^ permalink raw reply

* [PATCH v3 1/5] mdev: Introduce sha1 based mdev alias
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Some vendor drivers want an identifier for an mdev device that is
shorter than the UUID, due to length restrictions in the consumers of
that identifier.

Add a callback that allows a vendor driver to request an alias of a
specified length to be generated for an mdev device. If generated,
that alias is checked for collisions.

It is an optional attribute.
mdev alias is generated using sha1 from the mdev name.

Signed-off-by: Parav Pandit <parav@mellanox.com>

---
Changelog:
v1->v2:
 - Kept mdev_device naturally aligned
 - Added error checking for crypt_*() calls
 - Corrected a typo from 'and' to 'an'
 - Changed return type of generate_alias() from int to char*
v0->v1:
 - Moved alias length check outside of the parent lock
 - Moved alias and digest allocation from kvzalloc to kzalloc
 - &alias[0] changed to alias
 - alias_length check is nested under get_alias_length callback check
 - Changed comments to start with an empty line
 - Fixed cleaunup of hash if mdev_bus_register() fails
 - Added comment where alias memory ownership is handed over to mdev device
 - Updated commit log to indicate motivation for this feature
---
 drivers/vfio/mdev/mdev_core.c    | 123 ++++++++++++++++++++++++++++++-
 drivers/vfio/mdev/mdev_private.h |   5 +-
 drivers/vfio/mdev/mdev_sysfs.c   |  13 ++--
 include/linux/mdev.h             |   4 +
 4 files changed, 135 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index b558d4cfd082..3bdff0469607 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -10,9 +10,11 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 #include <linux/uuid.h>
 #include <linux/sysfs.h>
 #include <linux/mdev.h>
+#include <crypto/hash.h>
 
 #include "mdev_private.h"
 
@@ -27,6 +29,8 @@ static struct class_compat *mdev_bus_compat_class;
 static LIST_HEAD(mdev_list);
 static DEFINE_MUTEX(mdev_list_lock);
 
+static struct crypto_shash *alias_hash;
+
 struct device *mdev_parent_dev(struct mdev_device *mdev)
 {
 	return mdev->parent->dev;
@@ -150,6 +154,16 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
 		return -EINVAL;
 
+	if (ops->get_alias_length) {
+		unsigned int digest_size;
+		unsigned int aligned_len;
+
+		aligned_len = roundup(ops->get_alias_length(), 2);
+		digest_size = crypto_shash_digestsize(alias_hash);
+		if (aligned_len / 2 > digest_size)
+			return -EINVAL;
+	}
+
 	dev = get_device(dev);
 	if (!dev)
 		return -EINVAL;
@@ -259,6 +273,7 @@ static void mdev_device_free(struct mdev_device *mdev)
 	mutex_unlock(&mdev_list_lock);
 
 	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev->alias);
 	kfree(mdev);
 }
 
@@ -269,18 +284,101 @@ static void mdev_device_release(struct device *dev)
 	mdev_device_free(mdev);
 }
 
-int mdev_device_create(struct kobject *kobj,
-		       struct device *dev, const guid_t *uuid)
+static const char *
+generate_alias(const char *uuid, unsigned int max_alias_len)
+{
+	struct shash_desc *hash_desc;
+	unsigned int digest_size;
+	unsigned char *digest;
+	unsigned int alias_len;
+	char *alias;
+	int ret;
+
+	/*
+	 * Align to multiple of 2 as bin2hex will generate
+	 * even number of bytes.
+	 */
+	alias_len = roundup(max_alias_len, 2);
+	alias = kzalloc(alias_len + 1, GFP_KERNEL);
+	if (!alias)
+		return ERR_PTR(-ENOMEM);
+
+	/* Allocate and init descriptor */
+	hash_desc = kvzalloc(sizeof(*hash_desc) +
+			     crypto_shash_descsize(alias_hash),
+			     GFP_KERNEL);
+	if (!hash_desc) {
+		ret = -ENOMEM;
+		goto desc_err;
+	}
+
+	hash_desc->tfm = alias_hash;
+
+	digest_size = crypto_shash_digestsize(alias_hash);
+
+	digest = kzalloc(digest_size, GFP_KERNEL);
+	if (!digest) {
+		ret = -ENOMEM;
+		goto digest_err;
+	}
+	ret = crypto_shash_init(hash_desc);
+	if (ret)
+		goto hash_err;
+
+	ret = crypto_shash_update(hash_desc, uuid, UUID_STRING_LEN);
+	if (ret)
+		goto hash_err;
+
+	ret = crypto_shash_final(hash_desc, digest);
+	if (ret)
+		goto hash_err;
+
+	bin2hex(alias, digest, min_t(unsigned int, digest_size, alias_len / 2));
+	/*
+	 * When alias length is odd, zero out an additional last byte
+	 * that bin2hex has copied.
+	 */
+	if (max_alias_len % 2)
+		alias[max_alias_len] = 0;
+
+	kfree(digest);
+	kvfree(hash_desc);
+	return alias;
+
+hash_err:
+	kfree(digest);
+digest_err:
+	kvfree(hash_desc);
+desc_err:
+	kfree(alias);
+	return ERR_PTR(ret);
+}
+
+int mdev_device_create(struct kobject *kobj, struct device *dev,
+		       const char *uuid_str, const guid_t *uuid)
 {
 	int ret;
 	struct mdev_device *mdev, *tmp;
 	struct mdev_parent *parent;
 	struct mdev_type *type = to_mdev_type(kobj);
+	const char *alias = NULL;
 
 	parent = mdev_get_parent(type->parent);
 	if (!parent)
 		return -EINVAL;
 
+	if (parent->ops->get_alias_length) {
+		unsigned int alias_len;
+
+		alias_len = parent->ops->get_alias_length();
+		if (alias_len) {
+			alias = generate_alias(uuid_str, alias_len);
+			if (IS_ERR(alias)) {
+				ret = PTR_ERR(alias);
+				goto alias_fail;
+			}
+		}
+	}
 	mutex_lock(&mdev_list_lock);
 
 	/* Check for duplicate */
@@ -300,6 +398,12 @@ int mdev_device_create(struct kobject *kobj,
 	}
 
 	guid_copy(&mdev->uuid, uuid);
+	mdev->alias = alias;
+	/*
+	 * At this point alias memory is owned by the mdev.
+	 * Mark it NULL, so that only mdev can free it.
+	 */
+	alias = NULL;
 	list_add(&mdev->next, &mdev_list);
 	mutex_unlock(&mdev_list_lock);
 
@@ -346,6 +450,8 @@ int mdev_device_create(struct kobject *kobj,
 	up_read(&parent->unreg_sem);
 	put_device(&mdev->dev);
 mdev_fail:
+	kfree(alias);
+alias_fail:
 	mdev_put_parent(parent);
 	return ret;
 }
@@ -406,7 +512,17 @@ EXPORT_SYMBOL(mdev_get_iommu_device);
 
 static int __init mdev_init(void)
 {
-	return mdev_bus_register();
+	int ret;
+
+	alias_hash = crypto_alloc_shash("sha1", 0, 0);
+	if (!alias_hash)
+		return -ENOMEM;
+
+	ret = mdev_bus_register();
+	if (ret)
+		crypto_free_shash(alias_hash);
+
+	return ret;
 }
 
 static void __exit mdev_exit(void)
@@ -415,6 +531,7 @@ static void __exit mdev_exit(void)
 		class_compat_unregister(mdev_bus_compat_class);
 
 	mdev_bus_unregister();
+	crypto_free_shash(alias_hash);
 }
 
 module_init(mdev_init)
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 7d922950caaf..078fdaf7836e 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -32,6 +32,7 @@ struct mdev_device {
 	struct list_head next;
 	struct kobject *type_kobj;
 	struct device *iommu_device;
+	const char *alias;
 	bool active;
 };
 
@@ -57,8 +58,8 @@ void parent_remove_sysfs_files(struct mdev_parent *parent);
 int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
 void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
 
-int  mdev_device_create(struct kobject *kobj,
-			struct device *dev, const guid_t *uuid);
+int mdev_device_create(struct kobject *kobj, struct device *dev,
+		       const char *uuid_str, const guid_t *uuid);
 int  mdev_device_remove(struct device *dev);
 
 #endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 7570c7602ab4..43afe0e80b76 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -63,15 +63,18 @@ static ssize_t create_store(struct kobject *kobj, struct device *dev,
 		return -ENOMEM;
 
 	ret = guid_parse(str, &uuid);
-	kfree(str);
 	if (ret)
-		return ret;
+		goto err;
 
-	ret = mdev_device_create(kobj, dev, &uuid);
+	ret = mdev_device_create(kobj, dev, str, &uuid);
 	if (ret)
-		return ret;
+		goto err;
 
-	return count;
+	ret = count;
+
+err:
+	kfree(str);
+	return ret;
 }
 
 MDEV_TYPE_ATTR_WO(create);
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index 0ce30ca78db0..f036fe9854ee 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -72,6 +72,9 @@ struct device *mdev_get_iommu_device(struct device *dev);
  * @mmap:		mmap callback
  *			@mdev: mediated device structure
  *			@vma: vma structure
+ * @get_alias_length:	Generate alias for the mdevs of this parent based on the
+ *			mdev device name when it returns non zero alias length.
+ *			It is optional.
  * Parent device that support mediated device should be registered with mdev
  * module with mdev_parent_ops structure.
  **/
@@ -92,6 +95,7 @@ struct mdev_parent_ops {
 	long	(*ioctl)(struct mdev_device *mdev, unsigned int cmd,
 			 unsigned long arg);
 	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+	unsigned int (*get_alias_length)(void);
 };
 
 /* interface for exporting mdev supported type attributes */
-- 
2.19.2


^ permalink raw reply related

* [PATCH v3 5/5] mtty: Optionally support mtty alias
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Provide a module parameter to set alias length to optionally generate
mdev alias.

Example to request mdev alias.
$ modprobe mtty alias_length=12

Make use of mtty_alias() API when alias_length module parameter is set.

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
Changelog:
v1->v2:
 - Added mdev_alias() usage sample
---
 samples/vfio-mdev/mtty.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/samples/vfio-mdev/mtty.c b/samples/vfio-mdev/mtty.c
index 92e770a06ea2..075d65440bc0 100644
--- a/samples/vfio-mdev/mtty.c
+++ b/samples/vfio-mdev/mtty.c
@@ -150,6 +150,10 @@ static const struct file_operations vd_fops = {
 	.owner          = THIS_MODULE,
 };
 
+static unsigned int mtty_alias_length;
+module_param_named(alias_length, mtty_alias_length, uint, 0444);
+MODULE_PARM_DESC(alias_length, "mdev alias length; default=0");
+
 /* function prototypes */
 
 static int mtty_trigger_interrupt(const guid_t *uuid);
@@ -770,6 +774,9 @@ static int mtty_create(struct kobject *kobj, struct mdev_device *mdev)
 	list_add(&mdev_state->next, &mdev_devices_list);
 	mutex_unlock(&mdev_list_lock);
 
+	if (mtty_alias_length)
+		dev_dbg(mdev_dev(mdev), "alias is %s\n", mdev_alias(mdev));
+
 	return 0;
 }
 
@@ -1410,6 +1417,11 @@ static struct attribute_group *mdev_type_groups[] = {
 	NULL,
 };
 
+static unsigned int mtty_get_alias_length(void)
+{
+	return mtty_alias_length;
+}
+
 static const struct mdev_parent_ops mdev_fops = {
 	.owner                  = THIS_MODULE,
 	.dev_attr_groups        = mtty_dev_groups,
@@ -1422,6 +1434,7 @@ static const struct mdev_parent_ops mdev_fops = {
 	.read                   = mtty_read,
 	.write                  = mtty_write,
 	.ioctl		        = mtty_ioctl,
+	.get_alias_length	= mtty_get_alias_length
 };
 
 static void mtty_device_release(struct device *dev)
-- 
2.19.2


^ permalink raw reply related

* [PATCH v3 4/5] mdev: Introduce an API mdev_alias
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Introduce an API mdev_alias() to provide access to optionally generated
alias.

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c | 12 ++++++++++++
 include/linux/mdev.h          |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index c8cd40366783..9eec556fbdd4 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -517,6 +517,18 @@ struct device *mdev_get_iommu_device(struct device *dev)
 }
 EXPORT_SYMBOL(mdev_get_iommu_device);
 
+/**
+ * mdev_alias: Return alias string of a mdev device
+ * @mdev:	Pointer to the mdev device
+ * mdev_alias() returns alias string of a mdev device if alias is present,
+ * returns NULL otherwise.
+ */
+const char *mdev_alias(struct mdev_device *mdev)
+{
+	return mdev->alias;
+}
+EXPORT_SYMBOL(mdev_alias);
+
 static int __init mdev_init(void)
 {
 	int ret;
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index f036fe9854ee..6da82213bc4e 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -148,5 +148,6 @@ void mdev_unregister_driver(struct mdev_driver *drv);
 struct device *mdev_parent_dev(struct mdev_device *mdev);
 struct device *mdev_dev(struct mdev_device *mdev);
 struct mdev_device *mdev_from_dev(struct device *dev);
+const char *mdev_alias(struct mdev_device *mdev);
 
 #endif /* MDEV_H */
-- 
2.19.2


^ permalink raw reply related

* [PATCH v3 2/5] mdev: Make mdev alias unique among all mdevs
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Mdev alias should be unique among all the mdevs, so that when such alias
is used by the mdev users to derive other objects, there is no
collision in a given system.

Signed-off-by: Parav Pandit <parav@mellanox.com>

---
Changelog:
v2->v3:
 - Changed strcmp() ==0 to !strcmp()
v1->v2:
 - Moved alias NULL check at beginning
v0->v1:
 - Fixed inclusiong of alias for NULL check
 - Added ratelimited debug print for sha1 hash collision error
---
 drivers/vfio/mdev/mdev_core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 3bdff0469607..c8cd40366783 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -388,6 +388,13 @@ int mdev_device_create(struct kobject *kobj, struct device *dev,
 			ret = -EEXIST;
 			goto mdev_fail;
 		}
+		if (alias && tmp->alias && !strcmp(alias, tmp->alias)) {
+			mutex_unlock(&mdev_list_lock);
+			ret = -EEXIST;
+			dev_dbg_ratelimited(dev, "Hash collision in alias creation for UUID %pUl\n",
+					    uuid);
+			goto mdev_fail;
+		}
 	}
 
 	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
-- 
2.19.2


^ permalink raw reply related

* [PATCH v3 3/5] mdev: Expose mdev alias in sysfs tree
From: Parav Pandit @ 2019-09-02  4:24 UTC (permalink / raw)
  To: alex.williamson, jiri, kwankhede, cohuck, davem
  Cc: kvm, linux-kernel, netdev, Parav Pandit
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Expose the optional alias for an mdev device as a sysfs attribute.
This way, userspace tools such as udev may make use of the alias, for
example to create a netdevice name for the mdev.

Updated documentation for optional read only sysfs attribute.

Signed-off-by: Parav Pandit <parav@mellanox.com>

---
Changelog:
v2->v3:
 - Merged sysfs documentation patch with sysfs addition
 - Added more description for alias return value
v0->v1:
 - Addressed comments from Cornelia Huck
 - Updated commit description
---
 Documentation/driver-api/vfio-mediated-device.rst |  9 +++++++++
 drivers/vfio/mdev/mdev_sysfs.c                    | 13 +++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/Documentation/driver-api/vfio-mediated-device.rst b/Documentation/driver-api/vfio-mediated-device.rst
index 25eb7d5b834b..0b7d2bf843b6 100644
--- a/Documentation/driver-api/vfio-mediated-device.rst
+++ b/Documentation/driver-api/vfio-mediated-device.rst
@@ -270,6 +270,7 @@ Directories and Files Under the sysfs for Each mdev Device
          |--- remove
          |--- mdev_type {link to its type}
          |--- vendor-specific-attributes [optional]
+         |--- alias
 
 * remove (write only)
 
@@ -281,6 +282,14 @@ Example::
 
 	# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
 
+* alias (read only, optional)
+Whenever a parent requested to generate an alias, each mdev device of such
+parent is assigned unique alias by the mdev core.
+This file shows the alias of the mdev device.
+
+Reading file either returns valid alias when assigned or returns error code
+-EOPNOTSUPP when unsupported.
+
 Mediated device Hot plug
 ------------------------
 
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 43afe0e80b76..59f4e3cc5233 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -246,7 +246,20 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
 
 static DEVICE_ATTR_WO(remove);
 
+static ssize_t alias_show(struct device *device,
+			  struct device_attribute *attr, char *buf)
+{
+	struct mdev_device *dev = mdev_from_dev(device);
+
+	if (!dev->alias)
+		return -EOPNOTSUPP;
+
+	return sprintf(buf, "%s\n", dev->alias);
+}
+static DEVICE_ATTR_RO(alias);
+
 static const struct attribute *mdev_device_attrs[] = {
+	&dev_attr_alias.attr,
 	&dev_attr_remove.attr,
 	NULL,
 };
-- 
2.19.2


^ permalink raw reply related

* Re: [RFC v3] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2019-09-02  4:15 UTC (permalink / raw)
  To: Tiwei Bie, mst, alex.williamson, maxime.coquelin
  Cc: linux-kernel, kvm, virtualization, netdev, dan.daly,
	cunming.liang, zhihong.wang, lingshan.zhu
In-Reply-To: <20190828053712.26106-1-tiwei.bie@intel.com>


On 2019/8/28 下午1:37, Tiwei Bie wrote:
> Details about this can be found here:
>
> https://lwn.net/Articles/750770/
>
> What's new in this version
> ==========================
>
> There are three choices based on the discussion [1] in RFC v2:
>
>> #1. We expose a VFIO device, so we can reuse the VFIO container/group
>>      based DMA API and potentially reuse a lot of VFIO code in QEMU.
>>
>>      But in this case, we have two choices for the VFIO device interface
>>      (i.e. the interface on top of VFIO device fd):
>>
>>      A) we may invent a new vhost protocol (as demonstrated by the code
>>         in this RFC) on VFIO device fd to make it work in VFIO's way,
>>         i.e. regions and irqs.
>>
>>      B) Or as you proposed, instead of inventing a new vhost protocol,
>>         we can reuse most existing vhost ioctls on the VFIO device fd
>>         directly. There should be no conflicts between the VFIO ioctls
>>         (type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
>>
>> #2. Instead of exposing a VFIO device, we may expose a VHOST device.
>>      And we will introduce a new mdev driver vhost-mdev to do this.
>>      It would be natural to reuse the existing kernel vhost interface
>>      (ioctls) on it as much as possible. But we will need to invent
>>      some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
>>      choice, but it's too heavy and doesn't support vIOMMU by itself).
> This version is more like a quick PoC to try Jason's proposal on
> reusing vhost ioctls. And the second way (#1/B) in above three
> choices was chosen in this version to demonstrate the idea quickly.
>
> Now the userspace API looks like this:
>
> - VFIO's container/group based IOMMU API is used to do the
>    DMA programming.
>
> - Vhost's existing ioctls are used to setup the device.
>
> And the device will report device_api as "vfio-vhost".
>
> Note that, there are dirty hacks in this version. If we decide to
> go this way, some refactoring in vhost.c/vhost.h may be needed.
>
> PS. The direct mapping of the notify registers isn't implemented
>      in this version.
>
> [1] https://lkml.org/lkml/2019/7/9/101


Thanks for the patch, see comments inline.


>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
>   drivers/vhost/Kconfig      |   9 +
>   drivers/vhost/Makefile     |   3 +
>   drivers/vhost/mdev.c       | 382 +++++++++++++++++++++++++++++++++++++
>   include/linux/vhost_mdev.h |  58 ++++++
>   include/uapi/linux/vfio.h  |   2 +
>   include/uapi/linux/vhost.h |   8 +
>   6 files changed, 462 insertions(+)
>   create mode 100644 drivers/vhost/mdev.c
>   create mode 100644 include/linux/vhost_mdev.h
>
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> index 3d03ccbd1adc..2ba54fcf43b7 100644
> --- a/drivers/vhost/Kconfig
> +++ b/drivers/vhost/Kconfig
> @@ -34,6 +34,15 @@ config VHOST_VSOCK
>   	To compile this driver as a module, choose M here: the module will be called
>   	vhost_vsock.
>   
> +config VHOST_MDEV
> +	tristate "Hardware vhost accelerator abstraction"
> +	depends on EVENTFD && VFIO && VFIO_MDEV
> +	select VHOST
> +	default n
> +	---help---
> +	Say Y here to enable the vhost_mdev module
> +	for use with hardware vhost accelerators
> +
>   config VHOST
>   	tristate
>   	---help---
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> index 6c6df24f770c..ad9c0f8c6d8c 100644
> --- a/drivers/vhost/Makefile
> +++ b/drivers/vhost/Makefile
> @@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
>   
>   obj-$(CONFIG_VHOST_RING) += vringh.o
>   
> +obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
> +vhost_mdev-y := mdev.o
> +
>   obj-$(CONFIG_VHOST)	+= vhost.o
> diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
> new file mode 100644
> index 000000000000..6bef1d9ae2e6
> --- /dev/null
> +++ b/drivers/vhost/mdev.c
> @@ -0,0 +1,382 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2018-2019 Intel Corporation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/vfio.h>
> +#include <linux/vhost.h>
> +#include <linux/mdev.h>
> +#include <linux/vhost_mdev.h>
> +
> +#include "vhost.h"
> +
> +struct vhost_mdev {
> +	struct vhost_dev dev;
> +	bool opened;
> +	int nvqs;
> +	u64 state;
> +	u64 acked_features;
> +	u64 features;
> +	const struct vhost_mdev_device_ops *ops;
> +	struct mdev_device *mdev;
> +	void *private;
> +	struct vhost_virtqueue vqs[];
> +};
> +
> +static void handle_vq_kick(struct vhost_work *work)
> +{
> +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> +						  poll.work);
> +	struct vhost_mdev *vdpa = container_of(vq->dev, struct vhost_mdev, dev);
> +
> +	vdpa->ops->notify(vdpa, vq - vdpa->vqs);
> +}
> +
> +static int vhost_set_state(struct vhost_mdev *vdpa, u64 __user *statep)
> +{
> +	u64 state;
> +
> +	if (copy_from_user(&state, statep, sizeof(state)))
> +		return -EFAULT;
> +
> +	if (state >= VHOST_MDEV_S_MAX)
> +		return -EINVAL;
> +
> +	if (vdpa->state == state)
> +		return 0;
> +
> +	mutex_lock(&vdpa->dev.mutex);
> +
> +	vdpa->state = state;
> +
> +	switch (vdpa->state) {
> +	case VHOST_MDEV_S_RUNNING:
> +		vdpa->ops->start(vdpa);
> +		break;
> +	case VHOST_MDEV_S_STOPPED:
> +		vdpa->ops->stop(vdpa);
> +		break;
> +	}
> +
> +	mutex_unlock(&vdpa->dev.mutex);
> +
> +	return 0;
> +}
> +
> +static int vhost_set_features(struct vhost_mdev *vdpa, u64 __user *featurep)
> +{
> +	u64 features;
> +
> +	if (copy_from_user(&features, featurep, sizeof(features)))
> +		return -EFAULT;
> +
> +	if (features & ~vdpa->features)
> +		return -EINVAL;
> +
> +	vdpa->acked_features = features;
> +	vdpa->ops->features_changed(vdpa);
> +	return 0;
> +}
> +
> +static int vhost_get_features(struct vhost_mdev *vdpa, u64 __user *featurep)
> +{
> +	if (copy_to_user(featurep, &vdpa->features, sizeof(vdpa->features)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +static int vhost_get_vring_base(struct vhost_mdev *vdpa, void __user *argp)
> +{
> +	struct vhost_virtqueue *vq;
> +	u32 idx;
> +	int r;
> +
> +	r = get_user(idx, (u32 __user *)argp);
> +	if (r < 0)
> +		return r;
> +
> +	vq = &vdpa->vqs[idx];
> +	vq->last_avail_idx = vdpa->ops->get_vring_base(vdpa, idx);
> +
> +	return vhost_vring_ioctl(&vdpa->dev, VHOST_GET_VRING_BASE, argp);
> +}
> +
> +/*
> + * Helpers for backend to register mdev.
> + */
> +
> +struct vhost_mdev *vhost_mdev_alloc(struct mdev_device *mdev, void *private,
> +				    int nvqs)
> +{
> +	struct vhost_mdev *vdpa;
> +	struct vhost_dev *dev;
> +	struct vhost_virtqueue **vqs;
> +	size_t size;
> +	int i;
> +
> +	size = sizeof(struct vhost_mdev) + nvqs * sizeof(struct vhost_virtqueue);
> +
> +	vdpa = kzalloc(size, GFP_KERNEL);
> +	if (!vdpa)
> +		return NULL;
> +
> +	vdpa->nvqs = nvqs;
> +
> +	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs) {
> +		kfree(vdpa);
> +		return NULL;
> +	}
> +
> +	dev = &vdpa->dev;
> +	for (i = 0; i < nvqs; i++) {
> +		vqs[i] = &vdpa->vqs[i];
> +		vqs[i]->handle_kick = handle_vq_kick;
> +	}
> +	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0);
> +
> +	vdpa->private = private;
> +	vdpa->mdev = mdev;
> +
> +	mdev_set_drvdata(mdev, vdpa);
> +
> +	return vdpa;
> +}
> +EXPORT_SYMBOL(vhost_mdev_alloc);
> +
> +void vhost_mdev_free(struct vhost_mdev *vdpa)
> +{
> +	struct mdev_device *mdev;
> +
> +	mdev = vdpa->mdev;
> +	mdev_set_drvdata(mdev, NULL);
> +
> +	vhost_dev_stop(&vdpa->dev);
> +	vhost_dev_cleanup(&vdpa->dev);
> +	kfree(vdpa->dev.vqs);
> +	kfree(vdpa);
> +}
> +EXPORT_SYMBOL(vhost_mdev_free);
> +
> +ssize_t vhost_mdev_read(struct mdev_device *mdev, char __user *buf,
> +		  size_t count, loff_t *ppos)
> +{
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL(vhost_mdev_read);
> +
> +
> +ssize_t vhost_mdev_write(struct mdev_device *mdev, const char __user *buf,
> +		   size_t count, loff_t *ppos)
> +{
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL(vhost_mdev_write);
> +
> +int vhost_mdev_mmap(struct mdev_device *mdev, struct vm_area_struct *vma)
> +{
> +	// TODO
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL(vhost_mdev_mmap);
> +
> +long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
> +		      unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	struct vhost_mdev *vdpa;
> +	unsigned long minsz;
> +	int ret = 0;
> +
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	vdpa = mdev_get_drvdata(mdev);
> +	if (!vdpa)
> +		return -ENODEV;
> +
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		if (info.argsz < minsz) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +
> +		info.flags = VFIO_DEVICE_FLAGS_VHOST;
> +		info.num_regions = 0;
> +		info.num_irqs = 0;
> +
> +		if (copy_to_user((void __user *)arg, &info, minsz)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		break;
> +	}
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +	case VFIO_DEVICE_GET_IRQ_INFO:
> +	case VFIO_DEVICE_SET_IRQS:
> +	case VFIO_DEVICE_RESET:
> +		ret = -EINVAL;
> +		break;
> +
> +	case VHOST_MDEV_SET_STATE:
> +		ret = vhost_set_state(vdpa, argp);
> +		break;


So this is used to start or stop the device. This means if userspace 
want to drive a network device, the API is not 100% compatible. Any 
blocker for this? E.g for SET_BACKEND, we can pass a fd and then 
identify the type of backend.

Another question is, how can user know the type of a device?


> +	case VHOST_GET_FEATURES:
> +		ret = vhost_get_features(vdpa, argp);
> +		break;
> +	case VHOST_SET_FEATURES:
> +		ret = vhost_set_features(vdpa, argp);
> +		break;
> +	case VHOST_GET_VRING_BASE:
> +		ret = vhost_get_vring_base(vdpa, argp);
> +		break;
> +	default:
> +		ret = vhost_dev_ioctl(&vdpa->dev, cmd, argp);
> +		if (ret == -ENOIOCTLCMD)
> +			ret = vhost_vring_ioctl(&vdpa->dev, cmd, argp);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(vhost_mdev_ioctl);
> +
> +int vhost_mdev_open(struct mdev_device *mdev)
> +{
> +	struct vhost_mdev *vdpa;
> +	int ret = 0;
> +
> +	vdpa = mdev_get_drvdata(mdev);
> +	if (!vdpa)
> +		return -ENODEV;
> +
> +	mutex_lock(&vdpa->dev.mutex);
> +
> +	if (vdpa->opened)
> +		ret = -EBUSY;
> +	else
> +		vdpa->opened = true;
> +
> +	mutex_unlock(&vdpa->dev.mutex);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(vhost_mdev_open);
> +
> +void vhost_mdev_close(struct mdev_device *mdev)
> +{
> +	struct vhost_mdev *vdpa;
> +
> +	vdpa = mdev_get_drvdata(mdev);
> +
> +	mutex_lock(&vdpa->dev.mutex);
> +
> +	vhost_dev_stop(&vdpa->dev);
> +	vhost_dev_cleanup(&vdpa->dev);
> +
> +	vdpa->opened = false;
> +	mutex_unlock(&vdpa->dev.mutex);
> +}
> +EXPORT_SYMBOL(vhost_mdev_close);
> +
> +/*
> + * Helpers for backend to set/get information.
> + */
> +
> +int vhost_mdev_set_device_ops(struct vhost_mdev *vdpa,
> +			      const struct vhost_mdev_device_ops *ops)
> +{
> +	vdpa->ops = ops;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_set_device_ops);
> +
> +int vhost_mdev_set_features(struct vhost_mdev *vdpa, u64 features)
> +{
> +	vdpa->features = features;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_set_features);
> +
> +struct eventfd_ctx *
> +vhost_mdev_get_call_ctx(struct vhost_mdev *vdpa, int queue_id)
> +{
> +	return vdpa->vqs[queue_id].call_ctx;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_call_ctx);
> +
> +int vhost_mdev_get_acked_features(struct vhost_mdev *vdpa, u64 *features)
> +{
> +	*features = vdpa->acked_features;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_acked_features);
> +
> +int vhost_mdev_get_vring_num(struct vhost_mdev *vdpa, int queue_id, u16 *num)
> +{
> +	*num = vdpa->vqs[queue_id].num;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_vring_num);
> +
> +int vhost_mdev_get_vring_base(struct vhost_mdev *vdpa, int queue_id, u16 *base)
> +{
> +	*base = vdpa->vqs[queue_id].last_avail_idx;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_vring_base);
> +
> +int vhost_mdev_get_vring_addr(struct vhost_mdev *vdpa, int queue_id,
> +			      struct vhost_vring_addr *addr)
> +{
> +	struct vhost_virtqueue *vq = &vdpa->vqs[queue_id];
> +
> +	/*
> +	 * XXX: we need userspace to pass guest physical address or
> +	 *      IOVA directly.
> +	 */
> +	addr->flags = vq->log_used ? (0x1 << VHOST_VRING_F_LOG) : 0;
> +	addr->desc_user_addr = (__u64)vq->desc;
> +	addr->avail_user_addr = (__u64)vq->avail;
> +	addr->used_user_addr = (__u64)vq->used;
> +	addr->log_guest_addr = (__u64)vq->log_addr;
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_vring_addr);
> +
> +int vhost_mdev_get_log_base(struct vhost_mdev *vdpa, int queue_id,
> +			    void **log_base, u64 *log_size)
> +{
> +	// TODO
> +	return 0;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_log_base);
> +
> +struct mdev_device *vhost_mdev_get_mdev(struct vhost_mdev *vdpa)
> +{
> +	return vdpa->mdev;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_mdev);
> +
> +void *vhost_mdev_get_private(struct vhost_mdev *vdpa)
> +{
> +	return vdpa->private;
> +}
> +EXPORT_SYMBOL(vhost_mdev_get_private);
> +
> +MODULE_VERSION("0.0.0");
> +MODULE_LICENSE("GPL v2");
> +MODULE_DESCRIPTION("Hardware vhost accelerator abstraction");
> diff --git a/include/linux/vhost_mdev.h b/include/linux/vhost_mdev.h
> new file mode 100644
> index 000000000000..070787ce6b36
> --- /dev/null
> +++ b/include/linux/vhost_mdev.h
> @@ -0,0 +1,58 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2018-2019 Intel Corporation.
> + */
> +
> +#ifndef _VHOST_MDEV_H
> +#define _VHOST_MDEV_H
> +
> +struct mdev_device;
> +struct vhost_mdev;
> +
> +typedef int (*vhost_mdev_start_device_t)(struct vhost_mdev *vdpa);
> +typedef int (*vhost_mdev_stop_device_t)(struct vhost_mdev *vdpa);
> +typedef int (*vhost_mdev_set_features_t)(struct vhost_mdev *vdpa);
> +typedef void (*vhost_mdev_notify_device_t)(struct vhost_mdev *vdpa, int queue_id);
> +typedef u64 (*vhost_mdev_get_notify_addr_t)(struct vhost_mdev *vdpa, int queue_id);
> +typedef u16 (*vhost_mdev_get_vring_base_t)(struct vhost_mdev *vdpa, int queue_id);
> +typedef void (*vhost_mdev_features_changed_t)(struct vhost_mdev *vdpa);
> +
> +struct vhost_mdev_device_ops {
> +	vhost_mdev_start_device_t	start;
> +	vhost_mdev_stop_device_t	stop;
> +	vhost_mdev_notify_device_t	notify;
> +	vhost_mdev_get_notify_addr_t	get_notify_addr;
> +	vhost_mdev_get_vring_base_t	get_vring_base;
> +	vhost_mdev_features_changed_t	features_changed;
> +};


Consider we want to implement a network device, who is going to 
implement the device configuration space? I believe it's not good to 
invent another set of API for doing this. So I believe we want something 
like read_config/write_config here.

Then I came up an idea:

1) introduce a new mdev bus transport, and a new mdev driver virtio_mdev
2) vDPA (either software or hardware) can register as a device of virtio 
mdev device
3) then we can use kernel virtio driver to drive vDPA device and utilize 
kernel networking/storage stack
4) for userspace driver like vhost-mdev, it could be built of top of 
mdev transport

Having a full new transport for virtio, the advantages are obvious:

1) A generic solution for both kernel and userspace driver and support 
configuration space access
2) For kernel driver, exist kernel networking/storage stack could be 
reused, and so did fast path implementation (e.g XDP, io_uring etc).
2) For userspace driver, the function of virtio transport is a superset 
of vhost, any API could be built on top easily (e.g vhost ioctl).

What's your thought?

Thanks


> +
> +struct vhost_mdev *vhost_mdev_alloc(struct mdev_device *mdev,
> +		void *private, int nvqs);
> +void vhost_mdev_free(struct vhost_mdev *vdpa);
> +
> +ssize_t vhost_mdev_read(struct mdev_device *mdev, char __user *buf,
> +		size_t count, loff_t *ppos);
> +ssize_t vhost_mdev_write(struct mdev_device *mdev, const char __user *buf,
> +		size_t count, loff_t *ppos);
> +long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
> +		unsigned long arg);
> +int vhost_mdev_mmap(struct mdev_device *mdev, struct vm_area_struct *vma);
> +int vhost_mdev_open(struct mdev_device *mdev);
> +void vhost_mdev_close(struct mdev_device *mdev);
> +
> +int vhost_mdev_set_device_ops(struct vhost_mdev *vdpa,
> +		const struct vhost_mdev_device_ops *ops);
> +int vhost_mdev_set_features(struct vhost_mdev *vdpa, u64 features);
> +struct eventfd_ctx *vhost_mdev_get_call_ctx(struct vhost_mdev *vdpa,
> +		int queue_id);
> +int vhost_mdev_get_acked_features(struct vhost_mdev *vdpa, u64 *features);
> +int vhost_mdev_get_vring_num(struct vhost_mdev *vdpa, int queue_id, u16 *num);
> +int vhost_mdev_get_vring_base(struct vhost_mdev *vdpa, int queue_id, u16 *base);
> +int vhost_mdev_get_vring_addr(struct vhost_mdev *vdpa, int queue_id,
> +		struct vhost_vring_addr *addr);
> +int vhost_mdev_get_log_base(struct vhost_mdev *vdpa, int queue_id,
> +		void **log_base, u64 *log_size);
> +struct mdev_device *vhost_mdev_get_mdev(struct vhost_mdev *vdpa);
> +void *vhost_mdev_get_private(struct vhost_mdev *vdpa);
> +
> +#endif /* _VHOST_MDEV_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 8f10748dac79..0300d6831cc5 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -201,6 +201,7 @@ struct vfio_device_info {
>   #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
>   #define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
>   #define VFIO_DEVICE_FLAGS_AP	(1 << 5)	/* vfio-ap device */
> +#define VFIO_DEVICE_FLAGS_VHOST	(1 << 6)	/* vfio-vhost device */
>   	__u32	num_regions;	/* Max region index + 1 */
>   	__u32	num_irqs;	/* Max IRQ index + 1 */
>   };
> @@ -217,6 +218,7 @@ struct vfio_device_info {
>   #define VFIO_DEVICE_API_AMBA_STRING		"vfio-amba"
>   #define VFIO_DEVICE_API_CCW_STRING		"vfio-ccw"
>   #define VFIO_DEVICE_API_AP_STRING		"vfio-ap"
> +#define VFIO_DEVICE_API_VHOST_STRING		"vfio-vhost"
>   
>   /**
>    * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index 40d028eed645..5afbc2f08fa3 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -116,4 +116,12 @@
>   #define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u64)
>   #define VHOST_VSOCK_SET_RUNNING		_IOW(VHOST_VIRTIO, 0x61, int)
>   
> +/* VHOST_MDEV specific defines */
> +
> +#define VHOST_MDEV_SET_STATE	_IOW(VHOST_VIRTIO, 0x70, __u64)
> +
> +#define VHOST_MDEV_S_STOPPED	0
> +#define VHOST_MDEV_S_RUNNING	1
> +#define VHOST_MDEV_S_MAX	2
> +
>   #endif

^ permalink raw reply

* Re: KASAN: use-after-free Write in __xfrm_policy_unlink (2)
From: Dmitry Vyukov @ 2019-09-02  3:31 UTC (permalink / raw)
  To: syzbot
  Cc: David Miller, Herbert Xu, LKML, netdev, Steffen Klassert,
	syzkaller-bugs
In-Reply-To: <000000000000cd5fdf0588fed11c@google.com>

On Thu, May 16, 2019 at 3:35 AM syzbot
<syzbot+0025447b4cb6f208558f@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    3b0f31f2 genetlink: make policy common to family
> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=12a319df200000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=f05902bca21d8935
> dashboard link: https://syzkaller.appspot.com/bug?extid=0025447b4cb6f208558f
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+0025447b4cb6f208558f@syzkaller.appspotmail.com

This looks like what has been fixed by:

#syz fix:
xfrm: policy: Fix out-of-bound array accesses in __xfrm_policy_unlink


> ==================================================================
> BUG: KASAN: use-after-free in __write_once_size
> include/linux/compiler.h:220 [inline]
> BUG: KASAN: use-after-free in __hlist_del include/linux/list.h:713 [inline]
> BUG: KASAN: use-after-free in hlist_del_rcu include/linux/rculist.h:455
> [inline]
> BUG: KASAN: use-after-free in __xfrm_policy_unlink+0x4b1/0x5c0
> net/xfrm/xfrm_policy.c:2212
> Write of size 8 at addr ffff8880a55a9e80 by task kworker/u4:6/7431
>
> CPU: 1 PID: 7431 Comm: kworker/u4:6 Not tainted 5.0.0+ #106
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: netns cleanup_net
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>   print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
>   kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
>   __asan_report_store8_noabort+0x17/0x20 mm/kasan/generic_report.c:137
>   __write_once_size include/linux/compiler.h:220 [inline]
>   __hlist_del include/linux/list.h:713 [inline]
>   hlist_del_rcu include/linux/rculist.h:455 [inline]
>   __xfrm_policy_unlink+0x4b1/0x5c0 net/xfrm/xfrm_policy.c:2212
>   xfrm_policy_flush+0x331/0x460 net/xfrm/xfrm_policy.c:1789
>   xfrm_policy_fini+0x49/0x3a0 net/xfrm/xfrm_policy.c:3871
>   xfrm_net_exit+0x1d/0x70 net/xfrm/xfrm_policy.c:3933
>   ops_exit_list.isra.0+0xb0/0x160 net/core/net_namespace.c:153
>   cleanup_net+0x3fb/0x960 net/core/net_namespace.c:551
>   process_one_work+0x98e/0x1790 kernel/workqueue.c:2269
>   worker_thread+0x98/0xe40 kernel/workqueue.c:2415
>   kthread+0x357/0x430 kernel/kthread.c:253
>   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
>
> Allocated by task 7242:
>   save_stack+0x45/0xd0 mm/kasan/common.c:75
>   set_track mm/kasan/common.c:87 [inline]
>   __kasan_kmalloc mm/kasan/common.c:497 [inline]
>   __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
>   kasan_kmalloc+0x9/0x10 mm/kasan/common.c:511
>   __do_kmalloc mm/slab.c:3726 [inline]
>   __kmalloc+0x15c/0x740 mm/slab.c:3735
>   kmalloc include/linux/slab.h:550 [inline]
>   kzalloc include/linux/slab.h:740 [inline]
>   ext4_htree_store_dirent+0x8a/0x650 fs/ext4/dir.c:450
>   htree_dirblock_to_tree+0x4fe/0x910 fs/ext4/namei.c:1021
>   ext4_htree_fill_tree+0x252/0xa50 fs/ext4/namei.c:1098
>   ext4_dx_readdir fs/ext4/dir.c:574 [inline]
>   ext4_readdir+0x1999/0x3490 fs/ext4/dir.c:121
>   iterate_dir+0x489/0x5f0 fs/readdir.c:51
>   __do_sys_getdents fs/readdir.c:231 [inline]
>   __se_sys_getdents fs/readdir.c:212 [inline]
>   __x64_sys_getdents+0x1dd/0x370 fs/readdir.c:212
>   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Freed by task 7242:
>   save_stack+0x45/0xd0 mm/kasan/common.c:75
>   set_track mm/kasan/common.c:87 [inline]
>   __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
>   kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
>   __cache_free mm/slab.c:3498 [inline]
>   kfree+0xcf/0x230 mm/slab.c:3821
>   free_rb_tree_fname+0x87/0xe0 fs/ext4/dir.c:402
>   ext4_htree_free_dir_info fs/ext4/dir.c:424 [inline]
>   ext4_release_dir+0x46/0x70 fs/ext4/dir.c:622
>   __fput+0x2e5/0x8d0 fs/file_table.c:278
>   ____fput+0x16/0x20 fs/file_table.c:309
>   task_work_run+0x14a/0x1c0 kernel/task_work.c:113
>   tracehook_notify_resume include/linux/tracehook.h:188 [inline]
>   exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:166
>   prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
>   syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
>   do_syscall_64+0x52d/0x610 arch/x86/entry/common.c:293
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> The buggy address belongs to the object at ffff8880a55a9e80
>   which belongs to the cache kmalloc-64 of size 64
> The buggy address is located 0 bytes inside of
>   64-byte region [ffff8880a55a9e80, ffff8880a55a9ec0)
> The buggy address belongs to the page:
> page:ffffea0002956a40 count:1 mapcount:0 mapping:ffff88812c3f0340 index:0x0
> flags: 0x1fffc0000000200(slab)
> raw: 01fffc0000000200 ffffea0002a0d748 ffffea00018af1c8 ffff88812c3f0340
> raw: 0000000000000000 ffff8880a55a9000 0000000100000020 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>   ffff8880a55a9d80: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
>   ffff8880a55a9e00: 00 00 00 00 04 fc fc fc fc fc fc fc fc fc fc fc
> > ffff8880a55a9e80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>                     ^
>   ffff8880a55a9f00: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
>   ffff8880a55a9f80: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/000000000000cd5fdf0588fed11c%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: kernel panic: stack is corrupted in lock_release (2)
From: Dmitry Vyukov @ 2019-09-02  3:24 UTC (permalink / raw)
  To: syzbot; +Cc: LKML, netdev, syzkaller-bugs, bpf
In-Reply-To: <00000000000088cdb2059186312f@google.com>

On Sun, Sep 1, 2019 at 4:27 PM syzbot
<syzbot+97deee97cf14574b96d0@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    dd7078f0 enetc: Add missing call to 'pci_free_irq_vectors(..
> git tree:       net
> console output: https://syzkaller.appspot.com/x/log.txt?x=115fe0fa600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=2a6a2b9826fdadf9
> dashboard link: https://syzkaller.appspot.com/bug?extid=97deee97cf14574b96d0
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11f7c2fe600000

Stack corruption + bpf maps in repro triggers some bells. +bpf mailing list.

> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+97deee97cf14574b96d0@syzkaller.appspotmail.com
>
> Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:
> lock_release+0x866/0x960 kernel/locking/lockdep.c:4435
> CPU: 0 PID: 9965 Comm: syz-executor.0 Not tainted 5.3.0-rc6+ #182
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/00000000000088cdb2059186312f%40google.com.

^ permalink raw reply

* Re: kernel panic: stack is corrupted in __lock_acquire (4)
From: Dmitry Vyukov @ 2019-09-02  3:23 UTC (permalink / raw)
  To: syzbot, bpf; +Cc: LKML, netdev, syzkaller-bugs
In-Reply-To: <0000000000000ec274059185a63e@google.com>

On Sun, Sep 1, 2019 at 3:48 PM syzbot
<syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com> wrote:
>
> syzbot has found a reproducer for the following crash on:
>
> HEAD commit:    38320f69 Merge branch 'Minor-cleanup-in-devlink'
> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=13d74356600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=1bbf70b6300045af
> dashboard link: https://syzkaller.appspot.com/bug?extid=83979935eb6304f8cd46
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1008b232600000

Stack corruption + bpf maps in repro triggers some bells. +bpf mailing list.

> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
>
> Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:
> __lock_acquire+0x36fa/0x4c30 kernel/locking/lockdep.c:3907
> CPU: 0 PID: 8662 Comm: syz-executor.4 Not tainted 5.3.0-rc6+ #153
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/0000000000000ec274059185a63e%40google.com.

^ permalink raw reply

* RE: [PATCH net-next] r8152: fix accessing skb after napi_gro_receive
From: Hayes Wang @ 2019-09-02  3:11 UTC (permalink / raw)
  To: Eric Dumazet, netdev@vger.kernel.org
  Cc: nic_swsd, linux-kernel@vger.kernel.org
In-Reply-To: <b39bc8a1-54c7-42d4-00ed-d48aa1bac734@gmail.com>

Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Friday, August 30, 2019 12:32 AM
> To: Hayes Wang; netdev@vger.kernel.org
> Cc: nic_swsd; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next] r8152: fix accessing skb after napi_gro_receive
> 
> On 8/19/19 5:15 AM, Hayes Wang wrote:
> > Fix accessing skb after napi_gro_receive which is caused by
> > commit 47922fcde536 ("r8152: support skb_add_rx_frag").
> >
> > Fixes: 47922fcde536 ("r8152: support skb_add_rx_frag")
> > Signed-off-by: Hayes Wang <hayeswang@realtek.com>
> > ---
> 
> It is customary to add a tag to credit the reporter...
> 
> Something like :
> 
> Reported-by: ....
> 
> Thanks.

Sorry. It's my mistake.
I would note that next time.

Best Regards,
Hayes



^ permalink raw reply

* [PATCH net-next] net/ncsi: support unaligned payload size in NC-SI cmd handler
From: Ben Wei @ 2019-09-02  2:46 UTC (permalink / raw)
  To: Ben Wei, David Miller, sam@mendozajonas.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	openbmc@lists.ozlabs.org
  Cc: Ben Wei

Update NC-SI command handler (both standard and OEM) to take into
account of payload paddings in allocating skb (in case of payload
size is not 32-bit aligned).

The checksum field follows payload field, without taking payload
padding into account can cause checksum being truncated, leading to
dropped packets.

Signed-off-by: Ben Wei <benwei@fb.com>
---
 net/ncsi/ncsi-cmd.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index 0187e65176c0..42636ed3cf3a 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -213,17 +213,22 @@ static int ncsi_cmd_handler_oem(struct sk_buff *skb,
 {
 	struct ncsi_cmd_oem_pkt *cmd;
 	unsigned int len;
+	/* NC-SI spec requires payload to be padded with 0
+	 * to 32-bit boundary before the checksum field.
+	 * Ensure the padding bytes are accounted for in
+	 * skb allocation
+	 */
+	unsigned short payload = ALIGN(nca->payload, 4);
 
 	len = sizeof(struct ncsi_cmd_pkt_hdr) + 4;
-	if (nca->payload < 26)
+	if (payload < 26)
 		len += 26;
 	else
-		len += nca->payload;
+		len += payload;
 
 	cmd = skb_put_zero(skb, len);
 	memcpy(&cmd->mfr_id, nca->data, nca->payload);
 	ncsi_cmd_build_header(&cmd->cmd.common, nca);
-
 	return 0;
 }
 
@@ -272,6 +277,7 @@ static struct ncsi_request *ncsi_alloc_command(struct ncsi_cmd_arg *nca)
 	struct net_device *dev = nd->dev;
 	int hlen = LL_RESERVED_SPACE(dev);
 	int tlen = dev->needed_tailroom;
+	int payload;
 	int len = hlen + tlen;
 	struct sk_buff *skb;
 	struct ncsi_request *nr;
@@ -281,14 +287,17 @@ static struct ncsi_request *ncsi_alloc_command(struct ncsi_cmd_arg *nca)
 		return NULL;
 
 	/* NCSI command packet has 16-bytes header, payload, 4 bytes checksum.
+	 * Payload needs padding so that the checksum field follwoing payload is
+	 * aligned to 32bit boundary.
 	 * The packet needs padding if its payload is less than 26 bytes to
 	 * meet 64 bytes minimal ethernet frame length.
 	 */
 	len += sizeof(struct ncsi_cmd_pkt_hdr) + 4;
-	if (nca->payload < 26)
+	payload = ALIGN(nca->payload, 4);
+	if (payload < 26)
 		len += 26;
 	else
-		len += nca->payload;
+		len += payload;
 
 	/* Allocate skb */
 	skb = alloc_skb(len, GFP_ATOMIC);
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH] net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others)
From: Lorenzo Colitti @ 2019-09-02  2:12 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Maciej Żenczykowski, David S . Miller, Linux NetDev,
	David Ahern
In-Reply-To: <CAHo-Ooy_g-7eZvBSbKR2eaQW3_Bk+fik5YaYAgN60GjmAU=ADA@mail.gmail.com>

On Mon, Sep 2, 2019 at 2:55 AM Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
> It's not immediately clear to me what is the better approach as I'm
> not immediately certain what RTF_ADDRCONF truly means.
> However the in kernel header file comment does explicitly mention this
> being used to flag routes derived from RA's, and very clearly ::1/128
> is not RA generated, so I *think* the correct fix is to return to the
> old way the kernel used to do things and not flag with ADDRCONF...

AIUI, "addrconf" has always meant stateless address autoconfiguration
as per RFC 4862, i.e., addresses autoconfigured when getting an RA, or
autoconfigured based on adding the link-local prefix. Looking at 5.1
(the most recent release before c7a1ce397ada which you're fixing here)
confirms this interpretation, because RTF_ADDRCONF is only used by:

- addrconf_prefix_rcv: receiving a PIO from an RA
- rt6_add_route_info: receiving an RIO from an RA
- rt6_add_dflt_router, rt6_get_dflt_router: receiving the default
router from an RA
- __rt6_purge_dflt_routers: removing all routes received from RAs,
when enabling forwarding (i.e., switching from being a host to being a
router)

So, if I'm reading c7a1ce397ada right, I would say it's incorrect.
That patch changes things so that RTF_ADDRCONF is set for pretty much
all routes created by adding IPv6 addresses. That includes not only
IPv6 addresses created by RAs, which has always been the case, but
also IPv6 addresses created manually from userspace, or the loopback
address, and even multicast and anycast addresses created by
IPV6_JOIN_GROUP and IPV6_JOIN_ANYCAST. That's userspace-visible
breakage and should be reverted.

Not sure if this patch is the right fix, though, because it breaks
things in the opposite direction: even routes created by an IPv6
address added by receiving an RA will no longer have RTF_ADDRCONF.
Perhaps add something like this as well?

 struct fib6_info *addrconf_f6i_alloc(struct net *net, struct inet6_dev *idev,
-                                     const struct in6_addr *addr, bool anycast,
-                                     const struct in6_addr *addr, u8 flags,
                                      gfp_t gfp_flags);

flags would be RTF_ANYCAST iff the code previously called with true,
and RTF_ADDRCONF if called by a function that is adding an IPv6
address coming from an RA.

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Florian Fainelli @ 2019-09-02  2:07 UTC (permalink / raw)
  To: Heiner Kallweit, Andrew Lunn; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <94cc3fe3-98ed-d8d2-2444-84bf3eae0c5e@gmail.com>



On 8/8/2019 1:24 PM, Heiner Kallweit wrote:
> On 08.08.2019 22:20, Andrew Lunn wrote:
>>> I have a contact in Realtek who provided the information about
>>> the vendor-specific registers used in the patch. I also asked for
>>> a method to auto-detect 2.5Gbps support but have no feedback so far.
>>> What may contribute to the problem is that also the integrated 1Gbps
>>> PHY's (all with the same PHY ID) differ significantly from each other,
>>> depending on the network chip version.
>>
>> Hi Heiner
>>
>> Some of the PHYs embedded in Marvell switches have an OUI, but no
>> product ID. We work around this brokenness by trapping the reads to
>> the ID registers in the MDIO bus controller driver and inserting the
>> switch product ID. The Marvell PHY driver then recognises these IDs
>> and does the right thing.
>>
>> Maybe you can do something similar here?
>>
> Yes, this would be an idea. Let me check.

Since this is an integrated PHY you could have the MAC driver pass a
specific phydev->dev_flag bit that indicates that this is RTL8215, since
I am assuming that PCI IDs for those different chipsets do have to be
allocated, right?
-- 
Florian

^ permalink raw reply

* linux-next: manual merge of the afs tree with the net tree
From: Stephen Rothwell @ 2019-09-02  0:31 UTC (permalink / raw)
  To: David Howells, David Miller, Networking
  Cc: Linux Next Mailing List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

Hi all,

Today's linux-next merge of the afs tree got conflicts in:

  include/trace/events/rxrpc.h
  net/rxrpc/ar-internal.h
  net/rxrpc/call_object.c
  net/rxrpc/conn_client.c
  net/rxrpc/input.c
  net/rxrpc/recvmsg.c
  net/rxrpc/skbuff.c

between various commits from the net tree and similar commits from the
afs tree.

I fixed it up (I just dropped the afs tree for today) and can carry the
fix as necessary. This is now fixed as far as linux-next is concerned,
but any non trivial conflicts should be mentioned to your upstream
maintainer when your tree is submitted for merging.  You may also want
to consider cooperating with the maintainer of the conflicting tree to
minimise any particularly complex conflicts.

It looks like the afs tree has older versions fo some commits in the
net tree ... plus some more.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* kernel panic: stack is corrupted in lock_release (2)
From: syzbot @ 2019-09-01 23:27 UTC (permalink / raw)
  To: linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    dd7078f0 enetc: Add missing call to 'pci_free_irq_vectors(..
git tree:       net
console output: https://syzkaller.appspot.com/x/log.txt?x=115fe0fa600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2a6a2b9826fdadf9
dashboard link: https://syzkaller.appspot.com/bug?extid=97deee97cf14574b96d0
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11f7c2fe600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+97deee97cf14574b96d0@syzkaller.appspotmail.com

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:  
lock_release+0x866/0x960 kernel/locking/lockdep.c:4435
CPU: 0 PID: 9965 Comm: syz-executor.0 Not tainted 5.3.0-rc6+ #182
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
Kernel Offset: disabled
Rebooting in 86400 seconds..

---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: kernel panic: stack is corrupted in __lock_acquire (4)
From: syzbot @ 2019-09-01 22:48 UTC (permalink / raw)
  To: linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <0000000000009b3b80058af452ae@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    38320f69 Merge branch 'Minor-cleanup-in-devlink'
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=13d74356600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=1bbf70b6300045af
dashboard link: https://syzkaller.appspot.com/bug?extid=83979935eb6304f8cd46
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1008b232600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:  
__lock_acquire+0x36fa/0x4c30 kernel/locking/lockdep.c:3907
CPU: 0 PID: 8662 Comm: syz-executor.4 Not tainted 5.3.0-rc6+ #153
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
Kernel Offset: disabled
Rebooting in 86400 seconds..


^ permalink raw reply

* Re: [PATCH 0/4 net-next] flow_offload: update mangle action representation
From: Jakub Kicinski @ 2019-09-01 20:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, vishal, saeedm, jiri
In-Reply-To: <20190831142217.bvxx3vc6wpsmnxpe@salvia>

On Sat, 31 Aug 2019 16:22:17 +0200, Pablo Neira Ayuso wrote:
> On Fri, Aug 30, 2019 at 03:33:51PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 Aug 2019 11:07:10 +0200, Pablo Neira Ayuso wrote:  
> > > > > * The front-end coalesces consecutive pedit actions into one single
> > > > >   word, so drivers can mangle IPv6 and ethernet address fields in one
> > > > >   single go.    
> > > > 
> > > > You still only coalesce up to 16 bytes, no?    
> > > 
> > > You only have to rise FLOW_ACTION_MANGLE_MAXLEN coming in this patch
> > > if you need more. I don't know of any packet field larger than 16
> > > bytes. If there is a use-case for this, it should be easy to rise that
> > > definition.  
> > 
> > Please see the definitions of:
> > 
> > struct nfp_fl_set_eth
> > struct nfp_fl_set_ip4_addrs
> > struct nfp_fl_set_ip4_ttl_tos
> > struct nfp_fl_set_ipv6_tc_hl_fl
> > struct nfp_fl_set_ipv6_addr
> > struct nfp_fl_set_tport
> > 
> > These are the programming primitives for header rewrites in the NFP.
> > Since each of those contains more than just one field, we'll have to
> > keep all the field coalescing logic in the driver, even if you coalesce
> > while fields (i.e. IPv6 addresses).  
> 
> nfp has been updated in this patch series to deal with the new mangle
> representation.

It has been updated to handle the trivial coalescing.

> > Perhaps it's not a serious blocker for the series, but it'd be nice if
> > rewrite action grouping was handled in the core. Since you're already
> > poking at that code..  
> 
> Rewrite action grouping is already handled from the core front-end in
> this patch series.

If you did what I'm asking the functions nfp_fl_check_mangle_start()
and nfp_fl_check_mangle_end() would no longer exist. They were not
really needed before you "common flow API" changes.

Your reply makes limited amount of sense to me. Pleas read the code and
what I wrote, if you think I'm asking for too much just say that, I'd
accept that.

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2019-09-01 20:45 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Fix some length checks during OGM processing in batman-adv, from
   Sven Eckelmann.

2) Fix regression that caused netfilter conntrack sysctls to not be per-netns
   any more.  From Florian Westphal.

3) Use after free in netpoll, from Feng Sun.

4) Guard destruction of pfifo_fast per-cpu qdisc stats with
   qdisc_is_percpu_stats(), from Davide Caratti.  Similar bug
   is fixed in pfifo_fast_enqueue().

5) Fix memory leak in mld_del_delrec(), from Eric Dumazet.

6) Handle neigh events on internal ports correctly in nfp, from John
   Hurley.

7) Clear SKB timestamp in NF flow table code so that it does not
   confuse fq scheduler.  From Florian Westphal.

8) taprio destroy can crash if it is invoked in a failure path of
   taprio_init(), because the list head isn't setup properly yet
   and the list del is unconditional.  Perform the list add earlier
   to address this.  From Vladimir Oltean.

9) Make sure to reapply vlan filters on device up, in aquantia driver.
   From Dmitry Bogdanov.

10) sgiseeq driver releases DMA memory using free_page() instead of
    dma_free_attrs().  From Christophe JAILLET.

Please pull, thanks a lot!

The following changes since commit 9e8312f5e160ade069e131d54ab8652cf0e86e1a:

  Merge tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs (2019-08-27 13:22:57 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git 

for you to fetch changes up to e1e54ec7fb55501c33b117c111cb0a045b8eded2:

  net: seeq: Fix the function used to release some memory in an error handling path (2019-09-01 12:10:11 -0700)

----------------------------------------------------------------
Chen-Yu Tsai (1):
      net: stmmac: dwmac-rk: Don't fail if phy regulator is absent

Christophe JAILLET (2):
      enetc: Add missing call to 'pci_free_irq_vectors()' in probe and remove functions
      net: seeq: Fix the function used to release some memory in an error handling path

Cong Wang (1):
      net_sched: fix a NULL pointer deref in ipt action

David Howells (8):
      rxrpc: Improve jumbo packet counting
      rxrpc: Use info in skbuff instead of reparsing a jumbo packet
      rxrpc: Pass the input handler's data skb reference to the Rx ring
      rxrpc: Abstract out rxtx ring cleanup
      rxrpc: Add a private skb flag to indicate transmission-phase skbs
      rxrpc: Use the tx-phase skb flag to simplify tracing
      rxrpc: Use skb_unshare() rather than skb_cow_data()
      rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver #2]

David S. Miller (11):
      Merge branch 'macb-Update-ethernet-compatible-string-for-SiFive-FU540'
      Merge branch 'r8152-fix-side-effect'
      Merge branch 'nfp-flower-fix-bugs-in-merge-tunnel-encap-code'
      Merge tag 'mac80211-for-davem-2019-08-29' of git://git.kernel.org/.../jberg/mac80211
      Merge tag 'rxrpc-fixes-20190827' of git://git.kernel.org/.../dhowells/linux-fs
      Merge git://git.kernel.org/.../bpf/bpf
      Merge git://git.kernel.org/.../pablo/nf
      Merge tag 'batadv-net-for-davem-20190830' of git://git.open-mesh.org/linux-merge
      Merge branch 'Fix-issues-in-tc-taprio-and-tc-cbs'
      Merge branch 'net-aquantia-fixes-on-vlan-filters-and-other-conditions'
      Merge branch 'net-dsa-microchip-add-KSZ8563-support'

Davide Caratti (3):
      net/sched: pfifo_fast: fix wrong dereference when qdisc is reset
      net/sched: pfifo_fast: fix wrong dereference in pfifo_fast_enqueue
      tc-testing: don't hardcode 'ip' in nsPlugin.py

Denis Kenzior (2):
      mac80211: Don't memset RXCB prior to PAE intercept
      mac80211: Correctly set noencrypt for PAE frames

Dmitry Bogdanov (4):
      net: aquantia: fix removal of vlan 0
      net: aquantia: fix limit of vlan filters
      net: aquantia: reapply vlan filters on up
      net: aquantia: fix out of memory condition on rx side

Eric Dumazet (2):
      tcp: remove empty skb from write queue in error cases
      mld: fix memory leak in mld_del_delrec()

Feng Sun (1):
      net: fix skb use after free in netpoll

Florian Westphal (2):
      netfilter: conntrack: make sysctls per-namespace again
      netfilter: nf_flow_table: clear skb tstamp before xmit

George McCollister (1):
      net: dsa: microchip: fill regmap_config name

Greg Rose (1):
      openvswitch: Properly set L4 keys on "later" IP fragments

Hayes Wang (2):
      Revert "r8152: napi hangup fix after disconnect"
      r8152: remove calling netif_napi_del

Igor Russkikh (1):
      net: aquantia: linkstate irq should be oneshot

Jiong Wang (1):
      nfp: bpf: fix latency bug when updating stack index register

John Hurley (2):
      nfp: flower: prevent ingress block binds on internal ports
      nfp: flower: handle neighbour events on internal ports

Justin Pettit (1):
      openvswitch: Clear the L4 portion of the key for "later" fragments.

Ka-Cheong Poon (1):
      net/rds: Fix info leak in rds6_inc_info_copy()

Luca Coelho (1):
      iwlwifi: pcie: handle switching killer Qu B0 NICs to C0

Marco Hartmann (1):
      Add genphy_c45_config_aneg() function to phy-c45.c

Naveen N. Rao (1):
      bpf: handle 32-bit zext during constant blinding

Razvan Stefanescu (2):
      dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
      net: dsa: microchip: add KSZ8563 compatibility string

Ryan M. Collins (1):
      net: bcmgenet: use ethtool_op_get_ts_info()

Sven Eckelmann (2):
      batman-adv: Only read OGM tvlv_len after buffer len check
      batman-adv: Only read OGM2 tvlv_len after buffer len check

Takashi Iwai (1):
      sky2: Disable MSI on yet another ASUS boards (P6Xxxx)

Thomas Falcon (1):
      ibmvnic: Do not process reset during or after device removal

Thomas Jarosch (1):
      netfilter: nf_conntrack_ftp: Fix debug output

Todd Seidelmann (1):
      netfilter: xt_physdev: Fix spurious error message in physdev_mt_check

Vlad Buslov (1):
      net: sched: act_sample: fix psample group handling on overwrite

Vladimir Oltean (4):
      net: dsa: tag_8021q: Future-proof the reserved fields in the custom VID
      taprio: Fix kernel panic in taprio_destroy
      taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte
      net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate

Willem de Bruijn (1):
      tcp: inherit timestamp on mtu probe

Yash Shah (2):
      macb: bindings doc: update sifive fu540-c000 binding
      macb: Update compatibility string for SiFive FU540-C000

YueHaibing (1):
      amd-xgbe: Fix error path in xgbe_mod_init()

wenxu (1):
      netfilter: nft_meta_bridge: Fix get NFT_META_BRI_IIFVPROTO in network byteorder

 Documentation/devicetree/bindings/net/dsa/ksz.txt         |   1 +
 Documentation/devicetree/bindings/net/macb.txt            |   4 +-
 drivers/net/dsa/microchip/ksz9477_spi.c                   |   1 +
 drivers/net/dsa/microchip/ksz_common.h                    |   1 +
 drivers/net/ethernet/amd/xgbe/xgbe-main.c                 |  10 ++-
 drivers/net/ethernet/aquantia/atlantic/aq_filters.c       |   5 +-
 drivers/net/ethernet/aquantia/atlantic/aq_main.c          |   4 ++
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c           |   2 +-
 drivers/net/ethernet/aquantia/atlantic/aq_vec.c           |   3 +-
 drivers/net/ethernet/broadcom/genet/bcmgenet.c            |   1 +
 drivers/net/ethernet/cadence/macb_main.c                  |   2 +-
 drivers/net/ethernet/freescale/enetc/enetc_ptp.c          |   5 +-
 drivers/net/ethernet/ibm/ibmvnic.c                        |   6 +-
 drivers/net/ethernet/marvell/sky2.c                       |   7 +++
 drivers/net/ethernet/netronome/nfp/bpf/jit.c              |  17 +++--
 drivers/net/ethernet/netronome/nfp/flower/offload.c       |   7 ++-
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c   |   8 +--
 drivers/net/ethernet/seeq/sgiseeq.c                       |   7 ++-
 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c            |   6 +-
 drivers/net/phy/phy-c45.c                                 |  26 ++++++++
 drivers/net/phy/phy.c                                     |   2 +-
 drivers/net/usb/r8152.c                                   |   5 +-
 drivers/net/wireless/intel/iwlwifi/cfg/22000.c            |  24 ++++++++
 drivers/net/wireless/intel/iwlwifi/iwl-config.h           |   2 +
 drivers/net/wireless/intel/iwlwifi/pcie/drv.c             |   4 ++
 drivers/net/wireless/intel/iwlwifi/pcie/trans.c           |   7 +--
 include/linux/phy.h                                       |   1 +
 include/net/act_api.h                                     |   4 +-
 include/net/psample.h                                     |   1 +
 include/trace/events/rxrpc.h                              |  59 +++++++++---------
 kernel/bpf/core.c                                         |   8 ++-
 net/batman-adv/bat_iv_ogm.c                               |  20 +++---
 net/batman-adv/bat_v_ogm.c                                |  18 ++++--
 net/bridge/netfilter/nft_meta_bridge.c                    |   2 +-
 net/core/netpoll.c                                        |   6 +-
 net/dsa/tag_8021q.c                                       |   2 +
 net/ipv4/tcp.c                                            |  30 ++++++---
 net/ipv4/tcp_output.c                                     |   3 +-
 net/ipv6/mcast.c                                          |   5 +-
 net/mac80211/rx.c                                         |   6 +-
 net/netfilter/nf_conntrack_ftp.c                          |   2 +-
 net/netfilter/nf_conntrack_standalone.c                   |   5 ++
 net/netfilter/nf_flow_table_ip.c                          |   3 +-
 net/netfilter/xt_physdev.c                                |   6 +-
 net/openvswitch/conntrack.c                               |   5 ++
 net/openvswitch/flow.c                                    | 160 +++++++++++++++++++++++++++--------------------
 net/openvswitch/flow.h                                    |   1 +
 net/psample/psample.c                                     |   2 +-
 net/rds/recv.c                                            |   5 +-
 net/rxrpc/af_rxrpc.c                                      |   3 -
 net/rxrpc/ar-internal.h                                   |  17 +++--
 net/rxrpc/call_event.c                                    |   8 +--
 net/rxrpc/call_object.c                                   |  33 +++++-----
 net/rxrpc/conn_client.c                                   |  44 +++++++++++++
 net/rxrpc/conn_event.c                                    |   6 +-
 net/rxrpc/conn_object.c                                   |   2 +-
 net/rxrpc/input.c                                         | 304 +++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------
 net/rxrpc/local_event.c                                   |   4 +-
 net/rxrpc/local_object.c                                  |   5 +-
 net/rxrpc/output.c                                        |   6 +-
 net/rxrpc/peer_event.c                                    |  10 +--
 net/rxrpc/protocol.h                                      |   9 +++
 net/rxrpc/recvmsg.c                                       |  47 ++++++++------
 net/rxrpc/rxkad.c                                         |  32 +++-------
 net/rxrpc/sendmsg.c                                       |  13 ++--
 net/rxrpc/skbuff.c                                        |  40 ++++++++----
 net/sched/act_bpf.c                                       |   2 +-
 net/sched/act_connmark.c                                  |   2 +-
 net/sched/act_csum.c                                      |   2 +-
 net/sched/act_ct.c                                        |   2 +-
 net/sched/act_ctinfo.c                                    |   2 +-
 net/sched/act_gact.c                                      |   2 +-
 net/sched/act_ife.c                                       |   2 +-
 net/sched/act_ipt.c                                       |  11 ++--
 net/sched/act_mirred.c                                    |   2 +-
 net/sched/act_mpls.c                                      |   2 +-
 net/sched/act_nat.c                                       |   2 +-
 net/sched/act_pedit.c                                     |   2 +-
 net/sched/act_police.c                                    |   2 +-
 net/sched/act_sample.c                                    |   8 ++-
 net/sched/act_simple.c                                    |   2 +-
 net/sched/act_skbedit.c                                   |   2 +-
 net/sched/act_skbmod.c                                    |   2 +-
 net/sched/act_tunnel_key.c                                |   2 +-
 net/sched/act_vlan.c                                      |   2 +-
 net/sched/sch_cbs.c                                       |  19 +++---
 net/sched/sch_generic.c                                   |  19 ++++--
 net/sched/sch_taprio.c                                    |  31 +++++-----
 tools/testing/selftests/tc-testing/plugin-lib/nsPlugin.py |  22 +++----
 89 files changed, 761 insertions(+), 487 deletions(-)

^ permalink raw reply

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
From: Willem de Bruijn @ 2019-09-01 20:05 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck,
	Alexei Starovoitov, Yonghong Song, Steffen Klassert,
	Shmulik Ladkani, eyal
In-Reply-To: <20190829152241.73734206@pixies>

On Thu, Aug 29, 2019 at 8:22 AM Shmulik Ladkani
<shmulik.ladkani@gmail.com> wrote:
>
> On Tue, 27 Aug 2019 14:10:35 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> > Given first point above wrt hitting rarely, it would be good to first get a
> > better understanding for writing a reproducer. Back then Yonghong added one
> > to the BPF kernel test suite [0], so it would be desirable to extend it for
> > the case you're hitting. Given NAT64 use-case is needed and used by multiple
> > parties, we should try to (fully) fix it generically.
> >
>
> Thanks Daniel.
>
> Managed to write a reproducer which mimics the skb we see on prodction,
> that hits the exact same BUG_ON.
>
> Submitted as a separate RFC PATCH to bpf-next.

Thanks for the reproducer.

One quick fix is to disable sg and thus revert to copying in this
case. Not ideal, but better than a kernel splat:

@@ -3714,6 +3714,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
        sg = !!(features & NETIF_F_SG);
        csum = !!can_checksum_protocol(features, proto);

+       if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag)
+               sg = false;
+

It could perhaps be refined to avoid in the special case where
skb_headlen(list_skb) == len and nskb aligned to start of list_skb.
And needs looking into effect on GSO_BY_FRAGS.

I also looked into trying to convert a kmalloc'ed skb->head into a
headfrag. But even if possible, that conversion is non-trivial and
easy to have bugs of its own.

@@ -3849,8 +3885,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
                                if (!skb_headlen(list_skb)) {
                                        BUG_ON(!nfrags);
                                } else {
-                                       BUG_ON(!list_skb->head_frag);
-
+                                       BUG_ON(!list_skb->head_frag &&
+
!skb_to_headfrag(list_skb, GFP_ATOMIC));

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox