* Re: [RFC PATCH net-next] net: phy: force phy suspend when calling phy_stop
From: shenjian (K) @ 2019-08-27 11:36 UTC (permalink / raw)
To: Sergei Shtylyov, andrew, f.fainelli, hkallweit1, davem
Cc: netdev, forest.zhouchang, linuxarm
In-Reply-To: <cc52cde5-b114-3bf8-4c4b-fe81c04080ee@cogentembedded.com>
在 2019/8/27 18:11, Sergei Shtylyov 写道:
> On 27.08.2019 5:47, Jian Shen wrote:
>
>> Some ethernet drivers may call phy_start() and phy_stop() from
>> ndo_open and ndo_close() respectively.
>
> ndo_open() for consistency.
>
>> When network cable is unconnected, and operate like below:
>> step 1: ifconfig ethX up -> ndo_open -> phy_start ->start
>> autoneg, and phy is no link.
>> step 2: ifconfig ethX down -> ndo_close -> phy_stop -> just stop
>> phy state machine.
>> step 3: plugin the network cable, and autoneg complete, then
>> LED for link status will be on.
>> step 4: ethtool ethX --> see the result of "Link detected" is no.
>>
>> This patch forces phy suspend even phydev->link is off.
>>
>> Signed-off-by: Jian Shen <shenjian15@huawei.com>
> [...]
>
> MBR, Sergei
>
>
Thanks, will fix it.
^ permalink raw reply
* Re: [PATCH 3/4] mdev: Expose mdev alias in sysfs tree
From: Cornelia Huck @ 2019-08-27 11:34 UTC (permalink / raw)
To: Parav Pandit
Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
davem@davemloft.net, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866BDA002F2C6566492244ED1A00@AM0PR05MB4866.eurprd05.prod.outlook.com>
On Tue, 27 Aug 2019 11:07:37 +0000
Parav Pandit <parav@mellanox.com> wrote:
> > -----Original Message-----
> > From: Cornelia Huck <cohuck@redhat.com>
> > Sent: Tuesday, August 27, 2019 4:17 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; netdev@vger.kernel.org
> > Subject: Re: [PATCH 3/4] mdev: Expose mdev alias in sysfs tree
> >
> > On Mon, 26 Aug 2019 15:41:18 -0500
> > Parav Pandit <parav@mellanox.com> wrote:
> > > +static ssize_t alias_show(struct device *device,
> > > + struct device_attribute *attr, char *buf) {
> > > + struct mdev_device *dev = mdev_from_dev(device);
> > > +
> > > + if (!dev->alias)
> > > + return -EOPNOTSUPP;
> >
> > I'm wondering how to make this consumable by userspace in the easiest way.
> > - As you do now (userspace gets an error when trying to read)?
> > - Returning an empty value (nothing to see here, move along)?
> No. This is confusing, to return empty value, because it says that there is an alias but it is some weird empty string.
> If there is alias, it shows exactly what it is.
> If no alias it returns an error code = unsupported -> inline with other widely used subsystem.
>
> > - Or not creating the attribute at all? That would match what userspace
> > sees on older kernels, so it needs to be able to deal with that
> New sysfs files can appear. Tool cannot say that I was not expecting this file here.
> User space is supposed to work with the file they are off interest.
> Mdev interface has option to specify vendor specific files, though in usual manner it's not recommended.
> So there is no old user space, new kernel issue here.
I'm not talking about old userspace/new kernel, but new userspace/old
kernel. Code that wants to consume this attribute needs to be able to
cope with its absence anyway.
>
> > anyway.
> >
> > > +
> > > + return sprintf(buf, "%s\n", dev->alias); } static
> > > +DEVICE_ATTR_RO(alias);
> > > +
> > > static const struct attribute *mdev_device_attrs[] = {
> > > + &dev_attr_alias.attr,
> > > &dev_attr_remove.attr,
> > > NULL,
> > > };
>
^ permalink raw reply
* RE: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
From: Parav Pandit @ 2019-08-27 11:33 UTC (permalink / raw)
To: Cornelia Huck
Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
davem@davemloft.net, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190827132404.483a74ad.cohuck@redhat.com>
> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Tuesday, August 27, 2019 4:54 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
>
> On Tue, 27 Aug 2019 11:12:23 +0000
> Parav Pandit <parav@mellanox.com> wrote:
>
> > > -----Original Message-----
> > > From: Cornelia Huck <cohuck@redhat.com>
> > > Sent: Tuesday, August 27, 2019 3:54 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > > kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org;
> > > linux- kernel@vger.kernel.org; netdev@vger.kernel.org
> > > Subject: Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
> > >
> > > On Mon, 26 Aug 2019 15:41:16 -0500
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > Whenever a parent requests to generate mdev alias, generate a mdev
> > > > alias.
> > > > It is an optional attribute that parent can request to generate
> > > > for each of its child mdev.
> > > > mdev alias is generated using sha1 from the mdev name.
> > >
> > > Maybe add some motivation here as well?
> > >
> > > "Some vendor drivers want an identifier for an mdev device that is
> > > shorter than the uuid, due to length restrictions in the consumers of that
> identifier.
> > >
> > > Add a callback that allows a vendor driver to request an alias of a
> > > specified length to be generated (via sha1) for an mdev device. If
> > > generated, that alias is checked for collisions."
> > >
> > I did described the motivation in the cover letter with example and this
> design discussion thread.
>
> Yes, but adding it to the patch description makes it available in the git history.
>
Ok.
> > I will include above summary in v1.
> >
> > > What about:
> > >
> > > * @get_alias_length: optional callback to specify length of the alias to
> create
> > > * Returns unsigned integer: length of the alias to be created,
> > > * 0 to not create an alias
> > >
> > Ack.
> >
> > > I also think it might be beneficial to add a device parameter here
> > > now (rather than later); that seems to be something that makes sense.
> > >
> > Without showing the use, it shouldn't be added.
>
> It just feels like an omission: Why should the vendor driver only be able to
> return one value here, without knowing which device it is for?
> If a driver supports different devices, it may have different requirements for
> them.
>
Sure. Lets first have this requirement to add it.
I am against adding this length field itself without an actual vendor use case, which is adding some complexity in code today.
But it was ok to have length field instead of bool.
Lets not further add "no-requirement futuristic knobs" which hasn't shown its need yet.
When a vendor driver needs it, there is nothing prevents such addition.
> >
> > > > * Parent device that support mediated device should be
> > > > registered with
> > > mdev
> > > > * module with mdev_parent_ops structure.
> > > > **/
> > > > @@ -92,6 +95,7 @@ struct mdev_parent_ops {
> > > > long (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
> > > > unsigned long arg);
> > > > int (*mmap)(struct mdev_device *mdev, struct vm_area_struct
> > > *vma);
> > > > + unsigned int (*get_alias_length)(void);
> > > > };
> > > >
> > > > /* interface for exporting mdev supported type attributes */
> >
^ permalink raw reply
* [PATCH] arcnet: capmode: remove redundant assignment to pointer pkt
From: Colin King @ 2019-08-27 11:29 UTC (permalink / raw)
To: Michael Grzeschik, David S . Miller, netdev; +Cc: kernel-janitors, linux-kernel
From: Colin Ian King <colin.king@canonical.com>
Pointer pkt is being initialized with a value that is never read
and pkg is being re-assigned a little later on. The assignment is
redundant and hence can be removed.
Addresses-Coverity: ("Ununsed value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
drivers/net/arcnet/capmode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/arcnet/capmode.c b/drivers/net/arcnet/capmode.c
index b780be6f41ff..c09b567845e1 100644
--- a/drivers/net/arcnet/capmode.c
+++ b/drivers/net/arcnet/capmode.c
@@ -44,7 +44,7 @@ static void rx(struct net_device *dev, int bufnum,
{
struct arcnet_local *lp = netdev_priv(dev);
struct sk_buff *skb;
- struct archdr *pkt = pkthdr;
+ struct archdr *pkt;
char *pktbuf, *pkthdrbuf;
int ofs;
--
2.20.1
^ permalink raw reply related
* Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
From: Cornelia Huck @ 2019-08-27 11:29 UTC (permalink / raw)
To: Parav Pandit
Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
davem@davemloft.net, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB486621458EC71973378CD5A0D1A00@AM0PR05MB4866.eurprd05.prod.outlook.com>
On Tue, 27 Aug 2019 11:08:59 +0000
Parav Pandit <parav@mellanox.com> wrote:
> > -----Original Message-----
> > From: Cornelia Huck <cohuck@redhat.com>
> > Sent: Tuesday, August 27, 2019 3:59 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; netdev@vger.kernel.org
> > Subject: Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
> >
> > On Mon, 26 Aug 2019 15:41:17 -0500
> > Parav Pandit <parav@mellanox.com> wrote:
> >
> > > Mdev alias should be unique among all the mdevs, so that when such
> > > alias is used by the mdev users to derive other objects, there is no
> > > collision in a given system.
> > >
> > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > ---
> > > drivers/vfio/mdev/mdev_core.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > b/drivers/vfio/mdev/mdev_core.c index e825ff38b037..6eb37f0c6369
> > > 100644
> > > --- a/drivers/vfio/mdev/mdev_core.c
> > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > @@ -375,6 +375,11 @@ int mdev_device_create(struct kobject *kobj, struct
> > device *dev,
> > > ret = -EEXIST;
> > > goto mdev_fail;
> > > }
> > > + if (tmp->alias && strcmp(tmp->alias, alias) == 0) {
> >
> > Any way we can relay to the caller that the uuid was fine, but that we had a
> > hash collision? Duplicate uuids are much more obvious than a collision here.
> >
> How do you want to relay this rare event?
> Netlink interface has way to return the error message back, but sysfs is limited due to its error code based interface.
I don't know, that's why I asked :)
The problem is that "uuid already used" and "hash collision" are
indistinguishable. While "use a different uuid" will probably work in
both cases, "increase alias length" might be a good alternative in some
cases.
But if there is no good way to relay the problem, we can live with it.
>
> > > + mutex_unlock(&mdev_list_lock);
> > > + ret = -EEXIST;
> > > + goto mdev_fail;
> > > + }
> > > }
> > >
> > > mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
>
^ permalink raw reply
* Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
From: Cornelia Huck @ 2019-08-27 11:24 UTC (permalink / raw)
To: Parav Pandit
Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
davem@davemloft.net, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866B68C9E60E42359BE1F4DD1A00@AM0PR05MB4866.eurprd05.prod.outlook.com>
On Tue, 27 Aug 2019 11:12:23 +0000
Parav Pandit <parav@mellanox.com> wrote:
> > -----Original Message-----
> > From: Cornelia Huck <cohuck@redhat.com>
> > Sent: Tuesday, August 27, 2019 3:54 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; netdev@vger.kernel.org
> > Subject: Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
> >
> > On Mon, 26 Aug 2019 15:41:16 -0500
> > Parav Pandit <parav@mellanox.com> wrote:
> >
> > > Whenever a parent requests to generate mdev alias, generate a mdev
> > > alias.
> > > It is an optional attribute that parent can request to generate for
> > > each of its child mdev.
> > > mdev alias is generated using sha1 from the mdev name.
> >
> > Maybe add some motivation here as well?
> >
> > "Some vendor drivers want an identifier for an mdev device that is shorter than
> > the uuid, due to length restrictions in the consumers of that identifier.
> >
> > Add a callback that allows a vendor driver to request an alias of a specified
> > length to be generated (via sha1) for an mdev device. If generated, that alias is
> > checked for collisions."
> >
> I did described the motivation in the cover letter with example and this design discussion thread.
Yes, but adding it to the patch description makes it available in the
git history.
> I will include above summary in v1.
>
> > What about:
> >
> > * @get_alias_length: optional callback to specify length of the alias to create
> > * Returns unsigned integer: length of the alias to be created,
> > * 0 to not create an alias
> >
> Ack.
>
> > I also think it might be beneficial to add a device parameter here now (rather
> > than later); that seems to be something that makes sense.
> >
> Without showing the use, it shouldn't be added.
It just feels like an omission: Why should the vendor driver only be
able to return one value here, without knowing which device it is for?
If a driver supports different devices, it may have different
requirements for them.
>
> > > * Parent device that support mediated device should be registered with
> > mdev
> > > * module with mdev_parent_ops structure.
> > > **/
> > > @@ -92,6 +95,7 @@ struct mdev_parent_ops {
> > > long (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
> > > unsigned long arg);
> > > int (*mmap)(struct mdev_device *mdev, struct vm_area_struct
> > *vma);
> > > + unsigned int (*get_alias_length)(void);
> > > };
> > >
> > > /* interface for exporting mdev supported type attributes */
>
^ permalink raw reply
* Re: [REGRESSION] netfilter: conntrack: Unable to change conntrack accounting of a net namespace via 'nf_conntrack_acct' sysfs
From: Florian Westphal @ 2019-08-27 11:18 UTC (permalink / raw)
To: Shmulik Ladkani; +Cc: Florian Westphal, Pablo Neira Ayuso, netdev, shmulik
In-Reply-To: <20190827135754.7d460ef8@pixies>
Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> -static int nf_conntrack_acct_init_sysctl(struct net *net)
> -{
> - struct ctl_table *table;
> -
> - table = kmemdup(acct_sysctl_table, sizeof(acct_sysctl_table),
> - GFP_KERNEL);
> - if (!table)
> - goto out;
> -
> - table[0].data = &net->ct.sysctl_acct;
> -
>
> (where 'nf_conntrack_acct_init_sysctl()' was originally called by
> 'nf_conntrack_acct_pernet_init()').
>
> However POST d912dec12428, the per-net netfilter sysctl table simply
> inherits from global 'nf_ct_sysctl_table[]', which has
>
> + .data = &init_net.ct.sysctl_acct,
>
> effectivly making any 'net.netfilter.nf_conntrack_acct' sysctl change
> affect the 'init_net' and not relevant net namespace.
>
> Also, looks like "nf_conntrack_helper", "nf_conntrack_events",
> "nf_conntrack_timestamp" where also harmed in a similar way, see:
>
> d912dec12428 ("netfilter: conntrack: merge acct and helper sysctl table with main one")
> cb2833ed0044 ("netfilter: conntrack: merge ecache and timestamp sysctl tables with main one")
Thanks for reporting this bug, I will submit a patch soon.
^ permalink raw reply
* RE: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
From: Parav Pandit @ 2019-08-27 11:16 UTC (permalink / raw)
To: Cornelia Huck
Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
davem@davemloft.net, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190827122428.37442fe1.cohuck@redhat.com>
> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Tuesday, August 27, 2019 3:54 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
>
> On Mon, 26 Aug 2019 15:41:16 -0500
> Parav Pandit <parav@mellanox.com> wrote:
> >
> > static int __init mdev_init(void)
> > {
> > + alias_hash = crypto_alloc_shash("sha1", 0, 0);
> > + if (!alias_hash)
> > + return -ENOMEM;
> > +
> > return mdev_bus_register();
>
> Don't you need to call crypto_free_shash() if mdev_bus_register() fails?
>
Missed to answer this in previous reply.
Yes, took care of it in v1.
Mark Bloch also pointed it to me.
^ permalink raw reply
* RE: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
From: Parav Pandit @ 2019-08-27 11:12 UTC (permalink / raw)
To: Cornelia Huck
Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
davem@davemloft.net, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190827122428.37442fe1.cohuck@redhat.com>
> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Tuesday, August 27, 2019 3:54 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [PATCH 1/4] mdev: Introduce sha1 based mdev alias
>
> On Mon, 26 Aug 2019 15:41:16 -0500
> Parav Pandit <parav@mellanox.com> wrote:
>
> > Whenever a parent requests to generate mdev alias, generate a mdev
> > alias.
> > It is an optional attribute that parent can request to generate for
> > each of its child mdev.
> > mdev alias is generated using sha1 from the mdev name.
>
> Maybe add some motivation here as well?
>
> "Some vendor drivers want an identifier for an mdev device that is shorter than
> the uuid, due to length restrictions in the consumers of that identifier.
>
> Add a callback that allows a vendor driver to request an alias of a specified
> length to be generated (via sha1) for an mdev device. If generated, that alias is
> checked for collisions."
>
I did described the motivation in the cover letter with example and this design discussion thread.
I will include above summary in v1.
> What about:
>
> * @get_alias_length: optional callback to specify length of the alias to create
> * Returns unsigned integer: length of the alias to be created,
> * 0 to not create an alias
>
Ack.
> I also think it might be beneficial to add a device parameter here now (rather
> than later); that seems to be something that makes sense.
>
Without showing the use, it shouldn't be added.
> > * Parent device that support mediated device should be registered with
> mdev
> > * module with mdev_parent_ops structure.
> > **/
> > @@ -92,6 +95,7 @@ struct mdev_parent_ops {
> > long (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
> > unsigned long arg);
> > int (*mmap)(struct mdev_device *mdev, struct vm_area_struct
> *vma);
> > + unsigned int (*get_alias_length)(void);
> > };
> >
> > /* interface for exporting mdev supported type attributes */
^ permalink raw reply
* RE: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
From: Parav Pandit @ 2019-08-27 11:08 UTC (permalink / raw)
To: Cornelia Huck
Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
davem@davemloft.net, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190827122928.752e763b.cohuck@redhat.com>
> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Tuesday, August 27, 2019 3:59 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [PATCH 2/4] mdev: Make mdev alias unique among all mdevs
>
> On Mon, 26 Aug 2019 15:41:17 -0500
> Parav Pandit <parav@mellanox.com> wrote:
>
> > Mdev alias should be unique among all the mdevs, so that when such
> > alias is used by the mdev users to derive other objects, there is no
> > collision in a given system.
> >
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> > drivers/vfio/mdev/mdev_core.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/vfio/mdev/mdev_core.c
> > b/drivers/vfio/mdev/mdev_core.c index e825ff38b037..6eb37f0c6369
> > 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -375,6 +375,11 @@ int mdev_device_create(struct kobject *kobj, struct
> device *dev,
> > ret = -EEXIST;
> > goto mdev_fail;
> > }
> > + if (tmp->alias && strcmp(tmp->alias, alias) == 0) {
>
> Any way we can relay to the caller that the uuid was fine, but that we had a
> hash collision? Duplicate uuids are much more obvious than a collision here.
>
How do you want to relay this rare event?
Netlink interface has way to return the error message back, but sysfs is limited due to its error code based interface.
> > + mutex_unlock(&mdev_list_lock);
> > + ret = -EEXIST;
> > + goto mdev_fail;
> > + }
> > }
> >
> > mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
^ permalink raw reply
* RE: [PATCH 3/4] mdev: Expose mdev alias in sysfs tree
From: Parav Pandit @ 2019-08-27 11:07 UTC (permalink / raw)
To: Cornelia Huck
Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
davem@davemloft.net, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190827124706.7e726794.cohuck@redhat.com>
> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Tuesday, August 27, 2019 4:17 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [PATCH 3/4] mdev: Expose mdev alias in sysfs tree
>
> On Mon, 26 Aug 2019 15:41:18 -0500
> Parav Pandit <parav@mellanox.com> wrote:
>
> > Expose mdev alias as string in a sysfs tree so that such attribute can
> > be used to generate netdevice name by systemd/udev or can be used to
> > match other kernel objects based on the alias of the mdev.
>
> What about
>
> "Expose the optional alias for an mdev device as a sysfs attribute.
> This way, userspace tools such as udev may make use of the alias, for example
> to create a netdevice name for the mdev."
>
Ok. I will change it.
> >
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> > drivers/vfio/mdev/mdev_sysfs.c | 13 +++++++++++++
>
> I think the documentation should be updated as well.
>
> > 1 file changed, 13 insertions(+)
> >
> > diff --git a/drivers/vfio/mdev/mdev_sysfs.c
> > b/drivers/vfio/mdev/mdev_sysfs.c index 43afe0e80b76..59f4e3cc5233
> > 100644
> > --- a/drivers/vfio/mdev/mdev_sysfs.c
> > +++ b/drivers/vfio/mdev/mdev_sysfs.c
> > @@ -246,7 +246,20 @@ static ssize_t remove_store(struct device *dev,
> > struct device_attribute *attr,
> >
> > static DEVICE_ATTR_WO(remove);
> >
> > +static ssize_t alias_show(struct device *device,
> > + struct device_attribute *attr, char *buf) {
> > + struct mdev_device *dev = mdev_from_dev(device);
> > +
> > + if (!dev->alias)
> > + return -EOPNOTSUPP;
>
> I'm wondering how to make this consumable by userspace in the easiest way.
> - As you do now (userspace gets an error when trying to read)?
> - Returning an empty value (nothing to see here, move along)?
No. This is confusing, to return empty value, because it says that there is an alias but it is some weird empty string.
If there is alias, it shows exactly what it is.
If no alias it returns an error code = unsupported -> inline with other widely used subsystem.
> - Or not creating the attribute at all? That would match what userspace
> sees on older kernels, so it needs to be able to deal with that
New sysfs files can appear. Tool cannot say that I was not expecting this file here.
User space is supposed to work with the file they are off interest.
Mdev interface has option to specify vendor specific files, though in usual manner it's not recommended.
So there is no old user space, new kernel issue here.
> anyway.
>
> > +
> > + return sprintf(buf, "%s\n", dev->alias); } static
> > +DEVICE_ATTR_RO(alias);
> > +
> > static const struct attribute *mdev_device_attrs[] = {
> > + &dev_attr_alias.attr,
> > &dev_attr_remove.attr,
> > NULL,
> > };
^ permalink raw reply
* SFP+ EEPROM readouts fail on X722 (ethtool -m: Invalid argument)
From: Jakub Jankowski @ 2019-08-27 11:03 UTC (permalink / raw)
To: e1000-devel; +Cc: shasta, mhemsley, netdev
Hi,
We can't get SFP+ EEPROM readouts for X722 to work at all:
# ethtool -m eth10
Cannot get module EEPROM information: Invalid argument
# dmesg | tail -n 1
[ 445.971974] i40e 0000:3d:00.3 eth10: Module EEPROM memory read not supported. Please update the NVM image.
# lspci | grep 3d:00.3
3d:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09)
We're running 4.19.65 kernel at the moment, testing using the newest out-of-tree Intel module
# modinfo -F version i40e
2.9.21
We also tried:
- 4.19.65 with in-tree i40e (2.3.2-k)
- stock Arch Linux (kernel 5.2.5, driver 2.8.20-k)
and the results are the same, as shown above.
# ethtool -i eth10
driver: i40e
version: 2.9.21
firmware-version: 3.31 0x80000d31 1.1767.0
expansion-rom-version:
bus-info: 0000:3d:00.3
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
# dmidecode -s baseboard-manufacturer
Intel Corporation
# dmidecode -s baseboard-product-name
S2600WFT
# dmidecode -s baseboard-version
H48104-853
# lspci -vvv
(...)
3d:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09)
DeviceName: Intel PCH Integrated 10 Gigabit Ethernet Controller
Subsystem: Intel Corporation Ethernet Connection X722 for 10GbE SFP+
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 112
NUMA node: 0
Region 0: Memory at ab000000 (64-bit, prefetchable) [size=16M]
Region 3: Memory at b0000000 (64-bit, prefetchable) [size=32K]
Expansion ROM at <ignored> [disabled]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00001000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [e0] Vital Product Data
Product Name: Example VPD
Read-only fields:
[V0] Vendor specific:
[RV] Reserved: checksum good, 0 byte(s) reserved
End
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 03
VF offset: 109, stride: 1, Device ID: 37cd
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 00000000af000000 (64-bit, prefetchable)
Region 3: Memory at 00000000b0020000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [1a0 v1] Transaction Processing Hints
Device specific mode supported
No steering table available
Capabilities: [1b0 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Kernel driver in use: i40e
Kernel modules: i40e
Same kernel+i40e, same SFP+ module - but on Intel X710, works like a treat:
# lspci | grep X7
81:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
81:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
# ethtool -m eth8
Identifier : 0x03 (SFP)
Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID)
Connector : 0x07 (LC)
Transceiver codes : 0x10 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00
Transceiver type : 10G Ethernet: 10G Base-SR
Transceiver type : Ethernet: 1000BASE-SX
Encoding : 0x06 (64B/66B)
BR, Nominal : 10300MBd
(...)
# ethtool -i eth8
driver: i40e
version: 2.9.21
firmware-version: 6.01 0x800035cf 1.1876.0
expansion-rom-version:
bus-info: 0000:81:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
#
Is this a known problem?
Best regards,
Jakub
^ permalink raw reply
* Re: [PATCH net v3] ixgbe: fix double clean of tx descriptors with xdp
From: Ilya Maximets @ 2019-08-27 11:00 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: netdev, linux-kernel, bpf, David S. Miller, Björn Töpel,
Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
Daniel Borkmann, Jeff Kirsher, intel-wired-lan, Eelco Chaudron,
William Tu, Alexander Duyck
In-Reply-To: <20190826154042.00004bfc@gmail.com>
On 26.08.2019 16:40, Maciej Fijalkowski wrote:
> On Thu, 22 Aug 2019 20:12:37 +0300
> Ilya Maximets <i.maximets@samsung.com> wrote:
>
>> Tx code doesn't clear the descriptors' status after cleaning.
>> So, if the budget is larger than number of used elems in a ring, some
>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>> prod_tail far beyond the prod_head breaking the completion queue ring.
>>
>> Fix that by limiting the number of descriptors to clean by the number
>> of used descriptors in the tx ring.
>>
>> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
>> 'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
>> 'next_to_clean' and 'next_to_use' indexes.
>>
>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>> ---
>>
>> Version 3:
>> * Reverted some refactoring made for v2.
>> * Eliminated 'budget' for tx clean.
>> * prefetch returned.
>>
>> Version 2:
>> * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>> 'ixgbe_xsk_clean_tx_ring()'.
>>
>> drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 29 ++++++++------------
>> 1 file changed, 11 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> index 6b609553329f..a3b6d8c89127 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> @@ -633,19 +633,17 @@ static void ixgbe_clean_xdp_tx_buffer(struct ixgbe_ring *tx_ring,
>> bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>> struct ixgbe_ring *tx_ring, int napi_budget)
>
> While you're at it, can you please as well remove the 'napi_budget' argument?
> It wasn't used at all even before your patch.
As you mentioned, this is not related to current patch and this patch doesn't
touch these particular lines of code. So, I think it's better to make a
separate patch for this if you think it's needed.
>
> I'm jumping late in, but I was really wondering and hesitated with taking
> part in discussion since the v1 of this patch - can you elaborate why simply
> clearing the DD bit wasn't sufficient?
Clearing the DD bit will end up driver and hardware writing to close memory
locations at the same time leading to cache trashing and poor performance.
Anyway additional write is unnecessary, because we know exactly which descriptors
we need to check.
Best regards, Ilya Maximets.
>
> Maciej
>
>> {
>> + u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
>> unsigned int total_packets = 0, total_bytes = 0;
>> - u32 i = tx_ring->next_to_clean, xsk_frames = 0;
>> - unsigned int budget = q_vector->tx.work_limit;
>> struct xdp_umem *umem = tx_ring->xsk_umem;
>> union ixgbe_adv_tx_desc *tx_desc;
>> struct ixgbe_tx_buffer *tx_bi;
>> - bool xmit_done;
>> + u32 xsk_frames = 0;
>>
>> - tx_bi = &tx_ring->tx_buffer_info[i];
>> - tx_desc = IXGBE_TX_DESC(tx_ring, i);
>> - i -= tx_ring->count;
>> + tx_bi = &tx_ring->tx_buffer_info[ntc];
>> + tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
>>
>> - do {
>> + while (ntc != ntu) {
>> if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
>> break;
>>
>> @@ -661,22 +659,18 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>>
>> tx_bi++;
>> tx_desc++;
>> - i++;
>> - if (unlikely(!i)) {
>> - i -= tx_ring->count;
>> + ntc++;
>> + if (unlikely(ntc == tx_ring->count)) {
>> + ntc = 0;
>> tx_bi = tx_ring->tx_buffer_info;
>> tx_desc = IXGBE_TX_DESC(tx_ring, 0);
>> }
>>
>> /* issue prefetch for next Tx descriptor */
>> prefetch(tx_desc);
>> + }
>>
>> - /* update budget accounting */
>> - budget--;
>> - } while (likely(budget));
>> -
>> - i += tx_ring->count;
>> - tx_ring->next_to_clean = i;
>> + tx_ring->next_to_clean = ntc;
>>
>> u64_stats_update_begin(&tx_ring->syncp);
>> tx_ring->stats.bytes += total_bytes;
>> @@ -688,8 +682,7 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>> if (xsk_frames)
>> xsk_umem_complete_tx(umem, xsk_frames);
>>
>> - xmit_done = ixgbe_xmit_zc(tx_ring, q_vector->tx.work_limit);
>> - return budget > 0 && xmit_done;
>> + return ixgbe_xmit_zc(tx_ring, q_vector->tx.work_limit);
>> }
>>
>> int ixgbe_xsk_async_xmit(struct net_device *dev, u32 qid)
>
>
>
^ permalink raw reply
* [REGRESSION] netfilter: conntrack: Unable to change conntrack accounting of a net namespace via 'nf_conntrack_acct' sysfs
From: Shmulik Ladkani @ 2019-08-27 10:57 UTC (permalink / raw)
To: Florian Westphal, Pablo Neira Ayuso; +Cc: netdev, shmulik
Hi,
Prior d912dec12428 ("netfilter: conntrack: merge acct and helper sysctl table with main one")
one was able to enable extended accounting within a (non-init)
net-namespace by setting: 'net.netfilter.nf_conntrack_acct=1'
However since d912dec12428, doing so results in changing init_net's
sysctl_acct field, instead of the relevant net's sysctl_acct.
Seen in original code, PRE d912dec12428, which creates a reference to
each net's _OWN_ ct.sysctl_acct within a separate acct_sysctl_table,
snip:
-static int nf_conntrack_acct_init_sysctl(struct net *net)
-{
- struct ctl_table *table;
-
- table = kmemdup(acct_sysctl_table, sizeof(acct_sysctl_table),
- GFP_KERNEL);
- if (!table)
- goto out;
-
- table[0].data = &net->ct.sysctl_acct;
-
(where 'nf_conntrack_acct_init_sysctl()' was originally called by
'nf_conntrack_acct_pernet_init()').
However POST d912dec12428, the per-net netfilter sysctl table simply
inherits from global 'nf_ct_sysctl_table[]', which has
+ .data = &init_net.ct.sysctl_acct,
effectivly making any 'net.netfilter.nf_conntrack_acct' sysctl change
affect the 'init_net' and not relevant net namespace.
Also, looks like "nf_conntrack_helper", "nf_conntrack_events",
"nf_conntrack_timestamp" where also harmed in a similar way, see:
d912dec12428 ("netfilter: conntrack: merge acct and helper sysctl table with main one")
cb2833ed0044 ("netfilter: conntrack: merge ecache and timestamp sysctl tables with main one")
Florian, would it be possible for you to revert these on -net ?
Many thanks,
Shmulik
^ permalink raw reply
* Re: [PATCH net v2] net/sched: pfifo_fast: fix wrong dereference when qdisc is reset
From: Stefano Brivio @ 2019-08-27 10:51 UTC (permalink / raw)
To: Davide Caratti
Cc: Cong Wang, Jamal Hadi Salim, Jiri Pirko, David S. Miller, netdev,
Paolo Abeni, Li Shuang
In-Reply-To: <783231162b9d32faaf5df34ad8ad437b0031bd31.1566901438.git.dcaratti@redhat.com>
On Tue, 27 Aug 2019 12:29:09 +0200
Davide Caratti <dcaratti@redhat.com> wrote:
> Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
> 'TCQ_F_NOLOCK' bit in the parent qdisc, we need to be sure that per-cpu
> counters are present when 'reset()' is called for pfifo_fast qdiscs.
> Otherwise, the following script:
>
> # tc q a dev lo handle 1: root htb default 100
> # tc c a dev lo parent 1: classid 1:100 htb \
> > rate 95Mbit ceil 100Mbit burst 64k
> [...]
> # tc f a dev lo parent 1: protocol arp basic classid 1:100
> [...]
> # tc q a dev lo parent 1:100 handle 100: pfifo_fast
> [...]
> # tc q d dev lo root
>
> can generate the following splat:
>
> Unable to handle kernel paging request at virtual address dfff2c01bd148000
> Mem abort info:
> ESR = 0x96000004
> Exception class = DABT (current EL), IL = 32 bits
> SET = 0, FnV = 0
> EA = 0, S1PTW = 0
> Data abort info:
> ISV = 0, ISS = 0x00000004
> CM = 0, WnR = 0
> [dfff2c01bd148000] address between user and kernel address ranges
> Internal error: Oops: 96000004 [#1] SMP
> [...]
> pstate: 80000005 (Nzcv daif -PAN -UAO)
> pc : pfifo_fast_reset+0x280/0x4d8
> lr : pfifo_fast_reset+0x21c/0x4d8
> sp : ffff800d09676fa0
> x29: ffff800d09676fa0 x28: ffff200012ee22e4
> x27: dfff200000000000 x26: 0000000000000000
> x25: ffff800ca0799958 x24: ffff1001940f332b
> x23: 0000000000000007 x22: ffff200012ee1ab8
> x21: 0000600de8a40000 x20: 0000000000000000
> x19: ffff800ca0799900 x18: 0000000000000000
> x17: 0000000000000002 x16: 0000000000000000
> x15: 0000000000000000 x14: 0000000000000000
> x13: 0000000000000000 x12: ffff1001b922e6e2
> x11: 1ffff001b922e6e1 x10: 0000000000000000
> x9 : 1ffff001b922e6e1 x8 : dfff200000000000
> x7 : 0000000000000000 x6 : 0000000000000000
> x5 : 1fffe400025dc45c x4 : 1fffe400025dc357
> x3 : 00000c01bd148000 x2 : 0000600de8a40000
> x1 : 0000000000000007 x0 : 0000600de8a40004
> Call trace:
> pfifo_fast_reset+0x280/0x4d8
> qdisc_reset+0x6c/0x370
> htb_reset+0x150/0x3b8 [sch_htb]
> qdisc_reset+0x6c/0x370
> dev_deactivate_queue.constprop.5+0xe0/0x1a8
> dev_deactivate_many+0xd8/0x908
> dev_deactivate+0xe4/0x190
> qdisc_graft+0x88c/0xbd0
> tc_get_qdisc+0x418/0x8a8
> rtnetlink_rcv_msg+0x3a8/0xa78
> netlink_rcv_skb+0x18c/0x328
> rtnetlink_rcv+0x28/0x38
> netlink_unicast+0x3c4/0x538
> netlink_sendmsg+0x538/0x9a0
> sock_sendmsg+0xac/0xf8
> ___sys_sendmsg+0x53c/0x658
> __sys_sendmsg+0xc8/0x140
> __arm64_sys_sendmsg+0x74/0xa8
> el0_svc_handler+0x164/0x468
> el0_svc+0x10/0x14
> Code: 910012a0 92400801 d343fc03 11000c21 (38fb6863)
>
> Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
> before dereferencing 'qdisc->cpu_qstats'.
>
> Changes since v1:
> - coding style improvements, thanks to Stefano Brivio
>
> Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
> CC: Paolo Abeni <pabeni@redhat.com>
> Reported-by: Li Shuang <shuali@redhat.com>
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
--
Stefano
^ permalink raw reply
* Re: [PATCH 3/4] mdev: Expose mdev alias in sysfs tree
From: Cornelia Huck @ 2019-08-27 10:47 UTC (permalink / raw)
To: Parav Pandit
Cc: alex.williamson, jiri, kwankhede, davem, kvm, linux-kernel,
netdev
In-Reply-To: <20190826204119.54386-4-parav@mellanox.com>
On Mon, 26 Aug 2019 15:41:18 -0500
Parav Pandit <parav@mellanox.com> wrote:
> Expose mdev alias as string in a sysfs tree so that such attribute can
> be used to generate netdevice name by systemd/udev or can be used to
> match other kernel objects based on the alias of the mdev.
What about
"Expose the optional alias for an mdev device as a sysfs attribute.
This way, userspace tools such as udev may make use of the alias, for
example to create a netdevice name for the mdev."
>
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
> drivers/vfio/mdev/mdev_sysfs.c | 13 +++++++++++++
I think the documentation should be updated as well.
> 1 file changed, 13 insertions(+)
>
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 43afe0e80b76..59f4e3cc5233 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -246,7 +246,20 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
>
> static DEVICE_ATTR_WO(remove);
>
> +static ssize_t alias_show(struct device *device,
> + struct device_attribute *attr, char *buf)
> +{
> + struct mdev_device *dev = mdev_from_dev(device);
> +
> + if (!dev->alias)
> + return -EOPNOTSUPP;
I'm wondering how to make this consumable by userspace in the easiest way.
- As you do now (userspace gets an error when trying to read)?
- Returning an empty value (nothing to see here, move along)?
- Or not creating the attribute at all? That would match what userspace
sees on older kernels, so it needs to be able to deal with that
anyway.
> +
> + return sprintf(buf, "%s\n", dev->alias);
> +}
> +static DEVICE_ATTR_RO(alias);
> +
> static const struct attribute *mdev_device_attrs[] = {
> + &dev_attr_alias.attr,
> &dev_attr_remove.attr,
> NULL,
> };
^ permalink raw reply
* [PATCH bpf-next v6 11/12] samples/bpf: use hugepages in xdpsock app
From: Kevin Laatz @ 2019-08-27 2:25 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon, saeedm, maximmi, stephen
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190827022531.15060-1-kevin.laatz@intel.com>
This patch modifies xdpsock to use mmap instead of posix_memalign. With
this change, we can use hugepages when running the application in unaligned
chunks mode. Using hugepages makes it more likely that we have physically
contiguous memory, which supports the unaligned chunk mode better.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
samples/bpf/xdpsock_user.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index dc3d50f8ed86..102eace22956 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -70,6 +70,7 @@ static int opt_interval = 1;
static u32 opt_xdp_bind_flags = XDP_USE_NEED_WAKEUP;
static u32 opt_umem_flags;
static int opt_unaligned_chunks;
+static int opt_mmap_flags;
static u32 opt_xdp_bind_flags;
static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
static int opt_timeout = 1000;
@@ -440,6 +441,7 @@ static void parse_command_line(int argc, char **argv)
case 'u':
opt_umem_flags |= XDP_UMEM_UNALIGNED_CHUNK_FLAG;
opt_unaligned_chunks = 1;
+ opt_mmap_flags = MAP_HUGETLB;
break;
case 'F':
opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
@@ -742,11 +744,14 @@ int main(int argc, char **argv)
exit(EXIT_FAILURE);
}
- ret = posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
- NUM_FRAMES * opt_xsk_frame_size);
- if (ret)
- exit_with_error(ret);
-
+ /* Reserve memory for the umem. Use hugepages if unaligned chunk mode */
+ bufs = mmap(NULL, NUM_FRAMES * opt_xsk_frame_size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | opt_mmap_flags, -1, 0);
+ if (bufs == MAP_FAILED) {
+ printf("ERROR: mmap failed\n");
+ exit(EXIT_FAILURE);
+ }
/* Create sockets... */
umem = xsk_configure_umem(bufs, NUM_FRAMES * opt_xsk_frame_size);
xsks[num_socks++] = xsk_configure_socket(umem);
--
2.17.1
^ permalink raw reply related
* [PATCH bpf-next v6 12/12] doc/af_xdp: include unaligned chunk case
From: Kevin Laatz @ 2019-08-27 2:25 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon, saeedm, maximmi, stephen
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190827022531.15060-1-kevin.laatz@intel.com>
The addition of unaligned chunks mode, the documentation needs to be
updated to indicate that the incoming addr to the fill ring will only be
masked if the user application is run in the aligned chunk mode. This patch
also adds a line to explicitly indicate that the incoming addr will not be
masked if running the user application in the unaligned chunk mode.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
Documentation/networking/af_xdp.rst | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index eeedc2e826aa..83f7ae5fc045 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -153,10 +153,12 @@ an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
Frames passed to the kernel are used for the ingress path (RX rings).
-The user application produces UMEM addrs to this ring. Note that the
-kernel will mask the incoming addr. E.g. for a chunk size of 2k, the
-log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050
-and 3000 refers to the same chunk.
+The user application produces UMEM addrs to this ring. Note that, if
+running the application with aligned chunk mode, the kernel will mask
+the incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of
+the addr will be masked off, meaning that 2048, 2050 and 3000 refers
+to the same chunk. If the user application is run in the unaligned
+chunks mode, then the incoming addr will be left untouched.
UMEM Completion Ring
--
2.17.1
^ permalink raw reply related
* [PATCH bpf-next v6 10/12] samples/bpf: add buffer recycling for unaligned chunks to xdpsock
From: Kevin Laatz @ 2019-08-27 2:25 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon, saeedm, maximmi, stephen
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190827022531.15060-1-kevin.laatz@intel.com>
This patch adds buffer recycling support for unaligned buffers. Since we
don't mask the addr to 2k at umem_reg in unaligned mode, we need to make
sure we give back the correct (original) addr to the fill queue. We achieve
this using the new descriptor format and associated masks. The new format
uses the upper 16-bits for the offset and the lower 48-bits for the addr.
Since we have a field for the offset, we no longer need to modify the
actual address. As such, all we have to do to get back the original address
is mask for the lower 48 bits (i.e. strip the offset and we get the address
on it's own).
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
v2:
- Removed unused defines
- Fix buffer recycling for unaligned case
- Remove --buf-size (--frame-size merged before this)
- Modifications to use the new descriptor format for buffer recycling
v5:
- Use new accessors for addr/offset instead of macros.
---
samples/bpf/xdpsock_user.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 7312eab4d201..dc3d50f8ed86 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -484,6 +484,7 @@ static void kick_tx(struct xsk_socket_info *xsk)
static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
struct pollfd *fds)
{
+ struct xsk_umem_info *umem = xsk->umem;
u32 idx_cq = 0, idx_fq = 0;
unsigned int rcvd;
size_t ndescs;
@@ -498,24 +499,23 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
xsk->outstanding_tx;
/* re-add completed Tx buffers */
- rcvd = xsk_ring_cons__peek(&xsk->umem->cq, ndescs, &idx_cq);
+ rcvd = xsk_ring_cons__peek(&umem->cq, ndescs, &idx_cq);
if (rcvd > 0) {
unsigned int i;
int ret;
- ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+ ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
while (ret != rcvd) {
if (ret < 0)
exit_with_error(-ret);
- if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
+ if (xsk_ring_prod__needs_wakeup(&umem->fq))
ret = poll(fds, num_socks, opt_timeout);
- ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
- &idx_fq);
+ ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
}
+
for (i = 0; i < rcvd; i++)
- *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
- *xsk_ring_cons__comp_addr(&xsk->umem->cq,
- idx_cq++);
+ *xsk_ring_prod__fill_addr(&umem->fq, idx_fq++) =
+ *xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
xsk_ring_cons__release(&xsk->umem->cq, rcvd);
@@ -568,10 +568,13 @@ static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds)
for (i = 0; i < rcvd; i++) {
u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
u32 len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len;
+ u64 orig = xsk_umem__extract_addr(addr);
+
+ addr = xsk_umem__add_offset_to_addr(addr);
char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
hex_dump(pkt, len, addr);
- *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = addr;
+ *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = orig;
}
xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
@@ -680,12 +683,15 @@ static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
for (i = 0; i < rcvd; i++) {
u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
u32 len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len;
+ u64 orig = xsk_umem__extract_addr(addr);
+
+ addr = xsk_umem__add_offset_to_addr(addr);
char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
swap_mac_addresses(pkt);
hex_dump(pkt, len, addr);
- xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = addr;
+ xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = orig;
xsk_ring_prod__tx_desc(&xsk->tx, idx_tx++)->len = len;
}
--
2.17.1
^ permalink raw reply related
* [PATCH bpf-next v6 09/12] samples/bpf: add unaligned chunks mode support to xdpsock
From: Kevin Laatz @ 2019-08-27 2:25 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon, saeedm, maximmi, stephen
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190827022531.15060-1-kevin.laatz@intel.com>
This patch adds support for the unaligned chunks mode. The addition of the
unaligned chunks option will allow users to run the application with more
relaxed chunk placement in the XDP umem.
Unaligned chunks mode can be used with the '-u' or '--unaligned' command
line options.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
v4:
- updated help text for -f
- use new chunk flag define
---
samples/bpf/xdpsock_user.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index da84c760c094..7312eab4d201 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -68,6 +68,9 @@ static int opt_queue;
static int opt_poll;
static int opt_interval = 1;
static u32 opt_xdp_bind_flags = XDP_USE_NEED_WAKEUP;
+static u32 opt_umem_flags;
+static int opt_unaligned_chunks;
+static u32 opt_xdp_bind_flags;
static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
static int opt_timeout = 1000;
static bool opt_need_wakeup = true;
@@ -284,7 +287,9 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size)
.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
.frame_size = opt_xsk_frame_size,
.frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM,
+ .flags = opt_umem_flags
};
+
int ret;
umem = calloc(1, sizeof(*umem));
@@ -293,6 +298,7 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size)
ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
&cfg);
+
if (ret)
exit_with_error(-ret);
@@ -355,6 +361,7 @@ static struct option long_options[] = {
{"copy", no_argument, 0, 'c'},
{"frame-size", required_argument, 0, 'f'},
{"no-need-wakeup", no_argument, 0, 'm'},
+ {"unaligned", no_argument, 0, 'u'},
{0, 0, 0, 0}
};
@@ -376,6 +383,8 @@ static void usage(const char *prog)
" -c, --copy Force copy mode.\n"
" -f, --frame-size=n Set the frame size (must be a power of two, default is %d).\n"
" -m, --no-need-wakeup Turn off use of driver need wakeup flag.\n"
+ " -f, --frame-size=n Set the frame size (must be a power of two in aligned mode, default is %d).\n"
+ " -u, --unaligned Enable unaligned chunk placement\n"
"\n";
fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
exit(EXIT_FAILURE);
@@ -388,8 +397,7 @@ static void parse_command_line(int argc, char **argv)
opterr = 0;
for (;;) {
-
- c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:m",
+ c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:mu",
long_options, &option_index);
if (c == -1)
break;
@@ -429,6 +437,10 @@ static void parse_command_line(int argc, char **argv)
case 'c':
opt_xdp_bind_flags |= XDP_COPY;
break;
+ case 'u':
+ opt_umem_flags |= XDP_UMEM_UNALIGNED_CHUNK_FLAG;
+ opt_unaligned_chunks = 1;
+ break;
case 'F':
opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
break;
@@ -438,6 +450,7 @@ static void parse_command_line(int argc, char **argv)
opt_need_wakeup = false;
opt_xdp_bind_flags &= ~XDP_USE_NEED_WAKEUP;
break;
+
default:
usage(basename(argv[0]));
}
@@ -450,7 +463,8 @@ static void parse_command_line(int argc, char **argv)
usage(basename(argv[0]));
}
- if (opt_xsk_frame_size & (opt_xsk_frame_size - 1)) {
+ if ((opt_xsk_frame_size & (opt_xsk_frame_size - 1)) &&
+ !opt_unaligned_chunks) {
fprintf(stderr, "--frame-size=%d is not a power of two\n",
opt_xsk_frame_size);
usage(basename(argv[0]));
--
2.17.1
^ permalink raw reply related
* [PATCH bpf-next v6 08/12] libbpf: add flags to umem config
From: Kevin Laatz @ 2019-08-27 2:25 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon, saeedm, maximmi, stephen
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190827022531.15060-1-kevin.laatz@intel.com>
This patch adds a 'flags' field to the umem_config and umem_reg structs.
This will allow for more options to be added for configuring umems.
The first use for the flags field is to add a flag for unaligned chunks
mode. These flags can either be user-provided or filled with a default.
Since we change the size of the xsk_umem_config struct, we need to version
the ABI. This patch includes the ABI versioning for xsk_umem__create. The
Makefile was also updated to handle multiple function versions in
check-abi.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
v2:
- Removed the headroom check from this patch. It has moved to the
previous patch.
v4:
- Modified chunk flag define.
v5:
- Added ABI versioning for xsk_umem__create().
- Updated Makefile to handle multiple function versions.
- Removed bitfields from xdp_umem_reg
- Fix conflicts after 'bpf-af-xdp-wakeup' was merged
---
tools/include/uapi/linux/if_xdp.h | 9 +++++++++
tools/lib/bpf/Makefile | 5 ++++-
tools/lib/bpf/libbpf.map | 1 +
tools/lib/bpf/xsk.c | 33 ++++++++++++++++++++++++++++---
tools/lib/bpf/xsk.h | 27 +++++++++++++++++++++++++
5 files changed, 71 insertions(+), 4 deletions(-)
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 62b80d57b72a..be328c59389d 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -26,6 +26,9 @@
*/
#define XDP_USE_NEED_WAKEUP (1 << 3)
+/* Flags for xsk_umem_config flags */
+#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
+
struct sockaddr_xdp {
__u16 sxdp_family;
__u16 sxdp_flags;
@@ -66,6 +69,7 @@ struct xdp_umem_reg {
__u64 len; /* Length of packet data area */
__u32 chunk_size;
__u32 headroom;
+ __u32 flags;
};
struct xdp_statistics {
@@ -87,6 +91,11 @@ struct xdp_options {
#define XDP_UMEM_PGOFF_FILL_RING 0x100000000ULL
#define XDP_UMEM_PGOFF_COMPLETION_RING 0x180000000ULL
+/* Masks for unaligned chunks mode */
+#define XSK_UNALIGNED_BUF_OFFSET_SHIFT 48
+#define XSK_UNALIGNED_BUF_ADDR_MASK \
+ ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
+
/* Rx/Tx descriptor */
struct xdp_desc {
__u64 addr;
diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 613acb93b144..c6f94cffe06e 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -134,7 +134,9 @@ LIB_FILE := $(addprefix $(OUTPUT),$(LIB_FILE))
PC_FILE := $(addprefix $(OUTPUT),$(PC_FILE))
GLOBAL_SYM_COUNT = $(shell readelf -s --wide $(BPF_IN) | \
- awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {s++} END{print s}')
+ cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | \
+ awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}' | \
+ sort -u | wc -l)
VERSIONED_SYM_COUNT = $(shell readelf -s --wide $(OUTPUT)libbpf.so | \
grep -Eo '[^ ]+@LIBBPF_' | cut -d@ -f1 | sort -u | wc -l)
@@ -201,6 +203,7 @@ check_abi: $(OUTPUT)libbpf.so
"Please make sure all LIBBPF_API symbols are" \
"versioned in $(VERSION_SCRIPT)." >&2; \
readelf -s --wide $(OUTPUT)libbpf-in.o | \
+ cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | \
awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}'| \
sort -u > $(OUTPUT)libbpf_global_syms.tmp; \
readelf -s --wide $(OUTPUT)libbpf.so | \
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 664ce8e7a60e..d04c7cb623ed 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -183,6 +183,7 @@ LIBBPF_0.0.4 {
perf_buffer__new;
perf_buffer__new_raw;
perf_buffer__poll;
+ xsk_umem__create;
} LIBBPF_0.0.3;
LIBBPF_0.0.5 {
diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index 12ad78510147..842c4fd55859 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -99,6 +99,7 @@ static void xsk_set_umem_config(struct xsk_umem_config *cfg,
cfg->comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
cfg->frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
cfg->frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM;
+ cfg->flags = XSK_UMEM__DEFAULT_FLAGS;
return;
}
@@ -106,6 +107,7 @@ static void xsk_set_umem_config(struct xsk_umem_config *cfg,
cfg->comp_size = usr_cfg->comp_size;
cfg->frame_size = usr_cfg->frame_size;
cfg->frame_headroom = usr_cfg->frame_headroom;
+ cfg->flags = usr_cfg->flags;
}
static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
@@ -132,9 +134,10 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
return 0;
}
-int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area, __u64 size,
- struct xsk_ring_prod *fill, struct xsk_ring_cons *comp,
- const struct xsk_umem_config *usr_config)
+int xsk_umem__create_v0_0_4(struct xsk_umem **umem_ptr, void *umem_area,
+ __u64 size, struct xsk_ring_prod *fill,
+ struct xsk_ring_cons *comp,
+ const struct xsk_umem_config *usr_config)
{
struct xdp_mmap_offsets off;
struct xdp_umem_reg mr;
@@ -165,6 +168,7 @@ int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area, __u64 size,
mr.len = size;
mr.chunk_size = umem->config.frame_size;
mr.headroom = umem->config.frame_headroom;
+ mr.flags = umem->config.flags;
err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
if (err) {
@@ -238,6 +242,29 @@ int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area, __u64 size,
return err;
}
+struct xsk_umem_config_v1 {
+ __u32 fill_size;
+ __u32 comp_size;
+ __u32 frame_size;
+ __u32 frame_headroom;
+};
+
+int xsk_umem__create_v0_0_2(struct xsk_umem **umem_ptr, void *umem_area,
+ __u64 size, struct xsk_ring_prod *fill,
+ struct xsk_ring_cons *comp,
+ const struct xsk_umem_config *usr_config)
+{
+ struct xsk_umem_config config;
+
+ memcpy(&config, usr_config, sizeof(struct xsk_umem_config_v1));
+ config.flags = 0;
+
+ return xsk_umem__create_v0_0_4(umem_ptr, umem_area, size, fill, comp,
+ &config);
+}
+asm(".symver xsk_umem__create_v0_0_2, xsk_umem__create@LIBBPF_0.0.2");
+asm(".symver xsk_umem__create_v0_0_4, xsk_umem__create@@LIBBPF_0.0.4");
+
static int xsk_load_xdp_prog(struct xsk_socket *xsk)
{
static const int log_buf_size = 16 * 1024;
diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index aa1d6122b7db..584f6820a639 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -168,6 +168,21 @@ static inline void *xsk_umem__get_data(void *umem_area, __u64 addr)
return &((char *)umem_area)[addr];
}
+static inline __u64 xsk_umem__extract_addr(__u64 addr)
+{
+ return addr & XSK_UNALIGNED_BUF_ADDR_MASK;
+}
+
+static inline __u64 xsk_umem__extract_offset(__u64 addr)
+{
+ return addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+}
+
+static inline __u64 xsk_umem__add_offset_to_addr(__u64 addr)
+{
+ return xsk_umem__extract_addr(addr) + xsk_umem__extract_offset(addr);
+}
+
LIBBPF_API int xsk_umem__fd(const struct xsk_umem *umem);
LIBBPF_API int xsk_socket__fd(const struct xsk_socket *xsk);
@@ -176,12 +191,14 @@ LIBBPF_API int xsk_socket__fd(const struct xsk_socket *xsk);
#define XSK_UMEM__DEFAULT_FRAME_SHIFT 12 /* 4096 bytes */
#define XSK_UMEM__DEFAULT_FRAME_SIZE (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT)
#define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0
+#define XSK_UMEM__DEFAULT_FLAGS 0
struct xsk_umem_config {
__u32 fill_size;
__u32 comp_size;
__u32 frame_size;
__u32 frame_headroom;
+ __u32 flags;
};
/* Flags for the libbpf_flags field. */
@@ -201,6 +218,16 @@ LIBBPF_API int xsk_umem__create(struct xsk_umem **umem,
struct xsk_ring_prod *fill,
struct xsk_ring_cons *comp,
const struct xsk_umem_config *config);
+LIBBPF_API int xsk_umem__create_v0_0_2(struct xsk_umem **umem,
+ void *umem_area, __u64 size,
+ struct xsk_ring_prod *fill,
+ struct xsk_ring_cons *comp,
+ const struct xsk_umem_config *config);
+LIBBPF_API int xsk_umem__create_v0_0_4(struct xsk_umem **umem,
+ void *umem_area, __u64 size,
+ struct xsk_ring_prod *fill,
+ struct xsk_ring_cons *comp,
+ const struct xsk_umem_config *config);
LIBBPF_API int xsk_socket__create(struct xsk_socket **xsk,
const char *ifname, __u32 queue_id,
struct xsk_umem *umem,
--
2.17.1
^ permalink raw reply related
* [PATCH bpf-next v6 07/12] net/mlx5e: Allow XSK frames smaller than a page
From: Kevin Laatz @ 2019-08-27 2:25 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon, saeedm, maximmi, stephen
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan
In-Reply-To: <20190827022531.15060-1-kevin.laatz@intel.com>
From: Maxim Mikityanskiy <maximmi@mellanox.com>
Relax the requirements to the XSK frame size to allow it to be smaller
than a page and even not a power of two. The current implementation can
work in this mode, both with Striding RQ and without it.
The code that checks `mtu + headroom <= XSK frame size` is modified
accordingly. Any frame size between 2048 and PAGE_SIZE is accepted.
Functions that worked with pages only now work with XSK frames, even if
their size is different from PAGE_SIZE.
With XSK queues, regardless of the frame size, Striding RQ uses the
stride size of PAGE_SIZE, and UMR MTTs are posted using starting
addresses of frames, but PAGE_SIZE as page size. MTU guarantees that no
packet data will overlap with other frames. UMR MTT size is made equal
to the stride size of the RQ, because UMEM frames may come in random
order, and we need to handle them one by one. PAGE_SIZE is just a power
of two that is bigger than any allowed XSK frame size, and also it
doesn't require making additional changes to the code.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
---
.../ethernet/mellanox/mlx5/core/en/params.c | 23 +++++++++++++++----
.../ethernet/mellanox/mlx5/core/en/params.h | 2 ++
.../ethernet/mellanox/mlx5/core/en/xsk/rx.c | 2 +-
.../mellanox/mlx5/core/en/xsk/setup.c | 15 ++++++++----
4 files changed, 32 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 79301d116667..eb2e1f2138e4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -25,18 +25,33 @@ u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
return headroom;
}
-u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
- struct mlx5e_xsk_param *xsk)
+u32 mlx5e_rx_get_min_frag_sz(struct mlx5e_params *params,
+ struct mlx5e_xsk_param *xsk)
{
u32 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
u16 linear_rq_headroom = mlx5e_get_linear_rq_headroom(params, xsk);
- u32 frag_sz = linear_rq_headroom + hw_mtu;
+
+ return linear_rq_headroom + hw_mtu;
+}
+
+u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
+ struct mlx5e_xsk_param *xsk)
+{
+ u32 frag_sz = mlx5e_rx_get_min_frag_sz(params, xsk);
/* AF_XDP doesn't build SKBs in place. */
if (!xsk)
frag_sz = MLX5_SKB_FRAG_SZ(frag_sz);
- /* XDP in mlx5e doesn't support multiple packets per page. */
+ /* XDP in mlx5e doesn't support multiple packets per page. AF_XDP is a
+ * special case. It can run with frames smaller than a page, as it
+ * doesn't allocate pages dynamically. However, here we pretend that
+ * fragments are page-sized: it allows to treat XSK frames like pages
+ * by redirecting alloc and free operations to XSK rings and by using
+ * the fact there are no multiple packets per "page" (which is a frame).
+ * The latter is important, because frames may come in a random order,
+ * and we will have trouble assemblying a real page of multiple frames.
+ */
if (mlx5e_rx_is_xdp(params, xsk))
frag_sz = max_t(u32, frag_sz, PAGE_SIZE);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index 3a615d663d84..989d8f429438 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -76,6 +76,8 @@ static inline bool mlx5e_qid_validate(const struct mlx5e_profile *profile,
u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
+u32 mlx5e_rx_get_min_frag_sz(struct mlx5e_params *params,
+ struct mlx5e_xsk_param *xsk);
u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 7c49a66d28c9..475b6bd5d29b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -105,7 +105,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
/* head_offset is not used in this function, because di->xsk.data and
* di->addr point directly to the necessary place. Furthermore, in the
- * current implementation, one page = one packet = one frame, so
+ * current implementation, UMR pages are mapped to XSK frames, so
* head_offset should always be 0.
*/
WARN_ON_ONCE(head_offset);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
index f701e4f3c076..d549f770cb4f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
@@ -4,18 +4,23 @@
#include "setup.h"
#include "en/params.h"
+/* It matches XDP_UMEM_MIN_CHUNK_SIZE, but as this constant is private and may
+ * change unexpectedly, and mlx5e has a minimum valid stride size for striding
+ * RQ, keep this check in the driver.
+ */
+#define MLX5E_MIN_XSK_CHUNK_SIZE 2048
+
bool mlx5e_validate_xsk_param(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk,
struct mlx5_core_dev *mdev)
{
- /* AF_XDP doesn't support frames larger than PAGE_SIZE, and the current
- * mlx5e XDP implementation doesn't support multiple packets per page.
- */
- if (xsk->chunk_size != PAGE_SIZE)
+ /* AF_XDP doesn't support frames larger than PAGE_SIZE. */
+ if (xsk->chunk_size > PAGE_SIZE ||
+ xsk->chunk_size < MLX5E_MIN_XSK_CHUNK_SIZE)
return false;
/* Current MTU and XSK headroom don't allow packets to fit the frames. */
- if (mlx5e_rx_get_linear_frag_sz(params, xsk) > xsk->chunk_size)
+ if (mlx5e_rx_get_min_frag_sz(params, xsk) > xsk->chunk_size)
return false;
/* frag_sz is different for regular and XSK RQs, so ensure that linear
--
2.17.1
^ permalink raw reply related
* [PATCH bpf-next v6 06/12] mlx5e: modify driver for handling offsets
From: Kevin Laatz @ 2019-08-27 2:25 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon, saeedm, maximmi, stephen
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190827022531.15060-1-kevin.laatz@intel.com>
With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
v3:
- Use new helper function to handle offset
v4:
- fixed headroom addition to handle. Using xsk_umem_adjust_headroom()
now.
v5:
- Fixed typo: handle_offset -> adjust_offset
---
drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 8 ++++++--
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 3 ++-
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 1ed5c33e022f..f049e0ac308a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -122,6 +122,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
void *va, u16 *rx_headroom, u32 *len, bool xsk)
{
struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+ struct xdp_umem *umem = rq->umem;
struct xdp_buff xdp;
u32 act;
int err;
@@ -138,8 +139,11 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
xdp.rxq = &rq->xdp_rxq;
act = bpf_prog_run_xdp(prog, &xdp);
- if (xsk)
- xdp.handle += xdp.data - xdp.data_hard_start;
+ if (xsk) {
+ u64 off = xdp.data - xdp.data_hard_start;
+
+ xdp.handle = xsk_umem_adjust_offset(umem, xdp.handle, off);
+ }
switch (act) {
case XDP_PASS:
*rx_headroom = xdp.data - xdp.data_hard_start;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 6a55573ec8f2..7c49a66d28c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -24,7 +24,8 @@ int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
if (!xsk_umem_peek_addr_rq(umem, &handle))
return -ENOMEM;
- dma_info->xsk.handle = handle + rq->buff.umem_headroom;
+ dma_info->xsk.handle = xsk_umem_adjust_offset(umem, handle,
+ rq->buff.umem_headroom);
dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle);
/* No need to add headroom to the DMA address. In striding RQ case, we
--
2.17.1
^ permalink raw reply related
* [PATCH bpf-next v6 05/12] ixgbe: modify driver for handling offsets
From: Kevin Laatz @ 2019-08-27 2:25 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon, saeedm, maximmi, stephen
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190827022531.15060-1-kevin.laatz@intel.com>
With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
v3:
- Use new helper function to handle offset
---
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index e1f743470ae9..17061c799f72 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -143,7 +143,9 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
struct ixgbe_ring *rx_ring,
struct xdp_buff *xdp)
{
+ struct xdp_umem *umem = rx_ring->xsk_umem;
int err, result = IXGBE_XDP_PASS;
+ u64 offset = umem->headroom;
struct bpf_prog *xdp_prog;
struct xdp_frame *xdpf;
u32 act;
@@ -151,7 +153,10 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
rcu_read_lock();
xdp_prog = READ_ONCE(rx_ring->xdp_prog);
act = bpf_prog_run_xdp(xdp_prog, xdp);
- xdp->handle += xdp->data - xdp->data_hard_start;
+ offset += xdp->data - xdp->data_hard_start;
+
+ xdp->handle = xsk_umem_adjust_offset(umem, xdp->handle, offset);
+
switch (act) {
case XDP_PASS:
break;
@@ -243,7 +248,7 @@ void ixgbe_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
bi->addr = xdp_umem_get_data(rx_ring->xsk_umem, handle);
bi->addr += hr;
- bi->handle = (u64)handle + rx_ring->xsk_umem->headroom;
+ bi->handle = (u64)handle;
}
static bool ixgbe_alloc_buffer_zc(struct ixgbe_ring *rx_ring,
@@ -269,7 +274,7 @@ static bool ixgbe_alloc_buffer_zc(struct ixgbe_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;
- bi->handle = handle + umem->headroom;
+ bi->handle = handle;
xsk_umem_discard_addr(umem);
return true;
@@ -296,7 +301,7 @@ static bool ixgbe_alloc_buffer_slow_zc(struct ixgbe_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;
- bi->handle = handle + umem->headroom;
+ bi->handle = handle;
xsk_umem_discard_addr_rq(umem);
return true;
--
2.17.1
^ permalink raw reply related
* [PATCH bpf-next v6 04/12] i40e: modify driver for handling offsets
From: Kevin Laatz @ 2019-08-27 2:25 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon, saeedm, maximmi, stephen
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190827022531.15060-1-kevin.laatz@intel.com>
With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
v3:
- Use new helper function for handling the offset
---
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 2d6e82f72f48..eaca6162a6e6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -190,7 +190,9 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
**/
static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
{
+ struct xdp_umem *umem = rx_ring->xsk_umem;
int err, result = I40E_XDP_PASS;
+ u64 offset = umem->headroom;
struct i40e_ring *xdp_ring;
struct bpf_prog *xdp_prog;
u32 act;
@@ -201,7 +203,10 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
*/
xdp_prog = READ_ONCE(rx_ring->xdp_prog);
act = bpf_prog_run_xdp(xdp_prog, xdp);
- xdp->handle += xdp->data - xdp->data_hard_start;
+ offset += xdp->data - xdp->data_hard_start;
+
+ xdp->handle = xsk_umem_adjust_offset(umem, xdp->handle, offset);
+
switch (act) {
case XDP_PASS:
break;
@@ -262,7 +267,7 @@ static bool i40e_alloc_buffer_zc(struct i40e_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;
- bi->handle = handle + umem->headroom;
+ bi->handle = handle;
xsk_umem_discard_addr(umem);
return true;
@@ -299,7 +304,7 @@ static bool i40e_alloc_buffer_slow_zc(struct i40e_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;
- bi->handle = handle + umem->headroom;
+ bi->handle = handle;
xsk_umem_discard_addr_rq(umem);
return true;
@@ -464,7 +469,7 @@ void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
bi->addr = xdp_umem_get_data(rx_ring->xsk_umem, handle);
bi->addr += hr;
- bi->handle = (u64)handle + rx_ring->xsk_umem->headroom;
+ bi->handle = (u64)handle;
}
/**
--
2.17.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox