Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net: phy: allow scanning busses with missing phys
From: Florian Fainelli @ 2018-04-25  1:09 UTC (permalink / raw)
  To: Alexandre Belloni, Andrew Lunn
  Cc: David S . Miller, Allan Nielsen, Thomas Petazzoni, netdev,
	linux-kernel
In-Reply-To: <20180424160904.32457-1-alexandre.belloni@bootlin.com>

On 04/24/2018 09:09 AM, Alexandre Belloni wrote:
> Some MDIO busses will error out when trying to read a phy address with no
> phy present at that address. In that case, probing the bus will fail
> because __mdiobus_register() is scanning the bus for all possible phys
> addresses.
> 
> In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
> this address and set the phy ID to 0xffffffff which is then properly
> handled in get_phy_device().
> 
> Suggested-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH] net: phy: allow scanning busses with missing phys
From: Florian Fainelli @ 2018-04-25  1:08 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Alexandre Belloni, David S . Miller, Allan Nielsen,
	Thomas Petazzoni, netdev, linux-kernel
In-Reply-To: <20180424180153.GD2360@lunn.ch>

On 04/24/2018 11:01 AM, Andrew Lunn wrote:
> On Tue, Apr 24, 2018 at 09:37:09AM -0700, Florian Fainelli wrote:
>>
>>
>> On 04/24/2018 09:09 AM, Alexandre Belloni wrote:
>>> Some MDIO busses will error out when trying to read a phy address with no
>>> phy present at that address. In that case, probing the bus will fail
>>> because __mdiobus_register() is scanning the bus for all possible phys
>>> addresses.
>>>
>>> In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
>>> this address and set the phy ID to 0xffffffff which is then properly
>>> handled in get_phy_device().
>>
>> Humm, why not have your MDIO bus implementation do the scanning itself
>> in a reset() callback, which happens before probing the bus, and based
>> on the results, set phy_mask accordingly such that only PHYs present are
>> populated?
> 
> Hi Florian
> 
> Seems a bit odd have the driver do this, when the core could.

I could see arguments for doing it in either location, if this is done
in the driver, the driver is responsible for knowing what addresses will
respond, which does not seem to big of a stretch. If done by core, we
lift the driver from taking care of that. Either way is fine, really.

> 
>> My only concern with your change is that we are having a special
>> treatment for EIO and ENODEV, so we must make sure MDIO bus drivers are
>> all conforming to that.
>  
> I don't see how it could be a problem. It used to be any error was a
> real error, and would stop the bus from loading. It now means there is
> nothing there. The only possible side effect is an mdio bus driver
> might remain loaded without any devices if all reads return
> ENODEV/EIO, were as before it would probably never load. 

Right, though that does not seem to be a big problem.

> 
> 	    Andrew
> 


-- 
Florian

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Richard Guy Briggs @ 2018-04-25  0:40 UTC (permalink / raw)
  To: Paul Moore
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <CAHC9VhSZd7V9avx6K5g6CQ7mkj1T8ti7Nqq=OoWVwPznkesD1w@mail.gmail.com>

On 2018-04-24 15:01, Paul Moore wrote:
> On Mon, Apr 23, 2018 at 10:02 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2018-04-23 19:15, Paul Moore wrote:
> >> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> >> > On 2018-04-18 19:47, Paul Moore wrote:
> >> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> > Implement the proc fs write to set the audit container ID of a process,
> >> >> > emitting an AUDIT_CONTAINER record to document the event.
> >> >> >
> >> >> > This is a write from the container orchestrator task to a proc entry of
> >> >> > the form /proc/PID/containerid where PID is the process ID of the newly
> >> >> > created task that is to become the first task in a container, or an
> >> >> > additional task added to a container.
> >> >> >
> >> >> > The write expects up to a u64 value (unset: 18446744073709551615).
> >> >> >
> >> >> > This will produce a record such as this:
> >> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> >> >> >
> >> >> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> >> >> > the orchestrator while the "opid" field is the object's PID, the process
> >> >> > being "contained".  Old and new container ID values are given in the
> >> >> > "contid" fields, while res indicates its success.
> >> >> >
> >> >> > It is not permitted to self-set, unset or re-set the container ID.  A
> >> >> > child inherits its parent's container ID, but then can be set only once
> >> >> > after.
> >> >> >
> >> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
> >> >> >
> >> >> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> >> >> > ---
> >> >> >  fs/proc/base.c             | 37 ++++++++++++++++++++
> >> >> >  include/linux/audit.h      | 16 +++++++++
> >> >> >  include/linux/init_task.h  |  4 ++-
> >> >> >  include/linux/sched.h      |  1 +
> >> >> >  include/uapi/linux/audit.h |  2 ++
> >> >> >  kernel/auditsc.c           | 84 ++++++++++++++++++++++++++++++++++++++++++++++
> >> >> >  6 files changed, 143 insertions(+), 1 deletion(-)
> 
> ...
> 
> >> >> >  /* audit_rule_data supports filter rules with both integer and string
> >> >> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> >> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> >> >> > index 4e0a4ac..29c8482 100644
> >> >> > --- a/kernel/auditsc.c
> >> >> > +++ b/kernel/auditsc.c
> >> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
> >> >> >         return rc;
> >> >> >  }
> >> >> >
> >> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
> >> >> > +{
> >> >> > +       struct task_struct *parent;
> >> >> > +       u64 pcontainerid, ccontainerid;
> >> >> > +
> >> >> > +       /* Don't allow to set our own containerid */
> >> >> > +       if (current == task)
> >> >> > +               return -EPERM;
> >> >>
> >> >> Why not?  Is there some obvious security concern that I missing?
> >> >
> >> > We then lose the distinction in the AUDIT_CONTAINER record between the
> >> > initiating PID and the target PID.  This was outlined in the proposal.
> >>
> >> I just went back and reread the v3 proposal and I still don't see a
> >> good explanation of this.  Why is this bad?  What's the security
> >> concern?
> >
> > I don't remember, specifically.  Maybe this has been addressed by the
> > check for children/threads or identical parent container ID.  So, I'm
> > reluctantly willing to remove that check for now.
> 
> Okay.  For the record, if someone can explain to me why this
> restriction saves us from some terrible situation I'm all for leaving
> it.  I'm just opposed to restrictions without solid reasoning behind
> them.
> 
> >> > Having said that, I'm still not sure we have protected sufficiently from
> >> > a child turning around and setting it's parent's as yet unset or
> >> > inherited audit container ID.
> >>
> >> Yes, I believe we only want to let a task set the audit container for
> >> it's children (or itself/threads if we decide to allow that, see
> >> above).  There *has* to be a function to check to see if a task if a
> >> child of a given task ... right? ... although this is likely to be a
> >> pointer traversal and locking nightmare ... hmmm.
> >
> > Isn't that just (struct task_struct)parent == (struct
> > task_struct)child->parent (or ->real_parent)?
> >
> > And now that I say that, it is covered by the following patch's child
> > check, so as long as we keep that, we should be fine.
> 
> I was thinking of checking not just current's immediate children, but
> any of it's descendants as I believe that is what we want to limit,
> yes?  I just worry that it isn't really practical to perform that
> check.

The child check I'm talking about prevents setting a task's audit
container ID if it *has* any children or threads, so if it has children
it is automatically disqualified and its grandchildren are irrelevant.

> >> >> I ask because I suppose it might be possible for some container
> >> >> runtime to do a fork, setup some of the environment and them exec the
> >> >> container (before you answer the obvious "namespaces!" please remember
> >> >> we're not trying to define containers).
> >> >
> >> > I don't think namespaces have any bearing on this concern since none are
> >> > required.
> >> >
> >> >> > +       /* Don't allow the containerid to be unset */
> >> >> > +       if (!cid_valid(containerid))
> >> >> > +               return -EINVAL;
> >> >> > +       /* if we don't have caps, reject */
> >> >> > +       if (!capable(CAP_AUDIT_CONTROL))
> >> >> > +               return -EPERM;
> >> >> > +       /* if containerid is unset, allow */
> >> >> > +       if (!audit_containerid_set(task))
> >> >> > +               return 0;
> >> >> > +       /* it is already set, and not inherited from the parent, reject */
> >> >> > +       ccontainerid = audit_get_containerid(task);
> >> >> > +       rcu_read_lock();
> >> >> > +       parent = rcu_dereference(task->real_parent);
> >> >> > +       rcu_read_unlock();
> >> >> > +       task_lock(parent);
> >> >> > +       pcontainerid = audit_get_containerid(parent);
> >> >> > +       task_unlock(parent);
> >> >> > +       if (ccontainerid != pcontainerid)
> >> >> > +               return -EPERM;
> >> >> > +       return 0;
> 
> I'm looking at the parent checks again and I wonder if the logic above
> is what we really want.  Maybe it is, but I'm not sure.
> 
> Things I'm wondering about:
> 
> * "ccontainerid" and "containerid" are too close in name, I kept
> confusing myself when looking at this code.  Please change one.  Bonus
> points if it is shorter.

Would c_containerid and p_containerid be ok?  child_cid and parent_cid?
I'd really like it to have the same root as the parameter handed in so
teh code is easier to follow.  It would be nice to have that across
caller to local, but that's challenging.

I've been tempted to use contid or even cid everywhere instead of
containerid.  Perhaps the longer name doesn't bother me because I
like its uniqueness and I learned touch-typing in grade 9 and I like
100+ character wide terminals?  ;-)

> * What if the orchestrator wants to move the task to a new container?
> Right now it looks like you can only do that once, then then the
> task's audit container ID will no longer be the same as real_parent

A task's audit container ID can be unset or inherited, and then set
only once.  After that, if you want it moved to a new container you
can't and your only option is to spawn another peer to that task or a
child of it and set that new task's audit container ID.

Currently, the method of detecting if its audit container ID has been
set (rather than inherited) was to check its parent's audit container
ID.  The only reason to change this might be if the audit container ID
were not inheritable, but then we lose the accountability of a task
spawning another process and being able to leave its child's audit
container ID unset and unaccountable to any existing container.  I think
the relationship to the parent is crucial, and if something wants to
change audit container ID it can, by spawning childrent and leaving a
trail of container IDs in its parent processes.  (So what if a parent
dies?)

> ... or does the orchestrator change that?  *Can* the orchestrator
> change real_parent (I suspect the answer is "no")?

I don't think the orchestrator is able to change real_parent.  I've
forgotten why there is a ->parent and ->real_parent and how they can
change.  One is for the wait signal.  I don't remember the purpose of
the other.

If the parent dies before the child, the child will be re-parented on
its grandparent if the parent doesn't hang around zombified, if I
understand correctly.  If anything, a parent dying would likely further
restrict the ability to set a task's audit container ID because a parent
with an identical ID could vanish.

> * I think the key is the relationship between current and task, not
> between task and task->real_parent.  I believe what we really care
> about is that task is a descendant of current.  We might also want to
> allow current to change the audit container ID if it holds
> CAP_AUDIT_CONTROL, regardless of it's relationship with task.

Currently, a process with CAP_AUDIT_CONTROL can set the audit container
ID of any task that hasn't got children or threads, isn't itself, and
its audit container ID is inherited or unset.  This was to try to
prevent games with parents and children scratching each other's backs.

I would feel more comfortable if only descendants were settable, so
adding that restriction sounds like a good idea to me other than the
tree-climbing excercise and overhead involved.

> >> >> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid,
> >> >> > +                                     u64 containerid, int rc)
> >> >> > +{
> >> >> > +       struct audit_buffer *ab;
> >> >> > +       uid_t uid;
> >> >> > +       struct tty_struct *tty;
> >> >> > +
> >> >> > +       if (!audit_enabled)
> >> >> > +               return;
> >> >> > +
> >> >> > +       ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
> >> >> > +       if (!ab)
> >> >> > +               return;
> >> >> > +
> >> >> > +       uid = from_kuid(&init_user_ns, task_uid(current));
> >> >> > +       tty = audit_get_tty(current);
> >> >> > +
> >> >> > +       audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid);
> >> >> > +       audit_log_task_context(ab);
> >> >> > +       audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d",
> >> >> > +                        from_kuid(&init_user_ns, audit_get_loginuid(current)),
> >> >> > +                        tty ? tty_name(tty) : "(none)", audit_get_sessionid(current),
> >> >> > +                        task_tgid_nr(task), oldcontainerid, containerid, !rc);
> >> >> > +
> >> >> > +       audit_put_tty(tty);
> >> >> > +       audit_log_end(ab);
> >> >> > +}
> >> >> > +
> >> >> > +/**
> >> >> > + * audit_set_containerid - set current task's audit_context containerid
> >> >> > + * @containerid: containerid value
> >> >> > + *
> >> >> > + * Returns 0 on success, -EPERM on permission failure.
> >> >> > + *
> >> >> > + * Called (set) from fs/proc/base.c::proc_containerid_write().
> >> >> > + */
> >> >> > +int audit_set_containerid(struct task_struct *task, u64 containerid)
> >> >> > +{
> >> >> > +       u64 oldcontainerid;
> >> >> > +       int rc;
> >> >> > +
> >> >> > +       oldcontainerid = audit_get_containerid(task);
> >> >> > +
> >> >> > +       rc = audit_set_containerid_perm(task, containerid);
> >> >> > +       if (!rc) {
> >> >> > +               task_lock(task);
> >> >> > +               task->containerid = containerid;
> >> >> > +               task_unlock(task);
> >> >> > +       }
> >> >> > +
> >> >> > +       audit_log_set_containerid(task, oldcontainerid, containerid, rc);
> >> >> > +       return rc;
> >> >>
> >> >> Why are audit_set_containerid_perm() and audit_log_containerid()
> >> >> separate functions?
> >> >
> >> > (I assume you mean audit_log_set_containerid()?)
> >>
> >> Yep.  My fingers got tired typing in that function name and decided a
> >> shortcut was necessary.
> >>
> >> > It seemed clearer that all the permission checking was in one function
> >> > and its return code could be used to report the outcome when logging the
> >> > (attempted) action.  This is the same structure as audit_set_loginuid()
> >> > and it made sense.
> >>
> >> When possible I really like it when the permission checks are in the
> >> same function as the code which does the work; it's less likely to get
> >> abused that way (you have to willfully bypass the access checks).  The
> >> exceptions might be if you wanted to reuse the access control code, or
> >> insert a modular access mechanism (e.g. LSMs).
> >
> > I don't follow how it could be abused.  The return code from the perm
> > check gates setting the value and is used in the success field in the
> > log.
> 
> If the permission checks are in the same function body as the code
> which does the work you have to either split the function, or rewrite
> it, if you want to bypass the permission checks.  It may be more of a
> style issue than an actual safety issue, but the comments about
> single-use functions in the same scope is the tie breaker.

Perhaps I'm just being quite dense, but I just don't follow what the
problem is and how you suggest fixing it.  A bunch of gotos to a label
such as "out:" to log the refused action?  That seems messy and
unstructured.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH net-next v2 0/7] net: Extend availability of PHY statistics
From: Florian Fainelli @ 2018-04-25  0:35 UTC (permalink / raw)
  To: netdev; +Cc: andrew, vivien.didelot, cphealy, davem, nikita.yoush
In-Reply-To: <20180425002748.27330-1-f.fainelli@gmail.com>

On 04/24/2018 05:27 PM, Florian Fainelli wrote:
> Hi all,
> 
> This patch series adds support for retrieving PHY statistics with DSA switches
> when the CPU port uses a PHY to PHY connection (as opposed to MAC to MAC).
> To get there a number of things are done:
> 
> - first we move the code dealing with PHY statistics outside of net/core/ethtool.c
>   and create helper functions since the same code will be reused
> - then we allow network device drivers to provide an ethtool_get_phy_stats callback
>   when the standard PHY library helpers are not suitable
> - we update the DSA functions dealing with ethtool operations to get passed a
>   stringset instead of assuming ETH_SS_STATS like they currently do
> - then we provide a set of standard helpers within DSA as a framework and add
>   the plumbing to allow retrieving the PHY statistics of the CPU port(s)
> - finally plug support for retrieving such PHY statistics with the b53 driver
> 
> Changes in v2:
> 
> - got actual testing when the DSA master network device has a PHY that
>   already provides statistics (thanks Nikita!)
> 
> - fixed the kbuild error reported when CONFIG_PHYLIB=n
> 
> - removed the checking of ops which is redundant and not needed

David, sorry looks like there will be a v3, the last patch introduces a
possible problem with bcm_sf2 which uses b53_common as a library, will
respin later.
-- 
Florian

^ permalink raw reply

* [PATCH net-next v2 7/7] net: dsa: b53: Add support for reading PHY statistics
From: Florian Fainelli @ 2018-04-25  0:27 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, andrew, vivien.didelot, cphealy, davem,
	nikita.yoush
In-Reply-To: <20180425002748.27330-1-f.fainelli@gmail.com>

Allow the b53 driver to return PHY statistics when the CPU port used is
different than 5, 7 or 8, because those are typically PHY-less on most
devices. This is useful for debugging link problems between the switch
and an external host when using a non standard CPU port number (e.g: 4).

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/dsa/b53/b53_common.c | 76 ++++++++++++++++++++++++++++++++++------
 drivers/net/dsa/b53/b53_priv.h   |  1 +
 drivers/net/dsa/bcm_sf2.c        |  1 +
 3 files changed, 68 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 726b2d8c6fe9..7faea467aca8 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -806,20 +806,39 @@ static unsigned int b53_get_mib_size(struct b53_device *dev)
 		return B53_MIBS_SIZE;
 }
 
+static struct phy_device *b53_get_phy_device(struct dsa_switch *ds, int port)
+{
+	/* These ports typically do not have built-in PHYs */
+	switch (port) {
+	case B53_CPU_PORT_25:
+	case 7:
+	case B53_CPU_PORT:
+		return NULL;
+	}
+
+	return mdiobus_get_phy(ds->slave_mii_bus, port);
+}
+
 void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
 		     uint8_t *data)
 {
 	struct b53_device *dev = ds->priv;
 	const struct b53_mib_desc *mibs = b53_get_mib(dev);
 	unsigned int mib_size = b53_get_mib_size(dev);
+	struct phy_device *phydev;
 	unsigned int i;
 
-	if (stringset != ETH_SS_STATS)
-		return;
+	if (stringset == ETH_SS_STATS) {
+		for (i = 0; i < mib_size; i++)
+			strlcpy(data + i * ETH_GSTRING_LEN,
+				mibs[i].name, ETH_GSTRING_LEN);
+	} else if (stringset == ETH_SS_PHY_STATS) {
+		phydev = b53_get_phy_device(ds, port);
+		if (!phydev)
+			return;
 
-	for (i = 0; i < mib_size; i++)
-		strlcpy(data + i * ETH_GSTRING_LEN,
-			mibs[i].name, ETH_GSTRING_LEN);
+		phy_ethtool_get_strings(phydev, data);
+	}
 }
 EXPORT_SYMBOL(b53_get_strings);
 
@@ -856,14 +875,34 @@ void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data)
 }
 EXPORT_SYMBOL(b53_get_ethtool_stats);
 
+void b53_get_ethtool_phy_stats(struct dsa_switch *ds, int port, uint64_t *data)
+{
+	struct phy_device *phydev;
+
+	phydev = b53_get_phy_device(ds, port);
+	if (!phydev)
+		return;
+
+	phy_ethtool_get_stats(phydev, NULL, data);
+}
+EXPORT_SYMBOL(b53_get_ethtool_phy_stats);
+
 int b53_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
 	struct b53_device *dev = ds->priv;
+	struct phy_device *phydev;
 
-	if (sset != ETH_SS_STATS)
-		return 0;
+	if (sset == ETH_SS_STATS) {
+		return b53_get_mib_size(dev);
+	} else if (sset == ETH_SS_PHY_STATS) {
+		phydev = b53_get_phy_device(ds, port);
+		if (!phydev)
+			return 0;
 
-	return b53_get_mib_size(dev);
+		return phy_ethtool_get_sset_count(phydev);
+	}
+
+	return 0;
 }
 EXPORT_SYMBOL(b53_get_sset_count);
 
@@ -1484,7 +1523,7 @@ void b53_br_fast_age(struct dsa_switch *ds, int port)
 }
 EXPORT_SYMBOL(b53_br_fast_age);
 
-static bool b53_can_enable_brcm_tags(struct dsa_switch *ds, int port)
+static bool b53_possible_cpu_port(struct dsa_switch *ds, int port)
 {
 	/* Broadcom switches will accept enabling Broadcom tags on the
 	 * following ports: 5, 7 and 8, any other port is not supported
@@ -1496,10 +1535,19 @@ static bool b53_can_enable_brcm_tags(struct dsa_switch *ds, int port)
 		return true;
 	}
 
-	dev_warn(ds->dev, "Port %d is not Broadcom tag capable\n", port);
 	return false;
 }
 
+static bool b53_can_enable_brcm_tags(struct dsa_switch *ds, int port)
+{
+	bool ret = b53_possible_cpu_port(ds, port);
+
+	if (!ret)
+		dev_warn(ds->dev, "Port %d is not Broadcom tag capable\n",
+			 port);
+	return ret;
+}
+
 enum dsa_tag_protocol b53_get_tag_protocol(struct dsa_switch *ds, int port)
 {
 	struct b53_device *dev = ds->priv;
@@ -1657,6 +1705,7 @@ static const struct dsa_switch_ops b53_switch_ops = {
 	.get_strings		= b53_get_strings,
 	.get_ethtool_stats	= b53_get_ethtool_stats,
 	.get_sset_count		= b53_get_sset_count,
+	.get_ethtool_phy_stats	= b53_get_ethtool_phy_stats,
 	.phy_read		= b53_phy_read16,
 	.phy_write		= b53_phy_write16,
 	.adjust_link		= b53_adjust_link,
@@ -1961,6 +2010,13 @@ static int b53_switch_init(struct b53_device *dev)
 	dev->num_ports = dev->cpu_port + 1;
 	dev->enabled_ports |= BIT(dev->cpu_port);
 
+	/* Include non standard CPU port built-in PHYs to be probed */
+	for (i = 0; i < dev->num_ports; i++) {
+		if (!(dev->ds->phys_mii_mask & BIT(i)) &&
+		    !b53_possible_cpu_port(dev->ds, i))
+			dev->ds->phys_mii_mask |= BIT(i);
+	}
+
 	dev->ports = devm_kzalloc(dev->dev,
 				  sizeof(struct b53_port) * dev->num_ports,
 				  GFP_KERNEL);
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index b933d5cb5c2d..cc284a514de9 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -290,6 +290,7 @@ void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
 		     uint8_t *data);
 void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data);
 int b53_get_sset_count(struct dsa_switch *ds, int port, int sset);
+void b53_get_ethtool_phy_stats(struct dsa_switch *ds, int port, uint64_t *data);
 int b53_br_join(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_leave(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_set_stp_state(struct dsa_switch *ds, int port, u8 state);
diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 0378eded31f2..97236cfcbae4 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -859,6 +859,7 @@ static const struct dsa_switch_ops bcm_sf2_ops = {
 	.get_strings		= b53_get_strings,
 	.get_ethtool_stats	= b53_get_ethtool_stats,
 	.get_sset_count		= b53_get_sset_count,
+	.get_ethtool_phy_stats	= b53_get_ethtool_phy_stats,
 	.get_phy_flags		= bcm_sf2_sw_get_phy_flags,
 	.adjust_link		= bcm_sf2_sw_adjust_link,
 	.fixed_link_update	= bcm_sf2_sw_fixed_link_update,
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 6/7] net: dsa: Allow providing PHY statistics from CPU port
From: Florian Fainelli @ 2018-04-25  0:27 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, andrew, vivien.didelot, cphealy, davem,
	nikita.yoush
In-Reply-To: <20180425002748.27330-1-f.fainelli@gmail.com>

Implement the same type of ethtool diversion that we have for
ETH_SS_STATS and make it work with ETH_SS_PHY_STATS. This allows
providing PHY level statistics for CPU ports that are directly
connecting to a PHY device.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/net/dsa.h |  7 +++++++
 net/dsa/master.c  | 47 ++++++++++++++++++++++++++++++++++++++++-----
 net/dsa/port.c    | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 106 insertions(+), 5 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 0bc0aad1b02e..462e9741b210 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -361,6 +361,8 @@ struct dsa_switch_ops {
 	void	(*get_ethtool_stats)(struct dsa_switch *ds,
 				     int port, uint64_t *data);
 	int	(*get_sset_count)(struct dsa_switch *ds, int port, int sset);
+	void	(*get_ethtool_phy_stats)(struct dsa_switch *ds,
+					 int port, uint64_t *data);
 
 	/*
 	 * ethtool Wake-on-LAN
@@ -589,4 +591,9 @@ static inline int call_dsa_notifiers(unsigned long val, struct net_device *dev,
 #define BRCM_TAG_GET_PORT(v)		((v) >> 8)
 #define BRCM_TAG_GET_QUEUE(v)		((v) & 0xff)
 
+
+int dsa_port_get_phy_strings(struct dsa_port *dp, uint8_t *data);
+int dsa_port_get_ethtool_phy_stats(struct dsa_port *dp, uint64_t *data);
+int dsa_port_get_phy_sset_count(struct dsa_port *dp);
+
 #endif
diff --git a/net/dsa/master.c b/net/dsa/master.c
index 8d27687fd0ca..c90ee3227dea 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -31,6 +31,32 @@ static void dsa_master_get_ethtool_stats(struct net_device *dev,
 		ds->ops->get_ethtool_stats(ds, port, data + count);
 }
 
+static void dsa_master_get_ethtool_phy_stats(struct net_device *dev,
+					     struct ethtool_stats *stats,
+					     uint64_t *data)
+{
+	struct dsa_port *cpu_dp = dev->dsa_ptr;
+	const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
+	struct dsa_switch *ds = cpu_dp->ds;
+	int port = cpu_dp->index;
+	int count = 0;
+
+	if (dev->phydev && !ops->get_ethtool_phy_stats) {
+		count = phy_ethtool_get_sset_count(dev->phydev);
+		if (count >= 0)
+			phy_ethtool_get_stats(dev->phydev, stats, data);
+	} else if (ops->get_sset_count && ops->get_ethtool_phy_stats) {
+		count = ops->get_sset_count(dev, ETH_SS_PHY_STATS);
+		ops->get_ethtool_phy_stats(dev, stats, data);
+	}
+
+	if (count < 0)
+		count = 0;
+
+	if (ds->ops->get_ethtool_phy_stats)
+		ds->ops->get_ethtool_phy_stats(ds, port, data + count);
+}
+
 static int dsa_master_get_sset_count(struct net_device *dev, int sset)
 {
 	struct dsa_port *cpu_dp = dev->dsa_ptr;
@@ -38,11 +64,14 @@ static int dsa_master_get_sset_count(struct net_device *dev, int sset)
 	struct dsa_switch *ds = cpu_dp->ds;
 	int count = 0;
 
-	if (ops->get_sset_count) {
+	if (sset == ETH_SS_PHY_STATS && dev->phydev &&
+	    !ops->get_ethtool_phy_stats)
+		count = phy_ethtool_get_sset_count(dev->phydev);
+	else if (ops->get_sset_count)
 		count = ops->get_sset_count(dev, sset);
-		if (count < 0)
-			count = 0;
-	}
+
+	if (count < 0)
+		count = 0;
 
 	if (ds->ops->get_sset_count)
 		count += ds->ops->get_sset_count(ds, cpu_dp->index, sset);
@@ -67,7 +96,14 @@ static void dsa_master_get_strings(struct net_device *dev, uint32_t stringset,
 	/* We do not want to be NULL-terminated, since this is a prefix */
 	pfx[sizeof(pfx) - 1] = '_';
 
-	if (ops->get_sset_count && ops->get_strings) {
+	if (stringset == ETH_SS_PHY_STATS && dev->phydev &&
+	    !ops->get_ethtool_phy_stats) {
+		mcount = phy_ethtool_get_sset_count(dev->phydev);
+		if (mcount < 0)
+			mcount = 0;
+		else
+			phy_ethtool_get_strings(dev->phydev, data);
+	} else if (ops->get_sset_count && ops->get_strings) {
 		mcount = ops->get_sset_count(dev, stringset);
 		if (mcount < 0)
 			mcount = 0;
@@ -107,6 +143,7 @@ static int dsa_master_ethtool_setup(struct net_device *dev)
 	ops->get_sset_count = dsa_master_get_sset_count;
 	ops->get_ethtool_stats = dsa_master_get_ethtool_stats;
 	ops->get_strings = dsa_master_get_strings;
+	ops->get_ethtool_phy_stats = dsa_master_get_ethtool_phy_stats;
 
 	dev->ethtool_ops = ops;
 
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 5e2a88720a9a..2413beb995be 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -383,3 +383,60 @@ void dsa_port_link_unregister_of(struct dsa_port *dp)
 	else
 		dsa_port_setup_phy_of(dp, false);
 }
+
+int dsa_port_get_phy_strings(struct dsa_port *dp, uint8_t *data)
+{
+	struct phy_device *phydev;
+	int ret = -EOPNOTSUPP;
+
+	if (of_phy_is_fixed_link(dp->dn))
+		return ret;
+
+	phydev = dsa_port_get_phy_device(dp);
+	if (IS_ERR_OR_NULL(phydev))
+		return ret;
+
+	ret = phy_ethtool_get_strings(phydev, data);
+	put_device(&phydev->mdio.dev);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dsa_port_get_phy_strings);
+
+int dsa_port_get_ethtool_phy_stats(struct dsa_port *dp, uint64_t *data)
+{
+	struct phy_device *phydev;
+	int ret = -EOPNOTSUPP;
+
+	if (of_phy_is_fixed_link(dp->dn))
+		return ret;
+
+	phydev = dsa_port_get_phy_device(dp);
+	if (IS_ERR_OR_NULL(phydev))
+		return ret;
+
+	ret = phy_ethtool_get_stats(phydev, NULL, data);
+	put_device(&phydev->mdio.dev);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dsa_port_get_ethtool_phy_stats);
+
+int dsa_port_get_phy_sset_count(struct dsa_port *dp)
+{
+	struct phy_device *phydev;
+	int ret = -EOPNOTSUPP;
+
+	if (of_phy_is_fixed_link(dp->dn))
+		return ret;
+
+	phydev = dsa_port_get_phy_device(dp);
+	if (IS_ERR_OR_NULL(phydev))
+		return ret;
+
+	ret = phy_ethtool_get_sset_count(phydev);
+	put_device(&phydev->mdio.dev);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dsa_port_get_phy_sset_count);
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 5/7] net: dsa: Add helper function to obtain PHY device of a given port
From: Florian Fainelli @ 2018-04-25  0:27 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, andrew, vivien.didelot, cphealy, davem,
	nikita.yoush
In-Reply-To: <20180425002748.27330-1-f.fainelli@gmail.com>

In preparation for having more call sites attempting to obtain a
reference against a PHY device corresponding to a particular port,
introduce a helper function for that purpose.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 net/dsa/port.c | 33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 7acc1169d75e..5e2a88720a9a 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -273,25 +273,38 @@ int dsa_port_vlan_del(struct dsa_port *dp,
 	return 0;
 }
 
-static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
+static struct phy_device *dsa_port_get_phy_device(struct dsa_port *dp)
 {
-	struct device_node *port_dn = dp->dn;
 	struct device_node *phy_dn;
-	struct dsa_switch *ds = dp->ds;
 	struct phy_device *phydev;
-	int port = dp->index;
-	int err = 0;
 
-	phy_dn = of_parse_phandle(port_dn, "phy-handle", 0);
+	phy_dn = of_parse_phandle(dp->dn, "phy-handle", 0);
 	if (!phy_dn)
-		return 0;
+		return NULL;
 
 	phydev = of_phy_find_device(phy_dn);
 	if (!phydev) {
-		err = -EPROBE_DEFER;
-		goto err_put_of;
+		of_node_put(phy_dn);
+		return ERR_PTR(-EPROBE_DEFER);
 	}
 
+	return phydev;
+}
+
+static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
+{
+	struct dsa_switch *ds = dp->ds;
+	struct phy_device *phydev;
+	int port = dp->index;
+	int err = 0;
+
+	phydev = dsa_port_get_phy_device(dp);
+	if (!phydev)
+		return 0;
+
+	if (IS_ERR(phydev))
+		return PTR_ERR(phydev);
+
 	if (enable) {
 		err = genphy_config_init(phydev);
 		if (err < 0)
@@ -317,8 +330,6 @@ static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
 
 err_put_dev:
 	put_device(&phydev->mdio.dev);
-err_put_of:
-	of_node_put(phy_dn);
 	return err;
 }
 
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 4/7] net: dsa: Pass stringset to ethtool operations
From: Florian Fainelli @ 2018-04-25  0:27 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, andrew, vivien.didelot, cphealy, davem,
	nikita.yoush
In-Reply-To: <20180425002748.27330-1-f.fainelli@gmail.com>

Up until now we largely assumed that we were interested in ETH_SS_STATS
type of strings for all ethtool operations, this is about to change with
the introduction of additional string sets, e.g: ETH_SS_PHY_STATS.
Update all functions to take an appropriate stringset argument and act
on it when it is different than ETH_SS_STATS for now.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/dsa/b53/b53_common.c       | 11 +++++++++--
 drivers/net/dsa/b53/b53_priv.h         |  5 +++--
 drivers/net/dsa/dsa_loop.c             | 11 +++++++++--
 drivers/net/dsa/lan9303-core.c         | 11 +++++++++--
 drivers/net/dsa/microchip/ksz_common.c | 11 +++++++++--
 drivers/net/dsa/mt7530.c               | 11 +++++++++--
 drivers/net/dsa/mv88e6xxx/chip.c       | 10 ++++++++--
 drivers/net/dsa/qca8k.c                | 10 ++++++++--
 include/net/dsa.h                      |  5 +++--
 net/dsa/master.c                       | 21 +++++++++++++--------
 net/dsa/slave.c                        |  5 +++--
 11 files changed, 83 insertions(+), 28 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 78616787f2a3..726b2d8c6fe9 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -806,13 +806,17 @@ static unsigned int b53_get_mib_size(struct b53_device *dev)
 		return B53_MIBS_SIZE;
 }
 
-void b53_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+		     uint8_t *data)
 {
 	struct b53_device *dev = ds->priv;
 	const struct b53_mib_desc *mibs = b53_get_mib(dev);
 	unsigned int mib_size = b53_get_mib_size(dev);
 	unsigned int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < mib_size; i++)
 		strlcpy(data + i * ETH_GSTRING_LEN,
 			mibs[i].name, ETH_GSTRING_LEN);
@@ -852,10 +856,13 @@ void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data)
 }
 EXPORT_SYMBOL(b53_get_ethtool_stats);
 
-int b53_get_sset_count(struct dsa_switch *ds, int port)
+int b53_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
 	struct b53_device *dev = ds->priv;
 
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return b53_get_mib_size(dev);
 }
 EXPORT_SYMBOL(b53_get_sset_count);
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index 1187ebd79287..b933d5cb5c2d 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -286,9 +286,10 @@ static inline int b53_switch_get_reset_gpio(struct b53_device *dev)
 /* Exported functions towards other drivers */
 void b53_imp_vlan_setup(struct dsa_switch *ds, int cpu_port);
 int b53_configure_vlan(struct dsa_switch *ds);
-void b53_get_strings(struct dsa_switch *ds, int port, uint8_t *data);
+void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+		     uint8_t *data);
 void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data);
-int b53_get_sset_count(struct dsa_switch *ds, int port);
+int b53_get_sset_count(struct dsa_switch *ds, int port, int sset);
 int b53_br_join(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_leave(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_set_stp_state(struct dsa_switch *ds, int port, u8 state);
diff --git a/drivers/net/dsa/dsa_loop.c b/drivers/net/dsa/dsa_loop.c
index f77be9f85cb3..9354cc08d3fd 100644
--- a/drivers/net/dsa/dsa_loop.c
+++ b/drivers/net/dsa/dsa_loop.c
@@ -86,16 +86,23 @@ static int dsa_loop_setup(struct dsa_switch *ds)
 	return 0;
 }
 
-static int dsa_loop_get_sset_count(struct dsa_switch *ds, int port)
+static int dsa_loop_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return __DSA_LOOP_CNT_MAX;
 }
 
-static void dsa_loop_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+static void dsa_loop_get_strings(struct dsa_switch *ds, int port,
+				 u32 stringset, uint8_t *data)
 {
 	struct dsa_loop_priv *ps = ds->priv;
 	unsigned int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < __DSA_LOOP_CNT_MAX; i++)
 		memcpy(data + i * ETH_GSTRING_LEN,
 		       ps->ports[port].mib[i].name, ETH_GSTRING_LEN);
diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index fefa454f3e56..b4f6e1a67dd9 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -977,10 +977,14 @@ static const struct lan9303_mib_desc lan9303_mib[] = {
 	{ .offset = LAN9303_MAC_TX_LATECOL_0, .name = "TxLateCol", },
 };
 
-static void lan9303_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+static void lan9303_get_strings(struct dsa_switch *ds, int port,
+				u32 stringset, uint8_t *data)
 {
 	unsigned int u;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (u = 0; u < ARRAY_SIZE(lan9303_mib); u++) {
 		strncpy(data + u * ETH_GSTRING_LEN, lan9303_mib[u].name,
 			ETH_GSTRING_LEN);
@@ -1007,8 +1011,11 @@ static void lan9303_get_ethtool_stats(struct dsa_switch *ds, int port,
 	}
 }
 
-static int lan9303_get_sset_count(struct dsa_switch *ds, int port)
+static int lan9303_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return ARRAY_SIZE(lan9303_mib);
 }
 
diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c
index bcb3e6c734f2..7210c49b7922 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -439,15 +439,22 @@ static void ksz_disable_port(struct dsa_switch *ds, int port,
 	ksz_port_cfg(dev, port, REG_PORT_CTRL_0, PORT_MAC_LOOPBACK, true);
 }
 
-static int ksz_sset_count(struct dsa_switch *ds, int port)
+static int ksz_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return TOTAL_SWITCH_COUNTER_NUM;
 }
 
-static void ksz_get_strings(struct dsa_switch *ds, int port, uint8_t *buf)
+static void ksz_get_strings(struct dsa_switch *ds, int port,
+			    u32 stringset, uint8_t *buf)
 {
 	int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < TOTAL_SWITCH_COUNTER_NUM; i++) {
 		memcpy(buf + i * ETH_GSTRING_LEN, mib_names[i].string,
 		       ETH_GSTRING_LEN);
diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index 80a4dbc3a499..62e486652e62 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -573,10 +573,14 @@ static int mt7530_phy_write(struct dsa_switch *ds, int port, int regnum,
 }
 
 static void
-mt7530_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+mt7530_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+		   uint8_t *data)
 {
 	int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < ARRAY_SIZE(mt7530_mib); i++)
 		strncpy(data + i * ETH_GSTRING_LEN, mt7530_mib[i].name,
 			ETH_GSTRING_LEN);
@@ -604,8 +608,11 @@ mt7530_get_ethtool_stats(struct dsa_switch *ds, int port,
 }
 
 static int
-mt7530_get_sset_count(struct dsa_switch *ds, int port)
+mt7530_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return ARRAY_SIZE(mt7530_mib);
 }
 
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 3d2091099f7f..8f92ccc0dd54 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -742,11 +742,14 @@ static void mv88e6xxx_atu_vtu_get_strings(uint8_t *data)
 }
 
 static void mv88e6xxx_get_strings(struct dsa_switch *ds, int port,
-				  uint8_t *data)
+				  u32 stringset, uint8_t *data)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
 	int count = 0;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	mutex_lock(&chip->reg_lock);
 
 	if (chip->info->ops->stats_get_strings)
@@ -789,12 +792,15 @@ static int mv88e6320_stats_get_sset_count(struct mv88e6xxx_chip *chip)
 					      STATS_TYPE_BANK1);
 }
 
-static int mv88e6xxx_get_sset_count(struct dsa_switch *ds, int port)
+static int mv88e6xxx_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
 	int serdes_count = 0;
 	int count = 0;
 
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	mutex_lock(&chip->reg_lock);
 	if (chip->info->ops->stats_get_sset_count)
 		count = chip->info->ops->stats_get_sset_count(chip);
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 600d5ad1fbde..757b6d90ea36 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -600,10 +600,13 @@ qca8k_phy_write(struct dsa_switch *ds, int phy, int regnum, u16 val)
 }
 
 static void
-qca8k_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+qca8k_get_strings(struct dsa_switch *ds, int port, u32 stringset, uint8_t *data)
 {
 	int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < ARRAY_SIZE(ar8327_mib); i++)
 		strncpy(data + i * ETH_GSTRING_LEN, ar8327_mib[i].name,
 			ETH_GSTRING_LEN);
@@ -631,8 +634,11 @@ qca8k_get_ethtool_stats(struct dsa_switch *ds, int port,
 }
 
 static int
-qca8k_get_sset_count(struct dsa_switch *ds, int port)
+qca8k_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return ARRAY_SIZE(ar8327_mib);
 }
 
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 60fb4ec8ba61..0bc0aad1b02e 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -356,10 +356,11 @@ struct dsa_switch_ops {
 	/*
 	 * ethtool hardware statistics.
 	 */
-	void	(*get_strings)(struct dsa_switch *ds, int port, uint8_t *data);
+	void	(*get_strings)(struct dsa_switch *ds, int port,
+			       u32 stringset, uint8_t *data);
 	void	(*get_ethtool_stats)(struct dsa_switch *ds,
 				     int port, uint64_t *data);
-	int	(*get_sset_count)(struct dsa_switch *ds, int port);
+	int	(*get_sset_count)(struct dsa_switch *ds, int port, int sset);
 
 	/*
 	 * ethtool Wake-on-LAN
diff --git a/net/dsa/master.c b/net/dsa/master.c
index 9ec16b39ed15..8d27687fd0ca 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -38,11 +38,14 @@ static int dsa_master_get_sset_count(struct net_device *dev, int sset)
 	struct dsa_switch *ds = cpu_dp->ds;
 	int count = 0;
 
-	if (ops->get_sset_count)
-		count += ops->get_sset_count(dev, sset);
+	if (ops->get_sset_count) {
+		count = ops->get_sset_count(dev, sset);
+		if (count < 0)
+			count = 0;
+	}
 
-	if (sset == ETH_SS_STATS && ds->ops->get_sset_count)
-		count += ds->ops->get_sset_count(ds, cpu_dp->index);
+	if (ds->ops->get_sset_count)
+		count += ds->ops->get_sset_count(ds, cpu_dp->index, sset);
 
 	return count;
 }
@@ -65,18 +68,20 @@ static void dsa_master_get_strings(struct net_device *dev, uint32_t stringset,
 	pfx[sizeof(pfx) - 1] = '_';
 
 	if (ops->get_sset_count && ops->get_strings) {
-		mcount = ops->get_sset_count(dev, ETH_SS_STATS);
+		mcount = ops->get_sset_count(dev, stringset);
+		if (mcount < 0)
+			mcount = 0;
 		ops->get_strings(dev, stringset, data);
 	}
 
-	if (stringset == ETH_SS_STATS && ds->ops->get_strings) {
+	if (ds->ops->get_strings) {
 		ndata = data + mcount * len;
 		/* This function copies ETH_GSTRINGS_LEN bytes, we will mangle
 		 * the output after to prepend our CPU port prefix we
 		 * constructed earlier
 		 */
-		ds->ops->get_strings(ds, port, ndata);
-		count = ds->ops->get_sset_count(ds, port);
+		ds->ops->get_strings(ds, port, stringset, ndata);
+		count = ds->ops->get_sset_count(ds, port, stringset);
 		for (i = 0; i < count; i++) {
 			memmove(ndata + (i * len + sizeof(pfx)),
 				ndata + i * len, len - sizeof(pfx));
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 18561af7a8f1..f3fb3a0880b1 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -560,7 +560,8 @@ static void dsa_slave_get_strings(struct net_device *dev,
 		strncpy(data + 2 * len, "rx_packets", len);
 		strncpy(data + 3 * len, "rx_bytes", len);
 		if (ds->ops->get_strings)
-			ds->ops->get_strings(ds, dp->index, data + 4 * len);
+			ds->ops->get_strings(ds, dp->index, stringset,
+					     data + 4 * len);
 	}
 }
 
@@ -605,7 +606,7 @@ static int dsa_slave_get_sset_count(struct net_device *dev, int sset)
 
 		count = 4;
 		if (ds->ops->get_sset_count)
-			count += ds->ops->get_sset_count(ds, dp->index);
+			count += ds->ops->get_sset_count(ds, dp->index, sset);
 
 		return count;
 	}
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 3/7] net: dsa: Do not check for ethtool_ops validity
From: Florian Fainelli @ 2018-04-25  0:27 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, andrew, vivien.didelot, cphealy, davem,
	nikita.yoush
In-Reply-To: <20180425002748.27330-1-f.fainelli@gmail.com>

This is completely redundant with what netdev_set_default_ethtool_ops()
does, we are always guaranteed to have a valid dev->ethtool_ops pointer,
however, within that structure, not all function calls may be populated,
so we still have to check them individually.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 net/dsa/master.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/dsa/master.c b/net/dsa/master.c
index 90e6df0351eb..9ec16b39ed15 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -22,7 +22,7 @@ static void dsa_master_get_ethtool_stats(struct net_device *dev,
 	int port = cpu_dp->index;
 	int count = 0;
 
-	if (ops && ops->get_sset_count && ops->get_ethtool_stats) {
+	if (ops->get_sset_count && ops->get_ethtool_stats) {
 		count = ops->get_sset_count(dev, ETH_SS_STATS);
 		ops->get_ethtool_stats(dev, stats, data);
 	}
@@ -38,7 +38,7 @@ static int dsa_master_get_sset_count(struct net_device *dev, int sset)
 	struct dsa_switch *ds = cpu_dp->ds;
 	int count = 0;
 
-	if (ops && ops->get_sset_count)
+	if (ops->get_sset_count)
 		count += ops->get_sset_count(dev, sset);
 
 	if (sset == ETH_SS_STATS && ds->ops->get_sset_count)
@@ -64,7 +64,7 @@ static void dsa_master_get_strings(struct net_device *dev, uint32_t stringset,
 	/* We do not want to be NULL-terminated, since this is a prefix */
 	pfx[sizeof(pfx) - 1] = '_';
 
-	if (ops && ops->get_sset_count && ops->get_strings) {
+	if (ops->get_sset_count && ops->get_strings) {
 		mcount = ops->get_sset_count(dev, ETH_SS_STATS);
 		ops->get_strings(dev, stringset, data);
 	}
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 2/7] net: Allow network devices to have PHY statistics
From: Florian Fainelli @ 2018-04-25  0:27 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, andrew, vivien.didelot, cphealy, davem,
	nikita.yoush
In-Reply-To: <20180425002748.27330-1-f.fainelli@gmail.com>

Add a new callback: get_ethtool_phy_stats() which allows network device
drivers not making use of the PHY library to return PHY statistics.
Update ethtool_get_phy_stats(), __ethtool_get_sset_count() and
__ethtool_get_strings() accordingly to interogate the network device
about ETH_SS_PHY_STATS.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/linux/ethtool.h |  5 +++++
 net/core/ethtool.c      | 39 +++++++++++++++++++++------------------
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ebe41811ed34..9b19556f0156 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -310,6 +310,9 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
  *	fields should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS
  *	instead of the latter), any change to them will be overwritten
  *	by kernel. Returns a negative error code or zero.
+ * @get_ethtool_phy_stats: Return extended statistics about the PHY device.
+ *	This is only useful if the device maintains PHY statistics and
+ *	cannot use the standard PHY library helpers.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -405,5 +408,7 @@ struct ethtool_ops {
 				      struct ethtool_fecparam *);
 	int	(*set_fecparam)(struct net_device *,
 				      struct ethtool_fecparam *);
+	void	(*get_ethtool_phy_stats)(struct net_device *,
+					 struct ethtool_stats *, u64 *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index f0d42e093c4a..4b8992ccf904 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -226,12 +226,9 @@ static int __ethtool_get_sset_count(struct net_device *dev, int sset)
 	if (sset == ETH_SS_PHY_TUNABLES)
 		return ARRAY_SIZE(phy_tunable_strings);
 
-	if (sset == ETH_SS_PHY_STATS) {
-		if (dev->phydev)
-			return phy_ethtool_get_sset_count(dev->phydev);
-		else
-			return -EOPNOTSUPP;
-	}
+	if (sset == ETH_SS_PHY_STATS && dev->phydev &&
+	    !ops->get_ethtool_phy_stats)
+		return phy_ethtool_get_sset_count(dev->phydev);
 
 	if (ops->get_sset_count && ops->get_strings)
 		return ops->get_sset_count(dev, sset);
@@ -254,12 +251,10 @@ static void __ethtool_get_strings(struct net_device *dev,
 		memcpy(data, tunable_strings, sizeof(tunable_strings));
 	else if (stringset == ETH_SS_PHY_TUNABLES)
 		memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
-	else if (stringset == ETH_SS_PHY_STATS) {
-		if (dev->phydev)
-			phy_ethtool_get_strings(dev->phydev, data);
-		else
-			return;
-	} else
+	else if (stringset == ETH_SS_PHY_STATS && dev->phydev &&
+		 !ops->get_ethtool_phy_stats)
+		phy_ethtool_get_strings(dev->phydev, data);
+	else
 		/* ops->get_strings is valid because checked earlier */
 		ops->get_strings(dev, stringset, data);
 }
@@ -1971,15 +1966,19 @@ static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
 
 static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 {
-	struct ethtool_stats stats;
+	const struct ethtool_ops *ops = dev->ethtool_ops;
 	struct phy_device *phydev = dev->phydev;
+	struct ethtool_stats stats;
 	u64 *data;
 	int ret, n_stats;
 
-	if (!phydev)
+	if (!phydev && (!ops->get_ethtool_phy_stats || !ops->get_sset_count))
 		return -EOPNOTSUPP;
 
-	n_stats = phy_ethtool_get_sset_count(dev->phydev);
+	if (dev->phydev && !ops->get_ethtool_phy_stats)
+		n_stats = phy_ethtool_get_sset_count(dev->phydev);
+	else
+		n_stats = ops->get_sset_count(dev, ETH_SS_PHY_STATS);
 	if (n_stats < 0)
 		return n_stats;
 	if (n_stats > S32_MAX / sizeof(u64))
@@ -1994,9 +1993,13 @@ static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 	if (n_stats && !data)
 		return -ENOMEM;
 
-	ret = phy_ethtool_get_stats(dev->phydev, &stats, data);
-	if (ret < 0)
-		return ret;
+	if (dev->phydev && !ops->get_ethtool_phy_stats) {
+		ret = phy_ethtool_get_stats(dev->phydev, &stats, data);
+		if (ret < 0)
+			return ret;
+	} else {
+		ops->get_ethtool_phy_stats(dev, &stats, data);
+	}
 
 	ret = -EFAULT;
 	if (copy_to_user(useraddr, &stats, sizeof(stats)))
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 1/7] net: Move PHY statistics code into PHY library helpers
From: Florian Fainelli @ 2018-04-25  0:27 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, andrew, vivien.didelot, cphealy, davem,
	nikita.yoush
In-Reply-To: <20180425002748.27330-1-f.fainelli@gmail.com>

In order to make it possible for network device drivers that do not
necessarily have a phy_device attached, but still report PHY statistics,
have a preliminary refactoring consisting in creating helper functions
that encapsulate the PHY device driver knowledge within PHYLIB.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/phy.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/phy.h   | 20 ++++++++++++++++++++
 net/core/ethtool.c    | 38 ++++++++------------------------------
 3 files changed, 76 insertions(+), 30 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 05c1e8ef15e6..a98ed12c0009 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1277,3 +1277,51 @@ int phy_ethtool_nway_reset(struct net_device *ndev)
 	return phy_restart_aneg(phydev);
 }
 EXPORT_SYMBOL(phy_ethtool_nway_reset);
+
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
+{
+	if (!phydev->drv)
+		return -EIO;
+
+	mutex_lock(&phydev->lock);
+	phydev->drv->get_strings(phydev, data);
+	mutex_unlock(&phydev->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_get_strings);
+
+int phy_ethtool_get_sset_count(struct phy_device *phydev)
+{
+	int ret;
+
+	if (!phydev->drv)
+		return -EIO;
+
+	if (phydev->drv->get_sset_count &&
+	    phydev->drv->get_strings &&
+	    phydev->drv->get_stats) {
+		mutex_lock(&phydev->lock);
+		ret = phydev->drv->get_sset_count(phydev);
+		mutex_unlock(&phydev->lock);
+
+		return ret;
+	}
+
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL(phy_ethtool_get_sset_count);
+
+int phy_ethtool_get_stats(struct phy_device *phydev,
+			  struct ethtool_stats *stats, u64 *data)
+{
+	if (!phydev->drv)
+		return -EIO;
+
+	mutex_lock(&phydev->lock);
+	phydev->drv->get_stats(phydev, stats, data);
+	mutex_unlock(&phydev->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_get_stats);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index f0b5870a6d40..6ca81395c545 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1066,6 +1066,26 @@ int phy_ethtool_nway_reset(struct net_device *ndev);
 #if IS_ENABLED(CONFIG_PHYLIB)
 int __init mdio_bus_init(void);
 void mdio_bus_exit(void);
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data);
+int phy_ethtool_get_sset_count(struct phy_device *phydev);
+int phy_ethtool_get_stats(struct phy_device *phydev,
+			  struct ethtool_stats *stats, u64 *data);
+#else
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
+{
+	return -EOPNOTSUPP;
+}
+
+int phy_ethtool_get_sset_count(struct phy_device *phydev)
+{
+	return -EOPNOTSUPP;
+}
+
+int phy_ethtool_get_stats(struct phy_device *phydev,
+			  struct ethtool_stats *stats, u64 *data)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 extern struct bus_type mdio_bus_type;
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03416e6dd5d7..f0d42e093c4a 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -210,23 +210,6 @@ static int ethtool_set_features(struct net_device *dev, void __user *useraddr)
 	return ret;
 }
 
-static int phy_get_sset_count(struct phy_device *phydev)
-{
-	int ret;
-
-	if (phydev->drv->get_sset_count &&
-	    phydev->drv->get_strings &&
-	    phydev->drv->get_stats) {
-		mutex_lock(&phydev->lock);
-		ret = phydev->drv->get_sset_count(phydev);
-		mutex_unlock(&phydev->lock);
-
-		return ret;
-	}
-
-	return -EOPNOTSUPP;
-}
-
 static int __ethtool_get_sset_count(struct net_device *dev, int sset)
 {
 	const struct ethtool_ops *ops = dev->ethtool_ops;
@@ -245,7 +228,7 @@ static int __ethtool_get_sset_count(struct net_device *dev, int sset)
 
 	if (sset == ETH_SS_PHY_STATS) {
 		if (dev->phydev)
-			return phy_get_sset_count(dev->phydev);
+			return phy_ethtool_get_sset_count(dev->phydev);
 		else
 			return -EOPNOTSUPP;
 	}
@@ -272,15 +255,10 @@ static void __ethtool_get_strings(struct net_device *dev,
 	else if (stringset == ETH_SS_PHY_TUNABLES)
 		memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
 	else if (stringset == ETH_SS_PHY_STATS) {
-		struct phy_device *phydev = dev->phydev;
-
-		if (phydev) {
-			mutex_lock(&phydev->lock);
-			phydev->drv->get_strings(phydev, data);
-			mutex_unlock(&phydev->lock);
-		} else {
+		if (dev->phydev)
+			phy_ethtool_get_strings(dev->phydev, data);
+		else
 			return;
-		}
 	} else
 		/* ops->get_strings is valid because checked earlier */
 		ops->get_strings(dev, stringset, data);
@@ -2001,7 +1979,7 @@ static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 	if (!phydev)
 		return -EOPNOTSUPP;
 
-	n_stats = phy_get_sset_count(phydev);
+	n_stats = phy_ethtool_get_sset_count(dev->phydev);
 	if (n_stats < 0)
 		return n_stats;
 	if (n_stats > S32_MAX / sizeof(u64))
@@ -2016,9 +1994,9 @@ static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 	if (n_stats && !data)
 		return -ENOMEM;
 
-	mutex_lock(&phydev->lock);
-	phydev->drv->get_stats(phydev, &stats, data);
-	mutex_unlock(&phydev->lock);
+	ret = phy_ethtool_get_stats(dev->phydev, &stats, data);
+	if (ret < 0)
+		return ret;
 
 	ret = -EFAULT;
 	if (copy_to_user(useraddr, &stats, sizeof(stats)))
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 0/7] net: Extend availability of PHY statistics
From: Florian Fainelli @ 2018-04-25  0:27 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, andrew, vivien.didelot, cphealy, davem,
	nikita.yoush


Hi all,

This patch series adds support for retrieving PHY statistics with DSA switches
when the CPU port uses a PHY to PHY connection (as opposed to MAC to MAC).
To get there a number of things are done:

- first we move the code dealing with PHY statistics outside of net/core/ethtool.c
  and create helper functions since the same code will be reused
- then we allow network device drivers to provide an ethtool_get_phy_stats callback
  when the standard PHY library helpers are not suitable
- we update the DSA functions dealing with ethtool operations to get passed a
  stringset instead of assuming ETH_SS_STATS like they currently do
- then we provide a set of standard helpers within DSA as a framework and add
  the plumbing to allow retrieving the PHY statistics of the CPU port(s)
- finally plug support for retrieving such PHY statistics with the b53 driver

Changes in v2:

- got actual testing when the DSA master network device has a PHY that
  already provides statistics (thanks Nikita!)

- fixed the kbuild error reported when CONFIG_PHYLIB=n

- removed the checking of ops which is redundant and not needed

Florian Fainelli (7):
  net: Move PHY statistics code into PHY library helpers
  net: Allow network devices to have PHY statistics
  net: dsa: Do not check for ethtool_ops validity
  net: dsa: Pass stringset to ethtool operations
  net: dsa: Add helper function to obtain PHY device of a given port
  net: dsa: Allow providing PHY statistics from CPU port
  net: dsa: b53: Add support for reading PHY statistics

 drivers/net/dsa/b53/b53_common.c       | 79 ++++++++++++++++++++++++++---
 drivers/net/dsa/b53/b53_priv.h         |  6 ++-
 drivers/net/dsa/bcm_sf2.c              |  1 +
 drivers/net/dsa/dsa_loop.c             | 11 ++++-
 drivers/net/dsa/lan9303-core.c         | 11 ++++-
 drivers/net/dsa/microchip/ksz_common.c | 11 ++++-
 drivers/net/dsa/mt7530.c               | 11 ++++-
 drivers/net/dsa/mv88e6xxx/chip.c       | 10 +++-
 drivers/net/dsa/qca8k.c                | 10 +++-
 drivers/net/phy/phy.c                  | 48 ++++++++++++++++++
 include/linux/ethtool.h                |  5 ++
 include/linux/phy.h                    | 20 ++++++++
 include/net/dsa.h                      | 12 ++++-
 net/core/ethtool.c                     | 61 ++++++++---------------
 net/dsa/master.c                       | 62 +++++++++++++++++++----
 net/dsa/port.c                         | 90 +++++++++++++++++++++++++++++-----
 net/dsa/slave.c                        |  5 +-
 17 files changed, 366 insertions(+), 87 deletions(-)

-- 
2.14.1

^ permalink raw reply

* Re: [bpf-next PATCH] bpf: reduce runtime of test_sockmap tests
From: Daniel Borkmann @ 2018-04-25  0:15 UTC (permalink / raw)
  To: John Fastabend, ast; +Cc: netdev
In-Reply-To: <20180424232818.21788.42271.stgit@john-Precision-Tower-5810>

On 04/25/2018 01:28 AM, John Fastabend wrote:
> When test_sockmap was running outside of selftests and was not being
> run by build bots it was reasonable to spend significant amount of
> time running various tests. The number of tests is high because many
> different I/O iterators are run.
> 
> However, now that test_sockmap is part of selftests rather than
> iterate through all I/O sides only test a minimal set of min/max
> values along with a few "normal" I/O ops. Also remove the long
> running tests. They can be run from other test frameworks on a regular
> cadence.
> 
> This significanly reduces runtime of test_sockmap.
> 
> Before:
> 
> $ time sudo ./test_sockmap  > /dev/null
> 
> real    4m47.521s
> user    0m0.370s
> sys     0m3.131s
> 
> After:
> 
> $ time sudo ./test_sockmap  > /dev/null
> 
> real    0m0.514s
> user    0m0.104s
> sys     0m0.430s
> 
> The CLI is still available for users that want to test the long
> running tests that do the larger send/recv tests.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Applied to bpf-next, thanks John!

^ permalink raw reply

* Re: [PATCH v2] bpf, x86_32: add eBPF JIT compiler for ia32
From: Daniel Borkmann @ 2018-04-25  0:11 UTC (permalink / raw)
  To: Wang YanQing, ast, illusionist.neo, tglx, mingo, hpa, davem, x86,
	netdev, linux-kernel
In-Reply-To: <20180419155420.GA5468@udknight>

On 04/19/2018 05:54 PM, Wang YanQing wrote:
> The JIT compiler emits ia32 bit instructions. Currently, It supports
> eBPF only. Classic BPF is supported because of the conversion by BPF core.
> 
> Almost all instructions from eBPF ISA supported except the following:
> BPF_ALU64 | BPF_DIV | BPF_K
> BPF_ALU64 | BPF_DIV | BPF_X
> BPF_ALU64 | BPF_MOD | BPF_K
> BPF_ALU64 | BPF_MOD | BPF_X
> BPF_STX | BPF_XADD | BPF_W
> BPF_STX | BPF_XADD | BPF_DW
> 
> It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL too.
> 
> ia32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI,
> and in these six registers, we can't treat all of them as real
> general purpose registers in jit:
> MUL instructions need EAX:EDX, shift instructions need ECX.
> 
> So I decide to use stack to emulate all eBPF 64 registers, this will
> simplify the implementation a lot, because we don't need to face the
> flexible memory address modes on ia32, for example, we don't need to
> write below code pattern for one BPF_ADD instruction:
> 
>     if (src is a register && dst is a register)
>     {
>        //one instruction encoding for ADD instruction
>     } else if (only src is a register)
>     {
>        //another different instruction encoding for ADD instruction
>     } else if (only dst is a register)
>     {
>        //another different instruction encoding for ADD instruction
>     } else
>     {
>        //src and dst are all on stack.
>        //move src or dst to temporary registers
>     }
> 
> If the above example if-else-else-else isn't so painful, try to think
> it for BPF_ALU64|BPF_*SHIFT* instruction which we need to use many
> native instructions to emulate.
> 
> Tested on my PC (Intel(R) Core(TM) i5-5200U CPU) and virtualbox.

Just out of plain curiosity, do you have a specific use case on the
JIT for x86_32? Are you targeting this to be used for e.g. Atom or
the like (sorry, don't really have a good overview where x86_32 is
still in use these days)?

> Testing results on i5-5200U:
> 
> 1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
> 2) test_progs: Summary: 81 PASSED, 2 FAILED.
>    test_progs report "libbpf: incorrect bpf_call opcode" for
>    test_l4lb_noinline and test_xdp_noinline, because there is
>    no llvm-6.0 on my machine, and current implementation doesn't
>    support BPF_PSEUDO_CALL, so I think we can ignore the two failed
>    testcases.
> 3) test_lpm: OK
> 4) test_lru_map: OK
> 5) test_verifier: Summary: 823 PASSED, 5 FAILED
>    test_verifier report "invalid bpf_context access off=68 size=1/2/4/8"
>    for all the 5 FAILED testcases with/without jit, we need to fix the
>    failed testcases themself instead of this jit.

Can you elaborate further on these? Looks like this definitely needs
fixing on 32 bit. Would be great to get a better understanding of the
underlying bug(s) and properly fix them.

> Above tests are all done with following flags settings discretely:
> 1:bpf_jit_enable=1 and bpf_jit_harden=0
> 2:bpf_jit_enable=1 and bpf_jit_harden=2
> 
> Below are some numbers for this jit implementation:
> Note:
>   I run test_progs in kselftest 100 times continuously for every testcase,
>   the numbers are in format: total/times=avg.
>   The numbers that test_bpf reports show almost the same relation.
> 
> a:jit_enable=0 and jit_harden=0            b:jit_enable=1 and jit_harden=0
>   test_pkt_access:PASS:ipv4:15622/100=156  test_pkt_access:PASS:ipv4:10057/100=100
>   test_pkt_access:PASS:ipv6:9130/100=91    test_pkt_access:PASS:ipv6:5055/100=50
>   test_xdp:PASS:ipv4:240198/100=2401       test_xdp:PASS:ipv4:145945/100=1459
>   test_xdp:PASS:ipv6:137326/100=1373       test_xdp:PASS:ipv6:67337/100=673
>   test_l4lb:PASS:ipv4:61100/100=611        test_l4lb:PASS:ipv4:38137/100=381
>   test_l4lb:PASS:ipv6:101000/100=1010      test_l4lb:PASS:ipv6:57779/100=577
> 
> c:jit_enable=1 and jit_harden=2
>   test_pkt_access:PASS:ipv4:12650/100=126
>   test_pkt_access:PASS:ipv6:7074/100=70
>   test_xdp:PASS:ipv4:147211/100=1472
>   test_xdp:PASS:ipv6:85783/100=857
>   test_l4lb:PASS:ipv4:53222/100=532
>   test_l4lb:PASS:ipv6:76322/100=763
> 
> Yes, the numbers are pretty when turn off jit_harden, if we want to speedup
> jit_harden, then we need to move BPF_REG_AX to *real* register instead of stack
> emulation, but when we do it, we need to face all the pain I describe above. We
> can do it in next step.
> 
> See Documentation/networking/filter.txt for more information.
> 
> Signed-off-by: Wang YanQing <udknight@gmail.com>
[...]
> +		/* call */
> +		case BPF_JMP | BPF_CALL:
> +		{
> +			const u8 *r1 = bpf2ia32[BPF_REG_1];
> +			const u8 *r2 = bpf2ia32[BPF_REG_2];
> +			const u8 *r3 = bpf2ia32[BPF_REG_3];
> +			const u8 *r4 = bpf2ia32[BPF_REG_4];
> +			const u8 *r5 = bpf2ia32[BPF_REG_5];
> +
> +			if (insn->src_reg == BPF_PSEUDO_CALL)
> +				goto notyet;

Here you bail out on BPF_PSEUDO_CALL, but in below bpf_int_jit_compile() you
implement half of e.g. 1c2a088a6626 ("bpf: x64: add JIT support for multi-function
programs"), which is rather confusing to leave it in this half complete state;
either remove altogether or implement everything.

> +			func = (u8 *) __bpf_call_base + imm32;
> +			jmp_offset = func - (image + addrs[i]);
> +
> +			if (!imm32 || !is_simm32(jmp_offset)) {
> +				pr_err("unsupported bpf func %d addr %p image %p\n",
> +				       imm32, func, image);
> +				return -EINVAL;
> +			}
> +
> +			/* mov eax,dword ptr [ebp+off] */
> +			EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[0]),
> +			      STACK_VAR(r1[0]));
> +			/* mov edx,dword ptr [ebp+off] */
> +			EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp2[1]),
> +			      STACK_VAR(r1[1]));
> +
> +			emit_push_r64(r5, &prog);
> +			emit_push_r64(r4, &prog);
> +			emit_push_r64(r3, &prog);
> +			emit_push_r64(r2, &prog);
> +
> +			EMIT1_off32(0xE8, jmp_offset + 9);
> +
> +			/* mov dword ptr [ebp+off],eax */
> +			EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[0]),
> +			      STACK_VAR(r0[0]));
> +			/* mov dword ptr [ebp+off],edx */
> +			EMIT3(0x89, add_2reg(0x40, IA32_EBP, tmp2[1]),
> +			      STACK_VAR(r0[1]));
> +
> +			/* add esp,32 */
> +			EMIT3(0x83, add_1reg(0xC0, IA32_ESP), 32);
> +			break;
> +		}
> +		case BPF_JMP | BPF_TAIL_CALL:
> +			emit_bpf_tail_call(&prog);
> +			break;
> +
> +		/* cond jump */
> +		case BPF_JMP | BPF_JEQ | BPF_X:
> +		case BPF_JMP | BPF_JNE | BPF_X:
> +		case BPF_JMP | BPF_JGT | BPF_X:
> +		case BPF_JMP | BPF_JLT | BPF_X:
> +		case BPF_JMP | BPF_JGE | BPF_X:
> +		case BPF_JMP | BPF_JLE | BPF_X:
> +		case BPF_JMP | BPF_JSGT | BPF_X:
> +		case BPF_JMP | BPF_JSLE | BPF_X:
> +		case BPF_JMP | BPF_JSLT | BPF_X:
> +		case BPF_JMP | BPF_JSGE | BPF_X:
> +			/* mov esi,dword ptr [ebp+off] */
> +			EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
> +			      STACK_VAR(src_hi));
> +			/* cmp dword ptr [ebp+off], esi */
> +			EMIT3(0x39, add_2reg(0x40, IA32_EBP, tmp[0]),
> +			      STACK_VAR(dst_hi));
> +
> +			EMIT2(IA32_JNE, 6);
> +			/* mov esi,dword ptr [ebp+off] */
> +			EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
> +			      STACK_VAR(src_lo));
> +			/* cmp dword ptr [ebp+off], esi */
> +			EMIT3(0x39, add_2reg(0x40, IA32_EBP, tmp[0]),
> +			      STACK_VAR(dst_lo));
> +			goto emit_cond_jmp;
> +
> +		case BPF_JMP | BPF_JSET | BPF_X:
> +			/* mov esi,dword ptr [ebp+off] */
> +			EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[0]),
> +			      STACK_VAR(dst_lo));
> +			/* and esi,dword ptr [ebp+off]*/
> +			EMIT3(0x23, add_2reg(0x40, IA32_EBP, tmp[0]),
> +			      STACK_VAR(src_lo));
> +
> +			/* mov edi,dword ptr [ebp+off] */
> +			EMIT3(0x8B, add_2reg(0x40, IA32_EBP, tmp[1]),
> +			      STACK_VAR(dst_hi));
> +			/* and edi,dword ptr [ebp+off] */
> +			EMIT3(0x23, add_2reg(0x40, IA32_EBP, tmp[1]),
> +			      STACK_VAR(src_hi));
> +			/* or esi,edi */
> +			EMIT2(0x09, add_2reg(0xC0, tmp[0], tmp[1]));
> +			goto emit_cond_jmp;
> +
> +		case BPF_JMP | BPF_JSET | BPF_K: {
> +			u32 hi;
> +
> +			hi = imm32 & (1<<31) ? (u32)~0 : 0;
> +			/* mov esi,imm32 */
> +			EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), imm32);
> +			/* and esi,dword ptr [ebp+off]*/
> +			EMIT3(0x23, add_2reg(0x40, IA32_EBP, tmp[0]),
> +			      STACK_VAR(dst_lo));
> +
> +			/* mov esi,imm32 */
> +			EMIT2_off32(0xC7, add_1reg(0xC0, tmp[1]), hi);
> +			/* and esi,dword ptr [ebp+off] */
> +			EMIT3(0x23, add_2reg(0x40, IA32_EBP, tmp[1]),
> +			      STACK_VAR(dst_hi));
> +			/* or esi,edi */
> +			EMIT2(0x09, add_2reg(0xC0, tmp[0], tmp[1]));
> +			goto emit_cond_jmp;
> +		}
> +		case BPF_JMP | BPF_JEQ | BPF_K:
> +		case BPF_JMP | BPF_JNE | BPF_K:
> +		case BPF_JMP | BPF_JGT | BPF_K:
> +		case BPF_JMP | BPF_JLT | BPF_K:
> +		case BPF_JMP | BPF_JGE | BPF_K:
> +		case BPF_JMP | BPF_JLE | BPF_K:
> +		case BPF_JMP | BPF_JSGT | BPF_K:
> +		case BPF_JMP | BPF_JSLE | BPF_K:
> +		case BPF_JMP | BPF_JSLT | BPF_K:
> +		case BPF_JMP | BPF_JSGE | BPF_K: {
> +			u32 hi;
> +
> +			hi = imm32 & (1<<31) ? (u32)~0 : 0;
> +			/* mov esi,imm32 */
> +			EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), hi);
> +			/* cmp dword ptr [ebp+off],esi */
> +			EMIT3(0x39, add_2reg(0x40, IA32_EBP, tmp[0]),
> +			      STACK_VAR(dst_hi));
> +
> +			EMIT2(IA32_JNE, 6);
> +			/* mov esi,imm32 */
> +			EMIT2_off32(0xC7, add_1reg(0xC0, tmp[0]), imm32);
> +			/* cmp dword ptr [ebp+off],esi */
> +			EMIT3(0x39, add_2reg(0x40, IA32_EBP, tmp[0]),
> +			      STACK_VAR(dst_lo));
> +
> +emit_cond_jmp:		/* convert BPF opcode to x86 */
> +			switch (BPF_OP(code)) {
> +			case BPF_JEQ:
> +				jmp_cond = IA32_JE;
> +				break;
> +			case BPF_JSET:
> +			case BPF_JNE:
> +				jmp_cond = IA32_JNE;
> +				break;
> +			case BPF_JGT:
> +				/* GT is unsigned '>', JA in x86 */
> +				jmp_cond = IA32_JA;
> +				break;
> +			case BPF_JLT:
> +				/* LT is unsigned '<', JB in x86 */
> +				jmp_cond = IA32_JB;
> +				break;
> +			case BPF_JGE:
> +				/* GE is unsigned '>=', JAE in x86 */
> +				jmp_cond = IA32_JAE;
> +				break;
> +			case BPF_JLE:
> +				/* LE is unsigned '<=', JBE in x86 */
> +				jmp_cond = IA32_JBE;
> +				break;
> +			case BPF_JSGT:
> +				/* signed '>', GT in x86 */
> +				jmp_cond = IA32_JG;
> +				break;
> +			case BPF_JSLT:
> +				/* signed '<', LT in x86 */
> +				jmp_cond = IA32_JL;
> +				break;
> +			case BPF_JSGE:
> +				/* signed '>=', GE in x86 */
> +				jmp_cond = IA32_JGE;
> +				break;
> +			case BPF_JSLE:
> +				/* signed '<=', LE in x86 */
> +				jmp_cond = IA32_JLE;
> +				break;
> +			default: /* to silence gcc warning */
> +				return -EFAULT;
> +			}
> +			jmp_offset = addrs[i + insn->off] - addrs[i];
> +			if (is_imm8(jmp_offset)) {
> +				EMIT2(jmp_cond, jmp_offset);
> +			} else if (is_simm32(jmp_offset)) {
> +				EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
> +			} else {
> +				pr_err("cond_jmp gen bug %llx\n", jmp_offset);
> +				return -EFAULT;
> +			}
> +
> +			break;
> +		}
> +		case BPF_JMP | BPF_JA:
> +			jmp_offset = addrs[i + insn->off] - addrs[i];
> +			if (!jmp_offset)
> +				/* optimize out nop jumps */
> +				break;
> +emit_jmp:
> +			if (is_imm8(jmp_offset)) {
> +				EMIT2(0xEB, jmp_offset);
> +			} else if (is_simm32(jmp_offset)) {
> +				EMIT1_off32(0xE9, jmp_offset);
> +			} else {
> +				pr_err("jmp gen bug %llx\n", jmp_offset);
> +				return -EFAULT;
> +			}
> +			break;
> +
> +		case BPF_LD | BPF_IND | BPF_W:
> +			func = sk_load_word;
> +			goto common_load;
> +		case BPF_LD | BPF_ABS | BPF_W:
> +			func = CHOOSE_LOAD_FUNC(imm32, sk_load_word);
> +common_load:
> +			jmp_offset = func - (image + addrs[i]);
> +			if (!func || !is_simm32(jmp_offset)) {
> +				pr_err("unsupported bpf func %d addr %p image %p\n",
> +				       imm32, func, image);
> +				return -EINVAL;
> +			}
> +			if (BPF_MODE(code) == BPF_ABS) {
> +				/* mov %edx, imm32 */
> +				EMIT1_off32(0xBA, imm32);
> +			} else {
> +				/* mov edx,dword ptr [ebp+off] */
> +				EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
> +				      STACK_VAR(src_lo));
> +				if (imm32) {
> +					if (is_imm8(imm32))
> +						/* add %edx, imm8 */
> +						EMIT3(0x83, 0xC2, imm32);
> +					else
> +						/* add %edx, imm32 */
> +						EMIT2_off32(0x81, 0xC2, imm32);
> +				}
> +			}
> +			emit_load_skb_data_hlen(&prog);

The only place where you call the emit_load_skb_data_hlen() is here, so it
effectively doesn't cache anything as with the other JITs. I would suggest
to keep the complexity of the BPF_ABS/IND rather very low, remove the
arch/x86/net/bpf_jit32.S handling altogether and just follow the model the
way arm64 implemented this in the arch/arm64/net/bpf_jit_comp.c JIT. For
eBPF these instructions are legacy and have been source of many subtle
bugs in the various BPF JITs in the past, plus nobody complained about the
current interpreter speed for LD_ABS/IND on x86_32 anyway so far. ;) We're
also very close to have a rewrite of LD_ABS/IND out of native eBPF with
similar performance to be able to finally remove them from the core.

> +			EMIT1_off32(0xE8, jmp_offset + 10); /* call */
> +
> +			/* mov dword ptr [ebp+off],eax */
> +			EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
> +			      STACK_VAR(r0[0]));
> +			EMIT3(0xC7, add_1reg(0x40, IA32_EBP), STACK_VAR(r0[1]));
> +			EMIT(0x0, 4);
> +			break;
> +
> +		case BPF_LD | BPF_IND | BPF_H:
> +			func = sk_load_half;
> +			goto common_load;
> +		case BPF_LD | BPF_ABS | BPF_H:
> +			func = CHOOSE_LOAD_FUNC(imm32, sk_load_half);
> +			goto common_load;
> +		case BPF_LD | BPF_IND | BPF_B:
> +			func = sk_load_byte;
> +			goto common_load;
> +		case BPF_LD | BPF_ABS | BPF_B:
> +			func = CHOOSE_LOAD_FUNC(imm32, sk_load_byte);
> +			goto common_load;
> +		/* STX XADD: lock *(u32 *)(dst + off) += src */
> +		case BPF_STX | BPF_XADD | BPF_W:
> +		/* STX XADD: lock *(u64 *)(dst + off) += src */
> +		case BPF_STX | BPF_XADD | BPF_DW:
> +			goto notyet;
> +		case BPF_JMP | BPF_EXIT:
> +			if (seen_exit) {
> +				jmp_offset = ctx->cleanup_addr - addrs[i];
> +				goto emit_jmp;
> +			}
> +			seen_exit = true;
> +			/* update cleanup_addr */
> +			ctx->cleanup_addr = proglen;
> +			emit_epilogue(&prog, bpf_prog->aux->stack_depth);
> +			break;
> +notyet:
> +			pr_info_once("*** NOT YET: opcode %02x ***\n", code);
> +			return -EFAULT;
> +		default:
> +			/* This error will be seen if new instruction was added
> +			 * to interpreter, but not to JIT
[...]

Thanks,
Daniel

^ permalink raw reply

* [PATCH v3] ath9k: dfs: Remove VLA usage
From: Kees Cook @ 2018-04-24 23:57 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Andreas Christoforou, Rosen Penev, Eric Dumazet, Joe Perches,
	linux-wireless, netdev, QCA ath9k Development, kernel-hardening,
	linux-kernel

In the quest to remove all stack VLA usage from the kernel[1], this
redefines FFT_NUM_SAMPLES as a #define instead of const int, which still
triggers gcc's VLA checking pass.

[1] https://lkml.org/lkml/2018/3/7/621

Co-developed-by: Andreas Christoforou <andreaschristofo@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
v3: replace FFT_NUM_SAMPLES as a #define (Joe)
---
 drivers/net/wireless/ath/ath9k/dfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/dfs.c b/drivers/net/wireless/ath/ath9k/dfs.c
index 6fee9a464cce..e6e56a925121 100644
--- a/drivers/net/wireless/ath/ath9k/dfs.c
+++ b/drivers/net/wireless/ath/ath9k/dfs.c
@@ -40,8 +40,8 @@ static const int BIN_DELTA_MIN		= 1;
 static const int BIN_DELTA_MAX		= 10;
 
 /* we need at least 3 deltas / 4 samples for a reliable chirp detection */
-#define NUM_DIFFS 3
-static const int FFT_NUM_SAMPLES	= (NUM_DIFFS + 1);
+#define NUM_DIFFS	3
+#define FFT_NUM_SAMPLES	(NUM_DIFFS + 1)
 
 /* Threshold for difference of delta peaks */
 static const int MAX_DIFF		= 2;
-- 
2.7.4


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* [PATCH v3 ipsec-next] xfrm: remove VLA usage in __xfrm6_sort()
From: Kees Cook @ 2018-04-24 23:46 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Andreas Christoforou, kernel-hardening, Steffen Klassert,
	Herbert Xu, David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	netdev, linux-kernel

In the quest to remove all stack VLA usage removed from the kernel[1],
just use XFRM_MAX_DEPTH as already done for the "class" array. In one
case, it'll do this loop up to 5, the other caller up to 6.

[1] https://lkml.org/lkml/2018/3/7/621

Co-developed-by: Andreas Christoforou <andreaschristofo@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
v3:
- adjust Subject and commit log (Steffen)
- use "= { }" instead of memset() (Stefano)
- reorder variables (Stefano)
v2:
- use XFRM_MAX_DEPTH for "count" array (Steffen and Mathias).
---
 net/ipv6/xfrm6_state.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index 16f434791763..eeb44b64ae7f 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -60,9 +60,9 @@ xfrm6_init_temprop(struct xfrm_state *x, const struct xfrm_tmpl *tmpl,
 static int
 __xfrm6_sort(void **dst, void **src, int n, int (*cmp)(void *p), int maxclass)
 {
-	int i;
+	int count[XFRM_MAX_DEPTH] = { };
 	int class[XFRM_MAX_DEPTH];
-	int count[maxclass];
+	int i;
 
 	memset(count, 0, sizeof(count));
 
-- 
2.7.4


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* [bpf-next PATCH] bpf: reduce runtime of test_sockmap tests
From: John Fastabend @ 2018-04-24 23:28 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev

When test_sockmap was running outside of selftests and was not being
run by build bots it was reasonable to spend significant amount of
time running various tests. The number of tests is high because many
different I/O iterators are run.

However, now that test_sockmap is part of selftests rather than
iterate through all I/O sides only test a minimal set of min/max
values along with a few "normal" I/O ops. Also remove the long
running tests. They can be run from other test frameworks on a regular
cadence.

This significanly reduces runtime of test_sockmap.

Before:

$ time sudo ./test_sockmap  > /dev/null

real    4m47.521s
user    0m0.370s
sys     0m3.131s

After:

$ time sudo ./test_sockmap  > /dev/null

real    0m0.514s
user    0m0.104s
sys     0m0.430s

The CLI is still available for users that want to test the long
running tests that do the larger send/recv tests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/testing/selftests/bpf/test_sockmap.c |   33 ++++++++++++++--------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index 6d63a1c..29c022d 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -344,8 +344,8 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
 		if (err < 0)
 			perror("recv start time: ");
 		while (s->bytes_recvd < total_bytes) {
-			timeout.tv_sec = 1;
-			timeout.tv_usec = 0;
+			timeout.tv_sec = 0;
+			timeout.tv_usec = 10;
 
 			/* FD sets */
 			FD_ZERO(&w);
@@ -903,12 +903,10 @@ static int test_exec(int cgrp, struct sockmap_options *opt)
 {
 	int err = __test_exec(cgrp, SENDMSG, opt);
 
-	sched_yield();
 	if (err)
 		goto out;
 
 	err = __test_exec(cgrp, SENDPAGE, opt);
-	sched_yield();
 out:
 	return err;
 }
@@ -928,19 +926,18 @@ static int test_loop(int cgrp)
 	opt.iov_length = 0;
 	opt.rate = 0;
 
-	for (r = 1; r < 100; r += 33) {
-		for (i = 1; i < 100; i += 33) {
-			for (l = 1; l < 100; l += 33) {
-				opt.rate = r;
-				opt.iov_count = i;
-				opt.iov_length = l;
-				err = test_exec(cgrp, &opt);
-				if (err)
-					goto out;
-			}
+	r = 1;
+	for (i = 1; i < 100; i += 33) {
+		for (l = 1; l < 100; l += 33) {
+			opt.rate = r;
+			opt.iov_count = i;
+			opt.iov_length = l;
+			err = test_exec(cgrp, &opt);
+			if (err)
+				goto out;
 		}
 	}
-
+	sched_yield();
 out:
 	return err;
 }
@@ -1031,6 +1028,7 @@ static int test_send(struct sockmap_options *opt, int cgrp)
 	if (err)
 		goto out;
 out:
+	sched_yield();
 	return err;
 }
 
@@ -1168,7 +1166,7 @@ static int test_start_end(int cgrp)
 	opt.iov_length = 100;
 	txmsg_cork = 1600;
 
-	for (i = 99; i <= 1600; i += 100) {
+	for (i = 99; i <= 1600; i += 500) {
 		txmsg_start = 0;
 		txmsg_end = i;
 		err = test_exec(cgrp, &opt);
@@ -1177,7 +1175,7 @@ static int test_start_end(int cgrp)
 	}
 
 	/* Test start/end with cork but pull data in middle */
-	for (i = 199; i <= 1600; i += 100) {
+	for (i = 199; i <= 1600; i += 500) {
 		txmsg_start = 100;
 		txmsg_end = i;
 		err = test_exec(cgrp, &opt);
@@ -1221,6 +1219,7 @@ static int test_start_end(int cgrp)
 out:
 	txmsg_start = 0;
 	txmsg_end = 0;
+	sched_yield();
 	return err;
 }
 

^ permalink raw reply related

* Re: [PATCH] [PATCH bpf-next] samples/bpf/bpf_load.c: remove redundant ret, assignment in bpf_load_program()
From: Daniel Borkmann @ 2018-04-24 23:15 UTC (permalink / raw)
  To: shhuiw, ast, netdev
In-Reply-To: <a953464a-385a-38f3-df84-d9dac58b99b4@foxmail.com>

On 04/24/2018 10:18 AM, shhuiw wrote:
> 
> 2 redundant ret assignments removded:
> * 'ret = 1' before the logic 'if (data_maps)', and if any errors jump to
>   label 'done'. No 'ret = 1' needed before the error jump.
> * After the '/* load programs */' part, if everything goes well, then
>   the BPF code will be loaded and 'ret' set to 0 by load_and_attach().
>   If something goes wrong, 'ret' set to none-O, the redundant 'ret = 0'
>   after the for clause will make the error skipped.
>   For example, if some BPF code cannot provide supported program types
>   in ELF SEC("unknown"), the for clause will not call load_and_attach()
>   to load the BPF code. 1 should be returned to callees instead of 0.
> 
> Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>

Your patch is corrupted, please use something like git-send-email(1).

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next 1/2 v2] netns: restrict uevents
From: Eric W. Biederman @ 2018-04-24 23:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christian Brauner, David Miller, netdev, linux-kernel, avagin,
	ktkhai, serge, gregkh
In-Reply-To: <CAPP7u0WH9w26Y9ai-EQTTeq7Rz_=7u2-=t4nhHmjwh-UAiyqeQ@mail.gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

> On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>  Bah. This code is obviously correct and probably wrong.
>
>  How do we deliver uevents for network devices that are outside of the
>  initial user namespace? The kernel still needs to deliver those.
>
>  The logic to figure out which network namespace a device needs to be
>  delivered to is is present in kobj_bcast_filter. That logic will almost
>  certainly need to be turned inside out. Sign not as easy as I would
>  have hoped.
>
> My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
> out and come up with a new patch.

I may have mis-understood.

I heard and am still hearing additional filtering to reduce the places
the packet is delievered.

I am saying something needs to change to increase the number of places
the packet is delivered.

For the special class of devices that kobj_bcast_filter would apply to
those need to be delivered to netowrk namespaces  that are no longer on
uevent_sock_list.

So the code fundamentally needs to split into two paths.  Ordinary
devices that use uevent_sock_list.  Network devices that are just
delivered in their own network namespace.

netlink_broadcast_filtered gets to go away completely.
The logic of figuring out the network namespace though becomes trickier.

Now it may make sense to have all of that as an additional patch on top
of this one or perhaps a precursor patch that addresses the problem.  We
will unfortunately drop those uevents today because their uids are not
valid.  But they are not delivered anywhere else so to allow them to be
received we need to fix them.

Eric

>
>  Christian Brauner <christian.brauner@ubuntu.com> writes:
>  > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>  >
>  > enabled sending hotplug events into all network namespaces back in 2010.
>  > Over time the set of uevents that get sent into all network namespaces has
>  > shrunk a little. We have now reached the point where hotplug events for all
>  > devices that carry a namespace tag are filtered according to that
>  > namespace. Specifically, they are filtered whenever the namespace tag of
>  > the kobject does not match the namespace tag of the netlink socket. One
>  > example are network devices. Uevents for network devices only show up in
>  > the network namespaces these devices are moved to or created in.
>  >
>  > However, any uevent for a kobject that does not have a namespace tag
>  > associated with it will not be filtered and we will broadcast it into all
>  > network namespaces. This behavior stopped making sense when user namespaces
>  > were introduced.
>  >
>  > This patch restricts uevents to the initial user namespace for a couple of
>  > reasons that have been extensively discusses on the mailing list [1].
>  > - Thundering herd:
>  > Broadcasting uevents into all network namespaces introduces significant
>  > overhead.
>  > All processes that listen to uevents running in non-initial user
>  > namespaces will end up responding to uevents that will be meaningless to
>  > them. Mainly, because non-initial user namespaces cannot easily manage
>  > devices unless they have a privileged host-process helping them out. This
>  > means that there will be a thundering herd of activity when there
>  > shouldn't be any.
>  > - Uevents from non-root users are already filtered in userspace:
>  > Uevents are filtered by userspace in a user namespace because the
>  > received uid != 0. Instead the uid associated with the event will be
>  > 65534 == "nobody" because the global root uid is not mapped.
>  > This means we can safely and without introducing regressions modify the
>  > kernel to not send uevents into all network namespaces whose owning user
>  > namespace is not the initial user namespace because we know that
>  > userspace will ignore the message because of the uid anyway. I have
>  > a) verified that is is true for every udev implementation out there b)
>  > that this behavior has been present in all udev implementations from the
>  > very beginning.
>  > - Removing needless overhead/Increasing performance:
>  > Currently, the uevent socket for each network namespace is added to the
>  > global variable uevent_sock_list. The list itself needs to be protected
>  > by a mutex. So everytime a uevent is generated the mutex is taken on the
>  > list. The mutex is held *from the creation of the uevent (memory
>  > allocation, string creation etc. until all uevent sockets have been
>  > handled*. This is aggravated by the fact that for each uevent socket that
>  > has listeners the mc_list must be walked as well which means we're
>  > talking O(n^2) here. Given that a standard Linux workload usually has
>  > quite a lot of network namespaces and - in the face of containers - a lot
>  > of user namespaces this quickly becomes a performance problem (see
>  > "Thundering herd" above). By just recording uevent sockets of network
>  > namespaces that are owned by the initial user namespace we significantly
>  > increase performance in this codepath.
>  > - Injecting uevents:
>  > There's a valid argument that containers might be interested in receiving
>  > device events especially if they are delegated to them by a privileged
>  > userspace process. One prime example are SR-IOV enabled devices that are
>  > explicitly designed to be handed of to other users such as VMs or
>  > containers.
>  > This use-case can now be correctly handled since
>  > commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>  > introduced the ability to send uevents from userspace. As such we can let
>  > a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>  > the network namespace of the netlink socket) userspace process make a
>  > decision what uevents should be sent. This removes the need to blindly
>  > broadcast uevents into all user namespaces and provides a performant and
>  > safe solution to this problem.
>  > - Filtering logic:
>  > This patch filters by *owning user namespace of the network namespace a
>  > given task resides in* and not by user namespace of the task per se. This
>  > means if the user namespace of a given task is unshared but the network
>  > namespace is kept and is owned by the initial user namespace a listener
>  > that is opening the uevent socket in that network namespace can still
>  > listen to uevents.
>  >
>  > [1]: https://lkml.org/lkml/2018/4/4/739
>  > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>  > ---
>  > Changelog v1->v2:
>  > * patch unchanged
>  > Changelog v0->v1:
>  > * patch unchanged
>  > ---
>  > lib/kobject_uevent.c | 18 ++++++++++++------
>  > 1 file changed, 12 insertions(+), 6 deletions(-)
>  >
>  > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>  > index 15ea216a67ce..f5f5038787ac 100644
>  > --- a/lib/kobject_uevent.c
>  > +++ b/lib/kobject_uevent.c
>  > @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
>  > 
>  > net->uevent_sock = ue_sk;
>  > 
>  > - mutex_lock(&uevent_sock_mutex);
>  > - list_add_tail(&ue_sk->list, &uevent_sock_list);
>  > - mutex_unlock(&uevent_sock_mutex);
>  > + /* Restrict uevents to initial user namespace. */
>  > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
>  > + mutex_lock(&uevent_sock_mutex);
>  > + list_add_tail(&ue_sk->list, &uevent_sock_list);
>  > + mutex_unlock(&uevent_sock_mutex);
>  > + }
>  > +
>  > return 0;
>  > }
>  > 
>  > @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
>  > {
>  > struct uevent_sock *ue_sk = net->uevent_sock;
>  > 
>  > - mutex_lock(&uevent_sock_mutex);
>  > - list_del(&ue_sk->list);
>  > - mutex_unlock(&uevent_sock_mutex);
>  > + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
>  > + mutex_lock(&uevent_sock_mutex);
>  > + list_del(&ue_sk->list);
>  > + mutex_unlock(&uevent_sock_mutex);
>  > + }
>  > 
>  > netlink_kernel_release(ue_sk->sk);
>  > kfree(ue_sk);

^ permalink raw reply

* Re: [PATCH net-next 1/2 v2] netns: restrict uevents
From: Christian Brauner @ 2018-04-24 22:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <87po2oz0s8.fsf@xmission.com>

On Tue, Apr 24, 2018 at 05:40:07PM -0500, Eric W. Biederman wrote:
> 
> Bah.  This code is obviously correct and probably wrong.
> 
> How do we deliver uevents for network devices that are outside of the
> initial user namespace?  The kernel still needs to deliver those.
> 
> The logic to figure out which network namespace a device needs to be
> delivered to is is present in kobj_bcast_filter.  That logic will almost
> certainly need to be turned inside out.  Sign not as easy as I would
> have hoped.

That's why my initial patch [1] added additional filtering logic to
kobj_bcast_filter(). But since we care about performance improvements as
well I can come up with a patch that moves this logic out of
kobj_bcast_filter().

Christian
[1]: https://www.spinics.net/lists/netdev/msg494487.html

> 
> Eric
> 
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >
> > enabled sending hotplug events into all network namespaces back in 2010.
> > Over time the set of uevents that get sent into all network namespaces has
> > shrunk a little. We have now reached the point where hotplug events for all
> > devices that carry a namespace tag are filtered according to that
> > namespace. Specifically, they are filtered whenever the namespace tag of
> > the kobject does not match the namespace tag of the netlink socket. One
> > example are network devices. Uevents for network devices only show up in
> > the network namespaces these devices are moved to or created in.
> >
> > However, any uevent for a kobject that does not have a namespace tag
> > associated with it will not be filtered and we will broadcast it into all
> > network namespaces. This behavior stopped making sense when user namespaces
> > were introduced.
> >
> > This patch restricts uevents to the initial user namespace for a couple of
> > reasons that have been extensively discusses on the mailing list [1].
> > - Thundering herd:
> >   Broadcasting uevents into all network namespaces introduces significant
> >   overhead.
> >   All processes that listen to uevents running in non-initial user
> >   namespaces will end up responding to uevents that will be meaningless to
> >   them. Mainly, because non-initial user namespaces cannot easily manage
> >   devices unless they have a privileged host-process helping them out. This
> >   means that there will be a thundering herd of activity when there
> >   shouldn't be any.
> > - Uevents from non-root users are already filtered in userspace:
> >   Uevents are filtered by userspace in a user namespace because the
> >   received uid != 0. Instead the uid associated with the event will be
> >   65534 == "nobody" because the global root uid is not mapped.
> >   This means we can safely and without introducing regressions modify the
> >   kernel to not send uevents into all network namespaces whose owning user
> >   namespace is not the initial user namespace because we know that
> >   userspace will ignore the message because of the uid anyway. I have
> >   a) verified that is is true for every udev implementation out there b)
> >   that this behavior has been present in all udev implementations from the
> >   very beginning.
> > - Removing needless overhead/Increasing performance:
> >   Currently, the uevent socket for each network namespace is added to the
> >   global variable uevent_sock_list. The list itself needs to be protected
> >   by a mutex. So everytime a uevent is generated the mutex is taken on the
> >   list. The mutex is held *from the creation of the uevent (memory
> >   allocation, string creation etc. until all uevent sockets have been
> >   handled*. This is aggravated by the fact that for each uevent socket that
> >   has listeners the mc_list must be walked as well which means we're
> >   talking O(n^2) here. Given that a standard Linux workload usually has
> >   quite a lot of network namespaces and - in the face of containers - a lot
> >   of user namespaces this quickly becomes a performance problem (see
> >   "Thundering herd" above). By just recording uevent sockets of network
> >   namespaces that are owned by the initial user namespace we significantly
> >   increase performance in this codepath.
> > - Injecting uevents:
> >   There's a valid argument that containers might be interested in receiving
> >   device events especially if they are delegated to them by a privileged
> >   userspace process. One prime example are SR-IOV enabled devices that are
> >   explicitly designed to be handed of to other users such as VMs or
> >   containers.
> >   This use-case can now be correctly handled since
> >   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> >   introduced the ability to send uevents from userspace. As such we can let
> >   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> >   the network namespace of the netlink socket) userspace process make a
> >   decision what uevents should be sent. This removes the need to blindly
> >   broadcast uevents into all user namespaces and provides a performant and
> >   safe solution to this problem.
> > - Filtering logic:
> >   This patch filters by *owning user namespace of the network namespace a
> >   given task resides in* and not by user namespace of the task per se. This
> >   means if the user namespace of a given task is unshared but the network
> >   namespace is kept and is owned by the initial user namespace a listener
> >   that is opening the uevent socket in that network namespace can still
> >   listen to uevents.
> >
> > [1]: https://lkml.org/lkml/2018/4/4/739
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> > Changelog v1->v2:
> > * patch unchanged
> > Changelog v0->v1:
> > * patch unchanged
> > ---
> >  lib/kobject_uevent.c | 18 ++++++++++++------
> >  1 file changed, 12 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 15ea216a67ce..f5f5038787ac 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
> >  
> >  	net->uevent_sock = ue_sk;
> >  
> > -	mutex_lock(&uevent_sock_mutex);
> > -	list_add_tail(&ue_sk->list, &uevent_sock_list);
> > -	mutex_unlock(&uevent_sock_mutex);
> > +	/* Restrict uevents to initial user namespace. */
> > +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > +		mutex_lock(&uevent_sock_mutex);
> > +		list_add_tail(&ue_sk->list, &uevent_sock_list);
> > +		mutex_unlock(&uevent_sock_mutex);
> > +	}
> > +
> >  	return 0;
> >  }
> >  
> > @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
> >  {
> >  	struct uevent_sock *ue_sk = net->uevent_sock;
> >  
> > -	mutex_lock(&uevent_sock_mutex);
> > -	list_del(&ue_sk->list);
> > -	mutex_unlock(&uevent_sock_mutex);
> > +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > +		mutex_lock(&uevent_sock_mutex);
> > +		list_del(&ue_sk->list);
> > +		mutex_unlock(&uevent_sock_mutex);
> > +	}
> >  
> >  	netlink_kernel_release(ue_sk->sk);
> >  	kfree(ue_sk);

^ permalink raw reply

* Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks
From: Eric W. Biederman @ 2018-04-24 22:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <20180424222049.GA21073@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

> On Tue, Apr 24, 2018 at 04:52:20PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> 
>> > Now that it's possible to have a different set of uevents in different
>> > network namespaces, per-network namespace uevent sequence numbers are
>> > introduced. This increases performance as locking is now restricted to the
>> > network namespace affected by the uevent rather than locking everything.
>> > Testing revealed significant performance improvements. For details see
>> > "Testing" below.
>> 
>> Maybe.  Your locking is wrong, and a few other things are wrong.  see
>> below.
>
> Thanks for the review! Happy to rework this until it's in a mergeable shape.
>
>> 
>> > Since commit 692ec06 ("netns: send uevent messages") network namespaces not
>> > owned by the intial user namespace can be sent uevents from a sufficiently
>> > privileged userspace process.
>> > In order to send a uevent into a network namespace not owned by the initial
>> > user namespace we currently still need to take the *global mutex* that
>> > locks the uevent socket list even though the list *only contains network
>> > namespaces owned by the initial user namespace*. This needs to be done
>> > because the uevent counter is a global variable. Taking the global lock is
>> > performance sensitive since a user on the host can spawn a pool of n
>> > process that each create their own new user and network namespaces and then
>> > go on to inject uevents in parallel into the network namespace of all of
>> > these processes. This can have a significant performance impact for the
>> > host's udevd since it means that there can be a lot of delay between a
>> > device being added and the corresponding uevent being sent out and
>> > available for processing by udevd. It also means that each network
>> > namespace not owned by the initial user namespace which userspace has sent
>> > a uevent to will need to wait until the lock becomes available.
>> >
>> > Implementation:
>> > This patch gives each network namespace its own uevent sequence number.
>> > Each network namespace not owned by the initial user namespace receives its
>> > own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
>> > so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
>> > file it is clearly documented which lock has to be taken. All network
>> > namespaces owned by the initial user namespace will still share the same
>> > lock since they are all served sequentially via the uevent socket list.
>> > This decouples the locking and ensures that the host retrieves uevents as
>> > fast as possible even if there are a lot of uevents injected into network
>> > namespaces not owned by the initial user namespace.  In addition, each
>> > network namespace not owned by the initial user namespace does not have to
>> > wait on any other network namespace not sharing the same user namespace.
>> >
>> > Testing:
>> > Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
>> > with decoupled locking and one without. To ensure that testing made sense
>> > both kernels carried the patch to remove network namespaces not owned by
>> > the initial user namespace from the uevent socket list.
>> > Three tests were constructed. All of them showed significant performance
>> > improvements with per-netns uevent sequence numbers and decoupled locking.
>> >
>> >  # Testcase 1:
>> >    Only Injecting Uevents into network namespaces not owned by the initial
>> >    user namespace.
>> >    - created 1000 new user namespace + network namespace pairs
>> >    - opened a uevent listener in each of those namespace pairs
>> >    - injected uevents into each of those network namespaces 10,000 times
>> >      meaning 10,000,000 (10 million) uevents were injected. (The high
>> >      number of uevent injections should get rid of a lot of jitter.)
>> >      The injection was done by fork()ing 1000 uevent injectors in a simple
>> >      for-loop to ensure that uevents were injected in parallel.
>> >    - mean transaction time was calculated:
>> >      - *without* uevent sequence number namespacing: 67 μs
>> >      - *with* uevent sequence number namespacing:    55 μs
>> >      - makes a difference of:                        12 μs
>> >    - a t-test was performed on the two data vectors which revealed
>> >      shows significant performance improvements:
>> >      Welch Two Sample t-test
>> >      data:  x1 and y1
>> >      t = 405.16, df = 18883000, p-value < 2.2e-16
>> >      alternative hypothesis: true difference in means is not equal to 0
>> >      95 percent confidence interval:
>> >      12.14949 12.26761
>> >      sample estimates:
>> >      mean of x mean of y
>> >      68.48594  56.27739
>> >
>> >  # Testcase 2:
>> >    Injecting Uevents into network namespaces not owned by the initial user
>> >    namespace and network namespaces owned by the initial user namespace.
>> >    - created 500 new user namespace + network namespace pairs
>> >    - created 500 new network namespace pairs
>> >    - opened a uevent listener in each of those namespace pairs
>> >    - injected uevents into each of those network namespaces 10,000 times
>> >      meaning 10,000,000 (10 million) uevents were injected. (The high
>> >      number of uevent injections should get rid of a lot of jitter.)
>> >      The injection was done by fork()ing 1000 uevent injectors in a simple
>> >      for-loop to ensure that uevents were injected in parallel.
>> >    - mean transaction time was calculated:
>> >      - *without* uevent sequence number namespacing: 572 μs
>> >      - *with* uevent sequence number namespacing:    514 μs
>> >      - makes a difference of:                         58 μs
>> >    - a t-test was performed on the two data vectors which revealed
>> >      shows significant performance improvements:
>> >      Welch Two Sample t-test
>> >      data:  x2 and y2
>> >      t = 38.685, df = 19682000, p-value < 2.2e-16
>> >      alternative hypothesis: true difference in means is not equal to 0
>> >      95 percent confidence interval:
>> >      55.10630 60.98815
>> >      sample estimates:
>> >      mean of x mean of y
>> >      572.9684  514.9211
>> >
>> >  # Testcase 3:
>> >    Created 500 new user namespace + network namespace pairs *without uevent
>> >    listeners*
>> >    - created 500 new network namespace pairs *without uevent listeners*
>> >    - injected uevents into each of those network namespaces 10,000 times
>> >      meaning 10,000,000 (10 million) uevents were injected. (The high number
>> >      of uevent injections should get rid of a lot of jitter.)
>> >      The injection was done by fork()ing 1000 uevent injectors in a simple
>> >      for-loop to ensure that uevents were injected in parallel.
>> >     - mean transaction time was calculated:
>> >       - *without* uevent sequence number namespacing: 206 μs
>> >       - *with* uevent sequence number namespacing:    163 μs
>> >       - makes a difference of:                         43 μs
>> >     - a t-test was performed on the two data vectors which revealed
>> >       shows significant performance improvements:
>> >       Welch Two Sample t-test
>> >       data:  x3 and y3
>> >       t = 58.37, df = 17711000, p-value < 2.2e-16
>> >       alternative hypothesis: true difference in means is not equal to 0
>> >       95 percent confidence interval:
>> >       41.77860 44.68178
>> >       sample estimates:
>> >       mean of x mean of y
>> >       207.2632  164.0330
>> >
>> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> > ---
>> > Changelog v1->v2:
>> > * non-functional change: fix indendation for C directives in
>> >   kernel/ksysfs.c
>> > Changelog v0->v1:
>> > * add detailed test results to the commit message
>> > * account for kernels compiled without CONFIG_NET
>> > ---
>> >  include/linux/kobject.h     |   2 +
>> >  include/net/net_namespace.h |   3 ++
>> >  kernel/ksysfs.c             |  11 +++-
>> >  lib/kobject_uevent.c        | 104 +++++++++++++++++++++++++++++-------
>> >  net/core/net_namespace.c    |  14 +++++
>> >  5 files changed, 114 insertions(+), 20 deletions(-)
>> >
>> > diff --git a/include/linux/kobject.h b/include/linux/kobject.h
>> > index 7f6f93c3df9c..4e608968907f 100644
>> > --- a/include/linux/kobject.h
>> > +++ b/include/linux/kobject.h
>> > @@ -36,8 +36,10 @@
>> >  extern char uevent_helper[];
>> >  #endif
>> >  
>> > +#ifndef CONFIG_NET
>> >  /* counter to tag the uevent, read only except for the kobject core */
>> >  extern u64 uevent_seqnum;
>> > +#endif
>> 
>> That smells like an implementation bug somewhere.
>
> Sorry, I'm not following. I'm I'm not mistaken there won't be any struct
> net when CONFIG_NET=n. This has been reported by kbuild robot with alpha
> and CONFIG_NET=n.

The Kbuild robot isn't wrong.  I am just saying this likely comes up
because the code is not clean.

If uevent_seqnum is tied to a network namespace than it doesn't make
sense to exist without netork namespaces.

At the very least this feels like an untested special case.  Untested
special cases tend to break when people are not paying attention which
can be dangerous.

>> >  /*
>> >   * The actions here must match the index to the string array
>> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
>> > index 47e35cce3b64..e4e171b1ba69 100644
>> > --- a/include/net/net_namespace.h
>> > +++ b/include/net/net_namespace.h
>> > @@ -85,6 +85,8 @@ struct net {
>> >  	struct sock		*genl_sock;
>> >  
>> >  	struct uevent_sock	*uevent_sock;		/* uevent socket */
>> > +	/* counter to tag the uevent, read only except for the kobject core */
>> > +	u64                     uevent_seqnum;
>> >  
>> >  	struct list_head 	dev_base_head;
>> >  	struct hlist_head 	*dev_name_head;
>> > @@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
>> >  
>> >  struct net *get_net_ns_by_pid(pid_t pid);
>> >  struct net *get_net_ns_by_fd(int fd);
>> > +u64 get_ns_uevent_seqnum_by_vpid(void);
>> >  
>> >  #ifdef CONFIG_SYSCTL
>> >  void ipx_register_sysctl(void);
>> > diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
>> > index 46ba853656f6..38b70b90a21f 100644
>> > --- a/kernel/ksysfs.c
>> > +++ b/kernel/ksysfs.c
>> > @@ -19,6 +19,7 @@
>> >  #include <linux/sched.h>
>> >  #include <linux/capability.h>
>> >  #include <linux/compiler.h>
>> > +#include <net/net_namespace.h>
>> >  
>> >  #include <linux/rcupdate.h>	/* rcu_expedited and rcu_normal */
>> >  
>> > @@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
>> >  static ssize_t uevent_seqnum_show(struct kobject *kobj,
>> >  				  struct kobj_attribute *attr, char *buf)
>> >  {
>> > -	return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
>> > +	u64 seqnum;
>> > +
>> > +#ifdef CONFIG_NET
>> > +	seqnum = get_ns_uevent_seqnum_by_vpid();
>> > +#else
>> > +	seqnum = uevent_seqnum;
>> > +#endif
>> 
>> This can be simplified to be just:
>> 	seqnum = current->nsproxy->net_ns->uevent_seqnum;
>
> Does that work even if CONFIG_NET=n?

No.  It is a NULL pointer dereference but the net_ns member continues
to exist.

>> Except that is not correct either.  As every instance of sysfs has a
>> network namespace associated with it, and you are not fetching that
>> network namespace.
>
> I'm not yet familiar with all aspects of sysfs so thanks for pointing
> that out. Then I'll try to come up with a way to fetch the network
> namespace associated with sysfs. Unless you already know exactly how to
> do this and can point it out.
> This would also lets us drop get_ns_uevent_seqnum_by_vpid().

Everywhere else the distinction has been at a directory level.  With one
the set of directory entries being different depending on which network
namespace sysfs was mounted in.  For /sys/kernel/seqnum that looks like
fresh work.

Plus I expect there are some embedded systems without networking that
may still need uevents via /sbin/hotplug. Which makes this doubly
tricky.

I honestly don't know the answer to this one.

>> Typically this would call for making this file per network namespace
>> so you would have this information available.  Sigh.  I don't know if
>> there is an easy way to do that for this file.
>> 
>> > +
>> > +	return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
>> >  }
>> >  KERNEL_ATTR_RO(uevent_seqnum);
>> >  
>> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> > index f5f5038787ac..5da20def556d 100644
>> > --- a/lib/kobject_uevent.c
>> > +++ b/lib/kobject_uevent.c
>> > @@ -29,21 +29,42 @@
>> >  #include <net/net_namespace.h>
>> >  
>> >  
>> > +#ifndef CONFIG_NET
>> >  u64 uevent_seqnum;
>> > +#endif
>> > +
>> >  #ifdef CONFIG_UEVENT_HELPER
>> >  char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
>> >  #endif
>> >  
>> > +/*
>> > + * Size a buffer needs to be in order to hold the largest possible sequence
>> > + * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
>> > + */
>> > +#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
>> >  struct uevent_sock {
>> >  	struct list_head list;
>> >  	struct sock *sk;
>> > +	/*
>> > +	 * This mutex protects uevent sockets and the uevent counter of
>> > +	 * network namespaces *not* owned by init_user_ns.
>> > +	 * For network namespaces owned by init_user_ns this lock is *not*
>> > +	 * valid instead the global uevent_sock_mutex must be used!
>> > +	 */
>> > +	struct mutex sk_mutex;
>> >  };
>> >  
>> >  #ifdef CONFIG_NET
>> >  static LIST_HEAD(uevent_sock_list);
>> >  #endif
>> >  
>> > -/* This lock protects uevent_seqnum and uevent_sock_list */
>> > +/*
>> > + * This mutex protects uevent sockets and the uevent counter of network
>> > + * namespaces owned by init_user_ns.
>> > + * For network namespaces not owned by init_user_ns this lock is *not*
>> > + * valid instead the network namespace specific sk_mutex in struct
>> > + * uevent_sock must be used!
>> > + */
>> >  static DEFINE_MUTEX(uevent_sock_mutex);
>> >  
>> >  /* the strings here must match the enum in include/linux/kobject.h */
>> > @@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >  
>> >  	return 0;
>> >  }
>> > +
>> > +static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
>> > +{
>> > +	if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
>> > +		WARN(1, KERN_ERR "Failed to append sequence number. "
>> > +		     "Too many uevent variables\n");
>> > +		return false;
>> > +	}
>> > +
>> > +	if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
>> > +		WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
>> > +		return false;
>> > +	}
>> > +
>> > +	return true;
>> > +}
>> >  #endif
>> >  
>> >  #ifdef CONFIG_UEVENT_HELPER
>> > @@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>> >  
>> >  	/* send netlink message */
>> >  	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
>> > +		/* bump sequence number */
>> > +		u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
>> >  		struct sock *uevent_sock = ue_sk->sk;
>> > +		char buf[SEQNUM_BUFSIZE];
>> >  
>> >  		if (!netlink_has_listeners(uevent_sock, 1))
>> >  			continue;
>> >  
>> >  		if (!skb) {
>> > -			/* allocate message with the maximum possible size */
>> > +			/* calculate header length */
>> >  			size_t len = strlen(action_string) + strlen(devpath) + 2;
>> >  			char *scratch;
>> >  
>> > +			/* allocate message with the maximum possible size */
>> >  			retval = -ENOMEM;
>> > -			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
>> > +			skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
>> >  			if (!skb)
>> >  				continue;
>> >  
>> > @@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>> >  			scratch = skb_put(skb, len);
>> >  			sprintf(scratch, "%s@%s", action_string, devpath);
>> >  
>> > +			/* add env */
>> >  			skb_put_data(skb, env->buf, env->buflen);
>> >  
>> >  			NETLINK_CB(skb).dst_group = 1;
>> >  		}
>> >  
>> > +		/* prepare netns seqnum */
>> > +		retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
>> > +		if (retval < 0 || retval >= SEQNUM_BUFSIZE)
>> > +			continue;
>> > +		retval++;
>> > +
>> > +		if (!can_hold_seqnum(env, retval))
>> > +			continue;
>> 
>> You have allocated enough space in the skb why does can_hold_seqnum make
>> sense?
>
> Because it doesn't check whether the socket buffer can hold the sequence
> number but checks whether the uevent buffer size in "env" can hold it.
> uevents are only delivered if the env buffer is large enough to hold all
> of the info including the sequence number. That's independent of the
> socket buffer.

The practical question is why does that get called every time through
the loop?

>> 
>> Do you need to back seqnum out of the env later for this to work twice
>> in a row?
>
> I guess I can just override it. It just felt cleaner to trim it.

You trim it from the socket, but what about the environment?

>
>> 
>> > +
>> > +		/* append netns seqnum */
>> > +		skb_put_data(skb, buf, retval);
>> > +
>> >  		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
>> >  						    0, 1, GFP_KERNEL,
>> >  						    kobj_bcast_filter,
>> > @@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
>> >  		/* ENOBUFS should be handled in userspace */
>> >  		if (retval == -ENOBUFS || retval == -ESRCH)
>> >  			retval = 0;
>> > +
>> > +		/* remove netns seqnum */
>> > +		skb_trim(skb, env->buflen);
>> 
>> Have you checked to see if the seqnum actually makes it to userspace.
>
> Yes, I did. I also wonder why it wouldn't make it. Any specific reason
> why you suspect this?

trim?  I am not really certain what that does and if when you deliever
to multiple network namespaces if it might not cause the wrong seqnum
to be delivered.  You are using the same initial buffer.

I haven't stared at the socket buffer code enough to remember off hand
what the semantics are.  Perhaps whatever netlink_broadcast does
is sufficient to prevent issues.

>
>> >  	}
>> >  	consume_skb(skb);
>> > +#else
>> > +	uevent_seqnum++;
>> >  #endif
>> >  	return retval;
>> >  }
>> > @@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
>> >  	}
>> >  
>> >  	mutex_lock(&uevent_sock_mutex);
>> > -	/* we will send an event, so request a new sequence number */
>> > -	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
>> > -	if (retval) {
>> > -		mutex_unlock(&uevent_sock_mutex);
>> > -		goto exit;
>> > -	}
>> > -	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
>> > -					      devpath);
>> > +	retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
>> >  	mutex_unlock(&uevent_sock_mutex);
>> 
>> How does all of this work with events for network devices that are not
>> in the initial network namespace.  This looks to me like this code fails
>> to take the sk_mutex.
>
> But in this list only non-initial network namespaces that are owned by
> the initial user namespace are recorded and for these uevent_sock_mutex
> has to be taken. Am I missing something?

The bug I missed in your first patch.  We still have to deliver uevents
to secondary network namespaces for their network devices.
At which point you need your new sk_mutex.


>> >  #ifdef CONFIG_UEVENT_HELPER
>> > @@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
>> >  EXPORT_SYMBOL_GPL(add_uevent_var);
>> >  
>> >  #if defined(CONFIG_NET)
>> > -static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
>> > +static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
>> >  				struct netlink_ext_ack *extack)
>> >  {
>> > -	/* u64 to chars: 2^64 - 1 = 21 chars */
>> > -	char buf[sizeof("SEQNUM=") + 21];
>> > +	struct sock *usk = ue_sk->sk;
>> > +	char buf[SEQNUM_BUFSIZE];
>> >  	struct sk_buff *skbc;
>> >  	int ret;
>> >  
>> >  	/* bump and prepare sequence number */
>> > -	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
>> > -	if (ret < 0 || (size_t)ret >= sizeof(buf))
>> > +	ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
>> > +		       ++sock_net(ue_sk->sk)->uevent_seqnum);
>> > +	if (ret < 0 || ret >= SEQNUM_BUFSIZE)
>> >  		return -ENOMEM;
>> >  	ret++;
>> >  
>> > @@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
>> >  		return -EPERM;
>> >  	}
>> >  
>> > -	mutex_lock(&uevent_sock_mutex);
>> > -	ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
>> > -	mutex_unlock(&uevent_sock_mutex);
>> > +	if (net->user_ns == &init_user_ns)
>> > +		mutex_lock(&uevent_sock_mutex);
>> > +	else
>> > +		mutex_lock(&net->uevent_sock->sk_mutex);
>> > +	ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
>> > +	if (net->user_ns == &init_user_ns)
>> > +		mutex_unlock(&uevent_sock_mutex);
>> > +	else
>> > +		mutex_unlock(&net->uevent_sock->sk_mutex);
>> >  
>> >  	return ret;
>> >  }
>> > @@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
>> >  		mutex_lock(&uevent_sock_mutex);
>> >  		list_add_tail(&ue_sk->list, &uevent_sock_list);
>> >  		mutex_unlock(&uevent_sock_mutex);
>> > +	} else {
>> > +		/*
>> > +		 * Uevent sockets and counters for network namespaces
>> > +		 * not owned by the initial user namespace have their
>> > +		 * own mutex.
>> > +		 */
>> > +		mutex_init(&ue_sk->sk_mutex);
>> >  	}
>> >  
>> >  	return 0;
>> > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> > index a11e03f920d3..8894638f5150 100644
>> > --- a/net/core/net_namespace.c
>> > +++ b/net/core/net_namespace.c
>> > @@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
>> >  }
>> >  EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
>> >  
>> > +u64 get_ns_uevent_seqnum_by_vpid(void)
>> > +{
>> > +	pid_t cur_pid;
>> > +	struct net *net;
>> > +
>> > +	cur_pid = task_pid_vnr(current);
>> > +	net = get_net_ns_by_pid(cur_pid);
>> > +	if (IS_ERR(net))
>> > +		return 0;
>> > +
>> > +	return net->uevent_seqnum;
>> > +}
>> > +EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);
>> 
>> I just have to say this function is completely crazy.
>> You go from the tsk to the pid back to the tsk.
>> And you leak a struct net pointer.
>> 
>> It is much simpler and less racy to say:
>> 
>> 	current->nsproxy->net_ns->uevent_seqnum;
>> 
>> That you are accessing current->nsproxy means nsproxy can't
>> change.  The rcu_read_lock etc that get_net_ns_by_pid does
>> is there for accessing non-current tasks.
>> 
>> 
>> 
>> >  static __net_init int net_ns_net_init(struct net *net)
>> >  {
>> >  #ifdef CONFIG_NET_NS

^ permalink raw reply

* Re: [PATCH net-next 1/2 v2] netns: restrict uevents
From: Eric W. Biederman @ 2018-04-24 22:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <20180424204335.12904-2-christian.brauner@ubuntu.com>


Bah.  This code is obviously correct and probably wrong.

How do we deliver uevents for network devices that are outside of the
initial user namespace?  The kernel still needs to deliver those.

The logic to figure out which network namespace a device needs to be
delivered to is is present in kobj_bcast_filter.  That logic will almost
certainly need to be turned inside out.  Sign not as easy as I would
have hoped.

Eric

Christian Brauner <christian.brauner@ubuntu.com> writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that is opening the uevent socket in that network namespace can still
>   listen to uevents.
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> Changelog v1->v2:
> * patch unchanged
> Changelog v0->v1:
> * patch unchanged
> ---
>  lib/kobject_uevent.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..f5f5038787ac 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
>  
>  	net->uevent_sock = ue_sk;
>  
> -	mutex_lock(&uevent_sock_mutex);
> -	list_add_tail(&ue_sk->list, &uevent_sock_list);
> -	mutex_unlock(&uevent_sock_mutex);
> +	/* Restrict uevents to initial user namespace. */
> +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> +		mutex_lock(&uevent_sock_mutex);
> +		list_add_tail(&ue_sk->list, &uevent_sock_list);
> +		mutex_unlock(&uevent_sock_mutex);
> +	}
> +
>  	return 0;
>  }
>  
> @@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
>  {
>  	struct uevent_sock *ue_sk = net->uevent_sock;
>  
> -	mutex_lock(&uevent_sock_mutex);
> -	list_del(&ue_sk->list);
> -	mutex_unlock(&uevent_sock_mutex);
> +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> +		mutex_lock(&uevent_sock_mutex);
> +		list_del(&ue_sk->list);
> +		mutex_unlock(&uevent_sock_mutex);
> +	}
>  
>  	netlink_kernel_release(ue_sk->sk);
>  	kfree(ue_sk);

^ permalink raw reply

* Re: [bpf-next PATCH 0/4] selftests for BPF sockmap use cases
From: Daniel Borkmann @ 2018-04-24 22:22 UTC (permalink / raw)
  To: John Fastabend, ast; +Cc: netdev
In-Reply-To: <20180423212506.20332.6762.stgit@john-Precision-Tower-5810>

On 04/23/2018 11:30 PM, John Fastabend wrote:
> This series moves ./samples/sockmap into BPF selftests. There are a
> few good reasons to do this. First, by pushing this into selftests
> the tests will be run automatically. Second, sockmap was not really
> a sample of anything anymore, but rather a large set of tests.
> 
> Note: There are three recent fixes outstanding against bpf branch
> that can be detected occosionally by the automated tests here.
> 
> https://patchwork.ozlabs.org/patch/903138/
> https://patchwork.ozlabs.org/patch/903139/
> https://patchwork.ozlabs.org/patch/903140/
> 
> ---
> 
> John Fastabend (4):
>       bpf: sockmap, code sockmap_test in C
>       bpf: sockmap, add a set of tests to run by default
>       bpf: sockmap, add selftests
>       bpf: sockmap, remove samples program

Applied to bpf-next, thanks John!

^ permalink raw reply

* Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV
From: Alexander Duyck @ 2018-04-24 22:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Bjorn Helgaas, linux-pci, virtio-dev, kvm,
	Netdev, Daly, Dan, LKML, linux-nvme, Keith Busch, netanel,
	Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <20180424215150.GB72698@bhelgaas-glaptop.roam.corp.google.com>

On Tue, Apr 24, 2018 at 2:51 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Sat, Apr 21, 2018 at 05:22:27PM -0700, Alexander Duyck wrote:
>> On Sat, Apr 21, 2018 at 1:34 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
>
>> > For example, I'm not sure what you mean by "devices where the PF is
>> > not capable of managing VF resources."
>> >
>> > It *sounds* like you're saying the hardware works differently on some
>> > devices, but I don't think that's what you mean.  I think you're
>> > saying something about which drivers are used for the PF and the VF.
>>
>> That is sort of what I am saying.
>>
>> So for example with ixgbe there is functionality which is controlled
>> in the MMIO space of the PF that affects the functionality of the VFs
>> that are generated on the device. The PF has to rearrange the
>> resources such as queues and interrupts on the device before it can
>> enable SR-IOV, and it could alter those later to limit what the VF is
>> capable of doing.
>>
>> The model I am dealing with via this patch set has a PF that is not
>> much different than the VFs other than the fact that it has some
>> extended configuration space bits in place for SR-IOV, ARI, ACS, and
>> whatever other bits are needed in order to support spawning isolated
>> VFs.
>
> OK, thanks for the explanation, I think I understand what's going on
> now, correct me if I'm mistaken.  I added a hint about "PF" for Randy,
> too.
>
> These are on pci/virtualization for v4.18.

I reviewed them and all of the changes to patches 1 & 2 both below,
and in the tree look good to me.

Thanks for taking care of all this.

- Alex

> commit 8effc395c209
> Author: Alexander Duyck <alexander.h.duyck@intel.com>
> Date:   Sat Apr 21 15:23:09 2018 -0500
>
>     PCI/IOV: Add pci_sriov_configure_simple()
>
>     SR-IOV (Single Root I/O Virtualization) is an optional PCIe capability (see
>     PCIe r4.0, sec 9).  A PCIe Function with the SR-IOV capability is referred
>     to as a PF (Physical Function).  If SR-IOV is enabled on the PF, several
>     VFs (Virtual Functions) may be created.  The VFs can be individually
>     assigned to virtual machines, which allows them to share a single hardware
>     device while being isolated from each other.
>
>     Some SR-IOV devices have resources such as queues and interrupts that must
>     be set up in the PF before enabling the VFs, so they require a PF driver to
>     do that.
>
>     Other SR-IOV devices don't require any PF setup before enabling VFs.  Add a
>     pci_sriov_configure_simple() interface so PF drivers for such devices can
>     use it without repeating the VF-enabling code.
>
>     Tested-by: Mark Rustad <mark.d.rustad@intel.com>
>     Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>     [bhelgaas: changelog, comment]
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>     Reviewed-by: Greg Rose <gvrose8192@gmail.com>
>     Reviewed-by: Christoph Hellwig <hch@lst.de>:wq
>
> commit a8ccf8a66663
> Author: Alexander Duyck <alexander.h.duyck@intel.com>
> Date:   Tue Apr 24 16:47:16 2018 -0500
>
>     PCI/IOV: Add pci-pf-stub driver for PFs that only enable VFs
>
>     Some SR-IOV PF devices provide no functionality other than acting as a
>     means of enabling VFs.  For these devices, we want to enable the VFs and
>     assign them to guest virtual machines, but there's no need to have a driver
>     for the PF itself.
>
>     Add a new pci-pf-stub driver to claim those PF devices and provide the
>     generic VF enable functionality.  An administrator can use the sysfs
>     "sriov_numvfs" file to enable VFs, then assign them to guests.
>
>     For now I only have one example ID provided by Amazon in terms of devices
>     that require this functionality.  The general idea is that in the future we
>     will see other devices added as vendors come up with devices where the PF
>     is more or less just a lightweight shim used to allocate VFs.
>
>     Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>     [bhelgaas: changelog]
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>     Reviewed-by: Greg Rose <gvrose8192@gmail.com>
>     Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> commit 115ddc491922
> Author: Alexander Duyck <alexander.h.duyck@intel.com>
> Date:   Tue Apr 24 16:47:22 2018 -0500
>
>     net: ena: Use pci_sriov_configure_simple() to enable VFs
>
>     Instead of implementing our own version of a SR-IOV configuration stub in
>     the ena driver, use the existing pci_sriov_configure_simple() function.
>
>     Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>     Reviewed-by: Greg Rose <gvrose8192@gmail.com>
>     Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> commit 74d986abc20b
> Author: Alexander Duyck <alexander.h.duyck@intel.com>
> Date:   Tue Apr 24 16:47:27 2018 -0500
>
>     nvme-pci: Use pci_sriov_configure_simple() to enable VFs
>
>     Instead of implementing our own version of a SR-IOV configuration stub in
>     the nvme driver, use the existing pci_sriov_configure_simple() function.
>
>     Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>     Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-24 22:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <87k1sw46i3.fsf@xmission.com>

On Tue, Apr 24, 2018 at 04:52:20PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > Now that it's possible to have a different set of uevents in different
> > network namespaces, per-network namespace uevent sequence numbers are
> > introduced. This increases performance as locking is now restricted to the
> > network namespace affected by the uevent rather than locking everything.
> > Testing revealed significant performance improvements. For details see
> > "Testing" below.
> 
> Maybe.  Your locking is wrong, and a few other things are wrong.  see
> below.

Thanks for the review! Happy to rework this until it's in a mergeable shape.

> 
> > Since commit 692ec06 ("netns: send uevent messages") network namespaces not
> > owned by the intial user namespace can be sent uevents from a sufficiently
> > privileged userspace process.
> > In order to send a uevent into a network namespace not owned by the initial
> > user namespace we currently still need to take the *global mutex* that
> > locks the uevent socket list even though the list *only contains network
> > namespaces owned by the initial user namespace*. This needs to be done
> > because the uevent counter is a global variable. Taking the global lock is
> > performance sensitive since a user on the host can spawn a pool of n
> > process that each create their own new user and network namespaces and then
> > go on to inject uevents in parallel into the network namespace of all of
> > these processes. This can have a significant performance impact for the
> > host's udevd since it means that there can be a lot of delay between a
> > device being added and the corresponding uevent being sent out and
> > available for processing by udevd. It also means that each network
> > namespace not owned by the initial user namespace which userspace has sent
> > a uevent to will need to wait until the lock becomes available.
> >
> > Implementation:
> > This patch gives each network namespace its own uevent sequence number.
> > Each network namespace not owned by the initial user namespace receives its
> > own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
> > so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
> > file it is clearly documented which lock has to be taken. All network
> > namespaces owned by the initial user namespace will still share the same
> > lock since they are all served sequentially via the uevent socket list.
> > This decouples the locking and ensures that the host retrieves uevents as
> > fast as possible even if there are a lot of uevents injected into network
> > namespaces not owned by the initial user namespace.  In addition, each
> > network namespace not owned by the initial user namespace does not have to
> > wait on any other network namespace not sharing the same user namespace.
> >
> > Testing:
> > Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
> > with decoupled locking and one without. To ensure that testing made sense
> > both kernels carried the patch to remove network namespaces not owned by
> > the initial user namespace from the uevent socket list.
> > Three tests were constructed. All of them showed significant performance
> > improvements with per-netns uevent sequence numbers and decoupled locking.
> >
> >  # Testcase 1:
> >    Only Injecting Uevents into network namespaces not owned by the initial
> >    user namespace.
> >    - created 1000 new user namespace + network namespace pairs
> >    - opened a uevent listener in each of those namespace pairs
> >    - injected uevents into each of those network namespaces 10,000 times
> >      meaning 10,000,000 (10 million) uevents were injected. (The high
> >      number of uevent injections should get rid of a lot of jitter.)
> >      The injection was done by fork()ing 1000 uevent injectors in a simple
> >      for-loop to ensure that uevents were injected in parallel.
> >    - mean transaction time was calculated:
> >      - *without* uevent sequence number namespacing: 67 μs
> >      - *with* uevent sequence number namespacing:    55 μs
> >      - makes a difference of:                        12 μs
> >    - a t-test was performed on the two data vectors which revealed
> >      shows significant performance improvements:
> >      Welch Two Sample t-test
> >      data:  x1 and y1
> >      t = 405.16, df = 18883000, p-value < 2.2e-16
> >      alternative hypothesis: true difference in means is not equal to 0
> >      95 percent confidence interval:
> >      12.14949 12.26761
> >      sample estimates:
> >      mean of x mean of y
> >      68.48594  56.27739
> >
> >  # Testcase 2:
> >    Injecting Uevents into network namespaces not owned by the initial user
> >    namespace and network namespaces owned by the initial user namespace.
> >    - created 500 new user namespace + network namespace pairs
> >    - created 500 new network namespace pairs
> >    - opened a uevent listener in each of those namespace pairs
> >    - injected uevents into each of those network namespaces 10,000 times
> >      meaning 10,000,000 (10 million) uevents were injected. (The high
> >      number of uevent injections should get rid of a lot of jitter.)
> >      The injection was done by fork()ing 1000 uevent injectors in a simple
> >      for-loop to ensure that uevents were injected in parallel.
> >    - mean transaction time was calculated:
> >      - *without* uevent sequence number namespacing: 572 μs
> >      - *with* uevent sequence number namespacing:    514 μs
> >      - makes a difference of:                         58 μs
> >    - a t-test was performed on the two data vectors which revealed
> >      shows significant performance improvements:
> >      Welch Two Sample t-test
> >      data:  x2 and y2
> >      t = 38.685, df = 19682000, p-value < 2.2e-16
> >      alternative hypothesis: true difference in means is not equal to 0
> >      95 percent confidence interval:
> >      55.10630 60.98815
> >      sample estimates:
> >      mean of x mean of y
> >      572.9684  514.9211
> >
> >  # Testcase 3:
> >    Created 500 new user namespace + network namespace pairs *without uevent
> >    listeners*
> >    - created 500 new network namespace pairs *without uevent listeners*
> >    - injected uevents into each of those network namespaces 10,000 times
> >      meaning 10,000,000 (10 million) uevents were injected. (The high number
> >      of uevent injections should get rid of a lot of jitter.)
> >      The injection was done by fork()ing 1000 uevent injectors in a simple
> >      for-loop to ensure that uevents were injected in parallel.
> >     - mean transaction time was calculated:
> >       - *without* uevent sequence number namespacing: 206 μs
> >       - *with* uevent sequence number namespacing:    163 μs
> >       - makes a difference of:                         43 μs
> >     - a t-test was performed on the two data vectors which revealed
> >       shows significant performance improvements:
> >       Welch Two Sample t-test
> >       data:  x3 and y3
> >       t = 58.37, df = 17711000, p-value < 2.2e-16
> >       alternative hypothesis: true difference in means is not equal to 0
> >       95 percent confidence interval:
> >       41.77860 44.68178
> >       sample estimates:
> >       mean of x mean of y
> >       207.2632  164.0330
> >
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> > Changelog v1->v2:
> > * non-functional change: fix indendation for C directives in
> >   kernel/ksysfs.c
> > Changelog v0->v1:
> > * add detailed test results to the commit message
> > * account for kernels compiled without CONFIG_NET
> > ---
> >  include/linux/kobject.h     |   2 +
> >  include/net/net_namespace.h |   3 ++
> >  kernel/ksysfs.c             |  11 +++-
> >  lib/kobject_uevent.c        | 104 +++++++++++++++++++++++++++++-------
> >  net/core/net_namespace.c    |  14 +++++
> >  5 files changed, 114 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/kobject.h b/include/linux/kobject.h
> > index 7f6f93c3df9c..4e608968907f 100644
> > --- a/include/linux/kobject.h
> > +++ b/include/linux/kobject.h
> > @@ -36,8 +36,10 @@
> >  extern char uevent_helper[];
> >  #endif
> >  
> > +#ifndef CONFIG_NET
> >  /* counter to tag the uevent, read only except for the kobject core */
> >  extern u64 uevent_seqnum;
> > +#endif
> 
> That smells like an implementation bug somewhere.

Sorry, I'm not following. I'm I'm not mistaken there won't be any struct
net when CONFIG_NET=n. This has been reported by kbuild robot with alpha
and CONFIG_NET=n.

> 
> >  /*
> >   * The actions here must match the index to the string array
> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > index 47e35cce3b64..e4e171b1ba69 100644
> > --- a/include/net/net_namespace.h
> > +++ b/include/net/net_namespace.h
> > @@ -85,6 +85,8 @@ struct net {
> >  	struct sock		*genl_sock;
> >  
> >  	struct uevent_sock	*uevent_sock;		/* uevent socket */
> > +	/* counter to tag the uevent, read only except for the kobject core */
> > +	u64                     uevent_seqnum;
> >  
> >  	struct list_head 	dev_base_head;
> >  	struct hlist_head 	*dev_name_head;
> > @@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
> >  
> >  struct net *get_net_ns_by_pid(pid_t pid);
> >  struct net *get_net_ns_by_fd(int fd);
> > +u64 get_ns_uevent_seqnum_by_vpid(void);
> >  
> >  #ifdef CONFIG_SYSCTL
> >  void ipx_register_sysctl(void);
> > diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
> > index 46ba853656f6..38b70b90a21f 100644
> > --- a/kernel/ksysfs.c
> > +++ b/kernel/ksysfs.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/capability.h>
> >  #include <linux/compiler.h>
> > +#include <net/net_namespace.h>
> >  
> >  #include <linux/rcupdate.h>	/* rcu_expedited and rcu_normal */
> >  
> > @@ -33,7 +34,15 @@ static struct kobj_attribute _name##_attr = \
> >  static ssize_t uevent_seqnum_show(struct kobject *kobj,
> >  				  struct kobj_attribute *attr, char *buf)
> >  {
> > -	return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
> > +	u64 seqnum;
> > +
> > +#ifdef CONFIG_NET
> > +	seqnum = get_ns_uevent_seqnum_by_vpid();
> > +#else
> > +	seqnum = uevent_seqnum;
> > +#endif
> 
> This can be simplified to be just:
> 	seqnum = current->nsproxy->net_ns->uevent_seqnum;

Does that work even if CONFIG_NET=n?

> 
> Except that is not correct either.  As every instance of sysfs has a
> network namespace associated with it, and you are not fetching that
> network namespace.

I'm not yet familiar with all aspects of sysfs so thanks for pointing
that out. Then I'll try to come up with a way to fetch the network
namespace associated with sysfs. Unless you already know exactly how to
do this and can point it out.
This would also lets us drop get_ns_uevent_seqnum_by_vpid().

> 
> Typically this would call for making this file per network namespace
> so you would have this information available.  Sigh.  I don't know if
> there is an easy way to do that for this file.
> 
> > +
> > +	return sprintf(buf, "%llu\n", (unsigned long long)seqnum);
> >  }
> >  KERNEL_ATTR_RO(uevent_seqnum);
> >  
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index f5f5038787ac..5da20def556d 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -29,21 +29,42 @@
> >  #include <net/net_namespace.h>
> >  
> >  
> > +#ifndef CONFIG_NET
> >  u64 uevent_seqnum;
> > +#endif
> > +
> >  #ifdef CONFIG_UEVENT_HELPER
> >  char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
> >  #endif
> >  
> > +/*
> > + * Size a buffer needs to be in order to hold the largest possible sequence
> > + * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
> > + */
> > +#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
> >  struct uevent_sock {
> >  	struct list_head list;
> >  	struct sock *sk;
> > +	/*
> > +	 * This mutex protects uevent sockets and the uevent counter of
> > +	 * network namespaces *not* owned by init_user_ns.
> > +	 * For network namespaces owned by init_user_ns this lock is *not*
> > +	 * valid instead the global uevent_sock_mutex must be used!
> > +	 */
> > +	struct mutex sk_mutex;
> >  };
> >  
> >  #ifdef CONFIG_NET
> >  static LIST_HEAD(uevent_sock_list);
> >  #endif
> >  
> > -/* This lock protects uevent_seqnum and uevent_sock_list */
> > +/*
> > + * This mutex protects uevent sockets and the uevent counter of network
> > + * namespaces owned by init_user_ns.
> > + * For network namespaces not owned by init_user_ns this lock is *not*
> > + * valid instead the network namespace specific sk_mutex in struct
> > + * uevent_sock must be used!
> > + */
> >  static DEFINE_MUTEX(uevent_sock_mutex);
> >  
> >  /* the strings here must match the enum in include/linux/kobject.h */
> > @@ -253,6 +274,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >  
> >  	return 0;
> >  }
> > +
> > +static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
> > +{
> > +	if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
> > +		WARN(1, KERN_ERR "Failed to append sequence number. "
> > +		     "Too many uevent variables\n");
> > +		return false;
> > +	}
> > +
> > +	if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
> > +		WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> >  #endif
> >  
> >  #ifdef CONFIG_UEVENT_HELPER
> > @@ -308,18 +345,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> >  
> >  	/* send netlink message */
> >  	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
> > +		/* bump sequence number */
> > +		u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
> >  		struct sock *uevent_sock = ue_sk->sk;
> > +		char buf[SEQNUM_BUFSIZE];
> >  
> >  		if (!netlink_has_listeners(uevent_sock, 1))
> >  			continue;
> >  
> >  		if (!skb) {
> > -			/* allocate message with the maximum possible size */
> > +			/* calculate header length */
> >  			size_t len = strlen(action_string) + strlen(devpath) + 2;
> >  			char *scratch;
> >  
> > +			/* allocate message with the maximum possible size */
> >  			retval = -ENOMEM;
> > -			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> > +			skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
> >  			if (!skb)
> >  				continue;
> >  
> > @@ -327,11 +368,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> >  			scratch = skb_put(skb, len);
> >  			sprintf(scratch, "%s@%s", action_string, devpath);
> >  
> > +			/* add env */
> >  			skb_put_data(skb, env->buf, env->buflen);
> >  
> >  			NETLINK_CB(skb).dst_group = 1;
> >  		}
> >  
> > +		/* prepare netns seqnum */
> > +		retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
> > +		if (retval < 0 || retval >= SEQNUM_BUFSIZE)
> > +			continue;
> > +		retval++;
> > +
> > +		if (!can_hold_seqnum(env, retval))
> > +			continue;
> 
> You have allocated enough space in the skb why does can_hold_seqnum make
> sense?

Because it doesn't check whether the socket buffer can hold the sequence
number but checks whether the uevent buffer size in "env" can hold it.
uevents are only delivered if the env buffer is large enough to hold all
of the info including the sequence number. That's independent of the
socket buffer.

> 
> Do you need to back seqnum out of the env later for this to work twice
> in a row?

I guess I can just override it. It just felt cleaner to trim it.

> 
> > +
> > +		/* append netns seqnum */
> > +		skb_put_data(skb, buf, retval);
> > +
> >  		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
> >  						    0, 1, GFP_KERNEL,
> >  						    kobj_bcast_filter,
> > @@ -339,8 +393,13 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
> >  		/* ENOBUFS should be handled in userspace */
> >  		if (retval == -ENOBUFS || retval == -ESRCH)
> >  			retval = 0;
> > +
> > +		/* remove netns seqnum */
> > +		skb_trim(skb, env->buflen);
> 
> Have you checked to see if the seqnum actually makes it to userspace.

Yes, I did. I also wonder why it wouldn't make it. Any specific reason
why you suspect this?

> >  	}
> >  	consume_skb(skb);
> > +#else
> > +	uevent_seqnum++;
> >  #endif
> >  	return retval;
> >  }
> > @@ -510,14 +569,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
> >  	}
> >  
> >  	mutex_lock(&uevent_sock_mutex);
> > -	/* we will send an event, so request a new sequence number */
> > -	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
> > -	if (retval) {
> > -		mutex_unlock(&uevent_sock_mutex);
> > -		goto exit;
> > -	}
> > -	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
> > -					      devpath);
> > +	retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
> >  	mutex_unlock(&uevent_sock_mutex);
> 
> How does all of this work with events for network devices that are not
> in the initial network namespace.  This looks to me like this code fails
> to take the sk_mutex.

But in this list only non-initial network namespaces that are owned by
the initial user namespace are recorded and for these uevent_sock_mutex
has to be taken. Am I missing something?

> 
> >  #ifdef CONFIG_UEVENT_HELPER
> > @@ -605,17 +657,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
> >  EXPORT_SYMBOL_GPL(add_uevent_var);
> >  
> >  #if defined(CONFIG_NET)
> > -static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
> > +static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
> >  				struct netlink_ext_ack *extack)
> >  {
> > -	/* u64 to chars: 2^64 - 1 = 21 chars */
> > -	char buf[sizeof("SEQNUM=") + 21];
> > +	struct sock *usk = ue_sk->sk;
> > +	char buf[SEQNUM_BUFSIZE];
> >  	struct sk_buff *skbc;
> >  	int ret;
> >  
> >  	/* bump and prepare sequence number */
> > -	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
> > -	if (ret < 0 || (size_t)ret >= sizeof(buf))
> > +	ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
> > +		       ++sock_net(ue_sk->sk)->uevent_seqnum);
> > +	if (ret < 0 || ret >= SEQNUM_BUFSIZE)
> >  		return -ENOMEM;
> >  	ret++;
> >  
> > @@ -668,9 +721,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
> >  		return -EPERM;
> >  	}
> >  
> > -	mutex_lock(&uevent_sock_mutex);
> > -	ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
> > -	mutex_unlock(&uevent_sock_mutex);
> > +	if (net->user_ns == &init_user_ns)
> > +		mutex_lock(&uevent_sock_mutex);
> > +	else
> > +		mutex_lock(&net->uevent_sock->sk_mutex);
> > +	ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
> > +	if (net->user_ns == &init_user_ns)
> > +		mutex_unlock(&uevent_sock_mutex);
> > +	else
> > +		mutex_unlock(&net->uevent_sock->sk_mutex);
> >  
> >  	return ret;
> >  }
> > @@ -708,6 +767,13 @@ static int uevent_net_init(struct net *net)
> >  		mutex_lock(&uevent_sock_mutex);
> >  		list_add_tail(&ue_sk->list, &uevent_sock_list);
> >  		mutex_unlock(&uevent_sock_mutex);
> > +	} else {
> > +		/*
> > +		 * Uevent sockets and counters for network namespaces
> > +		 * not owned by the initial user namespace have their
> > +		 * own mutex.
> > +		 */
> > +		mutex_init(&ue_sk->sk_mutex);
> >  	}
> >  
> >  	return 0;
> > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> > index a11e03f920d3..8894638f5150 100644
> > --- a/net/core/net_namespace.c
> > +++ b/net/core/net_namespace.c
> > @@ -618,6 +618,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
> >  }
> >  EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
> >  
> > +u64 get_ns_uevent_seqnum_by_vpid(void)
> > +{
> > +	pid_t cur_pid;
> > +	struct net *net;
> > +
> > +	cur_pid = task_pid_vnr(current);
> > +	net = get_net_ns_by_pid(cur_pid);
> > +	if (IS_ERR(net))
> > +		return 0;
> > +
> > +	return net->uevent_seqnum;
> > +}
> > +EXPORT_SYMBOL_GPL(get_ns_uevent_seqnum_by_vpid);
> 
> I just have to say this function is completely crazy.
> You go from the tsk to the pid back to the tsk.
> And you leak a struct net pointer.
> 
> It is much simpler and less racy to say:
> 
> 	current->nsproxy->net_ns->uevent_seqnum;
> 
> That you are accessing current->nsproxy means nsproxy can't
> change.  The rcu_read_lock etc that get_net_ns_by_pid does
> is there for accessing non-current tasks.
> 
> 
> 
> >  static __net_init int net_ns_net_init(struct net *net)
> >  {
> >  #ifdef CONFIG_NET_NS

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox