Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 3/5] net: dsa: sja1105: Really fix panic on unregistering PTP clock
From: Vladimir Oltean @ 2019-08-04 22:38 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem; +Cc: netdev, Vladimir Oltean
In-Reply-To: <20190804223848.31676-1-olteanv@gmail.com>

The IS_ERR_OR_NULL(priv->clock) check inside
sja1105_ptp_clock_unregister() is preventing cancel_delayed_work_sync
from actually being run.

Additionally, sja1105_ptp_clock_unregister() does not actually get run,
when placed in sja1105_remove(). The DSA switch gets torn down, but the
sja1105 module does not get unregistered. So sja1105_ptp_clock_unregister
needs to be moved to sja1105_teardown, to be symmetrical with
sja1105_ptp_clock_register which is called from the DSA sja1105_setup.

It is strange to fix a "fixes" patch, but the probe failure can only be
seen when the attached PHY does not respond to MDIO (issue which I can't
pinpoint the reason to) and it goes away after I power-cycle the board.
This time the patch was validated on a failing board, and the kernel
panic from the fixed commit's message can no longer be seen.

Fixes: 29dd908d355f ("net: dsa: sja1105: Cancel PTP delayed work on unregister")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
---
 drivers/net/dsa/sja1105/sja1105_main.c | 4 ++--
 drivers/net/dsa/sja1105/sja1105_ptp.c  | 7 +++----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c
index fd036bf0a819..13c22f51d476 100644
--- a/drivers/net/dsa/sja1105/sja1105_main.c
+++ b/drivers/net/dsa/sja1105/sja1105_main.c
@@ -1725,6 +1725,8 @@ static void sja1105_teardown(struct dsa_switch *ds)
 
 	cancel_work_sync(&priv->tagger_data.rxtstamp_work);
 	skb_queue_purge(&priv->tagger_data.skb_rxtstamp_queue);
+	sja1105_ptp_clock_unregister(priv);
+	sja1105_static_config_free(&priv->static_config);
 }
 
 static int sja1105_mgmt_xmit(struct dsa_switch *ds, int port, int slot,
@@ -2182,9 +2184,7 @@ static int sja1105_remove(struct spi_device *spi)
 {
 	struct sja1105_private *priv = spi_get_drvdata(spi);
 
-	sja1105_ptp_clock_unregister(priv);
 	dsa_unregister_switch(priv->ds);
-	sja1105_static_config_free(&priv->static_config);
 	return 0;
 }
 
diff --git a/drivers/net/dsa/sja1105/sja1105_ptp.c b/drivers/net/dsa/sja1105/sja1105_ptp.c
index d19cfdf681af..d8e8dd59f3d1 100644
--- a/drivers/net/dsa/sja1105/sja1105_ptp.c
+++ b/drivers/net/dsa/sja1105/sja1105_ptp.c
@@ -369,16 +369,15 @@ int sja1105_ptp_clock_register(struct sja1105_private *priv)
 		.mult = SJA1105_CC_MULT,
 	};
 	mutex_init(&priv->ptp_lock);
-	INIT_DELAYED_WORK(&priv->refresh_work, sja1105_ptp_overflow_check);
-
-	schedule_delayed_work(&priv->refresh_work, SJA1105_REFRESH_INTERVAL);
-
 	priv->ptp_caps = sja1105_ptp_caps;
 
 	priv->clock = ptp_clock_register(&priv->ptp_caps, ds->dev);
 	if (IS_ERR_OR_NULL(priv->clock))
 		return PTR_ERR(priv->clock);
 
+	INIT_DELAYED_WORK(&priv->refresh_work, sja1105_ptp_overflow_check);
+	schedule_delayed_work(&priv->refresh_work, SJA1105_REFRESH_INTERVAL);
+
 	return sja1105_ptp_reset(priv);
 }
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH net 4/5] net: dsa: sja1105: Fix memory leak on meta state machine normal path
From: Vladimir Oltean @ 2019-08-04 22:38 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem; +Cc: netdev, Vladimir Oltean
In-Reply-To: <20190804223848.31676-1-olteanv@gmail.com>

After a meta frame is received, it is associated with the cached
sp->data->stampable_skb from the DSA tagger private structure.

Cached means its refcount is incremented with skb_get() in order for
dsa_switch_rcv() to not free it when the tagger .rcv returns NULL.

The mistake is that skb_unref() is not the correct function to use. It
will correctly decrement the refcount (which will go back to zero) but
the skb memory will not be freed.  That is the job of kfree_skb(), which
also calls skb_unref().

But it turns out that freeing the cached stampable_skb is in fact not
necessary.  It is still a perfectly valid skb, and now it is even
annotated with the partial RX timestamp.  So remove the skb_copy()
altogether and simply pass the stampable_skb with a refcount of 1
(incremented by us, decremented by dsa_switch_rcv) up the stack.

Fixes: f3097be21bf1 ("net: dsa: sja1105: Add a state machine for RX timestamping")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
---
 net/dsa/tag_sja1105.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
index 26363d72d25b..8fa8dda8a15b 100644
--- a/net/dsa/tag_sja1105.c
+++ b/net/dsa/tag_sja1105.c
@@ -211,17 +211,8 @@ static struct sk_buff
 		 * for further processing up the network stack.
 		 */
 		kfree_skb(skb);
-
-		skb = skb_copy(stampable_skb, GFP_ATOMIC);
-		if (!skb) {
-			dev_err_ratelimited(dp->ds->dev,
-					    "Failed to copy stampable skb\n");
-			spin_unlock(&sp->data->meta_lock);
-			return NULL;
-		}
+		skb = stampable_skb;
 		sja1105_transfer_meta(skb, meta);
-		/* The cached copy will be freed now */
-		skb_unref(stampable_skb);

 		spin_unlock(&sp->data->meta_lock);
 	}
-- 
2.17.1

^ permalink raw reply related

* [PATCH net 2/5] net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as well
From: Vladimir Oltean @ 2019-08-04 22:38 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem; +Cc: netdev, Vladimir Oltean
In-Reply-To: <20190804223848.31676-1-olteanv@gmail.com>

It looks like the FDB dump taken from first-generation switches also
contains information on whether entries are static or not. So use that
instead of searching through the driver's tables.

Fixes: d763778224ea ("net: dsa: sja1105: Implement is_static for FDB entries on E/T")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
---
 drivers/net/dsa/sja1105/sja1105_dynamic_config.c | 14 +++++++++++++-
 drivers/net/dsa/sja1105/sja1105_main.c           | 15 ---------------
 2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/drivers/net/dsa/sja1105/sja1105_dynamic_config.c b/drivers/net/dsa/sja1105/sja1105_dynamic_config.c
index 6bfb1696a6f2..9988c9d18567 100644
--- a/drivers/net/dsa/sja1105/sja1105_dynamic_config.c
+++ b/drivers/net/dsa/sja1105/sja1105_dynamic_config.c
@@ -277,6 +277,18 @@ sja1105et_l2_lookup_cmd_packing(void *buf, struct sja1105_dyn_cmd *cmd,
 			SJA1105ET_SIZE_L2_LOOKUP_ENTRY, op);
 }
 
+static size_t sja1105et_dyn_l2_lookup_entry_packing(void *buf, void *entry_ptr,
+						    enum packing_op op)
+{
+	struct sja1105_l2_lookup_entry *entry = entry_ptr;
+	u8 *cmd = buf + SJA1105ET_SIZE_L2_LOOKUP_ENTRY;
+	const int size = SJA1105_SIZE_DYN_CMD;
+
+	sja1105_packing(cmd, &entry->lockeds, 28, 28, size, op);
+
+	return sja1105et_l2_lookup_entry_packing(buf, entry_ptr, op);
+}
+
 static void
 sja1105et_mgmt_route_cmd_packing(void *buf, struct sja1105_dyn_cmd *cmd,
 				 enum packing_op op)
@@ -477,7 +489,7 @@ sja1105et_general_params_entry_packing(void *buf, void *entry_ptr,
 /* SJA1105E/T: First generation */
 struct sja1105_dynamic_table_ops sja1105et_dyn_ops[BLK_IDX_MAX_DYN] = {
 	[BLK_IDX_L2_LOOKUP] = {
-		.entry_packing = sja1105et_l2_lookup_entry_packing,
+		.entry_packing = sja1105et_dyn_l2_lookup_entry_packing,
 		.cmd_packing = sja1105et_l2_lookup_cmd_packing,
 		.access = (OP_READ | OP_WRITE | OP_DEL),
 		.max_entry_count = SJA1105_MAX_L2_LOOKUP_COUNT,
diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c
index dc6ab834f0cc..fd036bf0a819 100644
--- a/drivers/net/dsa/sja1105/sja1105_main.c
+++ b/drivers/net/dsa/sja1105/sja1105_main.c
@@ -1251,21 +1251,6 @@ static int sja1105_fdb_dump(struct dsa_switch *ds, int port,
 			continue;
 		u64_to_ether_addr(l2_lookup.macaddr, macaddr);
 
-		/* On SJA1105 E/T, the switch doesn't implement the LOCKEDS
-		 * bit, so it doesn't tell us whether a FDB entry is static
-		 * or not.
-		 * But, of course, we can find out - we're the ones who added
-		 * it in the first place.
-		 */
-		if (priv->info->device_id == SJA1105E_DEVICE_ID ||
-		    priv->info->device_id == SJA1105T_DEVICE_ID) {
-			int match;
-
-			match = sja1105_find_static_fdb_entry(priv, port,
-							      &l2_lookup);
-			l2_lookup.lockeds = (match >= 0);
-		}
-
 		/* We need to hide the dsa_8021q VLANs from the user. */
 		if (!dsa_port_is_vlan_filtering(&ds->ports[port]))
 			l2_lookup.vlanid = 0;
-- 
2.17.1


^ permalink raw reply related

* [PATCH net 0/5] Fixes for SJA1105 DSA: FDBs, Learning and PTP
From: Vladimir Oltean @ 2019-08-04 22:38 UTC (permalink / raw)
  To: f.fainelli, vivien.didelot, andrew, davem; +Cc: netdev, Vladimir Oltean

This is an assortment of functional fixes for the sja1105 switch driver
targeted for the "net" tree (although they apply on net-next just as
well).

Patch 1/5 ("net: dsa: sja1105: Fix broken learning with vlan_filtering
disabled") repairs a breakage introduced in the early development stages
of the driver: support for traffic from the CPU has broken "normal"
frame forwarding (based on DMAC) - there is connectivity through the
switch only because all frames are flooded.
I debated whether this patch qualifies as a fix, since it puts the
switch into a mode it has never operated in before (aka SVL). But
"normal" forwarding did use to work before the "Traffic support for
SJA1105 DSA driver" patchset, and arguably this patch should have been
part of that.
Also, it would be strange for this feature to be broken in the 5.2 LTS.

Patch 2/5 ("net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as
well") is a simplification of a previous FDB-related patch that is
currently in the 5.3 rc's.

Patches 3/5 - 5/5 fix various crashes found while running linuxptp over the
switch ports for extended periods of time, or in conjunction with other
error conditions. The fixed-up commits were all introduced in 5.2.

Vladimir Oltean (5):
  net: dsa: sja1105: Fix broken learning with vlan_filtering disabled
  net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as well
  net: dsa: sja1105: Really fix panic on unregistering PTP clock
  net: dsa: sja1105: Fix memory leak on meta state machine normal path
  net: dsa: sja1105: Fix memory leak on meta state machine error path

 .../net/dsa/sja1105/sja1105_dynamic_config.c  |  14 +-
 drivers/net/dsa/sja1105/sja1105_main.c        | 140 +++++++-----------
 drivers/net/dsa/sja1105/sja1105_ptp.c         |   7 +-
 net/dsa/tag_sja1105.c                         |  12 +-
 4 files changed, 75 insertions(+), 98 deletions(-)

-- 
2.17.1

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-04 22:16 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <5A2FCD7E-7F54-41E5-BFAE-BB9494E74F2D@fb.com>

On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
 >> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> >> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
> >
> > It's been done before -- it's not that hard.  IMO the main tricky bit
> > would be try be somewhat careful about defining exactly what
> > CAP_BPF_ADMIN does.
>
> Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
> Plumbers conference.
>
> OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
> like systemd to do sys_bpf() without root.

I don't understand the use case here.  Are you talking about systemd
--user?  As far as I know, a user is expected to be able to fully
control their systemd --user process, so giving it unrestricted bpf
access is very close to giving it superuser access, and this doesn't
sound like a good idea.  I think that, if systemd --user needs bpf(),
it either needs real unprivileged bpf() or it needs a privileged
helper (SUID or a daemon) to intermediate this access.

>
> >
> >>> I don't see why you need to invent a whole new mechanism for this.
> >>> The entire cgroup ecosystem outside bpf() does just fine using the
> >>> write permission on files in cgroupfs to control access.  Why can't
> >>> bpf() do the same thing?
> >>
> >> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> >> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> >> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> >> we should have target concept for all these commands. But that is a
> >> much bigger project. OTOH, "all or nothing" model allows all these
> >> commands at once.
> >
> > For BPF_PROG_LOAD, I admit I've never understood why permission is
> > required at all.  I think that CAP_SYS_ADMIN or similar should be
> > needed to get is_priv in the verifier, but I think that should mainly
> > be useful for tracing, and that requires lots of privilege anyway.
> > BPF_MAP_* is probably the trickiest part.  One solution would be some
> > kind of bpffs, but I'm sure other solutions are possible.
>
> Improving permission management of cgroup_bpf is another good topic to
> discuss. However, it is also an overkill for current use case.
>

I looked at the code some more, and I don't think this is so hard
after all.  As I understand it, all of the map..by_id stuff is, to
some extent, deprecated in favor of persistent maps.  As I see it, the
map..by_id calls should require privilege forever, although I can
imagine ways to scope that privilege to a namespace if the maps
themselves were to be scoped to a namespace.

Instead, unprivileged tools would use the persistent map interface
roughly like this:

$ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
8 entries 64 name mapname

This would require that the caller have either CAP_DAC_OVERRIDE or
that the caller have permission to create files in /sys/fs/bpf/my_dir
(using the same rules as for any filesystem), and the resulting map
would end up owned by the creating user and have mode 0600 (or maybe
0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
the various capable() checks that are currently involved in accessing
a persistent map would instead check FMODE_READ or FMODE_WRITE on the
map file as appropriate.

Half of this stuff already works.  I just set my system up like this:

$ ls -l /sys/fs/bpf
total 0
drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto

$ mkdir /sys/fs/bpf/luto/test

$ ls -l /sys/fs/bpf/luto
total 0
drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test

I bet that making the bpf() syscalls work appropriately in this
context without privilege would only be a couple of hours of work.
The hard work, creating bpffs and making it function, is already done
:)

P.S. The docs for bpftool create are less than fantastic.  The
complete lack of any error message at all when the syscall returns
-EACCES is also not fantastic.

^ permalink raw reply

* RE: Slowness forming TIPC cluster with explicit node addresses
From: Jon Maloy @ 2019-08-04 21:53 UTC (permalink / raw)
  To: Chris Packham, tipc-discussion@lists.sourceforge.net
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1564722689.4914.27.camel@alliedtelesis.co.nz>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Chris Packham
> Sent: 2-Aug-19 01:11
> To: Jon Maloy <jon.maloy@ericsson.com>; tipc-
> discussion@lists.sourceforge.net
> Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: Slowness forming TIPC cluster with explicit node addresses
> 
> On Mon, 2019-07-29 at 09:04 +1200, Chris Packham wrote:
> > On Fri, 2019-07-26 at 13:31 +0000, Jon Maloy wrote:
> > >
> > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: netdev-owner@vger.kernel.org <netdev-
> owner@vger.kernel.org>
> > > > On Behalf Of Chris Packham
> > > > Sent: 25-Jul-19 19:37
> > > > To: tipc-discussion@lists.sourceforge.net
> > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > > > Subject: Slowness forming TIPC cluster with explicit node
> > > > addresses
> > > >
> > > > Hi,
> > > >
> > > > I'm having problems forming a TIPC cluster between 2 nodes.
> > > >
> > > > This is the basic steps I'm going through on each node.
> > > >
> > > > modprobe tipc
> > > > ip link set eth2 up
> > > > tipc node set addr 1.1.5 # or 1.1.6 tipc bearer enable media eth
> > > > dev eth0
> > > eth2, I assume...
> > >
> > Yes sorry I keep switching between between Ethernet ports for testing
> > so I hand edited the email.
> >
> > >
> > > >
> > > >
> > > >
> > > > Then to confirm if the cluster is formed I use tipc link list
> > > >
> > > > [root@node-5 ~]# tipc link list
> > > > broadcast-link: up
> > > > ...
> > > >
> > > > Looking at tcpdump the two nodes are sending packets
> > > >
> > > > 22:30:05.782320 TIPC v2.0 1.1.5 > 0.0.0, headerlength 60 bytes,
> > > > MessageSize
> > > > 76 bytes, Neighbor Detection Protocol internal, messageType Link
> > > > request
> > > > 22:30:05.863555 TIPC v2.0 1.1.6 > 0.0.0, headerlength 60 bytes,
> > > > MessageSize
> > > > 76 bytes, Neighbor Detection Protocol internal, messageType Link
> > > > request
> > > >
> > > > Eventually (after a few minutes) the link does come up
> > > >
> > > > [root@node-6 ~]# tipc link list
> > > > broadcast-link: up
> > > > 1001006:eth2-1001005:eth2: up
> > > >
> > > > [root@node-5 ~]# tipc link list
> > > > broadcast-link: up
> > > > 1001005:eth2-1001006:eth2: up
> > > >
> > > > When I remove the "tipc node set addr" things seem to kick into
> > > > life straight away
> > > >
> > > > [root@node-5 ~]# tipc link list
> > > > broadcast-link: up
> > > > 0050b61bd2aa:eth2-0050b61e6dfa:eth2: up
> > > >
> > > > So there appears to be some difference in behaviour between having
> > > > an explicit node address and using the default. Unfortunately our
> > > > application relies on setting the node addresses.
> > > I do this many times a day, without any problems. If there would be
> > > any time difference, I would expect the 'auto configurable' version
> > > to be slower, because it involves a DAD step.
> > > Are you sure you don't have any other nodes running in your system?
> > >
> > > ///jon
> > >
> > Nope the two nodes are connected back to back. Does the number of
> > Ethernet interfaces make a difference? As you can see I've got 3 on
> > each node. One is completely disconnected, one is for booting over
> > TFTP
> >  (only used by U-boot) and the other is the USB Ethernet I'm using for
> > testing.
> >
> 
> So I can still reproduce this on nodes that only have one network interface and
> are the only things connected.
> 
> I did find one thing that helps
> 
> diff --git a/net/tipc/discover.c b/net/tipc/discover.c index
> c138d68e8a69..49921dad404a 100644
> --- a/net/tipc/discover.c
> +++ b/net/tipc/discover.c
> @@ -358,10 +358,10 @@ int tipc_disc_create(struct net *net, struct
> tipc_bearer *b,
>         tipc_disc_init_msg(net, d->skb, DSC_REQ_MSG, b);
> 
>         /* Do we need an address trial period first ? */
> -       if (!tipc_own_addr(net)) {
> +//     if (!tipc_own_addr(net)) {
>                 tn->addr_trial_end = jiffies + msecs_to_jiffies(1000);
>                 msg_set_type(buf_msg(d->skb), DSC_TRIAL_MSG);
> -       }
> +//     }
>         memcpy(&d->dest, dest, sizeof(*dest));
>         d->net = net;
>         d->bearer_id = b->identity;
> 
> I think because with pre-configured addresses the duplicate address detection
> is skipped the shorter init phase is skipped. Would is make sense to
> unconditionally do the trial step? Or is there some better way to get things to
> transition with pre-assigned addresses.

I am on vacation until the end of next-week, so I can't give you any good analysis right now.
To do the trial step doesn’t make much sense to me, -it would only delay the setup unnecessarily (but with only 1 second).
Can you check the initial value of addr_trial_end when there a pre-configured address?

///jon


^ permalink raw reply

* [PATCH v6 0/3] mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
From: john.hubbard @ 2019-08-04 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Björn Töpel, Boaz Harrosh,
	Christoph Hellwig, Daniel Vetter, Dan Williams, Dave Chinner,
	David Airlie, David S . Miller, Ilya Dryomov, Jan Kara,
	Jason Gunthorpe, Jens Axboe, Jérôme Glisse,
	Johannes Thumshirn, Magnus Karlsson, Matthew Wilcox,
	Miklos Szeredi, Ming Lei, Sage Weil, Santosh Shilimkar, Yan Zheng,
	netdev, dri-devel, linux-mm, linux-rdma, bpf, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Changes since v5:

* Patch #1: Fixed a bug that I introduced in v4:
  drivers/infiniband/sw/siw/siw_mem.c needs to refer to
  umem->page_chunk[i].plist, rather than umem->page_chunk[i].

Changes since v4:

* Christophe Hellwig's review applied: deleted siw_free_plist() and
  __qib_release_user_pages(), now that put_user_pages_dirty_lock() does
  what those routines were doing.

* Applied Bjorn's ACK for net/xdp, and Christophe's Reviewed-by for patch
  #1.

Changes since v3:

* Fixed an unused variable warning in siw_mem.c

Changes since v2:

* Critical bug fix: remove a stray "break;" from the new routine.

Changes since v1:

* Instead of providing __put_user_pages(), add an argument to
  put_user_pages_dirty_lock(), and delete put_user_pages_dirty().
  This is based on the following points:

    1. Lots of call sites become simpler if a bool is passed
    into put_user_page*(), instead of making the call site
    choose which put_user_page*() variant to call.

    2. Christoph Hellwig's observation that set_page_dirty_lock()
    is usually correct, and set_page_dirty() is usually a
    bug, or at least questionable, within a put_user_page*()
    calling chain.

* Added the Infiniband driver back to the patch series, because it is
  a caller of put_user_pages_dirty_lock().

Unchanged parts from the v1 cover letter (except for the diffstat):

Notes about the remaining patches to come:

There are about 50+ patches in my tree [2], and I'll be sending out the
remaining ones in a few more groups:

    * The block/bio related changes (Jerome mostly wrote those, but I've
      had to move stuff around extensively, and add a little code)

    * mm/ changes

    * other subsystem patches

    * an RFC that shows the current state of the tracking patch set. That
      can only be applied after all call sites are converted, but it's
      good to get an early look at it.

This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").

John Hubbard (3):
  mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
  drivers/gpu/drm/via: convert put_page() to put_user_page*()
  net/xdp: convert put_page() to put_user_page*()

 drivers/gpu/drm/via/via_dmablit.c          |  10 +-
 drivers/infiniband/core/umem.c             |   5 +-
 drivers/infiniband/hw/hfi1/user_pages.c    |   5 +-
 drivers/infiniband/hw/qib/qib_user_pages.c |  13 +--
 drivers/infiniband/hw/usnic/usnic_uiom.c   |   5 +-
 drivers/infiniband/sw/siw/siw_mem.c        |  19 +---
 include/linux/mm.h                         |   5 +-
 mm/gup.c                                   | 115 +++++++++------------
 net/xdp/xdp_umem.c                         |   9 +-
 9 files changed, 64 insertions(+), 122 deletions(-)

-- 
2.22.0

^ permalink raw reply

* [PATCH v6 1/3] mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
From: john.hubbard @ 2019-08-04 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Björn Töpel, Boaz Harrosh,
	Christoph Hellwig, Daniel Vetter, Dan Williams, Dave Chinner,
	David Airlie, David S . Miller, Ilya Dryomov, Jan Kara,
	Jason Gunthorpe, Jens Axboe, Jérôme Glisse,
	Johannes Thumshirn, Magnus Karlsson, Matthew Wilcox,
	Miklos Szeredi, Ming Lei, Sage Weil, Santosh Shilimkar, Yan Zheng,
	netdev, dri-devel, linux-mm, linux-rdma, bpf, LKML, John Hubbard,
	Ira Weiny
In-Reply-To: <20190804214042.4564-1-jhubbard@nvidia.com>

From: John Hubbard <jhubbard@nvidia.com>

Provide a more capable variation of put_user_pages_dirty_lock(),
and delete put_user_pages_dirty(). This is based on the
following:

1. Lots of call sites become simpler if a bool is passed
into put_user_page*(), instead of making the call site
choose which put_user_page*() variant to call.

2. Christoph Hellwig's observation that set_page_dirty_lock()
is usually correct, and set_page_dirty() is usually a
bug, or at least questionable, within a put_user_page*()
calling chain.

This leads to the following API choices:

    * put_user_pages_dirty_lock(page, npages, make_dirty)

    * There is no put_user_pages_dirty(). You have to
      hand code that, in the rare case that it's
      required.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c             |   5 +-
 drivers/infiniband/hw/hfi1/user_pages.c    |   5 +-
 drivers/infiniband/hw/qib/qib_user_pages.c |  13 +--
 drivers/infiniband/hw/usnic/usnic_uiom.c   |   5 +-
 drivers/infiniband/sw/siw/siw_mem.c        |  19 +---
 include/linux/mm.h                         |   5 +-
 mm/gup.c                                   | 115 +++++++++------------
 7 files changed, 61 insertions(+), 106 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 08da840ed7ee..965cf9dea71a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,10 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 
 	for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
 		page = sg_page_iter_page(&sg_iter);
-		if (umem->writable && dirty)
-			put_user_pages_dirty_lock(&page, 1);
-		else
-			put_user_page(page);
+		put_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
 	}
 
 	sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index b89a9b9aef7a..469acb961fbd 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,10 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 			     size_t npages, bool dirty)
 {
-	if (dirty)
-		put_user_pages_dirty_lock(p, npages);
-	else
-		put_user_pages(p, npages);
+	put_user_pages_dirty_lock(p, npages, dirty);
 
 	if (mm) { /* during close after signal, mm can be NULL */
 		atomic64_sub(npages, &mm->pinned_vm);
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index bfbfbb7e0ff4..26c1fb8d45cc 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -37,15 +37,6 @@
 
 #include "qib.h"
 
-static void __qib_release_user_pages(struct page **p, size_t num_pages,
-				     int dirty)
-{
-	if (dirty)
-		put_user_pages_dirty_lock(p, num_pages);
-	else
-		put_user_pages(p, num_pages);
-}
-
 /**
  * qib_map_page - a safety wrapper around pci_map_page()
  *
@@ -124,7 +115,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
 
 	return 0;
 bail_release:
-	__qib_release_user_pages(p, got, 0);
+	put_user_pages_dirty_lock(p, got, false);
 bail:
 	atomic64_sub(num_pages, &current->mm->pinned_vm);
 	return ret;
@@ -132,7 +123,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
 
 void qib_release_user_pages(struct page **p, size_t num_pages)
 {
-	__qib_release_user_pages(p, num_pages, 1);
+	put_user_pages_dirty_lock(p, num_pages, true);
 
 	/* during close after signal, mm can be NULL */
 	if (current->mm)
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 0b0237d41613..62e6ffa9ad78 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -75,10 +75,7 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
 		for_each_sg(chunk->page_list, sg, chunk->nents, i) {
 			page = sg_page(sg);
 			pa = sg_phys(sg);
-			if (dirty)
-				put_user_pages_dirty_lock(&page, 1);
-			else
-				put_user_page(page);
+			put_user_pages_dirty_lock(&page, 1, dirty);
 			usnic_dbg("pa: %pa\n", &pa);
 		}
 		kfree(chunk);
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index 67171c82b0c4..1e197753bf2f 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -60,20 +60,6 @@ struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index)
 	return NULL;
 }
 
-static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
-			   bool dirty)
-{
-	struct page **p = chunk->plist;
-
-	while (num_pages--) {
-		if (!PageDirty(*p) && dirty)
-			put_user_pages_dirty_lock(p, 1);
-		else
-			put_user_page(*p);
-		p++;
-	}
-}
-
 void siw_umem_release(struct siw_umem *umem, bool dirty)
 {
 	struct mm_struct *mm_s = umem->owning_mm;
@@ -82,8 +68,9 @@ void siw_umem_release(struct siw_umem *umem, bool dirty)
 	for (i = 0; num_pages; i++) {
 		int to_free = min_t(int, PAGES_PER_CHUNK, num_pages);
 
-		siw_free_plist(&umem->page_chunk[i], to_free,
-			       umem->writable && dirty);
+		put_user_pages_dirty_lock(umem->page_chunk[i].plist,
+					  to_free,
+					  umem->writable && dirty);
 		kfree(umem->page_chunk[i].plist);
 		num_pages -= to_free;
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..9759b6a24420 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1057,8 +1057,9 @@ static inline void put_user_page(struct page *page)
 	put_page(page);
 }
 
-void put_user_pages_dirty(struct page **pages, unsigned long npages);
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+			       bool make_dirty);
+
 void put_user_pages(struct page **pages, unsigned long npages);
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..7fefd7ab02c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,85 +29,70 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
-typedef int (*set_dirty_func_t)(struct page *page);
-
-static void __put_user_pages_dirty(struct page **pages,
-				   unsigned long npages,
-				   set_dirty_func_t sdf)
-{
-	unsigned long index;
-
-	for (index = 0; index < npages; index++) {
-		struct page *page = compound_head(pages[index]);
-
-		/*
-		 * Checking PageDirty at this point may race with
-		 * clear_page_dirty_for_io(), but that's OK. Two key cases:
-		 *
-		 * 1) This code sees the page as already dirty, so it skips
-		 * the call to sdf(). That could happen because
-		 * clear_page_dirty_for_io() called page_mkclean(),
-		 * followed by set_page_dirty(). However, now the page is
-		 * going to get written back, which meets the original
-		 * intention of setting it dirty, so all is well:
-		 * clear_page_dirty_for_io() goes on to call
-		 * TestClearPageDirty(), and write the page back.
-		 *
-		 * 2) This code sees the page as clean, so it calls sdf().
-		 * The page stays dirty, despite being written back, so it
-		 * gets written back again in the next writeback cycle.
-		 * This is harmless.
-		 */
-		if (!PageDirty(page))
-			sdf(page);
-
-		put_user_page(page);
-	}
-}
-
 /**
- * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
- * @pages:  array of pages to be marked dirty and released.
+ * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
+ * @pages:  array of pages to be maybe marked dirty, and definitely released.
  * @npages: number of pages in the @pages array.
+ * @make_dirty: whether to mark the pages dirty
  *
  * "gup-pinned page" refers to a page that has had one of the get_user_pages()
  * variants called on that page.
  *
  * For each page in the @pages array, make that page (or its head page, if a
- * compound page) dirty, if it was previously listed as clean. Then, release
- * the page using put_user_page().
+ * compound page) dirty, if @make_dirty is true, and if the page was previously
+ * listed as clean. In any case, releases all pages using put_user_page(),
+ * possibly via put_user_pages(), for the non-dirty case.
  *
  * Please see the put_user_page() documentation for details.
  *
- * set_page_dirty(), which does not lock the page, is used here.
- * Therefore, it is the caller's responsibility to ensure that this is
- * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), put_user_page().
  *
  */
-void put_user_pages_dirty(struct page **pages, unsigned long npages)
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+			       bool make_dirty)
 {
-	__put_user_pages_dirty(pages, npages, set_page_dirty);
-}
-EXPORT_SYMBOL(put_user_pages_dirty);
+	unsigned long index;
 
-/**
- * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
- * @pages:  array of pages to be marked dirty and released.
- * @npages: number of pages in the @pages array.
- *
- * For each page in the @pages array, make that page (or its head page, if a
- * compound page) dirty, if it was previously listed as clean. Then, release
- * the page using put_user_page().
- *
- * Please see the put_user_page() documentation for details.
- *
- * This is just like put_user_pages_dirty(), except that it invokes
- * set_page_dirty_lock(), instead of set_page_dirty().
- *
- */
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
-{
-	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+	/*
+	 * TODO: this can be optimized for huge pages: if a series of pages is
+	 * physically contiguous and part of the same compound page, then a
+	 * single operation to the head page should suffice.
+	 */
+
+	if (!make_dirty) {
+		put_user_pages(pages, npages);
+		return;
+	}
+
+	for (index = 0; index < npages; index++) {
+		struct page *page = compound_head(pages[index]);
+		/*
+		 * Checking PageDirty at this point may race with
+		 * clear_page_dirty_for_io(), but that's OK. Two key
+		 * cases:
+		 *
+		 * 1) This code sees the page as already dirty, so it
+		 * skips the call to set_page_dirty(). That could happen
+		 * because clear_page_dirty_for_io() called
+		 * page_mkclean(), followed by set_page_dirty().
+		 * However, now the page is going to get written back,
+		 * which meets the original intention of setting it
+		 * dirty, so all is well: clear_page_dirty_for_io() goes
+		 * on to call TestClearPageDirty(), and write the page
+		 * back.
+		 *
+		 * 2) This code sees the page as clean, so it calls
+		 * set_page_dirty(). The page stays dirty, despite being
+		 * written back, so it gets written back again in the
+		 * next writeback cycle. This is harmless.
+		 */
+		if (!PageDirty(page))
+			set_page_dirty_lock(page);
+		put_user_page(page);
+	}
 }
 EXPORT_SYMBOL(put_user_pages_dirty_lock);
 
-- 
2.22.0


^ permalink raw reply related

* [PATCH v6 3/3] net/xdp: convert put_page() to put_user_page*()
From: john.hubbard @ 2019-08-04 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Björn Töpel, Boaz Harrosh,
	Christoph Hellwig, Daniel Vetter, Dan Williams, Dave Chinner,
	David Airlie, David S . Miller, Ilya Dryomov, Jan Kara,
	Jason Gunthorpe, Jens Axboe, Jérôme Glisse,
	Johannes Thumshirn, Magnus Karlsson, Matthew Wilcox,
	Miklos Szeredi, Ming Lei, Sage Weil, Santosh Shilimkar, Yan Zheng,
	netdev, dri-devel, linux-mm, linux-rdma, bpf, LKML, John Hubbard
In-Reply-To: <20190804214042.4564-1-jhubbard@nvidia.com>

From: John Hubbard <jhubbard@nvidia.com>

For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().

This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").

Acked-by: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 net/xdp/xdp_umem.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 83de74ca729a..17c4b3d3dc34 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -166,14 +166,7 @@ void xdp_umem_clear_dev(struct xdp_umem *umem)
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 {
-	unsigned int i;
-
-	for (i = 0; i < umem->npgs; i++) {
-		struct page *page = umem->pgs[i];
-
-		set_page_dirty_lock(page);
-		put_page(page);
-	}
+	put_user_pages_dirty_lock(umem->pgs, umem->npgs, true);
 
 	kfree(umem->pgs);
 	umem->pgs = NULL;
-- 
2.22.0


^ permalink raw reply related

* [PATCH v6 2/3] drivers/gpu/drm/via: convert put_page() to put_user_page*()
From: john.hubbard @ 2019-08-04 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Björn Töpel, Boaz Harrosh,
	Christoph Hellwig, Daniel Vetter, Dan Williams, Dave Chinner,
	David Airlie, David S . Miller, Ilya Dryomov, Jan Kara,
	Jason Gunthorpe, Jens Axboe, Jérôme Glisse,
	Johannes Thumshirn, Magnus Karlsson, Matthew Wilcox,
	Miklos Szeredi, Ming Lei, Sage Weil, Santosh Shilimkar, Yan Zheng,
	netdev, dri-devel, linux-mm, linux-rdma, bpf, LKML, John Hubbard
In-Reply-To: <20190804214042.4564-1-jhubbard@nvidia.com>

From: John Hubbard <jhubbard@nvidia.com>

For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().

This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").

Also reverse the order of a comparison, in order to placate
checkpatch.pl.

Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/gpu/drm/via/via_dmablit.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c
index 062067438f1d..b5b5bf0ba65e 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -171,7 +171,6 @@ via_map_blit_for_device(struct pci_dev *pdev,
 static void
 via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t *vsg)
 {
-	struct page *page;
 	int i;
 
 	switch (vsg->state) {
@@ -186,13 +185,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t *vsg)
 		kfree(vsg->desc_pages);
 		/* fall through */
 	case dr_via_pages_locked:
-		for (i = 0; i < vsg->num_pages; ++i) {
-			if (NULL != (page = vsg->pages[i])) {
-				if (!PageReserved(page) && (DMA_FROM_DEVICE == vsg->direction))
-					SetPageDirty(page);
-				put_page(page);
-			}
-		}
+		put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
+					  (vsg->direction == DMA_FROM_DEVICE));
 		/* fall through */
 	case dr_via_pages_alloc:
 		vfree(vsg->pages);
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH 1/1] bpf: introduce new helper udp_flow_src_port
From: Farid Zakaria @ 2019-08-04 20:43 UTC (permalink / raw)
  To: Y Song; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev, bpf
In-Reply-To: <CACCo2jmcYAfY8zHJiT7NCb-Ct7Wguk9XHRc8QmZa7V3eJy0WTg@mail.gmail.com>

* re-sending as I've sent previously as HTML ... sorry *

First off, thank you for taking the time to review this patch.

It's not clear to me the backport you'd like for libbpf, I have been following
the documentation outlined in
https://www.kernel.org/doc/html/latest/bpf/index.html
I will hold off on changes to this patch as you've asked for it going forward.

This patch is inspired by a MPLSoUDP (https://tools.ietf.org/html/rfc7510)]
kernel module that was ported to eBPF -- our implementation is slightly
modified for our custom use case.

The Linux kernel provides a single abstraction for the src port for
UDP tunneling
via udp_flow_src_port. If it's improved eBPF filters would benefit if
the call is the same.

Exposing this function to eBPF programs would maintain feature parity
with other kernel
tunneling implementations.

Farid Zakaria

Farid Zakaria



On Sun, Aug 4, 2019 at 1:41 PM Farid Zakaria <farid.m.zakaria@gmail.com> wrote:
>
> First off, thank you for taking the time to review this patch.
>
> It's not clear to me the backport you'd like for libbpf, I have been following
> the documentation outlined in https://www.kernel.org/doc/html/latest/bpf/index.html
> I will hold off on changes to this patch as you've asked for it going forward.
>
> This patch is inspired by a MPLSoUDP (https://tools.ietf.org/html/rfc7510)]
> kernel module that was ported to eBPF -- our implementation is slightly
> modified for our custom use case.
>
> The Linux kernel provides a single abstraction for the src port for UDP tunneling
> via udp_flow_src_port. If it's improved eBPF filters would benefit if the call is the same.
>
> Exposing this function to eBPF programs would maintain feature parity with other kernel
> tunneling implementations.
>
> Cheers,
> Farid Zakaria
>
>
>
> On Sat, Aug 3, 2019 at 11:52 PM Y Song <ys114321@gmail.com> wrote:
>>
>> On Sat, Aug 3, 2019 at 8:29 PM Farid Zakaria <farid.m.zakaria@gmail.com> wrote:
>> >
>> > Foo over UDP uses UDP encapsulation to add additional entropy
>> > into the packets so that they get beter distribution across EMCP
>> > routes.
>> >
>> > Expose udp_flow_src_port as a bpf helper so that tunnel filters
>> > can benefit from the helper.
>> >
>> > Signed-off-by: Farid Zakaria <farid.m.zakaria@gmail.com>
>> > ---
>> >  include/uapi/linux/bpf.h                      | 21 +++++++--
>> >  net/core/filter.c                             | 20 ++++++++
>> >  tools/include/uapi/linux/bpf.h                | 21 +++++++--
>> >  tools/testing/selftests/bpf/bpf_helpers.h     |  2 +
>> >  .../bpf/prog_tests/udp_flow_src_port.c        | 28 +++++++++++
>> >  .../bpf/progs/test_udp_flow_src_port_kern.c   | 47 +++++++++++++++++++
>> >  6 files changed, 131 insertions(+), 8 deletions(-)
>> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/udp_flow_src_port.c
>> >  create mode 100644 tools/testing/selftests/bpf/progs/test_udp_flow_src_port_kern.c
>>
>> First, for each review, backport and sync with libbpf repo, in the future,
>> could you break the patch to two patches?
>>    1. kernel changes (net/core/filter.c, include/uapi/linux/bpf.h)
>>    2. tools/include/uapi/linux/bpf.h
>>    3. tools/testing/ changes
>>
>> Second, could you explain why existing __sk_buff->hash not enough?
>> there are corner cases where if __sk_buff->hash is 0 and the kernel did some
>> additional hashing, but maybe you can approximate in bpf program?
>> For case, min >= max, I suppose you can get min/max port values
>> from the user space for a particular net device and then calculate
>> the hash in the bpf program?
>> What I want to know if how much accuracy you will lose if you just
>> use __sk_buff->hash and do approximation in bpf program.
>>
>> >
>> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> > index 4393bd4b2419..90e814153dec 100644
>> > --- a/include/uapi/linux/bpf.h
>> > +++ b/include/uapi/linux/bpf.h
>> > @@ -2545,9 +2545,21 @@ union bpf_attr {
>> >   *             *th* points to the start of the TCP header, while *th_len*
>> >   *             contains **sizeof**\ (**struct tcphdr**).
>> >   *
>> > - *     Return
>> > - *             0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
>> > - *             error otherwise.
>> > + *  Return
>> > + *      0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
>> > + *      error otherwise.
>> > + *
>> > + * int bpf_udp_flow_src_port(struct sk_buff *skb, int min, int max, int use_eth)
>> > + *  Description
>> > + *      It's common to implement tunnelling inside a UDP protocol to provide
>> > + *      additional randomness to the packet. The destination port of the UDP
>> > + *      header indicates the inner packet type whereas the source port is used
>> > + *      for additional entropy.
>> > + *
>> > + *  Return
>> > + *      An obfuscated hash of the packet that falls within the
>> > + *      min & max port range.
>> > + *      If min >= max, the default port range is used
>> >   *
>> >   * int bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
>> >   *     Description
>> > @@ -2853,7 +2865,8 @@ union bpf_attr {
>> >         FN(sk_storage_get),             \
>> >         FN(sk_storage_delete),          \
>> >         FN(send_signal),                \
>> > -       FN(tcp_gen_syncookie),
>> > +       FN(tcp_gen_syncookie),  \
>> > +       FN(udp_flow_src_port),
>> >
>> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>> >   * function eBPF program intends to call
>> > diff --git a/net/core/filter.c b/net/core/filter.c
>> > index 5a2707918629..fdf0ebb8c2c8 100644
>> > --- a/net/core/filter.c
>> > +++ b/net/core/filter.c
>> > @@ -2341,6 +2341,24 @@ static const struct bpf_func_proto bpf_msg_pull_data_proto = {
>> >         .arg4_type      = ARG_ANYTHING,
>> >  };
>> >
>> > +BPF_CALL_4(bpf_udp_flow_src_port, struct sk_buff *, skb, int, min,
>> > +          int, max, int, use_eth)
>> > +{
>> > +       struct net *net = dev_net(skb->dev);
>> > +
>> > +       return udp_flow_src_port(net, skb, min, max, use_eth);
>> > +}
>> > +
>> [...]

^ permalink raw reply

* Re: [PATCH net] net: dsa: mv88e6xxx: drop adjust_link to enabled phylink
From: Vladimir Oltean @ 2019-08-04 19:37 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: netdev, lkml, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David S. Miller, Ioana Ciornei
In-Reply-To: <20190731154239.19270-1-h.feurstein@gmail.com>

On Wed, 31 Jul 2019 at 18:43, Hubert Feurstein <h.feurstein@gmail.com> wrote:
>
> We have to drop the adjust_link callback in order to finally migrate to
> phylink.
>
> Otherwise we get the following warning during startup:
>   "mv88e6xxx 2188000.ethernet-1:10: Using legacy PHYLIB callbacks. Please
>    migrate to PHYLINK!"
>
> The warning is generated in the function dsa_port_link_register_of in
> dsa/port.c:
>
>   int dsa_port_link_register_of(struct dsa_port *dp)
>   {
>         struct dsa_switch *ds = dp->ds;
>
>         if (!ds->ops->adjust_link)
>                 return dsa_port_phylink_register(dp);
>
>         dev_warn(ds->dev,
>                  "Using legacy PHYLIB callbacks. Please migrate to PHYLINK!\n");
>         [...]
>   }
>
> Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
> ---

Reviewed-by: Vladimir Oltean <olteanv@gmail.com>

>  drivers/net/dsa/mv88e6xxx/chip.c | 26 --------------------------
>  1 file changed, 26 deletions(-)
>
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
> index 366f70bfe055..37e8babd035f 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -27,7 +27,6 @@
>  #include <linux/platform_data/mv88e6xxx.h>
>  #include <linux/netdevice.h>
>  #include <linux/gpio/consumer.h>
> -#include <linux/phy.h>
>  #include <linux/phylink.h>
>  #include <net/dsa.h>
>
> @@ -482,30 +481,6 @@ static int mv88e6xxx_phy_is_internal(struct dsa_switch *ds, int port)
>         return port < chip->info->num_internal_phys;
>  }
>
> -/* We expect the switch to perform auto negotiation if there is a real
> - * phy. However, in the case of a fixed link phy, we force the port
> - * settings from the fixed link settings.
> - */
> -static void mv88e6xxx_adjust_link(struct dsa_switch *ds, int port,
> -                                 struct phy_device *phydev)
> -{
> -       struct mv88e6xxx_chip *chip = ds->priv;
> -       int err;
> -
> -       if (!phy_is_pseudo_fixed_link(phydev) &&
> -           mv88e6xxx_phy_is_internal(ds, port))
> -               return;
> -
> -       mv88e6xxx_reg_lock(chip);
> -       err = mv88e6xxx_port_setup_mac(chip, port, phydev->link, phydev->speed,
> -                                      phydev->duplex, phydev->pause,
> -                                      phydev->interface);
> -       mv88e6xxx_reg_unlock(chip);
> -
> -       if (err && err != -EOPNOTSUPP)
> -               dev_err(ds->dev, "p%d: failed to configure MAC\n", port);
> -}
> -
>  static void mv88e6065_phylink_validate(struct mv88e6xxx_chip *chip, int port,
>                                        unsigned long *mask,
>                                        struct phylink_link_state *state)
> @@ -4755,7 +4730,6 @@ static int mv88e6xxx_port_egress_floods(struct dsa_switch *ds, int port,
>  static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
>         .get_tag_protocol       = mv88e6xxx_get_tag_protocol,
>         .setup                  = mv88e6xxx_setup,
> -       .adjust_link            = mv88e6xxx_adjust_link,
>         .phylink_validate       = mv88e6xxx_validate,
>         .phylink_mac_link_state = mv88e6xxx_link_state,
>         .phylink_mac_config     = mv88e6xxx_mac_config,
> --
> 2.22.0
>

^ permalink raw reply

* Re: [PATCH] net: sctp: Rename fallthrough label to unhandled
From: Neil Horman @ 2019-08-04 19:26 UTC (permalink / raw)
  To: David Miller
  Cc: joe, vyasevich, marcelo.leitner, linux-sctp, netdev, linux-kernel
In-Reply-To: <20190802.161932.1776993765494484851.davem@davemloft.net>

On Fri, Aug 02, 2019 at 04:19:32PM -0700, David Miller wrote:
> From: Joe Perches <joe@perches.com>
> Date: Fri, 02 Aug 2019 10:47:34 -0700
> 
> > On Wed, 2019-07-31 at 08:16 -0400, Neil Horman wrote:
> >> On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> >> > On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> >> > > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> >> > > > fallthrough may become a pseudo reserved keyword so this only use of
> >> > > > fallthrough is better renamed to allow it.
> > 
> > Can you or any other maintainer apply this patch
> > or ack it so David Miller can apply it?
> 
> I, like others, don't like the lack of __ in the keyword.  It's kind of
> rediculous the problems it creates to pollute the global namespace like
> that and yes also inconsistent with other shorthands for builtins.
> 
FWIW, I acked the sctp patch, because the use of the word fallthrough as a
label, isn't that important to me, unhendled is just as good, so I'm ok with
that change.

But, as I stated in the other thread, I agree, making a macro out of fallthrough
without clearly naming it using a macro convention like __ is not something I'm
ok with
Neil


^ permalink raw reply

* Re: [PATCH net-next v3] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Vladimir Oltean @ 2019-08-04 19:22 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Andrew Lunn, Tao Ren, Florian Fainelli, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml,
	openbmc@lists.ozlabs.org
In-Reply-To: <f8de2514-081a-0e6e-fbe2-bcafcd459646@gmail.com>

On Sun, 4 Aug 2019 at 19:07, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 04.08.2019 17:59, Vladimir Oltean wrote:
> > On Sun, 4 Aug 2019 at 17:52, Andrew Lunn <andrew@lunn.ch> wrote:
> >>
> >>>> The patchset looks better now. But is it ok, I wonder, to keep
> >>>> PHY_BCM_FLAGS_MODE_1000BX in phydev->dev_flags, considering that
> >>>> phy_attach_direct is overwriting it?
> >>>
> >>
> >>> I checked ftgmac100 driver (used on my machine) and it calls
> >>> phy_connect_direct which passes phydev->dev_flags when calling
> >>> phy_attach_direct: that explains why the flag is not cleared in my
> >>> case.
> >>
> >> Yes, that is the way it is intended to be used. The MAC driver can
> >> pass flags to the PHY. It is a fragile API, since the MAC needs to
> >> know what PHY is being used, since the flags are driver specific.
> >>
> >> One option would be to modify the assignment in phy_attach_direct() to
> >> OR in the flags passed to it with flags which are already in
> >> phydev->dev_flags.
> >>
> >>         Andrew
> >
> > Even if that were the case (patching phy_attach_direct to apply a
> > logical-or to dev_flags), it sounds fishy to me that the genphy code
> > is unable to determine that this PHY is running in 1000Base-X mode.
> >
> > In my opinion it all boils down to this warning:
> >
> > "PHY advertising (0,00000200,000062c0) more modes than genphy
> > supports, some modes not advertised".
> >
> The genphy code deals with Clause 22 + Gigabit BaseT only.
> Question is whether you want aneg at all in 1000Base-X mode and
> what you want the config_aneg callback to do.
> There may be some inspiration in the Marvel PHY drivers.
>

AN for 1000Base-X still gives you duplex and pause frame settings. I
thought the base page format for exchanging that info is standardized
in clause 37.
Does genphy cover only copper media by design, or is it desirable to
augment genphy_read_status?

> > You see, the 0x200 in the above advertising mask corresponds exactly
> > to this definition from ethtool.h:
> >     ETHTOOL_LINK_MODE_1000baseX_Full_BIT    = 41,
> >
> > But it gets truncated and hence lost.
> >
> > Regards,
> > -Vladimir
> >
> Heiner

^ permalink raw reply

* Re: [PATCH net-next v3] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Andrew Lunn @ 2019-08-04 16:22 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Vladimir Oltean, Tao Ren, Florian Fainelli, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml,
	openbmc@lists.ozlabs.org
In-Reply-To: <f8de2514-081a-0e6e-fbe2-bcafcd459646@gmail.com>

> > Even if that were the case (patching phy_attach_direct to apply a
> > logical-or to dev_flags), it sounds fishy to me that the genphy code
> > is unable to determine that this PHY is running in 1000Base-X mode.
> > 
> > In my opinion it all boils down to this warning:
> > 
> > "PHY advertising (0,00000200,000062c0) more modes than genphy
> > supports, some modes not advertised".
> > 
> The genphy code deals with Clause 22 + Gigabit BaseT only.
> Question is whether you want aneg at all in 1000Base-X mode and
> what you want the config_aneg callback to do.
> There may be some inspiration in the Marvel PHY drivers.

As far as i know, you cannot actually advertise 1000Base-X. So we
probably should not be setting the bit in advertise, only having it in
supported?

	Andrew

^ permalink raw reply

* Re: [PATCH net-next v3] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Heiner Kallweit @ 2019-08-04 16:06 UTC (permalink / raw)
  To: Vladimir Oltean, Andrew Lunn
  Cc: Tao Ren, Florian Fainelli, David S . Miller, Arun Parameswaran,
	Justin Chen, netdev, lkml, openbmc@lists.ozlabs.org
In-Reply-To: <CA+h21hrUDaSxKpsy9TuWqwgaxKYaoXHyhgS=xSoAcPwxXzvrHg@mail.gmail.com>

On 04.08.2019 17:59, Vladimir Oltean wrote:
> On Sun, 4 Aug 2019 at 17:52, Andrew Lunn <andrew@lunn.ch> wrote:
>>
>>>> The patchset looks better now. But is it ok, I wonder, to keep
>>>> PHY_BCM_FLAGS_MODE_1000BX in phydev->dev_flags, considering that
>>>> phy_attach_direct is overwriting it?
>>>
>>
>>> I checked ftgmac100 driver (used on my machine) and it calls
>>> phy_connect_direct which passes phydev->dev_flags when calling
>>> phy_attach_direct: that explains why the flag is not cleared in my
>>> case.
>>
>> Yes, that is the way it is intended to be used. The MAC driver can
>> pass flags to the PHY. It is a fragile API, since the MAC needs to
>> know what PHY is being used, since the flags are driver specific.
>>
>> One option would be to modify the assignment in phy_attach_direct() to
>> OR in the flags passed to it with flags which are already in
>> phydev->dev_flags.
>>
>>         Andrew
> 
> Even if that were the case (patching phy_attach_direct to apply a
> logical-or to dev_flags), it sounds fishy to me that the genphy code
> is unable to determine that this PHY is running in 1000Base-X mode.
> 
> In my opinion it all boils down to this warning:
> 
> "PHY advertising (0,00000200,000062c0) more modes than genphy
> supports, some modes not advertised".
> 
The genphy code deals with Clause 22 + Gigabit BaseT only.
Question is whether you want aneg at all in 1000Base-X mode and
what you want the config_aneg callback to do.
There may be some inspiration in the Marvel PHY drivers.

> You see, the 0x200 in the above advertising mask corresponds exactly
> to this definition from ethtool.h:
>     ETHTOOL_LINK_MODE_1000baseX_Full_BIT    = 41,
> 
> But it gets truncated and hence lost.
> 
> Regards,
> -Vladimir
> 
Heiner

^ permalink raw reply

* Re: [PATCH net-next v3] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Vladimir Oltean @ 2019-08-04 15:59 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Tao Ren, Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml,
	openbmc@lists.ozlabs.org
In-Reply-To: <20190804145152.GA6800@lunn.ch>

On Sun, 4 Aug 2019 at 17:52, Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > The patchset looks better now. But is it ok, I wonder, to keep
> > > PHY_BCM_FLAGS_MODE_1000BX in phydev->dev_flags, considering that
> > > phy_attach_direct is overwriting it?
> >
>
> > I checked ftgmac100 driver (used on my machine) and it calls
> > phy_connect_direct which passes phydev->dev_flags when calling
> > phy_attach_direct: that explains why the flag is not cleared in my
> > case.
>
> Yes, that is the way it is intended to be used. The MAC driver can
> pass flags to the PHY. It is a fragile API, since the MAC needs to
> know what PHY is being used, since the flags are driver specific.
>
> One option would be to modify the assignment in phy_attach_direct() to
> OR in the flags passed to it with flags which are already in
> phydev->dev_flags.
>
>         Andrew

Even if that were the case (patching phy_attach_direct to apply a
logical-or to dev_flags), it sounds fishy to me that the genphy code
is unable to determine that this PHY is running in 1000Base-X mode.

In my opinion it all boils down to this warning:

"PHY advertising (0,00000200,000062c0) more modes than genphy
supports, some modes not advertised".

You see, the 0x200 in the above advertising mask corresponds exactly
to this definition from ethtool.h:
    ETHTOOL_LINK_MODE_1000baseX_Full_BIT    = 41,

But it gets truncated and hence lost.

Regards,
-Vladimir

^ permalink raw reply

* [PATCH] Documentation: virt: Fix broken reference to virt tree's index
From: Sheriff Esseson @ 2019-08-04 15:46 UTC (permalink / raw)
  To: skhan
  Cc: linux-kernel-mentees, Jonathan Corbet, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song,
	open list:DOCUMENTATION, open list, open list:RISC-V ARCHITECTURE,
	open list:BPF (Safe dynamic programs and tools),
	open list:BPF (Safe dynamic programs and tools)

Fix broken reference to virt/index.rst.

Sequel to: 2f5947dfcaec ("Documentation: move Documentation/virtual to
Documentation/virt")

Reported-by: Sphinx

Signed-off-by: Sheriff Esseson <sheriffesseson@gmail.com>
---
 Documentation/index.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/index.rst b/Documentation/index.rst
index 2df5a3da563c..5205430305d5 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -115,7 +115,7 @@ needed).
    target/index
    timers/index
    watchdog/index
-   virtual/index
+   virt/index
    input/index
    hwmon/index
    gpu/index
-- 
2.17.1


^ permalink raw reply related

* [PATCH] net: dsa: qca8k: Add of_node_put() in qca8k_setup_mdio_bus()
From: Nishka Dasgupta @ 2019-08-04 15:30 UTC (permalink / raw)
  To: andrew, vivien.didelot, f.fainelli, davem, netdev; +Cc: Nishka Dasgupta

Each iteration of for_each_available_child_of_node() puts the previous
node, but in the case of a return from the middle of the loop, there
is no put, thus causing a memory leak. Hence add an of_node_put() before
the return.
Additionally, the local variable ports in the function 
qca8k_setup_mdio_bus() takes the return value of of_get_child_by_name(),
which gets a node but does not put it. If the function returns without
putting ports, it may cause a memory leak. Hence put ports before the
mid-loop return statement, and also outside the loop after its last usage
in this function.
Issues found with Coccinelle.

Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
---
 drivers/net/dsa/qca8k.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 232e8cc96f6d..87753e4423aa 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -583,8 +583,11 @@ qca8k_setup_mdio_bus(struct qca8k_priv *priv)

 	for_each_available_child_of_node(ports, port) {
 		err = of_property_read_u32(port, "reg", &reg);
-		if (err)
+		if (err) {
+			of_node_put(port);
+			of_node_put(ports);
 			return err;
+		}

 		if (!dsa_is_user_port(priv->ds, reg))
 			continue;
@@ -595,6 +598,7 @@ qca8k_setup_mdio_bus(struct qca8k_priv *priv)
 			internal_mdio_mask |= BIT(reg);
 	}

+	of_node_put(ports);
 	if (!external_mdio_mask && !internal_mdio_mask) {
 		dev_err(priv->dev, "no PHYs are defined.\n");
 		return -EINVAL;
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next 10/10] nfp: flower: encode mac indexes with pre-tunnel rule check
From: John Hurley @ 2019-08-04 15:10 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley

When a tunnel packet arrives on the NFP card, its destination MAC is
looked up and MAC index returned for it. This index can help verify the
tunnel by, for example, ensuring that the packet arrived on the expected
port. If the packet is destined for a known MAC that is not connected to a
given physical port then the mac index can have a global value (e.g. when
a series of bonded ports shared the same MAC).

If the packet is to be detunneled at a bridge device or internal port like
an Open vSwitch VLAN port, then it should first match a 'pre-tunnel' rule
to direct it to that internal port.

Use the MAC index to indicate if a packet should match a pre-tunnel rule
before decap is allowed. Do this by tracking the number of internal ports
associated with a MAC address and, if the number if >0, set a bit in the
mac_index to forward the packet to the pre-tunnel table before continuing
with decap.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 .../ethernet/netronome/nfp/flower/tunnel_conf.c    | 71 +++++++++++++++++-----
 1 file changed, 55 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index a61e7f2..def8c19 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -17,6 +17,7 @@
 
 #define NFP_TUN_PRE_TUN_RULE_LIMIT	32
 #define NFP_TUN_PRE_TUN_RULE_DEL	0x1
+#define NFP_TUN_PRE_TUN_IDX_BIT		0x8
 
 /**
  * struct nfp_tun_pre_run_rule - rule matched before decap
@@ -141,11 +142,12 @@ enum nfp_flower_mac_offload_cmd {
 
 /**
  * struct nfp_tun_offloaded_mac - hashtable entry for an offloaded MAC
- * @ht_node:	Hashtable entry
- * @addr:	Offloaded MAC address
- * @index:	Offloaded index for given MAC address
- * @ref_count:	Number of devs using this MAC address
- * @repr_list:	List of reprs sharing this MAC address
+ * @ht_node:		Hashtable entry
+ * @addr:		Offloaded MAC address
+ * @index:		Offloaded index for given MAC address
+ * @ref_count:		Number of devs using this MAC address
+ * @repr_list:		List of reprs sharing this MAC address
+ * @bridge_count:	Number of bridge/internal devs with MAC
  */
 struct nfp_tun_offloaded_mac {
 	struct rhash_head ht_node;
@@ -153,6 +155,7 @@ struct nfp_tun_offloaded_mac {
 	u16 index;
 	int ref_count;
 	struct list_head repr_list;
+	int bridge_count;
 };
 
 static const struct rhashtable_params offloaded_macs_params = {
@@ -573,6 +576,8 @@ nfp_tunnel_offloaded_macs_inc_ref_and_link(struct nfp_tun_offloaded_mac *entry,
 			list_del(&repr_priv->mac_list);
 
 		list_add_tail(&repr_priv->mac_list, &entry->repr_list);
+	} else if (nfp_flower_is_supported_bridge(netdev)) {
+		entry->bridge_count++;
 	}
 
 	entry->ref_count++;
@@ -589,20 +594,35 @@ nfp_tunnel_add_shared_mac(struct nfp_app *app, struct net_device *netdev,
 
 	entry = nfp_tunnel_lookup_offloaded_macs(app, netdev->dev_addr);
 	if (entry && nfp_tunnel_is_mac_idx_global(entry->index)) {
-		nfp_tunnel_offloaded_macs_inc_ref_and_link(entry, netdev, mod);
-		return 0;
+		if (entry->bridge_count ||
+		    !nfp_flower_is_supported_bridge(netdev)) {
+			nfp_tunnel_offloaded_macs_inc_ref_and_link(entry,
+								   netdev, mod);
+			return 0;
+		}
+
+		/* MAC is global but matches need to go to pre_tun table. */
+		nfp_mac_idx = entry->index | NFP_TUN_PRE_TUN_IDX_BIT;
 	}
 
-	/* Assign a global index if non-repr or MAC address is now shared. */
-	if (entry || !port) {
-		ida_idx = ida_simple_get(&priv->tun.mac_off_ids, 0,
-					 NFP_MAX_MAC_INDEX, GFP_KERNEL);
-		if (ida_idx < 0)
-			return ida_idx;
+	if (!nfp_mac_idx) {
+		/* Assign a global index if non-repr or MAC is now shared. */
+		if (entry || !port) {
+			ida_idx = ida_simple_get(&priv->tun.mac_off_ids, 0,
+						 NFP_MAX_MAC_INDEX, GFP_KERNEL);
+			if (ida_idx < 0)
+				return ida_idx;
 
-		nfp_mac_idx = nfp_tunnel_get_global_mac_idx_from_ida(ida_idx);
-	} else {
-		nfp_mac_idx = nfp_tunnel_get_mac_idx_from_phy_port_id(port);
+			nfp_mac_idx =
+				nfp_tunnel_get_global_mac_idx_from_ida(ida_idx);
+
+			if (nfp_flower_is_supported_bridge(netdev))
+				nfp_mac_idx |= NFP_TUN_PRE_TUN_IDX_BIT;
+
+		} else {
+			nfp_mac_idx =
+				nfp_tunnel_get_mac_idx_from_phy_port_id(port);
+		}
 	}
 
 	if (!entry) {
@@ -671,6 +691,25 @@ nfp_tunnel_del_shared_mac(struct nfp_app *app, struct net_device *netdev,
 		list_del(&repr_priv->mac_list);
 	}
 
+	if (nfp_flower_is_supported_bridge(netdev)) {
+		entry->bridge_count--;
+
+		if (!entry->bridge_count && entry->ref_count) {
+			u16 nfp_mac_idx;
+
+			nfp_mac_idx = entry->index & ~NFP_TUN_PRE_TUN_IDX_BIT;
+			if (__nfp_tunnel_offload_mac(app, mac, nfp_mac_idx,
+						     false)) {
+				nfp_flower_cmsg_warn(app, "MAC offload index revert failed on %s.\n",
+						     netdev_name(netdev));
+				return 0;
+			}
+
+			entry->index = nfp_mac_idx;
+			return 0;
+		}
+	}
+
 	/* If MAC is now used by 1 repr set the offloaded MAC index to port. */
 	if (entry->ref_count == 1 && list_is_singular(&entry->repr_list)) {
 		u16 nfp_mac_idx;
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 03/10] net: tc_act: add helpers to detect ingress mirred actions
From: John Hurley @ 2019-08-04 15:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1564931351-1036-1-git-send-email-john.hurley@netronome.com>

TC mirred actions can send to egress or ingress on a given netdev. Helpers
exist to detect actions that are mirred to egress. Extend the header file
to include helpers to detect ingress mirred actions.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/net/tc_act/tc_mirred.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
index c757585..1cace4c 100644
--- a/include/net/tc_act/tc_mirred.h
+++ b/include/net/tc_act/tc_mirred.h
@@ -32,6 +32,24 @@ static inline bool is_tcf_mirred_egress_mirror(const struct tc_action *a)
 	return false;
 }
 
+static inline bool is_tcf_mirred_ingress_redirect(const struct tc_action *a)
+{
+#ifdef CONFIG_NET_CLS_ACT
+	if (a->ops && a->ops->id == TCA_ID_MIRRED)
+		return to_mirred(a)->tcfm_eaction == TCA_INGRESS_REDIR;
+#endif
+	return false;
+}
+
+static inline bool is_tcf_mirred_ingress_mirror(const struct tc_action *a)
+{
+#ifdef CONFIG_NET_CLS_ACT
+	if (a->ops && a->ops->id == TCA_ID_MIRRED)
+		return to_mirred(a)->tcfm_eaction == TCA_INGRESS_MIRROR;
+#endif
+	return false;
+}
+
 static inline struct net_device *tcf_mirred_dev(const struct tc_action *a)
 {
 	return rtnl_dereference(to_mirred(a)->tcfm_dev);
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 09/10] nfp: flower: remove offloaded MACs when reprs are applied to OvS bridges
From: John Hurley @ 2019-08-04 15:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1564931351-1036-1-git-send-email-john.hurley@netronome.com>

MAC addresses along with an identifying index are offloaded to firmware to
allow tunnel decapsulation. If a tunnel packet arrives with a matching
destination MAC address and a verified index, it can continue on the
decapsulation process. This replicates the MAC verifications carried out
in the kernel network stack.

When a netdev is added to a bridge (e.g. OvS) then packets arriving on
that dev are directed through the bridge datapath instead of passing
through the network stack. Therefore, tunnelled packets matching the MAC
of that dev will not be decapped here.

Replicate this behaviour on firmware by removing offloaded MAC addresses
when a MAC representer is added to an OvS bridge. This can prevent any
false positive tunnel decaps.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  7 ++++
 .../ethernet/netronome/nfp/flower/tunnel_conf.c    | 42 ++++++++++++++++++++++
 2 files changed, 49 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 5d302d7..31d9459 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -221,6 +221,7 @@ struct nfp_fl_qos {
  * @block_shared:	Flag indicating if offload applies to shared blocks
  * @mac_list:		List entry of reprs that share the same offloaded MAC
  * @qos_table:		Stored info on filters implementing qos
+ * @on_bridge:		Indicates if the repr is attached to a bridge
  */
 struct nfp_flower_repr_priv {
 	struct nfp_repr *nfp_repr;
@@ -230,6 +231,7 @@ struct nfp_flower_repr_priv {
 	bool block_shared;
 	struct list_head mac_list;
 	struct nfp_fl_qos qos_table;
+	bool on_bridge;
 };
 
 /**
@@ -341,6 +343,11 @@ static inline bool nfp_flower_is_merge_flow(struct nfp_fl_payload *flow_pay)
 	return flow_pay->tc_flower_cookie == (unsigned long)flow_pay;
 }
 
+static inline bool nfp_flower_is_supported_bridge(struct net_device *netdev)
+{
+	return netif_is_ovs_master(netdev);
+}
+
 int nfp_flower_metadata_init(struct nfp_app *app, u64 host_ctx_count,
 			     unsigned int host_ctx_split);
 void nfp_flower_metadata_cleanup(struct nfp_app *app);
diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index b9dbfb7..a61e7f2 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -730,6 +730,9 @@ nfp_tunnel_offload_mac(struct nfp_app *app, struct net_device *netdev,
 			return 0;
 
 		repr_priv = repr->app_priv;
+		if (repr_priv->on_bridge)
+			return 0;
+
 		mac_offloaded = &repr_priv->mac_offloaded;
 		off_mac = &repr_priv->offloaded_mac_addr[0];
 		port = nfp_repr_get_port_id(netdev);
@@ -845,6 +848,45 @@ int nfp_tunnel_mac_event_handler(struct nfp_app *app,
 		if (err)
 			nfp_flower_cmsg_warn(app, "Failed to offload MAC change on %s.\n",
 					     netdev_name(netdev));
+	} else if (event == NETDEV_CHANGEUPPER) {
+		/* If a repr is attached to a bridge then tunnel packets
+		 * entering the physical port are directed through the bridge
+		 * datapath and cannot be directly detunneled. Therefore,
+		 * associated offloaded MACs and indexes should not be used
+		 * by fw for detunneling.
+		 */
+		struct netdev_notifier_changeupper_info *info = ptr;
+		struct net_device *upper = info->upper_dev;
+		struct nfp_flower_repr_priv *repr_priv;
+		struct nfp_repr *repr;
+
+		if (!nfp_netdev_is_nfp_repr(netdev) ||
+		    !nfp_flower_is_supported_bridge(upper))
+			return NOTIFY_OK;
+
+		repr = netdev_priv(netdev);
+		if (repr->app != app)
+			return NOTIFY_OK;
+
+		repr_priv = repr->app_priv;
+
+		if (info->linking) {
+			if (nfp_tunnel_offload_mac(app, netdev,
+						   NFP_TUNNEL_MAC_OFFLOAD_DEL))
+				nfp_flower_cmsg_warn(app, "Failed to delete offloaded MAC on %s.\n",
+						     netdev_name(netdev));
+			repr_priv->on_bridge = true;
+		} else {
+			repr_priv->on_bridge = false;
+
+			if (!(netdev->flags & IFF_UP))
+				return NOTIFY_OK;
+
+			if (nfp_tunnel_offload_mac(app, netdev,
+						   NFP_TUNNEL_MAC_OFFLOAD_ADD))
+				nfp_flower_cmsg_warn(app, "Failed to offload MAC on %s.\n",
+						     netdev_name(netdev));
+		}
 	}
 	return NOTIFY_OK;
 }
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 06/10] nfp: flower: detect potential pre-tunnel rules
From: John Hurley @ 2019-08-04 15:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1564931351-1036-1-git-send-email-john.hurley@netronome.com>

Pre-tunnel rules are used when the tunnel end-point is on an 'internal
port'. These rules are used to direct the tunnelled packets (based on outer
header fields) to the internal port where they can be detunnelled. The
rule must send the packet to ingress the internal port at the TC layer.

Currently FW does not support an action to send to ingress so cannot
offload such rules. However, in preparation for populating the pre-tunnel
table to represent such rules, check for rules that send to the ingress of
an internal port and mark them as such. Further validation of such rules
is left to subsequent patches.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/action.c | 40 ++++++++++++++++++----
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  4 +++
 .../net/ethernet/netronome/nfp/flower/offload.c    | 25 ++++++++++++++
 3 files changed, 62 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c b/drivers/net/ethernet/netronome/nfp/flower/action.c
index ff2f419..1b019fd 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -173,7 +173,7 @@ nfp_fl_output(struct nfp_app *app, struct nfp_fl_output *output,
 	      struct nfp_fl_payload *nfp_flow,
 	      bool last, struct net_device *in_dev,
 	      enum nfp_flower_tun_type tun_type, int *tun_out_cnt,
-	      struct netlink_ext_ack *extack)
+	      bool pkt_host, struct netlink_ext_ack *extack)
 {
 	size_t act_size = sizeof(struct nfp_fl_output);
 	struct nfp_flower_priv *priv = app->priv;
@@ -218,6 +218,20 @@ nfp_fl_output(struct nfp_app *app, struct nfp_fl_output *output,
 			return gid;
 		}
 		output->port = cpu_to_be32(NFP_FL_LAG_OUT | gid);
+	} else if (nfp_flower_internal_port_can_offload(app, out_dev)) {
+		if (!(priv->flower_ext_feats & NFP_FL_FEATS_PRE_TUN_RULES)) {
+			NL_SET_ERR_MSG_MOD(extack, "unsupported offload: pre-tunnel rules not supported in loaded firmware");
+			return -EOPNOTSUPP;
+		}
+
+		if (nfp_flow->pre_tun_rule.dev || !pkt_host) {
+			NL_SET_ERR_MSG_MOD(extack, "unsupported offload: pre-tunnel rules require single egress dev and ptype HOST action");
+			return -EOPNOTSUPP;
+		}
+
+		nfp_flow->pre_tun_rule.dev = out_dev;
+
+		return 0;
 	} else {
 		/* Set action output parameters. */
 		output->flags = cpu_to_be16(tmp_flags);
@@ -885,7 +899,7 @@ nfp_flower_output_action(struct nfp_app *app,
 			 struct nfp_fl_payload *nfp_fl, int *a_len,
 			 struct net_device *netdev, bool last,
 			 enum nfp_flower_tun_type *tun_type, int *tun_out_cnt,
-			 int *out_cnt, u32 *csum_updated,
+			 int *out_cnt, u32 *csum_updated, bool pkt_host,
 			 struct netlink_ext_ack *extack)
 {
 	struct nfp_flower_priv *priv = app->priv;
@@ -907,7 +921,7 @@ nfp_flower_output_action(struct nfp_app *app,
 
 	output = (struct nfp_fl_output *)&nfp_fl->action_data[*a_len];
 	err = nfp_fl_output(app, output, act, nfp_fl, last, netdev, *tun_type,
-			    tun_out_cnt, extack);
+			    tun_out_cnt, pkt_host, extack);
 	if (err)
 		return err;
 
@@ -939,7 +953,7 @@ nfp_flower_loop_action(struct nfp_app *app, const struct flow_action_entry *act,
 		       struct net_device *netdev,
 		       enum nfp_flower_tun_type *tun_type, int *tun_out_cnt,
 		       int *out_cnt, u32 *csum_updated,
-		       struct nfp_flower_pedit_acts *set_act,
+		       struct nfp_flower_pedit_acts *set_act, bool *pkt_host,
 		       struct netlink_ext_ack *extack, int act_idx)
 {
 	struct nfp_fl_set_ipv4_tun *set_tun;
@@ -955,17 +969,21 @@ nfp_flower_loop_action(struct nfp_app *app, const struct flow_action_entry *act,
 	case FLOW_ACTION_DROP:
 		nfp_fl->meta.shortcut = cpu_to_be32(NFP_FL_SC_ACT_DROP);
 		break;
+	case FLOW_ACTION_REDIRECT_INGRESS:
 	case FLOW_ACTION_REDIRECT:
 		err = nfp_flower_output_action(app, act, nfp_fl, a_len, netdev,
 					       true, tun_type, tun_out_cnt,
-					       out_cnt, csum_updated, extack);
+					       out_cnt, csum_updated, *pkt_host,
+					       extack);
 		if (err)
 			return err;
 		break;
+	case FLOW_ACTION_MIRRED_INGRESS:
 	case FLOW_ACTION_MIRRED:
 		err = nfp_flower_output_action(app, act, nfp_fl, a_len, netdev,
 					       false, tun_type, tun_out_cnt,
-					       out_cnt, csum_updated, extack);
+					       out_cnt, csum_updated, *pkt_host,
+					       extack);
 		if (err)
 			return err;
 		break;
@@ -1095,6 +1113,13 @@ nfp_flower_loop_action(struct nfp_app *app, const struct flow_action_entry *act,
 		nfp_fl_set_mpls(set_m, act);
 		*a_len += sizeof(struct nfp_fl_set_mpls);
 		break;
+	case FLOW_ACTION_PTYPE:
+		/* TC ptype skbedit sets PACKET_HOST for ingress redirect. */
+		if (act->ptype != PACKET_HOST)
+			return -EOPNOTSUPP;
+
+		*pkt_host = true;
+		break;
 	default:
 		/* Currently we do not handle any other actions. */
 		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: unsupported action in action list");
@@ -1150,6 +1175,7 @@ int nfp_flower_compile_action(struct nfp_app *app,
 	struct nfp_flower_pedit_acts set_act;
 	enum nfp_flower_tun_type tun_type;
 	struct flow_action_entry *act;
+	bool pkt_host = false;
 	u32 csum_updated = 0;
 
 	memset(nfp_flow->action_data, 0, NFP_FL_MAX_A_SIZ);
@@ -1166,7 +1192,7 @@ int nfp_flower_compile_action(struct nfp_app *app,
 		err = nfp_flower_loop_action(app, act, flow, nfp_flow, &act_len,
 					     netdev, &tun_type, &tun_out_cnt,
 					     &out_cnt, &csum_updated,
-					     &set_act, extack, i);
+					     &set_act, &pkt_host, extack, i);
 		if (err)
 			return err;
 		act_cnt++;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index af9441d..6e9de4e 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -42,6 +42,7 @@ struct nfp_app;
 #define NFP_FL_FEATS_VLAN_PCP		BIT(3)
 #define NFP_FL_FEATS_VF_RLIM		BIT(4)
 #define NFP_FL_FEATS_FLOW_MOD		BIT(5)
+#define NFP_FL_FEATS_PRE_TUN_RULES	BIT(6)
 #define NFP_FL_FEATS_FLOW_MERGE		BIT(30)
 #define NFP_FL_FEATS_LAG		BIT(31)
 
@@ -280,6 +281,9 @@ struct nfp_fl_payload {
 	char *action_data;
 	struct list_head linked_flows;
 	bool in_hw;
+	struct {
+		struct net_device *dev;
+	} pre_tun_rule;
 };
 
 struct nfp_fl_payload_link {
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 607426c..fba802a 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -489,6 +489,7 @@ nfp_flower_allocate_new(struct nfp_fl_key_ls *key_layer)
 	flow_pay->meta.flags = 0;
 	INIT_LIST_HEAD(&flow_pay->linked_flows);
 	flow_pay->in_hw = false;
+	flow_pay->pre_tun_rule.dev = NULL;
 
 	return flow_pay;
 
@@ -997,6 +998,24 @@ int nfp_flower_merge_offloaded_flows(struct nfp_app *app,
 }
 
 /**
+ * nfp_flower_validate_pre_tun_rule()
+ * @app:	Pointer to the APP handle
+ * @flow:	Pointer to NFP flow representation of rule
+ * @extack:	Netlink extended ACK report
+ *
+ * Verifies the flow as a pre-tunnel rule.
+ *
+ * Return: negative value on error, 0 if verified.
+ */
+static int
+nfp_flower_validate_pre_tun_rule(struct nfp_app *app,
+				 struct nfp_fl_payload *flow,
+				 struct netlink_ext_ack *extack)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
  * nfp_flower_add_offload() - Adds a new flow to hardware.
  * @app:	Pointer to the APP handle
  * @netdev:	netdev structure.
@@ -1046,6 +1065,12 @@ nfp_flower_add_offload(struct nfp_app *app, struct net_device *netdev,
 	if (err)
 		goto err_destroy_flow;
 
+	if (flow_pay->pre_tun_rule.dev) {
+		err = nfp_flower_validate_pre_tun_rule(app, flow_pay, extack);
+		if (err)
+			goto err_destroy_flow;
+	}
+
 	err = nfp_compile_flow_metadata(app, flow, flow_pay, netdev, extack);
 	if (err)
 		goto err_destroy_flow;
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 07/10] nfp: flower: verify pre-tunnel rules
From: John Hurley @ 2019-08-04 15:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1564931351-1036-1-git-send-email-john.hurley@netronome.com>

Pre-tunnel rules must direct packets to an internal port based on L2
information. Rules that egress to an internal port are already indicated
by a non-NULL device in its nfp_fl_payload struct. Verfiy the rest of the
match fields indicate that the rule is a pre-tunnel rule. This requires a
full match on the destination MAC address, an option VLAN field, and no
specific matches on other lower layer fields (with the exception of L4
proto and flags).

If a rule is identified as a pre-tunnel rule then mark it for offload to
the pre-tunnel table. Similarly, remove it from the pre-tunnel table on
rule deletion. The actual offloading of these commands is left to a
following patch.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/main.h   |   5 +
 .../net/ethernet/netronome/nfp/flower/offload.c    | 103 ++++++++++++++++++++-
 .../ethernet/netronome/nfp/flower/tunnel_conf.c    |  12 +++
 3 files changed, 115 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 6e9de4e..c3aa415 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -283,6 +283,7 @@ struct nfp_fl_payload {
 	bool in_hw;
 	struct {
 		struct net_device *dev;
+		__be16 vlan_tci;
 	} pre_tun_rule;
 };
 
@@ -419,4 +420,8 @@ void
 nfp_flower_non_repr_priv_put(struct nfp_app *app, struct net_device *netdev);
 u32 nfp_flower_get_port_id_from_netdev(struct nfp_app *app,
 				       struct net_device *netdev);
+int nfp_flower_xmit_pre_tun_flow(struct nfp_app *app,
+				 struct nfp_fl_payload *flow);
+int nfp_flower_xmit_pre_tun_del_flow(struct nfp_app *app,
+				     struct nfp_fl_payload *flow);
 #endif
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index fba802a..ff8a9f1 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -61,6 +61,11 @@
 	 NFP_FLOWER_LAYER_IPV4 | \
 	 NFP_FLOWER_LAYER_IPV6)
 
+#define NFP_FLOWER_PRE_TUN_RULE_FIELDS \
+	(NFP_FLOWER_LAYER_PORT | \
+	 NFP_FLOWER_LAYER_MAC | \
+	 NFP_FLOWER_LAYER_IPV4)
+
 struct nfp_flower_merge_check {
 	union {
 		struct {
@@ -1012,7 +1017,89 @@ nfp_flower_validate_pre_tun_rule(struct nfp_app *app,
 				 struct nfp_fl_payload *flow,
 				 struct netlink_ext_ack *extack)
 {
-	return -EOPNOTSUPP;
+	struct nfp_flower_meta_tci *meta_tci;
+	struct nfp_flower_mac_mpls *mac;
+	struct nfp_fl_act_head *act;
+	u8 *mask = flow->mask_data;
+	bool vlan = false;
+	int act_offset;
+	u8 key_layer;
+
+	meta_tci = (struct nfp_flower_meta_tci *)flow->unmasked_data;
+	if (meta_tci->tci & cpu_to_be16(NFP_FLOWER_MASK_VLAN_PRESENT)) {
+		u16 vlan_tci = be16_to_cpu(meta_tci->tci);
+
+		vlan_tci &= ~NFP_FLOWER_MASK_VLAN_PRESENT;
+		flow->pre_tun_rule.vlan_tci = cpu_to_be16(vlan_tci);
+		vlan = true;
+	} else {
+		flow->pre_tun_rule.vlan_tci = cpu_to_be16(0xffff);
+	}
+
+	key_layer = meta_tci->nfp_flow_key_layer;
+	if (key_layer & ~NFP_FLOWER_PRE_TUN_RULE_FIELDS) {
+		NL_SET_ERR_MSG_MOD(extack, "unsupported pre-tunnel rule: too many match fields");
+		return -EOPNOTSUPP;
+	}
+
+	if (!(key_layer & NFP_FLOWER_LAYER_MAC)) {
+		NL_SET_ERR_MSG_MOD(extack, "unsupported pre-tunnel rule: MAC fields match required");
+		return -EOPNOTSUPP;
+	}
+
+	/* Skip fields known to exist. */
+	mask += sizeof(struct nfp_flower_meta_tci);
+	mask += sizeof(struct nfp_flower_in_port);
+
+	/* Ensure destination MAC address is fully matched. */
+	mac = (struct nfp_flower_mac_mpls *)mask;
+	if (!is_broadcast_ether_addr(&mac->mac_dst[0])) {
+		NL_SET_ERR_MSG_MOD(extack, "unsupported pre-tunnel rule: dest MAC field must not be masked");
+		return -EOPNOTSUPP;
+	}
+
+	if (key_layer & NFP_FLOWER_LAYER_IPV4) {
+		int ip_flags = offsetof(struct nfp_flower_ipv4, ip_ext.flags);
+		int ip_proto = offsetof(struct nfp_flower_ipv4, ip_ext.proto);
+		int i;
+
+		mask += sizeof(struct nfp_flower_mac_mpls);
+
+		/* Ensure proto and flags are the only IP layer fields. */
+		for (i = 0; i < sizeof(struct nfp_flower_ipv4); i++)
+			if (mask[i] && i != ip_flags && i != ip_proto) {
+				NL_SET_ERR_MSG_MOD(extack, "unsupported pre-tunnel rule: only flags and proto can be matched in ip header");
+				return -EOPNOTSUPP;
+			}
+	}
+
+	/* Action must be a single egress or pop_vlan and egress. */
+	act_offset = 0;
+	act = (struct nfp_fl_act_head *)&flow->action_data[act_offset];
+	if (vlan) {
+		if (act->jump_id != NFP_FL_ACTION_OPCODE_POP_VLAN) {
+			NL_SET_ERR_MSG_MOD(extack, "unsupported pre-tunnel rule: match on VLAN must have VLAN pop as first action");
+			return -EOPNOTSUPP;
+		}
+
+		act_offset += act->len_lw << NFP_FL_LW_SIZ;
+		act = (struct nfp_fl_act_head *)&flow->action_data[act_offset];
+	}
+
+	if (act->jump_id != NFP_FL_ACTION_OPCODE_OUTPUT) {
+		NL_SET_ERR_MSG_MOD(extack, "unsupported pre-tunnel rule: non egress action detected where egress was expected");
+		return -EOPNOTSUPP;
+	}
+
+	act_offset += act->len_lw << NFP_FL_LW_SIZ;
+
+	/* Ensure there are no more actions after egress. */
+	if (act_offset != flow->meta.act_len) {
+		NL_SET_ERR_MSG_MOD(extack, "unsupported pre-tunnel rule: egress is not the last action");
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
 }
 
 /**
@@ -1083,8 +1170,11 @@ nfp_flower_add_offload(struct nfp_app *app, struct net_device *netdev,
 		goto err_release_metadata;
 	}
 
-	err = nfp_flower_xmit_flow(app, flow_pay,
-				   NFP_FLOWER_CMSG_TYPE_FLOW_ADD);
+	if (flow_pay->pre_tun_rule.dev)
+		err = nfp_flower_xmit_pre_tun_flow(app, flow_pay);
+	else
+		err = nfp_flower_xmit_flow(app, flow_pay,
+					   NFP_FLOWER_CMSG_TYPE_FLOW_ADD);
 	if (err)
 		goto err_remove_rhash;
 
@@ -1226,8 +1316,11 @@ nfp_flower_del_offload(struct nfp_app *app, struct net_device *netdev,
 		goto err_free_merge_flow;
 	}
 
-	err = nfp_flower_xmit_flow(app, nfp_flow,
-				   NFP_FLOWER_CMSG_TYPE_FLOW_DEL);
+	if (nfp_flow->pre_tun_rule.dev)
+		err = nfp_flower_xmit_pre_tun_del_flow(app, nfp_flow);
+	else
+		err = nfp_flower_xmit_flow(app, nfp_flow,
+					   NFP_FLOWER_CMSG_TYPE_FLOW_DEL);
 	/* Fall through on error. */
 
 err_free_merge_flow:
diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index a7a80f4..ef59481 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -832,6 +832,18 @@ int nfp_tunnel_mac_event_handler(struct nfp_app *app,
 	return NOTIFY_OK;
 }
 
+int nfp_flower_xmit_pre_tun_flow(struct nfp_app *app,
+				 struct nfp_fl_payload *flow)
+{
+	return -EOPNOTSUPP;
+}
+
+int nfp_flower_xmit_pre_tun_del_flow(struct nfp_app *app,
+				     struct nfp_fl_payload *flow)
+{
+	return -EOPNOTSUPP;
+}
+
 int nfp_tunnel_config_start(struct nfp_app *app)
 {
 	struct nfp_flower_priv *priv = app->priv;
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 08/10] nfp: flower: offload pre-tunnel rules
From: John Hurley @ 2019-08-04 15:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1564931351-1036-1-git-send-email-john.hurley@netronome.com>

Pre-tunnel rules are TC flower and OvS rules that forward a packet to the
tunnel end point where it can then pass through the network stack and be
decapsulated. These are required if the tunnel end point is, say, an OvS
internal port.

Currently, firmware determines that a packet is in a tunnel and decaps it
if it has a known destination IP and MAC address. However, this bypasses
the flower pre-tunnel rule and so does not update the stats. Further to
this it ignores VLANs that may exist outside of the tunnel header.

Offload pre-tunnel rules to the NFP. This embeds the pre-tunnel rule into
the tunnel decap process based on (firmware) mac index and VLAN. This
means that decap can be carried out correctly with VLANs and that stats
can be updated for all kernel rules correctly.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/cmsg.h   |  1 +
 drivers/net/ethernet/netronome/nfp/flower/main.c   |  1 +
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  3 +
 .../ethernet/netronome/nfp/flower/tunnel_conf.c    | 79 +++++++++++++++++++++-
 4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
index caf3029..7eb2ec8 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
@@ -484,6 +484,7 @@ enum nfp_flower_cmsg_type_port {
 	NFP_FLOWER_CMSG_TYPE_QOS_MOD =		18,
 	NFP_FLOWER_CMSG_TYPE_QOS_DEL =		19,
 	NFP_FLOWER_CMSG_TYPE_QOS_STATS =	20,
+	NFP_FLOWER_CMSG_TYPE_PRE_TUN_RULE =	21,
 	NFP_FLOWER_CMSG_TYPE_MAX =		32,
 };
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c b/drivers/net/ethernet/netronome/nfp/flower/main.c
index eb84613..7a20447 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -781,6 +781,7 @@ static int nfp_flower_init(struct nfp_app *app)
 
 	INIT_LIST_HEAD(&app_priv->indr_block_cb_priv);
 	INIT_LIST_HEAD(&app_priv->non_repr_priv);
+	app_priv->pre_tun_rule_cnt = 0;
 
 	return 0;
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index c3aa415..5d302d7 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -163,6 +163,7 @@ struct nfp_fl_internal_ports {
  * @qos_stats_work:	Workqueue for qos stats processing
  * @qos_rate_limiters:	Current active qos rate limiters
  * @qos_stats_lock:	Lock on qos stats updates
+ * @pre_tun_rule_cnt:	Number of pre-tunnel rules offloaded
  */
 struct nfp_flower_priv {
 	struct nfp_app *app;
@@ -194,6 +195,7 @@ struct nfp_flower_priv {
 	struct delayed_work qos_stats_work;
 	unsigned int qos_rate_limiters;
 	spinlock_t qos_stats_lock; /* Protect the qos stats */
+	int pre_tun_rule_cnt;
 };
 
 /**
@@ -284,6 +286,7 @@ struct nfp_fl_payload {
 	struct {
 		struct net_device *dev;
 		__be16 vlan_tci;
+		__be16 port_idx;
 	} pre_tun_rule;
 };
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index ef59481..b9dbfb7 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -15,6 +15,23 @@
 
 #define NFP_FL_MAX_ROUTES               32
 
+#define NFP_TUN_PRE_TUN_RULE_LIMIT	32
+#define NFP_TUN_PRE_TUN_RULE_DEL	0x1
+
+/**
+ * struct nfp_tun_pre_run_rule - rule matched before decap
+ * @flags:		options for the rule offset
+ * @port_idx:		index of destination MAC address for the rule
+ * @vlan_tci:		VLAN info associated with MAC
+ * @host_ctx_id:	stats context of rule to update
+ */
+struct nfp_tun_pre_tun_rule {
+	__be32 flags;
+	__be16 port_idx;
+	__be16 vlan_tci;
+	__be32 host_ctx_id;
+};
+
 /**
  * struct nfp_tun_active_tuns - periodic message of active tunnels
  * @seq:		sequence number of the message
@@ -835,13 +852,71 @@ int nfp_tunnel_mac_event_handler(struct nfp_app *app,
 int nfp_flower_xmit_pre_tun_flow(struct nfp_app *app,
 				 struct nfp_fl_payload *flow)
 {
-	return -EOPNOTSUPP;
+	struct nfp_flower_priv *app_priv = app->priv;
+	struct nfp_tun_offloaded_mac *mac_entry;
+	struct nfp_tun_pre_tun_rule payload;
+	struct net_device *internal_dev;
+	int err;
+
+	if (app_priv->pre_tun_rule_cnt == NFP_TUN_PRE_TUN_RULE_LIMIT)
+		return -ENOSPC;
+
+	memset(&payload, 0, sizeof(struct nfp_tun_pre_tun_rule));
+
+	internal_dev = flow->pre_tun_rule.dev;
+	payload.vlan_tci = flow->pre_tun_rule.vlan_tci;
+	payload.host_ctx_id = flow->meta.host_ctx_id;
+
+	/* Lookup MAC index for the pre-tunnel rule egress device.
+	 * Note that because the device is always an internal port, it will
+	 * have a constant global index so does not need to be tracked.
+	 */
+	mac_entry = nfp_tunnel_lookup_offloaded_macs(app,
+						     internal_dev->dev_addr);
+	if (!mac_entry)
+		return -ENOENT;
+
+	payload.port_idx = cpu_to_be16(mac_entry->index);
+
+	/* Copy mac id and vlan to flow - dev may not exist at delete time. */
+	flow->pre_tun_rule.vlan_tci = payload.vlan_tci;
+	flow->pre_tun_rule.port_idx = payload.port_idx;
+
+	err = nfp_flower_xmit_tun_conf(app, NFP_FLOWER_CMSG_TYPE_PRE_TUN_RULE,
+				       sizeof(struct nfp_tun_pre_tun_rule),
+				       (unsigned char *)&payload, GFP_KERNEL);
+	if (err)
+		return err;
+
+	app_priv->pre_tun_rule_cnt++;
+
+	return 0;
 }
 
 int nfp_flower_xmit_pre_tun_del_flow(struct nfp_app *app,
 				     struct nfp_fl_payload *flow)
 {
-	return -EOPNOTSUPP;
+	struct nfp_flower_priv *app_priv = app->priv;
+	struct nfp_tun_pre_tun_rule payload;
+	u32 tmp_flags = 0;
+	int err;
+
+	memset(&payload, 0, sizeof(struct nfp_tun_pre_tun_rule));
+
+	tmp_flags |= NFP_TUN_PRE_TUN_RULE_DEL;
+	payload.flags = cpu_to_be32(tmp_flags);
+	payload.vlan_tci = flow->pre_tun_rule.vlan_tci;
+	payload.port_idx = flow->pre_tun_rule.port_idx;
+
+	err = nfp_flower_xmit_tun_conf(app, NFP_FLOWER_CMSG_TYPE_PRE_TUN_RULE,
+				       sizeof(struct nfp_tun_pre_tun_rule),
+				       (unsigned char *)&payload, GFP_KERNEL);
+	if (err)
+		return err;
+
+	app_priv->pre_tun_rule_cnt--;
+
+	return 0;
 }
 
 int nfp_tunnel_config_start(struct nfp_app *app)
-- 
2.7.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox