Netdev List
 help / color / mirror / Atom feed
* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Eric Dumazet @ 2016-09-12 22:46 UTC (permalink / raw)
  To: John Fastabend
  Cc: bblanco, alexei.starovoitov, jeffrey.t.kirsher, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev
In-Reply-To: <20160912221351.5610.29043.stgit@john-Precision-Tower-5810>

On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> From: Alexei Starovoitov <ast@fb.com>

> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +				 u32 len,
> +				 struct net_device *netdev,
> +				 struct e1000_adapter *adapter)
> +{
> +	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +	struct e1000_hw *hw = &adapter->hw;
> +	struct e1000_tx_ring *tx_ring;
> +
> +	if (len > E1000_MAX_DATA_PER_TXD)
> +		return;
> +
> +	/* e1000 only support a single txq at the moment so the queue is being
> +	 * shared with stack. To support this requires locking to ensure the
> +	 * stack and XDP are not running at the same time. Devices with
> +	 * multiple queues should allocate a separate queue space.
> +	 */
> +	HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +	tx_ring = adapter->tx_ring;
> +
> +	if (E1000_DESC_UNUSED(tx_ring) < 2) {
> +		HARD_TX_UNLOCK(netdev, txq);
> +		return;
> +	}
> +
> +	if (netif_xmit_frozen_or_stopped(txq))
> +		return;
> +
> +	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +	netdev_sent_queue(netdev, len);
> +	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> +	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> +	mmiowb();
> +
> +	HARD_TX_UNLOCK(netdev, txq);
> +}


e1000_tx_map() is full of workarounds.

Have a look at last_tx_tso for example.

               /* Workaround for Controller erratum --
                 * descriptor for non-tso packet in a linear SKB that follows a
                 * tso gets written back prematurely before the data is fully
                 * DMA'd to the controller
                 */
                if (!skb->data_len && tx_ring->last_tx_tso &&
                    !skb_is_gso(skb)) {
                        tx_ring->last_tx_tso = false;
                        size -= 4;
                }

Look, this XDP_TX thing is hard to properly implement and test on
various NIC revisions.

Without proper queue management, high prio packets in qdisc wont be sent
if NIC is under RX -> XDP_TX flood.

Sounds a horrible feature to me.

^ permalink raw reply

* Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family
From: Andrew Lunn @ 2016-09-12 22:52 UTC (permalink / raw)
  To: John Crispin
  Cc: David S. Miller, Florian Fainelli, netdev, linux-kernel,
	qsdk-review
In-Reply-To: <1473669337-21221-4-git-send-email-john@phrozen.org>

Hi John

> +#include <linux/module.h>
> +#include <linux/phy.h>
> +#include <linux/netdevice.h>
> +#include <net/dsa.h>
> +#include <net/switchdev.h>
> +#include <linux/phy.h>

One linux/phy.h is enough.

> +	/* Setup connection between CPU ports & PHYs */
> +	for (i = 0; i < DSA_MAX_PORTS; i++) {
> +		/* CPU port gets connected to all PHYs in the switch */
> +		if (dsa_is_cpu_port(ds, i)) {
> +			qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(priv->cpu_port),
> +				  QCA8K_PORT_LOOKUP_MEMBER,
> +				  ds->enabled_port_mask << 1);
> +		}
> +
> +		/* Invividual PHYs gets connected to CPU port only */
> +		if (ds->enabled_port_mask & BIT(i)) {
> +			int port = qca8k_phy_to_port(i);
> +			int shift = 16 * (port % 2);
> +

snip

> +static void
> +qca8k_get_ethtool_stats(struct dsa_switch *ds, int phy,
> +			uint64_t *data)
> +{
> +	struct qca8k_priv *priv = qca8k_to_priv(ds);
> +	const struct qca8k_mib_desc *mib;
> +	u32 reg, i, port;
> +	u64 hi;
> +
> +	port = qca8k_phy_to_port(phy);

snip

> +static inline int qca8k_phy_to_port(int phy)
> +{
> +	if (phy < 5)
> +		return phy + 1;
> +
> +	return -1;
> +}

What keep throwing me while reading this code is the use of PHY/phy.

What is meant by a phy in this driver?

DSA is all about switches. Switches are a switch fabric and a number
of ports. The ports contain Ethernet MACs, and optionally contain
embedded PHYs. If there are not embedded PHYs, there are often
external PHYs, and sometimes we just have MACs connected back to back.

Generally, DSA drivers don't need to do much with the MAC. Maybe set
the speed and duplex, and maybe signal delays. They also don't need to
do much with the PHY, phylib and a phy driver does all the work. DSA
is all about the switch fabric.

Yet i see phy scattered in a few places in this driver, and it seems
like they have nothing to do with the PHY.

Please can you change the terminology here? It might help if you can
remove qca8k_phy_to_port(). Why do you need to add 1? Could this be
eliminated via a better choice of reg in the device tree?

Thanks
	Andrew	   

^ permalink raw reply

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-12 23:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <1473719192.18970.84.camel@edumazet-glaptop3.roam.corp.google.com>

On 16-09-12 06:26 PM, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 18:14 -0400, Jamal Hadi Salim wrote:
>
>> I noticed some very weird issues when I took that out.
>> Running sufficiently large amount of traffic (ping -f is sufficient)
>> I saw that when i did a dump it took anywhere between 6-15 seconds.
>> With the read_lock in place response was immediate.
>> I can go back and run things to verify - but it was very odd.
>
> This was on uni processor ?
>

It was a VM.

> Looks like typical starvation caused by aggressive softirq.
>

Well, then it is strange that in one case a tc dump of the rule
was immediate and in the other case it was consistent for 5-15
seconds.

> Anyway, I suspect your kernel build has rcu_read_lock() and
> rcu_read_unlock() as NOP ;)
>

Which doesnt give me a good feel if i tested this well ;->

I would like to try again with those two kernels just to
make sure i was not imagining this.

cheers,
jamal

^ permalink raw reply

* [PATCH v3 net 1/1] net sched actions: fix GETing actions
From: Jamal Hadi Salim @ 2016-09-12 23:07 UTC (permalink / raw)
  To: davem; +Cc: netdev, xiyou.wangcong, Jamal Hadi Salim

From: Jamal Hadi Salim <jhs@mojatatu.com>

With the batch changes that translated transient actions into
a temporary list lost in the translation was the fact that
tcf_action_destroy() will eventually delete the action from
the permanent location if the refcount is zero.

Example of what broke:
...add a gact action to drop
sudo $TC actions add action drop index 10
...now retrieve it, looks good
sudo $TC actions get action gact index 10
...retrieve it again and find it is gone!
sudo $TC actions get action gact index 10

Fixes:
commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"),
commit 824a7e8863b3 ("net_sched: remove an unnecessary list_del()")
commit f07fed82ad79 ("net_sched: remove the leftover cleanup_a()")

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/act_api.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index d09d068..50720b1 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -592,6 +592,16 @@ err_out:
 	return ERR_PTR(err);
 }
 
+static void cleanup_a(struct list_head *actions, int ovr)
+{
+	struct tc_action *a;
+
+	list_for_each_entry(a, actions, list) {
+		if (ovr)
+			a->tcfa_refcnt -= 1;
+	}
+}
+
 int tcf_action_init(struct net *net, struct nlattr *nla,
 				  struct nlattr *est, char *name, int ovr,
 				  int bind, struct list_head *actions)
@@ -612,8 +622,15 @@ int tcf_action_init(struct net *net, struct nlattr *nla,
 			goto err;
 		}
 		act->order = i;
+		if (ovr)
+			act->tcfa_refcnt += 1;
 		list_add_tail(&act->list, actions);
 	}
+
+	/* Remove the temp refcnt which was necessary to protect against
+	 * destroying an existing action which was being replaced
+	 */
+	cleanup_a(actions, ovr);
 	return 0;
 
 err:
@@ -883,6 +900,8 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 			goto err;
 		}
 		act->order = i;
+		if (event == RTM_GETACTION)
+			act->tcfa_refcnt += 1;
 		list_add_tail(&act->list, &actions);
 	}
 
-- 
1.9.1

^ permalink raw reply related

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Alexei Starovoitov @ 2016-09-12 23:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John Fastabend, bblanco, jeffrey.t.kirsher, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev
In-Reply-To: <1473720399.18970.91.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Sep 12, 2016 at 03:46:39PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> > From: Alexei Starovoitov <ast@fb.com>
> 
> > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> > +				 u32 len,
> > +				 struct net_device *netdev,
> > +				 struct e1000_adapter *adapter)
> > +{
> > +	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> > +	struct e1000_hw *hw = &adapter->hw;
> > +	struct e1000_tx_ring *tx_ring;
> > +
> > +	if (len > E1000_MAX_DATA_PER_TXD)
> > +		return;
> > +
> > +	/* e1000 only support a single txq at the moment so the queue is being
> > +	 * shared with stack. To support this requires locking to ensure the
> > +	 * stack and XDP are not running at the same time. Devices with
> > +	 * multiple queues should allocate a separate queue space.
> > +	 */
> > +	HARD_TX_LOCK(netdev, txq, smp_processor_id());
> > +
> > +	tx_ring = adapter->tx_ring;
> > +
> > +	if (E1000_DESC_UNUSED(tx_ring) < 2) {
> > +		HARD_TX_UNLOCK(netdev, txq);
> > +		return;
> > +	}
> > +
> > +	if (netif_xmit_frozen_or_stopped(txq))
> > +		return;
> > +
> > +	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> > +	netdev_sent_queue(netdev, len);
> > +	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> > +
> > +	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> > +	mmiowb();
> > +
> > +	HARD_TX_UNLOCK(netdev, txq);
> > +}
> 
> 
> e1000_tx_map() is full of workarounds.
> 
> Have a look at last_tx_tso for example.
> 
>                /* Workaround for Controller erratum --
>                  * descriptor for non-tso packet in a linear SKB that follows a
>                  * tso gets written back prematurely before the data is fully
>                  * DMA'd to the controller
>                  */
>                 if (!skb->data_len && tx_ring->last_tx_tso &&
>                     !skb_is_gso(skb)) {
>                         tx_ring->last_tx_tso = false;
>                         size -= 4;
>                 }
> 
> Look, this XDP_TX thing is hard to properly implement and test on
> various NIC revisions.

my reading of these e1k workarounds are for old versions of the nic that
don't even exist anymore. I believe none of them apply to qemu e1k.

> Without proper queue management, high prio packets in qdisc wont be sent
> if NIC is under RX -> XDP_TX flood.

yep. there are various ways to shoot yourself in the foot with xdp.
The simplest program that drops all the packets will make the box unpingable.

> Sounds a horrible feature to me.

yep. not pretty by any means.
This whole thing is only to test xdp programs under qemu which
realistically emulates only e1k.
I simply don't see any other options to test xdp under kvm.

^ permalink raw reply

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Eric Dumazet @ 2016-09-12 23:10 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <3a1da0eb-b253-ef9c-8679-0290e670238c@mojatatu.com>

On Mon, 2016-09-12 at 19:02 -0400, Jamal Hadi Salim wrote:
> On 16-09-12 06:26 PM, Eric Dumazet wrote:
> > On Mon, 2016-09-12 at 18:14 -0400, Jamal Hadi Salim wrote:
> >
> >> I noticed some very weird issues when I took that out.
> >> Running sufficiently large amount of traffic (ping -f is sufficient)
> >> I saw that when i did a dump it took anywhere between 6-15 seconds.
> >> With the read_lock in place response was immediate.
> >> I can go back and run things to verify - but it was very odd.
> >
> > This was on uni processor ?
> >
> 
> It was a VM.
> 
> > Looks like typical starvation caused by aggressive softirq.
> >
> 
> Well, then it is strange that in one case a tc dump of the rule
> was immediate and in the other case it was consistent for 5-15
> seconds.
> 

This needs investigation ;)

One possible loop under high stress would be possible in
__gnet_stats_copy_basic(), since we might restart the loop if we are
really really unlucky, but this would have nothing with your patches.



diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
index 508e051304fb62627e61b5065b2325edd1b84f2e..dc9dd8ae7d5405f76c775278dac7689655b21041 100644
--- a/net/core/gen_stats.c
+++ b/net/core/gen_stats.c
@@ -142,10 +142,14 @@ __gnet_stats_copy_basic(const seqcount_t *running,
 		return;
 	}
 	do {
-		if (running)
+		if (running) {
+			local_bh_disable();
 			seq = read_seqcount_begin(running);
+		}
 		bstats->bytes = b->bytes;
 		bstats->packets = b->packets;
+		if (running)
+			local_bh_enable();
 	} while (running && read_seqcount_retry(running, seq));
 }
 EXPORT_SYMBOL(__gnet_stats_copy_basic);

^ permalink raw reply related

* Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver
From: Jason Gunthorpe @ 2016-09-12 23:16 UTC (permalink / raw)
  To: Adit Ranadive
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, pv-drivers,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jorgen S. Hansen, Aditya Sarwade, George Zhang, Bryan Tan
In-Reply-To: <BLUPR0501MB836ED564F2AE7AB9221AC25C5FF0-84Rf5TRaNBMVDhIuTCx1aJLWcSx1hRipwIZJ9u9yWa8oOQlpcoRfSA@public.gmane.org>

On Mon, Sep 12, 2016 at 10:43:00PM +0000, Adit Ranadive wrote:
> On Mon, Sep 12, 2016 at 11:03:39 -0700, Jason Gunthorpe wrote:
> > On Sun, Sep 11, 2016 at 09:49:10PM -0700, Adit Ranadive wrote:
> > > [2] Libpvrdma User-level library - 
> > > http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary
> > 
> > You will probably find that rdma-plumbing will be the best way to get
> > your userspace component into the distributors.

> Sorry I haven't paying attention to that discussion. Do you know how soon 
> distros will pick up the rdma-plumbing stuff?

Hopefully things will be clearer after the call on Tuesday.

I'm hoping we can begin the process sooner than later.

There are 4 new drivers being submitted and at least 2-3 more kernel
drivers that are not yet in Debian/SuSE/etc.

By far the simplest way to push those 6 new user space drivers into
the distros will be via rdma-plumbing, I'm optimistic that the vendors
will work together on this.

> It seems like it should be fairly straightforward to include the libpvrdma git
> within the rdma-plumbing git as well. Let me know what the process is since
> I may have to discuss it internally.

For now, you can prepare your own clone and create a providers/pvrdma
directory similar to the others. It should be very
straightforward. Let me know if you have any problems.

You should ensure clean compilation on FC24 (for instance using the
provided docker tooling), and review the commit stream in
rdma-plumbing to make sure you pick up all the systemic fixes that
have already been done to the other providers.

If Doug accepts your provider upstream before we finalize
rdma-plumbing then I will add your git to the giant merge, however
please make my life easier and ensure the code does not have any of
the systemic problems I alreadly fixed in other providers :(

After finalization you'll have to post your user space provider to the
mailing list and Doug will pick it up.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Eric Dumazet @ 2016-09-12 23:20 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <1473721837.18970.95.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, 2016-09-12 at 16:10 -0700, Eric Dumazet wrote:

> 
> diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
> index 508e051304fb62627e61b5065b2325edd1b84f2e..dc9dd8ae7d5405f76c775278dac7689655b21041 100644
> --- a/net/core/gen_stats.c
> +++ b/net/core/gen_stats.c
> @@ -142,10 +142,14 @@ __gnet_stats_copy_basic(const seqcount_t *running,
>  		return;
>  	}
>  	do {
> -		if (running)
> +		if (running) {
> +			local_bh_disable();
>  			seq = read_seqcount_begin(running);
> +		}
>  		bstats->bytes = b->bytes;
>  		bstats->packets = b->packets;
> +		if (running)
> +			local_bh_enable();
>  	} while (running && read_seqcount_retry(running, seq));
>  }

Ah well, forget this patch, re-enabling bh right before
read_seqcount_retry() is not going to help.

^ permalink raw reply

* Re: [v11, 5/8] soc: fsl: add GUTS driver for QorIQ platforms
From: Scott Wood @ 2016-09-12 23:25 UTC (permalink / raw)
  To: Y.B. Lu, linux-mmc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ulf.hansson-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
	Arnd Bergmann
  Cc: Mark Rutland, devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Russell King, Bhupesh Sharma,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Santosh Shilimkar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jochen Friedrich, X.B. Xie,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Rob Herring, linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Claudiu Manoil, Kumar Gala, Leo Li,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org,
	linux-clk-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
In-Reply-To: <HE1PR04MB08892D8354E38EC32F549E98F8FF0-6LN7OEpIatX1kPMWxTxe+c9NdZoXdze2vxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

On Mon, 2016-09-12 at 06:39 +0000, Y.B. Lu wrote:
> Hi Scott,
> 
> Thanks for your review :)
> See my comment inline.
> 
> > 
> > -----Original Message-----
> > From: Scott Wood [mailto:oss@buserror.net]
> > Sent: Friday, September 09, 2016 11:47 AM
> > To: Y.B. Lu; linux-mmc@vger.kernel.org; ulf.hansson@linaro.org; Arnd
> > Bergmann
> > Cc: linuxppc-dev@lists.ozlabs.org; devicetree@vger.kernel.org; linux-arm-
> > kernel@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > clk@vger.kernel.org; linux-i2c@vger.kernel.org; iommu@lists.linux-
> > foundation.org; netdev@vger.kernel.org; Mark Rutland; Rob Herring;
> > Russell King; Jochen Friedrich; Joerg Roedel; Claudiu Manoil; Bhupesh
> > Sharma; Qiang Zhao; Kumar Gala; Santosh Shilimkar; Leo Li; X.B. Xie
> > Subject: Re: [v11, 5/8] soc: fsl: add GUTS driver for QorIQ platforms
> > 
> > On Tue, 2016-09-06 at 16:28 +0800, Yangbo Lu wrote:
> > > 
> > > The global utilities block controls power management, I/O device
> > > enabling, power-onreset(POR) configuration monitoring, alternate
> > > function selection for multiplexed signals,and clock control.
> > > 
> > > This patch adds a driver to manage and access global utilities block.
> > > Initially only reading SVR and registering soc device are supported.
> > > Other guts accesses, such as reading RCW, should eventually be moved
> > > into this driver as well.
> > > 
> > > Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
> > > Signed-off-by: Scott Wood <oss@buserror.net>
> > Don't put my signoff on patches that I didn't put it on
> > myself.  Definitely don't put mine *after* yours on patches that were
> > last modified by you.
> > 
> > If you want to mention that the soc_id encoding was my suggestion, then
> > do so explicitly.
> > 
> [Lu Yangbo-B47093] I found your 'signoff' on this patch at below link.
> http://patchwork.ozlabs.org/patch/649211/
> 
> So, let me just change the order in next version ?
> Signed-off-by: Scott Wood <oss@buserror.net>
> Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>

No.  This isn't my patch so my signoff shouldn't be on it.

> [Lu Yangbo-B47093] It's a good idea to move die into .family I think.
> In my opinion, it's better to keep svr and name in soc_id just like your
> suggestion above.
> > 
> > 	{
> > 		.soc_id = "svr:0x85490010,name:T1023E,",
> > 		.family = "QorIQ T1024",
> > 	}
> The user probably don’t like to learn the svr value. What they want is just
> to match the soc they use.
> It's convenient to use name+rev for them to match a soc.

What the user should want 99% of the time is to match the die (plus revision),
not the soc.

> Regarding shrinking the table, I think it's hard to use svr+mask. Because I
> find many platforms use different masks.
> We couldn’t know the mask according svr value.

The mask would be part of the table:

{
	{
		.die = "T1024",
		.svr = 0x85400000,
		.mask = 0xfff00000,
	},
	{
		.die = "T1040",
		.svr = 0x85200000,
		.mask = 0xfff00000,
	},
	{
		.die = "LS1088A",
		.svr = 0x87030000,
		.mask = 0xffff0000,
	},
	...
}

There's a small risk that we get the mask wrong and a different die is created
that matches an existing table, but it doesn't seem too likely, and can easily
be fixed with a kernel update if it happens.

BTW, aren't ls2080a and ls2085a the same die?  And is there no non-E version
of LS2080A/LS2040A?

> > > +	do {
> > > +		if (!matches->soc_id)
> > > +			return NULL;
> > > +		if (glob_match(svr_match, matches->soc_id))
> > > +			break;
> > > +	} while (matches++);
> > Are you expecting "matches++" to ever evaluate as false?
> [Lu Yangbo-B47093] Yes, this is used to match the soc we use in qoriq_soc
> array until getting true. 
> We need to get the name and die information defined in array.

I'm not asking whether the glob_match will ever return true.  I'm saying that
"matches++" will never become NULL.

> > > +	/* Register soc device */
> > > +	soc_dev_attr = kzalloc(sizeof(*soc_dev_attr), GFP_KERNEL);
> > > +	if (!soc_dev_attr) {
> > > +		ret = -ENOMEM;
> > > +		goto out_unmap;
> > > +	}
> > Couldn't this be statically allocated?
> [Lu Yangbo-B47093] Do you mean we define this struct statically ?
> 
> static struct soc_device_attribute soc_dev_attr;

Yes.

> > > +
> > > +	soc_dev = soc_device_register(soc_dev_attr);
> > > +	if (IS_ERR(soc_dev)) {
> > > +		ret = -ENODEV;
> > Why are you changing the error code?
> [Lu Yangbo-B47093] What error code should we use ? :)

ret = PTR_ERR(soc_dev);

+	}
> > > +	return 0;
> > > +out:
> > > +	kfree(soc_dev_attr->machine);
> > > +	kfree(soc_dev_attr->family);
> > > +	kfree(soc_dev_attr->soc_id);
> > > +	kfree(soc_dev_attr->revision);
> > > +	kfree(soc_dev_attr);
> > > +out_unmap:
> > > +	iounmap(guts->regs);
> > > +out_free:
> > > +	kfree(guts);
> > devm
> [Lu Yangbo-B47093] What's the devm meaning here :)

If you allocate these with devm_kzalloc(), devm_kasprintf(), devm_kstrdup(),
etc. then they will be freed automatically when the device is unbound.

>  
> > 
> > 
> > > 
> > > +static int fsl_guts_remove(struct platform_device *dev) {
> > > +	kfree(soc_dev_attr->machine);
> > > +	kfree(soc_dev_attr->family);
> > > +	kfree(soc_dev_attr->soc_id);
> > > +	kfree(soc_dev_attr->revision);
> > > +	kfree(soc_dev_attr);
> > > +	soc_device_unregister(soc_dev);
> > > +	iounmap(guts->regs);
> > > +	kfree(guts);
> > > +	return 0;
> > > +}
> > Don't free the memory before you unregister the device that uses it (moot
> > if you use devm).
> [Lu Yangbo-B47093] The soc.c driver mentions that.
> Ensure soc_dev->attr is freed prior to calling soc_device_unregister.

That comment is wrong.  Freeing the memory first creates a race condition that
could result in accessing freed memory, if something accesses the soc device
in parallel with unbinding.

-Scott

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-12 23:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <3a1da0eb-b253-ef9c-8679-0290e670238c@mojatatu.com>

On 16-09-12 07:02 PM, Jamal Hadi Salim wrote:

>> Looks like typical starvation caused by aggressive softirq.
>>
>
> Well, then it is strange that in one case a tc dump of the rule
> was immediate and in the other case it was consistent for 5-15
> seconds.
>

I may be going nuts because i cant reproduce that issue anymore.
VM is configured for SMP 4 vcpus.

I will remove that check and submit a new version.

cheers,
jamal

^ permalink raw reply

* [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-12 23:30 UTC (permalink / raw)
  To: davem
  Cc: netdev, daniel, xiyou.wangcong, eric.dumazet, john.r.fastabend,
	Jamal Hadi Salim

From: Jamal Hadi Salim <jhs@mojatatu.com>

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/tc_act/tc_skbmod.h        |  30 ++++
 include/uapi/linux/tc_act/tc_skbmod.h |  39 +++++
 net/sched/Kconfig                     |  11 ++
 net/sched/Makefile                    |   1 +
 net/sched/act_skbmod.c                | 291 ++++++++++++++++++++++++++++++++++
 5 files changed, 372 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 0000000..644a211
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include <net/act_api.h>
+#include <linux/tc_act/tc_skbmod.h>
+
+struct tcf_skbmod_params {
+	struct rcu_head	rcu;
+	u64	flags; /*up to 64 types of operations; extend if needed */
+	u8	eth_dst[ETH_ALEN];
+	u16	eth_type;
+	u8	eth_src[ETH_ALEN];
+};
+
+struct tcf_skbmod {
+	struct tc_action	common;
+	struct tcf_skbmod_params __rcu *skbmod_p;
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 0000000..10fc07d
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include <linux/pkt_cls.h>
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC	0x1
+#define SKBMOD_F_SMAC	0x2
+#define SKBMOD_F_ETYPE	0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+	tc_gen;
+	__u64 flags;
+};
+
+enum {
+	TCA_SKBMOD_UNSPEC,
+	TCA_SKBMOD_TM,
+	TCA_SKBMOD_PARMS,
+	TCA_SKBMOD_DMAC,
+	TCA_SKBMOD_SMAC,
+	TCA_SKBMOD_ETYPE,
+	TCA_SKBMOD_PAD,
+	__TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72e3426..7795d5a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
 	  To compile this code as a module, choose M here: the
 	  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+        tristate "skb data modification action"
+        depends on NET_CLS_ACT
+        ---help---
+         Say Y here to allow modification of skb data
+
+         If unsure, say N.
+
+         To compile this code as a module, choose M here: the
+         module will be called act_skbmod.
+
 config NET_ACT_IFE
         tristate "Inter-FE action based on IETF ForCES InterFE LFB"
         depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b9d046b..148ae0d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)	+= act_csum.o
 obj-$(CONFIG_NET_ACT_VLAN)	+= act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)	+= act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK)	+= act_connmark.o
+obj-$(CONFIG_NET_ACT_SKBMOD)	+= act_skbmod.o
 obj-$(CONFIG_NET_ACT_IFE)	+= act_ife.o
 obj-$(CONFIG_NET_IFE_SKBMARK)	+= act_meta_mark.o
 obj-$(CONFIG_NET_IFE_SKBPRIO)	+= act_meta_skbprio.o
diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
new file mode 100644
index 0000000..2b3f2af
--- /dev/null
+++ b/net/sched/act_skbmod.c
@@ -0,0 +1,291 @@
+/*
+ * net/sched/act_skbmod.c  skb data modifier
+ *
+ * Copyright (c) 2016 Jamal Hadi Salim <jhs@mojatatu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+
+#include <linux/tc_act/tc_skbmod.h>
+#include <net/tc_act/tc_skbmod.h>
+
+#define SKBMOD_TAB_MASK     15
+
+static int skbmod_net_id;
+static struct tc_action_ops act_skbmod_ops;
+
+#define MAX_EDIT_LEN ETH_HLEN
+static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
+			  struct tcf_result *res)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	int action;
+	struct tcf_skbmod_params *p;
+	u64 flags;
+	int err;
+
+	tcf_lastuse_update(&d->tcf_tm);
+	bstats_cpu_update(this_cpu_ptr(d->common.cpu_bstats), skb);
+
+	/* XXX: if you are going to edit more fields beyond ethernet header
+	 * (example when you add IP header replacement or vlan swap)
+	 * then MAX_EDIT_LEN needs to change appropriately
+	*/
+	err = skb_ensure_writable(skb, ETH_HLEN);
+	if (unlikely(err)) { /* best policy is to drop on the floor */
+		qstats_overlimit_inc(this_cpu_ptr(d->common.cpu_qstats));
+		return TC_ACT_SHOT;
+	}
+
+	rcu_read_lock();
+	action = READ_ONCE(d->tcf_action);
+	if (unlikely(action == TC_ACT_SHOT)) {
+		qstats_overlimit_inc(this_cpu_ptr(d->common.cpu_qstats));
+		rcu_read_unlock();
+		return action;
+	}
+
+	p = rcu_dereference(d->skbmod_p);
+	flags = p->flags;
+	if (flags & SKBMOD_F_DMAC)
+		ether_addr_copy(eth_hdr(skb)->h_dest, p->eth_dst);
+	if (flags & SKBMOD_F_SMAC)
+		ether_addr_copy(eth_hdr(skb)->h_source, p->eth_src);
+	if (flags & SKBMOD_F_ETYPE)
+		eth_hdr(skb)->h_proto = p->eth_type;
+	rcu_read_unlock();
+
+	if (flags & SKBMOD_F_SWAPMAC) {
+		u16 tmpaddr[ETH_ALEN / 2]; /* ether_addr_copy() requirement */
+		/*XXX: I am sure we can come up with more efficient swapping*/
+		ether_addr_copy((u8 *)tmpaddr, eth_hdr(skb)->h_dest);
+		ether_addr_copy(eth_hdr(skb)->h_dest, eth_hdr(skb)->h_source);
+		ether_addr_copy(eth_hdr(skb)->h_source, (u8 *)tmpaddr);
+	}
+
+	return action;
+}
+
+static const struct nla_policy skbmod_policy[TCA_SKBMOD_MAX + 1] = {
+	[TCA_SKBMOD_PARMS]		= { .len = sizeof(struct tc_skbmod) },
+	[TCA_SKBMOD_DMAC]		= { .len = ETH_ALEN },
+	[TCA_SKBMOD_SMAC]		= { .len = ETH_ALEN },
+	[TCA_SKBMOD_ETYPE]		= { .type = NLA_U16 },
+};
+
+static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
+			   struct nlattr *est, struct tc_action **a,
+			   int ovr, int bind)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+	struct nlattr *tb[TCA_SKBMOD_MAX + 1];
+	struct tcf_skbmod_params *p, *p_old;
+	struct tc_skbmod *parm;
+	struct tcf_skbmod *d;
+	bool exists = false;
+	u8 *daddr = NULL;
+	u8 *saddr = NULL;
+	u16 eth_type = 0;
+	u32 lflags = 0;
+	int ret = 0, err;
+
+	if (!nla)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_SKBMOD_MAX, nla, skbmod_policy);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_SKBMOD_PARMS])
+		return -EINVAL;
+
+	if (tb[TCA_SKBMOD_DMAC]) {
+		daddr = nla_data(tb[TCA_SKBMOD_DMAC]);
+		lflags |= SKBMOD_F_DMAC;
+	}
+
+	if (tb[TCA_SKBMOD_SMAC]) {
+		saddr = nla_data(tb[TCA_SKBMOD_SMAC]);
+		lflags |= SKBMOD_F_SMAC;
+	}
+
+	if (tb[TCA_SKBMOD_ETYPE]) {
+		eth_type = nla_get_u16(tb[TCA_SKBMOD_ETYPE]);
+		lflags |= SKBMOD_F_ETYPE;
+	}
+
+	parm = nla_data(tb[TCA_SKBMOD_PARMS]);
+	if (parm->flags & SKBMOD_F_SWAPMAC)
+		lflags = SKBMOD_F_SWAPMAC;
+
+	exists = tcf_hash_check(tn, parm->index, a, bind);
+	if (exists && bind)
+		return 0;
+
+	if (!lflags)
+		return -EINVAL;
+
+	if (!exists) {
+		ret = tcf_hash_create(tn, parm->index, est, a,
+				      &act_skbmod_ops, bind, true);
+		if (ret)
+			return ret;
+
+		ret = ACT_P_CREATED;
+	} else {
+		tcf_hash_release(*a, bind);
+		if (!ovr)
+			return -EEXIST;
+	}
+
+	d = to_skbmod(*a);
+
+	ASSERT_RTNL();
+	p = kzalloc(sizeof(struct tcf_skbmod_params), GFP_KERNEL);
+	if (unlikely(!p)) {
+		if (ovr)
+			tcf_hash_release(*a, bind);
+		return -ENOMEM;
+	}
+
+	p->flags = lflags;
+	d->tcf_action = parm->action;
+
+	p_old = rtnl_dereference(d->skbmod_p);
+
+	if (ovr)
+		spin_lock_bh(&d->tcf_lock);
+
+	if (lflags & SKBMOD_F_DMAC)
+		ether_addr_copy(p->eth_dst, daddr);
+	if (lflags & SKBMOD_F_SMAC)
+		ether_addr_copy(p->eth_src, saddr);
+	if (lflags & SKBMOD_F_ETYPE)
+		p->eth_type = htons(eth_type);
+
+	rcu_assign_pointer(d->skbmod_p, p);
+	if (ovr)
+		spin_unlock_bh(&d->tcf_lock);
+
+	if (p_old)
+		kfree_rcu(p_old, rcu);
+
+	if (ret == ACT_P_CREATED)
+		tcf_hash_insert(tn, *a);
+	return ret;
+}
+
+static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
+			   int bind, int ref)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tcf_skbmod_params  *p = rtnl_dereference(d->skbmod_p);
+	struct tc_skbmod opt = {
+		.index   = d->tcf_index,
+		.refcnt  = d->tcf_refcnt - ref,
+		.bindcnt = d->tcf_bindcnt - bind,
+		.action  = d->tcf_action,
+	};
+	struct tcf_t t;
+
+	opt.flags  = p->flags;
+	if (nla_put(skb, TCA_SKBMOD_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_DMAC) &&
+	    nla_put(skb, TCA_SKBMOD_DMAC, ETH_ALEN, p->eth_dst))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_SMAC) &&
+	    nla_put(skb, TCA_SKBMOD_SMAC, ETH_ALEN, p->eth_src))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_ETYPE) &&
+	    nla_put_u16(skb, TCA_SKBMOD_ETYPE, ntohs(p->eth_type)))
+		goto nla_put_failure;
+
+	tcf_tm_dump(&t, &d->tcf_tm);
+	if (nla_put_64bit(skb, TCA_SKBMOD_TM, sizeof(t), &t, TCA_SKBMOD_PAD))
+		goto nla_put_failure;
+
+	return skb->len;
+nla_put_failure:
+	rcu_read_unlock();
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int tcf_skbmod_walker(struct net *net, struct sk_buff *skb,
+			     struct netlink_callback *cb, int type,
+			     const struct tc_action_ops *ops)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tcf_generic_walker(tn, skb, cb, type, ops);
+}
+
+static int tcf_skbmod_search(struct net *net, struct tc_action **a, u32 index)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tcf_hash_search(tn, a, index);
+}
+
+static struct tc_action_ops act_skbmod_ops = {
+	.kind		=	"skbmod",
+	.type		=	TCA_ACT_SKBMOD,
+	.owner		=	THIS_MODULE,
+	.act		=	tcf_skbmod_run,
+	.dump		=	tcf_skbmod_dump,
+	.init		=	tcf_skbmod_init,
+	.walk		=	tcf_skbmod_walker,
+	.lookup		=	tcf_skbmod_search,
+	.size		=	sizeof(struct tcf_skbmod),
+};
+
+static __net_init int skbmod_init_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tc_action_net_init(tn, &act_skbmod_ops, SKBMOD_TAB_MASK);
+}
+
+static void __net_exit skbmod_exit_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	tc_action_net_exit(tn);
+}
+
+static struct pernet_operations skbmod_net_ops = {
+	.init = skbmod_init_net,
+	.exit = skbmod_exit_net,
+	.id   = &skbmod_net_id,
+	.size = sizeof(struct tc_action_net),
+};
+
+MODULE_AUTHOR("Jamal Hadi Salim, <jhs@mojatatu.com>");
+MODULE_DESCRIPTION("SKB data mod-ing");
+MODULE_LICENSE("GPL");
+
+static int __init skbmod_init_module(void)
+{
+	return tcf_register_action(&act_skbmod_ops, &skbmod_net_ops);
+}
+
+static void __exit skbmod_cleanup_module(void)
+{
+	tcf_unregister_action(&act_skbmod_ops, &skbmod_net_ops);
+}
+
+module_init(skbmod_init_module);
+module_exit(skbmod_cleanup_module);
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-12 23:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <1473722404.18970.97.camel@edumazet-glaptop3.roam.corp.google.com>

On 16-09-12 07:20 PM, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 16:10 -0700, Eric Dumazet wrote:
>
>>
>> diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
>> index 508e051304fb62627e61b5065b2325edd1b84f2e..dc9dd8ae7d5405f76c775278dac7689655b21041 100644
>> --- a/net/core/gen_stats.c
>> +++ b/net/core/gen_stats.c
>> @@ -142,10 +142,14 @@ __gnet_stats_copy_basic(const seqcount_t *running,
>>  		return;
>>  	}
>>  	do {
>> -		if (running)
>> +		if (running) {
>> +			local_bh_disable();
>>  			seq = read_seqcount_begin(running);
>> +		}
>>  		bstats->bytes = b->bytes;
>>  		bstats->packets = b->packets;
>> +		if (running)
>> +			local_bh_enable();
>>  	} while (running && read_seqcount_retry(running, seq));
>>  }
>
> Ah well, forget this patch, re-enabling bh right before
> read_seqcount_retry() is not going to help.
>

I have to say I have seen some odd issues once in a while reading
generic action stats.
I had a program that opened a netlink socket into the kernel.
Every X seconds it does a dump of all the actions to read the
stats.
There is a very reproducible behavior that the stats
are not in sync with the kernel. Given generic stats is lockless
I thought maybe rcu or per-cpu stats was the issue. I havent had time
to look closely.
The solution is instead of keeping the socket open all the time;
I open, read stats, close (repeat every x seconds).

If there is something you want me to try - I could do sometimes
this week. Your patch above may be useful!

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
From: Eric Dumazet @ 2016-09-12 23:40 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: davem, netdev, daniel, xiyou.wangcong, john.r.fastabend
In-Reply-To: <1473723029-16533-1-git-send-email-jhs@emojatatu.com>

On Mon, 2016-09-12 at 19:30 -0400, Jamal Hadi Salim wrote:
> From: Jamal Hadi Salim <jhs@mojatatu.com>

> +
> +static struct tc_action_ops act_skbmod_ops = {
> +	.kind		=	"skbmod",
> +	.type		=	TCA_ACT_SKBMOD,
> +	.owner		=	THIS_MODULE,
> +	.act		=	tcf_skbmod_run,
> +	.dump		=	tcf_skbmod_dump,
> +	.init		=	tcf_skbmod_init,
> +	.walk		=	tcf_skbmod_walker,
> +	.lookup		=	tcf_skbmod_search,
> +	.size		=	sizeof(struct tcf_skbmod),
> +};

Looks like you missed a .cleanup() handler ?

Otherwise, d->skbmod_p wont be freed .

^ permalink raw reply

* Re: [net-next PATCH v3 3/3] e1000: bundle xdp xmit routines
From: Tom Herbert @ 2016-09-12 23:45 UTC (permalink / raw)
  To: John Fastabend
  Cc: Brenden Blanco, Alexei Starovoitov, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers
In-Reply-To: <20160912221417.5610.37355.stgit@john-Precision-Tower-5810>

On Mon, Sep 12, 2016 at 3:14 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> e1000 supports a single TX queue so it is being shared with the stack
> when XDP runs XDP_TX action. This requires taking the xmit lock to
> ensure we don't corrupt the tx ring. To avoid taking and dropping the
> lock per packet this patch adds a bundling implementation to submit
> a bundle of packets to the xmit routine.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device using pktgen to generate traffic along with 'ping -f -l 100'.
>
John,

It still looks to me like this will break the normal TX path. Can you
test these patches on real e1000 to get performance numbers for both
the drop and forwarding case. Also can you run this against an
antagonist workload over the normal stack to see if one or the other
path can be starved?

Thanks,
Tom

> Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h      |   10 +++
>  drivers/net/ethernet/intel/e1000/e1000_main.c |   80 +++++++++++++++++++------
>  2 files changed, 70 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> index 5cf8a0a..877b377 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -133,6 +133,8 @@ struct e1000_adapter;
>  #define E1000_TX_QUEUE_WAKE    16
>  /* How many Rx Buffers do we bundle into one write to the hardware ? */
>  #define E1000_RX_BUFFER_WRITE  16 /* Must be power of 2 */
> +/* How many XDP XMIT buffers to bundle into one xmit transaction */
> +#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
>
>  #define AUTO_ALL_MODES         0
>  #define E1000_EEPROM_82544_APM 0x0004
> @@ -168,6 +170,11 @@ struct e1000_rx_buffer {
>         dma_addr_t dma;
>  };
>
> +struct e1000_rx_buffer_bundle {
> +       struct e1000_rx_buffer *buffer;
> +       u32 length;
> +};
> +
>  struct e1000_tx_ring {
>         /* pointer to the descriptor ring memory */
>         void *desc;
> @@ -206,6 +213,9 @@ struct e1000_rx_ring {
>         struct e1000_rx_buffer *buffer_info;
>         struct sk_buff *rx_skb_top;
>
> +       /* array of XDP buffer information structs */
> +       struct e1000_rx_buffer_bundle *xdp_buffer;
> +
>         /* cpu for rx queue */
>         int cpu;
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 232b927..31489d4 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
>         struct e1000_adapter *adapter = netdev_priv(netdev);
>         struct bpf_prog *old_prog;
>
> +       if (!adapter->rx_ring[0].xdp_buffer) {
> +               int size = sizeof(struct e1000_rx_buffer_bundle) *
> +                               E1000_XDP_XMIT_BUNDLE_MAX;
> +
> +               adapter->rx_ring[0].xdp_buffer = vzalloc(size);
> +               if (!adapter->rx_ring[0].xdp_buffer)
> +                       return -ENOMEM;
> +       }
> +
>         old_prog = xchg(&adapter->prog, prog);
>         if (old_prog) {
>                 synchronize_net();
> @@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev)
>         if (adapter->prog)
>                 bpf_prog_put(adapter->prog);
>
> +       if (adapter->rx_ring[0].xdp_buffer)
> +               vfree(adapter->rx_ring[0].xdp_buffer);
> +
>         unregister_netdev(netdev);
>
>         e1000_phy_hw_reset(hw);
> @@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
>
>  static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
>                                  u32 len,
> +                                struct e1000_adapter *adapter,
>                                  struct net_device *netdev,
> -                                struct e1000_adapter *adapter)
> +                                struct e1000_tx_ring *tx_ring)
>  {
> -       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> -       struct e1000_hw *hw = &adapter->hw;
> -       struct e1000_tx_ring *tx_ring;
> +       const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
>
>         if (len > E1000_MAX_DATA_PER_TXD)
>                 return;
>
> -       /* e1000 only support a single txq at the moment so the queue is being
> -        * shared with stack. To support this requires locking to ensure the
> -        * stack and XDP are not running at the same time. Devices with
> -        * multiple queues should allocate a separate queue space.
> -        */
> -       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> -
> -       tx_ring = adapter->tx_ring;
> -
> -       if (E1000_DESC_UNUSED(tx_ring) < 2) {
> -               HARD_TX_UNLOCK(netdev, txq);
> +       if (E1000_DESC_UNUSED(tx_ring) < 2)
>                 return;
> -       }
>
>         if (netif_xmit_frozen_or_stopped(txq))
>                 return;
> @@ -3402,7 +3402,36 @@ static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
>         e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
>         netdev_sent_queue(netdev, len);
>         e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +}
>
> +static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info,
> +                                 struct net_device *netdev,
> +                                 struct e1000_adapter *adapter)
> +{
> +       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +       struct e1000_tx_ring *tx_ring = adapter->tx_ring;
> +       struct e1000_hw *hw = &adapter->hw;
> +       int i = 0;
> +
> +       /* e1000 only support a single txq at the moment so the queue is being
> +        * shared with stack. To support this requires locking to ensure the
> +        * stack and XDP are not running at the same time. Devices with
> +        * multiple queues should allocate a separate queue space.
> +        *
> +        * To amortize the locking cost e1000 bundles the xmits and send up to
> +        * E1000_XDP_XMIT_BUNDLE_MAX.
> +        */
> +       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +       for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) {
> +               e1000_xmit_raw_frame(buffer_info[i].buffer,
> +                                    buffer_info[i].length,
> +                                    adapter, netdev, tx_ring);
> +               buffer_info[i].buffer = NULL;
> +               buffer_info[i].length = 0;
> +       }
> +
> +       /* kick hardware to send bundle and return control back to the stack */
>         writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
>         mmiowb();
>
> @@ -4284,9 +4313,10 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>         struct bpf_prog *prog;
>         u32 length;
>         unsigned int i;
> -       int cleaned_count = 0;
> +       int cleaned_count = 0, xdp_xmit = 0;
>         bool cleaned = false;
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> +       struct e1000_rx_buffer_bundle *xdp_bundle = rx_ring->xdp_buffer;
>
>         rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
>         prog = READ_ONCE(adapter->prog);
> @@ -4341,12 +4371,13 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>                         case XDP_PASS:
>                                 break;
>                         case XDP_TX:
> +                               xdp_bundle[xdp_xmit].buffer = buffer_info;
> +                               xdp_bundle[xdp_xmit].length = length;
>                                 dma_sync_single_for_device(&pdev->dev,
>                                                            dma,
>                                                            length,
>                                                            DMA_TO_DEVICE);
> -                               e1000_xmit_raw_frame(buffer_info, length,
> -                                                    netdev, adapter);
> +                               xdp_xmit++;
>                         case XDP_DROP:
>                         default:
>                                 /* re-use mapped page. keep buffer_info->dma
> @@ -4488,8 +4519,14 @@ next_desc:
>
>                 /* return some buffers to hardware, one at a time is too slow */
>                 if (unlikely(cleaned_count >= E1000_RX_BUFFER_WRITE)) {
> +                       if (xdp_xmit)
> +                               e1000_xdp_xmit_bundle(xdp_bundle,
> +                                                     netdev,
> +                                                     adapter);
> +
>                         adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
>                         cleaned_count = 0;
> +                       xdp_xmit = 0;
>                 }
>
>                 /* use prefetched values */
> @@ -4500,8 +4537,11 @@ next_desc:
>         rx_ring->next_to_clean = i;
>
>         cleaned_count = E1000_DESC_UNUSED(rx_ring);
> -       if (cleaned_count)
> +       if (cleaned_count) {
> +               if (xdp_xmit)
> +                       e1000_xdp_xmit_bundle(xdp_bundle, netdev, adapter);
>                 adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
> +       }
>
>         adapter->total_rx_packets += total_rx_packets;
>         adapter->total_rx_bytes += total_rx_bytes;
>

^ permalink raw reply

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Eric Dumazet @ 2016-09-12 23:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, bblanco, jeffrey.t.kirsher, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev
In-Reply-To: <20160912230745.GA24171@ast-mbp.thefacebook.com>

On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:

> yep. there are various ways to shoot yourself in the foot with xdp.
> The simplest program that drops all the packets will make the box unpingable.

Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
scooter on 101 highway ;)

This XDP_TX thing was one of the XDP marketing stuff, but there is
absolutely no documentation on it, warning users about possible
limitations/outcomes.

BTW, I am not sure mlx4 implementation even works, vs BQL :

mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
but tx completion will call netdev_tx_completed_queue() -> crash

Do we have one test to validate that a XDP_TX implementation is actually
correct ?

^ permalink raw reply

* Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-12 23:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, daniel, xiyou.wangcong, john.r.fastabend
In-Reply-To: <1473723645.18970.109.camel@edumazet-glaptop3.roam.corp.google.com>

On 16-09-12 07:40 PM, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 19:30 -0400, Jamal Hadi Salim wrote:
>> From: Jamal Hadi Salim <jhs@mojatatu.com>
>
>> +
>> +static struct tc_action_ops act_skbmod_ops = {
>> +	.kind		=	"skbmod",
>> +	.type		=	TCA_ACT_SKBMOD,
>> +	.owner		=	THIS_MODULE,
>> +	.act		=	tcf_skbmod_run,
>> +	.dump		=	tcf_skbmod_dump,
>> +	.init		=	tcf_skbmod_init,
>> +	.walk		=	tcf_skbmod_walker,
>> +	.lookup		=	tcf_skbmod_search,
>> +	.size		=	sizeof(struct tcf_skbmod),
>> +};
>
> Looks like you missed a .cleanup() handler ?
>
> Otherwise, d->skbmod_p wont be freed .
>
>
>

Yes, I did ;->
Another version coming. Is there a good way to catch
something like this (ala userspace valgrind)?

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
From: Eric Dumazet @ 2016-09-13  0:00 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: davem, netdev, daniel, xiyou.wangcong, john.r.fastabend
In-Reply-To: <99274ae5-4f40-fc08-3d35-9b2fe7f792b5@mojatatu.com>

On Mon, 2016-09-12 at 19:56 -0400, Jamal Hadi Salim wrote:

> 
> Yes, I did ;->
> Another version coming. Is there a good way to catch
> something like this (ala userspace valgrind)?

You might try CONFIG_DEBUG_KMEMLEAK=y,
or simply run your script deleting/inserting the action one million
time ;)

^ permalink raw reply

* Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
From: Florian Westphal @ 2016-09-13  0:01 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: davem, netdev, daniel, xiyou.wangcong, eric.dumazet,
	john.r.fastabend
In-Reply-To: <1473723029-16533-1-git-send-email-jhs@emojatatu.com>

Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> +#define MAX_EDIT_LEN ETH_HLEN

This define is only referenced in a comment and seems unused otherwise.

^ permalink raw reply

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Tom Herbert @ 2016-09-13  0:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, John Fastabend, Brenden Blanco, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers
In-Reply-To: <1473723968.18970.111.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>
>> yep. there are various ways to shoot yourself in the foot with xdp.
>> The simplest program that drops all the packets will make the box unpingable.
>
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
>
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.
>
> BTW, I am not sure mlx4 implementation even works, vs BQL :
>
> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash
>
> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?
>
Obviously not for e1000 :-(. We really need some real test and
performance results and analysis on the interaction between the stack
data path and XDP data path. The fact that these changes are being
passed of as something only needed for KCM is irrelevant, e1000 is a
well deployed a NIC and there's no restriction that I see that would
prevent any users from enabling this feature on real devices. So these
patches need to be tested and justified. Eric's skepticism in all this
really doesn't surprise me...

Tom

>
>

^ permalink raw reply

* [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-13  0:06 UTC (permalink / raw)
  To: davem
  Cc: netdev, daniel, xiyou.wangcong, eric.dumazet, john.r.fastabend,
	Jamal Hadi Salim

From: Jamal Hadi Salim <jhs@mojatatu.com>

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/tc_act/tc_skbmod.h        |  30 ++++
 include/uapi/linux/tc_act/tc_skbmod.h |  39 +++++
 net/sched/Kconfig                     |  11 ++
 net/sched/Makefile                    |   1 +
 net/sched/act_skbmod.c                | 301 ++++++++++++++++++++++++++++++++++
 5 files changed, 382 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 0000000..644a211
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include <net/act_api.h>
+#include <linux/tc_act/tc_skbmod.h>
+
+struct tcf_skbmod_params {
+	struct rcu_head	rcu;
+	u64	flags; /*up to 64 types of operations; extend if needed */
+	u8	eth_dst[ETH_ALEN];
+	u16	eth_type;
+	u8	eth_src[ETH_ALEN];
+};
+
+struct tcf_skbmod {
+	struct tc_action	common;
+	struct tcf_skbmod_params __rcu *skbmod_p;
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 0000000..10fc07d
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include <linux/pkt_cls.h>
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC	0x1
+#define SKBMOD_F_SMAC	0x2
+#define SKBMOD_F_ETYPE	0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+	tc_gen;
+	__u64 flags;
+};
+
+enum {
+	TCA_SKBMOD_UNSPEC,
+	TCA_SKBMOD_TM,
+	TCA_SKBMOD_PARMS,
+	TCA_SKBMOD_DMAC,
+	TCA_SKBMOD_SMAC,
+	TCA_SKBMOD_ETYPE,
+	TCA_SKBMOD_PAD,
+	__TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72e3426..7795d5a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
 	  To compile this code as a module, choose M here: the
 	  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+        tristate "skb data modification action"
+        depends on NET_CLS_ACT
+        ---help---
+         Say Y here to allow modification of skb data
+
+         If unsure, say N.
+
+         To compile this code as a module, choose M here: the
+         module will be called act_skbmod.
+
 config NET_ACT_IFE
         tristate "Inter-FE action based on IETF ForCES InterFE LFB"
         depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b9d046b..148ae0d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)	+= act_csum.o
 obj-$(CONFIG_NET_ACT_VLAN)	+= act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)	+= act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK)	+= act_connmark.o
+obj-$(CONFIG_NET_ACT_SKBMOD)	+= act_skbmod.o
 obj-$(CONFIG_NET_ACT_IFE)	+= act_ife.o
 obj-$(CONFIG_NET_IFE_SKBMARK)	+= act_meta_mark.o
 obj-$(CONFIG_NET_IFE_SKBPRIO)	+= act_meta_skbprio.o
diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
new file mode 100644
index 0000000..cb9c4db
--- /dev/null
+++ b/net/sched/act_skbmod.c
@@ -0,0 +1,301 @@
+/*
+ * net/sched/act_skbmod.c  skb data modifier
+ *
+ * Copyright (c) 2016 Jamal Hadi Salim <jhs@mojatatu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+
+#include <linux/tc_act/tc_skbmod.h>
+#include <net/tc_act/tc_skbmod.h>
+
+#define SKBMOD_TAB_MASK     15
+
+static int skbmod_net_id;
+static struct tc_action_ops act_skbmod_ops;
+
+#define MAX_EDIT_LEN ETH_HLEN
+static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
+			  struct tcf_result *res)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	int action;
+	struct tcf_skbmod_params *p;
+	u64 flags;
+	int err;
+
+	tcf_lastuse_update(&d->tcf_tm);
+	bstats_cpu_update(this_cpu_ptr(d->common.cpu_bstats), skb);
+
+	/* XXX: if you are going to edit more fields beyond ethernet header
+	 * (example when you add IP header replacement or vlan swap)
+	 * then MAX_EDIT_LEN needs to change appropriately
+	*/
+	err = skb_ensure_writable(skb, ETH_HLEN);
+	if (unlikely(err)) { /* best policy is to drop on the floor */
+		qstats_overlimit_inc(this_cpu_ptr(d->common.cpu_qstats));
+		return TC_ACT_SHOT;
+	}
+
+	rcu_read_lock();
+	action = READ_ONCE(d->tcf_action);
+	if (unlikely(action == TC_ACT_SHOT)) {
+		qstats_overlimit_inc(this_cpu_ptr(d->common.cpu_qstats));
+		rcu_read_unlock();
+		return action;
+	}
+
+	p = rcu_dereference(d->skbmod_p);
+	flags = p->flags;
+	if (flags & SKBMOD_F_DMAC)
+		ether_addr_copy(eth_hdr(skb)->h_dest, p->eth_dst);
+	if (flags & SKBMOD_F_SMAC)
+		ether_addr_copy(eth_hdr(skb)->h_source, p->eth_src);
+	if (flags & SKBMOD_F_ETYPE)
+		eth_hdr(skb)->h_proto = p->eth_type;
+	rcu_read_unlock();
+
+	if (flags & SKBMOD_F_SWAPMAC) {
+		u16 tmpaddr[ETH_ALEN / 2]; /* ether_addr_copy() requirement */
+		/*XXX: I am sure we can come up with more efficient swapping*/
+		ether_addr_copy((u8 *)tmpaddr, eth_hdr(skb)->h_dest);
+		ether_addr_copy(eth_hdr(skb)->h_dest, eth_hdr(skb)->h_source);
+		ether_addr_copy(eth_hdr(skb)->h_source, (u8 *)tmpaddr);
+	}
+
+	return action;
+}
+
+static const struct nla_policy skbmod_policy[TCA_SKBMOD_MAX + 1] = {
+	[TCA_SKBMOD_PARMS]		= { .len = sizeof(struct tc_skbmod) },
+	[TCA_SKBMOD_DMAC]		= { .len = ETH_ALEN },
+	[TCA_SKBMOD_SMAC]		= { .len = ETH_ALEN },
+	[TCA_SKBMOD_ETYPE]		= { .type = NLA_U16 },
+};
+
+static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
+			   struct nlattr *est, struct tc_action **a,
+			   int ovr, int bind)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+	struct nlattr *tb[TCA_SKBMOD_MAX + 1];
+	struct tcf_skbmod_params *p, *p_old;
+	struct tc_skbmod *parm;
+	struct tcf_skbmod *d;
+	bool exists = false;
+	u8 *daddr = NULL;
+	u8 *saddr = NULL;
+	u16 eth_type = 0;
+	u32 lflags = 0;
+	int ret = 0, err;
+
+	if (!nla)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_SKBMOD_MAX, nla, skbmod_policy);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_SKBMOD_PARMS])
+		return -EINVAL;
+
+	if (tb[TCA_SKBMOD_DMAC]) {
+		daddr = nla_data(tb[TCA_SKBMOD_DMAC]);
+		lflags |= SKBMOD_F_DMAC;
+	}
+
+	if (tb[TCA_SKBMOD_SMAC]) {
+		saddr = nla_data(tb[TCA_SKBMOD_SMAC]);
+		lflags |= SKBMOD_F_SMAC;
+	}
+
+	if (tb[TCA_SKBMOD_ETYPE]) {
+		eth_type = nla_get_u16(tb[TCA_SKBMOD_ETYPE]);
+		lflags |= SKBMOD_F_ETYPE;
+	}
+
+	parm = nla_data(tb[TCA_SKBMOD_PARMS]);
+	if (parm->flags & SKBMOD_F_SWAPMAC)
+		lflags = SKBMOD_F_SWAPMAC;
+
+	exists = tcf_hash_check(tn, parm->index, a, bind);
+	if (exists && bind)
+		return 0;
+
+	if (!lflags)
+		return -EINVAL;
+
+	if (!exists) {
+		ret = tcf_hash_create(tn, parm->index, est, a,
+				      &act_skbmod_ops, bind, true);
+		if (ret)
+			return ret;
+
+		ret = ACT_P_CREATED;
+	} else {
+		tcf_hash_release(*a, bind);
+		if (!ovr)
+			return -EEXIST;
+	}
+
+	d = to_skbmod(*a);
+
+	ASSERT_RTNL();
+	p = kzalloc(sizeof(struct tcf_skbmod_params), GFP_KERNEL);
+	if (unlikely(!p)) {
+		if (ovr)
+			tcf_hash_release(*a, bind);
+		return -ENOMEM;
+	}
+
+	p->flags = lflags;
+	d->tcf_action = parm->action;
+
+	p_old = rtnl_dereference(d->skbmod_p);
+
+	if (ovr)
+		spin_lock_bh(&d->tcf_lock);
+
+	if (lflags & SKBMOD_F_DMAC)
+		ether_addr_copy(p->eth_dst, daddr);
+	if (lflags & SKBMOD_F_SMAC)
+		ether_addr_copy(p->eth_src, saddr);
+	if (lflags & SKBMOD_F_ETYPE)
+		p->eth_type = htons(eth_type);
+
+	rcu_assign_pointer(d->skbmod_p, p);
+	if (ovr)
+		spin_unlock_bh(&d->tcf_lock);
+
+	if (p_old)
+		kfree_rcu(p_old, rcu);
+
+	if (ret == ACT_P_CREATED)
+		tcf_hash_insert(tn, *a);
+	return ret;
+}
+
+static void tcf_skbmod_cleanup(struct tc_action *a, int bind)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	struct tcf_skbmod_params  *p;
+
+	p = rcu_dereference_protected(d->skbmod_p, 1);
+	kfree_rcu(p, rcu);
+}
+
+static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
+			   int bind, int ref)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tcf_skbmod_params  *p = rtnl_dereference(d->skbmod_p);
+	struct tc_skbmod opt = {
+		.index   = d->tcf_index,
+		.refcnt  = d->tcf_refcnt - ref,
+		.bindcnt = d->tcf_bindcnt - bind,
+		.action  = d->tcf_action,
+	};
+	struct tcf_t t;
+
+	opt.flags  = p->flags;
+	if (nla_put(skb, TCA_SKBMOD_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_DMAC) &&
+	    nla_put(skb, TCA_SKBMOD_DMAC, ETH_ALEN, p->eth_dst))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_SMAC) &&
+	    nla_put(skb, TCA_SKBMOD_SMAC, ETH_ALEN, p->eth_src))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_ETYPE) &&
+	    nla_put_u16(skb, TCA_SKBMOD_ETYPE, ntohs(p->eth_type)))
+		goto nla_put_failure;
+
+	tcf_tm_dump(&t, &d->tcf_tm);
+	if (nla_put_64bit(skb, TCA_SKBMOD_TM, sizeof(t), &t, TCA_SKBMOD_PAD))
+		goto nla_put_failure;
+
+	return skb->len;
+nla_put_failure:
+	rcu_read_unlock();
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int tcf_skbmod_walker(struct net *net, struct sk_buff *skb,
+			     struct netlink_callback *cb, int type,
+			     const struct tc_action_ops *ops)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tcf_generic_walker(tn, skb, cb, type, ops);
+}
+
+static int tcf_skbmod_search(struct net *net, struct tc_action **a, u32 index)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tcf_hash_search(tn, a, index);
+}
+
+static struct tc_action_ops act_skbmod_ops = {
+	.kind		=	"skbmod",
+	.type		=	TCA_ACT_SKBMOD,
+	.owner		=	THIS_MODULE,
+	.act		=	tcf_skbmod_run,
+	.dump		=	tcf_skbmod_dump,
+	.init		=	tcf_skbmod_init,
+	.cleanup	=	tcf_skbmod_cleanup,
+	.walk		=	tcf_skbmod_walker,
+	.lookup		=	tcf_skbmod_search,
+	.size		=	sizeof(struct tcf_skbmod),
+};
+
+static __net_init int skbmod_init_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tc_action_net_init(tn, &act_skbmod_ops, SKBMOD_TAB_MASK);
+}
+
+static void __net_exit skbmod_exit_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	tc_action_net_exit(tn);
+}
+
+static struct pernet_operations skbmod_net_ops = {
+	.init = skbmod_init_net,
+	.exit = skbmod_exit_net,
+	.id   = &skbmod_net_id,
+	.size = sizeof(struct tc_action_net),
+};
+
+MODULE_AUTHOR("Jamal Hadi Salim, <jhs@mojatatu.com>");
+MODULE_DESCRIPTION("SKB data mod-ing");
+MODULE_LICENSE("GPL");
+
+static int __init skbmod_init_module(void)
+{
+	return tcf_register_action(&act_skbmod_ops, &skbmod_net_ops);
+}
+
+static void __exit skbmod_cleanup_module(void)
+{
+	tcf_unregister_action(&act_skbmod_ops, &skbmod_net_ops);
+}
+
+module_init(skbmod_init_module);
+module_exit(skbmod_cleanup_module);
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v5 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-13  0:10 UTC (permalink / raw)
  To: davem
  Cc: netdev, daniel, xiyou.wangcong, eric.dumazet, john.r.fastabend,
	Florian Westphal
In-Reply-To: <1473725180-26987-1-git-send-email-jhs@emojatatu.com>

Ok, screwup. I used version 5 again ;-< I will resend and incorporate
Florian's comment.

cheers,
jamal

^ permalink raw reply

* linux-next: manual merge of the netfilter-next tree with the net tree
From: Stephen Rothwell @ 2016-09-13  0:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso, NetFilter, David Miller, Networking
  Cc: linux-next, linux-kernel, Liping Zhang

Hi all,

Today's linux-next merge of the netfilter-next tree got a conflict in:

  net/netfilter/nf_tables_netdev.c

between commit:

  c73c24849011 ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment")

from the net tree and commit:

  ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()")

from the netfilter-next tree.

I fixed it up (I used the latter version of this file and applied the
patch below) and can carry the fix as necessary. This is now fixed as
far as linux-next is concerned, but any non trivial conflicts should be
mentioned to your upstream maintainer when your tree is submitted for
merging.  You may also want to consider cooperating with the maintainer
of the conflicting tree to minimise any particularly complex conflicts.

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Tue, 13 Sep 2016 10:08:58 +1000
Subject: [PATCH] netfilter: merge fixup for "nf_tables_netdev: remove
 redundant ip_hdr assignment"

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 include/net/netfilter/nf_tables_ipv4.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/net/netfilter/nf_tables_ipv4.h b/include/net/netfilter/nf_tables_ipv4.h
index 968f00b82fb5..25e33aee91e7 100644
--- a/include/net/netfilter/nf_tables_ipv4.h
+++ b/include/net/netfilter/nf_tables_ipv4.h
@@ -33,7 +33,6 @@ __nft_set_pktinfo_ipv4_validate(struct nft_pktinfo *pkt,
 	if (!iph)
 		return -1;
 
-	iph = ip_hdr(skb);
 	if (iph->ihl < 5 || iph->version != 4)
 		return -1;
 
-- 
2.8.1

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply related

* [PATCH v6 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-13  0:13 UTC (permalink / raw)
  To: davem
  Cc: netdev, daniel, xiyou.wangcong, eric.dumazet, john.r.fastabend,
	Jamal Hadi Salim

From: Jamal Hadi Salim <jhs@mojatatu.com>

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/tc_act/tc_skbmod.h        |  30 ++++
 include/uapi/linux/tc_act/tc_skbmod.h |  39 +++++
 net/sched/Kconfig                     |  11 ++
 net/sched/Makefile                    |   1 +
 net/sched/act_skbmod.c                | 301 ++++++++++++++++++++++++++++++++++
 5 files changed, 382 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 0000000..644a211
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include <net/act_api.h>
+#include <linux/tc_act/tc_skbmod.h>
+
+struct tcf_skbmod_params {
+	struct rcu_head	rcu;
+	u64	flags; /*up to 64 types of operations; extend if needed */
+	u8	eth_dst[ETH_ALEN];
+	u16	eth_type;
+	u8	eth_src[ETH_ALEN];
+};
+
+struct tcf_skbmod {
+	struct tc_action	common;
+	struct tcf_skbmod_params __rcu *skbmod_p;
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 0000000..10fc07d
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include <linux/pkt_cls.h>
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC	0x1
+#define SKBMOD_F_SMAC	0x2
+#define SKBMOD_F_ETYPE	0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+	tc_gen;
+	__u64 flags;
+};
+
+enum {
+	TCA_SKBMOD_UNSPEC,
+	TCA_SKBMOD_TM,
+	TCA_SKBMOD_PARMS,
+	TCA_SKBMOD_DMAC,
+	TCA_SKBMOD_SMAC,
+	TCA_SKBMOD_ETYPE,
+	TCA_SKBMOD_PAD,
+	__TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72e3426..7795d5a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
 	  To compile this code as a module, choose M here: the
 	  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+        tristate "skb data modification action"
+        depends on NET_CLS_ACT
+        ---help---
+         Say Y here to allow modification of skb data
+
+         If unsure, say N.
+
+         To compile this code as a module, choose M here: the
+         module will be called act_skbmod.
+
 config NET_ACT_IFE
         tristate "Inter-FE action based on IETF ForCES InterFE LFB"
         depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b9d046b..148ae0d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)	+= act_csum.o
 obj-$(CONFIG_NET_ACT_VLAN)	+= act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)	+= act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK)	+= act_connmark.o
+obj-$(CONFIG_NET_ACT_SKBMOD)	+= act_skbmod.o
 obj-$(CONFIG_NET_ACT_IFE)	+= act_ife.o
 obj-$(CONFIG_NET_IFE_SKBMARK)	+= act_meta_mark.o
 obj-$(CONFIG_NET_IFE_SKBPRIO)	+= act_meta_skbprio.o
diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
new file mode 100644
index 0000000..e7d9638
--- /dev/null
+++ b/net/sched/act_skbmod.c
@@ -0,0 +1,301 @@
+/*
+ * net/sched/act_skbmod.c  skb data modifier
+ *
+ * Copyright (c) 2016 Jamal Hadi Salim <jhs@mojatatu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+
+#include <linux/tc_act/tc_skbmod.h>
+#include <net/tc_act/tc_skbmod.h>
+
+#define SKBMOD_TAB_MASK     15
+
+static int skbmod_net_id;
+static struct tc_action_ops act_skbmod_ops;
+
+#define MAX_EDIT_LEN ETH_HLEN
+static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
+			  struct tcf_result *res)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	int action;
+	struct tcf_skbmod_params *p;
+	u64 flags;
+	int err;
+
+	tcf_lastuse_update(&d->tcf_tm);
+	bstats_cpu_update(this_cpu_ptr(d->common.cpu_bstats), skb);
+
+	/* XXX: if you are going to edit more fields beyond ethernet header
+	 * (example when you add IP header replacement or vlan swap)
+	 * then MAX_EDIT_LEN needs to change appropriately
+	*/
+	err = skb_ensure_writable(skb, MAX_EDIT_LEN);
+	if (unlikely(err)) { /* best policy is to drop on the floor */
+		qstats_overlimit_inc(this_cpu_ptr(d->common.cpu_qstats));
+		return TC_ACT_SHOT;
+	}
+
+	rcu_read_lock();
+	action = READ_ONCE(d->tcf_action);
+	if (unlikely(action == TC_ACT_SHOT)) {
+		qstats_overlimit_inc(this_cpu_ptr(d->common.cpu_qstats));
+		rcu_read_unlock();
+		return action;
+	}
+
+	p = rcu_dereference(d->skbmod_p);
+	flags = p->flags;
+	if (flags & SKBMOD_F_DMAC)
+		ether_addr_copy(eth_hdr(skb)->h_dest, p->eth_dst);
+	if (flags & SKBMOD_F_SMAC)
+		ether_addr_copy(eth_hdr(skb)->h_source, p->eth_src);
+	if (flags & SKBMOD_F_ETYPE)
+		eth_hdr(skb)->h_proto = p->eth_type;
+	rcu_read_unlock();
+
+	if (flags & SKBMOD_F_SWAPMAC) {
+		u16 tmpaddr[ETH_ALEN / 2]; /* ether_addr_copy() requirement */
+		/*XXX: I am sure we can come up with more efficient swapping*/
+		ether_addr_copy((u8 *)tmpaddr, eth_hdr(skb)->h_dest);
+		ether_addr_copy(eth_hdr(skb)->h_dest, eth_hdr(skb)->h_source);
+		ether_addr_copy(eth_hdr(skb)->h_source, (u8 *)tmpaddr);
+	}
+
+	return action;
+}
+
+static const struct nla_policy skbmod_policy[TCA_SKBMOD_MAX + 1] = {
+	[TCA_SKBMOD_PARMS]		= { .len = sizeof(struct tc_skbmod) },
+	[TCA_SKBMOD_DMAC]		= { .len = ETH_ALEN },
+	[TCA_SKBMOD_SMAC]		= { .len = ETH_ALEN },
+	[TCA_SKBMOD_ETYPE]		= { .type = NLA_U16 },
+};
+
+static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
+			   struct nlattr *est, struct tc_action **a,
+			   int ovr, int bind)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+	struct nlattr *tb[TCA_SKBMOD_MAX + 1];
+	struct tcf_skbmod_params *p, *p_old;
+	struct tc_skbmod *parm;
+	struct tcf_skbmod *d;
+	bool exists = false;
+	u8 *daddr = NULL;
+	u8 *saddr = NULL;
+	u16 eth_type = 0;
+	u32 lflags = 0;
+	int ret = 0, err;
+
+	if (!nla)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_SKBMOD_MAX, nla, skbmod_policy);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_SKBMOD_PARMS])
+		return -EINVAL;
+
+	if (tb[TCA_SKBMOD_DMAC]) {
+		daddr = nla_data(tb[TCA_SKBMOD_DMAC]);
+		lflags |= SKBMOD_F_DMAC;
+	}
+
+	if (tb[TCA_SKBMOD_SMAC]) {
+		saddr = nla_data(tb[TCA_SKBMOD_SMAC]);
+		lflags |= SKBMOD_F_SMAC;
+	}
+
+	if (tb[TCA_SKBMOD_ETYPE]) {
+		eth_type = nla_get_u16(tb[TCA_SKBMOD_ETYPE]);
+		lflags |= SKBMOD_F_ETYPE;
+	}
+
+	parm = nla_data(tb[TCA_SKBMOD_PARMS]);
+	if (parm->flags & SKBMOD_F_SWAPMAC)
+		lflags = SKBMOD_F_SWAPMAC;
+
+	exists = tcf_hash_check(tn, parm->index, a, bind);
+	if (exists && bind)
+		return 0;
+
+	if (!lflags)
+		return -EINVAL;
+
+	if (!exists) {
+		ret = tcf_hash_create(tn, parm->index, est, a,
+				      &act_skbmod_ops, bind, true);
+		if (ret)
+			return ret;
+
+		ret = ACT_P_CREATED;
+	} else {
+		tcf_hash_release(*a, bind);
+		if (!ovr)
+			return -EEXIST;
+	}
+
+	d = to_skbmod(*a);
+
+	ASSERT_RTNL();
+	p = kzalloc(sizeof(struct tcf_skbmod_params), GFP_KERNEL);
+	if (unlikely(!p)) {
+		if (ovr)
+			tcf_hash_release(*a, bind);
+		return -ENOMEM;
+	}
+
+	p->flags = lflags;
+	d->tcf_action = parm->action;
+
+	p_old = rtnl_dereference(d->skbmod_p);
+
+	if (ovr)
+		spin_lock_bh(&d->tcf_lock);
+
+	if (lflags & SKBMOD_F_DMAC)
+		ether_addr_copy(p->eth_dst, daddr);
+	if (lflags & SKBMOD_F_SMAC)
+		ether_addr_copy(p->eth_src, saddr);
+	if (lflags & SKBMOD_F_ETYPE)
+		p->eth_type = htons(eth_type);
+
+	rcu_assign_pointer(d->skbmod_p, p);
+	if (ovr)
+		spin_unlock_bh(&d->tcf_lock);
+
+	if (p_old)
+		kfree_rcu(p_old, rcu);
+
+	if (ret == ACT_P_CREATED)
+		tcf_hash_insert(tn, *a);
+	return ret;
+}
+
+static void tcf_skbmod_cleanup(struct tc_action *a, int bind)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	struct tcf_skbmod_params  *p;
+
+	p = rcu_dereference_protected(d->skbmod_p, 1);
+	kfree_rcu(p, rcu);
+}
+
+static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
+			   int bind, int ref)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tcf_skbmod_params  *p = rtnl_dereference(d->skbmod_p);
+	struct tc_skbmod opt = {
+		.index   = d->tcf_index,
+		.refcnt  = d->tcf_refcnt - ref,
+		.bindcnt = d->tcf_bindcnt - bind,
+		.action  = d->tcf_action,
+	};
+	struct tcf_t t;
+
+	opt.flags  = p->flags;
+	if (nla_put(skb, TCA_SKBMOD_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_DMAC) &&
+	    nla_put(skb, TCA_SKBMOD_DMAC, ETH_ALEN, p->eth_dst))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_SMAC) &&
+	    nla_put(skb, TCA_SKBMOD_SMAC, ETH_ALEN, p->eth_src))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_ETYPE) &&
+	    nla_put_u16(skb, TCA_SKBMOD_ETYPE, ntohs(p->eth_type)))
+		goto nla_put_failure;
+
+	tcf_tm_dump(&t, &d->tcf_tm);
+	if (nla_put_64bit(skb, TCA_SKBMOD_TM, sizeof(t), &t, TCA_SKBMOD_PAD))
+		goto nla_put_failure;
+
+	return skb->len;
+nla_put_failure:
+	rcu_read_unlock();
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int tcf_skbmod_walker(struct net *net, struct sk_buff *skb,
+			     struct netlink_callback *cb, int type,
+			     const struct tc_action_ops *ops)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tcf_generic_walker(tn, skb, cb, type, ops);
+}
+
+static int tcf_skbmod_search(struct net *net, struct tc_action **a, u32 index)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tcf_hash_search(tn, a, index);
+}
+
+static struct tc_action_ops act_skbmod_ops = {
+	.kind		=	"skbmod",
+	.type		=	TCA_ACT_SKBMOD,
+	.owner		=	THIS_MODULE,
+	.act		=	tcf_skbmod_run,
+	.dump		=	tcf_skbmod_dump,
+	.init		=	tcf_skbmod_init,
+	.cleanup	=	tcf_skbmod_cleanup,
+	.walk		=	tcf_skbmod_walker,
+	.lookup		=	tcf_skbmod_search,
+	.size		=	sizeof(struct tcf_skbmod),
+};
+
+static __net_init int skbmod_init_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tc_action_net_init(tn, &act_skbmod_ops, SKBMOD_TAB_MASK);
+}
+
+static void __net_exit skbmod_exit_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	tc_action_net_exit(tn);
+}
+
+static struct pernet_operations skbmod_net_ops = {
+	.init = skbmod_init_net,
+	.exit = skbmod_exit_net,
+	.id   = &skbmod_net_id,
+	.size = sizeof(struct tc_action_net),
+};
+
+MODULE_AUTHOR("Jamal Hadi Salim, <jhs@mojatatu.com>");
+MODULE_DESCRIPTION("SKB data mod-ing");
+MODULE_LICENSE("GPL");
+
+static int __init skbmod_init_module(void)
+{
+	return tcf_register_action(&act_skbmod_ops, &skbmod_net_ops);
+}
+
+static void __exit skbmod_cleanup_module(void)
+{
+	tcf_unregister_action(&act_skbmod_ops, &skbmod_net_ops);
+}
+
+module_init(skbmod_init_module);
+module_exit(skbmod_cleanup_module);
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family
From: Andrew Lunn @ 2016-09-13  0:40 UTC (permalink / raw)
  To: John Crispin
  Cc: David S. Miller, Florian Fainelli, netdev, linux-kernel,
	qsdk-review
In-Reply-To: <1473669337-21221-4-git-send-email-john@phrozen.org>

Hi John

> +static u16
> +qca8k_phy_mmd_read(struct qca8k_priv *priv, int phy_addr, u16 addr, u16 reg)
> +{
> +	u16 data;
> +
> +	mutex_lock(&priv->bus->mdio_lock);
> +
> +	priv->bus->write(priv->bus, phy_addr, MII_ATH_MMD_ADDR, addr);
> +	priv->bus->write(priv->bus, phy_addr, MII_ATH_MMD_DATA, reg);
> +	priv->bus->write(priv->bus, phy_addr, MII_ATH_MMD_ADDR, addr | 0x4000);
> +	data = priv->bus->read(priv->bus, phy_addr, MII_ATH_MMD_DATA);
> +
> +	mutex_unlock(&priv->bus->mdio_lock);
> +
> +	return data;
> +}

snip

> +static int
> +qca8k_get_eee(struct dsa_switch *ds, int port,
> +	      struct ethtool_eee *e)
> +{
> +	struct qca8k_priv *priv = qca8k_to_priv(ds);
> +	struct ethtool_eee *p = &priv->port_sts[qca8k_phy_to_port(port)].eee;
> +	u32 lp, adv, supported;
> +	u16 val;
> +
> +	/* The switch has no way to tell the result of the AN so we need to
> +	 * read the result directly from the PHYs MMD registers
> +	 */
> +	val = qca8k_phy_mmd_read(priv, port, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE);
> +	supported = mmd_eee_cap_to_ethtool_sup_t(val);
> +
> +	val = qca8k_phy_mmd_read(priv, port, MDIO_MMD_AN, MDIO_AN_EEE_ADV);
> +	adv = mmd_eee_adv_to_ethtool_adv_t(val);
> +
> +	val = qca8k_phy_mmd_read(priv, port, MDIO_MMD_AN, MDIO_AN_EEE_LPABLE);
> +	lp = mmd_eee_adv_to_ethtool_adv_t(val);
> +
> +	e->eee_enabled = p->eee_enabled;
> +	e->eee_active = !!(supported & adv & lp);
> +
> +	return 0;
> +}

Couldn't you just call phy_ethtool_get_eee(phydev)? Then you don't
need qca8k_phy_mmd_read()?

     Andrew

^ permalink raw reply

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Alexei Starovoitov @ 2016-09-13  1:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John Fastabend, bblanco, jeffrey.t.kirsher, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev
In-Reply-To: <1473723968.18970.111.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Sep 12, 2016 at 04:46:08PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> 
> > yep. there are various ways to shoot yourself in the foot with xdp.
> > The simplest program that drops all the packets will make the box unpingable.
> 
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
> 
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

that's fair critique. there is no documentation for xdp in general.

> BTW, I am not sure mlx4 implementation even works, vs BQL :

it doesn't and it shouldn't. xdp and stack use different tx queues.

> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash

not quite. netdev_tx_completed_queue() is not called for xdp queues.

> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?

it's correct in the scope that it was defined for.
There is no objective to share the same tx queue with the stack in xdp.
The queues must be absolutely separate otherwise performance will tank.

This e1k patch is really special, because the e1k has one tx queue,
but this is for debugging of the programs, so it's ok to cut corners.
The e1k xdp support doesn't need to interact nicely with stack.
It's impossible in the first place. xdp is the layer before the stack.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox