Netdev List
 help / color / mirror / Atom feed
* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Eric Dumazet @ 2016-09-12 22:46 UTC (permalink / raw)
  To: John Fastabend
  Cc: bblanco, alexei.starovoitov, jeffrey.t.kirsher, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev
In-Reply-To: <20160912221351.5610.29043.stgit@john-Precision-Tower-5810>

On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> From: Alexei Starovoitov <ast@fb.com>

> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +				 u32 len,
> +				 struct net_device *netdev,
> +				 struct e1000_adapter *adapter)
> +{
> +	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +	struct e1000_hw *hw = &adapter->hw;
> +	struct e1000_tx_ring *tx_ring;
> +
> +	if (len > E1000_MAX_DATA_PER_TXD)
> +		return;
> +
> +	/* e1000 only support a single txq at the moment so the queue is being
> +	 * shared with stack. To support this requires locking to ensure the
> +	 * stack and XDP are not running at the same time. Devices with
> +	 * multiple queues should allocate a separate queue space.
> +	 */
> +	HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +	tx_ring = adapter->tx_ring;
> +
> +	if (E1000_DESC_UNUSED(tx_ring) < 2) {
> +		HARD_TX_UNLOCK(netdev, txq);
> +		return;
> +	}
> +
> +	if (netif_xmit_frozen_or_stopped(txq))
> +		return;
> +
> +	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +	netdev_sent_queue(netdev, len);
> +	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> +	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> +	mmiowb();
> +
> +	HARD_TX_UNLOCK(netdev, txq);
> +}


e1000_tx_map() is full of workarounds.

Have a look at last_tx_tso for example.

               /* Workaround for Controller erratum --
                 * descriptor for non-tso packet in a linear SKB that follows a
                 * tso gets written back prematurely before the data is fully
                 * DMA'd to the controller
                 */
                if (!skb->data_len && tx_ring->last_tx_tso &&
                    !skb_is_gso(skb)) {
                        tx_ring->last_tx_tso = false;
                        size -= 4;
                }

Look, this XDP_TX thing is hard to properly implement and test on
various NIC revisions.

Without proper queue management, high prio packets in qdisc wont be sent
if NIC is under RX -> XDP_TX flood.

Sounds a horrible feature to me.

^ permalink raw reply

* Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver
From: Adit Ranadive @ 2016-09-12 22:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford@redhat.com, linux-rdma@vger.kernel.org, pv-drivers,
	netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	Jorgen S. Hansen, Aditya Sarwade, George Zhang, Bryan Tan
In-Reply-To: <20160912180251.GH5843@obsidianresearch.com>

On Mon, Sep 12, 2016 at 11:03:39 -0700, Jason Gunthorpe wrote:
> On Sun, Sep 11, 2016 at 09:49:10PM -0700, Adit Ranadive wrote:
> > [2] Libpvrdma User-level library - 
> > http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary
> 
> You will probably find that rdma-plumbing will be the best way to get
> your userspace component into the distributors.

Hi Jason,

Sorry I haven't paying attention to that discussion. Do you know how soon 
distros will pick up the rdma-plumbing stuff? 
It seems like it should be fairly straightforward to include the libpvrdma git 
within the rdma-plumbing git as well. Let me know what the process is since
I may have to discuss it internally.

Thanks!
Adit

^ permalink raw reply

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Eric Dumazet @ 2016-09-12 22:26 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <77bca669-9113-3977-ad3c-478091422ff8@mojatatu.com>

On Mon, 2016-09-12 at 18:14 -0400, Jamal Hadi Salim wrote:

> I noticed some very weird issues when I took that out.
> Running sufficiently large amount of traffic (ping -f is sufficient)
> I saw that when i did a dump it took anywhere between 6-15 seconds.
> With the read_lock in place response was immediate.
> I can go back and run things to verify - but it was very odd.

This was on uni processor ?

Looks like typical starvation caused by aggressive softirq.

Anyway, I suspect your kernel build has rcu_read_lock() and
rcu_read_unlock() as NOP ;)

^ permalink raw reply

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-12 22:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <1473717691.18970.82.camel@edumazet-glaptop3.roam.corp.google.com>

On 16-09-12 06:01 PM, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote:
>
>> +
>> +static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
>> +			   int bind, int ref)
>> +{
>> +	struct tcf_skbmod *d = to_skbmod(a);
>> +	unsigned char *b = skb_tail_pointer(skb);
>> +	struct tcf_skbmod_params  *p = rtnl_dereference(d->skbmod_p);
>> +	struct tc_skbmod opt = {
>> +		.index   = d->tcf_index,
>> +		.refcnt  = d->tcf_refcnt - ref,
>> +		.bindcnt = d->tcf_bindcnt - bind,
>> +		.action  = d->tcf_action,
>> +	};
>> +	struct tcf_t t;
>> +
>> +	rcu_read_lock();
>
> You do not need rcu read lock protection here, RTNL is enough.

I noticed some very weird issues when I took that out.
Running sufficiently large amount of traffic (ping -f is sufficient)
I saw that when i did a dump it took anywhere between 6-15 seconds.
With the read_lock in place response was immediate.
I can go back and run things to verify - but it was very odd.

cheers,
jamal

^ permalink raw reply

* [net-next PATCH v3 3/3] e1000: bundle xdp xmit routines
From: John Fastabend @ 2016-09-12 22:14 UTC (permalink / raw)
  To: bblanco, john.fastabend, alexei.starovoitov, jeffrey.t.kirsher,
	brouer, davem
  Cc: xiyou.wangcong, intel-wired-lan, u9012063, netdev
In-Reply-To: <20160912220312.5610.77528.stgit@john-Precision-Tower-5810>

e1000 supports a single TX queue so it is being shared with the stack
when XDP runs XDP_TX action. This requires taking the xmit lock to
ensure we don't corrupt the tx ring. To avoid taking and dropping the
lock per packet this patch adds a bundling implementation to submit
a bundle of packets to the xmit routine.

I tested this patch running e1000 in a VM using KVM over a tap
device using pktgen to generate traffic along with 'ping -f -l 100'.

Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |   10 +++
 drivers/net/ethernet/intel/e1000/e1000_main.c |   80 +++++++++++++++++++------
 2 files changed, 70 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index 5cf8a0a..877b377 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -133,6 +133,8 @@ struct e1000_adapter;
 #define E1000_TX_QUEUE_WAKE	16
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define E1000_RX_BUFFER_WRITE	16 /* Must be power of 2 */
+/* How many XDP XMIT buffers to bundle into one xmit transaction */
+#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
 
 #define AUTO_ALL_MODES		0
 #define E1000_EEPROM_82544_APM	0x0004
@@ -168,6 +170,11 @@ struct e1000_rx_buffer {
 	dma_addr_t dma;
 };
 
+struct e1000_rx_buffer_bundle {
+	struct e1000_rx_buffer *buffer;
+	u32 length;
+};
+
 struct e1000_tx_ring {
 	/* pointer to the descriptor ring memory */
 	void *desc;
@@ -206,6 +213,9 @@ struct e1000_rx_ring {
 	struct e1000_rx_buffer *buffer_info;
 	struct sk_buff *rx_skb_top;
 
+	/* array of XDP buffer information structs */
+	struct e1000_rx_buffer_bundle *xdp_buffer;
+
 	/* cpu for rx queue */
 	int cpu;
 
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 232b927..31489d4 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct bpf_prog *old_prog;
 
+	if (!adapter->rx_ring[0].xdp_buffer) {
+		int size = sizeof(struct e1000_rx_buffer_bundle) *
+				E1000_XDP_XMIT_BUNDLE_MAX;
+
+		adapter->rx_ring[0].xdp_buffer = vzalloc(size);
+		if (!adapter->rx_ring[0].xdp_buffer)
+			return -ENOMEM;
+	}
+
 	old_prog = xchg(&adapter->prog, prog);
 	if (old_prog) {
 		synchronize_net();
@@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev)
 	if (adapter->prog)
 		bpf_prog_put(adapter->prog);
 
+	if (adapter->rx_ring[0].xdp_buffer)
+		vfree(adapter->rx_ring[0].xdp_buffer);
+
 	unregister_netdev(netdev);
 
 	e1000_phy_hw_reset(hw);
@@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
 
 static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
 				 u32 len,
+				 struct e1000_adapter *adapter,
 				 struct net_device *netdev,
-				 struct e1000_adapter *adapter)
+				 struct e1000_tx_ring *tx_ring)
 {
-	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
-	struct e1000_hw *hw = &adapter->hw;
-	struct e1000_tx_ring *tx_ring;
+	const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
 
 	if (len > E1000_MAX_DATA_PER_TXD)
 		return;
 
-	/* e1000 only support a single txq at the moment so the queue is being
-	 * shared with stack. To support this requires locking to ensure the
-	 * stack and XDP are not running at the same time. Devices with
-	 * multiple queues should allocate a separate queue space.
-	 */
-	HARD_TX_LOCK(netdev, txq, smp_processor_id());
-
-	tx_ring = adapter->tx_ring;
-
-	if (E1000_DESC_UNUSED(tx_ring) < 2) {
-		HARD_TX_UNLOCK(netdev, txq);
+	if (E1000_DESC_UNUSED(tx_ring) < 2)
 		return;
-	}
 
 	if (netif_xmit_frozen_or_stopped(txq))
 		return;
@@ -3402,7 +3402,36 @@ static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
 	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
 	netdev_sent_queue(netdev, len);
 	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
+}
 
+static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info,
+				  struct net_device *netdev,
+				  struct e1000_adapter *adapter)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+	struct e1000_hw *hw = &adapter->hw;
+	int i = 0;
+
+	/* e1000 only support a single txq at the moment so the queue is being
+	 * shared with stack. To support this requires locking to ensure the
+	 * stack and XDP are not running at the same time. Devices with
+	 * multiple queues should allocate a separate queue space.
+	 *
+	 * To amortize the locking cost e1000 bundles the xmits and send up to
+	 * E1000_XDP_XMIT_BUNDLE_MAX.
+	 */
+	HARD_TX_LOCK(netdev, txq, smp_processor_id());
+
+	for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) {
+		e1000_xmit_raw_frame(buffer_info[i].buffer,
+				     buffer_info[i].length,
+				     adapter, netdev, tx_ring);
+		buffer_info[i].buffer = NULL;
+		buffer_info[i].length = 0;
+	}
+
+	/* kick hardware to send bundle and return control back to the stack */
 	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
 	mmiowb();
 
@@ -4284,9 +4313,10 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 	struct bpf_prog *prog;
 	u32 length;
 	unsigned int i;
-	int cleaned_count = 0;
+	int cleaned_count = 0, xdp_xmit = 0;
 	bool cleaned = false;
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+	struct e1000_rx_buffer_bundle *xdp_bundle = rx_ring->xdp_buffer;
 
 	rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
 	prog = READ_ONCE(adapter->prog);
@@ -4341,12 +4371,13 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 			case XDP_PASS:
 				break;
 			case XDP_TX:
+				xdp_bundle[xdp_xmit].buffer = buffer_info;
+				xdp_bundle[xdp_xmit].length = length;
 				dma_sync_single_for_device(&pdev->dev,
 							   dma,
 							   length,
 							   DMA_TO_DEVICE);
-				e1000_xmit_raw_frame(buffer_info, length,
-						     netdev, adapter);
+				xdp_xmit++;
 			case XDP_DROP:
 			default:
 				/* re-use mapped page. keep buffer_info->dma
@@ -4488,8 +4519,14 @@ next_desc:
 
 		/* return some buffers to hardware, one at a time is too slow */
 		if (unlikely(cleaned_count >= E1000_RX_BUFFER_WRITE)) {
+			if (xdp_xmit)
+				e1000_xdp_xmit_bundle(xdp_bundle,
+						      netdev,
+						      adapter);
+
 			adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
 			cleaned_count = 0;
+			xdp_xmit = 0;
 		}
 
 		/* use prefetched values */
@@ -4500,8 +4537,11 @@ next_desc:
 	rx_ring->next_to_clean = i;
 
 	cleaned_count = E1000_DESC_UNUSED(rx_ring);
-	if (cleaned_count)
+	if (cleaned_count) {
+		if (xdp_xmit)
+			e1000_xdp_xmit_bundle(xdp_bundle, netdev, adapter);
 		adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
+	}
 
 	adapter->total_rx_packets += total_rx_packets;
 	adapter->total_rx_bytes += total_rx_bytes;

^ permalink raw reply related

* [net-next PATCH v3 2/3] e1000: add initial XDP support
From: John Fastabend @ 2016-09-12 22:13 UTC (permalink / raw)
  To: bblanco, john.fastabend, alexei.starovoitov, jeffrey.t.kirsher,
	brouer, davem
  Cc: xiyou.wangcong, intel-wired-lan, u9012063, netdev
In-Reply-To: <20160912220312.5610.77528.stgit@john-Precision-Tower-5810>

From: Alexei Starovoitov <ast@fb.com>

This patch adds initial support for XDP on e1000 driver. Note e1000
driver does not support page recycling in general which could be
added as a further improvement. However XDP_DROP case will recycle.
XDP_TX and XDP_PASS do not support recycling.

e1000 only supports a single tx queue at this time so the queue
is shared between xdp program and Linux stack. It is possible for
an XDP program to starve the stack in this model.

The XDP program will drop packets on XDP_TX errors. This can occur
when the tx descriptors are exhausted. This behavior is the same
for both shared queue models like e1000 and dedicated tx queue
models used in multiqueue devices. However if both the stack and
XDP are transmitting packets it is perhaps more likely to occur in
the shared queue model. Further refinement to the XDP model may be
possible in the future.

I tested this patch running e1000 in a VM using KVM over a tap
device.

CC: William Tu <u9012063@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |    2 
 drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
 2 files changed, 175 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index d7bdea7..5cf8a0a 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -150,6 +150,7 @@ struct e1000_adapter;
  */
 struct e1000_tx_buffer {
 	struct sk_buff *skb;
+	struct page *page;
 	dma_addr_t dma;
 	unsigned long time_stamp;
 	u16 length;
@@ -279,6 +280,7 @@ struct e1000_adapter {
 			     struct e1000_rx_ring *rx_ring,
 			     int cleaned_count);
 	struct e1000_rx_ring *rx_ring;      /* One per active queue */
+	struct bpf_prog *prog;
 	struct napi_struct napi;
 
 	int num_tx_queues;
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 62a7f8d..232b927 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -32,6 +32,7 @@
 #include <linux/prefetch.h>
 #include <linux/bitops.h>
 #include <linux/if_vlan.h>
+#include <linux/bpf.h>
 
 char e1000_driver_name[] = "e1000";
 static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
@@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
 	return 0;
 }
 
+static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+{
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct bpf_prog *old_prog;
+
+	old_prog = xchg(&adapter->prog, prog);
+	if (old_prog) {
+		synchronize_net();
+		bpf_prog_put(old_prog);
+	}
+
+	if (netif_running(netdev))
+		e1000_reinit_locked(adapter);
+	else
+		e1000_reset(adapter);
+	return 0;
+}
+
+static bool e1000_xdp_attached(struct net_device *dev)
+{
+	struct e1000_adapter *priv = netdev_priv(dev);
+
+	return !!priv->prog;
+}
+
+static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return e1000_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = e1000_xdp_attached(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops e1000_netdev_ops = {
 	.ndo_open		= e1000_open,
 	.ndo_stop		= e1000_close,
@@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
 #endif
 	.ndo_fix_features	= e1000_fix_features,
 	.ndo_set_features	= e1000_set_features,
+	.ndo_xdp		= e1000_xdp,
 };
 
 /**
@@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
 	e1000_down_and_stop(adapter);
 	e1000_release_manageability(adapter);
 
+	if (adapter->prog)
+		bpf_prog_put(adapter->prog);
+
 	unregister_netdev(netdev);
 
 	e1000_phy_hw_reset(hw);
@@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
 	struct e1000_hw *hw = &adapter->hw;
 	u32 rdlen, rctl, rxcsum;
 
-	if (adapter->netdev->mtu > ETH_DATA_LEN) {
+	if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
 		rdlen = adapter->rx_ring[0].count *
 			sizeof(struct e1000_rx_desc);
 		adapter->clean_rx = e1000_clean_jumbo_rx_irq;
@@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
 		dev_kfree_skb_any(buffer_info->skb);
 		buffer_info->skb = NULL;
 	}
+	if (buffer_info->page) {
+		put_page(buffer_info->page);
+		buffer_info->page = NULL;
+	}
+
 	buffer_info->time_stamp = 0;
 	/* buffer_info must be completely set up in the transmit path */
 }
@@ -3298,6 +3346,69 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
 	return NETDEV_TX_OK;
 }
 
+static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
+				struct e1000_rx_buffer *rx_buffer_info,
+				unsigned int len)
+{
+	struct e1000_tx_buffer *buffer_info;
+	unsigned int i = tx_ring->next_to_use;
+
+	buffer_info = &tx_ring->buffer_info[i];
+
+	buffer_info->length = len;
+	buffer_info->time_stamp = jiffies;
+	buffer_info->mapped_as_page = false;
+	buffer_info->dma = rx_buffer_info->dma;
+	buffer_info->next_to_watch = i;
+	buffer_info->page = rx_buffer_info->rxbuf.page;
+
+	tx_ring->buffer_info[i].skb = NULL;
+	tx_ring->buffer_info[i].segs = 1;
+	tx_ring->buffer_info[i].bytecount = len;
+	tx_ring->buffer_info[i].next_to_watch = i;
+
+	rx_buffer_info->rxbuf.page = NULL;
+}
+
+static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
+				 u32 len,
+				 struct net_device *netdev,
+				 struct e1000_adapter *adapter)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
+	struct e1000_hw *hw = &adapter->hw;
+	struct e1000_tx_ring *tx_ring;
+
+	if (len > E1000_MAX_DATA_PER_TXD)
+		return;
+
+	/* e1000 only support a single txq at the moment so the queue is being
+	 * shared with stack. To support this requires locking to ensure the
+	 * stack and XDP are not running at the same time. Devices with
+	 * multiple queues should allocate a separate queue space.
+	 */
+	HARD_TX_LOCK(netdev, txq, smp_processor_id());
+
+	tx_ring = adapter->tx_ring;
+
+	if (E1000_DESC_UNUSED(tx_ring) < 2) {
+		HARD_TX_UNLOCK(netdev, txq);
+		return;
+	}
+
+	if (netif_xmit_frozen_or_stopped(txq))
+		return;
+
+	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
+	netdev_sent_queue(netdev, len);
+	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
+
+	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
+	mmiowb();
+
+	HARD_TX_UNLOCK(netdev, txq);
+}
+
 #define NUM_REGS 38 /* 1 based count */
 static void e1000_regdump(struct e1000_adapter *adapter)
 {
@@ -4139,6 +4250,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
 	return skb;
 }
 
+static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
+				 unsigned int length)
+{
+	struct xdp_buff xdp;
+	int ret;
+
+	xdp.data = data;
+	xdp.data_end = data + length;
+	ret = BPF_PROG_RUN(prog, (void *)&xdp);
+
+	return ret;
+}
+
 /**
  * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
  * @adapter: board private structure
@@ -4157,12 +4281,15 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 	struct pci_dev *pdev = adapter->pdev;
 	struct e1000_rx_desc *rx_desc, *next_rxd;
 	struct e1000_rx_buffer *buffer_info, *next_buffer;
+	struct bpf_prog *prog;
 	u32 length;
 	unsigned int i;
 	int cleaned_count = 0;
 	bool cleaned = false;
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 
+	rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
+	prog = READ_ONCE(adapter->prog);
 	i = rx_ring->next_to_clean;
 	rx_desc = E1000_RX_DESC(*rx_ring, i);
 	buffer_info = &rx_ring->buffer_info[i];
@@ -4188,12 +4315,54 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 
 		cleaned = true;
 		cleaned_count++;
+		length = le16_to_cpu(rx_desc->length);
+
+		if (prog) {
+			struct page *p = buffer_info->rxbuf.page;
+			dma_addr_t dma = buffer_info->dma;
+			int act;
+
+			if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
+				/* attached bpf disallows larger than page
+				 * packets, so this is hw error or corruption
+				 */
+				pr_info_once("%s buggy !eop\n", netdev->name);
+				break;
+			}
+			if (unlikely(rx_ring->rx_skb_top)) {
+				pr_info_once("%s ring resizing bug\n",
+					     netdev->name);
+				break;
+			}
+			dma_sync_single_for_cpu(&pdev->dev, dma,
+						length, DMA_FROM_DEVICE);
+			act = e1000_call_bpf(prog, page_address(p), length);
+			switch (act) {
+			case XDP_PASS:
+				break;
+			case XDP_TX:
+				dma_sync_single_for_device(&pdev->dev,
+							   dma,
+							   length,
+							   DMA_TO_DEVICE);
+				e1000_xmit_raw_frame(buffer_info, length,
+						     netdev, adapter);
+			case XDP_DROP:
+			default:
+				/* re-use mapped page. keep buffer_info->dma
+				 * as-is, so that e1000_alloc_jumbo_rx_buffers
+				 * only needs to put it back into rx ring
+				 */
+				total_rx_bytes += length;
+				total_rx_packets++;
+				goto next_desc;
+			}
+		}
+
 		dma_unmap_page(&pdev->dev, buffer_info->dma,
 			       adapter->rx_buffer_len, DMA_FROM_DEVICE);
 		buffer_info->dma = 0;
 
-		length = le16_to_cpu(rx_desc->length);
-
 		/* errors is only valid for DD + EOP descriptors */
 		if (unlikely((status & E1000_RXD_STAT_EOP) &&
 		    (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
@@ -4327,6 +4496,7 @@ next_desc:
 		rx_desc = next_rxd;
 		buffer_info = next_buffer;
 	}
+	rcu_read_unlock();
 	rx_ring->next_to_clean = i;
 
 	cleaned_count = E1000_DESC_UNUSED(rx_ring);

^ permalink raw reply related

* Re: [PATCH v4 16/16] MAINTAINERS: Update for PVRDMA driver
From: Adit Ranadive @ 2016-09-12 22:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford@redhat.com, linux-rdma@vger.kernel.org, pv-drivers,
	netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	Jorgen S. Hansen, Aditya Sarwade, George Zhang, Bryan Tan
In-Reply-To: <20160912175222.GG5843@obsidianresearch.com>

On Mon, Sep 12, 2016 at 10:52:26 -0700, Jason Gunthorpe wrote: 
> On Sun, Sep 11, 2016 at 09:49:26PM -0700, Adit Ranadive wrote:
> > Add maintainer info for the PVRDMA driver.
> 
> You can probably squash the last three patches.

Thanks for taking a look.
Since Doug mentioned that he would squash my commits anyways while applying I
decided to keep them separate. Let me know if you still want me to squash them.

> .. and fix the __u32 stuff throughout the entire driver please.

Will do.

Thanks,
Adit

^ permalink raw reply

* [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
From: John Fastabend @ 2016-09-12 22:13 UTC (permalink / raw)
  To: bblanco, john.fastabend, alexei.starovoitov, jeffrey.t.kirsher,
	brouer, davem
  Cc: xiyou.wangcong, intel-wired-lan, u9012063, netdev
In-Reply-To: <20160912220312.5610.77528.stgit@john-Precision-Tower-5810>

The BQL API does not reference the sk_buff nor does the driver need to
reference the sk_buff to calculate the length of a transmitted frame.
This patch removes an sk_buff reference from the xmit irq path and
also allows packets sent from XDP to use BQL.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f42129d..62a7f8d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
 			if (cleaned) {
 				total_tx_packets += buffer_info->segs;
 				total_tx_bytes += buffer_info->bytecount;
-				if (buffer_info->skb) {
-					bytes_compl += buffer_info->skb->len;
-					pkts_compl++;
-				}
-
+				bytes_compl += buffer_info->length;
+				pkts_compl++;
 			}
 			e1000_unmap_and_free_tx_resource(adapter, buffer_info);
 			tx_desc->upper.data = 0;

^ permalink raw reply related

* [net-next PATCH v3 0/3] e1000 XDP implementation
From: John Fastabend @ 2016-09-12 22:13 UTC (permalink / raw)
  To: bblanco, john.fastabend, alexei.starovoitov, jeffrey.t.kirsher,
	brouer, davem
  Cc: xiyou.wangcong, intel-wired-lan, u9012063, netdev

This patch implements XDP on e1000 a few comments below:

The XDP_TX path does not violate BQL in this series so we have to increment
and decrement the bql counters correctly. When a TX queue can have both stack
traffic and XDP traffic I believe user configured BQL should be correct. If
users do not want BQL they can always disable it or raise the limits.


I left the xdp program attached to the adapter structure because in the e1000
case the program is global because we only run on a single queue. Pushing the
program into the rx_ring just creates a bunch of ring[0] references in the code
and adds little in my opinion. This is really just a style issue though. Notice
I did push the bundle logic onto the queue but I view the bundling and rx
ring buffers to be so closely linked it was worth it and only caused a couple
extra ring[0] lines.


XDP_TX will drop packets if any errors occur while attempting to push onto the
TX descriptor ring. This seems to me at least to be the most sensible thing to
do and keeps e1000 inline with existing mlx XDP drivers. This can always be
extended if some consensus is made later.

Thanks for all the comments in v{1|2}. As always any comments/feedback
appreciated.

Also initial patch was based on a patch I took from Alexei so I left his
signed-off there even though I mangled the code a fair amount since then and
did not give him any chance to review off-list.

---

Alexei Starovoitov (1):
      e1000: add initial XDP support

John Fastabend (2):
      e1000: track BQL bytes regardless of skb or not
      e1000: bundle xdp xmit routines


 drivers/net/ethernet/intel/e1000/e1000.h      |   12 +
 drivers/net/ethernet/intel/e1000/e1000_main.c |  227 ++++++++++++++++++++++++-
 2 files changed, 229 insertions(+), 10 deletions(-)

^ permalink raw reply

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-12 22:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <1473717514.18970.80.camel@edumazet-glaptop3.roam.corp.google.com>

On 16-09-12 05:58 PM, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote:
>> From: Jamal Hadi Salim <jhs@mojatatu.com>
>
>> +
>> +	/* XXX: if you are going to edit more fields beyond ethernet header
>> +	 * (example when you add IP header replacement or vlan swap)
>> +	 * then MAX_EDIT_LEN needs to change appropriately
>> +	*/
>> +	err = skb_ensure_writable(skb, ETH_HLEN);
>> +	if (unlikely(err)) /* best policy is to drop on the floor */
>> +		action = TC_ACT_SHOT;
>> +
>> +	rcu_read_lock();
>> +	action = READ_ONCE(d->tcf_action);
>
> You are overwriting @action, while you might have put TC_ACT_SHOT in it
> 3 lines above.
>
> Maybe you meant :
> 	if (err)
> 		return TC_ACT_SHOT;
>

Thanks for catching that (leftover from when i used a lock).
Will resubmit.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Eric Dumazet @ 2016-09-12 22:01 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <1473713206-2450-1-git-send-email-jhs@emojatatu.com>

On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote:

> +
> +static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
> +			   int bind, int ref)
> +{
> +	struct tcf_skbmod *d = to_skbmod(a);
> +	unsigned char *b = skb_tail_pointer(skb);
> +	struct tcf_skbmod_params  *p = rtnl_dereference(d->skbmod_p);
> +	struct tc_skbmod opt = {
> +		.index   = d->tcf_index,
> +		.refcnt  = d->tcf_refcnt - ref,
> +		.bindcnt = d->tcf_bindcnt - bind,
> +		.action  = d->tcf_action,
> +	};
> +	struct tcf_t t;
> +
> +	rcu_read_lock();

You do not need rcu read lock protection here, RTNL is enough.

> +
> +	opt.flags  = p->flags;
> +	if (nla_put(skb, TCA_SKBMOD_PARMS, sizeof(opt), &opt))
> +		goto nla_put_failure;
> +	if ((p->flags & SKBMOD_F_DMAC) &&
> +	    nla_put(skb, TCA_SKBMOD_DMAC, ETH_ALEN, p->eth_dst))
> +		goto nla_put_failure;
> +	if ((p->flags & SKBMOD_F_SMAC) &&
> +	    nla_put(skb, TCA_SKBMOD_SMAC, ETH_ALEN, p->eth_src))
> +		goto nla_put_failure;
> +	if ((p->flags & SKBMOD_F_ETYPE) &&
> +	    nla_put_u16(skb, TCA_SKBMOD_ETYPE, ntohs(p->eth_type)))
> +		goto nla_put_failure;
> +
> +	tcf_tm_dump(&t, &d->tcf_tm);
> +	if (nla_put_64bit(skb, TCA_SKBMOD_TM, sizeof(t), &t, TCA_SKBMOD_PAD))
> +		goto nla_put_failure;
> +
> +	rcu_read_unlock();
> +
> +	return skb->len;
> +nla_put_failure:
> +	rcu_read_unlock();
> +	nlmsg_trim(skb, b);
> +	return -1;
> +}
> +

^ permalink raw reply

* Re: [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Eric Dumazet @ 2016-09-12 21:58 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: davem, netdev, xiyou.wangcong, daniel, john.r.fastabend
In-Reply-To: <1473713206-2450-1-git-send-email-jhs@emojatatu.com>

On Mon, 2016-09-12 at 16:46 -0400, Jamal Hadi Salim wrote:
> From: Jamal Hadi Salim <jhs@mojatatu.com>

> +
> +	/* XXX: if you are going to edit more fields beyond ethernet header
> +	 * (example when you add IP header replacement or vlan swap)
> +	 * then MAX_EDIT_LEN needs to change appropriately
> +	*/
> +	err = skb_ensure_writable(skb, ETH_HLEN);
> +	if (unlikely(err)) /* best policy is to drop on the floor */
> +		action = TC_ACT_SHOT;
> +
> +	rcu_read_lock();
> +	action = READ_ONCE(d->tcf_action);

You are overwriting @action, while you might have put TC_ACT_SHOT in it
3 lines above.

Maybe you meant :
	if (err)
		return TC_ACT_SHOT;

^ permalink raw reply

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
From: Tom Herbert via iovisor-dev @ 2016-09-12 21:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet
In-Reply-To: <20160912121530.4b4f0ad7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Mon, Sep 12, 2016 at 3:15 AM, Jesper Dangaard Brouer
<brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, 9 Sep 2016 18:03:09 +0300
> Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>
>> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
>> <iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org> wrote:
>> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>> >>
>> >> I'm sorry but I have a problem with this patch!
>> >> Looking at this patch, I want to bring up a fundamental architectural
>> >> concern with the development direction of XDP transmit.
>> >>
>> >>
>> >> What you are trying to implement, with delaying the doorbell, is
>> >> basically TX bulking for TX_XDP.
>> >>
>> >>  Why not implement a TX bulking interface directly instead?!?
>> >>
>> >> Yes, the tailptr/doorbell is the most costly operation, but why not
>> >> also take advantage of the benefits of bulking for other parts of the
>> >> code? (benefit is smaller, by every cycles counts in this area)
>> >>
>> >> This hole XDP exercise is about avoiding having a transaction cost per
>> >> packet, that reads "bulking" or "bundling" of packets, where possible.
>> >>
>> >>  Lets do bundling/bulking from the start!
>>
>> Jesper, what we did here is also bulking, instead of bulking in a
>> temporary list in the driver we list the packets in the HW and once
>> done we transmit all at once via the xdp_doorbell indication.
>>
>> I agree with you that we can take advantage and improve the icache by
>> bulking first in software and then queue all at once in the hw then
>> ring one doorbell.
>>
>> but I also agree with Alexei that this will introduce an extra
>> pointer/list handling in the driver and we need to do the comparison
>> between both approaches before we decide which is better.
>
> I welcome implementing both approaches and benchmarking them against
> each-other, I'll gladly dedicate time for this!
>
Yes, please implement this so we can have something clear to evaluate
and compare. There is far to much spewing of "expert opinions"
happening here :-(

> I'm reacting so loudly, because this is a mental model switch, that
> need to be applied to the full drivers RX path. Also for normal stack
> delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> there is between 10%-25% perf gain here.
>
> The key point is stop seeing the lowest driver RX as something that
> works on a per packet basis.  It might be easier to view this as a kind
> of vector processing.  This is about having a vector of packets, where
> we apply some action/operation.
>
> This is about using the CPU more efficiently, getting it to do more
> instructions per cycle.  The next level of optimization (for >= Sandy
> Bridge CPUs) is to make these vector processing stages small enough to fit
> into the CPUs decoded-I-cache section.
>
>
> It might also be important to mention, that for netstack delivery, I
> don't imagine bulking 64 packets.  Instead, I imagine doing 8-16
> packets.  Why, because the NIC-HW runs independently and have the
> opportunity to deliver more frames in the RX ring queue, while the
> stack "slow" processed packets.  You can view this as "bulking" from
> the RX ring queue, with a "look-back" before exiting the NAPI poll loop.
>
>
>> this must be marked as future work and not have this from the start.
>
> We both know that statement is BS, and the other approach will never be
> implemented once this patch is accepted upstream.
>
>
>> > mlx4 already does bulking and this proposed mlx5 set of patches
>> > does bulking as well.
>
> I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
> way IMHO.  And now mlx5 is building on the same principle. That is why
> I'm yelling STOP.
>
>
>> >> The reason behind the xmit_more API is that we could not change the
>> >> API of all the drivers.  And we found that calling an explicit NDO
>> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> >> would hit the common single packet use-case.
>> >>
>> >> It should be really easy to build a bundle of packets that need XDP_TX
>> >> action, especially given you only have a single destination "port".
>> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>
>
> [1] http://lists.openwall.net/netdev/2016/04/19/89
> [2] http://lists.openwall.net/netdev/2016/01/15/51
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] [RFC] proc connector: add namespace events
From: Evgeniy Polyakov @ 2016-09-12 21:39 UTC (permalink / raw)
  To: Alban Crequy, Alban Crequy
  Cc: Tejun Heo, Aditya Kali, Serge Hallyn, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Iago Lopez Galeiras
In-Reply-To: <1473349086-31260-1-git-send-email-alban@kinvolk.io>

Hi everyone

08.09.2016, 18:39, "Alban Crequy" <alban.crequy@gmail.com>:
> The act of a process creating or joining a namespace via clone(),
> unshare() or setns() is a useful signal for monitoring applications.

> + if (old_ns->mnt_ns != new_ns->mnt_ns)
> + proc_ns_connector(tsk, CLONE_NEWNS, PROC_NM_REASON_CLONE, old_mntns_inum, new_mntns_inum);
> +
> + if (old_ns->uts_ns != new_ns->uts_ns)
> + proc_ns_connector(tsk, CLONE_NEWUTS, PROC_NM_REASON_CLONE, old_ns->uts_ns->ns.inum, new_ns->uts_ns->ns.inum);
> +
> + if (old_ns->ipc_ns != new_ns->ipc_ns)
> + proc_ns_connector(tsk, CLONE_NEWIPC, PROC_NM_REASON_CLONE, old_ns->ipc_ns->ns.inum, new_ns->ipc_ns->ns.inum);
> +
> + if (old_ns->net_ns != new_ns->net_ns)
> + proc_ns_connector(tsk, CLONE_NEWNET, PROC_NM_REASON_CLONE, old_ns->net_ns->ns.inum, new_ns->net_ns->ns.inum);
> +
> + if (old_ns->cgroup_ns != new_ns->cgroup_ns)
> + proc_ns_connector(tsk, CLONE_NEWCGROUP, PROC_NM_REASON_CLONE, old_ns->cgroup_ns->ns.inum, new_ns->cgroup_ns->ns.inum);
> +
> + if (old_ns->pid_ns_for_children != new_ns->pid_ns_for_children)
> + proc_ns_connector(tsk, CLONE_NEWPID, PROC_NM_REASON_CLONE, old_ns->pid_ns_for_children->ns.inum, new_ns->pid_ns_for_children->ns.inum);
> + }
> +

Patch looks good to me from technical/connector point of view, but these even multiplication is a bit weird imho.

I'm not against it, but did you consider sending just 2 serialized ns structures via single message, and client
would check all ns bits himself?

^ permalink raw reply

* [PATCH net-next 0/2] Misc cls_bpf/act_bpf improvements
From: Daniel Borkmann @ 2016-09-12 21:38 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, netdev, Daniel Borkmann

Two minor improvements to {cls,act}_bpf. For details please see
individual patches.

Thanks!

Daniel Borkmann (2):
  bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf
  bpf: use skb_at_tc_ingress helper in tcf_bpf

 net/sched/act_bpf.c | 5 +----
 net/sched/cls_bpf.c | 3 ---
 2 files changed, 1 insertion(+), 7 deletions(-)

-- 
1.9.3

^ permalink raw reply

* [PATCH net-next 1/2] bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf
From: Daniel Borkmann @ 2016-09-12 21:38 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, netdev, Daniel Borkmann
In-Reply-To: <cover.1473715365.git.daniel@iogearbox.net>

The skb_mac_header_was_set() test in cls_bpf's and act_bpf's fast-path is
actually unnecessary and can be removed altogether. This was added by
commit a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative
offsets"), which was later on improved by 3431205e0397 ("bpf: make programs
see skb->data == L2 for ingress and egress"). We're always guaranteed to
have valid mac header at the time we invoke cls_bpf_classify() or tcf_bpf().

Reason is that since 6d1ccff62780 ("net: reset mac header in dev_start_xmit()")
we do skb_reset_mac_header() in __dev_queue_xmit() before we could call
into sch_handle_egress() or any subsequent enqueue. sch_handle_ingress()
always sees a valid mac header as well (things like skb_reset_mac_len()
would badly fail otherwise). Thus, drop the unnecessary test in classifier
and action case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 net/sched/act_bpf.c | 3 ---
 net/sched/cls_bpf.c | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index bfa8707..78400de 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -44,9 +44,6 @@ static int tcf_bpf(struct sk_buff *skb, const struct tc_action *act,
 	int action, filter_res;
 	bool at_ingress = G_TC_AT(skb->tc_verd) & AT_INGRESS;
 
-	if (unlikely(!skb_mac_header_was_set(skb)))
-		return TC_ACT_UNSPEC;
-
 	tcf_lastuse_update(&prog->tcf_tm);
 	bstats_cpu_update(this_cpu_ptr(prog->common.cpu_bstats), skb);
 
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 4742f41..1d92d4d 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -83,9 +83,6 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 	struct cls_bpf_prog *prog;
 	int ret = -1;
 
-	if (unlikely(!skb_mac_header_was_set(skb)))
-		return -1;
-
 	/* Needed here for accessing maps. */
 	rcu_read_lock();
 	list_for_each_entry_rcu(prog, &head->plist, link) {
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 2/2] bpf: use skb_at_tc_ingress helper in tcf_bpf
From: Daniel Borkmann @ 2016-09-12 21:38 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, netdev, Daniel Borkmann
In-Reply-To: <cover.1473715365.git.daniel@iogearbox.net>

We have a small skb_at_tc_ingress() helper for testing for ingress, so
make use of it. cls_bpf already uses it and so should act_bpf.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 net/sched/act_bpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 78400de..1d39600 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -39,10 +39,10 @@ static struct tc_action_ops act_bpf_ops;
 static int tcf_bpf(struct sk_buff *skb, const struct tc_action *act,
 		   struct tcf_result *res)
 {
+	bool at_ingress = skb_at_tc_ingress(skb);
 	struct tcf_bpf *prog = to_bpf(act);
 	struct bpf_prog *filter;
 	int action, filter_res;
-	bool at_ingress = G_TC_AT(skb->tc_verd) & AT_INGRESS;
 
 	tcf_lastuse_update(&prog->tcf_tm);
 	bstats_cpu_update(this_cpu_ptr(prog->common.cpu_bstats), skb);
-- 
1.9.3

^ permalink raw reply related

* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Martin Blumenstingl @ 2016-09-12 21:29 UTC (permalink / raw)
  To: Matt Corallo; +Cc: netdev, linux-amlogic, Giuseppe Cavallaro, Alexandre Torgue
In-Reply-To: <216F2694-1C1D-44DA-AC15-57ED15C24BBE@bluematt.me>

Hi Matt,

On Sun, Sep 11, 2016 at 10:57 PM, Matt Corallo <linux@bluematt.me> wrote:
> The general advice is to force it into 100M mode then it seems to work fine.
> There is some issue with the driver seemingly not fully supporting 1G (at
> least on Odroid C2) which needs to be worked out.
thanks for the hint - by "forcing 100M mode" you mean "ethtool –s eth0
speed 100 duplex full"?
Do you know more about this issue (or do you just know the
workaround)? It'd be interesting to know which "driver" does not
support Gbit speeds (as there are three drivers involved: realtek PHY
driver, stmmac and Meson stmmac glue driver


Regards,
Martin

^ permalink raw reply

* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Martin Blumenstingl @ 2016-09-12 21:26 UTC (permalink / raw)
  To: Alexandre Torgue; +Cc: netdev, linux-amlogic, Giuseppe Cavallaro, Johnson Leung
In-Reply-To: <ff6f3501-fbd5-3e79-dcc1-549406a5144b@st.com>

Hi Alexandre,

On Mon, Sep 12, 2016 at 6:37 PM, Alexandre Torgue
<alexandre.torgue@st.com> wrote:
> Which Synopsys IP version do you use ?
found this in a dmesg log:
[    1.504784] stmmac - user ID: 0x11, Synopsys ID: 0x37
[    1.509785]  Ring mode enabled
[    1.512796]  DMA HW capability register supported
[    1.517286]  Normal descriptors
[    1.520565]  RX Checksum Offload Engine supported
[    1.525219]  COE Type 2
[    1.527638]  TX Checksum insertion supported
[    1.531862]  Wake-Up On Lan supported
[    1.535483]  Enable RX Mitigation via HW Watchdog Timer
[    1.543851] libphy: stmmac: probed
[    1.544025] eth0: PHY ID 001cc916 at 0 IRQ POLL (stmmac-0:00) active
[    1.550321] eth0: PHY ID 001cc916 at 7 IRQ POLL (stmmac-0:07)

>> Gbit ethernet on my device is provided by a Realtek RTL8211F RGMII PHY.
>> Similar issues were reported in #linux-amlogic by a user with an
>> Odroid C2 board (= similar hardware).
>>
>> The symptoms are:
>> Receiving data is plenty fast (I can max out my internet connection
>> easily, and with iperf3 I get ~900Mbit/s).
>> Transmitting data from the device is unfortunately very slow, traffic
>> sometimes even stalls completely.
>>
>> I have attached the iperf results and the output of
>> /sys/kernel/debug/stmmaceth/eth0/descriptors_status.
>> Below you can find the ifconfig, netstat and stmmac dma_cap info
>> (*after* I ran all tests).
>>
>> The "involved parties" are:
>> - Meson GXBB specific network configuration registers (I have have
>> double-checked them with the reference drivers: everything seems fine
>> here)
>> - stmmac: it seems that nobody else has reported these kind of issues
>> so far, however I'd still like to hear where I should enable some
>> debugging bits to rule out any stmmac bug
>
>
> On my side, I just tested on the same "kind" of system:
> -SYNOPSYS GMAC 3.7
> -RTL8211EG as PHY
>
> With I perf, I reach:
>         -RX: 932 Mbps
>         -TX: 820Mbps
>
> Can you check ethtool -S eth0 (most precisely "MMC"counter and errors) ?
> Which kernel version do you use ?
I am using a 4.8.0-rc4 kernel, based on Kevin's "integration" branch: [0]
Unfortunately I don't have access to my device in the next few days,
but I'll keep you updated once I have the ethtool output.


Thanks for your time
Regards,
Martin


[0] https://git.kernel.org/cgit/linux/kernel/git/khilman/linux-amlogic.git/log/?h=v4.8/integ

^ permalink raw reply

* Re: [PATCH 00/26] constify local structures
From: Julia Lawall @ 2016-09-12 21:11 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Felipe Balbi, Julia Lawall, linux-renesas-soc, joe,
	kernel-janitors, Sergei Shtylyov, linux-pm, platform-driver-x86,
	linux-media, linux-can, Tatyana Nikolova, Shiraz Saleem,
	Mustafa Ismail, Chien Tin Tung, linux-rdma, netdev, devel,
	alsa-devel, linux-kernel, linux-fbdev, linux-wireless,
	Jason Gunthorpe, t
In-Reply-To: <20160912201450.GA8889@intel.com>



On Mon, 12 Sep 2016, Jarkko Sakkinen wrote:

> On Mon, Sep 12, 2016 at 04:43:58PM +0300, Felipe Balbi wrote:
> >
> > Hi,
> >
> > Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> writes:
> > > On Mon, Sep 12, 2016 at 10:54:07AM +0200, Julia Lawall wrote:
> > >>
> > >>
> > >> On Sun, 11 Sep 2016, Jarkko Sakkinen wrote:
> > >>
> > >> > On Sun, Sep 11, 2016 at 03:05:42PM +0200, Julia Lawall wrote:
> > >> > > Constify local structures.
> > >> > >
> > >> > > The semantic patch that makes this change is as follows:
> > >> > > (http://coccinelle.lip6.fr/)
> > >> >
> > >> > Just my two cents but:
> > >> >
> > >> > 1. You *can* use a static analysis too to find bugs or other issues.
> > >> > 2. However, you should manually do the commits and proper commit
> > >> >    messages to subsystems based on your findings. And I generally think
> > >> >    that if one contributes code one should also at least smoke test changes
> > >> >    somehow.
> > >> >
> > >> > I don't know if I'm alone with my opinion. I just think that one should
> > >> > also do the analysis part and not blindly create and submit patches.
> > >>
> > >> All of the patches are compile tested.  And the individual patches are
> > >
> > > Compile-testing is not testing. If you are not able to test a commit,
> > > you should explain why.
> >
> > Dude, Julia has been doing semantic patching for years already and
> > nobody has raised any concerns so far. There's already an expectation
> > that Coccinelle *works* and Julia's sematic patches are sound.
> >
> > Besides, adding 'const' is something that causes virtually no functional
> > changes to the point that build-testing is really all you need. Any
> > problems caused by adding 'const' to a definition will be seen by build
> > errors or warnings.
> >
> > Really, just stop with the pointless discussion and go read a bit about
> > Coccinelle and what semantic patches are giving you. The work done by
> > Julia and her peers are INRIA have measurable benefits.
> >
> > You're really making a thunderstorm in a glass of water.
>
> Hmm... I've been using coccinelle in cyclic basis for some time now.
> My comment was oversized but I didn't mean it to be impolite or attack
> of any kind for that matter.

No problem :)  Thanks for the feedback.

julia

^ permalink raw reply

* Re: [PATCH v2 net 1/1] net sched actions: fix GETing actions
From: Jamal Hadi Salim @ 2016-09-12 20:51 UTC (permalink / raw)
  To: Cong Wang; +Cc: David Miller, Linux Kernel Network Developers
In-Reply-To: <CAM_iQpX3j_ejd1MjnaYkq0OL8B_gsN+SEoFk-8Srp+nWxKHzQg@mail.gmail.com>

On 16-09-12 01:18 PM, Cong Wang wrote:
> On Mon, Sep 12, 2016 at 3:21 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:


>> +               if (ovr)
>> +                       a->tcfa_refcnt-=1;
>
> How about tcfa_bindcnt?

We dont touch it when mucking around only with actions (as is in
this case).


>> +       }
>> +}
>> +
>>  int tcf_action_init(struct net *net, struct nlattr *nla,
>>                                   struct nlattr *est, char *name, int ovr,
>>                                   int bind, struct list_head *actions)
>> @@ -612,8 +622,15 @@ int tcf_action_init(struct net *net, struct nlattr *nla,
>>                         goto err;
>>                 }
>>                 act->order = i;
>> +               if (ovr)
>
> Need to check this boolean? It looks like we need this for !ovr case too?
>

They are fine if they didnt exist.

I will fix the coding style issues and submit.

cheers,
jamal

^ permalink raw reply

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-12 20:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20160912195626.GA18146-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>

On Mon, 12 Sep 2016 12:56:28 -0700
Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 8 Sep 2016 23:30:50 -0700
> > Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >   
> > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:  
> > [...]  
> > > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > > Every time you call the stack code, then you flush your icache.  When
> > > > returning to the driver code, you will have to reload all the icache
> > > > associated with the XDP_TX, this is a costly operation.    
> > >   
> > [...]  
> > > To make further progress in this discussion can we talk about
> > > the use case you have in mind instead? Then solution will
> > > be much clear, I hope.  
> > 
> > The DDoS use-case _is_ affected by this "hidden" bulking design.
> > 
> > Lets say, I want to implement a DDoS facility. Instead of just
> > dropping the malicious packets, I want to see the bad packets.  I
> > implement this by rewriting the destination-MAC to be my monitor
> > machine and then XDP_TX the packet.  
> 
> not following the use case. you want to implement a DDoS generator?
> Or just forward all bad packets from affected host to another host
> in the same rack? so two servers will be spammed with traffic and
> even more load on the tor? I really don't see how this is useful
> for anything but stress testing.

As I wrote below, the purpose of the monitor machine is to diagnose
false positives.  If you worry about the added load I would either,
forward out another interface (which is not supported yet) or simply do
sampling of packets being forwarded to the monitor host.

> > In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> > of the traffic is delivered to the stack. (See note 1)  
> 
> hmm. DoS prevention use case is when 99% of the traffic is dropped.

As I wrote below, until the DDoS attack starts, all packets are
delivered to the stack.

> > Once the DDoS attack starts, then the traffic pattern changes, and XDP
> > should (hopefully only) catch the malicious traffic (monitor machine can
> > help diagnose false positive).  Now, due to interleaving the DDoS
> > traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> > more icache misses...
> > 
> > 
> > 
> > Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> > program with 100% delivery to the stack, actually slows down the
> > normal stack.  This is due to hitting a bottleneck in the page
> > allocator.  I'm working removing that bottleneck with page_pool, and
> > that solution is orthogonal to this problem.  
> 
> sure. no one arguing against improving page allocator.
> 
> >  It is actually an excellent argument, for why you would want to run a
> > DDoS XDP filter only on a restricted number of RX queues.  
> 
> no. it's the opposite. If the host is under DoS there is no way
> the host can tell in advance which rx queue will be seeing bad
> packets.

Sorry, this note was not related to the DoS use-case.  You
misunderstood it.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH v4 net-next 1/1] net_sched: Introduce skbmod action
From: Jamal Hadi Salim @ 2016-09-12 20:46 UTC (permalink / raw)
  To: davem
  Cc: netdev, xiyou.wangcong, daniel, eric.dumazet, john.r.fastabend,
	Jamal Hadi Salim

From: Jamal Hadi Salim <jhs@mojatatu.com>

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

To: 

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesn't get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/tc_act/tc_skbmod.h        |  30 ++++
 include/uapi/linux/tc_act/tc_skbmod.h |  39 +++++
 net/sched/Kconfig                     |  11 ++
 net/sched/Makefile                    |   1 +
 net/sched/act_skbmod.c                | 293 ++++++++++++++++++++++++++++++++++
 5 files changed, 374 insertions(+)
 create mode 100644 include/net/tc_act/tc_skbmod.h
 create mode 100644 include/uapi/linux/tc_act/tc_skbmod.h
 create mode 100644 net/sched/act_skbmod.c

diff --git a/include/net/tc_act/tc_skbmod.h b/include/net/tc_act/tc_skbmod.h
new file mode 100644
index 0000000..644a211
--- /dev/null
+++ b/include/net/tc_act/tc_skbmod.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __NET_TC_SKBMOD_H
+#define __NET_TC_SKBMOD_H
+
+#include <net/act_api.h>
+#include <linux/tc_act/tc_skbmod.h>
+
+struct tcf_skbmod_params {
+	struct rcu_head	rcu;
+	u64	flags; /*up to 64 types of operations; extend if needed */
+	u8	eth_dst[ETH_ALEN];
+	u16	eth_type;
+	u8	eth_src[ETH_ALEN];
+};
+
+struct tcf_skbmod {
+	struct tc_action	common;
+	struct tcf_skbmod_params __rcu *skbmod_p;
+};
+#define to_skbmod(a) ((struct tcf_skbmod *)a)
+
+#endif /* __NET_TC_SKBMOD_H */
diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
new file mode 100644
index 0000000..10fc07d
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_skbmod.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (c) 2016, Jamal Hadi Salim
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#ifndef __LINUX_TC_SKBMOD_H
+#define __LINUX_TC_SKBMOD_H
+
+#include <linux/pkt_cls.h>
+
+#define TCA_ACT_SKBMOD 15
+
+#define SKBMOD_F_DMAC	0x1
+#define SKBMOD_F_SMAC	0x2
+#define SKBMOD_F_ETYPE	0x4
+#define SKBMOD_F_SWAPMAC 0x8
+
+struct tc_skbmod {
+	tc_gen;
+	__u64 flags;
+};
+
+enum {
+	TCA_SKBMOD_UNSPEC,
+	TCA_SKBMOD_TM,
+	TCA_SKBMOD_PARMS,
+	TCA_SKBMOD_DMAC,
+	TCA_SKBMOD_SMAC,
+	TCA_SKBMOD_ETYPE,
+	TCA_SKBMOD_PAD,
+	__TCA_SKBMOD_MAX
+};
+#define TCA_SKBMOD_MAX (__TCA_SKBMOD_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 72e3426..7795d5a 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -749,6 +749,17 @@ config NET_ACT_CONNMARK
 	  To compile this code as a module, choose M here: the
 	  module will be called act_connmark.
 
+config NET_ACT_SKBMOD
+        tristate "skb data modification action"
+        depends on NET_CLS_ACT
+        ---help---
+         Say Y here to allow modification of skb data
+
+         If unsure, say N.
+
+         To compile this code as a module, choose M here: the
+         module will be called act_skbmod.
+
 config NET_ACT_IFE
         tristate "Inter-FE action based on IETF ForCES InterFE LFB"
         depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b9d046b..148ae0d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_NET_ACT_CSUM)	+= act_csum.o
 obj-$(CONFIG_NET_ACT_VLAN)	+= act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)	+= act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK)	+= act_connmark.o
+obj-$(CONFIG_NET_ACT_SKBMOD)	+= act_skbmod.o
 obj-$(CONFIG_NET_ACT_IFE)	+= act_ife.o
 obj-$(CONFIG_NET_IFE_SKBMARK)	+= act_meta_mark.o
 obj-$(CONFIG_NET_IFE_SKBPRIO)	+= act_meta_skbprio.o
diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
new file mode 100644
index 0000000..40a2dfd
--- /dev/null
+++ b/net/sched/act_skbmod.c
@@ -0,0 +1,293 @@
+/*
+ * net/sched/act_skbmod.c  skb data modifier
+ *
+ * Copyright (c) 2016 Jamal Hadi Salim <jhs@mojatatu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+*/
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+
+#include <linux/tc_act/tc_skbmod.h>
+#include <net/tc_act/tc_skbmod.h>
+
+#define SKBMOD_TAB_MASK     15
+
+static int skbmod_net_id;
+static struct tc_action_ops act_skbmod_ops;
+
+#define MAX_EDIT_LEN ETH_HLEN
+static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
+			  struct tcf_result *res)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	int action;
+	struct tcf_skbmod_params *p;
+	u64 flags;
+	int err;
+
+	tcf_lastuse_update(&d->tcf_tm);
+	bstats_cpu_update(this_cpu_ptr(d->common.cpu_bstats), skb);
+
+	/* XXX: if you are going to edit more fields beyond ethernet header
+	 * (example when you add IP header replacement or vlan swap)
+	 * then MAX_EDIT_LEN needs to change appropriately
+	*/
+	err = skb_ensure_writable(skb, ETH_HLEN);
+	if (unlikely(err)) /* best policy is to drop on the floor */
+		action = TC_ACT_SHOT;
+
+	rcu_read_lock();
+	action = READ_ONCE(d->tcf_action);
+	if (unlikely(action == TC_ACT_SHOT)) {
+		qstats_overlimit_inc(this_cpu_ptr(d->common.cpu_qstats));
+		rcu_read_unlock();
+		return action;
+	}
+
+	p = rcu_dereference(d->skbmod_p);
+	flags = p->flags;
+	if (flags & SKBMOD_F_DMAC)
+		ether_addr_copy(eth_hdr(skb)->h_dest, p->eth_dst);
+	if (flags & SKBMOD_F_SMAC)
+		ether_addr_copy(eth_hdr(skb)->h_source, p->eth_src);
+	if (flags & SKBMOD_F_ETYPE)
+		eth_hdr(skb)->h_proto = p->eth_type;
+	rcu_read_unlock();
+
+	if (flags & SKBMOD_F_SWAPMAC) {
+		u16 tmpaddr[ETH_ALEN / 2]; /* ether_addr_copy() requirement */
+		/*XXX: I am sure we can come up with more efficient swapping*/
+		ether_addr_copy((u8 *)tmpaddr, eth_hdr(skb)->h_dest);
+		ether_addr_copy(eth_hdr(skb)->h_dest, eth_hdr(skb)->h_source);
+		ether_addr_copy(eth_hdr(skb)->h_source, (u8 *)tmpaddr);
+	}
+
+	return action;
+}
+
+static const struct nla_policy skbmod_policy[TCA_SKBMOD_MAX + 1] = {
+	[TCA_SKBMOD_PARMS]		= { .len = sizeof(struct tc_skbmod) },
+	[TCA_SKBMOD_DMAC]		= { .len = ETH_ALEN },
+	[TCA_SKBMOD_SMAC]		= { .len = ETH_ALEN },
+	[TCA_SKBMOD_ETYPE]		= { .type = NLA_U16 },
+};
+
+static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
+			   struct nlattr *est, struct tc_action **a,
+			   int ovr, int bind)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+	struct nlattr *tb[TCA_SKBMOD_MAX + 1];
+	struct tcf_skbmod_params *p, *p_old;
+	struct tc_skbmod *parm;
+	struct tcf_skbmod *d;
+	bool exists = false;
+	u8 *daddr = NULL;
+	u8 *saddr = NULL;
+	u16 eth_type = 0;
+	u32 lflags = 0;
+	int ret = 0, err;
+
+	if (!nla)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_SKBMOD_MAX, nla, skbmod_policy);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_SKBMOD_PARMS])
+		return -EINVAL;
+
+	if (tb[TCA_SKBMOD_DMAC]) {
+		daddr = nla_data(tb[TCA_SKBMOD_DMAC]);
+		lflags |= SKBMOD_F_DMAC;
+	}
+
+	if (tb[TCA_SKBMOD_SMAC]) {
+		saddr = nla_data(tb[TCA_SKBMOD_SMAC]);
+		lflags |= SKBMOD_F_SMAC;
+	}
+
+	if (tb[TCA_SKBMOD_ETYPE]) {
+		eth_type = nla_get_u16(tb[TCA_SKBMOD_ETYPE]);
+		lflags |= SKBMOD_F_ETYPE;
+	}
+
+	parm = nla_data(tb[TCA_SKBMOD_PARMS]);
+	if (parm->flags & SKBMOD_F_SWAPMAC)
+		lflags = SKBMOD_F_SWAPMAC;
+
+	exists = tcf_hash_check(tn, parm->index, a, bind);
+	if (exists && bind)
+		return 0;
+
+	if (!lflags)
+		return -EINVAL;
+
+	if (!exists) {
+		ret = tcf_hash_create(tn, parm->index, est, a,
+				      &act_skbmod_ops, bind, true);
+		if (ret)
+			return ret;
+
+		ret = ACT_P_CREATED;
+	} else {
+		tcf_hash_release(*a, bind);
+		if (!ovr)
+			return -EEXIST;
+	}
+
+	d = to_skbmod(*a);
+
+	ASSERT_RTNL();
+	p = kzalloc(sizeof(struct tcf_skbmod_params), GFP_KERNEL);
+	if (unlikely(!p)) {
+		if (ovr)
+			tcf_hash_release(*a, bind);
+		return -ENOMEM;
+	}
+
+	p->flags = lflags;
+	d->tcf_action = parm->action;
+
+	p_old = rtnl_dereference(d->skbmod_p);
+
+	if (ovr)
+		spin_lock_bh(&d->tcf_lock);
+
+	if (lflags & SKBMOD_F_DMAC)
+		ether_addr_copy(p->eth_dst, daddr);
+	if (lflags & SKBMOD_F_SMAC)
+		ether_addr_copy(p->eth_src, saddr);
+	if (lflags & SKBMOD_F_ETYPE)
+		p->eth_type = htons(eth_type);
+
+	rcu_assign_pointer(d->skbmod_p, p);
+	if (ovr)
+		spin_unlock_bh(&d->tcf_lock);
+
+	if (p_old)
+		kfree_rcu(p_old, rcu);
+
+	if (ret == ACT_P_CREATED)
+		tcf_hash_insert(tn, *a);
+	return ret;
+}
+
+static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
+			   int bind, int ref)
+{
+	struct tcf_skbmod *d = to_skbmod(a);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tcf_skbmod_params  *p = rtnl_dereference(d->skbmod_p);
+	struct tc_skbmod opt = {
+		.index   = d->tcf_index,
+		.refcnt  = d->tcf_refcnt - ref,
+		.bindcnt = d->tcf_bindcnt - bind,
+		.action  = d->tcf_action,
+	};
+	struct tcf_t t;
+
+	rcu_read_lock();
+
+	opt.flags  = p->flags;
+	if (nla_put(skb, TCA_SKBMOD_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_DMAC) &&
+	    nla_put(skb, TCA_SKBMOD_DMAC, ETH_ALEN, p->eth_dst))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_SMAC) &&
+	    nla_put(skb, TCA_SKBMOD_SMAC, ETH_ALEN, p->eth_src))
+		goto nla_put_failure;
+	if ((p->flags & SKBMOD_F_ETYPE) &&
+	    nla_put_u16(skb, TCA_SKBMOD_ETYPE, ntohs(p->eth_type)))
+		goto nla_put_failure;
+
+	tcf_tm_dump(&t, &d->tcf_tm);
+	if (nla_put_64bit(skb, TCA_SKBMOD_TM, sizeof(t), &t, TCA_SKBMOD_PAD))
+		goto nla_put_failure;
+
+	rcu_read_unlock();
+
+	return skb->len;
+nla_put_failure:
+	rcu_read_unlock();
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int tcf_skbmod_walker(struct net *net, struct sk_buff *skb,
+			     struct netlink_callback *cb, int type,
+			     const struct tc_action_ops *ops)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tcf_generic_walker(tn, skb, cb, type, ops);
+}
+
+static int tcf_skbmod_search(struct net *net, struct tc_action **a, u32 index)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tcf_hash_search(tn, a, index);
+}
+
+static struct tc_action_ops act_skbmod_ops = {
+	.kind		=	"skbmod",
+	.type		=	TCA_ACT_SKBMOD,
+	.owner		=	THIS_MODULE,
+	.act		=	tcf_skbmod_run,
+	.dump		=	tcf_skbmod_dump,
+	.init		=	tcf_skbmod_init,
+	.walk		=	tcf_skbmod_walker,
+	.lookup		=	tcf_skbmod_search,
+	.size		=	sizeof(struct tcf_skbmod),
+};
+
+static __net_init int skbmod_init_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	return tc_action_net_init(tn, &act_skbmod_ops, SKBMOD_TAB_MASK);
+}
+
+static void __net_exit skbmod_exit_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
+
+	tc_action_net_exit(tn);
+}
+
+static struct pernet_operations skbmod_net_ops = {
+	.init = skbmod_init_net,
+	.exit = skbmod_exit_net,
+	.id   = &skbmod_net_id,
+	.size = sizeof(struct tc_action_net),
+};
+
+MODULE_AUTHOR("Jamal Hadi Salim, <jhs@mojatatu.com>");
+MODULE_DESCRIPTION("SKB data mod-ing");
+MODULE_LICENSE("GPL");
+
+static int __init skbmod_init_module(void)
+{
+	return tcf_register_action(&act_skbmod_ops, &skbmod_net_ops);
+}
+
+static void __exit skbmod_cleanup_module(void)
+{
+	tcf_unregister_action(&act_skbmod_ops, &skbmod_net_ops);
+}
+
+module_init(skbmod_init_module);
+module_exit(skbmod_cleanup_module);
-- 
1.9.1

^ permalink raw reply related

* Re: [RFC PATCH 9/9] ethernet: sun8i-emac: add pm_runtime support
From: Maxime Ripard @ 2016-09-12 20:44 UTC (permalink / raw)
  To: Corentin Labbe
  Cc: robh+dt, mark.rutland, wens, linux, davem, netdev, devicetree,
	linux-arm-kernel, linux-kernel
In-Reply-To: <1473425117-18645-10-git-send-email-clabbe.montjoie@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4675 bytes --]

Hi,

On Fri, Sep 09, 2016 at 02:45:17PM +0200, Corentin Labbe wrote:
> This patch add pm_runtime support to sun8i-emac.
> For the moment, only basic support is added, (the device is marked as
> used when net/open)
> 
> Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
> ---
>  drivers/net/ethernet/allwinner/sun8i-emac.c | 62 ++++++++++++++++++++++++++++-
>  1 file changed, 60 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/allwinner/sun8i-emac.c b/drivers/net/ethernet/allwinner/sun8i-emac.c
> index 1c4bc80..cce886e 100644
> --- a/drivers/net/ethernet/allwinner/sun8i-emac.c
> +++ b/drivers/net/ethernet/allwinner/sun8i-emac.c
> @@ -9,7 +9,6 @@
>   * - MAC filtering
>   * - Jumbo frame
>   * - features rx-all (NETIF_F_RXALL_BIT)
> - * - PM runtime
>   */
>  #include <linux/bitops.h>
>  #include <linux/clk.h>
> @@ -27,6 +26,7 @@
>  #include <linux/pinctrl/consumer.h>
>  #include <linux/pinctrl/pinctrl.h>
>  #include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/reset.h>
>  #include <linux/scatterlist.h>
>  #include <linux/skbuff.h>
> @@ -1301,11 +1301,18 @@ static int sun8i_emac_open(struct net_device *ndev)
>  	int err;
>  	u32 v;
>  
> +	err = pm_runtime_get_sync(priv->dev);
> +	if (err) {
> +		pm_runtime_put_noidle(priv->dev);
> +		dev_err(priv->dev, "pm_runtime error: %d\n", err);
> +		return err;
> +	}
> +
>  	err = request_irq(priv->irq, sun8i_emac_dma_interrupt, 0,
>  			  dev_name(priv->dev), ndev);
>  	if (err) {
>  		dev_err(priv->dev, "Cannot request IRQ: %d\n", err);
> -		return err;
> +		goto err_runtime;
>  	}
>  
>  	/* Set interface mode (and configure internal PHY on H3) */
> @@ -1395,6 +1402,8 @@ err_syscon:
>  	sun8i_emac_unset_syscon(ndev);
>  err_irq:
>  	free_irq(priv->irq, ndev);
> +err_runtime:
> +	pm_runtime_put(priv->dev);
>  	return err;
>  }
>  
> @@ -1483,6 +1492,8 @@ static int sun8i_emac_stop(struct net_device *ndev)
>  	dma_free_coherent(priv->dev, priv->nbdesc_tx * sizeof(struct dma_desc),
>  			  priv->dd_tx, priv->dd_tx_phy);
>  
> +	pm_runtime_put(priv->dev);
> +
>  	return 0;
>  }
>  
> @@ -2210,6 +2221,8 @@ static int sun8i_emac_probe(struct platform_device *pdev)
>  		goto probe_err;
>  	}
>  
> +	pm_runtime_enable(priv->dev);
> +
>  	return 0;
>  
>  probe_err:
> @@ -2221,6 +2234,8 @@ static int sun8i_emac_remove(struct platform_device *pdev)
>  {
>  	struct net_device *ndev = platform_get_drvdata(pdev);
>  
> +	pm_runtime_disable(&pdev->dev);
> +
>  	unregister_netdev(ndev);
>  	platform_set_drvdata(pdev, NULL);
>  	free_netdev(ndev);
> @@ -2228,6 +2243,47 @@ static int sun8i_emac_remove(struct platform_device *pdev)
>  	return 0;
>  }
>  
> +static int __maybe_unused sun8i_emac_suspend(struct platform_device *pdev, pm_message_t state)
> +{
> +	struct net_device *ndev = platform_get_drvdata(pdev);
> +	struct sun8i_emac_priv *priv = netdev_priv(ndev);
> +
> +	napi_disable(&priv->napi);
> +
> +	if (netif_running(ndev))
> +		netif_device_detach(ndev);
> +
> +	sun8i_emac_stop_tx(ndev);
> +	sun8i_emac_stop_rx(ndev);
> +
> +	sun8i_emac_rx_clean(ndev);
> +	sun8i_emac_tx_clean(ndev);
> +
> +	phy_stop(ndev->phydev);
> +
> +	return 0;
> +}
> +
> +static int __maybe_unused sun8i_emac_resume(struct platform_device *pdev)
> +{
> +	struct net_device *ndev = platform_get_drvdata(pdev);
> +	struct sun8i_emac_priv *priv = netdev_priv(ndev);
> +
> +	phy_start(ndev->phydev);
> +
> +	sun8i_emac_start_tx(ndev);
> +	sun8i_emac_start_rx(ndev);
> +
> +	if (netif_running(ndev))
> +		netif_device_attach(ndev);
> +
> +	netif_start_queue(ndev);
> +
> +	napi_enable(&priv->napi);
> +
> +	return 0;
> +}

The main idea behind the runtime PM hooks is that they bring the
device to a working state and shuts it down when it's not needed
anymore.

However, they shouldn't be called when the device is still in used, so
all the mangling with NAPI, the phy and so on is irrelevant here, but
the clocks, resets, for example, are.

>  static const struct of_device_id sun8i_emac_of_match_table[] = {
>  	{ .compatible = "allwinner,sun8i-a83t-emac",
>  	  .data = &emac_variant_a83t },
> @@ -2246,6 +2302,8 @@ static struct platform_driver sun8i_emac_driver = {
>  		.name           = "sun8i-emac",
>  		.of_match_table	= sun8i_emac_of_match_table,
>  	},
> +	.suspend	= sun8i_emac_suspend,
> +	.resume		= sun8i_emac_resume,

These are not the runtime PM hooks. How did you test that?

Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [RFC V3 PATCH 18/26] net/netpolicy: set tx queues according to policy
From: Tom Herbert @ 2016-09-12 20:23 UTC (permalink / raw)
  To: Liang, Kan
  Cc: David S. Miller, LKML, Linux Kernel Network Developers,
	Jeff Kirsher, Ingo Molnar, peterz, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, akpm, Kees Cook, viro,
	gorcunov, John Stultz, Alexander Duyck, Ben Hutchings,
	David Decotigny, Florian Westphal, Alexander Duyck,
	Daniel Borkmann, r
In-Reply-To: <1473692159-4017-19-git-send-email-kan.liang@intel.com>

On Mon, Sep 12, 2016 at 7:55 AM,  <kan.liang@intel.com> wrote:
> From: Kan Liang <kan.liang@intel.com>
>
> When the device tries to transmit a packet, netdev_pick_tx is called to
> find the available tx queues. If the net policy is applied, it picks up
> the assigned tx queue from net policy subsystem, and redirect the
> traffic to the assigned queue.
>
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  include/net/sock.h |  9 +++++++++
>  net/core/dev.c     | 20 ++++++++++++++++++--
>  2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e1e9e3d..ca97f35 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2280,4 +2280,13 @@ extern int sysctl_optmem_max;
>  extern __u32 sysctl_wmem_default;
>  extern __u32 sysctl_rmem_default;
>
> +/* Return netpolicy instance information from socket. */
> +static inline struct netpolicy_instance *netpolicy_find_instance(struct sock *sk)
> +{
> +#ifdef CONFIG_NETPOLICY
> +       if (is_net_policy_valid(sk->sk_netpolicy.policy))
> +               return &sk->sk_netpolicy;
> +#endif
> +       return NULL;
> +}
>  #endif /* _SOCK_H */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 34b5322..b9a8044 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3266,6 +3266,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
>                                     struct sk_buff *skb,
>                                     void *accel_priv)
>  {
> +       struct sock *sk = skb->sk;
>         int queue_index = 0;
>
>  #ifdef CONFIG_XPS
> @@ -3280,8 +3281,23 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
>                 if (ops->ndo_select_queue)
>                         queue_index = ops->ndo_select_queue(dev, skb, accel_priv,
>                                                             __netdev_pick_tx);
> -               else
> -                       queue_index = __netdev_pick_tx(dev, skb);
> +               else {
> +#ifdef CONFIG_NETPOLICY
> +                       struct netpolicy_instance *instance;
> +
> +                       queue_index = -1;
> +                       if (dev->netpolicy && sk) {
> +                               instance = netpolicy_find_instance(sk);
> +                               if (instance) {
> +                                       if (!instance->dev)
> +                                               instance->dev = dev;
> +                                       queue_index = netpolicy_pick_queue(instance, false);
> +                               }
> +                       }
> +                       if (queue_index < 0)
> +#endif

I doubt this produces the intended effect. Several drivers use
ndo_select_queue (such as mlx4) where there might do something special
for a few packets but end up called the default handler which
__netdev_pick_tx for most packets. So in such cases the netpolicy path
would be routinely bypassed. Maybe this code should be in
__netdev_pick_tx.

Tom

> +                               queue_index = __netdev_pick_tx(dev, skb);
> +               }
>
>                 if (!accel_priv)
>                         queue_index = netdev_cap_txqueue(dev, queue_index);
> --
> 2.5.5
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox