[PATCH 00/13] Assorted mvneta fixes and improvements

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/13] Assorted mvneta fixes and improvements
@ 2014-01-16  7:20 Willy Tarreau
  2014-01-16  7:20 ` [PATCH 01/13] net: mvneta: increase the 64-bit rx/tx stats out of the hot path Willy Tarreau
                   ` (14 more replies)
  0 siblings, 15 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT,
	Arnaud Ebalard, Eric Dumazet, Ben Hutchings

Hi,

this series provides some fixes for a number of issues met with the
mvneta driver, then adds some improvements. Patches 1-5 are fixes
and would be needed in 3.13 and likely -stable. The next ones are
performance improvements and cleanups :

  - driver lockup when reading stats while sending traffic from multiple
    CPUs : this obviously only happens on SMP and is the result of missing
    locking on the driver. The problem was present since the introduction
    of the driver in 3.8. The first patch performs some changes that are
    needed for the second one which actually fixes the issue by using
    per-cpu counters. It could make sense to backport this to the relevant
    stable versions.

  - mvneta_tx_timeout calls various functions to reset the NIC, and these
    functions sleep, which is not allowed here, resulting in a panic.
    Better completely disable this Tx timeout handler for now since it is
    never called. The problem was encountered while developing some new
    features, it's uncertain whether it's possible to reproduce it with
    regular usage, so maybe a backport to stable is not needed.

  - replace the Tx timer with a real Tx IRQ. As first reported by Arnaud
    Ebalard and explained by Eric Dumazet, there is no way this driver
    can work correctly if it uses a driver to recycle the Tx descriptors.
    If too many packets are sent at once, the driver quickly ends up with
    no descriptors (which happens twice as easily in GSO) and has to wait
    10ms for recycling its descriptors and being able to send again. Eric
    has worked around this in the core GSO code. But still when routing
    traffic or sending UDP packets, the limitation is very visible. Using
    Tx IRQs allows Tx descriptors to be recycled when sent. The coalesce
    value is still configurable using ethtool. This fix turns the UDP
    send bitrate from 134 Mbps to 987 Mbps (ie: line rate). It's made of
    two patches, one to add the relevant bits from the original Marvell's
    driver, and another one to implement the change. I don't know if it
    should be backported to stable, as the bug only causes poor performance.

  - Patches 6..8 are essentially cleanups, code deduplication and minor
    optimizations for not re-fetching a value we already have (status).

  - patch 9 changes the prefetch of Rx descriptor from current one to
    next one. In benchmarks, it results in about 1% general performance
    increase on HTTP traffic, probably because prefetching the current
    descriptor does not leave enough time between the start of prefetch
    and its usage.

  - patch 10 implements support for build_skb() on Rx path. The driver
    now preallocates frags instead of skbs and builds an skb just before
    delivering it. This results in a 2% performance increase on HTTP
    traffic, and up to 5% on small packet Rx rate.

  - patch 11 implements rx_copybreak for small packets (256 bytes). It
    avoids a dma_map_single()/dma_unmap_single() and increases the Rx
    rate by 16.4%, from 486kpps to 573kpps. Further improvements up to
    711kpps are possible depending how the DMA is used.

  - patches 12 and 13 are extra cleanups made possible by some of the
    simplifications above.

Thanks,
Willy

---
Arnaud Ebalard (2):
  net: mvneta: mvneta_tx_done_gbe() cleanups
  net: mvneta: make mvneta_txq_done() return void

Willy Tarreau (11):
  net: mvneta: increase the 64-bit rx/tx stats out of the hot path
  net: mvneta: use per_cpu stats to fix an SMP lock up
  net: mvneta: do not schedule in mvneta_tx_timeout
  net: mvneta: add missing bit descriptions for interrupt masks and
    causes
  net: mvneta: replace Tx timer with a real interrupt
  net: mvneta: remove tests for impossible cases in the tx_done path
  net: mvneta: factor rx refilling code
  net: mvneta: simplify access to the rx descriptor status
  net: mvneta: prefetch next rx descriptor instead of current one
  net: mvneta: convert to build_skb()
  net: mvneta: implement rx_copybreak

 drivers/net/ethernet/marvell/mvneta.c | 387 +++++++++++++++++++---------------
 1 file changed, 215 insertions(+), 172 deletions(-)

--
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/13] net: mvneta: increase the 64-bit rx/tx stats out of the hot path
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 02/13] net: mvneta: use per_cpu stats to fix an SMP lock up Willy Tarreau
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT

Better count packets and bytes in the stack and on 32 bit then
accumulate them at the end for once. This saves two memory writes
and two memory barriers per packet. The incoming packet rate was
increased by 4.7% on the Openblocks AX3 thanks to this.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index d5f0d72..baa85af 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1391,6 +1391,8 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 {
 	struct net_device *dev = pp->dev;
 	int rx_done, rx_filled;
+	u32 rcvd_pkts = 0;
+	u32 rcvd_bytes = 0;
 
 	/* Get number of received packets */
 	rx_done = mvneta_rxq_busy_desc_num_get(pp, rxq);
@@ -1428,10 +1430,8 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 
 		rx_bytes = rx_desc->data_size -
 			(ETH_FCS_LEN + MVNETA_MH_SIZE);
-		u64_stats_update_begin(&pp->rx_stats.syncp);
-		pp->rx_stats.packets++;
-		pp->rx_stats.bytes += rx_bytes;
-		u64_stats_update_end(&pp->rx_stats.syncp);
+		rcvd_pkts++;
+		rcvd_bytes += rx_bytes;
 
 		/* Linux processing */
 		skb_reserve(skb, MVNETA_MH_SIZE);
@@ -1452,6 +1452,13 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		}
 	}
 
+	if (rcvd_pkts) {
+		u64_stats_update_begin(&pp->rx_stats.syncp);
+		pp->rx_stats.packets += rcvd_pkts;
+		pp->rx_stats.bytes   += rcvd_bytes;
+		u64_stats_update_end(&pp->rx_stats.syncp);
+	}
+
 	/* Update rxq management counters */
 	mvneta_rxq_desc_num_update(pp, rxq, rx_done, rx_filled);
 
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/13] net: mvneta: use per_cpu stats to fix an SMP lock up
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
  2014-01-16  7:20 ` [PATCH 01/13] net: mvneta: increase the 64-bit rx/tx stats out of the hot path Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 03/13] net: mvneta: do not schedule in mvneta_tx_timeout Willy Tarreau
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT

Stats writers are mvneta_rx() and mvneta_tx(). They don't lock anything
when they update the stats, and as a result, it randomly happens that
the stats freeze on SMP if two updates happen during stats retrieval.
This is very easily reproducible by starting two HTTP servers and binding
each of them to a different CPU, then consulting /proc/net/dev in loops
during transfers, the interface should immediately lock up. This issue
also randomly happens upon link state changes during transfers, because
the stats are collected in this situation, but it takes more attempts to
reproduce it.

The comments in netdevice.h suggest using per_cpu stats instead to get
rid of this issue.

This patch implements this. It merges both rx_stats and tx_stats into
a single "stats" member with a single syncp. Both mvneta_rx() and
mvneta_rx() now only update the a single CPU's counters.

In turn, mvneta_get_stats64() does the summing by iterating over all CPUs
to get their respective stats.

With this change, stats are still correct and no more lockup is encountered.

Note that this bug was present since the first import of the mvneta
driver.  It might make sense to backport it to some stable trees. If
so, it depends on "d33dc73 net: mvneta: increase the 64-bit rx/tx stats
out of the hot path".

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 84 +++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index baa85af..40d3e8b 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -221,10 +221,12 @@
 
 #define MVNETA_RX_BUF_SIZE(pkt_size)   ((pkt_size) + NET_SKB_PAD)
 
-struct mvneta_stats {
+struct mvneta_pcpu_stats {
 	struct	u64_stats_sync syncp;
-	u64	packets;
-	u64	bytes;
+	u64	rx_packets;
+	u64	rx_bytes;
+	u64	tx_packets;
+	u64	tx_bytes;
 };
 
 struct mvneta_port {
@@ -250,8 +252,7 @@ struct mvneta_port {
 	u8 mcast_count[256];
 	u16 tx_ring_size;
 	u16 rx_ring_size;
-	struct mvneta_stats tx_stats;
-	struct mvneta_stats rx_stats;
+	struct mvneta_pcpu_stats *stats;
 
 	struct mii_bus *mii_bus;
 	struct phy_device *phy_dev;
@@ -461,21 +462,29 @@ struct rtnl_link_stats64 *mvneta_get_stats64(struct net_device *dev,
 {
 	struct mvneta_port *pp = netdev_priv(dev);
 	unsigned int start;
+	int cpu;
 
-	memset(stats, 0, sizeof(struct rtnl_link_stats64));
-
-	do {
-		start = u64_stats_fetch_begin_bh(&pp->rx_stats.syncp);
-		stats->rx_packets = pp->rx_stats.packets;
-		stats->rx_bytes	= pp->rx_stats.bytes;
-	} while (u64_stats_fetch_retry_bh(&pp->rx_stats.syncp, start));
+	for_each_possible_cpu(cpu) {
+		struct mvneta_pcpu_stats *cpu_stats;
+		u64 rx_packets;
+		u64 rx_bytes;
+		u64 tx_packets;
+		u64 tx_bytes;
 
+		cpu_stats = per_cpu_ptr(pp->stats, cpu);
+		do {
+			start = u64_stats_fetch_begin_bh(&cpu_stats->syncp);
+			rx_packets = cpu_stats->rx_packets;
+			rx_bytes   = cpu_stats->rx_bytes;
+			tx_packets = cpu_stats->tx_packets;
+			tx_bytes   = cpu_stats->tx_bytes;
+		} while (u64_stats_fetch_retry_bh(&cpu_stats->syncp, start));
 
-	do {
-		start = u64_stats_fetch_begin_bh(&pp->tx_stats.syncp);
-		stats->tx_packets = pp->tx_stats.packets;
-		stats->tx_bytes	= pp->tx_stats.bytes;
-	} while (u64_stats_fetch_retry_bh(&pp->tx_stats.syncp, start));
+		stats->rx_packets += rx_packets;
+		stats->rx_bytes   += rx_bytes;
+		stats->tx_packets += tx_packets;
+		stats->tx_bytes   += tx_bytes;
+	}
 
 	stats->rx_errors	= dev->stats.rx_errors;
 	stats->rx_dropped	= dev->stats.rx_dropped;
@@ -1453,10 +1462,12 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 	}
 
 	if (rcvd_pkts) {
-		u64_stats_update_begin(&pp->rx_stats.syncp);
-		pp->rx_stats.packets += rcvd_pkts;
-		pp->rx_stats.bytes   += rcvd_bytes;
-		u64_stats_update_end(&pp->rx_stats.syncp);
+		struct mvneta_pcpu_stats *stats = this_cpu_ptr(pp->stats);
+
+		u64_stats_update_begin(&stats->syncp);
+		stats->rx_packets += rcvd_pkts;
+		stats->rx_bytes   += rcvd_bytes;
+		u64_stats_update_end(&stats->syncp);
 	}
 
 	/* Update rxq management counters */
@@ -1589,11 +1600,12 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 
 out:
 	if (frags > 0) {
-		u64_stats_update_begin(&pp->tx_stats.syncp);
-		pp->tx_stats.packets++;
-		pp->tx_stats.bytes += skb->len;
-		u64_stats_update_end(&pp->tx_stats.syncp);
+		struct mvneta_pcpu_stats *stats = this_cpu_ptr(pp->stats);
 
+		u64_stats_update_begin(&stats->syncp);
+		stats->tx_packets++;
+		stats->tx_bytes  += skb->len;
+		u64_stats_update_end(&stats->syncp);
 	} else {
 		dev->stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
@@ -2758,6 +2770,7 @@ static int mvneta_probe(struct platform_device *pdev)
 	const char *mac_from;
 	int phy_mode;
 	int err;
+	int cpu;
 
 	/* Our multiqueue support is not complete, so for now, only
 	 * allow the usage of the first RX queue
@@ -2799,9 +2812,6 @@ static int mvneta_probe(struct platform_device *pdev)
 
 	pp = netdev_priv(dev);
 
-	u64_stats_init(&pp->tx_stats.syncp);
-	u64_stats_init(&pp->rx_stats.syncp);
-
 	pp->weight = MVNETA_RX_POLL_WEIGHT;
 	pp->phy_node = phy_node;
 	pp->phy_interface = phy_mode;
@@ -2820,6 +2830,19 @@ static int mvneta_probe(struct platform_device *pdev)
 		goto err_clk;
 	}
 
+	/* Alloc per-cpu stats */
+	pp->stats = alloc_percpu(struct mvneta_pcpu_stats);
+	if (!pp->stats) {
+		err = -ENOMEM;
+		goto err_unmap;
+	}
+
+	for_each_possible_cpu(cpu) {
+		struct mvneta_pcpu_stats *stats;
+		stats = per_cpu_ptr(pp->stats, cpu);
+		u64_stats_init(&stats->syncp);
+	}
+
 	dt_mac_addr = of_get_mac_address(dn);
 	if (dt_mac_addr) {
 		mac_from = "device tree";
@@ -2849,7 +2872,7 @@ static int mvneta_probe(struct platform_device *pdev)
 	err = mvneta_init(pp, phy_addr);
 	if (err < 0) {
 		dev_err(&pdev->dev, "can't init eth hal\n");
-		goto err_unmap;
+		goto err_free_stats;
 	}
 	mvneta_port_power_up(pp, phy_mode);
 
@@ -2879,6 +2902,8 @@ static int mvneta_probe(struct platform_device *pdev)
 
 err_deinit:
 	mvneta_deinit(pp);
+err_free_stats:
+	free_percpu(pp->stats);
 err_unmap:
 	iounmap(pp->base);
 err_clk:
@@ -2899,6 +2924,7 @@ static int mvneta_remove(struct platform_device *pdev)
 	unregister_netdev(dev);
 	mvneta_deinit(pp);
 	clk_disable_unprepare(pp->clk);
+	free_percpu(pp->stats);
 	iounmap(pp->base);
 	irq_dispose_mapping(dev->irq);
 	free_netdev(dev);
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/13] net: mvneta: do not schedule in mvneta_tx_timeout
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
  2014-01-16  7:20 ` [PATCH 01/13] net: mvneta: increase the 64-bit rx/tx stats out of the hot path Willy Tarreau
  2014-01-16  7:20 ` [PATCH 02/13] net: mvneta: use per_cpu stats to fix an SMP lock up Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 04/13] net: mvneta: add missing bit descriptions for interrupt masks and causes Willy Tarreau
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT,
	Ben Hutchings

If a queue timeout is reported, we can oops because of some
schedules while the caller is atomic, as shown below :

  mvneta d0070000.ethernet eth0: tx timeout
  BUG: scheduling while atomic: bash/1528/0x00000100
  Modules linked in: slhttp_ethdiv(C) [last unloaded: slhttp_ethdiv]
  CPU: 2 PID: 1528 Comm: bash Tainted: G        WC   3.13.0-rc4-mvebu-nf #180
  [<c0011bd9>] (unwind_backtrace+0x1/0x98) from [<c000f1ab>] (show_stack+0xb/0xc)
  [<c000f1ab>] (show_stack+0xb/0xc) from [<c02ad323>] (dump_stack+0x4f/0x64)
  [<c02ad323>] (dump_stack+0x4f/0x64) from [<c02abe67>] (__schedule_bug+0x37/0x4c)
  [<c02abe67>] (__schedule_bug+0x37/0x4c) from [<c02ae261>] (__schedule+0x325/0x3ec)
  [<c02ae261>] (__schedule+0x325/0x3ec) from [<c02adb97>] (schedule_timeout+0xb7/0x118)
  [<c02adb97>] (schedule_timeout+0xb7/0x118) from [<c0020a67>] (msleep+0xf/0x14)
  [<c0020a67>] (msleep+0xf/0x14) from [<c01dcbe5>] (mvneta_stop_dev+0x21/0x194)
  [<c01dcbe5>] (mvneta_stop_dev+0x21/0x194) from [<c01dcfe9>] (mvneta_tx_timeout+0x19/0x24)
  [<c01dcfe9>] (mvneta_tx_timeout+0x19/0x24) from [<c024afc7>] (dev_watchdog+0x18b/0x1c4)
  [<c024afc7>] (dev_watchdog+0x18b/0x1c4) from [<c0020b53>] (call_timer_fn.isra.27+0x17/0x5c)
  [<c0020b53>] (call_timer_fn.isra.27+0x17/0x5c) from [<c0020cad>] (run_timer_softirq+0x115/0x170)
  [<c0020cad>] (run_timer_softirq+0x115/0x170) from [<c001ccb9>] (__do_softirq+0xbd/0x1a8)
  [<c001ccb9>] (__do_softirq+0xbd/0x1a8) from [<c001cfad>] (irq_exit+0x61/0x98)
  [<c001cfad>] (irq_exit+0x61/0x98) from [<c000d4bf>] (handle_IRQ+0x27/0x60)
  [<c000d4bf>] (handle_IRQ+0x27/0x60) from [<c000843b>] (armada_370_xp_handle_irq+0x33/0xc8)
  [<c000843b>] (armada_370_xp_handle_irq+0x33/0xc8) from [<c000fba9>] (__irq_usr+0x49/0x60)

Ben Hutchings attempted to propose a better fix consisting in using a
scheduled work for this, but while it fixed this panic, it caused other
random freezes and panics proving that the reset sequence in the driver
is unreliable and that additional fixes should be investigated.

When sending multiple streams over a link limited to 100 Mbps, Tx timeouts
happen from time to time, and the driver correctly recovers only when the
function is disabled.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 40d3e8b..84220c1 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2244,16 +2244,6 @@ static void mvneta_stop_dev(struct mvneta_port *pp)
 	mvneta_rx_reset(pp);
 }
 
-/* tx timeout callback - display a message and stop/start the network device */
-static void mvneta_tx_timeout(struct net_device *dev)
-{
-	struct mvneta_port *pp = netdev_priv(dev);
-
-	netdev_info(dev, "tx timeout\n");
-	mvneta_stop_dev(pp);
-	mvneta_start_dev(pp);
-}
-
 /* Return positive if MTU is valid */
 static int mvneta_check_mtu_valid(struct net_device *dev, int mtu)
 {
@@ -2634,7 +2624,6 @@ static const struct net_device_ops mvneta_netdev_ops = {
 	.ndo_set_rx_mode     = mvneta_set_rx_mode,
 	.ndo_set_mac_address = mvneta_set_mac_addr,
 	.ndo_change_mtu      = mvneta_change_mtu,
-	.ndo_tx_timeout      = mvneta_tx_timeout,
 	.ndo_get_stats64     = mvneta_get_stats64,
 	.ndo_do_ioctl        = mvneta_ioctl,
 };
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/13] net: mvneta: add missing bit descriptions for interrupt masks and causes
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (2 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 03/13] net: mvneta: do not schedule in mvneta_tx_timeout Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 05/13] net: mvneta: replace Tx timer with a real interrupt Willy Tarreau
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT

Marvell has not published the chip's datasheet yet, so it's very hard
to find the relevant bits to manipulate to change the IRQ behaviour.
Fortunately, these bits are described in the proprietary LSP patch set
which is publicly available here :

    http://www.plugcomputer.org/downloads/mirabox/

So let's put them back in the driver in order to reduce the burden of
current and future maintenance.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 44 +++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 84220c1..defda6f 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -101,16 +101,56 @@
 #define      MVNETA_CPU_RXQ_ACCESS_ALL_MASK      0x000000ff
 #define      MVNETA_CPU_TXQ_ACCESS_ALL_MASK      0x0000ff00
 #define MVNETA_RXQ_TIME_COAL_REG(q)              (0x2580 + ((q) << 2))
+
+/* Exception Interrupt Port/Queue Cause register */
+
 #define MVNETA_INTR_NEW_CAUSE                    0x25a0
-#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
 #define MVNETA_INTR_NEW_MASK                     0x25a4
+
+/* bits  0..7  = TXQ SENT, one bit per queue.
+ * bits  8..15 = RXQ OCCUP, one bit per queue.
+ * bits 16..23 = RXQ FREE, one bit per queue.
+ * bit  29 = OLD_REG_SUM, see old reg ?
+ * bit  30 = TX_ERR_SUM, one bit for 4 ports
+ * bit  31 = MISC_SUM,   one bit for 4 ports
+ */
+#define      MVNETA_TX_INTR_MASK(nr_txqs)        (((1 << nr_txqs) - 1) << 0)
+#define      MVNETA_TX_INTR_MASK_ALL             (0xff << 0)
+#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
+#define      MVNETA_RX_INTR_MASK_ALL             (0xff << 8)
+
 #define MVNETA_INTR_OLD_CAUSE                    0x25a8
 #define MVNETA_INTR_OLD_MASK                     0x25ac
+
+/* Data Path Port/Queue Cause Register */
 #define MVNETA_INTR_MISC_CAUSE                   0x25b0
 #define MVNETA_INTR_MISC_MASK                    0x25b4
+
+#define      MVNETA_CAUSE_PHY_STATUS_CHANGE      BIT(0)
+#define      MVNETA_CAUSE_LINK_CHANGE            BIT(1)
+#define      MVNETA_CAUSE_PTP                    BIT(4)
+
+#define      MVNETA_CAUSE_INTERNAL_ADDR_ERR      BIT(7)
+#define      MVNETA_CAUSE_RX_OVERRUN             BIT(8)
+#define      MVNETA_CAUSE_RX_CRC_ERROR           BIT(9)
+#define      MVNETA_CAUSE_RX_LARGE_PKT           BIT(10)
+#define      MVNETA_CAUSE_TX_UNDERUN             BIT(11)
+#define      MVNETA_CAUSE_PRBS_ERR               BIT(12)
+#define      MVNETA_CAUSE_PSC_SYNC_CHANGE        BIT(13)
+#define      MVNETA_CAUSE_SERDES_SYNC_ERR        BIT(14)
+
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT    16
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_ALL_MASK   (0xF << MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT)
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_MASK(pool) (1 << (MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT + (pool)))
+
+#define      MVNETA_CAUSE_TXQ_ERROR_SHIFT        24
+#define      MVNETA_CAUSE_TXQ_ERROR_ALL_MASK     (0xFF << MVNETA_CAUSE_TXQ_ERROR_SHIFT)
+#define      MVNETA_CAUSE_TXQ_ERROR_MASK(q)      (1 << (MVNETA_CAUSE_TXQ_ERROR_SHIFT + (q)))
+
 #define MVNETA_INTR_ENABLE                       0x25b8
 #define      MVNETA_TXQ_INTR_ENABLE_ALL_MASK     0x0000ff00
-#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000
+#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000  // note: neta says it's 0x000000FF
+
 #define MVNETA_RXQ_CMD                           0x2680
 #define      MVNETA_RXQ_DISABLE_SHIFT            8
 #define      MVNETA_RXQ_ENABLE_MASK              0x000000ff
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/13] net: mvneta: replace Tx timer with a real interrupt
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (3 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 04/13] net: mvneta: add missing bit descriptions for interrupt masks and causes Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 06/13] net: mvneta: remove tests for impossible cases in the tx_done path Willy Tarreau
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT,
	Arnaud Ebalard, Eric Dumazet

Right now the mvneta driver doesn't handle Tx IRQ, and relies on two
mechanisms to flush Tx descriptors : a flush at the end of mvneta_tx()
and a timer. If a burst of packets is emitted faster than the device
can send them, then the queue is stopped until next wake-up of the
timer 10ms later. This causes jerky output traffic with bursts and
pauses, making it difficult to reach line rate with very few streams.

A test on UDP traffic shows that it's not possible to go beyond 134
Mbps / 12 kpps of outgoing traffic with 1500-bytes IP packets. Routed
traffic tends to observe pauses as well if the traffic is bursty,
making it even burstier after the wake-up.

It seems that this feature was inherited from the original driver but
nothing there mentions any reason for not using the interrupt instead,
which the chip supports.

Thus, this patch enables Tx interrupts and removes the timer. It does
the two at once because it's not really possible to make the two
mechanisms coexist, so a split patch doesn't make sense.

First tests performed on a Mirabox (Armada 370) show that less CPU
seems to be used when sending traffic. One reason might be that we now
call the mvneta_tx_done_gbe() with a mask indicating which queues have
been done instead of looping over all of them.

The same UDP test above now happily reaches 987 Mbps / 87.7 kpps.
Single-stream TCP traffic can now more easily reach line rate. HTTP
transfers of 1 MB objects over a single connection went from 730 to
840 Mbps. It is even possible to go significantly higher (>900 Mbps)
by tweaking tcp_tso_win_divisor.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: Arnaud Ebalard <arno@natisbad.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 71 ++++++-----------------------------
 1 file changed, 12 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index defda6f..df75a23 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -216,9 +216,6 @@
 #define MVNETA_RX_COAL_PKTS		32
 #define MVNETA_RX_COAL_USEC		100
 
-/* Timer */
-#define MVNETA_TX_DONE_TIMER_PERIOD	10
-
 /* Napi polling weight */
 #define MVNETA_RX_POLL_WEIGHT		64
 
@@ -274,16 +271,11 @@ struct mvneta_port {
 	void __iomem *base;
 	struct mvneta_rx_queue *rxqs;
 	struct mvneta_tx_queue *txqs;
-	struct timer_list tx_done_timer;
 	struct net_device *dev;
 
 	u32 cause_rx_tx;
 	struct napi_struct napi;
 
-	/* Flags */
-	unsigned long flags;
-#define MVNETA_F_TX_DONE_TIMER_BIT  0
-
 	/* Napi weight */
 	int weight;
 
@@ -1149,17 +1141,6 @@ static void mvneta_tx_done_pkts_coal_set(struct mvneta_port *pp,
 	txq->done_pkts_coal = value;
 }
 
-/* Trigger tx done timer in MVNETA_TX_DONE_TIMER_PERIOD msecs */
-static void mvneta_add_tx_done_timer(struct mvneta_port *pp)
-{
-	if (test_and_set_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags) == 0) {
-		pp->tx_done_timer.expires = jiffies +
-			msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD);
-		add_timer(&pp->tx_done_timer);
-	}
-}
-
-
 /* Handle rx descriptor fill by setting buf_cookie and buf_phys_addr */
 static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
 				u32 phys_addr, u32 cookie)
@@ -1651,15 +1632,6 @@ out:
 		dev_kfree_skb_any(skb);
 	}
 
-	if (txq->count >= MVNETA_TXDONE_COAL_PKTS)
-		mvneta_txq_done(pp, txq);
-
-	/* If after calling mvneta_txq_done, count equals
-	 * frags, we need to set the timer
-	 */
-	if (txq->count == frags && frags > 0)
-		mvneta_add_tx_done_timer(pp);
-
 	return NETDEV_TX_OK;
 }
 
@@ -1935,14 +1907,22 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 
 	/* Read cause register */
 	cause_rx_tx = mvreg_read(pp, MVNETA_INTR_NEW_CAUSE) &
-		MVNETA_RX_INTR_MASK(rxq_number);
+		(MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
+
+	/* Release Tx descriptors */
+	if (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL) {
+		int tx_todo = 0;
+
+		mvneta_tx_done_gbe(pp, (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL), &tx_todo);
+		cause_rx_tx &= ~MVNETA_TX_INTR_MASK_ALL;
+	}
 
 	/* For the case where the last mvneta_poll did not process all
 	 * RX packets
 	 */
 	cause_rx_tx |= pp->cause_rx_tx;
 	if (rxq_number > 1) {
-		while ((cause_rx_tx != 0) && (budget > 0)) {
+		while ((cause_rx_tx & MVNETA_RX_INTR_MASK_ALL) && (budget > 0)) {
 			int count;
 			struct mvneta_rx_queue *rxq;
 			/* get rx queue number from cause_rx_tx */
@@ -1974,7 +1954,7 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 		napi_complete(napi);
 		local_irq_save(flags);
 		mvreg_write(pp, MVNETA_INTR_NEW_MASK,
-			    MVNETA_RX_INTR_MASK(rxq_number));
+			    MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
 		local_irq_restore(flags);
 	}
 
@@ -1982,26 +1962,6 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 	return rx_done;
 }
 
-/* tx done timer callback */
-static void mvneta_tx_done_timer_callback(unsigned long data)
-{
-	struct net_device *dev = (struct net_device *)data;
-	struct mvneta_port *pp = netdev_priv(dev);
-	int tx_done = 0, tx_todo = 0;
-
-	if (!netif_running(dev))
-		return ;
-
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
-
-	tx_done = mvneta_tx_done_gbe(pp,
-				     (((1 << txq_number) - 1) &
-				      MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
-				     &tx_todo);
-	if (tx_todo > 0)
-		mvneta_add_tx_done_timer(pp);
-}
-
 /* Handle rxq fill: allocates rxq skbs; called when initializing a port */
 static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 			   int num)
@@ -2251,7 +2211,7 @@ static void mvneta_start_dev(struct mvneta_port *pp)
 
 	/* Unmask interrupts */
 	mvreg_write(pp, MVNETA_INTR_NEW_MASK,
-		    MVNETA_RX_INTR_MASK(rxq_number));
+		    MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
 
 	phy_start(pp->phy_dev);
 	netif_tx_start_all_queues(pp->dev);
@@ -2527,8 +2487,6 @@ static int mvneta_stop(struct net_device *dev)
 	free_irq(dev->irq, pp);
 	mvneta_cleanup_rxqs(pp);
 	mvneta_cleanup_txqs(pp);
-	del_timer(&pp->tx_done_timer);
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
 
 	return 0;
 }
@@ -2887,11 +2845,6 @@ static int mvneta_probe(struct platform_device *pdev)
 		}
 	}
 
-	pp->tx_done_timer.data = (unsigned long)dev;
-	pp->tx_done_timer.function = mvneta_tx_done_timer_callback;
-	init_timer(&pp->tx_done_timer);
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
-
 	pp->tx_ring_size = MVNETA_MAX_TXD;
 	pp->rx_ring_size = MVNETA_MAX_RXD;
 
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/13] net: mvneta: remove tests for impossible cases in the tx_done path
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (4 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 05/13] net: mvneta: replace Tx timer with a real interrupt Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 07/13] net: mvneta: factor rx refilling code Willy Tarreau
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT

Currently, mvneta_txq_bufs_free() calls mvneta_tx_done_policy() with
a non-null cause to retrieve the pointer to the next queue to process.
There are useless tests on the return queue number and on the pointer,
all of which are well defined within a known limited set. This code
path is fast, although not critical. Removing 3 tests here that the
compiler could not optimize (verified) is always desirable.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index df75a23..2fb9559 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1276,13 +1276,16 @@ static void mvneta_rx_csum(struct mvneta_port *pp,
 	skb->ip_summed = CHECKSUM_NONE;
 }
 
-/* Return tx queue pointer (find last set bit) according to causeTxDone reg */
+/* Return tx queue pointer (find last set bit) according to <cause> returned
+ * form tx_done reg. <cause> must not be null. The return value is always a
+ * valid queue for matching the first one found in <cause>.
+ */
 static struct mvneta_tx_queue *mvneta_tx_done_policy(struct mvneta_port *pp,
 						     u32 cause)
 {
 	int queue = fls(cause) - 1;
 
-	return (queue < 0 || queue >= txq_number) ? NULL : &pp->txqs[queue];
+	return &pp->txqs[queue];
 }
 
 /* Free tx queue skbuffs */
@@ -1651,7 +1654,9 @@ static void mvneta_txq_done_force(struct mvneta_port *pp,
 	txq->txq_get_index = 0;
 }
 
-/* handle tx done - called from tx done timer callback */
+/* Handle tx done - called in softirq context. The <cause_tx_done> argument
+ * must be a valid cause according to MVNETA_TXQ_INTR_MASK_ALL.
+ */
 static u32 mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done,
 			      int *tx_todo)
 {
@@ -1660,10 +1665,8 @@ static u32 mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done,
 	struct netdev_queue *nq;
 
 	*tx_todo = 0;
-	while (cause_tx_done != 0) {
+	while (cause_tx_done) {
 		txq = mvneta_tx_done_policy(pp, cause_tx_done);
-		if (!txq)
-			break;
 
 		nq = netdev_get_tx_queue(pp->dev, txq->id);
 		__netif_tx_lock(nq, smp_processor_id());
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/13] net: mvneta: factor rx refilling code
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (5 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 06/13] net: mvneta: remove tests for impossible cases in the tx_done path Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 08/13] net: mvneta: simplify access to the rx descriptor status Willy Tarreau
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT

Make mvneta_rxq_fill() use mvneta_rx_refill() instead of using
duplicate code.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 2fb9559..eccafd1 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1969,32 +1969,15 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 			   int num)
 {
-	struct net_device *dev = pp->dev;
 	int i;
 
 	for (i = 0; i < num; i++) {
-		struct sk_buff *skb;
-		struct mvneta_rx_desc *rx_desc;
-		unsigned long phys_addr;
-
-		skb = dev_alloc_skb(pp->pkt_size);
-		if (!skb) {
-			netdev_err(dev, "%s:rxq %d, %d of %d buffs  filled\n",
+		memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));
+		if (mvneta_rx_refill(pp, rxq->descs + i) != 0) {
+			netdev_err(pp->dev, "%s:rxq %d, %d of %d buffs  filled\n",
 				__func__, rxq->id, i, num);
 			break;
 		}
-
-		rx_desc = rxq->descs + i;
-		memset(rx_desc, 0, sizeof(struct mvneta_rx_desc));
-		phys_addr = dma_map_single(dev->dev.parent, skb->head,
-					   MVNETA_RX_BUF_SIZE(pp->pkt_size),
-					   DMA_FROM_DEVICE);
-		if (unlikely(dma_mapping_error(dev->dev.parent, phys_addr))) {
-			dev_kfree_skb(skb);
-			break;
-		}
-
-		mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)skb);
 	}
 
 	/* Add this number of RX descriptors as non occupied (ready to
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/13] net: mvneta: simplify access to the rx descriptor status
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (6 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 07/13] net: mvneta: factor rx refilling code Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 09/13] net: mvneta: prefetch next rx descriptor instead of current one Willy Tarreau
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT

At several places, we already know the value of the rx status but
we call functions which dereference the pointer again to get it
and don't need the descriptor for anything else. Simplify this
task by replacing the rx desc pointer by the status word itself.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index eccafd1..aa3a4f7 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -528,14 +528,14 @@ struct rtnl_link_stats64 *mvneta_get_stats64(struct net_device *dev,
 
 /* Rx descriptors helper methods */
 
-/* Checks whether the given RX descriptor is both the first and the
- * last descriptor for the RX packet. Each RX packet is currently
+/* Checks whether the RX descriptor having this status is both the first
+ * and the last descriptor for the RX packet. Each RX packet is currently
  * received through a single RX descriptor, so not having each RX
  * descriptor with its first and last bits set is an error
  */
-static int mvneta_rxq_desc_is_first_last(struct mvneta_rx_desc *desc)
+static int mvneta_rxq_desc_is_first_last(u32 status)
 {
-	return (desc->status & MVNETA_RXD_FIRST_LAST_DESC) ==
+	return (status & MVNETA_RXD_FIRST_LAST_DESC) ==
 		MVNETA_RXD_FIRST_LAST_DESC;
 }
 
@@ -1234,10 +1234,10 @@ static void mvneta_rx_error(struct mvneta_port *pp,
 {
 	u32 status = rx_desc->status;
 
-	if (!mvneta_rxq_desc_is_first_last(rx_desc)) {
+	if (!mvneta_rxq_desc_is_first_last(status)) {
 		netdev_err(pp->dev,
 			   "bad rx status %08x (buffer oversize), size=%d\n",
-			   rx_desc->status, rx_desc->data_size);
+			   status, rx_desc->data_size);
 		return;
 	}
 
@@ -1261,13 +1261,12 @@ static void mvneta_rx_error(struct mvneta_port *pp,
 	}
 }
 
-/* Handle RX checksum offload */
-static void mvneta_rx_csum(struct mvneta_port *pp,
-			   struct mvneta_rx_desc *rx_desc,
+/* Handle RX checksum offload based on the descriptor's status */
+static void mvneta_rx_csum(struct mvneta_port *pp, u32 status,
 			   struct sk_buff *skb)
 {
-	if ((rx_desc->status & MVNETA_RXD_L3_IP4) &&
-	    (rx_desc->status & MVNETA_RXD_L4_CSUM_OK)) {
+	if ((status & MVNETA_RXD_L3_IP4) &&
+	    (status & MVNETA_RXD_L4_CSUM_OK)) {
 		skb->csum = 0;
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 		return;
@@ -1449,7 +1448,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		rx_status = rx_desc->status;
 		skb = (struct sk_buff *)rx_desc->buf_cookie;
 
-		if (!mvneta_rxq_desc_is_first_last(rx_desc) ||
+		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
@@ -1472,7 +1471,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 
 		skb->protocol = eth_type_trans(skb, dev);
 
-		mvneta_rx_csum(pp, rx_desc, skb);
+		mvneta_rx_csum(pp, rx_status, skb);
 
 		napi_gro_receive(&pp->napi, skb);
 
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/13] net: mvneta: prefetch next rx descriptor instead of current one
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (7 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 08/13] net: mvneta: simplify access to the rx descriptor status Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 10/13] net: mvneta: convert to build_skb() Willy Tarreau
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT

Currently, the mvneta driver tries to prefetch the current Rx
descriptor during read. Tests have shown that prefetching the
next one instead increases general performance by about 1% on
HTTP traffic.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index aa3a4f7..c7b37e0 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -611,6 +611,7 @@ mvneta_rxq_next_desc_get(struct mvneta_rx_queue *rxq)
 	int rx_desc = rxq->next_desc_to_proc;
 
 	rxq->next_desc_to_proc = MVNETA_QUEUE_NEXT_DESC(rxq, rx_desc);
+	prefetch(rxq->descs + rxq->next_desc_to_proc);
 	return rxq->descs + rx_desc;
 }
 
@@ -1442,7 +1443,6 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		u32 rx_status;
 		int rx_bytes, err;
 
-		prefetch(rx_desc);
 		rx_done++;
 		rx_filled++;
 		rx_status = rx_desc->status;
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/13] net: mvneta: convert to build_skb()
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (8 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 09/13] net: mvneta: prefetch next rx descriptor instead of current one Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 11/13] net: mvneta: implement rx_copybreak Willy Tarreau
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT

Make use of build_skb() to allocate frags on the RX path. When frag size
is lower than a page size, we can use netdev_alloc_frag(), and we fall back
to kmalloc() for larger sizes. The frag size is stored into the mvneta_port
struct. The alloc/free functions check the frag size to decide what alloc/
free method to use. MTU changes are safe because the MTU change function
stops the device and clears the queues before applying the change.

With this patch, I observed a reproducible 2% performance improvement on
HTTP-based benchmarks, and 5% on small packet RX rate.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 49 +++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index c7b37e0..726a8d2 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -268,6 +268,7 @@ struct mvneta_pcpu_stats {
 
 struct mvneta_port {
 	int pkt_size;
+	unsigned int frag_size;
 	void __iomem *base;
 	struct mvneta_rx_queue *rxqs;
 	struct mvneta_tx_queue *txqs;
@@ -1332,28 +1333,43 @@ static int mvneta_txq_done(struct mvneta_port *pp,
 	return tx_done;
 }
 
+static void *mvneta_frag_alloc(const struct mvneta_port *pp)
+{
+	if (likely(pp->frag_size <= PAGE_SIZE))
+		return netdev_alloc_frag(pp->frag_size);
+	else
+		return kmalloc(pp->frag_size, GFP_ATOMIC);
+}
+
+static void mvneta_frag_free(const struct mvneta_port *pp, void *data)
+{
+	if (likely(pp->frag_size <= PAGE_SIZE))
+		put_page(virt_to_head_page(data));
+	else
+		kfree(data);
+}
+
 /* Refill processing */
 static int mvneta_rx_refill(struct mvneta_port *pp,
 			    struct mvneta_rx_desc *rx_desc)
 
 {
 	dma_addr_t phys_addr;
-	struct sk_buff *skb;
+	void *data;
 
-	skb = netdev_alloc_skb(pp->dev, pp->pkt_size);
-	if (!skb)
+	data = mvneta_frag_alloc(pp);
+	if (!data)
 		return -ENOMEM;
 
-	phys_addr = dma_map_single(pp->dev->dev.parent, skb->head,
+	phys_addr = dma_map_single(pp->dev->dev.parent, data,
 				   MVNETA_RX_BUF_SIZE(pp->pkt_size),
 				   DMA_FROM_DEVICE);
 	if (unlikely(dma_mapping_error(pp->dev->dev.parent, phys_addr))) {
-		dev_kfree_skb(skb);
+		mvneta_frag_free(pp, data);
 		return -ENOMEM;
 	}
 
-	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)skb);
-
+	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
 	return 0;
 }
 
@@ -1407,9 +1423,9 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 	rx_done = mvneta_rxq_busy_desc_num_get(pp, rxq);
 	for (i = 0; i < rxq->size; i++) {
 		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
-		struct sk_buff *skb = (struct sk_buff *)rx_desc->buf_cookie;
+		void *data = (void *)rx_desc->buf_cookie;
 
-		dev_kfree_skb_any(skb);
+		mvneta_frag_free(pp, data);
 		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 	}
@@ -1440,20 +1456,21 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 	while (rx_done < rx_todo) {
 		struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq);
 		struct sk_buff *skb;
+		unsigned char *data;
 		u32 rx_status;
 		int rx_bytes, err;
 
 		rx_done++;
 		rx_filled++;
 		rx_status = rx_desc->status;
-		skb = (struct sk_buff *)rx_desc->buf_cookie;
+		data = (unsigned char *)rx_desc->buf_cookie;
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
+		    (rx_status & MVNETA_RXD_ERR_SUMMARY) ||
+		    !(skb = build_skb(data, pp->frag_size > PAGE_SIZE ? 0 : pp->frag_size))) {
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
-			mvneta_rx_desc_fill(rx_desc, rx_desc->buf_phys_addr,
-					    (u32)skb);
+			/* leave the descriptor untouched */
 			continue;
 		}
 
@@ -1466,7 +1483,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		rcvd_bytes += rx_bytes;
 
 		/* Linux processing */
-		skb_reserve(skb, MVNETA_MH_SIZE);
+		skb_reserve(skb, MVNETA_MH_SIZE + NET_SKB_PAD);
 		skb_put(skb, rx_bytes);
 
 		skb->protocol = eth_type_trans(skb, dev);
@@ -2276,6 +2293,8 @@ static int mvneta_change_mtu(struct net_device *dev, int mtu)
 	mvneta_cleanup_rxqs(pp);
 
 	pp->pkt_size = MVNETA_RX_PKT_SIZE(pp->dev->mtu);
+	pp->frag_size = SKB_DATA_ALIGN(MVNETA_RX_BUF_SIZE(pp->pkt_size)) +
+	                SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	ret = mvneta_setup_rxqs(pp);
 	if (ret) {
@@ -2423,6 +2442,8 @@ static int mvneta_open(struct net_device *dev)
 	mvneta_mac_addr_set(pp, dev->dev_addr, rxq_def);
 
 	pp->pkt_size = MVNETA_RX_PKT_SIZE(pp->dev->mtu);
+	pp->frag_size = SKB_DATA_ALIGN(MVNETA_RX_BUF_SIZE(pp->pkt_size)) +
+	                SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	ret = mvneta_setup_rxqs(pp);
 	if (ret)
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/13] net: mvneta: implement rx_copybreak
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (9 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 10/13] net: mvneta: convert to build_skb() Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  9:14   ` David Laight
  2014-01-16  7:20 ` [PATCH 12/13] net: mvneta: mvneta_tx_done_gbe() cleanups Willy Tarreau
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT

calling dma_map_single()/dma_unmap_single() is quite expensive compared
to copying a small packet. So let's copy short frames and keep the buffers
mapped. We set the limit to 256 bytes which seems to give good results both
on the XP-GP board and on the AX3/4.

The Rx small packet rate increased by 16.4% doing this, from 486kpps to
573kpps. It is worth noting that even the call to the function
dma_sync_single_range_for_cpu() is expensive (300 ns) although less
than dma_unmap_single(). Without it, the packet rate raises to 711kpps
(+24% more). Thus on systems where coherency from device to CPU is
guaranteed by a snoop control unit, this patch should provide even more
gains, and probably rx_copybreak could be increased.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 44 ++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 726a8d2..f5fc7a2 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -444,6 +444,8 @@ static int txq_number = 8;
 
 static int rxq_def;
 
+static int rx_copybreak __read_mostly = 256;
+
 #define MVNETA_DRIVER_NAME "mvneta"
 #define MVNETA_DRIVER_VERSION "1.0"
 
@@ -1463,22 +1465,51 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		rx_done++;
 		rx_filled++;
 		rx_status = rx_desc->status;
+		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (unsigned char *)rx_desc->buf_cookie;
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-		    (rx_status & MVNETA_RXD_ERR_SUMMARY) ||
-		    !(skb = build_skb(data, pp->frag_size > PAGE_SIZE ? 0 : pp->frag_size))) {
+		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
+		err_drop_frame:
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
 			/* leave the descriptor untouched */
 			continue;
 		}
 
-		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
+		if (rx_bytes <= rx_copybreak) {
+			/* better copy a small frame and not unmap the DMA region */
+			skb = netdev_alloc_skb_ip_align(dev, rx_bytes);
+			if (unlikely(!skb))
+				goto err_drop_frame;
+
+			dma_sync_single_range_for_cpu(dev->dev.parent,
+			                              rx_desc->buf_phys_addr,
+			                              MVNETA_MH_SIZE + NET_SKB_PAD,
+			                              rx_bytes,
+			                              DMA_FROM_DEVICE);
+			memcpy(skb_put(skb, rx_bytes),
+			       data + MVNETA_MH_SIZE + NET_SKB_PAD,
+			       rx_bytes);
+
+			skb->protocol = eth_type_trans(skb, dev);
+			mvneta_rx_csum(pp, rx_status, skb);
+			napi_gro_receive(&pp->napi, skb);
+
+			rcvd_pkts++;
+			rcvd_bytes += rx_bytes;
+
+			/* leave the descriptor and buffer untouched */
+			continue;
+		}
+
+		skb = build_skb(data, pp->frag_size > PAGE_SIZE ? 0 : pp->frag_size);
+		if (!skb)
+			goto err_drop_frame;
+
+		dma_unmap_single(dev->dev.parent, rx_desc->buf_phys_addr,
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 
-		rx_bytes = rx_desc->data_size -
-			(ETH_FCS_LEN + MVNETA_MH_SIZE);
 		rcvd_pkts++;
 		rcvd_bytes += rx_bytes;
 
@@ -1495,7 +1526,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		/* Refill processing */
 		err = mvneta_rx_refill(pp, rx_desc);
 		if (err) {
-			netdev_err(pp->dev, "Linux processing - Can't refill\n");
+			netdev_err(dev, "Linux processing - Can't refill\n");
 			rxq->missed++;
 			rx_filled--;
 		}
@@ -2945,3 +2976,4 @@ module_param(rxq_number, int, S_IRUGO);
 module_param(txq_number, int, S_IRUGO);
 
 module_param(rxq_def, int, S_IRUGO);
+module_param(rx_copybreak, int, S_IRUGO | S_IWUSR);
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 12/13] net: mvneta: mvneta_tx_done_gbe() cleanups
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (10 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 11/13] net: mvneta: implement rx_copybreak Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:20 ` [PATCH 13/13] net: mvneta: make mvneta_txq_done() return void Willy Tarreau
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Arnaud Ebalard

From: Arnaud Ebalard <arno@natisbad.org>

mvneta_tx_done_gbe() return value and third parameter are no more
used. This patch changes the function prototype and removes a useless
variable where the function is called.

Reviewed-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index f5fc7a2..8c51501 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1704,30 +1704,23 @@ static void mvneta_txq_done_force(struct mvneta_port *pp,
 /* Handle tx done - called in softirq context. The <cause_tx_done> argument
  * must be a valid cause according to MVNETA_TXQ_INTR_MASK_ALL.
  */
-static u32 mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done,
-			      int *tx_todo)
+static void mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done)
 {
 	struct mvneta_tx_queue *txq;
-	u32 tx_done = 0;
 	struct netdev_queue *nq;
 
-	*tx_todo = 0;
 	while (cause_tx_done) {
 		txq = mvneta_tx_done_policy(pp, cause_tx_done);
 
 		nq = netdev_get_tx_queue(pp->dev, txq->id);
 		__netif_tx_lock(nq, smp_processor_id());
 
-		if (txq->count) {
-			tx_done += mvneta_txq_done(pp, txq);
-			*tx_todo += txq->count;
-		}
+		if (txq->count)
+			mvneta_txq_done(pp, txq);
 
 		__netif_tx_unlock(nq);
 		cause_tx_done &= ~((1 << txq->id));
 	}
-
-	return tx_done;
 }
 
 /* Compute crc8 of the specified address, using a unique algorithm ,
@@ -1961,9 +1954,7 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 
 	/* Release Tx descriptors */
 	if (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL) {
-		int tx_todo = 0;
-
-		mvneta_tx_done_gbe(pp, (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL), &tx_todo);
+		mvneta_tx_done_gbe(pp, (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL));
 		cause_rx_tx &= ~MVNETA_TX_INTR_MASK_ALL;
 	}
 
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 13/13] net: mvneta: make mvneta_txq_done() return void
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (11 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 12/13] net: mvneta: mvneta_tx_done_gbe() cleanups Willy Tarreau
@ 2014-01-16  7:20 ` Willy Tarreau
  2014-01-16  7:22 ` [PATCH net-next 00/13] Assorted mvneta fixes and improvements Willy Tarreau
  2014-01-16 23:21 ` [PATCH " David Miller
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Arnaud Ebalard

From: Arnaud Ebalard <arno@natisbad.org>

The function return parameter is not used in mvneta_tx_done_gbe(),
where the function is called. This patch makes the function return
void.

Reviewed-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 8c51501..f418f4f 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1314,15 +1314,16 @@ static void mvneta_txq_bufs_free(struct mvneta_port *pp,
 }
 
 /* Handle end of transmission */
-static int mvneta_txq_done(struct mvneta_port *pp,
+static void mvneta_txq_done(struct mvneta_port *pp,
 			   struct mvneta_tx_queue *txq)
 {
 	struct netdev_queue *nq = netdev_get_tx_queue(pp->dev, txq->id);
 	int tx_done;
 
 	tx_done = mvneta_txq_sent_desc_proc(pp, txq);
-	if (tx_done == 0)
-		return tx_done;
+	if (!tx_done)
+		return;
+
 	mvneta_txq_bufs_free(pp, txq, tx_done);
 
 	txq->count -= tx_done;
@@ -1331,8 +1332,6 @@ static int mvneta_txq_done(struct mvneta_port *pp,
 		if (txq->size - txq->count >= MAX_SKB_FRAGS + 1)
 			netif_tx_wake_queue(nq);
 	}
-
-	return tx_done;
 }
 
 static void *mvneta_frag_alloc(const struct mvneta_port *pp)
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 00/13] Assorted mvneta fixes and improvements
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (12 preceding siblings ...)
  2014-01-16  7:20 ` [PATCH 13/13] net: mvneta: make mvneta_txq_done() return void Willy Tarreau
@ 2014-01-16  7:22 ` Willy Tarreau
  2014-01-16 23:21 ` [PATCH " David Miller
  14 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  7:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, Thomas Petazzoni, Gregory CLEMENT, Arnaud Ebalard,
	Eric Dumazet, Ben Hutchings

On Thu, Jan 16, 2014 at 08:20:06AM +0100, Willy Tarreau wrote:
> Hi,
> 
> this series provides some fixes for a number of issues met with the
> mvneta driver, then adds some improvements. Patches 1-5 are fixes
> and would be needed in 3.13 and likely -stable. The next ones are
> performance improvements and cleanups :

I just wanted to make it clear that the whole series is desired for net-next.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 11/13] net: mvneta: implement rx_copybreak
  2014-01-16  7:20 ` [PATCH 11/13] net: mvneta: implement rx_copybreak Willy Tarreau
@ 2014-01-16  9:14   ` David Laight
  2014-01-16  9:36     ` Willy Tarreau
  0 siblings, 1 reply; 24+ messages in thread
From: David Laight @ 2014-01-16  9:14 UTC (permalink / raw)
  To: 'Willy Tarreau', davem@davemloft.net
  Cc: netdev@vger.kernel.org, Thomas Petazzoni, Gregory CLEMENT

From: Of Willy Tarreau
> calling dma_map_single()/dma_unmap_single() is quite expensive compared
> to copying a small packet. So let's copy short frames and keep the buffers
> mapped. We set the limit to 256 bytes which seems to give good results both
> on the XP-GP board and on the AX3/4.

Have you tried something similar for transmits?
Copying short tx into a preallocated memory area that is dma_mapped
at driver initialisation?

	David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
  2014-01-16  9:14   ` David Laight
@ 2014-01-16  9:36     ` Willy Tarreau
  2014-01-16 19:49       ` David Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16  9:36 UTC (permalink / raw)
  To: David Laight
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Thomas Petazzoni,
	Gregory CLEMENT

On Thu, Jan 16, 2014 at 09:14:00AM +0000, David Laight wrote:
> From: Of Willy Tarreau
> > calling dma_map_single()/dma_unmap_single() is quite expensive compared
> > to copying a small packet. So let's copy short frames and keep the buffers
> > mapped. We set the limit to 256 bytes which seems to give good results both
> > on the XP-GP board and on the AX3/4.
> 
> Have you tried something similar for transmits?
> Copying short tx into a preallocated memory area that is dma_mapped
> at driver initialisation?

Not yet by lack of time, but I intend to do so. Both the NIC and the driver
are capable of running at line rate (1.488 Mpps) when filling descriptors
directly, so many improvements are still possible.

Regards,
Willy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
  2014-01-16  9:36     ` Willy Tarreau
@ 2014-01-16 19:49       ` David Miller
  2014-01-16 20:07         ` Willy Tarreau
  2014-01-17  9:32         ` David Laight
  0 siblings, 2 replies; 24+ messages in thread
From: David Miller @ 2014-01-16 19:49 UTC (permalink / raw)
  To: w; +Cc: David.Laight, netdev, thomas.petazzoni, gregory.clement

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 16 Jan 2014 10:36:40 +0100

> On Thu, Jan 16, 2014 at 09:14:00AM +0000, David Laight wrote:
>> From: Of Willy Tarreau
>> > calling dma_map_single()/dma_unmap_single() is quite expensive compared
>> > to copying a small packet. So let's copy short frames and keep the buffers
>> > mapped. We set the limit to 256 bytes which seems to give good results both
>> > on the XP-GP board and on the AX3/4.
>> 
>> Have you tried something similar for transmits?
>> Copying short tx into a preallocated memory area that is dma_mapped
>> at driver initialisation?
> 
> Not yet by lack of time, but I intend to do so. Both the NIC and the driver
> are capable of running at line rate (1.488 Mpps) when filling descriptors
> directly, so many improvements are still possible.

I don't know if this is relevant for your hardware, but there is a trick
with IOMMUs that I at least at one point was doing on sparc64.

I never flushed the IOMMU mappings explicitly on a dma_unmap call.

Instead I always allocate by moving the allocation cursor to higher
addresses in the IOMMU range looking for free space.

Then when I hit the end of the range and had to wrap the allocation
cursor back to the beginning, I flush the entire IOMMU.

So for %99 of IOMMU mapping creation and teardown calls, I didn't even
touch the hardware.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
  2014-01-16 19:49       ` David Miller
@ 2014-01-16 20:07         ` Willy Tarreau
  2014-01-16 20:11           ` David Miller
  2014-01-17  9:28           ` David Laight
  2014-01-17  9:32         ` David Laight
  1 sibling, 2 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-16 20:07 UTC (permalink / raw)
  To: David Miller; +Cc: David.Laight, netdev, thomas.petazzoni, gregory.clement

Hi David,

On Thu, Jan 16, 2014 at 11:49:05AM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 16 Jan 2014 10:36:40 +0100
> 
> > On Thu, Jan 16, 2014 at 09:14:00AM +0000, David Laight wrote:
> >> From: Of Willy Tarreau
> >> > calling dma_map_single()/dma_unmap_single() is quite expensive compared
> >> > to copying a small packet. So let's copy short frames and keep the buffers
> >> > mapped. We set the limit to 256 bytes which seems to give good results both
> >> > on the XP-GP board and on the AX3/4.
> >> 
> >> Have you tried something similar for transmits?
> >> Copying short tx into a preallocated memory area that is dma_mapped
> >> at driver initialisation?
> > 
> > Not yet by lack of time, but I intend to do so. Both the NIC and the driver
> > are capable of running at line rate (1.488 Mpps) when filling descriptors
> > directly, so many improvements are still possible.
> 
> I don't know if this is relevant for your hardware, but there is a trick
> with IOMMUs that I at least at one point was doing on sparc64.
> 
> I never flushed the IOMMU mappings explicitly on a dma_unmap call.
> 
> Instead I always allocate by moving the allocation cursor to higher
> addresses in the IOMMU range looking for free space.
> 
> Then when I hit the end of the range and had to wrap the allocation
> cursor back to the beginning, I flush the entire IOMMU.
> 
> So for %99 of IOMMU mapping creation and teardown calls, I didn't even
> touch the hardware.

Ah that's an interesting trick! We don't have an IOMMU on this platform
so the call is cheap. However, it relies on an I/O barrier to wait for
a cache snoop completion. From what I read, it's not needed when going
to the device. But when going to the CPU for the Rx case, we're calling
it for every packet which is counter productive because I'd like to do
it only once when entering the rx loop and avoid any other call during
the loop. I measured that I could save 300 ns per packet by doing this
(thus slightly modifying the DMA API to add a new dma_io_barrier function).

I think it could be interesting (at least for this platform, I don't know
if other platforms work with cache coherent DMA which only require to
verify end of snooping), to have two distinct set of DMA calls, something
like this :

    rx_loop(napi, budget)
    {
         dma_wait_for_snoop_completion(dev);
         ...
         for_each_rx_packet(pkt) {
             dma_sync_single_range_for_cpu_unless_completed(dev, pkt->addr);
             /* use packet */
         }
    }

The dma_wait_for_snoop_completion() call would call this I/O barrier on
platforms which provide one, and dma_sync_single_range_unless_completed()
would do the equivalent call only when the I/O barrier is not provided.
However I don't like this principle because it requires adding many new
entries to the struct dma_ops while we could avoid this by just adding
the first entry and changing all drivers at once to add the first call
before the loop. I'm just having a hard time believing that it's possible
anymore to change all drivers at once, especially when there are out of
tree drivers... So such a change would probably require to add a number
of new calls to the DMA API. I really don't know what's the best solution
would be in fact :-/

Thanks,
Willy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
  2014-01-16 20:07         ` Willy Tarreau
@ 2014-01-16 20:11           ` David Miller
  2014-01-17  9:28           ` David Laight
  1 sibling, 0 replies; 24+ messages in thread
From: David Miller @ 2014-01-16 20:11 UTC (permalink / raw)
  To: w; +Cc: David.Laight, netdev, thomas.petazzoni, gregory.clement

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 16 Jan 2014 21:07:57 +0100

> So such a change would probably require to add a number
> of new calls to the DMA API. I really don't know what's the best solution
> would be in fact :-/

Right, thanks for explaining your situation.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/13] Assorted mvneta fixes and improvements
  2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
                   ` (13 preceding siblings ...)
  2014-01-16  7:22 ` [PATCH net-next 00/13] Assorted mvneta fixes and improvements Willy Tarreau
@ 2014-01-16 23:21 ` David Miller
  14 siblings, 0 replies; 24+ messages in thread
From: David Miller @ 2014-01-16 23:21 UTC (permalink / raw)
  To: w; +Cc: netdev, thomas.petazzoni, gregory.clement, arno, eric.dumazet,
	ben

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 16 Jan 2014 08:20:06 +0100

> this series provides some fixes for a number of issues met with the
> mvneta driver, then adds some improvements. Patches 1-5 are fixes
> and would be needed in 3.13 and likely -stable. The next ones are
> performance improvements and cleanups :

Series applied, thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 11/13] net: mvneta: implement rx_copybreak
  2014-01-16 20:07         ` Willy Tarreau
  2014-01-16 20:11           ` David Miller
@ 2014-01-17  9:28           ` David Laight
  2014-01-17  9:48             ` Willy Tarreau
  1 sibling, 1 reply; 24+ messages in thread
From: David Laight @ 2014-01-17  9:28 UTC (permalink / raw)
  To: 'Willy Tarreau', David Miller
  Cc: netdev@vger.kernel.org, thomas.petazzoni@free-electrons.com,
	gregory.clement@free-electrons.com

From: Willy Tarreau 
..
> Ah that's an interesting trick! We don't have an IOMMU on this platform
> so the call is cheap. However, it relies on an I/O barrier to wait for
> a cache snoop completion. From what I read, it's not needed when going
> to the device. But when going to the CPU for the Rx case, we're calling
> it for every packet which is counter productive because I'd like to do
> it only once when entering the rx loop and avoid any other call during
> the loop. I measured that I could save 300 ns per packet by doing this
> (thus slightly modifying the DMA API to add a new dma_io_barrier function).
> 
> I think it could be interesting (at least for this platform, I don't know
> if other platforms work with cache coherent DMA which only require to
> verify end of snooping), to have two distinct set of DMA calls, something
> like this :
> 
>     rx_loop(napi, budget)
>     {
>          dma_wait_for_snoop_completion(dev);
>          ...
>          for_each_rx_packet(pkt) {
>              dma_sync_single_range_for_cpu_unless_completed(dev, pkt->addr);
>              /* use packet */
>          }
>     }

You'd need to scan the rx ring for completed entries before the waiting
for the snoop to complete - and then only process those entries.
(Or read a block of rx ring entries...)

Otherwise you might try to process a rx that finished after you
waited for the snoop to complete.

	David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 11/13] net: mvneta: implement rx_copybreak
  2014-01-16 19:49       ` David Miller
  2014-01-16 20:07         ` Willy Tarreau
@ 2014-01-17  9:32         ` David Laight
  1 sibling, 0 replies; 24+ messages in thread
From: David Laight @ 2014-01-17  9:32 UTC (permalink / raw)
  To: 'David Miller', w@1wt.eu
  Cc: netdev@vger.kernel.org, thomas.petazzoni@free-electrons.com,
	gregory.clement@free-electrons.com

From: David Miller
> I don't know if this is relevant for your hardware, but there is a trick
> with IOMMUs that I at least at one point was doing on sparc64.
> 
> I never flushed the IOMMU mappings explicitly on a dma_unmap call.
> 
> Instead I always allocate by moving the allocation cursor to higher
> addresses in the IOMMU range looking for free space.
> 
> Then when I hit the end of the range and had to wrap the allocation
> cursor back to the beginning, I flush the entire IOMMU.
> 
> So for %99 of IOMMU mapping creation and teardown calls, I didn't even
> touch the hardware.

Another option that ought to work is splitting the allocation of
the io address space and the mapping of the addresses to physical
memory.
So for rx you'd leave the rx ring pointers constant and change the
iommu to refer to the correct buffer.

	David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
  2014-01-17  9:28           ` David Laight
@ 2014-01-17  9:48             ` Willy Tarreau
  0 siblings, 0 replies; 24+ messages in thread
From: Willy Tarreau @ 2014-01-17  9:48 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, netdev@vger.kernel.org,
	thomas.petazzoni@free-electrons.com,
	gregory.clement@free-electrons.com

On Fri, Jan 17, 2014 at 09:28:13AM +0000, David Laight wrote:
> From: Willy Tarreau 
> ..
> > Ah that's an interesting trick! We don't have an IOMMU on this platform
> > so the call is cheap. However, it relies on an I/O barrier to wait for
> > a cache snoop completion. From what I read, it's not needed when going
> > to the device. But when going to the CPU for the Rx case, we're calling
> > it for every packet which is counter productive because I'd like to do
> > it only once when entering the rx loop and avoid any other call during
> > the loop. I measured that I could save 300 ns per packet by doing this
> > (thus slightly modifying the DMA API to add a new dma_io_barrier function).
> > 
> > I think it could be interesting (at least for this platform, I don't know
> > if other platforms work with cache coherent DMA which only require to
> > verify end of snooping), to have two distinct set of DMA calls, something
> > like this :
> > 
> >     rx_loop(napi, budget)
> >     {
> >          dma_wait_for_snoop_completion(dev);
> >          ...
> >          for_each_rx_packet(pkt) {
> >              dma_sync_single_range_for_cpu_unless_completed(dev, pkt->addr);
> >              /* use packet */
> >          }
> >     }
> 
> You'd need to scan the rx ring for completed entries before the waiting
> for the snoop to complete - and then only process those entries.
> (Or read a block of rx ring entries...)
> 
> Otherwise you might try to process a rx that finished after you
> waited for the snoop to complete.

That's already the case, the number of descs to use is computed first
thing when entering the function. That's why it's already safe.

Willy

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-01-17  9:48 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-16  7:20 [PATCH 00/13] Assorted mvneta fixes and improvements Willy Tarreau
2014-01-16  7:20 ` [PATCH 01/13] net: mvneta: increase the 64-bit rx/tx stats out of the hot path Willy Tarreau
2014-01-16  7:20 ` [PATCH 02/13] net: mvneta: use per_cpu stats to fix an SMP lock up Willy Tarreau
2014-01-16  7:20 ` [PATCH 03/13] net: mvneta: do not schedule in mvneta_tx_timeout Willy Tarreau
2014-01-16  7:20 ` [PATCH 04/13] net: mvneta: add missing bit descriptions for interrupt masks and causes Willy Tarreau
2014-01-16  7:20 ` [PATCH 05/13] net: mvneta: replace Tx timer with a real interrupt Willy Tarreau
2014-01-16  7:20 ` [PATCH 06/13] net: mvneta: remove tests for impossible cases in the tx_done path Willy Tarreau
2014-01-16  7:20 ` [PATCH 07/13] net: mvneta: factor rx refilling code Willy Tarreau
2014-01-16  7:20 ` [PATCH 08/13] net: mvneta: simplify access to the rx descriptor status Willy Tarreau
2014-01-16  7:20 ` [PATCH 09/13] net: mvneta: prefetch next rx descriptor instead of current one Willy Tarreau
2014-01-16  7:20 ` [PATCH 10/13] net: mvneta: convert to build_skb() Willy Tarreau
2014-01-16  7:20 ` [PATCH 11/13] net: mvneta: implement rx_copybreak Willy Tarreau
2014-01-16  9:14   ` David Laight
2014-01-16  9:36     ` Willy Tarreau
2014-01-16 19:49       ` David Miller
2014-01-16 20:07         ` Willy Tarreau
2014-01-16 20:11           ` David Miller
2014-01-17  9:28           ` David Laight
2014-01-17  9:48             ` Willy Tarreau
2014-01-17  9:32         ` David Laight
2014-01-16  7:20 ` [PATCH 12/13] net: mvneta: mvneta_tx_done_gbe() cleanups Willy Tarreau
2014-01-16  7:20 ` [PATCH 13/13] net: mvneta: make mvneta_txq_done() return void Willy Tarreau
2014-01-16  7:22 ` [PATCH net-next 00/13] Assorted mvneta fixes and improvements Willy Tarreau
2014-01-16 23:21 ` [PATCH " David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).