Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()
From: Christian Borntraeger @ 2016-11-25 16:49 UTC (permalink / raw)
  To: Peter Zijlstra, Mark Rutland
  Cc: Davidlohr Bueso, KVM list, dbueso, netdev, Boqun Feng,
	Michael S. Tsirkin, LKML, virtualization, Paul McKenney,
	Linus Torvalds, Dmitry Vyukov
In-Reply-To: <20161125161709.GQ3092@twins.programming.kicks-ass.net>

On 11/25/2016 05:17 PM, Peter Zijlstra wrote:
> On Fri, Nov 25, 2016 at 04:10:04PM +0000, Mark Rutland wrote:
>> On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:
> 
>>> What are use cases for such primitive that won't be OK with "read once
>>> _and_ atomically"?
>>
>> I have none to hand.
> 
> Whatever triggers the __builtin_memcpy() paths, and even the size==8
> paths on 32bit.
> 
> You could put a WARN in there to easily find them.

There were several cases that I found during writing the *ONCE stuff.
For example there are some 32bit ppc variants with 64bit PTEs. Some for
others (I think sparc). And the mm/ code is perfectly fine with these
PTE accesses being done NOT atomic.


> 
> The advantage of introducing the SINGLE_{LOAD,STORE}() helpers is that
> they compiletime validate this the size is 'right' and can runtime check
> alignment constraints.
> 
> IE, they are strictly stronger than {READ,WRITE}_ONCE().
> 

^ permalink raw reply

* Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()
From: Mark Rutland @ 2016-11-25 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dmitry Vyukov, Boqun Feng, Christian Borntraeger,
	Michael S. Tsirkin, LKML, Davidlohr Bueso, dbueso, jasowang,
	KVM list, netdev, Paul McKenney, virtualization, Linus Torvalds
In-Reply-To: <20161125161709.GQ3092@twins.programming.kicks-ass.net>

On Fri, Nov 25, 2016 at 05:17:09PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 25, 2016 at 04:10:04PM +0000, Mark Rutland wrote:
> > On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:
> 
> > > What are use cases for such primitive that won't be OK with "read once
> > > _and_ atomically"?
> > 
> > I have none to hand.
> 
> Whatever triggers the __builtin_memcpy() paths, and even the size==8
> paths on 32bit.

Lockref, per:

http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02294.html

In that specific case, a torn value just means we'll retry until we get
a non torn value, due to the cmpxchg. For that case, all we need is the
value to be reloaded per invocation of READ_ONCE().

This guy seems to have the full story:

http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02389.html
http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02558.html

Thanks,
Mark.

^ permalink raw reply

* Aw: Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Lino Sanfilippo @ 2016-11-25 16:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Laight, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480090590.8455.549.camel@edumazet-glaptop3.roam.corp.google.com>

Hi,

> 
> The READ_ONCE() are documenting the fact that no lock is taken to fetch
> the stats, while another cpus might being changing them.
> 
> I had no answer yet from https://patchwork.ozlabs.org/patch/698449/
> 
> So I thought it was not needed to explain this in the changelog, given
> that it apparently is one of the few things that can block someone to
> understand one of my changes :/
> 
> Apparently nobody really understands READ_ONCE() purpose, it is really a
> pity we have to explain this over and over.
> 

Even at the risk of showing once more a lack of understanding for READ_ONCE():
Does not a READ_ONCE() have to e paired with some kind of WRITE_ONCE()? 
Furthermore: there a quite some network drivers that ensure visibility of 
the descriptor queue indices between xmit and xmit completion function by means of
smp barriers. Could all these drivers theoretically be adjusted to use READ_ONCE(),
WRITE_ONCE() for the indices instead?

Regards,
Lino

^ permalink raw reply

* Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()
From: Peter Zijlstra @ 2016-11-25 16:17 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Davidlohr Bueso, KVM list, Michael S. Tsirkin, netdev, Boqun Feng,
	dbueso, LKML, virtualization, Paul McKenney, Linus Torvalds,
	Dmitry Vyukov
In-Reply-To: <20161125161004.GA30181@leverpostej>

On Fri, Nov 25, 2016 at 04:10:04PM +0000, Mark Rutland wrote:
> On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:

> > What are use cases for such primitive that won't be OK with "read once
> > _and_ atomically"?
> 
> I have none to hand.

Whatever triggers the __builtin_memcpy() paths, and even the size==8
paths on 32bit.

You could put a WARN in there to easily find them.

The advantage of introducing the SINGLE_{LOAD,STORE}() helpers is that
they compiletime validate this the size is 'right' and can runtime check
alignment constraints.

IE, they are strictly stronger than {READ,WRITE}_ONCE().

^ permalink raw reply

* Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Eric Dumazet @ 2016-11-25 16:16 UTC (permalink / raw)
  To: David Laight; +Cc: David Miller, netdev, Tariq Toukan
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0228849@AcuExch.aculab.com>

On Fri, 2016-11-25 at 16:03 +0000, David Laight wrote:
> From: Of Eric Dumazet
> > Sent: 25 November 2016 15:46
> > mlx4 stats are chaotic because a deferred work queue is responsible
> > to update them every 250 ms.
> > 
> > Even sampling stats every one second with "sar -n DEV 1" gives
> > variations like the following :
> ...
> > This patch allows rx/tx bytes/packets counters being folded at the
> > time we need stats.
> > 
> > We now can fetch stats every 1 ms if we want to check NIC behavior
> > on a small time window. It is also easier to detect anomalies.
> ...
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>
> ...
> >  	for (i = 0; i < priv->rx_ring_num; i++) {
> > -		stats->rx_packets += priv->rx_ring[i]->packets;
> > -		stats->rx_bytes += priv->rx_ring[i]->bytes;
> > -		sw_rx_dropped += priv->rx_ring[i]->dropped;
> > -		priv->port_stats.rx_chksum_good += priv->rx_ring[i]->csum_ok;
> > -		priv->port_stats.rx_chksum_none += priv->rx_ring[i]->csum_none;
> > -		priv->port_stats.rx_chksum_complete += priv->rx_ring[i]->csum_complete;
> > -		priv->xdp_stats.rx_xdp_drop    += priv->rx_ring[i]->xdp_drop;
> > -		priv->xdp_stats.rx_xdp_tx      += priv->rx_ring[i]->xdp_tx;
> > -		priv->xdp_stats.rx_xdp_tx_full += priv->rx_ring[i]->xdp_tx_full;
> > +		const struct mlx4_en_rx_ring *ring = priv->rx_ring[i];
> > +
> > +		sw_rx_dropped			+= READ_ONCE(ring->dropped);
> > +		priv->port_stats.rx_chksum_good += READ_ONCE(ring->csum_ok);
> > +		priv->port_stats.rx_chksum_none += READ_ONCE(ring->csum_none);
> > +		priv->port_stats.rx_chksum_complete += READ_ONCE(ring->csum_complete);
> > +		priv->xdp_stats.rx_xdp_drop	+= READ_ONCE(ring->xdp_drop);
> > +		priv->xdp_stats.rx_xdp_tx	+= READ_ONCE(ring->xdp_tx);
> > +		priv->xdp_stats.rx_xdp_tx_full	+= READ_ONCE(ring->xdp_tx_full);
> 
> This chunk (and the one after) seem to be adding READ_ONCE() and don't
> seem to be related to the commit message.

The READ_ONCE() are documenting the fact that no lock is taken to fetch
the stats, while another cpus might being changing them.

I had no answer yet from https://patchwork.ozlabs.org/patch/698449/

So I thought it was not needed to explain this in the changelog, given
that it apparently is one of the few things that can block someone to
understand one of my changes :/

Apparently nobody really understands READ_ONCE() purpose, it is really a
pity we have to explain this over and over.

^ permalink raw reply

* Re: [PATCH] net: dsa: fix unbalanced dsa_switch_tree reference counting
From: Andrew Lunn @ 2016-11-25 16:14 UTC (permalink / raw)
  To: Nikita Yushchenko
  Cc: Vivien Didelot, Florian Fainelli, David S. Miller, John Crispin,
	Wei Yongjun, netdev, linux-kernel, Chris Healy
In-Reply-To: <1480090133-30412-1-git-send-email-nikita.yoush@cogentembedded.com>

On Fri, Nov 25, 2016 at 07:08:53PM +0300, Nikita Yushchenko wrote:
> _dsa_register_switch() gets a dsa_switch_tree object either via
> dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref
> in returned object (resulting into caller not owning a reference),
> while later path does create a new object (resulting into caller owning
> a reference).
> 
> The rest of _dsa_register_switch() assumes that it owns a reference, and
> calls dsa_put_dst().
> 
> This causes a memory breakage if first switch in the tree initialized
> successfully, but second failed to initialize. In particular, freed
> dsa_swith_tree object is left referenced by switch that was initialized,
> and later access to sysfs attributes of that switch cause OOPS.
> 
> To fix, need to add kref_get() call to dsa_get_dst().
> 
> Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>

Hi Nikita

What tree is this against? It should be net. Please make the subject
[patch net] ... so it is clear what tree this is for.

And it should have a fixes-tag

Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation")

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

> ---
>  net/dsa/dsa2.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> index f8a7d9aab437..5fff951a0a49 100644
> --- a/net/dsa/dsa2.c
> +++ b/net/dsa/dsa2.c
> @@ -28,8 +28,10 @@ static struct dsa_switch_tree *dsa_get_dst(u32 tree)
>  	struct dsa_switch_tree *dst;
>  
>  	list_for_each_entry(dst, &dsa_switch_trees, list)
> -		if (dst->tree == tree)
> +		if (dst->tree == tree) {
> +			kref_get(&dst->refcount);
>  			return dst;
> +		}
>  	return NULL;
>  }
>  
> -- 
> 2.1.4
> 

^ permalink raw reply

* Re: [patch v2 net-next] sfc: remove unneeded variable
From: Edward Cree @ 2016-11-25 16:12 UTC (permalink / raw)
  To: Dan Carpenter, Solarflare linux maintainers
  Cc: Bert Kenward, netdev, kernel-janitors
In-Reply-To: <20161125104304.GA5938@mwanda>

On 25/11/16 10:43, Dan Carpenter wrote:
> We don't use ->heap_buf after commit 46d1efd852cc ("sfc: remove Software
> TSO") so let's remove the last traces.
>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Edward Cree <ecree@solarflare.com>
> diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
> index f97f828..18aceb2 100644
> --- a/drivers/net/ethernet/sfc/net_driver.h
> +++ b/drivers/net/ethernet/sfc/net_driver.h
> @@ -139,8 +139,6 @@ struct efx_special_buffer {
>   * struct efx_tx_buffer - buffer state for a TX descriptor
>   * @skb: When @flags & %EFX_TX_BUF_SKB, the associated socket buffer to be
>   *	freed when descriptor completes
> - * @heap_buf: When @flags & %EFX_TX_BUF_HEAP, the associated heap buffer to be
> - *	freed when descriptor completes.
>   * @option: When @flags & %EFX_TX_BUF_OPTION, a NIC-specific option descriptor.
>   * @dma_addr: DMA address of the fragment.
>   * @flags: Flags for allocation and DMA mapping type
> @@ -151,10 +149,7 @@ struct efx_special_buffer {
>   * Only valid if @unmap_len != 0.
>   */
>  struct efx_tx_buffer {
> -	union {
> -		const struct sk_buff *skb;
> -		void *heap_buf;
> -	};
> +	const struct sk_buff *skb;
>  	union {
>  		efx_qword_t option;
>  		dma_addr_t dma_addr;
> @@ -166,7 +161,6 @@ struct efx_tx_buffer {
>  };
>  #define EFX_TX_BUF_CONT		1	/* not last descriptor of packet */
>  #define EFX_TX_BUF_SKB		2	/* buffer is last part of skb */
> -#define EFX_TX_BUF_HEAP		4	/* buffer was allocated with kmalloc() */
>  #define EFX_TX_BUF_MAP_SINGLE	8	/* buffer was mapped with dma_map_single() */
>  #define EFX_TX_BUF_OPTION	0x10	/* empty buffer for option descriptor */
>  
> diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
> index 1aa728c..bb07034 100644
> --- a/drivers/net/ethernet/sfc/tx.c
> +++ b/drivers/net/ethernet/sfc/tx.c
> @@ -84,8 +84,6 @@ static void efx_dequeue_buffer(struct efx_tx_queue *tx_queue,
>  		netif_vdbg(tx_queue->efx, tx_done, tx_queue->efx->net_dev,
>  			   "TX queue %d transmission id %x complete\n",
>  			   tx_queue->queue, tx_queue->read_count);
> -	} else if (buffer->flags & EFX_TX_BUF_HEAP) {
> -		kfree(buffer->heap_buf);
>  	}
>  
>  	buffer->len = 0;

^ permalink raw reply

* Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()
From: Mark Rutland @ 2016-11-25 16:10 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Boqun Feng, Peter Zijlstra, Christian Borntraeger,
	Michael S. Tsirkin, LKML, Davidlohr Bueso, dbueso, jasowang,
	KVM list, netdev, Paul McKenney, virtualization, Linus Torvalds
In-Reply-To: <CACT4Y+ZpzFhmSqOG+dG7QHSNObWatLOjPjNK2BznnRLeRQpF8A@mail.gmail.com>

On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:
> 
> READ/WRITE_ONCE imply atomicity. Even if their names don't spell it (a
> function name doesn't have to spell all of its guarantees). Most of
> the uses of READ/WRITE_ONCE will be broken if they are not atomic.

In practice, this is certainly the assumption made by many/most users of
the *_ONCE() accessors.

Looking again, Linus does seem to agree that word-sized accesses should
result in single instructions (and be single-copy atomic) [1], so in
contrast to [2], that's clearly *part* of the point of the *_ONCE()
accessors...

> "Read once but not necessary atomically" is a very subtle primitive
> which is very easy to misuse.

I agree. Unfortunately, Linus does not appear to [2].

> What are use cases for such primitive that won't be OK with "read once
> _and_ atomically"?

I have none to hand.

Thanks,
Mark.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02674.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02670.html

^ permalink raw reply

* [PATCH] net: dsa: fix unbalanced dsa_switch_tree reference counting
From: Nikita Yushchenko @ 2016-11-25 16:08 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot, Florian Fainelli, David S. Miller,
	John Crispin, Wei Yongjun, netdev, linux-kernel
  Cc: Chris Healy, Nikita Yushchenko

_dsa_register_switch() gets a dsa_switch_tree object either via
dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref
in returned object (resulting into caller not owning a reference),
while later path does create a new object (resulting into caller owning
a reference).

The rest of _dsa_register_switch() assumes that it owns a reference, and
calls dsa_put_dst().

This causes a memory breakage if first switch in the tree initialized
successfully, but second failed to initialize. In particular, freed
dsa_swith_tree object is left referenced by switch that was initialized,
and later access to sysfs attributes of that switch cause OOPS.

To fix, need to add kref_get() call to dsa_get_dst().

Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
---
 net/dsa/dsa2.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index f8a7d9aab437..5fff951a0a49 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -28,8 +28,10 @@ static struct dsa_switch_tree *dsa_get_dst(u32 tree)
 	struct dsa_switch_tree *dst;

 	list_for_each_entry(dst, &dsa_switch_trees, list)
-		if (dst->tree == tree)
+		if (dst->tree == tree) {
+			kref_get(&dst->refcount);
 			return dst;
+		}
 	return NULL;
 }

-- 
2.1.4

^ permalink raw reply related

* RE: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: David Laight @ 2016-11-25 16:03 UTC (permalink / raw)
  To: 'Eric Dumazet', David Miller; +Cc: netdev, Tariq Toukan
In-Reply-To: <1480088780.8455.543.camel@edumazet-glaptop3.roam.corp.google.com>

From: Of Eric Dumazet
> Sent: 25 November 2016 15:46
> mlx4 stats are chaotic because a deferred work queue is responsible
> to update them every 250 ms.
> 
> Even sampling stats every one second with "sar -n DEV 1" gives
> variations like the following :
...
> This patch allows rx/tx bytes/packets counters being folded at the
> time we need stats.
> 
> We now can fetch stats every 1 ms if we want to check NIC behavior
> on a small time window. It is also easier to detect anomalies.
...
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
...
>  	for (i = 0; i < priv->rx_ring_num; i++) {
> -		stats->rx_packets += priv->rx_ring[i]->packets;
> -		stats->rx_bytes += priv->rx_ring[i]->bytes;
> -		sw_rx_dropped += priv->rx_ring[i]->dropped;
> -		priv->port_stats.rx_chksum_good += priv->rx_ring[i]->csum_ok;
> -		priv->port_stats.rx_chksum_none += priv->rx_ring[i]->csum_none;
> -		priv->port_stats.rx_chksum_complete += priv->rx_ring[i]->csum_complete;
> -		priv->xdp_stats.rx_xdp_drop    += priv->rx_ring[i]->xdp_drop;
> -		priv->xdp_stats.rx_xdp_tx      += priv->rx_ring[i]->xdp_tx;
> -		priv->xdp_stats.rx_xdp_tx_full += priv->rx_ring[i]->xdp_tx_full;
> +		const struct mlx4_en_rx_ring *ring = priv->rx_ring[i];
> +
> +		sw_rx_dropped			+= READ_ONCE(ring->dropped);
> +		priv->port_stats.rx_chksum_good += READ_ONCE(ring->csum_ok);
> +		priv->port_stats.rx_chksum_none += READ_ONCE(ring->csum_none);
> +		priv->port_stats.rx_chksum_complete += READ_ONCE(ring->csum_complete);
> +		priv->xdp_stats.rx_xdp_drop	+= READ_ONCE(ring->xdp_drop);
> +		priv->xdp_stats.rx_xdp_tx	+= READ_ONCE(ring->xdp_tx);
> +		priv->xdp_stats.rx_xdp_tx_full	+= READ_ONCE(ring->xdp_tx_full);

This chunk (and the one after) seem to be adding READ_ONCE() and don't
seem to be related to the commit message.

	David


^ permalink raw reply

* [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Eric Dumazet @ 2016-11-25 15:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan

From: Eric Dumazet <edumazet@google.com>

mlx4 stats are chaotic because a deferred work queue is responsible
to update them every 250 ms.

Even sampling stats every one second with "sar -n DEV 1" gives
variations like the following :

lpaa23:~# sar -n DEV 1 10 | grep eth0 | cut -c1-65
07:39:22         eth0 146877.00 3265554.00   9467.15 4828168.50  
07:39:23         eth0 146587.00 3260329.00   9448.15 4820445.98  
07:39:24         eth0 146894.00 3259989.00   9468.55 4819943.26  
07:39:25         eth0 110368.00 2454497.00   7113.95 3629012.17  <<>>
07:39:26         eth0 146563.00 3257502.00   9447.25 4816266.23  
07:39:27         eth0 145678.00 3258292.00   9389.79 4817414.39  
07:39:28         eth0 145268.00 3253171.00   9363.85 4809852.46  
07:39:29         eth0 146439.00 3262185.00   9438.97 4823172.48  
07:39:30         eth0 146758.00 3264175.00   9459.94 4826124.13  
07:39:31         eth0 146843.00 3256903.00   9465.44 4815381.97  
Average:         eth0 142827.50 3179259.70   9206.30 4700578.16  

This patch allows rx/tx bytes/packets counters being folded at the
time we need stats.

We now can fetch stats every 1 ms if we want to check NIC behavior
on a small time window. It is also easier to detect anomalies.

lpaa23:~# sar -n DEV 1 10 | grep eth0 | cut -c1-65
07:42:50         eth0 142915.00 3177696.00   9212.06 4698270.42  
07:42:51         eth0 143741.00 3200232.00   9265.15 4731593.02  
07:42:52         eth0 142781.00 3171600.00   9202.92 4689260.16  
07:42:53         eth0 143835.00 3192932.00   9271.80 4720761.39  
07:42:54         eth0 141922.00 3165174.00   9147.64 4679759.21  
07:42:55         eth0 142993.00 3207038.00   9216.78 4741653.05  
07:42:56         eth0 141394.06 3154335.64   9113.85 4663731.73  
07:42:57         eth0 141850.00 3161202.00   9144.48 4673866.07  
07:42:58         eth0 143439.00 3180736.00   9246.05 4702755.35  
07:42:59         eth0 143501.00 3210992.00   9249.99 4747501.84  
Average:         eth0 142835.66 3182165.93   9206.98 4704874.08  

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |    2 
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |    1 
 drivers/net/ethernet/mellanox/mlx4/en_port.c    |   77 +++++++++-----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |    1 
 4 files changed, 58 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 487a58f9c192896852fef271b6cce9bde132deb7..d9c9f86a30df953fa555934c5406057dcaf28960 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -367,6 +367,8 @@ static void mlx4_en_get_ethtool_stats(struct net_device *dev,
 
 	spin_lock_bh(&priv->stats_lock);
 
+	mlx4_en_fold_software_stats(dev);
+
 	for (i = 0; i < NUM_MAIN_STATS; i++, bitmap_iterator_inc(&it))
 		if (bitmap_iterator_test(&it))
 			data[index++] = ((unsigned long *)&dev->stats)[i];
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 9018bb1b2e12142e048281a9d28ddf95e0023a61..d28d841db23ce885d2011877a156bacf23f65afe 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1321,6 +1321,7 @@ mlx4_en_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 
 	spin_lock_bh(&priv->stats_lock);
+	mlx4_en_fold_software_stats(dev);
 	netdev_stats_to_stats64(stats, &dev->stats);
 	spin_unlock_bh(&priv->stats_lock);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index 1eb4c1e10bad1dad26049876acf107a2073a6ab1..c6c4f1238923e09eced547454b86c68720292859 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -147,6 +147,39 @@ static unsigned long en_stats_adder(__be64 *start, __be64 *next, int num)
 	return ret;
 }
 
+void mlx4_en_fold_software_stats(struct net_device *dev)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_dev *mdev = priv->mdev;
+	unsigned long packets, bytes;
+	int i;
+
+	if (mlx4_is_master(mdev->dev))
+		return;
+
+	packets = 0;
+	bytes = 0;
+	for (i = 0; i < priv->rx_ring_num; i++) {
+		const struct mlx4_en_rx_ring *ring = priv->rx_ring[i];
+
+		packets += READ_ONCE(ring->packets);
+		bytes   += READ_ONCE(ring->bytes);
+	}
+	dev->stats.rx_packets = packets;
+	dev->stats.rx_bytes = bytes;
+
+	packets = 0;
+	bytes = 0;
+	for (i = 0; i < priv->tx_ring_num[TX]; i++) {
+		const struct mlx4_en_tx_ring *ring = priv->tx_ring[TX][i];
+
+		packets += READ_ONCE(ring->packets);
+		bytes   += READ_ONCE(ring->bytes);
+	}
+	dev->stats.tx_packets = packets;
+	dev->stats.tx_bytes = bytes;
+}
+
 int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 {
 	struct mlx4_counter tmp_counter_stats;
@@ -159,6 +192,7 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 	u64 in_mod = reset << 8 | port;
 	int err;
 	int i, counter_index;
+	unsigned long sw_tx_dropped = 0;
 	unsigned long sw_rx_dropped = 0;
 
 	mailbox = mlx4_alloc_cmd_mailbox(mdev->dev);
@@ -174,8 +208,8 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 
 	spin_lock_bh(&priv->stats_lock);
 
-	stats->rx_packets = 0;
-	stats->rx_bytes = 0;
+	mlx4_en_fold_software_stats(dev);
+
 	priv->port_stats.rx_chksum_good = 0;
 	priv->port_stats.rx_chksum_none = 0;
 	priv->port_stats.rx_chksum_complete = 0;
@@ -183,19 +217,16 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 	priv->xdp_stats.rx_xdp_tx      = 0;
 	priv->xdp_stats.rx_xdp_tx_full = 0;
 	for (i = 0; i < priv->rx_ring_num; i++) {
-		stats->rx_packets += priv->rx_ring[i]->packets;
-		stats->rx_bytes += priv->rx_ring[i]->bytes;
-		sw_rx_dropped += priv->rx_ring[i]->dropped;
-		priv->port_stats.rx_chksum_good += priv->rx_ring[i]->csum_ok;
-		priv->port_stats.rx_chksum_none += priv->rx_ring[i]->csum_none;
-		priv->port_stats.rx_chksum_complete += priv->rx_ring[i]->csum_complete;
-		priv->xdp_stats.rx_xdp_drop    += priv->rx_ring[i]->xdp_drop;
-		priv->xdp_stats.rx_xdp_tx      += priv->rx_ring[i]->xdp_tx;
-		priv->xdp_stats.rx_xdp_tx_full += priv->rx_ring[i]->xdp_tx_full;
+		const struct mlx4_en_rx_ring *ring = priv->rx_ring[i];
+
+		sw_rx_dropped			+= READ_ONCE(ring->dropped);
+		priv->port_stats.rx_chksum_good += READ_ONCE(ring->csum_ok);
+		priv->port_stats.rx_chksum_none += READ_ONCE(ring->csum_none);
+		priv->port_stats.rx_chksum_complete += READ_ONCE(ring->csum_complete);
+		priv->xdp_stats.rx_xdp_drop	+= READ_ONCE(ring->xdp_drop);
+		priv->xdp_stats.rx_xdp_tx	+= READ_ONCE(ring->xdp_tx);
+		priv->xdp_stats.rx_xdp_tx_full	+= READ_ONCE(ring->xdp_tx_full);
 	}
-	stats->tx_packets = 0;
-	stats->tx_bytes = 0;
-	stats->tx_dropped = 0;
 	priv->port_stats.tx_chksum_offload = 0;
 	priv->port_stats.queue_stopped = 0;
 	priv->port_stats.wake_queue = 0;
@@ -205,15 +236,14 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 	for (i = 0; i < priv->tx_ring_num[TX]; i++) {
 		const struct mlx4_en_tx_ring *ring = priv->tx_ring[TX][i];
 
-		stats->tx_packets += ring->packets;
-		stats->tx_bytes += ring->bytes;
-		stats->tx_dropped += ring->tx_dropped;
-		priv->port_stats.tx_chksum_offload += ring->tx_csum;
-		priv->port_stats.queue_stopped     += ring->queue_stopped;
-		priv->port_stats.wake_queue        += ring->wake_queue;
-		priv->port_stats.tso_packets       += ring->tso_packets;
-		priv->port_stats.xmit_more         += ring->xmit_more;
+		sw_tx_dropped			   += READ_ONCE(ring->tx_dropped);
+		priv->port_stats.tx_chksum_offload += READ_ONCE(ring->tx_csum);
+		priv->port_stats.queue_stopped     += READ_ONCE(ring->queue_stopped);
+		priv->port_stats.wake_queue        += READ_ONCE(ring->wake_queue);
+		priv->port_stats.tso_packets       += READ_ONCE(ring->tso_packets);
+		priv->port_stats.xmit_more         += READ_ONCE(ring->xmit_more);
 	}
+
 	if (mlx4_is_master(mdev->dev)) {
 		stats->rx_packets = en_stats_adder(&mlx4_en_stats->RTOT_prio_0,
 						   &mlx4_en_stats->RTOT_prio_1,
@@ -251,7 +281,8 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 	stats->rx_length_errors = be32_to_cpu(mlx4_en_stats->RdropLength);
 	stats->rx_crc_errors = be32_to_cpu(mlx4_en_stats->RCRC);
 	stats->rx_fifo_errors = be32_to_cpu(mlx4_en_stats->RdropOvflw);
-	stats->tx_dropped += be32_to_cpu(mlx4_en_stats->TDROP);
+	stats->tx_dropped = be32_to_cpu(mlx4_en_stats->TDROP) +
+			    sw_tx_dropped;
 
 	/* RX stats */
 	priv->pkstats.rx_multicast_packets = stats->multicast;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 574bcbb1b38fc4758511d8f7bd17a87b0a507a73..20a936428f4a44c8ca0a7161855da310f9166b50 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -755,6 +755,7 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq);
 int mlx4_SET_MCAST_FLTR(struct mlx4_dev *dev, u8 port, u64 mac, u64 clear, u8 mode);
 int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv);
 
+void mlx4_en_fold_software_stats(struct net_device *dev);
 int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset);
 int mlx4_en_QUERY_PORT(struct mlx4_en_dev *mdev, u8 port);
 

^ permalink raw reply related

* Re: [PATCH] net: stmmac: enable tx queue 0 for gmac4 IPs synthesized with multiple TX queues
From: Alexandre Torgue @ 2016-11-25 15:40 UTC (permalink / raw)
  To: Niklas Cassel, Giuseppe Cavallaro; +Cc: netdev, linux-kernel
In-Reply-To: <c97ee9f8-c656-20a9-c41a-e66041540013@axis.com>

Hi Niklas

On 11/25/2016 01:14 PM, Niklas Cassel wrote:
> On 11/25/2016 01:10 PM, Niklas Cassel wrote:
>> On 11/24/2016 07:11 PM, Alexandre Torgue wrote:
>>> Hi Niklas,
>> Hello Alexandre
>>
>>> On 11/24/2016 03:36 PM, Niklas Cassel wrote:
>>>> From: Niklas Cassel <niklas.cassel@axis.com>
>>>>
>>>> The dwmac4 IP can synthesized with 1-8 number of tx queues.
>>>> On an IP synthesized with DWC_EQOS_NUM_TXQ > 1, all txqueues are disabled
>>>> by default. For these IPs, the bitfield TXQEN is R/W.
>>>>
>>>> Always enable tx queue 0. The write will have no effect on IPs synthesized
>>>> with DWC_EQOS_NUM_TXQ == 1.
>>>>
>>>> The driver does still not utilize more than one tx queue in the IP.
>>>>
>>>> Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
>>>> ---
>>>>  drivers/net/ethernet/stmicro/stmmac/dwmac4.h     |  3 +++
>>>>  drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 12 +++++++++++-
>>>>  2 files changed, 14 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>>>> index 6f4f5ce25114..3e8d4fefa5e0 100644
>>>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>>>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>>>> @@ -155,8 +155,11 @@ enum power_event {
>>>>  #define MTL_CHAN_RX_DEBUG(x)        (MTL_CHANX_BASE_ADDR(x) + 0x38)
>>>>
>>>>  #define MTL_OP_MODE_RSF            BIT(5)
>>>> +#define MTL_OP_MODE_TXQEN        BIT(3)
>>>>  #define MTL_OP_MODE_TSF            BIT(1)
>>>>
>>>> +#define MTL_OP_MODE_TQS_MASK        GENMASK(24, 16)
>>>> +
>>>>  #define MTL_OP_MODE_TTC_MASK        0x70
>>>>  #define MTL_OP_MODE_TTC_SHIFT        4
>>>>
>>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
>>>> index 116151cd6a95..577316de6ba8 100644
>>>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
>>>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
>>>> @@ -213,7 +213,17 @@ static void dwmac4_dma_chan_op_mode(void __iomem *ioaddr, int txmode,
>>>>          else
>>>>              mtl_tx_op |= MTL_OP_MODE_TTC_512;
>>>>      }
>>>> -
>>>> +    /* For an IP with DWC_EQOS_NUM_TXQ == 1, the fields TXQEN and TQS are RO
>>>> +     * with reset values: TXQEN on, TQS == DWC_EQOS_TXFIFO_SIZE.
>>>> +     * For an IP with DWC_EQOS_NUM_TXQ > 1, the fields TXQEN and TQS are R/W
>>>> +     * with reset values: TXQEN off, TQS 256 bytes.
>>>> +     *
>>>> +     * Write the bits in both cases, since it will have no effect when RO.
>>>> +     * For DWC_EQOS_NUM_TXQ > 1, the top bits in MTL_OP_MODE_TQS_MASK might
>>>> +     * be RO, however, writing the whole TQS field will result in a value
>>>> +     * equal to DWC_EQOS_TXFIFO_SIZE, just like for DWC_EQOS_NUM_TXQ == 1.
>>>> +     */
>>>> +    mtl_tx_op |= MTL_OP_MODE_TXQEN | MTL_OP_MODE_TQS_MASK;
>>> Your patch sounds good. Just one question:
>>>
>>> In synopsys databook I use, I see that MTL_OP_MODE_TXQEN for channel 2 can take several values "disabled / enabled / Enabled in AV mode":
>>>
>>> Transmit Queue Enable
>>> This field is used to enable/disable the transmit queue 1. 00 R/W
>>> ■ 2'b00 - Not enabled
>>> ■ 2'b01 - Enable in AV mode (Reserved when Enable Audio Video
>>> Bridging is not selected while configuring the core)
>>> ■ 2'b10 - Enabled
>>> ■ 2'b11 - Reserved
>>>
>>> Do you plan to manage av mode in a future patch ?
>> We are not planning on using the AV mode.
>> We will probably not use TXQ1 at all.
>>
>> I noticed that the MAC_HW_Feature2 Register actually has a TXQCNT field.
>> It is currently saved in priv->dma_cap.number_tx_channel.
>> If you prefer, I could do a patch V2 where we only set the bits if
>> priv->dma_cap.number_tx_channel > 1
>
> Oh, sorry, that was number of tx _channels_,
> not number of tx _queues_.
>
> However, we could add a number_tx_queue to struct dma_features,
> if you would prefer that.

I agree your patch is good. It will work even if we use several tx 
channels. We will see in the future for AV mode.

regards
alex

>
>>
>> However, I don't think the current patch is too bad either, since the bits
>> are RO when number_tx_channel == 1.
>>
>>
>>> Regards
>>> Alex
>>>
>>>>      writel(mtl_tx_op, ioaddr +  MTL_CHAN_TX_OP_MODE(channel));
>>>>
>>>>      mtl_rx_op = readl(ioaddr + MTL_CHAN_RX_OP_MODE(channel));
>>>>
>

^ permalink raw reply

* [PATCH net-next 5/5] udp: add recvmmsg implementation
From: Paolo Abeni @ 2016-11-25 15:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Hannes Frederic Sowa, Sabrina Dubroca
In-Reply-To: <cover.1480086321.git.pabeni@redhat.com>

skbs are extracted from the receive queue in burts, and a single
sk_rmem_alloc/forward allocated memory update is performed for
each burst.
MSG_PEEK and MSG_ERRQUEUE are not supported to keep the implementation
as simple as possible.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/udp.h   |   7 +++
 net/ipv4/udp.c      | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/udp_impl.h |   3 ++
 net/ipv4/udplite.c  |   1 +
 net/ipv6/udp.c      |  16 +++++++
 net/ipv6/udp_impl.h |   3 ++
 net/ipv6/udplite.c  |   1 +
 7 files changed, 152 insertions(+)

diff --git a/include/net/udp.h b/include/net/udp.h
index 1661791..2bd63c9 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -308,6 +308,13 @@ struct sock *__udp6_lib_lookup(struct net *net,
 struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
 				 __be16 sport, __be16 dport);
 
+int __udp_recvmmsg(struct sock *sk, struct mmsghdr __user *ummsg,
+		   unsigned int *vlen, unsigned int flags,
+		   struct timespec *timeout, const struct timespec64 *end_time,
+		   int (*udp_process_msg)(struct sock *sk, struct sk_buff *skb,
+					  struct msghdr *msg,
+					  unsigned int flags));
+
 /*
  * 	SNMP statistics for UDP and UDP-Lite
  */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d99429d..44f1326 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1467,6 +1467,126 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	return err;
 }
 
+static void udp_skb_bulk_destructor(struct sock *sk, int totalsize)
+{
+	udp_rmem_release(sk, totalsize, 1);
+}
+
+int __udp_recvmmsg(struct sock *sk, struct mmsghdr __user *mmsg,
+		   unsigned int *nr, unsigned int flags,
+		   struct timespec *timeout, const struct timespec64 *end_time,
+		   int (*process_msg)(struct sock *sk, struct sk_buff *skb,
+				      struct msghdr *msg,
+				      unsigned int flags))
+{
+	long timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+	int datagrams = 0, err = 0, ret = 0, vlen = *nr;
+	struct sk_buff *skb, *next, *last;
+
+	if (flags & (MSG_PEEK | MSG_ERRQUEUE))
+		return -EOPNOTSUPP;
+
+again:
+	for (;;) {
+		skb = __skb_try_recv_datagram_batch(sk, flags, vlen - datagrams,
+						    udp_skb_bulk_destructor,
+						    &err);
+		if (skb)
+			break;
+
+		if ((err != -EAGAIN) || !timeo || (flags & MSG_DONTWAIT))
+			goto out;
+
+		/* no packets, and we are supposed to wait for the next */
+		if (timeout) {
+			long expires;
+
+			if (sock_recvmmsg_timeout(timeout, *end_time))
+				goto out;
+			expires = timeout->tv_sec * HZ +
+				  (timeout->tv_nsec >> 20);
+			if (expires + 1 < timeo)
+				timeo = expires + 1;
+		}
+
+		/* the queue was empty when tried to dequeue */
+		last = (struct sk_buff *)&sk->sk_receive_queue;
+		if (__skb_wait_for_more_packets(sk, &err, &timeo, last))
+			goto out;
+	}
+
+	for (; skb; skb = next) {
+		struct recvmmsg_ctx ctx;
+		int len;
+
+		next = skb->next;
+		err = recvmmsg_ctx_from_user(sk, mmsg, flags, datagrams, &ctx);
+		if (err < 0) {
+			kfree_skb(skb);
+			goto free_ctx;
+		}
+
+		/* process skb's until we find a valid one */
+		for (;;) {
+			len = process_msg(sk, skb, &ctx.msg_sys, flags);
+			if (len >= 0)
+				break;
+
+			/* only non csum errors are propagated to the caller */
+			if (len != -EINVAL) {
+				err = len;
+				goto free_ctx;
+			}
+
+			if (!next)
+				goto free_ctx;
+			skb = next;
+			next = skb->next;
+		}
+
+		err = recvmmsg_ctx_to_user(&mmsg, len, flags, &ctx);
+		if (err < 0)
+			goto free_ctx;
+
+		/* now we're sure the skb is fully processed, we can count it */
+		datagrams++;
+
+free_ctx:
+		recvmmsg_ctx_free(&ctx);
+		if (err < 0)
+			ret = err;
+	}
+
+	/* only handle waitforone after processing a full batch. */
+	if (datagrams && (flags & MSG_WAITFORONE))
+		flags |= MSG_DONTWAIT;
+
+	if (!ret && (datagrams < vlen)) {
+		cond_resched();
+		goto again;
+	}
+
+out:
+	*nr = datagrams;
+	return ret < 0 ? ret : -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(__udp_recvmmsg);
+
+static int udp_process_msg(struct sock *sk, struct sk_buff *skb,
+			   struct msghdr *msg, unsigned int flags)
+{
+	return udp_process_skb(sk, skb, msg, msg_data_left(msg), flags,
+			       &msg->msg_namelen, 0, 0, skb->peeked);
+}
+
+int udp_recvmmsg(struct sock *sk, struct mmsghdr __user *ummsg,
+		 unsigned int *nr, unsigned int flags, struct timespec *timeout,
+		 const struct timespec64 *end_time)
+{
+	return __udp_recvmmsg(sk, ummsg, nr, flags, timeout, end_time,
+			      udp_process_msg);
+}
+
 int __udp_disconnect(struct sock *sk, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -2329,6 +2449,7 @@ struct proto udp_prot = {
 	.getsockopt	   = udp_getsockopt,
 	.sendmsg	   = udp_sendmsg,
 	.recvmsg	   = udp_recvmsg,
+	.recvmmsg	   = udp_recvmmsg,
 	.sendpage	   = udp_sendpage,
 	.release_cb	   = ip4_datagram_release_cb,
 	.hash		   = udp_lib_hash,
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 7e0fe4b..f11d608 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,6 +23,9 @@ int compat_udp_getsockopt(struct sock *sk, int level, int optname,
 #endif
 int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 		int flags, int *addr_len);
+int udp_recvmmsg(struct sock *sk, struct mmsghdr __user *ummsg,
+		 unsigned int *nr, unsigned int flags, struct timespec *timeout,
+		 const struct timespec64 *end_time);
 int udp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
 		 int flags);
 int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index 59f10fe..a0e7fe9 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -49,6 +49,7 @@ struct proto 	udplite_prot = {
 	.getsockopt	   = udp_getsockopt,
 	.sendmsg	   = udp_sendmsg,
 	.recvmsg	   = udp_recvmsg,
+	.recvmmsg	   = udp_recvmmsg,
 	.sendpage	   = udp_sendpage,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3218c64..2c034be 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -479,6 +479,21 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 
 }
 
+static int udp6_process_msg(struct sock *sk, struct sk_buff *skb,
+			    struct msghdr *msg, unsigned int flags)
+{
+	return udp6_process_skb(sk, skb, msg, msg_data_left(msg), flags,
+				&msg->msg_namelen, 0, 0, skb->peeked);
+}
+
+int udpv6_recvmmsg(struct sock *sk, struct mmsghdr __user *ummsg,
+		   unsigned int *nr, unsigned int flags,
+		   struct timespec *timeout, const struct timespec64 *end_time)
+{
+	return __udp_recvmmsg(sk, ummsg, nr, flags, timeout, end_time,
+			      udp6_process_msg);
+}
+
 void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		    u8 type, u8 code, int offset, __be32 info,
 		    struct udp_table *udptable)
@@ -1443,6 +1458,7 @@ struct proto udpv6_prot = {
 	.getsockopt	   = udpv6_getsockopt,
 	.sendmsg	   = udpv6_sendmsg,
 	.recvmsg	   = udpv6_recvmsg,
+	.recvmmsg	   = udpv6_recvmmsg,
 	.release_cb	   = ip6_datagram_release_cb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
diff --git a/net/ipv6/udp_impl.h b/net/ipv6/udp_impl.h
index f6eb1ab..fe566db 100644
--- a/net/ipv6/udp_impl.h
+++ b/net/ipv6/udp_impl.h
@@ -26,6 +26,9 @@ int compat_udpv6_getsockopt(struct sock *sk, int level, int optname,
 int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
 int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 		  int flags, int *addr_len);
+int udpv6_recvmmsg(struct sock *sk, struct mmsghdr __user *ummsg,
+		   unsigned int *nr, unsigned int flags,
+		   struct timespec *timeout, const struct timespec64 *end_time);
 int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 void udpv6_destroy_sock(struct sock *sk);
 
diff --git a/net/ipv6/udplite.c b/net/ipv6/udplite.c
index 2784cc3..23d80ac 100644
--- a/net/ipv6/udplite.c
+++ b/net/ipv6/udplite.c
@@ -45,6 +45,7 @@ struct proto udplitev6_prot = {
 	.getsockopt	   = udpv6_getsockopt,
 	.sendmsg	   = udpv6_sendmsg,
 	.recvmsg	   = udpv6_recvmsg,
+	.recvmmsg	   = udpv6_recvmmsg,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
 	.get_port	   = udp_v6_get_port,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 4/5] net/socket: add helpers for recvmmsg
From: Paolo Abeni @ 2016-11-25 15:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Hannes Frederic Sowa, Sabrina Dubroca
In-Reply-To: <cover.1480086321.git.pabeni@redhat.com>

_skb_try_recv_datagram_batch dequeues multiple skb's from the
socket's receive queue, and runs the bulk_destructor callback under
the receive queue lock.

recvmmsg_ctx_from_user retrieves msghdr information from userspace,
and sets up the kernelspace context for processing one datagram.

recvmmsg_ctx_to_user copies to userspace the results of processing one
datagram.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h | 20 ++++++++++++++++
 include/net/sock.h     | 19 +++++++++++++++
 net/core/datagram.c    | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++
 net/socket.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 164 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9c535fb..5672045 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1598,6 +1598,20 @@ static inline void __skb_insert(struct sk_buff *newsk,
 	list->qlen++;
 }
 
+static inline void __skb_queue_unsplice(struct sk_buff *first,
+					struct sk_buff *last,
+					unsigned int n,
+					struct sk_buff_head *queue)
+{
+	struct sk_buff *next = last->next, *prev = first->prev;
+
+	queue->qlen -= n;
+	last->next = NULL;
+	first->prev = NULL;
+	next->prev = prev;
+	prev->next = next;
+}
+
 static inline void __skb_queue_splice(const struct sk_buff_head *list,
 				      struct sk_buff *prev,
 				      struct sk_buff *next)
@@ -3032,6 +3046,12 @@ static inline void skb_frag_list_init(struct sk_buff *skb)
 
 int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
 				const struct sk_buff *skb);
+struct sk_buff *__skb_try_recv_datagram_batch(struct sock *sk,
+					      unsigned int flags,
+					      unsigned int batch,
+					      void (*bulk_destructor)(
+						     struct sock *sk, int size),
+					      int *err);
 struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
 					void (*destructor)(struct sock *sk,
 							   struct sk_buff *skb),
diff --git a/include/net/sock.h b/include/net/sock.h
index 11126f4..3daf63a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1534,6 +1534,25 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 		   struct sockcm_cookie *sockc);
 
+struct recvmmsg_ctx {
+	struct iovec		iovstack[UIO_FASTIOV];
+	struct msghdr		msg_sys;
+	struct sockaddr __user	*uaddr;
+	struct sockaddr_storage	addr;
+	unsigned long		cmsg_ptr;
+	struct iovec		*iov;
+};
+
+int recvmmsg_ctx_from_user(struct sock *sk, struct mmsghdr __user *mmsg,
+			   unsigned int flags, int nosec,
+			   struct recvmmsg_ctx *ctx);
+int recvmmsg_ctx_to_user(struct mmsghdr __user **mmsg, int len,
+			 unsigned int flags, struct recvmmsg_ctx *ctx);
+static inline void recvmmsg_ctx_free(struct recvmmsg_ctx *ctx)
+{
+	kfree(ctx->iov);
+}
+
 static inline bool sock_recvmmsg_timeout(struct timespec *timeout,
 					 struct timespec64 end_time)
 {
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 49816af..90d1aa2 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -301,6 +301,71 @@ struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned int flags,
 }
 EXPORT_SYMBOL(skb_recv_datagram);
 
+/**
+ *	__skb_try_recv_datagram_batch - Receive a batch of datagram skbuff
+ *	@sk: socket
+ *	@flags: MSG_ flags
+ *	@batch: maximum batch length
+ *	@bulk_destructor: invoked under the receive lock on successful dequeue
+ *	@err: error code returned
+ *	@last: set to last peeked message to inform the wait function
+ *	       what to look for when peeking
+ *
+ * like __skb_try_recv_datagram, but dequeue a full batch up to the specified
+ * max length. Returned skbs are linked and the list is NULL terminated.
+ * Peeking is not supported.
+ */
+struct sk_buff *__skb_try_recv_datagram_batch(struct sock *sk,
+					      unsigned int flags,
+					      unsigned int batch,
+					      void (*bulk_destructor)(
+						     struct sock *sk, int size),
+					      int *err)
+{
+	unsigned int datagrams = 0, totalsize = 0;
+	struct sk_buff *skb, *last, *first;
+	struct sk_buff_head *queue;
+
+	*err = sock_error(sk);
+	if (*err)
+		return NULL;
+
+	queue = &sk->sk_receive_queue;
+	spin_lock_bh(&queue->lock);
+	for (;;) {
+		if (!skb_queue_empty(queue))
+			break;
+
+		spin_unlock_bh(&queue->lock);
+
+		if (!sk_can_busy_loop(sk) ||
+		    !sk_busy_loop(sk, flags & MSG_DONTWAIT))
+			goto no_packets;
+
+		spin_lock_bh(&queue->lock);
+	}
+
+	last = (struct sk_buff *)queue;
+	first = (struct sk_buff *)queue->next;
+	skb_queue_walk(queue, skb) {
+		last = skb;
+		totalsize += skb->truesize;
+		if (++datagrams == batch)
+			break;
+	}
+	__skb_queue_unsplice(first, last, datagrams, queue);
+
+	if (bulk_destructor)
+		bulk_destructor(sk, totalsize);
+	spin_unlock_bh(&queue->lock);
+	return first;
+
+no_packets:
+	*err = -EAGAIN;
+	return NULL;
+}
+EXPORT_SYMBOL(__skb_try_recv_datagram_batch);
+
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb)
 {
 	consume_skb(skb);
diff --git a/net/socket.c b/net/socket.c
index 49e6cd6..ceb627b 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2220,6 +2220,66 @@ long __sys_recvmsg(int fd, struct user_msghdr __user *msg, unsigned flags)
  *     Linux recvmmsg interface
  */
 
+int recvmmsg_ctx_from_user(struct sock *sk, struct mmsghdr __user *mmsg,
+			   unsigned int flags, int nosec,
+			   struct recvmmsg_ctx *ctx)
+{
+	struct user_msghdr __user *msg = (struct user_msghdr __user *)mmsg;
+	struct compat_msghdr __user *msg_compat;
+	ssize_t err;
+
+	ctx->iov = ctx->iovstack;
+	msg_compat = (struct compat_msghdr __user *)mmsg;
+	err = copy_msghdr_from_user_gen(&ctx->msg_sys, flags, msg_compat, msg,
+					&ctx->uaddr, &ctx->iov, &ctx->addr);
+	if (err < 0) {
+		ctx->iov = NULL;
+		return err;
+	}
+
+	ctx->cmsg_ptr = (unsigned long)ctx->msg_sys.msg_control;
+	ctx->msg_sys.msg_flags = flags & MSG_CMSG_MASK;
+
+	/* We assume all kernel code knows the size of sockaddr_storage */
+	ctx->msg_sys.msg_namelen = 0;
+
+	if (nosec)
+		return 0;
+
+	return security_socket_recvmsg(sk->sk_socket, &ctx->msg_sys,
+				      msg_data_left(&ctx->msg_sys), flags);
+}
+
+int recvmmsg_ctx_to_user(struct mmsghdr __user **mmsg_ptr, int len,
+			 unsigned int flags, struct recvmmsg_ctx *ctx)
+{
+	struct compat_mmsghdr __user *mmsg_compat;
+	struct mmsghdr __user *mmsg = *mmsg_ptr;
+	int err;
+
+	mmsg_compat = (struct compat_mmsghdr __user *)mmsg;
+	err = copy_msghdr_to_user_gen(&ctx->msg_sys, flags,
+				      &mmsg_compat->msg_hdr, &mmsg->msg_hdr,
+				      ctx->uaddr, &ctx->addr, ctx->cmsg_ptr);
+	if (err)
+		return err;
+
+	if (MSG_CMSG_COMPAT & flags) {
+		err = __put_user(len, &mmsg_compat->msg_len);
+		if (err < 0)
+			return err;
+
+		*mmsg_ptr = (struct mmsghdr __user *)(mmsg_compat + 1);
+	} else {
+		err = put_user(len, &mmsg->msg_len);
+		if (err < 0)
+			return err;
+
+		*mmsg_ptr = mmsg + 1;
+	}
+	return err;
+}
+
 static int __proto_recvmmsg(struct socket *sock, struct mmsghdr __user *ummsg,
 			    unsigned int *vlen, unsigned int flags,
 			    struct timespec *timeout,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 2/5] net/socket: add per protocol mmesg support
From: Paolo Abeni @ 2016-11-25 15:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Hannes Frederic Sowa, Sabrina Dubroca
In-Reply-To: <cover.1480086321.git.pabeni@redhat.com>

proto->recvmmsg allows leveraging per protocol optimization,
amortizing the protocol/locking overhead on multiple packets.

This commit introduces the procotol level callbacks and change
to generic implementation to use them.

We explicitly pass down to the protocol level both 'timeout'
and 'end_time' to avoid duplicating there the call to
poll_select_set_timeout().

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/net.h       |  5 +++++
 include/net/inet_common.h |  3 +++
 include/net/sock.h        |  6 ++++++
 net/ipv4/af_inet.c        | 16 ++++++++++++++++
 net/ipv6/af_inet6.c       |  1 +
 net/socket.c              | 26 +++++++++++++++++++++++++-
 6 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index cd0c8bd..95f1bc5 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -182,6 +182,11 @@ struct proto_ops {
 	 */
 	int		(*recvmsg)   (struct socket *sock, struct msghdr *m,
 				      size_t total_len, int flags);
+	int		(*recvmmsg)  (struct socket *sock,
+				      struct mmsghdr __user *user_mmsg,
+				      unsigned int *vlen, unsigned int flags,
+				      struct timespec *timeout,
+				      const struct timespec64 *end_time);
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 5d68342..c7d5875 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -26,6 +26,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 		      size_t size, int flags);
 int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 		 int flags);
+int inet_recvmmsg(struct socket *sock, struct mmsghdr __user *user_mmsg,
+		  unsigned int *vlen, unsigned int flags,
+		  struct timespec *timeout, const struct timespec64 *end_time);
 int inet_shutdown(struct socket *sock, int how);
 int inet_listen(struct socket *sock, int backlog);
 void inet_sock_destruct(struct sock *sk);
diff --git a/include/net/sock.h b/include/net/sock.h
index c92dc19..11126f4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1010,6 +1010,12 @@ struct proto {
 	int			(*recvmsg)(struct sock *sk, struct msghdr *msg,
 					   size_t len, int noblock, int flags,
 					   int *addr_len);
+	int			(*recvmmsg)(struct sock *sk,
+					    struct mmsghdr __user *user_mmsg,
+					    unsigned int *nr,
+					    unsigned int flags,
+					    struct timespec *timeout,
+					    const struct timespec64 *end_time);
 	int			(*sendpage)(struct sock *sk, struct page *page,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5ddf5cd..747558d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -770,6 +770,21 @@ int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 }
 EXPORT_SYMBOL(inet_recvmsg);
 
+int inet_recvmmsg(struct socket *sock, struct mmsghdr __user *ummsg,
+		  unsigned int *vlen, unsigned int flags,
+		  struct timespec *timeout, const struct timespec64 *end_time)
+{
+	struct sock *sk = sock->sk;
+
+	if (!sk->sk_prot->recvmmsg)
+		return -EOPNOTSUPP;
+
+	sock_rps_record_flow(sk);
+
+	return sk->sk_prot->recvmmsg(sk, ummsg, vlen, flags, timeout, end_time);
+}
+EXPORT_SYMBOL(inet_recvmmsg);
+
 int inet_shutdown(struct socket *sock, int how)
 {
 	struct sock *sk = sock->sk;
@@ -942,6 +957,7 @@ static int inet_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned lon
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = inet_recvmsg,
+	.recvmmsg	   = inet_recvmmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 	.set_peek_off	   = sk_set_peek_off,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d424f3a..f7ce6a2 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -571,6 +571,7 @@ int inet6_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet_sendmsg,		/* ok		*/
 	.recvmsg	   = inet_recvmsg,		/* ok		*/
+	.recvmmsg	   = inet_recvmmsg,		/* ok		*/
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = sock_no_sendpage,
 	.set_peek_off	   = sk_set_peek_off,
diff --git a/net/socket.c b/net/socket.c
index 9b5f360..49e6cd6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2140,6 +2140,8 @@ static int copy_msghdr_to_user_gen(struct msghdr *msg_sys, int flags,
 				  &msg->msg_controllen);
 }
 
+#define MSG_CMSG_MASK (MSG_CMSG_CLOEXEC | MSG_CMSG_COMPAT)
+
 static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg,
 			 struct msghdr *msg_sys, unsigned int flags, int nosec)
 {
@@ -2163,7 +2165,7 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg,
 		return err;
 
 	cmsg_ptr = (unsigned long)msg_sys->msg_control;
-	msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
+	msg_sys->msg_flags = flags & MSG_CMSG_MASK;
 
 	/* We assume all kernel code knows the size of sockaddr_storage */
 	msg_sys->msg_namelen = 0;
@@ -2218,6 +2220,21 @@ long __sys_recvmsg(int fd, struct user_msghdr __user *msg, unsigned flags)
  *     Linux recvmmsg interface
  */
 
+static int __proto_recvmmsg(struct socket *sock, struct mmsghdr __user *ummsg,
+			    unsigned int *vlen, unsigned int flags,
+			    struct timespec *timeout,
+			    const struct timespec64 *end_time)
+{
+	if (!sock->ops->recvmmsg)
+		return -EOPNOTSUPP;
+
+	if (sock->file->f_flags & O_NONBLOCK)
+		flags |= MSG_DONTWAIT;
+
+	/* defer LSM check and mmsg parsing to the proto operation */
+	return sock->ops->recvmmsg(sock, ummsg, vlen, flags, timeout, end_time);
+}
+
 int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 		   unsigned int flags, struct timespec *timeout)
 {
@@ -2243,6 +2260,12 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 	if (err)
 		goto out_put;
 
+	err = __proto_recvmmsg(sock, mmsg, &vlen, flags, timeout, &end_time);
+	if (err != -EOPNOTSUPP) {
+		datagrams = vlen;
+		goto chk_error;
+	}
+
 	entry = mmsg;
 	compat_entry = (struct compat_mmsghdr __user *)mmsg;
 
@@ -2287,6 +2310,7 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 		cond_resched();
 	}
 
+chk_error:
 	if (err == 0)
 		goto out_put;
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 3/5] net/udp: factor out main skb processing routine
From: Paolo Abeni @ 2016-11-25 15:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Hannes Frederic Sowa, Sabrina Dubroca
In-Reply-To: <cover.1480086321.git.pabeni@redhat.com>

we will reuse it later for mmsg implementation

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c | 67 ++++++++++++++++++++++++++++++++---------------------
 net/ipv6/udp.c | 73 ++++++++++++++++++++++++++++++++++++----------------------
 2 files changed, 86 insertions(+), 54 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b3b6bc5..d99429d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1350,34 +1350,18 @@ int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(udp_ioctl);
 
-/*
- * 	This should be easy, if there is something there we
- * 	return it, otherwise we block.
- */
-
-int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
-		int flags, int *addr_len)
+static int udp_process_skb(struct sock *sk, struct sk_buff *skb,
+			   struct msghdr *msg, size_t len, int flags,
+			   int *addr_len, int off, int peeking, int peeked)
 {
-	struct inet_sock *inet = inet_sk(sk);
 	DECLARE_SOCKADDR(struct sockaddr_in *, sin, msg->msg_name);
-	struct sk_buff *skb;
-	unsigned int ulen, copied;
-	int peeked, peeking, off;
-	int err;
+	struct inet_sock *inet = inet_sk(sk);
 	int is_udplite = IS_UDPLITE(sk);
 	bool checksum_valid = false;
+	int ulen = skb->len;
+	int copied = len;
+	int err;
 
-	if (flags & MSG_ERRQUEUE)
-		return ip_recv_error(sk, msg, len, addr_len);
-
-try_again:
-	peeking = off = sk_peek_offset(sk, flags);
-	skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
-	if (!skb)
-		return err;
-
-	ulen = skb->len;
-	copied = len;
 	if (copied > ulen - off)
 		copied = ulen - off;
 	else if (copied < ulen)
@@ -1446,10 +1430,41 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	}
 	kfree_skb(skb);
 
-	/* starting over for a new packet, but check if we need to yield */
-	cond_resched();
 	msg->msg_flags &= ~MSG_TRUNC;
-	goto try_again;
+	return -EINVAL;
+}
+
+/*
+ *	This should be easy, if there is something there we
+ *	return it, otherwise we block.
+ */
+int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
+		int flags, int *addr_len)
+{
+	struct sk_buff *skb;
+	int peeked, peeking, off;
+	int err;
+
+	if (flags & MSG_ERRQUEUE)
+		return ip_recv_error(sk, msg, len, addr_len);
+
+try_again:
+	peeking = off = sk_peek_offset(sk, flags);
+	skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
+	if (!skb)
+		return err;
+
+	err = udp_process_skb(sk, skb, msg, len, flags, addr_len, off, peeking,
+			      peeked);
+	if (err < 0) {
+		if (err != -EINVAL)
+			return err;
+
+		/* restarting for a new packet, but check if we need to yield */
+		cond_resched();
+		goto try_again;
+	}
+	return err;
 }
 
 int __udp_disconnect(struct sock *sk, int flags)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index ba25ec2..3218c64 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -318,38 +318,19 @@ struct sock *udp6_lib_lookup(struct net *net, const struct in6_addr *saddr, __be
 EXPORT_SYMBOL_GPL(udp6_lib_lookup);
 #endif
 
-/*
- *	This should be easy, if there is something there we
- *	return it, otherwise we block.
- */
-
-int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
-		  int noblock, int flags, int *addr_len)
+static int udp6_process_skb(struct sock *sk, struct sk_buff *skb,
+			    struct msghdr *msg, size_t len, int flags,
+			    int *addr_len, int off, int peeking, int peeked)
 {
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct inet_sock *inet = inet_sk(sk);
-	struct sk_buff *skb;
-	unsigned int ulen, copied;
-	int peeked, peeking, off;
-	int err;
 	int is_udplite = IS_UDPLITE(sk);
 	bool checksum_valid = false;
+	int ulen = skb->len;
+	int copied = len;
 	int is_udp4;
+	int err;
 
-	if (flags & MSG_ERRQUEUE)
-		return ipv6_recv_error(sk, msg, len, addr_len);
-
-	if (np->rxpmtu && np->rxopt.bits.rxpmtu)
-		return ipv6_recv_rxpmtu(sk, msg, len, addr_len);
-
-try_again:
-	peeking = off = sk_peek_offset(sk, flags);
-	skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
-	if (!skb)
-		return err;
-
-	ulen = skb->len;
-	copied = len;
 	if (copied > ulen - off)
 		copied = ulen - off;
 	else if (copied < ulen)
@@ -456,10 +437,46 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	}
 	kfree_skb(skb);
 
-	/* starting over for a new packet, but check if we need to yield */
-	cond_resched();
 	msg->msg_flags &= ~MSG_TRUNC;
-	goto try_again;
+	return -EINVAL;
+}
+
+/*
+ *	This should be easy, if there is something there we
+ *	return it, otherwise we block.
+ */
+int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+		  int noblock, int flags, int *addr_len)
+{
+	struct ipv6_pinfo *np = inet6_sk(sk);
+	int peeked, peeking, off;
+	struct sk_buff *skb;
+	int err;
+
+	if (flags & MSG_ERRQUEUE)
+		return ipv6_recv_error(sk, msg, len, addr_len);
+
+	if (np->rxpmtu && np->rxopt.bits.rxpmtu)
+		return ipv6_recv_rxpmtu(sk, msg, len, addr_len);
+
+try_again:
+	peeking = off = sk_peek_offset(sk, flags);
+	skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
+	if (!skb)
+		return err;
+
+	err = udp6_process_skb(sk, skb, msg, len, flags, addr_len, off, peeking,
+			       peeked);
+	if (err < 0) {
+		if (err != -EINVAL)
+			return err;
+
+		/* restarting for a new packet, but check if we need to yield */
+		cond_resched();
+		goto try_again;
+	}
+	return err;
+
 }
 
 void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/5] net/socket: factor out msghdr manipulation helpers
From: Paolo Abeni @ 2016-11-25 15:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Hannes Frederic Sowa, Sabrina Dubroca
In-Reply-To: <cover.1480086321.git.pabeni@redhat.com>

so that they can be later used for recvmmsg refactor

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/sock.h | 18 ++++++++++
 net/socket.c       | 97 +++++++++++++++++++++++++++++-------------------------
 2 files changed, 70 insertions(+), 45 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 442cbb1..c92dc19 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1528,6 +1528,24 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 		   struct sockcm_cookie *sockc);
 
+static inline bool sock_recvmmsg_timeout(struct timespec *timeout,
+					 struct timespec64 end_time)
+{
+	struct timespec64 timeout64;
+
+	if (!timeout)
+		return false;
+
+	ktime_get_ts64(&timeout64);
+	*timeout = timespec64_to_timespec(timespec64_sub(end_time, timeout64));
+	if (timeout->tv_sec < 0) {
+		timeout->tv_sec = timeout->tv_nsec = 0;
+		return true;
+	}
+
+	return timeout->tv_nsec == 0 && timeout->tv_sec == 0;
+}
+
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
  * does not implement a particular function.
diff --git a/net/socket.c b/net/socket.c
index e2584c5..9b5f360 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1903,6 +1903,21 @@ static int copy_msghdr_from_user(struct msghdr *kmsg,
 			    UIO_FASTIOV, iov, &kmsg->msg_iter);
 }
 
+static int copy_msghdr_from_user_gen(struct msghdr *msg_sys, unsigned int flags,
+				     struct compat_msghdr __user *msg_compat,
+				     struct user_msghdr __user *msg,
+				     struct sockaddr __user **uaddr,
+				     struct iovec **iov,
+				     struct sockaddr_storage *addr)
+{
+	msg_sys->msg_name = addr;
+
+	if (MSG_CMSG_COMPAT & flags)
+		return get_compat_msghdr(msg_sys, msg_compat, uaddr, iov);
+	else
+		return copy_msghdr_from_user(msg_sys, msg, uaddr, iov);
+}
+
 static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg,
 			 struct msghdr *msg_sys, unsigned int flags,
 			 struct used_address *used_address,
@@ -1919,12 +1934,8 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg,
 	int ctl_len;
 	ssize_t err;
 
-	msg_sys->msg_name = &address;
-
-	if (MSG_CMSG_COMPAT & flags)
-		err = get_compat_msghdr(msg_sys, msg_compat, NULL, &iov);
-	else
-		err = copy_msghdr_from_user(msg_sys, msg, NULL, &iov);
+	err = copy_msghdr_from_user_gen(msg_sys, flags, msg_compat, msg, NULL,
+					&iov, &address);
 	if (err < 0)
 		return err;
 
@@ -2101,6 +2112,34 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 	return __sys_sendmmsg(fd, mmsg, vlen, flags);
 }
 
+static int copy_msghdr_to_user_gen(struct msghdr *msg_sys, int flags,
+				   struct compat_msghdr __user *msg_compat,
+				   struct user_msghdr __user *msg,
+				   struct sockaddr __user *uaddr,
+				   struct sockaddr_storage *addr,
+				   unsigned long cmsgptr)
+{
+	int __user *uaddr_len = COMPAT_NAMELEN(msg);
+	int err;
+
+	if (uaddr) {
+		err = move_addr_to_user(addr, msg_sys->msg_namelen, uaddr,
+					uaddr_len);
+		if (err < 0)
+			return err;
+	}
+	err = __put_user((msg_sys->msg_flags & ~MSG_CMSG_COMPAT),
+			 COMPAT_FLAGS(msg));
+	if (err)
+		return err;
+	if (MSG_CMSG_COMPAT & flags)
+		return __put_user((unsigned long)msg_sys->msg_control -
+				  cmsgptr, &msg_compat->msg_controllen);
+	else
+		return __put_user((unsigned long)msg_sys->msg_control - cmsgptr,
+				  &msg->msg_controllen);
+}
+
 static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg,
 			 struct msghdr *msg_sys, unsigned int flags, int nosec)
 {
@@ -2117,14 +2156,9 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg,
 
 	/* user mode address pointers */
 	struct sockaddr __user *uaddr;
-	int __user *uaddr_len = COMPAT_NAMELEN(msg);
 
-	msg_sys->msg_name = &addr;
-
-	if (MSG_CMSG_COMPAT & flags)
-		err = get_compat_msghdr(msg_sys, msg_compat, &uaddr, &iov);
-	else
-		err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov);
+	err = copy_msghdr_from_user_gen(msg_sys, flags, msg_compat, msg, &uaddr,
+					&iov, &addr);
 	if (err < 0)
 		return err;
 
@@ -2140,24 +2174,8 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg,
 	if (err < 0)
 		goto out_freeiov;
 	len = err;
-
-	if (uaddr != NULL) {
-		err = move_addr_to_user(&addr,
-					msg_sys->msg_namelen, uaddr,
-					uaddr_len);
-		if (err < 0)
-			goto out_freeiov;
-	}
-	err = __put_user((msg_sys->msg_flags & ~MSG_CMSG_COMPAT),
-			 COMPAT_FLAGS(msg));
-	if (err)
-		goto out_freeiov;
-	if (MSG_CMSG_COMPAT & flags)
-		err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr,
-				 &msg_compat->msg_controllen);
-	else
-		err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr,
-				 &msg->msg_controllen);
+	err = copy_msghdr_to_user_gen(msg_sys, flags, msg_compat, msg, uaddr,
+				      &addr, cmsg_ptr);
 	if (err)
 		goto out_freeiov;
 	err = len;
@@ -2209,7 +2227,6 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 	struct compat_mmsghdr __user *compat_entry;
 	struct msghdr msg_sys;
 	struct timespec64 end_time;
-	struct timespec64 timeout64;
 
 	if (timeout &&
 	    poll_select_set_timeout(&end_time, timeout->tv_sec,
@@ -2260,19 +2277,9 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 		if (flags & MSG_WAITFORONE)
 			flags |= MSG_DONTWAIT;
 
-		if (timeout) {
-			ktime_get_ts64(&timeout64);
-			*timeout = timespec64_to_timespec(
-					timespec64_sub(end_time, timeout64));
-			if (timeout->tv_sec < 0) {
-				timeout->tv_sec = timeout->tv_nsec = 0;
-				break;
-			}
-
-			/* Timeout, return less than vlen datagrams */
-			if (timeout->tv_nsec == 0 && timeout->tv_sec == 0)
-				break;
-		}
+		/* Timeout, return less than vlen datagrams */
+		if (sock_recvmmsg_timeout(timeout, end_time))
+			break;
 
 		/* Out of band data, return right away */
 		if (msg_sys.msg_flags & MSG_OOB)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/5] net: add protocol level recvmmsg support
From: Paolo Abeni @ 2016-11-25 15:39 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Hannes Frederic Sowa, Sabrina Dubroca

The goal of recvmmsg() is to amortize the syscall overhead on a possible
long messages batch, but for most networking protocols, e.g. udp the
syscall overhead is negligible compared to the protocol specific operations
like dequeuing.

Moreover, the current recvmmsg() implementation has a long-standing bug with
the timeout argument that can be solved only with protocol level support for
recvmmsg().

This patch series aims to solve both issues, introducing support for
the recvmmsg implementation at the protocol level, adding some generic helpers
for such operation, and finally implementing recvmmsg() support for
udp[v6]/udplite[v6]. Such support does not cover MSG_PEEK and MSG_ERRQUEUE,
as a trade-off between benefit and implementation complexity.

The udp version of recvmmsg() tries to bulk-dequeue skbs from the receive queue,
each burst acquires the lock once to extract as many skbs from the receive
queue as possible, up to the number needed to reach the specified maximum.
rmem_alloc and fwd memory are touched once per burst.

When the protocol-level recvmmsg() is not available or it does not support the
specified flags, the code falls-back to the current generic implementation.

This series introduces some behavior changes for the recvmmsg() syscall (only
for udp):
- the timeout argument now works as expected
- recvmmsg() does not stop anymore when getting the first error, instead
  it keeps processing the current burst and then handle the error code as
  in the generic implementation.

The measured performance delta is as follow:

		before		after
		(Kpps)		(Kpps)

udp flood[1]	570		1800(+215%)
max tput[2]	1850		3500(+89%)
single queue[3]	1850		1630(-11%)

[1] line rate flood using multiple 64 bytes packets and multiple flows
[2] like [1], but using the minimum number of flows to saturate the user space
 sink, that is 1 flow for the old kernel and 3 for the patched one.
 the tput increases since the contention on the rx lock is low.
[3] like [1] but using a single flow with both old and new kernel. All the
 packets land on the same rx queue and there is a single ksoftirqd instance
 running

The regression in the single queue scenario is actually due to the improved
performance of the recvmmsg() syscall: the user space process is now
significantly faster than the ksoftirqd process so that the latter needs often
to wake up the user space process.

Since ksoftirqd is the bottle-neck is such scenario, overall this causes a
tput reduction. In a real use case, where the udp sink is performing some
actual processing of the received data, such regression is unlikely to really
have an effect.

Join work with Sabrina Dubroca <sd@queasysnail.net>.

Paolo Abeni (5):
  net/socket: factor out msghdr manipulation helpers
  net/socket: add per protocol mmesg support
  net/udp: factor out main skb processing routine
  net/socket: add helpers for recvmmsg
  udp: add recvmmsg implementation

 include/linux/net.h       |   5 ++
 include/linux/skbuff.h    |  20 +++++
 include/net/inet_common.h |   3 +
 include/net/sock.h        |  43 +++++++++++
 include/net/udp.h         |   7 ++
 net/core/datagram.c       |  65 ++++++++++++++++
 net/ipv4/af_inet.c        |  16 ++++
 net/ipv4/udp.c            | 188 +++++++++++++++++++++++++++++++++++++++-------
 net/ipv4/udp_impl.h       |   3 +
 net/ipv4/udplite.c        |   1 +
 net/ipv6/af_inet6.c       |   1 +
 net/ipv6/udp.c            |  89 +++++++++++++++-------
 net/ipv6/udp_impl.h       |   3 +
 net/ipv6/udplite.c        |   1 +
 net/socket.c              | 183 ++++++++++++++++++++++++++++++++------------
 15 files changed, 528 insertions(+), 100 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net 1/1] tipc: fix link statistics counter errors
From: Jon Maloy @ 2016-11-25 15:35 UTC (permalink / raw)
  To: davem; +Cc: Jon Maloy, netdev, tipc-discussion

In commit e4bf4f76962b ("tipc: simplify packet sequence number
handling") we changed the internal representation of the packet
sequence number counters from u32 to u16, reflecting what is really
sent over the wire.

Since then some link statistics counters have been displaying incorrect
values, partially because the counters meant to be used as sequence
number snapshots are now used as direct counters, stored as u32, and
partially because some counter updates are just missing in the code.

In this commit we correct this in two ways. First, we base the
displayed packet sent/received values on direct counters instead
of as previously a calculated difference between current sequence
number and a snapshot. Second, we add the missing updates of the
counters.

This change is compatible with the current netlink API, and requires
no changes to the user space tools.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/link.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index ecc12411..bda89bf 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -47,8 +47,8 @@
 #include <linux/pkt_sched.h>
 
 struct tipc_stats {
-	u32 sent_info;		/* used in counting # sent packets */
-	u32 recv_info;		/* used in counting # recv'd packets */
+	u32 sent_pkts;
+	u32 recv_pkts;
 	u32 sent_states;
 	u32 recv_states;
 	u32 sent_probes;
@@ -857,7 +857,6 @@ void tipc_link_reset(struct tipc_link *l)
 	l->acked = 0;
 	l->silent_intv_cnt = 0;
 	l->rst_cnt = 0;
-	l->stats.recv_info = 0;
 	l->stale_count = 0;
 	l->bc_peer_is_up = false;
 	memset(&l->mon_state, 0, sizeof(l->mon_state));
@@ -888,6 +887,7 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list,
 	struct sk_buff_head *transmq = &l->transmq;
 	struct sk_buff_head *backlogq = &l->backlogq;
 	struct sk_buff *skb, *_skb, *bskb;
+	int pkt_cnt = skb_queue_len(list);
 
 	/* Match msg importance against this and all higher backlog limits: */
 	if (!skb_queue_empty(backlogq)) {
@@ -901,6 +901,11 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list,
 		return -EMSGSIZE;
 	}
 
+	if (pkt_cnt > 1) {
+		l->stats.sent_fragmented++;
+		l->stats.sent_fragments += pkt_cnt;
+	}
+
 	/* Prepare each packet for sending, and add to relevant queue: */
 	while (skb_queue_len(list)) {
 		skb = skb_peek(list);
@@ -920,6 +925,7 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list,
 			__skb_queue_tail(xmitq, _skb);
 			TIPC_SKB_CB(skb)->ackers = l->ackers;
 			l->rcv_unacked = 0;
+			l->stats.sent_pkts++;
 			seqno++;
 			continue;
 		}
@@ -968,6 +974,7 @@ void tipc_link_advance_backlog(struct tipc_link *l, struct sk_buff_head *xmitq)
 		msg_set_ack(hdr, ack);
 		msg_set_bcast_ack(hdr, bc_ack);
 		l->rcv_unacked = 0;
+		l->stats.sent_pkts++;
 		seqno++;
 	}
 	l->snd_nxt = seqno;
@@ -1260,7 +1267,7 @@ int tipc_link_rcv(struct tipc_link *l, struct sk_buff *skb,
 
 		/* Deliver packet */
 		l->rcv_nxt++;
-		l->stats.recv_info++;
+		l->stats.recv_pkts++;
 		if (!tipc_data_input(l, skb, l->inputq))
 			rc |= tipc_link_input(l, skb, l->inputq);
 		if (unlikely(++l->rcv_unacked >= TIPC_MIN_LINK_WIN))
@@ -1800,10 +1807,6 @@ void tipc_link_set_queue_limits(struct tipc_link *l, u32 win)
 void tipc_link_reset_stats(struct tipc_link *l)
 {
 	memset(&l->stats, 0, sizeof(l->stats));
-	if (!link_is_bc_sndlink(l)) {
-		l->stats.sent_info = l->snd_nxt;
-		l->stats.recv_info = l->rcv_nxt;
-	}
 }
 
 static void link_print(struct tipc_link *l, const char *str)
@@ -1867,12 +1870,12 @@ static int __tipc_nl_add_stats(struct sk_buff *skb, struct tipc_stats *s)
 	};
 
 	struct nla_map map[] = {
-		{TIPC_NLA_STATS_RX_INFO, s->recv_info},
+		{TIPC_NLA_STATS_RX_INFO, 0},
 		{TIPC_NLA_STATS_RX_FRAGMENTS, s->recv_fragments},
 		{TIPC_NLA_STATS_RX_FRAGMENTED, s->recv_fragmented},
 		{TIPC_NLA_STATS_RX_BUNDLES, s->recv_bundles},
 		{TIPC_NLA_STATS_RX_BUNDLED, s->recv_bundled},
-		{TIPC_NLA_STATS_TX_INFO, s->sent_info},
+		{TIPC_NLA_STATS_TX_INFO, 0},
 		{TIPC_NLA_STATS_TX_FRAGMENTS, s->sent_fragments},
 		{TIPC_NLA_STATS_TX_FRAGMENTED, s->sent_fragmented},
 		{TIPC_NLA_STATS_TX_BUNDLES, s->sent_bundles},
@@ -1947,9 +1950,9 @@ int __tipc_nl_add_link(struct net *net, struct tipc_nl_msg *msg,
 		goto attr_msg_full;
 	if (nla_put_u32(msg->skb, TIPC_NLA_LINK_MTU, link->mtu))
 		goto attr_msg_full;
-	if (nla_put_u32(msg->skb, TIPC_NLA_LINK_RX, link->rcv_nxt))
+	if (nla_put_u32(msg->skb, TIPC_NLA_LINK_RX, link->stats.recv_pkts))
 		goto attr_msg_full;
-	if (nla_put_u32(msg->skb, TIPC_NLA_LINK_TX, link->snd_nxt))
+	if (nla_put_u32(msg->skb, TIPC_NLA_LINK_TX, link->stats.sent_pkts))
 		goto attr_msg_full;
 
 	if (tipc_link_is_up(link))
@@ -2004,12 +2007,12 @@ static int __tipc_nl_add_bc_link_stat(struct sk_buff *skb,
 	};
 
 	struct nla_map map[] = {
-		{TIPC_NLA_STATS_RX_INFO, stats->recv_info},
+		{TIPC_NLA_STATS_RX_INFO, stats->recv_pkts},
 		{TIPC_NLA_STATS_RX_FRAGMENTS, stats->recv_fragments},
 		{TIPC_NLA_STATS_RX_FRAGMENTED, stats->recv_fragmented},
 		{TIPC_NLA_STATS_RX_BUNDLES, stats->recv_bundles},
 		{TIPC_NLA_STATS_RX_BUNDLED, stats->recv_bundled},
-		{TIPC_NLA_STATS_TX_INFO, stats->sent_info},
+		{TIPC_NLA_STATS_TX_INFO, stats->sent_pkts},
 		{TIPC_NLA_STATS_TX_FRAGMENTS, stats->sent_fragments},
 		{TIPC_NLA_STATS_TX_FRAGMENTED, stats->sent_fragmented},
 		{TIPC_NLA_STATS_TX_BUNDLES, stats->sent_bundles},
@@ -2076,9 +2079,9 @@ int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg)
 		goto attr_msg_full;
 	if (nla_put_string(msg->skb, TIPC_NLA_LINK_NAME, bcl->name))
 		goto attr_msg_full;
-	if (nla_put_u32(msg->skb, TIPC_NLA_LINK_RX, bcl->rcv_nxt))
+	if (nla_put_u32(msg->skb, TIPC_NLA_LINK_RX, 0))
 		goto attr_msg_full;
-	if (nla_put_u32(msg->skb, TIPC_NLA_LINK_TX, bcl->snd_nxt))
+	if (nla_put_u32(msg->skb, TIPC_NLA_LINK_TX, 0))
 		goto attr_msg_full;
 
 	prop = nla_nest_start(msg->skb, TIPC_NLA_LINK_PROP);
-- 
2.7.4


------------------------------------------------------------------------------

^ permalink raw reply related

* [PATCH net-next 3/5] net: mvneta: Only disable mvneta_bm for 64-bits
From: Gregory CLEMENT @ 2016-11-25 15:30 UTC (permalink / raw)
  To: David S. Miller, linux-kernel, netdev
  Cc: Jisheng Zhang, Arnd Bergmann, Jason Cooper, Andrew Lunn,
	Sebastian Hesselbarth, Gregory CLEMENT, Thomas Petazzoni,
	linux-arm-kernel, Nadav Haklai, Marcin Wojtas, Dmitri Epshtein,
	Yelena Krivosheev
In-Reply-To: <cover.2b146800967005632cd02d4da77397e6e2fdf51f.1480087510.git-series.gregory.clement@free-electrons.com>

Actually only the mvneta_bm support is not 64-bits compatible.
The mvneta code itself can run on 64-bits architecture.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
index 66fd9dbb2ca7..2ccea9dd9248 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -44,6 +44,7 @@ config MVMDIO
 config MVNETA_BM_ENABLE
 	tristate "Marvell Armada 38x/XP network interface BM support"
 	depends on MVNETA
+	depends on !64BIT
 	---help---
 	  This driver supports auxiliary block of the network
 	  interface units in the Marvell ARMADA XP and ARMADA 38x SoC
@@ -58,7 +59,6 @@ config MVNETA
 	tristate "Marvell Armada 370/38x/XP network interface support"
 	depends on PLAT_ORION || COMPILE_TEST
 	depends on HAS_DMA
-	depends on !64BIT
 	select MVMDIO
 	select FIXED_PHY
 	---help---
@@ -71,6 +71,7 @@ config MVNETA
 
 config MVNETA_BM
 	tristate
+	depends on !64BIT
 	default y if MVNETA=y && MVNETA_BM_ENABLE!=n
 	default MVNETA_BM_ENABLE
 	select HWBM
-- 
git-series 0.8.10

^ permalink raw reply related

* [PATCH net-next 4/5] net: mvneta: Add network support for Armada 3700 SoC
From: Gregory CLEMENT @ 2016-11-25 15:30 UTC (permalink / raw)
  To: David S. Miller, linux-kernel, netdev
  Cc: Jisheng Zhang, Arnd Bergmann, Jason Cooper, Andrew Lunn,
	Sebastian Hesselbarth, Gregory CLEMENT, Thomas Petazzoni,
	linux-arm-kernel, Nadav Haklai, Marcin Wojtas, Dmitri Epshtein,
	Yelena Krivosheev
In-Reply-To: <cover.2b146800967005632cd02d4da77397e6e2fdf51f.1480087510.git-series.gregory.clement@free-electrons.com>

From: Marcin Wojtas <mw@semihalf.com>

Armada 3700 is a new ARMv8 SoC from Marvell using same network controller
as older Armada 370/38x/XP. There are however some differences that
needed taking into account when adding support for it:

* open default MBUS window to 4GB of DRAM - Armada 3700 SoC's Mbus
  configuration for network controller has to be done on two levels:
  global and per-port. The first one is inherited from the
  bootloader. The latter can be opened in a default way, leaving
  arbitration to the bus controller.  Hence filled mbus_dram_target_info
  structure is not needed

* make per-CPU operation optional - Recent patches adding RSS and XPS
  support for Armada 38x/XP enabled per-CPU operation of the controller
  by default. Contrary to older SoC's Armada 3700 SoC's network
  controller is not capable of per-CPU processing due to interrupt lines'
  connectivity.  This patch restores non-per-CPU operation, which is now
  optional and depends on neta_armada3700 flag value in mvneta_port
  structure. In order not to complicate the code, separate interrupt
  subroutine is implemented.

For now, on the Armada 3700, RSS is disabled as the current
implementation depend on the per cpu interrupts.

[gregory.clement@free-electrons.com: extract from a larger patch, replace
some ifdef and port to net-next for v4.10]

Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt |   7 +-
 drivers/net/ethernet/marvell/Kconfig                              |   7 +-
 drivers/net/ethernet/marvell/mvneta.c                             | 287 +++++++++++++++++++++++++++++++++++++++++++++++++++---------------------
 3 files changed, 214 insertions(+), 87 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt b/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt
index 73be8970815e..7aa840c8768d 100644
--- a/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt
+++ b/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt
@@ -1,7 +1,10 @@
-* Marvell Armada 370 / Armada XP Ethernet Controller (NETA)
+* Marvell Armada 370 / Armada XP / Armada 3700 Ethernet Controller (NETA)
 
 Required properties:
-- compatible: "marvell,armada-370-neta" or "marvell,armada-xp-neta".
+- compatible: could be one of the followings
+	"marvell,armada-370-neta"
+	"marvell,armada-xp-neta"
+	"marvell,armada-3700-neta"
 - reg: address and length of the register set for the device.
 - interrupts: interrupt for the device
 - phy: See ethernet.txt file in the same directory.
diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
index 2ccea9dd9248..3b8f11fe5e13 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -56,14 +56,15 @@ config MVNETA_BM_ENABLE
 	  buffer management.
 
 config MVNETA
-	tristate "Marvell Armada 370/38x/XP network interface support"
-	depends on PLAT_ORION || COMPILE_TEST
+	tristate "Marvell Armada 370/38x/XP/37xx network interface support"
+	depends on ARCH_MVEBU || COMPILE_TEST
 	depends on HAS_DMA
 	select MVMDIO
 	select FIXED_PHY
 	---help---
 	  This driver supports the network interface units in the
-	  Marvell ARMADA XP, ARMADA 370 and ARMADA 38x SoC family.
+	  Marvell ARMADA XP, ARMADA 370, ARMADA 38x and
+	  ARMADA 37xx SoC family.
 
 	  Note that this driver is distinct from the mv643xx_eth
 	  driver, which should be used for the older Marvell SoCs
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index ad3872e07a93..77cef5a9de7b 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -397,6 +397,9 @@ struct mvneta_port {
 	spinlock_t lock;
 	bool is_stopped;
 
+	u32 cause_rx_tx;
+	struct napi_struct napi;
+
 	/* Core clock */
 	struct clk *clk;
 	/* AXI clock */
@@ -422,6 +425,9 @@ struct mvneta_port {
 	u64 ethtool_stats[ARRAY_SIZE(mvneta_statistics)];
 
 	u32 indir[MVNETA_RSS_LU_TABLE_SIZE];
+
+	/* Flags for special SoC configurations */
+	bool neta_armada3700;
 	u16 rx_offset_correction;
 };
 
@@ -965,14 +971,9 @@ static int mvneta_mbus_io_win_set(struct mvneta_port *pp, u32 base, u32 wsize,
 	return 0;
 }
 
-/* Assign and initialize pools for port. In case of fail
- * buffer manager will remain disabled for current port.
- */
-static int mvneta_bm_port_init(struct platform_device *pdev,
-			       struct mvneta_port *pp)
+static  int mvneta_bm_port_mbus_init(struct mvneta_port *pp)
 {
-	struct device_node *dn = pdev->dev.of_node;
-	u32 long_pool_id, short_pool_id, wsize;
+	u32 wsize;
 	u8 target, attr;
 	int err;
 
@@ -991,6 +992,25 @@ static int mvneta_bm_port_init(struct platform_device *pdev,
 		netdev_info(pp->dev, "fail to configure mbus window to BM\n");
 		return err;
 	}
+	return 0;
+}
+
+/* Assign and initialize pools for port. In case of fail
+ * buffer manager will remain disabled for current port.
+ */
+static int mvneta_bm_port_init(struct platform_device *pdev,
+			       struct mvneta_port *pp)
+{
+	struct device_node *dn = pdev->dev.of_node;
+	u32 long_pool_id, short_pool_id;
+
+	if (!pp->neta_armada3700) {
+		int ret;
+
+		ret = mvneta_bm_port_mbus_init(pp);
+		if (ret)
+			return ret;
+	}
 
 	if (of_property_read_u32(dn, "bm,pool-long", &long_pool_id)) {
 		netdev_info(pp->dev, "missing long pool id\n");
@@ -1359,22 +1379,27 @@ static void mvneta_defaults_set(struct mvneta_port *pp)
 	for_each_present_cpu(cpu) {
 		int rxq_map = 0, txq_map = 0;
 		int rxq, txq;
+		if (!pp->neta_armada3700) {
+			for (rxq = 0; rxq < rxq_number; rxq++)
+				if ((rxq % max_cpu) == cpu)
+					rxq_map |= MVNETA_CPU_RXQ_ACCESS(rxq);
+
+			for (txq = 0; txq < txq_number; txq++)
+				if ((txq % max_cpu) == cpu)
+					txq_map |= MVNETA_CPU_TXQ_ACCESS(txq);
+
+			/* With only one TX queue we configure a special case
+			 * which will allow to get all the irq on a single
+			 * CPU
+			 */
+			if (txq_number == 1)
+				txq_map = (cpu == pp->rxq_def) ?
+					MVNETA_CPU_TXQ_ACCESS(1) : 0;
 
-		for (rxq = 0; rxq < rxq_number; rxq++)
-			if ((rxq % max_cpu) == cpu)
-				rxq_map |= MVNETA_CPU_RXQ_ACCESS(rxq);
-
-		for (txq = 0; txq < txq_number; txq++)
-			if ((txq % max_cpu) == cpu)
-				txq_map |= MVNETA_CPU_TXQ_ACCESS(txq);
-
-		/* With only one TX queue we configure a special case
-		 * which will allow to get all the irq on a single
-		 * CPU
-		 */
-		if (txq_number == 1)
-			txq_map = (cpu == pp->rxq_def) ?
-				MVNETA_CPU_TXQ_ACCESS(1) : 0;
+		} else {
+			txq_map = MVNETA_CPU_TXQ_ACCESS_ALL_MASK;
+			rxq_map = MVNETA_CPU_RXQ_ACCESS_ALL_MASK;
+		}
 
 		mvreg_write(pp, MVNETA_CPU_MAP(cpu), rxq_map | txq_map);
 	}
@@ -2632,6 +2657,17 @@ static void mvneta_set_rx_mode(struct net_device *dev)
 /* Interrupt handling - the callback for request_irq() */
 static irqreturn_t mvneta_isr(int irq, void *dev_id)
 {
+	struct mvneta_port *pp = (struct mvneta_port *)dev_id;
+
+	mvreg_write(pp, MVNETA_INTR_NEW_MASK, 0);
+	napi_schedule(&pp->napi);
+
+	return IRQ_HANDLED;
+}
+
+/* Interrupt handling - the callback for request_percpu_irq() */
+static irqreturn_t mvneta_percpu_isr(int irq, void *dev_id)
+{
 	struct mvneta_pcpu_port *port = (struct mvneta_pcpu_port *)dev_id;
 
 	disable_percpu_irq(port->pp->dev->irq);
@@ -2679,7 +2715,7 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 	struct mvneta_pcpu_port *port = this_cpu_ptr(pp->ports);
 
 	if (!netif_running(pp->dev)) {
-		napi_complete(&port->napi);
+		napi_complete(napi);
 		return rx_done;
 	}
 
@@ -2708,7 +2744,8 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 	 */
 	rx_queue = fls(((cause_rx_tx >> 8) & 0xff));
 
-	cause_rx_tx |= port->cause_rx_tx;
+	cause_rx_tx |= pp->neta_armada3700 ? pp->cause_rx_tx :
+		port->cause_rx_tx;
 
 	if (rx_queue) {
 		rx_queue = rx_queue - 1;
@@ -2722,11 +2759,27 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 
 	if (budget > 0) {
 		cause_rx_tx = 0;
-		napi_complete(&port->napi);
-		enable_percpu_irq(pp->dev->irq, 0);
+		napi_complete(napi);
+
+		if (pp->neta_armada3700) {
+			unsigned long flags;
+
+			local_irq_save(flags);
+			mvreg_write(pp, MVNETA_INTR_NEW_MASK,
+				    MVNETA_RX_INTR_MASK(rxq_number) |
+				    MVNETA_TX_INTR_MASK(txq_number) |
+				    MVNETA_MISCINTR_INTR_MASK);
+			local_irq_restore(flags);
+		} else {
+			enable_percpu_irq(pp->dev->irq, 0);
+		}
 	}
 
-	port->cause_rx_tx = cause_rx_tx;
+	if (pp->neta_armada3700)
+		pp->cause_rx_tx = cause_rx_tx;
+	else
+		port->cause_rx_tx = cause_rx_tx;
+
 	return rx_done;
 }
 
@@ -3054,11 +3107,16 @@ static void mvneta_start_dev(struct mvneta_port *pp)
 	/* start the Rx/Tx activity */
 	mvneta_port_enable(pp);
 
-	/* Enable polling on the port */
-	for_each_online_cpu(cpu) {
-		struct mvneta_pcpu_port *port = per_cpu_ptr(pp->ports, cpu);
+	if (!pp->neta_armada3700) {
+		/* Enable polling on the port */
+		for_each_online_cpu(cpu) {
+			struct mvneta_pcpu_port *port =
+				per_cpu_ptr(pp->ports, cpu);
 
-		napi_enable(&port->napi);
+			napi_enable(&port->napi);
+		}
+	} else {
+		napi_enable(&pp->napi);
 	}
 
 	/* Unmask interrupts. It has to be done from each CPU */
@@ -3080,10 +3138,15 @@ static void mvneta_stop_dev(struct mvneta_port *pp)
 
 	phy_stop(ndev->phydev);
 
-	for_each_online_cpu(cpu) {
-		struct mvneta_pcpu_port *port = per_cpu_ptr(pp->ports, cpu);
+	if (!pp->neta_armada3700) {
+		for_each_online_cpu(cpu) {
+			struct mvneta_pcpu_port *port =
+				per_cpu_ptr(pp->ports, cpu);
 
-		napi_disable(&port->napi);
+			napi_disable(&port->napi);
+		}
+	} else {
+		napi_disable(&pp->napi);
 	}
 
 	netif_carrier_off(pp->dev);
@@ -3493,31 +3556,37 @@ static int mvneta_open(struct net_device *dev)
 		goto err_cleanup_rxqs;
 
 	/* Connect to port interrupt line */
-	ret = request_percpu_irq(pp->dev->irq, mvneta_isr,
-				 MVNETA_DRIVER_NAME, pp->ports);
+	if (pp->neta_armada3700)
+		ret = request_irq(pp->dev->irq, mvneta_isr, 0,
+				  dev->name, pp);
+	else
+		ret = request_percpu_irq(pp->dev->irq, mvneta_percpu_isr,
+					 dev->name, pp->ports);
 	if (ret) {
 		netdev_err(pp->dev, "cannot request irq %d\n", pp->dev->irq);
 		goto err_cleanup_txqs;
 	}
 
-	/* Enable per-CPU interrupt on all the CPU to handle our RX
-	 * queue interrupts
-	 */
-	on_each_cpu(mvneta_percpu_enable, pp, true);
+	if (!pp->neta_armada3700) {
+		/* Enable per-CPU interrupt on all the CPU to handle our RX
+		 * queue interrupts
+		 */
+		on_each_cpu(mvneta_percpu_enable, pp, true);
 
-	pp->is_stopped = false;
-	/* Register a CPU notifier to handle the case where our CPU
-	 * might be taken offline.
-	 */
-	ret = cpuhp_state_add_instance_nocalls(online_hpstate,
-					       &pp->node_online);
-	if (ret)
-		goto err_free_irq;
+		pp->is_stopped = false;
+		/* Register a CPU notifier to handle the case where our CPU
+		 * might be taken offline.
+		 */
+		ret = cpuhp_state_add_instance_nocalls(online_hpstate,
+						       &pp->node_online);
+		if (ret)
+			goto err_free_irq;
 
-	ret = cpuhp_state_add_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
-					       &pp->node_dead);
-	if (ret)
-		goto err_free_online_hp;
+		ret = cpuhp_state_add_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
+						       &pp->node_dead);
+		if (ret)
+			goto err_free_online_hp;
+	}
 
 	/* In default link is down */
 	netif_carrier_off(pp->dev);
@@ -3533,13 +3602,20 @@ static int mvneta_open(struct net_device *dev)
 	return 0;
 
 err_free_dead_hp:
-	cpuhp_state_remove_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
-					    &pp->node_dead);
+	if (!pp->neta_armada3700)
+		cpuhp_state_remove_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
+						    &pp->node_dead);
 err_free_online_hp:
-	cpuhp_state_remove_instance_nocalls(online_hpstate, &pp->node_online);
+	if (!pp->neta_armada3700)
+		cpuhp_state_remove_instance_nocalls(online_hpstate,
+						    &pp->node_online);
 err_free_irq:
-	on_each_cpu(mvneta_percpu_disable, pp, true);
-	free_percpu_irq(pp->dev->irq, pp->ports);
+	if (pp->neta_armada3700) {
+		free_irq(pp->dev->irq, pp);
+	} else {
+		on_each_cpu(mvneta_percpu_disable, pp, true);
+		free_percpu_irq(pp->dev->irq, pp->ports);
+	}
 err_cleanup_txqs:
 	mvneta_cleanup_txqs(pp);
 err_cleanup_rxqs:
@@ -3552,23 +3628,30 @@ static int mvneta_stop(struct net_device *dev)
 {
 	struct mvneta_port *pp = netdev_priv(dev);
 
-	/* Inform that we are stopping so we don't want to setup the
-	 * driver for new CPUs in the notifiers. The code of the
-	 * notifier for CPU online is protected by the same spinlock,
-	 * so when we get the lock, the notifer work is done.
-	 */
-	spin_lock(&pp->lock);
-	pp->is_stopped = true;
-	spin_unlock(&pp->lock);
+	if (!pp->neta_armada3700) {
+		/* Inform that we are stopping so we don't want to setup the
+		 * driver for new CPUs in the notifiers. The code of the
+		 * notifier for CPU online is protected by the same spinlock,
+		 * so when we get the lock, the notifer work is done.
+		 */
+		spin_lock(&pp->lock);
+		pp->is_stopped = true;
+		spin_unlock(&pp->lock);
 
-	mvneta_stop_dev(pp);
-	mvneta_mdio_remove(pp);
+		mvneta_stop_dev(pp);
+		mvneta_mdio_remove(pp);
 
 	cpuhp_state_remove_instance_nocalls(online_hpstate, &pp->node_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_NET_MVNETA_DEAD,
 					    &pp->node_dead);
-	on_each_cpu(mvneta_percpu_disable, pp, true);
-	free_percpu_irq(dev->irq, pp->ports);
+		on_each_cpu(mvneta_percpu_disable, pp, true);
+		free_percpu_irq(dev->irq, pp->ports);
+	} else {
+		mvneta_stop_dev(pp);
+		mvneta_mdio_remove(pp);
+		free_irq(dev->irq, pp);
+	}
+
 	mvneta_cleanup_rxqs(pp);
 	mvneta_cleanup_txqs(pp);
 
@@ -3847,6 +3930,11 @@ static int mvneta_ethtool_set_rxfh(struct net_device *dev, const u32 *indir,
 				   const u8 *key, const u8 hfunc)
 {
 	struct mvneta_port *pp = netdev_priv(dev);
+
+	/* Current code for Armada 3700 doesn't support RSS features yet */
+	if (pp->neta_armada3700)
+		return -EOPNOTSUPP;
+
 	/* We require at least one supported parameter to be changed
 	 * and no change in any of the unsupported parameters
 	 */
@@ -3867,6 +3955,10 @@ static int mvneta_ethtool_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
 {
 	struct mvneta_port *pp = netdev_priv(dev);
 
+	/* Current code for Armada 3700 doesn't support RSS features yet */
+	if (pp->neta_armada3700)
+		return -EOPNOTSUPP;
+
 	if (hfunc)
 		*hfunc = ETH_RSS_HASH_TOP;
 
@@ -3969,16 +4061,29 @@ static void mvneta_conf_mbus_windows(struct mvneta_port *pp,
 	win_enable = 0x3f;
 	win_protect = 0;
 
-	for (i = 0; i < dram->num_cs; i++) {
-		const struct mbus_dram_window *cs = dram->cs + i;
-		mvreg_write(pp, MVNETA_WIN_BASE(i), (cs->base & 0xffff0000) |
-			    (cs->mbus_attr << 8) | dram->mbus_dram_target_id);
+	if (dram) {
+		for (i = 0; i < dram->num_cs; i++) {
+			const struct mbus_dram_window *cs = dram->cs + i;
+
+			mvreg_write(pp, MVNETA_WIN_BASE(i),
+				    (cs->base & 0xffff0000) |
+				    (cs->mbus_attr << 8) |
+				    dram->mbus_dram_target_id);
 
-		mvreg_write(pp, MVNETA_WIN_SIZE(i),
-			    (cs->size - 1) & 0xffff0000);
+			mvreg_write(pp, MVNETA_WIN_SIZE(i),
+				    (cs->size - 1) & 0xffff0000);
 
-		win_enable &= ~(1 << i);
-		win_protect |= 3 << (2 * i);
+			win_enable &= ~(1 << i);
+			win_protect |= 3 << (2 * i);
+		}
+	} else {
+		/* For Armada3700 open default 4GB Mbus window, leaving
+		 * arbitration of target/attribute to a different layer
+		 * of configuration.
+		 */
+		mvreg_write(pp, MVNETA_WIN_SIZE(0), 0xffff0000);
+		win_enable &= ~BIT(0);
+		win_protect = 3;
 	}
 
 	mvreg_write(pp, MVNETA_BASE_ADDR_ENABLE, win_enable);
@@ -4108,6 +4213,10 @@ static int mvneta_probe(struct platform_device *pdev)
 
 	pp->indir[0] = rxq_def;
 
+	/* Get special SoC configurations */
+	if (of_device_is_compatible(dn, "marvell,armada-3700-neta"))
+		pp->neta_armada3700 = true;
+
 	pp->clk = devm_clk_get(&pdev->dev, "core");
 	if (IS_ERR(pp->clk))
 		pp->clk = devm_clk_get(&pdev->dev, NULL);
@@ -4175,7 +4284,11 @@ static int mvneta_probe(struct platform_device *pdev)
 	pp->tx_csum_limit = tx_csum_limit;
 
 	dram_target_info = mv_mbus_dram_info();
-	if (dram_target_info)
+	/* Armada3700 requires setting default configuration of Mbus
+	 * windows, however without using filled mbus_dram_target_info
+	 * structure.
+	 */
+	if (dram_target_info || pp->neta_armada3700)
 		mvneta_conf_mbus_windows(pp, dram_target_info);
 
 	pp->tx_ring_size = MVNETA_MAX_TXD;
@@ -4208,11 +4321,20 @@ static int mvneta_probe(struct platform_device *pdev)
 		goto err_netdev;
 	}
 
-	for_each_present_cpu(cpu) {
-		struct mvneta_pcpu_port *port = per_cpu_ptr(pp->ports, cpu);
+	/* Armada3700 network controller does not support per-cpu
+	 * operation, so only single NAPI should be initialized.
+	 */
+	if (pp->neta_armada3700) {
+		netif_napi_add(dev, &pp->napi, mvneta_poll, NAPI_POLL_WEIGHT);
+	} else {
+		for_each_present_cpu(cpu) {
+			struct mvneta_pcpu_port *port =
+				per_cpu_ptr(pp->ports, cpu);
 
-		netif_napi_add(dev, &port->napi, mvneta_poll, NAPI_POLL_WEIGHT);
-		port->pp = pp;
+			netif_napi_add(dev, &port->napi, mvneta_poll,
+				       NAPI_POLL_WEIGHT);
+			port->pp = pp;
+		}
 	}
 
 	dev->features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
@@ -4297,6 +4419,7 @@ static int mvneta_remove(struct platform_device *pdev)
 static const struct of_device_id mvneta_match[] = {
 	{ .compatible = "marvell,armada-370-neta" },
 	{ .compatible = "marvell,armada-xp-neta" },
+	{ .compatible = "marvell,armada-3700-neta" },
 	{ }
 };
 MODULE_DEVICE_TABLE(of, mvneta_match);
-- 
git-series 0.8.10

^ permalink raw reply related

* [PATCH net-next 0/5] Support Armada 37xx SoC (ARMv8 64-bits) in mvneta driver
From: Gregory CLEMENT @ 2016-11-25 15:30 UTC (permalink / raw)
  To: David S. Miller, linux-kernel, netdev
  Cc: Jisheng Zhang, Arnd Bergmann, Jason Cooper, Andrew Lunn,
	Sebastian Hesselbarth, Gregory CLEMENT, Thomas Petazzoni,
	linux-arm-kernel, Nadav Haklai, Marcin Wojtas, Dmitri Epshtein,
	Yelena Krivosheev

Hi,

The Armada 37xx is a new ARMv8 SoC from Marvell using same network
controller as the older Armada 370/38x/XP SoCs. This series adapts the
driver in order to be able to use it on this new SoC. The main changes
are:

- 64-bits support: the first patches allow using the driver on a 64-bit
  architecture.

- MBUS support: the mbus configuration is different on Armada 37xx
  from the older SoCs.

- per cpu interrupt: Armada 37xx do not support per cpu interrupt for
  the NETA IP, the non-per-CPU behavior was added back.

The first item is solved by patches 1 to 3.
The 2 last items are solved by patch 4.
In patch 5 the dt support is added.

Beside Armada 37xx, the series have been tested on Armada XP and
Armada 38x (with Hardware Buffer Management and with Software Buffer
Managment).

Thanks,

Gregory

Gregory CLEMENT (3):
  net: mvneta: Use cacheable memory to store the rx buffer virtual address
  net: mvneta: Only disable mvneta_bm for 64-bits
  ARM64: dts: marvell: Add network support for Armada 3700

Marcin Wojtas (2):
  net: mvneta: Convert to be 64 bits compatible
  net: mvneta: Add network support for Armada 3700 SoC

 Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt |   7 +-
 arch/arm64/boot/dts/marvell/armada-3720-db.dts                    |  23 ++++-
 arch/arm64/boot/dts/marvell/armada-37xx.dtsi                      |  23 ++++-
 drivers/net/ethernet/marvell/Kconfig                              |  10 +-
 drivers/net/ethernet/marvell/mvneta.c                             | 400 ++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------
 5 files changed, 362 insertions(+), 101 deletions(-)

base-commit: 436accebb53021ef7c63535f60bda410aa87c136
-- 
git-series 0.8.10

^ permalink raw reply

* [PATCH net-next 1/5] net: mvneta: Use cacheable memory to store the rx buffer virtual address
From: Gregory CLEMENT @ 2016-11-25 15:30 UTC (permalink / raw)
  To: David S. Miller, linux-kernel, netdev
  Cc: Jisheng Zhang, Arnd Bergmann, Jason Cooper, Andrew Lunn,
	Sebastian Hesselbarth, Gregory CLEMENT, Thomas Petazzoni,
	linux-arm-kernel, Nadav Haklai, Marcin Wojtas, Dmitri Epshtein,
	Yelena Krivosheev
In-Reply-To: <cover.2b146800967005632cd02d4da77397e6e2fdf51f.1480087510.git-series.gregory.clement@free-electrons.com>

Until now the virtual address of the received buffer were stored in the
cookie field of the rx descriptor. However, this field is 32-bits only
which prevents to use the driver on a 64-bits architecture.

With this patch the virtual address is stored in an array not shared with
the hardware (no more need to use the DMA API). Thanks to this, it is
possible to use cache contrary to the access of the rx descriptor member.

The change is done in the swbm path only because the hwbm uses the cookie
field, this also means that currently the hwbm is not usable in 64-bits.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 96 ++++++++++++++++++++++++----
 1 file changed, 84 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 87274d4ab102..b6849f88cab7 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -561,6 +561,9 @@ struct mvneta_rx_queue {
 	u32 pkts_coal;
 	u32 time_coal;
 
+	/* Virtual address of the RX buffer */
+	void  **buf_virt_addr;
+
 	/* Virtual address of the RX DMA descriptors array */
 	struct mvneta_rx_desc *descs;
 
@@ -1573,10 +1576,14 @@ static void mvneta_tx_done_pkts_coal_set(struct mvneta_port *pp,
 
 /* Handle rx descriptor fill by setting buf_cookie and buf_phys_addr */
 static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
-				u32 phys_addr, u32 cookie)
+				u32 phys_addr, void *virt_addr,
+				struct mvneta_rx_queue *rxq)
 {
-	rx_desc->buf_cookie = cookie;
+	int i;
+
 	rx_desc->buf_phys_addr = phys_addr;
+	i = rx_desc - rxq->descs;
+	rxq->buf_virt_addr[i] = virt_addr;
 }
 
 /* Decrement sent descriptors counter */
@@ -1781,7 +1788,8 @@ EXPORT_SYMBOL_GPL(mvneta_frag_free);
 
 /* Refill processing for SW buffer management */
 static int mvneta_rx_refill(struct mvneta_port *pp,
-			    struct mvneta_rx_desc *rx_desc)
+			    struct mvneta_rx_desc *rx_desc,
+			    struct mvneta_rx_queue *rxq)
 
 {
 	dma_addr_t phys_addr;
@@ -1799,7 +1807,7 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
 		return -ENOMEM;
 	}
 
-	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
+	mvneta_rx_desc_fill(rx_desc, phys_addr, data, rxq);
 	return 0;
 }
 
@@ -1861,7 +1869,12 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 
 	for (i = 0; i < rxq->size; i++) {
 		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
-		void *data = (void *)rx_desc->buf_cookie;
+		void *data;
+
+		if (!pp->bm_priv)
+			data = rxq->buf_virt_addr[i];
+		else
+			data = (void *)(uintptr_t)rx_desc->buf_cookie;
 
 		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
@@ -1894,12 +1907,13 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
 		unsigned char *data;
 		dma_addr_t phys_addr;
 		u32 rx_status, frag_size;
-		int rx_bytes, err;
+		int rx_bytes, err, index;
 
 		rx_done++;
 		rx_status = rx_desc->status;
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
-		data = (unsigned char *)rx_desc->buf_cookie;
+		index = rx_desc - rxq->descs;
+		data = (unsigned char *)rxq->buf_virt_addr[index];
 		phys_addr = rx_desc->buf_phys_addr;
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
@@ -1938,7 +1952,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
 		}
 
 		/* Refill processing */
-		err = mvneta_rx_refill(pp, rx_desc);
+		err = mvneta_rx_refill(pp, rx_desc, rxq);
 		if (err) {
 			netdev_err(dev, "Linux processing - Can't refill\n");
 			rxq->missed++;
@@ -2020,7 +2034,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_done++;
 		rx_status = rx_desc->status;
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
-		data = (unsigned char *)rx_desc->buf_cookie;
+		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
 		phys_addr = rx_desc->buf_phys_addr;
 		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
@@ -2708,6 +2722,57 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 	return rx_done;
 }
 
+/* Refill processing for HW buffer management */
+static int mvneta_rx_hwbm_refill(struct mvneta_port *pp,
+				 struct mvneta_rx_desc *rx_desc)
+
+{
+	dma_addr_t phys_addr;
+	void *data;
+
+	data = mvneta_frag_alloc(pp->frag_size);
+	if (!data)
+		return -ENOMEM;
+
+	phys_addr = dma_map_single(pp->dev->dev.parent, data,
+				   MVNETA_RX_BUF_SIZE(pp->pkt_size),
+				   DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(pp->dev->dev.parent, phys_addr))) {
+		mvneta_frag_free(pp->frag_size, data);
+		return -ENOMEM;
+	}
+
+	phys_addr += pp->rx_offset_correction;
+	rx_desc->buf_phys_addr = phys_addr;
+	rx_desc->buf_cookie = (uintptr_t)data;
+
+	return 0;
+}
+
+/* Handle rxq fill: allocates rxq skbs; called when initializing a port */
+static int mvneta_rxq_bm_fill(struct mvneta_port *pp,
+			      struct mvneta_rx_queue *rxq,
+			      int num)
+{
+	int i;
+
+	for (i = 0; i < num; i++) {
+		memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));
+		if (mvneta_rx_hwbm_refill(pp, rxq->descs + i) != 0) {
+			netdev_err(pp->dev, "%s:rxq %d, %d of %d buffs  filled\n",
+				   __func__, rxq->id, i, num);
+			break;
+		}
+	}
+
+	/* Add this number of RX descriptors as non occupied (ready to
+	 * get packets)
+	 */
+	mvneta_rxq_non_occup_desc_add(pp, rxq, i);
+
+	return i;
+}
+
 /* Handle rxq fill: allocates rxq skbs; called when initializing a port */
 static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 			   int num)
@@ -2716,7 +2781,7 @@ static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 
 	for (i = 0; i < num; i++) {
 		memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));
-		if (mvneta_rx_refill(pp, rxq->descs + i) != 0) {
+		if (mvneta_rx_refill(pp, rxq->descs + i, rxq) != 0) {
 			netdev_err(pp->dev, "%s:rxq %d, %d of %d buffs  filled\n",
 				__func__, rxq->id, i, num);
 			break;
@@ -2784,14 +2849,21 @@ static int mvneta_rxq_init(struct mvneta_port *pp,
 		mvneta_rxq_buf_size_set(pp, rxq,
 					MVNETA_RX_BUF_SIZE(pp->pkt_size));
 		mvneta_rxq_bm_disable(pp, rxq);
+
+		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
+						  rxq->size * sizeof(void *),
+						  GFP_KERNEL);
+		if (!rxq->buf_virt_addr)
+			return -ENOMEM;
+
+		mvneta_rxq_fill(pp, rxq, rxq->size);
 	} else {
 		mvneta_rxq_bm_enable(pp, rxq);
 		mvneta_rxq_long_pool_set(pp, rxq);
 		mvneta_rxq_short_pool_set(pp, rxq);
+		mvneta_rxq_bm_fill(pp, rxq, rxq->size);
 	}
 
-	mvneta_rxq_fill(pp, rxq, rxq->size);
-
 	return 0;
 }
 
-- 
git-series 0.8.10

^ permalink raw reply related

* [PATCH net-next 5/5] ARM64: dts: marvell: Add network support for Armada 3700
From: Gregory CLEMENT @ 2016-11-25 15:30 UTC (permalink / raw)
  To: David S. Miller, linux-kernel, netdev
  Cc: Jisheng Zhang, Andrew Lunn, Jason Cooper, Arnd Bergmann,
	Dmitri Epshtein, Nadav Haklai, Yelena Krivosheev, Gregory CLEMENT,
	Marcin Wojtas, Thomas Petazzoni, linux-arm-kernel,
	Sebastian Hesselbarth
In-Reply-To: <cover.2b146800967005632cd02d4da77397e6e2fdf51f.1480087510.git-series.gregory.clement@free-electrons.com>

Add neta nodes for network support both in device tree for the SoC and
the board.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 arch/arm64/boot/dts/marvell/armada-3720-db.dts | 23 +++++++++++++++++++-
 arch/arm64/boot/dts/marvell/armada-37xx.dtsi   | 23 +++++++++++++++++++-
 2 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/arch/arm64/boot/dts/marvell/armada-3720-db.dts b/arch/arm64/boot/dts/marvell/armada-3720-db.dts
index 1372e9a6aaa4..c8b82e4145de 100644
--- a/arch/arm64/boot/dts/marvell/armada-3720-db.dts
+++ b/arch/arm64/boot/dts/marvell/armada-3720-db.dts
@@ -81,3 +81,26 @@
 &pcie0 {
 	status = "okay";
 };
+
+&mdio {
+	status = "okay";
+	phy0: ethernet-phy@0 {
+		reg = <0>;
+	};
+
+	phy1: ethernet-phy@1 {
+		reg = <1>;
+	};
+};
+
+&eth0 {
+	phy-mode = "rgmii-id";
+	phy = <&phy0>;
+	status = "okay";
+};
+
+&eth1 {
+	phy-mode = "rgmii-id";
+	phy = <&phy1>;
+	status = "okay";
+};
diff --git a/arch/arm64/boot/dts/marvell/armada-37xx.dtsi b/arch/arm64/boot/dts/marvell/armada-37xx.dtsi
index e9bd58793464..3b8eb45bdc76 100644
--- a/arch/arm64/boot/dts/marvell/armada-37xx.dtsi
+++ b/arch/arm64/boot/dts/marvell/armada-37xx.dtsi
@@ -140,6 +140,29 @@
 				};
 			};
 
+			eth0: ethernet@30000 {
+				   compatible = "marvell,armada-3700-neta";
+				   reg = <0x30000 0x4000>;
+				   interrupts = <GIC_SPI 42 IRQ_TYPE_LEVEL_HIGH>;
+				   clocks = <&sb_periph_clk 8>;
+				   status = "disabled";
+			};
+
+			mdio: mdio@32004 {
+				#address-cells = <1>;
+				#size-cells = <0>;
+				compatible = "marvell,orion-mdio";
+				reg = <0x32004 0x4>;
+			};
+
+			eth1: ethernet@40000 {
+				compatible = "marvell,armada-3700-neta";
+				reg = <0x40000 0x4000>;
+				interrupts = <GIC_SPI 45 IRQ_TYPE_LEVEL_HIGH>;
+				clocks = <&sb_periph_clk 7>;
+				status = "disabled";
+			};
+
 			usb3: usb@58000 {
 				compatible = "marvell,armada3700-xhci",
 				"generic-xhci";
-- 
git-series 0.8.10

^ permalink raw reply related

* [PATCH net-next 2/5] net: mvneta: Convert to be 64 bits compatible
From: Gregory CLEMENT @ 2016-11-25 15:30 UTC (permalink / raw)
  To: David S. Miller, linux-kernel, netdev
  Cc: Jisheng Zhang, Arnd Bergmann, Jason Cooper, Andrew Lunn,
	Sebastian Hesselbarth, Gregory CLEMENT, Thomas Petazzoni,
	linux-arm-kernel, Nadav Haklai, Marcin Wojtas, Dmitri Epshtein,
	Yelena Krivosheev
In-Reply-To: <cover.2b146800967005632cd02d4da77397e6e2fdf51f.1480087510.git-series.gregory.clement@free-electrons.com>

From: Marcin Wojtas <mw@semihalf.com>

Prepare the mvneta driver in order to be usable on the 64 bits platform
such as the Armada 3700.

[gregory.clement@free-electrons.com]: this patch was extract from a larger
one to ease review and maintenance.

Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b6849f88cab7..ad3872e07a93 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -296,6 +296,12 @@
 /* descriptor aligned size */
 #define MVNETA_DESC_ALIGNED_SIZE	32
 
+/* Number of bytes to be taken into account by HW when putting incoming data
+ * to the buffers. It is needed in case NET_SKB_PAD exceeds maximum packet
+ * offset supported in MVNETA_RXQ_CONFIG_REG(q) registers.
+ */
+#define MVNETA_RX_PKT_OFFSET_CORRECTION		64
+
 #define MVNETA_RX_PKT_SIZE(mtu) \
 	ALIGN((mtu) + MVNETA_MH_SIZE + MVNETA_VLAN_TAG_LEN + \
 	      ETH_HLEN + ETH_FCS_LEN,			     \
@@ -416,6 +422,7 @@ struct mvneta_port {
 	u64 ethtool_stats[ARRAY_SIZE(mvneta_statistics)];
 
 	u32 indir[MVNETA_RSS_LU_TABLE_SIZE];
+	u16 rx_offset_correction;
 };
 
 /* The mvneta_tx_desc and mvneta_rx_desc structures describe the
@@ -1807,6 +1814,7 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
 		return -ENOMEM;
 	}
 
+	phys_addr += pp->rx_offset_correction;
 	mvneta_rx_desc_fill(rx_desc, phys_addr, data, rxq);
 	return 0;
 }
@@ -2838,7 +2846,7 @@ static int mvneta_rxq_init(struct mvneta_port *pp,
 	mvreg_write(pp, MVNETA_RXQ_SIZE_REG(rxq->id), rxq->size);
 
 	/* Set Offset */
-	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD);
+	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD - pp->rx_offset_correction);
 
 	/* Set coalescing pkts and time */
 	mvneta_rx_pkts_coal_set(pp, rxq, rxq->pkts_coal);
@@ -4091,6 +4099,13 @@ static int mvneta_probe(struct platform_device *pdev)
 
 	pp->rxq_def = rxq_def;
 
+	/* Set RX packet offset correction for platforms, whose
+	 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
+	 * platforms and 0B for 32-bit ones.
+	 */
+	pp->rx_offset_correction =
+		max(0, NET_SKB_PAD - MVNETA_RX_PKT_OFFSET_CORRECTION);
+
 	pp->indir[0] = rxq_def;
 
 	pp->clk = devm_clk_get(&pdev->dev, "core");
-- 
git-series 0.8.10

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox