Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] Fixed to checkpatch.pl errors to vlan_dev.c
From: Ozgur Karatas @ 2016-12-05 15:34 UTC (permalink / raw)
  To: kaber, David Miller; +Cc: netdev, linux-kernel

Hello all,

I will solve a checkpatch errors.

Signed-off-by: Ozgur Karatas <ozgur@member.fsf.org>

---
 net/8021q/vlan_dev.c                               |   2 +-
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index fbfacd5..2edb495 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -738,7 +738,7 @@ static int vlan_dev_netpoll_setup(struct net_device *dev, struct netpoll_info *n

 static void vlan_dev_netpoll_cleanup(struct net_device *dev)
 {
-       struct vlan_dev_priv *vlan= vlan_dev_priv(dev);
+       struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
        struct netpoll *netpoll = vlan->netpoll;

        if (!netpoll)

^ permalink raw reply related

* Re: BROKEN Re: [PATCH] netlink: 2-clause nla_ok()
From: Johannes Berg @ 2016-12-05 15:09 UTC (permalink / raw)
  To: Alexey Dobriyan, David Miller; +Cc: netdev
In-Reply-To: <CACVxJT9Qc=vNexx4ooQVFo=SrGs4-oHeZrnL18vbiS+o0F_N1A@mail.gmail.com>

> Maybe someone could vouch that other checks prevent
> this kind of situation from happening but not me.

No, now that you spell it out (and I see the patch) - this is
absolutely needed because nla_for_each_attr() [1] can be called on
arbitrary data coming from userspace in a message, e.g. by way
of nla_for_each_nested(). Even if it's not malformed, nla_ok() is the
only abort condition for that loop, so it would read at least one
nla_len after the real buffer without that condition.

johannes

[1] which seems to be the only significant user thereof

^ permalink raw reply

* Re: [net PATCH 2/2] ipv4: Drop suffix update from resize code
From: Robert Shearman @ 2016-12-05 15:05 UTC (permalink / raw)
  To: Alexander Duyck, netdev; +Cc: davem
In-Reply-To: <20161201122757.15499.51778.stgit@ahduyck-blue-test.jf.intel.com>

On 01/12/16 12:27, Alexander Duyck wrote:
> It has been reported that update_suffix can be expensive when it is called
> on a large node in which most of the suffix lengths are the same.  The time
> required to add 200K entries had increased from around 3 seconds to almost
> 49 seconds.
>
> In order to address this we need to move the code for updating the suffix
> out of resize and instead just have it handled in the cases where we are
> pushing a node that increases the suffix length, or will decrease the
> suffix length.
>
> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
> Reported-by: Robert Shearman <rshearma@brocade.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real	0m4.026s
user	0m0.156s
sys	0m2.500s
$ time sudo ip route flush via 110.110.110.2

real	0m5.423s
user	0m0.064s
sys	0m1.096s

This reduces the time taken to both add and delete the 200k routes back 
down to levels comparable before commit 5405afd1a306. The changes look 
good to me too. Nicely done Alex!

Reviewed-by: Robert Shearman <rshearma@brocade.com>
Tested-by: Robert Shearman <rshearma@brocade.com>

Thanks,
Rob

^ permalink raw reply

* Re: [net PATCH 1/2] ipv4: Drop leaf from suffix pull/push functions
From: Robert Shearman @ 2016-12-05 15:05 UTC (permalink / raw)
  To: Alexander Duyck, netdev; +Cc: davem
In-Reply-To: <20161201122752.15499.8184.stgit@ahduyck-blue-test.jf.intel.com>

On 01/12/16 12:27, Alexander Duyck wrote:
> It wasn't necessary to pass a leaf in when doing the suffix updates so just
> drop it.  Instead just pass the suffix and work with that.
>
> Since we dropped the leaf there is no need to include that in the name so
> the names are updated to node_push_suffix and node_pull_suffix.
>
> Finally I noticed that the logic for pulling the suffix length back
> actually had some issues.  Specifically it would stop prematurely if there
> was a longer suffix, but it was not as long as the original suffix.  I
> updated the code to address that in node_pull_suffix.
>
> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
> Suggested-by: Robert Shearman <rshearma@brocade.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

This addresses an issue in the current code whereby when a fib alias is 
removed from a leaf when there are other aliases remaining then the 
suffix length may not be updated in the conditions described above. This 
is because fib_remove_alias doesn't call trie_rebalance (or resize) in 
this case, as there's no need since no nodes have been added/removed.

This can be reproduced with this structure of the fib trie and by adding 
and then deleting 10.37.96.0/19:

         +-- 10.37.0.0/16 4 1 8 suffix/19
            |-- 10.37.0.0
               /24 universe UNICAST
            |-- 10.37.32.0
               /24 universe UNICAST
            |-- 10.37.64.0
               /20 universe UNICAST
            |-- 10.37.95.0
               /24 universe UNICAST
            +-- 10.37.96.0/20 2 1 2 suffix/21
               +-- 10.37.96.0/22 2 1 2 suffix/24
                  +-- 10.37.96.0/24 2 1 2 suffix/24
                     |-- 10.37.96.0
                        /32 link BROADCAST
                        /24 link UNICAST
                     +-- 10.37.96.192/26 2 0 2 suffix/28
                        |-- 10.37.96.204
                           /32 host LOCAL
                        |-- 10.37.96.255
                           /32 link BROADCAST
                  |-- 10.37.98.0
                     /25 universe UNICAST
               |-- 10.37.104.0
                  /21 universe UNICAST
            +-- 10.37.112.0/23 2 0 2 suffix/24
               |-- 10.37.112.0
                  /24 universe UNICAST
               |-- 10.37.113.0
                  /24 universe UNICAST
            |-- 10.37.223.0
               /24 universe UNICAST
            |-- 10.37.224.0
               /24 universe UNICAST

I've verified that with the fix included here that the suffix length on 
the 10.37.0.0/16 node now gets updated correctly to be /20.

Reviewed-by: Robert Shearman <rshearma@brocade.com>
Tested-by: Robert Shearman <rshearma@brocade.com>

Thanks,
Rob

^ permalink raw reply

* BROKEN Re: [PATCH] netlink: 2-clause nla_ok()
From: Alexey Dobriyan @ 2016-12-05 14:56 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David, please do

    git revert 4f7df337fe79bba1e4c2d525525d63b5ba186bbd

I'm an idiot.

All rationale in the commit would be correct if reading "nla_len"
didn't require memory access. But it does.

    return rem >= (int)sizeof(*nla) &&
               nla->nla_len >= sizeof(*nla) &&
               nla->nla_len <= remaining;

Those logical ands ensure that memory access is not done
if "rem" is small enough to even fetch ->nla_len.

Maybe someone could vouch that other checks prevent
this kind of situation from happening but not me.
How very embarrassing.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

^ permalink raw reply

* Re: [PATCH 2/6] net: ethernet: ti: cpts: add support for ext rftclk selection
From: Rob Herring @ 2016-12-05 14:51 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: David S. Miller, netdev, Mugunthan V N, Richard Cochran,
	Sekhar Nori, linux-kernel, linux-omap, devicetree,
	Murali Karicheri, Wingman Kwok
In-Reply-To: <20161128230428.6872-3-grygorii.strashko@ti.com>

On Mon, Nov 28, 2016 at 05:04:24PM -0600, Grygorii Strashko wrote:
> Some CPTS instances, which can be found on KeyStone 2 1/10G Ethernet
> Switch Subsystems, can control an external multiplexer that selects
> one of up to 32 clocks for time sync reference (RFTCLK). This feature
> can be configured through CPTS_RFTCLK_SEL register (offset: x08).
> 
> Hence, introduce optional DT cpts_rftclk_sel poperty wich, if present,
> will specify CPTS reference clock. The cpts_rftclk_sel should be
> omitted in DT if HW doesn't support this feature. The external fixed
> rate clocks can be defined in board files as "fixed-clock".
> 
> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
> ---
>  Documentation/devicetree/bindings/net/keystone-netcp.txt |  2 ++

Please group binding changes into a single patch.

>  drivers/net/ethernet/ti/cpts.c                           | 12 ++++++++++++
>  drivers/net/ethernet/ti/cpts.h                           |  8 +++++++-
>  3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/keystone-netcp.txt b/Documentation/devicetree/bindings/net/keystone-netcp.txt
> index c37b54e..ec4a241 100644
> --- a/Documentation/devicetree/bindings/net/keystone-netcp.txt
> +++ b/Documentation/devicetree/bindings/net/keystone-netcp.txt
> @@ -114,6 +114,8 @@ Optional properties:
>  				driver to them if needed.
>  
>   Properties related to cpts configurations.
> +	- cpts-rftclk-sel: selects one of up to 32 clocks for time sync
> +		reference.  Default = 0.

Vendor prefix.

>  	- cpts_clock_mult/cpts_clock_shift:
>  		used for converting time counter cycles to ns as in
>  

^ permalink raw reply

* Re: [PATCH 1/6] net: ethernet: ti: netcp: add support of cpts
From: Rob Herring @ 2016-12-05 14:49 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: David S. Miller, netdev, Mugunthan V N, Richard Cochran,
	Sekhar Nori, linux-kernel, linux-omap, devicetree,
	Murali Karicheri, Wingman Kwok
In-Reply-To: <20161128230428.6872-2-grygorii.strashko@ti.com>

On Mon, Nov 28, 2016 at 05:04:23PM -0600, Grygorii Strashko wrote:
> From: WingMan Kwok <w-kwok2@ti.com>
> 
> This patch adds support of the cpts device found in the
> gbe and 10gbe ethernet switches on the keystone 2 SoCs
> (66AK2E/L/Hx, 66AK2Gx).
> 
> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
> ---
>  .../devicetree/bindings/net/keystone-netcp.txt     |   9 +
>  drivers/net/ethernet/ti/Kconfig                    |   7 +-
>  drivers/net/ethernet/ti/netcp.h                    |   2 +-
>  drivers/net/ethernet/ti/netcp_core.c               |  18 +-
>  drivers/net/ethernet/ti/netcp_ethss.c              | 437 ++++++++++++++++++++-
>  5 files changed, 459 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/keystone-netcp.txt b/Documentation/devicetree/bindings/net/keystone-netcp.txt
> index 04ba1dc..c37b54e 100644
> --- a/Documentation/devicetree/bindings/net/keystone-netcp.txt
> +++ b/Documentation/devicetree/bindings/net/keystone-netcp.txt
> @@ -113,6 +113,15 @@ Optional properties:
>  				will only initialize these ports and attach PHY
>  				driver to them if needed.
>  
> + Properties related to cpts configurations.
> +	- cpts_clock_mult/cpts_clock_shift:

Needs vendor prefix. Don't use '_'.

> +		used for converting time counter cycles to ns as in
> +
> +		ns = (cycles * clock_mult) >> _shift
> +
> +		Defaults: clock_mult, clock_shift = calculated from
> +		CPTS refclk

What does this mean?

> +
>  NetCP interface properties: Interface specification for NetCP sub-modules.
>  Required properties:
>  - rx-channel:	the navigator packet dma channel name for rx.

^ permalink raw reply

* Re: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat
From: Alexey Dobriyan @ 2016-12-05 14:48 UTC (permalink / raw)
  To: David Laight
  Cc: davem@davemloft.net, netdev@vger.kernel.org, xemul@openvz.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0234CCB@AcuExch.aculab.com>

On Mon, Dec 5, 2016 at 3:49 PM, David Laight <David.Laight@aculab.com> wrote:
> From: Alexey Dobriyan
>> Sent: 02 December 2016 01:22
>> net_generic() function is both a) inline and b) used ~600 times.
>>
>> It has the following code inside
>>
>>               ...
>>       ptr = ng->ptr[id - 1];
>>               ...
>>
>> "id" is never compile time constant so compiler is forced to subtract 1.
>> And those decrements or LEA [r32 - 1] instructions add up.
>>
>> We also start id'ing from 1 to catch bugs where pernet sybsystem id
>> is not initialized and 0. This is quite pointless idea (nothing will
>> work or immediate interference with first registered subsystem) in
>> general but it hints what needs to be done for code size reduction.
>>
>> Namely, overlaying allocation of pointer array and fixed part of
>> structure in the beginning and using usual base-0 addressing.
>>
>> Ids are just cookies, their exact values do not matter, so lets start
>> with 3 on x86_64.
> ...
>>  struct net_generic {
>> -     struct {
>> -             unsigned int len;
>> -             struct rcu_head rcu;
>> -     } s;
>> -
>> -     void *ptr[0];
>> +     union {
>> +             struct {
>> +                     unsigned int len;
>> +                     struct rcu_head rcu;
>> +             } s;
>> +
>> +             void *ptr[0];
>> +     };
>>  };
>
> That union is an accident waiting to happen.

I kind of disagree. Module authors should not be given matches,
but it is hard to screw up if net_generic() is all you're given.

> What might work is to offset the Ids by
> (offsetof(struct net_generic, ptr)/sizeof (void *)) instead of by 1.
> The subtract from the offset will then counter the structure offset
> - which is what you are trying to achieve.

If you suggest this layout

struct net_generic {
        struct {
        } s;
        void *ptr[0];
};

then is it not optimal because offset of "ptr" needs to be somewhere in code
either in some LEA or imm8 of the final MOV which is 1 byte more bloaty.

Here is test program

struct ng1 {
        union {
                struct {
                        unsigned int len;
                } s;
                void *ptr[0];
        };
};
struct ng2 {
        struct {
                unsigned int len;
        } s;
        void *ptr[0];
};
struct net {
        int x;
        struct ng1 *gen1;
        struct ng2 *gen2;
};
void *ng1(const struct net *net, unsigned int id)
{
        return net->gen1->ptr[id];
}
void *ng2(const struct net *net, unsigned int id)
{
        return net->gen2->ptr[id];
}


0000000000000000 <ng1>:
   0:   48 8b 47 08             mov    rax,QWORD PTR [rdi+0x8]
   4:   89 f6                   mov    esi,esi
   6:   48 8b 04 f0             mov    rax,QWORD PTR [rax+rsi*8]
   a:   c3                      ret


0000000000000010 <ng2>:
  10:   48 8b 47 10             mov    rax,QWORD PTR [rdi+0x10]
  14:   89 f6                   mov    esi,esi
  16:   48 8b 44 f0 [[[08]]]          mov    rax,QWORD PTR [rax+rsi*8+0x8]
  1b:   c3                      ret

I can send more type checking if you can tolerate "_" identifier

    struct {
        unsigned int _;
    } pernet_ops_id_t;

^ permalink raw reply

* Re: [RFC] udp: some improvements on RX path.
From: Eric Dumazet @ 2016-12-05 14:28 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev
In-Reply-To: <1480944138.4694.37.camel@redhat.com>

On Mon, 2016-12-05 at 14:22 +0100, Paolo Abeni wrote:
> Hi Eric,
> 
> On Sun, 2016-12-04 at 18:43 -0800, Eric Dumazet wrote:
> > We currently access 3 cache lines from an skb in receive queue while
> > holding receive queue lock :
> > 
> > First cache line (contains ->next / prev pointers )
> > 2nd cache line (skb->peeked)
> > 3rd cache line (skb->truesize)
> > 
> > I believe we could get rid of skb->peeked completely.
> > 
> > I will cook a patch, but basically the idea is that the last owner of a
> > skb (right before skb->users becomes 0) can have the 'ownership' and
> > thus increase stats.
> 
> Agreed.
> 
> > The 3rd cache line miss is easily avoided by the following patch.
> 
> I run some performance tests on top of your patch "net: reorganize
> struct sock for better data locality", and I see an additional ~7%
> improvement on top of that, in the udp flood scenario. 
> 
> In my tests, the topmost perf offenders for the u/s process are now:
> 
>    9.98%  udp_sink  [kernel.kallsyms]  [k] udp_rmem_release
>    8.76%  udp_sink  [kernel.kallsyms]  [k] inet_recvmsg
>    6.71%  udp_sink  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    5.40%  udp_sink  [kernel.kallsyms]  [k] __skb_try_recv_datagram
>    5.19%  udp_sink  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
> 
> udp_rmem_release() spends most of its time doing:
> 
>     atomic_sub(size, &sk->sk_rmem_alloc);
> 
> I see a cacheline miss while accessing sk_rmem_alloc; most probably due
> to sk_rmem_alloc being updated by the writer outside the rx queue lock.
> 
> Moving such update inside the lock show remove this cache miss but will
> increase the pressure on the rx lock. What do you think ?

I want to accumulate the sk_rmem_alloc  deficit in another variable in
the same cache line than the secondary list, updated from process
context at udp recvmsg() time.

And only transfer the accumulated deficit in one go when the second
queue is emptied, or if the accumulated deficite is > (rcvbuf/2)

If done right, we should interfere between the softirq flooder(s) and
the process thread only once per batch.

> 
> inet_recvmsg() is there because with "net: reorganize struct sock for
> better data locality" we get a cache miss while accessing skc_rxhash in
> sock_rps_record_flow(); touching sk_drops is dirtying that cacheline -
> sorry for not noticing this before. Do you have CONFIG_RPS disabled ?
> 
> > But I also want to work on the idea I gave few days back, having a
> > separate queue and use splice to transfer the 'softirq queue' into
> > a calm queue in a different cache line.
> > 
> > I expect a 50 % performance increase under load, maybe 1.5 Mpps.
> 
> It should work nicely under contention, but won't that increase the
> overhead for the uncontended/single flow scenario ? the user space
> reader needs to acquire 2 lock when splicing the 'softirq queue'. On my
> system ksoftirqd and the u/s process work at similar speeds, so splicing
> will happen quite often. 

Well, the splice would happen only if you have more than one message in
the softirq queue. So no real overhead for uncontended flow scenario.


This reminds me of the busylock I added in __dev_xmit_skb(), which
basically is acquired only when we detect a possible contention on qdisc
lock.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v4 2/4] dt-bindings: net: add EEE capability constants
From: Rob Herring @ 2016-12-05 14:39 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: netdev, devicetree, Florian Fainelli, Carlo Caione, Kevin Hilman,
	Giuseppe Cavallaro, Alexandre TORGUE, Martin Blumenstingl,
	Andre Roth, Andrew Lunn, Neil Armstrong, linux-amlogic,
	linux-arm-kernel, linux-kernel, Julia Lawall, Yegor Yefremov,
	Andreas Färber
In-Reply-To: <1480348229-25672-3-git-send-email-jbrunet@baylibre.com>

On Mon, Nov 28, 2016 at 04:50:26PM +0100, Jerome Brunet wrote:
> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
> Tested-by: Andreas Färber <afaerber@suse.de>
> Tested-by: Neil Armstrong <narmstrong@baylibre.com>
> ---
>  include/dt-bindings/net/mdio.h | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>  create mode 100644 include/dt-bindings/net/mdio.h

Seems changes are wanted on this, but patches 2 and 3 should be 
combined. The header is part of the binding doc.

Rob

^ permalink raw reply

* Re: [PATCH net-next v6 6/6] samples/bpf: add userspace example for prohibiting sockets
From: David Ahern @ 2016-12-05 14:31 UTC (permalink / raw)
  To: David Laight, Alexei Starovoitov
  Cc: netdev@vger.kernel.org, daniel@zonque.org, ast@fb.com,
	daniel@iogearbox.net, maheshb@google.com, tgraf@suug.ch
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0234BBD@AcuExch.aculab.com>

On 12/5/16 3:51 AM, David Laight wrote:
> You can at least improve the comment.

I fixed the header files for v7.

^ permalink raw reply

* Designing a safe RX-zero-copy Memory Model for Networking
From: Jesper Dangaard Brouer @ 2016-12-05 14:31 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: brouer, linux-mm, John Fastabend, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg

Hi all,

This is my design for how to safely handle RX zero-copy in the network
stack, by using page_pool[1] and modifying NIC drivers.  Safely means
not leaking kernel info in pages mapped to userspace and resilience
so a malicious userspace app cannot crash the kernel.

It is only a design, and thus the purpose is for you to find any holes
in this design ;-)  Below text is also available as html see[2].

[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/design.html
[2] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html

===========================
Memory Model for Networking
===========================

This design describes how the page_pool change the memory model for
networking in the NIC (Network Interface Card) drivers.

.. Note:: The catch for driver developers is that, once an application
          request zero-copy RX, then the driver must use a specific
          SKB allocation mode and might have to reconfigure the
          RX-ring.

Design target
=============

Allow the NIC to function as a normal Linux NIC and be shared in a
safe manor, between the kernel network stack and an accelerated
userspace application using RX zero-copy delivery.

Target is to provide the basis for building RX zero-copy solutions in
a memory safe manor.  An efficient communication channel for userspace
delivery is out of scope for this document, but OOM considerations are
discussed below (`Userspace delivery and OOM`_).

Background
==========

The SKB or ``struct sk_buff`` is the fundamental meta-data structure
for network packets in the Linux Kernel network stack.  It is a fairly
complex object and can be constructed in several ways.

From a memory perspective there are two ways depending on
RX-buffer/page state:

1) Writable packet page
2) Read-only packet page

To take full potential of the page_pool, the drivers must actually
support handling both options depending on the configuration state of
the page_pool.

Writable packet page
--------------------

When the RX packet page is writable, the SKB setup is fairly straight
forward.  The SKB->data (and skb->head) can point directly to the page
data, adjusting the offset according to drivers headroom (for adding
headers) and setting the length according to the DMA descriptor info.

The page/data need to be writable, because the network stack need to
adjust headers (like TimeToLive and checksum) or even add or remove
headers for encapsulation purposes.

A subtle catch, which also requires a writable page, is that the SKB
also have an accompanying "shared info" data-structure ``struct
skb_shared_info``.  This "skb_shared_info" is written into the
skb->data memory area at the end (skb->end) of the (header) data.  The
skb_shared_info contains semi-sensitive information, like kernel
memory pointers to other pages (which might be pointers to more packet
data).  This would be bad from a zero-copy point of view to leak this
kind of information.

Read-only packet page
---------------------

When the RX packet page is read-only, the construction of the SKB is
significantly more complicated and even involves one more memory
allocation.

1) Allocate a new separate writable memory area, and point skb->data
   here.  This is needed due to (above described) skb_shared_info.

2) Memcpy packet headers into this (skb->data) area.

3) Clear part of skb_shared_info struct in writable-area.

4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
   and adjust the page_offset to be past the headers just copied.

It is useful (later) that the network stack have this notion that part
of the packet and a page can be read-only.  This implies that the
kernel will not "pollute" this memory with any sensitive information.
This is good from a zero-copy point of view, but bad from a
performance perspective.

NIC RX Zero-Copy
================

Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
involves costly mapping and unmapping operations in the address space
of the userspace process.  Plus for doing this safely, the page memory
need to be cleared before using it, to avoid leaking kernel
information to userspace, also a costly operation.  The page_pool base
"class" of optimization is moving these kind of operations out of the
fastpath, by recycling and lifetime control.

Once a NIC RX-queue's page_pool have been configured for zero-copy
into userspace, then can packets still be allowed to travel the normal
stack?

Yes, this should be possible, because the driver can use the
SKB-read-only mode, which avoids polluting the page data with
kernel-side sensitive data.  This implies, when a driver RX-queue
switch page_pool to RX-zero-copy mode it MUST also switch to
SKB-read-only mode (for normal stack delivery for this RXq).

XDP can be used for controlling which pages that gets RX zero-copied
to userspace.  The page is still writable for the XDP program, but
read-only for normal stack delivery.

Kernel safety
-------------

For the paranoid, how do we protect the kernel from a malicious
userspace program.  Sure there will be a communication interface
between kernel and userspace, that synchronize ownership of pages.
But a userspace program can violate this interface, given pages are
kept VMA mapped, the program can in principle access all the memory
pages in the given page_pool.  This opens up for a malicious (or
defect) program modifying memory pages concurrently with the kernel
and DMA engine using them.

An easy way to get around userspace modifying page data contents is
simply to map pages read-only into userspace.

.. Note:: The first implementation target is read-only zero-copy RX
          page to userspace and require driver to use SKB-read-only
          mode.

Advanced: Allowing userspace write access?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What if userspace need write access? Flipping the page permissions per
transfer will likely kill performance (as this likely affects the
TLB-cache).

I will argue that giving userspace write access is still possible,
without risking a kernel crash.  This is related to the SKB-read-only
mode that copies the packet headers (in to another memory area,
inaccessible to userspace).  The attack angle is to modify packet
headers after they passed some kernel network stack validation step
(as once headers are copied they are out of "reach").

Situation classes where memory page can be modified concurrently:

1) When DMA engine owns the page.  Not a problem, as DMA engine will
   simply overwrite data.

2) Just after DMA engine finish writing.  Not a problem, the packet
   will go through netstack validation and be rejected.

3) While XDP reads data. This can lead to XDP/eBPF program goes into a
   wrong code branch, but the eBPF virtual machine should not be able
   to crash the kernel. The worst outcome is a wrong or invalid XDP
   return code.

4) Before SKB with read-only page is constructed. Not a problem, the
   packet will go through netstack validation and be rejected.

5) After SKB with read-only page has been constructed.  Remember the
   packet headers were copied into a separate memory area, and the
   page data is pointed to with an offset passed the copied headers.
   Thus, userspace cannot modify the headers used for netstack
   validation.  It can only modify packet data contents, which is less
   critical as it cannot crash the kernel, and eventually this will be
   caught by packet checksum validation.

6) After netstack delivered packet to another userspace process. Not a
   problem, as it cannot crash the kernel.  It might corrupt
   packet-data being read by another userspace process, which one
   argument for requiring elevated privileges to get write access
   (like NET_CAP_ADMIN).

Userspace delivery and OOM
--------------------------

These RX pages are likely mapped to userspace via mmap(), so-far so
good.  It is key to performance to get an efficient way of signaling
between kernel and userspace, e.g what page are ready for consumption,
and when userspace are done with the page.

It is outside the scope of page_pool to provide such a queuing
structure, but the page_pool can offer some means of protecting the
system resource usage.  It is a classical problem that resources
(e.g. the page) must be returned in a timely manor, else the system,
in this case, will run out of memory.  Any system/design with
unbounded memory allocation can lead to Out-Of-Memory (OOM)
situations.

Communication between kernel and userspace is likely going to be some
kind of queue.  Given transferring packets individually will have too
much scheduling overhead.  A queue can implicitly function as a
bulking interface, and offers a natural way to split the workload
across CPU cores.

This essentially boils down-to a two queue system, with the RX-ring
queue and the userspace delivery queue.

Two bad situations exists for the userspace queue:

1) Userspace is not consuming objects fast-enough. This should simply
   result in packets getting dropped when enqueueing to a full
   userspace queue (as queue *must* implement some limit). Open
   question is; should this be reported or communicated to userspace.

2) Userspace is consuming objects fast, but not returning them in a
   timely manor.  This is a bad situation, because it threatens the
   system stability as it can lead to OOM.

The page_pool should somehow protect the system in case 2.  The
page_pool can detect the situation as it is able to track the number
of outstanding pages, due to the recycle feedback loop.  Thus, the
page_pool can have some configurable limit of allowed outstanding
pages, which can protect the system against OOM.

Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
pages to be "pageable", i.e. swap-able, but that is not an option for
the page_pool as these pages are DMA mapped.

.. _`Fbufs paper`:
   http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688

Effect of blocking allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The effect of page_pool, in case 2, that denies more allocations
essentially result-in the RX-ring queue cannot be refilled and HW
starts dropping packets due to "out-of-buffers".  For NICs with
several HW RX-queues, this can be limited to a subset of queues (and
admin can control which RX queue with HW filters).

The question is if the page_pool can do something smarter in this
case, to signal the consumers of these pages, before the maximum limit
is hit (of allowed outstanding packets).  The MM-subsystem already
have a concept of emergency PFMEMALLOC reserves and associate
page-flags (e.g. page_is_pfmemalloc).  And the network stack already
handle and react to this.  Could the same PFMEMALLOC system be used
for marking pages when limit is close?

This requires further analysis. One can imagine; this could be used at
RX by XDP to mitigate the situation by dropping less-important frames.
Given XDP choose which pages are being send to userspace it might have
appropriate knowledge of what it relevant to drop(?).

.. Note:: An alternative idea is using a data-structure that blocks
          userspace from getting new pages before returning some.
          (out of scope for the page_pool)

--
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Above document is taken at GitHub commit 47fa7c844f48fab8b
 https://github.com/netoptimizer/prototype-kernel/commit/47fa7c844f48fab8b

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v2 net-next 7/8] net: reorganize struct sock for better data locality
From: Eric Dumazet @ 2016-12-05 14:30 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Eric Dumazet, David S . Miller, netdev, Yuchung Cheng
In-Reply-To: <1480941384.4694.13.camel@redhat.com>

On Mon, 2016-12-05 at 13:36 +0100, Paolo Abeni wrote:
> Hi Eric,

> I cherry-picked this patch only for some UDP benchmark. Under flood with
> many concurrent flows, I see this 20% improvement and a relevant
> decrease in system load.
> 
> Nice work, thanks Eric!
> 
> Tested-by: Paolo Abeni <pabeni@redhat.com>

Excellent, thanks a lot for testing !

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] flow dissector: ICMP support
From: Jiri Pirko @ 2016-12-05 14:27 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tom Herbert, David Miller, Linux Kernel Network Developers,
	Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, Jamal Hadi Salim,
	Jiri Pirko
In-Reply-To: <20161205142208.GA16425@penelope.horms.nl>

Mon, Dec 05, 2016 at 03:23:16PM CET, simon.horman@netronome.com wrote:
>On Sat, Dec 03, 2016 at 10:52:32AM -0800, Tom Herbert wrote:
>> On Sat, Dec 3, 2016 at 2:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> > Fri, Dec 02, 2016 at 09:31:41PM CET, simon.horman@netronome.com wrote:
>> >>Allow dissection of ICMP(V6) type and code. This re-uses transport layer
>> >>port dissection code as although ICMP is not a transport protocol and their
>> >>type and code are not ports this allows sharing of both code and storage.
>> >>
>> >>Signed-off-by: Simon Horman <simon.horman@netronome.com>
>
>...
>
>> > Digging into this a bit more. I think it would be much nice not to mix
>> > up l4 ports and icmp stuff.
>> >
>> > How about to have FLOW_DISSECTOR_KEY_ICMP
>> > and
>> > struct flow_dissector_key_icmp {
>> >         u8 type;
>> >         u8 code;
>> > };
>> >
>> > The you can make this structure and struct flow_dissector_key_ports into
>> > an union in struct flow_keys.
>> >
>> > Looks much cleaner to me.
>
>Hi Jiri,
>
>I wondered about that too. The main reason that I didn't do this the first
>time around is that I wasn't sure what to do to differentiate between ports
>and icmp in fl_init_dissector() given that using a union would could to
>mask bits being set in both if they are set in either.
>
>I've taken a stab at addressing that below along with implementing your
>suggestion. If the treatment in fl_init_dissector() is acceptable
>perhaps something similar should be done for FLOW_DISSECTOR_KEY_IPV[46]_ADDRS

Looks fine.

>too?
>
>> I agree, this patch adds to many conditionals into the fast path for
>> ICMP handling. Neither is there much point in using type and code as
>> input to the packet hash.
>
>Hi Tom,
>
>I'm not sure that I follow this. A packet may be ICMP or not regardless of
>if FLOW_DISSECTOR_KEY_PORT (and now FLOW_DISSECTOR_KEY_ICMP) port is set or
>not. I'd appreciate some guidance here.
>
>
>First-pass at implementing Jiri's suggestion follows:
>
>diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
>index 5540dfa18872..475cd121496f 100644
>--- a/include/net/flow_dissector.h
>+++ b/include/net/flow_dissector.h
>@@ -91,14 +91,10 @@ struct flow_dissector_key_addrs {
> 
> /**
>  * flow_dissector_key_ports:
>- *	@ports: port numbers of Transport header or
>- *		type and code of ICMP header
>+ *	@ports: port numbers of Transport header
>  *		ports: source (high) and destination (low) port numbers
>  *		src: source port number
>  *		dst: destination port number
>- *		icmp: ICMP type (high) and code (low)
>- *		type: ICMP type
>- *		type: ICMP code
>  */
> struct flow_dissector_key_ports {
> 	union {
>@@ -107,6 +103,18 @@ struct flow_dissector_key_ports {
> 			__be16 src;
> 			__be16 dst;
> 		};
>+	};
>+};
>+
>+/**
>+ * flow_dissector_key_icmp:
>+ *	@ports: type and code of ICMP header
>+ *		icmp: ICMP type (high) and code (low)
>+ *		type: ICMP type
>+ *		code: ICMP code
>+ */
>+struct flow_dissector_key_icmp {
>+	union {
> 		__be16 icmp;
> 		struct {
> 			u8 type;
>@@ -133,6 +141,7 @@ enum flow_dissector_key_id {
> 	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
> 	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
> 	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
>+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
> 	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
> 	FLOW_DISSECTOR_KEY_TIPC_ADDRS, /* struct flow_dissector_key_tipc_addrs */
> 	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
>@@ -171,7 +180,10 @@ struct flow_keys {
> 	struct flow_dissector_key_tags tags;
> 	struct flow_dissector_key_vlan vlan;
> 	struct flow_dissector_key_keyid keyid;
>-	struct flow_dissector_key_ports ports;
>+	union {
>+		struct flow_dissector_key_ports ports;
>+		struct flow_dissector_key_icmp icmp;
>+	};
> 	struct flow_dissector_key_addrs addrs;
> };
> 
>diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>index 0584b4bb4390..33e5fa2b3e87 100644
>--- a/net/core/flow_dissector.c
>+++ b/net/core/flow_dissector.c
>@@ -139,6 +139,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> 	struct flow_dissector_key_basic *key_basic;
> 	struct flow_dissector_key_addrs *key_addrs;
> 	struct flow_dissector_key_ports *key_ports;
>+	struct flow_dissector_key_icmp *key_icmp;
> 	struct flow_dissector_key_tags *key_tags;
> 	struct flow_dissector_key_vlan *key_vlan;
> 	struct flow_dissector_key_keyid *key_keyid;
>@@ -559,18 +560,25 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> 		break;
> 	}
> 
>-	if (dissector_uses_key(flow_dissector,
>-			       FLOW_DISSECTOR_KEY_PORTS)) {
>-		key_ports = skb_flow_dissector_target(flow_dissector,
>-						      FLOW_DISSECTOR_KEY_PORTS,
>-						      target_container);
>-		if (flow_protos_are_icmp_any(proto, ip_proto))
>-			key_ports->icmp = skb_flow_get_be16(skb, nhoff, data,
>-							    hlen);
>-		else
>+	if (flow_protos_are_icmp_any(proto, ip_proto)) {
>+		if (dissector_uses_key(flow_dissector,
>+				       FLOW_DISSECTOR_KEY_ICMP)) {
>+			key_icmp = skb_flow_dissector_target(flow_dissector,
>+							     FLOW_DISSECTOR_KEY_ICMP,
>+							     target_container);
>+			key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data,
>+							   hlen);
>+		}
>+	} else {
>+		if (dissector_uses_key(flow_dissector,
>+				       FLOW_DISSECTOR_KEY_PORTS)) {
>+			key_ports = skb_flow_dissector_target(flow_dissector,
>+							      FLOW_DISSECTOR_KEY_PORTS,
>+							      target_container);
> 			key_ports->ports = __skb_flow_get_ports(skb, nhoff,
> 								ip_proto, data,
> 								hlen);
>+		}
> 	}
> 
> out_good:
>@@ -975,6 +983,10 @@ static const struct flow_dissector_key flow_keys_dissector_keys[] = {
> 		.offset = offsetof(struct flow_keys, ports),
> 	},
> 	{
>+		.key_id = FLOW_DISSECTOR_KEY_ICMP,
>+		.offset = offsetof(struct flow_keys, icmp),
>+	},
>+	{
> 		.key_id = FLOW_DISSECTOR_KEY_VLAN,
> 		.offset = offsetof(struct flow_keys, vlan),
> 	},
>diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>index c96b2a349779..f4ba69bd362f 100644
>--- a/net/sched/cls_flower.c
>+++ b/net/sched/cls_flower.c
>@@ -38,7 +38,10 @@ struct fl_flow_key {
> 		struct flow_dissector_key_ipv4_addrs ipv4;
> 		struct flow_dissector_key_ipv6_addrs ipv6;
> 	};
>-	struct flow_dissector_key_ports tp;
>+	union {
>+		struct flow_dissector_key_ports tp;
>+		struct flow_dissector_key_icmp icmp;
>+	};
> 	struct flow_dissector_key_keyid enc_key_id;
> 	union {
> 		struct flow_dissector_key_ipv4_addrs enc_ipv4;
>@@ -511,19 +514,23 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
> 			       &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK,
> 			       sizeof(key->tp.dst));
> 	} else if (flow_basic_key_is_icmpv4(&key->basic)) {
>-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
>-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>-			       sizeof(key->tp.type));
>-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-			       sizeof(key->tp.code));
>+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
>+			       &mask->icmp.type,
>+			       TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>+			       sizeof(key->icmp.type));
>+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>+			       &mask->icmp.code,
>+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>+			       sizeof(key->icmp.code));
> 	} else if (flow_basic_key_is_icmpv6(&key->basic)) {
>-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
>-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>-			       sizeof(key->tp.type));
>-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-			       sizeof(key->tp.code));
>+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
>+			       &mask->icmp.type,
>+			       TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>+			       sizeof(key->icmp.type));
>+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>+			       &mask->icmp.code,
>+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>+			       sizeof(key->icmp.code));
> 	}
> 
> 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
>@@ -609,15 +616,16 @@ static int fl_init_hashtable(struct cls_fl_head *head,
> 		keys[cnt].key_id = id;						\
> 		keys[cnt].offset = FL_KEY_MEMBER_OFFSET(member);		\
> 		cnt++;								\
>-	} while(0);
>+	} while(0)
> 
> #define FL_KEY_SET_IF_MASKED(mask, keys, cnt, id, member)			\
> 	do {									\
> 		if (FL_KEY_IS_MASKED(mask, member))				\
> 			FL_KEY_SET(keys, cnt, id, member);			\
>-	} while(0);
>+	} while(0)
> 
> static void fl_init_dissector(struct cls_fl_head *head,
>+			      struct cls_fl_filter *f,
> 			      struct fl_flow_mask *mask)
> {
> 	struct flow_dissector_key keys[FLOW_DISSECTOR_KEY_MAX];
>@@ -631,8 +639,12 @@ static void fl_init_dissector(struct cls_fl_head *head,
> 			     FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6);
>-	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>-			     FLOW_DISSECTOR_KEY_PORTS, tp);
>+	if (flow_basic_key_is_icmpv4(&f->key.basic))
>+		FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>+				     FLOW_DISSECTOR_KEY_ICMP, icmp);
>+	else
>+		FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>+				     FLOW_DISSECTOR_KEY_PORTS, tp);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_VLAN, vlan);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>@@ -652,6 +664,7 @@ static void fl_init_dissector(struct cls_fl_head *head,
> }
> 
> static int fl_check_assign_mask(struct cls_fl_head *head,
>+				struct cls_fl_filter *f,
> 				struct fl_flow_mask *mask)
> {
> 	int err;
>@@ -672,7 +685,7 @@ static int fl_check_assign_mask(struct cls_fl_head *head,
> 	memcpy(&head->mask, mask, sizeof(head->mask));
> 	head->mask_assigned = true;
> 
>-	fl_init_dissector(head, mask);
>+	fl_init_dissector(head, f, mask);
> 
> 	return 0;
> }
>@@ -785,7 +798,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
> 	if (err)
> 		goto errout;
> 
>-	err = fl_check_assign_mask(head, &mask);
>+	err = fl_check_assign_mask(head, fnew, &mask);
> 	if (err)
> 		goto errout;
> 
>@@ -1000,24 +1013,24 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
> 				  sizeof(key->tp.dst))))
> 		goto nla_put_failure;
> 	else if (flow_basic_key_is_icmpv4(&key->basic) &&
>-		 (fl_dump_key_val(skb, &key->tp.type,
>-				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->tp.type,
>+		 (fl_dump_key_val(skb, &key->icmp.type,
>+				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->icmp.type,
> 				  TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>-				  sizeof(key->tp.type)) ||
>-		  fl_dump_key_val(skb, &key->tp.code,
>-				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->tp.code,
>+				  sizeof(key->icmp.type)) ||
>+		  fl_dump_key_val(skb, &key->icmp.code,
>+				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->icmp.code,
> 				  TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-				  sizeof(key->tp.code))))
>+				  sizeof(key->icmp.code))))
> 		goto nla_put_failure;
> 	else if (flow_basic_key_is_icmpv6(&key->basic) &&
>-		 (fl_dump_key_val(skb, &key->tp.type,
>-				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->tp.type,
>+		 (fl_dump_key_val(skb, &key->icmp.type,
>+				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->icmp.type,
> 				  TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>-				  sizeof(key->tp.type)) ||
>-		  fl_dump_key_val(skb, &key->tp.code,
>-				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->tp.code,
>+				  sizeof(key->icmp.type)) ||
>+		  fl_dump_key_val(skb, &key->icmp.code,
>+				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->icmp.code,
> 				  TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
>-				  sizeof(key->tp.code))))
>+				  sizeof(key->icmp.code))))
> 		goto nla_put_failure;
> 
> 	if (key->enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS &&

^ permalink raw reply

* [PATCH net-next] net: dsa: mv88e6xxx: Use EDSA on mv88e6097
From: Stefan Eichenberger @ 2016-12-05 13:12 UTC (permalink / raw)
  To: andrew, vivien.didelot; +Cc: netdev, Stefan Eichenberger

Use DSA_TAG_PROTO_EDSA as tag_protocol for the mv88e6097. The 
initialisation was missing before.

Signed-off-by: Stefan Eichenberger <stefan.eichenberger@netmodule.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index ca453f3..7a6c587 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -3861,6 +3861,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
 		.global1_addr = 0x1b,
 		.age_time_coeff = 15000,
 		.g1_irqs = 8,
+		.tag_protocol = DSA_TAG_PROTO_EDSA,
 		.flags = MV88E6XXX_FLAGS_FAMILY_6097,
 		.ops = &mv88e6097_ops,
 	},
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH v2 net-next 1/2] flow dissector: ICMP support
From: Simon Horman @ 2016-12-05 14:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jiri Pirko, David Miller, Linux Kernel Network Developers,
	Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, Jamal Hadi Salim,
	Jiri Pirko
In-Reply-To: <CALx6S36EorReKwSs=qpS6uxCrL=_CQHf2BfV+Y=Kw-Dzfc241A@mail.gmail.com>

On Sat, Dec 03, 2016 at 10:52:32AM -0800, Tom Herbert wrote:
> On Sat, Dec 3, 2016 at 2:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> > Fri, Dec 02, 2016 at 09:31:41PM CET, simon.horman@netronome.com wrote:
> >>Allow dissection of ICMP(V6) type and code. This re-uses transport layer
> >>port dissection code as although ICMP is not a transport protocol and their
> >>type and code are not ports this allows sharing of both code and storage.
> >>
> >>Signed-off-by: Simon Horman <simon.horman@netronome.com>

...

> > Digging into this a bit more. I think it would be much nice not to mix
> > up l4 ports and icmp stuff.
> >
> > How about to have FLOW_DISSECTOR_KEY_ICMP
> > and
> > struct flow_dissector_key_icmp {
> >         u8 type;
> >         u8 code;
> > };
> >
> > The you can make this structure and struct flow_dissector_key_ports into
> > an union in struct flow_keys.
> >
> > Looks much cleaner to me.

Hi Jiri,

I wondered about that too. The main reason that I didn't do this the first
time around is that I wasn't sure what to do to differentiate between ports
and icmp in fl_init_dissector() given that using a union would could to
mask bits being set in both if they are set in either.

I've taken a stab at addressing that below along with implementing your
suggestion. If the treatment in fl_init_dissector() is acceptable
perhaps something similar should be done for FLOW_DISSECTOR_KEY_IPV[46]_ADDRS
too?

> I agree, this patch adds to many conditionals into the fast path for
> ICMP handling. Neither is there much point in using type and code as
> input to the packet hash.

Hi Tom,

I'm not sure that I follow this. A packet may be ICMP or not regardless of
if FLOW_DISSECTOR_KEY_PORT (and now FLOW_DISSECTOR_KEY_ICMP) port is set or
not. I'd appreciate some guidance here.


First-pass at implementing Jiri's suggestion follows:

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 5540dfa18872..475cd121496f 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -91,14 +91,10 @@ struct flow_dissector_key_addrs {
 
 /**
  * flow_dissector_key_ports:
- *	@ports: port numbers of Transport header or
- *		type and code of ICMP header
+ *	@ports: port numbers of Transport header
  *		ports: source (high) and destination (low) port numbers
  *		src: source port number
  *		dst: destination port number
- *		icmp: ICMP type (high) and code (low)
- *		type: ICMP type
- *		type: ICMP code
  */
 struct flow_dissector_key_ports {
 	union {
@@ -107,6 +103,18 @@ struct flow_dissector_key_ports {
 			__be16 src;
 			__be16 dst;
 		};
+	};
+};
+
+/**
+ * flow_dissector_key_icmp:
+ *	@ports: type and code of ICMP header
+ *		icmp: ICMP type (high) and code (low)
+ *		type: ICMP type
+ *		code: ICMP code
+ */
+struct flow_dissector_key_icmp {
+	union {
 		__be16 icmp;
 		struct {
 			u8 type;
@@ -133,6 +141,7 @@ enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
 	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
 	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
 	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
 	FLOW_DISSECTOR_KEY_TIPC_ADDRS, /* struct flow_dissector_key_tipc_addrs */
 	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
@@ -171,7 +180,10 @@ struct flow_keys {
 	struct flow_dissector_key_tags tags;
 	struct flow_dissector_key_vlan vlan;
 	struct flow_dissector_key_keyid keyid;
-	struct flow_dissector_key_ports ports;
+	union {
+		struct flow_dissector_key_ports ports;
+		struct flow_dissector_key_icmp icmp;
+	};
 	struct flow_dissector_key_addrs addrs;
 };
 
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0584b4bb4390..33e5fa2b3e87 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -139,6 +139,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	struct flow_dissector_key_basic *key_basic;
 	struct flow_dissector_key_addrs *key_addrs;
 	struct flow_dissector_key_ports *key_ports;
+	struct flow_dissector_key_icmp *key_icmp;
 	struct flow_dissector_key_tags *key_tags;
 	struct flow_dissector_key_vlan *key_vlan;
 	struct flow_dissector_key_keyid *key_keyid;
@@ -559,18 +560,25 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 		break;
 	}
 
-	if (dissector_uses_key(flow_dissector,
-			       FLOW_DISSECTOR_KEY_PORTS)) {
-		key_ports = skb_flow_dissector_target(flow_dissector,
-						      FLOW_DISSECTOR_KEY_PORTS,
-						      target_container);
-		if (flow_protos_are_icmp_any(proto, ip_proto))
-			key_ports->icmp = skb_flow_get_be16(skb, nhoff, data,
-							    hlen);
-		else
+	if (flow_protos_are_icmp_any(proto, ip_proto)) {
+		if (dissector_uses_key(flow_dissector,
+				       FLOW_DISSECTOR_KEY_ICMP)) {
+			key_icmp = skb_flow_dissector_target(flow_dissector,
+							     FLOW_DISSECTOR_KEY_ICMP,
+							     target_container);
+			key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data,
+							   hlen);
+		}
+	} else {
+		if (dissector_uses_key(flow_dissector,
+				       FLOW_DISSECTOR_KEY_PORTS)) {
+			key_ports = skb_flow_dissector_target(flow_dissector,
+							      FLOW_DISSECTOR_KEY_PORTS,
+							      target_container);
 			key_ports->ports = __skb_flow_get_ports(skb, nhoff,
 								ip_proto, data,
 								hlen);
+		}
 	}
 
 out_good:
@@ -975,6 +983,10 @@ static const struct flow_dissector_key flow_keys_dissector_keys[] = {
 		.offset = offsetof(struct flow_keys, ports),
 	},
 	{
+		.key_id = FLOW_DISSECTOR_KEY_ICMP,
+		.offset = offsetof(struct flow_keys, icmp),
+	},
+	{
 		.key_id = FLOW_DISSECTOR_KEY_VLAN,
 		.offset = offsetof(struct flow_keys, vlan),
 	},
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c96b2a349779..f4ba69bd362f 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -38,7 +38,10 @@ struct fl_flow_key {
 		struct flow_dissector_key_ipv4_addrs ipv4;
 		struct flow_dissector_key_ipv6_addrs ipv6;
 	};
-	struct flow_dissector_key_ports tp;
+	union {
+		struct flow_dissector_key_ports tp;
+		struct flow_dissector_key_icmp icmp;
+	};
 	struct flow_dissector_key_keyid enc_key_id;
 	union {
 		struct flow_dissector_key_ipv4_addrs enc_ipv4;
@@ -511,19 +514,23 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 			       &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK,
 			       sizeof(key->tp.dst));
 	} else if (flow_basic_key_is_icmpv4(&key->basic)) {
-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
-			       sizeof(key->tp.type));
-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-			       sizeof(key->tp.code));
+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
+			       &mask->icmp.type,
+			       TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
+			       sizeof(key->icmp.type));
+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+			       &mask->icmp.code,
+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
+			       sizeof(key->icmp.code));
 	} else if (flow_basic_key_is_icmpv6(&key->basic)) {
-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
-			       sizeof(key->tp.type));
-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-			       sizeof(key->tp.code));
+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
+			       &mask->icmp.type,
+			       TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
+			       sizeof(key->icmp.type));
+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+			       &mask->icmp.code,
+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
+			       sizeof(key->icmp.code));
 	}
 
 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
@@ -609,15 +616,16 @@ static int fl_init_hashtable(struct cls_fl_head *head,
 		keys[cnt].key_id = id;						\
 		keys[cnt].offset = FL_KEY_MEMBER_OFFSET(member);		\
 		cnt++;								\
-	} while(0);
+	} while(0)
 
 #define FL_KEY_SET_IF_MASKED(mask, keys, cnt, id, member)			\
 	do {									\
 		if (FL_KEY_IS_MASKED(mask, member))				\
 			FL_KEY_SET(keys, cnt, id, member);			\
-	} while(0);
+	} while(0)
 
 static void fl_init_dissector(struct cls_fl_head *head,
+			      struct cls_fl_filter *f,
 			      struct fl_flow_mask *mask)
 {
 	struct flow_dissector_key keys[FLOW_DISSECTOR_KEY_MAX];
@@ -631,8 +639,12 @@ static void fl_init_dissector(struct cls_fl_head *head,
 			     FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6);
-	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
-			     FLOW_DISSECTOR_KEY_PORTS, tp);
+	if (flow_basic_key_is_icmpv4(&f->key.basic))
+		FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+				     FLOW_DISSECTOR_KEY_ICMP, icmp);
+	else
+		FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+				     FLOW_DISSECTOR_KEY_PORTS, tp);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_VLAN, vlan);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
@@ -652,6 +664,7 @@ static void fl_init_dissector(struct cls_fl_head *head,
 }
 
 static int fl_check_assign_mask(struct cls_fl_head *head,
+				struct cls_fl_filter *f,
 				struct fl_flow_mask *mask)
 {
 	int err;
@@ -672,7 +685,7 @@ static int fl_check_assign_mask(struct cls_fl_head *head,
 	memcpy(&head->mask, mask, sizeof(head->mask));
 	head->mask_assigned = true;
 
-	fl_init_dissector(head, mask);
+	fl_init_dissector(head, f, mask);
 
 	return 0;
 }
@@ -785,7 +798,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	if (err)
 		goto errout;
 
-	err = fl_check_assign_mask(head, &mask);
+	err = fl_check_assign_mask(head, fnew, &mask);
 	if (err)
 		goto errout;
 
@@ -1000,24 +1013,24 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 				  sizeof(key->tp.dst))))
 		goto nla_put_failure;
 	else if (flow_basic_key_is_icmpv4(&key->basic) &&
-		 (fl_dump_key_val(skb, &key->tp.type,
-				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->tp.type,
+		 (fl_dump_key_val(skb, &key->icmp.type,
+				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->icmp.type,
 				  TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
-				  sizeof(key->tp.type)) ||
-		  fl_dump_key_val(skb, &key->tp.code,
-				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->tp.code,
+				  sizeof(key->icmp.type)) ||
+		  fl_dump_key_val(skb, &key->icmp.code,
+				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->icmp.code,
 				  TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-				  sizeof(key->tp.code))))
+				  sizeof(key->icmp.code))))
 		goto nla_put_failure;
 	else if (flow_basic_key_is_icmpv6(&key->basic) &&
-		 (fl_dump_key_val(skb, &key->tp.type,
-				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->tp.type,
+		 (fl_dump_key_val(skb, &key->icmp.type,
+				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->icmp.type,
 				  TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
-				  sizeof(key->tp.type)) ||
-		  fl_dump_key_val(skb, &key->tp.code,
-				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->tp.code,
+				  sizeof(key->icmp.type)) ||
+		  fl_dump_key_val(skb, &key->icmp.code,
+				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->icmp.code,
 				  TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
-				  sizeof(key->tp.code))))
+				  sizeof(key->icmp.code))))
 		goto nla_put_failure;
 
 	if (key->enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS &&

^ permalink raw reply related

* Re: [PATCH net-next] net: dsa: mv88e6xxx: Use EDSA on mv88e6097
From: Andrew Lunn @ 2016-12-05 14:12 UTC (permalink / raw)
  To: Stefan Eichenberger; +Cc: vivien.didelot, netdev, Stefan Eichenberger
In-Reply-To: <20161205131242.19370-1-stefan.eichenberger@netmodule.com>

On Mon, Dec 05, 2016 at 02:12:42PM +0100, Stefan Eichenberger wrote:
> Use DSA_TAG_PROTO_EDSA as tag_protocol for the mv88e6097. The 
> initialisation was missing before.
> 
> Signed-off-by: Stefan Eichenberger <stefan.eichenberger@netmodule.com>

Upps, sorry. My error.

This is net-next only.

Fixes: a1f482aa8c33 ("net: dsa: mv88e6xxx: Move the tagging protocol into info")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [patch net v2] net: fec: fix compile with CONFIG_M5272
From: Nikita Yushchenko @ 2016-12-05 14:03 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, David S. Miller, Fugang Duan, Troy Kisky, Andrew Lunn,
	Eric Nelson, Philippe Reynes, Johannes Berg, netdev, Chris Healy,
	Fabio Estevam, linux-kernel
In-Reply-To: <60acf0f4-2f1f-2b2c-06ed-b1d8cb03525d@cogentembedded.com>

diff --git a/drivers/net/ethernet/freescale/fec_main.c
b/drivers/net/ethernet/freescale/fec_main.c
index 741cf4a57bfc..12aef1b15356 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3301,7 +3301,7 @@ fec_probe(struct platform_device *pdev)

 	/* Init network device */
 	ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private) +
-				  FEC_STAT_SIZE, num_tx_qs, num_rx_qs);
+				  FEC_STATS_SIZE, num_tx_qs, num_rx_qs);
 	if (!ndev)
 		return -ENOMEM;


> Aieee   I was typing too fast today, sorry...
> 
> send separate "fix for the fix", or re-send patch without that silly typo?
> 
> Nikita
> 
>> Hi Nikita,
>>
>> [auto build test ERROR on net/master]
>>
>> url:    https://github.com/0day-ci/linux/commits/Nikita-Yushchenko/net-fec-fix-compile-with-CONFIG_M5272/20161205-181735
>> config: arm-multi_v5_defconfig (attached as .config)
>> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
>> reproduce:
>>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>>         chmod +x ~/bin/make.cross
>>         # save the attached .config to linux build tree
>>         make.cross ARCH=arm 
>>
>> All errors (new ones prefixed by >>):
>>
>>    drivers/net/ethernet/freescale/fec_main.c: In function 'fec_probe':
>>>> drivers/net/ethernet/freescale/fec_main.c:3304:7: error: 'FEC_STAT_SIZE' undeclared (first use in this function)
>>           FEC_STAT_SIZE, num_tx_qs, num_rx_qs);
>>           ^~~~~~~~~~~~~
>>    drivers/net/ethernet/freescale/fec_main.c:3304:7: note: each undeclared identifier is reported only once for each function it appears in
>>
>> vim +/FEC_STAT_SIZE +3304 drivers/net/ethernet/freescale/fec_main.c
>>
>>   3298		int num_rx_qs;
>>   3299	
>>   3300		fec_enet_get_queue_num(pdev, &num_tx_qs, &num_rx_qs);
>>   3301	
>>   3302		/* Init network device */
>>   3303		ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private) +
>>> 3304					  FEC_STAT_SIZE, num_tx_qs, num_rx_qs);
>>   3305		if (!ndev)
>>   3306			return -ENOMEM;
>>   3307	
>>
>> ---
>> 0-DAY kernel test infrastructure                Open Source Technology Center
>> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
>>

^ permalink raw reply related

* Re: [patch net v2] net: fec: fix compile with CONFIG_M5272
From: Nikita Yushchenko @ 2016-12-05 13:55 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, David S. Miller, Fugang Duan, Troy Kisky, Andrew Lunn,
	Eric Nelson, Philippe Reynes, Johannes Berg, netdev, Chris Healy,
	Fabio Estevam, linux-kernel
In-Reply-To: <201612052104.HSl9jlj9%fengguang.wu@intel.com>

Aieee   I was typing too fast today, sorry...

send separate "fix for the fix", or re-send patch without that silly typo?

Nikita

> Hi Nikita,
> 
> [auto build test ERROR on net/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Nikita-Yushchenko/net-fec-fix-compile-with-CONFIG_M5272/20161205-181735
> config: arm-multi_v5_defconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm 
> 
> All errors (new ones prefixed by >>):
> 
>    drivers/net/ethernet/freescale/fec_main.c: In function 'fec_probe':
>>> drivers/net/ethernet/freescale/fec_main.c:3304:7: error: 'FEC_STAT_SIZE' undeclared (first use in this function)
>           FEC_STAT_SIZE, num_tx_qs, num_rx_qs);
>           ^~~~~~~~~~~~~
>    drivers/net/ethernet/freescale/fec_main.c:3304:7: note: each undeclared identifier is reported only once for each function it appears in
> 
> vim +/FEC_STAT_SIZE +3304 drivers/net/ethernet/freescale/fec_main.c
> 
>   3298		int num_rx_qs;
>   3299	
>   3300		fec_enet_get_queue_num(pdev, &num_tx_qs, &num_rx_qs);
>   3301	
>   3302		/* Init network device */
>   3303		ndev = alloc_etherdev_mqs(sizeof(struct fec_enet_private) +
>> 3304					  FEC_STAT_SIZE, num_tx_qs, num_rx_qs);
>   3305		if (!ndev)
>   3306			return -ENOMEM;
>   3307	
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 

^ permalink raw reply

* Re: [PATCH 2/2] net: rfkill: Add rfkill-any LED trigger
From: Johannes Berg @ 2016-12-05 13:54 UTC (permalink / raw)
  To: Michał Kępień, kbuild test robot
  Cc: kbuild-all-JC7UmRfGjtg, David S . Miller,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161201200804.GA3872-d0zqvqtRkg7MisJl3QRgO5JNpPtI3pG4@public.gmane.org>


> Thanks, these are obviously all valid concerns.  Sorry for being
> sloppy
> with the ifdefs.  If I get positive feedback on the proposed feature
> itself, all these issues (and the warning pointed out in the other
> message) will be resolved in v2.

Looks fine, please do that.

johannes

^ permalink raw reply

* [PATCH v2 4/4] ARM: dts: hix5hd2: add gmac generic compatible and clock names
From: Dongpo Li @ 2016-12-05 13:28 UTC (permalink / raw)
  To: robh+dt, mark.rutland, mturquette, sboyd, linux, zhangfei.gao,
	yisen.zhuang, salil.mehta, davem, arnd, andrew
  Cc: xuejiancheng, benjamin.chenhao, caizhiyong, netdev, devicetree,
	linux-kernel, Dongpo Li
In-Reply-To: <1480944481-118803-1-git-send-email-lidongpo@hisilicon.com>

Add gmac generic compatible and clock names.

Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
---
 arch/arm/boot/dts/hisi-x5hd2.dtsi | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/hisi-x5hd2.dtsi b/arch/arm/boot/dts/hisi-x5hd2.dtsi
index fdcc23d..0da76c5 100644
--- a/arch/arm/boot/dts/hisi-x5hd2.dtsi
+++ b/arch/arm/boot/dts/hisi-x5hd2.dtsi
@@ -436,18 +436,20 @@
 		};
 
 		gmac0: ethernet@1840000 {
-			compatible = "hisilicon,hix5hd2-gmac";
+			compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1";
 			reg = <0x1840000 0x1000>,<0x184300c 0x4>;
 			interrupts = <0 71 4>;
 			clocks = <&clock HIX5HD2_MAC0_CLK>;
+			clock-names = "mac_core";
 			status = "disabled";
 		};
 
 		gmac1: ethernet@1841000 {
-			compatible = "hisilicon,hix5hd2-gmac";
+			compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1";
 			reg = <0x1841000 0x1000>,<0x1843010 0x4>;
 			interrupts = <0 72 4>;
 			clocks = <&clock HIX5HD2_MAC1_CLK>;
+			clock-names = "mac_core";
 			status = "disabled";
 		};
 
-- 
2.8.2

^ permalink raw reply related

* [PATCH v2 1/4] net: hix5hd2_gmac: add generic compatible string
From: Dongpo Li @ 2016-12-05 13:27 UTC (permalink / raw)
  To: robh+dt, mark.rutland, mturquette, sboyd, linux, zhangfei.gao,
	yisen.zhuang, salil.mehta, davem, arnd, andrew
  Cc: xuejiancheng, benjamin.chenhao, caizhiyong, netdev, devicetree,
	linux-kernel, Dongpo Li
In-Reply-To: <1480944481-118803-1-git-send-email-lidongpo@hisilicon.com>

The "hix5hd2" is SoC name, add the generic ethernet driver name.
The "hisi-gemac-v1" is the basic version and "hisi-gemac-v2" adds
the SG/TXCSUM/TSO/UFO features.

Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
---
 .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt    |  9 +++++++--
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c             | 15 +++++++++++----
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
index 75d398b..75920f0 100644
--- a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
+++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
@@ -1,7 +1,12 @@
 Hisilicon hix5hd2 gmac controller
 
 Required properties:
-- compatible: should be "hisilicon,hix5hd2-gmac".
+- compatible: should contain one of the following SoC strings:
+	* "hisilicon,hix5hd2-gemac"
+	* "hisilicon,hi3798cv200-gemac"
+	and one of the following version string:
+	* "hisilicon,hisi-gemac-v1"
+	* "hisilicon,hisi-gemac-v2"
 - reg: specifies base physical address(s) and size of the device registers.
   The first region is the MAC register base and size.
   The second region is external interface control register.
@@ -20,7 +25,7 @@ Required properties:
 
 Example:
 	gmac0: ethernet@f9840000 {
-		compatible = "hisilicon,hix5hd2-gmac";
+		compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1";
 		reg = <0xf9840000 0x1000>,<0xf984300c 0x4>;
 		interrupts = <0 71 4>;
 		#address-cells = <1>;
diff --git a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
index e69a6be..27cb2e6 100644
--- a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
+++ b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
@@ -189,6 +189,10 @@
 #define dma_cnt(n)			((n) >> 5)
 #define dma_byte(n)			((n) << 5)
 
+#define HW_CAP_TSO			BIT(0)
+#define GEMAC_V1			0
+#define GEMAC_V2			(GEMAC_V1 | HW_CAP_TSO)
+
 struct hix5hd2_desc {
 	__le32 buff_addr;
 	__le32 cmd;
@@ -1021,7 +1025,10 @@ static int hix5hd2_dev_remove(struct platform_device *pdev)
 }
 
 static const struct of_device_id hix5hd2_of_match[] = {
-	{.compatible = "hisilicon,hix5hd2-gmac",},
+	{ .compatible = "hisilicon,hisi-gemac-v1", .data = (void *)GEMAC_V1 },
+	{ .compatible = "hisilicon,hisi-gemac-v2", .data = (void *)GEMAC_V2 },
+	{ .compatible = "hisilicon,hix5hd2-gemac", .data = (void *)GEMAC_V1 },
+	{ .compatible = "hisilicon,hi3798cv200-gemac", .data = (void *)GEMAC_V2 },
 	{},
 };
 
@@ -1029,7 +1036,7 @@ MODULE_DEVICE_TABLE(of, hix5hd2_of_match);
 
 static struct platform_driver hix5hd2_dev_driver = {
 	.driver = {
-		.name = "hix5hd2-gmac",
+		.name = "hisi-gemac",
 		.of_match_table = hix5hd2_of_match,
 	},
 	.probe = hix5hd2_dev_probe,
@@ -1038,6 +1045,6 @@ static struct platform_driver hix5hd2_dev_driver = {
 
 module_platform_driver(hix5hd2_dev_driver);
 
-MODULE_DESCRIPTION("HISILICON HIX5HD2 Ethernet driver");
+MODULE_DESCRIPTION("HISILICON Gigabit Ethernet MAC driver");
 MODULE_LICENSE("GPL v2");
-MODULE_ALIAS("platform:hix5hd2-gmac");
+MODULE_ALIAS("platform:hisi-gemac");
-- 
2.8.2

^ permalink raw reply related

* [PATCH v2 2/4] net: hix5hd2_gmac: add tx scatter-gather feature
From: Dongpo Li @ 2016-12-05 13:27 UTC (permalink / raw)
  To: robh+dt, mark.rutland, mturquette, sboyd, linux, zhangfei.gao,
	yisen.zhuang, salil.mehta, davem, arnd, andrew
  Cc: xuejiancheng, benjamin.chenhao, caizhiyong, netdev, devicetree,
	linux-kernel, Dongpo Li
In-Reply-To: <1480944481-118803-1-git-send-email-lidongpo@hisilicon.com>

"hisi-gemac-v2" adds the SG/TXCSUM/TSO/UFO features.
This patch only adds the SG(scatter-gather) driver for transmitting,
the drivers of other features will be submitted later.

Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
---
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 198 ++++++++++++++++++++++++--
 1 file changed, 187 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
index 27cb2e6..18af55b 100644
--- a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
+++ b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
@@ -11,6 +11,7 @@
 #include <linux/interrupt.h>
 #include <linux/etherdevice.h>
 #include <linux/platform_device.h>
+#include <linux/of_device.h>
 #include <linux/of_net.h>
 #include <linux/of_mdio.h>
 #include <linux/clk.h>
@@ -183,6 +184,8 @@
 #define DESC_DATA_LEN_OFF		16
 #define DESC_BUFF_LEN_OFF		0
 #define DESC_DATA_MASK			0x7ff
+#define DESC_SG				BIT(30)
+#define DESC_FRAGS_NUM_OFF		11
 
 /* DMA descriptor ring helpers */
 #define dma_ring_incr(n, s)		(((n) + 1) & ((s) - 1))
@@ -192,6 +195,7 @@
 #define HW_CAP_TSO			BIT(0)
 #define GEMAC_V1			0
 #define GEMAC_V2			(GEMAC_V1 | HW_CAP_TSO)
+#define HAS_CAP_TSO(hw_cap)		((hw_cap) & HW_CAP_TSO)
 
 struct hix5hd2_desc {
 	__le32 buff_addr;
@@ -205,6 +209,27 @@ struct hix5hd2_desc_sw {
 	unsigned int	size;
 };
 
+struct hix5hd2_sg_desc_ring {
+	struct sg_desc *desc;
+	dma_addr_t phys_addr;
+};
+
+struct frags_info {
+	__le32 addr;
+	__le32 size;
+};
+
+/* hardware supported max skb frags num */
+#define SG_MAX_SKB_FRAGS	17
+struct sg_desc {
+	__le32 total_len;
+	__le32 resvd0;
+	__le32 linear_addr;
+	__le32 linear_len;
+	/* reserve one more frags for memory alignment */
+	struct frags_info frags[SG_MAX_SKB_FRAGS + 1];
+};
+
 #define QUEUE_NUMS	4
 struct hix5hd2_priv {
 	struct hix5hd2_desc_sw pool[QUEUE_NUMS];
@@ -212,6 +237,7 @@ struct hix5hd2_priv {
 #define rx_bq		pool[1]
 #define tx_bq		pool[2]
 #define tx_rq		pool[3]
+	struct hix5hd2_sg_desc_ring tx_ring;
 
 	void __iomem *base;
 	void __iomem *ctrl_base;
@@ -225,6 +251,7 @@ struct hix5hd2_priv {
 	struct device_node *phy_node;
 	phy_interface_t	phy_mode;
 
+	unsigned long hw_cap;
 	unsigned int speed;
 	unsigned int duplex;
 
@@ -515,6 +542,27 @@ static int hix5hd2_rx(struct net_device *dev, int limit)
 	return num;
 }
 
+static void hix5hd2_clean_sg_desc(struct hix5hd2_priv *priv,
+				  struct sk_buff *skb, u32 pos)
+{
+	struct sg_desc *desc;
+	dma_addr_t addr;
+	u32 len;
+	int i;
+
+	desc = priv->tx_ring.desc + pos;
+
+	addr = le32_to_cpu(desc->linear_addr);
+	len = le32_to_cpu(desc->linear_len);
+	dma_unmap_single(priv->dev, addr, len, DMA_TO_DEVICE);
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		addr = le32_to_cpu(desc->frags[i].addr);
+		len = le32_to_cpu(desc->frags[i].size);
+		dma_unmap_page(priv->dev, addr, len, DMA_TO_DEVICE);
+	}
+}
+
 static void hix5hd2_xmit_reclaim(struct net_device *dev)
 {
 	struct sk_buff *skb;
@@ -542,8 +590,15 @@ static void hix5hd2_xmit_reclaim(struct net_device *dev)
 		pkts_compl++;
 		bytes_compl += skb->len;
 		desc = priv->tx_rq.desc + pos;
-		addr = le32_to_cpu(desc->buff_addr);
-		dma_unmap_single(priv->dev, addr, skb->len, DMA_TO_DEVICE);
+
+		if (skb_shinfo(skb)->nr_frags) {
+			hix5hd2_clean_sg_desc(priv, skb, pos);
+		} else {
+			addr = le32_to_cpu(desc->buff_addr);
+			dma_unmap_single(priv->dev, addr, skb->len,
+					 DMA_TO_DEVICE);
+		}
+
 		priv->tx_skb[pos] = NULL;
 		dev_consume_skb_any(skb);
 		pos = dma_ring_incr(pos, TX_DESC_NUM);
@@ -604,12 +659,66 @@ static irqreturn_t hix5hd2_interrupt(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
 
+static u32 hix5hd2_get_desc_cmd(struct sk_buff *skb, unsigned long hw_cap)
+{
+	u32 cmd = 0;
+
+	if (HAS_CAP_TSO(hw_cap)) {
+		if (skb_shinfo(skb)->nr_frags)
+			cmd |= DESC_SG;
+		cmd |= skb_shinfo(skb)->nr_frags << DESC_FRAGS_NUM_OFF;
+	} else {
+		cmd |= DESC_FL_FULL |
+			((skb->len & DESC_DATA_MASK) << DESC_BUFF_LEN_OFF);
+	}
+
+	cmd |= (skb->len & DESC_DATA_MASK) << DESC_DATA_LEN_OFF;
+	cmd |= DESC_VLD_BUSY;
+
+	return cmd;
+}
+
+static int hix5hd2_fill_sg_desc(struct hix5hd2_priv *priv,
+				struct sk_buff *skb, u32 pos)
+{
+	struct sg_desc *desc;
+	dma_addr_t addr;
+	int ret;
+	int i;
+
+	desc = priv->tx_ring.desc + pos;
+
+	desc->total_len = cpu_to_le32(skb->len);
+	addr = dma_map_single(priv->dev, skb->data, skb_headlen(skb),
+			      DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(priv->dev, addr)))
+		return -EINVAL;
+	desc->linear_addr = cpu_to_le32(addr);
+	desc->linear_len = cpu_to_le32(skb_headlen(skb));
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+		int len = frag->size;
+
+		addr = skb_frag_dma_map(priv->dev, frag, 0, len, DMA_TO_DEVICE);
+		ret = dma_mapping_error(priv->dev, addr);
+		if (unlikely(ret))
+			return -EINVAL;
+		desc->frags[i].addr = cpu_to_le32(addr);
+		desc->frags[i].size = cpu_to_le32(len);
+	}
+
+	return 0;
+}
+
 static int hix5hd2_net_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct hix5hd2_priv *priv = netdev_priv(dev);
 	struct hix5hd2_desc *desc;
 	dma_addr_t addr;
 	u32 pos;
+	u32 cmd;
+	int ret;
 
 	/* software write pointer */
 	pos = dma_cnt(readl_relaxed(priv->base + TX_BQ_WR_ADDR));
@@ -620,18 +729,31 @@ static int hix5hd2_net_xmit(struct sk_buff *skb, struct net_device *dev)
 		return NETDEV_TX_BUSY;
 	}
 
-	addr = dma_map_single(priv->dev, skb->data, skb->len, DMA_TO_DEVICE);
-	if (dma_mapping_error(priv->dev, addr)) {
-		dev_kfree_skb_any(skb);
-		return NETDEV_TX_OK;
-	}
-
 	desc = priv->tx_bq.desc + pos;
+
+	cmd = hix5hd2_get_desc_cmd(skb, priv->hw_cap);
+	desc->cmd = cpu_to_le32(cmd);
+
+	if (skb_shinfo(skb)->nr_frags) {
+		ret = hix5hd2_fill_sg_desc(priv, skb, pos);
+		if (unlikely(ret)) {
+			dev_kfree_skb_any(skb);
+			dev->stats.tx_dropped++;
+			return NETDEV_TX_OK;
+		}
+		addr = priv->tx_ring.phys_addr + pos * sizeof(struct sg_desc);
+	} else {
+		addr = dma_map_single(priv->dev, skb->data, skb->len,
+				      DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(priv->dev, addr))) {
+			dev_kfree_skb_any(skb);
+			dev->stats.tx_dropped++;
+			return NETDEV_TX_OK;
+		}
+	}
 	desc->buff_addr = cpu_to_le32(addr);
+
 	priv->tx_skb[pos] = skb;
-	desc->cmd = cpu_to_le32(DESC_VLD_BUSY | DESC_FL_FULL |
-				(skb->len & DESC_DATA_MASK) << DESC_DATA_LEN_OFF |
-				(skb->len & DESC_DATA_MASK) << DESC_BUFF_LEN_OFF);
 
 	/* ensure desc updated */
 	wmb();
@@ -866,10 +988,40 @@ static int hix5hd2_init_hw_desc_queue(struct hix5hd2_priv *priv)
 	return -ENOMEM;
 }
 
+static int hix5hd2_init_sg_desc_queue(struct hix5hd2_priv *priv)
+{
+	struct sg_desc *desc;
+	dma_addr_t phys_addr;
+
+	desc = (struct sg_desc *)dma_alloc_coherent(priv->dev,
+				TX_DESC_NUM * sizeof(struct sg_desc),
+				&phys_addr, GFP_KERNEL);
+	if (!desc)
+		return -ENOMEM;
+
+	priv->tx_ring.desc = desc;
+	priv->tx_ring.phys_addr = phys_addr;
+
+	return 0;
+}
+
+static void hix5hd2_destroy_sg_desc_queue(struct hix5hd2_priv *priv)
+{
+	if (priv->tx_ring.desc) {
+		dma_free_coherent(priv->dev,
+				  TX_DESC_NUM * sizeof(struct sg_desc),
+				  priv->tx_ring.desc, priv->tx_ring.phys_addr);
+		priv->tx_ring.desc = NULL;
+	}
+}
+
+static const struct of_device_id hix5hd2_of_match[];
+
 static int hix5hd2_dev_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
 	struct device_node *node = dev->of_node;
+	const struct of_device_id *of_id = NULL;
 	struct net_device *ndev;
 	struct hix5hd2_priv *priv;
 	struct resource *res;
@@ -887,6 +1039,13 @@ static int hix5hd2_dev_probe(struct platform_device *pdev)
 	priv->dev = dev;
 	priv->netdev = ndev;
 
+	of_id = of_match_device(hix5hd2_of_match, dev);
+	if (!of_id) {
+		ret = -EINVAL;
+		goto out_free_netdev;
+	}
+	priv->hw_cap = (unsigned long)of_id->data;
+
 	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
 	priv->base = devm_ioremap_resource(dev, res);
 	if (IS_ERR(priv->base)) {
@@ -976,11 +1135,24 @@ static int hix5hd2_dev_probe(struct platform_device *pdev)
 	ndev->ethtool_ops = &hix5hd2_ethtools_ops;
 	SET_NETDEV_DEV(ndev, dev);
 
+	if (HAS_CAP_TSO(priv->hw_cap))
+		ndev->hw_features |= NETIF_F_SG;
+
+	ndev->features |= ndev->hw_features | NETIF_F_HIGHDMA;
+	ndev->vlan_features |= ndev->features;
+
 	ret = hix5hd2_init_hw_desc_queue(priv);
 	if (ret)
 		goto out_phy_node;
 
 	netif_napi_add(ndev, &priv->napi, hix5hd2_poll, NAPI_POLL_WEIGHT);
+
+	if (HAS_CAP_TSO(priv->hw_cap)) {
+		ret = hix5hd2_init_sg_desc_queue(priv);
+		if (ret)
+			goto out_destroy_queue;
+	}
+
 	ret = register_netdev(priv->netdev);
 	if (ret) {
 		netdev_err(ndev, "register_netdev failed!");
@@ -992,6 +1164,8 @@ static int hix5hd2_dev_probe(struct platform_device *pdev)
 	return ret;
 
 out_destroy_queue:
+	if (HAS_CAP_TSO(priv->hw_cap))
+		hix5hd2_destroy_sg_desc_queue(priv);
 	netif_napi_del(&priv->napi);
 	hix5hd2_destroy_hw_desc_queue(priv);
 out_phy_node:
@@ -1016,6 +1190,8 @@ static int hix5hd2_dev_remove(struct platform_device *pdev)
 	mdiobus_unregister(priv->bus);
 	mdiobus_free(priv->bus);
 
+	if (HAS_CAP_TSO(priv->hw_cap))
+		hix5hd2_destroy_sg_desc_queue(priv);
 	hix5hd2_destroy_hw_desc_queue(priv);
 	of_node_put(priv->phy_node);
 	cancel_work_sync(&priv->tx_timeout_task);
-- 
2.8.2

^ permalink raw reply related

* Re: [RFC] udp: some improvements on RX path.
From: Paolo Abeni @ 2016-12-05 13:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1480905784.18162.509.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On Sun, 2016-12-04 at 18:43 -0800, Eric Dumazet wrote:
> We currently access 3 cache lines from an skb in receive queue while
> holding receive queue lock :
> 
> First cache line (contains ->next / prev pointers )
> 2nd cache line (skb->peeked)
> 3rd cache line (skb->truesize)
> 
> I believe we could get rid of skb->peeked completely.
> 
> I will cook a patch, but basically the idea is that the last owner of a
> skb (right before skb->users becomes 0) can have the 'ownership' and
> thus increase stats.

Agreed.

> The 3rd cache line miss is easily avoided by the following patch.

I run some performance tests on top of your patch "net: reorganize
struct sock for better data locality", and I see an additional ~7%
improvement on top of that, in the udp flood scenario. 

In my tests, the topmost perf offenders for the u/s process are now:

   9.98%  udp_sink  [kernel.kallsyms]  [k] udp_rmem_release
   8.76%  udp_sink  [kernel.kallsyms]  [k] inet_recvmsg
   6.71%  udp_sink  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   5.40%  udp_sink  [kernel.kallsyms]  [k] __skb_try_recv_datagram
   5.19%  udp_sink  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string

udp_rmem_release() spends most of its time doing:

    atomic_sub(size, &sk->sk_rmem_alloc);

I see a cacheline miss while accessing sk_rmem_alloc; most probably due
to sk_rmem_alloc being updated by the writer outside the rx queue lock.

Moving such update inside the lock show remove this cache miss but will
increase the pressure on the rx lock. What do you think ?

inet_recvmsg() is there because with "net: reorganize struct sock for
better data locality" we get a cache miss while accessing skc_rxhash in
sock_rps_record_flow(); touching sk_drops is dirtying that cacheline -
sorry for not noticing this before. Do you have CONFIG_RPS disabled ?

> But I also want to work on the idea I gave few days back, having a
> separate queue and use splice to transfer the 'softirq queue' into
> a calm queue in a different cache line.
> 
> I expect a 50 % performance increase under load, maybe 1.5 Mpps.

It should work nicely under contention, but won't that increase the
overhead for the uncontended/single flow scenario ? the user space
reader needs to acquire 2 lock when splicing the 'softirq queue'. On my
system ksoftirqd and the u/s process work at similar speeds, so splicing
will happen quite often. 

Paolo

^ permalink raw reply

* [PATCH v2 3/4] net: hix5hd2_gmac: add reset control and clock signals
From: Dongpo Li @ 2016-12-05 13:28 UTC (permalink / raw)
  To: robh+dt, mark.rutland, mturquette, sboyd, linux, zhangfei.gao,
	yisen.zhuang, salil.mehta, davem, arnd, andrew
  Cc: xuejiancheng, benjamin.chenhao, caizhiyong, netdev, devicetree,
	linux-kernel, Dongpo Li
In-Reply-To: <1480944481-118803-1-git-send-email-lidongpo@hisilicon.com>

Add three reset control signals, "mac_core_rst", "mac_ifc_rst" and
"phy_rst".
The following diagram explained how the reset signals work.

                        SoC
|-----------------------------------------------------
|                               ------                |
|                               | cpu |               |
|                               ------                |
|                                  |                  |
|                              ------------ AMBA bus  |
|                         GMAC     |                  |
|                            ----------------------   |
| ------------- mac_core_rst | --------------      |  |
| |clock and   |-------------->|   mac core  |     |  |
| |reset       |             | --------------      |  |
| |generator   |----         |       |             |  |
| -------------     |        | ----------------    |  |
|          |        ---------->| mac interface |   |  |
|          |     mac_ifc_rst | ----------------    |  |
|          |                 |       |             |  |
|          |                 | ------------------  |  |
|          |phy_rst          | | RGMII interface | |  |
|          |                 | ------------------  |  |
|          |                 ----------------------   |
|----------|------------------------------------------|
           |                          |
           |                      ----------
           |--------------------- |PHY chip |
                                  ----------

The "mac_core_rst" represents "mac core reset signal", it resets
the mac core including packet processing unit, descriptor processing unit,
tx engine, rx engine, control unit.
The "mac_ifc_rst" represents "mac interface reset signal", it resets
the mac interface. The mac interface unit connects mac core and
data interface like MII/RMII/RGMII. After we set a new value of
interface mode, we must reset mac interface to reload the new mode value.
The "mac_core_rst" and "mac_ifc_rst" are both optional to be
backward compatible with the hix5hd2 SoC.
The "phy_rst" represents "phy reset signal", it does a hardware reset
on the PHY chip. This reset signal is optional if the PHY can work well
without the hardware reset.

Add one more clock signal, the existing is MAC core clock,
and the new one is MAC interface clock.
The MAC interface clock is optional to be backward compatible with
the hix5hd2 SoC.

Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
---
 .../bindings/net/hisilicon-hix5hd2-gmac.txt        |  20 ++-
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c      | 139 +++++++++++++++++++--
 2 files changed, 144 insertions(+), 15 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
index 75920f0..063c02d 100644
--- a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
+++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
@@ -17,6 +17,16 @@ Required properties:
 - phy-handle: see ethernet.txt [1].
 - mac-address: see ethernet.txt [1].
 - clocks: clock phandle and specifier pair.
+- clock-names: contain the clock name "mac_core"(required) and "mac_ifc"(optional).
+- resets: should contain the phandle to the MAC core reset signal(optional),
+	the MAC interface reset signal(optional)
+	and the PHY reset signal(optional).
+- reset-names: contain the reset signal name "mac_core"(optional),
+	"mac_ifc"(optional) and "phy"(optional).
+- hisilicon,phy-reset-delays-us: triplet of delays if PHY reset signal given.
+	The 1st cell is reset pre-delay in micro seconds.
+	The 2nd cell is reset pulse in micro seconds.
+	The 3rd cell is reset post-delay in micro seconds.
 
 - PHY subnode: inherits from phy binding [2]
 
@@ -25,15 +35,19 @@ Required properties:
 
 Example:
 	gmac0: ethernet@f9840000 {
-		compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1";
+		compatible = "hisilicon,hi3798cv200-gemac", "hisilicon,hisi-gemac-v2";
 		reg = <0xf9840000 0x1000>,<0xf984300c 0x4>;
 		interrupts = <0 71 4>;
 		#address-cells = <1>;
 		#size-cells = <0>;
-		phy-mode = "mii";
+		phy-mode = "rgmii";
 		phy-handle = <&phy2>;
 		mac-address = [00 00 00 00 00 00];
-		clocks = <&clock HIX5HD2_MAC0_CLK>;
+		clocks = <&crg HISTB_ETH0_MAC_CLK>, <&crg HISTB_ETH0_MACIF_CLK>;
+		clock-names = "mac_core", "mac_ifc";
+		resets = <&crg 0xcc 8>, <&crg 0xcc 10>, <&crg 0xcc 12>;
+		reset-names = "mac_core", "mac_ifc", "phy";
+		hisilicon,phy-reset-delays-us = <10000 10000 30000>;
 
 		phy2: ethernet-phy@2 {
 			reg = <2>;
diff --git a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
index 18af55b..ee7e9ce 100644
--- a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
+++ b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
@@ -14,6 +14,7 @@
 #include <linux/of_device.h>
 #include <linux/of_net.h>
 #include <linux/of_mdio.h>
+#include <linux/reset.h>
 #include <linux/clk.h>
 #include <linux/circ_buf.h>
 
@@ -197,6 +198,15 @@
 #define GEMAC_V2			(GEMAC_V1 | HW_CAP_TSO)
 #define HAS_CAP_TSO(hw_cap)		((hw_cap) & HW_CAP_TSO)
 
+#define PHY_RESET_DELAYS_PROPERTY	"hisilicon,phy-reset-delays-us"
+
+enum phy_reset_delays {
+	PRE_DELAY,
+	PULSE,
+	POST_DELAY,
+	DELAYS_NUM,
+};
+
 struct hix5hd2_desc {
 	__le32 buff_addr;
 	__le32 cmd;
@@ -255,12 +265,26 @@ struct hix5hd2_priv {
 	unsigned int speed;
 	unsigned int duplex;
 
-	struct clk *clk;
+	struct clk *mac_core_clk;
+	struct clk *mac_ifc_clk;
+	struct reset_control *mac_core_rst;
+	struct reset_control *mac_ifc_rst;
+	struct reset_control *phy_rst;
+	u32 phy_reset_delays[DELAYS_NUM];
 	struct mii_bus *bus;
 	struct napi_struct napi;
 	struct work_struct tx_timeout_task;
 };
 
+static inline void hix5hd2_mac_interface_reset(struct hix5hd2_priv *priv)
+{
+	if (!priv->mac_ifc_rst)
+		return;
+
+	reset_control_assert(priv->mac_ifc_rst);
+	reset_control_deassert(priv->mac_ifc_rst);
+}
+
 static void hix5hd2_config_port(struct net_device *dev, u32 speed, u32 duplex)
 {
 	struct hix5hd2_priv *priv = netdev_priv(dev);
@@ -293,6 +317,7 @@ static void hix5hd2_config_port(struct net_device *dev, u32 speed, u32 duplex)
 	if (duplex)
 		val |= GMAC_FULL_DUPLEX;
 	writel_relaxed(val, priv->ctrl_base);
+	hix5hd2_mac_interface_reset(priv);
 
 	writel_relaxed(BIT_MODE_CHANGE_EN, priv->base + MODE_CHANGE_EN);
 	if (speed == SPEED_1000)
@@ -807,16 +832,26 @@ static int hix5hd2_net_open(struct net_device *dev)
 	struct phy_device *phy;
 	int ret;
 
-	ret = clk_prepare_enable(priv->clk);
+	ret = clk_prepare_enable(priv->mac_core_clk);
+	if (ret < 0) {
+		netdev_err(dev, "failed to enable mac core clk %d\n", ret);
+		return ret;
+	}
+
+	ret = clk_prepare_enable(priv->mac_ifc_clk);
 	if (ret < 0) {
-		netdev_err(dev, "failed to enable clk %d\n", ret);
+		clk_disable_unprepare(priv->mac_core_clk);
+		netdev_err(dev, "failed to enable mac ifc clk %d\n", ret);
 		return ret;
 	}
 
 	phy = of_phy_connect(dev, priv->phy_node,
 			     &hix5hd2_adjust_link, 0, priv->phy_mode);
-	if (!phy)
+	if (!phy) {
+		clk_disable_unprepare(priv->mac_ifc_clk);
+		clk_disable_unprepare(priv->mac_core_clk);
 		return -ENODEV;
+	}
 
 	phy_start(phy);
 	hix5hd2_hw_init(priv);
@@ -847,7 +882,8 @@ static int hix5hd2_net_close(struct net_device *dev)
 		phy_disconnect(dev->phydev);
 	}
 
-	clk_disable_unprepare(priv->clk);
+	clk_disable_unprepare(priv->mac_ifc_clk);
+	clk_disable_unprepare(priv->mac_core_clk);
 
 	return 0;
 }
@@ -1015,6 +1051,48 @@ static void hix5hd2_destroy_sg_desc_queue(struct hix5hd2_priv *priv)
 	}
 }
 
+static inline void hix5hd2_mac_core_reset(struct hix5hd2_priv *priv)
+{
+	if (!priv->mac_core_rst)
+		return;
+
+	reset_control_assert(priv->mac_core_rst);
+	reset_control_deassert(priv->mac_core_rst);
+}
+
+static void hix5hd2_sleep_us(u32 time_us)
+{
+	u32 time_ms;
+
+	if (!time_us)
+		return;
+
+	time_ms = DIV_ROUND_UP(time_us, 1000);
+	if (time_ms < 20)
+		usleep_range(time_us, time_us + 500);
+	else
+		msleep(time_ms);
+}
+
+static void hix5hd2_phy_reset(struct hix5hd2_priv *priv)
+{
+	/* To make sure PHY hardware reset success,
+	 * we must keep PHY in deassert state first and
+	 * then complete the hardware reset operation
+	 */
+	reset_control_deassert(priv->phy_rst);
+	hix5hd2_sleep_us(priv->phy_reset_delays[PRE_DELAY]);
+
+	reset_control_assert(priv->phy_rst);
+	/* delay some time to ensure reset ok,
+	 * this depends on PHY hardware feature
+	 */
+	hix5hd2_sleep_us(priv->phy_reset_delays[PULSE]);
+	reset_control_deassert(priv->phy_rst);
+	/* delay some time to ensure later MDIO access */
+	hix5hd2_sleep_us(priv->phy_reset_delays[POST_DELAY]);
+}
+
 static const struct of_device_id hix5hd2_of_match[];
 
 static int hix5hd2_dev_probe(struct platform_device *pdev)
@@ -1060,23 +1138,55 @@ static int hix5hd2_dev_probe(struct platform_device *pdev)
 		goto out_free_netdev;
 	}
 
-	priv->clk = devm_clk_get(&pdev->dev, NULL);
-	if (IS_ERR(priv->clk)) {
-		netdev_err(ndev, "failed to get clk\n");
+	priv->mac_core_clk = devm_clk_get(&pdev->dev, "mac_core");
+	if (IS_ERR(priv->mac_core_clk)) {
+		netdev_err(ndev, "failed to get mac core clk\n");
 		ret = -ENODEV;
 		goto out_free_netdev;
 	}
 
-	ret = clk_prepare_enable(priv->clk);
+	ret = clk_prepare_enable(priv->mac_core_clk);
 	if (ret < 0) {
-		netdev_err(ndev, "failed to enable clk %d\n", ret);
+		netdev_err(ndev, "failed to enable mac core clk %d\n", ret);
 		goto out_free_netdev;
 	}
 
+	priv->mac_ifc_clk = devm_clk_get(&pdev->dev, "mac_ifc");
+	if (IS_ERR(priv->mac_ifc_clk))
+		priv->mac_ifc_clk = NULL;
+
+	ret = clk_prepare_enable(priv->mac_ifc_clk);
+	if (ret < 0) {
+		netdev_err(ndev, "failed to enable mac ifc clk %d\n", ret);
+		goto out_disable_mac_core_clk;
+	}
+
+	priv->mac_core_rst = devm_reset_control_get(dev, "mac_core");
+	if (IS_ERR(priv->mac_core_rst))
+		priv->mac_core_rst = NULL;
+	hix5hd2_mac_core_reset(priv);
+
+	priv->mac_ifc_rst = devm_reset_control_get(dev, "mac_ifc");
+	if (IS_ERR(priv->mac_ifc_rst))
+		priv->mac_ifc_rst = NULL;
+
+	priv->phy_rst = devm_reset_control_get(dev, "phy");
+	if (IS_ERR(priv->phy_rst)) {
+		priv->phy_rst = NULL;
+	} else {
+		ret = of_property_read_u32_array(node,
+						 PHY_RESET_DELAYS_PROPERTY,
+						 priv->phy_reset_delays,
+						 DELAYS_NUM);
+		if (ret)
+			goto out_disable_clk;
+		hix5hd2_phy_reset(priv);
+	}
+
 	bus = mdiobus_alloc();
 	if (bus == NULL) {
 		ret = -ENOMEM;
-		goto out_free_netdev;
+		goto out_disable_clk;
 	}
 
 	bus->priv = priv;
@@ -1159,7 +1269,8 @@ static int hix5hd2_dev_probe(struct platform_device *pdev)
 		goto out_destroy_queue;
 	}
 
-	clk_disable_unprepare(priv->clk);
+	clk_disable_unprepare(priv->mac_ifc_clk);
+	clk_disable_unprepare(priv->mac_core_clk);
 
 	return ret;
 
@@ -1174,6 +1285,10 @@ static int hix5hd2_dev_probe(struct platform_device *pdev)
 	mdiobus_unregister(bus);
 err_free_mdio:
 	mdiobus_free(bus);
+out_disable_clk:
+	clk_disable_unprepare(priv->mac_ifc_clk);
+out_disable_mac_core_clk:
+	clk_disable_unprepare(priv->mac_core_clk);
 out_free_netdev:
 	free_netdev(ndev);
 
-- 
2.8.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox