* Re: [PATCH net-next 2/2] net:dsa: check for EPROBE_DEFER from dsa_dst_parse()
From: Andrew Lunn @ 2017-01-04 2:19 UTC (permalink / raw)
To: Volodymyr Bendiuga
Cc: vivien.didelot, f.fainelli, davem, netdev, volodymyr.bendiuga
In-Reply-To: <1481129766-10235-1-git-send-email-volodymyr.bendiuga@westermo.se>
On Wed, Dec 07, 2016 at 05:56:06PM +0100, Volodymyr Bendiuga wrote:
> Since there can be multiple dsa switches stacked together but
> not all of devicetree nodes available at the time of calling
> dsa_dst_parse(), EPROBE_DEFER can be returned by it. When this
> happens, only the last dsa switch has to be deleted by
> dsa_dst_del_ds(), but not the whole list, because next time linux
> cames back to this function it will try to add only the last dsa
> switch which returned EPROBE_DEFER.
>
> Signed-off-by: Volodymyr Bendiuga <volodymyr.bendiuga@westermo.se>
> ---
> net/dsa/dsa2.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
> index 7924c92..0a5ddaa 100644
> --- a/net/dsa/dsa2.c
> +++ b/net/dsa/dsa2.c
> @@ -673,8 +673,14 @@ static int _dsa_register_switch(struct dsa_switch *ds, struct device_node *np)
> }
>
> err = dsa_dst_parse(dst);
> - if (err)
> + if (err){
> + if (-EPROBE_DEFER == err) {
Hi Volodymyr
Please can you turn this around, err == -EPROBE_DEFER, to make it
consistent with all the other network code.
With that change
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Andrew
^ permalink raw reply
* [PATCH net-next 0/6] Prepare BPF for VLAN_TAG_PRESENT cleanup
From: Michał Mirosław @ 2017-01-04 1:18 UTC (permalink / raw)
To: netdev
Those patches prepare BPF ant its JITs for removal of VLAN_TAG_PRESENT.
The set depends on "Preparation for VLAN_TAG_PRESENT cleanup" patchset.
The series is supposed to be bisect-friendly and that requires temporary
insertion of #define VLAN_TAG_PRESENT in BPF code to be able to split
JIT changes per architecture.
Michał Mirosław (6):
net/skbuff: add macros for VLAN_PRESENT bit
net/bpf_jit: ARM: split VLAN_PRESENT bit handling from VLAN_TCI
net/bpf_jit: MIPS: split VLAN_PRESENT bit handling from VLAN_TCI
net/bpf_jit: PPC: split VLAN_PRESENT bit handling from VLAN_TCI
net/bpf_jit: SPARC: split VLAN_PRESENT bit handling from VLAN_TCI
net/bpf: split VLAN_PRESENT bit handling from VLAN_TCI
arch/arm/net/bpf_jit_32.c | 16 ++++++++++------
arch/mips/net/bpf_jit.c | 18 ++++++++++--------
arch/powerpc/net/bpf_jit_comp.c | 17 +++++++++--------
arch/sparc/net/bpf_jit_comp.c | 18 ++++++++++--------
include/linux/skbuff.h | 6 ++++++
net/core/filter.c | 19 +++++++++----------
6 files changed, 54 insertions(+), 40 deletions(-)
--
2.11.0
^ permalink raw reply
* Re: [PATCH net-next] ibmvnic: fix accelerated VLAN handling
From: kbuild test robot @ 2017-01-04 2:36 UTC (permalink / raw)
To: Michał Mirosław; +Cc: kbuild-all, netdev, Thomas Falcon, John Allen
In-Reply-To: <8e3c0fc229bbbc549e2529e3c174b7ef477b181c.1483487887.git.mirq-linux@rere.qmqm.pl>
[-- Attachment #1: Type: text/plain, Size: 2034 bytes --]
Hi Michał,
[auto build test ERROR on net-next/master]
url: https://github.com/0day-ci/linux/commits/Micha-Miros-aw/ibmvnic-fix-accelerated-VLAN-handling/20170104-095210
config: powerpc-allmodconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc
All errors (new ones prefixed by >>):
drivers/net/ethernet/ibm/ibmvnic.c: In function 'ibmvnic_xmit':
>> drivers/net/ethernet/ibm/ibmvnic.c:768:40: error: implicit declaration of function 'skb_vlan_tag_present' [-Werror=implicit-function-declaration]
if (adapter->vlan_header_insertion && skb_vlan_tag_present(skb)) {
^~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/ibm/ibmvnic.c: In function 'ibmvnic_poll':
>> drivers/net/ethernet/ibm/ibmvnic.c:968:4: error: implicit declaration of function '__vlan_hwaccel_put_tag' [-Werror=implicit-function-declaration]
__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), be16_to_cpu(next->rx_comp.vlan_tci));
^~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/skb_vlan_tag_present +768 drivers/net/ethernet/ibm/ibmvnic.c
762 tx_crq.v1.flags1 = IBMVNIC_TX_COMP_NEEDED;
763 tx_crq.v1.correlator = cpu_to_be32(index);
764 tx_crq.v1.dma_reg = cpu_to_be16(tx_pool->long_term_buff.map_id);
765 tx_crq.v1.sge_len = cpu_to_be32(skb->len);
766 tx_crq.v1.ioba = cpu_to_be64(data_dma_addr);
767
> 768 if (adapter->vlan_header_insertion && skb_vlan_tag_present(skb)) {
769 tx_crq.v1.flags2 |= IBMVNIC_TX_VLAN_INSERT;
770 tx_crq.v1.vlan_id = cpu_to_be16(skb->vlan_tci);
771 }
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 51750 bytes --]
^ permalink raw reply
* Re: [PATCH net-next 1/2] net:dsa: fix dsa_dst_del_ds()
From: Andrew Lunn @ 2017-01-04 2:45 UTC (permalink / raw)
To: Volodymyr Bendiuga
Cc: vivien.didelot, f.fainelli, davem, netdev, volodymyr.bendiuga
In-Reply-To: <1481129585-9084-1-git-send-email-volodymyr.bendiuga@westermo.se>
On Wed, Dec 07, 2016 at 05:53:05PM +0100, Volodymyr Bendiuga wrote:
> When dsa_dst_del_ds() is called, do not free the whole list,
> instead, only decrement refcount for the switch tree. The list
> will be deleted in dsa_put_dst() if refcount is 0. Nothing
> really needs to be freed for dsa switch, therefore dsa_free_ds()
> is empty. kref_put() will print warning if dsa_free_ds() is not
> passed as a parameter to it.
This does not look correct. I would expect there to be some symmetry.
The dst gets allocated in _dsa_register_switch(), so it should be
freed somewhere in or under _dsa_unregister_switch(). As you say, it
can be freed from dsa_free_dst(), but that is not called from
_dsa_unregister_switch().
dsa_dst_add_ds() and dsa_dst_del_ds() currently look symmetric. Add
increments the ref count for the tree, del decrements it. When it
reaches zero, the tree is freed.
dsa_dst_del_ds() is called from _dsa_unregister_switch(), which gives
us the symmetry with _dsa_register_switch().
What problem are you actually seeing? A double free? A use after free?
Thanks
Andrew
^ permalink raw reply
* Re: [PATCH net-next V2 3/3] tun: rx batching
From: Jason Wang @ 2017-01-04 3:03 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <20170103133303.GC14707@stefanha-x1.localdomain>
On 2017年01月03日 21:33, Stefan Hajnoczi wrote:
> On Wed, Dec 28, 2016 at 04:09:31PM +0800, Jason Wang wrote:
>> +static int tun_rx_batched(struct tun_file *tfile, struct sk_buff *skb,
>> + int more)
>> +{
>> + struct sk_buff_head *queue = &tfile->sk.sk_write_queue;
>> + struct sk_buff_head process_queue;
>> + int qlen;
>> + bool rcv = false;
>> +
>> + spin_lock(&queue->lock);
> Should this be spin_lock_bh()? Below and in tun_get_user() there are
> explicit local_bh_disable() calls so I guess BHs can interrupt us here
> and this would deadlock.
sk_write_queue were accessed only in this function which runs under
process context, so no need for spin_lock_bh() here.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net 9/9] virtio-net: XDP support for small buffers
From: Jason Wang @ 2017-01-04 3:05 UTC (permalink / raw)
To: John Fastabend, mst, virtualization, netdev, linux-kernel
Cc: john.r.fastabend
In-Reply-To: <586BD408.9030009@gmail.com>
On 2017年01月04日 00:40, John Fastabend wrote:
> On 17-01-02 10:16 PM, Jason Wang wrote:
>>
>> On 2017年01月03日 06:43, John Fastabend wrote:
>>> On 16-12-23 06:37 AM, Jason Wang wrote:
>>>> Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
>>>> small receive buffer untouched. This will confuse the user who want to
>>>> set XDP but use small buffers. Other than forbid XDP in small buffer
>>>> mode, let's make it work. XDP then can only work at skb->data since
>>>> virtio-net create skbs during refill, this is sub optimal which could
>>>> be optimized in the future.
>>>>
>>>> Cc: John Fastabend <john.r.fastabend@intel.com>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> ---
>>>> drivers/net/virtio_net.c | 112 ++++++++++++++++++++++++++++++++++++-----------
>>>> 1 file changed, 87 insertions(+), 25 deletions(-)
>>>>
>>> Hi Jason,
>>>
>>> I was doing some more testing on this what do you think about doing this
>>> so that free_unused_bufs() handles the buffer free with dev_kfree_skb()
>>> instead of put_page in small receive mode. Seems more correct to me.
>>>
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 783e842..27ff76c 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1898,6 +1898,10 @@ static void free_receive_page_frags(struct virtnet_info
>>> *vi)
>>>
>>> static bool is_xdp_queue(struct virtnet_info *vi, int q)
>>> {
>>> + /* For small receive mode always use kfree_skb variants */
>>> + if (!vi->mergeable_rx_bufs)
>>> + return false;
>>> +
>>> if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
>>> return false;
>>> else if (q < vi->curr_queue_pairs)
>>>
>>>
>>> patch is untested just spotted doing code review.
>>>
>>> Thanks,
>>> John
>> We probably need a better name for this function.
>>
>> Acked-by: Jason Wang <jasowang@redhat.com>
>>
> How about is_xdp_raw_buffer_queue()?
>
> I'll submit a proper patch today.
Sounds good, thanks.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [net PATCH] net: virtio: cap mtu when XDP programs are running
From: Jason Wang @ 2017-01-04 3:16 UTC (permalink / raw)
To: John Fastabend, mst; +Cc: john.r.fastabend, netdev, alexei.starovoitov, daniel
In-Reply-To: <586BD5D5.6020100@gmail.com>
case.
On 2017年01月04日 00:48, John Fastabend wrote:
> On 17-01-02 10:14 PM, Jason Wang wrote:
>>
>> On 2017年01月03日 06:30, John Fastabend wrote:
>>> XDP programs can not consume multiple pages so we cap the MTU to
>>> avoid this case. Virtio-net however only checks the MTU at XDP
>>> program load and does not block MTU changes after the program
>>> has loaded.
>>>
>>> This patch sets/clears the max_mtu value at XDP load/unload time.
>>>
>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>> ---
>>> drivers/net/virtio_net.c | 9 ++++++---
>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 5deeda6..783e842 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1699,6 +1699,9 @@ static void virtnet_init_settings(struct net_device *dev)
>>> .set_settings = virtnet_set_settings,
>>> };
>>> +#define MIN_MTU ETH_MIN_MTU
>>> +#define MAX_MTU ETH_MAX_MTU
>>> +
>>> static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>>> {
>>> unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
>>> @@ -1748,6 +1751,9 @@ static int virtnet_xdp_set(struct net_device *dev,
>>> struct bpf_prog *prog)
>>> virtnet_set_queues(vi, curr_qp);
>>> return PTR_ERR(prog);
>>> }
>>> + dev->max_mtu = max_sz;
>>> + } else {
>>> + dev->max_mtu = ETH_MAX_MTU;
>> Or use ETH_DATA_LEN here consider we only allocate a size of GOOD_PACKET_LEN for
>> each small buffer?
>>
>> Thanks
> OK so this logic is a bit too simply. When it resets the max_mtu I guess it
> needs to read the mtu via
>
> virtio_cread16(vdev, ...)
>
> or we may break the negotiated mtu.
Yes, this is a problem (even use ETH_MAX_MTU). We may need a method to
notify the device about the mtu in this case which is not supported by
virtio now.
>
> As for capping it at GOOD_PACKET_LEN this has the nice benefit of avoiding any
> underestimates in EWMA predictions because it appears min estimates are capped
> at GOOD_PACKET_LEN via get_mergeable_buf_len().
This seems something misunderstanding here, I meant only use
GOOD_PACKET_LEN for small buffer (which does not use EWMA).
Thanks
>
> Thanks,
> John
>
^ permalink raw reply
* Re: [RFC PATCH] virtio_net: XDP support for adjust_head
From: Jason Wang @ 2017-01-04 3:21 UTC (permalink / raw)
To: John Fastabend, mst; +Cc: john.r.fastabend, netdev, alexei.starovoitov, daniel
In-Reply-To: <586BD734.7020105@gmail.com>
On 2017年01月04日 00:54, John Fastabend wrote:
>>> + /* Changing the headroom in buffers is a disruptive operation because
>>> + * existing buffers must be flushed and reallocated. This will happen
>>> + * when a xdp program is initially added or xdp is disabled by removing
>>> + * the xdp program.
>>> + */
>> We probably need reset the device here, but maybe Michale has more ideas. And if
>> we do this, another interesting thing to do is to disable EWMA and always use a
>> single page for each packet, this could almost eliminate linearizing.
> Well with normal MTU 1500 size we should not hit the linearizing case right?
My reply may be not clear, for 1500 I mean for small buffer only.
Thanks
> The
> question is should we cap the MTU at GOOD_PACKET_LEN vs the current cap of
> (PAGE_SIZE - overhead).
>
^ permalink raw reply
* Re: [RFC PATCH] virtio_net: XDP support for adjust_head
From: Jason Wang @ 2017-01-04 3:22 UTC (permalink / raw)
To: John Fastabend, mst; +Cc: john.r.fastabend, netdev, alexei.starovoitov, daniel
In-Reply-To: <586BD7F3.60109@gmail.com>
On 2017年01月04日 00:57, John Fastabend wrote:
>>>> + /* Changing the headroom in buffers is a disruptive operation because
>>>> + * existing buffers must be flushed and reallocated. This will happen
>>>> + * when a xdp program is initially added or xdp is disabled by removing
>>>> + * the xdp program.
>>>> + */
>>> We probably need reset the device here, but maybe Michale has more ideas. And if
>>> we do this, another interesting thing to do is to disable EWMA and always use a
>>> single page for each packet, this could almost eliminate linearizing.
>> Well with normal MTU 1500 size we should not hit the linearizing case right? The
>> question is should we cap the MTU at GOOD_PACKET_LEN vs the current cap of
>> (PAGE_SIZE - overhead).
> Sorry responding to my own post with a bit more detail. I don't really like
> going to a page for each packet because we end up with double the pages in use
> for the "normal" 1500 MTU case. We could make the xdp allocation scheme smarter
> and allocate a page per packet when MTU is greater than 2k instead of using the
> EWMA but I would push those types of things at net-next and live with the
> linearizing behavior for now or capping the MTU.
>
Yes, agree.
Thanks
^ permalink raw reply
* Re: [PATCH net-next 0/3] Preparation for VLAN_TAG_PRESENT cleanup
From: David Miller @ 2017-01-04 3:23 UTC (permalink / raw)
To: mirq-linux; +Cc: netdev
In-Reply-To: <cover.1483487429.git.mirq-linux@rere.qmqm.pl>
By submitted these in sections, but all at once, you are subverting
my requirement to submit only small self contained patch series.
Please do not do this.
The whole point is to not have a lot of patches in flight for one
thing for people to review at one time. Contributors can review
more easily small, easily digestable, pieces.
Start over, and only submit small numbers of patches at one time. Do
not submit new patches until the first series has been processed.
Thank you.
^ permalink raw reply
* [PATCH net-next] net: phy: add extension of phy-mode for XLGMII
From: Jie Deng @ 2017-01-04 5:04 UTC (permalink / raw)
To: f.fainelli, davem; +Cc: netdev, linux-kernel, Jie Deng
The Synopsys DWC_xlgmac core provides a multiplexed 40-Gigabit
Media-Independent Interface (XLGMII, an IEEE 802.3 Clause 81
compliant reconciliation sub-layer) for communication with
the 100/50/40/25-Gigabit PHY and 10-Gigabit Media-Independent
Interface (XGMII, an IEEE 802.3 Clause 46 compliant reconciliation
sub-layer) for communication with the 10-Gigabit PHY.
Currently, There are only interface mode definitions for "xgmii".
This patch adds the definitions for the PHY layer to recognize
"xlgmii" as a valid PHY interface.
Signed-off-by: Jie Deng <jiedeng@synopsys.com>
---
include/linux/phy.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/phy.h b/include/linux/phy.h
index f7d95f6..7b6bfb3 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -82,6 +82,7 @@
PHY_INTERFACE_MODE_MOCA,
PHY_INTERFACE_MODE_QSGMII,
PHY_INTERFACE_MODE_TRGMII,
+ PHY_INTERFACE_MODE_XLGMII,
PHY_INTERFACE_MODE_MAX,
} phy_interface_t;
@@ -142,6 +143,8 @@ static inline const char *phy_modes(phy_interface_t interface)
return "qsgmii";
case PHY_INTERFACE_MODE_TRGMII:
return "trgmii";
+ case PHY_INTERFACE_MODE_XLGMII:
+ return "xlgmii";
default:
return "unknown";
}
--
1.9.1
^ permalink raw reply related
* Re: [PATCH net-next v2] net/sched: cls_matchall: Fix error path
From: Cong Wang @ 2017-01-04 6:33 UTC (permalink / raw)
To: Yotam Gigi
Cc: Jamal Hadi Salim, David Miller, eladr, Jiri Pirko,
Linux Kernel Network Developers
In-Reply-To: <1483464024-25745-1-git-send-email-yotamg@mellanox.com>
On Tue, Jan 3, 2017 at 9:20 AM, Yotam Gigi <yotamg@mellanox.com> wrote:
> Fix several error paths in matchall:
> - Release reference to actions in case the hardware fails offloading
> (relevant to skip_sw only)
> - Fix error path in case tcf_exts initialization/validation fail
>
> Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
> Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Looks good,
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
^ permalink raw reply
* Re: ping: What's the purpose of rate limit?
From: Cong Wang @ 2017-01-04 6:45 UTC (permalink / raw)
To: Guan Xin; +Cc: Linux Kernel Network Developers
In-Reply-To: <CANeMGR4=vycxBU+LcrL2JXHt2FTYBkU2aLxfY5XRzHdws-aYag@mail.gmail.com>
Hi,
On Sat, Dec 31, 2016 at 4:04 AM, Guan Xin <guanx.bac@gmail.com> wrote:
> Hello,
>
> Excuse me, I searched but didn't find an answer --
>
> What's the purpose of setting a limit to the "-i" and "-l" parameters
> of ping for non-root users?
>
> It seems that this is only intended to prevent accidental misuse
> because these restrictions can be easily worked around by starting
> multiple instances or wrapping the program in a loop, e.g.,
> while true; do ping -l3 -c3 192.168.0.1; done
>
I think the purpose of "-i" is not for rate limit, but for either hear beating
or monitoring latency periodically.
Thanks.
^ permalink raw reply
* [RFC 2/4] net-next: dsa: Refactor DT probing of a switch port
From: John Crispin @ 2017-01-04 7:38 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Florian Fainelli, Vivien Didelot; +Cc: netdev
In-Reply-To: <1483515484-21793-1-git-send-email-john@phrozen.org>
From: Andrew Lunn <andrew@lunn.ch>
Move the DT probing of a switch port into a function of its own, since
it is about to get more complex. Add better error handling as well.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
net/dsa/dsa.c | 138 ++++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 102 insertions(+), 36 deletions(-)
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 7899919..0e0621c 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -326,14 +326,10 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, struct device *parent)
continue;
if (!strcmp(name, "cpu")) {
- if (dst->cpu_switch != -1) {
- netdev_err(dst->master_netdev,
- "multiple cpu ports?!\n");
- ret = -EINVAL;
- goto out;
+ if (dst->cpu_switch == -1) {
+ dst->cpu_switch = index;
+ dst->cpu_port = i;
}
- dst->cpu_switch = index;
- dst->cpu_port = i;
ds->cpu_port_mask |= 1 << i;
} else if (!strcmp(name, "dsa")) {
ds->dsa_port_mask |= 1 << i;
@@ -709,11 +705,15 @@ static void dsa_of_free_platform_data(struct dsa_platform_data *pd)
{
int i;
int port_index;
+ struct dsa_chip_data *cd;
for (i = 0; i < pd->nr_chips; i++) {
+ cd = &pd->chip[i];
port_index = 0;
while (port_index < DSA_MAX_PORTS) {
- kfree(pd->chip[i].port_names[port_index]);
+ kfree(cd->port_names[port_index]);
+ if (cd->port_ethernet[port_index])
+ dev_put(cd->port_ethernet[port_index]);
port_index++;
}
@@ -724,6 +724,94 @@ static void dsa_of_free_platform_data(struct dsa_platform_data *pd)
kfree(pd->chip);
}
+static int dsa_of_probe_cpu_port(struct dsa_chip_data *cd,
+ struct device_node *port,
+ int port_index)
+{
+ struct net_device *ethernet_dev;
+ struct device_node *ethernet;
+
+ ethernet = of_parse_phandle(port, "ethernet", 0);
+ if (ethernet) {
+ ethernet_dev = of_find_net_device_by_node(ethernet);
+ if (!ethernet_dev)
+ return -EPROBE_DEFER;
+
+ dev_hold(ethernet_dev);
+ cd->port_ethernet[port_index] = ethernet_dev;
+ }
+
+ return 0;
+}
+
+static int dsa_of_probe_user_port(struct dsa_chip_data *cd,
+ struct device_node *port,
+ int port_index)
+{
+ struct device_node *cpu_port;
+ const unsigned int *cpu_port_reg;
+ int cpu_port_index;
+
+ cpu_port = of_parse_phandle(port, "cpu", 0);
+ if (cpu_port) {
+ cpu_port_reg = of_get_property(cpu_port, "reg", NULL);
+ if (!cpu_port_reg)
+ return -EINVAL;
+ cpu_port_index = be32_to_cpup(cpu_port_reg);
+ cd->port_cpu[port_index] = cpu_port_index;
+ }
+
+ return 0;
+}
+
+static int dsa_of_probe_port(struct dsa_platform_data *pd,
+ struct dsa_chip_data *cd,
+ int chip_index,
+ struct device_node *port)
+{
+ bool is_cpu_port = false, is_dsa_port = false;
+ bool is_user_port = false;
+ const unsigned int *port_reg;
+ const char *port_name;
+ int port_index, ret = 0;
+
+ port_reg = of_get_property(port, "reg", NULL);
+ if (!port_reg)
+ return -EINVAL;
+
+ port_index = be32_to_cpup(port_reg);
+
+ port_name = of_get_property(port, "label", NULL);
+ if (!port_name)
+ return -EINVAL;
+
+ if (!strcmp(port_name, "cpu"))
+ is_cpu_port = true;
+ if (!strcmp(port_name, "dsa"))
+ is_dsa_port = true;
+ if (!is_cpu_port && !is_dsa_port)
+ is_user_port = true;
+
+ cd->port_dn[port_index] = port;
+
+ cd->port_names[port_index] = kstrdup(port_name,
+ GFP_KERNEL);
+ if (!cd->port_names[port_index])
+ return -ENOMEM;
+
+ if (is_dsa_port)
+ ret = dsa_of_probe_links(pd, cd, chip_index,
+ port_index, port, port_name);
+ if (is_cpu_port)
+ ret = dsa_of_probe_cpu_port(cd, port, port_index);
+ if (is_user_port)
+ ret = dsa_of_probe_user_port(cd, port, port_index);
+ if (ret)
+ return ret;
+
+ return port_index;
+}
+
static int dsa_of_probe(struct device *dev)
{
struct device_node *np = dev->of_node;
@@ -732,9 +820,8 @@ static int dsa_of_probe(struct device *dev)
struct net_device *ethernet_dev;
struct dsa_platform_data *pd;
struct dsa_chip_data *cd;
- const char *port_name;
- int chip_index, port_index;
- const unsigned int *sw_addr, *port_reg;
+ int chip_index;
+ const unsigned int *sw_addr;
u32 eeprom_len;
int ret;
@@ -821,32 +908,11 @@ static int dsa_of_probe(struct device *dev)
}
for_each_available_child_of_node(child, port) {
- port_reg = of_get_property(port, "reg", NULL);
- if (!port_reg)
- continue;
-
- port_index = be32_to_cpup(port_reg);
- if (port_index >= DSA_MAX_PORTS)
- break;
-
- port_name = of_get_property(port, "label", NULL);
- if (!port_name)
- continue;
-
- cd->port_dn[port_index] = port;
-
- cd->port_names[port_index] = kstrdup(port_name,
- GFP_KERNEL);
- if (!cd->port_names[port_index]) {
- ret = -ENOMEM;
+ ret = dsa_of_probe_port(pd, cd, chip_index, port);
+ if (ret < 0)
goto out_free_chip;
- }
-
- ret = dsa_of_probe_links(pd, cd, chip_index,
- port_index, port, port_name);
- if (ret)
- goto out_free_chip;
-
+ if (ret == DSA_MAX_PORTS)
+ break;
}
}
--
1.7.10.4
^ permalink raw reply related
* [RFC 4/4] net-next: dsa: qca8k: add support for multiple cpu ports
From: John Crispin @ 2017-01-04 7:38 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Florian Fainelli, Vivien Didelot
Cc: netdev, John Crispin
In-Reply-To: <1483515484-21793-1-git-send-email-john@phrozen.org>
With the subsystem now supporting multiple cpu ports, we need to make some
changes to the driver as it currently has the cpu port hardcoded as port0.
The patch moves the setup logic for the cpu port into one loop which
iterates over all cpu ports and sets them up. Additionally the bridge
join/leave logic needs a small fix to work with having a cpu port other
than 0.
Signed-off-by: John Crispin <john@phrozen.org>
---
drivers/net/dsa/qca8k.c | 135 +++++++++++++++++++++++++++--------------------
drivers/net/dsa/qca8k.h | 2 -
2 files changed, 78 insertions(+), 59 deletions(-)
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index b3df70d..1693388 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -486,11 +486,25 @@
qca8k_reg_clear(priv, QCA8K_REG_PORT_STATUS(port), mask);
}
+static void
+qca8k_setup_flooding(struct qca8k_priv *priv, int port_mask, int enable)
+{
+ u32 mask = (port_mask << QCA8K_GLOBAL_FW_CTRL1_IGMP_DP_S) |
+ (port_mask << QCA8K_GLOBAL_FW_CTRL1_BC_DP_S) |
+ (port_mask << QCA8K_GLOBAL_FW_CTRL1_MC_DP_S) |
+ (port_mask << QCA8K_GLOBAL_FW_CTRL1_UC_DP_S);
+
+ if (enable)
+ qca8k_reg_set(priv, QCA8K_REG_GLOBAL_FW_CTRL1, mask);
+ else
+ qca8k_reg_clear(priv, QCA8K_REG_GLOBAL_FW_CTRL1, mask);
+}
+
static int
qca8k_setup(struct dsa_switch *ds)
{
struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
- int ret, i, phy_mode = -1;
+ int ret, i;
/* Make sure that port 0 is the cpu port */
if (!dsa_is_cpu_port(ds, 0)) {
@@ -506,29 +520,49 @@
if (IS_ERR(priv->regmap))
pr_warn("regmap initialization failed");
- /* Initialize CPU port pad mode (xMII type, delays...) */
- phy_mode = of_get_phy_mode(ds->ports[ds->dst->cpu_port].dn);
- if (phy_mode < 0) {
- pr_err("Can't find phy-mode for master device\n");
- return phy_mode;
- }
- ret = qca8k_set_pad_ctrl(priv, QCA8K_CPU_PORT, phy_mode);
- if (ret < 0)
- return ret;
-
- /* Enable CPU Port */
+ /* Tell the switch that port0 is a cpu port */
qca8k_reg_set(priv, QCA8K_REG_GLOBAL_FW_CTRL0,
QCA8K_GLOBAL_FW_CTRL0_CPU_PORT_EN);
- qca8k_port_set_status(priv, QCA8K_CPU_PORT, 1);
- priv->port_sts[QCA8K_CPU_PORT].enabled = 1;
/* Enable MIB counters */
qca8k_mib_init(priv);
- /* Enable QCA header mode on the cpu port */
- qca8k_write(priv, QCA8K_REG_PORT_HDR_CTRL(QCA8K_CPU_PORT),
- QCA8K_PORT_HDR_CTRL_ALL << QCA8K_PORT_HDR_CTRL_TX_S |
- QCA8K_PORT_HDR_CTRL_ALL << QCA8K_PORT_HDR_CTRL_RX_S);
+ /* Setup the cpu ports */
+ for (i = 0; i < DSA_MAX_PORTS; i++) {
+ struct net_device *netdev;
+ int phy_mode = -1;
+
+ if (!dsa_is_cpu_port(ds, i))
+ continue;
+
+ netdev = ds->dst->pd->chip->port_ethernet[i];
+ if (!netdev) {
+ pr_err("Can't find netdev for port%d\n", i);
+ return -ENODEV;
+ }
+
+ /* Initialize CPU port pad mode (xMII type, delays...) */
+ phy_mode = of_get_phy_mode(netdev->dev.parent->of_node);
+ if (phy_mode < 0) {
+ pr_err("Can't find phy-mode for port:%d\n", i);
+ return phy_mode;
+ }
+ ret = qca8k_set_pad_ctrl(priv, i, phy_mode);
+ if (ret < 0)
+ return ret;
+
+ /* Enable QCA header mode on the cpu port */
+ qca8k_write(priv,
+ QCA8K_REG_PORT_HDR_CTRL(i),
+ QCA8K_PORT_HDR_CTRL_ALL << QCA8K_PORT_HDR_CTRL_TX_S |
+ QCA8K_PORT_HDR_CTRL_ALL << QCA8K_PORT_HDR_CTRL_RX_S);
+
+ qca8k_port_set_status(priv, i, 1);
+ priv->port_sts[i].enabled = 1;
+
+ /* Forward all unknown frames to CPU port for Linux processing */
+ qca8k_setup_flooding(priv, BIT(i), 1);
+ }
/* Disable forwarding by default on all ports */
for (i = 0; i < QCA8K_NUM_PORTS; i++)
@@ -540,43 +574,30 @@
if (ds->enabled_port_mask & BIT(i))
qca8k_port_set_status(priv, i, 0);
- /* Forward all unknown frames to CPU port for Linux processing */
- qca8k_write(priv, QCA8K_REG_GLOBAL_FW_CTRL1,
- BIT(0) << QCA8K_GLOBAL_FW_CTRL1_IGMP_DP_S |
- BIT(0) << QCA8K_GLOBAL_FW_CTRL1_BC_DP_S |
- BIT(0) << QCA8K_GLOBAL_FW_CTRL1_MC_DP_S |
- BIT(0) << QCA8K_GLOBAL_FW_CTRL1_UC_DP_S);
-
- /* Setup connection between CPU port & user ports */
+ /* Setup user ports and connections to CPU ports */
for (i = 0; i < DSA_MAX_PORTS; i++) {
- /* CPU port gets connected to all user ports of the switch */
- if (dsa_is_cpu_port(ds, i)) {
- qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(QCA8K_CPU_PORT),
- QCA8K_PORT_LOOKUP_MEMBER,
- ds->enabled_port_mask);
- }
+ int shift = 16 * (i % 2);
+ int cpu_port;
- /* Invividual user ports get connected to CPU port only */
- if (ds->enabled_port_mask & BIT(i)) {
- int shift = 16 * (i % 2);
-
- qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(i),
- QCA8K_PORT_LOOKUP_MEMBER,
- BIT(QCA8K_CPU_PORT));
-
- /* Enable ARP Auto-learning by default */
- qca8k_reg_set(priv, QCA8K_PORT_LOOKUP_CTRL(i),
- QCA8K_PORT_LOOKUP_LEARN);
-
- /* For port based vlans to work we need to set the
- * default egress vid
- */
- qca8k_rmw(priv, QCA8K_EGRESS_VLAN(i),
- 0xffff << shift, 1 << shift);
- qca8k_write(priv, QCA8K_REG_PORT_VLAN_CTRL0(i),
- QCA8K_PORT_VLAN_CVID(1) |
- QCA8K_PORT_VLAN_SVID(1));
- }
+ if (!(ds->enabled_port_mask & BIT(i)))
+ continue;
+
+ cpu_port = dsa_port_upstream_port(ds, i);
+ qca8k_reg_set(priv, QCA8K_PORT_LOOKUP_CTRL(i), BIT(cpu_port));
+ qca8k_reg_set(priv, QCA8K_PORT_LOOKUP_CTRL(cpu_port), BIT(i));
+
+ /* Enable ARP Auto-learning by default */
+ qca8k_reg_set(priv, QCA8K_PORT_LOOKUP_CTRL(i),
+ QCA8K_PORT_LOOKUP_LEARN);
+
+ /* For port based vlans to work we need to set the
+ * default egress vid
+ */
+ qca8k_rmw(priv, QCA8K_EGRESS_VLAN(i),
+ 0xffff << shift, 1 << shift);
+ qca8k_write(priv, QCA8K_REG_PORT_VLAN_CTRL0(i),
+ QCA8K_PORT_VLAN_CVID(1) |
+ QCA8K_PORT_VLAN_SVID(1));
}
/* Flush the FDB table */
@@ -750,7 +771,7 @@
struct net_device *bridge)
{
struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
- int port_mask = BIT(QCA8K_CPU_PORT);
+ int port_mask = 0;
int i;
priv->port_sts[port].bridge_dev = bridge;
@@ -768,8 +789,7 @@
port_mask |= BIT(i);
}
/* Add all other ports to this ports portvlan mask */
- qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(port),
- QCA8K_PORT_LOOKUP_MEMBER, port_mask);
+ qca8k_reg_set(priv, QCA8K_PORT_LOOKUP_CTRL(port), port_mask);
return 0;
}
@@ -796,7 +816,8 @@
* this port
*/
qca8k_rmw(priv, QCA8K_PORT_LOOKUP_CTRL(port),
- QCA8K_PORT_LOOKUP_MEMBER, BIT(QCA8K_CPU_PORT));
+ QCA8K_PORT_LOOKUP_MEMBER,
+ BIT(dsa_port_upstream_port(ds, i)));
}
static int
diff --git a/drivers/net/dsa/qca8k.h b/drivers/net/dsa/qca8k.h
index 2014647..aca6abb 100644
--- a/drivers/net/dsa/qca8k.h
+++ b/drivers/net/dsa/qca8k.h
@@ -26,8 +26,6 @@
#define QCA8K_NUM_FDB_RECORDS 2048
-#define QCA8K_CPU_PORT 0
-
/* Global control registers */
#define QCA8K_REG_MASK_CTRL 0x000
#define QCA8K_MASK_CTRL_ID_M 0xff
--
1.7.10.4
^ permalink raw reply related
* [RFC 3/4] net-next: dsa: Add support for multiple cpu ports.
From: John Crispin @ 2017-01-04 7:38 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Florian Fainelli, Vivien Didelot
Cc: netdev, John Crispin
In-Reply-To: <1483515484-21793-1-git-send-email-john@phrozen.org>
From: Andrew Lunn <andrew@lunn.ch>
Some boards have two CPU interfaces connected to the switch, e.g. WiFi
access points, with 1 port labeled WAN, 4 ports labeled lan1-lan4, and
two port connected to the SoC.
This patch extends DSA to allows both CPU ports to be used. The "cpu"
node in the DSA tree can now have a phandle to the host interface it
connects to. Each user port can have a phandle to a cpu port which
should be used for traffic between the port and the CPU. Thus simple
load sharing over the two CPU ports can be achieved.
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
include/net/dsa.h | 21 ++++++++++++++++++++-
net/dsa/dsa2.c | 36 ++++++++++++++++++++++++++++++------
net/dsa/dsa_priv.h | 5 +++++
net/dsa/slave.c | 27 ++++++++++++++++-----------
4 files changed, 71 insertions(+), 18 deletions(-)
diff --git a/include/net/dsa.h b/include/net/dsa.h
index b122196..f68180b 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -60,6 +60,8 @@ struct dsa_chip_data {
*/
char *port_names[DSA_MAX_PORTS];
struct device_node *port_dn[DSA_MAX_PORTS];
+ struct net_device *port_ethernet[DSA_MAX_PORTS];
+ int port_cpu[DSA_MAX_PORTS];
/*
* An array of which element [a] indicates which port on this
@@ -204,7 +206,7 @@ struct dsa_switch {
static inline bool dsa_is_cpu_port(struct dsa_switch *ds, int p)
{
- return !!(ds->index == ds->dst->cpu_switch && p == ds->dst->cpu_port);
+ return !!(ds->cpu_port_mask & (1 << p));
}
static inline bool dsa_is_dsa_port(struct dsa_switch *ds, int p)
@@ -217,6 +219,11 @@ static inline bool dsa_is_port_initialized(struct dsa_switch *ds, int p)
return ds->enabled_port_mask & (1 << p) && ds->ports[p].netdev;
}
+static inline bool dsa_is_upstream_port(struct dsa_switch *ds, int p)
+{
+ return dsa_is_cpu_port(ds, p) || dsa_is_dsa_port(ds, p);
+}
+
static inline u8 dsa_upstream_port(struct dsa_switch *ds)
{
struct dsa_switch_tree *dst = ds->dst;
@@ -233,6 +240,18 @@ static inline u8 dsa_upstream_port(struct dsa_switch *ds)
return ds->rtable[dst->cpu_switch];
}
+static inline u8 dsa_port_upstream_port(struct dsa_switch *ds, int port)
+{
+ /*
+ * If this port has a specific upstream cpu port, use it,
+ * otherwise use the switch default.
+ */
+ if (ds->cd->port_cpu[port])
+ return ds->cd->port_cpu[port];
+ else
+ return dsa_upstream_port(ds);
+}
+
struct switchdev_trans;
struct switchdev_obj;
struct switchdev_obj_port_fdb;
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 5fff951..1763cd4 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -258,7 +258,7 @@ static void dsa_cpu_port_unapply(struct device_node *port, u32 index,
{
dsa_cpu_dsa_destroy(port);
ds->cpu_port_mask &= ~BIT(index);
-
+ dev_put(ds->cd->port_ethernet[index]);
}
static int dsa_user_port_apply(struct device_node *port, u32 index,
@@ -475,6 +475,28 @@ static int dsa_cpu_parse(struct device_node *port, u32 index,
dst->rcv = dst->tag_ops->rcv;
+ dev_hold(ethernet_dev);
+ ds->cd->port_ethernet[index] = ethernet_dev;
+
+ return 0;
+}
+
+static int dsa_user_parse(struct device_node *port, u32 index,
+ struct dsa_switch *ds)
+{
+ struct device_node *cpu_port;
+ const unsigned int *cpu_port_reg;
+ int cpu_port_index;
+
+ cpu_port = of_parse_phandle(port, "cpu", 0);
+ if (cpu_port) {
+ cpu_port_reg = of_get_property(cpu_port, "reg", NULL);
+ if (!cpu_port_reg)
+ return -EINVAL;
+ cpu_port_index = be32_to_cpup(cpu_port_reg);
+ ds->cd->port_cpu[index] = cpu_port_index;
+ }
+
return 0;
}
@@ -482,18 +504,20 @@ static int dsa_ds_parse(struct dsa_switch_tree *dst, struct dsa_switch *ds)
{
struct device_node *port;
u32 index;
- int err;
+ int err = 0;
for (index = 0; index < DSA_MAX_PORTS; index++) {
port = ds->ports[index].dn;
if (!port)
continue;
- if (dsa_port_is_cpu(port)) {
+ if (dsa_port_is_cpu(port))
err = dsa_cpu_parse(port, index, dst, ds);
- if (err)
- return err;
- }
+ else if (!dsa_port_is_dsa(port))
+ err = dsa_user_parse(port, index, ds);
+
+ if (err)
+ return err;
}
pr_info("DSA: switch %d %d parsed\n", dst->tree, ds->index);
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 6cfd738..7e1e62c 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -24,6 +24,11 @@ struct dsa_device_ops {
struct dsa_slave_priv {
struct sk_buff * (*xmit)(struct sk_buff *skb,
struct net_device *dev);
+ /*
+ * Which host device do we used to send packets to the switch
+ * for this port.
+ */
+ struct net_device *master;
/*
* Which switch this port is a part of, and the port index
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index ffd91969..260d4a9 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -61,7 +61,7 @@ static int dsa_slave_get_iflink(const struct net_device *dev)
{
struct dsa_slave_priv *p = netdev_priv(dev);
- return p->parent->dst->master_netdev->ifindex;
+ return p->master->ifindex;
}
static inline bool dsa_port_is_bridged(struct dsa_slave_priv *p)
@@ -96,7 +96,7 @@ static void dsa_port_set_stp_state(struct dsa_switch *ds, int port, u8 state)
static int dsa_slave_open(struct net_device *dev)
{
struct dsa_slave_priv *p = netdev_priv(dev);
- struct net_device *master = p->parent->dst->master_netdev;
+ struct net_device *master = p->master;
struct dsa_switch *ds = p->parent;
u8 stp_state = dsa_port_is_bridged(p) ?
BR_STATE_BLOCKING : BR_STATE_FORWARDING;
@@ -151,7 +151,7 @@ static int dsa_slave_open(struct net_device *dev)
static int dsa_slave_close(struct net_device *dev)
{
struct dsa_slave_priv *p = netdev_priv(dev);
- struct net_device *master = p->parent->dst->master_netdev;
+ struct net_device *master = p->master;
struct dsa_switch *ds = p->parent;
if (p->phy)
@@ -178,7 +178,7 @@ static int dsa_slave_close(struct net_device *dev)
static void dsa_slave_change_rx_flags(struct net_device *dev, int change)
{
struct dsa_slave_priv *p = netdev_priv(dev);
- struct net_device *master = p->parent->dst->master_netdev;
+ struct net_device *master = p->master;
if (change & IFF_ALLMULTI)
dev_set_allmulti(master, dev->flags & IFF_ALLMULTI ? 1 : -1);
@@ -189,7 +189,7 @@ static void dsa_slave_change_rx_flags(struct net_device *dev, int change)
static void dsa_slave_set_rx_mode(struct net_device *dev)
{
struct dsa_slave_priv *p = netdev_priv(dev);
- struct net_device *master = p->parent->dst->master_netdev;
+ struct net_device *master = p->master;
dev_mc_sync(master, dev);
dev_uc_sync(master, dev);
@@ -198,7 +198,7 @@ static void dsa_slave_set_rx_mode(struct net_device *dev)
static int dsa_slave_set_mac_address(struct net_device *dev, void *a)
{
struct dsa_slave_priv *p = netdev_priv(dev);
- struct net_device *master = p->parent->dst->master_netdev;
+ struct net_device *master = p->master;
struct sockaddr *addr = a;
int err;
@@ -633,7 +633,7 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
/* Queue the SKB for transmission on the parent interface, but
* do not modify its EtherType
*/
- nskb->dev = p->parent->dst->master_netdev;
+ nskb->dev = p->master;
dev_queue_xmit(nskb);
return NETDEV_TX_OK;
@@ -947,7 +947,7 @@ static int dsa_slave_netpoll_setup(struct net_device *dev,
{
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_switch *ds = p->parent;
- struct net_device *master = ds->dst->master_netdev;
+ struct net_device *master = p->master;
struct netpoll *netpoll;
int err = 0;
@@ -1247,12 +1247,16 @@ int dsa_slave_create(struct dsa_switch *ds, struct device *parent,
struct net_device *master;
struct net_device *slave_dev;
struct dsa_slave_priv *p;
+ int port_cpu = ds->cd->port_cpu[port];
int ret;
- master = ds->dst->master_netdev;
- if (ds->master_netdev)
+ if (port_cpu && ds->cd->port_ethernet[port_cpu])
+ master = ds->cd->port_ethernet[port_cpu];
+ else if (ds->master_netdev)
master = ds->master_netdev;
-
+ else
+ master = ds->dst->master_netdev;
+ master->dsa_ptr = (void *)ds->dst;
slave_dev = alloc_netdev(sizeof(struct dsa_slave_priv), name,
NET_NAME_UNKNOWN, ether_setup);
if (slave_dev == NULL)
@@ -1279,6 +1283,7 @@ int dsa_slave_create(struct dsa_switch *ds, struct device *parent,
p->parent = ds;
p->port = port;
p->xmit = dst->tag_ops->xmit;
+ p->master = master;
p->old_pause = -1;
p->old_link = -1;
--
1.7.10.4
^ permalink raw reply related
* [RFC 0/4] net-next: dsa: add support for multiple cpu ports
From: John Crispin @ 2017-01-04 7:38 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Florian Fainelli, Vivien Didelot
Cc: netdev, John Crispin
This series is based on work from Andrew. I have rebased his patches to
work on the latest kernel. The main problem is probably the fact that the
cpu port to user port mapping happens inside the devicetree.
Andrew Lunn (3):
Documentation: devicetree: add multiple cpu port DSA binding
net-next: dsa: Refactor DT probing of a switch port
net-next: dsa: Add support for multiple cpu ports.
John Crispin (1):
net-next: dsa: qca8k: add support for multiple cpu ports
Documentation/devicetree/bindings/net/dsa/dsa.txt | 67 +++++++++-
drivers/net/dsa/qca8k.c | 135 +++++++++++---------
drivers/net/dsa/qca8k.h | 2 -
include/net/dsa.h | 21 +++-
net/dsa/dsa.c | 138 +++++++++++++++------
net/dsa/dsa2.c | 36 +++++-
net/dsa/dsa_priv.h | 5 +
net/dsa/slave.c | 27 ++--
8 files changed, 317 insertions(+), 114 deletions(-)
--
1.7.10.4
^ permalink raw reply
* [RFC 1/4] Documentation: devicetree: add multiple cpu port DSA binding
From: John Crispin @ 2017-01-04 7:38 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Florian Fainelli, Vivien Didelot
Cc: netdev, Rob Herring, devicetree
In-Reply-To: <1483515484-21793-1-git-send-email-john@phrozen.org>
From: Andrew Lunn <andrew@lunn.ch>
Extend the DSA binding documentation, adding the new properties required
when there is more than one CPU port attached to the switch.
Cc: Rob Herring <robh+dt@kernel.org>
Cc: devicetree@vger.kernel.org
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
Documentation/devicetree/bindings/net/dsa/dsa.txt | 67 ++++++++++++++++++++-
1 file changed, 66 insertions(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.txt b/Documentation/devicetree/bindings/net/dsa/dsa.txt
index a4a570f..fc901cf 100644
--- a/Documentation/devicetree/bindings/net/dsa/dsa.txt
+++ b/Documentation/devicetree/bindings/net/dsa/dsa.txt
@@ -337,13 +337,25 @@ Optional property:
This mii-bus will be used in preference to the
global dsa,mii-bus defined above, for this switch.
+- ethernet : Optional for "cpu" ports. A phandle to an ethernet
+ device which will be used by this CPU port for
+ passing packets to/from the host. If not present,
+ the port will use the "dsa,ethernet" property
+ defined above.
+
+- cpu : Option for non "cpu"/"dsa" ports. A phandle to a
+ "cpu" port, which will be used for passing packets
+ from this port to the host. If not present, the first
+ "cpu" port will be used.
+
+
Optional subnodes:
- fixed-link : Fixed-link subnode describing a link to a non-MDIO
managed entity. See
Documentation/devicetree/bindings/net/fixed-link.txt
for details.
-Example:
+Examples:
dsa@0 {
compatible = "marvell,dsa";
@@ -416,3 +428,56 @@ Example:
};
};
};
+
+ dsa@1 {
+ compatible = "marvell,dsa";
+ #address-cells = <2>;
+ #size-cells = <0>;
+
+ dsa,ethernet = <ð0port>;
+ dsa,mii-bus = <&mdio>;
+
+ switch@0 {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ reg = <0 0>; /* MDIO address 0, switch 0 in tree */
+
+ port@0 {
+ reg = <0>;
+ label = "lan4";
+ };
+
+ port@1 {
+ reg = <1>;
+ label = "lan3";
+ cpu = <&cpu1>;
+ };
+
+ port@2 {
+ reg = <2>;
+ label = "lan2";
+ };
+
+ port@3 {
+ reg = <3>;
+ label = "lan1";
+ cpu = <&cpu1>;
+ };
+
+ port@4 {
+ reg = <4>;
+ label = "wan";
+ };
+
+ port@5 {
+ reg = <5>;
+ label = "cpu";
+ };
+
+ cpu1: port@6 {
+ reg = <6>;
+ label = "cpu";
+ ethernet = <ð1port>;
+ };
+ };
+ };
--
1.7.10.4
^ permalink raw reply related
* [PATCH RFC ipsec-next 5/5] esp: Add a software GRO codepath
From: Steffen Klassert @ 2017-01-04 8:23 UTC (permalink / raw)
To: David Miller, netdev; +Cc: Steffen Klassert, Sowmini Varadhan, Ilan Tayari
In-Reply-To: <1483518230-6777-1-git-send-email-steffen.klassert@secunet.com>
This patch adds GRO callbacks for ESP on ipv4 and ipv6.
In case the GRO layer detects an ESP packet, the
esp{4,6}_gro_receive() function calls the xfrm input layer
which decapsulates the packet and reinject it into
layer 2 by calling netif_rx().
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/ipv4/Kconfig | 6 +++
net/ipv4/Makefile | 1 +
net/ipv4/esp4.c | 5 +++
net/ipv4/esp4_offload.c | 94 +++++++++++++++++++++++++++++++++++++++
net/ipv4/ip_vti.c | 4 ++
net/ipv4/xfrm4_input.c | 6 +++
net/ipv4/xfrm4_mode_transport.c | 3 +-
net/ipv6/Kconfig | 6 +++
net/ipv6/Makefile | 1 +
net/ipv6/esp6.c | 5 +++
net/ipv6/esp6_offload.c | 98 +++++++++++++++++++++++++++++++++++++++++
net/ipv6/xfrm6_input.c | 5 +++
net/xfrm/xfrm_input.c | 27 +++++++++---
13 files changed, 253 insertions(+), 8 deletions(-)
create mode 100644 net/ipv4/esp4_offload.c
create mode 100644 net/ipv6/esp6_offload.c
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 6e7baaf..27df99f 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -360,6 +360,12 @@ config INET_ESP
If unsure, say Y.
+config INET_ESP_OFFLOAD
+ tristate "IP: ESP transformation offload"
+ depends on INET_ESP
+ select XFRM_OFFLOAD
+ default n
+
config INET_IPCOMP
tristate "IP: IPComp transformation"
select INET_XFRM_TUNNEL
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 48af58a..c6d4238 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_NET_IPVTI) += ip_vti.o
obj-$(CONFIG_SYN_COOKIES) += syncookies.o
obj-$(CONFIG_INET_AH) += ah4.o
obj-$(CONFIG_INET_ESP) += esp4.o
+obj-$(CONFIG_INET_ESP_OFFLOAD) += esp4_offload.o
obj-$(CONFIG_INET_IPCOMP) += ipcomp.o
obj-$(CONFIG_INET_XFRM_TUNNEL) += xfrm4_tunnel.o
obj-$(CONFIG_INET_XFRM_MODE_BEET) += xfrm4_mode_beet.o
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index b1e2444..4563aeb 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -629,6 +629,11 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
nfrags = 1;
goto skip_cow;
+ } else if (skb_xfrm_gro(skb)) {
+ nfrags = skb_shinfo(skb)->nr_frags;
+ nfrags++;
+
+ goto skip_cow;
} else if (!skb_has_frag_list(skb)) {
nfrags = skb_shinfo(skb)->nr_frags;
nfrags++;
diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
new file mode 100644
index 0000000..7277d15
--- /dev/null
+++ b/net/ipv4/esp4_offload.c
@@ -0,0 +1,94 @@
+/*
+ * IPV4 GSO/GRO offload support
+ * Linux INET implementation
+ *
+ * Copyright (C) 2016 secunet Security Networks AG
+ * Author: Steffen Klassert <steffen.klassert@secunet.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * ESP GRO support
+ */
+
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <net/protocol.h>
+#include <crypto/aead.h>
+#include <crypto/authenc.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <net/ip.h>
+#include <net/xfrm.h>
+#include <net/esp.h>
+#include <linux/scatterlist.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <net/udp.h>
+
+static struct sk_buff **esp4_gro_receive(struct sk_buff **head,
+ struct sk_buff *skb)
+{
+ int err;
+ if (NAPI_GRO_CB(skb)->flush)
+ goto out;
+
+ skb_pull(skb, skb_gro_offset(skb));
+ skb->xfrm_gro = 1;
+
+ err = xfrm4_rcv_encap(skb, IPPROTO_ESP, 0, -2);
+ if (err == -EOPNOTSUPP) {
+ skb_push(skb, skb_gro_offset(skb));
+ NAPI_GRO_CB(skb)->same_flow = 0;
+ NAPI_GRO_CB(skb)->flush = 1;
+ skb->xfrm_gro = 0;
+ goto out;
+ }
+
+ return ERR_PTR(-EINPROGRESS);
+out:
+ return NULL;
+}
+
+static int esp4_gro_complete(struct sk_buff *skb, int nhoff)
+{
+ struct xfrm_state *x = xfrm_input_state(skb);
+ struct crypto_aead *aead = x->data;
+ struct ip_esp_hdr *esph = (struct ip_esp_hdr *)(skb->data + nhoff);
+ struct packet_offload *ptype;
+ int err = -ENOENT;
+ __be16 type = skb->protocol;
+
+ rcu_read_lock();
+ ptype = gro_find_complete_by_type(type);
+ if (ptype != NULL)
+ err = ptype->callbacks.gro_complete(skb, nhoff + sizeof(*esph) + crypto_aead_ivsize(aead));
+
+ rcu_read_unlock();
+
+ return err;
+}
+
+static const struct net_offload esp4_offload = {
+ .callbacks = {
+ .gro_receive = esp4_gro_receive,
+ .gro_complete = esp4_gro_complete,
+ },
+};
+
+static int __init esp4_offload_init(void)
+{
+ return inet_add_offload(&esp4_offload, IPPROTO_ESP);
+}
+
+static void __exit esp4_offload_exit(void)
+{
+ inet_del_offload(&esp4_offload, IPPROTO_ESP);
+}
+
+module_init(esp4_offload_init);
+module_exit(esp4_offload_exit);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Steffen Klassert <steffen.klassert@secunet.com>");
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 8b14f14..f275feb 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -60,6 +60,10 @@ static int vti_input(struct sk_buff *skb, int nexthdr, __be32 spi,
tunnel = ip_tunnel_lookup(itn, skb->dev->ifindex, TUNNEL_NO_KEY,
iph->saddr, iph->daddr, 0);
if (tunnel) {
+ /* encap_type < -1 indicates a GRO call, we don't support this. */
+ if (encap_type < -1)
+ return -EOPNOTSUPP;
+
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto drop;
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index 62e1e72..6fe1a68 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -53,6 +53,12 @@ int xfrm4_transport_finish(struct sk_buff *skb, int async)
iph->tot_len = htons(skb->len);
ip_send_check(iph);
+
+ if (skb_xfrm_gro(skb)) {
+ skb_mac_header_rebuild(skb);
+ return 0;
+ }
+
NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
dev_net(skb->dev), NULL, skb, skb->dev, NULL,
xfrm4_rcv_encap_finish);
diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c
index fd840c7..7b447f3 100644
--- a/net/ipv4/xfrm4_mode_transport.c
+++ b/net/ipv4/xfrm4_mode_transport.c
@@ -50,7 +50,8 @@ static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb)
skb->network_header = skb->transport_header;
}
ip_hdr(skb)->tot_len = htons(skb->len + ihl);
- skb_reset_transport_header(skb);
+ if (!skb_xfrm_gro(skb))
+ skb_reset_transport_header(skb);
return 0;
}
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index ec1267e..8b0713b 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -75,6 +75,12 @@ config INET6_ESP
If unsure, say Y.
+config INET6_ESP_OFFLOAD
+ tristate "IPv6: ESP transformation offload"
+ depends on INET6_ESP
+ select XFRM_OFFLOAD
+ default n
+
config INET6_IPCOMP
tristate "IPv6: IPComp transformation"
select INET6_XFRM_TUNNEL
diff --git a/net/ipv6/Makefile b/net/ipv6/Makefile
index a9e9fec..217e9ff 100644
--- a/net/ipv6/Makefile
+++ b/net/ipv6/Makefile
@@ -30,6 +30,7 @@ ipv6-objs += $(ipv6-y)
obj-$(CONFIG_INET6_AH) += ah6.o
obj-$(CONFIG_INET6_ESP) += esp6.o
+obj-$(CONFIG_INET6_ESP_OFFLOAD) += esp6_offload.o
obj-$(CONFIG_INET6_IPCOMP) += ipcomp6.o
obj-$(CONFIG_INET6_XFRM_TUNNEL) += xfrm6_tunnel.o
obj-$(CONFIG_INET6_TUNNEL) += tunnel6.o
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index ff54faa..1cca0f1 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -571,6 +571,11 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
nfrags = 1;
goto skip_cow;
+ } else if (skb_xfrm_gro(skb)) {
+ nfrags = skb_shinfo(skb)->nr_frags;
+ nfrags++;
+
+ goto skip_cow;
} else if (!skb_has_frag_list(skb)) {
nfrags = skb_shinfo(skb)->nr_frags;
nfrags++;
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
new file mode 100644
index 0000000..e0006cf
--- /dev/null
+++ b/net/ipv6/esp6_offload.c
@@ -0,0 +1,98 @@
+/*
+ * IPV6 GSO/GRO offload support
+ * Linux INET implementation
+ *
+ * Copyright (C) 2016 secunet Security Networks AG
+ * Author: Steffen Klassert <steffen.klassert@secunet.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * ESP GRO support
+ */
+
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <net/protocol.h>
+#include <crypto/aead.h>
+#include <crypto/authenc.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <net/ip.h>
+#include <net/xfrm.h>
+#include <net/esp.h>
+#include <linux/scatterlist.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <net/ip6_route.h>
+#include <net/ipv6.h>
+#include <linux/icmpv6.h>
+
+static struct sk_buff **esp6_gro_receive(struct sk_buff **head,
+ struct sk_buff *skb)
+{
+ int err;
+ if (NAPI_GRO_CB(skb)->flush)
+ goto out;
+
+ skb_pull(skb, skb_gro_offset(skb));
+ skb->xfrm_gro = 1;
+
+ XFRM_SPI_SKB_CB(skb)->family = AF_INET6;
+ XFRM_SPI_SKB_CB(skb)->daddroff = offsetof(struct ipv6hdr, daddr);
+ err = xfrm_input(skb, IPPROTO_ESP, 0, -2);
+ if (err == -EOPNOTSUPP) {
+ skb_push(skb, skb_gro_offset(skb));
+ NAPI_GRO_CB(skb)->same_flow = 0;
+ NAPI_GRO_CB(skb)->flush = 1;
+ skb->xfrm_gro = 0;
+ goto out;
+ }
+
+ return ERR_PTR(-EINPROGRESS);
+out:
+ return NULL;
+}
+
+static int esp6_gro_complete(struct sk_buff *skb, int nhoff)
+{
+ struct xfrm_state *x = xfrm_input_state(skb);
+ struct crypto_aead *aead = x->data;
+ struct ip_esp_hdr *esph = (struct ip_esp_hdr *)(skb->data + nhoff);
+ struct packet_offload *ptype;
+ int err = -ENOENT;
+ __be16 type = skb->protocol;
+
+ rcu_read_lock();
+ ptype = gro_find_complete_by_type(type);
+ if (ptype != NULL)
+ err = ptype->callbacks.gro_complete(skb, nhoff + sizeof(*esph) + crypto_aead_ivsize(aead));
+
+ rcu_read_unlock();
+
+ return err;
+}
+
+static const struct net_offload esp6_offload = {
+ .callbacks = {
+ .gro_receive = esp6_gro_receive,
+ .gro_complete = esp6_gro_complete,
+ },
+};
+
+static int __init esp6_offload_init(void)
+{
+ return inet6_add_offload(&esp6_offload, IPPROTO_ESP);
+}
+
+static void __exit esp6_offload_exit(void)
+{
+ inet6_del_offload(&esp6_offload, IPPROTO_ESP);
+}
+
+module_init(esp6_offload_init);
+module_exit(esp6_offload_exit);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Steffen Klassert <steffen.klassert@secunet.com>");
diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c
index b578956..8ed16c8 100644
--- a/net/ipv6/xfrm6_input.c
+++ b/net/ipv6/xfrm6_input.c
@@ -44,6 +44,11 @@ int xfrm6_transport_finish(struct sk_buff *skb, int async)
ipv6_hdr(skb)->payload_len = htons(skb->len);
__skb_push(skb, skb->data - skb_network_header(skb));
+ if (skb_xfrm_gro(skb)) {
+ skb_mac_header_rebuild(skb);
+ return -1;
+ }
+
NF_HOOK(NFPROTO_IPV6, NF_INET_PRE_ROUTING,
dev_net(skb->dev), NULL, skb, skb->dev, NULL,
ip6_rcv_finish);
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 6e3f025..893263e 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -193,13 +193,17 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
int decaps = 0;
int async = 0;
- /* A negative encap_type indicates async resumption. */
if (encap_type < 0) {
- async = 1;
- x = xfrm_input_state(skb);
- seq = XFRM_SKB_CB(skb)->seq.input.low;
- family = x->outer_mode->afinfo->family;
- goto resume;
+ /* An encap_type of -1 indicates async resumption. */
+ if (encap_type == -1) {
+ async = 1;
+ x = xfrm_input_state(skb);
+ seq = XFRM_SKB_CB(skb)->seq.input.low;
+ family = x->outer_mode->afinfo->family;
+ goto resume;
+ }
+ /* encap_type < -1 indicates a GRO call. */
+ encap_type = 0;
}
daddr = (xfrm_address_t *)(skb_network_header(skb) +
@@ -374,7 +378,16 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
netif_rx(skb);
return 0;
} else {
- return x->inner_mode->afinfo->transport_finish(skb, async);
+ int xfrm_gro = skb_xfrm_gro(skb);
+
+ err = x->inner_mode->afinfo->transport_finish(skb, async);
+ if (xfrm_gro) {
+ skb_dst_drop(skb);
+ netif_rx(skb);
+ return 0;
+ }
+
+ return err;
}
drop_unlock:
--
1.9.1
^ permalink raw reply related
* [PATCH RFC ipsec-next 4/5] net: Prepare for IPsec GRO
From: Steffen Klassert @ 2017-01-04 8:23 UTC (permalink / raw)
To: David Miller, netdev; +Cc: Steffen Klassert, Sowmini Varadhan, Ilan Tayari
In-Reply-To: <1483518230-6777-1-git-send-email-steffen.klassert@secunet.com>
This patch prepares the generic codepath for IPsec GRO.
We introduce a new GRO_CONSUMED notifier to reflect that
IPsec can return asynchronous. On IPsec GRO we grab the
packet and reinject it back to layer 2 after IPsec
processing. We also use one xfrm_gro bit on the sk_buff
that will be set from IPsec to notify about GRO. If this
bit is set, we call napi_gro_receive for the backlog device
instead of __netif_receive_skb.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
include/linux/netdevice.h | 1 +
include/linux/skbuff.h | 14 +++++++++++++-
net/core/dev.c | 17 ++++++++++++++++-
net/xfrm/Kconfig | 4 ++++
4 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ecd78b3..89bad76 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -352,6 +352,7 @@ enum gro_result {
GRO_HELD,
GRO_NORMAL,
GRO_DROP,
+ GRO_CONSUMED,
};
typedef enum gro_result gro_result_t;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b53c0cf..a78cd90 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -749,7 +749,10 @@ struct sk_buff {
#ifdef CONFIG_NET_SWITCHDEV
__u8 offload_fwd_mark:1;
#endif
- /* 2, 4 or 5 bit hole */
+#ifdef CONFIG_XFRM_OFFLOAD
+ __u8 xfrm_gro:1;
+#endif
+ /* 1 to 5 bits hole */
#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */
@@ -3698,6 +3701,15 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
#endif
}
+static inline bool skb_xfrm_gro(struct sk_buff *skb)
+{
+#ifdef CONFIG_XFRM_OFFLOAD
+ return skb->xfrm_gro;
+#else
+ return false;
+#endif
+}
+
/* Keeps track of mac header offset relative to skb->head.
* It is useful for TSO of Tunneling protocol. e.g. GRE.
* For non-tunnel skb it points to skb_mac_header() and for
diff --git a/net/core/dev.c b/net/core/dev.c
index 56818f7..ecbaaf7f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4525,6 +4525,11 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
}
rcu_read_unlock();
+ if (PTR_ERR(pp) == -EINPROGRESS) {
+ ret = GRO_CONSUMED;
+ goto ok;
+ }
+
if (&ptype->list == head)
goto normal;
@@ -4623,6 +4628,9 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
case GRO_MERGED_FREE:
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) {
skb_dst_drop(skb);
+#ifdef CONFIG_XFRM_OFFLOAD
+ secpath_put(skb->sp);
+#endif
kmem_cache_free(skbuff_head_cache, skb);
} else {
__kfree_skb(skb);
@@ -4631,6 +4639,7 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
case GRO_HELD:
case GRO_MERGED:
+ case GRO_CONSUMED:
break;
}
@@ -4701,6 +4710,7 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
break;
case GRO_MERGED:
+ case GRO_CONSUMED:
break;
}
@@ -4843,7 +4853,12 @@ static int process_backlog(struct napi_struct *napi, int quota)
while ((skb = __skb_dequeue(&sd->process_queue))) {
rcu_read_lock();
- __netif_receive_skb(skb);
+
+ if (skb_xfrm_gro(skb))
+ napi_gro_receive(napi, skb);
+ else
+ __netif_receive_skb(skb);
+
rcu_read_unlock();
input_queue_head_incr(sd);
if (++work >= quota)
diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
index bda1a13..442ac61 100644
--- a/net/xfrm/Kconfig
+++ b/net/xfrm/Kconfig
@@ -5,6 +5,10 @@ config XFRM
bool
depends on NET
+config XFRM_OFFLOAD
+ bool
+ depends on XFRM
+
config XFRM_ALGO
tristate
select XFRM
--
1.9.1
^ permalink raw reply related
* [PATCH RFC ipsec-next] IPsec offload, part one
From: Steffen Klassert @ 2017-01-04 8:23 UTC (permalink / raw)
To: David Miller, netdev; +Cc: Steffen Klassert, Sowmini Varadhan, Ilan Tayari
This is the first part of the IPsec offload work we
talked at the IPsec workshop at the last netdev
conference. I plan to apply this to ipsec-next
after this round of review.
Patch 1 and 2 try to avoid skb linearization in
the ESP layer.
Patch 3 introduces a hepler to seup the esp trailer.
Patch 4 prepares the generic network code for
IPsec GRO. The main reason why we need this, is
that we need to reinject the decrypted inner
packet back to the GRO layer.
Patch 5 introduces GRO handlers for ESP, GRO
can enabled with a IPsec offload config option.
This config option will also be used for the
upcomming hardware offload.
David, patch 3 touches generic networking code.
Is it ok to integrate such a generic preparation
patch into an IPsec pull request, or do you
prefer to get it as a separate patch?
^ permalink raw reply
* [PATCH RFC ipsec-next 3/5] esp: Introduce a helper to setup the trailer
From: Steffen Klassert @ 2017-01-04 8:23 UTC (permalink / raw)
To: David Miller, netdev; +Cc: Steffen Klassert, Sowmini Varadhan, Ilan Tayari
In-Reply-To: <1483518230-6777-1-git-send-email-steffen.klassert@secunet.com>
We need to setup the trailer in two different cases,
so add a helper to avoid code duplication.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/ipv4/esp4.c | 44 +++++++++++++++++++-------------------------
net/ipv6/esp6.c | 44 +++++++++++++++++++-------------------------
2 files changed, 38 insertions(+), 50 deletions(-)
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 9e8d971..b1e2444 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -182,6 +182,22 @@ static void esp_output_done_esn(struct crypto_async_request *base, int err)
esp_output_done(base, err);
}
+static void esp_output_fill_trailer(u8 *tail, int tfclen, int plen, __u8 proto)
+{
+ /* Fill padding... */
+ if (tfclen) {
+ memset(tail, 0, tfclen);
+ tail += tfclen;
+ }
+ do {
+ int i;
+ for (i = 0; i < plen - 2; i++)
+ tail[i] = i + 1;
+ } while (0);
+ tail[plen - 2] = plen - 2;
+ tail[plen - 1] = proto;
+}
+
static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
{
struct esp_output_extra *extra;
@@ -304,18 +320,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
tail = vaddr + pfrag->offset;
- /* Fill padding... */
- if (tfclen) {
- memset(tail, 0, tfclen);
- tail += tfclen;
- }
- do {
- int i;
- for (i = 0; i < plen - 2; i++)
- tail[i] = i + 1;
- } while (0);
- tail[plen - 2] = plen - 2;
- tail[plen - 1] = proto;
+ esp_output_fill_trailer(tail, tfclen, plen, proto);
kunmap_atomic(vaddr);
@@ -395,20 +400,9 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
esph = ip_esp_hdr(skb);
skip_cow:
- /* Fill padding... */
- if (tfclen) {
- memset(tail, 0, tfclen);
- tail += tfclen;
- }
- do {
- int i;
- for (i = 0; i < plen - 2; i++)
- tail[i] = i + 1;
- } while (0);
- tail[plen - 2] = plen - 2;
- tail[plen - 1] = proto;
- pskb_put(skb, trailer, clen - skb->len + alen);
+ esp_output_fill_trailer(tail, tfclen, plen, proto);
+ pskb_put(skb, trailer, clen - skb->len + alen);
skb_push(skb, -skb_network_offset(skb));
esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
esph->spi = x->id.spi;
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index a428ac6..ff54faa 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -198,6 +198,22 @@ static void esp_output_done_esn(struct crypto_async_request *base, int err)
esp_output_done(base, err);
}
+static void esp_output_fill_trailer(u8 *tail, int tfclen, int plen, __u8 proto)
+{
+ /* Fill padding... */
+ if (tfclen) {
+ memset(tail, 0, tfclen);
+ tail += tfclen;
+ }
+ do {
+ int i;
+ for (i = 0; i < plen - 2; i++)
+ tail[i] = i + 1;
+ } while (0);
+ tail[plen - 2] = plen - 2;
+ tail[plen - 1] = proto;
+}
+
static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
{
int err;
@@ -284,18 +300,7 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
tail = vaddr + pfrag->offset;
- /* Fill padding... */
- if (tfclen) {
- memset(tail, 0, tfclen);
- tail += tfclen;
- }
- do {
- int i;
- for (i = 0; i < plen - 2; i++)
- tail[i] = i + 1;
- } while (0);
- tail[plen - 2] = plen - 2;
- tail[plen - 1] = proto;
+ esp_output_fill_trailer(tail, tfclen, plen, proto);
kunmap_atomic(vaddr);
@@ -375,20 +380,9 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
esph = ip_esp_hdr(skb);
skip_cow:
- /* Fill padding... */
- if (tfclen) {
- memset(tail, 0, tfclen);
- tail += tfclen;
- }
- do {
- int i;
- for (i = 0; i < plen - 2; i++)
- tail[i] = i + 1;
- } while (0);
- tail[plen - 2] = plen - 2;
- tail[plen - 1] = proto;
- pskb_put(skb, trailer, clen - skb->len + alen);
+ esp_output_fill_trailer(tail, tfclen, plen, proto);
+ pskb_put(skb, trailer, clen - skb->len + alen);
skb_push(skb, -skb_network_offset(skb));
esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
--
1.9.1
^ permalink raw reply related
* [PATCH RFC ipsec-next 1/5] esp4: Avoid skb_cow_data whenever possible
From: Steffen Klassert @ 2017-01-04 8:23 UTC (permalink / raw)
To: David Miller, netdev; +Cc: Steffen Klassert, Sowmini Varadhan, Ilan Tayari
In-Reply-To: <1483518230-6777-1-git-send-email-steffen.klassert@secunet.com>
This patch tries to avoid skb_cow_data on esp4.
On the encrypt side we add the IPsec tailbits
to the linear part of the buffer if there is
space on it. If there is no space on the linear
part, we add a page fragment with the tailbits to
the buffer and use separate src and dst scatterlists.
On the decrypt side, we leave the buffer as it is
if it is not cloned.
With this, we can avoid a linearization of the buffer
in most of the cases.
Joint work with:
Sowmini Varadhan <sowmini.varadhan@oracle.com>
Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
include/net/xfrm.h | 2 +
net/ipv4/esp4.c | 338 +++++++++++++++++++++++++++++++++++++++++------------
2 files changed, 266 insertions(+), 74 deletions(-)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 31947b9..ce53cfe 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -213,6 +213,8 @@ struct xfrm_state {
/* Last used time */
unsigned long lastused;
+ struct page_frag xfrag;
+
/* Reference to data common to all the instances of this
* transformer. */
const struct xfrm_type *type;
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 20fb25e..9e8d971 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -18,6 +18,8 @@
#include <net/protocol.h>
#include <net/udp.h>
+#include <linux/highmem.h>
+
struct esp_skb_cb {
struct xfrm_skb_cb xfrm;
void *tmp;
@@ -92,11 +94,40 @@ static inline struct scatterlist *esp_req_sg(struct crypto_aead *aead,
__alignof__(struct scatterlist));
}
+static void esp_ssg_unref(struct xfrm_state *x, void *tmp)
+{
+ struct esp_output_extra *extra = esp_tmp_extra(tmp);
+ struct crypto_aead *aead = x->data;
+ int extralen = 0;
+ u8 *iv;
+ struct aead_request *req;
+ struct scatterlist *sg;
+
+ if (x->props.flags & XFRM_STATE_ESN)
+ extralen += sizeof(*extra);
+
+ extra = esp_tmp_extra(tmp);
+ iv = esp_tmp_iv(aead, tmp, extralen);
+ req = esp_tmp_req(aead, iv);
+
+ /* Unref skb_frag_pages in the src scatterlist if necessary.
+ * Skip the first sg which comes from skb->data.
+ */
+ if (req->src != req->dst)
+ for (sg = sg_next(req->src); sg; sg = sg_next(sg))
+ put_page(sg_page(sg));
+}
+
static void esp_output_done(struct crypto_async_request *base, int err)
{
struct sk_buff *skb = base->data;
+ void *tmp;
+ struct dst_entry *dst = skb_dst(skb);
+ struct xfrm_state *x = dst->xfrm;
- kfree(ESP_SKB_CB(skb)->tmp);
+ tmp = ESP_SKB_CB(skb)->tmp;
+ esp_ssg_unref(x, tmp);
+ kfree(tmp);
xfrm_output_resume(skb, err);
}
@@ -120,6 +151,29 @@ static void esp_output_restore_header(struct sk_buff *skb)
sizeof(__be32));
}
+static struct ip_esp_hdr *esp_output_set_extra(struct sk_buff *skb,
+ struct ip_esp_hdr *esph,
+ struct esp_output_extra *extra)
+{
+ struct xfrm_state *x = skb_dst(skb)->xfrm;
+
+ /* For ESN we move the header forward by 4 bytes to
+ * accomodate the high bits. We will move it back after
+ * encryption.
+ */
+ if ((x->props.flags & XFRM_STATE_ESN)) {
+ extra->esphoff = (unsigned char *)esph -
+ skb_transport_header(skb);
+ esph = (struct ip_esp_hdr *)((unsigned char *)esph - 4);
+ extra->seqhi = esph->spi;
+ esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.hi);
+ }
+
+ esph->spi = x->id.spi;
+
+ return esph;
+}
+
static void esp_output_done_esn(struct crypto_async_request *base, int err)
{
struct sk_buff *skb = base->data;
@@ -130,16 +184,18 @@ static void esp_output_done_esn(struct crypto_async_request *base, int err)
static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
{
- int err;
struct esp_output_extra *extra;
+ int err = -ENOMEM;
struct ip_esp_hdr *esph;
struct crypto_aead *aead;
struct aead_request *req;
- struct scatterlist *sg;
+ struct scatterlist *sg, *dsg;
struct sk_buff *trailer;
+ struct page *page;
void *tmp;
u8 *iv;
u8 *tail;
+ u8 *vaddr;
int blksize;
int clen;
int alen;
@@ -149,7 +205,9 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
int nfrags;
int assoclen;
int extralen;
+ int tailen;
__be64 seqno;
+ __u8 proto = *skb_mac_header(skb);
/* skb is pure payload to encrypt */
@@ -169,12 +227,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
blksize = ALIGN(crypto_aead_blocksize(aead), 4);
clen = ALIGN(skb->len + 2 + tfclen, blksize);
plen = clen - skb->len - tfclen;
-
- err = skb_cow_data(skb, tfclen + plen + alen, &trailer);
- if (err < 0)
- goto error;
- nfrags = err;
-
+ tailen = tfclen + plen + alen;
assoclen = sizeof(*esph);
extralen = 0;
@@ -183,35 +236,8 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
assoclen += sizeof(__be32);
}
- tmp = esp_alloc_tmp(aead, nfrags, extralen);
- if (!tmp) {
- err = -ENOMEM;
- goto error;
- }
-
- extra = esp_tmp_extra(tmp);
- iv = esp_tmp_iv(aead, tmp, extralen);
- req = esp_tmp_req(aead, iv);
- sg = esp_req_sg(aead, req);
-
- /* Fill padding... */
- tail = skb_tail_pointer(trailer);
- if (tfclen) {
- memset(tail, 0, tfclen);
- tail += tfclen;
- }
- do {
- int i;
- for (i = 0; i < plen - 2; i++)
- tail[i] = i + 1;
- } while (0);
- tail[plen - 2] = plen - 2;
- tail[plen - 1] = *skb_mac_header(skb);
- pskb_put(skb, trailer, clen - skb->len + alen);
-
- skb_push(skb, -skb_network_offset(skb));
- esph = ip_esp_hdr(skb);
*skb_mac_header(skb) = IPPROTO_ESP;
+ esph = ip_esp_hdr(skb);
/* this is non-NULL only with UDP Encapsulation */
if (x->encap) {
@@ -230,7 +256,8 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
uh = (struct udphdr *)esph;
uh->source = sport;
uh->dest = dport;
- uh->len = htons(skb->len - skb_transport_offset(skb));
+ uh->len = htons(skb->len + tailen
+ - skb_transport_offset(skb));
uh->check = 0;
switch (encap_type) {
@@ -248,31 +275,170 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
*skb_mac_header(skb) = IPPROTO_UDP;
}
- esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
+ if (!skb_cloned(skb)) {
+ if (tailen <= skb_availroom(skb)) {
+ nfrags = 1;
+ trailer = skb;
+ tail = skb_tail_pointer(trailer);
- aead_request_set_callback(req, 0, esp_output_done, skb);
+ goto skip_cow;
+ } else if ((skb_shinfo(skb)->nr_frags < MAX_SKB_FRAGS)
+ && !skb_has_frag_list(skb)) {
+ int allocsize;
+ struct sock *sk = skb->sk;
+ struct page_frag *pfrag = &x->xfrag;
- /* For ESN we move the header forward by 4 bytes to
- * accomodate the high bits. We will move it back after
- * encryption.
- */
- if ((x->props.flags & XFRM_STATE_ESN)) {
- extra->esphoff = (unsigned char *)esph -
- skb_transport_header(skb);
- esph = (struct ip_esp_hdr *)((unsigned char *)esph - 4);
- extra->seqhi = esph->spi;
- esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.hi);
- aead_request_set_callback(req, 0, esp_output_done_esn, skb);
+ allocsize = ALIGN(tailen, L1_CACHE_BYTES);
+
+ spin_lock_bh(&x->lock);
+
+ if (unlikely(!skb_page_frag_refill(allocsize, pfrag, GFP_ATOMIC))) {
+ spin_unlock_bh(&x->lock);
+ goto cow;
+ }
+
+ page = pfrag->page;
+ get_page(page);
+
+ vaddr = kmap_atomic(page);
+
+ tail = vaddr + pfrag->offset;
+
+ /* Fill padding... */
+ if (tfclen) {
+ memset(tail, 0, tfclen);
+ tail += tfclen;
+ }
+ do {
+ int i;
+ for (i = 0; i < plen - 2; i++)
+ tail[i] = i + 1;
+ } while (0);
+ tail[plen - 2] = plen - 2;
+ tail[plen - 1] = proto;
+
+ kunmap_atomic(vaddr);
+
+ nfrags = skb_shinfo(skb)->nr_frags;
+
+ __skb_fill_page_desc(skb, nfrags, page, pfrag->offset,
+ tailen);
+ skb_shinfo(skb)->nr_frags = ++nfrags;
+
+ pfrag->offset = pfrag->offset + allocsize;
+ nfrags++;
+
+ skb->len += tailen;
+ skb->data_len += tailen;
+ skb->truesize += tailen;
+ if (sk)
+ atomic_add(tailen, &sk->sk_wmem_alloc);
+
+ skb_push(skb, -skb_network_offset(skb));
+
+ esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
+ esph->spi = x->id.spi;
+
+ tmp = esp_alloc_tmp(aead, nfrags + 2, extralen);
+ if (!tmp) {
+ spin_unlock_bh(&x->lock);
+ err = -ENOMEM;
+ goto error;
+ }
+
+ extra = esp_tmp_extra(tmp);
+ iv = esp_tmp_iv(aead, tmp, extralen);
+ req = esp_tmp_req(aead, iv);
+ sg = esp_req_sg(aead, req);
+ dsg = &sg[nfrags];
+
+ esph = esp_output_set_extra(skb, esph, extra);
+
+ sg_init_table(sg, nfrags);
+ skb_to_sgvec(skb, sg,
+ (unsigned char *)esph - skb->data,
+ assoclen + ivlen + clen + alen);
+
+ allocsize = ALIGN(skb->data_len, L1_CACHE_BYTES);
+
+ if (unlikely(!skb_page_frag_refill(allocsize, pfrag, GFP_ATOMIC))) {
+ spin_unlock_bh(&x->lock);
+ err = -ENOMEM;
+ goto error;
+ }
+
+ skb_shinfo(skb)->nr_frags = 1;
+
+ page = pfrag->page;
+ get_page(page);
+ /* replace page frags in skb with new page */
+ __skb_fill_page_desc(skb, 0, page, pfrag->offset, skb->data_len);
+ pfrag->offset = pfrag->offset + allocsize;
+
+ sg_init_table(dsg, skb_shinfo(skb)->nr_frags + 1);
+ skb_to_sgvec(skb, dsg,
+ (unsigned char *)esph - skb->data,
+ assoclen + ivlen + clen + alen);
+
+ spin_unlock_bh(&x->lock);
+
+ goto skip_cow2;
+ }
}
+cow:
+ err = skb_cow_data(skb, tailen, &trailer);
+ if (err < 0)
+ goto error;
+ nfrags = err;
+ tail = skb_tail_pointer(trailer);
+ esph = ip_esp_hdr(skb);
+
+skip_cow:
+ /* Fill padding... */
+ if (tfclen) {
+ memset(tail, 0, tfclen);
+ tail += tfclen;
+ }
+ do {
+ int i;
+ for (i = 0; i < plen - 2; i++)
+ tail[i] = i + 1;
+ } while (0);
+ tail[plen - 2] = plen - 2;
+ tail[plen - 1] = proto;
+ pskb_put(skb, trailer, clen - skb->len + alen);
+
+ skb_push(skb, -skb_network_offset(skb));
+ esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
esph->spi = x->id.spi;
+ tmp = esp_alloc_tmp(aead, nfrags, extralen);
+ if (!tmp) {
+ err = -ENOMEM;
+ goto error;
+ }
+
+ extra = esp_tmp_extra(tmp);
+ iv = esp_tmp_iv(aead, tmp, extralen);
+ req = esp_tmp_req(aead, iv);
+ sg = esp_req_sg(aead, req);
+ dsg = sg;
+
+ esph = esp_output_set_extra(skb, esph, extra);
+
sg_init_table(sg, nfrags);
skb_to_sgvec(skb, sg,
(unsigned char *)esph - skb->data,
assoclen + ivlen + clen + alen);
- aead_request_set_crypt(req, sg, sg, ivlen + clen, iv);
+skip_cow2:
+ if ((x->props.flags & XFRM_STATE_ESN))
+ aead_request_set_callback(req, 0, esp_output_done_esn, skb);
+ else
+ aead_request_set_callback(req, 0, esp_output_done, skb);
+
+ aead_request_set_crypt(req, sg, dsg, ivlen + clen, iv);
aead_request_set_ad(req, assoclen);
seqno = cpu_to_be64(XFRM_SKB_CB(skb)->seq.output.low +
@@ -298,6 +464,8 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
esp_output_restore_header(skb);
}
+ if (sg != dsg)
+ esp_ssg_unref(x, tmp);
kfree(tmp);
error:
@@ -401,6 +569,23 @@ static void esp_input_restore_header(struct sk_buff *skb)
__skb_pull(skb, 4);
}
+static void esp_input_set_header(struct sk_buff *skb, __be32 *seqhi)
+{
+ struct xfrm_state *x = xfrm_input_state(skb);
+ struct ip_esp_hdr *esph = (struct ip_esp_hdr *)skb->data;
+
+ /* For ESN we move the header forward by 4 bytes to
+ * accomodate the high bits. We will move it back after
+ * decryption.
+ */
+ if ((x->props.flags & XFRM_STATE_ESN)) {
+ esph = (void *)skb_push(skb, 4);
+ *seqhi = esph->spi;
+ esph->spi = esph->seq_no;
+ esph->seq_no = XFRM_SKB_CB(skb)->seq.input.hi;
+ }
+}
+
static void esp_input_done_esn(struct crypto_async_request *base, int err)
{
struct sk_buff *skb = base->data;
@@ -437,12 +622,6 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
if (elen <= 0)
goto out;
- err = skb_cow_data(skb, 0, &trailer);
- if (err < 0)
- goto out;
-
- nfrags = err;
-
assoclen = sizeof(*esph);
seqhilen = 0;
@@ -451,6 +630,26 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
assoclen += seqhilen;
}
+ if (!skb_cloned(skb)) {
+ if (!skb_is_nonlinear(skb)) {
+ nfrags = 1;
+
+ goto skip_cow;
+ } else if (!skb_has_frag_list(skb)) {
+ nfrags = skb_shinfo(skb)->nr_frags;
+ nfrags++;
+
+ goto skip_cow;
+ }
+ }
+
+ err = skb_cow_data(skb, 0, &trailer);
+ if (err < 0)
+ goto out;
+
+ nfrags = err;
+
+skip_cow:
err = -ENOMEM;
tmp = esp_alloc_tmp(aead, nfrags, seqhilen);
if (!tmp)
@@ -462,26 +661,17 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
req = esp_tmp_req(aead, iv);
sg = esp_req_sg(aead, req);
- skb->ip_summed = CHECKSUM_NONE;
+ esp_input_set_header(skb, seqhi);
- esph = (struct ip_esp_hdr *)skb->data;
+ sg_init_table(sg, nfrags);
+ skb_to_sgvec(skb, sg, 0, skb->len);
- aead_request_set_callback(req, 0, esp_input_done, skb);
+ skb->ip_summed = CHECKSUM_NONE;
- /* For ESN we move the header forward by 4 bytes to
- * accomodate the high bits. We will move it back after
- * decryption.
- */
- if ((x->props.flags & XFRM_STATE_ESN)) {
- esph = (void *)skb_push(skb, 4);
- *seqhi = esph->spi;
- esph->spi = esph->seq_no;
- esph->seq_no = XFRM_SKB_CB(skb)->seq.input.hi;
+ if ((x->props.flags & XFRM_STATE_ESN))
aead_request_set_callback(req, 0, esp_input_done_esn, skb);
- }
-
- sg_init_table(sg, nfrags);
- skb_to_sgvec(skb, sg, 0, skb->len);
+ else
+ aead_request_set_callback(req, 0, esp_input_done, skb);
aead_request_set_crypt(req, sg, sg, elen + ivlen, iv);
aead_request_set_ad(req, assoclen);
--
1.9.1
^ permalink raw reply related
* [PATCH RFC ipsec-next 2/5] esp6: Avoid skb_cow_data whenever possible
From: Steffen Klassert @ 2017-01-04 8:23 UTC (permalink / raw)
To: David Miller, netdev; +Cc: Steffen Klassert, Sowmini Varadhan, Ilan Tayari
In-Reply-To: <1483518230-6777-1-git-send-email-steffen.klassert@secunet.com>
This patch tries to avoid skb_cow_data on esp6.
On the encrypt side we add the IPsec tailbits
to the linear part of the buffer if there is
space on it. If there is no space on the linear
part, we add a page fragment with the tailbits to
the buffer and use separate src and dst scatterlists.
On the decrypt side, we leave the buffer as it is
if it is not cloned.
With this, we can avoid a linearization of the buffer
in most of the cases.
Joint work with:
Sowmini Varadhan <sowmini.varadhan@oracle.com>
Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/ipv6/esp6.c | 302 +++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 246 insertions(+), 56 deletions(-)
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index cbcdd5d..a428ac6 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -44,6 +44,8 @@
#include <net/protocol.h>
#include <linux/icmpv6.h>
+#include <linux/highmem.h>
+
struct esp_skb_cb {
struct xfrm_skb_cb xfrm;
void *tmp;
@@ -114,11 +116,40 @@ static inline struct scatterlist *esp_req_sg(struct crypto_aead *aead,
__alignof__(struct scatterlist));
}
+static void esp_ssg_unref(struct xfrm_state *x, void *tmp)
+{
+ __be32 *seqhi;
+ struct crypto_aead *aead = x->data;
+ int seqhilen = 0;
+ u8 *iv;
+ struct aead_request *req;
+ struct scatterlist *sg;
+
+ if (x->props.flags & XFRM_STATE_ESN)
+ seqhilen += sizeof(__be32);
+
+ seqhi = esp_tmp_seqhi(tmp);
+ iv = esp_tmp_iv(aead, tmp, seqhilen);
+ req = esp_tmp_req(aead, iv);
+
+ /* Unref skb_frag_pages in the src scatterlist if necessary.
+ * Skip the first sg which comes from skb->data.
+ */
+ if (req->src != req->dst)
+ for (sg = sg_next(req->src); sg; sg = sg_next(sg))
+ put_page(sg_page(sg));
+}
+
static void esp_output_done(struct crypto_async_request *base, int err)
{
struct sk_buff *skb = base->data;
+ void *tmp;
+ struct dst_entry *dst = skb_dst(skb);
+ struct xfrm_state *x = dst->xfrm;
- kfree(ESP_SKB_CB(skb)->tmp);
+ tmp = ESP_SKB_CB(skb)->tmp;
+ esp_ssg_unref(x, tmp);
+ kfree(tmp);
xfrm_output_resume(skb, err);
}
@@ -138,6 +169,27 @@ static void esp_output_restore_header(struct sk_buff *skb)
esp_restore_header(skb, skb_transport_offset(skb) - sizeof(__be32));
}
+static struct ip_esp_hdr *esp_output_set_esn(struct sk_buff *skb,
+ struct ip_esp_hdr *esph,
+ __be32 *seqhi)
+{
+ struct xfrm_state *x = skb_dst(skb)->xfrm;
+
+ /* For ESN we move the header forward by 4 bytes to
+ * accomodate the high bits. We will move it back after
+ * encryption.
+ */
+ if ((x->props.flags & XFRM_STATE_ESN)) {
+ esph = (void *)(skb_transport_header(skb) - sizeof(__be32));
+ *seqhi = esph->spi;
+ esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.hi);
+ }
+
+ esph->spi = x->id.spi;
+
+ return esph;
+}
+
static void esp_output_done_esn(struct crypto_async_request *base, int err)
{
struct sk_buff *skb = base->data;
@@ -152,8 +204,9 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
struct ip_esp_hdr *esph;
struct crypto_aead *aead;
struct aead_request *req;
- struct scatterlist *sg;
+ struct scatterlist *sg, *dsg;
struct sk_buff *trailer;
+ struct page *page;
void *tmp;
int blksize;
int clen;
@@ -164,10 +217,13 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
int nfrags;
int assoclen;
int seqhilen;
+ int tailen;
u8 *iv;
u8 *tail;
+ u8 *vaddr;
__be32 *seqhi;
__be64 seqno;
+ __u8 proto = *skb_mac_header(skb);
/* skb is pure payload to encrypt */
aead = x->data;
@@ -186,11 +242,7 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
blksize = ALIGN(crypto_aead_blocksize(aead), 4);
clen = ALIGN(skb->len + 2 + tfclen, blksize);
plen = clen - skb->len - tfclen;
-
- err = skb_cow_data(skb, tfclen + plen + alen, &trailer);
- if (err < 0)
- goto error;
- nfrags = err;
+ tailen = tfclen + plen + alen;
assoclen = sizeof(*esph);
seqhilen = 0;
@@ -200,19 +252,130 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
assoclen += seqhilen;
}
- tmp = esp_alloc_tmp(aead, nfrags, seqhilen);
- if (!tmp) {
- err = -ENOMEM;
- goto error;
+ *skb_mac_header(skb) = IPPROTO_ESP;
+ esph = ip_esp_hdr(skb);
+
+ if (!skb_cloned(skb)) {
+ if (tailen <= skb_availroom(skb)) {
+ nfrags = 1;
+ trailer = skb;
+ tail = skb_tail_pointer(trailer);
+
+ goto skip_cow;
+ } else if ((skb_shinfo(skb)->nr_frags < MAX_SKB_FRAGS)
+ && !skb_has_frag_list(skb)) {
+ int allocsize;
+ struct sock *sk = skb->sk;
+ struct page_frag *pfrag = &x->xfrag;
+
+ allocsize = ALIGN(tailen, L1_CACHE_BYTES);
+
+ spin_lock_bh(&x->lock);
+
+ if (unlikely(!skb_page_frag_refill(allocsize, pfrag, GFP_ATOMIC))) {
+ spin_unlock_bh(&x->lock);
+ goto cow;
+ }
+
+ page = pfrag->page;
+ get_page(page);
+
+ vaddr = kmap_atomic(page);
+
+ tail = vaddr + pfrag->offset;
+
+ /* Fill padding... */
+ if (tfclen) {
+ memset(tail, 0, tfclen);
+ tail += tfclen;
+ }
+ do {
+ int i;
+ for (i = 0; i < plen - 2; i++)
+ tail[i] = i + 1;
+ } while (0);
+ tail[plen - 2] = plen - 2;
+ tail[plen - 1] = proto;
+
+ kunmap_atomic(vaddr);
+
+ nfrags = skb_shinfo(skb)->nr_frags;
+
+ __skb_fill_page_desc(skb, nfrags, page, pfrag->offset,
+ tailen);
+ skb_shinfo(skb)->nr_frags = ++nfrags;
+
+ pfrag->offset = pfrag->offset + allocsize;
+ nfrags++;
+
+ skb->len += tailen;
+ skb->data_len += tailen;
+ skb->truesize += tailen;
+ if (sk)
+ atomic_add(tailen, &sk->sk_wmem_alloc);
+
+ skb_push(skb, -skb_network_offset(skb));
+
+ esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
+ esph->spi = x->id.spi;
+
+ tmp = esp_alloc_tmp(aead, nfrags + 2, seqhilen);
+ if (!tmp) {
+ spin_unlock_bh(&x->lock);
+ err = -ENOMEM;
+ goto error;
+ }
+ seqhi = esp_tmp_seqhi(tmp);
+ iv = esp_tmp_iv(aead, tmp, seqhilen);
+ req = esp_tmp_req(aead, iv);
+ sg = esp_req_sg(aead, req);
+ dsg = &sg[nfrags];
+
+ esph = esp_output_set_esn(skb, esph, seqhi);
+
+ sg_init_table(sg, nfrags);
+ skb_to_sgvec(skb, sg,
+ (unsigned char *)esph - skb->data,
+ assoclen + ivlen + clen + alen);
+
+ allocsize = ALIGN(skb->data_len, L1_CACHE_BYTES);
+
+ if (unlikely(!skb_page_frag_refill(allocsize, pfrag, GFP_ATOMIC))) {
+ spin_unlock_bh(&x->lock);
+ err = -ENOMEM;
+ goto error;
+ }
+
+ skb_shinfo(skb)->nr_frags = 1;
+
+ page = pfrag->page;
+ get_page(page);
+ /* replace page frags in skb with new page */
+ __skb_fill_page_desc(skb, 0, page, pfrag->offset, skb->data_len);
+ pfrag->offset = pfrag->offset + allocsize;
+
+ sg_init_table(dsg, skb_shinfo(skb)->nr_frags + 1);
+ skb_to_sgvec(skb, dsg,
+ (unsigned char *)esph - skb->data,
+ assoclen + ivlen + clen + alen);
+
+ spin_unlock_bh(&x->lock);
+
+ goto skip_cow2;
+ }
}
- seqhi = esp_tmp_seqhi(tmp);
- iv = esp_tmp_iv(aead, tmp, seqhilen);
- req = esp_tmp_req(aead, iv);
- sg = esp_req_sg(aead, req);
+cow:
+ err = skb_cow_data(skb, tailen, &trailer);
+ if (err < 0)
+ goto error;
+ nfrags = err;
- /* Fill padding... */
tail = skb_tail_pointer(trailer);
+ esph = ip_esp_hdr(skb);
+
+skip_cow:
+ /* Fill padding... */
if (tfclen) {
memset(tail, 0, tfclen);
tail += tfclen;
@@ -223,36 +386,40 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
tail[i] = i + 1;
} while (0);
tail[plen - 2] = plen - 2;
- tail[plen - 1] = *skb_mac_header(skb);
+ tail[plen - 1] = proto;
pskb_put(skb, trailer, clen - skb->len + alen);
skb_push(skb, -skb_network_offset(skb));
- esph = ip_esp_hdr(skb);
- *skb_mac_header(skb) = IPPROTO_ESP;
esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
+ esph->spi = x->id.spi;
- aead_request_set_callback(req, 0, esp_output_done, skb);
-
- /* For ESN we move the header forward by 4 bytes to
- * accomodate the high bits. We will move it back after
- * encryption.
- */
- if ((x->props.flags & XFRM_STATE_ESN)) {
- esph = (void *)(skb_transport_header(skb) - sizeof(__be32));
- *seqhi = esph->spi;
- esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.hi);
- aead_request_set_callback(req, 0, esp_output_done_esn, skb);
+ tmp = esp_alloc_tmp(aead, nfrags, seqhilen);
+ if (!tmp) {
+ err = -ENOMEM;
+ goto error;
}
- esph->spi = x->id.spi;
+ seqhi = esp_tmp_seqhi(tmp);
+ iv = esp_tmp_iv(aead, tmp, seqhilen);
+ req = esp_tmp_req(aead, iv);
+ sg = esp_req_sg(aead, req);
+ dsg = sg;
+
+ esph = esp_output_set_esn(skb, esph, seqhi);
sg_init_table(sg, nfrags);
skb_to_sgvec(skb, sg,
(unsigned char *)esph - skb->data,
assoclen + ivlen + clen + alen);
- aead_request_set_crypt(req, sg, sg, ivlen + clen, iv);
+skip_cow2:
+ if ((x->props.flags & XFRM_STATE_ESN))
+ aead_request_set_callback(req, 0, esp_output_done_esn, skb);
+ else
+ aead_request_set_callback(req, 0, esp_output_done, skb);
+
+ aead_request_set_crypt(req, sg, dsg, ivlen + clen, iv);
aead_request_set_ad(req, assoclen);
seqno = cpu_to_be64(XFRM_SKB_CB(skb)->seq.output.low +
@@ -278,6 +445,8 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
esp_output_restore_header(skb);
}
+ if (sg != dsg)
+ esp_ssg_unref(x, tmp);
kfree(tmp);
error:
@@ -343,6 +512,23 @@ static void esp_input_restore_header(struct sk_buff *skb)
__skb_pull(skb, 4);
}
+static void esp_input_set_header(struct sk_buff *skb, __be32 *seqhi)
+{
+ struct xfrm_state *x = xfrm_input_state(skb);
+ struct ip_esp_hdr *esph = (struct ip_esp_hdr *)skb->data;
+
+ /* For ESN we move the header forward by 4 bytes to
+ * accomodate the high bits. We will move it back after
+ * decryption.
+ */
+ if ((x->props.flags & XFRM_STATE_ESN)) {
+ esph = (void *)skb_push(skb, 4);
+ *seqhi = esph->spi;
+ esph->spi = esph->seq_no;
+ esph->seq_no = XFRM_SKB_CB(skb)->seq.input.hi;
+ }
+}
+
static void esp_input_done_esn(struct crypto_async_request *base, int err)
{
struct sk_buff *skb = base->data;
@@ -378,14 +564,6 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
goto out;
}
- nfrags = skb_cow_data(skb, 0, &trailer);
- if (nfrags < 0) {
- ret = -EINVAL;
- goto out;
- }
-
- ret = -ENOMEM;
-
assoclen = sizeof(*esph);
seqhilen = 0;
@@ -394,6 +572,27 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
assoclen += seqhilen;
}
+ if (!skb_cloned(skb)) {
+ if (!skb_is_nonlinear(skb)) {
+ nfrags = 1;
+
+ goto skip_cow;
+ } else if (!skb_has_frag_list(skb)) {
+ nfrags = skb_shinfo(skb)->nr_frags;
+ nfrags++;
+
+ goto skip_cow;
+ }
+ }
+
+ nfrags = skb_cow_data(skb, 0, &trailer);
+ if (nfrags < 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+skip_cow:
+ ret = -ENOMEM;
tmp = esp_alloc_tmp(aead, nfrags, seqhilen);
if (!tmp)
goto out;
@@ -404,26 +603,17 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
req = esp_tmp_req(aead, iv);
sg = esp_req_sg(aead, req);
- skb->ip_summed = CHECKSUM_NONE;
+ esp_input_set_header(skb, seqhi);
- esph = (struct ip_esp_hdr *)skb->data;
+ sg_init_table(sg, nfrags);
+ skb_to_sgvec(skb, sg, 0, skb->len);
- aead_request_set_callback(req, 0, esp_input_done, skb);
+ skb->ip_summed = CHECKSUM_NONE;
- /* For ESN we move the header forward by 4 bytes to
- * accomodate the high bits. We will move it back after
- * decryption.
- */
- if ((x->props.flags & XFRM_STATE_ESN)) {
- esph = (void *)skb_push(skb, 4);
- *seqhi = esph->spi;
- esph->spi = esph->seq_no;
- esph->seq_no = XFRM_SKB_CB(skb)->seq.input.hi;
+ if ((x->props.flags & XFRM_STATE_ESN))
aead_request_set_callback(req, 0, esp_input_done_esn, skb);
- }
-
- sg_init_table(sg, nfrags);
- skb_to_sgvec(skb, sg, 0, skb->len);
+ else
+ aead_request_set_callback(req, 0, esp_input_done, skb);
aead_request_set_crypt(req, sg, sg, elen + ivlen, iv);
aead_request_set_ad(req, assoclen);
--
1.9.1
^ permalink raw reply related
* RE: [PATCH] ethtool: add one ethtool option to set relax ordering mode
From: maowenan @ 2017-01-04 9:02 UTC (permalink / raw)
To: maowenan, Alexander Duyck, netdev@vger.kernel.org,
jeffrey.t.kirsher@intel.com, Stephen Hemminger
Cc: weiyongjun (A), Dingtianhong, Wangzhou (B)
In-Reply-To: <CAKgT0Ud0=fQpADLHPnZBLrG8xLGHUB20TZt1mi+GBNYJQngDCA@mail.gmail.com>
> -----Original Message-----
> From: maowenan
> Sent: Monday, December 26, 2016 4:33 PM
> To: maowenan; 'Alexander Duyck'
> Cc: 'Jeff Kirsher'; 'Stephen Hemminger'; 'netdev@vger.kernel.org'; weiyongjun
> (A); Dingtianhong; Wangzhou (B)
> Subject: RE: [PATCH] ethtool: add one ethtool option to set relax ordering mode
>
>
>
> > -----Original Message-----
> > From: maowenan
> > Sent: Saturday, December 24, 2016 4:30 PM
> > To: 'Alexander Duyck'
> > Cc: Jeff Kirsher; Stephen Hemminger; netdev@vger.kernel.org;
> > weiyongjun (A); Dingtianhong; Wangzhou (B)
> > Subject: RE: [PATCH] ethtool: add one ethtool option to set relax
> > ordering mode
> >
> >
> >
> > > -----Original Message-----
> > > From: Alexander Duyck [mailto:alexander.duyck@gmail.com]
> > > Sent: Friday, December 23, 2016 11:43 PM
> > > To: maowenan
> > > Cc: Jeff Kirsher; Stephen Hemminger; netdev@vger.kernel.org;
> > > weiyongjun (A); Dingtianhong
> > > Subject: Re: [PATCH] ethtool: add one ethtool option to set relax
> > > ordering mode
> > >
> > > On Thu, Dec 22, 2016 at 10:14 PM, maowenan <maowenan@huawei.com>
> > > wrote:
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Jeff Kirsher [mailto:jeffrey.t.kirsher@intel.com]
> > > >> Sent: Friday, December 23, 2016 9:07 AM
> > > >> To: maowenan; Alexander Duyck
> > > >> Cc: Stephen Hemminger; netdev@vger.kernel.org; weiyongjun (A);
> > > >> Dingtianhong
> > > >> Subject: Re: [PATCH] ethtool: add one ethtool option to set relax
> > > >> ordering mode
> > > >>
> > > >> On Fri, 2016-12-23 at 00:40 +0000, maowenan wrote:
> > > >> > > -----Original Message-----
> > > >> > > From: Alexander Duyck [mailto:alexander.duyck@gmail.com]
> > > >> > > Sent: Thursday, December 22, 2016 11:54 PM
> > > >> > > To: maowenan
> > > >> > > Cc: Stephen Hemminger; netdev@vger.kernel.org;
> > > jeffrey.t.kirsher@intel.
> > > >> > > com;
> > > >> > > weiyongjun (A); Dingtianhong
> > > >> > > Subject: Re: [PATCH] ethtool: add one ethtool option to set
> > > >> > > relax ordering mode
> > > >> > >
> > > >> > > On Wed, Dec 21, 2016 at 5:39 PM, maowenan
> > > <maowenan@huawei.com>
> > > >> > > wrote:
> > > >> > > >
> > > >> > > >
> > > >> > > > > -----Original Message-----
> > > >> > > > > From: Stephen Hemminger
> > > >> > > > > [mailto:stephen@networkplumber.org]
> > > >> > > > > Sent: Thursday, December 22, 2016 9:28 AM
> > > >> > > > > To: maowenan
> > > >> > > > > Cc: netdev@vger.kernel.org; jeffrey.t.kirsher@intel.com
> > > >> > > > > Subject: Re: [PATCH] ethtool: add one ethtool option to
> > > >> > > > > set relax ordering mode
> > > >> > > > >
> > > >> > > > > On Thu, 8 Dec 2016 14:51:38 +0800 Mao Wenan
> > > >> > > > > <maowenan@huawei.com> wrote:
> > > >> > > > >
> > > >> > > > > > This patch provides one way to set/unset IXGBE NIC TX
> > > >> > > > > > and RX relax ordering mode, which can be set by ethtool.
> > > >> > > > > > Relax ordering is one mode of 82599 NIC, to enable this
> > > >> > > > > > mode can enhance the performance for some cpu architecure.
> > > >> > > > >
> > > >> > > > > Then it should be done by CPU architecture specific
> > > >> > > > > quirks (preferably in PCI
> > > >> > > > > layer) so that all users get the option without having to
> > > >> > > > > do manual
> > > >> > >
> > > >> > > intervention.
> > > >> > > > >
> > > >> > > > > > example:
> > > >> > > > > > ethtool -s enp1s0f0 relaxorder off ethtool -s enp1s0f0
> > > >> > > > > > relaxorder on
> > > >> > > > >
> > > >> > > > > Doing it via ethtool is a developer API (for testing) not
> > > >> > > > > something that makes sense in production.
> > > >> > > >
> > > >> > > >
> > > >> > > > This feature is not mandatory for all users, acturally
> > > >> > > > relax ordering default configuration of 82599 is 'disable',
> > > >> > > > So this patch gives one way to
> > > >> > >
> > > >> > > enable relax ordering to be selected in some performance condition.
> > > >> > >
> > > >> > > That isn't quite true. The default for Sparc systems is to
> > > >> > > have it enabled.
> > > >> > >
> > > >> > > Really this is something that is platform specific. I agree
> > > >> > > with Stephen that it would work better if this was handled as
> > > >> > > a series of platform specific quirks handled at something
> > > >> > > like the PCI layer rather than be a switch the user can toggle on and
> off.
> > > >> > >
> > > >> > > With that being said there are changes being made that should
> > > >> > > help to improve the situation. Specifically I am looking at
> > > >> > > adding support for the DMA_ATTR_WEAK_ORDERING which may
> also
> > > >> > > allow us to identify cases where you might be able to specify
> > > >> > > the DMA behavior via the DMA mapping instead of having to
> > > >> > > make the final decision in the device itself.
> > > >> > >
> > > >> > > - Alex
> > > >> >
> > > >> > Yes, Sparc is a special case. From the NIC driver point of
> > > >> > view, It is no need for some ARCHs to do particular operation
> > > >> > and compiling branch, ethtool is a flexible method for user to
> > > >> > make decision whether
> > > >> > on|off this feature.
> > > >> > I think Jeff as maintainer of 82599 has some comments about this.
> > > >>
> > > >> My original comment/objection was that you attempted to do this
> > > >> change as a module parameter to the ixgbe driver, where I
> > > >> directed you to use ethtool so that other drivers could benefit
> > > >> from the ability to enable/disable relaxed ordering. As far as
> > > >> how it gets implemented in ethtool or PCI layer, makes little
> > > >> difference to me, I only had issues with the driver specific
> > > >> module parameter implementation,
> > > which is not acceptable.
> > > >
> > > >
> > > > Thank you Jeff and Alex.
> > > > And then I have gone through mail thread about "i40e: enable PCIe
> > > > relax ordering for SPARC", It only works for SPARC, any other ARCH
> > > > who wants to enable DMA_ATTR_WEAK_ORDERING feature, should
> define
> > > > the
> > > new macro, recompile the driver module.
> > > >
> > > > Because of the above reasons, we implement in ethtool to give the
> > > > final user a convenient way to on|off special feature, no need
> > > > define new macro, easy to extend the new features, and also good
> > > > benefit for other
> > > driver as Jeff referred.
> > > >
> > >
> > > I think the point is we shouldn't base the decision on user input.
> > > The fact is the PCIe device control register should have a bit that
> > > indicates if the device is allowed to enable relaxed ordering or not.
> > > If we can guarantee that the bit is set in all the cases where it
> > > should be set, and cleared in all the cases where it should not then
> > > we could use something like that to determine if the device is
> > > supposed to enable relaxed ordering instead of trying to make the
> > > decision
> > ourselves.
> > >
> > > - Alex
> >
> > ok. We are focusing on the register.
> > And yes, to enable relax ordering for 82599 should be set by one or
> > more bits of Rx/TX DCA Control Register, these bits should be set in
> > many cpu architectures, such as arm64, sparc, and so on, and should be
> cleared in other ARCHs.
> > By the way, how do you enable SPARC macro, how and where to define
> > this compiling macro when user one to enable relax ordering under SPARC
> system?
> > #ifndef CONFIG_SPARC
> >
> >
>
>
> Hi, Alex,
> Have you already sent out the patches about DMA_ATTR_WEAK_ORDERING?
> We want to get you how to enable DMA_ATTR_WEAK_ORDERING by PCIe layer,
> and we can refer to that.
I have verified DMA_ATTR_WEAK_ORDERING is not usable for our system(arm64 and 82599),
We should enable relax ordering in 82599 DCA control register to improve performance.
As Stephen Hemminger do not suggest use ethtool to set relax ordering feature,
@Jeff, do you agree with using erratum config to enable RO mode in 82599.
Codes like below:
In Kconfig:
+config HI_ERRATUM_xxxx
In ixgbe_82599.c
#if !defined (CONFIG_SPARC) || !defined(HI_ERRATUM_xxxx)
/* Disable relaxed ordering */
for (i = 0; ((i < hw->mac.max_tx_queues) &&
(i < IXGBE_DCA_MAX_QUEUES_82598)); i++) {
regval = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL(i));
regval &= ~IXGBE_DCA_TXCTRL_DESC_WRO_EN;
IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL(i), regval);
}
for (i = 0; ((i < hw->mac.max_rx_queues) &&
(i < IXGBE_DCA_MAX_QUEUES_82598)); i++) {
regval = IXGBE_READ_REG(hw, IXGBE_DCA_RXCTRL(i));
regval &= ~(IXGBE_DCA_RXCTRL_DATA_WRO_EN |
IXGBE_DCA_RXCTRL_HEAD_WRO_EN);
IXGBE_WRITE_REG(hw, IXGBE_DCA_RXCTRL(i), regval);
}
#endif
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox