* Re: DSA vs. SWTICHDEV ?
From: Joakim Tjernlund @ 2016-11-30 14:30 UTC (permalink / raw)
To: andrew@lunn.ch; +Cc: netdev@vger.kernel.org
In-Reply-To: <20161130135257.GC18716@lunn.ch>
On Wed, 2016-11-30 at 14:52 +0100, Andrew Lunn wrote:
> On Wed, Nov 30, 2016 at 08:50:34AM +0000, Joakim Tjernlund wrote:
> > I am trying to wrap my head around these two "devices" and have a hard time telling them apart.
> > We are looking att adding a faily large switch(over PCIe) to our board and from what I can tell
> > switchdev is the new way to do it but DSA is still there. Is it possible to just list
> > how they differ?
>
> Hi Joakim
Hi Andrew, thanks for answering
>
> If the interface you use to send frames from the host to the switch is
> PCIe, you probably want to use switchdev directly.
OK, we will have a few ethernet I/F's connected too but I these should be used
as normal interfaces just interfacing a switch.
>
> DSA devices all use a host Ethernet interface to send frames to the
> switch. DSA sits under switchdev, and effectively provides a lot of
> the common stuff needed for implementing switch drivers of this
> sort. It creates the slave interfaces, links the MAC to the PHY, has
> one uniform device tree binding which all DSA switches have, deals
> with encapsulation/decapsulating frames sent over the master device,
> etc.
And switchdev can do all this over PCIe instead? Can you have a switch tree in switchdev too?
We probably need to push a custom tag for final step (after the switch) and I think that
should be doable with a custom VLAN driver over one/several of switchdev's virtual
interfaces?
Joakim
^ permalink raw reply
* Re: [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum
From: David Miller @ 2016-11-30 14:47 UTC (permalink / raw)
To: alexander.h.duyck; +Cc: netdev, intel-wired-lan, sfr, jeffrey.t.kirsher
In-Reply-To: <20161128153927.15466.99193.stgit@ahduyck-blue-test.jf.intel.com>
From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Mon, 28 Nov 2016 10:42:18 -0500
> When I implemented the GSO partial support in the Intel drivers I was using
> lco_csum to compute the checksum that we needed to plug into the IPv4
> checksum field in order to cancel out the data that was not a part of the
> IPv4 header. However this didn't take into account that the transport
> offset might be pointing to the inner transport header.
>
> Instead of using lco_csum I have just coded around it so that we can use
> the outer IP header plus the IP header length to determine where we need to
> start our checksum and then just call csum_partial ourselves.
>
> This should fix the SIT issue reported on igb interfaces as well as simliar
> issues that would pop up on other Intel NICs.
Jeff, are you going to send me a pull request with this stuff or would
you be OK with my applying these directly to 'net'?
Thanks.
^ permalink raw reply
* DSA vs envelope frames
From: Nikita Yushchenko @ 2016-11-30 14:58 UTC (permalink / raw)
To: Toshiaki Makita, Andy Duan, David S. Miller, Troy Kisky,
Andrew Lunn, Eric Nelson, Philippe Reynes, Johannes Berg,
netdev@vger.kernel.org
Cc: Chris Healy, Fabio Estevam, linux-kernel@vger.kernel.org,
Vivien Didelot, lorian Fainelli
In-Reply-To: <f632a0e8-746b-2db5-09af-3a5f6fc78e13@lab.ntt.co.jp>
>> (1) When DSA is in use, frames processed by FEC chip contain DSA tag and
>> thus can be larger than hardcoded limit of 1522. This issue is not
>> FEC-specific, any driver that hardcodes maximum frame size to 1522 (many
>> do) will have this issue if used with DSA.
>
> BTW I'm trying to introduce envelope frames to solve this kind of problems.
> http://marc.info/?t=147496691500005&r=1&w=2
> http://marc.info/?t=147496691500003&r=1&w=2
> http://marc.info/?t=147496691500002&r=1&w=2
> http://marc.info/?t=147496691500004&r=1&w=2
> http://marc.info/?t=147496691500001&r=1&w=2
>
> It needs jumbo frame support of NICs though.
Thanks for pointing to this.
Indeed frame with DSA tag conceptually is an envelope frame.
ndev->env_hdr_len introduced by your patches, actually is explicitly
handled difference between (MTU + 18) and frame that HW should allow.
If this is known, hardware can be configured to work with DSA. At least
FEC hardware that can send and receive "slightly larger" frames after
simple register configuration.
Furthermore, since DSA configuration is known statically (it comes from
device tree), ndo_set_env_hdr_len method could be automatically called
at init, making setup working by default if driver supports that. And if
not, perhaps can automatically lower MTU.
Looks like a solution :)
What's current status of this work?
What is not really clear - what if several tagging protocols are used
together. AFAIU, things may be more complex that simple appending of
tags, e.g. EDSA tag can carry VLAN id inside.
Nikita
^ permalink raw reply
* Re: DSA vs envelope frames
From: Andrew Lunn @ 2016-11-30 15:10 UTC (permalink / raw)
To: Nikita Yushchenko
Cc: Toshiaki Makita, Andy Duan, David S. Miller, Troy Kisky,
Eric Nelson, Philippe Reynes, Johannes Berg,
netdev@vger.kernel.org, Chris Healy, Fabio Estevam,
linux-kernel@vger.kernel.org, Vivien Didelot, florian Fainelli
In-Reply-To: <d8a6df03-d9bd-eca5-0b9f-73406efe6509@cogentembedded.com>
> What is not really clear - what if several tagging protocols are used
> together. AFAIU, things may be more complex that simple appending of
> tags, e.g. EDSA tag can carry VLAN id inside.
Hi Nikita
At least for all current tagging protocols, the size of the tag is
constant. And you cannot run different tagging protocols on the same
master interface at the same time.
However, i think Florian tried something funky with the SF2 and B53
driver. He has a b53 hanging off a sf2. So i think he used nested
tagging protocols!
Andrew
^ permalink raw reply
* Re: Receive offloads, small RCVBUF and zero TCP window
From: Alex Sidorenko @ 2016-11-30 15:10 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Marcelo Ricardo Leitner
In-Reply-To: <15427752.RvQkb5CQdb@zbook>
On Monday, November 28, 2016 4:14:04 PM EST Alex Sidorenko wrote:
> On Monday, November 28, 2016 3:54:59 PM EST David Miller wrote:
> > From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
> > Date: Mon, 28 Nov 2016 15:49:26 -0500
> >
> > > Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss
> > > larger than MTU.
> >
> > It absolutely is not OK.
> >
> > If VMWare wants to receive large frames for batching purposes it must
> > use GRO or similar to achieve that, not just send vanilla frames into
> > the stack which are larger than the device MTU.
> >
>
> As VMWare's vmxnet3 driver is open-sourced and part of generic kernel, do you think the problem is in that driver or elsewhere? I looked at vmxnet3 sources and see that it uses LRO/GRO subroutines. Unfortunately, I don't understand its logic enough to see whether they are doing anything incorrectly.
I think this has been already fixed in recent versions of vmxnet3 driver (but not in RHEL6). VMWare/ESX can pass us aggregated large SKBs indeed (> MTU) if LRO is enabled, but the driver takes care of that in vmxnet3_rq_rx_complete():
} else if (segCnt != 0 || skb->len > mtu) {
u32 hlen;
hlen = vmxnet3_get_hdr_len(adapter, skb,
(union Vmxnet3_GenericDesc *)rcd);
if (hlen == 0)
goto not_lro;
skb_shinfo(skb)->gso_type =
rcd->v4 ? SKB_GSO_TCPV4 : SKB_GSO_TCPV6;
if (segCnt != 0) {
skb_shinfo(skb)->gso_segs = segCnt;
skb_shinfo(skb)->gso_size =
DIV_ROUND_UP(skb->len -
hlen, segCnt);
} else {
skb_shinfo(skb)->gso_size = mtu - hlen;
}
}
So if packets have been aggregated,
u8 segCnt; /* Number of aggregated packets */
we compute gso_size by dividing large skb->len by the number.
I still like Marcelo's idea of printing a warning when icsk->icsk_ack.rcv_mss looks unreasonable, should really help with detecting buggy drivers.
^ permalink raw reply
* Re: [PATCH] cpsw: ethtool: add support for nway reset
From: David Miller @ 2016-11-30 15:13 UTC (permalink / raw)
To: yegorslists; +Cc: netdev, linux-omap, grygorii.strashko, mugunthanvnm
In-Reply-To: <CAGm1_ktxyxVxYdF0Aq-Nv2rVGjEccPsGRZSXHBOE9egybiV1Lg@mail.gmail.com>
From: Yegor Yefremov <yegorslists@googlemail.com>
Date: Wed, 30 Nov 2016 10:31:30 +0100
> Hi David,
>
> On Wed, Nov 30, 2016 at 1:41 AM, David Miller <davem@davemloft.net> wrote:
>> From: yegorslists@googlemail.com
>> Date: Mon, 28 Nov 2016 10:47:52 +0100
>>
>>> From: Yegor Yefremov <yegorslists@googlemail.com>
>>>
>>> This patch adds support for ethtool's '-r' command. Restarting
>>> N-WAY negotiation can be useful to activate newly changed EEE
>>> settings etc.
>>>
>>> Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
>>
>> This doesn't apply cleanly to net-next.
>
> My previous patch [1] doesn't show up in net-next. This could explain,
> why nway patch doesn't apply.
> Should I resend them both as series?
>
> [1] http://marc.info/?l=linux-omap&m=148036822211869&w=2
My bad, I sorted this out and applied the nway-reset patch too.
Sorry about that.
^ permalink raw reply
* Re: [PATCH v2 net-next] sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver
From: David Miller @ 2016-11-30 15:18 UTC (permalink / raw)
To: ecree; +Cc: linux-net-drivers, bkenward, netdev
In-Reply-To: <624b8ce9-12dd-1b89-ae9f-8a092751bf12@solarflare.com>
Applied, but yes you should really rethink the Kconfig stuff.
I would strongly suggest you simply make them independent Kconfig
options with no attachment to eachother. No depends, no implies,
nothing like that.
^ permalink raw reply
* Re: [PATCH net-next v3 0/2] net: phy: broadcom: Support for PHY counters
From: David Miller @ 2016-11-30 15:22 UTC (permalink / raw)
To: f.fainelli
Cc: netdev, andrew, bcm-kernel-feedback-list, allan.nielsen,
raju.lakkaraju
In-Reply-To: <20161129175718.20213-1-f.fainelli@gmail.com>
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Tue, 29 Nov 2016 09:57:16 -0800
> This patch series adds support for reading the Broadcom PHYs internal counters.
>
> Changes in v3:
>
> - fixed the allocation of priv->stats in bcm7xxx
>
> Changes in v2:
>
> - fixed modular build reported by kbuild
>
> - constify array of stats
Series applied, thanks Florian.
^ permalink raw reply
* Re: DSA vs. SWTICHDEV ?
From: Andrew Lunn @ 2016-11-30 15:25 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: netdev@vger.kernel.org
In-Reply-To: <1480516241.3563.142.camel@infinera.com>
On Wed, Nov 30, 2016 at 02:30:43PM +0000, Joakim Tjernlund wrote:
> On Wed, 2016-11-30 at 14:52 +0100, Andrew Lunn wrote:
> > On Wed, Nov 30, 2016 at 08:50:34AM +0000, Joakim Tjernlund wrote:
> > > I am trying to wrap my head around these two "devices" and have a hard time telling them apart.
> > > We are looking att adding a faily large switch(over PCIe) to our board and from what I can tell
> > > switchdev is the new way to do it but DSA is still there. Is it possible to just list
> > > how they differ?
> >
> > Hi Joakim
>
> Hi Andrew, thanks for answering
>
> >
> > If the interface you use to send frames from the host to the switch is
> > PCIe, you probably want to use switchdev directly.
>
> OK, we will have a few ethernet I/F's connected too but I these should be used
> as normal interfaces just interfacing a switch.
That does not make much sense.
Maybe time to backtrack a bit. The Linux concept for switch/router
chips is that they are just hardware accelerators for what Linux can
already do in software. Each port of the switch is just a normal Linux
interface. ip link show will list each port. ip addr add can be used
to add an IP address to the interface. You want to switch frames
between two ports: Create a linux bridge and put the interfaces into
it. Via switchdev you get a call into the hardware to accelerate
this. If the hardware cannot accelerate it, it is done in software as
normal. Want to combine two ports into a trunk: Add a team interface
and make the port interfaces slaves of the team interface. Via
switchdev, you ask the hardware to accelerate this. If it cannot, it
is done in software.
So back your connecting a few host interfaces to the switch. This is
logically putting a cable between two interfaces on the same host. You
are making a loopback. Why do that? Sure it is possible, but it is an
odd architecture.
> And switchdev can do all this over PCIe instead? Can you have a
> switch tree in switchdev too?
Mellonex says so, but i don't think they have actually implemented it.
Andrew
^ permalink raw reply
* Re: [PATCH net-next] bpf, xdp: allow to pass flags to dev_change_xdp_fd
From: David Miller @ 2016-11-30 15:27 UTC (permalink / raw)
To: daniel; +Cc: alexei.starovoitov, netdev
In-Reply-To: <cd0871f11c476f8d7fee1cb356940a13cf7d4807.1480370617.git.daniel@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon, 28 Nov 2016 23:16:54 +0100
> Add an IFLA_XDP_FLAGS attribute that can be passed for setting up
> XDP along with IFLA_XDP_FD, which eventually allows user space to
> implement typical add/replace/delete logic for programs. Right now,
> calling into dev_change_xdp_fd() will always replace previous programs.
>
> When passed XDP_FLAGS_UPDATE_IF_NOEXIST, we can handle this more
> graceful when requested by returning -EBUSY in case we try to
> attach a new program, but we find that another one is already
> attached. This will be used by upcoming front-end for iproute2 as
> well.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
Applied, thanks Daniel.
^ permalink raw reply
* RE: Aw: Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: David Laight @ 2016-11-30 15:28 UTC (permalink / raw)
To: 'Eric Dumazet', Lino Sanfilippo
Cc: David Miller, netdev, Tariq Toukan
In-Reply-To: <1480378759.18162.105.camel@edumazet-glaptop3.roam.corp.google.com>
From: Eric Dumazet
> Sent: 29 November 2016 00:19
> On Mon, 2016-11-28 at 23:02 +0100, Lino Sanfilippo wrote:
> > Hi Eric,
> >
> > On 25.11.2016 20:19, Eric Dumazet wrote:
> > > On Fri, 2016-11-25 at 17:30 +0100, Lino Sanfilippo wrote:
> > >> Hi,
> > >>
> > >>
> > >> >
> > >> > The READ_ONCE() are documenting the fact that no lock is taken to fetch
> > >> > the stats, while another cpus might being changing them.
> > >> >
> > >> > I had no answer yet from https://patchwork.ozlabs.org/patch/698449/
> > >> >
> > >> > So I thought it was not needed to explain this in the changelog, given
> > >> > that it apparently is one of the few things that can block someone to
> > >> > understand one of my changes :/
> > >> >
> > >> > Apparently nobody really understands READ_ONCE() purpose, it is really a
> > >> > pity we have to explain this over and over.
> > >> >
> > >>
> > >> Even at the risk of showing once more a lack of understanding for
> > >> READ_ONCE():
> > >> Does not a READ_ONCE() have to e paired with some kind of
> > >> WRITE_ONCE()?
> > >
> > > You are right.
> > >
> > > Although in this case, the producers are using a lock, and do
> > >
> > > ring->packets++;
> > >
> > > We hopefully have compilers/cpus that do not put intermediate garbage in
> > > ring->packets while doing the increment.
> > >
> > > One problem with :
> > >
> > > WRITE_ONCE(ring->packets, ring->packets + 1);
> > >
> > > is that gcc no longer uses an INC instruction.
> >
> > I see. So we would have to do something like
> >
> > tmp = ring->packets;
> > tmp++;
> > WRITE_ONCE(ring->packets, tmp);
>
>
> Well, gcc will generate a code with more instructions than a mere
>
> "inc offset(%register)"
Are you sure??
Last I looked gcc seemed to convert 'foo++' to 'foo = foo + 1' before
generating any code.
It might then optimise that back to a memory increment, but that would
also happen if you'd coded the latter form.
> Which is kind of unfortunate, given it is the fast path.
>
> Better add a comment, like :
>
> /* We should use WRITE_ONCE() to pair with the READ_ONCE() found in xxxx()
> * But gcc would generate non optimal code.
> */
Actually while READ_ONCE() is generally useful - to get a snapshot of a changing value.
WRITE_ONCE() isn't a pairing - the compiler is highly unlikely to write a
location twice.
You might want an annotation to ensure is doesn't assume it can read the value
back (write through volatile pointer). But that has nothing to do with how readers behave.
David
^ permalink raw reply
* Re: [PATCH] net: brocade: bna: use new api ethtool_{get|set}_link_ksettings
From: David Miller @ 2016-11-30 15:29 UTC (permalink / raw)
To: tremyfr
Cc: rasesh.mody, sudarsana.kalluru, Dept-GELinuxNICDev, netdev,
linux-kernel
In-Reply-To: <1480373539-3257-1-git-send-email-tremyfr@gmail.com>
From: Philippe Reynes <tremyfr@gmail.com>
Date: Mon, 28 Nov 2016 23:52:19 +0100
> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
>
> Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Applied.
^ permalink raw reply
* Re: [net-next 1/1] samples: bpf: Refactor test_cgrp2_attach -- use getopt, and add mode
From: David Miller @ 2016-11-30 15:29 UTC (permalink / raw)
To: sargun; +Cc: netdev, daniel, ast
In-Reply-To: <20161128225240.GA8761@ircssh.c.rugged-nimbus-611.internal>
From: Sargun Dhillon <sargun@sargun.me>
Date: Mon, 28 Nov 2016 14:52:42 -0800
> This patch modifies test_cgrp2_attach to use getopt so we can use standard
> command line parsing.
>
> It also adds an option to run the program in detach only mode. This does
> not attach a new filter at the cgroup, but only runs the detach command.
>
> Lastly, it changes the attach code to not detach and then attach. It relies
> on the 'hotswap' behaviour of CGroup BPF programs to be able to change
> in-place. If detach-then-attach behaviour needs to be tested, the example
> can be run in detach only mode prior to attachment.
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Applied, thanks.
^ permalink raw reply
* [PATCH net-next 0/8] Offloading tc rules using underline Hardware device
From: Hadar Hen Zion @ 2016-11-30 14:41 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
Roi Dayan, Hadar Hen Zion
This series adds flower classifier support in offloading tc rules when the
Software ingress device is different from the Hardware ingress device,
such as when dealing with IP tunnels
The first two patches are a small fixes to flower, checking the skip_hw flag
wasn't set before calling the Hardware offloading functions which will try to
offload the rule.
The next two patches are infrastructure patches, a preparation for the fourth
patch which is adding support in flower to offload rules when the ingress
device is not a Hardware device and therefore can't offload.
In this case ndo_setup_tc is called with the mirred (egress) device.
The last three patchs are adding mlx5e support to offload rules using the new
"egress_device" flag.
Thanks,
Hadar
Hadar Hen Zion (8):
net/sched: Add separate check for skip_hw flag
net/sched: cls_flower: Try to offload only if skip_hw flag isn't set
net/sched: cls_flower: Provide a filter to replace/destroy hardware
filter functions
net/sched: act_mirred: Add new tc_action_ops get_dev()
net/sched: cls_flower: Add offload support using egress Hardware
device
net/mlx5e: Bring back representor's ndos that were accidentally
removed
net/mlx5e: Save the represntor netdevice as part of the representor
net/mlx5e: Support adding ingress tc rule when egress device flag is
set
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 25 +++++--
drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 3 +-
.../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 12 ++-
include/linux/netdevice.h | 1 +
include/net/act_api.h | 2 +
include/net/pkt_cls.h | 21 +++++-
net/sched/act_mirred.c | 12 +++
net/sched/cls_api.c | 22 ++++++
net/sched/cls_flower.c | 87 ++++++++++++----------
10 files changed, 133 insertions(+), 54 deletions(-)
--
1.8.3.1
^ permalink raw reply
* [PATCH net-next 1/8] net/sched: Add separate check for skip_hw flag
From: Hadar Hen Zion @ 2016-11-30 14:41 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-1-git-send-email-hadarh@mellanox.com>
Creating a difference between two possible cases:
1. Not offloading tc rule since the user sets 'skip_hw' flag.
2. Not offloading tc rule since the device doesn't support offloading.
This patch doesn't add any new functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
include/net/pkt_cls.h | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 767b03a..45ad9aa 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -425,16 +425,14 @@ struct tc_cls_u32_offload {
};
};
-static inline bool tc_should_offload(const struct net_device *dev,
- const struct tcf_proto *tp, u32 flags)
+static inline bool tc_can_offload(const struct net_device *dev,
+ const struct tcf_proto *tp)
{
const struct Qdisc *sch = tp->q;
const struct Qdisc_class_ops *cops = sch->ops->cl_ops;
if (!(dev->features & NETIF_F_HW_TC))
return false;
- if (flags & TCA_CLS_FLAGS_SKIP_HW)
- return false;
if (!dev->netdev_ops->ndo_setup_tc)
return false;
if (cops && cops->tcf_cl_offload)
@@ -443,6 +441,19 @@ static inline bool tc_should_offload(const struct net_device *dev,
return true;
}
+static inline bool tc_skip_hw(u32 flags)
+{
+ return (flags & TCA_CLS_FLAGS_SKIP_HW) ? true : false;
+}
+
+static inline bool tc_should_offload(const struct net_device *dev,
+ const struct tcf_proto *tp, u32 flags)
+{
+ if (tc_skip_hw(flags))
+ return false;
+ return tc_can_offload(dev, tp);
+}
+
static inline bool tc_skip_sw(u32 flags)
{
return (flags & TCA_CLS_FLAGS_SKIP_SW) ? true : false;
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 3/8] net/sched: cls_flower: Provide a filter to replace/destroy hardware filter functions
From: Hadar Hen Zion @ 2016-11-30 14:41 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-1-git-send-email-hadarh@mellanox.com>
Instead of providing many arguments to fl_hw_{replace/destroy}_filter
functions, just provide cls_fl_filter struct that includes all the relevant
args.
This patches doesn't add any new functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
net/sched/cls_flower.c | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 5e70f65..13b349f 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -201,7 +201,7 @@ static void fl_destroy_filter(struct rcu_head *head)
kfree(f);
}
-static void fl_hw_destroy_filter(struct tcf_proto *tp, unsigned long cookie)
+static void fl_hw_destroy_filter(struct tcf_proto *tp, struct cls_fl_filter *f)
{
struct net_device *dev = tp->q->dev_queue->dev;
struct tc_cls_flower_offload offload = {0};
@@ -211,7 +211,7 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, unsigned long cookie)
return;
offload.command = TC_CLSFLOWER_DESTROY;
- offload.cookie = cookie;
+ offload.cookie = (unsigned long)f;
tc.type = TC_SETUP_CLSFLOWER;
tc.cls_flower = &offload;
@@ -222,9 +222,7 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, unsigned long cookie)
static int fl_hw_replace_filter(struct tcf_proto *tp,
struct flow_dissector *dissector,
struct fl_flow_key *mask,
- struct fl_flow_key *key,
- struct tcf_exts *actions,
- unsigned long cookie, u32 flags)
+ struct cls_fl_filter *f)
{
struct net_device *dev = tp->q->dev_queue->dev;
struct tc_cls_flower_offload offload = {0};
@@ -232,14 +230,14 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
int err;
if (!tc_can_offload(dev, tp))
- return tc_skip_sw(flags) ? -EINVAL : 0;
+ return tc_skip_sw(f->flags) ? -EINVAL : 0;
offload.command = TC_CLSFLOWER_REPLACE;
- offload.cookie = cookie;
+ offload.cookie = (unsigned long)f;
offload.dissector = dissector;
offload.mask = mask;
- offload.key = key;
- offload.exts = actions;
+ offload.key = &f->key;
+ offload.exts = &f->exts;
tc.type = TC_SETUP_CLSFLOWER;
tc.cls_flower = &offload;
@@ -247,7 +245,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
err = dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol,
&tc);
- if (tc_skip_sw(flags))
+ if (tc_skip_sw(f->flags))
return err;
return 0;
@@ -276,7 +274,7 @@ static void __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f)
{
list_del_rcu(&f->list);
if (!tc_skip_hw(f->flags))
- fl_hw_destroy_filter(tp, (unsigned long)f);
+ fl_hw_destroy_filter(tp, f);
tcf_unbind_filter(tp, &f->res);
call_rcu(&f->rcu, fl_destroy_filter);
}
@@ -748,10 +746,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
err = fl_hw_replace_filter(tp,
&head->dissector,
&mask.key,
- &fnew->key,
- &fnew->exts,
- (unsigned long)fnew,
- fnew->flags);
+ fnew);
if (err)
goto errout;
}
@@ -760,7 +755,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
rhashtable_remove_fast(&head->ht, &fold->ht_node,
head->ht_params);
if (!tc_skip_hw(fold->flags))
- fl_hw_destroy_filter(tp, (unsigned long)fold);
+ fl_hw_destroy_filter(tp, fold);
}
*arg = (unsigned long) fnew;
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 4/8] net/sched: act_mirred: Add new tc_action_ops get_dev()
From: Hadar Hen Zion @ 2016-11-30 14:41 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-1-git-send-email-hadarh@mellanox.com>
Adding support to a new tc_action_ops.
get_dev is a general option which allows to get the underline
device when trying to offload a tc rule.
In case of mirred action the returned device is the mirred (egress)
device.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
include/net/act_api.h | 2 ++
net/sched/act_mirred.c | 12 ++++++++++++
2 files changed, 14 insertions(+)
diff --git a/include/net/act_api.h b/include/net/act_api.h
index d8eae87..9dddf77 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -119,6 +119,8 @@ struct tc_action_ops {
int (*walk)(struct net *, struct sk_buff *,
struct netlink_callback *, int, const struct tc_action_ops *);
void (*stats_update)(struct tc_action *, u64, u32, u64);
+ int (*get_dev)(const struct tc_action *a, struct net *net,
+ struct net_device **mirred_dev);
};
struct tc_action_net {
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 1af7baa..bb09ba3 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -315,6 +315,17 @@ static int mirred_device_event(struct notifier_block *unused,
.notifier_call = mirred_device_event,
};
+static int tcf_mirred_device(const struct tc_action *a, struct net *net,
+ struct net_device **mirred_dev)
+{
+ int ifindex = tcf_mirred_ifindex(a);
+
+ *mirred_dev = __dev_get_by_index(net, ifindex);
+ if (!mirred_dev)
+ return -EINVAL;
+ return 0;
+}
+
static struct tc_action_ops act_mirred_ops = {
.kind = "mirred",
.type = TCA_ACT_MIRRED,
@@ -327,6 +338,7 @@ static int mirred_device_event(struct notifier_block *unused,
.walk = tcf_mirred_walker,
.lookup = tcf_mirred_search,
.size = sizeof(struct tcf_mirred),
+ .get_dev = tcf_mirred_device,
};
static __net_init int mirred_init_net(struct net *net)
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 2/8] net/sched: cls_flower: Try to offload only if skip_hw flag isn't set
From: Hadar Hen Zion @ 2016-11-30 14:41 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-1-git-send-email-hadarh@mellanox.com>
Check skip_hw flag isn't set before calling
fl_hw_{replace/destroy}_filter and fl_hw_update_stats functions.
Replace the call to tc_should_offload with tc_can_offload.
tc_can_offload only checks if the device supports offloading, the check for
skip_hw flag is done earlier in the flow.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
net/sched/cls_flower.c | 35 ++++++++++++++++++++---------------
1 file changed, 20 insertions(+), 15 deletions(-)
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e8dd09a..5e70f65 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -207,7 +207,7 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, unsigned long cookie)
struct tc_cls_flower_offload offload = {0};
struct tc_to_netdev tc;
- if (!tc_should_offload(dev, tp, 0))
+ if (!tc_can_offload(dev, tp))
return;
offload.command = TC_CLSFLOWER_DESTROY;
@@ -231,7 +231,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
struct tc_to_netdev tc;
int err;
- if (!tc_should_offload(dev, tp, flags))
+ if (!tc_can_offload(dev, tp))
return tc_skip_sw(flags) ? -EINVAL : 0;
offload.command = TC_CLSFLOWER_REPLACE;
@@ -259,7 +259,7 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f)
struct tc_cls_flower_offload offload = {0};
struct tc_to_netdev tc;
- if (!tc_should_offload(dev, tp, 0))
+ if (!tc_can_offload(dev, tp))
return;
offload.command = TC_CLSFLOWER_STATS;
@@ -275,7 +275,8 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f)
static void __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f)
{
list_del_rcu(&f->list);
- fl_hw_destroy_filter(tp, (unsigned long)f);
+ if (!tc_skip_hw(f->flags))
+ fl_hw_destroy_filter(tp, (unsigned long)f);
tcf_unbind_filter(tp, &f->res);
call_rcu(&f->rcu, fl_destroy_filter);
}
@@ -743,20 +744,23 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
goto errout;
}
- err = fl_hw_replace_filter(tp,
- &head->dissector,
- &mask.key,
- &fnew->key,
- &fnew->exts,
- (unsigned long)fnew,
- fnew->flags);
- if (err)
- goto errout;
+ if (!tc_skip_hw(fnew->flags)) {
+ err = fl_hw_replace_filter(tp,
+ &head->dissector,
+ &mask.key,
+ &fnew->key,
+ &fnew->exts,
+ (unsigned long)fnew,
+ fnew->flags);
+ if (err)
+ goto errout;
+ }
if (fold) {
rhashtable_remove_fast(&head->ht, &fold->ht_node,
head->ht_params);
- fl_hw_destroy_filter(tp, (unsigned long)fold);
+ if (!tc_skip_hw(fold->flags))
+ fl_hw_destroy_filter(tp, (unsigned long)fold);
}
*arg = (unsigned long) fnew;
@@ -879,7 +883,8 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
goto nla_put_failure;
}
- fl_hw_update_stats(tp, f);
+ if (!tc_skip_hw(f->flags))
+ fl_hw_update_stats(tp, f);
if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 6/8] net/mlx5e: Bring back representor's ndos that were accidentally removed
From: Hadar Hen Zion @ 2016-11-30 14:41 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-1-git-send-email-hadarh@mellanox.com>
The VF Representor udp tunnel ndo entries were removed by mistake,
return them.
Fixes: 370bad0f9a52 ('net/mlx5e: Support HW (offloaded) and SW counters for SRIOV switchdev mode')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 5e33f6b..9b1e351 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -384,6 +384,8 @@ int mlx5e_get_offload_stats(int attr_id, const struct net_device *dev,
.ndo_get_phys_port_name = mlx5e_rep_get_phys_port_name,
.ndo_setup_tc = mlx5e_rep_ndo_setup_tc,
.ndo_get_stats64 = mlx5e_rep_get_stats,
+ .ndo_udp_tunnel_add = mlx5e_add_vxlan_port,
+ .ndo_udp_tunnel_del = mlx5e_del_vxlan_port,
.ndo_has_offload_stats = mlx5e_has_offload_stats,
.ndo_get_offload_stats = mlx5e_get_offload_stats,
};
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 8/8] net/mlx5e: Support adding ingress tc rule when egress device flag is set
From: Hadar Hen Zion @ 2016-11-30 14:41 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-1-git-send-email-hadarh@mellanox.com>
When ndo_setup_tc is called with an egress_dev flag set, it means that
the ndo call was executed on the mirred action (egress) device and not
on the ingress device.
In order to support this kind of ndo_setup_tc call, and insert the
correct decap rule to the hardware, the uplink device on the same eswitch
should be found.
Currently, we use this resolution between the mirred device and the
uplink on the same eswitch to offload vxlan shared device decap rules.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 0868677..8503788 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -289,6 +289,14 @@ static int mlx5e_rep_ndo_setup_tc(struct net_device *dev, u32 handle,
if (TC_H_MAJ(handle) != TC_H_MAJ(TC_H_INGRESS))
return -EOPNOTSUPP;
+ if (tc->egress_dev) {
+ struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
+ struct net_device *uplink_dev = mlx5_eswitch_get_uplink_netdev(esw);
+
+ return uplink_dev->netdev_ops->ndo_setup_tc(uplink_dev, handle,
+ proto, tc);
+ }
+
switch (tc->type) {
case TC_SETUP_CLSFLOWER:
switch (tc->cls_flower->command) {
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 5/8] net/sched: cls_flower: Add offload support using egress Hardware device
From: Hadar Hen Zion @ 2016-11-30 14:41 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-1-git-send-email-hadarh@mellanox.com>
In order to support hardware offloading when the device given by the tc
rule is different from the Hardware underline device, extract the mirred
(egress) device from the tc action when a filter is added, using the new
tc_action_ops, get_dev().
Flower caches the information about the mirred device and use it for
calling ndo_setup_tc in filter change, update stats and delete.
Calling ndo_setup_tc of the mirred (egress) device instead of the
ingress device will allow a resolution between the software ingress
device and the underline hardware device.
The resolution will take place inside the offloading driver using
'egress_device' flag added to tc_to_netdev struct which is provided to
the offloading driver.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
include/linux/netdevice.h | 1 +
include/net/pkt_cls.h | 2 ++
net/sched/cls_api.c | 22 ++++++++++++++++++++++
net/sched/cls_flower.c | 41 ++++++++++++++++++++++++-----------------
4 files changed, 49 insertions(+), 17 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4ffcd87..00c351c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -802,6 +802,7 @@ struct tc_to_netdev {
struct tc_cls_matchall_offload *cls_mall;
struct tc_cls_bpf_offload *cls_bpf;
};
+ bool egress_dev;
};
/* These structures hold the attributes of xdp state that are being passed
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 45ad9aa..f0a0514 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -171,6 +171,8 @@ void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst,
struct tcf_exts *src);
int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts);
int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts);
+int tcf_exts_get_dev(struct net_device *dev, struct tcf_exts *exts,
+ struct net_device **hw_dev);
/**
* struct tcf_pkt_info - packet information
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index b05d4a2..e7aeab9 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -682,6 +682,28 @@ int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts)
}
EXPORT_SYMBOL(tcf_exts_dump_stats);
+int tcf_exts_get_dev(struct net_device *dev, struct tcf_exts *exts,
+ struct net_device **hw_dev)
+{
+ const struct tc_action *a;
+ LIST_HEAD(actions);
+
+ if (tc_no_actions(exts))
+ return -EINVAL;
+
+ tcf_exts_to_list(exts, &actions);
+ list_for_each_entry(a, &actions, list) {
+ if (a->ops->get_dev) {
+ a->ops->get_dev(a, dev_net(dev), hw_dev);
+ break;
+ }
+ }
+ if (*hw_dev)
+ return 0;
+ return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL(tcf_exts_get_dev);
+
static int __init tc_filter_init(void)
{
rtnl_register(PF_UNSPEC, RTM_NEWTFILTER, tc_ctl_tfilter, NULL, NULL);
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 13b349f..1cacfa5 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -78,6 +78,8 @@ struct cls_fl_filter {
u32 handle;
u32 flags;
struct rcu_head rcu;
+ struct tc_to_netdev tc;
+ struct net_device *hw_dev;
};
static unsigned short int fl_mask_range(const struct fl_flow_mask *mask)
@@ -203,9 +205,9 @@ static void fl_destroy_filter(struct rcu_head *head)
static void fl_hw_destroy_filter(struct tcf_proto *tp, struct cls_fl_filter *f)
{
- struct net_device *dev = tp->q->dev_queue->dev;
struct tc_cls_flower_offload offload = {0};
- struct tc_to_netdev tc;
+ struct net_device *dev = f->hw_dev;
+ struct tc_to_netdev *tc = &f->tc;
if (!tc_can_offload(dev, tp))
return;
@@ -213,10 +215,10 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, struct cls_fl_filter *f)
offload.command = TC_CLSFLOWER_DESTROY;
offload.cookie = (unsigned long)f;
- tc.type = TC_SETUP_CLSFLOWER;
- tc.cls_flower = &offload;
+ tc->type = TC_SETUP_CLSFLOWER;
+ tc->cls_flower = &offload;
- dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol, &tc);
+ dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol, tc);
}
static int fl_hw_replace_filter(struct tcf_proto *tp,
@@ -226,11 +228,17 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
{
struct net_device *dev = tp->q->dev_queue->dev;
struct tc_cls_flower_offload offload = {0};
- struct tc_to_netdev tc;
+ struct tc_to_netdev *tc = &f->tc;
int err;
- if (!tc_can_offload(dev, tp))
- return tc_skip_sw(f->flags) ? -EINVAL : 0;
+ if (!tc_can_offload(dev, tp)) {
+ if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev))
+ return tc_skip_sw(f->flags) ? -EINVAL : 0;
+ dev = f->hw_dev;
+ tc->egress_dev = true;
+ } else {
+ f->hw_dev = dev;
+ }
offload.command = TC_CLSFLOWER_REPLACE;
offload.cookie = (unsigned long)f;
@@ -239,23 +247,22 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
offload.key = &f->key;
offload.exts = &f->exts;
- tc.type = TC_SETUP_CLSFLOWER;
- tc.cls_flower = &offload;
+ tc->type = TC_SETUP_CLSFLOWER;
+ tc->cls_flower = &offload;
err = dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol,
- &tc);
+ tc);
if (tc_skip_sw(f->flags))
return err;
-
return 0;
}
static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f)
{
- struct net_device *dev = tp->q->dev_queue->dev;
struct tc_cls_flower_offload offload = {0};
- struct tc_to_netdev tc;
+ struct net_device *dev = f->hw_dev;
+ struct tc_to_netdev *tc = &f->tc;
if (!tc_can_offload(dev, tp))
return;
@@ -264,10 +271,10 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f)
offload.cookie = (unsigned long)f;
offload.exts = &f->exts;
- tc.type = TC_SETUP_CLSFLOWER;
- tc.cls_flower = &offload;
+ tc->type = TC_SETUP_CLSFLOWER;
+ tc->cls_flower = &offload;
- dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol, &tc);
+ dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol, tc);
}
static void __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f)
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next 7/8] net/mlx5e: Save the represntor netdevice as part of the representor
From: Hadar Hen Zion @ 2016-11-30 14:41 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Saeed Mahameed, Jiri Pirko, Amir Vadai, Or Gerlitz,
Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-1-git-send-email-hadarh@mellanox.com>
Replace the representor private data to a net_device pointer holding the
representor netdevice, instead of void pointer holding mlx5e_priv.
It will be used by a new eswitch service function, returning the uplink representor
netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 15 ++++++++-------
drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 3 ++-
.../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 12 +++++++++++-
4 files changed, 22 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6b492ca..37c0d84 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3796,7 +3796,7 @@ static void mlx5e_nic_enable(struct mlx5e_priv *priv)
rep.load = mlx5e_nic_rep_load;
rep.unload = mlx5e_nic_rep_unload;
rep.vport = FDB_UPLINK_VPORT;
- rep.priv_data = priv;
+ rep.netdev = netdev;
mlx5_eswitch_register_vport_rep(esw, 0, &rep);
}
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 9b1e351..0868677 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -208,7 +208,8 @@ int mlx5e_add_sqs_fwd_rules(struct mlx5e_priv *priv)
int mlx5e_nic_rep_load(struct mlx5_eswitch *esw, struct mlx5_eswitch_rep *rep)
{
- struct mlx5e_priv *priv = rep->priv_data;
+ struct net_device *netdev = rep->netdev;
+ struct mlx5e_priv *priv = netdev_priv(netdev);
if (test_bit(MLX5E_STATE_OPENED, &priv->state))
return mlx5e_add_sqs_fwd_rules(priv);
@@ -226,7 +227,8 @@ void mlx5e_remove_sqs_fwd_rules(struct mlx5e_priv *priv)
void mlx5e_nic_rep_unload(struct mlx5_eswitch *esw,
struct mlx5_eswitch_rep *rep)
{
- struct mlx5e_priv *priv = rep->priv_data;
+ struct net_device *netdev = rep->netdev;
+ struct mlx5e_priv *priv = netdev_priv(netdev);
if (test_bit(MLX5E_STATE_OPENED, &priv->state))
mlx5e_remove_sqs_fwd_rules(priv);
@@ -555,7 +557,7 @@ int mlx5e_vport_rep_load(struct mlx5_eswitch *esw,
return -EINVAL;
}
- rep->priv_data = netdev_priv(netdev);
+ rep->netdev = netdev;
err = mlx5e_attach_netdev(esw->dev, netdev);
if (err) {
@@ -577,7 +579,7 @@ int mlx5e_vport_rep_load(struct mlx5_eswitch *esw,
mlx5e_detach_netdev(esw->dev, netdev);
err_destroy_netdev:
- mlx5e_destroy_netdev(esw->dev, rep->priv_data);
+ mlx5e_destroy_netdev(esw->dev, netdev_priv(netdev));
return err;
@@ -586,10 +588,9 @@ int mlx5e_vport_rep_load(struct mlx5_eswitch *esw,
void mlx5e_vport_rep_unload(struct mlx5_eswitch *esw,
struct mlx5_eswitch_rep *rep)
{
- struct mlx5e_priv *priv = rep->priv_data;
- struct net_device *netdev = priv->netdev;
+ struct net_device *netdev = rep->netdev;
unregister_netdev(netdev);
mlx5e_detach_netdev(esw->dev, netdev);
- mlx5e_destroy_netdev(esw->dev, priv);
+ mlx5e_destroy_netdev(esw->dev, netdev_priv(netdev));
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index cf1aa56..8661dd3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -186,7 +186,7 @@ struct mlx5_eswitch_rep {
struct mlx5_eswitch_rep *rep);
u16 vport;
u8 hw_id[ETH_ALEN];
- void *priv_data;
+ struct net_device *netdev;
struct mlx5_flow_handle *vport_rx_rule;
struct list_head vport_sqs_list;
@@ -318,6 +318,7 @@ void mlx5_eswitch_register_vport_rep(struct mlx5_eswitch *esw,
struct mlx5_eswitch_rep *rep);
void mlx5_eswitch_unregister_vport_rep(struct mlx5_eswitch *esw,
int vport_index);
+struct net_device *mlx5_eswitch_get_uplink_netdev(struct mlx5_eswitch *esw);
int mlx5_eswitch_add_vlan_action(struct mlx5_eswitch *esw,
struct mlx5_esw_flow_attr *attr);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 5c01550..466e161 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -970,7 +970,7 @@ void mlx5_eswitch_register_vport_rep(struct mlx5_eswitch *esw,
rep->load = __rep->load;
rep->unload = __rep->unload;
rep->vport = __rep->vport;
- rep->priv_data = __rep->priv_data;
+ rep->netdev = __rep->netdev;
ether_addr_copy(rep->hw_id, __rep->hw_id);
INIT_LIST_HEAD(&rep->vport_sqs_list);
@@ -990,3 +990,13 @@ void mlx5_eswitch_unregister_vport_rep(struct mlx5_eswitch *esw,
rep->valid = false;
}
+
+struct net_device *mlx5_eswitch_get_uplink_netdev(struct mlx5_eswitch *esw)
+{
+#define UPLINK_REP_INDEX 0
+ struct mlx5_esw_offload *offloads = &esw->offloads;
+ struct mlx5_eswitch_rep *rep;
+
+ rep = &offloads->vport_reps[UPLINK_REP_INDEX];
+ return rep->netdev;
+}
--
1.8.3.1
^ permalink raw reply related
* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-11-30 15:37 UTC (permalink / raw)
To: Jiri Pirko, netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kaber
In-Reply-To: <1480500546-2544-12-git-send-email-jiri@resnulli.us>
On 30.11.2016 11:09, Jiri Pirko wrote:
> From: Ido Schimmel <idosch@mellanox.com>
>
> Make sure the device has a complete view of the FIB tables by invoking
> their dump during module init.
>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
> .../net/ethernet/mellanox/mlxsw/spectrum_router.c | 23 ++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> index 14bed1d..d176047 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
> return NOTIFY_DONE;
> }
>
> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
> +{
> + struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
> +
> + /* Flush pending FIB notifications and then flush the device's
> + * table before requesting another dump. Do that with RTNL held,
> + * as FIB notification block is already registered.
> + */
> + mlxsw_core_flush_owq();
> + rtnl_lock();
> + mlxsw_sp_router_fib_flush(mlxsw_sp);
> + rtnl_unlock();
> +}
> +
> int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> {
> + fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
> int err;
>
> INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>
> mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
> register_fib_notifier(&mlxsw_sp->fib_nb);
Sorry to pick in here again:
There is a race here. You need to protect the registration of the fib
notifier as well by the sequence counter. Updates here are not ordered
in relation to this code below.
I think just move the register notification into the fib_notifier_dump
function, rename it to fib_notifier_init and use it here:
> + if (!fib_notifier_dump(&mlxsw_sp->fib_nb, &init_net, cb)) {
> + err = -EBUSY;
> + goto err_fib_notifier_dump;
> + }
> +
> return 0;
Thanks,
Hannes
^ permalink raw reply
* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-11-30 15:44 UTC (permalink / raw)
To: Saeed Mahameed
Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
Saeed Mahameed, Tariq Toukan
In-Reply-To: <CALzJLG_G3G-_jFFD0xiAOfR-k69rhb_2ia4LHZroE-oEFxApnQ@mail.gmail.com>
On Wed, 2016-11-30 at 15:50 +0200, Saeed Mahameed wrote:
> On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
> >
> >
> >> Not sure it this has been tried before, but the doorbell avoidance could
> >> be done by the driver itself, because it knows a TX completion will come
> >> shortly (well... if softirqs are not delayed too much !)
> >>
> >> Doorbell would be forced only if :
> >>
> >> ( "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
> >> OR
> >> ( too many [1] packets were put in TX ring buffer, no point deferring
> >> more)
> >>
> >> Start the pump, but once it is started, let the doorbells being done by
> >> TX completion.
> >>
> >> ndo_start_xmit and TX completion handler would have to maintain a shared
> >> state describing if packets were ready but doorbell deferred.
> >>
> >>
> >> Note that TX completion means "if at least one packet was drained",
> >> otherwise busy polling, constantly calling napi->poll() would force a
> >> doorbell too soon for devices sharing a NAPI for both RX and TX.
> >>
> >> But then, maybe busy poll would like to force a doorbell...
> >>
> >> I could try these ideas on mlx4 shortly.
> >>
> >>
> >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
> >
> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> > not used.
>
> Hi Eric, Nice Idea indeed and we need something like this,
> today we almost don't exploit the TX bulking at all.
>
> But please see below, i am not sure different contexts should share
> the doorbell ringing, it is really risky.
>
> > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2
> > drivers/net/ethernet/mellanox/mlx4/en_tx.c | 90 +++++++++++------
> > drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4
> > include/linux/netdevice.h | 1
> > net/core/net-sysfs.c | 18 +++
> > 5 files changed, 83 insertions(+), 32 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > index 6562f78b07f4..fbea83218fc0 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >
> > if (polled) {
> > if (doorbell_pending)
> > - mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> > + mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
> >
> > mlx4_cq_set_ci(&cq->mcq);
> > wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > index 4b597dca5c52..affebb435679 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
> > ring->size = size;
> > ring->size_mask = size - 1;
> > ring->sp_stride = stride;
> > - ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> > + ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
> >
> > tmp = size * sizeof(struct mlx4_en_tx_info);
> > ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> > @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
> > ring->sp_cqn = cq;
> > ring->prod = 0;
> > ring->cons = 0xffffffff;
> > + ring->ncons = 0;
> > ring->last_nr_txbb = 1;
> > memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
> > memset(ring->buf, 0, ring->buf_size);
> > @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
> > MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
> > }
> >
> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> > {
> > - return ring->prod - ring->cons > ring->full_size;
> > + return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> > }
> >
> > static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> > @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> >
> > /* Skip last polled descriptor */
> > ring->cons += ring->last_nr_txbb;
> > + ring->ncons += ring->last_nr_txbb;
> > en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
> > ring->cons, ring->prod);
> >
> > @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> > !!(ring->cons & ring->size), 0,
> > 0 /* Non-NAPI caller */);
> > ring->cons += ring->last_nr_txbb;
> > + ring->ncons += ring->last_nr_txbb;
> > cnt++;
> > }
> >
> > @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> > return cnt;
> > }
> >
> > +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> > + struct mlx4_en_tx_ring *ring)
> > +{
> > +
> > + if (dev->doorbell_opt & 1) {
> > + u32 oval = READ_ONCE(ring->prod_bell);
> > + u32 nval = READ_ONCE(ring->prod);
> > +
> > + if (oval == nval)
> > + return;
> > +
> > + /* I can not tell yet if a cmpxchg() is needed or not */
> > + if (dev->doorbell_opt & 2)
> > + WRITE_ONCE(ring->prod_bell, nval);
> > + else
> > + if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> > + return;
> > + }
> > + /* Since there is no iowrite*_native() that writes the
> > + * value as is, without byteswapping - using the one
> > + * the doesn't do byteswapping in the relevant arch
> > + * endianness.
> > + */
> > +#if defined(__LITTLE_ENDIAN)
> > + iowrite32(
> > +#else
> > + iowrite32be(
> > +#endif
> > + ring->doorbell_qpn,
> > + ring->bf.uar->map + MLX4_SEND_DOORBELL);
> > +}
> > +
> > static bool mlx4_en_process_tx_cq(struct net_device *dev,
> > struct mlx4_en_cq *cq, int napi_budget)
> > {
> > @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
> > wmb();
> >
> > /* we want to dirty this cache line once */
> > - ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> > - ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> > + WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> > + ring_cons += txbbs_skipped;
> > + WRITE_ONCE(ring->cons, ring_cons);
> > + WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> > +
> > + if (dev->doorbell_opt)
> > + mlx4_en_xmit_doorbell(dev, ring);
> >
> > if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
> > return done < budget;
> > @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
> > __iowrite64_copy(dst, src, bytecnt / 8);
> > }
> >
> > -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> > -{
> > - wmb();
>
> you missed/removed this "wmb()" in the new function, why ? where did
> you compensate for it ?
I removed it because I had a cmpxchg() there if the barrier was needed.
My patch is a WIP, where you can set the bit 2 to ask to replace the
cmpxchg() by a simple write, only for performance testing/comparisons.
>
> > - /* Since there is no iowrite*_native() that writes the
> > - * value as is, without byteswapping - using the one
> > - * the doesn't do byteswapping in the relevant arch
> > - * endianness.
> > - */
> > -#if defined(__LITTLE_ENDIAN)
> > - iowrite32(
> > -#else
> > - iowrite32be(
> > -#endif
> > - ring->doorbell_qpn,
> > - ring->bf.uar->map + MLX4_SEND_DOORBELL);
> > -}
> >
> > static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> > struct mlx4_en_tx_desc *tx_desc,
> > union mlx4_wqe_qpn_vlan qpn_vlan,
> > int desc_size, int bf_index,
> > __be32 op_own, bool bf_ok,
> > - bool send_doorbell)
> > + bool send_doorbell,
> > + const struct net_device *dev, int nr_txbb)
> > {
> > tx_desc->ctrl.qpn_vlan = qpn_vlan;
> >
> > @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> >
> > wmb();
> >
> > + ring->prod += nr_txbb;
> > mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
> > desc_size);
> >
> > @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> > */
> > dma_wmb();
> > tx_desc->ctrl.owner_opcode = op_own;
> > + ring->prod += nr_txbb;
> > if (send_doorbell)
> > - mlx4_en_xmit_doorbell(ring);
> > + mlx4_en_xmit_doorbell(dev, ring);
> > else
> > ring->xmit_more++;
> > }
> > @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> > op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
> > }
> >
> > - ring->prod += nr_txbb;
> > -
> > /* If we used a bounce buffer then copy descriptor back into place */
> > if (unlikely(bounce))
> > tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> > }
> > send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> >
> > + /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > + * will happen shortly.
> > + */
> > + if (send_doorbell &&
> > + dev->doorbell_opt &&
> > + (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
>
> Aelexi already expressed his worries about synchronization, and i
> think here (in this exact line) sits the problem,
> what about if exactly at this point the TX completion handler just
> finished and rang the last doorbell,
> you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
> the last doorbell from the CQ handler missed it.
> even if you wrote the TX descriptor before the doorbell decision here,
> you will need a memory barrier to make sure the HW sees
> the new packet, which was typically done before ringing the doorbell.
My patch is a WIP, meaning it is not completed ;)
Surely we can find a non racy way to handle this.
> All in all, this is risky business :), the right way to go is to
> force the upper layer to use xmit-more and delay doorbells/use bulking
> but from the same context
> (xmit routine). For example see Achiad's suggestion (attached in
> Jesper's response), he used stop queue to force the stack to queue up
> packets (TX bulking)
> which would set xmit-more and will use the next completion to release
> the "stopped" ring TXQ rather than hit the doorbell on behalf of it.
Well, you depend on having a higher level queue like a qdisc.
Some users do not use a qdisc.
If you stop the queue, they no longer can send anything -> drops.
T
^ permalink raw reply
* Re: Aw: Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Eric Dumazet @ 2016-11-30 15:47 UTC (permalink / raw)
To: David Laight; +Cc: Lino Sanfilippo, David Miller, netdev, Tariq Toukan
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0230455@AcuExch.aculab.com>
On Wed, 2016-11-30 at 15:28 +0000, David Laight wrote:
> Are you sure??
Yes I am
> Last I looked gcc seemed to convert 'foo++' to 'foo = foo + 1' before
> generating any code.
Your gcc might need a refresh then.
> It might then optimise that back to a memory increment, but that would
> also happen if you'd coded the latter form.
>
> > Which is kind of unfortunate, given it is the fast path.
> >
> > Better add a comment, like :
> >
> > /* We should use WRITE_ONCE() to pair with the READ_ONCE() found in xxxx()
> > * But gcc would generate non optimal code.
> > */
>
> Actually while READ_ONCE() is generally useful - to get a snapshot of a changing value.
>
> WRITE_ONCE() isn't a pairing - the compiler is highly unlikely to write a
> location twice.
WOW. I can not believe what you just said.
We had numerous bugs because compiler was writing on the location
temporary computations. Just take a look at git history to find some
gems.
> You might want an annotation to ensure is doesn't assume it can read the value
> back (write through volatile pointer). But that has nothing to do with how readers behave.
Completely wrong.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox