Netdev List
 help / color / mirror / Atom feed
* Re: [Patch net-next] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Cong Wang @ 2019-02-06 22:51 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <90bd3088771ab2be083eac9772d488bc3ea8763b.camel@mellanox.com>

On Wed, Feb 6, 2019 at 9:40 AM Saeed Mahameed <saeedm@mellanox.com> wrote:
>
> On Wed, 2019-02-06 at 09:35 -0800, Saeed Mahameed wrote:
> > On Tue, 2019-02-05 at 16:35 -0800, Cong Wang wrote:
> > > mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
> > > gets a lot of contentions when we test some heavy workload
> > > with 60 RX queues and 80 CPU's, and it is clearly shown in the
> > > flame graph.
> > >
> > > In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
> > > we don't have to take a spinlock on this hot path. It is pretty
> > > much
> > > similar to commit 291c566a2891
> > > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
> > > are still serialized with the spinlock, and with synchronize_irq()
> > > it should be safe to just move the fast path to RCU read lock.
> > >
> > > This patch itself reduces the latency by about 50% with our
> > > workload.
> > >
> > > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > > Cc: Tariq Toukan <tariqt@mellanox.com>
> > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> >
> > Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> >
>
> Actually, the commit message needs some rework, since there is no
> contention upstream, Cong can you take care of this and post a V2 ?

I can't verify if upstream has contention or not, but yeah, I can at
least mention the commit 02d92f7903647119e125b24f547 in
changelog.

Thanks.

^ permalink raw reply

* [Patch net-next v2] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Cong Wang @ 2019-02-06 23:00 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Saeed Mahameed, Tariq Toukan

mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
gets a lot of contentions when we test some heavy workload
with 60 RX queues and 80 CPU's, and it is clearly shown in the
flame graph.

In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
we don't have to take a spinlock on this hot path. This is pretty
much similar to commit 291c566a2891
("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
are still serialized with the spinlock, and with synchronize_irq()
it should be safe to just move the fast path to RCU read lock.

This patch itself reduces the latency by about 50% for our memcached
workload on a 4.14 kernel we test. In upstream, as pointed out by Saeed,
this spinlock gets some rework in commit 02d92f790364
("net/mlx5: CQ Database per EQ"), so the difference could be smaller.

Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index ee04aab65a9f..7092457705a2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -114,11 +114,11 @@ static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
 	struct mlx5_cq_table *table = &eq->cq_table;
 	struct mlx5_core_cq *cq = NULL;
 
-	spin_lock(&table->lock);
+	rcu_read_lock();
 	cq = radix_tree_lookup(&table->tree, cqn);
 	if (likely(cq))
 		mlx5_cq_hold(cq);
-	spin_unlock(&table->lock);
+	rcu_read_unlock();
 
 	return cq;
 }
@@ -371,9 +371,9 @@ int mlx5_eq_add_cq(struct mlx5_eq *eq, struct mlx5_core_cq *cq)
 	struct mlx5_cq_table *table = &eq->cq_table;
 	int err;
 
-	spin_lock_irq(&table->lock);
+	spin_lock(&table->lock);
 	err = radix_tree_insert(&table->tree, cq->cqn, cq);
-	spin_unlock_irq(&table->lock);
+	spin_unlock(&table->lock);
 
 	return err;
 }
@@ -383,9 +383,9 @@ int mlx5_eq_del_cq(struct mlx5_eq *eq, struct mlx5_core_cq *cq)
 	struct mlx5_cq_table *table = &eq->cq_table;
 	struct mlx5_core_cq *tmp;
 
-	spin_lock_irq(&table->lock);
+	spin_lock(&table->lock);
 	tmp = radix_tree_delete(&table->tree, cq->cqn);
-	spin_unlock_irq(&table->lock);
+	spin_unlock(&table->lock);
 
 	if (!tmp) {
 		mlx5_core_warn(eq->dev, "cq 0x%x not found in eq 0x%x tree\n", eq->eqn, cq->cqn);
-- 
2.20.1


^ permalink raw reply related

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
From: Ian Kumlien @ 2019-02-06 23:00 UTC (permalink / raw)
  To: Cong Wang; +Cc: David Miller, Saeed Mahameed, Linux Kernel Network Developers
In-Reply-To: <CAM_iQpWMdMG_wtvHa9NQHQnQ4AuEEH2FcsJ2nWptSqFD_-6-Ww@mail.gmail.com>

On Wed, Feb 6, 2019 at 11:49 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Feb 6, 2019 at 2:41 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> >
> > On Wed, Feb 6, 2019 at 11:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > On Wed, Feb 6, 2019 at 2:15 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > Could we please schedule this for 4.19 and 4.20 - it's kinda breaking things
> > >
> > > It doesn't break anything, packets are _not_ dropped, only that the
> > > warning itself is noisy.
> >
> > Not my experience, to me it slows the machine down and looses packets,
> > I don't however know
> > if this is the only culprit
>
> The packet process could be slow down because of printing
> out this kernel warning. Packet should be still delivered to upper
> stack, at least I didn't see any packet drops because of this.

I have several machines pushing the same errors currently, while on this
one I was logged in on the serial console and not over ssh like the others.

On the other machines, typing is slow, looses characters and drops the
connection

But, again, I don't know if this is the only culprit, it sure does
fill dmesg though =)
(which suddenly takes minutes to show over a 100gig connection)

> > You can actually see it on ping where it start out with 0.0xyx and
> > ends up at ~10ms
>
> I don't understand how it could affect ICMP, it is purely TCP
> from my point of view, even the stack trace from you says so. ;)

It changes directly after the first hw checksum failure, I don't know why =/

^ permalink raw reply

* Waiting for vrf to become free on rmmod of bridge...
From: Ben Greear @ 2019-02-06 23:20 UTC (permalink / raw)
  To: netdev

Hello,

I just saw this warning on a system running a hacked 4.20.2+ kernel.  Any known bugs
of this nature in this (upstream) kernel?  The command that is blocked is:
'rmmod bridge llc'

[17069.299135] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1
[17079.306438] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1
[17089.314656] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1
[17099.322870] unregister_netdevice: waiting for _vrf13 to become free. Usage count = 1

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [Patch net-next v2] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Eric Dumazet @ 2019-02-06 23:36 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: Saeed Mahameed, Tariq Toukan
In-Reply-To: <20190206230019.1303-1-xiyou.wangcong@gmail.com>



On 02/06/2019 03:00 PM, Cong Wang wrote:
> mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
> gets a lot of contentions when we test some heavy workload
> with 60 RX queues and 80 CPU's, and it is clearly shown in the
> flame graph.
> 
> In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
> we don't have to take a spinlock on this hot path. This is pretty
> much similar to commit 291c566a2891
> ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
> are still serialized with the spinlock, and with synchronize_irq()
> it should be safe to just move the fast path to RCU read lock.
> 
> This patch itself reduces the latency by about 50% for our memcached
> workload on a 4.14 kernel we test. In upstream, as pointed out by Saeed,
> this spinlock gets some rework in commit 02d92f790364
> ("net/mlx5: CQ Database per EQ"), so the difference could be smaller.
> 
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index ee04aab65a9f..7092457705a2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -114,11 +114,11 @@ static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
>  	struct mlx5_cq_table *table = &eq->cq_table;
>  	struct mlx5_core_cq *cq = NULL;
>  
> -	spin_lock(&table->lock);
> +	rcu_read_lock();
>  	cq = radix_tree_lookup(&table->tree, cqn);
>  	if (likely(cq))
>  		mlx5_cq_hold(cq);

I suspect that you need a variant that makes sure refcount is not zero.

( Typical RCU rules apply )

if (cq && !refcount_inc_not_zero(&cq->refcount))
	cq = NULL;


See commit 6fa19f5637a6c22bc0999596bcc83bdcac8a4fa6 rds: fix refcount bug in rds_sock_addref
for a similar issue I fixed recently.




^ permalink raw reply

* Re: [PATCH v2] arm64: dts: lx2160aqds: Add mdio mux nodes
From: Li Yang @ 2019-02-06 23:39 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Pankaj Bansal, Shawn Guo, Florian Fainelli,
	netdev@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Rob Herring
In-Reply-To: <20190206214432.GB32483@lunn.ch>

On Wed, Feb 6, 2019 at 3:46 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > > >  &i2c0 {
> > > > >         status = "okay";
> > > > >
> > > > > +       fpga@66 {
> > > > > +               compatible = "fsl,lx2160aqds-fpga", "fsl,fpga-qixis-i2c";
> > > > > +               reg = <0x66>;
> > > > > +               #address-cells = <1>;
> > > > > +               #size-cells = <0>;
> > > > > +
> > > > > +               mdio-mux-1@54 {
> > > >
> > > > Still no compatible string defined for the node.  Probably should be
> > > > "mdio-mux- mmioreg", "mdio-mux"
> > >
> > > it is not a specific device. MDIO mux is meant to be controlled by some
> > > registers of parent device (FPGA).
> > > Therefore, IMO this should not be a device and there should not be any
> > > "compatible" property for it.
>
> > If it is not a device why we are defining a device node for it?  It
> > is probably not a physical device per se, but it can be considered a
> > virtual device provided by FPGA.
>
> It is a physical device. But it happens to be embedded inside another
> device. And that embedded is not performed as a bus with devices on
> it, so the device tree concepts don't fit directly.

Whether or not it is populated as a bus(which probably should as the
FPGA does contain many different functions and these functions like
the mdio-mux we are discussing about could have separate drivers), the
node should have a new binding documentation similar to the
mdio_mux_mmioreg binding or even covers the mmioreg too.  And the best
way to match the node with the binding is through compatible strings
IMO.  This is why I'm asking the node to have a compatible string.

>
> > This also bring up another question that why this device cannot
> > reuse the existing drivers/net/phy/mdio-mux-mmioreg.c driver?
>
> Because it is on an i2c bus, not an mmio bus.

Oops, I missed that.

>
> > If we think regmap is a better solution, shall we replace the
> > mmioreg driver with the regmap driver?
>
> regmap can be used with mmio. But for a single MMIO register it is a
> huge framework. So it makes sense to keep mdio-mux-mmioreg simple.
>
> If however the device is already using regmap, adding one more
> register is very little overhead. And it might be possible to use this
> new mux with an mmio regmap, or an spi regmap, etc. So we seem to be
> covering the best of both worlds.

Ya.  It would be ideal if the new driver can cover the legacy
mdio-mux-mmioreg case too.

Regards,
Leo

^ permalink raw reply

* Re: [Patch net-next v2] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Eric Dumazet @ 2019-02-06 23:55 UTC (permalink / raw)
  To: Eric Dumazet, Cong Wang, netdev; +Cc: Saeed Mahameed, Tariq Toukan
In-Reply-To: <52ed9806-310d-bfeb-9610-0daefc1e66fa@gmail.com>



On 02/06/2019 03:36 PM, Eric Dumazet wrote:
> 

> I suspect that you need a variant that makes sure refcount is not zero.
> 
> ( Typical RCU rules apply )
> 
> if (cq && !refcount_inc_not_zero(&cq->refcount))
> 	cq = NULL;
> 
> 
> See commit 6fa19f5637a6c22bc0999596bcc83bdcac8a4fa6 rds: fix refcount bug in rds_sock_addref
> for a similar issue I fixed recently.
> 
 
By the way, we also could avoid two atomics on the cq->refcount , by using rcu_read_lock()/unlock()
in the two callers .



^ permalink raw reply

* [PATCH mac80211-next] virt_wifi: Remove REGULATORY_WIPHY_SELF_MANAGED
From: Cody Schuffelen @ 2019-02-06 23:54 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Kalle Valo, David S . Miller, linux-kernel, linux-wireless,
	netdev, kernel-team, Cody Schuffelen, Alistair Strachan,
	Greg Hartman

REGULATORY_WIPHY_SELF_MANAGED as set here breaks NL80211_CMD_GET_REG,
because it expects the wiphy to do regulatory management. Since
virt_wifi does not do regulatory management, this triggers a WARN_ON in
NL80211_CMD_GET_REG and fails the netlink command.

Removing REGULATORY_WIPHY_SELF_MANAGED fixes the problem and the virtual
wireless network continues to work.

Signed-off-by: Cody Schuffelen <schuffelen@google.com>
Acked-by: Alistair Strachan <astrachan@google.com>
Acked-by: Greg Hartman <ghartman@google.com>
---
 drivers/net/wireless/virt_wifi.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/wireless/virt_wifi.c b/drivers/net/wireless/virt_wifi.c
index 71044c6cfd8c..606999f102eb 100644
--- a/drivers/net/wireless/virt_wifi.c
+++ b/drivers/net/wireless/virt_wifi.c
@@ -360,7 +360,6 @@ static struct wiphy *virt_wifi_make_wiphy(void)
 	wiphy->bands[NL80211_BAND_5GHZ] = &band_5ghz;
 	wiphy->bands[NL80211_BAND_60GHZ] = NULL;
 
-	wiphy->regulatory_flags = REGULATORY_WIPHY_SELF_MANAGED;
 	wiphy->interface_modes = BIT(NL80211_IFTYPE_STATION);
 
 	priv = wiphy_priv(wiphy);
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related

* Re: [Patch net-next v2] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Cong Wang @ 2019-02-07  0:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan
In-Reply-To: <52ed9806-310d-bfeb-9610-0daefc1e66fa@gmail.com>

On Wed, Feb 6, 2019 at 3:36 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 02/06/2019 03:00 PM, Cong Wang wrote:
> > mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
> > gets a lot of contentions when we test some heavy workload
> > with 60 RX queues and 80 CPU's, and it is clearly shown in the
> > flame graph.
> >
> > In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
> > we don't have to take a spinlock on this hot path. This is pretty
> > much similar to commit 291c566a2891
> > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
> > are still serialized with the spinlock, and with synchronize_irq()
> > it should be safe to just move the fast path to RCU read lock.
> >
> > This patch itself reduces the latency by about 50% for our memcached
> > workload on a 4.14 kernel we test. In upstream, as pointed out by Saeed,
> > this spinlock gets some rework in commit 02d92f790364
> > ("net/mlx5: CQ Database per EQ"), so the difference could be smaller.
> >
> > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>
> > Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++------
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > index ee04aab65a9f..7092457705a2 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > @@ -114,11 +114,11 @@ static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
> >       struct mlx5_cq_table *table = &eq->cq_table;
> >       struct mlx5_core_cq *cq = NULL;
> >
> > -     spin_lock(&table->lock);
> > +     rcu_read_lock();
> >       cq = radix_tree_lookup(&table->tree, cqn);
> >       if (likely(cq))
> >               mlx5_cq_hold(cq);
>
> I suspect that you need a variant that makes sure refcount is not zero.
>
> ( Typical RCU rules apply )
>
> if (cq && !refcount_inc_not_zero(&cq->refcount))
>         cq = NULL;
>
>
> See commit 6fa19f5637a6c22bc0999596bcc83bdcac8a4fa6 rds: fix refcount bug in rds_sock_addref
> for a similar issue I fixed recently.

synchronize_irq() is called before mlx5_cq_put(), so I don't
see why readers could get 0 refcnt.

For the rds you mentioned, it doesn't wait for readers, this
is why it needs to check against 0 and why it is different from
this one.

Thanks.

^ permalink raw reply

* Re: [Patch net-next v2] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Eric Dumazet @ 2019-02-07  0:28 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan
In-Reply-To: <CAM_iQpUy0y_NqT82htx_D-3G-wpo4mfguvxk2SPt4d2+KjXetA@mail.gmail.com>



On 02/06/2019 04:04 PM, Cong Wang wrote:

> synchronize_irq() is called before mlx5_cq_put(), so I don't
> see why readers could get 0 refcnt.

Then the more reasons to get rid of the refcount increment/decrement completely ...

Technically, even the rcu_read_lock() and rcu_read_unlock() are not needed,
since synchronize_irq() is enough.

> 
> For the rds you mentioned, it doesn't wait for readers, this
> is why it needs to check against 0 and why it is different from
> this one.
> 
> Thanks.
> 

^ permalink raw reply

* [PATCH bpf-next v7 0/6] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-07  0:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov

This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

V2 changes: Added flowi-based route lookup, IPv6 encapping, and
   encapping on ingress.

V3 changes: incorporated David Ahern's suggestions:
   - added l3mdev check/oif (patch 2)
   - sync bpf.h from include/uapi into tools/include/uapi
   - selftest tweaks

V4 changes: moved route lookup/dst change from bpf_push_ip_encap
   to when BPF_LWT_REROUTE is handled, as suggested by David Ahern.

V5 changes: added a check in lwt_xmit that skb->protocol stays the
   same if the skb is to be passed back to the stack (ret == BPF_OK).
   Again, suggested by David Ahern.

V6 changes: abandoned.

V7 changes: added handling of GSO packets (patch 3 in the patchset added).

Peter Oskolkov (6):
  bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
  bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
  bpf: handle GSO in bpf_lwt_push_encap
  bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
  bpf: sync <kdir>/include/.../bpf.h with tools/include/.../bpf.h
  selftests: bpf: add test_lwt_ip_encap selftest

 include/net/lwtunnel.h                        |   3 +
 include/uapi/linux/bpf.h                      |  26 +-
 net/core/filter.c                             |  47 ++-
 net/core/lwt_bpf.c                            | 246 ++++++++++++++
 tools/include/uapi/linux/bpf.h                |  26 +-
 tools/testing/selftests/bpf/Makefile          |   6 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  85 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 8 files changed, 739 insertions(+), 11 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply

* [PATCH bpf-next v7 1/6] bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-07  0:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190207003720.51096-1-posk@google.com>

This patch adds all needed plumbing in preparation to allowing
bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
implementation is added in the next patch in the patchset.

Of note:
- bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
  prog types in addition to BPF_PROG_TYPE_LWT_IN;
- if the skb being encapped has GSO set, encapsulation is limited
  to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6);
- as route lookups are different for ingress vs egress, the single
  external bpf_lwt_push_encap BPF helper is routed internally to
  either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
  depending on prog type.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/uapi/linux/bpf.h | 26 +++++++++++++++++++++--
 net/core/filter.c        | 46 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1777fa0c61e4..138089ff24cf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2016,6 +2016,19 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers. PLease note that
+ *			if skb_is_gso(skb) is true, no more than two headers
+ *			can be prepended, and the inner header, if present,
+ *			should be either GRE or UDP/GUE.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2498,7 +2511,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2586,7 +2600,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been
+	 *    changed and should be routed based on its new L3 header.
+	 *    (This is an L3 redirect, as opposed to L2 redirect
+	 *    represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
diff --git a/net/core/filter.c b/net/core/filter.c
index 3a49f68eda10..8884120fe458 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4801,7 +4801,13 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 }
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
-BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
+static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			     bool ingress)
+{
+	return -EINVAL;  /* Implemented in the next patch. */
+}
+
+BPF_CALL_4(bpf_lwt_in_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	   u32, len)
 {
 	switch (type) {
@@ -4809,14 +4815,41 @@ BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	case BPF_LWT_ENCAP_SEG6:
 	case BPF_LWT_ENCAP_SEG6_INLINE:
 		return bpf_push_seg6_encap(skb, type, hdr, len);
+#endif
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, true /* ingress */);
 #endif
 	default:
 		return -EINVAL;
 	}
 }
 
-static const struct bpf_func_proto bpf_lwt_push_encap_proto = {
-	.func		= bpf_lwt_push_encap,
+BPF_CALL_4(bpf_lwt_xmit_push_encap, struct sk_buff *, skb, u32, type,
+	   void *, hdr, u32, len)
+{
+	switch (type) {
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, false /* egress */);
+#endif
+	default:
+		return -EINVAL;
+	}
+}
+
+static const struct bpf_func_proto bpf_lwt_in_push_encap_proto = {
+	.func		= bpf_lwt_in_push_encap,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+static const struct bpf_func_proto bpf_lwt_xmit_push_encap_proto = {
+	.func		= bpf_lwt_xmit_push_encap,
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
@@ -5282,7 +5315,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_lwt_seg6_adjust_srh ||
 	    func == bpf_lwt_seg6_action ||
 #endif
-	    func == bpf_lwt_push_encap)
+	    func == bpf_lwt_in_push_encap ||
+	    func == bpf_lwt_xmit_push_encap)
 		return true;
 
 	return false;
@@ -5670,7 +5704,7 @@ lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_lwt_push_encap:
-		return &bpf_lwt_push_encap_proto;
+		return &bpf_lwt_in_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
@@ -5706,6 +5740,8 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_l4_csum_replace_proto;
 	case BPF_FUNC_set_hash_invalid:
 		return &bpf_set_hash_invalid_proto;
+	case BPF_FUNC_lwt_push_encap:
+		return &bpf_lwt_xmit_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related

* [PATCH bpf-next v7 2/6] bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-07  0:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190207003720.51096-1-posk@google.com>

This patch implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

v7 changes:
 - added a call skb_clear_hash();
 - removed calls to skb_set_transport_header();
 - refuse to encap GSO-enabled packets.

Note: the next patch in the patchset with deal with GSO-enabled packets,
which are currently rejected at encapping attempt.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/net/lwtunnel.h |  3 ++
 net/core/filter.c      |  3 +-
 net/core/lwt_bpf.c     | 65 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 33fd9ba7e0e5..f0973eca8036 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -126,6 +126,8 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b);
 int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb);
 int lwtunnel_input(struct sk_buff *skb);
 int lwtunnel_xmit(struct sk_buff *skb);
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			  bool ingress);
 
 static inline void lwtunnel_set_redirect(struct dst_entry *dst)
 {
@@ -138,6 +140,7 @@ static inline void lwtunnel_set_redirect(struct dst_entry *dst)
 		dst->input = lwtunnel_input;
 	}
 }
+
 #else
 
 static inline void lwtstate_free(struct lwtunnel_state *lws)
diff --git a/net/core/filter.c b/net/core/filter.c
index 8884120fe458..7b7e7c9125e2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -73,6 +73,7 @@
 #include <linux/seg6_local.h>
 #include <net/seg6.h>
 #include <net/seg6_local.h>
+#include <net/lwtunnel.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -4804,7 +4805,7 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
 			     bool ingress)
 {
-	return -EINVAL;  /* Implemented in the next patch. */
+	return bpf_lwt_push_ip_encap(skb, hdr, len, ingress);
 }
 
 BPF_CALL_4(bpf_lwt_in_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a648568c5e8f..786b96148937 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -390,6 +390,71 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
+static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
+{
+	/* Handling of GSO-enabled packets is added in the next patch. */
+	if (unlikely(skb_is_gso(skb)))
+		return -EINVAL;
+
+	return 0;
+}
+
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
+{
+	struct iphdr *iph;
+	bool ipv4;
+	int err;
+
+	if (unlikely(len < sizeof(struct iphdr) || len > LWT_BPF_MAX_HEADROOM))
+		return -EINVAL;
+
+	/* validate protocol and length */
+	iph = (struct iphdr *)hdr;
+	if (iph->version == 4) {
+		ipv4 = true;
+		if (unlikely(len < iph->ihl * 4))
+			return -EINVAL;
+	} else if (iph->version == 6) {
+		ipv4 = false;
+		if (unlikely(len < sizeof(struct ipv6hdr)))
+			return -EINVAL;
+	} else {
+		return -EINVAL;
+	}
+
+	if (ingress)
+		err = skb_cow_head(skb, len + skb->mac_len);
+	else
+		err = skb_cow_head(skb,
+				   len + LL_RESERVED_SPACE(skb_dst(skb)->dev));
+	if (unlikely(err))
+		return err;
+
+	/* push the encap headers and fix pointers */
+	skb_reset_inner_headers(skb);
+	skb->encapsulation = 1;
+	skb_push(skb, len);
+	if (ingress)
+		skb_postpush_rcsum(skb, iph, len);
+	skb_reset_network_header(skb);
+	memcpy(skb_network_header(skb), hdr, len);
+	bpf_compute_data_pointers(skb);
+	skb_clear_hash(skb);
+
+	if (ipv4) {
+		skb->protocol = htons(ETH_P_IP);
+		iph = ip_hdr(skb);
+
+		if (!iph->check)
+			iph->check = ip_fast_csum((unsigned char *)iph,
+						  iph->ihl);
+	} else {
+		skb->protocol = htons(ETH_P_IPV6);
+	}
+
+	return handle_gso_encap(skb, ipv4, len);
+}
+
 static int __init bpf_lwt_init(void)
 {
 	return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related

* [PATCH bpf-next v7 3/6] bpf: handle GSO in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-07  0:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190207003720.51096-1-posk@google.com>

This patch adds handling of GSO packets in bpf_lwt_push_ip_encap()
(called from bpf_lwt_push_encap):

* IPIP, GRE, and UDP encapsulation types are deduced by looking
  into iphdr->protocol or ipv6hdr->next_header;
* an error is returned if the same GSO encap type is set on the skb;
* SCTP GSO packets are not supported (as bpf_skb_proto_4_to_6
  and similar do);
* UDP_L4 GSO packets are also not supported (although they are
  not blocked in bpf_skb_proto_4_to_6 and similar), as
  skb_decrease_gso_size() will break it;
* SKB_GSO_DODGY bit is set.

Note: it may be possible to support SCTP and UDP_L4 gso packets;
      but as these cases seem to be not well handled by other
      tunneling/encapping code paths, the solution should
      be generic enough to apply to all tunneling/encapping code.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 62 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 59 insertions(+), 3 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 786b96148937..4ff60757bf23 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
+#include <net/gre.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -390,15 +391,70 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
-static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
+static int handle_gso_type(struct sk_buff *skb, unsigned int gso_type,
+			   int encap_len)
 {
-	/* Handling of GSO-enabled packets is added in the next patch. */
-	if (unlikely(skb_is_gso(skb)))
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+	/* Refuse to double-encap with the same type. */
+	if (shinfo->gso_type & gso_type)
 		return -EINVAL;
 
+	gso_type |= SKB_GSO_DODGY;
+	shinfo->gso_type |= gso_type;
+	skb_decrease_gso_size(shinfo, encap_len);
+	shinfo->gso_segs = 0;
 	return 0;
 }
 
+static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
+{
+	void *next_hdr;
+	__u8 protocol;
+
+	if (!skb_is_gso(skb))
+		return 0;
+
+	/* SCTP and UDP_L4 gso need more nuanced handling than what
+	 * handle_gso_type() does above: skb_decrease_gso_size() is not enough.
+	 */
+	if (unlikely(skb_shinfo(skb)->gso_type &
+		     (SKB_GSO_SCTP | SKB_GSO_UDP_L4)))
+		return -ENOTSUPP;
+
+	if (ipv4) {
+		protocol = ip_hdr(skb)->protocol;
+		next_hdr = skb_network_header(skb) + sizeof(struct iphdr);
+	} else {
+		protocol = ipv6_hdr(skb)->nexthdr;
+		next_hdr = skb_network_header(skb) + sizeof(struct ipv6hdr);
+	}
+
+	switch (protocol) {
+	case IPPROTO_GRE:
+		if (((struct gre_base_hdr *)next_hdr)->flags & GRE_CSUM)
+			return handle_gso_type(skb, SKB_GSO_GRE_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_GRE, encap_len);
+
+	case IPPROTO_UDP:
+		if (((struct udphdr *)next_hdr)->check)
+			return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL, encap_len);
+
+	case IPPROTO_IP:
+	case IPPROTO_IPV6:
+		if (ipv4)
+			return handle_gso_type(skb, SKB_GSO_IPXIP4, encap_len);
+		else
+			return handle_gso_type(skb, SKB_GSO_IPXIP6, encap_len);
+
+	default:
+		return -EPROTONOSUPPORT;
+	}
+}
+
 int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
 {
 	struct iphdr *iph;
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related

* [PATCH bpf-next v7 4/6] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: Peter Oskolkov @ 2019-02-07  0:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190207003720.51096-1-posk@google.com>

This patch builds on top of the previous patch in the patchset,
which added BPF_LWT_ENCAP_IP mode to bpf_lwt_push_encap. As the
encapping can result in the skb needing to go via a different
interface/route/dst, bpf programs can indicate this by returning
BPF_LWT_REROUTE, which triggers a new route lookup for the skb.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 125 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 125 insertions(+)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 4ff60757bf23..faeeb2ed3f1f 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -17,6 +17,7 @@
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
 #include <net/gre.h>
+#include <net/ip6_route.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -56,6 +57,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 
 	switch (ret) {
 	case BPF_OK:
+	case BPF_LWT_REROUTE:
 		break;
 
 	case BPF_REDIRECT:
@@ -88,6 +90,32 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	return ret;
 }
 
+static int bpf_lwt_input_reroute(struct sk_buff *skb)
+{
+	int err = -EINVAL;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		struct iphdr *iph = ip_hdr(skb);
+
+		err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					   iph->tos, skb_dst(skb)->dev);
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		ip6_route_input(skb);
+		err = skb_dst(skb)->error;
+	} else {
+		pr_warn_once("BPF_LWT_REROUTE input: unsupported proto %d\n",
+			     skb->protocol);
+	}
+
+	if (err)
+		goto err;
+	return dst_input(skb);
+
+err:
+	kfree_skb(skb);
+	return err;
+}
+
 static int bpf_input(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -99,6 +127,8 @@ static int bpf_input(struct sk_buff *skb)
 		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
 		if (ret < 0)
 			return ret;
+		if (ret == BPF_LWT_REROUTE)
+			return bpf_lwt_input_reroute(skb);
 	}
 
 	if (unlikely(!dst->lwtstate->orig_input)) {
@@ -148,6 +178,90 @@ static int xmit_check_hhlen(struct sk_buff *skb)
 	return 0;
 }
 
+static int bpf_lwt_xmit_reroute(struct sk_buff *skb)
+{
+	struct net_device *l3mdev = l3mdev_master_dev_rcu(skb_dst(skb)->dev);
+	int oif = l3mdev ? l3mdev->ifindex : 0;
+	struct dst_entry *dst = NULL;
+	struct sock *sk;
+	struct net *net;
+	bool ipv4;
+	int err;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		ipv4 = true;
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		ipv4 = false;
+	} else {
+		pr_warn_once("BPF_LWT_REROUTE xmit: unsupported proto %d\n",
+			     skb->protocol);
+		return -EINVAL;
+	}
+
+	sk = sk_to_full_sk(skb->sk);
+	if (sk) {
+		if (sk->sk_bound_dev_if)
+			oif = sk->sk_bound_dev_if;
+		net = sock_net(sk);
+	} else {
+		net = dev_net(skb_dst(skb)->dev);
+	}
+
+	if (ipv4) {
+		struct iphdr *iph = ip_hdr(skb);
+		struct flowi4 fl4 = {0};
+		struct rtable *rt;
+
+		fl4.flowi4_oif = oif;
+		fl4.flowi4_mark = skb->mark;
+		fl4.flowi4_uid = sock_net_uid(net, sk);
+		fl4.flowi4_tos = RT_TOS(iph->tos);
+		fl4.flowi4_flags = FLOWI_FLAG_ANYSRC;
+		fl4.flowi4_proto = iph->protocol;
+		fl4.daddr = iph->daddr;
+		fl4.saddr = iph->saddr;
+
+		rt = ip_route_output_key(net, &fl4);
+		if (IS_ERR(rt) || rt->dst.error)
+			return -EINVAL;
+		dst = &rt->dst;
+	} else {
+		struct ipv6hdr *iph6 = ipv6_hdr(skb);
+		struct flowi6 fl6 = {0};
+
+		fl6.flowi6_oif = oif;
+		fl6.flowi6_mark = skb->mark;
+		fl6.flowi6_uid = sock_net_uid(net, sk);
+		fl6.flowlabel = ip6_flowinfo(iph6);
+		fl6.flowi6_proto = iph6->nexthdr;
+		fl6.daddr = iph6->daddr;
+		fl6.saddr = iph6->saddr;
+
+		dst = ip6_route_output(net, skb->sk, &fl6);
+		if (IS_ERR(dst) || dst->error)
+			return -EINVAL;
+	}
+
+	/* Although skb header was reserved in bpf_lwt_push_ip_encap(), it
+	 * was done for the previous dst, so we are doing it here again, in
+	 * case the new dst needs much more space. The call below is a noop
+	 * if there is enough header space in skb.
+	 */
+	err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
+	if (unlikely(err))
+		return err;
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, dst);
+
+	err = dst_output(dev_net(skb_dst(skb)->dev), skb->sk, skb);
+	if (unlikely(err))
+		return err;
+
+	/* ip[6]_finish_output2 understand LWTUNNEL_XMIT_DONE */
+	return LWTUNNEL_XMIT_DONE;
+}
+
 static int bpf_xmit(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -155,11 +269,20 @@ static int bpf_xmit(struct sk_buff *skb)
 
 	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
 	if (bpf->xmit.prog) {
+		__be16 proto = skb->protocol;
 		int ret;
 
 		ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
 		switch (ret) {
 		case BPF_OK:
+			/* If the header changed, e.g. via bpf_lwt_push_encap,
+			 * BPF_LWT_REROUTE below should have been used if the
+			 * protocol was also changed.
+			 */
+			if (skb->protocol != proto) {
+				kfree_skb(skb);
+				return -EINVAL;
+			}
 			/* If the header was expanded, headroom might be too
 			 * small for L2 header to come, expand as needed.
 			 */
@@ -170,6 +293,8 @@ static int bpf_xmit(struct sk_buff *skb)
 			return LWTUNNEL_XMIT_CONTINUE;
 		case BPF_REDIRECT:
 			return LWTUNNEL_XMIT_DONE;
+		case BPF_LWT_REROUTE:
+			return bpf_lwt_xmit_reroute(skb);
 		default:
 			return ret;
 		}
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related

* [PATCH bpf-next v7 5/6] bpf: sync <kdir>/include/.../bpf.h with tools/include/.../bpf.h
From: Peter Oskolkov @ 2019-02-07  0:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190207003720.51096-1-posk@google.com>

This patch copies changes in bpf.h done by a previous patch
in this patchset from the kernel uapi include dir into tools
uapi include dir.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/include/uapi/linux/bpf.h | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1777fa0c61e4..138089ff24cf 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2016,6 +2016,19 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers. PLease note that
+ *			if skb_is_gso(skb) is true, no more than two headers
+ *			can be prepended, and the inner header, if present,
+ *			should be either GRE or UDP/GUE.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2498,7 +2511,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2586,7 +2600,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been
+	 *    changed and should be routed based on its new L3 header.
+	 *    (This is an L3 redirect, as opposed to L2 redirect
+	 *    represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related

* [PATCH bpf-next v7 6/6] selftests: bpf: add test_lwt_ip_encap selftest
From: Peter Oskolkov @ 2019-02-07  0:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190207003720.51096-1-posk@google.com>

This patch adds a bpf self-test to cover BPF_LWT_ENCAP_IP mode
in bpf_lwt_push_encap.

Covered:
- encapping in LWT_IN and LWT_XMIT
- IPv4 and IPv6

A follow-up patch will add VRF-enabled tests.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   6 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  85 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 3 files changed, 400 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 383d2ff13fc7..d56c74727b6c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,8 @@ BPF_OBJ_FILES = \
 	sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
-	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o
+	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o \
+	test_lwt_ip_encap.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
@@ -73,7 +74,8 @@ TEST_PROGS := test_kmod.sh \
 	test_lirc_mode2.sh \
 	test_skb_cgroup_id.sh \
 	test_flow_dissector.sh \
-	test_xdp_vlan.sh
+	test_xdp_vlan.sh \
+	test_lwt_ip_encap.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
 	with_tunnels.sh \
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.c b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
new file mode 100644
index 000000000000..c957d6dfe6d7
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+struct grehdr {
+	__be16 flags;
+	__be16 protocol;
+};
+
+SEC("encap_gre")
+int bpf_lwt_encap_gre(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct iphdr iph;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.iph.ihl = 5;
+	hdr.iph.version = 4;
+	hdr.iph.ttl = 0x40;
+	hdr.iph.protocol = 47;  /* IPPROTO_GRE */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	hdr.iph.saddr = 0x640110ac;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0x641010ac;  /* 172.16.16.100 */
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	hdr.iph.saddr = 0xac100164;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0xac101064;  /* 172.16.16.100 */
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+	hdr.iph.tot_len = bpf_htons(skb->len + sizeof(struct encap_hdr));
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+SEC("encap_gre6")
+int bpf_lwt_encap_gre6(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct ipv6hdr ip6hdr;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.ip6hdr.version = 6;
+	hdr.ip6hdr.payload_len = bpf_htons(skb->len + sizeof(struct grehdr));
+	hdr.ip6hdr.nexthdr = 47;  /* IPPROTO_GRE */
+	hdr.ip6hdr.hop_limit = 0x40;
+	/* fb01::1 */
+	hdr.ip6hdr.saddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.saddr.s6_addr[1] = 1;
+	hdr.ip6hdr.saddr.s6_addr[15] = 1;
+	/* fb10::1 */
+	hdr.ip6hdr.daddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.daddr.s6_addr[1] = 0x10;
+	hdr.ip6hdr.daddr.s6_addr[15] = 1;
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
new file mode 100755
index 000000000000..4ca714e23ab0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -0,0 +1,311 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Setup/topology:
+#
+#    NS1             NS2             NS3
+#   veth1 <---> veth2   veth3 <---> veth4 (the top route)
+#   veth5 <---> veth6   veth7 <---> veth8 (the bottom route)
+#
+#   each vethN gets IPv[4|6]_N address
+#
+#   IPv*_SRC = IPv*_1
+#   IPv*_DST = IPv*_4
+#
+#   all tests test pings from IPv*_SRC to IPv*_DST
+#
+#   by default, routes are configured to allow packets to go
+#   IP*_1 <=> IP*_2 <=> IP*_3 <=> IP*_4 (the top route)
+#
+#   a GRE device is installed in NS3 with IPv*_GRE, and
+#   NS1/NS2 are configured to route packets to IPv*_GRE via IP*_8
+#   (the bottom route)
+#
+# Tests:
+#
+#   1. routes NS2->IPv*_DST are brought down, so the only way a ping
+#      from IP*_SRC to IP*_DST can work is via IPv*_GRE
+#
+#   2a. in an egress test, a bpf LWT_XMIT program is installed on veth1
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth1:egress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+#
+#   2b. in an ingress test, a bpf LWT_IN program is installed on veth2
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth2:ingress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+
+set -e  # exit on error
+
+if [[ $EUID -ne 0 ]]; then
+	echo "This script must be run as root"
+	echo "FAIL"
+	exit 1
+fi
+
+readonly NS1="ns1-$(mktemp -u XXXXXX)"
+readonly NS2="ns2-$(mktemp -u XXXXXX)"
+readonly NS3="ns3-$(mktemp -u XXXXXX)"
+
+readonly IPv4_1="172.16.1.100"
+readonly IPv4_2="172.16.2.100"
+readonly IPv4_3="172.16.3.100"
+readonly IPv4_4="172.16.4.100"
+readonly IPv4_5="172.16.5.100"
+readonly IPv4_6="172.16.6.100"
+readonly IPv4_7="172.16.7.100"
+readonly IPv4_8="172.16.8.100"
+readonly IPv4_GRE="172.16.16.100"
+
+readonly IPv4_SRC=$IPv4_1
+readonly IPv4_DST=$IPv4_4
+
+readonly IPv6_1="fb01::1"
+readonly IPv6_2="fb02::1"
+readonly IPv6_3="fb03::1"
+readonly IPv6_4="fb04::1"
+readonly IPv6_5="fb05::1"
+readonly IPv6_6="fb06::1"
+readonly IPv6_7="fb07::1"
+readonly IPv6_8="fb08::1"
+readonly IPv6_GRE="fb10::1"
+
+readonly IPv6_SRC=$IPv6_1
+readonly IPv6_DST=$IPv6_4
+
+setup() {
+set -e  # exit on error
+	# create devices and namespaces
+	ip netns add "${NS1}"
+	ip netns add "${NS2}"
+	ip netns add "${NS3}"
+
+	ip link add veth1 type veth peer name veth2
+	ip link add veth3 type veth peer name veth4
+	ip link add veth5 type veth peer name veth6
+	ip link add veth7 type veth peer name veth8
+
+	ip netns exec ${NS2} sysctl -wq net.ipv4.ip_forward=1
+	ip netns exec ${NS2} sysctl -wq net.ipv6.conf.all.forwarding=1
+
+	ip link set veth1 netns ${NS1}
+	ip link set veth2 netns ${NS2}
+	ip link set veth3 netns ${NS2}
+	ip link set veth4 netns ${NS3}
+	ip link set veth5 netns ${NS1}
+	ip link set veth6 netns ${NS2}
+	ip link set veth7 netns ${NS2}
+	ip link set veth8 netns ${NS3}
+
+	# configure addesses: the top route (1-2-3-4)
+	ip -netns ${NS1}    addr add ${IPv4_1}/24  dev veth1
+	ip -netns ${NS2}    addr add ${IPv4_2}/24  dev veth2
+	ip -netns ${NS2}    addr add ${IPv4_3}/24  dev veth3
+	ip -netns ${NS3}    addr add ${IPv4_4}/24  dev veth4
+	ip -netns ${NS1} -6 addr add ${IPv6_1}/128 nodad dev veth1
+	ip -netns ${NS2} -6 addr add ${IPv6_2}/128 nodad dev veth2
+	ip -netns ${NS2} -6 addr add ${IPv6_3}/128 nodad dev veth3
+	ip -netns ${NS3} -6 addr add ${IPv6_4}/128 nodad dev veth4
+
+	# configure addresses: the bottom route (5-6-7-8)
+	ip -netns ${NS1}    addr add ${IPv4_5}/24  dev veth5
+	ip -netns ${NS2}    addr add ${IPv4_6}/24  dev veth6
+	ip -netns ${NS2}    addr add ${IPv4_7}/24  dev veth7
+	ip -netns ${NS3}    addr add ${IPv4_8}/24  dev veth8
+	ip -netns ${NS1} -6 addr add ${IPv6_5}/128 nodad dev veth5
+	ip -netns ${NS2} -6 addr add ${IPv6_6}/128 nodad dev veth6
+	ip -netns ${NS2} -6 addr add ${IPv6_7}/128 nodad dev veth7
+	ip -netns ${NS3} -6 addr add ${IPv6_8}/128 nodad dev veth8
+
+
+	ip -netns ${NS1} link set dev veth1 up
+	ip -netns ${NS2} link set dev veth2 up
+	ip -netns ${NS2} link set dev veth3 up
+	ip -netns ${NS3} link set dev veth4 up
+	ip -netns ${NS1} link set dev veth5 up
+	ip -netns ${NS2} link set dev veth6 up
+	ip -netns ${NS2} link set dev veth7 up
+	ip -netns ${NS3} link set dev veth8 up
+
+	# configure routes: IP*_SRC -> veth1/IP*_2 (= top route) default;
+	# the bottom route to specific bottom addresses
+
+	# NS1
+	# top route
+	ip -netns ${NS1}    route add ${IPv4_2}/32  dev veth1
+	ip -netns ${NS1}    route add default dev veth1 via ${IPv4_2}  # go top by default
+	ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1
+	ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2}  # go top by default
+	# bottom route
+	ip -netns ${NS1}    route add ${IPv4_6}/32  dev veth5
+	ip -netns ${NS1}    route add ${IPv4_7}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1}    route add ${IPv4_8}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5
+	ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6}
+
+	# NS2
+	# top route
+	ip -netns ${NS2}    route add ${IPv4_1}/32  dev veth2
+	ip -netns ${NS2}    route add ${IPv4_4}/32  dev veth3
+	ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2
+	ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3
+	# bottom route
+	ip -netns ${NS2}    route add ${IPv4_5}/32  dev veth6
+	ip -netns ${NS2}    route add ${IPv4_8}/32  dev veth7
+	ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6
+	ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7
+
+	# NS3
+	# top route
+	ip -netns ${NS3}    route add ${IPv4_3}/32  dev veth4
+	ip -netns ${NS3}    route add ${IPv4_1}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3}    route add ${IPv4_2}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3} -6 route add ${IPv6_3}/128 dev veth4
+	ip -netns ${NS3} -6 route add ${IPv6_1}/128 dev veth4 via ${IPv6_3}
+	ip -netns ${NS3} -6 route add ${IPv6_2}/128 dev veth4 via ${IPv6_3}
+	# bottom route
+	ip -netns ${NS3}    route add ${IPv4_7}/32  dev veth8
+	ip -netns ${NS3}    route add ${IPv4_5}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3}    route add ${IPv4_6}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3} -6 route add ${IPv6_7}/128 dev veth8
+	ip -netns ${NS3} -6 route add ${IPv6_5}/128 dev veth8 via ${IPv6_7}
+	ip -netns ${NS3} -6 route add ${IPv6_6}/128 dev veth8 via ${IPv6_7}
+
+	# configure IPv4 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local ${IPv4_GRE} ttl 255
+	ip -netns ${NS3} link set gre_dev up
+	ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
+	ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
+	ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
+
+
+	# configure IPv6 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} -6 tunnel add name gre6_dev mode ip6gre remote ${IPv6_1} local ${IPv6_GRE} ttl 255
+	ip -netns ${NS3} link set gre6_dev up
+	ip -netns ${NS3} -6 addr add ${IPv6_GRE} nodad dev gre6_dev
+	ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8}
+
+	# rp_filter gets confused by what these tests are doing, so disable it
+	ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS2} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS3} sysctl -wq net.ipv4.conf.all.rp_filter=0
+}
+
+cleanup() {
+	ip netns del ${NS1} 2> /dev/null
+	ip netns del ${NS2} 2> /dev/null
+	ip netns del ${NS3} 2> /dev/null
+}
+
+trap cleanup EXIT
+
+test_ping() {
+	local readonly PROTO=$1
+	local readonly EXPECTED=$2
+	local RET=0
+
+	set +e
+	if [ "${PROTO}" == "IPv4" ] ; then
+		ip netns exec ${NS1} ping  -c 1 -W 1 -I ${IPv4_SRC} ${IPv4_DST} 2>&1 > /dev/null
+		RET=$?
+	elif [ "${PROTO}" == "IPv6" ] ; then
+		ip netns exec ${NS1} ping6 -c 1 -W 6 -I ${IPv6_SRC} ${IPv6_DST} 2>&1 > /dev/null
+		RET=$?
+	else
+		echo "test_ping: unknown PROTO: ${PROTO}"
+		exit 1
+	fi
+	set -e
+
+	if [ "0" != "${RET}" ]; then
+		RET=1
+	fi
+
+	if [ "${EXPECTED}" != "${RET}" ] ; then
+		echo "FAIL: test_ping: ${RET}"
+		exit 1
+	fi
+}
+
+test_egress() {
+	local readonly ENCAP=$1
+	echo "starting egress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, ping fails
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_ingress() {
+	local readonly ENCAP=$1
+	echo "starting ingress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, pings fail
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_egress IPv4
+test_egress IPv6
+
+test_ingress IPv4
+test_ingress IPv6
+
+echo "all tests passed"
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related

* pull-request: bpf-next 2019-02-07
From: Daniel Borkmann @ 2019-02-07  0:42 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev, bpf

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add a riscv64 JIT for BPF, from Björn.

2) Implement BTF deduplication algorithm for libbpf which takes BTF type
   information containing duplicate per-compilation unit information and
   reduces it to an equivalent set of BTF types with no duplication and
   without loss of information, from Andrii.

3) Offloaded and native BPF XDP programs can coexist today, enable also
   offloaded and generic ones as well, from Jakub.

4) Expose various BTF related helper functions in libbpf as API which
   are in particular helpful for JITed programs, from Yonghong.

5) Fix the recently added JMP32 code emission in s390x JIT, from Heiko.

6) Fix BPF kselftests' tcp_{server,client}.py to be able to run inside
   a network namespace, also add a fix for libbpf to get libbpf_print()
   working, from Stanislav.

7) Fixes for bpftool documentation, from Prashant.

8) Type cleanup in BPF kselftests' test_maps.c to silence a gcc8 warning,
   from Breno.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit cc7335786f7278d66bdcf96d3d411edfcb01be51:

  socket: fix for Add SO_TIMESTAMP[NS]_NEW (2019-02-03 20:36:11 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to dd9cef43c222df7c0d76d34451808e789952379d:

  bpf: test_maps: fix possible out of bound access warning (2019-02-06 15:48:43 +0100)

----------------------------------------------------------------
Alexei Starovoitov (2):
      Merge branch 'change-libbpf-print-api'
      Merge branch 'libbpf-btf_ext'

Andrii Nakryiko (3):
      btf: extract BTF type size calculation
      btf: add BTF types deduplication algorithm
      selftests/btf: add initial BTF dedup tests

Björn Töpel (4):
      bpf, riscv: add BPF JIT for RV64G
      MAINTAINERS: add RISC-V BPF JIT maintainer
      bpf, doc: add RISC-V JIT to BPF documentation
      selftests/bpf: add "any alignment" annotation for some tests

Breno Leitao (1):
      bpf: test_maps: fix possible out of bound access warning

Daniel Borkmann (3):
      Merge branch 'bpf-btf-dedup'
      Merge branch 'bpf-riscv-jit'
      Merge branch 'bpf-xdp-hw-plus-generic'

Heiko Carstens (1):
      s390: bpf: fix JMP32 code-gen

Jakub Kicinski (5):
      selftests/bpf: fix the expected messages
      net: xdp: allow generic and driver XDP on one interface
      selftests/bpf: print traceback when test fails
      selftests/bpf: add test for mixing generic and offload XDP
      selftests/bpf: test reading the offloaded program

Prashant Bhole (1):
      tools: bpftool: doc, fix incorrect text

Stanislav Fomichev (2):
      selftests/bpf: use localhost in tcp_{server,client}.py
      libbpf: fix libbpf_print

Yonghong Song (8):
      tools/bpf: move libbpf pr_* debug print functions to headers
      tools/bpf: print out btf log at LIBBPF_WARN level
      tools/bpf: simplify libbpf API function libbpf_set_print()
      tools/bpf: expose functions btf_ext__* as API functions
      tools/bpf: implement libbpf btf__get_map_kv_tids() API function
      tools/bpf: fix a selftest test_btf failure
      tools/bpf: add const qualifier to btf__get_map_kv_tids() map_name parameter
      tools/bpf: silence a libbpf unnecessary warning

 Documentation/networking/filter.txt                |   16 +-
 Documentation/sysctl/net.txt                       |    1 +
 MAINTAINERS                                        |    6 +
 arch/riscv/Kconfig                                 |    1 +
 arch/riscv/Makefile                                |    2 +-
 arch/riscv/net/Makefile                            |    1 +
 arch/riscv/net/bpf_jit_comp.c                      | 1602 +++++++++++++++
 arch/s390/net/bpf_jit_comp.c                       |    6 +-
 net/core/dev.c                                     |   10 +-
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst |    4 +-
 .../bpf/bpftool/Documentation/bpftool-feature.rst  |    4 +-
 tools/bpf/bpftool/Documentation/bpftool-prog.rst   |    2 +-
 tools/lib/bpf/btf.c                                | 2032 ++++++++++++++++++--
 tools/lib/bpf/btf.h                                |   43 +-
 tools/lib/bpf/libbpf.c                             |  125 +-
 tools/lib/bpf/libbpf.h                             |   19 +-
 tools/lib/bpf/libbpf.map                           |   10 +
 tools/lib/bpf/libbpf_util.h                        |   30 +
 tools/lib/bpf/test_libbpf.cpp                      |    4 +-
 tools/perf/util/bpf-loader.c                       |   26 +-
 tools/testing/selftests/bpf/tcp_client.py          |    3 +-
 tools/testing/selftests/bpf/tcp_server.py          |    5 +-
 tools/testing/selftests/bpf/test_btf.c             |  553 +++++-
 tools/testing/selftests/bpf/test_libbpf_open.c     |   30 +-
 tools/testing/selftests/bpf/test_maps.c            |   27 +-
 tools/testing/selftests/bpf/test_offload.py        |  135 +-
 tools/testing/selftests/bpf/test_progs.c           |   14 +-
 tools/testing/selftests/bpf/verifier/ctx_sk_msg.c  |    1 +
 tools/testing/selftests/bpf/verifier/ctx_skb.c     |    1 +
 tools/testing/selftests/bpf/verifier/jmp32.c       |   22 +
 tools/testing/selftests/bpf/verifier/jset.c        |    2 +
 tools/testing/selftests/bpf/verifier/spill_fill.c  |    1 +
 tools/testing/selftests/bpf/verifier/spin_lock.c   |    2 +
 .../selftests/bpf/verifier/value_ptr_arith.c       |    4 +
 34 files changed, 4353 insertions(+), 391 deletions(-)
 create mode 100644 arch/riscv/net/Makefile
 create mode 100644 arch/riscv/net/bpf_jit_comp.c
 create mode 100644 tools/lib/bpf/libbpf_util.h

^ permalink raw reply

* Re: [PATCH v1] net: dsa: qca8k: implement DT-based ports <-> phy translation
From: Christian Lamparter @ 2019-02-07  0:43 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: Andrew Lunn, netdev, Vivien Didelot
In-Reply-To: <14f73250-4255-6f4e-336a-9bf289539b75@gmail.com>

On Wednesday, February 6, 2019 11:29:18 PM CET Florian Fainelli wrote:
> On 2/6/19 1:57 PM, Christian Lamparter wrote:
> > On Tuesday, February 5, 2019 11:29:36 PM CET Florian Fainelli wrote:
> >> On 2/5/19 2:12 PM, Christian Lamparter wrote:
> >>> On Tuesday, February 5, 2019 10:29:34 PM CET Andrew Lunn wrote:
> >>>>> For now, I added the DT binding update to the patch as well.
> >>>>> But if this is indeed the way to go, it'll get a separate patch.
> >>>>
> >>>> Hi Christian 
> >>>>
> >>>> You need to be careful with the DT binding. You need to keep backwards
> >>>> compatible with it. An old DT blob needs to keep working. I don't
> >>>> think this is true with this change.
> >>>
> >>> Do you mean because of the 
> >>>
> >>> -               switch0@0 {
> >>> +               switch@10 {
> >>>                         compatible = "qca,qca8337";
> >>>                         #address-cells = <1>;
> >>>                         #size-cells = <0>;
> >>>  
> >>> -                       reg = <0>;
> >>> +                       reg = <0x10>;
> >>>
> >>> change?
> >>>
> >>> or because I removed the phy-handles?>
> >>> The reg = <0x10>; will be necessary regardless. Because this
> >>> is really a bug in the existing binding example and if it is
> >>> copied it will prevent the qca8k driver from loading. 
> >>> This is due to a resource conflict, because there will be 
> >>> already a "phy_port1: phy@0" registered at reg = <0>;
> >>> So this never worked would have worked.
> >>
> >> That part is fine, it is the removal of the phy-handle properties that
> >> is possibly a problem, but in hindsight, I do not believe it will be a
> >> compatibility issue. Lack of "phy-handle" property within the core DSA
> >> layer means: utilize the switch's internal MDIO bus (ds->slave_mii_bus)
> >> instance, which you are not removing, you are just changing how the PHYs
> >> map to port numbers.
> >>
> > Ok, thanks. 
> > 
> > I think I'm almost ready for v2. I have fully addressed the compatibility
> > issue by forking off the qca8k_switch_ops depending on whenever a phy-handle
> > property on one of the ports was found or not. If there was no phy-handle the
> > driver adds the slave-bus accessors to the ops which tells DSA to allocate
> > the slave bus and allows the phys can be enumerated. If the phy-handles are
> > found the driver will not have the accessors and DSA will not setup a
> > redundant/fake bus and this prevents the second/double/duplicated discovery
> > and enumeration of the same PHYs again.
> 
> The logic you have sounds a little too broad since it stops as soon as
> one port is found with a 'phy-handle' property and assumes that the
> parent MDIO bus from which qca8k itself is a child device, is the MDIO
> bus to be used. There are possibly 3 cases:
> 
> 1) All ports using internal/build-in PHYs. In that case, you can either
> not specify a 'phy-handle' property and DSA assumes that they are part
> of the switch's internal MDIO bus. You can also specify a 'phy-handle'
> property that references the internal MDIO bus, although then we also
> expect qca8k to register its internal MDIO bus (ala mv88e6xxx)
> 
> 2) Some ports using internal PHYs, some using external PHYs. Similar
> situation again, ports may, or may not specify a 'phy-handle' property,
> so without a 'phy-handle' property that means the port connects to an
> internal PHY, with a 'phy-handle' it could connect to either internal
> PHY or external PHY
> 
> 3) All ports using external PHYs, in that case, we must have a
> 'phy-handle' for each port to specify where and how they connect to
> their external PHYs.

Oh, sadly the mixed configuration you have envisioned will not work really.
The QCA8K_MDIO_MASTER_EN Bit,which grants access to PHYs through the
MDIO_MASTER register also _disconnects_ the external MDC passthrough to
the internal PHYs. So you get garbage like:

[   17.036963] Generic PHY 37000000.mdio-mii:01: Master/Slave resolution failed, maybe conflicting manual settings?
[   17.116927] Generic PHY 37000000.mdio-mii:02: Master/Slave resolution failed, maybe conflicting manual settings?
[   17.196894] Generic PHY 37000000.mdio-mii:03: Master/Slave resolution failed, maybe conflicting manual settings?
(the PHY reads/write get seemingly stuck/do nothing/strange things).

To pull this partially (so it works for the kernel) off, would require
at least a custom phy driver on top of the dsa switch which would
syncronize the phy register access between the external and internal source.

(I think this would still leave access from userspace in a broken though?!
unless the mdiobus between the qca8k and the SoC can be syncronized as well)

> With respect to your patch, what I would do is register QCA8k's internal
> MDIO bus as a proper mdio bus and use ds->slave_mii_bus as a storage for
> that bus, such that tell the DSA layer: look, here is the internal MDIO
> bus, would you ever find a port that needs to use a PHY in there.
> 
> Then you can still scan each enabled port device, and for each of them,
> populate ds->phys_mii_mask, thus telling DSA exacly which ports are
> using an internal PHY because that would be the ports that do not have a
> 'phy-handle' property. Ports that have a 'phy-handle' property.
> 
> Hope this helps and is clear, if not, I can try to cook a patch for you
> to try, though I don't have you hardware.
Yes, I think I understood, I also tested it (see the implementation below).
But as said it's not that easy.

> Tangential, since you are working on qca8k, it would be great to give
> this driver some TLC and make sure that:
> 
> - bridge w/ and w/o VLAN filtering enabled works
> - multicast snooping works etc.
the driver/switch has much bigger problems :(. For example, my existing
configuration broke (I see RX and TX, but fails to get address by dhcp) due to:

    net: dsa: qca8k: disable delay for RGMII mode
    
    In RGMII mode we should not have any delay in port MAC, so disable
    the delay.
    
    Signed-off-by: Vinod Koul <vkoul@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

It would have been great if a proper .phylink_mac_config() was implemented
at that time. But oh well.

Cheers
Christian

---

commit 9db6f01c864494ebf1308e2f001ece92b5e1ae57
Author: Christian Lamparter <chunkeey@gmail.com>
Date:   Fri Feb 1 22:54:32 2019 +0100

    net: dsa: qca8k: extend slave-bus implementations
    
    This patch implements accessors for the QCA8337 MDIO access
    through the MDIO_MASTER register, which makes it possible to
    access the PHYs on slave-bus through the switch. In cases
    where the switch ports are already mapped via external
    "phy-phandles", the internal mdio-bus is disabled in order to
    prevent a duplicated discovery and enumeration of the same
    PHYs.
    
    Signed-off-by: Christian Lamparter <chunkeey@gmail.com>
    ---
    
    Changes from v2:
     - Make it compatible with existing configurations
    
    Changes from v1:
     - drop DT port <-> phy mapping
     - added register definitions for the MDIO control register
     - implemented new slave-mdio bus accessors
     - DT-binding: fix switch's PSEUDO_PHY address. It's 0x10 not 0.

diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index a4b6cda38016..8d7ee4449d13 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -469,6 +469,131 @@ qca8k_port_set_status(struct qca8k_priv *priv, int port, int enable)
 		qca8k_reg_clear(priv, QCA8K_REG_PORT_STATUS(port), mask);
 }
 
+static int
+qca8k_port_to_phy(int port)
+{
+	if (port < 1 || port > QCA8K_MDIO_MASTER_MAX_PORTS)
+		return -EINVAL;
+
+	return port - 1;
+}
+
+static int
+qca8k_mdio_write(struct qca8k_priv *priv, int port, int regnum, u16 data)
+{
+	u32 val, phy;
+	int ret;
+
+	phy = qca8k_port_to_phy(port);
+	if (phy < 0 || (regnum < 0 || regnum >= QCA8K_MDIO_MASTER_MAX_REG))
+		return -EINVAL;
+
+	val = QCA8K_MDIO_MASTER_BUSY | QCA8K_MDIO_MASTER_EN |
+	      QCA8K_MDIO_MASTER_WRITE | QCA8K_MDIO_MASTER_PHY_ADDR(phy) |
+	      QCA8K_MDIO_MASTER_REG_ADDR(regnum) |
+	      QCA8K_MDIO_MASTER_DATA(data);
+
+	qca8k_write(priv, QCA8K_MDIO_MASTER_CTRL, val);
+
+	return qca8k_busy_wait(priv, QCA8K_MDIO_MASTER_CTRL,
+		QCA8K_MDIO_MASTER_BUSY);
+}
+
+static int
+qca8k_mdio_read(struct qca8k_priv *priv, int port, int regnum)
+{
+	u32 val, phy;
+
+	phy = qca8k_port_to_phy(port);
+	if (phy < 0 || (regnum < 0 || regnum >= QCA8K_MDIO_MASTER_MAX_REG))
+		return -EINVAL;
+
+	val = QCA8K_MDIO_MASTER_BUSY | QCA8K_MDIO_MASTER_EN |
+	      QCA8K_MDIO_MASTER_READ | QCA8K_MDIO_MASTER_PHY_ADDR(phy) |
+	      QCA8K_MDIO_MASTER_REG_ADDR(regnum);
+
+	qca8k_write(priv, QCA8K_MDIO_MASTER_CTRL, val);
+
+	if (qca8k_busy_wait(priv, QCA8K_MDIO_MASTER_CTRL,
+				  QCA8K_MDIO_MASTER_BUSY)) {
+		return -ETIMEDOUT;
+	}
+
+	val = (qca8k_read(priv, QCA8K_MDIO_MASTER_CTRL) &
+		QCA8K_MDIO_MASTER_DATA_MASK);
+
+	return val;
+}
+
+static int
+qca8k_slave_phy_read(struct mii_bus *bus, int addr, int reg)
+{
+	struct qca8k_priv *priv = bus->priv;
+
+	if (priv->ds->phys_mii_mask & BIT(addr)) {
+		int ret = qca8k_mdio_read(priv, addr, reg);
+
+		if (ret >= 0)
+			return ret;
+	}
+
+	return 0xffff;
+}
+
+static int
+qca8k_slave_phy_write(struct mii_bus *bus, int addr, int reg, u16 val)
+{
+	struct qca8k_priv *priv = bus->priv;
+
+	if (priv->ds->phys_mii_mask & BIT(addr))
+		qca8k_mdio_write(priv, addr, reg, val);
+
+	return 0;
+}
+
+static int
+qca8k_setup_mdio_bus(struct qca8k_priv *priv)
+{
+	struct device_node *ports, *port;
+	struct mii_bus *bus;
+	u32 reg, mask = 0;
+	int err;
+
+	bus = devm_mdiobus_alloc_size(priv->dev, sizeof(priv));
+	if (!bus)
+		return -ENOMEM;
+	bus->priv = (void *)priv;
+
+	bus->name = priv->dev->of_node->full_name;
+	snprintf(bus->id, MII_BUS_ID_SIZE, "%pOF", priv->dev->of_node);
+
+	bus->read = qca8k_slave_phy_read;
+	bus->write = qca8k_slave_phy_write;
+	bus->parent = priv->dev;
+
+	ports = of_get_child_by_name(priv->dev->of_node, "ports");
+	if (!ports) {
+		dev_err(priv->dev, "no ports child node found.\n");
+		return -EINVAL;
+	}
+
+	for_each_available_child_of_node(ports, port) {
+		if (of_property_read_bool(port, "phy-handle"))
+			continue;
+
+		err = of_property_read_u32(port, "reg", &reg);
+		if (err)
+			return err;
+
+		if (dsa_is_user_port(priv->ds, reg))
+			mask |= BIT(reg);
+	}
+	bus->phy_mask = ~mask;
+	priv->ds->slave_mii_bus = bus;
+
+	return mdiobus_register(bus);
+}
+
 static int
 qca8k_setup(struct dsa_switch *ds)
 {
@@ -490,6 +615,10 @@ qca8k_setup(struct dsa_switch *ds)
 	if (IS_ERR(priv->regmap))
 		pr_warn("regmap initialization failed");
 
+	ret = qca8k_setup_mdio_bus(priv);
+	if (ret)
+		return ret;
+
 	/* Initialize CPU port pad mode (xMII type, delays...) */
 	phy_mode = of_get_phy_mode(ds->ports[QCA8K_CPU_PORT].dn);
 	if (phy_mode < 0) {
@@ -613,19 +742,27 @@ qca8k_adjust_link(struct dsa_switch *ds, int port, struct phy_device *phy)
 }
 
 static int
-qca8k_phy_read(struct dsa_switch *ds, int phy, int regnum)
+qca8k_phy_write(struct dsa_switch *ds, int port, int regnum, u16 data)
 {
-	struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
+	struct qca8k_priv *priv = ds->priv;
+	int ret = -EIO;
 
-	return mdiobus_read(priv->bus, phy, regnum);
+	if (ds->slave_mii_bus->phy_mask & BIT(port))
+		ret = qca8k_mdio_write(priv, port, regnum, data);
+
+	return ret;
 }
 
 static int
-qca8k_phy_write(struct dsa_switch *ds, int phy, int regnum, u16 val)
+qca8k_phy_read(struct dsa_switch *ds, int port, int regnum)
 {
-	struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
+	struct qca8k_priv *priv = ds->priv;
+	int ret = -EIO;
 
-	return mdiobus_write(priv->bus, phy, regnum, val);
+	if (ds->slave_mii_bus->phy_mask & BIT(port))
+		ret = qca8k_mdio_read(priv, port, regnum);
+
+	return ret;
 }
 
 static void
diff --git a/drivers/net/dsa/qca8k.h b/drivers/net/dsa/qca8k.h
index 613fe5c50236..09a1d76b8037 100644
--- a/drivers/net/dsa/qca8k.h
+++ b/drivers/net/dsa/qca8k.h
@@ -48,6 +48,18 @@
 #define   QCA8K_MIB_FLUSH				BIT(24)
 #define   QCA8K_MIB_CPU_KEEP				BIT(20)
 #define   QCA8K_MIB_BUSY				BIT(17)
+#define QCA8K_MDIO_MASTER_CTRL				0x3c
+#define   QCA8K_MDIO_MASTER_BUSY			BIT(31)
+#define   QCA8K_MDIO_MASTER_EN				BIT(30)
+#define   QCA8K_MDIO_MASTER_READ			BIT(27)
+#define   QCA8K_MDIO_MASTER_WRITE			0
+#define   QCA8K_MDIO_MASTER_SUP_PRE			BIT(26)
+#define   QCA8K_MDIO_MASTER_PHY_ADDR(x)			((x) << 21)
+#define   QCA8K_MDIO_MASTER_REG_ADDR(x)			((x) << 16)
+#define   QCA8K_MDIO_MASTER_DATA(x)			(x)
+#define   QCA8K_MDIO_MASTER_DATA_MASK			GENMASK(15, 0)
+#define   QCA8K_MDIO_MASTER_MAX_PORTS			5
+#define   QCA8K_MDIO_MASTER_MAX_REG			32
 #define QCA8K_GOL_MAC_ADDR0				0x60
 #define QCA8K_GOL_MAC_ADDR1				0x64
 #define QCA8K_REG_PORT_STATUS(_i)			(0x07c + (_i) * 4)





^ permalink raw reply related

* Re: [Patch net-next v2] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Saeed Mahameed @ 2019-02-07  0:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Cong Wang, Linux Kernel Network Developers, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <c35f67d1-0190-cb64-74d9-54889eb1aef7@gmail.com>

On Wed, Feb 6, 2019 at 4:28 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 02/06/2019 04:04 PM, Cong Wang wrote:
>
> > synchronize_irq() is called before mlx5_cq_put(), so I don't
> > see why readers could get 0 refcnt.
>
> Then the more reasons to get rid of the refcount increment/decrement completely ...
>
> Technically, even the rcu_read_lock() and rcu_read_unlock() are not needed,
> since synchronize_irq() is enough.
>

I already suggested this, quoting myself from my first reply to this patch V0:
"another way to do it is not to do any refcounting in the irq handler
and fence cq removal via synchronize_irq(eq->irqn) on mlx5_eq_del_cq."

I already have a patch I was just waiting for Cong to push V2.

^ permalink raw reply

* Re: [Patch net-next v2] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Cong Wang @ 2019-02-07  0:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan
In-Reply-To: <c35f67d1-0190-cb64-74d9-54889eb1aef7@gmail.com>

On Wed, Feb 6, 2019 at 4:28 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 02/06/2019 04:04 PM, Cong Wang wrote:
>
> > synchronize_irq() is called before mlx5_cq_put(), so I don't
> > see why readers could get 0 refcnt.
>
> Then the more reasons to get rid of the refcount increment/decrement completely ...
>
> Technically, even the rcu_read_lock() and rcu_read_unlock() are not needed,
> since synchronize_irq() is enough.

Excellent point.

For the refcnt, I am afraid we still have to hold refcnt for the tasklet,
mlx5_cq_tasklet_cb. But yeah, should be safe to remove from IRQ
path.

^ permalink raw reply

* Re: [PATCH net 0/6] qed*: Bug fixes.
From: David Miller @ 2019-02-07  0:53 UTC (permalink / raw)
  To: manishc; +Cc: netdev, aelior, mkalderon
In-Reply-To: <20190206224347.17054-1-manishc@marvell.com>

From: Manish Chopra <manishc@marvell.com>
Date: Wed, 6 Feb 2019 14:43:41 -0800

> This series contains general qed/qede fixes.
> Please consider applying this to "net"

Series applied, thanks Manish.

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2019-02-07  0:54 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux Next Mailing List, Linux Kernel Mailing List, Guy Shattah,
	Saeed Mahameed, Pablo Neira Ayuso

[-- Attachment #1: Type: text/plain, Size: 4365 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/ethernet/mellanox/mlx5/core/en_tc.c

between commit:

  1651925d403e ("net/mlx5e: Use the inner headers to determine tc/pedit offload limitation on decap flows")

from the net tree and commit:

  738678817573 ("drivers: net: use flow action infrastructure")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 1c3c9fa26b55,83522c926d7c..000000000000
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@@ -1309,15 -1309,12 +1309,12 @@@ static int parse_tunnel_attr(struct mlx
  				       outer_headers);
  	void *headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value,
  				       outer_headers);
- 
- 	struct flow_dissector_key_control *enc_control =
- 		skb_flow_dissector_target(f->dissector,
- 					  FLOW_DISSECTOR_KEY_ENC_CONTROL,
- 					  f->key);
- 	int err = 0;
+ 	struct flow_rule *rule = tc_cls_flower_offload_flow_rule(f);
+ 	struct flow_match_control enc_control;
+ 	int err;
  
  	err = mlx5e_tc_tun_parse(filter_dev, priv, spec, f,
 -				 headers_c, headers_v);
 +				 headers_c, headers_v, match_level);
  	if (err) {
  		NL_SET_ERR_MSG_MOD(extack,
  				   "failed to parse tunnel attributes");
@@@ -1465,19 -1455,17 +1455,17 @@@ static int __parse_cls_flower(struct ml
  		return -EOPNOTSUPP;
  	}
  
- 	if ((dissector_uses_key(f->dissector,
- 				FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) ||
- 	     dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_KEYID) ||
- 	     dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_PORTS)) &&
- 	    dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_CONTROL)) {
- 		struct flow_dissector_key_control *key =
- 			skb_flow_dissector_target(f->dissector,
- 						  FLOW_DISSECTOR_KEY_ENC_CONTROL,
- 						  f->key);
- 		switch (key->addr_type) {
+ 	if ((flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) ||
+ 	     flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_KEYID) ||
+ 	     flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_PORTS)) &&
+ 	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_CONTROL)) {
+ 		struct flow_match_control match;
+ 
+ 		flow_rule_match_enc_control(rule, &match);
+ 		switch (match.key->addr_type) {
  		case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
  		case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
 -			if (parse_tunnel_attr(priv, spec, f, filter_dev))
 +			if (parse_tunnel_attr(priv, spec, f, filter_dev, tunnel_match_level))
  				return -EOPNOTSUPP;
  			break;
  		default:
@@@ -2180,22 -2129,17 +2131,22 @@@ static bool csum_offload_supported(stru
  }
  
  static bool modify_header_match_supported(struct mlx5_flow_spec *spec,
- 					  struct tcf_exts *exts,
+ 					  struct flow_action *flow_action,
 +					  u32 actions,
  					  struct netlink_ext_ack *extack)
  {
- 	const struct tc_action *a;
+ 	const struct flow_action_entry *act;
  	bool modify_ip_header;
  	u8 htype, ip_proto;
  	void *headers_v;
  	u16 ethertype;
- 	int nkeys, i;
+ 	int i;
  
 -	headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, outer_headers);
 +	if (actions & MLX5_FLOW_CONTEXT_ACTION_DECAP)
 +		headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, inner_headers);
 +	else
 +		headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, outer_headers);
 +
  	ethertype = MLX5_GET(fte_match_set_lyr_2_4, headers_v, ethertype);
  
  	/* for non-IP we only re-write MACs, so we're okay */
@@@ -2251,8 -2191,9 +2198,9 @@@ static bool actions_match_supported(str
  		return false;
  
  	if (actions & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
- 		return modify_header_match_supported(&parse_attr->spec, exts,
+ 		return modify_header_match_supported(&parse_attr->spec,
+ 						     flow_action,
 -						     extack);
 +						     actions, extack);
  
  	return true;
  }

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 484 bytes --]

^ permalink raw reply

* Re: [PATCH net 1/2] geneve: should not call rt6_lookup() when ipv6 was disabled
From: kbuild test robot @ 2019-02-07  0:55 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: kbuild-all, netdev, Stefano Brivio, David Miller, Hangbin Liu
In-Reply-To: <20190206125111.5286-2-liuhangbin@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4504 bytes --]

Hi Hangbin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:    https://github.com/0day-ci/linux/commits/Hangbin-Liu/fix-two-kernel-panics-when-disabled-IPv6-on-boot-up/20190207-071954
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 8.2.0-11) 8.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=8.2.0 make.cross ARCH=m68k 

All warnings (new ones prefixed by >>):

   drivers/net/geneve.c: In function 'geneve_link_config':
>> drivers/net/geneve.c:1519:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
      struct rt6_info *rt = rt6_lookup(geneve->net,
      ^~~~~~

vim +1519 drivers/net/geneve.c

abe492b4f5 Tom Herbert    2015-12-10  1490  
c40e89fd35 Alexey Kodanev 2018-04-19  1491  static void geneve_link_config(struct net_device *dev,
c40e89fd35 Alexey Kodanev 2018-04-19  1492  			       struct ip_tunnel_info *info, struct nlattr *tb[])
c40e89fd35 Alexey Kodanev 2018-04-19  1493  {
c40e89fd35 Alexey Kodanev 2018-04-19  1494  	struct geneve_dev *geneve = netdev_priv(dev);
c40e89fd35 Alexey Kodanev 2018-04-19  1495  	int ldev_mtu = 0;
c40e89fd35 Alexey Kodanev 2018-04-19  1496  
c40e89fd35 Alexey Kodanev 2018-04-19  1497  	if (tb[IFLA_MTU]) {
c40e89fd35 Alexey Kodanev 2018-04-19  1498  		geneve_change_mtu(dev, nla_get_u32(tb[IFLA_MTU]));
c40e89fd35 Alexey Kodanev 2018-04-19  1499  		return;
c40e89fd35 Alexey Kodanev 2018-04-19  1500  	}
c40e89fd35 Alexey Kodanev 2018-04-19  1501  
c40e89fd35 Alexey Kodanev 2018-04-19  1502  	switch (ip_tunnel_info_af(info)) {
c40e89fd35 Alexey Kodanev 2018-04-19  1503  	case AF_INET: {
c40e89fd35 Alexey Kodanev 2018-04-19  1504  		struct flowi4 fl4 = { .daddr = info->key.u.ipv4.dst };
c40e89fd35 Alexey Kodanev 2018-04-19  1505  		struct rtable *rt = ip_route_output_key(geneve->net, &fl4);
c40e89fd35 Alexey Kodanev 2018-04-19  1506  
c40e89fd35 Alexey Kodanev 2018-04-19  1507  		if (!IS_ERR(rt) && rt->dst.dev) {
c40e89fd35 Alexey Kodanev 2018-04-19  1508  			ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV4_HLEN;
c40e89fd35 Alexey Kodanev 2018-04-19  1509  			ip_rt_put(rt);
c40e89fd35 Alexey Kodanev 2018-04-19  1510  		}
c40e89fd35 Alexey Kodanev 2018-04-19  1511  		break;
c40e89fd35 Alexey Kodanev 2018-04-19  1512  	}
c40e89fd35 Alexey Kodanev 2018-04-19  1513  #if IS_ENABLED(CONFIG_IPV6)
c40e89fd35 Alexey Kodanev 2018-04-19  1514  	case AF_INET6: {
55e942d018 Hangbin Liu    2019-02-06  1515  		struct inet6_dev *idev = in6_dev_get(dev);
55e942d018 Hangbin Liu    2019-02-06  1516  		if (!idev)
55e942d018 Hangbin Liu    2019-02-06  1517  			break;
55e942d018 Hangbin Liu    2019-02-06  1518  
c40e89fd35 Alexey Kodanev 2018-04-19 @1519  		struct rt6_info *rt = rt6_lookup(geneve->net,
c40e89fd35 Alexey Kodanev 2018-04-19  1520  						 &info->key.u.ipv6.dst, NULL, 0,
c40e89fd35 Alexey Kodanev 2018-04-19  1521  						 NULL, 0);
c40e89fd35 Alexey Kodanev 2018-04-19  1522  
c40e89fd35 Alexey Kodanev 2018-04-19  1523  		if (rt && rt->dst.dev)
c40e89fd35 Alexey Kodanev 2018-04-19  1524  			ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV6_HLEN;
c40e89fd35 Alexey Kodanev 2018-04-19  1525  		ip6_rt_put(rt);
55e942d018 Hangbin Liu    2019-02-06  1526  
55e942d018 Hangbin Liu    2019-02-06  1527  		in6_dev_put(idev);
c40e89fd35 Alexey Kodanev 2018-04-19  1528  		break;
c40e89fd35 Alexey Kodanev 2018-04-19  1529  	}
c40e89fd35 Alexey Kodanev 2018-04-19  1530  #endif
c40e89fd35 Alexey Kodanev 2018-04-19  1531  	}
c40e89fd35 Alexey Kodanev 2018-04-19  1532  
c40e89fd35 Alexey Kodanev 2018-04-19  1533  	if (ldev_mtu <= 0)
c40e89fd35 Alexey Kodanev 2018-04-19  1534  		return;
c40e89fd35 Alexey Kodanev 2018-04-19  1535  
c40e89fd35 Alexey Kodanev 2018-04-19  1536  	geneve_change_mtu(dev, ldev_mtu - info->options_len);
c40e89fd35 Alexey Kodanev 2018-04-19  1537  }
c40e89fd35 Alexey Kodanev 2018-04-19  1538  

:::::: The code at line 1519 was first introduced by commit
:::::: c40e89fd358e94a55d6c1475afbea17b5580f601 geneve: configure MTU based on a lower device

:::::: TO: Alexey Kodanev <alexey.kodanev@oracle.com>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 12117 bytes --]

^ permalink raw reply

* Re: [Patch net-next v2] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Saeed Mahameed @ 2019-02-07  0:56 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, Linux Kernel Network Developers, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <CAM_iQpVF-71uQtybZLqLDWZQS=JxjD_23fAjxyfy7oLBzw-BoA@mail.gmail.com>

On Wed, Feb 6, 2019 at 4:53 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Feb 6, 2019 at 4:28 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> >
> > On 02/06/2019 04:04 PM, Cong Wang wrote:
> >
> > > synchronize_irq() is called before mlx5_cq_put(), so I don't
> > > see why readers could get 0 refcnt.
> >
> > Then the more reasons to get rid of the refcount increment/decrement completely ...
> >
> > Technically, even the rcu_read_lock() and rcu_read_unlock() are not needed,
> > since synchronize_irq() is enough.
>
> Excellent point.
>
> For the refcnt, I am afraid we still have to hold refcnt for the tasklet,
> mlx5_cq_tasklet_cb. But yeah, should be safe to remove from IRQ
> path.

the tasklet path is for rdma CQs only, netdev cqs handling will be refcnt free.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox