Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] nfp: flower: cmsg: use struct_size() helper
From: Jakub Kicinski @ 2019-02-08 15:51 UTC (permalink / raw)
  To: Gustavo A. R. Silva; +Cc: David S. Miller, oss-drivers, netdev, linux-kernel
In-Reply-To: <20190208034725.GA12043@embeddedor>

On Thu, 7 Feb 2019 21:47:25 -0600, Gustavo A. R. Silva wrote:
> One of the more common cases of allocation size calculations is finding
> the size of a structure that has a zero-sized array at the end, along
> with memory for some number of elements for that array. For example:
> 
> struct foo {
>     int stuff;
>     void *entry[];
> };
> 
> size = sizeof(struct foo) + count * sizeof(void *);
> instance = alloc(size, GFP_KERNEL);
> 
> Instead of leaving these open-coded and prone to type mistakes, we can
> now use the new struct_size() helper:
> 
> instance = alloc(struct_size(instance, entry, count), GFP_KERNEL);
> 
> Notice that, in this case, variable size is not necessary, hence
> it is removed.
> 
> This code was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* Re: [PATCH net] virtio_net: Account for tx bytes and packets on sending xdp_frames
From: Jesper Dangaard Brouer @ 2019-02-08 16:02 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: dsahern@gmail.com, thoiland@redhat.com, hawk@kernel.org,
	virtualization@lists.linux-foundation.org, borkmann@iogearbox.net,
	Tariq Toukan, john.fastabend@gmail.com, mst@redhat.com,
	jakub.kicinski@netronome.com, netdev@vger.kernel.org,
	jasowang@redhat.com, davem@davemloft.net,
	makita.toshiaki@lab.ntt.co.jp, brouer
In-Reply-To: <140ecbe1e25f54f90d859cc696c4119aa96bc6eb.camel@mellanox.com>

On Wed, 6 Feb 2019 00:06:33 +0000 Saeed Mahameed <saeedm@mellanox.com> wrote:

> 2) Driver should keep track of XDP decisions statistics, report them in
> ethtool and in the new API suggested by David. track even (XDP_PASS) ?
> 
> Maybe instead of having all drivers track the statistics on their own,
> we should move the responsibility to upper layer.
> 
> Idea: since we already have rxq_info structure per XDP ring (no false
> sharing) and available per xdp_buff we can do:
> 
> +++ b/include/linux/filter.h
> @@ -651,7 +651,9 @@ static __always_inline u32 bpf_prog_run_xdp(const
> struct bpf_prog *prog,
>          * already takes rcu_read_lock() when fetching the program, so
>          * it's not necessary here anymore.
>          */
> -       return BPF_PROG_RUN(prog, xdp);
> +       u32 ret = BPF_PROG_RUN(prog, xdp);
> +       xdp->xdp_rxq_info.stats[ret]++
> +       return ret;
>  }
> 
> still we need a way (API) to report the rxq_info to whoever needs to
> read current XDP stats 

I'm capturing this as tasks under the XDP-project github page:
 https://github.com/xdp-project/xdp-project/pull/13/files

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [EXT] Re: [PATCH net-next 0/2] qed*: SmartAN query support
From: Jakub Kicinski @ 2019-02-08 16:04 UTC (permalink / raw)
  To: Sudarsana Reddy Kalluru
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Ariel Elior,
	Michal Kalderon, Michal Kubecek
In-Reply-To: <MWHPR18MB12642183A28EA8D162CAC9D3D3690@MWHPR18MB1264.namprd18.prod.outlook.com>

On Fri, 8 Feb 2019 11:32:14 +0000, Sudarsana Reddy Kalluru wrote:
> >On Thu, 7 Feb 2019 06:20:10 -0800, Sudarsana Reddy Kalluru wrote:  
> >> SmartAN feature detects the peer/cable capabilities and establishes
> >> the link in the best possible configuration.  
> >
> >It sounds familiar, I need to check with FW team, but I think we may be doing
> >a similar thing, and adding a common API rather than ethtool flag would be
> >preferable.
> >
> >Could you please share a little bit more detail?  What are the configurations
> >this would choose between?  
> 
> Jakub,
>   Following doc provides detailed information on this feature. We simply need a flag to display whether the feature is enabled in the hardware or not, hence adding it to "ethtool --show-priv-flags".
> https://www.cavium.com/Dell/Documents/Converged/TB_Establishing_Adaptive_Links_with_SmartAN_Dell.pdf

Thanks!  Yes, that's exactly the same thing.  In a nutshell trying
different serdes and FEC speeds.  I guess I'll just put adding an API
for this down as a TODO, and add it once we have ethtool netlink.  And
keep IOCTL ethtool frozen until then.

So I think the priv flag is fine for now.

^ permalink raw reply

* Re: [PATCH v2 net-next] net: fixed-phy: Add fixed_phy_register_with_gpiod() API
From: Moritz Fischer @ 2019-02-08 16:18 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Andrew Lunn, Florian Fainelli, hkallweit1,
	Linux Kernel Mailing List
In-Reply-To: <20190207.181530.1190821370151974734.davem@davemloft.net>

Hi David,

On Thu, Feb 7, 2019 at 6:15 PM David Miller <davem@davemloft.net> wrote:
>
> From: Moritz Fischer <mdf@kernel.org>
> Date: Thu,  7 Feb 2019 12:14:55 -0800
>
> > Add fixed_phy_register_with_gpiod() API. It lets users create a
> > fixed_phy instance that uses a GPIO descriptor which was obtained
> > externally e.g. through platform data.
> > This enables platform devices (non-DT based) to use GPIOs for link
> > status.
> >
> > Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
> > Signed-off-by: Moritz Fischer <mdf@kernel.org>
>
> This doesn't apply cleanly to net-next, please respin.

I think 5468e82f7034f ("net: phy: fixed-phy: Drop GPIO from
fixed_phy_add()") might've been
missing when you tried. I just tried with net-next and next, and it
applied fine.

It also seems to be in net next as 71bd106d25676 ("net: fixed-phy: Add
fixed_phy_register_with_gpiod() API"),
do you still want me to resend?

Sorry for the confusion,

Moritz

^ permalink raw reply

* [PATCH net-next] devlink: Add WARN_ON to catch errors of not cleaning devlink objects
From: Parav Pandit @ 2019-02-08 16:22 UTC (permalink / raw)
  To: netdev, davem; +Cc: parav

Add WARN_ON to make sure that all sub objects of a devlink device are
cleanedup before freeing the devlink device.
This helps to catch any driver bugs.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/core/devlink.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/core/devlink.c b/net/core/devlink.c
index cd0d393..5e2ef5a 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -4229,6 +4229,13 @@ void devlink_unregister(struct devlink *devlink)
  */
 void devlink_free(struct devlink *devlink)
 {
+	WARN_ON(!list_empty(&devlink->port_list));
+	WARN_ON(!list_empty(&devlink->sb_list));
+	WARN_ON(!list_empty(&devlink->dpipe_table_list));
+	WARN_ON(!list_empty(&devlink->resource_list));
+	WARN_ON(!list_empty(&devlink->param_list));
+	WARN_ON(!list_empty(&devlink->region_list));
+
 	kfree(devlink);
 }
 EXPORT_SYMBOL_GPL(devlink_free);
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH net-next 1/2] mlxsw: spectrum_router: Offload blackhole routes
From: David Ahern @ 2019-02-08 16:24 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Jiri Pirko,
	Alexander Petrovskiy, mlxsw
In-Reply-To: <20190208073408.GA26396@splinter>

On 2/7/19 11:34 PM, Ido Schimmel wrote:
> 
> Yes. This patch configures the route itself to drop packets, but we can
> instead configure it as a remote route and configure the adjacency entry
> to drop packets.
> 
> If you later want to change X routes using this blackhole nexthop to a
> different one, then create the new one and tell the hardware to do the
> switch in a single operation. It will basically grep over all configured
> routes and do:
> 
> s/blackhole_adjacency_index/new_adjacency_index/
> s/black_ecmp_size/new_ecmp_size/
> 
> See RALEU in drivers/net/ethernet/mellanox/mlxsw/reg.h

Thanks for the reference.

> 
> I assume that user can't put blackhole and normal nexthops in the same
> group?
> 

I allow a nexthop group to reference a nexthop that is a blackhole, but
the group can only contain the one entry. That allows multipath routes
to toggle between a blackhole and a real spec.

^ permalink raw reply

* [PATCH net-next] sfc: add bundle partition definitions to mtd
From: Bert Kenward @ 2019-02-08 16:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-net-drivers, pfox

From: Paul Fox <pfox@solarflare.com>

Signed-off-by: Paul Fox <pfox@solarflare.com>
Signed-off-by: Bert Kenward <bkenward@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c      | 2 ++
 drivers/net/ethernet/sfc/mcdi_pcol.h | 8 ++++++++
 2 files changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 6062e99fa69b..bc92d73047c6 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -6046,6 +6046,8 @@ static const struct efx_ef10_nvram_type_info efx_ef10_nvram_types[] = {
 	{ NVRAM_PARTITION_TYPE_DYNCONFIG_DEFAULTS, 0,    0, "sfc_dynamic_cfg_dflt" },
 	{ NVRAM_PARTITION_TYPE_ROMCONFIG_DEFAULTS, 0,    0, "sfc_exp_rom_cfg_dflt" },
 	{ NVRAM_PARTITION_TYPE_STATUS,		   0,    0, "sfc_status" },
+	{ NVRAM_PARTITION_TYPE_BUNDLE,		   0,    0, "sfc_bundle" },
+	{ NVRAM_PARTITION_TYPE_BUNDLE_METADATA,	   0,    0, "sfc_bundle_metadata" },
 };
 #define EF10_NVRAM_PARTITION_COUNT	ARRAY_SIZE(efx_ef10_nvram_types)
 
diff --git a/drivers/net/ethernet/sfc/mcdi_pcol.h b/drivers/net/ethernet/sfc/mcdi_pcol.h
index 3839eec783ea..20a5523bf9f3 100644
--- a/drivers/net/ethernet/sfc/mcdi_pcol.h
+++ b/drivers/net/ethernet/sfc/mcdi_pcol.h
@@ -6784,6 +6784,14 @@
  * subset of the information stored in this partition.
  */
 #define          NVRAM_PARTITION_TYPE_FRU_INFORMATION 0x1d00
+/* enum: Bundle image partition */
+#define          NVRAM_PARTITION_TYPE_BUNDLE 0x1e00
+/* enum: Bundle metadata partition that holds additional information related to
+ * a bundle update in TLV format
+ */
+#define          NVRAM_PARTITION_TYPE_BUNDLE_METADATA 0x1e01
+/* enum: Bundle update non-volatile log output partition */
+#define          NVRAM_PARTITION_TYPE_BUNDLE_LOG 0x1e02
 /* enum: Start of reserved value range (firmware may use for any purpose) */
 #define          NVRAM_PARTITION_TYPE_RESERVED_VALUES_MIN 0xff00
 /* enum: End of reserved value range (firmware may use for any purpose) */
-- 
2.20.1


^ permalink raw reply related

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
From: Ian Kumlien @ 2019-02-08 16:29 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Cong Wang, David Miller, Saeed Mahameed,
	Linux Kernel Network Developers
In-Reply-To: <CAA85sZvNrHYBr+HGdBgV8tRuhK4vA23gBCXszmw5MPfEiW+7fg@mail.gmail.com>

On Thu, Feb 7, 2019 at 11:01 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> On Thu, Feb 7, 2019 at 7:43 PM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > On Thu, Feb 7, 2019 at 2:17 AM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > On Thu, Feb 7, 2019 at 2:01 AM Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > > On Wed, Feb 6, 2019 at 3:00 PM Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > > > It changes directly after the first hw checksum failure, I don't know why =/
> > > >
> > > > weird, Maybe a real check-summing issue/corruption on the PCI ?!
> > >
> > > Actually, it seems to have been introduced in 4.20.6 - 4.20.5 works just fine

> > Great, the difference is only 120 patches.
> > that is bisect-able, it will only take 5 iterations to find the
> > offending commit.
>
> I just wish it wasn't a server that takes, what feels like 5 minutes to boot...
>
> All of these seas of sensors 2d and 3d... =P
>
> But, yep, that's the plan

Huh, spent most of the day with two bisects and none of them yielded
any results....

Looks like I'll have to start investigating the elrepo kernel-ml build =(

^ permalink raw reply

* Re: [PATCH net-next] ethtool: Remove unnecessary null check in ethtool_rx_flow_rule_create
From: Pablo Neira Ayuso @ 2019-02-08 16:38 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: David S. Miller, netdev, linux-kernel, Jiri Pirko,
	Nick Desaulniers
In-Reply-To: <20190208044652.32166-1-natechancellor@gmail.com>

On Thu, Feb 07, 2019 at 09:46:53PM -0700, Nathan Chancellor wrote:
> net/core/ethtool.c:3023:19: warning: address of array
> 'ext_m_spec->h_dest' will always evaluate to 'true'
> [-Wpointer-bool-conversion]
>                 if (ext_m_spec->h_dest) {
>                 ~~  ~~~~~~~~~~~~^~~~~~
> 
> h_dest is an array, it can't be null so remove this check.
> 
> Fixes: eca4205f9ec3 ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator")
> Link: https://github.com/ClangBuiltLinux/linux/issues/353
> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>

Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>

Thanks!

^ permalink raw reply

* [PATCH bpf-next v8 0/6] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-08 16:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov

This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

V2 changes: added flowi-based route lookup, IPv6 encapping, and
   encapping on ingress.

V3 changes: incorporated David Ahern's suggestions:
   - added l3mdev check/oif (patch 2)
   - sync bpf.h from include/uapi into tools/include/uapi
   - selftest tweaks

V4 changes: moved route lookup/dst change from bpf_push_ip_encap
   to when BPF_LWT_REROUTE is handled, as suggested by David Ahern.

V5 changes: added a check in lwt_xmit that skb->protocol stays the
   same if the skb is to be passed back to the stack (ret == BPF_OK).
   Again, suggested by David Ahern.

V6 changes: abandoned.

V7 changes: added handling of GSO packets (patch 3 in the patchset added),
   as suggested by BPF maintainers.

V8 changes:
   - fixed build errors when LWT or IPV6 are not enabled;
   - whitelisted TCP GSO instead of blacklisting SCTP and UDP GSO, as
     suggested by Willem de Bruijn;
   - added validation that pushed length cover needed headers when GRE/UDP
     encap is detected, as suggested by Willem de Bruijn;
   - a couple of minor/stylistic tweaks/fixed typos.

Peter Oskolkov (6):
  bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
  bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
  bpf: handle GSO in bpf_lwt_push_encap
  bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
  bpf: sync <kdir>/include/.../bpf.h with tools/include/.../bpf.h
  selftests: bpf: add test_lwt_ip_encap selftest

 include/net/lwtunnel.h                        |   2 +
 include/uapi/linux/bpf.h                      |  26 +-
 net/core/filter.c                             |  49 ++-
 net/core/lwt_bpf.c                            | 262 +++++++++++++++
 tools/include/uapi/linux/bpf.h                |  26 +-
 tools/testing/selftests/bpf/Makefile          |   6 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  85 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 8 files changed, 756 insertions(+), 11 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

-- 
2.20.1.791.gb4d0f1c61a-goog

^ permalink raw reply

* [PATCH bpf-next v8 1/6] bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-08 16:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190208163849.151626-1-posk@google.com>

This patch adds all needed plumbing in preparation to allowing
bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
implementation is added in the next patch in the patchset.

Of note:
- bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
  prog types in addition to BPF_PROG_TYPE_LWT_IN;
- if the skb being encapped has GSO set, encapsulation is limited
  to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6);
- as route lookups are different for ingress vs egress, the single
  external bpf_lwt_push_encap BPF helper is routed internally to
  either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
  depending on prog type.

v8 changes: fixed a typo.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/uapi/linux/bpf.h | 26 ++++++++++++++++++++--
 net/core/filter.c        | 48 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1777fa0c61e4..f94683797552 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2016,6 +2016,19 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers. Please note that
+ *			if skb_is_gso(skb) is true, no more than two headers
+ *			can be prepended, and the inner header, if present,
+ *			should be either GRE or UDP/GUE.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2498,7 +2511,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2586,7 +2600,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been
+	 *    changed and should be routed based on its new L3 header.
+	 *    (This is an L3 redirect, as opposed to L2 redirect
+	 *    represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
diff --git a/net/core/filter.c b/net/core/filter.c
index 3a49f68eda10..5ca7a689ea64 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4801,7 +4801,15 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 }
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
-BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			     bool ingress)
+{
+	return -EINVAL;  /* Implemented in the next patch. */
+}
+#endif
+
+BPF_CALL_4(bpf_lwt_in_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	   u32, len)
 {
 	switch (type) {
@@ -4809,14 +4817,41 @@ BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	case BPF_LWT_ENCAP_SEG6:
 	case BPF_LWT_ENCAP_SEG6_INLINE:
 		return bpf_push_seg6_encap(skb, type, hdr, len);
+#endif
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, true /* ingress */);
+#endif
+	default:
+		return -EINVAL;
+	}
+}
+
+BPF_CALL_4(bpf_lwt_xmit_push_encap, struct sk_buff *, skb, u32, type,
+	   void *, hdr, u32, len)
+{
+	switch (type) {
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, false /* egress */);
 #endif
 	default:
 		return -EINVAL;
 	}
 }
 
-static const struct bpf_func_proto bpf_lwt_push_encap_proto = {
-	.func		= bpf_lwt_push_encap,
+static const struct bpf_func_proto bpf_lwt_in_push_encap_proto = {
+	.func		= bpf_lwt_in_push_encap,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+static const struct bpf_func_proto bpf_lwt_xmit_push_encap_proto = {
+	.func		= bpf_lwt_xmit_push_encap,
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
@@ -5282,7 +5317,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_lwt_seg6_adjust_srh ||
 	    func == bpf_lwt_seg6_action ||
 #endif
-	    func == bpf_lwt_push_encap)
+	    func == bpf_lwt_in_push_encap ||
+	    func == bpf_lwt_xmit_push_encap)
 		return true;
 
 	return false;
@@ -5670,7 +5706,7 @@ lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_lwt_push_encap:
-		return &bpf_lwt_push_encap_proto;
+		return &bpf_lwt_in_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
@@ -5706,6 +5742,8 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_l4_csum_replace_proto;
 	case BPF_FUNC_set_hash_invalid:
 		return &bpf_set_hash_invalid_proto;
+	case BPF_FUNC_lwt_push_encap:
+		return &bpf_lwt_xmit_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v8 2/6] bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-08 16:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190208163849.151626-1-posk@google.com>

Implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap BPF helper.
It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

v7 changes:
 - added a call skb_clear_hash();
 - removed calls to skb_set_transport_header();
 - refuse to encap GSO-enabled packets.

v8 changes:
 - fix build errors when LWT is not enabled.

Note: the next patch in the patchset with deal with GSO-enabled packets,
which are currently rejected at encapping attempt.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/net/lwtunnel.h |  2 ++
 net/core/filter.c      |  3 +-
 net/core/lwt_bpf.c     | 65 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 33fd9ba7e0e5..671113bcb2cc 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -126,6 +126,8 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b);
 int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb);
 int lwtunnel_input(struct sk_buff *skb);
 int lwtunnel_xmit(struct sk_buff *skb);
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			  bool ingress);
 
 static inline void lwtunnel_set_redirect(struct dst_entry *dst)
 {
diff --git a/net/core/filter.c b/net/core/filter.c
index 5ca7a689ea64..017e60e65004 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -73,6 +73,7 @@
 #include <linux/seg6_local.h>
 #include <net/seg6.h>
 #include <net/seg6_local.h>
+#include <net/lwtunnel.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -4805,7 +4806,7 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
 			     bool ingress)
 {
-	return -EINVAL;  /* Implemented in the next patch. */
+	return bpf_lwt_push_ip_encap(skb, hdr, len, ingress);
 }
 #endif
 
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a648568c5e8f..e5a9850d9f48 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -390,6 +390,71 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
+static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
+{
+	/* Handling of GSO-enabled packets is added in the next patch. */
+	return -EOPNOTSUPP;
+}
+
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
+{
+	struct iphdr *iph;
+	bool ipv4;
+	int err;
+
+	if (unlikely(len < sizeof(struct iphdr) || len > LWT_BPF_MAX_HEADROOM))
+		return -EINVAL;
+
+	/* validate protocol and length */
+	iph = (struct iphdr *)hdr;
+	if (iph->version == 4) {
+		ipv4 = true;
+		if (unlikely(len < iph->ihl * 4))
+			return -EINVAL;
+	} else if (iph->version == 6) {
+		ipv4 = false;
+		if (unlikely(len < sizeof(struct ipv6hdr)))
+			return -EINVAL;
+	} else {
+		return -EINVAL;
+	}
+
+	if (ingress)
+		err = skb_cow_head(skb, len + skb->mac_len);
+	else
+		err = skb_cow_head(skb,
+				   len + LL_RESERVED_SPACE(skb_dst(skb)->dev));
+	if (unlikely(err))
+		return err;
+
+	/* push the encap headers and fix pointers */
+	skb_reset_inner_headers(skb);
+	skb->encapsulation = 1;
+	skb_push(skb, len);
+	if (ingress)
+		skb_postpush_rcsum(skb, iph, len);
+	skb_reset_network_header(skb);
+	memcpy(skb_network_header(skb), hdr, len);
+	bpf_compute_data_pointers(skb);
+	skb_clear_hash(skb);
+
+	if (ipv4) {
+		skb->protocol = htons(ETH_P_IP);
+		iph = ip_hdr(skb);
+
+		if (!iph->check)
+			iph->check = ip_fast_csum((unsigned char *)iph,
+						  iph->ihl);
+	} else {
+		skb->protocol = htons(ETH_P_IPV6);
+	}
+
+	if (skb_is_gso(skb))
+		return handle_gso_encap(skb, ipv4, len);
+
+	return 0;
+}
+
 static int __init bpf_lwt_init(void)
 {
 	return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v8 3/6] bpf: handle GSO in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-08 16:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190208163849.151626-1-posk@google.com>

This patch adds handling of GSO packets in bpf_lwt_push_ip_encap()
(called from bpf_lwt_push_encap):

* IPIP, GRE, and UDP encapsulation types are deduced by looking
  into iphdr->protocol or ipv6hdr->next_header;
* SCTP GSO packets are not supported (as bpf_skb_proto_4_to_6
  and similar do);
* UDP_L4 GSO packets are also not supported (although they are
  not blocked in bpf_skb_proto_4_to_6 and similar), as
  skb_decrease_gso_size() will break it;
* SKB_GSO_DODGY bit is set.

Note: it may be possible to support SCTP and UDP_L4 gso packets;
      but as these cases seem to be not well handled by other
      tunneling/encapping code paths, the solution should
      be generic enough to apply to all tunneling/encapping code.

v8 changes:
   - make sure that if GRE or UDP encap is detected, there is
     enough of pushed bytes to cover both IP[v6] + GRE|UDP headers;
   - do not reject double-encapped packets;
   - whitelist TCP GSO packets rather than block SCTP GSO and
     UDP GSO.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 67 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 2 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index e5a9850d9f48..079871fc020f 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
+#include <net/gre.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -390,10 +391,72 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
+static int handle_gso_type(struct sk_buff *skb, unsigned int gso_type,
+			   int encap_len)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+	gso_type |= SKB_GSO_DODGY;
+	shinfo->gso_type |= gso_type;
+	skb_decrease_gso_size(shinfo, encap_len);
+	shinfo->gso_segs = 0;
+	return 0;
+}
+
 static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
 {
-	/* Handling of GSO-enabled packets is added in the next patch. */
-	return -EOPNOTSUPP;
+	int next_hdr_offset;
+	void *next_hdr;
+	__u8 protocol;
+
+	/* SCTP and UDP_L4 gso need more nuanced handling than what
+	 * handle_gso_type() does above: skb_decrease_gso_size() is not enough.
+	 * So at the moment only TCP GSO packets are let through.
+	 */
+	if (!(skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
+		return -ENOTSUPP;
+
+	if (ipv4) {
+		protocol = ip_hdr(skb)->protocol;
+		next_hdr_offset = sizeof(struct iphdr);
+		next_hdr = skb_network_header(skb) + next_hdr_offset;
+	} else {
+		protocol = ipv6_hdr(skb)->nexthdr;
+		next_hdr_offset = sizeof(struct ipv6hdr);
+		next_hdr = skb_network_header(skb) + next_hdr_offset;
+	}
+
+	switch (protocol) {
+	case IPPROTO_GRE:
+		next_hdr_offset += sizeof(struct gre_base_hdr);
+		if (next_hdr_offset > encap_len)
+			return -EINVAL;
+
+		if (((struct gre_base_hdr *)next_hdr)->flags & GRE_CSUM)
+			return handle_gso_type(skb, SKB_GSO_GRE_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_GRE, encap_len);
+
+	case IPPROTO_UDP:
+		next_hdr_offset += sizeof(struct udphdr);
+		if (next_hdr_offset > encap_len)
+			return -EINVAL;
+
+		if (((struct udphdr *)next_hdr)->check)
+			return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL, encap_len);
+
+	case IPPROTO_IP:
+	case IPPROTO_IPV6:
+		if (ipv4)
+			return handle_gso_type(skb, SKB_GSO_IPXIP4, encap_len);
+		else
+			return handle_gso_type(skb, SKB_GSO_IPXIP6, encap_len);
+
+	default:
+		return -EPROTONOSUPPORT;
+	}
 }
 
 int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v8 4/6] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: Peter Oskolkov @ 2019-02-08 16:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190208163849.151626-1-posk@google.com>

This patch builds on top of the previous patch in the patchset,
which added BPF_LWT_ENCAP_IP mode to bpf_lwt_push_encap. As the
encapping can result in the skb needing to go via a different
interface/route/dst, bpf programs can indicate this by returning
BPF_LWT_REROUTE, which triggers a new route lookup for the skb.

v8 changes: fix kbuild errors when LWTUNNEL_BPF is builtin, but
   IPV6 is a module: as LWTUNNEL_BPF can only be either Y or N,
   call IPV6 routing functions only if they are built-in.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 134 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 134 insertions(+)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 079871fc020f..ebe294de36e1 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -17,6 +17,7 @@
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
 #include <net/gre.h>
+#include <net/ip6_route.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -56,6 +57,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 
 	switch (ret) {
 	case BPF_OK:
+	case BPF_LWT_REROUTE:
 		break;
 
 	case BPF_REDIRECT:
@@ -88,6 +90,36 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	return ret;
 }
 
+static int bpf_lwt_input_reroute(struct sk_buff *skb)
+{
+	int err = -EINVAL;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		struct iphdr *iph = ip_hdr(skb);
+
+		err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					   iph->tos, skb_dst(skb)->dev);
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+#if IS_BUILTIN(CONFIG_IPV6)
+		ip6_route_input(skb);
+		err = skb_dst(skb)->error;
+#else
+		pr_warn_once("BPF_LWT_REROUTE input: IPV6 not built-in\n");
+#endif
+	} else {
+		pr_warn_once("BPF_LWT_REROUTE input: unsupported proto %d\n",
+			     skb->protocol);
+	}
+
+	if (err)
+		goto err;
+	return dst_input(skb);
+
+err:
+	kfree_skb(skb);
+	return err;
+}
+
 static int bpf_input(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -99,6 +131,8 @@ static int bpf_input(struct sk_buff *skb)
 		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
 		if (ret < 0)
 			return ret;
+		if (ret == BPF_LWT_REROUTE)
+			return bpf_lwt_input_reroute(skb);
 	}
 
 	if (unlikely(!dst->lwtstate->orig_input)) {
@@ -148,6 +182,95 @@ static int xmit_check_hhlen(struct sk_buff *skb)
 	return 0;
 }
 
+static int bpf_lwt_xmit_reroute(struct sk_buff *skb)
+{
+	struct net_device *l3mdev = l3mdev_master_dev_rcu(skb_dst(skb)->dev);
+	int oif = l3mdev ? l3mdev->ifindex : 0;
+	struct dst_entry *dst = NULL;
+	struct sock *sk;
+	struct net *net;
+	bool ipv4;
+	int err;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		ipv4 = true;
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		ipv4 = false;
+	} else {
+		pr_warn_once("BPF_LWT_REROUTE xmit: unsupported proto %d\n",
+			     skb->protocol);
+		return -EINVAL;
+	}
+
+	sk = sk_to_full_sk(skb->sk);
+	if (sk) {
+		if (sk->sk_bound_dev_if)
+			oif = sk->sk_bound_dev_if;
+		net = sock_net(sk);
+	} else {
+		net = dev_net(skb_dst(skb)->dev);
+	}
+
+	if (ipv4) {
+		struct iphdr *iph = ip_hdr(skb);
+		struct flowi4 fl4 = {0};
+		struct rtable *rt;
+
+		fl4.flowi4_oif = oif;
+		fl4.flowi4_mark = skb->mark;
+		fl4.flowi4_uid = sock_net_uid(net, sk);
+		fl4.flowi4_tos = RT_TOS(iph->tos);
+		fl4.flowi4_flags = FLOWI_FLAG_ANYSRC;
+		fl4.flowi4_proto = iph->protocol;
+		fl4.daddr = iph->daddr;
+		fl4.saddr = iph->saddr;
+
+		rt = ip_route_output_key(net, &fl4);
+		if (IS_ERR(rt) || rt->dst.error)
+			return -EINVAL;
+		dst = &rt->dst;
+	} else {
+#if IS_BUILTIN(CONFIG_IPV6)
+		struct ipv6hdr *iph6 = ipv6_hdr(skb);
+		struct flowi6 fl6 = {0};
+
+		fl6.flowi6_oif = oif;
+		fl6.flowi6_mark = skb->mark;
+		fl6.flowi6_uid = sock_net_uid(net, sk);
+		fl6.flowlabel = ip6_flowinfo(iph6);
+		fl6.flowi6_proto = iph6->nexthdr;
+		fl6.daddr = iph6->daddr;
+		fl6.saddr = iph6->saddr;
+
+		dst = ip6_route_output(net, skb->sk, &fl6);
+		if (IS_ERR(dst) || dst->error)
+			return -EINVAL;
+#else
+		pr_warn_once("BPF_LWT_REROUTE xmit: IPV6 not built-in\n");
+		return -EINVAL;
+#endif
+	}
+
+	/* Although skb header was reserved in bpf_lwt_push_ip_encap(), it
+	 * was done for the previous dst, so we are doing it here again, in
+	 * case the new dst needs much more space. The call below is a noop
+	 * if there is enough header space in skb.
+	 */
+	err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
+	if (unlikely(err))
+		return err;
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, dst);
+
+	err = dst_output(dev_net(skb_dst(skb)->dev), skb->sk, skb);
+	if (unlikely(err))
+		return err;
+
+	/* ip[6]_finish_output2 understand LWTUNNEL_XMIT_DONE */
+	return LWTUNNEL_XMIT_DONE;
+}
+
 static int bpf_xmit(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -155,11 +278,20 @@ static int bpf_xmit(struct sk_buff *skb)
 
 	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
 	if (bpf->xmit.prog) {
+		__be16 proto = skb->protocol;
 		int ret;
 
 		ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
 		switch (ret) {
 		case BPF_OK:
+			/* If the header changed, e.g. via bpf_lwt_push_encap,
+			 * BPF_LWT_REROUTE below should have been used if the
+			 * protocol was also changed.
+			 */
+			if (skb->protocol != proto) {
+				kfree_skb(skb);
+				return -EINVAL;
+			}
 			/* If the header was expanded, headroom might be too
 			 * small for L2 header to come, expand as needed.
 			 */
@@ -170,6 +302,8 @@ static int bpf_xmit(struct sk_buff *skb)
 			return LWTUNNEL_XMIT_CONTINUE;
 		case BPF_REDIRECT:
 			return LWTUNNEL_XMIT_DONE;
+		case BPF_LWT_REROUTE:
+			return bpf_lwt_xmit_reroute(skb);
 		default:
 			return ret;
 		}
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v8 5/6] bpf: sync <kdir>/include/.../bpf.h with tools/include/.../bpf.h
From: Peter Oskolkov @ 2019-02-08 16:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190208163849.151626-1-posk@google.com>

This patch copies changes in bpf.h done by a previous patch
in this patchset from the kernel uapi include dir into tools
uapi include dir.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/include/uapi/linux/bpf.h | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1777fa0c61e4..f94683797552 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2016,6 +2016,19 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers. Please note that
+ *			if skb_is_gso(skb) is true, no more than two headers
+ *			can be prepended, and the inner header, if present,
+ *			should be either GRE or UDP/GUE.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2498,7 +2511,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2586,7 +2600,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been
+	 *    changed and should be routed based on its new L3 header.
+	 *    (This is an L3 redirect, as opposed to L2 redirect
+	 *    represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* KASAN: wild-memory-access Write in fib6_purge_rt
From: syzbot @ 2019-02-08 16:39 UTC (permalink / raw)
  To: davem, kuznet, linux-kernel, netdev, syzkaller-bugs, yoshfuji

Hello,

syzbot found the following crash on:

HEAD commit:    cc7335786f72 socket: fix for Add SO_TIMESTAMP[NS]_NEW
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16539260c00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=33ad02b9305759c3
dashboard link: https://syzkaller.appspot.com/bug?extid=3dbea54db3674c0d57d6
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+3dbea54db3674c0d57d6@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: wild-memory-access in atomic_dec_and_test  
include/asm-generic/atomic-instrumented.h:259 [inline]
BUG: KASAN: wild-memory-access in fib6_info_release  
include/net/ip6_fib.h:294 [inline]
BUG: KASAN: wild-memory-access in fib6_info_release  
include/net/ip6_fib.h:292 [inline]
BUG: KASAN: wild-memory-access in fib6_drop_pcpu_from  
net/ipv6/ip6_fib.c:927 [inline]
BUG: KASAN: wild-memory-access in fib6_purge_rt+0x4f6/0x670  
net/ipv6/ip6_fib.c:960
Write of size 4 at addr ff88809b5430992b by task syz-executor5/21222

CPU: 1 PID: 21222 Comm: syz-executor5 Not tainted 5.0.0-rc4+ #45
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  kasan_report.cold+0x5/0x40 mm/kasan/report.c:321
  check_memory_region_inline mm/kasan/generic.c:185 [inline]
  check_memory_region+0x123/0x190 mm/kasan/generic.c:191
  kasan_check_write+0x14/0x20 mm/kasan/common.c:106
  atomic_dec_and_test include/asm-generic/atomic-instrumented.h:259 [inline]
  fib6_info_release include/net/ip6_fib.h:294 [inline]
  fib6_info_release include/net/ip6_fib.h:292 [inline]
  fib6_drop_pcpu_from net/ipv6/ip6_fib.c:927 [inline]
  fib6_purge_rt+0x4f6/0x670 net/ipv6/ip6_fib.c:960
  fib6_del_route net/ipv6/ip6_fib.c:1813 [inline]
  fib6_del+0xac2/0x10a0 net/ipv6/ip6_fib.c:1844
  fib6_clean_node+0x3a8/0x590 net/ipv6/ip6_fib.c:2006
  fib6_walk_continue+0x4b3/0x8e0 net/ipv6/ip6_fib.c:1928
  fib6_walk+0x9d/0x100 net/ipv6/ip6_fib.c:1976
  fib6_clean_tree+0xe0/0x120 net/ipv6/ip6_fib.c:2055
  __fib6_clean_all+0x118/0x2a0 net/ipv6/ip6_fib.c:2071
  fib6_clean_all+0x2b/0x40 net/ipv6/ip6_fib.c:2082
  rt6_sync_down_dev+0x134/0x150 net/ipv6/route.c:4041
  rt6_disable_ip+0x27/0x5f0 net/ipv6/route.c:4046
  addrconf_ifdown+0xa2/0x1220 net/ipv6/addrconf.c:3704
  addrconf_notify+0x5e1/0x2280 net/ipv6/addrconf.c:3629
  notifier_call_chain+0xc7/0x240 kernel/notifier.c:93
  __raw_notifier_call_chain kernel/notifier.c:394 [inline]
  raw_notifier_call_chain+0x2e/0x40 kernel/notifier.c:401
  call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1739
  call_netdevice_notifiers_extack net/core/dev.c:1751 [inline]
  call_netdevice_notifiers net/core/dev.c:1765 [inline]
  __dev_notify_flags+0x1e9/0x2c0 net/core/dev.c:7607
  dev_change_flags+0x10d/0x170 net/core/dev.c:7643
  devinet_ioctl+0x1396/0x1c60 net/ipv4/devinet.c:1104
  inet_ioctl+0x1fc/0x370 net/ipv4/af_inet.c:954
  sock_do_ioctl+0xe2/0x320 net/socket.c:972
  sock_ioctl+0x331/0x620 net/socket.c:1096
  vfs_ioctl fs/ioctl.c:46 [inline]
  file_ioctl fs/ioctl.c:509 [inline]
  do_vfs_ioctl+0xd6e/0x1390 fs/ioctl.c:696
  ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
  __do_sys_ioctl fs/ioctl.c:720 [inline]
  __se_sys_ioctl fs/ioctl.c:718 [inline]
  __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e39
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fa4949bfc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e39
RDX: 0000000020000040 RSI: 0000000000008914 RDI: 0000000000000004
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa4949c06d4
R13: 00000000004c34ff R14: 00000000004d61e8 R15: 00000000ffffffff
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply

* [PATCH bpf-next v8 6/6] selftests: bpf: add test_lwt_ip_encap selftest
From: Peter Oskolkov @ 2019-02-08 16:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190208163849.151626-1-posk@google.com>

This patch adds a bpf self-test to cover BPF_LWT_ENCAP_IP mode
in bpf_lwt_push_encap.

Covered:
- encapping in LWT_IN and LWT_XMIT
- IPv4 and IPv6

A follow-up patch will add GSO and VRF-enabled tests.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   6 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  85 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 3 files changed, 400 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 383d2ff13fc7..d56c74727b6c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,8 @@ BPF_OBJ_FILES = \
 	sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
-	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o
+	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o \
+	test_lwt_ip_encap.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
@@ -73,7 +74,8 @@ TEST_PROGS := test_kmod.sh \
 	test_lirc_mode2.sh \
 	test_skb_cgroup_id.sh \
 	test_flow_dissector.sh \
-	test_xdp_vlan.sh
+	test_xdp_vlan.sh \
+	test_lwt_ip_encap.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
 	with_tunnels.sh \
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.c b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
new file mode 100644
index 000000000000..c957d6dfe6d7
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+struct grehdr {
+	__be16 flags;
+	__be16 protocol;
+};
+
+SEC("encap_gre")
+int bpf_lwt_encap_gre(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct iphdr iph;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.iph.ihl = 5;
+	hdr.iph.version = 4;
+	hdr.iph.ttl = 0x40;
+	hdr.iph.protocol = 47;  /* IPPROTO_GRE */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	hdr.iph.saddr = 0x640110ac;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0x641010ac;  /* 172.16.16.100 */
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	hdr.iph.saddr = 0xac100164;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0xac101064;  /* 172.16.16.100 */
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+	hdr.iph.tot_len = bpf_htons(skb->len + sizeof(struct encap_hdr));
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+SEC("encap_gre6")
+int bpf_lwt_encap_gre6(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct ipv6hdr ip6hdr;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.ip6hdr.version = 6;
+	hdr.ip6hdr.payload_len = bpf_htons(skb->len + sizeof(struct grehdr));
+	hdr.ip6hdr.nexthdr = 47;  /* IPPROTO_GRE */
+	hdr.ip6hdr.hop_limit = 0x40;
+	/* fb01::1 */
+	hdr.ip6hdr.saddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.saddr.s6_addr[1] = 1;
+	hdr.ip6hdr.saddr.s6_addr[15] = 1;
+	/* fb10::1 */
+	hdr.ip6hdr.daddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.daddr.s6_addr[1] = 0x10;
+	hdr.ip6hdr.daddr.s6_addr[15] = 1;
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
new file mode 100755
index 000000000000..4ca714e23ab0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -0,0 +1,311 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Setup/topology:
+#
+#    NS1             NS2             NS3
+#   veth1 <---> veth2   veth3 <---> veth4 (the top route)
+#   veth5 <---> veth6   veth7 <---> veth8 (the bottom route)
+#
+#   each vethN gets IPv[4|6]_N address
+#
+#   IPv*_SRC = IPv*_1
+#   IPv*_DST = IPv*_4
+#
+#   all tests test pings from IPv*_SRC to IPv*_DST
+#
+#   by default, routes are configured to allow packets to go
+#   IP*_1 <=> IP*_2 <=> IP*_3 <=> IP*_4 (the top route)
+#
+#   a GRE device is installed in NS3 with IPv*_GRE, and
+#   NS1/NS2 are configured to route packets to IPv*_GRE via IP*_8
+#   (the bottom route)
+#
+# Tests:
+#
+#   1. routes NS2->IPv*_DST are brought down, so the only way a ping
+#      from IP*_SRC to IP*_DST can work is via IPv*_GRE
+#
+#   2a. in an egress test, a bpf LWT_XMIT program is installed on veth1
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth1:egress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+#
+#   2b. in an ingress test, a bpf LWT_IN program is installed on veth2
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth2:ingress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+
+set -e  # exit on error
+
+if [[ $EUID -ne 0 ]]; then
+	echo "This script must be run as root"
+	echo "FAIL"
+	exit 1
+fi
+
+readonly NS1="ns1-$(mktemp -u XXXXXX)"
+readonly NS2="ns2-$(mktemp -u XXXXXX)"
+readonly NS3="ns3-$(mktemp -u XXXXXX)"
+
+readonly IPv4_1="172.16.1.100"
+readonly IPv4_2="172.16.2.100"
+readonly IPv4_3="172.16.3.100"
+readonly IPv4_4="172.16.4.100"
+readonly IPv4_5="172.16.5.100"
+readonly IPv4_6="172.16.6.100"
+readonly IPv4_7="172.16.7.100"
+readonly IPv4_8="172.16.8.100"
+readonly IPv4_GRE="172.16.16.100"
+
+readonly IPv4_SRC=$IPv4_1
+readonly IPv4_DST=$IPv4_4
+
+readonly IPv6_1="fb01::1"
+readonly IPv6_2="fb02::1"
+readonly IPv6_3="fb03::1"
+readonly IPv6_4="fb04::1"
+readonly IPv6_5="fb05::1"
+readonly IPv6_6="fb06::1"
+readonly IPv6_7="fb07::1"
+readonly IPv6_8="fb08::1"
+readonly IPv6_GRE="fb10::1"
+
+readonly IPv6_SRC=$IPv6_1
+readonly IPv6_DST=$IPv6_4
+
+setup() {
+set -e  # exit on error
+	# create devices and namespaces
+	ip netns add "${NS1}"
+	ip netns add "${NS2}"
+	ip netns add "${NS3}"
+
+	ip link add veth1 type veth peer name veth2
+	ip link add veth3 type veth peer name veth4
+	ip link add veth5 type veth peer name veth6
+	ip link add veth7 type veth peer name veth8
+
+	ip netns exec ${NS2} sysctl -wq net.ipv4.ip_forward=1
+	ip netns exec ${NS2} sysctl -wq net.ipv6.conf.all.forwarding=1
+
+	ip link set veth1 netns ${NS1}
+	ip link set veth2 netns ${NS2}
+	ip link set veth3 netns ${NS2}
+	ip link set veth4 netns ${NS3}
+	ip link set veth5 netns ${NS1}
+	ip link set veth6 netns ${NS2}
+	ip link set veth7 netns ${NS2}
+	ip link set veth8 netns ${NS3}
+
+	# configure addesses: the top route (1-2-3-4)
+	ip -netns ${NS1}    addr add ${IPv4_1}/24  dev veth1
+	ip -netns ${NS2}    addr add ${IPv4_2}/24  dev veth2
+	ip -netns ${NS2}    addr add ${IPv4_3}/24  dev veth3
+	ip -netns ${NS3}    addr add ${IPv4_4}/24  dev veth4
+	ip -netns ${NS1} -6 addr add ${IPv6_1}/128 nodad dev veth1
+	ip -netns ${NS2} -6 addr add ${IPv6_2}/128 nodad dev veth2
+	ip -netns ${NS2} -6 addr add ${IPv6_3}/128 nodad dev veth3
+	ip -netns ${NS3} -6 addr add ${IPv6_4}/128 nodad dev veth4
+
+	# configure addresses: the bottom route (5-6-7-8)
+	ip -netns ${NS1}    addr add ${IPv4_5}/24  dev veth5
+	ip -netns ${NS2}    addr add ${IPv4_6}/24  dev veth6
+	ip -netns ${NS2}    addr add ${IPv4_7}/24  dev veth7
+	ip -netns ${NS3}    addr add ${IPv4_8}/24  dev veth8
+	ip -netns ${NS1} -6 addr add ${IPv6_5}/128 nodad dev veth5
+	ip -netns ${NS2} -6 addr add ${IPv6_6}/128 nodad dev veth6
+	ip -netns ${NS2} -6 addr add ${IPv6_7}/128 nodad dev veth7
+	ip -netns ${NS3} -6 addr add ${IPv6_8}/128 nodad dev veth8
+
+
+	ip -netns ${NS1} link set dev veth1 up
+	ip -netns ${NS2} link set dev veth2 up
+	ip -netns ${NS2} link set dev veth3 up
+	ip -netns ${NS3} link set dev veth4 up
+	ip -netns ${NS1} link set dev veth5 up
+	ip -netns ${NS2} link set dev veth6 up
+	ip -netns ${NS2} link set dev veth7 up
+	ip -netns ${NS3} link set dev veth8 up
+
+	# configure routes: IP*_SRC -> veth1/IP*_2 (= top route) default;
+	# the bottom route to specific bottom addresses
+
+	# NS1
+	# top route
+	ip -netns ${NS1}    route add ${IPv4_2}/32  dev veth1
+	ip -netns ${NS1}    route add default dev veth1 via ${IPv4_2}  # go top by default
+	ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1
+	ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2}  # go top by default
+	# bottom route
+	ip -netns ${NS1}    route add ${IPv4_6}/32  dev veth5
+	ip -netns ${NS1}    route add ${IPv4_7}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1}    route add ${IPv4_8}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5
+	ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6}
+
+	# NS2
+	# top route
+	ip -netns ${NS2}    route add ${IPv4_1}/32  dev veth2
+	ip -netns ${NS2}    route add ${IPv4_4}/32  dev veth3
+	ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2
+	ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3
+	# bottom route
+	ip -netns ${NS2}    route add ${IPv4_5}/32  dev veth6
+	ip -netns ${NS2}    route add ${IPv4_8}/32  dev veth7
+	ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6
+	ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7
+
+	# NS3
+	# top route
+	ip -netns ${NS3}    route add ${IPv4_3}/32  dev veth4
+	ip -netns ${NS3}    route add ${IPv4_1}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3}    route add ${IPv4_2}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3} -6 route add ${IPv6_3}/128 dev veth4
+	ip -netns ${NS3} -6 route add ${IPv6_1}/128 dev veth4 via ${IPv6_3}
+	ip -netns ${NS3} -6 route add ${IPv6_2}/128 dev veth4 via ${IPv6_3}
+	# bottom route
+	ip -netns ${NS3}    route add ${IPv4_7}/32  dev veth8
+	ip -netns ${NS3}    route add ${IPv4_5}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3}    route add ${IPv4_6}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3} -6 route add ${IPv6_7}/128 dev veth8
+	ip -netns ${NS3} -6 route add ${IPv6_5}/128 dev veth8 via ${IPv6_7}
+	ip -netns ${NS3} -6 route add ${IPv6_6}/128 dev veth8 via ${IPv6_7}
+
+	# configure IPv4 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local ${IPv4_GRE} ttl 255
+	ip -netns ${NS3} link set gre_dev up
+	ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
+	ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
+	ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
+
+
+	# configure IPv6 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} -6 tunnel add name gre6_dev mode ip6gre remote ${IPv6_1} local ${IPv6_GRE} ttl 255
+	ip -netns ${NS3} link set gre6_dev up
+	ip -netns ${NS3} -6 addr add ${IPv6_GRE} nodad dev gre6_dev
+	ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8}
+
+	# rp_filter gets confused by what these tests are doing, so disable it
+	ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS2} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS3} sysctl -wq net.ipv4.conf.all.rp_filter=0
+}
+
+cleanup() {
+	ip netns del ${NS1} 2> /dev/null
+	ip netns del ${NS2} 2> /dev/null
+	ip netns del ${NS3} 2> /dev/null
+}
+
+trap cleanup EXIT
+
+test_ping() {
+	local readonly PROTO=$1
+	local readonly EXPECTED=$2
+	local RET=0
+
+	set +e
+	if [ "${PROTO}" == "IPv4" ] ; then
+		ip netns exec ${NS1} ping  -c 1 -W 1 -I ${IPv4_SRC} ${IPv4_DST} 2>&1 > /dev/null
+		RET=$?
+	elif [ "${PROTO}" == "IPv6" ] ; then
+		ip netns exec ${NS1} ping6 -c 1 -W 6 -I ${IPv6_SRC} ${IPv6_DST} 2>&1 > /dev/null
+		RET=$?
+	else
+		echo "test_ping: unknown PROTO: ${PROTO}"
+		exit 1
+	fi
+	set -e
+
+	if [ "0" != "${RET}" ]; then
+		RET=1
+	fi
+
+	if [ "${EXPECTED}" != "${RET}" ] ; then
+		echo "FAIL: test_ping: ${RET}"
+		exit 1
+	fi
+}
+
+test_egress() {
+	local readonly ENCAP=$1
+	echo "starting egress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, ping fails
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_ingress() {
+	local readonly ENCAP=$1
+	echo "starting ingress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, pings fail
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_egress IPv4
+test_egress IPv6
+
+test_ingress IPv4
+test_ingress IPv6
+
+echo "all tests passed"
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH net-next] nfp: flower: remove unused index from nfp_fl_pedit()
From: Pablo Neira Ayuso @ 2019-02-08 16:41 UTC (permalink / raw)
  To: netdev; +Cc: davem, jakub.kicinski, oss-drivers

Static checker warning complains on uninitialized variable:

        drivers/net/ethernet/netronome/nfp/flower/action.c:618 nfp_fl_pedit()
        error: uninitialized symbol 'idx'.

Which is actually never used from the functions that take it as
parameter. Remove it.

Fixes: 738678817573 ("drivers: net: use flow action infrastructure")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 drivers/net/ethernet/netronome/nfp/flower/action.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c b/drivers/net/ethernet/netronome/nfp/flower/action.c
index 583e97c99e68..eeda4ed98333 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -345,7 +345,7 @@ static void nfp_fl_set_helper32(u32 value, u32 mask, u8 *p_exact, u8 *p_mask)
 }
 
 static int
-nfp_fl_set_eth(const struct flow_action_entry *act, int idx, u32 off,
+nfp_fl_set_eth(const struct flow_action_entry *act, u32 off,
 	       struct nfp_fl_set_eth *set_eth)
 {
 	u32 exact, mask;
@@ -376,7 +376,7 @@ struct ipv4_ttl_word {
 };
 
 static int
-nfp_fl_set_ip4(const struct flow_action_entry *act, int idx, u32 off,
+nfp_fl_set_ip4(const struct flow_action_entry *act, u32 off,
 	       struct nfp_fl_set_ip4_addrs *set_ip_addr,
 	       struct nfp_fl_set_ip4_ttl_tos *set_ip_ttl_tos)
 {
@@ -505,7 +505,7 @@ nfp_fl_set_ip6_hop_limit_flow_label(u32 off, __be32 exact, __be32 mask,
 }
 
 static int
-nfp_fl_set_ip6(const struct flow_action_entry *act, int idx, u32 off,
+nfp_fl_set_ip6(const struct flow_action_entry *act, u32 off,
 	       struct nfp_fl_set_ipv6_addr *ip_dst,
 	       struct nfp_fl_set_ipv6_addr *ip_src,
 	       struct nfp_fl_set_ipv6_tc_hl_fl *ip_hl_fl)
@@ -541,7 +541,7 @@ nfp_fl_set_ip6(const struct flow_action_entry *act, int idx, u32 off,
 }
 
 static int
-nfp_fl_set_tport(const struct flow_action_entry *act, int idx, u32 off,
+nfp_fl_set_tport(const struct flow_action_entry *act, u32 off,
 		 struct nfp_fl_set_tport *set_tport, int opcode)
 {
 	u32 exact, mask;
@@ -598,8 +598,8 @@ nfp_fl_pedit(const struct flow_action_entry *act,
 	struct nfp_fl_set_eth set_eth;
 	size_t act_size = 0;
 	u8 ip_proto = 0;
-	int idx, err;
 	u32 offset;
+	int err;
 
 	memset(&set_ip6_tc_hl_fl, 0, sizeof(set_ip6_tc_hl_fl));
 	memset(&set_ip_ttl_tos, 0, sizeof(set_ip_ttl_tos));
@@ -614,22 +614,22 @@ nfp_fl_pedit(const struct flow_action_entry *act,
 
 	switch (htype) {
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_ETH:
-		err = nfp_fl_set_eth(act, idx, offset, &set_eth);
+		err = nfp_fl_set_eth(act, offset, &set_eth);
 		break;
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_IP4:
-		err = nfp_fl_set_ip4(act, idx, offset, &set_ip_addr,
+		err = nfp_fl_set_ip4(act, offset, &set_ip_addr,
 				     &set_ip_ttl_tos);
 		break;
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_IP6:
-		err = nfp_fl_set_ip6(act, idx, offset, &set_ip6_dst,
+		err = nfp_fl_set_ip6(act, offset, &set_ip6_dst,
 				     &set_ip6_src, &set_ip6_tc_hl_fl);
 		break;
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_TCP:
-		err = nfp_fl_set_tport(act, idx, offset, &set_tport,
+		err = nfp_fl_set_tport(act, offset, &set_tport,
 				       NFP_FL_ACTION_OPCODE_SET_TCP);
 		break;
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_UDP:
-		err = nfp_fl_set_tport(act, idx, offset, &set_tport,
+		err = nfp_fl_set_tport(act, offset, &set_tport,
 				       NFP_FL_ACTION_OPCODE_SET_UDP);
 		break;
 	default:
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH net-next] nfp: flower: remove unused index from nfp_fl_pedit()
From: Jakub Kicinski @ 2019-02-08 16:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, davem, oss-drivers
In-Reply-To: <20190208164113.8935-1-pablo@netfilter.org>

On Fri,  8 Feb 2019 17:41:13 +0100, Pablo Neira Ayuso wrote:
> Static checker warning complains on uninitialized variable:
> 
>         drivers/net/ethernet/netronome/nfp/flower/action.c:618 nfp_fl_pedit()
>         error: uninitialized symbol 'idx'.
> 
> Which is actually never used from the functions that take it as
> parameter. Remove it.
> 
> Fixes: 738678817573 ("drivers: net: use flow action infrastructure")

Hardly a fix.  It's completely unused.

> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Ah, I was hoping you wouldn't notice :)  Now the backport of this code
got almost as excruciatingly painful as the match part :)  But I guess
nothing we can do about it:

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* Re: Resource management for ndo_xdp_xmit (Was: [PATCH net] virtio_net: Account for tx bytes and packets on sending xdp_frames)
From: Toke Høiland-Jørgensen @ 2019-02-08 16:55 UTC (permalink / raw)
  To: Saeed Mahameed, brouer@redhat.com
  Cc: hawk@kernel.org, virtualization@lists.linux-foundation.org,
	borkmann@iogearbox.net, Tariq Toukan, john.fastabend@gmail.com,
	mst@redhat.com, jakub.kicinski@netronome.com, dsahern@gmail.com,
	netdev@vger.kernel.org, jasowang@redhat.com, davem@davemloft.net,
	makita.toshiaki@lab.ntt.co.jp
In-Reply-To: <9e5e6882566ac67276209b35ec112a824b256bff.camel@mellanox.com>

Saeed Mahameed <saeedm@mellanox.com> writes:

> But:
> 2) this won't totally solve our problem, since sometimes the driver can
> decide to recreate (change of configuration) hw resources on the fly
> while redirect/devmap is already happening, so we need some kind of a
> dev_map_notification or a flag with rcu synch, for when the driver want
> to make the xdp redirect resources unavailable.

Good point, I'll make a note of this. Do you have a pointer to where the
mlx5 driver does this kind of change currently?

-Toke

^ permalink raw reply

* Re: [PATCH net-next 1/2] mlxsw: spectrum_router: Offload blackhole routes
From: Ido Schimmel @ 2019-02-08 17:05 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Jiri Pirko,
	Alexander Petrovskiy, mlxsw
In-Reply-To: <bfc7715f-a762-0ad7-9cb2-3c248eefad9e@gmail.com>

On Fri, Feb 08, 2019 at 08:24:49AM -0800, David Ahern wrote:
> On 2/7/19 11:34 PM, Ido Schimmel wrote:
> > I assume that user can't put blackhole and normal nexthops in the same
> > group?
> > 
> 
> I allow a nexthop group to reference a nexthop that is a blackhole, but
> the group can only contain the one entry. That allows multipath routes
> to toggle between a blackhole and a real spec.

Good, thanks. I was afraid users will be able to configure a nexthop
group that will randomly drop flows :)

^ permalink raw reply

* Re: [PATCH v3 bpf-next 3/4] btf: expose API to work with raw btf_ext data
From: Yonghong Song @ 2019-02-08 17:12 UTC (permalink / raw)
  To: Andrii Nakryiko, alexei.starovoitov@gmail.com,
	andrii.nakryiko@gmail.com, Song Liu, Alexei Starovoitov,
	Martin Lau, netdev@vger.kernel.org, Kernel Team,
	daniel@iogearbox.net
In-Reply-To: <20190208025555.4027769-4-andriin@fb.com>



On 2/7/19 6:55 PM, Andrii Nakryiko wrote:
> This patch changes struct btf_ext to retain original data in sequential
> block of memory, which makes it possible to expose
> btf_ext__get_raw_data() interface, that's similar to
> btf__get_raw_data(), allowing users of libbpf to get access to raw
> representation of .BTF.ext section.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>   tools/lib/bpf/btf.c      | 85 +++++++++++++++++++++-------------------
>   tools/lib/bpf/btf.h      |  2 +
>   tools/lib/bpf/libbpf.map |  1 +
>   3 files changed, 47 insertions(+), 41 deletions(-)
> 
> diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> index 8730f6c3be9e..c87cc3d71b9f 100644
> --- a/tools/lib/bpf/btf.c
> +++ b/tools/lib/bpf/btf.c
> @@ -41,9 +41,8 @@ struct btf {
>   
>   struct btf_ext_info {
>   	/*
> -	 * info points to a deep copy of the individual info section
> -	 * (e.g. func_info and line_info) from the .BTF.ext.
> -	 * It does not include the __u32 rec_size.
> +	 * info points to the individual info section (e.g. func_info and
> +	 * line_info) from the .BTF.ext. It does not include the __u32 rec_size.
>   	 */
>   	void *info;
>   	__u32 rec_size;
> @@ -51,8 +50,13 @@ struct btf_ext_info {
>   };
>   
>   struct btf_ext {
> +	union {
> +		struct btf_ext_header *hdr;
> +		void *data;
> +	};
>   	struct btf_ext_info func_info;
>   	struct btf_ext_info line_info;
> +	__u32 data_size;
>   };
>   
>   struct btf_ext_info_sec {
> @@ -603,19 +607,13 @@ struct btf_ext_sec_copy_param {
>   };
>   
>   static int btf_ext_copy_info(struct btf_ext *btf_ext,
> -			     __u8 *data, __u32 data_size,
>   			     struct btf_ext_sec_copy_param *ext_sec)

Overall looks good. Since we do not really "copy" info any more,
rather we try to "setup" info based on btf_ext. Maybe changing
all function and structure names with "_copy_" to "_setup_"?

>   {
> -	const struct btf_ext_header *hdr = (struct btf_ext_header *)data;
>   	const struct btf_ext_info_sec *sinfo;
>   	struct btf_ext_info *ext_info;
>   	__u32 info_left, record_size;
>   	/* The start of the info sec (including the __u32 record_size). */
> -	const void *info;
> -
> -	/* data and data_size do not include btf_ext_header from now on */
> -	data = data + hdr->hdr_len;
> -	data_size -= hdr->hdr_len;
> +	void *info;
[...]

^ permalink raw reply

* Re: [PATCH v3 bpf-next 4/4] tools/bpf: remove btf__get_strings superseded() by raw data API
From: Yonghong Song @ 2019-02-08 17:13 UTC (permalink / raw)
  To: Andrii Nakryiko, alexei.starovoitov@gmail.com,
	andrii.nakryiko@gmail.com, Song Liu, Alexei Starovoitov,
	Martin Lau, netdev@vger.kernel.org, Kernel Team,
	daniel@iogearbox.net
In-Reply-To: <20190208025555.4027769-5-andriin@fb.com>



On 2/7/19 6:55 PM, Andrii Nakryiko wrote:
> Now that we have btf__get_raw_data() it's trivial for tests to iterate
> over all strings for testing purposes, which eliminates the need for
> btf__get_strings() API.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply

* Re: [PATCH v3 bpf-next 3/4] btf: expose API to work with raw btf_ext data
From: Andrii Nakryiko @ 2019-02-08 17:15 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, alexei.starovoitov@gmail.com, Song Liu,
	Alexei Starovoitov, Martin Lau, netdev@vger.kernel.org,
	Kernel Team, daniel@iogearbox.net
In-Reply-To: <e4c11465-31a9-5dd3-7408-b0d5665bc391@fb.com>

On Fri, Feb 8, 2019 at 9:13 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 2/7/19 6:55 PM, Andrii Nakryiko wrote:
> > This patch changes struct btf_ext to retain original data in sequential
> > block of memory, which makes it possible to expose
> > btf_ext__get_raw_data() interface, that's similar to
> > btf__get_raw_data(), allowing users of libbpf to get access to raw
> > representation of .BTF.ext section.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >   tools/lib/bpf/btf.c      | 85 +++++++++++++++++++++-------------------
> >   tools/lib/bpf/btf.h      |  2 +
> >   tools/lib/bpf/libbpf.map |  1 +
> >   3 files changed, 47 insertions(+), 41 deletions(-)
> >
> > diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> > index 8730f6c3be9e..c87cc3d71b9f 100644
> > --- a/tools/lib/bpf/btf.c
> > +++ b/tools/lib/bpf/btf.c
> > @@ -41,9 +41,8 @@ struct btf {
> >
> >   struct btf_ext_info {
> >       /*
> > -      * info points to a deep copy of the individual info section
> > -      * (e.g. func_info and line_info) from the .BTF.ext.
> > -      * It does not include the __u32 rec_size.
> > +      * info points to the individual info section (e.g. func_info and
> > +      * line_info) from the .BTF.ext. It does not include the __u32 rec_size.
> >        */
> >       void *info;
> >       __u32 rec_size;
> > @@ -51,8 +50,13 @@ struct btf_ext_info {
> >   };
> >
> >   struct btf_ext {
> > +     union {
> > +             struct btf_ext_header *hdr;
> > +             void *data;
> > +     };
> >       struct btf_ext_info func_info;
> >       struct btf_ext_info line_info;
> > +     __u32 data_size;
> >   };
> >
> >   struct btf_ext_info_sec {
> > @@ -603,19 +607,13 @@ struct btf_ext_sec_copy_param {
> >   };
> >
> >   static int btf_ext_copy_info(struct btf_ext *btf_ext,
> > -                          __u8 *data, __u32 data_size,
> >                            struct btf_ext_sec_copy_param *ext_sec)
>
> Overall looks good. Since we do not really "copy" info any more,
> rather we try to "setup" info based on btf_ext. Maybe changing
> all function and structure names with "_copy_" to "_setup_"?

Makes sense, will update.

>
> >   {
> > -     const struct btf_ext_header *hdr = (struct btf_ext_header *)data;
> >       const struct btf_ext_info_sec *sinfo;
> >       struct btf_ext_info *ext_info;
> >       __u32 info_left, record_size;
> >       /* The start of the info sec (including the __u32 record_size). */
> > -     const void *info;
> > -
> > -     /* data and data_size do not include btf_ext_header from now on */
> > -     data = data + hdr->hdr_len;
> > -     data_size -= hdr->hdr_len;
> > +     void *info;
> [...]

^ permalink raw reply

* Re: [PATCH net-next] devlink: Add WARN_ON to catch errors of not cleaning devlink objects
From: David Ahern @ 2019-02-08 17:30 UTC (permalink / raw)
  To: Parav Pandit, netdev, davem
In-Reply-To: <1549642921-11196-1-git-send-email-parav@mellanox.com>

On 2/8/19 8:22 AM, Parav Pandit wrote:
> Add WARN_ON to make sure that all sub objects of a devlink device are
> cleanedup before freeing the devlink device.
> This helps to catch any driver bugs.
> 
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  net/core/devlink.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/net/core/devlink.c b/net/core/devlink.c
> index cd0d393..5e2ef5a 100644
> --- a/net/core/devlink.c
> +++ b/net/core/devlink.c
> @@ -4229,6 +4229,13 @@ void devlink_unregister(struct devlink *devlink)
>   */
>  void devlink_free(struct devlink *devlink)
>  {
> +	WARN_ON(!list_empty(&devlink->port_list));
> +	WARN_ON(!list_empty(&devlink->sb_list));
> +	WARN_ON(!list_empty(&devlink->dpipe_table_list));
> +	WARN_ON(!list_empty(&devlink->resource_list));
> +	WARN_ON(!list_empty(&devlink->param_list));
> +	WARN_ON(!list_empty(&devlink->region_list));
> +
>  	kfree(devlink);
>  }
>  EXPORT_SYMBOL_GPL(devlink_free);
> 

reporter_list was just added which brings up the maintenance question:
If you are going to do this you might want a comment in
include/net/devlink.h to remind folks to update this function as relevant.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox