* Re: [PATCH net-next 1/2] net: phy: sfp: make the i2c-bus property really optional
From: Antoine Tenart @ 2018-05-17 12:56 UTC (permalink / raw)
To: Andrew Lunn
Cc: Antoine Tenart, davem, linux, netdev, linux-kernel,
thomas.petazzoni, maxime.chevallier, gregory.clement,
miquel.raynal, nadavh, stefanc, ymarkman, mw
In-Reply-To: <20180517124127.GG8547@lunn.ch>
Hi Andrew,
On Thu, May 17, 2018 at 02:41:28PM +0200, Andrew Lunn wrote:
> On Thu, May 17, 2018 at 10:29:06AM +0200, Antoine Tenart wrote:
> > The SFF,SFP documentation is clear about making all the DT properties,
> > with the exception of the compatible, optional. In practice this is not
> > the case and without an i2c-bus property provided the SFP code will
> > throw NULL pointer exceptions.
> >
> > This patch is an attempt to fix this.
>
> How usable is an SFF/SFP module without access to the i2c EEPROM? I
> guess this comes down to link speed. Can it be manually configured?
>
> I'm just wondering if we want to make this mandatory? Fail the probe
> if it is not listed?
Yes, the other option would be to fail when probing a cage missing the
i2c description. I'd say a passive module can work without the i2c
EEPROM accessible as it does not need to be configured. I don't know
what would happen with active ones.
So the question is, do we want to enable partially working SFP cages
(ie. probably working with only a subset of SFP modules)?
Thanks!
Antoine
--
Antoine Ténart, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com
^ permalink raw reply
* [PATCH bpf-next] bpf: change eBPF helper doc parsing script to allow for smaller indent
From: Quentin Monnet @ 2018-05-17 12:43 UTC (permalink / raw)
To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet
Documentation for eBPF helpers can be parsed from bpf.h and eventually
turned into a man page. Commit 6f96674dbd8c ("bpf: relax constraints on
formatting for eBPF helper documentation") changed the script used to
parse it, in order to allow for different indent style and to ease the
work for writing documentation for future helpers.
The script currently considers that the first tab can be replaced by 6
to 8 spaces. But the documentation for bpf_fib_lookup() uses a mix of
tabs (for the "Description" part) and of spaces ("Return" part), and
only has 5 space long indent for the latter.
We probably do not want to change the values accepted by the script each
time a new helper gets a new indent style. However, it is worth noting
that with those 5 spaces, the "Description" and "Return" part *look*
aligned in the generated patch and in `git show`, so it is likely other
helper authors will use the same length. Therefore, allow for helper
documentation to use 5 spaces only for the first indent level.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
scripts/bpf_helpers_doc.py | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index 8f59897fbda1..5010a4d5bfba 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -95,7 +95,7 @@ class HeaderParser(object):
return capture.group(1)
def parse_desc(self):
- p = re.compile(' \* ?(?:\t| {6,8})Description$')
+ p = re.compile(' \* ?(?:\t| {5,8})Description$')
capture = p.match(self.line)
if not capture:
# Helper can have empty description and we might be parsing another
@@ -109,7 +109,7 @@ class HeaderParser(object):
if self.line == ' *\n':
desc += '\n'
else:
- p = re.compile(' \* ?(?:\t| {6,8})(?:\t| {8})(.*)')
+ p = re.compile(' \* ?(?:\t| {5,8})(?:\t| {8})(.*)')
capture = p.match(self.line)
if capture:
desc += capture.group(1) + '\n'
@@ -118,7 +118,7 @@ class HeaderParser(object):
return desc
def parse_ret(self):
- p = re.compile(' \* ?(?:\t| {6,8})Return$')
+ p = re.compile(' \* ?(?:\t| {5,8})Return$')
capture = p.match(self.line)
if not capture:
# Helper can have empty retval and we might be parsing another
@@ -132,7 +132,7 @@ class HeaderParser(object):
if self.line == ' *\n':
ret += '\n'
else:
- p = re.compile(' \* ?(?:\t| {6,8})(?:\t| {8})(.*)')
+ p = re.compile(' \* ?(?:\t| {5,8})(?:\t| {8})(.*)')
capture = p.match(self.line)
if capture:
ret += capture.group(1) + '\n'
--
2.14.1
^ permalink raw reply related
* Re: [patch net-next] nfp: flower: set sysfs link to device for representors
From: Jiri Pirko @ 2018-05-17 12:42 UTC (permalink / raw)
To: Or Gerlitz
Cc: Linux Netdev List, David Miller, Jakub Kicinski, Simon Horman,
Dirk van der Merwe, John Hurley, Pieter Jansen van Vuuren,
oss-drivers
In-Reply-To: <CAJ3xEMgw0Ar4_Zvm4cirjf7XgaPf8Ufos+vEjRmMq5OM2yg_aw@mail.gmail.com>
Thu, May 17, 2018 at 02:25:14PM CEST, gerlitz.or@gmail.com wrote:
>On Thu, May 17, 2018 at 1:05 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> Do this so the sysfs has "device" link correctly set.
>
>please no
>
>This is likely to create bunch of issues with respect to how libvirt
>deals with the representors.
Once netdev is there because of some probed pci device, this link should
be set. That is one of the basics.
What "bunch of issues" are you talking about? Please be specific.
>
>We were discussing it off list between nfp and mlnx driver people. We
>need to put
>the open stack folks and kernel developers into the same thread. I can
>make a post
>on that next week
^ permalink raw reply
* Re: [PATCH net-next 1/2] net: phy: sfp: make the i2c-bus property really optional
From: Andrew Lunn @ 2018-05-17 12:41 UTC (permalink / raw)
To: Antoine Tenart
Cc: davem, linux, netdev, linux-kernel, thomas.petazzoni,
maxime.chevallier, gregory.clement, miquel.raynal, nadavh,
stefanc, ymarkman, mw
In-Reply-To: <20180517082907.14420-2-antoine.tenart@bootlin.com>
On Thu, May 17, 2018 at 10:29:06AM +0200, Antoine Tenart wrote:
> The SFF,SFP documentation is clear about making all the DT properties,
> with the exception of the compatible, optional. In practice this is not
> the case and without an i2c-bus property provided the SFP code will
> throw NULL pointer exceptions.
>
> This patch is an attempt to fix this.
Hi Antoine, Russell
How usable is an SFF/SFP module without access to the i2c EEPROM? I
guess this comes down to link speed. Can it be manually configured?
I'm just wondering if we want to make this mandatory? Fail the probe
if it is not listed?
Andrew
^ permalink raw reply
* Re: [PATCH V2 8/8] dt-bindings: stm32: add compatible for syscon
From: Rob Herring @ 2018-05-17 12:37 UTC (permalink / raw)
To: Christophe ROULLIER
Cc: mark.rutland@arm.com, mcoquelin.stm32@gmail.com, Alexandre TORGUE,
Peppe CAVALLARO, devicetree@vger.kernel.org, andrew@lunn.ch,
linux-arm-kernel@lists.infradead.org, netdev@vger.kernel.org
In-Reply-To: <d2b4cd1845b949a7bf9c42a17a5358a2@SFHDAG5NODE3.st.com>
On Tue, May 15, 2018 at 11:19 AM, Christophe ROULLIER
<christophe.roullier@st.com> wrote:
> Hi Rob,
Please don't top post to lists.
>
> I do not understand, so let me explain our status:
>
> We have syscfg IP Harware in our SOC.
Add a compatible string that uniquely identifies what the block is. So
something like "st,stm32f746-syscfg".
> But we do not have SoC specific driver to manage syscfg, we are using a generic driver "syscon".
That does not matter. We're talking about the binding. Design
decisions in the OS should not define the binding. It doesn't matter
that the OS currently doesn't use the compatible string.
> So can you tell me what you wish to describe this part in our SOC bindings ?
>
> Thanks for your help.
>
> Christophe.
>
> -----Original Message-----
> From: Rob Herring [mailto:robh@kernel.org]
> Sent: lundi 7 mai 2018 18:36
> To: Christophe ROULLIER <christophe.roullier@st.com>
> Cc: mark.rutland@arm.com; mcoquelin.stm32@gmail.com; Alexandre TORGUE <alexandre.torgue@st.com>; Peppe CAVALLARO <peppe.cavallaro@st.com>; devicetree@vger.kernel.org; andrew@lunn.ch; linux-arm-kernel@lists.infradead.org; netdev@vger.kernel.org
> Subject: Re: [PATCH V2 8/8] dt-bindings: stm32: add compatible for syscon
>
> On Wed, May 02, 2018 at 04:18:43PM +0200, Christophe Roullier wrote:
>> This patch describes syscon DT bindings.
>>
>> Signed-off-by: Christophe Roullier <christophe.roullier@st.com>
>> ---
>> Documentation/devicetree/bindings/arm/stm32.txt | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/arm/stm32.txt
>> b/Documentation/devicetree/bindings/arm/stm32.txt
>> index 6808ed9..06e3834 100644
>> --- a/Documentation/devicetree/bindings/arm/stm32.txt
>> +++ b/Documentation/devicetree/bindings/arm/stm32.txt
>> @@ -8,3 +8,7 @@ using one of the following compatible strings:
>> st,stm32f746
>> st,stm32h743
>> st,stm32mp157
>> +
>> +Required nodes:
>> +- syscon: the soc bus node must have a system controller node
>> +pointing to the
>> + global control registers, with the compatible string "syscon";
>
> You misunderstood my prior comment. 'syscon' alone is not valid. You need SoC specific compatible string for it and 'stm32' is not SoC specific. IOW, the compatible property for a syscon should imply every single register field in the block.
>
> Rob
^ permalink raw reply
* [PATCH] net/ncsi: prevent a couple array underflows
From: Dan Carpenter @ 2018-05-17 12:33 UTC (permalink / raw)
To: David S. Miller, Samuel Mendoza-Jonas; +Cc: netdev, Gavin Shan, kernel-janitors
We recently refactored this code and introduced a static checker
warning. Smatch complains that if cmd->index is zero then we would
underflow the arrays. That's obviously true.
The question is whether we prevent cmd->index from being zero at a
different level. I've looked at the code and I don't immediately see
a check for that.
Fixes: 062b3e1b6d4f ("net/ncsi: Refactor MAC, VLAN filters")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
index ce9497966ebe..a6b7c7d5c829 100644
--- a/net/ncsi/ncsi-rsp.c
+++ b/net/ncsi/ncsi-rsp.c
@@ -347,7 +347,7 @@ static int ncsi_rsp_handler_svf(struct ncsi_request *nr)
cmd = (struct ncsi_cmd_svf_pkt *)skb_network_header(nr->cmd);
ncf = &nc->vlan_filter;
- if (cmd->index > ncf->n_vids)
+ if (cmd->index == 0 || cmd->index > ncf->n_vids)
return -ERANGE;
/* Add or remove the VLAN filter. Remember HW indexes from 1 */
@@ -445,7 +445,8 @@ static int ncsi_rsp_handler_sma(struct ncsi_request *nr)
ncf = &nc->mac_filter;
bitmap = &ncf->bitmap;
- if (cmd->index > ncf->n_uc + ncf->n_mc + ncf->n_mixed)
+ if (cmd->index == 0 ||
+ cmd->index > ncf->n_uc + ncf->n_mc + ncf->n_mixed)
return -ERANGE;
index = (cmd->index - 1) * ETH_ALEN;
^ permalink raw reply related
* [PATCH bpf-next v6 6/6] selftests/bpf: test for seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-05-17 14:28 UTC (permalink / raw)
To: netdev; +Cc: daniel, dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526565671.git.m.xhonneux@gmail.com>
Add a new test for the seg6local End.BPF action. The following helpers
are also tested:
- bpf_lwt_push_encap within the LWT BPF IN hook
- bpf_lwt_seg6_action
- bpf_lwt_seg6_adjust_srh
- bpf_lwt_seg6_store_bytes
A chain of End.BPF actions is built. The SRH is injected through a LWT
BPF IN hook before the chain. Each End.BPF action validates the previous
one, otherwise the packet is dropped.
The test succeeds if the last node in the chain receives the packet and
the UDP datagram contained can be retrieved from userspace.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
---
tools/include/uapi/linux/bpf.h | 97 ++++-
tools/testing/selftests/bpf/Makefile | 6 +-
tools/testing/selftests/bpf/bpf_helpers.h | 12 +
tools/testing/selftests/bpf/test_lwt_seg6local.c | 438 ++++++++++++++++++++++
tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++++++
5 files changed, 690 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d94d333a8225..e8efb12d0a7d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -141,6 +141,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_RAW_TRACEPOINT,
BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+ BPF_PROG_TYPE_LWT_SEG6LOCAL,
};
enum bpf_attach_type {
@@ -1902,6 +1903,90 @@ union bpf_attr {
* egress otherwise). This is the only flag supported for now.
* Return
* **SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ * Description
+ * Encapsulate the packet associated to *skb* within a Layer 3
+ * protocol header. This header is provided in the buffer at
+ * address *hdr*, with *len* its size in bytes. *type* indicates
+ * the protocol of the header and can be one of:
+ *
+ * **BPF_LWT_ENCAP_SEG6**
+ * IPv6 encapsulation with Segment Routing Header
+ * (**struct ipv6_sr_hdr**). *hdr* only contains the SRH,
+ * the IPv6 header is computed by the kernel.
+ * **BPF_LWT_ENCAP_SEG6_INLINE**
+ * Only works if *skb* contains an IPv6 packet. Insert a
+ * Segment Routing Header (**struct ipv6_sr_hdr**) inside
+ * the IPv6 header.
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
+ * Description
+ * Store *len* bytes from address *from* into the packet
+ * associated to *skb*, at *offset*. Only the flags, tag and TLVs
+ * inside the outermost IPv6 Segment Routing Header can be
+ * modified through this helper.
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
+ * Description
+ * Adjust the size allocated to TLVs in the outermost IPv6
+ * Segment Routing Header contained in the packet associated to
+ * *skb*, at position *offset* by *delta* bytes. Only offsets
+ * after the segments are accepted. *delta* can be as well
+ * positive (growing) as negative (shrinking).
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
+ * Description
+ * Apply an IPv6 Segment Routing action of type *action* to the
+ * packet associated to *skb*. Each action takes a parameter
+ * contained at address *param*, and of length *param_len* bytes.
+ * *action* can be one of:
+ *
+ * **SEG6_LOCAL_ACTION_END_X**
+ * End.X action: Endpoint with Layer-3 cross-connect.
+ * Type of *param*: **struct in6_addr**.
+ * **SEG6_LOCAL_ACTION_END_T**
+ * End.T action: Endpoint with specific IPv6 table lookup.
+ * Type of *param*: **int**.
+ * **SEG6_LOCAL_ACTION_END_B6**
+ * End.B6 action: Endpoint bound to an SRv6 policy.
+ * Type of param: **struct ipv6_sr_hdr**.
+ * **SEG6_LOCAL_ACTION_END_B6_ENCAP**
+ * End.B6.Encap action: Endpoint bound to an SRv6
+ * encapsulation policy.
+ * Type of param: **struct ipv6_sr_hdr**.
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -1976,7 +2061,11 @@ union bpf_attr {
FN(fib_lookup), \
FN(sock_hash_update), \
FN(msg_redirect_hash), \
- FN(sk_redirect_hash),
+ FN(sk_redirect_hash), \
+ FN(lwt_push_encap), \
+ FN(lwt_seg6_store_bytes), \
+ FN(lwt_seg6_adjust_srh), \
+ FN(lwt_seg6_action),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -2043,6 +2132,12 @@ enum bpf_hdr_start_off {
BPF_HDR_START_NET,
};
+/* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
+enum bpf_lwt_encap_mode {
+ BPF_LWT_ENCAP_SEG6,
+ BPF_LWT_ENCAP_SEG6_INLINE
+};
+
/* user accessible mirror of in-kernel sk_buff.
* new fields can only be added to the end of this structure
*/
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 1eb0fa2aba92..b6222b3f8fab 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -33,7 +33,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
- test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o
+ test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
+ test_lwt_seg6local.o
# Order correspond to 'make run_tests' order
TEST_PROGS := test_kmod.sh \
@@ -42,7 +43,8 @@ TEST_PROGS := test_kmod.sh \
test_xdp_meta.sh \
test_offload.py \
test_sock_addr.sh \
- test_tunnel.sh
+ test_tunnel.sh \
+ test_lwt_seg6local.sh
# Compile but not part of 'make run_tests'
TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 8f143dfb3700..334d3e8c5e89 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -114,6 +114,18 @@ static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
static int (*bpf_fib_lookup)(void *ctx, struct bpf_fib_lookup *params,
int plen, __u32 flags) =
(void *) BPF_FUNC_fib_lookup;
+static int (*bpf_lwt_push_encap)(void *ctx, unsigned int type, void *hdr,
+ unsigned int len) =
+ (void *) BPF_FUNC_lwt_push_encap;
+static int (*bpf_lwt_seg6_store_bytes)(void *ctx, unsigned int offset,
+ void *from, unsigned int len) =
+ (void *) BPF_FUNC_lwt_seg6_store_bytes;
+static int (*bpf_lwt_seg6_action)(void *ctx, unsigned int action, void *param,
+ unsigned int param_len) =
+ (void *) BPF_FUNC_lwt_seg6_action;
+static int (*bpf_lwt_seg6_adjust_srh)(void *ctx, unsigned int offset,
+ unsigned int len) =
+ (void *) BPF_FUNC_lwt_seg6_adjust_srh;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_lwt_seg6local.c b/tools/testing/selftests/bpf/test_lwt_seg6local.c
new file mode 100644
index 000000000000..d752bc1fe81c
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_seg6local.c
@@ -0,0 +1,438 @@
+#include <stddef.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <linux/seg6_local.h>
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define bpf_printk(fmt, ...) \
+({ \
+ char ____fmt[] = fmt; \
+ bpf_trace_printk(____fmt, sizeof(____fmt), \
+ ##__VA_ARGS__); \
+})
+
+/* Packet parsing state machine helpers. */
+#define cursor_advance(_cursor, _len) \
+ ({ void *_tmp = _cursor; _cursor += _len; _tmp; })
+
+#define SR6_FLAG_ALERT (1 << 4)
+
+#define htonll(x) ((bpf_htonl(1)) == 1 ? (x) : ((uint64_t)bpf_htonl((x) & \
+ 0xFFFFFFFF) << 32) | bpf_htonl((x) >> 32))
+#define ntohll(x) ((bpf_ntohl(1)) == 1 ? (x) : ((uint64_t)bpf_ntohl((x) & \
+ 0xFFFFFFFF) << 32) | bpf_ntohl((x) >> 32))
+#define BPF_PACKET_HEADER __attribute__((packed))
+
+struct ip6_t {
+ unsigned int ver:4;
+ unsigned int priority:8;
+ unsigned int flow_label:20;
+ unsigned short payload_len;
+ unsigned char next_header;
+ unsigned char hop_limit;
+ unsigned long long src_hi;
+ unsigned long long src_lo;
+ unsigned long long dst_hi;
+ unsigned long long dst_lo;
+} BPF_PACKET_HEADER;
+
+struct ip6_addr_t {
+ unsigned long long hi;
+ unsigned long long lo;
+} BPF_PACKET_HEADER;
+
+struct ip6_srh_t {
+ unsigned char nexthdr;
+ unsigned char hdrlen;
+ unsigned char type;
+ unsigned char segments_left;
+ unsigned char first_segment;
+ unsigned char flags;
+ unsigned short tag;
+
+ struct ip6_addr_t segments[0];
+} BPF_PACKET_HEADER;
+
+struct sr6_tlv_t {
+ unsigned char type;
+ unsigned char len;
+ unsigned char value[0];
+} BPF_PACKET_HEADER;
+
+__attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb)
+{
+ void *cursor, *data_end;
+ struct ip6_srh_t *srh;
+ struct ip6_t *ip;
+ uint8_t *ipver;
+
+ data_end = (void *)(long)skb->data_end;
+ cursor = (void *)(long)skb->data;
+ ipver = (uint8_t *)cursor;
+
+ if ((void *)ipver + sizeof(*ipver) > data_end)
+ return NULL;
+
+ if ((*ipver >> 4) != 6)
+ return NULL;
+
+ ip = cursor_advance(cursor, sizeof(*ip));
+ if ((void *)ip + sizeof(*ip) > data_end)
+ return NULL;
+
+ if (ip->next_header != 43)
+ return NULL;
+
+ srh = cursor_advance(cursor, sizeof(*srh));
+ if ((void *)srh + sizeof(*srh) > data_end)
+ return NULL;
+
+ if (srh->type != 4)
+ return NULL;
+
+ return srh;
+}
+
+__attribute__((always_inline))
+int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad,
+ uint32_t old_pad, uint32_t pad_off)
+{
+ int err;
+
+ if (new_pad != old_pad) {
+ err = bpf_lwt_seg6_adjust_srh(skb, pad_off,
+ (int) new_pad - (int) old_pad);
+ if (err)
+ return err;
+ }
+
+ if (new_pad > 0) {
+ char pad_tlv_buf[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0};
+ struct sr6_tlv_t *pad_tlv = (struct sr6_tlv_t *) pad_tlv_buf;
+
+ pad_tlv->type = SR6_TLV_PADDING;
+ pad_tlv->len = new_pad - 2;
+
+ err = bpf_lwt_seg6_store_bytes(skb, pad_off,
+ (void *)pad_tlv_buf, new_pad);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+__attribute__((always_inline))
+int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh,
+ uint32_t *tlv_off, uint32_t *pad_size,
+ uint32_t *pad_off)
+{
+ uint32_t srh_off, cur_off;
+ int offset_valid = 0;
+ int err;
+
+ srh_off = (char *)srh - (char *)(long)skb->data;
+ // cur_off = end of segments, start of possible TLVs
+ cur_off = srh_off + sizeof(*srh) +
+ sizeof(struct ip6_addr_t) * (srh->first_segment + 1);
+
+ *pad_off = 0;
+
+ // we can only go as far as ~10 TLVs due to the BPF max stack size
+ #pragma clang loop unroll(full)
+ for (int i = 0; i < 10; i++) {
+ struct sr6_tlv_t tlv;
+
+ if (cur_off == *tlv_off)
+ offset_valid = 1;
+
+ if (cur_off >= srh_off + ((srh->hdrlen + 1) << 3))
+ break;
+
+ err = bpf_skb_load_bytes(skb, cur_off, &tlv, sizeof(tlv));
+ if (err)
+ return err;
+
+ if (tlv.type == SR6_TLV_PADDING) {
+ *pad_size = tlv.len + sizeof(tlv);
+ *pad_off = cur_off;
+
+ if (*tlv_off == srh_off) {
+ *tlv_off = cur_off;
+ offset_valid = 1;
+ }
+ break;
+
+ } else if (tlv.type == SR6_TLV_HMAC) {
+ break;
+ }
+
+ cur_off += sizeof(tlv) + tlv.len;
+ } // we reached the padding or HMAC TLVs, or the end of the SRH
+
+ if (*pad_off == 0)
+ *pad_off = cur_off;
+
+ if (*tlv_off == -1)
+ *tlv_off = cur_off;
+ else if (!offset_valid)
+ return -EINVAL;
+
+ return 0;
+}
+
+__attribute__((always_inline))
+int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off,
+ struct sr6_tlv_t *itlv, uint8_t tlv_size)
+{
+ uint32_t srh_off = (char *)srh - (char *)(long)skb->data;
+ uint8_t len_remaining, new_pad;
+ uint32_t pad_off = 0;
+ uint32_t pad_size = 0;
+ uint32_t partial_srh_len;
+ int err;
+
+ if (tlv_off != -1)
+ tlv_off += srh_off;
+
+ if (itlv->type == SR6_TLV_PADDING || itlv->type == SR6_TLV_HMAC)
+ return -EINVAL;
+
+ err = is_valid_tlv_boundary(skb, srh, &tlv_off, &pad_size, &pad_off);
+ if (err)
+ return err;
+
+ err = bpf_lwt_seg6_adjust_srh(skb, tlv_off, sizeof(*itlv) + itlv->len);
+ if (err)
+ return err;
+
+ err = bpf_lwt_seg6_store_bytes(skb, tlv_off, (void *)itlv, tlv_size);
+ if (err)
+ return err;
+
+ // the following can't be moved inside update_tlv_pad because the
+ // bpf verifier has some issues with it
+ pad_off += sizeof(*itlv) + itlv->len;
+ partial_srh_len = pad_off - srh_off;
+ len_remaining = partial_srh_len % 8;
+ new_pad = 8 - len_remaining;
+
+ if (new_pad == 1) // cannot pad for 1 byte only
+ new_pad = 9;
+ else if (new_pad == 8)
+ new_pad = 0;
+
+ return update_tlv_pad(skb, new_pad, pad_size, pad_off);
+}
+
+__attribute__((always_inline))
+int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh,
+ uint32_t tlv_off)
+{
+ uint32_t srh_off = (char *)srh - (char *)(long)skb->data;
+ uint8_t len_remaining, new_pad;
+ uint32_t partial_srh_len;
+ uint32_t pad_off = 0;
+ uint32_t pad_size = 0;
+ struct sr6_tlv_t tlv;
+ int err;
+
+ tlv_off += srh_off;
+
+ err = is_valid_tlv_boundary(skb, srh, &tlv_off, &pad_size, &pad_off);
+ if (err)
+ return err;
+
+ err = bpf_skb_load_bytes(skb, tlv_off, &tlv, sizeof(tlv));
+ if (err)
+ return err;
+
+ err = bpf_lwt_seg6_adjust_srh(skb, tlv_off, -(sizeof(tlv) + tlv.len));
+ if (err)
+ return err;
+
+ pad_off -= sizeof(tlv) + tlv.len;
+ partial_srh_len = pad_off - srh_off;
+ len_remaining = partial_srh_len % 8;
+ new_pad = 8 - len_remaining;
+ if (new_pad == 1) // cannot pad for 1 byte only
+ new_pad = 9;
+ else if (new_pad == 8)
+ new_pad = 0;
+
+ return update_tlv_pad(skb, new_pad, pad_size, pad_off);
+}
+
+__attribute__((always_inline))
+int has_egr_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh)
+{
+ int tlv_offset = sizeof(struct ip6_t) + sizeof(struct ip6_srh_t) +
+ ((srh->first_segment + 1) << 4);
+ struct sr6_tlv_t tlv;
+
+ if (bpf_skb_load_bytes(skb, tlv_offset, &tlv, sizeof(struct sr6_tlv_t)))
+ return 0;
+
+ if (tlv.type == SR6_TLV_EGRESS && tlv.len == 18) {
+ struct ip6_addr_t egr_addr;
+
+ if (bpf_skb_load_bytes(skb, tlv_offset + 4, &egr_addr, 16))
+ return 0;
+
+ // check if egress TLV value is correct
+ if (ntohll(egr_addr.hi) == 0xfd00000000000000 &&
+ ntohll(egr_addr.lo) == 0x4)
+ return 1;
+ }
+
+ return 0;
+}
+
+// This function will push a SRH with segments fd00::1, fd00::2, fd00::3,
+// fd00::4
+SEC("encap_srh")
+int __encap_srh(struct __sk_buff *skb)
+{
+ bpf_printk("got pkt\n");
+ unsigned long long hi = 0xfd00000000000000;
+ struct ip6_addr_t *seg;
+ struct ip6_srh_t *srh;
+ char srh_buf[72]; // room for 4 segments
+ int err;
+
+ srh = (struct ip6_srh_t *)srh_buf;
+ srh->nexthdr = 0;
+ srh->hdrlen = 8;
+ srh->type = 4;
+ srh->segments_left = 3;
+ srh->first_segment = 3;
+ srh->flags = 0;
+ srh->tag = 0;
+
+ seg = (struct ip6_addr_t *)((char *)srh + sizeof(*srh));
+
+ #pragma clang loop unroll(full)
+ for (unsigned long long lo = 0; lo < 4; lo++) {
+ seg->lo = htonll(4 - lo);
+ seg->hi = htonll(hi);
+ seg = (struct ip6_addr_t *)((char *)seg + sizeof(*seg));
+ }
+
+ err = bpf_lwt_push_encap(skb, 0, (void *)srh, sizeof(srh_buf));
+ if (err)
+ return BPF_DROP;
+
+ return BPF_REDIRECT;
+}
+
+// Add an Egress TLV fc00::4, add the flag A,
+// and apply End.X action to fc42::1
+SEC("add_egr_x")
+int __add_egr_x(struct __sk_buff *skb)
+{
+ unsigned long long hi = 0xfc42000000000000;
+ unsigned long long lo = 0x1;
+ struct ip6_srh_t *srh = get_srh(skb);
+ uint8_t new_flags = SR6_FLAG_ALERT;
+ struct ip6_addr_t addr;
+ int err, offset;
+
+ if (srh == NULL)
+ return BPF_DROP;
+
+ uint8_t tlv[20] = {2, 18, 0, 0, 0xfd, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+ 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4};
+
+ err = add_tlv(skb, srh, (srh->hdrlen+1) << 3,
+ (struct sr6_tlv_t *)&tlv, 20);
+ if (err)
+ return BPF_DROP;
+
+ offset = sizeof(struct ip6_t) + offsetof(struct ip6_srh_t, flags);
+ err = bpf_lwt_seg6_store_bytes(skb, offset,
+ (void *)&new_flags, sizeof(new_flags));
+ if (err)
+ return BPF_DROP;
+
+ addr.lo = htonll(lo);
+ addr.hi = htonll(hi);
+ err = bpf_lwt_seg6_action(skb, SEG6_LOCAL_ACTION_END_X,
+ (void *)&addr, sizeof(addr));
+ if (err)
+ return BPF_DROP;
+ return BPF_REDIRECT;
+}
+
+// Pop the Egress TLV, reset the flags, change the tag 2442 and finally do a
+// simple End action
+SEC("pop_egr")
+int __pop_egr(struct __sk_buff *skb)
+{
+ struct ip6_srh_t *srh = get_srh(skb);
+ uint16_t new_tag = bpf_htons(2442);
+ uint8_t new_flags = 0;
+ int err, offset;
+
+ if (srh == NULL)
+ return BPF_DROP;
+
+ if (srh->flags != SR6_FLAG_ALERT)
+ return BPF_DROP;
+
+ if (srh->hdrlen != 11) // 4 segments + Egress TLV + Padding TLV
+ return BPF_DROP;
+
+ if (!has_egr_tlv(skb, srh))
+ return BPF_DROP;
+
+ err = delete_tlv(skb, srh, 8 + (srh->first_segment + 1) * 16);
+ if (err)
+ return BPF_DROP;
+
+ offset = sizeof(struct ip6_t) + offsetof(struct ip6_srh_t, flags);
+ if (bpf_lwt_seg6_store_bytes(skb, offset, (void *)&new_flags,
+ sizeof(new_flags)))
+ return BPF_DROP;
+
+ offset = sizeof(struct ip6_t) + offsetof(struct ip6_srh_t, tag);
+ if (bpf_lwt_seg6_store_bytes(skb, offset, (void *)&new_tag,
+ sizeof(new_tag)))
+ return BPF_DROP;
+
+ return BPF_OK;
+}
+
+// Inspect if the Egress TLV and flag have been removed, if the tag is correct,
+// then apply a End.T action to reach the last segment
+SEC("inspect_t")
+int __inspect_t(struct __sk_buff *skb)
+{
+ struct ip6_srh_t *srh = get_srh(skb);
+ int table = 117;
+ int err;
+
+ if (srh == NULL)
+ return BPF_DROP;
+
+ if (srh->flags != 0)
+ return BPF_DROP;
+
+ if (srh->tag != bpf_htons(2442))
+ return BPF_DROP;
+
+ if (srh->hdrlen != 8) // 4 segments
+ return BPF_DROP;
+
+ err = bpf_lwt_seg6_action(skb, SEG6_LOCAL_ACTION_END_T,
+ (void *)&table, sizeof(table));
+
+ if (err)
+ return BPF_DROP;
+
+ return BPF_REDIRECT;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_seg6local.sh b/tools/testing/selftests/bpf/test_lwt_seg6local.sh
new file mode 100755
index 000000000000..1c77994b5e71
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_seg6local.sh
@@ -0,0 +1,140 @@
+#!/bin/bash
+# Connects 6 network namespaces through veths.
+# Each NS may have different IPv6 global scope addresses :
+# NS1 ---- NS2 ---- NS3 ---- NS4 ---- NS5 ---- NS6
+# fb00::1 fd00::1 fd00::2 fd00::3 fb00::6
+# fc42::1 fd00::4
+#
+# All IPv6 packets going to fb00::/16 through NS2 will be encapsulated in a
+# IPv6 header with a Segment Routing Header, with segments :
+# fd00::1 -> fd00::2 -> fd00::3 -> fd00::4
+#
+# 3 fd00::/16 IPv6 addresses are binded to seg6local End.BPF actions :
+# - fd00::1 : add a TLV, change the flags and apply a End.X action to fc42::1
+# - fd00::2 : remove the TLV, change the flags, add a tag
+# - fd00::3 : apply an End.T action to fd00::4, through routing table 117
+#
+# fd00::4 is a simple Segment Routing node decapsulating the inner IPv6 packet.
+# Each End.BPF action will validate the operations applied on the SRH by the
+# previous BPF program in the chain, otherwise the packet is dropped.
+#
+# An UDP datagram is sent from fb00::1 to fb00::6. The test succeeds if this
+# datagram can be read on NS6 when binding to fb00::6.
+
+TMP_FILE="/tmp/selftest_lwt_seg6local.txt"
+
+cleanup()
+{
+ if [ "$?" = "0" ]; then
+ echo "selftests: test_lwt_seg6local [PASS]";
+ else
+ echo "selftests: test_lwt_seg6local [FAILED]";
+ fi
+
+ set +e
+ ip netns del ns1 2> /dev/null
+ ip netns del ns2 2> /dev/null
+ ip netns del ns3 2> /dev/null
+ ip netns del ns4 2> /dev/null
+ ip netns del ns5 2> /dev/null
+ ip netns del ns6 2> /dev/null
+ rm -f $TMP_FILE
+}
+
+set -e
+
+ip netns add ns1
+ip netns add ns2
+ip netns add ns3
+ip netns add ns4
+ip netns add ns5
+ip netns add ns6
+
+trap cleanup 0 2 3 6 9
+
+ip link add veth1 type veth peer name veth2
+ip link add veth3 type veth peer name veth4
+ip link add veth5 type veth peer name veth6
+ip link add veth7 type veth peer name veth8
+ip link add veth9 type veth peer name veth10
+
+ip link set veth1 netns ns1
+ip link set veth2 netns ns2
+ip link set veth3 netns ns2
+ip link set veth4 netns ns3
+ip link set veth5 netns ns3
+ip link set veth6 netns ns4
+ip link set veth7 netns ns4
+ip link set veth8 netns ns5
+ip link set veth9 netns ns5
+ip link set veth10 netns ns6
+
+ip netns exec ns1 ip link set dev veth1 up
+ip netns exec ns2 ip link set dev veth2 up
+ip netns exec ns2 ip link set dev veth3 up
+ip netns exec ns3 ip link set dev veth4 up
+ip netns exec ns3 ip link set dev veth5 up
+ip netns exec ns4 ip link set dev veth6 up
+ip netns exec ns4 ip link set dev veth7 up
+ip netns exec ns5 ip link set dev veth8 up
+ip netns exec ns5 ip link set dev veth9 up
+ip netns exec ns6 ip link set dev veth10 up
+ip netns exec ns6 ip link set dev lo up
+
+# All link scope addresses and routes required between veths
+ip netns exec ns1 ip -6 addr add fb00::12/16 dev veth1 scope link
+ip netns exec ns1 ip -6 route add fb00::21 dev veth1 scope link
+ip netns exec ns2 ip -6 addr add fb00::21/16 dev veth2 scope link
+ip netns exec ns2 ip -6 addr add fb00::34/16 dev veth3 scope link
+ip netns exec ns2 ip -6 route add fb00::43 dev veth3 scope link
+ip netns exec ns3 ip -6 route add fb00::65 dev veth5 scope link
+ip netns exec ns3 ip -6 addr add fb00::43/16 dev veth4 scope link
+ip netns exec ns3 ip -6 addr add fb00::56/16 dev veth5 scope link
+ip netns exec ns4 ip -6 addr add fb00::65/16 dev veth6 scope link
+ip netns exec ns4 ip -6 addr add fb00::78/16 dev veth7 scope link
+ip netns exec ns4 ip -6 route add fb00::87 dev veth7 scope link
+ip netns exec ns5 ip -6 addr add fb00::87/16 dev veth8 scope link
+ip netns exec ns5 ip -6 addr add fb00::910/16 dev veth9 scope link
+ip netns exec ns5 ip -6 route add fb00::109 dev veth9 scope link
+ip netns exec ns5 ip -6 route add fb00::109 table 117 dev veth9 scope link
+ip netns exec ns6 ip -6 addr add fb00::109/16 dev veth10 scope link
+
+ip netns exec ns1 ip -6 addr add fb00::1/16 dev lo
+ip netns exec ns1 ip -6 route add fb00::6 dev veth1 via fb00::21
+
+ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj test_lwt_seg6local.o sec encap_srh dev veth2
+ip netns exec ns2 ip -6 route add fd00::1 dev veth3 via fb00::43 scope link
+
+ip netns exec ns3 ip -6 route add fc42::1 dev veth5 via fb00::65
+ip netns exec ns3 ip -6 route add fd00::1 encap seg6local action End.BPF obj test_lwt_seg6local.o sec add_egr_x dev veth4
+
+ip netns exec ns4 ip -6 route add fd00::2 encap seg6local action End.BPF obj test_lwt_seg6local.o sec pop_egr dev veth6
+ip netns exec ns4 ip -6 addr add fc42::1 dev lo
+ip netns exec ns4 ip -6 route add fd00::3 dev veth7 via fb00::87
+
+ip netns exec ns5 ip -6 route add fd00::4 table 117 dev veth9 via fb00::109
+ip netns exec ns5 ip -6 route add fd00::3 encap seg6local action End.BPF obj test_lwt_seg6local.o sec inspect_t dev veth8
+
+ip netns exec ns6 ip -6 addr add fb00::6/16 dev lo
+ip netns exec ns6 ip -6 addr add fd00::4/16 dev lo
+
+ip netns exec ns1 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns2 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns3 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns4 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns5 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+
+ip netns exec ns6 sysctl net.ipv6.conf.all.seg6_enabled=1 > /dev/null
+ip netns exec ns6 sysctl net.ipv6.conf.lo.seg6_enabled=1 > /dev/null
+ip netns exec ns6 sysctl net.ipv6.conf.veth10.seg6_enabled=1 > /dev/null
+
+ip netns exec ns6 nc -l -6 -u -d 7330 > $TMP_FILE &
+ip netns exec ns1 bash -c "echo 'foobar' | nc -w0 -6 -u -p 2121 -s fb00::1 fb00::6 7330"
+sleep 5 # wait enough time to ensure the UDP datagram arrived to the last segment
+kill -INT $!
+
+if [[ $(< $TMP_FILE) != "foobar" ]]; then
+ exit 1
+fi
+
+exit 0
--
2.16.1
^ permalink raw reply related
* [PATCH bpf-next v6 5/6] ipv6: sr: Add seg6local action End.BPF
From: Mathieu Xhonneux @ 2018-05-17 14:28 UTC (permalink / raw)
To: netdev; +Cc: daniel, dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526565671.git.m.xhonneux@gmail.com>
This patch adds the End.BPF action to the LWT seg6local infrastructure.
This action works like any other seg6local End action, meaning that an IPv6
header with SRH is needed, whose DA has to be equal to the SID of the
action. It will also advance the SRH to the next segment, the BPF program
does not have to take care of this.
Since the BPF program may not be a source of instability in the kernel, it
is important to ensure that the integrity of the packet is maintained
before yielding it back to the IPv6 layer. The hook hence keeps track if
the SRH has been altered through the helpers, and re-validates its
content if needed with seg6_validate_srh. The state kept for validation is
stored in a per-CPU buffer. The BPF program is not allowed to directly
write into the packet, and only some fields of the SRH can be altered
through the helper bpf_lwt_seg6_store_bytes.
Performances profiling has shown that the SRH re-validation does not induce
a significant overhead. If the altered SRH is deemed as invalid, the packet
is dropped.
This validation is also done before executing any action through
bpf_lwt_seg6_action, and will not be performed again if the SRH is not
modified after calling the action.
The BPF program may return 3 types of return codes:
- BPF_OK: the End.BPF action will look up the next destination through
seg6_lookup_nexthop.
- BPF_REDIRECT: if an action has been executed through the
bpf_lwt_seg6_action helper, the BPF program should return this
value, as the skb's destination is already set and the default
lookup should not be performed.
- BPF_DROP : the packet will be dropped.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
include/linux/bpf_types.h | 1 +
include/uapi/linux/bpf.h | 1 +
include/uapi/linux/seg6_local.h | 3 +
kernel/bpf/verifier.c | 1 +
net/core/filter.c | 25 +++++++
net/ipv6/seg6_local.c | 158 +++++++++++++++++++++++++++++++++++++++-
tools/lib/bpf/libbpf.c | 1 +
7 files changed, 187 insertions(+), 3 deletions(-)
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index aa5c8b878474..b161e506dcfc 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -12,6 +12,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr)
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_in)
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_out)
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_SEG6LOCAL, lwt_seg6local)
BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 37f098ca822b..e8efb12d0a7d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -141,6 +141,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_RAW_TRACEPOINT,
BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+ BPF_PROG_TYPE_LWT_SEG6LOCAL,
};
enum bpf_attach_type {
diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h
index ef2d8c3e76c1..aadcc11fb918 100644
--- a/include/uapi/linux/seg6_local.h
+++ b/include/uapi/linux/seg6_local.h
@@ -25,6 +25,7 @@ enum {
SEG6_LOCAL_NH6,
SEG6_LOCAL_IIF,
SEG6_LOCAL_OIF,
+ SEG6_LOCAL_BPF,
__SEG6_LOCAL_MAX,
};
#define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1)
@@ -59,6 +60,8 @@ enum {
SEG6_LOCAL_ACTION_END_AS = 13,
/* forward to SR-unaware VNF with masquerading */
SEG6_LOCAL_ACTION_END_AM = 14,
+ /* custom BPF action */
+ SEG6_LOCAL_ACTION_END_BPF = 15,
__SEG6_LOCAL_ACTION_MAX,
};
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a9e4b1372da6..390142d62ba1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1262,6 +1262,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
switch (env->prog->type) {
case BPF_PROG_TYPE_LWT_IN:
case BPF_PROG_TYPE_LWT_OUT:
+ case BPF_PROG_TYPE_LWT_SEG6LOCAL:
/* dst_input() and dst_output() can't write for now */
if (t == BPF_WRITE)
return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index 39641ea567b4..8cf0065107a3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4893,6 +4893,21 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
}
}
+static const struct bpf_func_proto *
+lwt_seg6local_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_lwt_seg6_store_bytes:
+ return &bpf_lwt_seg6_store_bytes_proto;
+ case BPF_FUNC_lwt_seg6_action:
+ return &bpf_lwt_seg6_action_proto;
+ case BPF_FUNC_lwt_seg6_adjust_srh:
+ return &bpf_lwt_seg6_adjust_srh_proto;
+ default:
+ return lwt_out_func_proto(func_id, prog);
+ }
+}
+
static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
@@ -6493,6 +6508,16 @@ const struct bpf_prog_ops lwt_xmit_prog_ops = {
.test_run = bpf_prog_test_run_skb,
};
+const struct bpf_verifier_ops lwt_seg6local_verifier_ops = {
+ .get_func_proto = lwt_seg6local_func_proto,
+ .is_valid_access = lwt_is_valid_access,
+ .convert_ctx_access = bpf_convert_ctx_access,
+};
+
+const struct bpf_prog_ops lwt_seg6local_prog_ops = {
+ .test_run = bpf_prog_test_run_skb,
+};
+
const struct bpf_verifier_ops cg_sock_verifier_ops = {
.get_func_proto = sock_filter_func_proto,
.is_valid_access = sock_filter_is_valid_access,
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index ae68c1ef8fb0..2ac887da63e2 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1,8 +1,9 @@
/*
* SR-IPv6 implementation
*
- * Author:
+ * Authors:
* David Lebrun <david.lebrun@uclouvain.be>
+ * eBPF support: Mathieu Xhonneux <m.xhonneux@gmail.com>
*
*
* This program is free software; you can redistribute it and/or
@@ -32,6 +33,7 @@
#endif
#include <net/seg6_local.h>
#include <linux/etherdevice.h>
+#include <linux/bpf.h>
struct seg6_local_lwt;
@@ -42,6 +44,11 @@ struct seg6_action_desc {
int static_headroom;
};
+struct bpf_lwt_prog {
+ struct bpf_prog *prog;
+ char *name;
+};
+
struct seg6_local_lwt {
int action;
struct ipv6_sr_hdr *srh;
@@ -50,6 +57,7 @@ struct seg6_local_lwt {
struct in6_addr nh6;
int iif;
int oif;
+ struct bpf_lwt_prog bpf;
int headroom;
struct seg6_action_desc *desc;
@@ -451,6 +459,69 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+static int input_action_end_bpf(struct sk_buff *skb,
+ struct seg6_local_lwt *slwt)
+{
+ struct seg6_bpf_srh_state *srh_state =
+ this_cpu_ptr(&seg6_bpf_srh_states);
+ struct seg6_bpf_srh_state local_srh_state;
+ struct ipv6_sr_hdr *srh;
+ int srhoff = 0;
+ int ret;
+
+ srh = get_and_validate_srh(skb);
+ if (!srh)
+ goto drop;
+ advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
+
+ /* preempt_disable is needed to protect the per-CPU buffer srh_state,
+ * which is also accessed by the bpf_lwt_seg6_* helpers
+ */
+ preempt_disable();
+ srh_state->hdrlen = srh->hdrlen << 3;
+ srh_state->valid = 1;
+
+ rcu_read_lock();
+ bpf_compute_data_pointers(skb);
+ ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
+ rcu_read_unlock();
+
+ local_srh_state = *srh_state;
+ preempt_enable();
+
+ switch (ret) {
+ case BPF_OK:
+ case BPF_REDIRECT:
+ break;
+ case BPF_DROP:
+ goto drop;
+ default:
+ pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
+ goto drop;
+ }
+
+ if (unlikely((local_srh_state.hdrlen & 7) != 0))
+ goto drop;
+
+ if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+ goto drop;
+ srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+ srh->hdrlen = (u8)(local_srh_state.hdrlen >> 3);
+
+ if (!local_srh_state.valid &&
+ unlikely(!seg6_validate_srh(srh, (srh->hdrlen + 1) << 3)))
+ goto drop;
+
+ if (ret != BPF_REDIRECT)
+ seg6_lookup_nexthop(skb, NULL, 0);
+
+ return dst_input(skb);
+
+drop:
+ kfree_skb(skb);
+ return -EINVAL;
+}
+
static struct seg6_action_desc seg6_action_table[] = {
{
.action = SEG6_LOCAL_ACTION_END,
@@ -497,7 +568,13 @@ static struct seg6_action_desc seg6_action_table[] = {
.attrs = (1 << SEG6_LOCAL_SRH),
.input = input_action_end_b6_encap,
.static_headroom = sizeof(struct ipv6hdr),
- }
+ },
+ {
+ .action = SEG6_LOCAL_ACTION_END_BPF,
+ .attrs = (1 << SEG6_LOCAL_BPF),
+ .input = input_action_end_bpf,
+ },
+
};
static struct seg6_action_desc *__get_action_desc(int action)
@@ -542,6 +619,7 @@ static const struct nla_policy seg6_local_policy[SEG6_LOCAL_MAX + 1] = {
.len = sizeof(struct in6_addr) },
[SEG6_LOCAL_IIF] = { .type = NLA_U32 },
[SEG6_LOCAL_OIF] = { .type = NLA_U32 },
+ [SEG6_LOCAL_BPF] = { .type = NLA_NESTED },
};
static int parse_nla_srh(struct nlattr **attrs, struct seg6_local_lwt *slwt)
@@ -719,6 +797,71 @@ static int cmp_nla_oif(struct seg6_local_lwt *a, struct seg6_local_lwt *b)
return 0;
}
+#define MAX_PROG_NAME 256
+static const struct nla_policy bpf_prog_policy[LWT_BPF_PROG_MAX + 1] = {
+ [LWT_BPF_PROG_FD] = { .type = NLA_U32, },
+ [LWT_BPF_PROG_NAME] = { .type = NLA_NUL_STRING,
+ .len = MAX_PROG_NAME },
+};
+
+static int parse_nla_bpf(struct nlattr **attrs, struct seg6_local_lwt *slwt)
+{
+ struct nlattr *tb[LWT_BPF_PROG_MAX + 1];
+ struct bpf_prog *p;
+ int ret;
+ u32 fd;
+
+ ret = nla_parse_nested(tb, LWT_BPF_PROG_MAX, attrs[SEG6_LOCAL_BPF],
+ bpf_prog_policy, NULL);
+ if (ret < 0)
+ return ret;
+
+ if (!tb[LWT_BPF_PROG_FD] || !tb[LWT_BPF_PROG_NAME])
+ return -EINVAL;
+
+ slwt->bpf.name = nla_memdup(tb[LWT_BPF_PROG_NAME], GFP_KERNEL);
+ if (!slwt->bpf.name)
+ return -ENOMEM;
+
+ fd = nla_get_u32(tb[LWT_BPF_PROG_FD]);
+ p = bpf_prog_get_type(fd, BPF_PROG_TYPE_LWT_SEG6LOCAL);
+ if (IS_ERR(p))
+ return PTR_ERR(p);
+
+ slwt->bpf.prog = p;
+
+ return 0;
+}
+
+static int put_nla_bpf(struct sk_buff *skb, struct seg6_local_lwt *slwt)
+{
+ struct nlattr *nest;
+
+ if (!slwt->bpf.prog)
+ return 0;
+
+ nest = nla_nest_start(skb, SEG6_LOCAL_BPF);
+ if (!nest)
+ return -EMSGSIZE;
+
+ if (slwt->bpf.name &&
+ nla_put_string(skb, LWT_BPF_PROG_NAME, slwt->bpf.name))
+ return -EMSGSIZE;
+
+ return nla_nest_end(skb, nest);
+}
+
+static int cmp_nla_bpf(struct seg6_local_lwt *a, struct seg6_local_lwt *b)
+{
+ if (!a->bpf.name && !b->bpf.name)
+ return 0;
+
+ if (!a->bpf.name || !b->bpf.name)
+ return 1;
+
+ return strcmp(a->bpf.name, b->bpf.name);
+}
+
struct seg6_action_param {
int (*parse)(struct nlattr **attrs, struct seg6_local_lwt *slwt);
int (*put)(struct sk_buff *skb, struct seg6_local_lwt *slwt);
@@ -749,6 +892,11 @@ static struct seg6_action_param seg6_action_params[SEG6_LOCAL_MAX + 1] = {
[SEG6_LOCAL_OIF] = { .parse = parse_nla_oif,
.put = put_nla_oif,
.cmp = cmp_nla_oif },
+
+ [SEG6_LOCAL_BPF] = { .parse = parse_nla_bpf,
+ .put = put_nla_bpf,
+ .cmp = cmp_nla_bpf },
+
};
static int parse_nla_action(struct nlattr **attrs, struct seg6_local_lwt *slwt)
@@ -797,7 +945,6 @@ static int seg6_local_build_state(struct nlattr *nla, unsigned int family,
err = nla_parse_nested(tb, SEG6_LOCAL_MAX, nla, seg6_local_policy,
extack);
-
if (err < 0)
return err;
@@ -886,6 +1033,11 @@ static int seg6_local_get_encap_size(struct lwtunnel_state *lwt)
if (attrs & (1 << SEG6_LOCAL_OIF))
nlsize += nla_total_size(4);
+ if (attrs & (1 << SEG6_LOCAL_BPF))
+ nlsize += nla_total_size(sizeof(struct nlattr)) +
+ nla_total_size(MAX_PROG_NAME) +
+ nla_total_size(4);
+
return nlsize;
}
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 3dbe217bf23e..a29fed1dfce2 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1456,6 +1456,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
case BPF_PROG_TYPE_LWT_IN:
case BPF_PROG_TYPE_LWT_OUT:
case BPF_PROG_TYPE_LWT_XMIT:
+ case BPF_PROG_TYPE_LWT_SEG6LOCAL:
case BPF_PROG_TYPE_SOCK_OPS:
case BPF_PROG_TYPE_SK_SKB:
case BPF_PROG_TYPE_CGROUP_DEVICE:
--
2.16.1
^ permalink raw reply related
* [PATCH bpf-next v6 4/6] bpf: Split lwt inout verifier structures
From: Mathieu Xhonneux @ 2018-05-17 14:28 UTC (permalink / raw)
To: netdev; +Cc: daniel, dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526565671.git.m.xhonneux@gmail.com>
The new bpf_lwt_push_encap helper should only be accessible within the
LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
panic.
At the moment, both LWT BPF IN and OUT share the same list of helpers,
whose calls are authorized by the verifier. This patch separates the
verifier ops for the IN and OUT hooks, and allows the IN hook to call the
bpf_lwt_push_encap helper.
This patch is also the occasion to put all lwt_*_func_proto functions
together for clarity. At the moment, socks_op_func_proto is in the middle
of lwt_inout_func_proto and lwt_xmit_func_proto.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
include/linux/bpf_types.h | 4 +--
net/core/filter.c | 83 +++++++++++++++++++++++++++++------------------
2 files changed, 54 insertions(+), 33 deletions(-)
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index b67f8793de0d..aa5c8b878474 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -9,8 +9,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr)
-BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
-BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_in)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_out)
BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
diff --git a/net/core/filter.c b/net/core/filter.c
index 2f43d0a6ac5d..39641ea567b4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4755,33 +4755,6 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
}
}
-static const struct bpf_func_proto *
-lwt_inout_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
-{
- switch (func_id) {
- case BPF_FUNC_skb_load_bytes:
- return &bpf_skb_load_bytes_proto;
- case BPF_FUNC_skb_pull_data:
- return &bpf_skb_pull_data_proto;
- case BPF_FUNC_csum_diff:
- return &bpf_csum_diff_proto;
- case BPF_FUNC_get_cgroup_classid:
- return &bpf_get_cgroup_classid_proto;
- case BPF_FUNC_get_route_realm:
- return &bpf_get_route_realm_proto;
- case BPF_FUNC_get_hash_recalc:
- return &bpf_get_hash_recalc_proto;
- case BPF_FUNC_perf_event_output:
- return &bpf_skb_event_output_proto;
- case BPF_FUNC_get_smp_processor_id:
- return &bpf_get_smp_processor_id_proto;
- case BPF_FUNC_skb_under_cgroup:
- return &bpf_skb_under_cgroup_proto;
- default:
- return bpf_base_func_proto(func_id);
- }
-}
-
static const struct bpf_func_proto *
sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
@@ -4847,6 +4820,44 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
}
}
+static const struct bpf_func_proto *
+lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_skb_load_bytes:
+ return &bpf_skb_load_bytes_proto;
+ case BPF_FUNC_skb_pull_data:
+ return &bpf_skb_pull_data_proto;
+ case BPF_FUNC_csum_diff:
+ return &bpf_csum_diff_proto;
+ case BPF_FUNC_get_cgroup_classid:
+ return &bpf_get_cgroup_classid_proto;
+ case BPF_FUNC_get_route_realm:
+ return &bpf_get_route_realm_proto;
+ case BPF_FUNC_get_hash_recalc:
+ return &bpf_get_hash_recalc_proto;
+ case BPF_FUNC_perf_event_output:
+ return &bpf_skb_event_output_proto;
+ case BPF_FUNC_get_smp_processor_id:
+ return &bpf_get_smp_processor_id_proto;
+ case BPF_FUNC_skb_under_cgroup:
+ return &bpf_skb_under_cgroup_proto;
+ default:
+ return bpf_base_func_proto(func_id);
+ }
+}
+
+static const struct bpf_func_proto *
+lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_lwt_push_encap:
+ return &bpf_lwt_push_encap_proto;
+ default:
+ return lwt_out_func_proto(func_id, prog);
+ }
+}
+
static const struct bpf_func_proto *
lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
@@ -4878,7 +4889,7 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_set_hash_invalid:
return &bpf_set_hash_invalid_proto;
default:
- return lwt_inout_func_proto(func_id, prog);
+ return lwt_out_func_proto(func_id, prog);
}
}
@@ -6451,13 +6462,23 @@ const struct bpf_prog_ops cg_skb_prog_ops = {
.test_run = bpf_prog_test_run_skb,
};
-const struct bpf_verifier_ops lwt_inout_verifier_ops = {
- .get_func_proto = lwt_inout_func_proto,
+const struct bpf_verifier_ops lwt_in_verifier_ops = {
+ .get_func_proto = lwt_in_func_proto,
+ .is_valid_access = lwt_is_valid_access,
+ .convert_ctx_access = bpf_convert_ctx_access,
+};
+
+const struct bpf_prog_ops lwt_in_prog_ops = {
+ .test_run = bpf_prog_test_run_skb,
+};
+
+const struct bpf_verifier_ops lwt_out_verifier_ops = {
+ .get_func_proto = lwt_out_func_proto,
.is_valid_access = lwt_is_valid_access,
.convert_ctx_access = bpf_convert_ctx_access,
};
-const struct bpf_prog_ops lwt_inout_prog_ops = {
+const struct bpf_prog_ops lwt_out_prog_ops = {
.test_run = bpf_prog_test_run_skb,
};
--
2.16.1
^ permalink raw reply related
* [PATCH bpf-next v6 3/6] bpf: Add IPv6 Segment Routing helpers
From: Mathieu Xhonneux @ 2018-05-17 14:28 UTC (permalink / raw)
To: netdev; +Cc: daniel, dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526565671.git.m.xhonneux@gmail.com>
The BPF seg6local hook should be powerful enough to enable users to
implement most of the use-cases one could think of. After some thinking,
we figured out that the following actions should be possible on a SRv6
packet, requiring 3 specific helpers :
- bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
- bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
(to add/delete TLVs)
- bpf_lwt_seg6_action: Apply some SRv6 network programming actions
(specifically End.X, End.T, End.B6 and
End.B6.Encap)
The specifications of these helpers are provided in the patch (see
include/uapi/linux/bpf.h).
The non-sensitive fields of the SRH are the following : flags, tag and
TLVs. The other fields can not be modified, to maintain the SRH
integrity. Flags, tag and TLVs can easily be modified as their validity
can be checked afterwards via seg6_validate_srh. It is not allowed to
modify the segments directly. If one wants to add segments on the path,
he should stack a new SRH using the End.B6 action via
bpf_lwt_seg6_action.
Growing, shrinking or editing TLVs via the helpers will flag the SRH as
invalid, and it will have to be re-validated before re-entering the IPv6
layer. This flag is stored in a per-CPU buffer, along with the current
header length in bytes.
Storing the SRH len in bytes in the control block is mandatory when using
bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
boundary). When adding/deleting TLVs within the BPF program, the SRH may
temporary be in an invalid state where its length cannot be rounded to 8
bytes without remainder, hence the need to store the length in bytes
separately. The caller of the BPF program can then ensure that the SRH's
final length is valid using this value. Again, a final SRH modified by a
BPF program which doesn’t respect the 8-bytes boundary will be discarded
as it will be considered as invalid.
Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
available from the LWT BPF IN hook, but not from the seg6local BPF one.
This helper allows to encapsulate a Segment Routing Header (either with
a new outer IPv6 header, or by inlining it directly in the existing IPv6
header) into a non-SRv6 packet. This helper is required if we want to
offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
as the BPF seg6local hook only works on traffic already containing a SRH.
This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
the same purpose but with a static SRH per route.
These helpers require CONFIG_IPV6=y (and not =m).
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
include/net/seg6_local.h | 8 ++
include/uapi/linux/bpf.h | 96 +++++++++++++++-
net/core/filter.c | 285 +++++++++++++++++++++++++++++++++++++++++++----
net/ipv6/Kconfig | 5 +
net/ipv6/seg6_local.c | 2 +
5 files changed, 372 insertions(+), 24 deletions(-)
diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 57498b23085d..661fd5b4d3e0 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -15,10 +15,18 @@
#ifndef _NET_SEG6_LOCAL_H
#define _NET_SEG6_LOCAL_H
+#include <linux/percpu.h>
#include <linux/net.h>
#include <linux/ipv6.h>
extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
u32 tbl_id);
+struct seg6_bpf_srh_state {
+ bool valid;
+ u16 hdrlen;
+};
+
+DECLARE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+
#endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d94d333a8225..37f098ca822b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1902,6 +1902,90 @@ union bpf_attr {
* egress otherwise). This is the only flag supported for now.
* Return
* **SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ * Description
+ * Encapsulate the packet associated to *skb* within a Layer 3
+ * protocol header. This header is provided in the buffer at
+ * address *hdr*, with *len* its size in bytes. *type* indicates
+ * the protocol of the header and can be one of:
+ *
+ * **BPF_LWT_ENCAP_SEG6**
+ * IPv6 encapsulation with Segment Routing Header
+ * (**struct ipv6_sr_hdr**). *hdr* only contains the SRH,
+ * the IPv6 header is computed by the kernel.
+ * **BPF_LWT_ENCAP_SEG6_INLINE**
+ * Only works if *skb* contains an IPv6 packet. Insert a
+ * Segment Routing Header (**struct ipv6_sr_hdr**) inside
+ * the IPv6 header.
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
+ * Description
+ * Store *len* bytes from address *from* into the packet
+ * associated to *skb*, at *offset*. Only the flags, tag and TLVs
+ * inside the outermost IPv6 Segment Routing Header can be
+ * modified through this helper.
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
+ * Description
+ * Adjust the size allocated to TLVs in the outermost IPv6
+ * Segment Routing Header contained in the packet associated to
+ * *skb*, at position *offset* by *delta* bytes. Only offsets
+ * after the segments are accepted. *delta* can be as well
+ * positive (growing) as negative (shrinking).
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
+ * Description
+ * Apply an IPv6 Segment Routing action of type *action* to the
+ * packet associated to *skb*. Each action takes a parameter
+ * contained at address *param*, and of length *param_len* bytes.
+ * *action* can be one of:
+ *
+ * **SEG6_LOCAL_ACTION_END_X**
+ * End.X action: Endpoint with Layer-3 cross-connect.
+ * Type of *param*: **struct in6_addr**.
+ * **SEG6_LOCAL_ACTION_END_T**
+ * End.T action: Endpoint with specific IPv6 table lookup.
+ * Type of *param*: **int**.
+ * **SEG6_LOCAL_ACTION_END_B6**
+ * End.B6 action: Endpoint bound to an SRv6 policy.
+ * Type of param: **struct ipv6_sr_hdr**.
+ * **SEG6_LOCAL_ACTION_END_B6_ENCAP**
+ * End.B6.Encap action: Endpoint bound to an SRv6
+ * encapsulation policy.
+ * Type of param: **struct ipv6_sr_hdr**.
+ *
+ * A call to this helper is susceptible to change the underlaying
+ * packet buffer. Therefore, at load time, all checks on pointers
+ * previously done by the verifier are invalidated and must be
+ * performed again, if the helper is used in combination with
+ * direct packet access.
+ * Return
+ * 0 on success, or a negative error in case of failure.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -1976,7 +2060,11 @@ union bpf_attr {
FN(fib_lookup), \
FN(sock_hash_update), \
FN(msg_redirect_hash), \
- FN(sk_redirect_hash),
+ FN(sk_redirect_hash), \
+ FN(lwt_push_encap), \
+ FN(lwt_seg6_store_bytes), \
+ FN(lwt_seg6_adjust_srh), \
+ FN(lwt_seg6_action),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -2043,6 +2131,12 @@ enum bpf_hdr_start_off {
BPF_HDR_START_NET,
};
+/* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
+enum bpf_lwt_encap_mode {
+ BPF_LWT_ENCAP_SEG6,
+ BPF_LWT_ENCAP_SEG6_INLINE
+};
+
/* user accessible mirror of in-kernel sk_buff.
* new fields can only be added to the end of this structure
*/
diff --git a/net/core/filter.c b/net/core/filter.c
index 6d0d1560bd70..2f43d0a6ac5d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -64,6 +64,10 @@
#include <net/ip_fib.h>
#include <net/flow.h>
#include <net/arp.h>
+#include <net/ipv6.h>
+#include <linux/seg6_local.h>
+#include <net/seg6.h>
+#include <net/seg6_local.h>
/**
* sk_filter_trim_cap - run a packet through a socket filter
@@ -3363,28 +3367,6 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
.arg3_type = ARG_ANYTHING,
};
-bool bpf_helper_changes_pkt_data(void *func)
-{
- if (func == bpf_skb_vlan_push ||
- func == bpf_skb_vlan_pop ||
- func == bpf_skb_store_bytes ||
- func == bpf_skb_change_proto ||
- func == bpf_skb_change_head ||
- func == bpf_skb_change_tail ||
- func == bpf_skb_adjust_room ||
- func == bpf_skb_pull_data ||
- func == bpf_clone_redirect ||
- func == bpf_l3_csum_replace ||
- func == bpf_l4_csum_replace ||
- func == bpf_xdp_adjust_head ||
- func == bpf_xdp_adjust_meta ||
- func == bpf_msg_pull_data ||
- func == bpf_xdp_adjust_tail)
- return true;
-
- return false;
-}
-
static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
unsigned long off, unsigned long len)
{
@@ -4332,6 +4314,264 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
.arg4_type = ARG_ANYTHING,
};
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+{
+ int err;
+ struct ipv6_sr_hdr *srh = (struct ipv6_sr_hdr *)hdr;
+
+ if (!seg6_validate_srh(srh, len))
+ return -EINVAL;
+
+ switch (type) {
+ case BPF_LWT_ENCAP_SEG6_INLINE:
+ if (skb->protocol != htons(ETH_P_IPV6))
+ return -EBADMSG;
+
+ err = seg6_do_srh_inline(skb, srh);
+ break;
+ case BPF_LWT_ENCAP_SEG6:
+ skb_reset_inner_headers(skb);
+ skb->encapsulation = 1;
+ err = seg6_do_srh_encap(skb, srh, IPPROTO_IPV6);
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ bpf_compute_data_pointers(skb);
+ if (err)
+ return err;
+
+ ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
+ skb_set_transport_header(skb, sizeof(struct ipv6hdr));
+
+ return seg6_lookup_nexthop(skb, NULL, 0);
+}
+#endif /* CONFIG_IPV6_SEG6_BPF */
+
+BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
+ u32, len)
+{
+ switch (type) {
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+ case BPF_LWT_ENCAP_SEG6:
+ case BPF_LWT_ENCAP_SEG6_INLINE:
+ return bpf_push_seg6_encap(skb, type, hdr, len);
+#endif
+ default:
+ return -EINVAL;
+ }
+}
+
+static const struct bpf_func_proto bpf_lwt_push_encap_proto = {
+ .func = bpf_lwt_push_encap,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_MEM,
+ .arg4_type = ARG_CONST_SIZE
+};
+
+BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
+ const void *, from, u32, len)
+{
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+ struct seg6_bpf_srh_state *srh_state =
+ this_cpu_ptr(&seg6_bpf_srh_states);
+ void *srh_tlvs, *srh_end, *ptr;
+ struct ipv6_sr_hdr *srh;
+ int srhoff = 0;
+
+ if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+ return -EINVAL;
+
+ srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+ srh_tlvs = (void *)((char *)srh + ((srh->first_segment + 1) << 4));
+ srh_end = (void *)((char *)srh + sizeof(*srh) + srh_state->hdrlen);
+
+ ptr = skb->data + offset;
+ if (ptr >= srh_tlvs && ptr + len <= srh_end)
+ srh_state->valid = 0;
+ else if (ptr < (void *)&srh->flags ||
+ ptr + len > (void *)&srh->segments)
+ return -EFAULT;
+
+ if (unlikely(bpf_try_make_writable(skb, offset + len)))
+ return -EFAULT;
+
+ memcpy(skb->data + offset, from, len);
+ return 0;
+#else /* CONFIG_IPV6_SEG6_BPF */
+ return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_lwt_seg6_store_bytes_proto = {
+ .func = bpf_lwt_seg6_store_bytes,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_MEM,
+ .arg4_type = ARG_CONST_SIZE
+};
+
+BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
+ u32, action, void *, param, u32, param_len)
+{
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+ struct seg6_bpf_srh_state *srh_state =
+ this_cpu_ptr(&seg6_bpf_srh_states);
+ struct ipv6_sr_hdr *srh;
+ int srhoff = 0;
+ int err;
+
+ if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+ return -EINVAL;
+ srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+
+ if (!srh_state->valid) {
+ if (unlikely((srh_state->hdrlen & 7) != 0))
+ return -EBADMSG;
+
+ srh->hdrlen = (u8)(srh_state->hdrlen >> 3);
+ if (unlikely(!seg6_validate_srh(srh, (srh->hdrlen + 1) << 3)))
+ return -EBADMSG;
+
+ srh_state->valid = 1;
+ }
+
+ switch (action) {
+ case SEG6_LOCAL_ACTION_END_X:
+ if (param_len != sizeof(struct in6_addr))
+ return -EINVAL;
+ return seg6_lookup_nexthop(skb, (struct in6_addr *)param, 0);
+ case SEG6_LOCAL_ACTION_END_T:
+ if (param_len != sizeof(int))
+ return -EINVAL;
+ return seg6_lookup_nexthop(skb, NULL, *(int *)param);
+ case SEG6_LOCAL_ACTION_END_B6:
+ err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6_INLINE,
+ param, param_len);
+ if (!err)
+ srh_state->hdrlen =
+ ((struct ipv6_sr_hdr *)param)->hdrlen << 3;
+ return err;
+ case SEG6_LOCAL_ACTION_END_B6_ENCAP:
+ err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6,
+ param, param_len);
+ if (!err)
+ srh_state->hdrlen =
+ ((struct ipv6_sr_hdr *)param)->hdrlen << 3;
+ return err;
+ default:
+ return -EINVAL;
+ }
+#else /* CONFIG_IPV6_SEG6_BPF */
+ return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_lwt_seg6_action_proto = {
+ .func = bpf_lwt_seg6_action,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_MEM,
+ .arg4_type = ARG_CONST_SIZE
+};
+
+BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
+ s32, len)
+{
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+ struct seg6_bpf_srh_state *srh_state =
+ this_cpu_ptr(&seg6_bpf_srh_states);
+ void *srh_end, *srh_tlvs, *ptr;
+ struct ipv6_sr_hdr *srh;
+ struct ipv6hdr *hdr;
+ int srhoff = 0;
+ int ret;
+
+ if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+ return -EINVAL;
+ srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+
+ srh_tlvs = (void *)((unsigned char *)srh + sizeof(*srh) +
+ ((srh->first_segment + 1) << 4));
+ srh_end = (void *)((unsigned char *)srh + sizeof(*srh) +
+ srh_state->hdrlen);
+ ptr = skb->data + offset;
+
+ if (unlikely(ptr < srh_tlvs || ptr > srh_end))
+ return -EFAULT;
+ if (unlikely(len < 0 && (void *)((char *)ptr - len) > srh_end))
+ return -EFAULT;
+
+ if (len > 0) {
+ ret = skb_cow_head(skb, len);
+ if (unlikely(ret < 0))
+ return ret;
+
+ ret = bpf_skb_net_hdr_push(skb, offset, len);
+ } else {
+ ret = bpf_skb_net_hdr_pop(skb, offset, -1 * len);
+ }
+
+ bpf_compute_data_pointers(skb);
+ if (unlikely(ret < 0))
+ return ret;
+
+ hdr = (struct ipv6hdr *)skb->data;
+ hdr->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
+
+ srh_state->hdrlen += len;
+ srh_state->valid = 0;
+ return 0;
+#else /* CONFIG_IPV6_SEG6_BPF */
+ return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
+ .func = bpf_lwt_seg6_adjust_srh,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_ANYTHING,
+};
+
+bool bpf_helper_changes_pkt_data(void *func)
+{
+ if (func == bpf_skb_vlan_push ||
+ func == bpf_skb_vlan_pop ||
+ func == bpf_skb_store_bytes ||
+ func == bpf_skb_change_proto ||
+ func == bpf_skb_change_head ||
+ func == bpf_skb_change_tail ||
+ func == bpf_skb_adjust_room ||
+ func == bpf_skb_pull_data ||
+ func == bpf_clone_redirect ||
+ func == bpf_l3_csum_replace ||
+ func == bpf_l4_csum_replace ||
+ func == bpf_xdp_adjust_head ||
+ func == bpf_xdp_adjust_meta ||
+ func == bpf_msg_pull_data ||
+ func == bpf_xdp_adjust_tail ||
+ func == bpf_lwt_push_encap ||
+ func == bpf_lwt_seg6_store_bytes ||
+ func == bpf_lwt_seg6_adjust_srh ||
+ func == bpf_lwt_seg6_action
+ )
+ return true;
+
+ return false;
+}
+
static const struct bpf_func_proto *
bpf_base_func_proto(enum bpf_func_id func_id)
{
@@ -4746,7 +4986,6 @@ static bool lwt_is_valid_access(int off, int size,
return bpf_skb_is_valid_access(off, size, type, prog, info);
}
-
/* Attach type specific accesses */
static bool __sock_filter_check_attach_type(int off,
enum bpf_access_type access_type,
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 11e4e80cf7e9..0eff75525da1 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -329,4 +329,9 @@ config IPV6_SEG6_HMAC
If unsure, say N.
+config IPV6_SEG6_BPF
+ def_bool y
+ depends on IPV6_SEG6_LWTUNNEL
+ depends on IPV6 = y
+
endif # IPV6
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index e9b23fb924ad..ae68c1ef8fb0 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -449,6 +449,8 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
return err;
}
+DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+
static struct seg6_action_desc seg6_action_table[] = {
{
.action = SEG6_LOCAL_ACTION_END,
--
2.16.1
^ permalink raw reply related
* [PATCH bpf-next v6 2/6] ipv6: sr: export function lookup_nexthop
From: Mathieu Xhonneux @ 2018-05-17 14:28 UTC (permalink / raw)
To: netdev; +Cc: daniel, dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526565671.git.m.xhonneux@gmail.com>
The function lookup_nexthop is essential to implement most of the seg6local
actions. As we want to provide a BPF helper allowing to apply some of these
actions on the packet being processed, the helper should be able to call
this function, hence the need to make it public.
Moreover, if one argument is incorrect or if the next hop can not be found,
an error should be returned by the BPF helper so the BPF program can adapt
its processing of the packet (return an error, properly force the drop,
...). This patch hence makes this function return dst->error to indicate a
possible error.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
include/net/seg6.h | 3 ++-
include/net/seg6_local.h | 24 ++++++++++++++++++++++++
net/ipv6/seg6_local.c | 20 +++++++++++---------
3 files changed, 37 insertions(+), 10 deletions(-)
create mode 100644 include/net/seg6_local.h
diff --git a/include/net/seg6.h b/include/net/seg6.h
index 70b4cfac52d7..e029e301faa5 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -67,5 +67,6 @@ extern bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len);
extern int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh,
int proto);
extern int seg6_do_srh_inline(struct sk_buff *skb, struct ipv6_sr_hdr *osrh);
-
+extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+ u32 tbl_id);
#endif
diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
new file mode 100644
index 000000000000..57498b23085d
--- /dev/null
+++ b/include/net/seg6_local.h
@@ -0,0 +1,24 @@
+/*
+ * SR-IPv6 implementation
+ *
+ * Authors:
+ * David Lebrun <david.lebrun@uclouvain.be>
+ * eBPF support: Mathieu Xhonneux <m.xhonneux@gmail.com>
+ *
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _NET_SEG6_LOCAL_H
+#define _NET_SEG6_LOCAL_H
+
+#include <linux/net.h>
+#include <linux/ipv6.h>
+
+extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+ u32 tbl_id);
+
+#endif
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 45722327375a..e9b23fb924ad 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -30,6 +30,7 @@
#ifdef CONFIG_IPV6_SEG6_HMAC
#include <net/seg6_hmac.h>
#endif
+#include <net/seg6_local.h>
#include <linux/etherdevice.h>
struct seg6_local_lwt;
@@ -140,8 +141,8 @@ static void advance_nextseg(struct ipv6_sr_hdr *srh, struct in6_addr *daddr)
*daddr = *addr;
}
-static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
- u32 tbl_id)
+int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+ u32 tbl_id)
{
struct net *net = dev_net(skb->dev);
struct ipv6hdr *hdr = ipv6_hdr(skb);
@@ -187,6 +188,7 @@ static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
skb_dst_drop(skb);
skb_dst_set(skb, dst);
+ return dst->error;
}
/* regular endpoint function */
@@ -200,7 +202,7 @@ static int input_action_end(struct sk_buff *skb, struct seg6_local_lwt *slwt)
advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
- lookup_nexthop(skb, NULL, 0);
+ seg6_lookup_nexthop(skb, NULL, 0);
return dst_input(skb);
@@ -220,7 +222,7 @@ static int input_action_end_x(struct sk_buff *skb, struct seg6_local_lwt *slwt)
advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
- lookup_nexthop(skb, &slwt->nh6, 0);
+ seg6_lookup_nexthop(skb, &slwt->nh6, 0);
return dst_input(skb);
@@ -239,7 +241,7 @@ static int input_action_end_t(struct sk_buff *skb, struct seg6_local_lwt *slwt)
advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
- lookup_nexthop(skb, NULL, slwt->table);
+ seg6_lookup_nexthop(skb, NULL, slwt->table);
return dst_input(skb);
@@ -331,7 +333,7 @@ static int input_action_end_dx6(struct sk_buff *skb,
if (!ipv6_addr_any(&slwt->nh6))
nhaddr = &slwt->nh6;
- lookup_nexthop(skb, nhaddr, 0);
+ seg6_lookup_nexthop(skb, nhaddr, 0);
return dst_input(skb);
drop:
@@ -380,7 +382,7 @@ static int input_action_end_dt6(struct sk_buff *skb,
if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
goto drop;
- lookup_nexthop(skb, NULL, slwt->table);
+ seg6_lookup_nexthop(skb, NULL, slwt->table);
return dst_input(skb);
@@ -406,7 +408,7 @@ static int input_action_end_b6(struct sk_buff *skb, struct seg6_local_lwt *slwt)
ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
skb_set_transport_header(skb, sizeof(struct ipv6hdr));
- lookup_nexthop(skb, NULL, 0);
+ seg6_lookup_nexthop(skb, NULL, 0);
return dst_input(skb);
@@ -438,7 +440,7 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
skb_set_transport_header(skb, sizeof(struct ipv6hdr));
- lookup_nexthop(skb, NULL, 0);
+ seg6_lookup_nexthop(skb, NULL, 0);
return dst_input(skb);
--
2.16.1
^ permalink raw reply related
* [PATCH bpf-next v6 1/6] ipv6: sr: make seg6.h includable without IPv6
From: Mathieu Xhonneux @ 2018-05-17 14:28 UTC (permalink / raw)
To: netdev; +Cc: daniel, dlebrun, alexei.starovoitov
In-Reply-To: <cover.1526565671.git.m.xhonneux@gmail.com>
include/net/seg6.h cannot be included in a source file if CONFIG_IPV6 is
not enabled:
include/net/seg6.h: In function 'seg6_pernet':
>> include/net/seg6.h:52:14: error: 'struct net' has no member named
'ipv6'; did you mean 'ipv4'?
return net->ipv6.seg6_data;
^~~~
ipv4
This commit makes seg6_pernet return NULL if IPv6 is not compiled, hence
allowing seg6.h to be included regardless of the configuration.
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
---
include/net/seg6.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/net/seg6.h b/include/net/seg6.h
index 099bad59dc90..70b4cfac52d7 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -49,7 +49,11 @@ struct seg6_pernet_data {
static inline struct seg6_pernet_data *seg6_pernet(struct net *net)
{
+#if IS_ENABLED(CONFIG_IPV6)
return net->ipv6.seg6_data;
+#else
+ return NULL;
+#endif
}
extern int seg6_init(void);
--
2.16.1
^ permalink raw reply related
* [PATCH bpf-next v6 0/6] ipv6: sr: introduce seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-05-17 14:28 UTC (permalink / raw)
To: netdev; +Cc: daniel, dlebrun, alexei.starovoitov
As of Linux 4.14, it is possible to define advanced local processing for
IPv6 packets with a Segment Routing Header through the seg6local LWT
infrastructure. This LWT implements the network programming principles
defined in the IETF “SRv6 Network Programming” draft.
The implemented operations are generic, and it would be very interesting to
be able to implement user-specific seg6local actions, without having to
modify the kernel directly. To do so, this patchset adds an End.BPF action
to seg6local, powered by some specific Segment Routing-related helpers,
which provide SR functionalities that can be applied on the packet. This
BPF hook would then allow to implement specific actions at native kernel
speed such as OAM features, advanced SR SDN policies, SRv6 actions like
Segment Routing Header (SRH) encapsulation depending on the content of
the packet, etc.
This patchset is divided in 6 patches, whose main features are :
- A new seg6local action End.BPF with the corresponding new BPF program
type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
passed to the LWT seg6local through netlink, the same way as the LWT
BPF hook operates.
- 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
shrink a SRH and apply on a packet some of the generic SRv6 actions.
- 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
encapsulation (via IPv6 encapsulation or inlining if the packet contains
already an IPv6 header).
As this patchset adds a new LWT BPF hook, I took into account the result of
the discussions when the LWT BPF infrastructure got merged. Hence, the
seg6local BPF hook doesn’t allow write access to skb->data directly, only
the SRH can be modified through specific helpers, which ensures that the
integrity of the packet is maintained.
More details are available in the related patches messages.
The performances of this BPF hook have been assessed with the BPF JIT
enabled on a Intel Xeon X3440 processors with 4 cores and 8 threads
clocked at 2.53 GHz. No throughput losses are noted with the seg6local
BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
drops the throughput to 410kpps, and inlining a SRH via
bpf_lwt_seg6_action drops the throughput to 420kpps.
All throughputs are stable.
-------
v2: move the SRH integrity state from skb->cb to a per-cpu buffer
v3: - document helpers in man-page style
- fix kbuild bugs
- un-break BPF LWT out hook
- bpf_push_seg6_encap is now static
- preempt_enable is now called when the packet is dropped in
input_action_end_bpf
v4: fix kbuild bugs when CONFIG_IPV6=m
v5: fix kbuild sparse warnings when CONFIG_IPV6=m
v6: fix skb pointers-related bugs in helpers
Thanks.
Mathieu Xhonneux (6):
ipv6: sr: make seg6.h includable without IPv6
ipv6: sr: export function lookup_nexthop
bpf: Add IPv6 Segment Routing helpers
bpf: Split lwt inout verifier structures
ipv6: sr: Add seg6local action End.BPF
selftests/bpf: test for seg6local End.BPF action
include/linux/bpf_types.h | 5 +-
include/net/seg6.h | 7 +-
include/net/seg6_local.h | 32 ++
include/uapi/linux/bpf.h | 97 ++++-
include/uapi/linux/seg6_local.h | 3 +
kernel/bpf/verifier.c | 1 +
net/core/filter.c | 393 ++++++++++++++++---
net/ipv6/Kconfig | 5 +
net/ipv6/seg6_local.c | 180 ++++++++-
tools/include/uapi/linux/bpf.h | 97 ++++-
tools/lib/bpf/libbpf.c | 1 +
tools/testing/selftests/bpf/Makefile | 6 +-
tools/testing/selftests/bpf/bpf_helpers.h | 12 +
tools/testing/selftests/bpf/test_lwt_seg6local.c | 438 ++++++++++++++++++++++
tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++++++
15 files changed, 1344 insertions(+), 73 deletions(-)
create mode 100644 include/net/seg6_local.h
create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh
--
2.16.1
^ permalink raw reply
* Re: [patch net-next] nfp: flower: set sysfs link to device for representors
From: Or Gerlitz @ 2018-05-17 12:25 UTC (permalink / raw)
To: Jiri Pirko
Cc: Linux Netdev List, David Miller, Jakub Kicinski, Simon Horman,
Dirk van der Merwe, John Hurley, Pieter Jansen van Vuuren,
oss-drivers
In-Reply-To: <20180517100520.23971-1-jiri@resnulli.us>
On Thu, May 17, 2018 at 1:05 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Do this so the sysfs has "device" link correctly set.
please no
This is likely to create bunch of issues with respect to how libvirt
deals with the representors.
We were discussing it off list between nfp and mlnx driver people. We
need to put
the open stack folks and kernel developers into the same thread. I can
make a post
on that next week
^ permalink raw reply
* Re: [PATCH net-next v3 00/10] net: mvpp2: phylink conversion
From: Gregory CLEMENT @ 2018-05-17 12:23 UTC (permalink / raw)
To: Antoine Tenart
Cc: davem, kishon, linux, andrew, jason, sebastian.hesselbarth,
netdev, linux-kernel, thomas.petazzoni, maxime.chevallier,
miquel.raynal, nadavh, stefanc, ymarkman, mw, linux-arm-kernel
In-Reply-To: <20180517082939.14598-1-antoine.tenart@bootlin.com>
Hi Antoine,
On jeu., mai 17 2018, Antoine Tenart <antoine.tenart@bootlin.com> wrote:
> Hi Dave, Russell,
>
> This series convert the Marvell PPv2 driver to phylink (models the MAC
> to PHY link).
>
> One important point is the PPv2 driver supports two probe modes: device
> tree and ACPI. This series only brings phylink support for the device
> tree mode, as the ACPI one will need further work. Still, the driver
> should be working as before when using ACPI. This split should be
> temporary, and was discussed with Marcin (in Cc.) who added ACPI support
> to the driver.
>
> Also as the SFP cages on both DB boards can be considered as non-wired.
> We thus chose not to describe those SFP cages and we use fixed-link.
>
> The rest of the series uses phylink to add support for 1000BaseX and
> 2500BaseX modes in the PPv2 driver. To do this, two patches are needed
> in the common PHY framework (patches 3 and 4). The last 4 patches modify
> the device tree to use the new PPv2 functionalities.
>
> The series has been tested for the device tree mode on the 7040-db,
> 8040-db and 8040-mcbin boards, to ensure all the interface where working
> as expected.
>
> @Dave: patches 7 to 10 should go through the mvebu tree (Gregory in
> Cc.) to avoid any conflict with the other mvebu dt patches taken during
> this cycle.
Patches 7 to 10 have been applied on mvebu/dt64.
Thanks,
Gregory
>
> The series is based on today's net-next.
>
> Thanks!
> Antoine
>
> Since v2:
> - Removed the SFP description from the DB boards, as their SFP cages
> are wired properly. We now use fixed-link.
> - Because of this rework, split the series in two, so that the SFP
> part is reviewed separately.
> - Small fixes in the phylink patch.
> - Rebased on the latest net-next branch.
>
> Since v1:
> - Chose a different approach to the SFP changes, as the previous ones
> weren't valid and reworked both BD boards device trees.
> - Misc fixes.
> - Added Kishon's acked-by on one patch.
> - Rebaed on latest net-next branch.
>
> Antoine Tenart (9):
> net: mvpp2: align the ethtool ops definition
> net: mvpp2: phylink support
> phy: add 2.5G SGMII mode to the phy_mode enum
> phy: cp110-comphy: 2.5G SGMII mode
> net: mvpp2: 1000baseX support
> net: mvpp2: 2500baseX support
> arm64: dts: marvell: mcbin: enable the fourth network interface
> arm64: dts: marvell: 8040-db: describe the 10G interfaces as
> fixed-link
> arm64: dts: marvell: 7040-db: describe the 10G interface as fixed-link
>
> Russell King (1):
> arm64: dts: marvell: mcbin: add 10G SFP support
>
> .../arm64/boot/dts/marvell/armada-7040-db.dts | 5 +
> .../arm64/boot/dts/marvell/armada-8040-db.dts | 10 +
> .../boot/dts/marvell/armada-8040-mcbin.dts | 70 ++
> drivers/net/ethernet/marvell/Kconfig | 1 +
> drivers/net/ethernet/marvell/mvpp2.c | 931 +++++++++++-------
> drivers/phy/marvell/phy-mvebu-cp110-comphy.c | 17 +-
> include/linux/phy/phy.h | 1 +
> 7 files changed, 680 insertions(+), 355 deletions(-)
>
> --
> 2.17.0
>
--
Gregory Clement, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
http://bootlin.com
^ permalink raw reply
* Re: [PATCH net-next] net/smc: init conn.tx_work & conn.send_lock sooner
From: Eric Dumazet @ 2018-05-17 12:20 UTC (permalink / raw)
To: ubraun; +Cc: David Miller, netdev, Eric Dumazet, linux-s390
In-Reply-To: <63bf1ebb-4a6f-43b8-1ee3-ced650e1dfdf@linux.ibm.com>
On Thu, May 17, 2018 at 5:13 AM Ursula Braun <ubraun@linux.ibm.com> wrote:
> This problem should no longer show up with yesterday's net-next commit
> 569bc6436568 ("net/smc: no tx work trigger for fallback sockets").
It definitely triggers on latest net-next, which includes 569bc6436568
Thanks.
^ permalink raw reply
* Re: [PATCH net-next] net/smc: init conn.tx_work & conn.send_lock sooner
From: Ursula Braun @ 2018-05-17 12:13 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet, linux-s390
In-Reply-To: <20180517105421.182397-1-edumazet@google.com>
On 05/17/2018 12:54 PM, Eric Dumazet wrote:
> syzkaller found that following program crashes the host :
>
> {
> int fd = socket(AF_SMC, SOCK_STREAM, 0);
> int val = 1;
>
> listen(fd, 0);
> shutdown(fd, SHUT_RDWR);
> setsockopt(fd, 6, TCP_NODELAY, &val, 4);
> }
>
> Simply initialize conn.tx_work & conn.send_lock at socket creation,
> rather than deeper in the stack.
>
> ODEBUG: assert_init not available (active state 0) object type: timer_list hint: (null)
> WARNING: CPU: 1 PID: 13988 at lib/debugobjects.c:329 debug_print_object+0x16a/0x210 lib/debugobjects.c:326
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 1 PID: 13988 Comm: syz-executor0 Not tainted 4.17.0-rc4+ #46
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> panic+0x22f/0x4de kernel/panic.c:184
> __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
> report_bug+0x252/0x2d0 lib/bug.c:186
> fixup_bug arch/x86/kernel/traps.c:178 [inline]
> do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
> do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
> RIP: 0010:debug_print_object+0x16a/0x210 lib/debugobjects.c:326
> RSP: 0018:ffff880197a37880 EFLAGS: 00010086
> RAX: 0000000000000061 RBX: 0000000000000005 RCX: ffffc90001ed0000
> RDX: 0000000000004aaf RSI: ffffffff8160f6f1 RDI: 0000000000000001
> RBP: ffff880197a378c0 R08: ffff8801aa7a0080 R09: ffffed003b5e3eb2
> R10: ffffed003b5e3eb2 R11: ffff8801daf1f597 R12: 0000000000000001
> R13: ffffffff88d96980 R14: ffffffff87fa19a0 R15: ffffffff81666ec0
> debug_object_assert_init+0x309/0x500 lib/debugobjects.c:692
> debug_timer_assert_init kernel/time/timer.c:724 [inline]
> debug_assert_init kernel/time/timer.c:776 [inline]
> del_timer+0x74/0x140 kernel/time/timer.c:1198
> try_to_grab_pending+0x439/0x9a0 kernel/workqueue.c:1223
> mod_delayed_work_on+0x91/0x250 kernel/workqueue.c:1592
> mod_delayed_work include/linux/workqueue.h:541 [inline]
> smc_setsockopt+0x387/0x6d0 net/smc/af_smc.c:1367
> __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
> __do_sys_setsockopt net/socket.c:1914 [inline]
> __se_sys_setsockopt net/socket.c:1911 [inline]
> __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
> do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
This problem should no longer show up with yesterday's net-next commit
569bc6436568 ("net/smc: no tx work trigger for fallback sockets").
> Fixes: 01d2f7e2cdd3 ("net/smc: sockopts TCP_NODELAY and TCP_CORK")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Ursula Braun <ubraun@linux.ibm.com>
> Cc: linux-s390@vger.kernel.org
> Reported-by: syzbot <syzkaller@googlegroups.com>
> ---
> net/smc/af_smc.c | 2 ++
> net/smc/smc_tx.c | 4 +---
> net/smc/smc_tx.h | 1 +
> 3 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index d15762b057c0d8d4167feca9a4c41f0408604c37..6ad4f6c771c3fa63d3ae714dbe65879910963a21 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -193,8 +193,10 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock,
> sk->sk_protocol = protocol;
> smc = smc_sk(sk);
> INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work);
> + INIT_DELAYED_WORK(&smc->conn.tx_work, smc_tx_work);
> INIT_LIST_HEAD(&smc->accept_q);
> spin_lock_init(&smc->accept_q_lock);
> + spin_lock_init(&smc->conn.send_lock);
> sk->sk_prot->hash(sk);
> sk_refcnt_debug_inc(sk);
>
> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
> index 58dfe0bd9d6075b5d3db97c659999b364b97da73..08a7de98bb031b5d2256e3238236108749e7ae39 100644
> --- a/net/smc/smc_tx.c
> +++ b/net/smc/smc_tx.c
> @@ -450,7 +450,7 @@ int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
> /* Wakeup sndbuf consumers from process context
> * since there is more data to transmit
> */
> -static void smc_tx_work(struct work_struct *work)
> +void smc_tx_work(struct work_struct *work)
> {
> struct smc_connection *conn = container_of(to_delayed_work(work),
> struct smc_connection,
> @@ -512,6 +512,4 @@ void smc_tx_consumer_update(struct smc_connection *conn)
> void smc_tx_init(struct smc_sock *smc)
> {
> smc->sk.sk_write_space = smc_tx_write_space;
> - INIT_DELAYED_WORK(&smc->conn.tx_work, smc_tx_work);
> - spin_lock_init(&smc->conn.send_lock);
> }
> diff --git a/net/smc/smc_tx.h b/net/smc/smc_tx.h
> index 78255964fa4dc1c69f96548e035e74a167999a62..8f64b12bf03c1b52c599b1239784a404efe65dae 100644
> --- a/net/smc/smc_tx.h
> +++ b/net/smc/smc_tx.h
> @@ -27,6 +27,7 @@ static inline int smc_tx_prepared_sends(struct smc_connection *conn)
> return smc_curs_diff(conn->sndbuf_size, &sent, &prep);
> }
>
> +void smc_tx_work(struct work_struct *work);
> void smc_tx_init(struct smc_sock *smc);
> int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len);
> int smc_tx_sndbuf_nonempty(struct smc_connection *conn);
>
^ permalink raw reply
* [PATCH net-next 4/4] tcp: add TCPAckCompressed SNMP counter
From: Eric Dumazet @ 2018-05-17 12:12 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Toke Høiland-Jørgensen, Neal Cardwell,
Yuchung Cheng, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180517121213.43559-1-edumazet@google.com>
This counter tracks number of ACK packets that the host has not sent,
thanks to ACK compression.
Sample output :
$ nstat -n;sleep 1;nstat|egrep "IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed"
IpInReceives 123250 0.0
IpOutRequests 3684 0.0
TcpInSegs 123251 0.0
TcpOutSegs 3684 0.0
TcpExtTCPAckCompressed 119252 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/uapi/linux/snmp.h | 1 +
net/ipv4/proc.c | 1 +
net/ipv4/tcp_output.c | 2 ++
3 files changed, 4 insertions(+)
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index d02e859301ff499dd72a1c0e1b56bed10a9397a6..750d89120335eb489f698191edb6c5110969fa8c 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -278,6 +278,7 @@ enum
LINUX_MIB_TCPMTUPSUCCESS, /* TCPMTUPSuccess */
LINUX_MIB_TCPDELIVERED, /* TCPDelivered */
LINUX_MIB_TCPDELIVEREDCE, /* TCPDeliveredCE */
+ LINUX_MIB_TCPACKCOMPRESSED, /* TCPAckCompressed */
__LINUX_MIB_MAX
};
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 261b71d0ccc5c17c6032bf67eb8f842006766e64..6c1ff89a60fa0a3485dcc71fafc799e798d5dc11 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -298,6 +298,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPMTUPSuccess", LINUX_MIB_TCPMTUPSUCCESS),
SNMP_MIB_ITEM("TCPDelivered", LINUX_MIB_TCPDELIVERED),
SNMP_MIB_ITEM("TCPDeliveredCE", LINUX_MIB_TCPDELIVEREDCE),
+ SNMP_MIB_ITEM("TCPAckCompressed", LINUX_MIB_TCPACKCOMPRESSED),
SNMP_MIB_SENTINEL
};
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7ee98aad82b758674ca7f3e90bd3fc165e8fcd45..437bb7ceba7fd388abac1c12f2920b02be77bad9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -165,6 +165,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts)
struct tcp_sock *tp = tcp_sk(sk);
if (unlikely(tp->compressed_ack)) {
+ NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPACKCOMPRESSED,
+ tp->compressed_ack);
tp->compressed_ack = 0;
if (hrtimer_try_to_cancel(&tp->compressed_ack_timer) == 1)
__sock_put(sk);
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
* [PATCH net-next 3/4] tcp: add SACK compression
From: Eric Dumazet @ 2018-05-17 12:12 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Toke Høiland-Jørgensen, Neal Cardwell,
Yuchung Cheng, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180517121213.43559-1-edumazet@google.com>
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on SRTT and capped to 2.5 ms : delay = min ( 5 % of SRTT, 2.5 ms)
If subsequent SACKs need to be sent while the timer has not yet expired,
we simply increment tp->compressed_ack
When timer expires, a SACK is sent with the latest information.
Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.
A new SNMP counter is added in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/tcp.h | 2 ++
include/net/tcp.h | 3 +++
net/ipv4/tcp.c | 1 +
net/ipv4/tcp_input.c | 31 +++++++++++++++++++++++++------
net/ipv4/tcp_output.c | 7 +++++++
net/ipv4/tcp_timer.c | 25 +++++++++++++++++++++++++
6 files changed, 63 insertions(+), 6 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 807776928cb8610fe97121fbc3c600b08d5d2991..72705eaf4b84060a45bf04d5170f389a18010eac 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -218,6 +218,7 @@ struct tcp_sock {
reord:1; /* reordering detected */
} rack;
u16 advmss; /* Advertised MSS */
+ u8 compressed_ack;
u32 chrono_start; /* Start time in jiffies of a TCP chrono */
u32 chrono_stat[3]; /* Time in jiffies for chrono_stat stats */
u8 chrono_type:2, /* current chronograph type */
@@ -297,6 +298,7 @@ struct tcp_sock {
u32 sacked_out; /* SACK'd packets */
struct hrtimer pacing_timer;
+ struct hrtimer compressed_ack_timer;
/* from STCP, retrans queue hinting */
struct sk_buff* lost_skb_hint;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6ffc8bd894876ad23407f5ec4994350139af85e7..c8c65ae62955eb12a9a6489fa8e008fd89f89f16 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -560,6 +560,9 @@ static inline void tcp_clear_xmit_timers(struct sock *sk)
if (hrtimer_try_to_cancel(&tcp_sk(sk)->pacing_timer) == 1)
__sock_put(sk);
+ if (hrtimer_try_to_cancel(&tcp_sk(sk)->compressed_ack_timer) == 1)
+ __sock_put(sk);
+
inet_csk_clear_xmit_timers(sk);
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 62b776f9003798eaf06992a4eb0914d17646aa61..0a2ea0bbf867271db05aedd7d48b193677664321 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2595,6 +2595,7 @@ int tcp_disconnect(struct sock *sk, int flags)
dst_release(sk->sk_rx_dst);
sk->sk_rx_dst = NULL;
tcp_saved_syn_free(tp);
+ tp->compressed_ack = 0;
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99fcab7e6570c8b8758ea4b15cdd26df29fb4fd6..58feea67b6bb147fa9e75b8b514a9a41576b512b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4242,6 +4242,8 @@ static void tcp_sack_new_ofo_skb(struct sock *sk, u32 seq, u32 end_seq)
* If the sack array is full, forget about the last one.
*/
if (this_sack >= TCP_NUM_SACKS) {
+ if (tp->compressed_ack)
+ tcp_send_ack(sk);
this_sack--;
tp->rx_opt.num_sacks--;
sp--;
@@ -5074,6 +5076,7 @@ static inline void tcp_data_snd_check(struct sock *sk)
static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
{
struct tcp_sock *tp = tcp_sk(sk);
+ unsigned long delay;
/* More than one full frame received... */
if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
@@ -5085,15 +5088,31 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
(tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
__tcp_select_window(sk) >= tp->rcv_wnd)) ||
/* We ACK each frame or... */
- tcp_in_quickack_mode(sk) ||
- /* We have out of order data. */
- (ofo_possible && !RB_EMPTY_ROOT(&tp->out_of_order_queue))) {
- /* Then ack it now */
+ tcp_in_quickack_mode(sk)) {
+send_now:
tcp_send_ack(sk);
- } else {
- /* Else, send delayed ack. */
+ return;
+ }
+
+ if (!ofo_possible || RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
tcp_send_delayed_ack(sk);
+ return;
}
+
+ if (!tcp_is_sack(tp) || tp->compressed_ack >= 127)
+ goto send_now;
+ tp->compressed_ack++;
+
+ if (hrtimer_is_queued(&tp->compressed_ack_timer))
+ return;
+
+ /* compress ack timer : 5 % of srtt, but no more than 2.5 ms */
+
+ delay = min_t(unsigned long, 2500 * NSEC_PER_USEC,
+ tp->rcv_rtt_est.rtt_us * (NSEC_PER_USEC >> 3)/20);
+ sock_hold(sk);
+ hrtimer_start(&tp->compressed_ack_timer, ns_to_ktime(delay),
+ HRTIMER_MODE_REL_PINNED_SOFT);
}
static inline void tcp_ack_snd_check(struct sock *sk)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0d8f950a9006598c70dbf51e281a3fe32dfaa234..7ee98aad82b758674ca7f3e90bd3fc165e8fcd45 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -162,6 +162,13 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
/* Account for an ACK we sent. */
static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts)
{
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ if (unlikely(tp->compressed_ack)) {
+ tp->compressed_ack = 0;
+ if (hrtimer_try_to_cancel(&tp->compressed_ack_timer) == 1)
+ __sock_put(sk);
+ }
tcp_dec_quickack_mode(sk, pkts);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
}
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 92bdf64fffae3a5be291ca419eb21276b4c8cbae..3b3611729928f77934e0298bb248e55c7a7c5def 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -708,6 +708,27 @@ static void tcp_keepalive_timer (struct timer_list *t)
sock_put(sk);
}
+static enum hrtimer_restart tcp_compressed_ack_kick(struct hrtimer *timer)
+{
+ struct tcp_sock *tp = container_of(timer, struct tcp_sock, compressed_ack_timer);
+ struct sock *sk = (struct sock *)tp;
+
+ bh_lock_sock(sk);
+ if (!sock_owned_by_user(sk)) {
+ if (tp->compressed_ack)
+ tcp_send_ack(sk);
+ } else {
+ if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED,
+ &sk->sk_tsq_flags))
+ sock_hold(sk);
+ }
+ bh_unlock_sock(sk);
+
+ sock_put(sk);
+
+ return HRTIMER_NORESTART;
+}
+
void tcp_init_xmit_timers(struct sock *sk)
{
inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
@@ -715,4 +736,8 @@ void tcp_init_xmit_timers(struct sock *sk)
hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC,
HRTIMER_MODE_ABS_PINNED_SOFT);
tcp_sk(sk)->pacing_timer.function = tcp_pace_kick;
+
+ hrtimer_init(&tcp_sk(sk)->compressed_ack_timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL_PINNED_SOFT);
+ tcp_sk(sk)->compressed_ack_timer.function = tcp_compressed_ack_kick;
}
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
* [PATCH net-next 2/4] tcp: do not force quickack when receiving out-of-order packets
From: Eric Dumazet @ 2018-05-17 12:12 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Toke Høiland-Jørgensen, Neal Cardwell,
Yuchung Cheng, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180517121213.43559-1-edumazet@google.com>
As explained in commit 9f9843a751d0 ("tcp: properly handle stretch
acks in slow start"), TCP stacks have to consider how many packets
are acknowledged in one single ACK, because of GRO, but also
because of ACK compression or losses.
We plan to add SACK compression in the following patch, we
must therefore not call tcp_enter_quickack_mode()
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_input.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b188e0d75edd9e5e1c9f0355818caa932fef7416..99fcab7e6570c8b8758ea4b15cdd26df29fb4fd6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4708,8 +4708,6 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp)))
goto out_of_window;
- tcp_enter_quickack_mode(sk);
-
if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
/* Partial packet, seq < rcv_next < end_seq */
SOCK_DEBUG(sk, "partial packet: rcv_next %X seq %X - %X\n",
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
* [PATCH net-next 1/4] tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()
From: Eric Dumazet @ 2018-05-17 12:12 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Toke Høiland-Jørgensen, Neal Cardwell,
Yuchung Cheng, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180517121213.43559-1-edumazet@google.com>
Socket can not disappear under us.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/tcp.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a08eab58ef7001b3e141e3722fd8a3875e5c5d7d..6ffc8bd894876ad23407f5ec4994350139af85e7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -558,7 +558,7 @@ void tcp_init_xmit_timers(struct sock *);
static inline void tcp_clear_xmit_timers(struct sock *sk)
{
if (hrtimer_try_to_cancel(&tcp_sk(sk)->pacing_timer) == 1)
- sock_put(sk);
+ __sock_put(sk);
inet_csk_clear_xmit_timers(sk);
}
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
* [PATCH net-next 0/4] tcp: implement SACK compression
From: Eric Dumazet @ 2018-05-17 12:12 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Toke Høiland-Jørgensen, Neal Cardwell,
Yuchung Cheng, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch series adds SACK compression, but the infrastructure
could be leveraged to also compress ACK in the future.
Eric Dumazet (4):
tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()
tcp: do not force quickack when receiving out-of-order packets
tcp: add SACK compression
tcp: add TCPAckCompressed SNMP counter
include/linux/tcp.h | 2 ++
include/net/tcp.h | 5 ++++-
include/uapi/linux/snmp.h | 1 +
net/ipv4/proc.c | 1 +
net/ipv4/tcp.c | 1 +
net/ipv4/tcp_input.c | 33 +++++++++++++++++++++++++--------
net/ipv4/tcp_output.c | 9 +++++++++
net/ipv4/tcp_timer.c | 25 +++++++++++++++++++++++++
8 files changed, 68 insertions(+), 9 deletions(-)
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply
* Re: [PATCH v3 1/2] media: rc: introduce BPF_PROG_RAWIR_EVENT
From: Quentin Monnet @ 2018-05-17 12:10 UTC (permalink / raw)
To: Sean Young, linux-media, linux-kernel, Alexei Starovoitov,
Mauro Carvalho Chehab, Daniel Borkmann, netdev, Matthias Reichl,
Devin Heitmueller, Y Song
In-Reply-To: <92785c791057185fa5691f78cecfa4beae7fc336.1526504511.git.sean@mess.org>
2018-05-16 22:04 UTC+0100 ~ Sean Young <sean@mess.org>
> Add support for BPF_PROG_RAWIR_EVENT. This type of BPF program can call
> rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
> that the last key should be repeated.
>
> The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall;
> the target_fd must be the /dev/lircN device.
>
> Signed-off-by: Sean Young <sean@mess.org>
> ---
> drivers/media/rc/Kconfig | 13 ++
> drivers/media/rc/Makefile | 1 +
> drivers/media/rc/bpf-rawir-event.c | 363 +++++++++++++++++++++++++++++
> drivers/media/rc/lirc_dev.c | 24 ++
> drivers/media/rc/rc-core-priv.h | 24 ++
> drivers/media/rc/rc-ir-raw.c | 14 +-
> include/linux/bpf_rcdev.h | 30 +++
> include/linux/bpf_types.h | 3 +
> include/uapi/linux/bpf.h | 55 ++++-
> kernel/bpf/syscall.c | 7 +
> 10 files changed, 531 insertions(+), 3 deletions(-)
> create mode 100644 drivers/media/rc/bpf-rawir-event.c
> create mode 100644 include/linux/bpf_rcdev.h
>
[...]
Hi Sean,
Please find below some nitpicks on the documentation for the two helpers.
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d94d333a8225..243e141e8a5b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
[...]
> @@ -1902,6 +1904,35 @@ union bpf_attr {
> * egress otherwise). This is the only flag supported for now.
> * Return
> * **SK_PASS** on success, or **SK_DROP** on error.
> + *
> + * int bpf_rc_keydown(void *ctx, u32 protocol, u32 scancode, u32 toggle)
> + * Description
> + * Report decoded scancode with toggle value. For use in
> + * BPF_PROG_TYPE_RAWIR_EVENT, to report a successfully
Could you please use bold RST markup for constants and function names?
Typically for BPF_PROG_TYPE_RAWIR_EVENT here and the enum below.
> + * decoded scancode. This is will generate a keydown event,
s/This is will/This will/?
> + * and a keyup event once the scancode is no longer repeated.
> + *
> + * *ctx* pointer to bpf_rawir_event, *protocol* is decoded
> + * protocol (see RC_PROTO_* enum).
This documentation is intended to be compiled as a man page. Could you
please use a complete sentence here?
Also, this could do with additional markup as well: **struct
bpf_rawir_event**.
> + *
> + * Some protocols include a toggle bit, in case the button
> + * was released and pressed again between consecutive scancodes,
> + * copy this bit into *toggle* if it exists, else set to 0.
> + *
> + * Return
The "Return" lines here and in the second helper use space indent
instead as tabs (as all other lines do). Would you mind fixing it for
consistency?
> + * Always return 0 (for now)
Other helpers use just "0" in that case, but I do not really mind.
Out of curiosity, do you have anything specific in mind for changing the
return value here in the future?
> + *
> + * int bpf_rc_repeat(void *ctx)
> + * Description
> + * Repeat the last decoded scancode; some IR protocols like
> + * NEC have a special IR message for repeat last button,
s/repeat/repeating/?
> + * in case user is holding a button down; the scancode is
> + * not repeated.
> + *
> + * *ctx* pointer to bpf_rawir_event.
Please use a complete sentence here as well, if you do not mind.
> + *
> + * Return
> + * Always return 0 (for now)
> */
Thanks,
Quentin
^ permalink raw reply
* Re: [RFC v4 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-17 12:01 UTC (permalink / raw)
To: Tiwei Bie; +Cc: mst, virtualization, linux-kernel, netdev, wexu, jfreimann
In-Reply-To: <20180516143332.GA1957@debian>
On 2018年05月16日 22:33, Tiwei Bie wrote:
> On Wed, May 16, 2018 at 10:05:44PM +0800, Jason Wang wrote:
>> On 2018年05月16日 21:45, Tiwei Bie wrote:
>>> On Wed, May 16, 2018 at 08:51:43PM +0800, Jason Wang wrote:
>>>> On 2018年05月16日 20:39, Tiwei Bie wrote:
>>>>> On Wed, May 16, 2018 at 07:50:16PM +0800, Jason Wang wrote:
>>>>>> On 2018年05月16日 16:37, Tiwei Bie wrote:
> [...]
>>>>>>> +static void detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
>>>>>>> + unsigned int id, void **ctx)
>>>>>>> +{
>>>>>>> + struct vring_packed_desc *desc;
>>>>>>> + unsigned int i, j;
>>>>>>> +
>>>>>>> + /* Clear data ptr. */
>>>>>>> + vq->desc_state[id].data = NULL;
>>>>>>> +
>>>>>>> + i = head;
>>>>>>> +
>>>>>>> + for (j = 0; j < vq->desc_state[id].num; j++) {
>>>>>>> + desc = &vq->vring_packed.desc[i];
>>>>>>> + vring_unmap_one_packed(vq, desc);
>>>>>> As mentioned in previous discussion, this probably won't work for the case
>>>>>> of out of order completion since it depends on the information in the
>>>>>> descriptor ring. We probably need to extend ctx to record such information.
>>>>> Above code doesn't depend on the information in the descriptor
>>>>> ring. The vq->desc_state[] is the extended ctx.
>>>>>
>>>>> Best regards,
>>>>> Tiwei Bie
>>>> Yes, but desc is a pointer to descriptor ring I think so
>>>> vring_unmap_one_packed() still depends on the content of descriptor ring?
>>>>
>>> I got your point now. I think it makes sense to reserve
>>> the bits of the addr field. Driver shouldn't try to get
>>> addrs from the descriptors when cleanup the descriptors
>>> no matter whether we support out-of-order or not.
>> Maybe I was wrong, but I remember spec mentioned something like this.
> You're right. Spec mentioned this. I was just repeating
> the spec to emphasize that it does make sense. :)
>
>>> But combining it with the out-of-order support, it will
>>> mean that the driver still needs to maintain a desc/ctx
>>> list that is very similar to the desc ring in the split
>>> ring. I'm not quite sure whether it's something we want.
>>> If it is true, I'll do it. So do you think we also want
>>> to maintain such a desc/ctx list for packed ring?
>> To make it work for OOO backends I think we need something like this
>> (hardware NIC drivers are usually have something like this).
> Which hardware NIC drivers have this?
It's quite common I think, e.g driver track e.g dma addr and page frag
somewhere. e.g the ring->rx_info in mlx4 driver.
Thanks
>
>> Not for the patch, but it looks like having a OUT_OF_ORDER feature bit is
>> much more simpler to be started with.
> +1
>
> Best regards,
> Tiwei Bie
^ permalink raw reply
* Re: [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter
From: Eric Dumazet @ 2018-05-17 11:56 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Eric Dumazet, netdev; +Cc: cake
In-Reply-To: <87in7my196.fsf@toke.dk>
On 05/17/2018 04:23 AM, Toke Høiland-Jørgensen wrote:
>
> We don't do full parsing of SACKs, no; we were trying to keep things
> simple... We do detect the presence of SACK options, though, and the
> presence of SACK options on an ACK will make previous ACKs be considered
> redundant.
>
But they are not redundant in some cases, particularly when reorders happen in the network.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox