* Re: [PATCH] bpf: handle 32-bit zext during constant blinding
From: Jiong Wang @ 2019-08-21 21:51 UTC (permalink / raw)
To: Naveen N. Rao
Cc: Alexei Starovoitov, Daniel Borkmann, Jiong Wang, Michael Ellerman,
bpf, linuxppc-dev, netdev, linux-kernel
In-Reply-To: <20190821192358.31922-1-naveen.n.rao@linux.vnet.ibm.com>
Naveen N. Rao writes:
> Since BPF constant blinding is performed after the verifier pass, the
> ALU32 instructions inserted for doubleword immediate loads don't have a
> corresponding zext instruction. This is causing a kernel oops on powerpc
> and can be reproduced by running 'test_cgroup_storage' with
> bpf_jit_harden=2.
>
> Fix this by emitting BPF_ZEXT during constant blinding if
> prog->aux->verifier_zext is set.
>
> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result")
> Reported-by: Michael Ellerman <mpe@ellerman.id.au>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Thanks for the fix.
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Just two other comments during review in case I am wrong on somewhere.
- Use verifier_zext instead of bpf_jit_needs_zext() seems better, even
though the latter could avoid extending function argument.
Because JIT back-ends look at verifier_zext, true means zext inserted
by verifier so JITs won't do the code-gen.
Use verifier_zext is sort of keeping JIT blinding the same behaviour
has verifier even though blinding doesn't belong to verifier, but for
such insn patching, it could be seen as a extension of verifier,
therefore use verifier_zext seems better than bpf_jit_needs_zext() to
me.
- JIT blinding is also escaping the HI32 randomization which happens
inside verifier, otherwise x86-64 regression should have caught this issue.
Regards,
Jiong
> ---
> Changes since RFC:
> - Removed changes to ALU32 and JMP32 ops since those don't alter program
> execution, and the verifier would have already accounted for them.
>
>
> kernel/bpf/core.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..66088a9e9b9e 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -890,7 +890,8 @@ int bpf_jit_get_func_addr(const struct bpf_prog *prog,
>
> static int bpf_jit_blind_insn(const struct bpf_insn *from,
> const struct bpf_insn *aux,
> - struct bpf_insn *to_buff)
> + struct bpf_insn *to_buff,
> + bool emit_zext)
> {
> struct bpf_insn *to = to_buff;
> u32 imm_rnd = get_random_int();
> @@ -1005,6 +1006,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from,
> case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */
> *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[0].imm);
> *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
> + if (emit_zext)
> + *to++ = BPF_ZEXT_REG(BPF_REG_AX);
> *to++ = BPF_ALU64_REG(BPF_OR, aux[0].dst_reg, BPF_REG_AX);
> break;
>
> @@ -1088,7 +1091,8 @@ struct bpf_prog *bpf_jit_blind_constants(struct bpf_prog *prog)
> insn[1].code == 0)
> memcpy(aux, insn, sizeof(aux));
>
> - rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
> + rewritten = bpf_jit_blind_insn(insn, aux, insn_buff,
> + clone->aux->verifier_zext);
> if (!rewritten)
> continue;
^ permalink raw reply
* [PATCH ipsec-next 6/7] esp4: split esp_output_udp_encap and introduce esp_output_encap
From: Sabrina Dubroca @ 2019-08-21 21:46 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
In-Reply-To: <cover.1566395202.git.sd@queasysnail.net>
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
net/ipv4/esp4.c | 57 ++++++++++++++++++++++++++++++++-----------------
1 file changed, 37 insertions(+), 20 deletions(-)
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index c5d826642229..033c61d27148 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -225,45 +225,62 @@ static void esp_output_fill_trailer(u8 *tail, int tfclen, int plen, __u8 proto)
tail[plen - 1] = proto;
}
-static int esp_output_udp_encap(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *esp)
+static struct ip_esp_hdr *esp_output_udp_encap(struct sk_buff *skb,
+ int encap_type,
+ struct esp_info *esp,
+ __be16 sport,
+ __be16 dport)
{
- int encap_type;
struct udphdr *uh;
__be32 *udpdata32;
- __be16 sport, dport;
- struct xfrm_encap_tmpl *encap = x->encap;
- struct ip_esp_hdr *esph = esp->esph;
unsigned int len;
- spin_lock_bh(&x->lock);
- sport = encap->encap_sport;
- dport = encap->encap_dport;
- encap_type = encap->encap_type;
- spin_unlock_bh(&x->lock);
-
len = skb->len + esp->tailen - skb_transport_offset(skb);
if (len + sizeof(struct iphdr) >= IP_MAX_MTU)
- return -EMSGSIZE;
+ return ERR_PTR(-EMSGSIZE);
- uh = (struct udphdr *)esph;
+ uh = (struct udphdr *)esp->esph;
uh->source = sport;
uh->dest = dport;
uh->len = htons(len);
uh->check = 0;
+ *skb_mac_header(skb) = IPPROTO_UDP;
+
+ if (encap_type == UDP_ENCAP_ESPINUDP_NON_IKE) {
+ udpdata32 = (__be32 *)(uh + 1);
+ udpdata32[0] = udpdata32[1] = 0;
+ return (struct ip_esp_hdr *)(udpdata32 + 2);
+ }
+
+ return (struct ip_esp_hdr *)(uh + 1);
+}
+
+static int esp_output_encap(struct xfrm_state *x, struct sk_buff *skb,
+ struct esp_info *esp)
+{
+ struct xfrm_encap_tmpl *encap = x->encap;
+ struct ip_esp_hdr *esph;
+ __be16 sport, dport;
+ int encap_type;
+
+ spin_lock_bh(&x->lock);
+ sport = encap->encap_sport;
+ dport = encap->encap_dport;
+ encap_type = encap->encap_type;
+ spin_unlock_bh(&x->lock);
+
switch (encap_type) {
default:
case UDP_ENCAP_ESPINUDP:
- esph = (struct ip_esp_hdr *)(uh + 1);
- break;
case UDP_ENCAP_ESPINUDP_NON_IKE:
- udpdata32 = (__be32 *)(uh + 1);
- udpdata32[0] = udpdata32[1] = 0;
- esph = (struct ip_esp_hdr *)(udpdata32 + 2);
+ esph = esp_output_udp_encap(skb, encap_type, esp, sport, dport);
break;
}
- *skb_mac_header(skb) = IPPROTO_UDP;
+ if (IS_ERR(esph))
+ return PTR_ERR(esph);
+
esp->esph = esph;
return 0;
@@ -281,7 +298,7 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
/* this is non-NULL only with UDP Encapsulation */
if (x->encap) {
- int err = esp_output_udp_encap(x, skb, esp);
+ int err = esp_output_encap(x, skb, esp);
if (err < 0)
return err;
--
2.22.0
^ permalink raw reply related
* [PATCH ipsec-next 7/7] xfrm: add espintcp (RFC 8229)
From: Sabrina Dubroca @ 2019-08-21 21:46 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
In-Reply-To: <cover.1566395202.git.sd@queasysnail.net>
TCP encapsulation of IKE and IPsec messages (RFC 8229) is implemented
as a TCP ULP, overriding in particular the sendmsg and recvmsg
operations. A Stream Parser is used to extract messages out of the TCP
stream using the first 2 bytes as length marker. Received IKE messages
are put on "ike_queue", waiting to be dequeued by the custom recvmsg
implementation. Received ESP messages are sent to XFRM, like with UDP
encapsulation.
Some of this code is taken from the original submission by Herbert
Xu. Currently, only IPv4 is supported, like for UDP encapsulation.
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
include/net/espintcp.h | 38 +++
include/net/xfrm.h | 1 +
include/uapi/linux/udp.h | 1 +
net/ipv4/esp4.c | 189 ++++++++++++++-
net/xfrm/Kconfig | 9 +
net/xfrm/Makefile | 1 +
net/xfrm/espintcp.c | 505 +++++++++++++++++++++++++++++++++++++++
net/xfrm/xfrm_policy.c | 7 +
net/xfrm/xfrm_state.c | 3 +
9 files changed, 751 insertions(+), 3 deletions(-)
create mode 100644 include/net/espintcp.h
create mode 100644 net/xfrm/espintcp.c
diff --git a/include/net/espintcp.h b/include/net/espintcp.h
new file mode 100644
index 000000000000..02fc28c82d30
--- /dev/null
+++ b/include/net/espintcp.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_ESPINTCP_H
+#define _NET_ESPINTCP_H
+
+#include <net/strparser.h>
+#include <linux/skmsg.h>
+
+void __init espintcp_init(void);
+
+int espintcp_push_skb(struct sock *sk, struct sk_buff *skb);
+int espintcp_queue_out(struct sock *sk, struct sk_buff *skb);
+bool tcp_is_ulp_esp(struct sock *sk);
+
+struct espintcp_msg {
+ struct sk_buff *skb;
+ struct sk_msg skmsg;
+ int offset;
+ int len;
+};
+
+struct espintcp_ctx {
+ struct strparser strp;
+ struct sk_buff_head ike_queue;
+ struct sk_buff_head out_queue;
+ struct espintcp_msg partial;
+ void (*saved_data_ready)(struct sock *sk);
+ void (*saved_write_space)(struct sock *sk);
+ struct work_struct work;
+ bool tx_running;
+};
+
+static inline struct espintcp_ctx *espintcp_getctx(const struct sock *sk)
+{
+ struct inet_connection_sock *icsk = inet_csk(sk);
+
+ return icsk->icsk_ulp_data;
+}
+#endif
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index afedc9210c4b..3dd3c199ecfa 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -193,6 +193,7 @@ struct xfrm_state {
/* Data for encapsulator */
struct xfrm_encap_tmpl *encap;
+ struct sock __rcu *encap_sk;
/* Data for care-of address */
xfrm_address_t *coaddr;
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 30baccb6c9c4..4828794efcf8 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -42,5 +42,6 @@ struct udphdr {
#define UDP_ENCAP_GTP0 4 /* GSM TS 09.60 */
#define UDP_ENCAP_GTP1U 5 /* 3GPP TS 29.060 */
#define UDP_ENCAP_RXRPC 6
+#define TCP_ENCAP_ESPINTCP 7 /* Yikes, this is really xfrm encap types. */
#endif /* _UAPI_LINUX_UDP_H */
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 033c61d27148..d101aca546c3 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -18,6 +18,8 @@
#include <net/icmp.h>
#include <net/protocol.h>
#include <net/udp.h>
+#include <net/tcp.h>
+#include <net/espintcp.h>
#include <linux/highmem.h>
@@ -117,6 +119,129 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp)
put_page(sg_page(sg));
}
+#ifdef CONFIG_XFRM_ESPINTCP
+struct esp_tcp_sk {
+ struct sock *sk;
+ struct rcu_head rcu;
+};
+
+static void esp_free_tcp_sk(struct rcu_head *head)
+{
+ struct esp_tcp_sk *esk = container_of(head, struct esp_tcp_sk, rcu);
+
+ sock_put(esk->sk);
+ kfree(esk);
+}
+
+static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
+{
+ struct xfrm_encap_tmpl *encap = x->encap;
+ struct esp_tcp_sk *esk;
+ __be16 sport, dport;
+ struct sock *nsk;
+ struct sock *sk;
+
+ sk = rcu_dereference(x->encap_sk);
+ if (sk && sk->sk_state == TCP_ESTABLISHED)
+ return sk;
+
+ spin_lock_bh(&x->lock);
+ sport = encap->encap_sport;
+ dport = encap->encap_dport;
+ nsk = rcu_dereference_protected(x->encap_sk,
+ lockdep_is_held(&x->lock));
+ if (sk && sk == nsk) {
+ esk = kmalloc(sizeof(*esk), GFP_ATOMIC);
+ if (!esk) {
+ spin_unlock_bh(&x->lock);
+ return ERR_PTR(-ENOMEM);
+ }
+ RCU_INIT_POINTER(x->encap_sk, NULL);
+ esk->sk = sk;
+ call_rcu(&esk->rcu, esp_free_tcp_sk);
+ }
+ spin_unlock_bh(&x->lock);
+
+ sk = inet_lookup_established(xs_net(x), &tcp_hashinfo, x->id.daddr.a4,
+ dport, x->props.saddr.a4, sport, 0);
+ if (!sk)
+ return ERR_PTR(-ENOENT);
+
+ if (!tcp_is_ulp_esp(sk)) {
+ sock_put(sk);
+ return ERR_PTR(-EINVAL);
+ }
+
+ spin_lock_bh(&x->lock);
+ nsk = rcu_dereference_protected(x->encap_sk,
+ lockdep_is_held(&x->lock));
+ if (encap->encap_sport != sport ||
+ encap->encap_dport != dport) {
+ sock_put(sk);
+ sk = nsk ?: ERR_PTR(-EREMCHG);
+ } else if (sk == nsk) {
+ sock_put(sk);
+ } else {
+ rcu_assign_pointer(x->encap_sk, sk);
+ }
+ spin_unlock_bh(&x->lock);
+
+ return sk;
+}
+
+static int esp_output_tcp_finish(struct xfrm_state *x, struct sk_buff *skb)
+{
+ struct sock *sk;
+ int err;
+
+ rcu_read_lock();
+
+ sk = esp_find_tcp_sk(x);
+ err = PTR_ERR(sk);
+ if (IS_ERR(sk))
+ goto out;
+
+ bh_lock_sock(sk);
+ if (sock_owned_by_user(sk)) {
+ err = espintcp_queue_out(sk, skb);
+ if (err < 0)
+ goto unlock_sock;
+ } else {
+ err = espintcp_push_skb(sk, skb);
+ }
+
+unlock_sock:
+ bh_unlock_sock(sk);
+out:
+ rcu_read_unlock();
+ return err;
+}
+
+static int esp_output_tcp_encap_cb(struct net *net, struct sock *sk,
+ struct sk_buff *skb)
+{
+ struct dst_entry *dst = skb_dst(skb);
+ struct xfrm_state *x = dst->xfrm;
+
+ return esp_output_tcp_finish(x, skb);
+}
+
+static int esp_output_tail_tcp(struct xfrm_state *x, struct sk_buff *skb)
+{
+ int err;
+
+ local_bh_disable();
+ err = xfrm_trans_queue_net(xs_net(x), skb, esp_output_tcp_encap_cb);
+ local_bh_enable();
+
+ /* EINPROGRESS just happens to do the right thing. It
+ * actually means that the skb has been consumed and
+ * isn't coming back.
+ */
+ return err ?: -EINPROGRESS;
+}
+#endif
+
static void esp_output_done(struct crypto_async_request *base, int err)
{
struct sk_buff *skb = base->data;
@@ -147,7 +272,13 @@ static void esp_output_done(struct crypto_async_request *base, int err)
secpath_reset(skb);
xfrm_dev_resume(skb);
} else {
- xfrm_output_resume(skb, err);
+#ifdef CONFIG_XFRM_ESPINTCP
+ if (!err &&
+ x->encap && x->encap->encap_type == TCP_ENCAP_ESPINTCP)
+ esp_output_tail_tcp(x, skb);
+ else
+#endif
+ xfrm_output_resume(skb, err);
}
}
@@ -236,7 +367,7 @@ static struct ip_esp_hdr *esp_output_udp_encap(struct sk_buff *skb,
unsigned int len;
len = skb->len + esp->tailen - skb_transport_offset(skb);
- if (len + sizeof(struct iphdr) >= IP_MAX_MTU)
+ if (len + sizeof(struct iphdr) > IP_MAX_MTU)
return ERR_PTR(-EMSGSIZE);
uh = (struct udphdr *)esp->esph;
@@ -256,6 +387,23 @@ static struct ip_esp_hdr *esp_output_udp_encap(struct sk_buff *skb,
return (struct ip_esp_hdr *)(uh + 1);
}
+#ifdef CONFIG_XFRM_ESPINTCP
+static struct ip_esp_hdr *esp_output_tcp_encap(struct sk_buff *skb,
+ struct esp_info *esp)
+{
+ __be16 *lenp = (void *)esp->esph;
+ unsigned int len;
+
+ len = skb->len + esp->tailen - skb_transport_offset(skb);
+ if (len > IP_MAX_MTU)
+ return ERR_PTR(-EMSGSIZE);
+
+ *lenp = htons(len);
+
+ return (struct ip_esp_hdr *)(lenp + 1);
+}
+#endif
+
static int esp_output_encap(struct xfrm_state *x, struct sk_buff *skb,
struct esp_info *esp)
{
@@ -276,6 +424,22 @@ static int esp_output_encap(struct xfrm_state *x, struct sk_buff *skb,
case UDP_ENCAP_ESPINUDP_NON_IKE:
esph = esp_output_udp_encap(skb, encap_type, esp, sport, dport);
break;
+#ifdef CONFIG_XFRM_ESPINTCP
+ case TCP_ENCAP_ESPINTCP: {
+ struct sock *sk;
+
+ rcu_read_lock();
+ sk = esp_find_tcp_sk(x);
+ if (IS_ERR(sk)) {
+ rcu_read_unlock();
+ return PTR_ERR(sk);
+ }
+
+ esph = esp_output_tcp_encap(skb, esp);
+ rcu_read_unlock();
+ break;
+ }
+#endif
}
if (IS_ERR(esph))
@@ -296,7 +460,7 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
struct sk_buff *trailer;
int tailen = esp->tailen;
- /* this is non-NULL only with UDP Encapsulation */
+ /* this is non-NULL only with TCP/UDP Encapsulation */
if (x->encap) {
int err = esp_output_encap(x, skb, esp);
@@ -491,6 +655,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
if (sg != dsg)
esp_ssg_unref(x, tmp);
+#ifdef CONFIG_XFRM_ESPINTCP
+ if (!err && x->encap && x->encap->encap_type == TCP_ENCAP_ESPINTCP)
+ err = esp_output_tail_tcp(x, skb);
+#endif
+
error_free:
kfree(tmp);
error:
@@ -617,10 +786,16 @@ int esp_input_done2(struct sk_buff *skb, int err)
if (x->encap) {
struct xfrm_encap_tmpl *encap = x->encap;
+ struct tcphdr *th = (void *)(skb_network_header(skb) + ihl);
struct udphdr *uh = (void *)(skb_network_header(skb) + ihl);
__be16 source;
switch (x->encap->encap_type) {
+#ifdef CONFIG_XFRM_ESPINTCP
+ case TCP_ENCAP_ESPINTCP:
+ source = th->source;
+ break;
+#endif
case UDP_ENCAP_ESPINUDP:
case UDP_ENCAP_ESPINUDP_NON_IKE:
source = uh->source;
@@ -1017,6 +1192,14 @@ static int esp_init_state(struct xfrm_state *x)
case UDP_ENCAP_ESPINUDP_NON_IKE:
x->props.header_len += sizeof(struct udphdr) + 2 * sizeof(u32);
break;
+#ifdef CONFIG_XFRM_ESPINTCP
+ case TCP_ENCAP_ESPINTCP:
+ /* only the length field, TCP encap is done by
+ * the socket
+ */
+ x->props.header_len += 2;
+ break;
+#endif
}
}
diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
index c967fc3c38c8..ccc012b3ea10 100644
--- a/net/xfrm/Kconfig
+++ b/net/xfrm/Kconfig
@@ -71,6 +71,15 @@ config XFRM_IPCOMP
select CRYPTO
select CRYPTO_DEFLATE
+config XFRM_ESPINTCP
+ bool "ESP in TCP encapsulation (RFC 8229)"
+ depends on XFRM && INET_ESP
+ select STREAM_PARSER
+ help
+ Support for RFC 8229 encapsulation of ESP and IKE over TCP sockets.
+
+ If unsure, say N.
+
config NET_KEY
tristate "PF_KEY sockets"
select XFRM_ALGO
diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
index fbc4552d17b8..2d4bb4b9f75e 100644
--- a/net/xfrm/Makefile
+++ b/net/xfrm/Makefile
@@ -11,3 +11,4 @@ obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o
obj-$(CONFIG_XFRM_USER) += xfrm_user.o
obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
+obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
new file mode 100644
index 000000000000..1d561a00c4b0
--- /dev/null
+++ b/net/xfrm/espintcp.c
@@ -0,0 +1,505 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <net/tcp.h>
+#include <net/strparser.h>
+#include <net/xfrm.h>
+#include <net/esp.h>
+#include <net/espintcp.h>
+#include <linux/skmsg.h>
+#include <net/inet_common.h>
+
+static void handle_nonesp(struct espintcp_ctx *ctx, struct sk_buff *skb,
+ struct sock *sk)
+{
+ if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf ||
+ !sk_rmem_schedule(sk, skb, skb->truesize)) {
+ kfree_skb(skb);
+ return;
+ }
+
+ skb_set_owner_r(skb, sk);
+
+ memset(skb->cb, 0, sizeof(skb->cb));
+ skb_queue_tail(&ctx->ike_queue, skb);
+ ctx->saved_data_ready(sk);
+}
+
+static void handle_esp(struct sk_buff *skb, struct sock *sk)
+{
+ skb_reset_transport_header(skb);
+ memset(skb->cb, 0, sizeof(skb->cb));
+
+ rcu_read_lock();
+ skb->dev = dev_get_by_index_rcu(sock_net(sk), skb->skb_iif);
+ local_bh_disable();
+ xfrm4_rcv_encap(skb, IPPROTO_ESP, 0, TCP_ENCAP_ESPINTCP);
+ local_bh_enable();
+ rcu_read_unlock();
+}
+
+static void espintcp_rcv(struct strparser *strp, struct sk_buff *skb)
+{
+ struct espintcp_ctx *ctx = container_of(strp, struct espintcp_ctx,
+ strp);
+ struct strp_msg *rxm = strp_msg(skb);
+ u32 nonesp_marker;
+ int err;
+
+ err = skb_copy_bits(skb, rxm->offset + 2, &nonesp_marker,
+ sizeof(nonesp_marker));
+ if (err < 0) {
+ kfree_skb(skb);
+ return;
+ }
+
+ /* remove header, leave non-ESP marker/SPI */
+ if (!__pskb_pull(skb, rxm->offset + 2)) {
+ kfree_skb(skb);
+ return;
+ }
+
+ if (pskb_trim(skb, rxm->full_len - 2) != 0) {
+ kfree_skb(skb);
+ return;
+ }
+
+ if (nonesp_marker == 0)
+ handle_nonesp(ctx, skb, strp->sk);
+ else
+ handle_esp(skb, strp->sk);
+}
+
+static int espintcp_parse(struct strparser *strp, struct sk_buff *skb)
+{
+ struct strp_msg *rxm = strp_msg(skb);
+ __be16 blen;
+ u16 len;
+ int err;
+
+ if (skb->len < rxm->offset + 2)
+ return 0;
+
+ err = skb_copy_bits(skb, rxm->offset, &blen, sizeof(blen));
+ if (err < 0)
+ return err;
+
+ len = be16_to_cpu(blen);
+ if (len < 6)
+ return -EINVAL;
+
+ return len;
+}
+
+static int espintcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+ int nonblock, int flags, int *addr_len)
+{
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+ struct sk_buff *skb;
+ int err = 0;
+ int copied;
+ int off = 0;
+
+ flags |= nonblock ? MSG_DONTWAIT : 0;
+
+ skb = __skb_recv_datagram(sk, &ctx->ike_queue, flags, NULL, &off, &err);
+ if (!skb)
+ return err;
+
+ copied = len;
+ if (copied > skb->len)
+ copied = skb->len;
+ else if (copied < skb->len)
+ msg->msg_flags |= MSG_TRUNC;
+
+ err = skb_copy_datagram_msg(skb, 0, msg, copied);
+ if (unlikely(err)) {
+ kfree_skb(skb);
+ return err;
+ }
+
+ if (flags & MSG_TRUNC)
+ copied = skb->len;
+ kfree_skb(skb);
+ return copied;
+}
+
+int espintcp_queue_out(struct sock *sk, struct sk_buff *skb)
+{
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+ if (skb_queue_len(&ctx->out_queue) >= netdev_max_backlog)
+ return -ENOBUFS;
+
+ __skb_queue_tail(&ctx->out_queue, skb);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(espintcp_queue_out);
+
+/* espintcp length field is 2B and length includes the length field's size */
+#define MAX_ESPINTCP_MSG (((1 << 16) - 1) - 2)
+
+static int espintcp_sendskb_locked(struct sock *sk, struct espintcp_msg *emsg,
+ int flags)
+{
+ do {
+ int ret;
+
+ ret = skb_send_sock_locked(sk, emsg->skb,
+ emsg->offset, emsg->len);
+ if (ret < 0)
+ return ret;
+
+ emsg->len -= ret;
+ emsg->offset += ret;
+ } while (emsg->len > 0);
+
+ kfree_skb(emsg->skb);
+ memset(emsg, 0, sizeof(*emsg));
+
+ return 0;
+}
+
+static int espintcp_sendskmsg_locked(struct sock *sk,
+ struct espintcp_msg *emsg, int flags)
+{
+ struct sk_msg *skmsg = &emsg->skmsg;
+ struct scatterlist *sg;
+ int done = 0;
+ int ret;
+
+ flags |= MSG_SENDPAGE_NOTLAST;
+ sg = &skmsg->sg.data[skmsg->sg.start];
+ do {
+ size_t size = sg->length - emsg->offset;
+ int offset = sg->offset + emsg->offset;
+ struct page *p;
+
+ emsg->offset = 0;
+
+ if (sg_is_last(sg))
+ flags &= ~MSG_SENDPAGE_NOTLAST;
+
+ p = sg_page(sg);
+retry:
+ ret = do_tcp_sendpages(sk, p, offset, size, flags);
+ if (ret < 0) {
+ emsg->offset = offset - sg->offset;
+ skmsg->sg.start += done;
+ return ret;
+ }
+
+ if (ret != size) {
+ offset += ret;
+ size -= ret;
+ goto retry;
+ }
+
+ done++;
+ put_page(p);
+ sk_mem_uncharge(sk, sg->length);
+ sg = sg_next(sg);
+ } while (sg);
+
+ memset(emsg, 0, sizeof(*emsg));
+
+ return 0;
+}
+
+static int espintcp_push_msgs(struct sock *sk)
+{
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+ struct espintcp_msg *emsg = &ctx->partial;
+ int err;
+
+ if (!emsg->len)
+ return 0;
+
+ if (ctx->tx_running)
+ return -EAGAIN;
+ ctx->tx_running = 1;
+
+ if (emsg->skb)
+ err = espintcp_sendskb_locked(sk, emsg, 0);
+ else
+ err = espintcp_sendskmsg_locked(sk, emsg, 0);
+ if (err == -EAGAIN) {
+ ctx->tx_running = 0;
+ return 0;
+ }
+ if (!err)
+ memset(emsg, 0, sizeof(*emsg));
+
+ ctx->tx_running = 0;
+
+ return err;
+}
+
+int espintcp_push_skb(struct sock *sk, struct sk_buff *skb)
+{
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+ struct espintcp_msg *emsg = &ctx->partial;
+ unsigned int len;
+ int offset;
+
+ if (sk->sk_state != TCP_ESTABLISHED) {
+ kfree_skb(skb);
+ return -ECONNRESET;
+ }
+
+ offset = skb_transport_offset(skb);
+ len = skb->len - offset;
+
+ espintcp_push_msgs(sk);
+
+ if (emsg->len) {
+ kfree_skb(skb);
+ return -ENOBUFS;
+ }
+
+ skb_set_owner_w(skb, sk);
+
+ emsg->offset = offset;
+ emsg->len = len;
+ emsg->skb = skb;
+
+ espintcp_push_msgs(sk);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(espintcp_push_skb);
+
+static int espintcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+{
+ long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+ struct espintcp_msg *emsg = &ctx->partial;
+ struct iov_iter pfx_iter;
+ struct kvec pfx_iov = {};
+ size_t msglen = size + 2;
+ char buf[2] = {0};
+ int err, end;
+
+ if (msg->msg_flags)
+ return -EOPNOTSUPP;
+
+ if (size > MAX_ESPINTCP_MSG)
+ return -EMSGSIZE;
+
+ if (msg->msg_controllen)
+ return -EOPNOTSUPP;
+
+ lock_sock(sk);
+
+ err = espintcp_push_msgs(sk);
+ if (err < 0) {
+ err = -ENOBUFS;
+ goto unlock;
+ }
+
+ sk_msg_init(&emsg->skmsg);
+ while (1) {
+ /* only -ENOMEM is possible since we don't coalesce */
+ err = sk_msg_alloc(sk, &emsg->skmsg, msglen, 0);
+ if (!err)
+ break;
+
+ err = sk_stream_wait_memory(sk, &timeo);
+ if (err)
+ goto fail;
+ }
+
+ *((__be16 *)buf) = cpu_to_be16(msglen);
+ pfx_iov.iov_base = buf;
+ pfx_iov.iov_len = sizeof(buf);
+ iov_iter_kvec(&pfx_iter, WRITE, &pfx_iov, 1, pfx_iov.iov_len);
+
+ err = sk_msg_memcopy_from_iter(sk, &pfx_iter, &emsg->skmsg,
+ pfx_iov.iov_len);
+ if (err < 0)
+ goto fail;
+
+ err = sk_msg_memcopy_from_iter(sk, &msg->msg_iter, &emsg->skmsg, size);
+ if (err < 0)
+ goto fail;
+
+ end = emsg->skmsg.sg.end;
+ emsg->len = size;
+ sk_msg_iter_var_prev(end);
+ sg_mark_end(sk_msg_elem(&emsg->skmsg, end));
+
+ tcp_rate_check_app_limited(sk);
+
+ err = espintcp_push_msgs(sk);
+ /* this message could be partially sent, keep it */
+ if (err < 0)
+ goto unlock;
+ release_sock(sk);
+
+ return size;
+
+fail:
+ sk_msg_free(sk, &emsg->skmsg);
+ memset(emsg, 0, sizeof(*emsg));
+unlock:
+ release_sock(sk);
+ return err;
+}
+
+static struct proto espintcp_prot __ro_after_init;
+static struct proto_ops espintcp_ops __ro_after_init;
+
+static void espintcp_data_ready(struct sock *sk)
+{
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+ strp_data_ready(&ctx->strp);
+}
+
+static void espintcp_tx_work(struct work_struct *work)
+{
+ struct espintcp_ctx *ctx = container_of(work,
+ struct espintcp_ctx, work);
+ struct sock *sk = ctx->strp.sk;
+
+ lock_sock(sk);
+ if (!ctx->tx_running)
+ espintcp_push_msgs(sk);
+ release_sock(sk);
+}
+
+static void espintcp_write_space(struct sock *sk)
+{
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+ schedule_work(&ctx->work);
+ ctx->saved_write_space(sk);
+}
+
+static void espintcp_destruct(struct sock *sk)
+{
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+ kfree(ctx);
+}
+
+bool tcp_is_ulp_esp(struct sock *sk)
+{
+ return sk->sk_prot == &espintcp_prot;
+}
+EXPORT_SYMBOL_GPL(tcp_is_ulp_esp);
+
+static int espintcp_init_sk(struct sock *sk)
+{
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct strp_callbacks cb = {
+ .rcv_msg = espintcp_rcv,
+ .parse_msg = espintcp_parse,
+ };
+ struct espintcp_ctx *ctx;
+ int err;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ err = strp_init(&ctx->strp, sk, &cb);
+ if (err)
+ goto free;
+
+ __sk_dst_reset(sk);
+
+ strp_check_rcv(&ctx->strp);
+ skb_queue_head_init(&ctx->ike_queue);
+ skb_queue_head_init(&ctx->out_queue);
+ sk->sk_prot = &espintcp_prot;
+ sk->sk_socket->ops = &espintcp_ops;
+ ctx->saved_data_ready = sk->sk_data_ready;
+ ctx->saved_write_space = sk->sk_write_space;
+ sk->sk_data_ready = espintcp_data_ready;
+ sk->sk_write_space = espintcp_write_space;
+ sk->sk_destruct = espintcp_destruct;
+ icsk->icsk_ulp_data = ctx;
+ INIT_WORK(&ctx->work, espintcp_tx_work);
+
+ /* avoid using task_frag */
+ sk->sk_allocation = GFP_ATOMIC;
+
+ return 0;
+
+free:
+ kfree(ctx);
+ return err;
+}
+
+static void espintcp_release(struct sock *sk)
+{
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+ struct sk_buff_head queue;
+ struct sk_buff *skb;
+
+ __skb_queue_head_init(&queue);
+ skb_queue_splice_init(&ctx->out_queue, &queue);
+
+ while ((skb = __skb_dequeue(&queue)))
+ espintcp_push_skb(sk, skb);
+
+ tcp_release_cb(sk);
+}
+
+static void espintcp_close(struct sock *sk, long timeout)
+{
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+ struct espintcp_msg *emsg = &ctx->partial;
+
+ strp_stop(&ctx->strp);
+
+ sk->sk_prot = &tcp_prot;
+ barrier();
+
+ cancel_work_sync(&ctx->work);
+ strp_done(&ctx->strp);
+
+ skb_queue_purge(&ctx->out_queue);
+ skb_queue_purge(&ctx->ike_queue);
+
+ if (emsg->len) {
+ if (emsg->skb)
+ kfree_skb(emsg->skb);
+ else
+ sk_msg_free(sk, &emsg->skmsg);
+ }
+
+ tcp_close(sk, timeout);
+}
+
+static __poll_t espintcp_poll(struct file *file, struct socket *sock,
+ poll_table *wait)
+{
+ __poll_t mask = datagram_poll(file, sock, wait);
+ struct sock *sk = sock->sk;
+ struct espintcp_ctx *ctx = espintcp_getctx(sk);
+
+ if (!skb_queue_empty(&ctx->ike_queue))
+ mask |= EPOLLIN | EPOLLRDNORM;
+
+ return mask;
+}
+
+static struct tcp_ulp_ops espintcp_ulp __read_mostly = {
+ .name = "espintcp",
+ .owner = THIS_MODULE,
+ .init = espintcp_init_sk,
+};
+
+void __init espintcp_init(void)
+{
+ memcpy(&espintcp_prot, &tcp_prot, sizeof(tcp_prot));
+ memcpy(&espintcp_ops, &inet_stream_ops, sizeof(inet_stream_ops));
+ espintcp_prot.sendmsg = espintcp_sendmsg;
+ espintcp_prot.recvmsg = espintcp_recvmsg;
+ espintcp_prot.close = espintcp_close;
+ espintcp_prot.release_cb = espintcp_release;
+ espintcp_ops.poll = espintcp_poll;
+
+ tcp_register_ulp(&espintcp_ulp);
+}
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 1070dfece76b..893072f2e651 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -39,6 +39,9 @@
#ifdef CONFIG_XFRM_STATISTICS
#include <net/snmp.h>
#endif
+#ifdef CONFIG_XFRM_ESPINTCP
+#include <net/espintcp.h>
+#endif
#include "xfrm_hash.h"
@@ -4156,6 +4159,10 @@ void __init xfrm_init(void)
seqcount_init(&xfrm_policy_hash_generation);
xfrm_input_init();
+#ifdef CONFIG_XFRM_ESPINTCP
+ espintcp_init();
+#endif
+
RCU_INIT_POINTER(xfrm_if_cb, NULL);
synchronize_rcu();
}
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index c6f3c4a1bd99..acef2d54f869 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -668,6 +668,9 @@ int __xfrm_state_delete(struct xfrm_state *x)
net->xfrm.state_num--;
spin_unlock(&net->xfrm.xfrm_state_lock);
+ if (x->encap_sk)
+ sock_put(rcu_dereference_raw(x->encap_sk));
+
xfrm_dev_state_delete(x);
/* All xfrm_state objects are created by xfrm_state_alloc.
--
2.22.0
^ permalink raw reply related
* [PATCH ipsec-next 5/7] esp4: prepare esp_input_done2 for non-UDP encapsulation
From: Sabrina Dubroca @ 2019-08-21 21:46 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
In-Reply-To: <cover.1566395202.git.sd@queasysnail.net>
For espintcp encapsulation, we will need to get the source port from the
TCP header instead of UDP. Introduce a variable to hold the port.
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
net/ipv4/esp4.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 5c967764041f..c5d826642229 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -601,6 +601,18 @@ int esp_input_done2(struct sk_buff *skb, int err)
if (x->encap) {
struct xfrm_encap_tmpl *encap = x->encap;
struct udphdr *uh = (void *)(skb_network_header(skb) + ihl);
+ __be16 source;
+
+ switch (x->encap->encap_type) {
+ case UDP_ENCAP_ESPINUDP:
+ case UDP_ENCAP_ESPINUDP_NON_IKE:
+ source = uh->source;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ err = -EINVAL;
+ goto out;
+ }
/*
* 1) if the NAT-T peer's IP or port changed then
@@ -609,11 +621,11 @@ int esp_input_done2(struct sk_buff *skb, int err)
* SRC ports.
*/
if (iph->saddr != x->props.saddr.a4 ||
- uh->source != encap->encap_sport) {
+ source != encap->encap_sport) {
xfrm_address_t ipaddr;
ipaddr.a4 = iph->saddr;
- km_new_mapping(x, &ipaddr, uh->source);
+ km_new_mapping(x, &ipaddr, source);
/* XXX: perhaps add an extra
* policy check here, to see
--
2.22.0
^ permalink raw reply related
* [PATCH ipsec-next 4/7] xfrm: add route lookup to xfrm4_rcv_encap
From: Sabrina Dubroca @ 2019-08-21 21:46 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
In-Reply-To: <cover.1566395202.git.sd@queasysnail.net>
At this point, with TCP encapsulation, the dst may be gone, but
xfrm_input needs one.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
net/ipv4/xfrm4_protocol.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/net/ipv4/xfrm4_protocol.c b/net/ipv4/xfrm4_protocol.c
index bcab48944c15..1665e1a05ec5 100644
--- a/net/ipv4/xfrm4_protocol.c
+++ b/net/ipv4/xfrm4_protocol.c
@@ -76,6 +76,14 @@ int xfrm4_rcv_encap(struct sk_buff *skb, int nexthdr, __be32 spi,
if (!head)
goto out;
+ if (!skb_dst(skb)) {
+ const struct iphdr *iph = ip_hdr(skb);
+
+ if (ip_route_input_noref(skb, iph->daddr, iph->saddr,
+ iph->tos, skb->dev))
+ goto drop;
+ }
+
for_each_protocol_rcu(*head, handler)
if ((ret = handler->input_handler(skb, nexthdr, spi, encap_type)) != -EINVAL)
return ret;
@@ -83,6 +91,7 @@ int xfrm4_rcv_encap(struct sk_buff *skb, int nexthdr, __be32 spi,
out:
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
+drop:
kfree_skb(skb);
return 0;
}
--
2.22.0
^ permalink raw reply related
* [PATCH ipsec-next 3/7] xfrm: introduce xfrm_trans_queue_net
From: Sabrina Dubroca @ 2019-08-21 21:46 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
In-Reply-To: <cover.1566395202.git.sd@queasysnail.net>
This will be used by TCP encapsulation to write packets to the encap
socket without holding the user socket's lock. Without this reinjection,
we're already holding the lock of the user socket, and then try to lock
the encap socket as well when we enqueue the encrypted packet.
While at it, add a BUILD_BUG_ON like we usually do for skb->cb, since
it's missing for struct xfrm_trans_cb.
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
include/net/xfrm.h | 3 +++
net/xfrm/xfrm_input.c | 21 +++++++++++++++++----
2 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index b22db30c3d88..afedc9210c4b 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1549,6 +1549,9 @@ int __xfrm_init_state(struct xfrm_state *x, bool init_replay, bool offload);
int xfrm_init_state(struct xfrm_state *x);
int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type);
int xfrm_input_resume(struct sk_buff *skb, int nexthdr);
+int xfrm_trans_queue_net(struct net *net, struct sk_buff *skb,
+ int (*finish)(struct net *, struct sock *,
+ struct sk_buff *));
int xfrm_trans_queue(struct sk_buff *skb,
int (*finish)(struct net *, struct sock *,
struct sk_buff *));
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 6088bc2dc11e..eb0f0e64c71c 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -36,6 +36,7 @@ struct xfrm_trans_cb {
#endif
} header;
int (*finish)(struct net *net, struct sock *sk, struct sk_buff *skb);
+ struct net *net;
};
#define XFRM_TRANS_SKB_CB(__skb) ((struct xfrm_trans_cb *)&((__skb)->cb[0]))
@@ -763,12 +764,13 @@ static void xfrm_trans_reinject(unsigned long data)
skb_queue_splice_init(&trans->queue, &queue);
while ((skb = __skb_dequeue(&queue)))
- XFRM_TRANS_SKB_CB(skb)->finish(dev_net(skb->dev), NULL, skb);
+ XFRM_TRANS_SKB_CB(skb)->finish(XFRM_TRANS_SKB_CB(skb)->net,
+ NULL, skb);
}
-int xfrm_trans_queue(struct sk_buff *skb,
- int (*finish)(struct net *, struct sock *,
- struct sk_buff *))
+int xfrm_trans_queue_net(struct net *net, struct sk_buff *skb,
+ int (*finish)(struct net *, struct sock *,
+ struct sk_buff *))
{
struct xfrm_trans_tasklet *trans;
@@ -777,11 +779,22 @@ int xfrm_trans_queue(struct sk_buff *skb,
if (skb_queue_len(&trans->queue) >= netdev_max_backlog)
return -ENOBUFS;
+ BUILD_BUG_ON(sizeof(struct xfrm_trans_cb) > sizeof(skb->cb));
+
XFRM_TRANS_SKB_CB(skb)->finish = finish;
+ XFRM_TRANS_SKB_CB(skb)->net = net;
__skb_queue_tail(&trans->queue, skb);
tasklet_schedule(&trans->tasklet);
return 0;
}
+EXPORT_SYMBOL(xfrm_trans_queue_net);
+
+int xfrm_trans_queue(struct sk_buff *skb,
+ int (*finish)(struct net *, struct sock *,
+ struct sk_buff *))
+{
+ return xfrm_trans_queue_net(dev_net(skb->dev), skb, finish);
+}
EXPORT_SYMBOL(xfrm_trans_queue);
void __init xfrm_input_init(void)
--
2.22.0
^ permalink raw reply related
* [PATCH ipsec-next 2/7] skbuff: Avoid sleeping in skb_send_sock_locked
From: Sabrina Dubroca @ 2019-08-21 21:46 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
In-Reply-To: <cover.1566395202.git.sd@queasysnail.net>
From: Herbert Xu <herbert@gondor.apana.org.au>
For a function that needs to be called with the socket spinlock
held, sleeping would seem to be a bad idea. This function does
in fact avoid sleeping when calling kernel_sendpage_locked on the
page part of the skb. However, it doesn't do that when sending
the linear part. Resulting in sleeping when the socket send buffer
is full.
This patch fixes it by setting the MSG_DONTWAIT flag when calling
kernel_sendmsg_locked.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
net/core/skbuff.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b50a5e3ac4e4..f863c7ef417c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2367,6 +2367,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
kv.iov_base = skb->data + offset;
kv.iov_len = slen;
memset(&msg, 0, sizeof(msg));
+ msg.msg_flags = MSG_DONTWAIT;
ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
if (ret <= 0)
--
2.22.0
^ permalink raw reply related
* [PATCH ipsec-next 1/7] net: add queue argument to __skb_wait_for_more_packets and __skb_{,try_}recv_datagram
From: Sabrina Dubroca @ 2019-08-21 21:46 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
In-Reply-To: <cover.1566395202.git.sd@queasysnail.net>
This will be used by ESP over TCP to handle the queue of IKE messages.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
include/linux/skbuff.h | 11 ++++++++---
net/core/datagram.c | 26 ++++++++++++++++----------
net/ipv4/udp.c | 3 ++-
net/unix/af_unix.c | 7 ++++---
4 files changed, 30 insertions(+), 17 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 98ff5ac98caa..149c542115a6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3377,7 +3377,8 @@ static inline void skb_frag_list_init(struct sk_buff *skb)
for (iter = skb_shinfo(skb)->frag_list; iter; iter = iter->next)
-int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
+int __skb_wait_for_more_packets(struct sock *sk, struct sk_buff_head *queue,
+ int *err, long *timeo_p,
const struct sk_buff *skb);
struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
struct sk_buff_head *queue,
@@ -3386,12 +3387,16 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
struct sk_buff *skb),
int *off, int *err,
struct sk_buff **last);
-struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
+struct sk_buff *__skb_try_recv_datagram(struct sock *sk,
+ struct sk_buff_head *queue,
+ unsigned int flags,
void (*destructor)(struct sock *sk,
struct sk_buff *skb),
int *off, int *err,
struct sk_buff **last);
-struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags,
+struct sk_buff *__skb_recv_datagram(struct sock *sk,
+ struct sk_buff_head *sk_queue,
+ unsigned int flags,
void (*destructor)(struct sock *sk,
struct sk_buff *skb),
int *off, int *err);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 45a162ef5e02..5fe681e1f4ae 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -84,7 +84,8 @@ static int receiver_wake_function(wait_queue_entry_t *wait, unsigned int mode, i
/*
* Wait for the last received packet to be different from skb
*/
-int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
+int __skb_wait_for_more_packets(struct sock *sk, struct sk_buff_head *queue,
+ int *err, long *timeo_p,
const struct sk_buff *skb)
{
int error;
@@ -97,7 +98,7 @@ int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
if (error)
goto out_err;
- if (sk->sk_receive_queue.prev != skb)
+ if (queue->prev != skb)
goto out;
/* Socket shut down? */
@@ -241,13 +242,14 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
* quite explicitly by POSIX 1003.1g, don't change them without having
* the standard around please.
*/
-struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
+struct sk_buff *__skb_try_recv_datagram(struct sock *sk,
+ struct sk_buff_head *queue,
+ unsigned int flags,
void (*destructor)(struct sock *sk,
struct sk_buff *skb),
int *off, int *err,
struct sk_buff **last)
{
- struct sk_buff_head *queue = &sk->sk_receive_queue;
struct sk_buff *skb;
unsigned long cpu_flags;
/*
@@ -278,7 +280,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
break;
sk_busy_loop(sk, flags & MSG_DONTWAIT);
- } while (sk->sk_receive_queue.prev != *last);
+ } while (queue->prev != *last);
error = -EAGAIN;
@@ -288,7 +290,9 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
}
EXPORT_SYMBOL(__skb_try_recv_datagram);
-struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
+struct sk_buff *__skb_recv_datagram(struct sock *sk,
+ struct sk_buff_head *sk_queue,
+ unsigned int flags,
void (*destructor)(struct sock *sk,
struct sk_buff *skb),
int *off, int *err)
@@ -299,15 +303,16 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
do {
- skb = __skb_try_recv_datagram(sk, flags, destructor, off, err,
- &last);
+ skb = __skb_try_recv_datagram(sk, sk_queue, flags, destructor,
+ off, err, &last);
if (skb)
return skb;
if (*err != -EAGAIN)
break;
} while (timeo &&
- !__skb_wait_for_more_packets(sk, err, &timeo, last));
+ !__skb_wait_for_more_packets(sk, sk_queue, err,
+ &timeo, last));
return NULL;
}
@@ -318,7 +323,8 @@ struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned int flags,
{
int off = 0;
- return __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0),
+ return __skb_recv_datagram(sk, &sk->sk_receive_queue,
+ flags | (noblock ? MSG_DONTWAIT : 0),
NULL, &off, err);
}
EXPORT_SYMBOL(skb_recv_datagram);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8fb250ed53d4..40067fc4c82b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1690,7 +1690,8 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
/* sk_queue is empty, reader_queue may contain peeked packets */
} while (timeo &&
- !__skb_wait_for_more_packets(sk, &error, &timeo,
+ !__skb_wait_for_more_packets(sk, &sk->sk_receive_queue,
+ &error, &timeo,
(struct sk_buff *)sk_queue));
*err = error;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e68d7454f2e3..91c1ffd82ff9 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2053,8 +2053,8 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
mutex_lock(&u->iolock);
skip = sk_peek_offset(sk, flags);
- skb = __skb_try_recv_datagram(sk, flags, NULL, &skip, &err,
- &last);
+ skb = __skb_try_recv_datagram(sk, &sk->sk_receive_queue, flags,
+ NULL, &skip, &err, &last);
if (skb)
break;
@@ -2063,7 +2063,8 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
if (err != -EAGAIN)
break;
} while (timeo &&
- !__skb_wait_for_more_packets(sk, &err, &timeo, last));
+ !__skb_wait_for_more_packets(sk, &sk->sk_receive_queue,
+ &err, &timeo, last));
if (!skb) { /* implies iolock unlocked */
unix_state_lock(sk);
--
2.22.0
^ permalink raw reply related
* [PATCH ipsec-next 0/7] ipsec: add TCP encapsulation support (RFC 8229)
From: Sabrina Dubroca @ 2019-08-21 21:46 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Steffen Klassert, Sabrina Dubroca
This patchset introduces support for TCP encapsulation of IKE and ESP
messages, as defined by RFC 8229 [0]. It is an evolution of what
Herbert Xu proposed in January 2018 [1] that addresses the main
criticism against it, by not interfering with the TCP implementation
at all. The networking stack now has infrastructure for this: TCP ULPs
and Stream Parsers.
The first patches are preparation and refactoring, and the final patch
adds the feature.
The main omission in this submission is IPv6 support. ESP
encapsulation over UDP with IPv6 is currently not supported in the
kernel either, as UDP encapsulation is aimed at NAT traversal, and NAT
is not frequently used with IPv6.
Some of the code is taken directly, or slightly modified, from Herbert
Xu's original submission [1]. The ULP and strparser pieces are
new. This work was presented and discussed at the IPsec workshop and
netdev 0x13 conference [2] in Prague, last March.
An equivalent of patch #1 (skbuff: Avoid sleeping in
skb_send_sock_locked) is already present in other trees (but not
ipsec-next) as commit bd95e678e0f6 ("bpf: sockmap, fix use after free
from sleep in psock backlog workqueue"), I'm only including it here so
that this patchset works correctly on top of ipsec-next/master.
No changes in the patchset since the RFC.
[0] https://tools.ietf.org/html/rfc8229
[1] https://patchwork.ozlabs.org/patch/859107/
[2] https://netdevconf.org/0x13/session.html?talk-ipsec-encap
Herbert Xu (1):
skbuff: Avoid sleeping in skb_send_sock_locked
Sabrina Dubroca (6):
net: add queue argument to __skb_wait_for_more_packets and
__skb_{,try_}recv_datagram
xfrm: introduce xfrm_trans_queue_net
xfrm: add route lookup to xfrm4_rcv_encap
esp4: prepare esp_input_done2 for non-UDP encapsulation
esp4: split esp_output_udp_encap and introduce esp_output_encap
xfrm: add espintcp (RFC 8229)
include/linux/skbuff.h | 11 +-
include/net/espintcp.h | 38 +++
include/net/xfrm.h | 4 +
include/uapi/linux/udp.h | 1 +
net/core/datagram.c | 26 +-
net/core/skbuff.c | 1 +
net/ipv4/esp4.c | 262 ++++++++++++++++++--
net/ipv4/udp.c | 3 +-
net/ipv4/xfrm4_protocol.c | 9 +
net/unix/af_unix.c | 7 +-
net/xfrm/Kconfig | 9 +
net/xfrm/Makefile | 1 +
net/xfrm/espintcp.c | 505 ++++++++++++++++++++++++++++++++++++++
net/xfrm/xfrm_input.c | 21 +-
net/xfrm/xfrm_policy.c | 7 +
net/xfrm/xfrm_state.c | 3 +
16 files changed, 862 insertions(+), 46 deletions(-)
create mode 100644 include/net/espintcp.h
create mode 100644 net/xfrm/espintcp.c
--
2.22.0
^ permalink raw reply
* Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp
From: William Tu @ 2019-08-21 21:38 UTC (permalink / raw)
To: Alexander Duyck
Cc: Ilya Maximets, Björn Töpel, Netdev, LKML, bpf,
David S. Miller, Magnus Karlsson, Jakub Kicinski,
Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
intel-wired-lan, Eelco Chaudron
In-Reply-To: <CAKgT0UcCKiM1Ys=vWxctprN7fzWcBCk-PCuKB-8=RThM=CqLSQ@mail.gmail.com>
On Wed, Aug 21, 2019 at 9:57 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >
> > On 21.08.2019 4:17, Alexander Duyck wrote:
> > > On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> > >>
> > >> On 20.08.2019 18:35, Alexander Duyck wrote:
> > >>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> > >>>>
> > >>>> Tx code doesn't clear the descriptor status after cleaning.
> > >>>> So, if the budget is larger than number of used elems in a ring, some
> > >>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
> > >>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
> > >>>>
> > >>>> Fix that by limiting the number of descriptors to clean by the number
> > >>>> of used descriptors in the tx ring.
> > >>>>
> > >>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> > >>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> > >>>
> > >>> I'm not sure this is the best way to go. My preference would be to
> > >>> have something in the ring that would prevent us from racing which I
> > >>> don't think this really addresses. I am pretty sure this code is safe
> > >>> on x86 but I would be worried about weak ordered systems such as
> > >>> PowerPC.
> > >>>
> > >>> It might make sense to look at adding the eop_desc logic like we have
> > >>> in the regular path with a proper barrier before we write it and after
> > >>> we read it. So for example we could hold of on writing the bytecount
> > >>> value until the end of an iteration and call smp_wmb before we write
> > >>> it. Then on the cleanup we could read it and if it is non-zero we take
> > >>> an smp_rmb before proceeding further to process the Tx descriptor and
> > >>> clearing the value. Otherwise this code is going to just keep popping
> > >>> up with issues.
> > >>
> > >> But, unlike regular case, xdp zero-copy xmit and clean for particular
> > >> tx ring always happens in the same NAPI context and even on the same
> > >> CPU core.
> > >>
> > >> I saw the 'eop_desc' manipulations in regular case and yes, we could
> > >> use 'next_to_watch' field just as a flag of descriptor existence,
> > >> but it seems unnecessarily complicated. Am I missing something?
> > >>
> > >
> > > So is it always in the same NAPI context?. I forgot, I was thinking
> > > that somehow the socket could possibly make use of XDP for transmit.
> >
> > AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
> > is used in zero-copy mode. Real xmit happens inside
> > ixgbe_poll()
> > -> ixgbe_clean_xdp_tx_irq()
> > -> ixgbe_xmit_zc()
> >
> > This should be not possible to bound another XDP socket to the same netdev
> > queue.
> >
> > It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
> > actions. REDIRECT could happen from different netdev with different NAPI
> > context, but this operation is bound to specific CPU core and each core has
> > its own xdp_ring.
> >
> > However, I'm not an expert here.
> > Björn, maybe you could comment on this?
> >
> > >
> > > As far as the logic to use I would be good with just using a value you
> > > are already setting such as the bytecount value. All that would need
> > > to happen is to guarantee that the value is cleared in the Tx path. So
> > > if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
> > > theoretically just use that as well to flag that a descriptor has been
> > > populated and is ready to be cleaned. Assuming the logic about this
> > > all being in the same NAPI context anyway you wouldn't need to mess
> > > with the barrier stuff I mentioned before.
> >
> > Checking the number of used descs, i.e. next_to_use - next_to_clean,
> > makes iteration in this function logically equal to the iteration inside
> > 'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
> > function too to follow same 'bytecount' approach? I don't like having
> > two different ways to determine number of used descriptors in the same file.
> >
> > Best regards, Ilya Maximets.
>
> As far as ixgbe_clean_xdp_tx_irq() vs ixgbe_xsk_clean_tx_ring(), I
> would say that if you got rid of budget and framed things more like
> how ixgbe_xsk_clean_tx_ring was framed with the ntc != ntu being
> obvious I would prefer to see us go that route.
>
> Really there is no need for budget in ixgbe_clean_xdp_tx_irq() if you
> are going to be working with a static ntu value since you will only
> ever process one iteration through the ring anyway. It might make more
> sense if you just went through and got rid of budget and i, and
> instead used ntc and ntu like what was done in
> ixgbe_xsk_clean_tx_ring().
>
> Thanks.
>
> - Alex
Not familiar with the driver details.
I tested this patch and the issue mentioned in OVS mailing list.
https://www.mail-archive.com/ovs-dev@openvswitch.org/msg35362.html
and indeed the problem goes away. But I saw a huge performance drop,
my AF_XDP tx performance drops from >9Mpps to <5Mpps.
Tested using kernel 5.3.0-rc3+
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller
10-Gigabit X540-AT2 (rev 01)
Subsystem: Intel Corporation Ethernet 10G 2P X540-t Adapter
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Regards,
William
^ permalink raw reply
* Re: [PATCH 24/38] cls_u32: Convert tc_u_common->handle_idr to XArray
From: Jakub Kicinski @ 2019-08-21 21:38 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: netdev
In-Reply-To: <20190821212542.GB21442@bombadil.infradead.org>
On Wed, 21 Aug 2019 14:25:42 -0700, Matthew Wilcox wrote:
> On Wed, Aug 21, 2019 at 02:13:08PM -0700, Jakub Kicinski wrote:
> > On Tue, 20 Aug 2019 15:32:45 -0700, Matthew Wilcox wrote:
> > > @@ -305,8 +306,12 @@ static void *u32_get(struct tcf_proto *tp, u32 handle)
> > > /* Protected by rtnl lock */
> > > static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr)
> > > {
> > > - int id = idr_alloc_cyclic(&tp_c->handle_idr, ptr, 1, 0x7FF, GFP_KERNEL);
> > > - if (id < 0)
> > > + int err;
> > > + u32 id;
> > > +
> > > + err = xa_alloc_cyclic(&tp_c->ht_xa, &id, ptr, XA_LIMIT(0, 0x7ff),
> > > + &tp_c->ht_next, GFP_KERNEL);
> >
> > nit: indentation seems off here and a couple of other places.
>
> what indentation rule does the networking stack use? i just leave the
> cursor where my editor puts it, which seems to be two tabs.
Oh, match opening bracket..
err = xa_alloc_cyclic(&tp_c->ht_xa, &id, ptr, XA_LIMIT(0, 0x7ff),
&tp_c->ht_next, GFP_KERNEL);
^ permalink raw reply
* Re: [net-next 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2019-08-21
From: Jakub Kicinski @ 2019-08-21 21:31 UTC (permalink / raw)
To: Jeff Kirsher; +Cc: davem, netdev, nhorman, sassmann
In-Reply-To: <20190821201623.5506-1-jeffrey.t.kirsher@intel.com>
On Wed, 21 Aug 2019 13:16:08 -0700, Jeff Kirsher wrote:
> This series contains updates to i40e driver only.
Patch 12 should really be squashed into 13, 7 and 9 could also be
combined. But not a big deal, I guess.
^ permalink raw reply
* Re: [PATCH 24/38] cls_u32: Convert tc_u_common->handle_idr to XArray
From: Matthew Wilcox @ 2019-08-21 21:25 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev
In-Reply-To: <20190821141308.54313c30@cakuba.netronome.com>
On Wed, Aug 21, 2019 at 02:13:08PM -0700, Jakub Kicinski wrote:
> On Tue, 20 Aug 2019 15:32:45 -0700, Matthew Wilcox wrote:
> > @@ -305,8 +306,12 @@ static void *u32_get(struct tcf_proto *tp, u32 handle)
> > /* Protected by rtnl lock */
> > static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr)
> > {
> > - int id = idr_alloc_cyclic(&tp_c->handle_idr, ptr, 1, 0x7FF, GFP_KERNEL);
> > - if (id < 0)
> > + int err;
> > + u32 id;
> > +
> > + err = xa_alloc_cyclic(&tp_c->ht_xa, &id, ptr, XA_LIMIT(0, 0x7ff),
> > + &tp_c->ht_next, GFP_KERNEL);
>
> nit: indentation seems off here and a couple of other places.
what indentation rule does the networking stack use? i just leave the
cursor where my editor puts it, which seems to be two tabs.
^ permalink raw reply
* [PATCH] net/ncsi: Fix the payload copying for the request coming from Netlink
From: Justin.Lee1 @ 2019-08-21 21:24 UTC (permalink / raw)
To: netdev, openbmc, linux-kernel, sam, davem
The request coming from Netlink should use the OEM generic handler.
The standard command handler expects payload in bytes/words/dwords
but the actual payload is stored in data if the request is coming from Netlink.
Signed-off-by: Justin Lee <justin.lee1@dell.com>
---
net/ncsi/ncsi-cmd.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index eab4346..0187e65 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -309,14 +309,21 @@ static struct ncsi_request *ncsi_alloc_command(struct ncsi_cmd_arg *nca)
int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
{
+ struct ncsi_cmd_handler *nch = NULL;
struct ncsi_request *nr;
+ unsigned char type;
struct ethhdr *eh;
- struct ncsi_cmd_handler *nch = NULL;
int i, ret;
+ /* Use OEM generic handler for Netlink request */
+ if (nca->req_flags == NCSI_REQ_FLAG_NETLINK_DRIVEN)
+ type = NCSI_PKT_CMD_OEM;
+ else
+ type = nca->type;
+
/* Search for the handler */
for (i = 0; i < ARRAY_SIZE(ncsi_cmd_handlers); i++) {
- if (ncsi_cmd_handlers[i].type == nca->type) {
+ if (ncsi_cmd_handlers[i].type == type) {
if (ncsi_cmd_handlers[i].handler)
nch = &ncsi_cmd_handlers[i];
else
--
2.9.3
^ permalink raw reply related
* Re: [PATCH 24/38] cls_u32: Convert tc_u_common->handle_idr to XArray
From: Jakub Kicinski @ 2019-08-21 21:13 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: netdev
In-Reply-To: <20190820223259.22348-25-willy@infradead.org>
On Tue, 20 Aug 2019 15:32:45 -0700, Matthew Wilcox wrote:
> @@ -305,8 +306,12 @@ static void *u32_get(struct tcf_proto *tp, u32 handle)
> /* Protected by rtnl lock */
> static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr)
> {
> - int id = idr_alloc_cyclic(&tp_c->handle_idr, ptr, 1, 0x7FF, GFP_KERNEL);
> - if (id < 0)
> + int err;
> + u32 id;
> +
> + err = xa_alloc_cyclic(&tp_c->ht_xa, &id, ptr, XA_LIMIT(0, 0x7ff),
> + &tp_c->ht_next, GFP_KERNEL);
nit: indentation seems off here and a couple of other places.
^ permalink raw reply
* Re: libbpf distro packaging
From: Jiri Olsa @ 2019-08-21 21:09 UTC (permalink / raw)
To: Julia Kartseva
Cc: Andrii Nakryiko, labbott@redhat.com, acme@kernel.org,
debian-kernel@lists.debian.org, netdev@vger.kernel.org,
Andrii Nakryiko, Andrey Ignatov, Alexei Starovoitov,
Yonghong Song, jolsa@kernel.org
In-Reply-To: <A770810D-591E-4292-AEFA-563724B6D6CB@fb.com>
On Tue, Aug 20, 2019 at 10:27:23PM +0000, Julia Kartseva wrote:
>
>
> On 8/19/19, 11:08 AM, "Julia Kartseva" <hex@fb.com> wrote:
>
> On 8/13/19, 11:24 AM, "Andrii Nakryiko" <andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Aug 13, 2019 at 5:26 AM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Mon, Aug 12, 2019 at 07:04:12PM +0000, Julia Kartseva wrote:
> > > I would like to bring up libbpf publishing discussion started at [1].
> > > The present state of things is that libbpf is built from kernel tree, e.g. [2]
> > > For Debian and [3] for Fedora whereas the better way would be having a
> > > package built from github mirror. The advantages of the latter:
> > > - Consistent, ABI matching versioning across distros
> > > - The mirror has integration tests
> > > - No need in kernel tree to build a package
> > > - Changes can be merged directly to github w/o waiting them to be merged
> > > through bpf-next -> net-next -> main
> > > There is a PR introducing a libbpf.spec which can be used as a starting point: [4]
> > > Any comments regarding the spec itself can be posted there.
> > > In the future it may be used as a source of truth.
> > > Please consider switching libbpf packaging to the github mirror instead
> > > of the kernel tree.
> > > Thanks
> > >
> > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.iovisor.org_g_iovisor-2Ddev_message_1521&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=zUrDY_Sp_5PqcGtRQPNeDA&m=prYVDiu3-aH1o2PWH4ZcP7lEQRCQAcTwcWPrJrtaroQ&s=dYAc2jLhFg0wtCZ_ms2HF5bWANoHzA3UMug5TNCeBtE&e=
> > > [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__packages.debian.org_sid_libbpf4.19&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=zUrDY_Sp_5PqcGtRQPNeDA&m=prYVDiu3-aH1o2PWH4ZcP7lEQRCQAcTwcWPrJrtaroQ&s=lq1MpF-bt6y6ZEtFc57eT-BO_wMBx8uUBACJooWbUYk&e=
> > > [3] https://urldefense.proofpoint.com/v2/url?u=http-3A__rpmfind.net_linux_RPM_fedora_devel_rawhide_x86-5F64_l_libbpf-2D5.3.0-2D0.rc2.git0.1.fc31.x86-5F64.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=zUrDY_Sp_5PqcGtRQPNeDA&m=prYVDiu3-aH1o2PWH4ZcP7lEQRCQAcTwcWPrJrtaroQ&s=NoolYHL57G2KhzE768iWdy6v5LD2GfJQyqPmtjy196E&e=
> > > [4] https://github.com/libbpf/libbpf/pull/64
> >
> > hi,
> > Fedora has libbpf as kernel-tools subpackage, so I think
> > we'd need to create new package and deprecate the current
> >
> > but I like the ABI stability by using github .. how's actually
> > the sync (in both directions) with kernel sources going on?
>
> Sync is always in one direction, from kernel sources into Github repo.
> Right now it's triggered by a human (usually me), but we are using a
> script that automates entire process (see
> https://github.com/libbpf/libbpf/blob/master/scripts/sync-kernel.sh).
> It cherry-pick relevant commits from kernel, transforms them to match
> Github's file layout and re-applies those changes to Github repo.
>
> There is never a sync from Github back to kernel, but Github repo
> contains some extra stuff that's not in kernel. E.g., the script I
> mentioned, plus Github's Makefile is different, because it can't rely
> on kernel's kbuild setup.
>
> Hi Jiri,
> I'm curious if you have any comments regarding sync procedure described
> By Andrii. Or if there is anything else you'd like us to address so Fedora
> can be switched to libbpf built from the github mirror?
hi,
yea, I think it's ok.. just need to check the implications
for rhel packaging and I'll let you know
jirka
^ permalink raw reply
* [PATCH bpf] bpf: fix precision tracking in presence of bpf2bpf calls
From: Alexei Starovoitov @ 2019-08-21 21:07 UTC (permalink / raw)
To: davem; +Cc: daniel, netdev, bpf, kernel-team
While adding extra tests for precision tracking and extra infra
to adjust verifier heuristics the existing test
"calls: cross frame pruning - liveness propagation" started to fail.
The root cause is the same as described in verifer.c comment:
* Also if parent's curframe > frame where backtracking started,
* the verifier need to mark registers in both frames, otherwise callees
* may incorrectly prune callers. This is similar to
* commit 7640ead93924 ("bpf: verifier: make sure callees don't prune with caller differences")
* For now backtracking falls back into conservative marking.
Turned out though that returning -ENOTSUPP from backtrack_insn() and
doing mark_all_scalars_precise() in the current parentage chain is not enough.
Depending on how is_state_visited() heuristic is creating parentage chain
it's possible that callee will incorrectly prune caller.
Fix the issue by setting precise=true earlier and more aggressively.
Before this fix the precision tracking _within_ functions that don't do
bpf2bpf calls would still work. Whereas now precision tracking is completely
disabled when bpf2bpf calls are present anywhere in the program.
No difference in cilium tests (they don't have bpf2bpf calls).
No difference in test_progs though some of them have bpf2bpf calls,
but precision tracking wasn't effective there.
Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
Separate set of tests and infra for them will go into bpf-next.
---
kernel/bpf/verifier.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c84d83f86141..b5c14c9d7b98 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -985,9 +985,6 @@ static void __mark_reg_unbounded(struct bpf_reg_state *reg)
reg->smax_value = S64_MAX;
reg->umin_value = 0;
reg->umax_value = U64_MAX;
-
- /* constant backtracking is enabled for root only for now */
- reg->precise = capable(CAP_SYS_ADMIN) ? false : true;
}
/* Mark a register as having a completely unknown (scalar) value. */
@@ -1014,7 +1011,11 @@ static void mark_reg_unknown(struct bpf_verifier_env *env,
__mark_reg_not_init(regs + regno);
return;
}
- __mark_reg_unknown(regs + regno);
+ regs += regno;
+ __mark_reg_unknown(regs);
+ /* constant backtracking is enabled for root without bpf2bpf calls */
+ regs->precise = env->subprog_cnt > 1 || !env->allow_ptr_leaks ?
+ true : false;
}
static void __mark_reg_not_init(struct bpf_reg_state *reg)
--
2.20.0
^ permalink raw reply related
* Re: [RFC bpf-next 0/5] Convert iproute2 to use libbpf (WIP)
From: Toke Høiland-Jørgensen @ 2019-08-21 21:07 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Stephen Hemminger, Daniel Borkmann, Alexei Starovoitov,
Martin KaFai Lau, Song Liu, Yonghong Song, David Miller,
Jesper Dangaard Brouer, Networking, bpf
In-Reply-To: <CAEf4BzZxb7qZabw6aDVaTqnhr3AGtwEo+DbuBR9U9tJr+qVuyg@mail.gmail.com>
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> On Tue, Aug 20, 2019 at 4:47 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> iproute2 uses its own bpf loader to load eBPF programs, which has
>> evolved separately from libbpf. Since we are now standardising on
>> libbpf, this becomes a problem as iproute2 is slowly accumulating
>> feature incompatibilities with libbpf-based loaders. In particular,
>> iproute2 has its own (expanded) version of the map definition struct,
>> which makes it difficult to write programs that can be loaded with both
>> custom loaders and iproute2.
>>
>> This series seeks to address this by converting iproute2 to using libbpf
>> for all its bpf needs. This version is an early proof-of-concept RFC, to
>> get some feedback on whether people think this is the right direction.
>>
>> What this series does is the following:
>>
>> - Updates the libbpf map definition struct to match that of iproute2
>> (patch 1).
>
>
> Hi Toke,
>
> Thanks for taking a stab at unifying libbpf and iproute2 loaders. I'm
> totally in support of making iproute2 use libbpf to load/initialize
> BPF programs. But I'm against adding iproute2-specific fields to
> libbpf's bpf_map_def definitions to support this.
>
> I've proposed the plan of extending libbpf's supported features so
> that it can be used to load iproute2-style BPF programs earlier,
> please see discussions in [0] and [1].
Yeah, I've seen that discussion, and agree that longer term this is
probably a better way to do map-in-map definitions.
However, I view your proposal as complementary to this series: we'll
probably also want the BTF-based definition to work with iproute2, and
that means iproute2 needs to be ported to libbpf. But iproute2 needs to
be backwards compatible with the format it supports now, and, well, this
series is the simplest way to achieve that IMO :)
> I think instead of emulating iproute2 way of matching everything based
> on user-specified internal IDs, which doesn't provide good user
> experience and is quite easy to get wrong, we should support same
> scenarios with better declarative syntax and in a less error-prone
> way. I believe we can do that by relying on BTF more heavily (again,
> please check some of my proposals in [0], [1], and discussion with
> Daniel in those threads). It will feel more natural and be more
> straightforward to follow. It would be great if you can lend a hand in
> implementing pieces of that plan!
>
> I'm currently on vacation, so my availability is very sparse, but I'd
> be happy to discuss this further, if need be.
Happy to collaborate on your proposal when you're back from vacation;
but as I said above, I believe this is a complementary longer-term
thing...
-Toke
^ permalink raw reply
* Re: [PATCH net-next v2 0/3] net: dsa: mt7530: Convert to PHYLINK and add support for port 5
From: Andrew Lunn @ 2019-08-21 21:05 UTC (permalink / raw)
To: René van Dorst
Cc: Sean Wang, Vivien Didelot, Florian Fainelli, David S . Miller,
Matthias Brugger, netdev, linux-arm-kernel, linux-mediatek,
John Crispin, linux-mips, Frank Wunderlich
In-Reply-To: <20190821144547.15113-1-opensource@vdorst.com>
On Wed, Aug 21, 2019 at 04:45:44PM +0200, René van Dorst wrote:
> 1. net: dsa: mt7530: Convert to PHYLINK API
> This patch converts mt7530 to PHYLINK API.
> 2. dt-bindings: net: dsa: mt7530: Add support for port 5
> 3. net: dsa: mt7530: Add support for port 5
> These 2 patches adding support for port 5 of the switch.
>
> v1->v2:
> * Mostly phylink improvements after review.
Hi René
You are addressing comments mostly from Russell King. It would of been
good to Cc: him on the patchset.
Andrew
^ permalink raw reply
* Re: [RFC bpf-next 0/5] Convert iproute2 to use libbpf (WIP)
From: Toke Høiland-Jørgensen @ 2019-08-21 21:00 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Stephen Hemminger, Daniel Borkmann, Alexei Starovoitov,
Martin KaFai Lau, Song Liu, Yonghong Song, David Miller,
Jesper Dangaard Brouer, netdev, bpf
In-Reply-To: <20190821192611.xmciiiqjpkujjup7@ast-mbp.dhcp.thefacebook.com>
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> On Tue, Aug 20, 2019 at 01:47:01PM +0200, Toke Høiland-Jørgensen wrote:
>> iproute2 uses its own bpf loader to load eBPF programs, which has
>> evolved separately from libbpf. Since we are now standardising on
>> libbpf, this becomes a problem as iproute2 is slowly accumulating
>> feature incompatibilities with libbpf-based loaders. In particular,
>> iproute2 has its own (expanded) version of the map definition struct,
>> which makes it difficult to write programs that can be loaded with both
>> custom loaders and iproute2.
>>
>> This series seeks to address this by converting iproute2 to using libbpf
>> for all its bpf needs. This version is an early proof-of-concept RFC, to
>> get some feedback on whether people think this is the right direction.
>>
>> What this series does is the following:
>>
>> - Updates the libbpf map definition struct to match that of iproute2
>> (patch 1).
>> - Adds functionality to libbpf to support automatic pinning of maps when
>> loading an eBPF program, while re-using pinned maps if they already
>> exist (patches 2-3).
>> - Modifies iproute2 to make it possible to compile it against libbpf
>> without affecting any existing functionality (patch 4).
>> - Changes the iproute2 eBPF loader to use libbpf for loading XDP
>> programs (patch 5).
>>
>>
>> As this is an early PoC, there are still a few missing pieces before
>> this can be merged. Including (but probably not limited to):
>>
>> - Consolidate the map definition struct in the bpf_helpers.h file in the
>> kernel tree. This contains a different, and incompatible, update to
>> the struct. Since the iproute2 version has actually been released for
>> use outside the kernel tree (and thus is subject to API stability
>> constraints), I think it makes the most sense to keep that, and port
>> the selftests to use it.
>
> It sounds like you're implying that existing libbpf format is not
> uapi.
No, that's not what I meant... See below.
> It is and we cannot break it.
> If patch 1 means breakage for existing pre-compiled .o that won't load
> with new libbpf then we cannot use this method.
> Recompiling .o with new libbpf definition of bpf_map_def isn't an option.
> libbpf has to be smart before/after and recognize both old and iproute2 format.
The libbpf.h definition of struct bpf_map_def is compatible with the one
used in iproute2. In libbpf.h, the struct only contains five fields
(type, key_size, value_size, max_entries and flags), and iproute2 adds
another 4 (id, pinning, inner_id and inner_idx; these are the ones in
patch 1 in this series).
The issue I was alluding to above is that the bpf_helpers.h file in the
kernel selftests directory *also* extends the bpf_map_def struct, and
adds two *different* fields (inner_map_idx and numa_mode). The former is
used to implement the same map-in-map definition functionality that
iproute2 has, but with different semantics. The latter is additional to
that, and I'm planning to add that to this series.
Since bpf_helpers.h is *not* part of libbpf (yet), this will make it
possible to keep API (and ABI) compatibility with both iproute2 and
libbpf. As in, old .o files will still load with libbpf after this
series, they just won't be able to use the new automatic pinning
feature.
-Toke
^ permalink raw reply
* Re: [PATCH net] net: cpsw: fix NULL pointer exception in the probe error path
From: David Miller @ 2019-08-21 21:00 UTC (permalink / raw)
To: antoine.tenart
Cc: grygorii.strashko, netdev, linux-omap, linux-kernel,
maxime.chevallier
In-Reply-To: <20190821144123.22248-1-antoine.tenart@bootlin.com>
From: Antoine Tenart <antoine.tenart@bootlin.com>
Date: Wed, 21 Aug 2019 16:41:23 +0200
> In certain cases when the probe function fails the error path calls
> cpsw_remove_dt() before calling platform_set_drvdata(). This is an
> issue as cpsw_remove_dt() uses platform_get_drvdata() to retrieve the
> cpsw_common data and leds to a NULL pointer exception. This patches
> fixes it by calling platform_set_drvdata() earlier in the probe.
>
> Fixes: 83a8471ba255 ("net: ethernet: ti: cpsw: refactor probe to group common hw initialization")
> Reported-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
> Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Applied and queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCH net-next] net: stmmac: dwc-qos: use devm_platform_ioremap_resource() to simplify code
From: David Miller @ 2019-08-21 20:53 UTC (permalink / raw)
To: yuehaibing
Cc: peppe.cavallaro, alexandre.torgue, joabreu, khilman,
mcoquelin.stm32, linux-kernel, netdev, linux-arm-kernel,
linux-amlogic, linux-stm32
In-Reply-To: <20190821135701.46780-1-yuehaibing@huawei.com>
From: YueHaibing <yuehaibing@huawei.com>
Date: Wed, 21 Aug 2019 21:57:01 +0800
> Use devm_platform_ioremap_resource() to simplify the code a bit.
> This is detected by coccinelle.
>
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next] net: stmmac: dwmac-anarion: use devm_platform_ioremap_resource() to simplify code
From: David Miller @ 2019-08-21 20:53 UTC (permalink / raw)
To: yuehaibing
Cc: peppe.cavallaro, alexandre.torgue, joabreu, khilman,
mcoquelin.stm32, linux-kernel, netdev, linux-arm-kernel,
linux-amlogic, linux-stm32
In-Reply-To: <20190821135550.55200-1-yuehaibing@huawei.com>
From: YueHaibing <yuehaibing@huawei.com>
Date: Wed, 21 Aug 2019 21:55:50 +0800
> Use devm_platform_ioremap_resource() to simplify the code a bit.
> This is detected by coccinelle.
>
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next] net: stmmac: dwmac-meson: use devm_platform_ioremap_resource() to simplify code
From: David Miller @ 2019-08-21 20:53 UTC (permalink / raw)
To: yuehaibing
Cc: peppe.cavallaro, alexandre.torgue, joabreu, khilman,
mcoquelin.stm32, linux-kernel, netdev, linux-arm-kernel,
linux-amlogic, linux-stm32
In-Reply-To: <20190821135406.26200-1-yuehaibing@huawei.com>
From: YueHaibing <yuehaibing@huawei.com>
Date: Wed, 21 Aug 2019 21:54:06 +0800
> Use devm_platform_ioremap_resource() to simplify the code a bit.
> This is detected by coccinelle.
>
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next] net: stmmac: dwmac-meson8b: use devm_platform_ioremap_resource() to simplify code
From: David Miller @ 2019-08-21 20:53 UTC (permalink / raw)
To: yuehaibing
Cc: peppe.cavallaro, alexandre.torgue, joabreu, khilman,
mcoquelin.stm32, linux-kernel, netdev, linux-arm-kernel,
linux-amlogic, linux-stm32
In-Reply-To: <20190821135130.68636-1-yuehaibing@huawei.com>
From: YueHaibing <yuehaibing@huawei.com>
Date: Wed, 21 Aug 2019 21:51:30 +0800
> Use devm_platform_ioremap_resource() to simplify the code a bit.
> This is detected by coccinelle.
>
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Applied.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox