[RFC v3 net-next 0/5] eBPF and struct scatterlist

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v3 net-next 0/5] eBPF and struct scatterlist
@ 2018-08-17 23:08 Tushar Dave
  2018-08-17 23:08 ` [RFC v3 net-next 1/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev

This is v3 of the RFC sent earlier,
(https://patchwork.ozlabs.org/cover/931785/).

v2->v3:
- As per the review feedback received, this patchset reuses as much code
as possible from sockmap/sk_msg. e.g. it uses existing struct
sk_msg_buff, struct sk_msg_md, sk_msg_convert_ctx_access and part of
code from sk_msg_convert_ctx_access.

- bpf helper bpf_msg_pull_data() is used to access packet data. Some
issues found with bpf_msg_pull_data() are therefore fixed in patch 3.

- A feedback was given that unprivileged user can attach a new
BPF_PROG_TYPE_SOCKET_SG_FILTER to a non-rds socket e.g. normal tcp/udp
through the SO_ATTACH_BPF sockopt, where input context is skb instead of
sg list and can cause issues. However, I found that as an unprivileged,
user can attach any kind of eBPF program to socket using SO_ATTACH_BPF,
not only socksg. But if eBPF program is faulty, kernel BPF verifier take
care of it and invalidate any access to kernel data, doesn't let eBPF
program to run.

- socksg programs now returns action code (e.g. SOCKSG_PASS etc,.).

Background:
The motivation for this work is to allow eBPF based firewalling for
kernel modules that do not always get their packet as an sk_buff from
their downlink drivers. One such instance of this use-case is RDS, which
can be run both over IB (driver RDMA's a scatterlist to the RDS module)
or over TCP (TCP passes an sk_buff to the RDS module).

This patchset uses exiting socket filter infrastructure and extend it
with new eBPF program type that deals with struct scatterlist.
Existing bpf helper bpf_msg_pull_data() is used to inspect packet data
that are in form struct scatterlist. For RDS, the integrated approach
treats the scatterlist as the common denominator, and allows the
application to write a filter for processing a scatterlist.

Details:
Patch 1 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which
uses the existing socket filter infrastructure for bpf program attach
and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER deals with
struct scatterlist as bpf context contrast to
BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
program type allow socket filter to run on packet data that is in form
of struct scatterlist.

Patch 2 adds sg_filter_run() that runs BPF_PROG_TYPE_SOCKET_SG_FILTER.

Patch 3 fixes bpf_msg_pull_data() for the bugs that were found while
doing some experiment with different size of packets.

patch 4 allows rds_recv_incoming to invoke socket filter program which
deals with struct scatterlist.

Patch 5 adds socket filter eBPF sample program that uses patches 1 to 4.
The sample program opens an rds socket, attach ebpf program
(socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and uses
bpf_msg_pull_data() helper to inspect RDS packet data. For a test,
current sample program only prints first few bytes of packet data.

Testing:
To confirm data accuracy and results, RDS packets of various sizes has
been tested with socksg program along with various start and end values
for bpf_msg_pull_data(). All such tests shows accurate results.

Thanks.

-Tushar

Tushar Dave (5):
  eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  ebpf: Add sg_filter_run()
  ebpf: fix bpf_msg_pull_data
  rds: invoke socket sg filter attached to rds socket
  ebpf: Add sample ebpf program for SOCKET_SG_FILTER

 include/linux/bpf_types.h      |   1 +
 include/linux/filter.h         |   8 +
 include/uapi/linux/bpf.h       |   7 +
 kernel/bpf/syscall.c           |   1 +
 kernel/bpf/verifier.c          |   1 +
 net/core/filter.c              | 140 +++++++++++++----
 net/rds/ib.c                   |   1 +
 net/rds/ib.h                   |   1 +
 net/rds/ib_recv.c              |  12 ++
 net/rds/rds.h                  |   2 +
 net/rds/recv.c                 |  17 +++
 net/rds/tcp.c                  |   2 +
 net/rds/tcp.h                  |   2 +
 net/rds/tcp_recv.c             |  38 +++++
 samples/bpf/Makefile           |   3 +
 samples/bpf/bpf_load.c         |  11 +-
 samples/bpf/rds_filter_kern.c  |  42 +++++
 samples/bpf/rds_filter_user.c  | 339 +++++++++++++++++++++++++++++++++++++++++
 tools/bpf/bpftool/prog.c       |   1 +
 tools/include/uapi/linux/bpf.h |   7 +
 tools/lib/bpf/libbpf.c         |   3 +
 tools/lib/bpf/libbpf.h         |   2 +
 22 files changed, 607 insertions(+), 34 deletions(-)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC v3 net-next 1/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-08-17 23:08 [RFC v3 net-next 0/5] eBPF and struct scatterlist Tushar Dave
@ 2018-08-17 23:08 ` Tushar Dave
  2018-08-17 23:08 ` [RFC v3 net-next 2/5] ebpf: Add sg_filter_run() Tushar Dave
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev

Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
existing socket filter infrastructure for bpf program attach and load.
SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
contrast to SOCKET_FILTER which deals with struct skb. This is useful
for kernel entities that don't have skb to represent packet data but
want to run eBPF socket filter on packet data that is in form of struct
scatterlist e.g. IB/RDMA

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/bpf_types.h      |  1 +
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           |  1 +
 kernel/bpf/verifier.c          |  1 +
 net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
 samples/bpf/bpf_load.c         | 11 ++++++---
 tools/bpf/bpftool/prog.c       |  1 +
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/libbpf.c         |  3 +++
 tools/lib/bpf/libbpf.h         |  2 ++
 10 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c09..7dc1503 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -16,6 +16,7 @@
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8339d81..160cdb2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1362,6 +1362,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
+	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ca90679..5abc788 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		if (meta)
 			return meta->pkt_access;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index fd423ce..cec3807 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
+	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
+	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
 		bpf_prog_put(prog);
 	} else {
 		bpf_release_orig_filter(prog);
@@ -1539,10 +1540,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 
 static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
 {
+	struct bpf_prog *prog;
+
 	if (sock_flag(sk, SOCK_FILTER_LOCKED))
 		return ERR_PTR(-EPERM);
 
-	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	if (IS_ERR(prog))
+		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
+
+	return prog;
 }
 
 int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -4920,6 +4927,17 @@ bool bpf_helper_changes_pkt_data(void *func)
 }
 
 static const struct bpf_func_proto *
+socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_msg_pull_data:
+		return &bpf_msg_pull_data_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
@@ -6738,6 +6756,30 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 socksg_filter_convert_ctx_access(enum bpf_access_type type,
+					    const struct bpf_insn *si,
+					    struct bpf_insn *insn_buf,
+					    struct bpf_prog *prog,
+					    u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct sk_msg_md, data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data));
+		break;
+	case offsetof(struct sk_msg_md, data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 				     const struct bpf_insn *si,
 				     struct bpf_insn *insn_buf,
@@ -6876,6 +6918,15 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops socksg_filter_verifier_ops = {
+	.get_func_proto         = socksg_filter_func_proto,
+	.is_valid_access        = sk_msg_is_valid_access,
+	.convert_ctx_access     = socksg_filter_convert_ctx_access,
+};
+
+const struct bpf_prog_ops socksg_filter_prog_ops = {
+};
+
 const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
 	.get_func_proto		= tc_cls_act_func_proto,
 	.is_valid_access	= tc_cls_act_is_valid_access,
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 904e775..3b1697d 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -69,6 +69,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
 	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+	bool is_socksg = strncmp(event, "socksg", 6) == 0;
+
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -102,6 +104,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SK_SKB;
 	} else if (is_sk_msg) {
 		prog_type = BPF_PROG_TYPE_SK_MSG;
+	} else if (is_socksg) {
+		prog_type = BPF_PROG_TYPE_SOCKET_SG_FILTER;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -122,8 +126,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-		if (is_socket)
+	if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_socksg) {
+		if (is_socket || is_socksg)
 			event += 6;
 		else
 			event += 7;
@@ -627,7 +631,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "cgroup/", 7) == 0 ||
 		    memcmp(shname, "sockops", 7) == 0 ||
 		    memcmp(shname, "sk_skb", 6) == 0 ||
-		    memcmp(shname, "sk_msg", 6) == 0) {
+		    memcmp(shname, "sk_msg", 6) == 0 ||
+		    memcmp(shname, "socksg", 6) == 0) {
 			ret = load_and_attach(shname, data->d_buf,
 					      data->d_size);
 			if (ret != 0)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d..9c57c4e 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_SOCKET_SG_FILTER] = "socket_sg_filter",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abd0f1..a7ac51c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -2077,6 +2078,7 @@ static bool bpf_program__is_type(struct bpf_program *prog,
 BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT);
 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
+BPF_PROG_TYPE_FNS(socket_sg_filter, BPF_PROG_TYPE_SOCKET_SG_FILTER);
 
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type)
@@ -2129,6 +2131,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 	BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG),
 	BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND),
 	BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND),
+	BPF_PROG_SEC("socksg",          BPF_PROG_TYPE_SOCKET_SG_FILTER),
 };
 
 #undef BPF_PROG_SEC
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 96c55fa..7527ea4 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -208,6 +208,7 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance,
 void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type);
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type);
+int bpf_program__set_socket_sg_filter(struct bpf_program *prog);
 
 bool bpf_program__is_socket_filter(struct bpf_program *prog);
 bool bpf_program__is_tracepoint(struct bpf_program *prog);
@@ -217,6 +218,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 bool bpf_program__is_sched_act(struct bpf_program *prog);
 bool bpf_program__is_xdp(struct bpf_program *prog);
 bool bpf_program__is_perf_event(struct bpf_program *prog);
+bool bpf_program__is_socket_sg_filter(struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC v3 net-next 2/5] ebpf: Add sg_filter_run()
  2018-08-17 23:08 [RFC v3 net-next 0/5] eBPF and struct scatterlist Tushar Dave
  2018-08-17 23:08 ` [RFC v3 net-next 1/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
@ 2018-08-17 23:08 ` Tushar Dave
  2018-08-17 23:08 ` [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data Tushar Dave
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev

When sg_filter_run() is invoked it runs the attached eBPF
prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
struct scatterlist.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/filter.h         |  8 ++++++++
 include/uapi/linux/bpf.h       |  6 ++++++
 net/core/filter.c              | 24 ++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  6 ++++++
 4 files changed, 44 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5d565c5..9f1f7c1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1112,4 +1112,12 @@ struct bpf_sock_ops_kern {
 					 */
 };
 
+enum __socksg_action {
+	__SOCKSG_DROP = 0,
+	__SOCKSG_PASS,
+	__SOCKSG_REDIRECT,
+};
+
+int sg_filter_run(struct sock *sk, struct scatterlist *sg);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6ec1e32..d1d0ceb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_DROP = 0,
+	SOCKSG_PASS,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index cec3807..e427c8e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -121,6 +121,30 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 }
 EXPORT_SYMBOL(sk_filter_trim_cap);
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg)
+{
+	struct sk_filter *filter;
+	int result;
+
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		struct sk_msg_buff mb = {0};
+
+		memcpy(mb.sg_data, sg, sizeof(*sg) * MAX_SKB_FRAGS);
+		mb.sg_start = 0;
+		mb.sg_end = sg_nents(sg) - 1;
+		mb.data = sg_virt(sg);
+		mb.data_end = mb.data + sg->length;
+		mb.sg_copy[mb.sg_end] = true;
+		result = BPF_PROG_RUN(filter->prog, &mb);
+	}
+	rcu_read_unlock();
+
+	return result;
+}
+EXPORT_SYMBOL(sg_filter_run);
+
 BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
 {
 	return skb_get_poff(skb);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6ec1e32..d1d0ceb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_DROP = 0,
+	SOCKSG_PASS,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data
  2018-08-17 23:08 [RFC v3 net-next 0/5] eBPF and struct scatterlist Tushar Dave
  2018-08-17 23:08 ` [RFC v3 net-next 1/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
  2018-08-17 23:08 ` [RFC v3 net-next 2/5] ebpf: Add sg_filter_run() Tushar Dave
@ 2018-08-17 23:08 ` Tushar Dave
  2018-08-25  1:02   ` John Fastabend
  2018-08-17 23:08 ` [RFC v3 net-next 4/5] rds: invoke socket sg filter attached to rds socket Tushar Dave
  2018-08-17 23:08 ` [RFC v3 net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave
  4 siblings, 1 reply; 9+ messages in thread
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev

Like sockmap (sk_msg), socksg also deals with struct scatterlist
therefore socksg programs can use existing bpf helper bpf_msg_pull_data
to access packet data contained in struct scatterlist. While doing some
prelimnary testing, there are couple of issues found with
bpf_msg_pull_data that are fixed in this patch.

Also, there cannot be more than MAX_SKB_FRAGS entries in sg_data
therefore any checks for sg entry more than MAX_SKB_FRAGS in
bpf_msg_pull_data() is removed.

Besides that, I also ran into issues while put_page() is invoked.
e.g.
[ 450.568723] BUG: Bad page state in process swapper/10 pfn:2021540
[ 450.575632] page:ffffea0080855000 count:0 mapcount:0
mapping:ffff88103d006840 index:0xffff882021540000 compound_mapcount: 0
[ 450.588069] flags: 0x6fffff80008100(slab|head)
[ 450.593033] raw: 006fffff80008100 dead000000000100 dead000000000200
ffff88103d006840
[ 450.601683] raw: ffff882021540000 0000000080080007 00000000ffffffff
0000000000000000
[ 450.610337] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[ 450.617530] bad because of flags: 0x100(slab)

To avoid above issue, currently put_page() is disabled in this patch
temporarily. I am working on alternatives so that page allocated via
slab (in this case) can be freed without any issue.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/core/filter.c | 61 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e427c8e..cc52baa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2316,7 +2316,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 BPF_CALL_4(bpf_msg_pull_data,
 	   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
 {
-	unsigned int len = 0, offset = 0, copy = 0;
+	unsigned int len = 0, offset = 0, copy = 0, off = 0;
 	struct scatterlist *sg = msg->sg_data;
 	int first_sg, last_sg, i, shift;
 	unsigned char *p, *to, *from;
@@ -2330,22 +2330,28 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	i = msg->sg_start;
 	do {
 		len = sg[i].length;
-		offset += len;
 		if (start < offset + len)
 			break;
+		offset += len;
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-	} while (i != msg->sg_end);
+	} while (i <= msg->sg_end);
 
+	/* return error if start is out of range */
 	if (unlikely(start >= offset + len))
 		return -EINVAL;
 
-	if (!msg->sg_copy[i] && bytes <= len)
-		goto out;
+	/* return error if i is last entry in sglist and end is out of range */
+	if (msg->sg_copy[i] && end > offset + len)
+		return -EINVAL;
 
 	first_sg = i;
 
+	/* if i is not last entry in sg list and end (i.e start + bytes) is
+	 * within this sg[i] then goto out and calculate data and data_end
+	 */
+	if (!msg->sg_copy[i] && end <= offset + len)
+		goto out;
+
 	/* At this point we need to linearize multiple scatterlist
 	 * elements or a single shared page. Either way we need to
 	 * copy into a linear buffer exclusively owned by BPF. Then
@@ -2359,11 +2365,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	do {
 		copy += sg[i].length;
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-		if (bytes < copy)
+		if (end < copy)
 			break;
-	} while (i != msg->sg_end);
+	} while (i <= msg->sg_end);
+
+	/* return error if i is last entry in sglist and end is out of range */
+	if (i > msg->sg_end && end > offset + copy)
+		return -EINVAL;
+
 	last_sg = i;
 
 	if (unlikely(copy < end - start))
@@ -2373,23 +2382,25 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	if (unlikely(!page))
 		return -ENOMEM;
 	p = page_address(page);
-	offset = 0;
 
 	i = first_sg;
 	do {
 		from = sg_virt(&sg[i]);
 		len = sg[i].length;
-		to = p + offset;
+		to = p + off;
 
 		memcpy(to, from, len);
-		offset += len;
+		off += len;
 		sg[i].length = 0;
-		put_page(sg_page(&sg[i]));
+		/* if original page is allocated via slab then put_page
+		 * causes error BUG: Bad page state in process. So temporarily
+		 * disabled put_page.
+		 * Todo: fix it
+		 */
+		//put_page(sg_page(&sg[i]));
 
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-	} while (i != last_sg);
+	} while (i < last_sg);
 
 	sg[first_sg].length = copy;
 	sg_set_page(&sg[first_sg], page, copy, 0);
@@ -2406,12 +2417,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	do {
 		int move_from;
 
-		if (i + shift >= MAX_SKB_FRAGS)
-			move_from = i + shift - MAX_SKB_FRAGS;
-		else
-			move_from = i + shift;
-
-		if (move_from == msg->sg_end)
+		move_from = i + shift;
+		if (move_from > msg->sg_end)
 			break;
 
 		sg[i] = sg[move_from];
@@ -2420,14 +2427,10 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 		sg[move_from].offset = 0;
 
 		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
 	} while (1);
 	msg->sg_end -= shift;
-	if (msg->sg_end < 0)
-		msg->sg_end += MAX_SKB_FRAGS;
 out:
-	msg->data = sg_virt(&sg[i]) + start - offset;
+	msg->data = sg_virt(&sg[first_sg]) + start - offset;
 	msg->data_end = msg->data + bytes;
 
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC v3 net-next 4/5] rds: invoke socket sg filter attached to rds socket
  2018-08-17 23:08 [RFC v3 net-next 0/5] eBPF and struct scatterlist Tushar Dave
                   ` (2 preceding siblings ...)
  2018-08-17 23:08 ` [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data Tushar Dave
@ 2018-08-17 23:08 ` Tushar Dave
  2018-08-20 16:58   ` Santosh Shilimkar
  2018-08-17 23:08 ` [RFC v3 net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave
  4 siblings, 1 reply; 9+ messages in thread
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev

RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma.

Considering one filtering solution for RDS, it seems that the common
denominator between sk_buff and scatterlist is scatterlist. Therefore,
this patch converts skb to sgvec and invoke sg_filter_run for
rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/ib.c       |  1 +
 net/rds/ib.h       |  1 +
 net/rds/ib_recv.c  | 12 ++++++++++++
 net/rds/rds.h      |  2 ++
 net/rds/recv.c     | 17 +++++++++++++++++
 net/rds/tcp.c      |  2 ++
 net/rds/tcp.h      |  2 ++
 net/rds/tcp_recv.c | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 75 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 89c6333..6ba1f75 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -532,6 +532,7 @@ struct rds_transport rds_ib_transport = {
 	.conn_path_shutdown	= rds_ib_conn_path_shutdown,
 	.inc_copy_to_user	= rds_ib_inc_copy_to_user,
 	.inc_free		= rds_ib_inc_free,
+	.inc_to_sg_get		= rds_ib_inc_to_sg_get,
 	.cm_initiate_connect	= rds_ib_cm_initiate_connect,
 	.cm_handle_connect	= rds_ib_cm_handle_connect,
 	.cm_connect_complete	= rds_ib_cm_connect_complete,
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 73427ff..0a12b41 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -404,6 +404,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev,
 void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
 void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
 			     struct rds_ib_ack_state *state);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index d300186..2f76a91 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -219,6 +219,18 @@ void rds_ib_inc_free(struct rds_incoming *inc)
 	rds_ib_recv_cache_put(&ibinc->ii_cache_entry, &ic->i_cache_incs);
 }
 
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	*sg =  &frag->f_sg;
+
+	return 0;
+}
+
 static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
 				  struct rds_ib_recv_work *recv)
 {
diff --git a/net/rds/rds.h b/net/rds/rds.h
index c4dcf65..abcd5ce 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -542,6 +542,8 @@ struct rds_transport {
 	int (*recv_path)(struct rds_conn_path *cp);
 	int (*inc_copy_to_user)(struct rds_incoming *inc, struct iov_iter *to);
 	void (*inc_free)(struct rds_incoming *inc);
+	int (*inc_to_sg_get)(struct rds_incoming *inc, struct scatterlist **sg);
+	void (*inc_to_sg_put)(struct scatterlist **sg);
 
 	int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
 				 struct rdma_cm_event *event, bool isv6);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 504cd6b..261904c 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -292,6 +292,8 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	struct sock *sk;
 	unsigned long flags;
 	struct rds_conn_path *cp;
+	struct sk_filter *filter;
+	int result = __SOCKSG_PASS;
 
 	inc->i_conn = conn;
 	inc->i_rx_jiffies = jiffies;
@@ -376,6 +378,21 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	/* We can be racing with rds_release() which marks the socket dead. */
 	sk = rds_rs_to_sk(rs);
 
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		if (conn->c_trans->inc_to_sg_get) {
+			struct scatterlist *sg;
+
+			if (conn->c_trans->inc_to_sg_get(inc, &sg) == 0) {
+				result = sg_filter_run(sk, sg);
+				if (conn->c_trans->inc_to_sg_put)
+					conn->c_trans->inc_to_sg_put(&sg);
+			}
+		}
+	}
+	rcu_read_unlock();
+
 	/* serialize with rds_release -> sock_orphan */
 	write_lock_irqsave(&rs->rs_recv_lock, flags);
 	if (!sock_flag(sk, SOCK_DEAD)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2c7b7c3..35454c7 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -465,6 +465,8 @@ struct rds_transport rds_tcp_transport = {
 	.conn_path_shutdown	= rds_tcp_conn_path_shutdown,
 	.inc_copy_to_user	= rds_tcp_inc_copy_to_user,
 	.inc_free		= rds_tcp_inc_free,
+	.inc_to_sg_get		= rds_tcp_inc_to_sg_get,
+	.inc_to_sg_put		= rds_tcp_inc_to_sg_put,
 	.stats_info_copy	= rds_tcp_stats_info_copy,
 	.exit			= rds_tcp_exit,
 	.t_owner		= THIS_MODULE,
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 3c69361..b2cc910 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -82,6 +82,8 @@ void rds_tcp_restore_callbacks(struct socket *sock,
 int rds_tcp_recv_path(struct rds_conn_path *cp);
 void rds_tcp_inc_free(struct rds_incoming *inc);
 int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg);
 
 /* tcp_send.c */
 void rds_tcp_xmit_path_prepare(struct rds_conn_path *cp);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 42c5ff1..b45e69b 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -56,6 +56,44 @@ void rds_tcp_inc_free(struct rds_incoming *inc)
 	kmem_cache_free(rds_tcp_incoming_slab, tinc);
 }
 
+#define MAX_SG MAX_SKB_FRAGS
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct scatterlist *sg_list;
+	struct rds_tcp_incoming *tinc;
+	struct sk_buff *skb;
+	int num_sg = 0;
+
+	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+
+	/* For now we are assuming that the max sg elements we need is MAX_SG.
+	 * To determine actual number of sg elements we need to traverse the
+	 * skb queue e.g.
+	 *
+	 * skb_queue_walk(&tinc->ti_skb_list, skb) {
+	 *	num_sg += skb_shinfo(skb)->nr_frags + 1;
+	 * }
+	 */
+	sg_list = kzalloc(sizeof(*sg_list) * MAX_SG, GFP_KERNEL);
+	if (!sg_list)
+		return -ENOMEM;
+
+	sg_init_table(sg_list, MAX_SG);
+	skb_queue_walk(&tinc->ti_skb_list, skb) {
+		num_sg += skb_to_sgvec_nomark(skb, &sg_list[num_sg], 0,
+					      skb->len);
+	}
+	sg_mark_end(&sg_list[num_sg - 1]);
+	*sg = sg_list;
+
+	return 0;
+}
+
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg)
+{
+	kfree(*sg);
+}
+
 /*
  * this is pretty lame, but, whatever.
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC v3 net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-08-17 23:08 [RFC v3 net-next 0/5] eBPF and struct scatterlist Tushar Dave
                   ` (3 preceding siblings ...)
  2018-08-17 23:08 ` [RFC v3 net-next 4/5] rds: invoke socket sg filter attached to rds socket Tushar Dave
@ 2018-08-17 23:08 ` Tushar Dave
  4 siblings, 0 replies; 9+ messages in thread
From: Tushar Dave @ 2018-08-17 23:08 UTC (permalink / raw)
  To: john.fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev

Add a sample program that shows how socksg program is used and attached
to socket filter. The kernel sample program deals with struct
scatterlist that is passed as bpf context.

When run in server mode, the sample RDS program opens PF_RDS socket,
attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
helper to inspect packet data contained in struct scatterlist and
returns appropriate action code back to kernel.

To ease testing, RDS client functionality is also added so that users
can generate RDS packet.

Server:
[root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71

Client:
[root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
transport tcp
client bound to address: 192.168.3.70 port 25278
client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...

Server output:
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
server listening on 192.168.3.71

[root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [038] ..s.   146.947362: 0: 30 31 32
          <idle>-0     [038] ..s.   146.947364: 0: 33 34 35

Similarly specifying '-t ib' will run this on IB link.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 samples/bpf/Makefile          |   3 +
 samples/bpf/rds_filter_kern.c |  42 ++++++
 samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 384 insertions(+)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 36f9f41..dbb30d0 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
+hostprogs-y += rds_filter
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
 xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+rds_filter-objs := bpf_load.o rds_filter_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -166,6 +168,7 @@ always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
 always += xdp_sample_pkts_kern.o
+always += rds_filter_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_filter_kern.c b/samples/bpf/rds_filter_kern.c
new file mode 100644
index 0000000..633e687
--- /dev/null
+++ b/samples/bpf/rds_filter_kern.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/rds.h>
+#include "bpf_helpers.h"
+
+#define bpf_printk(fmt, ...)				\
+({							\
+	char ____fmt[] = fmt;				\
+	bpf_trace_printk(____fmt, sizeof(____fmt),	\
+			##__VA_ARGS__);			\
+})
+
+SEC("socksg")
+int main_prog(struct sk_msg_md *msg)
+{
+	int start, end, err;
+	unsigned char *d;
+
+	start = 0;
+	end = 6;
+
+	err = bpf_msg_pull_data(msg, start, end, 0);
+	if (err) {
+		bpf_printk("socksg: pull_data err %i\n", err);
+		return SOCKSG_PASS;
+	}
+
+	if (msg->data + 6 > msg->data_end)
+		return SOCKSG_PASS;
+
+	d = (unsigned char *)msg->data;
+	bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+	bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+
+	return SOCKSG_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/rds_filter_user.c b/samples/bpf/rds_filter_user.c
new file mode 100644
index 0000000..1186345
--- /dev/null
+++ b/samples/bpf/rds_filter_user.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <assert.h>
+#include "bpf_load.h"
+#include <getopt.h>
+#include <errno.h>
+#include <netinet/in.h>
+#include <limits.h>
+#include <linux/sockios.h>
+#include <linux/rds.h>
+#include <linux/errqueue.h>
+#include <linux/bpf.h>
+#include <strings.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#define TESTPORT	4000
+#define BUFSIZE		8192
+
+int transport = -1;
+
+static int str2trans(const char *trans)
+{
+	if (strcmp(trans, "tcp") == 0)
+		return RDS_TRANS_TCP;
+	if (strcmp(trans, "ib") == 0)
+		return RDS_TRANS_IB;
+	return (RDS_TRANS_NONE);
+}
+
+static const char *trans2str(int trans)
+{
+	switch (trans) {
+	case RDS_TRANS_TCP:
+		return ("tcp");
+	case RDS_TRANS_IB:
+		return ("ib");
+	case RDS_TRANS_NONE:
+		return ("none");
+	default:
+		return ("unknown");
+	}
+}
+
+static int gettransport(int sock)
+{
+	int err;
+	char val;
+	socklen_t len = sizeof(int);
+
+	err = getsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&val, &len);
+	if (err < 0) {
+		fprintf(stderr, "%s: getsockopt %s\n",
+			__func__, strerror(errno));
+		return err;
+	}
+	return (int)val;
+}
+
+static int settransport(int sock, int transport)
+{
+	int err;
+
+	err = setsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&transport, sizeof(transport));
+	if (err < 0) {
+		fprintf(stderr, "could not set transport %s, %s\n",
+			trans2str(transport), strerror(errno));
+	}
+	return err;
+}
+
+static void print_sock_local_info(int fd, char *str, struct sockaddr_in *ret)
+{
+	socklen_t sin_size = sizeof(struct sockaddr_in);
+	struct sockaddr_in sin;
+	int err;
+
+	err = getsockname(fd, (struct sockaddr *)&sin, &sin_size);
+	if (err < 0) {
+		fprintf(stderr, "%s getsockname %s\n",
+			__func__, strerror(errno));
+		return;
+	}
+	printf("%s address: %s port %d\n",
+		(str ? str : ""), inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
+
+	if (ret != NULL)
+		*ret = sin;
+}
+
+static void print_payload(char *buf)
+{
+	int i;
+
+	printf("payload contains:");
+	for (i = 0; i < 10; i++)
+		printf("%x ", buf[i]);
+	printf("...\n");
+}
+
+static void server(char *address, in_port_t port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	struct iovec *iov;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(address);
+	sin.sin_port = htons(port);
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	/* attach bpf prog */
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+			  sizeof(prog_fd[0])) == 0);
+
+	print_sock_local_info(sock, "server bound to", NULL);
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	while (1) {
+		memset(buf, 0, BUFSIZE);
+		iov[0].iov_base = buf;
+		iov[0].iov_len = BUFSIZE;
+
+		memset(&msg, 0, sizeof(msg));
+		msg.msg_name = &din;
+		msg.msg_namelen = sizeof(din);
+		msg.msg_iov = iov;
+		msg.msg_iovlen = 1;
+
+		printf("server listening on %s\n", inet_ntoa(sin.sin_addr));
+
+		rc = recvmsg(sock, &msg, 0);
+		if (rc < 0) {
+			fprintf(stderr, "%s: recvmsg %s\n",
+				__func__, strerror(errno));
+			break;
+		}
+
+		printf("%s received a packet from %s of len %d cmsg len %d, on port %d\n",
+			inet_ntoa(sin.sin_addr),
+			inet_ntoa(din.sin_addr),
+			(uint32_t) iov[0].iov_len,
+			(uint32_t) msg.msg_controllen,
+			ntohs(din.sin_port));
+
+		print_payload(buf);
+	}
+	free(iov);
+out:
+	free(buf);
+}
+
+static void create_message(char *buf)
+{
+	unsigned int i;
+
+	for (i = 0; i < BUFSIZE; i++) {
+		buf[i] = i + 0x30;
+	}
+}
+
+static int build_rds_packet(struct msghdr *msg, char *buf)
+{
+	struct iovec *iov;
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return -1;
+	}
+
+	msg->msg_iov = iov;
+	msg->msg_iovlen = 1;
+
+	iov[0].iov_base = buf;
+	iov[0].iov_len = BUFSIZE * sizeof(char);
+
+	return 0;
+}
+
+static void client(char *localaddr, char *remoteaddr, in_port_t server_port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	create_message(buf);
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(localaddr);
+	sin.sin_port = 0;
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	print_sock_local_info(sock, "client bound to",  &sin);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_name = &din;
+	msg.msg_namelen = sizeof(din);
+
+	memset(&din, 0, sizeof(din));
+	din.sin_family = AF_INET;
+	din.sin_addr.s_addr = inet_addr(remoteaddr);
+	din.sin_port = htons(server_port);
+
+	rc = build_rds_packet(&msg, buf);
+	if (rc < 0)
+		goto out;
+
+	printf("client sending %d byte message from %s to %s on port %d\n",
+		(uint32_t) msg.msg_iov->iov_len, localaddr,
+		remoteaddr, ntohs(sin.sin_port));
+
+	rc = sendmsg(sock, &msg, 0);
+	if (rc < 0)
+		fprintf(stderr, "%s: sendmsg %s\n", __func__, strerror(errno));
+
+	print_payload(buf);
+
+	if (msg.msg_control)
+		free(msg.msg_control);
+	if (msg.msg_iov)
+		free(msg.msg_iov);
+out:
+	free(buf);
+
+	return;
+}
+
+static void usage(char *progname)
+{
+	fprintf(stderr, "Usage %s [-s srvaddr] [-c clientaddr] [-t transport]"
+		"\n", progname);
+}
+
+int main(int argc, char **argv)
+{
+	in_port_t server_port = TESTPORT;
+	char *serveraddr = NULL;
+	char *clientaddr = NULL;
+	char filename[256];
+	int opt;
+
+	while ((opt = getopt(argc, argv, "s:c:t:")) != -1) {
+		switch (opt) {
+		case 's':
+			serveraddr = optarg;
+			break;
+		case 'c':
+			clientaddr = optarg;
+			break;
+		case 't':
+			transport = str2trans(optarg);
+			if (transport == RDS_TRANS_NONE) {
+				fprintf(stderr,
+					"unknown transport %s\n", optarg);
+					usage(argv[0]);
+					return (-1);
+			}
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "Error: load_bpf_file %s", bpf_log_buf);
+		return 1;
+	}
+
+	if (serveraddr && !clientaddr) {
+		printf("running server in a loop\n");
+		server(serveraddr, server_port);
+	} else if (serveraddr && clientaddr) {
+		client(clientaddr, serveraddr, server_port);
+	}
+
+	return 0;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC v3 net-next 4/5] rds: invoke socket sg filter attached to rds socket
  2018-08-17 23:08 ` [RFC v3 net-next 4/5] rds: invoke socket sg filter attached to rds socket Tushar Dave
@ 2018-08-20 16:58   ` Santosh Shilimkar
  0 siblings, 0 replies; 9+ messages in thread
From: Santosh Shilimkar @ 2018-08-20 16:58 UTC (permalink / raw)
  To: Tushar Dave, daniel, davem, sowmini.varadhan, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, kafai, rdna
  Cc: john.fastabend, ast, yhs, netdev

On 8/17/2018 4:08 PM, Tushar Dave wrote:
> RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
> arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
> However, because socket filter only deal with skb (e.g. struct skb as
> bpf context) we can only use socket filter for rds_tcp and not for
> rds_rdma.
> 
> Considering one filtering solution for RDS, it seems that the common
> denominator between sk_buff and scatterlist is scatterlist. Therefore,
> this patch converts skb to sgvec and invoke sg_filter_run for
> rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---

Looks good Tushar !!

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data
  2018-08-17 23:08 ` [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data Tushar Dave
@ 2018-08-25  1:02   ` John Fastabend
  2018-08-27  4:45     ` Tushar Dave
  0 siblings, 1 reply; 9+ messages in thread
From: John Fastabend @ 2018-08-25  1:02 UTC (permalink / raw)
  To: Tushar Dave, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev

On 08/17/2018 04:08 PM, Tushar Dave wrote:
> Like sockmap (sk_msg), socksg also deals with struct scatterlist
> therefore socksg programs can use existing bpf helper bpf_msg_pull_data
> to access packet data contained in struct scatterlist. While doing some
> prelimnary testing, there are couple of issues found with
> bpf_msg_pull_data that are fixed in this patch.
> 
> Also, there cannot be more than MAX_SKB_FRAGS entries in sg_data
> therefore any checks for sg entry more than MAX_SKB_FRAGS in
> bpf_msg_pull_data() is removed.

In sockmap the scatterlist is used as a ring so the MAX_SKB_FRAGS
check is needed to keep searching through the ring when sg_start
is non-zero.

> 
> Besides that, I also ran into issues while put_page() is invoked.
> e.g.
> [ 450.568723] BUG: Bad page state in process swapper/10 pfn:2021540
> [ 450.575632] page:ffffea0080855000 count:0 mapcount:0
> mapping:ffff88103d006840 index:0xffff882021540000 compound_mapcount: 0
> [ 450.588069] flags: 0x6fffff80008100(slab|head)
> [ 450.593033] raw: 006fffff80008100 dead000000000100 dead000000000200
> ffff88103d006840
> [ 450.601683] raw: ffff882021540000 0000000080080007 00000000ffffffff
> 0000000000000000
> [ 450.610337] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [ 450.617530] bad because of flags: 0x100(slab)
> 
> To avoid above issue, currently put_page() is disabled in this patch
> temporarily. I am working on alternatives so that page allocated via
> slab (in this case) can be freed without any issue.> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  net/core/filter.c | 61 +++++++++++++++++++++++++++++--------------------------
>  1 file changed, 32 insertions(+), 29 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index e427c8e..cc52baa 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2316,7 +2316,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  BPF_CALL_4(bpf_msg_pull_data,
>  	   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
>  {
> -	unsigned int len = 0, offset = 0, copy = 0;
> +	unsigned int len = 0, offset = 0, copy = 0, off = 0;
>  	struct scatterlist *sg = msg->sg_data;
>  	int first_sg, last_sg, i, shift;
>  	unsigned char *p, *to, *from;
> @@ -2330,22 +2330,28 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  	i = msg->sg_start;
>  	do {
>  		len = sg[i].length;
> -		offset += len;
>  		if (start < offset + len)
>  			break;
> +		offset += len;

This looks like a generic fix unrelated to this series.
Can you send that as a bugfix?

>  		i++;
> -		if (i == MAX_SKB_FRAGS)
> -			i = 0;
> -	} while (i != msg->sg_end);
> +	} while (i <= msg->sg_end);
>  

As noted above the MAX_SKB_FRAGS check is needed because
sg_start can be non-zero and sg_end < st_start. In these
cases we need to search the entries at the start of the
array (being used as a ring).

> +	/* return error if start is out of range */
>  	if (unlikely(start >= offset + len))
>  		return -EINVAL;
>  
> -	if (!msg->sg_copy[i] && bytes <= len)
> -		goto out;
> +	/* return error if i is last entry in sglist and end is out of range */
> +	if (msg->sg_copy[i] && end > offset + len)
> +		return -EINVAL>  
>  	first_sg = i;
>  
> +	/* if i is not last entry in sg list and end (i.e start + bytes) is
> +	 * within this sg[i] then goto out and calculate data and data_end
> +	 */
> +	if (!msg->sg_copy[i] && end <= offset + len)
> +		goto out;
> +>  	/* At this point we need to linearize multiple scatterlist
>  	 * elements or a single shared page. Either way we need to
>  	 * copy into a linear buffer exclusively owned by BPF. Then
> @@ -2359,11 +2365,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  	do {
>  		copy += sg[i].length;
>  		i++;
> -		if (i == MAX_SKB_FRAGS)
> -			i = 0;

same as above, need to keep.

> -		if (bytes < copy)
> +		if (end < copy)
>  			break;
> -	} while (i != msg->sg_end);
> +	} while (i <= msg->sg_end);
> +
> +	/* return error if i is last entry in sglist and end is out of range */
> +	if (i > msg->sg_end && end > offset + copy)
> +		return -EINVAL;
> +
>  	last_sg = i;
>  
>  	if (unlikely(copy < end - start))
> @@ -2373,23 +2382,25 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  	if (unlikely(!page))
>  		return -ENOMEM;
>  	p = page_address(page);
> -	offset = 0;
>  
>  	i = first_sg;
>  	do {
>  		from = sg_virt(&sg[i]);
>  		len = sg[i].length;
> -		to = p + offset;
> +		to = p + off;

Not really sure if the change from offset->off is needed. Looks
like it just makes a bigger diff.

>  
>  		memcpy(to, from, len);
> -		offset += len;
> +		off += len;
>  		sg[i].length = 0;
> -		put_page(sg_page(&sg[i]));
> +		/* if original page is allocated via slab then put_page
> +		 * causes error BUG: Bad page state in process. So temporarily
> +		 * disabled put_page.
> +		 * Todo: fix it
> +		 */
> +		//put_page(sg_page(&sg[i]));
>  
>  		i++;
> -		if (i == MAX_SKB_FRAGS)
> -			i = 0;
> -	} while (i != last_sg);
> +	} while (i < last_sg);
>  
>  	sg[first_sg].length = copy;
>  	sg_set_page(&sg[first_sg], page, copy, 0);
> @@ -2406,12 +2417,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  	do {
>  		int move_from;
>  
> -		if (i + shift >= MAX_SKB_FRAGS)
> -			move_from = i + shift - MAX_SKB_FRAGS;
> -		else
> -			move_from = i + shift;
> -

Need to keep same as above.

> -		if (move_from == msg->sg_end)
> +		move_from = i + shift;> +		if (move_from > msg->sg_end)
>  			break;
>  
>  		sg[i] = sg[move_from];
> @@ -2420,14 +2427,10 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>  		sg[move_from].offset = 0;
>  
>  		i++;
> -		if (i == MAX_SKB_FRAGS)
> -			i = 0;
>  	} while (1);
>  	msg->sg_end -= shift;
> -	if (msg->sg_end < 0)
> -		msg->sg_end += MAX_SKB_FRAGS;
>  out:
> -	msg->data = sg_virt(&sg[i]) + start - offset;
> +	msg->data = sg_virt(&sg[first_sg]) + start - offset;
>  	msg->data_end = msg->data + bytes;
>  
>  	return 0;
> 

Thanks,
John

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data
  2018-08-25  1:02   ` John Fastabend
@ 2018-08-27  4:45     ` Tushar Dave
  0 siblings, 0 replies; 9+ messages in thread
From: Tushar Dave @ 2018-08-27  4:45 UTC (permalink / raw)
  To: John Fastabend, ast, daniel, davem, sowmini.varadhan,
	santosh.shilimkar, jakub.kicinski, quentin.monnet, jiong.wang,
	sandipan, kafai, rdna, yhs, netdev



On 08/24/2018 06:02 PM, John Fastabend wrote:
> On 08/17/2018 04:08 PM, Tushar Dave wrote:
>> Like sockmap (sk_msg), socksg also deals with struct scatterlist
>> therefore socksg programs can use existing bpf helper bpf_msg_pull_data
>> to access packet data contained in struct scatterlist. While doing some
>> prelimnary testing, there are couple of issues found with
>> bpf_msg_pull_data that are fixed in this patch.
>>
>> Also, there cannot be more than MAX_SKB_FRAGS entries in sg_data
>> therefore any checks for sg entry more than MAX_SKB_FRAGS in
>> bpf_msg_pull_data() is removed.
> 
> In sockmap the scatterlist is used as a ring so the MAX_SKB_FRAGS
> check is needed to keep searching through the ring when sg_start
> is non-zero.

Okay.

> 
>>
>> Besides that, I also ran into issues while put_page() is invoked.
>> e.g.
>> [ 450.568723] BUG: Bad page state in process swapper/10 pfn:2021540
>> [ 450.575632] page:ffffea0080855000 count:0 mapcount:0
>> mapping:ffff88103d006840 index:0xffff882021540000 compound_mapcount: 0
>> [ 450.588069] flags: 0x6fffff80008100(slab|head)
>> [ 450.593033] raw: 006fffff80008100 dead000000000100 dead000000000200
>> ffff88103d006840
>> [ 450.601683] raw: ffff882021540000 0000000080080007 00000000ffffffff
>> 0000000000000000
>> [ 450.610337] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>> [ 450.617530] bad because of flags: 0x100(slab)
>>
>> To avoid above issue, currently put_page() is disabled in this patch
>> temporarily. I am working on alternatives so that page allocated via
>> slab (in this case) can be freed without any issue.>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   net/core/filter.c | 61 +++++++++++++++++++++++++++++--------------------------
>>   1 file changed, 32 insertions(+), 29 deletions(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index e427c8e..cc52baa 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2316,7 +2316,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   BPF_CALL_4(bpf_msg_pull_data,
>>   	   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
>>   {
>> -	unsigned int len = 0, offset = 0, copy = 0;
>> +	unsigned int len = 0, offset = 0, copy = 0, off = 0;
>>   	struct scatterlist *sg = msg->sg_data;
>>   	int first_sg, last_sg, i, shift;
>>   	unsigned char *p, *to, *from;
>> @@ -2330,22 +2330,28 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   	i = msg->sg_start;
>>   	do {
>>   		len = sg[i].length;
>> -		offset += len;
>>   		if (start < offset + len)
>>   			break;
>> +		offset += len;
> 
> This looks like a generic fix unrelated to this series.
> Can you send that as a bugfix?

Okay.

> 
>>   		i++;
>> -		if (i == MAX_SKB_FRAGS)
>> -			i = 0;
>> -	} while (i != msg->sg_end);
>> +	} while (i <= msg->sg_end);
>>   
> 
> As noted above the MAX_SKB_FRAGS check is needed because
> sg_start can be non-zero and sg_end < st_start. In these
> cases we need to search the entries at the start of the
> array (being used as a ring).

Yup!

> 
>> +	/* return error if start is out of range */
>>   	if (unlikely(start >= offset + len))
>>   		return -EINVAL;
>>   
>> -	if (!msg->sg_copy[i] && bytes <= len)
>> -		goto out;
>> +	/* return error if i is last entry in sglist and end is out of range */
>> +	if (msg->sg_copy[i] && end > offset + len)
>> +		return -EINVAL>
>>   	first_sg = i;
>>   
>> +	/* if i is not last entry in sg list and end (i.e start + bytes) is
>> +	 * within this sg[i] then goto out and calculate data and data_end
>> +	 */
>> +	if (!msg->sg_copy[i] && end <= offset + len)
>> +		goto out;
>> +>  	/* At this point we need to linearize multiple scatterlist
>>   	 * elements or a single shared page. Either way we need to
>>   	 * copy into a linear buffer exclusively owned by BPF. Then
>> @@ -2359,11 +2365,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   	do {
>>   		copy += sg[i].length;
>>   		i++;
>> -		if (i == MAX_SKB_FRAGS)
>> -			i = 0;
> 
> same as above, need to keep.

Yup!

> 
>> -		if (bytes < copy)
>> +		if (end < copy)
>>   			break;
>> -	} while (i != msg->sg_end);
>> +	} while (i <= msg->sg_end);
>> +
>> +	/* return error if i is last entry in sglist and end is out of range */
>> +	if (i > msg->sg_end && end > offset + copy)
>> +		return -EINVAL;
>> +
>>   	last_sg = i;
>>   
>>   	if (unlikely(copy < end - start))
>> @@ -2373,23 +2382,25 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   	if (unlikely(!page))
>>   		return -ENOMEM;
>>   	p = page_address(page);
>> -	offset = 0;
>>   
>>   	i = first_sg;
>>   	do {
>>   		from = sg_virt(&sg[i]);
>>   		len = sg[i].length;
>> -		to = p + offset;
>> +		to = p + off;
> 
> Not really sure if the change from offset->off is needed. Looks
> like it just makes a bigger diff.

We need both offset and off because they both are used for different
calculations!

'offset' is used to calculate the 'msg->data'
i.e. msg->data = sg_virt(&sg[first_sg]) + start - offset"

'off' , on the other hand, is used for when we linearize sg.

> 
>>   
>>   		memcpy(to, from, len);
>> -		offset += len;
>> +		off += len;
>>   		sg[i].length = 0;
>> -		put_page(sg_page(&sg[i]));
>> +		/* if original page is allocated via slab then put_page
>> +		 * causes error BUG: Bad page state in process. So temporarily
>> +		 * disabled put_page.
>> +		 * Todo: fix it
>> +		 */
>> +		//put_page(sg_page(&sg[i]));

As I said in the commit message that put_page() causes error "BUG: Bad
page state in process ..." when used for RDS.
Any clue? Have you seen something like this with sockmap?


>>   
>>   		i++;
>> -		if (i == MAX_SKB_FRAGS)
>> -			i = 0;
>> -	} while (i != last_sg);
>> +	} while (i < last_sg);
>>   
>>   	sg[first_sg].length = copy;
>>   	sg_set_page(&sg[first_sg], page, copy, 0);
>> @@ -2406,12 +2417,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   	do {
>>   		int move_from;
>>   
>> -		if (i + shift >= MAX_SKB_FRAGS)
>> -			move_from = i + shift - MAX_SKB_FRAGS;
>> -		else
>> -			move_from = i + shift;
>> -
> 
> Need to keep same as above.
yup!

> 
>> -		if (move_from == msg->sg_end)
>> +		move_from = i + shift;> +		if (move_from > msg->sg_end)
>>   			break;
>>   
>>   		sg[i] = sg[move_from];
>> @@ -2420,14 +2427,10 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>   		sg[move_from].offset = 0;
>>   
>>   		i++;
>> -		if (i == MAX_SKB_FRAGS)
>> -			i = 0;
>>   	} while (1);
>>   	msg->sg_end -= shift;
>> -	if (msg->sg_end < 0)
>> -		msg->sg_end += MAX_SKB_FRAGS;
>>   out:
>> -	msg->data = sg_virt(&sg[i]) + start - offset;
>> +	msg->data = sg_virt(&sg[first_sg]) + start - offset;
>>   	msg->data_end = msg->data + bytes;
>>   
>>   	return 0;
>>
> 
> Thanks,
> John
> 

Thanks.
-Tushar

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-08-27  8:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-17 23:08 [RFC v3 net-next 0/5] eBPF and struct scatterlist Tushar Dave
2018-08-17 23:08 ` [RFC v3 net-next 1/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
2018-08-17 23:08 ` [RFC v3 net-next 2/5] ebpf: Add sg_filter_run() Tushar Dave
2018-08-17 23:08 ` [RFC v3 net-next 3/5] ebpf: fix bpf_msg_pull_data Tushar Dave
2018-08-25  1:02   ` John Fastabend
2018-08-27  4:45     ` Tushar Dave
2018-08-17 23:08 ` [RFC v3 net-next 4/5] rds: invoke socket sg filter attached to rds socket Tushar Dave
2018-08-20 16:58   ` Santosh Shilimkar
2018-08-17 23:08 ` [RFC v3 net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).