* [PATCH bpf-next v2 02/11] bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, David S. Miller
Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for generic XDP we need to reflect this packet's length change by
adjusting skb's tail pointer
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
net/core/dev.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 969462ebb296..11c789231a03 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3996,9 +3996,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
struct bpf_prog *xdp_prog)
{
struct netdev_rx_queue *rxqueue;
+ void *orig_data, *orig_data_end;
u32 metalen, act = XDP_DROP;
struct xdp_buff xdp;
- void *orig_data;
int hlen, off;
u32 mac_len;
@@ -4037,6 +4037,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
xdp.data_meta = xdp.data;
xdp.data_end = xdp.data + hlen;
xdp.data_hard_start = skb->data - skb_headroom(skb);
+ orig_data_end = xdp.data_end;
orig_data = xdp.data;
rxqueue = netif_get_rxqueue(skb);
@@ -4051,6 +4052,13 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
__skb_push(skb, -off);
skb->mac_header += off;
+ /* check if bpf_xdp_adjust_tail was used. it can only "shrink"
+ * pckt.
+ */
+ off = orig_data_end - xdp.data_end;
+ if (off != 0)
+ skb_set_tail_pointer(skb, xdp.data_end - xdp.data);
+
switch (act) {
case XDP_REDIRECT:
case XDP_TX:
--
2.15.1
^ permalink raw reply related
* [PATCH bpf-next v2 03/11] bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Tariq Toukan
Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for mlx4 driver we will just calculate packet's length unconditionally
(the same way as it's already being done in mlx5)
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5c613c6663da..efc55feddc5c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -775,8 +775,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
act = bpf_prog_run_xdp(xdp_prog, &xdp);
+ length = xdp.data_end - xdp.data;
if (xdp.data != orig_data) {
- length = xdp.data_end - xdp.data;
frags[0].page_offset = xdp.data -
xdp.data_hard_start;
va = xdp.data;
--
2.15.1
^ permalink raw reply related
* [PATCH bpf-next v2 04/11] bpf: make bnxt compatible w/ bpf_xdp_adjust_tail
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Michael Chan
Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for bnxt driver we will just calculate packet's length unconditionally
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 1389ab5e05df..1f0e872d0667 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -113,10 +113,10 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons,
if (tx_avail != bp->tx_ring_size)
*event &= ~BNXT_RX_EVENT;
+ *len = xdp.data_end - xdp.data;
if (orig_data != xdp.data) {
offset = xdp.data - xdp.data_hard_start;
*data_ptr = xdp.data_hard_start + offset;
- *len = xdp.data_end - xdp.data;
}
switch (act) {
case XDP_PASS:
--
2.15.1
^ permalink raw reply related
* [PATCH bpf-next v2 05/11] bpf: make cavium thunder compatible w/ bpf_xdp_adjust_tail
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Robert Richter,
Sunil Goutham
Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for cavium's thunder driver we will just calculate packet's length
unconditionally
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
drivers/net/ethernet/cavium/thunder/nicvf_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 707db3304396..7135db45927e 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -538,9 +538,9 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
action = bpf_prog_run_xdp(prog, &xdp);
rcu_read_unlock();
+ len = xdp.data_end - xdp.data;
/* Check if XDP program has changed headers */
if (orig_data != xdp.data) {
- len = xdp.data_end - xdp.data;
offset = orig_data - xdp.data;
dma_addr -= offset;
}
--
2.15.1
^ permalink raw reply related
* [PATCH bpf-next v2 06/11] bpf: make netronome nfp compatible w/ bpf_xdp_adjust_tail
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski
Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for nfp driver we will just calculate packet's length unconditionally
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 1eb6549f2a54..d9111c077699 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1722,7 +1722,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
act = bpf_prog_run_xdp(xdp_prog, &xdp);
- pkt_len -= xdp.data - orig_data;
+ pkt_len = xdp.data_end - xdp.data;
pkt_off += xdp.data - orig_data;
switch (act) {
--
2.15.1
^ permalink raw reply related
* [PATCH bpf-next v2 07/11] bpf: make tun compatible w/ bpf_xdp_adjust_tail
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Jason Wang, David S. Miller,
Michael S. Tsirkin
Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for tun driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
(in case if after xdp's prog run data_end pointer was adjusted)
Reviewed-by: Jason Wang <jasowang@redhat.com>
---
drivers/net/tun.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1e58be152d5c..901351a6ed21 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1696,6 +1696,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
return NULL;
case XDP_PASS:
delta = orig_data - xdp.data;
+ len = xdp.data_end - xdp.data;
break;
default:
bpf_warn_invalid_xdp_action(act);
@@ -1716,7 +1717,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
}
skb_reserve(skb, pad - delta);
- skb_put(skb, len + delta);
+ skb_put(skb, len);
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
--
2.15.1
^ permalink raw reply related
* [PATCH bpf-next v2 08/11] bpf: make virtio compatible w/ bpf_xdp_adjust_tail
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann,
Michael S. Tsirkin mst @ redhat . com , Jason Wang
Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for virtio driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
Reviewed-by: Jason Wang <jasowang@redhat.com>
---
drivers/net/virtio_net.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 01694e26f03e..779a4f798522 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -606,6 +606,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
case XDP_PASS:
/* Recalculate length in case bpf program changed it */
delta = orig_data - xdp.data;
+ len = xdp.data_end - xdp.data;
break;
case XDP_TX:
xdpf = convert_to_xdp_frame(&xdp);
@@ -642,7 +643,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
goto err;
}
skb_reserve(skb, headroom - delta);
- skb_put(skb, len + delta);
+ skb_put(skb, len);
if (!delta) {
buf += header_offset;
memcpy(skb_vnet_hdr(skb), buf, vi->hdr_len);
@@ -757,6 +758,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
offset = xdp.data -
page_address(xdp_page) - vi->hdr_len;
+ /* recalculate len if xdp.data or xdp.data_end were
+ * adjusted
+ */
+ len = xdp.data_end - xdp.data;
/* We can only create skb based on xdp_page. */
if (unlikely(xdp_page != page)) {
rcu_read_unlock();
--
2.15.1
^ permalink raw reply related
* [PATCH bpf-next v2 09/11] bpf: making bpf_prog_test run aware of possible data_end ptr change
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
after introduction of bpf_xdp_adjust_tail helper packet length
could be changed not only if xdp->data pointer has been changed
but xdp->data_end as well. making bpf_prog_test_run aware of this
possibility
---
net/bpf/test_run.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2ced48662c1f..68c3578343b4 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -170,7 +170,8 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
xdp.rxq = &rxqueue->xdp_rxq;
retval = bpf_test_run(prog, &xdp, repeat, &duration);
- if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+ if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN ||
+ xdp.data_end != xdp.data + size)
size = xdp.data_end - xdp.data;
ret = bpf_test_finish(kattr, uattr, xdp.data, size, retval, duration);
kfree(data);
--
2.15.1
^ permalink raw reply related
* [PATCH bpf-next v2 10/11] bpf: adding tests for bpf_xdp_adjust_tail
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
adding selftests for bpf_xdp_adjust_tail helper. in this synthetic test
we are testing that 1) if data_end < data helper will return EINVAL
2) for normal use case packet's length would be reduced.
---
tools/include/uapi/linux/bpf.h | 10 +++++++-
tools/testing/selftests/bpf/Makefile | 2 +-
tools/testing/selftests/bpf/bpf_helpers.h | 3 +++
tools/testing/selftests/bpf/test_adjust_tail.c | 30 ++++++++++++++++++++++++
tools/testing/selftests/bpf/test_progs.c | 32 ++++++++++++++++++++++++++
5 files changed, 75 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d07465023a2..56bf493ba7ed 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -755,6 +755,13 @@ union bpf_attr {
* @addr: pointer to struct sockaddr to bind socket to
* @addr_len: length of sockaddr structure
* Return: 0 on success or negative error code
+ *
+ * int bpf_xdp_adjust_tail(xdp_md, delta)
+ * Adjust the xdp_md.data_end by delta. Only shrinking of packet's
+ * size is supported.
+ * @xdp_md: pointer to xdp_md
+ * @delta: A negative integer to be added to xdp_md.data_end
+ * Return: 0 on success or negative on error
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -821,7 +828,8 @@ union bpf_attr {
FN(msg_apply_bytes), \
FN(msg_cork_bytes), \
FN(msg_pull_data), \
- FN(bind),
+ FN(bind), \
+ FN(xdp_adjust_tail),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 0a315ddabbf4..3e819dc70bee 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -31,7 +31,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
- sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o
+ sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o
# Order correspond to 'make run_tests' order
TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index d8223d99f96d..50c607014b22 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -96,6 +96,9 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
+static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
+ (void *) BPF_FUNC_xdp_adjust_tail;
+
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_adjust_tail.c b/tools/testing/selftests/bpf/test_adjust_tail.c
new file mode 100644
index 000000000000..4cd5e860c903
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_adjust_tail.c
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (c) 2018 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include "bpf_helpers.h"
+
+int _version SEC("version") = 1;
+
+SEC("xdp_adjust_tail")
+int _xdp_adjust_tail(struct xdp_md *xdp)
+{
+ void *data_end = (void *)(long)xdp->data_end;
+ void *data = (void *)(long)xdp->data;
+ int offset = 0;
+
+ if (data_end - data == 54)
+ offset = 256;
+ else
+ offset = 20;
+ if (bpf_xdp_adjust_tail(xdp, 0 - offset))
+ return XDP_DROP;
+ return XDP_TX;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index faadbe233966..eedda98d7bb1 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -166,6 +166,37 @@ static void test_xdp(void)
bpf_object__close(obj);
}
+static void test_xdp_adjust_tail(void)
+{
+ const char *file = "./test_adjust_tail.o";
+ struct bpf_object *obj;
+ char buf[128];
+ __u32 duration, retval, size;
+ int err, prog_fd;
+
+ err = bpf_prog_load(file, BPF_PROG_TYPE_XDP, &obj, &prog_fd);
+ if (err) {
+ error_cnt++;
+ return;
+ }
+
+ err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+ buf, &size, &retval, &duration);
+
+ CHECK(err || errno || retval != XDP_DROP,
+ "ipv4", "err %d errno %d retval %d size %d\n",
+ err, errno, retval, size);
+
+ err = bpf_prog_test_run(prog_fd, 1, &pkt_v6, sizeof(pkt_v6),
+ buf, &size, &retval, &duration);
+ CHECK(err || errno || retval != XDP_TX || size != 54,
+ "ipv6", "err %d errno %d retval %d size %d\n",
+ err, errno, retval, size);
+ bpf_object__close(obj);
+}
+
+
+
#define MAGIC_VAL 0x1234
#define NUM_ITER 100000
#define VIP_NUM 5
@@ -1177,6 +1208,7 @@ int main(void)
{
test_pkt_access();
test_xdp();
+ test_xdp_adjust_tail();
test_l4lb_all();
test_xdp_noinline();
test_tcp_estats();
--
2.15.1
^ permalink raw reply related
* [PATCH bpf-next v2 11/11] bpf: add bpf_xdp_adjust_tail sample prog
From: Nikita V. Shirokov @ 2018-04-18 4:29 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann; +Cc: netdev, Nikita V. Shirokov
In-Reply-To: <20180418042951.17183-1-tehnerd@tehnerd.com>
adding bpf's sample program which is using bpf_xdp_adjust_tail helper
by generating ICMPv4 "packet to big" message if ingress packet's size is
bigger then 600 bytes
---
samples/bpf/Makefile | 4 +
samples/bpf/xdp_adjust_tail_kern.c | 152 ++++++++++++++++++++++++++++++
samples/bpf/xdp_adjust_tail_user.c | 142 ++++++++++++++++++++++++++++
tools/testing/selftests/bpf/bpf_helpers.h | 2 +
4 files changed, 300 insertions(+)
create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
create mode 100644 samples/bpf/xdp_adjust_tail_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4d6a6edd4bf6..aa8c392e2e52 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor
hostprogs-y += xdp_rxq_info
hostprogs-y += syscall_tp
hostprogs-y += cpustat
+hostprogs-y += xdp_adjust_tail
# Libbpf dependencies
LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
+xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o
always += xdp2skb_meta_kern.o
always += syscall_tp_kern.o
always += cpustat_kern.o
+always += xdp_adjust_tail_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf
HOSTLOADLIBES_xdp_rxq_info += -lelf
HOSTLOADLIBES_syscall_tp += -lelf
HOSTLOADLIBES_cpustat += -lelf
+HOSTLOADLIBES_xdp_adjust_tail += -lelf
# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
# make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp_adjust_tail_kern.c b/samples/bpf/xdp_adjust_tail_kern.c
new file mode 100644
index 000000000000..411fdb21f8bc
--- /dev/null
+++ b/samples/bpf/xdp_adjust_tail_kern.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (c) 2018 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program shows how to use bpf_xdp_adjust_tail() by
+ * generating ICMPv4 "packet to big" (unreachable/ df bit set frag needed
+ * to be more preice in case of v4)" where receiving packets bigger then
+ * 600 bytes.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/icmp.h>
+#include "bpf_helpers.h"
+
+#define DEFAULT_TTL 64
+#define MAX_PCKT_SIZE 600
+#define ICMP_TOOBIG_SIZE 98
+#define ICMP_TOOBIG_PAYLOAD_SIZE 92
+
+struct bpf_map_def SEC("maps") icmpcnt = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(__u32),
+ .value_size = sizeof(__u64),
+ .max_entries = 1,
+};
+
+static __always_inline void count_icmp(void)
+{
+ u64 key = 0;
+ u64 *icmp_count;
+
+ icmp_count = bpf_map_lookup_elem(&icmpcnt, &key);
+ if (icmp_count)
+ *icmp_count += 1;
+}
+
+static __always_inline void swap_mac(void *data, struct ethhdr *orig_eth)
+{
+ struct ethhdr *eth;
+
+ eth = data;
+ memcpy(eth->h_source, orig_eth->h_dest, ETH_ALEN);
+ memcpy(eth->h_dest, orig_eth->h_source, ETH_ALEN);
+ eth->h_proto = orig_eth->h_proto;
+}
+
+static __always_inline __u16 csum_fold_helper(__u32 csum)
+{
+ return ~((csum & 0xffff) + (csum >> 16));
+}
+
+static __always_inline void ipv4_csum(void *data_start, int data_size,
+ __u32 *csum)
+{
+ *csum = bpf_csum_diff(0, 0, data_start, data_size, *csum);
+ *csum = csum_fold_helper(*csum);
+}
+
+static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
+{
+ int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
+
+ if (bpf_xdp_adjust_head(xdp, 0 - headroom))
+ return XDP_DROP;
+ void *data = (void *)(long)xdp->data;
+ void *data_end = (void *)(long)xdp->data_end;
+
+ if (data + (ICMP_TOOBIG_SIZE + headroom) > data_end)
+ return XDP_DROP;
+
+ struct iphdr *iph, *orig_iph;
+ struct icmphdr *icmp_hdr;
+ struct ethhdr *orig_eth;
+ __u32 csum = 0;
+ __u64 off = 0;
+
+ orig_eth = data + headroom;
+ swap_mac(data, orig_eth);
+ off += sizeof(struct ethhdr);
+ iph = data + off;
+ off += sizeof(struct iphdr);
+ icmp_hdr = data + off;
+ off += sizeof(struct icmphdr);
+ orig_iph = data + off;
+ icmp_hdr->type = ICMP_DEST_UNREACH;
+ icmp_hdr->code = ICMP_FRAG_NEEDED;
+ icmp_hdr->un.frag.mtu = htons(MAX_PCKT_SIZE-sizeof(struct ethhdr));
+ icmp_hdr->checksum = 0;
+ ipv4_csum(icmp_hdr, ICMP_TOOBIG_PAYLOAD_SIZE, &csum);
+ icmp_hdr->checksum = csum;
+ iph->ttl = DEFAULT_TTL;
+ iph->daddr = orig_iph->saddr;
+ iph->saddr = orig_iph->daddr;
+ iph->version = 4;
+ iph->ihl = 5;
+ iph->protocol = IPPROTO_ICMP;
+ iph->tos = 0;
+ iph->tot_len = htons(
+ ICMP_TOOBIG_SIZE + headroom - sizeof(struct ethhdr));
+ iph->check = 0;
+ csum = 0;
+ ipv4_csum(iph, sizeof(struct iphdr), &csum);
+ iph->check = csum;
+ count_icmp();
+ return XDP_TX;
+}
+
+
+static __always_inline int handle_ipv4(struct xdp_md *xdp)
+{
+ void *data_end = (void *)(long)xdp->data_end;
+ void *data = (void *)(long)xdp->data;
+ int pckt_size = data_end - data;
+ int offset;
+
+ if (pckt_size > MAX_PCKT_SIZE) {
+ offset = pckt_size - ICMP_TOOBIG_SIZE;
+ if (bpf_xdp_adjust_tail(xdp, 0 - offset))
+ return XDP_PASS;
+ return send_icmp4_too_big(xdp);
+ }
+ return XDP_PASS;
+}
+
+SEC("xdp_icmp")
+int _xdp_icmp(struct xdp_md *xdp)
+{
+ void *data_end = (void *)(long)xdp->data_end;
+ void *data = (void *)(long)xdp->data;
+ struct ethhdr *eth = data;
+ __u16 h_proto;
+
+ if (eth + 1 > data_end)
+ return XDP_DROP;
+
+ h_proto = eth->h_proto;
+
+ if (h_proto == htons(ETH_P_IP))
+ return handle_ipv4(xdp);
+ else
+ return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
new file mode 100644
index 000000000000..f621a541b574
--- /dev/null
+++ b/samples/bpf/xdp_adjust_tail_user.c
@@ -0,0 +1,142 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (c) 2018 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/resource.h>
+#include <arpa/inet.h>
+#include <netinet/ether.h>
+#include <unistd.h>
+#include <time.h>
+#include "bpf_load.h"
+#include "libbpf.h"
+#include "bpf_util.h"
+
+#define STATS_INTERVAL_S 2U
+
+static int ifindex = -1;
+static __u32 xdp_flags;
+
+static void int_exit(int sig)
+{
+ if (ifindex > -1)
+ bpf_set_link_xdp_fd(ifindex, -1, xdp_flags);
+ exit(0);
+}
+
+/* simple "icmp packet too big sent" counter
+ */
+static void poll_stats(unsigned int kill_after_s)
+{
+ time_t started_at = time(NULL);
+ __u64 value = 0;
+ int key = 0;
+
+
+ while (!kill_after_s || time(NULL) - started_at <= kill_after_s) {
+ sleep(STATS_INTERVAL_S);
+
+ assert(bpf_map_lookup_elem(map_fd[0], &key, &value) == 0);
+
+ printf("icmp \"packet too big\" sent: %10llu pkts\n", value);
+ }
+}
+
+static void usage(const char *cmd)
+{
+ printf("Start a XDP prog which send ICMP \"packet too big\" \n"
+ "messages if ingress packet is bigger then MAX_SIZE bytes\n");
+ printf("Usage: %s [...]\n", cmd);
+ printf(" -i <ifindex> Interface Index\n");
+ printf(" -T <stop-after-X-seconds> Default: 0 (forever)\n");
+ printf(" -S use skb-mode\n");
+ printf(" -N enforce native mode\n");
+ printf(" -h Display this help\n");
+}
+
+int main(int argc, char **argv)
+{
+ unsigned char opt_flags[256] = {};
+ unsigned int kill_after_s = 0;
+ const char *optstr = "i:T:SNh";
+ struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+ char filename[256];
+ int opt;
+ int i;
+
+
+ for (i = 0; i < strlen(optstr); i++)
+ if (optstr[i] != 'h' && 'a' <= optstr[i] && optstr[i] <= 'z')
+ opt_flags[(unsigned char)optstr[i]] = 1;
+
+ while ((opt = getopt(argc, argv, optstr)) != -1) {
+
+ switch (opt) {
+ case 'i':
+ ifindex = atoi(optarg);
+ break;
+ case 'T':
+ kill_after_s = atoi(optarg);
+ break;
+ case 'S':
+ xdp_flags |= XDP_FLAGS_SKB_MODE;
+ break;
+ case 'N':
+ xdp_flags |= XDP_FLAGS_DRV_MODE;
+ break;
+ default:
+ usage(argv[0]);
+ return 1;
+ }
+ opt_flags[opt] = 0;
+ }
+
+ for (i = 0; i < strlen(optstr); i++) {
+ if (opt_flags[(unsigned int)optstr[i]]) {
+ fprintf(stderr, "Missing argument -%c\n", optstr[i]);
+ usage(argv[0]);
+ return 1;
+ }
+ }
+
+ if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+ perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
+ return 1;
+ }
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ if (!prog_fd[0]) {
+ printf("load_bpf_file: %s\n", strerror(errno));
+ return 1;
+ }
+
+ signal(SIGINT, int_exit);
+ signal(SIGTERM, int_exit);
+
+ if (bpf_set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0) {
+ printf("link set xdp fd failed\n");
+ return 1;
+ }
+
+ poll_stats(kill_after_s);
+
+ bpf_set_link_xdp_fd(ifindex, -1, xdp_flags);
+
+ return 0;
+}
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 50c607014b22..9271576bdc8f 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -132,6 +132,8 @@ static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int flag
(void *) BPF_FUNC_l3_csum_replace;
static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flags) =
(void *) BPF_FUNC_l4_csum_replace;
+static int (*bpf_csum_diff)(void *from, int from_size, void *to, int to_size, int seed) =
+ (void *) BPF_FUNC_csum_diff;
static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
(void *) BPF_FUNC_skb_under_cgroup;
static int (*bpf_skb_change_head)(void *, int len, int flags) =
--
2.15.1
^ permalink raw reply related
* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: Dave Young @ 2018-04-18 6:15 UTC (permalink / raw)
To: Rahul Lakkireddy
Cc: netdev, kexec, linux-fsdevel, linux-kernel, indranil, nirranjan,
stephen, ganeshgr, ebiederm, akpm, torvalds, davem, viro
In-Reply-To: <cover.1523950321.git.rahul.lakkireddy@chelsio.com>
Hi Rahul,
On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> On production servers running variety of workloads over time, kernel
> panic can happen sporadically after days or even months. It is
> important to collect as much debug logs as possible to root cause
> and fix the problem, that may not be easy to reproduce. Snapshot of
> underlying hardware/firmware state (like register dump, firmware
> logs, adapter memory, etc.), at the time of kernel panic will be very
> helpful while debugging the culprit device driver.
>
> This series of patches add new generic framework that enable device
> drivers to collect device specific snapshot of the hardware/firmware
> state of the underlying device in the crash recovery kernel. In crash
> recovery kernel, the collected logs are added as elf notes to
> /proc/vmcore, which is copied by user space scripts for post-analysis.
>
> The sequence of actions done by device drivers to append their device
> specific hardware/firmware logs to /proc/vmcore are as follows:
>
> 1. During probe (before hardware is initialized), device drivers
> register to the vmcore module (via vmcore_add_device_dump()), with
> callback function, along with buffer size and log name needed for
> firmware/hardware log collection.
I assumed the elf notes info should be prepared while kexec_[file_]load
phase. But I did not read the old comment, not sure if it has been discussed
or not.
If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
How to guarantee the function works if vmcore reading happens before
the driver is loaded?
Also it is possible that kdump initramfs does not contains the driver
module.
Am I missing something?
>
> 2. vmcore module allocates the buffer with requested size. It adds
> an elf note and invokes the device driver's registered callback
> function.
>
> 3. Device driver collects all hardware/firmware logs into the buffer
> and returns control back to vmcore module.
>
> The device specific hardware/firmware logs can be seen as elf notes:
>
> # readelf -n /proc/vmcore
>
> Displaying notes found at file offset 0x00001000 with length 0x04003288:
> Owner Data size Description
> VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8 Unknown note type: (0x00000700)
> VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8 Unknown note type: (0x00000700)
> CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> CORE 0x00000150 NT_PRSTATUS (prstatus structure)
> VMCOREINFO 0x0000074f Unknown note type: (0x00000000)
>
> Patch 1 adds API to vmcore module to allow drivers to register callback
> to collect the device specific hardware/firmware logs. The logs will
> be added to /proc/vmcore as elf notes.
>
> Patch 2 updates read and mmap logic to append device specific hardware/
> firmware logs as elf notes.
>
> Patch 3 shows a cxgb4 driver example using the API to collect
> hardware/firmware logs in crash recovery kernel, before hardware is
> initialized.
>
> Thanks,
> Rahul
>
> RFC v1: https://lkml.org/lkml/2018/3/2/542
> RFC v2: https://lkml.org/lkml/2018/3/16/326
>
> ---
> v4:
> - Made __vmcore_add_device_dump() static.
> - Moved compile check to define vmcore_add_device_dump() to
> crash_dump.h to fix compilation when vmcore.c is not compiled in.
> - Convert ---help--- to help in Kconfig as indicated by checkpatch.
> - Rebased to tip.
>
> v3:
> - Dropped sysfs crashdd module.
> - Exported dumps as elf notes. Suggested by Eric Biederman
> <ebiederm@xmission.com>. Added as patch 2 in this version.
> - Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
> dump support.
> - Moved logic related to adding dumps from crashdd to vmcore module.
> - Rename all crashdd* to vmcoredd*.
> - Updated comments.
>
> v2:
> - Added ABI Documentation for crashdd.
> - Directly use octal permission instead of macro.
>
> Changes since rfc v2:
> - Moved exporting crashdd from procfs to sysfs. Suggested by
> Stephen Hemminger <stephen@networkplumber.org>
> - Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
> - Replaced all proc API with sysfs API and updated comments.
> - Calling driver callback before creating the binary file under
> crashdd sysfs.
> - Changed binary dump file permission from S_IRUSR to S_IRUGO.
> - Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.
>
> rfc v2:
> - Collecting logs in 2nd kernel instead of during kernel panic.
> Suggested by Eric Biederman <ebiederm@xmission.com>.
> - Added new crashdd module that exports /proc/crashdd/ containing
> driver's registered hardware/firmware logs in patch 1.
> - Replaced the API to allow drivers to register their hardware/firmware
> log collect routine in crash recovery kernel in patch 1.
> - Updated patch 2 to use the new API in patch 1.
>
> Rahul Lakkireddy (3):
> vmcore: add API to collect hardware dump in second kernel
> vmcore: append device dumps to vmcore as elf notes
> cxgb4: collect hardware dump in second kernel
>
> drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 4 +
> drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 ++
> drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h | 3 +
> drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 10 +
> fs/proc/Kconfig | 10 +
> fs/proc/vmcore.c | 399 ++++++++++++++++++++++-
> include/linux/crash_core.h | 4 +
> include/linux/crash_dump.h | 17 +
> include/linux/kcore.h | 6 +
> include/uapi/linux/elf.h | 1 +
> 10 files changed, 466 insertions(+), 13 deletions(-)
>
> --
> 2.14.1
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
Thanks
Dave
^ permalink raw reply
* [PATCH net-next 0/4] tracking TCP data delivery and ECN stats
From: Yuchung Cheng @ 2018-04-18 6:18 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, ncardwell, soheil, Yuchung Cheng
This patch series improve tracking the data delivery status
1. minor improvement on SYN data
2. accounting bytes delivered with CE marks
3. exporting the delivery stats to applications
s.t. users can get better sense of TCP performance at per host,
per connection, and even per application message level.
Yuchung Cheng (4):
tcp: better delivery accounting for SYN-ACK and SYN-data
tcp: new helper to calculate newly delivered
tcp: track total bytes delivered with ECN CE marks
tcp: export packets delivery info
include/linux/tcp.h | 1 +
include/uapi/linux/snmp.h | 2 ++
include/uapi/linux/tcp.h | 5 +++++
net/ipv4/proc.c | 2 ++
net/ipv4/tcp.c | 8 +++++++-
net/ipv4/tcp_input.c | 33 ++++++++++++++++++++++++++++-----
6 files changed, 45 insertions(+), 6 deletions(-)
--
2.17.0.484.g0c8726318c-goog
^ permalink raw reply
* [PATCH net-next 1/4] tcp: better delivery accounting for SYN-ACK and SYN-data
From: Yuchung Cheng @ 2018-04-18 6:18 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, ncardwell, soheil, Yuchung Cheng
In-Reply-To: <20180418061849.220459-1-ycheng@google.com>
the tcp_sock:delivered has inconsistent accounting for SYN and FIN.
1. it counts pure FIN
2. it counts pure SYN
3. it counts SYN-data twice
4. it does not count SYN-ACK
For congestion control perspective it does not matter much as C.C. only
cares about the difference not the aboslute value. But the next patch
would export this field to user-space so it's better to report the absolute
value w/o these caveats.
This patch counts SYN, SYN-ACK, or SYN-data delivery once always in
the "delivered" field.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_input.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f93687f97d80..2499248d4a67 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5567,9 +5567,12 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
return true;
}
tp->syn_data_acked = tp->syn_data;
- if (tp->syn_data_acked)
- NET_INC_STATS(sock_net(sk),
- LINUX_MIB_TCPFASTOPENACTIVE);
+ if (tp->syn_data_acked) {
+ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENACTIVE);
+ /* SYN-data is counted as two separate packets in tcp_ack() */
+ if (tp->delivered > 1)
+ --tp->delivered;
+ }
tcp_fastopen_add_skb(sk, synack);
@@ -5901,6 +5904,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
}
switch (sk->sk_state) {
case TCP_SYN_RECV:
+ tp->delivered++; /* SYN-ACK delivery isn't tracked in tcp_ack */
if (!tp->srtt_us)
tcp_synack_rtt_meas(sk, req);
--
2.17.0.484.g0c8726318c-goog
^ permalink raw reply related
* [PATCH net-next 2/4] tcp: new helper to calculate newly delivered
From: Yuchung Cheng @ 2018-04-18 6:18 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, ncardwell, soheil, Yuchung Cheng
In-Reply-To: <20180418061849.220459-1-ycheng@google.com>
Add new helper tcp_newly_delivered() to prepare the ECN accounting change.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_input.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2499248d4a67..01cce28f90ca 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3496,6 +3496,16 @@ static void tcp_xmit_recovery(struct sock *sk, int rexmit)
tcp_xmit_retransmit_queue(sk);
}
+/* Returns the number of packets newly acked or sacked by the current ACK */
+static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 delivered;
+
+ delivered = tp->delivered - prior_delivered;
+ return delivered;
+}
+
/* This routine deals with incoming acks, but not outgoing ones. */
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
@@ -3619,7 +3629,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
sk_dst_confirm(sk);
- delivered = tp->delivered - delivered; /* freshly ACKed or SACKed */
+ delivered = tcp_newly_delivered(sk, delivered, flag);
lost = tp->lost - lost; /* freshly marked lost */
rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED);
tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate);
@@ -3629,9 +3639,11 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
no_queue:
/* If data was DSACKed, see if we can undo a cwnd reduction. */
- if (flag & FLAG_DSACKING_ACK)
+ if (flag & FLAG_DSACKING_ACK) {
tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, &flag,
&rexmit);
+ tcp_newly_delivered(sk, delivered, flag);
+ }
/* If this ack opens up a zero window, clear backoff. It was
* being used to time the probes, and is probably far higher than
* it needs to be for normal retransmission.
@@ -3655,6 +3667,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
&sack_state);
tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, &flag,
&rexmit);
+ tcp_newly_delivered(sk, delivered, flag);
tcp_xmit_recovery(sk, rexmit);
}
--
2.17.0.484.g0c8726318c-goog
^ permalink raw reply related
* [PATCH net-next 3/4] tcp: track total bytes delivered with ECN CE marks
From: Yuchung Cheng @ 2018-04-18 6:18 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, ncardwell, soheil, Yuchung Cheng
In-Reply-To: <20180418061849.220459-1-ycheng@google.com>
Introduce a new delivered_ce stat in tcp socket to estimate
number of packets being marked with CE bits. The estimation is
done via ACKs with ECE bit. Depending on the actual receiver
behavior, the estimation could have biases.
Since the TCP sender can't really see the CE bit in the data path,
so the sender is technically counting packets marked delivered with
the "ECE / ECN-Echo" flag set.
With RFC3168 ECN, because the ECE bit is sticky, this count can
drastically overestimate the nummber of CE-marked data packets
With DCTCP-style ECN this should be reasonably precise unless there
is loss in the ACK path, in which case it's not precise.
With AccECN proposal this can be made still more precise, even in
the case some degree of ACK loss.
However this is sender's best estimate of CE information.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
include/linux/tcp.h | 1 +
net/ipv4/tcp.c | 1 +
net/ipv4/tcp_input.c | 2 ++
3 files changed, 4 insertions(+)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c54986f97..20585d5c4e1c 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -281,6 +281,7 @@ struct tcp_sock {
* receiver in Recovery. */
u32 prr_out; /* Total number of pkts sent during Recovery. */
u32 delivered; /* Total data packets delivered incl. rexmits */
+ u32 delivered_ce; /* Like the above but only ECE marked packets */
u32 lost; /* Total data packets lost incl. rexmits */
u32 app_limited; /* limited until "delivered" reaches this val */
u64 first_tx_mstamp; /* start of window send phase */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 438fbca96cd3..5a5ce6da4792 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2559,6 +2559,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd_cnt = 0;
tp->window_clamp = 0;
+ tp->delivered_ce = 0;
tcp_set_ca_state(sk, TCP_CA_Open);
tp->is_sack_reneg = 0;
tcp_clear_retrans(tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 01cce28f90ca..b3bff9c20606 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3503,6 +3503,8 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
u32 delivered;
delivered = tp->delivered - prior_delivered;
+ if (flag & FLAG_ECE)
+ tp->delivered_ce += delivered;
return delivered;
}
--
2.17.0.484.g0c8726318c-goog
^ permalink raw reply related
* [PATCH net-next 4/4] tcp: export packets delivery info
From: Yuchung Cheng @ 2018-04-18 6:18 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, ncardwell, soheil, Yuchung Cheng
In-Reply-To: <20180418061849.220459-1-ycheng@google.com>
Export data delivered and delivered with CE marks to
1) SNMP TCPDelivered and TCPDeliveredCE
2) getsockopt(TCP_INFO)
3) Timestamping API SOF_TIMESTAMPING_OPT_STATS
Note that for SCM_TSTAMP_ACK, the delivery info in
SOF_TIMESTAMPING_OPT_STATS is reported before the info
was fully updated on the ACK.
These stats help application monitor TCP delivery and ECN status
on per host, per connection, even per message level.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
include/uapi/linux/snmp.h | 2 ++
include/uapi/linux/tcp.h | 5 +++++
net/ipv4/proc.c | 2 ++
net/ipv4/tcp.c | 7 ++++++-
net/ipv4/tcp_input.c | 6 +++++-
5 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 33a70ece462f..d02e859301ff 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -276,6 +276,8 @@ enum
LINUX_MIB_TCPKEEPALIVE, /* TCPKeepAlive */
LINUX_MIB_TCPMTUPFAIL, /* TCPMTUPFail */
LINUX_MIB_TCPMTUPSUCCESS, /* TCPMTUPSuccess */
+ LINUX_MIB_TCPDELIVERED, /* TCPDelivered */
+ LINUX_MIB_TCPDELIVEREDCE, /* TCPDeliveredCE */
__LINUX_MIB_MAX
};
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 560374c978f9..379b08700a54 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -224,6 +224,9 @@ struct tcp_info {
__u64 tcpi_busy_time; /* Time (usec) busy sending data */
__u64 tcpi_rwnd_limited; /* Time (usec) limited by receive window */
__u64 tcpi_sndbuf_limited; /* Time (usec) limited by send buffer */
+
+ __u32 tcpi_delivered;
+ __u32 tcpi_delivered_ce;
};
/* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -244,6 +247,8 @@ enum {
TCP_NLA_SNDQ_SIZE, /* Data (bytes) pending in send queue */
TCP_NLA_CA_STATE, /* ca_state of socket */
TCP_NLA_SND_SSTHRESH, /* Slow start size threshold */
+ TCP_NLA_DELIVERED, /* Data pkts delivered incl. out-of-order */
+ TCP_NLA_DELIVERED_CE, /* Like above but only ones w/ CE marks */
};
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index a058de677e94..261b71d0ccc5 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -296,6 +296,8 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPKeepAlive", LINUX_MIB_TCPKEEPALIVE),
SNMP_MIB_ITEM("TCPMTUPFail", LINUX_MIB_TCPMTUPFAIL),
SNMP_MIB_ITEM("TCPMTUPSuccess", LINUX_MIB_TCPMTUPSUCCESS),
+ SNMP_MIB_ITEM("TCPDelivered", LINUX_MIB_TCPDELIVERED),
+ SNMP_MIB_ITEM("TCPDeliveredCE", LINUX_MIB_TCPDELIVEREDCE),
SNMP_MIB_SENTINEL
};
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5a5ce6da4792..4022073b0aee 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3167,6 +3167,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
rate64 = tcp_compute_delivery_rate(tp);
if (rate64)
info->tcpi_delivery_rate = rate64;
+ info->tcpi_delivered = tp->delivered;
+ info->tcpi_delivered_ce = tp->delivered_ce;
unlock_sock_fast(sk, slow);
}
EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3180,7 +3182,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk)
u32 rate;
stats = alloc_skb(7 * nla_total_size_64bit(sizeof(u64)) +
- 5 * nla_total_size(sizeof(u32)) +
+ 7 * nla_total_size(sizeof(u32)) +
3 * nla_total_size(sizeof(u8)), GFP_ATOMIC);
if (!stats)
return NULL;
@@ -3211,9 +3213,12 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk)
nla_put_u8(stats, TCP_NLA_RECUR_RETRANS, inet_csk(sk)->icsk_retransmits);
nla_put_u8(stats, TCP_NLA_DELIVERY_RATE_APP_LMT, !!tp->rate_app_limited);
nla_put_u32(stats, TCP_NLA_SND_SSTHRESH, tp->snd_ssthresh);
+ nla_put_u32(stats, TCP_NLA_DELIVERED, tp->delivered);
+ nla_put_u32(stats, TCP_NLA_DELIVERED_CE, tp->delivered_ce);
nla_put_u32(stats, TCP_NLA_SNDQ_SIZE, tp->write_seq - tp->snd_una);
nla_put_u8(stats, TCP_NLA_CA_STATE, inet_csk(sk)->icsk_ca_state);
+
return stats;
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b3bff9c20606..0396fb919b5d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3499,12 +3499,16 @@ static void tcp_xmit_recovery(struct sock *sk, int rexmit)
/* Returns the number of packets newly acked or sacked by the current ACK */
static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
{
+ const struct net *net = sock_net(sk);
struct tcp_sock *tp = tcp_sk(sk);
u32 delivered;
delivered = tp->delivered - prior_delivered;
- if (flag & FLAG_ECE)
+ NET_ADD_STATS(net, LINUX_MIB_TCPDELIVERED, delivered);
+ if (flag & FLAG_ECE) {
tp->delivered_ce += delivered;
+ NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, delivered);
+ }
return delivered;
}
--
2.17.0.484.g0c8726318c-goog
^ permalink raw reply related
* [PATCH 0/6] Assorted rhashtable improvements. RESEND
From: NeilBrown @ 2018-04-18 6:47 UTC (permalink / raw)
To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
[[ I mistyped linux-kernel the first time I sent these, so
resending. Please reply to this set. Sorry - neilb ]]
Some of these have been posted before and a couple
received an Ack from Herbert, but haven't appeared in any git tree
yet.
Another (the first) has been sent but received no ack.
I've added the second patch, which removes more incorrect
documentation, and added the last two patches.
One further improves rhashtable_walk stability.
The last added rhashtable_walk_prev(), as discussed with Herbert,
which should be useful for seq_files.
(Separately I've posted a patch to Al Viro to make seq_file even
easier to use with rhashtables, but this series does not depend
on that patch).
I don't see these patches as particularly urgent, though the third is a
bugfix that currently prevents me from allowing one rhashtable in
lustre to auto-shrink.
I previously suggested it might be good for some of these patches to
go upstream through 'staging' with the lustre patches. I no longer
think that is necessary. It is probably best for them to go upstream
through net or net-next.
Thanks,
NeilBrown
---
NeilBrown (6):
rhashtable: remove outdated comments about grow_decision etc
rhashtable: remove incorrect comment on r{hl,hash}table_walk_enter()
rhashtable: reset iter when rhashtable_walk_start sees new table
rhashtable: improve rhashtable_walk stability when stop/start used.
rhashtable: further improve stability of rhashtable_walk
rhashtable: add rhashtable_walk_prev()
include/linux/rhashtable.h | 49 +++++++++---------
lib/rhashtable.c | 121 +++++++++++++++++++++++++++++++++++++-------
2 files changed, 126 insertions(+), 44 deletions(-)
--
Signature
^ permalink raw reply
* [PATCH 1/6] rhashtable: remove outdated comments about grow_decision etc
From: NeilBrown @ 2018-04-18 6:47 UTC (permalink / raw)
To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152403346237.16895.8767189357062722046.stgit2@noble>
grow_decision and shink_decision no longer exist, so remove
the remaining references to them.
Signed-off-by: NeilBrown <neilb@suse.com>
---
include/linux/rhashtable.h | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1f8ad121eb43..87d443a5b11d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -836,9 +836,8 @@ static inline void *__rhashtable_insert_fast(
*
* It is safe to call this function from atomic context.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*/
static inline int rhashtable_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -866,9 +865,8 @@ static inline int rhashtable_insert_fast(
*
* It is safe to call this function from atomic context.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*/
static inline int rhltable_insert_key(
struct rhltable *hlt, const void *key, struct rhlist_head *list,
@@ -890,9 +888,8 @@ static inline int rhltable_insert_key(
*
* It is safe to call this function from atomic context.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*/
static inline int rhltable_insert(
struct rhltable *hlt, struct rhlist_head *list,
@@ -922,9 +919,8 @@ static inline int rhltable_insert(
*
* It is safe to call this function from atomic context.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*/
static inline int rhashtable_lookup_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -981,9 +977,8 @@ static inline void *rhashtable_lookup_get_insert_fast(
*
* Lookups may occur in parallel with hashtable mutations and resizing.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*
* Returns zero on success.
*/
@@ -1134,8 +1129,8 @@ static inline int __rhashtable_remove_fast(
* walk the bucket chain upon removal. The removal operation is thus
* considerable slow if the hash table is not correctly sized.
*
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%.
*
* Returns zero on success, -ENOENT if the entry could not be found.
*/
@@ -1156,8 +1151,8 @@ static inline int rhashtable_remove_fast(
* walk the bucket chain upon removal. The removal operation is thus
* considerable slow if the hash table is not correctly sized.
*
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%
*
* Returns zero on success, -ENOENT if the entry could not be found.
*/
^ permalink raw reply related
* [PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()
From: NeilBrown @ 2018-04-18 6:47 UTC (permalink / raw)
To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152403346237.16895.8767189357062722046.stgit2@noble>
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, so
remove the comments which suggest that they do.
Signed-off-by: NeilBrown <neilb@suse.com>
---
include/linux/rhashtable.h | 3 ---
lib/rhashtable.c | 3 ---
2 files changed, 6 deletions(-)
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 87d443a5b11d..b01d88e196c2 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1268,9 +1268,6 @@ static inline int rhashtable_walk_init(struct rhashtable *ht,
* For a completely stable walk you should construct your own data
* structure outside the hash table.
*
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
- *
* You must call rhashtable_walk_exit after this function returns.
*/
static inline void rhltable_walk_enter(struct rhltable *hlt,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 2b2b79974b61..19db8e563c40 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -668,9 +668,6 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
* For a completely stable walk you should construct your own data
* structure outside the hash table.
*
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
- *
* You must call rhashtable_walk_exit after this function returns.
*/
void rhashtable_walk_enter(struct rhashtable *ht, struct rhashtable_iter *iter)
^ permalink raw reply related
* [PATCH 3/6] rhashtable: reset iter when rhashtable_walk_start sees new table
From: NeilBrown @ 2018-04-18 6:47 UTC (permalink / raw)
To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152403346237.16895.8767189357062722046.stgit2@noble>
The documentation claims that when rhashtable_walk_start_check()
detects a resize event, it will rewind back to the beginning
of the table. This is not true. We need to set ->slot and
->skip to be zero for it to be true.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
---
lib/rhashtable.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 19db8e563c40..28e1be9f681b 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -733,6 +733,8 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
if (!iter->walker.tbl && !iter->end_of_table) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
+ iter->slot = 0;
+ iter->skip = 0;
return -EAGAIN;
}
^ permalink raw reply related
* [PATCH 5/6] rhashtable: further improve stability of rhashtable_walk
From: NeilBrown @ 2018-04-18 6:47 UTC (permalink / raw)
To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152403346237.16895.8767189357062722046.stgit2@noble>
If the sequence:
obj = rhashtable_walk_next(iter);
rhashtable_walk_stop(iter);
rhashtable_remove_fast(ht, &obj->head, params);
rhashtable_walk_start(iter);
races with another thread inserting or removing
an object on the same hash chain, a subsequent
rhashtable_walk_next() is not guaranteed to get the "next"
object. It is possible that an object could be
repeated, or missed.
This can be made more reliable by keeping the objects in a hash chain
sorted by memory address. A subsequent rhashtable_walk_next()
call can reliably find the correct position in the list, and thus
find the 'next' object.
It is not possible (certainly not so easy) to achieve this with an
rhltable as keeping the hash chain in order is not so easy. When the
first object with a given key is removed, it is replaced in the chain
with the next object with the same key, and the address of that
object may not be correctly ordered.
No current user of rhltable_walk_enter() calls
rhashtable_walk_start() more than once, so no current code
could benefit from a more reliable walk of rhltables.
This patch only attempts to improve walks for rhashtables.
- a new object is always inserted after the last object with a
smaller address, or at the start
- when rhashtable_walk_start() is called, it records that 'p' is not
'safe', meaning that it cannot be dereferenced. The revalidation
that was previously done here is moved to rhashtable_walk_next()
- when rhashtable_walk_next() is called while p is not NULL and not
safe, it walks the chain looking for the first object with an
address greater than p and returns that. If there is none, it moves
to the next hash chain.
Signed-off-by: NeilBrown <neilb@suse.com>
---
include/linux/rhashtable.h | 12 ++++++--
lib/rhashtable.c | 66 +++++++++++++++++++++++++++-----------------
2 files changed, 50 insertions(+), 28 deletions(-)
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index b01d88e196c2..5ce6201f246e 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -207,6 +207,7 @@ struct rhashtable_iter {
struct rhashtable_walker walker;
unsigned int slot;
unsigned int skip;
+ bool p_is_unsafe;
bool end_of_table;
};
@@ -730,6 +731,7 @@ static inline void *__rhashtable_insert_fast(
.ht = ht,
.key = key,
};
+ struct rhash_head __rcu **inspos;
struct rhash_head __rcu **pprev;
struct bucket_table *tbl;
struct rhash_head *head;
@@ -757,6 +759,7 @@ static inline void *__rhashtable_insert_fast(
data = ERR_PTR(-ENOMEM);
if (!pprev)
goto out;
+ inspos = pprev;
rht_for_each_continue(head, *pprev, tbl, hash) {
struct rhlist_head *plist;
@@ -768,6 +771,8 @@ static inline void *__rhashtable_insert_fast(
params.obj_cmpfn(&arg, rht_obj(ht, head)) :
rhashtable_compare(&arg, rht_obj(ht, head)))) {
pprev = &head->next;
+ if (head < obj)
+ inspos = &head->next;
continue;
}
@@ -798,7 +803,7 @@ static inline void *__rhashtable_insert_fast(
if (unlikely(rht_grow_above_100(ht, tbl)))
goto slow_path;
- head = rht_dereference_bucket(*pprev, tbl, hash);
+ head = rht_dereference_bucket(*inspos, tbl, hash);
RCU_INIT_POINTER(obj->next, head);
if (rhlist) {
@@ -808,7 +813,7 @@ static inline void *__rhashtable_insert_fast(
RCU_INIT_POINTER(list->next, NULL);
}
- rcu_assign_pointer(*pprev, obj);
+ rcu_assign_pointer(*inspos, obj);
atomic_inc(&ht->nelems);
if (rht_grow_above_75(ht, tbl))
@@ -1263,7 +1268,8 @@ static inline int rhashtable_walk_init(struct rhashtable *ht,
* Note that if you restart a walk after rhashtable_walk_stop you
* may see the same object twice. Also, you may miss objects if
* there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * call to rhashtable_walk_start. Note that this is different to
+ * rhashtable_walk_enter() which never leads to missing objects.
*
* For a completely stable walk you should construct your own data
* structure outside the hash table.
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 16cde54a553b..be7eb57d9398 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -566,6 +566,10 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
return ERR_PTR(-ENOMEM);
head = rht_dereference_bucket(*pprev, tbl, hash);
+ while (!rht_is_a_nulls(head) && head < obj) {
+ pprev = &head->next;
+ head = rht_dereference_bucket(*pprev, tbl, hash);
+ }
RCU_INIT_POINTER(obj->next, head);
if (ht->rhlist) {
@@ -660,10 +664,10 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
*
* This function prepares a hash table walk.
*
- * Note that if you restart a walk after rhashtable_walk_stop you
- * may see the same object twice. Also, you may miss objects if
- * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * A walk is guaranteed to return every object that was in
+ * the table before this call, and is still in the table when
+ * rhashtable_walk_next() returns NULL. Duplicates can be
+ * seen, but only if there is a rehash event during the walk.
*
* For a completely stable walk you should construct your own data
* structure outside the hash table.
@@ -743,19 +747,10 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
if (iter->p && !rhlist) {
/*
- * We need to validate that 'p' is still in the table, and
- * if so, update 'skip'
+ * 'p' will be revalidated when rhashtable_walk_next()
+ * is called.
*/
- struct rhash_head *p;
- int skip = 0;
- rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
- skip++;
- if (p == iter->p) {
- iter->skip = skip;
- goto found;
- }
- }
- iter->p = NULL;
+ iter->p_is_unsafe = true;
} else if (iter->p && rhlist) {
/* Need to validate that 'list' is still in the table, and
* if so, update 'skip' and 'p'.
@@ -872,15 +867,35 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
bool rhlist = ht->rhlist;
if (p) {
- if (!rhlist || !(list = rcu_dereference(list->next))) {
- p = rcu_dereference(p->next);
- list = container_of(p, struct rhlist_head, rhead);
- }
- if (!rht_is_a_nulls(p)) {
- iter->skip++;
- iter->p = p;
- iter->list = list;
- return rht_obj(ht, rhlist ? &list->rhead : p);
+ if (!rhlist && iter->p_is_unsafe) {
+ /*
+ * First time next() was called after start().
+ * Need to find location of 'p' in the list.
+ */
+ struct rhash_head *p;
+
+ iter->skip = 0;
+ rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+ iter->skip++;
+ if (p <= iter->p)
+ continue;
+
+ /* p is the next object after iter->p */
+ iter->p = p;
+ iter->p_is_unsafe = false;
+ return rht_obj(ht, p);
+ }
+ } else {
+ if (!rhlist || !(list = rcu_dereference(list->next))) {
+ p = rcu_dereference(p->next);
+ list = container_of(p, struct rhlist_head, rhead);
+ }
+ if (!rht_is_a_nulls(p)) {
+ iter->skip++;
+ iter->p = p;
+ iter->list = list;
+ return rht_obj(ht, rhlist ? &list->rhead : p);
+ }
}
/* At the end of this slot, switch to next one and then find
@@ -890,6 +905,7 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
iter->slot++;
}
+ iter->p_is_unsafe = false;
return __rhashtable_walk_find_next(iter);
}
EXPORT_SYMBOL_GPL(rhashtable_walk_next);
^ permalink raw reply related
* [PATCH 6/6] rhashtable: add rhashtable_walk_prev()
From: NeilBrown @ 2018-04-18 6:47 UTC (permalink / raw)
To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152403346237.16895.8767189357062722046.stgit2@noble>
rhashtable_walk_prev() returns the object returned by
the previous rhashtable_walk_next(), providing it is still in the
table (or was during this grace period).
This works even if rhashtable_walk_stop() and rhashtable_talk_start()
have been called since the last rhashtable_walk_next().
If there have been no calls to rhashtable_walk_next(), or if the
object is gone from the table, then NULL is returned.
This can usefully be used in a seq_file ->start() function.
If the pos is the same as was returned by the last ->next() call,
then rhashtable_walk_prev() can be used to re-establish the
current location in the table. If it returns NULL, then
rhashtable_walk_next() should be used.
Signed-off-by: NeilBrown <neilb@suse.com>
---
include/linux/rhashtable.h | 1 +
lib/rhashtable.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 31 insertions(+)
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 5ce6201f246e..b1ad2b6a3f3f 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -397,6 +397,7 @@ static inline void rhashtable_walk_start(struct rhashtable_iter *iter)
void *rhashtable_walk_next(struct rhashtable_iter *iter);
void *rhashtable_walk_peek(struct rhashtable_iter *iter);
+void *rhashtable_walk_prev(struct rhashtable_iter *iter);
void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
void rhashtable_free_and_destroy(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index be7eb57d9398..d2f941146ea3 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -910,6 +910,36 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
}
EXPORT_SYMBOL_GPL(rhashtable_walk_next);
+/**
+ * rhashtable_walk_prev - Return the previously returned object, if available
+ * @iter: Hash table iterator
+ *
+ * If rhashtable_walk_next() has previously been called and the object
+ * it returned is still in the hash table, that object is returned again,
+ * otherwise %NULL is returned.
+ *
+ * If the recent rhashtable_walk_next() call was since the most recent
+ * rhashtable_walk_start() call then the returned object may not, strictly
+ * speaking, still be in the table. It will be safe to dereference.
+ *
+ * Note that the iterator is not changed and in particular it does not
+ * step backwards.
+ */
+void *rhashtable_walk_prev(struct rhashtable_iter *iter)
+{
+ struct rhashtable *ht = iter->ht;
+ struct rhash_head *p = iter->p;
+
+ if (!p)
+ return NULL;
+ if (!iter->p_is_unsafe || ht->rhlist)
+ return p;
+ rht_for_each_rcu(p, iter->walker.tbl, iter->slot)
+ if (p == iter->p)
+ return p;
+ return NULL;
+}
+
/**
* rhashtable_walk_peek - Return the next object but don't advance the iterator
* @iter: Hash table iterator
^ permalink raw reply related
* [PATCH 0/6] Assorted rhashtable improvements.
From: NeilBrown @ 2018-04-18 6:47 UTC (permalink / raw)
To: Thomas Graf, Herbert Xu; +Cc: netdev, inux-kernel
Some of these have been posted before and a couple
received an Ack from Herbert, but haven't appeared in any git tree
yet.
Another (the first) has been sent but received no ack.
I've added the second patch, which removes more incorrect
documentation, and added the last two patches.
One further improves rhashtable_walk stability.
The last added rhashtable_walk_prev(), as discussed with Herbert,
which should be useful for seq_files.
(Separately I've posted a patch to Al Viro to make seq_file even
easier to use with rhashtables, but this series does not depend
on that patch).
I don't see these patches as particularly urgent, though the third is a
bugfix that currently prevents me from allowing one rhashtable in
lustre to auto-shrink.
I previously suggested it might be good for some of these patches to
go upstream through 'staging' with the lustre patches. I no longer
think that is necessary. It is probably best for them to go upstream
through net or net-next.
Thanks,
NeilBrown
---
NeilBrown (6):
rhashtable: remove outdated comments about grow_decision etc
rhashtable: remove incorrect comment on r{hl,hash}table_walk_enter()
rhashtable: reset iter when rhashtable_walk_start sees new table
rhashtable: improve rhashtable_walk stability when stop/start used.
rhashtable: further improve stability of rhashtable_walk
rhashtable: add rhashtable_walk_prev()
include/linux/rhashtable.h | 49 +++++++++---------
lib/rhashtable.c | 121 +++++++++++++++++++++++++++++++++++++-------
2 files changed, 126 insertions(+), 44 deletions(-)
^ permalink raw reply
* [PATCH 1/6] rhashtable: remove outdated comments about grow_decision etc
From: NeilBrown @ 2018-04-18 6:47 UTC (permalink / raw)
To: Thomas Graf, Herbert Xu; +Cc: netdev, inux-kernel
In-Reply-To: <152403346237.16895.8767189357062722046.stgit@noble>
grow_decision and shink_decision no longer exist, so remove
the remaining references to them.
Signed-off-by: NeilBrown <neilb@suse.com>
---
include/linux/rhashtable.h | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1f8ad121eb43..87d443a5b11d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -836,9 +836,8 @@ static inline void *__rhashtable_insert_fast(
*
* It is safe to call this function from atomic context.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*/
static inline int rhashtable_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -866,9 +865,8 @@ static inline int rhashtable_insert_fast(
*
* It is safe to call this function from atomic context.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*/
static inline int rhltable_insert_key(
struct rhltable *hlt, const void *key, struct rhlist_head *list,
@@ -890,9 +888,8 @@ static inline int rhltable_insert_key(
*
* It is safe to call this function from atomic context.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*/
static inline int rhltable_insert(
struct rhltable *hlt, struct rhlist_head *list,
@@ -922,9 +919,8 @@ static inline int rhltable_insert(
*
* It is safe to call this function from atomic context.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*/
static inline int rhashtable_lookup_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -981,9 +977,8 @@ static inline void *rhashtable_lookup_get_insert_fast(
*
* Lookups may occur in parallel with hashtable mutations and resizing.
*
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
*
* Returns zero on success.
*/
@@ -1134,8 +1129,8 @@ static inline int __rhashtable_remove_fast(
* walk the bucket chain upon removal. The removal operation is thus
* considerable slow if the hash table is not correctly sized.
*
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%.
*
* Returns zero on success, -ENOENT if the entry could not be found.
*/
@@ -1156,8 +1151,8 @@ static inline int rhashtable_remove_fast(
* walk the bucket chain upon removal. The removal operation is thus
* considerable slow if the hash table is not correctly sized.
*
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%
*
* Returns zero on success, -ENOENT if the entry could not be found.
*/
^ permalink raw reply related
* [PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()
From: NeilBrown @ 2018-04-18 6:47 UTC (permalink / raw)
To: Thomas Graf, Herbert Xu; +Cc: netdev, inux-kernel
In-Reply-To: <152403346237.16895.8767189357062722046.stgit@noble>
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, so
remove the comments which suggest that they do.
Signed-off-by: NeilBrown <neilb@suse.com>
---
include/linux/rhashtable.h | 3 ---
lib/rhashtable.c | 3 ---
2 files changed, 6 deletions(-)
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 87d443a5b11d..b01d88e196c2 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1268,9 +1268,6 @@ static inline int rhashtable_walk_init(struct rhashtable *ht,
* For a completely stable walk you should construct your own data
* structure outside the hash table.
*
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
- *
* You must call rhashtable_walk_exit after this function returns.
*/
static inline void rhltable_walk_enter(struct rhltable *hlt,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 2b2b79974b61..19db8e563c40 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -668,9 +668,6 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
* For a completely stable walk you should construct your own data
* structure outside the hash table.
*
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
- *
* You must call rhashtable_walk_exit after this function returns.
*/
void rhashtable_walk_enter(struct rhashtable *ht, struct rhashtable_iter *iter)
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox