* [PATCH v2 09/10] samples/bpf: use hugepages in xdpsock app
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190716030637.5634-1-kevin.laatz@intel.com>
This patch modifies xdpsock to use mmap instead of posix_memalign. With
this change, we can use hugepages when running the application in unaligned
chunks mode. Using hugepages makes it more likely that we have physically
contiguous memory, which supports the unaligned chunk mode better.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
samples/bpf/xdpsock_user.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 8f220afd549a..958a27193582 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -69,6 +69,7 @@ static int opt_poll;
static int opt_interval = 1;
static u32 opt_umem_flags;
static int opt_unaligned_chunks;
+static int opt_mmap_flags;
static u32 opt_xdp_bind_flags;
static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
static __u32 prog_id;
@@ -434,6 +435,7 @@ static void parse_command_line(int argc, char **argv)
case 'u':
opt_umem_flags |= XDP_UMEM_UNALIGNED_CHUNKS;
opt_unaligned_chunks = 1;
+ opt_mmap_flags = MAP_HUGETLB;
break;
case 'F':
opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
@@ -696,11 +698,14 @@ int main(int argc, char **argv)
exit(EXIT_FAILURE);
}
- ret = posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
- NUM_FRAMES * opt_xsk_frame_size);
- if (ret)
- exit_with_error(ret);
-
+ /* Reserve memory for the umem. Use hugepages if unaligned chunk mode */
+ bufs = mmap(NULL, NUM_FRAMES * opt_xsk_frame_size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | opt_mmap_flags, -1, 0);
+ if (bufs == MAP_FAILED) {
+ printf("ERROR: mmap failed\n");
+ exit(EXIT_FAILURE);
+ }
/* Create sockets... */
umem = xsk_configure_umem(bufs, NUM_FRAMES * opt_xsk_frame_size);
xsks[num_socks++] = xsk_configure_socket(umem);
--
2.17.1
^ permalink raw reply related
* [PATCH v2 10/10] doc/af_xdp: include unaligned chunk case
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190716030637.5634-1-kevin.laatz@intel.com>
The addition of unaligned chunks mode, the documentation needs to be
updated to indicate that the incoming addr to the fill ring will only be
masked if the user application is run in the aligned chunk mode. This patch
also adds a line to explicitly indicate that the incoming addr will not be
masked if running the user application in the unaligned chunk mode.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
Documentation/networking/af_xdp.rst | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index eeedc2e826aa..83f7ae5fc045 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -153,10 +153,12 @@ an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
Frames passed to the kernel are used for the ingress path (RX rings).
-The user application produces UMEM addrs to this ring. Note that the
-kernel will mask the incoming addr. E.g. for a chunk size of 2k, the
-log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050
-and 3000 refers to the same chunk.
+The user application produces UMEM addrs to this ring. Note that, if
+running the application with aligned chunk mode, the kernel will mask
+the incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of
+the addr will be masked off, meaning that 2048, 2050 and 3000 refers
+to the same chunk. If the user application is run in the unaligned
+chunks mode, then the incoming addr will be left untouched.
UMEM Completion Ring
--
2.17.1
^ permalink raw reply related
* [PATCH v2 08/10] samples/bpf: add buffer recycling for unaligned chunks to xdpsock
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190716030637.5634-1-kevin.laatz@intel.com>
This patch adds buffer recycling support for unaligned buffers. Since we
don't mask the addr to 2k at umem_reg in unaligned mode, we need to make
sure we give back the correct (original) addr to the fill queue. We achieve
this using the new descriptor format and associated masks. The new format
uses the upper 16-bits for the offset and the lower 48-bits for the addr.
Since we have a field for the offset, we no longer need to modify the
actual address. As such, all we have to do to get back the original address
is mask for the lower 48 bits (i.e. strip the offset and we get the address
on it's own).
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
v2:
- Removed unused defines
- Fix buffer recycling for unaligned case
- Remove --buf-size (--frame-size merged before this)
- Modifications to use the new descriptor format for buffer recycling
---
samples/bpf/xdpsock_user.c | 29 +++++++++++++++++++----------
1 file changed, 19 insertions(+), 10 deletions(-)
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 26ba1a1fd582..8f220afd549a 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -474,6 +474,7 @@ static void kick_tx(struct xsk_socket_info *xsk)
static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
{
+ struct xsk_umem_info *umem = xsk->umem;
u32 idx_cq = 0, idx_fq = 0;
unsigned int rcvd;
size_t ndescs;
@@ -486,22 +487,24 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
xsk->outstanding_tx;
/* re-add completed Tx buffers */
- rcvd = xsk_ring_cons__peek(&xsk->umem->cq, ndescs, &idx_cq);
+ rcvd = xsk_ring_cons__peek(&umem->cq, ndescs, &idx_cq);
if (rcvd > 0) {
unsigned int i;
int ret;
- ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+ ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
while (ret != rcvd) {
if (ret < 0)
exit_with_error(-ret);
- ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
- &idx_fq);
+ ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
+ }
+
+ for (i = 0; i < rcvd; i++) {
+ u64 comp_addr =
+ *xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
+ *xsk_ring_prod__fill_addr(&umem->fq, idx_fq++) =
+ comp_addr & XSK_UNALIGNED_BUF_ADDR_MASK;
}
- for (i = 0; i < rcvd; i++)
- *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
- *xsk_ring_cons__comp_addr(&xsk->umem->cq,
- idx_cq++);
xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
xsk_ring_cons__release(&xsk->umem->cq, rcvd);
@@ -548,7 +551,11 @@ static void rx_drop(struct xsk_socket_info *xsk)
for (i = 0; i < rcvd; i++) {
u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
u32 len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len;
- char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+ u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+
+ addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
+ char *pkt = xsk_umem__get_data(xsk->umem->buffer,
+ addr + offset);
hex_dump(pkt, len, addr);
*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = addr;
@@ -654,7 +661,9 @@ static void l2fwd(struct xsk_socket_info *xsk)
idx_rx)->addr;
u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
idx_rx++)->len;
- char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+ u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+ char *pkt = xsk_umem__get_data(xsk->umem->buffer,
+ (addr & XSK_UNALIGNED_BUF_ADDR_MASK) + offset);
swap_mac_addresses(pkt);
--
2.17.1
^ permalink raw reply related
* [PATCH v2 07/10] samples/bpf: add unaligned chunks mode support to xdpsock
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190716030637.5634-1-kevin.laatz@intel.com>
This patch adds support for the unaligned chunks mode. The addition of the
unaligned chunks option will allow users to run the application with more
relaxed chunk placement in the XDP umem.
Unaligned chunks mode can be used with the '-u' or '--unaligned' command
line options.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
samples/bpf/xdpsock_user.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 93eaaf7239b2..26ba1a1fd582 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -67,6 +67,8 @@ static int opt_ifindex;
static int opt_queue;
static int opt_poll;
static int opt_interval = 1;
+static u32 opt_umem_flags;
+static int opt_unaligned_chunks;
static u32 opt_xdp_bind_flags;
static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
static __u32 prog_id;
@@ -282,7 +284,9 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size)
.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
.frame_size = opt_xsk_frame_size,
.frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM,
+ .flags = opt_umem_flags
};
+
int ret;
umem = calloc(1, sizeof(*umem));
@@ -291,6 +295,7 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size)
ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
&cfg);
+
if (ret)
exit_with_error(-ret);
@@ -352,6 +357,7 @@ static struct option long_options[] = {
{"zero-copy", no_argument, 0, 'z'},
{"copy", no_argument, 0, 'c'},
{"frame-size", required_argument, 0, 'f'},
+ {"unaligned", no_argument, 0, 'u'},
{0, 0, 0, 0}
};
@@ -372,6 +378,7 @@ static void usage(const char *prog)
" -z, --zero-copy Force zero-copy mode.\n"
" -c, --copy Force copy mode.\n"
" -f, --frame-size=n Set the frame size (must be a power of two, default is %d).\n"
+ " -u, --unaligned Enable unaligned chunk placement\n"
"\n";
fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
exit(EXIT_FAILURE);
@@ -384,7 +391,7 @@ static void parse_command_line(int argc, char **argv)
opterr = 0;
for (;;) {
- c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:", long_options,
+ c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:u", long_options,
&option_index);
if (c == -1)
break;
@@ -424,12 +431,17 @@ static void parse_command_line(int argc, char **argv)
case 'c':
opt_xdp_bind_flags |= XDP_COPY;
break;
+ case 'u':
+ opt_umem_flags |= XDP_UMEM_UNALIGNED_CHUNKS;
+ opt_unaligned_chunks = 1;
+ break;
case 'F':
opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
break;
case 'f':
opt_xsk_frame_size = atoi(optarg);
break;
+
default:
usage(basename(argv[0]));
}
@@ -442,7 +454,8 @@ static void parse_command_line(int argc, char **argv)
usage(basename(argv[0]));
}
- if (opt_xsk_frame_size & (opt_xsk_frame_size - 1)) {
+ if ((opt_xsk_frame_size & (opt_xsk_frame_size - 1)) &&
+ !opt_unaligned_chunks) {
fprintf(stderr, "--frame-size=%d is not a power of two\n",
opt_xsk_frame_size);
usage(basename(argv[0]));
--
2.17.1
^ permalink raw reply related
* [PATCH v2 05/10] ixgbe: modify driver for handling offsets
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190716030637.5634-1-kevin.laatz@intel.com>
With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 26 ++++++++++++++++----
1 file changed, 21 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index bc86057628c8..ac1669b18d13 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -143,7 +143,9 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
struct ixgbe_ring *rx_ring,
struct xdp_buff *xdp)
{
+ struct xdp_umem *umem = rx_ring->xsk_umem;
int err, result = IXGBE_XDP_PASS;
+ u64 offset = umem->headroom;
struct bpf_prog *xdp_prog;
struct xdp_frame *xdpf;
u32 act;
@@ -151,7 +153,13 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
rcu_read_lock();
xdp_prog = READ_ONCE(rx_ring->xdp_prog);
act = bpf_prog_run_xdp(xdp_prog, xdp);
- xdp->handle += xdp->data - xdp->data_hard_start;
+ offset += xdp->data - xdp->data_hard_start;
+
+ if (umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
+ xdp->handle |= (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
+ else
+ xdp->handle += offset;
+
switch (act) {
case XDP_PASS:
break;
@@ -235,7 +243,10 @@ void ixgbe_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
nta++;
rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
- handle &= mask;
+ if (rx_ring->xsk_umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
+ handle &= XSK_UNALIGNED_BUF_ADDR_MASK;
+ else
+ handle &= mask;
bi->dma = xdp_umem_get_dma(rx_ring->xsk_umem, handle);
bi->dma += hr;
@@ -269,7 +280,7 @@ static bool ixgbe_alloc_buffer_zc(struct ixgbe_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;
- bi->handle = handle + umem->headroom;
+ bi->handle = handle;
xsk_umem_discard_addr(umem);
return true;
@@ -296,7 +307,7 @@ static bool ixgbe_alloc_buffer_slow_zc(struct ixgbe_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;
- bi->handle = handle + umem->headroom;
+ bi->handle = handle;
xsk_umem_discard_addr_rq(umem);
return true;
@@ -565,6 +576,7 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
struct ixgbe_tx_buffer *tx_bi;
bool work_done = true;
struct xdp_desc desc;
+ u64 addr, offset;
dma_addr_t dma;
u32 cmd_type;
@@ -578,7 +590,11 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
break;
- dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);
+ /* for unaligned chunks need to take offset from upper bits */
+ offset = (desc.addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT);
+ addr = (desc.addr & XSK_UNALIGNED_BUF_ADDR_MASK);
+
+ dma = xdp_umem_get_dma(xdp_ring->xsk_umem, addr + offset);
dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
DMA_BIDIRECTIONAL);
--
2.17.1
^ permalink raw reply related
* [PATCH v2 04/10] i40e: modify driver for handling offsets
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190716030637.5634-1-kevin.laatz@intel.com>
With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 26 +++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index dfa096db2244..b8316e9ba159 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -190,7 +190,9 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
**/
static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
{
+ struct xdp_umem *umem = rx_ring->xsk_umem;
int err, result = I40E_XDP_PASS;
+ u64 offset = umem->headroom;
struct i40e_ring *xdp_ring;
struct bpf_prog *xdp_prog;
u32 act;
@@ -201,7 +203,13 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
*/
xdp_prog = READ_ONCE(rx_ring->xdp_prog);
act = bpf_prog_run_xdp(xdp_prog, xdp);
- xdp->handle += xdp->data - xdp->data_hard_start;
+ offset += xdp->data - xdp->data_hard_start;
+
+ if (umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
+ xdp->handle |= (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
+ else
+ xdp->handle += offset;
+
switch (act) {
case XDP_PASS:
break;
@@ -262,7 +270,7 @@ static bool i40e_alloc_buffer_zc(struct i40e_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;
- bi->handle = handle + umem->headroom;
+ bi->handle = handle;
xsk_umem_discard_addr(umem);
return true;
@@ -299,7 +307,7 @@ static bool i40e_alloc_buffer_slow_zc(struct i40e_ring *rx_ring,
bi->addr = xdp_umem_get_data(umem, handle);
bi->addr += hr;
- bi->handle = handle + umem->headroom;
+ bi->handle = handle;
xsk_umem_discard_addr_rq(umem);
return true;
@@ -456,7 +464,10 @@ void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
nta++;
rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
- handle &= mask;
+ if (rx_ring->xsk_umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
+ handle &= XSK_UNALIGNED_BUF_ADDR_MASK;
+ else
+ handle &= mask;
bi->dma = xdp_umem_get_dma(rx_ring->xsk_umem, handle);
bi->dma += hr;
@@ -635,6 +646,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
struct i40e_tx_buffer *tx_bi;
bool work_done = true;
struct xdp_desc desc;
+ u64 addr, offset;
dma_addr_t dma;
while (budget-- > 0) {
@@ -647,7 +659,11 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
break;
- dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);
+ /* for unaligned chunks need to take offset from upper bits */
+ offset = (desc.addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT);
+ addr = (desc.addr & XSK_UNALIGNED_BUF_ADDR_MASK);
+
+ dma = xdp_umem_get_dma(xdp_ring->xsk_umem, addr + offset);
dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
DMA_BIDIRECTIONAL);
--
2.17.1
^ permalink raw reply related
* [PATCH v2 03/10] xsk: add support to allow unaligned chunk placement
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190716030637.5634-1-kevin.laatz@intel.com>
Currently, addresses are chunk size aligned. This means, we are very
restricted in terms of where we can place chunk within the umem. For
example, if we have a chunk size of 2k, then our chunks can only be placed
at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
This patch introduces the ability to use unaligned chunks. With these
changes, we are no longer bound to having to place chunks at a 2k (or
whatever your chunk size is) interval. Since we are no longer dealing with
aligned chunks, they can now cross page boundaries. Checks for page
contiguity have been added in order to keep track of which pages are
followed by a physically contiguous page.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
v2:
- Add checks for the flags coming from userspace
- Fix how we get chunk_size in xsk_diag.c
- Add defines for masking the new descriptor format
- Modified the rx functions to use new descriptor format
- Modified the tx functions to use new descriptor format
---
include/net/xdp_sock.h | 2 +
include/uapi/linux/if_xdp.h | 9 ++++
net/xdp/xdp_umem.c | 17 ++++---
net/xdp/xsk.c | 89 ++++++++++++++++++++++++++++++-------
net/xdp/xsk_diag.c | 2 +-
net/xdp/xsk_queue.h | 70 +++++++++++++++++++++++++----
6 files changed, 159 insertions(+), 30 deletions(-)
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69796d264f06..f7ab8ff33f06 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -19,6 +19,7 @@ struct xsk_queue;
struct xdp_umem_page {
void *addr;
dma_addr_t dma;
+ bool next_pg_contig;
};
struct xdp_umem_fq_reuse {
@@ -48,6 +49,7 @@ struct xdp_umem {
bool zc;
spinlock_t xsk_list_lock;
struct list_head xsk_list;
+ u32 flags;
};
struct xdp_sock {
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index faaa5ca2a117..f8dc68fcdf78 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -17,6 +17,9 @@
#define XDP_COPY (1 << 1) /* Force copy-mode */
#define XDP_ZEROCOPY (1 << 2) /* Force zero-copy mode */
+/* Flags for xsk_umem_config flags */
+#define XDP_UMEM_UNALIGNED_CHUNKS (1 << 0)
+
struct sockaddr_xdp {
__u16 sxdp_family;
__u16 sxdp_flags;
@@ -53,6 +56,7 @@ struct xdp_umem_reg {
__u64 len; /* Length of packet data area */
__u32 chunk_size;
__u32 headroom;
+ __u32 flags;
};
struct xdp_statistics {
@@ -74,6 +78,11 @@ struct xdp_options {
#define XDP_UMEM_PGOFF_FILL_RING 0x100000000ULL
#define XDP_UMEM_PGOFF_COMPLETION_RING 0x180000000ULL
+/* Masks for unaligned chunks mode */
+#define XSK_UNALIGNED_BUF_OFFSET_SHIFT 48
+#define XSK_UNALIGNED_BUF_ADDR_MASK \
+ ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
+
/* Rx/Tx descriptor */
struct xdp_desc {
__u64 addr;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 20c91f02d3d8..6130735bdd3d 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -303,6 +303,7 @@ static int xdp_umem_account_pages(struct xdp_umem *umem)
static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
{
+ bool unaligned_chunks = mr->flags & XDP_UMEM_UNALIGNED_CHUNKS;
u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
unsigned int chunks, chunks_per_page;
u64 addr = mr->addr, size = mr->len;
@@ -318,7 +319,10 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
return -EINVAL;
}
- if (!is_power_of_2(chunk_size))
+ if (mr->flags & ~(XDP_UMEM_UNALIGNED_CHUNKS))
+ return -EINVAL;
+
+ if (!unaligned_chunks && !is_power_of_2(chunk_size))
return -EINVAL;
if (!PAGE_ALIGNED(addr)) {
@@ -335,9 +339,11 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
if (chunks == 0)
return -EINVAL;
- chunks_per_page = PAGE_SIZE / chunk_size;
- if (chunks < chunks_per_page || chunks % chunks_per_page)
- return -EINVAL;
+ if (!unaligned_chunks) {
+ chunks_per_page = PAGE_SIZE / chunk_size;
+ if (chunks < chunks_per_page || chunks % chunks_per_page)
+ return -EINVAL;
+ }
headroom = ALIGN(headroom, 64);
@@ -346,13 +352,14 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
return -EINVAL;
umem->address = (unsigned long)addr;
- umem->chunk_mask = ~((u64)chunk_size - 1);
+ umem->chunk_mask = unaligned_chunks ? U64_MAX : ~((u64)chunk_size - 1);
umem->size = size;
umem->headroom = headroom;
umem->chunk_size_nohr = chunk_size - headroom;
umem->npgs = size / PAGE_SIZE;
umem->pgs = NULL;
umem->user = NULL;
+ umem->flags = mr->flags;
INIT_LIST_HEAD(&umem->xsk_list);
spin_lock_init(&umem->xsk_list_lock);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index d4d6f10aa936..78089825821a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL(xsk_umem_has_addrs);
u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
{
- return xskq_peek_addr(umem->fq, addr);
+ return xskq_peek_addr(umem->fq, addr, umem);
}
EXPORT_SYMBOL(xsk_umem_peek_addr);
@@ -55,21 +55,42 @@ void xsk_umem_discard_addr(struct xdp_umem *umem)
}
EXPORT_SYMBOL(xsk_umem_discard_addr);
+/* If a buffer crosses a page boundary, we need to do 2 memcpy's, one for
+ * each page. This is only required in copy mode.
+ */
+static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void *from_buf,
+ u32 len, u32 metalen)
+{
+ void *to_buf = xdp_umem_get_data(umem, addr);
+
+ if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
+ void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;
+ u64 page_start = addr & (PAGE_SIZE - 1);
+ u64 first_len = PAGE_SIZE - (addr - page_start);
+
+ memcpy(to_buf, from_buf, first_len + metalen);
+ memcpy(next_pg_addr, from_buf + first_len, len - first_len);
+
+ return;
+ }
+
+ memcpy(to_buf, from_buf, len + metalen);
+}
+
static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
{
- void *to_buf, *from_buf;
+ u64 offset = xs->umem->headroom;
+ void *from_buf;
u32 metalen;
u64 addr;
int err;
- if (!xskq_peek_addr(xs->umem->fq, &addr) ||
+ if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
- addr += xs->umem->headroom;
-
if (unlikely(xdp_data_meta_unsupported(xdp))) {
from_buf = xdp->data;
metalen = 0;
@@ -78,9 +99,13 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
metalen = xdp->data - xdp->data_meta;
}
- to_buf = xdp_umem_get_data(xs->umem, addr);
- memcpy(to_buf, from_buf, len + metalen);
- addr += metalen;
+ __xsk_rcv_memcpy(xs->umem, addr + offset, from_buf, len, metalen);
+
+ offset += metalen;
+ if (xs->umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
+ addr |= offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+ else
+ addr += offset;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
@@ -127,6 +152,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
u32 len = xdp->data_end - xdp->data;
void *buffer;
u64 addr;
+ u64 offset = xs->umem->headroom;
int err;
spin_lock_bh(&xs->rx_lock);
@@ -136,17 +162,20 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
goto out_unlock;
}
- if (!xskq_peek_addr(xs->umem->fq, &addr) ||
+ if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
err = -ENOSPC;
goto out_drop;
}
- addr += xs->umem->headroom;
-
- buffer = xdp_umem_get_data(xs->umem, addr);
+ buffer = xdp_umem_get_data(xs->umem, addr + offset);
memcpy(buffer, xdp->data_meta, len + metalen);
- addr += metalen;
+ offset += metalen;
+
+ if (xs->umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
+ addr |= offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+ else
+ addr += offset;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (err)
goto out_drop;
@@ -190,7 +219,7 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc)
rcu_read_lock();
list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
- if (!xskq_peek_desc(xs->tx, desc))
+ if (!xskq_peek_desc(xs->tx, desc, umem))
continue;
if (xskq_produce_addr_lazy(umem->cq, desc->addr))
@@ -240,7 +269,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
mutex_lock(&xs->mutex);
- while (xskq_peek_desc(xs->tx, &desc)) {
+ while (xskq_peek_desc(xs->tx, &desc, xs->umem)) {
char *buffer;
u64 addr;
u32 len;
@@ -265,6 +294,10 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
skb_put(skb, len);
addr = desc.addr;
+ if (xs->umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
+ addr = (addr & XSK_UNALIGNED_BUF_ADDR_MASK) |
+ (addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT);
+
buffer = xdp_umem_get_data(xs->umem, addr);
err = skb_store_bits(skb, 0, buffer, len);
if (unlikely(err)) {
@@ -275,7 +308,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
skb->dev = xs->dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
- skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
+ skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
skb->destructor = xsk_destruct_skb;
err = dev_direct_xmit(skb, xs->queue_id);
@@ -415,6 +448,28 @@ static struct socket *xsk_lookup_xsk_from_fd(int fd)
return sock;
}
+/* Check if umem pages are contiguous.
+ * If zero-copy mode, use the DMA address to do the page contiguity check
+ * For all other modes we use addr (kernel virtual address)
+ */
+static void xsk_check_page_contiguity(struct xdp_umem *umem, u32 flags)
+{
+ int i;
+
+ if (flags & XDP_ZEROCOPY) {
+ for (i = 0; i < umem->npgs - 1; i++)
+ umem->pages[i].next_pg_contig =
+ (umem->pages[i].dma + PAGE_SIZE ==
+ umem->pages[i + 1].dma);
+ return;
+ }
+
+ for (i = 0; i < umem->npgs - 1; i++)
+ umem->pages[i].next_pg_contig =
+ (umem->pages[i].addr + PAGE_SIZE ==
+ umem->pages[i + 1].addr);
+}
+
static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
{
struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
@@ -502,6 +557,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
if (err)
goto out_unlock;
+
+ xsk_check_page_contiguity(xs->umem, flags);
}
xs->dev = dev;
diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
index d5e06c8e0cbf..9986a759fe06 100644
--- a/net/xdp/xsk_diag.c
+++ b/net/xdp/xsk_diag.c
@@ -56,7 +56,7 @@ static int xsk_diag_put_umem(const struct xdp_sock *xs, struct sk_buff *nlskb)
du.id = umem->id;
du.size = umem->size;
du.num_pages = umem->npgs;
- du.chunk_size = (__u32)(~umem->chunk_mask + 1);
+ du.chunk_size = umem->chunk_size_nohr + umem->headroom;
du.headroom = umem->headroom;
du.ifindex = umem->dev ? umem->dev->ifindex : 0;
du.queue_id = umem->queue_id;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 909c5168ed0f..04afc9de86d9 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -133,6 +133,16 @@ static inline bool xskq_has_addrs(struct xsk_queue *q, u32 cnt)
/* UMEM queue */
+static inline bool xskq_crosses_non_contig_pg(struct xdp_umem *umem, u64 addr,
+ u64 length)
+{
+ bool cross_pg = (addr & (PAGE_SIZE - 1)) + length > PAGE_SIZE;
+ bool next_pg_contig =
+ umem->pages[(addr >> PAGE_SHIFT) + 1].next_pg_contig;
+
+ return cross_pg && !next_pg_contig;
+}
+
static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
{
if (addr >= q->size) {
@@ -143,23 +153,52 @@ static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
return true;
}
-static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr)
+static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q, u64 addr,
+ u64 length,
+ struct xdp_umem *umem)
+{
+ addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+ addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
+ if (addr >= q->size ||
+ xskq_crosses_non_contig_pg(umem, addr, length)) {
+ q->invalid_descs++;
+ return false;
+ }
+
+ return true;
+}
+
+static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr,
+ struct xdp_umem *umem)
{
while (q->cons_tail != q->cons_head) {
struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
unsigned int idx = q->cons_tail & q->ring_mask;
*addr = READ_ONCE(ring->desc[idx]) & q->chunk_mask;
+ if (*addr & (~XSK_UNALIGNED_BUF_ADDR_MASK))
+ goto out;
+
+ if (umem->flags & XDP_UMEM_UNALIGNED_CHUNKS) {
+ if (xskq_is_valid_addr_unaligned(q, *addr,
+ umem->chunk_size_nohr,
+ umem))
+ return addr;
+ goto out;
+ }
+
if (xskq_is_valid_addr(q, *addr))
return addr;
+out:
q->cons_tail++;
}
return NULL;
}
-static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
+static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr,
+ struct xdp_umem *umem)
{
if (q->cons_tail == q->cons_head) {
smp_mb(); /* D, matches A */
@@ -170,7 +209,7 @@ static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
smp_rmb();
}
- return xskq_validate_addr(q, addr);
+ return xskq_validate_addr(q, addr, umem);
}
static inline void xskq_discard_addr(struct xsk_queue *q)
@@ -229,8 +268,21 @@ static inline int xskq_reserve_addr(struct xsk_queue *q)
/* Rx/Tx queue */
-static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
+static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d,
+ struct xdp_umem *umem)
{
+ if (umem->flags & XDP_UMEM_UNALIGNED_CHUNKS) {
+ if (!xskq_is_valid_addr_unaligned(q, d->addr, d->len, umem))
+ return false;
+
+ if (d->len > umem->chunk_size_nohr || d->options) {
+ q->invalid_descs++;
+ return false;
+ }
+
+ return true;
+ }
+
if (!xskq_is_valid_addr(q, d->addr))
return false;
@@ -244,14 +296,15 @@ static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
}
static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
- struct xdp_desc *desc)
+ struct xdp_desc *desc,
+ struct xdp_umem *umem)
{
while (q->cons_tail != q->cons_head) {
struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
unsigned int idx = q->cons_tail & q->ring_mask;
*desc = READ_ONCE(ring->desc[idx]);
- if (xskq_is_valid_desc(q, desc))
+ if (xskq_is_valid_desc(q, desc, umem))
return desc;
q->cons_tail++;
@@ -261,7 +314,8 @@ static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
}
static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
- struct xdp_desc *desc)
+ struct xdp_desc *desc,
+ struct xdp_umem *umem)
{
if (q->cons_tail == q->cons_head) {
smp_mb(); /* D, matches A */
@@ -272,7 +326,7 @@ static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
smp_rmb(); /* C, matches B */
}
- return xskq_validate_desc(q, desc);
+ return xskq_validate_desc(q, desc, umem);
}
static inline void xskq_discard_desc(struct xsk_queue *q)
--
2.17.1
^ permalink raw reply related
* [PATCH v2 02/10] ixgbe: simplify Rx buffer recycle
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190716030637.5634-1-kevin.laatz@intel.com>
Currently, the dma, addr and handle are modified when we reuse Rx buffers
in zero-copy mode. However, this is not required as the inputs to the
function are copies, not the original values themselves. As we use the
copies within the function, we can use the original 'obi' values
directly without having to mask and add the headroom.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 6b609553329f..bc86057628c8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -201,8 +201,6 @@ ixgbe_rx_buffer *ixgbe_get_rx_buffer_zc(struct ixgbe_ring *rx_ring,
static void ixgbe_reuse_rx_buffer_zc(struct ixgbe_ring *rx_ring,
struct ixgbe_rx_buffer *obi)
{
- unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask;
- u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
u16 nta = rx_ring->next_to_alloc;
struct ixgbe_rx_buffer *nbi;
@@ -212,14 +210,9 @@ static void ixgbe_reuse_rx_buffer_zc(struct ixgbe_ring *rx_ring,
rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
/* transfer page from old buffer to new buffer */
- nbi->dma = obi->dma & mask;
- nbi->dma += hr;
-
- nbi->addr = (void *)((unsigned long)obi->addr & mask);
- nbi->addr += hr;
-
- nbi->handle = obi->handle & mask;
- nbi->handle += rx_ring->xsk_umem->headroom;
+ nbi->dma = obi->dma;
+ nbi->addr = obi->addr;
+ nbi->handle = obi->handle;
obi->addr = NULL;
obi->skb = NULL;
--
2.17.1
^ permalink raw reply related
* [PATCH v2 01/10] i40e: simplify Rx buffer recycle
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190716030637.5634-1-kevin.laatz@intel.com>
Currently, the dma, addr and handle are modified when we reuse Rx buffers
in zero-copy mode. However, this is not required as the inputs to the
function are copies, not the original values themselves. As we use the
copies within the function, we can use the original 'old_bi' values
directly without having to mask and add the headroom.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 32bad014d76c..dfa096db2244 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -420,8 +420,6 @@ static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
struct i40e_rx_buffer *old_bi)
{
struct i40e_rx_buffer *new_bi = &rx_ring->rx_bi[rx_ring->next_to_alloc];
- unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask;
- u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
u16 nta = rx_ring->next_to_alloc;
/* update, and store next to alloc */
@@ -429,14 +427,9 @@ static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
/* transfer page from old buffer to new buffer */
- new_bi->dma = old_bi->dma & mask;
- new_bi->dma += hr;
-
- new_bi->addr = (void *)((unsigned long)old_bi->addr & mask);
- new_bi->addr += hr;
-
- new_bi->handle = old_bi->handle & mask;
- new_bi->handle += rx_ring->xsk_umem->headroom;
+ new_bi->dma = old_bi->dma;
+ new_bi->addr = old_bi->addr;
+ new_bi->handle = old_bi->handle;
old_bi->addr = NULL;
}
--
2.17.1
^ permalink raw reply related
* [PATCH v2 00/10] XDP unaligned chunk placement support
From: Kevin Laatz @ 2019-07-16 3:06 UTC (permalink / raw)
To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
jonathan.lemon
Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190620090958.2135-1-kevin.laatz@intel.com>
This patch set adds the ability to use unaligned chunks in the XDP umem.
Currently, all chunk addresses passed to the umem are masked to be chunk
size aligned (default is 2k, max is PAGE_SIZE). This limits where we can
place chunks within the umem as well as limiting the packet sizes that are
supported.
The changes in this patch set removes these restrictions, allowing XDP to
be more flexible in where it can place a chunk within a umem. By relaxing
where the chunks can be placed, it allows us to use an arbitrary buffer
size and place that wherever we have a free address in the umem. These
changes add the ability to support arbitrary frame sizes up to 4k
(PAGE_SIZE) and make it easy to integrate with other existing frameworks
that have their own memory management systems, such as DPDK.
Since we are now dealing with arbitrary frame sizes, we need also need to
update how we pass around addresses. Currently, the addresses can simply be
masked to 2k to get back to the original address. This becomes less trivial
when using frame sizes that are not a 'power of 2' size. This patch set
modifies the Rx/Tx descriptor format to use the upper 16-bits of the addr
field for an offset value, leaving the lower 48-bits for the address (this
leaves us with 256 Terabytes, which should be enough!). We only need to use
the upper 16-bits to store the offset when running in unaligned mode.
Rather than adding the offset (headroom etc) to the address, we will store
it in the upper 16-bits of the address field. This way, we can easily add
the offset to the address where we need it, using some bit manipulation and
addition, and we can also easily get the original address wherever we need
it (for example in i40e_zca_free) by simply masking to get the lower
48-bits of the address field.
The numbers below were recorded with the following set up:
- Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
- Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
- Driver: i40e
- Application: xdpsock with l2fwd (single interface)
These are solely for comparing performance with and without the patches.
The largest drop was ~1% (in zero-copy mode).
+-------------------------+------------+-----------------+-------------+
| Buffer size: 2048 | SKB mode | Zero-copy | Copy |
+-------------------------+------------+-----------------+-------------+
| Aligned (baseline) | 1.7 Mpps | 15.3 Mpps | 2.08 Mpps |
+-------------------------+------------+-----------------+-------------+
| Aligned (with patches) | 1.7 Mpps | 15.1 Mpps | 2.08 Mpps |
+-------------------------+------------+-----------------+-------------+
| Unaligned | 1.7 Mpps | 14.5 Mpps | 2.08 Mpps |
+-------------------------+------------+-----------------+-------------+
NOTE: We are currently working on the changes required in the Mellanox
driver. We will include these in the v3.
Structure of the patchset:
Patch 1:
- Remove unnecessary masking and headroom addition during zero-copy Rx
buffer recycling in i40e. This change is required in order for the
buffer recycling to work in the unaligned chunk mode.
Patch 2:
- Remove unnecessary masking and headroom addition during
zero-copy Rx buffer recycling in ixgbe. This change is required in
order for the buffer recycling to work in the unaligned chunk mode.
Patch 3:
- Add infrastructure for unaligned chunks. Since we are dealing with
unaligned chunks that could potentially cross a physical page boundary,
we add checks to keep track of that information. We can later use this
information to correctly handle buffers that are placed at an address
where they cross a page boundary. This patch also modifies the
existing Rx and Tx functions to use the new descriptor format. To
handle addresses correctly, we need to mask appropriately based on
whether we are in aligned or unaligned mode.
Patch 4:
- This patch updates the i40e driver to make use of the new descriptor
format. The new format is particularly useful here since we can now
retrieve the original address in places like i40e_zca_free with ease.
This saves us doing various calculations to get the original address
back.
Patch 5:
- This patch updates the ixgbe driver to make use of the new descriptor
format. The new format is particularly useful here since we can now
retrieve the original address in places like ixgbe_zca_free with ease.
This saves us doing various calculations to get the original address
back.
Patch 6:
- Add flags for umem configuration to libbpf
Patch 7:
- Modify xdpsock application to add a command line option for
unaligned chunks
Patch 8:
- Since we can now run the application in unaligned chunk mode, we need
to make sure we recycle the buffers appropriately.
Patch 9:
- Adds hugepage support to the xdpsock application
Patch 10:
- Documentation update to include the unaligned chunk scenario. We need
to explicitly state that the incoming addresses are only masked in the
aligned chunk mode and not the unaligned chunk mode.
---
v2:
- fixed checkpatch issues
- fixed Rx buffer recycling for unaligned chunks in xdpsock
- removed unused defines
- fixed how chunk_size is calculated in xsk_diag.c
- added some performance numbers to cover letter
- modified descriptor format to make it easier to retrieve original
address
- removed patch adding off_t off to the zero copy allocator. This is no
longer needed with the new descriptor format.
Kevin Laatz (10):
i40e: simplify Rx buffer recycle
ixgbe: simplify Rx buffer recycle
xsk: add support to allow unaligned chunk placement
i40e: modify driver for handling offsets
ixgbe: modify driver for handling offsets
libbpf: add flags to umem config
samples/bpf: add unaligned chunks mode support to xdpsock
samples/bpf: add buffer recycling for unaligned chunks to xdpsock
samples/bpf: use hugepages in xdpsock app
doc/af_xdp: include unaligned chunk case
Documentation/networking/af_xdp.rst | 10 ++-
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 39 +++++----
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 39 +++++----
include/net/xdp_sock.h | 2 +
include/uapi/linux/if_xdp.h | 9 ++
net/xdp/xdp_umem.c | 17 ++--
net/xdp/xsk.c | 89 ++++++++++++++++----
net/xdp/xsk_diag.c | 2 +-
net/xdp/xsk_queue.h | 70 +++++++++++++--
samples/bpf/xdpsock_user.c | 61 ++++++++++----
tools/include/uapi/linux/if_xdp.h | 4 +
tools/lib/bpf/xsk.c | 3 +
tools/lib/bpf/xsk.h | 2 +
13 files changed, 266 insertions(+), 81 deletions(-)
--
2.17.1
^ permalink raw reply
* Re: [PATCH v5] net: netfilter: Fix rpfilter dropping vrf packets by mistake
From: Pablo Neira Ayuso @ 2019-07-16 11:16 UTC (permalink / raw)
To: Miaohe Lin
Cc: kadlec, fw, davem, kuznet, yoshfuji, netfilter-devel, coreteam,
netdev, linux-kernel, mingfangsen
In-Reply-To: <1562039976-203880-1-git-send-email-linmiaohe@huawei.com>
On Tue, Jul 02, 2019 at 03:59:36AM +0000, Miaohe Lin wrote:
> When firewalld is enabled with ipv4/ipv6 rpfilter, vrf
> ipv4/ipv6 packets will be dropped. Vrf device will pass
> through netfilter hook twice. One with enslaved device
> and another one with l3 master device. So in device may
> dismatch witch out device because out device is always
> enslaved device.So failed with the check of the rpfilter
> and drop the packets by mistake.
Applied to nf.git, thanks.
^ permalink raw reply
* [PATCH 2/2] arm: Add support for function error injection
From: Leo Yan @ 2019-07-16 11:13 UTC (permalink / raw)
To: Russell King, Oleg Nesterov, Catalin Marinas, Will Deacon,
Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
Yonghong Song, Arnd Bergmann, linux-arm-kernel, linux-kernel,
netdev, bpf, Masami Hiramatsu, Justin He
Cc: Leo Yan
In-Reply-To: <20190716111301.1855-1-leo.yan@linaro.org>
This patch implement regs_set_return_value() and
override_function_with_return() to support function error injection
for arm.
In the exception flow, we can update pt_regs::ARM_pc with
pt_regs::ARM_lr so that can override the probed function return.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
arch/arm/Kconfig | 1 +
arch/arm/include/asm/error-injection.h | 13 +++++++++++++
arch/arm/include/asm/ptrace.h | 5 +++++
arch/arm/lib/Makefile | 2 ++
arch/arm/lib/error-inject.c | 19 +++++++++++++++++++
5 files changed, 40 insertions(+)
create mode 100644 arch/arm/include/asm/error-injection.h
create mode 100644 arch/arm/lib/error-inject.c
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 8869742a85df..f7932a5e29ea 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -74,6 +74,7 @@ config ARM
select HAVE_EFFICIENT_UNALIGNED_ACCESS if (CPU_V6 || CPU_V6K || CPU_V7) && MMU
select HAVE_EXIT_THREAD
select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
+ select HAVE_FUNCTION_ERROR_INJECTION if !THUMB2_KERNEL
select HAVE_FUNCTION_GRAPH_TRACER if !THUMB2_KERNEL && !CC_IS_CLANG
select HAVE_FUNCTION_TRACER if !XIP_KERNEL
select HAVE_GCC_PLUGINS
diff --git a/arch/arm/include/asm/error-injection.h b/arch/arm/include/asm/error-injection.h
new file mode 100644
index 000000000000..da057e8ed224
--- /dev/null
+++ b/arch/arm/include/asm/error-injection.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __ASM_ERROR_INJECTION_H_
+#define __ASM_ERROR_INJECTION_H_
+
+#include <linux/compiler.h>
+#include <linux/linkage.h>
+#include <asm/ptrace.h>
+#include <asm-generic/error-injection.h>
+
+void override_function_with_return(struct pt_regs *regs);
+
+#endif /* __ASM_ERROR_INJECTION_H_ */
diff --git a/arch/arm/include/asm/ptrace.h b/arch/arm/include/asm/ptrace.h
index 91d6b7856be4..3b41f37b361a 100644
--- a/arch/arm/include/asm/ptrace.h
+++ b/arch/arm/include/asm/ptrace.h
@@ -89,6 +89,11 @@ static inline long regs_return_value(struct pt_regs *regs)
return regs->ARM_r0;
}
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
+{
+ regs->ARM_r0 = rc;
+}
+
#define instruction_pointer(regs) (regs)->ARM_pc
#ifdef CONFIG_THUMB2_KERNEL
diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
index 0bff0176db2c..d3d7430ecd76 100644
--- a/arch/arm/lib/Makefile
+++ b/arch/arm/lib/Makefile
@@ -43,3 +43,5 @@ ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
CFLAGS_xor-neon.o += $(NEON_FLAGS)
obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o
endif
+
+obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/arm/lib/error-inject.c b/arch/arm/lib/error-inject.c
new file mode 100644
index 000000000000..96319d017114
--- /dev/null
+++ b/arch/arm/lib/error-inject.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/error-injection.h>
+#include <linux/kprobes.h>
+
+void override_function_with_return(struct pt_regs *regs)
+{
+ /*
+ * 'regs' represents the state on entry of a predefined function in
+ * the kernel/module and which is captured on a kprobe.
+ *
+ * 'regs->ARM_lr' contains the the link register for the probed
+ * function and assign it to 'regs->ARM_pc', so when kprobe returns
+ * back from exception it will override the end of probed function
+ * and drirectly return to the predefined function's caller.
+ */
+ regs->ARM_pc = regs->ARM_lr;
+}
+NOKPROBE_SYMBOL(override_function_with_return);
--
2.17.1
^ permalink raw reply related
* [PATCH 1/2] arm64: Add support for function error injection
From: Leo Yan @ 2019-07-16 11:13 UTC (permalink / raw)
To: Russell King, Oleg Nesterov, Catalin Marinas, Will Deacon,
Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
Yonghong Song, Arnd Bergmann, linux-arm-kernel, linux-kernel,
netdev, bpf, Masami Hiramatsu, Justin He
Cc: Leo Yan
In-Reply-To: <20190716111301.1855-1-leo.yan@linaro.org>
This patch implement regs_set_return_value() and
override_function_with_return() to support function error injection
for arm64.
In the exception flow, arm64's general register x30 contains the value
for the link register; so we can just update pt_regs::pc with it rather
than redirecting execution to a dummy function that returns.
This patch is heavily inspired by the commit 7cd01b08d35f ("powerpc:
Add support for function error injection").
Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/error-injection.h | 13 +++++++++++++
arch/arm64/include/asm/ptrace.h | 5 +++++
arch/arm64/lib/Makefile | 2 ++
arch/arm64/lib/error-inject.c | 19 +++++++++++++++++++
5 files changed, 40 insertions(+)
create mode 100644 arch/arm64/include/asm/error-injection.h
create mode 100644 arch/arm64/lib/error-inject.c
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 697ea0510729..a6d9e622977d 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -142,6 +142,7 @@ config ARM64
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_TRACER
+ select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_GCC_PLUGINS
select HAVE_HW_BREAKPOINT if PERF_EVENTS
diff --git a/arch/arm64/include/asm/error-injection.h b/arch/arm64/include/asm/error-injection.h
new file mode 100644
index 000000000000..da057e8ed224
--- /dev/null
+++ b/arch/arm64/include/asm/error-injection.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __ASM_ERROR_INJECTION_H_
+#define __ASM_ERROR_INJECTION_H_
+
+#include <linux/compiler.h>
+#include <linux/linkage.h>
+#include <asm/ptrace.h>
+#include <asm-generic/error-injection.h>
+
+void override_function_with_return(struct pt_regs *regs);
+
+#endif /* __ASM_ERROR_INJECTION_H_ */
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index dad858b6adc6..3aafbbe218a2 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -294,6 +294,11 @@ static inline unsigned long regs_return_value(struct pt_regs *regs)
return regs->regs[0];
}
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
+{
+ regs->regs[0] = rc;
+}
+
/**
* regs_get_kernel_argument() - get Nth function argument in kernel
* @regs: pt_regs of that context
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 33c2a4abda04..f182ccb0438e 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -33,3 +33,5 @@ UBSAN_SANITIZE_atomic_ll_sc.o := n
lib-$(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) += uaccess_flushcache.o
obj-$(CONFIG_CRC32) += crc32.o
+
+obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/arm64/lib/error-inject.c b/arch/arm64/lib/error-inject.c
new file mode 100644
index 000000000000..35661c2de4b0
--- /dev/null
+++ b/arch/arm64/lib/error-inject.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/error-injection.h>
+#include <linux/kprobes.h>
+
+void override_function_with_return(struct pt_regs *regs)
+{
+ /*
+ * 'regs' represents the state on entry of a predefined function in
+ * the kernel/module and which is captured on a kprobe.
+ *
+ * 'regs->regs[30]' contains the the link register for the probed
+ * function and assign it to 'regs->pc', so when kprobe returns
+ * back from exception it will override the end of probed function
+ * and drirectly return to the predefined function's caller.
+ */
+ regs->pc = regs->regs[30];
+}
+NOKPROBE_SYMBOL(override_function_with_return);
--
2.17.1
^ permalink raw reply related
* [PATCH 0/2] arm/arm64: Add support for function error injection
From: Leo Yan @ 2019-07-16 11:12 UTC (permalink / raw)
To: Russell King, Oleg Nesterov, Catalin Marinas, Will Deacon,
Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
Yonghong Song, Arnd Bergmann, linux-arm-kernel, linux-kernel,
netdev, bpf, Masami Hiramatsu, Justin He
Cc: Leo Yan
This small patch set is to add support for function error injection;
this can be used to eanble more advanced debugging feature, e.g.
CONFIG_BPF_KPROBE_OVERRIDE.
I only tested the first patch on arm64 platform Juno-r2 with below
steps; the second patch is for arm arch, but I absent the platform
for the testing so only pass compilation.
- Enable kernel configuration:
CONFIG_BPF_KPROBE_OVERRIDE
CONFIG_BTRFS_FS
CONFIG_BPF_EVENTS=y
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
CONFIG_BPF_KPROBE_OVERRIDE=y
- Build samples/bpf on Juno-r2 board with Debian rootFS:
# cd $kernel
# make headers_install
# make samples/bpf/ LLC=llc-7 CLANG=clang-7
- Run the sample tracex7:
# ./tracex7 /dev/sdb1
[ 1975.211781] BTRFS error (device (efault)): open_ctree failed
mount: /mnt/linux-kernel/linux-cs-dev/samples/bpf/tmpmnt: mount(2) system call failed: Cannot allocate memory.
Leo Yan (2):
arm64: Add support for function error injection
arm: Add support for function error injection
arch/arm/Kconfig | 1 +
arch/arm/include/asm/error-injection.h | 13 +++++++++++++
arch/arm/include/asm/ptrace.h | 5 +++++
arch/arm/lib/Makefile | 2 ++
arch/arm/lib/error-inject.c | 19 +++++++++++++++++++
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/error-injection.h | 13 +++++++++++++
arch/arm64/include/asm/ptrace.h | 5 +++++
arch/arm64/lib/Makefile | 2 ++
arch/arm64/lib/error-inject.c | 19 +++++++++++++++++++
10 files changed, 80 insertions(+)
create mode 100644 arch/arm/include/asm/error-injection.h
create mode 100644 arch/arm/lib/error-inject.c
create mode 100644 arch/arm64/include/asm/error-injection.h
create mode 100644 arch/arm64/lib/error-inject.c
--
2.17.1
^ permalink raw reply
* Re: [PATCH] net: sctp: fix warning "NULL check before some freeing functions is not needed"
From: Neil Horman @ 2019-07-16 11:09 UTC (permalink / raw)
To: Hariprasad Kelam
Cc: Vlad Yasevich, Marcelo Ricardo Leitner, David S. Miller,
linux-sctp, netdev, linux-kernel
In-Reply-To: <20190716022002.GA19592@hari-Inspiron-1545>
On Tue, Jul 16, 2019 at 07:50:02AM +0530, Hariprasad Kelam wrote:
> This patch removes NULL checks before calling kfree.
>
> fixes below issues reported by coccicheck
> net/sctp/sm_make_chunk.c:2586:3-8: WARNING: NULL check before some
> freeing functions is not needed.
> net/sctp/sm_make_chunk.c:2652:3-8: WARNING: NULL check before some
> freeing functions is not needed.
> net/sctp/sm_make_chunk.c:2667:3-8: WARNING: NULL check before some
> freeing functions is not needed.
> net/sctp/sm_make_chunk.c:2684:3-8: WARNING: NULL check before some
> freeing functions is not needed.
>
> Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
> ---
> net/sctp/sm_make_chunk.c | 12 ++++--------
> 1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index ed39396..36bd8a6e 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -2582,8 +2582,7 @@ static int sctp_process_param(struct sctp_association *asoc,
> case SCTP_PARAM_STATE_COOKIE:
> asoc->peer.cookie_len =
> ntohs(param.p->length) - sizeof(struct sctp_paramhdr);
> - if (asoc->peer.cookie)
> - kfree(asoc->peer.cookie);
> + kfree(asoc->peer.cookie);
> asoc->peer.cookie = kmemdup(param.cookie->body, asoc->peer.cookie_len, gfp);
> if (!asoc->peer.cookie)
> retval = 0;
> @@ -2648,8 +2647,7 @@ static int sctp_process_param(struct sctp_association *asoc,
> goto fall_through;
>
> /* Save peer's random parameter */
> - if (asoc->peer.peer_random)
> - kfree(asoc->peer.peer_random);
> + kfree(asoc->peer.peer_random);
> asoc->peer.peer_random = kmemdup(param.p,
> ntohs(param.p->length), gfp);
> if (!asoc->peer.peer_random) {
> @@ -2663,8 +2661,7 @@ static int sctp_process_param(struct sctp_association *asoc,
> goto fall_through;
>
> /* Save peer's HMAC list */
> - if (asoc->peer.peer_hmacs)
> - kfree(asoc->peer.peer_hmacs);
> + kfree(asoc->peer.peer_hmacs);
> asoc->peer.peer_hmacs = kmemdup(param.p,
> ntohs(param.p->length), gfp);
> if (!asoc->peer.peer_hmacs) {
> @@ -2680,8 +2677,7 @@ static int sctp_process_param(struct sctp_association *asoc,
> if (!ep->auth_enable)
> goto fall_through;
>
> - if (asoc->peer.peer_chunks)
> - kfree(asoc->peer.peer_chunks);
> + kfree(asoc->peer.peer_chunks);
> asoc->peer.peer_chunks = kmemdup(param.p,
> ntohs(param.p->length), gfp);
> if (!asoc->peer.peer_chunks)
> --
> 2.7.4
>
>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
^ permalink raw reply
* [PATCH bpf v2] selftests/bpf: skip nmi test when perf hw events are disabled
From: Ilya Leoshkevich @ 2019-07-16 10:56 UTC (permalink / raw)
To: bpf, netdev
Cc: gor, heiko.carstens, andrii.nakryiko, ys114321, Ilya Leoshkevich
Some setups (e.g. virtual machines) might run with hardware perf events
disabled. If this is the case, skip the test_send_signal_nmi test.
Add a separate test involving a software perf event. This allows testing
the perf event path regardless of hardware perf event support.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---
v1->v2: Skip the test instead of using a software event.
Add a separate test with a software event.
.../selftests/bpf/prog_tests/send_signal.c | 33 ++++++++++++++++++-
1 file changed, 32 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/send_signal.c b/tools/testing/selftests/bpf/prog_tests/send_signal.c
index 67cea1686305..54218ee3c004 100644
--- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
+++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
@@ -173,6 +173,18 @@ static int test_send_signal_tracepoint(void)
return test_send_signal_common(&attr, BPF_PROG_TYPE_TRACEPOINT, "tracepoint");
}
+static int test_send_signal_perf(void)
+{
+ struct perf_event_attr attr = {
+ .sample_period = 1,
+ .type = PERF_TYPE_SOFTWARE,
+ .config = PERF_COUNT_SW_CPU_CLOCK,
+ };
+
+ return test_send_signal_common(&attr, BPF_PROG_TYPE_PERF_EVENT,
+ "perf_sw_event");
+}
+
static int test_send_signal_nmi(void)
{
struct perf_event_attr attr = {
@@ -181,8 +193,26 @@ static int test_send_signal_nmi(void)
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
};
+ int pmu_fd;
+
+ /* Some setups (e.g. virtual machines) might run with hardware
+ * perf events disabled. If this is the case, skip this test.
+ */
+ pmu_fd = syscall(__NR_perf_event_open, &attr, 0 /* pid */,
+ -1 /* cpu */, -1 /* group_fd */, 0 /* flags */);
+ if (pmu_fd == -1) {
+ if (errno == ENOENT) {
+ printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n",
+ __func__);
+ return 0;
+ }
+ /* Let the test fail with a more informative message */
+ } else {
+ close(pmu_fd);
+ }
- return test_send_signal_common(&attr, BPF_PROG_TYPE_PERF_EVENT, "perf_event");
+ return test_send_signal_common(&attr, BPF_PROG_TYPE_PERF_EVENT,
+ "perf_hw_event");
}
void test_send_signal(void)
@@ -190,6 +220,7 @@ void test_send_signal(void)
int ret = 0;
ret |= test_send_signal_tracepoint();
+ ret |= test_send_signal_perf();
ret |= test_send_signal_nmi();
if (!ret)
printf("test_send_signal:OK\n");
--
2.21.0
^ permalink raw reply related
* [PATCH bpf v3] selftests/bpf: fix "alu with different scalars 1" on s390
From: Ilya Leoshkevich @ 2019-07-16 10:53 UTC (permalink / raw)
To: bpf, netdev
Cc: gor, heiko.carstens, alexei.starovoitov, daniel, ys114321,
Ilya Leoshkevich
BPF_LDX_MEM is used to load the least significant byte of the retrieved
test_val.index, however, on big-endian machines it ends up retrieving
the most significant byte.
Change the test to load the whole int in order to make it
endianness-independent.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---
v1->v2:
- use __BYTE_ORDER instead of __BYTE_ORDER__.
v2->v3:
- load the whole int instead of byte.
tools/testing/selftests/bpf/verifier/value_ptr_arith.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
index c3de1a2c9dc5..a53d99cebd9f 100644
--- a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
+++ b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
@@ -183,7 +183,7 @@
BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),
- BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+ BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 3),
BPF_MOV64_IMM(BPF_REG_2, 0),
BPF_MOV64_IMM(BPF_REG_3, 0x100000),
--
2.21.0
^ permalink raw reply related
* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Stefano Garzarella @ 2019-07-16 10:22 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Jason Wang, Stefan Hajnoczi, virtualization, netdev
In-Reply-To: <20190716055837-mutt-send-email-mst@kernel.org>
On Tue, Jul 16, 2019 at 06:01:33AM -0400, Michael S. Tsirkin wrote:
> On Tue, Jul 16, 2019 at 11:40:24AM +0200, Stefano Garzarella wrote:
> > On Mon, Jul 15, 2019 at 01:50:28PM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Jul 15, 2019 at 09:44:16AM +0200, Stefano Garzarella wrote:
> > > > On Fri, Jul 12, 2019 at 06:14:39PM +0800, Jason Wang wrote:
> >
> > [...]
> >
> > > > >
> > > > >
> > > > > I think it's just a branch, for ethernet, go for networking stack. otherwise
> > > > > go for vsock core?
> > > > >
> > > >
> > > > Yes, that should work.
> > > >
> > > > So, I should refactor the functions that can be called also from the vsock
> > > > core, in order to remove "struct net_device *dev" parameter.
> > > > Maybe creating some wrappers for the network stack.
> > > >
> > > > Otherwise I should create a fake net_device for vsock_core.
> > > >
> > > > What do you suggest?
> > >
> > > Neither.
> > >
> > > I think what Jason was saying all along is this:
> > >
> > > virtio net doesn't actually lose packets, at least most
> > > of the time. And it actually most of the time
> > > passes all packets to host. So it's possible to use a virtio net
> > > device (possibly with a feature flag that says "does not lose packets,
> > > all packets go to host") and build vsock on top.
> >
> > Yes, I got it after the latest Jason's reply.
> >
> > >
> > > and all of this is nice, but don't expect anything easy,
> > > or any quick results.
> >
> > I expected this... :-(
> >
> > >
> > > Also, in a sense it's a missed opportunity: we could cut out a lot
> > > of fat and see just how fast can a protocol that is completely
> > > new and separate from networking stack go.
> >
> > In this case, if we will try to do a PoC, what do you think is better?
> > 1. new AF_VSOCK + network-stack + virtio-net modified
> > Maybe it is allow us to reuse a lot of stuff already written,
> > but we will go through the network stack
> >
> > 2. new AF_VSOCK + glue + virtio-net modified
> > Intermediate approach, similar to Jason's proposal
> >
> > 3, new AF_VSOCK + new virtio-vsock
> > Can be the thinnest, but we have to rewrite many things, with the risk
> > of making the same mistakes as the current implementation.
> >
>
> 1 or 3 imho. I wouldn't expect a lot from 2. I slightly favor 3 and
> Jason 1. So take your pick :)
>
Yes, I agree :)
Maybe "Jason 1" could be the short term (and an opportunity to study better the
code and sources of overhead) and "new AF_VSOCK + new virtio-vsock" the long
term goal with the multi-transport support in mind.
Thank you so much for your guidance and useful advice,
Stefano
^ permalink raw reply
* Re: [PATCH v3 0/3] kernel/notifier.c: avoid duplicate registration
From: Vasily Averin @ 2019-07-16 10:20 UTC (permalink / raw)
To: Xiaoming Ni, gregkh@linuxfoundation.org
Cc: adobriyan@gmail.com, akpm@linux-foundation.org,
anna.schumaker@netapp.com, arjan@linux.intel.com,
bfields@fieldses.org, chuck.lever@oracle.com, davem@davemloft.net,
jlayton@kernel.org, luto@kernel.org, mingo@kernel.org,
Nadia.Derbey@bull.net, paulmck@linux.vnet.ibm.com,
semen.protsenko@linaro.org, stable@kernel.org,
stern@rowland.harvard.edu, tglx@linutronix.de,
torvalds@linux-foundation.org, trond.myklebust@hammerspace.com,
viresh.kumar@linaro.org, Huangjianhui (Alex), Dailei,
linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org,
netdev@vger.kernel.org
In-Reply-To: <3bbc16ba-953c-a6b6-c5f3-4deaeaa25d10@huawei.com>
On 7/16/19 5:00 AM, Xiaoming Ni wrote:
> On 2019/7/15 13:38, Vasily Averin wrote:
>> On 7/14/19 5:45 AM, Xiaoming Ni wrote:
>>> On 2019/7/12 22:07, gregkh@linuxfoundation.org wrote:
>>>> On Fri, Jul 12, 2019 at 09:11:57PM +0800, Xiaoming Ni wrote:
>>>>> On 2019/7/11 21:57, Vasily Averin wrote:
>>>>>> On 7/11/19 4:55 AM, Nixiaoming wrote:
>>>>>>> On Wed, July 10, 2019 1:49 PM Vasily Averin wrote:
>>>>>>>> On 7/10/19 6:09 AM, Xiaoming Ni wrote:
>>>>>>>>> Registering the same notifier to a hook repeatedly can cause the hook
>>>>>>>>> list to form a ring or lose other members of the list.
>>>>>>>>
>>>>>>>> I think is not enough to _prevent_ 2nd register attempt,
>>>>>>>> it's enough to detect just attempt and generate warning to mark host in bad state.
>>>>>>>>
>>>>>>>
>>>>>>> Duplicate registration is prevented in my patch, not just "mark host in bad state"
>>>>>>>
>>>>>>> Duplicate registration is checked and exited in notifier_chain_cond_register()
>>>>>>>
>>>>>>> Duplicate registration was checked in notifier_chain_register() but only
>>>>>>> the alarm was triggered without exiting. added by commit 831246570d34692e
>>>>>>> ("kernel/notifier.c: double register detection")
>>>>>>>
>>>>>>> My patch is like a combination of 831246570d34692e and notifier_chain_cond_register(),
>>>>>>> which triggers an alarm and exits when a duplicate registration is detected.
>>>>>>>
>>>>>>>> Unexpected 2nd register of the same hook most likely will lead to 2nd unregister,
>>>>>>>> and it can lead to host crash in any time:
>>>>>>>> you can unregister notifier on first attempt it can be too early, it can be still in use.
>>>>>>>> on the other hand you can never call 2nd unregister at all.
>>>>>>>
>>>>>>> Since the member was not added to the linked list at the time of the second registration,
>>>>>>> no linked list ring was formed.
>>>>>>> The member is released on the first unregistration and -ENOENT on the second unregistration.
>>>>>>> After patching, the fault has been alleviated
>>>>>>
>>>>>> You are wrong here.
>>>>>> 2nd notifier's registration is a pure bug, this should never happen.
>>>>>> If you know the way to reproduce this situation -- you need to fix it.
>>>>>>
>>>>>> 2nd registration can happen in 2 cases:
>>>>>> 1) missed rollback, when someone forget to call unregister after successfull registration,
>>>>>> and then tried to call register again. It can lead to crash for example when according module will be unloaded.
>>>>>> 2) some subsystem is registered twice, for example from different namespaces.
>>>>>> in this case unregister called during sybsystem cleanup in first namespace will incorrectly remove notifier used
>>>>>> in second namespace, it also can lead to unexpacted behaviour.
>>>>>>
>>>>> So in these two cases, is it more reasonable to trigger BUG() directly when checking for duplicate registration ?
>>>>> But why does current notifier_chain_register() just trigger WARN() without exiting ?
>>>>> notifier_chain_cond_register() direct exit without triggering WARN() ?
>>>>
>>>> It should recover from this, if it can be detected. The main point is
>>>> that not all apis have to be this "robust" when used within the kernel
>>>> as we do allow for the callers to know what they are doing :)
>>>>
>>> In the notifier_chain_register(), the condition ( (*nl) == n) is the same registration of the same hook.
>>> We can intercept this situation and avoid forming a linked list ring to make the API more rob
>>
>> Once again -- yes, you CAN prevent list corruption, but you CANNOT recover the host and return it back to safe state.
>> If double register event was detected -- it means you have bug in kernel.
>>
>> Yes, you can add BUG here and crash the host immediately, but I prefer to use warning in such situations.
>>
>>>> If this does not cause any additional problems or slow downs, it's
>>>> probably fine to add.
>>>>
>>> Notifier_chain_register() is not a system hotspot function.
>>> At the same time, there is already a WARN_ONCE judgment. There is no new judgment in the new patch.
>>> It only changes the processing under the condition of (*nl) == n, which will not cause performance problems.
>>> At the same time, avoiding the formation of a link ring can make the system more robust.
>>
>> I disagree,
>> yes, node will have correct list, but anyway node will work wrong and can crash the host in any time.
>
> Sorry, my description is not accurate.
>
> My patch feature does not prevent users from repeatedly registering hooks.
> But avoiding the chain ring caused by the user repeatedly registering the hook
>
> There are no modules for duplicate registration hooks in the current system.
> But considering that not all modules are in the kernel source tree,
> In order to improve the robustness of the kernel API, we should avoid the linked list ring caused by repeated registration.
> Or in order to improve the efficiency of problem location, when the duplicate registration is checked, the system crashes directly.
Detect of duplicate registration means an unrecoverable error,
from this point of view it makes sense to replace WARN_ONCE by BUG_ON.
> On the other hand, the difference between notifier_chain_register() and notifier_chain_cond_register() for duplicate registrations is confusing:
> Blocking the formation of the linked list ring in notifier_chain_cond_register()
> There is no interception of the linked list ring in notifier_chain_register(), just an alarm.
> Give me the illusion: Isn't notifier_chain_register() allowed to create a linked list ring?
I'm not sure that I understood your question correctly but will try to answer.
As far as I see all callers of notifier_chain_cond_register checks return value, expect possible failure and handle it somehow.
On the other hand callers of notifier_chain_register() in many cases do not check return value and always expect success.
The goal of original WARN_ONCE -- to detect possible misuse of notifiers and it seems for me it correctly handles this task.
^ permalink raw reply
* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Michael S. Tsirkin @ 2019-07-16 10:01 UTC (permalink / raw)
To: Stefano Garzarella; +Cc: Jason Wang, Stefan Hajnoczi, virtualization, netdev
In-Reply-To: <20190716094024.ob43g5lxga5uwb7z@steredhat>
On Tue, Jul 16, 2019 at 11:40:24AM +0200, Stefano Garzarella wrote:
> On Mon, Jul 15, 2019 at 01:50:28PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 15, 2019 at 09:44:16AM +0200, Stefano Garzarella wrote:
> > > On Fri, Jul 12, 2019 at 06:14:39PM +0800, Jason Wang wrote:
>
> [...]
>
> > > >
> > > >
> > > > I think it's just a branch, for ethernet, go for networking stack. otherwise
> > > > go for vsock core?
> > > >
> > >
> > > Yes, that should work.
> > >
> > > So, I should refactor the functions that can be called also from the vsock
> > > core, in order to remove "struct net_device *dev" parameter.
> > > Maybe creating some wrappers for the network stack.
> > >
> > > Otherwise I should create a fake net_device for vsock_core.
> > >
> > > What do you suggest?
> >
> > Neither.
> >
> > I think what Jason was saying all along is this:
> >
> > virtio net doesn't actually lose packets, at least most
> > of the time. And it actually most of the time
> > passes all packets to host. So it's possible to use a virtio net
> > device (possibly with a feature flag that says "does not lose packets,
> > all packets go to host") and build vsock on top.
>
> Yes, I got it after the latest Jason's reply.
>
> >
> > and all of this is nice, but don't expect anything easy,
> > or any quick results.
>
> I expected this... :-(
>
> >
> > Also, in a sense it's a missed opportunity: we could cut out a lot
> > of fat and see just how fast can a protocol that is completely
> > new and separate from networking stack go.
>
> In this case, if we will try to do a PoC, what do you think is better?
> 1. new AF_VSOCK + network-stack + virtio-net modified
> Maybe it is allow us to reuse a lot of stuff already written,
> but we will go through the network stack
>
> 2. new AF_VSOCK + glue + virtio-net modified
> Intermediate approach, similar to Jason's proposal
>
> 3, new AF_VSOCK + new virtio-vsock
> Can be the thinnest, but we have to rewrite many things, with the risk
> of making the same mistakes as the current implementation.
>
1 or 3 imho. I wouldn't expect a lot from 2. I slightly favor 3 and
Jason 1. So take your pick :)
> > Instead vsock implementation carries so much baggage from both
> > networking stack - such as softirq processing - and itself such as
> > workqueues, global state and crude locking - to the point where
> > it's actually slower than TCP.
>
> I agree, and I'm finding new issues while I'm trying to support nested
> VMs, allowing multiple vsock transports (virtio-vsock and vhost-vsock in
> the KVM case) at runtime.
>
> >
>
> [...]
>
> > > >
> > > > I suggest to do this step by step:
> > > >
> > > > 1) use virtio-net but keep some protocol logic
> > > >
> > > > 2) separate protocol logic and merge it to exist Linux networking stack
> > >
> > > Make sense, thanks for the suggestions, I'll try to do these steps!
> > >
> > > Thanks,
> > > Stefano
> >
> >
> > An alternative is look at sources of overhead in vsock and get rid of
> > them, or rewrite it from scratch focusing on performance.
>
> I started looking at virtio-vsock and vhost-vsock trying to do very
> simple changes [1] to increase the performance. I should send a v4 of that
> series as a very short term, then I'd like to have a deeper look to understand
> if it is better to try to optimize or rewrite it from scratch.
>
>
> Thanks,
> Stefano
>
> [1] https://patchwork.kernel.org/cover/10970145/
^ permalink raw reply
* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Stefano Garzarella @ 2019-07-16 9:40 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Jason Wang, Stefan Hajnoczi, virtualization, netdev
In-Reply-To: <20190715134115-mutt-send-email-mst@kernel.org>
On Mon, Jul 15, 2019 at 01:50:28PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 15, 2019 at 09:44:16AM +0200, Stefano Garzarella wrote:
> > On Fri, Jul 12, 2019 at 06:14:39PM +0800, Jason Wang wrote:
[...]
> > >
> > >
> > > I think it's just a branch, for ethernet, go for networking stack. otherwise
> > > go for vsock core?
> > >
> >
> > Yes, that should work.
> >
> > So, I should refactor the functions that can be called also from the vsock
> > core, in order to remove "struct net_device *dev" parameter.
> > Maybe creating some wrappers for the network stack.
> >
> > Otherwise I should create a fake net_device for vsock_core.
> >
> > What do you suggest?
>
> Neither.
>
> I think what Jason was saying all along is this:
>
> virtio net doesn't actually lose packets, at least most
> of the time. And it actually most of the time
> passes all packets to host. So it's possible to use a virtio net
> device (possibly with a feature flag that says "does not lose packets,
> all packets go to host") and build vsock on top.
Yes, I got it after the latest Jason's reply.
>
> and all of this is nice, but don't expect anything easy,
> or any quick results.
I expected this... :-(
>
> Also, in a sense it's a missed opportunity: we could cut out a lot
> of fat and see just how fast can a protocol that is completely
> new and separate from networking stack go.
In this case, if we will try to do a PoC, what do you think is better?
1. new AF_VSOCK + network-stack + virtio-net modified
Maybe it is allow us to reuse a lot of stuff already written,
but we will go through the network stack
2. new AF_VSOCK + glue + virtio-net modified
Intermediate approach, similar to Jason's proposal
3, new AF_VSOCK + new virtio-vsock
Can be the thinnest, but we have to rewrite many things, with the risk
of making the same mistakes as the current implementation.
> Instead vsock implementation carries so much baggage from both
> networking stack - such as softirq processing - and itself such as
> workqueues, global state and crude locking - to the point where
> it's actually slower than TCP.
I agree, and I'm finding new issues while I'm trying to support nested
VMs, allowing multiple vsock transports (virtio-vsock and vhost-vsock in
the KVM case) at runtime.
>
[...]
> > >
> > > I suggest to do this step by step:
> > >
> > > 1) use virtio-net but keep some protocol logic
> > >
> > > 2) separate protocol logic and merge it to exist Linux networking stack
> >
> > Make sense, thanks for the suggestions, I'll try to do these steps!
> >
> > Thanks,
> > Stefano
>
>
> An alternative is look at sources of overhead in vsock and get rid of
> them, or rewrite it from scratch focusing on performance.
I started looking at virtio-vsock and vhost-vsock trying to do very
simple changes [1] to increase the performance. I should send a v4 of that
series as a very short term, then I'd like to have a deeper look to understand
if it is better to try to optimize or rewrite it from scratch.
Thanks,
Stefano
[1] https://patchwork.kernel.org/cover/10970145/
^ permalink raw reply
* Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed
From: Jiong Wang @ 2019-07-16 8:50 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Jiong Wang, Alexei Starovoitov, Daniel Borkmann, Edward Cree,
Naveen N. Rao, Andrii Nakryiko, Jakub Kicinski, bpf, Networking,
oss-drivers, Yonghong Song
In-Reply-To: <CAEf4BzYDAVUgajz4=dRTu5xQDddp5pi2s=T1BdFmRLZjOwGypQ@mail.gmail.com>
Andrii Nakryiko writes:
> On Mon, Jul 15, 2019 at 2:21 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>>
>> Andrii Nakryiko writes:
>>
>> > On Thu, Jul 11, 2019 at 4:22 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>> >>
>> >>
>> >> Andrii Nakryiko writes:
>> >>
>> >> > On Thu, Jul 4, 2019 at 2:31 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>> >> >>
>> >> >> This is an RFC based on latest bpf-next about acclerating insn patching
>> >> >> speed, it is now near the shape of final PATCH set, and we could see the
>> >> >> changes migrating to list patching would brings, so send out for
>> >> >> comments. Most of the info are in cover letter. I splitted the code in a
>> >> >> way to show API migration more easily.
>> >> >
>> >> >
>> >> > Hey Jiong,
>> >> >
>> >> >
>> >> > Sorry, took me a while to get to this and learn more about instruction
>> >> > patching. Overall this looks good and I think is a good direction.
>> >> > I'll post high-level feedback here, and some more
>> >> > implementation-specific ones in corresponding patches.
>> >>
>> >> Great, thanks very much for the feedbacks. Most of your feedbacks are
>> >> hitting those pain points I exactly had ran into. For some of them, I
>> >> thought similar solutions like yours, but failed due to various
>> >> reasons. Let's go through them again, I could have missed some important
>> >> things.
>> >>
>> >> Please see my replies below.
>> >
>> > Thanks for thoughtful reply :)
>> >
>> >>
>> >> >>
>> >> >> Test Results
>> >> >> ===
>> >> >> - Full pass on test_verifier/test_prog/test_prog_32 under all three
>> >> >> modes (interpreter, JIT, JIT with blinding).
>> >> >>
>> >> >> - Benchmarking shows 10 ~ 15x faster on medium sized prog, and reduce
>> >> >> patching time from 5100s (nearly one and a half hour) to less than
>> >> >> 0.5s for 1M insn patching.
>> >> >>
>> >> >> Known Issues
>> >> >> ===
>> >> >> - The following warning is triggered when running scale test which
>> >> >> contains 1M insns and patching:
>> >> >> warning of mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330
>> >> >>
>> >> >> This is caused by existing code, it can be reproduced on bpf-next
>> >> >> master with jit blinding enabled, then run scale unit test, it will
>> >> >> shown up after half an hour. After this set, patching is very fast, so
>> >> >> it shows up quickly.
>> >> >>
>> >> >> - No line info adjustment support when doing insn delete, subprog adj
>> >> >> is with bug when doing insn delete as well. Generally, removal of insns
>> >> >> could possibly cause remove of entire line or subprog, therefore
>> >> >> entries of prog->aux->linfo or env->subprog needs to be deleted. I
>> >> >> don't have good idea and clean code for integrating this into the
>> >> >> linearization code at the moment, will do more experimenting,
>> >> >> appreciate ideas and suggestions on this.
>> >> >
>> >> > Is there any specific problem to detect which line info to delete? Or
>> >> > what am I missing besides careful implementation?
>> >>
>> >> Mostly line info and subprog info are range info which covers a range of
>> >> insns. Deleting insns could causing you adjusting the range or removing one
>> >> range entirely. subprog info could be fully recalcuated during
>> >> linearization while line info I need some careful implementation and I
>> >> failed to have clean code for this during linearization also as said no
>> >> unit tests to help me understand whether the code is correct or not.
>> >>
>> >
>> > Ok, that's good that it's just about clean implementation. Try to
>> > implement it as clearly as possible. Then post it here, and if it can
>> > be improved someone (me?) will try to help to clean it up further.
>> >
>> > Not a big expert on line info, so can't comment on that,
>> > unfortunately. Maybe Yonghong can chime in (cc'ed)
>> >
>> >
>> >> I will described this latter, spent too much time writing the following
>> >> reply. Might worth an separate discussion thread.
>> >>
>> >> >>
>> >> >> Insn delete doesn't happen on normal programs, for example Cilium
>> >> >> benchmarks, and happens rarely on test_progs, so the test coverage is
>> >> >> not good. That's also why this RFC have a full pass on selftest with
>> >> >> this known issue.
>> >> >
>> >> > I hope you'll add test for deletion (and w/ corresponding line info)
>> >> > in final patch set :)
>> >>
>> >> Will try. Need to spend some time on BTF format.
>> >> >
>> >> >>
>> >> >> - Could further use mem pool to accelerate the speed, changes are trivial
>> >> >> on top of this RFC, and could be 2x extra faster. Not included in this
>> >> >> RFC as reducing the algo complexity from quadratic to linear of insn
>> >> >> number is the first step.
>> >> >
>> >> > Honestly, I think that would add more complexity than necessary, and I
>> >> > think we can further speed up performance without that, see below.
>> >> >
>> >> >>
>> >> >> Background
>> >> >> ===
>> >> >> This RFC aims to accelerate BPF insn patching speed, patching means expand
>> >> >> one bpf insn at any offset inside bpf prog into a set of new insns, or
>> >> >> remove insns.
>> >> >>
>> >> >> At the moment, insn patching is quadratic of insn number, this is due to
>> >> >> branch targets of jump insns needs to be adjusted, and the algo used is:
>> >> >>
>> >> >> for insn inside prog
>> >> >> patch insn + regeneate bpf prog
>> >> >> for insn inside new prog
>> >> >> adjust jump target
>> >> >>
>> >> >> This is causing significant time spending when a bpf prog requires large
>> >> >> amount of patching on different insns. Benchmarking shows it could take
>> >> >> more than half minutes to finish patching when patching number is more
>> >> >> than 50K, and the time spent could be more than one hour when patching
>> >> >> number is around 1M.
>> >> >>
>> >> >> 15000 : 3s
>> >> >> 45000 : 29s
>> >> >> 95000 : 125s
>> >> >> 195000 : 712s
>> >> >> 1000000 : 5100s
>> >> >>
>> >> >> This RFC introduces new patching infrastructure. Before doing insn
>> >> >> patching, insns in bpf prog are turned into a singly linked list, insert
>> >> >> new insns just insert new list node, delete insns just set delete flag.
>> >> >> And finally, the list is linearized back into array, and branch target
>> >> >> adjustment is done for all jump insns during linearization. This algo
>> >> >> brings the time complexity from quadratic to linear of insn number.
>> >> >>
>> >> >> Benchmarking shows the new patching infrastructure could be 10 ~ 15x faster
>> >> >> on medium sized prog, and for a 1M patching it reduce the time from 5100s
>> >> >> to less than 0.5s.
>> >> >>
>> >> >> Patching API
>> >> >> ===
>> >> >> Insn patching could happen on two layers inside BPF. One is "core layer"
>> >> >> where only BPF insns are patched. The other is "verification layer" where
>> >> >> insns have corresponding aux info as well high level subprog info, so
>> >> >> insn patching means aux info needs to be patched as well, and subprog info
>> >> >> needs to be adjusted. BPF prog also has debug info associated, so line info
>> >> >> should always be updated after insn patching.
>> >> >>
>> >> >> So, list creation, destroy, insert, delete is the same for both layer,
>> >> >> but lineration is different. "verification layer" patching require extra
>> >> >> work. Therefore the patch APIs are:
>> >> >>
>> >> >> list creation: bpf_create_list_insn
>> >> >> list patch: bpf_patch_list_insn
>> >> >> list pre-patch: bpf_prepatch_list_insn
>> >> >
>> >> > I think pre-patch name is very confusing, until I read full
>> >> > description I couldn't understand what it's supposed to be used for.
>> >> > Speaking of bpf_patch_list_insn, patch is also generic enough to leave
>> >> > me wondering whether instruction buffer is inserted after instruction,
>> >> > or instruction is replaced with a bunch of instructions.
>> >> >
>> >> > So how about two more specific names:
>> >> > bpf_patch_list_insn -> bpf_list_insn_replace (meaning replace given
>> >> > instruction with a list of patch instructions)
>> >> > bpf_prepatch_list_insn -> bpf_list_insn_prepend (well, I think this
>> >> > one is pretty clear).
>> >>
>> >> My sense on English word is not great, will switch to above which indeed
>> >> reads more clear.
>> >>
>> >> >> list lineration (core layer): prog = bpf_linearize_list_insn(prog, list)
>> >> >> list lineration (veri layer): env = verifier_linearize_list_insn(env, list)
>> >> >
>> >> > These two functions are both quite involved, as well as share a lot of
>> >> > common code. I'd rather have one linearize instruction, that takes env
>> >> > as an optional parameter. If env is specified (which is the case for
>> >> > all cases except for constant blinding pass), then adjust aux_data and
>> >> > subprogs along the way.
>> >>
>> >> Two version of lineration and how to unify them was a painpoint to me. I
>> >> thought to factor out some of the common code out, but it actually doesn't
>> >> count much, the final size counting + insnsi resize parts are the same,
>> >> then things start to diverge since the "Copy over insn" loop.
>> >>
>> >> verifier layer needs to copy and initialize aux data etc. And jump
>> >> relocation is different. At core layer, the use case is JIT blinding which
>> >> could expand an jump_imm insn into a and/or/jump_reg sequence, and the
>> >
>> > Sorry, I didn't get what "could expand an jump_imm insn into a
>> > and/or/jump_reg sequence", maybe you can clarify if I'm missing
>> > something.
>> >
>> > But from your cover letter description, core layer has no jumps at
>> > all, while verifier has jumps inside patch buffer. So, if you support
>> > jumps inside of patch buffer, it will automatically work for core
>> > layer. Or what am I missing?
>>
>> I meant in core layer (JIT blinding), there is the following patching:
>>
>> input:
>> insn 0 insn 0
>> insn 1 insn 1
>> jmp_imm >> mov_imm \
>> insn 2 xor_imm insn seq expanded from jmp_imm
>> insn 3 jmp_reg /
>> insn 2
>> insn 3
>>
>>
>> jmp_imm is the insn that will be patched, and the actually transformation
>> is to expand it into mov_imm/xor_imm/jmp_reg sequence. "jmp_reg", sitting
>> at the end of the patch buffer, must jump to the same destination as the
>> original jmp_imm, so "jmp_reg" is an insn inside patch buffer but should
>> be relocated, and the jump destination is outside of patch buffer.
>
>
> Ok, great, thanks for explaining, yeah it's definitely something that
> we should be able to support. BUT. It got me thinking a bit more and I
> think I have simpler and more elegant solution now, again, supporting
> both core-layer and verifier-layer operations.
>
> struct bpf_patchable_insn {
> struct bpf_patchable_insn *next;
> struct bpf_insn insn;
> int orig_idx; /* original non-patched index */
> int new_idx; /* new index, will be filled only during linearization */
> };
>
> struct bpf_patcher {
> /* dummy head node of a chain of patchable instructions */
> struct bpf_patchable_insn insn_head;
> /* dynamic array of size(original instruction count)
> * this is a map from original instruction index to a first
> * patchable instruction that replaced that instruction (or
> * just original instruction as bpf_patchable_insn).
> */
> int *orig_idx_to_patchable_insn;
> int cnt;
> };
>
> Few points, but it should be pretty clear just from comments and definitions:
> 1. When you created bpf_patcher, you create patchabe_insn list, fill
> orig_idx_to_patchable_insn map to store proper pointers. This array is
> NEVER changed after that.
> 2. When replacing instruction, you re-use struct bpf_patchable_insn
> for first patched instruction, then append after that (not prepend to
> next instruction to not disrupt orig_idx -> patchable_insn mapping).
> 3. During linearizations, you first traverse the chain of instructions
> and trivially assing new_idxs.
> 4. No need for patchabe_insn->target anymore. All jumps use relative
> instruction offsets, right?
Yes, all jumps are pc-relative.
> So when you need to determine new
> instruction index during linearization, you just do (after you
> calculated new instruction indicies):
>
> func adjust_jmp(struct bpf_patcher* patcher, struct bpf_patchable_insn *insn) {
> int old_jmp_idx = insn->orig_idx + jmp_offset_of(insn->insn);
> int new_jmp_idx = patcher->orig_idx_to_patchable_insn[old_jmp_idx]->new_idx;
> adjust_jmp_offset(insn->insn, new_jmp_idx) - insn->orig_idx;
> }
Hmm, this algo is kinds of the same this RFC, just we have organized "new_index"
as "idx_map". And in this RFC, only new_idx of one original insn matters,
no space is allocated for patched insns. (As mentioned, JIT blinding
requires the last insn inside patch buffer relocated to original jump
offset, so there was a little special handling in the relocation loop in
core layer linearization code)
> The idea is that we want to support quick look-up by original
> instruction index. That's what orig_idx_to_patchable_insn provides. On
> the other hand, no existing instruction is ever referencing newly
> patched instruction by its new offset, so with careful implementation,
> you can transparently support all the cases, regardless if it's in
> core layer or verifier layer (so, e.g., verifier layer patched
> instructions now will be able to jump out of patched buffer, if
> necessary, neat, right?).
>
> It is cleaner than everything we've discussed so far. Unless I missed
> something critical (it's all quite convoluted, so I might have
> forgotten some parts already). Let me know what you think.
Let me digest a little bit and do some coding, then I will come back. Some
issues can only shown up during in-depth coding. I kind of feel handling
aux reference in verifier layer is the part that will still introduce some
un-clean code.
<snip>
>> If there is no dead insn elimination opt, then we could just adjust
>> offsets. When there is insn deleting, I feel the logic becomes more
>> complex. One subprog could be completely deleted or partially deleted, so
>> I feel just recalculate the whole subprog info as a side-product is
>> much simpler.
>
> What's the situation where entirety of subprog can be deleted?
Suppose you have conditional jmp_imm, true path calls one subprog, false
path calls the other. If insn walker later found it is also true, then the
subprog at false path won't be marked as "seen", so it is entirely deleted.
I actually thought it is in theory one subprog could be deleted entirely,
so if we support insn deletion inside verifier, then range info like
line_info/subprog_info needs to consider one range is deleted.
Thanks.
Regards,
Jiong
^ permalink raw reply
* Re: memory leak in new_inode_pseudo (2)
From: syzbot @ 2019-07-16 8:28 UTC (permalink / raw)
To: davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000111cbe058dc7754d@google.com>
syzbot has found a reproducer for the following crash on:
HEAD commit: be8454af Merge tag 'drm-next-2019-07-16' of git://anongit...
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=13d5f750600000
kernel config: https://syzkaller.appspot.com/x/.config?x=d23a1a7bf85c5250
dashboard link: https://syzkaller.appspot.com/bug?extid=e682cca30bc101a4d9d9
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=155c5800600000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1738f800600000
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+e682cca30bc101a4d9d9@syzkaller.appspotmail.com
executing program
executing program
executing program
executing program
BUG: memory leak
unreferenced object 0xffff888128ea0980 (size 768):
comm "syz-executor303", pid 7044, jiffies 4294943526 (age 13.490s)
hex dump (first 32 bytes):
01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<000000005ba542b8>] kmemleak_alloc_recursive
include/linux/kmemleak.h:43 [inline]
[<000000005ba542b8>] slab_post_alloc_hook mm/slab.h:522 [inline]
[<000000005ba542b8>] slab_alloc mm/slab.c:3319 [inline]
[<000000005ba542b8>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
[<000000006532a1e9>] sock_alloc_inode+0x1c/0xa0 net/socket.c:238
[<0000000014ddc967>] alloc_inode+0x2c/0xe0 fs/inode.c:227
[<0000000056541455>] new_inode_pseudo+0x18/0x70 fs/inode.c:916
[<000000003b5b5444>] sock_alloc+0x1c/0x90 net/socket.c:554
[<00000000e623b353>] __sock_create+0x8f/0x250 net/socket.c:1378
[<000000000e094708>] sock_create_kern+0x3b/0x50 net/socket.c:1483
[<000000009fe4f64f>] smc_create+0xae/0x160 net/smc/af_smc.c:1975
[<0000000056be84a7>] __sock_create+0x164/0x250 net/socket.c:1414
[<000000005915e5fe>] sock_create net/socket.c:1465 [inline]
[<000000005915e5fe>] __sys_socket+0x69/0x110 net/socket.c:1507
[<00000000afa837b2>] __do_sys_socket net/socket.c:1516 [inline]
[<00000000afa837b2>] __se_sys_socket net/socket.c:1514 [inline]
[<00000000afa837b2>] __x64_sys_socket+0x1e/0x30 net/socket.c:1514
[<00000000d0addad1>] do_syscall_64+0x76/0x1a0
arch/x86/entry/common.c:296
[<000000004e8e7c22>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
BUG: memory leak
unreferenced object 0xffff88811faeeab8 (size 56):
comm "syz-executor303", pid 7044, jiffies 4294943526 (age 13.490s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 0a ea 28 81 88 ff ff d0 ea ae 1f 81 88 ff ff ...(............
backtrace:
[<000000005ba542b8>] kmemleak_alloc_recursive
include/linux/kmemleak.h:43 [inline]
[<000000005ba542b8>] slab_post_alloc_hook mm/slab.h:522 [inline]
[<000000005ba542b8>] slab_alloc mm/slab.c:3319 [inline]
[<000000005ba542b8>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
[<000000008ca63096>] kmem_cache_zalloc include/linux/slab.h:738 [inline]
[<000000008ca63096>] lsm_inode_alloc security/security.c:522 [inline]
[<000000008ca63096>] security_inode_alloc+0x33/0xb0
security/security.c:875
[<00000000b335d930>] inode_init_always+0x108/0x200 fs/inode.c:169
[<0000000015dcffb3>] alloc_inode+0x49/0xe0 fs/inode.c:234
[<0000000056541455>] new_inode_pseudo+0x18/0x70 fs/inode.c:916
[<000000003b5b5444>] sock_alloc+0x1c/0x90 net/socket.c:554
[<00000000e623b353>] __sock_create+0x8f/0x250 net/socket.c:1378
[<000000000e094708>] sock_create_kern+0x3b/0x50 net/socket.c:1483
[<000000009fe4f64f>] smc_create+0xae/0x160 net/smc/af_smc.c:1975
[<0000000056be84a7>] __sock_create+0x164/0x250 net/socket.c:1414
[<000000005915e5fe>] sock_create net/socket.c:1465 [inline]
[<000000005915e5fe>] __sys_socket+0x69/0x110 net/socket.c:1507
[<00000000afa837b2>] __do_sys_socket net/socket.c:1516 [inline]
[<00000000afa837b2>] __se_sys_socket net/socket.c:1514 [inline]
[<00000000afa837b2>] __x64_sys_socket+0x1e/0x30 net/socket.c:1514
[<00000000d0addad1>] do_syscall_64+0x76/0x1a0
arch/x86/entry/common.c:296
[<000000004e8e7c22>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
^ permalink raw reply
* Re: [PATCH iproute2-rc 8/8] rdma: Document counter statistic
From: Gal Pressman @ 2019-07-16 8:19 UTC (permalink / raw)
To: Leon Romanovsky, Stephen Hemminger
Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
RDMA mailing list
In-Reply-To: <20190710072455.9125-9-leon@kernel.org>
On 10/07/2019 10:24, Leon Romanovsky wrote:
> +.SH "EXAMPLES"
> +.PP
> +rdma statistic show
> +.RS 4
> +Shows the state of the default counter of all RDMA devices on the system.
> +.RE
> +.PP
> +rdma statistic show link mlx5_2/1
> +.RS 4
> +Shows the state of the default counter of specified RDMA port
> +.RE
> +.PP
> +rdma statistic qp show
> +.RS 4
> +Shows the state of all qp counters of all RDMA devices on the system.
> +.RE
> +.PP
> +rdma statistic qp show link mlx5_2/1
> +.RS 4
> +Shows the state of all qp counters of specified RDMA port.
> +.RE
> +.PP
> +rdma statistic qp show link mlx5_2 pid 30489
> +.RS 4
> +Shows the state of all qp counters of specified RDMA port and belonging to pid 30489
> +.RE
> +.PP
> +rdma statistic qp mode
> +.RS 4
> +List current counter mode on all deivces
"deivces" -> "devices".
^ permalink raw reply
* [PATCH net] be2net: Signal that the device cannot transmit during reconfiguration
From: Benjamin Poirier @ 2019-07-16 8:16 UTC (permalink / raw)
To: David Miller
Cc: Sathya Perla, Ajit Khaparde, Sriharsha Basavapatna, Somnath Kotur,
Firo Yang, Saeed Mahameed, netdev
While changing the number of interrupt channels, be2net stops adapter
operation (including netif_tx_disable()) but it doesn't signal that it
cannot transmit. This may lead dev_watchdog() to falsely trigger during
that time.
Add the missing call to netif_carrier_off(), following the pattern used in
many other drivers. netif_carrier_on() is already taken care of in
be_open().
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
---
drivers/net/ethernet/emulex/benet/be_main.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 82015c8a5ed7..b7a246b33599 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4697,8 +4697,12 @@ int be_update_queues(struct be_adapter *adapter)
struct net_device *netdev = adapter->netdev;
int status;
- if (netif_running(netdev))
+ if (netif_running(netdev)) {
+ /* device cannot transmit now, avoid dev_watchdog timeouts */
+ netif_carrier_off(netdev);
+
be_close(netdev);
+ }
be_cancel_worker(adapter);
--
2.22.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox