* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
From: Magnus Karlsson @ 2018-04-24 8:08 UTC (permalink / raw)
To: Willem de Bruijn
Cc: Michael S. Tsirkin, Björn Töpel, Karlsson, Magnus,
Alexander Duyck, Alexander Duyck, John Fastabend,
Alexei Starovoitov, Jesper Dangaard Brouer, Daniel Borkmann,
Network Development, michael.lundkvist, Brandeburg, Jesse,
Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <CAF=yD-+m5+5sKvo2Z1YOOX+zFKNYLVFqjq6+b4wpP6dTX=cyEA@mail.gmail.com>
On Tue, Apr 24, 2018 at 1:59 AM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 7:21 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
>>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>>
>>> Here, we add another setsockopt for registered user memory (umem)
>>> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
>>> ask the kernel to allocate a queue (ring buffer) and also mmap it
>>> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
>>>
>>> The queue is used to explicitly pass ownership of umem frames from the
>>> user process to the kernel. These frames will in a later patch be
>>> filled in with Rx packet data by the kernel.
>>>
>>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>>> ---
>>> include/uapi/linux/if_xdp.h | 15 +++++++++++
>>> net/xdp/Makefile | 2 +-
>>> net/xdp/xdp_umem.c | 5 ++++
>>> net/xdp/xdp_umem.h | 2 ++
>>> net/xdp/xsk.c | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>>> net/xdp/xsk_queue.c | 58 ++++++++++++++++++++++++++++++++++++++++++
>>> net/xdp/xsk_queue.h | 38 +++++++++++++++++++++++++++
>>> 7 files changed, 180 insertions(+), 2 deletions(-)
>>> create mode 100644 net/xdp/xsk_queue.c
>>> create mode 100644 net/xdp/xsk_queue.h
>>>
>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>>> index 41252135a0fe..975661e1baca 100644
>>> --- a/include/uapi/linux/if_xdp.h
>>> +++ b/include/uapi/linux/if_xdp.h
>>> @@ -23,6 +23,7 @@
>>>
>>> /* XDP socket options */
>>> #define XDP_UMEM_REG 3
>>> +#define XDP_UMEM_FILL_RING 4
>>>
>>> struct xdp_umem_reg {
>>> __u64 addr; /* Start of packet data area */
>>> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>>> __u32 frame_headroom; /* Frame head room */
>>> };
>>>
>>> +/* Pgoff for mmaping the rings */
>>> +#define XDP_UMEM_PGOFF_FILL_RING 0x100000000
>>> +
>>> +struct xdp_ring {
>>> + __u32 producer __attribute__((aligned(64)));
>>> + __u32 consumer __attribute__((aligned(64)));
>>> +};
>>
>> Why 64? And do you still need these guys in uapi?
>
> I was just about to ask the same. You mean cacheline_aligned?
Yes, I would like to have these cache aligned. How can I accomplish
this in a uapi?
I put a note around this in the cover letter:
* How to deal with cache alignment for uapi when different
architectures can have different cache line sizes? We have just
aligned it to 64 bytes for now, which works for many popular
architectures, but not all. Please advise.
>
>>> +static int xsk_mmap(struct file *file, struct socket *sock,
>>> + struct vm_area_struct *vma)
>>> +{
>>> + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>>> + unsigned long size = vma->vm_end - vma->vm_start;
>>> + struct xdp_sock *xs = xdp_sk(sock->sk);
>>> + struct xsk_queue *q;
>>> + unsigned long pfn;
>>> + struct page *qpg;
>>> +
>>> + if (!xs->umem)
>>> + return -EINVAL;
>>> +
>>> + if (offset == XDP_UMEM_PGOFF_FILL_RING)
>>> + q = xs->umem->fq;
>>> + else
>>> + return -EINVAL;
>>> +
>>> + qpg = virt_to_head_page(q->ring);
>
> Is it assured that q is initialized with a call to setsockopt
> XDP_UMEM_FILL_RING before the call the mmap?
Unfortunately not, so this is a bug. Case in point for running
syzkaller below, definitely.
> In general, with such an extensive new API, it might be worthwhile to
> run syzkaller locally on a kernel with these patches. It is pretty
> easy to set up (https://github.com/google/syzkaller/blob/master/docs/linux/setup.md),
> though it also needs to be taught about any new APIs.
Good idea. Will set this up and have it torture the API.
Thanks: Magnus
^ permalink raw reply
* Re: [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets
From: Magnus Karlsson @ 2018-04-24 8:22 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
Alexander Duyck, John Fastabend, Alexei Starovoitov,
Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
Network Development, michael.lundkvist, Brandeburg, Jesse,
Singhai, Anjali, Zhang, Qi Z, Björn Töpel
In-Reply-To: <20180424022858-mutt-send-email-mst@kernel.org>
On Tue, Apr 24, 2018 at 1:31 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 03:56:19PM +0200, Björn Töpel wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> This is a sample application for AF_XDP sockets. The application
>> supports three different modes of operation: rxdrop, txonly and l2fwd.
>>
>> To show-case a simple round-robin load-balancing between a set of
>> sockets in an xskmap, set the RR_LB compile time define option to 1 in
>> "xdpsock.h".
>>
>> Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> ---
>> samples/bpf/Makefile | 4 +
>> samples/bpf/xdpsock.h | 11 +
>> samples/bpf/xdpsock_kern.c | 56 +++
>> samples/bpf/xdpsock_user.c | 947 +++++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 1018 insertions(+)
>> create mode 100644 samples/bpf/xdpsock.h
>> create mode 100644 samples/bpf/xdpsock_kern.c
>> create mode 100644 samples/bpf/xdpsock_user.c
>>
>> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
>> index aa8c392e2e52..d0ddc1abf20d 100644
>> --- a/samples/bpf/Makefile
>> +++ b/samples/bpf/Makefile
>> @@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
>> hostprogs-y += syscall_tp
>> hostprogs-y += cpustat
>> hostprogs-y += xdp_adjust_tail
>> +hostprogs-y += xdpsock
>>
>> # Libbpf dependencies
>> LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
>> @@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
>> syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
>> cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
>> xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
>> +xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
>>
>> # Tell kbuild to always build the programs
>> always := $(hostprogs-y)
>> @@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
>> always += syscall_tp_kern.o
>> always += cpustat_kern.o
>> always += xdp_adjust_tail_kern.o
>> +always += xdpsock_kern.o
>>
>> HOSTCFLAGS += -I$(objtree)/usr/include
>> HOSTCFLAGS += -I$(srctree)/tools/lib/
>> @@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
>> HOSTLOADLIBES_syscall_tp += -lelf
>> HOSTLOADLIBES_cpustat += -lelf
>> HOSTLOADLIBES_xdp_adjust_tail += -lelf
>> +HOSTLOADLIBES_xdpsock += -lelf -pthread
>>
>> # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>> # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
>> diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
>> new file mode 100644
>> index 000000000000..533ab81adfa1
>> --- /dev/null
>> +++ b/samples/bpf/xdpsock.h
>> @@ -0,0 +1,11 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef XDPSOCK_H_
>> +#define XDPSOCK_H_
>> +
>> +/* Power-of-2 number of sockets */
>> +#define MAX_SOCKS 4
>> +
>> +/* Round-robin receive */
>> +#define RR_LB 0
>> +
>> +#endif /* XDPSOCK_H_ */
>> diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
>> new file mode 100644
>> index 000000000000..d8806c41362e
>> --- /dev/null
>> +++ b/samples/bpf/xdpsock_kern.c
>> @@ -0,0 +1,56 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#define KBUILD_MODNAME "foo"
>> +#include <uapi/linux/bpf.h>
>> +#include "bpf_helpers.h"
>> +
>> +#include "xdpsock.h"
>> +
>> +struct bpf_map_def SEC("maps") qidconf_map = {
>> + .type = BPF_MAP_TYPE_ARRAY,
>> + .key_size = sizeof(int),
>> + .value_size = sizeof(int),
>> + .max_entries = 1,
>> +};
>> +
>> +struct bpf_map_def SEC("maps") xsks_map = {
>> + .type = BPF_MAP_TYPE_XSKMAP,
>> + .key_size = sizeof(int),
>> + .value_size = sizeof(int),
>> + .max_entries = 4,
>> +};
>> +
>> +struct bpf_map_def SEC("maps") rr_map = {
>> + .type = BPF_MAP_TYPE_PERCPU_ARRAY,
>> + .key_size = sizeof(int),
>> + .value_size = sizeof(unsigned int),
>> + .max_entries = 1,
>> +};
>> +
>> +SEC("xdp_sock")
>> +int xdp_sock_prog(struct xdp_md *ctx)
>> +{
>> + int *qidconf, key = 0, idx;
>> + unsigned int *rr;
>> +
>> + qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
>> + if (!qidconf)
>> + return XDP_ABORTED;
>> +
>> + if (*qidconf != ctx->rx_queue_index)
>> + return XDP_PASS;
>> +
>> +#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
>> + rr = bpf_map_lookup_elem(&rr_map, &key);
>> + if (!rr)
>> + return XDP_ABORTED;
>> +
>> + *rr = (*rr + 1) & (MAX_SOCKS - 1);
>> + idx = *rr;
>> +#else
>> + idx = 0;
>> +#endif
>> +
>> + return bpf_redirect_map(&xsks_map, idx, 0);
>> +}
>> +
>> +char _license[] SEC("license") = "GPL";
>> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
>> new file mode 100644
>> index 000000000000..690bac1a0ab7
>> --- /dev/null
>> +++ b/samples/bpf/xdpsock_user.c
>> @@ -0,0 +1,947 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright(c) 2017 - 2018 Intel Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#include <assert.h>
>> +#include <errno.h>
>> +#include <getopt.h>
>> +#include <libgen.h>
>> +#include <linux/bpf.h>
>> +#include <linux/if_link.h>
>> +#include <linux/if_xdp.h>
>> +#include <linux/if_ether.h>
>> +#include <net/if.h>
>> +#include <signal.h>
>> +#include <stdbool.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <net/ethernet.h>
>> +#include <sys/resource.h>
>> +#include <sys/socket.h>
>> +#include <sys/mman.h>
>> +#include <time.h>
>> +#include <unistd.h>
>> +#include <pthread.h>
>> +#include <locale.h>
>> +#include <sys/types.h>
>> +#include <poll.h>
>> +
>> +#include "bpf_load.h"
>> +#include "bpf_util.h"
>> +#include "libbpf.h"
>> +
>> +#include "xdpsock.h"
>> +
>> +#ifndef SOL_XDP
>> +#define SOL_XDP 283
>> +#endif
>> +
>> +#ifndef AF_XDP
>> +#define AF_XDP 44
>> +#endif
>> +
>> +#ifndef PF_XDP
>> +#define PF_XDP AF_XDP
>> +#endif
>> +
>> +#define NUM_FRAMES 131072
>> +#define FRAME_HEADROOM 0
>> +#define FRAME_SIZE 2048
>> +#define NUM_DESCS 1024
>> +#define BATCH_SIZE 16
>> +
>> +#define FQ_NUM_DESCS 1024
>> +#define CQ_NUM_DESCS 1024
>> +
>> +#define DEBUG_HEXDUMP 0
>> +
>> +typedef __u32 u32;
>> +
>> +static unsigned long prev_time;
>> +
>> +enum benchmark_type {
>> + BENCH_RXDROP = 0,
>> + BENCH_TXONLY = 1,
>> + BENCH_L2FWD = 2,
>> +};
>> +
>> +static enum benchmark_type opt_bench = BENCH_RXDROP;
>> +static u32 opt_xdp_flags;
>> +static const char *opt_if = "";
>> +static int opt_ifindex;
>> +static int opt_queue;
>> +static int opt_poll;
>> +static int opt_shared_packet_buffer;
>> +static int opt_interval = 1;
>> +
>> +struct xdp_umem_uqueue {
>> + u32 cached_prod;
>> + u32 cached_cons;
>> + u32 mask;
>> + u32 size;
>> + struct xdp_umem_ring *ring;
>> +};
>> +
>> +struct xdp_umem {
>> + char (*frames)[FRAME_SIZE];
>> + struct xdp_umem_uqueue fq;
>> + struct xdp_umem_uqueue cq;
>> + int fd;
>> +};
>> +
>> +struct xdp_uqueue {
>> + u32 cached_prod;
>> + u32 cached_cons;
>> + u32 mask;
>> + u32 size;
>> + struct xdp_rxtx_ring *ring;
>> +};
>> +
>> +struct xdpsock {
>> + struct xdp_uqueue rx;
>> + struct xdp_uqueue tx;
>> + int sfd;
>> + struct xdp_umem *umem;
>> + u32 outstanding_tx;
>> + unsigned long rx_npkts;
>> + unsigned long tx_npkts;
>> + unsigned long prev_rx_npkts;
>> + unsigned long prev_tx_npkts;
>> +};
>> +
>> +#define MAX_SOCKS 4
>> +static int num_socks;
>> +struct xdpsock *xsks[MAX_SOCKS];
>> +
>> +static unsigned long get_nsecs(void)
>> +{
>> + struct timespec ts;
>> +
>> + clock_gettime(CLOCK_MONOTONIC, &ts);
>> + return ts.tv_sec * 1000000000UL + ts.tv_nsec;
>> +}
>> +
>> +static void dump_stats(void);
>> +
>> +#define lassert(expr) \
>> + do { \
>> + if (!(expr)) { \
>> + fprintf(stderr, "%s:%s:%i: Assertion failed: " \
>> + #expr ": errno: %d/\"%s\"\n", \
>> + __FILE__, __func__, __LINE__, \
>> + errno, strerror(errno)); \
>> + dump_stats(); \
>> + exit(EXIT_FAILURE); \
>> + } \
>> + } while (0)
>> +
>> +#define barrier() __asm__ __volatile__("": : :"memory")
>> +#define u_smp_rmb() barrier()
>> +#define u_smp_wmb() barrier()
>> +#define likely(x) __builtin_expect(!!(x), 1)
>> +#define unlikely(x) __builtin_expect(!!(x), 0)
>> +
>> +static const char pkt_data[] =
>> + "\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
>> + "\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
>> + "\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
>> + "\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
>> +
>> +static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
>> +{
>> + u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
>> +
>> + if (free_entries >= nb)
>> + return free_entries;
>> +
>> + /* Refresh the local tail pointer */
>> + q->cached_cons = q->ring->ptrs.consumer;
>> +
>> + return q->size - (q->cached_prod - q->cached_cons);
>> +}
>> +
>> +static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
>> +{
>> + u32 free_entries = q->cached_cons - q->cached_prod;
>> +
>> + if (free_entries >= ndescs)
>> + return free_entries;
>> +
>> + /* Refresh the local tail pointer */
>> + q->cached_cons = q->ring->ptrs.consumer + q->size;
>> + return q->cached_cons - q->cached_prod;
>> +}
>> +
>> +static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
>> +{
>> + u32 entries = q->cached_prod - q->cached_cons;
>> +
>> + if (entries == 0)
>> + q->cached_prod = q->ring->ptrs.producer;
>> +
>> + entries = q->cached_prod - q->cached_cons;
>> +
>> + return (entries > nb) ? nb : entries;
>> +}
>> +
>> +static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
>> +{
>> + u32 entries = q->cached_prod - q->cached_cons;
>> +
>> + if (entries == 0)
>> + q->cached_prod = q->ring->ptrs.producer;
>> +
>> + entries = q->cached_prod - q->cached_cons;
>> + return (entries > ndescs) ? ndescs : entries;
>> +}
>> +
>> +static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
>> + struct xdp_desc *d,
>> + size_t nb)
>> +{
>> + u32 i;
>> +
>> + if (umem_nb_free(fq, nb) < nb)
>> + return -ENOSPC;
>> +
>> + for (i = 0; i < nb; i++) {
>> + u32 idx = fq->cached_prod++ & fq->mask;
>> +
>> + fq->ring->desc[idx] = d[i].idx;
>> + }
>> +
>> + u_smp_wmb();
>> +
>> + fq->ring->ptrs.producer = fq->cached_prod;
>> +
>> + return 0;
>> +}
>> +
>> +static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
>> + size_t nb)
>> +{
>> + u32 i;
>> +
>> + if (umem_nb_free(fq, nb) < nb)
>> + return -ENOSPC;
>> +
>> + for (i = 0; i < nb; i++) {
>> + u32 idx = fq->cached_prod++ & fq->mask;
>> +
>> + fq->ring->desc[idx] = d[i];
>> + }
>> +
>> + u_smp_wmb();
>> +
>> + fq->ring->ptrs.producer = fq->cached_prod;
>> +
>> + return 0;
>> +}
>> +
>> +static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
>> + u32 *d, size_t nb)
>> +{
>> + u32 idx, i, entries = umem_nb_avail(cq, nb);
>> +
>> + u_smp_rmb();
>> +
>> + for (i = 0; i < entries; i++) {
>> + idx = cq->cached_cons++ & cq->mask;
>> + d[i] = cq->ring->desc[idx];
>> + }
>> +
>> + if (entries > 0) {
>> + u_smp_wmb();
>> +
>> + cq->ring->ptrs.consumer = cq->cached_cons;
>> + }
>> +
>> + return entries;
>> +}
>> +
>> +static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
>> +{
>> + lassert(idx < NUM_FRAMES);
>> + return &xsk->umem->frames[idx][off];
>> +}
>> +
>> +static inline int xq_enq(struct xdp_uqueue *uq,
>> + const struct xdp_desc *descs,
>> + unsigned int ndescs)
>> +{
>> + struct xdp_rxtx_ring *r = uq->ring;
>> + unsigned int i;
>> +
>> + if (xq_nb_free(uq, ndescs) < ndescs)
>> + return -ENOSPC;
>> +
>> + for (i = 0; i < ndescs; i++) {
>> + u32 idx = uq->cached_prod++ & uq->mask;
>> +
>> + r->desc[idx].idx = descs[i].idx;
>> + r->desc[idx].len = descs[i].len;
>> + r->desc[idx].offset = descs[i].offset;
>> + }
>> +
>> + u_smp_wmb();
>> +
>> + r->ptrs.producer = uq->cached_prod;
>> + return 0;
>> +}
>> +
>> +static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
>> + __u32 idx, unsigned int ndescs)
>> +{
>> + struct xdp_rxtx_ring *q = uq->ring;
>> + unsigned int i;
>> +
>> + if (xq_nb_free(uq, ndescs) < ndescs)
>> + return -ENOSPC;
>> +
>> + for (i = 0; i < ndescs; i++) {
>> + u32 idx = uq->cached_prod++ & uq->mask;
>> +
>> + q->desc[idx].idx = idx + i;
>> + q->desc[idx].len = sizeof(pkt_data) - 1;
>> + q->desc[idx].offset = 0;
>> + }
>> +
>> + u_smp_wmb();
>> +
>> + q->ptrs.producer = uq->cached_prod;
>> + return 0;
>> +}
>> +
>> +static inline int xq_deq(struct xdp_uqueue *uq,
>> + struct xdp_desc *descs,
>> + int ndescs)
>> +{
>> + struct xdp_rxtx_ring *r = uq->ring;
>> + unsigned int idx;
>> + int i, entries;
>> +
>> + entries = xq_nb_avail(uq, ndescs);
>> +
>> + u_smp_rmb();
>> +
>> + for (i = 0; i < entries; i++) {
>> + idx = uq->cached_cons++ & uq->mask;
>> + descs[i] = r->desc[idx];
>> + }
>> +
>> + if (entries > 0) {
>> + u_smp_wmb();
>> +
>> + r->ptrs.consumer = uq->cached_cons;
>> + }
>> +
>> + return entries;
>> +}
>
> Interesting, I was under the impression that you were
> planning to get rid of consumer/producer counters
> and validate the descriptors instead.
>
> That's the ptr_ring design.
>
> You can then drop all the code around synchronising
> counter caches, as well as smp_rmb barriers.
We evaluated the current producer/consumer ring vs a
version of the ptr_ring modified for our purposes in a previous
mail thread (https://patchwork.ozlabs.org/patch/891713/)
and came to the conclusion that adopting everything in ptr_ring
was not better. That is the reason while we have kept the prod/cons ring.
Note that we did adopt a number of things from your design, but
not the approach of validating a descriptor by checking for a zero
in a specific field. It did not provide a performance benefit for our
balanced test cases and performed worse in the contended
corner cases.
>
>> +
>> +static void swap_mac_addresses(void *data)
>> +{
>> + struct ether_header *eth = (struct ether_header *)data;
>> + struct ether_addr *src_addr = (struct ether_addr *)ð->ether_shost;
>> + struct ether_addr *dst_addr = (struct ether_addr *)ð->ether_dhost;
>> + struct ether_addr tmp;
>> +
>> + tmp = *src_addr;
>> + *src_addr = *dst_addr;
>> + *dst_addr = tmp;
>> +}
>> +
>> +#if DEBUG_HEXDUMP
>> +static void hex_dump(void *pkt, size_t length, const char *prefix)
>> +{
>> + int i = 0;
>> + const unsigned char *address = (unsigned char *)pkt;
>> + const unsigned char *line = address;
>> + size_t line_size = 32;
>> + unsigned char c;
>> +
>> + printf("length = %zu\n", length);
>> + printf("%s | ", prefix);
>> + while (length-- > 0) {
>> + printf("%02X ", *address++);
>> + if (!(++i % line_size) || (length == 0 && i % line_size)) {
>> + if (length == 0) {
>> + while (i++ % line_size)
>> + printf("__ ");
>> + }
>> + printf(" | "); /* right close */
>> + while (line < address) {
>> + c = *line++;
>> + printf("%c", (c < 33 || c == 255) ? 0x2E : c);
>> + }
>> + printf("\n");
>> + if (length > 0)
>> + printf("%s | ", prefix);
>> + }
>> + }
>> + printf("\n");
>> +}
>> +#endif
>> +
>> +static size_t gen_eth_frame(char *frame)
>> +{
>> + memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
>> + return sizeof(pkt_data) - 1;
>> +}
>> +
>> +static struct xdp_umem *xdp_umem_configure(int sfd)
>> +{
>> + int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
>> + struct xdp_umem_reg mr;
>> + struct xdp_umem *umem;
>> + void *bufs;
>> +
>> + umem = calloc(1, sizeof(*umem));
>> + lassert(umem);
>> +
>> + lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
>> + NUM_FRAMES * FRAME_SIZE) == 0);
>> +
>> + mr.addr = (__u64)bufs;
>> + mr.len = NUM_FRAMES * FRAME_SIZE;
>> + mr.frame_size = FRAME_SIZE;
>> + mr.frame_headroom = FRAME_HEADROOM;
>> +
>> + lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
>> + lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
>> + sizeof(int)) == 0);
>> + lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size,
>> + sizeof(int)) == 0);
>> +
>> + umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
>> + FQ_NUM_DESCS * sizeof(u32),
>> + PROT_READ | PROT_WRITE,
>> + MAP_SHARED | MAP_POPULATE, sfd,
>> + XDP_UMEM_PGOFF_FILL_RING);
>> + lassert(umem->fq.ring != MAP_FAILED);
>> +
>> + umem->fq.mask = FQ_NUM_DESCS - 1;
>> + umem->fq.size = FQ_NUM_DESCS;
>> +
>> + umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
>> + CQ_NUM_DESCS * sizeof(u32),
>> + PROT_READ | PROT_WRITE,
>> + MAP_SHARED | MAP_POPULATE, sfd,
>> + XDP_UMEM_PGOFF_COMPLETION_RING);
>> + lassert(umem->cq.ring != MAP_FAILED);
>> +
>> + umem->cq.mask = CQ_NUM_DESCS - 1;
>> + umem->cq.size = CQ_NUM_DESCS;
>> +
>> + umem->frames = (char (*)[FRAME_SIZE])bufs;
>> + umem->fd = sfd;
>> +
>> + if (opt_bench == BENCH_TXONLY) {
>> + int i;
>> +
>> + for (i = 0; i < NUM_FRAMES; i++)
>> + (void)gen_eth_frame(&umem->frames[i][0]);
>> + }
>> +
>> + return umem;
>> +}
>> +
>> +static struct xdpsock *xsk_configure(struct xdp_umem *umem)
>> +{
>> + struct sockaddr_xdp sxdp = {};
>> + int sfd, ndescs = NUM_DESCS;
>> + struct xdpsock *xsk;
>> + bool shared = true;
>> + u32 i;
>> +
>> + sfd = socket(PF_XDP, SOCK_RAW, 0);
>> + lassert(sfd >= 0);
>> +
>> + xsk = calloc(1, sizeof(*xsk));
>> + lassert(xsk);
>> +
>> + xsk->sfd = sfd;
>> + xsk->outstanding_tx = 0;
>> +
>> + if (!umem) {
>> + shared = false;
>> + xsk->umem = xdp_umem_configure(sfd);
>> + } else {
>> + xsk->umem = umem;
>> + }
>> +
>> + lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING,
>> + &ndescs, sizeof(int)) == 0);
>> + lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING,
>> + &ndescs, sizeof(int)) == 0);
>> +
>> + /* Rx */
>> + xsk->rx.ring = mmap(NULL,
>> + sizeof(struct xdp_ring) +
>> + NUM_DESCS * sizeof(struct xdp_desc),
>> + PROT_READ | PROT_WRITE,
>> + MAP_SHARED | MAP_POPULATE, sfd,
>> + XDP_PGOFF_RX_RING);
>> + lassert(xsk->rx.ring != MAP_FAILED);
>> +
>> + if (!shared) {
>> + for (i = 0; i < NUM_DESCS / 2; i++)
>> + lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
>> + == 0);
>> + }
>> +
>> + /* Tx */
>> + xsk->tx.ring = mmap(NULL,
>> + sizeof(struct xdp_ring) +
>> + NUM_DESCS * sizeof(struct xdp_desc),
>> + PROT_READ | PROT_WRITE,
>> + MAP_SHARED | MAP_POPULATE, sfd,
>> + XDP_PGOFF_TX_RING);
>> + lassert(xsk->tx.ring != MAP_FAILED);
>> +
>> + xsk->rx.mask = NUM_DESCS - 1;
>> + xsk->rx.size = NUM_DESCS;
>> +
>> + xsk->tx.mask = NUM_DESCS - 1;
>> + xsk->tx.size = NUM_DESCS;
>> +
>> + sxdp.sxdp_family = PF_XDP;
>> + sxdp.sxdp_ifindex = opt_ifindex;
>> + sxdp.sxdp_queue_id = opt_queue;
>> + if (shared) {
>> + sxdp.sxdp_flags = XDP_SHARED_UMEM;
>> + sxdp.sxdp_shared_umem_fd = umem->fd;
>> + }
>> +
>> + lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
>> +
>> + return xsk;
>> +}
>> +
>> +static void print_benchmark(bool running)
>> +{
>> + const char *bench_str = "INVALID";
>> +
>> + if (opt_bench == BENCH_RXDROP)
>> + bench_str = "rxdrop";
>> + else if (opt_bench == BENCH_TXONLY)
>> + bench_str = "txonly";
>> + else if (opt_bench == BENCH_L2FWD)
>> + bench_str = "l2fwd";
>> +
>> + printf("%s:%d %s ", opt_if, opt_queue, bench_str);
>> + if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
>> + printf("xdp-skb ");
>> + else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
>> + printf("xdp-drv ");
>> + else
>> + printf(" ");
>> +
>> + if (opt_poll)
>> + printf("poll() ");
>> +
>> + if (running) {
>> + printf("running...");
>> + fflush(stdout);
>> + }
>> +}
>> +
>> +static void dump_stats(void)
>> +{
>> + unsigned long now = get_nsecs();
>> + long dt = now - prev_time;
>> + int i;
>> +
>> + prev_time = now;
>> +
>> + for (i = 0; i < num_socks; i++) {
>> + char *fmt = "%-15s %'-11.0f %'-11lu\n";
>> + double rx_pps, tx_pps;
>> +
>> + rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) *
>> + 1000000000. / dt;
>> + tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) *
>> + 1000000000. / dt;
>> +
>> + printf("\n sock%d@", i);
>> + print_benchmark(false);
>> + printf("\n");
>> +
>> + printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
>> + dt / 1000000000.);
>> + printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
>> + printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
>> +
>> + xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts;
>> + xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts;
>> + }
>> +}
>> +
>> +static void *poller(void *arg)
>> +{
>> + (void)arg;
>> + for (;;) {
>> + sleep(opt_interval);
>> + dump_stats();
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +static void int_exit(int sig)
>> +{
>> + (void)sig;
>> + dump_stats();
>> + bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
>> + exit(EXIT_SUCCESS);
>> +}
>> +
>> +static struct option long_options[] = {
>> + {"rxdrop", no_argument, 0, 'r'},
>> + {"txonly", no_argument, 0, 't'},
>> + {"l2fwd", no_argument, 0, 'l'},
>> + {"interface", required_argument, 0, 'i'},
>> + {"queue", required_argument, 0, 'q'},
>> + {"poll", no_argument, 0, 'p'},
>> + {"shared-buffer", no_argument, 0, 's'},
>> + {"xdp-skb", no_argument, 0, 'S'},
>> + {"xdp-native", no_argument, 0, 'N'},
>> + {"interval", required_argument, 0, 'n'},
>> + {0, 0, 0, 0}
>> +};
>> +
>> +static void usage(const char *prog)
>> +{
>> + const char *str =
>> + " Usage: %s [OPTIONS]\n"
>> + " Options:\n"
>> + " -r, --rxdrop Discard all incoming packets (default)\n"
>> + " -t, --txonly Only send packets\n"
>> + " -l, --l2fwd MAC swap L2 forwarding\n"
>> + " -i, --interface=n Run on interface n\n"
>> + " -q, --queue=n Use queue n (default 0)\n"
>> + " -p, --poll Use poll syscall\n"
>> + " -s, --shared-buffer Use shared packet buffer\n"
>> + " -S, --xdp-skb=n Use XDP skb-mod\n"
>> + " -N, --xdp-native=n Enfore XDP native mode\n"
>> + " -n, --interval=n Specify statistics update interval (default 1 sec).\n"
>> + "\n";
>> + fprintf(stderr, str, prog);
>> + exit(EXIT_FAILURE);
>> +}
>> +
>> +static void parse_command_line(int argc, char **argv)
>> +{
>> + int option_index, c;
>> +
>> + opterr = 0;
>> +
>> + for (;;) {
>> + c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
>> + &option_index);
>> + if (c == -1)
>> + break;
>> +
>> + switch (c) {
>> + case 'r':
>> + opt_bench = BENCH_RXDROP;
>> + break;
>> + case 't':
>> + opt_bench = BENCH_TXONLY;
>> + break;
>> + case 'l':
>> + opt_bench = BENCH_L2FWD;
>> + break;
>> + case 'i':
>> + opt_if = optarg;
>> + break;
>> + case 'q':
>> + opt_queue = atoi(optarg);
>> + break;
>> + case 's':
>> + opt_shared_packet_buffer = 1;
>> + break;
>> + case 'p':
>> + opt_poll = 1;
>> + break;
>> + case 'S':
>> + opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
>> + break;
>> + case 'N':
>> + opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
>> + break;
>> + case 'n':
>> + opt_interval = atoi(optarg);
>> + break;
>> + default:
>> + usage(basename(argv[0]));
>> + }
>> + }
>> +
>> + opt_ifindex = if_nametoindex(opt_if);
>> + if (!opt_ifindex) {
>> + fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
>> + opt_if);
>> + usage(basename(argv[0]));
>> + }
>> +}
>> +
>> +static void kick_tx(int fd)
>> +{
>> + int ret;
>> +
>> + ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
>> + if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
>> + return;
>> + lassert(0);
>> +}
>> +
>> +static inline void complete_tx_l2fwd(struct xdpsock *xsk)
>> +{
>> + u32 descs[BATCH_SIZE];
>> + unsigned int rcvd;
>> + size_t ndescs;
>> +
>> + if (!xsk->outstanding_tx)
>> + return;
>> +
>> + kick_tx(xsk->sfd);
>> + ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
>> + xsk->outstanding_tx;
>> +
>> + /* re-add completed Tx buffers */
>> + rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
>> + if (rcvd > 0) {
>> + umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
>> + xsk->outstanding_tx -= rcvd;
>> + xsk->tx_npkts += rcvd;
>> + }
>> +}
>> +
>> +static inline void complete_tx_only(struct xdpsock *xsk)
>> +{
>> + u32 descs[BATCH_SIZE];
>> + unsigned int rcvd;
>> +
>> + if (!xsk->outstanding_tx)
>> + return;
>> +
>> + kick_tx(xsk->sfd);
>> +
>> + rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
>> + if (rcvd > 0) {
>> + xsk->outstanding_tx -= rcvd;
>> + xsk->tx_npkts += rcvd;
>> + }
>> +}
>> +
>> +static void rx_drop(struct xdpsock *xsk)
>> +{
>> + struct xdp_desc descs[BATCH_SIZE];
>> + unsigned int rcvd, i;
>> +
>> + rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
>> + if (!rcvd)
>> + return;
>> +
>> + for (i = 0; i < rcvd; i++) {
>> + u32 idx = descs[i].idx;
>> +
>> + lassert(idx < NUM_FRAMES);
>> +#if DEBUG_HEXDUMP
>> + char *pkt;
>> + char buf[32];
>> +
>> + pkt = xq_get_data(xsk, idx, descs[i].offset);
>> + sprintf(buf, "idx=%d", idx);
>> + hex_dump(pkt, descs[i].len, buf);
>> +#endif
>> + }
>> +
>> + xsk->rx_npkts += rcvd;
>> +
>> + umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
>> +}
>> +
>> +static void rx_drop_all(void)
>> +{
>> + struct pollfd fds[MAX_SOCKS + 1];
>> + int i, ret, timeout, nfds = 1;
>> +
>> + memset(fds, 0, sizeof(fds));
>> +
>> + for (i = 0; i < num_socks; i++) {
>> + fds[i].fd = xsks[i]->sfd;
>> + fds[i].events = POLLIN;
>> + timeout = 1000; /* 1sn */
>> + }
>> +
>> + for (;;) {
>> + if (opt_poll) {
>> + ret = poll(fds, nfds, timeout);
>> + if (ret <= 0)
>> + continue;
>> + }
>> +
>> + for (i = 0; i < num_socks; i++)
>> + rx_drop(xsks[i]);
>> + }
>> +}
>> +
>> +static void tx_only(struct xdpsock *xsk)
>> +{
>> + int timeout, ret, nfds = 1;
>> + struct pollfd fds[nfds + 1];
>> + unsigned int idx = 0;
>> +
>> + memset(fds, 0, sizeof(fds));
>> + fds[0].fd = xsk->sfd;
>> + fds[0].events = POLLOUT;
>> + timeout = 1000; /* 1sn */
>> +
>> + for (;;) {
>> + if (opt_poll) {
>> + ret = poll(fds, nfds, timeout);
>> + if (ret <= 0)
>> + continue;
>> +
>> + if (fds[0].fd != xsk->sfd ||
>> + !(fds[0].revents & POLLOUT))
>> + continue;
>> + }
>> +
>> + if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
>> + lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0);
>> +
>> + xsk->outstanding_tx += BATCH_SIZE;
>> + idx += BATCH_SIZE;
>> + idx %= NUM_FRAMES;
>> + }
>> +
>> + complete_tx_only(xsk);
>> + }
>> +}
>> +
>> +static void l2fwd(struct xdpsock *xsk)
>> +{
>> + for (;;) {
>> + struct xdp_desc descs[BATCH_SIZE];
>> + unsigned int rcvd, i;
>> + int ret;
>> +
>> + for (;;) {
>> + complete_tx_l2fwd(xsk);
>> +
>> + rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
>> + if (rcvd > 0)
>> + break;
>> + }
>> +
>> + for (i = 0; i < rcvd; i++) {
>> + char *pkt = xq_get_data(xsk, descs[i].idx,
>> + descs[i].offset);
>> +
>> + swap_mac_addresses(pkt);
>> +#if DEBUG_HEXDUMP
>> + char buf[32];
>> + u32 idx = descs[i].idx;
>> +
>> + sprintf(buf, "idx=%d", idx);
>> + hex_dump(pkt, descs[i].len, buf);
>> +#endif
>> + }
>> +
>> + xsk->rx_npkts += rcvd;
>> +
>> + ret = xq_enq(&xsk->tx, descs, rcvd);
>> + lassert(ret == 0);
>> + xsk->outstanding_tx += rcvd;
>> + }
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
>> + char xdp_filename[256];
>> + int i, ret, key = 0;
>> + pthread_t pt;
>> +
>> + parse_command_line(argc, argv);
>> +
>> + if (setrlimit(RLIMIT_MEMLOCK, &r)) {
>> + fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
>> + strerror(errno));
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
>> +
>> + if (load_bpf_file(xdp_filename)) {
>> + fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + if (!prog_fd[0]) {
>> + fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
>> + strerror(errno));
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
>> + fprintf(stderr, "ERROR: link set xdp fd failed\n");
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
>> + if (ret) {
>> + fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + /* Create sockets... */
>> + xsks[num_socks++] = xsk_configure(NULL);
>> +
>> +#if RR_LB
>> + for (i = 0; i < MAX_SOCKS - 1; i++)
>> + xsks[num_socks++] = xsk_configure(xsks[0]->umem);
>> +#endif
>> +
>> + /* ...and insert them into the map. */
>> + for (i = 0; i < num_socks; i++) {
>> + key = i;
>> + ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
>> + if (ret) {
>> + fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
>> + exit(EXIT_FAILURE);
>> + }
>> + }
>> +
>> + signal(SIGINT, int_exit);
>> + signal(SIGTERM, int_exit);
>> + signal(SIGABRT, int_exit);
>> +
>> + setlocale(LC_ALL, "");
>> +
>> + ret = pthread_create(&pt, NULL, poller, NULL);
>> + lassert(ret == 0);
>> +
>> + prev_time = get_nsecs();
>> +
>> + if (opt_bench == BENCH_RXDROP)
>> + rx_drop_all();
>> + else if (opt_bench == BENCH_TXONLY)
>> + tx_only(xsks[0]);
>> + else
>> + l2fwd(xsks[0]);
>> +
>> + return 0;
>> +}
>> --
>> 2.14.1
^ permalink raw reply
* Re: [PATCH net] l2tp: check sockaddr length in pppol2tp_connect()
From: Guillaume Nault @ 2018-04-24 8:23 UTC (permalink / raw)
To: David Miller; +Cc: netdev, jchapman
In-Reply-To: <20180423.211122.76873901039312656.davem@davemloft.net>
On Mon, Apr 23, 2018 at 09:11:22PM -0400, David Miller wrote:
> From: Guillaume Nault <g.nault@alphalink.fr>
> Date: Mon, 23 Apr 2018 16:15:14 +0200
>
> > Check sockaddr_len before dereferencing sp->sa_protocol, to ensure that
> > it actually points to valid data.
> >
> > Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
> > Reported-by: syzbot+a70ac890b23b1bf29f5c@syzkaller.appspotmail.com
> > Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
>
> Applied and queued up for -stable.
>
> I guess you can completely remove the "bad socket address" -EINVAL else
> clause later in the function as a cleanup in net-next.
>
Yes, will do. Thanks.
^ permalink raw reply
* [PATCH] vhost_net: use packet weight for rx handler, too
From: Paolo Abeni @ 2018-04-24 8:34 UTC (permalink / raw)
To: kvm; +Cc: haibinzhang, Michael S. Tsirkin, Jason Wang, virtualization,
netdev
Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
tx polling to 2 * vq size"), we need a packet-based limit for
handler_rx, too - elsewhere, under rx flood with small packets,
tx can be delayed for a very long time, even without busypolling.
The pkt limit applied to handle_rx must be the same applied by
handle_tx, or we will get unfair scheduling between rx and tx.
Tying such limit to the queue length makes it less effective for
large queue length values and can introduce large process
scheduler latencies, so a constant valued is used - likewise
the existing bytes limit.
The selected limit has been validated with PVP[1] performance
test with different queue sizes:
queue size 256 512 1024
baseline 366 354 362
weight 128 715 723 670
weight 256 740 745 733
weight 512 600 460 583
weight 1024 423 427 418
A packet weight of 256 gives peek performances in under all the
tested scenarios.
No measurable regression in unidirectional performance tests has
been detected.
[1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
drivers/vhost/net.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index bbf38befefb2..c4b49fca4871 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -46,8 +46,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
#define VHOST_NET_WEIGHT 0x80000
/* Max number of packets transferred before requeueing the job.
- * Using this limit prevents one virtqueue from starving rx. */
-#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
+ * Using this limit prevents one virtqueue from starving others with small
+ * pkts.
+ */
+#define VHOST_NET_PKT_WEIGHT 256
/* MAX number of TX used buffers for outstanding zerocopy */
#define VHOST_MAX_PEND 128
@@ -587,7 +589,7 @@ static void handle_tx(struct vhost_net *net)
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
- unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
+ unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
vhost_poll_queue(&vq->poll);
break;
}
@@ -769,6 +771,7 @@ static void handle_rx(struct vhost_net *net)
struct socket *sock;
struct iov_iter fixup;
__virtio16 num_buffers;
+ int recv_pkts = 0;
mutex_lock_nested(&vq->mutex, 0);
sock = vq->private_data;
@@ -872,7 +875,8 @@ static void handle_rx(struct vhost_net *net)
if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, vhost_len);
total_len += vhost_len;
- if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
+ if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
+ unlikely(++recv_pkts >= VHOST_NET_PKT_WEIGHT)) {
vhost_poll_queue(&vq->poll);
goto out;
}
--
2.14.3
^ permalink raw reply related
* Re: [PATCH 1/1] IB/rxe: avoid double kfree_skb
From: Yanjun Zhu @ 2018-04-24 8:34 UTC (permalink / raw)
To: Doug Ledford, monis, jgg, linux-rdma; +Cc: netdev
In-Reply-To: <89cbf00d-40b1-d3cc-dd1c-3c4b6fd365d8@oracle.com>
Hi, all
rxe_send
ip_local_out
__ip_local_out
nf_hook_slow
In the above call process, nf_hook_slow drops and frees skb, then -EPERM
is returned when iptables rules(iptables -I OUTPUT -p udp --dport 4791
-j DROP) is set.
If skb->users is not changed in softroce, kfree_skb should not be called
in this module.
I will make further investigations about other error handler after
ip_local_out.
If I am wrong, please correct me.
Any reply is appreciated.
Zhu Yanjun
On 2018/4/20 13:46, Yanjun Zhu wrote:
>
>
> On 2018/4/20 10:19, Doug Ledford wrote:
>> On Thu, 2018-04-19 at 10:01 -0400, Zhu Yanjun wrote:
>>> When skb is dropped by iptables rules, the skb is freed at the same
>>> time
>>> -EPERM is returned. So in softroce, it is not necessary to free skb
>>> again.
>>> Or else, crash will occur.
>>>
>>> The steps to reproduce:
>>>
>>> server client
>>> --------- ---------
>>> |1.1.1.1|<----rxe-channel--->|1.1.1.2|
>>> --------- ---------
>>>
>>> On server: rping -s -a 1.1.1.1 -v -C 10000 -S 512
>>> On client: rping -c -a 1.1.1.1 -v -C 10000 -S 512
>>>
>>> The kernel configs CONFIG_DEBUG_KMEMLEAK and
>>> CONFIG_DEBUG_OBJECTS are enabled on both server and client.
>>>
>>> When rping runs, run the following command in server:
>>>
>>> iptables -I OUTPUT -p udp --dport 4791 -j DROP
>>>
>>> Without this patch, crash will occur.
>>>
>>> CC: Srinivas Eeda <srinivas.eeda@oracle.com>
>>> CC: Junxiao Bi <junxiao.bi@oracle.com>
>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
>>> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
>> I have no reason to doubt your analysis, but if there are a bunch of
>> error paths for net_xmit and they all return with your skb still being
>> valid and holding a reference, and then one oddball that returns with
>> your skb already gone, that just sounds like a mistake waiting to happen
>> (not to mention a bajillion special cases sprinkled everywhere to deal
>> with this apparent inconsistency).
>>
>> Can we get a netdev@ confirmation on this being the right solution?
> Yes. I agree with you.
> After iptables rule "iptables -I OUTPUT -p udp --dport 4791 -j DROP",
> the skb is freed in this function
>
> /* Returns 1 if okfn() needs to be executed by the caller,
> * -EPERM for NF_DROP, 0 otherwise. Caller must hold rcu_read_lock. */
> int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
> const struct nf_hook_entries *e, unsigned int s)
> {
> unsigned int verdict;
> int ret;
>
> for (; s < e->num_hook_entries; s++) {
> verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, state);
> switch (verdict & NF_VERDICT_MASK) {
> case NF_ACCEPT:
> break;
> case NF_DROP:
> kfree_skb(skb); <----here, skb is freed
> ret = NF_DROP_GETERR(verdict);
> if (ret == 0)
> ret = -EPERM;
> return ret;
> case NF_QUEUE:
> ret = nf_queue(skb, state, e, s, verdict);
> if (ret == 1)
> continue;
> return ret;
> default:
> /* Implicit handling for NF_STOLEN, as well as
> any other
> * non conventional verdicts.
> */
> return 0;
> }
> }
>
> return 1;
> }
> EXPORT_SYMBOL(nf_hook_slow);
>
> If I am wrong, please correct me.
>
> And my test environment is still there, any solution can be verified
> in it.
>
> Zhu Yanjun
>>
>>> ---
>>> drivers/infiniband/sw/rxe/rxe_net.c | 3 +++
>>> drivers/infiniband/sw/rxe/rxe_req.c | 5 +++--
>>> drivers/infiniband/sw/rxe/rxe_resp.c | 9 ++++++---
>>> 3 files changed, 12 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c
>>> b/drivers/infiniband/sw/rxe/rxe_net.c
>>> index 9da6e37..2094434 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_net.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
>>> @@ -511,6 +511,9 @@ int rxe_send(struct rxe_pkt_info *pkt, struct
>>> sk_buff *skb)
>>> if (unlikely(net_xmit_eval(err))) {
>>> pr_debug("error sending packet: %d\n", err);
>>> + /* -EPERM means the skb is dropped and freed. */
>>> + if (err == -EPERM)
>>> + return -EPERM;
>>> return -EAGAIN;
>>> }
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_req.c
>>> b/drivers/infiniband/sw/rxe/rxe_req.c
>>> index 7bdaf71..9d2efec 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_req.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_req.c
>>> @@ -727,8 +727,9 @@ int rxe_requester(void *arg)
>>> rollback_state(wqe, qp, &rollback_wqe,
>>> rollback_psn);
>>> - if (ret == -EAGAIN) {
>>> - kfree_skb(skb);
>>> + if ((ret == -EAGAIN) || (ret == -EPERM)) {
>>> + if (ret == -EAGAIN)
>>> + kfree_skb(skb);
>>> rxe_run_task(&qp->req.task, 1);
>>> goto exit;
>>> }
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c
>>> b/drivers/infiniband/sw/rxe/rxe_resp.c
>>> index a65c996..6bdf9b2 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
>>> @@ -742,7 +742,8 @@ static enum resp_states read_reply(struct rxe_qp
>>> *qp,
>>> err = rxe_xmit_packet(rxe, qp, &ack_pkt, skb);
>>> if (err) {
>>> pr_err("Failed sending RDMA reply.\n");
>>> - kfree_skb(skb);
>>> + if (err != -EPERM)
>>> + kfree_skb(skb);
>>> return RESPST_ERR_RNR;
>>> }
>>> @@ -956,7 +957,8 @@ static int send_ack(struct rxe_qp *qp, struct
>>> rxe_pkt_info *pkt,
>>> err = rxe_xmit_packet(rxe, qp, &ack_pkt, skb);
>>> if (err) {
>>> pr_err_ratelimited("Failed sending ack\n");
>>> - kfree_skb(skb);
>>> + if (err != -EPERM)
>>> + kfree_skb(skb);
>>> }
>>> err1:
>>> @@ -1141,7 +1143,8 @@ static enum resp_states
>>> duplicate_request(struct rxe_qp *qp,
>>> if (rc) {
>>> pr_err("Failed resending result.
>>> This flow is not handled - skb ignored\n");
>>> rxe_drop_ref(qp);
>>> - kfree_skb(skb_copy);
>>> + if (rc != -EPERM)
>>> + kfree_skb(skb_copy);
>>> rc = RESPST_CLEANUP;
>>> goto out;
>>> }
>>> --
>>> 2.7.4
>>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
From: Magnus Karlsson @ 2018-04-24 8:44 UTC (permalink / raw)
To: Jason Wang
Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
Alexander Duyck, John Fastabend, Alexei Starovoitov,
Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
Michael S. Tsirkin, Network Development, Björn Töpel,
michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
Zhang, Qi Z
In-Reply-To: <3165e013-fab9-a0a2-2048-6d7aac0bd85e@redhat.com>
>> We have run some benchmarks on a dual socket system with two Broadwell
>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>> cores which gives a total of 28, but only two cores are used in these
>> experiments. One for TR/RX and one for the user space application. The
>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>> Intel I40E 40Gbit/s using the i40e driver.
>>
>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>> and 1500 byte packets, generated by commercial packet generator HW that is
>> generating packets at full 40 Gbit/s line rate.
>>
>> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
>> Benchmark XDP_SKB XDP_DRV
>> rxdrop 2.9(3.0) 9.4(9.3)
>> txpush 2.5(2.2) NA*
>> l2fwd 1.9(1.7) 2.4(2.4) (TX using XDP_SKB in both cases)
>
>
> This number looks not very exciting. I can get ~3Mpps when using testpmd in
> a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I believe
> we can even better performance without virt. It would be interesting to
> compare this performance with e.g testpmd + virito_user(vhost_kernel) + XDP.
Note that all the XDP_SKB numbers plus the TX part of XDP_DRV for l2fwd
uses SKBs and the generic XDP path in the kernel. I am not surprised those
numbers are lower than what you are seeing with XDP_DRV support.
(If that is what you are running? Unsure about your setup). The
9.4 Mpps for RX is what you get with the XDP_DRV support and copies
out to user space. Or is it this number you think is low? Zerocopy will be added
in later patch sets.
With that said, both XDP_SKB and XDP_DRV can be optimized. We
have not spent that much time on optimizations at this point.
>
>>
>> AF_XDP performance 1500 byte packets:
>> Benchmark XDP_SKB XDP_DRV
>> rxdrop 2.1(2.2) 3.3(3.1)
>> l2fwd 1.4(1.1) 1.8(1.7) (TX using XDP_SKB in both cases)
>>
>> * NA since we have no support for TX using the XDP_DRV infrastructure
>> in this RFC. This is for a future patch set since it involves
>> changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>> Dangaard Brouer.
>>
>> XDP performance on our system as a base line:
>>
>> 64 byte packets:
>> XDP stats CPU pps issue-pps
>> XDP-RX CPU 16 32,921,521 0
>>
>> 1500 byte packets:
>> XDP stats CPU pps issue-pps
>> XDP-RX CPU 16 3,289,491 0
>>
>> Changes from RFC V2:
>>
>> * Optimizations and simplifications to the ring structures inspired by
>> ptr_ring.h
>> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>> consistent with AF_PACKET
>> * Support for only having an RX queue or a TX queue defined
>> * Some bug fixes and code cleanup
>>
>> The structure of the patch set is as follows:
>>
>> Patches 1-2: Basic socket and umem plumbing
>> Patches 3-10: RX support together with the new XSKMAP
>> Patches 11-14: TX support
>> Patch 15: Sample application
>>
>> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
>> Clean up btf.h in uapi")
>>
>> Questions:
>>
>> * How to deal with cache alignment for uapi when different
>> architectures can have different cache line sizes? We have just
>> aligned it to 64 bytes for now, which works for many popular
>> architectures, but not all. Please advise.
>>
>> To do:
>>
>> * Optimize performance
>>
>> * Kernel selftest
>>
>> Post-series plan:
>>
>> * Kernel load module support of AF_XDP would be nice. Unclear how to
>> achieve this though since our XDP code depends on net/core.
>>
>> * Support for AF_XDP sockets without an XPD program loaded. In this
>> case all the traffic on a queue should go up to the user space socket.
>
>
> I think we probably need this in the case of TUN XDP for virt guest too.
Yes.
Thanks: Magnus
> Thanks
>
>
>>
>> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>> XDP_PASS" for a tcpdump-like functionality.
>>
>> * And of course getting to zero-copy support in small increments.
>>
>> Thanks: Björn and Magnus
>>
>> Björn Töpel (8):
>> net: initial AF_XDP skeleton
>> xsk: add user memory registration support sockopt
>> xsk: add Rx queue setup and mmap support
>> xdp: introduce xdp_return_buff API
>> xsk: add Rx receive functions and poll support
>> bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>> xsk: wire up XDP_DRV side of AF_XDP
>> xsk: wire up XDP_SKB side of AF_XDP
>>
>> Magnus Karlsson (7):
>> xsk: add umem fill queue support and mmap
>> xsk: add support for bind for Rx
>> xsk: add umem completion queue support and mmap
>> xsk: add Tx queue setup and mmap support
>> xsk: support for Tx
>> xsk: statistics support
>> samples/bpf: sample application for AF_XDP sockets
>>
>> MAINTAINERS | 8 +
>> include/linux/bpf.h | 26 +
>> include/linux/bpf_types.h | 3 +
>> include/linux/filter.h | 2 +-
>> include/linux/socket.h | 5 +-
>> include/net/xdp.h | 1 +
>> include/net/xdp_sock.h | 46 ++
>> include/uapi/linux/bpf.h | 1 +
>> include/uapi/linux/if_xdp.h | 87 ++++
>> kernel/bpf/Makefile | 3 +
>> kernel/bpf/verifier.c | 8 +-
>> kernel/bpf/xskmap.c | 286 +++++++++++
>> net/Kconfig | 1 +
>> net/Makefile | 1 +
>> net/core/dev.c | 34 +-
>> net/core/filter.c | 40 +-
>> net/core/sock.c | 12 +-
>> net/core/xdp.c | 15 +-
>> net/xdp/Kconfig | 7 +
>> net/xdp/Makefile | 2 +
>> net/xdp/xdp_umem.c | 256 ++++++++++
>> net/xdp/xdp_umem.h | 65 +++
>> net/xdp/xdp_umem_props.h | 23 +
>> net/xdp/xsk.c | 704 +++++++++++++++++++++++++++
>> net/xdp/xsk_queue.c | 73 +++
>> net/xdp/xsk_queue.h | 245 ++++++++++
>> samples/bpf/Makefile | 4 +
>> samples/bpf/xdpsock.h | 11 +
>> samples/bpf/xdpsock_kern.c | 56 +++
>> samples/bpf/xdpsock_user.c | 947
>> ++++++++++++++++++++++++++++++++++++
>> security/selinux/hooks.c | 4 +-
>> security/selinux/include/classmap.h | 4 +-
>> 32 files changed, 2945 insertions(+), 35 deletions(-)
>> create mode 100644 include/net/xdp_sock.h
>> create mode 100644 include/uapi/linux/if_xdp.h
>> create mode 100644 kernel/bpf/xskmap.c
>> create mode 100644 net/xdp/Kconfig
>> create mode 100644 net/xdp/Makefile
>> create mode 100644 net/xdp/xdp_umem.c
>> create mode 100644 net/xdp/xdp_umem.h
>> create mode 100644 net/xdp/xdp_umem_props.h
>> create mode 100644 net/xdp/xsk.c
>> create mode 100644 net/xdp/xsk_queue.c
>> create mode 100644 net/xdp/xsk_queue.h
>> create mode 100644 samples/bpf/xdpsock.h
>> create mode 100644 samples/bpf/xdpsock_kern.c
>> create mode 100644 samples/bpf/xdpsock_user.c
>>
>
^ permalink raw reply
* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Thomas Gleixner @ 2018-04-24 8:50 UTC (permalink / raw)
To: Jesus Sanchez-Palencia
Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
intel-wired-lan, anna-maria, henrik, john.stultz, levi.pearson,
edumazet, willemb, mlichvar
In-Reply-To: <768e8da5-502e-d36f-0f32-9324eaca4a1d@intel.com>
On Mon, 23 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> > On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> >> +struct tbs_sched_data {
> >> + bool sorting;
> >> + int clockid;
> >> + int queue;
> >> + s32 delta; /* in ns */
> >> + ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> >> + struct rb_root head;
> >
> > Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> > you could reuse the timerqueue implementation?
> >
> > That requires to add a timerqueue node to struct skbuff
> >
> > @@ -671,7 +671,8 @@ struct sk_buff {
> > unsigned long dev_scratch;
> > };
> > };
> > - struct rb_node rbnode; /* used in netem & tcp stack */
> > + struct rb_node rbnode; /* used in netem & tcp stack */
> > + struct timerqueue_node tqnode;
> > };
> > struct sock *sk;
> >
> > Then you can use timerqueue_head in your scheduler data and all the open
> > coded rbtree handling goes away.
>
>
> I just noticed that doing the above increases the size of struct sk_buff by 8
> bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
> 24bytes long.
>
> Given the feedback we got here before against touching struct sk_buff at all for
> non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
> keeping the open-coded version for now, ok?
The size of sk_buff is 216 and the size of sk_buff_fclones is 440
bytes. The sk_buff and sk_buff_fclones kmem_caches use objects sized 256
and 512 bytes because the kmem_caches are created with SLAB_HWCACHE_ALIGN.
So adding 8 bytes to spare duplicated code will not change the kmem_cache
object size and I really doubt that anyone will notice.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH 03/12] netfilter: ebtables: don't attempt to allocate 0-sized compat array
From: Sergei Shtylyov @ 2018-04-24 8:55 UTC (permalink / raw)
To: Pablo Neira Ayuso, netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180423175714.9794-4-pablo@netfilter.org>
Hello!
On 4/23/2018 8:57 PM, Pablo Neira Ayuso wrote:
> From: Florian Westphal <fw@strlen.de>
>
> Dmitry reports 32bit ebtables on 64bit kernel got broken by
> a recent change that returns -EINVAL when ruleset has no entries.
>
> ebtables however only counts user-defined chains, so for the
> initial table nentries will be 0.
>
> Don't try to allocate the compat array in this case, as no user
As if, perhaps?
> defined rules exist no rule will need 64bit translation.
>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Fixes: 7d7d7e02111e9 ("netfilter: compat: reject huge allocation requests")
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> net/bridge/netfilter/ebtables.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
> index 032e0fe45940..28a4c3490359 100644
> --- a/net/bridge/netfilter/ebtables.c
> +++ b/net/bridge/netfilter/ebtables.c
> @@ -1825,13 +1825,14 @@ static int compat_table_info(const struct ebt_table_info *info,
> {
> unsigned int size = info->entries_size;
> const void *entries = info->entries;
> - int ret;
>
> newinfo->entries_size = size;
> -
> - ret = xt_compat_init_offsets(NFPROTO_BRIDGE, info->nentries);
> - if (ret)
> - return ret;
> + if (info->nentries) {
> + int ret = xt_compat_init_offsets(NFPROTO_BRIDGE,
> + info->nentries);
Need an empty line here...
> + if (ret)
> + return ret;
> + }
>
> return EBT_ENTRY_ITERATE(entries, size, compat_calc_entry, info,
> entries, newinfo);
MBR, Sergei
^ permalink raw reply
* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
From: Jason Wang @ 2018-04-24 9:10 UTC (permalink / raw)
To: Magnus Karlsson
Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
Alexander Duyck, John Fastabend, Alexei Starovoitov,
Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
Michael S. Tsirkin, Network Development, Björn Töpel,
michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
Zhang, Qi Z
In-Reply-To: <CAJ8uoz3WEfowgwXXdG3LYbNmJ3Y1CW8nkc=7pvzLvNdfWSCAsA@mail.gmail.com>
On 2018年04月24日 16:44, Magnus Karlsson wrote:
>>> We have run some benchmarks on a dual socket system with two Broadwell
>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>>> cores which gives a total of 28, but only two cores are used in these
>>> experiments. One for TR/RX and one for the user space application. The
>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>>> Intel I40E 40Gbit/s using the i40e driver.
>>>
>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>>> and 1500 byte packets, generated by commercial packet generator HW that is
>>> generating packets at full 40 Gbit/s line rate.
>>>
>>> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
>>> Benchmark XDP_SKB XDP_DRV
>>> rxdrop 2.9(3.0) 9.4(9.3)
>>> txpush 2.5(2.2) NA*
>>> l2fwd 1.9(1.7) 2.4(2.4) (TX using XDP_SKB in both cases)
>> This number looks not very exciting. I can get ~3Mpps when using testpmd in
>> a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I believe
>> we can even better performance without virt. It would be interesting to
>> compare this performance with e.g testpmd + virito_user(vhost_kernel) + XDP.
> Note that all the XDP_SKB numbers plus the TX part of XDP_DRV for l2fwd
> uses SKBs and the generic XDP path in the kernel. I am not surprised those
> numbers are lower than what you are seeing with XDP_DRV support.
> (If that is what you are running? Unsure about your setup).
Yes, I'm using haswell E5-2630 v3 @ 2.40GHz and ixgbe.
> The
> 9.4 Mpps for RX is what you get with the XDP_DRV support and copies
> out to user space. Or is it this number you think is low?
No rxdrop looks ok. I mean for l2fwd only.
> Zerocopy will be added
> in later patch sets.
>
> With that said, both XDP_SKB and XDP_DRV can be optimized. We
> have not spent that much time on optimizations at this point.
>
Yes, and it is interesting to compare the performance numbers between
AF_XDP and TAP XDP + vhost_net since their functions are almost equivalent.
Thanks
^ permalink raw reply
* Re: [PATCH] vhost_net: use packet weight for rx handler, too
From: Jason Wang @ 2018-04-24 9:11 UTC (permalink / raw)
To: Paolo Abeni, kvm; +Cc: netdev, virtualization, haibinzhang, Michael S. Tsirkin
In-Reply-To: <11f2a27cee0c660a611af381ac1b68d9526095e3.1524556673.git.pabeni@redhat.com>
On 2018年04月24日 16:34, Paolo Abeni wrote:
> Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
> tx polling to 2 * vq size"), we need a packet-based limit for
> handler_rx, too - elsewhere, under rx flood with small packets,
> tx can be delayed for a very long time, even without busypolling.
>
> The pkt limit applied to handle_rx must be the same applied by
> handle_tx, or we will get unfair scheduling between rx and tx.
> Tying such limit to the queue length makes it less effective for
> large queue length values and can introduce large process
> scheduler latencies, so a constant valued is used - likewise
> the existing bytes limit.
>
> The selected limit has been validated with PVP[1] performance
> test with different queue sizes:
>
> queue size 256 512 1024
>
> baseline 366 354 362
> weight 128 715 723 670
> weight 256 740 745 733
> weight 512 600 460 583
> weight 1024 423 427 418
>
> A packet weight of 256 gives peek performances in under all the
> tested scenarios.
>
> No measurable regression in unidirectional performance tests has
> been detected.
>
> [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> drivers/vhost/net.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index bbf38befefb2..c4b49fca4871 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -46,8 +46,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> #define VHOST_NET_WEIGHT 0x80000
>
> /* Max number of packets transferred before requeueing the job.
> - * Using this limit prevents one virtqueue from starving rx. */
> -#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
> + * Using this limit prevents one virtqueue from starving others with small
> + * pkts.
> + */
> +#define VHOST_NET_PKT_WEIGHT 256
>
> /* MAX number of TX used buffers for outstanding zerocopy */
> #define VHOST_MAX_PEND 128
> @@ -587,7 +589,7 @@ static void handle_tx(struct vhost_net *net)
> vhost_zerocopy_signal_used(net, vq);
> vhost_net_tx_packet(net);
> if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> - unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
> + unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
> vhost_poll_queue(&vq->poll);
> break;
> }
> @@ -769,6 +771,7 @@ static void handle_rx(struct vhost_net *net)
> struct socket *sock;
> struct iov_iter fixup;
> __virtio16 num_buffers;
> + int recv_pkts = 0;
>
> mutex_lock_nested(&vq->mutex, 0);
> sock = vq->private_data;
> @@ -872,7 +875,8 @@ static void handle_rx(struct vhost_net *net)
> if (unlikely(vq_log))
> vhost_log_write(vq, vq_log, log, vhost_len);
> total_len += vhost_len;
> - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> + if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> + unlikely(++recv_pkts >= VHOST_NET_PKT_WEIGHT)) {
> vhost_poll_queue(&vq->poll);
> goto out;
> }
The numbers looks impressive.
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks!
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
From: Magnus Karlsson @ 2018-04-24 9:14 UTC (permalink / raw)
To: Jason Wang
Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
Alexander Duyck, John Fastabend, Alexei Starovoitov,
Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
Michael S. Tsirkin, Network Development, Björn Töpel,
michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
Zhang, Qi Z
In-Reply-To: <60f41443-3c8f-4570-6ebc-176444b7b9e9@redhat.com>
On Tue, Apr 24, 2018 at 11:10 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2018年04月24日 16:44, Magnus Karlsson wrote:
>>>>
>>>> We have run some benchmarks on a dual socket system with two Broadwell
>>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
>>>> cores which gives a total of 28, but only two cores are used in these
>>>> experiments. One for TR/RX and one for the user space application. The
>>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
>>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
>>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
>>>> Intel I40E 40Gbit/s using the i40e driver.
>>>>
>>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>>>> and 1500 byte packets, generated by commercial packet generator HW that
>>>> is
>>>> generating packets at full 40 Gbit/s line rate.
>>>>
>>>> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
>>>> Benchmark XDP_SKB XDP_DRV
>>>> rxdrop 2.9(3.0) 9.4(9.3)
>>>> txpush 2.5(2.2) NA*
>>>> l2fwd 1.9(1.7) 2.4(2.4) (TX using XDP_SKB in both cases)
>>>
>>> This number looks not very exciting. I can get ~3Mpps when using testpmd
>>> in
>>> a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I
>>> believe
>>> we can even better performance without virt. It would be interesting to
>>> compare this performance with e.g testpmd + virito_user(vhost_kernel) +
>>> XDP.
>>
>> Note that all the XDP_SKB numbers plus the TX part of XDP_DRV for l2fwd
>> uses SKBs and the generic XDP path in the kernel. I am not surprised those
>> numbers are lower than what you are seeing with XDP_DRV support.
>> (If that is what you are running? Unsure about your setup).
>
>
> Yes, I'm using haswell E5-2630 v3 @ 2.40GHz and ixgbe.
>
>> The
>> 9.4 Mpps for RX is what you get with the XDP_DRV support and copies
>> out to user space. Or is it this number you think is low?
>
>
> No rxdrop looks ok. I mean for l2fwd only.
OK, sounds good. l2fwd will get much better once we add XDP_DRV support for TX.
Thanks: Magnus
>> Zerocopy will be added
>> in later patch sets.
>>
>> With that said, both XDP_SKB and XDP_DRV can be optimized. We
>> have not spent that much time on optimizations at this point.
>>
>
> Yes, and it is interesting to compare the performance numbers between AF_XDP
> and TAP XDP + vhost_net since their functions are almost equivalent.
>
> Thanks
^ permalink raw reply
* Re: [RFC] ethtool: Support for driver private ioctl's
From: Jose Abreu @ 2018-04-24 9:37 UTC (permalink / raw)
To: Florian Fainelli, Jose Abreu, David Miller, Jakub Jelinek,
Jeff Garzik, Tim Hockin, Eli Kupermann, Chris Leech,
Scott Feldman, Ben Hutchings
Cc: netdev, Joao Pinto
In-Reply-To: <80a8c4b7-2f2b-4b64-3065-4384b1b1e6e2@gmail.com>
Hi Florian,
On 07-04-2018 20:58, Florian Fainelli wrote:
>
> On 04/06/2018 06:51 AM, Jose Abreu wrote:
>> Hi Florian,
>>
>> On 05-04-2018 16:50, Florian Fainelli wrote:
>>> On 04/05/2018 03:47 AM, Jose Abreu wrote:
>>>> Hi All,
>>>>
>>>> I would like to know your opinion regarding adding support for
>>>> driver private ioctl's in ethtool.
>>>>
>>>> Background: Synopsys Ethernet IP's have a certain number of
>>>> features which can be reconfigured at runtime. Giving you two
>>>> examples: One of the most recent one is the safety features,
>>>> which can be enabled/disabled and forced at runtime. Another one
>>>> is a Flexible RX Parser which can route specific packets to
>>>> specific RX DMA channels. Given that these are features specific
>>>> to our IP's it would not be useful to add an uniform API for this
>>>> because the users would only be one or two drivers ...
>>> Parsing of packets and directing the matched packets to specific
>>> queues/channels can be done through ethtool rxnfc API, tc/cls_flower as
>>> well, so you should really check whether those APIs don't already allow
>>> you to do what you want.
>> Hmm, but in our case this is directly done by HW, we just have to
>> program a kind of a table which will route automatically the
>> packets. Does this API support this?
> I was sort of expecting you to look at the ethtool rxnfc API to see if
> it is suitable given your hardware, but if this is indeed a table
> programming, then yes, this is what it is designed for. You might want
> to consider using the newer, albeit more complex tc/cls_flower if that
> works for your use case.
>
I took a quick look at rxrnfc API and it doesn't seem to match
entirely my requirements.
The feature I want to introduce is called Flexible RX Parser and
will let me route specific packets to specific DMA channels
number. This is different from rxrnfc API because, and I far as I
understand, the API was designed to add rules to packet types
whilst in my case I can add a rule to *any* of the packet content
(within the first 256 bytes of packet, at max). So technically I
can route packets based on destination/ source mac address,
packet type, lenght, protocol, source/destination IP , ....
So, I guess cls_flower it will be ... I'm slowing looking at some
code and docs but it would be great if you could pin point me to
some HW that has similar behavior ?
Thanks and Best Regards,
Jose Miguel Abreu
^ permalink raw reply
* [net-next regression] kselftest failure in fib_nl_newrule()
From: Anders Roxell @ 2018-04-24 9:46 UTC (permalink / raw)
To: roopa, davem, dsa; +Cc: netdev, Linux Kernel Mailing List
Hi,
fib-onlink-tests.sh (from kselftest) found a regression between
next-20180424 [1] (worked with tag next-20180423 [2])
here is tree commits that look suspicious specially this patch (sha:
f9d4b0c1e969)
rewrites fib_nl_newrule().his patch (sha: f9d4b0c1e969) rewrites
fib_nl_newrule().
b16fb418b1bf ("net: fib_rules: add extack support")
f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs
into fib_nl2rule")
8a14e46f1402 ("net/ipv6: Fix missing rcu dereferences on from")
Cheers,
Anders
[1] https://lkft.validation.linaro.org/scheduler/job/195181#L3447
[2] https://lkft.validation.linaro.org/scheduler/job/193410#L3438
^ permalink raw reply
* Re: [net-next 2/2] ipv6: sr: Compute flowlabel of outer IPv6 header for seg6 encap mode
From: kbuild test robot @ 2018-04-24 10:10 UTC (permalink / raw)
To: Ahmed Abdelsalam
Cc: kbuild-all, davem, dav.lebrun, kuznet, yoshfuji, netdev,
linux-kernel, Ahmed Abdelsalam
In-Reply-To: <1524519420-1612-2-git-send-email-amsalam20@gmail.com>
Hi Ahmed,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on net-next/master]
url: https://github.com/0day-ci/linux/commits/Ahmed-Abdelsalam/ipv6-sr-add-a-per-namespace-sysctl-to-control-seg6-flowlabel/20180424-142142
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
sparse warnings: (new ones prefixed by >>)
>> net/ipv6/seg6_iptunnel.c:95:8: sparse: symbol 'seg6_make_flowlabel' was not declared. Should it be static?
Please review and possibly fold the followup patch.
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
^ permalink raw reply
* [RFC PATCH] ipv6: sr: seg6_make_flowlabel() can be static
From: kbuild test robot @ 2018-04-24 10:10 UTC (permalink / raw)
To: Ahmed Abdelsalam
Cc: kbuild-all, davem, dav.lebrun, kuznet, yoshfuji, netdev,
linux-kernel, Ahmed Abdelsalam
In-Reply-To: <1524519420-1612-2-git-send-email-amsalam20@gmail.com>
Fixes: 484c5b41c685 ("ipv6: sr: Compute flowlabel of outer IPv6 header for seg6 encap mode")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
seg6_iptunnel.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index bf76c59..d2ff788 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -92,8 +92,8 @@ static void set_tun_src(struct net *net, struct net_device *dev,
}
/* Compute flowlabel for outer IPv6 header */
-__be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
- struct ipv6hdr *inner_hdr)
+static __be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
+ struct ipv6hdr *inner_hdr)
{
int do_flowlabel = net->ipv6.sysctl.seg6_flowlabel;
__be32 flowlabel = 0;
^ permalink raw reply related
* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 11:04 UTC (permalink / raw)
To: David Rientjes
Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
Michal Hocko, linux-mm, edumazet, Andrew Morton, virtualization,
David Miller, Vlastimil Babka
In-Reply-To: <alpine.DEB.2.21.1804231945410.58980@chino.kir.corp.google.com>
On Mon, 23 Apr 2018, David Rientjes wrote:
> On Mon, 23 Apr 2018, Mikulas Patocka wrote:
>
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> >
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> >
> > These bugs are hard to reproduce because kvmalloc falls back to vmalloc
> > only if memory is fragmented.
> >
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
> > is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
> > verify the addresses passed to it, and so the user will get a reliable
> > stacktrace.
>
> Why not just do it unconditionally? Sounds better than "50% of the time
> this will catch bugs".
Because kmalloc (with slub_debug) detects buffer overflows better than
vmalloc. vmalloc detects buffer overflows only at a page boundary. This is
intended for debugging kernels and debugging kernels should detect as many
bugs as possible.
Mikulas
^ permalink raw reply
* Re: [PATCH 1/1] Revert "rds: ib: add error handle"
From: Håkon Bugge @ 2018-04-24 11:10 UTC (permalink / raw)
To: Santosh Shilimkar
Cc: Zhu Yanjun, OFED mailing list, rds-devel, davem, netdev,
Dag Moxnes
In-Reply-To: <4e17b3f0-94b4-6038-419d-68657027f331@oracle.com>
> On 24 Apr 2018, at 05:27, santosh.shilimkar@oracle.com wrote:
>
> On 4/23/18 6:39 PM, Zhu Yanjun wrote:
>> This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.
>> After long time discussion and investigations, it seems that there
>> is no mem leak. So this patch is reverted.
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
>> ---
> Well your fix was not for any leaks but just proper labels for
> graceful exits. I don't know which long time discussion
> you are referring but there is no need to revert this change
> unless you see any issue with your change.
As spotted by Dax Moxnes, the patch misses to call rds_ib_dev_put() when exiting normally, only on err exit.
Thxs, Håkon
>
> Regards,
> Santosh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: fix for bnx2x panic during ethtool reporting
From: Kalluru, Sudarsana @ 2018-04-24 11:23 UTC (permalink / raw)
To: Florian Fainelli, Sebastian Kuzminsky,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
Elior, Ariel, Dept-Eng Everest Linux L2
In-Reply-To: <7747689b-6774-ca84-e0d5-ffd9b0d59e1b@gmail.com>
-----Original Message-----
From: Florian Fainelli [mailto:f.fainelli@gmail.com]
Sent: 18 April 2018 05:12
To: Sebastian Kuzminsky <seb.kuzminsky@gmail.com>; linux-kernel@vger.kernel.org; netdev@vger.kernel.org; Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>
Subject: Re: fix for bnx2x panic during ethtool reporting
+netdev, Ariel,
On 04/17/2018 10:21 AM, Sebastian Kuzminsky wrote:
> "ethtool -i" on a bnx2x interface causes kernel panic when the
> firmware version is longer than expected. The attached patch fixes
> the problem by simplifying the string handling in bnx2x_fill_fw_str().
> It applies cleanly to 4.14 and 4.17-rc1.
If you want to have a chance of getting your patch included, your should make sure you copy the driver maintainers and the network mailinglist, doing that.
--
Florian
Acked-by: Sudarsana Kalluru <Sudarsana.Kalluru@cavium.com>
^ permalink raw reply
* Re: [PATCH 1/1] Revert "rds: ib: add error handle"
From: Dag Moxnes @ 2018-04-24 11:25 UTC (permalink / raw)
To: Håkon Bugge, Santosh Shilimkar
Cc: Zhu Yanjun, OFED mailing list, rds-devel, davem, netdev
In-Reply-To: <66130D36-2987-416A-AF31-2BB9001E0036@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]
I was going to suggest the following correction:
If all agree that this is the correct way of doing it, I can go ahead
and an post it.
Regards,
-Dag
On 04/24/2018 01:10 PM, Håkon Bugge wrote:
>
>> On 24 Apr 2018, at 05:27, santosh.shilimkar@oracle.com wrote:
>>
>> On 4/23/18 6:39 PM, Zhu Yanjun wrote:
>>> This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.
>>> After long time discussion and investigations, it seems that there
>>> is no mem leak. So this patch is reverted.
>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
>>> ---
>> Well your fix was not for any leaks but just proper labels for
>> graceful exits. I don't know which long time discussion
>> you are referring but there is no need to revert this change
>> unless you see any issue with your change.
> As spotted by Dax Moxnes, the patch misses to call rds_ib_dev_put() when exiting normally, only on err exit.
>
>
> Thxs, Håkon
>
>> Regards,
>> Santosh
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: 0001-rds-ib-Fix-missing-call-to-rds_ib_dev_put-in-rds_ib_.patch --]
[-- Type: text/x-patch, Size: 1185 bytes --]
>From 9d8e7c568485b1b10ccdfa02305c6fa284f42d8f Mon Sep 17 00:00:00 2001
From: Dag Moxnes <dag.moxnes@oracle.com>
Date: Tue, 24 Apr 2018 13:14:48 +0200
Subject: [PATCH] rds: ib: Fix missing call to rds_ib_dev_put in rds_ib_setup_qp
The function rds_ib_setup_qp is calling rds_ib_get_client_data and
should correspondingly call rds_ib_dev_put. This call was lost in
the non-error path with the introduction of error handling done in
commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.
Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com>
---
net/rds/ib_cm.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index eea1d86..3193249 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -547,7 +547,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
rdsdebug("conn %p pd %p cq %p %p\n", conn, ic->i_pd,
ic->i_send_cq, ic->i_recv_cq);
- return ret;
+ goto out:
sends_out:
vfree(ic->i_sends);
@@ -572,6 +572,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_send_cq = NULL;
rds_ibdev_out:
rds_ib_remove_conn(rds_ibdev, conn);
+out:
rds_ib_dev_put(rds_ibdev);
return ret;
--
1.7.1
^ permalink raw reply related
* [PATCH net-next V3 0/3] Introduce adaptive TX interrupt moderation to net DIM
From: Tal Gilboa @ 2018-04-24 10:36 UTC (permalink / raw)
To: David Miller
Cc: netdev, Tariq Toukan, Tal Gilboa, Saeed Mahameed,
Florian Fainelli, Andy Gospodarek
Net DIM is a library designed for dynamic interrupt moderation. It was
implemented and optimized with receive side interrupts in mind, since these
are usually the CPU expensive ones. This patch-set introduces adaptive transmit
interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
Furthermore, it is essential for increasing bandwidth on cases where payload
aggregation is required.
v3: Remove "inline" from functions in .c files (requested by DaveM). Revert
adding "enabled" field from struct net_dim and applied mlx5e structural
suggestions (suggested by SaeedM).
v2: Rebase over proper tree.
v1: Fix compilation issues due to missed function renaming.
Tal Gilboa (3):
net/dim: Rename *_get_profile() functions to *_get_rx_moderation()
net/dim: Support adaptive TX moderation
net/mlx5e: Enable adaptive-TX moderation
drivers/net/ethernet/broadcom/bcmsysport.c | 6 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c | 8 +--
drivers/net/ethernet/broadcom/genet/bcmgenet.c | 6 +-
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 ++
drivers/net/ethernet/mellanox/mlx5/core/en_dim.c | 28 ++++++--
.../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 35 +++++++---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 79 ++++++++++++++--------
drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 37 +++++++---
include/linux/net_dim.h | 69 ++++++++++++++-----
9 files changed, 190 insertions(+), 82 deletions(-)
--
1.8.3.1
^ permalink raw reply
* [PATCH net-next V3 1/3] net/dim: Rename *_get_profile() functions to *_get_rx_moderation()
From: Tal Gilboa @ 2018-04-24 10:36 UTC (permalink / raw)
To: David Miller
Cc: netdev, Tariq Toukan, Tal Gilboa, Saeed Mahameed,
Florian Fainelli, Andy Gospodarek
In-Reply-To: <1524566163-41563-1-git-send-email-talgi@mellanox.com>
Preparation for introducing adaptive TX to net DIM.
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
---
drivers/net/ethernet/broadcom/bcmsysport.c | 6 +++---
drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c | 8 ++++----
drivers/net/ethernet/broadcom/genet/bcmgenet.c | 6 +++---
drivers/net/ethernet/mellanox/mlx5/core/en_dim.c | 6 +++---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 6 ++++--
include/linux/net_dim.h | 12 ++++++------
6 files changed, 23 insertions(+), 21 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
index f9a3c1a..effc651 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -654,7 +654,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
pkts = priv->rx_max_coalesced_frames;
if (ec->use_adaptive_rx_coalesce && !priv->dim.use_dim) {
- moder = net_dim_get_def_profile(priv->dim.dim.mode);
+ moder = net_dim_get_def_rx_moderation(priv->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -1064,7 +1064,7 @@ static void bcm_sysport_dim_work(struct work_struct *work)
struct bcm_sysport_priv *priv =
container_of(ndim, struct bcm_sysport_priv, dim);
struct net_dim_cq_moder cur_profile =
- net_dim_get_profile(dim->mode, dim->profile_ix);
+ net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
bcm_sysport_set_rx_coalesce(priv, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
@@ -1437,7 +1437,7 @@ static void bcm_sysport_init_rx_coalesce(struct bcm_sysport_priv *priv)
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
- moder = net_dim_get_def_profile(dim->dim.mode);
+ moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
index 408dd19..afa97c8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
@@ -21,11 +21,11 @@ void bnxt_dim_work(struct work_struct *work)
struct bnxt_napi *bnapi = container_of(cpr,
struct bnxt_napi,
cp_ring);
- struct net_dim_cq_moder cur_profile = net_dim_get_profile(dim->mode,
- dim->profile_ix);
+ struct net_dim_cq_moder cur_moder =
+ net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
- cpr->rx_ring_coal.coal_ticks = cur_profile.usec;
- cpr->rx_ring_coal.coal_bufs = cur_profile.pkts;
+ cpr->rx_ring_coal.coal_ticks = cur_moder.usec;
+ cpr->rx_ring_coal.coal_bufs = cur_moder.pkts;
bnxt_hwrm_set_ring_coal(bnapi->bp, bnapi);
dim->state = NET_DIM_START_MEASURE;
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 0445f2c..20c1681 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -652,7 +652,7 @@ static void bcmgenet_set_ring_rx_coalesce(struct bcmgenet_rx_ring *ring,
pkts = ring->rx_max_coalesced_frames;
if (ec->use_adaptive_rx_coalesce && !ring->dim.use_dim) {
- moder = net_dim_get_def_profile(ring->dim.dim.mode);
+ moder = net_dim_get_def_rx_moderation(ring->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -1925,7 +1925,7 @@ static void bcmgenet_dim_work(struct work_struct *work)
struct bcmgenet_rx_ring *ring =
container_of(ndim, struct bcmgenet_rx_ring, dim);
struct net_dim_cq_moder cur_profile =
- net_dim_get_profile(dim->mode, dim->profile_ix);
+ net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
bcmgenet_set_rx_coalesce(ring, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
@@ -2102,7 +2102,7 @@ static void bcmgenet_init_rx_coalesce(struct bcmgenet_rx_ring *ring)
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
- moder = net_dim_get_def_profile(dim->dim.mode);
+ moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
index 602851a..1b286e1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -38,11 +38,11 @@ void mlx5e_rx_dim_work(struct work_struct *work)
struct net_dim *dim = container_of(work, struct net_dim,
work);
struct mlx5e_rq *rq = container_of(dim, struct mlx5e_rq, dim);
- struct net_dim_cq_moder cur_profile = net_dim_get_profile(dim->mode,
- dim->profile_ix);
+ struct net_dim_cq_moder cur_moder =
+ net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
mlx5_core_modify_cq_moderation(rq->mdev, &rq->cq.mcq,
- cur_profile.usec, cur_profile.pkts);
+ cur_moder.usec, cur_moder.pkts);
dim->state = NET_DIM_START_MEASURE;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f100374..a69cc1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4119,12 +4119,14 @@ void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode)
switch (cq_period_mode) {
case MLX5_CQ_PERIOD_MODE_START_FROM_CQE:
params->rx_cq_moderation =
- net_dim_get_def_profile(NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE);
+ net_dim_get_def_rx_moderation(
+ NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE);
break;
case MLX5_CQ_PERIOD_MODE_START_FROM_EQE:
default:
params->rx_cq_moderation =
- net_dim_get_def_profile(NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE);
+ net_dim_get_def_rx_moderation(
+ NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE);
}
}
diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index 29ed8fd..7ca3c4d 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -129,17 +129,17 @@ enum {
NET_DIM_CQE_PROFILES,
};
-static inline struct net_dim_cq_moder net_dim_get_profile(u8 cq_period_mode,
- int ix)
+static inline struct net_dim_cq_moder
+net_dim_get_rx_moderation(u8 cq_period_mode, int ix)
{
- struct net_dim_cq_moder cq_moder;
+ struct net_dim_cq_moder cq_moder = profile[cq_period_mode][ix];
- cq_moder = profile[cq_period_mode][ix];
cq_moder.cq_period_mode = cq_period_mode;
return cq_moder;
}
-static inline struct net_dim_cq_moder net_dim_get_def_profile(u8 rx_cq_period_mode)
+static inline struct net_dim_cq_moder
+net_dim_get_def_rx_moderation(u8 rx_cq_period_mode)
{
int default_profile_ix;
@@ -148,7 +148,7 @@ static inline struct net_dim_cq_moder net_dim_get_def_profile(u8 rx_cq_period_mo
else /* NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE */
default_profile_ix = NET_DIM_DEF_PROFILE_EQE;
- return net_dim_get_profile(rx_cq_period_mode, default_profile_ix);
+ return net_dim_get_rx_moderation(rx_cq_period_mode, default_profile_ix);
}
static inline bool net_dim_on_top(struct net_dim *dim)
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next V3 2/3] net/dim: Support adaptive TX moderation
From: Tal Gilboa @ 2018-04-24 10:36 UTC (permalink / raw)
To: David Miller
Cc: netdev, Tariq Toukan, Tal Gilboa, Saeed Mahameed,
Florian Fainelli, Andy Gospodarek
In-Reply-To: <1524566163-41563-1-git-send-email-talgi@mellanox.com>
Interrupt moderation for TX traffic requires different profiles than RX
interrupt moderation. The main goal here is to reduce interrupt rate and
allow better payload aggregation by keeping SKBs in the TX queue a bit
longer. Ping-pong behavior would get a profile with a short timer, so
latency wouldn't increase for these scenarios. There might be a slight
degradation in bandwidth for single stream with large message sizes, since
net.ipv4.tcp_limit_output_bytes is limiting the allowed TX traffic, but
with many streams performance is always improved.
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
---
include/linux/net_dim.h | 63 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 50 insertions(+), 13 deletions(-)
diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index 7ca3c4d..db99240 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -103,11 +103,12 @@ enum {
#define NET_DIM_PARAMS_NUM_PROFILES 5
/* Adaptive moderation profiles */
#define NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE 256
+#define NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE 128
#define NET_DIM_DEF_PROFILE_CQE 1
#define NET_DIM_DEF_PROFILE_EQE 1
/* All profiles sizes must be NET_PARAMS_DIM_NUM_PROFILES */
-#define NET_DIM_EQE_PROFILES { \
+#define NET_DIM_RX_EQE_PROFILES { \
{1, NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{8, NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{64, NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
@@ -115,7 +116,7 @@ enum {
{256, NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
}
-#define NET_DIM_CQE_PROFILES { \
+#define NET_DIM_RX_CQE_PROFILES { \
{2, 256}, \
{8, 128}, \
{16, 64}, \
@@ -123,32 +124,68 @@ enum {
{64, 64} \
}
+#define NET_DIM_TX_EQE_PROFILES { \
+ {1, NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE}, \
+ {8, NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE}, \
+ {32, NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE}, \
+ {64, NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE}, \
+ {128, NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE} \
+}
+
+#define NET_DIM_TX_CQE_PROFILES { \
+ {5, 128}, \
+ {8, 64}, \
+ {16, 32}, \
+ {32, 32}, \
+ {64, 32} \
+}
+
static const struct net_dim_cq_moder
-profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
- NET_DIM_EQE_PROFILES,
- NET_DIM_CQE_PROFILES,
+rx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+ NET_DIM_RX_EQE_PROFILES,
+ NET_DIM_RX_CQE_PROFILES,
+};
+
+static const struct net_dim_cq_moder
+tx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+ NET_DIM_TX_EQE_PROFILES,
+ NET_DIM_TX_CQE_PROFILES,
};
static inline struct net_dim_cq_moder
net_dim_get_rx_moderation(u8 cq_period_mode, int ix)
{
- struct net_dim_cq_moder cq_moder = profile[cq_period_mode][ix];
+ struct net_dim_cq_moder cq_moder = rx_profile[cq_period_mode][ix];
cq_moder.cq_period_mode = cq_period_mode;
return cq_moder;
}
static inline struct net_dim_cq_moder
-net_dim_get_def_rx_moderation(u8 rx_cq_period_mode)
+net_dim_get_def_rx_moderation(u8 cq_period_mode)
+{
+ u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE ?
+ NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
+
+ return net_dim_get_rx_moderation(cq_period_mode, profile_ix);
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_tx_moderation(u8 cq_period_mode, int ix)
{
- int default_profile_ix;
+ struct net_dim_cq_moder cq_moder = tx_profile[cq_period_mode][ix];
- if (rx_cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE)
- default_profile_ix = NET_DIM_DEF_PROFILE_CQE;
- else /* NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE */
- default_profile_ix = NET_DIM_DEF_PROFILE_EQE;
+ cq_moder.cq_period_mode = cq_period_mode;
+ return cq_moder;
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_def_tx_moderation(u8 cq_period_mode)
+{
+ u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE ?
+ NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
- return net_dim_get_rx_moderation(rx_cq_period_mode, default_profile_ix);
+ return net_dim_get_tx_moderation(cq_period_mode, profile_ix);
}
static inline bool net_dim_on_top(struct net_dim *dim)
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next V3 3/3] net/mlx5e: Enable adaptive-TX moderation
From: Tal Gilboa @ 2018-04-24 10:36 UTC (permalink / raw)
To: David Miller
Cc: netdev, Tariq Toukan, Tal Gilboa, Saeed Mahameed,
Florian Fainelli, Andy Gospodarek
In-Reply-To: <1524566163-41563-1-git-send-email-talgi@mellanox.com>
Add support for adaptive TX moderation. This greatly reduces TX interrupt
rate and increases bandwidth, mostly for TCP bandwidth over ARM
architecture (below). There is a slight single stream TCP with very large
message sizes degradation (x86). In this case if there's any moderation on
transmitted packets the bandwidth would reduce due to hitting TCP output limit.
Since this is a synthetic case, this is still worth doing.
Performance improvement (ConnectX-4Lx 40GbE, ARM)
TCP 64B bandwidth with 1-50 streams increased 6-35%.
TCP 64B bandwidth with 100-500 streams increased 20-70%.
Performance improvement (ConnectX-5 100GbE, x86)
Bandwidth: increased up to 40% (1024B with 10s of streams).
Interrupt rate: reduced up to 50% (1024B with 1000s of streams).
Performance degradation (ConnectX-5 100GbE, x86)
Bandwidth: up to 10% decrease single stream TCP (1MB message size from
51Gb/s to 47Gb/s).
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 ++
drivers/net/ethernet/mellanox/mlx5/core/en_dim.c | 24 +++++--
.../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 35 +++++++---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 81 +++++++++++++---------
drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 37 +++++++---
5 files changed, 125 insertions(+), 56 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 3317a4d..56c748c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -241,6 +241,7 @@ struct mlx5e_params {
bool vlan_strip_disable;
bool scatter_fcs_en;
bool rx_dim_enabled;
+ bool tx_dim_enabled;
u32 lro_timeout;
u32 pflags;
struct bpf_prog *xdp_prog;
@@ -330,6 +331,7 @@ enum {
MLX5E_SQ_STATE_ENABLED,
MLX5E_SQ_STATE_RECOVERING,
MLX5E_SQ_STATE_IPSEC,
+ MLX5E_SQ_STATE_AM,
};
struct mlx5e_sq_wqe_info {
@@ -342,6 +344,7 @@ struct mlx5e_txqsq {
/* dirtied @completion */
u16 cc;
u32 dma_fifo_cc;
+ struct net_dim dim; /* Adaptive Moderation */
/* dirtied @xmit */
u16 pc ____cacheline_aligned_in_smp;
@@ -1111,4 +1114,5 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
u16 max_channels, u16 mtu);
u8 mlx5e_params_calculate_tx_min_inline(struct mlx5_core_dev *mdev);
void mlx5e_rx_dim_work(struct work_struct *work);
+void mlx5e_tx_dim_work(struct work_struct *work);
#endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
index 1b286e1..d67adf7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -33,16 +33,30 @@
#include <linux/net_dim.h>
#include "en.h"
+static void
+mlx5e_complete_dim_work(struct net_dim *dim, struct net_dim_cq_moder moder,
+ struct mlx5_core_dev *mdev, struct mlx5_core_cq *mcq)
+{
+ mlx5_core_modify_cq_moderation(mdev, mcq, moder.usec, moder.pkts);
+ dim->state = NET_DIM_START_MEASURE;
+}
+
void mlx5e_rx_dim_work(struct work_struct *work)
{
- struct net_dim *dim = container_of(work, struct net_dim,
- work);
+ struct net_dim *dim = container_of(work, struct net_dim, work);
struct mlx5e_rq *rq = container_of(dim, struct mlx5e_rq, dim);
struct net_dim_cq_moder cur_moder =
net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
- mlx5_core_modify_cq_moderation(rq->mdev, &rq->cq.mcq,
- cur_moder.usec, cur_moder.pkts);
+ mlx5e_complete_dim_work(dim, cur_moder, rq->mdev, &rq->cq.mcq);
+}
- dim->state = NET_DIM_START_MEASURE;
+void mlx5e_tx_dim_work(struct work_struct *work)
+{
+ struct net_dim *dim = container_of(work, struct net_dim, work);
+ struct mlx5e_txqsq *sq = container_of(dim, struct mlx5e_txqsq, dim);
+ struct net_dim_cq_moder cur_moder =
+ net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
+
+ mlx5e_complete_dim_work(dim, cur_moder, sq->cq.mdev, &sq->cq.mcq);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 37fd024..2b786c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -389,14 +389,20 @@ static int mlx5e_set_channels(struct net_device *dev,
int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
struct ethtool_coalesce *coal)
{
+ struct net_dim_cq_moder *rx_moder, *tx_moder;
+
if (!MLX5_CAP_GEN(priv->mdev, cq_moderation))
return -EOPNOTSUPP;
- coal->rx_coalesce_usecs = priv->channels.params.rx_cq_moderation.usec;
- coal->rx_max_coalesced_frames = priv->channels.params.rx_cq_moderation.pkts;
- coal->tx_coalesce_usecs = priv->channels.params.tx_cq_moderation.usec;
- coal->tx_max_coalesced_frames = priv->channels.params.tx_cq_moderation.pkts;
- coal->use_adaptive_rx_coalesce = priv->channels.params.rx_dim_enabled;
+ rx_moder = &priv->channels.params.rx_cq_moderation;
+ coal->rx_coalesce_usecs = rx_moder->usec;
+ coal->rx_max_coalesced_frames = rx_moder->pkts;
+ coal->use_adaptive_rx_coalesce = priv->channels.params.rx_dim_enabled;
+
+ tx_moder = &priv->channels.params.tx_cq_moderation;
+ coal->tx_coalesce_usecs = tx_moder->usec;
+ coal->tx_max_coalesced_frames = tx_moder->pkts;
+ coal->use_adaptive_tx_coalesce = priv->channels.params.tx_dim_enabled;
return 0;
}
@@ -438,6 +444,7 @@ static int mlx5e_get_coalesce(struct net_device *netdev,
int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
struct ethtool_coalesce *coal)
{
+ struct net_dim_cq_moder *rx_moder, *tx_moder;
struct mlx5_core_dev *mdev = priv->mdev;
struct mlx5e_channels new_channels = {};
int err = 0;
@@ -463,11 +470,15 @@ int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
mutex_lock(&priv->state_lock);
new_channels.params = priv->channels.params;
- new_channels.params.tx_cq_moderation.usec = coal->tx_coalesce_usecs;
- new_channels.params.tx_cq_moderation.pkts = coal->tx_max_coalesced_frames;
- new_channels.params.rx_cq_moderation.usec = coal->rx_coalesce_usecs;
- new_channels.params.rx_cq_moderation.pkts = coal->rx_max_coalesced_frames;
- new_channels.params.rx_dim_enabled = !!coal->use_adaptive_rx_coalesce;
+ rx_moder = &new_channels.params.rx_cq_moderation;
+ rx_moder->usec = coal->rx_coalesce_usecs;
+ rx_moder->pkts = coal->rx_max_coalesced_frames;
+ new_channels.params.rx_dim_enabled = !!coal->use_adaptive_rx_coalesce;
+
+ tx_moder = &new_channels.params.tx_cq_moderation;
+ tx_moder->usec = coal->tx_coalesce_usecs;
+ tx_moder->pkts = coal->tx_max_coalesced_frames;
+ new_channels.params.tx_dim_enabled = !!coal->use_adaptive_tx_coalesce;
if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
priv->channels.params = new_channels.params;
@@ -475,7 +486,9 @@ int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
}
/* we are opened */
- reset = !!coal->use_adaptive_rx_coalesce != priv->channels.params.rx_dim_enabled;
+ reset = (!!coal->use_adaptive_rx_coalesce != priv->channels.params.rx_dim_enabled) ||
+ (!!coal->use_adaptive_tx_coalesce != priv->channels.params.tx_dim_enabled);
+
if (!reset) {
mlx5e_set_priv_channels_coalesce(priv, coal);
priv->channels.params = new_channels.params;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a69cc1c..f1fe490 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1025,6 +1025,9 @@ static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
if (err)
goto err_sq_wq_destroy;
+ INIT_WORK(&sq->dim.work, mlx5e_tx_dim_work);
+ sq->dim.mode = params->tx_cq_moderation.cq_period_mode;
+
sq->edge = (sq->wq.sz_m1 + 1) - MLX5_SEND_WQE_MAX_WQEBBS;
return 0;
@@ -1188,6 +1191,9 @@ static int mlx5e_open_txqsq(struct mlx5e_channel *c,
if (tx_rate)
mlx5e_set_sq_maxrate(c->netdev, sq, tx_rate);
+ if (params->tx_dim_enabled)
+ sq->state |= BIT(MLX5E_SQ_STATE_AM);
+
return 0;
err_free_txqsq:
@@ -4084,18 +4090,48 @@ static bool slow_pci_heuristic(struct mlx5_core_dev *mdev)
link_speed > MLX5E_SLOW_PCI_RATIO * pci_bw;
}
-void mlx5e_set_tx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode)
+static struct net_dim_cq_moder mlx5e_get_def_tx_moderation(u8 cq_period_mode)
{
- params->tx_cq_moderation.cq_period_mode = cq_period_mode;
+ struct net_dim_cq_moder moder;
+
+ moder.cq_period_mode = cq_period_mode;
+ moder.pkts = MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS;
+ moder.usec = MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC;
+ if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE)
+ moder.usec = MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC_FROM_CQE;
+
+ return moder;
+}
- params->tx_cq_moderation.pkts =
- MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS;
- params->tx_cq_moderation.usec =
- MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC;
+static struct net_dim_cq_moder mlx5e_get_def_rx_moderation(u8 cq_period_mode)
+{
+ struct net_dim_cq_moder moder;
+ moder.cq_period_mode = cq_period_mode;
+ moder.pkts = MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS;
+ moder.usec = MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC;
if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE)
- params->tx_cq_moderation.usec =
- MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC_FROM_CQE;
+ moder.usec = MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE;
+
+ return moder;
+}
+
+static u8 mlx5_to_net_dim_cq_period_mode(u8 cq_period_mode)
+{
+ return cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE ?
+ NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE :
+ NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+}
+
+void mlx5e_set_tx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode)
+{
+ if (params->tx_dim_enabled) {
+ u8 dim_period_mode = mlx5_to_net_dim_cq_period_mode(cq_period_mode);
+
+ params->tx_cq_moderation = net_dim_get_def_tx_moderation(dim_period_mode);
+ } else {
+ params->tx_cq_moderation = mlx5e_get_def_tx_moderation(cq_period_mode);
+ }
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_TX_CQE_BASED_MODER,
params->tx_cq_moderation.cq_period_mode ==
@@ -4104,30 +4140,12 @@ void mlx5e_set_tx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode)
void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode)
{
- params->rx_cq_moderation.cq_period_mode = cq_period_mode;
-
- params->rx_cq_moderation.pkts =
- MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS;
- params->rx_cq_moderation.usec =
- MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC;
-
- if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE)
- params->rx_cq_moderation.usec =
- MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE;
-
if (params->rx_dim_enabled) {
- switch (cq_period_mode) {
- case MLX5_CQ_PERIOD_MODE_START_FROM_CQE:
- params->rx_cq_moderation =
- net_dim_get_def_rx_moderation(
- NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE);
- break;
- case MLX5_CQ_PERIOD_MODE_START_FROM_EQE:
- default:
- params->rx_cq_moderation =
- net_dim_get_def_rx_moderation(
- NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE);
- }
+ u8 dim_period_mode = mlx5_to_net_dim_cq_period_mode(cq_period_mode);
+
+ params->rx_cq_moderation = net_dim_get_def_rx_moderation(dim_period_mode);
+ } else {
+ params->rx_cq_moderation = mlx5e_get_def_rx_moderation(cq_period_mode);
}
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_CQE_BASED_MODER,
@@ -4191,6 +4209,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
params->rx_dim_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
+ params->tx_dim_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
mlx5e_set_rx_cq_mode_params(params, rx_cq_period_mode);
mlx5e_set_tx_cq_mode_params(params, MLX5_CQ_PERIOD_MODE_START_FROM_EQE);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index f292bb3..900661d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -44,6 +44,30 @@ static inline bool mlx5e_channel_no_affinity_change(struct mlx5e_channel *c)
return cpumask_test_cpu(current_cpu, aff);
}
+static void mlx5e_handle_tx_dim(struct mlx5e_txqsq *sq)
+{
+ struct net_dim_sample dim_sample;
+
+ if (unlikely(!MLX5E_TEST_BIT(sq->state, MLX5E_SQ_STATE_AM)))
+ return;
+
+ net_dim_sample(sq->cq.event_ctr, sq->stats.packets, sq->stats.bytes,
+ &dim_sample);
+ net_dim(&sq->dim, dim_sample);
+}
+
+static void mlx5e_handle_rx_dim(struct mlx5e_rq *rq)
+{
+ struct net_dim_sample dim_sample;
+
+ if (unlikely(!MLX5E_TEST_BIT(rq->state, MLX5E_RQ_STATE_AM)))
+ return;
+
+ net_dim_sample(rq->cq.event_ctr, rq->stats.packets, rq->stats.bytes,
+ &dim_sample);
+ net_dim(&rq->dim, dim_sample);
+}
+
int mlx5e_napi_poll(struct napi_struct *napi, int budget)
{
struct mlx5e_channel *c = container_of(napi, struct mlx5e_channel,
@@ -75,18 +99,13 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
if (unlikely(!napi_complete_done(napi, work_done)))
return work_done;
- for (i = 0; i < c->num_tc; i++)
+ for (i = 0; i < c->num_tc; i++) {
+ mlx5e_handle_tx_dim(&c->sq[i]);
mlx5e_cq_arm(&c->sq[i].cq);
-
- if (MLX5E_TEST_BIT(c->rq.state, MLX5E_RQ_STATE_AM)) {
- struct net_dim_sample dim_sample;
- net_dim_sample(c->rq.cq.event_ctr,
- c->rq.stats.packets,
- c->rq.stats.bytes,
- &dim_sample);
- net_dim(&c->rq.dim, dim_sample);
}
+ mlx5e_handle_rx_dim(&c->rq);
+
mlx5e_cq_arm(&c->rq.cq);
mlx5e_cq_arm(&c->icosq.cq);
--
1.8.3.1
^ permalink raw reply related
* [PATCH] fsl/fman_port: remove redundant check on port->rev_info.major
From: Colin King @ 2018-04-24 11:39 UTC (permalink / raw)
To: Madalin Bucur, netdev; +Cc: kernel-janitors, linux-kernel
From: Colin Ian King <colin.king@canonical.com>
The check port->rev_info.major >= 6 is being performed twice, thus
the inner second check is always true and is redundant, hence it
can be removed. Detected by cppcheck.
drivers/net/ethernet/freescale/fman/fman_port.c:1394]: (warning)
Identical inner 'if' condition is always true.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
drivers/net/ethernet/freescale/fman/fman_port.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c b/drivers/net/ethernet/freescale/fman/fman_port.c
index 6552d68ea6e1..ce6e24c74978 100644
--- a/drivers/net/ethernet/freescale/fman/fman_port.c
+++ b/drivers/net/ethernet/freescale/fman/fman_port.c
@@ -1391,12 +1391,10 @@ int fman_port_config(struct fman_port *port, struct fman_port_params *params)
/* FM_WRONG_RESET_VALUES_ERRATA_FMAN_A005127 Errata
* workaround
*/
- if (port->rev_info.major >= 6) {
- u32 reg;
+ u32 reg;
- reg = 0x00001013;
- iowrite32be(reg, &port->bmi_regs->tx.fmbm_tfp);
- }
+ reg = 0x00001013;
+ iowrite32be(reg, &port->bmi_regs->tx.fmbm_tfp);
}
return 0;
--
2.17.0
^ permalink raw reply related
* Re: [PATCH net-next] net: init sk_cookie for inet socket
From: Eric Dumazet @ 2018-04-24 11:41 UTC (permalink / raw)
To: Yafang Shao, Eric Dumazet; +Cc: David Miller, Alexei Starovoitov, netdev, LKML
In-Reply-To: <CALOAHbDpgOckQhW7_Ysi-F2HCoZW4RKSdUuvTquELJZsaMj6Lg@mail.gmail.com>
On 04/23/2018 09:39 PM, Yafang Shao wrote:
> On Tue, Apr 24, 2018 at 12:09 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>> On 04/23/2018 08:58 AM, David Miller wrote:
>>> From: Yafang Shao <laoar.shao@gmail.com>
>>> Date: Sun, 22 Apr 2018 21:50:04 +0800
>>>
>>>> With sk_cookie we can identify a socket, that is very helpful for
>>>> traceing and statistic, i.e. tcp tracepiont and ebpf.
>>>> So we'd better init it by default for inet socket.
>>>> When using it, we just need call atomic64_read(&sk->sk_cookie).
>>>>
>>>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>>>
>>> Applied, thank you.
>>>
>>
>> This is adding yet another atomic_inc on a global cache line.
>>
>
> That's a trade-off.
>
>> Most applications do not need the cookie being ever set.
>>
>> The existing mechanism was fine. Set it on demand.
>
> There are some drawback in the existing mechanism.
> - we have to set the net->cookie_gen and then sk->sk_cookie when we
> want to get the sk_cookie, that's a little expensive as well.
Same cost.
> After that change, sock_gen_cookie() could be replaced by
> atomic64_read(&sk->sk_cookie) in most places.
Same cost than the helper.
>
> - If the application want to get the sk_cookie, it must set it first.
> What if the application don't have the permision to write?
> Furthermore, maybe it is a security concern ?
Maybe ? Please elaborate.
Your patch destroys SYNFLOOD behavior.
I have spent months of work solving the SYNFLOOD behavior, your patch crushes it.
I am not that happy.
Please revert this patch.
Thank you.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox