From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D54FC43381 for ; Mon, 18 Feb 2019 08:59:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B3A632175B for ; Mon, 18 Feb 2019 08:59:45 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mOyxZFA3" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728823AbfBRI7o (ORCPT ); Mon, 18 Feb 2019 03:59:44 -0500 Received: from mail-ot1-f66.google.com ([209.85.210.66]:40151 "EHLO mail-ot1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727423AbfBRI7n (ORCPT ); Mon, 18 Feb 2019 03:59:43 -0500 Received: by mail-ot1-f66.google.com with SMTP id s5so27038627oth.7 for ; Mon, 18 Feb 2019 00:59:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=g9HgGtghcPXXmyxU34lbIHXb62T5nWuI55X1AN+10BA=; b=mOyxZFA358oCYeq4vmapmR8DttiC8X+AM0oZE06ZoRG449fmjaNAGc/klZHIl6A04Z V4qxEa5UcZ4HnnLFThHKZ5YVFsCtMzYJaTQv0Rid9F7ZR/WHH7nv6BpXrY1rJj73csfD B1zSIi22qVisp8I2QaDSt7/g0txyfVCoh6ntfuyBJB77nu6eVeVqc5JZKB+NkhN+p2VO m3zXTMTz4LBD+7QjfE3QF5FBjG9jTTUi0qEbnzFOsolD3q5YTWcyRJa17g85LJbkX8fT Se6YOd9lcrgECwEs5tIDIk3kBoGB3iBOQJaIHL4ssKj0X+H0AeerLiCcCpYrYUckQaEm bBgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=g9HgGtghcPXXmyxU34lbIHXb62T5nWuI55X1AN+10BA=; b=Up7e4MWhxlDNkN5Zho+Pt3uN5aa3ZUnhWral69jkfvFhvmuTTyqMNySTEp6NpPWtUh FU3NY362u4g7wGhb3v7zEglC/yFqYgLpNsel2WmvEzQapxdFfJAaXtA/BPMtp12tpn46 ejro71d4s4WyIiutUKM8IiQlRPBn+nm2iRjDF0LvUL0ZPJCnpBgqUL6dl9ih2Jq59Zzk gP1/7UmmRrSrsk7opF0Zdx7LaHZ/KfCXD/SeZu9/5xXDgttw4izV2sMTGuQ6sa2jln4e oJ4tcTlqpmoeRy5LQMppn0ehAXZgXdyGlhd5gxPOX72XoOs0VNuGYALRCxbNfrblSuik ErQg== X-Gm-Message-State: AHQUAuYFkNa99WMas3UCGQFN2ztsvg8lKzj3DT2bOtUarBzleFWOqE5e ihmh2aeXWjlIP8eWINO0Ms3oH5h8ffHQj4GPowo= X-Google-Smtp-Source: AHgI3IaGyEhyH1/DGiM3Dv2l4kCdAP8pOgNAH1tpkAADOxQv61ne5OfJ6BII8eshm4HzUmKtW4z8UpIDSaNx3Fb6Pyw= X-Received: by 2002:a9d:77d0:: with SMTP id w16mr13658100otl.189.1550480382428; Mon, 18 Feb 2019 00:59:42 -0800 (PST) MIME-Version: 1.0 References: <1549631126-29067-1-git-send-email-magnus.karlsson@intel.com> <1549631126-29067-2-git-send-email-magnus.karlsson@intel.com> In-Reply-To: From: Magnus Karlsson Date: Mon, 18 Feb 2019 09:59:30 +0100 Message-ID: Subject: Re: [PATCH bpf-next v4 1/2] libbpf: add support for using AF_XDP sockets To: Daniel Borkmann Cc: Magnus Karlsson , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , ast@kernel.org, Network Development , Jakub Kicinski , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , "Zhang, Qi Z" , Jesper Dangaard Brouer , xiaolong.ye@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Fri, Feb 15, 2019 at 6:09 PM Daniel Borkmann wrot= e: > > On 02/08/2019 02:05 PM, Magnus Karlsson wrote: > > This commit adds AF_XDP support to libbpf. The main reason for this is > > to facilitate writing applications that use AF_XDP by offering > > higher-level APIs that hide many of the details of the AF_XDP > > uapi. This is in the same vein as libbpf facilitates XDP adoption by > > offering easy-to-use higher level interfaces of XDP > > functionality. Hopefully this will facilitate adoption of AF_XDP, make > > applications using it simpler and smaller, and finally also make it > > possible for applications to benefit from optimizations in the AF_XDP > > user space access code. Previously, people just copied and pasted the > > code from the sample application into their application, which is not > > desirable. > > > > The interface is composed of two parts: > > > > * Low-level access interface to the four rings and the packet > > * High-level control plane interface for creating and setting > > up umems and af_xdp sockets as well as a simple XDP program. > > > > Tested-by: Bj=C3=B6rn T=C3=B6pel > > Signed-off-by: Magnus Karlsson > [...] > > diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c > > new file mode 100644 > > index 0000000..a982a76 > > --- /dev/null > > +++ b/tools/lib/bpf/xsk.c > > @@ -0,0 +1,742 @@ > > +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) > > + > > +/* > > + * AF_XDP user-space access library. > > + * > > + * Copyright(c) 2018 - 2019 Intel Corporation. > > + * > > + * Author(s): Magnus Karlsson > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include "bpf.h" > > +#include "libbpf.h" > > +#include "libbpf_util.h" > > +#include "nlattr.h" > > +#include "xsk.h" > > + > > +#ifndef SOL_XDP > > + #define SOL_XDP 283 > > +#endif > > + > > +#ifndef AF_XDP > > + #define AF_XDP 44 > > +#endif > > + > > +#ifndef PF_XDP > > + #define PF_XDP AF_XDP > > +#endif > > + > > +struct xsk_umem { > > + struct xsk_ring_prod *fill; > > + struct xsk_ring_cons *comp; > > + char *umem_area; > > + struct xsk_umem_config config; > > + int fd; > > + int refcount; > > +}; > > + > > +struct xsk_socket { > > + struct xsk_ring_cons *rx; > > + struct xsk_ring_prod *tx; > > + __u64 outstanding_tx; > > + struct xsk_umem *umem; > > + struct xsk_socket_config config; > > + int fd; > > + int xsks_map; > > + int ifindex; > > + int prog_fd; > > + int qidconf_map_fd; > > + int xsks_map_fd; > > + __u32 queue_id; > > +}; > > + > > +struct xsk_nl_info { > > + bool xdp_prog_attached; > > + int ifindex; > > + int fd; > > +}; > > + > > +#define MAX_QUEUES 128 > > Why is this a fixed constant here, shouldn't this be dynamic due to being= NIC > specific anyway? It was only here for simplicity. If a NIC had more queues, it would require a recompile of the lib. Obviously, not desirable in a distro. What I could do is to read the max "combined" queues (pre-set maximum in the ethtool output) from the same interface as ethool uses and size the array after that. Or is there a simpler way? What to do if the NIC does not have a "combined", or is there no such NIC (seems the common HW ones set this)? > [...] > > +void *xsk_umem__get_data(struct xsk_umem *umem, __u64 addr) > > +{ > > + return &((char *)(umem->umem_area))[addr]; > > +} > > There's also a xsk_umem__get_data_raw() doing the same. Why having both, = resp. > when to choose which? ;) There is enough to have the xsk_umem__get_data_raw() function. xsk_umem__get_data() is just a convenience function for which the application does not have to store the beginning of the umem. But as the application always has to provide this anyway in the xsk_umem__create() function, it might as well store this pointer. I will delete xsk_umem__get_data() and rename xsk_umem__get_data_raw() to xsk_umem__get_data(). > > +int xsk_umem__fd(const struct xsk_umem *umem) > > +{ > > + return umem ? umem->fd : -EINVAL; > > +} > > + > > +int xsk_socket__fd(const struct xsk_socket *xsk) > > +{ > > + return xsk ? xsk->fd : -EINVAL; > > +} > > + > > +static bool xsk_page_aligned(void *buffer) > > +{ > > + unsigned long addr =3D (unsigned long)buffer; > > + > > + return !(addr & (getpagesize() - 1)); > > +} > > + > > +static void xsk_set_umem_config(struct xsk_umem_config *cfg, > > + const struct xsk_umem_config *usr_cfg) > > +{ > > + if (!usr_cfg) { > > + cfg->fill_size =3D XSK_RING_PROD__DEFAULT_NUM_DESCS; > > + cfg->comp_size =3D XSK_RING_CONS__DEFAULT_NUM_DESCS; > > + cfg->frame_size =3D XSK_UMEM__DEFAULT_FRAME_SIZE; > > + cfg->frame_headroom =3D XSK_UMEM__DEFAULT_FRAME_HEADROOM; > > + return; > > + } > > + > > + cfg->fill_size =3D usr_cfg->fill_size; > > + cfg->comp_size =3D usr_cfg->comp_size; > > + cfg->frame_size =3D usr_cfg->frame_size; > > + cfg->frame_headroom =3D usr_cfg->frame_headroom; > > Just optional nit, might be a bit nicer to have it in this form: > > cfg->fill_size =3D usr_cfg ? usr_cfg->fill_size : > XSK_RING_PROD__DEFAULT_NUM_DESCS; I actually think the current form is clearer when there are multiple lines. If there was only one line, I would agree with you. > > +} > > + > > +static void xsk_set_xdp_socket_config(struct xsk_socket_config *cfg, > > + const struct xsk_socket_config *usr= _cfg) > > +{ > > + if (!usr_cfg) { > > + cfg->rx_size =3D XSK_RING_CONS__DEFAULT_NUM_DESCS; > > + cfg->tx_size =3D XSK_RING_PROD__DEFAULT_NUM_DESCS; > > + cfg->libbpf_flags =3D 0; > > + cfg->xdp_flags =3D 0; > > + cfg->bind_flags =3D 0; > > + return; > > + } > > + > > + cfg->rx_size =3D usr_cfg->rx_size; > > + cfg->tx_size =3D usr_cfg->tx_size; > > + cfg->libbpf_flags =3D usr_cfg->libbpf_flags; > > + cfg->xdp_flags =3D usr_cfg->xdp_flags; > > + cfg->bind_flags =3D usr_cfg->bind_flags; > > (Ditto) > > > +} > > + > > +int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area, __u6= 4 size, > > + struct xsk_ring_prod *fill, struct xsk_ring_cons *co= mp, > > + const struct xsk_umem_config *usr_config) > > +{ > > + struct xdp_mmap_offsets off; > > + struct xdp_umem_reg mr; > > + struct xsk_umem *umem; > > + socklen_t optlen; > > + void *map; > > + int err; > > + > > + if (!umem_area || !umem_ptr || !fill || !comp) > > + return -EFAULT; > > + if (!size && !xsk_page_aligned(umem_area)) > > + return -EINVAL; > > + > > + umem =3D calloc(1, sizeof(*umem)); > > + if (!umem) > > + return -ENOMEM; > > + > > + umem->fd =3D socket(AF_XDP, SOCK_RAW, 0); > > + if (umem->fd < 0) { > > + err =3D -errno; > > + goto out_umem_alloc; > > + } > > + > > + umem->umem_area =3D umem_area; > > + xsk_set_umem_config(&umem->config, usr_config); > > + > > + mr.addr =3D (uintptr_t)umem_area; > > + mr.len =3D size; > > + mr.chunk_size =3D umem->config.frame_size; > > + mr.headroom =3D umem->config.frame_headroom; > > + > > + err =3D setsockopt(umem->fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(m= r)); > > + if (err) { > > + err =3D -errno; > > + goto out_socket; > > + } > > + err =3D setsockopt(umem->fd, SOL_XDP, XDP_UMEM_FILL_RING, > > + &umem->config.fill_size, > > + sizeof(umem->config.fill_size)); > > + if (err) { > > + err =3D -errno; > > + goto out_socket; > > + } > > + err =3D setsockopt(umem->fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, > > + &umem->config.comp_size, > > + sizeof(umem->config.comp_size)); > > + if (err) { > > + err =3D -errno; > > + goto out_socket; > > + } > > + > > + optlen =3D sizeof(off); > > + err =3D getsockopt(umem->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &op= tlen); > > + if (err) { > > + err =3D -errno; > > + goto out_socket; > > + } > > + > > + map =3D xsk_mmap(NULL, off.fr.desc + > > + umem->config.fill_size * sizeof(__u64), > > + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, > > + umem->fd, XDP_UMEM_PGOFF_FILL_RING); > > + if (map =3D=3D MAP_FAILED) { > > + err =3D -errno; > > + goto out_socket; > > + } > > + > > + umem->fill =3D fill; > > + fill->mask =3D umem->config.fill_size - 1; > > + fill->size =3D umem->config.fill_size; > > + fill->producer =3D map + off.fr.producer; > > + fill->consumer =3D map + off.fr.consumer; > > + fill->ring =3D map + off.fr.desc; > > + fill->cached_cons =3D umem->config.fill_size; > > + > > + map =3D xsk_mmap(NULL, > > + off.cr.desc + umem->config.comp_size * sizeof(__u6= 4), > > + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, > > + umem->fd, XDP_UMEM_PGOFF_COMPLETION_RING); > > + if (map =3D=3D MAP_FAILED) { > > + err =3D -errno; > > + goto out_mmap; > > + } > > + > > + umem->comp =3D comp; > > + comp->mask =3D umem->config.comp_size - 1; > > + comp->size =3D umem->config.comp_size; > > + comp->producer =3D map + off.cr.producer; > > + comp->consumer =3D map + off.cr.consumer; > > + comp->ring =3D map + off.cr.desc; > > + > > + *umem_ptr =3D umem; > > + return 0; > > + > > +out_mmap: > > + munmap(umem->fill, > > + off.fr.desc + umem->config.fill_size * sizeof(__u64)); > > +out_socket: > > + close(umem->fd); > > +out_umem_alloc: > > + free(umem); > > + return err; > > +} > > + > > +static int xsk_parse_nl(void *cookie, void *msg, struct nlattr **tb) > > +{ > > + struct nlattr *tb_parsed[IFLA_XDP_MAX + 1]; > > + struct xsk_nl_info *nl_info =3D cookie; > > + struct ifinfomsg *ifinfo =3D msg; > > + unsigned char mode; > > + int err; > > + > > + if (nl_info->ifindex && nl_info->ifindex !=3D ifinfo->ifi_index) > > + return 0; > > + > > + if (!tb[IFLA_XDP]) > > + return 0; > > + > > + err =3D libbpf_nla_parse_nested(tb_parsed, IFLA_XDP_MAX, tb[IFLA_= XDP], > > + NULL); > > + if (err) > > + return err; > > + > > + if (!tb_parsed[IFLA_XDP_ATTACHED] || !tb_parsed[IFLA_XDP_FD]) > > + return 0; > > + > > + mode =3D libbpf_nla_getattr_u8(tb_parsed[IFLA_XDP_ATTACHED]); > > + if (mode =3D=3D XDP_ATTACHED_NONE) > > + return 0; > > + > > + nl_info->xdp_prog_attached =3D true; > > + nl_info->fd =3D libbpf_nla_getattr_u32(tb_parsed[IFLA_XDP_FD]); > > Hm, I don't think this works if I read the intention of this helper corre= ctly. > > IFLA_XDP_FD is never set for retrieving the prog from the kernel. So the > above is a bug. > > We also have bpf_get_link_xdp_id(). This should probably just be reused i= n > this context here. If bpf_get_link_xdp_id() will fit my bill, I will happily use it. I will check it out and hopefully I can drop all this code. Thanks. > > + return 0; > > +} > > + > > +static bool xsk_xdp_prog_attached(struct xsk_socket *xsk) > > +{ > > + struct xsk_nl_info nl_info; > > + unsigned int nl_pid; > > + char err_buf[256]; > > + int sock, err; > > + > > + sock =3D libbpf_netlink_open(&nl_pid); > > + if (sock < 0) > > + return false; > > + > > + nl_info.xdp_prog_attached =3D false; > > + nl_info.ifindex =3D xsk->ifindex; > > + nl_info.fd =3D -1; > > + > > + err =3D libbpf_nl_get_link(sock, nl_pid, xsk_parse_nl, &nl_info); > > + if (err) { > > + libbpf_strerror(err, err_buf, sizeof(err_buf)); > > + pr_warning("Error:\n%s\n", err_buf); > > + close(sock); > > + return false; > > + } > > + > > + close(sock); > > + xsk->prog_fd =3D nl_info.fd; > > + return nl_info.xdp_prog_attached; > > +} > > (See bpf_get_link_xdp_id().) > > > + > > +static int xsk_load_xdp_prog(struct xsk_socket *xsk) > > +{ > > + char bpf_log_buf[BPF_LOG_BUF_SIZE]; > > + int err, prog_fd; > > + > > + /* This is the C-program: > > + * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) > > + * { > > + * int *qidconf, index =3D ctx->rx_queue_index; > [...] > > + > > +int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname= , > > + __u32 queue_id, struct xsk_umem *umem, > > + struct xsk_ring_cons *rx, struct xsk_ring_prod *tx= , > > + const struct xsk_socket_config *usr_config) > > +{ > > + struct sockaddr_xdp sxdp =3D {}; > > + struct xdp_mmap_offsets off; > > + struct xsk_socket *xsk; > > + socklen_t optlen; > > + void *map; > > + int err; > > + > > + if (!umem || !xsk_ptr || !rx || !tx) > > + return -EFAULT; > > + > > + if (umem->refcount) { > > + pr_warning("Error: shared umems not supported by libbpf.\= n"); > > + return -EBUSY; > > + } > > + > > + xsk =3D calloc(1, sizeof(*xsk)); > > + if (!xsk) > > + return -ENOMEM; > > + > > + if (umem->refcount++ > 0) { > > Should this refcount rather be atomic actually? Neither our config nor data plane interfaces are reentrant for performance reasons. Any concurrency has to be handled explicitly on the application level. This so that it only penalizes apps that really need this. Thanks for all your reviews: Magnus > > + xsk->fd =3D socket(AF_XDP, SOCK_RAW, 0); > > + if (xsk->fd < 0) { > > + err =3D -errno; > > + goto out_xsk_alloc; > > + } > > + } else { > > + xsk->fd =3D umem->fd; > > + } > > + > > + xsk->outstanding_tx =3D 0; > > + xsk->queue_id =3D queue_id; > > + xsk->umem =3D umem; > > + xsk->ifindex =3D if_nametoindex(ifname); > > + if (!xsk->ifindex) { > > + err =3D -errno; > > + goto out_socket; > > + } > > + > > + xsk_set_xdp_socket_config(&xsk->config, usr_config); > [...]