From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E710C43381 for ; Mon, 4 Mar 2019 12:47:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DECC120815 for ; Mon, 4 Mar 2019 12:47:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726528AbfCDMrv convert rfc822-to-8bit (ORCPT ); Mon, 4 Mar 2019 07:47:51 -0500 Received: from mail-ed1-f68.google.com ([209.85.208.68]:43495 "EHLO mail-ed1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726101AbfCDMrv (ORCPT ); Mon, 4 Mar 2019 07:47:51 -0500 Received: by mail-ed1-f68.google.com with SMTP id m35so4118059ede.10 for ; Mon, 04 Mar 2019 04:47:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=/17XhNZ+2GKtscpZCeTeobajxd6Md/OnpvM2x4rnWLM=; b=OAlPso1u1ef75JhO8bk9h+ILwDQr9BaKdWxN2yft5F49ag9LtxplHUfx/GFZs+bgzy cEB/Dld2fCJyVusLrO1ZYBSu3O7ETCAMPtod4egpTDuDslE65huaRYIxRgjjFott5K+k oAHjZP0xI7AH57zMlRkOjqEP1VqBNPz5N12oDLlfASwPAdWPviJbO6O5torbKHCNQEWP NHw6oHr9RtsdTRMA0Pu4UHqy4TiVP2FkXyBTN3sePLlQyvclrfataXzy0mzFzxhQXiuF GWzNfdvUe/a6bBlxcT72wcrQ8v7PqGWmuXe1Y4Ly1KJ7y288Hvd6HUIWIPKuhthxH6B3 ChrA== X-Gm-Message-State: APjAAAXLkQwZyt3TdsQp2OZd6wkOAhOm3q0ALhDUUD0dEoMmuLQJ4l1J okjBsrxlSABhSlKTJ9mFOnaooA== X-Google-Smtp-Source: AHgI3IYEMuUqyXMQ2QfzFSiZHU/MujGja8zwbRD/ttMrRKcpa7XC9+mrx7A2Dd60HcjH6wbTpeUPFQ== X-Received: by 2002:a50:ac09:: with SMTP id v9mr15284881edc.3.1551703669056; Mon, 04 Mar 2019 04:47:49 -0800 (PST) Received: from alrua-x1.borgediget.toke.dk (borgediget.toke.dk. [85.204.121.218]) by smtp.gmail.com with ESMTPSA id q55sm2076164eda.19.2019.03.04.04.47.48 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 04 Mar 2019 04:47:48 -0800 (PST) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id E0929183BB9; Mon, 4 Mar 2019 13:47:47 +0100 (CET) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Jakub Kicinski Cc: David Miller , netdev@vger.kernel.org, Jesper Dangaard Brouer , Daniel Borkmann , Alexei Starovoitov Subject: Re: [PATCH net-next v3 1/3] xdp: Refactor devmap code in preparation for subsequent additions In-Reply-To: <20190301170831.6ae29baa@cakuba.netronome.com> References: <155144955030.28287.14029975169967438162.stgit@alrua-x1> <155144955040.28287.1075106871059918653.stgit@alrua-x1> <20190301170831.6ae29baa@cakuba.netronome.com> X-Clacks-Overhead: GNU Terry Pratchett Date: Mon, 04 Mar 2019 13:47:47 +0100 Message-ID: <87zhqakdqk.fsf@toke.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Jakub Kicinski writes: > On Fri, 01 Mar 2019 15:12:30 +0100, Toke Høiland-Jørgensen wrote: >> The subsequent commits introducing default maps and a hash-based ifindex >> devmap require a bit of refactoring of the devmap code. Perform this first >> so the subsequent commits become easier to read. >> >> Signed-off-by: Toke Høiland-Jørgensen > >> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c >> index 191b79948424..1037fc08c504 100644 >> --- a/kernel/bpf/devmap.c >> +++ b/kernel/bpf/devmap.c >> @@ -75,6 +75,7 @@ struct bpf_dtab { >> struct bpf_dtab_netdev **netdev_map; >> unsigned long __percpu *flush_needed; >> struct list_head list; >> + struct rcu_head rcu; > > I think the RCU change may warrant a separate commit or at least a > mention in the commit message ;) Heh, fair point. >> }; >> >> static DEFINE_SPINLOCK(dev_map_lock); >> @@ -85,23 +86,11 @@ static u64 dev_map_bitmap_size(const union bpf_attr *attr) >> return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long); >> } >> >> -static struct bpf_map *dev_map_alloc(union bpf_attr *attr) >> +static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr, >> + bool check_memlock) >> { >> - struct bpf_dtab *dtab; >> - int err = -EINVAL; >> u64 cost; >> - >> - if (!capable(CAP_NET_ADMIN)) >> - return ERR_PTR(-EPERM); >> - >> - /* check sanity of attributes */ >> - if (attr->max_entries == 0 || attr->key_size != 4 || >> - attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK) >> - return ERR_PTR(-EINVAL); > > perhaps consider moving these to a map_alloc_check callback? The reason I kept them here is that these checks are only needed strictly when the parameters come from userspace. In the other allocation paths, the attrs are hard-coded to valid values, so the only thing having the check would do is guard against future changes to one part of the file being out of sync with the other part. I can certainly add the check, it just seemed a bit excessive... >> - dtab = kzalloc(sizeof(*dtab), GFP_USER); >> - if (!dtab) >> - return ERR_PTR(-ENOMEM); >> + int err; >> >> bpf_map_init_from_attr(&dtab->map, attr); >> >> @@ -109,60 +98,72 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr) >> cost = (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *); >> cost += dev_map_bitmap_size(attr) * num_possible_cpus(); >> if (cost >= U32_MAX - PAGE_SIZE) >> - goto free_dtab; >> + return -EINVAL; >> >> dtab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; >> >> - /* if map size is larger than memlock limit, reject it early */ >> - err = bpf_map_precharge_memlock(dtab->map.pages); >> - if (err) >> - goto free_dtab; >> - >> - err = -ENOMEM; >> + if (check_memlock) { >> + /* if map size is larger than memlock limit, reject it early */ >> + err = bpf_map_precharge_memlock(dtab->map.pages); >> + if (err) >> + return -EINVAL; >> + } >> >> /* A per cpu bitfield with a bit per possible net device */ >> dtab->flush_needed = __alloc_percpu_gfp(dev_map_bitmap_size(attr), >> __alignof__(unsigned long), >> GFP_KERNEL | __GFP_NOWARN); >> if (!dtab->flush_needed) >> - goto free_dtab; >> + goto err_alloc; >> >> dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries * >> sizeof(struct bpf_dtab_netdev *), >> dtab->map.numa_node); >> if (!dtab->netdev_map) >> - goto free_dtab; >> + goto err_map; >> >> - spin_lock(&dev_map_lock); >> - list_add_tail_rcu(&dtab->list, &dev_map_list); >> - spin_unlock(&dev_map_lock); >> + return 0; >> >> - return &dtab->map; >> -free_dtab: >> + err_map: > > The label should name the first action. So it was correct, you're > making it less good :) Heh, yeah, guess you're right. > Also why the space? Because the might GNU ordained it (i.e., Emacs added those). Will fix. >> free_percpu(dtab->flush_needed); >> - kfree(dtab); >> - return ERR_PTR(err); >> + err_alloc: > > and no need for this one Right, guess I can just rely on the NULL check in the free functions; would make things simpler in patch 3 as well :) >> + return -ENOMEM; >> } >> >> -static void dev_map_free(struct bpf_map *map) >> +static struct bpf_map *dev_map_alloc(union bpf_attr *attr) >> { >> - struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); >> - int i, cpu; >> + struct bpf_dtab *dtab; >> + int err; >> >> - /* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0, >> - * so the programs (can be more than one that used this map) were >> - * disconnected from events. Wait for outstanding critical sections in >> - * these programs to complete. The rcu critical section only guarantees >> - * no further reads against netdev_map. It does __not__ ensure pending >> - * flush operations (if any) are complete. >> - */ >> + if (!capable(CAP_NET_ADMIN)) >> + return ERR_PTR(-EPERM); >> + >> + /* check sanity of attributes */ >> + if (attr->max_entries == 0 || attr->key_size != 4 || >> + attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK) >> + return ERR_PTR(-EINVAL); >> + >> + dtab = kzalloc(sizeof(*dtab), GFP_USER); >> + if (!dtab) >> + return ERR_PTR(-ENOMEM); >> + >> + err = dev_map_init_map(dtab, attr, true); >> + if (err) { >> + kfree(dtab); >> + return ERR_PTR(err); >> + } >> >> spin_lock(&dev_map_lock); >> - list_del_rcu(&dtab->list); >> + list_add_tail_rcu(&dtab->list, &dev_map_list); >> spin_unlock(&dev_map_lock); >> >> - bpf_clear_redirect_map(map); >> - synchronize_rcu(); >> + return &dtab->map; >> +} >> + >> +static void __dev_map_free(struct rcu_head *rcu) >> +{ >> + struct bpf_dtab *dtab = container_of(rcu, struct bpf_dtab, rcu); >> + int i, cpu; >> >> /* To ensure all pending flush operations have completed wait for flush >> * bitmap to indicate all flush_needed bits to be zero on _all_ cpus. >> @@ -192,6 +193,26 @@ static void dev_map_free(struct bpf_map *map) > > There is a cond_resched() here, I don't think you can call > cond_resched() from an RCU callback. Hmm, bugger. I thought I could just use call_rcu() as semantically equivalent (but slightly slower) to just calling synchronize_rcu() inside the function. In an earlier version I had a namespace enter/exit notifier in devmap.c as well, to react to new namespaces. And that notifier has a comment about avoiding calls to synchronize_rcu(). But since this version doesn't actually need that, maybe I can just keep using direct calls and synchronize_rcu() and avoid the callback? I'm a bit worried about adding both synchronize_rcu() and cond_resched() as a possible side effect of every call to bpf_prog_put(), though; so maybe it's better to move the cleanup somewhere it's actually safe to call cond_resched(); what would that be, a workqueue? -Toke