From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2020C43381 for ; Mon, 4 Mar 2019 19:05:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 98F2320663 for ; Mon, 4 Mar 2019 19:05:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726951AbfCDTFj convert rfc822-to-8bit (ORCPT ); Mon, 4 Mar 2019 14:05:39 -0500 Received: from mail-ed1-f65.google.com ([209.85.208.65]:39224 "EHLO mail-ed1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726073AbfCDTFj (ORCPT ); Mon, 4 Mar 2019 14:05:39 -0500 Received: by mail-ed1-f65.google.com with SMTP id p27so5133191edc.6 for ; Mon, 04 Mar 2019 11:05:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=K20sMmOjDwYau1PjecCZ05AtM5LUFj8xkLR6GHDqJD0=; b=mzBLNwBjKskGCTtmzdE44uSXzkdV6fnl8LnQhpxZFO20OQGZuBtMNlck/fQWaOEvkV 73a/0q0i3svWV8GMOsTfD/EUR4tESjBsaU3uWHxV46Gi32weuqlJ7piv/pRfU3oFDRl4 m3OIEwKwz6MhZ6iK5WH9fKmjqis7HsPSsUm+P5KGRc3o0ZjBwxkKfyADVFFUxhyZjMq5 wp7UQasYcf6BkZFx6raWuty90o1d3iSkTf5SqXJPBTRxCR2f4/4oCjzlTc02vBFOv0eg PZxWBce3vFV9WSRay6T0IYqIyM8OfqRo9p1amUsZm0knYcSEbAYLYkrVCfCWPcrIkCSJ 8rRg== X-Gm-Message-State: APjAAAVHcGGNudiU05PdnxN4mb4Am21yM6fYwmPxNOApH393E5rfymy/ qw1pJaKUDwLalXDwZV1hisroJA== X-Google-Smtp-Source: APXvYqzjOxXee76mXzGo8zxeGvmfRcJRmoxvccV6Z9Bhy4rFtdgmKiex/SI41qkz5xpZtcFbTS0Eug== X-Received: by 2002:a50:f5ea:: with SMTP id x39mr16577247edm.154.1551726336879; Mon, 04 Mar 2019 11:05:36 -0800 (PST) Received: from alrua-x1.borgediget.toke.dk ([2a00:7660:6da:10::2]) by smtp.gmail.com with ESMTPSA id w24sm2335930edb.72.2019.03.04.11.05.35 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 04 Mar 2019 11:05:36 -0800 (PST) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 89D86182E8E; Mon, 4 Mar 2019 20:05:30 +0100 (CET) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Jakub Kicinski Cc: David Miller , netdev@vger.kernel.org, Jesper Dangaard Brouer , Daniel Borkmann , Alexei Starovoitov Subject: Re: [PATCH net-next v3 2/3] xdp: Always use a devmap for XDP_REDIRECT to a device In-Reply-To: <20190304094452.375ba6e7@cakuba.netronome.com> References: <155144955030.28287.14029975169967438162.stgit@alrua-x1> <155144955045.28287.6187020206969282688.stgit@alrua-x1> <20190301180945.1152c6de@cakuba.netronome.com> <877edelukk.fsf@toke.dk> <20190304094452.375ba6e7@cakuba.netronome.com> X-Clacks-Overhead: GNU Terry Pratchett Date: Mon, 04 Mar 2019 20:05:30 +0100 Message-ID: <87fts231fp.fsf@toke.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Jakub Kicinski writes: > On Mon, 04 Mar 2019 12:58:51 +0100, Toke Høiland-Jørgensen wrote: >> >> diff --git a/include/net/netns/xdp.h b/include/net/netns/xdp.h >> >> index e5734261ba0a..4935dfe1cf43 100644 >> >> --- a/include/net/netns/xdp.h >> >> +++ b/include/net/netns/xdp.h >> >> @@ -4,10 +4,21 @@ >> >> >> >> #include >> >> #include >> >> +#include >> >> + >> >> +struct bpf_dtab; >> >> + >> >> +struct bpf_dtab_container { >> >> + struct bpf_dtab __rcu *dtab; >> >> + atomic_t refcnt; >> > >> > refcount_t ? I'm not sure what the rules are for non-performance >> > critical stuff are.. >> >> I based the use of atomic_t on what bpf/syscall.c is doing. The problem >> with using refcount_t is that it's a bit of a weird refcount because >> it's not strictly tied to the object lifetime (as the objects are only >> allocated if *both* the globals and the per-namespace refcnt are >0). >> >> >> +static void __dev_map_release_default_map(struct bpf_dtab_container *cont) >> >> +{ >> >> + struct bpf_dtab *dtab = NULL; >> >> + >> >> + lockdep_assert_held(&dev_map_lock); >> >> + >> >> + dtab = rcu_dereference(cont->dtab); >> >> + if (dtab) { >> > >> > How can it be NULL if it's refcounted? hm.. >> >> See above; it's not actually the object that is refcounted, it's a count >> of the number of active XDP programs in the namespace, which still needs >> to be kept so we can allocate the devmaps when a REDIRECT program is >> loaded. > > I think I got it now. I wonder if anything could be done to improve > the readability of the dual-refcnting :S Maybe use the "needed"/"used" > terms throughout? Yeah, good idea, I'll try something along those lines. >> >> + list_del_rcu(&dtab->list); >> >> + rcu_assign_pointer(cont->dtab, NULL); >> >> + bpf_clear_redirect_map(&dtab->map); >> >> + call_rcu(&dtab->rcu, __dev_map_free); >> >> + } >> >> +} >> >> + >> >> +void dev_map_put_default_map(struct net *net) >> >> +{ >> >> + if (atomic_dec_and_test(&net->xdp.default_map.refcnt)) { >> >> + spin_lock(&dev_map_lock); >> >> + __dev_map_release_default_map(&net->xdp.default_map); >> >> + spin_unlock(&dev_map_lock); >> >> + } >> >> +} >> >> + >> >> +static int __init_default_map(struct bpf_dtab_container *cont) >> >> +{ >> >> + struct net *net = bpf_default_map_to_net(cont); >> >> + struct bpf_dtab *dtab, *old_dtab; >> >> + int size = DEV_MAP_DEFAULT_SIZE; >> >> + struct net_device *netdev; >> >> + union bpf_attr attr = {}; >> >> + u32 idx; >> >> + int err; >> >> + >> >> + lockdep_assert_held(&dev_map_lock); >> >> + >> >> + if (!atomic_read(&global_redirect_use)) >> >> + return 0; >> >> + >> >> + for_each_netdev(net, netdev) >> >> + if (netdev->ifindex >= size) >> >> + size <<= 1; >> >> + >> >> + old_dtab = rcu_dereference(cont->dtab); >> >> + if (old_dtab && old_dtab->map.max_entries == size) >> >> + return 0; >> >> + >> >> + dtab = kzalloc(sizeof(*dtab), GFP_USER); >> > >> > You are calling this with a spin lock held :( >> >> Yeah, not just this, there are more allocations in dev_map_init_map(). >> Hmm, I guess I need to think about the concurrency bits some more; it is >> made a bit tricky by the refcounts being outside the objects. >> >> Basically what I'm trying to protect against is the case where the >> global refcount goes to zero and then immediately back to one (or vice >> versa), which results in a deallocation immediately followed by an >> allocation. I was worried about the allocation and deallocation racing >> with each other (because there's a window after the refcnt is checked >> before the (de)allocation begins). It's not quite clear to me whether >> RCU is enough to protect against this... Thoughts? > > Hm. I think you'll still need a lock (mutex?) on the alloc path, but > the free path should be fine as long as you load the map pointer before > looking at the refcnt (atomic op ensuring the barrier there). Yeah, for the per-namespace refcnt it's pretty straight forward, the trouble is the global count that needs to iterate over all namespaces; probably need to put that all behind a (non-spin)lock, right? >> > Could you please add tests for the replacement combinations? >> >> Sure. But, erm, where? selftests? :) > > Yes :) selftests/bpf/ Look around at the scripts, AFAIK we don't have > much in the way of standard infra for system interaction tests like > this :S Right. I'll shamelessly copy from the tests already in there and see if I can't get something decent working :) -Toke