From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Hif/=RH=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1E710C43381
	for <netdev@archiver.kernel.org>; Mon,  4 Mar 2019 12:47:53 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id DECC120815
	for <netdev@archiver.kernel.org>; Mon,  4 Mar 2019 12:47:52 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726528AbfCDMrv convert rfc822-to-8bit (ORCPT
        <rfc822;netdev@archiver.kernel.org>); Mon, 4 Mar 2019 07:47:51 -0500
Received: from mail-ed1-f68.google.com ([209.85.208.68]:43495 "EHLO
        mail-ed1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726101AbfCDMrv (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 4 Mar 2019 07:47:51 -0500
Received: by mail-ed1-f68.google.com with SMTP id m35so4118059ede.10
        for <netdev@vger.kernel.org>; Mon, 04 Mar 2019 04:47:49 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date
         :message-id:mime-version:content-transfer-encoding;
        bh=/17XhNZ+2GKtscpZCeTeobajxd6Md/OnpvM2x4rnWLM=;
        b=OAlPso1u1ef75JhO8bk9h+ILwDQr9BaKdWxN2yft5F49ag9LtxplHUfx/GFZs+bgzy
         cEB/Dld2fCJyVusLrO1ZYBSu3O7ETCAMPtod4egpTDuDslE65huaRYIxRgjjFott5K+k
         oAHjZP0xI7AH57zMlRkOjqEP1VqBNPz5N12oDLlfASwPAdWPviJbO6O5torbKHCNQEWP
         NHw6oHr9RtsdTRMA0Pu4UHqy4TiVP2FkXyBTN3sePLlQyvclrfataXzy0mzFzxhQXiuF
         GWzNfdvUe/a6bBlxcT72wcrQ8v7PqGWmuXe1Y4Ly1KJ7y288Hvd6HUIWIPKuhthxH6B3
         ChrA==
X-Gm-Message-State: APjAAAXLkQwZyt3TdsQp2OZd6wkOAhOm3q0ALhDUUD0dEoMmuLQJ4l1J
        okjBsrxlSABhSlKTJ9mFOnaooA==
X-Google-Smtp-Source: AHgI3IYEMuUqyXMQ2QfzFSiZHU/MujGja8zwbRD/ttMrRKcpa7XC9+mrx7A2Dd60HcjH6wbTpeUPFQ==
X-Received: by 2002:a50:ac09:: with SMTP id v9mr15284881edc.3.1551703669056;
        Mon, 04 Mar 2019 04:47:49 -0800 (PST)
Received: from alrua-x1.borgediget.toke.dk (borgediget.toke.dk. [85.204.121.218])
        by smtp.gmail.com with ESMTPSA id q55sm2076164eda.19.2019.03.04.04.47.48
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Mon, 04 Mar 2019 04:47:48 -0800 (PST)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
        id E0929183BB9; Mon,  4 Mar 2019 13:47:47 +0100 (CET)
From:   Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To:     Jakub Kicinski <jakub.kicinski@netronome.com>
Cc:     David Miller <davem@davemloft.net>, netdev@vger.kernel.org,
        Jesper Dangaard Brouer <brouer@redhat.com>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Alexei Starovoitov <ast@kernel.org>
Subject: Re: [PATCH net-next v3 1/3] xdp: Refactor devmap code in preparation for subsequent additions
In-Reply-To: <20190301170831.6ae29baa@cakuba.netronome.com>
References: <155144955030.28287.14029975169967438162.stgit@alrua-x1> <155144955040.28287.1075106871059918653.stgit@alrua-x1> <20190301170831.6ae29baa@cakuba.netronome.com>
X-Clacks-Overhead: GNU Terry Pratchett
Date:   Mon, 04 Mar 2019 13:47:47 +0100
Message-ID: <87zhqakdqk.fsf@toke.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8BIT
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Jakub Kicinski <jakub.kicinski@netronome.com> writes:

> On Fri, 01 Mar 2019 15:12:30 +0100, Toke Høiland-Jørgensen wrote:
>> The subsequent commits introducing default maps and a hash-based ifindex
>> devmap require a bit of refactoring of the devmap code. Perform this first
>> so the subsequent commits become easier to read.
>> 
>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>
>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>> index 191b79948424..1037fc08c504 100644
>> --- a/kernel/bpf/devmap.c
>> +++ b/kernel/bpf/devmap.c
>> @@ -75,6 +75,7 @@ struct bpf_dtab {
>>  	struct bpf_dtab_netdev **netdev_map;
>>  	unsigned long __percpu *flush_needed;
>>  	struct list_head list;
>> +	struct rcu_head rcu;
>
> I think the RCU change may warrant a separate commit or at least a
> mention in the commit message ;)

Heh, fair point.

>>  };
>>  
>>  static DEFINE_SPINLOCK(dev_map_lock);
>> @@ -85,23 +86,11 @@ static u64 dev_map_bitmap_size(const union bpf_attr *attr)
>>  	return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long);
>>  }
>>  
>> -static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
>> +static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr,
>> +			    bool check_memlock)
>>  {
>> -	struct bpf_dtab *dtab;
>> -	int err = -EINVAL;
>>  	u64 cost;
>> -
>> -	if (!capable(CAP_NET_ADMIN))
>> -		return ERR_PTR(-EPERM);
>> -
>> -	/* check sanity of attributes */
>> -	if (attr->max_entries == 0 || attr->key_size != 4 ||
>> -	    attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK)
>> -		return ERR_PTR(-EINVAL);
>
> perhaps consider moving these to a map_alloc_check callback?

The reason I kept them here is that these checks are only needed
strictly when the parameters come from userspace. In the other
allocation paths, the attrs are hard-coded to valid values, so the only
thing having the check would do is guard against future changes to one
part of the file being out of sync with the other part. I can certainly
add the check, it just seemed a bit excessive...

>> -	dtab = kzalloc(sizeof(*dtab), GFP_USER);
>> -	if (!dtab)
>> -		return ERR_PTR(-ENOMEM);
>> +	int err;
>>  
>>  	bpf_map_init_from_attr(&dtab->map, attr);
>>  
>> @@ -109,60 +98,72 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
>>  	cost = (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *);
>>  	cost += dev_map_bitmap_size(attr) * num_possible_cpus();
>>  	if (cost >= U32_MAX - PAGE_SIZE)
>> -		goto free_dtab;
>> +		return -EINVAL;
>>  
>>  	dtab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
>>  
>> -	/* if map size is larger than memlock limit, reject it early */
>> -	err = bpf_map_precharge_memlock(dtab->map.pages);
>> -	if (err)
>> -		goto free_dtab;
>> -
>> -	err = -ENOMEM;
>> +	if (check_memlock) {
>> +		/* if map size is larger than memlock limit, reject it early */
>> +		err = bpf_map_precharge_memlock(dtab->map.pages);
>> +		if (err)
>> +			return -EINVAL;
>> +	}
>>  
>>  	/* A per cpu bitfield with a bit per possible net device */
>>  	dtab->flush_needed = __alloc_percpu_gfp(dev_map_bitmap_size(attr),
>>  						__alignof__(unsigned long),
>>  						GFP_KERNEL | __GFP_NOWARN);
>>  	if (!dtab->flush_needed)
>> -		goto free_dtab;
>> +		goto err_alloc;
>>  
>>  	dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries *
>>  					      sizeof(struct bpf_dtab_netdev *),
>>  					      dtab->map.numa_node);
>>  	if (!dtab->netdev_map)
>> -		goto free_dtab;
>> +		goto err_map;
>>  
>> -	spin_lock(&dev_map_lock);
>> -	list_add_tail_rcu(&dtab->list, &dev_map_list);
>> -	spin_unlock(&dev_map_lock);
>> +	return 0;
>>  
>> -	return &dtab->map;
>> -free_dtab:
>> + err_map:
>
> The label should name the first action.  So it was correct, you're
> making it less good :)

Heh, yeah, guess you're right.

> Also why the space?

Because the might GNU ordained it (i.e., Emacs added those). Will fix.

>>  	free_percpu(dtab->flush_needed);
>> -	kfree(dtab);
>> -	return ERR_PTR(err);
>> + err_alloc:
>
> and no need for this one

Right, guess I can just rely on the NULL check in the free functions;
would make things simpler in patch 3 as well :)

>> +	return -ENOMEM;
>>  }
>>  
>> -static void dev_map_free(struct bpf_map *map)
>> +static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
>>  {
>> -	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
>> -	int i, cpu;
>> +	struct bpf_dtab *dtab;
>> +	int err;
>>  
>> -	/* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
>> -	 * so the programs (can be more than one that used this map) were
>> -	 * disconnected from events. Wait for outstanding critical sections in
>> -	 * these programs to complete. The rcu critical section only guarantees
>> -	 * no further reads against netdev_map. It does __not__ ensure pending
>> -	 * flush operations (if any) are complete.
>> -	 */
>> +	if (!capable(CAP_NET_ADMIN))
>> +		return ERR_PTR(-EPERM);
>> +
>> +	/* check sanity of attributes */
>> +	if (attr->max_entries == 0 || attr->key_size != 4 ||
>> +	    attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	dtab = kzalloc(sizeof(*dtab), GFP_USER);
>> +	if (!dtab)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	err = dev_map_init_map(dtab, attr, true);
>> +	if (err) {
>> +		kfree(dtab);
>> +		return ERR_PTR(err);
>> +	}
>>  
>>  	spin_lock(&dev_map_lock);
>> -	list_del_rcu(&dtab->list);
>> +	list_add_tail_rcu(&dtab->list, &dev_map_list);
>>  	spin_unlock(&dev_map_lock);
>>  
>> -	bpf_clear_redirect_map(map);
>> -	synchronize_rcu();
>> +	return &dtab->map;
>> +}
>> +
>> +static void __dev_map_free(struct rcu_head *rcu)
>> +{
>> +	struct bpf_dtab *dtab = container_of(rcu, struct bpf_dtab, rcu);
>> +	int i, cpu;
>>  
>>  	/* To ensure all pending flush operations have completed wait for flush
>>  	 * bitmap to indicate all flush_needed bits to be zero on _all_ cpus.
>> @@ -192,6 +193,26 @@ static void dev_map_free(struct bpf_map *map)
>
> There is a cond_resched() here, I don't think you can call
> cond_resched() from an RCU callback.

Hmm, bugger. I thought I could just use call_rcu() as semantically
equivalent (but slightly slower) to just calling synchronize_rcu()
inside the function.

In an earlier version I had a namespace enter/exit notifier in devmap.c
as well, to react to new namespaces. And that notifier has a comment
about avoiding calls to synchronize_rcu(). But since this version
doesn't actually need that, maybe I can just keep using direct calls and
synchronize_rcu() and avoid the callback? I'm a bit worried about adding
both synchronize_rcu() and cond_resched() as a possible side effect of
every call to bpf_prog_put(), though; so maybe it's better to move the
cleanup somewhere it's actually safe to call cond_resched(); what would
that be, a workqueue?

-Toke