From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E53DAC432C0 for ; Thu, 21 Nov 2019 19:55:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B5EDE20659 for ; Thu, 21 Nov 2019 19:55:01 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="T1GkjIuA" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726714AbfKUTzB (ORCPT ); Thu, 21 Nov 2019 14:55:01 -0500 Received: from us-smtp-2.mimecast.com ([207.211.31.81]:45918 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726293AbfKUTzB (ORCPT ); Thu, 21 Nov 2019 14:55:01 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1574366099; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dDP9zZHWWVzhM5nlApK7sPPP8e0+XOdW9eP7EvaEtHM=; b=T1GkjIuA/i/7mfNxior4PTCrdOXrU9xsf5H24kNILwSYQIgqaRg3362WgvPzpOr1yJNUtq W6Lrz4+zwARMxb/UYzUsW01UN0tONIkZCfyeBozPNmX+PwpXY9wy0W/xmNUMY/ApAqP/e8 +1dzzlzJF0OI7g5Cl37qR/qm8zRP784= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-225-Y1mU7YDVP6CxmAmwU_Xv1A-1; Thu, 21 Nov 2019 14:54:56 -0500 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 204B11883545; Thu, 21 Nov 2019 19:54:54 +0000 (UTC) Received: from localhost (ovpn-112-24.ams2.redhat.com [10.36.112.24]) by smtp.corp.redhat.com (Postfix) with ESMTPS id AED0C26E43; Thu, 21 Nov 2019 19:54:50 +0000 (UTC) Date: Thu, 21 Nov 2019 20:54:42 +0100 From: Stefano Brivio To: Florian Westphal Cc: Pablo Neira Ayuso , netfilter-devel@vger.kernel.org, Kadlecsik =?UTF-8?B?SsOzenNlZg==?= , Eric Garver , Phil Sutter , Sabrina Dubroca , Jay Ligatti , Ori Rottenstreich , Kirill Kogan Subject: Re: [PATCH nf-next 3/8] nf_tables: Add set type for arbitrary concatenation of ranges Message-ID: <20191121205442.5eb3d113@redhat.com> In-Reply-To: <20191120150609.GB20235@breakpoint.cc> References: <6da551247fd90666b0eca00fb4467151389bf1dc.1574119038.git.sbrivio@redhat.com> <20191120150609.GB20235@breakpoint.cc> Organization: Red Hat MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-MC-Unique: Y1mU7YDVP6CxmAmwU_Xv1A-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Sender: netfilter-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netfilter-devel@vger.kernel.org On Wed, 20 Nov 2019 16:06:09 +0100 Florian Westphal wrote: > Stefano Brivio wrote: > > +static bool nft_pipapo_lookup(const struct net *net, const struct nft_= set *set, > > +=09=09=09 const u32 *key, const struct nft_set_ext **ext) > > +{ > > +=09struct nft_pipapo *priv =3D nft_set_priv(set); > > +=09unsigned long *res_map, *fill_map; > > +=09u8 genmask =3D nft_genmask_cur(net); > > +=09const u8 *rp =3D (const u8 *)key; > > +=09struct nft_pipapo_match *m; > > +=09struct nft_pipapo_field *f; > > +=09bool map_index; > > +=09int i; > > + > > +=09map_index =3D raw_cpu_read(nft_pipapo_scratch_index); =20 >=20 > I'm afraid this will need local_bh_disable to prevent reentry from > softirq processing. I'm afraid you're right and yes, not just this: from here to the point where we're done using scratch maps or their index. Adding in v2. Well, at least vectorised versions for (at least) x86, ARM and s390x won't have any overhead from it as they will already do that with kernel_fpu_begin()/kernel_neon_begin(). > > +=09rcu_read_lock(); =20 >=20 > All netfilter hooks run inside rcu read section, so this isn't needed. Dropping in v2. > > +static int pipapo_realloc_scratch(struct nft_pipapo_match *m, > > +=09=09=09=09 unsigned long bsize_max) > > +{ > > +=09int i; > > + > > +=09for_each_possible_cpu(i) { > > +=09=09unsigned long *scratch; > > + > > +=09=09scratch =3D kzalloc_node(bsize_max * sizeof(*scratch) * 2, > > +=09=09=09=09 GFP_KERNEL, cpu_to_node(i)); > > +=09=09if (!scratch) > > +=09=09=09return -ENOMEM; =20 >=20 > No need to handle partial failures on the other cpu / no rollback? > AFAICS ->destroy will handle it correctly, i.e. next insertion may > enter this again and allocate a same-sized chunk, so AFAICS its fine. There's no need because this is just called on insertion, so the new scratch maps will be bigger than the previous ones, and if only some allocations here succeed, that means some CPUs have a bigger allocated map, but the element is not inserted and the extra room is not used, because the caller won't update m->bsize_max. > But still, it looks odd -- perhaps add a comment that there is no need > to rollback earlier allocs. Sure, added. > > + > > +=09=09kfree(*per_cpu_ptr(m->scratch, i)); =20 >=20 > I was about to ask what would prevent nft_pipapo_lookup() from accessing > m->scratch. Its because "m" is the private clone. Perhaps add a > comment here to that effect. I renamed 'm' to 'clone' and updated kerneldoc header, I think it's even clearer than a comment that way. > > + * @net:=09Network namespace > > + * @set:=09nftables API set representation > > + * @elem:=09nftables API element representation containing key data > > + * @flags:=09If NFT_SET_ELEM_INTERVAL_END is passed, this is the end e= lement > > + * @ext2:=09Filled with pointer to &struct nft_set_ext in inserted ele= ment > > + * > > + * In this set implementation, this functions needs to be called twice= , with > > + * start and end element, to obtain a valid entry insertion. Calls to = this > > + * function are serialised, so we can store element and key data on th= e first > > + * call with start element, and use it on the second call once we get = the end > > + * element too. =20 >=20 > What guaranttess this? Well, the only guarantee that I'm expecting here is that the insert function is not called concurrently in the same namespace, and as far as I understand that comes from nf_tables_valid_genid(). However: > AFAICS userspace could send a single element, with either > NFT_SET_ELEM_INTERVAL_END, or only the start element. this is all possible, and: - for a single element with NFT_SET_ELEM_INTERVAL_END, we'll reuse the last 'start' element ever seen, or an all-zero key if no 'start' elements were seen at all - for a single 'start' element, no element is added If the user chooses to configure firewalling with syzbot, my assumption is that all we have to do is to avoid crashing or leaking anything. We could opt to be stricter indeed, by checking that a single netlink batch contains a corresponding number of start and end elements. This can't be done by the insert function though, we don't have enough context there. A possible solution might be to implement a ->validate() callback similar to what's done for chains -- or maybe export the context to insert functions so that we can relate stuff to portid/seq. Do you think it's worth it? In some sense, this should already be all consistent and safe. --=20 Stefano