From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.3 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_PASS,URIBL_BLOCKED,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A8F4C169C4 for ; Thu, 31 Jan 2019 10:48:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E1697218AC for ; Thu, 31 Jan 2019 10:48:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="Tv2DuAb2" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728067AbfAaKsH (ORCPT ); Thu, 31 Jan 2019 05:48:07 -0500 Received: from merlin.infradead.org ([205.233.59.134]:59970 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725766AbfAaKsH (ORCPT ); Thu, 31 Jan 2019 05:48:07 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=merlin.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=lvQ/IESBRKTO8KjxyyLAk6JtGwKGlOSLfyHLAc1sZEM=; b=Tv2DuAb2Wyz8uFPyzEa2GCjD/ LHS9u+pRvkM9MPprvn/K/WITPBQIqahlEcF/mOh/EpVi4gmrf735WQaExvARME3OPwHQ6yQbUk7jY D6/x/yL+reeyccEZpBdNSdShZ6P5Q8qAlVfBOKNVsuFJDXSpRsIas/RfrW1JAf5HX0ac9GmprNa5t irHP1kXSjeNZvH1sgNoumU4WdJOm2A0nUNZs3LvRIf7VYy0nK6TCfyVdDjcMjYAilWB8poMRTEShK /DsRA7sC3nq/G8At5EKpWG4bQ8Bx2vbfDS21Y00iHdpkyhRl16odrzHSe3T6V5yCsyPofudBpql2c fnTIzM1Kg==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by merlin.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1gp9sx-0004Ny-Oy; Thu, 31 Jan 2019 10:47:56 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 5AC0B202E51E4; Thu, 31 Jan 2019 11:47:54 +0100 (CET) Date: Thu, 31 Jan 2019 11:47:54 +0100 From: Peter Zijlstra To: Alexei Starovoitov Cc: davem@davemloft.net, daniel@iogearbox.net, jannh@google.com, netdev@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH v6 bpf-next 1/9] bpf: introduce bpf_spin_lock Message-ID: <20190131104754.GA31516@hirez.programming.kicks-ass.net> References: <20190131050546.3634009-1-ast@kernel.org> <20190131050546.3634009-2-ast@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190131050546.3634009-2-ast@kernel.org> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Wed, Jan 30, 2019 at 09:05:38PM -0800, Alexei Starovoitov wrote: > Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let > bpf program serialize access to other variables. > > Example: > struct hash_elem { > int cnt; > struct bpf_spin_lock lock; > }; > struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key); > if (val) { > bpf_spin_lock(&val->lock); > val->cnt++; > bpf_spin_unlock(&val->lock); > } > > Restrictions and safety checks: > - bpf_spin_lock is only allowed inside HASH and ARRAY maps. > - BTF description of the map is mandatory for safety analysis. > - bpf program can take one bpf_spin_lock at a time, since two or more can > cause dead locks. > - only one 'struct bpf_spin_lock' is allowed per map element. > It drastically simplifies implementation yet allows bpf program to use > any number of bpf_spin_locks. > - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed. > - bpf program must bpf_spin_unlock() before return. > - bpf program can access 'struct bpf_spin_lock' only via > bpf_spin_lock()/bpf_spin_unlock() helpers. > - load/store into 'struct bpf_spin_lock lock;' field is not allowed. > - to use bpf_spin_lock() helper the BTF description of map value must be > a struct and have 'struct bpf_spin_lock anyname;' field at the top level. > Nested lock inside another struct is not allowed. > - syscall map_lookup doesn't copy bpf_spin_lock field to user space. > - syscall map_update and program map_update do not update bpf_spin_lock field. > - bpf_spin_lock cannot be on the stack or inside networking packet. > bpf_spin_lock can only be inside HASH or ARRAY map value. > - bpf_spin_lock is available to root only and to all program types. > - bpf_spin_lock is not allowed in inner maps of map-in-map. > - ld_abs is not allowed inside spin_lock-ed region. > - tracing progs and socket filter progs cannot use bpf_spin_lock due to > insufficient preemption checks > > Implementation details: > - cgroup-bpf class of programs can nest with xdp/tc programs. > Hence bpf_spin_lock is equivalent to spin_lock_irqsave. > Other solutions to avoid nested bpf_spin_lock are possible. > Like making sure that all networking progs run with softirq disabled. > spin_lock_irqsave is the simplest and doesn't add overhead to the > programs that don't use it. > - arch_spinlock_t is used when its implemented as queued_spin_lock > - archs can force their own arch_spinlock_t > - on architectures where queued_spin_lock is not available and > sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used. > - presence of bpf_spin_lock inside map value could have been indicated via > extra flag during map_create, but specifying it via BTF is cleaner. > It provides introspection for map key/value and reduces user mistakes. > > Next steps: > - allow bpf_spin_lock in other map types (like cgroup local storage) > - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper > to request kernel to grab bpf_spin_lock before rewriting the value. > That will serialize access to map elements. > > Signed-off-by: Alexei Starovoitov Acked-by: Peter Zijlstra (Intel) Small nit below. > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index a74972b07e74..29a8a4e62b8a 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -221,6 +221,86 @@ const struct bpf_func_proto bpf_get_current_comm_proto = { > .arg2_type = ARG_CONST_SIZE, > }; > > +#if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK) > + > +static inline void __bpf_spin_lock(struct bpf_spin_lock *lock) > +{ > + arch_spinlock_t *l = (void *)lock; > + union { > + __u32 val; > + arch_spinlock_t lock; > + } u = { .lock = __ARCH_SPIN_LOCK_UNLOCKED }; > + > + compiletime_assert(u.val == 0, "__ARCH_SPIN_LOCK_UNLOCKED not 0"); > + BUILD_BUG_ON(sizeof(*l) != sizeof(__u32)); > + BUILD_BUG_ON(sizeof(*lock) != sizeof(__u32)); > + arch_spin_lock(l); > +} > + > +static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock) > +{ > + arch_spinlock_t *l = (void *)lock; > + > + arch_spin_unlock(l); > +} > + > +#else > + > +static inline void __bpf_spin_lock(struct bpf_spin_lock *lock) > +{ > + atomic_t *l = (void *)lock; > + > + BUILD_BUG_ON(sizeof(*l) != sizeof(*lock)); > + do { > + atomic_cond_read_relaxed(l, !VAL); > + } while (atomic_xchg(l, 1)); > +} > + > +static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock) > +{ > + atomic_t *l = (void *)lock; > + > + atomic_set_release(l, 0); > +} > + > +#endif > + > +static DEFINE_PER_CPU(unsigned long, irqsave_flags); > + > +notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock) > +{ > + unsigned long flags; > + > + local_irq_save(flags); > + __bpf_spin_lock(lock); > + this_cpu_write(irqsave_flags, flags); __this_cpu_write() > + return 0; > +} > + > +const struct bpf_func_proto bpf_spin_lock_proto = { > + .func = bpf_spin_lock, > + .gpl_only = false, > + .ret_type = RET_VOID, > + .arg1_type = ARG_PTR_TO_SPIN_LOCK, > +}; > + > +notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock) > +{ > + unsigned long flags; > + > + flags = this_cpu_read(irqsave_flags); __this_cpu_read() > + __bpf_spin_unlock(lock); > + local_irq_restore(flags); > + return 0; > +} > + > +const struct bpf_func_proto bpf_spin_unlock_proto = { > + .func = bpf_spin_unlock, > + .gpl_only = false, > + .ret_type = RET_VOID, > + .arg1_type = ARG_PTR_TO_SPIN_LOCK, > +}; > + > #ifdef CONFIG_CGROUPS > BPF_CALL_0(bpf_get_current_cgroup_id) > {