From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CD16C46CD2 for ; Tue, 30 Jan 2024 08:55:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 041CA6B0075; Tue, 30 Jan 2024 03:55:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F34696B0082; Tue, 30 Jan 2024 03:55:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E24726B008C; Tue, 30 Jan 2024 03:55:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D2FBC6B0075 for ; Tue, 30 Jan 2024 03:55:43 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A0A31A1B27 for ; Tue, 30 Jan 2024 08:55:43 +0000 (UTC) X-FDA: 81735369366.05.897292F Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf11.hostedemail.com (Postfix) with ESMTP id 590494000E for ; Tue, 30 Jan 2024 08:55:40 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=KiNieTQC; spf=pass (imf11.hostedemail.com: domain of rppt@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706604942; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KtA6kMbr7Sbr0nrBqLzlwUEFNxZkVrnMZ0qmMciR898=; b=MEZpPWqb7tSst7HwCTsNeMS6bfrx8puVx9f0VlJSNvkMVzuFPTIoxiWD2Qq5Gcz9maPSYa D4x5l/rfcc89BXbGzoA0Dm1zF2c4RjeDYcbWB5cdNATrgcWiRZrUO3U/N7dNJFo79guYSK /AhJRdsdA9QwO+AQ45+oovCfTkb5u78= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706604942; a=rsa-sha256; cv=none; b=Wf7A1j/x7718vb63YsJOoxjkA4s3/2/SE+EERkpKpDtS8Uho7vBh4fCbhrZhUGhemMMegr jYMkHAzQl4gSpUAI0Z+eB17gUTK/6hI1+MuO8CckoM26jQ+JNyVF7MH3owDdfUzM6ddn0b TsUEdhBi2BoOo/WxPpSOSYdnErs/W8g= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=KiNieTQC; spf=pass (imf11.hostedemail.com: domain of rppt@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 93833CE1106; Tue, 30 Jan 2024 08:55:36 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E74A2C433C7; Tue, 30 Jan 2024 08:55:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1706604935; bh=DYJUz9kKcGLJZ98tsI6Vy8dWSmkL5CVjrSjHuwrhBFc=; h=Date:From:To:Subject:References:In-Reply-To:From; b=KiNieTQCpZmVleFrROUWQu0NDvzUVlu47fVYNoOPNVEfF0RaUAPUbFJ8ELRE9xEBl iRoWKAiUOKgCHz4ay+Sa350xqxXAh+L235WWsJfb6clIej45A7owi8nddPP4N72CoN ZAk9eZUy1TV75OZZc7SxSMGyasVnSj8SrxMDfqiRIK96IFxnoOFoA/pJWx9ST6tT27 KuI3j0y5Jez8mpoPe5DrdudQaGJ30YQAI+jOIaSCkTpivdD1hEPfm2cZtGSX1apj9N BCOtGf1fIXrl0fhc/KMELNpqB4Y7iN7kPqi4eiu4VfzNpx/drxCnUEZo0dONjuNzRB gXOA3svyctVVQ== Date: Tue, 30 Jan 2024 10:55:09 +0200 From: Mike Rapoport To: "Liam R. Howlett" , Lokesh Gidra , akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, selinux@vger.kernel.org, surenb@google.com, kernel-team@android.com, aarcange@redhat.com, peterx@redhat.com, david@redhat.com, axelrasmussen@google.com, bgeffon@google.com, willy@infradead.org, jannh@google.com, kaleshsingh@google.com, ngeoffray@google.com, timmurray@google.com Subject: Re: [PATCH v2 2/3] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx Message-ID: References: <20240129193512.123145-1-lokeshgidra@google.com> <20240129193512.123145-3-lokeshgidra@google.com> <20240129210014.troxejbr3mzorcvx@revolver> <20240130034627.4aupq27mksswisqg@revolver> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20240130034627.4aupq27mksswisqg@revolver> X-Rspamd-Queue-Id: 590494000E X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: kcdhiq88i8q3qk4qg3xi1gannd3znpty X-HE-Tag: 1706604940-685586 X-HE-Meta: U2FsdGVkX19yYdMCUbmlTdnp9V1PLbsOimwYIHspsBcEoVQps1sqQsnkXnAlI4UgQS2i4lFVfba2eX0vzYbEp9W/PMEl8o8uleKTjgDQw4ncr3j41BNXtrwsWyFqGcArfkHoZhl5xrtV/23rJXbxVbAWHz523INFVR2PD8TIkiZRSOqbX43eEAl3iyUPFijnr1gngtKYwmBrdP+fKPzWaLd3k/3ERJu6ii7v7iJ6AmwpW011SHPO0H5T3QxEj2FBCOIs5+PKFOdvoNzNRp3ef5sIG63gFKK9nnlEMqeN++q9+fQhDXPPExiUPLmEztnFFoKBay6zAuytIiW9CEiesLTvrir5sBlXeyfFmRMUam/04DUEti8W3q3wVjdP866Yvr1NOPyWH3P7OG/vc+FO7ZrYUzkRAy2y/d2BB+56SEKWcmtmzUXsfu59qxXwEyA0wT1leyEZPSiyGUzHUwjEDjpRSntPPjf1ZMtHU9PMC/bEntFmsZYUkcMs99bhlBisr15I0sOONGep6mcFmmQZeMZ75ZzORA0Pvbs6vMaULaEtLmTDmEXyFCsfay4ZbZYciViTdK8Fm9zccgnG2CIBW3mMhxe6wCjxZd+OssG4GOjEIxvQDEJwk/Sre66Ec1RbJC2IIjveKolCyhQ7m0qEgfOrOxVbh8Xa4f7YWMGfdOpekt+acD6cZyTj587SNe+g1WlcheHjE45BMIHbR9mHnrRAbUiYvZg+2Q71AjVBvJZoo9GSDMzBLsvSs0B5zY2b8znowqCNnr5PKECp161FINTfN5+1E7udFHcZa3lhUVoaXYjswsyURplck6dtlr6jYxPy0cJRu7ZEeHgDP20M6Qp/2feMpRErzx+YWZ95BSbiYfdIllf1n6Lwy5TDTZHR35ABueL4iHN2EzpUHHNZ6SWXYs0+VV54Twlmk8GV/C/GTrL/u3fCkSakHKlZfLVMaTIoZsQDm4b9kYymh2R Yh3byZIq VCDyxivuh6woYz1J3Wmjyeu7CAy3MzrsGfnkTpdpfbbUCOVfLMHMXIdt4I775r/0aS9UlrhtgxxQoLt2zRSASfS/DV2jPtYIJbRBR54wrlGNrG4kR8KoFAfMzxot4oKilym3ZJ5+6AVKTRF0VCK9kW9Eh6ZC32Z3Q5P051MLR4XBuw78jJHTKPCn/3nj3NrX86h9RRY4hnXJ7XDCyi5cEeL+9zePR1qQg1BvmGd5hk7Pqe0AhXhO2Y5qsFvxBc+oUwcHraNc7W1uJwSOVJQFnuAWdbfRshr18YyetLnD722yxfYRG5obmoRPD9w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jan 29, 2024 at 10:46:27PM -0500, Liam R. Howlett wrote: > * Lokesh Gidra [240129 17:35]: > > On Mon, Jan 29, 2024 at 1:00 PM Liam R. Howlett wrote: > > > > > > * Lokesh Gidra [240129 14:35]: > > > > Increments and loads to mmap_changing are always in mmap_lock > > > > critical section. > > > > > > Read or write? > > > > > It's write-mode when incrementing (except in case of > > userfaultfd_remove() where it's done in read-mode) and loads are in > > mmap_lock (read-mode). I'll clarify this in the next version. > > > > > > > This ensures that if userspace requests event > > > > notification for non-cooperative operations (e.g. mremap), userfaultfd > > > > operations don't occur concurrently. > > > > > > > > This can be achieved by using a separate read-write semaphore in > > > > userfaultfd_ctx such that increments are done in write-mode and loads > > > > in read-mode, thereby eliminating the dependency on mmap_lock for this > > > > purpose. > > > > > > > > This is a preparatory step before we replace mmap_lock usage with > > > > per-vma locks in fill/move ioctls. > > > > > > > > Signed-off-by: Lokesh Gidra > > > > --- > > > > fs/userfaultfd.c | 40 ++++++++++++---------- > > > > include/linux/userfaultfd_k.h | 31 ++++++++++-------- > > > > mm/userfaultfd.c | 62 ++++++++++++++++++++--------------- > > > > 3 files changed, 75 insertions(+), 58 deletions(-) > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > index 58331b83d648..c00a021bcce4 100644 > > > > --- a/fs/userfaultfd.c > > > > +++ b/fs/userfaultfd.c > > > > @@ -685,12 +685,15 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) > > > > ctx->flags = octx->flags; > > > > ctx->features = octx->features; > > > > ctx->released = false; > > > > + init_rwsem(&ctx->map_changing_lock); > > > > atomic_set(&ctx->mmap_changing, 0); > > > > ctx->mm = vma->vm_mm; > > > > mmgrab(ctx->mm); > > > > > > > > userfaultfd_ctx_get(octx); > > > > + down_write(&octx->map_changing_lock); > > > > atomic_inc(&octx->mmap_changing); > > > > + up_write(&octx->map_changing_lock); > > On init, I don't think taking the lock is strictly necessary - unless > there is a way to access it before this increment? Not that it would > cost much. It's fork, the lock is for the context of the parent process and there could be uffdio ops running in parallel on its VM. > > > You could use the first bit of the atomic_inc as indication of a write. > > > So if the mmap_changing is even, then there are no writers. If it > > > didn't change and it's even then you know no modification has happened > > > (or it overflowed and hit the same number which would be rare, but > > > maybe okay?). > > > > This is already achievable, right? If mmap_changing is >0 then we know > > there are writers. The problem is that we want writers (like mremap > > operations) to block as long as there is a userfaultfd operation (also > > reader of mmap_changing) going on. Please note that I'm inferring this > > from current implementation. > > > > AFAIU, mmap_changing isn't required for correctness, because all > > operations are happening under the right mode of mmap_lock. It's used > > to ensure that while a non-cooperative operations is happening, if the > > user has asked it to be notified, then no other userfaultfd operations > > should take place until the user gets the event notification. > > I think it is needed, mmap_changing is read before the mmap_lock is > taken, then compared after the mmap_lock is taken (both read mode) to > ensure nothing has changed. mmap_changing is required to ensure that no uffdio operation runs in parallel with operations that modify the memory map, like fork, mremap, munmap and some of madvise calls. And we do need the writers to block if there is an uffdio operation going on, so I think an rwsem is the right way to protect mmap_chaniging. > > > > @@ -783,7 +788,9 @@ bool userfaultfd_remove(struct vm_area_struct *vma, > > > > return true; > > > > > > > > userfaultfd_ctx_get(ctx); > > > > + down_write(&ctx->map_changing_lock); > > > > atomic_inc(&ctx->mmap_changing); > > > > + up_write(&ctx->map_changing_lock); > > > > mmap_read_unlock(mm); > > > > > > > > msg_init(&ewq.msg); > > If this happens in read mode, then why are you waiting for the readers > to leave? Can't you just increment the atomic? It's fine happening in > read mode today, so it should be fine with this new rwsem. It's been a while and the details are blurred now, but if I remember correctly, having this in read mode forced non-cooperative uffd monitor to be single threaded. If a monitor runs, say uffdio_copy, and in parallel a thread in the monitored process does MADV_DONTNEED, the latter will wait for userfaultfd_remove notification to be processed in the monitor and drop the VMA contents only afterwards. If a non-cooperative monitor would process notification in parallel with uffdio ops, MADV_DONTNEED could continue and race with uffdio_copy, so read mode wouldn't be enough. There was no much sense to make MADV_DONTNEED take mmap_lock in write mode just for this, but now taking the rwsem in write mode here sounds reasonable. > Thanks, > Liam > > ... -- Sincerely yours, Mike.