From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11BAEC5B549 for ; Wed, 4 Jun 2025 20:10:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 021A76B05F0; Wed, 4 Jun 2025 16:10:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F151A6B05F4; Wed, 4 Jun 2025 16:10:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DDC936B05F5; Wed, 4 Jun 2025 16:10:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B829A6B05F0 for ; Wed, 4 Jun 2025 16:10:12 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 1908AEDB96 for ; Wed, 4 Jun 2025 20:10:12 +0000 (UTC) X-FDA: 83518809864.12.535567C Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf17.hostedemail.com (Postfix) with ESMTP id E372540011 for ; Wed, 4 Jun 2025 20:10:09 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=eMbViv9J; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf17.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749067810; a=rsa-sha256; cv=none; b=LJbUQBFbRaaihyKiXpiOs/nHOios63/xIc356N05IwUYuEit8r0qsLwehOwF0/Gg6JHJYQ l0egcxgkJcJwlrnf/Aj+gaJ2oNiICGGWskhkJ0XQpgQ4X+eMkmdIlkBdgaltVyJe3VP2cP +ExRtb6I8egevcBLrJ2SSIdcN7c8WSo= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=eMbViv9J; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf17.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749067810; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=q9BN5xXgT88bSXulA/FeIPqelXeL/Un5XhFATUvSjD8=; b=rufTGizlMzskinFAq80Av7wIPQe5vo3yQt4M+zO5lcLT0yqS/HwXg86tULBGX5aSxfjFA3 QqcWwSyJdACFT/CDHQb/BD55/idNeewb0BhKcolHM/PMJOy5w3/yXX2uoh+jwjd6+wca99 ufb7c0t5rEwAsMj3dNk45aHh1K/X7Ro= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1749067809; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=q9BN5xXgT88bSXulA/FeIPqelXeL/Un5XhFATUvSjD8=; b=eMbViv9JVVNarVfqr7Lwb4/oWsz4yrqyOMyPKqsPbjosxka57mmyB+dt66qR9dfGYxAo9k LA6VSNhqtz7vgEliiXop0HoSQj6sK7eCC+5/efUTe8xQsNH5hof9AAPW2wFBpAiJt4WsHV skkcPU2H0hbdj89BrJNZVUX9iD8GK7I= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-669-z-0eR8MtN26UdS_jRVBt-g-1; Wed, 04 Jun 2025 16:10:08 -0400 X-MC-Unique: z-0eR8MtN26UdS_jRVBt-g-1 X-Mimecast-MFC-AGG-ID: z-0eR8MtN26UdS_jRVBt-g_1749067807 Received: by mail-qv1-f72.google.com with SMTP id 6a1803df08f44-6fad8aa6fa2so4471856d6.0 for ; Wed, 04 Jun 2025 13:10:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749067807; x=1749672607; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=q9BN5xXgT88bSXulA/FeIPqelXeL/Un5XhFATUvSjD8=; b=oxipJqCaFTgAvFUqevg+JrYmxOlKpP/JCMAY3Pq66jzDm6h/C/SSlOVMdd0ecMFQqU KnfXWWpzfrP+MsWdGsOsevlqPs86VMfwkfIlt8f9ObZp2YuekqyyafNZbnWP/00yVLAS dcLqK7hk8ChQ0wlZ7lUX1HMziPpioeL3MqvBN9kLdr0Mz6lDCSeTCaFptPrL7Ol7dLE4 /0SRuRTFNqK5HzdU016reWWQUGKxjBXB5tqQQZRGv6A2r9FtRM7UYKRMBG5dbXajgYti dyi6ouky8zIepjiQ3JjtKalem7r1IRJdZ3s4uygu5H+CSGCMth/qqNdIhUGR22WDuzI4 j4qw== X-Forwarded-Encrypted: i=1; AJvYcCXYLrbbOpvkLSk20xS3BTZcg3npwGCUqRB0JX3uRpbu+PdbLJJTwx8X+KuKS5ETugXxB7Zq0cMaYA==@kvack.org X-Gm-Message-State: AOJu0YwarfmylrmV9ZOnBcEqxPage8LfjaeLHMdg8VfxP3GYjpnMckru oyRkxWGS/Z69simfc0BN4SEX1fCNZCTB5ktP/tfUfsgbR7VkhFHndT50Y2RNr+c2y+Kjyv8wA96 wRjAU9lKmgpUaXJsDcNaC2WR/agOeXQ0ZfBQbsgOlMN9Mr8mmk70y X-Gm-Gg: ASbGnctsqBUxGE7egW0n4qllRtWbdBqydit/YFxUdymBG0Cfcx9N572T5EX3iVenb9K 8NY1ZG1IllHEKj8Y+sjBJTa9qqGFIgp6jAUROAl8xBaaYMwOfXUC/S3asatwPl6Hel6z2f0AIxO 6sknXJVfr+n5lPFpnD0YfkhlFSiC3BkpLQ0FtyK5hZqOgoAcokXkwsxoXTGMi4Y4ENxl0fZAZgd iSlB+3jboI1joVsCbGAE/f2uG2WTqkBjOgDpKnSy9xIjSfN2O7riYBgJu/lDHRnjAipNf1JJu8Q wGE= X-Received: by 2002:ad4:5ecd:0:b0:6fa:cdc9:8b02 with SMTP id 6a1803df08f44-6faf6f9e8f2mr61152096d6.16.1749067807560; Wed, 04 Jun 2025 13:10:07 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE405MytDVz3s6Kc/HX1eBXRVHVZoZtFmEhwF86+X8cjETvPQ/GL8LxPX1fmGW9kIB8AVLnFg== X-Received: by 2002:ad4:5ecd:0:b0:6fa:cdc9:8b02 with SMTP id 6a1803df08f44-6faf6f9e8f2mr61151716d6.16.1749067807154; Wed, 04 Jun 2025 13:10:07 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6fac6d5b59asm105134686d6.65.2025.06.04.13.10.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Jun 2025 13:10:06 -0700 (PDT) Date: Wed, 4 Jun 2025 16:10:04 -0400 From: Peter Xu To: Jann Horn Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot Message-ID: References: <20250603-fork-tearing-v1-0-a7f64b7cfc96@google.com> <20250603-fork-tearing-v1-2-a7f64b7cfc96@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: MiNeLcXIHZWaU8i12BsXYho9vxjXqs1SLD3nJi-_BRI_1749067807 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: E372540011 X-Stat-Signature: 5zueqy4zjipdf3zbq63jtj63xc1rgi8d X-Rspam-User: X-HE-Tag: 1749067809-829081 X-HE-Meta: U2FsdGVkX190flUgE89wKW3bqxvQo9baQebpDsBMm+KRSngS2YlwnU57QZborOUrxO3bE4yW0AJMMBBhv2scRV/8axXS4QZpBNjgPt8J/rCITUKzACoXWUNlnE2hDztgQSaG2YX5UEqUQ9q0Jah5QM0harhfM4tjSdZZ9rTJKiqJj/DHenajIdq1MyhIlFfGcImPkrRWnvx0Gibe8VNasM4BxB0C78o0RtfyF4UeClEWXuJYma8BoYVS9yv38MWVKiIZTYMze8OgNlMtvEnKdMYdsjo2DkVVHO1VvqP7FXfReENEj7XLoASOq5nPUr6hELjsyRd9YDcDheI5KssoPgFTslW92hSGoOXZLEJBZgLX9ARBZEL8M4NU/egeD928myll7g4TvZ4gQYuVj6sYPS/Pp/2wsZmsWDHWcK2OZeLPFKww3dV0cTQz/vLNVVnwSPmZ+/7fzA7cnpZ50qzAt9TRHTB7qHGeGnuLE3W4DiaaK9fAyUaIUSz8menfufxmSECWqzEjH8GNrELzF6ldi44HbhZpyzTDr1W1hGLg+vkghYiVvt8cZ6hpp3A8n/8DrQ2Fxazj6skjcRSaP+78Dl9XeCl7lYZ0Q9VtnFkkck2gG9jh24EDnZ+1+t0fpCaEsbs3pmYxGhol7Khm+SXCT/apN7y3tWB3ncoTbTr3x45BeALRpHAmwGmF035OS4twf/3Sr9W+mOIZOXY1udODwpqhXVusV7vEqke3ox+4EN8Hirr+HNlqsFvo1TZCjqZZWbW1ubEhUaeSLXzpUQbxuYT2lpFliJFhyGyXEWohD4xuONOyLrUXVyU6YBQy1IFjLqMr9RUJiz05DToC3feoa9WJ49vIrfvE6cObMGTSx91NJLCtQRrl4y2wLv9JaIWGHkgwo45e+wtWIIxbjNRGpvNS13L+dtvSHjNi3TI/1OBitInItVhIWw2KtlKidBCpYpvhw9wKXi2DB+Jm8Ie 4xQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jun 04, 2025 at 08:11:08PM +0200, Jann Horn wrote: > On Wed, Jun 4, 2025 at 7:04 PM Peter Xu wrote: > > On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote: > > > It is not currently documented that the child of fork() should receive a > > > coherent snapshot of the parent's memory, or how we get such a snapshot. > > > Add a comment block to explain this. > > > > > > Signed-off-by: Jann Horn > > > --- > > > kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++ > > > 1 file changed, 34 insertions(+) > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > index 85afccfdf3b1..f78f5df596a9 100644 > > > --- a/kernel/fork.c > > > +++ b/kernel/fork.c > > > @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) > > > } > > > > > > #ifdef CONFIG_MMU > > > +/* > > > + * Anonymous memory inherited by the child MM must, on success, contain a > > > + * coherent snapshot of corresponding anonymous memory in the parent MM. > > > > Should we better define what is a coherent snapshot? Or maybe avoid using > > this term which seems to apply to the whole mm? > > > > I think it's at least not a snapshot of whole mm at a specific time, > > because as long as there can be more than one concurrent writers (hence, it > > needs to be at least 3 threads in the parent process, 1 in charge of fork), > > this can happen: > > > > parent writer 1 parent writer 2 parent fork thr > > --------------- --------------- --------------- > > wr-protect P1 > > write P1 <---- T1 > > (trapped, didn't happen) > > write PN <---- T2 > > (went through) > > ... > > wr-protect PN > > > > The result of above would be that child process will see a mixture of old > > P1 (at timestamp T1) but updated P2 (timestamp T2). I don't think it's > > impossible that the userapp could try to serialize "write P1" and "write > > PN" operations in a way that it would also get a surprise seeing in the > > child PN updated but P1 didn't. > > If the write at T1 hits a page fault, then it doesn't actually happen > at T1. The write instruction starts doing something at T1, but it does > not fully retire, and the architectural register state does not > change, and in particular the instruction pointer does not advance > past this instruction; just like when speculative execution is aborted > after a branch misprediction, except that the CPU raises an exception > and we enter the page fault handler. The write actually happens when > the instruction is executed a second time after page fault handling > has completed after the mmap lock is dropped. (Unless something during > page fault handling raises a signal, in which case the instruction > might never architecturally execute.) Fair enough. So maybe that's something like a best-effort whole mm snapshot anytime happened during the fork() but before releasing mmap write lock. Your comment did mention one exception on the kernel, is it still pretty easy to happen? I'm thinking this use case of trying to load some data from a O_DIRECT fd and then set the var to show it's loaded: bool data_read=0 read(...); data_read=1; Then IIUC this can happen: parent thread 1 parent fork thr --------------- --------------- read(...) using O_DIRECT on priv-anon buffers P1 pin_user_pages fork() happens Sees P1 pinned P1 early COW (child sees no data loaded) memcpy() set data_read=1 (data_read can be a global private var on P2) P2 wr-protected (child sees data_read=1) Hence in child even if it sees data_read=1 it is possible the buffer may be uninitialized, or the buffer is partly loaded, still racing with the kernel early COW. I'm not sure if I understand it correct this time as you discussed in the comment. If so, should we still not emphasize too much on the kernel providing coherent mm snapshot, at least emphasize the best-effort part (both in comment of patch 2, but also in patch subjects)? After all, it seems it isn't straightforward for any userapp to see when that coherency will be violated. >From that POV, maybe it's better we should still suggest the undefined behavior, even if it'll recover the old behavior some existing use case? Thanks, -- Peter Xu