From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED211C5B552 for ; Sat, 31 May 2025 06:36:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 850E76B015D; Sat, 31 May 2025 02:36:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 828ED6B015F; Sat, 31 May 2025 02:36:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 765576B0160; Sat, 31 May 2025 02:36:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 59DDA6B015D for ; Sat, 31 May 2025 02:36:34 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 709D1EB1B1 for ; Sat, 31 May 2025 06:36:33 +0000 (UTC) X-FDA: 83502244266.13.2E93A1F Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47]) by imf24.hostedemail.com (Postfix) with ESMTP id 77A0B180008 for ; Sat, 31 May 2025 06:36:31 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Gv/JCbJK"; spf=pass (imf24.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.47 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748673391; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y/BmoDNXQwduqlIZzmWZXOyqj/fMIT70s0zMx8YqFyY=; b=MGXcjPTjOkWPWujpd2FGVsYtznBcGjU7O9ryVAQkGdGdTgOILaZg8W84Qym3S8urZOyAdb VhpQI4RD2qU/sZtU0D2bJkCdzm/v/j2CFGMkmYqs15EpgVvhWp79Rc8EVtiGsvMN8Z2glr ROJc4qOEhNowV1In3gPLXdusYyiC8L0= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Gv/JCbJK"; spf=pass (imf24.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.47 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748673391; a=rsa-sha256; cv=none; b=2LUGvCoQjMvlZZi69qtnsTlfqVUJSHWeul2XR9DrgA/4ZxuFPXiD6NfcpmFJLk+cKtqQyw ra87re4JE1DhaDQp75eHwdMXuBesUoY+BbhFnFplhNaZbfuo03ewP+u0heOZyZGShP/trD RLsruPYQ4W6ikiu+oSC58sGGUnwePjM= Received: by mail-lf1-f47.google.com with SMTP id 2adb3069b0e04-55324062ea8so3760269e87.3 for ; Fri, 30 May 2025 23:36:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748673390; x=1749278190; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=y/BmoDNXQwduqlIZzmWZXOyqj/fMIT70s0zMx8YqFyY=; b=Gv/JCbJKiDgJP3pCVH1spd+3YPAQQglQNyHrXE6IW0MZvUQ1+P2MfDdFAbDXrcMMe6 Ky+HlJEQ74Fgfh6DfSjl31YkTHlE9WRpDfq2zI1zdPI6y0LOdRMdQloLiz3sXEFCU+4u 9ovxIwnbdMkK8WDJZ4e/W4FgFOJZO26XBjTNkZqm0cVN7UJ3Izzu/SFvKB0o482w6I3M I4V1NvklkRqbWz28w3xkGcL96mrwTImL30a3y/nkxgHc9HOKuOB+WPa3xIC3RCTkwKPZ PnTR1xMFxOfsE11+SBpc5nSyQs5VYu2CZVs0pia7xXKLlzwfGqmxWK1gcwAfmgs/8EKN xEjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748673390; x=1749278190; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=y/BmoDNXQwduqlIZzmWZXOyqj/fMIT70s0zMx8YqFyY=; b=dK7Ri3d16Q8v7YAqJNcCkzV0Z4MZyxR5PoT219RoJUf220zEZFWRYdkfI9xFyITi2o tsSlF1Kiw5OH1E0pIrLRmMIdyXhGT4cKPsqO8ACEMloeuUdmx8n+nb3vOs7iEko1qUoy m3PP6GFdD91VaJOLG4Dfx4QP9nnzG8sh6tN0K42fQ30QcUQm6Xp8koI5YtANqfdZfpdS JKfJoJZX2YN+B4U6Hbe5Vg0QcdpUAv9JxRGDj/AJVfG5+V4N3r3SuabbTkhqdb6LhcJC sMdfevzvJaw5pssK6sEYP4okJAUC/EDBqQcZU9in5vlaa/qMf1//fBAKulN0IhxMQlGg SkOQ== X-Forwarded-Encrypted: i=1; AJvYcCVbY/zNEdHBveuIfXB9Yq8+OWuuXl8WtVYDhAJZheyK+5G2InkgiNCoDNh2j9eSmlQGS/51gP3teQ==@kvack.org X-Gm-Message-State: AOJu0YwQ+I9g/OgG0+Gs8G7hRY7xR/OJGSgMHhuBW/VOBZwPyo/GBy4/ TScnUi1uzFDY+6I4/MVu4uW1dQKylOxMzZB/HHo7rT9M35fCrV5kt+S+ITF5OoGlS8TnKw54bYu ah9Zmn1AFPwVg6a8wmJcZ/E8Ps3gzQtc= X-Gm-Gg: ASbGncvS6pcAbRnhz0lgnXalvKvlktw+1WJ2e/mP7id2T/kGMTgArHfzbE4BT0K6jDv fRG/85O7laK9sKEXtKwaLeiA8pCvzSdWheUGfsWZ5AHMkmwqsofdxo1/EcV0N3JK6GXfnp0WmYY nQ/ZQFCXLPPy34q1I/VDnIlQWsvS0pL5UY X-Google-Smtp-Source: AGHT+IGS+8zInVTEalAvw6lUhSsBpRvCGIuCKi3iQOkea3Fr6ApWHCsNgurueZz8XEZYCyJZmcr/XV8pjyvmWe80Gvg= X-Received: by 2002:a05:651c:3617:b0:30b:d562:c154 with SMTP id 38308e7fff4ca-32a9ea3bb47mr3546451fa.19.1748673389221; Fri, 30 May 2025 23:36:29 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Sat, 31 May 2025 14:36:11 +0800 X-Gm-Features: AX0GCFshAi5xaCYNNOadtBW6k_pLyhFYFBRbCnnGh9C67-Jw3R-U6Uz_6qQb9PM Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Lokesh Gidra Cc: Barry Song <21cnbao@gmail.com>, linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: ai3ie3qdk4juma89n188zt4t6tbom8hp X-Rspamd-Queue-Id: 77A0B180008 X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1748673391-8594 X-HE-Meta: U2FsdGVkX1+aZ7mYJFyQIcol4TxLYip72MLG/doe85fdieccLMB6/I/6z0Hl9wv4ywTBHre4JamvtQDfBz9c103/hHMdL2Wl7MzlLhGXI7C7b2IfNurPTdwJ+TtJ028ApZMQXKJRI/0wwUEjI1MBhroYrkKucBBdxX22grFtDLre68YtQLDVTvtsjxgp4rjdVk8ILvWQB1qUyfWrUJH3zC2HFi98z4yGO19evVlRpLzXacoLG9e1QFH4vbwokRXvjvidzgEDefVZl1zvCFcEYFK+w/QdppZSztE7ZQcabbJ40quO0UpzD4XhPR/x4uzr+IUYV3b0vqbcjSW5Pnam1fN+UR38e44B1JgNcqm0n2DFgLUBc8WHX3r39wPJfEVM/dCpGVQOhk57UznFkSnYRqBAZGcgLdSLzC/EctKmGz08LlU+T81fFno5df+etP03BiWXdpsEHX1NuklkQq82qScxUUTHCe3wPflUfwiB/gPrt4gy3bYayh/VpzYMjorEgv9rGFOK2EQdSADpU63C0d1yvVif1oUNI4kXJtC32qGYq4XDtjTdbrAolrW7oVZ8wy/ml9fbFjYIagNaYq8CHWej0TY2UFE3d7FcNHJzK5JepLKDDOVWN+hSFMJ/BEMhCQhTk3qvIkZb/hSS6oF53P0fkl6FQVay0RBEtDunO6D0TCMAxElPtTrVXsgmDYhY3R0k9YBVZaFNboPZInM+9sYHzgjlCb+XY8yezYuEfRe7j3MOY03jmftg5QRIkp25xwGid86Qy2GVNSHeEd11jhR8jNbyeQsLzYWcP3RBY1b1SrtPl3Vj8heiPE2l/7xWyXkCouvm3qYtu+ny/fMqgKgr/88JngFPW+rioka9bIXmAdidve75rq1odRE5L7QTQGtrHCFOOfscKXbeoTZ9WXa2TwsBSDQCU+eWhBygiNZ5Yxkt8LbuyccMT8StaCjhVQwnmk4/ss6zikPoW/f 6l4E2kj3 F1JY4mPOQmBg7uXV89MIEQi4DmXkt+5RLWTuTgd3c3z8eSxev+2Uey5s15cRiTnT768xpm76OAIXK4Tu1rAG54bhyr8wMPS8206924ADeTl+u1Omw/qYJW5N7iiW/yoNlRSY2hv0XykrKkiy35f2V1g2KeILv3/DBJcufeXN34mRgS6MRfE8Qcl5b++WCBI8kRgsCbu5P56CQXWqBi4QuuZaDZ3ErfI4j9EOduf44YZXFZcrSyNnUBYFwHRU0Z5VqilbuDrmMQeIwwsU6jRaYbbfSgxdN1820RAToWEQZsToZMeTt52wUoyH56+B/6tE392U1g4Am3nfGMmuNZRGjcgF8No1NkWdP+sHvCINpaT+3MNeQWzntsjDak1anEc1XaffYZDJGp/pofFCV44j3ww1eDS12Uf7hM4bW1meKr4RjpIvyumc47tNIBF23g0SD4zRG3pvBX5LEJ28= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 2:10=E2=80=AFPM Lokesh Gidra wrote: > > On Fri, May 30, 2025 at 9:42=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Sat, May 31, 2025 at 4:04=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Sat, May 31, 2025 at 8:17=E2=80=AFAM Kairui Song wrote: > > > > > > > > From: Kairui Song > > > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap c= ache > > > > lookup, and try to move the found folio to the faulting vma when. > > > > Currently, it relies on the PTE value check to ensure the moved fol= io > > > > still belongs to the src swap entry, which turns out is not reliabl= e. > > > > > > > > While working and reviewing the swap table series with Barry, follo= wing > > > > existing race is observed and reproduced [1]: > > > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cac= he.) > > > > > > > > CPU1 CPU2 > > > > userfaultfd_move > > > > move_pages_pte() > > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > > // Here it got entry =3D S1 > > > > ... < Somehow interrupted> ... > > > > > > > > // folio A is just a new allocat= ed folio > > > > // and get installed into src_pt= e > > > > > > > > // src_pte now points to folio A= , S1 > > > > // has swap count =3D=3D 0, it c= an be freed > > > > // by folio_swap_swap or swap > > > > // allocator's reclaim. > > > > > > > > // folio B is a folio in another= VMA. > > > > > > > > // S1 is freed, folio B could us= e it > > > > // for swap out with no problem. > > > > ... > > > > folio =3D filemap_get_folio(S1) > > > > // Got folio B here !!! > > > > ... < Somehow interrupted again> ... > > > > > > > > // Now S1 is free to be used aga= in. > > > > > > > > // Now src_pte is a swap entry p= te > > > > // holding S1 again. > > > > folio_trylock(folio) > > > > move_swap_pte > > > > double_pt_lock > > > > is_pte_pages_stable > > > > // Check passed because src_pte =3D=3D S1 > > > > folio_move_anon_rmap(...) > > > > // Moved invalid folio B here !!! > > > > > > > > The race window is very short and requires multiple collisions of > > > > multiple rare events, so it's very unlikely to happen, but with a > > > > deliberately constructed reproducer and increased time window, it c= an be > > > > reproduced [1]. > > > > > > > > It's also possible that folio (A) is swapped in, and swapped out ag= ain > > > > after the filemap_get_folio lookup, in such case folio (A) may stay= in > > > > swap cache so it needs to be moved too. In this case we should also= try > > > > again so kernel won't miss a folio move. > > > > > > > > Fix this by checking if the folio is the valid swap cache folio aft= er > > > > acquiring the folio lock, and checking the swap cache again after > > > > acquiring the src_pte lock. > > > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so = far > > > > we don't need to worry about that since folios only might get expos= ed to > > > > swap cache in the swap out path, and it's covered in this patch too= by > > > > checking the swap cache again after acquiring src_pte lock. > > > > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tq= Czi+EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > > > Signed-off-by: Kairui Song > > > > --- > > > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > > > > 1 file changed, 26 insertions(+) > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > index bc473ad21202..a1564d205dfb 100644 > > > > --- a/mm/userfaultfd.c > > > > +++ b/mm/userfaultfd.c > > > > @@ -15,6 +15,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > #include > > > > #include > > > > #include "internal.h" > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm= , struct vm_area_struct *dst_vma, > > > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > > > struct folio *src_folio) > > > > { > > > > + swp_entry_t entry; > > > > + > > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, or= ig_src_pte, > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *m= m, struct vm_area_struct *dst_vma, > > > > if (src_folio) { > > > > folio_move_anon_rmap(src_folio, dst_vma); > > > > src_folio->index =3D linear_page_index(dst_vma, dst= _addr); > > > > + } else { > > > > + /* > > > > + * Check again after acquiring the src_pte lock. Or= we might > > > > + * miss a new loaded swap cache folio. > > > > + */ > > > > + entry =3D pte_to_swp_entry(orig_src_pte); > > > > + src_folio =3D filemap_get_folio(swap_address_space(= entry), > > > > + swap_cache_index(entr= y)); > > > > + if (!IS_ERR_OR_NULL(src_folio)) { > > > > + double_pt_unlock(dst_ptl, src_ptl); > > > > + folio_put(src_folio); > > > > + return -EAGAIN; > > > > + } > > > > } > > > > > > step 1: src pte points to a swap entry without swapcache > > > step 2: we call move_swap_pte() > > > step 3: someone swap-in src_pte by swap_readhead() and make src_pte's= swap entry > > > have swapcache again - for non-sync/non-zRAM swap device; > > > step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear*= src_pte; > > > step 5: do_swap_page() for src_pte holds the ptl and found pte has > > > been cleared in > > > step 4; pte_same() returns false; > > > step 6: do_swap_page() won't map src_pte to the new swapcache got fro= m step 3; > > > if the swapcache folio is dropped, it seems everything is= fine. > > > > > > So the real issue is that do_swap_page() doesn=E2=80=99t drop the new= swapcache > > > even when pte_same() returns false? That means the dst_pte swap-in > > > can still hit the swap cache entry brought in by the src_pte's swap-i= n? > > > > It seems also possible for the sync zRAM device. > > > > step 1: src pte points to a swap entry S without swapcache > > step 2: we call move_swap_pte() > > step 3: someone swap-in src_pte by sync path, no swapcache; swap slot > > S is freed. > > -- for zRAM; > > step 4: someone swap-out src_pte, get the exactly same swap slot S as = step 1, > > adds folio to swapcache due to swapout; > > step 5: move_swap_pte() gets ptl and finds page tables are stable > > since swap-out > > happens to have the same swap slot as step1; > > step 6: we clear src_pte, move src_pte to dst_pte; but miss to move th= e folio. > > > > Yep, we really need to re-check pte for swapcache after holding PTL. > > > Any idea what is the overhead of filemap_get_folio()? In particular, > when no folio exists for the given entry, how costly is it? > > Given how rare it is, unless filemap_get_folio() is cheap for 'no > folio' case, if there is no way to avoid calling it after holding PTL, > then we should do it only once at that point. If a folio is returned, > then like in the pte_present() case, we attempt folio_trylock() with > PTL held, otherwise do the retry dance. Yeah I think filemap_get_folio is cheap, each swap cache space is at most 64M big, so it just walks at most three xa_nodes and returns, not involving any synchronization or write. The swap cache lookup will be even cheaper in the future to be just checking one plain array element. I can try to fix this with the folio_trylock inside the PTL lock approach, maybe the code will be cleaner that way.