From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2865EC5AD49 for ; Mon, 2 Jun 2025 18:14:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 895D16B030D; Mon, 2 Jun 2025 14:14:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 86D6A6B030E; Mon, 2 Jun 2025 14:14:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 75C6D6B030F; Mon, 2 Jun 2025 14:14:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 579B86B030D for ; Mon, 2 Jun 2025 14:14:31 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id CE2FDC0F27 for ; Mon, 2 Jun 2025 18:14:30 +0000 (UTC) X-FDA: 83511260700.16.FEF68D0 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf10.hostedemail.com (Postfix) with ESMTP id 0F817C0008 for ; Mon, 2 Jun 2025 18:14:28 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=A6BLaFjw; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748888069; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=JDZ3mQFVfy4xCt7wewUY81eAfsa93TjxeELX/NJDSpw=; b=LeHSMprriEVl/Fm0YEtEZfPehU+LHbvoGkGFGXySkDY12qLjSGmAsQrRQC6RPDzepVisDc iBECdS1c8TJ8r0qWdUommC2iddbu8+FYLADa/Yq7aYz6VQhE5cGugiXQCAlKDI3HL3LjvP CNXyDMKTWDwMnQiNydSKRziWho2qOi8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=A6BLaFjw; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748888069; a=rsa-sha256; cv=none; b=lnWvNdWFKJPEFsF4dl/5ElY2L6PwO+eTqzKMAqVhT6YTCtiYg4GPDQymCWxhfdAck2+apY 0QRBH4UCp405BDtcgQXZMlnPR5Hl9is0uhigTeZ27ymrquMkuRW3k98ck60AoY9KXsEfu3 b23N/T5KDMj0RZbHS5USdj7GREyXGMs= Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-231e8553248so45595755ad.1 for ; Mon, 02 Jun 2025 11:14:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748888067; x=1749492867; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=JDZ3mQFVfy4xCt7wewUY81eAfsa93TjxeELX/NJDSpw=; b=A6BLaFjwLkvDRrefvranCUrFa6Ui81+ShumR+x7nMQK4Hg8UgizbxtNns/+ZKB5zQt oIzwFFXg9St2rXkQ4RJgowAYRNbgAkI4TF6yEGUG1ftnuTHBjFmzW4t379pxKxOSJgsz A5718gNM9BXpuE5vNSMqfw4iOaQ9SZFjH0Z/vFkVZuL4iYE2THbPcs05mQi1MjbetBCe hS/GhRDKH69MmJzPfoF6MiQXKKLRM1erxqcH7gm5HBxpvck5JVqu9kUswcmcpmnqxely Sqph74Por3ppHu//kW4HZhk5ynzPMaGMr0237O9cpWbyUG1CC2aFv5Amcc7vYVHs0Y6i rbog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748888067; x=1749492867; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=JDZ3mQFVfy4xCt7wewUY81eAfsa93TjxeELX/NJDSpw=; b=QoAGi2lEE1mUsXz5Sn+NwS9qaVL6wI4LVgKiRM0SEGH3LTmHPE9P5PtfKRqbCTQikc oT3ERRt7MsJ5uL9QvcjbVvyajyJzkftL7ta2fXGYblSJMAFNDxhxtiV76yYj1sTgjhe3 XVv5Ggru8+xlEtvbIEYoYFDQ32z2IP9VIdCAE/j63V7IHxHxGKzY8Xi0BlTXSURjmBXQ IEiPkzhIL7XC6JmbtOMFZmH25t6hvPejHLReLE0uRZoIeybyC3Bn0Jt+jFZ2zhB768eA x+zyUeGpeOy8NE0EIfQiKOzPlaWu0rDCYaHWw3cqxuvxQr9JBFwlGvXB+4KDI9F2UC5H bZ/g== X-Gm-Message-State: AOJu0YxgSM58i8rzPWDY9Ryp1ertUEn2n8mAkSiVpNTYDqTaKdmpPEUs ksD3KanZICVcrdqm5t/jeiKeEA5XimGkKZFdkN2Ta7ME3APHWvGQ16YRgQRVCwvkaBg= X-Gm-Gg: ASbGnctbKd/NVieEWSeLCL2bVUorbsHAMazwjmekpnNEVLDflb9FHvh4tclbiAVgghu zfkq3yK/uUlQkAiZ0AEaqD4dMlF1GMSz6SqgddqSQrFvVzIBqKTf7iibTX1xzlX19pPoGVhhihu lP9w62O1zRP8syVWtRmdnrsqSugBi9N2tYPRUT8k/w4xCqPYIVauDcErEKrwt4VafxFLPFKCGvr 191ao1eoG4ffz6VBuvzMpbivXWdFcObeyu7Jms70nfB2hATrJ0PdEYLDR8RvlvjgFicOv1g0ngt 6u6tms2VbiNUOvjWyMQyY9mTXuGvwi2ZFpTvGyabmIORRtVW8+AAuhb4QnKrZ9urZPcEMa+3J2a SH+t3bpA= X-Google-Smtp-Source: AGHT+IFNUzydhhbwzDlKGYhPSEDvNyUIjQMWLcCreC/Ptd/11scl6UUMIg4C1XJP1X2NKPSLK4iw2g== X-Received: by 2002:a17:902:f610:b0:229:1619:ab58 with SMTP id d9443c01a7336-235396e2a84mr204137485ad.43.1748888066475; Mon, 02 Jun 2025 11:14:26 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2354fd632d7sm45031275ad.130.2025.06.02.11.14.23 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 02 Jun 2025 11:14:25 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Barry Song <21cnbao@gmail.com>, Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache Date: Tue, 3 Jun 2025 02:14:19 +0800 Message-ID: <20250602181419.20478-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.49.0 Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 0F817C0008 X-Stat-Signature: s8hw5h75kmir13bhghxbbyg3nfejtofe X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1748888068-1623 X-HE-Meta: U2FsdGVkX1+h/CbV49KIIwXgqrQTNt2/+ygZFrIYanGt2PEJJhr3tw9KmpOG57paaExMGjoICP6AoidAjwn63kK80ygbYkGlEyNSr2phlf3Htk5vVXE1vDCgzIJiCk5nIF6YrZMJabxPxY9YLeRzH3KveNcd00y24VJA4lhoYJ0E+M0dfQuDhjtuWrUDd+TSOd2lcd4s0QVHMep5IlS6OWjGysX/Fk7CIjzmxW6lGFBJU+p/wUCpWsR4fDM/Xi+Tyuwuvl7gOaIeyaPCKsLnjOGYfRyVaD49YdvAt0JuP+BtuZagnP7szNmQbnNYG+C976SDtJ4NEbcaXzg1fpEk9mo7KUYuE1rWCrFdUua0K/9L+Fx4QzYWsMMN7ZikyJwgvAiRcD3NsyIg2ykmFfsx57LKLfnhMLGHEi7/CZhsgI5r/pmd6yp1bywD/RucoFupGP5qGQV/Xes4vwX+/caLrE4dJvzjYR9nFSytXn7Ib1zT184QerCTPaGT2NiVqugDGp9FyLQpMzahsK3AsluoqGgtBSZ1x4v+ZIzx41Tlr2YD/+K8GI9QJY4/6IAlHIWDCFLcXDcabAVom8eyThQFlqFxyhfnavCCqT9JWXLQxdDU8MbwumvLoYfTNeMQS96SP5Oc5cd41QPeWSIxmUkg7/lykbTXBjSVMaJUTL4OhxR9C0f6MHU9qfWYYoupWTlneNDwYO/amqzVtN4lEJ3LABn8L//gfBIykAY2rI8s7hKCapyJ9o32tPYhxgEDMD12axWmQJKatotekkUdvAbayXTOJZFMSDZ+M/tkQVQ2OV3AT5dis8dooj5BAuejspbEAS+WJFcpaXSKPQQ7hqW7Wnij3/Ti+cfzUAeaPdqfzuPk4jC8WtU0WFUT3Qoe9kjlPymyN4Z6fVhPCUrxK220TJorzO/nXzIML8UL3P56u4gPYhXeq9m3FDOfCCyUQIxAmCfr44B6fUP9cqA3Peg n0eEYWMQ 0sMSNqZKg+xGkSX1EwaCbnpStB/TgWt0XyGs5N5g9SpO2JvPzUC9i0pIRp2P2QXTPPOgX2FNvmDAvkvLGzun7K6+t2y/Wd1g94IgoMyjRYQuW2LWXDCZsJA1L1MACgyiEil8CxlO71wR4OeXSYugzs9mzqHpa67GfsAj1tSNLbBGIoBYG/trRy/1dtZnnP+A9iizuqFW0H+XSDo0ftvJE4uFQpOkchNYmml0daUoogcRaAw6n8D1XNv6+mxJR1QTASW54ufNNKhSJf7NNbJYtw274sfmuZH/jto4jsENOBMj1RvlhZ9GRSBDpk8KER9EexsRNmY/QoE0CgJEKg9ngQ2PIWJqROVLUtOOtD4db33UyJ2q0dsRykinBIm7MhBoys7ZZXqJf0wtdyQcRQ9hF07kud3UIeYbwW4GZ4IpDYMpMcpd6Yy61Eh4avIoStKyRaKztqc0VQukJBFQV1lxSpwqWGhKG/VRLpnjqRmlv97ESqIbQ3yDL4Fqn6MroZt+SU6NJp5RSay6jzHDKTSwrTqi0qw84IOtwRKUurfH6zern94egu7xMiQ/40Dzb5e4DIDR/kKlcUVRHvpOfTpDuSQqVhjzpQJG69jEvajycNvhiEskjTo7VS5udhMEc3tRLN+5IK1cp4YwwUqtR6iM2oCOMit8kWsBg39bJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache lookup, and try to move the found folio to the faulting vma when. Currently, it relies on the PTE value check to ensure the moved folio still belongs to the src swap entry, which turns out is not reliable. While working and reviewing the swap table series with Barry, following existing race is observed and reproduced [1]: ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1 ... < Somehow interrupted> ... // folio A is just a new allocated folio // and get installed into src_pte // src_pte now points to folio A, S1 // has swap count == 0, it can be freed // by folio_swap_swap or swap // allocator's reclaim. // folio B is a folio in another VMA. // S1 is freed, folio B could use it // for swap out with no problem. ... folio = filemap_get_folio(S1) // Got folio B here !!! ... < Somehow interrupted again> ... // Now S1 is free to be used again. // Now src_pte is a swap entry pte // holding S1 again. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // Moved invalid folio B here !!! The race window is very short and requires multiple collisions of multiple rare events, so it's very unlikely to happen, but with a deliberately constructed reproducer and increased time window, it can be reproduced [1]. It's also possible that folio (A) is swapped in, and swapped out again after the filemap_get_folio lookup, in such case folio (A) may stay in swap cache so it needs to be moved too. In this case we should also try again so kernel won't miss a folio move. Fix this by checking if the folio is the valid swap cache folio after acquiring the folio lock, and checking the swap cache again after acquiring the src_pte lock. SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far we don't need to worry about that since folios only might get exposed to swap cache in the swap out path, and it's covered in this patch too by checking the swap cache again after acquiring src_pte lock. Testing with a simple C program to allocate and move several GB of memory did not show any observable performance change. Cc: Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1] Signed-off-by: Kairui Song --- V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/ Changes: - Check swap_map instead of doing a filemap lookup after acquiring the PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ] V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/ Changes: - Move the folio and swap check inside move_swap_pte to avoid skipping the check and potential overhead [ Lokesh Gidra ] - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated value. mm/userfaultfd.c | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bc473ad21202..5dc05346e360 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, pte_t orig_dst_pte, pte_t orig_src_pte, pmd_t *dst_pmd, pmd_t dst_pmdval, spinlock_t *dst_ptl, spinlock_t *src_ptl, - struct folio *src_folio) + struct folio *src_folio, + struct swap_info_struct *si, swp_entry_t entry) { + /* + * Check if the folio still belongs to the target swap entry after + * acquiring the lock. Folio can be freed in the swap cache while + * not locked. + */ + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || + entry.val != src_folio->swap.val)) + return -EAGAIN; + double_pt_lock(dst_ptl, src_ptl); if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, if (src_folio) { folio_move_anon_rmap(src_folio, dst_vma); src_folio->index = linear_page_index(dst_vma, dst_addr); + } else { + /* + * Check if the swap entry is cached after acquiring the src_pte + * lock. Or we might miss a new loaded swap cache folio. + */ + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } } orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte); @@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, } err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte, orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval, - dst_ptl, src_ptl, src_folio); + dst_ptl, src_ptl, src_folio, si, entry); } out: -- 2.49.0