From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id F1317C83F10 for ; Thu, 31 Aug 2023 14:52:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=lwE1GdV6D69f6G+zYZ3quXQds5jqv1KmBeIZnoFgnz0=; b=QwPKdXIx5XPBEj3PtJROi6d7X4 newmIEajD22cZAkyes3ujT7W4vwrf2yjUzyvbQ9Khbk/xYQwWsL78taEdgfmZNlKmj0c2etuKym1z E8g96XbUNXO1Ep18RWoYmNoYJj4UkHDYie2D+wD/7p9IL+4XJ/pf+3hbAWZRPMeGoGrVITX/J/kGi LQqqEcDTWJgABfjdfdYVO6yDkl3UAarOdxUWtQ6EBJpmiw05thg6q104JWRRiUlXk1laq/CJ6vWR6 DoTMBrahWRKePWVoIq6vee2wvq/AlWBl9lUQksz8v008BiFbKnSgttXBcyiDF1WCbyAhNpCfQnChz FrIOpuCg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qbj2V-00FSmJ-2B; Thu, 31 Aug 2023 14:52:55 +0000 Received: from casper.infradead.org ([2001:8b0:10b:1236::1]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1qbj2T-00FSmA-1F for linux-nvme@bombadil.infradead.org; Thu, 31 Aug 2023 14:52:53 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=lwE1GdV6D69f6G+zYZ3quXQds5jqv1KmBeIZnoFgnz0=; b=ZPYbF54tuouED9z6EN3KIkl3g9 JdC7ngDwcMKciB6aIQXbMZKb4zWv+FaIV9nJA3I6XU5VtcZfbVDQ8cKaN+3teDlXYFGr7dqQvi3YI QKelGrsc2qvjHRkernw6hRQtSn8i5Ubsf/BIJPH80jkHBV2takvPiJs87NWw+OBxL2/0YtclokET+ mU2wbLkPDe82empFNt0sOlk0G+T+VHx4+gmbEfBsred9xYqmWHOtnEcetovtC4LwYQXl6PBtni9bB 9qWxxNHXa9WoE8PEprurVWx7FWb4EM4LUcINoj7IfkVPJk8ADpj9bKBwjNYjnDnHX7A+s2affthen yCtonS/w==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1qbj2P-001ufX-K8; Thu, 31 Aug 2023 14:52:49 +0000 Date: Thu, 31 Aug 2023 15:52:49 +0100 From: Matthew Wilcox To: Mirsad Todorovac Cc: linux-kernel@vger.kernel.org, Andrew Morton , linux-mm@kvack.org, Keith Busch , Jens Axboe , Christoph Hellwig , Sagi Grimberg , linux-nvme@lists.infradead.org Subject: Re: BUG: KCSAN: data-race in folio_batch_move_lru / mpage_read_end_io Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org On Mon, Aug 28, 2023 at 11:14:23PM +0200, Mirsad Todorovac wrote: > BUG: KCSAN: data-race in folio_batch_move_lru / mpage_read_end_io This one's still niggling at me. I've trimmed the timestamps and some of the other irrelevant stuff out of this to make it easier to read. > value changed: 0x0017ffffc0020001 -> 0x0017ffffc0020004 Notionally I understand this. This is page->flags and the PG_locked bit was set initially, but after a short delay PG_locked was cleared and PG_uptodate was set. That's _normal_. For many, many pages, we set the locked bit, initiate a read; the device does a DMA, sends an interrupt; the interrupt handler sets the PG_uptodate bit and clears the PG_locked bit to indicate the page is no longer under I/O. But what I don't understand is how we see this for _this_ page. > write (marked) to 0xffffef9a44978bc0 of 8 bytes by interrupt on cpu 28: > mpage_read_end_io (arch/x86/include/asm/bitops.h:55 include/asm-generic/bitops/instrumented-atomic.h:29 include/linux/page-flags.h:739 fs/mpage.c:55) > bio_endio (block/bio.c:1617) > blk_mq_end_request_batch (block/blk-mq.c:850 block/blk-mq.c:1088) > nvme_pci_complete_batch (drivers/nvme/host/pci.c:986) nvme > nvme_irq (drivers/nvme/host/pci.c:1086) nvme This is the interrupt handler. It's doing what it's supposed to; marking the page uptodate and unlocking it. > read to 0xffffef9a44978bc0 of 8 bytes by task 348 on cpu 12: > folio_batch_move_lru (./include/linux/mm.h:1814 ./include/linux/mm.h:1824 ./include/linux/memcontrol.h:1636 ./include/linux/memcontrol.h:1659 mm/swap.c:216) > folio_batch_add_and_move (mm/swap.c:235) > folio_add_lru (./arch/x86/include/asm/preempt.h:95 mm/swap.c:518) > folio_add_lru_vma (mm/swap.c:538) > do_anonymous_page (mm/memory.c:4146) This is the part I don't understand. The path to calling folio_add_lru_vma() comes directly from vma_alloc_zeroed_movable_folio(): folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); if (!folio) goto oom; if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); __folio_mark_uptodate(folio); entry = mk_pte(&folio->page, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!vmf->pte) goto release; if (vmf_pte_changed(vmf)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto release; } ret = check_stable_address_space(vma->vm_mm); if (ret) goto release; /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); folio_put(folio); return handle_userfault(vmf, VM_UFFD_MISSING); } inc_mm_counter(vma->vm_mm, MM_ANONPAGES); folio_add_new_anon_rmap(folio, vma, vmf->address); folio_add_lru_vma(folio, vma); (sorry that's a lot of lines). But there's _nowhere_ there that sets PG_locked. It's a freshly allocated page; all page flags (that are actually flags; ignore the stuff up at the top) should be clear. We even check that with PAGE_FLAGS_CHECK_AT_PREP. Plus, it doesn't make sense that we'd start I/O; the page is freshly allocated, full of zeroes; there's no backing store to read the page from. It really feels like this page was freed while it was still under I/O and it's been reallocated to this victim process. I'm going to try a few things and see if I can figure this out. > __handle_mm_fault (mm/memory.c:3662 mm/memory.c:4939 mm/memory.c:5079) > handle_mm_fault (mm/memory.c:5233) > do_user_addr_fault (arch/x86/mm/fault.c:1392) > exc_page_fault (./arch/x86/include/asm/paravirt.h:695 arch/x86/mm/fault.c:1494 arch/x86/mm/fault.c:1542) > asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:570) > copyout (./arch/x86/include/asm/uaccess_64.h:112 ./arch/x86/include/asm/uaccess_64.h:133 lib/iov_iter.c:168) > _copy_to_iter (lib/iov_iter.c:316 (discriminator 5)) > copy_page_to_iter (lib/iov_iter.c:483 lib/iov_iter.c:468) > filemap_read (mm/filemap.c:2712) > blkdev_read_iter (block/fops.c:620) > vfs_read (./include/linux/fs.h:1871 fs/read_write.c:389 fs/read_write.c:470) > ksys_read (fs/read_write.c:613) > __x64_sys_read (fs/read_write.c:621)