From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03FE8C433F5 for ; Thu, 30 Sep 2021 15:31:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 68C69615A7 for ; Thu, 30 Sep 2021 15:31:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 68C69615A7 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id C44BF9400B6; Thu, 30 Sep 2021 11:31:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BF55F94003A; Thu, 30 Sep 2021 11:31:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE3439400B6; Thu, 30 Sep 2021 11:31:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0077.hostedemail.com [216.40.44.77]) by kanga.kvack.org (Postfix) with ESMTP id 9F70894003A for ; Thu, 30 Sep 2021 11:31:28 -0400 (EDT) Received: from smtpin37.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 4A5C018238CD3 for ; Thu, 30 Sep 2021 15:31:28 +0000 (UTC) X-FDA: 78644629056.37.0D866BC Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf27.hostedemail.com (Postfix) with ESMTP id A984770009D8 for ; Thu, 30 Sep 2021 15:31:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=S0BblVsdckwghc4txw7r+dPMT0D1gW3oVbzrVCA8EGA=; b=KnFOj0lp75XlU4ooc12G0EXxE9 gWRX4QNFmmTtdsLG1Y4GQeG7R3u/wggaZoaQ5CVmtm4dY8DjHuLVe8FDwDjPdFkWRw7oSp+84lUG3 TnekohIjzeda3UF0SKrqSV8u22ROkZNhO4NNIWNjU88fWGCY3tZ8CoPdTqpWEK+fqsyjNeWBizz7R UyJH7191eFnQiynjWLm5QTzeR2loZ0y2Q85ob+iTIXoBE3IvQ9SxxbZ/aVtBJtBq1IOib1/1G+ane SJMCFvNjCbW2S97rXyGbMLf4kV/QUarT1mnNnPUDNwxG5in5YgShOI5cPytdohEBl3F3YNGXy9yoc bld4yTRA==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mVxzH-00D08C-Iu; Thu, 30 Sep 2021 15:29:15 +0000 Date: Thu, 30 Sep 2021 16:28:43 +0100 From: Matthew Wilcox To: Hugh Dickins Cc: Song Liu , Rongwei Wang , Andrew Morton , Linux MM , Linux Kernel Mailing List , William Kucharski Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache Message-ID: References: <68737431-01d2-e6e3-5131-7d7c731e49ae@linux.alibaba.com> <67906bf5-4de9-8433-3d70-cc8fc5cc2347@linux.alibaba.com> <3d264ed9-f8fd-60d4-7125-f9f745ebeb52@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3d264ed9-f8fd-60d4-7125-f9f745ebeb52@google.com> X-Rspamd-Queue-Id: A984770009D8 X-Stat-Signature: ho6daejzo8a43d1xfw3nuc3sikzr1zcu Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=KnFOj0lp; spf=none (imf27.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none X-Rspamd-Server: rspam06 X-HE-Tag: 1633015887-183821 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote: > On Wed, 29 Sep 2021, Song Liu wrote: > > On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang > > wrote: > > > On 9/30/21 7:41 AM, Song Liu wrote: > > > > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox wrote: > > > >> > > > > [...] > > > >>> Now, I am able to crash the system on > > > >>> find_lock_entries () { > > > >>> ... > > > >>> VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > > > >>> } > > > >>> I guess it is related. I will test more. > > > >> > > > >> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it. > > > >> Andrew has it too, but for some reason, he hasn't sent it on to Linus. > > I think Andrew has held back from sending it on to Linus because I pointed > out that the example Matthew cited (shmem and swap cache) was wrong, and > could not explain it: we wanted to understand what syzbot had actually > hit before sending on. > > Would this bug be a good explanation for it? I don't think so? Even if that specific example is not what's happening, the general principle is that you can't verify page->index until you're holding the page lock. So the VM_BUG_ON is definitely in the wrong place. > In the meantime, independently, I was looking at fuse_try_move_page(), > which appears to do the old splice thievery that got disabled from splice > itself, stealing a page from one mapping to put into another. I suspect > that could result in a page->index != xas.xa_index too (outside page lock). I think there are other examples too where page->index gets changed ... they're not springing to mind right now, but I have some other intricate details occupying that bit of my brain at the moment. > > > I think the second possibility mentioned above will been found if you > > > enable CONFIG_DEBUG_VM: > > > > > > 1) multiple writers truncate the same page cache concurrently; > > > 2) collapse_file rolls back when writer truncates the page cache; > > > > > > The following log will be print after enable CONFIG_DEBUG_VM: > > > > > > [22216.789904] do_idle+0xb4/0x104 > > > [22216.789906] cpu_startup_entry+0x34/0x9c > > > [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS > > > 0.0.0 02/06/2015 > > > [22216.790553] secondary_start_kernel+0x104/0x180 > > > [22216.790778] Call trace: > > > [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000) > > > [22216.791662] dump_backtrace+0x0/0x1ec > > > [22216.791664] show_stack+0x24/0x30 > > > [22216.791956] ---[ end trace dc769a61c1af087b ]--- > > > [22216.792295] dump_stack+0xd0/0x128 > > > [22216.792299] bad_page+0xe4/0x110 > > > [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception > > > in interrupt > > > [22216.792937] check_free_page_bad+0x84/0x90 > > > [22216.792940] free_pcp_prepare+0x1fc/0x21c > > > [22216.793253] SMP: stopping secondary CPUs > > > [22216.793525] free_unref_page+0x2c/0xec > > > [22216.805537] __put_page+0x60/0x70 > > > [22216.805931] collapse_file+0xdc8/0x12f0 > > > [22216.806385] khugepaged_scan_file+0x2dc/0x37c > > > [22216.806900] khugepaged_scan_mm_slot+0x2e0/0x380 > > > [22216.807450] khugepaged_do_scan+0x2dc/0x2fc > > > [22216.807946] khugepaged+0x38/0x100 > > > [22216.808342] kthread+0x11c/0x120 > > > [22216.808735] Kernel Offset: disabled > > > [22216.809153] CPU features: 0x0040002,62208238 > > > [22216.809681] Memory Limit: none > > > [22216.813477] Starting crashdump kernel... > > > > > > So I think the race also exists between collapse_file and > > > truncate_pagecache. > > > > I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet. > > Sorry, it's taken me a long time to get into this bug(s). > > I haven't tried to reproduce it, but I do think that Rongwei will > be proved right. > > In doing a lockless lookup of the page cache, we tend to imagine > that the THP head page will be encountered first, and the special > treatment for the head will do the right thing to skip the tails. > > But when racing against collapse_file()'s rewind, I agree with > Rongwei that it is possible for truncation (or page cache cleanup > in this instance) to collect a pagevec which starts with a PageTail > some way into the compound page. > > shmem_undo_range() (which shmem uses rather than truncate_pagecache()) > would not call truncate_inode_page() on a THP tail: if it encounters a > tail, it will try to split the THP, and not call truncate_inode_page() > if it cannot. Unless I'm inventing the memory, I now think that I did > encounter this race between truncate and collapse on huge shmem, and > had to re-craft my shmem_punch_compound() to handle it correctly. > > If it is just a matter of collapse_file() rewind, I suppose we could > reverse the direction in which that proceeds; but I'm not convinced > that would be enough, and don't think we need such a "big" change. > > Aside from the above page->index mischeck in find_lock_entries(), > I now think this bug needs nothing more than simply removing the > VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page(). I don't think that's right. This bug was also observed when calling truncate(), so there's clearly a situation where two concurrent calls to truncate_pagecache() leaves a THP in the cache. Unless there's a case where mapping->nr_thps gets corrupted, so the open() thinks there's no THPs in the page cache, when there's actually one or more? That might be a problem that we're solving by locking around the truncate_pagecache() call? > (You could say move its page->mapping check before its PageTail > check; but the PageTail check would then never catch anything.) > > Rongwei's patch went in the right direction, but why add N lines > if deleting one is good? And for a while I thought that idea of > using filemap_invalidate_lock() was very good; but now think > the page lock we already have is better, and solve both races. > > Hugh