From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 612FA1E884; Fri, 13 Sep 2024 21:30:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726263024; cv=none; b=AAWdUFMEusM3pAbz0/s2oUD60s6BvXY3/83IbF6ZOwZsd3X1tDr0YGLYGE0epf/AZtzGx3BDxEZmrI+RY5hMu0Jm88LzFj8ekUIqcDb9ru+z7vP/AFA81nggUp62NJMbjmMJYGGlcX87wNr32wZirEeECngchzPJbsqalvRpNzM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726263024; c=relaxed/simple; bh=+k6ZxvwrVNCbKN2u3wk5AlFx1/04DtTREo4MFxRD448=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=P0t5BiLvVVwzTdYSPxpQzpxCuxd2h0IWS5g+/2m8OHagdxqtoD+44xLsPXNSXiUFtFefq/4PenKisX/ZbBKbFwIX/3feQQ6KuztPtceVVRK0wmyXykwCCXRHcXzZuwWfYwynBaJy1f/tPyVGeyLqhPvfUznxDafoCubjaIBR9g0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=D2MX1hXV; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="D2MX1hXV" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=yZFmPPNNPyPVHwqQRbgbAhRYsxST8nX90G2qtvqpxNQ=; b=D2MX1hXVqxM+b2Tq+Hvw4tocOz DLg4I0ge4lCMRtXGKFXcDCDoqiHz5uV1jey6HcaRGdhqNsDq9kaCOD5GMwYGdJKL8u8nr7RPusfWP eCqHGxooSUSBS+L++H6Yi3XWDxfxioPGIdJLfAENLQC0MQVFfRTUrqaSYUvpEVihXoTDulMulQq2t AljQWD4jAGxQX52sidyhG1KiJU49IgAxxXi3tze/1mhSFUpP5OOOkq+1AkFxkE9jRQuMTf3hPJoY+ LEAIYHHmZC4S0va6eJosOkNqz1Tx9qQNupkHhQ8tllOPDy7Knx+PzDsSMEV80kR0i79uQHlx40Gpx +cWxG3ig==; Received: from willy by casper.infradead.org with local (Exim 4.98 #2 (Red Hat Linux)) id 1spDrr-0000000H0er-3mFV; Fri, 13 Sep 2024 21:30:16 +0000 Date: Fri, 13 Sep 2024 22:30:15 +0100 From: Matthew Wilcox To: Linus Torvalds Cc: Chris Mason , Jens Axboe , Christian Theune , linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Daniel Dao , Dave Chinner , regressions@lists.linux.dev, regressions@leemhuis.info Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) Message-ID: References: <0fc8c3e7-e5d2-40db-8661-8c7199f84e43@kernel.dk> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, Sep 13, 2024 at 02:24:02PM -0700, Linus Torvalds wrote: > On Fri, 13 Sept 2024 at 11:15, Matthew Wilcox wrote: > > > > Oh! I think split is the key. Let's say we have an order-6 (or > > larger) folio. And we call split_huge_page() (whatever it's called > > in your kernel version). That calls xas_split_alloc() followed > > by xas_split(). xas_split_alloc() puts entry in node->slots[0] and > > initialises node->slots[1..XA_CHUNK_SIZE] to a sibling entry. > > Hmm. The splitting does seem to be not just indicated by the debug > logs, but it ends up being a fairly complicated case. *The* most > complicated case of adding a new folio by far, I'd say. > > And I wonder if it's even necessary? Unfortunately, we need to handle things like "we are truncating a file which has a folio which now extends many pages beyond the end of the file" and so we have to split the folio which now crosses EOF. Or we could write it back and drop it, but that has its own problems. Part of the "large block size" patches sitting in Christian's tree is solving these problems for folios which can't be split down to order-0, so there may be ways we can handle this better now, but if we don't split we might end up wasting a lot of memory in file tails. > It's possible that I'm entirely missing something, but at least the > filemap_add_folio() case looks like it really would actually be > happier with a "oh, that size conflicts with an existing entry, let's > just allocate a smaller size then" Pretty sure we already do that; it's mostly handled through the readahead path which checks for conflicting folios already in the cache.