From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9B1C4C54E58
	for <linux-mm@archiver.kernel.org>; Tue, 12 Mar 2024 11:57:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E6EB58D003B; Tue, 12 Mar 2024 07:57:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E1CDD8D0036; Tue, 12 Mar 2024 07:57:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CE5218D003B; Tue, 12 Mar 2024 07:57:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id BABC38D0036
	for <linux-mm@kvack.org>; Tue, 12 Mar 2024 07:57:20 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 23AE4C0475
	for <linux-mm@kvack.org>; Tue, 12 Mar 2024 11:57:20 +0000 (UTC)
X-FDA: 81888236640.16.794EABF
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf15.hostedemail.com (Postfix) with ESMTP id 46C12A002C
	for <linux-mm@kvack.org>; Tue, 12 Mar 2024 11:57:18 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710244638;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ACs3exkhkqaJeArUS/XOq8SEE5r3gfN5ZlgXEmDi/Kg=;
	b=e+b1w70ADMjnkojs/P1ghb791r6aoAXC4+rRxsnBuujIzB40fmCMnZpEDd97FaLMKh483C
	ci865o8P84EwBTr0zfm1qbe6HtEJ/XWTrauBDC7TK9ofluIh8aE1OKJ9the7J6np3FmmSy
	6kuUWhTITqQKcCaxhvMPc2mxgOKQjMo=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710244638; a=rsa-sha256;
	cv=none;
	b=jwRxD8p78HKDcLgemPFsXoHONghJNr7IQ548D9tjX7cAilRf1ftwPTeO6lIEs8JUK6V2Vd
	BrzCvZg+4Rtoy9s5sEDCXhY3ShfzHxVf1XiXNVrb0pDXwKc+rWuGJJTgfabnLf0lPQ15Zs
	px61CNg1P7zGba0o1/w3RZuJFPHmjyk=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 40B111007;
	Tue, 12 Mar 2024 04:57:54 -0700 (PDT)
Received: from [10.1.27.122] (XHFQ2J9959.cambridge.arm.com [10.1.27.122])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 761C03F73F;
	Tue, 12 Mar 2024 04:57:15 -0700 (PDT)
Message-ID: <b8a954ac-04c0-4cae-b60a-2dabd9f60beb@arm.com>
Date: Tue, 12 Mar 2024 11:57:13 +0000
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch
 processed
Content-Language: en-GB
To: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org
References: <Ze2VbmOsb0s5fT4m@casper.infradead.org>
 <02e820c2-8a1d-42cc-954b-f9e041c4417a@arm.com>
 <9dfd3b3b-4733-4c7c-b09c-5e6388531e49@arm.com>
 <Ze4cQtxRWeAJI_Kh@casper.infradead.org>
 <Ze4rsB6s3hKK_tyC@casper.infradead.org>
 <db9cd982-5ce5-4709-8999-fd81e3d437fd@arm.com>
 <Ze74cbay57-0pEGw@casper.infradead.org>
 <79dad067-1d26-4867-8eb1-941277b9a77b@arm.com>
 <Ze8oMcUrmWWzt9Qd@casper.infradead.org>
 <7177da75-0158-4227-98cf-ca93a73389a0@arm.com>
 <Ze9EFdFLXQEUVtKl@casper.infradead.org>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <Ze9EFdFLXQEUVtKl@casper.infradead.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Rspamd-Queue-Id: 46C12A002C
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: ohgw9i3qmrd3zm8sjrsuusgwx85fiewj
X-HE-Tag: 1710244638-669051
X-HE-Meta: U2FsdGVkX1+mgs+w6M4mEWs/vwzgJbMtYPeNqNiyo6FzFeHkEJ9KeodX723VvJtkHTc/XA914+RMFapm4N5Q/VUZuVzG96hFiKsE1Jpb7m2lUPMxSJTiMFmpmtn/iOODxmlD5BywBD96UbEslgVDqr7RXIUEnyMzGOLHHjT5SpuejViLVAA0vh+DgTY/qQxeOgLw+IKrGw4bQ+OVJ9Y1lWd19bC7+IySA9xsGvgAFCQ9N8qiSiW1oDfzPpUPwQhI62Za1qC0Iek8h0RKTtOKjyAM++nJrpFj27MMeHplQpPHZ+OmmUV2fqw4DVOtc3J0PTxtbb1WRXxh0OOnyG2llvRYYZ78jbqAGy0LM24ULVqwyA44js8JN3XspAFd13Jn4UjoDIpL8sWa1rFxrhVrAg1teEnB18t0oKPW/Ts/pu2v0XtCx5JctqxYEOFu4xxbio5+RjbaYokDTrp/nltwVyN0D2smB+eTWkbf244eD4HNL0Z7x7E336ldsByHe+Z1iGAJQL6gMtVDN5VDAxIo6b0GoRooU2je4LoWzedPYvoIOAlPEk7CwOzjEuA88IBdzpb4QIMklJ9gXiiCfR7/JWfxm79GTVVozEsA8RjzIdTqGeGbju841k3OXZFjKUZpcH8ADGz70xtWA/CWESPW++7gfigD/vqSmjHqpa1zEez/aMuTNVdkMLlpTs/JGVZwHo4Vpvjg8hi3FhFbIzulSlGW6eWMOWMwlbQQV3rOUh/McDopUdEge3d/FRNbbVOozEhGeVvRg848gygNRHr/Y3btYyS/yme1dbsw48Mzl1ez8KSGpzRfMgDgUUfqfB4BkJo2v9IcTTnmjIxENw7HN+7WF9XgS7yi
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 11/03/2024 17:49, Matthew Wilcox wrote:
> On Mon, Mar 11, 2024 at 04:14:06PM +0000, Ryan Roberts wrote:
>>> With this patch, it took about 700 seconds in my xfstests run to find
>>> the one in shrink_folio_list().  It took 3260 seconds to catch the one
>>> in move_folios_to_lru().
>>
>> 54 hours?? That's before I even reported that we still had a bug! Either you're
>> anticipating my every move, or you have a lot of stuff running in parallel :)
> 
> One hour is 3600 seconds ;-)  So more like 53 minutes.
> 
>>> That one is only probably OK.  We usually manage to split a folio before
>>> we get to this point, so we should remove it from the deferred list then.
>>> You'd need to buy a lottery ticket if you managed to get hwpoison in a
>>> folio that was deferred split ...
>>
>> OK, but "probably"? Given hwpoison is surely not a hot path, why not just be
>> safe and call folio_undo_large_rmappable()?
> 
> Yes, I've added it; just commenting on the likelihood of being able to
> hit it in testing, even with aggressive error injection.
> 
>>>> But taking a step back, I wonder whether the charge() and uncharge() functions
>>>> should really be checking to see if the folio is on a deferred split list and if
>>>> so, then move the folio to the corect list?
>>>
>>> I don't think that's the right thing to do.  All of these places which
>>> uncharge a folio are part of the freeing path, so we always want it
>>> removed, not moved to a different deferred list.
>>
>> Well I'm just thinking about trying to be robust. Clearly you would prefer that
>> folio_undo_large_rmappable() has been called before uncharge(), then uncharge()
>> notices that there is nothing on the deferred list (and doesn't take the lock).
>> But if it's not, is it better to randomly crash (costing best part of a week to
>> debug) or move the folio to the right list?
> 
> Neither ;-)  The right option is to include the assertion that the
> deferred list is empty.  That way we get to see the backtrace of whoever
> forgot to take the folio off the deferred list.
> 
>> Alternatively, can we refactor so that there aren't 9 separate uncharge() call
>> sites. Those sites are all trying to free the folio so is there a way to better
>> refactor that into a single place (I guess the argument for the current
>> arrangement is reducing the number of times that we have to iterate through the
>> batch?). Then we only have to get it right once.
> 
> I have been wondering about a better way to do it.  I've also been
> looking a bit askance at put_pages_list() which doesn't do memcg
> uncharging ...
> 
>>>
>>> But what about mem_cgroup_move_account()?  Looks like that's memcg v1
>>> only?  Should still be fixed though ...
>>
>> Right.
>>
>> And what about the first bug you found with the local list corruption? I'm not
>> running with that fix so its obviously not a problem here. But I still think its
>> a bug that we should fix? list_for_each_entry_safe() isn't safe against
>> *concurrent* list modification, right?
> 
> I've been thinking about that too.  I decided that the local list is
> actually protected by the lock after all.  It's a bit fiddly to prove,
> but:
> 
> 1. We have a reference on every folio ahead on the list (not behind us,
> but see below)
> 2. If split_folio succeeds, it takes the lock that would protect the
> list we are on.
> 3. If it doesn't, and folio_put() turns out to be the last reference,
> __folio_put_large -> destroy_large_folio -> folio_undo_large_rmappable
> takes the lock that protects the list we would be on.
> 
> So we can analyse this loop as:
> 
> 	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> 		if (random() & 1)
> 			continue;
> 		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> 		list_del_init(&folio->_deferred_list);
> 		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> 	}
> 
> We're guaranteed that 'next' is a valid folio because we hold a refcount
> on it.  Anything left on the list between &list and next may have been
> removed from the list, but we don't look at those entries until after
> we take the split_queue_lock again to do the list_splice_tail().
> 
> I'm too scared to write a loop like this, but I don't think it contains
> a bug.

OK, wow. Now that I'm looking at the implementation of
list_for_each_entry_safe() along with your reasoning, that is clear. But it's
certainly not obvious looking at deferred_split_scan().