From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B1C4C54E58 for ; Tue, 12 Mar 2024 11:57:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E6EB58D003B; Tue, 12 Mar 2024 07:57:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E1CDD8D0036; Tue, 12 Mar 2024 07:57:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE5218D003B; Tue, 12 Mar 2024 07:57:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BABC38D0036 for ; Tue, 12 Mar 2024 07:57:20 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 23AE4C0475 for ; Tue, 12 Mar 2024 11:57:20 +0000 (UTC) X-FDA: 81888236640.16.794EABF Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf15.hostedemail.com (Postfix) with ESMTP id 46C12A002C for ; Tue, 12 Mar 2024 11:57:18 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710244638; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ACs3exkhkqaJeArUS/XOq8SEE5r3gfN5ZlgXEmDi/Kg=; b=e+b1w70ADMjnkojs/P1ghb791r6aoAXC4+rRxsnBuujIzB40fmCMnZpEDd97FaLMKh483C ci865o8P84EwBTr0zfm1qbe6HtEJ/XWTrauBDC7TK9ofluIh8aE1OKJ9the7J6np3FmmSy 6kuUWhTITqQKcCaxhvMPc2mxgOKQjMo= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710244638; a=rsa-sha256; cv=none; b=jwRxD8p78HKDcLgemPFsXoHONghJNr7IQ548D9tjX7cAilRf1ftwPTeO6lIEs8JUK6V2Vd BrzCvZg+4Rtoy9s5sEDCXhY3ShfzHxVf1XiXNVrb0pDXwKc+rWuGJJTgfabnLf0lPQ15Zs px61CNg1P7zGba0o1/w3RZuJFPHmjyk= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 40B111007; Tue, 12 Mar 2024 04:57:54 -0700 (PDT) Received: from [10.1.27.122] (XHFQ2J9959.cambridge.arm.com [10.1.27.122]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 761C03F73F; Tue, 12 Mar 2024 04:57:15 -0700 (PDT) Message-ID: Date: Tue, 12 Mar 2024 11:57:13 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Content-Language: en-GB To: Matthew Wilcox Cc: Andrew Morton , linux-mm@kvack.org References: <02e820c2-8a1d-42cc-954b-f9e041c4417a@arm.com> <9dfd3b3b-4733-4c7c-b09c-5e6388531e49@arm.com> <79dad067-1d26-4867-8eb1-941277b9a77b@arm.com> <7177da75-0158-4227-98cf-ca93a73389a0@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 46C12A002C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ohgw9i3qmrd3zm8sjrsuusgwx85fiewj X-HE-Tag: 1710244638-669051 X-HE-Meta: U2FsdGVkX1+mgs+w6M4mEWs/vwzgJbMtYPeNqNiyo6FzFeHkEJ9KeodX723VvJtkHTc/XA914+RMFapm4N5Q/VUZuVzG96hFiKsE1Jpb7m2lUPMxSJTiMFmpmtn/iOODxmlD5BywBD96UbEslgVDqr7RXIUEnyMzGOLHHjT5SpuejViLVAA0vh+DgTY/qQxeOgLw+IKrGw4bQ+OVJ9Y1lWd19bC7+IySA9xsGvgAFCQ9N8qiSiW1oDfzPpUPwQhI62Za1qC0Iek8h0RKTtOKjyAM++nJrpFj27MMeHplQpPHZ+OmmUV2fqw4DVOtc3J0PTxtbb1WRXxh0OOnyG2llvRYYZ78jbqAGy0LM24ULVqwyA44js8JN3XspAFd13Jn4UjoDIpL8sWa1rFxrhVrAg1teEnB18t0oKPW/Ts/pu2v0XtCx5JctqxYEOFu4xxbio5+RjbaYokDTrp/nltwVyN0D2smB+eTWkbf244eD4HNL0Z7x7E336ldsByHe+Z1iGAJQL6gMtVDN5VDAxIo6b0GoRooU2je4LoWzedPYvoIOAlPEk7CwOzjEuA88IBdzpb4QIMklJ9gXiiCfR7/JWfxm79GTVVozEsA8RjzIdTqGeGbju841k3OXZFjKUZpcH8ADGz70xtWA/CWESPW++7gfigD/vqSmjHqpa1zEez/aMuTNVdkMLlpTs/JGVZwHo4Vpvjg8hi3FhFbIzulSlGW6eWMOWMwlbQQV3rOUh/McDopUdEge3d/FRNbbVOozEhGeVvRg848gygNRHr/Y3btYyS/yme1dbsw48Mzl1ez8KSGpzRfMgDgUUfqfB4BkJo2v9IcTTnmjIxENw7HN+7WF9XgS7yi X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11/03/2024 17:49, Matthew Wilcox wrote: > On Mon, Mar 11, 2024 at 04:14:06PM +0000, Ryan Roberts wrote: >>> With this patch, it took about 700 seconds in my xfstests run to find >>> the one in shrink_folio_list(). It took 3260 seconds to catch the one >>> in move_folios_to_lru(). >> >> 54 hours?? That's before I even reported that we still had a bug! Either you're >> anticipating my every move, or you have a lot of stuff running in parallel :) > > One hour is 3600 seconds ;-) So more like 53 minutes. > >>> That one is only probably OK. We usually manage to split a folio before >>> we get to this point, so we should remove it from the deferred list then. >>> You'd need to buy a lottery ticket if you managed to get hwpoison in a >>> folio that was deferred split ... >> >> OK, but "probably"? Given hwpoison is surely not a hot path, why not just be >> safe and call folio_undo_large_rmappable()? > > Yes, I've added it; just commenting on the likelihood of being able to > hit it in testing, even with aggressive error injection. > >>>> But taking a step back, I wonder whether the charge() and uncharge() functions >>>> should really be checking to see if the folio is on a deferred split list and if >>>> so, then move the folio to the corect list? >>> >>> I don't think that's the right thing to do. All of these places which >>> uncharge a folio are part of the freeing path, so we always want it >>> removed, not moved to a different deferred list. >> >> Well I'm just thinking about trying to be robust. Clearly you would prefer that >> folio_undo_large_rmappable() has been called before uncharge(), then uncharge() >> notices that there is nothing on the deferred list (and doesn't take the lock). >> But if it's not, is it better to randomly crash (costing best part of a week to >> debug) or move the folio to the right list? > > Neither ;-) The right option is to include the assertion that the > deferred list is empty. That way we get to see the backtrace of whoever > forgot to take the folio off the deferred list. > >> Alternatively, can we refactor so that there aren't 9 separate uncharge() call >> sites. Those sites are all trying to free the folio so is there a way to better >> refactor that into a single place (I guess the argument for the current >> arrangement is reducing the number of times that we have to iterate through the >> batch?). Then we only have to get it right once. > > I have been wondering about a better way to do it. I've also been > looking a bit askance at put_pages_list() which doesn't do memcg > uncharging ... > >>> >>> But what about mem_cgroup_move_account()? Looks like that's memcg v1 >>> only? Should still be fixed though ... >> >> Right. >> >> And what about the first bug you found with the local list corruption? I'm not >> running with that fix so its obviously not a problem here. But I still think its >> a bug that we should fix? list_for_each_entry_safe() isn't safe against >> *concurrent* list modification, right? > > I've been thinking about that too. I decided that the local list is > actually protected by the lock after all. It's a bit fiddly to prove, > but: > > 1. We have a reference on every folio ahead on the list (not behind us, > but see below) > 2. If split_folio succeeds, it takes the lock that would protect the > list we are on. > 3. If it doesn't, and folio_put() turns out to be the last reference, > __folio_put_large -> destroy_large_folio -> folio_undo_large_rmappable > takes the lock that protects the list we would be on. > > So we can analyse this loop as: > > list_for_each_entry_safe(folio, next, &list, _deferred_list) { > if (random() & 1) > continue; > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > list_del_init(&folio->_deferred_list); > spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > } > > We're guaranteed that 'next' is a valid folio because we hold a refcount > on it. Anything left on the list between &list and next may have been > removed from the list, but we don't look at those entries until after > we take the split_queue_lock again to do the list_splice_tail(). > > I'm too scared to write a loop like this, but I don't think it contains > a bug. OK, wow. Now that I'm looking at the implementation of list_for_each_entry_safe() along with your reasoning, that is clear. But it's certainly not obvious looking at deferred_split_scan().