From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 514B5C6FD1C
	for <linux-arm-kernel@archiver.kernel.org>; Wed, 22 Mar 2023 14:26:31 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:From:References:Cc:To:
	Subject:MIME-Version:Date:Message-ID:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=WtBKpfDBHwN8wVKz+Pl0kjSdJgRGmEaDWUJ9HLt8bRo=; b=ada5b6s+RxG0gA
	OZguq+CnsCBZQs0OfnxyMR2vn0oi+xQWWrrjFI3Z8YHgnFs0prKLNDvFlcQ0HqfmnbIH0h/FT0qlx
	ferosD/X+1zG2lusGYqnRBRjWnJug0Nsn9yfoY54cCIqny0nrh4CCK/w1FhKv6XvepZJHM4HdWe5K
	tm3ldTT+CvoV0zlFNS14aFLxEIEqvBMwEz8Z4wc8iR2GP/Bzz7kuPreR5kKpigzKn+e2tNbJ5tD2h
	0sXlxOTIe7yrRU2eZoP4EyfEmEeSe1zE5WZSOfFyoxTAIIi3GPMIlIEZcB8NLq0oEqtwtXWcftp0r
	t0/vmw6318VlqFmfvBWQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux))
	id 1pezPC-00GHeH-0L;
	Wed, 22 Mar 2023 14:25:34 +0000
Received: from foss.arm.com ([217.140.110.172])
	by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux))
	id 1pezP8-00GHdQ-2X
	for linux-arm-kernel@lists.infradead.org;
	Wed, 22 Mar 2023 14:25:32 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8B40D4B3;
	Wed, 22 Mar 2023 07:26:13 -0700 (PDT)
Received: from [10.1.37.147] (C02CF1NRLVDN.cambridge.arm.com [10.1.37.147])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A05A13F71E;
	Wed, 22 Mar 2023 07:25:28 -0700 (PDT)
Message-ID: <a66d7ac1-59cd-2a95-dd48-79d569583aff@arm.com>
Date: Wed, 22 Mar 2023 14:25:27 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.9.0
Subject: Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
Content-Language: en-US
To: "Yin, Fengwei" <fengwei.yin@intel.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
References: <20230317105802.2634004-1-ryan.roberts@arm.com>
 <a2861b0a-be8f-0911-e551-af9d3b549f6d@arm.com>
 <7b63b922-9c19-b1ed-a6f6-72e1b25f38f6@intel.com>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <7b63b922-9c19-b1ed-a6f6-72e1b25f38f6@intel.com>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20230322_072530_937737_81123222 
X-CRM114-Status: GOOD (  32.71  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

On 22/03/2023 13:36, Yin, Fengwei wrote:
> 
> 
> On 3/22/2023 8:03 PM, Ryan Roberts wrote:
>> Hi Matthew,
>>
>> On 17/03/2023 10:57, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> [...]
>>>
>>> Bug(s)
>>> ======
>>>
>>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>>> relating to invalid kernel addresses (which usually look like either NULL +
>>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>>> lockdep complaining about a bad unlock balance. Call stacks are often in
>>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>>> email example oopses out separately if anyone wants to review them). My hunch is
>>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>>> evidence.
>>>
>>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>>> attempting to split a large folio, I never see any issues. Although I'm not
>>> putting the system under memory pressure so guess I might see the same types of
>>> problem crop up under swap, etc.
>>>
>>> I've reviewed most of the code within split_folio() and can't find any smoking
>>> gun, but I wonder if there are implicit assumptions about the large folio being
>>> PMD sized that I'm obviously breaking now?
>>>
>>> The code in madvise_free_pte_range():
>>>
>>> 	if (folio_test_large(folio)) {
>>> 		if (folio_mapcount(folio) != 1)
>>> 			goto out;
>>> 		folio_get(folio);
>>> 		if (!folio_trylock(folio)) {
>>> 			folio_put(folio);
>>> 			goto out;
>>> 		}
>>> 		pte_unmap_unlock(orig_pte, ptl);
>>> 		if (split_folio(folio)) {
>>> 			folio_unlock(folio);
>>> 			folio_put(folio);
>>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>>> 			goto out;
>>> 		}
>>> 		...
>>> 	}
>>
>> I've noticed that its folio_split() with a folio order of 1 that causes my
>> problems. And I also see that the page cache code always explicitly never
>> allocates order-1 folios:
>>
>> void page_cache_ra_order(struct readahead_control *ractl,
>> 		struct file_ra_state *ra, unsigned int new_order)
>> {
>> 	...
>>
>> 	while (index <= limit) {
>> 		unsigned int order = new_order;
>>
>> 		/* Align with smaller pages if needed */
>> 		if (index & ((1UL << order) - 1)) {
>> 			order = __ffs(index);
>> 			if (order == 1)
>> 				order = 0;
>> 		}
>> 		/* Don't allocate pages past EOF */
>> 		while (index + (1UL << order) - 1 > limit) {
>> 			if (--order == 1)
>> 				order = 0;
>> 		}
>> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
>> 		if (err)
>> 			break;
>> 		index += 1UL << order;
>> 	}
>>
>> 	...
>> }
>>
>> Matthew, what is the reason for this? I suspect its guarding against the same
>> problem I'm seeing.
>>
>> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
>> any oops/panic/etc. I'd just like to understand the root cause.
> Checked the struct folio definition. The _deferred_list is in third page struct.
> My understanding is to support folio split, the folio order must >= 2. Thanks.

Yep, looks like we have found the root cause - thanks for your help!

I've updated calc_anonymous_folio_order() to only use non-0 order if THP is
available and in that case, never allocate order-1. I think that both fixes the
problem and manages the dependency we have on THP:

static void calc_anonymous_folio_order(struct vm_fault *vmf,
				       int *order_out,
				       unsigned long *addr_out)
{
	/*
	 * The aim here is to determine what size of folio we should allocate
	 * for this fault. Factors include:
	 * - Folio must be naturally aligned within VA space
	 * - Folio must not breach boundaries of vma
	 * - Folio must be fully contained inside one pmd entry
	 * - Folio must not overlap any non-none ptes
	 * - Order must not be higher than *order_out upon entry
	 *
	 * Additionally, we do not allow order-1 since this breaks assumptions
	 * elsewhere in the mm; THP pages must be at least order-2 (since they
	 * store state up to the 3rd struct page subpage), and these pages must
	 * be THP in order to correctly use pre-existing THP infrastructure such
	 * as folio_split().
	 *
	 * As a consequence of relying on the THP infrastructure, if the system
	 * does not support THP, we always fallback to order-0.
	 *
	 * Note that the caller may or may not choose to lock the pte. If
	 * unlocked, the calculation should be considered an estimate that will
	 * need to be validated under the lock.
	 */

	struct vm_area_struct *vma = vmf->vma;
	int nr;
	int order;
	unsigned long addr;
	pte_t *pte;
	pte_t *first_set = NULL;
	int ret;

	if (has_transparent_hugepage()) {
		order = min(*order_out, PMD_SHIFT - PAGE_SHIFT);

		for (; order > 1; order--) {
			nr = 1 << order;
			addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);

			/* Check vma bounds. */
			if (addr < vma->vm_start ||
			    addr + nr * PAGE_SIZE > vma->vm_end)
				continue;

			/* Ptes covered by order already known to be none. */
			if (pte + nr <= first_set)
				break;

			/* Already found set pte in range covered by order. */
			if (pte <= first_set)
				continue;

			/* Need to check if all the ptes are none. */
			ret = check_all_ptes_none(pte, nr);
			if (ret == nr)
				break;

			first_set = pte + ret;
		}

		if (order == 1)
			order = 0;
	} else {
		order = 0;
	}

	*order_out = order;
	*addr_out = order > 0 ? addr : vmf->address;
}


> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>
>>>
>>> Will normally skip my large folios because they have a mapcount > 1, due to
>>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>>> or CoW in this case?
>>>
>>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>>> large anon folio is not using the folio lock in the same way as a THP would and
>>> we are therefore not getting the expected serialization?
>>>
>>> I'd really appreciate any suggestions for how to pregress here!
>>>
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel