From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 79261CD98C0
	for <linux-mm@archiver.kernel.org>; Tue, 10 Oct 2023 21:12:05 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0575A6B026C; Tue, 10 Oct 2023 17:12:05 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0065C6B026D; Tue, 10 Oct 2023 17:12:04 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E108C6B026E; Tue, 10 Oct 2023 17:12:04 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id D12796B026C
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 17:12:04 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id A68CF1CA874
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 21:12:04 +0000 (UTC)
X-FDA: 81330799368.17.E96BED1
Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171])
	by imf26.hostedemail.com (Postfix) with ESMTP id 8CA8B140017
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 21:12:02 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=WA0Jsp2d;
	dmarc=pass (policy=none) header.from=cmpxchg.org;
	spf=pass (imf26.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.171 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696972322; a=rsa-sha256;
	cv=none;
	b=3OEsP0yUv8PEMWS2gxjLsJgpJgh5Fns52gJlYHz9vylH1IsVqJkXPWI8Dn7q8TYtVHy9mZ
	/wqi0DNLnU45QKN9t1pZsT7/c7GC9kvWMAkffQqdau8rC87XsVTKrOT8M7rz5S8Bo79cgh
	taUGFulMCsWScTRE47tnXSXvSm7R2YA=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=WA0Jsp2d;
	dmarc=pass (policy=none) header.from=cmpxchg.org;
	spf=pass (imf26.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.171 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1696972322;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=HC+Qu78EEjL4KMzF8E1/E4mWpZo9bo+hSG7Ou0hFeX8=;
	b=qYlBE6uWKMEf1yEGL5/yNr2GAFFk4XcV5ufq9Q0MIeximrg3bv97HyMbnetZFX7rCyCyxb
	JBeh+0jlAttwB1az1BoC+3twg2ZZ5gn9o6HB9THnpmfAAPfKvPkKM0bpWq6u7MnpVLWc8D
	i2KGawRjQYRNNQcX2RzPTzzFHBoJKY8=
Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-7740aa4b545so411168185a.3
        for <linux-mm@kvack.org>; Tue, 10 Oct 2023 14:12:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1696972321; x=1697577121; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=HC+Qu78EEjL4KMzF8E1/E4mWpZo9bo+hSG7Ou0hFeX8=;
        b=WA0Jsp2dmmzX0lj2VdPvz89glDzqj16oh9cAzpU8b9ij5mNZjQynmFadNCAafMXghX
         WJcHWbns8fsEzhgZXrON9sgtxmBRVZIozg7vySGKSXapgJmSAquDarT/ACnqBCLv9G8+
         dx+xHGoJL9mAMmJewwb6XJTn7z6HU93Jh3dZDmOgynEEVOjv/tmBMKeLzFxjunvPNib5
         pLy0hsAT4eK7pGYZPSbQiO9M2WPxcHyaarZqQZVtDstIuFYZwx3l2FFtrntoqB5y3IDq
         fl+hsVYOfjGdTo6W+6oxDOFnSyQxIevV7Txw6zBQTSPtCAl53uvbQPM0lEDGeSJ/G1n9
         WRwg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1696972321; x=1697577121;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=HC+Qu78EEjL4KMzF8E1/E4mWpZo9bo+hSG7Ou0hFeX8=;
        b=jHdx1zhP4oVv58QGUcYIzSRJ0qAi518YivimxRrb9NOg/lzRGeT0bxmYj3Kj9Dys37
         +1W9dyxAkJg6q5qAL1b8H9H1+cSCXystGl6nF48vsY3/md9l9dvQ2hVwbTJIoJwoCyf9
         Yf3LVJAgMeSenfRffgmgy14/BNzEtb5reQ4+vFeLrcEhCQD2hhhzZXM7xVQRfZrT54/C
         oG6c+KIEOg4zB9L54kl1wHaNxeCBbp+794Le9lsqC31tHy4w+tdD4wMZbXndfYFYdV6d
         Gjhe83ZuVypzJevrp1sayAtNn3uK7qsjdEKAubX4nbGr638bypnW8z14dfK/Ek7XrYhu
         d83A==
X-Gm-Message-State: AOJu0YzxOFktizG0Hh+jb98x8yUVMdsHstORbv9oNDJoI2Lkhx00sxYv
	zrqOYA0M03zf/vIPNJI+Upv9JQ==
X-Google-Smtp-Source: AGHT+IExnLFXiLD1bsUeG9yYjXMxJ0iw/TVc3aNNBZALtevcf2FAZXGtfGrOEOjHIJ+VIeYIlTL/mg==
X-Received: by 2002:a05:620a:7f9:b0:76e:e619:1f95 with SMTP id k25-20020a05620a07f900b0076ee6191f95mr15298836qkk.78.1696972321554;
        Tue, 10 Oct 2023 14:12:01 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:66a6])
        by smtp.gmail.com with ESMTPSA id b14-20020a05620a118e00b00773f008da40sm4633646qkk.125.2023.10.10.14.12.00
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 10 Oct 2023 14:12:01 -0700 (PDT)
Date: Tue, 10 Oct 2023 17:12:00 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>, Vlastimil Babka <vbabka@suse.cz>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene
Message-ID: <20231010211200.GA129823@cmpxchg.org>
References: <20230920134811.GB124289@cmpxchg.org>
 <20230920160400.GC124289@cmpxchg.org>
 <762CA634-053A-41DD-8ED7-895374640858@nvidia.com>
 <D4F59724-61EB-4DA5-94DF-59E79F0F1FB3@nvidia.com>
 <505e7f55-f63a-b33d-aa10-44de16d2d3cc@redhat.com>
 <4466F447-43D3-43CD-8930-FBE9A49028BA@nvidia.com>
 <92AE29D4-E715-447C-AF99-ECF42383C74D@nvidia.com>
 <20230926173939.GA348484@cmpxchg.org>
 <0D2BD71D-5E65-4175-8872-5E70278C57DA@nvidia.com>
 <329AB331-DDC1-4074-A85E-AB5CF866CE84@nvidia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <329AB331-DDC1-4074-A85E-AB5CF866CE84@nvidia.com>
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 8CA8B140017
X-Stat-Signature: ifx8an3gzawzgy6m7eafagusbtdg8nfe
X-HE-Tag: 1696972322-137862
X-HE-Meta: U2FsdGVkX1/oHSspwyGNHvAIGjIpbhJAgg9l1ekjC4l0Z4BCNNX8B0+FQhKthz1g2LlIfpjDQf6X7LpK66ud4fdf4I0t6YtwnyheG+SfmE47tpmgDQe5PztD06uL1W2yKktpwflbvcJxvHBbgA4SiDLzpF/Y4PBZAW2iKDM8HXMk89JgpIsjqMdF7bvIfhp9XCq5jO9xzdmTihkayEwFpvdzzYHFMKXCt1PgHYuqJR+VTUSEOT9FHqKi4oQEEmcP3f66FqRhNFWdFHkLMj0CSmFVT6rP67G+CFKoP2bEUolyWfxRJ3Mau5IV6zvKduKe4Jo7q4vEBODuPWLXaBTrprziv2GxgBVqX3VcAiqvOF1UmyT9s1fkhefEZAw564keXx+Yr7qB8x0UFqp1mcXR8eEMMYeSoBsRhp/Ela1ZzEqqAsDtsalqQBpbaJbYFhloZtWed4KvHo7+5PK/bywQx99UnzqSItFNOvZHOd66KIRF+NoJq2oOyVLUf1s9JI5QzKIMsd57iTrdQI2b7LQPcTRcaEablubVT4jh1pECvtQAAxhfWAM6uSRuwiFG/uNNp8vUcTbpKe7yqr8obvIYTE3DnH3QNIoWIczdT5WcdXBBhOtsVf50s6guXTkXVew3nKOtAFZ97wrNzXH8J5J3oeJlWHVuzakWqGtnDqko2kasdYw6+CVtD3TXpMMoDkHF0oQVLoXj9pC+UhXh2KR16HAAh+p4PUEO6tPjDI7jcS7GEIswaVX+b/D2KhXdEFcNLLjhmrdnyMV1JP7JK4l4UsD3ctZGRF9CI1eqPMM+Xh7CV9JYEMRQxED3l9hcJzoA2OrEsQf9IQjbhtO4nR2Yn+u1U2CCw6myil4Pgu4daVZ/JyzeDGv6aoUcIzmcpemiIHX6EV3Ujn+2qWvwn1Zlhl8JVxCXLKxo7ABaPpHPBc1aaFp6eyKJRQmlCA8v8BBimO6crjMB89BmoagqXBQ
 1XIGCvEt
 HfuGSg7nIwWBqJTNFolpzwtNUhGaU8h+Gvdpv+LwkuoSmY9v8HSiAhpQSbaA5rOquwMtyMueUAE1kj4BdWMxs4Thoxza5U+Su4KWPca2ei7FipG2Sl0P16/BhaqqlGxkqfsUqLFzFm23bgy1h/7mywZdSjHe9SU3wpe9Iir4XOKPGs0rTadwC8A2iLKqty4GtJkzn8zalYjVOdocJbHA0LtARa4xaaDdBktCjnbeLnBHKphiI7Xwe4zR7eAtm7/+6EpxEKdvLu7pQVjsO19NS0PACE9wOJXT/9QXfWTp4PuOn5XXUt8ufAygenAcxnAoU2WJR4RSo7G+xovPMhEUbKAVJY12Hjw43m/w52tEV8JGtC7HTGsvRDelvHftHLUnCBQfpbaGlQFR/SgFYtwr+1Fdx3MStCLO1JXN4A3ozZjEboop3hagVN4Yvcg4UTAsi+/92Wuqb0OnyvYtHs++XKSr/uZzBxB8gsr3Em7QoFpcqB3PQ8qnwJ8GlkA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hello!

On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
> On 27 Sep 2023, at 22:51, Zi Yan wrote:
> I attached my revised patch 2 and 3 (with all the suggestions above).

Thanks! It took me a bit to read through them. It's a tricky codebase!

Some comments below.

> From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Mon, 25 Sep 2023 16:27:14 -0400
> Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
>  free lists.
> 
> This avoids changing migratetype after move_freepages() or
> move_freepages_block(), which is error prone. It also prepares for upcoming
> changes to fix move_freepages() not moving free pages partially in the
> range.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>

This is great and indeed makes the callsites much simpler. Thanks,
I'll fold this into the series.

> @@ -1597,9 +1615,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  			  unsigned long end_pfn, int old_mt, int new_mt)
>  {
>  	struct page *page;
> -	unsigned long pfn;
> +	unsigned long pfn, pfn2;
>  	unsigned int order;
>  	int pages_moved = 0;
> +	unsigned long mt_changed_pfn = start_pfn - pageblock_nr_pages;
> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
> +
> +	/* split at start_pfn if it is in the middle of a free page */
> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
> +		struct page *new_page = pfn_to_page(new_start_pfn);
> +		int new_page_order = buddy_order(new_page);

get_freepage_start_pfn() returns start_pfn if it didn't find a large
buddy, so the buddy check shouldn't be necessary, right?

> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {

This *should* be implied according to the comments on
get_freepage_start_pfn(), but it currently isn't. Doing so would help
here, and seemingly also in alloc_contig_range().

How about this version of get_freepage_start_pfn()?

/*
 * Scan the range before this pfn for a buddy that straddles it
 */
static unsigned long find_straddling_buddy(unsigned long start_pfn)
{
	int order = 0;
	struct page *page;
	unsigned long pfn = start_pfn;

	while (!PageBuddy(page = pfn_to_page(pfn))) {
		/* Nothing found */
		if (++order > MAX_ORDER)
			return start_pfn;
		pfn &= ~0UL << order;
	}

	/*
	 * Found a preceding buddy, but does it straddle?
	 */
	if (pfn + (1 << buddy_order(page)) > start_pfn)
		return pfn;

	/* Nothing found */
	return start_pfn;
}

> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  
>  		order = buddy_order(page);
>  		move_to_free_list(page, zone, order, old_mt, new_mt);
> +		/*
> +		 * set page migratetype 1) only after we move all free pages in
> +		 * one pageblock and 2) for all pageblocks within the page.
> +		 *
> +		 * for 1), since move_to_free_list() checks page migratetype with
> +		 * old_mt and changing one page migratetype affects all pages
> +		 * within the same pageblock, if we are moving more than
> +		 * one free pages in the same pageblock, setting migratetype
> +		 * right after first move_to_free_list() triggers the warning
> +		 * in the following move_to_free_list().
> +		 *
> +		 * for 2), when a free page order is greater than pageblock_order,
> +		 * all pageblocks within the free page need to be changed after
> +		 * move_to_free_list().

I think this can be somewhat simplified.

There are two assumptions we can make. Buddies always consist of 2^n
pages. And buddies and pageblocks are naturally aligned. This means
that if this pageblock has the start of a buddy that straddles into
the next pageblock(s), it must be the first page in the block. That in
turn means we can move the handling before the loop.

If we split first, it also makes the loop a little simpler because we
know that any buddies that start inside this block cannot extend
beyond it (due to the alignment). The loop how it was originally
written can remain untouched.

> +		 */
> +		if (pfn + (1 << order) > pageblock_end_pfn(pfn)) {
> +			for (pfn2 = pfn;
> +			     pfn2 < min_t(unsigned long,
> +					  pfn + (1 << order),
> +					  end_pfn + 1);
> +			     pfn2 += pageblock_nr_pages) {
> +				set_pageblock_migratetype(pfn_to_page(pfn2),
> +							  new_mt);
> +				mt_changed_pfn = pfn2;

Hm, this seems to assume that start_pfn to end_pfn can be more than
one block. Why is that? This function is only used on single blocks.

> +			}
> +			/* split the free page if it goes beyond the specified range */
> +			if (pfn + (1 << order) > (end_pfn + 1))
> +				split_free_page(page, order, end_pfn + 1 - pfn);
> +		}
>  		pfn += 1 << order;
>  		pages_moved += 1 << order;
>  	}
> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
> +	/* set migratetype for the remaining pageblocks */
> +	for (pfn2 = mt_changed_pfn + pageblock_nr_pages;
> +	     pfn2 <= end_pfn;
> +	     pfn2 += pageblock_nr_pages)
> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);

If I rework the code on the above, I'm arriving at the following:

static int move_freepages(struct zone *zone, unsigned long start_pfn,
			  unsigned long end_pfn, int old_mt, int new_mt)
{
	struct page *start_page = pfn_to_page(start_pfn);
	int pages_moved = 0;
	unsigned long pfn;

	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);

	/*
	 * A free page may be comprised of 2^n blocks, which means our
	 * block of interest could be head or tail in such a page.
	 *
	 * If we're a tail, update the type of our block, then split
	 * the page into pageblocks. The splitting will do the leg
	 * work of sorting the blocks into the right freelists.
	 *
	 * If we're a head, split the page into pageblocks first. This
	 * ensures the migratetypes still match up during the freelist
	 * removal. Then do the regular scan for buddies in the block
	 * of interest, which will handle the rest.
	 *
	 * In theory, we could try to preserve 2^1 and larger blocks
	 * that lie outside our range. In practice, MAX_ORDER is
	 * usually one or two pageblocks anyway, so don't bother.
	 *
	 * Note that this only applies to page isolation, which calls
	 * this on random blocks in the pfn range! When we move stuff
	 * from inside the page allocator, the pages are coming off
	 * the freelist (can't be tail) and multi-block pages are
	 * handled directly in the stealing code (can't be a head).
	 */

	/* We're a tail */
	pfn = find_straddling_buddy(start_pfn);
	if (pfn != start_pfn) {
		struct page *free_page = pfn_to_page(pfn);

		set_pageblock_migratetype(start_page, new_mt);
		split_free_page(free_page, buddy_order(free_page),
				pageblock_nr_pages);
		return pageblock_nr_pages;
	}

	/* We're a head */
	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order)
		split_free_page(start_page, buddy_order(start_page),
				pageblock_nr_pages);

	/* Move buddies within the block */
	while (pfn <= end_pfn) {
		struct page *page = pfn_to_page(pfn);
		int order, nr_pages;

		if (!PageBuddy(page)) {
			pfn++;
			continue;
		}

		/* Make sure we are not inadvertently changing nodes */
		VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
		VM_BUG_ON_PAGE(page_zone(page) != zone, page);

		order = buddy_order(page);
		nr_pages = 1 << order;

		move_to_free_list(page, zone, order, old_mt, new_mt);

		pfn += nr_pages;
		pages_moved += nr_pages;
	}

	set_pageblock_migratetype(start_page, new_mt);

	return pages_moved;
}

Does this look reasonable to you?

Note that the page isolation specific stuff comes first. If this code
holds up, we should be able to move it to page-isolation.c and keep it
out of the regular allocator path.

Thanks!