From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D385A3C1F for ; Mon, 13 Jan 2025 08:11:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736755902; cv=none; b=CflPyz/hj2egNWE8JJv3191c3Pp1dYxc6+krFOOeYvd/yeubG4T8xdbfw51spWH45ylxRCdaNzwPcgEreCwmtOdq3w341THW41HXudojVlA4jae9rfpmYJ8QQBQsEnvjputyDsBxGYogFc40nDzZIOFCZp1y1qRpXRepJlBhUBk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736755902; c=relaxed/simple; bh=zzdvWfpflSkMT6848v2R9ApEfTn8Vmz6M+5f4tbVjzU=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=QDrv3q08o82zM2D4bJZHdF3z47tWvcSDYfdQGuMXRF4VQaZAbnAF1wwjBv7Ll9b+cXmrFaC9qUY4wtBn8eR92QI8Z1aUSDr08+LOR20jroliNo7KGHSLrKTUhp6CnjJZUoIJSk6UBbpBt5I9wXTMcgvfI90OOmMkYZy1s91H8jM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=gkP/wMUl; arc=none smtp.client-ip=115.124.30.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="gkP/wMUl" DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1736755897; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=wbsywCouWA6zshwZs9IgmJzTWth1P8A5x0G4J1e81Yg=; b=gkP/wMUla3bpZs9Nx9nnWW63d2UTPvTxNF2PPrTXXxWLgRgJElX54NcMBJjvpROxJHpA9B36mGdVfTDHUguh2/ooqdM1zFcvYhfk9fnR1bRiopvRKORGHNdyiDcqUUji8Z3XqECruKE0ZhAGPcC7+aW8bS9YNixK6fs8ef/SJYc= Received: from 30.74.144.122(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WNVPY22_1736755895 cluster:ay36) by smtp.aliyun-inc.com; Mon, 13 Jan 2025 16:11:36 +0800 Message-ID: <324bf85f-442a-4388-a8f4-55d60e57b914@linux.alibaba.com> Date: Mon, 13 Jan 2025 16:11:35 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V2] mm: compaction: skip memory compaction when there are not enough migratable pages To: yangge1116@126.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, 21cnbao@gmail.com, david@redhat.com, hannes@cmpxchg.org, liuzixing@hygon.cn, Vlastimil Babka References: <1736325440-30857-1-git-send-email-yangge1116@126.com> From: Baolin Wang In-Reply-To: <1736325440-30857-1-git-send-email-yangge1116@126.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc Vlastimil. On 2025/1/8 16:37, yangge1116@126.com wrote: > From: yangge > > There are 4 NUMA nodes on my machine, and each NUMA node has 32GB > of memory. I have configured 16GB of CMA memory on each NUMA node, > and starting a 32GB virtual machine with device passthrough is > extremely slow, taking almost an hour. > > During the start-up of the virtual machine, it will call > pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. > Long term GUP cannot allocate memory from CMA area, so a maximum of > 16 GB of no-CMA memory on a NUMA node can be used as virtual machine > memory. There is 16GB of free CMA memory on a NUMA node, which is > sufficient to pass the order-0 watermark check, causing the > __compaction_suitable() function to consistently return true. > However, if there aren't enough migratable pages available, performing > memory compaction is also meaningless. Besides checking whether > the order-0 watermark is met, __compaction_suitable() also needs > to determine whether there are sufficient migratable pages available > for memory compaction. > > For costly allocations, because __compaction_suitable() always > returns true, __alloc_pages_slowpath() can't exit at the appropriate > place, resulting in excessively long virtual machine startup times. > Call trace: > __alloc_pages_slowpath > if (compact_result == COMPACT_SKIPPED || > compact_result == COMPACT_DEFERRED) > goto nopage; // should exit __alloc_pages_slowpath() from here > > When the 16G of non-CMA memory on a single node is exhausted, we will > fallback to allocating memory on other nodes. In order to quickly > fallback to remote nodes, we should skip memory compaction when > migratable pages are insufficient. After this fix, it only takes a > few tens of seconds to start a 32GB virtual machine with device > passthrough functionality. > > Signed-off-by: yangge > --- > > V2: > - consider unevictable folios > > mm/compaction.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/mm/compaction.c b/mm/compaction.c > index 07bd227..1630abd 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order, > int highest_zoneidx, > unsigned long wmark_target) > { > + struct pglist_data *pgdat = zone->zone_pgdat; > + unsigned long sum, nr_pinned; > unsigned long watermark; > + > + sum = node_page_state(pgdat, NR_INACTIVE_FILE) + > + node_page_state(pgdat, NR_INACTIVE_ANON) + > + node_page_state(pgdat, NR_ACTIVE_FILE) + > + node_page_state(pgdat, NR_ACTIVE_ANON) + > + node_page_state(pgdat, NR_UNEVICTABLE); > + > + nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) - > + node_page_state(pgdat, NR_FOLL_PIN_RELEASED); > + > + /* > + * Gup-pinned pages are non-migratable. After subtracting these pages, > + * we need to check if the remaining pages are sufficient for memory > + * compaction. > + */ > + if ((sum - nr_pinned) < (1 << order)) > + return false; > + Looks reasonable to me, but let's see if other people have any comments.