From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4FCC522092
	for <linux-kernel@vger.kernel.org>; Mon, 27 Jan 2025 06:55:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.169
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1737960952; cv=none; b=T9DPytDb3yX2kq730Z9t9pn5XAj4CotPHuxns77ISnxde14/0GtKbW/7SxQnoiSG0dt0VWffmEBZK7EcsIsplg3EPcJX0ersMx+qF02SB40+mCTVPwy7oDtWfF/Ph+qs0eDUmLJrTk2U7W74o0Fno2xYLIJNNZNXBSnXamAP2/c=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1737960952; c=relaxed/simple;
	bh=lDb4GIVl/F+JYVPzUCSzlZlsiwSu/HZ8kfJeHCFyz/0=;
	h=Date:From:To:cc:Subject:In-Reply-To:Message-ID:References:
	 MIME-Version:Content-Type; b=OtScJkX5gprLe9K+TVVP0f561D53Yuf3bwhr2ErBIj+ujgWinNZZBOLK5sPgYS9GtukGVSYBn+3LF8+Tm4Vb2SuRQITAORX2G+0bUJBlPFDv/p2bZg/RwI4/rhPo610iZx8CtLAsLnrK0TTtPXiK3D8tWnVSTJggoOvqf8t0dsE=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=1nKUXvD8; arc=none smtp.client-ip=209.85.214.169
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="1nKUXvD8"
Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-2163affd184so163885ad.1
        for <linux-kernel@vger.kernel.org>; Sun, 26 Jan 2025 22:55:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1737960950; x=1738565750; darn=vger.kernel.org;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=4wfQi+VRysSXzPv7Sv93reG91bbyjwdHfo545GyBzaY=;
        b=1nKUXvD8FM4sehz//jfjQjOGA4Pg/xod38bihr9tXlrlP7Zsg9VsRsyx0Hl/GAy6ce
         GOYyV+8myjTmQX7ZFN3OYFys9Wj9atOkXJ7BlX9VK5adCzsZIatcEt3tROGJgn3MS9rU
         1H5KxiNlt+/7/q29c/s0XLlUrzTiQvu2PumAHYEIP9C5u85upwyQr514QlNDVWSxggiE
         UsaUaF+CbNxexb5khpeBkpJREjIAS+qEPiV0uuR79bZaSIJ+qkDZ0bbsVOi5Pq051reG
         zaF6f6Qr1vHZBkM6UkFk3J6IxtJht62V1725FQHh6stbXdD40yvE/wt0kYXq/2GimKXm
         wpUA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1737960950; x=1738565750;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=4wfQi+VRysSXzPv7Sv93reG91bbyjwdHfo545GyBzaY=;
        b=eWzQbrzLg9XXzwVsFRZv8QDPCvDcTnWzM68ji0QW98x48V2Zdc2AAx+TyHoR50EcJv
         2aI3sZbD3K3Cld2WmBnxki9wUZlBnC68JGXREcazjfhkZ5HrdpoZX4IkNdEMKKW/1alh
         KehvxW6IB0gWtN7H7XjKeQuB5xuUZ/IUm0u/vzC9uD+ZfNU1EDi9Erd6zPJqn1a3UslQ
         epCoaRZtCnpYUyCNdTBrgc+R8gyMFTGCcy3k+4/UXCS9jwm5Ke0+8agSoJlkFW9LGLIG
         52FOxhGT+sQHW2cttJ0jy/7oTiQGLj4CTVQQ7aCDNfrcRhvucjPHfIx1CBXkIUWWYynH
         xiYA==
X-Forwarded-Encrypted: i=1; AJvYcCVKozqLJz9xVc0NwMIm1v+LVKFi6Vo4ZfZengJsZAbOXbBMIBeNdEnV5HTNqE/B/WnAWfzGG1soyYkyQvs=@vger.kernel.org
X-Gm-Message-State: AOJu0Yw3y8WJ5upwe+mrmNd2XXoUtWxGsjYQ+hRjNKK1kEeC+H5Ir9n7
	oQ+Mcc0Uc5EDxcVOvzoTMH0YAAILQhVsB0XSsFpBcqRMoLPXSXWpyIlOb+NApg==
X-Gm-Gg: ASbGncv5JABbtP06IuOysJ11c21tq9WfPjhTdp7zftdNO5pK/3L32B69jk2hR3VhSNm
	R3NSV5GabC3qA/WokveA1lm9PaHbcjqcMJjyUkGZtSBhw8zVImyhbYOQ7wcJILQLI5cntPOdgmj
	bdhin58McKeZfjAJ7E9ceOeS3FfQeFznFSQA6NCexvvDeT4BOoDXfkYPYmvt808BF+ZRCOmB+db
	MpoxQu0bljRX3TebIDjSSeKmZmM0g6iANJEuBs0dCXSk/Q5p4hbg+aG767LB6plY1C8gL+ZVCEF
	3oIC/3G4BYFUASv/5K3jwTTM2NzAG57CS7s=
X-Google-Smtp-Source: AGHT+IFEqfbeimL+xhM+fNb+Xy/kok4WLeJmsHgJYlD59tLRzk+6aTM7RWwdLPVJTkad1O4Oyg1sKQ==
X-Received: by 2002:a17:902:eb83:b0:216:6ecd:8950 with SMTP id d9443c01a7336-21db0383784mr2600065ad.19.1737960950296;
        Sun, 26 Jan 2025 22:55:50 -0800 (PST)
Received: from [2620:0:1008:15:a895:32e7:423e:b2d8] ([2620:0:1008:15:a895:32e7:423e:b2d8])
        by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-72f8a77c882sm6520994b3a.146.2025.01.26.22.55.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 26 Jan 2025 22:55:49 -0800 (PST)
Date: Sun, 26 Jan 2025 22:55:48 -0800 (PST)
From: David Rientjes <rientjes@google.com>
To: Shivank Garg <shivankg@amd.com>
cc: akpm@linux-foundation.org, lsf-pc@lists.linux-foundation.org, 
    linux-mm@kvack.org, ziy@nvidia.com, AneeshKumar.KizhakeVeetil@arm.com, 
    baolin.wang@linux.alibaba.com, bharata@amd.com, david@redhat.com, 
    gregory.price@memverge.com, honggyu.kim@sk.com, jane.chu@oracle.com, 
    jhubbard@nvidia.com, jon.grimm@amd.com, k.shutemov@gmail.com, 
    leesuyeon0506@gmail.com, leillc@google.com, liam.howlett@oracle.com, 
    linux-kernel@vger.kernel.org, mel.gorman@gmail.com, Michael.Day@amd.com, 
    Raghavendra.KodsaraThimmappa@amd.com, riel@surriel.com, 
    santosh.shukla@amd.com, shy828301@gmail.com, sj@kernel.org, 
    wangkefeng.wang@huawei.com, weixugc@google.com, willy@infradead.org, 
    ying.huang@linux.alibaba.com
Subject: Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with
 Multi-threading and Batch Offloading to DMA
In-Reply-To: <cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com>
Message-ID: <3b59ea3e-04db-ad38-97b1-20cff0f8f17c@google.com>
References: <cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII

On Thu, 23 Jan 2025, Shivank Garg wrote:

> Hi all,
> 
> Zi Yan and I would like to propose the topic: Enhancements to Page
> Migration with Multi-threading and Batch Offloading to DMA.
> 

I think this would be a very useful topic to discuss, thanks for proposing 
it.

> Page migration is a critical operation in NUMA systems that can incur
> significant overheads, affecting memory management performance across
> various workloads. For example, copying folios between DRAM NUMA nodes
> can take ~25% of the total migration cost for migrating 256MB of data.
> 
> Modern systems are equipped with powerful DMA engines for bulk data
> copying, GPUs, and high CPU core counts. Leveraging these hardware
> capabilities becomes essential for systems where frequent page promotion
> and demotion occur - from large-scale tiered-memory systems with CXL nodes
> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
> 

Indeed, there are multiple use cases for optimizations in this area.  With 
the ramp of memory tiered systems, I think there will be an even greater 
reliance on memory migration going forward.

Do you have numbers to share on how offloading, even as a proof of 
concept, moves the needle compared to traditional and sequential memory 
migration?

> Existing page migration performs sequential page copying, underutilizing
> modern CPU architectures and high-bandwidth memory subsystems.
> 
> We have proposed and posted RFCs to enhance page migration through three
> key techniques:
> 1. Batching migration operations for bulk copying data [1]
> 2. Multi-threaded folio copying [2]
> 3. DMA offloading to hardware accelerators [1]
> 

Curious: does memory migration of pages that are actively undergoing DMA 
with hardware assist fit into any of these?

> By employing batching and multi-threaded folio copying, we are able to
> achieve significant improvements in page migration throughput for large
> pages.
> 
> Discussion points:
> 1. Performance:
>    a. Policy decision for DMA and CPU selection
>    b. Platform-specific scheduling of folio-copy worker threads for better
>       bandwidth utilization

Why platform specific?  I *assume* this means a generic framework that can 
optimize for scheduling based on the underlying hardware and not specific 
implementations that can only be used on AMD, for example.  Is that the 
case?

>    c. Using Non-temporal instructions for CPU-based memcpy
>    d. Upscaling/downscaling worker threads based on migration size, CPU
>       availability (system load), bandwidth saturation, etc.
> 2. Interface requirements with DMA hardware:
>    a. Standardizing APIs for DMA drivers and support for different DMA
>       drivers
>    b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
> 3. Resources Accounting:
>    a. CPU cgroups accounting and fairness [3]
>    b. Who bears migration cost? - (Migration cost attribution)
> 
> References:
> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
>