From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out30-100.freemail.mail.aliyun.com (out30-100.freemail.mail.aliyun.com [115.124.30.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C51ED205E25 for ; Wed, 24 Sep 2025 01:49:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.100 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758678596; cv=none; b=qnmyZrFe7EPIQMkGjBoKwKW0x+rRdhUHgVpGiqM6W7Bnfp9S7+lX4hFOXaCWXptnz0fgZbnf4sQMaAuqbYwo+RsHJqY9tJkE+euRmpUcZniYslj+ib7bHVDkaGmbm3e+DoVewpwf7DD5Ce6WahN0yfTmZva3symBtsmgwNnw0Go= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758678596; c=relaxed/simple; bh=iedKoULziJzaj0C/IFUUz5eawEUVn2msOhvNokdp6wc=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=Vnbawdxgi1JQzde8CCoejC1/7YE9Qhp0VqyJrPCtt3xjIdavYPjBZmft8dWkva3Aynh1epMSm+//t11hBaLE15QXuUqI7w3xDCcpLlu+4elEMl8rlSA5hWhY0V7qWGyw+k/02rw6rt+DxiQpAQTYo7Jn8PHps5k7N8HuLRdPbe4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=F4xoxPze; arc=none smtp.client-ip=115.124.30.100 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="F4xoxPze" DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1758678590; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type; bh=0NxHbnawDoiwdor0a0/TkIoFsILHYQtnh4tEXaeWR3w=; b=F4xoxPzegiASihfxGaqZBCRDa6m7zwGFzEiEdi38a1Mkljz3NyJu8eNo0j2lKjbgeF+cuS6/75RkkT8Dw7kisrlQtb1jbKNQFsixmd50t3YD2HAXoIB/cONoPB+wbsWpEs5BPXeRnMHC07QgK3xugPnOXBe0eFjXSwEqh2Q5FDY= Received: from DESKTOP-5N7EMDA(mailfrom:ying.huang@linux.alibaba.com fp:SMTPD_---0WohcHHC_1758678578 cluster:ay36) by smtp.aliyun-inc.com; Wed, 24 Sep 2025 09:49:48 +0800 From: "Huang, Ying" To: Shivank Garg Cc: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload In-Reply-To: <20250923174752.35701-1-shivankg@amd.com> (Shivank Garg's message of "Tue, 23 Sep 2025 17:47:35 +0000") References: <20250923174752.35701-1-shivankg@amd.com> Date: Wed, 24 Sep 2025 09:49:37 +0800 Message-ID: <87plbghb66.fsf@DESKTOP-5N7EMDA> User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Hi, Shivank, Thanks for working on this! Shivank Garg writes: > This is the third RFC of the patchset to enhance page migration by batching > folio-copy operations and enabling acceleration via multi-threaded CPU or > DMA offload. > > Single-threaded, folio-by-folio copying bottlenecks page migration > in modern systems with deep memory hierarchies, especially for large > folios where copy overhead dominates, leaving significant hardware > potential untapped. > > By batching the copy phase, we create an opportunity for significant > hardware acceleration. This series builds a framework for this acceleration > and provides two initial offload driver implementations: one using multiple > CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm). > > This version incorporates significant feedback to improve correctness, > robustness, and the efficiency of the DMA offload path. > > Changelog since V2: > > 1. DMA Engine Rewrite: > - Switched from per-folio dma_map_page() to batch dma_map_sgtable() > - Single completion interrupt per batch (reduced overhead) > - Order of magnitude improvement in setup time for large batches > 2. Code cleanups and refactoring > 3. Rebased on latest mainline (6.17-rc6+) > > MOTIVATION: > ----------- > > Current Migration Flow: > [ move_pages(), Compaction, Tiering, etc. ] > | > v > [ migrate_pages() ] // Common entry point > | > v > [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time > | > |--> [ migrate_folio_unmap() ] > | > |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush > | > |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy > - For each folio: > - Metadata prep: Copy flags, mappings, etc. > - folio_copy() <-- Single-threaded, serial data copy. > - Update PTEs & finalize for that single folio. > > Understanding overheads in page migration (move_pages() syscall): > > Total move_pages() overheads = folio_copy() + Other overheads > 1. folio_copy() is the core copy operation that interests us. > 2. The remaining operations are user/kernel transitions, page table walks, > locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating > mappings and PTEs etc. that contribute to the remaining overheads. > > Percentage of folio_copy() overheads in move_pages(N pages) syscall time: > Number of pages being migrated and folio size: > 4KB 2MB > 1 page <1% ~66% > 512 page ~35% ~97% > > Based on Amdahl's Law, optimizing folio_copy() for large pages offers a > substantial performance opportunity. > > move_pages() syscall speedup = 1 / ((1 - F) + (F / S)) > Where F is the fraction of time spent in folio_copy() and S is the speedup of > folio_copy(). > > For 4KB folios, folio copy overheads are significantly small in single-page > migrations to impact overall speedup, even for 512 pages, maximum theoretical > speedup is limited to ~1.54x with infinite folio_copy() speedup. > > For 2MB THPs, folio copy overheads are significant even in single page > migrations, with a theoretical speedup of ~3x with infinite folio_copy() > speedup and up to ~33x for 512 pages. > > A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload > based on my measurements for copying 512 2MB pages. > This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also > observed in the experiments below). > > DESIGN: A Pluggable Migrator Framework > --------------------------------------- > > Introduce migrate_folios_batch_move(): > > [ migrate_pages_batch() ] > | > |--> migrate_folio_unmap() > | > |--> try_to_unmap_flush() > | > +--> [ migrate_folios_batch_move() ] // new batched design > | > |--> Metadata migration > | - Metadata prep: Copy flags, mappings, etc. > | - Use MIGRATE_NO_COPY to skip the actual data copy. > | > |--> Batch copy folio data > | - Migrator is configurable at runtime via sysfs. > | > | static_call(_folios_copy) // Pluggable migrators > | / | \ > | v v v > | [ Default ] [ MT CPU copy ] [ DMA Offload ] > | > +--> Update PTEs to point to dst folios and complete migration. > I just jump in the discussion, so this may be discussed before already. Sorry if so. Why not migrate_folios_unmap() try_to_unmap_flush() copy folios in parallel if possible migrate_folios_move(): with MIGRATE_NO_COPY? > User Control of Migrator: > > # echo 1 > /sys/kernel/dcbm/offloading > | > +--> Driver's sysfs handler > | > +--> calls start_offloading(&cpu_migrator) > | > +--> calls offc_update_migrator() > | > +--> static_call_update(_folios_copy, mig->migrate_offc) > > Later, During Migration ... > migrate_folios_batch_move() > | > +--> static_call(_folios_copy) // Now dispatches to the selected migrator > | > +-> [ mtcopy | dcbm | kernel_default ] > [snip] --- Best Regards, Huang, Ying