Re: [PATCH v9 0/5] Migrate on fault for device pages

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: "Mika Penttilä" <mpenttil@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>, <linux-mm@kvack.org>,
	<dri-devel@lists.freedesktop.org>,
	<intel-xe@lists.freedesktop.org>, <linux-kernel@vger.kernel.org>,
	David Hildenbrand <david@kernel.org>,
	"Jason Gunthorpe" <jgg@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Balbir Singh <balbirs@nvidia.com>, Zi Yan <ziy@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: [PATCH v9 0/5] Migrate on fault for device pages
Date: Tue, 5 May 2026 11:01:13 -0700	[thread overview]
Message-ID: <afowaUffi0JrnADf@gsse-cloud1.jf.intel.com> (raw)
In-Reply-To: <ecd81b6b-62ef-4dab-a93e-e65598d64841@redhat.com>

On Tue, May 05, 2026 at 10:18:14AM +0300, Mika Penttilä wrote:
> 
> On 5/5/26 10:09, Alistair Popple wrote:
> 
> > Thanks for doing this work Mika. I've been meaning to take a look at this series
> > for a while. I'm currently at LSFMM but will try and take a look this week or
> > next as it sounds quite useful.
> >
> >  - Alistair
> 
> Thanks Alistair and no problem, appreciate your insights whenever you have time.
> 

It looks like this series is breaking Intel's CI [1]. Looks like
something in RCU is blowing up:

<4> [212.361418] ------------[ cut here ]------------
<4> [212.361431] Voluntary context switch within RCU read-side critical section!
<4> [212.361432] WARNING: kernel/rcu/tree_plugin.h:332 at rcu_note_context_switch+0x82/0x780, CPU#11: kworker/u65:5/2352
<4> [212.361440] Modules linked in: snd_hda_codec_intelhdmi snd_hda_codec_hdmi mei_lb mei_gsc_proxy mtd_intel_dg mei_gsc xe drm_gpuvm drm_gpusvm_helper drm_buddy gpu_sched drm_ttm_helper ttm drm_suballoc_helper drm_exec drm_display_helper cec rc_core drm_kunit_helpers i2c_algo_bit kunit overlay intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp hid_generic coretemp eeepc_wmi cmdlinepart asus_wmi binfmt_misc sparse_keymap spi_nor mei_hdcp mei_pxp mtd wmi_bmof kvm_intel kvm irqbypass aesni_intel gf128mul r8169 usbhid rapl hid intel_cstate realtek snd_hda_intel phy_package snd_intel_dspcfg intel_pmc_core snd_hda_codec idma64 nls_iso8859_1 pmt_telemetry snd_hda_core video snd_hwdep pmt_discovery snd_pcm i2c_i801 pinctrl_alderlake pmt_class snd_timer i2c_mux intel_pmc_ssram_telemetry acpi_tad acpi_pad mei_me snd i2c_smbus spi_intel_pci soundcore mei spi_intel wmi intel_vsec dm_multipath msr nvme_fabrics fuse efi_pstore nfnetlink autofs4
<4> [212.361711] CPU: 11 UID: 0 PID: 2352 Comm: kworker/u65:5 Tainted: G S   U              7.1.0-rc2-lgci-xe-xe-pw-165953v1-debug+ #1 PREEMPT(lazy) 
<4> [212.361715] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
<4> [212.361716] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023
<4> [212.361718] Workqueue: xe_page_fault_work_queue xe_pagefault_queue_work [xe]
<4> [212.361833] RIP: 0010:rcu_note_context_switch+0x82/0x780
<4> [212.361838] Code: 45 85 c0 74 0f 65 8b 05 24 84 ab 02 85 c0 0f 84 8d 01 00 00 45 84 ed 75 16 8b 83 bc 08 00 00 85 c0 7e 0c 48 8d 3d de ad 4d 02 <67> 48 0f b9 3a 8b 83 bc 08 00 00 85 c0 7e 0d 80 bb c0 08 00 00 00
<4> [212.361840] RSP: 0018:ffffc9000186f4a0 EFLAGS: 00010002
<4> [212.361843] RAX: 0000000000000001 RBX: ffff88810a3a8040 RCX: 0000000000000000
<4> [212.361845] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff839bcea0
<4> [212.361846] RBP: ffffc9000186f4e8 R08: 0000000000000001 R09: 0000000000000000
<4> [212.361848] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88885f1b6a00
<4> [212.361849] R13: 0000000000000000 R14: ffffffff83248312 R15: ffffc9000186f630
<4> [212.361851] FS:  0000000000000000(0000) GS:ffff8888db203000(0000) knlGS:0000000000000000
<4> [212.361853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [212.361854] CR2: 00007fe433b2f088 CR3: 000000000344a000 CR4: 0000000000f52ef0
<4> [212.361856] PKRU: 55555554
<4> [212.361858] Call Trace:
<4> [212.361859]  <TASK>
<4> [212.361862]  ? lock_is_held_type+0xa3/0x130
<4> [212.361868]  __schedule+0x103/0x1f70
<4> [212.361870]  ? lock_acquire+0xc4/0x300
<4> [212.361874]  ? find_held_lock+0x31/0x90
<4> [212.361877]  ? schedule+0x10e/0x180
<4> [212.361880]  ? lock_release+0xd0/0x2b0
<4> [212.361885]  schedule+0x3a/0x180
<4> [212.361888]  io_schedule+0x4c/0x80
<4> [212.361890]  ? softleaf_entry_wait_on_locked+0x147/0x2b0
<4> [212.361894]  softleaf_entry_wait_on_locked+0x24f/0x2b0
<4> [212.361899]  ? __pfx_wake_page_function+0x10/0x10
<4> [212.361904]  migration_entry_wait+0xff/0x190
<4> [212.361909]  hmm_vma_handle_pte+0x440/0x790
<4> [212.361914]  hmm_vma_walk_pmd+0x5c8/0x1360
<4> [212.361918]  ? xe_pagefault_queue_work+0x1a9/0x520 [xe]
<4> [212.362015]  walk_pgd_range+0x57f/0xd70
<4> [212.362017]  ? lock_is_held_type+0xa3/0x130
<4> [212.362028]  __walk_page_range+0x8e/0x290
<4> [212.362034]  walk_page_range_mm_unsafe+0x19e/0x270
<4> [212.362036]  ? trace_hardirqs_on+0x22/0xf0
<4> [212.362043]  walk_page_range+0x2a/0x40
<4> [212.362045]  hmm_range_fault+0x94/0x190
<4> [212.362053]  drm_gpusvm_get_pages+0x269/0xa30 [drm_gpusvm_helper]
<4> [212.362067]  drm_gpusvm_range_get_pages+0x2e/0x50 [drm_gpusvm_helper]
<4> [212.362071]  __xe_svm_handle_pagefault+0x3e0/0xef0 [xe]
<4> [212.362181]  ? __lock_acquire+0x43e/0x2790
<4> [212.362188]  ? lock_is_held_type+0xa3/0x130
<4> [212.362193]  ? lock_is_held_type+0xa3/0x130
<4> [212.362197]  ? xe_vm_find_overlapping_vma+0x57/0x1e0 [xe]
<4> [212.362304]  xe_svm_handle_pagefault+0x3d/0xb0 [xe]
<4> [212.362412]  xe_pagefault_queue_work+0x1a9/0x520 [xe]
<4> [212.362509]  process_one_work+0x239/0x740
<4> [212.362518]  worker_thread+0x200/0x3f0
<4> [212.362521]  ? __pfx_worker_thread+0x10/0x10
<4> [212.362524]  kthread+0x10d/0x150
<4> [212.362527]  ? __pfx_kthread+0x10/0x10
<4> [212.362530]  ret_from_fork+0x3bd/0x470
<4> [212.362533]  ? __pfx_kthread+0x10/0x10
<4> [212.362536]  ret_from_fork_asm+0x1a/0x30
<4> [212.362546]  </TASK>
<4> [212.362547] irq event stamp: 2057044

I’ll be out this Thursday for five weeks, but assuming you can sort this
part out, I’m fine with the series moving forward. I’ve looked at this
several times, and it seems sane enough to me.

On our list we also have the Sashiko setup [2], which I’ve found to be
incredibly helpful for series that do deep MM work. I’m not sure why
Sashiko is saying this series didn’t apply, since it applied cleanly to
our CI branches. If you can get Sashiko to run on it, that might be
helpful as well.

Matt

[1] https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-165953v1/shard-bmg-4/igt@xe_exec_system_allocator@process-many-stride-mmap-race-nomemset.html
[2] https://sashiko.dev/#/patchset/20260505051658.2219537-1-mpenttil%40redhat.com

> --Mika
> 
> >
> > On 2026-05-05 at 15:16 +1000, mpenttil@redhat.com wrote...
> >> From: Mika Penttilä <mpenttil@redhat.com>
> >>
> >> Currently, the way device page faulting and migration works
> >> is not optimal, if you want to do both fault handling and
> >> migration at once.
> >>
> >> Being able to migrate not present pages (or pages mapped with incorrect
> >> permissions, eg. COW) to the GPU requires doing either of the
> >> following sequences:
> >>
> >> 1. hmm_range_fault() - fault in non-present pages with correct permissions, etc.
> >> 2. migrate_vma_*() - migrate the pages
> >>
> >> Or:
> >>
> >> 1. migrate_vma_*() - migrate present pages
> >> 2. If non-present pages detected by migrate_vma_*():
> >>    a) call hmm_range_fault() to fault pages in
> >>    b) call migrate_vma_*() again to migrate now present pages
> >>
> >> The problem with the first sequence is that you always have to do two
> >> page walks even when most of the time the pages are present or zero page
> >> mappings so the common case takes a performance hit.
> >>
> >> The second sequence is better for the common case, but far worse if
> >> pages aren't present because now you have to walk the page tables three
> >> times (once to find the page is not present, once so hmm_range_fault()
> >> can find a non-present page to fault in and once again to setup the
> >> migration). It is also tricky to code correctly. One page table walk
> >> could costs over 1000 cpu cycles on X86-64, which is a significant hit.
> >>
> >> We should be able to walk the page table once, faulting
> >> pages in as required and replacing them with migration entries if
> >> requested.
> >>
> >> Add a new flag to HMM APIs, HMM_PFN_REQ_MIGRATE,
> >> which tells to prepare for migration also during fault handling.
> >> Also, for the migrate_vma_setup() call paths, a flag, MIGRATE_VMA_FAULT,
> >> is added to tell to add fault handling to migrate.
> >>
> >> One extra benefit of migrating with hmm_range_fault() path
> >> is the migrate_vma.vma gets populated, so no need to
> >> retrieve that separataly.
> >>
> >> Tested in X86-64 VM with HMM test device, passing the selftests.
> >> For performance, the migrate throughput tests from the selftests
> >> show similar numbers (within error margin) as unmodified kernel.
> >> Tested also rebased on the
> >> "Remove device private pages from physical address space" series:
> >> https://lore.kernel.org/linux-mm/20260130111050.53670-1-jniethe@nvidia.com/
> >> plus a small patch to adjust with no problems.
> >>
> >> Changes v8-v9
> >>   - rebase on drm-tip
> >>   - fixed uaf around  migrate_vma_split_folio() usage
> >>   - added missing pmd unlock
> >>
> >> Changes v7-v8
> >>   - rebase on 7.0
> >>   - fixed subject in two patches
> >>   - enhanced commit messages
> >>   - squashed patch 6 into patch 4 to fix kernel test robot warning
> >>   - readded dropped Cc block from cover letter
> >>   - fixed white space
> >>
> >> Changes v6-v7
> >>   - rebase on 7.0.0-rc6
> >>   - added documentation and comments
> >>   - denote to be migrated zero page as HMM_PFN_MIGRATE alone
> >>   - got rid of HMM_PFN_INOUT_FLAGS movement in patch 2
> >>   - picked up Acked-By from David for patch 1
> >>   
> >> Changes v5-v6
> >>   - rebase on 7.0.0-rc4
> >>   - use range based TLB flushing while unmapping ptes
> >>   - gate migration behind HMM_PFN_REQ_MIGRATE for fault and
> >>     migrate paths
> >>   - always infer migration flags from migrate->flags only
> >>
> >> Changes v4-v5
> >>   - rebase on 6.19
> >>   - fixed David's email address
> >>   - fixed link issue without CONFIG_TRANSPARENT_HUGEPAGE
> >>   - refactored into smaller commits
> >>   - added more comments to code
> >>
> >> Changes v3-v4:
> >>   - rebase on 6.19-rc8
> >>   - fixed issues found by kernel test robot with random configs
> >>   - fixed typos
> >>
> >> Changes v2-v3:
> >>   - rebase on 6.19-rc7
> >>   - fixed issues found by kernel test robot
> >>   - fixed smatch issues reported by Dan Carpenter <dan.carpenter@linaro.org>
> >>   - fixes to lock handling (pmd/pte) on errors
> >>   - added assertions for pmd/pte lock states
> >>   - other issues discovered by Matthew, thanks!
> >>
> >> Changes v1-v2:
> >>   - rebase on 6.19-rc6
> >>   - fixed issues found by kernel test robot
> >>   - fixed locking (pmd/ptl) to cover handle_ and prepare_ regions
> >>     parts if migrating
> >>   - other issues discovered by Matthew, thanks!
> >>
> >> Changes RFC-v1:
> >>   - rebase on 6.19-rc5
> >>   - adjust for the device THP
> >>   - changes from feedback
> >>
> >> Revisions:
> >>   - RFC https://lore.kernel.org/linux-mm/20250814072045.3637192-1-mpenttil@redhat.com/
> >>   - v1: https://lore.kernel.org/all/20260114091923.3950465-1-mpenttil@redhat.com/
> >>   - v2: https://lore.kernel.org/all/20260119112502.645059-1-mpenttil@redhat.com/
> >>   - v3: https://lore.kernel.org/all/20260126111939.1332983-2-mpenttil@redhat.com/
> >>   - v4: https://lore.kernel.org/all/20260202112622.2104213-1-mpenttil@redhat.com/
> >>   - v5: https://lore.kernel.org/linux-mm/20260211081301.2940672-1-mpenttil@redhat.com/
> >>   - v6: https://lore.kernel.org/linux-mm/20260316062407.3354636-1-mpenttil@redhat.com/
> >>   - v7: https://lore.kernel.org/linux-mm/20260330115611.347988-1-mpenttil@redhat.com/
> >>   - v8: https://lore.kernel.org/linux-mm/20260414041226.1539439-1-mpenttil@redhat.com/
> >>
> >> Cc: David Hildenbrand <david@kernel.org>
> >> Cc: Jason Gunthorpe <jgg@nvidia.com>
> >> Cc: Leon Romanovsky <leonro@nvidia.com>
> >> Cc: Alistair Popple <apopple@nvidia.com>
> >> Cc: Balbir Singh <balbirs@nvidia.com>
> >> Cc: Zi Yan <ziy@nvidia.com>
> >> Cc: Matthew Brost <matthew.brost@intel.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> >> Cc: Vlastimil Babka <vbabka@suse.cz>
> >> Cc: Mike Rapoport <rppt@kernel.org>
> >> Cc: Suren Baghdasaryan <surenb@google.com>
> >> Cc: Michal Hocko <mhocko@suse.com>
> >>
> >> Mika Penttilä (5):
> >>   mm/Kconfig: changes for migrate on fault for device pages
> >>   mm: Add helper to convert HMM pfn to migrate pfn
> >>   mm/hmm: do the plumbing for HMM to participate in migration
> >>   mm: setup device page migration in HMM pagewalk
> >>   lib/test_hmm:: add a new testcase for the migrate on fault
> >>
> >>  include/linux/hmm.h                    |  19 +-
> >>  include/linux/migrate.h                |  26 +-
> >>  lib/test_hmm.c                         | 101 ++-
> >>  lib/test_hmm_uapi.h                    |  19 +-
> >>  mm/Kconfig                             |   2 +
> >>  mm/hmm.c                               | 835 +++++++++++++++++++++++--
> >>  mm/migrate_device.c                    | 583 +++--------------
> >>  tools/testing/selftests/mm/hmm-tests.c |  54 ++
> >>  8 files changed, 1066 insertions(+), 573 deletions(-)
> >>
> >> drm-tip
> >> base-commit: 94d56a898a2db27f841b17f6966a81ba502fe63c
> >> -- 
> >> 2.50.0
> >>
>

next prev parent reply	other threads:[~2026-05-05 18:01 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05  5:16 [PATCH v9 0/5] Migrate on fault for device pages mpenttil
2026-05-05  5:16 ` [PATCH v9 1/5] mm/Kconfig: changes for migrate " mpenttil
2026-05-05  5:16 ` [PATCH v9 2/5] mm: Add helper to convert HMM pfn to migrate pfn mpenttil
2026-05-05  5:16 ` [PATCH v9 3/5] mm/hmm: do the plumbing for HMM to participate in migration mpenttil
2026-05-05  5:16 ` [PATCH v9 4/5] mm: setup device page migration in HMM pagewalk mpenttil
2026-05-05  5:16 ` [PATCH v9 5/5] lib/test_hmm:: add a new testcase for the migrate on fault mpenttil
2026-05-05  7:09 ` [PATCH v9 0/5] Migrate on fault for device pages Alistair Popple
2026-05-05  7:18   ` Mika Penttilä
2026-05-05 18:01     ` Matthew Brost [this message]
2026-05-05 18:47       ` Mika Penttilä

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afowaUffi0JrnADf@gsse-cloud1.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=balbirs@nvidia.com \
    --cc=david@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jgg@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=mpenttil@redhat.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox