From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B267AC87FD2 for ; Fri, 1 Aug 2025 06:01:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 500836B00BF; Fri, 1 Aug 2025 02:01:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B1556B00C0; Fri, 1 Aug 2025 02:01:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 379186B00C1; Fri, 1 Aug 2025 02:01:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 2263D6B00BF for ; Fri, 1 Aug 2025 02:01:28 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A32BCBE858 for ; Fri, 1 Aug 2025 06:01:27 +0000 (UTC) X-FDA: 83727141414.16.CB38865 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 4378240003 for ; Fri, 1 Aug 2025 06:01:25 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=M+XvkZgd; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of mpenttil@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=mpenttil@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1754028085; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EdZC5+nmo9dUHgvl2KKSykpDdecoEUXr7xB1ysKzrIg=; b=nVKNhTUeH0w/hD6m3QmGuxZjAypyCnXZ1pHgdoQkn/LSJ6y/uF6kZNzpfGtf+sS1HbOBMN nFAAoqtj0cC96lI0vegi54sVFUjw55352KvjZK8uQT3kRVsxQ/mwkR0Cg180kvRvxYgg3w 1Bryb2y1d0WAR1uRXFDj/BoTBC7gLD0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1754028085; a=rsa-sha256; cv=none; b=7+TifmTU1ErXluuhBO4EzEYucc3FMhG2XWot8zVeHpvmUYkd18PC0ZF1ohzGa9lG5qJmbc dHcKcb/qQi7A1vx3wyx6chnc1fGZNZMuMdPm9vYXkjuievDgb5FaQtAR50VVoREWeqVKA5 wR9klqBLuZLQTJleto66dAao0o64MNQ= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=M+XvkZgd; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of mpenttil@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=mpenttil@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1754028084; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EdZC5+nmo9dUHgvl2KKSykpDdecoEUXr7xB1ysKzrIg=; b=M+XvkZgdfSjGfclTT/6/j0xQ6Cps642VZM1N8GbalyTQ/Zg4RSHcOpCe2q7C8N/s6Vm5NP fUGnd3NXbN3zv4dbsHpbmUBf0FPqLlV4V7TSQ+1u9oWpnN4TYkBH6Gwx1ozAcN5FJs1lLp Op78yOOoPkXGUaODKYs7peGWLPcEZ9g= Received: from mail-lf1-f71.google.com (mail-lf1-f71.google.com [209.85.167.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-633-_NCebG4YMT2yPtVmu0vIcw-1; Fri, 01 Aug 2025 02:01:23 -0400 X-MC-Unique: _NCebG4YMT2yPtVmu0vIcw-1 X-Mimecast-MFC-AGG-ID: _NCebG4YMT2yPtVmu0vIcw_1754028082 Received: by mail-lf1-f71.google.com with SMTP id 2adb3069b0e04-55b8f945c72so126998e87.1 for ; Thu, 31 Jul 2025 23:01:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1754028082; x=1754632882; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=EdZC5+nmo9dUHgvl2KKSykpDdecoEUXr7xB1ysKzrIg=; b=vO1faFxuNzBxidXkv8GjdK9Kgzvf5A04WA06xHdoLuJommIR9io5mJK13ccft+ipvb 635Pat5uYrNJK2NcuyHXxh9mB0GeoTGn4QU0tAP7qffK1xT5inX1uTnZQ/p3SiAmb5P3 o9C9Bt8FJ3gUOHEgSxleugsenZFLlyGWgzmo+tjhxTNQWZT6AZ8bGybTm88Uq2ascwBb gRUuTm4pwt+/Zsf4kM6G7bNh9UAt5r14yXO1TtqXfn0R/4q+2MiG6ofbDX6FyN8dxsC9 sMS/SaZ/S4qrqnbANag5hQdSzolOdOG37tbvqY7ko5N/rij2VTaKXOmiZFg58dG4LCwg yk8A== X-Gm-Message-State: AOJu0Yw/69ikstt8fHSwc9a9dpfkucrnaJudVkAJcPTCVdMklGsMU89R pBK6sKRqqt5/a2ARn/ShWdtrKbFjFdHH0uNhUr7ac7sgorhcAfs7BnBKpzTSAPjlrJtbT21xPgS F2EFqbjDvOG3+NC0WjdbdKHJqa2yB9vhJkCtBa721pZ5gHxpKJ9A= X-Gm-Gg: ASbGncvqz88qhc/7YUEzxs+McB6yy4XX0FFtTNekKIh8OrF7H+K0dKiQDVoRhGY9+HZ 7/2o8ZWqe+djVCf3EnharAa/OsamzfhrhjxlK2/Wt9AFWV79I6IdpBw5J8EFtUrb7dZsz6yut9P qpgwuZVTiC0DFg/float2htnDOFxm748rhpKvMmsFwCZGHfKxwr8jbYqEi05XebEwOxH5tU5uwu c8kb5P3wZ36H7X3BhLCLh4xB+vUQHWZvHHYWsc/ffZ9m0RqIXyaIglxgvI0NAVEPTCq1O5p/qNe Y20oUN7lBu7CDLsrQsrH/fQqk0D1AIWf2o1bzfNhVjupcSBumlmpFXUTmAbOnFoiyw== X-Received: by 2002:a05:6512:138b:b0:55b:894b:72fc with SMTP id 2adb3069b0e04-55b8f262bddmr523000e87.10.1754028081772; Thu, 31 Jul 2025 23:01:21 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEYBE28J5PJ5xh2/y8f+TbNJAhiqdTHtKUFB3l97nn8fQ38bAYqBti7BEIZZMYbtXFaBC1VcQ== X-Received: by 2002:a05:6512:138b:b0:55b:894b:72fc with SMTP id 2adb3069b0e04-55b8f262bddmr522764e87.10.1754028073515; Thu, 31 Jul 2025 23:01:13 -0700 (PDT) Received: from [192.168.1.86] (85-23-48-6.bb.dnainternet.fi. [85.23.48.6]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-55b8899dc75sm456787e87.40.2025.07.31.23.01.12 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 31 Jul 2025 23:01:12 -0700 (PDT) Message-ID: Date: Fri, 1 Aug 2025 09:01:12 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code To: Balbir Singh , Zi Yan , David Hildenbrand Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Karol Herbst , Lyude Paul , Danilo Krummrich , David Airlie , Simona Vetter , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Shuah Khan , Barry Song , Baolin Wang , Ryan Roberts , Matthew Wilcox , Peter Xu , Kefeng Wang , Jane Chu , Alistair Popple , Donet Tom , Matthew Brost , Francois Dugast , Ralph Campbell References: <20250730092139.3890844-1-balbirs@nvidia.com> <20250730092139.3890844-3-balbirs@nvidia.com> <22D1AD52-F7DA-4184-85A7-0F14D2413591@nvidia.com> <9f836828-4f53-41a0-b5f7-bbcd2084086e@redhat.com> <884b9246-de7c-4536-821f-1bf35efe31c8@redhat.com> <6291D401-1A45-4203-B552-79FE26E151E4@nvidia.com> <8E2CE1DF-4C37-4690-B968-AEA180FF44A1@nvidia.com> <2308291f-3afc-44b4-bfc9-c6cf0cdd6295@redhat.com> <9FBDBFB9-8B27-459C-8047-055F90607D60@nvidia.com> <11ee9c5e-3e74-4858-bf8d-94daf1530314@redhat.com> <14aeaecc-c394-41bf-ae30-24537eb299d9@nvidia.com> <71c736e9-eb77-4e8e-bd6a-965a1bbcbaa8@nvidia.com> From: =?UTF-8?Q?Mika_Penttil=C3=A4?= In-Reply-To: <71c736e9-eb77-4e8e-bd6a-965a1bbcbaa8@nvidia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: bIXiQvxGqi4eOX_7mzi1eZ0JZGo8juqIF0zads4ImI0_1754028082 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: 7mcpt89sk3ce31gatfp3ta5i5br3s5t7 X-Rspamd-Queue-Id: 4378240003 X-Rspamd-Server: rspam10 X-Rspam-User: X-HE-Tag: 1754028085-138299 X-HE-Meta: U2FsdGVkX1+f8/7HMh5oNCLkkiPmaBCqlNlihvQtHhhWkpeZ0JjLoc6Y1Qvw2qSXV+HmjwpZ5vbDM5x8FJll/ujbe9hVE21K/b01ojpfjG57LPmfpHv2SC8sXVdeuKqTUDSMPc2f8qDVsetWtnsbDEA1fG9IIBhbZ/SdI5sksGGSIluiAKEpl94/99tDsFc/OvtSfIdUgAZy9eusAPALQbLKD0KEFUkFKPHSRw7tVm6PGHaRR4eSbqCq7Ij5VVNWUpKiO3Bl7LHH7GE4juHNdPRMtlgXbDl1hcsQMc4I7COJVuGz0VigeY70mO9mlJdYDa9Z1qhyGsokVINAC9Ce3jAnMNZxwPLaL25lLsWeJEOmFlf02wa6I3NIEJd1+4Uo91T1XojQxuWRoC7V8kj5aQu9fWo+3tpAs4TRxXATT0e1WSOpFROUhme+h4S7bwVkMHndlh992wpmaz/LhOOqzMAJdVq8uPAakueEhfmYYp1PbAuIEiwF46f2erQ3PlrX1MbT0RS1+tUCj07dulwDQqZCiCyDIUfpDZB5EpIdgSdDuyHboQhd3clv57F84a+Js0mqho2GieluFD2xvKdOvV6o6MC0E8ojsSex6gaHJw7S4Ir68DE8MRUJ73QTuwVuqAVzQXAt0VgnfIYcA1w9d6UK0NJbw4MN5A+xnKyeLGyuN+SzUPkJkiNziOTsm78Y6SpHsothDI4Wos1+7Q0hLIBGFzmvq3VVB+xByYUIpIpvpxSuVRDQfi/dyk3czKTS+kc2TXtBXm+ibR8M6U7+DkUOA5wMKpndhi20L/skSjGjsNjRB7a7gWlfYyuuo+IfR7jkzM9sKXfOREdUnEaF3gIgY6wwEoJ+wt24al1mZOi1kz3IHV/cYK2Wl0gJZblFIzawh1CBiWsC1HDIAG+IZd2BUsO1GWX6q7qtVIgMd/i2ZD7YxxPHWxJpf57qIy5X4RFj0RhdCv5XvpXTsMU s0dZoUOa ZkzYe1RmGiwMmNy0oEqbbLahugYXhIk+PmFjmg7gJQsIrTOyoJxHQ67enzI7nxhuKULfs5SVk3QjFVuZCLzur7n2O5dXoy7nzT4qbXoNHJn2ZTMcgRrdCir287HHEC7+6i5CkK32943uhSTeiN4pOj5iJ5y7qH82gvxukf4DY8R6irWUFHq2gI4Qj53oKhdPgx+h0xyTxFUsZ8YIaYP0bvcx1//qAx8fCURcHunGzJuHBT6vkgMeuN1qiEmtbY+mmkw2yFenQE+4eFyvFmo+sB6ZZIJv18BEkpr09vVsk7pP5iyHO1f3h0GRaNVpHJ6Djy7cXTOK8M2xtZPnCCuqEfthKdJoFqMPHjpj5Ul++zgaEzz/Ihc/xZ0BtuCB4KPCq8xyzF+736Byf28qnLOW1eIMqsRjnJ+pR0T0zYqQNnysgk04enDm6MwLQOuSEw0KZYKv36AmSDRYRWBLBaRHdKov2V/sWYd4u6jcdF+mXzOj0xrpTu6HlUSI5cb20+o2sPgSoVFKiWrQtPf4Y4wE7H77EiMqUKJ+i1f1kwloFME5jaO7QN8PAmz+26RMQUPy+uUcsbXOzc7j5GpX3jYtkGlW1yeWfjMSIFF5BkzJVP2icJ3tEp+niV7lpm8tMsHOFCmY9+Ec3gZiARJxyFYzDYuCLOSPkkobmFB6CuUgyxE0Y/mgl/LAa7s3u67+CfbCi19+h3ypE2LSd0/9UuE3C76iOKO6r0SIX3t4pmbv+bvyVHXmMuCCUnn2T8vioN1+DOscjAKNuphQUxm62YPC3Zz144TVmfn/diPUdj3ilKIBm39PWC/WnABc21Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 8/1/25 07:44, Balbir Singh wrote: > On 8/1/25 11:16, Mika Penttilä wrote: >> Hi, >> >> On 8/1/25 03:49, Balbir Singh wrote: >> >>> On 7/31/25 21:26, Zi Yan wrote: >>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote: >>>> >>>>> On 30.07.25 18:29, Mika Penttilä wrote: >>>>>> On 7/30/25 18:58, Zi Yan wrote: >>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote: >>>>>>> >>>>>>>> On 7/30/25 18:10, Zi Yan wrote: >>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote: >>>>>>>>> >>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote: >>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote: >>>>>>>>>>> >>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote: >>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote: >>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote: >>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone >>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes >>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP >>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge >>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration >>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will >>>>>>>>>>>>>>>>> return true for zone device private large folios only when >>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are >>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key >>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other >>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is >>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use >>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device >>>>>>>>>>>>>>>>> entries. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split, >>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a >>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries >>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the >>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling >>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling >>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio() >>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code >>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around >>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap >>>>>>>>>>>>>>>>> folio. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Cc: Karol Herbst >>>>>>>>>>>>>>>>> Cc: Lyude Paul >>>>>>>>>>>>>>>>> Cc: Danilo Krummrich >>>>>>>>>>>>>>>>> Cc: David Airlie >>>>>>>>>>>>>>>>> Cc: Simona Vetter >>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" >>>>>>>>>>>>>>>>> Cc: Shuah Khan >>>>>>>>>>>>>>>>> Cc: David Hildenbrand >>>>>>>>>>>>>>>>> Cc: Barry Song >>>>>>>>>>>>>>>>> Cc: Baolin Wang >>>>>>>>>>>>>>>>> Cc: Ryan Roberts >>>>>>>>>>>>>>>>> Cc: Matthew Wilcox >>>>>>>>>>>>>>>>> Cc: Peter Xu >>>>>>>>>>>>>>>>> Cc: Zi Yan >>>>>>>>>>>>>>>>> Cc: Kefeng Wang >>>>>>>>>>>>>>>>> Cc: Jane Chu >>>>>>>>>>>>>>>>> Cc: Alistair Popple >>>>>>>>>>>>>>>>> Cc: Donet Tom >>>>>>>>>>>>>>>>> Cc: Mika Penttilä >>>>>>>>>>>>>>>>> Cc: Matthew Brost >>>>>>>>>>>>>>>>> Cc: Francois Dugast >>>>>>>>>>>>>>>>> Cc: Ralph Campbell >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost >>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh >>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>> include/linux/huge_mm.h | 1 + >>>>>>>>>>>>>>>>> include/linux/rmap.h | 2 + >>>>>>>>>>>>>>>>> include/linux/swapops.h | 17 +++ >>>>>>>>>>>>>>>>> mm/huge_memory.c | 268 +++++++++++++++++++++++++++++++++------- >>>>>>>>>>>>>>>>> mm/page_vma_mapped.c | 13 +- >>>>>>>>>>>>>>>>> mm/pgtable-generic.c | 6 + >>>>>>>>>>>>>>>>> mm/rmap.c | 22 +++- >>>>>>>>>>>>>>>>> 7 files changed, 278 insertions(+), 51 deletions(-) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +/** >>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into >>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to >>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped >>>>>>>>>>>>>>>>> + * >>>>>>>>>>>>>>>>> + * @folio: the folio to split >>>>>>>>>>>>>>>>> + * >>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get >>>>>>>>>>>>>>>>> + */ >>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio) >>>>>>>>>>>>>>>>> +{ >>>>>>>>>>>>>>>>> + struct folio *end_folio = folio_next(folio); >>>>>>>>>>>>>>>>> + struct folio *new_folio; >>>>>>>>>>>>>>>>> + int ret = 0; >>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>> + /* >>>>>>>>>>>>>>>>> + * Split the folio now. In the case of device >>>>>>>>>>>>>>>>> + * private pages, this path is executed when >>>>>>>>>>>>>>>>> + * the pmd is split and since freeze is not true >>>>>>>>>>>>>>>>> + * it is likely the folio will be deferred_split. >>>>>>>>>>>>>>>>> + * >>>>>>>>>>>>>>>>> + * With device private pages, deferred splits of >>>>>>>>>>>>>>>>> + * folios should be handled here to prevent partial >>>>>>>>>>>>>>>>> + * unmaps from causing issues later on in migration >>>>>>>>>>>>>>>>> + * and fault handling flows. >>>>>>>>>>>>>>>>> + */ >>>>>>>>>>>>>>>>> + folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio)); >>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller? >>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in >>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of >>>>>>>>>>>>>>> device side mapping. >>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the >>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping, >>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze, >>>>>>>>>>>>>> 5) remap device private mapping. >>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind.. >>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task? >>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective, >>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split >>>>>>>>>>> folio by replacing existing page table entries with migration entries >>>>>>>>>>> and after that the folio is regarded as “unmapped”. >>>>>>>>>>> >>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU >>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics. >>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry >>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped” >>>>>>>>> at CPU side. >>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount. >>>>>>> Right. That confused me when I was talking to Balbir and looking at v1. >>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to >>>>>>> add code to skip CPU mapping handling code. Basically device private folios are >>>>>>> CPU unmapped and device mapped. >>>>>>> >>>>>>> Here are my questions on device private folios: >>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU >>>>>>> perspective? Can it be stored in a device private specific data structure? >>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make >>>>>> common code more messy if not done that way but sure possible. >>>>>> And not consuming pfns (address space) at all would have benefits. >>>>>> >>>>>>> 2. When a device private folio is mapped on device, can someone other than >>>>>>> the device driver manipulate it assuming core-mm just skips device private >>>>>>> folios (barring the CPU access fault handling)? >>>>>>> >>>>>>> Where I am going is that can device private folios be treated as unmapped folios >>>>>>> by CPU and only device driver manipulates their mappings? >>>>>>> >>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content >>>>>> someone could change while in device, it's just pfn. >>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries. >>>>> >>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future. >>>>> >>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ... >>>>> >>>>> >>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions. >>>> Thanks for the clarification. >>>> >>>> So folio_mapcount() for device private folios should be treated the same >>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs. >>>> Then I wonder if the device private large folio split should go through >>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze, >>>> remap. Otherwise, how can we prevent rmap changes during the split? >>>> >>> That is true in general, the special cases I mentioned are: >>> >>> 1. split during migration (where we the sizes on source/destination do not >>> match) and so we need to split in the middle of migration. The entries >>> there are already unmapped and hence the special handling >>> 2. Partial unmap case, where we need to split in the context of the unmap >>> due to the isses mentioned in the patch. I expanded the folio split code >>> for device private can be expanded into its own helper, which does not >>> need to do the xas/mapped/lru folio handling. During partial unmap the >>> original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked) >>> >>> For (2), I spent some time examining the implications of not unmapping the >>> folios prior to split and in the partial unmap path, once we split the PMD >>> the folios diverge. I did not run into any particular race either with the >>> tests. >> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio() >> >> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path. >> It is vulnerable to races by rmap. And for instance this does not look right without checking: >> >> folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio)); >> > I can add checks to make sure that the call does succeed. > >> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be >> possible to split the folio at fault time then? > So after the partial unmap, the folio ends up in a little strange situation, the folio is large, > but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split() > on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures > related to folio_large_mapcount. There is also additional complexity with ref counts and mapping. Is this after the deferred split -> map_unused_to_zeropage flow which would leave the page unmapped? Maybe disable that for device pages? > > >> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio() >> instead? >> >> > Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from > split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the > migration requirements for fault handling. > > Balbir Singh --Mika