From: Alistair Popple <apopple@nvidia.com>
To: David Hildenbrand <david@redhat.com>
Cc: dan.j.williams@intel.com, vishal.l.verma@intel.com,
dave.jiang@intel.com, logang@deltatee.com, bhelgaas@google.com,
jack@suse.cz, jgg@ziepe.ca, catalin.marinas@arm.com,
will@kernel.org, mpe@ellerman.id.au, npiggin@gmail.com,
dave.hansen@linux.intel.com, ira.weiny@intel.com,
willy@infradead.org, djwong@kernel.org, tytso@mit.edu,
linmiaohe@huawei.com, peterx@redhat.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linuxppc-dev@lists.ozlabs.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-ext4@vger.kernel.org,
linux-xfs@vger.kernel.org, jhubbard@nvidia.com, hch@lst.de,
david@fromorbit.com
Subject: Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
Date: Tue, 02 Jul 2024 21:30:19 +1000 [thread overview]
Message-ID: <87wmm4f1y6.fsf@nvdebian.thelocal> (raw)
In-Reply-To: <0b549ff0-b0b6-4fc8-aa6f-0d76157575b3@redhat.com>
David Hildenbrand <david@redhat.com> writes:
> On 02.07.24 12:19, Alistair Popple wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 27.06.24 02:54, Alistair Popple wrote:
>>>> Currently DAX folio/page reference counts are managed differently to
>>>> normal pages. To allow these to be managed the same as normal pages
>>>> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
>>>> and take references as it would for a normally mapped page.
>>>> This is distinct from the current mechanism, vmf_insert_pfn_pud,
>>>> which
>>>> simply inserts a special devmap PUD entry into the page table without
>>>> holding a reference to the page for the mapping.
>>>
>>> Do we really have to involve mapcounts/rmap for daxfs pages at this
>>> point? Or is this only "to make it look more like other pages" ?
>> The aim of the series is make FS DAX and other ZONE_DEVICE pages
>> look
>> like other pages, at least with regards to the way they are refcounted.
>> At the moment they are not refcounted - instead their refcounts are
>> basically statically initialised to one and there are all these special
>> cases and functions requiring magic PTE bits (pXX_devmap) to do the
>> special DAX reference counting. This then adds some cruft to manage
>> pgmap references and to catch the 2->1 page refcount transition. All
>> this just goes away if we manage the page references the same as other
>> pages (and indeed we already manage DEVICE_PRIVATE and COHERENT pages
>> the same as normal pages).
>> So I think to make this work we at least need the mapcounts.
>>
>
> We only really need the mapcounts if we intend to do something like
> folio_mapcount() == folio_ref_count() to detect unexpected folio
> references, and if we have to have things like folio_mapped()
> working. For now that was not required, that's why I am asking.
Oh I see, thanks for pointing that out. In that case I agree, we don't
need the mapcounts. As you say we don't currently need to detect
unexpect references for FS DAX and this series doesn't seek to introduce
any new behviour/features.
> Background also being that in a distant future folios will be
> decoupled more from other compound pages, and only folios (or "struct
> anon_folio" / "struct file_folio") would even have mapcounts.
>
> For example, most stuff we map (and refcount!) via vm_insert_page()
> really must stop involving mapcounts. These won't be "ordinary"
> mapcount-tracked folios in the future, they are simply some refcounted
> pages some ordinary driver allocated.
Ok, so for FS DAX we should take a reference on the folio for the
mapping but not a mapcount?
> For FS-DAX, if we'll be using the same "struct file_folio" approach as
> for ordinary pageache memory, then this is the right thing to do here.
>
>
>>> I'm asking this because:
>>>
>>> (A) We don't support mixing PUD+PMD mappings yet. I have plans to change
>>> that in the future, but for now you can only map using a single PUD
>>> or by PTEs. I suspect that's good enoug for now for dax fs?
>> Yep, that's all we support.
>>
>>> (B) As long as we have subpage mapcounts, this prevents vmemmap
>>> optimizations [1]. Is that only used for device-dax for now and are
>>> there no plans to make use of that for fs-dax?
>> I don't have any plans to. This is purely focussed on refcounting
>> pages
>> "like normal" so we can get rid of all the DAX special casing.
>>
>>> (C) We managed without so far :)
>> Indeed, although Christoph has asked repeatedly ([1], [2] and likely
>> others) that this gets fixed and I finally got sick of it coming up
>> everytime I need to touch something with ZONE_DEVICE pages :)
>> Also it removes the need for people to understand the special DAX
>> page
>> recounting scheme and ends up removing a bunch of cruft as a bonus:
>> 59 files changed, 485 insertions(+), 869 deletions(-)
>
> I'm not challenging the refcounting scheme. I'm purely asking about
> mapcount handling, which is something related but different.
Got it, thanks. I hadn't quite picked up on the mapcount vs. refcount
distinction you were asking about.
>> And that's before I clean up all the pgmap reference handling. It
>> also
>> removes the pXX_trans_huge and pXX_leaf distinction. So we managed, but
>> things could be better IMHO.
>>
>
> Again, all nice things.
>
>>> Having that said, with folio->_large_mapcount things like
>>> folio_mapcount() are no longer terribly slow once we weould PTE-map a
>>> PUD-sized folio.
>>>
>>> Also, all ZONE_DEVICE pages should currently be marked PG_reserved,
>>> translating to "don't touch the memmap". I think we might want to
>>> tackle that first.
>
> Missed to add a pointer to [2].
>
>> Ok. I'm keen to get this series finished and I don't quite get the
>> connection here, what needs to change there?
>
> include/linux/page-flags.h
>
> "PG_reserved is set for special pages. The "struct page" of such a
> page should in general not be touched (e.g. set dirty) except by its
> owner. Pages marked as PG_reserved include:
>
> ...
>
> - Device memory (e.g. PMEM, DAX, HMM)
> "
>
> I think we already entered that domain with other ZONE_DEVICE pages
> being returned from vm_normal_folio(), unfortunately. But that really
> must be cleaned up for these pages to not look special anymore.
>
> Agreed that it likely is something that is not blocking this series.
Great. I'd like to see that cleaned up a little too or at least made
more understandable. The first time I looked at this it took a
suprising amount of time to figure out what constituted a "normal"
page so I'd be happy to help. This series does, in a small way, clean
that up by removing the pte_devmap special case.
WARNING: multiple messages have this Message-ID (diff)
From: Alistair Popple <apopple@nvidia.com>
To: David Hildenbrand <david@redhat.com>
Cc: linmiaohe@huawei.com, nvdimm@lists.linux.dev, jack@suse.cz,
linux-doc@vger.kernel.org, djwong@kernel.org,
dave.hansen@linux.intel.com, david@fromorbit.com,
peterx@redhat.com, linux-mm@kvack.org, will@kernel.org,
hch@lst.de, dave.jiang@intel.com, vishal.l.verma@intel.com,
willy@infradead.org, jgg@ziepe.ca, catalin.marinas@arm.com,
linux-ext4@vger.kernel.org, ira.weiny@intel.com,
jhubbard@nvidia.com, npiggin@gmail.com,
linux-cxl@vger.kernel.org, bhelgaas@google.com,
dan.j.williams@intel.com, linux-arm-kernel@lists.infradead.org,
tytso@mit.edu, logang@deltatee.com, linux-kernel@vger.kernel.org,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
Date: Tue, 02 Jul 2024 21:30:19 +1000 [thread overview]
Message-ID: <87wmm4f1y6.fsf@nvdebian.thelocal> (raw)
In-Reply-To: <0b549ff0-b0b6-4fc8-aa6f-0d76157575b3@redhat.com>
David Hildenbrand <david@redhat.com> writes:
> On 02.07.24 12:19, Alistair Popple wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 27.06.24 02:54, Alistair Popple wrote:
>>>> Currently DAX folio/page reference counts are managed differently to
>>>> normal pages. To allow these to be managed the same as normal pages
>>>> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
>>>> and take references as it would for a normally mapped page.
>>>> This is distinct from the current mechanism, vmf_insert_pfn_pud,
>>>> which
>>>> simply inserts a special devmap PUD entry into the page table without
>>>> holding a reference to the page for the mapping.
>>>
>>> Do we really have to involve mapcounts/rmap for daxfs pages at this
>>> point? Or is this only "to make it look more like other pages" ?
>> The aim of the series is make FS DAX and other ZONE_DEVICE pages
>> look
>> like other pages, at least with regards to the way they are refcounted.
>> At the moment they are not refcounted - instead their refcounts are
>> basically statically initialised to one and there are all these special
>> cases and functions requiring magic PTE bits (pXX_devmap) to do the
>> special DAX reference counting. This then adds some cruft to manage
>> pgmap references and to catch the 2->1 page refcount transition. All
>> this just goes away if we manage the page references the same as other
>> pages (and indeed we already manage DEVICE_PRIVATE and COHERENT pages
>> the same as normal pages).
>> So I think to make this work we at least need the mapcounts.
>>
>
> We only really need the mapcounts if we intend to do something like
> folio_mapcount() == folio_ref_count() to detect unexpected folio
> references, and if we have to have things like folio_mapped()
> working. For now that was not required, that's why I am asking.
Oh I see, thanks for pointing that out. In that case I agree, we don't
need the mapcounts. As you say we don't currently need to detect
unexpect references for FS DAX and this series doesn't seek to introduce
any new behviour/features.
> Background also being that in a distant future folios will be
> decoupled more from other compound pages, and only folios (or "struct
> anon_folio" / "struct file_folio") would even have mapcounts.
>
> For example, most stuff we map (and refcount!) via vm_insert_page()
> really must stop involving mapcounts. These won't be "ordinary"
> mapcount-tracked folios in the future, they are simply some refcounted
> pages some ordinary driver allocated.
Ok, so for FS DAX we should take a reference on the folio for the
mapping but not a mapcount?
> For FS-DAX, if we'll be using the same "struct file_folio" approach as
> for ordinary pageache memory, then this is the right thing to do here.
>
>
>>> I'm asking this because:
>>>
>>> (A) We don't support mixing PUD+PMD mappings yet. I have plans to change
>>> that in the future, but for now you can only map using a single PUD
>>> or by PTEs. I suspect that's good enoug for now for dax fs?
>> Yep, that's all we support.
>>
>>> (B) As long as we have subpage mapcounts, this prevents vmemmap
>>> optimizations [1]. Is that only used for device-dax for now and are
>>> there no plans to make use of that for fs-dax?
>> I don't have any plans to. This is purely focussed on refcounting
>> pages
>> "like normal" so we can get rid of all the DAX special casing.
>>
>>> (C) We managed without so far :)
>> Indeed, although Christoph has asked repeatedly ([1], [2] and likely
>> others) that this gets fixed and I finally got sick of it coming up
>> everytime I need to touch something with ZONE_DEVICE pages :)
>> Also it removes the need for people to understand the special DAX
>> page
>> recounting scheme and ends up removing a bunch of cruft as a bonus:
>> 59 files changed, 485 insertions(+), 869 deletions(-)
>
> I'm not challenging the refcounting scheme. I'm purely asking about
> mapcount handling, which is something related but different.
Got it, thanks. I hadn't quite picked up on the mapcount vs. refcount
distinction you were asking about.
>> And that's before I clean up all the pgmap reference handling. It
>> also
>> removes the pXX_trans_huge and pXX_leaf distinction. So we managed, but
>> things could be better IMHO.
>>
>
> Again, all nice things.
>
>>> Having that said, with folio->_large_mapcount things like
>>> folio_mapcount() are no longer terribly slow once we weould PTE-map a
>>> PUD-sized folio.
>>>
>>> Also, all ZONE_DEVICE pages should currently be marked PG_reserved,
>>> translating to "don't touch the memmap". I think we might want to
>>> tackle that first.
>
> Missed to add a pointer to [2].
>
>> Ok. I'm keen to get this series finished and I don't quite get the
>> connection here, what needs to change there?
>
> include/linux/page-flags.h
>
> "PG_reserved is set for special pages. The "struct page" of such a
> page should in general not be touched (e.g. set dirty) except by its
> owner. Pages marked as PG_reserved include:
>
> ...
>
> - Device memory (e.g. PMEM, DAX, HMM)
> "
>
> I think we already entered that domain with other ZONE_DEVICE pages
> being returned from vm_normal_folio(), unfortunately. But that really
> must be cleaned up for these pages to not look special anymore.
>
> Agreed that it likely is something that is not blocking this series.
Great. I'd like to see that cleaned up a little too or at least made
more understandable. The first time I looked at this it took a
suprising amount of time to figure out what constituted a "normal"
page so I'd be happy to help. This series does, in a small way, clean
that up by removing the pte_devmap special case.
next prev parent reply other threads:[~2024-07-02 11:50 UTC|newest]
Thread overview: 107+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-27 0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 0:54 ` [PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 6:36 ` Dan Williams
2024-06-27 6:36 ` Dan Williams
2024-06-27 0:54 ` [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 5:30 ` Christoph Hellwig
2024-06-27 5:30 ` Christoph Hellwig
2024-06-29 21:28 ` Bjorn Helgaas
2024-06-29 21:28 ` Bjorn Helgaas
2024-06-27 0:54 ` [PATCH 03/13] fs/dax: Refactor wait for dax idle page Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 5:31 ` Christoph Hellwig
2024-06-27 5:31 ` Christoph Hellwig
2024-06-27 0:54 ` [PATCH 04/13] fs/dax: Add dax_page_free callback Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 5:33 ` Christoph Hellwig
2024-06-27 5:33 ` Christoph Hellwig
2024-06-27 23:48 ` Alistair Popple
2024-06-27 23:48 ` Alistair Popple
2024-06-27 0:54 ` [PATCH 05/13] mm: Allow compound zone device pages Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 5:35 ` Christoph Hellwig
2024-06-27 5:35 ` Christoph Hellwig
2024-06-27 0:54 ` [PATCH 06/13] mm/memory: Add dax_insert_pfn Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 5:22 ` Christoph Hellwig
2024-06-27 5:22 ` Christoph Hellwig
2024-06-27 11:33 ` Jan Kara
2024-06-27 11:33 ` Jan Kara
2024-09-06 6:21 ` Alistair Popple
2024-07-02 7:18 ` David Hildenbrand
2024-07-02 7:18 ` David Hildenbrand
2024-07-02 10:47 ` Alistair Popple
2024-07-02 10:47 ` Alistair Popple
2024-07-02 11:46 ` Christoph Hellwig
2024-07-02 11:46 ` Christoph Hellwig
2024-07-02 11:53 ` David Hildenbrand
2024-07-02 11:53 ` David Hildenbrand
2024-06-27 0:54 ` [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 22:26 ` kernel test robot
2024-06-27 22:26 ` kernel test robot
2024-07-02 7:16 ` David Hildenbrand
2024-07-02 7:16 ` David Hildenbrand
2024-07-02 10:19 ` Alistair Popple
2024-07-02 10:19 ` Alistair Popple
2024-07-02 11:02 ` David Hildenbrand
2024-07-02 11:02 ` David Hildenbrand
2024-07-02 11:30 ` Alistair Popple [this message]
2024-07-02 11:30 ` Alistair Popple
2024-07-02 13:01 ` David Hildenbrand
2024-07-02 13:01 ` David Hildenbrand
2024-07-02 11:51 ` Christoph Hellwig
2024-07-02 11:51 ` Christoph Hellwig
2024-07-02 12:22 ` Eliot Moss
2024-06-27 0:54 ` [PATCH 08/13] huge_memory: Allow mappings of PMD " Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 0:54 ` [PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-07-01 8:59 ` David Hildenbrand
2024-07-01 8:59 ` David Hildenbrand
2024-07-01 23:47 ` Alistair Popple
2024-07-01 23:47 ` Alistair Popple
2024-07-02 10:48 ` David Hildenbrand
2024-07-02 10:48 ` David Hildenbrand
2024-06-27 0:54 ` [PATCH 10/13] fs/dax: Properly refcount fs dax pages Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 5:44 ` Christoph Hellwig
2024-06-27 5:44 ` Christoph Hellwig
2024-09-06 6:00 ` Alistair Popple
2024-06-27 0:54 ` [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-07-05 14:24 ` Peter Xu
2024-07-05 14:24 ` Peter Xu
2024-07-09 4:07 ` Alistair Popple
2024-07-09 4:07 ` Alistair Popple
2024-07-09 15:56 ` Peter Xu
2024-07-09 15:56 ` Peter Xu
2024-07-12 2:40 ` Alistair Popple
2024-07-12 2:40 ` Alistair Popple
2024-07-12 15:52 ` Peter Xu
2024-07-12 15:52 ` Peter Xu
2024-06-27 0:54 ` [PATCH 12/13] mm: Remove pXX_devmap callers Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 0:54 ` [PATCH 13/13] mm: Remove devmap related functions and page table bits Alistair Popple
2024-06-27 0:54 ` Alistair Popple
2024-06-27 23:04 ` kernel test robot
2024-06-27 23:04 ` kernel test robot
2024-06-28 2:12 ` kernel test robot
2024-06-28 2:12 ` kernel test robot
2024-07-08 11:35 ` Will Deacon
2024-07-08 11:35 ` Will Deacon
2024-06-27 6:58 ` [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Dan Williams
2024-06-27 6:58 ` Dan Williams
2024-06-27 7:15 ` Alistair Popple
2024-06-27 7:15 ` Alistair Popple
2024-06-27 20:24 ` Dan Williams
2024-06-27 20:24 ` Dan Williams
2024-06-28 0:06 ` Alistair Popple
2024-06-28 0:06 ` Alistair Popple
2024-07-01 4:24 ` Dave Chinner
2024-07-01 4:24 ` Dave Chinner
2024-07-01 8:33 ` Alistair Popple
2024-07-01 8:33 ` Alistair Popple
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wmm4f1y6.fsf@nvdebian.thelocal \
--to=apopple@nvidia.com \
--cc=bhelgaas@google.com \
--cc=catalin.marinas@arm.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=david@fromorbit.com \
--cc=david@redhat.com \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=linmiaohe@huawei.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=logang@deltatee.com \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=nvdimm@lists.linux.dev \
--cc=peterx@redhat.com \
--cc=tytso@mit.edu \
--cc=vishal.l.verma@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.