From: Jason Gunthorpe <jgg@nvidia.com>
To: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>, <linux-mm@kvack.org>,
<nouveau@lists.freedesktop.org>,
<linux-kselftest@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
Jerome Glisse <jglisse@redhat.com>,
"John Hubbard" <jhubbard@nvidia.com>,
Alistair Popple <apopple@nvidia.com>,
Bharata B Rao <bharata@linux.ibm.com>, Zi Yan <ziy@nvidia.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Yang Shi <yang.shi@linux.alibaba.com>,
Ben Skeggs <bskeggs@redhat.com>, Shuah Khan <shuah@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v3 3/6] mm: support THP migration to device private memory
Date: Fri, 20 Nov 2020 16:01:33 -0400 [thread overview]
Message-ID: <20201120200133.GH917484@nvidia.com> (raw)
In-Reply-To: <bbf1f0df-85f3-5887-050e-beb2aad750f2@nvidia.com>
On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> MEMORY_DEVICE_GENERIC:
> Struct pages are created in dev_dax_probe() and represent non-volatile memory.
> The device can be mmap()'ed which calls dax_mmap() which sets
> vma->vm_flags | VM_HUGEPAGE.
> A CPU page fault will result in a PTE, PMD, or PUD sized page
> (but not compound) to be inserted by vmf_insert_mixed() which will call either
> insert_pfn() or insert_page().
> Neither insert_pfn() nor insert_page() increments the page reference
> count.
But why was this done? It seems very strange to put a pfn with a
struct page into a VMA and then deliberately not take the refcount for
the duration of that pfn being in the VMA?
What prevents memunmap_pages() from progressing while VMAs still point
at the memory?
> I think just leaving the page reference count at one is better than trying
> to use the mmu_interval_notifier or changing vmf_insert_mixed() and
> invalidations of pfn_t_devmap(pfn) to adjust the page reference count.
Why so? The entire point of getting struct page's for this stuff was
to be able to follow the struct page flow. I never did learn a reason
why there is devmap stuff all over the place in the page table code...
> MEMORY_DEVICE_FS_DAX:
> Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
> an initial reference count of one.
> The problem I see is that there are 3 states that are important:
> a) memory is free and not allocated to any file (page_ref_count() == 0).
> b) memory is allocated to a file and in the page cache (page_ref_count() == 1).
> c) some gup() or I/O has a reference even after calling unmap_mapping_pages()
> (page_ref_count() > 1). ext4_break_layouts() basically waits until the
> page_ref_count() == 1 with put_page() calling wake_up_var(&page->_refcount)
> to wake up ext4_break_layouts().
> The current code doesn't seem to distinguish (a) and (b). If we want to use
> the 0->1 reference count to signal (c), then the page cache would have hold
> entries with a page_ref_count() == 0 which doesn't match the general page cache
> assumptions.
This explanation feels confusing. If *anything* has a reference on the
page it cannot be recycled. I would have guess the logic is to remove
it from the page cache then wait for a 0 reference??
Jason
WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
To: Ralph Campbell <rcampbell-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Cc: Yang Shi
<yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>,
Zi Yan <ziy-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>,
nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Bharata B Rao <bharata-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
Ben Skeggs <bskeggs-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
linux-kselftest-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
Shuah Khan <shuah-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>,
"Kirill A . Shutemov"
<kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Subject: Re: [PATCH v3 3/6] mm: support THP migration to device private memory
Date: Fri, 20 Nov 2020 16:01:33 -0400 [thread overview]
Message-ID: <20201120200133.GH917484@nvidia.com> (raw)
In-Reply-To: <bbf1f0df-85f3-5887-050e-beb2aad750f2-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> MEMORY_DEVICE_GENERIC:
> Struct pages are created in dev_dax_probe() and represent non-volatile memory.
> The device can be mmap()'ed which calls dax_mmap() which sets
> vma->vm_flags | VM_HUGEPAGE.
> A CPU page fault will result in a PTE, PMD, or PUD sized page
> (but not compound) to be inserted by vmf_insert_mixed() which will call either
> insert_pfn() or insert_page().
> Neither insert_pfn() nor insert_page() increments the page reference
> count.
But why was this done? It seems very strange to put a pfn with a
struct page into a VMA and then deliberately not take the refcount for
the duration of that pfn being in the VMA?
What prevents memunmap_pages() from progressing while VMAs still point
at the memory?
> I think just leaving the page reference count at one is better than trying
> to use the mmu_interval_notifier or changing vmf_insert_mixed() and
> invalidations of pfn_t_devmap(pfn) to adjust the page reference count.
Why so? The entire point of getting struct page's for this stuff was
to be able to follow the struct page flow. I never did learn a reason
why there is devmap stuff all over the place in the page table code...
> MEMORY_DEVICE_FS_DAX:
> Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
> an initial reference count of one.
> The problem I see is that there are 3 states that are important:
> a) memory is free and not allocated to any file (page_ref_count() == 0).
> b) memory is allocated to a file and in the page cache (page_ref_count() == 1).
> c) some gup() or I/O has a reference even after calling unmap_mapping_pages()
> (page_ref_count() > 1). ext4_break_layouts() basically waits until the
> page_ref_count() == 1 with put_page() calling wake_up_var(&page->_refcount)
> to wake up ext4_break_layouts().
> The current code doesn't seem to distinguish (a) and (b). If we want to use
> the 0->1 reference count to signal (c), then the page cache would have hold
> entries with a page_ref_count() == 0 which doesn't match the general page cache
> assumptions.
This explanation feels confusing. If *anything* has a reference on the
page it cannot be recycled. I would have guess the logic is to remove
it from the page cache then wait for a 0 reference??
Jason
next prev parent reply other threads:[~2020-11-20 20:02 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-06 0:51 [PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_* Ralph Campbell
2020-11-06 0:51 ` Ralph Campbell
2020-11-06 0:51 ` [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page() Ralph Campbell
2020-11-06 0:51 ` Ralph Campbell
2020-11-06 7:55 ` Christoph Hellwig
2020-11-06 20:56 ` Ralph Campbell
2020-11-06 20:56 ` Ralph Campbell
2020-11-06 12:14 ` Matthew Wilcox
2020-11-06 12:14 ` Matthew Wilcox
2020-11-06 20:34 ` Ralph Campbell
2020-11-06 20:34 ` Ralph Campbell
2020-11-06 0:51 ` [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip() Ralph Campbell
2020-11-06 0:51 ` Ralph Campbell
2020-11-06 7:56 ` Christoph Hellwig
2020-11-06 7:57 ` Christoph Hellwig
2020-11-06 0:51 ` [PATCH v3 3/6] mm: support THP migration to device private memory Ralph Campbell
2020-11-06 0:51 ` Ralph Campbell
2020-11-06 8:03 ` Christoph Hellwig
2020-11-06 8:03 ` Christoph Hellwig
2020-11-06 21:26 ` Ralph Campbell
2020-11-06 21:26 ` Ralph Campbell
2020-11-09 9:14 ` Christoph Hellwig
2020-11-09 21:34 ` Ralph Campbell
2020-11-09 21:34 ` Ralph Campbell
2020-11-11 23:38 ` Ralph Campbell
2020-11-11 23:38 ` Ralph Campbell
2020-11-20 20:01 ` Jason Gunthorpe [this message]
2020-11-20 20:01 ` Jason Gunthorpe
2020-12-02 10:08 ` Christoph Hellwig
2020-12-02 10:08 ` Christoph Hellwig
2020-12-05 8:22 ` Roger Pau Monné
2020-12-05 8:22 ` Roger Pau Monné
2020-12-02 10:14 ` Christoph Hellwig
2020-12-02 10:14 ` Christoph Hellwig
2020-12-02 18:01 ` Logan Gunthorpe
2020-12-02 18:01 ` Logan Gunthorpe
2020-11-06 0:51 ` [PATCH v3 4/6] mm/thp: add THP allocation helper Ralph Campbell
2020-11-06 0:51 ` Ralph Campbell
2020-11-06 8:01 ` Christoph Hellwig
2020-11-06 21:09 ` Ralph Campbell
2020-11-06 0:51 ` [PATCH v3 5/6] mm/hmm/test: add self tests for THP migration Ralph Campbell
2020-11-06 0:51 ` Ralph Campbell
2020-11-06 0:51 ` [PATCH v3 6/6] nouveau: support THP migration to private memory Ralph Campbell
2020-11-06 0:51 ` Ralph Campbell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201120200133.GH917484@nvidia.com \
--to=jgg@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=bharata@linux.ibm.com \
--cc=bskeggs@redhat.com \
--cc=hch@lst.de \
--cc=jglisse@redhat.com \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nouveau@lists.freedesktop.org \
--cc=rcampbell@nvidia.com \
--cc=shuah@kernel.org \
--cc=yang.shi@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.