* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
[not found] ` <bbf1f0df-85f3-5887-050e-beb2aad750f2@nvidia.com>
@ 2020-12-02 10:14 ` Christoph Hellwig
2020-12-02 18:01 ` Logan Gunthorpe
0 siblings, 1 reply; 2+ messages in thread
From: Christoph Hellwig @ 2020-12-02 10:14 UTC (permalink / raw)
To: Ralph Campbell
Cc: Christoph Hellwig, linux-mm, nouveau, linux-kselftest,
linux-kernel, Jerome Glisse, John Hubbard, Alistair Popple,
Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Logan Gunthorpe,
Dan Williams, linux-nvdimm, linux-fsdevel
[adding a few of the usual suspects]
On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> There are 4 types of ZONE_DEVICE struct pages:
> MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, and
> MEMORY_DEVICE_PCI_P2PDMA.
>
> Currently, memremap_pages() allocates struct pages for a physical address range
> with a page_ref_count(page) of one and increments the pgmap->ref per CPU
> reference count by the number of pages created since each ZONE_DEVICE struct
> page has a pointer to the pgmap.
>
> The struct pages are not freed until memunmap_pages() is called which
> calls put_page() which calls put_dev_pagemap() which releases a reference to
> pgmap->ref. memunmap_pages() blocks waiting for pgmap->ref reference count
> to be zero. As far as I can tell, the put_page() in memunmap_pages() has to
> be the *last* put_page() (see MEMORY_DEVICE_PCI_P2PDMA).
> My RFC [1] breaks this put_page() -> put_dev_pagemap() connection so that
> the struct page reference count can go to zero and back to non-zero without
> changing the pgmap->ref reference count.
>
> Q1: Is that safe? Is there some code that depends on put_page() dropping
> the pgmap->ref reference count as part of memunmap_pages()?
> My testing of [1] seems OK but I'm sure there are lots of cases I didn't test.
It should be safe, but the audit you've done is important to make sure
we do not miss anything important.
> MEMORY_DEVICE_PCI_P2PDMA:
> Struct pages are created in pci_p2pdma_add_resource() and represent device
> memory accessible by PCIe bar address space. Memory is allocated with
> pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner()
> call will allocate memory in a minimum of PAGE_SIZE units.
> Reference counting is +1 per *allocation* on the pgmap->ref reference count.
> Note that this is not +1 per page which is what put_page() expects. So
> currently, a get_page()/put_page() works OK because the page reference count
> only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count
> would be incorrect if the allocation size was greater than one page.
>
> I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and
> pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *.
> Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist.
> There are a huge number of places sg_page() is called so it is hard to tell
> whether or not get_page()/put_page() is ever called on MEMORY_DEVICE_PCI_P2PDMA
> pages.
Nothing should call get_page/put_page on them, as they are not treated
as refcountable memory. More importantly nothing is allowed to keep
a reference longer than the time of the I/O.
> pci_p2pmem_virt_to_bus() will return the physical address and I guess
> pfn_to_page(physaddr >> PAGE_SHIFT) could return the struct page.
>
> Since there is a clear allocation/free, pci_alloc_p2pmem() can probably be
> modified to increment/decrement the MEMORY_DEVICE_PCI_P2PDMA struct page
> reference count. Or maybe just leave it at one like it is now.
And yes, doing that is probably a sensible safe guard.
> MEMORY_DEVICE_FS_DAX:
> Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
> an initial reference count of one.
> The problem I see is that there are 3 states that are important:
> a) memory is free and not allocated to any file (page_ref_count() == 0).
> b) memory is allocated to a file and in the page cache (page_ref_count() == 1).
> c) some gup() or I/O has a reference even after calling unmap_mapping_pages()
> (page_ref_count() > 1). ext4_break_layouts() basically waits until the
> page_ref_count() == 1 with put_page() calling wake_up_var(&page->_refcount)
> to wake up ext4_break_layouts().
> The current code doesn't seem to distinguish (a) and (b). If we want to use
> the 0->1 reference count to signal (c), then the page cache would have hold
> entries with a page_ref_count() == 0 which doesn't match the general page cache
I think the sensible model here is to grab a reference when it is
added to the page cache. That is exactly how normal system memory pages
work.
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
2020-12-02 10:14 ` [PATCH v3 3/6] mm: support THP migration to device private memory Christoph Hellwig
@ 2020-12-02 18:01 ` Logan Gunthorpe
0 siblings, 0 replies; 2+ messages in thread
From: Logan Gunthorpe @ 2020-12-02 18:01 UTC (permalink / raw)
To: Christoph Hellwig, Ralph Campbell
Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
John Hubbard, Alistair Popple, Jason Gunthorpe, Bharata B Rao,
Zi Yan, Kirill A . Shutemov, Yang Shi, Ben Skeggs, Shuah Khan,
Andrew Morton, Dan Williams, linux-nvdimm, linux-fsdevel
On 2020-12-02 3:14 a.m., Christoph Hellwig wrote:>>
MEMORY_DEVICE_PCI_P2PDMA:
>> Struct pages are created in pci_p2pdma_add_resource() and represent device
>> memory accessible by PCIe bar address space. Memory is allocated with
>> pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner()
>> call will allocate memory in a minimum of PAGE_SIZE units.
>> Reference counting is +1 per *allocation* on the pgmap->ref reference count.
>> Note that this is not +1 per page which is what put_page() expects. So
>> currently, a get_page()/put_page() works OK because the page reference count
>> only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count
>> would be incorrect if the allocation size was greater than one page.
>>
>> I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and
>> pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *.
>> Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist.
>> There are a huge number of places sg_page() is called so it is hard to tell
>> whether or not get_page()/put_page() is ever called on MEMORY_DEVICE_PCI_P2PDMA
>> pages.
>
> Nothing should call get_page/put_page on them, as they are not treated
> as refcountable memory. More importantly nothing is allowed to keep
> a reference longer than the time of the I/O.
Yes, right now this is safe, as Christoph notes there are no places
where these should be got/put.
But eventually we'll need to change how pci_alloc_p2pmem() works to take
references on the actual pages and allow freeing individual pages,
similar to what you suggest. This is one of the issues Jason pointed out
in my last RFC to try to pass these pages through GUP.
Logan
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2020-12-02 18:02 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20201106005147.20113-1-rcampbell@nvidia.com>
[not found] ` <20201106005147.20113-4-rcampbell@nvidia.com>
[not found] ` <20201106080322.GE31341@lst.de>
[not found] ` <a7b8b90c-09b7-2009-0784-908b61f61ef2@nvidia.com>
[not found] ` <20201109091415.GC28918@lst.de>
[not found] ` <bbf1f0df-85f3-5887-050e-beb2aad750f2@nvidia.com>
2020-12-02 10:14 ` [PATCH v3 3/6] mm: support THP migration to device private memory Christoph Hellwig
2020-12-02 18:01 ` Logan Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).