All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: akpm@linux-foundation.org, "Matthew Wilcox" <willy@infradead.org>,
	"Jan Kara" <jack@suse.cz>, "Darrick J. Wong" <djwong@kernel.org>,
	"Christoph Hellwig" <hch@lst.de>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Felix Kuehling" <Felix.Kuehling@amd.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Pan, Xinhui" <Xinhui.Pan@amd.com>,
	"David Airlie" <airlied@linux.ie>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Ben Skeggs" <bskeggs@redhat.com>,
	"Karol Herbst" <kherbst@redhat.com>,
	"Lyude Paul" <lyude@redhat.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	linux-mm@kvack.org, dri-devel@lists.freedesktop.org,
	nvdimm@lists.linux.dev
Subject: Re: [PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets
Date: Fri, 28 Oct 2022 15:33:26 -0300	[thread overview]
Message-ID: <Y1wgdp/uaoF70bmk@nvidia.com> (raw)
In-Reply-To: <166630293549.1017198.3833687373550679565.stgit@dwillia2-xfh.jf.intel.com>

On Thu, Oct 20, 2022 at 02:56:39PM -0700, Dan Williams wrote:
> A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE
> pages. The pgmap is a reference counted object that serves a similar
> role as a 'struct request_queue'. Live references are obtained for each
> in flight request / page, and once a page's reference count drops to
> zero the associated pin of the pgmap is dropped as well. While a page is
> idle nothing should be accessing it because that is effectively a
> use-after-free situation. Unfortunately, all current ZONE_DEVICE
> implementations deploy a layering violation to manage requests to
> activate pages owned by a pgmap. Specifically, they take steps like walk
> the pfns that were previously assigned at memremap_pages() time and use
> pfn_to_page() to recall metadata like page->pgmap, or make use of other
> data like page->zone_device_data.
> 
> The first step towards correcting that situation is to provide a
> API to get access to a pgmap page that does not require the caller to
> know the pfn, nor access any fields of an idle page. Ideally this API
> would be able to support dynamic page creation instead of the current
> status quo of pre-allocating and initializing pages.
> 
> On a prompt from Jason, introduce pgmap_request_folio() that operates on
> an offset into a pgmap. It replaces the shortlived
> pgmap_request_folios() that was continuing the layering violation of
> assuming pages are available to be consulted before asking the pgmap to
> make them available.
> 
> For now this only converts the callers to lookup the pgmap and generate
> the pgmap offset, but it does not do the deeper cleanup of teaching
> those call sites to generate those arguments without walking the page
> metadata. For next steps it appears the DEVICE_PRIVATE implementations
> could plumb the pgmap into the necessary callsites and switch to using
> gen_pool_alloc() to track which offsets of a pgmap are allocated. For
> DAX, dax_direct_access() could switch from returning pfns to returning
> the associated @pgmap and @pgmap_offset. Those changes are saved for
> follow-on work.

I like it, though it would be nice to see drivers converted away from
pfn_to_pgmap_offset()..

>  /**
> - * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> - * @pgmap: host page map for the folio array
> - * @folio: start of the folio list, all subsequent folios have same folio_size()
> + * pgmap_request_folio - activate a folio of a given order in @pgmap
> + * @pgmap: host page map of the folio to activate
> + * @pgmap_offset: page-offset into the pgmap to request
> + * @order: expected folio_order() of the folio
>   *
>   * Caller is responsible for @pgmap remaining live for the duration of
> - * this call. Caller is also responsible for not racing requests for the
> - * same folios.
> + * this call. The order (size) of the folios in the pgmap are assumed
> + * stable before this call.
>   */

I would probably add some discussion here that this enables
refcounting on the folio and the pgmap_ops page free will be called
once the folio is no longer being used.

And explain that the pgmap user is responsible for tracking which
pgmap_offsets are requested and which have been returned by free. It
would be nice to say that this can only be called on free'd folios.

> -bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> -			  int nr_folios)
> +struct folio *pgmap_request_folio(struct dev_pagemap *pgmap,
> +				  pgoff_t pgmap_offset, int order)

unsigned int order?

>  {
> -	struct folio *iter;
> -	int i;
> +	unsigned long pfn = pgmap_offset_to_pfn(pgmap, pgmap_offset);
> +	struct page *page = pfn_to_page(pfn);
> +	struct folio *folio;
> +	int v;
>  
> -	/*
> -	 * All of the WARNs below are for catching bugs in future
> -	 * development that changes the assumptions of:
> -	 * 1/ uniform folios in @pgmap
> -	 * 2/ @pgmap death does not race this routine.
> -	 */
> -	VM_WARN_ON_ONCE(!folio_span_valid(pgmap, folio, nr_folios));
> +	if (WARN_ON_ONCE(page->pgmap != pgmap))
> +		return NULL;

Checking that pgmap_offset is not bigger than pgmap length is also a
good assertion.. At that point if pgmap is not right then the struct
page has been corrupted.

>  
>  	if (WARN_ON_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> -		return false;
> +		return NULL;
>  
> -	for (iter = folio_next(folio), i = 1; i < nr_folios;
> -	     iter = folio_next(folio), i++)
> -		if (WARN_ON_ONCE(folio_order(iter) != folio_order(folio)))
> -			return false;
> +	folio = page_folio(page);
> +	if (WARN_ON_ONCE(folio_order(folio) != order))
> +		return NULL;

Do you see a blocker to simply restructuring the pages into head/tail
here? If the refcounts are all zero it should be safe?

> +	v = folio_ref_inc_return(folio);
> +	if (v > 1)
> +		return folio;

IMHO, ideally, this should require the foilio to have a 0 refcount and
this should set it to 1.

> +	if (WARN_ON_ONCE(!percpu_ref_tryget(&pgmap->ref))) {

This should not be a warn on, there should be races where the dying
check could miss but the refcounts all reached zero anyhow.

Jason

WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg@nvidia.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: nvdimm@lists.linux.dev, "Pan, Xinhui" <Xinhui.Pan@amd.com>,
	"Jan Kara" <jack@suse.cz>,
	dri-devel@lists.freedesktop.org,
	"Karol Herbst" <kherbst@redhat.com>,
	"David Airlie" <airlied@linux.ie>,
	"Darrick J. Wong" <djwong@kernel.org>,
	"Felix Kuehling" <Felix.Kuehling@amd.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Christian König" <christian.koenig@amd.com>,
	linux-mm@kvack.org, "Jérôme Glisse" <jglisse@redhat.com>,
	"Ben Skeggs" <bskeggs@redhat.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	akpm@linux-foundation.org, "Christoph Hellwig" <hch@lst.de>
Subject: Re: [PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets
Date: Fri, 28 Oct 2022 15:33:26 -0300	[thread overview]
Message-ID: <Y1wgdp/uaoF70bmk@nvidia.com> (raw)
In-Reply-To: <166630293549.1017198.3833687373550679565.stgit@dwillia2-xfh.jf.intel.com>

On Thu, Oct 20, 2022 at 02:56:39PM -0700, Dan Williams wrote:
> A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE
> pages. The pgmap is a reference counted object that serves a similar
> role as a 'struct request_queue'. Live references are obtained for each
> in flight request / page, and once a page's reference count drops to
> zero the associated pin of the pgmap is dropped as well. While a page is
> idle nothing should be accessing it because that is effectively a
> use-after-free situation. Unfortunately, all current ZONE_DEVICE
> implementations deploy a layering violation to manage requests to
> activate pages owned by a pgmap. Specifically, they take steps like walk
> the pfns that were previously assigned at memremap_pages() time and use
> pfn_to_page() to recall metadata like page->pgmap, or make use of other
> data like page->zone_device_data.
> 
> The first step towards correcting that situation is to provide a
> API to get access to a pgmap page that does not require the caller to
> know the pfn, nor access any fields of an idle page. Ideally this API
> would be able to support dynamic page creation instead of the current
> status quo of pre-allocating and initializing pages.
> 
> On a prompt from Jason, introduce pgmap_request_folio() that operates on
> an offset into a pgmap. It replaces the shortlived
> pgmap_request_folios() that was continuing the layering violation of
> assuming pages are available to be consulted before asking the pgmap to
> make them available.
> 
> For now this only converts the callers to lookup the pgmap and generate
> the pgmap offset, but it does not do the deeper cleanup of teaching
> those call sites to generate those arguments without walking the page
> metadata. For next steps it appears the DEVICE_PRIVATE implementations
> could plumb the pgmap into the necessary callsites and switch to using
> gen_pool_alloc() to track which offsets of a pgmap are allocated. For
> DAX, dax_direct_access() could switch from returning pfns to returning
> the associated @pgmap and @pgmap_offset. Those changes are saved for
> follow-on work.

I like it, though it would be nice to see drivers converted away from
pfn_to_pgmap_offset()..

>  /**
> - * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> - * @pgmap: host page map for the folio array
> - * @folio: start of the folio list, all subsequent folios have same folio_size()
> + * pgmap_request_folio - activate a folio of a given order in @pgmap
> + * @pgmap: host page map of the folio to activate
> + * @pgmap_offset: page-offset into the pgmap to request
> + * @order: expected folio_order() of the folio
>   *
>   * Caller is responsible for @pgmap remaining live for the duration of
> - * this call. Caller is also responsible for not racing requests for the
> - * same folios.
> + * this call. The order (size) of the folios in the pgmap are assumed
> + * stable before this call.
>   */

I would probably add some discussion here that this enables
refcounting on the folio and the pgmap_ops page free will be called
once the folio is no longer being used.

And explain that the pgmap user is responsible for tracking which
pgmap_offsets are requested and which have been returned by free. It
would be nice to say that this can only be called on free'd folios.

> -bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> -			  int nr_folios)
> +struct folio *pgmap_request_folio(struct dev_pagemap *pgmap,
> +				  pgoff_t pgmap_offset, int order)

unsigned int order?

>  {
> -	struct folio *iter;
> -	int i;
> +	unsigned long pfn = pgmap_offset_to_pfn(pgmap, pgmap_offset);
> +	struct page *page = pfn_to_page(pfn);
> +	struct folio *folio;
> +	int v;
>  
> -	/*
> -	 * All of the WARNs below are for catching bugs in future
> -	 * development that changes the assumptions of:
> -	 * 1/ uniform folios in @pgmap
> -	 * 2/ @pgmap death does not race this routine.
> -	 */
> -	VM_WARN_ON_ONCE(!folio_span_valid(pgmap, folio, nr_folios));
> +	if (WARN_ON_ONCE(page->pgmap != pgmap))
> +		return NULL;

Checking that pgmap_offset is not bigger than pgmap length is also a
good assertion.. At that point if pgmap is not right then the struct
page has been corrupted.

>  
>  	if (WARN_ON_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> -		return false;
> +		return NULL;
>  
> -	for (iter = folio_next(folio), i = 1; i < nr_folios;
> -	     iter = folio_next(folio), i++)
> -		if (WARN_ON_ONCE(folio_order(iter) != folio_order(folio)))
> -			return false;
> +	folio = page_folio(page);
> +	if (WARN_ON_ONCE(folio_order(folio) != order))
> +		return NULL;

Do you see a blocker to simply restructuring the pages into head/tail
here? If the refcounts are all zero it should be safe?

> +	v = folio_ref_inc_return(folio);
> +	if (v > 1)
> +		return folio;

IMHO, ideally, this should require the foilio to have a 0 refcount and
this should set it to 1.

> +	if (WARN_ON_ONCE(!percpu_ref_tryget(&pgmap->ref))) {

This should not be a warn on, there should be races where the dying
check could miss but the refcounts all reached zero anyhow.

Jason

  parent reply	other threads:[~2022-10-28 18:33 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-20 21:56 [PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets Dan Williams
2022-10-20 21:56 ` Dan Williams
2022-10-20 22:32 ` Felix Kuehling
2022-10-20 22:32   ` Felix Kuehling
2022-10-20 23:17   ` Dan Williams
2022-10-20 23:17     ` Dan Williams
2022-10-21 18:31     ` Felix Kuehling
2022-10-21 18:31       ` Felix Kuehling
2022-10-24  1:44       ` Alistair Popple
2022-10-24  1:44         ` Alistair Popple
2022-10-21 20:36 ` Lyude Paul
2022-10-21 20:36   ` Lyude Paul
2022-10-28 18:33 ` Jason Gunthorpe [this message]
2022-10-28 18:33   ` Jason Gunthorpe
2022-12-01  0:22   ` Dan Williams
2022-12-01  0:22     ` Dan Williams
2022-12-01 23:12     ` Andrew Morton
2022-12-01 23:12       ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1wgdp/uaoF70bmk@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.deucher@amd.com \
    --cc=apopple@nvidia.com \
    --cc=bskeggs@redhat.com \
    --cc=christian.koenig@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=daniel@ffwll.ch \
    --cc=djwong@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=kherbst@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=lyude@redhat.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.