[PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev

public inbox for linux-nvdimm@lists.01.org
 help / color / mirror / Atom feed

* [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev
       [not found] <20200807195526.426056-1-vgoyal@redhat.com>
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-17 16:57   ` Jan Kara
  2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
  1 sibling, 1 reply; 5+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: miklos, stefanha, dgilbert, Christoph Hellwig, linux-nvdimm

virtiofs does not have a block device but it has dax device.
Modify bdev_dax_pgoff() to be able to handle that.

If there is no bdev, that means dax offset is 0. (It can't be a partition
block device starting at an offset in dax device).

This is little hackish. There have been discussions about getting rid
of dax not supporting partitions.

https://lore.kernel.org/linux-fsdevel/20200107125159.GA15745@infradead.org/

IMHO, this path can easily break exisitng users. For example
ioctl(BLKPG_ADD_PARTITION) will start breaking on block devices
supporting DAX. Also, I personally find it very useful to be able to
partition dax devices and still be able to use DAX.

Alternatively, I tried to store offset into dax device information in iomap
interface, but that got NACKed.

https://lore.kernel.org/linux-fsdevel/20200217133117.GB20444@infradead.org/

I can't think of a good path to solve this issue properly. So to make
progress, it seems this patch is least bad option for now and I hope
we can take it.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
---
 drivers/dax/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 8e32345be0f7..c4bec437e88b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 		pgoff_t *pgoff)
 {
-	phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+	sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+	phys_addr_t phys_off = (start_sect + sector) * 512;

 	if (pgoff)
 		*pgoff = PHYS_PFN(phys_off);
-- 
2.25.4
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page()
       [not found] <20200807195526.426056-1-vgoyal@redhat.com>
  2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-17 16:53   ` Jan Kara
  1 sibling, 1 reply; 5+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: miklos, stefanha, dgilbert, linux-nvdimm

virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.

As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.

We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references  (and don't want to unmap all the pages of inode).

Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c            | 66 ++++++++++++++++++++++++++++++++-------------
 include/linux/dax.h |  6 +++++
 2 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..0d51b0fbb489 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
 	return xa_mk_internal(VM_FAULT_FALLBACK);
 }
 
-/**
- * dax_layout_busy_page - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
- *
- * DAX requires ZONE_DEVICE mapped pages. These pages are never
- * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
- * any page in the mapping is busy, i.e. for DMA, or other
- * get_user_pages() usages.
- *
- * It is expected that the filesystem is holding locks to block the
- * establishment of new mappings in this address_space. I.e. it expects
- * to be able to run unmap_mapping_range() and subsequently not race
- * mapping_mapped() becoming true.
+/*
+ * Partial pages are included. If end is LLONG_MAX, pages in the range from
+ * start to end of the file are inluded.
  */
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+					loff_t start, loff_t end)
 {
-	XA_STATE(xas, &mapping->i_pages, 0);
 	void *entry;
 	unsigned int scanned = 0;
 	struct page *page = NULL;
+	pgoff_t start_idx = start >> PAGE_SHIFT;
+	pgoff_t end_idx = end >> PAGE_SHIFT;
+	XA_STATE(xas, &mapping->i_pages, start_idx);
+	loff_t len, lstart = round_down(start, PAGE_SIZE);
 
 	/*
 	 * In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
 		return NULL;
 
+	/* If end == LLONG_MAX, all pages from start to till end of file */
+	if (end == LLONG_MAX) {
+		end_idx = ULONG_MAX;
+		len = 0;
+	} else {
+		/* length is being calculated from lstart and not start.
+		 * This is due to behavior of unmap_mapping_range(). If
+		 * start is say 4094 and end is on 4096 then we want to
+		 * unamp two pages, idx 0 and 1. But unmap_mapping_range()
+		 * will unmap only page at idx 0. If we calculate len
+		 * from the rounded down start, this problem should not
+		 * happen.
+		 */
+		len = end - lstart + 1;
+	}
+
 	/*
 	 * If we race get_user_pages_fast() here either we'll see the
 	 * elevated page count in the iteration and wait, or
@@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	 * guaranteed to either see new references or prevent new
 	 * references from being established.
 	 */
-	unmap_mapping_range(mapping, 0, 0, 0);
+	unmap_mapping_range(mapping, start, len, 0);
 
 	xas_lock_irq(&xas);
-	xas_for_each(&xas, entry, ULONG_MAX) {
+	xas_for_each(&xas, entry, end_idx) {
 		if (WARN_ON_ONCE(!xa_is_value(entry)))
 			continue;
 		if (unlikely(dax_is_locked(entry)))
@@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	xas_unlock_irq(&xas);
 	return page;
 }
+EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true.
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	return dax_layout_busy_page_range(mapping, 0, 0);
+}
 EXPORT_SYMBOL_GPL(dax_layout_busy_page);
 
 static int __dax_invalidate_entry(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..9016929db4c6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -141,6 +141,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct dax_device *dax_dev, struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
+struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
 #else
@@ -171,6 +172,11 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 	return NULL;
 }
 
+static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages)
+{
+	return NULL;
+}
+
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
 		struct dax_device *dax_dev, struct writeback_control *wbc)
 {
-- 
2.25.4
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page()
  2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
@ 2020-08-17 16:53   ` Jan Kara
  2020-08-17 17:22     ` Vivek Goyal
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2020-08-17 16:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, linux-nvdimm

On Fri 07-08-20 15:55:08, Vivek Goyal wrote:
> virtiofs device has a range of memory which is mapped into file inodes
> using dax. This memory is mapped in qemu on host and maps different
> sections of real file on host. Size of this memory is limited
> (determined by administrator) and depending on filesystem size, we will
> soon reach a situation where all the memory is in use and we need to
> reclaim some.
> 
> As part of reclaim process, we will need to make sure that there are
> no active references to pages (taken by get_user_pages()) on the memory
> range we are trying to reclaim. I am planning to use
> dax_layout_busy_page() for this. But in current form this is per inode
> and scans through all the pages of the inode.
> 
> We want to reclaim only a portion of memory (say 2MB page). So we want
> to make sure that only that 2MB range of pages do not have any
> references  (and don't want to unmap all the pages of inode).
> 
> Hence, create a range version of this function named
> dax_layout_busy_page_range() which can be used to pass a range which
> needs to be unmapped.
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-nvdimm@lists.01.org
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

The API looks OK. Some comments WRT the implementation below.

> diff --git a/fs/dax.c b/fs/dax.c
> index 11b16729b86f..0d51b0fbb489 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
>  	return xa_mk_internal(VM_FAULT_FALLBACK);
>  }
>  
> -/**
> - * dax_layout_busy_page - find first pinned page in @mapping
> - * @mapping: address space to scan for a page with ref count > 1
> - *
> - * DAX requires ZONE_DEVICE mapped pages. These pages are never
> - * 'onlined' to the page allocator so they are considered idle when
> - * page->count == 1. A filesystem uses this interface to determine if
> - * any page in the mapping is busy, i.e. for DMA, or other
> - * get_user_pages() usages.
> - *
> - * It is expected that the filesystem is holding locks to block the
> - * establishment of new mappings in this address_space. I.e. it expects
> - * to be able to run unmap_mapping_range() and subsequently not race
> - * mapping_mapped() becoming true.
> +/*
> + * Partial pages are included. If end is LLONG_MAX, pages in the range from
> + * start to end of the file are inluded.
>   */

I think the big kerneldoc comment should stay with
dax_layout_busy_page_range() since dax_layout_busy_page() will be just a
trivial wrapper around it..

> -struct page *dax_layout_busy_page(struct address_space *mapping)
> +struct page *dax_layout_busy_page_range(struct address_space *mapping,
> +					loff_t start, loff_t end)
>  {
> -	XA_STATE(xas, &mapping->i_pages, 0);
>  	void *entry;
>  	unsigned int scanned = 0;
>  	struct page *page = NULL;
> +	pgoff_t start_idx = start >> PAGE_SHIFT;
> +	pgoff_t end_idx = end >> PAGE_SHIFT;
> +	XA_STATE(xas, &mapping->i_pages, start_idx);
> +	loff_t len, lstart = round_down(start, PAGE_SIZE);
>  
>  	/*
>  	 * In the 'limited' case get_user_pages() for dax is disabled.
> @@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
>  	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
>  		return NULL;
>  
> +	/* If end == LLONG_MAX, all pages from start to till end of file */
> +	if (end == LLONG_MAX) {
> +		end_idx = ULONG_MAX;
> +		len = 0;
> +	} else {
> +		/* length is being calculated from lstart and not start.
> +		 * This is due to behavior of unmap_mapping_range(). If
> +		 * start is say 4094 and end is on 4096 then we want to
> +		 * unamp two pages, idx 0 and 1. But unmap_mapping_range()
> +		 * will unmap only page at idx 0. If we calculate len
> +		 * from the rounded down start, this problem should not
> +		 * happen.
> +		 */
> +		len = end - lstart + 1;
> +	}

Maybe it would be more understandable to use
	unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1);
below and avoid all this rounding and special-casing.

> +
>  	/*
>  	 * If we race get_user_pages_fast() here either we'll see the
>  	 * elevated page count in the iteration and wait, or
> @@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
>  	 * guaranteed to either see new references or prevent new
>  	 * references from being established.
>  	 */
> -	unmap_mapping_range(mapping, 0, 0, 0);
> +	unmap_mapping_range(mapping, start, len, 0);
>  
>  	xas_lock_irq(&xas);
> -	xas_for_each(&xas, entry, ULONG_MAX) {
> +	xas_for_each(&xas, entry, end_idx) {
>  		if (WARN_ON_ONCE(!xa_is_value(entry)))
>  			continue;
>  		if (unlikely(dax_is_locked(entry)))
> @@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
>  	xas_unlock_irq(&xas);
>  	return page;
>  }
> +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
> +
> +/**
> + * dax_layout_busy_page - find first pinned page in @mapping
> + * @mapping: address space to scan for a page with ref count > 1
> + *
> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> + * 'onlined' to the page allocator so they are considered idle when
> + * page->count == 1. A filesystem uses this interface to determine if
> + * any page in the mapping is busy, i.e. for DMA, or other
> + * get_user_pages() usages.
> + *
> + * It is expected that the filesystem is holding locks to block the
> + * establishment of new mappings in this address_space. I.e. it expects
> + * to be able to run unmap_mapping_range() and subsequently not race
> + * mapping_mapped() becoming true.
> + */
> +struct page *dax_layout_busy_page(struct address_space *mapping)
> +{
> +	return dax_layout_busy_page_range(mapping, 0, 0);

Should the 'end' rather be LLONG_MAX?

Otherwise the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev
  2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
@ 2020-08-17 16:57   ` Jan Kara
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2020-08-17 16:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Christoph Hellwig, linux-nvdimm

On Fri 07-08-20 15:55:07, Vivek Goyal wrote:
> virtiofs does not have a block device but it has dax device.
> Modify bdev_dax_pgoff() to be able to handle that.
> 
> If there is no bdev, that means dax offset is 0. (It can't be a partition
> block device starting at an offset in dax device).
> 
> This is little hackish. There have been discussions about getting rid
> of dax not supporting partitions.
> 
> https://lore.kernel.org/linux-fsdevel/20200107125159.GA15745@infradead.org/
> 
> IMHO, this path can easily break exisitng users. For example
> ioctl(BLKPG_ADD_PARTITION) will start breaking on block devices
> supporting DAX. Also, I personally find it very useful to be able to
> partition dax devices and still be able to use DAX.
> 
> Alternatively, I tried to store offset into dax device information in iomap
> interface, but that got NACKed.
> 
> https://lore.kernel.org/linux-fsdevel/20200217133117.GB20444@infradead.org/
> 
> I can't think of a good path to solve this issue properly. So to make
> progress, it seems this patch is least bad option for now and I hope
> we can take it.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-nvdimm@lists.01.org

This patch looks OK to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/dax/super.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 8e32345be0f7..c4bec437e88b 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
>  int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
>  		pgoff_t *pgoff)
>  {
> -	phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
> +	sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
> +	phys_addr_t phys_off = (start_sect + sector) * 512;
>  
>  	if (pgoff)
>  		*pgoff = PHYS_PFN(phys_off);
> -- 
> 2.25.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page()
  2020-08-17 16:53   ` Jan Kara
@ 2020-08-17 17:22     ` Vivek Goyal
  0 siblings, 0 replies; 5+ messages in thread
From: Vivek Goyal @ 2020-08-17 17:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, linux-nvdimm

On Mon, Aug 17, 2020 at 06:53:39PM +0200, Jan Kara wrote:
> On Fri 07-08-20 15:55:08, Vivek Goyal wrote:
> > virtiofs device has a range of memory which is mapped into file inodes
> > using dax. This memory is mapped in qemu on host and maps different
> > sections of real file on host. Size of this memory is limited
> > (determined by administrator) and depending on filesystem size, we will
> > soon reach a situation where all the memory is in use and we need to
> > reclaim some.
> > 
> > As part of reclaim process, we will need to make sure that there are
> > no active references to pages (taken by get_user_pages()) on the memory
> > range we are trying to reclaim. I am planning to use
> > dax_layout_busy_page() for this. But in current form this is per inode
> > and scans through all the pages of the inode.
> > 
> > We want to reclaim only a portion of memory (say 2MB page). So we want
> > to make sure that only that 2MB range of pages do not have any
> > references  (and don't want to unmap all the pages of inode).
> > 
> > Hence, create a range version of this function named
> > dax_layout_busy_page_range() which can be used to pass a range which
> > needs to be unmapped.
> > 
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-nvdimm@lists.01.org
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> 
> The API looks OK. Some comments WRT the implementation below.
> 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 11b16729b86f..0d51b0fbb489 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
> >  	return xa_mk_internal(VM_FAULT_FALLBACK);
> >  }
> >  
> > -/**
> > - * dax_layout_busy_page - find first pinned page in @mapping
> > - * @mapping: address space to scan for a page with ref count > 1
> > - *
> > - * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > - * 'onlined' to the page allocator so they are considered idle when
> > - * page->count == 1. A filesystem uses this interface to determine if
> > - * any page in the mapping is busy, i.e. for DMA, or other
> > - * get_user_pages() usages.
> > - *
> > - * It is expected that the filesystem is holding locks to block the
> > - * establishment of new mappings in this address_space. I.e. it expects
> > - * to be able to run unmap_mapping_range() and subsequently not race
> > - * mapping_mapped() becoming true.
> > +/*
> > + * Partial pages are included. If end is LLONG_MAX, pages in the range from
> > + * start to end of the file are inluded.
> >   */
> 
> I think the big kerneldoc comment should stay with
> dax_layout_busy_page_range() since dax_layout_busy_page() will be just a
> trivial wrapper around it..

Hi Jan,

Thanks for the review.

Will move kerneldoc comment.


> 
> > -struct page *dax_layout_busy_page(struct address_space *mapping)
> > +struct page *dax_layout_busy_page_range(struct address_space *mapping,
> > +					loff_t start, loff_t end)
> >  {
> > -	XA_STATE(xas, &mapping->i_pages, 0);
> >  	void *entry;
> >  	unsigned int scanned = 0;
> >  	struct page *page = NULL;
> > +	pgoff_t start_idx = start >> PAGE_SHIFT;
> > +	pgoff_t end_idx = end >> PAGE_SHIFT;
> > +	XA_STATE(xas, &mapping->i_pages, start_idx);
> > +	loff_t len, lstart = round_down(start, PAGE_SIZE);
> >  
> >  	/*
> >  	 * In the 'limited' case get_user_pages() for dax is disabled.
> > @@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> >  	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> >  		return NULL;
> >  
> > +	/* If end == LLONG_MAX, all pages from start to till end of file */
> > +	if (end == LLONG_MAX) {
> > +		end_idx = ULONG_MAX;
> > +		len = 0;
> > +	} else {
> > +		/* length is being calculated from lstart and not start.
> > +		 * This is due to behavior of unmap_mapping_range(). If
> > +		 * start is say 4094 and end is on 4096 then we want to
> > +		 * unamp two pages, idx 0 and 1. But unmap_mapping_range()
> > +		 * will unmap only page at idx 0. If we calculate len
> > +		 * from the rounded down start, this problem should not
> > +		 * happen.
> > +		 */
> > +		len = end - lstart + 1;
> > +	}
> 
> Maybe it would be more understandable to use
> 	unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1);
> below and avoid all this rounding and special-casing.

Will do.

> 
> > +
> >  	/*
> >  	 * If we race get_user_pages_fast() here either we'll see the
> >  	 * elevated page count in the iteration and wait, or
> > @@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> >  	 * guaranteed to either see new references or prevent new
> >  	 * references from being established.
> >  	 */
> > -	unmap_mapping_range(mapping, 0, 0, 0);
> > +	unmap_mapping_range(mapping, start, len, 0);
> >  
> >  	xas_lock_irq(&xas);
> > -	xas_for_each(&xas, entry, ULONG_MAX) {
> > +	xas_for_each(&xas, entry, end_idx) {
> >  		if (WARN_ON_ONCE(!xa_is_value(entry)))
> >  			continue;
> >  		if (unlikely(dax_is_locked(entry)))
> > @@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> >  	xas_unlock_irq(&xas);
> >  	return page;
> >  }
> > +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
> > +
> > +/**
> > + * dax_layout_busy_page - find first pinned page in @mapping
> > + * @mapping: address space to scan for a page with ref count > 1
> > + *
> > + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > + * 'onlined' to the page allocator so they are considered idle when
> > + * page->count == 1. A filesystem uses this interface to determine if
> > + * any page in the mapping is busy, i.e. for DMA, or other
> > + * get_user_pages() usages.
> > + *
> > + * It is expected that the filesystem is holding locks to block the
> > + * establishment of new mappings in this address_space. I.e. it expects
> > + * to be able to run unmap_mapping_range() and subsequently not race
> > + * mapping_mapped() becoming true.
> > + */
> > +struct page *dax_layout_busy_page(struct address_space *mapping)
> > +{
> > +	return dax_layout_busy_page_range(mapping, 0, 0);
> 
> Should the 'end' rather be LLONG_MAX?

My bad. I forgot to change this. Previous version of patches had the
semantic that 'end == 0' signifies till the end of file. Yes, 'end'
should be LLONG_MAX now. Will fix it.

Thanks
Vivek

> 
> Otherwise the patch looks good to me.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-17 17:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20200807195526.426056-1-vgoyal@redhat.com>
2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
2020-08-17 16:57   ` Jan Kara
2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
2020-08-17 16:53   ` Jan Kara
2020-08-17 17:22     ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox