- * [PATCH v2 1/9] mm: Fix the return type of __do_page_cache_readahead
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
@ 2020-01-15  2:38 ` Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 2/9] readahead: Ignore return value of ->readpages Matthew Wilcox
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  2:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), Jeff Layton, Christoph Hellwig,
	Chris Mason
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
ra_submit() which is a wrapper around __do_page_cache_readahead() already
returns an unsigned long, and the 'nr_to_read' parameter is an unsigned
long, so fix __do_page_cache_readahead() to return an unsigned long,
even though I'm pretty sure we're not going to readahead more than 2^32
pages ever.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h  | 2 +-
 mm/readahead.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 3cf20ab3ca01..41b93c4b3ab7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -49,7 +49,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
 
-extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
+extern unsigned long __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
 		unsigned long lookahead_size);
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..6bf73ef33b7e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -152,7 +152,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
  *
  * Returns the number of pages requested, or the maximum amount of I/O allowed.
  */
-unsigned int __do_page_cache_readahead(struct address_space *mapping,
+unsigned long __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
 		unsigned long lookahead_size)
 {
@@ -161,7 +161,7 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
-	unsigned int nr_pages = 0;
+	unsigned long nr_pages = 0;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 
-- 
2.24.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
- * [PATCH v2 2/9] readahead: Ignore return value of ->readpages
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 1/9] mm: Fix the return type of __do_page_cache_readahead Matthew Wilcox
@ 2020-01-15  2:38 ` Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 3/9] XArray: Add xarray_for_each_range Matthew Wilcox
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  2:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), Jeff Layton, Christoph Hellwig,
	Chris Mason
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
We used to assign the return value to a variable, which we then ignored.
Remove the pretence of caring.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 6bf73ef33b7e..fc77d13af556 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -113,17 +113,16 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 
 EXPORT_SYMBOL(read_cache_pages);
 
-static int read_pages(struct address_space *mapping, struct file *filp,
+static void read_pages(struct address_space *mapping, struct file *filp,
 		struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
 {
 	struct blk_plug plug;
 	unsigned page_idx;
-	int ret;
 
 	blk_start_plug(&plug);
 
 	if (mapping->a_ops->readpages) {
-		ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
+		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
 		/* Clean up the remaining pages */
 		put_pages_list(pages);
 		goto out;
@@ -136,12 +135,9 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 			mapping->a_ops->readpage(filp, page);
 		put_page(page);
 	}
-	ret = 0;
 
 out:
 	blk_finish_plug(&plug);
-
-	return ret;
 }
 
 /*
-- 
2.24.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
- * [PATCH v2 3/9] XArray: Add xarray_for_each_range
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 1/9] mm: Fix the return type of __do_page_cache_readahead Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 2/9] readahead: Ignore return value of ->readpages Matthew Wilcox
@ 2020-01-15  2:38 ` Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 4/9] readahead: Put pages in cache earlier Matthew Wilcox
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  2:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), Jeff Layton, Christoph Hellwig,
	Chris Mason
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
This function supports iterating over a range of an array.  Also add
documentation links for xarray_for_each_start().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/core-api/xarray.rst | 10 ++++++----
 include/linux/xarray.h            | 30 ++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 4 deletions(-)
diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index fcedc5349ace..95a732b5c848 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -94,10 +94,10 @@ calling xa_clear_mark().  You can ask whether any entry in the
 XArray has a particular mark set by calling xa_marked().
 
 You can copy entries out of the XArray into a plain array by calling
-xa_extract().  Or you can iterate over the present entries in
-the XArray by calling xa_for_each().  You may prefer to use
-xa_find() or xa_find_after() to move to the next present
-entry in the XArray.
+xa_extract().  Or you can iterate over the present entries in the XArray
+by calling xa_for_each(), xa_for_each_start() or xa_for_each_range().
+You may prefer to use xa_find() or xa_find_after() to move to the next
+present entry in the XArray.
 
 Calling xa_store_range() stores the same entry in a range
 of indices.  If you do this, some of the other operations will behave
@@ -180,6 +180,8 @@ No lock needed:
 Takes RCU read lock:
  * xa_load()
  * xa_for_each()
+ * xa_for_each_start()
+ * xa_for_each_range()
  * xa_find()
  * xa_find_after()
  * xa_extract()
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 86eecbd98e84..a06cb555fe23 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -445,6 +445,36 @@ static inline bool xa_marked(const struct xarray *xa, xa_mark_t mark)
 	     entry;							\
 	     entry = xa_find_after(xa, &index, ULONG_MAX, XA_PRESENT))
 
+/**
+ * xa_for_each_range() - Iterate over a portion of an XArray.
+ * @xa: XArray.
+ * @index: Index of @entry.
+ * @entry: Entry retrieved from array.
+ * @start: First index to retrieve from array.
+ * @last: Last index to retrieve from array.
+ *
+ * During the iteration, @entry will have the value of the entry stored
+ * in @xa at @index.  You may modify @index during the iteration if you
+ * want to skip or reprocess indices.  It is safe to modify the array
+ * during the iteration.  At the end of the iteration, @entry will be set
+ * to NULL and @index will have a value less than or equal to max.
+ *
+ * xa_for_each_range() is O(n.log(n)) while xas_for_each() is O(n).  You have
+ * to handle your own locking with xas_for_each(), and if you have to unlock
+ * after each iteration, it will also end up being O(n.log(n)).
+ * xa_for_each_range() will spin if it hits a retry entry; if you intend to
+ * see retry entries, you should use the xas_for_each() iterator instead.
+ * The xas_for_each() iterator will expand into more inline code than
+ * xa_for_each_range().
+ *
+ * Context: Any context.  Takes and releases the RCU lock.
+ */
+#define xa_for_each_range(xa, index, entry, start, last)		\
+	for (index = start,						\
+	     entry = xa_find(xa, &index, last, XA_PRESENT);		\
+	     entry;							\
+	     entry = xa_find_after(xa, &index, last, XA_PRESENT))
+
 /**
  * xa_for_each() - Iterate over present entries in an XArray.
  * @xa: XArray.
-- 
2.24.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
- * [PATCH v2 4/9] readahead: Put pages in cache earlier
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
                   ` (2 preceding siblings ...)
  2020-01-15  2:38 ` [PATCH v2 3/9] XArray: Add xarray_for_each_range Matthew Wilcox
@ 2020-01-15  2:38 ` Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 5/9] mm: Add readahead address space operation Matthew Wilcox
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  2:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), Jeff Layton, Christoph Hellwig,
	Chris Mason
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
At allocation time, put the pages in the cache unless we're using
->readpages.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 51 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 19 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index fc77d13af556..5a6676640f20 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -114,10 +114,10 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 EXPORT_SYMBOL(read_cache_pages);
 
 static void read_pages(struct address_space *mapping, struct file *filp,
-		struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
+		struct list_head *pages, pgoff_t start,
+		unsigned int nr_pages)
 {
 	struct blk_plug plug;
-	unsigned page_idx;
 
 	blk_start_plug(&plug);
 
@@ -125,18 +125,17 @@ static void read_pages(struct address_space *mapping, struct file *filp,
 		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
 		/* Clean up the remaining pages */
 		put_pages_list(pages);
-		goto out;
-	}
+	} else {
+		struct page *page;
+		unsigned long index;
 
-	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
-		struct page *page = lru_to_page(pages);
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
+		xa_for_each_range(&mapping->i_pages, index, page, start,
+				start + nr_pages - 1) {
 			mapping->a_ops->readpage(filp, page);
-		put_page(page);
+			put_page(page);
+		}
 	}
 
-out:
 	blk_finish_plug(&plug);
 }
 
@@ -157,9 +156,11 @@ unsigned long __do_page_cache_readahead(struct address_space *mapping,
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
+	pgoff_t page_offset;
 	unsigned long nr_pages = 0;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	bool use_list = mapping->a_ops->readpages;
 
 	if (isize == 0)
 		goto out;
@@ -170,7 +171,7 @@ unsigned long __do_page_cache_readahead(struct address_space *mapping,
 	 * Preallocate as many pages as we will need.
 	 */
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
-		pgoff_t page_offset = offset + page_idx;
+		page_offset = offset + page_idx;
 
 		if (page_offset > end_index)
 			break;
@@ -178,13 +179,14 @@ unsigned long __do_page_cache_readahead(struct address_space *mapping,
 		page = xa_load(&mapping->i_pages, page_offset);
 		if (page && !xa_is_value(page)) {
 			/*
-			 * Page already present?  Kick off the current batch of
-			 * contiguous pages before continuing with the next
-			 * batch.
+			 * Page already present?  Kick off the current batch
+			 * of contiguous pages before continuing with the
+			 * next batch.
 			 */
 			if (nr_pages)
-				read_pages(mapping, filp, &page_pool, nr_pages,
-						gfp_mask);
+				read_pages(mapping, filp, &page_pool,
+						page_offset - nr_pages,
+						nr_pages);
 			nr_pages = 0;
 			continue;
 		}
@@ -192,8 +194,18 @@ unsigned long __do_page_cache_readahead(struct address_space *mapping,
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			break;
-		page->index = page_offset;
-		list_add(&page->lru, &page_pool);
+		if (use_list) {
+			page->index = page_offset;
+			list_add(&page->lru, &page_pool);
+		} else if (!add_to_page_cache_lru(page, mapping, page_offset,
+					gfp_mask)) {
+			if (nr_pages)
+				read_pages(mapping, filp, &page_pool,
+						page_offset - nr_pages,
+						nr_pages);
+			nr_pages = 0;
+			continue;
+		}
 		if (page_idx == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
 		nr_pages++;
@@ -205,7 +217,8 @@ unsigned long __do_page_cache_readahead(struct address_space *mapping,
 	 * will then handle the error.
 	 */
 	if (nr_pages)
-		read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask);
+		read_pages(mapping, filp, &page_pool, page_offset - nr_pages,
+				nr_pages);
 	BUG_ON(!list_empty(&page_pool));
 out:
 	return nr_pages;
-- 
2.24.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
- * [PATCH v2 5/9] mm: Add readahead address space operation
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
                   ` (3 preceding siblings ...)
  2020-01-15  2:38 ` [PATCH v2 4/9] readahead: Put pages in cache earlier Matthew Wilcox
@ 2020-01-15  2:38 ` Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 6/9] iomap,xfs: Convert from readpages to readahead Matthew Wilcox
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  2:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), Jeff Layton, Christoph Hellwig,
	Chris Mason
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
This replaces ->readpages with a saner interface:
 - Return the number of pages not read instead of an ignored error code.
 - Pages are already in the page cache when ->readahead is called.
 - Implementation looks up the pages in the page cache instead of
   having them passed in a linked list.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/filesystems/locking.rst |  7 ++++++-
 Documentation/filesystems/vfs.rst     | 11 +++++++++++
 include/linux/fs.h                    |  2 ++
 include/linux/pagemap.h               | 12 ++++++++++++
 mm/readahead.c                        | 13 ++++++++++++-
 5 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 5057e4d9dcd1..d8a5dde914b5 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -239,6 +239,8 @@ prototypes::
 	int (*readpage)(struct file *, struct page *);
 	int (*writepages)(struct address_space *, struct writeback_control *);
 	int (*set_page_dirty)(struct page *page);
+	unsigned (*readahead)(struct file *, struct address_space *,
+				 pgoff_t start, unsigned nr_pages);
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	int (*write_begin)(struct file *, struct address_space *mapping,
@@ -271,7 +273,8 @@ writepage:		yes, unlocks (see below)
 readpage:		yes, unlocks
 writepages:
 set_page_dirty		no
-readpages:
+readahead:              yes, unlocks
+readpages:              no
 write_begin:		locks the page		 exclusive
 write_end:		yes, unlocks		 exclusive
 bmap:
@@ -295,6 +298,8 @@ the request handler (/dev/loop).
 ->readpage() unlocks the page, either synchronously or via I/O
 completion.
 
+->readahead() unlocks the page like ->readpage().
+
 ->readpages() populates the pagecache with the passed pages and starts
 I/O against them.  They come unlocked upon I/O completion.
 
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..bb06fb7b120b 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -706,6 +706,8 @@ cache in your filesystem.  The following members are defined:
 		int (*readpage)(struct file *, struct page *);
 		int (*writepages)(struct address_space *, struct writeback_control *);
 		int (*set_page_dirty)(struct page *page);
+		unsigned (*readahead)(struct file *filp, struct address_space *mapping,
+				 pgoff_t start, unsigned nr_pages);
 		int (*readpages)(struct file *filp, struct address_space *mapping,
 				 struct list_head *pages, unsigned nr_pages);
 		int (*write_begin)(struct file *, struct address_space *mapping,
@@ -781,6 +783,15 @@ cache in your filesystem.  The following members are defined:
 	If defined, it should set the PageDirty flag, and the
 	PAGECACHE_TAG_DIRTY tag in the radix tree.
 
+``readahead``
+	called by the VM to read pages associated with the address_space
+	object.  The pages are consecutive in the page cache and are
+        locked.  The implementation should decrement the page refcount after
+        attempting I/O on each page.  Usually the page will be unlocked by
+        the I/O completion handler.  If the function does not attempt I/O on
+        some pages, return the number of pages which were not read so the
+        common code can unlock the pages for you.
+
 ``readpages``
 	called by the VM to read pages associated with the address_space
 	object.  This is essentially just a vector version of readpage.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98e0349adb52..a10f3a72e5ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -375,6 +375,8 @@ struct address_space_operations {
 	 */
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
+	unsigned (*readahead)(struct file *, struct address_space *,
+			pgoff_t start, unsigned nr_pages);
 
 	int (*write_begin)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 37a4d9e32cd3..0821f584c43c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -630,6 +630,18 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+/*
+ * Only call this from a ->readahead implementation.
+ */
+static inline
+struct page *readahead_page(struct address_space *mapping, loff_t pos)
+{
+	struct page *page = xa_load(&mapping->i_pages, pos / PAGE_SIZE);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	return page;
+}
+
 static inline unsigned long dir_pages(struct inode *inode)
 {
 	return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
diff --git a/mm/readahead.c b/mm/readahead.c
index 5a6676640f20..6d65dae6dad0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -121,7 +121,18 @@ static void read_pages(struct address_space *mapping, struct file *filp,
 
 	blk_start_plug(&plug);
 
-	if (mapping->a_ops->readpages) {
+	if (mapping->a_ops->readahead) {
+		unsigned left = mapping->a_ops->readahead(filp, mapping,
+				start, nr_pages);
+
+		while (left) {
+			struct page *page = readahead_page(mapping,
+					start + nr_pages - left - 1);
+			unlock_page(page);
+			put_page(page);
+			left--;
+		}
+	} else if (mapping->a_ops->readpages) {
 		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
 		/* Clean up the remaining pages */
 		put_pages_list(pages);
-- 
2.24.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
- * [PATCH v2 6/9] iomap,xfs: Convert from readpages to readahead
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
                   ` (4 preceding siblings ...)
  2020-01-15  2:38 ` [PATCH v2 5/9] mm: Add readahead address space operation Matthew Wilcox
@ 2020-01-15  2:38 ` Matthew Wilcox
  2020-01-15  7:16   ` Christoph Hellwig
  2020-01-15  2:38 ` [PATCH v2 7/9] cifs: " Matthew Wilcox
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  2:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), Jeff Layton, Christoph Hellwig,
	Chris Mason
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Use the new readahead operation in XFS and iomap.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 72 +++++++++---------------------------------
 fs/iomap/trace.h       |  2 +-
 fs/xfs/xfs_aops.c      | 10 +++---
 include/linux/iomap.h  |  2 +-
 4 files changed, 22 insertions(+), 64 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 828444e14d09..a835b99f281f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -216,7 +216,6 @@ struct iomap_readpage_ctx {
 	bool			cur_page_in_bio;
 	bool			is_readahead;
 	struct bio		*bio;
-	struct list_head	*pages;
 };
 
 static void
@@ -367,36 +366,8 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 }
 EXPORT_SYMBOL_GPL(iomap_readpage);
 
-static struct page *
-iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
-		loff_t length, loff_t *done)
-{
-	while (!list_empty(pages)) {
-		struct page *page = lru_to_page(pages);
-
-		if (page_offset(page) >= (u64)pos + length)
-			break;
-
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
-				GFP_NOFS))
-			return page;
-
-		/*
-		 * If we already have a page in the page cache at index we are
-		 * done.  Upper layers don't care if it is uptodate after the
-		 * readpages call itself as every page gets checked again once
-		 * actually needed.
-		 */
-		*done += PAGE_SIZE;
-		put_page(page);
-	}
-
-	return NULL;
-}
-
 static loff_t
-iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
+iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
@@ -410,10 +381,8 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			ctx->cur_page = NULL;
 		}
 		if (!ctx->cur_page) {
-			ctx->cur_page = iomap_next_page(inode, ctx->pages,
-					pos, length, &done);
-			if (!ctx->cur_page)
-				break;
+			ctx->cur_page = readahead_page(inode->i_mapping,
+					pos / PAGE_SIZE);
 			ctx->cur_page_in_bio = false;
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
@@ -423,48 +392,37 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 	return done;
 }
 
-int
-iomap_readpages(struct address_space *mapping, struct list_head *pages,
+unsigned
+iomap_readahead(struct address_space *mapping, pgoff_t start,
 		unsigned nr_pages, const struct iomap_ops *ops)
 {
 	struct iomap_readpage_ctx ctx = {
-		.pages		= pages,
 		.is_readahead	= true,
 	};
-	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
-	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
-	loff_t length = last - pos + PAGE_SIZE, ret = 0;
+	loff_t pos = start * PAGE_SIZE;
+	loff_t length = nr_pages * PAGE_SIZE;
 
-	trace_iomap_readpages(mapping->host, nr_pages);
+	trace_iomap_readahead(mapping->host, nr_pages);
 
 	while (length > 0) {
-		ret = iomap_apply(mapping->host, pos, length, 0, ops,
-				&ctx, iomap_readpages_actor);
+		loff_t ret = iomap_apply(mapping->host, pos, length, 0, ops,
+				&ctx, iomap_readahead_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
-			goto done;
+			break;
 		}
 		pos += ret;
 		length -= ret;
 	}
-	ret = 0;
-done:
+
 	if (ctx.bio)
 		submit_bio(ctx.bio);
-	if (ctx.cur_page) {
-		if (!ctx.cur_page_in_bio)
-			unlock_page(ctx.cur_page);
+	if (ctx.cur_page && ctx.cur_page_in_bio)
 		put_page(ctx.cur_page);
-	}
 
-	/*
-	 * Check that we didn't lose a page due to the arcance calling
-	 * conventions..
-	 */
-	WARN_ON_ONCE(!ret && !list_empty(ctx.pages));
-	return ret;
+	return length / PAGE_SIZE;
 }
-EXPORT_SYMBOL_GPL(iomap_readpages);
+EXPORT_SYMBOL_GPL(iomap_readahead);
 
 /*
  * iomap_is_partially_uptodate checks whether blocks within a page are
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 6dc227b8c47e..d6ba705f938a 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -39,7 +39,7 @@ DEFINE_EVENT(iomap_readpage_class, name,	\
 	TP_PROTO(struct inode *inode, int nr_pages), \
 	TP_ARGS(inode, nr_pages))
 DEFINE_READPAGE_EVENT(iomap_readpage);
-DEFINE_READPAGE_EVENT(iomap_readpages);
+DEFINE_READPAGE_EVENT(iomap_readahead);
 
 DECLARE_EVENT_CLASS(iomap_page_class,
 	TP_PROTO(struct inode *inode, struct page *page, unsigned long off,
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a688eb5c5ae..4d9da34e759b 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -621,14 +621,14 @@ xfs_vm_readpage(
 	return iomap_readpage(page, &xfs_read_iomap_ops);
 }
 
-STATIC int
-xfs_vm_readpages(
+STATIC unsigned
+xfs_vm_readahead(
 	struct file		*unused,
 	struct address_space	*mapping,
-	struct list_head	*pages,
+	pgoff_t			start,
 	unsigned		nr_pages)
 {
-	return iomap_readpages(mapping, pages, nr_pages, &xfs_read_iomap_ops);
+	return iomap_readahead(mapping, start, nr_pages, &xfs_read_iomap_ops);
 }
 
 static int
@@ -644,7 +644,7 @@ xfs_iomap_swapfile_activate(
 
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
-	.readpages		= xfs_vm_readpages,
+	.readahead		= xfs_vm_readahead,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
 	.set_page_dirty		= iomap_set_page_dirty,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..81c6067e9b61 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -155,7 +155,7 @@ loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
-int iomap_readpages(struct address_space *mapping, struct list_head *pages,
+unsigned iomap_readahead(struct address_space *, pgoff_t start,
 		unsigned nr_pages, const struct iomap_ops *ops);
 int iomap_set_page_dirty(struct page *page);
 int iomap_is_partially_uptodate(struct page *page, unsigned long from,
-- 
2.24.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
- * Re: [PATCH v2 6/9] iomap,xfs: Convert from readpages to readahead
  2020-01-15  2:38 ` [PATCH v2 6/9] iomap,xfs: Convert from readpages to readahead Matthew Wilcox
@ 2020-01-15  7:16   ` Christoph Hellwig
  2020-01-15  7:42     ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2020-01-15  7:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, linux-mm, Jeff Layton,
	Christoph Hellwig, Chris Mason
On Tue, Jan 14, 2020 at 06:38:40PM -0800, Matthew Wilcox wrote:
>  static loff_t
> +iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
>  		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_readpage_ctx *ctx = data;
> @@ -410,10 +381,8 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  			ctx->cur_page = NULL;
>  		}
>  		if (!ctx->cur_page) {
> -			ctx->cur_page = iomap_next_page(inode, ctx->pages,
> -					pos, length, &done);
> -			if (!ctx->cur_page)
> -				break;
> +			ctx->cur_page = readahead_page(inode->i_mapping,
> +					pos / PAGE_SIZE);
Don't we at least need a sanity check for a NULL cur_page here?
Also the readahead_page version in your previous patch seems to expect
a byte offset, so the division above would not be required. (and should
probably be replaced with a right shift anyway no matter where it ends
up)
> +unsigned
> +iomap_readahead(struct address_space *mapping, pgoff_t start,
>  		unsigned nr_pages, const struct iomap_ops *ops)
>  {
>  	struct iomap_readpage_ctx ctx = {
> -		.pages		= pages,
>  		.is_readahead	= true,
>  	};
> -	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
> -	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
> -	loff_t length = last - pos + PAGE_SIZE, ret = 0;
> +	loff_t pos = start * PAGE_SIZE;
> +	loff_t length = nr_pages * PAGE_SIZE;
Any good reason not to pass byte offsets for start and length?
> +	return length / PAGE_SIZE;
Same for the return value?
For the file systems that would usually be a more natural interface than
a page index and number of pages.
^ permalink raw reply	[flat|nested] 23+ messages in thread
- * Re: [PATCH v2 6/9] iomap,xfs: Convert from readpages to readahead
  2020-01-15  7:16   ` Christoph Hellwig
@ 2020-01-15  7:42     ` Matthew Wilcox
  2020-01-24 22:53       ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  7:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-mm, Jeff Layton, Chris Mason
On Tue, Jan 14, 2020 at 11:16:28PM -0800, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 06:38:40PM -0800, Matthew Wilcox wrote:
> >  static loff_t
> > +iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
> >  		void *data, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct iomap_readpage_ctx *ctx = data;
> > @@ -410,10 +381,8 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> >  			ctx->cur_page = NULL;
> >  		}
> >  		if (!ctx->cur_page) {
> > -			ctx->cur_page = iomap_next_page(inode, ctx->pages,
> > -					pos, length, &done);
> > -			if (!ctx->cur_page)
> > -				break;
> > +			ctx->cur_page = readahead_page(inode->i_mapping,
> > +					pos / PAGE_SIZE);
> 
> Don't we at least need a sanity check for a NULL cur_page here?
I don't think so.  The caller has already put the locked page into the
page cache at that index.  If the page has gone away, that's a bug, and
I don't think BUG_ON is all that much better than a NULL pointer derefence.
Indeed, readahead_page() checks PageLocked, so it can't return NULL.
> Also the readahead_page version in your previous patch seems to expect
> a byte offset, so the division above would not be required.
Oops.  I had intended to make readahead_pages() look like this:
struct page *readahead_page(struct address_space *mapping, pgoff_t index)
{
        struct page *page = xa_load(&mapping->i_pages, index);
        VM_BUG_ON_PAGE(!PageLocked(page), page);
        return page;
}
If only our tools could warn about these kinds of mistakes.
> (and should
> probably be replaced with a right shift anyway no matter where it ends
> up)
If the compiler can't tell that x / 4096 and x >> 12 are precisely the same
and choose the more efficient of the two, we have big problems.
> > +unsigned
> > +iomap_readahead(struct address_space *mapping, pgoff_t start,
> >  		unsigned nr_pages, const struct iomap_ops *ops)
> >  {
> >  	struct iomap_readpage_ctx ctx = {
> > -		.pages		= pages,
> >  		.is_readahead	= true,
> >  	};
> > -	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
> > -	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
> > -	loff_t length = last - pos + PAGE_SIZE, ret = 0;
> > +	loff_t pos = start * PAGE_SIZE;
> > +	loff_t length = nr_pages * PAGE_SIZE;
> 
> Any good reason not to pass byte offsets for start and length?
> 
> > +	return length / PAGE_SIZE;
> 
> Same for the return value?
> 
> For the file systems that would usually be a more natural interface than
> a page index and number of pages.
That seems to depend on the filesystem.  iomap definitely would be happier
with loff_t, but cifs prefers pgoff_t.  I should probably survey a few
more filesystems and see if there's a strong lean in one direction or
the other.
^ permalink raw reply	[flat|nested] 23+ messages in thread
- * Re: [PATCH v2 6/9] iomap,xfs: Convert from readpages to readahead
  2020-01-15  7:42     ` Matthew Wilcox
@ 2020-01-24 22:53       ` Matthew Wilcox
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-24 22:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-mm, Jeff Layton, Chris Mason
On Tue, Jan 14, 2020 at 11:42:43PM -0800, Matthew Wilcox wrote:
> On Tue, Jan 14, 2020 at 11:16:28PM -0800, Christoph Hellwig wrote:
> > On Tue, Jan 14, 2020 at 06:38:40PM -0800, Matthew Wilcox wrote:
> > > +iomap_readahead(struct address_space *mapping, pgoff_t start,
> > >  		unsigned nr_pages, const struct iomap_ops *ops)
> > >  {
> > >  	struct iomap_readpage_ctx ctx = {
> > > -		.pages		= pages,
> > >  		.is_readahead	= true,
> > >  	};
> > > -	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
> > > -	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
> > > -	loff_t length = last - pos + PAGE_SIZE, ret = 0;
> > > +	loff_t pos = start * PAGE_SIZE;
> > > +	loff_t length = nr_pages * PAGE_SIZE;
> > 
> > Any good reason not to pass byte offsets for start and length?
> > 
> > > +	return length / PAGE_SIZE;
> > 
> > Same for the return value?
> > 
> > For the file systems that would usually be a more natural interface than
> > a page index and number of pages.
> 
> That seems to depend on the filesystem.  iomap definitely would be happier
> with loff_t, but cifs prefers pgoff_t.  I should probably survey a few
> more filesystems and see if there's a strong lean in one direction or
> the other.
I've converted all the filesystems now except for those that use fscache.
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/readahead
iomap is the only one for which an loff_t makes sense as an argument.
fscache will also prefer page index & count once Dave's conversion series
lands:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=ae317744dfb9732123e554467a9f6d93733e8a5b
I'll prep a serious conversion series for 5.6 soon (skipping cifs, but
converting all the non-fscache filesystems).
^ permalink raw reply	[flat|nested] 23+ messages in thread
 
 
 
- * [PATCH v2 7/9] cifs: Convert from readpages to readahead
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
                   ` (5 preceding siblings ...)
  2020-01-15  2:38 ` [PATCH v2 6/9] iomap,xfs: Convert from readpages to readahead Matthew Wilcox
@ 2020-01-15  2:38 ` Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 8/9] mm: Remove add_to_page_cache_locked Matthew Wilcox
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  2:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), Jeff Layton, Christoph Hellwig,
	Chris Mason
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Use the new readahead operation in CIFS
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/cifs/file.c | 143 +++++++++----------------------------------------
 1 file changed, 25 insertions(+), 118 deletions(-)
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 043288b5c728..c9162380ad22 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4280,70 +4280,11 @@ cifs_readpages_copy_into_pages(struct TCP_Server_Info *server,
 	return readpages_fill_pages(server, rdata, iter, iter->count);
 }
 
-static int
-readpages_get_pages(struct address_space *mapping, struct list_head *page_list,
-		    unsigned int rsize, struct list_head *tmplist,
-		    unsigned int *nr_pages, loff_t *offset, unsigned int *bytes)
-{
-	struct page *page, *tpage;
-	unsigned int expected_index;
-	int rc;
-	gfp_t gfp = readahead_gfp_mask(mapping);
-
-	INIT_LIST_HEAD(tmplist);
-
-	page = lru_to_page(page_list);
-
-	/*
-	 * Lock the page and put it in the cache. Since no one else
-	 * should have access to this page, we're safe to simply set
-	 * PG_locked without checking it first.
-	 */
-	__SetPageLocked(page);
-	rc = add_to_page_cache_locked(page, mapping,
-				      page->index, gfp);
-
-	/* give up if we can't stick it in the cache */
-	if (rc) {
-		__ClearPageLocked(page);
-		return rc;
-	}
-
-	/* move first page to the tmplist */
-	*offset = (loff_t)page->index << PAGE_SHIFT;
-	*bytes = PAGE_SIZE;
-	*nr_pages = 1;
-	list_move_tail(&page->lru, tmplist);
-
-	/* now try and add more pages onto the request */
-	expected_index = page->index + 1;
-	list_for_each_entry_safe_reverse(page, tpage, page_list, lru) {
-		/* discontinuity ? */
-		if (page->index != expected_index)
-			break;
-
-		/* would this page push the read over the rsize? */
-		if (*bytes + PAGE_SIZE > rsize)
-			break;
-
-		__SetPageLocked(page);
-		if (add_to_page_cache_locked(page, mapping, page->index, gfp)) {
-			__ClearPageLocked(page);
-			break;
-		}
-		list_move_tail(&page->lru, tmplist);
-		(*bytes) += PAGE_SIZE;
-		expected_index++;
-		(*nr_pages)++;
-	}
-	return rc;
-}
-
-static int cifs_readpages(struct file *file, struct address_space *mapping,
-	struct list_head *page_list, unsigned num_pages)
+static unsigned cifs_readahead(struct file *file,
+		struct address_space *mapping, pgoff_t start,
+		unsigned num_pages)
 {
 	int rc;
-	struct list_head tmplist;
 	struct cifsFileInfo *open_file = file->private_data;
 	struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(file);
 	struct TCP_Server_Info *server;
@@ -4358,11 +4299,10 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 	 * After this point, every page in the list might have PG_fscache set,
 	 * so we will need to clean that up off of every page we don't use.
 	 */
-	rc = cifs_readpages_from_fscache(mapping->host, mapping, page_list,
-					 &num_pages);
+	rc = -ENOBUFS;
 	if (rc == 0) {
 		free_xid(xid);
-		return rc;
+		return num_pages;
 	}
 
 	if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
@@ -4373,24 +4313,11 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 	rc = 0;
 	server = tlink_tcon(open_file->tlink)->ses->server;
 
-	cifs_dbg(FYI, "%s: file=%p mapping=%p num_pages=%u\n",
-		 __func__, file, mapping, num_pages);
+	cifs_dbg(FYI, "%s: file=%p mapping=%p start=%lu num_pages=%u\n",
+		 __func__, file, mapping, start, num_pages);
 
-	/*
-	 * Start with the page at end of list and move it to private
-	 * list. Do the same with any following pages until we hit
-	 * the rsize limit, hit an index discontinuity, or run out of
-	 * pages. Issue the async read and then start the loop again
-	 * until the list is empty.
-	 *
-	 * Note that list order is important. The page_list is in
-	 * the order of declining indexes. When we put the pages in
-	 * the rdata->pages, then we want them in increasing order.
-	 */
-	while (!list_empty(page_list)) {
-		unsigned int i, nr_pages, bytes, rsize;
-		loff_t offset;
-		struct page *page, *tpage;
+	while (num_pages) {
+		unsigned int i, nr_pages, rsize;
 		struct cifs_readdata *rdata;
 		struct cifs_credits credits_on_stack;
 		struct cifs_credits *credits = &credits_on_stack;
@@ -4408,21 +4335,14 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 		if (rc)
 			break;
 
+		nr_pages = min_t(unsigned, rsize / PAGE_SIZE, num_pages);
 		/*
 		 * Give up immediately if rsize is too small to read an entire
 		 * page. The VFS will fall back to readpage. We should never
-		 * reach this point however since we set ra_pages to 0 when the
-		 * rsize is smaller than a cache page.
+		 * reach this point however since we set ra_pages to 0 when
+		 * the rsize is smaller than a cache page.
 		 */
-		if (unlikely(rsize < PAGE_SIZE)) {
-			add_credits_and_wake_if(server, credits, 0);
-			free_xid(xid);
-			return 0;
-		}
-
-		rc = readpages_get_pages(mapping, page_list, rsize, &tmplist,
-					 &nr_pages, &offset, &bytes);
-		if (rc) {
+		if (unlikely(nr_pages < 1)) {
 			add_credits_and_wake_if(server, credits, 0);
 			break;
 		}
@@ -4430,21 +4350,15 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 		rdata = cifs_readdata_alloc(nr_pages, cifs_readv_complete);
 		if (!rdata) {
 			/* best to give up if we're out of mem */
-			list_for_each_entry_safe(page, tpage, &tmplist, lru) {
-				list_del(&page->lru);
-				lru_cache_add_file(page);
-				unlock_page(page);
-				put_page(page);
-			}
-			rc = -ENOMEM;
 			add_credits_and_wake_if(server, credits, 0);
 			break;
 		}
 
 		rdata->cfile = cifsFileInfo_get(open_file);
 		rdata->mapping = mapping;
-		rdata->offset = offset;
-		rdata->bytes = bytes;
+		rdata->offset = start;
+		rdata->nr_pages = nr_pages;
+		rdata->bytes = nr_pages * PAGE_SIZE;
 		rdata->pid = pid;
 		rdata->pagesz = PAGE_SIZE;
 		rdata->tailsz = PAGE_SIZE;
@@ -4452,10 +4366,8 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 		rdata->copy_into_pages = cifs_readpages_copy_into_pages;
 		rdata->credits = credits_on_stack;
 
-		list_for_each_entry_safe(page, tpage, &tmplist, lru) {
-			list_del(&page->lru);
-			rdata->pages[rdata->nr_pages++] = page;
-		}
+		for (i = 0; i < nr_pages; i++)
+			rdata->pages[i] = readahead_page(mapping, start++);
 
 		rc = adjust_credits(server, &rdata->credits, rdata->bytes);
 
@@ -4468,27 +4380,22 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 
 		if (rc) {
 			add_credits_and_wake_if(server, &rdata->credits, 0);
-			for (i = 0; i < rdata->nr_pages; i++) {
-				page = rdata->pages[i];
-				lru_cache_add_file(page);
-				unlock_page(page);
-				put_page(page);
-			}
-			/* Fallback to the readpage in error/reconnect cases */
 			kref_put(&rdata->refcount, cifs_readdata_release);
 			break;
 		}
 
 		kref_put(&rdata->refcount, cifs_readdata_release);
+		num_pages -= nr_pages;
 	}
 
 	/* Any pages that have been shown to fscache but didn't get added to
 	 * the pagecache must be uncached before they get returned to the
 	 * allocator.
 	 */
-	cifs_fscache_readpages_cancel(mapping->host, page_list);
+//	cifs_fscache_readpages_cancel(mapping->host, page_list);
 	free_xid(xid);
-	return rc;
+
+	return num_pages;
 }
 
 /*
@@ -4806,7 +4713,7 @@ cifs_direct_io(struct kiocb *iocb, struct iov_iter *iter)
 
 const struct address_space_operations cifs_addr_ops = {
 	.readpage = cifs_readpage,
-	.readpages = cifs_readpages,
+	.readahead = cifs_readahead,
 	.writepage = cifs_writepage,
 	.writepages = cifs_writepages,
 	.write_begin = cifs_write_begin,
@@ -4819,9 +4726,9 @@ const struct address_space_operations cifs_addr_ops = {
 };
 
 /*
- * cifs_readpages requires the server to support a buffer large enough to
+ * cifs_readahead requires the server to support a buffer large enough to
  * contain the header plus one complete page of data.  Otherwise, we need
- * to leave cifs_readpages out of the address space operations.
+ * to leave cifs_readahead out of the address space operations.
  */
 const struct address_space_operations cifs_addr_ops_smallbuf = {
 	.readpage = cifs_readpage,
-- 
2.24.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
- * [PATCH v2 8/9] mm: Remove add_to_page_cache_locked
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
                   ` (6 preceding siblings ...)
  2020-01-15  2:38 ` [PATCH v2 7/9] cifs: " Matthew Wilcox
@ 2020-01-15  2:38 ` Matthew Wilcox
  2020-01-15  2:38 ` [PATCH v2 9/9] mm: Unify all add_to_page_cache variants Matthew Wilcox
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  2:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), Jeff Layton, Christoph Hellwig,
	Chris Mason
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
The only remaining caller is add_to_page_cache(), and the only
caller of that is hugetlbfs, so move add_to_page_cache() into
filemap.c and call __add_to_page_cache_locked() directly.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 20 ++------------------
 mm/filemap.c            | 23 ++++++++---------------
 2 files changed, 10 insertions(+), 33 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0821f584c43c..75075065dd0b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -604,8 +604,8 @@ static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 	return 0;
 }
 
-int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+int add_to_page_cache(struct page *page, struct address_space *mapping,
+				pgoff_t index, gfp_t gfp);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
@@ -614,22 +614,6 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
 
-/*
- * Like add_to_page_cache_locked, but used to add newly allocated pages:
- * the page is new, so we can just run __SetPageLocked() against it.
- */
-static inline int add_to_page_cache(struct page *page,
-		struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
-{
-	int error;
-
-	__SetPageLocked(page);
-	error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
-	if (unlikely(error))
-		__ClearPageLocked(page);
-	return error;
-}
-
 /*
  * Only call this from a ->readahead implementation.
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index bf6aa30be58d..fb87f5fa75e6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -913,25 +913,18 @@ static int __add_to_page_cache_locked(struct page *page,
 }
 ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
 
-/**
- * add_to_page_cache_locked - add a locked page to the pagecache
- * @page:	page to add
- * @mapping:	the page's address_space
- * @offset:	page index
- * @gfp_mask:	page allocation mode
- *
- * This function is used to add a page to the pagecache. It must be locked.
- * This function does not add the page to the LRU.  The caller must do that.
- *
- * Return: %0 on success, negative error code otherwise.
- */
-int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
+int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
-	return __add_to_page_cache_locked(page, mapping, offset,
+	int err;
+
+	__SetPageLocked(page);
+	err = __add_to_page_cache_locked(page, mapping, offset,
 					  gfp_mask, NULL);
+	if (unlikely(err))
+		__ClearPageLocked(page);
+	return err;
 }
-EXPORT_SYMBOL(add_to_page_cache_locked);
 
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t offset, gfp_t gfp_mask)
-- 
2.24.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
- * [PATCH v2 9/9] mm: Unify all add_to_page_cache variants
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
                   ` (7 preceding siblings ...)
  2020-01-15  2:38 ` [PATCH v2 8/9] mm: Remove add_to_page_cache_locked Matthew Wilcox
@ 2020-01-15  2:38 ` Matthew Wilcox
  2020-01-15  7:20   ` Christoph Hellwig
  2020-01-18 23:13 ` [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
  2020-01-21 11:36 ` Jan Kara
  10 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  2:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Matthew Wilcox (Oracle), Jeff Layton, Christoph Hellwig,
	Chris Mason
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
We already have various bits of add_to_page_cache() executed conditionally
on !PageHuge(page); add the add_to_page_cache_lru() pieces as some
more code which isn't executed for huge pages.  This lets us remove
the old add_to_page_cache() and rename __add_to_page_cache_locked() to
add_to_page_cache().  Include a compatibility define so we don't have
to change all 20+ callers of add_to_page_cache_lru().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h |  5 ++--
 mm/filemap.c            | 65 ++++++++++++-----------------------------
 2 files changed, 21 insertions(+), 49 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 75075065dd0b..637770fa283f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -606,14 +606,15 @@ static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp);
-int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
 extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
 
+#define add_to_page_cache_lru(page, mapping, index, gfp) \
+	add_to_page_cache(page, mapping, index, gfp)
+
 /*
  * Only call this from a ->readahead implementation.
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index fb87f5fa75e6..83f45f31a00a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -847,19 +847,18 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
-static int __add_to_page_cache_locked(struct page *page,
-				      struct address_space *mapping,
-				      pgoff_t offset, gfp_t gfp_mask,
-				      void **shadowp)
+int add_to_page_cache(struct page *page, struct address_space *mapping,
+		pgoff_t offset, gfp_t gfp_mask)
 {
 	XA_STATE(xas, &mapping->i_pages, offset);
 	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
 	int error;
-	void *old;
+	void *old, *shadow = NULL;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
+	__SetPageLocked(page);
 	mapping_set_update(&xas, mapping);
 
 	if (!huge) {
@@ -884,8 +883,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
 		if (xa_is_value(old)) {
 			mapping->nrexceptional--;
-			if (shadowp)
-				*shadowp = old;
+			shadow = old;
 		}
 		mapping->nrpages++;
 
@@ -899,45 +897,8 @@ static int __add_to_page_cache_locked(struct page *page,
 	if (xas_error(&xas))
 		goto error;
 
-	if (!huge)
+	if (!huge) {
 		mem_cgroup_commit_charge(page, memcg, false, false);
-	trace_mm_filemap_add_to_page_cache(page);
-	return 0;
-error:
-	page->mapping = NULL;
-	/* Leave page->index set: truncation relies upon it */
-	if (!huge)
-		mem_cgroup_cancel_charge(page, memcg, false);
-	put_page(page);
-	return xas_error(&xas);
-}
-ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
-
-int add_to_page_cache(struct page *page, struct address_space *mapping,
-		pgoff_t offset, gfp_t gfp_mask)
-{
-	int err;
-
-	__SetPageLocked(page);
-	err = __add_to_page_cache_locked(page, mapping, offset,
-					  gfp_mask, NULL);
-	if (unlikely(err))
-		__ClearPageLocked(page);
-	return err;
-}
-
-int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-				pgoff_t offset, gfp_t gfp_mask)
-{
-	void *shadow = NULL;
-	int ret;
-
-	__SetPageLocked(page);
-	ret = __add_to_page_cache_locked(page, mapping, offset,
-					 gfp_mask, &shadow);
-	if (unlikely(ret))
-		__ClearPageLocked(page);
-	else {
 		/*
 		 * The page might have been evicted from cache only
 		 * recently, in which case it should be activated like
@@ -951,9 +912,19 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 			workingset_refault(page, shadow);
 		lru_cache_add(page);
 	}
-	return ret;
+	trace_mm_filemap_add_to_page_cache(page);
+	return 0;
+error:
+	page->mapping = NULL;
+	/* Leave page->index set: truncation relies upon it */
+	if (!huge)
+		mem_cgroup_cancel_charge(page, memcg, false);
+	put_page(page);
+	__ClearPageLocked(page);
+	return xas_error(&xas);
 }
-EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
+ALLOW_ERROR_INJECTION(add_to_page_cache, ERRNO);
+EXPORT_SYMBOL_GPL(add_to_page_cache);
 
 #ifdef CONFIG_NUMA
 struct page *__page_cache_alloc(gfp_t gfp)
-- 
2.24.1
^ permalink raw reply related	[flat|nested] 23+ messages in thread
- * Re: [PATCH v2 9/9] mm: Unify all add_to_page_cache variants
  2020-01-15  2:38 ` [PATCH v2 9/9] mm: Unify all add_to_page_cache variants Matthew Wilcox
@ 2020-01-15  7:20   ` Christoph Hellwig
  2020-01-15  7:44     ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2020-01-15  7:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, linux-mm, Jeff Layton,
	Christoph Hellwig, Chris Mason
On Tue, Jan 14, 2020 at 06:38:43PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> We already have various bits of add_to_page_cache() executed conditionally
> on !PageHuge(page); add the add_to_page_cache_lru() pieces as some
> more code which isn't executed for huge pages.  This lets us remove
> the old add_to_page_cache() and rename __add_to_page_cache_locked() to
> add_to_page_cache().  Include a compatibility define so we don't have
> to change all 20+ callers of add_to_page_cache_lru().
I'd rather change them.  20ish isn't that much after all, and not
keeping pointless aliases around keeps the code easier to read.
^ permalink raw reply	[flat|nested] 23+ messages in thread 
- * Re: [PATCH v2 9/9] mm: Unify all add_to_page_cache variants
  2020-01-15  7:20   ` Christoph Hellwig
@ 2020-01-15  7:44     ` Matthew Wilcox
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-15  7:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-mm, Jeff Layton, Chris Mason
On Tue, Jan 14, 2020 at 11:20:04PM -0800, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 06:38:43PM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > We already have various bits of add_to_page_cache() executed conditionally
> > on !PageHuge(page); add the add_to_page_cache_lru() pieces as some
> > more code which isn't executed for huge pages.  This lets us remove
> > the old add_to_page_cache() and rename __add_to_page_cache_locked() to
> > add_to_page_cache().  Include a compatibility define so we don't have
> > to change all 20+ callers of add_to_page_cache_lru().
> 
> I'd rather change them.  20ish isn't that much after all, and not
> keeping pointless aliases around keeps the code easier to read.
Almost all of them are called in the ->readpages() function, so they'll
go away as filesystems are converted to ->readahead().  I'd rather not
introduce something that makes patches harder to reorder.
^ permalink raw reply	[flat|nested] 23+ messages in thread 
 
 
- * Re: [RFC v2 0/9] Replacing the readpages a_op
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
                   ` (8 preceding siblings ...)
  2020-01-15  2:38 ` [PATCH v2 9/9] mm: Unify all add_to_page_cache variants Matthew Wilcox
@ 2020-01-18 23:13 ` Matthew Wilcox
  2020-01-21 11:36 ` Jan Kara
  10 siblings, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-18 23:13 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-mm
  Cc: Jeff Layton, Christoph Hellwig, Chris Mason
On Tue, Jan 14, 2020 at 06:38:34PM -0800, Matthew Wilcox wrote:
> This is an attempt to add a ->readahead op to replace ->readpages.  I've
> converted two users, iomap/xfs and cifs.  The cifs conversion is lacking
> fscache support, and that's just because I didn't want to do that work;
> I don't believe there's anything fundamental to it.  But I wanted to do
> iomap because it is The Infrastructure Of The Future and cifs because it
> is the sole remaining user of add_to_page_cache_locked(), which enables
> the last two patches in the series.  By the way, that gives CIFS access
> to the workingset shadow infrastructure, which it had to ignore before
> because it couldn't put pages onto the lru list at the right time.
> 
> v2: Chris asked me to show what this would look like if we just have
> the implementation look up the pages in the page cache, and I managed
> to figure out some things I'd done wrong last time.  It's even simpler
> than v1 (net 104 lines deleted).
I want to discuss whether to change the page refcount guarantees while we're
changing the API.  Currently,
page is allocated with a refcount of 1
page is locked, and inserted into page cache and refcount is bumped to 2
->readahead is called
callee is supposed to call put_page() after submitting I/O
I/O completion will unlock the page after I/O completes, leaving the refcount at 1.
So, what if we leave the refcount at 1 throughout the submission process,
saving ourselves two atomic ops per page?  We have to ensure that after
the page is submitted for I/O, the submission path no longer touches
the page.  So the process of converting a filesystem to ->readahead
becomes slightly more complex, but there's a bugger win as a result.
Opinions?
^ permalink raw reply	[flat|nested] 23+ messages in thread
- * Re: [RFC v2 0/9] Replacing the readpages a_op
  2020-01-15  2:38 [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
                   ` (9 preceding siblings ...)
  2020-01-18 23:13 ` [RFC v2 0/9] Replacing the readpages a_op Matthew Wilcox
@ 2020-01-21 11:36 ` Jan Kara
  2020-01-21 21:48   ` Matthew Wilcox
  10 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2020-01-21 11:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, linux-mm, Jeff Layton,
	Christoph Hellwig, Chris Mason
Hello Matthew!
On Tue 14-01-20 18:38:34, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This is an attempt to add a ->readahead op to replace ->readpages.  I've
> converted two users, iomap/xfs and cifs.  The cifs conversion is lacking
> fscache support, and that's just because I didn't want to do that work;
> I don't believe there's anything fundamental to it.  But I wanted to do
> iomap because it is The Infrastructure Of The Future and cifs because it
> is the sole remaining user of add_to_page_cache_locked(), which enables
> the last two patches in the series.  By the way, that gives CIFS access
> to the workingset shadow infrastructure, which it had to ignore before
> because it couldn't put pages onto the lru list at the right time.
> 
> v2: Chris asked me to show what this would look like if we just have
> the implementation look up the pages in the page cache, and I managed
> to figure out some things I'd done wrong last time.  It's even simpler
> than v1 (net 104 lines deleted).
I have an unfinished patch series laying around that pulls the ->readpage
/ ->readpages API in somewhat different direction so I'd like to discuss
whether it's possible to solve my problem using your API. The problem I
have is that currently some operations such as hole punching can race with
->readpage / ->readpages like:
CPU0						CPU1
fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
  filemap_write_and_wait_range()
  down_write(inode->i_rwsem);
  truncate_pagecache_range();
						readahead(fd, off, len)
						  creates pages in page cache
						  looks up block mapping
  removes blocks from inode and frees them
						  issues bio
						    - reads stale data -
						      potential security
						      issue
Now how I wanted to address this is that I'd change the API convention for
->readpage() so that we call it with the page unlocked and the function
would lock the page, check it's still OK, and do what it needs. And this
will allow ->readpage() and also ->readpages() to grab lock
(EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
while we are adding pages to page cache and mapping underlying blocks.
Now your API makes even ->readpages() (actually ->readahead) called with
pages locked so that makes this approach problematic because of lock
inversions. So I'd prefer if we could keep the situation that ->readpages /
->readahead gets called without any pages in page cache locked...
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 23+ messages in thread
- * Re: [RFC v2 0/9] Replacing the readpages a_op
  2020-01-21 11:36 ` Jan Kara
@ 2020-01-21 21:48   ` Matthew Wilcox
  2020-01-22  9:44     ` Jan Kara
  2020-01-22 23:47     ` Dave Chinner
  0 siblings, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-01-21 21:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-xfs, linux-fsdevel, linux-mm, Jeff Layton,
	Christoph Hellwig, Chris Mason
On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > v2: Chris asked me to show what this would look like if we just have
> > the implementation look up the pages in the page cache, and I managed
> > to figure out some things I'd done wrong last time.  It's even simpler
> > than v1 (net 104 lines deleted).
> 
> I have an unfinished patch series laying around that pulls the ->readpage
> / ->readpages API in somewhat different direction so I'd like to discuss
> whether it's possible to solve my problem using your API. The problem I
> have is that currently some operations such as hole punching can race with
> ->readpage / ->readpages like:
> 
> CPU0						CPU1
> fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
>   filemap_write_and_wait_range()
>   down_write(inode->i_rwsem);
>   truncate_pagecache_range();
> 						readahead(fd, off, len)
> 						  creates pages in page cache
> 						  looks up block mapping
>   removes blocks from inode and frees them
> 						  issues bio
> 						    - reads stale data -
> 						      potential security
> 						      issue
> 
> Now how I wanted to address this is that I'd change the API convention for
> ->readpage() so that we call it with the page unlocked and the function
> would lock the page, check it's still OK, and do what it needs. And this
> will allow ->readpage() and also ->readpages() to grab lock
> (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> while we are adding pages to page cache and mapping underlying blocks.
> 
> Now your API makes even ->readpages() (actually ->readahead) called with
> pages locked so that makes this approach problematic because of lock
> inversions. So I'd prefer if we could keep the situation that ->readpages /
> ->readahead gets called without any pages in page cache locked...
I'm not a huge fan of that approach because it increases the number of
atomic ops (right now, we __SetPageLocked on the page before adding it
to i_pages).  Holepunch is a rather rare operation while readpage and
readpages/readahead are extremely common, so can we make holepunch take
a lock that will prevent new readpage(s) succeeding?
I have an idea to move the lock entries from DAX to being a generic page
cache concept.  That way, holepunch could insert lock entries into the
pagecache to cover the range being punched, and readpage(s) would either
skip lock entries or block on them.
Maybe there's a better approach though.
^ permalink raw reply	[flat|nested] 23+ messages in thread 
- * Re: [RFC v2 0/9] Replacing the readpages a_op
  2020-01-21 21:48   ` Matthew Wilcox
@ 2020-01-22  9:44     ` Jan Kara
  2020-01-23 10:31       ` Jan Kara
  2020-01-22 23:47     ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Jan Kara @ 2020-01-22  9:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-xfs, linux-fsdevel, linux-mm, Jeff Layton,
	Christoph Hellwig, Chris Mason
On Tue 21-01-20 13:48:45, Matthew Wilcox wrote:
> On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > v2: Chris asked me to show what this would look like if we just have
> > > the implementation look up the pages in the page cache, and I managed
> > > to figure out some things I'd done wrong last time.  It's even simpler
> > > than v1 (net 104 lines deleted).
> > 
> > I have an unfinished patch series laying around that pulls the ->readpage
> > / ->readpages API in somewhat different direction so I'd like to discuss
> > whether it's possible to solve my problem using your API. The problem I
> > have is that currently some operations such as hole punching can race with
> > ->readpage / ->readpages like:
> > 
> > CPU0						CPU1
> > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> >   filemap_write_and_wait_range()
> >   down_write(inode->i_rwsem);
> >   truncate_pagecache_range();
> > 						readahead(fd, off, len)
> > 						  creates pages in page cache
> > 						  looks up block mapping
> >   removes blocks from inode and frees them
> > 						  issues bio
> > 						    - reads stale data -
> > 						      potential security
> > 						      issue
> > 
> > Now how I wanted to address this is that I'd change the API convention for
> > ->readpage() so that we call it with the page unlocked and the function
> > would lock the page, check it's still OK, and do what it needs. And this
> > will allow ->readpage() and also ->readpages() to grab lock
> > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > while we are adding pages to page cache and mapping underlying blocks.
> > 
> > Now your API makes even ->readpages() (actually ->readahead) called with
> > pages locked so that makes this approach problematic because of lock
> > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > ->readahead gets called without any pages in page cache locked...
> 
> I'm not a huge fan of that approach because it increases the number of
> atomic ops (right now, we __SetPageLocked on the page before adding it
> to i_pages).
Yeah, good point. The per-page cost of locking may be noticeable.
> Holepunch is a rather rare operation while readpage and
> readpages/readahead are extremely common, so can we make holepunch take a
> lock that will prevent new readpage(s) succeeding?
I'm not opposed - in fact my solution would do exactly that (with
EXT4_I(inode)->i_mmap_sem), just the lock ordering wrt page lock and
mmap_sem is causing troubles and that's why I need the change in the
API for readpage and friends.
> I have an idea to move the lock entries from DAX to being a generic page
> cache concept.  That way, holepunch could insert lock entries into the
> pagecache to cover the range being punched, and readpage(s) would either
> skip lock entries or block on them.
Two notes on the entry locks:
The additional traffic in the xarray creating locked entries and then
removing them is going to cost as well. But if that's only for hole
punching, it would be bearable I guess.
This does not solve the problem with the lock ordering. There are quite
some constraints on this synchronization scheme. Generally we want to
prevent creation of pages in the page cache in a certain range. That means
we need to block read(2), readahead(2), madvise(2) MADV_WILLNEED, page
faults.  The page faults constrain us that the lock has to rank below
mmap_sem. On the other hand hole punching needs to hold the lock while
evicting pages so that mandates that the lock needs to rank above page
lock. Also note that read(2) can cause page faults (to copy data to
userspace) so to avoid lock inversion against mmap_sem, any protection must
not cover that part of the read path which basically leaves us with
->readpage()/->readpages() as the only place where we can grab the lock.
Which nicely covers also all the other places creating pages in the page
cache we need to block. Except that ->readpage has the unfortunate property
of being called under page lock. But maybe we could create a new hook
somewhere in the paths creating pages to acquire the lock early. But so far
I don't have an idea for something nice.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 23+ messages in thread 
- * Re: [RFC v2 0/9] Replacing the readpages a_op
  2020-01-22  9:44     ` Jan Kara
@ 2020-01-23 10:31       ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2020-01-23 10:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-xfs, linux-fsdevel, linux-mm, Jeff Layton,
	Christoph Hellwig, Chris Mason
On Wed 22-01-20 10:44:14, Jan Kara wrote:
> On Tue 21-01-20 13:48:45, Matthew Wilcox wrote:
> > On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > > v2: Chris asked me to show what this would look like if we just have
> > > > the implementation look up the pages in the page cache, and I managed
> > > > to figure out some things I'd done wrong last time.  It's even simpler
> > > > than v1 (net 104 lines deleted).
> > > 
> > > I have an unfinished patch series laying around that pulls the ->readpage
> > > / ->readpages API in somewhat different direction so I'd like to discuss
> > > whether it's possible to solve my problem using your API. The problem I
> > > have is that currently some operations such as hole punching can race with
> > > ->readpage / ->readpages like:
> > > 
> > > CPU0						CPU1
> > > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> > >   filemap_write_and_wait_range()
> > >   down_write(inode->i_rwsem);
> > >   truncate_pagecache_range();
> > > 						readahead(fd, off, len)
> > > 						  creates pages in page cache
> > > 						  looks up block mapping
> > >   removes blocks from inode and frees them
> > > 						  issues bio
> > > 						    - reads stale data -
> > > 						      potential security
> > > 						      issue
> > > 
> > > Now how I wanted to address this is that I'd change the API convention for
> > > ->readpage() so that we call it with the page unlocked and the function
> > > would lock the page, check it's still OK, and do what it needs. And this
> > > will allow ->readpage() and also ->readpages() to grab lock
> > > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > > while we are adding pages to page cache and mapping underlying blocks.
> > > 
> > > Now your API makes even ->readpages() (actually ->readahead) called with
> > > pages locked so that makes this approach problematic because of lock
> > > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > > ->readahead gets called without any pages in page cache locked...
> > 
> > I'm not a huge fan of that approach because it increases the number of
> > atomic ops (right now, we __SetPageLocked on the page before adding it
> > to i_pages).
> 
> Yeah, good point. The per-page cost of locking may be noticeable.
Thinking about this a bit more, we should be using ->readpages() to fill
most of the data. And for ->readpages() there would be no additional
overhead. Just for ->readpage() which should be rarely needed. We just need
to come up with a good solution for filesystems that have ->readpage but
not ->readpages.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 23+ messages in thread 
 
- * Re: [RFC v2 0/9] Replacing the readpages a_op
  2020-01-21 21:48   ` Matthew Wilcox
  2020-01-22  9:44     ` Jan Kara
@ 2020-01-22 23:47     ` Dave Chinner
  2020-01-23 10:21       ` Jan Kara
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2020-01-22 23:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-xfs, linux-fsdevel, linux-mm, Jeff Layton,
	Christoph Hellwig, Chris Mason
On Tue, Jan 21, 2020 at 01:48:45PM -0800, Matthew Wilcox wrote:
> On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > v2: Chris asked me to show what this would look like if we just have
> > > the implementation look up the pages in the page cache, and I managed
> > > to figure out some things I'd done wrong last time.  It's even simpler
> > > than v1 (net 104 lines deleted).
> > 
> > I have an unfinished patch series laying around that pulls the ->readpage
> > / ->readpages API in somewhat different direction so I'd like to discuss
> > whether it's possible to solve my problem using your API. The problem I
> > have is that currently some operations such as hole punching can race with
> > ->readpage / ->readpages like:
> > 
> > CPU0						CPU1
> > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> >   filemap_write_and_wait_range()
> >   down_write(inode->i_rwsem);
> >   truncate_pagecache_range();
shouldn't fallocate be holding EXT4_I(inode)->i_mmap_sem before it
truncates the page cache? Otherwise it's not serialised against
page faults. Looks at code ... oh, it does hold the i_mmap_sem in
write mode, so....
> > 						readahead(fd, off, len)
> > 						  creates pages in page cache
> > 						  looks up block mapping
> >   removes blocks from inode and frees them
> > 						  issues bio
> > 						    - reads stale data -
> > 						      potential security
> > 						      issue
.... I'm not sure that this race condition should exist anymore
as readahead should not run until the filesystem drops it's inode
and mmap locks after the entire extent freeing operation is
complete...
> > Now how I wanted to address this is that I'd change the API convention for
> > ->readpage() so that we call it with the page unlocked and the function
> > would lock the page, check it's still OK, and do what it needs. And this
> > will allow ->readpage() and also ->readpages() to grab lock
> > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > while we are adding pages to page cache and mapping underlying blocks.
> > 
> > Now your API makes even ->readpages() (actually ->readahead) called with
> > pages locked so that makes this approach problematic because of lock
> > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > ->readahead gets called without any pages in page cache locked...
> 
> I'm not a huge fan of that approach because it increases the number of
> atomic ops (right now, we __SetPageLocked on the page before adding it
> to i_pages).  Holepunch is a rather rare operation while readpage and
> readpages/readahead are extremely common, so can we make holepunch take
> a lock that will prevent new readpage(s) succeeding?
> 
> I have an idea to move the lock entries from DAX to being a generic page
> cache concept.  That way, holepunch could insert lock entries into the
> pagecache to cover the range being punched, and readpage(s) would either
> skip lock entries or block on them.
> 
> Maybe there's a better approach though.
Can we step back for a moment and look at how we already serialise
readahead against truncate/hole punch? While the readahead code
itself doesn't serialise against truncate, in all cases we should be
running through the filesystem at a higher layer and provides the
truncate/holepunch serialisation before we get to the readahead
code.
The read() syscall IO path:
  read()
    ->read_iter()
      filesystem takes truncate serialisation lock
      generic_file_read_iter()
        generic_file_buffered_read()
	  page_cache_sync_readahead()
	    ....
	  page_cache_async_readahead()
	    ....
      .....
      filesystem drops truncate serialisation lock
The page fault IO path:
page fault
  ->fault
    xfs_vm_filemap_fault
      filesystem takes mmap truncate serialisation lock
	filemap_fault
	  do_async_mmap_readahead
	    page_cache_async_readahead
	  ....
	  do_sync_mmap_readahead
	    page_cache_sync_readahead
	.....
      filesystem drops mmap truncate serialisation lock
Then there is fadvise(WILLNEED) which calls
force_page_cache_readahead() directly. We now have ->fadvise for
filesystems to add locking around this:
fadvise64
  vfs_fadvise
    ->fadvise
      xfs_file_fadvise
        filesystem takes truncate serialisation lock
	generic_fadvise()
        filesystem drops truncate serialisation lock
XFS and overlay both have ->fadvise methods to provide this
serialisation of readahead, but ext4 does not so users could trigger
the stated race condition through fadvise(WILLNEED) or readahead()
syscalls.
Hence, AFAICT, ext4 only needs to implement ->fadvise and the stated
readahead vs truncate_pagecache_range() race condition goes away.
Unless, of course, I've missed some other entry point into the
readahead code that does not vector through the filesystem first.
What have I missed?
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
^ permalink raw reply	[flat|nested] 23+ messages in thread
- * Re: [RFC v2 0/9] Replacing the readpages a_op
  2020-01-22 23:47     ` Dave Chinner
@ 2020-01-23 10:21       ` Jan Kara
  2020-01-23 22:29         ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2020-01-23 10:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Jan Kara, linux-xfs, linux-fsdevel, linux-mm,
	Jeff Layton, Christoph Hellwig, Chris Mason
On Thu 23-01-20 10:47:40, Dave Chinner wrote:
> On Tue, Jan 21, 2020 at 01:48:45PM -0800, Matthew Wilcox wrote:
> > On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > > v2: Chris asked me to show what this would look like if we just have
> > > > the implementation look up the pages in the page cache, and I managed
> > > > to figure out some things I'd done wrong last time.  It's even simpler
> > > > than v1 (net 104 lines deleted).
> > > 
> > > I have an unfinished patch series laying around that pulls the ->readpage
> > > / ->readpages API in somewhat different direction so I'd like to discuss
> > > whether it's possible to solve my problem using your API. The problem I
> > > have is that currently some operations such as hole punching can race with
> > > ->readpage / ->readpages like:
> > > 
> > > CPU0						CPU1
> > > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> > >   filemap_write_and_wait_range()
> > >   down_write(inode->i_rwsem);
> > >   truncate_pagecache_range();
> 
> shouldn't fallocate be holding EXT4_I(inode)->i_mmap_sem before it
> truncates the page cache? Otherwise it's not serialised against
> page faults. Looks at code ... oh, it does hold the i_mmap_sem in
> write mode, so....
Yes.
> > > 						readahead(fd, off, len)
> > > 						  creates pages in page cache
> > > 						  looks up block mapping
> > >   removes blocks from inode and frees them
> > > 						  issues bio
> > > 						    - reads stale data -
> > > 						      potential security
> > > 						      issue
> 
> .... I'm not sure that this race condition should exist anymore
> as readahead should not run until the filesystem drops it's inode
> and mmap locks after the entire extent freeing operation is
> complete...
Not for XFS but for all the other filesystems see below..
> > > Now how I wanted to address this is that I'd change the API convention for
> > > ->readpage() so that we call it with the page unlocked and the function
> > > would lock the page, check it's still OK, and do what it needs. And this
> > > will allow ->readpage() and also ->readpages() to grab lock
> > > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > > while we are adding pages to page cache and mapping underlying blocks.
> > > 
> > > Now your API makes even ->readpages() (actually ->readahead) called with
> > > pages locked so that makes this approach problematic because of lock
> > > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > > ->readahead gets called without any pages in page cache locked...
> > 
> > I'm not a huge fan of that approach because it increases the number of
> > atomic ops (right now, we __SetPageLocked on the page before adding it
> > to i_pages).  Holepunch is a rather rare operation while readpage and
> > readpages/readahead are extremely common, so can we make holepunch take
> > a lock that will prevent new readpage(s) succeeding?
> > 
> > I have an idea to move the lock entries from DAX to being a generic page
> > cache concept.  That way, holepunch could insert lock entries into the
> > pagecache to cover the range being punched, and readpage(s) would either
> > skip lock entries or block on them.
> > 
> > Maybe there's a better approach though.
> 
> Can we step back for a moment and look at how we already serialise
> readahead against truncate/hole punch? While the readahead code
> itself doesn't serialise against truncate, in all cases we should be
> running through the filesystem at a higher layer and provides the
> truncate/holepunch serialisation before we get to the readahead
> code.
> 
> The read() syscall IO path:
> 
>   read()
>     ->read_iter()
>       filesystem takes truncate serialisation lock
>       generic_file_read_iter()
>         generic_file_buffered_read()
> 	  page_cache_sync_readahead()
> 	    ....
> 	  page_cache_async_readahead()
> 	    ....
>       .....
>       filesystem drops truncate serialisation lock
Yes, this is the scheme XFS uses. But ext4 and other filesystems use a
scheme where read is serialized against truncate only by page locks and
i_size checks. Which works for truncate but is not enough for hole
punching. And locking read(2) and readahead(2) in all these filesystem with
i_rwsem is going to cause heavy regressions with mixed read-write workloads
and unnecessarily so because we don't need to lock reads against writes,
just against truncate or hole punching.
So I wanted to use i_mmap_sem for the serialization of the read path against
truncate. But due to lock ordering with mmap_sem and because reads do take
page faults to copy data it is not straightforward - hence my messing with
->readpage(). Now that I'm thinking about it, there's also a possibility of
introducing yet another rwsem into the inode that would rank above
mmap_sem and be used to serialize ->read_iter and ->fadvise against
truncate. But having three rwsems in the inode for serialization seems a
bit too convoluted for my taste.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 23+ messages in thread 
- * Re: [RFC v2 0/9] Replacing the readpages a_op
  2020-01-23 10:21       ` Jan Kara
@ 2020-01-23 22:29         ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2020-01-23 22:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, linux-mm, Jeff Layton,
	Christoph Hellwig, Chris Mason
On Thu, Jan 23, 2020 at 11:21:01AM +0100, Jan Kara wrote:
> On Thu 23-01-20 10:47:40, Dave Chinner wrote:
> > On Tue, Jan 21, 2020 at 01:48:45PM -0800, Matthew Wilcox wrote:
> > > On Tue, Jan 21, 2020 at 12:36:27PM +0100, Jan Kara wrote:
> > > > > v2: Chris asked me to show what this would look like if we just have
> > > > > the implementation look up the pages in the page cache, and I managed
> > > > > to figure out some things I'd done wrong last time.  It's even simpler
> > > > > than v1 (net 104 lines deleted).
> > > > 
> > > > I have an unfinished patch series laying around that pulls the ->readpage
> > > > / ->readpages API in somewhat different direction so I'd like to discuss
> > > > whether it's possible to solve my problem using your API. The problem I
> > > > have is that currently some operations such as hole punching can race with
> > > > ->readpage / ->readpages like:
> > > > 
> > > > CPU0						CPU1
> > > > fallocate(fd, FALLOC_FL_PUNCH_HOLE, off, len)
> > > >   filemap_write_and_wait_range()
> > > >   down_write(inode->i_rwsem);
> > > >   truncate_pagecache_range();
> > 
> > shouldn't fallocate be holding EXT4_I(inode)->i_mmap_sem before it
> > truncates the page cache? Otherwise it's not serialised against
> > page faults. Looks at code ... oh, it does hold the i_mmap_sem in
> > write mode, so....
> 
> Yes.
> 
> > > > 						readahead(fd, off, len)
> > > > 						  creates pages in page cache
> > > > 						  looks up block mapping
> > > >   removes blocks from inode and frees them
> > > > 						  issues bio
> > > > 						    - reads stale data -
> > > > 						      potential security
> > > > 						      issue
> > 
> > .... I'm not sure that this race condition should exist anymore
> > as readahead should not run until the filesystem drops it's inode
> > and mmap locks after the entire extent freeing operation is
> > complete...
> 
> Not for XFS but for all the other filesystems see below..
> 
> > > > Now how I wanted to address this is that I'd change the API convention for
> > > > ->readpage() so that we call it with the page unlocked and the function
> > > > would lock the page, check it's still OK, and do what it needs. And this
> > > > will allow ->readpage() and also ->readpages() to grab lock
> > > > (EXT4_I(inode)->i_mmap_sem in case of ext4) to synchronize with hole punching
> > > > while we are adding pages to page cache and mapping underlying blocks.
> > > > 
> > > > Now your API makes even ->readpages() (actually ->readahead) called with
> > > > pages locked so that makes this approach problematic because of lock
> > > > inversions. So I'd prefer if we could keep the situation that ->readpages /
> > > > ->readahead gets called without any pages in page cache locked...
> > > 
> > > I'm not a huge fan of that approach because it increases the number of
> > > atomic ops (right now, we __SetPageLocked on the page before adding it
> > > to i_pages).  Holepunch is a rather rare operation while readpage and
> > > readpages/readahead are extremely common, so can we make holepunch take
> > > a lock that will prevent new readpage(s) succeeding?
> > > 
> > > I have an idea to move the lock entries from DAX to being a generic page
> > > cache concept.  That way, holepunch could insert lock entries into the
> > > pagecache to cover the range being punched, and readpage(s) would either
> > > skip lock entries or block on them.
> > > 
> > > Maybe there's a better approach though.
> > 
> > Can we step back for a moment and look at how we already serialise
> > readahead against truncate/hole punch? While the readahead code
> > itself doesn't serialise against truncate, in all cases we should be
> > running through the filesystem at a higher layer and provides the
> > truncate/holepunch serialisation before we get to the readahead
> > code.
> > 
> > The read() syscall IO path:
> > 
> >   read()
> >     ->read_iter()
> >       filesystem takes truncate serialisation lock
> >       generic_file_read_iter()
> >         generic_file_buffered_read()
> > 	  page_cache_sync_readahead()
> > 	    ....
> > 	  page_cache_async_readahead()
> > 	    ....
> >       .....
> >       filesystem drops truncate serialisation lock
> 
> Yes, this is the scheme XFS uses. But ext4 and other filesystems use a
> scheme where read is serialized against truncate only by page locks and
> i_size checks.
Right, which I've characterised in the past as a "performance over
correctness" hack that has resulted in infrastructure that can only
protect against truncate and not general page invalidation
operations. It's also been the source of many, many truncate bugs
over the past 20 years and has never worked for hole punch, yet....
> Which works for truncate but is not enough for hole
> punching. And locking read(2) and readahead(2) in all these filesystem with
> i_rwsem is going to cause heavy regressions with mixed read-write workloads
> and unnecessarily so because we don't need to lock reads against writes,
> just against truncate or hole punching.
.... performance over correctness is still used as the justification
for keeping this ancient, troublesome architecture in place even
though it cannot support the requirements of modern filesystems.
Indeed, this problem had to be fixed with DAX, and the ext4 DAX read
path uses shared inode locks. And now the ext4 DIO path uses shared
inode locking, too. Only the ext4 buffered read path does not use
shared read locks to avoid this race condition.
I don't have a "maintain the existing mixed rw performance" solution
for you here, except to say that I'm working (slowly) towards fixing
this problem on XFS with range locking for IO instead of using the
i_rwsem. That fixes so many other concurrency issues, too, such as
allowing fallocate() and ftruncate() to run concurrently with
non-overlapping IO to the same file. IOWs, moving IO locking away
from the page cache will allow much greater concurrency and
performance for a much wider range of filesystem operations than
trying to screw around with individual page locks.
> So I wanted to use i_mmap_sem for the serialization of the read path against
> truncate. But due to lock ordering with mmap_sem and because reads do take
> page faults to copy data it is not straightforward - hence my messing with
> ->readpage().
Another manifestation of the age old "can't take the same lock in
the syscall IO path and the page fault IO path" problem. We've
previously looked hacking about readpages in XFS, too, but it was
too complex, didn't improve performance, and in general such changes
were heading in the opposite direction we have been moving the IO
path infrastructure towards with iomap.
i.e. improving buffered IO efficiency requires use to reduce the
amount of work we do per page, not increase it....
> Now that I'm thinking about it, there's also a possibility of
> introducing yet another rwsem into the inode that would rank above
> mmap_sem and be used to serialize ->read_iter and ->fadvise against
> truncate. But having three rwsems in the inode for serialization seems a
> bit too convoluted for my taste.
Yup, that's called "falling off the locking cliff". :/
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
^ permalink raw reply	[flat|nested] 23+ messages in thread