[PATCH v5 0/3] block: add larger order folio instead of pages

Linux block layer
 help / color / mirror / Atom feed

* [PATCH v5 0/3] block: add larger order folio instead of pages
       [not found] <CGME20240619024142epcas5p22b2e0f83e526fc74fda62a4837bed544@epcas5p2.samsung.com>
@ 2024-06-19  2:34 ` Kundan Kumar
  2024-06-19  2:34   ` [PATCH v5 1/3] block: Added folio-lized version of bio_add_hw_page() Kundan Kumar
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Kundan Kumar @ 2024-06-19  2:34 UTC (permalink / raw)
  To: axboe, hch, willy, kbusch
  Cc: linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti,
	gost.dev, Kundan Kumar

User space memory is mapped in kernel in form of pages array. These pages
are iterated and added to BIO. In process, pages are also checked for
contiguity and merged.

When mTHP is enabled the pages generally belong to larger order folio. This
patch series enables adding large folio to bio. It fetches folio for
page in the page array. The page might start from an offset in the folio
which could be multiples of PAGE_SIZE. Subsequent pages in page array
might belong to same folio. Using the length of folio, folio_offset and
remaining size, determine length in folio which can be added to the bio.
Check if pages are contiguous and belong to same folio. If yes then skip
further processing for the contiguous pages.

This complete scheme reduces the overhead of iterating through pages.

perf diff before and after this change:

Perf diff for write I/O with 128K block size:
     1.16%     -0.18%  [kernel.kallsyms]     [k] bio_iov_iter_get_pages
     1.72%             [kernel.kallsyms]     [k] bvec_try_merge_page
Perf diff for read I/O with 128K block size:
     3.76%     -1.25%  [kernel.kallsyms]     [k] bio_iov_iter_get_pages
     4.96%             [kernel.kallsyms]     [k] bvec_try_merge_page

Patch 1: Adds folio-lized version of bio_add_hw_page()
Patch 2: Adds changes to add larger order folio to BIO.
Patch 3: If a large folio gets added, the subsequent pages of the folio are
released. This helps to avoid calculations at I/O completion.

Changes since v4:
- folio-lize bio_add_hw_page() to bio_add_hw_folio()
- make bio_add_hw_page() as a wrapper around bio_add_hw_folio()
- make new functions bio_release_folio() and unpin_user_folio()
- made a helper function to check for contiguous pages of folio
- changed &folio->page to folio_page(folio, 0)
- reworded comments

Changes since v3:
- Added change to see if pages are contiguous and belong to same folio.
  If not then avoid skipping of pages.(Suggested by Matthew Wilcox)

Changes since v2:
- Made separate patches
- Corrected code as per kernel coding style
- Removed size_folio variable

Changes since v1:
- Changed functions bio_iov_add_page() and bio_iov_add_zone_append_page()
  to accept a folio
- Removed branch and calculate folio_offset and len in same fashion for
  both 0 order and larger folios
- Added change in NVMe driver to use nvme_setup_prp_simple() by
  ignoring multiples of PAGE_SIZE in offset
- Added a change to unpin_user_pages which were added as folios. Also
  stopped the unpin of pages one by one from __bio_release_pages()
  (Suggested by Keith)

Kundan Kumar (3):
  block: Added folio-lized version of bio_add_hw_page()
  block: add folio awareness instead of looping through pages
  block: unpin user pages belonging to a folio

 block/bio.c        | 123 +++++++++++++++++++++++++++++++++------------
 block/blk.h        |  11 ++++
 include/linux/mm.h |   1 +
 mm/gup.c           |  13 +++++
 4 files changed, 116 insertions(+), 32 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/3] block: Added folio-lized version of bio_add_hw_page()
  2024-06-19  2:34 ` [PATCH v5 0/3] block: add larger order folio instead of pages Kundan Kumar
@ 2024-06-19  2:34   ` Kundan Kumar
  2024-06-20  3:47     ` Matthew Wilcox
  2024-06-19  2:34   ` [PATCH v5 2/3] block: add folio awareness instead of looping through pages Kundan Kumar
  2024-06-19  2:34   ` [PATCH v5 3/3] block: unpin user pages belonging to a folio Kundan Kumar
  2 siblings, 1 reply; 9+ messages in thread
From: Kundan Kumar @ 2024-06-19  2:34 UTC (permalink / raw)
  To: axboe, hch, willy, kbusch
  Cc: linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti,
	gost.dev, Kundan Kumar

Added new bio_add_hw_folio() function. This is a prep patch.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
---
 block/bio.c | 38 +++++++++++++++++++++++++++++---------
 block/blk.h |  4 ++++
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index e9e809a63c59..c8914febb16e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -964,7 +964,7 @@ bool bvec_try_merge_hw_page(struct request_queue *q, struct bio_vec *bv,
 }

 /**
- * bio_add_hw_page - attempt to add a page to a bio with hw constraints
+ * bio_add_hw_page - a wrapper around function bio_add_hw_folio
  * @q: the target queue
  * @bio: destination bio
  * @page: page to add
@@ -972,13 +972,35 @@ bool bvec_try_merge_hw_page(struct request_queue *q, struct bio_vec *bv,
  * @offset: vec entry offset
  * @max_sectors: maximum number of sectors that can be added
  * @same_page: return if the segment has been merged inside the same page
- *
- * Add a page to a bio while respecting the hardware max_sectors, max_segment
- * and gap limitations.
  */
 int bio_add_hw_page(struct request_queue *q, struct bio *bio,
                struct page *page, unsigned int len, unsigned int offset,
                unsigned int max_sectors, bool *same_page)
+{
+       struct folio *folio = page_folio(page);
+       size_t folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) +
+                              offset;
+
+       return bio_add_hw_folio(q, bio, folio, len, folio_offset, max_sectors,
+                               same_page);
+}
+
+/**
+ * bio_add_hw_folio - attempt to add a folio to a bio with hw constraints
+ * @q: the target queue
+ * @bio: destination bio
+ * @folio: folio to add
+ * @len: vec entry length
+ * @offset: vec entry offset in the folio
+ * @max_sectors: maximum number of sectors that can be added
+ * @same_page: return if the segment has been merged inside the same page
+ *
+ * Add a folio to a bio while respecting the hardware max_sectors, max_segment
+ * and gap limitations.
+ */
+int bio_add_hw_folio(struct request_queue *q, struct bio *bio,
+               struct folio *folio, unsigned int len, unsigned int offset,
+               unsigned int max_sectors, bool *same_page)
 {
        unsigned int max_size = max_sectors << SECTOR_SHIFT;

@@ -992,8 +1014,8 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
        if (bio->bi_vcnt > 0) {
                struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];

-               if (bvec_try_merge_hw_page(q, bv, page, len, offset,
-                               same_page)) {
+               if (bvec_try_merge_hw_page(q, bv, folio_page(folio, 0), len,
+                                          offset, same_page)) {
                        bio->bi_iter.bi_size += len;
                        return len;
                }
@@ -1010,9 +1032,7 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
                        return 0;
        }

-       bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page, len, offset);
-       bio->bi_vcnt++;
-       bio->bi_iter.bi_size += len;
+       bio_add_folio_nofail(bio, folio, len, offset);
        return len;
 }

diff --git a/block/blk.h b/block/blk.h
index 79e8d5d4fe0c..d0bec44a2ffb 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -524,6 +524,10 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
                struct page *page, unsigned int len, unsigned int offset,
                unsigned int max_sectors, bool *same_page);

+int bio_add_hw_folio(struct request_queue *q, struct bio *bio,
+               struct folio *folio, unsigned int len, unsigned int offset,
+               unsigned int max_sectors, bool *same_page);
+
 /*
  * Clean up a page appropriately, where the page may be pinned, may have a
  * ref taken on it or neither.
--
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/3] block: Added folio-lized version of bio_add_hw_page()
  2024-06-19  2:34   ` [PATCH v5 1/3] block: Added folio-lized version of bio_add_hw_page() Kundan Kumar
@ 2024-06-20  3:47     ` Matthew Wilcox
  0 siblings, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2024-06-20  3:47 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: axboe, hch, kbusch, linux-block, joshi.k, mcgrof, anuj20.g,
	nj.shetty, c.gameti, gost.dev

On Wed, Jun 19, 2024 at 08:04:18AM +0530, Kundan Kumar wrote:
>  /**
> - * bio_add_hw_page - attempt to add a page to a bio with hw constraints
> + * bio_add_hw_page - a wrapper around function bio_add_hw_folio

No.  You haven't changed what this function does, merely how it is
implemented.  The reader of the API documentation doesn't care about the
implementation, they just need to know what it does.

>   * @q: the target queue
>   * @bio: destination bio
>   * @page: page to add
> @@ -972,13 +972,35 @@ bool bvec_try_merge_hw_page(struct request_queue *q, struct bio_vec *bv,
>   * @offset: vec entry offset
>   * @max_sectors: maximum number of sectors that can be added
>   * @same_page: return if the segment has been merged inside the same page
> - *
> - * Add a page to a bio while respecting the hardware max_sectors, max_segment
> - * and gap limitations.

Likewise.

> +/**
> + * bio_add_hw_folio - attempt to add a folio to a bio with hw constraints
> + * @q: the target queue
> + * @bio: destination bio
> + * @folio: folio to add
> + * @len: vec entry length
> + * @offset: vec entry offset in the folio
> + * @max_sectors: maximum number of sectors that can be added
> + * @same_page: return if the segment has been merged inside the same page

... page?

> + * Add a folio to a bio while respecting the hardware max_sectors, max_segment
> + * and gap limitations.
> + */
> +int bio_add_hw_folio(struct request_queue *q, struct bio *bio,
> +               struct folio *folio, unsigned int len, unsigned int offset,

size_t for both of these parameters.  We're dangerously close to
overflowing unsigned int (arm64 gets to 512MB folios, which is only 3
bits from 4GB).


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 2/3] block: add folio awareness instead of looping through pages
  2024-06-19  2:34 ` [PATCH v5 0/3] block: add larger order folio instead of pages Kundan Kumar
  2024-06-19  2:34   ` [PATCH v5 1/3] block: Added folio-lized version of bio_add_hw_page() Kundan Kumar
@ 2024-06-19  2:34   ` Kundan Kumar
  2024-06-19  7:47     ` Hannes Reinecke
  2024-06-19  2:34   ` [PATCH v5 3/3] block: unpin user pages belonging to a folio Kundan Kumar
  2 siblings, 1 reply; 9+ messages in thread
From: Kundan Kumar @ 2024-06-19  2:34 UTC (permalink / raw)
  To: axboe, hch, willy, kbusch
  Cc: linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti,
	gost.dev, Kundan Kumar

Add a bigger size from folio to bio and skip merge processing for pages.

Fetch the offset of page within a folio. Depending on the size of folio
and folio_offset, fetch a larger length. This length may consist of
multiple contiguous pages if folio is multiorder.

Using the length calculate number of pages which will be added to bio and
increment the loop counter to skip those pages.

Using a helper function check if pages are contiguous and belong to same
folio, this is done as a COW may happen and change contiguous mapping of
pages of folio.

This technique helps to avoid overhead of merging pages which belong to
same large order folio.

Also folio-lize the functions bio_iov_add_page() and
bio_iov_add_zone_append_page()

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
---
 block/bio.c | 72 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 58 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index c8914febb16e..3e75b5b0eb6e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1224,7 +1224,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
        bio_set_flag(bio, BIO_CLONED);
 }

-static int bio_iov_add_page(struct bio *bio, struct page *page,
+static int bio_iov_add_folio(struct bio *bio, struct folio *folio,
                unsigned int len, unsigned int offset)
 {
        bool same_page = false;
@@ -1234,30 +1234,60 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,

        if (bio->bi_vcnt > 0 &&
            bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-                               page, len, offset, &same_page)) {
+                               folio_page(folio, 0), len, offset,
+                               &same_page)) {
                bio->bi_iter.bi_size += len;
                if (same_page)
-                       bio_release_page(bio, page);
+                       bio_release_page(bio, folio_page(folio, 0));
                return 0;
        }
-       __bio_add_page(bio, page, len, offset);
+       bio_add_folio_nofail(bio, folio, len, offset);
        return 0;
 }

-static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
+static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio,
                unsigned int len, unsigned int offset)
 {
        struct request_queue *q = bdev_get_queue(bio->bi_bdev);
        bool same_page = false;

-       if (bio_add_hw_page(q, bio, page, len, offset,
+       if (bio_add_hw_folio(q, bio, folio, len, offset,
                        queue_max_zone_append_sectors(q), &same_page) != len)
                return -EINVAL;
        if (same_page)
-               bio_release_page(bio, page);
+               bio_release_page(bio, folio_page(folio, 0));
        return 0;
 }

+static unsigned int get_contig_folio_len(int *num_pages, struct page **pages,
+                                        int i, struct folio *folio,
+                                        ssize_t left, size_t offset)
+{
+       ssize_t bytes = left;
+       size_t contig_sz = min_t(size_t,  PAGE_SIZE - offset, bytes);
+       unsigned int j;
+
+       /*
+        * We might COW a single page in the middle of
+        * a large folio, so we have to check that all
+        * pages belong to the same folio.
+        */
+       bytes -= contig_sz;
+       for (j = i + 1; j < i + *num_pages; j++) {
+               size_t next = min_t(size_t, PAGE_SIZE, bytes);
+
+               if (page_folio(pages[j]) != folio ||
+                   pages[j] != pages[j - 1] + 1) {
+                       break;
+               }
+               contig_sz += next;
+               bytes -= next;
+       }
+       *num_pages = j - i;
+
+       return contig_sz;
+}
+
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))

 /**
@@ -1277,9 +1307,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
        unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
        struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
        struct page **pages = (struct page **)bv;
-       ssize_t size, left;
-       unsigned len, i = 0;
-       size_t offset;
+       ssize_t size, left, len;
+       unsigned int i = 0, num_pages;
+       size_t offset, folio_offset;
        int ret = 0;

        /*
@@ -1321,15 +1351,29 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

        for (left = size, i = 0; left > 0; left -= len, i++) {
                struct page *page = pages[i];
+               struct folio *folio = page_folio(page);
+
+               folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) +
+                               offset;
+
+               len = min_t(size_t, (folio_size(folio) - folio_offset), left);
+
+               num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE);
+
+               if (num_pages > 1)
+                       len = get_contig_folio_len(&num_pages, pages, i,
+                                                  folio, left, offset);

-               len = min_t(size_t, PAGE_SIZE - offset, left);
                if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-                       ret = bio_iov_add_zone_append_page(bio, page, len,
-                                       offset);
+                       ret = bio_iov_add_zone_append_folio(bio, folio, len,
+                                       folio_offset);
                        if (ret)
                                break;
                } else
-                       bio_iov_add_page(bio, page, len, offset);
+                       bio_iov_add_folio(bio, folio, len, folio_offset);
+
+               /* Skip the pages which got added */
+               i = i + (num_pages - 1);

                offset = 0;
        }
--
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/3] block: add folio awareness instead of looping through pages
  2024-06-19  2:34   ` [PATCH v5 2/3] block: add folio awareness instead of looping through pages Kundan Kumar
@ 2024-06-19  7:47     ` Hannes Reinecke
  2024-06-20  4:48       ` Kundan Kumar
  0 siblings, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2024-06-19  7:47 UTC (permalink / raw)
  To: Kundan Kumar, axboe, hch, willy, kbusch
  Cc: linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti,
	gost.dev

On 6/19/24 04:34, Kundan Kumar wrote:
> Add a bigger size from folio to bio and skip merge processing for pages.
> 
> Fetch the offset of page within a folio. Depending on the size of folio
> and folio_offset, fetch a larger length. This length may consist of
> multiple contiguous pages if folio is multiorder.
> 
> Using the length calculate number of pages which will be added to bio and
> increment the loop counter to skip those pages.
> 
> Using a helper function check if pages are contiguous and belong to same
> folio, this is done as a COW may happen and change contiguous mapping of
> pages of folio.
> 
> This technique helps to avoid overhead of merging pages which belong to
> same large order folio.
> 
> Also folio-lize the functions bio_iov_add_page() and
> bio_iov_add_zone_append_page()
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> ---
>   block/bio.c | 72 ++++++++++++++++++++++++++++++++++++++++++-----------
>   1 file changed, 58 insertions(+), 14 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index c8914febb16e..3e75b5b0eb6e 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1224,7 +1224,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
>          bio_set_flag(bio, BIO_CLONED);
>   }
> 
> -static int bio_iov_add_page(struct bio *bio, struct page *page,
> +static int bio_iov_add_folio(struct bio *bio, struct folio *folio,
>                  unsigned int len, unsigned int offset)
>   {
>          bool same_page = false;
> @@ -1234,30 +1234,60 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
> 
>          if (bio->bi_vcnt > 0 &&
>              bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
> -                               page, len, offset, &same_page)) {
> +                               folio_page(folio, 0), len, offset,
> +                               &same_page)) {
>                  bio->bi_iter.bi_size += len;
>                  if (same_page)
> -                       bio_release_page(bio, page);
> +                       bio_release_page(bio, folio_page(folio, 0));
>                  return 0;
>          }
> -       __bio_add_page(bio, page, len, offset);
> +       bio_add_folio_nofail(bio, folio, len, offset);
>          return 0;
>   }
> 
> -static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
> +static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio,
>                  unsigned int len, unsigned int offset)
>   {
>          struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>          bool same_page = false;
> 
> -       if (bio_add_hw_page(q, bio, page, len, offset,
> +       if (bio_add_hw_folio(q, bio, folio, len, offset,
>                          queue_max_zone_append_sectors(q), &same_page) != len)
>                  return -EINVAL;
>          if (same_page)
> -               bio_release_page(bio, page);
> +               bio_release_page(bio, folio_page(folio, 0));
>          return 0;
>   }
> 
> +static unsigned int get_contig_folio_len(int *num_pages, struct page **pages,
> +                                        int i, struct folio *folio,
> +                                        ssize_t left, size_t offset)
> +{
> +       ssize_t bytes = left;
> +       size_t contig_sz = min_t(size_t,  PAGE_SIZE - offset, bytes);
> +       unsigned int j;
> +
> +       /*
> +        * We might COW a single page in the middle of
> +        * a large folio, so we have to check that all
> +        * pages belong to the same folio.
> +        */
> +       bytes -= contig_sz;
> +       for (j = i + 1; j < i + *num_pages; j++) {
> +               size_t next = min_t(size_t, PAGE_SIZE, bytes);
> +
> +               if (page_folio(pages[j]) != folio ||
> +                   pages[j] != pages[j - 1] + 1) {
> +                       break;
> +               }
> +               contig_sz += next;
> +               bytes -= next;
> +       }
> +       *num_pages = j - i;
> +
> +       return contig_sz;
> +}
> +
>   #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
> 
>   /**
> @@ -1277,9 +1307,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>          unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
>          struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>          struct page **pages = (struct page **)bv;
> -       ssize_t size, left;
> -       unsigned len, i = 0;
> -       size_t offset;
> +       ssize_t size, left, len;
> +       unsigned int i = 0, num_pages;
> +       size_t offset, folio_offset;
>          int ret = 0;
> 
>          /*
> @@ -1321,15 +1351,29 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> 
>          for (left = size, i = 0; left > 0; left -= len, i++) {
>                  struct page *page = pages[i];
> +               struct folio *folio = page_folio(page);
> +
> +               folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) +
> +                               offset;
> +
> +               len = min_t(size_t, (folio_size(folio) - folio_offset), left);
> +
> +               num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE);
> +
> +               if (num_pages > 1)
> +                       len = get_contig_folio_len(&num_pages, pages, i,
> +                                                  folio, left, offset);
> 
> -               len = min_t(size_t, PAGE_SIZE - offset, left);
>                  if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> -                       ret = bio_iov_add_zone_append_page(bio, page, len,
> -                                       offset);
> +                       ret = bio_iov_add_zone_append_folio(bio, folio, len,
> +                                       folio_offset);
>                          if (ret)
>                                  break;
>                  } else
> -                       bio_iov_add_page(bio, page, len, offset);
> +                       bio_iov_add_folio(bio, folio, len, folio_offset);
> +
> +               /* Skip the pages which got added */
> +               i = i + (num_pages - 1);
> 
>                  offset = 0;
>          }
> --
> 2.25.1
> 
> 

Well. The issue here is that bvecs really only use the 'struct page' 
entry as an address to the data; the page size itself is completely
immaterial. So from that perspective it doesn't really matter whether
we use 'struct page' or 'struct folio' to get to that address.
However, what matters is whether we _iterate_ over pages or folios.
The current workflow is to first allocate an array of pages,
call one of the _get_pages() variants, and the iterate over all
pages.
What we should be doing is to add _get_folios() variants, working
on folio batches and not pre-allocated arrays.
Then we could iterate over all folios in the batch, and can modify
the 'XXX_get_pages()' variants to extract pages from the folio batch.
And then gradually move over all callers to work on folio batches.

Tall order, but I fear this is the best way going forward.
Matthew? Christoph? Is that what you had in mind?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/3] block: add folio awareness instead of looping through pages
  2024-06-19  7:47     ` Hannes Reinecke
@ 2024-06-20  4:48       ` Kundan Kumar
  2024-06-20  7:21         ` Hannes Reinecke
  0 siblings, 1 reply; 9+ messages in thread
From: Kundan Kumar @ 2024-06-20  4:48 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, hch, willy, kbusch, linux-block, joshi.k, mcgrof, anuj20.g,
	nj.shetty, c.gameti, gost.dev

[-- Attachment #1: Type: text/plain, Size: 8284 bytes --]

On 19/06/24 09:47AM, Hannes Reinecke wrote:
>On 6/19/24 04:34, Kundan Kumar wrote:
>>Add a bigger size from folio to bio and skip merge processing for pages.
>>
>>Fetch the offset of page within a folio. Depending on the size of folio
>>and folio_offset, fetch a larger length. This length may consist of
>>multiple contiguous pages if folio is multiorder.
>>
>>Using the length calculate number of pages which will be added to bio and
>>increment the loop counter to skip those pages.
>>
>>Using a helper function check if pages are contiguous and belong to same
>>folio, this is done as a COW may happen and change contiguous mapping of
>>pages of folio.
>>
>>This technique helps to avoid overhead of merging pages which belong to
>>same large order folio.
>>
>>Also folio-lize the functions bio_iov_add_page() and
>>bio_iov_add_zone_append_page()
>>
>>Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
>>---
>>  block/bio.c | 72 ++++++++++++++++++++++++++++++++++++++++++-----------
>>  1 file changed, 58 insertions(+), 14 deletions(-)
>>
>>diff --git a/block/bio.c b/block/bio.c
>>index c8914febb16e..3e75b5b0eb6e 100644
>>--- a/block/bio.c
>>+++ b/block/bio.c
>>@@ -1224,7 +1224,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
>>         bio_set_flag(bio, BIO_CLONED);
>>  }
>>
>>-static int bio_iov_add_page(struct bio *bio, struct page *page,
>>+static int bio_iov_add_folio(struct bio *bio, struct folio *folio,
>>                 unsigned int len, unsigned int offset)
>>  {
>>         bool same_page = false;
>>@@ -1234,30 +1234,60 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
>>
>>         if (bio->bi_vcnt > 0 &&
>>             bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
>>-                               page, len, offset, &same_page)) {
>>+                               folio_page(folio, 0), len, offset,
>>+                               &same_page)) {
>>                 bio->bi_iter.bi_size += len;
>>                 if (same_page)
>>-                       bio_release_page(bio, page);
>>+                       bio_release_page(bio, folio_page(folio, 0));
>>                 return 0;
>>         }
>>-       __bio_add_page(bio, page, len, offset);
>>+       bio_add_folio_nofail(bio, folio, len, offset);
>>         return 0;
>>  }
>>
>>-static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
>>+static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio,
>>                 unsigned int len, unsigned int offset)
>>  {
>>         struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>         bool same_page = false;
>>
>>-       if (bio_add_hw_page(q, bio, page, len, offset,
>>+       if (bio_add_hw_folio(q, bio, folio, len, offset,
>>                         queue_max_zone_append_sectors(q), &same_page) != len)
>>                 return -EINVAL;
>>         if (same_page)
>>-               bio_release_page(bio, page);
>>+               bio_release_page(bio, folio_page(folio, 0));
>>         return 0;
>>  }
>>
>>+static unsigned int get_contig_folio_len(int *num_pages, struct page **pages,
>>+                                        int i, struct folio *folio,
>>+                                        ssize_t left, size_t offset)
>>+{
>>+       ssize_t bytes = left;
>>+       size_t contig_sz = min_t(size_t,  PAGE_SIZE - offset, bytes);
>>+       unsigned int j;
>>+
>>+       /*
>>+        * We might COW a single page in the middle of
>>+        * a large folio, so we have to check that all
>>+        * pages belong to the same folio.
>>+        */
>>+       bytes -= contig_sz;
>>+       for (j = i + 1; j < i + *num_pages; j++) {
>>+               size_t next = min_t(size_t, PAGE_SIZE, bytes);
>>+
>>+               if (page_folio(pages[j]) != folio ||
>>+                   pages[j] != pages[j - 1] + 1) {
>>+                       break;
>>+               }
>>+               contig_sz += next;
>>+               bytes -= next;
>>+       }
>>+       *num_pages = j - i;
>>+
>>+       return contig_sz;
>>+}
>>+
>>  #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
>>
>>  /**
>>@@ -1277,9 +1307,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>>         unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
>>         struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>>         struct page **pages = (struct page **)bv;
>>-       ssize_t size, left;
>>-       unsigned len, i = 0;
>>-       size_t offset;
>>+       ssize_t size, left, len;
>>+       unsigned int i = 0, num_pages;
>>+       size_t offset, folio_offset;
>>         int ret = 0;
>>
>>         /*
>>@@ -1321,15 +1351,29 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>>
>>         for (left = size, i = 0; left > 0; left -= len, i++) {
>>                 struct page *page = pages[i];
>>+               struct folio *folio = page_folio(page);
>>+
>>+               folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) +
>>+                               offset;
>>+
>>+               len = min_t(size_t, (folio_size(folio) - folio_offset), left);
>>+
>>+               num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE);
>>+
>>+               if (num_pages > 1)
>>+                       len = get_contig_folio_len(&num_pages, pages, i,
>>+                                                  folio, left, offset);
>>
>>-               len = min_t(size_t, PAGE_SIZE - offset, left);
>>                 if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>>-                       ret = bio_iov_add_zone_append_page(bio, page, len,
>>-                                       offset);
>>+                       ret = bio_iov_add_zone_append_folio(bio, folio, len,
>>+                                       folio_offset);
>>                         if (ret)
>>                                 break;
>>                 } else
>>-                       bio_iov_add_page(bio, page, len, offset);
>>+                       bio_iov_add_folio(bio, folio, len, folio_offset);
>>+
>>+               /* Skip the pages which got added */
>>+               i = i + (num_pages - 1);
>>
>>                 offset = 0;
>>         }
>>--
>>2.25.1
>>
>>
>
>Well. The issue here is that bvecs really only use the 'struct page' 
>entry as an address to the data; the page size itself is completely
>immaterial. So from that perspective it doesn't really matter whether
>we use 'struct page' or 'struct folio' to get to that address.
>However, what matters is whether we _iterate_ over pages or folios.
>The current workflow is to first allocate an array of pages,
>call one of the _get_pages() variants, and the iterate over all
>pages.
>What we should be doing is to add _get_folios() variants, working
>on folio batches and not pre-allocated arrays.

The XXX_get_pages() functions do page table walk and fill the pages
corresponding to a user space addr in pages array. The _get_folios()
variants shall return a folio_vec, rather than folio array, as every folio
entry will need a folio_offset and len per folio. If we convert to
_get_folios() variants, number of folios may be lesser than number of
pages. But we will need allocation of folio_vec array as a replacement
of pages array.

Am I missing something?

Down in the page table walk (gup_fast_pte_range), we fill the pages array
addr -> pte -> page. This shall be modified to fill a folio_vec array. The
page table walk also deals with huge pages, and looks like huge page
functions shall also be modified to fill folio_vec array. Also handling
the gup slow path will need a modification to fill the folio_vec array.
--
Kundan

>Then we could iterate over all folios in the batch, and can modify
>the 'XXX_get_pages()' variants to extract pages from the folio batch.
>And then gradually move over all callers to work on folio batches.
>
>Tall order, but I fear this is the best way going forward.
>Matthew? Christoph? Is that what you had in mind?
>
>Cheers,
>
>Hannes
>-- 
>Dr. Hannes Reinecke                  Kernel Storage Architect
>hare@suse.de                                +49 911 74053 688
>SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
>HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/3] block: add folio awareness instead of looping through pages
  2024-06-20  4:48       ` Kundan Kumar
@ 2024-06-20  7:21         ` Hannes Reinecke
  0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2024-06-20  7:21 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: axboe, hch, willy, kbusch, linux-block, joshi.k, mcgrof, anuj20.g,
	nj.shetty, c.gameti, gost.dev

On 6/20/24 06:48, Kundan Kumar wrote:
> On 19/06/24 09:47AM, Hannes Reinecke wrote:
[ .. ]
>>
>> Well. The issue here is that bvecs really only use the 'struct page' 
>> entry as an address to the data; the page size itself is completely
>> immaterial. So from that perspective it doesn't really matter whether
>> we use 'struct page' or 'struct folio' to get to that address.
>> However, what matters is whether we _iterate_ over pages or folios.
>> The current workflow is to first allocate an array of pages,
>> call one of the _get_pages() variants, and the iterate over all
>> pages.
>> What we should be doing is to add _get_folios() variants, working
>> on folio batches and not pre-allocated arrays.
> 
> The XXX_get_pages() functions do page table walk and fill the pages
> corresponding to a user space addr in pages array. The _get_folios()
> variants shall return a folio_vec, rather than folio array, as every folio
> entry will need a folio_offset and len per folio. If we convert to
> _get_folios() variants, number of folios may be lesser than number of
> pages. But we will need allocation of folio_vec array as a replacement
> of pages array.
> 
Well, actually I was thinking of using 'struct folio_batch' instead of
an array. There is an entire set of helpers here (pagevec.h) for 
precisely this purpose.

But yes, we will end up with less folios than pages (which was kinda the 
idea).

> Am I missing something?
> 
> Down in the page table walk (gup_fast_pte_range), we fill the pages array
> addr -> pte -> page. This shall be modified to fill a folio_vec array. The
> page table walk also deals with huge pages, and looks like huge page
> functions shall also be modified to fill folio_vec array. Also handling
> the gup slow path will need a modification to fill the folio_vec array.

Yes. I did say it's a tall order. But it would have the advantage of 
assembling large folios right at the start, so lower level (ie those
consuming the folio batch) would not need to worry of painstakingly
re-assemble folios from a list of pages.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 3/3] block: unpin user pages belonging to a folio
  2024-06-19  2:34 ` [PATCH v5 0/3] block: add larger order folio instead of pages Kundan Kumar
  2024-06-19  2:34   ` [PATCH v5 1/3] block: Added folio-lized version of bio_add_hw_page() Kundan Kumar
  2024-06-19  2:34   ` [PATCH v5 2/3] block: add folio awareness instead of looping through pages Kundan Kumar
@ 2024-06-19  2:34   ` Kundan Kumar
  2024-06-20  3:51     ` Matthew Wilcox
  2 siblings, 1 reply; 9+ messages in thread
From: Kundan Kumar @ 2024-06-19  2:34 UTC (permalink / raw)
  To: axboe, hch, willy, kbusch
  Cc: linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti,
	gost.dev, Kundan Kumar

Unpin pages which belong to same folio. This enables us to release folios
on I/O completion rather than looping through pages.

Introduce a function bio_release_folio() helps put refs by npages count.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
---
 block/bio.c        | 13 ++++---------
 block/blk.h        |  7 +++++++
 include/linux/mm.h |  1 +
 mm/gup.c           | 13 +++++++++++++
 4 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3e75b5b0eb6e..68f6de0b0a08 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1186,20 +1186,12 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
        struct folio_iter fi;

        bio_for_each_folio_all(fi, bio) {
-               struct page *page;
-               size_t nr_pages;
-
                if (mark_dirty) {
                        folio_lock(fi.folio);
                        folio_mark_dirty(fi.folio);
                        folio_unlock(fi.folio);
                }
-               page = folio_page(fi.folio, fi.offset / PAGE_SIZE);
-               nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE -
-                          fi.offset / PAGE_SIZE + 1;
-               do {
-                       bio_release_page(bio, page++);
-               } while (--nr_pages != 0);
+               bio_release_folio(bio, fi.folio, 1);
        }
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1372,6 +1364,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
                } else
                        bio_iov_add_folio(bio, folio, len, folio_offset);

+               if (num_pages > 1)
+                       bio_release_folio(bio, folio, num_pages - 1);
+
                /* Skip the pages which got added */
                i = i + (num_pages - 1);

diff --git a/block/blk.h b/block/blk.h
index d0bec44a2ffb..a657282c0e4a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -538,6 +538,13 @@ static inline void bio_release_page(struct bio *bio, struct page *page)
                unpin_user_page(page);
 }

+static inline void bio_release_folio(struct bio *bio, struct folio *folio,
+                                    unsigned long npages)
+{
+       if (bio_flagged(bio, BIO_PAGE_PINNED))
+               unpin_user_folio(folio, npages);
+}
+
 struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id);

 int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..b902c6c39e2b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1618,6 +1618,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
                                      bool make_dirty);
 void unpin_user_pages(struct page **pages, unsigned long npages);
+void unpin_user_folio(struct folio *folio, unsigned long npages);

 static inline bool is_cow_mapping(vm_flags_t flags)
 {
diff --git a/mm/gup.c b/mm/gup.c
index ca0f5cedce9b..bc96efa43d1b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -488,6 +488,19 @@ void unpin_user_pages(struct page **pages, unsigned long npages)
 }
 EXPORT_SYMBOL(unpin_user_pages);

+/**
+ * unpin_user_folio() - release pages of a folio
+ * @folio:  pointer to folio to be released
+ * @npages: number of pages of same folio
+ *
+ * Release npages of the folio
+ */
+void unpin_user_folio(struct folio *folio, unsigned long npages)
+{
+       gup_put_folio(folio, npages, FOLL_PIN);
+}
+EXPORT_SYMBOL(unpin_user_folio);
+
 /*
  * Set the MMF_HAS_PINNED if not set yet; after set it'll be there for the mm's
  * lifecycle.  Avoid setting the bit unless necessary, or it might cause write
--
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/3] block: unpin user pages belonging to a folio
  2024-06-19  2:34   ` [PATCH v5 3/3] block: unpin user pages belonging to a folio Kundan Kumar
@ 2024-06-20  3:51     ` Matthew Wilcox
  0 siblings, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2024-06-20  3:51 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: axboe, hch, kbusch, linux-block, joshi.k, mcgrof, anuj20.g,
	nj.shetty, c.gameti, gost.dev

On Wed, Jun 19, 2024 at 08:04:20AM +0530, Kundan Kumar wrote:
>                 if (mark_dirty) {
>                         folio_lock(fi.folio);
>                         folio_mark_dirty(fi.folio);
>                         folio_unlock(fi.folio);
>                 }
> -               page = folio_page(fi.folio, fi.offset / PAGE_SIZE);
> -               nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE -
> -                          fi.offset / PAGE_SIZE + 1;
> -               do {
> -                       bio_release_page(bio, page++);
> -               } while (--nr_pages != 0);
> +               bio_release_folio(bio, fi.folio, 1);

... no, call bio_release_folio(bio, fi.folio, nr_pages);

> @@ -1372,6 +1364,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>                 } else
>                         bio_iov_add_folio(bio, folio, len, folio_offset);
> 
> +               if (num_pages > 1)
> +                       bio_release_folio(bio, folio, num_pages - 1);
> +

... and drop this hunk.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-06-20  7:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20240619024142epcas5p22b2e0f83e526fc74fda62a4837bed544@epcas5p2.samsung.com>
2024-06-19  2:34 ` [PATCH v5 0/3] block: add larger order folio instead of pages Kundan Kumar
2024-06-19  2:34   ` [PATCH v5 1/3] block: Added folio-lized version of bio_add_hw_page() Kundan Kumar
2024-06-20  3:47     ` Matthew Wilcox
2024-06-19  2:34   ` [PATCH v5 2/3] block: add folio awareness instead of looping through pages Kundan Kumar
2024-06-19  7:47     ` Hannes Reinecke
2024-06-20  4:48       ` Kundan Kumar
2024-06-20  7:21         ` Hannes Reinecke
2024-06-19  2:34   ` [PATCH v5 3/3] block: unpin user pages belonging to a folio Kundan Kumar
2024-06-20  3:51     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox