* [PATCH v3 0/3] block: add larger order folio instead of pages [not found] <CGME20240507145232epcas5p481986099a82b1880758b7770cdeaf2d2@epcas5p4.samsung.com> @ 2024-05-07 14:45 ` Kundan Kumar 2024-05-07 14:45 ` [PATCH v3 1/3] nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset Kundan Kumar ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Kundan Kumar @ 2024-05-07 14:45 UTC (permalink / raw) To: axboe, hch, willy Cc: linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti, gost.dev, Kundan Kumar User space memory is mapped in kernel in form of pages array. These pages are iterated and added to BIO. In process, pages are also checked for contiguity and merged. When mTHP is enabled the pages generally belong to larger order folio. The pages from larger order folio need not be checked for contiguity and can be added directly as a big chunk. This patch series enables adding large folio to bio. It fetches folio for page in the page array. The page might start from an offset in the folio which could be multiples of PAGE_SIZE. Subsequent pages in page array might belong to same folio. Using the length of folio, folio_offset and remaining size, determine length in folio which shall be added to the bio. Further processing is skipped for these pages. This reduces the overhead of iterating through pages. perf diff before and after this change: Perf diff for write I/O with 128K block size: 1.26% -1.04% [kernel.kallsyms] [k] bio_iov_iter_get_pages 1.74% [kernel.kallsyms] [k] bvec_try_merge_page Perf diff for read I/O with 128K block size: 4.40% -3.63% [kernel.kallsyms] [k] bio_iov_iter_get_pages 5.60% [kernel.kallsyms] [k] bvec_try_merge_page Patch 1: Ignores bigger offset and helps driver to keep using PRP for I/Os if length is one or two PRPs Patch 2: Adds changes to add larger order folio to BIO. Patch 3: If a large folio gets added, the subsequent pages of the folio are released. This helps to avoid calculations at I/O completion. Previous Postings ----------------- v1: https://lore.kernel.org/all/20240419091721.1790-1-kundan.kumar@samsung.com/ v2: https://lore.kernel.org/linux-block/33717b97-8986-4d6e-aa10-47393b810ea2@suse.de/T/#m31564a9c6ae30e32e560dd8c76d26a3290a875aa Changes since v2: - Made separate patches - Corrected code as per kernel coding style - Removed size_folio variable Changes since v1: - Changed functions bio_iov_add_page() and bio_iov_add_zone_append_page() to accept a folio - Removed branch and calculate folio_offset and len in same fashion for both 0 order and larger folios - Added change in NVMe driver to use nvme_setup_prp_simple() by ignoring multiples of PAGE_SIZE in offset - Added a change to unpin_user_pages which were added as folios. Also stopped the unpin of pages one by one from __bio_release_pages()(Suggested by Keith) Kundan Kumar (3): nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset block: add folio awareness instead of looping through pages block: unpin user pages belonging to a folio block/bio.c | 50 +++++++++++++++++++++++------------------ drivers/nvme/host/pci.c | 3 ++- 2 files changed, 30 insertions(+), 23 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 1/3] nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset 2024-05-07 14:45 ` [PATCH v3 0/3] block: add larger order folio instead of pages Kundan Kumar @ 2024-05-07 14:45 ` Kundan Kumar 2024-05-23 8:51 ` Christoph Hellwig 2024-05-07 14:45 ` [PATCH v3 2/3] block: add folio awareness instead of looping through pages Kundan Kumar 2024-05-07 14:45 ` [PATCH v3 3/3] block: unpin user pages belonging to a folio Kundan Kumar 2 siblings, 1 reply; 9+ messages in thread From: Kundan Kumar @ 2024-05-07 14:45 UTC (permalink / raw) To: axboe, hch, willy Cc: linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti, gost.dev, Kundan Kumar bio_vec start offset may be relatively large particularly when large folio was added to the bio. A bigger offset will result in avoiding the single-segment mapping optimization and end up using expensive mempool_alloc further. Rather than using absolute value, adjust bv_offset by NVME_CTRL_PAGE_SIZE while checking if segment can be fitted into one/two PRP entries. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> --- drivers/nvme/host/pci.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 8e0bb9692685..c6408871466f 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -778,7 +778,8 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, struct bio_vec bv = req_bvec(req); if (!is_pci_p2pdma_page(bv.bv_page)) { - if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2) + if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) + + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2) return nvme_setup_prp_simple(dev, req, &cmnd->rw, &bv); -- 2.25.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/3] nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset 2024-05-07 14:45 ` [PATCH v3 1/3] nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset Kundan Kumar @ 2024-05-23 8:51 ` Christoph Hellwig 2024-05-23 10:07 ` Kundan Kumar 0 siblings, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2024-05-23 8:51 UTC (permalink / raw) To: Kundan Kumar Cc: axboe, hch, willy, linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti, gost.dev This needs to go to the nvme list and all nvme maintainers. Please just resend it standalone while you're at it, as it's a small and very usful improvement on it's own: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/3] nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset 2024-05-23 8:51 ` Christoph Hellwig @ 2024-05-23 10:07 ` Kundan Kumar 0 siblings, 0 replies; 9+ messages in thread From: Kundan Kumar @ 2024-05-23 10:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Kundan Kumar, axboe, willy, linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti, gost.dev On Thu, May 23, 2024 at 2:21 PM Christoph Hellwig <hch@lst.de> wrote: > > This needs to go to the nvme list and all nvme maintainers. Please > just resend it standalone while you're at it, as it's a small and > very usful improvement on it's own: Sure Christoph, I will send a standalone patch for this to nvme list. > > Reviewed-by: Christoph Hellwig <hch@lst.de> > ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 2/3] block: add folio awareness instead of looping through pages 2024-05-07 14:45 ` [PATCH v3 0/3] block: add larger order folio instead of pages Kundan Kumar 2024-05-07 14:45 ` [PATCH v3 1/3] nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset Kundan Kumar @ 2024-05-07 14:45 ` Kundan Kumar 2024-05-15 20:55 ` Matthew Wilcox 2024-05-07 14:45 ` [PATCH v3 3/3] block: unpin user pages belonging to a folio Kundan Kumar 2 siblings, 1 reply; 9+ messages in thread From: Kundan Kumar @ 2024-05-07 14:45 UTC (permalink / raw) To: axboe, hch, willy Cc: linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti, gost.dev, Kundan Kumar Add a bigger size from folio to bio and skip processing for pages. Fetch the offset of page within a folio. Depending on the size of folio and folio_offset, fetch a larger length. This length may consist of multiple contiguous pages if folio is multiorder. Using the length calculate number of pages which will be added to bio and increment the loop counter to skip those pages. This technique helps to avoid overhead of merging pages which belong to same large order folio. Also folio-lize the functions bio_iov_add_page() and bio_iov_add_zone_append_page() Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> --- block/bio.c | 37 ++++++++++++++++++++++++------------- 1 file changed, 24 insertions(+), 13 deletions(-) diff --git a/block/bio.c b/block/bio.c index 38baedb39c6f..e2c31f1b0aa8 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1192,7 +1192,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter) bio_set_flag(bio, BIO_CLONED); } -static int bio_iov_add_page(struct bio *bio, struct page *page, +static int bio_iov_add_folio(struct bio *bio, struct folio *folio, unsigned int len, unsigned int offset) { bool same_page = false; @@ -1202,27 +1202,27 @@ static int bio_iov_add_page(struct bio *bio, struct page *page, if (bio->bi_vcnt > 0 && bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1], - page, len, offset, &same_page)) { + &folio->page, len, offset, &same_page)) { bio->bi_iter.bi_size += len; if (same_page) - bio_release_page(bio, page); + bio_release_page(bio, &folio->page); return 0; } - __bio_add_page(bio, page, len, offset); + bio_add_folio_nofail(bio, folio, len, offset); return 0; } -static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page, +static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio, unsigned int len, unsigned int offset) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); bool same_page = false; - if (bio_add_hw_page(q, bio, page, len, offset, + if (bio_add_hw_page(q, bio, &folio->page, len, offset, queue_max_zone_append_sectors(q), &same_page) != len) return -EINVAL; if (same_page) - bio_release_page(bio, page); + bio_release_page(bio, &folio->page); return 0; } @@ -1247,8 +1247,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) struct page **pages = (struct page **)bv; ssize_t size, left; unsigned len, i = 0; - size_t offset; - int ret = 0; + size_t offset, folio_offset; + int ret = 0, num_pages; /* * Move page array up in the allocated memory for the bio vecs as far as @@ -1289,15 +1289,26 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) for (left = size, i = 0; left > 0; left -= len, i++) { struct page *page = pages[i]; + struct folio *folio = page_folio(page); + + /* Calculate the offset of page in folio */ + folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) + + offset; + + len = min_t(size_t, (folio_size(folio) - folio_offset), left); + + num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); - len = min_t(size_t, PAGE_SIZE - offset, left); if (bio_op(bio) == REQ_OP_ZONE_APPEND) { - ret = bio_iov_add_zone_append_page(bio, page, len, - offset); + ret = bio_iov_add_zone_append_folio(bio, folio, len, + folio_offset); if (ret) break; } else - bio_iov_add_page(bio, page, len, offset); + bio_iov_add_folio(bio, folio, len, folio_offset); + + /* Skip the pages which got added */ + i = i + (num_pages - 1); offset = 0; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3 2/3] block: add folio awareness instead of looping through pages 2024-05-07 14:45 ` [PATCH v3 2/3] block: add folio awareness instead of looping through pages Kundan Kumar @ 2024-05-15 20:55 ` Matthew Wilcox 2024-05-24 9:22 ` Kundan Kumar 0 siblings, 1 reply; 9+ messages in thread From: Matthew Wilcox @ 2024-05-15 20:55 UTC (permalink / raw) To: Kundan Kumar Cc: axboe, hch, linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti, gost.dev On Tue, May 07, 2024 at 08:15:08PM +0530, Kundan Kumar wrote: > Add a bigger size from folio to bio and skip processing for pages. > > Fetch the offset of page within a folio. Depending on the size of folio > and folio_offset, fetch a larger length. This length may consist of > multiple contiguous pages if folio is multiorder. The problem is that it may not. Here's the scenario: int fd, fd2; fd = open(src, O_RDONLY); char *addr = mmap(NULL, 1024 * 1024, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB, fd, 0); int i, j; for (i = 0; i < 1024 * 1024; i++) j |= addr[i]; addr[30000] = 17; fd2 = open(dest, O_RDWR | O_DIRECT); write(fd2, &addr[16384], 32768); Assuming that the source file supports being cached in large folios, the page array we get from GUP might contain: f0p4 f0p5 f0p6 f1p0 f0p8 f0p9 ... because we allocated 'f1' when we did COW due to the store to addr[30000]. We can certainly reduce the cost of merge if we know two pages are part of the same folio, but we still need to check that we actually got pages which are part of the same folio. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 2/3] block: add folio awareness instead of looping through pages 2024-05-15 20:55 ` Matthew Wilcox @ 2024-05-24 9:22 ` Kundan Kumar 2024-05-24 15:40 ` Matthew Wilcox 0 siblings, 1 reply; 9+ messages in thread From: Kundan Kumar @ 2024-05-24 9:22 UTC (permalink / raw) To: Matthew Wilcox Cc: axboe, hch, linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti, gost.dev [-- Attachment #1: Type: text/plain, Size: 1801 bytes --] On 15/05/24 09:55PM, Matthew Wilcox wrote: >On Tue, May 07, 2024 at 08:15:08PM +0530, Kundan Kumar wrote: >> Add a bigger size from folio to bio and skip processing for pages. >> >> Fetch the offset of page within a folio. Depending on the size of folio >> and folio_offset, fetch a larger length. This length may consist of >> multiple contiguous pages if folio is multiorder. > >The problem is that it may not. Here's the scenario: > >int fd, fd2; >fd = open(src, O_RDONLY); >char *addr = mmap(NULL, 1024 * 1024, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_HUGETLB, fd, 0); I also added MAP_ANONYMOUS flag here, otherwise mmap fails. >int i, j; > >for (i = 0; i < 1024 * 1024; i++) > j |= addr[i]; > >addr[30000] = 17; >fd2 = open(dest, O_RDWR | O_DIRECT); >write(fd2, &addr[16384], 32768); > >Assuming that the source file supports being cached in large folios, >the page array we get from GUP might contain:h > >f0p4 f0p5 f0p6 f1p0 f0p8 f0p9 ... > >because we allocated 'f1' when we did COW due to the store to addr[30000]. > >We can certainly reduce the cost of merge if we know two pages are part >of the same folio, but we still need to check that we actually got >pages which are part of the same folio. > Thanks, now I understand that COW may happen and will modify the ptes in between contiguous folio mapping in page table. I will include the check if each page belongs to same folio. I tried executing the sample code. When this does a GUP due to a O_DIRECT write, I see that entire HUGE folio gets reused. "addr[30000] = 17;" generates a COW but hugetlb_wp() reuses the same huge page, using the old_folio again. Am I missing something here? Do you have any further suggestions to generate a scenario similar to "f0p4 f0p5 f0p6 f1p0 f0p8 f0p9 ..." using a sample program. -- Kundan [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 2/3] block: add folio awareness instead of looping through pages 2024-05-24 9:22 ` Kundan Kumar @ 2024-05-24 15:40 ` Matthew Wilcox 0 siblings, 0 replies; 9+ messages in thread From: Matthew Wilcox @ 2024-05-24 15:40 UTC (permalink / raw) To: Kundan Kumar Cc: axboe, hch, linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti, gost.dev On Fri, May 24, 2024 at 02:52:31PM +0530, Kundan Kumar wrote: > On 15/05/24 09:55PM, Matthew Wilcox wrote: > > On Tue, May 07, 2024 at 08:15:08PM +0530, Kundan Kumar wrote: > > > Add a bigger size from folio to bio and skip processing for pages. > > > > > > Fetch the offset of page within a folio. Depending on the size of folio > > > and folio_offset, fetch a larger length. This length may consist of > > > multiple contiguous pages if folio is multiorder. > > > > The problem is that it may not. Here's the scenario: > > > > int fd, fd2; > > fd = open(src, O_RDONLY); > > char *addr = mmap(NULL, 1024 * 1024, PROT_READ | PROT_WRITE, > > MAP_PRIVATE | MAP_HUGETLB, fd, 0); > > I also added MAP_ANONYMOUS flag here, otherwise mmap fails. I didn't test this code, but MAP_ANONYMOUS is wrong. I'm trying to get a file mapping here, not an anoymous mapping. The intent is to hit the if (vm_flags & VM_HUGEPAGE) { case in do_sync_mmap_readahead(). Ah, I see. ksys_mmap_pgoff: if (!(flags & MAP_ANONYMOUS)) { ... if (is_file_hugepages(file)) { len = ALIGN(len, huge_page_size(hstate_file(file))); } else if (unlikely(flags & MAP_HUGETLB)) { retval = -EINVAL; goto out_fput; } Maybe we need something like this: diff --git a/mm/mmap.c b/mm/mmap.c index 83b4682ec85c..7c5066a8a3ac 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1307,6 +1307,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr, flags_mask = LEGACY_MAP_MASK; if (file->f_op->fop_flags & FOP_MMAP_SYNC) flags_mask |= MAP_SYNC; + if (flags & MAP_HUGETLB) + vm_flags |= VM_HUGEPAGE; switch (flags & MAP_TYPE) { case MAP_SHARED: @@ -1414,12 +1416,8 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, file = fget(fd); if (!file) return -EBADF; - if (is_file_hugepages(file)) { + if (is_file_hugepages(file)) len = ALIGN(len, huge_page_size(hstate_file(file))); - } else if (unlikely(flags & MAP_HUGETLB)) { - retval = -EINVAL; - goto out_fput; - } } else if (flags & MAP_HUGETLB) { struct hstate *hs; @@ -1441,7 +1439,6 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, } retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); -out_fput: if (file) fput(file); return retval; (compile tested only) ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 3/3] block: unpin user pages belonging to a folio 2024-05-07 14:45 ` [PATCH v3 0/3] block: add larger order folio instead of pages Kundan Kumar 2024-05-07 14:45 ` [PATCH v3 1/3] nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset Kundan Kumar 2024-05-07 14:45 ` [PATCH v3 2/3] block: add folio awareness instead of looping through pages Kundan Kumar @ 2024-05-07 14:45 ` Kundan Kumar 2 siblings, 0 replies; 9+ messages in thread From: Kundan Kumar @ 2024-05-07 14:45 UTC (permalink / raw) To: axboe, hch, willy Cc: linux-block, joshi.k, mcgrof, anuj20.g, nj.shetty, c.gameti, gost.dev, Kundan Kumar, Keith Busch Unpin pages which belong to same folio. This enables us to release folios on I/O completion rather than looping through pages. Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> --- block/bio.c | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/block/bio.c b/block/bio.c index e2c31f1b0aa8..510327b8852e 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1154,20 +1154,12 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty) struct folio_iter fi; bio_for_each_folio_all(fi, bio) { - struct page *page; - size_t nr_pages; - if (mark_dirty) { folio_lock(fi.folio); folio_mark_dirty(fi.folio); folio_unlock(fi.folio); } - page = folio_page(fi.folio, fi.offset / PAGE_SIZE); - nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE - - fi.offset / PAGE_SIZE + 1; - do { - bio_release_page(bio, page++); - } while (--nr_pages != 0); + bio_release_page(bio, &fi.folio->page); } } EXPORT_SYMBOL_GPL(__bio_release_pages); @@ -1307,6 +1299,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) } else bio_iov_add_folio(bio, folio, len, folio_offset); + if (bio_flagged(bio, BIO_PAGE_PINNED) && num_pages > 1) + unpin_user_pages(pages + i, num_pages - 1); + /* Skip the pages which got added */ i = i + (num_pages - 1); -- 2.25.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-05-24 15:40 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20240507145232epcas5p481986099a82b1880758b7770cdeaf2d2@epcas5p4.samsung.com>
2024-05-07 14:45 ` [PATCH v3 0/3] block: add larger order folio instead of pages Kundan Kumar
2024-05-07 14:45 ` [PATCH v3 1/3] nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset Kundan Kumar
2024-05-23 8:51 ` Christoph Hellwig
2024-05-23 10:07 ` Kundan Kumar
2024-05-07 14:45 ` [PATCH v3 2/3] block: add folio awareness instead of looping through pages Kundan Kumar
2024-05-15 20:55 ` Matthew Wilcox
2024-05-24 9:22 ` Kundan Kumar
2024-05-24 15:40 ` Matthew Wilcox
2024-05-07 14:45 ` [PATCH v3 3/3] block: unpin user pages belonging to a folio Kundan Kumar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox