* [RFC 0/4] convert create_page_buffers to create_folio_buffers [not found] <CGME20230414110825eucas1p1ed4d16627889ef8542dfa31b1183063d@eucas1p1.samsung.com> @ 2023-04-14 11:08 ` Pankaj Raghav 2023-04-14 11:08 ` [RFC 1/4] fs/buffer: add set_bh_folio helper Pankaj Raghav ` (4 more replies) 0 siblings, 5 replies; 27+ messages in thread From: Pankaj Raghav @ 2023-04-14 11:08 UTC (permalink / raw) To: brauner, willy, viro, akpm Cc: linux-fsdevel, linux-kernel, mcgrof, gost.dev, hare, Pankaj Raghav One of the first kernel panic we hit when we try to increase the block size > 4k is inside create_page_buffers()[1]. Even though buffer.c function do not support large folios (folios > PAGE_SIZE) at the moment, these changes are required when we want to remove that constraint. Willy had already mentioned that he wanted to convert create_page_buffers to create_folio_buffers but didn't get to it yet, so I decided to take a shot. No functional changes introduced. OI: - I don't like the fact that I had to introduce folio_create_empty_buffers() as create_empty_buffers() is used in many parts of the kernel. Should I do a big bang change as a part of this series where we convert create_empty_buffers to take a folio and change the callers to pass a folio instead of a page? - I split the series into 4 commits for clarity. I could squash them into one patch if needed. [1] https://lore.kernel.org/all/ZBnGc4WbBOlnRUgd@casper.infradead.org/ Pankaj Raghav (4): fs/buffer: add set_bh_folio helper buffer: add alloc_folio_buffers() helper fs/buffer: add folio_create_empty_buffers helper fs/buffer: convert create_page_buffers to create_folio_buffers fs/buffer.c | 131 +++++++++++++++++++++++++++++++++--- include/linux/buffer_head.h | 4 ++ 2 files changed, 125 insertions(+), 10 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC 1/4] fs/buffer: add set_bh_folio helper 2023-04-14 11:08 ` [RFC 0/4] convert create_page_buffers to create_folio_buffers Pankaj Raghav @ 2023-04-14 11:08 ` Pankaj Raghav 2023-04-14 11:08 ` [RFC 2/4] buffer: add alloc_folio_buffers() helper Pankaj Raghav ` (3 subsequent siblings) 4 siblings, 0 replies; 27+ messages in thread From: Pankaj Raghav @ 2023-04-14 11:08 UTC (permalink / raw) To: brauner, willy, viro, akpm Cc: linux-fsdevel, linux-kernel, mcgrof, gost.dev, hare, Pankaj Raghav The folio version of set_bh_page(). This is required to convert create_page_buffers() to create_folio_buffers() later in the series. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> --- fs/buffer.c | 15 +++++++++++++++ include/linux/buffer_head.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/fs/buffer.c b/fs/buffer.c index b3eb905f87d6..44380ff3a31f 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1484,6 +1484,21 @@ void set_bh_page(struct buffer_head *bh, } EXPORT_SYMBOL(set_bh_page); +void set_bh_folio(struct buffer_head *bh, struct folio *folio, + unsigned long offset) +{ + bh->b_folio = folio; + BUG_ON(offset >= folio_size(folio)); + if (folio_test_highmem(folio)) + /* + * This catches illegal uses and preserves the offset: + */ + bh->b_data = (char *)(0 + offset); + else + bh->b_data = folio_address(folio) + offset; +} +EXPORT_SYMBOL(set_bh_folio); + /* * Called when truncating a buffer on a page completely. */ diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 8f14dca5fed7..d5a2ef9b4cdf 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -196,6 +196,8 @@ void mark_buffer_write_io_error(struct buffer_head *bh); void touch_buffer(struct buffer_head *bh); void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset); +void set_bh_folio(struct buffer_head *bh, struct folio *folio, + unsigned long offset); bool try_to_free_buffers(struct folio *); struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, bool retry); -- 2.34.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 2/4] buffer: add alloc_folio_buffers() helper 2023-04-14 11:08 ` [RFC 0/4] convert create_page_buffers to create_folio_buffers Pankaj Raghav 2023-04-14 11:08 ` [RFC 1/4] fs/buffer: add set_bh_folio helper Pankaj Raghav @ 2023-04-14 11:08 ` Pankaj Raghav 2023-04-14 13:06 ` Matthew Wilcox 2023-04-14 11:08 ` [RFC 3/4] fs/buffer: add folio_create_empty_buffers helper Pankaj Raghav ` (2 subsequent siblings) 4 siblings, 1 reply; 27+ messages in thread From: Pankaj Raghav @ 2023-04-14 11:08 UTC (permalink / raw) To: brauner, willy, viro, akpm Cc: linux-fsdevel, linux-kernel, mcgrof, gost.dev, hare, Pankaj Raghav Folio version of alloc_page_buffers() helper. This is required to convert create_page_buffers() to create_folio_buffers() later in the series. It removes one call to compound_head() compared to alloc_page_buffers(). Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> --- fs/buffer.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) diff --git a/fs/buffer.c b/fs/buffer.c index 44380ff3a31f..0f9c2127543d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -900,6 +900,65 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, } EXPORT_SYMBOL_GPL(alloc_page_buffers); +/* + * Create the appropriate buffers when given a folio for data area and + * the size of each buffer.. Use the bh->b_this_page linked list to + * follow the buffers created. Return NULL if unable to create more + * buffers. + * + * The retry flag is used to differentiate async IO (paging, swapping) + * which may not fail from ordinary buffer allocations. + */ +struct buffer_head *alloc_folio_buffers(struct folio *folio, unsigned long size, + bool retry) +{ + struct buffer_head *bh, *head; + gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT; + long offset; + struct mem_cgroup *memcg, *old_memcg; + + if (retry) + gfp |= __GFP_NOFAIL; + + /* The folio lock pins the memcg */ + memcg = folio_memcg(folio); + old_memcg = set_active_memcg(memcg); + + head = NULL; + offset = folio_size(folio); + while ((offset -= size) >= 0) { + bh = alloc_buffer_head(gfp); + if (!bh) + goto no_grow; + + bh->b_this_page = head; + bh->b_blocknr = -1; + head = bh; + + bh->b_size = size; + + /* Link the buffer to its folio */ + set_bh_folio(bh, folio, offset); + } +out: + set_active_memcg(old_memcg); + return head; +/* + * In case anything failed, we just free everything we got. + */ +no_grow: + if (head) { + do { + bh = head; + head = head->b_this_page; + free_buffer_head(bh); + } while (head); + } + + goto out; +} +EXPORT_SYMBOL_GPL(alloc_folio_buffers); + static inline void link_dev_buffers(struct page *page, struct buffer_head *head) { -- 2.34.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC 2/4] buffer: add alloc_folio_buffers() helper 2023-04-14 11:08 ` [RFC 2/4] buffer: add alloc_folio_buffers() helper Pankaj Raghav @ 2023-04-14 13:06 ` Matthew Wilcox 2023-04-14 15:01 ` Pankaj Raghav 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-04-14 13:06 UTC (permalink / raw) To: Pankaj Raghav Cc: brauner, viro, akpm, linux-fsdevel, linux-kernel, mcgrof, gost.dev, hare On Fri, Apr 14, 2023 at 01:08:19PM +0200, Pankaj Raghav wrote: > Folio version of alloc_page_buffers() helper. This is required to convert > create_page_buffers() to create_folio_buffers() later in the series. > > It removes one call to compound_head() compared to alloc_page_buffers(). I would convert alloc_page_buffers() to folio_alloc_buffers() and add static struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, bool retry) { return folio_alloc_buffers(page_folio(page), size, retry); } in buffer_head.h (there are only five callers, so this feels like a better tradeoff than creating a new function) > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > --- > fs/buffer.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 59 insertions(+) > > diff --git a/fs/buffer.c b/fs/buffer.c > index 44380ff3a31f..0f9c2127543d 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -900,6 +900,65 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, > } > EXPORT_SYMBOL_GPL(alloc_page_buffers); > > +/* > + * Create the appropriate buffers when given a folio for data area and > + * the size of each buffer.. Use the bh->b_this_page linked list to > + * follow the buffers created. Return NULL if unable to create more > + * buffers. > + * > + * The retry flag is used to differentiate async IO (paging, swapping) > + * which may not fail from ordinary buffer allocations. > + */ > +struct buffer_head *alloc_folio_buffers(struct folio *folio, unsigned long size, > + bool retry) > +{ > + struct buffer_head *bh, *head; > + gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT; > + long offset; > + struct mem_cgroup *memcg, *old_memcg; > + > + if (retry) > + gfp |= __GFP_NOFAIL; > + > + /* The folio lock pins the memcg */ > + memcg = folio_memcg(folio); > + old_memcg = set_active_memcg(memcg); > + > + head = NULL; > + offset = folio_size(folio); > + while ((offset -= size) >= 0) { > + bh = alloc_buffer_head(gfp); > + if (!bh) > + goto no_grow; > + > + bh->b_this_page = head; > + bh->b_blocknr = -1; > + head = bh; > + > + bh->b_size = size; > + > + /* Link the buffer to its folio */ > + set_bh_folio(bh, folio, offset); > + } > +out: > + set_active_memcg(old_memcg); > + return head; > +/* > + * In case anything failed, we just free everything we got. > + */ > +no_grow: > + if (head) { > + do { > + bh = head; > + head = head->b_this_page; > + free_buffer_head(bh); > + } while (head); > + } > + > + goto out; > +} > +EXPORT_SYMBOL_GPL(alloc_folio_buffers); > + > static inline void > link_dev_buffers(struct page *page, struct buffer_head *head) > { > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 2/4] buffer: add alloc_folio_buffers() helper 2023-04-14 13:06 ` Matthew Wilcox @ 2023-04-14 15:01 ` Pankaj Raghav 0 siblings, 0 replies; 27+ messages in thread From: Pankaj Raghav @ 2023-04-14 15:01 UTC (permalink / raw) To: Matthew Wilcox Cc: brauner, viro, akpm, linux-fsdevel, linux-kernel, mcgrof, gost.dev, hare On 2023-04-14 15:06, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 01:08:19PM +0200, Pankaj Raghav wrote: >> Folio version of alloc_page_buffers() helper. This is required to convert >> create_page_buffers() to create_folio_buffers() later in the series. >> >> It removes one call to compound_head() compared to alloc_page_buffers(). > > I would convert alloc_page_buffers() to folio_alloc_buffers() and > add > > static struct buffer_head *alloc_page_buffers(struct page *page, > unsigned long size, bool retry) > { > return folio_alloc_buffers(page_folio(page), size, retry); > } > > in buffer_head.h > > (there are only five callers, so this feels like a better tradeoff > than creating a new function) > That is a good idea and follows the usual pattern for folio conversion. I will send a new version soon with your other comments as well. -- Pankaj ^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC 3/4] fs/buffer: add folio_create_empty_buffers helper 2023-04-14 11:08 ` [RFC 0/4] convert create_page_buffers to create_folio_buffers Pankaj Raghav 2023-04-14 11:08 ` [RFC 1/4] fs/buffer: add set_bh_folio helper Pankaj Raghav 2023-04-14 11:08 ` [RFC 2/4] buffer: add alloc_folio_buffers() helper Pankaj Raghav @ 2023-04-14 11:08 ` Pankaj Raghav 2023-04-14 13:16 ` Matthew Wilcox 2023-04-14 11:08 ` [RFC 4/4] fs/buffer: convert create_page_buffers to create_folio_buffers Pankaj Raghav 2023-04-14 13:47 ` [RFC 0/4] " Hannes Reinecke 4 siblings, 1 reply; 27+ messages in thread From: Pankaj Raghav @ 2023-04-14 11:08 UTC (permalink / raw) To: brauner, willy, viro, akpm Cc: linux-fsdevel, linux-kernel, mcgrof, gost.dev, hare, Pankaj Raghav Folio version of create_empty_buffers(). This is required to convert create_page_buffers() to create_folio_buffers() later in the series. It removes several calls to compound_head() as it works directly on folio compared to create_empty_buffers(). Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> --- fs/buffer.c | 34 ++++++++++++++++++++++++++++++++++ include/linux/buffer_head.h | 2 ++ 2 files changed, 36 insertions(+) diff --git a/fs/buffer.c b/fs/buffer.c index 0f9c2127543d..9e6a1a738fb5 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1645,6 +1645,40 @@ void block_invalidate_folio(struct folio *folio, size_t offset, size_t length) } EXPORT_SYMBOL(block_invalidate_folio); +/* + * We attach and possibly dirty the buffers atomically wrt + * block_dirty_folio() via private_lock. try_to_free_buffers + * is already excluded via the folio lock. + */ +void folio_create_empty_buffers(struct folio *folio, unsigned long blocksize, + unsigned long b_state) +{ + struct buffer_head *bh, *head, *tail; + + head = alloc_folio_buffers(folio, blocksize, true); + bh = head; + do { + bh->b_state |= b_state; + tail = bh; + bh = bh->b_this_page; + } while (bh); + tail->b_this_page = head; + + spin_lock(&folio->mapping->private_lock); + if (folio_test_uptodate(folio) || folio_test_dirty(folio)) { + bh = head; + do { + if (folio_test_dirty(folio)) + set_buffer_dirty(bh); + if (folio_test_uptodate(folio)) + set_buffer_uptodate(bh); + bh = bh->b_this_page; + } while (bh != head); + } + folio_attach_private(folio, head); + spin_unlock(&folio->mapping->private_lock); +} +EXPORT_SYMBOL(folio_create_empty_buffers); /* * We attach and possibly dirty the buffers atomically wrt diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index d5a2ef9b4cdf..8afa91cbb8e2 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -203,6 +203,8 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, bool retry); void create_empty_buffers(struct page *, unsigned long, unsigned long b_state); +void folio_create_empty_buffers(struct folio *, unsigned long, + unsigned long b_state); void end_buffer_read_sync(struct buffer_head *bh, int uptodate); void end_buffer_write_sync(struct buffer_head *bh, int uptodate); void end_buffer_async_write(struct buffer_head *bh, int uptodate); -- 2.34.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC 3/4] fs/buffer: add folio_create_empty_buffers helper 2023-04-14 11:08 ` [RFC 3/4] fs/buffer: add folio_create_empty_buffers helper Pankaj Raghav @ 2023-04-14 13:16 ` Matthew Wilcox 0 siblings, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2023-04-14 13:16 UTC (permalink / raw) To: Pankaj Raghav Cc: brauner, viro, akpm, linux-fsdevel, linux-kernel, mcgrof, gost.dev, hare On Fri, Apr 14, 2023 at 01:08:20PM +0200, Pankaj Raghav wrote: > Folio version of create_empty_buffers(). This is required to convert > create_page_buffers() to create_folio_buffers() later in the series. > > It removes several calls to compound_head() as it works directly on folio > compared to create_empty_buffers(). Again, I would create a create_empty_buffers() wrapper, but this time I would put it in buffer.c. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC 4/4] fs/buffer: convert create_page_buffers to create_folio_buffers 2023-04-14 11:08 ` [RFC 0/4] convert create_page_buffers to create_folio_buffers Pankaj Raghav ` (2 preceding siblings ...) 2023-04-14 11:08 ` [RFC 3/4] fs/buffer: add folio_create_empty_buffers helper Pankaj Raghav @ 2023-04-14 11:08 ` Pankaj Raghav 2023-04-14 13:21 ` Matthew Wilcox 2023-04-14 13:47 ` [RFC 0/4] " Hannes Reinecke 4 siblings, 1 reply; 27+ messages in thread From: Pankaj Raghav @ 2023-04-14 11:08 UTC (permalink / raw) To: brauner, willy, viro, akpm Cc: linux-fsdevel, linux-kernel, mcgrof, gost.dev, hare, Pankaj Raghav fs/buffer do not support large folios as there are many assumptions on the folio size to be the host page size. This conversion is one step towards removing that assumption. Also this conversion will reduce calls to compound_head() if create_folio_buffers() calls folio_create_empty_buffers(). Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> --- fs/buffer.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 9e6a1a738fb5..a83d9bf78ca5 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1802,14 +1802,17 @@ static inline int block_size_bits(unsigned int blocksize) return ilog2(blocksize); } -static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state) +static struct buffer_head *create_folio_buffers(struct folio *folio, + struct inode *inode, + unsigned int b_state) { - BUG_ON(!PageLocked(page)); + BUG_ON(!folio_test_locked(folio)); - if (!page_has_buffers(page)) - create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits), - b_state); - return page_buffers(page); + if (!folio_buffers(folio)) + folio_create_empty_buffers(folio, + 1 << READ_ONCE(inode->i_blkbits), + b_state); + return folio_buffers(folio); } /* @@ -1853,8 +1856,8 @@ int __block_write_full_page(struct inode *inode, struct page *page, int nr_underway = 0; blk_opf_t write_flags = wbc_to_write_flags(wbc); - head = create_page_buffers(page, inode, - (1 << BH_Dirty)|(1 << BH_Uptodate)); + head = create_folio_buffers(page_folio(page), inode, + (1 << BH_Dirty) | (1 << BH_Uptodate)); /* * Be very careful. We have no exclusion from block_dirty_folio @@ -2117,7 +2120,7 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len, BUG_ON(to > PAGE_SIZE); BUG_ON(from > to); - head = create_page_buffers(&folio->page, inode, 0); + head = create_folio_buffers(folio, inode, 0); blocksize = head->b_size; bbits = block_size_bits(blocksize); @@ -2403,7 +2406,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - head = create_page_buffers(&folio->page, inode, 0); + head = create_folio_buffers(folio, inode, 0); blocksize = head->b_size; bbits = block_size_bits(blocksize); -- 2.34.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC 4/4] fs/buffer: convert create_page_buffers to create_folio_buffers 2023-04-14 11:08 ` [RFC 4/4] fs/buffer: convert create_page_buffers to create_folio_buffers Pankaj Raghav @ 2023-04-14 13:21 ` Matthew Wilcox 0 siblings, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2023-04-14 13:21 UTC (permalink / raw) To: Pankaj Raghav Cc: brauner, viro, akpm, linux-fsdevel, linux-kernel, mcgrof, gost.dev, hare On Fri, Apr 14, 2023 at 01:08:21PM +0200, Pankaj Raghav wrote: > fs/buffer do not support large folios as there are many assumptions on > the folio size to be the host page size. This conversion is one step > towards removing that assumption. Also this conversion will reduce calls > to compound_head() if create_folio_buffers() calls > folio_create_empty_buffers(). I'd call this folio_create_buffers(), but other than that, looks good. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-14 11:08 ` [RFC 0/4] convert create_page_buffers to create_folio_buffers Pankaj Raghav ` (3 preceding siblings ...) 2023-04-14 11:08 ` [RFC 4/4] fs/buffer: convert create_page_buffers to create_folio_buffers Pankaj Raghav @ 2023-04-14 13:47 ` Hannes Reinecke 2023-04-14 13:51 ` Matthew Wilcox ` (3 more replies) 4 siblings, 4 replies; 27+ messages in thread From: Hannes Reinecke @ 2023-04-14 13:47 UTC (permalink / raw) To: Pankaj Raghav, brauner, willy, viro, akpm Cc: linux-fsdevel, linux-kernel, mcgrof, gost.dev On 4/14/23 13:08, Pankaj Raghav wrote: > One of the first kernel panic we hit when we try to increase the > block size > 4k is inside create_page_buffers()[1]. Even though buffer.c > function do not support large folios (folios > PAGE_SIZE) at the moment, > these changes are required when we want to remove that constraint. > > Willy had already mentioned that he wanted to convert create_page_buffers to > create_folio_buffers but didn't get to it yet, so I decided to take a > shot. > > No functional changes introduced. > > OI: > - I don't like the fact that I had to introduce > folio_create_empty_buffers() as create_empty_buffers() is used in > many parts of the kernel. Should I do a big bang change as a part of > this series where we convert create_empty_buffers to take a folio and > change the callers to pass a folio instead of a page? > > - I split the series into 4 commits for clarity. I could squash them > into one patch if needed. > > [1] https://lore.kernel.org/all/ZBnGc4WbBOlnRUgd@casper.infradead.org/ > > Pankaj Raghav (4): > fs/buffer: add set_bh_folio helper > buffer: add alloc_folio_buffers() helper > fs/buffer: add folio_create_empty_buffers helper > fs/buffer: convert create_page_buffers to create_folio_buffers > > fs/buffer.c | 131 +++++++++++++++++++++++++++++++++--- > include/linux/buffer_head.h | 4 ++ > 2 files changed, 125 insertions(+), 10 deletions(-) > Funnily enough, I've been tinkering along the same lines, and ended up with pretty similar patches. I've had to use two additional patches to get my modified 'brd' driver off the ground with logical blocksize of 16k: - mm/filemap: allocate folios according to the blocksize (will be sending the patch separately) - Modify read_folio() to use the correct order: @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) limit = inode->i_sb->s_maxbytes; - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - head = create_folio_buffers(folio, inode, 0); blocksize = head->b_size; bbits = block_size_bits(blocksize); - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); + if (WARN_ON(PAGE_SHIFT < bbits)) { + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); + } else { + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); + } lblock = (limit+blocksize-1) >> bbits; bh = head; nr = 0; With that (and my modified brd driver) I've been able to set the logical blocksize to 16k for brd and have it happily loaded. Haven't tested the write path yet, though; there's surely quite some work to be done. BTW; I've got another patch replacing 'writepage' with 'write_folio' (and the corresponding argument update). Is that a direction you want to go? Cheers, Hannes ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-14 13:47 ` [RFC 0/4] " Hannes Reinecke @ 2023-04-14 13:51 ` Matthew Wilcox 2023-04-14 13:56 ` Hannes Reinecke 2023-04-14 15:00 ` Pankaj Raghav ` (2 subsequent siblings) 3 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-04-14 13:51 UTC (permalink / raw) To: Hannes Reinecke Cc: Pankaj Raghav, brauner, viro, akpm, linux-fsdevel, linux-kernel, mcgrof, gost.dev On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: > BTW; I've got another patch replacing 'writepage' with 'write_folio' > (and the corresponding argument update). Is that a direction you want to go? No; ->writepage is being deleted. It's already gone from ext4 and xfs. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-14 13:51 ` Matthew Wilcox @ 2023-04-14 13:56 ` Hannes Reinecke 0 siblings, 0 replies; 27+ messages in thread From: Hannes Reinecke @ 2023-04-14 13:56 UTC (permalink / raw) To: Matthew Wilcox Cc: Pankaj Raghav, brauner, viro, akpm, linux-fsdevel, linux-kernel, mcgrof, gost.dev On 4/14/23 15:51, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: >> BTW; I've got another patch replacing 'writepage' with 'write_folio' >> (and the corresponding argument update). Is that a direction you want to go? > > No; ->writepage is being deleted. It's already gone from ext4 and xfs. Aw. And here's me having converted block/fops over to using iomap w/ iomap_writepage(). Tough. Oh well. Wasn't a great fit anyway as for a sb_bread() replacement we would need a sub-page access for iomap. Question is whether we really need that or shouldn't read PAGE_SIZE sectors always. Surely would make life easier. And would save us all the logic with bh_lru etc as we can rely on the page cache. But probably an item for the iomap discussion at LSF. Unless you got plans already ... Cheers, Hannes ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-14 13:47 ` [RFC 0/4] " Hannes Reinecke 2023-04-14 13:51 ` Matthew Wilcox @ 2023-04-14 15:00 ` Pankaj Raghav 2023-04-15 1:01 ` Luis Chamberlain 2023-04-17 2:27 ` Luis Chamberlain 3 siblings, 0 replies; 27+ messages in thread From: Pankaj Raghav @ 2023-04-14 15:00 UTC (permalink / raw) To: Hannes Reinecke, brauner, willy, viro, akpm Cc: linux-fsdevel, linux-kernel, mcgrof, gost.dev >> Pankaj Raghav (4): >> fs/buffer: add set_bh_folio helper >> buffer: add alloc_folio_buffers() helper >> fs/buffer: add folio_create_empty_buffers helper >> fs/buffer: convert create_page_buffers to create_folio_buffers >> >> fs/buffer.c | 131 +++++++++++++++++++++++++++++++++--- >> include/linux/buffer_head.h | 4 ++ >> 2 files changed, 125 insertions(+), 10 deletions(-) >> > Funnily enough, I've been tinkering along the same lines, and ended up with pretty similar patches. > I've had to use two additional patches to get my modified 'brd' driver off the ground with logical > blocksize of 16k: Good to know that we are working on a similar direction here. > - mm/filemap: allocate folios according to the blocksize > (will be sending the patch separately) > - Modify read_folio() to use the correct order: > > @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = create_folio_buffers(folio, inode, 0); > blocksize = head->b_size; > bbits = block_size_bits(blocksize); > > - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + if (WARN_ON(PAGE_SHIFT < bbits)) { > + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); > + } else { > + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + } > lblock = (limit+blocksize-1) >> bbits; > bh = head; > nr = 0; > > > With that (and my modified brd driver) I've been able to set the logical blocksize to 16k for brd > and have it happily loaded. I will give your patches a try as well. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-14 13:47 ` [RFC 0/4] " Hannes Reinecke 2023-04-14 13:51 ` Matthew Wilcox 2023-04-14 15:00 ` Pankaj Raghav @ 2023-04-15 1:01 ` Luis Chamberlain 2023-04-15 2:31 ` Matthew Wilcox 2023-04-17 2:27 ` Luis Chamberlain 3 siblings, 1 reply; 27+ messages in thread From: Luis Chamberlain @ 2023-04-15 1:01 UTC (permalink / raw) To: Hannes Reinecke Cc: Pankaj Raghav, brauner, willy, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: > On 4/14/23 13:08, Pankaj Raghav wrote: > > One of the first kernel panic we hit when we try to increase the > > block size > 4k is inside create_page_buffers()[1]. Even though buffer.c > > function do not support large folios (folios > PAGE_SIZE) at the moment, > > these changes are required when we want to remove that constraint. > Funnily enough, I've been tinkering along the same lines, and ended up with > pretty similar patches. > I've had to use two additional patches to get my modified 'brd' driver off > the ground with logical blocksize of 16k: > - mm/filemap: allocate folios according to the blocksize > (will be sending the patch separately) > - Modify read_folio() to use the correct order: > > @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, > get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = create_folio_buffers(folio, inode, 0); > blocksize = head->b_size; > bbits = block_size_bits(blocksize); > > - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + if (WARN_ON(PAGE_SHIFT < bbits)) { > + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); > + } else { > + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + } > lblock = (limit+blocksize-1) >> bbits; > bh = head; > nr = 0; At a quick glance I think both approaches (unless Hannes already did it) seem to just miss that pesky static *arr[MAX_BUF_PER_PAGE], and so I think we need to: a) dynamically allocate those now b) do a cursory review of the users of that and prepare them to grok buffer heads which are blocksize based rather than PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE. Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for bs > PAGE_SIZE right now. Luis ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-15 1:01 ` Luis Chamberlain @ 2023-04-15 2:31 ` Matthew Wilcox 2023-04-15 3:24 ` Luis Chamberlain 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-04-15 2:31 UTC (permalink / raw) To: Luis Chamberlain Cc: Hannes Reinecke, Pankaj Raghav, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote: > a) dynamically allocate those now > b) do a cursory review of the users of that and prepare them > to grok buffer heads which are blocksize based rather than > PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE. > > Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for > bs > PAGE_SIZE right now. Worse, we'll overflow the array and corrupt the stack. This one is a simple fix ... +++ b/fs/buffer.c @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr, i; int fully_mapped = 1; @@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) @@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) } /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; + bh = head; + do { lock_buffer(bh); mark_buffer_async_read(bh); - } + bh = bh->b_this_page; + } while (bh != head); /* * Stage 3: start the IO. Check for uptodateness * inside the buffer lock in case another process reading * the underlying blockdev brought it uptodate (the sct fix). */ - for (i = 0; i < nr; i++) { - bh = arr[i]; + bh = head; + do { if (buffer_uptodate(bh)) end_buffer_async_read(bh, 1); else submit_bh(REQ_OP_READ, bh); - } + bh = bh->b_this_page; + } while (bh != head); + return 0; } EXPORT_SYMBOL(block_read_full_folio); ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-15 2:31 ` Matthew Wilcox @ 2023-04-15 3:24 ` Luis Chamberlain 2023-04-15 3:44 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: Luis Chamberlain @ 2023-04-15 3:24 UTC (permalink / raw) To: Matthew Wilcox Cc: Hannes Reinecke, Pankaj Raghav, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Sat, Apr 15, 2023 at 03:31:54AM +0100, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote: > > a) dynamically allocate those now > > b) do a cursory review of the users of that and prepare them > > to grok buffer heads which are blocksize based rather than > > PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE. > > > > Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for > > bs > PAGE_SIZE right now. > > Worse, we'll overflow the array and corrupt the stack. > > This one is a simple fix ... > > +++ b/fs/buffer.c > @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > { > struct inode *inode = folio->mapping->host; > sector_t iblock, lblock; > - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; > + struct buffer_head *bh, *head; > unsigned int blocksize, bbits; > int nr, i; > int fully_mapped = 1; > @@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (buffer_uptodate(bh)) > continue; > } > - arr[nr++] = bh; > } while (i++, iblock++, (bh = bh->b_this_page) != head); > > if (fully_mapped) > @@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > } > > /* Stage two: lock the buffers */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > + bh = head; > + do { > lock_buffer(bh); > mark_buffer_async_read(bh); > - } > + bh = bh->b_this_page; > + } while (bh != head); > > /* > * Stage 3: start the IO. Check for uptodateness > * inside the buffer lock in case another process reading > * the underlying blockdev brought it uptodate (the sct fix). > */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > + bh = head; > + do { > if (buffer_uptodate(bh)) > end_buffer_async_read(bh, 1); > else > submit_bh(REQ_OP_READ, bh); > - } > + bh = bh->b_this_page; > + } while (bh != head); > + > return 0; I thought of that but I saw that the loop that assigns the arr only pegs a bh if we don't "continue" for certain conditions, which made me believe that we only wanted to keep on the array as non-null items which meet the initial loop's criteria. If that is not accurate then yes, the simplication is nice! Luis ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-15 3:24 ` Luis Chamberlain @ 2023-04-15 3:44 ` Matthew Wilcox 2023-04-15 13:14 ` Hannes Reinecke 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-04-15 3:44 UTC (permalink / raw) To: Luis Chamberlain Cc: Hannes Reinecke, Pankaj Raghav, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote: > I thought of that but I saw that the loop that assigns the arr only > pegs a bh if we don't "continue" for certain conditions, which made me > believe that we only wanted to keep on the array as non-null items which > meet the initial loop's criteria. If that is not accurate then yes, > the simplication is nice! Uh, right. A little bit more carefully this time ... how does this look? diff --git a/fs/buffer.c b/fs/buffer.c index 5e67e21b350a..dff671079b02 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr, i; int fully_mapped = 1; @@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; + nr++; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) @@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) return 0; } - /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; + /* + * Stage two: lock the buffers. Recheck the uptodate flag under + * the lock in case somebody else brought it uptodate first. + */ + bh = head; + do { + if (buffer_uptodate(bh)) + continue; lock_buffer(bh); + if (buffer_uptodate(bh)) { + unlock_buffer(bh); + continue; + } mark_buffer_async_read(bh); - } + } while ((bh = bh->b_this_page) != head); - /* - * Stage 3: start the IO. Check for uptodateness - * inside the buffer lock in case another process reading - * the underlying blockdev brought it uptodate (the sct fix). - */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - if (buffer_uptodate(bh)) - end_buffer_async_read(bh, 1); - else + /* Stage 3: start the IO */ + bh = head; + do { + if (buffer_async_read(bh)) submit_bh(REQ_OP_READ, bh); - } + } while ((bh = bh->b_this_page) != head); + return 0; } EXPORT_SYMBOL(block_read_full_folio); I do wonder how much it's worth doing this vs switching to non-BH methods. I appreciate that's a lot of work still. ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-15 3:44 ` Matthew Wilcox @ 2023-04-15 13:14 ` Hannes Reinecke 2023-04-15 17:09 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: Hannes Reinecke @ 2023-04-15 13:14 UTC (permalink / raw) To: Matthew Wilcox, Luis Chamberlain Cc: Pankaj Raghav, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On 4/15/23 05:44, Matthew Wilcox wrote: > On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote: >> I thought of that but I saw that the loop that assigns the arr only >> pegs a bh if we don't "continue" for certain conditions, which made me >> believe that we only wanted to keep on the array as non-null items which >> meet the initial loop's criteria. If that is not accurate then yes, >> the simplication is nice! > > Uh, right. A little bit more carefully this time ... how does this > look? > > diff --git a/fs/buffer.c b/fs/buffer.c > index 5e67e21b350a..dff671079b02 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > { > struct inode *inode = folio->mapping->host; > sector_t iblock, lblock; > - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; > + struct buffer_head *bh, *head; > unsigned int blocksize, bbits; > int nr, i; > int fully_mapped = 1; > @@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (buffer_uptodate(bh)) > continue; > } > - arr[nr++] = bh; > + nr++; > } while (i++, iblock++, (bh = bh->b_this_page) != head); > > if (fully_mapped) > @@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > return 0; > } > > - /* Stage two: lock the buffers */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > + /* > + * Stage two: lock the buffers. Recheck the uptodate flag under > + * the lock in case somebody else brought it uptodate first. > + */ > + bh = head; > + do { > + if (buffer_uptodate(bh)) > + continue; > lock_buffer(bh); > + if (buffer_uptodate(bh)) { > + unlock_buffer(bh); > + continue; > + } > mark_buffer_async_read(bh); > - } > + } while ((bh = bh->b_this_page) != head); > > - /* > - * Stage 3: start the IO. Check for uptodateness > - * inside the buffer lock in case another process reading > - * the underlying blockdev brought it uptodate (the sct fix). > - */ > - for (i = 0; i < nr; i++) { > - bh = arr[i]; > - if (buffer_uptodate(bh)) > - end_buffer_async_read(bh, 1); > - else > + /* Stage 3: start the IO */ > + bh = head; > + do { > + if (buffer_async_read(bh)) > submit_bh(REQ_OP_READ, bh); > - } > + } while ((bh = bh->b_this_page) != head); > + > return 0; > } > EXPORT_SYMBOL(block_read_full_folio); > > > I do wonder how much it's worth doing this vs switching to non-BH methods. > I appreciate that's a lot of work still. That's what I've been wondering, too. I would _vastly_ prefer to switch over to iomap; however, the blasted sb_bread() is getting in the way. Currently iomap only runs on entire pages / folios, but a lot of (older) filesystems insist on doing 512 byte I/O. While this seem logical (seeing that 512 bytes is the default, and, in most cases, the only supported sector size) question is whether _we_ from the linux side need to do that. We _could_ upgrade to always do full page I/O; there's a good chance we'll be using the entire page anyway eventually. And with storage bandwidth getting larger and larger we might even get a performance boost there. And it would save us having to implement sub-page I/O for iomap. Hmm? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-15 13:14 ` Hannes Reinecke @ 2023-04-15 17:09 ` Matthew Wilcox 2023-04-16 1:28 ` Luis Chamberlain 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-04-15 17:09 UTC (permalink / raw) To: Hannes Reinecke Cc: Luis Chamberlain, Pankaj Raghav, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > On 4/15/23 05:44, Matthew Wilcox wrote: > > I do wonder how much it's worth doing this vs switching to non-BH methods. > > I appreciate that's a lot of work still. > > That's what I've been wondering, too. > > I would _vastly_ prefer to switch over to iomap; however, the blasted > sb_bread() is getting in the way. Currently iomap only runs on entire > pages / folios, but a lot of (older) filesystems insist on doing 512 Hang on, no, iomap can issue sub-page reads. eg iomap_read_folio_sync() will read the parts of the folio which have not yet been read when called from __iomap_write_begin(). > byte I/O. While this seem logical (seeing that 512 bytes is the > default, and, in most cases, the only supported sector size) question > is whether _we_ from the linux side need to do that. > We _could_ upgrade to always do full page I/O; there's a good > chance we'll be using the entire page anyway eventually. > And with storage bandwidth getting larger and larger we might even > get a performance boost there. I think we need to look at this from the filesystem side. What do filesystems actually want to do? The first thing is they want to read the superblock. That's either going to be immediately freed ("Oh, this isn't a JFS filesystem after all") or it's going to hang around indefinitely. There's no particular need to keep it in any kind of cache (buffer or page). Except ... we want to probe a dozen different filesystems, and half of them keep their superblock at the same offset from the start of the block device. So we do want to keep it cached. That's arguing for using the page cache, at least to read it. Now, do we want userspace to be able to dd a new superblock into place and have the mounted filesystem see it? I suspect that confuses just about every filesystem out there. So I think the right answer is to read the page into the bdev's page cache and then copy it into a kmalloc'ed buffer which the filesystem is then responsible for freeing. It's also responsible for writing it back (so that's another API we need), and for a journalled filesystem, it needs to fit into the journalling scheme. Also, we may need to write back multiple copies of the superblock, possibly with slight modifications. There are a lot of considerations here, and I don't feel like I have enough of an appreciation of filesystem needs to come up with a decent API. I'd hope we can get a good discussion going at LSFMM. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-15 17:09 ` Matthew Wilcox @ 2023-04-16 1:28 ` Luis Chamberlain 2023-04-16 3:40 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: Luis Chamberlain @ 2023-04-16 1:28 UTC (permalink / raw) To: Matthew Wilcox Cc: Hannes Reinecke, Pankaj Raghav, kbus >> Keith Busch, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > On 4/15/23 05:44, Matthew Wilcox wrote: > > We _could_ upgrade to always do full page I/O; there's a good > > chance we'll be using the entire page anyway eventually. *Iff* doing away with buffer head 512 granularity could help block sizes greater than page size where physical and logical block size > PAGE_SIZE we whould also be able to see it on 4kn drives (logical and physical block size == 4k). A projection could be made after. In so far as experimenting with this, if you already have some effort on IOMAP for bdev aops one possibility for pure experimentation for now would be to peg a new set of aops could be set in the path of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early for us to know if the device is has (lbs = pbs) > 512. For NVMe for instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We put together and set the logical and phyiscal block size on NVMe on nvme_update_ns_info() --> nvme_update_disk_info(), right before we call device_add_disk(). The only way to override the aops then would be right before device_add_disk(), or as part of a new device_add_disk_aops() or whatever. > > And with storage bandwidth getting larger and larger we might even > > get a performance boost there. > > I think we need to look at this from the filesystem side. Before that let's recap the bdev cache current issues. Today by just adding the disk we move on to partition scanning immediately unless your block driver has a flag that says otherwise. The current crash we're evaluating with brd and that we also hit with NVMe is due to this part. device_add_disk() --> disk_scan_partitions() --> blkdev_get_whole() --> bdev_disk_changed() --> filemap_read_folio() --> filler() The filler is from aops. We don't even have a filesystem yet on these devices at this point. The entire partition core does this partition scanning. Refer to: disk_scan_partitions() --> block/partitions/core.c : bdev_disk_changed() And all of that stuff is also under a 512-byte atomic operation assumption, we could do better if wanted to. > What do filesystems actually want to do? So you are suggesting that the early reads of the block device by the block cache and its use of the page cache cache should be aligned / perhaps redesigned to assist more clearly with what modern filesystems might actually would want today? > The first thing is they want to read > the superblock. That's either going to be immediately freed ("Oh, > this isn't a JFS filesystem after all") or it's going to hang around > indefinitely. There's no particular need to keep it in any kind of > cache (buffer or page). And the bdev cache would not be able to know before hand that's the case. > Except ... we want to probe a dozen different > filesystems, and half of them keep their superblock at the same offset > from the start of the block device. So we do want to keep it cached. > That's arguing for using the page cache, at least to read it. Do we currently share anything from the bdev cache with the fs for this? Let's say that first block device blocksize in memory. > Now, do we want userspace to be able to dd a new superblock into place > and have the mounted filesystem see it? Not sure I follow this. dd a new super block? > I suspect that confuses just > about every filesystem out there. So I think the right answer is to read > the page into the bdev's page cache and then copy it into a kmalloc'ed > buffer which the filesystem is then responsible for freeing. It's also > responsible for writing it back (so that's another API we need), and for > a journalled filesystem, it needs to fit into the journalling scheme. > Also, we may need to write back multiple copies of the superblock, > possibly with slight modifications. Are you considering these as extentions to the bdev cache? Luis ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-16 1:28 ` Luis Chamberlain @ 2023-04-16 3:40 ` Matthew Wilcox 2023-04-16 5:26 ` Luis Chamberlain 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-04-16 3:40 UTC (permalink / raw) To: Luis Chamberlain Cc: Hannes Reinecke, Pankaj Raghav, kbus >> Keith Busch, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote: > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > > On 4/15/23 05:44, Matthew Wilcox wrote: > > > We _could_ upgrade to always do full page I/O; there's a good > > > chance we'll be using the entire page anyway eventually. > > *Iff* doing away with buffer head 512 granularity could help block sizes > greater than page size where physical and logical block size > PAGE_SIZE > we whould also be able to see it on 4kn drives (logical and physical > block size == 4k). A projection could be made after. > > In so far as experimenting with this, if you already have some > effort on IOMAP for bdev aops one possibility for pure experimentation > for now would be to peg a new set of aops could be set in the path > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early > for us to know if the device is has (lbs = pbs) > 512. For NVMe for > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We > put together and set the logical and phyiscal block size on NVMe on > nvme_update_ns_info() --> nvme_update_disk_info(), right before we > call device_add_disk(). The only way to override the aops then would > be right before device_add_disk(), or as part of a new device_add_disk_aops() > or whatever. I think you're making this harder than you need to. For LBA size > PAGE_SIZE, there is always only going to be one BH per folio-that-is-LBA-size, so all the problems with more-than-8-BHs-per-4k-page don't actually exist. I don't think we should be overriding the aops, and if we narrow the scope of large folio support in blockdev t only supporting folio_size == LBA size, it becomes much more feasible. > > > And with storage bandwidth getting larger and larger we might even > > > get a performance boost there. > > > > I think we need to look at this from the filesystem side. > > Before that let's recap the bdev cache current issues. Ooh, yes, this is good! I was totally neglecting the partition scanning code. > Today by just adding the disk we move on to partition scanning > immediately unless your block driver has a flag that says otherwise. The > current crash we're evaluating with brd and that we also hit with NVMe > is due to this part. > > device_add_disk() --> > disk_scan_partitions() --> > blkdev_get_whole() --> > bdev_disk_changed() --> > filemap_read_folio() --> filler() > > The filler is from aops. Right, so you missed a step in that callchain, which is read_mapping_folio(). That ends up in do_read_cache_folio(), which contains the deadly: folio = filemap_alloc_folio(gfp, 0); so that needs to be changed to get the minimum folio order from the mapping, and then it should work. > > What do filesystems actually want to do? > > So you are suggesting that the early reads of the block device by the > block cache and its use of the page cache cache should be aligned / > perhaps redesigned to assist more clearly with what modern filesystems > might actually would want today? Perhaps? I'm just saying the needs of the block device are not the only consideration here. I'd like an API that makes sense for the fs author. > > The first thing is they want to read > > the superblock. That's either going to be immediately freed ("Oh, > > this isn't a JFS filesystem after all") or it's going to hang around > > indefinitely. There's no particular need to keep it in any kind of > > cache (buffer or page). > > And the bdev cache would not be able to know before hand that's the case. Right, nobody knows until it's been read and examined. > > Except ... we want to probe a dozen different > > filesystems, and half of them keep their superblock at the same offset > > from the start of the block device. So we do want to keep it cached. > > That's arguing for using the page cache, at least to read it. > > Do we currently share anything from the bdev cache with the fs for this? > Let's say that first block device blocksize in memory. sb_bread() is used by most filesystems, and the buffer cache aliases into the page cache. > > Now, do we want userspace to be able to dd a new superblock into place > > and have the mounted filesystem see it? > > Not sure I follow this. dd a new super block? In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', I can overwrite the superblock. Do we want filesystems to see that kind of vandalism, or do we want the mounted filesystem to have its own copy of the data and overwrite what userspace wrote the next time it updates the superblock? (the trick is that this may not be vandalism, it might be the sysadmin updating the uuid or running some fsck-ish program or trying to update the superblock to support fabulous-new-feature on next mount. does this change the answer?) > > I suspect that confuses just > > about every filesystem out there. So I think the right answer is to read > > the page into the bdev's page cache and then copy it into a kmalloc'ed > > buffer which the filesystem is then responsible for freeing. It's also > > responsible for writing it back (so that's another API we need), and for > > a journalled filesystem, it needs to fit into the journalling scheme. > > Also, we may need to write back multiple copies of the superblock, > > possibly with slight modifications. > > Are you considering these as extentions to the bdev cache? I'm trying to suggest some of the considerations that need to go into a replacement for sb_bread(). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-16 3:40 ` Matthew Wilcox @ 2023-04-16 5:26 ` Luis Chamberlain 2023-04-16 14:07 ` Matthew Wilcox 2023-04-16 22:57 ` Dave Chinner 0 siblings, 2 replies; 27+ messages in thread From: Luis Chamberlain @ 2023-04-16 5:26 UTC (permalink / raw) To: Matthew Wilcox Cc: Hannes Reinecke, Pankaj Raghav, kbus >> Keith Busch, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote: > > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote: > > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote: > > > > On 4/15/23 05:44, Matthew Wilcox wrote: > > > > We _could_ upgrade to always do full page I/O; there's a good > > > > chance we'll be using the entire page anyway eventually. > > > > *Iff* doing away with buffer head 512 granularity could help block sizes > > greater than page size where physical and logical block size > PAGE_SIZE > > we whould also be able to see it on 4kn drives (logical and physical > > block size == 4k). A projection could be made after. > > > > In so far as experimenting with this, if you already have some > > effort on IOMAP for bdev aops one possibility for pure experimentation > > for now would be to peg a new set of aops could be set in the path > > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early > > for us to know if the device is has (lbs = pbs) > 512. For NVMe for > > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We > > put together and set the logical and phyiscal block size on NVMe on > > nvme_update_ns_info() --> nvme_update_disk_info(), right before we > > call device_add_disk(). The only way to override the aops then would > > be right before device_add_disk(), or as part of a new device_add_disk_aops() > > or whatever. > > I think you're making this harder than you need to. > For LBA size > PAGE_SIZE, there is always only going to be > one BH per folio-that-is-LBA-size, so all the problems with > more-than-8-BHs-per-4k-page don't actually exist. Oh! Then yes, sorry! > I don't think we > should be overriding the aops, and if we narrow the scope of large folio > support in blockdev t only supporting folio_size == LBA size, it becomes > much more feasible. I'm trying to think of the possible use cases where folio_size != LBA size and I cannot immediately think of some. Yes there are cases where a filesystem may use a different block for say meta data than data, but that I believe is side issue, ie, read/writes for small metadata would have to be accepted. At least for NVMe we have metadata size as part of the LBA format, but from what I understand no Linux filesystem yet uses that. > > > > And with storage bandwidth getting larger and larger we might even > > > > get a performance boost there. > > > > > > I think we need to look at this from the filesystem side. > > > > Before that let's recap the bdev cache current issues. > > Ooh, yes, this is good! I was totally neglecting the partition > scanning code. > > > Today by just adding the disk we move on to partition scanning > > immediately unless your block driver has a flag that says otherwise. The > > current crash we're evaluating with brd and that we also hit with NVMe > > is due to this part. > > > > device_add_disk() --> > > disk_scan_partitions() --> > > blkdev_get_whole() --> > > bdev_disk_changed() --> > > filemap_read_folio() --> filler() > > > > The filler is from aops. > > Right, so you missed a step in that callchain, which is > read_mapping_folio(). That ends up in do_read_cache_folio(), which > contains the deadly: > > folio = filemap_alloc_folio(gfp, 0); Right and before this we have: if (!filler) filler = mapping->a_ops->read_folio; The folio is order 0 and after filemap_alloc_folio() its added to the page cache, then filemap_read_folio(file, filler, folio) uses the filler blkdev_read_folio() --> fs/buffer.c block_read_full_folio() Just for posterity: fs/buffer.c int block_read_full_folio(struct folio *folio, get_block_t *get_block) { ... struct buffer_head *bh, *head,...; ... head = create_page_buffers(&folio->page, inode, 0); ... } static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state) { BUG_ON(!PageLocked(page)); if (!page_has_buffers(page)) create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits), b_state); return page_buffers(page); } void create_empty_buffers(struct page *page, unsigned long blocksize, unsigned long b_state) { struct buffer_head *bh, *head, *tail; head = alloc_page_buffers(page, blocksize, true); bh = head; do { bh->b_state |= b_state; -----> CRASH HERE head is NULL tail = bh; bh = bh->b_this_page; } while (bh); tail->b_this_page = head; } Why is it NULL? The blocksize passed to alloc_page_buffers() is larger than PAGE_SIZE and below offset is PAGE_SIZE. PAGE_SIZE - blocksize gives a negative number and so the loop is not traversed. struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, bool retry) { struct buffer_head *bh, *head; gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT; long offset; struct mem_cgroup *memcg, *old_memcg; if (retry) gfp |= __GFP_NOFAIL; /* The page lock pins the memcg */ memcg = page_memcg(page); old_memcg = set_active_memcg(memcg); head = NULL; offset = PAGE_SIZE; while ((offset -= size) >= 0) { bh = alloc_buffer_head(gfp); if (!bh) goto no_grow; bh->b_this_page = head; bh->b_blocknr = -1; head = bh; bh->b_size = size; /* Link the buffer to its page */ set_bh_page(bh, page, offset); } out: set_active_memcg(old_memcg); return head; ... } I see now what you say about the buffer head being of the block size bh->b_size = size above. > so that needs to be changed to get the minimum folio order from the > mapping, and then it should work. > > > > What do filesystems actually want to do? > > > > So you are suggesting that the early reads of the block device by the > > block cache and its use of the page cache cache should be aligned / > > perhaps redesigned to assist more clearly with what modern filesystems > > might actually would want today? > > Perhaps? I'm just saying the needs of the block device are not the > only consideration here. I'd like an API that makes sense for the fs > author. Makes sense! > > > The first thing is they want to read > > > the superblock. That's either going to be immediately freed ("Oh, > > > this isn't a JFS filesystem after all") or it's going to hang around > > > indefinitely. There's no particular need to keep it in any kind of > > > cache (buffer or page). > > > > And the bdev cache would not be able to know before hand that's the case. > > Right, nobody knows until it's been read and examined. > > > > Except ... we want to probe a dozen different > > > filesystems, and half of them keep their superblock at the same offset > > > from the start of the block device. So we do want to keep it cached. > > > That's arguing for using the page cache, at least to read it. > > > > Do we currently share anything from the bdev cache with the fs for this? > > Let's say that first block device blocksize in memory. > > sb_bread() is used by most filesystems, and the buffer cache aliases > into the page cache. I see thanks. I checked what xfs does and its xfs_readsb() uses its own xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why they did that. > > > Now, do we want userspace to be able to dd a new superblock into place > > > and have the mounted filesystem see it? > > > > Not sure I follow this. dd a new super block? > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > I can overwrite the superblock. Do we want filesystems to see that > kind of vandalism, or do we want the mounted filesystem to have its > own copy of the data and overwrite what userspace wrote the next time it > updates the superblock? Oh, what happens today? > (the trick is that this may not be vandalism, it might be the sysadmin > updating the uuid or running some fsck-ish program or trying to update > the superblock to support fabulous-new-feature on next mount. does this > change the answer?) > > > > I suspect that confuses just > > > about every filesystem out there. So I think the right answer is to read > > > the page into the bdev's page cache and then copy it into a kmalloc'ed > > > buffer which the filesystem is then responsible for freeing. It's also > > > responsible for writing it back (so that's another API we need), and for > > > a journalled filesystem, it needs to fit into the journalling scheme. > > > Also, we may need to write back multiple copies of the superblock, > > > possibly with slight modifications. > > > > Are you considering these as extentions to the bdev cache? > > I'm trying to suggest some of the considerations that need to go into > a replacement for sb_bread(). I see! That would also help EOL buffer heads for that purpose. Luis ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-16 5:26 ` Luis Chamberlain @ 2023-04-16 14:07 ` Matthew Wilcox 2023-04-17 15:40 ` Darrick J. Wong 2023-04-16 22:57 ` Dave Chinner 1 sibling, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-04-16 14:07 UTC (permalink / raw) To: Luis Chamberlain Cc: Hannes Reinecke, Pankaj Raghav, kbus >> Keith Busch, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > > I don't think we > > should be overriding the aops, and if we narrow the scope of large folio > > support in blockdev t only supporting folio_size == LBA size, it becomes > > much more feasible. > > I'm trying to think of the possible use cases where folio_size != LBA size > and I cannot immediately think of some. Yes there are cases where a > filesystem may use a different block for say meta data than data, but that > I believe is side issue, ie, read/writes for small metadata would have > to be accepted. At least for NVMe we have metadata size as part of the > LBA format, but from what I understand no Linux filesystem yet uses that. NVMe metadata is per-block metadata -- a CRC or similar. Filesystem metadata is things like directories, inode tables, free space bitmaps, etc. > struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, > bool retry) > { [...] > head = NULL; > offset = PAGE_SIZE; > while ((offset -= size) >= 0) { > > I see now what you say about the buffer head being of the block size > bh->b_size = size above. Yes, just changing that to 'offset = page_size(page);' will do the trick. > > sb_bread() is used by most filesystems, and the buffer cache aliases > > into the page cache. > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > they did that. IRIX didn't have an sb_bread() ;-) > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > I can overwrite the superblock. Do we want filesystems to see that > > kind of vandalism, or do we want the mounted filesystem to have its > > own copy of the data and overwrite what userspace wrote the next time it > > updates the superblock? > > Oh, what happens today? Depends on the filesystem, I think? Not really sure, to be honest. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-16 14:07 ` Matthew Wilcox @ 2023-04-17 15:40 ` Darrick J. Wong 0 siblings, 0 replies; 27+ messages in thread From: Darrick J. Wong @ 2023-04-17 15:40 UTC (permalink / raw) To: Matthew Wilcox Cc: Luis Chamberlain, Hannes Reinecke, Pankaj Raghav, kbus >> Keith Busch, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Sun, Apr 16, 2023 at 03:07:33PM +0100, Matthew Wilcox wrote: > On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > > On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote: > > > I don't think we > > > should be overriding the aops, and if we narrow the scope of large folio > > > support in blockdev t only supporting folio_size == LBA size, it becomes > > > much more feasible. > > > > I'm trying to think of the possible use cases where folio_size != LBA size > > and I cannot immediately think of some. Yes there are cases where a > > filesystem may use a different block for say meta data than data, but that > > I believe is side issue, ie, read/writes for small metadata would have > > to be accepted. At least for NVMe we have metadata size as part of the > > LBA format, but from what I understand no Linux filesystem yet uses that. > > NVMe metadata is per-block metadata -- a CRC or similar. Filesystem > metadata is things like directories, inode tables, free space bitmaps, > etc. > > > struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, > > bool retry) > > { > [...] > > head = NULL; > > offset = PAGE_SIZE; > > while ((offset -= size) >= 0) { > > > > I see now what you say about the buffer head being of the block size > > bh->b_size = size above. > > Yes, just changing that to 'offset = page_size(page);' will do the trick. > > > > sb_bread() is used by most filesystems, and the buffer cache aliases > > > into the page cache. > > > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > > they did that. > > IRIX didn't have an sb_bread() ;-) > > > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > > I can overwrite the superblock. Do we want filesystems to see that > > > kind of vandalism, or do we want the mounted filesystem to have its > > > own copy of the data and overwrite what userspace wrote the next time it > > > updates the superblock? > > > > Oh, what happens today? > > Depends on the filesystem, I think? Not really sure, to be honest. The filesystem driver sees the vandalism, and can very well crash as a result[1]. In that case it was corrupted journal contents being replayed, but the same thing would happen if you wrote a malicious userspace program to set the metadata_csum feature flag in the ondisk superblock after mounting the fs. https://bugzilla.kernel.org/show_bug.cgi?id=82201#c4 I've tried to prevent people from writing to mounted block devices in the past, but did not succeed. If you try to prevent programs from opening such devices with O_RDWR/O_WRONLY you then break lvm tools which require that ability even though they don't actually write anything to the block device. If you make the block device write_iter function fail, then old e2fsprogs breaks and you get shouted at for breaking userspace. Hence I decided to let security researchers find these bugs and control the design discussion via CVE. That's not correct and it's not smart, but it preserves some of my sanity. --D ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-16 5:26 ` Luis Chamberlain 2023-04-16 14:07 ` Matthew Wilcox @ 2023-04-16 22:57 ` Dave Chinner 1 sibling, 0 replies; 27+ messages in thread From: Dave Chinner @ 2023-04-16 22:57 UTC (permalink / raw) To: Luis Chamberlain Cc: Matthew Wilcox, Hannes Reinecke, Pankaj Raghav, kbus @pop.gmail.com>> Keith Busch, brauner, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote: > > > > Except ... we want to probe a dozen different > > > > filesystems, and half of them keep their superblock at the same offset > > > > from the start of the block device. So we do want to keep it cached. > > > > That's arguing for using the page cache, at least to read it. > > > > > > Do we currently share anything from the bdev cache with the fs for this? > > > Let's say that first block device blocksize in memory. > > > > sb_bread() is used by most filesystems, and the buffer cache aliases > > into the page cache. > > I see thanks. I checked what xfs does and its xfs_readsb() uses its own > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why > they did that. XFS has it's own metadata address space for caching - it does not use the block device page cache at all. This is not new, it never has. The xfs_buf buffer cache does not use the page cache, either. It does it's own thing, has it's own indexing, locking, shrinkers, etc. IOWs, it does not use the iomap infrastructure at all - iomap is used by XFS exclusively for data IO. As for why we use an uncached buffer for the superblock? That's largely historic because prior to 2007 every modification that did allocation/free needed to lock and modify the superblock at transaction commit. Hence it's always needed in memory but a critical fast path, so it is always directly available without needing to do a cache lookup to callers that need it. In 2007, lazy superblock counters got rid of the requirement to lock the superblock buffer in every transaction commit, so the uncached buffer optimisation hasn't really been needed for the past decade. But if it ain't broke, don't try to fix it.... > > > > Now, do we want userspace to be able to dd a new superblock into place > > > > and have the mounted filesystem see it? > > > > > > Not sure I follow this. dd a new super block? > > > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N', > > I can overwrite the superblock. Do we want filesystems to see that > > kind of vandalism, or do we want the mounted filesystem to have its > > own copy of the data and overwrite what userspace wrote the next time it > > updates the superblock? > > Oh, what happens today? In XFS, it will completely ignore the fact the the superblock got trashed like this. When the fs goes idle, or the sb modified for some other reason, it will relog the in-memory superblock and write it back to disk, thereby fixing the corruption. i.e. while the filesystem is mounted, the superblock is _write-only_... > > (the trick is that this may not be vandalism, it might be the sysadmin > > updating the uuid or running some fsck-ish program or trying to update > > the superblock to support fabulous-new-feature on next mount. does this > > change the answer?) If you need to change anything in the superblock while the XFS fs is mounted, then you have to use ioctls to modify the superblock contents through the running transaction subsystem. Editting the block device directly breaks the security model of filesystems that assume they have exclusive access to the block device whilst the filesystem is mounted.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-14 13:47 ` [RFC 0/4] " Hannes Reinecke ` (2 preceding siblings ...) 2023-04-15 1:01 ` Luis Chamberlain @ 2023-04-17 2:27 ` Luis Chamberlain 2023-04-17 6:04 ` Hannes Reinecke 3 siblings, 1 reply; 27+ messages in thread From: Luis Chamberlain @ 2023-04-17 2:27 UTC (permalink / raw) To: Hannes Reinecke Cc: Pankaj Raghav, brauner, willy, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: > @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, > get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = create_folio_buffers(folio, inode, 0); > blocksize = head->b_size; > bbits = block_size_bits(blocksize); > > - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + if (WARN_ON(PAGE_SHIFT < bbits)) { > + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); > + } else { > + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); > + } > lblock = (limit+blocksize-1) >> bbits; > bh = head; > nr = 0; BTW I See this pattern in: fs/mpage.c: do_mpage_readpage() fs/mpage.c: __mpage_writepage() A helper might be in order. Luis ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers 2023-04-17 2:27 ` Luis Chamberlain @ 2023-04-17 6:04 ` Hannes Reinecke 0 siblings, 0 replies; 27+ messages in thread From: Hannes Reinecke @ 2023-04-17 6:04 UTC (permalink / raw) To: Luis Chamberlain Cc: Pankaj Raghav, brauner, willy, viro, akpm, linux-fsdevel, linux-kernel, gost.dev On 4/17/23 04:27, Luis Chamberlain wrote: > On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote: >> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, >> get_block_t *get_block) >> if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) >> limit = inode->i_sb->s_maxbytes; >> >> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >> - >> head = create_folio_buffers(folio, inode, 0); >> blocksize = head->b_size; >> bbits = block_size_bits(blocksize); >> >> - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); >> + if (WARN_ON(PAGE_SHIFT < bbits)) { >> + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT); >> + } else { >> + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); >> + } >> lblock = (limit+blocksize-1) >> bbits; >> bh = head; >> nr = 0; > > BTW I See this pattern in: > > fs/mpage.c: do_mpage_readpage() > fs/mpage.c: __mpage_writepage() > > A helper might be in order. > But only _iff_ we decide to stick with buffer_heads and upgrade the buffer_head code to handle multi-page block sizes. The idea was to move over to iomap, hence I'm not sure into which lengths we should go keeping buffer_heads alive ... Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2023-04-17 15:40 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20230414110825eucas1p1ed4d16627889ef8542dfa31b1183063d@eucas1p1.samsung.com>
2023-04-14 11:08 ` [RFC 0/4] convert create_page_buffers to create_folio_buffers Pankaj Raghav
2023-04-14 11:08 ` [RFC 1/4] fs/buffer: add set_bh_folio helper Pankaj Raghav
2023-04-14 11:08 ` [RFC 2/4] buffer: add alloc_folio_buffers() helper Pankaj Raghav
2023-04-14 13:06 ` Matthew Wilcox
2023-04-14 15:01 ` Pankaj Raghav
2023-04-14 11:08 ` [RFC 3/4] fs/buffer: add folio_create_empty_buffers helper Pankaj Raghav
2023-04-14 13:16 ` Matthew Wilcox
2023-04-14 11:08 ` [RFC 4/4] fs/buffer: convert create_page_buffers to create_folio_buffers Pankaj Raghav
2023-04-14 13:21 ` Matthew Wilcox
2023-04-14 13:47 ` [RFC 0/4] " Hannes Reinecke
2023-04-14 13:51 ` Matthew Wilcox
2023-04-14 13:56 ` Hannes Reinecke
2023-04-14 15:00 ` Pankaj Raghav
2023-04-15 1:01 ` Luis Chamberlain
2023-04-15 2:31 ` Matthew Wilcox
2023-04-15 3:24 ` Luis Chamberlain
2023-04-15 3:44 ` Matthew Wilcox
2023-04-15 13:14 ` Hannes Reinecke
2023-04-15 17:09 ` Matthew Wilcox
2023-04-16 1:28 ` Luis Chamberlain
2023-04-16 3:40 ` Matthew Wilcox
2023-04-16 5:26 ` Luis Chamberlain
2023-04-16 14:07 ` Matthew Wilcox
2023-04-17 15:40 ` Darrick J. Wong
2023-04-16 22:57 ` Dave Chinner
2023-04-17 2:27 ` Luis Chamberlain
2023-04-17 6:04 ` Hannes Reinecke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox