All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Add mru cache for inode to zone allocation mapping
@ 2025-05-14 10:50 Hans Holmberg
  2025-05-14 10:50 ` [PATCH 1/2] xfs: free the item in xfs_mru_cache_insert on failure Hans Holmberg
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Hans Holmberg @ 2025-05-14 10:50 UTC (permalink / raw)
  To: linux-xfs@vger.kernel.org
  Cc: Carlos Maiolino, Dave Chinner, Darrick J . Wong, hch,
	linux-kernel@vger.kernel.org, Hans Holmberg

These patches cleans up the xfs mru code a bit and adds a cache for
keeping track of which zone an inode allocated data to last. Placing
file data in the same zone helps reduce garbage collection overhead,
and with this patch we add support per-file co-location for random
writes.

While I was initially concerned by adding overhead to the allocation
path, the cache actually reduces it as as we avoid going through the
zone allocation algorithm for every random write.

When I run a fio workload with 16 writers to different files in
parallel, bs=8k, iodepth=4, size=1G, I get these throughputs:

baseline	with_cache
774 MB/s	858 MB/s (+11%)

(averaged over three runs ech on a nullblk device)

I see similar, figures when benchmarking on a zns nvme drive (+17%).

No updates in the code since the RFC:
https://www.spinics.net/lists/linux-xfs/msg98889.html

Christoph Hellwig (1):
  xfs: free the item in xfs_mru_cache_insert on failure

Hans Holmberg (1):
  xfs: add inode to zone caching for data placement

 fs/xfs/xfs_filestream.c |  15 ++----
 fs/xfs/xfs_mount.h      |   1 +
 fs/xfs/xfs_mru_cache.c  |  15 ++++--
 fs/xfs/xfs_zone_alloc.c | 109 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 126 insertions(+), 14 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] xfs: free the item in xfs_mru_cache_insert on failure
  2025-05-14 10:50 [PATCH 0/2] Add mru cache for inode to zone allocation mapping Hans Holmberg
@ 2025-05-14 10:50 ` Hans Holmberg
  2025-05-14 13:59   ` Carlos Maiolino
  2025-05-14 16:04   ` Darrick J. Wong
  2025-05-14 10:50 ` [PATCH 2/2] xfs: add inode to zone caching for data placement Hans Holmberg
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Hans Holmberg @ 2025-05-14 10:50 UTC (permalink / raw)
  To: linux-xfs@vger.kernel.org
  Cc: Carlos Maiolino, Dave Chinner, Darrick J . Wong, hch,
	linux-kernel@vger.kernel.org, Hans Holmberg

From: Christoph Hellwig <hch@lst.de>

Call the provided free_func when xfs_mru_cache_insert as that's what
the callers need to do anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
---
 fs/xfs/xfs_filestream.c | 15 ++++-----------
 fs/xfs/xfs_mru_cache.c  | 15 ++++++++++++---
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
index a961aa420c48..044918fbae06 100644
--- a/fs/xfs/xfs_filestream.c
+++ b/fs/xfs/xfs_filestream.c
@@ -304,11 +304,9 @@ xfs_filestream_create_association(
 	 * for us, so all we need to do here is take another active reference to
 	 * the perag for the cached association.
 	 *
-	 * If we fail to store the association, we need to drop the fstrms
-	 * counter as well as drop the perag reference we take here for the
-	 * item. We do not need to return an error for this failure - as long as
-	 * we return a referenced AG, the allocation can still go ahead just
-	 * fine.
+	 * If we fail to store the association, we do not need to return an
+	 * error for this failure - as long as we return a referenced AG, the
+	 * allocation can still go ahead just fine.
 	 */
 	item = kmalloc(sizeof(*item), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!item)
@@ -316,14 +314,9 @@ xfs_filestream_create_association(
 
 	atomic_inc(&pag_group(args->pag)->xg_active_ref);
 	item->pag = args->pag;
-	error = xfs_mru_cache_insert(mp->m_filestream, pino, &item->mru);
-	if (error)
-		goto out_free_item;
+	xfs_mru_cache_insert(mp->m_filestream, pino, &item->mru);
 	return 0;
 
-out_free_item:
-	xfs_perag_rele(item->pag);
-	kfree(item);
 out_put_fstrms:
 	atomic_dec(&args->pag->pagf_fstrms);
 	return 0;
diff --git a/fs/xfs/xfs_mru_cache.c b/fs/xfs/xfs_mru_cache.c
index d0f5b403bdbe..08443ceec329 100644
--- a/fs/xfs/xfs_mru_cache.c
+++ b/fs/xfs/xfs_mru_cache.c
@@ -414,6 +414,8 @@ xfs_mru_cache_destroy(
  * To insert an element, call xfs_mru_cache_insert() with the data store, the
  * element's key and the client data pointer.  This function returns 0 on
  * success or ENOMEM if memory for the data element couldn't be allocated.
+ *
+ * The passed in elem is freed through the per-cache free_func on failure.
  */
 int
 xfs_mru_cache_insert(
@@ -421,14 +423,15 @@ xfs_mru_cache_insert(
 	unsigned long		key,
 	struct xfs_mru_cache_elem *elem)
 {
-	int			error;
+	int			error = -EINVAL;
 
 	ASSERT(mru && mru->lists);
 	if (!mru || !mru->lists)
-		return -EINVAL;
+		goto out_free;
 
+	error = -ENOMEM;
 	if (radix_tree_preload(GFP_KERNEL))
-		return -ENOMEM;
+		goto out_free;
 
 	INIT_LIST_HEAD(&elem->list_node);
 	elem->key = key;
@@ -440,6 +443,12 @@ xfs_mru_cache_insert(
 		_xfs_mru_cache_list_insert(mru, elem);
 	spin_unlock(&mru->lock);
 
+	if (error)
+		goto out_free;
+	return 0;
+
+out_free:
+	mru->free_func(mru->data, elem);
 	return error;
 }
 
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] xfs: add inode to zone caching for data placement
  2025-05-14 10:50 [PATCH 0/2] Add mru cache for inode to zone allocation mapping Hans Holmberg
  2025-05-14 10:50 ` [PATCH 1/2] xfs: free the item in xfs_mru_cache_insert on failure Hans Holmberg
@ 2025-05-14 10:50 ` Hans Holmberg
  2025-05-14 13:00   ` hch
  2025-05-14 16:05   ` Darrick J. Wong
  2025-05-14 13:00 ` [PATCH 0/2] Add mru cache for inode to zone allocation mapping hch
  2025-05-14 17:30 ` Carlos Maiolino
  3 siblings, 2 replies; 12+ messages in thread
From: Hans Holmberg @ 2025-05-14 10:50 UTC (permalink / raw)
  To: linux-xfs@vger.kernel.org
  Cc: Carlos Maiolino, Dave Chinner, Darrick J . Wong, hch,
	linux-kernel@vger.kernel.org, Hans Holmberg

Placing data from the same file in the same zone is a great heuristic
for reducing write amplification and we do this already - but only
for sequential writes.

To support placing data in the same way for random writes, reuse the
xfs mru cache to map inodes to open zones on first write. If a mapping
is present, use the open zone for data placement for this file until
the zone is full.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
---
 fs/xfs/xfs_mount.h      |   1 +
 fs/xfs/xfs_zone_alloc.c | 109 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e5192c12e7ac..f90c0a16766f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -230,6 +230,7 @@ typedef struct xfs_mount {
 	bool			m_update_sb;	/* sb needs update in mount */
 	unsigned int		m_max_open_zones;
 	unsigned int		m_zonegc_low_space;
+	struct xfs_mru_cache	*m_zone_cache;  /* Inode to open zone cache */
 
 	/*
 	 * Bitsets of per-fs metadata that have been checked and/or are sick.
diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
index d509e49b2aaa..80add26c0111 100644
--- a/fs/xfs/xfs_zone_alloc.c
+++ b/fs/xfs/xfs_zone_alloc.c
@@ -24,6 +24,7 @@
 #include "xfs_zone_priv.h"
 #include "xfs_zones.h"
 #include "xfs_trace.h"
+#include "xfs_mru_cache.h"
 
 void
 xfs_open_zone_put(
@@ -796,6 +797,100 @@ xfs_submit_zoned_bio(
 	submit_bio(&ioend->io_bio);
 }
 
+/*
+ * Cache the last zone written to for an inode so that it is considered first
+ * for subsequent writes.
+ */
+struct xfs_zone_cache_item {
+	struct xfs_mru_cache_elem	mru;
+	struct xfs_open_zone		*oz;
+};
+
+static inline struct xfs_zone_cache_item *
+xfs_zone_cache_item(struct xfs_mru_cache_elem *mru)
+{
+	return container_of(mru, struct xfs_zone_cache_item, mru);
+}
+
+static void
+xfs_zone_cache_free_func(
+	void				*data,
+	struct xfs_mru_cache_elem	*mru)
+{
+	struct xfs_zone_cache_item	*item = xfs_zone_cache_item(mru);
+
+	xfs_open_zone_put(item->oz);
+	kfree(item);
+}
+
+/*
+ * Check if we have a cached last open zone available for the inode and
+ * if yes return a reference to it.
+ */
+static struct xfs_open_zone *
+xfs_cached_zone(
+	struct xfs_mount		*mp,
+	struct xfs_inode		*ip)
+{
+	struct xfs_mru_cache_elem	*mru;
+	struct xfs_open_zone		*oz;
+
+	mru = xfs_mru_cache_lookup(mp->m_zone_cache, ip->i_ino);
+	if (!mru)
+		return NULL;
+	oz = xfs_zone_cache_item(mru)->oz;
+	if (oz) {
+		/*
+		 * GC only steals open zones at mount time, so no GC zones
+		 * should end up in the cache.
+		 */
+		ASSERT(!oz->oz_is_gc);
+		ASSERT(atomic_read(&oz->oz_ref) > 0);
+		atomic_inc(&oz->oz_ref);
+	}
+	xfs_mru_cache_done(mp->m_zone_cache);
+	return oz;
+}
+
+/*
+ * Update the last used zone cache for a given inode.
+ *
+ * The caller must have a reference on the open zone.
+ */
+static void
+xfs_zone_cache_create_association(
+	struct xfs_inode		*ip,
+	struct xfs_open_zone		*oz)
+{
+	struct xfs_mount		*mp = ip->i_mount;
+	struct xfs_zone_cache_item	*item = NULL;
+	struct xfs_mru_cache_elem	*mru;
+
+	ASSERT(atomic_read(&oz->oz_ref) > 0);
+	atomic_inc(&oz->oz_ref);
+
+	mru = xfs_mru_cache_lookup(mp->m_zone_cache, ip->i_ino);
+	if (mru) {
+		/*
+		 * If we have an association already, update it to point to the
+		 * new zone.
+		 */
+		item = xfs_zone_cache_item(mru);
+		xfs_open_zone_put(item->oz);
+		item->oz = oz;
+		xfs_mru_cache_done(mp->m_zone_cache);
+		return;
+	}
+
+	item = kmalloc(sizeof(*item), GFP_KERNEL);
+	if (!item) {
+		xfs_open_zone_put(oz);
+		return;
+	}
+	item->oz = oz;
+	xfs_mru_cache_insert(mp->m_zone_cache, ip->i_ino, &item->mru);
+}
+
 void
 xfs_zone_alloc_and_submit(
 	struct iomap_ioend	*ioend,
@@ -819,11 +914,16 @@ xfs_zone_alloc_and_submit(
 	 */
 	if (!*oz && ioend->io_offset)
 		*oz = xfs_last_used_zone(ioend);
+	if (!*oz)
+		*oz = xfs_cached_zone(mp, ip);
+
 	if (!*oz) {
 select_zone:
 		*oz = xfs_select_zone(mp, write_hint, pack_tight);
 		if (!*oz)
 			goto out_error;
+
+		xfs_zone_cache_create_association(ip, *oz);
 	}
 
 	alloc_len = xfs_zone_alloc_blocks(*oz, XFS_B_TO_FSB(mp, ioend->io_size),
@@ -1211,6 +1311,14 @@ xfs_mount_zones(
 	error = xfs_zone_gc_mount(mp);
 	if (error)
 		goto out_free_zone_info;
+
+	/*
+	 * Set up a mru cache to track inode to open zone for data placement
+	 * purposes. The magic values for group count and life time is the
+	 * same as the defaults for file streams, which seems sane enough.
+	 */
+	xfs_mru_cache_create(&mp->m_zone_cache, mp,
+			5000, 10, xfs_zone_cache_free_func);
 	return 0;
 
 out_free_zone_info:
@@ -1224,4 +1332,5 @@ xfs_unmount_zones(
 {
 	xfs_zone_gc_unmount(mp);
 	xfs_free_zone_info(mp->m_zone_info);
+	xfs_mru_cache_destroy(mp->m_zone_cache);
 }
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Add mru cache for inode to zone allocation mapping
  2025-05-14 10:50 [PATCH 0/2] Add mru cache for inode to zone allocation mapping Hans Holmberg
  2025-05-14 10:50 ` [PATCH 1/2] xfs: free the item in xfs_mru_cache_insert on failure Hans Holmberg
  2025-05-14 10:50 ` [PATCH 2/2] xfs: add inode to zone caching for data placement Hans Holmberg
@ 2025-05-14 13:00 ` hch
  2025-05-14 13:51   ` Carlos Maiolino
  2025-05-14 13:52   ` Carlos Maiolino
  2025-05-14 17:30 ` Carlos Maiolino
  3 siblings, 2 replies; 12+ messages in thread
From: hch @ 2025-05-14 13:00 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-xfs@vger.kernel.org, Carlos Maiolino, Dave Chinner,
	Darrick J . Wong, hch, linux-kernel@vger.kernel.org

On Wed, May 14, 2025 at 10:50:36AM +0000, Hans Holmberg wrote:
> While I was initially concerned by adding overhead to the allocation
> path, the cache actually reduces it as as we avoid going through the
> zone allocation algorithm for every random write.
> 
> When I run a fio workload with 16 writers to different files in
> parallel, bs=8k, iodepth=4, size=1G, I get these throughputs:
> 
> baseline	with_cache
> 774 MB/s	858 MB/s (+11%)
> 
> (averaged over three runs ech on a nullblk device)
> 
> I see similar, figures when benchmarking on a zns nvme drive (+17%).

Very nice!

These should probably go into the commit message for patch 2 so they
are recorded.  Carlos, is that something you can do when applying?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] xfs: add inode to zone caching for data placement
  2025-05-14 10:50 ` [PATCH 2/2] xfs: add inode to zone caching for data placement Hans Holmberg
@ 2025-05-14 13:00   ` hch
  2025-05-14 16:05   ` Darrick J. Wong
  1 sibling, 0 replies; 12+ messages in thread
From: hch @ 2025-05-14 13:00 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-xfs@vger.kernel.org, Carlos Maiolino, Dave Chinner,
	Darrick J . Wong, hch, linux-kernel@vger.kernel.org

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Add mru cache for inode to zone allocation mapping
  2025-05-14 13:00 ` [PATCH 0/2] Add mru cache for inode to zone allocation mapping hch
@ 2025-05-14 13:51   ` Carlos Maiolino
  2025-05-14 13:55     ` hch
  2025-05-14 13:52   ` Carlos Maiolino
  1 sibling, 1 reply; 12+ messages in thread
From: Carlos Maiolino @ 2025-05-14 13:51 UTC (permalink / raw)
  To: hch
  Cc: Hans Holmberg, linux-xfs@vger.kernel.org, Dave Chinner,
	Darrick J . Wong, linux-kernel@vger.kernel.org

On Wed, May 14, 2025 at 03:00:14PM +0200, hch wrote:
> On Wed, May 14, 2025 at 10:50:36AM +0000, Hans Holmberg wrote:
> > While I was initially concerned by adding overhead to the allocation
> > path, the cache actually reduces it as as we avoid going through the
> > zone allocation algorithm for every random write.
> >
> > When I run a fio workload with 16 writers to different files in
> > parallel, bs=8k, iodepth=4, size=1G, I get these throughputs:
> >
> > baseline	with_cache
> > 774 MB/s	858 MB/s (+11%)
> >
> > (averaged over three runs ech on a nullblk device)
> >
> > I see similar, figures when benchmarking on a zns nvme drive (+17%).
> 
> Very nice!
> 
> These should probably go into the commit message for patch 2 so they
> are recorded.  Carlos, is that something you can do when applying?
> 
Absolutely. Could you RwB patch 1? I just got your RwB on patch 2.

I'll add this to the tree today, I need to do another rebase anyway.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Add mru cache for inode to zone allocation mapping
  2025-05-14 13:00 ` [PATCH 0/2] Add mru cache for inode to zone allocation mapping hch
  2025-05-14 13:51   ` Carlos Maiolino
@ 2025-05-14 13:52   ` Carlos Maiolino
  1 sibling, 0 replies; 12+ messages in thread
From: Carlos Maiolino @ 2025-05-14 13:52 UTC (permalink / raw)
  To: hch
  Cc: Hans Holmberg, linux-xfs@vger.kernel.org, Dave Chinner,
	Darrick J . Wong, linux-kernel@vger.kernel.org

On Wed, May 14, 2025 at 03:00:14PM +0200, hch wrote:
> On Wed, May 14, 2025 at 10:50:36AM +0000, Hans Holmberg wrote:
> > While I was initially concerned by adding overhead to the allocation
> > path, the cache actually reduces it as as we avoid going through the
> > zone allocation algorithm for every random write.
> >
> > When I run a fio workload with 16 writers to different files in
> > parallel, bs=8k, iodepth=4, size=1G, I get these throughputs:
> >
> > baseline	with_cache
> > 774 MB/s	858 MB/s (+11%)
> >
> > (averaged over three runs ech on a nullblk device)
> >
> > I see similar, figures when benchmarking on a zns nvme drive (+17%).
> 
> Very nice!
> 
> These should probably go into the commit message for patch 2 so they
> are recorded.  Carlos, is that something you can do when applying?
> 

Nvm, I just noticed you as the author. I'll review it

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Add mru cache for inode to zone allocation mapping
  2025-05-14 13:51   ` Carlos Maiolino
@ 2025-05-14 13:55     ` hch
  0 siblings, 0 replies; 12+ messages in thread
From: hch @ 2025-05-14 13:55 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: hch, Hans Holmberg, linux-xfs@vger.kernel.org, Dave Chinner,
	Darrick J . Wong, linux-kernel@vger.kernel.org

On Wed, May 14, 2025 at 03:51:54PM +0200, Carlos Maiolino wrote:
> Absolutely. Could you RwB patch 1? I just got your RwB on patch 2.

I wrote it, so I can't review it.  But Hans has reviewed it as part
of passing it own, but that's designated by a signoff.

I'd be happy to wait for another review, though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] xfs: free the item in xfs_mru_cache_insert on failure
  2025-05-14 10:50 ` [PATCH 1/2] xfs: free the item in xfs_mru_cache_insert on failure Hans Holmberg
@ 2025-05-14 13:59   ` Carlos Maiolino
  2025-05-14 16:04   ` Darrick J. Wong
  1 sibling, 0 replies; 12+ messages in thread
From: Carlos Maiolino @ 2025-05-14 13:59 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-xfs@vger.kernel.org, Dave Chinner, Darrick J . Wong, hch,
	linux-kernel@vger.kernel.org

On Wed, May 14, 2025 at 10:50:37AM +0000, Hans Holmberg wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Call the provided free_func when xfs_mru_cache_insert as that's what
> the callers need to do anyway.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> ---
>  fs/xfs/xfs_filestream.c | 15 ++++-----------
>  fs/xfs/xfs_mru_cache.c  | 15 ++++++++++++---
>  2 files changed, 16 insertions(+), 14 deletions(-)
> 

Looks fine.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

> diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
> index a961aa420c48..044918fbae06 100644
> --- a/fs/xfs/xfs_filestream.c
> +++ b/fs/xfs/xfs_filestream.c
> @@ -304,11 +304,9 @@ xfs_filestream_create_association(
>  	 * for us, so all we need to do here is take another active reference to
>  	 * the perag for the cached association.
>  	 *
> -	 * If we fail to store the association, we need to drop the fstrms
> -	 * counter as well as drop the perag reference we take here for the
> -	 * item. We do not need to return an error for this failure - as long as
> -	 * we return a referenced AG, the allocation can still go ahead just
> -	 * fine.
> +	 * If we fail to store the association, we do not need to return an
> +	 * error for this failure - as long as we return a referenced AG, the
> +	 * allocation can still go ahead just fine.
>  	 */
>  	item = kmalloc(sizeof(*item), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>  	if (!item)
> @@ -316,14 +314,9 @@ xfs_filestream_create_association(
> 
>  	atomic_inc(&pag_group(args->pag)->xg_active_ref);
>  	item->pag = args->pag;
> -	error = xfs_mru_cache_insert(mp->m_filestream, pino, &item->mru);
> -	if (error)
> -		goto out_free_item;
> +	xfs_mru_cache_insert(mp->m_filestream, pino, &item->mru);
>  	return 0;
> 
> -out_free_item:
> -	xfs_perag_rele(item->pag);
> -	kfree(item);
>  out_put_fstrms:
>  	atomic_dec(&args->pag->pagf_fstrms);
>  	return 0;
> diff --git a/fs/xfs/xfs_mru_cache.c b/fs/xfs/xfs_mru_cache.c
> index d0f5b403bdbe..08443ceec329 100644
> --- a/fs/xfs/xfs_mru_cache.c
> +++ b/fs/xfs/xfs_mru_cache.c
> @@ -414,6 +414,8 @@ xfs_mru_cache_destroy(
>   * To insert an element, call xfs_mru_cache_insert() with the data store, the
>   * element's key and the client data pointer.  This function returns 0 on
>   * success or ENOMEM if memory for the data element couldn't be allocated.
> + *
> + * The passed in elem is freed through the per-cache free_func on failure.
>   */
>  int
>  xfs_mru_cache_insert(
> @@ -421,14 +423,15 @@ xfs_mru_cache_insert(
>  	unsigned long		key,
>  	struct xfs_mru_cache_elem *elem)
>  {
> -	int			error;
> +	int			error = -EINVAL;
> 
>  	ASSERT(mru && mru->lists);
>  	if (!mru || !mru->lists)
> -		return -EINVAL;
> +		goto out_free;
> 
> +	error = -ENOMEM;
>  	if (radix_tree_preload(GFP_KERNEL))
> -		return -ENOMEM;
> +		goto out_free;
> 
>  	INIT_LIST_HEAD(&elem->list_node);
>  	elem->key = key;
> @@ -440,6 +443,12 @@ xfs_mru_cache_insert(
>  		_xfs_mru_cache_list_insert(mru, elem);
>  	spin_unlock(&mru->lock);
> 
> +	if (error)
> +		goto out_free;
> +	return 0;
> +
> +out_free:
> +	mru->free_func(mru->data, elem);
>  	return error;
>  }
> 
> --
> 2.34.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] xfs: free the item in xfs_mru_cache_insert on failure
  2025-05-14 10:50 ` [PATCH 1/2] xfs: free the item in xfs_mru_cache_insert on failure Hans Holmberg
  2025-05-14 13:59   ` Carlos Maiolino
@ 2025-05-14 16:04   ` Darrick J. Wong
  1 sibling, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:04 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-xfs@vger.kernel.org, Carlos Maiolino, Dave Chinner, hch,
	linux-kernel@vger.kernel.org

On Wed, May 14, 2025 at 10:50:37AM +0000, Hans Holmberg wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Call the provided free_func when xfs_mru_cache_insert as that's what
> the callers need to do anyway.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>

Looks ok,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_filestream.c | 15 ++++-----------
>  fs/xfs/xfs_mru_cache.c  | 15 ++++++++++++---
>  2 files changed, 16 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
> index a961aa420c48..044918fbae06 100644
> --- a/fs/xfs/xfs_filestream.c
> +++ b/fs/xfs/xfs_filestream.c
> @@ -304,11 +304,9 @@ xfs_filestream_create_association(
>  	 * for us, so all we need to do here is take another active reference to
>  	 * the perag for the cached association.
>  	 *
> -	 * If we fail to store the association, we need to drop the fstrms
> -	 * counter as well as drop the perag reference we take here for the
> -	 * item. We do not need to return an error for this failure - as long as
> -	 * we return a referenced AG, the allocation can still go ahead just
> -	 * fine.
> +	 * If we fail to store the association, we do not need to return an
> +	 * error for this failure - as long as we return a referenced AG, the
> +	 * allocation can still go ahead just fine.
>  	 */
>  	item = kmalloc(sizeof(*item), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>  	if (!item)
> @@ -316,14 +314,9 @@ xfs_filestream_create_association(
>  
>  	atomic_inc(&pag_group(args->pag)->xg_active_ref);
>  	item->pag = args->pag;
> -	error = xfs_mru_cache_insert(mp->m_filestream, pino, &item->mru);
> -	if (error)
> -		goto out_free_item;
> +	xfs_mru_cache_insert(mp->m_filestream, pino, &item->mru);
>  	return 0;
>  
> -out_free_item:
> -	xfs_perag_rele(item->pag);
> -	kfree(item);
>  out_put_fstrms:
>  	atomic_dec(&args->pag->pagf_fstrms);
>  	return 0;
> diff --git a/fs/xfs/xfs_mru_cache.c b/fs/xfs/xfs_mru_cache.c
> index d0f5b403bdbe..08443ceec329 100644
> --- a/fs/xfs/xfs_mru_cache.c
> +++ b/fs/xfs/xfs_mru_cache.c
> @@ -414,6 +414,8 @@ xfs_mru_cache_destroy(
>   * To insert an element, call xfs_mru_cache_insert() with the data store, the
>   * element's key and the client data pointer.  This function returns 0 on
>   * success or ENOMEM if memory for the data element couldn't be allocated.
> + *
> + * The passed in elem is freed through the per-cache free_func on failure.
>   */
>  int
>  xfs_mru_cache_insert(
> @@ -421,14 +423,15 @@ xfs_mru_cache_insert(
>  	unsigned long		key,
>  	struct xfs_mru_cache_elem *elem)
>  {
> -	int			error;
> +	int			error = -EINVAL;
>  
>  	ASSERT(mru && mru->lists);
>  	if (!mru || !mru->lists)
> -		return -EINVAL;
> +		goto out_free;
>  
> +	error = -ENOMEM;
>  	if (radix_tree_preload(GFP_KERNEL))
> -		return -ENOMEM;
> +		goto out_free;
>  
>  	INIT_LIST_HEAD(&elem->list_node);
>  	elem->key = key;
> @@ -440,6 +443,12 @@ xfs_mru_cache_insert(
>  		_xfs_mru_cache_list_insert(mru, elem);
>  	spin_unlock(&mru->lock);
>  
> +	if (error)
> +		goto out_free;
> +	return 0;
> +
> +out_free:
> +	mru->free_func(mru->data, elem);
>  	return error;
>  }
>  
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] xfs: add inode to zone caching for data placement
  2025-05-14 10:50 ` [PATCH 2/2] xfs: add inode to zone caching for data placement Hans Holmberg
  2025-05-14 13:00   ` hch
@ 2025-05-14 16:05   ` Darrick J. Wong
  1 sibling, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2025-05-14 16:05 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-xfs@vger.kernel.org, Carlos Maiolino, Dave Chinner, hch,
	linux-kernel@vger.kernel.org

On Wed, May 14, 2025 at 10:50:37AM +0000, Hans Holmberg wrote:
> Placing data from the same file in the same zone is a great heuristic
> for reducing write amplification and we do this already - but only
> for sequential writes.
> 
> To support placing data in the same way for random writes, reuse the
> xfs mru cache to map inodes to open zones on first write. If a mapping
> is present, use the open zone for data placement for this file until
> the zone is full.
> 
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>

Odd, did my rvb from last time get dropped?  This doesn't look like a
huge change to me...

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_mount.h      |   1 +
>  fs/xfs/xfs_zone_alloc.c | 109 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 110 insertions(+)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index e5192c12e7ac..f90c0a16766f 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -230,6 +230,7 @@ typedef struct xfs_mount {
>  	bool			m_update_sb;	/* sb needs update in mount */
>  	unsigned int		m_max_open_zones;
>  	unsigned int		m_zonegc_low_space;
> +	struct xfs_mru_cache	*m_zone_cache;  /* Inode to open zone cache */
>  
>  	/*
>  	 * Bitsets of per-fs metadata that have been checked and/or are sick.
> diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
> index d509e49b2aaa..80add26c0111 100644
> --- a/fs/xfs/xfs_zone_alloc.c
> +++ b/fs/xfs/xfs_zone_alloc.c
> @@ -24,6 +24,7 @@
>  #include "xfs_zone_priv.h"
>  #include "xfs_zones.h"
>  #include "xfs_trace.h"
> +#include "xfs_mru_cache.h"
>  
>  void
>  xfs_open_zone_put(
> @@ -796,6 +797,100 @@ xfs_submit_zoned_bio(
>  	submit_bio(&ioend->io_bio);
>  }
>  
> +/*
> + * Cache the last zone written to for an inode so that it is considered first
> + * for subsequent writes.
> + */
> +struct xfs_zone_cache_item {
> +	struct xfs_mru_cache_elem	mru;
> +	struct xfs_open_zone		*oz;
> +};
> +
> +static inline struct xfs_zone_cache_item *
> +xfs_zone_cache_item(struct xfs_mru_cache_elem *mru)
> +{
> +	return container_of(mru, struct xfs_zone_cache_item, mru);
> +}
> +
> +static void
> +xfs_zone_cache_free_func(
> +	void				*data,
> +	struct xfs_mru_cache_elem	*mru)
> +{
> +	struct xfs_zone_cache_item	*item = xfs_zone_cache_item(mru);
> +
> +	xfs_open_zone_put(item->oz);
> +	kfree(item);
> +}
> +
> +/*
> + * Check if we have a cached last open zone available for the inode and
> + * if yes return a reference to it.
> + */
> +static struct xfs_open_zone *
> +xfs_cached_zone(
> +	struct xfs_mount		*mp,
> +	struct xfs_inode		*ip)
> +{
> +	struct xfs_mru_cache_elem	*mru;
> +	struct xfs_open_zone		*oz;
> +
> +	mru = xfs_mru_cache_lookup(mp->m_zone_cache, ip->i_ino);
> +	if (!mru)
> +		return NULL;
> +	oz = xfs_zone_cache_item(mru)->oz;
> +	if (oz) {
> +		/*
> +		 * GC only steals open zones at mount time, so no GC zones
> +		 * should end up in the cache.
> +		 */
> +		ASSERT(!oz->oz_is_gc);
> +		ASSERT(atomic_read(&oz->oz_ref) > 0);
> +		atomic_inc(&oz->oz_ref);
> +	}
> +	xfs_mru_cache_done(mp->m_zone_cache);
> +	return oz;
> +}
> +
> +/*
> + * Update the last used zone cache for a given inode.
> + *
> + * The caller must have a reference on the open zone.
> + */
> +static void
> +xfs_zone_cache_create_association(
> +	struct xfs_inode		*ip,
> +	struct xfs_open_zone		*oz)
> +{
> +	struct xfs_mount		*mp = ip->i_mount;
> +	struct xfs_zone_cache_item	*item = NULL;
> +	struct xfs_mru_cache_elem	*mru;
> +
> +	ASSERT(atomic_read(&oz->oz_ref) > 0);
> +	atomic_inc(&oz->oz_ref);
> +
> +	mru = xfs_mru_cache_lookup(mp->m_zone_cache, ip->i_ino);
> +	if (mru) {
> +		/*
> +		 * If we have an association already, update it to point to the
> +		 * new zone.
> +		 */
> +		item = xfs_zone_cache_item(mru);
> +		xfs_open_zone_put(item->oz);
> +		item->oz = oz;
> +		xfs_mru_cache_done(mp->m_zone_cache);
> +		return;
> +	}
> +
> +	item = kmalloc(sizeof(*item), GFP_KERNEL);
> +	if (!item) {
> +		xfs_open_zone_put(oz);
> +		return;
> +	}
> +	item->oz = oz;
> +	xfs_mru_cache_insert(mp->m_zone_cache, ip->i_ino, &item->mru);
> +}
> +
>  void
>  xfs_zone_alloc_and_submit(
>  	struct iomap_ioend	*ioend,
> @@ -819,11 +914,16 @@ xfs_zone_alloc_and_submit(
>  	 */
>  	if (!*oz && ioend->io_offset)
>  		*oz = xfs_last_used_zone(ioend);
> +	if (!*oz)
> +		*oz = xfs_cached_zone(mp, ip);
> +
>  	if (!*oz) {
>  select_zone:
>  		*oz = xfs_select_zone(mp, write_hint, pack_tight);
>  		if (!*oz)
>  			goto out_error;
> +
> +		xfs_zone_cache_create_association(ip, *oz);
>  	}
>  
>  	alloc_len = xfs_zone_alloc_blocks(*oz, XFS_B_TO_FSB(mp, ioend->io_size),
> @@ -1211,6 +1311,14 @@ xfs_mount_zones(
>  	error = xfs_zone_gc_mount(mp);
>  	if (error)
>  		goto out_free_zone_info;
> +
> +	/*
> +	 * Set up a mru cache to track inode to open zone for data placement
> +	 * purposes. The magic values for group count and life time is the
> +	 * same as the defaults for file streams, which seems sane enough.
> +	 */
> +	xfs_mru_cache_create(&mp->m_zone_cache, mp,
> +			5000, 10, xfs_zone_cache_free_func);
>  	return 0;
>  
>  out_free_zone_info:
> @@ -1224,4 +1332,5 @@ xfs_unmount_zones(
>  {
>  	xfs_zone_gc_unmount(mp);
>  	xfs_free_zone_info(mp->m_zone_info);
> +	xfs_mru_cache_destroy(mp->m_zone_cache);
>  }
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Add mru cache for inode to zone allocation mapping
  2025-05-14 10:50 [PATCH 0/2] Add mru cache for inode to zone allocation mapping Hans Holmberg
                   ` (2 preceding siblings ...)
  2025-05-14 13:00 ` [PATCH 0/2] Add mru cache for inode to zone allocation mapping hch
@ 2025-05-14 17:30 ` Carlos Maiolino
  3 siblings, 0 replies; 12+ messages in thread
From: Carlos Maiolino @ 2025-05-14 17:30 UTC (permalink / raw)
  To: linux-xfs, Hans Holmberg
  Cc: Dave Chinner, Darrick J . Wong, hch, linux-kernel

On Wed, 14 May 2025 10:50:36 +0000, Hans Holmberg wrote:
> These patches cleans up the xfs mru code a bit and adds a cache for
> keeping track of which zone an inode allocated data to last. Placing
> file data in the same zone helps reduce garbage collection overhead,
> and with this patch we add support per-file co-location for random
> writes.
> 
> While I was initially concerned by adding overhead to the allocation
> path, the cache actually reduces it as as we avoid going through the
> zone allocation algorithm for every random write.
> 
> [...]

Applied to for-next, thanks!

[1/2] xfs: free the item in xfs_mru_cache_insert on failure
      commit: 70b95cb86513d7f6d084ddc8e961a1cab9022e14
[2/2] xfs: add inode to zone caching for data placement
      commit: f3e2e53823b98d55da46ea72d32ac18bd7709c33

Best regards,
-- 
Carlos Maiolino <cem@kernel.org>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-05-14 17:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-14 10:50 [PATCH 0/2] Add mru cache for inode to zone allocation mapping Hans Holmberg
2025-05-14 10:50 ` [PATCH 1/2] xfs: free the item in xfs_mru_cache_insert on failure Hans Holmberg
2025-05-14 13:59   ` Carlos Maiolino
2025-05-14 16:04   ` Darrick J. Wong
2025-05-14 10:50 ` [PATCH 2/2] xfs: add inode to zone caching for data placement Hans Holmberg
2025-05-14 13:00   ` hch
2025-05-14 16:05   ` Darrick J. Wong
2025-05-14 13:00 ` [PATCH 0/2] Add mru cache for inode to zone allocation mapping hch
2025-05-14 13:51   ` Carlos Maiolino
2025-05-14 13:55     ` hch
2025-05-14 13:52   ` Carlos Maiolino
2025-05-14 17:30 ` Carlos Maiolino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.