* [PATCH v2 0/5] write streams and xfs spatial isolation
[not found] <CGME20260309053425epcas5p32886580a4fbe646ceee66f2864970e9f@epcas5p3.samsung.com>
@ 2026-03-09 5:29 ` Kanchan Joshi
2026-03-09 5:29 ` [PATCH v2 1/5] fs: add generic write-stream management ioctl Kanchan Joshi
` (5 more replies)
0 siblings, 6 replies; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-09 5:29 UTC (permalink / raw)
To: brauner, hch, djwong, jack, cem, kbusch, axboe
Cc: linux-xfs, linux-fsdevel, gost.dev, Kanchan Joshi
This series introduces a generic interface for write stream management,
enabling application to guide data placement on files that support it.
It also introduces spatial isolation and sofware write streams in xfs.
Write streams enable collaborative data placement, allowing the
abstraction provider to leverage application intent.
- application: sends grouping/isolation intent with a stream id.
- xfs: maps streams to AGs; allocates without interleaving; gains higher
concurrency; less lock contention.
- hardware: maps streams to underlying allocation unit; reduces device
internal write amplification, improved life, predicable QoS.
### Changelog
since v1:
https://lore.kernel.org/linux-fsdevel/20260216052540.217920-1-joshi.k@samsung.com/
- swich from fcntl based to ioctl-based interface (Christian)
- new patch (#4) that makes xfs allocator use the write streams for AG
selection
- new patch (#5) that introduces software write streams in xfs.
### Application interface
New vfs ioctl 'FS_IOC_WRITE_STEAM'.
Application communicates the intended operation using the 'op_flags'
field of the passed 'struct fs_write_stream'.
Valid flags are:
FS_WRITE_STREAM_OP_GET_MAX: Returns the number of available streams.
FS_WRITE_STREAM_OP_SET: Assign a specific stream value to the file.
FS_WRITE_STREAM_OP_GET: Query what stream value is set on the file.
### Comparison with Write Hints (RWH_WRITE_LIFE_*)
- Semantics: Write Hints describe 'data temperature' (e.g.,
short/long/extreme), implying a lifetime. Write Streams describe 'data
placement' (e.g., Bin 1/Bin 2), implying only separation.
- Scalability: Write Hints are limited to a small, fixed enum (6
values). Write streams are dynamic, provider-dependent values that can
scale much higher (kernel limit: up to 255 due to u8 field).
- Discovery: The existing write-hint interface is advisory and decoupled
from underlying capabilties; application has no way to probe support
and cannot deterministically know which hints are valid. OTOH, write-streams
provide explicit discovery.
Note: within the kernel, the separation between two constructs
(write-hint and write-stream) had started from 6.16 itself.
### Interface example
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <string.h>
#include <errno.h>
/* Duplicate the kernel UAPI definitions */
struct fs_write_stream {
uint32_t op_flags;
uint32_t stream_id;
uint32_t max_streams;
uint32_t __reserved;
};
#define FS_WRITE_STREAM_OP_GET (1 << 1)
#define FS_WRITE_STREAM_OP_SET (1 << 2)
#define FS_WRITE_STREAM_OP_GET_MAX (1 << 0)
#define FS_IOC_WRITE_STREAM _IOWR('f', 43, struct fs_write_stream)
void print_usage(const char *progname) {
fprintf(stderr, "Usage:\n");
fprintf(stderr, " %s <file> max - Get max supported streams\n", progname);
fprintf(stderr, " %s <file> get - Get current stream ID\n", progname);
fprintf(stderr, " %s <file> set <id> - Set stream ID\n", progname);
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
if (argc < 3)
print_usage(argv[0]);
const char *filepath = argv[1];
const char *cmd = argv[2];
int fd = open(filepath, O_RDWR);
if (fd < 0) {
perror("Error opening file");
return EXIT_FAILURE;
}
struct fs_write_stream req;
memset(&req, 0, sizeof(req));
if (strcmp(cmd, "max") == 0) {
req.op_flags = FS_WRITE_STREAM_OP_GET_MAX;
if (ioctl(fd, FS_IOC_WRITE_STREAM, &req) < 0) {
perror("ioctl(GET_MAX) failed");
close(fd);
return EXIT_FAILURE;
}
printf("Max streams supported: %u\n", req.max_streams);
} else if (strcmp(cmd, "get") == 0) {
req.op_flags = FS_WRITE_STREAM_OP_GET;
if (ioctl(fd, FS_IOC_WRITE_STREAM, &req) < 0) {
perror("ioctl(GET) failed");
close(fd);
return EXIT_FAILURE;
}
printf("Current stream ID: %u\n", req.stream_id);
} else if (strcmp(cmd, "set") == 0) {
if (argc != 4)
print_usage(argv[0]);
req.op_flags = FS_WRITE_STREAM_OP_SET;
req.stream_id = atoi(argv[3]);
if (ioctl(fd, FS_IOC_WRITE_STREAM, &req) < 0) {
perror("ioctl(SET) failed");
close(fd);
return EXIT_FAILURE;
}
printf("Set stream ID to: %u\n", req.stream_id);
} else {
fprintf(stderr, "Unknown command: %s\n", cmd);
close(fd);
print_usage(argv[0]);
}
close(fd);
return EXIT_SUCCESS;
}
Kanchan Joshi (5):
fs: add generic write-stream management ioctl
iomap: introduce and propagate write_stream
xfs: implement write-stream management support
xfs: steer allocation using write stream
xfs: introduce software write streams
fs/iomap/direct-io.c | 1 +
fs/iomap/ioend.c | 3 ++
fs/xfs/libxfs/xfs_bmap.c | 9 ++++
fs/xfs/xfs_icache.c | 1 +
fs/xfs/xfs_inode.c | 98 +++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_inode.h | 7 +++
fs/xfs/xfs_ioctl.c | 34 ++++++++++++++
fs/xfs/xfs_iomap.c | 1 +
include/linux/iomap.h | 2 +
include/uapi/linux/fs.h | 12 +++++
10 files changed, 167 insertions(+), 1 deletion(-)
base-commit: 11439c4635edd669ae435eec308f4ab8a0804808
--
2.25.1
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v2 1/5] fs: add generic write-stream management ioctl
2026-03-09 5:29 ` [PATCH v2 0/5] write streams and xfs spatial isolation Kanchan Joshi
@ 2026-03-09 5:29 ` Kanchan Joshi
2026-03-09 16:33 ` Darrick J. Wong
2026-03-09 5:29 ` [PATCH v2 2/5] iomap: introduce and propagate write_stream Kanchan Joshi
` (4 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-09 5:29 UTC (permalink / raw)
To: brauner, hch, djwong, jack, cem, kbusch, axboe
Cc: linux-xfs, linux-fsdevel, gost.dev, Kanchan Joshi
Wire up the userspace interface for write stream management via a new
vfs ioctl 'FS_IOC_WRITE_STEAM'.
Application communictes the intended operation using the 'op_flags'
field of the passed 'struct fs_write_stream'.
Valid flags are:
FS_WRITE_STREAM_OP_GET_MAX: Returns the number of available streams.
FS_WRITE_STREAM_OP_SET: Assign a specific stream value to the file.
FS_WRITE_STREAM_OP_GET: Query what stream value is set on the file.
Application should query the available streams by using
FS_WRITE_STREAM_OP_GET_MAX first.
If returned value is N, valid stream values for the file are 0 to N.
Stream value 0 implies that no stream is set on the file.
Setting a larger value than available streams is rejected.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
include/uapi/linux/fs.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 70b2b661f42c..4d0805b52949 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -338,6 +338,18 @@ struct file_attr {
/* Get logical block metadata capability details */
#define FS_IOC_GETLBMD_CAP _IOWR(0x15, 2, struct logical_block_metadata_cap)
+struct fs_write_stream {
+ __u32 op_flags; /* IN: operation flags */
+ __u32 stream_id; /* IN/OUT: stream value to assign/guery */
+ __u32 max_streams; /* OUT: max streams values supported */
+ __u32 rsvd;
+};
+
+#define FS_WRITE_STREAM_OP_GET_MAX (1 << 0)
+#define FS_WRITE_STREAM_OP_GET (1 << 1)
+#define FS_WRITE_STREAM_OP_SET (1 << 2)
+
+#define FS_IOC_WRITE_STREAM _IOWR('f', 43, struct fs_write_stream)
/*
* Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
*
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v2 2/5] iomap: introduce and propagate write_stream
2026-03-09 5:29 ` [PATCH v2 0/5] write streams and xfs spatial isolation Kanchan Joshi
2026-03-09 5:29 ` [PATCH v2 1/5] fs: add generic write-stream management ioctl Kanchan Joshi
@ 2026-03-09 5:29 ` Kanchan Joshi
2026-03-09 16:34 ` Darrick J. Wong
2026-03-09 5:29 ` [PATCH v2 3/5] xfs: implement write-stream management support Kanchan Joshi
` (3 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-09 5:29 UTC (permalink / raw)
To: brauner, hch, djwong, jack, cem, kbusch, axboe
Cc: linux-xfs, linux-fsdevel, gost.dev, Kanchan Joshi
Add a new write_stream field to struct iomap. Existing hole is used to
place the new field.
Propagate write_stream from iomap to bio in both direct I/O and buffered
writeback paths.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
fs/iomap/direct-io.c | 1 +
fs/iomap/ioend.c | 3 +++
include/linux/iomap.h | 2 ++
3 files changed, 6 insertions(+)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 95254aa1b654..086530c0471e 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -333,6 +333,7 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
pos >> iter->inode->i_blkbits, GFP_KERNEL);
bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
bio->bi_write_hint = iter->inode->i_write_hint;
+ bio->bi_write_stream = iter->iomap.write_stream;
bio->bi_ioprio = dio->iocb->ki_ioprio;
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index 4d1ef8a2cee9..6a9c8e0c7536 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -159,6 +159,7 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
GFP_NOFS, &iomap_ioend_bioset);
bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
bio->bi_write_hint = wpc->inode->i_write_hint;
+ bio->bi_write_stream = wpc->iomap.write_stream;
wbc_init_bio(wpc->wbc, bio);
wpc->nr_folios = 0;
return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
@@ -179,6 +180,8 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos,
if (!(wpc->iomap.flags & IOMAP_F_ANON_WRITE) &&
iomap_sector(&wpc->iomap, pos) != bio_end_sector(&ioend->io_bio))
return false;
+ if (wpc->iomap.write_stream != ioend->io_bio.bi_write_stream)
+ return false;
/*
* Limit ioend bio chain lengths to minimise IO completion latency. This
* also prevents long tight loops ending page writeback on all the
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 99b7209dabd7..e087818d11d4 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -113,6 +113,8 @@ struct iomap {
u64 length; /* length of mapping, bytes */
u16 type; /* type of mapping */
u16 flags; /* flags for mapping */
+ /* 4-byte padding hole here */
+ u8 write_stream; /* write stream for I/O */
struct block_device *bdev; /* block device for I/O */
struct dax_device *dax_dev; /* dax_dev for dax operations */
void *inline_data;
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v2 3/5] xfs: implement write-stream management support
2026-03-09 5:29 ` [PATCH v2 0/5] write streams and xfs spatial isolation Kanchan Joshi
2026-03-09 5:29 ` [PATCH v2 1/5] fs: add generic write-stream management ioctl Kanchan Joshi
2026-03-09 5:29 ` [PATCH v2 2/5] iomap: introduce and propagate write_stream Kanchan Joshi
@ 2026-03-09 5:29 ` Kanchan Joshi
2026-03-09 16:38 ` Darrick J. Wong
2026-03-09 5:29 ` [PATCH v2 4/5] xfs: steer allocation using write stream Kanchan Joshi
` (2 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-09 5:29 UTC (permalink / raw)
To: brauner, hch, djwong, jack, cem, kbusch, axboe
Cc: linux-xfs, linux-fsdevel, gost.dev, Kanchan Joshi
Implement support for FS_IOC_WRITE_STREAM ioctl.
For FS_WRITE_STREAM_OP_GET_MAX, available write streams are reported
based on the capability of the underlying block device.
For FS_WRITE_STREAM_OP_{SET/GET}, add a new i_write_stream field in xfs
inode. This value is propagated to the iomap during block mapping.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
fs/xfs/xfs_icache.c | 1 +
fs/xfs/xfs_inode.c | 46 +++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_inode.h | 6 ++++++
fs/xfs/xfs_ioctl.c | 34 +++++++++++++++++++++++++++++++++
fs/xfs/xfs_iomap.c | 1 +
5 files changed, 88 insertions(+)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a7a09e7eec81..2ad8d02152f4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -130,6 +130,7 @@ xfs_inode_alloc(
spin_lock_init(&ip->i_ioend_lock);
ip->i_next_unlinked = NULLAGINO;
ip->i_prev_unlinked = 0;
+ ip->i_write_stream = 0;
return ip;
}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 50c0404f9064..9b88b2d1cf9a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -47,6 +47,52 @@
struct kmem_cache *xfs_inode_cache;
+int
+xfs_inode_max_write_streams(
+ struct xfs_inode *ip)
+{
+ struct xfs_mount *mp = ip->i_mount;
+ struct block_device *bdev;
+
+ if (XFS_IS_REALTIME_INODE(ip))
+ bdev = mp->m_rtdev_targp ? mp->m_rtdev_targp->bt_bdev : NULL;
+ else
+ bdev = mp->m_ddev_targp->bt_bdev;
+
+ if (!bdev)
+ return 0;
+
+ return bdev_max_write_streams(bdev);
+}
+
+uint8_t
+xfs_inode_get_write_stream(
+ struct xfs_inode *ip)
+{
+ uint8_t stream_id;
+
+ xfs_ilock(ip, XFS_ILOCK_SHARED);
+ stream_id = ip->i_write_stream;
+ xfs_iunlock(ip, XFS_ILOCK_SHARED);
+
+ return stream_id;
+}
+
+int
+xfs_inode_set_write_stream(
+ struct xfs_inode *ip,
+ uint8_t stream_id)
+{
+ if (stream_id > xfs_inode_max_write_streams(ip))
+ return -EINVAL;
+
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ ip->i_write_stream = stream_id;
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+ return 0;
+}
+
/*
* These two are wrapper routines around the xfs_ilock() routine used to
* centralize some grungy code. They are used in places that wish to lock the
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index bd6d33557194..9f6cab729924 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -38,6 +38,9 @@ typedef struct xfs_inode {
struct xfs_ifork i_df; /* data fork */
struct xfs_ifork i_af; /* attribute fork */
+ /* Write stream information */
+ uint8_t i_write_stream;
+
/* Transaction and locking information. */
struct xfs_inode_log_item *i_itemp; /* logging information */
struct rw_semaphore i_lock; /* inode lock */
@@ -676,4 +679,7 @@ int xfs_icreate_dqalloc(const struct xfs_icreate_args *args,
struct xfs_dquot **udqpp, struct xfs_dquot **gdqpp,
struct xfs_dquot **pdqpp);
+int xfs_inode_max_write_streams(struct xfs_inode *ip);
+uint8_t xfs_inode_get_write_stream(struct xfs_inode *ip);
+int xfs_inode_set_write_stream(struct xfs_inode *ip, uint8_t stream_id);
#endif /* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index facffdc8dca8..091d6a8b5f57 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1160,6 +1160,38 @@ xfs_ioctl_fs_counts(
return 0;
}
+static int
+xfs_ioc_write_stream(
+ struct file *filp,
+ void __user *arg)
+{
+ struct inode *inode = file_inode(filp);
+ struct xfs_inode *ip = XFS_I(inode);
+ struct fs_write_stream ws = { };
+
+ if (copy_from_user(&ws, arg, sizeof(ws)))
+ return -EFAULT;
+
+ switch (ws.op_flags) {
+ case FS_WRITE_STREAM_OP_GET_MAX:
+ ws.max_streams = xfs_inode_max_write_streams(ip);
+ goto copy_out;
+ case FS_WRITE_STREAM_OP_GET:
+ ws.stream_id = xfs_inode_get_write_stream(ip);
+ goto copy_out;
+ case FS_WRITE_STREAM_OP_SET:
+ return xfs_inode_set_write_stream(ip, ws.stream_id);
+ default:
+ return -EINVAL;
+ }
+ return 0;
+
+copy_out:
+ if (copy_to_user(arg, &ws, sizeof(ws)))
+ return -EFAULT;
+ return 0;
+}
+
/*
* These long-unused ioctls were removed from the official ioctl API in 5.17,
* but retain these definitions so that we can log warnings about them.
@@ -1425,6 +1457,8 @@ xfs_file_ioctl(
return xfs_ioc_health_monitor(filp, arg);
case XFS_IOC_VERIFY_MEDIA:
return xfs_ioc_verify_media(filp, arg);
+ case FS_IOC_WRITE_STREAM:
+ return xfs_ioc_write_stream(filp, arg);
default:
return -ENOTTY;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index be86d43044df..7988c9e16635 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -148,6 +148,7 @@ xfs_bmbt_to_iomap(
else
iomap->bdev = target->bt_bdev;
iomap->flags = iomap_flags;
+ iomap->write_stream = ip->i_write_stream;
/*
* If the inode is dirty for datasync purposes, let iomap know so it
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v2 4/5] xfs: steer allocation using write stream
2026-03-09 5:29 ` [PATCH v2 0/5] write streams and xfs spatial isolation Kanchan Joshi
` (2 preceding siblings ...)
2026-03-09 5:29 ` [PATCH v2 3/5] xfs: implement write-stream management support Kanchan Joshi
@ 2026-03-09 5:29 ` Kanchan Joshi
2026-03-09 12:45 ` kernel test robot
` (2 more replies)
2026-03-09 5:29 ` [PATCH v2 5/5] xfs: introduce software write streams Kanchan Joshi
2026-03-09 15:40 ` [PATCH v2 0/5] write streams and xfs spatial isolation Christoph Hellwig
5 siblings, 3 replies; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-09 5:29 UTC (permalink / raw)
To: brauner, hch, djwong, jack, cem, kbusch, axboe
Cc: linux-xfs, linux-fsdevel, gost.dev, Kanchan Joshi
When write stream is set on the file, override the default
directory-locality heuristic with a new heuristic that maps
available AGs into streams.
Isolating distinct write streams into dedicated allocation groups helps
in reducing the block interleaving of concurrent writers. Keeping these
streams spatially separated reduces AGF lock contention and logical file
fragmentation.
If AGs are fewer than write streams, write streams are distributed into
available AGs in round robin fashion.
If not, available AGs are partitioned into write streams. Since each
write stream maps to a partition of multiple contiguous AGs, the inode hash
is used to choose the specific AG within the stream partition. This can
help with intra-stream concurency when multiple files are being written in
a single stream that has 2 or more AGs.
Example: 8 Allocation Groups, 4 Streams
Partition Size = 2 AGs per Stream
Stream 1 (ID: 1) Stream 2 (ID: 2) Streams 3 & 4
+---------+---------+ +---------+---------+ +-------------
| AG0 | AG1 | | AG2 | AG3 | | AG4...AG7
+---------+---------+ +---------+---------+ +-------------
^ ^ ^ ^
| | | |
| File B (ino: 101) | File D (ino: 201)
| 101 % 2 = 1 -> AG 1 | 201 % 2 = 1 -> AG 3
| |
File A (ino: 100) File C (ino: 200)
100 % 2 = 0 -> AG 0 200 % 2 = 0 -> AG 2
If AGs can not be evenly distributed among streams, the last stream will
absorb the remaining AGs.
Note that there are no hard boundaries; this only provides explicit
routing hint to xfs allocator so that it can group/isolate files in the way
application has decided to group/isolate. We still try to preserve file
contiguity, and the full space can be utilized even with a single stream.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
fs/xfs/libxfs/xfs_bmap.c | 9 +++++++++
fs/xfs/xfs_inode.c | 33 +++++++++++++++++++++++++++++++++
fs/xfs/xfs_inode.h | 1 +
3 files changed, 43 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7a4c8f1aa76c..facf56e8e01d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3591,6 +3591,15 @@ xfs_bmap_btalloc_best_length(
int error;
ap->blkno = XFS_INO_TO_FSB(args->mp, ap->ip->i_ino);
+
+ /* override the default allocation heuristic if write stream is set */
+ if (ap->ip->i_write_stream && ap->datatype & XFS_ALLOC_USERDATA) {
+ xfs_agnumber_t stream_ag = xfs_inode_write_stream_to_ag(ap->ip);
+
+ if (stream_ag != NULLAGNUMBER)
+ ap->blkno = XFS_AGB_TO_FSB(args->mp, stream_ag, 0);
+ }
+
if (!xfs_bmap_adjacent(ap))
ap->eof = false;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9b88b2d1cf9a..e93141d2cd8b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -93,6 +93,39 @@ xfs_inode_set_write_stream(
return 0;
}
+xfs_agnumber_t
+xfs_inode_write_stream_to_ag(
+ struct xfs_inode *ip)
+{
+ struct xfs_mount *mp = ip->i_mount;
+ uint8_t stream_id = ip->i_write_stream;
+ uint32_t max_streams = xfs_inode_max_write_streams(ip);
+ uint32_t nr_ags;
+ xfs_agnumber_t start_ag, ags_per_stream;
+
+ if (XFS_IS_REALTIME_INODE(ip) || !max_streams)
+ return NULLAGNUMBER;
+
+ stream_id -= 1; /* for 0-based math, stream-ids are 1-based */
+
+ nr_ags = mp->m_sb.sb_agcount;
+ ags_per_stream = nr_ags / max_streams;
+
+ /* for the case when we have fewer AGs than streams */
+ if (ags_per_stream == 0) {
+ start_ag = stream_id % nr_ags;
+ ags_per_stream = 1;
+ } else {
+ /* otherwise AGs are partitioned into N streams */
+ start_ag = stream_id * ags_per_stream;
+ /* uneven distribution case: last stream may contain extra */
+ if (stream_id == max_streams-1)
+ ags_per_stream = nr_ags - start_ag;
+ }
+ /* intra-stream concurrency: hash inode to choose AG within partition */
+ return start_ag + (ip->i_ino % ags_per_stream);
+}
+
/*
* These two are wrapper routines around the xfs_ilock() routine used to
* centralize some grungy code. They are used in places that wish to lock the
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 9f6cab729924..9ab31ff6b5e1 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -682,4 +682,5 @@ int xfs_icreate_dqalloc(const struct xfs_icreate_args *args,
int xfs_inode_max_write_streams(struct xfs_inode *ip);
uint8_t xfs_inode_get_write_stream(struct xfs_inode *ip);
int xfs_inode_set_write_stream(struct xfs_inode *ip, uint8_t stream_id);
+xfs_agnumber_t xfs_inode_write_stream_to_ag(struct xfs_inode *ip);
#endif /* __XFS_INODE_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v2 5/5] xfs: introduce software write streams
2026-03-09 5:29 ` [PATCH v2 0/5] write streams and xfs spatial isolation Kanchan Joshi
` (3 preceding siblings ...)
2026-03-09 5:29 ` [PATCH v2 4/5] xfs: steer allocation using write stream Kanchan Joshi
@ 2026-03-09 5:29 ` Kanchan Joshi
2026-03-10 6:01 ` Dave Chinner
2026-03-09 15:40 ` [PATCH v2 0/5] write streams and xfs spatial isolation Christoph Hellwig
5 siblings, 1 reply; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-09 5:29 UTC (permalink / raw)
To: brauner, hch, djwong, jack, cem, kbusch, axboe
Cc: linux-xfs, linux-fsdevel, gost.dev, Kanchan Joshi
Even when the underlying block device does not advertise write streams,
XFS can choose do so as write-stream based AG allocation can improve the
concurrency and reduce interleaving of concurrent block allocation as well
as logical fragmentation.
Use a simple 3-tier (low/medium/high) AG count based heuristic to
publish streams. This enables logical spatial isolation for standard
storage, execluding rotational media and rtvolume.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
fs/xfs/xfs_inode.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e93141d2cd8b..6c26cf03a261 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -44,7 +44,7 @@
#include "xfs_xattr.h"
#include "xfs_inode_util.h"
#include "xfs_metafile.h"
-
+#define XFS_MAX_WRITE_STREAMS (32)
struct kmem_cache *xfs_inode_cache;
int
@@ -53,6 +53,8 @@ xfs_inode_max_write_streams(
{
struct xfs_mount *mp = ip->i_mount;
struct block_device *bdev;
+ int hw_streams, sw_streams;
+ xfs_agnumber_t nr_ags;
if (XFS_IS_REALTIME_INODE(ip))
bdev = mp->m_rtdev_targp ? mp->m_rtdev_targp->bt_bdev : NULL;
@@ -62,7 +64,22 @@ xfs_inode_max_write_streams(
if (!bdev)
return 0;
- return bdev_max_write_streams(bdev);
+ hw_streams = bdev_max_write_streams(bdev);
+ if (hw_streams > 0)
+ return hw_streams;
+ /* fallback to software-only write streams, excluding some cases */
+ if (bdev_rot(bdev) || XFS_IS_REALTIME_INODE(ip))
+ return 0;
+ nr_ags = mp->m_sb.sb_agcount;
+ /* heuristic: 3-tier (large/mid/small) split of AGs into streams */
+ if (nr_ags >= 32)
+ sw_streams = nr_ags / 4;
+ else if (nr_ags >= 8)
+ sw_streams = nr_ags / 2;
+ else
+ sw_streams = nr_ags;
+
+ return min_t(int, sw_streams, XFS_MAX_WRITE_STREAMS);
}
uint8_t
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v2 4/5] xfs: steer allocation using write stream
2026-03-09 5:29 ` [PATCH v2 4/5] xfs: steer allocation using write stream Kanchan Joshi
@ 2026-03-09 12:45 ` kernel test robot
2026-03-09 20:01 ` kernel test robot
2026-03-10 5:47 ` Dave Chinner
2 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-03-09 12:45 UTC (permalink / raw)
To: Kanchan Joshi, brauner, hch, djwong, jack, cem, kbusch, axboe
Cc: oe-kbuild-all, linux-xfs, linux-fsdevel, gost.dev, Kanchan Joshi
Hi Kanchan,
kernel test robot noticed the following build errors:
[auto build test ERROR on 11439c4635edd669ae435eec308f4ab8a0804808]
url: https://github.com/intel-lab-lkp/linux/commits/Kanchan-Joshi/fs-add-generic-write-stream-management-ioctl/20260309-133736
base: 11439c4635edd669ae435eec308f4ab8a0804808
patch link: https://lore.kernel.org/r/20260309052944.156054-5-joshi.k%40samsung.com
patch subject: [PATCH v2 4/5] xfs: steer allocation using write stream
config: i386-randconfig-011-20260309 (https://download.01.org/0day-ci/archive/20260309/202603092015.hrOdrSYV-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260309/202603092015.hrOdrSYV-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603092015.hrOdrSYV-lkp@intel.com/
All errors (new ones prefixed by >>):
ld: fs/xfs/xfs_inode.o: in function `xfs_inode_write_stream_to_ag':
>> fs/xfs/xfs_inode.c:126:(.text+0x9ed): undefined reference to `__umoddi3'
vim +126 fs/xfs/xfs_inode.c
95
96 xfs_agnumber_t
97 xfs_inode_write_stream_to_ag(
98 struct xfs_inode *ip)
99 {
100 struct xfs_mount *mp = ip->i_mount;
101 uint8_t stream_id = ip->i_write_stream;
102 uint32_t max_streams = xfs_inode_max_write_streams(ip);
103 uint32_t nr_ags;
104 xfs_agnumber_t start_ag, ags_per_stream;
105
106 if (XFS_IS_REALTIME_INODE(ip) || !max_streams)
107 return NULLAGNUMBER;
108
109 stream_id -= 1; /* for 0-based math, stream-ids are 1-based */
110
111 nr_ags = mp->m_sb.sb_agcount;
112 ags_per_stream = nr_ags / max_streams;
113
114 /* for the case when we have fewer AGs than streams */
115 if (ags_per_stream == 0) {
116 start_ag = stream_id % nr_ags;
117 ags_per_stream = 1;
118 } else {
119 /* otherwise AGs are partitioned into N streams */
120 start_ag = stream_id * ags_per_stream;
121 /* uneven distribution case: last stream may contain extra */
122 if (stream_id == max_streams-1)
123 ags_per_stream = nr_ags - start_ag;
124 }
125 /* intra-stream concurrency: hash inode to choose AG within partition */
> 126 return start_ag + (ip->i_ino % ags_per_stream);
127 }
128
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 0/5] write streams and xfs spatial isolation
2026-03-09 5:29 ` [PATCH v2 0/5] write streams and xfs spatial isolation Kanchan Joshi
` (4 preceding siblings ...)
2026-03-09 5:29 ` [PATCH v2 5/5] xfs: introduce software write streams Kanchan Joshi
@ 2026-03-09 15:40 ` Christoph Hellwig
2026-03-10 21:19 ` Kanchan Joshi
5 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2026-03-09 15:40 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, djwong, jack, cem, kbusch, axboe, linux-xfs,
linux-fsdevel, gost.dev, Hans Holmberg
This is laking numbers to justify all the changes.
From previous experiments the most important isolations is to put
the file system log and metadata into a separate stream each, which
would be the first step before exposing user knobs.
And once we look into application optimizations I think your best bet is
to resurrect the FDP/write streams support for zoned XFS that Hans and I
did and posted in reply to one of Keith' iterations of the write stream
patches. This will reuse all the intelligent placement decisions we've
put into that allocator. Once that is done we can look into exposing the
write streams already inherent in that to user space, but we really
should be doing all the ground work first. And maybe some of this can
apply to the conventional allocator, but given that it has no way to
track the placement unit sizes I'm a bit doubtful that the results will
look great.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
2026-03-09 5:29 ` [PATCH v2 1/5] fs: add generic write-stream management ioctl Kanchan Joshi
@ 2026-03-09 16:33 ` Darrick J. Wong
2026-03-10 17:55 ` Kanchan Joshi
0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2026-03-09 16:33 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev, linux-api
[cc linux-api because this is certainly an API definition]
On Mon, Mar 09, 2026 at 10:59:40AM +0530, Kanchan Joshi wrote:
> Wire up the userspace interface for write stream management via a new
> vfs ioctl 'FS_IOC_WRITE_STEAM'.
> Application communictes the intended operation using the 'op_flags'
> field of the passed 'struct fs_write_stream'.
> Valid flags are:
> FS_WRITE_STREAM_OP_GET_MAX: Returns the number of available streams.
> FS_WRITE_STREAM_OP_SET: Assign a specific stream value to the file.
> FS_WRITE_STREAM_OP_GET: Query what stream value is set on the file.
>
> Application should query the available streams by using
> FS_WRITE_STREAM_OP_GET_MAX first.
> If returned value is N, valid stream values for the file are 0 to N.
> Stream value 0 implies that no stream is set on the file.
> Setting a larger value than available streams is rejected.
>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
> include/uapi/linux/fs.h | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 70b2b661f42c..4d0805b52949 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -338,6 +338,18 @@ struct file_attr {
> /* Get logical block metadata capability details */
> #define FS_IOC_GETLBMD_CAP _IOWR(0x15, 2, struct logical_block_metadata_cap)
>
> +struct fs_write_stream {
> + __u32 op_flags; /* IN: operation flags */
> + __u32 stream_id; /* IN/OUT: stream value to assign/guery */
> + __u32 max_streams; /* OUT: max streams values supported */
> + __u32 rsvd;
> +};
This isn't an very cohesive interface -- GET_MAX probably only needs
op_flags and max_streams, right? And GET/SET only use op_flags and
stream_id, right?
> +#define FS_WRITE_STREAM_OP_GET_MAX (1 << 0)
> +#define FS_WRITE_STREAM_OP_GET (1 << 1)
> +#define FS_WRITE_STREAM_OP_SET (1 << 2)
> +
> +#define FS_IOC_WRITE_STREAM _IOWR('f', 43, struct fs_write_stream)
EXT4_IOC_CHECKPOINT already took 'f' / 43. I /think/ there's no problem
because its argument is a u32 and ioctl definitions incorporate the
lower bits of of the argument size but you might want to be careful
anyway.
--D
> /*
> * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
> *
> --
> 2.25.1
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 2/5] iomap: introduce and propagate write_stream
2026-03-09 5:29 ` [PATCH v2 2/5] iomap: introduce and propagate write_stream Kanchan Joshi
@ 2026-03-09 16:34 ` Darrick J. Wong
2026-03-10 17:58 ` Kanchan Joshi
0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2026-03-09 16:34 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev
On Mon, Mar 09, 2026 at 10:59:41AM +0530, Kanchan Joshi wrote:
> Add a new write_stream field to struct iomap. Existing hole is used to
> place the new field.
> Propagate write_stream from iomap to bio in both direct I/O and buffered
> writeback paths.
>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
> fs/iomap/direct-io.c | 1 +
> fs/iomap/ioend.c | 3 +++
> include/linux/iomap.h | 2 ++
> 3 files changed, 6 insertions(+)
>
<snip>
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 99b7209dabd7..e087818d11d4 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -113,6 +113,8 @@ struct iomap {
> u64 length; /* length of mapping, bytes */
> u16 type; /* type of mapping */
> u16 flags; /* flags for mapping */
> + /* 4-byte padding hole here */
It's 3 bytes now, right? ;)
> + u8 write_stream; /* write stream for I/O */
> struct block_device *bdev; /* block device for I/O */
> struct dax_device *dax_dev; /* dax_dev for dax operations */
> void *inline_data;
> --
> 2.25.1
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 3/5] xfs: implement write-stream management support
2026-03-09 5:29 ` [PATCH v2 3/5] xfs: implement write-stream management support Kanchan Joshi
@ 2026-03-09 16:38 ` Darrick J. Wong
2026-03-10 18:07 ` Kanchan Joshi
0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2026-03-09 16:38 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev
On Mon, Mar 09, 2026 at 10:59:42AM +0530, Kanchan Joshi wrote:
> Implement support for FS_IOC_WRITE_STREAM ioctl.
>
> For FS_WRITE_STREAM_OP_GET_MAX, available write streams are reported
> based on the capability of the underlying block device.
> For FS_WRITE_STREAM_OP_{SET/GET}, add a new i_write_stream field in xfs
> inode. This value is propagated to the iomap during block mapping.
>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
> fs/xfs/xfs_icache.c | 1 +
> fs/xfs/xfs_inode.c | 46 +++++++++++++++++++++++++++++++++++++++++++++
> fs/xfs/xfs_inode.h | 6 ++++++
> fs/xfs/xfs_ioctl.c | 34 +++++++++++++++++++++++++++++++++
> fs/xfs/xfs_iomap.c | 1 +
> 5 files changed, 88 insertions(+)
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index a7a09e7eec81..2ad8d02152f4 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -130,6 +130,7 @@ xfs_inode_alloc(
> spin_lock_init(&ip->i_ioend_lock);
> ip->i_next_unlinked = NULLAGINO;
> ip->i_prev_unlinked = 0;
> + ip->i_write_stream = 0;
>
> return ip;
> }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 50c0404f9064..9b88b2d1cf9a 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -47,6 +47,52 @@
>
> struct kmem_cache *xfs_inode_cache;
>
> +int
> +xfs_inode_max_write_streams(
> + struct xfs_inode *ip)
> +{
> + struct xfs_mount *mp = ip->i_mount;
> + struct block_device *bdev;
> +
> + if (XFS_IS_REALTIME_INODE(ip))
> + bdev = mp->m_rtdev_targp ? mp->m_rtdev_targp->bt_bdev : NULL;
Uhh if this is a realtime inode then there had better be an
m_rtdev_targp to dereference.
Also... this is xfs_inode_buftarg().
> + else
> + bdev = mp->m_ddev_targp->bt_bdev;
> +
> + if (!bdev)
> + return 0;
> +
> + return bdev_max_write_streams(bdev);
> +}
> +
> +uint8_t
> +xfs_inode_get_write_stream(
> + struct xfs_inode *ip)
> +{
> + uint8_t stream_id;
> +
> + xfs_ilock(ip, XFS_ILOCK_SHARED);
> + stream_id = ip->i_write_stream;
> + xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +
> + return stream_id;
> +}
> +
> +int
> +xfs_inode_set_write_stream(
> + struct xfs_inode *ip,
> + uint8_t stream_id)
> +{
> + if (stream_id > xfs_inode_max_write_streams(ip))
Inodes can change devices, so this needs to go under the ILOCK.
> + return -EINVAL;
> +
> + xfs_ilock(ip, XFS_ILOCK_EXCL);
> + ip->i_write_stream = stream_id;
> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> + return 0;
> +}
> +
> /*
> * These two are wrapper routines around the xfs_ilock() routine used to
> * centralize some grungy code. They are used in places that wish to lock the
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index bd6d33557194..9f6cab729924 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -38,6 +38,9 @@ typedef struct xfs_inode {
> struct xfs_ifork i_df; /* data fork */
> struct xfs_ifork i_af; /* attribute fork */
>
> + /* Write stream information */
> + uint8_t i_write_stream;
I'm confused, bdev_max_write_streams returns an unsigned short, but this
is a uint8_t field. How are we supposed to deal with truncation issues?
--D
> +
> /* Transaction and locking information. */
> struct xfs_inode_log_item *i_itemp; /* logging information */
> struct rw_semaphore i_lock; /* inode lock */
> @@ -676,4 +679,7 @@ int xfs_icreate_dqalloc(const struct xfs_icreate_args *args,
> struct xfs_dquot **udqpp, struct xfs_dquot **gdqpp,
> struct xfs_dquot **pdqpp);
>
> +int xfs_inode_max_write_streams(struct xfs_inode *ip);
> +uint8_t xfs_inode_get_write_stream(struct xfs_inode *ip);
> +int xfs_inode_set_write_stream(struct xfs_inode *ip, uint8_t stream_id);
> #endif /* __XFS_INODE_H__ */
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index facffdc8dca8..091d6a8b5f57 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1160,6 +1160,38 @@ xfs_ioctl_fs_counts(
> return 0;
> }
>
> +static int
> +xfs_ioc_write_stream(
> + struct file *filp,
> + void __user *arg)
> +{
> + struct inode *inode = file_inode(filp);
> + struct xfs_inode *ip = XFS_I(inode);
> + struct fs_write_stream ws = { };
> +
> + if (copy_from_user(&ws, arg, sizeof(ws)))
> + return -EFAULT;
> +
> + switch (ws.op_flags) {
> + case FS_WRITE_STREAM_OP_GET_MAX:
> + ws.max_streams = xfs_inode_max_write_streams(ip);
> + goto copy_out;
> + case FS_WRITE_STREAM_OP_GET:
> + ws.stream_id = xfs_inode_get_write_stream(ip);
> + goto copy_out;
> + case FS_WRITE_STREAM_OP_SET:
> + return xfs_inode_set_write_stream(ip, ws.stream_id);
> + default:
> + return -EINVAL;
> + }
> + return 0;
> +
> +copy_out:
> + if (copy_to_user(arg, &ws, sizeof(ws)))
> + return -EFAULT;
> + return 0;
> +}
> +
> /*
> * These long-unused ioctls were removed from the official ioctl API in 5.17,
> * but retain these definitions so that we can log warnings about them.
> @@ -1425,6 +1457,8 @@ xfs_file_ioctl(
> return xfs_ioc_health_monitor(filp, arg);
> case XFS_IOC_VERIFY_MEDIA:
> return xfs_ioc_verify_media(filp, arg);
> + case FS_IOC_WRITE_STREAM:
> + return xfs_ioc_write_stream(filp, arg);
>
> default:
> return -ENOTTY;
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index be86d43044df..7988c9e16635 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -148,6 +148,7 @@ xfs_bmbt_to_iomap(
> else
> iomap->bdev = target->bt_bdev;
> iomap->flags = iomap_flags;
> + iomap->write_stream = ip->i_write_stream;
>
> /*
> * If the inode is dirty for datasync purposes, let iomap know so it
> --
> 2.25.1
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 4/5] xfs: steer allocation using write stream
2026-03-09 5:29 ` [PATCH v2 4/5] xfs: steer allocation using write stream Kanchan Joshi
2026-03-09 12:45 ` kernel test robot
@ 2026-03-09 20:01 ` kernel test robot
2026-03-10 5:47 ` Dave Chinner
2 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-03-09 20:01 UTC (permalink / raw)
To: Kanchan Joshi, brauner, hch, djwong, jack, cem, kbusch, axboe
Cc: llvm, oe-kbuild-all, linux-xfs, linux-fsdevel, gost.dev,
Kanchan Joshi
Hi Kanchan,
kernel test robot noticed the following build errors:
[auto build test ERROR on 11439c4635edd669ae435eec308f4ab8a0804808]
url: https://github.com/intel-lab-lkp/linux/commits/Kanchan-Joshi/fs-add-generic-write-stream-management-ioctl/20260309-133736
base: 11439c4635edd669ae435eec308f4ab8a0804808
patch link: https://lore.kernel.org/r/20260309052944.156054-5-joshi.k%40samsung.com
patch subject: [PATCH v2 4/5] xfs: steer allocation using write stream
config: arm-randconfig-002-20260309 (https://download.01.org/0day-ci/archive/20260310/202603100305.6kIq1qtR-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260310/202603100305.6kIq1qtR-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603100305.6kIq1qtR-lkp@intel.com/
All errors (new ones prefixed by >>):
>> ld.lld: error: undefined symbol: __aeabi_uldivmod
>>> referenced by xfs_inode.c:126 (fs/xfs/xfs_inode.c:126)
>>> fs/xfs/xfs_inode.o:(xfs_inode_write_stream_to_ag) in archive vmlinux.a
>>> did you mean: __aeabi_uidivmod
>>> defined in: vmlinux.a(arch/arm/lib/lib1funcs.o)
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 4/5] xfs: steer allocation using write stream
2026-03-09 5:29 ` [PATCH v2 4/5] xfs: steer allocation using write stream Kanchan Joshi
2026-03-09 12:45 ` kernel test robot
2026-03-09 20:01 ` kernel test robot
@ 2026-03-10 5:47 ` Dave Chinner
2026-03-10 19:03 ` Kanchan Joshi
2 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2026-03-10 5:47 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, djwong, jack, cem, kbusch, axboe, linux-xfs,
linux-fsdevel, gost.dev
On Mon, Mar 09, 2026 at 10:59:43AM +0530, Kanchan Joshi wrote:
> When write stream is set on the file, override the default
> directory-locality heuristic with a new heuristic that maps
> available AGs into streams.
>
> Isolating distinct write streams into dedicated allocation groups helps
> in reducing the block interleaving of concurrent writers. Keeping these
> streams spatially separated reduces AGF lock contention and logical file
> fragmentation.
a.k.a. the XFS filestreams allocator.
i.e. we already have an allocator that performs this exact locality
mapping. See xfs_inode_is_filestream() and the allocator code path
that goes this way:
xfs_bmap_btalloc()
xfs_bmap_btalloc_filestreams()
xfs_filestream_select_ag()
Please integrate this write stream mapping functionality into that
allocator rather than hacking a new, almost identical allocator
policy into XFS.
filestreams currently uses the parent inode number as the stream ID
and maps that to an AG. It should be relatively trivial to use the
ip->i_write_stream as the stream ID instead of the parent inode.
> If AGs are fewer than write streams, write streams are distributed into
> available AGs in round robin fashion.
> If not, available AGs are partitioned into write streams. Since each
> write stream maps to a partition of multiple contiguous AGs, the inode hash
> is used to choose the specific AG within the stream partition. This can
> help with intra-stream concurency when multiple files are being written in
> a single stream that has 2 or more AGs.
>
> Example: 8 Allocation Groups, 4 Streams
> Partition Size = 2 AGs per Stream
>
> Stream 1 (ID: 1) Stream 2 (ID: 2) Streams 3 & 4
> +---------+---------+ +---------+---------+ +-------------
> | AG0 | AG1 | | AG2 | AG3 | | AG4...AG7
> +---------+---------+ +---------+---------+ +-------------
> ^ ^ ^ ^
> | | | |
> | File B (ino: 101) | File D (ino: 201)
> | 101 % 2 = 1 -> AG 1 | 201 % 2 = 1 -> AG 3
> | |
> File A (ino: 100) File C (ino: 200)
> 100 % 2 = 0 -> AG 0 200 % 2 = 0 -> AG 2
>
> If AGs can not be evenly distributed among streams, the last stream will
> absorb the remaining AGs.
Yeah, this should all be hidden behind xfs_filestream_select_ag()
when ip->i_write_stream is set....
> Note that there are no hard boundaries; this only provides explicit
> routing hint to xfs allocator so that it can group/isolate files in the way
> application has decided to group/isolate. We still try to preserve file
> contiguity, and the full space can be utilized even with a single stream.
Yes, that's pretty much exactly what the filestreams allocator was
designed to do. It's a whole lot more dynamic that what you are
trying to do above and is not limited fixed AGs for streams - as
soon as an AG is out of space, it will select the AG with the most
free space for the stream and keep that relationship until that AG
is out of space.
IOWs, filestreams does not limit a stream to a fixed number of AGs.
All it does is keep IO with the same stream ID in the same AG until
the AG is full and, as much as possible, prevents multiple streams
from using the same AG.
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 5/5] xfs: introduce software write streams
2026-03-09 5:29 ` [PATCH v2 5/5] xfs: introduce software write streams Kanchan Joshi
@ 2026-03-10 6:01 ` Dave Chinner
0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2026-03-10 6:01 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, djwong, jack, cem, kbusch, axboe, linux-xfs,
linux-fsdevel, gost.dev
On Mon, Mar 09, 2026 at 10:59:44AM +0530, Kanchan Joshi wrote:
> Even when the underlying block device does not advertise write streams,
> XFS can choose do so as write-stream based AG allocation can improve the
> concurrency and reduce interleaving of concurrent block allocation as well
> as logical fragmentation.
XFS's default allocation policy already does this "stream
separation" implicitly on a per-directory basis. So you're going to
need to be more specific about the workload that can benefit from
software write streams being assigned to specific AGs...
> Use a simple 3-tier (low/medium/high) AG count based heuristic to
> publish streams. This enables logical spatial isolation for standard
> storage, execluding rotational media and rtvolume.
This is completely unnecessary if you use the filestreams allocator
for write streams to do workload separation. There is no special
hardware mapping of streams that we can use, so just an AG per
stream will get you exactly the same stream separation behaviour.
And, FWIW, filestreams works just fine on rotational devices. In
fact, that is what filestreams was designed for: optimising IO
to/from massive isochronous RAID arrays full of spinning disks that
stored uncompressed high definition video streams in file-per-frame
format. Keeping each video stream in a separate AG meant each frame
of a video was placed on sequential RAID stripes and each video
stream hit different disks in the RAID arrays. This mean video
stream reads and writes always did full stripe IO and the IO from
the different streams never interfered with each other.
So, yeah, get rid of the arbitrary "can't use software mappings on
rotational devices" because we know from experience that it is just
not true...
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
2026-03-09 16:33 ` Darrick J. Wong
@ 2026-03-10 17:55 ` Kanchan Joshi
2026-03-10 20:44 ` Darrick J. Wong
0 siblings, 1 reply; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-10 17:55 UTC (permalink / raw)
To: Darrick J. Wong
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev, linux-api
On 3/9/2026 10:03 PM, Darrick J. Wong wrote:
>> +struct fs_write_stream {
>> + __u32 op_flags; /* IN: operation flags */
>> + __u32 stream_id; /* IN/OUT: stream value to assign/guery */
>> + __u32 max_streams; /* OUT: max streams values supported */
>> + __u32 rsvd;
>> +};
> This isn't an very cohesive interface -- GET_MAX probably only needs
> op_flags and max_streams, right? And GET/SET only use op_flags and
> stream_id, right?
Yeah, right. That's the trade-off with swiss army knife type ioctl which
uses op_flags to decide what it should do. Apart from keeping a single
ioctl I was thinking a bit about extensibility (for anything new we may
be able to do a new op_flags with some rsvd or union) too. But if you
feel strong about this, I can take 3 ioctl route?
>> +#define FS_WRITE_STREAM_OP_GET_MAX (1 << 0)
>> +#define FS_WRITE_STREAM_OP_GET (1 << 1)
>> +#define FS_WRITE_STREAM_OP_SET (1 << 2)
>> +
>> +#define FS_IOC_WRITE_STREAM _IOWR('f', 43, struct fs_write_stream)
> EXT4_IOC_CHECKPOINT already took 'f' / 43. I/think/ there's no problem
> because its argument is a u32 and ioctl definitions incorporate the
> lower bits of of the argument size but you might want to be careful
> anyway.
Indeed, thanks!
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 2/5] iomap: introduce and propagate write_stream
2026-03-09 16:34 ` Darrick J. Wong
@ 2026-03-10 17:58 ` Kanchan Joshi
0 siblings, 0 replies; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-10 17:58 UTC (permalink / raw)
To: Darrick J. Wong
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev
On 3/9/2026 10:04 PM, Darrick J. Wong wrote:
>> u64 length; /* length of mapping, bytes */
>> u16 type; /* type of mapping */
>> u16 flags; /* flags for mapping */
>> + /* 4-byte padding hole here */
> It's 3 bytes now, right? 😉
:-) Yup. Fat finger/brain. I wrote the comment first, and added the
field afterwards.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 3/5] xfs: implement write-stream management support
2026-03-09 16:38 ` Darrick J. Wong
@ 2026-03-10 18:07 ` Kanchan Joshi
0 siblings, 0 replies; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-10 18:07 UTC (permalink / raw)
To: Darrick J. Wong
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev
On 3/9/2026 10:08 PM, Darrick J. Wong wrote:
>> +int
>> +xfs_inode_max_write_streams(
>> + struct xfs_inode *ip)
>> +{
>> + struct xfs_mount *mp = ip->i_mount;
>> + struct block_device *bdev;
>> +
>> + if (XFS_IS_REALTIME_INODE(ip))
>> + bdev = mp->m_rtdev_targp ? mp->m_rtdev_targp->bt_bdev : NULL;
> Uhh if this is a realtime inode then there had better be an
> m_rtdev_targp to dereference.
>
> Also... this is xfs_inode_buftarg().
Thanks, I would use that.
>> + else
>> + bdev = mp->m_ddev_targp->bt_bdev;
>> +
>> + if (!bdev)
>> + return 0;
>> +
>> + return bdev_max_write_streams(bdev);
>> +}
>> +
>> +uint8_t
>> +xfs_inode_get_write_stream(
>> + struct xfs_inode *ip)
>> +{
>> + uint8_t stream_id;
>> +
>> + xfs_ilock(ip, XFS_ILOCK_SHARED);
>> + stream_id = ip->i_write_stream;
>> + xfs_iunlock(ip, XFS_ILOCK_SHARED);
>> +
>> + return stream_id;
>> +}
>> +
>> +int
>> +xfs_inode_set_write_stream(
>> + struct xfs_inode *ip,
>> + uint8_t stream_id)
>> +{
>> + if (stream_id > xfs_inode_max_write_streams(ip))
> Inodes can change devices, so this needs to go under the ILOCK.
Indeed.
>> + return -EINVAL;
>> +
>> + xfs_ilock(ip, XFS_ILOCK_EXCL);
>> + ip->i_write_stream = stream_id;
>> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +
>> + return 0;
>> +}
>> +
>> /*
>> * These two are wrapper routines around the xfs_ilock() routine used to
>> * centralize some grungy code. They are used in places that wish to lock the
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index bd6d33557194..9f6cab729924 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -38,6 +38,9 @@ typedef struct xfs_inode {
>> struct xfs_ifork i_df; /* data fork */
>> struct xfs_ifork i_af; /* attribute fork */
>>
>> + /* Write stream information */
>> + uint8_t i_write_stream;
> I'm confused, bdev_max_write_streams returns an unsigned short, but this
> is a uint8_t field. How are we supposed to deal with truncation issues?
Right, in my head I was thinking "bi_write_stream" which is u8. Will
change in v3.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 4/5] xfs: steer allocation using write stream
2026-03-10 5:47 ` Dave Chinner
@ 2026-03-10 19:03 ` Kanchan Joshi
2026-03-10 22:44 ` Dave Chinner
0 siblings, 1 reply; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-10 19:03 UTC (permalink / raw)
To: Dave Chinner
Cc: brauner, hch, djwong, jack, cem, kbusch, axboe, linux-xfs,
linux-fsdevel, gost.dev
On 3/10/2026 11:17 AM, Dave Chinner wrote:
>> When write stream is set on the file, override the default
>> directory-locality heuristic with a new heuristic that maps
>> available AGs into streams.
>>
>> Isolating distinct write streams into dedicated allocation groups helps
>> in reducing the block interleaving of concurrent writers. Keeping these
>> streams spatially separated reduces AGF lock contention and logical file
>> fragmentation.
> a.k.a. the XFS filestreams allocator.
Yes, but there is difference between what I am doing and what is
present. Let me write that in the end.
>
> i.e. we already have an allocator that performs this exact locality
> mapping. See xfs_inode_is_filestream() and the allocator code path
> that goes this way:
>
> xfs_bmap_btalloc()
> xfs_bmap_btalloc_filestreams()
> xfs_filestream_select_ag()
>
> Please integrate this write stream mapping functionality into that
> allocator rather than hacking a new, almost identical allocator
> policy into XFS.
>
> filestreams currently uses the parent inode number as the stream ID
> and maps that to an AG. It should be relatively trivial to use the
> ip->i_write_stream as the stream ID instead of the parent inode.
Yeah, should be possible. Will try that in V3.
>> If AGs are fewer than write streams, write streams are distributed into
>> available AGs in round robin fashion.
>> If not, available AGs are partitioned into write streams. Since each
>> write stream maps to a partition of multiple contiguous AGs, the inode hash
>> is used to choose the specific AG within the stream partition. This can
>> help with intra-stream concurency when multiple files are being written in
>> a single stream that has 2 or more AGs.
>>
>> Example: 8 Allocation Groups, 4 Streams
>> Partition Size = 2 AGs per Stream
>>
>> Stream 1 (ID: 1) Stream 2 (ID: 2) Streams 3 & 4
>> +---------+---------+ +---------+---------+ +-------------
>> | AG0 | AG1 | | AG2 | AG3 | | AG4...AG7
>> +---------+---------+ +---------+---------+ +-------------
>> ^ ^ ^ ^
>> | | | |
>> | File B (ino: 101) | File D (ino: 201)
>> | 101 % 2 = 1 -> AG 1 | 201 % 2 = 1 -> AG 3
>> | |
>> File A (ino: 100) File C (ino: 200)
>> 100 % 2 = 0 -> AG 0 200 % 2 = 0 -> AG 2
>>
>> If AGs can not be evenly distributed among streams, the last stream will
>> absorb the remaining AGs.
> Yeah, this should all be hidden behind xfs_filestream_select_ag()
> when ip->i_write_stream is set....
Added in the TBD list.
>> Note that there are no hard boundaries; this only provides explicit
>> routing hint to xfs allocator so that it can group/isolate files in the way
>> application has decided to group/isolate. We still try to preserve file
>> contiguity, and the full space can be utilized even with a single stream.
> Yes, that's pretty much exactly what the filestreams allocator was
> designed to do. It's a whole lot more dynamic that what you are
> trying to do above and is not limited fixed AGs for streams - as
> soon as an AG is out of space, it will select the AG with the most
> free space for the stream and keep that relationship until that AG
> is out of space.
>
> IOWs, filestreams does not limit a stream to a fixed number of AGs.
> All it does is keep IO with the same stream ID in the same AG until
> the AG is full and, as much as possible, prevents multiple streams
> from using the same AG.
filestream: 1 filestream == 1AG, at a time. And that can cause AGF lock
contention on high-concurrency NVMe workloads i.e., when multiple
threads writing to different files in same filestream.
What I am doing here with new write stream has two aspects:
(a) inter stream concurrency: multiple threads writing to different
files in different streams are not going to run into AGF lock.
(b) intra stream concurrency: multiple threads writing to different
files in single stream will also face 'reduced' contention if each
stream is a collection of AG and we are spreading the load (with inode
hash). Therefore, each stream is partitioned into a group of AGs.
Also, with filestream we can't do cross-directory grouping, of
file-level granularity. Write stream is a more explicit model.
Application decides what files are to be spatially separated/grouped and
what kind of concurrency buckets should be chosen for its N threads/files.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 1/5] fs: add generic write-stream management ioctl
2026-03-10 17:55 ` Kanchan Joshi
@ 2026-03-10 20:44 ` Darrick J. Wong
0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2026-03-10 20:44 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, jack, cem, kbusch, axboe, linux-xfs, linux-fsdevel,
gost.dev, linux-api
On Tue, Mar 10, 2026 at 11:25:25PM +0530, Kanchan Joshi wrote:
> On 3/9/2026 10:03 PM, Darrick J. Wong wrote:
> >> +struct fs_write_stream {
> >> + __u32 op_flags; /* IN: operation flags */
> >> + __u32 stream_id; /* IN/OUT: stream value to assign/guery */
> >> + __u32 max_streams; /* OUT: max streams values supported */
> >> + __u32 rsvd;
> >> +};
> > This isn't an very cohesive interface -- GET_MAX probably only needs
> > op_flags and max_streams, right? And GET/SET only use op_flags and
> > stream_id, right?
>
> Yeah, right. That's the trade-off with swiss army knife type ioctl which
> uses op_flags to decide what it should do. Apart from keeping a single
> ioctl I was thinking a bit about extensibility (for anything new we may
> be able to do a new op_flags with some rsvd or union) too. But if you
> feel strong about this, I can take 3 ioctl route?
struct fs_write_stream {
__u32 op_flags;
union {
__u32 stream_id;
__u32 max_ids;
};
__u64 reserved;
};
perhaps? You might want to look into whether or not we're allowed to
have anonymous unions in UAPI headers. We all ❤️ C11, right?
--D
> >> +#define FS_WRITE_STREAM_OP_GET_MAX (1 << 0)
> >> +#define FS_WRITE_STREAM_OP_GET (1 << 1)
> >> +#define FS_WRITE_STREAM_OP_SET (1 << 2)
> >> +
> >> +#define FS_IOC_WRITE_STREAM _IOWR('f', 43, struct fs_write_stream)
> > EXT4_IOC_CHECKPOINT already took 'f' / 43. I/think/ there's no problem
> > because its argument is a u32 and ioctl definitions incorporate the
> > lower bits of of the argument size but you might want to be careful
> > anyway.
>
> Indeed, thanks!
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 0/5] write streams and xfs spatial isolation
2026-03-09 15:40 ` [PATCH v2 0/5] write streams and xfs spatial isolation Christoph Hellwig
@ 2026-03-10 21:19 ` Kanchan Joshi
0 siblings, 0 replies; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-10 21:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: brauner, djwong, jack, cem, kbusch, axboe, linux-xfs,
linux-fsdevel, gost.dev, Hans Holmberg
On 3/9/2026 9:10 PM, Christoph Hellwig wrote:
> This is laking numbers to justify all the changes.
>
> From previous experiments the most important isolations is to put
> the file system log and metadata into a separate stream each, which
> would be the first step before exposing user knobs.
>
> And once we look into application optimizations I think your best bet is
> to resurrect the FDP/write streams support for zoned XFS that Hans and I
> did and posted in reply to one of Keith' iterations of the write stream
> patches. This will reuse all the intelligent placement decisions we've
> put into that allocator. Once that is done we can look into exposing the
> write streams already inherent in that to user space, but we really
> should be doing all the ground work first. And maybe some of this can
> apply to the conventional allocator, but given that it has no way to
> track the placement unit sizes I'm a bit doubtful that the results will
> look great.
In the LSFMM thread on this topic, you advised that I should not tie a
high-level VFS/XFS feature exclusively to FDP as a single
implementation [1]. I took your advice and tried to communicate that
general value with this iteration.
- Parts of these patches rely on block device's write streams. Block
device provider can be anything (lvm, raid etc.) that chooses to expose
streams to leverage the application's spatial isolation intent. That
goes back to what I added in the cover letter: write streams enable
collaborative data placement, allowing the abstraction provider to
leverage application intent.
- I added Patch 4 (AG steering) and Patch 5 (software streams)
specifically with this general-purpose intent in mind. For FDP only
enablement, I wanted only first 3 patches.
Regarding the Zoned allocator approach - this was designed for a
fundamentally different device architecture. My concern with porting
FDP to it is that it negates the design consensus + progress we reached
since the RFC last year.
Temperature vs. Spatial Placement: The Zoned allocator leverages
temperature/lifetime hints. The deliberate design choice to move away
from a temperature-based API was made in RFC itself, and I outlined
the reasoning in the cover letter ("Comparison with write hints"
section). Also the discoverable interface allows to add things like
logs/meta streams easily.
Hardware Constraints: The Zoned allocator is designed to manage strict
hardware constraints (tracking exact zone write pointers, host-side
garbage collection). Conventional and FDP drives offload these
constraints to FTL anyway.
Lack of precise placement unit tracking: If any inefficiencies occur
due to that, they would exclusively apply to FDP-capable device.
The AG steering scheme is intended for the broader class of conventional
block devices which don't expose write pointer, placement unit etc.
[1] https://lore.kernel.org/linux-fsdevel/20260223135339.GA17313@lst.de/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 4/5] xfs: steer allocation using write stream
2026-03-10 19:03 ` Kanchan Joshi
@ 2026-03-10 22:44 ` Dave Chinner
2026-03-11 9:59 ` Kanchan Joshi
0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2026-03-10 22:44 UTC (permalink / raw)
To: Kanchan Joshi
Cc: brauner, hch, djwong, jack, cem, kbusch, axboe, linux-xfs,
linux-fsdevel, gost.dev
On Wed, Mar 11, 2026 at 12:33:42AM +0530, Kanchan Joshi wrote:
> On 3/10/2026 11:17 AM, Dave Chinner wrote:
> >> When write stream is set on the file, override the default
> >> directory-locality heuristic with a new heuristic that maps
> >> available AGs into streams.
> >>
> >> Isolating distinct write streams into dedicated allocation groups helps
> >> in reducing the block interleaving of concurrent writers. Keeping these
> >> streams spatially separated reduces AGF lock contention and logical file
> >> fragmentation.
> > a.k.a. the XFS filestreams allocator.
>
> Yes, but there is difference between what I am doing and what is
> present. Let me write that in the end.
> >
> > i.e. we already have an allocator that performs this exact locality
> > mapping. See xfs_inode_is_filestream() and the allocator code path
> > that goes this way:
> >
> > xfs_bmap_btalloc()
> > xfs_bmap_btalloc_filestreams()
> > xfs_filestream_select_ag()
> >
> > Please integrate this write stream mapping functionality into that
> > allocator rather than hacking a new, almost identical allocator
> > policy into XFS.
> >
> > filestreams currently uses the parent inode number as the stream ID
> > and maps that to an AG. It should be relatively trivial to use the
> > ip->i_write_stream as the stream ID instead of the parent inode.
>
> Yeah, should be possible. Will try that in V3.
>
> >> If AGs are fewer than write streams, write streams are distributed into
> >> available AGs in round robin fashion.
> >> If not, available AGs are partitioned into write streams. Since each
> >> write stream maps to a partition of multiple contiguous AGs, the inode hash
> >> is used to choose the specific AG within the stream partition. This can
> >> help with intra-stream concurency when multiple files are being written in
> >> a single stream that has 2 or more AGs.
> >>
> >> Example: 8 Allocation Groups, 4 Streams
> >> Partition Size = 2 AGs per Stream
> >>
> >> Stream 1 (ID: 1) Stream 2 (ID: 2) Streams 3 & 4
> >> +---------+---------+ +---------+---------+ +-------------
> >> | AG0 | AG1 | | AG2 | AG3 | | AG4...AG7
> >> +---------+---------+ +---------+---------+ +-------------
> >> ^ ^ ^ ^
> >> | | | |
> >> | File B (ino: 101) | File D (ino: 201)
> >> | 101 % 2 = 1 -> AG 1 | 201 % 2 = 1 -> AG 3
> >> | |
> >> File A (ino: 100) File C (ino: 200)
> >> 100 % 2 = 0 -> AG 0 200 % 2 = 0 -> AG 2
> >>
> >> If AGs can not be evenly distributed among streams, the last stream will
> >> absorb the remaining AGs.
> > Yeah, this should all be hidden behind xfs_filestream_select_ag()
> > when ip->i_write_stream is set....
>
> Added in the TBD list.
>
> >> Note that there are no hard boundaries; this only provides explicit
> >> routing hint to xfs allocator so that it can group/isolate files in the way
> >> application has decided to group/isolate. We still try to preserve file
> >> contiguity, and the full space can be utilized even with a single stream.
> > Yes, that's pretty much exactly what the filestreams allocator was
> > designed to do. It's a whole lot more dynamic that what you are
> > trying to do above and is not limited fixed AGs for streams - as
> > soon as an AG is out of space, it will select the AG with the most
> > free space for the stream and keep that relationship until that AG
> > is out of space.
> >
> > IOWs, filestreams does not limit a stream to a fixed number of AGs.
> > All it does is keep IO with the same stream ID in the same AG until
> > the AG is full and, as much as possible, prevents multiple streams
> > from using the same AG.
>
>
> filestream: 1 filestream == 1AG, at a time. And that can cause AGF lock
> contention on high-concurrency NVMe workloads i.e., when multiple
> threads writing to different files in same filestream.
Yes, I know. I understand what you are trying to do and why. What
I'm telling you - as an XFS allocator expert - how to implement it a
way that fits into the existing XFS allocator policy framework.
As I said, the existing filestreams allocator stream association is
not an exact match to what are trying to do. It is, however, trivial
to modify the filestreams stream-to-ag association behaviour to
match what you are trying to do.
> What I am doing here with new write stream has two aspects:
>
> (a) inter stream concurrency: multiple threads writing to different
> files in different streams are not going to run into AGF lock.
Yup, default behaviour of the allocator - it defines a "write
stream" to be all the files in a given directory. Hence worklaods
operating in different directories will target different AGs and not
contend unless unrelated directories land in the same AG.
Filestreams avoids that problem. It adds the constraint that a
workload in a directory will have a dynamic association with an AG,
instead of it being static based on the directory inode's number.
This allows the filesystem to dynamically separate worklaods in
different directories to different AGs.
All you are trying to do is define the related data set by
i_write_stream, and then separate them into different AGs.
The first step in this process is to add this association to the
filestreams allocator, and have it trigger when ip->i_write_streams
is set.
> (b) intra stream concurrency: multiple threads writing to different
> files in single stream will also face 'reduced' contention if each
> stream is a collection of AG and we are spreading the load (with inode
> hash). Therefore, each stream is partitioned into a group of AGs.
This is not a write stream specific allocator improvement.
This issue exists no matter how we define a write stream because we
currently only have a single AG association with a write stream.
The allocator currently addresses this with trylock based AG
iteration. i.e. it already spreads a write stream over multiple AGs
when the AG is contended during allocation. Hence there is no real
need for the generic allocator to define more than a single AG to a
write stream to avoid allocator contention.
However, there is good reason to enable this sort of functionality
as a generic behaviour because it would help prevent allocation
interleaving across files in the same data set. We have ways of
mitigating that (delalloc-based specualtive prealloc, extent size
hints, etc), but having generic AG sets for each workload would help
address this.
Similarly, adding a generic AG set for a filestream association
(e.g. 2 or 4 consecutive AGs per assocation) would address this
issue for filestreams as well.
Then a common "AG set" definition and behaviour can be defined for
all the allocators (ie. consistent, predictable behaviour across the
filesystem). e.g. The AG within the set could be selected based on
the low bits of the inode number we are allocating for, hence
resulting in a file always trying to use the same AG, but files
within the data set are spread across all AGs in the target AG set.
Such behaviour would be encapsualted inside the target AG seclection
for the allocator policies. i.e. xfs_filestreams_select_ag() for
the filestreams allocator, and xfs_bmap_btalloc_select_lengths() for
the normal allocator.
ANd then the rest of the allocation code remains unchanged.
> Also, with filestream we can't do cross-directory grouping, of
> file-level granularity.
Which is why the filestream association for write streams needs to
be based on ip->i_write_stream, not the parent directory!
> Write stream is a more explicit model.
> Application decides what files are to be spatially separated/grouped and
No. The application cannot decide what is "spatially
separated/grouped". Even the filesystem cannot decide that because
it does not know how the underlying storage is physically managed.
LBA addresses do not define physical locations in storage anymore;
they are just a convenient abstraction for abstracting the physical
characteristics of the storage away from the OS.
IOWs, Write streams do not define "spatial" locality. All they define is
how the data in certain IOs is related to other IOs. Hence all that
write stream IDs can do is provide information about data
relationships.
As such, I don't think filesystems really need to care that much
about write stream IDs that are passed down to the hardware.
However, if there are some things we can do that help scalability
and performance for write stream realted IO, then I'm not opposed to
doing that. Especially if the filesystem already has all the
infrastructure in place to handle write stream associations in a
dynamic manner...
> what kind of concurrency buckets should be chosen for its N threads/files.
This is not something the application should be caring about.
If you have known IO concurrency requirements, then you should be
create your XFS filesystem with enough AGs to handle that
concurrency requirement in the first place. Then the filesystem
itself should be able to make sane decisions about how to spread the
concurrency load across AGs without the application having to be
micro managed by the application.
i.e. if you are trying to manage low level filesystem concurrency
workarounds in the application, you are doing it wrong...
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2 4/5] xfs: steer allocation using write stream
2026-03-10 22:44 ` Dave Chinner
@ 2026-03-11 9:59 ` Kanchan Joshi
0 siblings, 0 replies; 22+ messages in thread
From: Kanchan Joshi @ 2026-03-11 9:59 UTC (permalink / raw)
To: Dave Chinner
Cc: brauner, hch, djwong, jack, cem, kbusch, axboe, linux-xfs,
linux-fsdevel, gost.dev
On 3/11/2026 4:14 AM, Dave Chinner wrote:
> Yes, I know. I understand what you are trying to do and why. What
> I'm telling you - as an XFS allocator expert - how to implement it a
> way that fits into the existing XFS allocator policy framework.
>
> As I said, the existing filestreams allocator stream association is
> not an exact match to what are trying to do. It is, however, trivial
> to modify the filestreams stream-to-ag association behaviour to
> match what you are trying to do.
Yes, I am already inclined to explore that direction. Thanks for all the
inputs on how to properly fit this into the existing XFS allocator
policy framework.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2026-03-11 10:00 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20260309053425epcas5p32886580a4fbe646ceee66f2864970e9f@epcas5p3.samsung.com>
2026-03-09 5:29 ` [PATCH v2 0/5] write streams and xfs spatial isolation Kanchan Joshi
2026-03-09 5:29 ` [PATCH v2 1/5] fs: add generic write-stream management ioctl Kanchan Joshi
2026-03-09 16:33 ` Darrick J. Wong
2026-03-10 17:55 ` Kanchan Joshi
2026-03-10 20:44 ` Darrick J. Wong
2026-03-09 5:29 ` [PATCH v2 2/5] iomap: introduce and propagate write_stream Kanchan Joshi
2026-03-09 16:34 ` Darrick J. Wong
2026-03-10 17:58 ` Kanchan Joshi
2026-03-09 5:29 ` [PATCH v2 3/5] xfs: implement write-stream management support Kanchan Joshi
2026-03-09 16:38 ` Darrick J. Wong
2026-03-10 18:07 ` Kanchan Joshi
2026-03-09 5:29 ` [PATCH v2 4/5] xfs: steer allocation using write stream Kanchan Joshi
2026-03-09 12:45 ` kernel test robot
2026-03-09 20:01 ` kernel test robot
2026-03-10 5:47 ` Dave Chinner
2026-03-10 19:03 ` Kanchan Joshi
2026-03-10 22:44 ` Dave Chinner
2026-03-11 9:59 ` Kanchan Joshi
2026-03-09 5:29 ` [PATCH v2 5/5] xfs: introduce software write streams Kanchan Joshi
2026-03-10 6:01 ` Dave Chinner
2026-03-09 15:40 ` [PATCH v2 0/5] write streams and xfs spatial isolation Christoph Hellwig
2026-03-10 21:19 ` Kanchan Joshi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox