* [PATCH 4/7] fuse: implement file attributes mask for statx
2025-07-17 23:23 [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
@ 2025-07-17 23:27 ` Darrick J. Wong
2025-08-18 15:11 ` Miklos Szeredi
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-07-17 23:27 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, neal, John, miklos, bernd, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Actually copy the attributes/attributes_mask from userspace.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/dir.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 45b4c3cc1396af..4d841869ba3d0a 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1285,6 +1285,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
stat->btime.tv_sec = sx->btime.tv_sec;
stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
+ stat->attributes = sx->attributes;
+ stat->attributes_mask = sx->attributes_mask;
fuse_fillattr(idmap, inode, &attr, stat);
stat->result_mask |= STATX_TYPE;
}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-07-17 23:27 ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
@ 2025-08-18 15:11 ` Miklos Szeredi
2025-08-18 20:01 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-18 15:11 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Fri, 18 Jul 2025 at 01:27, Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Actually copy the attributes/attributes_mask from userspace.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/dir.c | 2 ++
> 1 file changed, 2 insertions(+)
>
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 45b4c3cc1396af..4d841869ba3d0a 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1285,6 +1285,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
> stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
> stat->btime.tv_sec = sx->btime.tv_sec;
> stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
> + stat->attributes = sx->attributes;
> + stat->attributes_mask = sx->attributes_mask;
fuse_update_get_attr() has a cached and an uncached branch and these
fields are only getting set in the uncached case.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-18 15:11 ` Miklos Szeredi
@ 2025-08-18 20:01 ` Darrick J. Wong
2025-08-18 20:04 ` Darrick J. Wong
2025-08-19 15:01 ` Miklos Szeredi
0 siblings, 2 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-18 20:01 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Mon, Aug 18, 2025 at 05:11:07PM +0200, Miklos Szeredi wrote:
> On Fri, 18 Jul 2025 at 01:27, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Actually copy the attributes/attributes_mask from userspace.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/dir.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> >
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index 45b4c3cc1396af..4d841869ba3d0a 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -1285,6 +1285,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
> > stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
> > stat->btime.tv_sec = sx->btime.tv_sec;
> > stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
> > + stat->attributes = sx->attributes;
> > + stat->attributes_mask = sx->attributes_mask;
>
> fuse_update_get_attr() has a cached and an uncached branch and these
> fields are only getting set in the uncached case.
Hrmm, do you want to cache all the various statx attributes in struct
fuse_inode? Or would you rather that the kernel always call the fuse
server if any of the statx flags outside of (BASIC_STATS|BTIME) are set?
Right now the full version of kstat_from_fuse_statx contains:
if (sx->mask & STATX_BTIME) {
stat->btime.tv_sec = sx->btime.tv_sec;
stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
}
if (sx->mask & STATX_DIOALIGN) {
stat->dio_mem_align = sx->dio_mem_align;
stat->dio_offset_align = sx->dio_offset_align;
}
if (sx->mask & STATX_SUBVOL)
stat->subvol = sx->subvol;
if (sx->mask & STATX_WRITE_ATOMIC) {
stat->atomic_write_unit_min = sx->atomic_write_unit_min;
stat->atomic_write_unit_max = sx->atomic_write_unit_max;
stat->atomic_write_unit_max_opt = sx->atomic_write_unit_max_opt;
stat->atomic_write_segments_max = sx->atomic_write_segments_max;
}
if (sx->mask & STATX_DIO_READ_ALIGN)
stat->dio_read_offset_align = sx->dio_read_offset_align;
In theory only specialty programs are going to be interested in directio
or atomic writes, and only userspace nfs servers and backup programs are
going to care about subvolumes, so I don't know if it's really worth the
trouble to cache all that.
The dio/atomic fields are 7x u32, and the subvol id is u64. That's 40
bytes per inode, which is kind of a lot.
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-18 20:01 ` Darrick J. Wong
@ 2025-08-18 20:04 ` Darrick J. Wong
2025-08-19 15:01 ` Miklos Szeredi
1 sibling, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-18 20:04 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Mon, Aug 18, 2025 at 01:01:55PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 18, 2025 at 05:11:07PM +0200, Miklos Szeredi wrote:
> > On Fri, 18 Jul 2025 at 01:27, Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Actually copy the attributes/attributes_mask from userspace.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > > fs/fuse/dir.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > >
> > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > > index 45b4c3cc1396af..4d841869ba3d0a 100644
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -1285,6 +1285,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
> > > stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
> > > stat->btime.tv_sec = sx->btime.tv_sec;
> > > stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
> > > + stat->attributes = sx->attributes;
> > > + stat->attributes_mask = sx->attributes_mask;
> >
> > fuse_update_get_attr() has a cached and an uncached branch and these
> > fields are only getting set in the uncached case.
>
> Hrmm, do you want to cache all the various statx attributes in struct
> fuse_inode? Or would you rather that the kernel always call the fuse
> server if any of the statx flags outside of (BASIC_STATS|BTIME) are set?
I should have said explicitly that attributes/attributes_mask need to be
cached because there's no separate STATX_ request flag for the bitfield.
However, the *new* fields that have been added since BASIC_STATS are the
subject of my ramblings below.
--D
> Right now the full version of kstat_from_fuse_statx contains:
>
> if (sx->mask & STATX_BTIME) {
> stat->btime.tv_sec = sx->btime.tv_sec;
> stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
> }
>
> if (sx->mask & STATX_DIOALIGN) {
> stat->dio_mem_align = sx->dio_mem_align;
> stat->dio_offset_align = sx->dio_offset_align;
> }
>
> if (sx->mask & STATX_SUBVOL)
> stat->subvol = sx->subvol;
>
> if (sx->mask & STATX_WRITE_ATOMIC) {
> stat->atomic_write_unit_min = sx->atomic_write_unit_min;
> stat->atomic_write_unit_max = sx->atomic_write_unit_max;
> stat->atomic_write_unit_max_opt = sx->atomic_write_unit_max_opt;
> stat->atomic_write_segments_max = sx->atomic_write_segments_max;
> }
>
> if (sx->mask & STATX_DIO_READ_ALIGN)
> stat->dio_read_offset_align = sx->dio_read_offset_align;
>
> In theory only specialty programs are going to be interested in directio
> or atomic writes, and only userspace nfs servers and backup programs are
> going to care about subvolumes, so I don't know if it's really worth the
> trouble to cache all that.
>
> The dio/atomic fields are 7x u32, and the subvol id is u64. That's 40
> bytes per inode, which is kind of a lot.
>
> --D
>
> > Thanks,
> > Miklos
> >
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-18 20:01 ` Darrick J. Wong
2025-08-18 20:04 ` Darrick J. Wong
@ 2025-08-19 15:01 ` Miklos Szeredi
2025-08-19 22:51 ` Darrick J. Wong
1 sibling, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-19 15:01 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Mon, 18 Aug 2025 at 22:01, Darrick J. Wong <djwong@kernel.org> wrote:
> In theory only specialty programs are going to be interested in directio
> or atomic writes, and only userspace nfs servers and backup programs are
> going to care about subvolumes, so I don't know if it's really worth the
> trouble to cache all that.
>
> The dio/atomic fields are 7x u32, and the subvol id is u64. That's 40
> bytes per inode, which is kind of a lot.
Agreed. This should also depend on the sync mode.
AT_STATX_DONT_SYNC: anything not cached should be cleared from the mask.
AT_STATX_FORCE_SYNC: cached values should be ignored and FUSE_STATX
request sent.
AT_STATX_SYNC_AS_STAT: ???
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-19 15:01 ` Miklos Szeredi
@ 2025-08-19 22:51 ` Darrick J. Wong
2025-08-20 9:16 ` Miklos Szeredi
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-19 22:51 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Tue, Aug 19, 2025 at 05:01:15PM +0200, Miklos Szeredi wrote:
> On Mon, 18 Aug 2025 at 22:01, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > In theory only specialty programs are going to be interested in directio
> > or atomic writes, and only userspace nfs servers and backup programs are
> > going to care about subvolumes, so I don't know if it's really worth the
> > trouble to cache all that.
> >
> > The dio/atomic fields are 7x u32, and the subvol id is u64. That's 40
> > bytes per inode, which is kind of a lot.
>
> Agreed. This should also depend on the sync mode.
>
> AT_STATX_DONT_SYNC: anything not cached should be cleared from the mask.
>
> AT_STATX_FORCE_SYNC: cached values should be ignored and FUSE_STATX
> request sent.
IMO, if the caller asks for the weird statx attributes
(dioalign/subvol/write_atomic) then they probably prefer to wait to get
the attributes they asked for. I'd be willing to strip them out of the
request_mask if they affirm _DONT_SYNC though.
Something like this, maybe?
#define FUSE_UNCACHED_STATX_MASK (STATX_DIOALIGN | \
STATX_SUBVOL | \
STATX_WRITE_ATOMIC)
and then in fuse_update_get_attr,
if (!request_mask)
sync = false;
else if (request_mask & FUSE_UNCACHED_STATX_MASK) {
if (flags & AT_STATX_DONT_SYNC) {
request_mask &= ~FUSE_UNCACHED_STATX_MASK;
sync = false;
} else {
sync = true;
}
} else if (flags & AT_STATX_FORCE_SYNC)
sync = true;
else if (flags & AT_STATX_DONT_SYNC)
sync = false;
else if (request_mask & inval_mask & ~cache_mask)
sync = true;
else
sync = time_before64(fi->i_time, get_jiffies_64());
> AT_STATX_SYNC_AS_STAT: ???
I have no idea what that means. :)
Way back in 2017, dhowells implied that it synchronises the attributes
with the backing store in the same way that network filesystems do[1].
But the question is, does fuse count as a network fs?
I guess it does. But the discussion from 2016 also provided "this is
very filesystem specific" so I guess we can do whatever we want?? XFS
and ext4 ignore that value. The statx(2) manpage repeats that "whatever
stat does" language, but the stat(2) and stat(3) manpages don't say a
darned thing.
I was just gonna ignore it.
[1] https://lore.kernel.org/linux-fsdevel/147948603812.5122.5116851833739815967.stgit@warthog.procyon.org.uk/
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-19 22:51 ` Darrick J. Wong
@ 2025-08-20 9:16 ` Miklos Szeredi
2025-08-20 9:40 ` Miklos Szeredi
2025-08-20 15:09 ` Darrick J. Wong
0 siblings, 2 replies; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-20 9:16 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Wed, 20 Aug 2025 at 00:51, Darrick J. Wong <djwong@kernel.org> wrote:
> Something like this, maybe?
>
> #define FUSE_UNCACHED_STATX_MASK (STATX_DIOALIGN | \
> STATX_SUBVOL | \
> STATX_WRITE_ATOMIC)
>
> and then in fuse_update_get_attr,
>
> if (!request_mask)
> sync = false;
> else if (request_mask & FUSE_UNCACHED_STATX_MASK) {
> if (flags & AT_STATX_DONT_SYNC) {
> request_mask &= ~FUSE_UNCACHED_STATX_MASK;
> sync = false;
> } else {
> sync = true;
> }
> } else if (flags & AT_STATX_FORCE_SYNC)
> sync = true;
> else if (flags & AT_STATX_DONT_SYNC)
> sync = false;
> else if (request_mask & inval_mask & ~cache_mask)
> sync = true;
> else
> sync = time_before64(fi->i_time, get_jiffies_64());
Yes.
> Way back in 2017, dhowells implied that it synchronises the attributes
> with the backing store in the same way that network filesystems do[1].
> But the question is, does fuse count as a network fs?
>
> I guess it does. But the discussion from 2016 also provided "this is
> very filesystem specific" so I guess we can do whatever we want?? XFS
> and ext4 ignore that value. The statx(2) manpage repeats that "whatever
> stat does" language, but the stat(2) and stat(3) manpages don't say a
> darned thing.
Actually we can't ignore it, since it's the default (i.e. if neither
FORCE_SYNC nor DONT_SYNC is in effect, then that implies
SYNC_AS_STAT).
I guess the semantics you codified above make sense. In words:
"If neither forcing nor forbidding sync, then statx shall always
attempt to return attributes that are defined on that filesystem, but
may return stale values."
As an optimization of the above, the filesystem clearing the
request_mask for these uncached attributes means that that attribute
is not supported by the filesystem and that *can* be cheaply cached
(e.g. clearing fi->inval_mask).
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-20 9:16 ` Miklos Szeredi
@ 2025-08-20 9:40 ` Miklos Szeredi
2025-08-20 15:16 ` Darrick J. Wong
2025-08-20 15:09 ` Darrick J. Wong
1 sibling, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-20 9:40 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Wed, 20 Aug 2025 at 11:16, Miklos Szeredi <miklos@szeredi.hu> wrote:
> As an optimization of the above, the filesystem clearing the
> request_mask for these uncached attributes means that that attribute
> is not supported by the filesystem and that *can* be cheaply cached
> (e.g. clearing fi->inval_mask).
Even better: add sx_supported to fuse_init_out, so that unsupported
ones don't generate unnecessary requests.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-20 9:16 ` Miklos Szeredi
2025-08-20 9:40 ` Miklos Szeredi
@ 2025-08-20 15:09 ` Darrick J. Wong
2025-08-20 15:23 ` Miklos Szeredi
1 sibling, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-20 15:09 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Wed, Aug 20, 2025 at 11:16:42AM +0200, Miklos Szeredi wrote:
> On Wed, 20 Aug 2025 at 00:51, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > Something like this, maybe?
> >
> > #define FUSE_UNCACHED_STATX_MASK (STATX_DIOALIGN | \
> > STATX_SUBVOL | \
> > STATX_WRITE_ATOMIC)
> >
> > and then in fuse_update_get_attr,
> >
> > if (!request_mask)
> > sync = false;
> > else if (request_mask & FUSE_UNCACHED_STATX_MASK) {
> > if (flags & AT_STATX_DONT_SYNC) {
> > request_mask &= ~FUSE_UNCACHED_STATX_MASK;
> > sync = false;
> > } else {
> > sync = true;
> > }
> > } else if (flags & AT_STATX_FORCE_SYNC)
> > sync = true;
> > else if (flags & AT_STATX_DONT_SYNC)
> > sync = false;
> > else if (request_mask & inval_mask & ~cache_mask)
> > sync = true;
> > else
> > sync = time_before64(fi->i_time, get_jiffies_64());
>
> Yes.
>
> > Way back in 2017, dhowells implied that it synchronises the attributes
> > with the backing store in the same way that network filesystems do[1].
> > But the question is, does fuse count as a network fs?
> >
> > I guess it does. But the discussion from 2016 also provided "this is
> > very filesystem specific" so I guess we can do whatever we want?? XFS
> > and ext4 ignore that value. The statx(2) manpage repeats that "whatever
> > stat does" language, but the stat(2) and stat(3) manpages don't say a
> > darned thing.
Ohhh, only now I noticed that it's one of those trickster flags symbols
like O_RDONLY that are #define'd to 0. That's why there's no
(flags & SYNC_AS_STAT) anywhere in the codebase.
> Actually we can't ignore it, since it's the default (i.e. if neither
> FORCE_SYNC nor DONT_SYNC is in effect, then that implies
> SYNC_AS_STAT).
>
> I guess the semantics you codified above make sense. In words:
>
> "If neither forcing nor forbidding sync, then statx shall always
> attempt to return attributes that are defined on that filesystem, but
> may return stale values."
Where is that written? I'd like to read the rest of it to clear my
head. :)
> As an optimization of the above, the filesystem clearing the
> request_mask for these uncached attributes means that that attribute
> is not supported by the filesystem and that *can* be cheaply cached
> (e.g. clearing fi->inval_mask).
Hrmm. I wouldn't want to set fi->inval_mask bits just because a
FUSE_STATX message ignored a mask bit one time -- imagine a filesystem
with tiered storage. A file might be on slow hdd storage which means no
fancy things like atomic writes, but later it might get promoted to
faster nvme which does support that.
Anyway I'll send out rfcv4 today, which has the above update_get_attr
logic in it.
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-20 9:40 ` Miklos Szeredi
@ 2025-08-20 15:16 ` Darrick J. Wong
2025-08-20 15:31 ` Miklos Szeredi
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-20 15:16 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Wed, Aug 20, 2025 at 11:40:50AM +0200, Miklos Szeredi wrote:
> On Wed, 20 Aug 2025 at 11:16, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > As an optimization of the above, the filesystem clearing the
> > request_mask for these uncached attributes means that that attribute
> > is not supported by the filesystem and that *can* be cheaply cached
> > (e.g. clearing fi->inval_mask).
>
> Even better: add sx_supported to fuse_init_out, so that unsupported
> ones don't generate unnecessary requests.
That would work better -- if the fuse server knows it'll never respond
to STX_SUBVOL then we could obliterate it from all the statx queries.
How does one add a new field to struct fuse_init_out without breaking
old libfuse / fuse servers which still have the old fuse_init_out?
AFAICT, fuse_send_init sets out_argvar, so fuse_copy_out_args will
handle a short reply from old libfuse. But a new libfuse running on an
old kernel can't send the kernel what it will think is an oversized
init reply, right?
So I think we end up having to declare a new flags bit for struct
fuse_init_in, and the kernel sets the bit unconditionally. libfuse
sends the larger fuse_init_out reply if the new flag bit is set, or the
old size if it isn't. Does that sound correct?
--D
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-20 15:09 ` Darrick J. Wong
@ 2025-08-20 15:23 ` Miklos Szeredi
2025-08-20 15:29 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-20 15:23 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Wed, 20 Aug 2025 at 17:09, Darrick J. Wong <djwong@kernel.org> wrote:
> > "If neither forcing nor forbidding sync, then statx shall always
> > attempt to return attributes that are defined on that filesystem, but
> > may return stale values."
>
> Where is that written? I'd like to read the rest of it to clear my
> head. :)
It's my summary of what you wrote as code.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-20 15:23 ` Miklos Szeredi
@ 2025-08-20 15:29 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-20 15:29 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Wed, Aug 20, 2025 at 05:23:27PM +0200, Miklos Szeredi wrote:
> On Wed, 20 Aug 2025 at 17:09, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > > "If neither forcing nor forbidding sync, then statx shall always
> > > attempt to return attributes that are defined on that filesystem, but
> > > may return stale values."
> >
> > Where is that written? I'd like to read the rest of it to clear my
> > head. :)
>
> It's my summary of what you wrote as code.
Ahhh, thanks.
/me hands himself another cup of coffee. :P
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-20 15:16 ` Darrick J. Wong
@ 2025-08-20 15:31 ` Miklos Szeredi
0 siblings, 0 replies; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-20 15:31 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, neal, John, bernd, joannelkoong
On Wed, 20 Aug 2025 at 17:16, Darrick J. Wong <djwong@kernel.org> wrote:
> How does one add a new field to struct fuse_init_out without breaking
> old libfuse / fuse servers which still have the old fuse_init_out?
There's currently 22 bytes unused at the end, so it's easy unless you
want to add more.
Ideally there should also be a matching feature flag indicating that
a) kernel supports this feature b) field contains valid data.
> AFAICT, fuse_send_init sets out_argvar, so fuse_copy_out_args will
> handle a short reply from old libfuse. But a new libfuse running on an
> old kernel can't send the kernel what it will think is an oversized
> init reply, right?
>
> So I think we end up having to declare a new flags bit for struct
> fuse_init_in, and the kernel sets the bit unconditionally. libfuse
> sends the larger fuse_init_out reply if the new flag bit is set, or the
> old size if it isn't. Does that sound correct?
I think that's exactly what the previous size extension did
(FUSE_INIT_EXT flag).
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4
@ 2025-08-21 0:37 Darrick J. Wong
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
` (13 more replies)
0 siblings, 14 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:37 UTC (permalink / raw)
To: linux-fsdevel
Cc: Miklos Szeredi, Bernd Schubert, Joanne Koong, John Groves,
Josef Bacik, linux-ext4, Theodore Ts'o, Neal Gompa,
Amir Goldstein, Christian Brauner, Jeff Layton
Hi everyone,
Do not merge this, still!!
This is the fourth request for comments of a prototype to connect the
Linux fuse driver to fs-iomap for regular file IO operations to and from
files whose contents persist to locally attached storage devices.
Why would you want to do that? Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence. Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.
willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code. Eeeugh.
The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands. Pagecache
writeback is now a directio write. The fuse server is now able to
upsert mappings into the kernel for cached access (== zero upcalls for
rereads and pure overwrites!) and the iomap cache revalidation code
works.
With this RFC, I am able to show that it's possible to build a fuse
server for a real filesystem (ext4) that runs entirely in userspace yet
maintains most of its performance. At this stage I still get about 95%
of the kernel ext4 driver's streaming directio performance on streaming
IO, and 110% of its streaming buffered IO performance. Random buffered
IO is about 85% as fast as the kernel. Random direct IO is about 80% as
fast as the kernel; see the cover letter for the fuse2fs iomap changes
for more details. Unwritten extent conversions on random direct writes
are especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead. And that's with (now dynamic) debugging turned on!
These items have been addressed since the third RFC:
1. fuse2fs has been forked into fuse4fs, which now talks to the low
level fuse interface. This avoids all the path walking that the
high level fuse library provides, which dramatically improves the
performance of fuse4fs. fstests runs in half the time now. Many
thanks to Amir Goldstein for giving me a rough draft of the
conversion!
2. I simplified the configuration protocols -- now there's a per-fs
bit to enable any iomap, and a per-inode bit to enable iomap on a
specific file. Registration of iomap devices now uses the backing
fd registration interface.
3. You can now specify the root nodeid for any fuse mount.
4. Atomic writes are working, at least for single fsblocks.
5. I've ported the cache implementation from xfsprogs to e2fsprogs
libsupport, so the inode and buffer caches can now dynamically grow
to support larger working sets. No more fixed-size caches!
6. Cleaned up the kernel/libfuse ABI quite a bit.
7. fstests passes 97% of the tests that run, when iomap is enabled!
Only 93% pass when iomap is disabled, and I think that's due to some
bugs in the ACL and mode handling code.
There are some major warts remaining:
a. I've a /much/ clearer picture of how one might containerize a
filesystem server, thanks to a lot of input from Christian Brauner
in response to v3. I think I have enough pieces to try setting up
a fd-passing interface into a systemd service ... but I haven't
actually written any of it yet.
b. fsdax isn't implemented. I think I'm going to work on this for
RFC v5 to see if we can simplify the file mapping handling in famfs.
If not, then everyone else gets fsdax for free.
c. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.
d. I've not yet consolidated struct fuse_inode, so the iomap gunk still
eats rather a lot of space per inode.
e. fuse2fs doesn't support the ext4 journal. Urk.
f. There's a VERY large quantity of fuse2fs improvements that need to be
applied before we get to the fuse-iomap parts. I'm not sending these
(or the fstests changes) to keep the size of the patchbomb at
"unreasonably large". :P
I'll work on these in August/Steptember, but for now here's an
unmergeable RFC to start some discussion.
--Darrick
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 1/4] fuse: general bug fixes
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
@ 2025-08-21 0:47 ` Darrick J. Wong
2025-08-21 0:50 ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
` (6 more replies)
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (12 subsequent siblings)
13 siblings, 7 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:47 UTC (permalink / raw)
To: djwong, miklos; +Cc: stable, bernd, neal, John, linux-fsdevel, joannelkoong
Hi all,
Here's a collection of fixes that I *think* are bugs in fuse, along with
some scattered improvements.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-fixes
---
Commits in this patchset:
* fuse: fix livelock in synchronous file put from fuseblk workers
* fuse: flush pending fuse events before aborting the connection
* fuse: capture the unique id of fuse commands being sent
* fuse: implement file attributes mask for statx
* fuse: update file mode when updating acls
* fuse: propagate default and file acls on creation
* fuse: enable FUSE_SYNCFS for all servers
---
fs/fuse/fuse_i.h | 14 +++++++
fs/fuse/acl.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/dev.c | 44 +++++++++++++++++++++++
fs/fuse/dev_uring.c | 8 ++++
fs/fuse/dir.c | 96 +++++++++++++++++++++++++++++++++++++++------------
fs/fuse/file.c | 10 +++++
fs/fuse/inode.c | 5 +++
7 files changed, 245 insertions(+), 27 deletions(-)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
@ 2025-08-21 0:47 ` Darrick J. Wong
2025-08-21 0:52 ` [PATCH 01/23] fuse: move CREATE_TRACE_POINTS to a separate file Darrick J. Wong
` (22 more replies)
2025-08-21 0:47 ` [PATCHSET RFC v4 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (11 subsequent siblings)
13 siblings, 23 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:47 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
Hi all,
This series connects fuse (the userspace filesystem layer) to fs-iomap
to get fuse servers out of the business of handling file I/O themselves.
By keeping the IO path mostly within the kernel, we can dramatically
improve the speed of disk-based filesystems. This enables us to move
all the filesystem metadata parsing code out of the kernel and into
userspace, which means that we can containerize them for security
without losing a lot of performance.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap
---
Commits in this patchset:
* fuse: move CREATE_TRACE_POINTS to a separate file
* fuse: implement the basic iomap mechanisms
* fuse: make debugging configurable at runtime
* fuse: move the backing file idr and code into a new source file
* fuse: move the passthrough-specific code back to passthrough.c
* fuse: add an ioctl to add new iomap devices
* fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
* fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
* fuse: implement direct IO with iomap
* fuse: implement buffered IO with iomap
* fuse: enable caching of timestamps
* fuse: implement large folios for iomap pagecache files
* fuse: use an unrestricted backing device with iomap pagecache io
* fuse: advertise support for iomap
* fuse: query filesystem geometry when using iomap
* fuse: implement fadvise for iomap files
* fuse: make the root nodeid dynamic
* fuse: allow setting of root nodeid
* fuse: invalidate ranges of block devices being used for iomap
* fuse: implement inline data file IO via iomap
* fuse: allow more statx fields
* fuse: support atomic writes with iomap
* fuse: enable iomap
---
fs/fuse/fuse_i.h | 249 +++++
fs/fuse/fuse_trace.h | 996 +++++++++++++++++++++
fs/fuse/iomap_priv.h | 52 +
include/uapi/linux/fuse.h | 195 ++++
fs/fuse/Kconfig | 45 +
fs/fuse/Makefile | 5
fs/fuse/backing.c | 237 +++++
fs/fuse/dev.c | 35 +
fs/fuse/dir.c | 117 ++
fs/fuse/file.c | 133 ++-
fs/fuse/file_iomap.c | 2183 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 92 +-
fs/fuse/passthrough.c | 199 +---
fs/fuse/readdir.c | 10
fs/fuse/trace.c | 15
15 files changed, 4316 insertions(+), 247 deletions(-)
create mode 100644 fs/fuse/iomap_priv.h
create mode 100644 fs/fuse/backing.c
create mode 100644 fs/fuse/file_iomap.c
create mode 100644 fs/fuse/trace.c
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 3/4] fuse: cache iomap mappings for even better file IO performance
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-08-21 0:47 ` Darrick J. Wong
2025-08-21 0:58 ` [PATCH 1/4] fuse: cache iomaps Darrick J. Wong
` (3 more replies)
2025-08-21 0:48 ` [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (10 subsequent siblings)
13 siblings, 4 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:47 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
Hi all,
This series improves the performance (and correctness for some
filesystems) by adding the ability to cache iomap mappings in the
kernel. For filesystems that can change mapping states during pagecache
writeback (e.g. unwritten extent conversion) this is absolutely
necessary to deal with races with writes to the pagecache because
writeback does not take i_rwsem. For everyone else, it simply
eliminates roundtrips to userspace.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache
---
Commits in this patchset:
* fuse: cache iomaps
* fuse: use the iomap cache for iomap_begin
* fuse: invalidate iomap cache after file updates
* fuse: enable iomap cache management
---
fs/fuse/fuse_i.h | 51 +
fs/fuse/fuse_trace.h | 434 ++++++++++++
fs/fuse/iomap_priv.h | 149 ++++
include/uapi/linux/fuse.h | 33 +
fs/fuse/Makefile | 2
fs/fuse/dev.c | 44 +
fs/fuse/dir.c | 6
fs/fuse/file.c | 10
fs/fuse/file_iomap.c | 527 ++++++++++++++
fs/fuse/iomap_cache.c | 1693 +++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 2934 insertions(+), 15 deletions(-)
create mode 100644 fs/fuse/iomap_cache.c
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 0:47 ` [PATCHSET RFC v4 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-08-21 0:48 ` Darrick J. Wong
2025-08-21 0:59 ` [PATCH 1/6] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
` (5 more replies)
2025-08-21 0:48 ` [PATCHSET RFC v4 1/4] libfuse: general bug fixes Darrick J. Wong
` (9 subsequent siblings)
13 siblings, 6 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:48 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
Hi all,
When iomap is enabled for a fuse file, we try to keep as much of the
file IO path in the kernel as we possibly can. That means no calling
out to the fuse server in the IO path when we can avoid it. However,
the existing FUSE architecture defers all file attributes to the fuse
server -- [cm]time updates, ACL metadata management, set[ug]id removal,
and permissions checking thereof, etc.
We'd really rather do all these attribute updates in the kernel, and
only push them to the fuse server when it's actually necessary (e.g.
fsync). Furthermore, the POSIX ACL code has the weird behavior that if
the access ACL can be represented entirely by i_mode bits, it will
change the mode and delete the ACL, which fuse servers generally don't
seem to implement.
IOWs, we want consistent and correct (as defined by fstests) behavior
of file attributes in iomap mode. Let's make the kernel manage all that
and push the results to userspace as needed. This improves performance
even further, since it's sort of like writeback_cache mode but more
aggressive.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs
---
Commits in this patchset:
* fuse: force a ctime update after a fileattr_set call when in iomap mode
* fuse: synchronize inode->i_flags after fileattr_[gs]et
* fuse: cache atime when in iomap mode
* fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems
* fuse: update ctime when updating acls on an iomap inode
* fuse: always cache ACLs when using iomap
---
fs/fuse/fuse_i.h | 1
fs/fuse/fuse_trace.h | 81 ++++++++++++++++++++++++++++++++++++++++
fs/fuse/acl.c | 24 ++++++++++--
fs/fuse/dir.c | 32 +++++++++++++---
fs/fuse/inode.c | 20 ++++++++--
fs/fuse/ioctl.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/readdir.c | 3 +
7 files changed, 249 insertions(+), 13 deletions(-)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 1/4] libfuse: general bug fixes
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 0:48 ` [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-08-21 0:48 ` Darrick J. Wong
2025-08-21 1:01 ` [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header Darrick J. Wong
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (8 subsequent siblings)
13 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:48 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
Hi all,
Here's a collection of fixes that I *think* are bugs in libfuse.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-fixes
---
Commits in this patchset:
* libfuse: don't put HAVE_STATX in a public header
---
include/fuse.h | 2 --
include/fuse_lowlevel.h | 2 --
example/memfs_ll.cc | 2 +-
example/passthrough.c | 2 +-
example/passthrough_fh.c | 2 +-
example/passthrough_ll.c | 2 +-
6 files changed, 4 insertions(+), 8 deletions(-)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 0:48 ` [PATCHSET RFC v4 1/4] libfuse: general bug fixes Darrick J. Wong
@ 2025-08-21 0:48 ` Darrick J. Wong
2025-08-21 1:01 ` [PATCH 01/21] libfuse: bump kernel and library ABI versions Darrick J. Wong
` (20 more replies)
2025-08-21 0:48 ` [PATCHSET RFC v4 3/4] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (7 subsequent siblings)
13 siblings, 21 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:48 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
Hi all,
This series connects libfuse to the iomap-enabled fuse driver in Linux to get
fuse servers out of the business of handling file I/O themselves. By keeping
the IO path mostly within the kernel, we can dramatically improve the speed of
disk-based filesystems. This enables us to move all the filesystem metadata
parsing code out of the kernel and into userspace, which means that we can
containerize them for security without losing a lot of performance.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap
---
Commits in this patchset:
* libfuse: bump kernel and library ABI versions
* libfuse: add kernel gates for FUSE_IOMAP
* libfuse: add fuse commands for iomap_begin and end
* libfuse: add upper level iomap commands
* libfuse: add a lowlevel notification to add a new device to iomap
* libfuse: add upper-level iomap add device function
* libfuse: add iomap ioend low level handler
* libfuse: add upper level iomap ioend commands
* libfuse: add a reply function to send FUSE_ATTR_* to the kernel
* libfuse: connect high level fuse library to fuse_reply_attr_iflags
* libfuse: support direct I/O through iomap
* libfuse: support buffered I/O through iomap
* libfuse: don't allow hardlinking of iomap files in the upper level fuse library
* libfuse: allow discovery of the kernel's iomap capabilities
* libfuse: add lower level iomap_config implementation
* libfuse: add upper level iomap_config implementation
* libfuse: allow root_nodeid mount option
* libfuse: add low level code to invalidate iomap block device ranges
* libfuse: add upper-level API to invalidate parts of an iomap block device
* libfuse: add strictatime/lazytime mount options
* libfuse: add atomic write support
---
include/fuse.h | 86 ++++++++
include/fuse_common.h | 131 +++++++++++++
include/fuse_kernel.h | 113 +++++++++++
include/fuse_lowlevel.h | 238 +++++++++++++++++++++++
ChangeLog.rst | 12 +
lib/fuse.c | 484 ++++++++++++++++++++++++++++++++++++++++++-----
lib/fuse_lowlevel.c | 370 ++++++++++++++++++++++++++++++++++--
lib/fuse_versionscript | 20 ++
lib/meson.build | 2
lib/mount.c | 19 ++
meson.build | 2
11 files changed, 1400 insertions(+), 77 deletions(-)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 3/4] libfuse: cache iomap mappings for even better file IO performance
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (5 preceding siblings ...)
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-08-21 0:48 ` Darrick J. Wong
2025-08-21 1:07 ` [PATCH 1/2] libfuse: enable iomap cache management for lowlevel fuse Darrick J. Wong
2025-08-21 1:07 ` [PATCH 2/2] libfuse: add upper-level iomap cache management Darrick J. Wong
2025-08-21 0:49 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Darrick J. Wong
` (6 subsequent siblings)
13 siblings, 2 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:48 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
Hi all,
This series improves the performance (and correctness for some
filesystems) by adding the ability to cache iomap mappings in the
kernel. For filesystems that can change mapping states during pagecache
writeback (e.g. unwritten extent conversion) this is absolutely
necessary to deal with races with writes to the pagecache because
writeback does not take i_rwsem. For everyone else, it simply
eliminates roundtrips to userspace.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache
---
Commits in this patchset:
* libfuse: enable iomap cache management for lowlevel fuse
* libfuse: add upper-level iomap cache management
---
include/fuse.h | 31 ++++++++++++++++++++
include/fuse_common.h | 12 ++++++++
include/fuse_kernel.h | 26 +++++++++++++++++
include/fuse_lowlevel.h | 41 ++++++++++++++++++++++++++
lib/fuse.c | 30 +++++++++++++++++++
lib/fuse_lowlevel.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++
lib/fuse_versionscript | 4 +++
7 files changed, 217 insertions(+)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 4/4] libfuse: implement syncfs
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (6 preceding siblings ...)
2025-08-21 0:48 ` [PATCHSET RFC v4 3/4] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-08-21 0:49 ` Darrick J. Wong
2025-08-21 1:07 ` [PATCH 1/2] libfuse: wire up FUSE_SYNCFS to the low level library Darrick J. Wong
` (2 more replies)
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (5 subsequent siblings)
13 siblings, 3 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:49 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
Hi all,
Implement syncfs in libfuse so that iomap-compatible fuse servers can
receive syncfs commands.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs
---
Commits in this patchset:
* libfuse: wire up FUSE_SYNCFS to the low level library
* libfuse: add syncfs support to the upper library
---
include/fuse.h | 5 +++++
include/fuse_lowlevel.h | 16 ++++++++++++++++
lib/fuse.c | 31 +++++++++++++++++++++++++++++++
lib/fuse_lowlevel.c | 19 +++++++++++++++++++
4 files changed, 71 insertions(+)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (7 preceding siblings ...)
2025-08-21 0:49 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Darrick J. Wong
@ 2025-08-21 0:49 ` Darrick J. Wong
2025-08-21 1:08 ` [PATCH 01/20] fuse2fs: port fuse2fs to lowlevel libfuse API Darrick J. Wong
` (19 more replies)
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
` (4 subsequent siblings)
13 siblings, 20 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:49 UTC (permalink / raw)
To: tytso
Cc: amir73il, John, bernd, linux-fsdevel, linux-ext4, miklos,
amir73il, joannelkoong, neal
Hi all,
Whilst developing the fuse2fs+iomap prototype, I discovered a
fundamental design limitation of the upper-level libfuse API: hardlinks.
The upper level fuse library really wants to communicate with the fuse
server with file paths, instead of using inode numbers. This works
great for filesystems that don't have inodes, create files dynamically
at runtime, or lack stable inode numbers.
Unfortunately, the libfuse path abstraction assigns a unique nodeid to
every child file in the entire filesystem, without regard to hard links.
In other words, a hardlinked regular file may have one ondisk inode
number but multiple kernel inodes. For classic fuse2fs this isn't a
problem because all file access goes through the fuse server and the big
library lock protects us from corruption.
For fuse2fs + iomap this is a disaster because we rely on the kernel to
coordinate access to inodes. For hardlinked files, we *require* that
there only be one in-kernel inode for each ondisk inode.
The path based mechanism is also very inefficient for fuse2fs. Every
time a file is accessed, the upper level libfuse passes a new nodeid to
the kernel, and on every file access the kernel passes that same nodeid
back to libfuse. libfuse then walks its internal directory entry cache
to construct a path string for that nodeid and hands it to fuse2fs.
fuse2fs then walks the ondisk directory structure to find the ext2 inode
number. Every time.
Create a new fuse4fs server from fuse2fs that uses the lowlevel fuse
API. This affords us direct control over nodeids and eliminates the
path wrangling. Hardlinks can be supported when iomap is turned on,
and metadata-heavy workloads run twice as fast.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-fork
---
Commits in this patchset:
* fuse2fs: port fuse2fs to lowlevel libfuse API
* fuse4fs: drop fuse 2.x support code
* fuse4fs: namespace some helpers
* fuse4fs: convert to low level API
* libsupport: port the kernel list.h to libsupport
* libsupport: add a cache
* cache: disable debugging
* cache: use modern list iterator macros
* cache: embed struct cache in the owner
* cache: pass cache pointer to callbacks
* cache: pass a private data pointer through cache_walk
* cache: add a helper to grab a new refcount for a cache_node
* cache: return results of a cache flush
* cache: add a "get only if incore" flag to cache_node_get
* cache: support gradual expansion
* cache: implement automatic shrinking
* fuse4fs: add cache to track open files
* fuse4fs: use the orphaned inode list
* fuse4fs: implement FUSE_TMPFILE
* fuse4fs: create incore reverse orphan list
---
lib/ext2fs/jfs_compat.h | 2
lib/ext2fs/kernel-list.h | 111 -
lib/support/cache.h | 177 +
lib/support/list.h | 901 +++++++
lib/support/xbitops.h | 128 +
configure | 50
configure.ac | 31
debugfs/Makefile.in | 12
e2fsck/Makefile.in | 56
lib/config.h.in | 3
lib/e2p/Makefile.in | 4
lib/ext2fs/Makefile.in | 14
lib/support/Makefile.in | 8
lib/support/cache.c | 853 ++++++
misc/Makefile.in | 35
misc/fuse4fs.c | 6098 ++++++++++++++++++++++++++++++++++++++++++++++
misc/tune2fs.c | 4
17 files changed, 8319 insertions(+), 168 deletions(-)
delete mode 100644 lib/ext2fs/kernel-list.h
create mode 100644 lib/support/cache.h
create mode 100644 lib/support/list.h
create mode 100644 lib/support/xbitops.h
create mode 100644 lib/support/cache.c
create mode 100644 misc/fuse4fs.c
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (8 preceding siblings ...)
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
@ 2025-08-21 0:49 ` Darrick J. Wong
2025-08-21 1:13 ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
` (9 more replies)
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (3 subsequent siblings)
13 siblings, 10 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:49 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
Hi all,
In preparation for connecting fuse, iomap, and fuse2fs for a much more
performant file IO path, make some changes to the Unix IO manager in
libext2fs so that we can have better IO. First we start by making
filesystem flushes a lot more efficient by eliding fsyncs when they're
not necessary, and allowing library clients to turn off the racy code
that writes the superblock byte by byte but exposes stale checksums.
XXX: The second part of this series adds IO tagging so that we could tag
IOs by inode number to distinguish file data blocks in cache from
everything else. This is temporary scaffolding whilst we're in the
middle adding directio and later buffered writes. Once we can use the
pagecache for all file IO activity I think we could drop the back half
of this series.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=libext2fs-iomap-prep
---
Commits in this patchset:
* libext2fs: make it possible to extract the fd from an IO manager
* libext2fs: always fsync the device when flushing the cache
* libext2fs: always fsync the device when closing the unix IO manager
* libext2fs: only fsync the unix fd if we wrote to the device
* libext2fs: invalidate cached blocks when freeing them
* libext2fs: only flush affected blocks in unix_write_byte
* libext2fs: allow unix_write_byte when the write would be aligned
* libext2fs: allow clients to ask to write full superblocks
* libext2fs: allow callers to disallow I/O to file data blocks
* libext2fs: add posix advisory locking to the unix IO manager
---
lib/ext2fs/ext2_io.h | 10 ++
lib/ext2fs/ext2fs.h | 4 +
debian/libext2fs2t64.symbols | 2
lib/ext2fs/alloc_stats.c | 6 +
lib/ext2fs/closefs.c | 7 ++
lib/ext2fs/fileio.c | 12 +++
lib/ext2fs/io_manager.c | 17 ++++
lib/ext2fs/unix_io.c | 180 ++++++++++++++++++++++++++++++++++++++++--
8 files changed, 228 insertions(+), 10 deletions(-)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (9 preceding siblings ...)
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
@ 2025-08-21 0:49 ` Darrick J. Wong
2025-08-21 1:15 ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
` (18 more replies)
2025-08-21 0:50 ` [PATCHSET RFC v4 4/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (2 subsequent siblings)
13 siblings, 19 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:49 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
Hi all,
Switch fuse2fs to use the new iomap file data IO paths instead of
pushing it very slowly through the /dev/fuse connection. For local
filesystems, all we have to do is respond to requests for file to device
mappings; the rest of the IO hot path stays within the kernel. This
means that we can get rid of all file data block processing within
fuse2fs.
Because we're not pinning dirty pages through a potentially slow network
connection, we don't need the heavy BDI throttling for which most fuse
servers have become infamous. Yes, mapping lookups for writeback can
stall, but mappings are small as compared to data and this situation
exists for all kernel filesystems as well.
The performance of this new data path is quite stunning: on a warm
system, streaming reads and writes through the pagecache go from
60-90MB/s to 2-2.5GB/s. Direct IO reads and writes improve from the
same baseline to 2.5-8GB/s. FIEMAP and SEEK_DATA/SEEK_HOLE now work
too. The kernel ext4 driver can manage about 1.6GB/s for pagecache IO
and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the
kernel for streaming file IO.
Random 4k buffered IO is not so good: plain fuse2fs pokes along at
25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s. The kernel
can do 900-1300MB/s. Random directio is worse: plain fuse2fs does
20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does
40-55MB/s. I suspect that metadata heavy workloads do not perform well
on fuse2fs because libext2fs wasn't designed for that and it doesn't
even have a journal to absorb all the fsync writes. We also probably
need iomap caching really badly.
These performance numbers are slanted: my machine is 12 years old, and
fuse2fs is VERY poorly optimized for performance. It contains a single
Big Filesystem Lock which nukes multi-threaded scalability. There's no
inode cache nor is there a proper buffer cache, which means that fuse2fs
reads metadata in from disk and checksums it on EVERY ACCESS. Sad!
Despite these gaps, this RFC demonstrates that it's feasible to run the
metadata parsing parts of a filesystem in userspace while not
sacrificing much performance. We now have a vehicle to move the
filesystems out of the kernel, where they can be containerized so that
malicious filesystems can be contained, somewhat.
iomap mode also calls FUSE_DESTROY before unmounting the filesystem, so
for capable systems, fuse2fs doesn't need to run in fuseblk mode
anymore.
However, there are some major warts remaining:
1. The iomap cookie validation is not present, which can lead to subtle
races between pagecache zeroing and writeback on filesystems that
support unwritten and delalloc mappings.
2. Mappings ought to be cached in the kernel for more speed.
3. iomap doesn't support things like fscrypt or fsverity, and I haven't
yet figured out how inline data is supposed to work.
4. I would like to be able to turn on fuse+iomap on a per-inode basis,
which currently isn't possible because the kernel fuse driver will iget
inodes prior to calling FUSE_GETATTR to discover the properties of the
inode it just read.
5. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.
6. iomap is an inode-based service, not a file-based service. This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly.
I'll work on these in June, but for now here's an unmergeable RFC to
start some discussion.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap
---
Commits in this patchset:
* fuse2fs: implement bare minimum iomap for file mapping reporting
* fuse2fs: add iomap= mount option
* fuse2fs: implement iomap configuration
* fuse2fs: register block devices for use with iomap
* fuse2fs: implement directio file reads
* fuse2fs: add extent dump function for debugging
* fuse2fs: implement direct write support
* fuse2fs: turn on iomap for pagecache IO
* fuse2fs: don't zero bytes in punch hole
* fuse2fs: don't do file data block IO when iomap is enabled
* fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
* fuse2fs: enable file IO to inline data files
* fuse2fs: set iomap-related inode flags
* fuse2fs: add strictatime/lazytime mount options
* fuse2fs: configure block device block size
* fuse4fs: don't use inode number translation when possible
* fuse4fs: separate invalidation
* fuse2fs: implement statx
* fuse2fs: enable atomic writes
---
configure | 46 +
configure.ac | 31 +
lib/config.h.in | 3
misc/fuse2fs.c | 1777 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
misc/fuse4fs.c | 1741 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 3556 insertions(+), 42 deletions(-)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 4/6] fuse2fs: use fuse iomap data paths for better file I/O performance
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (10 preceding siblings ...)
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-08-21 0:50 ` Darrick J. Wong
2025-08-21 1:20 ` [PATCH 1/2] fuse2fs: enable caching of iomaps Darrick J. Wong
2025-08-21 1:21 ` [PATCH 2/2] fuse2fs: be smarter about caching iomaps Darrick J. Wong
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
13 siblings, 2 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:50 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
Hi all,
This series improves the performance (and correctness for some
filesystems) by adding the ability to cache iomap mappings in the
kernel. For filesystems that can change mapping states during pagecache
writeback (e.g. unwritten extent conversion) this is absolutely
necessary to deal with races with writes to the pagecache because
writeback does not take i_rwsem. For everyone else, it simply
eliminates roundtrips to userspace.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-cache
---
Commits in this patchset:
* fuse2fs: enable caching of iomaps
* fuse2fs: be smarter about caching iomaps
---
misc/fuse2fs.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
misc/fuse4fs.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 91 insertions(+)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (11 preceding siblings ...)
2025-08-21 0:50 ` [PATCHSET RFC v4 4/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-08-21 0:50 ` Darrick J. Wong
2025-08-21 1:21 ` [PATCH 1/8] fuse2fs: skip permission checking on utimens " Darrick J. Wong
` (7 more replies)
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
13 siblings, 8 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:50 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
Hi all,
When iomap is enabled for a fuse file, we try to keep as much of the
file IO path in the kernel as we possibly can. That means no calling
out to the fuse server in the IO path when we can avoid it. However,
the existing FUSE architecture defers all file attributes to the fuse
server -- [cm]time updates, ACL metadata management, set[ug]id removal,
and permissions checking thereof, etc.
We'd really rather do all these attribute updates in the kernel, and
only push them to the fuse server when it's actually necessary (e.g.
fsync). Furthermore, the POSIX ACL code has the weird behavior that if
the access ACL can be represented entirely by i_mode bits, it will
change the mode and delete the ACL, which fuse servers generally don't
seem to implement.
IOWs, we want consistent and correct (as defined by fstests) behavior
of file attributes in iomap mode. Let's make the kernel manage all that
and push the results to userspace as needed. This improves performance
even further, since it's sort of like writeback_cache mode but more
aggressive.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-attrs
---
Commits in this patchset:
* fuse2fs: skip permission checking on utimens when iomap is enabled
* fuse2fs: let the kernel tell us about acl/mode updates
* fuse2fs: better debugging for file mode updates
* fuse2fs: debug timestamp updates
* fuse2fs: use coarse timestamps for iomap mode
* fuse2fs: add tracing for retrieving timestamps
* fuse2fs: enable syncfs
* fuse2fs: skip the gdt write in op_destroy if syncfs is working
---
misc/fuse2fs.c | 237 ++++++++++++++++++++++++++++++++++++++++----------------
misc/fuse4fs.c | 193 ++++++++++++++++++++++++++++++++--------------
2 files changed, 304 insertions(+), 126 deletions(-)
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
` (12 preceding siblings ...)
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-08-21 0:50 ` Darrick J. Wong
2025-08-21 1:23 ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
` (5 more replies)
13 siblings, 6 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:50 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
Hi all,
This final series ports the libext2fs inode cache to the new cache.c
hashtable code that was added for fuse4fs unlinked file support and
improves on the UNIX I/O manager's block cache by adding a new I/O
manager that does its own caching. Now we no longer have statically
sized buffer caching for the two fuse servers.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-caching
---
Commits in this patchset:
* libsupport: add caching IO manager
* iocache: add the actual buffer cache
* iocache: bump buffer mru priority every 50 accesses
* fuse2fs: enable caching IO manager
* fuse2fs: increase inode cache size
* libext2fs: improve caching for inodes
---
lib/ext2fs/ext2fsP.h | 13 +
lib/support/cache.h | 1
lib/support/iocache.h | 17 +
debugfs/Makefile.in | 4
e2fsck/Makefile.in | 4
lib/ext2fs/Makefile.in | 4
lib/ext2fs/inode.c | 215 +++++++++++---
lib/ext2fs/io_manager.c | 3
lib/support/Makefile.in | 6
lib/support/cache.c | 16 +
lib/support/iocache.c | 740 +++++++++++++++++++++++++++++++++++++++++++++++
misc/Makefile.in | 7
misc/fuse2fs.c | 75 +----
misc/fuse4fs.c | 73 -----
resize/Makefile.in | 4
tests/progs/Makefile.in | 4
16 files changed, 990 insertions(+), 196 deletions(-)
create mode 100644 lib/support/iocache.h
create mode 100644 lib/support/iocache.c
^ permalink raw reply [flat|nested] 210+ messages in thread
* [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
@ 2025-08-21 0:50 ` Darrick J. Wong
2025-09-03 15:20 ` Miklos Szeredi
2025-08-21 0:51 ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
` (5 subsequent siblings)
6 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:50 UTC (permalink / raw)
To: djwong, miklos; +Cc: stable, bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
I observed a hang when running generic/323 against a fuseblk server.
This test opens a file, initiates a lot of AIO writes to that file
descriptor, and closes the file descriptor before the writes complete.
Unsurprisingly, the AIO exerciser threads are mostly stuck waiting for
responses from the fuseblk server:
# cat /proc/372265/task/372313/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_do_getattr+0xfc/0x1f0 [fuse]
[<0>] fuse_file_read_iter+0xbe/0x1c0 [fuse]
[<0>] aio_read+0x130/0x1e0
[<0>] io_submit_one+0x542/0x860
[<0>] __x64_sys_io_submit+0x98/0x1a0
[<0>] do_syscall_64+0x37/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
But the /weird/ part is that the fuseblk server threads are waiting for
responses from itself:
# cat /proc/372210/task/372232/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_file_put+0x9a/0xd0 [fuse]
[<0>] fuse_release+0x36/0x50 [fuse]
[<0>] __fput+0xec/0x2b0
[<0>] task_work_run+0x55/0x90
[<0>] syscall_exit_to_user_mode+0xe9/0x100
[<0>] do_syscall_64+0x43/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
The fuseblk server is fuse2fs so there's nothing all that exciting in
the server itself. So why is the fuse server calling fuse_file_put?
The commit message for the fstest sheds some light on that:
"By closing the file descriptor before calling io_destroy, you pretty
much guarantee that the last put on the ioctx will be done in interrupt
context (during I/O completion).
Aha. AIO fgets a new struct file from the fd when it queues the ioctx.
The completion of the FUSE_WRITE command from userspace causes the fuse
server to call the AIO completion function. The completion puts the
struct file, queuing a delayed fput to the fuse server task. When the
fuse server task returns to userspace, it has to run the delayed fput,
which in the case of a fuseblk server, it does synchronously.
Sending the FUSE_RELEASE command sychronously from fuse server threads
is a bad idea because a client program can initiate enough simultaneous
AIOs such that all the fuse server threads end up in delayed_fput, and
now there aren't any threads left to handle the queued fuse commands.
Fix this by only using synchronous fputs for fuseblk servers if the
process doesn't have PF_LOCAL_THROTTLE. Hopefully the fuseblk server
had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
filesystem server.
Cc: <stable@vger.kernel.org> # v2.6.38
Fixes: 5a18ec176c934c ("fuse: fix hang of single threaded fuseblk filesystem")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5525a4520b0f89..0ba2b62e06679e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -356,8 +356,16 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
* Make the release synchronous if this is a fuseblk mount,
* synchronous RELEASE is allowed (and desirable) in this case
* because the server can be trusted not to screw up.
+ *
+ * If we're a LOCAL_THROTTLE thread, use the asynchronous put
+ * because the current thread might be a fuse server. This can
+ * happen if a process starts some aio and closes the fd before
+ * the aio completes. Since aio takes its own ref to the file,
+ * the IO completion has to drop the ref, which is how the fuse
+ * server can end up closing its own clients' files.
*/
- fuse_file_put(ff, ff->fm->fc->destroy);
+ fuse_file_put(ff, ff->fm->fc->destroy &&
+ (current->flags & PF_LOCAL_THROTTLE) == 0);
}
void fuse_release_common(struct file *file, bool isdir)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
2025-08-21 0:50 ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
@ 2025-08-21 0:51 ` Darrick J. Wong
2025-09-03 15:45 ` Miklos Szeredi
2025-08-21 0:51 ` [PATCH 3/7] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
` (4 subsequent siblings)
6 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:51 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
generic/488 fails with fuse2fs in the following fashion:
generic/488 _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
(see /var/tmp/fstests/generic/488.full for details)
This test opens a large number of files, unlinks them (which really just
renames them to fuse hidden files), closes the program, unmounts the
filesystem, and runs fsck to check that there aren't any inconsistencies
in the filesystem.
Unfortunately, the 488.full file shows that there are a lot of hidden
files left over in the filesystem, with incorrect link counts. Tracing
fuse_request_* shows that there are a large number of FUSE_RELEASE
commands that are queued up on behalf of the unlinked files at the time
that fuse_conn_destroy calls fuse_abort_conn. Had the connection not
aborted, the fuse server would have responded to the RELEASE commands by
removing the hidden files; instead they stick around.
Create a function to push all the background requests to the queue and
then wait for the number of pending events to hit zero, and call this
before fuse_abort_conn. That way, all the pending events are processed
by the fuse server and we don't end up with a corrupt filesystem.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 6 ++++++
fs/fuse/dev.c | 38 ++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 1 +
3 files changed, 45 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ec248d13c8bfd9..2b5d56e3cb4eaf 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1249,6 +1249,12 @@ void fuse_request_end(struct fuse_req *req);
void fuse_abort_conn(struct fuse_conn *fc);
void fuse_wait_aborted(struct fuse_conn *fc);
+/**
+ * Flush all pending requests and wait for them. Takes an optional timeout
+ * in jiffies.
+ */
+void fuse_flush_requests_and_wait(struct fuse_conn *fc, unsigned long timeout);
+
/* Check if any requests timed out */
void fuse_check_timeout(struct work_struct *work);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e80cd8f2c049f9..6f2b277973ca7d 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -24,6 +24,7 @@
#include <linux/splice.h>
#include <linux/sched.h>
#include <linux/seq_file.h>
+#include <linux/nmi.h>
#define CREATE_TRACE_POINTS
#include "fuse_trace.h"
@@ -2385,6 +2386,43 @@ static void end_polls(struct fuse_conn *fc)
}
}
+/*
+ * Flush all pending requests and wait for them. Only call this function when
+ * it is no longer possible for other threads to add requests.
+ */
+void fuse_flush_requests_and_wait(struct fuse_conn *fc, unsigned long timeout)
+{
+ unsigned long deadline;
+
+ spin_lock(&fc->lock);
+ if (!fc->connected) {
+ spin_unlock(&fc->lock);
+ return;
+ }
+
+ /* Push all the background requests to the queue. */
+ spin_lock(&fc->bg_lock);
+ fc->blocked = 0;
+ fc->max_background = UINT_MAX;
+ flush_bg_queue(fc);
+ spin_unlock(&fc->bg_lock);
+ spin_unlock(&fc->lock);
+
+ /*
+ * Wait 30s for all the events to complete or abort. Touch the
+ * watchdog once per second so that we don't trip the hangcheck timer
+ * while waiting for the fuse server.
+ */
+ deadline = jiffies + timeout;
+ smp_mb();
+ while (fc->connected &&
+ (!timeout || time_before(jiffies, deadline)) &&
+ wait_event_timeout(fc->blocked_waitq,
+ !fc->connected || atomic_read(&fc->num_waiting) == 0,
+ HZ) == 0)
+ touch_softlockup_watchdog();
+}
+
/*
* Abort all requests.
*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index ecb869e895ab1d..b3b0c0f5598b4a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -2045,6 +2045,7 @@ void fuse_conn_destroy(struct fuse_mount *fm)
{
struct fuse_conn *fc = fm->fc;
+ fuse_flush_requests_and_wait(fc, secs_to_jiffies(30));
if (fc->destroy)
fuse_send_destroy(fm);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
2025-08-21 0:50 ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-08-21 0:51 ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
@ 2025-08-21 0:51 ` Darrick J. Wong
2025-08-22 0:15 ` Joanne Koong
2025-08-21 0:51 ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
` (3 subsequent siblings)
6 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:51 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
The fuse_request_{send,end} tracepoints capture the value of
req->in.h.unique in the trace output. It would be really nice if we
could use this to match a request to its response for debugging and
latency analysis, but the call to trace_fuse_request_send occurs before
the unique id has been set:
fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
fuse_request_end: connection 8388608 req 6 len 16 error -2
Move the callsites to trace_fuse_request_send to after the unique id has
been set, or right before we decide to cancel a request having not set
one.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/dev.c | 6 +++++-
fs/fuse/dev_uring.c | 8 +++++++-
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6f2b277973ca7d..05d6e7779387a4 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -376,10 +376,15 @@ static void fuse_dev_queue_req(struct fuse_iqueue *fiq, struct fuse_req *req)
if (fiq->connected) {
if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
req->in.h.unique = fuse_get_unique_locked(fiq);
+
+ /* tracepoint captures in.h.unique */
+ trace_fuse_request_send(req);
+
list_add_tail(&req->list, &fiq->pending);
fuse_dev_wake_and_unlock(fiq);
} else {
spin_unlock(&fiq->lock);
+ trace_fuse_request_send(req);
req->out.h.error = -ENOTCONN;
clear_bit(FR_PENDING, &req->flags);
fuse_request_end(req);
@@ -398,7 +403,6 @@ static void fuse_send_one(struct fuse_iqueue *fiq, struct fuse_req *req)
req->in.h.len = sizeof(struct fuse_in_header) +
fuse_len_args(req->args->in_numargs,
(struct fuse_arg *) req->args->in_args);
- trace_fuse_request_send(req);
fiq->ops->send_req(fiq, req);
}
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 249b210becb1cc..14f263d4419392 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -7,6 +7,7 @@
#include "fuse_i.h"
#include "dev_uring_i.h"
#include "fuse_dev_i.h"
+#include "fuse_trace.h"
#include <linux/fs.h>
#include <linux/io_uring/cmd.h>
@@ -1265,12 +1266,17 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
err = -EINVAL;
queue = fuse_uring_task_to_queue(ring);
- if (!queue)
+ if (!queue) {
+ trace_fuse_request_send(req);
goto err;
+ }
if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
req->in.h.unique = fuse_get_unique(fiq);
+ /* tracepoint captures in.h.unique */
+ trace_fuse_request_send(req);
+
spin_lock(&queue->lock);
err = -ENOTCONN;
if (unlikely(queue->stopped))
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 0:51 ` [PATCH 3/7] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
@ 2025-08-21 0:51 ` Darrick J. Wong
2025-08-22 0:01 ` Joanne Koong
2025-08-29 6:24 ` Miklos Szeredi
2025-08-21 0:51 ` [PATCH 5/7] fuse: update file mode when updating acls Darrick J. Wong
` (2 subsequent siblings)
6 siblings, 2 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:51 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Actually copy the attributes/attributes_mask from userspace.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 4 ++++
fs/fuse/dir.c | 4 ++++
fs/fuse/inode.c | 3 +++
3 files changed, 11 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 2b5d56e3cb4eaf..bb1fdae0bbc906 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -140,6 +140,10 @@ struct fuse_inode {
/** Version of last attribute change */
u64 attr_version;
+ /** statx file attributes */
+ u64 statx_attributes;
+ u64 statx_attributes_mask;
+
union {
/* read/write io cache (regular file only) */
struct {
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 2d817d7cab2649..2e4d1131ab8cbe 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1278,6 +1278,8 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
stat->btime.tv_sec = sx->btime.tv_sec;
stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
+ stat->attributes = sx->attributes;
+ stat->attributes_mask = sx->attributes_mask;
fuse_fillattr(idmap, inode, &attr, stat);
stat->result_mask |= STATX_TYPE;
}
@@ -1381,6 +1383,8 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode,
stat->btime = fi->i_btime;
stat->result_mask |= STATX_BTIME;
}
+ stat->attributes = fi->statx_attributes;
+ stat->attributes_mask = fi->statx_attributes_mask;
}
return err;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b3b0c0f5598b4a..463879830ecf34 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -287,6 +287,9 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
fi->i_btime.tv_sec = sx->btime.tv_sec;
fi->i_btime.tv_nsec = sx->btime.tv_nsec;
}
+
+ fi->statx_attributes = sx->attributes;
+ fi->statx_attributes_mask = sx->attributes_mask;
}
if (attr->blksize != 0)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 5/7] fuse: update file mode when updating acls
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 0:51 ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
@ 2025-08-21 0:51 ` Darrick J. Wong
2025-09-03 16:01 ` Miklos Szeredi
2025-08-21 0:52 ` [PATCH 6/7] fuse: propagate default and file acls on creation Darrick J. Wong
2025-08-21 0:52 ` [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers Darrick J. Wong
6 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:51 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
If someone sets ACLs on a file that can be expressed fully as Unix DAC
mode bits, most filesystems will then update the mode bits and drop the
ACL xattr to reduce inefficiency in the file access paths. Let's do
that too. Note that means that we can setacl and end up with no ACL
xattrs, so we also need to tolerate ENODATA returns from
fuse_removexattr.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/acl.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 8f484b105f13ab..63df349dee1caf 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -98,6 +98,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
struct inode *inode = d_inode(dentry);
struct fuse_conn *fc = get_fuse_conn(inode);
const char *name;
+ umode_t mode = inode->i_mode;
int ret;
if (fuse_is_bad(inode))
@@ -113,6 +114,17 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
else
return -EINVAL;
+ /*
+ * If the ACL can be represented entirely with changes to the mode
+ * bits, then most filesystems will update the mode bits and delete
+ * the ACL xattr.
+ */
+ if (acl && type == ACL_TYPE_ACCESS && fc->posix_acl) {
+ ret = posix_acl_update_mode(idmap, inode, &mode, &acl);
+ if (ret)
+ return ret;
+ }
+
if (acl) {
unsigned int extra_flags = 0;
/*
@@ -143,7 +155,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
* through POSIX ACLs. Such daemons don't expect setgid bits to
* be stripped.
*/
- if (fc->posix_acl &&
+ if (fc->posix_acl && mode == inode->i_mode &&
!in_group_or_capable(idmap, inode,
i_gid_into_vfsgid(idmap, inode)))
extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID;
@@ -152,6 +164,22 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
kfree(value);
} else {
ret = fuse_removexattr(inode, name);
+ /* If the acl didn't exist to start with that's fine. */
+ if (ret == -ENODATA)
+ ret = 0;
+ }
+
+ /* If we scheduled a mode update above, push that to userspace now. */
+ if (!ret) {
+ struct iattr attr = { };
+
+ if (mode != inode->i_mode) {
+ attr.ia_valid |= ATTR_MODE;
+ attr.ia_mode = mode;
+ }
+
+ if (attr.ia_valid)
+ ret = fuse_do_setattr(idmap, dentry, &attr, NULL);
}
if (fc->posix_acl) {
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 6/7] fuse: propagate default and file acls on creation
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 0:51 ` [PATCH 5/7] fuse: update file mode when updating acls Darrick J. Wong
@ 2025-08-21 0:52 ` Darrick J. Wong
2025-09-03 16:15 ` Miklos Szeredi
2025-08-21 0:52 ` [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers Darrick J. Wong
6 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:52 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Propagate the default and file access ACLs to new children when creating
them, just like the other kernel filesystems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 4 ++
fs/fuse/acl.c | 65 ++++++++++++++++++++++++++++++++++++++
fs/fuse/dir.c | 92 +++++++++++++++++++++++++++++++++++++++++-------------
3 files changed, 138 insertions(+), 23 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index bb1fdae0bbc906..b80505f5431e0b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1452,6 +1452,10 @@ struct posix_acl *fuse_get_acl(struct mnt_idmap *idmap,
struct dentry *dentry, int type);
int fuse_set_acl(struct mnt_idmap *, struct dentry *dentry,
struct posix_acl *acl, int type);
+int fuse_acl_create(struct inode *dir, umode_t *mode,
+ struct posix_acl **default_acl, struct posix_acl **acl);
+int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl,
+ const struct posix_acl *acl);
/* readdir.c */
int fuse_readdir(struct file *file, struct dir_context *ctx);
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 63df349dee1caf..4f37390e3f3ce7 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -193,3 +193,68 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
return ret;
}
+
+int fuse_acl_create(struct inode *dir, umode_t *mode,
+ struct posix_acl **default_acl, struct posix_acl **acl)
+{
+ struct fuse_conn *fc = get_fuse_conn(dir);
+
+ if (fuse_is_bad(dir))
+ return -EIO;
+
+ if (IS_POSIXACL(dir))
+ return posix_acl_create(dir, mode, default_acl, acl);
+
+ if (!fc->dont_mask)
+ *mode &= ~current_umask();
+
+ *default_acl = NULL;
+ *acl = NULL;
+ return 0;
+}
+
+static int __fuse_set_acl(struct inode *inode, const char *name,
+ const struct posix_acl *acl)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ size_t size = posix_acl_xattr_size(acl->a_count);
+ void *value;
+ int ret;
+
+ if (size > PAGE_SIZE)
+ return -E2BIG;
+
+ value = kmalloc(size, GFP_KERNEL);
+ if (!value)
+ return -ENOMEM;
+
+ ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
+ if (ret < 0)
+ goto out_value;
+
+ ret = fuse_setxattr(inode, name, value, size, 0, 0);
+out_value:
+ kfree(value);
+ return ret;
+}
+
+int fuse_init_acls(struct inode *inode, const struct posix_acl *default_acl,
+ const struct posix_acl *acl)
+{
+ int ret;
+
+ if (default_acl) {
+ ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_DEFAULT,
+ default_acl);
+ if (ret)
+ return ret;
+ }
+
+ if (acl) {
+ ret = __fuse_set_acl(inode, XATTR_NAME_POSIX_ACL_ACCESS, acl);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 2e4d1131ab8cbe..8e922dcadb8675 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -628,26 +628,28 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
struct fuse_entry_out outentry;
struct fuse_inode *fi;
struct fuse_file *ff;
+ struct posix_acl *default_acl = NULL, *acl = NULL;
int epoch, err;
bool trunc = flags & O_TRUNC;
/* Userspace expects S_IFREG in create mode */
BUG_ON((mode & S_IFMT) != S_IFREG);
+ err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+ if (err)
+ return err;
+
epoch = atomic_read(&fm->fc->epoch);
forget = fuse_alloc_forget();
err = -ENOMEM;
if (!forget)
- goto out_err;
+ goto out_acl_release;
err = -ENOMEM;
ff = fuse_file_alloc(fm, true);
if (!ff)
goto out_put_forget_req;
- if (!fm->fc->dont_mask)
- mode &= ~current_umask();
-
flags &= ~O_NOCTTY;
memset(&inarg, 0, sizeof(inarg));
memset(&outentry, 0, sizeof(outentry));
@@ -699,12 +701,16 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
fuse_sync_release(NULL, ff, flags);
fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
err = -ENOMEM;
- goto out_err;
+ goto out_acl_release;
}
kfree(forget);
d_instantiate(entry, inode);
entry->d_time = epoch;
fuse_change_entry_timeout(entry, &outentry);
+
+ err = fuse_init_acls(inode, default_acl, acl);
+ if (err)
+ goto out_acl_release;
fuse_dir_changed(dir);
err = generic_file_open(inode, file);
if (!err) {
@@ -726,7 +732,9 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
fuse_file_free(ff);
out_put_forget_req:
kfree(forget);
-out_err:
+out_acl_release:
+ posix_acl_release(default_acl);
+ posix_acl_release(acl);
return err;
}
@@ -785,7 +793,9 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
*/
static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_mount *fm,
struct fuse_args *args, struct inode *dir,
- struct dentry *entry, umode_t mode)
+ struct dentry *entry, umode_t mode,
+ struct posix_acl *default_acl,
+ struct posix_acl *acl)
{
struct fuse_entry_out outarg;
struct inode *inode;
@@ -793,14 +803,18 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
struct fuse_forget_link *forget;
int epoch, err;
- if (fuse_is_bad(dir))
- return ERR_PTR(-EIO);
+ if (fuse_is_bad(dir)) {
+ err = -EIO;
+ goto out_acl_release;
+ }
epoch = atomic_read(&fm->fc->epoch);
forget = fuse_alloc_forget();
- if (!forget)
- return ERR_PTR(-ENOMEM);
+ if (!forget) {
+ err = -ENOMEM;
+ goto out_acl_release;
+ }
memset(&outarg, 0, sizeof(outarg));
args->nodeid = get_node_id(dir);
@@ -830,7 +844,8 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
&outarg.attr, ATTR_TIMEOUT(&outarg), 0, 0);
if (!inode) {
fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1);
- return ERR_PTR(-ENOMEM);
+ err = -ENOMEM;
+ goto out_acl_release;
}
kfree(forget);
@@ -846,19 +861,31 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
entry->d_time = epoch;
fuse_change_entry_timeout(entry, &outarg);
}
+
+ err = fuse_init_acls(inode, default_acl, acl);
+ if (err)
+ goto out_acl_release;
fuse_dir_changed(dir);
+
+ posix_acl_release(default_acl);
+ posix_acl_release(acl);
return d;
out_put_forget_req:
if (err == -EEXIST)
fuse_invalidate_entry(entry);
kfree(forget);
+ out_acl_release:
+ posix_acl_release(default_acl);
+ posix_acl_release(acl);
return ERR_PTR(err);
}
static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm,
struct fuse_args *args, struct inode *dir,
- struct dentry *entry, umode_t mode)
+ struct dentry *entry, umode_t mode,
+ struct posix_acl *default_acl,
+ struct posix_acl *acl)
{
/*
* Note that when creating anything other than a directory we
@@ -869,7 +896,8 @@ static int create_new_nondir(struct mnt_idmap *idmap, struct fuse_mount *fm,
*/
WARN_ON_ONCE(S_ISDIR(mode));
- return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode));
+ return PTR_ERR(create_new_entry(idmap, fm, args, dir, entry, mode,
+ default_acl, acl));
}
static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
@@ -877,10 +905,13 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
{
struct fuse_mknod_in inarg;
struct fuse_mount *fm = get_fuse_mount(dir);
+ struct posix_acl *default_acl, *acl;
FUSE_ARGS(args);
+ int err;
- if (!fm->fc->dont_mask)
- mode &= ~current_umask();
+ err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+ if (err)
+ return err;
memset(&inarg, 0, sizeof(inarg));
inarg.mode = mode;
@@ -892,7 +923,8 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
args.in_args[0].value = &inarg;
args.in_args[1].size = entry->d_name.len + 1;
args.in_args[1].value = entry->d_name.name;
- return create_new_nondir(idmap, fm, &args, dir, entry, mode);
+ return create_new_nondir(idmap, fm, &args, dir, entry, mode,
+ default_acl, acl);
}
static int fuse_create(struct mnt_idmap *idmap, struct inode *dir,
@@ -924,13 +956,17 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
{
struct fuse_mkdir_in inarg;
struct fuse_mount *fm = get_fuse_mount(dir);
+ struct posix_acl *default_acl, *acl;
FUSE_ARGS(args);
+ int err;
- if (!fm->fc->dont_mask)
- mode &= ~current_umask();
+ mode |= S_IFDIR; /* vfs doesn't set S_IFDIR for us */
+ err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+ if (err)
+ return ERR_PTR(err);
memset(&inarg, 0, sizeof(inarg));
- inarg.mode = mode;
+ inarg.mode = mode & ~S_IFDIR;
inarg.umask = current_umask();
args.opcode = FUSE_MKDIR;
args.in_numargs = 2;
@@ -938,7 +974,8 @@ static struct dentry *fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir,
args.in_args[0].value = &inarg;
args.in_args[1].size = entry->d_name.len + 1;
args.in_args[1].value = entry->d_name.name;
- return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR);
+ return create_new_entry(idmap, fm, &args, dir, entry, S_IFDIR,
+ default_acl, acl);
}
static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
@@ -946,7 +983,14 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
{
struct fuse_mount *fm = get_fuse_mount(dir);
unsigned len = strlen(link) + 1;
+ struct posix_acl *default_acl, *acl;
+ umode_t mode = S_IFLNK | 0777;
FUSE_ARGS(args);
+ int err;
+
+ err = fuse_acl_create(dir, &mode, &default_acl, &acl);
+ if (err)
+ return err;
args.opcode = FUSE_SYMLINK;
args.in_numargs = 3;
@@ -955,7 +999,8 @@ static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir,
args.in_args[1].value = entry->d_name.name;
args.in_args[2].size = len;
args.in_args[2].value = link;
- return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK);
+ return create_new_nondir(idmap, fm, &args, dir, entry, S_IFLNK,
+ default_acl, acl);
}
void fuse_flush_time_update(struct inode *inode)
@@ -1155,7 +1200,8 @@ static int fuse_link(struct dentry *entry, struct inode *newdir,
args.in_args[0].value = &inarg;
args.in_args[1].size = newent->d_name.len + 1;
args.in_args[1].value = newent->d_name.name;
- err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent, inode->i_mode);
+ err = create_new_nondir(&invalid_mnt_idmap, fm, &args, newdir, newent,
+ inode->i_mode, NULL, NULL);
if (!err)
fuse_update_ctime_in_cache(inode);
else if (err == -EINTR)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
` (5 preceding siblings ...)
2025-08-21 0:52 ` [PATCH 6/7] fuse: propagate default and file acls on creation Darrick J. Wong
@ 2025-08-21 0:52 ` Darrick J. Wong
2025-08-21 22:18 ` Joanne Koong
6 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:52 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Turn on syncfs for all fuse servers so that the ones in the know can
flush cached intermediate data and logs to disk.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/inode.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 463879830ecf34..b05510799f93e1 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1814,6 +1814,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
if (!sb_set_blocksize(sb, ctx->blksize))
goto err;
#endif
+ fc->sync_fs = 1;
} else {
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 01/23] fuse: move CREATE_TRACE_POINTS to a separate file
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-08-21 0:52 ` Darrick J. Wong
2025-08-21 0:53 ` [PATCH 02/23] fuse: implement the basic iomap mechanisms Darrick J. Wong
` (21 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:52 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Before we start adding new tracepoints for fuse+iomap, move the
tracepoint creation itself to a separate source file so that we don't
have to start pulling iomap dependencies into dev.c just for the iomap
structures.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/Makefile | 3 ++-
fs/fuse/dev.c | 1 -
fs/fuse/trace.c | 13 +++++++++++++
3 files changed, 15 insertions(+), 2 deletions(-)
create mode 100644 fs/fuse/trace.c
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3f0f312a31c1cc..f3a273131a6cd1 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -10,7 +10,8 @@ obj-$(CONFIG_FUSE_FS) += fuse.o
obj-$(CONFIG_CUSE) += cuse.o
obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
-fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
+fuse-y := trace.o # put trace.o first so we see ftrace errors sooner
+fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
fuse-y += iomode.o
fuse-$(CONFIG_FUSE_DAX) += dax.o
fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 05d6e7779387a4..dbde17fff0cda9 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -26,7 +26,6 @@
#include <linux/seq_file.h>
#include <linux/nmi.h>
-#define CREATE_TRACE_POINTS
#include "fuse_trace.h"
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
diff --git a/fs/fuse/trace.c b/fs/fuse/trace.c
new file mode 100644
index 00000000000000..93bd72efc98cd0
--- /dev/null
+++ b/fs/fuse/trace.c
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "dev_uring_i.h"
+#include "fuse_i.h"
+#include "fuse_dev_i.h"
+
+#include <linux/pagemap.h>
+
+#define CREATE_TRACE_POINTS
+#include "fuse_trace.h"
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 02/23] fuse: implement the basic iomap mechanisms
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-08-21 0:52 ` [PATCH 01/23] fuse: move CREATE_TRACE_POINTS to a separate file Darrick J. Wong
@ 2025-08-21 0:53 ` Darrick J. Wong
2025-08-21 0:53 ` [PATCH 03/23] fuse: make debugging configurable at runtime Darrick J. Wong
` (20 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:53 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Implement functions to enable upcalling of iomap_begin and iomap_end to
userspace fuse servers.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 35 ++++
fs/fuse/fuse_trace.h | 295 ++++++++++++++++++++++++++++++
fs/fuse/iomap_priv.h | 42 ++++
include/uapi/linux/fuse.h | 92 +++++++++
fs/fuse/Kconfig | 25 +++
fs/fuse/Makefile | 1
fs/fuse/file_iomap.c | 444 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 5 +
8 files changed, 938 insertions(+), 1 deletion(-)
create mode 100644 fs/fuse/iomap_priv.h
create mode 100644 fs/fuse/file_iomap.c
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b80505f5431e0b..b28054c254f866 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -899,6 +899,9 @@ struct fuse_conn {
/* Is link not implemented by fs? */
unsigned int no_link:1;
+ /* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
+ unsigned int iomap:1;
+
/* Use io_uring for communication */
unsigned int io_uring;
@@ -1015,6 +1018,11 @@ static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
return sb->s_fs_info;
}
+static inline const struct fuse_mount *get_fuse_mount_super_c(const struct super_block *sb)
+{
+ return sb->s_fs_info;
+}
+
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
{
return get_fuse_mount_super(sb)->fc;
@@ -1025,16 +1033,31 @@ static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
return get_fuse_mount_super(inode->i_sb);
}
+static inline const struct fuse_mount *get_fuse_mount_c(const struct inode *inode)
+{
+ return get_fuse_mount_super_c(inode->i_sb);
+}
+
static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
{
return get_fuse_mount_super(inode->i_sb)->fc;
}
+static inline const struct fuse_conn *get_fuse_conn_c(const struct inode *inode)
+{
+ return get_fuse_mount_super_c(inode->i_sb)->fc;
+}
+
static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
{
return container_of(inode, struct fuse_inode, inode);
}
+static inline const struct fuse_inode *get_fuse_inode_c(const struct inode *inode)
+{
+ return container_of(inode, struct fuse_inode, inode);
+}
+
static inline u64 get_node_id(struct inode *inode)
{
return get_fuse_inode(inode)->nodeid;
@@ -1584,4 +1607,16 @@ extern void fuse_sysctl_unregister(void);
#define fuse_sysctl_unregister() do { } while (0)
#endif /* CONFIG_SYSCTL */
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+bool fuse_iomap_enabled(void);
+
+static inline bool fuse_has_iomap(const struct inode *inode)
+{
+ return get_fuse_conn_c(inode)->iomap;
+}
+#else
+# define fuse_iomap_enabled(...) (false)
+# define fuse_has_iomap(...) (false)
+#endif
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index bbe9ddd8c71696..2389072b734636 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -58,6 +58,8 @@
EM( FUSE_SYNCFS, "FUSE_SYNCFS") \
EM( FUSE_TMPFILE, "FUSE_TMPFILE") \
EM( FUSE_STATX, "FUSE_STATX") \
+ EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \
+ EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \
EMe(CUSE_INIT, "CUSE_INIT")
/*
@@ -77,6 +79,54 @@ OPCODES
#define EM(a, b) {a, b},
#define EMe(a, b) {a, b}
+/* tracepoint boilerplate so we don't have to keep doing this */
+#define FUSE_INODE_FIELDS \
+ __field(dev_t, connection) \
+ __field(uint64_t, ino) \
+ __field(uint64_t, nodeid) \
+ __field(loff_t, isize)
+
+#define FUSE_INODE_ASSIGN(inode, fi, fm) \
+ const struct fuse_inode *fi = get_fuse_inode_c(inode); \
+ const struct fuse_mount *fm = get_fuse_mount_c(inode); \
+\
+ __entry->connection = (fm)->fc->dev; \
+ __entry->ino = (fi)->orig_ino; \
+ __entry->nodeid = (fi)->nodeid; \
+ __entry->isize = i_size_read(inode)
+
+#define FUSE_INODE_FMT \
+ "connection %u ino %llu nodeid %llu isize 0x%llx"
+
+#define FUSE_INODE_PRINTK_ARGS \
+ __entry->connection, \
+ __entry->ino, \
+ __entry->nodeid, \
+ __entry->isize
+
+#define FUSE_FILE_RANGE_FIELDS(prefix) \
+ __field(loff_t, prefix##offset) \
+ __field(loff_t, prefix##length)
+
+#define FUSE_FILE_RANGE_FMT(prefix) \
+ " " prefix "pos 0x%llx length 0x%llx"
+
+#define FUSE_FILE_RANGE_PRINTK_ARGS(prefix) \
+ __entry->prefix##offset, \
+ __entry->prefix##length
+
+/* combinations of boilerplate to reduce typing further */
+#define FUSE_IO_RANGE_FIELDS(prefix) \
+ FUSE_INODE_FIELDS \
+ FUSE_FILE_RANGE_FIELDS(prefix)
+
+#define FUSE_IO_RANGE_FMT(prefix) \
+ FUSE_INODE_FMT FUSE_FILE_RANGE_FMT(prefix)
+
+#define FUSE_IO_RANGE_PRINTK_ARGS(prefix) \
+ FUSE_INODE_PRINTK_ARGS, \
+ FUSE_FILE_RANGE_PRINTK_ARGS(prefix)
+
TRACE_EVENT(fuse_request_send,
TP_PROTO(const struct fuse_req *req),
@@ -124,6 +174,251 @@ TRACE_EVENT(fuse_request_end,
__entry->unique, __entry->len, __entry->error)
);
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+
+/* tracepoint boilerplate so we don't have to keep doing this */
+#define FUSE_IOMAP_OPFLAGS_FIELD \
+ __field(unsigned, opflags)
+
+#define FUSE_IOMAP_OPFLAGS_FMT \
+ " opflags (%s)"
+
+#define FUSE_IOMAP_OPFLAGS_PRINTK_ARG \
+ __print_flags(__entry->opflags, "|", FUSE_IOMAP_OP_STRINGS)
+
+#define FUSE_IOMAP_MAP_FIELDS(prefix) \
+ __field(uint64_t, prefix##offset) \
+ __field(uint64_t, prefix##length) \
+ __field(uint64_t, prefix##addr) \
+ __field(uint32_t, prefix##dev) \
+ __field(uint16_t, prefix##type) \
+ __field(uint16_t, prefix##flags)
+
+#define FUSE_IOMAP_MAP_FMT(prefix) \
+ " " prefix "offset 0x%llx length 0x%llx type %s dev %u addr 0x%llx mapflags (%s)"
+
+#define FUSE_IOMAP_MAP_PRINTK_ARGS(prefix) \
+ __entry->prefix##offset, \
+ __entry->prefix##length, \
+ __print_symbolic(__entry->prefix##type, FUSE_IOMAP_TYPE_STRINGS), \
+ __entry->prefix##dev, \
+ __entry->prefix##addr, \
+ __print_flags(__entry->prefix##flags, "|", FUSE_IOMAP_F_STRINGS)
+
+/* combinations of boilerplate to reduce typing further */
+#define FUSE_IOMAP_OP_FIELDS(prefix) \
+ FUSE_INODE_FIELDS \
+ FUSE_IOMAP_OPFLAGS_FIELD \
+ FUSE_FILE_RANGE_FIELDS(prefix)
+
+#define FUSE_IOMAP_OP_FMT(prefix) \
+ FUSE_INODE_FMT FUSE_IOMAP_OPFLAGS_FMT FUSE_FILE_RANGE_FMT(prefix)
+
+#define FUSE_IOMAP_OP_PRINTK_ARGS(prefix) \
+ FUSE_INODE_PRINTK_ARGS, \
+ FUSE_IOMAP_OPFLAGS_PRINTK_ARG, \
+ FUSE_FILE_RANGE_PRINTK_ARGS(prefix)
+
+/* string decoding */
+#define FUSE_IOMAP_F_STRINGS \
+ { FUSE_IOMAP_F_NEW, "new" }, \
+ { FUSE_IOMAP_F_DIRTY, "dirty" }, \
+ { FUSE_IOMAP_F_SHARED, "shared" }, \
+ { FUSE_IOMAP_F_MERGED, "merged" }, \
+ { FUSE_IOMAP_F_BOUNDARY, "boundary" }, \
+ { FUSE_IOMAP_F_ANON_WRITE, "anon_write" }, \
+ { FUSE_IOMAP_F_ATOMIC_BIO, "atomic" }, \
+ { FUSE_IOMAP_F_WANT_IOMAP_END, "iomap_end" }, \
+ { FUSE_IOMAP_F_SIZE_CHANGED, "append" }, \
+ { FUSE_IOMAP_F_STALE, "stale" }
+
+#define FUSE_IOMAP_OP_STRINGS \
+ { FUSE_IOMAP_OP_WRITE, "write" }, \
+ { FUSE_IOMAP_OP_ZERO, "zero" }, \
+ { FUSE_IOMAP_OP_REPORT, "report" }, \
+ { FUSE_IOMAP_OP_FAULT, "fault" }, \
+ { FUSE_IOMAP_OP_DIRECT, "direct" }, \
+ { FUSE_IOMAP_OP_NOWAIT, "nowait" }, \
+ { FUSE_IOMAP_OP_OVERWRITE_ONLY, "overwrite" }, \
+ { FUSE_IOMAP_OP_UNSHARE, "unshare" }, \
+ { FUSE_IOMAP_OP_DAX, "fsdax" }, \
+ { FUSE_IOMAP_OP_ATOMIC, "atomic" }, \
+ { FUSE_IOMAP_OP_DONTCACHE, "dontcache" }
+
+#define FUSE_IOMAP_TYPE_STRINGS \
+ { FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \
+ { FUSE_IOMAP_TYPE_HOLE, "hole" }, \
+ { FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \
+ { FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \
+ { FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \
+ { FUSE_IOMAP_TYPE_INLINE, "inline" }
+
+DECLARE_EVENT_CLASS(fuse_iomap_check_class,
+ TP_PROTO(const char *func, int line, const char *condition),
+
+ TP_ARGS(func, line, condition),
+
+ TP_STRUCT__entry(
+ __string(func, func)
+ __field(int, line)
+ __string(condition, condition)
+ ),
+
+ TP_fast_assign(
+ __assign_str(func);
+ __assign_str(condition);
+ __entry->line = line;
+ ),
+
+ TP_printk("func %s line %d condition %s", __get_str(func),
+ __entry->line, __get_str(condition))
+);
+#define DEFINE_FUSE_IOMAP_CHECK_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_check_class, name, \
+ TP_PROTO(const char *func, int line, const char *condition), \
+ TP_ARGS(func, line, condition))
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+DEFINE_FUSE_IOMAP_CHECK_EVENT(fuse_iomap_assert);
+#endif
+DEFINE_FUSE_IOMAP_CHECK_EVENT(fuse_iomap_bad_data);
+
+TRACE_EVENT(fuse_iomap_begin,
+ TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags),
+
+ TP_ARGS(inode, pos, count, opflags),
+
+ TP_STRUCT__entry(
+ FUSE_IOMAP_OP_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = pos;
+ __entry->length = count;
+ __entry->opflags = opflags;
+ ),
+
+ TP_printk(FUSE_IOMAP_OP_FMT(),
+ FUSE_IOMAP_OP_PRINTK_ARGS())
+);
+
+TRACE_EVENT(fuse_iomap_begin_error,
+ TP_PROTO(const struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags, int error),
+
+ TP_ARGS(inode, pos, count, opflags, error),
+
+ TP_STRUCT__entry(
+ FUSE_IOMAP_OP_FIELDS()
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = pos;
+ __entry->length = count;
+ __entry->opflags = opflags;
+ __entry->error = error;
+ ),
+
+ TP_printk(FUSE_IOMAP_OP_FMT() " err %d",
+ FUSE_IOMAP_OP_PRINTK_ARGS(),
+ __entry->error)
+);
+
+DECLARE_EVENT_CLASS(fuse_iomap_mapping_class,
+ TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map),
+
+ TP_ARGS(inode, map),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->mapdev = map->dev;
+ __entry->mapaddr = map->addr;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ ),
+
+ TP_printk(FUSE_INODE_FMT FUSE_IOMAP_MAP_FMT(),
+ FUSE_INODE_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map))
+);
+#define DEFINE_FUSE_IOMAP_MAPPING_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_mapping_class, name, \
+ TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map), \
+ TP_ARGS(inode, map))
+DEFINE_FUSE_IOMAP_MAPPING_EVENT(fuse_iomap_read_map);
+DEFINE_FUSE_IOMAP_MAPPING_EVENT(fuse_iomap_write_map);
+
+TRACE_EVENT(fuse_iomap_end,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_end_in *inarg),
+
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ FUSE_IOMAP_OP_FIELDS()
+ __field(size_t, written)
+ FUSE_IOMAP_MAP_FIELDS(map)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->opflags = inarg->opflags;
+ __entry->written = inarg->written;
+ __entry->offset = inarg->pos;
+ __entry->length = inarg->count;
+
+ __entry->mapoffset = inarg->map.offset;
+ __entry->maplength = inarg->map.length;
+ __entry->mapdev = inarg->map.dev;
+ __entry->mapaddr = inarg->map.addr;
+ __entry->maptype = inarg->map.type;
+ __entry->mapflags = inarg->map.flags;
+ ),
+
+ TP_printk(FUSE_IOMAP_OP_FMT() " written %zd" FUSE_IOMAP_MAP_FMT(),
+ FUSE_IOMAP_OP_PRINTK_ARGS(),
+ __entry->written,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map))
+);
+
+TRACE_EVENT(fuse_iomap_end_error,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_end_in *inarg, int error),
+
+ TP_ARGS(inode, inarg, error),
+
+ TP_STRUCT__entry(
+ FUSE_IOMAP_OP_FIELDS()
+ __field(size_t, written)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = inarg->pos;
+ __entry->length = inarg->count;
+ __entry->opflags = inarg->opflags;
+ __entry->written = inarg->written;
+ __entry->error = error;
+ ),
+
+ TP_printk(FUSE_IOMAP_OP_FMT() " written %zd error %d",
+ FUSE_IOMAP_OP_PRINTK_ARGS(),
+ __entry->written,
+ __entry->error)
+);
+#endif /* CONFIG_FUSE_IOMAP */
+
#endif /* _TRACE_FUSE_H */
#undef TRACE_INCLUDE_PATH
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
new file mode 100644
index 00000000000000..ca8544a95a4267
--- /dev/null
+++ b/fs/fuse/iomap_priv.h
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef _FS_FUSE_IOMAP_PRIV_H
+#define _FS_FUSE_IOMAP_PRIV_H
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+# define ASSERT(condition) do { \
+ int __cond = !!(condition); \
+ if (unlikely(!__cond)) \
+ trace_fuse_iomap_assert(__func__, __LINE__, #condition); \
+ WARN(!__cond, "Assertion failed: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
+} while (0)
+# define BAD_DATA(condition) ({ \
+ int __cond = !!(condition); \
+ if (unlikely(__cond)) \
+ trace_fuse_iomap_bad_data(__func__, __LINE__, #condition); \
+ WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
+})
+#else
+# define ASSERT(condition)
+# define BAD_DATA(condition) ({ \
+ int __cond = !!(condition); \
+ if (unlikely(__cond)) \
+ trace_fuse_iomap_bad_data(__func__, __LINE__, #condition); \
+ unlikely(__cond); \
+})
+#endif /* CONFIG_FUSE_IOMAP_DEBUG */
+
+enum fuse_iomap_iodir {
+ READ_MAPPING,
+ WRITE_MAPPING,
+};
+
+#define EFSCORRUPTED EUCLEAN
+
+#endif /* CONFIG_FUSE_IOMAP */
+
+#endif /* _FS_FUSE_IOMAP_PRIV_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 122d6586e8d4da..3b9e337119d792 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -235,6 +235,10 @@
*
* 7.44
* - add FUSE_NOTIFY_INC_EPOCH
+ *
+ * 7.99
+ * - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
+ * SEEK_{DATA,HOLE}
*/
#ifndef _LINUX_FUSE_H
@@ -270,7 +274,7 @@
#define FUSE_KERNEL_VERSION 7
/** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 44
+#define FUSE_KERNEL_MINOR_VERSION 99
/** The node ID of the root inode */
#define FUSE_ROOT_ID 1
@@ -443,6 +447,8 @@ struct fuse_file_lock {
* FUSE_OVER_IO_URING: Indicate that client supports io-uring
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
+ * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
+ * operations.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -490,6 +496,7 @@ struct fuse_file_lock {
#define FUSE_ALLOW_IDMAP (1ULL << 40)
#define FUSE_OVER_IO_URING (1ULL << 41)
#define FUSE_REQUEST_TIMEOUT (1ULL << 42)
+#define FUSE_IOMAP (1ULL << 43)
/**
* CUSE INIT request/reply flags
@@ -658,6 +665,9 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_BEGIN = 4094,
+ FUSE_IOMAP_END = 4095,
+
/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -1290,4 +1300,84 @@ struct fuse_uring_cmd_req {
uint8_t padding[6];
};
+/* mapping types; see corresponding IOMAP_TYPE_ */
+#define FUSE_IOMAP_TYPE_HOLE (0)
+#define FUSE_IOMAP_TYPE_DELALLOC (1)
+#define FUSE_IOMAP_TYPE_MAPPED (2)
+#define FUSE_IOMAP_TYPE_UNWRITTEN (3)
+#define FUSE_IOMAP_TYPE_INLINE (4)
+
+/* fuse-specific mapping type indicating that writes use the read mapping */
+#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255)
+
+#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */
+
+/* mapping flags passed back from iomap_begin; see corresponding IOMAP_F_ */
+#define FUSE_IOMAP_F_NEW (1U << 0)
+#define FUSE_IOMAP_F_DIRTY (1U << 1)
+#define FUSE_IOMAP_F_SHARED (1U << 2)
+#define FUSE_IOMAP_F_MERGED (1U << 3)
+#define FUSE_IOMAP_F_BOUNDARY (1U << 4)
+#define FUSE_IOMAP_F_ANON_WRITE (1U << 5)
+#define FUSE_IOMAP_F_ATOMIC_BIO (1U << 6)
+
+/* fuse-specific mapping flag asking for ->iomap_end call */
+#define FUSE_IOMAP_F_WANT_IOMAP_END (1U << 7)
+
+/* mapping flags passed to iomap_end */
+#define FUSE_IOMAP_F_SIZE_CHANGED (1U << 8)
+#define FUSE_IOMAP_F_STALE (1U << 9)
+
+/* operation flags from iomap; see corresponding IOMAP_* */
+#define FUSE_IOMAP_OP_WRITE (1U << 0)
+#define FUSE_IOMAP_OP_ZERO (1U << 1)
+#define FUSE_IOMAP_OP_REPORT (1U << 2)
+#define FUSE_IOMAP_OP_FAULT (1U << 3)
+#define FUSE_IOMAP_OP_DIRECT (1U << 4)
+#define FUSE_IOMAP_OP_NOWAIT (1U << 5)
+#define FUSE_IOMAP_OP_OVERWRITE_ONLY (1U << 6)
+#define FUSE_IOMAP_OP_UNSHARE (1U << 7)
+#define FUSE_IOMAP_OP_DAX (1U << 8)
+#define FUSE_IOMAP_OP_ATOMIC (1U << 9)
+#define FUSE_IOMAP_OP_DONTCACHE (1U << 10)
+
+#define FUSE_IOMAP_NULL_ADDR (-1ULL) /* addr is not valid */
+
+struct fuse_iomap_io {
+ uint64_t offset; /* file offset of mapping, bytes */
+ uint64_t length; /* length of mapping, bytes */
+ uint64_t addr; /* disk offset of mapping, bytes */
+ uint16_t type; /* FUSE_IOMAP_TYPE_* */
+ uint16_t flags; /* FUSE_IOMAP_F_* */
+ uint32_t dev; /* device cookie */
+};
+
+struct fuse_iomap_begin_in {
+ uint32_t opflags; /* FUSE_IOMAP_OP_* */
+ uint32_t reserved; /* zero */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t count; /* operation length, in bytes */
+};
+
+struct fuse_iomap_begin_out {
+ /* read file data from here */
+ struct fuse_iomap_io read;
+
+ /* write file data to here, if applicable */
+ struct fuse_iomap_io write;
+};
+
+struct fuse_iomap_end_in {
+ uint32_t opflags; /* FUSE_IOMAP_OP_* */
+ uint32_t reserved; /* zero */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t count; /* operation length, in bytes */
+ int64_t written; /* bytes processed */
+
+ /* mapping that the kernel acted upon */
+ struct fuse_iomap_io map;
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index a774166264de69..e0bcbd42431344 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -65,6 +65,31 @@ config FUSE_PASSTHROUGH
If you want to allow passthrough operations, answer Y.
+config FUSE_IOMAP
+ bool "FUSE file IO over iomap"
+ default y
+ depends on FUSE_FS
+ depends on BLOCK
+ select FS_IOMAP
+ help
+ For supported fuseblk servers, this allows the file IO path to run
+ through the kernel.
+
+config FUSE_IOMAP_BY_DEFAULT
+ bool "FUSE file I/O over iomap by default"
+ default n
+ depends on FUSE_IOMAP
+ help
+ Enable sending FUSE file I/O over iomap by default.
+
+config FUSE_IOMAP_DEBUG
+ bool "Debug FUSE file IO over iomap"
+ default n
+ depends on FUSE_IOMAP
+ help
+ Enable debugging assertions for the fuse iomap code paths and logging
+ of bad iomap file mapping data being sent to the kernel.
+
config FUSE_IO_URING
bool "FUSE communication over io-uring"
default y
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index f3a273131a6cd1..70709a7a3f9523 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -17,5 +17,6 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
+fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
new file mode 100644
index 00000000000000..d11b1f810523fc
--- /dev/null
+++ b/fs/fuse/file_iomap.c
@@ -0,0 +1,444 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include <linux/iomap.h>
+#include "fuse_i.h"
+#include "fuse_trace.h"
+#include "iomap_priv.h"
+
+static bool __read_mostly enable_iomap =
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
+ true;
+#else
+ false;
+#endif
+module_param(enable_iomap, bool, 0644);
+MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap");
+
+bool fuse_iomap_enabled(void)
+{
+ /* Don't let anyone touch iomap until the end of the patchset. */
+ return false;
+
+ /*
+ * There are fears that a fuse+iomap server could somehow DoS the
+ * system by doing things like going out to lunch during a writeback
+ * related iomap request. Only allow iomap access if the fuse server
+ * has rawio capabilities since those processes can mess things up
+ * quite well even without our help.
+ */
+ return enable_iomap && has_capability_noaudit(current, CAP_SYS_RAWIO);
+}
+
+/* Convert IOMAP_* mapping types to FUSE_IOMAP_TYPE_* */
+#define XMAP(word) \
+ case IOMAP_##word: \
+ return FUSE_IOMAP_TYPE_##word
+static inline uint16_t fuse_iomap_type_to_server(uint16_t iomap_type)
+{
+ switch (iomap_type) {
+ XMAP(HOLE);
+ XMAP(DELALLOC);
+ XMAP(MAPPED);
+ XMAP(UNWRITTEN);
+ XMAP(INLINE);
+ default:
+ ASSERT(0);
+ }
+ return 0;
+}
+#undef XMAP
+
+/* Convert FUSE_IOMAP_TYPE_* to IOMAP_* mapping types */
+#define XMAP(word) \
+ case FUSE_IOMAP_TYPE_##word: \
+ return IOMAP_##word
+static inline uint16_t fuse_iomap_type_from_server(uint16_t fuse_type)
+{
+ switch (fuse_type) {
+ XMAP(HOLE);
+ XMAP(DELALLOC);
+ XMAP(MAPPED);
+ XMAP(UNWRITTEN);
+ XMAP(INLINE);
+ default:
+ ASSERT(0);
+ }
+ return 0;
+}
+#undef XMAP
+
+/* Validate FUSE_IOMAP_TYPE_* */
+static inline bool fuse_iomap_check_type(uint16_t fuse_type)
+{
+ switch (fuse_type) {
+ case FUSE_IOMAP_TYPE_HOLE:
+ case FUSE_IOMAP_TYPE_DELALLOC:
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ case FUSE_IOMAP_TYPE_INLINE:
+ case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+ return true;
+ }
+
+ return false;
+}
+
+#define FUSE_IOMAP_F_ALL (FUSE_IOMAP_F_NEW | \
+ FUSE_IOMAP_F_DIRTY | \
+ FUSE_IOMAP_F_SHARED | \
+ FUSE_IOMAP_F_MERGED | \
+ FUSE_IOMAP_F_BOUNDARY | \
+ FUSE_IOMAP_F_ANON_WRITE | \
+ FUSE_IOMAP_F_ATOMIC_BIO | \
+ FUSE_IOMAP_F_WANT_IOMAP_END)
+
+static inline bool fuse_iomap_check_flags(uint16_t flags)
+{
+ return (flags & ~FUSE_IOMAP_F_ALL) == 0;
+}
+
+/* Convert IOMAP_F_* mapping state flags to FUSE_IOMAP_F_* */
+#define XMAP(word) \
+ if (iomap_f_flags & IOMAP_F_##word) \
+ ret |= FUSE_IOMAP_F_##word
+#define YMAP(iword, oword) \
+ if (iomap_f_flags & IOMAP_F_##iword) \
+ ret |= FUSE_IOMAP_F_##oword
+static inline uint16_t fuse_iomap_flags_to_server(uint16_t iomap_f_flags)
+{
+ uint16_t ret = 0;
+
+ XMAP(NEW);
+ XMAP(DIRTY);
+ XMAP(SHARED);
+ XMAP(MERGED);
+ XMAP(BOUNDARY);
+ XMAP(ANON_WRITE);
+ XMAP(ATOMIC_BIO);
+ YMAP(PRIVATE, WANT_IOMAP_END);
+
+ XMAP(SIZE_CHANGED);
+ XMAP(STALE);
+
+ return ret;
+}
+#undef YMAP
+#undef XMAP
+
+/* Convert FUSE_IOMAP_F_* to IOMAP_F_* mapping state flags */
+#define XMAP(word) \
+ if (fuse_f_flags & FUSE_IOMAP_F_##word) \
+ ret |= IOMAP_F_##word
+#define YMAP(iword, oword) \
+ if (fuse_f_flags & FUSE_IOMAP_F_##iword) \
+ ret |= IOMAP_F_##oword
+static inline uint16_t fuse_iomap_flags_from_server(uint16_t fuse_f_flags)
+{
+ uint16_t ret = 0;
+
+ XMAP(NEW);
+ XMAP(DIRTY);
+ XMAP(SHARED);
+ XMAP(MERGED);
+ XMAP(BOUNDARY);
+ XMAP(ANON_WRITE);
+ XMAP(ATOMIC_BIO);
+ YMAP(WANT_IOMAP_END, PRIVATE);
+
+ return ret;
+}
+#undef YMAP
+#undef XMAP
+
+/* Convert IOMAP_* operation flags to FUSE_IOMAP_OP_* */
+#define XMAP(word) \
+ if (iomap_op_flags & IOMAP_##word) \
+ ret |= FUSE_IOMAP_OP_##word
+static inline uint32_t fuse_iomap_op_to_server(unsigned iomap_op_flags)
+{
+ uint32_t ret = 0;
+
+ XMAP(WRITE);
+ XMAP(ZERO);
+ XMAP(REPORT);
+ XMAP(FAULT);
+ XMAP(DIRECT);
+ XMAP(NOWAIT);
+ XMAP(OVERWRITE_ONLY);
+ XMAP(UNSHARE);
+ XMAP(DAX);
+ XMAP(ATOMIC);
+ XMAP(DONTCACHE);
+
+ return ret;
+}
+#undef XMAP
+
+/* Validate an iomap mapping. */
+static inline bool fuse_iomap_check_mapping(const struct inode *inode,
+ const struct fuse_iomap_io *map,
+ enum fuse_iomap_iodir iodir)
+{
+ const unsigned int blocksize = i_blocksize(inode);
+ uint64_t end;
+
+ /* Type and flags must be known */
+ if (BAD_DATA(!fuse_iomap_check_type(map->type)))
+ return false;
+ if (BAD_DATA(!fuse_iomap_check_flags(map->flags)))
+ return false;
+
+ /* No zero-length mappings */
+ if (BAD_DATA(map->length == 0))
+ return false;
+
+ /* File range must be aligned to blocksize */
+ if (BAD_DATA(!IS_ALIGNED(map->offset, blocksize)))
+ return false;
+ if (BAD_DATA(!IS_ALIGNED(map->length, blocksize)))
+ return false;
+
+ /* No overflows in the file range */
+ if (BAD_DATA(check_add_overflow(map->offset, map->length, &end)))
+ return false;
+
+ /* File range cannot start past maxbytes */
+ if (BAD_DATA(map->offset >= inode->i_sb->s_maxbytes))
+ return false;
+
+ switch (map->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ /* Mappings backed by space must have a device/addr */
+ if (BAD_DATA(map->dev == FUSE_IOMAP_DEV_NULL))
+ return false;
+ if (BAD_DATA(map->addr == FUSE_IOMAP_NULL_ADDR))
+ return false;
+ break;
+ case FUSE_IOMAP_TYPE_DELALLOC:
+ case FUSE_IOMAP_TYPE_HOLE:
+ case FUSE_IOMAP_TYPE_INLINE:
+ /* Mappings not backed by space cannot have a device addr. */
+ if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
+ return false;
+ if (BAD_DATA(map->addr != FUSE_IOMAP_NULL_ADDR))
+ return false;
+ break;
+ case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+ /* "Pure overwrite" only allowed for write mapping */
+ if (BAD_DATA(iodir != WRITE_MAPPING))
+ return false;
+ break;
+ default:
+ /* should have been caught already */
+ ASSERT(0);
+ return false;
+ }
+
+ /* XXX: we don't support devices yet */
+ if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
+ return false;
+
+ /* No overflows in the device range, if supplied */
+ if (map->addr != FUSE_IOMAP_NULL_ADDR &&
+ BAD_DATA(check_add_overflow(map->addr, map->length, &end)))
+ return false;
+
+ return true;
+}
+
+/* Convert a mapping from the server into something the kernel can use */
+static inline void fuse_iomap_from_server(struct inode *inode,
+ struct iomap *iomap,
+ const struct fuse_iomap_io *fmap)
+{
+ iomap->addr = fmap->addr;
+ iomap->offset = fmap->offset;
+ iomap->length = fmap->length;
+ iomap->type = fuse_iomap_type_from_server(fmap->type);
+ iomap->flags = fuse_iomap_flags_from_server(fmap->flags);
+ iomap->bdev = inode->i_sb->s_bdev; /* XXX */
+}
+
+/* Convert a mapping from the server into something the kernel can use */
+static inline void fuse_iomap_to_server(struct fuse_iomap_io *fmap,
+ const struct iomap *iomap)
+{
+ fmap->addr = iomap->addr;
+ fmap->offset = iomap->offset;
+ fmap->length = iomap->length;
+ fmap->type = fuse_iomap_type_to_server(iomap->type);
+ fmap->flags = fuse_iomap_flags_to_server(iomap->flags);
+ fmap->dev = FUSE_IOMAP_DEV_NULL; /* XXX */
+}
+
+/* Check the incoming _begin mappings to make sure they're not nonsense. */
+static inline int
+fuse_iomap_begin_validate(const struct inode *inode,
+ unsigned opflags, loff_t pos,
+ const struct fuse_iomap_begin_out *outarg)
+{
+ /* Make sure the mappings aren't garbage */
+ if (!fuse_iomap_check_mapping(inode, &outarg->read, READ_MAPPING))
+ return -EFSCORRUPTED;
+
+ if (!fuse_iomap_check_mapping(inode, &outarg->write, WRITE_MAPPING))
+ return -EFSCORRUPTED;
+
+ /*
+ * Must have returned a mapping for at least the first byte in the
+ * range. The main mapping check already validated that the length
+ * is nonzero and there is no overflow in computing end.
+ */
+ if (BAD_DATA(outarg->read.offset > pos))
+ return -EFSCORRUPTED;
+ if (BAD_DATA(outarg->write.offset > pos))
+ return -EFSCORRUPTED;
+
+ if (BAD_DATA(outarg->read.offset + outarg->read.length <= pos))
+ return -EFSCORRUPTED;
+ if (BAD_DATA(outarg->write.offset + outarg->write.length <= pos))
+ return -EFSCORRUPTED;
+
+ return 0;
+}
+
+static inline bool fuse_is_iomap_file_write(unsigned int opflags)
+{
+ return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
+}
+
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags, struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_begin_in inarg = {
+ .attr_ino = fi->orig_ino,
+ .opflags = fuse_iomap_op_to_server(opflags),
+ .pos = pos,
+ .count = count,
+ };
+ struct fuse_iomap_begin_out outarg = { };
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ FUSE_ARGS(args);
+ int err;
+
+ trace_fuse_iomap_begin(inode, pos, count, opflags);
+
+ args.opcode = FUSE_IOMAP_BEGIN;
+ args.nodeid = get_node_id(inode);
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ args.out_numargs = 1;
+ args.out_args[0].size = sizeof(outarg);
+ args.out_args[0].value = &outarg;
+ err = fuse_simple_request(fm, &args);
+ if (err) {
+ trace_fuse_iomap_begin_error(inode, pos, count, opflags, err);
+ return err;
+ }
+
+ trace_fuse_iomap_read_map(inode, &outarg.read);
+ trace_fuse_iomap_write_map(inode, &outarg.write);
+
+ err = fuse_iomap_begin_validate(inode, opflags, pos, &outarg);
+ if (err)
+ return err;
+
+ if (fuse_is_iomap_file_write(opflags) &&
+ outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+ /*
+ * For an out of place write, we must supply the write mapping
+ * via @iomap, and the read mapping via @srcmap.
+ */
+ fuse_iomap_from_server(inode, iomap, &outarg.write);
+ fuse_iomap_from_server(inode, srcmap, &outarg.read);
+ } else {
+ /*
+ * For everything else (reads, reporting, and pure overwrites),
+ * we can return the sole mapping through @iomap and leave
+ * @srcmap unchanged from its default (HOLE).
+ */
+ fuse_iomap_from_server(inode, iomap, &outarg.read);
+ }
+
+ return 0;
+}
+
+/* Decide if we send FUSE_IOMAP_END to the fuse server */
+static bool fuse_should_send_iomap_end(const struct iomap *iomap,
+ unsigned int opflags, loff_t count,
+ ssize_t written)
+{
+ /* fuse server demanded an iomap_end call. */
+ if (iomap->flags & FUSE_IOMAP_F_WANT_IOMAP_END)
+ return true;
+
+ /* Reads and reporting should never affect the filesystem metadata */
+ if (!fuse_is_iomap_file_write(opflags))
+ return false;
+
+ /* Appending writes get an iomap_end call */
+ if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+ return true;
+
+ /* Short writes get an iomap_end call to clean up delalloc */
+ return written < count;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
+ ssize_t written, unsigned opflags,
+ struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ int err = 0;
+
+ if (fuse_should_send_iomap_end(iomap, opflags, count, written)) {
+ struct fuse_iomap_end_in inarg = {
+ .opflags = fuse_iomap_op_to_server(opflags),
+ .attr_ino = fi->orig_ino,
+ .pos = pos,
+ .count = count,
+ .written = written,
+ };
+ FUSE_ARGS(args);
+
+ fuse_iomap_to_server(&inarg.map, iomap);
+
+ trace_fuse_iomap_end(inode, &inarg);
+
+ args.opcode = FUSE_IOMAP_END;
+ args.nodeid = get_node_id(inode);
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ err = fuse_simple_request(fm, &args);
+ switch (err) {
+ case -ENOSYS:
+ /*
+ * libfuse returns ENOSYS for servers that don't
+ * implement iomap_end
+ */
+ err = 0;
+ break;
+ case 0:
+ break;
+ default:
+ trace_fuse_iomap_end_error(inode, &inarg, err);
+ break;
+ }
+ }
+
+ return err;
+}
+
+const struct iomap_ops fuse_iomap_ops = {
+ .iomap_begin = fuse_iomap_begin,
+ .iomap_end = fuse_iomap_end,
+};
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b05510799f93e1..82e074642e8e9b 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1446,6 +1446,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
if (flags & FUSE_REQUEST_TIMEOUT)
timeout = arg->request_timeout;
+
+ if ((flags & FUSE_IOMAP) && fuse_iomap_enabled())
+ fc->iomap = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1514,6 +1517,8 @@ void fuse_send_init(struct fuse_mount *fm)
*/
if (fuse_uring_enabled())
flags |= FUSE_OVER_IO_URING;
+ if (fuse_iomap_enabled())
+ flags |= FUSE_IOMAP;
ia->in.flags = flags;
ia->in.flags2 = flags >> 32;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 03/23] fuse: make debugging configurable at runtime
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-08-21 0:52 ` [PATCH 01/23] fuse: move CREATE_TRACE_POINTS to a separate file Darrick J. Wong
2025-08-21 0:53 ` [PATCH 02/23] fuse: implement the basic iomap mechanisms Darrick J. Wong
@ 2025-08-21 0:53 ` Darrick J. Wong
2025-08-21 0:53 ` [PATCH 04/23] fuse: move the backing file idr and code into a new source file Darrick J. Wong
` (19 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:53 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Use static keys so that we can configure debugging assertions and dmesg
warnings at runtime. By default this is turned off so the cost is
merely scanning a nop sled. However, fuse server developers can turn
it on for their debugging systems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 8 +++++
fs/fuse/iomap_priv.h | 16 ++++++++--
fs/fuse/Kconfig | 15 +++++++++
fs/fuse/file_iomap.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 7 ++++
5 files changed, 124 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b28054c254f866..2cd9f4cdc6a7ef 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1607,6 +1607,14 @@ extern void fuse_sysctl_unregister(void);
#define fuse_sysctl_unregister() do { } while (0)
#endif /* CONFIG_SYSCTL */
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+int fuse_iomap_sysfs_init(struct kobject *kobj);
+void fuse_iomap_sysfs_cleanup(struct kobject *kobj);
+#else
+# define fuse_iomap_sysfs_init(...) (0)
+# define fuse_iomap_sysfs_cleanup(...) ((void)0)
+#endif
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
bool fuse_iomap_enabled(void);
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
index ca8544a95a4267..7002eb38f87fe1 100644
--- a/fs/fuse/iomap_priv.h
+++ b/fs/fuse/iomap_priv.h
@@ -6,19 +6,29 @@
#ifndef _FS_FUSE_IOMAP_PRIV_H
#define _FS_FUSE_IOMAP_PRIV_H
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG_DEFAULT)
+DECLARE_STATIC_KEY_TRUE(fuse_iomap_debug);
+#else
+DECLARE_STATIC_KEY_FALSE(fuse_iomap_debug);
+#endif
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
-# define ASSERT(condition) do { \
+# define ASSERT(condition) \
+while (static_branch_unlikely(&fuse_iomap_debug)) { \
int __cond = !!(condition); \
if (unlikely(!__cond)) \
trace_fuse_iomap_assert(__func__, __LINE__, #condition); \
WARN(!__cond, "Assertion failed: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
-} while (0)
+ break; \
+}
# define BAD_DATA(condition) ({ \
int __cond = !!(condition); \
if (unlikely(__cond)) \
trace_fuse_iomap_bad_data(__func__, __LINE__, #condition); \
- WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
+ if (static_branch_unlikely(&fuse_iomap_debug)) \
+ WARN(__cond, "Bad mapping: %s, func: %s, line: %d", #condition, __func__, __LINE__); \
+ unlikely(__cond); \
})
#else
# define ASSERT(condition)
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index e0bcbd42431344..6be74396ef5198 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -90,6 +90,21 @@ config FUSE_IOMAP_DEBUG
Enable debugging assertions for the fuse iomap code paths and logging
of bad iomap file mapping data being sent to the kernel.
+ Say N here if you don't want any debugging code code compiled in at
+ all.
+
+config FUSE_IOMAP_DEBUG_BY_DEFAULT
+ bool "Debug FUSE file IO over iomap at boot time"
+ default n
+ depends on FUSE_IOMAP_DEBUG
+ help
+ At boot time, enable debugging assertions for the fuse iomap code
+ paths and warnings about bad iomap file mapping data. This enables
+ fuse server authors to control debugging at runtime even on a
+ distribution kernel while avoiding most of the overhead on production
+ systems. The setting can be changed at runtime via
+ /sys/fs/fuse/iomap/debug.
+
config FUSE_IO_URING
bool "FUSE communication over io-uring"
default y
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index d11b1f810523fc..fad5457d669baf 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -8,6 +8,12 @@
#include "fuse_trace.h"
#include "iomap_priv.h"
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG_DEFAULT)
+DEFINE_STATIC_KEY_TRUE(fuse_iomap_debug);
+#else
+DEFINE_STATIC_KEY_FALSE(fuse_iomap_debug);
+#endif
+
static bool __read_mostly enable_iomap =
#if IS_ENABLED(CONFIG_FUSE_IOMAP_BY_DEFAULT)
true;
@@ -17,6 +23,81 @@ static bool __read_mostly enable_iomap =
module_param(enable_iomap, bool, 0644);
MODULE_PARM_DESC(enable_iomap, "Enable file I/O through iomap");
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+static struct kobject *iomap_kobj;
+
+static ssize_t fuse_iomap_debug_show(struct kobject *kobject,
+ struct kobj_attribute *a, char *buf)
+{
+ return sysfs_emit(buf, "%d\n", !!static_key_enabled(&fuse_iomap_debug));
+}
+
+static ssize_t fuse_iomap_debug_store(struct kobject *kobject,
+ struct kobj_attribute *a,
+ const char *buf, size_t count)
+{
+ int ret;
+ int val;
+
+ ret = kstrtoint(buf, 0, &val);
+ if (ret)
+ return ret;
+
+ if (val < 0 || val > 1)
+ return -EINVAL;
+
+ if (val)
+ static_branch_enable(&fuse_iomap_debug);
+ else
+ static_branch_disable(&fuse_iomap_debug);
+
+ return count;
+}
+
+#define __INIT_KOBJ_ATTR(_name, _mode, _show, _store) \
+{ \
+ .attr = { .name = __stringify(_name), .mode = _mode }, \
+ .show = _show, \
+ .store = _store, \
+}
+
+#define FUSE_ATTR_RW(_name, _show, _store) \
+ static struct kobj_attribute fuse_attr_##_name = \
+ __INIT_KOBJ_ATTR(_name, 0644, _show, _store)
+
+#define FUSE_ATTR_PTR(_name) \
+ (&fuse_attr_##_name.attr)
+
+FUSE_ATTR_RW(debug, fuse_iomap_debug_show, fuse_iomap_debug_store);
+
+static const struct attribute *fuse_iomap_attrs[] = {
+ FUSE_ATTR_PTR(debug),
+ NULL,
+};
+
+int fuse_iomap_sysfs_init(struct kobject *fuse_kobj)
+{
+ int error;
+
+ iomap_kobj = kobject_create_and_add("iomap", fuse_kobj);
+ if (!iomap_kobj)
+ return -ENOMEM;
+
+ error = sysfs_create_files(iomap_kobj, fuse_iomap_attrs);
+ if (error) {
+ kobject_put(iomap_kobj);
+ return error;
+ }
+
+ return 0;
+}
+
+void fuse_iomap_sysfs_cleanup(struct kobject *fuse_kobj)
+{
+ kobject_put(iomap_kobj);
+}
+#endif /* IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG) */
+
bool fuse_iomap_enabled(void)
{
/* Don't let anyone touch iomap until the end of the patchset. */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 82e074642e8e9b..9448a11c828fef 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -2217,8 +2217,14 @@ static int fuse_sysfs_init(void)
if (err)
goto out_fuse_unregister;
+ err = fuse_iomap_sysfs_init(fuse_kobj);
+ if (err)
+ goto out_fuse_connections;
+
return 0;
+ out_fuse_connections:
+ sysfs_remove_mount_point(fuse_kobj, "connections");
out_fuse_unregister:
kobject_put(fuse_kobj);
out_err:
@@ -2227,6 +2233,7 @@ static int fuse_sysfs_init(void)
static void fuse_sysfs_cleanup(void)
{
+ fuse_iomap_sysfs_cleanup(fuse_kobj);
sysfs_remove_mount_point(fuse_kobj, "connections");
kobject_put(fuse_kobj);
}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 04/23] fuse: move the backing file idr and code into a new source file
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 0:53 ` [PATCH 03/23] fuse: make debugging configurable at runtime Darrick J. Wong
@ 2025-08-21 0:53 ` Darrick J. Wong
2025-08-21 7:21 ` Amir Goldstein
2025-08-21 0:53 ` [PATCH 05/23] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong
` (18 subsequent siblings)
22 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:53 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
iomap support for fuse is also going to want the ability to attach
backing files to a fuse filesystem. Move the fuse_backing code into a
separate file so that both can use it.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 47 ++++++++-----
fs/fuse/Makefile | 2 -
fs/fuse/backing.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/passthrough.c | 158 --------------------------------------------
4 files changed, 203 insertions(+), 178 deletions(-)
create mode 100644 fs/fuse/backing.c
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 2cd9f4cdc6a7ef..2be2cbdf060536 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1535,29 +1535,11 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
void fuse_file_release(struct inode *inode, struct fuse_file *ff,
unsigned int open_flags, fl_owner_t id, bool isdir);
-/* passthrough.c */
-static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
-{
-#ifdef CONFIG_FUSE_PASSTHROUGH
- return READ_ONCE(fi->fb);
-#else
- return NULL;
-#endif
-}
-
-static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
- struct fuse_backing *fb)
-{
-#ifdef CONFIG_FUSE_PASSTHROUGH
- return xchg(&fi->fb, fb);
-#else
- return NULL;
-#endif
-}
-
+/* backing.c */
#ifdef CONFIG_FUSE_PASSTHROUGH
struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
void fuse_backing_put(struct fuse_backing *fb);
+struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
#else
static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
@@ -1568,6 +1550,11 @@ static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
static inline void fuse_backing_put(struct fuse_backing *fb)
{
}
+static inline struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
+ int backing_id)
+{
+ return NULL;
+}
#endif
void fuse_backing_files_init(struct fuse_conn *fc);
@@ -1575,6 +1562,26 @@ void fuse_backing_files_free(struct fuse_conn *fc);
int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map);
int fuse_backing_close(struct fuse_conn *fc, int backing_id);
+/* passthrough.c */
+static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
+{
+#ifdef CONFIG_FUSE_PASSTHROUGH
+ return READ_ONCE(fi->fb);
+#else
+ return NULL;
+#endif
+}
+
+static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
+ struct fuse_backing *fb)
+{
+#ifdef CONFIG_FUSE_PASSTHROUGH
+ return xchg(&fi->fb, fb);
+#else
+ return NULL;
+#endif
+}
+
struct fuse_backing *fuse_passthrough_open(struct file *file,
struct inode *inode,
int backing_id);
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 70709a7a3f9523..c79f786d0c90c3 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -14,7 +14,7 @@ fuse-y := trace.o # put trace.o first so we see ftrace errors sooner
fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
fuse-y += iomode.o
fuse-$(CONFIG_FUSE_DAX) += dax.o
-fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
+fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
new file mode 100644
index 00000000000000..ddb23b7400fc72
--- /dev/null
+++ b/fs/fuse/backing.c
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * FUSE passthrough to backing file.
+ *
+ * Copyright (c) 2023 CTERA Networks.
+ */
+
+#include "fuse_i.h"
+
+#include <linux/file.h>
+
+struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
+{
+ if (fb && refcount_inc_not_zero(&fb->count))
+ return fb;
+ return NULL;
+}
+
+static void fuse_backing_free(struct fuse_backing *fb)
+{
+ pr_debug("%s: fb=0x%p\n", __func__, fb);
+
+ if (fb->file)
+ fput(fb->file);
+ put_cred(fb->cred);
+ kfree_rcu(fb, rcu);
+}
+
+void fuse_backing_put(struct fuse_backing *fb)
+{
+ if (fb && refcount_dec_and_test(&fb->count))
+ fuse_backing_free(fb);
+}
+
+void fuse_backing_files_init(struct fuse_conn *fc)
+{
+ idr_init(&fc->backing_files_map);
+}
+
+static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
+{
+ int id;
+
+ idr_preload(GFP_KERNEL);
+ spin_lock(&fc->lock);
+ /* FIXME: xarray might be space inefficient */
+ id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
+ spin_unlock(&fc->lock);
+ idr_preload_end();
+
+ WARN_ON_ONCE(id == 0);
+ return id;
+}
+
+static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
+ int id)
+{
+ struct fuse_backing *fb;
+
+ spin_lock(&fc->lock);
+ fb = idr_remove(&fc->backing_files_map, id);
+ spin_unlock(&fc->lock);
+
+ return fb;
+}
+
+static int fuse_backing_id_free(int id, void *p, void *data)
+{
+ struct fuse_backing *fb = p;
+
+ WARN_ON_ONCE(refcount_read(&fb->count) != 1);
+ fuse_backing_free(fb);
+ return 0;
+}
+
+void fuse_backing_files_free(struct fuse_conn *fc)
+{
+ idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
+ idr_destroy(&fc->backing_files_map);
+}
+
+int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
+{
+ struct file *file;
+ struct super_block *backing_sb;
+ struct fuse_backing *fb = NULL;
+ int res;
+
+ pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
+
+ /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
+ res = -EPERM;
+ if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
+ goto out;
+
+ res = -EINVAL;
+ if (map->flags || map->padding)
+ goto out;
+
+ file = fget_raw(map->fd);
+ res = -EBADF;
+ if (!file)
+ goto out;
+
+ backing_sb = file_inode(file)->i_sb;
+ res = -ELOOP;
+ if (backing_sb->s_stack_depth >= fc->max_stack_depth)
+ goto out_fput;
+
+ fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
+ res = -ENOMEM;
+ if (!fb)
+ goto out_fput;
+
+ fb->file = file;
+ fb->cred = prepare_creds();
+ refcount_set(&fb->count, 1);
+
+ res = fuse_backing_id_alloc(fc, fb);
+ if (res < 0) {
+ fuse_backing_free(fb);
+ fb = NULL;
+ }
+
+out:
+ pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
+
+ return res;
+
+out_fput:
+ fput(file);
+ goto out;
+}
+
+int fuse_backing_close(struct fuse_conn *fc, int backing_id)
+{
+ struct fuse_backing *fb = NULL;
+ int err;
+
+ pr_debug("%s: backing_id=%d\n", __func__, backing_id);
+
+ /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
+ err = -EPERM;
+ if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
+ goto out;
+
+ err = -EINVAL;
+ if (backing_id <= 0)
+ goto out;
+
+ err = -ENOENT;
+ fb = fuse_backing_id_remove(fc, backing_id);
+ if (!fb)
+ goto out;
+
+ fuse_backing_put(fb);
+ err = 0;
+out:
+ pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
+
+ return err;
+}
+
+struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id)
+{
+ struct fuse_backing *fb;
+
+ rcu_read_lock();
+ fb = idr_find(&fc->backing_files_map, backing_id);
+ fb = fuse_backing_get(fb);
+ rcu_read_unlock();
+
+ return fb;
+}
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 607ef735ad4ab3..e0b8d885bc81f3 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -144,158 +144,6 @@ ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
return backing_file_mmap(backing_file, vma, &ctx);
}
-struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
-{
- if (fb && refcount_inc_not_zero(&fb->count))
- return fb;
- return NULL;
-}
-
-static void fuse_backing_free(struct fuse_backing *fb)
-{
- pr_debug("%s: fb=0x%p\n", __func__, fb);
-
- if (fb->file)
- fput(fb->file);
- put_cred(fb->cred);
- kfree_rcu(fb, rcu);
-}
-
-void fuse_backing_put(struct fuse_backing *fb)
-{
- if (fb && refcount_dec_and_test(&fb->count))
- fuse_backing_free(fb);
-}
-
-void fuse_backing_files_init(struct fuse_conn *fc)
-{
- idr_init(&fc->backing_files_map);
-}
-
-static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
-{
- int id;
-
- idr_preload(GFP_KERNEL);
- spin_lock(&fc->lock);
- /* FIXME: xarray might be space inefficient */
- id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
- spin_unlock(&fc->lock);
- idr_preload_end();
-
- WARN_ON_ONCE(id == 0);
- return id;
-}
-
-static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
- int id)
-{
- struct fuse_backing *fb;
-
- spin_lock(&fc->lock);
- fb = idr_remove(&fc->backing_files_map, id);
- spin_unlock(&fc->lock);
-
- return fb;
-}
-
-static int fuse_backing_id_free(int id, void *p, void *data)
-{
- struct fuse_backing *fb = p;
-
- WARN_ON_ONCE(refcount_read(&fb->count) != 1);
- fuse_backing_free(fb);
- return 0;
-}
-
-void fuse_backing_files_free(struct fuse_conn *fc)
-{
- idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
- idr_destroy(&fc->backing_files_map);
-}
-
-int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
-{
- struct file *file;
- struct super_block *backing_sb;
- struct fuse_backing *fb = NULL;
- int res;
-
- pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
-
- /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
- res = -EPERM;
- if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
- goto out;
-
- res = -EINVAL;
- if (map->flags || map->padding)
- goto out;
-
- file = fget_raw(map->fd);
- res = -EBADF;
- if (!file)
- goto out;
-
- backing_sb = file_inode(file)->i_sb;
- res = -ELOOP;
- if (backing_sb->s_stack_depth >= fc->max_stack_depth)
- goto out_fput;
-
- fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
- res = -ENOMEM;
- if (!fb)
- goto out_fput;
-
- fb->file = file;
- fb->cred = prepare_creds();
- refcount_set(&fb->count, 1);
-
- res = fuse_backing_id_alloc(fc, fb);
- if (res < 0) {
- fuse_backing_free(fb);
- fb = NULL;
- }
-
-out:
- pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
-
- return res;
-
-out_fput:
- fput(file);
- goto out;
-}
-
-int fuse_backing_close(struct fuse_conn *fc, int backing_id)
-{
- struct fuse_backing *fb = NULL;
- int err;
-
- pr_debug("%s: backing_id=%d\n", __func__, backing_id);
-
- /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
- err = -EPERM;
- if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
- goto out;
-
- err = -EINVAL;
- if (backing_id <= 0)
- goto out;
-
- err = -ENOENT;
- fb = fuse_backing_id_remove(fc, backing_id);
- if (!fb)
- goto out;
-
- fuse_backing_put(fb);
- err = 0;
-out:
- pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
-
- return err;
-}
-
/*
* Setup passthrough to a backing file.
*
@@ -315,12 +163,8 @@ struct fuse_backing *fuse_passthrough_open(struct file *file,
if (backing_id <= 0)
goto out;
- rcu_read_lock();
- fb = idr_find(&fc->backing_files_map, backing_id);
- fb = fuse_backing_get(fb);
- rcu_read_unlock();
-
err = -ENOENT;
+ fb = fuse_backing_lookup(fc, backing_id);
if (!fb)
goto out;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 05/23] fuse: move the passthrough-specific code back to passthrough.c
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 0:53 ` [PATCH 04/23] fuse: move the backing file idr and code into a new source file Darrick J. Wong
@ 2025-08-21 0:53 ` Darrick J. Wong
2025-08-21 9:05 ` Amir Goldstein
2025-08-21 0:54 ` [PATCH 06/23] fuse: add an ioctl to add new iomap devices Darrick J. Wong
` (17 subsequent siblings)
22 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:53 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
In preparation for iomap, move the passthrough-specific validation code
back to passthrough.c and create a new Kconfig item for conditional
compilation of backing.c. In the next patch, iomap will share the
backing structures.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 14 ++++++
fs/fuse/fuse_trace.h | 35 ++++++++++++++++
fs/fuse/Kconfig | 4 ++
fs/fuse/Makefile | 3 +
fs/fuse/backing.c | 106 +++++++++++++++++++++++++++++++++++++------------
fs/fuse/dev.c | 4 +-
fs/fuse/inode.c | 4 +-
fs/fuse/passthrough.c | 28 +++++++++++++
8 files changed, 165 insertions(+), 33 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 2be2cbdf060536..1762517a1b99c8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -958,7 +958,7 @@ struct fuse_conn {
/* New writepages go into this bucket */
struct fuse_sync_bucket __rcu *curr_bucket;
-#ifdef CONFIG_FUSE_PASSTHROUGH
+#ifdef CONFIG_FUSE_BACKING
/** IDR for backing files ids */
struct idr backing_files_map;
#endif
@@ -1536,7 +1536,7 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
unsigned int open_flags, fl_owner_t id, bool isdir);
/* backing.c */
-#ifdef CONFIG_FUSE_PASSTHROUGH
+#ifdef CONFIG_FUSE_BACKING
struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
void fuse_backing_put(struct fuse_backing *fb);
struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
@@ -1596,6 +1596,16 @@ static inline struct file *fuse_file_passthrough(struct fuse_file *ff)
#endif
}
+#ifdef CONFIG_FUSE_PASSTHROUGH
+int fuse_passthrough_backing_open(struct fuse_conn *fc,
+ struct fuse_backing *fb);
+int fuse_passthrough_backing_close(struct fuse_conn *fc,
+ struct fuse_backing *fb);
+#else
+# define fuse_passthrough_backing_open(...) (-EOPNOTSUPP)
+# define fuse_passthrough_backing_close(...) (-EOPNOTSUPP)
+#endif
+
ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *iter);
ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *iter);
ssize_t fuse_passthrough_splice_read(struct file *in, loff_t *ppos,
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 2389072b734636..660d9b5206a175 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -174,6 +174,41 @@ TRACE_EVENT(fuse_request_end,
__entry->unique, __entry->len, __entry->error)
);
+#ifdef CONFIG_FUSE_BACKING
+TRACE_EVENT(fuse_backing_class,
+ TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
+ const struct fuse_backing *fb),
+
+ TP_ARGS(fc, idx, fb),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(unsigned int, idx)
+ __field(unsigned long, ino)
+ ),
+
+ TP_fast_assign(
+ struct inode *inode = file_inode(fb->file);
+
+ __entry->connection = fc->dev;
+ __entry->idx = idx;
+ __entry->ino = inode->i_ino;
+ ),
+
+ TP_printk("connection %u idx %u ino 0x%lx",
+ __entry->connection,
+ __entry->idx,
+ __entry->ino)
+);
+#define DEFINE_FUSE_BACKING_EVENT(name) \
+DEFINE_EVENT(fuse_backing_class, name, \
+ TP_PROTO(const struct fuse_conn *fc, unsigned int idx, \
+ const struct fuse_backing *fb), \
+ TP_ARGS(fc, idx, fb))
+DEFINE_FUSE_BACKING_EVENT(fuse_backing_open);
+DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
+#endif
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
/* tracepoint boilerplate so we don't have to keep doing this */
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 6be74396ef5198..ebb9a2d76b532e 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -59,12 +59,16 @@ config FUSE_PASSTHROUGH
default y
depends on FUSE_FS
select FS_STACK
+ select FUSE_BACKING
help
This allows bypassing FUSE server by mapping specific FUSE operations
to be performed directly on a backing file.
If you want to allow passthrough operations, answer Y.
+config FUSE_BACKING
+ bool
+
config FUSE_IOMAP
bool "FUSE file IO over iomap"
default y
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index c79f786d0c90c3..27be39317701d6 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -14,7 +14,8 @@ fuse-y := trace.o # put trace.o first so we see ftrace errors sooner
fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
fuse-y += iomode.o
fuse-$(CONFIG_FUSE_DAX) += dax.o
-fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
+fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
+fuse-$(CONFIG_FUSE_BACKING) += backing.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
index ddb23b7400fc72..c128bed95a76b8 100644
--- a/fs/fuse/backing.c
+++ b/fs/fuse/backing.c
@@ -6,6 +6,7 @@
*/
#include "fuse_i.h"
+#include "fuse_trace.h"
#include <linux/file.h>
@@ -81,16 +82,14 @@ void fuse_backing_files_free(struct fuse_conn *fc)
int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
{
- struct file *file;
- struct super_block *backing_sb;
+ struct file *file = NULL;
struct fuse_backing *fb = NULL;
- int res;
+ int res, passthrough_res;
pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
- /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
res = -EPERM;
- if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
+ if (!fc->passthrough)
goto out;
res = -EINVAL;
@@ -102,46 +101,68 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
if (!file)
goto out;
- backing_sb = file_inode(file)->i_sb;
- res = -ELOOP;
- if (backing_sb->s_stack_depth >= fc->max_stack_depth)
- goto out_fput;
-
fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
res = -ENOMEM;
if (!fb)
- goto out_fput;
+ goto out_file;
+ /* fb now owns file */
fb->file = file;
+ file = NULL;
fb->cred = prepare_creds();
refcount_set(&fb->count, 1);
+ /*
+ * Each _backing_open function should either:
+ *
+ * 1. Take a ref to fb if it wants the file and return 0.
+ * 2. Return 0 without taking a ref if the backing file isn't needed.
+ * 3. Return an errno explaining why it couldn't attach.
+ *
+ * If at least one subsystem bumps the reference count to open it,
+ * we'll install it into the index and return the index. If nobody
+ * opens the file, the error code will be passed up. EPERM is the
+ * default.
+ */
+ passthrough_res = fuse_passthrough_backing_open(fc, fb);
+
+ if (refcount_read(&fb->count) < 2) {
+ if (passthrough_res)
+ res = passthrough_res;
+ if (!res)
+ res = -EPERM;
+ goto out_fb;
+ }
+
res = fuse_backing_id_alloc(fc, fb);
- if (res < 0) {
- fuse_backing_free(fb);
- fb = NULL;
- }
+ if (res < 0)
+ goto out_fb;
+
+ trace_fuse_backing_open(fc, res, fb);
-out:
pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
-
+ fuse_backing_put(fb);
return res;
-out_fput:
- fput(file);
- goto out;
+out_fb:
+ fuse_backing_free(fb);
+out_file:
+ if (file)
+ fput(file);
+out:
+ pr_debug("%s: ret=%i\n", __func__, res);
+ return res;
}
int fuse_backing_close(struct fuse_conn *fc, int backing_id)
{
- struct fuse_backing *fb = NULL;
- int err;
+ struct fuse_backing *fb = NULL, *test_fb;
+ int err, passthrough_err;
pr_debug("%s: backing_id=%d\n", __func__, backing_id);
- /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
err = -EPERM;
- if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
+ if (!fc->passthrough)
goto out;
err = -EINVAL;
@@ -149,12 +170,45 @@ int fuse_backing_close(struct fuse_conn *fc, int backing_id)
goto out;
err = -ENOENT;
- fb = fuse_backing_id_remove(fc, backing_id);
+ fb = fuse_backing_lookup(fc, backing_id);
if (!fb)
goto out;
+ /*
+ * Each _backing_close function should either:
+ *
+ * 1. Release the ref that it took in _backing_open and return 0.
+ * 2. Don't release the ref if the backing file is busy, and return 0.
+ * 2. Return an errno explaining why it couldn't detach.
+ *
+ * If there are no more active references to the backing file, it will
+ * be closed and removed from the index. If there are still active
+ * references to the backing file other than the one we just took, the
+ * error code will be passed up. EBUSY is the default.
+ */
+ passthrough_err = fuse_passthrough_backing_close(fc, fb);
+
+ if (refcount_read(&fb->count) > 1) {
+ if (passthrough_err)
+ err = passthrough_err;
+ if (!err)
+ err = -EBUSY;
+ goto out_fb;
+ }
+
+ trace_fuse_backing_close(fc, backing_id, fb);
+
+ err = -ENOENT;
+ test_fb = fuse_backing_id_remove(fc, backing_id);
+ if (!test_fb)
+ goto out_fb;
+
+ WARN_ON(fb != test_fb);
+ pr_debug("%s: fb=0x%p, err=0\n", __func__, fb);
+ fuse_backing_put(fb);
+ return 0;
+out_fb:
fuse_backing_put(fb);
- err = 0;
out:
pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index dbde17fff0cda9..31d9f006836ac1 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2623,7 +2623,7 @@ static long fuse_dev_ioctl_backing_open(struct file *file,
if (!fud)
return -EPERM;
- if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
+ if (!IS_ENABLED(CONFIG_FUSE_BACKING))
return -EOPNOTSUPP;
if (copy_from_user(&map, argp, sizeof(map)))
@@ -2640,7 +2640,7 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
if (!fud)
return -EPERM;
- if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
+ if (!IS_ENABLED(CONFIG_FUSE_BACKING))
return -EOPNOTSUPP;
if (get_user(backing_id, argp))
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9448a11c828fef..1f3f91981410aa 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -993,7 +993,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
fc->name_max = FUSE_NAME_LOW_MAX;
fc->timeout.req_timeout = 0;
- if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
+ if (IS_ENABLED(CONFIG_FUSE_BACKING))
fuse_backing_files_init(fc);
INIT_LIST_HEAD(&fc->mounts);
@@ -1030,7 +1030,7 @@ void fuse_conn_put(struct fuse_conn *fc)
WARN_ON(atomic_read(&bucket->count) != 1);
kfree(bucket);
}
- if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
+ if (IS_ENABLED(CONFIG_FUSE_BACKING))
fuse_backing_files_free(fc);
call_rcu(&fc->rcu, delayed_release);
}
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index e0b8d885bc81f3..dfc61cc4bd21af 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -197,3 +197,31 @@ void fuse_passthrough_release(struct fuse_file *ff, struct fuse_backing *fb)
put_cred(ff->cred);
ff->cred = NULL;
}
+
+int fuse_passthrough_backing_open(struct fuse_conn *fc,
+ struct fuse_backing *fb)
+{
+ struct super_block *backing_sb;
+
+ /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ backing_sb = file_inode(fb->file)->i_sb;
+ if (backing_sb->s_stack_depth >= fc->max_stack_depth)
+ return -ELOOP;
+
+ fuse_backing_get(fb);
+ return 0;
+}
+
+int fuse_passthrough_backing_close(struct fuse_conn *fc,
+ struct fuse_backing *fb)
+{
+ /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ fuse_backing_put(fb);
+ return 0;
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 06/23] fuse: add an ioctl to add new iomap devices
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 0:53 ` [PATCH 05/23] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong
@ 2025-08-21 0:54 ` Darrick J. Wong
2025-08-21 8:09 ` Amir Goldstein
2025-08-21 0:54 ` [PATCH 07/23] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
` (16 subsequent siblings)
22 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:54 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add an ioctl that allows fuse servers to register block devices for use
with iomap. This is (for now) separate from the backing file open/close
ioctl (despite using the same struct) to keep the codepaths separate.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 9 +++++
fs/fuse/fuse_trace.h | 49 ++++++++++++++++++++++++++-
fs/fuse/Kconfig | 1 +
fs/fuse/backing.c | 19 ++++++++---
fs/fuse/file_iomap.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-----
fs/fuse/passthrough.c | 13 +++++++
fs/fuse/trace.c | 1 +
7 files changed, 163 insertions(+), 17 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1762517a1b99c8..f4834a02d16c98 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -100,6 +100,10 @@ struct fuse_submount_lookup {
struct fuse_backing {
struct file *file;
struct cred *cred;
+ struct block_device *bdev;
+
+ unsigned int passthrough:1;
+ unsigned int iomap:1;
/** refcount */
refcount_t count;
@@ -1639,9 +1643,14 @@ static inline bool fuse_has_iomap(const struct inode *inode)
{
return get_fuse_conn_c(inode)->iomap;
}
+
+int fuse_iomap_backing_open(struct fuse_conn *fc, struct fuse_backing *fb);
+int fuse_iomap_backing_close(struct fuse_conn *fc, struct fuse_backing *fb);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
+# define fuse_iomap_backing_open(...) (-EOPNOTSUPP)
+# define fuse_iomap_backing_close(...) (-EOPNOTSUPP)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 660d9b5206a175..c3671a605a32f6 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -175,6 +175,13 @@ TRACE_EVENT(fuse_request_end,
);
#ifdef CONFIG_FUSE_BACKING
+#define FUSE_BACKING_PASSTHROUGH (1U << 0)
+#define FUSE_BACKING_IOMAP (1U << 1)
+
+#define FUSE_BACKING_FLAG_STRINGS \
+ { FUSE_BACKING_PASSTHROUGH, "pass" }, \
+ { FUSE_BACKING_IOMAP, "iomap" }
+
TRACE_EVENT(fuse_backing_class,
TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
const struct fuse_backing *fb),
@@ -184,7 +191,9 @@ TRACE_EVENT(fuse_backing_class,
TP_STRUCT__entry(
__field(dev_t, connection)
__field(unsigned int, idx)
+ __field(unsigned int, flags)
__field(unsigned long, ino)
+ __field(dev_t, rdev)
),
TP_fast_assign(
@@ -193,12 +202,23 @@ TRACE_EVENT(fuse_backing_class,
__entry->connection = fc->dev;
__entry->idx = idx;
__entry->ino = inode->i_ino;
+ __entry->flags = 0;
+ if (fb->passthrough)
+ __entry->flags |= FUSE_BACKING_PASSTHROUGH;
+ if (fb->iomap) {
+ __entry->rdev = inode->i_rdev;
+ __entry->flags |= FUSE_BACKING_IOMAP;
+ } else {
+ __entry->rdev = 0;
+ }
),
- TP_printk("connection %u idx %u ino 0x%lx",
+ TP_printk("connection %u idx %u flags (%s) ino 0x%lx rdev %u:%u",
__entry->connection,
__entry->idx,
- __entry->ino)
+ __print_flags(__entry->flags, "|", FUSE_BACKING_FLAG_STRINGS),
+ __entry->ino,
+ MAJOR(__entry->rdev), MINOR(__entry->rdev))
);
#define DEFINE_FUSE_BACKING_EVENT(name) \
DEFINE_EVENT(fuse_backing_class, name, \
@@ -210,7 +230,6 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
#endif
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
-
/* tracepoint boilerplate so we don't have to keep doing this */
#define FUSE_IOMAP_OPFLAGS_FIELD \
__field(unsigned, opflags)
@@ -452,6 +471,30 @@ TRACE_EVENT(fuse_iomap_end_error,
__entry->written,
__entry->error)
);
+
+TRACE_EVENT(fuse_iomap_dev_add,
+ TP_PROTO(const struct fuse_conn *fc,
+ const struct fuse_backing_map *map),
+
+ TP_ARGS(fc, map),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(int, fd)
+ __field(unsigned int, flags)
+ ),
+
+ TP_fast_assign(
+ __entry->connection = fc->dev;
+ __entry->fd = map->fd;
+ __entry->flags = map->flags;
+ ),
+
+ TP_printk("connection %u fd %d flags 0x%x",
+ __entry->connection,
+ __entry->fd,
+ __entry->flags)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index ebb9a2d76b532e..1ab3d3604c07d0 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -75,6 +75,7 @@ config FUSE_IOMAP
depends on FUSE_FS
depends on BLOCK
select FS_IOMAP
+ select FUSE_BACKING
help
For supported fuseblk servers, this allows the file IO path to run
through the kernel.
diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
index c128bed95a76b8..c63990254649ca 100644
--- a/fs/fuse/backing.c
+++ b/fs/fuse/backing.c
@@ -67,16 +67,19 @@ static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
static int fuse_backing_id_free(int id, void *p, void *data)
{
+ struct fuse_conn *fc = data;
struct fuse_backing *fb = p;
WARN_ON_ONCE(refcount_read(&fb->count) != 1);
+
+ trace_fuse_backing_close(fc, id, fb);
fuse_backing_free(fb);
return 0;
}
void fuse_backing_files_free(struct fuse_conn *fc)
{
- idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
+ idr_for_each(&fc->backing_files_map, fuse_backing_id_free, fc);
idr_destroy(&fc->backing_files_map);
}
@@ -84,12 +87,12 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
{
struct file *file = NULL;
struct fuse_backing *fb = NULL;
- int res, passthrough_res;
+ int res, passthrough_res, iomap_res;
pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
res = -EPERM;
- if (!fc->passthrough)
+ if (!fc->passthrough && !fc->iomap)
goto out;
res = -EINVAL;
@@ -125,10 +128,13 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
* default.
*/
passthrough_res = fuse_passthrough_backing_open(fc, fb);
+ iomap_res = fuse_iomap_backing_open(fc, fb);
if (refcount_read(&fb->count) < 2) {
if (passthrough_res)
res = passthrough_res;
+ if (!res && iomap_res)
+ res = iomap_res;
if (!res)
res = -EPERM;
goto out_fb;
@@ -157,12 +163,12 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
int fuse_backing_close(struct fuse_conn *fc, int backing_id)
{
struct fuse_backing *fb = NULL, *test_fb;
- int err, passthrough_err;
+ int err, passthrough_err, iomap_err;
pr_debug("%s: backing_id=%d\n", __func__, backing_id);
err = -EPERM;
- if (!fc->passthrough)
+ if (!fc->passthrough && !fc->iomap)
goto out;
err = -EINVAL;
@@ -187,10 +193,13 @@ int fuse_backing_close(struct fuse_conn *fc, int backing_id)
* error code will be passed up. EBUSY is the default.
*/
passthrough_err = fuse_passthrough_backing_close(fc, fb);
+ iomap_err = fuse_iomap_backing_close(fc, fb);
if (refcount_read(&fb->count) > 1) {
if (passthrough_err)
err = passthrough_err;
+ if (!err && iomap_err)
+ err = iomap_err;
if (!err)
err = -EBUSY;
goto out_fb;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index fad5457d669baf..154c99399f48d2 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -319,10 +319,6 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
return false;
}
- /* XXX: we don't support devices yet */
- if (BAD_DATA(map->dev != FUSE_IOMAP_DEV_NULL))
- return false;
-
/* No overflows in the device range, if supplied */
if (map->addr != FUSE_IOMAP_NULL_ADDR &&
BAD_DATA(check_add_overflow(map->addr, map->length, &end)))
@@ -334,6 +330,7 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
/* Convert a mapping from the server into something the kernel can use */
static inline void fuse_iomap_from_server(struct inode *inode,
struct iomap *iomap,
+ const struct fuse_backing *fb,
const struct fuse_iomap_io *fmap)
{
iomap->addr = fmap->addr;
@@ -341,7 +338,9 @@ static inline void fuse_iomap_from_server(struct inode *inode,
iomap->length = fmap->length;
iomap->type = fuse_iomap_type_from_server(fmap->type);
iomap->flags = fuse_iomap_flags_from_server(fmap->flags);
- iomap->bdev = inode->i_sb->s_bdev; /* XXX */
+
+ iomap->bdev = fb ? fb->bdev : NULL;
+ iomap->dax_dev = NULL;
}
/* Convert a mapping from the server into something the kernel can use */
@@ -392,6 +391,32 @@ static inline bool fuse_is_iomap_file_write(unsigned int opflags)
return opflags & (IOMAP_WRITE | IOMAP_ZERO | IOMAP_UNSHARE);
}
+static inline struct fuse_backing *
+fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map)
+{
+ struct fuse_backing *ret = NULL;
+
+ if (map->dev != FUSE_IOMAP_DEV_NULL && map->dev < INT_MAX)
+ ret = fuse_backing_lookup(fc, map->dev);
+
+ switch (map->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ /* Mappings backed by space must have a device/addr */
+ if (BAD_DATA(ret == NULL))
+ return ERR_PTR(-EFSCORRUPTED);
+ break;
+ }
+
+ /* Must be one of our open devices */
+ if (ret && BAD_DATA(ret->iomap == 0)) {
+ fuse_backing_put(ret);
+ return ERR_PTR(-EFSCORRUPTED);
+ }
+
+ return ret;
+}
+
static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
unsigned opflags, struct iomap *iomap,
struct iomap *srcmap)
@@ -405,6 +430,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
};
struct fuse_iomap_begin_out outarg = { };
struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_backing *read_dev = NULL;
+ struct fuse_backing *write_dev = NULL;
FUSE_ARGS(args);
int err;
@@ -431,24 +458,44 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
if (err)
return err;
+ read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read);
+ if (IS_ERR(read_dev))
+ return PTR_ERR(read_dev);
+
if (fuse_is_iomap_file_write(opflags) &&
outarg.write.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+ /* open the write device */
+ write_dev = fuse_iomap_find_dev(fm->fc, &outarg.write);
+ if (IS_ERR(write_dev)) {
+ err = PTR_ERR(write_dev);
+ goto out_read_dev;
+ }
+
/*
* For an out of place write, we must supply the write mapping
* via @iomap, and the read mapping via @srcmap.
*/
- fuse_iomap_from_server(inode, iomap, &outarg.write);
- fuse_iomap_from_server(inode, srcmap, &outarg.read);
+ fuse_iomap_from_server(inode, iomap, write_dev, &outarg.write);
+ fuse_iomap_from_server(inode, srcmap, read_dev, &outarg.read);
} else {
/*
* For everything else (reads, reporting, and pure overwrites),
* we can return the sole mapping through @iomap and leave
* @srcmap unchanged from its default (HOLE).
*/
- fuse_iomap_from_server(inode, iomap, &outarg.read);
+ fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
}
- return 0;
+ /*
+ * XXX: if we ever want to support closing devices, we need a way to
+ * track the fuse_backing refcount all the way through bio endios.
+ * For now we put the refcount here because you can't remove an iomap
+ * device until unmount time.
+ */
+ fuse_backing_put(write_dev);
+out_read_dev:
+ fuse_backing_put(read_dev);
+ return err;
}
/* Decide if we send FUSE_IOMAP_END to the fuse server */
@@ -523,3 +570,26 @@ const struct iomap_ops fuse_iomap_ops = {
.iomap_begin = fuse_iomap_begin,
.iomap_end = fuse_iomap_end,
};
+
+int fuse_iomap_backing_open(struct fuse_conn *fc, struct fuse_backing *fb)
+{
+ if (!fc->iomap)
+ return 0;
+
+ if (!S_ISBLK(file_inode(fb->file)->i_mode))
+ return -ENODEV;
+
+ fb->iomap = 1;
+ fb->bdev = I_BDEV(fb->file->f_mapping->host);
+ fuse_backing_get(fb);
+ return 0;
+}
+
+int fuse_iomap_backing_close(struct fuse_conn *fc, struct fuse_backing *fb)
+{
+ if (!fb->iomap)
+ return 0;
+
+ /* We only support closing iomap block devices at unmount */
+ return -EBUSY;
+}
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index dfc61cc4bd21af..29de6de9f4b59b 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -168,6 +168,11 @@ struct fuse_backing *fuse_passthrough_open(struct file *file,
if (!fb)
goto out;
+ if (!fb->passthrough) {
+ fuse_backing_put(fb);
+ goto out;
+ }
+
/* Allocate backing file per fuse file to store fuse path */
backing_file = backing_file_open(&file->f_path, file->f_flags,
&fb->file->f_path, fb->cred);
@@ -203,6 +208,9 @@ int fuse_passthrough_backing_open(struct fuse_conn *fc,
{
struct super_block *backing_sb;
+ if (!fc->passthrough)
+ return 0;
+
/* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -211,6 +219,7 @@ int fuse_passthrough_backing_open(struct fuse_conn *fc,
if (backing_sb->s_stack_depth >= fc->max_stack_depth)
return -ELOOP;
+ fb->passthrough = 1;
fuse_backing_get(fb);
return 0;
}
@@ -218,10 +227,14 @@ int fuse_passthrough_backing_open(struct fuse_conn *fc,
int fuse_passthrough_backing_close(struct fuse_conn *fc,
struct fuse_backing *fb)
{
+ if (!fb->passthrough)
+ return 0;
+
/* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
+ fb->passthrough = 0;
fuse_backing_put(fb);
return 0;
}
diff --git a/fs/fuse/trace.c b/fs/fuse/trace.c
index 93bd72efc98cd0..3b54f639a5423e 100644
--- a/fs/fuse/trace.c
+++ b/fs/fuse/trace.c
@@ -6,6 +6,7 @@
#include "dev_uring_i.h"
#include "fuse_i.h"
#include "fuse_dev_i.h"
+#include "iomap_priv.h"
#include <linux/pagemap.h>
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 07/23] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (5 preceding siblings ...)
2025-08-21 0:54 ` [PATCH 06/23] fuse: add an ioctl to add new iomap devices Darrick J. Wong
@ 2025-08-21 0:54 ` Darrick J. Wong
2025-08-21 0:54 ` [PATCH 08/23] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
` (15 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:54 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
At unmount time, there are a few things that we need to ask the fuse
server to do.
First, we need to flush queued events to userspace to give the fuse
server a chance to process the events. This is how we make sure that
the server processes FUSE_RELEASE events before the connection goes
down.
Second, to ensure that all those metadata updates are persisted to disk
before tell the fuse server to destroy itself, send FUSE_SYNCFS after
waiting for the queued events.
Finally, we need to send FUSE_DESTROY to the fuse server so that it
closes the filesystem and the device fds before unmount returns. That
way, a script that does something like "umount /dev/sda ; e2fsck -fn
/dev/sda" will not fail the e2fsck because the fd closure races with
e2fsck startup. Obviously, we need to wait for FUSE_SYNCFS.
This is a major behavior change and who knows what might break existing
code, so we hide it behind iomap mode.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 7 +++++++
fs/fuse/file_iomap.c | 29 +++++++++++++++++++++++++++++
fs/fuse/inode.c | 9 +++++++--
3 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f4834a02d16c98..6a155bdd389af6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1364,6 +1364,9 @@ int fuse_init_fs_context_submount(struct fs_context *fsc);
*/
void fuse_conn_destroy(struct fuse_mount *fm);
+/* Send the FUSE_DESTROY command. */
+void fuse_send_destroy(struct fuse_mount *fm);
+
/* Drop the connection and free the fuse mount */
void fuse_mount_destroy(struct fuse_mount *fm);
@@ -1646,11 +1649,15 @@ static inline bool fuse_has_iomap(const struct inode *inode)
int fuse_iomap_backing_open(struct fuse_conn *fc, struct fuse_backing *fb);
int fuse_iomap_backing_close(struct fuse_conn *fc, struct fuse_backing *fb);
+void fuse_iomap_mount(struct fuse_mount *fm);
+void fuse_iomap_unmount(struct fuse_mount *fm);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
# define fuse_iomap_backing_open(...) (-EOPNOTSUPP)
# define fuse_iomap_backing_close(...) (-EOPNOTSUPP)
+# define fuse_iomap_mount(...) ((void)0)
+# define fuse_iomap_unmount(...) ((void)0)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 154c99399f48d2..6e0e222da3046c 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -593,3 +593,32 @@ int fuse_iomap_backing_close(struct fuse_conn *fc, struct fuse_backing *fb)
/* We only support closing iomap block devices at unmount */
return -EBUSY;
}
+
+void fuse_iomap_mount(struct fuse_mount *fm)
+{
+ struct fuse_conn *fc = fm->fc;
+
+ /*
+ * Enable syncfs for iomap fuse servers so that we can send a final
+ * flush at unmount time. This also means that we can support
+ * freeze/thaw properly.
+ */
+ fc->sync_fs = true;
+}
+
+void fuse_iomap_unmount(struct fuse_mount *fm)
+{
+ struct fuse_conn *fc = fm->fc;
+
+ /*
+ * Flush all pending commands, syncfs, flush that, and send a destroy
+ * command. This gives the fuse server a chance to process all the
+ * pending releases, write the last bits of metadata changes to disk,
+ * and close the iomap block devices before we return from the umount
+ * call. The caller already flushed previously pending requests, so we
+ * only need the flush to wait for syncfs.
+ */
+ sync_filesystem(fm->sb);
+ fuse_flush_requests_and_wait(fc, secs_to_jiffies(60));
+ fuse_send_destroy(fm);
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1f3f91981410aa..3274ee1c31b62b 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -621,7 +621,7 @@ static void fuse_umount_begin(struct super_block *sb)
retire_super(sb);
}
-static void fuse_send_destroy(struct fuse_mount *fm)
+void fuse_send_destroy(struct fuse_mount *fm)
{
if (fm->fc->conn_init) {
FUSE_ARGS(args);
@@ -1457,6 +1457,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
init_server_timeout(fc, timeout);
+ if (fc->iomap)
+ fuse_iomap_mount(fm);
+
fm->sb->s_bdi->ra_pages =
min(fm->sb->s_bdi->ra_pages, ra_pages);
fc->minor = arg->minor;
@@ -2055,7 +2058,9 @@ void fuse_conn_destroy(struct fuse_mount *fm)
struct fuse_conn *fc = fm->fc;
fuse_flush_requests_and_wait(fc, secs_to_jiffies(30));
- if (fc->destroy)
+ if (fc->iomap)
+ fuse_iomap_unmount(fm);
+ else if (fc->destroy)
fuse_send_destroy(fm);
fuse_abort_conn(fc);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 08/23] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (6 preceding siblings ...)
2025-08-21 0:54 ` [PATCH 07/23] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
@ 2025-08-21 0:54 ` Darrick J. Wong
2025-08-21 0:54 ` [PATCH 09/23] fuse: implement direct IO with iomap Darrick J. Wong
` (14 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:54 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Implement the basic file mapping reporting functions like FIEMAP, BMAP,
and SEEK_DATA/HOLE.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 8 ++++++
fs/fuse/fuse_trace.h | 46 ++++++++++++++++++++++++++++++++
fs/fuse/dir.c | 1 +
fs/fuse/file.c | 13 +++++++++
fs/fuse/file_iomap.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 139 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 6a155bdd389af6..e7dc8229bcc5e7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1651,6 +1651,11 @@ int fuse_iomap_backing_open(struct fuse_conn *fc, struct fuse_backing *fb);
int fuse_iomap_backing_close(struct fuse_conn *fc, struct fuse_backing *fb);
void fuse_iomap_mount(struct fuse_mount *fm);
void fuse_iomap_unmount(struct fuse_mount *fm);
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 length);
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1658,6 +1663,9 @@ void fuse_iomap_unmount(struct fuse_mount *fm);
# define fuse_iomap_backing_close(...) (-EOPNOTSUPP)
# define fuse_iomap_mount(...) ((void)0)
# define fuse_iomap_unmount(...) ((void)0)
+# define fuse_iomap_fiemap NULL
+# define fuse_iomap_lseek(...) (-ENOSYS)
+# define fuse_iomap_bmap(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index c3671a605a32f6..d2a926124a5d54 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -495,6 +495,52 @@ TRACE_EVENT(fuse_iomap_dev_add,
__entry->fd,
__entry->flags)
);
+
+TRACE_EVENT(fuse_iomap_fiemap,
+ TP_PROTO(const struct inode *inode, u64 start, u64 count,
+ unsigned int flags),
+
+ TP_ARGS(inode, start, count, flags),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned int, flags)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = start;
+ __entry->length = count;
+ __entry->flags = flags;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT("fiemap") " flags 0x%x",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->flags)
+);
+
+TRACE_EVENT(fuse_iomap_lseek,
+ TP_PROTO(const struct inode *inode, loff_t offset, int whence),
+
+ TP_ARGS(inode, offset, whence),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(loff_t, offset)
+ __field(int, whence)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->whence = whence;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " offset 0x%llx whence %d",
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->offset,
+ __entry->whence)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8e922dcadb8675..4ea763699c1bae 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2298,6 +2298,7 @@ static const struct inode_operations fuse_common_inode_operations = {
.set_acl = fuse_set_acl,
.fileattr_get = fuse_fileattr_get,
.fileattr_set = fuse_fileattr_set,
+ .fiemap = fuse_iomap_fiemap,
};
static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 0ba2b62e06679e..54432cf0be82ba 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2516,6 +2516,12 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
struct fuse_bmap_out outarg;
int err;
+ if (fuse_has_iomap(inode)) {
+ sector_t alt_sec = fuse_iomap_bmap(mapping, block);
+ if (alt_sec > 0)
+ return alt_sec;
+ }
+
if (!inode->i_sb->s_bdev || fm->fc->no_bmap)
return 0;
@@ -2551,6 +2557,13 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
struct fuse_lseek_out outarg;
int err;
+ if (fuse_has_iomap(inode)) {
+ loff_t alt_pos = fuse_iomap_lseek(file, offset, whence);
+
+ if (alt_pos >= 0 || (alt_pos < 0 && alt_pos != -ENOSYS))
+ return alt_pos;
+ }
+
if (fm->fc->no_lseek)
goto fallback;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 6e0e222da3046c..691ca3a4ec95e5 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -4,6 +4,7 @@
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#include <linux/iomap.h>
+#include <linux/fiemap.h>
#include "fuse_i.h"
#include "fuse_trace.h"
#include "iomap_priv.h"
@@ -622,3 +623,73 @@ void fuse_iomap_unmount(struct fuse_mount *fm)
fuse_flush_requests_and_wait(fc, secs_to_jiffies(60));
fuse_send_destroy(fm);
}
+
+int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 count)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ int error;
+
+ /*
+ * We are called directly from the vfs so we need to check per-inode
+ * support here explicitly.
+ */
+ if (!fuse_has_iomap(inode))
+ return -EOPNOTSUPP;
+
+ if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR)
+ return -EOPNOTSUPP;
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ if (!fuse_allow_current_process(fc))
+ return -EACCES;
+
+ trace_fuse_iomap_fiemap(inode, start, count, fieinfo->fi_flags);
+
+ inode_lock_shared(inode);
+ error = iomap_fiemap(inode, fieinfo, start, count,
+ &fuse_iomap_ops);
+ inode_unlock_shared(inode);
+
+ return error;
+}
+
+sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block)
+{
+ ASSERT(fuse_has_iomap(mapping->host));
+
+ return iomap_bmap(mapping, block, &fuse_iomap_ops);
+}
+
+loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ if (!fuse_allow_current_process(fc))
+ return -EACCES;
+
+ trace_fuse_iomap_lseek(inode, offset, whence);
+
+ switch (whence) {
+ case SEEK_HOLE:
+ offset = iomap_seek_hole(inode, offset, &fuse_iomap_ops);
+ break;
+ case SEEK_DATA:
+ offset = iomap_seek_data(inode, offset, &fuse_iomap_ops);
+ break;
+ default:
+ return -ENOSYS;
+ }
+
+ if (offset < 0)
+ return offset;
+ return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 09/23] fuse: implement direct IO with iomap
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (7 preceding siblings ...)
2025-08-21 0:54 ` [PATCH 08/23] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
@ 2025-08-21 0:54 ` Darrick J. Wong
2025-08-21 0:55 ` [PATCH 10/23] fuse: implement buffered " Darrick J. Wong
` (13 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:54 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Start implementing the fuse-iomap file I/O paths by adding direct I/O
support and all the signalling flags that come with it. Buffered I/O
is much more complicated, so we leave that to a subsequent patch.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 50 +++++++
fs/fuse/fuse_trace.h | 186 +++++++++++++++++++++++++
include/uapi/linux/fuse.h | 29 ++++
fs/fuse/dir.c | 7 +
fs/fuse/file.c | 17 ++
fs/fuse/file_iomap.c | 338 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 2
fs/fuse/trace.c | 1
8 files changed, 624 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e7dc8229bcc5e7..1415db4ebf47b1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -234,6 +234,8 @@ enum {
FUSE_I_BTIME,
/* Wants or already has page cache IO */
FUSE_I_CACHE_IO_MODE,
+ /* Use iomap for this inode */
+ FUSE_I_IOMAP,
};
struct fuse_conn;
@@ -624,6 +626,16 @@ struct fuse_sync_bucket {
struct rcu_head rcu;
};
+#ifdef CONFIG_FUSE_IOMAP
+struct fuse_iomap_conn {
+ /* fuse server doesn't implement iomap_end */
+ unsigned int no_end:1;
+
+ /* fuse server doesn't implement iomap_ioend */
+ unsigned int no_ioend:1;
+};
+#endif
+
/**
* A Fuse connection.
*
@@ -903,7 +915,10 @@ struct fuse_conn {
/* Is link not implemented by fs? */
unsigned int no_link:1;
- /* Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations */
+ /*
+ * Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations and
+ * direct I/O.
+ */
unsigned int iomap:1;
/* Use io_uring for communication */
@@ -967,6 +982,11 @@ struct fuse_conn {
struct idr backing_files_map;
#endif
+#ifdef CONFIG_FUSE_IOMAP
+ /** iomap information */
+ struct fuse_iomap_conn iomap_conn;
+#endif
+
#ifdef CONFIG_FUSE_IO_URING
/** uring connection information*/
struct fuse_ring *ring;
@@ -1656,6 +1676,27 @@ int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 length);
loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
+
+void fuse_iomap_open(struct inode *inode, struct file *file);
+
+void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags);
+void fuse_iomap_evict_inode(struct inode *inode);
+
+static inline bool fuse_inode_has_iomap(const struct inode *inode)
+{
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+
+ return test_bit(FUSE_I_IOMAP, &fi->state);
+}
+
+static inline bool fuse_want_iomap_directio(const struct kiocb *iocb)
+{
+ return (iocb->ki_flags & IOCB_DIRECT) &&
+ fuse_inode_has_iomap(file_inode(iocb->ki_filp));
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1666,6 +1707,13 @@ sector_t fuse_iomap_bmap(struct address_space *mapping, sector_t block);
# define fuse_iomap_fiemap NULL
# define fuse_iomap_lseek(...) (-ENOSYS)
# define fuse_iomap_bmap(...) (-ENOSYS)
+# define fuse_iomap_open(...) ((void)0)
+# define fuse_iomap_init_inode(...) ((void)0)
+# define fuse_iomap_evict_inode(...) ((void)0)
+# define fuse_inode_has_iomap(...) (false)
+# define fuse_want_iomap_directio(...) (false)
+# define fuse_iomap_direct_read(...) (-ENOSYS)
+# define fuse_iomap_direct_write(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index d2a926124a5d54..12dd05877727ab 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -60,6 +60,7 @@
EM( FUSE_STATX, "FUSE_STATX") \
EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \
EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \
+ EM( FUSE_IOMAP_IOEND, "FUSE_IOMAP_IOEND") \
EMe(CUSE_INIT, "CUSE_INIT")
/*
@@ -307,6 +308,34 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
{ FUSE_IOMAP_TYPE_UNWRITTEN, "unwritten" }, \
{ FUSE_IOMAP_TYPE_INLINE, "inline" }
+#define FUSE_IOMAP_IOEND_STRINGS \
+ { FUSE_IOMAP_IOEND_SHARED, "shared" }, \
+ { FUSE_IOMAP_IOEND_UNWRITTEN, "unwritten" }, \
+ { FUSE_IOMAP_IOEND_BOUNDARY, "boundary" }, \
+ { FUSE_IOMAP_IOEND_DIRECT, "direct" }, \
+ { FUSE_IOMAP_IOEND_APPEND, "append" }
+
+#define IOMAP_DIOEND_STRINGS \
+ { IOMAP_DIO_UNWRITTEN, "unwritten" }, \
+ { IOMAP_DIO_COW, "cow" }
+
+TRACE_DEFINE_ENUM(FUSE_I_ADVISE_RDPLUS);
+TRACE_DEFINE_ENUM(FUSE_I_INIT_RDPLUS);
+TRACE_DEFINE_ENUM(FUSE_I_SIZE_UNSTABLE);
+TRACE_DEFINE_ENUM(FUSE_I_BAD);
+TRACE_DEFINE_ENUM(FUSE_I_BTIME);
+TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE);
+TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
+
+#define FUSE_IFLAG_STRINGS \
+ { 1 << FUSE_I_ADVISE_RDPLUS, "advise_rdplus" }, \
+ { 1 << FUSE_I_INIT_RDPLUS, "init_rdplus" }, \
+ { 1 << FUSE_I_SIZE_UNSTABLE, "size_unstable" }, \
+ { 1 << FUSE_I_BAD, "bad" }, \
+ { 1 << FUSE_I_BTIME, "btime" }, \
+ { 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \
+ { 1 << FUSE_I_IOMAP, "iomap" }
+
DECLARE_EVENT_CLASS(fuse_iomap_check_class,
TP_PROTO(const char *func, int line, const char *condition),
@@ -472,6 +501,65 @@ TRACE_EVENT(fuse_iomap_end_error,
__entry->error)
);
+TRACE_EVENT(fuse_iomap_ioend,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_ioend_in *inarg),
+
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned, ioendflags)
+ __field(int, error)
+ __field(uint64_t, new_addr)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = inarg->pos;
+ __entry->length = inarg->written;
+ __entry->ioendflags = inarg->ioendflags;
+ __entry->error = inarg->error;
+ __entry->new_addr = inarg->new_addr;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " ioendflags (%s) error %d new_addr 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+ __entry->error,
+ __entry->new_addr)
+);
+
+TRACE_EVENT(fuse_iomap_ioend_error,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_ioend_in *inarg,
+ int error),
+
+ TP_ARGS(inode, inarg, error),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned, ioendflags)
+ __field(int, error)
+ __field(uint64_t, new_addr)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = inarg->pos;
+ __entry->length = inarg->written;
+ __entry->ioendflags = inarg->ioendflags;
+ __entry->error = error;
+ __entry->new_addr = inarg->new_addr;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " ioendflags (%s) error %d new_addr 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __print_flags(__entry->ioendflags, "|", FUSE_IOMAP_IOEND_STRINGS),
+ __entry->error,
+ __entry->new_addr)
+);
+
TRACE_EVENT(fuse_iomap_dev_add,
TP_PROTO(const struct fuse_conn *fc,
const struct fuse_backing_map *map),
@@ -541,6 +629,104 @@ TRACE_EVENT(fuse_iomap_lseek,
__entry->offset,
__entry->whence)
);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_io_class,
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter),
+ TP_ARGS(iocb, iter),
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(file_inode(iocb->ki_filp), fi, fm);
+ __entry->offset = iocb->ki_pos;
+ __entry->length = iov_iter_count(iter);
+ ),
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+)
+#define DEFINE_FUSE_IOMAP_FILE_IO_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_file_io_class, name, \
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter), \
+ TP_ARGS(iocb, iter))
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
+ ssize_t ret),
+ TP_ARGS(iocb, iter, ret),
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(ssize_t, ret)
+ ),
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(file_inode(iocb->ki_filp), fi, fm);
+ __entry->offset = iocb->ki_pos;
+ __entry->length = iov_iter_count(iter);
+ __entry->ret = ret;
+ ),
+ TP_printk(FUSE_IO_RANGE_FMT() " ret 0x%zx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->ret)
+)
+#define DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_file_ioend_class, name, \
+ TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter, \
+ ssize_t ret), \
+ TP_ARGS(iocb, iter, ret))
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+
+TRACE_EVENT(fuse_iomap_dio_write_end_io,
+ TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
+ int error, unsigned flags),
+
+ TP_ARGS(inode, pos, written, error, flags),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned, dioendflags)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = pos;
+ __entry->length = written;
+ __entry->dioendflags = flags;
+ __entry->error = error;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " dioendflags (%s) error %d",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __print_flags(__entry->dioendflags, "|", IOMAP_DIOEND_STRINGS),
+ __entry->error)
+);
+
+DECLARE_EVENT_CLASS(fuse_inode_state_class,
+ TP_PROTO(const struct inode *inode),
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(unsigned long, state)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->state = fi->state;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " state (%s)",
+ FUSE_INODE_PRINTK_ARGS,
+ __print_flags(__entry->state, "|", FUSE_IFLAG_STRINGS))
+);
+#define DEFINE_FUSE_INODE_STATE_EVENT(name) \
+DEFINE_EVENT(fuse_inode_state_class, name, \
+ TP_PROTO(const struct inode *inode), \
+ TP_ARGS(inode))
+DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode);
+DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 3b9e337119d792..10882fa1452e49 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -238,7 +238,8 @@
*
* 7.99
* - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
- * SEEK_{DATA,HOLE}
+ * SEEK_{DATA,HOLE}, and direct I/O
+ * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
*/
#ifndef _LINUX_FUSE_H
@@ -448,7 +449,7 @@ struct fuse_file_lock {
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
* FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
- * operations.
+ * operations and direct I/O.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -580,9 +581,11 @@ struct fuse_file_lock {
*
* FUSE_ATTR_SUBMOUNT: Object is a submount root
* FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
+ * FUSE_ATTR_IOMAP: Use iomap for this inode
*/
#define FUSE_ATTR_SUBMOUNT (1 << 0)
#define FUSE_ATTR_DAX (1 << 1)
+#define FUSE_ATTR_IOMAP (1 << 2)
/**
* Open flags
@@ -665,6 +668,7 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_IOEND = 4093,
FUSE_IOMAP_BEGIN = 4094,
FUSE_IOMAP_END = 4095,
@@ -1380,4 +1384,25 @@ struct fuse_iomap_end_in {
struct fuse_iomap_io map;
};
+/* out of place write extent */
+#define FUSE_IOMAP_IOEND_SHARED (1U << 0)
+/* unwritten extent */
+#define FUSE_IOMAP_IOEND_UNWRITTEN (1U << 1)
+/* don't merge into previous ioend */
+#define FUSE_IOMAP_IOEND_BOUNDARY (1U << 2)
+/* is direct I/O */
+#define FUSE_IOMAP_IOEND_DIRECT (1U << 3)
+/* is append ioend */
+#define FUSE_IOMAP_IOEND_APPEND (1U << 4)
+
+struct fuse_iomap_ioend_in {
+ uint32_t ioendflags; /* FUSE_IOMAP_IOEND_* */
+ int32_t error; /* negative errno or 0 */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t new_addr; /* disk offset of new mapping, in bytes */
+ uint32_t written; /* bytes processed */
+ uint32_t reserved1; /* zero */
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 4ea763699c1bae..04e1242014c9c9 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -712,6 +712,10 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
if (err)
goto out_acl_release;
fuse_dir_changed(dir);
+
+ if (fuse_has_iomap(inode))
+ fuse_iomap_open(inode, file);
+
err = generic_file_open(inode, file);
if (!err) {
file->private_data = ff;
@@ -1749,6 +1753,9 @@ static int fuse_dir_open(struct inode *inode, struct file *file)
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_has_iomap(inode))
+ fuse_iomap_open(inode, file);
+
err = generic_file_open(inode, file);
if (err)
return err;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 54432cf0be82ba..f01a9346d4f8bc 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -245,6 +245,9 @@ static int fuse_open(struct inode *inode, struct file *file)
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_has_iomap(inode))
+ fuse_iomap_open(inode, file);
+
err = generic_file_open(inode, file);
if (err)
return err;
@@ -1751,10 +1754,17 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
struct file *file = iocb->ki_filp;
struct fuse_file *ff = file->private_data;
struct inode *inode = file_inode(file);
+ ssize_t ret;
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_want_iomap_directio(iocb)) {
+ ret = fuse_iomap_direct_read(iocb, to);
+ if (ret != -ENOSYS)
+ return ret;
+ }
+
if (FUSE_IS_DAX(inode))
return fuse_dax_read_iter(iocb, to);
@@ -1776,6 +1786,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (fuse_is_bad(inode))
return -EIO;
+ if (fuse_want_iomap_directio(iocb)) {
+ ssize_t ret = fuse_iomap_direct_write(iocb, from);
+ if (ret != -ENOSYS)
+ return ret;
+ }
+
if (FUSE_IS_DAX(inode))
return fuse_dax_write_iter(iocb, from);
@@ -3139,4 +3155,5 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
if (IS_ENABLED(CONFIG_FUSE_DAX))
fuse_dax_inode_init(inode, flags);
+ fuse_iomap_init_inode(inode, flags);
}
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 691ca3a4ec95e5..0a4433e9fe14ea 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -500,10 +500,15 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
}
/* Decide if we send FUSE_IOMAP_END to the fuse server */
-static bool fuse_should_send_iomap_end(const struct iomap *iomap,
+static bool fuse_should_send_iomap_end(const struct fuse_mount *fm,
+ const struct iomap *iomap,
unsigned int opflags, loff_t count,
ssize_t written)
{
+ /* Not implemented on fuse server */
+ if (fm->fc->iomap_conn.no_end)
+ return false;
+
/* fuse server demanded an iomap_end call. */
if (iomap->flags & FUSE_IOMAP_F_WANT_IOMAP_END)
return true;
@@ -528,7 +533,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
struct fuse_mount *fm = get_fuse_mount(inode);
int err = 0;
- if (fuse_should_send_iomap_end(iomap, opflags, count, written)) {
+ if (fuse_should_send_iomap_end(fm, iomap, opflags, count, written)) {
struct fuse_iomap_end_in inarg = {
.opflags = fuse_iomap_op_to_server(opflags),
.attr_ino = fi->orig_ino,
@@ -554,6 +559,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
* libfuse returns ENOSYS for servers that don't
* implement iomap_end
*/
+ fm->fc->iomap_conn.no_end = 1;
err = 0;
break;
case 0:
@@ -567,11 +573,104 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
return err;
}
-const struct iomap_ops fuse_iomap_ops = {
+static const struct iomap_ops fuse_iomap_ops = {
.iomap_begin = fuse_iomap_begin,
.iomap_end = fuse_iomap_end,
};
+static inline bool
+fuse_should_send_iomap_ioend(const struct fuse_mount *fm,
+ const struct fuse_iomap_ioend_in *inarg)
+{
+ /* Not implemented on fuse server */
+ if (fm->fc->iomap_conn.no_ioend)
+ return false;
+
+ /* Always send an ioend for errors. */
+ if (inarg->error)
+ return true;
+
+ /* Send an ioend if we performed an IO involving metadata changes. */
+ return inarg->written > 0 &&
+ (inarg->ioendflags & (FUSE_IOMAP_IOEND_SHARED |
+ FUSE_IOMAP_IOEND_UNWRITTEN |
+ FUSE_IOMAP_IOEND_APPEND));
+}
+
+/*
+ * Fast and loose check if this write could update the on-disk inode size.
+ */
+static inline bool fuse_ioend_is_append(const struct fuse_inode *fi,
+ loff_t pos, size_t written)
+{
+ return pos + written > i_size_read(&fi->inode);
+}
+
+static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
+ int error, unsigned ioendflags, sector_t new_addr)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_iomap_ioend_in inarg = {
+ .ioendflags = ioendflags,
+ .error = error,
+ .attr_ino = fi->orig_ino,
+ .pos = pos,
+ .written = written,
+ .new_addr = new_addr,
+ };
+
+ if (fuse_ioend_is_append(fi, pos, written))
+ inarg.ioendflags |= FUSE_IOMAP_IOEND_APPEND;
+
+ trace_fuse_iomap_ioend(inode, &inarg);
+
+ if (fuse_should_send_iomap_ioend(fm, &inarg)) {
+ FUSE_ARGS(args);
+ int err;
+
+ args.opcode = FUSE_IOMAP_IOEND;
+ args.nodeid = get_node_id(inode);
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ err = fuse_simple_request(fm, &args);
+ switch (err) {
+ case -ENOSYS:
+ /*
+ * fuse servers can return ENOSYS if ioend processing
+ * is never needed for this filesystem.
+ */
+ fm->fc->iomap_conn.no_ioend = 1;
+ err = 0;
+ break;
+ case 0:
+ break;
+ default:
+ trace_fuse_iomap_ioend_error(inode, &inarg, err);
+
+ /*
+ * If the write IO failed, return the failure code to
+ * the caller no matter what happens with the ioend.
+ * If the write IO succeeded but the ioend did not,
+ * pass the new error up to the caller.
+ */
+ if (!error)
+ error = err;
+ break;
+ }
+ }
+ if (error)
+ return error;
+
+ /*
+ * If there weren't any ioend errors, update the incore isize, which
+ * confusingly takes the new i_size as "pos".
+ */
+ fuse_write_update_attr(inode, pos + written, written);
+ return 0;
+}
+
int fuse_iomap_backing_open(struct fuse_conn *fc, struct fuse_backing *fb)
{
if (!fc->iomap)
@@ -605,6 +704,8 @@ void fuse_iomap_mount(struct fuse_mount *fm)
* freeze/thaw properly.
*/
fc->sync_fs = true;
+ fc->iomap_conn.no_end = 0;
+ fc->iomap_conn.no_ioend = 0;
}
void fuse_iomap_unmount(struct fuse_mount *fm)
@@ -693,3 +794,234 @@ loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence)
return offset;
return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
}
+
+void fuse_iomap_open(struct inode *inode, struct file *file)
+{
+ if (fuse_inode_has_iomap(inode))
+ file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+}
+
+enum fuse_ilock_type {
+ SHARED,
+ EXCL,
+};
+
+static int fuse_iomap_ilock_iocb(const struct kiocb *iocb,
+ enum fuse_ilock_type type)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ switch (type) {
+ case SHARED:
+ return inode_trylock_shared(inode) ? 0 : -EAGAIN;
+ case EXCL:
+ return inode_trylock(inode) ? 0 : -EAGAIN;
+ default:
+ ASSERT(0);
+ return -EIO;
+ }
+ } else {
+ switch (type) {
+ case SHARED:
+ inode_lock_shared(inode);
+ break;
+ case EXCL:
+ inode_lock(inode);
+ break;
+ default:
+ ASSERT(0);
+ return -EIO;
+ }
+ }
+
+ return 0;
+}
+
+static inline void fuse_inode_set_iomap(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ set_bit(FUSE_I_IOMAP, &fi->state);
+}
+
+static inline void fuse_inode_clear_iomap(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ clear_bit(FUSE_I_IOMAP, &fi->state);
+}
+
+void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
+{
+ struct fuse_conn *conn = get_fuse_conn(inode);
+
+ if (conn->iomap && (attr_flags & FUSE_ATTR_IOMAP))
+ fuse_inode_set_iomap(inode);
+
+ trace_fuse_iomap_init_inode(inode);
+}
+
+void fuse_iomap_evict_inode(struct inode *inode)
+{
+ trace_fuse_iomap_evict_inode(inode);
+
+ if (fuse_inode_has_iomap(inode))
+ fuse_inode_clear_iomap(inode);
+}
+
+ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_direct_read(iocb, to);
+
+ if (!iov_iter_count(to))
+ return 0; /* skip atime */
+
+ file_accessed(iocb->ki_filp);
+
+ ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+ if (ret)
+ return ret;
+ ret = iomap_dio_rw(iocb, to, &fuse_iomap_ops, NULL, 0, NULL, 0);
+ inode_unlock_shared(inode);
+
+ trace_fuse_iomap_direct_read_end(iocb, to, ret);
+ return ret;
+}
+
+static int fuse_iomap_dio_write_end_io(struct kiocb *iocb, ssize_t written,
+ int error, unsigned dioflags)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ unsigned int nofs_flag;
+ unsigned int ioendflags = FUSE_IOMAP_IOEND_DIRECT;
+ int ret;
+
+ if (fuse_is_bad(inode))
+ return -EIO;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_dio_write_end_io(inode, iocb->ki_pos, written, error,
+ dioflags);
+
+ if (dioflags & IOMAP_DIO_COW)
+ ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+ if (dioflags & IOMAP_DIO_UNWRITTEN)
+ ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+ /*
+ * We can allocate memory here while doing writeback on behalf of
+ * memory reclaim. To avoid memory allocation deadlocks set the
+ * task-wide nofs context for the following operations.
+ */
+ nofs_flag = memalloc_nofs_save();
+ ret = fuse_iomap_ioend(inode, iocb->ki_pos, written, error, ioendflags,
+ FUSE_IOMAP_NULL_ADDR);
+ memalloc_nofs_restore(nofs_flag);
+ return ret;
+}
+
+static const struct iomap_dio_ops fuse_iomap_dio_write_ops = {
+ .end_io = fuse_iomap_dio_write_end_io,
+};
+
+static int fuse_iomap_direct_write_sync(struct kiocb *iocb, loff_t start,
+ size_t count)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ loff_t end = start + count - 1;
+ int err;
+
+ /* Flush the file metadata, not the page cache. */
+ err = sync_inode_metadata(inode, 1);
+ if (err)
+ return err;
+
+ if (fc->no_fsync)
+ return 0;
+
+ err = fuse_fsync_common(iocb->ki_filp, start, end, iocb_is_dsync(iocb),
+ FUSE_FSYNC);
+ if (err == -ENOSYS) {
+ fc->no_fsync = 1;
+ err = 0;
+ }
+ return err;
+}
+
+ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ loff_t blockmask = i_blocksize(inode) - 1;
+ loff_t pos = iocb->ki_pos;
+ size_t count = iov_iter_count(from);
+ bool was_dsync = false;
+ ssize_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_direct_write(iocb, from);
+
+ if (!count)
+ return 0;
+
+ /*
+ * direct I/O must be aligned to the fsblock size or we fall back to
+ * the old paths
+ */
+ if ((iocb->ki_pos | count) & blockmask)
+ return -ENOTBLK;
+
+ /* fuse doesn't support S_SYNC, so complain if we see this. */
+ if (IS_SYNC(inode)) {
+ ASSERT(!IS_SYNC(inode));
+ return -EIO;
+ }
+
+ /*
+ * Strip off IOCB_DSYNC so that we can run the fsync ourselves because
+ * we hold inode_lock; iomap_dio_rw calls generic_write_sync; and
+ * fuse_fsync tries to take inode_lock again.
+ */
+ if (iocb_is_dsync(iocb)) {
+ was_dsync = true;
+ iocb->ki_flags &= ~IOCB_DSYNC;
+ }
+
+ ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+ if (ret)
+ goto out_dsync;
+ ret = generic_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out_unlock;
+
+ ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops,
+ &fuse_iomap_dio_write_ops, 0, NULL, 0);
+ if (ret)
+ goto out_unlock;
+
+ if (was_dsync) {
+ /* Restore IOCB_DSYNC and call our sync function */
+ iocb->ki_flags |= IOCB_DSYNC;
+ ret = fuse_iomap_direct_write_sync(iocb, pos, count);
+ }
+
+out_unlock:
+ inode_unlock(inode);
+out_dsync:
+ trace_fuse_iomap_direct_write_end(iocb, from, ret);
+ if (was_dsync)
+ iocb->ki_flags |= IOCB_DSYNC;
+ return ret;
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 3274ee1c31b62b..3d54fabbd64b0c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -197,6 +197,8 @@ static void fuse_evict_inode(struct inode *inode)
WARN_ON(!list_empty(&fi->write_files));
WARN_ON(!list_empty(&fi->queued_writes));
}
+
+ fuse_iomap_evict_inode(inode);
}
static int fuse_reconfigure(struct fs_context *fsc)
diff --git a/fs/fuse/trace.c b/fs/fuse/trace.c
index 3b54f639a5423e..9de407148c867d 100644
--- a/fs/fuse/trace.c
+++ b/fs/fuse/trace.c
@@ -9,6 +9,7 @@
#include "iomap_priv.h"
#include <linux/pagemap.h>
+#include <linux/iomap.h>
#define CREATE_TRACE_POINTS
#include "fuse_trace.h"
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 10/23] fuse: implement buffered IO with iomap
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (8 preceding siblings ...)
2025-08-21 0:54 ` [PATCH 09/23] fuse: implement direct IO with iomap Darrick J. Wong
@ 2025-08-21 0:55 ` Darrick J. Wong
2025-08-21 0:55 ` [PATCH 11/23] fuse: enable caching of timestamps Darrick J. Wong
` (12 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:55 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Implement pagecache IO with iomap, complete with hooks into truncate and
fallocate so that the fuse server needn't implement disk block zeroing
of post-EOF and unaligned punch/zero regions.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 36 ++
fs/fuse/fuse_trace.h | 268 +++++++++++++++++
include/uapi/linux/fuse.h | 4
fs/fuse/dir.c | 23 +
fs/fuse/file.c | 84 ++++-
fs/fuse/file_iomap.c | 718 ++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/inode.c | 10 +
7 files changed, 1113 insertions(+), 30 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1415db4ebf47b1..74fb5971f8fec7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -169,6 +169,13 @@ struct fuse_inode {
/* waitq for direct-io completion */
wait_queue_head_t direct_io_waitq;
+
+#ifdef CONFIG_FUSE_IOMAP
+ /* pending io completions */
+ spinlock_t ioend_lock;
+ struct work_struct ioend_work;
+ struct list_head ioend_list;
+#endif
};
/* readdir cache (directory only) */
@@ -916,8 +923,8 @@ struct fuse_conn {
unsigned int no_link:1;
/*
- * Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations and
- * direct I/O.
+ * Use fs/iomap for FIEMAP and SEEK_{DATA,HOLE} file operations,
+ * buffered I/O, and direct I/O.
*/
unsigned int iomap:1;
@@ -1659,6 +1666,8 @@ void fuse_iomap_sysfs_cleanup(struct kobject *kobj);
# define fuse_iomap_sysfs_cleanup(...) ((void)0)
#endif
+sector_t fuse_bmap(struct address_space *mapping, sector_t block);
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
bool fuse_iomap_enabled(void);
@@ -1697,6 +1706,21 @@ static inline bool fuse_want_iomap_directio(const struct kiocb *iocb)
ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to);
ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
+
+static inline bool fuse_want_iomap_buffered_io(const struct kiocb *iocb)
+{
+ return fuse_inode_has_iomap(file_inode(iocb->ki_filp));
+}
+
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma);
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from);
+int fuse_iomap_setsize_start(struct inode *inode, loff_t newsize);
+void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits);
+int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
+ loff_t length, loff_t new_size);
+int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
+ loff_t endpos);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1714,6 +1738,14 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from);
# define fuse_want_iomap_directio(...) (false)
# define fuse_iomap_direct_read(...) (-ENOSYS)
# define fuse_iomap_direct_write(...) (-ENOSYS)
+# define fuse_want_iomap_buffered_io(...) (false)
+# define fuse_iomap_mmap(...) (-ENOSYS)
+# define fuse_iomap_buffered_read(...) (-ENOSYS)
+# define fuse_iomap_buffered_write(...) (-ENOSYS)
+# define fuse_iomap_setsize_start(...) (-ENOSYS)
+# define fuse_iomap_set_i_blkbits(...) ((void)0)
+# define fuse_iomap_fallocate(...) (-ENOSYS)
+# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 12dd05877727ab..10537a38b54556 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -231,6 +231,9 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
#endif
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+struct iomap_writepage_ctx;
+struct iomap_ioend;
+
/* tracepoint boilerplate so we don't have to keep doing this */
#define FUSE_IOMAP_OPFLAGS_FIELD \
__field(unsigned, opflags)
@@ -336,6 +339,12 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
{ 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \
{ 1 << FUSE_I_IOMAP, "iomap" }
+#define IOMAP_IOEND_STRINGS \
+ { IOMAP_IOEND_SHARED, "shared" }, \
+ { IOMAP_IOEND_UNWRITTEN, "unwritten" }, \
+ { IOMAP_IOEND_BOUNDARY, "boundary" }, \
+ { IOMAP_IOEND_DIRECT, "direct" }
+
DECLARE_EVENT_CLASS(fuse_iomap_check_class,
TP_PROTO(const char *func, int line, const char *condition),
@@ -650,6 +659,9 @@ DEFINE_EVENT(fuse_iomap_file_io_class, name, \
TP_ARGS(iocb, iter))
DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_read);
DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_direct_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_read);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_buffered_write);
+DEFINE_FUSE_IOMAP_FILE_IO_EVENT(fuse_iomap_write_zero_eof);
DECLARE_EVENT_CLASS(fuse_iomap_file_ioend_class,
TP_PROTO(const struct kiocb *iocb, const struct iov_iter *iter,
@@ -676,6 +688,8 @@ DEFINE_EVENT(fuse_iomap_file_ioend_class, name, \
TP_ARGS(iocb, iter, ret))
DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_read_end);
DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_direct_write_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_read_end);
+DEFINE_FUSE_IOMAP_FILE_IOEND_EVENT(fuse_iomap_buffered_write_end);
TRACE_EVENT(fuse_iomap_dio_write_end_io,
TP_PROTO(const struct inode *inode, loff_t pos, ssize_t written,
@@ -727,6 +741,260 @@ DEFINE_EVENT(fuse_inode_state_class, name, \
TP_ARGS(inode))
DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode);
DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode);
+
+TRACE_EVENT(fuse_iomap_end_ioend,
+ TP_PROTO(const struct iomap_ioend *ioend),
+
+ TP_ARGS(ioend),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned int, ioendflags)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(ioend->io_inode, fi, fm);
+ __entry->offset = ioend->io_offset;
+ __entry->length = ioend->io_size;
+ __entry->ioendflags = ioend->io_flags;
+ __entry->error = blk_status_to_errno(ioend->io_bio.bi_status);
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " ioendflags (%s) error %d",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __print_flags(__entry->ioendflags, "|", IOMAP_IOEND_STRINGS),
+ __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_writeback_range,
+ TP_PROTO(const struct inode *inode, u64 offset, unsigned int count,
+ u64 end_pos),
+
+ TP_ARGS(inode, offset, count, end_pos),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(uint64_t, end_pos)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = count;
+ __entry->end_pos = end_pos;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " end_pos 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->end_pos)
+);
+
+TRACE_EVENT(fuse_iomap_writeback_submit,
+ TP_PROTO(const struct iomap_writepage_ctx *wpc, int error),
+
+ TP_ARGS(wpc, error),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(unsigned int, nr_folios)
+ __field(uint64_t, addr)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(wpc->inode, fi, fm);
+ __entry->nr_folios = wpc->nr_folios;
+ __entry->offset = wpc->iomap.offset;
+ __entry->length = wpc->iomap.length;
+ __entry->addr = wpc->iomap.addr << 9;
+ __entry->error = error;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " addr 0x%llx nr_folios %u error %d",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->addr,
+ __entry->nr_folios,
+ __entry->error)
+);
+
+TRACE_EVENT(fuse_iomap_discard_folio,
+ TP_PROTO(const struct inode *inode, loff_t offset, size_t count),
+
+ TP_ARGS(inode, offset, count),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = count;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
+TRACE_EVENT(fuse_iomap_writepages,
+ TP_PROTO(const struct inode *inode, const struct writeback_control *wbc),
+
+ TP_ARGS(inode, wbc),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(long, nr_to_write)
+ __field(bool, sync_all)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = wbc->range_start;
+ __entry->length = wbc->range_end - wbc->range_start + 1;
+ __entry->nr_to_write = wbc->nr_to_write;
+ __entry->sync_all = wbc->sync_mode == WB_SYNC_ALL;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " nr_folios %ld sync_all? %d",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->nr_to_write,
+ __entry->sync_all)
+);
+
+TRACE_EVENT(fuse_iomap_read_folio,
+ TP_PROTO(const struct folio *folio),
+
+ TP_ARGS(folio),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(folio->mapping->host, fi, fm);
+ __entry->offset = folio_pos(folio);
+ __entry->length = folio_size(folio);
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
+TRACE_EVENT(fuse_iomap_readahead,
+ TP_PROTO(const struct readahead_control *rac),
+
+ TP_ARGS(rac),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ struct readahead_control *mutrac = (struct readahead_control *)rac;
+ FUSE_INODE_ASSIGN(file_inode(rac->file), fi, fm);
+ __entry->offset = readahead_pos(mutrac);
+ __entry->length = readahead_length(mutrac);
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
+TRACE_EVENT(fuse_iomap_page_mkwrite,
+ TP_PROTO(const struct vm_fault *vmf),
+
+ TP_ARGS(vmf),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ struct folio *folio = page_folio(vmf->page);
+ FUSE_INODE_ASSIGN(file_inode(vmf->vma->vm_file), fi, fm);
+ __entry->offset = folio_pos(folio);
+ __entry->length = folio_size(folio);
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
+DECLARE_EVENT_CLASS(fuse_iomap_file_range_class,
+ TP_PROTO(const struct inode *inode, loff_t offset, loff_t length),
+
+ TP_ARGS(inode, offset, length),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = length;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+)
+#define DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_file_range_class, name, \
+ TP_PROTO(const struct inode *inode, loff_t offset, loff_t length), \
+ TP_ARGS(inode, offset, length))
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_up);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_flush_unmap_range);
+
+TRACE_EVENT(fuse_iomap_set_i_blkbits,
+ TP_PROTO(const struct inode *inode, u8 new_blkbits),
+ TP_ARGS(inode, new_blkbits),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(u8, old_blkbits)
+ __field(u8, new_blkbits)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->old_blkbits = inode->i_blkbits;
+ __entry->new_blkbits = new_blkbits;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " old_blkbits %u new_blkbits %u",
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->old_blkbits,
+ __entry->new_blkbits)
+);
+
+TRACE_EVENT(fuse_iomap_fallocate,
+ TP_PROTO(const struct inode *inode, int mode, loff_t offset,
+ loff_t length, loff_t newsize),
+ TP_ARGS(inode, mode, offset, length, newsize),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ __field(loff_t, newsize)
+ __field(int, mode)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = length;
+ __entry->mode = mode;
+ __entry->newsize = newsize;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() " mode 0x%x newsize 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ __entry->mode,
+ __entry->newsize)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 10882fa1452e49..12d15c186256f3 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -238,7 +238,7 @@
*
* 7.99
* - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
- * SEEK_{DATA,HOLE}, and direct I/O
+ * SEEK_{DATA,HOLE}, buffered I/O, and direct I/O
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
*/
@@ -449,7 +449,7 @@ struct fuse_file_lock {
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
* FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
- * operations and direct I/O.
+ * operations, buffered I/O, and direct I/O.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 04e1242014c9c9..6195ac3232ff22 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2025,7 +2025,10 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
is_truncate = true;
}
- if (FUSE_IS_DAX(inode) && is_truncate) {
+ if (fuse_inode_has_iomap(inode) && is_truncate) {
+ filemap_invalidate_lock(mapping);
+ fault_blocked = true;
+ } else if (FUSE_IS_DAX(inode) && is_truncate) {
filemap_invalidate_lock(mapping);
fault_blocked = true;
err = fuse_dax_break_layouts(inode, 0, -1);
@@ -2040,6 +2043,18 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
WARN_ON(!(attr->ia_valid & ATTR_SIZE));
WARN_ON(attr->ia_size != 0);
if (fc->atomic_o_trunc) {
+ if (fuse_inode_has_iomap(inode)) {
+ /*
+ * fuse_open already set the size to zero and
+ * truncated the pagecache, and we've since
+ * cycled the inode locks. Another thread
+ * could have performed an appending write, so
+ * we don't want to touch the file further.
+ */
+ filemap_invalidate_unlock(mapping);
+ return 0;
+ }
+
/*
* No need to send request to userspace, since actual
* truncation has already been done by OPEN. But still
@@ -2070,6 +2085,12 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
if (trust_local_cmtime && attr->ia_size != inode->i_size)
attr->ia_valid |= ATTR_MTIME | ATTR_CTIME;
+
+ if (fuse_inode_has_iomap(inode)) {
+ err = fuse_iomap_setsize_start(inode, attr->ia_size);
+ if (err)
+ goto error;
+ }
}
memset(&inarg, 0, sizeof(inarg));
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f01a9346d4f8bc..43f3e2d4eacb8e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -385,7 +385,7 @@ static int fuse_release(struct inode *inode, struct file *file)
* Dirty pages might remain despite write_inode_now() call from
* fuse_flush() due to writes racing with the close.
*/
- if (fc->writeback_cache)
+ if (fc->writeback_cache || fuse_inode_has_iomap(inode))
write_inode_now(inode, 1);
fuse_release_common(file, false);
@@ -1765,6 +1765,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
return ret;
}
+ if (fuse_want_iomap_buffered_io(iocb))
+ return fuse_iomap_buffered_read(iocb, to);
+
if (FUSE_IS_DAX(inode))
return fuse_dax_read_iter(iocb, to);
@@ -1788,10 +1791,29 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (fuse_want_iomap_directio(iocb)) {
ssize_t ret = fuse_iomap_direct_write(iocb, from);
- if (ret != -ENOSYS)
+ switch (ret) {
+ case -ENOTBLK:
+ /*
+ * If we're going to fall back to the iomap buffered
+ * write path only, then try the write again as a
+ * synchronous buffered write. Otherwise we let it
+ * drop through to the old ->direct_IO path.
+ */
+ if (fuse_want_iomap_buffered_io(iocb))
+ iocb->ki_flags |= IOCB_SYNC;
+ fallthrough;
+ case -ENOSYS:
+ /* no implementation, fall through */
+ break;
+ default:
+ /* errors, no progress, or even partial progress */
return ret;
+ }
}
+ if (fuse_want_iomap_buffered_io(iocb))
+ return fuse_iomap_buffered_write(iocb, from);
+
if (FUSE_IS_DAX(inode))
return fuse_dax_write_iter(iocb, from);
@@ -2325,6 +2347,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
struct inode *inode = file_inode(file);
int rc;
+ if (fuse_inode_has_iomap(inode))
+ return fuse_iomap_mmap(file, vma);
+
/* DAX mmap is superior to direct_io mmap */
if (FUSE_IS_DAX(inode))
return fuse_dax_mmap(file, vma);
@@ -2523,7 +2548,7 @@ static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
return err;
}
-static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
+sector_t fuse_bmap(struct address_space *mapping, sector_t block)
{
struct inode *inode = mapping->host;
struct fuse_mount *fm = get_fuse_mount(inode);
@@ -2877,8 +2902,12 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
static int fuse_writeback_range(struct inode *inode, loff_t start, loff_t end)
{
- int err = filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
+ int err;
+ if (fuse_inode_has_iomap(inode))
+ return fuse_iomap_flush_unmap_range(inode, start, end);
+
+ err = filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
if (!err)
fuse_sync_writes(inode);
@@ -2899,6 +2928,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
.length = length,
.mode = mode
};
+ loff_t newsize = 0;
int err;
bool block_faults = FUSE_IS_DAX(inode) &&
(!(mode & FALLOC_FL_KEEP_SIZE) ||
@@ -2912,7 +2942,10 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
return -EOPNOTSUPP;
inode_lock(inode);
- if (block_faults) {
+ if (fuse_inode_has_iomap(inode)) {
+ filemap_invalidate_lock(inode->i_mapping);
+ block_faults = true;
+ } else if (block_faults) {
filemap_invalidate_lock(inode->i_mapping);
err = fuse_dax_break_layouts(inode, 0, -1);
if (err)
@@ -2927,11 +2960,23 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
goto out;
}
+ /*
+ * If we are using iomap for file IO, fallocate must wait for all AIO
+ * to complete before we continue as AIO can change the file size on
+ * completion without holding any locks we currently hold. We must do
+ * this first because AIO can update the in-memory inode size, and the
+ * operations that follow require the in-memory size to be fully
+ * up-to-date.
+ */
+ if (fuse_inode_has_iomap(inode))
+ inode_dio_wait(inode);
+
if (!(mode & FALLOC_FL_KEEP_SIZE) &&
offset + length > i_size_read(inode)) {
err = inode_newsize_ok(inode, offset + length);
if (err)
goto out;
+ newsize = offset + length;
}
err = file_modified(file);
@@ -2954,14 +2999,22 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (err)
goto out;
- /* we could have extended the file */
- if (!(mode & FALLOC_FL_KEEP_SIZE)) {
- if (fuse_write_update_attr(inode, offset + length, length))
- file_update_time(file);
- }
+ if (fuse_inode_has_iomap(inode)) {
+ err = fuse_iomap_fallocate(file, mode, offset, length,
+ newsize);
+ if (err)
+ goto out;
+ } else {
+ /* we could have extended the file */
+ if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+ if (fuse_write_update_attr(inode, newsize, length))
+ file_update_time(file);
+ }
- if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))
- truncate_pagecache_range(inode, offset, offset + length - 1);
+ if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))
+ truncate_pagecache_range(inode, offset,
+ offset + length - 1);
+ }
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
@@ -3047,6 +3100,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
if (err)
goto out;
+ /* See inode_dio_wait comment in fuse_file_fallocate */
+ if (fuse_inode_has_iomap(inode_out))
+ inode_dio_wait(inode_out);
+
if (is_unstable)
set_bit(FUSE_I_SIZE_UNSTABLE, &fi_out->state);
@@ -3066,7 +3123,8 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
if (err)
goto out;
- truncate_inode_pages_range(inode_out->i_mapping,
+ if (!fuse_inode_has_iomap(inode_out))
+ truncate_inode_pages_range(inode_out->i_mapping,
ALIGN_DOWN(pos_out, PAGE_SIZE),
ALIGN(pos_out + outarg.size, PAGE_SIZE) - 1);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 0a4433e9fe14ea..ff9298de193a26 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -5,6 +5,8 @@
*/
#include <linux/iomap.h>
#include <linux/fiemap.h>
+#include <linux/pagemap.h>
+#include <linux/falloc.h>
#include "fuse_i.h"
#include "fuse_trace.h"
#include "iomap_priv.h"
@@ -838,20 +840,14 @@ static int fuse_iomap_ilock_iocb(const struct kiocb *iocb,
return 0;
}
-static inline void fuse_inode_set_iomap(struct inode *inode)
-{
- struct fuse_inode *fi = get_fuse_inode(inode);
-
- ASSERT(fuse_has_iomap(inode));
-
- set_bit(FUSE_I_IOMAP, &fi->state);
-}
+static inline void fuse_inode_set_iomap(struct inode *inode);
static inline void fuse_inode_clear_iomap(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
ASSERT(fuse_has_iomap(inode));
+ ASSERT(list_empty(&fi->ioend_list));
clear_bit(FUSE_I_IOMAP, &fi->state);
}
@@ -960,6 +956,112 @@ static int fuse_iomap_direct_write_sync(struct kiocb *iocb, loff_t start,
return err;
}
+static const struct iomap_write_ops fuse_iomap_write_ops = {
+};
+
+static int
+fuse_iomap_zero_range(
+ struct inode *inode,
+ loff_t pos,
+ loff_t len,
+ bool *did_zero)
+{
+ return iomap_zero_range(inode, pos, len, did_zero, &fuse_iomap_ops,
+ &fuse_iomap_write_ops, NULL);
+}
+
+/* Take care of zeroing post-EOF blocks when they might exist. */
+static ssize_t
+fuse_iomap_write_zero_eof(
+ struct kiocb *iocb,
+ struct iov_iter *from,
+ bool *drained_dio)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct address_space *mapping = iocb->ki_filp->f_mapping;
+ loff_t isize;
+ int error;
+
+ /*
+ * We need to serialise against EOF updates that occur in IO
+ * completions here. We want to make sure that nobody is changing the
+ * size while we do this check until we have placed an IO barrier (i.e.
+ * hold i_rwsem exclusively) that prevents new IO from being
+ * dispatched. The spinlock effectively forms a memory barrier once we
+ * have i_rwsem exclusively so we are guaranteed to see the latest EOF
+ * value and hence be able to correctly determine if we need to run
+ * zeroing.
+ */
+ spin_lock(&fi->lock);
+ isize = i_size_read(inode);
+ if (iocb->ki_pos <= isize) {
+ spin_unlock(&fi->lock);
+ return 0;
+ }
+ spin_unlock(&fi->lock);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+
+ if (!(*drained_dio)) {
+ /*
+ * We now have an IO submission barrier in place, but AIO can
+ * do EOF updates during IO completion and hence we now need to
+ * wait for all of them to drain. Non-AIO DIO will have
+ * drained before we are given the exclusive i_rwsem, and so
+ * for most cases this wait is a no-op.
+ */
+ inode_dio_wait(inode);
+ *drained_dio = true;
+ return 1;
+ }
+
+ trace_fuse_iomap_write_zero_eof(iocb, from);
+
+ filemap_invalidate_lock(mapping);
+ error = fuse_iomap_zero_range(inode, isize, iocb->ki_pos - isize, NULL);
+ filemap_invalidate_unlock(mapping);
+
+ return error;
+}
+
+static ssize_t
+fuse_iomap_write_checks(
+ struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ ssize_t error;
+ bool drained_dio = false;
+
+restart:
+ error = generic_write_checks(iocb, from);
+ if (error <= 0)
+ return error;
+
+ /*
+ * If the offset is beyond the size of the file, we need to zero all
+ * blocks that fall between the existing EOF and the start of this
+ * write.
+ *
+ * We can do an unlocked check for i_size here safely as I/O completion
+ * can only extend EOF. Truncate is locked out at this point, so the
+ * EOF cannot move backwards, only forwards. Hence we only need to take
+ * the slow path when we are at or beyond the current EOF.
+ */
+ if (fuse_inode_has_iomap(inode) &&
+ iocb->ki_pos > i_size_read(inode)) {
+ error = fuse_iomap_write_zero_eof(iocb, from, &drained_dio);
+ if (error == 1)
+ goto restart;
+ if (error)
+ return error;
+ }
+
+ return kiocb_modified(iocb);
+}
+
ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
@@ -1002,8 +1104,9 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
ret = fuse_iomap_ilock_iocb(iocb, EXCL);
if (ret)
goto out_dsync;
- ret = generic_write_checks(iocb, from);
- if (ret <= 0)
+
+ ret = fuse_iomap_write_checks(iocb, from);
+ if (ret)
goto out_unlock;
ret = iomap_dio_rw(iocb, from, &fuse_iomap_ops,
@@ -1025,3 +1128,598 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
iocb->ki_flags |= IOCB_DSYNC;
return ret;
}
+
+struct fuse_writepage_ctx {
+ struct iomap_writepage_ctx ctx;
+};
+
+static void fuse_iomap_end_ioend(struct iomap_ioend *ioend)
+{
+ struct inode *inode = ioend->io_inode;
+ unsigned int ioendflags = 0;
+ unsigned int nofs_flag;
+ int error = blk_status_to_errno(ioend->io_bio.bi_status);
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ if (fuse_is_bad(inode))
+ return;
+
+ trace_fuse_iomap_end_ioend(ioend);
+
+ if (ioend->io_flags & IOMAP_IOEND_SHARED)
+ ioendflags |= FUSE_IOMAP_IOEND_SHARED;
+ if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN)
+ ioendflags |= FUSE_IOMAP_IOEND_UNWRITTEN;
+
+ /*
+ * We can allocate memory here while doing writeback on behalf of
+ * memory reclaim. To avoid memory allocation deadlocks set the
+ * task-wide nofs context for the following operations.
+ */
+ nofs_flag = memalloc_nofs_save();
+ fuse_iomap_ioend(inode, ioend->io_offset, ioend->io_size, error,
+ ioendflags, ioend->io_sector);
+ iomap_finish_ioends(ioend, error);
+ memalloc_nofs_restore(nofs_flag);
+}
+
+/*
+ * Finish all pending IO completions that require transactional modifications.
+ *
+ * We try to merge physical and logically contiguous ioends before completion to
+ * minimise the number of transactions we need to perform during IO completion.
+ * Both unwritten extent conversion and COW remapping need to iterate and modify
+ * one physical extent at a time, so we gain nothing by merging physically
+ * discontiguous extents here.
+ *
+ * The ioend chain length that we can be processing here is largely unbound in
+ * length and we may have to perform significant amounts of work on each ioend
+ * to complete it. Hence we have to be careful about holding the CPU for too
+ * long in this loop.
+ */
+static void fuse_iomap_end_io(struct work_struct *work)
+{
+ struct fuse_inode *fi =
+ container_of(work, struct fuse_inode, ioend_work);
+ struct iomap_ioend *ioend;
+ struct list_head tmp;
+ unsigned long flags;
+
+ spin_lock_irqsave(&fi->ioend_lock, flags);
+ list_replace_init(&fi->ioend_list, &tmp);
+ spin_unlock_irqrestore(&fi->ioend_lock, flags);
+
+ iomap_sort_ioends(&tmp);
+ while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
+ io_list))) {
+ list_del_init(&ioend->io_list);
+ iomap_ioend_try_merge(ioend, &tmp);
+ fuse_iomap_end_ioend(ioend);
+ cond_resched();
+ }
+}
+
+static void fuse_iomap_end_bio(struct bio *bio)
+{
+ struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+ struct inode *inode = ioend->io_inode;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ unsigned long flags;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ spin_lock_irqsave(&fi->ioend_lock, flags);
+ if (list_empty(&fi->ioend_list))
+ WARN_ON_ONCE(!queue_work(system_unbound_wq, &fi->ioend_work));
+ list_add_tail(&ioend->io_list, &fi->ioend_list);
+ spin_unlock_irqrestore(&fi->ioend_lock, flags);
+}
+
+/*
+ * Fast revalidation of the cached writeback mapping. Return true if the current
+ * mapping is valid, false otherwise.
+ */
+static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc,
+ loff_t offset)
+{
+ if (offset < wpc->iomap.offset ||
+ offset >= wpc->iomap.offset + wpc->iomap.length)
+ return false;
+
+ /* XXX actually use revalidation cookie */
+ return true;
+}
+
+/*
+ * If the folio has delalloc blocks on it, the caller is asking us to punch them
+ * out. If we don't, we can leave a stale delalloc mapping covered by a clean
+ * page that needs to be dirtied again before the delalloc mapping can be
+ * converted. This stale delalloc mapping can trip up a later direct I/O read
+ * operation on the same region.
+ *
+ * We prevent this by truncating away the delalloc regions on the folio. Because
+ * they are delalloc, we can do this without needing a transaction. Indeed - if
+ * we get ENOSPC errors, we have to be able to do this truncation without a
+ * transaction as there is no space left for block reservation (typically why
+ * we see a ENOSPC in writeback).
+ */
+static void fuse_iomap_discard_folio(struct folio *folio, loff_t pos, int error)
+{
+ struct inode *inode = folio->mapping->host;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ loff_t end = folio_pos(folio) + folio_size(folio);
+
+ if (fuse_is_bad(inode))
+ return;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_discard_folio(inode, pos, folio_size(folio));
+
+ printk_ratelimited(KERN_ERR
+ "page discard on page %px, inode 0x%llx, pos %llu.",
+ folio, fi->orig_ino, pos);
+
+ /* Userspace may need to remove delayed allocations */
+ fuse_iomap_ioend(inode, pos, end - pos, error, 0, FUSE_IOMAP_NULL_ADDR);
+}
+
+static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
+ struct folio *folio, u64 offset,
+ unsigned int len, u64 end_pos)
+{
+ struct inode *inode = wpc->inode;
+ struct iomap write_iomap, dontcare;
+ ssize_t ret;
+
+ if (fuse_is_bad(inode)) {
+ ret = -EIO;
+ goto discard_folio;
+ }
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_writeback_range(inode, offset, len, end_pos);
+
+ if (!fuse_iomap_revalidate_writeback(wpc, offset)) {
+ /* Pretend that this is a directio write */
+ ret = fuse_iomap_begin(inode, offset, len,
+ IOMAP_DIRECT | IOMAP_WRITE,
+ &write_iomap, &dontcare);
+ if (ret)
+ goto discard_folio;
+
+ /*
+ * Landed in a hole or beyond EOF? Send that to iomap, it'll
+ * skip writing back the file range.
+ */
+ if (write_iomap.offset > offset) {
+ write_iomap.length = write_iomap.offset - offset;
+ write_iomap.offset = offset;
+ write_iomap.type = IOMAP_HOLE;
+ }
+
+ memcpy(&wpc->iomap, &write_iomap, sizeof(struct iomap));
+ }
+
+ ret = iomap_add_to_ioend(wpc, folio, offset, end_pos, len);
+ if (ret < 0)
+ goto discard_folio;
+
+ return ret;
+discard_folio:
+ fuse_iomap_discard_folio(folio, offset, ret);
+ return ret;
+}
+
+static int fuse_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
+ int error)
+{
+ struct iomap_ioend *ioend = wpc->wb_ctx;
+
+ ASSERT(fuse_inode_has_iomap(ioend->io_inode));
+
+ trace_fuse_iomap_writeback_submit(wpc, error);
+
+ /* always call our ioend function, even if we cancel the bio */
+ ioend->io_bio.bi_end_io = fuse_iomap_end_bio;
+ return iomap_ioend_writeback_submit(wpc, error);
+}
+
+static const struct iomap_writeback_ops fuse_iomap_writeback_ops = {
+ .writeback_range = fuse_iomap_writeback_range,
+ .writeback_submit = fuse_iomap_writeback_submit,
+};
+
+static int fuse_iomap_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct fuse_writepage_ctx wpc = {
+ .ctx = {
+ .inode = mapping->host,
+ .wbc = wbc,
+ .ops = &fuse_iomap_writeback_ops,
+ },
+ };
+
+ ASSERT(fuse_inode_has_iomap(mapping->host));
+
+ trace_fuse_iomap_writepages(mapping->host, wbc);
+
+ return iomap_writepages(&wpc.ctx);
+}
+
+static int fuse_iomap_read_folio(struct file *file, struct folio *folio)
+{
+ ASSERT(fuse_inode_has_iomap(file_inode(file)));
+
+ trace_fuse_iomap_read_folio(folio);
+
+ return iomap_read_folio(folio, &fuse_iomap_ops);
+}
+
+static void fuse_iomap_readahead(struct readahead_control *rac)
+{
+ ASSERT(fuse_inode_has_iomap(file_inode(rac->file)));
+
+ trace_fuse_iomap_readahead(rac);
+
+ iomap_readahead(rac, &fuse_iomap_ops);
+}
+
+static const struct address_space_operations fuse_iomap_aops = {
+ .read_folio = fuse_iomap_read_folio,
+ .readahead = fuse_iomap_readahead,
+ .writepages = fuse_iomap_writepages,
+ .dirty_folio = iomap_dirty_folio,
+ .release_folio = iomap_release_folio,
+ .invalidate_folio = iomap_invalidate_folio,
+ .migrate_folio = filemap_migrate_folio,
+ .is_partially_uptodate = iomap_is_partially_uptodate,
+ .error_remove_folio = generic_error_remove_folio,
+
+ /* These aren't pagecache operations per se */
+ .bmap = fuse_bmap,
+};
+
+static inline void fuse_inode_set_iomap(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ inode->i_data.a_ops = &fuse_iomap_aops;
+
+ INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
+ INIT_LIST_HEAD(&fi->ioend_list);
+ spin_lock_init(&fi->ioend_lock);
+ set_bit(FUSE_I_IOMAP, &fi->state);
+}
+
+/*
+ * Locking for serialisation of IO during page faults. This results in a lock
+ * ordering of:
+ *
+ * mmap_lock (MM)
+ * sb_start_pagefault(vfs, freeze)
+ * invalidate_lock (vfs - truncate serialisation)
+ * page_lock (MM)
+ * i_lock (FUSE - extent map serialisation)
+ */
+static vm_fault_t fuse_iomap_page_mkwrite(struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ vm_fault_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_page_mkwrite(vmf);
+
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vmf->vma->vm_file);
+
+ filemap_invalidate_lock_shared(mapping);
+ ret = iomap_page_mkwrite(vmf, &fuse_iomap_ops, NULL);
+ filemap_invalidate_unlock_shared(mapping);
+
+ sb_end_pagefault(inode->i_sb);
+ return ret;
+}
+
+static const struct vm_operations_struct fuse_iomap_vm_ops = {
+ .fault = filemap_fault,
+ .map_pages = filemap_map_pages,
+ .page_mkwrite = fuse_iomap_page_mkwrite,
+};
+
+int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ ASSERT(fuse_inode_has_iomap(file_inode(file)));
+
+ file_accessed(file);
+ vma->vm_ops = &fuse_iomap_vm_ops;
+ return 0;
+}
+
+ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_buffered_read(iocb, to);
+
+ if (!iov_iter_count(to))
+ return 0; /* skip atime */
+
+ file_accessed(iocb->ki_filp);
+
+ ret = fuse_iomap_ilock_iocb(iocb, SHARED);
+ if (ret)
+ return ret;
+ ret = generic_file_read_iter(iocb, to);
+ inode_unlock_shared(inode);
+
+ trace_fuse_iomap_buffered_read_end(iocb, to, ret);
+ return ret;
+}
+
+ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ loff_t pos = iocb->ki_pos;
+ ssize_t ret;
+
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_buffered_write(iocb, from);
+
+ if (!iov_iter_count(from))
+ return 0;
+
+ ret = fuse_iomap_ilock_iocb(iocb, EXCL);
+ if (ret)
+ return ret;
+
+ ret = fuse_iomap_write_checks(iocb, from);
+ if (ret)
+ goto out_unlock;
+
+ if (inode->i_size < pos + iov_iter_count(from))
+ set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+ ret = iomap_file_buffered_write(iocb, from, &fuse_iomap_ops,
+ &fuse_iomap_write_ops, NULL);
+
+ if (ret > 0)
+ fuse_write_update_attr(inode, pos + ret, ret);
+ clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+out_unlock:
+ inode_unlock(inode);
+
+ if (ret > 0) {
+ /* Handle various SYNC-type writes */
+ ret = generic_write_sync(iocb, ret);
+ }
+ trace_fuse_iomap_buffered_write_end(iocb, from, ret);
+ return ret;
+}
+
+static int
+fuse_iomap_truncate_page(
+ struct inode *inode,
+ loff_t pos,
+ bool *did_zero)
+{
+ return iomap_truncate_page(inode, pos, did_zero, &fuse_iomap_ops,
+ &fuse_iomap_write_ops, NULL);
+}
+/*
+ * Truncate pagecache for a file before sending the truncate request to
+ * userspace. Must have write permission and not be a directory.
+ *
+ * Caution: The caller of this function is responsible for calling
+ * setattr_prepare() or otherwise verifying the change is fine.
+ */
+int
+fuse_iomap_setsize_start(
+ struct inode *inode,
+ loff_t newsize)
+{
+ loff_t oldsize = i_size_read(inode);
+ int error;
+ bool did_zeroing = false;
+
+ rwsem_assert_held_write(&inode->i_rwsem);
+ rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
+ ASSERT(S_ISREG(inode->i_mode));
+
+ /*
+ * Wait for all direct I/O to complete.
+ */
+ inode_dio_wait(inode);
+
+ /*
+ * File data changes must be complete and flushed to disk before we
+ * call userspace to modify the inode.
+ *
+ * Start with zeroing any data beyond EOF that we may expose on file
+ * extension, or zeroing out the rest of the block on a downward
+ * truncate.
+ */
+ if (newsize > oldsize) {
+ trace_fuse_iomap_truncate_up(inode, oldsize, newsize - oldsize);
+
+ error = fuse_iomap_zero_range(inode, oldsize, newsize - oldsize,
+ &did_zeroing);
+ } else {
+ trace_fuse_iomap_truncate_down(inode, newsize,
+ oldsize - newsize);
+
+ error = fuse_iomap_truncate_page(inode, newsize, &did_zeroing);
+ }
+ if (error)
+ return error;
+
+ /*
+ * We've already locked out new page faults, so now we can safely
+ * remove pages from the page cache knowing they won't get refaulted
+ * until we drop the mapping invalidation lock after the extent
+ * manipulations are complete. The truncate_setsize() call also cleans
+ * folios spanning EOF on extending truncates and hence ensures
+ * sub-page block size filesystems are correctly handled, too.
+ *
+ * And we update in-core i_size and truncate page cache beyond newsize
+ * before writing back the whole file, so we're guaranteed not to write
+ * stale data past the new EOF on truncate down.
+ */
+ truncate_setsize(inode, newsize);
+
+ /*
+ * Flush the entire pagecache to ensure the fuse server logs the inode
+ * size change and all dirty data that might be associated with it.
+ * We don't know the ondisk inode size, so we only have this clumsy
+ * hammer.
+ */
+ return filemap_write_and_wait(inode->i_mapping);
+}
+
+/*
+ * Prepare for a file data block remapping operation by flushing and unmapping
+ * all pagecache for the entire range.
+ */
+int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
+ loff_t endpos)
+{
+ loff_t start, end;
+ unsigned int rounding;
+ int error;
+
+ /*
+ * Make sure we extend the flush out to extent alignment boundaries so
+ * any extent range overlapping the start/end of the modification we
+ * are about to do is clean and idle.
+ */
+ rounding = max_t(unsigned int, i_blocksize(inode), PAGE_SIZE);
+ start = round_down(pos, rounding);
+ end = round_up(endpos + 1, rounding) - 1;
+
+ trace_fuse_iomap_flush_unmap_range(inode, start, end + 1 - start);
+
+ error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+ if (error)
+ return error;
+ truncate_pagecache_range(inode, start, end);
+ return 0;
+}
+
+static int fuse_iomap_punch_range(struct inode *inode, loff_t offset,
+ loff_t length)
+{
+ loff_t isize = i_size_read(inode);
+ int error;
+
+ trace_fuse_iomap_punch_range(inode, offset, length);
+
+ /*
+ * Now that we've unmap all full blocks we'll have to zero out any
+ * partial block at the beginning and/or end. iomap_zero_range is
+ * smart enough to skip holes and unwritten extents, including those we
+ * just created, but we must take care not to zero beyond EOF, which
+ * would enlarge i_size.
+ */
+ if (offset >= isize)
+ return 0;
+ if (offset + length > isize)
+ length = isize - offset;
+ error = fuse_iomap_zero_range(inode, offset, length, NULL);
+ if (error)
+ return error;
+
+ /*
+ * If we zeroed right up to EOF and EOF straddles a page boundary we
+ * must make sure that the post-EOF area is also zeroed because the
+ * page could be mmap'd and iomap_zero_range doesn't do that for us.
+ * Writeback of the eof page will do this, albeit clumsily.
+ */
+ if (offset + length >= isize && offset_in_page(offset + length) > 0) {
+ error = filemap_write_and_wait_range(inode->i_mapping,
+ round_down(offset + length, PAGE_SIZE),
+ LLONG_MAX);
+ }
+
+ return error;
+}
+
+void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
+{
+ trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
+
+ if (inode->i_blkbits == new_blkbits)
+ return;
+
+ if (!S_ISREG(inode->i_mode))
+ goto set_it;
+
+ /*
+ * iomap attaches per-block state to each folio, so we cannot allow
+ * the file block size to change if there's anything in the page cache.
+ * In theory, fuse servers should never be doing this.
+ */
+ if (inode->i_mapping->nrpages > 0) {
+ WARN_ON(inode->i_blkbits != new_blkbits &&
+ inode->i_mapping->nrpages > 0);
+ return;
+ }
+
+set_it:
+ inode->i_blkbits = new_blkbits;
+}
+
+int
+fuse_iomap_fallocate(
+ struct file *file,
+ int mode,
+ loff_t offset,
+ loff_t length,
+ loff_t new_size)
+{
+ struct inode *inode = file_inode(file);
+ int error;
+
+ ASSERT(fuse_has_iomap(inode));
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size);
+
+ /*
+ * If we unmapped blocks from the file range, then we zero the
+ * pagecache for those regions and push them to disk rather than make
+ * the fuse server manually zero the disk blocks.
+ */
+ if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) {
+ error = fuse_iomap_punch_range(inode, offset, length);
+ if (error)
+ return error;
+ }
+
+ /*
+ * If this is an extending write, we need to zero the bytes beyond the
+ * new EOF and bounce the new size out to userspace.
+ */
+ if (new_size) {
+ error = fuse_iomap_setsize_start(inode, new_size);
+ if (error)
+ return error;
+
+ fuse_write_update_attr(inode, new_size, length);
+ }
+
+ file_update_time(file);
+ return 0;
+}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 3d54fabbd64b0c..72ba71c609a248 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -231,6 +231,7 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
+ u8 new_blkbits;
lockdep_assert_held(&fi->lock);
@@ -295,9 +296,14 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
}
if (attr->blksize != 0)
- inode->i_blkbits = ilog2(attr->blksize);
+ new_blkbits = ilog2(attr->blksize);
else
- inode->i_blkbits = inode->i_sb->s_blocksize_bits;
+ new_blkbits = inode->i_sb->s_blocksize_bits;
+
+ if (fuse_inode_has_iomap(inode))
+ fuse_iomap_set_i_blkbits(inode, new_blkbits);
+ else
+ inode->i_blkbits = new_blkbits;
/*
* Don't set the sticky bit in i_mode, unless we want the VFS
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 11/23] fuse: enable caching of timestamps
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (9 preceding siblings ...)
2025-08-21 0:55 ` [PATCH 10/23] fuse: implement buffered " Darrick J. Wong
@ 2025-08-21 0:55 ` Darrick J. Wong
2025-08-21 0:55 ` [PATCH 12/23] fuse: implement large folios for iomap pagecache files Darrick J. Wong
` (11 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:55 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Cache the timestamps in the kernel so that the kernel sends FUSE_SETATTR
calls to the fuse server after writes, because the iomap infrastructure
won't do that for us.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/dir.c | 3 ++-
fs/fuse/file.c | 20 ++++++++++++++------
fs/fuse/file_iomap.c | 6 ++++++
fs/fuse/inode.c | 13 +++++++------
4 files changed, 29 insertions(+), 13 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 6195ac3232ff22..07aa338208b5cc 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2005,7 +2005,8 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
- bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode);
+ bool is_wb = S_ISREG(inode->i_mode) &&
+ (fuse_inode_has_iomap(inode) || fc->writeback_cache);
loff_t oldsize;
int err;
bool trust_local_cmtime = is_wb;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 43f3e2d4eacb8e..825b7ac9158d08 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -239,7 +239,8 @@ static int fuse_open(struct inode *inode, struct file *file)
struct fuse_file *ff;
int err;
bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
- bool is_wb_truncate = is_truncate && fc->writeback_cache;
+ bool is_wb_truncate = is_truncate && (fuse_inode_has_iomap(inode) ||
+ fc->writeback_cache);
bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
if (fuse_is_bad(inode))
@@ -459,7 +460,9 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fuse_is_bad(inode))
return -EIO;
- if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->writeback_cache)
+ if ((ff->open_flags & FOPEN_NOFLUSH) &&
+ !fm->fc->writeback_cache &&
+ !fuse_inode_has_iomap(inode))
return 0;
err = write_inode_now(inode, 1);
@@ -495,7 +498,7 @@ static int fuse_flush(struct file *file, fl_owner_t id)
* In memory i_blocks is not maintained by fuse, if writeback cache is
* enabled, i_blocks from cached attr may not be accurate.
*/
- if (!err && fm->fc->writeback_cache)
+ if (!err && (fuse_inode_has_iomap(inode) || fm->fc->writeback_cache))
fuse_invalidate_attr_mask(inode, STATX_BLOCKS);
return err;
}
@@ -793,8 +796,10 @@ static void fuse_short_read(struct inode *inode, u64 attr_ver, size_t num_read,
* If writeback_cache is enabled, a short read means there's a hole in
* the file. Some data after the hole is in page cache, but has not
* reached the client fs yet. So the hole is not present there.
+ * If iomap is enabled, a short read means we hit EOF so there's
+ * nothing to adjust.
*/
- if (!fc->writeback_cache) {
+ if (!fc->writeback_cache && !fuse_inode_has_iomap(inode)) {
loff_t pos = folio_pos(ap->folios[0]) + num_read;
fuse_read_update_size(inode, pos, attr_ver);
}
@@ -1409,6 +1414,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned int flags, struct iomap *iomap,
struct iomap *srcmap)
{
+ WARN_ON(fuse_inode_has_iomap(inode));
+
iomap->type = IOMAP_MAPPED;
iomap->length = length;
iomap->offset = offset;
@@ -1976,7 +1983,7 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
* Do this only if writeback_cache is not enabled. If writeback_cache
* is enabled, we trust local ctime/mtime.
*/
- if (!fc->writeback_cache)
+ if (!fc->writeback_cache && !fuse_inode_has_iomap(inode))
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
spin_lock(&fi->lock);
fi->writectr--;
@@ -3057,7 +3064,8 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
ssize_t err;
/* mark unstable when write-back is not used, and file_out gets
* extended */
- bool is_unstable = (!fc->writeback_cache) &&
+ bool is_unstable = (!fc->writeback_cache &&
+ !fuse_inode_has_iomap(inode_out)) &&
((pos_out + len) > inode_out->i_size);
if (fc->no_copy_file_range)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index ff9298de193a26..6aa9269b504713 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1389,6 +1389,12 @@ static inline void fuse_inode_set_iomap(struct inode *inode)
ASSERT(fuse_has_iomap(inode));
+ /*
+ * Manage timestamps ourselves, don't make the fuse server do it. This
+ * is critical for mtime updates to work correctly with page_mkwrite.
+ */
+ inode->i_flags &= ~S_NOCMTIME;
+ inode->i_flags &= ~S_NOATIME;
inode->i_data.a_ops = &fuse_iomap_aops;
INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 72ba71c609a248..b08b1961d03b3e 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -331,10 +331,11 @@ u32 fuse_get_cache_mask(struct inode *inode)
{
struct fuse_conn *fc = get_fuse_conn(inode);
- if (!fc->writeback_cache || !S_ISREG(inode->i_mode))
- return 0;
+ if (S_ISREG(inode->i_mode) &&
+ (fuse_inode_has_iomap(inode) || fc->writeback_cache))
+ return STATX_MTIME | STATX_CTIME | STATX_SIZE;
- return STATX_MTIME | STATX_CTIME | STATX_SIZE;
+ return 0;
}
static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr,
@@ -349,9 +350,9 @@ static void fuse_change_attributes_i(struct inode *inode, struct fuse_attr *attr
spin_lock(&fi->lock);
/*
- * In case of writeback_cache enabled, writes update mtime, ctime and
- * may update i_size. In these cases trust the cached value in the
- * inode.
+ * In case of writeback_cache or iomap enabled, writes update mtime,
+ * ctime and may update i_size. In these cases trust the cached value
+ * in the inode.
*/
cache_mask = fuse_get_cache_mask(inode);
if (cache_mask & STATX_SIZE)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 12/23] fuse: implement large folios for iomap pagecache files
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (10 preceding siblings ...)
2025-08-21 0:55 ` [PATCH 11/23] fuse: enable caching of timestamps Darrick J. Wong
@ 2025-08-21 0:55 ` Darrick J. Wong
2025-08-21 0:55 ` [PATCH 13/23] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
` (10 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:55 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Use large folios when we're using iomap.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 6aa9269b504713..92cc85b5b8a8b5 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1386,6 +1386,7 @@ static const struct address_space_operations fuse_iomap_aops = {
static inline void fuse_inode_set_iomap(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
+ unsigned int min_order = 0;
ASSERT(fuse_has_iomap(inode));
@@ -1400,6 +1401,11 @@ static inline void fuse_inode_set_iomap(struct inode *inode)
INIT_WORK(&fi->ioend_work, fuse_iomap_end_io);
INIT_LIST_HEAD(&fi->ioend_list);
spin_lock_init(&fi->ioend_lock);
+
+ if (inode->i_blkbits > PAGE_SHIFT)
+ min_order = inode->i_blkbits - PAGE_SHIFT;
+
+ mapping_set_folio_min_order(inode->i_mapping, min_order);
set_bit(FUSE_I_IOMAP, &fi->state);
}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 13/23] fuse: use an unrestricted backing device with iomap pagecache io
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (11 preceding siblings ...)
2025-08-21 0:55 ` [PATCH 12/23] fuse: implement large folios for iomap pagecache files Darrick J. Wong
@ 2025-08-21 0:55 ` Darrick J. Wong
2025-08-21 0:56 ` [PATCH 14/23] fuse: advertise support for iomap Darrick J. Wong
` (9 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:55 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
With iomap support turned on for the pagecache, the kernel issues
writeback to directly to block devices and we no longer have to push all
those pages through the fuse device to userspace. Therefore, we don't
need the tight dirty limits (~1M) that are used for regular fuse. This
dramatically increases the performance of fuse's pagecache IO.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 92cc85b5b8a8b5..701df0d34067ee 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -699,6 +699,27 @@ int fuse_iomap_backing_close(struct fuse_conn *fc, struct fuse_backing *fb)
void fuse_iomap_mount(struct fuse_mount *fm)
{
struct fuse_conn *fc = fm->fc;
+ struct super_block *sb = fm->sb;
+ struct backing_dev_info *old_bdi = sb->s_bdi;
+ char *suffix = sb->s_bdev ? "-fuseblk" : "-fuse";
+ int res;
+
+ /*
+ * sb->s_bdi points to the initial private bdi. However, we want to
+ * redirect it to a new private bdi with default dirty and readahead
+ * settings because iomap writeback won't be pushing a ton of dirty
+ * data through the fuse device. If this fails we fall back to the
+ * initial fuse bdi.
+ */
+ sb->s_bdi = &noop_backing_dev_info;
+ res = super_setup_bdi_name(sb, "%u:%u%s.iomap", MAJOR(fc->dev),
+ MINOR(fc->dev), suffix);
+ if (res) {
+ sb->s_bdi = old_bdi;
+ } else {
+ bdi_unregister(old_bdi);
+ bdi_put(old_bdi);
+ }
/*
* Enable syncfs for iomap fuse servers so that we can send a final
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 14/23] fuse: advertise support for iomap
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (12 preceding siblings ...)
2025-08-21 0:55 ` [PATCH 13/23] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
@ 2025-08-21 0:56 ` Darrick J. Wong
2025-08-21 0:56 ` [PATCH 15/23] fuse: query filesystem geometry when using iomap Darrick J. Wong
` (8 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:56 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Advertise our new IO paths programmatically by creating an ioctl that
can return the capabilities of the kernel.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 4 ++++
include/uapi/linux/fuse.h | 9 +++++++++
fs/fuse/dev.c | 3 +++
fs/fuse/file_iomap.c | 13 +++++++++++++
4 files changed, 29 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 74fb5971f8fec7..5380a220741014 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1721,6 +1721,9 @@ int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
loff_t length, loff_t new_size);
int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
loff_t endpos);
+
+int fuse_dev_ioctl_iomap_support(struct file *file,
+ struct fuse_iomap_support __user *argp);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1746,6 +1749,7 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
# define fuse_iomap_set_i_blkbits(...) ((void)0)
# define fuse_iomap_fallocate(...) (-ENOSYS)
# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
+# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 12d15c186256f3..d0f71136837068 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1134,12 +1134,21 @@ struct fuse_backing_map {
uint64_t padding;
};
+/* basic file I/O functionality through iomap */
+#define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0)
+struct fuse_iomap_support {
+ uint64_t flags;
+ uint64_t padding;
+};
+
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
#define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 3, \
+ struct fuse_iomap_support)
struct fuse_lseek_in {
uint64_t fh;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 31d9f006836ac1..d239946a46c463 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2664,6 +2664,9 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
case FUSE_DEV_IOC_BACKING_CLOSE:
return fuse_dev_ioctl_backing_close(file, argp);
+ case FUSE_DEV_IOC_IOMAP_SUPPORT:
+ return fuse_dev_ioctl_iomap_support(file, argp);
+
default:
return -ENOTTY;
}
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 701df0d34067ee..5a2910919ba209 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1756,3 +1756,16 @@ fuse_iomap_fallocate(
file_update_time(file);
return 0;
}
+
+int fuse_dev_ioctl_iomap_support(struct file *file,
+ struct fuse_iomap_support __user *argp)
+{
+ struct fuse_iomap_support ios = { };
+
+ if (fuse_iomap_enabled())
+ ios.flags = FUSE_IOMAP_SUPPORT_FILEIO;
+
+ if (copy_to_user(argp, &ios, sizeof(ios)))
+ return -EFAULT;
+ return 0;
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 15/23] fuse: query filesystem geometry when using iomap
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (13 preceding siblings ...)
2025-08-21 0:56 ` [PATCH 14/23] fuse: advertise support for iomap Darrick J. Wong
@ 2025-08-21 0:56 ` Darrick J. Wong
2025-08-21 0:56 ` [PATCH 16/23] fuse: implement fadvise for iomap files Darrick J. Wong
` (7 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:56 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Add a new upcall to the fuse server so that the kernel can request
filesystem geometry bits when iomap mode is in use.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 4 +
fs/fuse/fuse_trace.h | 48 ++++++++++++++++++
include/uapi/linux/fuse.h | 39 ++++++++++++++
fs/fuse/file_iomap.c | 122 +++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/inode.c | 26 +++++++---
5 files changed, 231 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 5380a220741014..2572eab6100fe4 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -999,6 +999,9 @@ struct fuse_conn {
struct fuse_ring *ring;
#endif
+ /** How many subsystems still need initialization? */
+ atomic_t need_init;
+
/** Only used if the connection opts into request timeouts */
struct {
/* Worker for checking if any requests have timed out */
@@ -1366,6 +1369,7 @@ struct fuse_dev *fuse_dev_alloc(void);
void fuse_dev_install(struct fuse_dev *fud, struct fuse_conn *fc);
void fuse_dev_free(struct fuse_dev *fud);
void fuse_send_init(struct fuse_mount *fm);
+void fuse_finish_init(struct fuse_conn *fc, bool ok);
/**
* Fill in superblock and initialize fuse connection
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 10537a38b54556..d3a0bd066370f5 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -58,6 +58,7 @@
EM( FUSE_SYNCFS, "FUSE_SYNCFS") \
EM( FUSE_TMPFILE, "FUSE_TMPFILE") \
EM( FUSE_STATX, "FUSE_STATX") \
+ EM( FUSE_IOMAP_CONFIG, "FUSE_IOMAP_CONFIG") \
EM( FUSE_IOMAP_BEGIN, "FUSE_IOMAP_BEGIN") \
EM( FUSE_IOMAP_END, "FUSE_IOMAP_END") \
EM( FUSE_IOMAP_IOEND, "FUSE_IOMAP_IOEND") \
@@ -345,6 +346,14 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
{ IOMAP_IOEND_BOUNDARY, "boundary" }, \
{ IOMAP_IOEND_DIRECT, "direct" }
+#define FUSE_IOMAP_CONFIG_STRINGS \
+ { FUSE_IOMAP_CONFIG_SID, "sid" }, \
+ { FUSE_IOMAP_CONFIG_UUID, "uuid" }, \
+ { FUSE_IOMAP_CONFIG_BLOCKSIZE, "blocksize" }, \
+ { FUSE_IOMAP_CONFIG_MAX_LINKS, "max_links" }, \
+ { FUSE_IOMAP_CONFIG_TIME, "time" }, \
+ { FUSE_IOMAP_CONFIG_MAXBYTES, "maxbytes" }
+
DECLARE_EVENT_CLASS(fuse_iomap_check_class,
TP_PROTO(const char *func, int line, const char *condition),
@@ -995,6 +1004,45 @@ TRACE_EVENT(fuse_iomap_fallocate,
__entry->mode,
__entry->newsize)
);
+
+TRACE_EVENT(fuse_iomap_config,
+ TP_PROTO(const struct fuse_mount *fm,
+ const struct fuse_iomap_config_out *outarg),
+ TP_ARGS(fm, outarg),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+
+ __field(uint32_t, flags)
+ __field(uint32_t, blocksize)
+ __field(uint32_t, max_links)
+ __field(uint32_t, time_gran)
+
+ __field(int64_t, time_min)
+ __field(int64_t, time_max)
+ __field(int64_t, maxbytes)
+ __field(uint8_t, uuid_len)
+ ),
+
+ TP_fast_assign(
+ __entry->connection = fm->fc->dev;
+ __entry->flags = outarg->flags;
+ __entry->blocksize = outarg->s_blocksize;
+ __entry->max_links = outarg->s_max_links;
+ __entry->time_gran = outarg->s_time_gran;
+ __entry->time_min = outarg->s_time_min;
+ __entry->time_max = outarg->s_time_max;
+ __entry->maxbytes = outarg->s_maxbytes;
+ __entry->uuid_len = outarg->s_uuid_len;
+ ),
+
+ TP_printk("connection %u flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u",
+ __entry->connection,
+ __print_flags(__entry->flags, "|", FUSE_IOMAP_CONFIG_STRINGS),
+ __entry->blocksize, __entry->max_links, __entry->time_gran,
+ __entry->time_min, __entry->time_max, __entry->maxbytes,
+ __entry->uuid_len)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d0f71136837068..1a677e807c2846 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -240,6 +240,7 @@
* - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
* SEEK_{DATA,HOLE}, buffered I/O, and direct I/O
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
+ * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
*/
#ifndef _LINUX_FUSE_H
@@ -668,6 +669,7 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_CONFIG = 4092,
FUSE_IOMAP_IOEND = 4093,
FUSE_IOMAP_BEGIN = 4094,
FUSE_IOMAP_END = 4095,
@@ -1414,4 +1416,41 @@ struct fuse_iomap_ioend_in {
uint32_t reserved1; /* zero */
};
+struct fuse_iomap_config_in {
+ uint64_t flags; /* supported FUSE_IOMAP_CONFIG_* flags */
+ int64_t maxbytes; /* maximum supported file size */
+ uint64_t padding[6]; /* zero */
+};
+
+/* Which fields are set in fuse_iomap_config_out? */
+#define FUSE_IOMAP_CONFIG_SID (1 << 0ULL)
+#define FUSE_IOMAP_CONFIG_UUID (1 << 1ULL)
+#define FUSE_IOMAP_CONFIG_BLOCKSIZE (1 << 2ULL)
+#define FUSE_IOMAP_CONFIG_MAX_LINKS (1 << 3ULL)
+#define FUSE_IOMAP_CONFIG_TIME (1 << 4ULL)
+#define FUSE_IOMAP_CONFIG_MAXBYTES (1 << 5ULL)
+
+struct fuse_iomap_config_out {
+ uint64_t flags; /* FUSE_IOMAP_CONFIG_* */
+
+ char s_id[32]; /* Informational name */
+ char s_uuid[16]; /* UUID */
+
+ uint8_t s_uuid_len; /* length of s_uuid */
+
+ uint8_t s_pad[3]; /* must be zeroes */
+
+ uint32_t s_blocksize; /* fs block size */
+ uint32_t s_max_links; /* max hard links */
+
+ /* Granularity of c/m/atime in ns (cannot be worse than a second) */
+ uint32_t s_time_gran;
+
+ /* Time limits for c/m/atime in seconds */
+ int64_t s_time_min;
+ int64_t s_time_max;
+
+ int64_t s_maxbytes; /* max file size */
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 5a2910919ba209..d2b918521b7395 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -696,14 +696,105 @@ int fuse_iomap_backing_close(struct fuse_conn *fc, struct fuse_backing *fb)
return -EBUSY;
}
-void fuse_iomap_mount(struct fuse_mount *fm)
+struct fuse_iomap_config_args {
+ struct fuse_args args;
+ struct fuse_iomap_config_in inarg;
+ struct fuse_iomap_config_out outarg;
+};
+
+#define FUSE_IOMAP_CONFIG_ALL (FUSE_IOMAP_CONFIG_SID | \
+ FUSE_IOMAP_CONFIG_UUID | \
+ FUSE_IOMAP_CONFIG_BLOCKSIZE | \
+ FUSE_IOMAP_CONFIG_MAX_LINKS | \
+ FUSE_IOMAP_CONFIG_TIME | \
+ FUSE_IOMAP_CONFIG_MAXBYTES)
+
+static int fuse_iomap_process_config(struct fuse_mount *fm, int error,
+ const struct fuse_iomap_config_out *outarg)
{
+ struct super_block *sb = fm->sb;
+
+ switch (error) {
+ case 0:
+ break;
+ case -ENOSYS:
+ return 0;
+ default:
+ return error;
+ }
+
+ trace_fuse_iomap_config(fm, outarg);
+
+ if (outarg->flags & ~FUSE_IOMAP_CONFIG_ALL)
+ return -EINVAL;
+
+ if (outarg->s_uuid_len > sizeof(outarg->s_uuid))
+ return -EINVAL;
+
+ if (memchr_inv(outarg->s_pad, 0, sizeof(outarg->s_pad)))
+ return -EINVAL;
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_BLOCKSIZE) {
+ if (sb->s_bdev) {
+#ifdef CONFIG_BLOCK
+ if (!sb_set_blocksize(sb, outarg->s_blocksize))
+ return -EINVAL;
+#else
+ /*
+ * XXX: how do we have a bdev filesystem without
+ * CONFIG_BLOCK???
+ */
+ return -EINVAL;
+#endif
+ } else {
+ sb->s_blocksize = outarg->s_blocksize;
+ sb->s_blocksize_bits = blksize_bits(outarg->s_blocksize);
+ }
+ }
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_SID)
+ memcpy(sb->s_id, outarg->s_id, sizeof(sb->s_id));
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_UUID) {
+ memcpy(&sb->s_uuid, outarg->s_uuid, outarg->s_uuid_len);
+ sb->s_uuid_len = outarg->s_uuid_len;
+ }
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_MAX_LINKS)
+ sb->s_max_links = outarg->s_max_links;
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_TIME) {
+ sb->s_time_gran = outarg->s_time_gran;
+ sb->s_time_min = outarg->s_time_min;
+ sb->s_time_max = outarg->s_time_max;
+ }
+
+ if (outarg->flags & FUSE_IOMAP_CONFIG_MAXBYTES)
+ sb->s_maxbytes = outarg->s_maxbytes;
+
+ return 0;
+}
+
+static void fuse_iomap_config_reply(struct fuse_mount *fm,
+ struct fuse_args *args, int error)
+{
+ struct fuse_iomap_config_args *ia =
+ container_of(args, struct fuse_iomap_config_args, args);
struct fuse_conn *fc = fm->fc;
struct super_block *sb = fm->sb;
struct backing_dev_info *old_bdi = sb->s_bdi;
char *suffix = sb->s_bdev ? "-fuseblk" : "-fuse";
+ bool ok = true;
int res;
+ res = fuse_iomap_process_config(fm, error, &ia->outarg);
+ if (res) {
+ printk(KERN_ERR "%s: could not configure iomap, err=%d",
+ sb->s_id, res);
+ ok = false;
+ goto done;
+ }
+
/*
* sb->s_bdi points to the initial private bdi. However, we want to
* redirect it to a new private bdi with default dirty and readahead
@@ -729,6 +820,35 @@ void fuse_iomap_mount(struct fuse_mount *fm)
fc->sync_fs = true;
fc->iomap_conn.no_end = 0;
fc->iomap_conn.no_ioend = 0;
+
+done:
+ kfree(ia);
+ fuse_finish_init(fc, ok);
+}
+
+void fuse_iomap_mount(struct fuse_mount *fm)
+{
+ struct fuse_iomap_config_args *ia;
+
+ ia = kzalloc(sizeof(*ia), GFP_KERNEL | __GFP_NOFAIL);
+ ia->inarg.maxbytes = MAX_LFS_FILESIZE;
+ ia->inarg.flags = FUSE_IOMAP_CONFIG_ALL;
+
+ ia->args.opcode = FUSE_IOMAP_CONFIG;
+ ia->args.nodeid = 0;
+ ia->args.in_numargs = 1;
+ ia->args.in_args[0].size = sizeof(ia->inarg);
+ ia->args.in_args[0].value = &ia->inarg;
+ ia->args.out_argvar = true;
+ ia->args.out_numargs = 1;
+ ia->args.out_args[0].size = sizeof(ia->outarg);
+ ia->args.out_args[0].value = &ia->outarg;
+ ia->args.force = true;
+ ia->args.nocreds = true;
+ ia->args.end = fuse_iomap_config_reply;
+
+ if (fuse_simple_background(fm, &ia->args, GFP_KERNEL) != 0)
+ fuse_iomap_config_reply(fm, &ia->args, -ENOTCONN);
}
void fuse_iomap_unmount(struct fuse_mount *fm)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b08b1961d03b3e..abb2beef3cfe1f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1323,6 +1323,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
struct fuse_init_out *arg = &ia->out;
bool ok = true;
+ atomic_inc(&fc->need_init);
+
if (error || arg->major != FUSE_KERNEL_VERSION)
ok = false;
else {
@@ -1466,9 +1468,6 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
init_server_timeout(fc, timeout);
- if (fc->iomap)
- fuse_iomap_mount(fm);
-
fm->sb->s_bdi->ra_pages =
min(fm->sb->s_bdi->ra_pages, ra_pages);
fc->minor = arg->minor;
@@ -1478,13 +1477,26 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
}
kfree(ia);
- if (!ok) {
+ if (!ok)
fc->conn_init = 0;
+
+ if (ok && fc->iomap) {
+ atomic_inc(&fc->need_init);
+ fuse_iomap_mount(fm);
+ }
+
+ fuse_finish_init(fc, ok);
+}
+
+void fuse_finish_init(struct fuse_conn *fc, bool ok)
+{
+ if (!ok)
fc->conn_error = 1;
- }
- fuse_set_initialized(fc);
- wake_up_all(&fc->blocked_waitq);
+ if (atomic_dec_and_test(&fc->need_init)) {
+ fuse_set_initialized(fc);
+ wake_up_all(&fc->blocked_waitq);
+ }
}
void fuse_send_init(struct fuse_mount *fm)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 16/23] fuse: implement fadvise for iomap files
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (14 preceding siblings ...)
2025-08-21 0:56 ` [PATCH 15/23] fuse: query filesystem geometry when using iomap Darrick J. Wong
@ 2025-08-21 0:56 ` Darrick J. Wong
2025-08-21 0:56 ` [PATCH 17/23] fuse: make the root nodeid dynamic Darrick J. Wong
` (6 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:56 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
If userspace asks us to perform readahead on a file, take i_rwsem so
that it can't race with hole punching or writes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 3 +++
fs/fuse/file.c | 1 +
fs/fuse/file_iomap.c | 20 ++++++++++++++++++++
3 files changed, 24 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 2572eab6100fe4..63ce9ddb96477c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1728,6 +1728,8 @@ int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support __user *argp);
+
+int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1754,6 +1756,7 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
# define fuse_iomap_fallocate(...) (-ENOSYS)
# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
+# define fuse_iomap_fadvise NULL
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 825b7ac9158d08..6575deae7e65f6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3186,6 +3186,7 @@ static const struct file_operations fuse_file_operations = {
.poll = fuse_file_poll,
.fallocate = fuse_file_fallocate,
.copy_file_range = fuse_copy_file_range,
+ .fadvise = fuse_iomap_fadvise,
};
static const struct address_space_operations fuse_file_aops = {
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index d2b918521b7395..c740fb1420bee0 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -7,6 +7,7 @@
#include <linux/fiemap.h>
#include <linux/pagemap.h>
#include <linux/falloc.h>
+#include <linux/fadvise.h>
#include "fuse_i.h"
#include "fuse_trace.h"
#include "iomap_priv.h"
@@ -1889,3 +1890,22 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
return -EFAULT;
return 0;
}
+
+int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice)
+{
+ struct inode *inode = file_inode(file);
+ bool needlock = advice == POSIX_FADV_WILLNEED &&
+ fuse_inode_has_iomap(inode);
+ int ret;
+
+ /*
+ * Operations creating pages in page cache need protection from hole
+ * punching and similar ops
+ */
+ if (needlock)
+ inode_lock_shared(inode);
+ ret = generic_fadvise(file, start, end, advice);
+ if (needlock)
+ inode_unlock_shared(inode);
+ return ret;
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 17/23] fuse: make the root nodeid dynamic
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (15 preceding siblings ...)
2025-08-21 0:56 ` [PATCH 16/23] fuse: implement fadvise for iomap files Darrick J. Wong
@ 2025-08-21 0:56 ` Darrick J. Wong
2025-08-21 0:57 ` [PATCH 18/23] fuse: allow setting of root nodeid Darrick J. Wong
` (5 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:56 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Change this from a hardcoded constant to a dynamic field so that fuse
servers don't need to translate.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 7 +++++--
fs/fuse/fuse_trace.h | 6 ++++--
fs/fuse/dir.c | 10 ++++++----
fs/fuse/inode.c | 11 +++++++----
fs/fuse/readdir.c | 10 +++++-----
5 files changed, 27 insertions(+), 17 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 63ce9ddb96477c..66cf8dcf9216e7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -665,6 +665,9 @@ struct fuse_conn {
struct rcu_head rcu;
+ /* node id of the root directory */
+ u64 root_nodeid;
+
/** The user id for this mount */
kuid_t user_id;
@@ -1097,9 +1100,9 @@ static inline u64 get_node_id(struct inode *inode)
return get_fuse_inode(inode)->nodeid;
}
-static inline int invalid_nodeid(u64 nodeid)
+static inline int invalid_nodeid(const struct fuse_conn *fc, u64 nodeid)
{
- return !nodeid || nodeid == FUSE_ROOT_ID;
+ return !nodeid || nodeid == fc->root_nodeid;
}
static inline u64 fuse_get_attr_version(struct fuse_conn *fc)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index d3a0bd066370f5..1f2ff30bececd4 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -1012,6 +1012,7 @@ TRACE_EVENT(fuse_iomap_config,
TP_STRUCT__entry(
__field(dev_t, connection)
+ __field(uint64_t, root_nodeid)
__field(uint32_t, flags)
__field(uint32_t, blocksize)
@@ -1026,6 +1027,7 @@ TRACE_EVENT(fuse_iomap_config,
TP_fast_assign(
__entry->connection = fm->fc->dev;
+ __entry->root_nodeid = fm->fc->root_nodeid;
__entry->flags = outarg->flags;
__entry->blocksize = outarg->s_blocksize;
__entry->max_links = outarg->s_max_links;
@@ -1036,8 +1038,8 @@ TRACE_EVENT(fuse_iomap_config,
__entry->uuid_len = outarg->s_uuid_len;
),
- TP_printk("connection %u flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u",
- __entry->connection,
+ TP_printk("connection %u root_ino 0x%llx flags (%s) blocksize 0x%x max_links %u time_gran %u time_min %lld time_max %lld maxbytes 0x%llx uuid_len %u",
+ __entry->connection, __entry->root_nodeid,
__print_flags(__entry->flags, "|", FUSE_IOMAP_CONFIG_STRINGS),
__entry->blocksize, __entry->max_links, __entry->time_gran,
__entry->time_min, __entry->time_max, __entry->maxbytes,
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 07aa338208b5cc..02c8e705af1e35 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -386,7 +386,7 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
err = -EIO;
if (fuse_invalid_attr(&outarg->attr))
goto out_put_forget;
- if (outarg->nodeid == FUSE_ROOT_ID && outarg->generation != 0) {
+ if (outarg->nodeid == fm->fc->root_nodeid && outarg->generation != 0) {
pr_warn_once("root generation should be zero\n");
outarg->generation = 0;
}
@@ -436,7 +436,7 @@ static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry,
goto out_err;
err = -EIO;
- if (inode && get_node_id(inode) == FUSE_ROOT_ID)
+ if (inode && get_node_id(inode) == fc->root_nodeid)
goto out_iput;
newent = d_splice_alias(inode, entry);
@@ -687,7 +687,8 @@ static int fuse_create_open(struct mnt_idmap *idmap, struct inode *dir,
goto out_free_ff;
err = -EIO;
- if (!S_ISREG(outentry.attr.mode) || invalid_nodeid(outentry.nodeid) ||
+ if (!S_ISREG(outentry.attr.mode) ||
+ invalid_nodeid(fm->fc, outentry.nodeid) ||
fuse_invalid_attr(&outentry.attr))
goto out_free_ff;
@@ -838,7 +839,8 @@ static struct dentry *create_new_entry(struct mnt_idmap *idmap, struct fuse_moun
goto out_put_forget_req;
err = -EIO;
- if (invalid_nodeid(outarg.nodeid) || fuse_invalid_attr(&outarg.attr))
+ if (invalid_nodeid(fm->fc, outarg.nodeid) ||
+ fuse_invalid_attr(&outarg.attr))
goto out_put_forget_req;
if ((outarg.attr.mode ^ mode) & S_IFMT)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index abb2beef3cfe1f..f2d519c0f737e6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1001,6 +1001,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
fc->max_pages_limit = fuse_max_pages_limit;
fc->name_max = FUSE_NAME_LOW_MAX;
fc->timeout.req_timeout = 0;
+ fc->root_nodeid = FUSE_ROOT_ID;
if (IS_ENABLED(CONFIG_FUSE_BACKING))
fuse_backing_files_init(fc);
@@ -1056,12 +1057,14 @@ EXPORT_SYMBOL_GPL(fuse_conn_get);
static struct inode *fuse_get_root_inode(struct super_block *sb, unsigned int mode)
{
struct fuse_attr attr;
+ struct fuse_conn *fc = get_fuse_conn_super(sb);
+
memset(&attr, 0, sizeof(attr));
attr.mode = mode;
- attr.ino = FUSE_ROOT_ID;
+ attr.ino = fc->root_nodeid;
attr.nlink = 1;
- return fuse_iget(sb, FUSE_ROOT_ID, 0, &attr, 0, 0, 0);
+ return fuse_iget(sb, fc->root_nodeid, 0, &attr, 0, 0, 0);
}
struct fuse_inode_handle {
@@ -1105,7 +1108,7 @@ static struct dentry *fuse_get_dentry(struct super_block *sb,
goto out_iput;
entry = d_obtain_alias(inode);
- if (!IS_ERR(entry) && get_node_id(inode) != FUSE_ROOT_ID)
+ if (!IS_ERR(entry) && get_node_id(inode) != fc->root_nodeid)
fuse_invalidate_entry_cache(entry);
return entry;
@@ -1198,7 +1201,7 @@ static struct dentry *fuse_get_parent(struct dentry *child)
}
parent = d_obtain_alias(inode);
- if (!IS_ERR(parent) && get_node_id(inode) != FUSE_ROOT_ID)
+ if (!IS_ERR(parent) && get_node_id(inode) != fc->root_nodeid)
fuse_invalidate_entry_cache(parent);
return parent;
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index c2aae2eef0868b..45dd932eb03a5e 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -185,12 +185,12 @@ static int fuse_direntplus_link(struct file *file,
return 0;
}
- if (invalid_nodeid(o->nodeid))
- return -EIO;
- if (fuse_invalid_attr(&o->attr))
- return -EIO;
-
fc = get_fuse_conn(dir);
+ if (invalid_nodeid(fc, o->nodeid))
+ return -EIO;
+ if (fuse_invalid_attr(&o->attr))
+ return -EIO;
+
epoch = atomic_read(&fc->epoch);
name.hash = full_name_hash(parent, name.name, name.len);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 18/23] fuse: allow setting of root nodeid
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (16 preceding siblings ...)
2025-08-21 0:56 ` [PATCH 17/23] fuse: make the root nodeid dynamic Darrick J. Wong
@ 2025-08-21 0:57 ` Darrick J. Wong
2025-08-21 0:57 ` [PATCH 19/23] fuse: invalidate ranges of block devices being used for iomap Darrick J. Wong
` (4 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:57 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Provide a new mount option so that fuse servers can actually set the
root nodeid.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 2 ++
fs/fuse/inode.c | 11 +++++++++++
2 files changed, 13 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 66cf8dcf9216e7..a81138da1e55f6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -601,6 +601,7 @@ struct fuse_fs_context {
int fd;
struct file *file;
unsigned int rootmode;
+ u64 root_nodeid;
kuid_t user_id;
kgid_t group_id;
bool is_bdev:1;
@@ -614,6 +615,7 @@ struct fuse_fs_context {
bool no_control:1;
bool no_force_umount:1;
bool legacy_opts_show:1;
+ bool root_nodeid_present:1;
enum fuse_dax_mode dax_mode;
unsigned int max_read;
unsigned int blksize;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f2d519c0f737e6..18dc9492d19174 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -785,6 +785,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+ OPT_ROOT_NODEID,
OPT_ERR
};
@@ -799,6 +800,7 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
fsparam_u32 ("max_read", OPT_MAX_READ),
fsparam_u32 ("blksize", OPT_BLKSIZE),
fsparam_string ("subtype", OPT_SUBTYPE),
+ fsparam_u64 ("root_nodeid", OPT_ROOT_NODEID),
{}
};
@@ -894,6 +896,11 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
ctx->blksize = result.uint_32;
break;
+ case OPT_ROOT_NODEID:
+ ctx->root_nodeid = result.uint_64;
+ ctx->root_nodeid_present = true;
+ break;
+
default:
return -EINVAL;
}
@@ -929,6 +936,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+ if (fc->root_nodeid && fc->root_nodeid != FUSE_ROOT_ID)
+ seq_printf(m, ",root_nodeid=%llu", fc->root_nodeid);
}
#ifdef CONFIG_FUSE_DAX
if (fc->dax_mode == FUSE_DAX_ALWAYS)
@@ -1879,6 +1888,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
sb->s_flags |= SB_POSIXACL;
fc->default_permissions = ctx->default_permissions;
+ if (ctx->root_nodeid_present)
+ fc->root_nodeid = ctx->root_nodeid;
fc->allow_other = ctx->allow_other;
fc->user_id = ctx->user_id;
fc->group_id = ctx->group_id;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 19/23] fuse: invalidate ranges of block devices being used for iomap
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (17 preceding siblings ...)
2025-08-21 0:57 ` [PATCH 18/23] fuse: allow setting of root nodeid Darrick J. Wong
@ 2025-08-21 0:57 ` Darrick J. Wong
2025-08-21 0:57 ` [PATCH 20/23] fuse: implement inline data file IO via iomap Darrick J. Wong
` (3 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:57 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Make it easier to invalidate the page cache for a block device that is
being used in conjunction with iomap. This allows a fuse server to kill
all cached data for a block that is being freed, so that block reuse
doesn't result in file corruption. Right now, the only way to do this
is with fadvise, which ignores and doesn't wait for pages undergoing
writeback.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 4 ++++
fs/fuse/fuse_trace.h | 26 +++++++++++++++++++++++++
include/uapi/linux/fuse.h | 10 ++++++++++
fs/fuse/dev.c | 27 ++++++++++++++++++++++++++
fs/fuse/file_iomap.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 114 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a81138da1e55f6..362fa87241ac70 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1690,6 +1690,9 @@ int fuse_iomap_backing_close(struct fuse_conn *fc, struct fuse_backing *fb);
void fuse_iomap_mount(struct fuse_mount *fm);
void fuse_iomap_unmount(struct fuse_mount *fm);
+int fuse_iomap_dev_inval(struct fuse_conn *fc,
+ const struct fuse_iomap_dev_inval_out *arg);
+
int fuse_iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 length);
loff_t fuse_iomap_lseek(struct file *file, loff_t offset, int whence);
@@ -1742,6 +1745,7 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
# define fuse_iomap_backing_close(...) (-EOPNOTSUPP)
# define fuse_iomap_mount(...) ((void)0)
# define fuse_iomap_unmount(...) ((void)0)
+# define fuse_iomap_dev_inval(...) (-ENOSYS)
# define fuse_iomap_fiemap NULL
# define fuse_iomap_lseek(...) (-ENOSYS)
# define fuse_iomap_bmap(...) (-ENOSYS)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 1f2ff30bececd4..2f4c78ba498177 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -1045,6 +1045,32 @@ TRACE_EVENT(fuse_iomap_config,
__entry->time_min, __entry->time_max, __entry->maxbytes,
__entry->uuid_len)
);
+
+TRACE_EVENT(fuse_iomap_dev_inval,
+ TP_PROTO(const struct fuse_conn *fc,
+ const struct fuse_iomap_dev_inval_out *arg),
+ TP_ARGS(fc, arg),
+
+ TP_STRUCT__entry(
+ __field(dev_t, connection)
+ __field(int, dev)
+ __field(unsigned long long, offset)
+ __field(unsigned long long, length)
+ ),
+
+ TP_fast_assign(
+ __entry->connection = fc->dev;
+ __entry->dev = arg->dev;
+ __entry->offset = arg->offset;
+ __entry->length = arg->length;
+ ),
+
+ TP_printk("connection %u dev %d offset 0x%llx length 0x%llx",
+ __entry->connection,
+ __entry->dev,
+ __entry->offset,
+ __entry->length)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 1a677e807c2846..1f8e3ba60e7ec5 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -241,6 +241,7 @@
* SEEK_{DATA,HOLE}, buffered I/O, and direct I/O
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
* - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
+ * - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges
*/
#ifndef _LINUX_FUSE_H
@@ -691,6 +692,7 @@ enum fuse_notify_code {
FUSE_NOTIFY_DELETE = 6,
FUSE_NOTIFY_RESEND = 7,
FUSE_NOTIFY_INC_EPOCH = 8,
+ FUSE_NOTIFY_IOMAP_DEV_INVAL = 9,
FUSE_NOTIFY_CODE_MAX,
};
@@ -1453,4 +1455,12 @@ struct fuse_iomap_config_out {
int64_t s_maxbytes; /* max file size */
};
+struct fuse_iomap_dev_inval_out {
+ uint32_t dev; /* device cookie */
+ uint32_t reserved; /* zero */
+
+ uint64_t offset; /* range to invalidate pagecache, bytes */
+ uint64_t length;
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index d239946a46c463..575cb6e15d84d5 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1833,6 +1833,30 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
return err;
}
+static int fuse_notify_iomap_dev_inval(struct fuse_conn *fc, unsigned int size,
+ struct fuse_copy_state *cs)
+{
+ struct fuse_iomap_dev_inval_out outarg;
+ int err = -EINVAL;
+
+ if (size != sizeof(outarg))
+ goto err;
+
+ err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+ if (err)
+ goto err;
+ if (outarg.reserved) {
+ err = -EINVAL;
+ goto err;
+ }
+ fuse_copy_finish(cs);
+
+ return fuse_iomap_dev_inval(fc, &outarg);
+err:
+ fuse_copy_finish(cs);
+ return err;
+}
+
struct fuse_retrieve_args {
struct fuse_args_pages ap;
struct fuse_notify_retrieve_in inarg;
@@ -2079,6 +2103,9 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
case FUSE_NOTIFY_INC_EPOCH:
return fuse_notify_inc_epoch(fc);
+ case FUSE_NOTIFY_IOMAP_DEV_INVAL:
+ return fuse_notify_iomap_dev_inval(fc, size, cs);
+
default:
fuse_copy_finish(cs);
return -EINVAL;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index c740fb1420bee0..1b389d7792e965 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1909,3 +1909,50 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice)
inode_unlock_shared(inode);
return ret;
}
+
+int fuse_iomap_dev_inval(struct fuse_conn *fc,
+ const struct fuse_iomap_dev_inval_out *arg)
+{
+ struct fuse_backing *fb;
+ struct block_device *bdev;
+ loff_t end;
+ int ret = 0;
+
+ trace_fuse_iomap_dev_inval(fc, arg);
+
+ if (!fc->iomap || arg->dev == FUSE_IOMAP_DEV_NULL)
+ return -EINVAL;
+
+ down_read(&fc->killsb);
+ fb = fuse_backing_lookup(fc, arg->dev);
+ if (!fb) {
+ ret = -ENODEV;
+ goto out_killsb;
+ }
+ if (!fb->iomap) {
+ ret = -ENODEV;
+ goto out_fb;
+ }
+ bdev = fb->bdev;
+
+ inode_lock(bdev->bd_mapping->host);
+ filemap_invalidate_lock(bdev->bd_mapping);
+
+ if (check_add_overflow(arg->offset, arg->length, &end) ||
+ arg->offset >= bdev_nr_bytes(bdev)) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ end = min(end, bdev_nr_bytes(bdev));
+ truncate_inode_pages_range(bdev->bd_mapping, arg->offset, end - 1);
+
+out_unlock:
+ filemap_invalidate_unlock(bdev->bd_mapping);
+ inode_unlock(bdev->bd_mapping->host);
+out_fb:
+ fuse_backing_put(fb);
+out_killsb:
+ up_read(&fc->killsb);
+ return ret;
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 20/23] fuse: implement inline data file IO via iomap
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (18 preceding siblings ...)
2025-08-21 0:57 ` [PATCH 19/23] fuse: invalidate ranges of block devices being used for iomap Darrick J. Wong
@ 2025-08-21 0:57 ` Darrick J. Wong
2025-08-21 0:57 ` [PATCH 21/23] fuse: allow more statx fields Darrick J. Wong
` (2 subsequent siblings)
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:57 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Implement inline data file IO by issuing FUSE_READ/FUSE_WRITE commands
in response to an inline data mapping.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 45 +++++++++++++
fs/fuse/file_iomap.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 224 insertions(+)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 2f4c78ba498177..4ebd9a9e697ce2 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -234,6 +234,7 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
struct iomap_writepage_ctx;
struct iomap_ioend;
+struct iomap;
/* tracepoint boilerplate so we don't have to keep doing this */
#define FUSE_IOMAP_OPFLAGS_FIELD \
@@ -1071,6 +1072,50 @@ TRACE_EVENT(fuse_iomap_dev_inval,
__entry->offset,
__entry->length)
);
+
+DECLARE_EVENT_CLASS(fuse_iomap_inline_class,
+ TP_PROTO(const struct inode *inode, loff_t pos, uint64_t count,
+ const struct iomap *map),
+ TP_ARGS(inode, pos, count, map),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(bool, has_buf)
+ __field(uint64_t, validity_cookie)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = pos;
+ __entry->length = count;
+
+ __entry->mapdev = FUSE_IOMAP_DEV_NULL;
+ __entry->mapaddr = map->addr;
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+
+ __entry->has_buf = map->inline_data != NULL;
+ __entry->validity_cookie= map->validity_cookie;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_MAP_FMT() " has_buf? %d cookie 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ __entry->has_buf,
+ __entry->validity_cookie)
+);
+#define DEFINE_FUSE_IOMAP_INLINE_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_inline_class, name, \
+ TP_PROTO(const struct inode *inode, loff_t pos, uint64_t count, \
+ const struct iomap *map), \
+ TP_ARGS(inode, pos, count, map))
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_read);
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write);
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap);
+DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 1b389d7792e965..4c8fef25b0749b 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -421,6 +421,157 @@ fuse_iomap_find_dev(struct fuse_conn *fc, const struct fuse_iomap_io *map)
return ret;
}
+static inline int fuse_iomap_inline_alloc(struct iomap *iomap)
+{
+ ASSERT(iomap->inline_data == NULL);
+ ASSERT(iomap->length > 0);
+
+ iomap->inline_data = kvzalloc(iomap->length, GFP_KERNEL);
+ return iomap->inline_data ? 0 : -ENOMEM;
+}
+
+static inline void fuse_iomap_inline_free(struct iomap *iomap)
+{
+ kvfree(iomap->inline_data);
+ iomap->inline_data = NULL;
+}
+
+/*
+ * Use the FUSE_READ command to read inline file data from the fuse server.
+ * Note that there's no file handle attached, so the fuse server must be able
+ * to reconnect to the inode via the nodeid.
+ */
+static int fuse_iomap_inline_read(struct inode *inode, loff_t pos,
+ loff_t count, struct iomap *iomap)
+{
+ struct fuse_read_in in = {
+ .offset = pos,
+ .size = count,
+ };
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ FUSE_ARGS(args);
+ ssize_t ret;
+
+ if (BAD_DATA(!iomap_inline_data_valid(iomap)))
+ return -EFSCORRUPTED;
+
+ trace_fuse_iomap_inline_read(inode, pos, count, iomap);
+
+ args.opcode = FUSE_READ;
+ args.nodeid = fi->nodeid;
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(in);
+ args.in_args[0].value = ∈
+ args.out_argvar = true;
+ args.out_numargs = 1;
+ args.out_args[0].size = count;
+ args.out_args[0].value = iomap_inline_data(iomap, pos);
+
+ ret = fuse_simple_request(fm, &args);
+ if (ret < 0) {
+ fuse_iomap_inline_free(iomap);
+ return ret;
+ }
+ /* no readahead means something bad happened */
+ if (ret == 0) {
+ fuse_iomap_inline_free(iomap);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+/*
+ * Use the FUSE_WRITE command to write inline file data from the fuse server.
+ * Note that there's no file handle attached, so the fuse server must be able
+ * to reconnect to the inode via the nodeid.
+ */
+static int fuse_iomap_inline_write(struct inode *inode, loff_t pos,
+ loff_t count, struct iomap *iomap)
+{
+ struct fuse_write_in in = {
+ .offset = pos,
+ .size = count,
+ };
+ struct fuse_write_out out = { };
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ FUSE_ARGS(args);
+ ssize_t ret;
+
+ if (BAD_DATA(!iomap_inline_data_valid(iomap)))
+ return -EFSCORRUPTED;
+
+ trace_fuse_iomap_inline_write(inode, pos, count, iomap);
+
+ args.opcode = FUSE_WRITE;
+ args.nodeid = fi->nodeid;
+ args.in_numargs = 2;
+ args.in_args[0].size = sizeof(in);
+ args.in_args[0].value = ∈
+ args.in_args[1].size = count;
+ args.in_args[1].value = iomap_inline_data(iomap, pos);
+ args.out_numargs = 1;
+ args.out_args[0].size = sizeof(out);
+ args.out_args[0].value = &out;
+
+ ret = fuse_simple_request(fm, &args);
+ if (ret < 0) {
+ fuse_iomap_inline_free(iomap);
+ return ret;
+ }
+ /* short write means something bad happened */
+ if (out.size < count) {
+ fuse_iomap_inline_free(iomap);
+ return -EIO;
+ }
+
+ return 0;
+}
+
+/* Set up inline data buffers for iomap_begin */
+static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags,
+ loff_t pos, loff_t count,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ int err;
+
+ if (opflags & IOMAP_REPORT)
+ return 0;
+
+ if (fuse_is_iomap_file_write(opflags)) {
+ if (iomap->type == IOMAP_INLINE) {
+ err = fuse_iomap_inline_alloc(iomap);
+ if (err)
+ return err;
+ }
+
+ if (srcmap->type == IOMAP_INLINE) {
+ err = fuse_iomap_inline_alloc(srcmap);
+ if (!err)
+ err = fuse_iomap_inline_read(inode, pos, count,
+ srcmap);
+ if (err) {
+ fuse_iomap_inline_free(iomap);
+ return err;
+ }
+ }
+ } else if (iomap->type == IOMAP_INLINE) {
+ /* inline data read */
+ err = fuse_iomap_inline_alloc(iomap);
+ if (!err)
+ err = fuse_iomap_inline_read(inode, pos, count, iomap);
+ if (err)
+ return err;
+ }
+
+ trace_fuse_iomap_set_inline_iomap(inode, pos, count, iomap);
+ trace_fuse_iomap_set_inline_srcmap(inode, pos, count, srcmap);
+
+ return 0;
+}
+
static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
unsigned opflags, struct iomap *iomap,
struct iomap *srcmap)
@@ -490,12 +641,20 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
}
+ if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) {
+ err = fuse_iomap_set_inline(inode, opflags, pos, count, iomap,
+ srcmap);
+ if (err)
+ goto out_write_dev;
+ }
+
/*
* XXX: if we ever want to support closing devices, we need a way to
* track the fuse_backing refcount all the way through bio endios.
* For now we put the refcount here because you can't remove an iomap
* device until unmount time.
*/
+out_write_dev:
fuse_backing_put(write_dev);
out_read_dev:
fuse_backing_put(read_dev);
@@ -534,8 +693,28 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
{
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_mount *fm = get_fuse_mount(inode);
+ struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
+ struct iomap *srcmap = &iter->srcmap;
int err = 0;
+ if (srcmap->inline_data)
+ fuse_iomap_inline_free(srcmap);
+
+ if (iomap->inline_data) {
+ if (fuse_is_iomap_file_write(opflags) && written > 0) {
+ err = fuse_iomap_inline_write(inode, pos, written,
+ iomap);
+ fuse_iomap_inline_free(iomap);
+ if (err)
+ return err;
+ } else {
+ fuse_iomap_inline_free(iomap);
+ }
+
+ /* fuse server should already be aware of what happened */
+ return 0;
+ }
+
if (fuse_should_send_iomap_end(fm, iomap, opflags, count, written)) {
struct fuse_iomap_end_in inarg = {
.opflags = fuse_iomap_op_to_server(opflags),
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 21/23] fuse: allow more statx fields
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (19 preceding siblings ...)
2025-08-21 0:57 ` [PATCH 20/23] fuse: implement inline data file IO via iomap Darrick J. Wong
@ 2025-08-21 0:57 ` Darrick J. Wong
2025-08-21 0:58 ` [PATCH 22/23] fuse: support atomic writes with iomap Darrick J. Wong
2025-08-21 0:58 ` [PATCH 23/23] fuse: enable iomap Darrick J. Wong
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:57 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Allow the fuse server to supply us with the more recently added fields
of struct statx.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 8 +++++
include/uapi/linux/fuse.h | 15 +++++++++
fs/fuse/dir.c | 73 ++++++++++++++++++++++++++++++++++++++-------
3 files changed, 84 insertions(+), 12 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 362fa87241ac70..4ca29315b2a434 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1677,6 +1677,14 @@ void fuse_iomap_sysfs_cleanup(struct kobject *kobj);
sector_t fuse_bmap(struct address_space *mapping, sector_t block);
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+int fuse_iomap_sysfs_init(struct kobject *kobj);
+void fuse_iomap_sysfs_cleanup(struct kobject *kobj);
+#else
+# define fuse_iomap_sysfs_init(...) (0)
+# define fuse_iomap_sysfs_cleanup(...) ((void)0)
+#endif
+
#if IS_ENABLED(CONFIG_FUSE_IOMAP)
bool fuse_iomap_enabled(void);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 1f8e3ba60e7ec5..cfeee8a280896a 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -335,7 +335,20 @@ struct fuse_statx {
uint32_t rdev_minor;
uint32_t dev_major;
uint32_t dev_minor;
- uint64_t __spare2[14];
+
+ uint64_t mnt_id;
+ uint32_t dio_mem_align;
+ uint32_t dio_offset_align;
+ uint64_t subvol;
+
+ uint32_t atomic_write_unit_min;
+ uint32_t atomic_write_unit_max;
+ uint32_t atomic_write_segments_max;
+ uint32_t dio_read_offset_align;
+ uint32_t atomic_write_unit_max_opt;
+ uint32_t __spare2[1];
+
+ uint64_t __spare3[8];
};
struct fuse_kstatfs {
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 02c8e705af1e35..305b926b4a589a 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1276,6 +1276,48 @@ static void fuse_statx_to_attr(struct fuse_statx *sx, struct fuse_attr *attr)
attr->blksize = sx->blksize;
}
+#define FUSE_SUPPORTED_STATX_MASK (STATX_BASIC_STATS | \
+ STATX_BTIME | \
+ STATX_DIOALIGN | \
+ STATX_SUBVOL | \
+ STATX_WRITE_ATOMIC)
+
+#define FUSE_UNCACHED_STATX_MASK (STATX_DIOALIGN | \
+ STATX_SUBVOL | \
+ STATX_WRITE_ATOMIC)
+
+static void kstat_from_fuse_statx(struct kstat *stat,
+ const struct fuse_statx *sx)
+{
+ stat->result_mask = sx->mask & FUSE_SUPPORTED_STATX_MASK;
+
+ stat->attributes = sx->attributes;
+ stat->attributes_mask = sx->attributes_mask;
+
+ if (sx->mask & STATX_BTIME) {
+ stat->btime.tv_sec = sx->btime.tv_sec;
+ stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
+ }
+
+ if (sx->mask & STATX_DIOALIGN) {
+ stat->dio_mem_align = sx->dio_mem_align;
+ stat->dio_offset_align = sx->dio_offset_align;
+ }
+
+ if (sx->mask & STATX_SUBVOL)
+ stat->subvol = sx->subvol;
+
+ if (sx->mask & STATX_WRITE_ATOMIC) {
+ stat->atomic_write_unit_min = sx->atomic_write_unit_min;
+ stat->atomic_write_unit_max = sx->atomic_write_unit_max;
+ stat->atomic_write_unit_max_opt = sx->atomic_write_unit_max_opt;
+ stat->atomic_write_segments_max = sx->atomic_write_segments_max;
+ }
+
+ if (sx->mask & STATX_DIO_READ_ALIGN)
+ stat->dio_read_offset_align = sx->dio_read_offset_align;
+}
+
static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
struct file *file, struct kstat *stat)
{
@@ -1299,7 +1341,7 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
}
/* For now leave sync hints as the default, request all stats. */
inarg.sx_flags = 0;
- inarg.sx_mask = STATX_BASIC_STATS | STATX_BTIME;
+ inarg.sx_mask = FUSE_SUPPORTED_STATX_MASK;
args.opcode = FUSE_STATX;
args.nodeid = get_node_id(inode);
args.in_numargs = 1;
@@ -1327,11 +1369,7 @@ static int fuse_do_statx(struct mnt_idmap *idmap, struct inode *inode,
}
if (stat) {
- stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME);
- stat->btime.tv_sec = sx->btime.tv_sec;
- stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1);
- stat->attributes = sx->attributes;
- stat->attributes_mask = sx->attributes_mask;
+ kstat_from_fuse_statx(stat, sx);
fuse_fillattr(idmap, inode, &attr, stat);
stat->result_mask |= STATX_TYPE;
}
@@ -1396,16 +1434,29 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode,
u32 inval_mask = READ_ONCE(fi->inval_mask);
u32 cache_mask = fuse_get_cache_mask(inode);
-
- /* FUSE only supports basic stats and possibly btime */
- request_mask &= STATX_BASIC_STATS | STATX_BTIME;
+ /* Only ask for supported stats */
+ request_mask &= FUSE_SUPPORTED_STATX_MASK;
retry:
if (fc->no_statx)
request_mask &= STATX_BASIC_STATS;
if (!request_mask)
sync = false;
- else if (flags & AT_STATX_FORCE_SYNC)
+ else if (request_mask & FUSE_UNCACHED_STATX_MASK) {
+ switch (flags & AT_STATX_SYNC_TYPE) {
+ case AT_STATX_DONT_SYNC:
+ request_mask &= ~FUSE_UNCACHED_STATX_MASK;
+ sync = false;
+ break;
+ case AT_STATX_FORCE_SYNC:
+ case AT_STATX_SYNC_AS_STAT:
+ sync = true;
+ break;
+ default:
+ WARN_ON(1);
+ break;
+ }
+ } else if (flags & AT_STATX_FORCE_SYNC)
sync = true;
else if (flags & AT_STATX_DONT_SYNC)
sync = false;
@@ -1416,7 +1467,7 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode,
if (sync) {
forget_all_cached_acls(inode);
- /* Try statx if BTIME is requested */
+ /* Try statx if a field not covered by regular stat is wanted */
if (!fc->no_statx && (request_mask & ~STATX_BASIC_STATS)) {
err = fuse_do_statx(idmap, inode, file, stat);
if (err == -ENOSYS) {
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 22/23] fuse: support atomic writes with iomap
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (20 preceding siblings ...)
2025-08-21 0:57 ` [PATCH 21/23] fuse: allow more statx fields Darrick J. Wong
@ 2025-08-21 0:58 ` Darrick J. Wong
2025-08-21 0:58 ` [PATCH 23/23] fuse: enable iomap Darrick J. Wong
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:58 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
One whole block!
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 9 ++++++++
fs/fuse/fuse_trace.h | 4 +++-
include/uapi/linux/fuse.h | 5 ++++
fs/fuse/file_iomap.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 67 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4ca29315b2a434..e72cc25c564048 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -243,6 +243,8 @@ enum {
FUSE_I_CACHE_IO_MODE,
/* Use iomap for this inode */
FUSE_I_IOMAP,
+ /* Enable untorn writes */
+ FUSE_I_ATOMIC,
};
struct fuse_conn;
@@ -1718,6 +1720,13 @@ static inline bool fuse_inode_has_iomap(const struct inode *inode)
return test_bit(FUSE_I_IOMAP, &fi->state);
}
+static inline bool fuse_inode_has_atomic(const struct inode *inode)
+{
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+
+ return test_bit(FUSE_I_ATOMIC, &fi->state);
+}
+
static inline bool fuse_want_iomap_directio(const struct kiocb *iocb)
{
return (iocb->ki_flags & IOCB_DIRECT) &&
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 4ebd9a9e697ce2..79de0e65608360 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -331,6 +331,7 @@ TRACE_DEFINE_ENUM(FUSE_I_BAD);
TRACE_DEFINE_ENUM(FUSE_I_BTIME);
TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE);
TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
+TRACE_DEFINE_ENUM(FUSE_I_ATOMIC);
#define FUSE_IFLAG_STRINGS \
{ 1 << FUSE_I_ADVISE_RDPLUS, "advise_rdplus" }, \
@@ -339,7 +340,8 @@ TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
{ 1 << FUSE_I_BAD, "bad" }, \
{ 1 << FUSE_I_BTIME, "btime" }, \
{ 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \
- { 1 << FUSE_I_IOMAP, "iomap" }
+ { 1 << FUSE_I_IOMAP, "iomap" }, \
+ { 1 << FUSE_I_ATOMIC, "atomic" }
#define IOMAP_IOEND_STRINGS \
{ IOMAP_IOEND_SHARED, "shared" }, \
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index cfeee8a280896a..70b5530e587d48 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -242,6 +242,7 @@
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
* - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
* - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges
+ * - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support
*/
#ifndef _LINUX_FUSE_H
@@ -597,10 +598,12 @@ struct fuse_file_lock {
* FUSE_ATTR_SUBMOUNT: Object is a submount root
* FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
* FUSE_ATTR_IOMAP: Use iomap for this inode
+ * FUSE_ATTR_ATOMIC: Enable untorn writes
*/
#define FUSE_ATTR_SUBMOUNT (1 << 0)
#define FUSE_ATTR_DAX (1 << 1)
#define FUSE_ATTR_IOMAP (1 << 2)
+#define FUSE_ATTR_ATOMIC (1 << 3)
/**
* Open flags
@@ -1153,6 +1156,8 @@ struct fuse_backing_map {
/* basic file I/O functionality through iomap */
#define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0)
+/* untorn writes through iomap */
+#define FUSE_IOMAP_SUPPORT_ATOMIC (1ULL << 1)
struct fuse_iomap_support {
uint64_t flags;
uint64_t padding;
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 4c8fef25b0749b..ee199c1fd27b1f 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1122,6 +1122,8 @@ void fuse_iomap_open(struct inode *inode, struct file *file)
{
if (fuse_inode_has_iomap(inode))
file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
+ if (fuse_inode_has_atomic(inode))
+ file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
}
enum fuse_ilock_type {
@@ -1173,12 +1175,33 @@ static inline void fuse_inode_clear_iomap(struct inode *inode)
clear_bit(FUSE_I_IOMAP, &fi->state);
}
+static inline void fuse_inode_set_atomic(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ set_bit(FUSE_I_ATOMIC, &fi->state);
+}
+
+static inline void fuse_inode_clear_atomic(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ clear_bit(FUSE_I_ATOMIC, &fi->state);
+}
+
void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
{
struct fuse_conn *conn = get_fuse_conn(inode);
if (conn->iomap && (attr_flags & FUSE_ATTR_IOMAP))
fuse_inode_set_iomap(inode);
+ if (fuse_inode_has_iomap(inode) &&
+ (attr_flags & FUSE_ATTR_ATOMIC))
+ fuse_inode_set_atomic(inode);
trace_fuse_iomap_init_inode(inode);
}
@@ -1189,6 +1212,8 @@ void fuse_iomap_evict_inode(struct inode *inode)
if (fuse_inode_has_iomap(inode))
fuse_inode_clear_iomap(inode);
+ if (fuse_inode_has_atomic(inode))
+ fuse_inode_clear_atomic(inode);
}
ssize_t fuse_iomap_direct_read(struct kiocb *iocb, struct iov_iter *to)
@@ -1383,6 +1408,17 @@ fuse_iomap_write_checks(
return kiocb_modified(iocb);
}
+static inline ssize_t fuse_iomap_atomic_write_valid(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iov_iter_count(from) != i_blocksize(inode))
+ return -EINVAL;
+
+ return generic_atomic_write_valid(iocb, from);
+}
+
ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
@@ -1399,6 +1435,12 @@ ssize_t fuse_iomap_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (!count)
return 0;
+ if (iocb->ki_flags & IOCB_ATOMIC) {
+ ret = fuse_iomap_atomic_write_valid(iocb, from);
+ if (ret)
+ return ret;
+ }
+
/*
* direct I/O must be aligned to the fsblock size or we fall back to
* the old paths
@@ -1814,6 +1856,12 @@ ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from)
if (!iov_iter_count(from))
return 0;
+ if (iocb->ki_flags & IOCB_ATOMIC) {
+ ret = fuse_iomap_atomic_write_valid(iocb, from);
+ if (ret)
+ return ret;
+ }
+
ret = fuse_iomap_ilock_iocb(iocb, EXCL);
if (ret)
return ret;
@@ -2063,7 +2111,8 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support ios = { };
if (fuse_iomap_enabled())
- ios.flags = FUSE_IOMAP_SUPPORT_FILEIO;
+ ios.flags = FUSE_IOMAP_SUPPORT_FILEIO |
+ FUSE_IOMAP_SUPPORT_ATOMIC;
if (copy_to_user(argp, &ios, sizeof(ios)))
return -EFAULT;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 23/23] fuse: enable iomap
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (21 preceding siblings ...)
2025-08-21 0:58 ` [PATCH 22/23] fuse: support atomic writes with iomap Darrick J. Wong
@ 2025-08-21 0:58 ` Darrick J. Wong
22 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:58 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Remove the guard that we used to avoid bisection problems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/file_iomap.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index ee199c1fd27b1f..3141518cc6e67d 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -104,9 +104,6 @@ void fuse_iomap_sysfs_cleanup(struct kobject *fuse_kobj)
bool fuse_iomap_enabled(void)
{
- /* Don't let anyone touch iomap until the end of the patchset. */
- return false;
-
/*
* There are fears that a fuse+iomap server could somehow DoS the
* system by doing things like going out to lunch during a writeback
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 1/4] fuse: cache iomaps
2025-08-21 0:47 ` [PATCHSET RFC v4 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-08-21 0:58 ` Darrick J. Wong
2025-08-21 0:59 ` [PATCH 2/4] fuse: use the iomap cache for iomap_begin Darrick J. Wong
` (2 subsequent siblings)
3 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:58 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Cache iomaps to a file so that we don't have to upcall the server.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 37 +
fs/fuse/fuse_trace.h | 295 ++++++++
fs/fuse/iomap_priv.h | 135 ++++
include/uapi/linux/fuse.h | 5
fs/fuse/Makefile | 2
fs/fuse/file_iomap.c | 23 +
fs/fuse/iomap_cache.c | 1660 +++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 2150 insertions(+), 7 deletions(-)
create mode 100644 fs/fuse/iomap_cache.c
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e72cc25c564048..54b8aab94a9cd5 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -110,6 +110,24 @@ struct fuse_backing {
struct rcu_head rcu;
};
+#if IS_ENABLED(CONFIG_FUSE_IOMAP)
+/*
+ * File incore extent information, present for each of data & attr forks.
+ */
+struct fuse_ifork {
+ int64_t if_bytes; /* bytes in if_data */
+ void *if_data; /* extent tree root */
+ int if_height; /* height of the extent tree */
+};
+
+struct fuse_iomap_cache {
+ struct fuse_ifork im_read;
+ struct fuse_ifork *im_write;
+ uint64_t im_seq; /* validity counter */
+ struct rw_semaphore im_lock; /* mapping lock */
+};
+#endif
+
/** FUSE inode */
struct fuse_inode {
/** Inode data */
@@ -175,6 +193,7 @@ struct fuse_inode {
spinlock_t ioend_lock;
struct work_struct ioend_work;
struct list_head ioend_list;
+ struct fuse_iomap_cache cache;
#endif
};
@@ -245,6 +264,11 @@ enum {
FUSE_I_IOMAP,
/* Enable untorn writes */
FUSE_I_ATOMIC,
+ /*
+ * Cache iomaps in the kernel. This is required for any filesystem
+ * that needs to synchronize pagecache write and writeback.
+ */
+ FUSE_I_IOMAP_CACHE,
};
struct fuse_conn;
@@ -1755,6 +1779,18 @@ int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support __user *argp);
int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
+
+static inline bool fuse_inode_caches_iomaps(const struct inode *inode)
+{
+ const struct fuse_inode *fi = get_fuse_inode_c(inode);
+
+ return test_bit(FUSE_I_IOMAP_CACHE, &fi->state);
+}
+
+enum fuse_iomap_iodir {
+ READ_MAPPING,
+ WRITE_MAPPING,
+};
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1783,6 +1819,7 @@ int fuse_iomap_fadvise(struct file *file, loff_t start, loff_t end, int advice);
# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
# define fuse_iomap_fadvise NULL
+# define fuse_inode_caches_iomaps(...) (false)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 79de0e65608360..eb604eaf3bafad 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -235,6 +235,8 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
struct iomap_writepage_ctx;
struct iomap_ioend;
struct iomap;
+struct fuse_iext_cursor;
+struct fuse_iomap_lookup;
/* tracepoint boilerplate so we don't have to keep doing this */
#define FUSE_IOMAP_OPFLAGS_FIELD \
@@ -265,6 +267,16 @@ struct iomap;
__entry->prefix##addr, \
__print_flags(__entry->prefix##flags, "|", FUSE_IOMAP_F_STRINGS)
+#define FUSE_IOMAP_IODIR_FIELD \
+ __field(enum fuse_iomap_iodir, iodir)
+
+#define FUSE_IOMAP_IODIR_FMT \
+ " iodir %s"
+
+#define FUSE_IOMAP_IODIR_PRINTK_ARGS \
+ __print_symbolic(__entry->iodir, FUSE_IOMAP_FORK_STRINGS)
+
+
/* combinations of boilerplate to reduce typing further */
#define FUSE_IOMAP_OP_FIELDS(prefix) \
FUSE_INODE_FIELDS \
@@ -332,6 +344,7 @@ TRACE_DEFINE_ENUM(FUSE_I_BTIME);
TRACE_DEFINE_ENUM(FUSE_I_CACHE_IO_MODE);
TRACE_DEFINE_ENUM(FUSE_I_IOMAP);
TRACE_DEFINE_ENUM(FUSE_I_ATOMIC);
+TRACE_DEFINE_ENUM(FUSE_I_IOMAP_CACHE);
#define FUSE_IFLAG_STRINGS \
{ 1 << FUSE_I_ADVISE_RDPLUS, "advise_rdplus" }, \
@@ -341,7 +354,8 @@ TRACE_DEFINE_ENUM(FUSE_I_ATOMIC);
{ 1 << FUSE_I_BTIME, "btime" }, \
{ 1 << FUSE_I_CACHE_IO_MODE, "cacheio" }, \
{ 1 << FUSE_I_IOMAP, "iomap" }, \
- { 1 << FUSE_I_ATOMIC, "atomic" }
+ { 1 << FUSE_I_ATOMIC, "atomic" }, \
+ { 1 << FUSE_I_IOMAP_CACHE, "iomap_cache" }
#define IOMAP_IOEND_STRINGS \
{ IOMAP_IOEND_SHARED, "shared" }, \
@@ -357,6 +371,22 @@ TRACE_DEFINE_ENUM(FUSE_I_ATOMIC);
{ FUSE_IOMAP_CONFIG_TIME, "time" }, \
{ FUSE_IOMAP_CONFIG_MAXBYTES, "maxbytes" }
+TRACE_DEFINE_ENUM(READ_MAPPING);
+TRACE_DEFINE_ENUM(WRITE_MAPPING);
+
+#define FUSE_IOMAP_FORK_STRINGS \
+ { READ_MAPPING, "read" }, \
+ { WRITE_MAPPING, "write" }
+
+#define FUSE_IEXT_STATE_STRINGS \
+ { FUSE_IEXT_LEFT_CONTIG, "l_cont" }, \
+ { FUSE_IEXT_RIGHT_CONTIG, "r_cont" }, \
+ { FUSE_IEXT_LEFT_FILLING, "l_fill" }, \
+ { FUSE_IEXT_RIGHT_FILLING, "r_fill" }, \
+ { FUSE_IEXT_LEFT_VALID, "l_valid" }, \
+ { FUSE_IEXT_RIGHT_VALID, "r_valid" }, \
+ { FUSE_IEXT_WRITE_MAPPING, "write" }
+
DECLARE_EVENT_CLASS(fuse_iomap_check_class,
TP_PROTO(const char *func, int line, const char *condition),
@@ -1118,6 +1148,269 @@ DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_read);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap);
+
+DECLARE_EVENT_CLASS(fuse_iext_class,
+ TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur,
+ int state, unsigned long caller_ip),
+
+ TP_ARGS(inode, cur, state, caller_ip),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(void *, leaf)
+ __field(int, pos)
+ __field(int, iext_state)
+ __field(unsigned long, caller_ip)
+ ),
+ TP_fast_assign(
+ const struct fuse_ifork *ifp;
+ struct fuse_iomap_io r = { };
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+
+ if (state & FUSE_IEXT_WRITE_MAPPING)
+ ifp = fi->cache.im_write;
+ else
+ ifp = &fi->cache.im_read;
+ if (ifp)
+ fuse_iext_get_extent(ifp, cur, &r);
+
+ __entry->mapoffset = r.offset;
+ __entry->mapaddr = r.addr;
+ __entry->maplength = r.length;
+ __entry->mapdev = r.dev;
+ __entry->maptype = r.type;
+ __entry->mapflags = r.flags;
+
+ __entry->leaf = cur->leaf;
+ __entry->pos = cur->pos;
+
+ __entry->iext_state = state;
+ __entry->caller_ip = caller_ip;
+ ),
+ TP_printk(FUSE_INODE_FMT " state (%s) cur %p/%d " FUSE_IOMAP_MAP_FMT() " caller %pS",
+ FUSE_INODE_PRINTK_ARGS,
+ __print_flags(__entry->iext_state, "|", FUSE_IEXT_STATE_STRINGS),
+ __entry->leaf,
+ __entry->pos,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ (void *)__entry->caller_ip)
+)
+
+#define DEFINE_IEXT_EVENT(name) \
+DEFINE_EVENT(fuse_iext_class, name, \
+ TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur, \
+ int state, unsigned long caller_ip), \
+ TP_ARGS(inode, cur, state, caller_ip))
+DEFINE_IEXT_EVENT(fuse_iext_insert);
+DEFINE_IEXT_EVENT(fuse_iext_remove);
+DEFINE_IEXT_EVENT(fuse_iext_pre_update);
+DEFINE_IEXT_EVENT(fuse_iext_post_update);
+
+TRACE_EVENT(fuse_iext_update_class,
+ TP_PROTO(const struct inode *inode, uint32_t iext_state,
+ const struct fuse_iomap_io *map),
+ TP_ARGS(inode, iext_state, map),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(uint32_t, iext_state)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ __entry->mapdev = map->dev;
+ __entry->mapaddr = map->addr;
+
+ __entry->iext_state = iext_state;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " state (%s)" FUSE_IOMAP_MAP_FMT(),
+ FUSE_INODE_PRINTK_ARGS,
+ __print_flags(__entry->iext_state, "|", FUSE_IEXT_STATE_STRINGS),
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map))
+);
+#define DEFINE_IEXT_UPDATE_EVENT(name) \
+DEFINE_EVENT(fuse_iext_update_class, name, \
+ TP_PROTO(const struct inode *inode, uint32_t iext_state, \
+ const struct fuse_iomap_io *map), \
+ TP_ARGS(inode, iext_state, map))
+DEFINE_IEXT_UPDATE_EVENT(fuse_iext_del_mapping);
+DEFINE_IEXT_UPDATE_EVENT(fuse_iext_add_mapping);
+
+TRACE_EVENT(fuse_iext_alt_update_class,
+ TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map),
+ TP_ARGS(inode, map),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ __entry->mapdev = map->dev;
+ __entry->mapaddr = map->addr;
+ ),
+
+ TP_printk(FUSE_INODE_FMT FUSE_IOMAP_MAP_FMT(),
+ FUSE_INODE_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map))
+);
+#define DEFINE_IEXT_ALT_UPDATE_EVENT(name) \
+DEFINE_EVENT(fuse_iext_alt_update_class, name, \
+ TP_PROTO(const struct inode *inode, const struct fuse_iomap_io *map), \
+ TP_ARGS(inode, map))
+DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_del_mapping_got);
+DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_left);
+DEFINE_IEXT_ALT_UPDATE_EVENT(fuse_iext_add_mapping_right);
+
+TRACE_EVENT(fuse_iomap_cache_remove,
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t offset, uint64_t length, unsigned long caller_ip),
+ TP_ARGS(inode, iodir, offset, length, caller_ip),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ FUSE_IOMAP_IODIR_FIELD
+ __field(unsigned long, caller_ip)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->iodir = iodir;
+ __entry->offset = offset;
+ __entry->length = length;
+ __entry->caller_ip = caller_ip;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT " caller %pS",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ FUSE_IOMAP_IODIR_PRINTK_ARGS,
+ (void *)__entry->caller_ip)
+);
+
+TRACE_EVENT(fuse_iomap_cached_mapping_class,
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *map, unsigned long caller_ip),
+ TP_ARGS(inode, iodir, map, caller_ip),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_IODIR_FIELD
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(unsigned long, caller_ip)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->iodir = iodir;
+
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ __entry->mapdev = map->dev;
+ __entry->mapaddr = map->addr;
+
+ __entry->caller_ip = caller_ip;
+ ),
+
+ TP_printk(FUSE_INODE_FMT FUSE_IOMAP_IODIR_FMT FUSE_IOMAP_MAP_FMT() " caller %pS",
+ FUSE_INODE_PRINTK_ARGS,
+ FUSE_IOMAP_IODIR_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ (void *)__entry->caller_ip)
+);
+#define DEFINE_FUSE_IOMAP_CACHED_MAPPING_EVENT(name) \
+DEFINE_EVENT(fuse_iomap_cached_mapping_class, name, \
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir, \
+ const struct fuse_iomap_io *map, unsigned long caller_ip), \
+ TP_ARGS(inode, iodir, map, caller_ip))
+DEFINE_FUSE_IOMAP_CACHED_MAPPING_EVENT(fuse_iomap_cache_add);
+DEFINE_FUSE_IOMAP_CACHED_MAPPING_EVENT(fuse_iext_check_mapping);
+
+TRACE_EVENT(fuse_iomap_cache_lookup,
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t pos, uint64_t count, unsigned long caller_ip),
+ TP_ARGS(inode, iodir, pos, count, caller_ip),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ FUSE_IOMAP_IODIR_FIELD
+ __field(unsigned long, caller_ip)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->iodir = iodir;
+ __entry->offset = pos;
+ __entry->length = count;
+ __entry->caller_ip = caller_ip;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT " caller %pS",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ FUSE_IOMAP_IODIR_PRINTK_ARGS,
+ (void *)__entry->caller_ip)
+);
+
+TRACE_EVENT(fuse_iomap_cache_lookup_result,
+ TP_PROTO(const struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t pos, uint64_t count, const struct fuse_iomap_io *got,
+ const struct fuse_iomap_lookup *map),
+ TP_ARGS(inode, iodir, pos, count, got, map),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+
+ FUSE_IOMAP_MAP_FIELDS(got)
+ FUSE_IOMAP_MAP_FIELDS(map)
+
+ FUSE_IOMAP_IODIR_FIELD
+ __field(uint64_t, validity_cookie)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->iodir = iodir;
+ __entry->offset = pos;
+ __entry->length = count;
+
+ __entry->gotoffset = got->offset;
+ __entry->gotlength = got->length;
+ __entry->gottype = got->type;
+ __entry->gotflags = got->flags;
+ __entry->gotdev = got->dev;
+ __entry->gotaddr = got->addr;
+
+ __entry->mapoffset = map->map.offset;
+ __entry->maplength = map->map.length;
+ __entry->maptype = map->map.type;
+ __entry->mapflags = map->map.flags;
+ __entry->mapdev = map->map.dev;
+ __entry->mapaddr = map->map.addr;
+
+ __entry->validity_cookie= map->validity_cookie;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT() FUSE_IOMAP_IODIR_FMT FUSE_IOMAP_MAP_FMT("map") FUSE_IOMAP_MAP_FMT("got") " cookie 0x%llx",
+ FUSE_IO_RANGE_PRINTK_ARGS(),
+ FUSE_IOMAP_IODIR_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ FUSE_IOMAP_MAP_PRINTK_ARGS(got),
+ __entry->validity_cookie)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
index 7002eb38f87fe1..8e4a32879025a4 100644
--- a/fs/fuse/iomap_priv.h
+++ b/fs/fuse/iomap_priv.h
@@ -1,5 +1,9 @@
// SPDX-License-Identifier: GPL-2.0
/*
+ * The fuse_iext code comes from xfs_iext_tree.[ch] and is:
+ * Copyright (c) 2017 Christoph Hellwig.
+ *
+ * Everything else is:
* Copyright (C) 2025 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
@@ -40,13 +44,134 @@ while (static_branch_unlikely(&fuse_iomap_debug)) { \
})
#endif /* CONFIG_FUSE_IOMAP_DEBUG */
-enum fuse_iomap_iodir {
- READ_MAPPING,
- WRITE_MAPPING,
-};
-
#define EFSCORRUPTED EUCLEAN
+void fuse_iomap_cache_lock(struct inode *inode);
+void fuse_iomap_cache_unlock(struct inode *inode);
+void fuse_iomap_cache_lock_shared(struct inode *inode);
+void fuse_iomap_cache_unlock_shared(struct inode *inode);
+
+struct fuse_iext_leaf;
+
+struct fuse_iext_cursor {
+ struct fuse_iext_leaf *leaf;
+ int pos;
+};
+
+#define FUSE_IEXT_LEFT_CONTIG (1u << 0)
+#define FUSE_IEXT_RIGHT_CONTIG (1u << 1)
+#define FUSE_IEXT_LEFT_FILLING (1u << 2)
+#define FUSE_IEXT_RIGHT_FILLING (1u << 3)
+#define FUSE_IEXT_LEFT_VALID (1u << 4)
+#define FUSE_IEXT_RIGHT_VALID (1u << 5)
+#define FUSE_IEXT_WRITE_MAPPING (1u << 6)
+
+struct fuse_ifork *fuse_iext_state_to_fork(struct fuse_iomap_cache *ip,
+ unsigned int state);
+
+uint64_t fuse_iext_count(const struct fuse_ifork *ifp);
+void fuse_iext_insert_raw(struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur,
+ const struct fuse_iomap_io *irec);
+void fuse_iext_insert(struct fuse_iomap_cache *,
+ struct fuse_iext_cursor *cur,
+ const struct fuse_iomap_io *, int);
+void fuse_iext_remove(struct fuse_iomap_cache *,
+ struct fuse_iext_cursor *,
+ int);
+void fuse_iext_destroy(struct fuse_ifork *);
+
+bool fuse_iext_lookup_extent(struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp, loff_t bno,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp);
+bool fuse_iext_lookup_extent_before(struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp, loff_t *end,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp);
+bool fuse_iext_get_extent(const struct fuse_ifork *ifp,
+ const struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp);
+void fuse_iext_update_extent(struct fuse_iomap_cache *ip, int state,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp);
+
+void fuse_iext_first(struct fuse_ifork *, struct fuse_iext_cursor *);
+void fuse_iext_last(struct fuse_ifork *, struct fuse_iext_cursor *);
+void fuse_iext_next(struct fuse_ifork *, struct fuse_iext_cursor *);
+void fuse_iext_prev(struct fuse_ifork *, struct fuse_iext_cursor *);
+
+static inline bool fuse_iext_next_extent(struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp)
+{
+ fuse_iext_next(ifp, cur);
+ return fuse_iext_get_extent(ifp, cur, gotp);
+}
+
+static inline bool fuse_iext_prev_extent(struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp)
+{
+ fuse_iext_prev(ifp, cur);
+ return fuse_iext_get_extent(ifp, cur, gotp);
+}
+
+/*
+ * Return the extent after cur in gotp without updating the cursor.
+ */
+static inline bool fuse_iext_peek_next_extent(struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp)
+{
+ struct fuse_iext_cursor ncur = *cur;
+
+ fuse_iext_next(ifp, &ncur);
+ return fuse_iext_get_extent(ifp, &ncur, gotp);
+}
+
+/*
+ * Return the extent before cur in gotp without updating the cursor.
+ */
+static inline bool fuse_iext_peek_prev_extent(struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur, struct fuse_iomap_io *gotp)
+{
+ struct fuse_iext_cursor ncur = *cur;
+
+ fuse_iext_prev(ifp, &ncur);
+ return fuse_iext_get_extent(ifp, &ncur, gotp);
+}
+
+#define for_each_fuse_iext(ifp, ext, got) \
+ for (fuse_iext_first((ifp), (ext)); \
+ fuse_iext_get_extent((ifp), (ext), (got)); \
+ fuse_iext_next((ifp), (ext)))
+
+static inline uint64_t fuse_iext_read_seq(struct fuse_iomap_cache *ip)
+{
+ return (uint64_t)READ_ONCE(ip->im_seq);
+}
+
+int fuse_iomap_cache_remove(struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t off, uint64_t len);
+
+int fuse_iomap_cache_upsert(struct inode *inode, enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *map);
+
+enum fuse_iomap_lookup_result {
+ LOOKUP_HIT,
+ LOOKUP_MISS,
+ LOOKUP_NOFORK,
+};
+
+struct fuse_iomap_lookup {
+ struct fuse_iomap_io map; /* cached mapping */
+ uint64_t validity_cookie; /* used with .iomap_valid() */
+};
+
+enum fuse_iomap_lookup_result
+fuse_iomap_cache_lookup(struct inode *inode, enum fuse_iomap_iodir iodir,
+ loff_t off, uint64_t len,
+ struct fuse_iomap_lookup *mval);
+
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _FS_FUSE_IOMAP_PRIV_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 70b5530e587d48..df23eb65f0b497 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1344,6 +1344,8 @@ struct fuse_uring_cmd_req {
/* fuse-specific mapping type indicating that writes use the read mapping */
#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255)
+/* fuse-specific mapping type saying the server has populated the cache */
+#define FUSE_IOMAP_TYPE_RETRY_CACHE (254)
#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */
@@ -1481,4 +1483,7 @@ struct fuse_iomap_dev_inval_out {
uint64_t length;
};
+/* invalidate all cached iomap mappings up to EOF */
+#define FUSE_IOMAP_INVAL_TO_EOF (~0ULL)
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 27be39317701d6..e3ed1da6cfb6e7 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -18,6 +18,6 @@ fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
fuse-$(CONFIG_FUSE_BACKING) += backing.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
-fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
+fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o iomap_cache.o
virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 3141518cc6e67d..545798f0d915a1 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -1190,6 +1190,21 @@ static inline void fuse_inode_clear_atomic(struct inode *inode)
clear_bit(FUSE_I_ATOMIC, &fi->state);
}
+static inline void fuse_iomap_clear_cache(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ ASSERT(fuse_has_iomap(inode));
+
+ clear_bit(FUSE_I_IOMAP_CACHE, &fi->state);
+
+ fuse_iext_destroy(&fi->cache.im_read);
+ if (fi->cache.im_write) {
+ fuse_iext_destroy(fi->cache.im_write);
+ kfree(fi->cache.im_write);
+ }
+}
+
void fuse_iomap_init_inode(struct inode *inode, unsigned attr_flags)
{
struct fuse_conn *conn = get_fuse_conn(inode);
@@ -1207,6 +1222,8 @@ void fuse_iomap_evict_inode(struct inode *inode)
{
trace_fuse_iomap_evict_inode(inode);
+ if (fuse_inode_caches_iomaps(inode))
+ fuse_iomap_clear_cache(inode);
if (fuse_inode_has_iomap(inode))
fuse_inode_clear_iomap(inode);
if (fuse_inode_has_atomic(inode))
@@ -1766,6 +1783,12 @@ static inline void fuse_inode_set_iomap(struct inode *inode)
min_order = inode->i_blkbits - PAGE_SHIFT;
mapping_set_folio_min_order(inode->i_mapping, min_order);
+
+ memset(&fi->cache.im_read, 0, sizeof(fi->cache.im_read));
+ fi->cache.im_seq = 0;
+ fi->cache.im_write = NULL;
+
+ init_rwsem(&fi->cache.im_lock);
set_bit(FUSE_I_IOMAP, &fi->state);
}
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
new file mode 100644
index 00000000000000..5bfa0e26346d1f
--- /dev/null
+++ b/fs/fuse/iomap_cache.c
@@ -0,0 +1,1660 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fuse_iext* code adapted from xfs_iext_tree.c:
+ * Copyright (c) 2017 Christoph Hellwig.
+ *
+ * fuse_iomap_cache*lock* code adapted from xfs_inode.c:
+ * Copyright (c) 2000-2006 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ *
+ * Copyright (C) 2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "fuse_i.h"
+#include "iomap_priv.h"
+#include "fuse_trace.h"
+#include <linux/iomap.h>
+
+/* maximum length of a mapping that we're willing to cache */
+#define FUSE_IOMAP_MAX_LEN ((loff_t)(1ULL << 63))
+
+void fuse_iomap_cache_lock_shared(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+
+ down_read(&ip->im_lock);
+}
+
+void fuse_iomap_cache_unlock_shared(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+
+ up_read(&ip->im_lock);
+}
+
+void fuse_iomap_cache_lock(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+
+ down_write(&ip->im_lock);
+}
+
+void fuse_iomap_cache_unlock(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+
+ up_write(&ip->im_lock);
+}
+
+static inline void assert_cache_locked_shared(struct fuse_iomap_cache *ip)
+{
+ rwsem_assert_held(&ip->im_lock);
+}
+
+static inline void assert_cache_locked(struct fuse_iomap_cache *ip)
+{
+ rwsem_assert_held_write_nolockdep(&ip->im_lock);
+}
+
+static inline struct fuse_inode *FUSE_I(struct fuse_iomap_cache *ip)
+{
+ return container_of(ip, struct fuse_inode, cache);
+}
+
+static inline struct inode *VFS_I(struct fuse_iomap_cache *ip)
+{
+ struct fuse_inode *fi = FUSE_I(ip);
+
+ return &fi->inode;
+}
+
+static inline uint32_t
+fuse_iomap_fork_to_state(const struct fuse_iomap_cache *ip,
+ const struct fuse_ifork *ifp)
+{
+ ASSERT(ifp == ip->im_write || ifp == &ip->im_read);
+
+ if (ifp == ip->im_write)
+ return FUSE_IEXT_WRITE_MAPPING;
+ return 0;
+}
+
+/* Convert bmap state flags to an inode fork. */
+struct fuse_ifork *
+fuse_iext_state_to_fork(
+ struct fuse_iomap_cache *ip,
+ unsigned int state)
+{
+ if (state & FUSE_IEXT_WRITE_MAPPING)
+ return ip->im_write;
+ return &ip->im_read;
+}
+
+/* The internal iext tree record is a struct fuse_iomap_io */
+
+static bool fuse_iext_rec_is_empty(const struct fuse_iomap_io *rec)
+{
+ return rec->length == 0;
+}
+
+static inline void fuse_iext_rec_clear(struct fuse_iomap_io *rec)
+{
+ memset(rec, 0, sizeof(*rec));
+}
+
+static void
+fuse_iext_set(
+ struct fuse_iomap_io *rec,
+ const struct fuse_iomap_io *irec)
+{
+ ASSERT(irec->length > 0);
+
+ *rec = *irec;
+}
+
+static void
+fuse_iext_get(
+ struct fuse_iomap_io *irec,
+ const struct fuse_iomap_io *rec)
+{
+ *irec = *rec;
+}
+
+enum {
+ NODE_SIZE = 256,
+ KEYS_PER_NODE = NODE_SIZE / (sizeof(uint64_t) + sizeof(void *)),
+ RECS_PER_LEAF = (NODE_SIZE - (2 * sizeof(struct fuse_iext_leaf *))) /
+ sizeof(struct fuse_iomap_io),
+};
+
+/*
+ * In-core extent btree block layout:
+ *
+ * There are two types of blocks in the btree: leaf and inner (non-leaf) blocks.
+ *
+ * The leaf blocks are made up by %KEYS_PER_NODE extent records, which each
+ * contain the startoffset, blockcount, startblock and unwritten extent flag.
+ * See above for the exact format, followed by pointers to the previous and next
+ * leaf blocks (if there are any).
+ *
+ * The inner (non-leaf) blocks first contain KEYS_PER_NODE lookup keys, followed
+ * by an equal number of pointers to the btree blocks at the next lower level.
+ *
+ * +-------+-------+-------+-------+-------+----------+----------+
+ * Leaf: | rec 1 | rec 2 | rec 3 | rec 4 | rec N | prev-ptr | next-ptr |
+ * +-------+-------+-------+-------+-------+----------+----------+
+ *
+ * +-------+-------+-------+-------+-------+-------+------+-------+
+ * Inner: | key 1 | key 2 | key 3 | key N | ptr 1 | ptr 2 | ptr3 | ptr N |
+ * +-------+-------+-------+-------+-------+-------+------+-------+
+ */
+struct fuse_iext_node {
+ uint64_t keys[KEYS_PER_NODE];
+#define FUSE_IEXT_KEY_INVALID (1ULL << 63)
+ void *ptrs[KEYS_PER_NODE];
+};
+
+struct fuse_iext_leaf {
+ struct fuse_iomap_io recs[RECS_PER_LEAF];
+ struct fuse_iext_leaf *prev;
+ struct fuse_iext_leaf *next;
+};
+
+inline uint64_t fuse_iext_count(const struct fuse_ifork *ifp)
+{
+ return ifp->if_bytes / sizeof(struct fuse_iomap_io);
+}
+
+static inline int fuse_iext_max_recs(const struct fuse_ifork *ifp)
+{
+ if (ifp->if_height == 1)
+ return fuse_iext_count(ifp);
+ return RECS_PER_LEAF;
+}
+
+static inline struct fuse_iomap_io *cur_rec(const struct fuse_iext_cursor *cur)
+{
+ return &cur->leaf->recs[cur->pos];
+}
+
+static inline bool fuse_iext_valid(const struct fuse_ifork *ifp,
+ const struct fuse_iext_cursor *cur)
+{
+ if (!cur->leaf)
+ return false;
+ if (cur->pos < 0 || cur->pos >= fuse_iext_max_recs(ifp))
+ return false;
+ if (fuse_iext_rec_is_empty(cur_rec(cur)))
+ return false;
+ return true;
+}
+
+static void *
+fuse_iext_find_first_leaf(
+ struct fuse_ifork *ifp)
+{
+ struct fuse_iext_node *node = ifp->if_data;
+ int height;
+
+ if (!ifp->if_height)
+ return NULL;
+
+ for (height = ifp->if_height; height > 1; height--) {
+ node = node->ptrs[0];
+ ASSERT(node);
+ }
+
+ return node;
+}
+
+static void *
+fuse_iext_find_last_leaf(
+ struct fuse_ifork *ifp)
+{
+ struct fuse_iext_node *node = ifp->if_data;
+ int height, i;
+
+ if (!ifp->if_height)
+ return NULL;
+
+ for (height = ifp->if_height; height > 1; height--) {
+ for (i = 1; i < KEYS_PER_NODE; i++)
+ if (!node->ptrs[i])
+ break;
+ node = node->ptrs[i - 1];
+ ASSERT(node);
+ }
+
+ return node;
+}
+
+void
+fuse_iext_first(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ cur->pos = 0;
+ cur->leaf = fuse_iext_find_first_leaf(ifp);
+}
+
+void
+fuse_iext_last(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ int i;
+
+ cur->leaf = fuse_iext_find_last_leaf(ifp);
+ if (!cur->leaf) {
+ cur->pos = 0;
+ return;
+ }
+
+ for (i = 1; i < fuse_iext_max_recs(ifp); i++) {
+ if (fuse_iext_rec_is_empty(&cur->leaf->recs[i]))
+ break;
+ }
+ cur->pos = i - 1;
+}
+
+void
+fuse_iext_next(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ if (!cur->leaf) {
+ ASSERT(cur->pos <= 0 || cur->pos >= RECS_PER_LEAF);
+ fuse_iext_first(ifp, cur);
+ return;
+ }
+
+ ASSERT(cur->pos >= 0);
+ ASSERT(cur->pos < fuse_iext_max_recs(ifp));
+
+ cur->pos++;
+ if (ifp->if_height > 1 && !fuse_iext_valid(ifp, cur) &&
+ cur->leaf->next) {
+ cur->leaf = cur->leaf->next;
+ cur->pos = 0;
+ }
+}
+
+void
+fuse_iext_prev(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ if (!cur->leaf) {
+ ASSERT(cur->pos <= 0 || cur->pos >= RECS_PER_LEAF);
+ fuse_iext_last(ifp, cur);
+ return;
+ }
+
+ ASSERT(cur->pos >= 0);
+ ASSERT(cur->pos <= RECS_PER_LEAF);
+
+recurse:
+ do {
+ cur->pos--;
+ if (fuse_iext_valid(ifp, cur))
+ return;
+ } while (cur->pos > 0);
+
+ if (ifp->if_height > 1 && cur->leaf->prev) {
+ cur->leaf = cur->leaf->prev;
+ cur->pos = RECS_PER_LEAF;
+ goto recurse;
+ }
+}
+
+static inline int
+fuse_iext_key_cmp(
+ struct fuse_iext_node *node,
+ int n,
+ loff_t offset)
+{
+ if (node->keys[n] > offset)
+ return 1;
+ if (node->keys[n] < offset)
+ return -1;
+ return 0;
+}
+
+static inline int
+fuse_iext_rec_cmp(
+ struct fuse_iomap_io *rec,
+ loff_t offset)
+{
+ if (rec->offset > offset)
+ return 1;
+ if (rec->offset + rec->length <= offset)
+ return -1;
+ return 0;
+}
+
+static void *
+fuse_iext_find_level(
+ struct fuse_ifork *ifp,
+ loff_t offset,
+ int level)
+{
+ struct fuse_iext_node *node = ifp->if_data;
+ int height, i;
+
+ if (!ifp->if_height)
+ return NULL;
+
+ for (height = ifp->if_height; height > level; height--) {
+ for (i = 1; i < KEYS_PER_NODE; i++)
+ if (fuse_iext_key_cmp(node, i, offset) > 0)
+ break;
+
+ node = node->ptrs[i - 1];
+ if (!node)
+ break;
+ }
+
+ return node;
+}
+
+static int
+fuse_iext_node_pos(
+ struct fuse_iext_node *node,
+ loff_t offset)
+{
+ int i;
+
+ for (i = 1; i < KEYS_PER_NODE; i++) {
+ if (fuse_iext_key_cmp(node, i, offset) > 0)
+ break;
+ }
+
+ return i - 1;
+}
+
+static int
+fuse_iext_node_insert_pos(
+ struct fuse_iext_node *node,
+ loff_t offset)
+{
+ int i;
+
+ for (i = 0; i < KEYS_PER_NODE; i++) {
+ if (fuse_iext_key_cmp(node, i, offset) > 0)
+ return i;
+ }
+
+ return KEYS_PER_NODE;
+}
+
+static int
+fuse_iext_node_nr_entries(
+ struct fuse_iext_node *node,
+ int start)
+{
+ int i;
+
+ for (i = start; i < KEYS_PER_NODE; i++) {
+ if (node->keys[i] == FUSE_IEXT_KEY_INVALID)
+ break;
+ }
+
+ return i;
+}
+
+static int
+fuse_iext_leaf_nr_entries(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_leaf *leaf,
+ int start)
+{
+ int i;
+
+ for (i = start; i < fuse_iext_max_recs(ifp); i++) {
+ if (fuse_iext_rec_is_empty(&leaf->recs[i]))
+ break;
+ }
+
+ return i;
+}
+
+static inline uint64_t
+fuse_iext_leaf_key(
+ struct fuse_iext_leaf *leaf,
+ int n)
+{
+ return leaf->recs[n].offset;
+}
+
+static inline void *
+fuse_iext_alloc_node(
+ int size)
+{
+ return kzalloc(size, GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL);
+}
+
+static void
+fuse_iext_grow(
+ struct fuse_ifork *ifp)
+{
+ struct fuse_iext_node *node = fuse_iext_alloc_node(NODE_SIZE);
+ int i;
+
+ if (ifp->if_height == 1) {
+ struct fuse_iext_leaf *prev = ifp->if_data;
+
+ node->keys[0] = fuse_iext_leaf_key(prev, 0);
+ node->ptrs[0] = prev;
+ } else {
+ struct fuse_iext_node *prev = ifp->if_data;
+
+ ASSERT(ifp->if_height > 1);
+
+ node->keys[0] = prev->keys[0];
+ node->ptrs[0] = prev;
+ }
+
+ for (i = 1; i < KEYS_PER_NODE; i++)
+ node->keys[i] = FUSE_IEXT_KEY_INVALID;
+
+ ifp->if_data = node;
+ ifp->if_height++;
+}
+
+static void
+fuse_iext_update_node(
+ struct fuse_ifork *ifp,
+ loff_t old_offset,
+ loff_t new_offset,
+ int level,
+ void *ptr)
+{
+ struct fuse_iext_node *node = ifp->if_data;
+ int height, i;
+
+ for (height = ifp->if_height; height > level; height--) {
+ for (i = 0; i < KEYS_PER_NODE; i++) {
+ if (i > 0 && fuse_iext_key_cmp(node, i, old_offset) > 0)
+ break;
+ if (node->keys[i] == old_offset)
+ node->keys[i] = new_offset;
+ }
+ node = node->ptrs[i - 1];
+ ASSERT(node);
+ }
+
+ ASSERT(node == ptr);
+}
+
+static struct fuse_iext_node *
+fuse_iext_split_node(
+ struct fuse_iext_node **nodep,
+ int *pos,
+ int *nr_entries)
+{
+ struct fuse_iext_node *node = *nodep;
+ struct fuse_iext_node *new = fuse_iext_alloc_node(NODE_SIZE);
+ const int nr_move = KEYS_PER_NODE / 2;
+ int nr_keep = nr_move + (KEYS_PER_NODE & 1);
+ int i = 0;
+
+ /* for sequential append operations just spill over into the new node */
+ if (*pos == KEYS_PER_NODE) {
+ *nodep = new;
+ *pos = 0;
+ *nr_entries = 0;
+ goto done;
+ }
+
+
+ for (i = 0; i < nr_move; i++) {
+ new->keys[i] = node->keys[nr_keep + i];
+ new->ptrs[i] = node->ptrs[nr_keep + i];
+
+ node->keys[nr_keep + i] = FUSE_IEXT_KEY_INVALID;
+ node->ptrs[nr_keep + i] = NULL;
+ }
+
+ if (*pos >= nr_keep) {
+ *nodep = new;
+ *pos -= nr_keep;
+ *nr_entries = nr_move;
+ } else {
+ *nr_entries = nr_keep;
+ }
+done:
+ for (; i < KEYS_PER_NODE; i++)
+ new->keys[i] = FUSE_IEXT_KEY_INVALID;
+ return new;
+}
+
+static void
+fuse_iext_insert_node(
+ struct fuse_ifork *ifp,
+ uint64_t offset,
+ void *ptr,
+ int level)
+{
+ struct fuse_iext_node *node, *new;
+ int i, pos, nr_entries;
+
+again:
+ if (ifp->if_height < level)
+ fuse_iext_grow(ifp);
+
+ new = NULL;
+ node = fuse_iext_find_level(ifp, offset, level);
+ pos = fuse_iext_node_insert_pos(node, offset);
+ nr_entries = fuse_iext_node_nr_entries(node, pos);
+
+ ASSERT(pos >= nr_entries || fuse_iext_key_cmp(node, pos, offset) != 0);
+ ASSERT(nr_entries <= KEYS_PER_NODE);
+
+ if (nr_entries == KEYS_PER_NODE)
+ new = fuse_iext_split_node(&node, &pos, &nr_entries);
+
+ /*
+ * Update the pointers in higher levels if the first entry changes
+ * in an existing node.
+ */
+ if (node != new && pos == 0 && nr_entries > 0)
+ fuse_iext_update_node(ifp, node->keys[0], offset, level, node);
+
+ for (i = nr_entries; i > pos; i--) {
+ node->keys[i] = node->keys[i - 1];
+ node->ptrs[i] = node->ptrs[i - 1];
+ }
+ node->keys[pos] = offset;
+ node->ptrs[pos] = ptr;
+
+ if (new) {
+ offset = new->keys[0];
+ ptr = new;
+ level++;
+ goto again;
+ }
+}
+
+static struct fuse_iext_leaf *
+fuse_iext_split_leaf(
+ struct fuse_iext_cursor *cur,
+ int *nr_entries)
+{
+ struct fuse_iext_leaf *leaf = cur->leaf;
+ struct fuse_iext_leaf *new = fuse_iext_alloc_node(NODE_SIZE);
+ const int nr_move = RECS_PER_LEAF / 2;
+ int nr_keep = nr_move + (RECS_PER_LEAF & 1);
+ int i;
+
+ /* for sequential append operations just spill over into the new node */
+ if (cur->pos == RECS_PER_LEAF) {
+ cur->leaf = new;
+ cur->pos = 0;
+ *nr_entries = 0;
+ goto done;
+ }
+
+ for (i = 0; i < nr_move; i++) {
+ new->recs[i] = leaf->recs[nr_keep + i];
+ fuse_iext_rec_clear(&leaf->recs[nr_keep + i]);
+ }
+
+ if (cur->pos >= nr_keep) {
+ cur->leaf = new;
+ cur->pos -= nr_keep;
+ *nr_entries = nr_move;
+ } else {
+ *nr_entries = nr_keep;
+ }
+done:
+ if (leaf->next)
+ leaf->next->prev = new;
+ new->next = leaf->next;
+ new->prev = leaf;
+ leaf->next = new;
+ return new;
+}
+
+static void
+fuse_iext_alloc_root(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ ASSERT(ifp->if_bytes == 0);
+
+ ifp->if_data = fuse_iext_alloc_node(sizeof(struct fuse_iomap_io));
+ ifp->if_height = 1;
+
+ /* now that we have a node step into it */
+ cur->leaf = ifp->if_data;
+ cur->pos = 0;
+}
+
+static void
+fuse_iext_realloc_root(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur)
+{
+ int64_t new_size = ifp->if_bytes + sizeof(struct fuse_iomap_io);
+ void *new;
+
+ /* account for the prev/next pointers */
+ if (new_size / sizeof(struct fuse_iomap_io) == RECS_PER_LEAF)
+ new_size = NODE_SIZE;
+
+ new = krealloc(ifp->if_data, new_size,
+ GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL);
+ memset(new + ifp->if_bytes, 0, new_size - ifp->if_bytes);
+ ifp->if_data = new;
+ cur->leaf = new;
+}
+
+/*
+ * Increment the sequence counter on extent tree changes. We use WRITE_ONCE
+ * here to ensure the update to the sequence counter is seen before the
+ * modifications to the extent tree itself take effect.
+ */
+static inline void fuse_iext_inc_seq(struct fuse_iomap_cache *ip)
+{
+ WRITE_ONCE(ip->im_seq, READ_ONCE(ip->im_seq) + 1);
+}
+
+void
+fuse_iext_insert_raw(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur,
+ const struct fuse_iomap_io *irec)
+{
+ loff_t offset = irec->offset;
+ struct fuse_iext_leaf *new = NULL;
+ int nr_entries, i;
+
+ fuse_iext_inc_seq(ip);
+
+ if (ifp->if_height == 0)
+ fuse_iext_alloc_root(ifp, cur);
+ else if (ifp->if_height == 1)
+ fuse_iext_realloc_root(ifp, cur);
+
+ nr_entries = fuse_iext_leaf_nr_entries(ifp, cur->leaf, cur->pos);
+ ASSERT(nr_entries <= RECS_PER_LEAF);
+ ASSERT(cur->pos >= nr_entries ||
+ fuse_iext_rec_cmp(cur_rec(cur), irec->offset) != 0);
+
+ if (nr_entries == RECS_PER_LEAF)
+ new = fuse_iext_split_leaf(cur, &nr_entries);
+
+ /*
+ * Update the pointers in higher levels if the first entry changes
+ * in an existing node.
+ */
+ if (cur->leaf != new && cur->pos == 0 && nr_entries > 0) {
+ fuse_iext_update_node(ifp, fuse_iext_leaf_key(cur->leaf, 0),
+ offset, 1, cur->leaf);
+ }
+
+ for (i = nr_entries; i > cur->pos; i--)
+ cur->leaf->recs[i] = cur->leaf->recs[i - 1];
+ fuse_iext_set(cur_rec(cur), irec);
+ ifp->if_bytes += sizeof(struct fuse_iomap_io);
+
+ if (new)
+ fuse_iext_insert_node(ifp, fuse_iext_leaf_key(new, 0), new, 2);
+}
+
+void
+fuse_iext_insert(
+ struct fuse_iomap_cache *ip,
+ struct fuse_iext_cursor *cur,
+ const struct fuse_iomap_io *irec,
+ int state)
+{
+ struct fuse_ifork *ifp = fuse_iext_state_to_fork(ip, state);
+
+ fuse_iext_insert_raw(ip, ifp, cur, irec);
+ trace_fuse_iext_insert(VFS_I(ip), cur, state, _RET_IP_);
+}
+
+static struct fuse_iext_node *
+fuse_iext_rebalance_node(
+ struct fuse_iext_node *parent,
+ int *pos,
+ struct fuse_iext_node *node,
+ int nr_entries)
+{
+ /*
+ * If the neighbouring nodes are completely full, or have different
+ * parents, we might never be able to merge our node, and will only
+ * delete it once the number of entries hits zero.
+ */
+ if (nr_entries == 0)
+ return node;
+
+ if (*pos > 0) {
+ struct fuse_iext_node *prev = parent->ptrs[*pos - 1];
+ int nr_prev = fuse_iext_node_nr_entries(prev, 0), i;
+
+ if (nr_prev + nr_entries <= KEYS_PER_NODE) {
+ for (i = 0; i < nr_entries; i++) {
+ prev->keys[nr_prev + i] = node->keys[i];
+ prev->ptrs[nr_prev + i] = node->ptrs[i];
+ }
+ return node;
+ }
+ }
+
+ if (*pos + 1 < fuse_iext_node_nr_entries(parent, *pos)) {
+ struct fuse_iext_node *next = parent->ptrs[*pos + 1];
+ int nr_next = fuse_iext_node_nr_entries(next, 0), i;
+
+ if (nr_entries + nr_next <= KEYS_PER_NODE) {
+ /*
+ * Merge the next node into this node so that we don't
+ * have to do an additional update of the keys in the
+ * higher levels.
+ */
+ for (i = 0; i < nr_next; i++) {
+ node->keys[nr_entries + i] = next->keys[i];
+ node->ptrs[nr_entries + i] = next->ptrs[i];
+ }
+
+ ++*pos;
+ return next;
+ }
+ }
+
+ return NULL;
+}
+
+static void
+fuse_iext_remove_node(
+ struct fuse_ifork *ifp,
+ loff_t offset,
+ void *victim)
+{
+ struct fuse_iext_node *node, *parent;
+ int level = 2, pos, nr_entries, i;
+
+ ASSERT(level <= ifp->if_height);
+ node = fuse_iext_find_level(ifp, offset, level);
+ pos = fuse_iext_node_pos(node, offset);
+again:
+ ASSERT(node->ptrs[pos]);
+ ASSERT(node->ptrs[pos] == victim);
+ kfree(victim);
+
+ nr_entries = fuse_iext_node_nr_entries(node, pos) - 1;
+ offset = node->keys[0];
+ for (i = pos; i < nr_entries; i++) {
+ node->keys[i] = node->keys[i + 1];
+ node->ptrs[i] = node->ptrs[i + 1];
+ }
+ node->keys[nr_entries] = FUSE_IEXT_KEY_INVALID;
+ node->ptrs[nr_entries] = NULL;
+
+ if (pos == 0 && nr_entries > 0) {
+ fuse_iext_update_node(ifp, offset, node->keys[0], level, node);
+ offset = node->keys[0];
+ }
+
+ if (nr_entries >= KEYS_PER_NODE / 2)
+ return;
+
+ if (level < ifp->if_height) {
+ /*
+ * If we aren't at the root yet try to find a neighbour node to
+ * merge with (or delete the node if it is empty), and then
+ * recurse up to the next level.
+ */
+ level++;
+ parent = fuse_iext_find_level(ifp, offset, level);
+ pos = fuse_iext_node_pos(parent, offset);
+
+ ASSERT(pos != KEYS_PER_NODE);
+ ASSERT(parent->ptrs[pos] == node);
+
+ node = fuse_iext_rebalance_node(parent, &pos, node, nr_entries);
+ if (node) {
+ victim = node;
+ node = parent;
+ goto again;
+ }
+ } else if (nr_entries == 1) {
+ /*
+ * If we are at the root and only one entry is left we can just
+ * free this node and update the root pointer.
+ */
+ ASSERT(node == ifp->if_data);
+ ifp->if_data = node->ptrs[0];
+ ifp->if_height--;
+ kfree(node);
+ }
+}
+
+static void
+fuse_iext_rebalance_leaf(
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iext_leaf *leaf,
+ loff_t offset,
+ int nr_entries)
+{
+ /*
+ * If the neighbouring nodes are completely full we might never be able
+ * to merge our node, and will only delete it once the number of
+ * entries hits zero.
+ */
+ if (nr_entries == 0)
+ goto remove_node;
+
+ if (leaf->prev) {
+ int nr_prev = fuse_iext_leaf_nr_entries(ifp, leaf->prev, 0), i;
+
+ if (nr_prev + nr_entries <= RECS_PER_LEAF) {
+ for (i = 0; i < nr_entries; i++)
+ leaf->prev->recs[nr_prev + i] = leaf->recs[i];
+
+ if (cur->leaf == leaf) {
+ cur->leaf = leaf->prev;
+ cur->pos += nr_prev;
+ }
+ goto remove_node;
+ }
+ }
+
+ if (leaf->next) {
+ int nr_next = fuse_iext_leaf_nr_entries(ifp, leaf->next, 0), i;
+
+ if (nr_entries + nr_next <= RECS_PER_LEAF) {
+ /*
+ * Merge the next node into this node so that we don't
+ * have to do an additional update of the keys in the
+ * higher levels.
+ */
+ for (i = 0; i < nr_next; i++) {
+ leaf->recs[nr_entries + i] =
+ leaf->next->recs[i];
+ }
+
+ if (cur->leaf == leaf->next) {
+ cur->leaf = leaf;
+ cur->pos += nr_entries;
+ }
+
+ offset = fuse_iext_leaf_key(leaf->next, 0);
+ leaf = leaf->next;
+ goto remove_node;
+ }
+ }
+
+ return;
+remove_node:
+ if (leaf->prev)
+ leaf->prev->next = leaf->next;
+ if (leaf->next)
+ leaf->next->prev = leaf->prev;
+ fuse_iext_remove_node(ifp, offset, leaf);
+}
+
+static void
+fuse_iext_free_last_leaf(
+ struct fuse_ifork *ifp)
+{
+ ifp->if_height--;
+ kfree(ifp->if_data);
+ ifp->if_data = NULL;
+}
+
+void
+fuse_iext_remove(
+ struct fuse_iomap_cache *ip,
+ struct fuse_iext_cursor *cur,
+ int state)
+{
+ struct fuse_ifork *ifp = fuse_iext_state_to_fork(ip, state);
+ struct fuse_iext_leaf *leaf = cur->leaf;
+ loff_t offset = fuse_iext_leaf_key(leaf, 0);
+ int i, nr_entries;
+
+ trace_fuse_iext_remove(VFS_I(ip), cur, state, _RET_IP_);
+
+ ASSERT(ifp->if_height > 0);
+ ASSERT(ifp->if_data != NULL);
+ ASSERT(fuse_iext_valid(ifp, cur));
+
+ fuse_iext_inc_seq(ip);
+
+ nr_entries = fuse_iext_leaf_nr_entries(ifp, leaf, cur->pos) - 1;
+ for (i = cur->pos; i < nr_entries; i++)
+ leaf->recs[i] = leaf->recs[i + 1];
+ fuse_iext_rec_clear(&leaf->recs[nr_entries]);
+ ifp->if_bytes -= sizeof(struct fuse_iomap_io);
+
+ if (cur->pos == 0 && nr_entries > 0) {
+ fuse_iext_update_node(ifp, offset, fuse_iext_leaf_key(leaf, 0), 1,
+ leaf);
+ offset = fuse_iext_leaf_key(leaf, 0);
+ } else if (cur->pos == nr_entries) {
+ if (ifp->if_height > 1 && leaf->next)
+ cur->leaf = leaf->next;
+ else
+ cur->leaf = NULL;
+ cur->pos = 0;
+ }
+
+ if (nr_entries >= RECS_PER_LEAF / 2)
+ return;
+
+ if (ifp->if_height > 1)
+ fuse_iext_rebalance_leaf(ifp, cur, leaf, offset, nr_entries);
+ else if (nr_entries == 0)
+ fuse_iext_free_last_leaf(ifp);
+}
+
+/*
+ * Lookup the extent covering offset.
+ *
+ * If there is an extent covering offset return the extent index, and store the
+ * expanded extent structure in *gotp, and the extent cursor in *cur.
+ * If there is no extent covering offset, but there is an extent after it (e.g.
+ * it lies in a hole) return that extent in *gotp and its cursor in *cur
+ * instead.
+ * If offset is beyond the last extent return false, and return an invalid
+ * cursor value.
+ */
+bool
+fuse_iext_lookup_extent(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ loff_t offset,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp)
+{
+ cur->leaf = fuse_iext_find_level(ifp, offset, 1);
+ if (!cur->leaf) {
+ cur->pos = 0;
+ return false;
+ }
+
+ for (cur->pos = 0; cur->pos < fuse_iext_max_recs(ifp); cur->pos++) {
+ struct fuse_iomap_io *rec = cur_rec(cur);
+
+ if (fuse_iext_rec_is_empty(rec))
+ break;
+ if (fuse_iext_rec_cmp(rec, offset) >= 0)
+ goto found;
+ }
+
+ /* Try looking in the next node for an entry > offset */
+ if (ifp->if_height == 1 || !cur->leaf->next)
+ return false;
+ cur->leaf = cur->leaf->next;
+ cur->pos = 0;
+ if (!fuse_iext_valid(ifp, cur))
+ return false;
+found:
+ fuse_iext_get(gotp, cur_rec(cur));
+ return true;
+}
+
+/*
+ * Returns the last extent before end, and if this extent doesn't cover
+ * end, update end to the end of the extent.
+ */
+bool
+fuse_iext_lookup_extent_before(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ loff_t *end,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp)
+{
+ /* could be optimized to not even look up the next on a match.. */
+ if (fuse_iext_lookup_extent(ip, ifp, *end - 1, cur, gotp) &&
+ gotp->offset <= *end - 1)
+ return true;
+ if (!fuse_iext_prev_extent(ifp, cur, gotp))
+ return false;
+ *end = gotp->offset + gotp->length;
+ return true;
+}
+
+void
+fuse_iext_update_extent(
+ struct fuse_iomap_cache *ip,
+ int state,
+ struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *new)
+{
+ struct fuse_ifork *ifp = fuse_iext_state_to_fork(ip, state);
+
+ fuse_iext_inc_seq(ip);
+
+ if (cur->pos == 0) {
+ struct fuse_iomap_io old;
+
+ fuse_iext_get(&old, cur_rec(cur));
+ if (new->offset != old.offset) {
+ fuse_iext_update_node(ifp, old.offset,
+ new->offset, 1, cur->leaf);
+ }
+ }
+
+ trace_fuse_iext_pre_update(VFS_I(ip), cur, state, _RET_IP_);
+ fuse_iext_set(cur_rec(cur), new);
+ trace_fuse_iext_post_update(VFS_I(ip), cur, state, _RET_IP_);
+}
+
+/*
+ * Return true if the cursor points at an extent and return the extent structure
+ * in gotp. Else return false.
+ */
+bool
+fuse_iext_get_extent(
+ const struct fuse_ifork *ifp,
+ const struct fuse_iext_cursor *cur,
+ struct fuse_iomap_io *gotp)
+{
+ if (!fuse_iext_valid(ifp, cur))
+ return false;
+ fuse_iext_get(gotp, cur_rec(cur));
+ return true;
+}
+
+/*
+ * This is a recursive function, because of that we need to be extremely
+ * careful with stack usage.
+ */
+static void
+fuse_iext_destroy_node(
+ struct fuse_iext_node *node,
+ int level)
+{
+ int i;
+
+ if (level > 1) {
+ for (i = 0; i < KEYS_PER_NODE; i++) {
+ if (node->keys[i] == FUSE_IEXT_KEY_INVALID)
+ break;
+ fuse_iext_destroy_node(node->ptrs[i], level - 1);
+ }
+ }
+
+ kfree(node);
+}
+
+void
+fuse_iext_destroy(
+ struct fuse_ifork *ifp)
+{
+ fuse_iext_destroy_node(ifp->if_data, ifp->if_height);
+
+ ifp->if_bytes = 0;
+ ifp->if_height = 0;
+ ifp->if_data = NULL;
+}
+
+static inline struct fuse_ifork *
+fuse_iomap_fork_ptr(
+ struct fuse_iomap_cache *ip,
+ enum fuse_iomap_iodir iodir)
+{
+ switch (iodir) {
+ case READ_MAPPING:
+ return &ip->im_read;
+ case WRITE_MAPPING:
+ return ip->im_write;
+ default:
+ ASSERT(0);
+ return NULL;
+ }
+}
+
+static inline bool fuse_iomap_addrs_adjacent(const struct fuse_iomap_io *left,
+ const struct fuse_iomap_io *right)
+{
+ switch (left->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ return left->addr + left->length == right->addr;
+ default:
+ return left->addr == FUSE_IOMAP_NULL_ADDR &&
+ right->addr == FUSE_IOMAP_NULL_ADDR;
+ }
+}
+
+static inline bool fuse_iomap_can_merge(const struct fuse_iomap_io *left,
+ const struct fuse_iomap_io *right)
+{
+ return (left->dev == right->dev &&
+ left->offset + left->length == right->offset &&
+ left->type == right->type &&
+ fuse_iomap_addrs_adjacent(left, right) &&
+ left->flags == right->flags &&
+ left->length + right->length <= FUSE_IOMAP_MAX_LEN);
+}
+
+static inline bool fuse_iomap_can_merge3(const struct fuse_iomap_io *left,
+ const struct fuse_iomap_io *new,
+ const struct fuse_iomap_io *right)
+{
+ return left->length + new->length + right->length <= FUSE_IOMAP_MAX_LEN;
+}
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+static void fuse_iext_check_mappings(struct inode *inode,
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp)
+{
+ struct fuse_inode *fi = FUSE_I(ip);
+ struct fuse_iext_cursor icur;
+ struct fuse_iomap_io prev, got;
+ unsigned long long nr = 0;
+ enum fuse_iomap_iodir iodir;
+
+ if (!ifp || !static_branch_unlikely(&fuse_iomap_debug))
+ return;
+
+ if (ifp == ip->im_write)
+ iodir = WRITE_MAPPING;
+ else
+ iodir = READ_MAPPING;
+
+ fuse_iext_first(ifp, &icur);
+ if (!fuse_iext_get_extent(ifp, &icur, &prev))
+ return;
+ trace_fuse_iext_check_mapping(inode, iodir, &prev, _RET_IP_);
+ nr++;
+
+ fuse_iext_next(ifp, &icur);
+ while (fuse_iext_get_extent(ifp, &icur, &got)) {
+ trace_fuse_iext_check_mapping(inode, iodir, &got, _RET_IP_);
+ if (got.length == 0 ||
+ got.offset < prev.offset + prev.length ||
+ fuse_iomap_can_merge(&prev, &got)) {
+ printk(KERN_ERR "FUSE IOMAP CORRUPTION ino=%llu nr=%llu",
+ fi->orig_ino, nr);
+ printk(KERN_ERR "prev: offset=%llu length=%llu type=%u flags=0x%x dev=%u addr=%llu\n",
+ prev.offset, prev.length, prev.type, prev.flags,
+ prev.dev, prev.addr);
+ printk(KERN_ERR "curr: offset=%llu length=%llu type=%u flags=0x%x dev=%u addr=%llu\n",
+ got.offset, got.length, got.type, got.flags,
+ got.dev, got.addr);
+ }
+
+ prev = got;
+ nr++;
+ fuse_iext_next(ifp, &icur);
+ }
+}
+#else
+# define fuse_iext_check_mappings(...) ((void)0)
+#endif
+
+static void
+fuse_iext_del_mapping(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *icur,
+ struct fuse_iomap_io *got, /* current extent entry */
+ struct fuse_iomap_io *del) /* data to remove from extents */
+{
+ struct fuse_iomap_io new; /* new record to be inserted */
+ /* first addr (fsblock aligned) past del */
+ uint64_t del_endaddr;
+ /* first offset (fsblock aligned) past del */
+ uint64_t del_endoff = del->offset + del->length;
+ /* first offset (fsblock aligned) past got */
+ uint64_t got_endoff = got->offset + got->length;
+ uint32_t state = fuse_iomap_fork_to_state(ip, ifp);
+
+ ASSERT(del->length > 0);
+ ASSERT(got->offset <= del->offset);
+ ASSERT(got_endoff >= del_endoff);
+
+ switch (del->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ del_endaddr = del->addr + del->length;
+ break;
+ default:
+ del_endaddr = FUSE_IOMAP_NULL_ADDR;
+ break;
+ }
+
+ if (got->offset == del->offset)
+ state |= FUSE_IEXT_LEFT_FILLING;
+ if (got_endoff == del_endoff)
+ state |= FUSE_IEXT_RIGHT_FILLING;
+
+ trace_fuse_iext_del_mapping(VFS_I(ip), state, del);
+ trace_fuse_iext_del_mapping_got(VFS_I(ip), got);
+
+ switch (state & (FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING)) {
+ case FUSE_IEXT_LEFT_FILLING | FUSE_IEXT_RIGHT_FILLING:
+ /*
+ * Matches the whole extent. Delete the entry.
+ */
+ fuse_iext_remove(ip, icur, state);
+ fuse_iext_prev(ifp, icur);
+ break;
+ case FUSE_IEXT_LEFT_FILLING:
+ /*
+ * Deleting the first part of the extent.
+ */
+ got->offset = del_endoff;
+ got->addr = del_endaddr;
+ got->length -= del->length;
+ fuse_iext_update_extent(ip, state, icur, got);
+ break;
+ case FUSE_IEXT_RIGHT_FILLING:
+ /*
+ * Deleting the last part of the extent.
+ */
+ got->length -= del->length;
+ fuse_iext_update_extent(ip, state, icur, got);
+ break;
+ case 0:
+ /*
+ * Deleting the middle of the extent.
+ */
+ got->length = del->offset - got->offset;
+ fuse_iext_update_extent(ip, state, icur, got);
+
+ new.offset = del_endoff;
+ new.length = got_endoff - del_endoff;
+ new.type = got->type;
+ new.flags = got->flags;
+ new.addr = del_endaddr;
+ new.dev = got->dev;
+
+ fuse_iext_next(ifp, icur);
+ fuse_iext_insert(ip, icur, &new, state);
+ break;
+ }
+}
+
+int
+fuse_iomap_cache_remove(
+ struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ loff_t start, /* first file offset deleted */
+ uint64_t len) /* length to unmap */
+{
+ struct fuse_iext_cursor icur;
+ struct fuse_iomap_io got; /* current extent record */
+ struct fuse_iomap_io del; /* extent being deleted */
+ loff_t end;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+ struct fuse_ifork *ifp = fuse_iomap_fork_ptr(ip, iodir);
+ bool wasreal;
+ bool done = false;
+ int ret = 0;
+
+ assert_cache_locked(ip);
+
+ trace_fuse_iomap_cache_remove(inode, iodir, start, len, _RET_IP_);
+
+ if (!ifp || fuse_iext_count(ifp) == 0)
+ return 0;
+
+ /* Fast shortcut if the caller wants to erase everything */
+ if (start == 0 && len >= inode->i_sb->s_maxbytes) {
+ fuse_iext_destroy(ifp);
+ return 0;
+ }
+
+ if (!len)
+ goto out;
+
+ /*
+ * If the caller wants us to remove everything to EOF, we set the end
+ * of the removal range to the maximum file offset. We don't support
+ * unsigned file offsets.
+ */
+ if (len == FUSE_IOMAP_INVAL_TO_EOF) {
+ const unsigned int blocksize = i_blocksize(inode);
+
+ len = round_up(inode->i_sb->s_maxbytes, blocksize) - start;
+ }
+
+ /*
+ * Now that we've settled len, look up the extent before the end of the
+ * range.
+ */
+ end = start + len;
+ if (!fuse_iext_lookup_extent_before(ip, ifp, &end, &icur, &got))
+ goto out;
+ end--;
+
+ while (end != -1 && end >= start) {
+ /*
+ * Is the found extent after a hole in which end lives?
+ * Just back up to the previous extent, if so.
+ */
+ if (got.offset > end &&
+ !fuse_iext_prev_extent(ifp, &icur, &got)) {
+ done = true;
+ break;
+ }
+ /*
+ * Is the last block of this extent before the range
+ * we're supposed to delete? If so, we're done.
+ */
+ end = min_t(loff_t, end, got.offset + got.length - 1);
+ if (end < start)
+ break;
+ /*
+ * Then deal with the (possibly delayed) allocated space
+ * we found.
+ */
+ del = got;
+ switch (del.type) {
+ case FUSE_IOMAP_TYPE_DELALLOC:
+ case FUSE_IOMAP_TYPE_HOLE:
+ case FUSE_IOMAP_TYPE_INLINE:
+ case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+ wasreal = false;
+ break;
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ wasreal = true;
+ break;
+ default:
+ ASSERT(0);
+ ret = -EFSCORRUPTED;
+ goto out;
+ }
+
+ if (got.offset < start) {
+ del.offset = start;
+ del.length -= start - got.offset;
+ if (wasreal)
+ del.addr += start - got.offset;
+ }
+ if (del.offset + del.length > end + 1)
+ del.length = end + 1 - del.offset;
+
+ fuse_iext_del_mapping(ip, ifp, &icur, &got, &del);
+ end = del.offset - 1;
+
+ /*
+ * If not done go on to the next (previous) record.
+ */
+ if (end != -1 && end >= start) {
+ if (!fuse_iext_get_extent(ifp, &icur, &got) ||
+ (got.offset > end &&
+ !fuse_iext_prev_extent(ifp, &icur, &got))) {
+ done = true;
+ break;
+ }
+ }
+ }
+
+ /* Should have removed everything */
+ if (len == 0 || done || end == (loff_t)-1 || end < start)
+ ret = 0;
+ else
+ ret = -EFSCORRUPTED;
+
+out:
+ fuse_iext_check_mappings(inode, ip, ifp);
+ return ret;
+}
+
+static void
+fuse_iext_add_mapping(
+ struct fuse_iomap_cache *ip,
+ struct fuse_ifork *ifp,
+ struct fuse_iext_cursor *icur,
+ const struct fuse_iomap_io *new) /* new extent entry */
+{
+ struct fuse_iomap_io left; /* left neighbor extent entry */
+ struct fuse_iomap_io right; /* right neighbor extent entry */
+ uint32_t state = fuse_iomap_fork_to_state(ip, ifp);
+
+ /*
+ * Check and set flags if this segment has a left neighbor.
+ */
+ if (fuse_iext_peek_prev_extent(ifp, icur, &left))
+ state |= FUSE_IEXT_LEFT_VALID;
+
+ /*
+ * Check and set flags if this segment has a current value.
+ * Not true if we're inserting into the "hole" at eof.
+ */
+ if (fuse_iext_get_extent(ifp, icur, &right))
+ state |= FUSE_IEXT_RIGHT_VALID;
+
+ /*
+ * We're inserting a real allocation between "left" and "right".
+ * Set the contiguity flags. Don't let extents get too large.
+ */
+ if ((state & FUSE_IEXT_LEFT_VALID) && fuse_iomap_can_merge(&left, new))
+ state |= FUSE_IEXT_LEFT_CONTIG;
+
+ if ((state & FUSE_IEXT_RIGHT_VALID) &&
+ fuse_iomap_can_merge(new, &right) &&
+ (!(state & FUSE_IEXT_LEFT_CONTIG) ||
+ fuse_iomap_can_merge3(&left, new, &right)))
+ state |= FUSE_IEXT_RIGHT_CONTIG;
+
+ trace_fuse_iext_add_mapping(VFS_I(ip), state, new);
+ if (state & FUSE_IEXT_LEFT_VALID)
+ trace_fuse_iext_add_mapping_left(VFS_I(ip), &left);
+ if (state & FUSE_IEXT_RIGHT_VALID)
+ trace_fuse_iext_add_mapping_right(VFS_I(ip), &right);
+
+ /*
+ * Select which case we're in here, and implement it.
+ */
+ switch (state & (FUSE_IEXT_LEFT_CONTIG | FUSE_IEXT_RIGHT_CONTIG)) {
+ case FUSE_IEXT_LEFT_CONTIG | FUSE_IEXT_RIGHT_CONTIG:
+ /*
+ * New allocation is contiguous with real allocations on the
+ * left and on the right.
+ * Merge all three into a single extent record.
+ */
+ left.length += new->length + right.length;
+
+ fuse_iext_remove(ip, icur, state);
+ fuse_iext_prev(ifp, icur);
+ fuse_iext_update_extent(ip, state, icur, &left);
+ break;
+
+ case FUSE_IEXT_LEFT_CONTIG:
+ /*
+ * New allocation is contiguous with a real allocation
+ * on the left.
+ * Merge the new allocation with the left neighbor.
+ */
+ left.length += new->length;
+
+ fuse_iext_prev(ifp, icur);
+ fuse_iext_update_extent(ip, state, icur, &left);
+ break;
+
+ case FUSE_IEXT_RIGHT_CONTIG:
+ /*
+ * New allocation is contiguous with a real allocation
+ * on the right.
+ * Merge the new allocation with the right neighbor.
+ */
+ right.offset = new->offset;
+ right.addr = new->addr;
+ right.length += new->length;
+ fuse_iext_update_extent(ip, state, icur, &right);
+ break;
+
+ case 0:
+ /*
+ * New allocation is not contiguous with another
+ * real allocation.
+ * Insert a new entry.
+ */
+ fuse_iext_insert(ip, icur, new, state);
+ break;
+ }
+}
+
+static int
+fuse_iomap_cache_add(
+ struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *new)
+{
+ struct fuse_iext_cursor icur;
+ struct fuse_iomap_io got;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+ struct fuse_ifork *ifp = fuse_iomap_fork_ptr(ip, iodir);
+
+ assert_cache_locked(ip);
+ ASSERT(new->length > 0);
+ ASSERT(new->offset < inode->i_sb->s_maxbytes);
+
+ trace_fuse_iomap_cache_add(inode, iodir, new, _RET_IP_);
+
+ if (!ifp) {
+ ifp = kzalloc(sizeof(struct fuse_ifork),
+ GFP_KERNEL | __GFP_NOFAIL);
+ if (!ifp)
+ return -ENOMEM;
+
+ ip->im_write = ifp;
+ }
+
+ if (fuse_iext_lookup_extent(ip, ifp, new->offset, &icur, &got)) {
+ /* make sure we only add into a hole. */
+ ASSERT(got.offset > new->offset);
+ ASSERT(got.offset - new->offset >= new->length);
+
+ if (got.offset <= new->offset ||
+ got.offset - new->offset < new->length)
+ return -EFSCORRUPTED;
+ }
+
+ fuse_iext_add_mapping(ip, ifp, &icur, new);
+ fuse_iext_check_mappings(inode, ip, ifp);
+ return 0;
+}
+
+int
+fuse_iomap_cache_upsert(
+ struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *map)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+ int err;
+
+ /*
+ * We interpret no write fork to mean that all writes are pure
+ * overwrites. Avoid wasting memory if we're trying to upsert a
+ * pure overwrite.
+ */
+ if (iodir == WRITE_MAPPING &&
+ map->type == FUSE_IOMAP_TYPE_PURE_OVERWRITE &&
+ ip->im_write == NULL)
+ return 0;
+
+ err = fuse_iomap_cache_remove(inode, iodir, map->offset, map->length);
+ if (err)
+ return err;
+
+ return fuse_iomap_cache_add(inode, iodir, map);
+}
+
+/*
+ * Trim the returned map to the required bounds
+ */
+static void
+fuse_iomap_trim(
+ struct fuse_inode *fi,
+ struct fuse_iomap_lookup *mval,
+ const struct fuse_iomap_io *got,
+ loff_t off,
+ loff_t len)
+{
+ struct fuse_iomap_cache *ip = &fi->cache;
+ const unsigned int blocksize = i_blocksize(&fi->inode);
+ const loff_t aligned_off = round_down(off, blocksize);
+ const loff_t aligned_end = round_up(off + len, blocksize);
+ const loff_t aligned_len = aligned_end - aligned_off;
+
+ ASSERT(aligned_off >= got->offset);
+
+ switch (got->type) {
+ case FUSE_IOMAP_TYPE_MAPPED:
+ case FUSE_IOMAP_TYPE_UNWRITTEN:
+ mval->map.addr = got->addr + (aligned_off - got->offset);
+ break;
+ default:
+ mval->map.addr = FUSE_IOMAP_NULL_ADDR;
+ break;
+ }
+ mval->map.offset = aligned_off;
+ mval->map.length = min_t(loff_t, aligned_len,
+ got->length - (aligned_off - got->offset));
+ mval->map.type = got->type;
+ mval->map.flags = got->flags;
+ mval->map.dev = got->dev;
+ mval->validity_cookie = fuse_iext_read_seq(ip);
+}
+
+enum fuse_iomap_lookup_result
+fuse_iomap_cache_lookup(
+ struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ loff_t off,
+ uint64_t len,
+ struct fuse_iomap_lookup *mval)
+{
+ struct fuse_iomap_io got;
+ struct fuse_iext_cursor icur;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_iomap_cache *ip = &fi->cache;
+ struct fuse_ifork *ifp = fuse_iomap_fork_ptr(ip, iodir);
+
+ assert_cache_locked_shared(ip);
+
+ trace_fuse_iomap_cache_lookup(inode, iodir, off, len, _RET_IP_);
+
+ if (!ifp) {
+ /*
+ * No write fork at all means this filesystem doesn't do out of
+ * place writes.
+ */
+ return LOOKUP_NOFORK;
+ }
+
+ if (!fuse_iext_lookup_extent(ip, ifp, off, &icur, &got)) {
+ /*
+ * Write fork does not contain a mapping at or beyond off,
+ * which is a cache miss.
+ */
+ return LOOKUP_MISS;
+ }
+
+ if (got.offset > off) {
+ /*
+ * Found a mapping, but it doesn't cover the start of the
+ * range, which is effectively a miss.
+ */
+ return LOOKUP_MISS;
+ }
+
+ /* Found a mapping in the cache, return it */
+ fuse_iomap_trim(fi, mval, &got, off, len);
+
+ trace_fuse_iomap_cache_lookup_result(inode, iodir, off, len, &got,
+ mval);
+ return LOOKUP_HIT;
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 2/4] fuse: use the iomap cache for iomap_begin
2025-08-21 0:47 ` [PATCHSET RFC v4 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-08-21 0:58 ` [PATCH 1/4] fuse: cache iomaps Darrick J. Wong
@ 2025-08-21 0:59 ` Darrick J. Wong
2025-08-21 0:59 ` [PATCH 3/4] fuse: invalidate iomap cache after file updates Darrick J. Wong
2025-08-21 0:59 ` [PATCH 4/4] fuse: enable iomap cache management Darrick J. Wong
3 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:59 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Look inside the iomap cache to try to satisfy iomap_begin.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 34 ++++++++
fs/fuse/iomap_priv.h | 5 +
fs/fuse/file_iomap.c | 221 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/iomap_cache.c | 6 +
4 files changed, 260 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index eb604eaf3bafad..94e7a4222d2ac2 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -319,6 +319,7 @@ struct fuse_iomap_lookup;
#define FUSE_IOMAP_TYPE_STRINGS \
{ FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \
+ { FUSE_IOMAP_TYPE_RETRY_CACHE, "retry" }, \
{ FUSE_IOMAP_TYPE_HOLE, "hole" }, \
{ FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \
{ FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \
@@ -1411,6 +1412,39 @@ TRACE_EVENT(fuse_iomap_cache_lookup_result,
FUSE_IOMAP_MAP_PRINTK_ARGS(got),
__entry->validity_cookie)
);
+
+TRACE_EVENT(fuse_iomap_invalid,
+ TP_PROTO(const struct inode *inode, const struct iomap *map,
+ uint64_t validity_cookie),
+ TP_ARGS(inode, map, validity_cookie),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ FUSE_IOMAP_MAP_FIELDS(map)
+ __field(uint64_t, old_validity_cookie)
+ __field(uint64_t, validity_cookie)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+
+ __entry->mapoffset = map->offset;
+ __entry->maplength = map->length;
+ __entry->maptype = map->type;
+ __entry->mapflags = map->flags;
+ __entry->mapaddr = map->addr;
+ __entry->mapdev = FUSE_IOMAP_DEV_NULL;
+
+ __entry->old_validity_cookie= map->validity_cookie;
+ __entry->validity_cookie= validity_cookie;
+ ),
+
+ TP_printk(FUSE_INODE_FMT FUSE_IOMAP_MAP_FMT() " old_cookie 0x%llx new_cookie 0x%llx",
+ FUSE_INODE_PRINTK_ARGS,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(map),
+ __entry->old_validity_cookie,
+ __entry->validity_cookie)
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
index 8e4a32879025a4..8f1aef381942b6 100644
--- a/fs/fuse/iomap_priv.h
+++ b/fs/fuse/iomap_priv.h
@@ -145,6 +145,11 @@ static inline bool fuse_iext_peek_prev_extent(struct fuse_ifork *ifp,
fuse_iext_get_extent((ifp), (ext), (got)); \
fuse_iext_next((ifp), (ext)))
+/* iomaps that come direct from the fuse server are presumed to be valid */
+#define FUSE_IOMAP_ALWAYS_VALID ((uint64_t)0)
+/* set initial iomap cookie value to avoid ALWAYS_VALID */
+#define FUSE_IOMAP_INIT_COOKIE ((uint64_t)1)
+
static inline uint64_t fuse_iext_read_seq(struct fuse_iomap_cache *ip)
{
return (uint64_t)READ_ONCE(ip->im_seq);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 545798f0d915a1..706eff6863d0a7 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -162,6 +162,7 @@ static inline bool fuse_iomap_check_type(uint16_t fuse_type)
case FUSE_IOMAP_TYPE_UNWRITTEN:
case FUSE_IOMAP_TYPE_INLINE:
case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
+ case FUSE_IOMAP_TYPE_RETRY_CACHE:
return true;
}
@@ -267,9 +268,14 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
const unsigned int blocksize = i_blocksize(inode);
uint64_t end;
- /* Type and flags must be known */
+ /*
+ * Type and flags must be known. Mapping type "retry cache" doesn't
+ * use any of the other fields.
+ */
if (BAD_DATA(!fuse_iomap_check_type(map->type)))
return false;
+ if (map->type == FUSE_IOMAP_TYPE_RETRY_CACHE)
+ return true;
if (BAD_DATA(!fuse_iomap_check_flags(map->flags)))
return false;
@@ -300,6 +306,14 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
if (BAD_DATA(map->addr == FUSE_IOMAP_NULL_ADDR))
return false;
break;
+ case FUSE_IOMAP_TYPE_RETRY_CACHE:
+ /*
+ * We only accept cache retries if we have a cache to query.
+ * There must not be a device addr.
+ */
+ if (BAD_DATA(!fuse_inode_caches_iomaps(inode)))
+ return false;
+ fallthrough;
case FUSE_IOMAP_TYPE_DELALLOC:
case FUSE_IOMAP_TYPE_HOLE:
case FUSE_IOMAP_TYPE_INLINE:
@@ -569,6 +583,149 @@ static int fuse_iomap_set_inline(struct inode *inode, unsigned opflags,
return 0;
}
+/* Convert a mapping from the cache into something the kernel can use */
+static int fuse_iomap_from_cache(struct inode *inode, struct iomap *iomap,
+ const struct fuse_iomap_lookup *lmap)
+{
+ struct fuse_mount *fm = get_fuse_mount(inode);
+ struct fuse_backing *fb;
+
+ fb = fuse_iomap_find_dev(fm->fc, &lmap->map);
+ if (IS_ERR(fb))
+ return PTR_ERR(fb);
+
+ fuse_iomap_from_server(inode, iomap, fb, &lmap->map);
+ iomap->validity_cookie = lmap->validity_cookie;
+
+ fuse_backing_put(fb);
+ return 0;
+}
+
+#if IS_ENABLED(CONFIG_FUSE_IOMAP_DEBUG)
+static inline int
+fuse_iomap_cached_validate(const struct inode *inode,
+ enum fuse_iomap_iodir dir,
+ const struct fuse_iomap_lookup *lmap)
+{
+ if (!static_branch_unlikely(&fuse_iomap_debug))
+ return 0;
+
+ /* Make sure the mappings aren't garbage */
+ if (!fuse_iomap_check_mapping(inode, &lmap->map, dir))
+ return -EFSCORRUPTED;
+
+ /* The cache should not be storing "retry cache" mappings */
+ if (BAD_DATA(lmap->map.type == FUSE_IOMAP_TYPE_RETRY_CACHE))
+ return -EFSCORRUPTED;
+
+ return 0;
+}
+#else
+# define fuse_iomap_cached_validate(...) (0)
+#endif
+
+/*
+ * Look up iomappings from the cache. Returns 1 if iomap and srcmap were
+ * satisfied from cache; 0 if not; or a negative errno.
+ */
+static int fuse_iomap_try_cache(struct inode *inode, loff_t pos, loff_t count,
+ unsigned opflags, struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct fuse_iomap_lookup lmap;
+ struct iomap *dest = iomap;
+ enum fuse_iomap_lookup_result res;
+ int ret;
+
+ if (!fuse_inode_caches_iomaps(inode))
+ return 0;
+
+ fuse_iomap_cache_lock_shared(inode);
+
+ if (fuse_is_iomap_file_write(opflags)) {
+ res = fuse_iomap_cache_lookup(inode, WRITE_MAPPING, pos, count,
+ &lmap);
+ switch (res) {
+ case LOOKUP_HIT:
+ ret = fuse_iomap_cached_validate(inode, WRITE_MAPPING,
+ &lmap);
+ if (ret)
+ goto out_unlock;
+
+ if (lmap.map.type != FUSE_IOMAP_TYPE_PURE_OVERWRITE) {
+ ret = fuse_iomap_from_cache(inode, dest, &lmap);
+ if (ret)
+ goto out_unlock;
+
+ dest = srcmap;
+ }
+ fallthrough;
+ case LOOKUP_NOFORK:
+ /* move on to the read fork */
+ break;
+ case LOOKUP_MISS:
+ ret = 0;
+ goto out_unlock;
+ }
+ }
+
+ res = fuse_iomap_cache_lookup(inode, READ_MAPPING, pos, count, &lmap);
+ switch (res) {
+ case LOOKUP_HIT:
+ break;
+ case LOOKUP_NOFORK:
+ ASSERT(res != LOOKUP_NOFORK);
+ ret = -EFSCORRUPTED;
+ goto out_unlock;
+ case LOOKUP_MISS:
+ ret = 0;
+ goto out_unlock;
+ }
+
+ ret = fuse_iomap_cached_validate(inode, READ_MAPPING, &lmap);
+ if (ret)
+ goto out_unlock;
+
+ ret = fuse_iomap_from_cache(inode, dest, &lmap);
+ if (ret)
+ goto out_unlock;
+
+ if (fuse_is_iomap_file_write(opflags)) {
+ switch (iomap->type) {
+ case IOMAP_HOLE:
+ if (opflags & (IOMAP_ZERO | IOMAP_UNSHARE))
+ ret = 1;
+ else
+ ret = 0;
+ break;
+ case IOMAP_DELALLOC:
+ if (opflags & IOMAP_DIRECT)
+ ret = 0;
+ else
+ ret = 1;
+ break;
+ default:
+ ret = 1;
+ break;
+ }
+ } else {
+ ret = 1;
+ }
+
+out_unlock:
+ fuse_iomap_cache_unlock_shared(inode);
+ if (ret < 1)
+ return ret;
+
+ if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) {
+ ret = fuse_iomap_set_inline(inode, opflags, pos, count, iomap,
+ srcmap);
+ if (ret)
+ return ret;
+ }
+ return 1;
+}
+
static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
unsigned opflags, struct iomap *iomap,
struct iomap *srcmap)
@@ -589,6 +746,21 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
trace_fuse_iomap_begin(inode, pos, count, opflags);
+ /*
+ * Try to read mappings from the cache; if we find something then use
+ * it; otherwise we upcall the fuse server. For atomic writes we must
+ * always query the server.
+ */
+ if (!(opflags & FUSE_IOMAP_OP_ATOMIC)) {
+ err = fuse_iomap_try_cache(inode, pos, count, opflags, iomap,
+ srcmap);
+ if (err < 0)
+ return err;
+ if (err == 1)
+ return 0;
+ }
+
+retry:
args.opcode = FUSE_IOMAP_BEGIN;
args.nodeid = get_node_id(inode);
args.in_numargs = 1;
@@ -610,6 +782,24 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
if (err)
return err;
+ /*
+ * If the fuse server tells us it populated the cache, we'll try the
+ * cache lookup again. Note that we dropped the cache lock, so it's
+ * entirely possible that another thread could have invalidated the
+ * cache -- if the cache misses, we'll call the server again.
+ */
+ if (outarg.read.type == FUSE_IOMAP_TYPE_RETRY_CACHE) {
+ err = fuse_iomap_try_cache(inode, pos, count, opflags, iomap,
+ srcmap);
+ if (err < 0)
+ return err;
+ if (err == 1)
+ return 0;
+ if (signal_pending(current))
+ return -EINTR;
+ goto retry;
+ }
+
read_dev = fuse_iomap_find_dev(fm->fc, &outarg.read);
if (IS_ERR(read_dev))
return PTR_ERR(read_dev);
@@ -637,6 +827,8 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t count,
*/
fuse_iomap_from_server(inode, iomap, read_dev, &outarg.read);
}
+ iomap->validity_cookie = FUSE_IOMAP_ALWAYS_VALID;
+ srcmap->validity_cookie = FUSE_IOMAP_ALWAYS_VALID;
if (iomap->type == IOMAP_INLINE || srcmap->type == IOMAP_INLINE) {
err = fuse_iomap_set_inline(inode, opflags, pos, count, iomap,
@@ -1316,7 +1508,26 @@ static int fuse_iomap_direct_write_sync(struct kiocb *iocb, loff_t start,
return err;
}
+static bool fuse_iomap_revalidate(struct inode *inode,
+ const struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ uint64_t validity_cookie;
+
+ if (iomap->validity_cookie == FUSE_IOMAP_ALWAYS_VALID)
+ return true;
+
+ validity_cookie = fuse_iext_read_seq(&fi->cache);
+ if (iomap->validity_cookie != validity_cookie) {
+ trace_fuse_iomap_invalid(inode, iomap, validity_cookie);
+ return false;
+ }
+
+ return true;
+}
+
static const struct iomap_write_ops fuse_iomap_write_ops = {
+ .iomap_valid = fuse_iomap_revalidate,
};
static int
@@ -1598,14 +1809,14 @@ static void fuse_iomap_end_bio(struct bio *bio)
* mapping is valid, false otherwise.
*/
static bool fuse_iomap_revalidate_writeback(struct iomap_writepage_ctx *wpc,
+ struct inode *inode,
loff_t offset)
{
if (offset < wpc->iomap.offset ||
offset >= wpc->iomap.offset + wpc->iomap.length)
return false;
- /* XXX actually use revalidation cookie */
- return true;
+ return fuse_iomap_revalidate(inode, &wpc->iomap);
}
/*
@@ -1659,7 +1870,7 @@ static ssize_t fuse_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
trace_fuse_iomap_writeback_range(inode, offset, len, end_pos);
- if (!fuse_iomap_revalidate_writeback(wpc, offset)) {
+ if (!fuse_iomap_revalidate_writeback(wpc, inode, offset)) {
/* Pretend that this is a directio write */
ret = fuse_iomap_begin(inode, offset, len,
IOMAP_DIRECT | IOMAP_WRITE,
@@ -1785,7 +1996,7 @@ static inline void fuse_inode_set_iomap(struct inode *inode)
mapping_set_folio_min_order(inode->i_mapping, min_order);
memset(&fi->cache.im_read, 0, sizeof(fi->cache.im_read));
- fi->cache.im_seq = 0;
+ fi->cache.im_seq = FUSE_IOMAP_INIT_COOKIE;
fi->cache.im_write = NULL;
init_rwsem(&fi->cache.im_lock);
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
index 5bfa0e26346d1f..572bccf99a97a8 100644
--- a/fs/fuse/iomap_cache.c
+++ b/fs/fuse/iomap_cache.c
@@ -660,7 +660,11 @@ fuse_iext_realloc_root(
*/
static inline void fuse_iext_inc_seq(struct fuse_iomap_cache *ip)
{
- WRITE_ONCE(ip->im_seq, READ_ONCE(ip->im_seq) + 1);
+ uint64_t new_val = READ_ONCE(ip->im_seq) + 1;
+
+ if (new_val == FUSE_IOMAP_ALWAYS_VALID)
+ new_val++;
+ WRITE_ONCE(ip->im_seq, new_val);
}
void
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 3/4] fuse: invalidate iomap cache after file updates
2025-08-21 0:47 ` [PATCHSET RFC v4 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-08-21 0:58 ` [PATCH 1/4] fuse: cache iomaps Darrick J. Wong
2025-08-21 0:59 ` [PATCH 2/4] fuse: use the iomap cache for iomap_begin Darrick J. Wong
@ 2025-08-21 0:59 ` Darrick J. Wong
2025-08-21 0:59 ` [PATCH 4/4] fuse: enable iomap cache management Darrick J. Wong
3 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:59 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
The kernel doesn't know what the fuse server might have done in response
to truncate, fallocate, or ioend events. Therefore, it must invalidate
the mapping cache after those operations to ensure cache coherency.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 7 +++++++
fs/fuse/fuse_trace.h | 37 +++++++++++++++++++++++++++++++++++++
fs/fuse/iomap_priv.h | 9 +++++++++
fs/fuse/dir.c | 6 ++++++
fs/fuse/file.c | 10 +++++++---
fs/fuse/file_iomap.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/iomap_cache.c | 29 +++++++++++++++++++++++++++++
7 files changed, 143 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 54b8aab94a9cd5..0a7192b633dd3a 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1769,11 +1769,15 @@ int fuse_iomap_mmap(struct file *file, struct vm_area_struct *vma);
ssize_t fuse_iomap_buffered_read(struct kiocb *iocb, struct iov_iter *to);
ssize_t fuse_iomap_buffered_write(struct kiocb *iocb, struct iov_iter *from);
int fuse_iomap_setsize_start(struct inode *inode, loff_t newsize);
+int fuse_iomap_setsize_finish(struct inode *inode, loff_t newsize);
void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits);
int fuse_iomap_fallocate(struct file *file, int mode, loff_t offset,
loff_t length, loff_t new_size);
int fuse_iomap_flush_unmap_range(struct inode *inode, loff_t pos,
loff_t endpos);
+void fuse_iomap_open_truncate(struct inode *inode);
+void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
+ size_t written);
int fuse_dev_ioctl_iomap_support(struct file *file,
struct fuse_iomap_support __user *argp);
@@ -1814,9 +1818,12 @@ enum fuse_iomap_iodir {
# define fuse_iomap_buffered_read(...) (-ENOSYS)
# define fuse_iomap_buffered_write(...) (-ENOSYS)
# define fuse_iomap_setsize_start(...) (-ENOSYS)
+# define fuse_iomap_setsize_finish(...) (-ENOSYS)
# define fuse_iomap_set_i_blkbits(...) ((void)0)
# define fuse_iomap_fallocate(...) (-ENOSYS)
# define fuse_iomap_flush_unmap_range(...) (-ENOSYS)
+# define fuse_iomap_open_truncate(...) ((void)0)
+# define fuse_iomap_copied_file_range(...) ((void)0)
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
# define fuse_iomap_fadvise NULL
# define fuse_inode_caches_iomaps(...) (false)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 94e7a4222d2ac2..cd8aa9e0633eee 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -991,6 +991,7 @@ DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_truncate_down);
DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_punch_range);
DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_setsize);
DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_flush_unmap_range);
+DEFINE_FUSE_IOMAP_FILE_RANGE_EVENT(fuse_iomap_cache_invalidate_range);
TRACE_EVENT(fuse_iomap_set_i_blkbits,
TP_PROTO(const struct inode *inode, u8 new_blkbits),
@@ -1150,6 +1151,42 @@ DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_inline_write);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_iomap);
DEFINE_FUSE_IOMAP_INLINE_EVENT(fuse_iomap_set_inline_srcmap);
+TRACE_EVENT(fuse_iomap_open_truncate,
+ TP_PROTO(const struct inode *inode),
+
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ ),
+
+ TP_printk(FUSE_INODE_FMT,
+ FUSE_INODE_PRINTK_ARGS)
+);
+
+TRACE_EVENT(fuse_iomap_copied_file_range,
+ TP_PROTO(const struct inode *inode, loff_t offset,
+ size_t written),
+ TP_ARGS(inode, offset, written),
+
+ TP_STRUCT__entry(
+ FUSE_IO_RANGE_FIELDS()
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->offset = offset;
+ __entry->length = written;
+ ),
+
+ TP_printk(FUSE_IO_RANGE_FMT(),
+ FUSE_IO_RANGE_PRINTK_ARGS())
+);
+
DECLARE_EVENT_CLASS(fuse_iext_class,
TP_PROTO(const struct inode *inode, const struct fuse_iext_cursor *cur,
int state, unsigned long caller_ip),
diff --git a/fs/fuse/iomap_priv.h b/fs/fuse/iomap_priv.h
index 8f1aef381942b6..e78c49af638e0f 100644
--- a/fs/fuse/iomap_priv.h
+++ b/fs/fuse/iomap_priv.h
@@ -177,6 +177,15 @@ fuse_iomap_cache_lookup(struct inode *inode, enum fuse_iomap_iodir iodir,
loff_t off, uint64_t len,
struct fuse_iomap_lookup *mval);
+int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset,
+ uint64_t length);
+static inline int fuse_iomap_cache_invalidate(struct inode *inode,
+ loff_t offset)
+{
+ return fuse_iomap_cache_invalidate_range(inode, offset,
+ FUSE_IOMAP_INVAL_TO_EOF);
+}
+
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _FS_FUSE_IOMAP_PRIV_H */
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 305b926b4a589a..05cb79beb8e426 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2187,6 +2187,12 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
goto error;
}
+ if (fuse_inode_has_iomap(inode) && is_truncate) {
+ err = fuse_iomap_setsize_finish(inode, outarg.attr.size);
+ if (err)
+ goto error;
+ }
+
spin_lock(&fi->lock);
/* the kernel maintains i_mtime locally */
if (trust_local_cmtime) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6575deae7e65f6..701042c04ab733 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -279,9 +279,11 @@ static int fuse_open(struct inode *inode, struct file *file)
if (is_wb_truncate || dax_truncate)
fuse_release_nowrite(inode);
if (!err) {
- if (is_truncate)
+ if (is_truncate) {
truncate_pagecache(inode, 0);
- else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
+ if (fuse_inode_has_iomap(inode))
+ fuse_iomap_open_truncate(inode);
+ } else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
invalidate_inode_pages2(inode->i_mapping);
}
if (dax_truncate)
@@ -3131,7 +3133,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
if (err)
goto out;
- if (!fuse_inode_has_iomap(inode_out))
+ if (fuse_inode_has_iomap(inode_out))
+ fuse_iomap_copied_file_range(inode_out, pos_out, outarg.size);
+ else
truncate_inode_pages_range(inode_out->i_mapping,
ALIGN_DOWN(pos_out, PAGE_SIZE),
ALIGN(pos_out + outarg.size, PAGE_SIZE) - 1);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index 706eff6863d0a7..b4a2c4ea00a6f8 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -896,6 +896,7 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
fuse_iomap_inline_free(iomap);
if (err)
return err;
+ fuse_iomap_cache_invalidate_range(inode, pos, written);
} else {
fuse_iomap_inline_free(iomap);
}
@@ -1036,9 +1037,11 @@ static int fuse_iomap_ioend(struct inode *inode, loff_t pos, size_t written,
/*
* If there weren't any ioend errors, update the incore isize, which
- * confusingly takes the new i_size as "pos".
+ * confusingly takes the new i_size as "pos". Invalidate cached
+ * mappings for the file range that we just completed.
*/
fuse_write_update_attr(inode, pos + written, written);
+ fuse_iomap_cache_invalidate_range(inode, pos, written);
return 0;
}
@@ -2201,6 +2204,19 @@ fuse_iomap_setsize_start(
return filemap_write_and_wait(inode->i_mapping);
}
+int
+fuse_iomap_setsize_finish(
+ struct inode *inode,
+ loff_t newsize)
+{
+ ASSERT(fuse_has_iomap(inode));
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_setsize(inode, newsize, 0);
+
+ return fuse_iomap_cache_invalidate(inode, newsize);
+}
+
/*
* Prepare for a file data block remapping operation by flushing and unmapping
* all pagecache for the entire range.
@@ -2309,6 +2325,14 @@ fuse_iomap_fallocate(
trace_fuse_iomap_fallocate(inode, mode, offset, length, new_size);
+ if (mode & (FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_INSERT_RANGE))
+ error = fuse_iomap_cache_invalidate(inode, offset);
+ else
+ error = fuse_iomap_cache_invalidate_range(inode, offset,
+ length);
+ if (error)
+ return error;
+
/*
* If we unmapped blocks from the file range, then we zero the
* pagecache for those regions and push them to disk rather than make
@@ -2326,6 +2350,8 @@ fuse_iomap_fallocate(
*/
if (new_size) {
error = fuse_iomap_setsize_start(inode, new_size);
+ if (!error)
+ error = fuse_iomap_setsize_finish(inode, new_size);
if (error)
return error;
@@ -2415,3 +2441,24 @@ int fuse_iomap_dev_inval(struct fuse_conn *fc,
up_read(&fc->killsb);
return ret;
}
+
+void fuse_iomap_open_truncate(struct inode *inode)
+{
+ ASSERT(fuse_has_iomap(inode));
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_open_truncate(inode);
+
+ fuse_iomap_cache_invalidate(inode, 0);
+}
+
+void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
+ size_t written)
+{
+ ASSERT(fuse_has_iomap(inode));
+ ASSERT(fuse_inode_has_iomap(inode));
+
+ trace_fuse_iomap_copied_file_range(inode, offset, written);
+
+ fuse_iomap_cache_invalidate_range(inode, offset, written);
+}
diff --git a/fs/fuse/iomap_cache.c b/fs/fuse/iomap_cache.c
index 572bccf99a97a8..a13eb5eec72415 100644
--- a/fs/fuse/iomap_cache.c
+++ b/fs/fuse/iomap_cache.c
@@ -1412,6 +1412,35 @@ fuse_iomap_cache_remove(
return ret;
}
+int fuse_iomap_cache_invalidate_range(struct inode *inode, loff_t offset,
+ uint64_t length)
+{
+ loff_t aligned_offset;
+ const unsigned int blocksize = i_blocksize(inode);
+ int ret, ret2;
+
+ if (!fuse_inode_caches_iomaps(inode))
+ return 0;
+
+ trace_fuse_iomap_cache_invalidate_range(inode, offset, length);
+
+ aligned_offset = round_down(offset, blocksize);
+ if (length != FUSE_IOMAP_INVAL_TO_EOF) {
+ length += offset - aligned_offset;
+ length = round_up(length, blocksize);
+ }
+
+ fuse_iomap_cache_lock(inode);
+ ret = fuse_iomap_cache_remove(inode, READ_MAPPING,
+ aligned_offset, length);
+ ret2 = fuse_iomap_cache_remove(inode, WRITE_MAPPING,
+ aligned_offset, length);
+ fuse_iomap_cache_unlock(inode);
+ if (ret)
+ return ret;
+ return ret2;
+}
+
static void
fuse_iext_add_mapping(
struct fuse_iomap_cache *ip,
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 4/4] fuse: enable iomap cache management
2025-08-21 0:47 ` [PATCHSET RFC v4 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 0:59 ` [PATCH 3/4] fuse: invalidate iomap cache after file updates Darrick J. Wong
@ 2025-08-21 0:59 ` Darrick J. Wong
3 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:59 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Provide a means for the fuse server to upload iomappings to the kernel
and invalidate them. This is how we enable iomap caching for better
performance. This is also required for correct synchronization between
pagecache writes and writeback.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 7 +
fs/fuse/fuse_trace.h | 68 +++++++++++++
include/uapi/linux/fuse.h | 28 +++++
fs/fuse/dev.c | 44 ++++++++
fs/fuse/file_iomap.c | 244 ++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 387 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0a7192b633dd3a..a710c56b205e30 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1795,6 +1795,11 @@ enum fuse_iomap_iodir {
READ_MAPPING,
WRITE_MAPPING,
};
+
+int fuse_iomap_upsert(struct fuse_conn *fc,
+ const struct fuse_iomap_upsert_out *outarg);
+int fuse_iomap_inval(struct fuse_conn *fc,
+ const struct fuse_iomap_inval_out *outarg);
#else
# define fuse_iomap_enabled(...) (false)
# define fuse_has_iomap(...) (false)
@@ -1827,6 +1832,8 @@ enum fuse_iomap_iodir {
# define fuse_dev_ioctl_iomap_support(...) (-EOPNOTSUPP)
# define fuse_iomap_fadvise NULL
# define fuse_inode_caches_iomaps(...) (false)
+# define fuse_iomap_upsert(...) (-ENOSYS)
+# define fuse_iomap_inval(...) (-ENOSYS)
#endif
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index cd8aa9e0633eee..80af541a54c5bd 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -320,6 +320,7 @@ struct fuse_iomap_lookup;
#define FUSE_IOMAP_TYPE_STRINGS \
{ FUSE_IOMAP_TYPE_PURE_OVERWRITE, "overwrite" }, \
{ FUSE_IOMAP_TYPE_RETRY_CACHE, "retry" }, \
+ { FUSE_IOMAP_TYPE_NOCACHE, "nocache" }, \
{ FUSE_IOMAP_TYPE_HOLE, "hole" }, \
{ FUSE_IOMAP_TYPE_DELALLOC, "delalloc" }, \
{ FUSE_IOMAP_TYPE_MAPPED, "mapped" }, \
@@ -784,6 +785,7 @@ DEFINE_EVENT(fuse_inode_state_class, name, \
TP_ARGS(inode))
DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_init_inode);
DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_evict_inode);
+DEFINE_FUSE_INODE_STATE_EVENT(fuse_iomap_cache_enable);
TRACE_EVENT(fuse_iomap_end_ioend,
TP_PROTO(const struct iomap_ioend *ioend),
@@ -1482,6 +1484,72 @@ TRACE_EVENT(fuse_iomap_invalid,
__entry->old_validity_cookie,
__entry->validity_cookie)
);
+
+TRACE_EVENT(fuse_iomap_upsert,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_upsert_out *outarg),
+ TP_ARGS(inode, outarg),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(uint64_t, attr_ino)
+
+ FUSE_IOMAP_MAP_FIELDS(read)
+ FUSE_IOMAP_MAP_FIELDS(write)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->attr_ino = outarg->attr_ino;
+ __entry->readoffset = outarg->read.offset;
+ __entry->readlength = outarg->read.length;
+ __entry->readaddr = outarg->read.addr;
+ __entry->readtype = outarg->read.type;
+ __entry->readflags = outarg->read.flags;
+ __entry->readdev = outarg->read.dev;
+ __entry->writeoffset = outarg->write.offset;
+ __entry->writelength = outarg->write.length;
+ __entry->writeaddr = outarg->write.addr;
+ __entry->writetype = outarg->write.type;
+ __entry->writeflags = outarg->write.flags;
+ __entry->writedev = outarg->write.dev;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " attr_ino 0x%llx" FUSE_IOMAP_MAP_FMT("read") FUSE_IOMAP_MAP_FMT("write"),
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->attr_ino,
+ FUSE_IOMAP_MAP_PRINTK_ARGS(read),
+ FUSE_IOMAP_MAP_PRINTK_ARGS(write))
+);
+
+TRACE_EVENT(fuse_iomap_inval,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_iomap_inval_out *outarg),
+ TP_ARGS(inode, outarg),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(uint64_t, attr_ino)
+
+ FUSE_FILE_RANGE_FIELDS(read)
+ FUSE_FILE_RANGE_FIELDS(write)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->attr_ino = outarg->attr_ino;
+ __entry->readoffset = outarg->read_offset;
+ __entry->readlength = outarg->read_length;
+ __entry->writeoffset = outarg->write_offset;
+ __entry->writelength = outarg->write_length;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " attr_ino 0x%llx" FUSE_FILE_RANGE_FMT("read") FUSE_FILE_RANGE_FMT("write"),
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->attr_ino,
+ FUSE_FILE_RANGE_PRINTK_ARGS(read),
+ FUSE_FILE_RANGE_PRINTK_ARGS(write))
+);
#endif /* CONFIG_FUSE_IOMAP */
#endif /* _TRACE_FUSE_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index df23eb65f0b497..41f07513bbdd9d 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -243,6 +243,8 @@
* - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
* - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges
* - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support
+ * - add FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL so fuse servers
+ * can cache iomappings in the kernel
*/
#ifndef _LINUX_FUSE_H
@@ -709,6 +711,8 @@ enum fuse_notify_code {
FUSE_NOTIFY_RESEND = 7,
FUSE_NOTIFY_INC_EPOCH = 8,
FUSE_NOTIFY_IOMAP_DEV_INVAL = 9,
+ FUSE_NOTIFY_IOMAP_UPSERT = 10,
+ FUSE_NOTIFY_IOMAP_INVAL = 11,
FUSE_NOTIFY_CODE_MAX,
};
@@ -1346,6 +1350,8 @@ struct fuse_uring_cmd_req {
#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255)
/* fuse-specific mapping type saying the server has populated the cache */
#define FUSE_IOMAP_TYPE_RETRY_CACHE (254)
+/* do not upsert this mapping */
+#define FUSE_IOMAP_TYPE_NOCACHE (253)
#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */
@@ -1486,4 +1492,26 @@ struct fuse_iomap_dev_inval_out {
/* invalidate all cached iomap mappings up to EOF */
#define FUSE_IOMAP_INVAL_TO_EOF (~0ULL)
+struct fuse_iomap_inval_out {
+ uint64_t nodeid; /* Inode ID */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+
+ uint64_t read_offset; /* range to invalidate read iomaps, bytes */
+ uint64_t read_length; /* can be FUSE_IOMAP_INVAL_TO_EOF */
+
+ uint64_t write_offset; /* range to invalidate write iomaps, bytes */
+ uint64_t write_length; /* can be FUSE_IOMAP_INVAL_TO_EOF */
+};
+
+struct fuse_iomap_upsert_out {
+ uint64_t nodeid; /* Inode ID */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+
+ /* read file data from here */
+ struct fuse_iomap_io read;
+
+ /* write file data to here, if applicable */
+ struct fuse_iomap_io write;
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 575cb6e15d84d5..dc730bbb91d31d 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1857,6 +1857,46 @@ static int fuse_notify_iomap_dev_inval(struct fuse_conn *fc, unsigned int size,
return err;
}
+static int fuse_notify_iomap_upsert(struct fuse_conn *fc, unsigned int size,
+ struct fuse_copy_state *cs)
+{
+ struct fuse_iomap_upsert_out outarg;
+ int err = -EINVAL;
+
+ if (size != sizeof(outarg))
+ goto err;
+
+ err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+ if (err)
+ goto err;
+ fuse_copy_finish(cs);
+
+ return fuse_iomap_upsert(fc, &outarg);
+err:
+ fuse_copy_finish(cs);
+ return err;
+}
+
+static int fuse_notify_iomap_inval(struct fuse_conn *fc, unsigned int size,
+ struct fuse_copy_state *cs)
+{
+ struct fuse_iomap_inval_out outarg;
+ int err = -EINVAL;
+
+ if (size != sizeof(outarg))
+ goto err;
+
+ err = fuse_copy_one(cs, &outarg, sizeof(outarg));
+ if (err)
+ goto err;
+ fuse_copy_finish(cs);
+
+ return fuse_iomap_inval(fc, &outarg);
+err:
+ fuse_copy_finish(cs);
+ return err;
+}
+
struct fuse_retrieve_args {
struct fuse_args_pages ap;
struct fuse_notify_retrieve_in inarg;
@@ -2105,6 +2145,10 @@ static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
case FUSE_NOTIFY_IOMAP_DEV_INVAL:
return fuse_notify_iomap_dev_inval(fc, size, cs);
+ case FUSE_NOTIFY_IOMAP_UPSERT:
+ return fuse_notify_iomap_upsert(fc, size, cs);
+ case FUSE_NOTIFY_IOMAP_INVAL:
+ return fuse_notify_iomap_inval(fc, size, cs);
default:
fuse_copy_finish(cs);
diff --git a/fs/fuse/file_iomap.c b/fs/fuse/file_iomap.c
index b4a2c4ea00a6f8..f37f2890a343b5 100644
--- a/fs/fuse/file_iomap.c
+++ b/fs/fuse/file_iomap.c
@@ -163,6 +163,7 @@ static inline bool fuse_iomap_check_type(uint16_t fuse_type)
case FUSE_IOMAP_TYPE_INLINE:
case FUSE_IOMAP_TYPE_PURE_OVERWRITE:
case FUSE_IOMAP_TYPE_RETRY_CACHE:
+ case FUSE_IOMAP_TYPE_NOCACHE:
return true;
}
@@ -269,12 +270,13 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
uint64_t end;
/*
- * Type and flags must be known. Mapping type "retry cache" doesn't
- * use any of the other fields.
+ * Type and flags must be known. Mapping types "retry cache" and "do
+ * not insert in cache" don't use any of the other fields.
*/
if (BAD_DATA(!fuse_iomap_check_type(map->type)))
return false;
- if (map->type == FUSE_IOMAP_TYPE_RETRY_CACHE)
+ if (map->type == FUSE_IOMAP_TYPE_RETRY_CACHE ||
+ map->type == FUSE_IOMAP_TYPE_NOCACHE)
return true;
if (BAD_DATA(!fuse_iomap_check_flags(map->flags)))
return false;
@@ -328,6 +330,9 @@ static inline bool fuse_iomap_check_mapping(const struct inode *inode,
if (BAD_DATA(iodir != WRITE_MAPPING))
return false;
break;
+ case FUSE_IOMAP_TYPE_NOCACHE:
+ /* We're ignoring this mapping */
+ break;
default:
/* should have been caught already */
ASSERT(0);
@@ -383,6 +388,15 @@ fuse_iomap_begin_validate(const struct inode *inode,
if (!fuse_iomap_check_mapping(inode, &outarg->write, WRITE_MAPPING))
return -EFSCORRUPTED;
+ /*
+ * ->iomap_begin requires real mappings or "retry from cache"; "do not
+ * add to cache" does not apply here.
+ */
+ if (BAD_DATA(outarg->read.type == FUSE_IOMAP_TYPE_NOCACHE))
+ return -EFSCORRUPTED;
+ if (BAD_DATA(outarg->write.type == FUSE_IOMAP_TYPE_NOCACHE))
+ return -EFSCORRUPTED;
+
/*
* Must have returned a mapping for at least the first byte in the
* range. The main mapping check already validated that the length
@@ -614,9 +628,11 @@ fuse_iomap_cached_validate(const struct inode *inode,
if (!fuse_iomap_check_mapping(inode, &lmap->map, dir))
return -EFSCORRUPTED;
- /* The cache should not be storing "retry cache" mappings */
+ /* The cache should not be storing cache management mappings */
if (BAD_DATA(lmap->map.type == FUSE_IOMAP_TYPE_RETRY_CACHE))
return -EFSCORRUPTED;
+ if (BAD_DATA(lmap->map.type == FUSE_IOMAP_TYPE_NOCACHE))
+ return -EFSCORRUPTED;
return 0;
}
@@ -2462,3 +2478,223 @@ void fuse_iomap_copied_file_range(struct inode *inode, loff_t offset,
fuse_iomap_cache_invalidate_range(inode, offset, written);
}
+
+static inline bool
+fuse_iomap_upsert_validate_dev(
+ const struct fuse_backing *fb,
+ const struct fuse_iomap_io *map)
+{
+ uint64_t map_end;
+ sector_t device_bytes;
+
+ if (!fb) {
+ if (BAD_DATA(map->addr != FUSE_IOMAP_NULL_ADDR))
+ return false;
+
+ return true;
+ }
+
+ if (BAD_DATA(map->addr == FUSE_IOMAP_NULL_ADDR))
+ return false;
+
+ if (BAD_DATA(check_add_overflow(map->addr, map->length, &map_end)))
+ return false;
+
+ device_bytes = bdev_nr_sectors(fb->bdev) << SECTOR_SHIFT;
+ if (BAD_DATA(map_end > device_bytes))
+ return false;
+
+ return true;
+}
+
+/* Validate one of the incoming upsert mappings */
+static inline bool
+fuse_iomap_upsert_validate_mapping(struct inode *inode,
+ enum fuse_iomap_iodir iodir,
+ const struct fuse_iomap_io *map)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_backing *fb;
+ bool ret;
+
+ if (!fuse_iomap_check_mapping(inode, map, iodir))
+ return false;
+
+ /*
+ * A "retry cache" instruction makes no sense when we're adding to
+ * the mapping cache.
+ */
+ if (BAD_DATA(map->type == FUSE_IOMAP_TYPE_RETRY_CACHE))
+ return false;
+
+ if (map->type == FUSE_IOMAP_TYPE_NOCACHE)
+ return true;
+
+ /* Make sure we can find the device */
+ fb = fuse_iomap_find_dev(fc, map);
+ if (IS_ERR(fb))
+ return false;
+
+ ret = fuse_iomap_upsert_validate_dev(fb, map);
+ fuse_backing_put(fb);
+ return ret;
+}
+
+/* Check the incoming upsert mappings to make sure they're not nonsense */
+static inline int
+fuse_iomap_upsert_validate(struct inode *inode,
+ const struct fuse_iomap_upsert_out *outarg)
+{
+ if (!fuse_iomap_upsert_validate_mapping(inode, READ_MAPPING,
+ &outarg->read))
+ return -EFSCORRUPTED;
+ if (!fuse_iomap_upsert_validate_mapping(inode, WRITE_MAPPING,
+ &outarg->write))
+ return -EFSCORRUPTED;
+
+ return 0;
+}
+
+int fuse_iomap_upsert(struct fuse_conn *fc,
+ const struct fuse_iomap_upsert_out *outarg)
+{
+ struct inode *inode;
+ struct fuse_inode *fi;
+ int ret;
+
+ if (!fc->iomap)
+ return -EINVAL;
+
+ down_read(&fc->killsb);
+ inode = fuse_ilookup(fc, outarg->nodeid, NULL);
+ if (!inode) {
+ ret = -ESTALE;
+ goto out_sb;
+ }
+
+ trace_fuse_iomap_upsert(inode, outarg);
+
+ fi = get_fuse_inode(inode);
+ if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) {
+ ret = -EINVAL;
+ goto out_inode;
+ }
+
+ if (fuse_is_bad(inode)) {
+ ret = -EIO;
+ goto out_inode;
+ }
+
+ ret = fuse_iomap_upsert_validate(inode, outarg);
+ if (ret)
+ goto out_inode;
+
+ fuse_iomap_cache_lock(inode);
+
+ if (!test_and_set_bit(FUSE_I_IOMAP_CACHE, &fi->state))
+ trace_fuse_iomap_cache_enable(inode);
+
+ if (outarg->read.type != FUSE_IOMAP_TYPE_NOCACHE) {
+ ret = fuse_iomap_cache_upsert(inode, READ_MAPPING,
+ &outarg->read);
+ if (ret)
+ goto out_unlock;
+ }
+
+ if (outarg->write.type != FUSE_IOMAP_TYPE_NOCACHE) {
+ ret = fuse_iomap_cache_upsert(inode, WRITE_MAPPING,
+ &outarg->write);
+ if (ret)
+ goto out_unlock;
+ }
+
+out_unlock:
+ fuse_iomap_cache_unlock(inode);
+out_inode:
+ iput(inode);
+out_sb:
+ up_read(&fc->killsb);
+ return ret;
+}
+
+static inline bool fuse_iomap_inval_validate(const struct inode *inode,
+ uint64_t offset, uint64_t length)
+{
+ const unsigned int blocksize = i_blocksize(inode);
+
+ if (length == 0)
+ return true;
+
+ /* Range can't start beyond maxbytes */
+ if (BAD_DATA(offset >= inode->i_sb->s_maxbytes))
+ return false;
+
+ /* File range must be aligned to blocksize */
+ if (BAD_DATA(!IS_ALIGNED(offset, blocksize)))
+ return false;
+ if (length != FUSE_IOMAP_INVAL_TO_EOF &&
+ BAD_DATA(!IS_ALIGNED(length, blocksize)))
+ return false;
+
+ return true;
+}
+
+int fuse_iomap_inval(struct fuse_conn *fc,
+ const struct fuse_iomap_inval_out *outarg)
+{
+ struct inode *inode;
+ struct fuse_inode *fi;
+ int ret = 0, ret2 = 0;
+
+ if (!fc->iomap)
+ return -EINVAL;
+
+ down_read(&fc->killsb);
+ inode = fuse_ilookup(fc, outarg->nodeid, NULL);
+ if (!inode) {
+ ret = -ESTALE;
+ goto out_sb;
+ }
+
+ trace_fuse_iomap_inval(inode, outarg);
+
+ fi = get_fuse_inode(inode);
+ if (BAD_DATA(fi->orig_ino != outarg->attr_ino)) {
+ ret = -EINVAL;
+ goto out_inode;
+ }
+
+ if (fuse_is_bad(inode)) {
+ ret = -EIO;
+ goto out_inode;
+ }
+
+ if (!fuse_iomap_inval_validate(inode, outarg->write_offset,
+ outarg->write_length)) {
+ ret = -EFSCORRUPTED;
+ goto out_inode;
+ }
+
+ if (!fuse_iomap_inval_validate(inode, outarg->read_offset,
+ outarg->read_length)) {
+ ret = -EFSCORRUPTED;
+ goto out_inode;
+ }
+
+ fuse_iomap_cache_lock(inode);
+ if (outarg->read_length)
+ ret2 = fuse_iomap_cache_remove(inode, READ_MAPPING,
+ outarg->read_offset,
+ outarg->read_length);
+ if (outarg->write_length)
+ ret = fuse_iomap_cache_remove(inode, WRITE_MAPPING,
+ outarg->write_offset,
+ outarg->write_length);
+ fuse_iomap_cache_unlock(inode);
+
+out_inode:
+ iput(inode);
+out_sb:
+ up_read(&fc->killsb);
+ return ret ? ret : ret2;
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 1/6] fuse: force a ctime update after a fileattr_set call when in iomap mode
2025-08-21 0:48 ` [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-08-21 0:59 ` Darrick J. Wong
2025-08-21 1:00 ` [PATCH 2/6] fuse: synchronize inode->i_flags after fileattr_[gs]et Darrick J. Wong
` (4 subsequent siblings)
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 0:59 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
In iomap mode, the kernel is in charge of driving ctime updates to
the fuse server and ignores updates coming from the fuse server.
Therefore, when someone calls fileattr_set to change file attributes, we
must force a ctime update.
Found by generic/277.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/ioctl.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c
index 57032eadca6c27..f5f7d806262cdf 100644
--- a/fs/fuse/ioctl.c
+++ b/fs/fuse/ioctl.c
@@ -548,8 +548,13 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
struct fuse_file *ff;
unsigned int flags = fa->flags;
struct fsxattr xfa;
+ struct file_kattr old_ma = { };
+ bool is_wb = (fuse_get_cache_mask(inode) & STATX_CTIME);
int err;
+ if (is_wb)
+ vfs_fileattr_get(dentry, &old_ma);
+
ff = fuse_priv_ioctl_prepare(inode);
if (IS_ERR(ff))
return PTR_ERR(ff);
@@ -573,6 +578,12 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
cleanup:
fuse_priv_ioctl_cleanup(inode, ff);
+ /*
+ * If we cache ctime updates and the fileattr changed, then force a
+ * ctime update.
+ */
+ if (is_wb && memcmp(&old_ma, fa, sizeof(old_ma)))
+ fuse_update_ctime(inode);
if (err == -ENOTTY)
err = -EOPNOTSUPP;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 2/6] fuse: synchronize inode->i_flags after fileattr_[gs]et
2025-08-21 0:48 ` [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-08-21 0:59 ` [PATCH 1/6] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
@ 2025-08-21 1:00 ` Darrick J. Wong
2025-08-21 1:00 ` [PATCH 3/6] fuse: cache atime when in iomap mode Darrick J. Wong
` (3 subsequent siblings)
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:00 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
There are three inode flags (immutable, append, sync) that are enforced
by the VFS. Whenever we go around setting iflags, let's update the VFS
state so that they actually work.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_i.h | 1 +
fs/fuse/fuse_trace.h | 23 +++++++++++++
fs/fuse/dir.c | 1 +
fs/fuse/inode.c | 1 +
fs/fuse/ioctl.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 116 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a710c56b205e30..f7a7d8ad641d5b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1588,6 +1588,7 @@ long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
int fuse_fileattr_get(struct dentry *dentry, struct file_kattr *fa);
int fuse_fileattr_set(struct mnt_idmap *idmap,
struct dentry *dentry, struct file_kattr *fa);
+void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr);
/* iomode.c */
int fuse_file_cached_io_open(struct inode *inode, struct fuse_file *ff);
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index 80af541a54c5bd..aea9ea0835d497 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -176,6 +176,29 @@ TRACE_EVENT(fuse_request_end,
__entry->unique, __entry->len, __entry->error)
);
+TRACE_EVENT(fuse_fileattr_update_inode,
+ TP_PROTO(const struct inode *inode, unsigned int old_iflags),
+
+ TP_ARGS(inode, old_iflags),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(unsigned int, old_iflags)
+ __field(unsigned int, new_iflags)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->old_iflags = old_iflags;
+ __entry->new_iflags = inode->i_flags;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " old_iflags 0x%x iflags 0x%x",
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->old_iflags,
+ __entry->new_iflags)
+);
+
#ifdef CONFIG_FUSE_BACKING
#define FUSE_BACKING_PASSTHROUGH (1U << 0)
#define FUSE_BACKING_IOMAP (1U << 1)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 05cb79beb8e426..d2f9bcccd776f0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1254,6 +1254,7 @@ static void fuse_fillattr(struct mnt_idmap *idmap, struct inode *inode,
blkbits = inode->i_sb->s_blocksize_bits;
stat->blksize = 1 << blkbits;
+ generic_fill_statx_attr(inode, stat);
}
static void fuse_statx_to_attr(struct fuse_statx *sx, struct fuse_attr *attr)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 18dc9492d19174..b1793df3cbbd1a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -524,6 +524,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
fuse_init_inode(inode, attr, fc);
+ fuse_fileattr_init(inode, attr);
unlock_new_inode(inode);
} else if (fuse_stale_inode(inode, generation, attr)) {
/* nodeid was reused, any I/O on the old inode should fail */
diff --git a/fs/fuse/ioctl.c b/fs/fuse/ioctl.c
index f5f7d806262cdf..c320ea80cb3db8 100644
--- a/fs/fuse/ioctl.c
+++ b/fs/fuse/ioctl.c
@@ -4,6 +4,7 @@
*/
#include "fuse_i.h"
+#include "fuse_trace.h"
#include <linux/uio.h>
#include <linux/compat.h>
@@ -502,6 +503,92 @@ static void fuse_priv_ioctl_cleanup(struct inode *inode, struct fuse_file *ff)
fuse_file_release(inode, ff, O_RDONLY, NULL, S_ISDIR(inode->i_mode));
}
+static inline void update_iflag(struct inode *inode, unsigned int iflag,
+ bool set)
+{
+ if (set)
+ inode->i_flags |= iflag;
+ else
+ inode->i_flags &= ~iflag;
+}
+
+static void fuse_fileattr_update_inode(struct inode *inode,
+ const struct file_kattr *fa)
+{
+ unsigned int old_iflags = inode->i_flags;
+
+ /*
+ * Prior to iomap, the fuse driver sent all file IO operations to the
+ * fuse server, which was wholly responsible for enforcing the
+ * immutable and append bits. With iomap, we let more of the kernel IO
+ * path stay within the kernel, so we actually have to set the VFS
+ * flags now so that the enforcement can take place inside the kernel.
+ */
+ if (!fuse_has_iomap(inode))
+ return;
+
+ /*
+ * Configure VFS enforcement of the three inode flags that we support.
+ * XXX: still need to figure out what's going on wrt NOATIME in fuse.
+ */
+ if (fa->flags_valid) {
+ update_iflag(inode, S_SYNC, fa->flags & FS_SYNC_FL);
+ update_iflag(inode, S_IMMUTABLE, fa->flags & FS_IMMUTABLE_FL);
+ update_iflag(inode, S_APPEND, fa->flags & FS_APPEND_FL);
+ } else if (fa->fsx_xflags) {
+ update_iflag(inode, S_SYNC, fa->fsx_xflags & FS_XFLAG_SYNC);
+ update_iflag(inode, S_IMMUTABLE,
+ fa->fsx_xflags & FS_XFLAG_IMMUTABLE);
+ update_iflag(inode, S_APPEND, fa->fsx_xflags & FS_XFLAG_APPEND);
+ }
+
+ trace_fuse_fileattr_update_inode(inode, old_iflags);
+
+ if (old_iflags != inode->i_flags)
+ fuse_invalidate_attr(inode);
+}
+
+void fuse_fileattr_init(struct inode *inode, const struct fuse_attr *attr)
+{
+ struct file_kattr fa;
+ struct fsxattr xfa = { };
+ struct fuse_file *ff;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ unsigned int flags = 0;
+ int err;
+
+ if (!fuse_has_iomap(inode))
+ return;
+
+ /*
+ * Don't do this when we're setting up the root inode because the
+ * connection workers haven't been set up yet.
+ */
+ if (attr->ino == fc->root_nodeid && attr->blksize == 0)
+ return;
+
+ ff = fuse_priv_ioctl_prepare(inode);
+ if (IS_ERR(ff))
+ return;
+
+ err = fuse_priv_ioctl(inode, ff, FS_IOC_FSGETXATTR, &xfa, sizeof(xfa));
+ if (!err) {
+ fileattr_fill_xflags(&fa, xfa.fsx_xflags);
+ fuse_fileattr_update_inode(inode, &fa);
+ goto cleanup;
+ }
+
+ err = fuse_priv_ioctl(inode, ff, FS_IOC_GETFLAGS, &flags, sizeof(flags));
+ if (!err) {
+ fileattr_fill_flags(&fa, flags);
+ fuse_fileattr_update_inode(inode, &fa);
+ goto cleanup;
+ }
+
+cleanup:
+ fuse_priv_ioctl_cleanup(inode, ff);
+}
+
int fuse_fileattr_get(struct dentry *dentry, struct file_kattr *fa)
{
struct inode *inode = d_inode(dentry);
@@ -574,7 +661,10 @@ int fuse_fileattr_set(struct mnt_idmap *idmap,
err = fuse_priv_ioctl(inode, ff, FS_IOC_FSSETXATTR,
&xfa, sizeof(xfa));
+ if (err)
+ goto cleanup;
}
+ fuse_fileattr_update_inode(inode, fa);
cleanup:
fuse_priv_ioctl_cleanup(inode, ff);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 3/6] fuse: cache atime when in iomap mode
2025-08-21 0:48 ` [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-08-21 0:59 ` [PATCH 1/6] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
2025-08-21 1:00 ` [PATCH 2/6] fuse: synchronize inode->i_flags after fileattr_[gs]et Darrick J. Wong
@ 2025-08-21 1:00 ` Darrick J. Wong
2025-08-21 1:00 ` [PATCH 4/6] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
` (2 subsequent siblings)
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:00 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
When we're running in iomap mode, allow the kernel to cache the access
timestamp to further reduce the number of roundtrips to the fuse server.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/dir.c | 5 +++++
fs/fuse/inode.c | 19 ++++++++++++++++---
2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index d2f9bcccd776f0..a3ea50b99054ff 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2030,6 +2030,11 @@ int fuse_flush_times(struct inode *inode, struct fuse_file *ff)
inarg.ctime = inode_get_ctime_sec(inode);
inarg.ctimensec = inode_get_ctime_nsec(inode);
}
+ if (fuse_has_iomap(inode)) {
+ inarg.valid |= FATTR_ATIME;
+ inarg.atime = inode_get_atime_sec(inode);
+ inarg.atimensec = inode_get_atime_nsec(inode);
+ }
if (ff) {
inarg.valid |= FATTR_FH;
inarg.fh = ff->fh;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b1793df3cbbd1a..91143845c615c4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -264,7 +264,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
attr->mtimensec = min_t(u32, attr->mtimensec, NSEC_PER_SEC - 1);
attr->ctimensec = min_t(u32, attr->ctimensec, NSEC_PER_SEC - 1);
- inode_set_atime(inode, attr->atime, attr->atimensec);
+ if (!(cache_mask & STATX_ATIME))
+ inode_set_atime(inode, attr->atime, attr->atimensec);
/* mtime from server may be stale due to local buffered write */
if (!(cache_mask & STATX_MTIME)) {
inode_set_mtime(inode, attr->mtime, attr->mtimensec);
@@ -331,8 +332,12 @@ u32 fuse_get_cache_mask(struct inode *inode)
{
struct fuse_conn *fc = get_fuse_conn(inode);
- if (S_ISREG(inode->i_mode) &&
- (fuse_inode_has_iomap(inode) || fc->writeback_cache))
+ if (!S_ISREG(inode->i_mode))
+ return 0;
+
+ if (fuse_inode_has_iomap(inode))
+ return STATX_MTIME | STATX_CTIME | STATX_ATIME | STATX_SIZE;
+ if (fc->writeback_cache)
return STATX_MTIME | STATX_CTIME | STATX_SIZE;
return 0;
@@ -451,6 +456,14 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr,
new_decode_dev(attr->rdev));
} else
BUG();
+
+ /*
+ * iomap caches atime too, so we must load it from the fuse server
+ * at instantiation time.
+ */
+ if (fuse_has_iomap(inode))
+ inode_set_atime(inode, attr->atime, attr->atimensec);
+
/*
* Ensure that we don't cache acls for daemons without FUSE_POSIX_ACL
* so they see the exact same behavior as before.
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 4/6] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems
2025-08-21 0:48 ` [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 1:00 ` [PATCH 3/6] fuse: cache atime when in iomap mode Darrick J. Wong
@ 2025-08-21 1:00 ` Darrick J. Wong
2025-08-21 1:00 ` [PATCH 5/6] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
2025-08-21 1:01 ` [PATCH 6/6] fuse: always cache ACLs when using iomap Darrick J. Wong
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:00 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Let the kernel handle killing the suid/sgid bits because the
write/falloc/truncate/chown code already does this, and we don't have to
worry about external modifications that are only visible to the fuse
server (i.e. we're not a cluster fs).
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/fuse_trace.h | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/dir.c | 15 ++++++++++---
2 files changed, 70 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
index aea9ea0835d497..18606eb0bf8dd7 100644
--- a/fs/fuse/fuse_trace.h
+++ b/fs/fuse/fuse_trace.h
@@ -199,6 +199,64 @@ TRACE_EVENT(fuse_fileattr_update_inode,
__entry->new_iflags)
);
+TRACE_EVENT(fuse_setattr_fill,
+ TP_PROTO(const struct inode *inode,
+ const struct fuse_setattr_in *inarg),
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(umode_t, mode)
+ __field(uint32_t, valid)
+ __field(umode_t, new_mode)
+ __field(uint64_t, new_size)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->mode = inode->i_mode;
+ __entry->valid = inarg->valid;
+ __entry->new_mode = inarg->mode;
+ __entry->new_size = inarg->size;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " mode 0%o valid 0x%x new_mode 0%o new_size 0x%llx",
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->mode,
+ __entry->valid,
+ __entry->new_mode,
+ __entry->new_size)
+);
+
+TRACE_EVENT(fuse_setattr,
+ TP_PROTO(const struct inode *inode,
+ const struct iattr *inarg),
+ TP_ARGS(inode, inarg),
+
+ TP_STRUCT__entry(
+ FUSE_INODE_FIELDS
+ __field(umode_t, mode)
+ __field(uint32_t, valid)
+ __field(umode_t, new_mode)
+ __field(uint64_t, new_size)
+ ),
+
+ TP_fast_assign(
+ FUSE_INODE_ASSIGN(inode, fi, fm);
+ __entry->mode = inode->i_mode;
+ __entry->valid = inarg->ia_valid;
+ __entry->new_mode = inarg->ia_mode;
+ __entry->new_size = inarg->ia_size;
+ ),
+
+ TP_printk(FUSE_INODE_FMT " mode 0%o valid 0x%x new_mode 0%o new_size 0x%llx",
+ FUSE_INODE_PRINTK_ARGS,
+ __entry->mode,
+ __entry->valid,
+ __entry->new_mode,
+ __entry->new_size)
+);
+
#ifdef CONFIG_FUSE_BACKING
#define FUSE_BACKING_PASSTHROUGH (1U << 0)
#define FUSE_BACKING_IOMAP (1U << 1)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index a3ea50b99054ff..e8eef46d8e1b52 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -7,6 +7,7 @@
*/
#include "fuse_i.h"
+#include "fuse_trace.h"
#include <linux/pagemap.h>
#include <linux/file.h>
@@ -1999,6 +2000,8 @@ static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_args *args,
struct fuse_setattr_in *inarg_p,
struct fuse_attr_out *outarg_p)
{
+ trace_fuse_setattr_fill(inode, inarg_p);
+
args->opcode = FUSE_SETATTR;
args->nodeid = get_node_id(inode);
args->in_numargs = 1;
@@ -2273,15 +2276,21 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry,
if (!fuse_allow_current_process(get_fuse_conn(inode)))
return -EACCES;
- if (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID)) {
+ trace_fuse_setattr(inode, attr);
+
+ if (!fuse_has_iomap(inode) &&
+ (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))) {
attr->ia_valid &= ~(ATTR_KILL_SUID | ATTR_KILL_SGID |
ATTR_MODE);
/*
* The only sane way to reliably kill suid/sgid is to do it in
- * the userspace filesystem
+ * the userspace filesystem if this isn't an iomap file. For
+ * iomap filesystems we let the kernel kill the setuid/setgid
+ * bits.
*
- * This should be done on write(), truncate() and chown().
+ * This should be done on write(), truncate(), chown(), and
+ * fallocate().
*/
if (!fc->handle_killpriv && !fc->handle_killpriv_v2) {
/*
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 5/6] fuse: update ctime when updating acls on an iomap inode
2025-08-21 0:48 ` [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 1:00 ` [PATCH 4/6] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
@ 2025-08-21 1:00 ` Darrick J. Wong
2025-08-21 1:01 ` [PATCH 6/6] fuse: always cache ACLs when using iomap Darrick J. Wong
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:00 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
In iomap mode, the fuse kernel driver is in charge of updating file
attributes, so we need to update ctime after an ACL change.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/acl.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 4f37390e3f3ce7..efab333415131c 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -169,10 +169,24 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
ret = 0;
}
- /* If we scheduled a mode update above, push that to userspace now. */
if (!ret) {
struct iattr attr = { };
+ /*
+ * When we're running in iomap mode, we need to update mode and
+ * ctime ourselves instead of letting the fuse server figure
+ * that out.
+ */
+ if (fuse_has_iomap(inode)) {
+ attr.ia_valid |= ATTR_CTIME;
+ inode_set_ctime_current(inode);
+ attr.ia_ctime = inode_get_ctime(inode);
+ }
+
+ /*
+ * If we scheduled a mode update above, push that to userspace
+ * now.
+ */
if (mode != inode->i_mode) {
attr.ia_valid |= ATTR_MODE;
attr.ia_mode = mode;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 6/6] fuse: always cache ACLs when using iomap
2025-08-21 0:48 ` [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 1:00 ` [PATCH 5/6] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
@ 2025-08-21 1:01 ` Darrick J. Wong
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:01 UTC (permalink / raw)
To: djwong, miklos; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
From: Darrick J. Wong <djwong@kernel.org>
Keep ACLs cached in memory when we're using iomap, so that we don't have
to make a round trip to the fuse server. This might want to become a
FUSE_ATTR_ flag.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/fuse/acl.c | 8 ++++++--
fs/fuse/dir.c | 11 ++++++++---
fs/fuse/readdir.c | 3 ++-
3 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index efab333415131c..404c96ad68ea66 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -201,8 +201,12 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
* Fuse daemons without FUSE_POSIX_ACL never cached POSIX ACLs
* and didn't invalidate attributes. Retain that behavior.
*/
- forget_all_cached_acls(inode);
- fuse_invalidate_attr(inode);
+ if (!ret && fuse_has_iomap(inode)) {
+ set_cached_acl(inode, type, acl);
+ } else {
+ forget_all_cached_acls(inode);
+ fuse_invalidate_attr(inode);
+ }
}
return ret;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index e8eef46d8e1b52..d317b7965a8259 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -261,7 +261,8 @@ static int fuse_dentry_revalidate(struct inode *dir, const struct qstr *name,
fuse_stale_inode(inode, outarg.generation, &outarg.attr))
goto invalid;
- forget_all_cached_acls(inode);
+ if (!fuse_inode_has_iomap(inode))
+ forget_all_cached_acls(inode);
fuse_change_attributes(inode, &outarg.attr, NULL,
ATTR_TIMEOUT(&outarg),
attr_version);
@@ -1468,7 +1469,8 @@ static int fuse_update_get_attr(struct mnt_idmap *idmap, struct inode *inode,
sync = time_before64(fi->i_time, get_jiffies_64());
if (sync) {
- forget_all_cached_acls(inode);
+ if (!fuse_inode_has_iomap(inode))
+ forget_all_cached_acls(inode);
/* Try statx if a field not covered by regular stat is wanted */
if (!fc->no_statx && (request_mask & ~STATX_BASIC_STATS)) {
err = fuse_do_statx(idmap, inode, file, stat);
@@ -1645,6 +1647,9 @@ static int fuse_access(struct inode *inode, int mask)
static int fuse_perm_getattr(struct inode *inode, int mask)
{
+ if (fuse_inode_has_iomap(inode))
+ return 0;
+
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
@@ -2321,7 +2326,7 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry,
* If filesystem supports acls it may have updated acl xattrs in
* the filesystem, so forget cached acls for the inode.
*/
- if (fc->posix_acl)
+ if (fc->posix_acl && !fuse_inode_has_iomap(inode))
forget_all_cached_acls(inode);
/* Directory mode changed, may need to revalidate access */
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index 45dd932eb03a5e..f7c2a45f23678e 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -224,7 +224,8 @@ static int fuse_direntplus_link(struct file *file,
fi->nlookup++;
spin_unlock(&fi->lock);
- forget_all_cached_acls(inode);
+ if (!fuse_inode_has_iomap(inode))
+ forget_all_cached_acls(inode);
fuse_change_attributes(inode, &o->attr, NULL,
ATTR_TIMEOUT(o),
attr_version);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header
2025-08-21 0:48 ` [PATCHSET RFC v4 1/4] libfuse: general bug fixes Darrick J. Wong
@ 2025-08-21 1:01 ` Darrick J. Wong
2025-08-21 21:39 ` Bernd Schubert
2025-08-22 0:33 ` Joanne Koong
0 siblings, 2 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:01 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
fuse.h and fuse_lowlevel.h are public headers, don't expose internal
build system config variables to downstream clients. This can also lead
to function pointer ordering issues if (say) libfuse gets built with
HAVE_STATX but the client program doesn't define a HAVE_STATX.
Get rid of the conditionals in the public header files to fix this.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 2 --
include/fuse_lowlevel.h | 2 --
example/memfs_ll.cc | 2 +-
example/passthrough.c | 2 +-
example/passthrough_fh.c | 2 +-
example/passthrough_ll.c | 2 +-
6 files changed, 4 insertions(+), 8 deletions(-)
diff --git a/include/fuse.h b/include/fuse.h
index 06feacb070fbfb..209102651e9454 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -854,7 +854,6 @@ struct fuse_operations {
*/
off_t (*lseek) (const char *, off_t off, int whence, struct fuse_file_info *);
-#ifdef HAVE_STATX
/**
* Get extended file attributes.
*
@@ -865,7 +864,6 @@ struct fuse_operations {
*/
int (*statx)(const char *path, int flags, int mask, struct statx *stxbuf,
struct fuse_file_info *fi);
-#endif
};
/** Extra context that may be needed by some filesystems
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 844ee710295973..8d87be413bfe37 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1327,7 +1327,6 @@ struct fuse_lowlevel_ops {
void (*tmpfile) (fuse_req_t req, fuse_ino_t parent,
mode_t mode, struct fuse_file_info *fi);
-#ifdef HAVE_STATX
/**
* Get extended file attributes.
*
@@ -1343,7 +1342,6 @@ struct fuse_lowlevel_ops {
*/
void (*statx)(fuse_req_t req, fuse_ino_t ino, int flags, int mask,
struct fuse_file_info *fi);
-#endif
};
/**
diff --git a/example/memfs_ll.cc b/example/memfs_ll.cc
index edda34b4e43d39..7055a434a439cd 100644
--- a/example/memfs_ll.cc
+++ b/example/memfs_ll.cc
@@ -6,7 +6,7 @@
See the file GPL2.txt.
*/
-#define FUSE_USE_VERSION 317
+#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
#include <algorithm>
#include <stdio.h>
diff --git a/example/passthrough.c b/example/passthrough.c
index fdaa19e331a17d..1f09c2dc05df1e 100644
--- a/example/passthrough.c
+++ b/example/passthrough.c
@@ -23,7 +23,7 @@
*/
-#define FUSE_USE_VERSION 31
+#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
#define _GNU_SOURCE
diff --git a/example/passthrough_fh.c b/example/passthrough_fh.c
index 0d4fb5bd4df0d6..6403fbb74c7759 100644
--- a/example/passthrough_fh.c
+++ b/example/passthrough_fh.c
@@ -23,7 +23,7 @@
* \include passthrough_fh.c
*/
-#define FUSE_USE_VERSION 31
+#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
#define _GNU_SOURCE
diff --git a/example/passthrough_ll.c b/example/passthrough_ll.c
index 5ca6efa2300abe..8a5ac2e9226b59 100644
--- a/example/passthrough_ll.c
+++ b/example/passthrough_ll.c
@@ -35,7 +35,7 @@
*/
#define _GNU_SOURCE
-#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 12)
+#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
#include <fuse_lowlevel.h>
#include <unistd.h>
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 01/21] libfuse: bump kernel and library ABI versions
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
@ 2025-08-21 1:01 ` Darrick J. Wong
2025-08-21 1:01 ` [PATCH 02/21] libfuse: add kernel gates for FUSE_IOMAP Darrick J. Wong
` (19 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:01 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Bump the kernel ABI version to 7.99 and the libfuse ABI version to 3.99
to start our development. This patch exists to avoid confusion during
the prototyping stage.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_kernel.h | 4 +++-
ChangeLog.rst | 12 +++++++++++-
lib/fuse_versionscript | 3 +++
lib/meson.build | 2 +-
meson.build | 2 +-
5 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 122d6586e8d4da..4d68c4e8a71d5f 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -235,6 +235,8 @@
*
* 7.44
* - add FUSE_NOTIFY_INC_EPOCH
+ *
+ * 7.99
*/
#ifndef _LINUX_FUSE_H
@@ -270,7 +272,7 @@
#define FUSE_KERNEL_VERSION 7
/** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 44
+#define FUSE_KERNEL_MINOR_VERSION 99
/** The node ID of the root inode */
#define FUSE_ROOT_ID 1
diff --git a/ChangeLog.rst b/ChangeLog.rst
index 505d9dba84100f..bdb133a5f7db74 100644
--- a/ChangeLog.rst
+++ b/ChangeLog.rst
@@ -1,4 +1,14 @@
-libfuse 3.18
+libfuse 3.99
+
+libfuse 3.99-rc0 (2025-07-18)
+===============================
+
+* Add prototypes of iomap and syncfs (djwong)
+
+libfuse 3.18-rc0 (2025-07-18)
+===============================
+
+* Add statx, among other things (djwong)
libfuse 3.17.1-rc0 (2024-02.10)
===============================
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 0e581f1711412c..ba8f3b00478b30 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -217,6 +217,9 @@ FUSE_3.18 {
fuse_fs_statx;
} FUSE_3.17;
+FUSE_3.99 {
+} FUSE_3.18;
+
# Local Variables:
# indent-tabs-mode: t
# End:
diff --git a/lib/meson.build b/lib/meson.build
index fcd95741c9d374..8efe71abfabc9e 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -49,7 +49,7 @@ libfuse = library('fuse3',
dependencies: deps,
install: true,
link_depends: 'fuse_versionscript',
- c_args: [ '-DFUSE_USE_VERSION=317',
+ c_args: [ '-DFUSE_USE_VERSION=399',
'-DFUSERMOUNT_DIR="@0@"'.format(fusermount_path) ],
link_args: ['-Wl,--version-script,' + meson.current_source_dir()
+ '/fuse_versionscript' ])
diff --git a/meson.build b/meson.build
index f98ef8a6d60f33..0abb2cf0be5563 100644
--- a/meson.build
+++ b/meson.build
@@ -1,5 +1,5 @@
project('libfuse3', ['c'],
- version: '3.18.0-rc0', # Version with RC suffix
+ version: '3.99.0-rc0', # Version with RC suffix
meson_version: '>= 0.60.0',
default_options: [
'buildtype=debugoptimized',
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 02/21] libfuse: add kernel gates for FUSE_IOMAP
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-08-21 1:01 ` [PATCH 01/21] libfuse: bump kernel and library ABI versions Darrick J. Wong
@ 2025-08-21 1:01 ` Darrick J. Wong
2025-08-21 1:02 ` [PATCH 03/21] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
` (18 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:01 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Add some flags to query and request kernel support for filesystem iomap
for regular files.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_common.h | 5 +++++
include/fuse_kernel.h | 5 +++++
lib/fuse_lowlevel.c | 10 +++++++++-
3 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index b82f2c41deb30c..8f87263d78f999 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -520,6 +520,11 @@ struct fuse_loop_config_v1 {
*/
#define FUSE_CAP_OVER_IO_URING (1UL << 31)
+/**
+ * Client supports using iomap for FIEMAP and SEEK_{DATA,HOLE}
+ */
+#define FUSE_CAP_IOMAP (1ULL << 32)
+
/**
* Ioctl flags
*
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 4d68c4e8a71d5f..6779b9c69bb9e2 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -237,6 +237,8 @@
* - add FUSE_NOTIFY_INC_EPOCH
*
* 7.99
+ * - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
+ * SEEK_{DATA,HOLE}
*/
#ifndef _LINUX_FUSE_H
@@ -445,6 +447,8 @@ struct fuse_file_lock {
* FUSE_OVER_IO_URING: Indicate that client supports io-uring
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
+ * FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
+ * operations.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -492,6 +496,7 @@ struct fuse_file_lock {
#define FUSE_ALLOW_IDMAP (1ULL << 40)
#define FUSE_OVER_IO_URING (1ULL << 41)
#define FUSE_REQUEST_TIMEOUT (1ULL << 42)
+#define FUSE_IOMAP (1ULL << 43)
/**
* CUSE INIT request/reply flags
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 6afcecd3bdda96..33c71ba216679c 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2686,7 +2686,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
se->conn.capable_ext |= FUSE_CAP_NO_EXPORT_SUPPORT;
if (inargflags & FUSE_OVER_IO_URING)
se->conn.capable_ext |= FUSE_CAP_OVER_IO_URING;
-
+ if (inargflags & FUSE_IOMAP)
+ se->conn.capable_ext |= FUSE_CAP_IOMAP;
} else {
se->conn.max_readahead = 0;
}
@@ -2732,6 +2733,9 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
FUSE_CAP_READDIRPLUS_AUTO);
LL_SET_DEFAULT(1, FUSE_CAP_OVER_IO_URING);
+ /* servers need to opt-in to iomap explicitly */
+ LL_SET_DEFAULT(0, FUSE_CAP_IOMAP);
+
/* This could safely become default, but libfuse needs an API extension
* to support it
* LL_SET_DEFAULT(1, FUSE_CAP_SETXATTR_EXT);
@@ -2850,6 +2854,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
outargflags |= FUSE_REQUEST_TIMEOUT;
outarg.request_timeout = se->conn.request_timeout;
}
+ if (se->conn.want_ext & FUSE_CAP_IOMAP)
+ outargflags |= FUSE_IOMAP;
if (inargflags & FUSE_INIT_EXT) {
outargflags |= FUSE_INIT_EXT;
@@ -2891,6 +2897,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
if (se->conn.want_ext & FUSE_CAP_PASSTHROUGH)
fuse_log(FUSE_LOG_DEBUG, " max_stack_depth=%u\n",
outarg.max_stack_depth);
+ if (se->conn.want_ext & FUSE_CAP_IOMAP)
+ fuse_log(FUSE_LOG_DEBUG, " iomap=1\n");
}
if (arg->minor < 5)
outargsize = FUSE_COMPAT_INIT_OUT_SIZE;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 03/21] libfuse: add fuse commands for iomap_begin and end
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-08-21 1:01 ` [PATCH 01/21] libfuse: bump kernel and library ABI versions Darrick J. Wong
2025-08-21 1:01 ` [PATCH 02/21] libfuse: add kernel gates for FUSE_IOMAP Darrick J. Wong
@ 2025-08-21 1:02 ` Darrick J. Wong
2025-08-21 1:02 ` [PATCH 04/21] libfuse: add upper level iomap commands Darrick J. Wong
` (17 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:02 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Teach the low level API how to handle iomap begin and end commands that
we get from the kernel.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_common.h | 68 +++++++++++++++++++++++++++++++
include/fuse_kernel.h | 40 ++++++++++++++++++
include/fuse_lowlevel.h | 59 +++++++++++++++++++++++++++
lib/fuse_lowlevel.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++
lib/fuse_versionscript | 3 +
5 files changed, 272 insertions(+)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index 8f87263d78f999..d10364a077f31d 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1145,7 +1145,75 @@ bool fuse_get_feature_flag(struct fuse_conn_info *conn, uint64_t flag);
*/
int fuse_convert_to_conn_want_ext(struct fuse_conn_info *conn);
+/**
+ * iomap operations.
+ * These APIs are introduced in version 399 (FUSE_MAKE_VERSION(3, 99)).
+ */
+/* mapping types; see corresponding IOMAP_TYPE_ */
+#define FUSE_IOMAP_TYPE_HOLE (0)
+#define FUSE_IOMAP_TYPE_DELALLOC (1)
+#define FUSE_IOMAP_TYPE_MAPPED (2)
+#define FUSE_IOMAP_TYPE_UNWRITTEN (3)
+#define FUSE_IOMAP_TYPE_INLINE (4)
+
+/* fuse-specific mapping type indicating that writes use the read mapping */
+#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255)
+
+#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */
+
+/* mapping flags passed back from iomap_begin; see corresponding IOMAP_F_ */
+#define FUSE_IOMAP_F_NEW (1U << 0)
+#define FUSE_IOMAP_F_DIRTY (1U << 1)
+#define FUSE_IOMAP_F_SHARED (1U << 2)
+#define FUSE_IOMAP_F_MERGED (1U << 3)
+#define FUSE_IOMAP_F_BOUNDARY (1U << 4)
+#define FUSE_IOMAP_F_ANON_WRITE (1U << 5)
+#define FUSE_IOMAP_F_ATOMIC_BIO (1U << 6)
+
+/* fuse-specific mapping flag asking for ->iomap_end call */
+#define FUSE_IOMAP_F_WANT_IOMAP_END (1U << 7)
+
+/* mapping flags passed to iomap_end */
+#define FUSE_IOMAP_F_SIZE_CHANGED (1U << 8)
+#define FUSE_IOMAP_F_STALE (1U << 9)
+
+/* operation flags from iomap; see corresponding IOMAP_* */
+#define FUSE_IOMAP_OP_WRITE (1U << 0)
+#define FUSE_IOMAP_OP_ZERO (1U << 1)
+#define FUSE_IOMAP_OP_REPORT (1U << 2)
+#define FUSE_IOMAP_OP_FAULT (1U << 3)
+#define FUSE_IOMAP_OP_DIRECT (1U << 4)
+#define FUSE_IOMAP_OP_NOWAIT (1U << 5)
+#define FUSE_IOMAP_OP_OVERWRITE_ONLY (1U << 6)
+#define FUSE_IOMAP_OP_UNSHARE (1U << 7)
+#define FUSE_IOMAP_OP_DAX (1U << 8)
+#define FUSE_IOMAP_OP_ATOMIC (1U << 9)
+#define FUSE_IOMAP_OP_DONTCACHE (1U << 10)
+
+#define FUSE_IOMAP_NULL_ADDR (-1ULL) /* addr is not valid */
+
+struct fuse_file_iomap {
+ uint64_t addr; /* disk offset of mapping, bytes */
+ uint64_t offset; /* file offset of mapping, bytes */
+ uint64_t length; /* length of mapping, bytes */
+ uint16_t type; /* FUSE_IOMAP_TYPE_* */
+ uint16_t flags; /* FUSE_IOMAP_F_* */
+ uint32_t dev; /* device cookie */
+};
+
+static inline bool fuse_iomap_is_write(unsigned int opflags)
+{
+ return opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_ZERO |
+ FUSE_IOMAP_OP_UNSHARE);
+}
+
+static inline bool fuse_iomap_need_write_allocate(unsigned int opflags,
+ const struct fuse_file_iomap *map)
+{
+ return map->type == FUSE_IOMAP_TYPE_HOLE &&
+ !(opflags & FUSE_IOMAP_OP_ZERO);
+}
/* ----------------------------------------------------------- *
* Compatibility stuff *
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 6779b9c69bb9e2..2bcb3b394c0169 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -665,6 +665,9 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_BEGIN = 4094,
+ FUSE_IOMAP_END = 4095,
+
/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -1297,4 +1300,41 @@ struct fuse_uring_cmd_req {
uint8_t padding[6];
};
+struct fuse_iomap_io {
+ uint64_t offset; /* file offset of mapping, bytes */
+ uint64_t length; /* length of mapping, bytes */
+ uint64_t addr; /* disk offset of mapping, bytes */
+ uint16_t type; /* FUSE_IOMAP_TYPE_* */
+ uint16_t flags; /* FUSE_IOMAP_F_* */
+ uint32_t dev; /* device cookie */
+};
+
+struct fuse_iomap_begin_in {
+ uint32_t opflags; /* FUSE_IOMAP_OP_* */
+ uint32_t reserved; /* zero */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t count; /* operation length, in bytes */
+};
+
+struct fuse_iomap_begin_out {
+ /* read file data from here */
+ struct fuse_iomap_io read;
+
+ /* write file data to here, if applicable */
+ struct fuse_iomap_io write;
+};
+
+struct fuse_iomap_end_in {
+ uint32_t opflags; /* FUSE_IOMAP_OP_* */
+ uint32_t reserved; /* zero */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t count; /* operation length, in bytes */
+ int64_t written; /* bytes processed */
+
+ /* mapping that the kernel acted upon */
+ struct fuse_iomap_io map;
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 8d87be413bfe37..f9704533b5276d 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1342,6 +1342,43 @@ struct fuse_lowlevel_ops {
*/
void (*statx)(fuse_req_t req, fuse_ino_t ino, int flags, int mask,
struct fuse_file_info *fi);
+
+ /**
+ * Fetch file I/O mappings to begin an operation
+ *
+ * Valid replies:
+ * fuse_reply_iomap_begin
+ * fuse_reply_err
+ *
+ * @param req request handle
+ * @param nodeid the inode number
+ * @param attr_ino inode number as told by fuse_attr::ino
+ * @param pos position in file, in bytes
+ * @param count length of operation, in bytes
+ * @param opflags mask of FUSE_IOMAP_OP_ flags specifying operation
+ */
+ void (*iomap_begin) (fuse_req_t req, fuse_ino_t nodeid,
+ uint64_t attr_ino, off_t pos, uint64_t count,
+ uint32_t opflags);
+
+ /**
+ * Complete an iomap operation
+ *
+ * Valid replies:
+ * fuse_reply_err
+ *
+ * @param req request handle
+ * @param nodeid the inode number
+ * @param attr_ino inode number as told by fuse_attr::ino
+ * @param pos position in file, in bytes
+ * @param count length of operation, in bytes
+ * @param written number of bytes processed, or a negative errno
+ * @param opflags mask of FUSE_IOMAP_OP_ flags specifying operation
+ * @param iomap file I/O mapping that was acted upon
+ */
+ void (*iomap_end) (fuse_req_t req, fuse_ino_t nodeid, uint64_t attr_ino,
+ off_t pos, uint64_t count, uint32_t opflags,
+ ssize_t written, const struct fuse_file_iomap *iomap);
};
/**
@@ -1736,6 +1773,28 @@ int fuse_reply_lseek(fuse_req_t req, off_t off);
*/
int fuse_reply_statx(fuse_req_t req, int flags, struct statx *statx, double attr_timeout);
+/**
+ * Set an iomap write mapping to be a pure overwrite of the read mapping.
+ * @param write mapping for file data writes
+ * @param read mapping for file data reads
+ */
+void fuse_iomap_pure_overwrite(struct fuse_file_iomap *write,
+ const struct fuse_file_iomap *read);
+
+/**
+ * Reply with iomappings for an iomap_begin operation
+ *
+ * Possible requests:
+ * iomap_begin
+ *
+ * @param req request handle
+ * @param read mapping for file data reads
+ * @param write mapping for file data writes
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_file_iomap *read,
+ const struct fuse_file_iomap *write);
+
/* ----------------------------------------------------------- *
* Notification *
* ----------------------------------------------------------- */
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 33c71ba216679c..c8106cb25a02d3 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2491,6 +2491,104 @@ static void do_statx(fuse_req_t req, fuse_ino_t nodeid, const void *inarg)
_do_statx(req, nodeid, inarg, NULL);
}
+void fuse_iomap_pure_overwrite(struct fuse_file_iomap *write,
+ const struct fuse_file_iomap *read)
+{
+ write->addr = FUSE_IOMAP_NULL_ADDR;
+ write->offset = read->offset;
+ write->length = read->length;
+ write->type = FUSE_IOMAP_TYPE_PURE_OVERWRITE;
+ write->flags = 0;
+ write->dev = FUSE_IOMAP_DEV_NULL;
+}
+
+static inline void fuse_iomap_to_kernel(struct fuse_iomap_io *fmap,
+ const struct fuse_file_iomap *fimap)
+{
+ fmap->addr = fimap->addr;
+ fmap->offset = fimap->offset;
+ fmap->length = fimap->length;
+ fmap->type = fimap->type;
+ fmap->flags = fimap->flags;
+ fmap->dev = fimap->dev;
+}
+
+static inline void fuse_iomap_from_kernel(struct fuse_file_iomap *fimap,
+ const struct fuse_iomap_io *fmap)
+{
+ fimap->addr = fmap->addr;
+ fimap->offset = fmap->offset;
+ fimap->length = fmap->length;
+ fimap->type = fmap->type;
+ fimap->flags = fmap->flags;
+ fimap->dev = fmap->dev;
+}
+
+int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_file_iomap *read,
+ const struct fuse_file_iomap *write)
+{
+ struct fuse_iomap_begin_out arg = {
+ .write = {
+ .addr = FUSE_IOMAP_NULL_ADDR,
+ .offset = read->offset,
+ .length = read->length,
+ .type = FUSE_IOMAP_TYPE_PURE_OVERWRITE,
+ .flags = 0,
+ .dev = FUSE_IOMAP_DEV_NULL,
+ },
+ };
+
+ fuse_iomap_to_kernel(&arg.read, read);
+ if (write)
+ fuse_iomap_to_kernel(&arg.write, write);
+
+ return send_reply_ok(req, &arg, sizeof(arg));
+}
+
+static void _do_iomap_begin(fuse_req_t req, const fuse_ino_t nodeid,
+ const void *op_in, const void *in_payload)
+{
+ const struct fuse_iomap_begin_in *arg = op_in;
+ (void)in_payload;
+ (void)nodeid;
+
+ if (req->se->op.iomap_begin)
+ req->se->op.iomap_begin(req, nodeid, arg->attr_ino, arg->pos,
+ arg->count, arg->opflags);
+ else
+ fuse_reply_err(req, ENOSYS);
+}
+
+static void do_iomap_begin(fuse_req_t req, const fuse_ino_t nodeid,
+ const void *inarg)
+{
+ _do_iomap_begin(req, nodeid, inarg, NULL);
+}
+
+static void _do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid,
+ const void *op_in, const void *in_payload)
+{
+ const struct fuse_iomap_end_in *arg = op_in;
+ (void)in_payload;
+ (void)nodeid;
+
+ if (req->se->op.iomap_end) {
+ struct fuse_file_iomap fimap;
+
+ fuse_iomap_from_kernel(&fimap, &arg->map);
+ req->se->op.iomap_end(req, nodeid, arg->attr_ino, arg->pos,
+ arg->count, arg->opflags, arg->written,
+ &fimap);
+ } else
+ fuse_reply_err(req, ENOSYS);
+}
+
+static void do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid,
+ const void *inarg)
+{
+ _do_iomap_end(req, nodeid, inarg, NULL);
+}
+
static bool want_flags_valid(uint64_t capable, uint64_t want)
{
uint64_t unknown_flags = want & (~capable);
@@ -3376,6 +3474,8 @@ static struct {
[FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
[FUSE_LSEEK] = { do_lseek, "LSEEK" },
[FUSE_STATX] = { do_statx, "STATX" },
+ [FUSE_IOMAP_BEGIN] = { do_iomap_begin, "IOMAP_BEGIN" },
+ [FUSE_IOMAP_END] = { do_iomap_end, "IOMAP_END" },
[CUSE_INIT] = { cuse_lowlevel_init, "CUSE_INIT" },
};
@@ -3431,6 +3531,8 @@ static struct {
[FUSE_COPY_FILE_RANGE] = { _do_copy_file_range, "COPY_FILE_RANGE" },
[FUSE_LSEEK] = { _do_lseek, "LSEEK" },
[FUSE_STATX] = { _do_statx, "STATX" },
+ [FUSE_IOMAP_BEGIN] = { _do_iomap_begin, "IOMAP_BEGIN" },
+ [FUSE_IOMAP_END] = { _do_iomap_end, "IOMAP_END" },
[CUSE_INIT] = { _cuse_lowlevel_init, "CUSE_INIT" },
};
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index ba8f3b00478b30..17c9e538a67bfa 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -218,6 +218,9 @@ FUSE_3.18 {
} FUSE_3.17;
FUSE_3.99 {
+ global:
+ fuse_iomap_pure_overwrite;
+ fuse_reply_iomap_begin;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 04/21] libfuse: add upper level iomap commands
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 1:02 ` [PATCH 03/21] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
@ 2025-08-21 1:02 ` Darrick J. Wong
2025-08-21 1:02 ` [PATCH 05/21] libfuse: add a lowlevel notification to add a new device to iomap Darrick J. Wong
` (16 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:02 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Teach the upper level fuse library about the iomap begin and end
operations, and connect it to the lower level. This is needed for
fuse2fs to start using iomap.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 17 ++++++++++
lib/fuse.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 115 insertions(+)
diff --git a/include/fuse.h b/include/fuse.h
index 209102651e9454..958034a539abe6 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -864,6 +864,23 @@ struct fuse_operations {
*/
int (*statx)(const char *path, int flags, int mask, struct statx *stxbuf,
struct fuse_file_info *fi);
+
+ /**
+ * Send a mapping to the kernel so that a file IO operation can run.
+ */
+ int (*iomap_begin) (const char *path, uint64_t nodeid,
+ uint64_t attr_ino, off_t pos_in,
+ uint64_t length_in, uint32_t opflags_in,
+ struct fuse_file_iomap *read_out,
+ struct fuse_file_iomap *write_out);
+
+ /**
+ * Respond to the outcome of a previous file mapping operation.
+ */
+ int (*iomap_end) (const char *path, uint64_t nodeid, uint64_t attr_ino,
+ off_t pos_in, uint64_t length_in,
+ uint32_t opflags_in, ssize_t written_in,
+ const struct fuse_file_iomap *iomap);
};
/** Extra context that may be needed by some filesystems
diff --git a/lib/fuse.c b/lib/fuse.c
index e997531f4433bb..eef0967f796ed6 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2793,6 +2793,45 @@ int fuse_fs_chmod(struct fuse_fs *fs, const char *path, mode_t mode,
return fs->op.chmod(path, mode, fi);
}
+static int fuse_fs_iomap_begin(struct fuse_fs *fs, const char *path,
+ fuse_ino_t nodeid, uint64_t attr_ino, off_t pos,
+ uint64_t count, uint32_t opflags,
+ struct fuse_file_iomap *read,
+ struct fuse_file_iomap *write)
+{
+ fuse_get_context()->private_data = fs->user_data;
+ if (!fs->op.iomap_begin)
+ return -ENOSYS;
+
+ if (fs->debug) {
+ fuse_log(FUSE_LOG_DEBUG,
+ "iomap_begin[%s] nodeid %llu attr_ino %llu pos %llu count %llu opflags 0x%x\n",
+ path, nodeid, attr_ino, pos, count, opflags);
+ }
+
+ return fs->op.iomap_begin(path, nodeid, attr_ino, pos, count, opflags,
+ read, write);
+}
+
+static int fuse_fs_iomap_end(struct fuse_fs *fs, const char *path,
+ fuse_ino_t nodeid, uint64_t attr_ino, off_t pos,
+ uint64_t count, uint32_t opflags, ssize_t written,
+ const struct fuse_file_iomap *iomap)
+{
+ fuse_get_context()->private_data = fs->user_data;
+ if (!fs->op.iomap_end)
+ return -ENOSYS;
+
+ if (fs->debug) {
+ fuse_log(FUSE_LOG_DEBUG,
+ "iomap_end[%s] nodeid %llu attr_ino %llu pos %llu count %llu opflags 0x%x written %zd\n",
+ path, nodeid, attr_ino, pos, count, opflags, written);
+ }
+
+ return fs->op.iomap_end(path, nodeid, attr_ino, pos, count, opflags,
+ written, iomap);
+}
+
static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
int valid, struct fuse_file_info *fi)
{
@@ -4466,6 +4505,63 @@ static void fuse_lib_statx(fuse_req_t req, fuse_ino_t ino, int flags, int mask,
}
#endif
+static void fuse_lib_iomap_begin(fuse_req_t req, fuse_ino_t nodeid,
+ uint64_t attr_ino, off_t pos, uint64_t count,
+ uint32_t opflags)
+{
+ struct fuse *f = req_fuse_prepare(req);
+ struct fuse_file_iomap read = { };
+ struct fuse_file_iomap write = { };
+ struct fuse_intr_data d;
+ char *path;
+ int err;
+
+ err = get_path_nullok(f, nodeid, &path);
+ if (err) {
+ reply_err(req, err);
+ return;
+ }
+
+ fuse_prepare_interrupt(f, req, &d);
+ err = fuse_fs_iomap_begin(f->fs, path, nodeid, attr_ino, pos, count,
+ opflags, &read, &write);
+ fuse_finish_interrupt(f, req, &d);
+ free_path(f, nodeid, path);
+ if (err) {
+ reply_err(req, err);
+ return;
+ }
+
+ if (write.length == 0)
+ fuse_iomap_pure_overwrite(&write, &read);
+
+ fuse_reply_iomap_begin(req, &read, &write);
+}
+
+static void fuse_lib_iomap_end(fuse_req_t req, fuse_ino_t nodeid,
+ uint64_t attr_ino, off_t pos, uint64_t count,
+ uint32_t opflags, ssize_t written,
+ const struct fuse_file_iomap *iomap)
+{
+ struct fuse *f = req_fuse_prepare(req);
+ struct fuse_intr_data d;
+ char *path;
+ int err;
+
+ err = get_path_nullok(f, nodeid, &path);
+ if (err) {
+ reply_err(req, err);
+ return;
+ }
+
+ fuse_prepare_interrupt(f, req, &d);
+ err = fuse_fs_iomap_end(f->fs, path, nodeid, attr_ino, pos, count,
+ opflags, written, iomap);
+ fuse_finish_interrupt(f, req, &d);
+ free_path(f, nodeid, path);
+ reply_err(req, err);
+}
+
static int clean_delay(struct fuse *f)
{
/*
@@ -4567,6 +4663,8 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
#ifdef HAVE_STATX
.statx = fuse_lib_statx,
#endif
+ .iomap_begin = fuse_lib_iomap_begin,
+ .iomap_end = fuse_lib_iomap_end,
};
int fuse_notify_poll(struct fuse_pollhandle *ph)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 05/21] libfuse: add a lowlevel notification to add a new device to iomap
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 1:02 ` [PATCH 04/21] libfuse: add upper level iomap commands Darrick J. Wong
@ 2025-08-21 1:02 ` Darrick J. Wong
2025-08-21 1:02 ` [PATCH 06/21] libfuse: add upper-level iomap add device function Darrick J. Wong
` (15 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:02 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Plumb in the pieces needed to attach block devices to a fuse+iomap mount
for use with iomap operations. This enables us to have filesystems
where the metadata could live somewhere else, but the actual file IO
goes to locally attached storage.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_lowlevel.h | 29 ++++++++++++++++++++++++++++
lib/fuse_lowlevel.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++
lib/fuse_versionscript | 2 ++
3 files changed, 79 insertions(+)
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index f9704533b5276d..45655781e510a0 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1998,6 +1998,35 @@ int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino,
int fuse_lowlevel_notify_retrieve(struct fuse_session *se, fuse_ino_t ino,
size_t size, off_t offset, void *cookie);
+/**
+ * Attach an open file descriptor to a fuse+iomap mount. Currently must be
+ * a block device.
+ *
+ * Added in FUSE protocol version 7.99. If the kernel does not support
+ * this (or a newer) version, the function will return -ENOSYS and do
+ * nothing.
+ *
+ * @param se the session object
+ * @param fd file descriptor of an open block device
+ * @param flags flags for the operation; none defined so far
+ * @return positive nonzero device id on success, or negative errno on failure
+ */
+int fuse_lowlevel_iomap_device_add(struct fuse_session *se, int fd,
+ unsigned int flags);
+
+/**
+ * Detach an open file from a fuse+iomap mount. Must be a device id returned
+ * by fuse_lowlevel_iomap_device_add.
+ *
+ * Added in FUSE protocol version 7.99. If the kernel does not support
+ * this (or a newer) version, the function will return -ENOSYS and do
+ * nothing.
+ *
+ * @param se the session object
+ * @param device_id device index as returned by fuse_lowlevel_iomap_device_add
+ * @return 0 on success, or negative errno on failure
+ */
+int fuse_lowlevel_iomap_device_remove(struct fuse_session *se, int device_id);
/* ----------------------------------------------------------- *
* Utility functions *
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index c8106cb25a02d3..fec4e3265e53c1 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -580,6 +580,54 @@ int fuse_passthrough_close(fuse_req_t req, int backing_id)
return ret;
}
+int fuse_lowlevel_iomap_device_add(struct fuse_session *se, int fd,
+ unsigned int flags)
+{
+ struct fuse_backing_map map = {
+ .fd = fd,
+ .flags = flags,
+ };
+ int ret;
+
+ if (!(se->conn.want_ext & FUSE_CAP_IOMAP))
+ return -ENOSYS;
+
+ ret = ioctl(se->fd, FUSE_DEV_IOC_BACKING_OPEN, &map);
+ if (ret == 0) {
+ /* not supposed to happen */
+ ret = -1;
+ errno = ERANGE;
+ }
+ if (ret < 0) {
+ int err = errno;
+
+ fuse_log(FUSE_LOG_ERR, "fuse: iomap_device_add: %s\n",
+ strerror(err));
+ return -err;
+ }
+
+ return ret;
+}
+
+int fuse_lowlevel_iomap_device_remove(struct fuse_session *se, int device_id)
+{
+ int ret;
+
+ if (!(se->conn.want_ext & FUSE_CAP_IOMAP))
+ return -ENOSYS;
+
+ ret = ioctl(se->fd, FUSE_DEV_IOC_BACKING_CLOSE, &device_id);
+ if (ret < 0) {
+ int err = errno;
+
+ fuse_log(FUSE_LOG_ERR, "fuse: iomap_device_remove: %s\n",
+ strerror(errno));
+ return -err;
+ }
+
+ return ret;
+}
+
int fuse_reply_open(fuse_req_t req, const struct fuse_file_info *f)
{
struct fuse_open_out arg;
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 17c9e538a67bfa..d785303bab99ea 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -221,6 +221,8 @@ FUSE_3.99 {
global:
fuse_iomap_pure_overwrite;
fuse_reply_iomap_begin;
+ fuse_lowlevel_iomap_device_add;
+ fuse_lowlevel_iomap_device_remove;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 06/21] libfuse: add upper-level iomap add device function
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 1:02 ` [PATCH 05/21] libfuse: add a lowlevel notification to add a new device to iomap Darrick J. Wong
@ 2025-08-21 1:02 ` Darrick J. Wong
2025-08-21 1:03 ` [PATCH 07/21] libfuse: add iomap ioend low level handler Darrick J. Wong
` (14 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:02 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Make it so that the upper level fuse library can add iomap devices too.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 19 +++++++++++++++++++
lib/fuse.c | 16 ++++++++++++++++
lib/fuse_versionscript | 2 ++
3 files changed, 37 insertions(+)
diff --git a/include/fuse.h b/include/fuse.h
index 958034a539abe6..524b77b5d7bbd0 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -1381,6 +1381,25 @@ void fuse_fs_init(struct fuse_fs *fs, struct fuse_conn_info *conn,
struct fuse_config *cfg);
void fuse_fs_destroy(struct fuse_fs *fs);
+/**
+ * Attach an open file descriptor to a fuse+iomap mount. Currently must be
+ * a block device.
+ *
+ * @param fd file descriptor of an open block device
+ * @param flags flags for the operation; none defined so far
+ * @return positive nonzero device id on success, or negative errno on failure
+ */
+int fuse_fs_iomap_device_add(int fd, unsigned int flags);
+
+/**
+ * Detach an open file from a fuse+iomap mount. Must be a device id returned
+ * by fuse_lowlevel_iomap_device_add.
+ *
+ * @param device_id device index as returned by fuse_lowlevel_iomap_device_add
+ * @return 0 on success, or negative errno on failure
+ */
+int fuse_fs_iomap_device_remove(int device_id);
+
int fuse_notify_poll(struct fuse_pollhandle *ph);
/**
diff --git a/lib/fuse.c b/lib/fuse.c
index eef0967f796ed6..632e935b8dff3e 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2832,6 +2832,22 @@ static int fuse_fs_iomap_end(struct fuse_fs *fs, const char *path,
written, iomap);
}
+int fuse_fs_iomap_device_add(int fd, unsigned int flags)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse_session *se = fuse_get_session(ctxt->fuse);
+
+ return fuse_lowlevel_iomap_device_add(se, fd, flags);
+}
+
+int fuse_fs_iomap_device_remove(int device_id)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse_session *se = fuse_get_session(ctxt->fuse);
+
+ return fuse_lowlevel_iomap_device_remove(se, device_id);
+}
+
static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
int valid, struct fuse_file_info *fi)
{
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index d785303bab99ea..03cce1f0f184c3 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -223,6 +223,8 @@ FUSE_3.99 {
fuse_reply_iomap_begin;
fuse_lowlevel_iomap_device_add;
fuse_lowlevel_iomap_device_remove;
+ fuse_fs_iomap_device_add;
+ fuse_fs_iomap_device_remove;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 07/21] libfuse: add iomap ioend low level handler
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (5 preceding siblings ...)
2025-08-21 1:02 ` [PATCH 06/21] libfuse: add upper-level iomap add device function Darrick J. Wong
@ 2025-08-21 1:03 ` Darrick J. Wong
2025-08-21 1:03 ` [PATCH 08/21] libfuse: add upper level iomap ioend commands Darrick J. Wong
` (13 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:03 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Teach the low level library about the iomap ioend handler, which gets
called by the kernel when we finish a file write that isn't a pure
overwrite operation.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_common.h | 11 +++++++++++
include/fuse_kernel.h | 11 +++++++++++
include/fuse_lowlevel.h | 20 ++++++++++++++++++++
lib/fuse_lowlevel.c | 23 +++++++++++++++++++++++
4 files changed, 65 insertions(+)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index d10364a077f31d..77e971c3fed17d 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1215,6 +1215,17 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags,
!(opflags & FUSE_IOMAP_OP_ZERO);
}
+/* out of place write extent */
+#define FUSE_IOMAP_IOEND_SHARED (1U << 0)
+/* unwritten extent */
+#define FUSE_IOMAP_IOEND_UNWRITTEN (1U << 1)
+/* don't merge into previous ioend */
+#define FUSE_IOMAP_IOEND_BOUNDARY (1U << 2)
+/* is direct I/O */
+#define FUSE_IOMAP_IOEND_DIRECT (1U << 3)
+/* is append ioend */
+#define FUSE_IOMAP_IOEND_APPEND (1U << 4)
+
/* ----------------------------------------------------------- *
* Compatibility stuff *
* ----------------------------------------------------------- */
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 2bcb3b394c0169..849238c17baf5e 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -665,6 +665,7 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_IOEND = 4093,
FUSE_IOMAP_BEGIN = 4094,
FUSE_IOMAP_END = 4095,
@@ -1337,4 +1338,14 @@ struct fuse_iomap_end_in {
struct fuse_iomap_io map;
};
+struct fuse_iomap_ioend_in {
+ uint32_t ioendflags; /* FUSE_IOMAP_IOEND_* */
+ int32_t error; /* negative errno or 0 */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+ uint64_t pos; /* file position, in bytes */
+ uint64_t new_addr; /* disk offset of new mapping, in bytes */
+ uint32_t written; /* bytes processed */
+ uint32_t reserved1; /* zero */
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 45655781e510a0..7f7f418b281601 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1379,6 +1379,26 @@ struct fuse_lowlevel_ops {
void (*iomap_end) (fuse_req_t req, fuse_ino_t nodeid, uint64_t attr_ino,
off_t pos, uint64_t count, uint32_t opflags,
ssize_t written, const struct fuse_file_iomap *iomap);
+
+ /**
+ * Complete an iomap IO operation
+ *
+ * Valid replies:
+ * fuse_reply_err
+ *
+ * @param req request handle
+ * @param nodeid the inode number
+ * @param attr_ino inode number as told by fuse_attr::ino
+ * @param pos position in file, in bytes
+ * @param written number of bytes processed, or a negative errno
+ * @param ioendflags mask of FUSE_IOMAP_IOEND_ flags specifying operation
+ * @param error errno code of what went wrong
+ * @param new_addr disk address of new mapping, in bytes
+ */
+ void (*iomap_ioend) (fuse_req_t req, fuse_ino_t nodeid,
+ uint64_t attr_ino, off_t pos, size_t written,
+ uint32_t ioendflags, int error,
+ uint64_t new_addr);
};
/**
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index fec4e3265e53c1..ce7971a23be94b 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2637,6 +2637,27 @@ static void do_iomap_end(fuse_req_t req, const fuse_ino_t nodeid,
_do_iomap_end(req, nodeid, inarg, NULL);
}
+static void _do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid,
+ const void *op_in, const void *in_payload)
+{
+ const struct fuse_iomap_ioend_in *arg = op_in;
+ (void)in_payload;
+ (void)nodeid;
+
+ if (req->se->op.iomap_ioend)
+ req->se->op.iomap_ioend(req, nodeid, arg->attr_ino, arg->pos,
+ arg->written, arg->ioendflags,
+ arg->error, arg->new_addr);
+ else
+ fuse_reply_err(req, ENOSYS);
+}
+
+static void do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid,
+ const void *inarg)
+{
+ _do_iomap_ioend(req, nodeid, inarg, NULL);
+}
+
static bool want_flags_valid(uint64_t capable, uint64_t want)
{
uint64_t unknown_flags = want & (~capable);
@@ -3524,6 +3545,7 @@ static struct {
[FUSE_STATX] = { do_statx, "STATX" },
[FUSE_IOMAP_BEGIN] = { do_iomap_begin, "IOMAP_BEGIN" },
[FUSE_IOMAP_END] = { do_iomap_end, "IOMAP_END" },
+ [FUSE_IOMAP_IOEND] = { do_iomap_ioend, "IOMAP_IOEND" },
[CUSE_INIT] = { cuse_lowlevel_init, "CUSE_INIT" },
};
@@ -3581,6 +3603,7 @@ static struct {
[FUSE_STATX] = { _do_statx, "STATX" },
[FUSE_IOMAP_BEGIN] = { _do_iomap_begin, "IOMAP_BEGIN" },
[FUSE_IOMAP_END] = { _do_iomap_end, "IOMAP_END" },
+ [FUSE_IOMAP_IOEND] = { _do_iomap_ioend, "IOMAP_IOEND" },
[CUSE_INIT] = { _cuse_lowlevel_init, "CUSE_INIT" },
};
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 08/21] libfuse: add upper level iomap ioend commands
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (6 preceding siblings ...)
2025-08-21 1:03 ` [PATCH 07/21] libfuse: add iomap ioend low level handler Darrick J. Wong
@ 2025-08-21 1:03 ` Darrick J. Wong
2025-08-21 1:03 ` [PATCH 09/21] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong
` (12 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:03 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Teach the upper level fuse library about iomap ioend events, which
happen when a write that isn't a pure overwrite completes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 8 ++++++++
lib/fuse.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+)
diff --git a/include/fuse.h b/include/fuse.h
index 524b77b5d7bbd0..1357f4319bcc21 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -881,6 +881,14 @@ struct fuse_operations {
off_t pos_in, uint64_t length_in,
uint32_t opflags_in, ssize_t written_in,
const struct fuse_file_iomap *iomap);
+
+ /**
+ * Respond to the outcome of a file IO operation.
+ */
+ int (*iomap_ioend) (const char *path, uint64_t nodeid,
+ uint64_t attr_ino, off_t pos_in, size_t written_in,
+ uint32_t ioendflags_in, int error_in,
+ uint64_t new_addr_in);
};
/** Extra context that may be needed by some filesystems
diff --git a/lib/fuse.c b/lib/fuse.c
index 632e935b8dff3e..725ab615d456e3 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2848,6 +2848,26 @@ int fuse_fs_iomap_device_remove(int device_id)
return fuse_lowlevel_iomap_device_remove(se, device_id);
}
+static int fuse_fs_iomap_ioend(struct fuse_fs *fs, const char *path,
+ uint64_t nodeid, uint64_t attr_ino, off_t pos,
+ size_t written, uint32_t ioendflags, int error,
+ uint64_t new_addr)
+{
+ fuse_get_context()->private_data = fs->user_data;
+ if (!fs->op.iomap_ioend)
+ return -ENOSYS;
+
+ if (fs->debug) {
+ fuse_log(FUSE_LOG_DEBUG,
+ "iomap_ioend[%s] nodeid %llu attr_ino %llu pos %llu written %zu ioendflags 0x%x error %d\n",
+ path, nodeid, attr_ino, pos, written, ioendflags,
+ error);
+ }
+
+ return fs->op.iomap_ioend(path, nodeid, attr_ino, pos, written,
+ ioendflags, error, new_addr);
+}
+
static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
int valid, struct fuse_file_info *fi)
{
@@ -4578,6 +4598,30 @@ static void fuse_lib_iomap_end(fuse_req_t req, fuse_ino_t nodeid,
reply_err(req, err);
}
+static void fuse_lib_iomap_ioend(fuse_req_t req, fuse_ino_t nodeid,
+ uint64_t attr_ino, off_t pos, size_t written,
+ uint32_t ioendflags, int error,
+ uint64_t new_addr)
+{
+ struct fuse *f = req_fuse_prepare(req);
+ struct fuse_intr_data d;
+ char *path;
+ int err;
+
+ err = get_path_nullok(f, nodeid, &path);
+ if (err) {
+ reply_err(req, err);
+ return;
+ }
+
+ fuse_prepare_interrupt(f, req, &d);
+ err = fuse_fs_iomap_ioend(f->fs, path, nodeid, attr_ino, pos, written,
+ ioendflags, error, new_addr);
+ fuse_finish_interrupt(f, req, &d);
+ free_path(f, nodeid, path);
+ reply_err(req, err);
+}
+
static int clean_delay(struct fuse *f)
{
/*
@@ -4681,6 +4725,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
#endif
.iomap_begin = fuse_lib_iomap_begin,
.iomap_end = fuse_lib_iomap_end,
+ .iomap_ioend = fuse_lib_iomap_ioend,
};
int fuse_notify_poll(struct fuse_pollhandle *ph)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 09/21] libfuse: add a reply function to send FUSE_ATTR_* to the kernel
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (7 preceding siblings ...)
2025-08-21 1:03 ` [PATCH 08/21] libfuse: add upper level iomap ioend commands Darrick J. Wong
@ 2025-08-21 1:03 ` Darrick J. Wong
2025-08-21 1:03 ` [PATCH 10/21] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong
` (11 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:03 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Create new fuse_reply_{attr,create,entry}_iflags functions so that we
can send FUSE_ATTR_* flags to the kernel when instantiating an inode.
Servers are expected to send FUSE_IFLAG_* values, which will be
translated into what the kernel can understand.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_common.h | 3 ++
include/fuse_lowlevel.h | 83 +++++++++++++++++++++++++++++++++++++++++++++++
lib/fuse_lowlevel.c | 64 ++++++++++++++++++++++++++++--------
lib/fuse_versionscript | 4 ++
4 files changed, 139 insertions(+), 15 deletions(-)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index 77e971c3fed17d..9181ec6cb5e5e9 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1226,6 +1226,9 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags,
/* is append ioend */
#define FUSE_IOMAP_IOEND_APPEND (1U << 4)
+/* enable fsdax */
+#define FUSE_IFLAG_DAX (1U << 0)
+
/* ----------------------------------------------------------- *
* Compatibility stuff *
* ----------------------------------------------------------- */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 7f7f418b281601..e0642032127686 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -243,6 +243,7 @@ struct fuse_lowlevel_ops {
*
* Valid replies:
* fuse_reply_entry
+ * fuse_reply_entry_iflags
* fuse_reply_err
*
* @param req request handle
@@ -302,6 +303,7 @@ struct fuse_lowlevel_ops {
*
* Valid replies:
* fuse_reply_attr
+ * fuse_reply_attr_iflags
* fuse_reply_err
*
* @param req request handle
@@ -337,6 +339,7 @@ struct fuse_lowlevel_ops {
*
* Valid replies:
* fuse_reply_attr
+ * fuse_reply_attr_iflags
* fuse_reply_err
*
* @param req request handle
@@ -368,6 +371,7 @@ struct fuse_lowlevel_ops {
*
* Valid replies:
* fuse_reply_entry
+ * fuse_reply_entry_iflags
* fuse_reply_err
*
* @param req request handle
@@ -384,6 +388,7 @@ struct fuse_lowlevel_ops {
*
* Valid replies:
* fuse_reply_entry
+ * fuse_reply_entry_iflags
* fuse_reply_err
*
* @param req request handle
@@ -433,6 +438,7 @@ struct fuse_lowlevel_ops {
*
* Valid replies:
* fuse_reply_entry
+ * fuse_reply_entry_iflags
* fuse_reply_err
*
* @param req request handle
@@ -481,6 +487,7 @@ struct fuse_lowlevel_ops {
*
* Valid replies:
* fuse_reply_entry
+ * fuse_reply_entry_iflags
* fuse_reply_err
*
* @param req request handle
@@ -972,6 +979,7 @@ struct fuse_lowlevel_ops {
*
* Valid replies:
* fuse_reply_create
+ * fuse_reply_create_iflags
* fuse_reply_err
*
* @param req request handle
@@ -1317,6 +1325,7 @@ struct fuse_lowlevel_ops {
*
* Valid replies:
* fuse_reply_create
+ * fuse_reply_create_iflags
* fuse_reply_err
*
* @param req request handle
@@ -1451,6 +1460,23 @@ void fuse_reply_none(fuse_req_t req);
*/
int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e);
+/**
+ * Reply with a directory entry and FUSE_IFLAG_*
+ *
+ * Possible requests:
+ * lookup, mknod, mkdir, symlink, link
+ *
+ * Side effects:
+ * increments the lookup count on success
+ *
+ * @param req request handle
+ * @param e the entry parameters
+ * @param iflags FUSE_IFLAG_*
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_entry_iflags(fuse_req_t req, const struct fuse_entry_param *e,
+ unsigned int iflags);
+
/**
* Reply with a directory entry and open parameters
*
@@ -1472,6 +1498,29 @@ int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e);
int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
const struct fuse_file_info *fi);
+/**
+ * Reply with a directory entry, open parameters and FUSE_IFLAG_*
+ *
+ * currently the following members of 'fi' are used:
+ * fh, direct_io, keep_cache, cache_readdir, nonseekable, noflush,
+ * parallel_direct_writes
+ *
+ * Possible requests:
+ * create
+ *
+ * Side effects:
+ * increments the lookup count on success
+ *
+ * @param req request handle
+ * @param e the entry parameters
+ * @param iflags FUSE_IFLAG_*
+ * @param fi file information
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_create_iflags(fuse_req_t req, const struct fuse_entry_param *e,
+ unsigned int iflags,
+ const struct fuse_file_info *fi);
+
/**
* Reply with attributes
*
@@ -1486,6 +1535,21 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
double attr_timeout);
+/**
+ * Reply with attributes and FUSE_IFLAG_* flags
+ *
+ * Possible requests:
+ * getattr, setattr
+ *
+ * @param req request handle
+ * @param attr the attributes
+ * @param attr_timeout validity timeout (in seconds) for the attributes
+ * @param iflags set of FUSE_IFLAG_* flags
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr,
+ unsigned int iflags, double attr_timeout);
+
/**
* Reply with the contents of a symbolic link
*
@@ -1713,6 +1777,25 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
const char *name,
const struct fuse_entry_param *e, off_t off);
+/**
+ * Add a directory entry and FUSE_IFLAG_* to the buffer with the attributes
+ *
+ * See documentation of `fuse_add_direntry_plus()` for more details.
+ *
+ * @param req request handle
+ * @param buf the point where the new entry will be added to the buffer
+ * @param bufsize remaining size of the buffer
+ * @param name the name of the entry
+ * @param iflags FUSE_IFLAG_*
+ * @param e the directory entry
+ * @param off the offset of the next entry
+ * @return the space needed for the entry
+ */
+size_t fuse_add_direntry_plus_iflags(fuse_req_t req, char *buf, size_t bufsize,
+ const char *name, unsigned int iflags,
+ const struct fuse_entry_param *e,
+ off_t off);
+
/**
* Reply to ask for data fetch and output buffer preparation. ioctl
* will be retried with the specified input data fetched and output
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index ce7971a23be94b..04bc858f54d01f 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -102,7 +102,8 @@ static void trace_request_reply(uint64_t unique, unsigned int len,
}
#endif
-static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr)
+static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr,
+ unsigned int iflags)
{
attr->ino = stbuf->st_ino;
attr->mode = stbuf->st_mode;
@@ -119,6 +120,10 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr)
attr->atimensec = ST_ATIM_NSEC(stbuf);
attr->mtimensec = ST_MTIM_NSEC(stbuf);
attr->ctimensec = ST_CTIM_NSEC(stbuf);
+
+ attr->flags = 0;
+ if (iflags & FUSE_IFLAG_DAX)
+ attr->flags |= FUSE_ATTR_DAX;
}
static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
@@ -438,7 +443,8 @@ static unsigned int calc_timeout_nsec(double t)
}
static void fill_entry(struct fuse_entry_out *arg,
- const struct fuse_entry_param *e)
+ const struct fuse_entry_param *e,
+ unsigned int iflags)
{
arg->nodeid = e->ino;
arg->generation = e->generation;
@@ -446,14 +452,15 @@ static void fill_entry(struct fuse_entry_out *arg,
arg->entry_valid_nsec = calc_timeout_nsec(e->entry_timeout);
arg->attr_valid = calc_timeout_sec(e->attr_timeout);
arg->attr_valid_nsec = calc_timeout_nsec(e->attr_timeout);
- convert_stat(&e->attr, &arg->attr);
+ convert_stat(&e->attr, &arg->attr, iflags);
}
/* `buf` is allowed to be empty so that the proper size may be
allocated by the caller */
-size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
- const char *name,
- const struct fuse_entry_param *e, off_t off)
+size_t fuse_add_direntry_plus_iflags(fuse_req_t req, char *buf, size_t bufsize,
+ const char *name, unsigned int iflags,
+ const struct fuse_entry_param *e,
+ off_t off)
{
(void)req;
size_t namelen;
@@ -468,7 +475,7 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
struct fuse_direntplus *dp = (struct fuse_direntplus *) buf;
memset(&dp->entry_out, 0, sizeof(dp->entry_out));
- fill_entry(&dp->entry_out, e);
+ fill_entry(&dp->entry_out, e, iflags);
struct fuse_dirent *dirent = &dp->dirent;
dirent->ino = e->attr.st_ino;
@@ -481,6 +488,14 @@ size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
return entlen_padded;
}
+size_t fuse_add_direntry_plus(fuse_req_t req, char *buf, size_t bufsize,
+ const char *name,
+ const struct fuse_entry_param *e, off_t off)
+{
+ return fuse_add_direntry_plus_iflags(req, buf, bufsize, name, 0, e,
+ off);
+}
+
static void fill_open(struct fuse_open_out *arg,
const struct fuse_file_info *f)
{
@@ -503,7 +518,8 @@ static void fill_open(struct fuse_open_out *arg,
arg->open_flags |= FOPEN_PARALLEL_DIRECT_WRITES;
}
-int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e)
+int fuse_reply_entry_iflags(fuse_req_t req, const struct fuse_entry_param *e,
+ unsigned int iflags)
{
struct fuse_entry_out arg;
size_t size = req->se->conn.proto_minor < 9 ?
@@ -515,12 +531,18 @@ int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e)
return fuse_reply_err(req, ENOENT);
memset(&arg, 0, sizeof(arg));
- fill_entry(&arg, e);
+ fill_entry(&arg, e, iflags);
return send_reply_ok(req, &arg, size);
}
-int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
- const struct fuse_file_info *f)
+int fuse_reply_entry(fuse_req_t req, const struct fuse_entry_param *e)
+{
+ return fuse_reply_entry_iflags(req, e, 0);
+}
+
+int fuse_reply_create_iflags(fuse_req_t req, const struct fuse_entry_param *e,
+ unsigned int iflags,
+ const struct fuse_file_info *f)
{
alignas(uint64_t) char buf[sizeof(struct fuse_entry_out) + sizeof(struct fuse_open_out)];
size_t entrysize = req->se->conn.proto_minor < 9 ?
@@ -529,14 +551,20 @@ int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
struct fuse_open_out *oarg = (struct fuse_open_out *) (buf + entrysize);
memset(buf, 0, sizeof(buf));
- fill_entry(earg, e);
+ fill_entry(earg, e, iflags);
fill_open(oarg, f);
return send_reply_ok(req, buf,
entrysize + sizeof(struct fuse_open_out));
}
-int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
- double attr_timeout)
+int fuse_reply_create(fuse_req_t req, const struct fuse_entry_param *e,
+ const struct fuse_file_info *f)
+{
+ return fuse_reply_create_iflags(req, e, 0, f);
+}
+
+int fuse_reply_attr_iflags(fuse_req_t req, const struct stat *attr,
+ unsigned int iflags, double attr_timeout)
{
struct fuse_attr_out arg;
size_t size = req->se->conn.proto_minor < 9 ?
@@ -545,11 +573,17 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
memset(&arg, 0, sizeof(arg));
arg.attr_valid = calc_timeout_sec(attr_timeout);
arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
- convert_stat(attr, &arg.attr);
+ convert_stat(attr, &arg.attr, iflags);
return send_reply_ok(req, &arg, size);
}
+int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
+ double attr_timeout)
+{
+ return fuse_reply_attr_iflags(req, attr, 0, attr_timeout);
+}
+
int fuse_reply_readlink(fuse_req_t req, const char *linkname)
{
return send_reply_ok(req, linkname, strlen(linkname));
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 03cce1f0f184c3..df78723e0f2518 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -225,6 +225,10 @@ FUSE_3.99 {
fuse_lowlevel_iomap_device_remove;
fuse_fs_iomap_device_add;
fuse_fs_iomap_device_remove;
+ fuse_reply_attr_iflags;
+ fuse_reply_create_iflags;
+ fuse_reply_entry_iflags;
+ fuse_add_direntry_plus_iflags;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 10/21] libfuse: connect high level fuse library to fuse_reply_attr_iflags
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (8 preceding siblings ...)
2025-08-21 1:03 ` [PATCH 09/21] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong
@ 2025-08-21 1:03 ` Darrick J. Wong
2025-08-21 1:04 ` [PATCH 11/21] libfuse: support direct I/O through iomap Darrick J. Wong
` (10 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:03 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Create a new ->getattr_iflags function so that iomap filesystems can set
the appropriate in-kernel inode flags on instantiation.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 7 ++
lib/fuse.c | 191 ++++++++++++++++++++++++++++++++++++++++++--------------
2 files changed, 151 insertions(+), 47 deletions(-)
diff --git a/include/fuse.h b/include/fuse.h
index 1357f4319bcc21..7256f43fd5c39a 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -889,6 +889,13 @@ struct fuse_operations {
uint64_t attr_ino, off_t pos_in, size_t written_in,
uint32_t ioendflags_in, int error_in,
uint64_t new_addr_in);
+
+ /**
+ * Get file attributes and FUSE_IFLAG_* flags. Otherwise the same as
+ * getattr.
+ */
+ int (*getattr_iflags) (const char *path, struct stat *buf,
+ unsigned int *iflags, struct fuse_file_info *fi);
};
/** Extra context that may be needed by some filesystems
diff --git a/lib/fuse.c b/lib/fuse.c
index 725ab615d456e3..6b211084e2175a 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -123,6 +123,7 @@ struct fuse {
struct list_head partial_slabs;
struct list_head full_slabs;
pthread_t prune_thread;
+ bool want_iflags;
};
struct lock {
@@ -144,6 +145,7 @@ struct node {
char *name;
uint64_t nlookup;
int open_count;
+ unsigned int iflags;
struct timespec stat_updated;
struct timespec mtime;
off_t size;
@@ -1628,6 +1630,24 @@ int fuse_fs_getattr(struct fuse_fs *fs, const char *path, struct stat *buf,
return fs->op.getattr(path, buf, fi);
}
+static int fuse_fs_getattr_iflags(struct fuse_fs *fs, const char *path,
+ struct stat *buf, unsigned int *iflags,
+ struct fuse_file_info *fi)
+{
+ fuse_get_context()->private_data = fs->user_data;
+ if (!fs->op.getattr_iflags)
+ return -ENOSYS;
+
+ if (fs->debug) {
+ char buf[10];
+
+ fuse_log(FUSE_LOG_DEBUG, "getattr_iflags[%s] %s\n",
+ file_info_string(fi, buf, sizeof(buf)),
+ path);
+ }
+ return fs->op.getattr_iflags(path, buf, iflags, fi);
+}
+
int fuse_fs_rename(struct fuse_fs *fs, const char *oldpath,
const char *newpath, unsigned int flags)
{
@@ -2473,7 +2493,7 @@ static void update_stat(struct node *node, const struct stat *stbuf)
}
static int do_lookup(struct fuse *f, fuse_ino_t nodeid, const char *name,
- struct fuse_entry_param *e)
+ struct fuse_entry_param *e, unsigned int *iflags)
{
struct node *node;
@@ -2491,25 +2511,64 @@ static int do_lookup(struct fuse *f, fuse_ino_t nodeid, const char *name,
pthread_mutex_unlock(&f->lock);
}
set_stat(f, e->ino, &e->attr);
+ *iflags = node->iflags;
return 0;
}
+static int lookup_and_update(struct fuse *f, fuse_ino_t nodeid,
+ const char *name, struct fuse_entry_param *e,
+ unsigned int iflags)
+{
+ struct node *node;
+
+ node = find_node(f, nodeid, name);
+ if (node == NULL)
+ return -ENOMEM;
+
+ e->ino = node->nodeid;
+ e->generation = node->generation;
+ e->entry_timeout = f->conf.entry_timeout;
+ e->attr_timeout = f->conf.attr_timeout;
+ if (f->conf.auto_cache) {
+ pthread_mutex_lock(&f->lock);
+ update_stat(node, &e->attr);
+ pthread_mutex_unlock(&f->lock);
+ }
+ set_stat(f, e->ino, &e->attr);
+ node->iflags = iflags;
+ return 0;
+}
+
+static int getattr(struct fuse *f, const char *path, struct stat *buf,
+ unsigned int *iflags, struct fuse_file_info *fi)
+{
+ if (f->want_iflags)
+ return fuse_fs_getattr_iflags(f->fs, path, buf, iflags, fi);
+ return fuse_fs_getattr(f->fs, path, buf, fi);
+}
+
static int lookup_path(struct fuse *f, fuse_ino_t nodeid,
const char *name, const char *path,
- struct fuse_entry_param *e, struct fuse_file_info *fi)
+ struct fuse_entry_param *e, unsigned int *iflags,
+ struct fuse_file_info *fi)
{
int res;
memset(e, 0, sizeof(struct fuse_entry_param));
- res = fuse_fs_getattr(f->fs, path, &e->attr, fi);
- if (res == 0) {
- res = do_lookup(f, nodeid, name, e);
- if (res == 0 && f->conf.debug) {
- fuse_log(FUSE_LOG_DEBUG, " NODEID: %llu\n",
- (unsigned long long) e->ino);
- }
- }
- return res;
+ *iflags = 0;
+ res = getattr(f, path, &e->attr, iflags, fi);
+ if (res)
+ return res;
+
+ res = lookup_and_update(f, nodeid, name, e, *iflags);
+ if (res)
+ return res;
+
+ if (f->conf.debug)
+ fuse_log(FUSE_LOG_DEBUG, " NODEID: %llu iflags 0x%x\n",
+ (unsigned long long) e->ino, *iflags);
+
+ return 0;
}
static struct fuse_context_i *fuse_get_context_internal(void)
@@ -2593,11 +2652,14 @@ static inline void reply_err(fuse_req_t req, int err)
}
static void reply_entry(fuse_req_t req, const struct fuse_entry_param *e,
- int err)
+ unsigned int iflags, int err)
{
if (!err) {
struct fuse *f = req_fuse(req);
- if (fuse_reply_entry(req, e) == -ENOENT) {
+ int entry_res;
+
+ entry_res = fuse_reply_entry_iflags(req, e, iflags);
+ if (entry_res == -ENOENT) {
/* Skip forget for negative result */
if (e->ino != 0)
forget_node(f, e->ino, 1);
@@ -2638,6 +2700,9 @@ static void fuse_lib_init(void *data, struct fuse_conn_info *conn)
/* Disable the receiving and processing of FUSE_INTERRUPT requests */
conn->no_interrupt = 1;
}
+
+ if (conn->want_ext & FUSE_CAP_IOMAP)
+ f->want_iflags = true;
}
void fuse_fs_destroy(struct fuse_fs *fs)
@@ -2661,6 +2726,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
struct fuse *f = req_fuse_prepare(req);
struct fuse_entry_param e;
char *path;
+ unsigned int iflags = 0;
int err;
struct node *dot = NULL;
@@ -2675,7 +2741,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
dot = get_node_nocheck(f, parent);
if (dot == NULL) {
pthread_mutex_unlock(&f->lock);
- reply_entry(req, &e, -ESTALE);
+ reply_entry(req, &e, -ESTALE, 0);
return;
}
dot->refctr++;
@@ -2695,7 +2761,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
if (f->conf.debug)
fuse_log(FUSE_LOG_DEBUG, "LOOKUP %s\n", path);
fuse_prepare_interrupt(f, req, &d);
- err = lookup_path(f, parent, name, path, &e, NULL);
+ err = lookup_path(f, parent, name, path, &e, &iflags, NULL);
if (err == -ENOENT && f->conf.negative_timeout != 0.0) {
e.ino = 0;
e.entry_timeout = f->conf.negative_timeout;
@@ -2709,7 +2775,7 @@ static void fuse_lib_lookup(fuse_req_t req, fuse_ino_t parent,
unref_node(f, dot);
pthread_mutex_unlock(&f->lock);
}
- reply_entry(req, &e, err);
+ reply_entry(req, &e, iflags, err);
}
static void do_forget(struct fuse *f, fuse_ino_t ino, uint64_t nlookup)
@@ -2745,6 +2811,7 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino,
struct fuse *f = req_fuse_prepare(req);
struct stat buf;
char *path;
+ unsigned int iflags = 0;
int err;
memset(&buf, 0, sizeof(buf));
@@ -2756,7 +2823,7 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino,
if (!err) {
struct fuse_intr_data d;
fuse_prepare_interrupt(f, req, &d);
- err = fuse_fs_getattr(f->fs, path, &buf, fi);
+ err = getattr(f, path, &buf, &iflags, fi);
fuse_finish_interrupt(f, req, &d);
free_path(f, ino, path);
}
@@ -2769,9 +2836,11 @@ static void fuse_lib_getattr(fuse_req_t req, fuse_ino_t ino,
buf.st_nlink--;
if (f->conf.auto_cache)
update_stat(node, &buf);
+ node->iflags = iflags;
pthread_mutex_unlock(&f->lock);
set_stat(f, ino, &buf);
- fuse_reply_attr(req, &buf, f->conf.attr_timeout);
+ fuse_reply_attr_iflags(req, &buf, iflags,
+ f->conf.attr_timeout);
} else
reply_err(req, err);
}
@@ -2874,6 +2943,7 @@ static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
struct fuse *f = req_fuse_prepare(req);
struct stat buf;
char *path;
+ unsigned int iflags = 0;
int err;
memset(&buf, 0, sizeof(buf));
@@ -2932,19 +3002,23 @@ static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
err = fuse_fs_utimens(f->fs, path, tv, fi);
}
if (!err) {
- err = fuse_fs_getattr(f->fs, path, &buf, fi);
+ err = getattr(f, path, &buf, &iflags, fi);
}
fuse_finish_interrupt(f, req, &d);
free_path(f, ino, path);
}
if (!err) {
- if (f->conf.auto_cache) {
- pthread_mutex_lock(&f->lock);
- update_stat(get_node(f, ino), &buf);
- pthread_mutex_unlock(&f->lock);
- }
+ struct node *node;
+
+ pthread_mutex_lock(&f->lock);
+ node = get_node(f, ino);
+ if (f->conf.auto_cache)
+ update_stat(node, &buf);
+ node->iflags = iflags;
+ pthread_mutex_unlock(&f->lock);
set_stat(f, ino, &buf);
- fuse_reply_attr(req, &buf, f->conf.attr_timeout);
+ fuse_reply_attr_iflags(req, &buf, iflags,
+ f->conf.attr_timeout);
} else
reply_err(req, err);
}
@@ -2995,6 +3069,7 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name,
struct fuse *f = req_fuse_prepare(req);
struct fuse_entry_param e;
char *path;
+ unsigned int iflags = 0;
int err;
err = get_path_name(f, parent, name, &path);
@@ -3011,7 +3086,7 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name,
err = fuse_fs_create(f->fs, path, mode, &fi);
if (!err) {
err = lookup_path(f, parent, name, path, &e,
- &fi);
+ &iflags, &fi);
fuse_fs_release(f->fs, path, &fi);
}
}
@@ -3019,12 +3094,12 @@ static void fuse_lib_mknod(fuse_req_t req, fuse_ino_t parent, const char *name,
err = fuse_fs_mknod(f->fs, path, mode, rdev);
if (!err)
err = lookup_path(f, parent, name, path, &e,
- NULL);
+ &iflags, NULL);
}
fuse_finish_interrupt(f, req, &d);
free_path(f, parent, path);
}
- reply_entry(req, &e, err);
+ reply_entry(req, &e, iflags, err);
}
static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name,
@@ -3033,6 +3108,7 @@ static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name,
struct fuse *f = req_fuse_prepare(req);
struct fuse_entry_param e;
char *path;
+ unsigned int iflags = 0;
int err;
err = get_path_name(f, parent, name, &path);
@@ -3042,11 +3118,12 @@ static void fuse_lib_mkdir(fuse_req_t req, fuse_ino_t parent, const char *name,
fuse_prepare_interrupt(f, req, &d);
err = fuse_fs_mkdir(f->fs, path, mode);
if (!err)
- err = lookup_path(f, parent, name, path, &e, NULL);
+ err = lookup_path(f, parent, name, path, &e, &iflags,
+ NULL);
fuse_finish_interrupt(f, req, &d);
free_path(f, parent, path);
}
- reply_entry(req, &e, err);
+ reply_entry(req, &e, iflags, err);
}
static void fuse_lib_unlink(fuse_req_t req, fuse_ino_t parent,
@@ -3116,6 +3193,7 @@ static void fuse_lib_symlink(fuse_req_t req, const char *linkname,
struct fuse *f = req_fuse_prepare(req);
struct fuse_entry_param e;
char *path;
+ unsigned int iflags = 0;
int err;
err = get_path_name(f, parent, name, &path);
@@ -3125,11 +3203,12 @@ static void fuse_lib_symlink(fuse_req_t req, const char *linkname,
fuse_prepare_interrupt(f, req, &d);
err = fuse_fs_symlink(f->fs, linkname, path);
if (!err)
- err = lookup_path(f, parent, name, path, &e, NULL);
+ err = lookup_path(f, parent, name, path, &e, &iflags,
+ NULL);
fuse_finish_interrupt(f, req, &d);
free_path(f, parent, path);
}
- reply_entry(req, &e, err);
+ reply_entry(req, &e, iflags, err);
}
static void fuse_lib_rename(fuse_req_t req, fuse_ino_t olddir,
@@ -3177,6 +3256,7 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent,
struct fuse_entry_param e;
char *oldpath;
char *newpath;
+ unsigned int iflags = 0;
int err;
err = get_path2(f, ino, NULL, newparent, newname,
@@ -3188,11 +3268,11 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent,
err = fuse_fs_link(f->fs, oldpath, newpath);
if (!err)
err = lookup_path(f, newparent, newname, newpath,
- &e, NULL);
+ &e, &iflags, NULL);
fuse_finish_interrupt(f, req, &d);
free_path2(f, ino, newparent, NULL, NULL, oldpath, newpath);
}
- reply_entry(req, &e, err);
+ reply_entry(req, &e, iflags, err);
}
static void fuse_do_release(struct fuse *f, fuse_ino_t ino, const char *path,
@@ -3235,6 +3315,7 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent,
struct fuse_intr_data d;
struct fuse_entry_param e;
char *path;
+ unsigned int iflags;
int err;
err = get_path_name(f, parent, name, &path);
@@ -3242,7 +3323,8 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent,
fuse_prepare_interrupt(f, req, &d);
err = fuse_fs_create(f->fs, path, mode, fi);
if (!err) {
- err = lookup_path(f, parent, name, path, &e, fi);
+ err = lookup_path(f, parent, name, path, &e,
+ &iflags, fi);
if (err)
fuse_fs_release(f->fs, path, fi);
else if (!S_ISREG(e.attr.st_mode)) {
@@ -3262,10 +3344,14 @@ static void fuse_lib_create(fuse_req_t req, fuse_ino_t parent,
fuse_finish_interrupt(f, req, &d);
}
if (!err) {
+ int create_res;
+
pthread_mutex_lock(&f->lock);
get_node(f, e.ino)->open_count++;
pthread_mutex_unlock(&f->lock);
- if (fuse_reply_create(req, &e, fi) == -ENOENT) {
+
+ create_res = fuse_reply_create_iflags(req, &e, iflags, fi);
+ if (create_res == -ENOENT) {
/* The open syscall was interrupted, so it
must be cancelled */
fuse_do_release(f, e.ino, path, fi);
@@ -3299,13 +3385,16 @@ static void open_auto_cache(struct fuse *f, fuse_ino_t ino, const char *path,
if (diff_timespec(&now, &node->stat_updated) >
f->conf.ac_attr_timeout) {
struct stat stbuf;
+ unsigned int iflags = 0;
int err;
+
pthread_mutex_unlock(&f->lock);
- err = fuse_fs_getattr(f->fs, path, &stbuf, fi);
+ err = getattr(f, path, &stbuf, &iflags, fi);
pthread_mutex_lock(&f->lock);
- if (!err)
+ if (!err) {
update_stat(node, &stbuf);
- else
+ node->iflags = iflags;
+ } else
node->cache_valid = 0;
}
}
@@ -3634,6 +3723,7 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
.ino = 0,
};
struct fuse *f = dh->fuse;
+ unsigned int iflags = 0;
int res;
if ((flags & ~FUSE_FILL_DIR_PLUS) != 0) {
@@ -3658,6 +3748,7 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
if (off) {
size_t newlen;
+ size_t thislen;
if (dh->filled) {
dh->error = -EIO;
@@ -3673,7 +3764,8 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
if (statp && (flags & FUSE_FILL_DIR_PLUS)) {
if (!is_dot_or_dotdot(name)) {
- res = do_lookup(f, dh->nodeid, name, &e);
+ res = do_lookup(f, dh->nodeid, name, &e,
+ &iflags);
if (res) {
dh->error = res;
return 1;
@@ -3681,10 +3773,12 @@ static int fill_dir_plus(void *dh_, const char *name, const struct stat *statp,
}
}
- newlen = dh->len +
- fuse_add_direntry_plus(dh->req, dh->contents + dh->len,
- dh->needlen - dh->len, name,
- &e, off);
+ thislen = fuse_add_direntry_plus_iflags(dh->req,
+ dh->contents + dh->len,
+ dh->needlen - dh->len,
+ name, iflags, &e, off);
+ newlen = dh->len + thislen;
+
if (newlen > dh->needlen)
return 1;
dh->len = newlen;
@@ -3771,6 +3865,7 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
unsigned rem = dh->needlen - dh->len;
unsigned thislen;
unsigned newlen;
+ unsigned int iflags = 0;
pos++;
if (flags & FUSE_READDIR_PLUS) {
@@ -3782,15 +3877,17 @@ static int readdir_fill_from_list(fuse_req_t req, struct fuse_dh *dh,
if (de->flags & FUSE_FILL_DIR_PLUS &&
!is_dot_or_dotdot(de->name)) {
res = do_lookup(dh->fuse, dh->nodeid,
- de->name, &e);
+ de->name, &e, &iflags);
if (res) {
dh->error = res;
return 1;
}
}
- thislen = fuse_add_direntry_plus(req, p, rem,
- de->name, &e, pos);
+ thislen = fuse_add_direntry_plus_iflags(req, p, rem,
+ de->name,
+ iflags, &e,
+ pos);
} else {
thislen = fuse_add_direntry(req, p, rem,
de->name, &de->stat, pos);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 11/21] libfuse: support direct I/O through iomap
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (9 preceding siblings ...)
2025-08-21 1:03 ` [PATCH 10/21] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong
@ 2025-08-21 1:04 ` Darrick J. Wong
2025-08-21 1:04 ` [PATCH 12/21] libfuse: support buffered " Darrick J. Wong
` (9 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:04 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Make it so that fuse servers can ask the kernel fuse driver to use iomap
to support direct IO.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_common.h | 2 ++
include/fuse_kernel.h | 7 +++++--
lib/fuse_lowlevel.c | 2 ++
3 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index 9181ec6cb5e5e9..6e8b2958373258 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1228,6 +1228,8 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags,
/* enable fsdax */
#define FUSE_IFLAG_DAX (1U << 0)
+/* use iomap for this inode */
+#define FUSE_IFLAG_IOMAP (1U << 1)
/* ----------------------------------------------------------- *
* Compatibility stuff *
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 849238c17baf5e..86c81871ca2b37 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -238,7 +238,8 @@
*
* 7.99
* - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
- * SEEK_{DATA,HOLE}
+ * SEEK_{DATA,HOLE}, and direct I/O
+ * - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
*/
#ifndef _LINUX_FUSE_H
@@ -448,7 +449,7 @@ struct fuse_file_lock {
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
* FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
- * operations.
+ * operations and direct I/O.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -580,9 +581,11 @@ struct fuse_file_lock {
*
* FUSE_ATTR_SUBMOUNT: Object is a submount root
* FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
+ * FUSE_ATTR_IOMAP: Use iomap for this inode
*/
#define FUSE_ATTR_SUBMOUNT (1 << 0)
#define FUSE_ATTR_DAX (1 << 1)
+#define FUSE_ATTR_IOMAP (1 << 2)
/**
* Open flags
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 04bc858f54d01f..6a96c0f62d5884 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -124,6 +124,8 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr,
attr->flags = 0;
if (iflags & FUSE_IFLAG_DAX)
attr->flags |= FUSE_ATTR_DAX;
+ if (iflags & FUSE_IFLAG_IOMAP)
+ attr->flags |= FUSE_ATTR_IOMAP;
}
static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 12/21] libfuse: support buffered I/O through iomap
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (10 preceding siblings ...)
2025-08-21 1:04 ` [PATCH 11/21] libfuse: support direct I/O through iomap Darrick J. Wong
@ 2025-08-21 1:04 ` Darrick J. Wong
2025-08-21 1:04 ` [PATCH 13/21] libfuse: don't allow hardlinking of iomap files in the upper level fuse library Darrick J. Wong
` (8 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:04 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Make it so that fuse servers can ask the kernel fuse driver to use iomap
to support buffered IO.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_kernel.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 86c81871ca2b37..eafad773a1fd5f 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -238,7 +238,7 @@
*
* 7.99
* - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
- * SEEK_{DATA,HOLE}, and direct I/O
+ * SEEK_{DATA,HOLE}, buffered I/O, and direct I/O
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
*/
@@ -449,7 +449,7 @@ struct fuse_file_lock {
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
* FUSE_IOMAP: Client supports iomap for FIEMAP and SEEK_{DATA,HOLE} file
- * operations and direct I/O.
+ * operations, buffered I/O, and direct I/O.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 13/21] libfuse: don't allow hardlinking of iomap files in the upper level fuse library
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (11 preceding siblings ...)
2025-08-21 1:04 ` [PATCH 12/21] libfuse: support buffered " Darrick J. Wong
@ 2025-08-21 1:04 ` Darrick J. Wong
2025-08-21 1:05 ` [PATCH 14/21] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
` (7 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:04 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
The upper level fuse library creates a separate node object for every
(i)node referenced by a directory entry. Unfortunately, it doesn't
account for the possibility of hardlinks, which means that we can create
multiple nodeids that refer to the same hardlinked inode. Inode locking
in iomap mode in the kernel relies there only being one inode object for
a hardlinked file, so we cannot allow anyone to hardlink an iomap file.
The client had better not turn on iomap for an existing hardlinked file.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 18 ++++++++++
lib/fuse.c | 90 +++++++++++++++++++++++++++++++++++++++++++-----
lib/fuse_versionscript | 2 +
3 files changed, 101 insertions(+), 9 deletions(-)
diff --git a/include/fuse.h b/include/fuse.h
index 7256f43fd5c39a..4c4fff837437c8 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -1415,6 +1415,24 @@ int fuse_fs_iomap_device_add(int fd, unsigned int flags);
*/
int fuse_fs_iomap_device_remove(int device_id);
+/**
+ * Decide if we can enable iomap mode for a particular file for an upper-level
+ * fuse server.
+ *
+ * @param statbuf stat information for the file.
+ * @return true if it can be enabled, false if not.
+ */
+bool fuse_fs_can_enable_iomap(const struct stat *statbuf);
+
+/**
+ * Decide if we can enable iomap mode for a particular file for an upper-level
+ * fuse server.
+ *
+ * @param statxbuf statx information for the file.
+ * @return true if it can be enabled, false if not.
+ */
+bool fuse_fs_can_enable_iomapx(const struct statx *statxbuf);
+
int fuse_notify_poll(struct fuse_pollhandle *ph);
/**
diff --git a/lib/fuse.c b/lib/fuse.c
index 6b211084e2175a..cbf2c5d3a67895 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -3249,10 +3249,66 @@ static void fuse_lib_rename(fuse_req_t req, fuse_ino_t olddir,
reply_err(req, err);
}
+/*
+ * Decide if file IO for this inode can use iomap.
+ *
+ * The upper level libfuse creates internal node ids that have nothing to do
+ * with the ext2_ino_t that we give it. These internal node ids are what
+ * actually gets igetted in the kernel, which means that there can be multiple
+ * fuse_inode objects in the kernel for a single hardlinked inode in the fuse
+ * server.
+ *
+ * What this means, horrifyingly, is that on a fuse filesystem that supports
+ * hard links, the in-kernel i_rwsem does not protect against concurrent writes
+ * between files that point to the same inode. That in turn means that the
+ * file mode and size can get desynchronized between the multiple fuse_inode
+ * objects. This also means that we cannot cache iomaps in the kernel AT ALL
+ * because the caches will get out of sync, leading to WARN_ONs from the iomap
+ * zeroing code and probably data corruption after that.
+ *
+ * Therefore, libfuse must never create hardlinks of iomap files, and the
+ * predicates below allow fuse servers to decide if they can turn on iomap for
+ * existing hardlinked files.
+ */
+bool fuse_fs_can_enable_iomap(const struct stat *statbuf)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse_session *se = fuse_get_session(ctxt->fuse);
+
+ if (!(se->conn.want_ext & FUSE_CAP_IOMAP))
+ return false;
+
+ return statbuf->st_nlink < 2;
+}
+
+bool fuse_fs_can_enable_iomapx(const struct statx *statxbuf)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse_session *se = fuse_get_session(ctxt->fuse);
+
+ if (!(se->conn.want_ext & FUSE_CAP_IOMAP))
+ return false;
+
+ return statxbuf->stx_nlink < 2;
+}
+
+static bool fuse_lib_can_link(fuse_req_t req, fuse_ino_t ino)
+{
+ struct fuse *f = req_fuse_prepare(req);
+ struct node *node;
+
+ if (!(req->se->conn.want_ext & FUSE_CAP_IOMAP))
+ return true;
+
+ node = get_node(f, ino);
+ return !(node->iflags & FUSE_IFLAG_IOMAP);
+}
+
static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent,
const char *newname)
{
struct fuse *f = req_fuse_prepare(req);
+ struct fuse_intr_data d;
struct fuse_entry_param e;
char *oldpath;
char *newpath;
@@ -3261,17 +3317,33 @@ static void fuse_lib_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t newparent,
err = get_path2(f, ino, NULL, newparent, newname,
&oldpath, &newpath, NULL, NULL);
- if (!err) {
- struct fuse_intr_data d;
+ if (err)
+ goto out_reply;
- fuse_prepare_interrupt(f, req, &d);
- err = fuse_fs_link(f->fs, oldpath, newpath);
- if (!err)
- err = lookup_path(f, newparent, newname, newpath,
- &e, &iflags, NULL);
- fuse_finish_interrupt(f, req, &d);
- free_path2(f, ino, newparent, NULL, NULL, oldpath, newpath);
+ /*
+ * The upper level fuse library creates a separate node object for
+ * every (i)node referenced by a directory entry. Unfortunately, it
+ * doesn't account for the possibility of hardlinks, which means that
+ * we can create multiple nodeids that refer to the same hardlinked
+ * inode. Inode locking in iomap mode in the kernel relies there only
+ * being one inode object for a hardlinked file, so we cannot allow
+ * anyone to hardlink an iomap file. The client had better not turn on
+ * iomap for an existing hardlinked file.
+ */
+ if (!fuse_lib_can_link(req, ino)) {
+ err = -EPERM;
+ goto out_path;
}
+
+ fuse_prepare_interrupt(f, req, &d);
+ err = fuse_fs_link(f->fs, oldpath, newpath);
+ if (!err)
+ err = lookup_path(f, newparent, newname, newpath,
+ &e, &iflags, NULL);
+ fuse_finish_interrupt(f, req, &d);
+out_path:
+ free_path2(f, ino, newparent, NULL, NULL, oldpath, newpath);
+out_reply:
reply_entry(req, &e, iflags, err);
}
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index df78723e0f2518..aa16efdd8a9879 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -229,6 +229,8 @@ FUSE_3.99 {
fuse_reply_create_iflags;
fuse_reply_entry_iflags;
fuse_add_direntry_plus_iflags;
+ fuse_fs_can_enable_iomap;
+ fuse_fs_can_enable_iomapx;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 14/21] libfuse: allow discovery of the kernel's iomap capabilities
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (12 preceding siblings ...)
2025-08-21 1:04 ` [PATCH 13/21] libfuse: don't allow hardlinking of iomap files in the upper level fuse library Darrick J. Wong
@ 2025-08-21 1:05 ` Darrick J. Wong
2025-08-21 1:05 ` [PATCH 15/21] libfuse: add lower level iomap_config implementation Darrick J. Wong
` (6 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:05 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Create a library function so that we can discover the kernel's iomap
capabilities ahead of time.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_common.h | 7 +++++++
include/fuse_kernel.h | 7 +++++++
include/fuse_lowlevel.h | 5 +++++
lib/fuse_lowlevel.c | 15 +++++++++++++++
lib/fuse_versionscript | 1 +
5 files changed, 35 insertions(+)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index 6e8b2958373258..f9cc3702411680 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -542,6 +542,13 @@ struct fuse_loop_config_v1 {
#define FUSE_IOCTL_MAX_IOV 256
+/**
+ * iomap discovery flags
+ *
+ * FUSE_IOMAP_SUPPORT_FILEIO: basic file I/O functionality through iomap
+ */
+#define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0)
+
/**
* Connection information, passed to the ->init() method
*
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index eafad773a1fd5f..dbd2ce1fbbe6ed 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -1134,12 +1134,19 @@ struct fuse_backing_map {
uint64_t padding;
};
+struct fuse_iomap_support {
+ uint64_t flags;
+ uint64_t padding;
+};
+
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
#define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_IOMAP_SUPPORT _IOR(FUSE_DEV_IOC_MAGIC, 3, \
+ struct fuse_iomap_support)
struct fuse_lseek_in {
uint64_t fh;
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index e0642032127686..2931a57ec4079b 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -2556,6 +2556,11 @@ int fuse_session_receive_buf(struct fuse_session *se, struct fuse_buf *buf);
*/
bool fuse_req_is_uring(fuse_req_t req);
+/**
+ * Discover the kernel's iomap capabilities. Returns FUSE_CAP_IOMAP_* flags.
+ */
+uint64_t fuse_lowlevel_discover_iomap(void);
+
#ifdef __cplusplus
}
#endif
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 6a96c0f62d5884..ab10204c8042d9 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -4603,3 +4603,18 @@ int fuse_session_exited(struct fuse_session *se)
return exited ? 1 : 0;
}
+
+uint64_t fuse_lowlevel_discover_iomap(void)
+{
+ struct fuse_iomap_support ios = { };
+ int fd;
+
+ fd = open("/dev/fuse", O_RDONLY | O_CLOEXEC);
+ if (fd < 0)
+ return 0;
+
+ ioctl(fd, FUSE_DEV_IOC_IOMAP_SUPPORT, &ios);
+ close(fd);
+
+ return ios.flags;
+}
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index aa16efdd8a9879..5275a17ba1ed51 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -231,6 +231,7 @@ FUSE_3.99 {
fuse_add_direntry_plus_iflags;
fuse_fs_can_enable_iomap;
fuse_fs_can_enable_iomapx;
+ fuse_lowlevel_discover_iomap;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 15/21] libfuse: add lower level iomap_config implementation
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (13 preceding siblings ...)
2025-08-21 1:05 ` [PATCH 14/21] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
@ 2025-08-21 1:05 ` Darrick J. Wong
2025-08-21 1:05 ` [PATCH 16/21] libfuse: add upper " Darrick J. Wong
` (5 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:05 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Add FUSE_IOMAP_CONFIG helpers to the low level fuse library.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_common.h | 31 ++++++++++++++++++
include/fuse_kernel.h | 31 ++++++++++++++++++
include/fuse_lowlevel.h | 27 +++++++++++++++
lib/fuse_lowlevel.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++
lib/fuse_versionscript | 1 +
5 files changed, 172 insertions(+)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index f9cc3702411680..8e585cc7483643 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1238,6 +1238,37 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags,
/* use iomap for this inode */
#define FUSE_IFLAG_IOMAP (1U << 1)
+/* Which fields are set in fuse_iomap_config_out? */
+#define FUSE_IOMAP_CONFIG_SID (1 << 0ULL)
+#define FUSE_IOMAP_CONFIG_UUID (1 << 1ULL)
+#define FUSE_IOMAP_CONFIG_BLOCKSIZE (1 << 2ULL)
+#define FUSE_IOMAP_CONFIG_MAX_LINKS (1 << 3ULL)
+#define FUSE_IOMAP_CONFIG_TIME (1 << 4ULL)
+#define FUSE_IOMAP_CONFIG_MAXBYTES (1 << 5ULL)
+
+struct fuse_iomap_config{
+ uint64_t flags; /* FUSE_IOMAP_CONFIG_* */
+
+ char s_id[32]; /* Informational name */
+ char s_uuid[16]; /* UUID */
+
+ uint8_t s_uuid_len; /* length of s_uuid */
+
+ uint8_t s_pad[3]; /* must be zeroes */
+
+ uint32_t s_blocksize; /* fs block size */
+ uint32_t s_max_links; /* max hard links */
+
+ /* Granularity of c/m/atime in ns (cannot be worse than a second) */
+ uint32_t s_time_gran;
+
+ /* Time limits for c/m/atime in seconds */
+ int64_t s_time_min;
+ int64_t s_time_max;
+
+ int64_t s_maxbytes; /* max file size */
+};
+
/* ----------------------------------------------------------- *
* Compatibility stuff *
* ----------------------------------------------------------- */
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index dbd2ce1fbbe6ed..46960711691d99 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -240,6 +240,7 @@
* - add FUSE_IOMAP and iomap_{begin,end,ioend} handlers for FIEMAP and
* SEEK_{DATA,HOLE}, buffered I/O, and direct I/O
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
+ * - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
*/
#ifndef _LINUX_FUSE_H
@@ -668,6 +669,7 @@ enum fuse_opcode {
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
+ FUSE_IOMAP_CONFIG = 4092,
FUSE_IOMAP_IOEND = 4093,
FUSE_IOMAP_BEGIN = 4094,
FUSE_IOMAP_END = 4095,
@@ -1358,4 +1360,33 @@ struct fuse_iomap_ioend_in {
uint32_t reserved1; /* zero */
};
+struct fuse_iomap_config_in {
+ uint64_t flags; /* supported FUSE_IOMAP_CONFIG_* flags */
+ int64_t maxbytes; /* max supported file size */
+ uint64_t padding[6]; /* zero */
+};
+
+struct fuse_iomap_config_out {
+ uint64_t flags; /* FUSE_IOMAP_CONFIG_* */
+
+ char s_id[32]; /* Informational name */
+ char s_uuid[16]; /* UUID */
+
+ uint8_t s_uuid_len; /* length of s_uuid */
+
+ uint8_t s_pad[3]; /* must be zeroes */
+
+ uint32_t s_blocksize; /* fs block size */
+ uint32_t s_max_links; /* max hard links */
+
+ /* Granularity of c/m/atime in ns (cannot be worse than a second) */
+ uint32_t s_time_gran;
+
+ /* Time limits for c/m/atime in seconds */
+ int64_t s_time_min;
+ int64_t s_time_max;
+
+ int64_t s_maxbytes; /* max file size */
+};
+
#endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 2931a57ec4079b..1b2a6c00d0f9dc 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1408,6 +1408,20 @@ struct fuse_lowlevel_ops {
uint64_t attr_ino, off_t pos, size_t written,
uint32_t ioendflags, int error,
uint64_t new_addr);
+
+ /**
+ * Configure the filesystem geometry for iomap mode
+ *
+ * Valid replies:
+ * fuse_reply_iomap_config
+ * fuse_reply_err
+ *
+ * @param req request handle
+ * @param flags FUSE_IOMAP_CONFIG_* flags that can be passed back
+ * @param maxbytes maximum supported file size
+ */
+ void (*iomap_config) (fuse_req_t req, uint64_t flags,
+ uint64_t maxbytes);
};
/**
@@ -1898,6 +1912,19 @@ void fuse_iomap_pure_overwrite(struct fuse_file_iomap *write,
int fuse_reply_iomap_begin(fuse_req_t req, const struct fuse_file_iomap *read,
const struct fuse_file_iomap *write);
+/**
+ * Reply with iomap configuration
+ *
+ * Possible requests:
+ * iomap_config
+ *
+ * @param req request handle
+ * @param cfg iomap configuration
+ * @return zero for success, -errno for failure to send reply
+ */
+int fuse_reply_iomap_config(fuse_req_t req,
+ const struct fuse_iomap_config *cfg);
+
/* ----------------------------------------------------------- *
* Notification *
* ----------------------------------------------------------- */
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index ab10204c8042d9..60627ec35cd367 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2694,6 +2694,86 @@ static void do_iomap_ioend(fuse_req_t req, const fuse_ino_t nodeid,
_do_iomap_ioend(req, nodeid, inarg, NULL);
}
+#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
+#define offsetofend(TYPE, MEMBER) \
+ (offsetof(TYPE, MEMBER) + sizeof_field(TYPE, MEMBER))
+
+#define FUSE_IOMAP_CONFIG_V1 (FUSE_IOMAP_CONFIG_SID | \
+ FUSE_IOMAP_CONFIG_UUID | \
+ FUSE_IOMAP_CONFIG_BLOCKSIZE | \
+ FUSE_IOMAP_CONFIG_MAX_LINKS | \
+ FUSE_IOMAP_CONFIG_TIME | \
+ FUSE_IOMAP_CONFIG_MAXBYTES)
+
+#define FUSE_IOMAP_CONFIG_ALL (FUSE_IOMAP_CONFIG_V1)
+
+static ssize_t iomap_config_reply_size(const struct fuse_iomap_config *cfg)
+{
+ if (cfg->flags & ~FUSE_IOMAP_CONFIG_ALL)
+ return -EINVAL;
+
+ return offsetofend(struct fuse_iomap_config_out, s_maxbytes);
+}
+
+int fuse_reply_iomap_config(fuse_req_t req, const struct fuse_iomap_config *cfg)
+{
+ struct fuse_iomap_config_out arg = {
+ .flags = cfg->flags,
+ };
+ const ssize_t reply_size = iomap_config_reply_size(cfg);
+
+ if (reply_size < 0)
+ fuse_reply_err(req, -reply_size);
+
+ if (cfg->flags & FUSE_IOMAP_CONFIG_BLOCKSIZE)
+ arg.s_blocksize = cfg->s_blocksize;
+
+ if (cfg->flags & FUSE_IOMAP_CONFIG_SID)
+ memcpy(arg.s_id, cfg->s_id, sizeof(arg.s_id));
+
+ if (cfg->flags & FUSE_IOMAP_CONFIG_UUID) {
+ arg.s_uuid_len = cfg->s_uuid_len;
+ if (arg.s_uuid_len > sizeof(arg.s_uuid))
+ arg.s_uuid_len = sizeof(arg.s_uuid);
+ memcpy(arg.s_uuid, cfg->s_uuid, arg.s_uuid_len);
+ }
+
+ if (cfg->flags & FUSE_IOMAP_CONFIG_MAX_LINKS)
+ arg.s_max_links = cfg->s_max_links;
+
+ if (cfg->flags & FUSE_IOMAP_CONFIG_TIME) {
+ arg.s_time_gran = cfg->s_time_gran;
+ arg.s_time_min = cfg->s_time_min;
+ arg.s_time_max = cfg->s_time_max;
+ }
+
+ if (cfg->flags & FUSE_IOMAP_CONFIG_MAXBYTES)
+ arg.s_maxbytes = cfg->s_maxbytes;
+
+ return send_reply_ok(req, &arg, reply_size);
+}
+
+static void _do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid,
+ const void *op_in, const void *in_payload)
+{
+ (void)nodeid;
+ (void)in_payload;
+ const struct fuse_iomap_config_in *arg = op_in;
+
+ if (req->se->op.iomap_config)
+ req->se->op.iomap_config(req,
+ arg->flags & FUSE_IOMAP_CONFIG_ALL,
+ arg->maxbytes);
+ else
+ fuse_reply_err(req, ENOSYS);
+}
+
+static void do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid,
+ const void *inarg)
+{
+ _do_iomap_config(req, nodeid, inarg, NULL);
+}
+
static bool want_flags_valid(uint64_t capable, uint64_t want)
{
uint64_t unknown_flags = want & (~capable);
@@ -3579,6 +3659,7 @@ static struct {
[FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
[FUSE_LSEEK] = { do_lseek, "LSEEK" },
[FUSE_STATX] = { do_statx, "STATX" },
+ [FUSE_IOMAP_CONFIG]= { do_iomap_config, "IOMAP_CONFIG" },
[FUSE_IOMAP_BEGIN] = { do_iomap_begin, "IOMAP_BEGIN" },
[FUSE_IOMAP_END] = { do_iomap_end, "IOMAP_END" },
[FUSE_IOMAP_IOEND] = { do_iomap_ioend, "IOMAP_IOEND" },
@@ -3637,6 +3718,7 @@ static struct {
[FUSE_COPY_FILE_RANGE] = { _do_copy_file_range, "COPY_FILE_RANGE" },
[FUSE_LSEEK] = { _do_lseek, "LSEEK" },
[FUSE_STATX] = { _do_statx, "STATX" },
+ [FUSE_IOMAP_CONFIG] = { _do_iomap_config, "IOMAP_CONFIG" },
[FUSE_IOMAP_BEGIN] = { _do_iomap_begin, "IOMAP_BEGIN" },
[FUSE_IOMAP_END] = { _do_iomap_end, "IOMAP_END" },
[FUSE_IOMAP_IOEND] = { _do_iomap_ioend, "IOMAP_IOEND" },
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 5275a17ba1ed51..f886d268c8a99f 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -232,6 +232,7 @@ FUSE_3.99 {
fuse_fs_can_enable_iomap;
fuse_fs_can_enable_iomapx;
fuse_lowlevel_discover_iomap;
+ fuse_reply_iomap_config;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 16/21] libfuse: add upper level iomap_config implementation
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (14 preceding siblings ...)
2025-08-21 1:05 ` [PATCH 15/21] libfuse: add lower level iomap_config implementation Darrick J. Wong
@ 2025-08-21 1:05 ` Darrick J. Wong
2025-08-21 1:05 ` [PATCH 17/21] libfuse: allow root_nodeid mount option Darrick J. Wong
` (4 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:05 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Add FUSE_IOMAP_CONFIG helpers to the upper level fuse library.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 7 +++++++
lib/fuse.c | 37 +++++++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+)
diff --git a/include/fuse.h b/include/fuse.h
index 4c4fff837437c8..74b86e8d27fb35 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -896,6 +896,13 @@ struct fuse_operations {
*/
int (*getattr_iflags) (const char *path, struct stat *buf,
unsigned int *iflags, struct fuse_file_info *fi);
+
+ /**
+ * Configure the filesystem geometry that will be used by iomap
+ * files.
+ */
+ int (*iomap_config) (uint64_t supported_flags, off_t maxbytes,
+ struct fuse_iomap_config *cfg);
};
/** Extra context that may be needed by some filesystems
diff --git a/lib/fuse.c b/lib/fuse.c
index cbf2c5d3a67895..177c524eff736b 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2937,6 +2937,23 @@ static int fuse_fs_iomap_ioend(struct fuse_fs *fs, const char *path,
ioendflags, error, new_addr);
}
+static int fuse_fs_iomap_config(struct fuse_fs *fs, uint64_t flags,
+ uint64_t maxbytes,
+ struct fuse_iomap_config *cfg)
+{
+ fuse_get_context()->private_data = fs->user_data;
+ if (!fs->op.iomap_config)
+ return -ENOSYS;
+
+ if (fs->debug) {
+ fuse_log(FUSE_LOG_DEBUG,
+ "iomap_config flags 0x%x maxbytes %lld\n",
+ flags, (long long)maxbytes);
+ }
+
+ return fs->op.iomap_config(flags, maxbytes, cfg);
+}
+
static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
int valid, struct fuse_file_info *fi)
{
@@ -4791,6 +4808,25 @@ static void fuse_lib_iomap_ioend(fuse_req_t req, fuse_ino_t nodeid,
reply_err(req, err);
}
+static void fuse_lib_iomap_config(fuse_req_t req, uint64_t flags,
+ uint64_t maxbytes)
+{
+ struct fuse_iomap_config cfg = { };
+ struct fuse *f = req_fuse_prepare(req);
+ struct fuse_intr_data d;
+ int err;
+
+ fuse_prepare_interrupt(f, req, &d);
+ err = fuse_fs_iomap_config(f->fs, flags, maxbytes, &cfg);
+ fuse_finish_interrupt(f, req, &d);
+ if (err) {
+ reply_err(req, err);
+ return;
+ }
+
+ fuse_reply_iomap_config(req, &cfg);
+}
+
static int clean_delay(struct fuse *f)
{
/*
@@ -4895,6 +4931,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
.iomap_begin = fuse_lib_iomap_begin,
.iomap_end = fuse_lib_iomap_end,
.iomap_ioend = fuse_lib_iomap_ioend,
+ .iomap_config = fuse_lib_iomap_config,
};
int fuse_notify_poll(struct fuse_pollhandle *ph)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 17/21] libfuse: allow root_nodeid mount option
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (15 preceding siblings ...)
2025-08-21 1:05 ` [PATCH 16/21] libfuse: add upper " Darrick J. Wong
@ 2025-08-21 1:05 ` Darrick J. Wong
2025-08-21 1:06 ` [PATCH 18/21] libfuse: add low level code to invalidate iomap block device ranges Darrick J. Wong
` (3 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:05 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Allow this mount option so that fuse servers can configure the root
nodeid if they want to.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/mount.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/lib/mount.c b/lib/mount.c
index 2eb967399c9606..140489fa74bb55 100644
--- a/lib/mount.c
+++ b/lib/mount.c
@@ -100,6 +100,7 @@ static const struct fuse_opt fuse_mount_opts[] = {
FUSE_OPT_KEY("defcontext=", KEY_KERN_OPT),
FUSE_OPT_KEY("rootcontext=", KEY_KERN_OPT),
FUSE_OPT_KEY("max_read=", KEY_KERN_OPT),
+ FUSE_OPT_KEY("root_nodeid=", KEY_KERN_OPT),
FUSE_OPT_KEY("user=", KEY_MTAB_OPT),
FUSE_OPT_KEY("-n", KEY_MTAB_OPT),
FUSE_OPT_KEY("-r", KEY_RO),
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 18/21] libfuse: add low level code to invalidate iomap block device ranges
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (16 preceding siblings ...)
2025-08-21 1:05 ` [PATCH 17/21] libfuse: allow root_nodeid mount option Darrick J. Wong
@ 2025-08-21 1:06 ` Darrick J. Wong
2025-08-21 1:06 ` [PATCH 19/21] libfuse: add upper-level API to invalidate parts of an iomap block device Darrick J. Wong
` (2 subsequent siblings)
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:06 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Make it easier to invalidate the page cache for a block device that is
being used in conjunction with iomap. This allows a fuse server to kill
all cached data for a block that is being freed, so that block reuse
doesn't result in file corruption.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_kernel.h | 9 +++++++++
include/fuse_lowlevel.h | 15 +++++++++++++++
lib/fuse_lowlevel.c | 22 ++++++++++++++++++++++
lib/fuse_versionscript | 1 +
4 files changed, 47 insertions(+)
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 46960711691d99..1470b59d742165 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -241,6 +241,7 @@
* SEEK_{DATA,HOLE}, buffered I/O, and direct I/O
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
* - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
+ * - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges
*/
#ifndef _LINUX_FUSE_H
@@ -691,6 +692,7 @@ enum fuse_notify_code {
FUSE_NOTIFY_DELETE = 6,
FUSE_NOTIFY_RESEND = 7,
FUSE_NOTIFY_INC_EPOCH = 8,
+ FUSE_NOTIFY_IOMAP_DEV_INVAL = 9,
FUSE_NOTIFY_CODE_MAX,
};
@@ -1389,4 +1391,11 @@ struct fuse_iomap_config_out {
int64_t s_maxbytes; /* max file size */
};
+struct fuse_iomap_dev_inval {
+ uint32_t dev; /* device cookie */
+ uint32_t reserved; /* zero */
+
+ uint64_t offset; /* range to invalidate pagecache, bytes */
+ uint64_t length;
+};
#endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 1b2a6c00d0f9dc..b7a099bea6921e 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -2158,6 +2158,21 @@ int fuse_lowlevel_iomap_device_add(struct fuse_session *se, int fd,
*/
int fuse_lowlevel_iomap_device_remove(struct fuse_session *se, int device_id);
+/*
+ * Invalidate the page cache of a block device opened for use with iomap.
+ *
+ * Added in FUSE protocol version 7.99. If the kernel does not support
+ * this (or a newer) version, the function will return -ENOSYS and do
+ * nothing.
+ *
+ * @param se the session object
+ * @param dev device cookie returned by fuse_lowlevel_iomap_add_device
+ * @param offset start of the range to invalidate, in bytes
+ * @return length length of the range to invalidate, in bytes
+ */
+int fuse_lowlevel_iomap_device_invalidate(struct fuse_session *se, int dev,
+ off_t offset, off_t length);
+
/* ----------------------------------------------------------- *
* Utility functions *
* ----------------------------------------------------------- */
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 60627ec35cd367..f730a7fd4ead09 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -3480,6 +3480,28 @@ int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino,
return res;
}
+int fuse_lowlevel_iomap_device_invalidate(struct fuse_session *se, int dev,
+ off_t offset, off_t length)
+{
+ struct fuse_iomap_dev_inval arg = {
+ .dev = dev,
+ .offset = offset,
+ .length = length,
+ };
+ struct iovec iov[2];
+
+ if (!se)
+ return -EINVAL;
+
+ if (!(se->conn.want_ext & FUSE_CAP_IOMAP))
+ return -ENOSYS;
+
+ iov[1].iov_base = &arg;
+ iov[1].iov_len = sizeof(arg);
+
+ return send_notify_iov(se, FUSE_NOTIFY_IOMAP_DEV_INVAL, iov, 2);
+}
+
struct fuse_retrieve_req {
struct fuse_notify_req nreq;
void *cookie;
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index f886d268c8a99f..65ce70649b031c 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -233,6 +233,7 @@ FUSE_3.99 {
fuse_fs_can_enable_iomapx;
fuse_lowlevel_discover_iomap;
fuse_reply_iomap_config;
+ fuse_lowlevel_iomap_device_invalidate;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 19/21] libfuse: add upper-level API to invalidate parts of an iomap block device
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (17 preceding siblings ...)
2025-08-21 1:06 ` [PATCH 18/21] libfuse: add low level code to invalidate iomap block device ranges Darrick J. Wong
@ 2025-08-21 1:06 ` Darrick J. Wong
2025-08-21 1:06 ` [PATCH 20/21] libfuse: add strictatime/lazytime mount options Darrick J. Wong
2025-08-21 1:06 ` [PATCH 21/21] libfuse: add atomic write support Darrick J. Wong
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:06 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Wire up the upper-level wrappers to
fuse_lowlevel_iomap_invalidate_device.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 10 ++++++++++
lib/fuse.c | 9 +++++++++
lib/fuse_versionscript | 1 +
3 files changed, 20 insertions(+)
diff --git a/include/fuse.h b/include/fuse.h
index 74b86e8d27fb35..e53e92786cea08 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -1422,6 +1422,16 @@ int fuse_fs_iomap_device_add(int fd, unsigned int flags);
*/
int fuse_fs_iomap_device_remove(int device_id);
+/**
+ * Invalidate any pagecache for the given iomap (block) device.
+ *
+ * @param device_id device index as returned by fuse_lowlevel_iomap_device_add
+ * @param offset starting offset of the range to invalidate
+ * @param length length of the range to invalidate
+ * @return 0 on success, or negative errno on failure
+ */
+int fuse_fs_iomap_device_invalidate(int device_id, off_t offset, off_t length);
+
/**
* Decide if we can enable iomap mode for a particular file for an upper-level
* fuse server.
diff --git a/lib/fuse.c b/lib/fuse.c
index 177c524eff736b..1c813ec5a697a0 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2917,6 +2917,15 @@ int fuse_fs_iomap_device_remove(int device_id)
return fuse_lowlevel_iomap_device_remove(se, device_id);
}
+int fuse_fs_iomap_device_invalidate(int device_id, off_t offset, off_t length)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse_session *se = fuse_get_session(ctxt->fuse);
+
+ return fuse_lowlevel_iomap_device_invalidate(se, device_id, offset,
+ length);
+}
+
static int fuse_fs_iomap_ioend(struct fuse_fs *fs, const char *path,
uint64_t nodeid, uint64_t attr_ino, off_t pos,
size_t written, uint32_t ioendflags, int error,
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 65ce70649b031c..102f449d28a0be 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -234,6 +234,7 @@ FUSE_3.99 {
fuse_lowlevel_discover_iomap;
fuse_reply_iomap_config;
fuse_lowlevel_iomap_device_invalidate;
+ fuse_fs_iomap_device_invalidate;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 20/21] libfuse: add strictatime/lazytime mount options
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (18 preceding siblings ...)
2025-08-21 1:06 ` [PATCH 19/21] libfuse: add upper-level API to invalidate parts of an iomap block device Darrick J. Wong
@ 2025-08-21 1:06 ` Darrick J. Wong
2025-08-21 1:06 ` [PATCH 21/21] libfuse: add atomic write support Darrick J. Wong
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:06 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
fuse+iomap leaves the kernel completely in charge of handling
timestamps. Add the lazytime and strictatime mount options so that
fuse+iomap filesystems can take advantage of those options.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/mount.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/lib/mount.c b/lib/mount.c
index 140489fa74bb55..01d473902d50d7 100644
--- a/lib/mount.c
+++ b/lib/mount.c
@@ -117,9 +117,16 @@ static const struct fuse_opt fuse_mount_opts[] = {
FUSE_OPT_KEY("dirsync", KEY_KERN_FLAG),
FUSE_OPT_KEY("noatime", KEY_KERN_FLAG),
FUSE_OPT_KEY("nodiratime", KEY_KERN_FLAG),
- FUSE_OPT_KEY("nostrictatime", KEY_KERN_FLAG),
FUSE_OPT_KEY("symfollow", KEY_KERN_FLAG),
FUSE_OPT_KEY("nosymfollow", KEY_KERN_FLAG),
+#ifdef MS_LAZYTIME
+ FUSE_OPT_KEY("lazytime", KEY_KERN_FLAG),
+ FUSE_OPT_KEY("nolazytime", KEY_KERN_FLAG),
+#endif
+#ifdef MS_STRICTATIME
+ FUSE_OPT_KEY("strictatime", KEY_KERN_FLAG),
+ FUSE_OPT_KEY("nostrictatime", KEY_KERN_FLAG),
+#endif
FUSE_OPT_END
};
@@ -190,11 +197,18 @@ static const struct mount_flags mount_flags[] = {
{"noatime", MS_NOATIME, 1},
{"nodiratime", MS_NODIRATIME, 1},
{"norelatime", MS_RELATIME, 0},
- {"nostrictatime", MS_STRICTATIME, 0},
{"symfollow", MS_NOSYMFOLLOW, 0},
{"nosymfollow", MS_NOSYMFOLLOW, 1},
#ifndef __NetBSD__
{"dirsync", MS_DIRSYNC, 1},
+#endif
+#ifdef MS_LAZYTIME
+ {"lazytime", MS_LAZYTIME, 1},
+ {"nolazytime", MS_LAZYTIME, 0},
+#endif
+#ifdef MS_STRICTATIME
+ {"strictatime", MS_STRICTATIME, 1},
+ {"nostrictatime", MS_STRICTATIME, 0},
#endif
{NULL, 0, 0}
};
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 21/21] libfuse: add atomic write support
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
` (19 preceding siblings ...)
2025-08-21 1:06 ` [PATCH 20/21] libfuse: add strictatime/lazytime mount options Darrick J. Wong
@ 2025-08-21 1:06 ` Darrick J. Wong
20 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:06 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Add the single flag that we need to turn on atomic write support in
fuse.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_common.h | 4 ++++
include/fuse_kernel.h | 3 +++
lib/fuse_lowlevel.c | 2 ++
3 files changed, 9 insertions(+)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index 8e585cc7483643..19770262c4b518 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -548,6 +548,8 @@ struct fuse_loop_config_v1 {
* FUSE_IOMAP_SUPPORT_FILEIO: basic file I/O functionality through iomap
*/
#define FUSE_IOMAP_SUPPORT_FILEIO (1ULL << 0)
+/* untorn writes through iomap */
+#define FUSE_IOMAP_SUPPORT_ATOMIC (1ULL << 1)
/**
* Connection information, passed to the ->init() method
@@ -1237,6 +1239,8 @@ static inline bool fuse_iomap_need_write_allocate(unsigned int opflags,
#define FUSE_IFLAG_DAX (1U << 0)
/* use iomap for this inode */
#define FUSE_IFLAG_IOMAP (1U << 1)
+/* enable untorn writes */
+#define FUSE_IFLAG_ATOMIC (1U << 2)
/* Which fields are set in fuse_iomap_config_out? */
#define FUSE_IOMAP_CONFIG_SID (1 << 0ULL)
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 1470b59d742165..fcf02c9371ba3a 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -242,6 +242,7 @@
* - add FUSE_ATTR_IOMAP to enable iomap for specific inodes
* - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
* - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges
+ * - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support
*/
#ifndef _LINUX_FUSE_H
@@ -584,10 +585,12 @@ struct fuse_file_lock {
* FUSE_ATTR_SUBMOUNT: Object is a submount root
* FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
* FUSE_ATTR_IOMAP: Use iomap for this inode
+ * FUSE_ATTR_ATOMIC: Enable untorn writes
*/
#define FUSE_ATTR_SUBMOUNT (1 << 0)
#define FUSE_ATTR_DAX (1 << 1)
#define FUSE_ATTR_IOMAP (1 << 2)
+#define FUSE_ATTR_ATOMIC (1 << 3)
/**
* Open flags
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index f730a7fd4ead09..ee73de4a8950be 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -126,6 +126,8 @@ static void convert_stat(const struct stat *stbuf, struct fuse_attr *attr,
attr->flags |= FUSE_ATTR_DAX;
if (iflags & FUSE_IFLAG_IOMAP)
attr->flags |= FUSE_ATTR_IOMAP;
+ if (iflags & FUSE_IFLAG_ATOMIC)
+ attr->flags |= FUSE_ATTR_ATOMIC;
}
static void convert_attr(const struct fuse_setattr_in *attr, struct stat *stbuf)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 1/2] libfuse: enable iomap cache management for lowlevel fuse
2025-08-21 0:48 ` [PATCHSET RFC v4 3/4] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
@ 2025-08-21 1:07 ` Darrick J. Wong
2025-08-21 1:07 ` [PATCH 2/2] libfuse: add upper-level iomap cache management Darrick J. Wong
1 sibling, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:07 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Add the library methods so that fuse servers can manage an in-kernel
iomap cache. This enables better performance on small IOs and is
required if the filesystem needs synchronization between pagecache
writes and writeback.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_common.h | 12 ++++++++
include/fuse_kernel.h | 26 +++++++++++++++++
include/fuse_lowlevel.h | 41 ++++++++++++++++++++++++++
lib/fuse_lowlevel.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++
lib/fuse_versionscript | 2 +
5 files changed, 154 insertions(+)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index 19770262c4b518..a1c5199f2cb4ee 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -1168,6 +1168,10 @@ int fuse_convert_to_conn_want_ext(struct fuse_conn_info *conn);
/* fuse-specific mapping type indicating that writes use the read mapping */
#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255)
+/* fuse-specific mapping type saying the server has populated the cache */
+#define FUSE_IOMAP_TYPE_RETRY_CACHE (254)
+/* do not upsert this mapping */
+#define FUSE_IOMAP_TYPE_NOCACHE (253)
#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */
@@ -1273,6 +1277,14 @@ struct fuse_iomap_config{
int64_t s_maxbytes; /* max file size */
};
+/* invalidate to end of file */
+#define FUSE_IOMAP_INVAL_TO_EOF (~0ULL)
+
+struct fuse_iomap_inval {
+ uint64_t offset; /* file offset to invalidate, bytes */
+ uint64_t length; /* length to invalidate, bytes */
+};
+
/* ----------------------------------------------------------- *
* Compatibility stuff *
* ----------------------------------------------------------- */
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index fcf02c9371ba3a..0c30aebaf95c32 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -243,6 +243,8 @@
* - add FUSE_IOMAP_CONFIG so the fuse server can configure more fs geometry
* - add FUSE_NOTIFY_IOMAP_DEV_INVAL to invalidate iomap bdev ranges
* - add FUSE_ATTR_ATOMIC for single-fsblock atomic write support
+ * - add FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL so fuse servers
+ * can cache iomappings in the kernel
*/
#ifndef _LINUX_FUSE_H
@@ -696,6 +698,8 @@ enum fuse_notify_code {
FUSE_NOTIFY_RESEND = 7,
FUSE_NOTIFY_INC_EPOCH = 8,
FUSE_NOTIFY_IOMAP_DEV_INVAL = 9,
+ FUSE_NOTIFY_IOMAP_UPSERT = 10,
+ FUSE_NOTIFY_IOMAP_INVAL = 11,
FUSE_NOTIFY_CODE_MAX,
};
@@ -1401,4 +1405,26 @@ struct fuse_iomap_dev_inval {
uint64_t offset; /* range to invalidate pagecache, bytes */
uint64_t length;
};
+
+struct fuse_iomap_inval_out {
+ uint64_t nodeid; /* Inode ID */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+
+ uint64_t read_offset; /* range to invalidate read iomaps, bytes */
+ uint64_t read_length; /* can be FUSE_IOMAP_INVAL_TO_EOF */
+
+ uint64_t write_offset; /* range to invalidate write iomaps, bytes */
+ uint64_t write_length; /* can be FUSE_IOMAP_INVAL_TO_EOF */
+};
+
+struct fuse_iomap_upsert_out {
+ uint64_t nodeid; /* Inode ID */
+ uint64_t attr_ino; /* matches fuse_attr:ino */
+
+ /* read file data from here */
+ struct fuse_iomap_io read;
+
+ /* write file data to here, if applicable */
+ struct fuse_iomap_io write;
+};
#endif /* _LINUX_FUSE_H */
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index b7a099bea6921e..326c8f061aecfa 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -2173,6 +2173,47 @@ int fuse_lowlevel_iomap_device_remove(struct fuse_session *se, int device_id);
int fuse_lowlevel_iomap_device_invalidate(struct fuse_session *se, int dev,
off_t offset, off_t length);
+/*
+ * Upsert some file mapping information into the kernel. This is necessary
+ * for filesystems that require coordination of mapping state changes between
+ * buffered writes and writeback, and desirable for better performance
+ * elsewhere.
+ *
+ * Added in FUSE protocol version 7.99. If the kernel does not support
+ * this (or a newer) version, the function will return -ENOSYS and do
+ * nothing.
+ *
+ * @param se the session object
+ * @param nodeid the inode number
+ * @param attr_ino inode number as told by fuse_attr::ino
+ * @param read mapping information for file reads
+ * @param write mapping information for file writes
+ * @return zero for success, -errno for failure
+ */
+int fuse_lowlevel_notify_iomap_upsert(struct fuse_session *se,
+ fuse_ino_t nodeid, uint64_t attr_ino,
+ const struct fuse_file_iomap *read,
+ const struct fuse_file_iomap *write);
+
+/**
+ * Invalidate some file mapping information in the kernel.
+ *
+ * Added in FUSE protocol version 7.99. If the kernel does not support
+ * this (or a newer) version, the function will return -ENOSYS and do
+ * nothing.
+ *
+ * @param se the session object
+ * @param nodeid the inode number
+ * @param attr_ino inode number as told by fuse_attr::ino
+ * @param read read mapping range to invalidate
+ * @param write write mapping range to invalidate
+ * @return zero for success, -errno for failure
+ */
+int fuse_lowlevel_notify_iomap_inval(struct fuse_session *se,
+ fuse_ino_t nodeid, uint64_t attr_ino,
+ const struct fuse_iomap_inval *read,
+ const struct fuse_iomap_inval *write);
+
/* ----------------------------------------------------------- *
* Utility functions *
* ----------------------------------------------------------- */
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index ee73de4a8950be..721abe2686d9c4 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -3504,6 +3504,79 @@ int fuse_lowlevel_iomap_device_invalidate(struct fuse_session *se, int dev,
return send_notify_iov(se, FUSE_NOTIFY_IOMAP_DEV_INVAL, iov, 2);
}
+int fuse_lowlevel_notify_iomap_upsert(struct fuse_session *se,
+ fuse_ino_t nodeid, uint64_t attr_ino,
+ const struct fuse_file_iomap *read,
+ const struct fuse_file_iomap *write)
+{
+ struct fuse_iomap_upsert_out outarg = {
+ .nodeid = nodeid,
+ .attr_ino = attr_ino,
+ .read = {
+ .type = FUSE_IOMAP_TYPE_NOCACHE,
+ },
+ .write = {
+ .type = FUSE_IOMAP_TYPE_NOCACHE,
+ }
+ };
+ struct iovec iov[2];
+
+ if (!se)
+ return -EINVAL;
+
+ if (se->conn.proto_minor < 99)
+ return -ENOSYS;
+
+ if (!read && !write)
+ return 0;
+
+ if (read)
+ fuse_iomap_to_kernel(&outarg.read, read);
+
+ if (write)
+ fuse_iomap_to_kernel(&outarg.write, write);
+
+ iov[1].iov_base = &outarg;
+ iov[1].iov_len = sizeof(outarg);
+
+ return send_notify_iov(se, FUSE_NOTIFY_IOMAP_UPSERT, iov, 2);
+}
+
+int fuse_lowlevel_notify_iomap_inval(struct fuse_session *se,
+ fuse_ino_t nodeid, uint64_t attr_ino,
+ const struct fuse_iomap_inval *read,
+ const struct fuse_iomap_inval *write)
+{
+ struct fuse_iomap_inval_out outarg = {
+ .nodeid = nodeid,
+ .attr_ino = attr_ino,
+ };
+ struct iovec iov[2];
+
+ if (!se)
+ return -EINVAL;
+
+ if (se->conn.proto_minor < 99)
+ return -ENOSYS;
+
+ if (!read && !write)
+ return 0;
+
+ if (read) {
+ outarg.read_offset = read->offset;
+ outarg.read_length = read->length;
+ }
+ if (write) {
+ outarg.write_offset = write->offset;
+ outarg.write_length = write->length;
+ }
+
+ iov[1].iov_base = &outarg;
+ iov[1].iov_len = sizeof(outarg);
+
+ return send_notify_iov(se, FUSE_NOTIFY_IOMAP_INVAL, iov, 2);
+}
+
struct fuse_retrieve_req {
struct fuse_notify_req nreq;
void *cookie;
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index 102f449d28a0be..a83966b9e48018 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -235,6 +235,8 @@ FUSE_3.99 {
fuse_reply_iomap_config;
fuse_lowlevel_iomap_device_invalidate;
fuse_fs_iomap_device_invalidate;
+ fuse_lowlevel_notify_iomap_upsert;
+ fuse_lowlevel_notify_iomap_inval;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 2/2] libfuse: add upper-level iomap cache management
2025-08-21 0:48 ` [PATCHSET RFC v4 3/4] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-08-21 1:07 ` [PATCH 1/2] libfuse: enable iomap cache management for lowlevel fuse Darrick J. Wong
@ 2025-08-21 1:07 ` Darrick J. Wong
1 sibling, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:07 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Make it so that upper-level fuse servers can use the iomap cache too.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 31 +++++++++++++++++++++++++++++++
lib/fuse.c | 30 ++++++++++++++++++++++++++++++
lib/fuse_versionscript | 2 ++
3 files changed, 63 insertions(+)
diff --git a/include/fuse.h b/include/fuse.h
index e53e92786cea08..f8a57154017a2a 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -1450,6 +1450,37 @@ bool fuse_fs_can_enable_iomap(const struct stat *statbuf);
*/
bool fuse_fs_can_enable_iomapx(const struct statx *statxbuf);
+/*
+ * Upsert some file mapping information into the kernel. This is necessary
+ * for filesystems that require coordination of mapping state changes between
+ * buffered writes and writeback, and desirable for better performance
+ * elsewhere.
+ *
+ * @param nodeid the inode number
+ * @param attr_ino inode number as told by fuse_attr::ino
+ * @param read mapping information for file reads
+ * @param write mapping information for file writes
+ * @return zero for success, -errno for failure
+ */
+int fuse_fs_iomap_upsert(uint64_t nodeid, uint64_t attr_ino,
+ const struct fuse_file_iomap *read,
+ const struct fuse_file_iomap *write);
+
+/**
+ * Invalidate some file mapping information in the kernel.
+ *
+ * @param nodeid the inode number
+ * @param attr_ino inode number as told by fuse_attr::ino
+ * @param read_off start of the range of read mappings to invalidate
+ * @param read_len length of the range of read mappings to invalidate
+ * @param write_off start of the range of write mappings to invalidate
+ * @param write_len length of the range of write mappings to invalidate
+ * @return zero for success, -errno for failure
+ */
+int fuse_fs_iomap_inval(uint64_t nodeid, uint64_t attr_ino, loff_t read_off,
+ uint64_t read_len, loff_t write_off,
+ uint64_t write_len);
+
int fuse_notify_poll(struct fuse_pollhandle *ph);
/**
diff --git a/lib/fuse.c b/lib/fuse.c
index 1c813ec5a697a0..7b28f848116abb 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2963,6 +2963,36 @@ static int fuse_fs_iomap_config(struct fuse_fs *fs, uint64_t flags,
return fs->op.iomap_config(flags, maxbytes, cfg);
}
+int fuse_fs_iomap_upsert(uint64_t nodeid, uint64_t attr_ino,
+ const struct fuse_file_iomap *read,
+ const struct fuse_file_iomap *write)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse_session *se = fuse_get_session(ctxt->fuse);
+
+ return fuse_lowlevel_notify_iomap_upsert(se, nodeid, attr_ino,
+ read, write);
+}
+
+int fuse_fs_iomap_inval(uint64_t nodeid, uint64_t attr_ino, loff_t read_off,
+ uint64_t read_len, loff_t write_off,
+ uint64_t write_len)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse_session *se = fuse_get_session(ctxt->fuse);
+ struct fuse_iomap_inval read = {
+ .offset = read_off,
+ .length = read_len,
+ };
+ struct fuse_iomap_inval write = {
+ .offset = write_off,
+ .length = write_len,
+ };
+
+ return fuse_lowlevel_notify_iomap_inval(se, nodeid, attr_ino, &read,
+ &write);
+}
+
static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
int valid, struct fuse_file_info *fi)
{
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index a83966b9e48018..9a4baed32bc477 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -237,6 +237,8 @@ FUSE_3.99 {
fuse_fs_iomap_device_invalidate;
fuse_lowlevel_notify_iomap_upsert;
fuse_lowlevel_notify_iomap_inval;
+ fuse_fs_iomap_upsert;
+ fuse_fs_iomap_inval;
} FUSE_3.18;
# Local Variables:
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 1/2] libfuse: wire up FUSE_SYNCFS to the low level library
2025-08-21 0:49 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Darrick J. Wong
@ 2025-08-21 1:07 ` Darrick J. Wong
2025-08-21 1:07 ` [PATCH 2/2] libfuse: add syncfs support to the upper library Darrick J. Wong
2025-08-21 21:41 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Bernd Schubert
2 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:07 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Create hooks in the lowlevel library for syncfs.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse_lowlevel.h | 16 ++++++++++++++++
lib/fuse_lowlevel.c | 19 +++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 326c8f061aecfa..90a09b066c71f0 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1422,6 +1422,22 @@ struct fuse_lowlevel_ops {
*/
void (*iomap_config) (fuse_req_t req, uint64_t flags,
uint64_t maxbytes);
+
+ /**
+ * Flush the entire filesystem to disk.
+ *
+ * If this request is answered with an error code of ENOSYS, this is
+ * treated as a permanent failure, i.e. all future syncfs() requests
+ * will fail with the same error code without being sent to the
+ * filesystem process.
+ *
+ * Valid replies:
+ * fuse_reply_err
+ *
+ * @param req request handle
+ * @param ino the inode number
+ */
+ void (*syncfs) (fuse_req_t req, fuse_ino_t ino);
};
/**
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 721abe2686d9c4..e5c7c4487cef8c 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2776,6 +2776,23 @@ static void do_iomap_config(fuse_req_t req, const fuse_ino_t nodeid,
_do_iomap_config(req, nodeid, inarg, NULL);
}
+static void _do_syncfs(fuse_req_t req, const fuse_ino_t nodeid,
+ const void *op_in, const void *in_payload)
+{
+ (void)op_in;
+ (void)in_payload;
+
+ if (req->se->op.syncfs)
+ req->se->op.syncfs(req, nodeid);
+ else
+ fuse_reply_err(req, ENOSYS);
+}
+
+static void do_syncfs(fuse_req_t req, const fuse_ino_t nodeid, const void *inarg)
+{
+ _do_syncfs(req, nodeid, inarg, NULL);
+}
+
static bool want_flags_valid(uint64_t capable, uint64_t want)
{
uint64_t unknown_flags = want & (~capable);
@@ -3756,6 +3773,7 @@ static struct {
[FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
[FUSE_LSEEK] = { do_lseek, "LSEEK" },
[FUSE_STATX] = { do_statx, "STATX" },
+ [FUSE_SYNCFS] = { do_syncfs, "SYNCFS" },
[FUSE_IOMAP_CONFIG]= { do_iomap_config, "IOMAP_CONFIG" },
[FUSE_IOMAP_BEGIN] = { do_iomap_begin, "IOMAP_BEGIN" },
[FUSE_IOMAP_END] = { do_iomap_end, "IOMAP_END" },
@@ -3815,6 +3833,7 @@ static struct {
[FUSE_COPY_FILE_RANGE] = { _do_copy_file_range, "COPY_FILE_RANGE" },
[FUSE_LSEEK] = { _do_lseek, "LSEEK" },
[FUSE_STATX] = { _do_statx, "STATX" },
+ [FUSE_SYNCFS] = { _do_syncfs, "SYNCFS" },
[FUSE_IOMAP_CONFIG] = { _do_iomap_config, "IOMAP_CONFIG" },
[FUSE_IOMAP_BEGIN] = { _do_iomap_begin, "IOMAP_BEGIN" },
[FUSE_IOMAP_END] = { _do_iomap_end, "IOMAP_END" },
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 2/2] libfuse: add syncfs support to the upper library
2025-08-21 0:49 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Darrick J. Wong
2025-08-21 1:07 ` [PATCH 1/2] libfuse: wire up FUSE_SYNCFS to the low level library Darrick J. Wong
@ 2025-08-21 1:07 ` Darrick J. Wong
2025-08-21 21:41 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Bernd Schubert
2 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:07 UTC (permalink / raw)
To: djwong, bschubert; +Cc: John, joannelkoong, bernd, linux-fsdevel, miklos, neal
From: Darrick J. Wong <djwong@kernel.org>
Support syncfs in the upper level library.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/fuse.h | 5 +++++
lib/fuse.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/include/fuse.h b/include/fuse.h
index f8a57154017a2a..baf7a2e90af5e7 100644
--- a/include/fuse.h
+++ b/include/fuse.h
@@ -903,6 +903,11 @@ struct fuse_operations {
*/
int (*iomap_config) (uint64_t supported_flags, off_t maxbytes,
struct fuse_iomap_config *cfg);
+
+ /**
+ * Flush the entire filesystem to disk.
+ */
+ int (*syncfs) (const char *path);
};
/** Extra context that may be needed by some filesystems
diff --git a/lib/fuse.c b/lib/fuse.c
index 7b28f848116abb..4e207491532e8b 100644
--- a/lib/fuse.c
+++ b/lib/fuse.c
@@ -2993,6 +2993,16 @@ int fuse_fs_iomap_inval(uint64_t nodeid, uint64_t attr_ino, loff_t read_off,
&write);
}
+static int fuse_fs_syncfs(struct fuse_fs *fs, const char *path)
+{
+ fuse_get_context()->private_data = fs->user_data;
+ if (!fs->op.syncfs)
+ return -ENOSYS;
+ if (fs->debug)
+ fuse_log(FUSE_LOG_DEBUG, "syncfs[%s]\n", path);
+ return fs->op.syncfs(path);
+}
+
static void fuse_lib_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
int valid, struct fuse_file_info *fi)
{
@@ -4866,6 +4876,26 @@ static void fuse_lib_iomap_config(fuse_req_t req, uint64_t flags,
fuse_reply_iomap_config(req, &cfg);
}
+static void fuse_lib_syncfs(fuse_req_t req, fuse_ino_t ino)
+{
+ struct fuse *f = req_fuse_prepare(req);
+ struct fuse_intr_data d;
+ char *path;
+ int err;
+
+ err = get_path(f, ino, &path);
+ if (err) {
+ reply_err(req, err);
+ return;
+ }
+
+ fuse_prepare_interrupt(f, req, &d);
+ err = fuse_fs_syncfs(f->fs, path);
+ fuse_finish_interrupt(f, req, &d);
+ free_path(f, ino, path);
+ reply_err(req, err);
+}
+
static int clean_delay(struct fuse *f)
{
/*
@@ -4967,6 +4997,7 @@ static struct fuse_lowlevel_ops fuse_path_ops = {
#ifdef HAVE_STATX
.statx = fuse_lib_statx,
#endif
+ .syncfs = fuse_lib_syncfs,
.iomap_begin = fuse_lib_iomap_begin,
.iomap_end = fuse_lib_iomap_end,
.iomap_ioend = fuse_lib_iomap_ioend,
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 01/20] fuse2fs: port fuse2fs to lowlevel libfuse API
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
@ 2025-08-21 1:08 ` Darrick J. Wong
2025-08-21 1:08 ` [PATCH 02/20] fuse4fs: drop fuse 2.x support code Darrick J. Wong
` (18 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:08 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Copy fuse2fs.c to fuse4fs.c. This will become our testbed for trying
out lowlevel fuse server support in the next few patches.
Namespacing conversions performed via:
sed -e 's/fuse2fs/fuse4fs/g' -e 's/FUSE2FS/FUSE4FS/g' -e 's/F2OP_/F4OP_/g' -e 's/FUSE server/FUSE low-level server/g'
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
configure | 50
configure.ac | 31
lib/config.h.in | 3
misc/Makefile.in | 22
misc/fuse4fs.c | 5607 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 5712 insertions(+), 1 deletion(-)
create mode 100644 misc/fuse4fs.c
diff --git a/configure b/configure
index 71750b1a8ee972..8afc53f89f2bf4 100755
--- a/configure
+++ b/configure
@@ -701,6 +701,7 @@ gcc_ranlib
gcc_ar
UNI_DIFF_OPTS
SEM_INIT_LIB
+FUSE4_CMT
FUSE_CMT
FUSE_LIB
fuse3_LIBS
@@ -14719,6 +14720,55 @@ elif test -n "$FUSE_LIB"
then
FUSE_USE_VERSION=29
fi
+
+FUSE4FS_CMT=
+if test "$FUSE_USE_VERSION" -ge 30
+then
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for lowlevel interface in libfuse" >&5
+printf %s "checking for lowlevel interface in libfuse... " >&6; }
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS 64
+#define FUSE_USE_VERSION 30
+#include <fuse_lowlevel.h>
+
+int
+main (void)
+{
+
+struct fuse_lowlevel_ops fs_ops = {
+ .init = NULL,
+ .destroy = NULL,
+};
+
+ ;
+ return 0;
+}
+
+_ACEOF
+if ac_fn_c_try_link "$LINENO"
+then :
+ have_fuse_lowlevel=yes
+ { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+printf "%s\n" "yes" >&6; }
+else $as_nop
+ { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; }
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.beam \
+ conftest$ac_exeext conftest.$ac_ext
+if test "$have_fuse_lowlevel" = yes; then
+
+printf "%s\n" "#define HAVE_FUSE_LOWLEVEL 1" >>confdefs.h
+
+else
+ FUSE4FS_CMT="#"
+fi
+fi
+
+
if test -n "$FUSE_USE_VERSION"
then
diff --git a/configure.ac b/configure.ac
index 0591999b52b019..37dbfa0be4d7fc 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1447,6 +1447,37 @@ elif test -n "$FUSE_LIB"
then
FUSE_USE_VERSION=29
fi
+
+FUSE4FS_CMT=
+if test "$FUSE_USE_VERSION" -ge 30
+then
+dnl
+dnl see if fuse3 supports lowlevel interface
+dnl
+AC_MSG_CHECKING(for lowlevel interface in libfuse)
+AC_LINK_IFELSE(
+[ AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS 64
+#define FUSE_USE_VERSION 30
+#include <fuse_lowlevel.h>
+ ]], [[
+struct fuse_lowlevel_ops fs_ops = {
+ .init = NULL,
+ .destroy = NULL,
+};
+ ]])
+], have_fuse_lowlevel=yes
+ AC_MSG_RESULT(yes),
+ AC_MSG_RESULT(no))
+if test "$have_fuse_lowlevel" = yes; then
+ AC_DEFINE(HAVE_FUSE_LOWLEVEL, 1, [Define to 1 if fuse supports lowlevel API])
+else
+ FUSE4FS_CMT="#"
+fi
+fi
+AC_SUBST(FUSE4_CMT)
+
if test -n "$FUSE_USE_VERSION"
then
AC_DEFINE_UNQUOTED(FUSE_USE_VERSION, $FUSE_USE_VERSION,
diff --git a/lib/config.h.in b/lib/config.h.in
index a4d8ce1c3765ed..c3379758c3c9bc 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -73,6 +73,9 @@
/* Define to 1 if PR_SET_IO_FLUSHER is present */
#undef HAVE_PR_SET_IO_FLUSHER
+/* Define to 1 if fuse supports lowlevel API */
+#undef HAVE_FUSE_LOWLEVEL
+
/* Define to 1 if you have the Mac OS X function
CFLocaleCopyPreferredLanguages in the CoreFoundation framework. */
#undef HAVE_CFLOCALECOPYPREFERREDLANGUAGES
diff --git a/misc/Makefile.in b/misc/Makefile.in
index 0e3bed66dcb63d..7c6b33cb864204 100644
--- a/misc/Makefile.in
+++ b/misc/Makefile.in
@@ -35,6 +35,7 @@ MKDIR_P = @MKDIR_P@
@BLKID_CMT@FINDFS_MAN= findfs.8
@FUSE_CMT@FUSE_PROG= fuse2fs
+@FUSE4_CMT@FUSE_PROG+=fuse4fs
SPROGS= mke2fs badblocks tune2fs dumpe2fs $(BLKID_PROG) logsave \
$(E2IMAGE_PROG) @FSCK_PROG@ e2undo
@@ -73,6 +74,7 @@ E4CRYPT_OBJS= e4crypt.o
E2FREEFRAG_OBJS= e2freefrag.o
E2FUZZ_OBJS= e2fuzz.o
FUSE2FS_OBJS= fuse2fs.o journal.o recovery.o revoke.o
+FUSE4FS_OBJS= fuse4fs.o journal.o recovery.o revoke.o
PROFILED_TUNE2FS_OBJS= profiled/tune2fs.o profiled/util.o profiled/journal.o \
profiled/recovery.o profiled/revoke.o
@@ -99,6 +101,8 @@ PROFILED_E4DEFRAG_OBJS= profiled/e4defrag.o
PROFILED_E4CRYPT_OBJS= profiled/e4crypt.o
PROFILED_FUSE2FS_OJBS= profiled/fuse2fs.o profiled/journal.o \
profiled/recovery.o profiled/revoke.o
+PROFILED_FUSE4FS_OJBS= profiled/fuse4fs.o profiled/journal.o \
+ profiled/recovery.o profiled/revoke.o
SRCS= $(srcdir)/tune2fs.c $(srcdir)/mklost+found.c $(srcdir)/mke2fs.c $(srcdir)/mk_hugefiles.c \
$(srcdir)/chattr.c $(srcdir)/lsattr.c $(srcdir)/dumpe2fs.c \
@@ -108,7 +112,7 @@ SRCS= $(srcdir)/tune2fs.c $(srcdir)/mklost+found.c $(srcdir)/mke2fs.c $(srcdir)/
$(srcdir)/ismounted.c $(srcdir)/e2undo.c \
$(srcdir)/e2freefrag.c $(srcdir)/create_inode.c \
$(srcdir)/create_inode_libarchive.c \
- $(srcdir)/fuse2fs.c $(srcdir)/e2fuzz.c \
+ $(srcdir)/fuse2fs.c $(srcdir)/fuse4fs.c $(srcdir)/e2fuzz.c \
$(srcdir)/check_fuzzer.c \
$(srcdir)/../debugfs/journal.c $(srcdir)/../e2fsck/revoke.c \
$(srcdir)/../e2fsck/recovery.c
@@ -429,6 +433,13 @@ fuse2fs: $(FUSE2FS_OBJS) $(DEPLIBS) $(DEPLIBBLKID) $(DEPLIBUUID) \
$(LIBFUSE) $(LIBBLKID) $(LIBUUID) $(LIBEXT2FS) $(LIBINTL) \
$(CLOCK_GETTIME_LIB) $(SYSLIBS) $(LIBS_E2P)
+fuse4fs: $(FUSE4FS_OBJS) $(DEPLIBS) $(DEPLIBBLKID) $(DEPLIBUUID) \
+ $(LIBEXT2FS) $(DEPLIBS_E2P)
+ $(E) " LD $@"
+ $(Q) $(CC) $(ALL_LDFLAGS) -o fuse4fs $(FUSE4FS_OBJS) $(LIBS) \
+ $(LIBFUSE) $(LIBBLKID) $(LIBUUID) $(LIBEXT2FS) $(LIBINTL) \
+ $(CLOCK_GETTIME_LIB) $(SYSLIBS) $(LIBS_E2P)
+
journal.o: $(srcdir)/../debugfs/journal.c
$(E) " CC $<"
$(Q) $(CC) -c $(JOURNAL_CFLAGS) -I$(srcdir) \
@@ -881,6 +892,15 @@ fuse2fs.o: $(srcdir)/fuse2fs.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/ext2fsP.h \
$(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/version.h \
$(top_srcdir)/lib/e2p/e2p.h
+fuse4fs.o: $(srcdir)/fuse4fs.c $(top_builddir)/lib/config.h \
+ $(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/ext2fs/ext2fs.h \
+ $(top_builddir)/lib/ext2fs/ext2_types.h $(top_srcdir)/lib/ext2fs/ext2_fs.h \
+ $(top_srcdir)/lib/ext2fs/ext3_extents.h $(top_srcdir)/lib/et/com_err.h \
+ $(top_srcdir)/lib/ext2fs/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
+ $(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
+ $(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/ext2fsP.h \
+ $(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/version.h \
+ $(top_srcdir)/lib/e2p/e2p.h
e2fuzz.o: $(srcdir)/e2fuzz.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/ext2fs/ext2_fs.h \
$(top_builddir)/lib/ext2fs/ext2_types.h $(top_srcdir)/lib/ext2fs/ext2fs.h \
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
new file mode 100644
index 00000000000000..1b8240e56562d6
--- /dev/null
+++ b/misc/fuse4fs.c
@@ -0,0 +1,5607 @@
+/*
+ * fuse4fs.c - FUSE low-level server for e2fsprogs.
+ *
+ * Copyright (C) 2014-2025 Oracle.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License.
+ * %End-Header%
+ */
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+#include "config.h"
+#include <pthread.h>
+#ifdef __linux__
+# include <linux/fs.h>
+# include <linux/falloc.h>
+# include <linux/xattr.h>
+# include <sys/prctl.h>
+#endif
+#ifdef HAVE_SYS_XATTR_H
+#include <sys/xattr.h>
+#endif
+#include <sys/ioctl.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <stdbool.h>
+#define FUSE_DARWIN_ENABLE_EXTENSIONS 0
+#ifdef __SET_FOB_FOR_FUSE
+# error Do not set magic value __SET_FOB_FOR_FUSE!!!!
+#endif
+#ifndef _FILE_OFFSET_BITS
+/*
+ * Old versions of libfuse (e.g. Debian 2.9.9 package) required that the build
+ * system set _FILE_OFFSET_BITS explicitly, even if doing so isn't required to
+ * get a 64-bit off_t. AC_SYS_LARGEFILE doesn't set any _FILE_OFFSET_BITS if
+ * it's not required (such as on aarch64), so we must inject it here.
+ */
+# define __SET_FOB_FOR_FUSE
+# define _FILE_OFFSET_BITS 64
+#endif /* _FILE_OFFSET_BITS */
+#include <fuse.h>
+#ifdef __SET_FOB_FOR_FUSE
+# undef _FILE_OFFSET_BITS
+#endif /* __SET_FOB_FOR_FUSE */
+#include <inttypes.h>
+#include "ext2fs/ext2fs.h"
+#include "ext2fs/ext2_fs.h"
+#include "ext2fs/ext2fsP.h"
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+# define FUSE_PLATFORM_OPTS ""
+#else
+# ifdef __linux__
+# define FUSE_PLATFORM_OPTS ",use_ino,big_writes"
+# else
+# define FUSE_PLATFORM_OPTS ",use_ino"
+# endif
+#endif
+
+#include "../version.h"
+#include "uuid/uuid.h"
+#include "e2p/e2p.h"
+
+#ifdef ENABLE_NLS
+#include <libintl.h>
+#include <locale.h>
+#define _(a) (gettext(a))
+#ifdef gettext_noop
+#define N_(a) gettext_noop(a)
+#else
+#define N_(a) (a)
+#endif
+#define P_(singular, plural, n) (ngettext(singular, plural, n))
+#ifndef NLS_CAT_NAME
+#define NLS_CAT_NAME "e2fsprogs"
+#endif
+#ifndef LOCALEDIR
+#define LOCALEDIR "/usr/share/locale"
+#endif
+#else
+#define _(a) (a)
+#define N_(a) a
+#define P_(singular, plural, n) ((n) == 1 ? (singular) : (plural))
+#endif
+
+#ifndef XATTR_NAME_POSIX_ACL_DEFAULT
+#define XATTR_NAME_POSIX_ACL_DEFAULT "posix_acl_default"
+#endif
+#ifndef XATTR_SECURITY_PREFIX
+#define XATTR_SECURITY_PREFIX "security."
+#define XATTR_SECURITY_PREFIX_LEN (sizeof (XATTR_SECURITY_PREFIX) - 1)
+#endif
+
+/*
+ * Linux and MacOS implement the setxattr(2) interface, which defines
+ * XATTR_CREATE and XATTR_REPLACE. However, FreeBSD uses
+ * extattr_set_file(2), which does not have a flags or options
+ * parameter, and does not define XATTR_CREATE and XATTR_REPLACE.
+ */
+#ifndef XATTR_CREATE
+#define XATTR_CREATE 0
+#endif
+
+#ifndef XATTR_REPLACE
+#define XATTR_REPLACE 0
+#endif
+
+#if !defined(EUCLEAN)
+#if !defined(EBADMSG)
+#define EUCLEAN EBADMSG
+#elif !defined(EPROTO)
+#define EUCLEAN EPROTO
+#else
+#define EUCLEAN EIO
+#endif
+#endif /* !defined(EUCLEAN) */
+
+#if !defined(ENODATA)
+#ifdef ENOATTR
+#define ENODATA ENOATTR
+#else
+#define ENODATA ENOENT
+#endif
+#endif /* !defined(ENODATA) */
+
+static inline uint64_t round_up(uint64_t b, unsigned int align)
+{
+ unsigned int m;
+
+ if (align == 0)
+ return b;
+ m = b % align;
+ if (m)
+ b += align - m;
+ return b;
+}
+
+static inline uint64_t round_down(uint64_t b, unsigned int align)
+{
+ unsigned int m;
+
+ if (align == 0)
+ return b;
+ m = b % align;
+ return b - m;
+}
+
+#define dbg_printf(fuse4fs, format, ...) \
+ while ((fuse4fs)->debug) { \
+ printf("FUSE4FS (%s): tid=%d " format, (fuse4fs)->shortdev, gettid(), ##__VA_ARGS__); \
+ fflush(stdout); \
+ break; \
+ }
+
+#define log_printf(fuse4fs, format, ...) \
+ do { \
+ printf("FUSE4FS (%s): " format, (fuse4fs)->shortdev, ##__VA_ARGS__); \
+ fflush(stdout); \
+ } while (0)
+
+#define err_printf(fuse4fs, format, ...) \
+ do { \
+ fprintf(stderr, "FUSE4FS (%s): " format, (fuse4fs)->shortdev, ##__VA_ARGS__); \
+ fflush(stderr); \
+ } while (0)
+
+#define timing_printf(fuse4fs, format, ...) \
+ while ((fuse4fs)->timing) { \
+ printf("FUSE4FS (%s): " format, (fuse4fs)->shortdev, ##__VA_ARGS__); \
+ break; \
+ }
+
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 8)
+# ifdef _IOR
+# ifdef _IOW
+# define SUPPORT_I_FLAGS
+# endif
+# endif
+#endif
+
+#ifdef FALLOC_FL_KEEP_SIZE
+# define FL_KEEP_SIZE_FLAG FALLOC_FL_KEEP_SIZE
+# define SUPPORT_FALLOCATE
+#else
+# define FL_KEEP_SIZE_FLAG (0)
+#endif
+
+#ifdef FALLOC_FL_PUNCH_HOLE
+# define FL_PUNCH_HOLE_FLAG FALLOC_FL_PUNCH_HOLE
+#else
+# define FL_PUNCH_HOLE_FLAG (0)
+#endif
+
+#ifdef FALLOC_FL_ZERO_RANGE
+# define FL_ZERO_RANGE_FLAG FALLOC_FL_ZERO_RANGE
+#else
+# define FL_ZERO_RANGE_FLAG (0)
+#endif
+
+errcode_t ext2fs_run_ext3_journal(ext2_filsys *fs);
+
+const char *err_shortdev;
+
+#ifdef CONFIG_JBD_DEBUG /* Enabled by configure --enable-jbd-debug */
+int journal_enable_debug = -1;
+#endif
+
+/*
+ * ext2_file_t contains a struct inode, so we can't leave files open.
+ * Use this as a proxy instead.
+ */
+#define FUSE4FS_FILE_MAGIC (0xEF53DEAFUL)
+struct fuse4fs_file_handle {
+ unsigned long magic;
+ ext2_ino_t ino;
+ int open_flags;
+};
+
+enum fuse4fs_opstate {
+ F4OP_READONLY,
+ F4OP_WRITABLE,
+ F4OP_SHUTDOWN,
+};
+
+/* Main program context */
+#define FUSE4FS_MAGIC (0xEF53DEADUL)
+struct fuse4fs {
+ unsigned long magic;
+ ext2_filsys fs;
+ pthread_mutex_t bfl;
+ char *device;
+ char *shortdev;
+ uint8_t ro;
+ uint8_t debug;
+ uint8_t no_default_opts;
+ uint8_t errors_behavior; /* actually an enum */
+ uint8_t minixdf;
+ uint8_t fakeroot;
+ uint8_t alloc_all_blocks;
+ uint8_t norecovery;
+ uint8_t kernel;
+ uint8_t directio;
+ uint8_t acl;
+ uint8_t dirsync;
+ uint8_t unmount_in_destroy;
+ uint8_t noblkdev;
+
+ enum fuse4fs_opstate opstate;
+ int logfd;
+ int blocklog;
+ unsigned int blockmask;
+ unsigned long offset;
+ unsigned int next_generation;
+ unsigned long long cache_size;
+ char *lockfile;
+#ifdef HAVE_CLOCK_MONOTONIC
+ struct timespec lock_start_time;
+ struct timespec op_start_time;
+ uint8_t timing;
+#endif
+};
+
+#define FUSE4FS_CHECK_HANDLE(ff, fh) \
+ do { \
+ if ((fh) == NULL || (fh)->magic != FUSE4FS_FILE_MAGIC) { \
+ fprintf(stderr, \
+ "FUSE4FS: Corrupt in-memory file handle at %s:%d!\n", \
+ __func__, __LINE__); \
+ fflush(stderr); \
+ return -EUCLEAN; \
+ } \
+ } while (0)
+
+#define __FUSE4FS_CHECK_CONTEXT(ff, retcode, shutcode) \
+ do { \
+ if ((ff) == NULL || (ff)->magic != FUSE4FS_MAGIC) { \
+ fprintf(stderr, \
+ "FUSE4FS: Corrupt in-memory data at %s:%d!\n", \
+ __func__, __LINE__); \
+ fflush(stderr); \
+ retcode; \
+ } \
+ if ((ff)->opstate == F4OP_SHUTDOWN) { \
+ shutcode; \
+ } \
+ } while (0)
+
+#define FUSE4FS_CHECK_CONTEXT(ff) \
+ __FUSE4FS_CHECK_CONTEXT((ff), return -EUCLEAN, return -EIO)
+#define FUSE4FS_CHECK_CONTEXT_RETURN(ff) \
+ __FUSE4FS_CHECK_CONTEXT((ff), return, return)
+#define FUSE4FS_CHECK_CONTEXT_ABORT(ff) \
+ __FUSE4FS_CHECK_CONTEXT((ff), abort(), abort())
+
+static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
+ const char *func, int line);
+#define translate_error(fs, ino, err) __translate_error((fs), (ino), (err), \
+ __func__, __LINE__)
+
+/* for macosx */
+#ifndef W_OK
+# define W_OK 2
+#endif
+
+#ifndef R_OK
+# define R_OK 4
+#endif
+
+static inline int u_log2(unsigned int arg)
+{
+ int l = 0;
+
+ arg >>= 1;
+ while (arg) {
+ l++;
+ arg >>= 1;
+ }
+ return l;
+}
+
+static inline blk64_t FUSE4FS_B_TO_FSBT(const struct fuse4fs *ff, off_t pos)
+{
+ return pos >> ff->blocklog;
+}
+
+static inline blk64_t FUSE4FS_B_TO_FSB(const struct fuse4fs *ff, off_t pos)
+{
+ return (pos + ff->blockmask) >> ff->blocklog;
+}
+
+static inline unsigned int FUSE4FS_OFF_IN_FSB(const struct fuse4fs *ff,
+ off_t pos)
+{
+ return pos & ff->blockmask;
+}
+
+static inline off_t FUSE4FS_FSB_TO_B(const struct fuse4fs *ff, blk64_t bno)
+{
+ return bno << ff->blocklog;
+}
+
+#define EXT4_EPOCH_BITS 2
+#define EXT4_EPOCH_MASK ((1 << EXT4_EPOCH_BITS) - 1)
+#define EXT4_NSEC_MASK (~0UL << EXT4_EPOCH_BITS)
+
+/*
+ * Extended fields will fit into an inode if the filesystem was formatted
+ * with large inodes (-I 256 or larger) and there are not currently any EAs
+ * consuming all of the available space. For new inodes we always reserve
+ * enough space for the kernel's known extended fields, but for inodes
+ * created with an old kernel this might not have been the case. None of
+ * the extended inode fields is critical for correct filesystem operation.
+ * This macro checks if a certain field fits in the inode. Note that
+ * inode-size = GOOD_OLD_INODE_SIZE + i_extra_isize
+ */
+#define EXT4_FITS_IN_INODE(ext4_inode, field) \
+ ((offsetof(typeof(*ext4_inode), field) + \
+ sizeof((ext4_inode)->field)) \
+ <= ((size_t) EXT2_GOOD_OLD_INODE_SIZE + \
+ (ext4_inode)->i_extra_isize)) \
+
+static inline __u32 ext4_encode_extra_time(const struct timespec *time)
+{
+ __u32 extra = sizeof(time->tv_sec) > 4 ?
+ ((time->tv_sec - (__s32)time->tv_sec) >> 32) &
+ EXT4_EPOCH_MASK : 0;
+ return extra | (time->tv_nsec << EXT4_EPOCH_BITS);
+}
+
+static inline void ext4_decode_extra_time(struct timespec *time, __u32 extra)
+{
+ if (sizeof(time->tv_sec) > 4 && (extra & EXT4_EPOCH_MASK)) {
+ __u64 extra_bits = extra & EXT4_EPOCH_MASK;
+ /*
+ * Prior to kernel 3.14?, we had a broken decode function,
+ * wherein we effectively did this:
+ * if (extra_bits == 3)
+ * extra_bits = 0;
+ */
+ time->tv_sec += extra_bits << 32;
+ }
+ time->tv_nsec = ((extra) & EXT4_NSEC_MASK) >> EXT4_EPOCH_BITS;
+}
+
+#define EXT4_CLAMP_TIMESTAMP(xtime, timespec, raw_inode) \
+do { \
+ if ((timespec)->tv_sec < EXT4_TIMESTAMP_MIN) \
+ (timespec)->tv_sec = EXT4_TIMESTAMP_MIN; \
+ if ((timespec)->tv_sec < EXT4_TIMESTAMP_MIN) \
+ (timespec)->tv_sec = EXT4_TIMESTAMP_MIN; \
+ \
+ if (EXT4_FITS_IN_INODE(raw_inode, xtime ## _extra)) { \
+ if ((timespec)->tv_sec > EXT4_EXTRA_TIMESTAMP_MAX) \
+ (timespec)->tv_sec = EXT4_EXTRA_TIMESTAMP_MAX; \
+ } else { \
+ if ((timespec)->tv_sec > EXT4_NON_EXTRA_TIMESTAMP_MAX) \
+ (timespec)->tv_sec = EXT4_NON_EXTRA_TIMESTAMP_MAX; \
+ } \
+} while (0)
+
+#define EXT4_INODE_SET_XTIME(xtime, timespec, raw_inode) \
+do { \
+ typeof(*(timespec)) _ts = *(timespec); \
+ \
+ EXT4_CLAMP_TIMESTAMP(xtime, &_ts, raw_inode); \
+ (raw_inode)->xtime = _ts.tv_sec; \
+ if (EXT4_FITS_IN_INODE(raw_inode, xtime ## _extra)) \
+ (raw_inode)->xtime ## _extra = \
+ ext4_encode_extra_time(&_ts); \
+} while (0)
+
+#define EXT4_EINODE_SET_XTIME(xtime, timespec, raw_inode) \
+do { \
+ typeof(*(timespec)) _ts = *(timespec); \
+ \
+ EXT4_CLAMP_TIMESTAMP(xtime, &_ts, raw_inode); \
+ if (EXT4_FITS_IN_INODE(raw_inode, xtime)) \
+ (raw_inode)->xtime = _ts.tv_sec; \
+ if (EXT4_FITS_IN_INODE(raw_inode, xtime ## _extra)) \
+ (raw_inode)->xtime ## _extra = \
+ ext4_encode_extra_time(&_ts); \
+} while (0)
+
+#define EXT4_INODE_GET_XTIME(xtime, timespec, raw_inode) \
+do { \
+ (timespec)->tv_sec = (signed)((raw_inode)->xtime); \
+ if (EXT4_FITS_IN_INODE(raw_inode, xtime ## _extra)) \
+ ext4_decode_extra_time((timespec), \
+ (raw_inode)->xtime ## _extra); \
+ else \
+ (timespec)->tv_nsec = 0; \
+} while (0)
+
+#define EXT4_EINODE_GET_XTIME(xtime, timespec, raw_inode) \
+do { \
+ if (EXT4_FITS_IN_INODE(raw_inode, xtime)) \
+ (timespec)->tv_sec = \
+ (signed)((raw_inode)->xtime); \
+ if (EXT4_FITS_IN_INODE(raw_inode, xtime ## _extra)) \
+ ext4_decode_extra_time((timespec), \
+ raw_inode->xtime ## _extra); \
+ else \
+ (timespec)->tv_nsec = 0; \
+} while (0)
+
+static inline errcode_t fuse4fs_read_inode(ext2_filsys fs, ext2_ino_t ino,
+ struct ext2_inode_large *inode)
+{
+ memset(inode, 0, sizeof(*inode));
+ return ext2fs_read_inode_full(fs, ino, EXT2_INODE(inode),
+ sizeof(*inode));
+}
+
+static inline errcode_t fuse4fs_write_inode(ext2_filsys fs, ext2_ino_t ino,
+ struct ext2_inode_large *inode)
+{
+ return ext2fs_write_inode_full(fs, ino, EXT2_INODE(inode),
+ sizeof(*inode));
+}
+
+static inline struct fuse4fs *fuse4fs_get(void)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+
+ return ctxt->private_data;
+}
+
+static inline struct fuse4fs_file_handle *
+fuse4fs_get_handle(const struct fuse_file_info *fp)
+{
+ return (struct fuse4fs_file_handle *)(uintptr_t)fp->fh;
+}
+
+static inline void
+fuse4fs_set_handle(struct fuse_file_info *fp, struct fuse4fs_file_handle *fh)
+{
+ fp->fh = (uintptr_t)fh;
+}
+
+#ifdef HAVE_CLOCK_MONOTONIC
+static inline ext2_filsys fuse4fs_start(struct fuse4fs *ff)
+{
+ struct timespec lock_time;
+ int ret;
+
+ if (ff->timing)
+ clock_gettime(CLOCK_MONOTONIC, &lock_time);
+
+ pthread_mutex_lock(&ff->bfl);
+ if (ff->timing) {
+ ret = clock_gettime(CLOCK_MONOTONIC, &ff->op_start_time);
+ if (ret)
+ ff->timing = 0;
+ ff->lock_start_time = lock_time;
+ }
+ return ff->fs;
+}
+
+static inline double ms_from_timespec(const struct timespec *ts)
+{
+ return ((double)ts->tv_sec * 1000) + ((double)ts->tv_nsec / 1000000);
+}
+
+static inline void fuse4fs_finish_timing(struct fuse4fs *ff, const char *func)
+{
+ struct timespec now;
+ double lockf, startf, nowf;
+ int ret;
+
+ if (!ff->timing)
+ return;
+
+ ret = clock_gettime(CLOCK_MONOTONIC, &now);
+ if (ret) {
+ ff->timing = 0;
+ return;
+ }
+
+ lockf = ms_from_timespec(&ff->lock_start_time);
+ startf = ms_from_timespec(&ff->op_start_time);
+ nowf = ms_from_timespec(&now);
+ timing_printf(ff, "%s: lock=%.2fms elapsed=%.2fms\n", func,
+ startf - lockf, nowf - startf);
+}
+#else
+static inline ext2_filsys fuse4fs_start(struct fuse4fs *ff)
+{
+ pthread_mutex_lock(&ff->bfl);
+ return ff->fs;
+}
+# define fuse4fs_finish_timing(...) ((void)0)
+#endif
+
+static inline void __fuse4fs_finish(struct fuse4fs *ff, int ret,
+ const char *func)
+{
+ fuse4fs_finish_timing(ff, func);
+ if (ret)
+ dbg_printf(ff, "%s: libfuse ret=%d\n", func, ret);
+ pthread_mutex_unlock(&ff->bfl);
+}
+#define fuse4fs_finish(ff, ret) __fuse4fs_finish((ff), (ret), __func__)
+
+static void get_now(struct timespec *now)
+{
+#ifdef CLOCK_REALTIME
+ if (!clock_gettime(CLOCK_REALTIME, now))
+ return;
+#endif
+
+ now->tv_sec = time(NULL);
+ now->tv_nsec = 0;
+}
+
+static void increment_version(struct ext2_inode_large *inode)
+{
+ __u64 ver;
+
+ ver = inode->osd1.linux1.l_i_version;
+ if (EXT4_FITS_IN_INODE(inode, i_version_hi))
+ ver |= (__u64)inode->i_version_hi << 32;
+ ver++;
+ inode->osd1.linux1.l_i_version = ver;
+ if (EXT4_FITS_IN_INODE(inode, i_version_hi))
+ inode->i_version_hi = ver >> 32;
+}
+
+static void init_times(struct ext2_inode_large *inode)
+{
+ struct timespec now;
+
+ get_now(&now);
+ EXT4_INODE_SET_XTIME(i_atime, &now, inode);
+ EXT4_INODE_SET_XTIME(i_ctime, &now, inode);
+ EXT4_INODE_SET_XTIME(i_mtime, &now, inode);
+ EXT4_EINODE_SET_XTIME(i_crtime, &now, inode);
+ increment_version(inode);
+}
+
+static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
+ struct ext2_inode_large *pinode)
+{
+ errcode_t err;
+ struct timespec now;
+ struct ext2_inode_large inode;
+
+ get_now(&now);
+
+ /* If user already has a inode buffer, just update that */
+ if (pinode) {
+ increment_version(pinode);
+ EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
+ return 0;
+ }
+
+ /* Otherwise we have to read-modify-write the inode */
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ increment_version(&inode);
+ EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
+static int update_atime(ext2_filsys fs, ext2_ino_t ino)
+{
+ errcode_t err;
+ struct ext2_inode_large inode, *pinode;
+ struct timespec atime, mtime, now;
+ double datime, dmtime, dnow;
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ pinode = &inode;
+ EXT4_INODE_GET_XTIME(i_atime, &atime, pinode);
+ EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode);
+ get_now(&now);
+
+ datime = atime.tv_sec + ((double)atime.tv_nsec / 1000000000);
+ dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / 1000000000);
+ dnow = now.tv_sec + ((double)now.tv_nsec / 1000000000);
+
+ /*
+ * If atime is newer than mtime and atime hasn't been updated in thirty
+ * seconds, skip the atime update. Same idea as Linux "relatime". Use
+ * doubles to account for nanosecond resolution.
+ */
+ if (datime >= dmtime && datime >= dnow - 30)
+ return 0;
+ EXT4_INODE_SET_XTIME(i_atime, &now, &inode);
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
+static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
+ struct ext2_inode_large *pinode)
+{
+ errcode_t err;
+ struct ext2_inode_large inode;
+ struct timespec now;
+
+ if (pinode) {
+ get_now(&now);
+ EXT4_INODE_SET_XTIME(i_mtime, &now, pinode);
+ EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
+ increment_version(pinode);
+ return 0;
+ }
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ get_now(&now);
+ EXT4_INODE_SET_XTIME(i_mtime, &now, &inode);
+ EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
+ increment_version(&inode);
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
+static int ext2_file_type(unsigned int mode)
+{
+ if (LINUX_S_ISREG(mode))
+ return EXT2_FT_REG_FILE;
+
+ if (LINUX_S_ISDIR(mode))
+ return EXT2_FT_DIR;
+
+ if (LINUX_S_ISCHR(mode))
+ return EXT2_FT_CHRDEV;
+
+ if (LINUX_S_ISBLK(mode))
+ return EXT2_FT_BLKDEV;
+
+ if (LINUX_S_ISLNK(mode))
+ return EXT2_FT_SYMLINK;
+
+ if (LINUX_S_ISFIFO(mode))
+ return EXT2_FT_FIFO;
+
+ if (LINUX_S_ISSOCK(mode))
+ return EXT2_FT_SOCK;
+
+ return 0;
+}
+
+static int fs_can_allocate(struct fuse4fs *ff, blk64_t num)
+{
+ ext2_filsys fs = ff->fs;
+ blk64_t reserved;
+
+ dbg_printf(ff, "%s: Asking for %llu; alloc_all=%d total=%llu free=%llu "
+ "rsvd=%llu\n", __func__, num, ff->alloc_all_blocks,
+ ext2fs_blocks_count(fs->super),
+ ext2fs_free_blocks_count(fs->super),
+ ext2fs_r_blocks_count(fs->super));
+ if (num > ext2fs_blocks_count(fs->super))
+ return 0;
+
+ if (ff->alloc_all_blocks)
+ return 1;
+
+ /*
+ * Different meaning for r_blocks -- libext2fs has bugs where the FS
+ * can get corrupted if it totally runs out of blocks. Avoid this
+ * by refusing to allocate any of the reserve blocks to anybody.
+ */
+ reserved = ext2fs_r_blocks_count(fs->super);
+ if (reserved == 0)
+ reserved = ext2fs_blocks_count(fs->super) / 10;
+ return ext2fs_free_blocks_count(fs->super) > reserved + num;
+}
+
+static int fuse4fs_is_writeable(struct fuse4fs *ff)
+{
+ return ff->opstate == F4OP_WRITABLE &&
+ (ff->fs->super->s_error_count == 0);
+}
+
+static inline int is_superuser(struct fuse4fs *ff, struct fuse_context *ctxt)
+{
+ if (ff->fakeroot)
+ return 1;
+ return ctxt->uid == 0;
+}
+
+static inline int want_check_owner(struct fuse4fs *ff,
+ struct fuse_context *ctxt)
+{
+ /*
+ * The kernel is responsible for access control, so we allow anything
+ * that the superuser can do.
+ */
+ if (ff->kernel)
+ return 0;
+ return !is_superuser(ff, ctxt);
+}
+
+/* Test for append permission */
+#define A_OK 16
+
+static int check_iflags_access(struct fuse4fs *ff, ext2_ino_t ino,
+ const struct ext2_inode *inode, int mask)
+{
+ EXT2FS_BUILD_BUG_ON((A_OK & (R_OK | W_OK | X_OK | F_OK)) != 0);
+
+ /* no writing or metadata changes to read-only or broken fs */
+ if ((mask & (W_OK | A_OK)) && !fuse4fs_is_writeable(ff))
+ return -EROFS;
+
+ dbg_printf(ff, "access ino=%d mask=e%s%s%s%s iflags=0x%x\n",
+ ino,
+ (mask & R_OK ? "r" : ""),
+ (mask & W_OK ? "w" : ""),
+ (mask & X_OK ? "x" : ""),
+ (mask & A_OK ? "a" : ""),
+ inode->i_flags);
+
+ /* is immutable? */
+ if ((mask & W_OK) &&
+ (inode->i_flags & EXT2_IMMUTABLE_FL))
+ return -EPERM;
+
+ /* is append-only? */
+ if ((inode->i_flags & EXT2_APPEND_FL) && (mask & W_OK) && !(mask & A_OK))
+ return -EPERM;
+
+ return 0;
+}
+
+static int check_inum_access(struct fuse4fs *ff, ext2_ino_t ino, int mask)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ ext2_filsys fs = ff->fs;
+ struct ext2_inode inode;
+ mode_t perms;
+ errcode_t err;
+ int ret;
+
+ /* no writing to read-only or broken fs */
+ if ((mask & (W_OK | A_OK)) && !fuse4fs_is_writeable(ff))
+ return -EROFS;
+
+ err = ext2fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+ perms = inode.i_mode & 0777;
+
+ dbg_printf(ff, "access ino=%d mask=e%s%s%s%s perms=0%o iflags=0x%x "
+ "fuid=%d fgid=%d uid=%d gid=%d\n", ino,
+ (mask & R_OK ? "r" : ""),
+ (mask & W_OK ? "w" : ""),
+ (mask & X_OK ? "x" : ""),
+ (mask & A_OK ? "a" : ""),
+ perms, inode.i_flags,
+ inode_uid(inode), inode_gid(inode),
+ ctxt->uid, ctxt->gid);
+
+ /* existence check */
+ if (mask == 0)
+ return 0;
+
+ ret = check_iflags_access(ff, ino, &inode, mask);
+ if (ret)
+ return ret;
+
+ /* If kernel is responsible for mode and acl checks, we're done. */
+ if (ff->kernel)
+ return 0;
+
+ /* Figure out what root's allowed to do */
+ if (is_superuser(ff, ctxt)) {
+ /* Non-file access always ok */
+ if (!LINUX_S_ISREG(inode.i_mode))
+ return 0;
+
+ /* R/W access to a file always ok */
+ if (!(mask & X_OK))
+ return 0;
+
+ /* X access to a file ok if a user/group/other can X */
+ if (perms & 0111)
+ return 0;
+
+ /* Trying to execute a file that's not executable. BZZT! */
+ return -EACCES;
+ }
+
+ /* Remove the O_APPEND flag before testing permissions */
+ mask &= ~A_OK;
+
+ /* allow owner, if perms match */
+ if (inode_uid(inode) == ctxt->uid) {
+ if ((mask & (perms >> 6)) == mask)
+ return 0;
+ return -EACCES;
+ }
+
+ /* allow group, if perms match */
+ if (inode_gid(inode) == ctxt->gid) {
+ if ((mask & (perms >> 3)) == mask)
+ return 0;
+ return -EACCES;
+ }
+
+ /* otherwise check other */
+ if ((mask & perms) == mask)
+ return 0;
+ return -EACCES;
+}
+
+static errcode_t fuse4fs_acquire_lockfile(struct fuse4fs *ff)
+{
+ char *resolved;
+ int lockfd;
+ errcode_t err;
+
+ lockfd = open(ff->lockfile, O_RDWR | O_CREAT | O_EXCL, 0400);
+ if (lockfd < 0) {
+ if (errno == EEXIST)
+ err = EWOULDBLOCK;
+ else
+ err = errno;
+ err_printf(ff, "%s: %s: %s\n", ff->lockfile,
+ _("opening lockfile failed"),
+ strerror(err));
+ ff->lockfile = NULL;
+ return err;
+ }
+ close(lockfd);
+
+ resolved = realpath(ff->lockfile, NULL);
+ if (!resolved) {
+ err = errno;
+ err_printf(ff, "%s: %s: %s\n", ff->lockfile,
+ _("resolving lockfile failed"),
+ strerror(err));
+ unlink(ff->lockfile);
+ ff->lockfile = NULL;
+ return err;
+ }
+ free(ff->lockfile);
+ ff->lockfile = resolved;
+
+ return 0;
+}
+
+static void fuse4fs_release_lockfile(struct fuse4fs *ff)
+{
+ if (unlink(ff->lockfile)) {
+ errcode_t err = errno;
+
+ err_printf(ff, "%s: %s: %s\n", ff->lockfile,
+ _("removing lockfile failed"),
+ strerror(err));
+ }
+ free(ff->lockfile);
+}
+
+static void fuse4fs_unmount(struct fuse4fs *ff)
+{
+ errcode_t err;
+
+ if (!ff->fs)
+ return;
+
+ err = ext2fs_close(ff->fs);
+ if (err)
+ err_printf(ff, "%s\n", error_message(err));
+
+ ff->fs = NULL;
+
+ if (ff->lockfile)
+ fuse4fs_release_lockfile(ff);
+}
+
+static errcode_t fuse4fs_open(struct fuse4fs *ff, int libext2_flags)
+{
+ char options[128];
+ int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW |
+ libext2_flags;
+ errcode_t err;
+
+ if (ff->lockfile) {
+ err = fuse4fs_acquire_lockfile(ff);
+ if (err)
+ return err;
+ }
+
+ snprintf(options, sizeof(options) - 1, "offset=%lu", ff->offset);
+ ff->opstate = F4OP_READONLY;
+
+ if (ff->directio)
+ flags |= EXT2_FLAG_DIRECT_IO;
+
+ err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager,
+ &ff->fs);
+ if (err == EPERM) {
+ err_printf(ff, "%s.\n",
+ _("read-only device, trying to mount norecovery"));
+ flags &= ~EXT2_FLAG_RW;
+ ff->ro = 1;
+ ff->norecovery = 1;
+ err = ext2fs_open2(ff->device, options, flags, 0, 0,
+ unix_io_manager, &ff->fs);
+ }
+ if (err) {
+ err_printf(ff, "%s.\n", error_message(err));
+ err_printf(ff, "%s\n", _("Please run e2fsck -fy."));
+ return err;
+ }
+
+ ff->fs->priv_data = ff;
+ ff->blocklog = u_log2(ff->fs->blocksize);
+ ff->blockmask = ff->fs->blocksize - 1;
+ return 0;
+}
+
+static inline bool fuse4fs_on_bdev(const struct fuse4fs *ff)
+{
+ return ff->fs->io->flags & CHANNEL_FLAGS_BLOCK_DEVICE;
+}
+
+static errcode_t fuse4fs_config_cache(struct fuse4fs *ff)
+{
+ char buf[128];
+ errcode_t err;
+
+ snprintf(buf, sizeof(buf), "cache_blocks=%llu",
+ FUSE4FS_B_TO_FSBT(ff, ff->cache_size));
+ err = io_channel_set_options(ff->fs->io, buf);
+ if (err) {
+ err_printf(ff, "%s %lluk: %s\n",
+ _("cannot set disk cache size to"),
+ ff->cache_size >> 10,
+ error_message(err));
+ return err;
+ }
+
+ return 0;
+}
+
+static errcode_t fuse4fs_check_support(struct fuse4fs *ff)
+{
+ ext2_filsys fs = ff->fs;
+
+ if (ext2fs_has_feature_quota(fs->super)) {
+ err_printf(ff, "%s\n", _("quotas not supported."));
+ return EXT2_ET_UNSUPP_FEATURE;
+ }
+ if (ext2fs_has_feature_verity(fs->super)) {
+ err_printf(ff, "%s\n", _("verity not supported."));
+ return EXT2_ET_UNSUPP_FEATURE;
+ }
+ if (ext2fs_has_feature_encrypt(fs->super)) {
+ err_printf(ff, "%s\n", _("encryption not supported."));
+ return EXT2_ET_UNSUPP_FEATURE;
+ }
+ if (ext2fs_has_feature_casefold(fs->super)) {
+ err_printf(ff, "%s\n", _("casefolding not supported."));
+ return EXT2_ET_UNSUPP_FEATURE;
+ }
+
+ if (fs->super->s_state & EXT2_ERROR_FS) {
+ err_printf(ff, "%s\n",
+ _("Errors detected; running e2fsck is required."));
+ return EXT2_ET_FILESYSTEM_CORRUPTED;
+ }
+
+ return 0;
+}
+
+static int fuse4fs_check_norecovery(struct fuse4fs *ff)
+{
+ if (ext2fs_has_feature_journal_needs_recovery(ff->fs->super) &&
+ !ff->ro) {
+ log_printf(ff, "%s\n",
+ _("Required journal recovery suppressed and not mounted read-only."));
+ return 32;
+ }
+
+ /*
+ * Amazingly, norecovery allows a rw mount when there's a clean journal
+ * present.
+ */
+ return 0;
+}
+
+static errcode_t fuse4fs_mount(struct fuse4fs *ff)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+
+ if (ext2fs_has_feature_journal_needs_recovery(fs->super)) {
+ if (ff->norecovery) {
+ log_printf(ff, "%s\n",
+ _("Mounting read-only without recovering journal."));
+ } else {
+ log_printf(ff, "%s\n", _("Recovering journal."));
+ err = ext2fs_run_ext3_journal(&fs);
+ if (err) {
+ err_printf(ff, "%s.\n", error_message(err));
+ err_printf(ff, "%s\n",
+ _("Please run e2fsck -fy."));
+ return err;
+ }
+ ext2fs_clear_feature_journal_needs_recovery(fs->super);
+ ext2fs_mark_super_dirty(fs);
+
+ err = fuse4fs_check_support(ff);
+ if (err)
+ return err;
+ }
+ }
+
+ if (fs->flags & EXT2_FLAG_RW) {
+ if (ext2fs_has_feature_journal(fs->super))
+ log_printf(ff, "%s",
+ _("Warning: fuse4fs does not support using the journal.\n"
+ "There may be file system corruption or data loss if\n"
+ "the file system is not gracefully unmounted.\n"));
+ err = ext2fs_read_inode_bitmap(fs);
+ if (err) {
+ translate_error(fs, 0, err);
+ return err;
+ }
+ err = ext2fs_read_block_bitmap(fs);
+ if (err) {
+ translate_error(fs, 0, err);
+ return err;
+ }
+ ff->opstate = F4OP_WRITABLE;
+ }
+
+ if (!(fs->super->s_state & EXT2_VALID_FS))
+ err_printf(ff, "%s\n",
+ _("Warning: Mounting unchecked fs, running e2fsck is recommended."));
+ if (fs->super->s_max_mnt_count > 0 &&
+ fs->super->s_mnt_count >= fs->super->s_max_mnt_count)
+ err_printf(ff, "%s\n",
+ _("Warning: Maximal mount count reached, running e2fsck is recommended."));
+ if (fs->super->s_checkinterval > 0 &&
+ (time_t) (fs->super->s_lastcheck +
+ fs->super->s_checkinterval) <= time(0))
+ err_printf(ff, "%s\n",
+ _("Warning: Check time reached; running e2fsck is recommended."));
+ if (fs->super->s_last_orphan)
+ err_printf(ff, "%s\n",
+ _("Orphans detected; running e2fsck is recommended."));
+
+ if (!ff->errors_behavior)
+ ff->errors_behavior = fs->super->s_errors;
+
+ /* Clear the valid flag so that an unclean shutdown forces a fsck */
+ if (ff->opstate == F4OP_WRITABLE) {
+ fs->super->s_mnt_count++;
+ ext2fs_set_tstamp(fs->super, s_mtime, time(NULL));
+ fs->super->s_state &= ~EXT2_VALID_FS;
+ ext2fs_mark_super_dirty(fs);
+ err = ext2fs_flush2(fs, 0);
+ if (err)
+ return translate_error(fs, 0, err);
+ }
+
+ return 0;
+}
+
+static void op_destroy(void *p EXT2FS_ATTR((unused)))
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ errcode_t err;
+
+ FUSE4FS_CHECK_CONTEXT_RETURN(ff);
+
+ fs = fuse4fs_start(ff);
+
+ dbg_printf(ff, "%s: dev=%s\n", __func__, fs->device_name);
+ if (ff->opstate == F4OP_WRITABLE) {
+ fs->super->s_state |= EXT2_VALID_FS;
+ if (fs->super->s_error_count)
+ fs->super->s_state |= EXT2_ERROR_FS;
+ ext2fs_mark_super_dirty(fs);
+ err = ext2fs_set_gdt_csum(fs);
+ if (err)
+ translate_error(fs, 0, err);
+
+ err = ext2fs_flush2(fs, 0);
+ if (err)
+ translate_error(fs, 0, err);
+ }
+
+ if (ff->debug && fs->io->manager->get_stats) {
+ io_stats stats = NULL;
+
+ fs->io->manager->get_stats(fs->io, &stats);
+ dbg_printf(ff, "read: %lluk\n", stats->bytes_read >> 10);
+ dbg_printf(ff, "write: %lluk\n", stats->bytes_written >> 10);
+ dbg_printf(ff, "hits: %llu\n", stats->cache_hits);
+ dbg_printf(ff, "misses: %llu\n", stats->cache_misses);
+ dbg_printf(ff, "hit_ratio: %.1f%%\n",
+ (100.0 * stats->cache_hits) /
+ (stats->cache_hits + stats->cache_misses));
+ }
+
+ if (ff->kernel) {
+ char uuid[UUID_STR_SIZE];
+
+ uuid_unparse(fs->super->s_uuid, uuid);
+ log_printf(ff, "%s %s.\n", _("unmounting filesystem"), uuid);
+ }
+
+ if (ff->unmount_in_destroy)
+ fuse4fs_unmount(ff);
+
+ fuse4fs_finish(ff, 0);
+}
+
+/* Reopen @stream with @fileno */
+static int fuse4fs_freopen_stream(const char *path, int fileno, FILE *stream)
+{
+ char _fdpath[256];
+ const char *fdpath;
+ FILE *fp;
+ int ret;
+
+ ret = snprintf(_fdpath, sizeof(_fdpath), "/dev/fd/%d", fileno);
+ if (ret >= sizeof(_fdpath))
+ fdpath = path;
+ else
+ fdpath = _fdpath;
+
+ /*
+ * C23 defines std{out,err} as an expression of type FILE* that need
+ * not be an lvalue. What this means is that we can't just assign to
+ * stdout: we have to use freopen, which takes a path.
+ *
+ * There's no guarantee that the OS provides a /dev/fd/X alias for open
+ * file descriptors, so if that fails, fall back to the original log
+ * file path. We'd rather not do a path-based reopen because that
+ * exposes us to rename race attacks.
+ */
+ fp = freopen(fdpath, "a", stream);
+ if (!fp && errno == ENOENT && fdpath == _fdpath)
+ fp = freopen(path, "a", stream);
+ if (!fp) {
+ perror(fdpath);
+ return -1;
+ }
+
+ return 0;
+}
+
+/* Redirect stdout/stderr to a file, or return a mount-compatible error. */
+static int fuse4fs_capture_output(struct fuse4fs *ff, const char *path)
+{
+ int ret;
+ int fd;
+
+ /*
+ * First, open the log file path with system calls so that we can
+ * redirect the stdout/stderr file numbers (typically 1 and 2) to our
+ * logfile descriptor. We'd like to avoid allocating extra file
+ * objects in the kernel if we can because pos will be the same between
+ * stdout and stderr.
+ */
+ if (ff->logfd < 0) {
+ fd = open(path, O_WRONLY | O_CREAT | O_APPEND, 0600);
+ if (fd < 0) {
+ perror(path);
+ return -1;
+ }
+
+ /*
+ * Save the newly opened fd in case we have to do this again in
+ * op_init.
+ */
+ ff->logfd = fd;
+ }
+
+ ret = dup2(ff->logfd, STDOUT_FILENO);
+ if (ret < 0) {
+ perror(path);
+ return -1;
+ }
+
+ ret = dup2(ff->logfd, STDERR_FILENO);
+ if (ret < 0) {
+ perror(path);
+ return -1;
+ }
+
+ /*
+ * Now that we've changed STD{OUT,ERR}_FILENO to be the log file, use
+ * freopen to make sure that std{out,err} (the C library abstractions)
+ * point to the STDXXX_FILENO because any of our library dependencies
+ * might decide to printf to one of those streams and we want to
+ * capture all output in the log.
+ */
+ ret = fuse4fs_freopen_stream(path, STDOUT_FILENO, stdout);
+ if (ret)
+ return ret;
+ ret = fuse4fs_freopen_stream(path, STDERR_FILENO, stderr);
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+/* Set up debug and error logging files */
+static int fuse4fs_setup_logging(struct fuse4fs *ff)
+{
+ char *logfile = getenv("FUSE4FS_LOGFILE");
+ if (logfile)
+ return fuse4fs_capture_output(ff, logfile);
+
+ /* in kernel mode, try to log errors to the kernel log */
+ if (ff->kernel)
+ fuse4fs_capture_output(ff, "/dev/ttyprintk");
+
+ return 0;
+}
+
+#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 17)
+static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
+ uint64_t flag)
+{
+ if (conn->capable & flag) {
+ conn->want |= flag;
+ return 1;
+ }
+
+ return 0;
+}
+#endif
+
+static void *op_init(struct fuse_conn_info *conn
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ , struct fuse_config *cfg EXT2FS_ATTR((unused))
+#endif
+ )
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+
+ FUSE4FS_CHECK_CONTEXT_ABORT(ff);
+
+ /*
+ * Configure logging a second time, because libfuse might have
+ * redirected std{out,err} as part of daemonization. If this fails,
+ * give up and move on.
+ */
+ fuse4fs_setup_logging(ff);
+ if (ff->logfd >= 0)
+ close(ff->logfd);
+ ff->logfd = -1;
+
+ fs = ff->fs;
+ dbg_printf(ff, "%s: dev=%s\n", __func__, fs->device_name);
+#ifdef FUSE_CAP_IOCTL_DIR
+ fuse_set_feature_flag(conn, FUSE_CAP_IOCTL_DIR);
+#endif
+#ifdef FUSE_CAP_POSIX_ACL
+ if (ff->acl)
+ fuse_set_feature_flag(conn, FUSE_CAP_POSIX_ACL);
+#endif
+#ifdef FUSE_CAP_CACHE_SYMLINKS
+ fuse_set_feature_flag(conn, FUSE_CAP_CACHE_SYMLINKS);
+#endif
+#ifdef FUSE_CAP_NO_EXPORT_SUPPORT
+ fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
+#endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ conn->time_gran = 1;
+ cfg->use_ino = 1;
+ if (ff->debug)
+ cfg->debug = 1;
+ cfg->nullpath_ok = 1;
+#endif
+
+ if (ff->kernel) {
+ char uuid[UUID_STR_SIZE];
+
+ uuid_unparse(fs->super->s_uuid, uuid);
+ log_printf(ff, "%s %s.\n", _("mounted filesystem"), uuid);
+ }
+
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17)
+ /*
+ * THIS MUST GO LAST!
+ *
+ * fuse_set_feature_flag in 3.17.0 has a strange bug: it sets feature
+ * flags in conn->want_ext, but not conn->want. Upon return to
+ * libfuse, the lower level library observes that want and want_ext
+ * have gotten out of sync, and refuses to mount. Therefore,
+ * synchronize the two. This bug went away in 3.17.3, but we're stuck
+ * with this forever because Debian trixie released with 3.17.2.
+ */
+ conn->want = conn->want_ext & 0xFFFFFFFF;
+#endif
+ return ff;
+}
+
+static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf)
+{
+ struct ext2_inode_large inode;
+ dev_t fakedev = 0;
+ errcode_t err;
+ int ret = 0;
+ struct timespec tv;
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ memcpy(&fakedev, fs->super->s_uuid, sizeof(fakedev));
+ statbuf->st_dev = fakedev;
+ statbuf->st_ino = ino;
+ statbuf->st_mode = inode.i_mode;
+ statbuf->st_nlink = inode.i_links_count;
+ statbuf->st_uid = inode_uid(inode);
+ statbuf->st_gid = inode_gid(inode);
+ statbuf->st_size = EXT2_I_SIZE(&inode);
+ statbuf->st_blksize = fs->blocksize;
+ statbuf->st_blocks = ext2fs_get_stat_i_blocks(fs,
+ EXT2_INODE(&inode));
+ EXT4_INODE_GET_XTIME(i_atime, &tv, &inode);
+#if HAVE_STRUCT_STAT_ST_ATIM
+ statbuf->st_atim = tv;
+#else
+ statbuf->st_atime = tv.tv_sec;
+#endif
+ EXT4_INODE_GET_XTIME(i_mtime, &tv, &inode);
+#if HAVE_STRUCT_STAT_ST_ATIM
+ statbuf->st_mtim = tv;
+#else
+ statbuf->st_mtime = tv.tv_sec;
+#endif
+ EXT4_INODE_GET_XTIME(i_ctime, &tv, &inode);
+#if HAVE_STRUCT_STAT_ST_ATIM
+ statbuf->st_ctim = tv;
+#else
+ statbuf->st_ctime = tv.tv_sec;
+#endif
+ if (LINUX_S_ISCHR(inode.i_mode) ||
+ LINUX_S_ISBLK(inode.i_mode)) {
+ if (inode.i_block[0])
+ statbuf->st_rdev = inode.i_block[0];
+ else
+ statbuf->st_rdev = inode.i_block[1];
+ }
+
+ return ret;
+}
+
+static int __fuse4fs_file_ino(struct fuse4fs *ff, const char *path,
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ struct fuse_file_info *fp EXT2FS_ATTR((unused)),
+#endif
+ ext2_ino_t *inop,
+ const char *func,
+ int line)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ if (fp) {
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+
+ if (fh->ino == 0)
+ return -ESTALE;
+
+ *inop = fh->ino;
+ dbg_printf(ff, "%s: get ino=%d\n", func, fh->ino);
+ return 0;
+ }
+#endif
+ dbg_printf(ff, "%s: get path=%s\n", func, path);
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, inop);
+ if (err)
+ return __translate_error(fs, 0, err, func, line);
+
+ return 0;
+}
+
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+# define fuse4fs_file_ino(ff, path, fp, inop) \
+ __fuse4fs_file_ino((ff), (path), (fp), (inop), __func__, __LINE__)
+#else
+# define fuse4fs_file_ino(ff, path, fp, inop) \
+ __fuse4fs_file_ino((ff), (path), NULL, (inop), __func__, __LINE__)
+#endif
+
+static int op_getattr(const char *path, struct stat *statbuf
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ , struct fuse_file_info *fi
+#endif
+ )
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ ext2_ino_t ino;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fs = fuse4fs_start(ff);
+ ret = fuse4fs_file_ino(ff, path, fi, &ino);
+ if (ret)
+ goto out;
+ ret = stat_inode(fs, ino, statbuf);
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int op_readlink(const char *path, char *buf, size_t len)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ errcode_t err;
+ ext2_ino_t ino;
+ struct ext2_inode inode;
+ unsigned int got;
+ ext2_file_t file;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: path=%s\n", __func__, path);
+ fs = fuse4fs_start(ff);
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
+ if (err || ino == 0) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+
+ err = ext2fs_read_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+ if (!LINUX_S_ISLNK(inode.i_mode)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ len--;
+ if (inode.i_size < len)
+ len = inode.i_size;
+ if (ext2fs_is_fast_symlink(&inode))
+ memcpy(buf, (char *)inode.i_block, len);
+ else {
+ /* big/inline symlink */
+
+ err = ext2fs_file_open(fs, ino, 0, &file);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+ err = ext2fs_file_read(file, buf, len, &got);
+ if (err)
+ ret = translate_error(fs, ino, err);
+ else if (got != len)
+ ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED);
+
+ err = ext2fs_file_close(file);
+ if (ret)
+ goto out;
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+ }
+ buf[len] = 0;
+
+ if (fuse4fs_is_writeable(ff)) {
+ ret = update_atime(fs, ino);
+ if (ret)
+ goto out;
+ }
+
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int __getxattr(struct fuse4fs *ff, ext2_ino_t ino, const char *name,
+ void **value, size_t *value_len)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2_xattr_handle *h;
+ errcode_t err;
+ int ret = 0;
+
+ err = ext2fs_xattrs_open(fs, ino, &h);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_xattrs_read(h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_close;
+ }
+
+ err = ext2fs_xattr_get(h, name, value, value_len);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_close;
+ }
+
+out_close:
+ err = ext2fs_xattrs_close(&h);
+ if (err && !ret)
+ ret = translate_error(fs, ino, err);
+ return ret;
+}
+
+static int __setxattr(struct fuse4fs *ff, ext2_ino_t ino, const char *name,
+ void *value, size_t valuelen)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2_xattr_handle *h;
+ errcode_t err;
+ int ret = 0;
+
+ err = ext2fs_xattrs_open(fs, ino, &h);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_xattrs_read(h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_close;
+ }
+
+ err = ext2fs_xattr_set(h, name, value, valuelen);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_close;
+ }
+
+out_close:
+ err = ext2fs_xattrs_close(&h);
+ if (err && !ret)
+ ret = translate_error(fs, ino, err);
+ return ret;
+}
+
+static int propagate_default_acls(struct fuse4fs *ff, ext2_ino_t parent,
+ ext2_ino_t child)
+{
+ void *def;
+ size_t deflen;
+ int ret;
+
+ if (!ff->acl)
+ return 0;
+
+ ret = __getxattr(ff, parent, XATTR_NAME_POSIX_ACL_DEFAULT, &def,
+ &deflen);
+ switch (ret) {
+ case -ENODATA:
+ case -ENOENT:
+ /* no default acl */
+ return 0;
+ case 0:
+ break;
+ default:
+ return ret;
+ }
+
+ ret = __setxattr(ff, child, XATTR_NAME_POSIX_ACL_DEFAULT, def, deflen);
+ ext2fs_free_mem(&def);
+ return ret;
+}
+
+static inline void fuse4fs_set_uid(struct ext2_inode_large *inode, uid_t uid)
+{
+ inode->i_uid = uid;
+ ext2fs_set_i_uid_high(*inode, uid >> 16);
+}
+
+static inline void fuse4fs_set_gid(struct ext2_inode_large *inode, gid_t gid)
+{
+ inode->i_gid = gid;
+ ext2fs_set_i_gid_high(*inode, gid >> 16);
+}
+
+static int fuse4fs_new_child_gid(struct fuse4fs *ff, ext2_ino_t parent,
+ gid_t *gid, int *parent_sgid)
+{
+ struct ext2_inode_large inode;
+ struct fuse_context *ctxt = fuse_get_context();
+ errcode_t err;
+
+ err = fuse4fs_read_inode(ff->fs, parent, &inode);
+ if (err)
+ return translate_error(ff->fs, parent, err);
+
+ if (inode.i_mode & S_ISGID) {
+ if (parent_sgid)
+ *parent_sgid = 1;
+ *gid = inode.i_gid;
+ } else {
+ if (parent_sgid)
+ *parent_sgid = 0;
+ *gid = ctxt->gid;
+ }
+
+ return 0;
+}
+
+/*
+ * Flush dirty data to disk if we're running in dirsync mode. If @flushed is a
+ * non-null pointer, this function sets @flushed to 1 if we decided to flush
+ * data, or 0 if not.
+ */
+static inline int fuse4fs_dirsync_flush(struct fuse4fs *ff, ext2_ino_t ino,
+ int *flushed)
+{
+ struct ext2_inode_large inode;
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+
+ if (ff->dirsync)
+ goto flush;
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, 0, err);
+
+ if (inode.i_flags & EXT2_DIRSYNC_FL)
+ goto flush;
+
+ if (flushed)
+ *flushed = 0;
+ return 0;
+flush:
+ err = ext2fs_flush2(fs, 0);
+ if (err)
+ return translate_error(fs, 0, err);
+
+ if (flushed)
+ *flushed = 1;
+ return 0;
+}
+
+static void fuse4fs_set_extra_isize(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode)
+{
+ ext2_filsys fs = ff->fs;
+ size_t extra = sizeof(struct ext2_inode_large) -
+ EXT2_GOOD_OLD_INODE_SIZE;
+
+ if (ext2fs_has_feature_extra_isize(fs->super)) {
+ dbg_printf(ff, "%s: ino=%u extra=%zu want=%u min=%u\n",
+ __func__, ino, extra, fs->super->s_want_extra_isize,
+ fs->super->s_min_extra_isize);
+
+ if (fs->super->s_want_extra_isize > extra)
+ extra = fs->super->s_want_extra_isize;
+ if (fs->super->s_min_extra_isize > extra)
+ extra = fs->super->s_min_extra_isize;
+ }
+
+ inode->i_extra_isize = extra;
+}
+
+static int op_mknod(const char *path, mode_t mode, dev_t dev)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ ext2_ino_t parent, child;
+ char *temp_path;
+ errcode_t err;
+ char *node_name, a;
+ int filetype;
+ struct ext2_inode_large inode;
+ gid_t gid;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: path=%s mode=0%o dev=0x%x\n", __func__, path, mode,
+ (unsigned int)dev);
+ temp_path = strdup(path);
+ if (!temp_path) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name = strrchr(temp_path, '/');
+ if (!node_name) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name++;
+ a = *node_name;
+ *node_name = 0;
+
+ fs = fuse4fs_start(ff);
+ if (!fs_can_allocate(ff, 2)) {
+ ret = -ENOSPC;
+ goto out2;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
+ &parent);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+
+ ret = check_inum_access(ff, parent, A_OK | W_OK);
+ if (ret)
+ goto out2;
+
+ *node_name = a;
+
+ if (LINUX_S_ISCHR(mode))
+ filetype = EXT2_FT_CHRDEV;
+ else if (LINUX_S_ISBLK(mode))
+ filetype = EXT2_FT_BLKDEV;
+ else if (LINUX_S_ISFIFO(mode))
+ filetype = EXT2_FT_FIFO;
+ else if (LINUX_S_ISSOCK(mode))
+ filetype = EXT2_FT_SOCK;
+ else {
+ ret = -EINVAL;
+ goto out2;
+ }
+
+ err = fuse4fs_new_child_gid(ff, parent, &gid, NULL);
+ if (err)
+ goto out2;
+
+ err = ext2fs_new_inode(fs, parent, mode, 0, &child);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+
+ dbg_printf(ff, "%s: create ino=%d/name=%s in dir=%d\n", __func__, child,
+ node_name, parent);
+ err = ext2fs_link(fs, parent, node_name, child,
+ filetype | EXT2FS_LINK_EXPAND);
+ if (err) {
+ ret = translate_error(fs, parent, err);
+ goto out2;
+ }
+
+ ret = update_mtime(fs, parent, NULL);
+ if (ret)
+ goto out2;
+
+ memset(&inode, 0, sizeof(inode));
+ inode.i_mode = mode;
+
+ if (dev & ~0xFFFF)
+ inode.i_block[1] = dev;
+ else
+ inode.i_block[0] = dev;
+ inode.i_links_count = 1;
+ fuse4fs_set_extra_isize(ff, child, &inode);
+ fuse4fs_set_uid(&inode, ctxt->uid);
+ fuse4fs_set_gid(&inode, gid);
+
+ err = ext2fs_write_new_inode(fs, child, EXT2_INODE(&inode));
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+
+ inode.i_generation = ff->next_generation++;
+ init_times(&inode);
+ err = fuse4fs_write_inode(fs, child, &inode);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+
+ ext2fs_inode_alloc_stats2(fs, child, 1, 0);
+
+ ret = propagate_default_acls(ff, parent, child);
+ if (ret)
+ goto out2;
+
+ ret = fuse4fs_dirsync_flush(ff, parent, NULL);
+ if (ret)
+ goto out2;
+
+out2:
+ fuse4fs_finish(ff, ret);
+out:
+ free(temp_path);
+ return ret;
+}
+
+static int op_mkdir(const char *path, mode_t mode)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ ext2_ino_t parent, child;
+ char *temp_path;
+ errcode_t err;
+ char *node_name, a;
+ struct ext2_inode_large inode;
+ char *block;
+ blk64_t blk;
+ int ret = 0;
+ gid_t gid;
+ int parent_sgid;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: path=%s mode=0%o\n", __func__, path, mode);
+ temp_path = strdup(path);
+ if (!temp_path) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name = strrchr(temp_path, '/');
+ if (!node_name) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name++;
+ a = *node_name;
+ *node_name = 0;
+
+ fs = fuse4fs_start(ff);
+ if (!fs_can_allocate(ff, 1)) {
+ ret = -ENOSPC;
+ goto out2;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
+ &parent);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+
+ ret = check_inum_access(ff, parent, A_OK | W_OK);
+ if (ret)
+ goto out2;
+
+ err = fuse4fs_new_child_gid(ff, parent, &gid, &parent_sgid);
+ if (err)
+ goto out2;
+
+ *node_name = a;
+
+ err = ext2fs_mkdir2(fs, parent, 0, 0, EXT2FS_LINK_EXPAND,
+ node_name, NULL);
+ if (err) {
+ ret = translate_error(fs, parent, err);
+ goto out2;
+ }
+
+ ret = update_mtime(fs, parent, NULL);
+ if (ret)
+ goto out2;
+
+ /* Still have to update the uid/gid of the dir */
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
+ &child);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+ dbg_printf(ff, "%s: created ino=%d/path=%s in dir=%d\n", __func__, child,
+ node_name, parent);
+
+ err = fuse4fs_read_inode(fs, child, &inode);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+
+ fuse4fs_set_extra_isize(ff, child, &inode);
+ fuse4fs_set_uid(&inode, ctxt->uid);
+ fuse4fs_set_gid(&inode, gid);
+ inode.i_mode = LINUX_S_IFDIR | (mode & ~S_ISUID);
+ if (parent_sgid)
+ inode.i_mode |= S_ISGID;
+ inode.i_generation = ff->next_generation++;
+ init_times(&inode);
+
+ err = fuse4fs_write_inode(fs, child, &inode);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+
+ /* Rewrite the directory block checksum, having set i_generation */
+ if ((inode.i_flags & EXT4_INLINE_DATA_FL) ||
+ !ext2fs_has_feature_metadata_csum(fs->super))
+ goto out2;
+ err = ext2fs_new_dir_block(fs, child, parent, &block);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+ err = ext2fs_bmap2(fs, child, EXT2_INODE(&inode), NULL, 0, 0,
+ NULL, &blk);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out3;
+ }
+ err = ext2fs_write_dir_block4(fs, blk, block, 0, child);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out3;
+ }
+
+ ret = propagate_default_acls(ff, parent, child);
+ if (ret)
+ goto out3;
+
+ ret = fuse4fs_dirsync_flush(ff, parent, NULL);
+ if (ret)
+ goto out3;
+
+out3:
+ ext2fs_free_mem(&block);
+out2:
+ fuse4fs_finish(ff, ret);
+out:
+ free(temp_path);
+ return ret;
+}
+
+static int fuse4fs_unlink(struct fuse4fs *ff, const char *path,
+ ext2_ino_t *parent)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ ext2_ino_t dir;
+ char *filename = strdup(path);
+ char *base_name;
+ int ret;
+
+ base_name = strrchr(filename, '/');
+ if (base_name) {
+ *base_name++ = '\0';
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, filename,
+ &dir);
+ if (err) {
+ free(filename);
+ return translate_error(fs, 0, err);
+ }
+ } else {
+ dir = EXT2_ROOT_INO;
+ base_name = filename;
+ }
+
+ ret = check_inum_access(ff, dir, W_OK);
+ if (ret) {
+ free(filename);
+ return ret;
+ }
+
+ dbg_printf(ff, "%s: unlinking name=%s from dir=%d\n", __func__,
+ base_name, dir);
+ err = ext2fs_unlink(fs, dir, base_name, 0, 0);
+ free(filename);
+ if (err)
+ return translate_error(fs, dir, err);
+
+ ret = update_mtime(fs, dir, NULL);
+ if (ret)
+ return ret;
+
+ if (parent)
+ *parent = dir;
+ return 0;
+}
+
+static int remove_ea_inodes(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2_xattr_handle *h;
+ errcode_t err;
+ int ret = 0;
+
+ /*
+ * The xattr handle maintains its own private copy of the inode, so
+ * write ours to disk so that we can read it.
+ */
+ err = fuse4fs_write_inode(fs, ino, inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_xattrs_open(fs, ino, &h);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_xattrs_read(h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_close;
+ }
+
+ err = ext2fs_xattr_remove_all(h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_close;
+ }
+
+out_close:
+ ext2fs_xattrs_close(&h);
+ if (ret)
+ return ret;
+
+ /* Now read the inode back in. */
+ err = fuse4fs_read_inode(fs, ino, inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
+static int remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ struct ext2_inode_large inode;
+ int ret = 0;
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ dbg_printf(ff, "%s: put ino=%d links=%d\n", __func__, ino,
+ inode.i_links_count);
+
+ switch (inode.i_links_count) {
+ case 0:
+ return 0; /* XXX: already done? */
+ case 1:
+ inode.i_links_count--;
+ ext2fs_set_dtime(fs, EXT2_INODE(&inode));
+ break;
+ default:
+ inode.i_links_count--;
+ }
+
+ ret = update_ctime(fs, ino, &inode);
+ if (ret)
+ return ret;
+
+ if (inode.i_links_count)
+ goto write_out;
+
+ if (ext2fs_has_feature_ea_inode(fs->super)) {
+ ret = remove_ea_inodes(ff, ino, &inode);
+ if (ret)
+ return ret;
+ }
+
+ /* Nobody holds this file; free its blocks! */
+ err = ext2fs_free_ext_attr(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ if (ext2fs_inode_has_valid_blocks2(fs, EXT2_INODE(&inode))) {
+ err = ext2fs_punch(fs, ino, EXT2_INODE(&inode), NULL,
+ 0, ~0ULL);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ ext2fs_inode_alloc_stats2(fs, ino, -1,
+ LINUX_S_ISDIR(inode.i_mode));
+
+write_out:
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
+static int __op_unlink(struct fuse4fs *ff, const char *path)
+{
+ ext2_filsys fs = ff->fs;
+ ext2_ino_t parent, ino;
+ errcode_t err;
+ int ret = 0;
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+
+ ret = check_inum_access(ff, ino, W_OK);
+ if (ret)
+ goto out;
+
+ ret = fuse4fs_unlink(ff, path, &parent);
+ if (ret)
+ goto out;
+
+ ret = remove_inode(ff, ino);
+ if (ret)
+ goto out;
+
+ ret = fuse4fs_dirsync_flush(ff, parent, NULL);
+ if (ret)
+ goto out;
+
+out:
+ return ret;
+}
+
+static int op_unlink(const char *path)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ int ret;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fuse4fs_start(ff);
+ ret = __op_unlink(ff, path);
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+struct rd_struct {
+ ext2_ino_t parent;
+ int empty;
+};
+
+static int rmdir_proc(ext2_ino_t dir EXT2FS_ATTR((unused)),
+ int entry EXT2FS_ATTR((unused)),
+ struct ext2_dir_entry *dirent,
+ int offset EXT2FS_ATTR((unused)),
+ int blocksize EXT2FS_ATTR((unused)),
+ char *buf EXT2FS_ATTR((unused)),
+ void *private)
+{
+ struct rd_struct *rds = (struct rd_struct *) private;
+
+ if (dirent->inode == 0)
+ return 0;
+ if (((dirent->name_len & 0xFF) == 1) && (dirent->name[0] == '.'))
+ return 0;
+ if (((dirent->name_len & 0xFF) == 2) && (dirent->name[0] == '.') &&
+ (dirent->name[1] == '.')) {
+ rds->parent = dirent->inode;
+ return 0;
+ }
+ rds->empty = 0;
+ return 0;
+}
+
+static int __op_rmdir(struct fuse4fs *ff, const char *path)
+{
+ ext2_filsys fs = ff->fs;
+ ext2_ino_t parent, child;
+ errcode_t err;
+ struct ext2_inode_large inode;
+ struct rd_struct rds;
+ int ret = 0;
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &child);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+ dbg_printf(ff, "%s: rmdir path=%s ino=%d\n", __func__, path, child);
+
+ ret = check_inum_access(ff, child, W_OK);
+ if (ret)
+ goto out;
+
+ rds.parent = 0;
+ rds.empty = 1;
+
+ err = ext2fs_dir_iterate2(fs, child, 0, 0, rmdir_proc, &rds);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out;
+ }
+
+ /* the kernel checks parent permissions before emptiness */
+ if (rds.parent == 0) {
+ ret = translate_error(fs, child, EXT2_ET_FILESYSTEM_CORRUPTED);
+ goto out;
+ }
+
+ ret = check_inum_access(ff, rds.parent, W_OK);
+ if (ret)
+ goto out;
+
+ if (rds.empty == 0) {
+ ret = -ENOTEMPTY;
+ goto out;
+ }
+
+ ret = fuse4fs_unlink(ff, path, &parent);
+ if (ret)
+ goto out;
+ /* Directories have to be "removed" twice. */
+ ret = remove_inode(ff, child);
+ if (ret)
+ goto out;
+ ret = remove_inode(ff, child);
+ if (ret)
+ goto out;
+
+ if (rds.parent) {
+ dbg_printf(ff, "%s: decr dir=%d link count\n", __func__,
+ rds.parent);
+ err = fuse4fs_read_inode(fs, rds.parent, &inode);
+ if (err) {
+ ret = translate_error(fs, rds.parent, err);
+ goto out;
+ }
+ if (inode.i_links_count > 1)
+ inode.i_links_count--;
+ ret = update_mtime(fs, rds.parent, &inode);
+ if (ret)
+ goto out;
+ err = fuse4fs_write_inode(fs, rds.parent, &inode);
+ if (err) {
+ ret = translate_error(fs, rds.parent, err);
+ goto out;
+ }
+ }
+
+ ret = fuse4fs_dirsync_flush(ff, parent, NULL);
+ if (ret)
+ goto out;
+
+out:
+ return ret;
+}
+
+static int op_rmdir(const char *path)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ int ret;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fuse4fs_start(ff);
+ ret = __op_rmdir(ff, path);
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int op_symlink(const char *src, const char *dest)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ ext2_ino_t parent, child;
+ char *temp_path;
+ errcode_t err;
+ char *node_name, a;
+ struct ext2_inode_large inode;
+ gid_t gid;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: symlink %s to %s\n", __func__, src, dest);
+ temp_path = strdup(dest);
+ if (!temp_path) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name = strrchr(temp_path, '/');
+ if (!node_name) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name++;
+ a = *node_name;
+ *node_name = 0;
+
+ fs = fuse4fs_start(ff);
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
+ &parent);
+ *node_name = a;
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+
+ ret = check_inum_access(ff, parent, A_OK | W_OK);
+ if (ret)
+ goto out2;
+
+ err = fuse4fs_new_child_gid(ff, parent, &gid, NULL);
+ if (err)
+ goto out2;
+
+ /* Create symlink */
+ err = ext2fs_symlink(fs, parent, 0, node_name, src);
+ if (err == EXT2_ET_DIR_NO_SPACE) {
+ err = ext2fs_expand_dir(fs, parent);
+ if (err) {
+ ret = translate_error(fs, parent, err);
+ goto out2;
+ }
+
+ err = ext2fs_symlink(fs, parent, 0, node_name, src);
+ }
+ if (err) {
+ ret = translate_error(fs, parent, err);
+ goto out2;
+ }
+
+ /* Update parent dir's mtime */
+ ret = update_mtime(fs, parent, NULL);
+ if (ret)
+ goto out2;
+
+ /* Still have to update the uid/gid of the symlink */
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
+ &child);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+ dbg_printf(ff, "%s: symlinking ino=%d/name=%s to dir=%d\n", __func__,
+ child, node_name, parent);
+
+ err = fuse4fs_read_inode(fs, child, &inode);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+
+ fuse4fs_set_extra_isize(ff, child, &inode);
+ fuse4fs_set_uid(&inode, ctxt->uid);
+ fuse4fs_set_gid(&inode, gid);
+ inode.i_generation = ff->next_generation++;
+ init_times(&inode);
+
+ err = fuse4fs_write_inode(fs, child, &inode);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+
+ ret = fuse4fs_dirsync_flush(ff, parent, NULL);
+ if (ret)
+ goto out2;
+
+out2:
+ fuse4fs_finish(ff, ret);
+out:
+ free(temp_path);
+ return ret;
+}
+
+struct update_dotdot {
+ ext2_ino_t new_dotdot;
+};
+
+static int update_dotdot_helper(ext2_ino_t dir EXT2FS_ATTR((unused)),
+ int entry EXT2FS_ATTR((unused)),
+ struct ext2_dir_entry *dirent,
+ int offset EXT2FS_ATTR((unused)),
+ int blocksize EXT2FS_ATTR((unused)),
+ char *buf EXT2FS_ATTR((unused)),
+ void *priv_data)
+{
+ struct update_dotdot *ud = priv_data;
+
+ if (ext2fs_dirent_name_len(dirent) == 2 &&
+ dirent->name[0] == '.' && dirent->name[1] == '.') {
+ dirent->inode = ud->new_dotdot;
+ return DIRENT_CHANGED | DIRENT_ABORT;
+ }
+
+ return 0;
+}
+
+static int op_rename(const char *from, const char *to
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ , unsigned int flags EXT2FS_ATTR((unused))
+#endif
+ )
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ errcode_t err;
+ ext2_ino_t from_ino, to_ino, to_dir_ino, from_dir_ino;
+ char *temp_to = NULL, *temp_from = NULL;
+ char *cp, a;
+ struct ext2_inode inode;
+ struct update_dotdot ud;
+ int flushed = 0;
+ int ret = 0;
+
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ /* renameat2 is not supported */
+ if (flags)
+ return -ENOSYS;
+#endif
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: renaming %s to %s\n", __func__, from, to);
+ fs = fuse4fs_start(ff);
+ if (!fs_can_allocate(ff, 5)) {
+ ret = -ENOSPC;
+ goto out;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, from, &from_ino);
+ if (err || from_ino == 0) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, to, &to_ino);
+ if (err && err != EXT2_ET_FILE_NOT_FOUND) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+
+ if (err == EXT2_ET_FILE_NOT_FOUND)
+ to_ino = 0;
+
+ /* Already the same file? */
+ if (to_ino != 0 && to_ino == from_ino) {
+ ret = 0;
+ goto out;
+ }
+
+ ret = check_inum_access(ff, from_ino, W_OK);
+ if (ret)
+ goto out;
+
+ if (to_ino) {
+ ret = check_inum_access(ff, to_ino, W_OK);
+ if (ret)
+ goto out;
+ }
+
+ temp_to = strdup(to);
+ if (!temp_to) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ temp_from = strdup(from);
+ if (!temp_from) {
+ ret = -ENOMEM;
+ goto out2;
+ }
+
+ /* Find parent dir of the source and check write access */
+ cp = strrchr(temp_from, '/');
+ if (!cp) {
+ ret = -EINVAL;
+ goto out2;
+ }
+
+ a = *(cp + 1);
+ *(cp + 1) = 0;
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_from,
+ &from_dir_ino);
+ *(cp + 1) = a;
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+ if (from_dir_ino == 0) {
+ ret = -ENOENT;
+ goto out2;
+ }
+
+ ret = check_inum_access(ff, from_dir_ino, W_OK);
+ if (ret)
+ goto out2;
+
+ /* Find parent dir of the destination and check write access */
+ cp = strrchr(temp_to, '/');
+ if (!cp) {
+ ret = -EINVAL;
+ goto out2;
+ }
+
+ a = *(cp + 1);
+ *(cp + 1) = 0;
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_to,
+ &to_dir_ino);
+ *(cp + 1) = a;
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+ if (to_dir_ino == 0) {
+ ret = -ENOENT;
+ goto out2;
+ }
+
+ ret = check_inum_access(ff, to_dir_ino, W_OK);
+ if (ret)
+ goto out2;
+
+ /* If the target exists, unlink it first */
+ if (to_ino != 0) {
+ err = ext2fs_read_inode(fs, to_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, to_ino, err);
+ goto out2;
+ }
+
+ dbg_printf(ff, "%s: unlinking %s ino=%d\n", __func__,
+ LINUX_S_ISDIR(inode.i_mode) ? "dir" : "file",
+ to_ino);
+ if (LINUX_S_ISDIR(inode.i_mode))
+ ret = __op_rmdir(ff, to);
+ else
+ ret = __op_unlink(ff, to);
+ if (ret)
+ goto out2;
+ }
+
+ /* Get ready to do the move */
+ err = ext2fs_read_inode(fs, from_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, from_ino, err);
+ goto out2;
+ }
+
+ /* Link in the new file */
+ dbg_printf(ff, "%s: linking ino=%d/path=%s to dir=%d\n", __func__,
+ from_ino, cp + 1, to_dir_ino);
+ err = ext2fs_link(fs, to_dir_ino, cp + 1, from_ino,
+ ext2_file_type(inode.i_mode) | EXT2FS_LINK_EXPAND);
+ if (err) {
+ ret = translate_error(fs, to_dir_ino, err);
+ goto out2;
+ }
+
+ /* Update '..' pointer if dir */
+ err = ext2fs_read_inode(fs, from_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, from_ino, err);
+ goto out2;
+ }
+
+ if (LINUX_S_ISDIR(inode.i_mode)) {
+ ud.new_dotdot = to_dir_ino;
+ dbg_printf(ff, "%s: updating .. entry for dir=%d\n", __func__,
+ to_dir_ino);
+ err = ext2fs_dir_iterate2(fs, from_ino, 0, NULL,
+ update_dotdot_helper, &ud);
+ if (err) {
+ ret = translate_error(fs, from_ino, err);
+ goto out2;
+ }
+
+ /* Decrease from_dir_ino's links_count */
+ dbg_printf(ff, "%s: moving linkcount from dir=%d to dir=%d\n",
+ __func__, from_dir_ino, to_dir_ino);
+ err = ext2fs_read_inode(fs, from_dir_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, from_dir_ino, err);
+ goto out2;
+ }
+ inode.i_links_count--;
+ err = ext2fs_write_inode(fs, from_dir_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, from_dir_ino, err);
+ goto out2;
+ }
+
+ /* Increase to_dir_ino's links_count */
+ err = ext2fs_read_inode(fs, to_dir_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, to_dir_ino, err);
+ goto out2;
+ }
+ inode.i_links_count++;
+ err = ext2fs_write_inode(fs, to_dir_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, to_dir_ino, err);
+ goto out2;
+ }
+ }
+
+ /* Update timestamps */
+ ret = update_ctime(fs, from_ino, NULL);
+ if (ret)
+ goto out2;
+
+ ret = update_mtime(fs, to_dir_ino, NULL);
+ if (ret)
+ goto out2;
+
+ /* Remove the old file */
+ ret = fuse4fs_unlink(ff, from, NULL);
+ if (ret)
+ goto out2;
+
+ ret = fuse4fs_dirsync_flush(ff, from_dir_ino, &flushed);
+ if (ret)
+ goto out2;
+
+ if (from_dir_ino != to_dir_ino && !flushed) {
+ ret = fuse4fs_dirsync_flush(ff, to_dir_ino, NULL);
+ if (ret)
+ goto out2;
+ }
+
+out2:
+ free(temp_from);
+ free(temp_to);
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int op_link(const char *src, const char *dest)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ char *temp_path;
+ errcode_t err;
+ char *node_name, a;
+ ext2_ino_t parent, ino;
+ struct ext2_inode_large inode;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: src=%s dest=%s\n", __func__, src, dest);
+ temp_path = strdup(dest);
+ if (!temp_path) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name = strrchr(temp_path, '/');
+ if (!node_name) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name++;
+ a = *node_name;
+ *node_name = 0;
+
+ fs = fuse4fs_start(ff);
+ if (!fs_can_allocate(ff, 2)) {
+ ret = -ENOSPC;
+ goto out2;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
+ &parent);
+ *node_name = a;
+ if (err) {
+ err = -ENOENT;
+ goto out2;
+ }
+
+ ret = check_inum_access(ff, parent, A_OK | W_OK);
+ if (ret)
+ goto out2;
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, src, &ino);
+ if (err || ino == 0) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+
+ ret = check_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
+ if (ret)
+ goto out2;
+
+ inode.i_links_count++;
+ ret = update_ctime(fs, ino, &inode);
+ if (ret)
+ goto out2;
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+
+ dbg_printf(ff, "%s: linking ino=%d/name=%s to dir=%d\n", __func__, ino,
+ node_name, parent);
+ err = ext2fs_link(fs, parent, node_name, ino,
+ ext2_file_type(inode.i_mode) | EXT2FS_LINK_EXPAND);
+ if (err) {
+ ret = translate_error(fs, parent, err);
+ goto out2;
+ }
+
+ ret = update_mtime(fs, parent, NULL);
+ if (ret)
+ goto out2;
+
+ ret = fuse4fs_dirsync_flush(ff, parent, NULL);
+ if (ret)
+ goto out2;
+
+out2:
+ fuse4fs_finish(ff, ret);
+out:
+ free(temp_path);
+ return ret;
+}
+
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+/* Obtain group ids of the process that sent us a command(?) */
+static int get_req_groups(struct fuse4fs *ff, gid_t **gids, size_t *nr_gids)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ gid_t *array;
+ int nr = 32; /* nobody has more than 32 groups right? */
+ int ret;
+
+ do {
+ err = ext2fs_get_array(nr, sizeof(gid_t), &array);
+ if (err)
+ return translate_error(fs, 0, err);
+
+ ret = fuse_getgroups(nr, array);
+ if (ret < 0) {
+ /*
+ * If there's an error, we failed to find the group
+ * membership of the process that initiated the file
+ * change, either because the process went away or
+ * because there's no Linux procfs. Regardless of the
+ * cause, we return -ENOENT.
+ */
+ ext2fs_free_mem(&array);
+ return -ENOENT;
+ }
+
+ if (ret <= nr) {
+ *gids = array;
+ *nr_gids = ret;
+ return 0;
+ }
+
+ ext2fs_free_mem(&array);
+ nr = ret;
+ } while (0);
+
+ /* shut up gcc */
+ return -ENOMEM;
+}
+
+/*
+ * Is this file's group id in the set of groups associated with the process
+ * that initiated the fuse request? Returns 1 for yes, 0 for no, or a negative
+ * errno.
+ */
+static int in_file_group(struct fuse_context *ctxt,
+ const struct ext2_inode_large *inode)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ gid_t *gids = NULL;
+ size_t i, nr_gids = 0;
+ gid_t gid = inode_gid(*inode);
+ int ret;
+
+ ret = get_req_groups(ff, &gids, &nr_gids);
+ if (ret == -ENOENT) {
+ /* magic return code for "could not get caller group info" */
+ return ctxt->gid == inode_gid(*inode);
+ }
+ if (ret < 0)
+ return ret;
+
+ ret = 0;
+ for (i = 0; i < nr_gids; i++) {
+ if (gids[i] == gid) {
+ ret = 1;
+ break;
+ }
+ }
+
+ ext2fs_free_mem(&gids);
+ return ret;
+}
+#else
+static int in_file_group(struct fuse_context *ctxt,
+ const struct ext2_inode_large *inode)
+{
+ return ctxt->gid == inode_gid(*inode);
+}
+#endif
+
+static int op_chmod(const char *path, mode_t mode
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ , struct fuse_file_info *fi
+#endif
+ )
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ errcode_t err;
+ ext2_ino_t ino;
+ struct ext2_inode_large inode;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fs = fuse4fs_start(ff);
+ ret = fuse4fs_file_ino(ff, path, fi, &ino);
+ if (ret)
+ goto out;
+ dbg_printf(ff, "%s: path=%s mode=0%o ino=%d\n", __func__, path, mode, ino);
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+ ret = check_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
+ if (ret)
+ goto out;
+
+ if (want_check_owner(ff, ctxt) && ctxt->uid != inode_uid(inode)) {
+ ret = -EPERM;
+ goto out;
+ }
+
+ /*
+ * XXX: We should really check that the inode gid is not in /any/
+ * of the user's groups, but FUSE only tells us about the primary
+ * group.
+ */
+ if (!is_superuser(ff, ctxt)) {
+ ret = in_file_group(ctxt, &inode);
+ if (ret < 0)
+ goto out;
+
+ if (!ret)
+ mode &= ~S_ISGID;
+ }
+
+ inode.i_mode &= ~0xFFF;
+ inode.i_mode |= mode & 0xFFF;
+
+ dbg_printf(ff, "%s: path=%s new_mode=0%o ino=%d\n", __func__,
+ path, inode.i_mode, ino);
+
+ ret = update_ctime(fs, ino, &inode);
+ if (ret)
+ goto out;
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int op_chown(const char *path, uid_t owner, gid_t group
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ , struct fuse_file_info *fi
+#endif
+ )
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ errcode_t err;
+ ext2_ino_t ino;
+ struct ext2_inode_large inode;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fs = fuse4fs_start(ff);
+ ret = fuse4fs_file_ino(ff, path, fi, &ino);
+ if (ret)
+ goto out;
+ dbg_printf(ff, "%s: path=%s owner=%d group=%d ino=%d\n", __func__,
+ path, owner, group, ino);
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+ ret = check_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
+ if (ret)
+ goto out;
+
+ /* FUSE seems to feed us ~0 to mean "don't change" */
+ if (owner != (uid_t) ~0) {
+ /* Only root gets to change UID. */
+ if (want_check_owner(ff, ctxt) &&
+ !(inode_uid(inode) == ctxt->uid && owner == ctxt->uid)) {
+ ret = -EPERM;
+ goto out;
+ }
+ fuse4fs_set_uid(&inode, owner);
+ }
+
+ if (group != (gid_t) ~0) {
+ /* Only root or the owner get to change GID. */
+ if (want_check_owner(ff, ctxt) &&
+ inode_uid(inode) != ctxt->uid) {
+ ret = -EPERM;
+ goto out;
+ }
+
+ /* XXX: We /should/ check group membership but FUSE */
+ fuse4fs_set_gid(&inode, group);
+ }
+
+ ret = update_ctime(fs, ino, &inode);
+ if (ret)
+ goto out;
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int fuse4fs_punch_posteof(struct fuse4fs *ff, ext2_ino_t ino,
+ off_t new_size)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2_inode_large inode;
+ blk64_t truncate_block = FUSE4FS_B_TO_FSB(ff, new_size);
+ errcode_t err;
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_punch(fs, ino, EXT2_INODE(&inode), 0, truncate_block,
+ ~0ULL);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
+static int fuse4fs_truncate(struct fuse4fs *ff, ext2_ino_t ino, off_t new_size)
+{
+ ext2_filsys fs = ff->fs;
+ ext2_file_t file;
+ __u64 old_isize;
+ errcode_t err;
+ int ret = 0;
+
+ err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_file_get_lsize(file, &old_isize);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_close;
+ }
+
+ dbg_printf(ff, "%s: ino=%u isize=0x%llx new_size=0x%llx\n", __func__,
+ ino,
+ (unsigned long long)old_isize,
+ (unsigned long long)new_size);
+
+ err = ext2fs_file_set_size2(file, new_size);
+ if (err)
+ ret = translate_error(fs, ino, err);
+
+out_close:
+ err = ext2fs_file_close(file);
+ if (ret)
+ return ret;
+ if (err)
+ return translate_error(fs, ino, err);
+
+ ret = update_mtime(fs, ino, NULL);
+ if (ret)
+ return ret;
+
+ /*
+ * Truncating to the current size is usually understood to mean that
+ * we should clear out post-EOF preallocations.
+ */
+ if (new_size == old_isize)
+ return fuse4fs_punch_posteof(ff, ino, new_size);
+
+ return 0;
+}
+
+static int op_truncate(const char *path, off_t len
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ , struct fuse_file_info *fi
+#endif
+ )
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_ino_t ino;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fuse4fs_start(ff);
+ ret = fuse4fs_file_ino(ff, path, fi, &ino);
+ if (ret)
+ goto out;
+ dbg_printf(ff, "%s: ino=%d len=%jd\n", __func__, ino, (intmax_t) len);
+
+ ret = check_inum_access(ff, ino, W_OK);
+ if (ret)
+ goto out;
+
+ ret = fuse4fs_truncate(ff, ino, len);
+ if (ret)
+ goto out;
+
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+#ifdef __linux__
+static void detect_linux_executable_open(int kernel_flags, int *access_check,
+ int *e2fs_open_flags)
+{
+ /*
+ * On Linux, execve will bleed __FMODE_EXEC into the file mode flags,
+ * and FUSE is more than happy to let that slip through.
+ */
+ if (kernel_flags & 0x20) {
+ *access_check = X_OK;
+ *e2fs_open_flags &= ~EXT2_FILE_WRITE;
+ }
+}
+#else
+static void detect_linux_executable_open(int kernel_flags, int *access_check,
+ int *e2fs_open_flags)
+{
+ /* empty */
+}
+#endif /* __linux__ */
+
+static int __op_open(struct fuse4fs *ff, const char *path,
+ struct fuse_file_info *fp)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ struct fuse4fs_file_handle *file;
+ int check = 0, ret = 0;
+
+ dbg_printf(ff, "%s: path=%s oflags=0o%o\n", __func__, path, fp->flags);
+ err = ext2fs_get_mem(sizeof(*file), &file);
+ if (err)
+ return translate_error(fs, 0, err);
+ file->magic = FUSE4FS_FILE_MAGIC;
+
+ file->open_flags = 0;
+ switch (fp->flags & O_ACCMODE) {
+ case O_RDONLY:
+ check = R_OK;
+ break;
+ case O_WRONLY:
+ check = W_OK;
+ file->open_flags |= EXT2_FILE_WRITE;
+ break;
+ case O_RDWR:
+ check = R_OK | W_OK;
+ file->open_flags |= EXT2_FILE_WRITE;
+ break;
+ }
+
+ /*
+ * If the caller wants to truncate the file, we need to ask for full
+ * write access even if the caller claims to be appending.
+ */
+ if ((fp->flags & O_APPEND) && !(fp->flags & O_TRUNC))
+ check |= A_OK;
+
+ detect_linux_executable_open(fp->flags, &check, &file->open_flags);
+
+ if (fp->flags & O_CREAT)
+ file->open_flags |= EXT2_FILE_CREATE;
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &file->ino);
+ if (err || file->ino == 0) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+ dbg_printf(ff, "%s: ino=%d\n", __func__, file->ino);
+
+ ret = check_inum_access(ff, file->ino, check);
+ if (ret) {
+ /*
+ * In a regular (Linux) fs driver, the kernel will open
+ * binaries for reading if the user has --x privileges (i.e.
+ * execute without read). Since the kernel doesn't have any
+ * way to tell us if it's opening a file via execve, we'll
+ * just assume that allowing access is ok if asking for ro mode
+ * fails but asking for x mode succeeds. Of course we can
+ * also employ undocumented hacks (see above).
+ */
+ if (check == R_OK) {
+ ret = check_inum_access(ff, file->ino, X_OK);
+ if (ret)
+ goto out;
+ } else
+ goto out;
+ }
+
+ if (fp->flags & O_TRUNC) {
+ ret = fuse4fs_truncate(ff, file->ino, 0);
+ if (ret)
+ goto out;
+ }
+
+ fuse4fs_set_handle(fp, file);
+
+out:
+ if (ret)
+ ext2fs_free_mem(&file);
+ return ret;
+}
+
+static int op_open(const char *path, struct fuse_file_info *fp)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ int ret;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fuse4fs_start(ff);
+ ret = __op_open(ff, path, fp);
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
+ size_t len, off_t offset,
+ struct fuse_file_info *fp)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ ext2_filsys fs;
+ ext2_file_t efp;
+ errcode_t err;
+ unsigned int got = 0;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_HANDLE(ff, fh);
+ dbg_printf(ff, "%s: ino=%d off=0x%llx len=0x%zx\n", __func__, fh->ino,
+ (unsigned long long)offset, len);
+ fs = fuse4fs_start(ff);
+ err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out;
+ }
+
+ err = ext2fs_file_llseek(efp, offset, SEEK_SET, NULL);
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out2;
+ }
+
+ err = ext2fs_file_read(efp, buf, len, &got);
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out2;
+ }
+
+out2:
+ err = ext2fs_file_close(efp);
+ if (ret)
+ goto out;
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out;
+ }
+
+ if (fuse4fs_is_writeable(ff)) {
+ ret = update_atime(fs, fh->ino);
+ if (ret)
+ goto out;
+ }
+out:
+ fuse4fs_finish(ff, ret);
+ return got ? (int) got : ret;
+}
+
+static int op_write(const char *path EXT2FS_ATTR((unused)),
+ const char *buf, size_t len, off_t offset,
+ struct fuse_file_info *fp)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ ext2_filsys fs;
+ ext2_file_t efp;
+ errcode_t err;
+ unsigned int got = 0;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_HANDLE(ff, fh);
+ dbg_printf(ff, "%s: ino=%d off=0x%llx len=0x%zx\n", __func__, fh->ino,
+ (unsigned long long) offset, len);
+ fs = fuse4fs_start(ff);
+ if (!fuse4fs_is_writeable(ff)) {
+ ret = -EROFS;
+ goto out;
+ }
+
+ if (!fs_can_allocate(ff, FUSE4FS_B_TO_FSB(ff, len))) {
+ ret = -ENOSPC;
+ goto out;
+ }
+
+ err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out;
+ }
+
+ err = ext2fs_file_llseek(efp, offset, SEEK_SET, NULL);
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out2;
+ }
+
+ err = ext2fs_file_write(efp, buf, len, &got);
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out2;
+ }
+
+ err = ext2fs_file_flush(efp);
+ if (err) {
+ got = 0;
+ ret = translate_error(fs, fh->ino, err);
+ goto out2;
+ }
+
+out2:
+ err = ext2fs_file_close(efp);
+ if (ret)
+ goto out;
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out;
+ }
+
+ ret = update_mtime(fs, fh->ino, NULL);
+ if (ret)
+ goto out;
+
+out:
+ fuse4fs_finish(ff, ret);
+ return got ? (int) got : ret;
+}
+
+static int op_release(const char *path EXT2FS_ATTR((unused)),
+ struct fuse_file_info *fp)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ ext2_filsys fs;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_HANDLE(ff, fh);
+ dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ fs = fuse4fs_start(ff);
+
+ if ((fp->flags & O_SYNC) &&
+ fuse4fs_is_writeable(ff) &&
+ (fh->open_flags & EXT2_FILE_WRITE)) {
+ err = ext2fs_flush2(fs, EXT2_FLAG_FLUSH_NO_SYNC);
+ if (err)
+ ret = translate_error(fs, fh->ino, err);
+ }
+
+ fp->fh = 0;
+ fuse4fs_finish(ff, ret);
+
+ ext2fs_free_mem(&fh);
+
+ return ret;
+}
+
+static int op_fsync(const char *path EXT2FS_ATTR((unused)),
+ int datasync EXT2FS_ATTR((unused)),
+ struct fuse_file_info *fp)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ ext2_filsys fs;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_HANDLE(ff, fh);
+ dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ fs = fuse4fs_start(ff);
+ /* For now, flush everything, even if it's slow */
+ if (fuse4fs_is_writeable(ff) && fh->open_flags & EXT2_FILE_WRITE) {
+ err = ext2fs_flush2(fs, 0);
+ if (err)
+ ret = translate_error(fs, fh->ino, err);
+ }
+ fuse4fs_finish(ff, ret);
+
+ return ret;
+}
+
+static int op_statfs(const char *path EXT2FS_ATTR((unused)),
+ struct statvfs *buf)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ uint64_t fsid, *f;
+ blk64_t overhead, reserved, free;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: path=%s\n", __func__, path);
+ fs = fuse4fs_start(ff);
+ buf->f_bsize = fs->blocksize;
+ buf->f_frsize = 0;
+
+ if (ff->minixdf)
+ overhead = 0;
+ else
+ overhead = fs->desc_blocks +
+ (blk64_t)fs->group_desc_count *
+ (fs->inode_blocks_per_group + 2);
+ reserved = ext2fs_r_blocks_count(fs->super);
+ if (!reserved)
+ reserved = ext2fs_blocks_count(fs->super) / 10;
+ free = ext2fs_free_blocks_count(fs->super);
+
+ buf->f_blocks = ext2fs_blocks_count(fs->super) - overhead;
+ buf->f_bfree = free;
+ if (free < reserved)
+ buf->f_bavail = 0;
+ else
+ buf->f_bavail = free - reserved;
+ buf->f_files = fs->super->s_inodes_count;
+ buf->f_ffree = fs->super->s_free_inodes_count;
+ buf->f_favail = fs->super->s_free_inodes_count;
+ f = (uint64_t *)fs->super->s_uuid;
+ fsid = *f;
+ f++;
+ fsid ^= *f;
+ buf->f_fsid = fsid;
+ buf->f_flag = 0;
+ if (ff->opstate != F4OP_WRITABLE)
+ buf->f_flag |= ST_RDONLY;
+ buf->f_namemax = EXT2_NAME_LEN;
+ fuse4fs_finish(ff, 0);
+
+ return 0;
+}
+
+static const char *valid_xattr_prefixes[] = {
+ "user.",
+ "trusted.",
+ "security.",
+ "gnu.",
+ "system.",
+};
+
+static int validate_xattr_name(const char *name)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(valid_xattr_prefixes); i++) {
+ if (!strncmp(name, valid_xattr_prefixes[i],
+ strlen(valid_xattr_prefixes[i])))
+ return 1;
+ }
+
+ return 0;
+}
+
+static int op_getxattr(const char *path, const char *key, char *value,
+ size_t len)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ void *ptr;
+ size_t plen;
+ ext2_ino_t ino;
+ errcode_t err;
+ int ret = 0;
+
+ if (!validate_xattr_name(key))
+ return -ENODATA;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fs = fuse4fs_start(ff);
+ if (!ext2fs_has_feature_xattr(fs->super)) {
+ ret = -ENOTSUP;
+ goto out;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
+ if (err || ino == 0) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+ dbg_printf(ff, "%s: ino=%d name=%s\n", __func__, ino, key);
+
+ ret = check_inum_access(ff, ino, R_OK);
+ if (ret)
+ goto out;
+
+ ret = __getxattr(ff, ino, key, &ptr, &plen);
+ if (ret)
+ goto out;
+
+ if (!len) {
+ ret = plen;
+ } else if (len < plen) {
+ ret = -ERANGE;
+ } else {
+ memcpy(value, ptr, plen);
+ ret = plen;
+ }
+
+ ext2fs_free_mem(&ptr);
+out:
+ fuse4fs_finish(ff, ret);
+
+ return ret;
+}
+
+static int count_buffer_space(char *name, char *value EXT2FS_ATTR((unused)),
+ size_t value_len EXT2FS_ATTR((unused)),
+ void *data)
+{
+ unsigned int *x = data;
+
+ *x = *x + strlen(name) + 1;
+ return 0;
+}
+
+static int copy_names(char *name, char *value EXT2FS_ATTR((unused)),
+ size_t value_len EXT2FS_ATTR((unused)), void *data)
+{
+ char **b = data;
+ size_t name_len = strlen(name);
+
+ memcpy(*b, name, name_len + 1);
+ *b = *b + name_len + 1;
+
+ return 0;
+}
+
+static int op_listxattr(const char *path, char *names, size_t len)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ struct ext2_xattr_handle *h;
+ unsigned int bufsz;
+ ext2_ino_t ino;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fs = fuse4fs_start(ff);
+ if (!ext2fs_has_feature_xattr(fs->super)) {
+ ret = -ENOTSUP;
+ goto out;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
+ if (err || ino == 0) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+ dbg_printf(ff, "%s: ino=%d\n", __func__, ino);
+
+ ret = check_inum_access(ff, ino, R_OK);
+ if (ret)
+ goto out;
+
+ err = ext2fs_xattrs_open(fs, ino, &h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+ err = ext2fs_xattrs_read(h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+
+ /* Count buffer space needed for names */
+ bufsz = 0;
+ err = ext2fs_xattrs_iterate(h, count_buffer_space, &bufsz);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+
+ if (len == 0) {
+ ret = bufsz;
+ goto out2;
+ } else if (len < bufsz) {
+ ret = -ERANGE;
+ goto out2;
+ }
+
+ /* Copy names out */
+ memset(names, 0, len);
+ err = ext2fs_xattrs_iterate(h, copy_names, &names);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+ ret = bufsz;
+out2:
+ err = ext2fs_xattrs_close(&h);
+ if (err && !ret)
+ ret = translate_error(fs, ino, err);
+out:
+ fuse4fs_finish(ff, ret);
+
+ return ret;
+}
+
+static int op_setxattr(const char *path EXT2FS_ATTR((unused)),
+ const char *key, const char *value,
+ size_t len, int flags)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ struct ext2_xattr_handle *h;
+ ext2_ino_t ino;
+ errcode_t err;
+ int ret = 0;
+
+ if (flags & ~(XATTR_CREATE | XATTR_REPLACE))
+ return -EOPNOTSUPP;
+
+ if (!validate_xattr_name(key))
+ return -EINVAL;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fs = fuse4fs_start(ff);
+ if (!ext2fs_has_feature_xattr(fs->super)) {
+ ret = -ENOTSUP;
+ goto out;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
+ if (err || ino == 0) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+ dbg_printf(ff, "%s: ino=%d name=%s\n", __func__, ino, key);
+
+ ret = check_inum_access(ff, ino, W_OK);
+ if (ret == -EACCES) {
+ ret = -EPERM;
+ goto out;
+ } else if (ret)
+ goto out;
+
+ err = ext2fs_xattrs_open(fs, ino, &h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+ err = ext2fs_xattrs_read(h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+
+ if (flags & (XATTR_CREATE | XATTR_REPLACE)) {
+ void *buf;
+ size_t buflen;
+
+ err = ext2fs_xattr_get(h, key, &buf, &buflen);
+ switch (err) {
+ case EXT2_ET_EA_KEY_NOT_FOUND:
+ if (flags & XATTR_REPLACE) {
+ ret = -ENODATA;
+ goto out2;
+ }
+ break;
+ case 0:
+ ext2fs_free_mem(&buf);
+ if (flags & XATTR_CREATE) {
+ ret = -EEXIST;
+ goto out2;
+ }
+ break;
+ default:
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+ }
+
+ err = ext2fs_xattr_set(h, key, value, len);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+
+ ret = update_ctime(fs, ino, NULL);
+out2:
+ err = ext2fs_xattrs_close(&h);
+ if (!ret && err)
+ ret = translate_error(fs, ino, err);
+out:
+ fuse4fs_finish(ff, ret);
+
+ return ret;
+}
+
+static int op_removexattr(const char *path, const char *key)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ struct ext2_xattr_handle *h;
+ void *buf;
+ size_t buflen;
+ ext2_ino_t ino;
+ errcode_t err;
+ int ret = 0;
+
+ /*
+ * Once in a while libfuse gives us a no-name xattr to delete as part
+ * of clearing ACLs. Just pretend we cleared them.
+ */
+ if (key[0] == 0)
+ return 0;
+
+ if (!validate_xattr_name(key))
+ return -ENODATA;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fs = fuse4fs_start(ff);
+ if (!ext2fs_has_feature_xattr(fs->super)) {
+ ret = -ENOTSUP;
+ goto out;
+ }
+
+ if (!fs_can_allocate(ff, 1)) {
+ ret = -ENOSPC;
+ goto out;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
+ if (err || ino == 0) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+ dbg_printf(ff, "%s: ino=%d name=%s\n", __func__, ino, key);
+
+ ret = check_inum_access(ff, ino, W_OK);
+ if (ret)
+ goto out;
+
+ err = ext2fs_xattrs_open(fs, ino, &h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+ err = ext2fs_xattrs_read(h);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+
+ err = ext2fs_xattr_get(h, key, &buf, &buflen);
+ switch (err) {
+ case EXT2_ET_EA_KEY_NOT_FOUND:
+ /*
+ * ACLs are special snowflakes that require a 0 return when
+ * the ACL never existed in the first place.
+ */
+ if (!strncmp(XATTR_SECURITY_PREFIX, key,
+ XATTR_SECURITY_PREFIX_LEN))
+ ret = 0;
+ else
+ ret = -ENODATA;
+ goto out2;
+ case 0:
+ ext2fs_free_mem(&buf);
+ break;
+ default:
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+
+ err = ext2fs_xattr_remove(h, key);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out2;
+ }
+
+ ret = update_ctime(fs, ino, NULL);
+out2:
+ err = ext2fs_xattrs_close(&h);
+ if (err && !ret)
+ ret = translate_error(fs, ino, err);
+out:
+ fuse4fs_finish(ff, ret);
+
+ return ret;
+}
+
+struct readdir_iter {
+ void *buf;
+ ext2_filsys fs;
+ fuse_fill_dir_t func;
+
+ struct fuse4fs *ff;
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ enum fuse_readdir_flags flags;
+#endif
+ unsigned int nr;
+ off_t startpos;
+ off_t dirpos;
+};
+
+static inline mode_t dirent_fmode(ext2_filsys fs,
+ const struct ext2_dir_entry *dirent)
+{
+ if (!ext2fs_has_feature_filetype(fs->super))
+ return 0;
+
+ switch (ext2fs_dirent_file_type(dirent)) {
+ case EXT2_FT_REG_FILE:
+ return S_IFREG;
+ case EXT2_FT_DIR:
+ return S_IFDIR;
+ case EXT2_FT_CHRDEV:
+ return S_IFCHR;
+ case EXT2_FT_BLKDEV:
+ return S_IFBLK;
+ case EXT2_FT_FIFO:
+ return S_IFIFO;
+ case EXT2_FT_SOCK:
+ return S_IFSOCK;
+ case EXT2_FT_SYMLINK:
+ return S_IFLNK;
+ }
+
+ return 0;
+}
+
+static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)),
+ int entry EXT2FS_ATTR((unused)),
+ struct ext2_dir_entry *dirent,
+ int offset EXT2FS_ATTR((unused)),
+ int blocksize EXT2FS_ATTR((unused)),
+ char *buf EXT2FS_ATTR((unused)), void *data)
+{
+ struct readdir_iter *i = data;
+ char namebuf[EXT2_NAME_LEN + 1];
+ struct stat stat = {
+ .st_ino = dirent->inode,
+ .st_mode = dirent_fmode(i->fs, dirent),
+ };
+ int ret;
+
+ i->dirpos++;
+ if (i->startpos >= i->dirpos)
+ return 0;
+
+ dbg_printf(i->ff, "READDIR%s ino=%d %u offset=0x%llx\n",
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ i->flags == FUSE_READDIR_PLUS ? "PLUS" : "",
+#else
+ "",
+#endif
+ dir,
+ i->nr++,
+ (unsigned long long)i->dirpos);
+
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ if (i->flags == FUSE_READDIR_PLUS) {
+ ret = stat_inode(i->fs, dirent->inode, &stat);
+ if (ret)
+ return DIRENT_ABORT;
+ }
+#endif
+
+ memcpy(namebuf, dirent->name, dirent->name_len & 0xFF);
+ namebuf[dirent->name_len & 0xFF] = 0;
+ ret = i->func(i->buf, namebuf, &stat, i->dirpos
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ , 0
+#endif
+ );
+ if (ret)
+ return DIRENT_ABORT;
+
+ return 0;
+}
+
+static int op_readdir(const char *path EXT2FS_ATTR((unused)),
+ void *buf, fuse_fill_dir_t fill_func,
+ off_t offset,
+ struct fuse_file_info *fp
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ , enum fuse_readdir_flags flags
+#endif
+ )
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ errcode_t err;
+ struct readdir_iter i = {
+ .ff = ff,
+ .dirpos = 0,
+ .startpos = offset,
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ .flags = flags,
+#endif
+ };
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_HANDLE(ff, fh);
+ dbg_printf(ff, "%s: ino=%d offset=0x%llx\n", __func__, fh->ino,
+ (unsigned long long)offset);
+ i.fs = fuse4fs_start(ff);
+ i.buf = buf;
+ i.func = fill_func;
+ err = ext2fs_dir_iterate2(i.fs, fh->ino, 0, NULL, op_readdir_iter, &i);
+ if (err) {
+ ret = translate_error(i.fs, fh->ino, err);
+ goto out;
+ }
+
+ if (fuse4fs_is_writeable(ff)) {
+ ret = update_atime(i.fs, fh->ino);
+ if (ret)
+ goto out;
+ }
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int op_access(const char *path, int mask)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ errcode_t err;
+ ext2_ino_t ino;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: path=%s mask=0x%x\n", __func__, path, mask);
+ fs = fuse4fs_start(ff);
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
+ if (err || ino == 0) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+
+ ret = check_inum_access(ff, ino, mask);
+ if (ret)
+ goto out;
+
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ ext2_ino_t parent, child;
+ char *temp_path;
+ errcode_t err;
+ char *node_name, a;
+ int filetype;
+ struct ext2_inode_large inode;
+ gid_t gid;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: path=%s mode=0%o\n", __func__, path, mode);
+ temp_path = strdup(path);
+ if (!temp_path) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name = strrchr(temp_path, '/');
+ if (!node_name) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ node_name++;
+ a = *node_name;
+ *node_name = 0;
+
+ fs = fuse4fs_start(ff);
+ if (!fs_can_allocate(ff, 1)) {
+ ret = -ENOSPC;
+ goto out2;
+ }
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
+ &parent);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out2;
+ }
+
+ ret = check_inum_access(ff, parent, A_OK | W_OK);
+ if (ret)
+ goto out2;
+
+ err = fuse4fs_new_child_gid(ff, parent, &gid, NULL);
+ if (err)
+ goto out2;
+
+ *node_name = a;
+
+ filetype = ext2_file_type(mode);
+
+ err = ext2fs_new_inode(fs, parent, mode, 0, &child);
+ if (err) {
+ ret = translate_error(fs, parent, err);
+ goto out2;
+ }
+
+ dbg_printf(ff, "%s: creating ino=%d/name=%s in dir=%d\n", __func__, child,
+ node_name, parent);
+ err = ext2fs_link(fs, parent, node_name, child,
+ filetype | EXT2FS_LINK_EXPAND);
+ if (err) {
+ ret = translate_error(fs, parent, err);
+ goto out2;
+ }
+
+ ret = update_mtime(fs, parent, NULL);
+ if (ret)
+ goto out2;
+
+ memset(&inode, 0, sizeof(inode));
+ inode.i_mode = mode;
+ inode.i_links_count = 1;
+ fuse4fs_set_extra_isize(ff, child, &inode);
+ fuse4fs_set_uid(&inode, ctxt->uid);
+ fuse4fs_set_gid(&inode, gid);
+ if (ext2fs_has_feature_extents(fs->super)) {
+ ext2_extent_handle_t handle;
+
+ inode.i_flags &= ~EXT4_EXTENTS_FL;
+ ret = ext2fs_extent_open2(fs, child,
+ EXT2_INODE(&inode), &handle);
+ if (ret) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+
+ ext2fs_extent_free(handle);
+ }
+
+ err = ext2fs_write_new_inode(fs, child, EXT2_INODE(&inode));
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+
+ inode.i_generation = ff->next_generation++;
+ init_times(&inode);
+ err = fuse4fs_write_inode(fs, child, &inode);
+ if (err) {
+ ret = translate_error(fs, child, err);
+ goto out2;
+ }
+
+ ext2fs_inode_alloc_stats2(fs, child, 1, 0);
+
+ ret = propagate_default_acls(ff, parent, child);
+ if (ret)
+ goto out2;
+
+ fp->flags &= ~O_TRUNC;
+ ret = __op_open(ff, path, fp);
+ if (ret)
+ goto out2;
+
+ ret = fuse4fs_dirsync_flush(ff, parent, NULL);
+ if (ret)
+ goto out2;
+
+out2:
+ fuse4fs_finish(ff, ret);
+out:
+ free(temp_path);
+ return ret;
+}
+
+#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 0)
+static int op_ftruncate(const char *path EXT2FS_ATTR((unused)),
+ off_t len, struct fuse_file_info *fp)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ ext2_filsys fs;
+ ext2_file_t efp;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_HANDLE(ff, fh);
+ dbg_printf(ff, "%s: ino=%d len=%jd\n", __func__, fh->ino,
+ (intmax_t) len);
+ fs = fuse4fs_start(ff);
+ if (!fuse4fs_is_writeable(ff)) {
+ ret = -EROFS;
+ goto out;
+ }
+
+ err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out;
+ }
+
+ err = ext2fs_file_set_size2(efp, len);
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out2;
+ }
+
+out2:
+ err = ext2fs_file_close(efp);
+ if (ret)
+ goto out;
+ if (err) {
+ ret = translate_error(fs, fh->ino, err);
+ goto out;
+ }
+
+ ret = update_mtime(fs, fh->ino, NULL);
+ if (ret)
+ goto out;
+
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+static int op_fgetattr(const char *path EXT2FS_ATTR((unused)),
+ struct stat *statbuf,
+ struct fuse_file_info *fp)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_HANDLE(ff, fh);
+ dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ fs = fuse4fs_start(ff);
+ ret = stat_inode(fs, fh->ino, statbuf);
+ fuse4fs_finish(ff, ret);
+
+ return ret;
+}
+#endif /* FUSE_VERSION < FUSE_MAKE_VERSION(3, 0) */
+
+static int op_utimens(const char *path, const struct timespec ctv[2]
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ , struct fuse_file_info *fi
+#endif
+ )
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ struct timespec tv[2];
+ ext2_filsys fs;
+ errcode_t err;
+ ext2_ino_t ino;
+ struct ext2_inode_large inode;
+ int access = W_OK;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fs = fuse4fs_start(ff);
+ ret = fuse4fs_file_ino(ff, path, fi, &ino);
+ if (ret)
+ goto out;
+ dbg_printf(ff, "%s: ino=%d atime=%lld.%ld mtime=%lld.%ld\n", __func__,
+ ino,
+ (long long int)ctv[0].tv_sec, ctv[0].tv_nsec,
+ (long long int)ctv[1].tv_sec, ctv[1].tv_nsec);
+
+ /*
+ * ext4 allows timestamp updates of append-only files but only if we're
+ * setting to current time
+ */
+ if (ctv[0].tv_nsec == UTIME_NOW && ctv[1].tv_nsec == UTIME_NOW)
+ access |= A_OK;
+ ret = check_inum_access(ff, ino, access);
+ if (ret)
+ goto out;
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+ tv[0] = ctv[0];
+ tv[1] = ctv[1];
+#ifdef UTIME_NOW
+ if (tv[0].tv_nsec == UTIME_NOW)
+ get_now(tv);
+ if (tv[1].tv_nsec == UTIME_NOW)
+ get_now(tv + 1);
+#endif /* UTIME_NOW */
+#ifdef UTIME_OMIT
+ if (tv[0].tv_nsec != UTIME_OMIT)
+ EXT4_INODE_SET_XTIME(i_atime, &tv[0], &inode);
+ if (tv[1].tv_nsec != UTIME_OMIT)
+ EXT4_INODE_SET_XTIME(i_mtime, &tv[1], &inode);
+#endif /* UTIME_OMIT */
+ ret = update_ctime(fs, ino, &inode);
+ if (ret)
+ goto out;
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+#define FUSE4FS_MODIFIABLE_IFLAGS \
+ (EXT2_FL_USER_MODIFIABLE & ~(EXT4_EXTENTS_FL | EXT4_CASEFOLD_FL | \
+ EXT3_JOURNAL_DATA_FL))
+
+static inline int set_iflags(struct ext2_inode_large *inode, __u32 iflags)
+{
+ if ((inode->i_flags ^ iflags) & ~FUSE4FS_MODIFIABLE_IFLAGS)
+ return -EINVAL;
+
+ inode->i_flags = (inode->i_flags & ~FUSE4FS_MODIFIABLE_IFLAGS) |
+ (iflags & FUSE4FS_MODIFIABLE_IFLAGS);
+ return 0;
+}
+
+#ifdef SUPPORT_I_FLAGS
+static int ioctl_getflags(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
+ void *data)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ struct ext2_inode_large inode;
+
+ dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ err = fuse4fs_read_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ *(__u32 *)data = inode.i_flags & EXT2_FL_USER_VISIBLE;
+ return 0;
+}
+
+static int ioctl_setflags(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
+ void *data)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ struct ext2_inode_large inode;
+ int ret;
+ __u32 flags = *(__u32 *)data;
+ struct fuse_context *ctxt = fuse_get_context();
+
+ dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ err = fuse4fs_read_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ if (want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
+ return -EPERM;
+
+ ret = set_iflags(&inode, flags);
+ if (ret)
+ return ret;
+
+ ret = update_ctime(fs, fh->ino, &inode);
+ if (ret)
+ return ret;
+
+ err = fuse4fs_write_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ return 0;
+}
+
+static int ioctl_getversion(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
+ void *data)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ struct ext2_inode_large inode;
+
+ dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ err = fuse4fs_read_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ *(__u32 *)data = inode.i_generation;
+ return 0;
+}
+
+static int ioctl_setversion(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
+ void *data)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ struct ext2_inode_large inode;
+ int ret;
+ __u32 generation = *(__u32 *)data;
+ struct fuse_context *ctxt = fuse_get_context();
+
+ dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ err = fuse4fs_read_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ if (want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
+ return -EPERM;
+
+ inode.i_generation = generation;
+
+ ret = update_ctime(fs, fh->ino, &inode);
+ if (ret)
+ return ret;
+
+ err = fuse4fs_write_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ return 0;
+}
+#endif /* SUPPORT_I_FLAGS */
+
+#ifdef FS_IOC_FSGETXATTR
+static __u32 iflags_to_fsxflags(__u32 iflags)
+{
+ __u32 xflags = 0;
+
+ if (iflags & FS_SYNC_FL)
+ xflags |= FS_XFLAG_SYNC;
+ if (iflags & FS_IMMUTABLE_FL)
+ xflags |= FS_XFLAG_IMMUTABLE;
+ if (iflags & FS_APPEND_FL)
+ xflags |= FS_XFLAG_APPEND;
+ if (iflags & FS_NODUMP_FL)
+ xflags |= FS_XFLAG_NODUMP;
+ if (iflags & FS_NOATIME_FL)
+ xflags |= FS_XFLAG_NOATIME;
+ if (iflags & FS_DAX_FL)
+ xflags |= FS_XFLAG_DAX;
+ if (iflags & FS_PROJINHERIT_FL)
+ xflags |= FS_XFLAG_PROJINHERIT;
+ return xflags;
+}
+
+static int ioctl_fsgetxattr(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
+ void *data)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ struct ext2_inode_large inode;
+ struct fsxattr *fsx = data;
+ unsigned int inode_size;
+
+ dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ err = fuse4fs_read_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ memset(fsx, 0, sizeof(*fsx));
+ inode_size = EXT2_GOOD_OLD_INODE_SIZE + inode.i_extra_isize;
+ if (ext2fs_inode_includes(inode_size, i_projid))
+ fsx->fsx_projid = inode_projid(inode);
+ fsx->fsx_xflags = iflags_to_fsxflags(inode.i_flags);
+ return 0;
+}
+
+static __u32 fsxflags_to_iflags(__u32 xflags)
+{
+ __u32 iflags = 0;
+
+ if (xflags & FS_XFLAG_IMMUTABLE)
+ iflags |= FS_IMMUTABLE_FL;
+ if (xflags & FS_XFLAG_APPEND)
+ iflags |= FS_APPEND_FL;
+ if (xflags & FS_XFLAG_SYNC)
+ iflags |= FS_SYNC_FL;
+ if (xflags & FS_XFLAG_NOATIME)
+ iflags |= FS_NOATIME_FL;
+ if (xflags & FS_XFLAG_NODUMP)
+ iflags |= FS_NODUMP_FL;
+ if (xflags & FS_XFLAG_DAX)
+ iflags |= FS_DAX_FL;
+ if (xflags & FS_XFLAG_PROJINHERIT)
+ iflags |= FS_PROJINHERIT_FL;
+ return iflags;
+}
+
+static int ioctl_fssetxattr(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
+ void *data)
+{
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ struct ext2_inode_large inode;
+ int ret;
+ struct fuse_context *ctxt = fuse_get_context();
+ struct fsxattr *fsx = data;
+ __u32 flags = fsxflags_to_iflags(fsx->fsx_xflags);
+ unsigned int inode_size;
+
+ dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ err = fuse4fs_read_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ if (want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
+ return -EPERM;
+
+ ret = set_iflags(&inode, flags);
+ if (ret)
+ return ret;
+
+ inode_size = EXT2_GOOD_OLD_INODE_SIZE + inode.i_extra_isize;
+ if (ext2fs_inode_includes(inode_size, i_projid))
+ inode.i_projid = fsx->fsx_projid;
+
+ ret = update_ctime(fs, fh->ino, &inode);
+ if (ret)
+ return ret;
+
+ err = fuse4fs_write_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ return 0;
+}
+#endif /* FS_IOC_FSGETXATTR */
+
+#ifdef FITRIM
+static int ioctl_fitrim(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
+ void *data)
+{
+ ext2_filsys fs = ff->fs;
+ struct fstrim_range *fr = data;
+ blk64_t start, end, max_blocks, b, cleared, minlen;
+ blk64_t max_blks = ext2fs_blocks_count(fs->super);
+ errcode_t err = 0;
+
+ if (!fuse4fs_is_writeable(ff))
+ return -EROFS;
+
+ start = FUSE4FS_B_TO_FSBT(ff, fr->start);
+ if (fr->len == -1ULL)
+ end = -1ULL;
+ else
+ end = FUSE4FS_B_TO_FSBT(ff, fr->start + fr->len - 1);
+ minlen = FUSE4FS_B_TO_FSBT(ff, fr->minlen);
+
+ if (EXT2FS_NUM_B2C(fs, minlen) > EXT2_CLUSTERS_PER_GROUP(fs->super) ||
+ start >= max_blks ||
+ fr->len < fs->blocksize)
+ return -EINVAL;
+
+ dbg_printf(ff, "%s: start=0x%llx end=0x%llx minlen=0x%llx\n", __func__,
+ start, end, minlen);
+
+ if (start < fs->super->s_first_data_block)
+ start = fs->super->s_first_data_block;
+
+ if (end < fs->super->s_first_data_block)
+ end = fs->super->s_first_data_block;
+ if (end >= ext2fs_blocks_count(fs->super))
+ end = ext2fs_blocks_count(fs->super) - 1;
+
+ cleared = 0;
+ max_blocks = FUSE4FS_B_TO_FSBT(ff, 2048ULL * 1024 * 1024);
+
+ fr->len = 0;
+ while (start <= end) {
+ err = ext2fs_find_first_zero_block_bitmap2(fs->block_map,
+ start, end, &start);
+ switch (err) {
+ case 0:
+ break;
+ case ENOENT:
+ /* no free blocks found, so we're done */
+ err = 0;
+ goto out;
+ default:
+ return translate_error(fs, fh->ino, err);
+ }
+
+ b = start + max_blocks < end ? start + max_blocks : end;
+ err = ext2fs_find_first_set_block_bitmap2(fs->block_map,
+ start, b, &b);
+ switch (err) {
+ case 0:
+ break;
+ case ENOENT:
+ /*
+ * No free blocks found between start and b; discard
+ * the entire range.
+ */
+ err = 0;
+ break;
+ default:
+ return translate_error(fs, fh->ino, err);
+ }
+
+ if (b - start >= minlen) {
+ err = io_channel_discard(fs->io, start, b - start);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+ cleared += b - start;
+ fr->len = FUSE4FS_FSB_TO_B(ff, cleared);
+ }
+ start = b + 1;
+ }
+
+out:
+ fr->len = FUSE4FS_FSB_TO_B(ff, cleared);
+ dbg_printf(ff, "%s: len=%llu err=%ld\n", __func__, fr->len, err);
+ return err;
+}
+#endif /* FITRIM */
+
+#ifndef EXT4_IOC_SHUTDOWN
+# define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
+#endif
+
+static int ioctl_shutdown(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
+ void *data)
+{
+ struct fuse_context *ctxt = fuse_get_context();
+ ext2_filsys fs = ff->fs;
+
+ if (!is_superuser(ff, ctxt))
+ return -EPERM;
+
+ err_printf(ff, "%s.\n", _("shut down requested"));
+
+ /*
+ * EXT4_IOC_SHUTDOWN inherited the inverted polarity on the ioctl
+ * direction from XFS. Unfortunately, that means we can't implement
+ * any of the flags. Flush whatever is dirty and shut down.
+ */
+ if (ff->opstate == F4OP_WRITABLE)
+ ext2fs_flush2(fs, 0);
+ ff->opstate = F4OP_SHUTDOWN;
+
+ return 0;
+}
+
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 8)
+static int op_ioctl(const char *path EXT2FS_ATTR((unused)),
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
+ unsigned int cmd,
+#else
+ int cmd,
+#endif
+ void *arg EXT2FS_ATTR((unused)),
+ struct fuse_file_info *fp,
+ unsigned int flags EXT2FS_ATTR((unused)), void *data)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_HANDLE(ff, fh);
+ fuse4fs_start(ff);
+ switch ((unsigned long) cmd) {
+#ifdef SUPPORT_I_FLAGS
+ case EXT2_IOC_GETFLAGS:
+ ret = ioctl_getflags(ff, fh, data);
+ break;
+ case EXT2_IOC_SETFLAGS:
+ ret = ioctl_setflags(ff, fh, data);
+ break;
+ case EXT2_IOC_GETVERSION:
+ ret = ioctl_getversion(ff, fh, data);
+ break;
+ case EXT2_IOC_SETVERSION:
+ ret = ioctl_setversion(ff, fh, data);
+ break;
+#endif
+#ifdef FS_IOC_FSGETXATTR
+ case FS_IOC_FSGETXATTR:
+ ret = ioctl_fsgetxattr(ff, fh, data);
+ break;
+ case FS_IOC_FSSETXATTR:
+ ret = ioctl_fssetxattr(ff, fh, data);
+ break;
+#endif
+#ifdef FITRIM
+ case FITRIM:
+ ret = ioctl_fitrim(ff, fh, data);
+ break;
+#endif
+ case EXT4_IOC_SHUTDOWN:
+ ret = ioctl_shutdown(ff, fh, data);
+ break;
+ default:
+ dbg_printf(ff, "%s: Unknown ioctl %d\n", __func__, cmd);
+ ret = -ENOTTY;
+ }
+ fuse4fs_finish(ff, ret);
+
+ return ret;
+}
+#endif /* FUSE 28 */
+
+static int op_bmap(const char *path, size_t blocksize EXT2FS_ATTR((unused)),
+ uint64_t *idx)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ ext2_filsys fs;
+ ext2_ino_t ino;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ fs = fuse4fs_start(ff);
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+ dbg_printf(ff, "%s: ino=%d blk=%"PRIu64"\n", __func__, ino, *idx);
+
+ err = ext2fs_bmap2(fs, ino, NULL, NULL, 0, *idx, 0, (blk64_t *)idx);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out;
+ }
+
+out:
+ fuse4fs_finish(ff, ret);
+ return ret;
+}
+
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 9)
+# ifdef SUPPORT_FALLOCATE
+static int fuse4fs_allocate_range(struct fuse4fs *ff,
+ struct fuse4fs_file_handle *fh, int mode,
+ off_t offset, off_t len)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2_inode_large inode;
+ blk64_t start, end;
+ __u64 fsize;
+ errcode_t err;
+ int flags;
+
+ start = FUSE4FS_B_TO_FSBT(ff, offset);
+ end = FUSE4FS_B_TO_FSBT(ff, offset + len - 1);
+ dbg_printf(ff, "%s: ino=%d mode=0x%x offset=0x%llx len=0x%llx start=0x%llx end=0x%llx\n",
+ __func__, fh->ino, mode,
+ (unsigned long long)offset,
+ (unsigned long long)len,
+ (unsigned long long)start,
+ (unsigned long long)end);
+ if (!fs_can_allocate(ff, FUSE4FS_B_TO_FSB(ff, len)))
+ return -ENOSPC;
+
+ err = fuse4fs_read_inode(fs, fh->ino, &inode);
+ if (err)
+ return err;
+ fsize = EXT2_I_SIZE(&inode);
+
+ /* Indirect files do not support unwritten extents */
+ if (!(inode.i_flags & EXT4_EXTENTS_FL))
+ return -EOPNOTSUPP;
+
+ /* Allocate a bunch of blocks */
+ flags = (mode & FL_KEEP_SIZE_FLAG ? 0 :
+ EXT2_FALLOCATE_INIT_BEYOND_EOF);
+ err = ext2fs_fallocate(fs, flags, fh->ino,
+ EXT2_INODE(&inode),
+ ~0ULL, start, end - start + 1);
+ if (err && err != EXT2_ET_BLOCK_ALLOC_FAIL)
+ return translate_error(fs, fh->ino, err);
+
+ /* Update i_size */
+ if (!(mode & FL_KEEP_SIZE_FLAG)) {
+ if ((__u64) offset + len > fsize) {
+ err = ext2fs_inode_size_set(fs,
+ EXT2_INODE(&inode),
+ offset + len);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+ }
+ }
+
+ err = update_mtime(fs, fh->ino, &inode);
+ if (err)
+ return err;
+
+ err = fuse4fs_write_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ return err;
+}
+
+static errcode_t clean_block_middle(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ off_t offset, off_t len, char **buf)
+{
+ ext2_filsys fs = ff->fs;
+ blk64_t blk;
+ off_t residue = FUSE4FS_OFF_IN_FSB(ff, offset);
+ int retflags;
+ errcode_t err;
+
+ if (!*buf) {
+ err = ext2fs_get_mem(fs->blocksize, buf);
+ if (err)
+ return err;
+ }
+
+ err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), *buf, 0,
+ FUSE4FS_B_TO_FSBT(ff, offset), &retflags, &blk);
+ if (err)
+ return err;
+ if (!blk || (retflags & BMAP_RET_UNINIT))
+ return 0;
+
+ err = io_channel_read_blk64(fs->io, blk, 1, *buf);
+ if (err)
+ return err;
+
+ dbg_printf(ff, "%s: ino=%d offset=0x%llx len=0x%llx\n",
+ __func__, ino,
+ (unsigned long long)offset + residue,
+ (unsigned long long)len);
+ memset(*buf + residue, 0, len);
+
+ return io_channel_write_blk64(fs->io, blk, 1, *buf);
+}
+
+static errcode_t clean_block_edge(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode, off_t offset,
+ int clean_before, char **buf)
+{
+ ext2_filsys fs = ff->fs;
+ blk64_t blk;
+ int retflags;
+ off_t residue;
+ errcode_t err;
+
+ residue = FUSE4FS_OFF_IN_FSB(ff, offset);
+ if (residue == 0)
+ return 0;
+
+ if (!*buf) {
+ err = ext2fs_get_mem(fs->blocksize, buf);
+ if (err)
+ return err;
+ }
+
+ err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), *buf, 0,
+ FUSE4FS_B_TO_FSBT(ff, offset), &retflags, &blk);
+ if (err)
+ return err;
+
+ err = io_channel_read_blk64(fs->io, blk, 1, *buf);
+ if (err)
+ return err;
+ if (!blk || (retflags & BMAP_RET_UNINIT))
+ return 0;
+
+ if (clean_before) {
+ dbg_printf(ff, "%s: ino=%d before offset=0x%llx len=0x%llx\n",
+ __func__, ino,
+ (unsigned long long)offset,
+ (unsigned long long)residue);
+ memset(*buf, 0, residue);
+ } else {
+ dbg_printf(ff, "%s: ino=%d after offset=0x%llx len=0x%llx\n",
+ __func__, ino,
+ (unsigned long long)offset,
+ (unsigned long long)fs->blocksize - residue);
+ memset(*buf + residue, 0, fs->blocksize - residue);
+ }
+
+ return io_channel_write_blk64(fs->io, blk, 1, *buf);
+}
+
+static int fuse4fs_punch_range(struct fuse4fs *ff,
+ struct fuse4fs_file_handle *fh, int mode,
+ off_t offset, off_t len)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2_inode_large inode;
+ blk64_t start, end;
+ errcode_t err;
+ char *buf = NULL;
+
+ /* kernel ext4 punch requires this flag to be set */
+ if (!(mode & FL_KEEP_SIZE_FLAG))
+ return -EINVAL;
+
+ /*
+ * Unmap out all full blocks in the middle of the range being punched.
+ * The start of the unmap range should be the first byte of the first
+ * fsblock that starts within the range. The end of the range should
+ * be the next byte after the last fsblock to end in the range.
+ */
+ start = FUSE4FS_B_TO_FSBT(ff, round_up(offset, fs->blocksize));
+ end = FUSE4FS_B_TO_FSBT(ff, round_down(offset + len, fs->blocksize));
+
+ dbg_printf(ff,
+ "%s: ino=%d mode=0x%x offset=0x%llx len=0x%llx start=0x%llx end=0x%llx\n",
+ __func__, fh->ino, mode,
+ (unsigned long long)offset,
+ (unsigned long long)len,
+ (unsigned long long)start,
+ (unsigned long long)end);
+
+ err = fuse4fs_read_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ /*
+ * Indirect files do not support unwritten extents, which means we
+ * can't support zero range. Punch goes first in zero-range, which
+ * is why the check is here.
+ */
+ if ((mode & FL_ZERO_RANGE_FLAG) && !(inode.i_flags & EXT4_EXTENTS_FL))
+ return -EOPNOTSUPP;
+
+ /* Zero everything before the first block and after the last block */
+ if (FUSE4FS_B_TO_FSBT(ff, offset) == FUSE4FS_B_TO_FSBT(ff, offset + len))
+ err = clean_block_middle(ff, fh->ino, &inode, offset,
+ len, &buf);
+ else {
+ err = clean_block_edge(ff, fh->ino, &inode, offset, 0, &buf);
+ if (!err)
+ err = clean_block_edge(ff, fh->ino, &inode,
+ offset + len, 1, &buf);
+ }
+ if (buf)
+ ext2fs_free_mem(&buf);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ /*
+ * Unmap full blocks in the middle, which is to say that start - end
+ * must be at least one fsblock. ext2fs_punch takes a closed interval
+ * as its argument, so we pass [start, end - 1].
+ */
+ if (start < end) {
+ err = ext2fs_punch(fs, fh->ino, EXT2_INODE(&inode),
+ NULL, start, end - 1);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+ }
+
+ err = update_mtime(fs, fh->ino, &inode);
+ if (err)
+ return err;
+
+ err = fuse4fs_write_inode(fs, fh->ino, &inode);
+ if (err)
+ return translate_error(fs, fh->ino, err);
+
+ return 0;
+}
+
+static int fuse4fs_zero_range(struct fuse4fs *ff,
+ struct fuse4fs_file_handle *fh, int mode,
+ off_t offset, off_t len)
+{
+ int ret = fuse4fs_punch_range(ff, fh, mode | FL_KEEP_SIZE_FLAG, offset,
+ len);
+
+ if (!ret)
+ ret = fuse4fs_allocate_range(ff, fh, mode, offset, len);
+ return ret;
+}
+
+static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
+ off_t offset, off_t len,
+ struct fuse_file_info *fp)
+{
+ struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ int ret;
+
+ /* Catch unknown flags */
+ if (mode & ~(FL_ZERO_RANGE_FLAG | FL_PUNCH_HOLE_FLAG | FL_KEEP_SIZE_FLAG))
+ return -EOPNOTSUPP;
+
+ FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_HANDLE(ff, fh);
+ fuse4fs_start(ff);
+ if (!fuse4fs_is_writeable(ff)) {
+ ret = -EROFS;
+ goto out;
+ }
+
+ dbg_printf(ff, "%s: ino=%d mode=0x%x start=0x%llx end=0x%llx\n", __func__,
+ fh->ino, mode,
+ (unsigned long long)offset,
+ (unsigned long long)offset + len);
+
+ if (mode & FL_ZERO_RANGE_FLAG)
+ ret = fuse4fs_zero_range(ff, fh, mode, offset, len);
+ else if (mode & FL_PUNCH_HOLE_FLAG)
+ ret = fuse4fs_punch_range(ff, fh, mode, offset, len);
+ else
+ ret = fuse4fs_allocate_range(ff, fh, mode, offset, len);
+out:
+ fuse4fs_finish(ff, ret);
+
+ return ret;
+}
+# endif /* SUPPORT_FALLOCATE */
+#endif /* FUSE 29 */
+
+static struct fuse_operations fs_ops = {
+ .init = op_init,
+ .destroy = op_destroy,
+ .getattr = op_getattr,
+ .readlink = op_readlink,
+ .mknod = op_mknod,
+ .mkdir = op_mkdir,
+ .unlink = op_unlink,
+ .rmdir = op_rmdir,
+ .symlink = op_symlink,
+ .rename = op_rename,
+ .link = op_link,
+ .chmod = op_chmod,
+ .chown = op_chown,
+ .truncate = op_truncate,
+ .open = op_open,
+ .read = op_read,
+ .write = op_write,
+ .statfs = op_statfs,
+ .release = op_release,
+ .fsync = op_fsync,
+ .setxattr = op_setxattr,
+ .getxattr = op_getxattr,
+ .listxattr = op_listxattr,
+ .removexattr = op_removexattr,
+ .opendir = op_open,
+ .readdir = op_readdir,
+ .releasedir = op_release,
+ .fsyncdir = op_fsync,
+ .access = op_access,
+ .create = op_create,
+#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 0)
+ .ftruncate = op_ftruncate,
+ .fgetattr = op_fgetattr,
+#endif
+ .utimens = op_utimens,
+#if (FUSE_VERSION >= FUSE_MAKE_VERSION(2, 9)) && (FUSE_VERSION < FUSE_MAKE_VERSION(3, 0))
+# if defined(UTIME_NOW) || defined(UTIME_OMIT)
+ .flag_utime_omit_ok = 1,
+# endif
+#endif
+ .bmap = op_bmap,
+#ifdef SUPERFLUOUS
+ .lock = op_lock,
+ .poll = op_poll,
+#endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 8)
+ .ioctl = op_ioctl,
+#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 0)
+ .flag_nullpath_ok = 1,
+#endif
+#endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 9)
+#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 0)
+ .flag_nopath = 1,
+#endif
+# ifdef SUPPORT_FALLOCATE
+ .fallocate = op_fallocate,
+# endif
+#endif
+};
+
+static int get_random_bytes(void *p, size_t sz)
+{
+ int fd;
+ ssize_t r;
+
+ fd = open("/dev/urandom", O_RDONLY);
+ if (fd < 0) {
+ perror("/dev/urandom");
+ return 0;
+ }
+
+ r = read(fd, p, sz);
+
+ close(fd);
+ return (size_t) r == sz;
+}
+
+enum {
+ FUSE4FS_IGNORED,
+ FUSE4FS_VERSION,
+ FUSE4FS_HELP,
+ FUSE4FS_HELPFULL,
+ FUSE4FS_CACHE_SIZE,
+ FUSE4FS_DIRSYNC,
+ FUSE4FS_ERRORS_BEHAVIOR,
+};
+
+#define FUSE4FS_OPT(t, p, v) { t, offsetof(struct fuse4fs, p), v }
+
+static struct fuse_opt fuse4fs_opts[] = {
+ FUSE4FS_OPT("ro", ro, 1),
+ FUSE4FS_OPT("rw", ro, 0),
+ FUSE4FS_OPT("minixdf", minixdf, 1),
+ FUSE4FS_OPT("bsddf", minixdf, 0),
+ FUSE4FS_OPT("fakeroot", fakeroot, 1),
+ FUSE4FS_OPT("fuse4fs_debug", debug, 1),
+ FUSE4FS_OPT("no_default_opts", no_default_opts, 1),
+ FUSE4FS_OPT("norecovery", norecovery, 1),
+ FUSE4FS_OPT("noload", norecovery, 1),
+ FUSE4FS_OPT("offset=%lu", offset, 0),
+ FUSE4FS_OPT("kernel", kernel, 1),
+ FUSE4FS_OPT("directio", directio, 1),
+ FUSE4FS_OPT("acl", acl, 1),
+ FUSE4FS_OPT("noacl", acl, 0),
+ FUSE4FS_OPT("lockfile=%s", lockfile, 0),
+#ifdef HAVE_CLOCK_MONOTONIC
+ FUSE4FS_OPT("timing", timing, 1),
+#endif
+ FUSE4FS_OPT("noblkdev", noblkdev, 1),
+
+ FUSE_OPT_KEY("user_xattr", FUSE4FS_IGNORED),
+ FUSE_OPT_KEY("noblock_validity", FUSE4FS_IGNORED),
+ FUSE_OPT_KEY("nodelalloc", FUSE4FS_IGNORED),
+ FUSE_OPT_KEY("cache_size=%s", FUSE4FS_CACHE_SIZE),
+ FUSE_OPT_KEY("dirsync", FUSE4FS_DIRSYNC),
+ FUSE_OPT_KEY("errors=%s", FUSE4FS_ERRORS_BEHAVIOR),
+
+ FUSE_OPT_KEY("-V", FUSE4FS_VERSION),
+ FUSE_OPT_KEY("--version", FUSE4FS_VERSION),
+ FUSE_OPT_KEY("-h", FUSE4FS_HELP),
+ FUSE_OPT_KEY("--help", FUSE4FS_HELP),
+ FUSE_OPT_KEY("--helpfull", FUSE4FS_HELPFULL),
+ FUSE_OPT_END
+};
+
+
+static int fuse4fs_opt_proc(void *data, const char *arg,
+ int key, struct fuse_args *outargs)
+{
+ struct fuse4fs *ff = data;
+
+ switch (key) {
+ case FUSE4FS_DIRSYNC:
+ ff->dirsync = 1;
+ /* pass through to libfuse */
+ return 1;
+ case FUSE_OPT_KEY_NONOPT:
+ if (!ff->device) {
+ ff->device = strdup(arg);
+ return 0;
+ }
+ return 1;
+ case FUSE4FS_CACHE_SIZE:
+ ff->cache_size = parse_num_blocks2(arg + 11, -1);
+ if (ff->cache_size < 1 || ff->cache_size > INT32_MAX) {
+ fprintf(stderr, "%s: %s\n", arg,
+ _("cache size must be between 1 block and 2GB."));
+ return -1;
+ }
+
+ /* do not pass through to libfuse */
+ return 0;
+ case FUSE4FS_ERRORS_BEHAVIOR:
+ if (strcmp(arg + 7, "continue") == 0)
+ ff->errors_behavior = EXT2_ERRORS_CONTINUE;
+ else if (strcmp(arg + 7, "remount-ro") == 0)
+ ff->errors_behavior = EXT2_ERRORS_RO;
+ else if (strcmp(arg + 7, "panic") == 0)
+ ff->errors_behavior = EXT2_ERRORS_PANIC;
+ else {
+ fprintf(stderr, "%s: %s\n", arg,
+ _("unknown errors behavior."));
+ return -1;
+ }
+
+ /* do not pass through to libfuse */
+ return 0;
+ case FUSE4FS_IGNORED:
+ return 0;
+ case FUSE4FS_HELP:
+ case FUSE4FS_HELPFULL:
+ fprintf(stderr,
+ "usage: %s device/image mountpoint [options]\n"
+ "\n"
+ "general options:\n"
+ " -o opt,[opt...] mount options\n"
+ " -h --help print help\n"
+ " -V --version print version\n"
+ "\n"
+ "fuse4fs options:\n"
+ " -o errors=panic dump core on error\n"
+ " -o minixdf minix-style df\n"
+ " -o fakeroot pretend to be root for permission checks\n"
+ " -o no_default_opts do not include default fuse options\n"
+ " -o offset=<bytes> similar to mount -o offset=<bytes>, mount the partition starting at <bytes>\n"
+ " -o norecovery don't replay the journal\n"
+ " -o fuse4fs_debug enable fuse4fs debugging\n"
+ " -o lockfile=<file> file to show that fuse is still using the file system image\n"
+ " -o kernel run this as if it were the kernel, which sets:\n"
+ " allow_others,default_permissions,suid,dev\n"
+ " -o directio use O_DIRECT to read and write the disk\n"
+ " -o cache_size=N[KMG] use a disk cache of this size\n"
+ " -o errors= behavior when an error is encountered:\n"
+ " continue|remount-ro|panic\n"
+ "\n",
+ outargs->argv[0]);
+ if (key == FUSE4FS_HELPFULL) {
+ fuse_opt_add_arg(outargs, "-h");
+ fuse_main(outargs->argc, outargs->argv, &fs_ops, NULL);
+ } else {
+ fprintf(stderr, "Try --helpfull to get a list of "
+ "all flags, including the FUSE options.\n");
+ }
+ exit(1);
+
+ case FUSE4FS_VERSION:
+ fprintf(stderr, "fuse4fs %s (%s)\n", E2FSPROGS_VERSION,
+ E2FSPROGS_DATE);
+ fuse_opt_add_arg(outargs, "--version");
+ fuse_main(outargs->argc, outargs->argv, &fs_ops, NULL);
+ exit(0);
+ }
+ return 1;
+}
+
+static const char *get_subtype(const char *argv0)
+{
+ size_t argvlen = strlen(argv0);
+
+ if (argvlen < 4)
+ goto out_default;
+
+ if (argv0[argvlen - 4] == 'e' &&
+ argv0[argvlen - 3] == 'x' &&
+ argv0[argvlen - 2] == 't' &&
+ isdigit(argv0[argvlen - 1]))
+ return &argv0[argvlen - 4];
+
+out_default:
+ return "ext4";
+}
+
+/* Figure out a reasonable default size for the disk cache */
+static unsigned long long default_cache_size(void)
+{
+ long pages = 0, pagesize = 0;
+ unsigned long long max_cache;
+ unsigned long long ret = 32ULL << 20; /* 32 MB */
+
+#ifdef _SC_PHYS_PAGES
+ pages = sysconf(_SC_PHYS_PAGES);
+#endif
+#ifdef _SC_PAGESIZE
+ pagesize = sysconf(_SC_PAGESIZE);
+#endif
+ if (pages > 0 && pagesize > 0) {
+ max_cache = (unsigned long long)pagesize * pages / 20;
+
+ if (max_cache > 0 && ret > max_cache)
+ ret = max_cache;
+ }
+ return ret;
+}
+
+static inline bool fuse4fs_want_fuseblk(const struct fuse4fs *ff)
+{
+ if (ff->noblkdev)
+ return false;
+
+ /* libfuse won't let non-root do fuseblk mounts */
+ if (getuid() != 0)
+ return false;
+
+ return fuse4fs_on_bdev(ff);
+}
+
+static void fuse4fs_com_err_proc(const char *whoami, errcode_t code,
+ const char *fmt, va_list args)
+{
+ fprintf(stderr, "FUSE4FS (%s): ", err_shortdev ? err_shortdev : "?");
+ if (whoami)
+ fprintf(stderr, "%s: ", whoami);
+ fprintf(stderr, "%s ", error_message(code));
+ vfprintf(stderr, fmt, args);
+ fprintf(stderr, "\n");
+ fflush(stderr);
+}
+
+int main(int argc, char *argv[])
+{
+ struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
+ struct fuse4fs fctx;
+ errcode_t err;
+ FILE *orig_stderr = stderr;
+ char extra_args[BUFSIZ];
+ int ret;
+
+ memset(&fctx, 0, sizeof(fctx));
+ fctx.magic = FUSE4FS_MAGIC;
+ fctx.logfd = -1;
+ fctx.opstate = F4OP_WRITABLE;
+
+ ret = fuse_opt_parse(&args, &fctx, fuse4fs_opts, fuse4fs_opt_proc);
+ if (ret)
+ exit(1);
+ if (fctx.device == NULL) {
+ fprintf(stderr, "Missing ext4 device/image\n");
+ fprintf(stderr, "See '%s -h' for usage\n", argv[0]);
+ exit(1);
+ }
+
+ /* /dev/sda -> sda for reporting */
+ fctx.shortdev = strrchr(fctx.device, '/');
+ if (fctx.shortdev)
+ fctx.shortdev++;
+ else
+ fctx.shortdev = fctx.device;
+
+ /* capture library error messages */
+ err_shortdev = fctx.shortdev;
+ set_com_err_hook(fuse4fs_com_err_proc);
+
+#ifdef ENABLE_NLS
+ setlocale(LC_MESSAGES, "");
+ setlocale(LC_CTYPE, "");
+ bindtextdomain(NLS_CAT_NAME, LOCALEDIR);
+ textdomain(NLS_CAT_NAME);
+ set_com_err_gettext(gettext);
+#endif
+ add_error_table(&et_ext2_error_table);
+
+ ret = fuse4fs_setup_logging(&fctx);
+ if (ret) {
+ /* operational error */
+ ret = 2;
+ goto out;
+ }
+
+#ifdef HAVE_PR_SET_IO_FLUSHER
+ /*
+ * Register as a filesystem I/O server process so that our memory
+ * allocations don't cause fs reclaim.
+ */
+ ret = prctl(PR_SET_IO_FLUSHER, 1, 0, 0, 0);
+ if (ret < 0) {
+ err_printf(&fctx, "%s: %s.\n",
+ _("Could not register as IO flusher thread"),
+ strerror(errno));
+ ret = 0;
+ }
+#endif
+
+ /* Will we allow users to allocate every last block? */
+ if (getenv("FUSE4FS_ALLOC_ALL_BLOCKS")) {
+ log_printf(&fctx, "%s\n",
+ _("Allowing users to allocate all blocks. This is dangerous!"));
+ fctx.alloc_all_blocks = 1;
+ }
+
+ err = fuse4fs_open(&fctx, EXT2_FLAG_EXCLUSIVE);
+ if (err) {
+ ret = 32;
+ goto out;
+ }
+
+ if (fuse4fs_want_fuseblk(&fctx)) {
+ /*
+ * If this is a block device, we want to close the fs, reopen
+ * the block device in non-exclusive mode, and start the fuse
+ * driver in fuseblk mode (which will reopen the block device
+ * in exclusive mode) so that unmount will wait until
+ * op_destroy completes.
+ */
+ fuse4fs_unmount(&fctx);
+ err = fuse4fs_open(&fctx, 0);
+ if (err) {
+ ret = 32;
+ goto out;
+ }
+
+ /* "blkdev" is the magic mount option for fuseblk mode */
+ snprintf(extra_args, BUFSIZ, "-oblkdev,blksize=%u",
+ fctx.fs->blocksize);
+ fuse_opt_add_arg(&args, extra_args);
+ fctx.unmount_in_destroy = 1;
+ }
+
+ if (!fctx.cache_size)
+ fctx.cache_size = default_cache_size();
+ if (fctx.cache_size) {
+ err = fuse4fs_config_cache(&fctx);
+ if (err) {
+ ret = 32;
+ goto out;
+ }
+ }
+
+ err = fuse4fs_check_support(&fctx);
+ if (err) {
+ ret = 32;
+ goto out;
+ }
+
+ /*
+ * ext4 can't do COW of shared blocks, so if the feature is enabled,
+ * we must force ro mode.
+ */
+ if (ext2fs_has_feature_shared_blocks(fctx.fs->super))
+ fctx.ro = 1;
+
+ if (fctx.norecovery) {
+ ret = fuse4fs_check_norecovery(&fctx);
+ if (ret)
+ goto out;
+ }
+
+ err = fuse4fs_mount(&fctx);
+ if (err) {
+ ret = 32;
+ goto out;
+ }
+
+ /* Initialize generation counter */
+ get_random_bytes(&fctx.next_generation, sizeof(unsigned int));
+
+ /* Set up default fuse parameters */
+ snprintf(extra_args, BUFSIZ, "-okernel_cache,subtype=%s,"
+ "fsname=%s,attr_timeout=0" FUSE_PLATFORM_OPTS,
+ get_subtype(argv[0]),
+ fctx.device);
+ if (fctx.no_default_opts == 0)
+ fuse_opt_add_arg(&args, extra_args);
+
+ if (fctx.ro)
+ fuse_opt_add_arg(&args, "-oro");
+
+ if (fctx.fakeroot) {
+#ifdef HAVE_MOUNT_NODEV
+ fuse_opt_add_arg(&args,"-onodev");
+#endif
+#ifdef HAVE_MOUNT_NOSUID
+ fuse_opt_add_arg(&args,"-onosuid");
+#endif
+ }
+
+ if (fctx.kernel) {
+ /*
+ * ACLs are always enforced when kernel mode is enabled, to
+ * match the kernel ext4 driver which always enables ACLs.
+ */
+ fctx.acl = 1;
+ fuse_opt_insert_arg(&args, 1,
+ "-oallow_other,default_permissions,suid,dev");
+ }
+
+ if (fctx.debug) {
+ int i;
+
+ printf("FUSE4FS (%s): fuse arguments:", fctx.shortdev);
+ for (i = 0; i < args.argc; i++)
+ printf(" '%s'", args.argv[i]);
+ printf("\n");
+ fflush(stdout);
+ }
+
+ pthread_mutex_init(&fctx.bfl, NULL);
+ ret = fuse_main(args.argc, args.argv, &fs_ops, &fctx);
+ pthread_mutex_destroy(&fctx.bfl);
+
+ switch(ret) {
+ case 0:
+ /* success */
+ ret = 0;
+ break;
+ case 1:
+ case 2:
+ /* invalid option or no mountpoint */
+ ret = 1;
+ break;
+ case 3:
+ case 4:
+ case 5:
+ case 6:
+ case 7:
+ /* setup or mounting failed */
+ ret = 32;
+ break;
+ default:
+ /* fuse started up enough to call op_init */
+ ret = 0;
+ break;
+ }
+out:
+ if (ret & 1) {
+ fprintf(orig_stderr, "%s\n",
+ _("Mount failed due to unrecognized options. Check dmesg(1) for details."));
+ fflush(orig_stderr);
+ }
+ if (ret & 32) {
+ fprintf(orig_stderr, "%s\n",
+ _("Mount failed while opening filesystem. Check dmesg(1) for details."));
+ fflush(orig_stderr);
+ }
+ fuse4fs_unmount(&fctx);
+ reset_com_err_hook();
+ err_shortdev = NULL;
+ if (fctx.device)
+ free(fctx.device);
+ fuse_opt_free_args(&args);
+ return ret;
+}
+
+static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
+ const char *func, int line)
+{
+ struct timespec now;
+ int ret = err;
+ struct fuse4fs *ff = fs->priv_data;
+ int is_err = 0;
+
+ /* Translate ext2 error to unix error code */
+ switch (err) {
+ case 0:
+ break;
+ case EXT2_ET_NO_MEMORY:
+ case EXT2_ET_TDB_ERR_OOM:
+ ret = -ENOMEM;
+ break;
+ case EXT2_ET_INVALID_ARGUMENT:
+ case EXT2_ET_LLSEEK_FAILED:
+ ret = -EINVAL;
+ break;
+ case EXT2_ET_NO_DIRECTORY:
+ ret = -ENOTDIR;
+ break;
+ case EXT2_ET_FILE_NOT_FOUND:
+ ret = -ENOENT;
+ break;
+ case EXT2_ET_DIR_NO_SPACE:
+ is_err = 1;
+ /* fallthrough */
+ case EXT2_ET_TOOSMALL:
+ case EXT2_ET_BLOCK_ALLOC_FAIL:
+ case EXT2_ET_INODE_ALLOC_FAIL:
+ case EXT2_ET_EA_NO_SPACE:
+ ret = -ENOSPC;
+ break;
+ case EXT2_ET_SYMLINK_LOOP:
+ ret = -EMLINK;
+ break;
+ case EXT2_ET_FILE_TOO_BIG:
+ ret = -EFBIG;
+ break;
+ case EXT2_ET_TDB_ERR_EXISTS:
+ case EXT2_ET_FILE_EXISTS:
+ ret = -EEXIST;
+ break;
+ case EXT2_ET_MMP_FAILED:
+ case EXT2_ET_MMP_FSCK_ON:
+ ret = -EBUSY;
+ break;
+ case EXT2_ET_EA_KEY_NOT_FOUND:
+ ret = -ENODATA;
+ break;
+ case EXT2_ET_UNIMPLEMENTED:
+ ret = -EOPNOTSUPP;
+ break;
+ case EXT2_ET_MAGIC_EXT2_FILE:
+ case EXT2_ET_MAGIC_EXT2FS_FILSYS:
+ case EXT2_ET_MAGIC_BADBLOCKS_LIST:
+ case EXT2_ET_MAGIC_BADBLOCKS_ITERATE:
+ case EXT2_ET_MAGIC_INODE_SCAN:
+ case EXT2_ET_MAGIC_IO_CHANNEL:
+ case EXT2_ET_MAGIC_UNIX_IO_CHANNEL:
+ case EXT2_ET_MAGIC_IO_MANAGER:
+ case EXT2_ET_MAGIC_BLOCK_BITMAP:
+ case EXT2_ET_MAGIC_INODE_BITMAP:
+ case EXT2_ET_MAGIC_GENERIC_BITMAP:
+ case EXT2_ET_MAGIC_TEST_IO_CHANNEL:
+ case EXT2_ET_MAGIC_DBLIST:
+ case EXT2_ET_MAGIC_ICOUNT:
+ case EXT2_ET_MAGIC_PQ_IO_CHANNEL:
+ case EXT2_ET_MAGIC_E2IMAGE:
+ case EXT2_ET_MAGIC_INODE_IO_CHANNEL:
+ case EXT2_ET_MAGIC_EXTENT_HANDLE:
+ case EXT2_ET_BAD_MAGIC:
+ case EXT2_ET_MAGIC_EXTENT_PATH:
+ case EXT2_ET_MAGIC_GENERIC_BITMAP64:
+ case EXT2_ET_MAGIC_BLOCK_BITMAP64:
+ case EXT2_ET_MAGIC_INODE_BITMAP64:
+ case EXT2_ET_MAGIC_RESERVED_13:
+ case EXT2_ET_MAGIC_RESERVED_14:
+ case EXT2_ET_MAGIC_RESERVED_15:
+ case EXT2_ET_MAGIC_RESERVED_16:
+ case EXT2_ET_MAGIC_RESERVED_17:
+ case EXT2_ET_MAGIC_RESERVED_18:
+ case EXT2_ET_MAGIC_RESERVED_19:
+ case EXT2_ET_MMP_MAGIC_INVALID:
+ case EXT2_ET_MAGIC_EA_HANDLE:
+ case EXT2_ET_DIR_CORRUPTED:
+ case EXT2_ET_CORRUPT_SUPERBLOCK:
+ case EXT2_ET_RESIZE_INODE_CORRUPT:
+ case EXT2_ET_TDB_ERR_CORRUPT:
+ case EXT2_ET_UNDO_FILE_CORRUPT:
+ case EXT2_ET_FILESYSTEM_CORRUPTED:
+ case EXT2_ET_CORRUPT_JOURNAL_SB:
+ case EXT2_ET_INODE_CORRUPTED:
+ case EXT2_ET_EA_INODE_CORRUPTED:
+ /* same errno that linux uses */
+ is_err = 1;
+ ret = -EUCLEAN;
+ break;
+ case EIO:
+#ifdef EILSEQ
+ case EILSEQ:
+#endif
+ case EUCLEAN:
+ /* these errnos usually denote corruption or persistence fail */
+ is_err = 1;
+ ret = -err;
+ break;
+ default:
+ if (err < 256) {
+ /* other errno are usually operational errors */
+ ret = -err;
+ } else {
+ is_err = 1;
+ ret = -EIO;
+ }
+ break;
+ }
+
+ if (!is_err)
+ return ret;
+
+ if (ino)
+ err_printf(ff, "%s (inode #%d) at %s:%d.\n",
+ error_message(err), ino, func, line);
+ else
+ err_printf(ff, "%s at %s:%d.\n",
+ error_message(err), func, line);
+
+ /* Make a note in the error log */
+ get_now(&now);
+ ext2fs_set_tstamp(fs->super, s_last_error_time, now.tv_sec);
+ fs->super->s_last_error_ino = ino;
+ fs->super->s_last_error_line = line;
+ fs->super->s_last_error_block = err; /* Yeah... */
+ strncpy((char *)fs->super->s_last_error_func, func,
+ sizeof(fs->super->s_last_error_func));
+ if (ext2fs_get_tstamp(fs->super, s_first_error_time) == 0) {
+ ext2fs_set_tstamp(fs->super, s_first_error_time, now.tv_sec);
+ fs->super->s_first_error_ino = ino;
+ fs->super->s_first_error_line = line;
+ fs->super->s_first_error_block = err;
+ strncpy((char *)fs->super->s_first_error_func, func,
+ sizeof(fs->super->s_first_error_func));
+ }
+
+ fs->super->s_state |= EXT2_ERROR_FS;
+ fs->super->s_error_count++;
+ ext2fs_mark_super_dirty(fs);
+ ext2fs_flush(fs);
+ switch (ff->errors_behavior) {
+ case EXT2_ERRORS_CONTINUE:
+ err_printf(ff, "%s\n",
+ _("Continuing after errors; is this a good idea?"));
+ break;
+ case EXT2_ERRORS_RO:
+ if (ff->opstate == F4OP_WRITABLE) {
+ err_printf(ff, "%s\n",
+ _("Remounting read-only due to errors."));
+ ff->opstate = F4OP_READONLY;
+ }
+ fs->flags &= ~EXT2_FLAG_RW;
+ break;
+ case EXT2_ERRORS_PANIC:
+ err_printf(ff, "%s\n",
+ _("Aborting filesystem mount due to errors."));
+ abort();
+ break;
+ }
+
+ return ret;
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 02/20] fuse4fs: drop fuse 2.x support code
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
2025-08-21 1:08 ` [PATCH 01/20] fuse2fs: port fuse2fs to lowlevel libfuse API Darrick J. Wong
@ 2025-08-21 1:08 ` Darrick J. Wong
2025-08-21 1:08 ` [PATCH 03/20] fuse4fs: namespace some helpers Darrick J. Wong
` (17 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:08 UTC (permalink / raw)
To: tytso
Cc: amir73il, John, bernd, linux-fsdevel, linux-ext4, miklos,
amir73il, joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
We only enable fuse4fs if libfuse is from the 3.xx series and the
lowlevel libfuse API is present. Drop support for 2.x. This part is
cribbed from Amir who used an LLM aided conversion.
Note: I actually check for the lowlevel ops in configure.ac because
there are some fuse3 forks <cough>Windows<cough> that do not provide
that API.
Co-developed-by: Claude claude-4-sonnet
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse4fs.c | 219 ++++++--------------------------------------------------
1 file changed, 24 insertions(+), 195 deletions(-)
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 1b8240e56562d6..e6e5729936f6a1 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -48,15 +48,6 @@
#include "ext2fs/ext2fs.h"
#include "ext2fs/ext2_fs.h"
#include "ext2fs/ext2fsP.h"
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
-# define FUSE_PLATFORM_OPTS ""
-#else
-# ifdef __linux__
-# define FUSE_PLATFORM_OPTS ",use_ino,big_writes"
-# else
-# define FUSE_PLATFORM_OPTS ",use_ino"
-# endif
-#endif
#include "../version.h"
#include "uuid/uuid.h"
@@ -171,11 +162,9 @@ static inline uint64_t round_down(uint64_t b, unsigned int align)
break; \
}
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 8)
-# ifdef _IOR
-# ifdef _IOW
-# define SUPPORT_I_FLAGS
-# endif
+#ifdef _IOR
+# ifdef _IOW
+# define SUPPORT_I_FLAGS
# endif
#endif
@@ -1292,11 +1281,8 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
}
#endif
-static void *op_init(struct fuse_conn_info *conn
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
- , struct fuse_config *cfg EXT2FS_ATTR((unused))
-#endif
- )
+static void *op_init(struct fuse_conn_info *conn,
+ struct fuse_config *cfg EXT2FS_ATTR((unused)))
{
struct fuse4fs *ff = fuse4fs_get();
ext2_filsys fs;
@@ -1328,13 +1314,11 @@ static void *op_init(struct fuse_conn_info *conn
#ifdef FUSE_CAP_NO_EXPORT_SUPPORT
fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
#endif
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
conn->time_gran = 1;
cfg->use_ino = 1;
if (ff->debug)
cfg->debug = 1;
cfg->nullpath_ok = 1;
-#endif
if (ff->kernel) {
char uuid[UUID_STR_SIZE];
@@ -1412,9 +1396,7 @@ static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf)
}
static int __fuse4fs_file_ino(struct fuse4fs *ff, const char *path,
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
struct fuse_file_info *fp EXT2FS_ATTR((unused)),
-#endif
ext2_ino_t *inop,
const char *func,
int line)
@@ -1422,7 +1404,6 @@ static int __fuse4fs_file_ino(struct fuse4fs *ff, const char *path,
ext2_filsys fs = ff->fs;
errcode_t err;
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
if (fp) {
struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
@@ -1433,7 +1414,7 @@ static int __fuse4fs_file_ino(struct fuse4fs *ff, const char *path,
dbg_printf(ff, "%s: get ino=%d\n", func, fh->ino);
return 0;
}
-#endif
+
dbg_printf(ff, "%s: get path=%s\n", func, path);
err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, inop);
if (err)
@@ -1442,19 +1423,11 @@ static int __fuse4fs_file_ino(struct fuse4fs *ff, const char *path,
return 0;
}
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
# define fuse4fs_file_ino(ff, path, fp, inop) \
__fuse4fs_file_ino((ff), (path), (fp), (inop), __func__, __LINE__)
-#else
-# define fuse4fs_file_ino(ff, path, fp, inop) \
- __fuse4fs_file_ino((ff), (path), NULL, (inop), __func__, __LINE__)
-#endif
-static int op_getattr(const char *path, struct stat *statbuf
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
- , struct fuse_file_info *fi
-#endif
- )
+static int op_getattr(const char *path, struct stat *statbuf,
+ struct fuse_file_info *fi)
{
struct fuse4fs *ff = fuse4fs_get();
ext2_filsys fs;
@@ -2439,11 +2412,8 @@ static int update_dotdot_helper(ext2_ino_t dir EXT2FS_ATTR((unused)),
return 0;
}
-static int op_rename(const char *from, const char *to
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
- , unsigned int flags EXT2FS_ATTR((unused))
-#endif
- )
+static int op_rename(const char *from, const char *to,
+ unsigned int flags EXT2FS_ATTR((unused)))
{
struct fuse4fs *ff = fuse4fs_get();
ext2_filsys fs;
@@ -2456,11 +2426,9 @@ static int op_rename(const char *from, const char *to
int flushed = 0;
int ret = 0;
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
/* renameat2 is not supported */
if (flags)
return -ENOSYS;
-#endif
FUSE4FS_CHECK_CONTEXT(ff);
dbg_printf(ff, "%s: renaming %s to %s\n", __func__, from, to);
@@ -2774,7 +2742,6 @@ static int op_link(const char *src, const char *dest)
return ret;
}
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
/* Obtain group ids of the process that sent us a command(?) */
static int get_req_groups(struct fuse4fs *ff, gid_t **gids, size_t *nr_gids)
{
@@ -2849,19 +2816,8 @@ static int in_file_group(struct fuse_context *ctxt,
ext2fs_free_mem(&gids);
return ret;
}
-#else
-static int in_file_group(struct fuse_context *ctxt,
- const struct ext2_inode_large *inode)
-{
- return ctxt->gid == inode_gid(*inode);
-}
-#endif
-static int op_chmod(const char *path, mode_t mode
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
- , struct fuse_file_info *fi
-#endif
- )
+static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi)
{
struct fuse_context *ctxt = fuse_get_context();
struct fuse4fs *ff = fuse4fs_get();
@@ -2928,11 +2884,8 @@ static int op_chmod(const char *path, mode_t mode
return ret;
}
-static int op_chown(const char *path, uid_t owner, gid_t group
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
- , struct fuse_file_info *fi
-#endif
- )
+static int op_chown(const char *path, uid_t owner, gid_t group,
+ struct fuse_file_info *fi)
{
struct fuse_context *ctxt = fuse_get_context();
struct fuse4fs *ff = fuse4fs_get();
@@ -3070,11 +3023,7 @@ static int fuse4fs_truncate(struct fuse4fs *ff, ext2_ino_t ino, off_t new_size)
return 0;
}
-static int op_truncate(const char *path, off_t len
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
- , struct fuse_file_info *fi
-#endif
- )
+static int op_truncate(const char *path, off_t len, struct fuse_file_info *fi)
{
struct fuse4fs *ff = fuse4fs_get();
ext2_ino_t ino;
@@ -3802,9 +3751,7 @@ struct readdir_iter {
fuse_fill_dir_t func;
struct fuse4fs *ff;
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
enum fuse_readdir_flags flags;
-#endif
unsigned int nr;
off_t startpos;
off_t dirpos;
@@ -3856,44 +3803,29 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)),
return 0;
dbg_printf(i->ff, "READDIR%s ino=%d %u offset=0x%llx\n",
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
i->flags == FUSE_READDIR_PLUS ? "PLUS" : "",
-#else
- "",
-#endif
dir,
i->nr++,
(unsigned long long)i->dirpos);
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
if (i->flags == FUSE_READDIR_PLUS) {
ret = stat_inode(i->fs, dirent->inode, &stat);
if (ret)
return DIRENT_ABORT;
}
-#endif
memcpy(namebuf, dirent->name, dirent->name_len & 0xFF);
namebuf[dirent->name_len & 0xFF] = 0;
- ret = i->func(i->buf, namebuf, &stat, i->dirpos
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
- , 0
-#endif
- );
+ ret = i->func(i->buf, namebuf, &stat, i->dirpos , 0);
if (ret)
return DIRENT_ABORT;
return 0;
}
-static int op_readdir(const char *path EXT2FS_ATTR((unused)),
- void *buf, fuse_fill_dir_t fill_func,
- off_t offset,
- struct fuse_file_info *fp
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
- , enum fuse_readdir_flags flags
-#endif
- )
+static int op_readdir(const char *path EXT2FS_ATTR((unused)), void *buf,
+ fuse_fill_dir_t fill_func, off_t offset,
+ struct fuse_file_info *fp, enum fuse_readdir_flags flags)
{
struct fuse4fs *ff = fuse4fs_get();
struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
@@ -3902,9 +3834,7 @@ static int op_readdir(const char *path EXT2FS_ATTR((unused)),
.ff = ff,
.dirpos = 0,
.startpos = offset,
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
.flags = flags,
-#endif
};
int ret = 0;
@@ -4087,82 +4017,8 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
return ret;
}
-#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 0)
-static int op_ftruncate(const char *path EXT2FS_ATTR((unused)),
- off_t len, struct fuse_file_info *fp)
-{
- struct fuse4fs *ff = fuse4fs_get();
- struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
- ext2_filsys fs;
- ext2_file_t efp;
- errcode_t err;
- int ret = 0;
-
- FUSE4FS_CHECK_CONTEXT(ff);
- FUSE4FS_CHECK_HANDLE(ff, fh);
- dbg_printf(ff, "%s: ino=%d len=%jd\n", __func__, fh->ino,
- (intmax_t) len);
- fs = fuse4fs_start(ff);
- if (!fuse4fs_is_writeable(ff)) {
- ret = -EROFS;
- goto out;
- }
-
- err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
- if (err) {
- ret = translate_error(fs, fh->ino, err);
- goto out;
- }
-
- err = ext2fs_file_set_size2(efp, len);
- if (err) {
- ret = translate_error(fs, fh->ino, err);
- goto out2;
- }
-
-out2:
- err = ext2fs_file_close(efp);
- if (ret)
- goto out;
- if (err) {
- ret = translate_error(fs, fh->ino, err);
- goto out;
- }
-
- ret = update_mtime(fs, fh->ino, NULL);
- if (ret)
- goto out;
-
-out:
- fuse4fs_finish(ff, ret);
- return ret;
-}
-
-static int op_fgetattr(const char *path EXT2FS_ATTR((unused)),
- struct stat *statbuf,
- struct fuse_file_info *fp)
-{
- struct fuse4fs *ff = fuse4fs_get();
- ext2_filsys fs;
- struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
- int ret = 0;
-
- FUSE4FS_CHECK_CONTEXT(ff);
- FUSE4FS_CHECK_HANDLE(ff, fh);
- dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
- fs = fuse4fs_start(ff);
- ret = stat_inode(fs, fh->ino, statbuf);
- fuse4fs_finish(ff, ret);
-
- return ret;
-}
-#endif /* FUSE_VERSION < FUSE_MAKE_VERSION(3, 0) */
-
-static int op_utimens(const char *path, const struct timespec ctv[2]
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
- , struct fuse_file_info *fi
-#endif
- )
+static int op_utimens(const char *path, const struct timespec ctv[2],
+ struct fuse_file_info *fi)
{
struct fuse4fs *ff = fuse4fs_get();
struct timespec tv[2];
@@ -4560,13 +4416,8 @@ static int ioctl_shutdown(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
return 0;
}
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 8)
static int op_ioctl(const char *path EXT2FS_ATTR((unused)),
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
unsigned int cmd,
-#else
- int cmd,
-#endif
void *arg EXT2FS_ATTR((unused)),
struct fuse_file_info *fp,
unsigned int flags EXT2FS_ATTR((unused)), void *data)
@@ -4617,7 +4468,6 @@ static int op_ioctl(const char *path EXT2FS_ATTR((unused)),
return ret;
}
-#endif /* FUSE 28 */
static int op_bmap(const char *path, size_t blocksize EXT2FS_ATTR((unused)),
uint64_t *idx)
@@ -4648,8 +4498,7 @@ static int op_bmap(const char *path, size_t blocksize EXT2FS_ATTR((unused)),
return ret;
}
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 9)
-# ifdef SUPPORT_FALLOCATE
+#ifdef SUPPORT_FALLOCATE
static int fuse4fs_allocate_range(struct fuse4fs *ff,
struct fuse4fs_file_handle *fh, int mode,
off_t offset, off_t len)
@@ -4925,8 +4774,7 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
return ret;
}
-# endif /* SUPPORT_FALLOCATE */
-#endif /* FUSE 29 */
+#endif /* SUPPORT_FALLOCATE */
static struct fuse_operations fs_ops = {
.init = op_init,
@@ -4959,34 +4807,15 @@ static struct fuse_operations fs_ops = {
.fsyncdir = op_fsync,
.access = op_access,
.create = op_create,
-#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 0)
- .ftruncate = op_ftruncate,
- .fgetattr = op_fgetattr,
-#endif
.utimens = op_utimens,
-#if (FUSE_VERSION >= FUSE_MAKE_VERSION(2, 9)) && (FUSE_VERSION < FUSE_MAKE_VERSION(3, 0))
-# if defined(UTIME_NOW) || defined(UTIME_OMIT)
- .flag_utime_omit_ok = 1,
-# endif
-#endif
.bmap = op_bmap,
#ifdef SUPERFLUOUS
.lock = op_lock,
.poll = op_poll,
#endif
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 8)
.ioctl = op_ioctl,
-#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 0)
- .flag_nullpath_ok = 1,
-#endif
-#endif
-#if FUSE_VERSION >= FUSE_MAKE_VERSION(2, 9)
-#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 0)
- .flag_nopath = 1,
-#endif
-# ifdef SUPPORT_FALLOCATE
+#ifdef SUPPORT_FALLOCATE
.fallocate = op_fallocate,
-# endif
#endif
};
@@ -5347,7 +5176,7 @@ int main(int argc, char *argv[])
/* Set up default fuse parameters */
snprintf(extra_args, BUFSIZ, "-okernel_cache,subtype=%s,"
- "fsname=%s,attr_timeout=0" FUSE_PLATFORM_OPTS,
+ "fsname=%s,attr_timeout=0",
get_subtype(argv[0]),
fctx.device);
if (fctx.no_default_opts == 0)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 03/20] fuse4fs: namespace some helpers
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
2025-08-21 1:08 ` [PATCH 01/20] fuse2fs: port fuse2fs to lowlevel libfuse API Darrick J. Wong
2025-08-21 1:08 ` [PATCH 02/20] fuse4fs: drop fuse 2.x support code Darrick J. Wong
@ 2025-08-21 1:08 ` Darrick J. Wong
2025-08-21 1:08 ` [PATCH 04/20] fuse4fs: convert to low level API Darrick J. Wong
` (16 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:08 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Prepend "fuse4fs_" to all helper functions that take a struct fuse4fs
object pointer.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse4fs.c | 177 ++++++++++++++++++++++++++++----------------------------
1 file changed, 90 insertions(+), 87 deletions(-)
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index e6e5729936f6a1..124a16eb0614a8 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -2,6 +2,7 @@
* fuse4fs.c - FUSE low-level server for e2fsprogs.
*
* Copyright (C) 2014-2025 Oracle.
+ * Copyright (C) 2025 CTERA Networks.
*
* %Begin-Header%
* This file may be redistributed under the terms of the GNU Public
@@ -691,7 +692,7 @@ static int ext2_file_type(unsigned int mode)
return 0;
}
-static int fs_can_allocate(struct fuse4fs *ff, blk64_t num)
+static int fuse4fs_can_allocate(struct fuse4fs *ff, blk64_t num)
{
ext2_filsys fs = ff->fs;
blk64_t reserved;
@@ -718,21 +719,22 @@ static int fs_can_allocate(struct fuse4fs *ff, blk64_t num)
return ext2fs_free_blocks_count(fs->super) > reserved + num;
}
-static int fuse4fs_is_writeable(struct fuse4fs *ff)
+static int fuse4fs_is_writeable(const struct fuse4fs *ff)
{
return ff->opstate == F4OP_WRITABLE &&
(ff->fs->super->s_error_count == 0);
}
-static inline int is_superuser(struct fuse4fs *ff, struct fuse_context *ctxt)
+static inline int fuse4fs_is_superuser(struct fuse4fs *ff,
+ const struct fuse_context *ctxt)
{
if (ff->fakeroot)
return 1;
return ctxt->uid == 0;
}
-static inline int want_check_owner(struct fuse4fs *ff,
- struct fuse_context *ctxt)
+static inline int fuse4fs_want_check_owner(struct fuse4fs *ff,
+ const struct fuse_context *ctxt)
{
/*
* The kernel is responsible for access control, so we allow anything
@@ -740,14 +742,14 @@ static inline int want_check_owner(struct fuse4fs *ff,
*/
if (ff->kernel)
return 0;
- return !is_superuser(ff, ctxt);
+ return !fuse4fs_is_superuser(ff, ctxt);
}
/* Test for append permission */
#define A_OK 16
-static int check_iflags_access(struct fuse4fs *ff, ext2_ino_t ino,
- const struct ext2_inode *inode, int mask)
+static int fuse4fs_iflags_access(struct fuse4fs *ff, ext2_ino_t ino,
+ const struct ext2_inode *inode, int mask)
{
EXT2FS_BUILD_BUG_ON((A_OK & (R_OK | W_OK | X_OK | F_OK)) != 0);
@@ -775,7 +777,7 @@ static int check_iflags_access(struct fuse4fs *ff, ext2_ino_t ino,
return 0;
}
-static int check_inum_access(struct fuse4fs *ff, ext2_ino_t ino, int mask)
+static int fuse4fs_inum_access(struct fuse4fs *ff, ext2_ino_t ino, int mask)
{
struct fuse_context *ctxt = fuse_get_context();
ext2_filsys fs = ff->fs;
@@ -807,7 +809,7 @@ static int check_inum_access(struct fuse4fs *ff, ext2_ino_t ino, int mask)
if (mask == 0)
return 0;
- ret = check_iflags_access(ff, ino, &inode, mask);
+ ret = fuse4fs_iflags_access(ff, ino, &inode, mask);
if (ret)
return ret;
@@ -816,7 +818,7 @@ static int check_inum_access(struct fuse4fs *ff, ext2_ino_t ino, int mask)
return 0;
/* Figure out what root's allowed to do */
- if (is_superuser(ff, ctxt)) {
+ if (fuse4fs_is_superuser(ff, ctxt)) {
/* Non-file access always ok */
if (!LINUX_S_ISREG(inode.i_mode))
return 0;
@@ -1517,8 +1519,8 @@ static int op_readlink(const char *path, char *buf, size_t len)
return ret;
}
-static int __getxattr(struct fuse4fs *ff, ext2_ino_t ino, const char *name,
- void **value, size_t *value_len)
+static int fuse4fs_getxattr(struct fuse4fs *ff, ext2_ino_t ino,
+ const char *name, void **value, size_t *value_len)
{
ext2_filsys fs = ff->fs;
struct ext2_xattr_handle *h;
@@ -1548,8 +1550,8 @@ static int __getxattr(struct fuse4fs *ff, ext2_ino_t ino, const char *name,
return ret;
}
-static int __setxattr(struct fuse4fs *ff, ext2_ino_t ino, const char *name,
- void *value, size_t valuelen)
+static int fuse4fs_setxattr(struct fuse4fs *ff, ext2_ino_t ino,
+ const char *name, void *value, size_t valuelen)
{
ext2_filsys fs = ff->fs;
struct ext2_xattr_handle *h;
@@ -1579,8 +1581,8 @@ static int __setxattr(struct fuse4fs *ff, ext2_ino_t ino, const char *name,
return ret;
}
-static int propagate_default_acls(struct fuse4fs *ff, ext2_ino_t parent,
- ext2_ino_t child)
+static int fuse4fs_propagate_default_acls(struct fuse4fs *ff, ext2_ino_t parent,
+ ext2_ino_t child)
{
void *def;
size_t deflen;
@@ -1589,8 +1591,8 @@ static int propagate_default_acls(struct fuse4fs *ff, ext2_ino_t parent,
if (!ff->acl)
return 0;
- ret = __getxattr(ff, parent, XATTR_NAME_POSIX_ACL_DEFAULT, &def,
- &deflen);
+ ret = fuse4fs_getxattr(ff, parent, XATTR_NAME_POSIX_ACL_DEFAULT, &def,
+ &deflen);
switch (ret) {
case -ENODATA:
case -ENOENT:
@@ -1602,7 +1604,8 @@ static int propagate_default_acls(struct fuse4fs *ff, ext2_ino_t parent,
return ret;
}
- ret = __setxattr(ff, child, XATTR_NAME_POSIX_ACL_DEFAULT, def, deflen);
+ ret = fuse4fs_setxattr(ff, child, XATTR_NAME_POSIX_ACL_DEFAULT, def,
+ deflen);
ext2fs_free_mem(&def);
return ret;
}
@@ -1731,7 +1734,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
*node_name = 0;
fs = fuse4fs_start(ff);
- if (!fs_can_allocate(ff, 2)) {
+ if (!fuse4fs_can_allocate(ff, 2)) {
ret = -ENOSPC;
goto out2;
}
@@ -1743,7 +1746,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
goto out2;
}
- ret = check_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
if (ret)
goto out2;
@@ -1813,7 +1816,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
ext2fs_inode_alloc_stats2(fs, child, 1, 0);
- ret = propagate_default_acls(ff, parent, child);
+ ret = fuse4fs_propagate_default_acls(ff, parent, child);
if (ret)
goto out2;
@@ -1861,7 +1864,7 @@ static int op_mkdir(const char *path, mode_t mode)
*node_name = 0;
fs = fuse4fs_start(ff);
- if (!fs_can_allocate(ff, 1)) {
+ if (!fuse4fs_can_allocate(ff, 1)) {
ret = -ENOSPC;
goto out2;
}
@@ -1873,7 +1876,7 @@ static int op_mkdir(const char *path, mode_t mode)
goto out2;
}
- ret = check_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
if (ret)
goto out2;
@@ -1946,7 +1949,7 @@ static int op_mkdir(const char *path, mode_t mode)
goto out3;
}
- ret = propagate_default_acls(ff, parent, child);
+ ret = fuse4fs_propagate_default_acls(ff, parent, child);
if (ret)
goto out3;
@@ -1987,7 +1990,7 @@ static int fuse4fs_unlink(struct fuse4fs *ff, const char *path,
base_name = filename;
}
- ret = check_inum_access(ff, dir, W_OK);
+ ret = fuse4fs_inum_access(ff, dir, W_OK);
if (ret) {
free(filename);
return ret;
@@ -2009,8 +2012,8 @@ static int fuse4fs_unlink(struct fuse4fs *ff, const char *path,
return 0;
}
-static int remove_ea_inodes(struct fuse4fs *ff, ext2_ino_t ino,
- struct ext2_inode_large *inode)
+static int fuse4fs_remove_ea_inodes(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode)
{
ext2_filsys fs = ff->fs;
struct ext2_xattr_handle *h;
@@ -2054,7 +2057,7 @@ static int remove_ea_inodes(struct fuse4fs *ff, ext2_ino_t ino,
return 0;
}
-static int remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
+static int fuse4fs_remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
{
ext2_filsys fs = ff->fs;
errcode_t err;
@@ -2087,7 +2090,7 @@ static int remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
goto write_out;
if (ext2fs_has_feature_ea_inode(fs->super)) {
- ret = remove_ea_inodes(ff, ino, &inode);
+ ret = fuse4fs_remove_ea_inodes(ff, ino, &inode);
if (ret)
return ret;
}
@@ -2128,7 +2131,7 @@ static int __op_unlink(struct fuse4fs *ff, const char *path)
goto out;
}
- ret = check_inum_access(ff, ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ino, W_OK);
if (ret)
goto out;
@@ -2136,7 +2139,7 @@ static int __op_unlink(struct fuse4fs *ff, const char *path)
if (ret)
goto out;
- ret = remove_inode(ff, ino);
+ ret = fuse4fs_remove_inode(ff, ino);
if (ret)
goto out;
@@ -2204,7 +2207,7 @@ static int __op_rmdir(struct fuse4fs *ff, const char *path)
}
dbg_printf(ff, "%s: rmdir path=%s ino=%d\n", __func__, path, child);
- ret = check_inum_access(ff, child, W_OK);
+ ret = fuse4fs_inum_access(ff, child, W_OK);
if (ret)
goto out;
@@ -2223,7 +2226,7 @@ static int __op_rmdir(struct fuse4fs *ff, const char *path)
goto out;
}
- ret = check_inum_access(ff, rds.parent, W_OK);
+ ret = fuse4fs_inum_access(ff, rds.parent, W_OK);
if (ret)
goto out;
@@ -2236,10 +2239,10 @@ static int __op_rmdir(struct fuse4fs *ff, const char *path)
if (ret)
goto out;
/* Directories have to be "removed" twice. */
- ret = remove_inode(ff, child);
+ ret = fuse4fs_remove_inode(ff, child);
if (ret)
goto out;
- ret = remove_inode(ff, child);
+ ret = fuse4fs_remove_inode(ff, child);
if (ret)
goto out;
@@ -2321,7 +2324,7 @@ static int op_symlink(const char *src, const char *dest)
goto out2;
}
- ret = check_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
if (ret)
goto out2;
@@ -2433,7 +2436,7 @@ static int op_rename(const char *from, const char *to,
FUSE4FS_CHECK_CONTEXT(ff);
dbg_printf(ff, "%s: renaming %s to %s\n", __func__, from, to);
fs = fuse4fs_start(ff);
- if (!fs_can_allocate(ff, 5)) {
+ if (!fuse4fs_can_allocate(ff, 5)) {
ret = -ENOSPC;
goto out;
}
@@ -2459,12 +2462,12 @@ static int op_rename(const char *from, const char *to,
goto out;
}
- ret = check_inum_access(ff, from_ino, W_OK);
+ ret = fuse4fs_inum_access(ff, from_ino, W_OK);
if (ret)
goto out;
if (to_ino) {
- ret = check_inum_access(ff, to_ino, W_OK);
+ ret = fuse4fs_inum_access(ff, to_ino, W_OK);
if (ret)
goto out;
}
@@ -2502,7 +2505,7 @@ static int op_rename(const char *from, const char *to,
goto out2;
}
- ret = check_inum_access(ff, from_dir_ino, W_OK);
+ ret = fuse4fs_inum_access(ff, from_dir_ino, W_OK);
if (ret)
goto out2;
@@ -2527,7 +2530,7 @@ static int op_rename(const char *from, const char *to,
goto out2;
}
- ret = check_inum_access(ff, to_dir_ino, W_OK);
+ ret = fuse4fs_inum_access(ff, to_dir_ino, W_OK);
if (ret)
goto out2;
@@ -2674,7 +2677,7 @@ static int op_link(const char *src, const char *dest)
*node_name = 0;
fs = fuse4fs_start(ff);
- if (!fs_can_allocate(ff, 2)) {
+ if (!fuse4fs_can_allocate(ff, 2)) {
ret = -ENOSPC;
goto out2;
}
@@ -2687,7 +2690,7 @@ static int op_link(const char *src, const char *dest)
goto out2;
}
- ret = check_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
if (ret)
goto out2;
@@ -2703,7 +2706,7 @@ static int op_link(const char *src, const char *dest)
goto out2;
}
- ret = check_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
+ ret = fuse4fs_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
if (ret)
goto out2;
@@ -2743,7 +2746,7 @@ static int op_link(const char *src, const char *dest)
}
/* Obtain group ids of the process that sent us a command(?) */
-static int get_req_groups(struct fuse4fs *ff, gid_t **gids, size_t *nr_gids)
+static int fuse4fs_get_groups(struct fuse4fs *ff, gid_t **gids, size_t *nr_gids)
{
ext2_filsys fs = ff->fs;
errcode_t err;
@@ -2788,8 +2791,8 @@ static int get_req_groups(struct fuse4fs *ff, gid_t **gids, size_t *nr_gids)
* that initiated the fuse request? Returns 1 for yes, 0 for no, or a negative
* errno.
*/
-static int in_file_group(struct fuse_context *ctxt,
- const struct ext2_inode_large *inode)
+static int fuse4fs_in_file_group(struct fuse_context *ctxt,
+ const struct ext2_inode_large *inode)
{
struct fuse4fs *ff = fuse4fs_get();
gid_t *gids = NULL;
@@ -2797,7 +2800,7 @@ static int in_file_group(struct fuse_context *ctxt,
gid_t gid = inode_gid(*inode);
int ret;
- ret = get_req_groups(ff, &gids, &nr_gids);
+ ret = fuse4fs_get_groups(ff, &gids, &nr_gids);
if (ret == -ENOENT) {
/* magic return code for "could not get caller group info" */
return ctxt->gid == inode_gid(*inode);
@@ -2840,11 +2843,11 @@ static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi)
goto out;
}
- ret = check_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
+ ret = fuse4fs_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
if (ret)
goto out;
- if (want_check_owner(ff, ctxt) && ctxt->uid != inode_uid(inode)) {
+ if (fuse4fs_want_check_owner(ff, ctxt) && ctxt->uid != inode_uid(inode)) {
ret = -EPERM;
goto out;
}
@@ -2854,8 +2857,8 @@ static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi)
* of the user's groups, but FUSE only tells us about the primary
* group.
*/
- if (!is_superuser(ff, ctxt)) {
- ret = in_file_group(ctxt, &inode);
+ if (!fuse4fs_is_superuser(ff, ctxt)) {
+ ret = fuse4fs_in_file_group(ctxt, &inode);
if (ret < 0)
goto out;
@@ -2909,14 +2912,14 @@ static int op_chown(const char *path, uid_t owner, gid_t group,
goto out;
}
- ret = check_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
+ ret = fuse4fs_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
if (ret)
goto out;
/* FUSE seems to feed us ~0 to mean "don't change" */
if (owner != (uid_t) ~0) {
/* Only root gets to change UID. */
- if (want_check_owner(ff, ctxt) &&
+ if (fuse4fs_want_check_owner(ff, ctxt) &&
!(inode_uid(inode) == ctxt->uid && owner == ctxt->uid)) {
ret = -EPERM;
goto out;
@@ -2926,7 +2929,7 @@ static int op_chown(const char *path, uid_t owner, gid_t group,
if (group != (gid_t) ~0) {
/* Only root or the owner get to change GID. */
- if (want_check_owner(ff, ctxt) &&
+ if (fuse4fs_want_check_owner(ff, ctxt) &&
inode_uid(inode) != ctxt->uid) {
ret = -EPERM;
goto out;
@@ -3036,7 +3039,7 @@ static int op_truncate(const char *path, off_t len, struct fuse_file_info *fi)
goto out;
dbg_printf(ff, "%s: ino=%d len=%jd\n", __func__, ino, (intmax_t) len);
- ret = check_inum_access(ff, ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ino, W_OK);
if (ret)
goto out;
@@ -3118,7 +3121,7 @@ static int __op_open(struct fuse4fs *ff, const char *path,
}
dbg_printf(ff, "%s: ino=%d\n", __func__, file->ino);
- ret = check_inum_access(ff, file->ino, check);
+ ret = fuse4fs_inum_access(ff, file->ino, check);
if (ret) {
/*
* In a regular (Linux) fs driver, the kernel will open
@@ -3130,7 +3133,7 @@ static int __op_open(struct fuse4fs *ff, const char *path,
* also employ undocumented hacks (see above).
*/
if (check == R_OK) {
- ret = check_inum_access(ff, file->ino, X_OK);
+ ret = fuse4fs_inum_access(ff, file->ino, X_OK);
if (ret)
goto out;
} else
@@ -3239,7 +3242,7 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
goto out;
}
- if (!fs_can_allocate(ff, FUSE4FS_B_TO_FSB(ff, len))) {
+ if (!fuse4fs_can_allocate(ff, FUSE4FS_B_TO_FSB(ff, len))) {
ret = -ENOSPC;
goto out;
}
@@ -3439,11 +3442,11 @@ static int op_getxattr(const char *path, const char *key, char *value,
}
dbg_printf(ff, "%s: ino=%d name=%s\n", __func__, ino, key);
- ret = check_inum_access(ff, ino, R_OK);
+ ret = fuse4fs_inum_access(ff, ino, R_OK);
if (ret)
goto out;
- ret = __getxattr(ff, ino, key, &ptr, &plen);
+ ret = fuse4fs_getxattr(ff, ino, key, &ptr, &plen);
if (ret)
goto out;
@@ -3509,7 +3512,7 @@ static int op_listxattr(const char *path, char *names, size_t len)
}
dbg_printf(ff, "%s: ino=%d\n", __func__, ino);
- ret = check_inum_access(ff, ino, R_OK);
+ ret = fuse4fs_inum_access(ff, ino, R_OK);
if (ret)
goto out;
@@ -3590,7 +3593,7 @@ static int op_setxattr(const char *path EXT2FS_ATTR((unused)),
}
dbg_printf(ff, "%s: ino=%d name=%s\n", __func__, ino, key);
- ret = check_inum_access(ff, ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ino, W_OK);
if (ret == -EACCES) {
ret = -EPERM;
goto out;
@@ -3679,7 +3682,7 @@ static int op_removexattr(const char *path, const char *key)
goto out;
}
- if (!fs_can_allocate(ff, 1)) {
+ if (!fuse4fs_can_allocate(ff, 1)) {
ret = -ENOSPC;
goto out;
}
@@ -3691,7 +3694,7 @@ static int op_removexattr(const char *path, const char *key)
}
dbg_printf(ff, "%s: ino=%d name=%s\n", __func__, ino, key);
- ret = check_inum_access(ff, ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ino, W_OK);
if (ret)
goto out;
@@ -3878,7 +3881,7 @@ static int op_access(const char *path, int mask)
goto out;
}
- ret = check_inum_access(ff, ino, mask);
+ ret = fuse4fs_inum_access(ff, ino, mask);
if (ret)
goto out;
@@ -3918,7 +3921,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
*node_name = 0;
fs = fuse4fs_start(ff);
- if (!fs_can_allocate(ff, 1)) {
+ if (!fuse4fs_can_allocate(ff, 1)) {
ret = -ENOSPC;
goto out2;
}
@@ -3930,7 +3933,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
goto out2;
}
- ret = check_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
if (ret)
goto out2;
@@ -3997,7 +4000,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
ext2fs_inode_alloc_stats2(fs, child, 1, 0);
- ret = propagate_default_acls(ff, parent, child);
+ ret = fuse4fs_propagate_default_acls(ff, parent, child);
if (ret)
goto out2;
@@ -4045,7 +4048,7 @@ static int op_utimens(const char *path, const struct timespec ctv[2],
*/
if (ctv[0].tv_nsec == UTIME_NOW && ctv[1].tv_nsec == UTIME_NOW)
access |= A_OK;
- ret = check_inum_access(ff, ino, access);
+ ret = fuse4fs_inum_access(ff, ino, access);
if (ret)
goto out;
@@ -4130,7 +4133,7 @@ static int ioctl_setflags(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
if (err)
return translate_error(fs, fh->ino, err);
- if (want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
+ if (fuse4fs_want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
return -EPERM;
ret = set_iflags(&inode, flags);
@@ -4179,7 +4182,7 @@ static int ioctl_setversion(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
if (err)
return translate_error(fs, fh->ino, err);
- if (want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
+ if (fuse4fs_want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
return -EPERM;
inode.i_generation = generation;
@@ -4278,7 +4281,7 @@ static int ioctl_fssetxattr(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
if (err)
return translate_error(fs, fh->ino, err);
- if (want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
+ if (fuse4fs_want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
return -EPERM;
ret = set_iflags(&inode, flags);
@@ -4399,7 +4402,7 @@ static int ioctl_shutdown(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
struct fuse_context *ctxt = fuse_get_context();
ext2_filsys fs = ff->fs;
- if (!is_superuser(ff, ctxt))
+ if (!fuse4fs_is_superuser(ff, ctxt))
return -EPERM;
err_printf(ff, "%s.\n", _("shut down requested"));
@@ -4518,7 +4521,7 @@ static int fuse4fs_allocate_range(struct fuse4fs *ff,
(unsigned long long)len,
(unsigned long long)start,
(unsigned long long)end);
- if (!fs_can_allocate(ff, FUSE4FS_B_TO_FSB(ff, len)))
+ if (!fuse4fs_can_allocate(ff, FUSE4FS_B_TO_FSB(ff, len)))
return -ENOSPC;
err = fuse4fs_read_inode(fs, fh->ino, &inode);
@@ -4561,9 +4564,9 @@ static int fuse4fs_allocate_range(struct fuse4fs *ff,
return err;
}
-static errcode_t clean_block_middle(struct fuse4fs *ff, ext2_ino_t ino,
- struct ext2_inode_large *inode,
- off_t offset, off_t len, char **buf)
+static errcode_t fuse4fs_zero_middle(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ off_t offset, off_t len, char **buf)
{
ext2_filsys fs = ff->fs;
blk64_t blk;
@@ -4597,9 +4600,9 @@ static errcode_t clean_block_middle(struct fuse4fs *ff, ext2_ino_t ino,
return io_channel_write_blk64(fs->io, blk, 1, *buf);
}
-static errcode_t clean_block_edge(struct fuse4fs *ff, ext2_ino_t ino,
- struct ext2_inode_large *inode, off_t offset,
- int clean_before, char **buf)
+static errcode_t fuse4fs_zero_edge(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode, off_t offset,
+ int clean_before, char **buf)
{
ext2_filsys fs = ff->fs;
blk64_t blk;
@@ -4690,13 +4693,13 @@ static int fuse4fs_punch_range(struct fuse4fs *ff,
/* Zero everything before the first block and after the last block */
if (FUSE4FS_B_TO_FSBT(ff, offset) == FUSE4FS_B_TO_FSBT(ff, offset + len))
- err = clean_block_middle(ff, fh->ino, &inode, offset,
+ err = fuse4fs_zero_middle(ff, fh->ino, &inode, offset,
len, &buf);
else {
- err = clean_block_edge(ff, fh->ino, &inode, offset, 0, &buf);
+ err = fuse4fs_zero_edge(ff, fh->ino, &inode, offset, 0, &buf);
if (!err)
- err = clean_block_edge(ff, fh->ino, &inode,
- offset + len, 1, &buf);
+ err = fuse4fs_zero_edge(ff, fh->ino, &inode,
+ offset + len, 1, &buf);
}
if (buf)
ext2fs_free_mem(&buf);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 04/20] fuse4fs: convert to low level API
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 1:08 ` [PATCH 03/20] fuse4fs: namespace some helpers Darrick J. Wong
@ 2025-08-21 1:08 ` Darrick J. Wong
2025-08-21 1:09 ` [PATCH 05/20] libsupport: port the kernel list.h to libsupport Darrick J. Wong
` (15 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:08 UTC (permalink / raw)
To: tytso
Cc: amir73il, John, bernd, linux-fsdevel, linux-ext4, miklos,
amir73il, joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Convert fuse4fs to the lowlevel fuse API. Amir supplied the auto
translation; I ported and cleaned it up by hand, and did the QA work to
make sure it still runs correctly.
Co-developed-by: Claude claude-4-sonnet
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse4fs.c | 2005 ++++++++++++++++++++++++++++++--------------------------
1 file changed, 1073 insertions(+), 932 deletions(-)
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 124a16eb0614a8..0dd47dcf18d77a 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -41,7 +41,7 @@
# define __SET_FOB_FOR_FUSE
# define _FILE_OFFSET_BITS 64
#endif /* _FILE_OFFSET_BITS */
-#include <fuse.h>
+#include <fuse_lowlevel.h>
#ifdef __SET_FOB_FOR_FUSE
# undef _FILE_OFFSET_BITS
#endif /* __SET_FOB_FOR_FUSE */
@@ -116,6 +116,8 @@
#endif
#endif /* !defined(ENODATA) */
+#define FUSE4FS_ATTR_TIMEOUT (0.0)
+
static inline uint64_t round_up(uint64_t b, unsigned int align)
{
unsigned int m;
@@ -249,16 +251,18 @@ struct fuse4fs {
struct timespec op_start_time;
uint8_t timing;
#endif
+ struct fuse_session *fuse;
};
-#define FUSE4FS_CHECK_HANDLE(ff, fh) \
+#define FUSE4FS_CHECK_HANDLE(req, fh) \
do { \
if ((fh) == NULL || (fh)->magic != FUSE4FS_FILE_MAGIC) { \
fprintf(stderr, \
"FUSE4FS: Corrupt in-memory file handle at %s:%d!\n", \
__func__, __LINE__); \
fflush(stderr); \
- return -EUCLEAN; \
+ fuse_reply_err(req, EUCLEAN); \
+ return; \
} \
} while (0)
@@ -270,19 +274,52 @@ struct fuse4fs {
__func__, __LINE__); \
fflush(stderr); \
retcode; \
+ return; \
} \
if ((ff)->opstate == F4OP_SHUTDOWN) { \
shutcode; \
+ return; \
} \
} while (0)
-#define FUSE4FS_CHECK_CONTEXT(ff) \
- __FUSE4FS_CHECK_CONTEXT((ff), return -EUCLEAN, return -EIO)
+#define FUSE4FS_CHECK_CONTEXT(req) \
+ __FUSE4FS_CHECK_CONTEXT(fuse4fs_get(req), \
+ fuse_reply_err((req), EUCLEAN), \
+ fuse_reply_err((req), EIO))
#define FUSE4FS_CHECK_CONTEXT_RETURN(ff) \
__FUSE4FS_CHECK_CONTEXT((ff), return, return)
#define FUSE4FS_CHECK_CONTEXT_ABORT(ff) \
__FUSE4FS_CHECK_CONTEXT((ff), abort(), abort())
+static inline void fuse4fs_ino_from_fuse(ext2_ino_t *inop, fuse_ino_t fino)
+{
+ if (fino == FUSE_ROOT_ID)
+ *inop = EXT2_ROOT_INO;
+ else
+ *inop = fino;
+}
+
+static inline void fuse4fs_ino_to_fuse(fuse_ino_t *finop, ext2_ino_t ino)
+{
+ if (ino == EXT2_ROOT_INO)
+ *finop = FUSE_ROOT_ID;
+ else
+ *finop = ino;
+}
+
+#define FUSE4FS_CONVERT_FINO(req, ext2_inop, fuse_ino) \
+ do { \
+ if ((fuse_ino) > UINT32_MAX) { \
+ fprintf(stderr, \
+ "FUSE4FS: Bogus inode number 0x%llx at %s:%d!\n", \
+ (unsigned long long)(fuse_ino), __func__, __LINE__); \
+ fflush(stderr); \
+ fuse_reply_err((req), EIO); \
+ return; \
+ } \
+ fuse4fs_ino_from_fuse(ext2_inop, fuse_ino); \
+ } while (0)
+
static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
const char *func, int line);
#define translate_error(fs, ino, err) __translate_error((fs), (ino), (err), \
@@ -449,11 +486,9 @@ static inline errcode_t fuse4fs_write_inode(ext2_filsys fs, ext2_ino_t ino,
sizeof(*inode));
}
-static inline struct fuse4fs *fuse4fs_get(void)
+static inline struct fuse4fs *fuse4fs_get(fuse_req_t req)
{
- struct fuse_context *ctxt = fuse_get_context();
-
- return ctxt->private_data;
+ return (struct fuse4fs *)fuse_req_userdata(req);
}
static inline struct fuse4fs_file_handle *
@@ -466,6 +501,7 @@ static inline void
fuse4fs_set_handle(struct fuse_file_info *fp, struct fuse4fs_file_handle *fh)
{
fp->fh = (uintptr_t)fh;
+ fp->keep_cache = 1;
}
#ifdef HAVE_CLOCK_MONOTONIC
@@ -726,7 +762,7 @@ static int fuse4fs_is_writeable(const struct fuse4fs *ff)
}
static inline int fuse4fs_is_superuser(struct fuse4fs *ff,
- const struct fuse_context *ctxt)
+ const struct fuse_ctx *ctxt)
{
if (ff->fakeroot)
return 1;
@@ -734,7 +770,7 @@ static inline int fuse4fs_is_superuser(struct fuse4fs *ff,
}
static inline int fuse4fs_want_check_owner(struct fuse4fs *ff,
- const struct fuse_context *ctxt)
+ const struct fuse_ctx *ctxt)
{
/*
* The kernel is responsible for access control, so we allow anything
@@ -777,9 +813,9 @@ static int fuse4fs_iflags_access(struct fuse4fs *ff, ext2_ino_t ino,
return 0;
}
-static int fuse4fs_inum_access(struct fuse4fs *ff, ext2_ino_t ino, int mask)
+static int fuse4fs_inum_access(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
+ ext2_ino_t ino, int mask)
{
- struct fuse_context *ctxt = fuse_get_context();
ext2_filsys fs = ff->fs;
struct ext2_inode inode;
mode_t perms;
@@ -1114,9 +1150,9 @@ static errcode_t fuse4fs_mount(struct fuse4fs *ff)
return 0;
}
-static void op_destroy(void *p EXT2FS_ATTR((unused)))
+static void op_destroy(void *userdata)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs *ff = userdata;
ext2_filsys fs;
errcode_t err;
@@ -1283,24 +1319,13 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
}
#endif
-static void *op_init(struct fuse_conn_info *conn,
- struct fuse_config *cfg EXT2FS_ATTR((unused)))
+static void op_init(void *userdata, struct fuse_conn_info *conn)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs *ff = userdata;
ext2_filsys fs;
FUSE4FS_CHECK_CONTEXT_ABORT(ff);
- /*
- * Configure logging a second time, because libfuse might have
- * redirected std{out,err} as part of daemonization. If this fails,
- * give up and move on.
- */
- fuse4fs_setup_logging(ff);
- if (ff->logfd >= 0)
- close(ff->logfd);
- ff->logfd = -1;
-
fs = ff->fs;
dbg_printf(ff, "%s: dev=%s\n", __func__, fs->device_name);
#ifdef FUSE_CAP_IOCTL_DIR
@@ -1317,10 +1342,6 @@ static void *op_init(struct fuse_conn_info *conn,
fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
#endif
conn->time_gran = 1;
- cfg->use_ino = 1;
- if (ff->debug)
- cfg->debug = 1;
- cfg->nullpath_ok = 1;
if (ff->kernel) {
char uuid[UUID_STR_SIZE];
@@ -1342,132 +1363,151 @@ static void *op_init(struct fuse_conn_info *conn,
*/
conn->want = conn->want_ext & 0xFFFFFFFF;
#endif
- return ff;
}
-static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf)
+struct fuse4fs_stat {
+ struct fuse_entry_param entry;
+};
+
+static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inodep,
+ struct fuse4fs_stat *fstat)
{
struct ext2_inode_large inode;
+ ext2_filsys fs = ff->fs;
+ struct fuse_entry_param *entry = &fstat->entry;
+ struct stat *statbuf = &entry->attr;
dev_t fakedev = 0;
errcode_t err;
- int ret = 0;
struct timespec tv;
- err = fuse4fs_read_inode(fs, ino, &inode);
- if (err)
- return translate_error(fs, ino, err);
+ memset(fstat, 0, sizeof(*fstat));
+
+ if (!inodep) {
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+ inodep = &inode;
+ }
memcpy(&fakedev, fs->super->s_uuid, sizeof(fakedev));
statbuf->st_dev = fakedev;
statbuf->st_ino = ino;
- statbuf->st_mode = inode.i_mode;
- statbuf->st_nlink = inode.i_links_count;
- statbuf->st_uid = inode_uid(inode);
- statbuf->st_gid = inode_gid(inode);
- statbuf->st_size = EXT2_I_SIZE(&inode);
+ statbuf->st_mode = inodep->i_mode;
+ statbuf->st_nlink = inodep->i_links_count;
+ statbuf->st_uid = inode_uid(*inodep);
+ statbuf->st_gid = inode_gid(*inodep);
+ statbuf->st_size = EXT2_I_SIZE(inodep);
statbuf->st_blksize = fs->blocksize;
statbuf->st_blocks = ext2fs_get_stat_i_blocks(fs,
- EXT2_INODE(&inode));
- EXT4_INODE_GET_XTIME(i_atime, &tv, &inode);
+ EXT2_INODE(inodep));
+ EXT4_INODE_GET_XTIME(i_atime, &tv, inodep);
#if HAVE_STRUCT_STAT_ST_ATIM
statbuf->st_atim = tv;
#else
statbuf->st_atime = tv.tv_sec;
#endif
- EXT4_INODE_GET_XTIME(i_mtime, &tv, &inode);
+ EXT4_INODE_GET_XTIME(i_mtime, &tv, inodep);
#if HAVE_STRUCT_STAT_ST_ATIM
statbuf->st_mtim = tv;
#else
statbuf->st_mtime = tv.tv_sec;
#endif
- EXT4_INODE_GET_XTIME(i_ctime, &tv, &inode);
+ EXT4_INODE_GET_XTIME(i_ctime, &tv, inodep);
#if HAVE_STRUCT_STAT_ST_ATIM
statbuf->st_ctim = tv;
#else
statbuf->st_ctime = tv.tv_sec;
#endif
- if (LINUX_S_ISCHR(inode.i_mode) ||
- LINUX_S_ISBLK(inode.i_mode)) {
- if (inode.i_block[0])
- statbuf->st_rdev = inode.i_block[0];
+ if (LINUX_S_ISCHR(inodep->i_mode) ||
+ LINUX_S_ISBLK(inodep->i_mode)) {
+ if (inodep->i_block[0])
+ statbuf->st_rdev = inodep->i_block[0];
else
- statbuf->st_rdev = inode.i_block[1];
+ statbuf->st_rdev = inodep->i_block[1];
}
- return ret;
-}
-
-static int __fuse4fs_file_ino(struct fuse4fs *ff, const char *path,
- struct fuse_file_info *fp EXT2FS_ATTR((unused)),
- ext2_ino_t *inop,
- const char *func,
- int line)
-{
- ext2_filsys fs = ff->fs;
- errcode_t err;
-
- if (fp) {
- struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
-
- if (fh->ino == 0)
- return -ESTALE;
-
- *inop = fh->ino;
- dbg_printf(ff, "%s: get ino=%d\n", func, fh->ino);
- return 0;
- }
-
- dbg_printf(ff, "%s: get path=%s\n", func, path);
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, inop);
- if (err)
- return __translate_error(fs, 0, err, func, line);
+ fuse4fs_ino_to_fuse(&entry->ino, ino);
+ entry->generation = inodep->i_generation;
+ entry->attr_timeout = FUSE4FS_ATTR_TIMEOUT;
+ entry->entry_timeout = FUSE4FS_ATTR_TIMEOUT;
return 0;
}
-# define fuse4fs_file_ino(ff, path, fp, inop) \
- __fuse4fs_file_ino((ff), (path), (fp), (inop), __func__, __LINE__)
-
-static int op_getattr(const char *path, struct stat *statbuf,
- struct fuse_file_info *fi)
+static void op_lookup(fuse_req_t req, fuse_ino_t fino, const char *name)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_stat fstat;
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
- ext2_ino_t ino;
+ ext2_ino_t parent, child;
+ errcode_t err;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &parent, fino);
+ dbg_printf(ff, "%s: parent=%d name='%s'\n", __func__, parent, name);
fs = fuse4fs_start(ff);
- ret = fuse4fs_file_ino(ff, path, fi, &ino);
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, parent, name, &child);
+ if (err || child == 0) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+
+ ret = fuse4fs_stat_inode(ff, child, NULL, &fstat);
if (ret)
goto out;
- ret = stat_inode(fs, ino, statbuf);
+
out:
fuse4fs_finish(ff, ret);
- return ret;
+
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_entry(req, &fstat.entry);
}
-static int op_readlink(const char *path, char *buf, size_t len)
+static void op_getattr(fuse_req_t req, fuse_ino_t fino,
+ struct fuse_file_info *fi EXT2FS_ATTR((unused)))
{
- struct fuse4fs *ff = fuse4fs_get();
- ext2_filsys fs;
- errcode_t err;
+ struct fuse4fs_stat fstat;
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_ino_t ino;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
+ fuse4fs_start(ff);
+ ret = fuse4fs_stat_inode(ff, ino, NULL, &fstat);
+ fuse4fs_finish(ff, ret);
+
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_attr(req, &fstat.entry.attr,
+ fstat.entry.attr_timeout);
+}
+
+static void op_readlink(fuse_req_t req, fuse_ino_t fino)
+{
struct ext2_inode inode;
+ char buf[PATH_MAX + 1];
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_filsys fs;
+ ext2_file_t file;
+ errcode_t err;
+ ext2_ino_t ino;
+ size_t len = PATH_MAX;
unsigned int got;
- ext2_file_t file;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- dbg_printf(ff, "%s: path=%s\n", __func__, path);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
+ dbg_printf(ff, "%s: ino=%d\n", __func__, ino);
fs = fuse4fs_start(ff);
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
- if (err || ino == 0) {
- ret = translate_error(fs, 0, err);
- goto out;
- }
- err = ext2fs_read_inode(fs, ino, &inode);
+ err = ext2fs_read_inode(fs, fino, &inode);
if (err) {
ret = translate_error(fs, ino, err);
goto out;
@@ -1478,7 +1518,6 @@ static int op_readlink(const char *path, char *buf, size_t len)
goto out;
}
- len--;
if (inode.i_size < len)
len = inode.i_size;
if (ext2fs_is_fast_symlink(&inode))
@@ -1516,7 +1555,11 @@ static int op_readlink(const char *path, char *buf, size_t len)
out:
fuse4fs_finish(ff, ret);
- return ret;
+
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_readlink(req, buf);
}
static int fuse4fs_getxattr(struct fuse4fs *ff, ext2_ino_t ino,
@@ -1622,11 +1665,12 @@ static inline void fuse4fs_set_gid(struct ext2_inode_large *inode, gid_t gid)
ext2fs_set_i_gid_high(*inode, gid >> 16);
}
-static int fuse4fs_new_child_gid(struct fuse4fs *ff, ext2_ino_t parent,
- gid_t *gid, int *parent_sgid)
+static int fuse4fs_new_child_gid(struct fuse4fs *ff,
+ const struct fuse_ctx *ctxt,
+ ext2_ino_t parent, gid_t *gid,
+ int *parent_sgid)
{
struct ext2_inode_large inode;
- struct fuse_context *ctxt = fuse_get_context();
errcode_t err;
err = fuse4fs_read_inode(ff->fs, parent, &inode);
@@ -1702,36 +1746,44 @@ static void fuse4fs_set_extra_isize(struct fuse4fs *ff, ext2_ino_t ino,
inode->i_extra_isize = extra;
}
-static int op_mknod(const char *path, mode_t mode, dev_t dev)
+static void fuse4fs_reply_entry(fuse_req_t req, ext2_ino_t ino,
+ struct ext2_inode_large *inode, int ret)
{
- struct fuse_context *ctxt = fuse_get_context();
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs_stat fstat;
+ struct fuse4fs *ff = fuse4fs_get(req);
+
+ if (ret) {
+ fuse_reply_err(req, -ret);
+ return;
+ }
+
+ /* Get stat info for the new entry */
+ ret = fuse4fs_stat_inode(ff, ino, inode, &fstat);
+ if (ret) {
+ fuse_reply_err(req, -ret);
+ return;
+ }
+
+ fuse_reply_entry(req, &fstat.entry);
+}
+
+static void op_mknod(fuse_req_t req, fuse_ino_t fino, const char *name,
+ mode_t mode, dev_t dev)
+{
+ struct ext2_inode_large inode;
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
ext2_ino_t parent, child;
- char *temp_path;
errcode_t err;
- char *node_name, a;
int filetype;
- struct ext2_inode_large inode;
gid_t gid;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- dbg_printf(ff, "%s: path=%s mode=0%o dev=0x%x\n", __func__, path, mode,
- (unsigned int)dev);
- temp_path = strdup(path);
- if (!temp_path) {
- ret = -ENOMEM;
- goto out;
- }
- node_name = strrchr(temp_path, '/');
- if (!node_name) {
- ret = -ENOMEM;
- goto out;
- }
- node_name++;
- a = *node_name;
- *node_name = 0;
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &parent, fino);
+ dbg_printf(ff, "%s: parent=%d name='%s' mode=0%o dev=0x%x\n",
+ __func__, parent, name, mode, (unsigned int)dev);
fs = fuse4fs_start(ff);
if (!fuse4fs_can_allocate(ff, 2)) {
@@ -1739,33 +1791,14 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
goto out2;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
- &parent);
- if (err) {
- ret = translate_error(fs, 0, err);
- goto out2;
- }
-
- ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, parent, A_OK | W_OK);
if (ret)
goto out2;
- *node_name = a;
+ /* On a low level server, mknod handles all non-directory types */
+ filetype = ext2_file_type(mode);
- if (LINUX_S_ISCHR(mode))
- filetype = EXT2_FT_CHRDEV;
- else if (LINUX_S_ISBLK(mode))
- filetype = EXT2_FT_BLKDEV;
- else if (LINUX_S_ISFIFO(mode))
- filetype = EXT2_FT_FIFO;
- else if (LINUX_S_ISSOCK(mode))
- filetype = EXT2_FT_SOCK;
- else {
- ret = -EINVAL;
- goto out2;
- }
-
- err = fuse4fs_new_child_gid(ff, parent, &gid, NULL);
+ err = fuse4fs_new_child_gid(ff, ctxt, parent, &gid, NULL);
if (err)
goto out2;
@@ -1775,9 +1808,9 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
goto out2;
}
- dbg_printf(ff, "%s: create ino=%d/name=%s in dir=%d\n", __func__, child,
- node_name, parent);
- err = ext2fs_link(fs, parent, node_name, child,
+ dbg_printf(ff, "%s: create ino=%d name='%s' in dir=%d\n", __func__,
+ child, name, parent);
+ err = ext2fs_link(fs, parent, name, child,
filetype | EXT2FS_LINK_EXPAND);
if (err) {
ret = translate_error(fs, parent, err);
@@ -1826,42 +1859,28 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
out2:
fuse4fs_finish(ff, ret);
-out:
- free(temp_path);
- return ret;
+ fuse4fs_reply_entry(req, child, &inode, ret);
}
-static int op_mkdir(const char *path, mode_t mode)
+static void op_mkdir(fuse_req_t req, fuse_ino_t fino, const char *name,
+ mode_t mode)
{
- struct fuse_context *ctxt = fuse_get_context();
- struct fuse4fs *ff = fuse4fs_get();
+ struct ext2_inode_large inode;
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
ext2_ino_t parent, child;
- char *temp_path;
errcode_t err;
- char *node_name, a;
- struct ext2_inode_large inode;
char *block;
blk64_t blk;
int ret = 0;
gid_t gid;
int parent_sgid;
- FUSE4FS_CHECK_CONTEXT(ff);
- dbg_printf(ff, "%s: path=%s mode=0%o\n", __func__, path, mode);
- temp_path = strdup(path);
- if (!temp_path) {
- ret = -ENOMEM;
- goto out;
- }
- node_name = strrchr(temp_path, '/');
- if (!node_name) {
- ret = -ENOMEM;
- goto out;
- }
- node_name++;
- a = *node_name;
- *node_name = 0;
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &parent, fino);
+ dbg_printf(ff, "%s: parent=%d name='%s' mode=0%o\n",
+ __func__, parent, name, mode);
fs = fuse4fs_start(ff);
if (!fuse4fs_can_allocate(ff, 1)) {
@@ -1869,25 +1888,15 @@ static int op_mkdir(const char *path, mode_t mode)
goto out2;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
- &parent);
- if (err) {
- ret = translate_error(fs, 0, err);
- goto out2;
- }
-
- ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, parent, A_OK | W_OK);
if (ret)
goto out2;
- err = fuse4fs_new_child_gid(ff, parent, &gid, &parent_sgid);
+ err = fuse4fs_new_child_gid(ff, ctxt, parent, &gid, &parent_sgid);
if (err)
goto out2;
- *node_name = a;
-
- err = ext2fs_mkdir2(fs, parent, 0, 0, EXT2FS_LINK_EXPAND,
- node_name, NULL);
+ err = ext2fs_mkdir2(fs, parent, 0, 0, EXT2FS_LINK_EXPAND, name, NULL);
if (err) {
ret = translate_error(fs, parent, err);
goto out2;
@@ -1898,14 +1907,13 @@ static int op_mkdir(const char *path, mode_t mode)
goto out2;
/* Still have to update the uid/gid of the dir */
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
- &child);
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, parent, name, &child);
if (err) {
ret = translate_error(fs, 0, err);
goto out2;
}
- dbg_printf(ff, "%s: created ino=%d/path=%s in dir=%d\n", __func__, child,
- node_name, parent);
+ dbg_printf(ff, "%s: created ino=%d name='%s' in dir=%d\n",
+ __func__, child, name, parent);
err = fuse4fs_read_inode(fs, child, &inode);
if (err) {
@@ -1961,55 +1969,7 @@ static int op_mkdir(const char *path, mode_t mode)
ext2fs_free_mem(&block);
out2:
fuse4fs_finish(ff, ret);
-out:
- free(temp_path);
- return ret;
-}
-
-static int fuse4fs_unlink(struct fuse4fs *ff, const char *path,
- ext2_ino_t *parent)
-{
- ext2_filsys fs = ff->fs;
- errcode_t err;
- ext2_ino_t dir;
- char *filename = strdup(path);
- char *base_name;
- int ret;
-
- base_name = strrchr(filename, '/');
- if (base_name) {
- *base_name++ = '\0';
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, filename,
- &dir);
- if (err) {
- free(filename);
- return translate_error(fs, 0, err);
- }
- } else {
- dir = EXT2_ROOT_INO;
- base_name = filename;
- }
-
- ret = fuse4fs_inum_access(ff, dir, W_OK);
- if (ret) {
- free(filename);
- return ret;
- }
-
- dbg_printf(ff, "%s: unlinking name=%s from dir=%d\n", __func__,
- base_name, dir);
- err = ext2fs_unlink(fs, dir, base_name, 0, 0);
- free(filename);
- if (err)
- return translate_error(fs, dir, err);
-
- ret = update_mtime(fs, dir, NULL);
- if (ret)
- return ret;
-
- if (parent)
- *parent = dir;
- return 0;
+ fuse4fs_reply_entry(req, child, &inode, ret);
}
static int fuse4fs_remove_ea_inodes(struct fuse4fs *ff, ext2_ino_t ino,
@@ -2118,49 +2078,78 @@ static int fuse4fs_remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
return 0;
}
-static int __op_unlink(struct fuse4fs *ff, const char *path)
+static int fuse4fs_unlink(struct fuse4fs *ff, ext2_ino_t parent,
+ const char *name, ext2_ino_t child)
{
ext2_filsys fs = ff->fs;
- ext2_ino_t parent, ino;
errcode_t err;
int ret = 0;
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
+ err = ext2fs_unlink(fs, parent, name, child, 0);
+ if (err) {
+ ret = translate_error(fs, parent, err);
+ goto out;
+ }
+
+ ret = update_mtime(fs, parent, NULL);
+ if (ret)
+ goto out;
+out:
+ return ret;
+}
+
+static int fuse4fs_rmfile(struct fuse4fs *ff, ext2_ino_t parent,
+ const char *name, ext2_ino_t child)
+{
+ int ret;
+
+ ret = fuse4fs_unlink(ff, parent, name, child);
+ if (ret)
+ return ret;
+
+ return fuse4fs_remove_inode(ff, child);
+}
+
+static void op_unlink(fuse_req_t req, fuse_ino_t fino, const char *name)
+{
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_filsys fs;
+ ext2_ino_t parent, child;
+ errcode_t err;
+ int ret;
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &parent, fino);
+ fs = fuse4fs_start(ff);
+
+ /* Get the inode number for the file */
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, parent, name, &child);
if (err) {
ret = translate_error(fs, 0, err);
goto out;
}
- ret = fuse4fs_inum_access(ff, ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, child, W_OK);
if (ret)
goto out;
- ret = fuse4fs_unlink(ff, path, &parent);
+ ret = fuse4fs_inum_access(ff, ctxt, parent, W_OK);
if (ret)
goto out;
- ret = fuse4fs_remove_inode(ff, ino);
+ dbg_printf(ff, "%s: unlink parent=%d name='%s' child=%d\n",
+ __func__, parent, name, child);
+ ret = fuse4fs_rmfile(ff, parent, name, child);
if (ret)
goto out;
ret = fuse4fs_dirsync_flush(ff, parent, NULL);
if (ret)
goto out;
-
out:
- return ret;
-}
-
-static int op_unlink(const char *path)
-{
- struct fuse4fs *ff = fuse4fs_get();
- int ret;
-
- FUSE4FS_CHECK_CONTEXT(ff);
- fuse4fs_start(ff);
- ret = __op_unlink(ff, path);
fuse4fs_finish(ff, ret);
- return ret;
+ fuse_reply_err(req, -ret);
}
struct rd_struct {
@@ -2191,51 +2180,36 @@ static int rmdir_proc(ext2_ino_t dir EXT2FS_ATTR((unused)),
return 0;
}
-static int __op_rmdir(struct fuse4fs *ff, const char *path)
+static int fuse4fs_rmdir(struct fuse4fs *ff, ext2_ino_t parent,
+ const char *name, ext2_ino_t child)
{
ext2_filsys fs = ff->fs;
- ext2_ino_t parent, child;
errcode_t err;
struct ext2_inode_large inode;
- struct rd_struct rds;
+ struct rd_struct rds = {
+ .parent = 0,
+ .empty = 1,
+ };
int ret = 0;
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &child);
- if (err) {
- ret = translate_error(fs, 0, err);
- goto out;
- }
- dbg_printf(ff, "%s: rmdir path=%s ino=%d\n", __func__, path, child);
-
- ret = fuse4fs_inum_access(ff, child, W_OK);
- if (ret)
- goto out;
-
- rds.parent = 0;
- rds.empty = 1;
-
err = ext2fs_dir_iterate2(fs, child, 0, 0, rmdir_proc, &rds);
if (err) {
ret = translate_error(fs, child, err);
goto out;
}
- /* the kernel checks parent permissions before emptiness */
+ /* Make sure we found a dotdot entry */
if (rds.parent == 0) {
ret = translate_error(fs, child, EXT2_ET_FILESYSTEM_CORRUPTED);
goto out;
}
- ret = fuse4fs_inum_access(ff, rds.parent, W_OK);
- if (ret)
- goto out;
-
if (rds.empty == 0) {
ret = -ENOTEMPTY;
goto out;
}
- ret = fuse4fs_unlink(ff, path, &parent);
+ ret = fuse4fs_unlink(ff, parent, name, child);
if (ret)
goto out;
/* Directories have to be "removed" twice. */
@@ -2266,74 +2240,81 @@ static int __op_rmdir(struct fuse4fs *ff, const char *path)
}
}
- ret = fuse4fs_dirsync_flush(ff, parent, NULL);
- if (ret)
- goto out;
-
out:
return ret;
}
-static int op_rmdir(const char *path)
+static void op_rmdir(fuse_req_t req, fuse_ino_t fino, const char *name)
{
- struct fuse4fs *ff = fuse4fs_get();
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_filsys fs;
+ ext2_ino_t parent, child;
+ errcode_t err;
int ret;
- FUSE4FS_CHECK_CONTEXT(ff);
- fuse4fs_start(ff);
- ret = __op_rmdir(ff, path);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &parent, fino);
+ fs = fuse4fs_start(ff);
+
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, parent, name, &child);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+
+ ret = fuse4fs_inum_access(ff, ctxt, parent, W_OK);
+ if (ret)
+ goto out;
+
+ ret = fuse4fs_inum_access(ff, ctxt, child, W_OK);
+ if (ret)
+ goto out;
+
+ dbg_printf(ff, "%s: unlink parent=%d name='%s' child=%d\n",
+ __func__, parent, name, child);
+ ret = fuse4fs_rmdir(ff, parent, name, child);
+ if (ret)
+ goto out;
+
+ ret = fuse4fs_dirsync_flush(ff, parent, NULL);
+ if (ret)
+ goto out;
+
+out:
fuse4fs_finish(ff, ret);
- return ret;
+ fuse_reply_err(req, -ret);
}
-static int op_symlink(const char *src, const char *dest)
+static void op_symlink(fuse_req_t req, const char *target, fuse_ino_t fino,
+ const char *name)
{
- struct fuse_context *ctxt = fuse_get_context();
- struct fuse4fs *ff = fuse4fs_get();
- ext2_filsys fs;
- ext2_ino_t parent, child;
- char *temp_path;
- errcode_t err;
- char *node_name, a;
struct ext2_inode_large inode;
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_filsys fs;
+ ext2_ino_t parent, child;
+ errcode_t err;
gid_t gid;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- dbg_printf(ff, "%s: symlink %s to %s\n", __func__, src, dest);
- temp_path = strdup(dest);
- if (!temp_path) {
- ret = -ENOMEM;
- goto out;
- }
- node_name = strrchr(temp_path, '/');
- if (!node_name) {
- ret = -ENOMEM;
- goto out;
- }
- node_name++;
- a = *node_name;
- *node_name = 0;
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &parent, fino);
+ dbg_printf(ff, "%s: symlink dir=%d name='%s' target='%s'\n",
+ __func__, parent, name, target);
fs = fuse4fs_start(ff);
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
- &parent);
- *node_name = a;
- if (err) {
- ret = translate_error(fs, 0, err);
- goto out2;
- }
- ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, parent, A_OK | W_OK);
if (ret)
goto out2;
- err = fuse4fs_new_child_gid(ff, parent, &gid, NULL);
+ err = fuse4fs_new_child_gid(ff, ctxt, parent, &gid, NULL);
if (err)
goto out2;
/* Create symlink */
- err = ext2fs_symlink(fs, parent, 0, node_name, src);
+ err = ext2fs_symlink(fs, parent, 0, name, target);
if (err == EXT2_ET_DIR_NO_SPACE) {
err = ext2fs_expand_dir(fs, parent);
if (err) {
@@ -2341,7 +2322,7 @@ static int op_symlink(const char *src, const char *dest)
goto out2;
}
- err = ext2fs_symlink(fs, parent, 0, node_name, src);
+ err = ext2fs_symlink(fs, parent, 0, name, target);
}
if (err) {
ret = translate_error(fs, parent, err);
@@ -2354,14 +2335,13 @@ static int op_symlink(const char *src, const char *dest)
goto out2;
/* Still have to update the uid/gid of the symlink */
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
- &child);
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, parent, name, &child);
if (err) {
ret = translate_error(fs, 0, err);
goto out2;
}
- dbg_printf(ff, "%s: symlinking ino=%d/name=%s to dir=%d\n", __func__,
- child, node_name, parent);
+ dbg_printf(ff, "%s: symlinking dir=%d name='%s' child=%d\n",
+ __func__, parent, name, child);
err = fuse4fs_read_inode(fs, child, &inode);
if (err) {
@@ -2387,9 +2367,7 @@ static int op_symlink(const char *src, const char *dest)
out2:
fuse4fs_finish(ff, ret);
-out:
- free(temp_path);
- return ret;
+ fuse4fs_reply_entry(req, child, &inode, ret);
}
struct update_dotdot {
@@ -2415,39 +2393,43 @@ static int update_dotdot_helper(ext2_ino_t dir EXT2FS_ATTR((unused)),
return 0;
}
-static int op_rename(const char *from, const char *to,
- unsigned int flags EXT2FS_ATTR((unused)))
+static void op_rename(fuse_req_t req, fuse_ino_t from_parent, const char *from,
+ fuse_ino_t to_parent, const char *to, unsigned int flags)
{
- struct fuse4fs *ff = fuse4fs_get();
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
errcode_t err;
ext2_ino_t from_ino, to_ino, to_dir_ino, from_dir_ino;
- char *temp_to = NULL, *temp_from = NULL;
- char *cp, a;
struct ext2_inode inode;
struct update_dotdot ud;
int flushed = 0;
int ret = 0;
/* renameat2 is not supported */
- if (flags)
- return -ENOSYS;
+ if (flags) {
+ fuse_reply_err(req, ENOSYS);
+ return;
+ }
- FUSE4FS_CHECK_CONTEXT(ff);
- dbg_printf(ff, "%s: renaming %s to %s\n", __func__, from, to);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &from_dir_ino, from_parent);
+ FUSE4FS_CONVERT_FINO(req, &to_dir_ino, to_parent);
+ dbg_printf(ff, "%s: renaming dir=%d name='%s' to dir=%d name='%s'\n",
+ __func__, from_dir_ino, from, to_dir_ino, to);
fs = fuse4fs_start(ff);
if (!fuse4fs_can_allocate(ff, 5)) {
ret = -ENOSPC;
goto out;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, from, &from_ino);
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, from_dir_ino, from, &from_ino);
if (err || from_ino == 0) {
ret = translate_error(fs, 0, err);
goto out;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, to, &to_ino);
+ err = ext2fs_namei(fs, EXT2_ROOT_INO, to_dir_ino, to, &to_ino);
if (err && err != EXT2_ET_FILE_NOT_FOUND) {
ret = translate_error(fs, 0, err);
goto out;
@@ -2456,136 +2438,80 @@ static int op_rename(const char *from, const char *to,
if (err == EXT2_ET_FILE_NOT_FOUND)
to_ino = 0;
+ dbg_printf(ff,
+ "%s: renaming dir=%d name='%s' child=%d to dir=%d name='%s' child=%d\n",
+ __func__, from_dir_ino, from, from_ino, to_dir_ino, to,
+ to_ino);
+
/* Already the same file? */
if (to_ino != 0 && to_ino == from_ino) {
ret = 0;
goto out;
}
- ret = fuse4fs_inum_access(ff, from_ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, from_ino, W_OK);
if (ret)
goto out;
if (to_ino) {
- ret = fuse4fs_inum_access(ff, to_ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, to_ino, W_OK);
if (ret)
goto out;
}
- temp_to = strdup(to);
- if (!temp_to) {
- ret = -ENOMEM;
- goto out;
- }
-
- temp_from = strdup(from);
- if (!temp_from) {
- ret = -ENOMEM;
- goto out2;
- }
-
- /* Find parent dir of the source and check write access */
- cp = strrchr(temp_from, '/');
- if (!cp) {
- ret = -EINVAL;
- goto out2;
- }
-
- a = *(cp + 1);
- *(cp + 1) = 0;
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_from,
- &from_dir_ino);
- *(cp + 1) = a;
- if (err) {
- ret = translate_error(fs, 0, err);
- goto out2;
- }
- if (from_dir_ino == 0) {
- ret = -ENOENT;
- goto out2;
- }
-
- ret = fuse4fs_inum_access(ff, from_dir_ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, from_dir_ino, W_OK);
if (ret)
- goto out2;
-
- /* Find parent dir of the destination and check write access */
- cp = strrchr(temp_to, '/');
- if (!cp) {
- ret = -EINVAL;
- goto out2;
- }
-
- a = *(cp + 1);
- *(cp + 1) = 0;
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_to,
- &to_dir_ino);
- *(cp + 1) = a;
- if (err) {
- ret = translate_error(fs, 0, err);
- goto out2;
- }
- if (to_dir_ino == 0) {
- ret = -ENOENT;
- goto out2;
- }
+ goto out;
- ret = fuse4fs_inum_access(ff, to_dir_ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, to_dir_ino, W_OK);
if (ret)
- goto out2;
+ goto out;
/* If the target exists, unlink it first */
if (to_ino != 0) {
err = ext2fs_read_inode(fs, to_ino, &inode);
if (err) {
ret = translate_error(fs, to_ino, err);
- goto out2;
+ goto out;
}
- dbg_printf(ff, "%s: unlinking %s ino=%d\n", __func__,
- LINUX_S_ISDIR(inode.i_mode) ? "dir" : "file",
- to_ino);
+ dbg_printf(ff, "%s: unlink dir=%d name='%s' child=%d\n",
+ __func__, to_dir_ino, to, to_ino);
if (LINUX_S_ISDIR(inode.i_mode))
- ret = __op_rmdir(ff, to);
+ ret = fuse4fs_rmdir(ff, to_dir_ino, to, to_ino);
else
- ret = __op_unlink(ff, to);
+ ret = fuse4fs_rmfile(ff, to_dir_ino, to, to_ino);
if (ret)
- goto out2;
+ goto out;
}
/* Get ready to do the move */
err = ext2fs_read_inode(fs, from_ino, &inode);
if (err) {
ret = translate_error(fs, from_ino, err);
- goto out2;
+ goto out;
}
/* Link in the new file */
- dbg_printf(ff, "%s: linking ino=%d/path=%s to dir=%d\n", __func__,
- from_ino, cp + 1, to_dir_ino);
- err = ext2fs_link(fs, to_dir_ino, cp + 1, from_ino,
+ dbg_printf(ff, "%s: link dir=%d name='%s' child=%d\n",
+ __func__, to_dir_ino, to, from_ino);
+ err = ext2fs_link(fs, to_dir_ino, to, from_ino,
ext2_file_type(inode.i_mode) | EXT2FS_LINK_EXPAND);
if (err) {
ret = translate_error(fs, to_dir_ino, err);
- goto out2;
+ goto out;
}
/* Update '..' pointer if dir */
- err = ext2fs_read_inode(fs, from_ino, &inode);
- if (err) {
- ret = translate_error(fs, from_ino, err);
- goto out2;
- }
-
if (LINUX_S_ISDIR(inode.i_mode)) {
ud.new_dotdot = to_dir_ino;
- dbg_printf(ff, "%s: updating .. entry for dir=%d\n", __func__,
- to_dir_ino);
+ dbg_printf(ff, "%s: updating .. entry for child=%d parent=%d\n",
+ __func__, from_ino, to_dir_ino);
err = ext2fs_dir_iterate2(fs, from_ino, 0, NULL,
update_dotdot_helper, &ud);
if (err) {
ret = translate_error(fs, from_ino, err);
- goto out2;
+ goto out;
}
/* Decrease from_dir_ino's links_count */
@@ -2594,87 +2520,76 @@ static int op_rename(const char *from, const char *to,
err = ext2fs_read_inode(fs, from_dir_ino, &inode);
if (err) {
ret = translate_error(fs, from_dir_ino, err);
- goto out2;
+ goto out;
}
inode.i_links_count--;
err = ext2fs_write_inode(fs, from_dir_ino, &inode);
if (err) {
ret = translate_error(fs, from_dir_ino, err);
- goto out2;
+ goto out;
}
/* Increase to_dir_ino's links_count */
err = ext2fs_read_inode(fs, to_dir_ino, &inode);
if (err) {
ret = translate_error(fs, to_dir_ino, err);
- goto out2;
+ goto out;
}
inode.i_links_count++;
err = ext2fs_write_inode(fs, to_dir_ino, &inode);
if (err) {
ret = translate_error(fs, to_dir_ino, err);
- goto out2;
+ goto out;
}
}
/* Update timestamps */
ret = update_ctime(fs, from_ino, NULL);
if (ret)
- goto out2;
+ goto out;
ret = update_mtime(fs, to_dir_ino, NULL);
if (ret)
- goto out2;
+ goto out;
/* Remove the old file */
- ret = fuse4fs_unlink(ff, from, NULL);
+ dbg_printf(ff, "%s: unlink dir=%d name='%s' child=%d\n",
+ __func__, from_dir_ino, from, from_ino);
+ ret = fuse4fs_unlink(ff, from_dir_ino, from, from_ino);
if (ret)
- goto out2;
+ goto out;
ret = fuse4fs_dirsync_flush(ff, from_dir_ino, &flushed);
if (ret)
- goto out2;
+ goto out;
if (from_dir_ino != to_dir_ino && !flushed) {
ret = fuse4fs_dirsync_flush(ff, to_dir_ino, NULL);
if (ret)
- goto out2;
+ goto out;
}
-out2:
- free(temp_from);
- free(temp_to);
out:
fuse4fs_finish(ff, ret);
- return ret;
+ fuse_reply_err(req, -ret);
}
-static int op_link(const char *src, const char *dest)
+static void op_link(fuse_req_t req, fuse_ino_t child_fino,
+ fuse_ino_t parent_fino, const char *name)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct ext2_inode_large inode;
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
- char *temp_path;
errcode_t err;
- char *node_name, a;
- ext2_ino_t parent, ino;
- struct ext2_inode_large inode;
+ ext2_ino_t parent, child;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- dbg_printf(ff, "%s: src=%s dest=%s\n", __func__, src, dest);
- temp_path = strdup(dest);
- if (!temp_path) {
- ret = -ENOMEM;
- goto out;
- }
- node_name = strrchr(temp_path, '/');
- if (!node_name) {
- ret = -ENOMEM;
- goto out;
- }
- node_name++;
- a = *node_name;
- *node_name = 0;
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &parent, parent_fino);
+ FUSE4FS_CONVERT_FINO(req, &child, child_fino);
+ dbg_printf(ff, "%s: link dir=%d name='%s' child=%d\n",
+ __func__, parent, name, child);
fs = fuse4fs_start(ff);
if (!fuse4fs_can_allocate(ff, 2)) {
@@ -2682,48 +2597,32 @@ static int op_link(const char *src, const char *dest)
goto out2;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
- &parent);
- *node_name = a;
- if (err) {
- err = -ENOENT;
- goto out2;
- }
-
- ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, parent, A_OK | W_OK);
if (ret)
goto out2;
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, src, &ino);
- if (err || ino == 0) {
- ret = translate_error(fs, 0, err);
- goto out2;
- }
-
- err = fuse4fs_read_inode(fs, ino, &inode);
+ err = fuse4fs_read_inode(fs, child, &inode);
if (err) {
- ret = translate_error(fs, ino, err);
+ ret = translate_error(fs, child, err);
goto out2;
}
- ret = fuse4fs_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
+ ret = fuse4fs_iflags_access(ff, child, EXT2_INODE(&inode), W_OK);
if (ret)
goto out2;
inode.i_links_count++;
- ret = update_ctime(fs, ino, &inode);
+ ret = update_ctime(fs, child, &inode);
if (ret)
goto out2;
- err = fuse4fs_write_inode(fs, ino, &inode);
+ err = fuse4fs_write_inode(fs, child, &inode);
if (err) {
- ret = translate_error(fs, ino, err);
+ ret = translate_error(fs, child, err);
goto out2;
}
- dbg_printf(ff, "%s: linking ino=%d/name=%s to dir=%d\n", __func__, ino,
- node_name, parent);
- err = ext2fs_link(fs, parent, node_name, ino,
+ err = ext2fs_link(fs, parent, name, child,
ext2_file_type(inode.i_mode) | EXT2FS_LINK_EXPAND);
if (err) {
ret = translate_error(fs, parent, err);
@@ -2740,13 +2639,12 @@ static int op_link(const char *src, const char *dest)
out2:
fuse4fs_finish(ff, ret);
-out:
- free(temp_path);
- return ret;
+ fuse4fs_reply_entry(req, child, &inode, ret);
}
/* Obtain group ids of the process that sent us a command(?) */
-static int fuse4fs_get_groups(struct fuse4fs *ff, gid_t **gids, size_t *nr_gids)
+static int fuse4fs_get_groups(struct fuse4fs *ff, fuse_req_t req, gid_t **gids,
+ size_t *nr_gids)
{
ext2_filsys fs = ff->fs;
errcode_t err;
@@ -2759,7 +2657,7 @@ static int fuse4fs_get_groups(struct fuse4fs *ff, gid_t **gids, size_t *nr_gids)
if (err)
return translate_error(fs, 0, err);
- ret = fuse_getgroups(nr, array);
+ ret = fuse_req_getgroups(req, nr, array);
if (ret < 0) {
/*
* If there's an error, we failed to find the group
@@ -2791,17 +2689,18 @@ static int fuse4fs_get_groups(struct fuse4fs *ff, gid_t **gids, size_t *nr_gids)
* that initiated the fuse request? Returns 1 for yes, 0 for no, or a negative
* errno.
*/
-static int fuse4fs_in_file_group(struct fuse_context *ctxt,
+static int fuse4fs_in_file_group(struct fuse4fs *ff, fuse_req_t req,
const struct ext2_inode_large *inode)
{
- struct fuse4fs *ff = fuse4fs_get();
gid_t *gids = NULL;
size_t i, nr_gids = 0;
gid_t gid = inode_gid(*inode);
int ret;
- ret = fuse4fs_get_groups(ff, &gids, &nr_gids);
+ ret = fuse4fs_get_groups(ff, req, &gids, &nr_gids);
if (ret == -ENOENT) {
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+
/* magic return code for "could not get caller group info" */
return ctxt->gid == inode_gid(*inode);
}
@@ -2820,37 +2719,21 @@ static int fuse4fs_in_file_group(struct fuse_context *ctxt,
return ret;
}
-static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi)
+static int fuse4fs_chmod(struct fuse4fs *ff, fuse_req_t req, ext2_ino_t ino,
+ mode_t mode, struct ext2_inode_large *inode)
{
- struct fuse_context *ctxt = fuse_get_context();
- struct fuse4fs *ff = fuse4fs_get();
- ext2_filsys fs;
- errcode_t err;
- ext2_ino_t ino;
- struct ext2_inode_large inode;
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- fs = fuse4fs_start(ff);
- ret = fuse4fs_file_ino(ff, path, fi, &ino);
- if (ret)
- goto out;
- dbg_printf(ff, "%s: path=%s mode=0%o ino=%d\n", __func__, path, mode, ino);
-
- err = fuse4fs_read_inode(fs, ino, &inode);
- if (err) {
- ret = translate_error(fs, ino, err);
- goto out;
- }
+ dbg_printf(ff, "%s: ino=%d mode=0%o\n", __func__, ino, mode);
- ret = fuse4fs_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
+ ret = fuse4fs_iflags_access(ff, ino, EXT2_INODE(inode), W_OK);
if (ret)
- goto out;
+ return ret;
- if (fuse4fs_want_check_owner(ff, ctxt) && ctxt->uid != inode_uid(inode)) {
- ret = -EPERM;
- goto out;
- }
+ if (fuse4fs_want_check_owner(ff, ctxt) &&
+ ctxt->uid != inode_uid(*inode))
+ return -EPERM;
/*
* XXX: We should really check that the inode gid is not in /any/
@@ -2858,100 +2741,60 @@ static int op_chmod(const char *path, mode_t mode, struct fuse_file_info *fi)
* group.
*/
if (!fuse4fs_is_superuser(ff, ctxt)) {
- ret = fuse4fs_in_file_group(ctxt, &inode);
+ ret = fuse4fs_in_file_group(ff, req, inode);
if (ret < 0)
- goto out;
+ return ret;
if (!ret)
mode &= ~S_ISGID;
}
- inode.i_mode &= ~0xFFF;
- inode.i_mode |= mode & 0xFFF;
+ inode->i_mode &= ~0xFFF;
+ inode->i_mode |= mode & 0xFFF;
- dbg_printf(ff, "%s: path=%s new_mode=0%o ino=%d\n", __func__,
- path, inode.i_mode, ino);
+ dbg_printf(ff, "%s: ino=%d new_mode=0%o\n",
+ __func__, ino, inode->i_mode);
- ret = update_ctime(fs, ino, &inode);
- if (ret)
- goto out;
-
- err = fuse4fs_write_inode(fs, ino, &inode);
- if (err) {
- ret = translate_error(fs, ino, err);
- goto out;
- }
-
-out:
- fuse4fs_finish(ff, ret);
- return ret;
+ return 0;
}
-static int op_chown(const char *path, uid_t owner, gid_t group,
- struct fuse_file_info *fi)
+static int fuse4fs_chown(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
+ ext2_ino_t ino, const int to_set,
+ const struct stat *attr,
+ struct ext2_inode_large *inode)
{
- struct fuse_context *ctxt = fuse_get_context();
- struct fuse4fs *ff = fuse4fs_get();
- ext2_filsys fs;
- errcode_t err;
- ext2_ino_t ino;
- struct ext2_inode_large inode;
+ uid_t owner = (to_set & FUSE_SET_ATTR_UID) ? attr->st_uid : (uid_t)~0;
+ gid_t group = (to_set & FUSE_SET_ATTR_GID) ? attr->st_gid : (gid_t)~0;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- fs = fuse4fs_start(ff);
- ret = fuse4fs_file_ino(ff, path, fi, &ino);
- if (ret)
- goto out;
- dbg_printf(ff, "%s: path=%s owner=%d group=%d ino=%d\n", __func__,
- path, owner, group, ino);
-
- err = fuse4fs_read_inode(fs, ino, &inode);
- if (err) {
- ret = translate_error(fs, ino, err);
- goto out;
- }
+ dbg_printf(ff, "%s: ino=%d owner=%d group=%d\n",
+ __func__, ino, owner, group);
- ret = fuse4fs_iflags_access(ff, ino, EXT2_INODE(&inode), W_OK);
+ ret = fuse4fs_iflags_access(ff, ino, EXT2_INODE(inode), W_OK);
if (ret)
- goto out;
+ return ret;
/* FUSE seems to feed us ~0 to mean "don't change" */
if (owner != (uid_t) ~0) {
/* Only root gets to change UID. */
if (fuse4fs_want_check_owner(ff, ctxt) &&
- !(inode_uid(inode) == ctxt->uid && owner == ctxt->uid)) {
- ret = -EPERM;
- goto out;
- }
- fuse4fs_set_uid(&inode, owner);
+ !(inode_uid(*inode) == ctxt->uid && owner == ctxt->uid))
+ return -EPERM;
+
+ fuse4fs_set_uid(inode, owner);
}
if (group != (gid_t) ~0) {
/* Only root or the owner get to change GID. */
if (fuse4fs_want_check_owner(ff, ctxt) &&
- inode_uid(inode) != ctxt->uid) {
- ret = -EPERM;
- goto out;
- }
+ inode_uid(*inode) != ctxt->uid)
+ return -EPERM;
/* XXX: We /should/ check group membership but FUSE */
- fuse4fs_set_gid(&inode, group);
+ fuse4fs_set_gid(inode, group);
}
- ret = update_ctime(fs, ino, &inode);
- if (ret)
- goto out;
-
- err = fuse4fs_write_inode(fs, ino, &inode);
- if (err) {
- ret = translate_error(fs, ino, err);
- goto out;
- }
-
-out:
- fuse4fs_finish(ff, ret);
- return ret;
+ return 0;
}
static int fuse4fs_punch_posteof(struct fuse4fs *ff, ext2_ino_t ino,
@@ -3026,32 +2869,6 @@ static int fuse4fs_truncate(struct fuse4fs *ff, ext2_ino_t ino, off_t new_size)
return 0;
}
-static int op_truncate(const char *path, off_t len, struct fuse_file_info *fi)
-{
- struct fuse4fs *ff = fuse4fs_get();
- ext2_ino_t ino;
- int ret = 0;
-
- FUSE4FS_CHECK_CONTEXT(ff);
- fuse4fs_start(ff);
- ret = fuse4fs_file_ino(ff, path, fi, &ino);
- if (ret)
- goto out;
- dbg_printf(ff, "%s: ino=%d len=%jd\n", __func__, ino, (intmax_t) len);
-
- ret = fuse4fs_inum_access(ff, ino, W_OK);
- if (ret)
- goto out;
-
- ret = fuse4fs_truncate(ff, ino, len);
- if (ret)
- goto out;
-
-out:
- fuse4fs_finish(ff, ret);
- return ret;
-}
-
#ifdef __linux__
static void detect_linux_executable_open(int kernel_flags, int *access_check,
int *e2fs_open_flags)
@@ -3073,19 +2890,20 @@ static void detect_linux_executable_open(int kernel_flags, int *access_check,
}
#endif /* __linux__ */
-static int __op_open(struct fuse4fs *ff, const char *path,
- struct fuse_file_info *fp)
+static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
+ ext2_ino_t ino, struct fuse_file_info *fp)
{
ext2_filsys fs = ff->fs;
errcode_t err;
struct fuse4fs_file_handle *file;
int check = 0, ret = 0;
- dbg_printf(ff, "%s: path=%s oflags=0o%o\n", __func__, path, fp->flags);
+ dbg_printf(ff, "%s: ino=%d oflags=0o%o\n", __func__, ino, fp->flags);
err = ext2fs_get_mem(sizeof(*file), &file);
if (err)
return translate_error(fs, 0, err);
file->magic = FUSE4FS_FILE_MAGIC;
+ file->ino = ino;
file->open_flags = 0;
switch (fp->flags & O_ACCMODE) {
@@ -3114,14 +2932,7 @@ static int __op_open(struct fuse4fs *ff, const char *path,
if (fp->flags & O_CREAT)
file->open_flags |= EXT2_FILE_CREATE;
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &file->ino);
- if (err || file->ino == 0) {
- ret = translate_error(fs, 0, err);
- goto out;
- }
- dbg_printf(ff, "%s: ino=%d\n", __func__, file->ino);
-
- ret = fuse4fs_inum_access(ff, file->ino, check);
+ ret = fuse4fs_inum_access(ff, ctxt, file->ino, check);
if (ret) {
/*
* In a regular (Linux) fs driver, the kernel will open
@@ -3133,7 +2944,7 @@ static int __op_open(struct fuse4fs *ff, const char *path,
* also employ undocumented hacks (see above).
*/
if (check == R_OK) {
- ret = fuse4fs_inum_access(ff, file->ino, X_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, file->ino, X_OK);
if (ret)
goto out;
} else
@@ -3154,34 +2965,48 @@ static int __op_open(struct fuse4fs *ff, const char *path,
return ret;
}
-static int op_open(const char *path, struct fuse_file_info *fp)
+static void op_open(fuse_req_t req, fuse_ino_t fino, struct fuse_file_info *fp)
{
- struct fuse4fs *ff = fuse4fs_get();
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_ino_t ino;
int ret;
- FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
fuse4fs_start(ff);
- ret = __op_open(ff, path, fp);
+ ret = fuse4fs_open_file(ff, ctxt, ino, fp);
fuse4fs_finish(ff, ret);
- return ret;
+
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_open(req, fp);
}
-static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
- size_t len, off_t offset,
- struct fuse_file_info *fp)
+static void op_read(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
+ size_t len, off_t offset, struct fuse_file_info *fp)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs *ff = fuse4fs_get(req);
struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ char *buf;
ext2_filsys fs;
ext2_file_t efp;
errcode_t err;
unsigned int got = 0;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- FUSE4FS_CHECK_HANDLE(ff, fh);
+ buf = calloc(len, sizeof(char));
+ if (!buf) {
+ fuse_reply_err(req, errno);
+ return;
+ }
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CHECK_HANDLE(req, fh);
dbg_printf(ff, "%s: ino=%d off=0x%llx len=0x%zx\n", __func__, fh->ino,
(unsigned long long)offset, len);
+
fs = fuse4fs_start(ff);
err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
if (err) {
@@ -3217,14 +3042,18 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
}
out:
fuse4fs_finish(ff, ret);
- return got ? (int) got : ret;
+ if (got)
+ fuse_reply_buf(req, buf, got);
+ else
+ fuse_reply_err(req, -ret);
+ ext2fs_free_mem(&buf);
}
-static int op_write(const char *path EXT2FS_ATTR((unused)),
- const char *buf, size_t len, off_t offset,
- struct fuse_file_info *fp)
+static void op_write(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
+ const char *buf, size_t len, off_t offset,
+ struct fuse_file_info *fp)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs *ff = fuse4fs_get(req);
struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
ext2_filsys fs;
ext2_file_t efp;
@@ -3232,8 +3061,8 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
unsigned int got = 0;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- FUSE4FS_CHECK_HANDLE(ff, fh);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CHECK_HANDLE(req, fh);
dbg_printf(ff, "%s: ino=%d off=0x%llx len=0x%zx\n", __func__, fh->ino,
(unsigned long long) offset, len);
fs = fuse4fs_start(ff);
@@ -3287,20 +3116,23 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
out:
fuse4fs_finish(ff, ret);
- return got ? (int) got : ret;
+ if (got)
+ fuse_reply_write(req, got);
+ else
+ fuse_reply_err(req, -ret);
}
-static int op_release(const char *path EXT2FS_ATTR((unused)),
- struct fuse_file_info *fp)
+static void op_release(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
+ struct fuse_file_info *fp)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs *ff = fuse4fs_get(req);
struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
ext2_filsys fs;
errcode_t err;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- FUSE4FS_CHECK_HANDLE(ff, fh);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CHECK_HANDLE(req, fh);
dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
fs = fuse4fs_start(ff);
@@ -3317,21 +3149,21 @@ static int op_release(const char *path EXT2FS_ATTR((unused)),
ext2fs_free_mem(&fh);
- return ret;
+ fuse_reply_err(req, -ret);
}
-static int op_fsync(const char *path EXT2FS_ATTR((unused)),
- int datasync EXT2FS_ATTR((unused)),
- struct fuse_file_info *fp)
+static void op_fsync(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
+ int datasync EXT2FS_ATTR((unused)),
+ struct fuse_file_info *fp)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs *ff = fuse4fs_get(req);
struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
ext2_filsys fs;
errcode_t err;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- FUSE4FS_CHECK_HANDLE(ff, fh);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CHECK_HANDLE(req, fh);
dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
fs = fuse4fs_start(ff);
/* For now, flush everything, even if it's slow */
@@ -3342,22 +3174,24 @@ static int op_fsync(const char *path EXT2FS_ATTR((unused)),
}
fuse4fs_finish(ff, ret);
- return ret;
+ fuse_reply_err(req, -ret);
}
-static int op_statfs(const char *path EXT2FS_ATTR((unused)),
- struct statvfs *buf)
+static void op_statfs(fuse_req_t req, fuse_ino_t fino)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct statvfs buf;
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
uint64_t fsid, *f;
+ ext2_ino_t ino;
blk64_t overhead, reserved, free;
- FUSE4FS_CHECK_CONTEXT(ff);
- dbg_printf(ff, "%s: path=%s\n", __func__, path);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
+ dbg_printf(ff, "%s: ino=%d\n", __func__, ino);
fs = fuse4fs_start(ff);
- buf->f_bsize = fs->blocksize;
- buf->f_frsize = 0;
+ buf.f_bsize = fs->blocksize;
+ buf.f_frsize = 0;
if (ff->minixdf)
overhead = 0;
@@ -3370,27 +3204,27 @@ static int op_statfs(const char *path EXT2FS_ATTR((unused)),
reserved = ext2fs_blocks_count(fs->super) / 10;
free = ext2fs_free_blocks_count(fs->super);
- buf->f_blocks = ext2fs_blocks_count(fs->super) - overhead;
- buf->f_bfree = free;
+ buf.f_blocks = ext2fs_blocks_count(fs->super) - overhead;
+ buf.f_bfree = free;
if (free < reserved)
- buf->f_bavail = 0;
+ buf.f_bavail = 0;
else
- buf->f_bavail = free - reserved;
- buf->f_files = fs->super->s_inodes_count;
- buf->f_ffree = fs->super->s_free_inodes_count;
- buf->f_favail = fs->super->s_free_inodes_count;
+ buf.f_bavail = free - reserved;
+ buf.f_files = fs->super->s_inodes_count;
+ buf.f_ffree = fs->super->s_free_inodes_count;
+ buf.f_favail = fs->super->s_free_inodes_count;
f = (uint64_t *)fs->super->s_uuid;
fsid = *f;
f++;
fsid ^= *f;
- buf->f_fsid = fsid;
- buf->f_flag = 0;
+ buf.f_fsid = fsid;
+ buf.f_flag = 0;
if (ff->opstate != F4OP_WRITABLE)
- buf->f_flag |= ST_RDONLY;
- buf->f_namemax = EXT2_NAME_LEN;
+ buf.f_flag |= ST_RDONLY;
+ buf.f_namemax = EXT2_NAME_LEN;
fuse4fs_finish(ff, 0);
- return 0;
+ fuse_reply_statfs(req, &buf);
}
static const char *valid_xattr_prefixes[] = {
@@ -3414,35 +3248,33 @@ static int validate_xattr_name(const char *name)
return 0;
}
-static int op_getxattr(const char *path, const char *key, char *value,
- size_t len)
+static void op_getxattr(fuse_req_t req, fuse_ino_t fino, const char *key,
+ size_t len)
{
- struct fuse4fs *ff = fuse4fs_get();
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
- void *ptr;
+ void *ptr = NULL;
size_t plen;
ext2_ino_t ino;
- errcode_t err;
int ret = 0;
- if (!validate_xattr_name(key))
- return -ENODATA;
+ if (!validate_xattr_name(key)) {
+ fuse_reply_err(req, ENODATA);
+ return;
+ }
- FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
fs = fuse4fs_start(ff);
if (!ext2fs_has_feature_xattr(fs->super)) {
ret = -ENOTSUP;
goto out;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
- if (err || ino == 0) {
- ret = translate_error(fs, 0, err);
- goto out;
- }
- dbg_printf(ff, "%s: ino=%d name=%s\n", __func__, ino, key);
+ dbg_printf(ff, "%s: ino=%d name='%s'\n", __func__, ino, key);
- ret = fuse4fs_inum_access(ff, ino, R_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, ino, R_OK);
if (ret)
goto out;
@@ -3451,19 +3283,26 @@ static int op_getxattr(const char *path, const char *key, char *value,
goto out;
if (!len) {
+ /* Just tell us the length */
ret = plen;
} else if (len < plen) {
+ /* Caller's buffer wasn't big enough */
ret = -ERANGE;
} else {
- memcpy(value, ptr, plen);
+ /* We have data */
ret = plen;
}
+out:
+ fuse4fs_finish(ff, ret);
+
+ if (ret < 0)
+ fuse_reply_err(req, -ret);
+ else if (!len)
+ fuse_reply_xattr(req, ret);
+ else
+ fuse_reply_buf(req, ptr, ret);
ext2fs_free_mem(&ptr);
-out:
- fuse4fs_finish(ff, ret);
-
- return ret;
}
static int count_buffer_space(char *name, char *value EXT2FS_ATTR((unused)),
@@ -3488,31 +3327,30 @@ static int copy_names(char *name, char *value EXT2FS_ATTR((unused)),
return 0;
}
-static int op_listxattr(const char *path, char *names, size_t len)
+static void op_listxattr(fuse_req_t req, fuse_ino_t fino, size_t len)
{
- struct fuse4fs *ff = fuse4fs_get();
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
struct ext2_xattr_handle *h;
+ char *names = NULL;
+ char *next_name;
unsigned int bufsz;
ext2_ino_t ino;
errcode_t err;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
fs = fuse4fs_start(ff);
if (!ext2fs_has_feature_xattr(fs->super)) {
ret = -ENOTSUP;
goto out;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
- if (err || ino == 0) {
- ret = translate_error(fs, ino, err);
- goto out;
- }
dbg_printf(ff, "%s: ino=%d\n", __func__, ino);
- ret = fuse4fs_inum_access(ff, ino, R_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, ino, R_OK);
if (ret)
goto out;
@@ -3537,21 +3375,28 @@ static int op_listxattr(const char *path, char *names, size_t len)
}
if (len == 0) {
- ret = bufsz;
+ /* Just tell us the length */
goto out2;
} else if (len < bufsz) {
+ /* Caller's buffer wasn't big enough */
ret = -ERANGE;
goto out2;
}
/* Copy names out */
- memset(names, 0, len);
- err = ext2fs_xattrs_iterate(h, copy_names, &names);
+ names = calloc(len, sizeof(char));
+ if (!names) {
+ ret = translate_error(fs, ino, errno);
+ goto out2;
+ }
+ next_name = names;
+
+ err = ext2fs_xattrs_iterate(h, copy_names, &next_name);
if (err) {
ret = translate_error(fs, ino, err);
goto out2;
}
- ret = bufsz;
+
out2:
err = ext2fs_xattrs_close(&h);
if (err && !ret)
@@ -3559,41 +3404,47 @@ static int op_listxattr(const char *path, char *names, size_t len)
out:
fuse4fs_finish(ff, ret);
- return ret;
+ if (ret < 0)
+ fuse_reply_err(req, -ret);
+ else if (names)
+ fuse_reply_buf(req, names, bufsz);
+ else
+ fuse_reply_xattr(req, bufsz);
+ free(names);
}
-static int op_setxattr(const char *path EXT2FS_ATTR((unused)),
- const char *key, const char *value,
- size_t len, int flags)
+static void op_setxattr(fuse_req_t req, fuse_ino_t fino, const char *key,
+ const char *value, size_t len, int flags)
{
- struct fuse4fs *ff = fuse4fs_get();
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
struct ext2_xattr_handle *h;
ext2_ino_t ino;
errcode_t err;
int ret = 0;
- if (flags & ~(XATTR_CREATE | XATTR_REPLACE))
- return -EOPNOTSUPP;
+ if (flags & ~(XATTR_CREATE | XATTR_REPLACE)) {
+ fuse_reply_err(req, EOPNOTSUPP);
+ return;
+ }
- if (!validate_xattr_name(key))
- return -EINVAL;
+ if (!validate_xattr_name(key)) {
+ fuse_reply_err(req, EINVAL);
+ return;
+ }
- FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
fs = fuse4fs_start(ff);
if (!ext2fs_has_feature_xattr(fs->super)) {
ret = -ENOTSUP;
goto out;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
- if (err || ino == 0) {
- ret = translate_error(fs, 0, err);
- goto out;
- }
- dbg_printf(ff, "%s: ino=%d name=%s\n", __func__, ino, key);
+ dbg_printf(ff, "%s: ino=%d name='%s'\n", __func__, ino, key);
- ret = fuse4fs_inum_access(ff, ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, ino, W_OK);
if (ret == -EACCES) {
ret = -EPERM;
goto out;
@@ -3650,13 +3501,13 @@ static int op_setxattr(const char *path EXT2FS_ATTR((unused)),
ret = translate_error(fs, ino, err);
out:
fuse4fs_finish(ff, ret);
-
- return ret;
+ fuse_reply_err(req, -ret);
}
-static int op_removexattr(const char *path, const char *key)
+static void op_removexattr(fuse_req_t req, fuse_ino_t fino, const char *key)
{
- struct fuse4fs *ff = fuse4fs_get();
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
struct ext2_xattr_handle *h;
void *buf;
@@ -3669,13 +3520,18 @@ static int op_removexattr(const char *path, const char *key)
* Once in a while libfuse gives us a no-name xattr to delete as part
* of clearing ACLs. Just pretend we cleared them.
*/
- if (key[0] == 0)
- return 0;
+ if (key[0] == 0) {
+ fuse_reply_err(req, 0);
+ return;
+ }
- if (!validate_xattr_name(key))
- return -ENODATA;
+ if (!validate_xattr_name(key)) {
+ fuse_reply_err(req, ENODATA);
+ return;
+ }
- FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
fs = fuse4fs_start(ff);
if (!ext2fs_has_feature_xattr(fs->super)) {
ret = -ENOTSUP;
@@ -3687,14 +3543,9 @@ static int op_removexattr(const char *path, const char *key)
goto out;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
- if (err || ino == 0) {
- ret = translate_error(fs, 0, err);
- goto out;
- }
dbg_printf(ff, "%s: ino=%d name=%s\n", __func__, ino, key);
- ret = fuse4fs_inum_access(ff, ino, W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, ino, W_OK);
if (ret)
goto out;
@@ -3744,24 +3595,26 @@ static int op_removexattr(const char *path, const char *key)
ret = translate_error(fs, ino, err);
out:
fuse4fs_finish(ff, ret);
-
- return ret;
+ fuse_reply_err(req, -ret);
}
struct readdir_iter {
void *buf;
- ext2_filsys fs;
- fuse_fill_dir_t func;
+ size_t bufsz;
+ size_t bufused;
+ ext2_filsys fs;
struct fuse4fs *ff;
- enum fuse_readdir_flags flags;
+ fuse_req_t req;
+
+ bool readdirplus;
unsigned int nr;
off_t startpos;
off_t dirpos;
};
static inline mode_t dirent_fmode(ext2_filsys fs,
- const struct ext2_dir_entry *dirent)
+ const struct ext2_dir_entry *dirent)
{
if (!ext2fs_has_feature_filetype(fs->super))
return 0;
@@ -3795,10 +3648,15 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)),
{
struct readdir_iter *i = data;
char namebuf[EXT2_NAME_LEN + 1];
- struct stat stat = {
- .st_ino = dirent->inode,
- .st_mode = dirent_fmode(i->fs, dirent),
+ struct fuse4fs_stat fstat = {
+ .entry = {
+ .attr = {
+ .st_ino = dirent->inode,
+ .st_mode = dirent_fmode(i->fs, dirent),
+ },
+ },
};
+ size_t entrysize;
int ret;
i->dirpos++;
@@ -3806,48 +3664,67 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)),
return 0;
dbg_printf(i->ff, "READDIR%s ino=%d %u offset=0x%llx\n",
- i->flags == FUSE_READDIR_PLUS ? "PLUS" : "",
+ i->readdirplus ? "PLUS" : "",
dir,
i->nr++,
(unsigned long long)i->dirpos);
- if (i->flags == FUSE_READDIR_PLUS) {
- ret = stat_inode(i->fs, dirent->inode, &stat);
+ if (i->readdirplus) {
+ ret = fuse4fs_stat_inode(i->ff, dirent->inode, NULL, &fstat);
if (ret)
return DIRENT_ABORT;
}
memcpy(namebuf, dirent->name, dirent->name_len & 0xFF);
namebuf[dirent->name_len & 0xFF] = 0;
- ret = i->func(i->buf, namebuf, &stat, i->dirpos , 0);
- if (ret)
+
+ if (i->readdirplus) {
+ entrysize = fuse_add_direntry_plus(i->req, i->buf + i->bufused,
+ i->bufsz - i->bufused,
+ namebuf, &fstat.entry,
+ i->dirpos);
+ } else {
+ entrysize = fuse_add_direntry(i->req, i->buf + i->bufused,
+ i->bufsz - i->bufused, namebuf,
+ &fstat.entry.attr, i->dirpos);
+ }
+ if (entrysize > i->bufsz - i->bufused) {
+ /* Buffer is full */
return DIRENT_ABORT;
+ }
+ i->bufused += entrysize;
return 0;
}
-static int op_readdir(const char *path EXT2FS_ATTR((unused)), void *buf,
- fuse_fill_dir_t fill_func, off_t offset,
- struct fuse_file_info *fp, enum fuse_readdir_flags flags)
+static void __op_readdir(fuse_req_t req, fuse_ino_t fino, size_t size,
+ off_t offset, bool plus, struct fuse_file_info *fp)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs *ff = fuse4fs_get(req);
struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
errcode_t err;
struct readdir_iter i = {
.ff = ff,
+ .req = req,
.dirpos = 0,
.startpos = offset,
- .flags = flags,
+ .readdirplus = plus,
+ .bufsz = size,
};
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- FUSE4FS_CHECK_HANDLE(ff, fh);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CHECK_HANDLE(req, fh);
dbg_printf(ff, "%s: ino=%d offset=0x%llx\n", __func__, fh->ino,
(unsigned long long)offset);
+
+ err = ext2fs_get_mem(size, &i.buf);
+ if (err) {
+ ret = translate_error(i.fs, fh->ino, err);
+ goto out;
+ }
+
i.fs = fuse4fs_start(ff);
- i.buf = buf;
- i.func = fill_func;
err = ext2fs_dir_iterate2(i.fs, fh->ino, 0, NULL, op_readdir_iter, &i);
if (err) {
ret = translate_error(i.fs, fh->ino, err);
@@ -3861,64 +3738,66 @@ static int op_readdir(const char *path EXT2FS_ATTR((unused)), void *buf,
}
out:
fuse4fs_finish(ff, ret);
- return ret;
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_buf(req, i.buf, i.bufused);
+
+ ext2fs_free_mem(&i.buf);
+}
+
+static void op_readdir(fuse_req_t req, fuse_ino_t fino, size_t size,
+ off_t offset, struct fuse_file_info *fp)
+{
+ __op_readdir(req, fino, size, offset, false, fp);
+}
+
+static void op_readdirplus(fuse_req_t req, fuse_ino_t fino, size_t size,
+ off_t offset, struct fuse_file_info *fp)
+{
+ __op_readdir(req, fino, size, offset, true, fp);
}
-static int op_access(const char *path, int mask)
+static void op_access(fuse_req_t req, fuse_ino_t fino, int mask)
{
- struct fuse4fs *ff = fuse4fs_get();
- ext2_filsys fs;
- errcode_t err;
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_ino_t ino;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- dbg_printf(ff, "%s: path=%s mask=0x%x\n", __func__, path, mask);
- fs = fuse4fs_start(ff);
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
- if (err || ino == 0) {
- ret = translate_error(fs, 0, err);
- goto out;
- }
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
+ dbg_printf(ff, "%s: ino=%d mask=0x%x\n",
+ __func__, ino, mask);
+ fuse4fs_start(ff);
- ret = fuse4fs_inum_access(ff, ino, mask);
+ ret = fuse4fs_inum_access(ff, ctxt, ino, mask);
if (ret)
goto out;
out:
fuse4fs_finish(ff, ret);
- return ret;
+ fuse_reply_err(req, -ret);
}
-static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
+static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name,
+ mode_t mode, struct fuse_file_info *fp)
{
- struct fuse_context *ctxt = fuse_get_context();
- struct fuse4fs *ff = fuse4fs_get();
+ struct ext2_inode_large inode;
+ struct fuse4fs_stat fstat;
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
ext2_ino_t parent, child;
- char *temp_path;
errcode_t err;
- char *node_name, a;
int filetype;
- struct ext2_inode_large inode;
gid_t gid;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- dbg_printf(ff, "%s: path=%s mode=0%o\n", __func__, path, mode);
- temp_path = strdup(path);
- if (!temp_path) {
- ret = -ENOMEM;
- goto out;
- }
- node_name = strrchr(temp_path, '/');
- if (!node_name) {
- ret = -ENOMEM;
- goto out;
- }
- node_name++;
- a = *node_name;
- *node_name = 0;
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &parent, fino);
+ dbg_printf(ff, "%s: parent=%d name='%s' mode=0%o\n",
+ __func__, parent, name, mode);
fs = fuse4fs_start(ff);
if (!fuse4fs_can_allocate(ff, 1)) {
@@ -3926,23 +3805,14 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
goto out2;
}
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, temp_path,
- &parent);
- if (err) {
- ret = translate_error(fs, 0, err);
- goto out2;
- }
-
- ret = fuse4fs_inum_access(ff, parent, A_OK | W_OK);
+ ret = fuse4fs_inum_access(ff, ctxt, parent, A_OK | W_OK);
if (ret)
goto out2;
- err = fuse4fs_new_child_gid(ff, parent, &gid, NULL);
+ err = fuse4fs_new_child_gid(ff, ctxt, parent, &gid, NULL);
if (err)
goto out2;
- *node_name = a;
-
filetype = ext2_file_type(mode);
err = ext2fs_new_inode(fs, parent, mode, 0, &child);
@@ -3951,9 +3821,9 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
goto out2;
}
- dbg_printf(ff, "%s: creating ino=%d/name=%s in dir=%d\n", __func__, child,
- node_name, parent);
- err = ext2fs_link(fs, parent, node_name, child,
+ dbg_printf(ff, "%s: creating dir=%d name='%s' child=%d\n",
+ __func__, parent, name, child);
+ err = ext2fs_link(fs, parent, name, child,
filetype | EXT2FS_LINK_EXPAND);
if (err) {
ret = translate_error(fs, parent, err);
@@ -4005,7 +3875,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
goto out2;
fp->flags &= ~O_TRUNC;
- ret = __op_open(ff, path, fp);
+ ret = fuse4fs_open_file(ff, ctxt, child, fp);
if (ret)
goto out2;
@@ -4013,44 +3883,152 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
if (ret)
goto out2;
+ ret = fuse4fs_stat_inode(ff, child, NULL, &fstat);
+ if (ret)
+ goto out2;
+
out2:
fuse4fs_finish(ff, ret);
-out:
- free(temp_path);
- return ret;
+
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_create(req, &fstat.entry, fp);
+}
+
+enum fuse4fs_time_action {
+ TA_NOW, /* set to current time */
+ TA_OMIT, /* do not set timestamp */
+ TA_THIS, /* set to specific timestamp */
+};
+
+static inline const char *
+fuse4fs_time_action_string(enum fuse4fs_time_action act)
+{
+ switch (act) {
+ case TA_NOW:
+ return "now";
+ case TA_OMIT:
+ return "omit";
+ case TA_THIS:
+ return "specific";
+ }
+ return NULL; /* shut up gcc */
}
-static int op_utimens(const char *path, const struct timespec ctv[2],
- struct fuse_file_info *fi)
+static int fuse4fs_utimens(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
+ ext2_ino_t ino, const int to_set,
+ const struct stat *attr,
+ struct ext2_inode_large *inode)
{
- struct fuse4fs *ff = fuse4fs_get();
- struct timespec tv[2];
- ext2_filsys fs;
- errcode_t err;
- ext2_ino_t ino;
- struct ext2_inode_large inode;
+ enum fuse4fs_time_action aact = TA_OMIT;
+ enum fuse4fs_time_action mact = TA_OMIT;
+ struct timespec atime = { };
+ struct timespec mtime = { };
+ struct timespec now = { };
int access = W_OK;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- fs = fuse4fs_start(ff);
- ret = fuse4fs_file_ino(ff, path, fi, &ino);
- if (ret)
- goto out;
- dbg_printf(ff, "%s: ino=%d atime=%lld.%ld mtime=%lld.%ld\n", __func__,
- ino,
- (long long int)ctv[0].tv_sec, ctv[0].tv_nsec,
- (long long int)ctv[1].tv_sec, ctv[1].tv_nsec);
+ if (to_set & (FUSE_SET_ATTR_ATIME_NOW | FUSE_SET_ATTR_MTIME_NOW))
+ get_now(&now);
+
+ if (to_set & FUSE_SET_ATTR_ATIME_NOW) {
+ atime = now;
+ aact = TA_NOW;
+ } else if (to_set & FUSE_SET_ATTR_ATIME) {
+#if HAVE_STRUCT_STAT_ST_ATIM
+ atime = attr->st_atim;
+#else
+ atime.tv_sec = attr->st_atime;
+#endif
+ aact = TA_THIS;
+ }
+
+ if (to_set & FUSE_SET_ATTR_MTIME_NOW) {
+ mtime = now;
+ mact = TA_NOW;
+ } else if (to_set & FUSE_SET_ATTR_MTIME) {
+#if HAVE_STRUCT_STAT_ST_ATIM
+ mtime = attr->st_mtim;
+#else
+ mtime.tv_sec = attr->st_mtime;
+#endif
+ mact = TA_THIS;
+ }
+
+ dbg_printf(ff, "%s: ino=%d atime=%s:%lld.%ld mtime=%s:%lld.%ld\n",
+ __func__, ino, fuse4fs_time_action_string(aact),
+ (long long int)atime.tv_sec, atime.tv_nsec,
+ fuse4fs_time_action_string(mact),
+ (long long int)mtime.tv_sec, mtime.tv_nsec);
/*
* ext4 allows timestamp updates of append-only files but only if we're
* setting to current time
*/
- if (ctv[0].tv_nsec == UTIME_NOW && ctv[1].tv_nsec == UTIME_NOW)
+ if (aact == TA_NOW && mact == TA_NOW)
access |= A_OK;
- ret = fuse4fs_inum_access(ff, ino, access);
+ ret = fuse4fs_inum_access(ff, ctxt, ino, access);
if (ret)
+ return ret;
+
+ if (aact != TA_OMIT)
+ EXT4_INODE_SET_XTIME(i_atime, &atime, inode);
+ if (mact != TA_OMIT)
+ EXT4_INODE_SET_XTIME(i_mtime, &mtime, inode);
+
+ return 0;
+}
+
+static int fuse4fs_setsize(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
+ ext2_ino_t ino, off_t new_size,
+ struct ext2_inode_large *inode)
+{
+ errcode_t err;
+ int ret;
+
+ /* Write inode because truncate makes its own copy */
+ err = fuse4fs_write_inode(ff->fs, ino, inode);
+ if (err)
+ return translate_error(ff->fs, ino, err);
+
+ ret = fuse4fs_inum_access(ff, ctxt, ino, W_OK);
+ if (ret)
+ return ret;
+
+ ret = fuse4fs_truncate(ff, ino, new_size);
+ if (ret)
+ return ret;
+
+ /* Re-read inode after truncate */
+ err = fuse4fs_read_inode(ff->fs, ino, inode);
+ if (err)
+ return translate_error(ff->fs, ino, err);
+
+ return 0;
+}
+
+static void op_setattr(fuse_req_t req, fuse_ino_t fino, struct stat *attr,
+ int to_set, struct fuse_file_info *fi EXT2FS_ATTR((unused)))
+{
+ struct ext2_inode_large inode;
+ struct fuse4fs_stat fstat;
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_filsys fs;
+ ext2_ino_t ino;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
+ dbg_printf(ff, "%s: ino=%d to_set=0x%x\n", __func__, ino, to_set);
+ fs = fuse4fs_start(ff);
+
+ if (!fuse4fs_is_writeable(ff)) {
+ ret = -EROFS;
goto out;
+ }
err = fuse4fs_read_inode(fs, ino, &inode);
if (err) {
@@ -4058,20 +4036,35 @@ static int op_utimens(const char *path, const struct timespec ctv[2],
goto out;
}
- tv[0] = ctv[0];
- tv[1] = ctv[1];
-#ifdef UTIME_NOW
- if (tv[0].tv_nsec == UTIME_NOW)
- get_now(tv);
- if (tv[1].tv_nsec == UTIME_NOW)
- get_now(tv + 1);
-#endif /* UTIME_NOW */
-#ifdef UTIME_OMIT
- if (tv[0].tv_nsec != UTIME_OMIT)
- EXT4_INODE_SET_XTIME(i_atime, &tv[0], &inode);
- if (tv[1].tv_nsec != UTIME_OMIT)
- EXT4_INODE_SET_XTIME(i_mtime, &tv[1], &inode);
-#endif /* UTIME_OMIT */
+ /* Handle mode change using helper */
+ if (to_set & FUSE_SET_ATTR_MODE) {
+ ret = fuse4fs_chmod(ff, req, ino, attr->st_mode, &inode);
+ if (ret)
+ goto out;
+ }
+
+ /* Handle owner/group change using helper */
+ if (to_set & (FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID)) {
+ ret = fuse4fs_chown(ff, ctxt, ino, to_set, attr, &inode);
+ if (ret)
+ goto out;
+ }
+
+ /* Handle size change using helper */
+ if (to_set & FUSE_SET_ATTR_SIZE) {
+ ret = fuse4fs_setsize(ff, ctxt, ino, attr->st_size, &inode);
+ if (ret)
+ goto out;
+ }
+
+ /* Handle time changes using helper */
+ if (to_set & (FUSE_SET_ATTR_ATIME | FUSE_SET_ATTR_MTIME)) {
+ ret = fuse4fs_utimens(ff, ctxt, ino, to_set, attr, &inode);
+ if (ret)
+ goto out;
+ }
+
+ /* Update ctime for any attribute change */
ret = update_ctime(fs, ino, &inode);
if (ret)
goto out;
@@ -4082,9 +4075,17 @@ static int op_utimens(const char *path, const struct timespec ctv[2],
goto out;
}
+ /* Get updated stat info to return */
+ ret = fuse4fs_stat_inode(ff, ino, &inode, &fstat);
+
out:
fuse4fs_finish(ff, ret);
- return ret;
+
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_attr(req, &fstat.entry.attr,
+ fstat.entry.attr_timeout);
}
#define FUSE4FS_MODIFIABLE_IFLAGS \
@@ -4103,32 +4104,38 @@ static inline int set_iflags(struct ext2_inode_large *inode, __u32 iflags)
#ifdef SUPPORT_I_FLAGS
static int ioctl_getflags(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
- void *data)
+ __u32 *outdata, size_t *outsize)
{
ext2_filsys fs = ff->fs;
errcode_t err;
struct ext2_inode_large inode;
+ if (*outsize < sizeof(__u32))
+ return -EFAULT;
+
dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
err = fuse4fs_read_inode(fs, fh->ino, &inode);
if (err)
return translate_error(fs, fh->ino, err);
- *(__u32 *)data = inode.i_flags & EXT2_FL_USER_VISIBLE;
+ *outdata = inode.i_flags & EXT2_FL_USER_VISIBLE;
+ *outsize = sizeof(__u32);
return 0;
}
-static int ioctl_setflags(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
- void *data)
+static int ioctl_setflags(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
+ struct fuse4fs_file_handle *fh, const __u32 *indata,
+ size_t insize)
{
ext2_filsys fs = ff->fs;
errcode_t err;
struct ext2_inode_large inode;
int ret;
- __u32 flags = *(__u32 *)data;
- struct fuse_context *ctxt = fuse_get_context();
- dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ if (insize < sizeof(__u32))
+ return -EFAULT;
+
+ dbg_printf(ff, "%s: ino=%d iflags=0x%x\n", __func__, fh->ino, *indata);
err = fuse4fs_read_inode(fs, fh->ino, &inode);
if (err)
return translate_error(fs, fh->ino, err);
@@ -4136,7 +4143,7 @@ static int ioctl_setflags(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
if (fuse4fs_want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
return -EPERM;
- ret = set_iflags(&inode, flags);
+ ret = set_iflags(&inode, *indata);
if (ret)
return ret;
@@ -4152,32 +4159,38 @@ static int ioctl_setflags(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
}
static int ioctl_getversion(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
- void *data)
+ __u32 *outdata, size_t *outsize)
{
ext2_filsys fs = ff->fs;
errcode_t err;
struct ext2_inode_large inode;
+ if (*outsize < sizeof(__u32))
+ return -EFAULT;
+
dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
err = fuse4fs_read_inode(fs, fh->ino, &inode);
if (err)
return translate_error(fs, fh->ino, err);
- *(__u32 *)data = inode.i_generation;
+ *outdata = inode.i_generation;
+ *outsize = sizeof(__u32);
return 0;
}
-static int ioctl_setversion(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
- void *data)
+static int ioctl_setversion(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
+ struct fuse4fs_file_handle *fh, const __u32 *indata,
+ size_t insize)
{
ext2_filsys fs = ff->fs;
errcode_t err;
struct ext2_inode_large inode;
int ret;
- __u32 generation = *(__u32 *)data;
- struct fuse_context *ctxt = fuse_get_context();
- dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ if (insize < sizeof(__u32))
+ return -EFAULT;
+
+ dbg_printf(ff, "%s: ino=%d generation=%d\n", __func__, fh->ino, *indata);
err = fuse4fs_read_inode(fs, fh->ino, &inode);
if (err)
return translate_error(fs, fh->ino, err);
@@ -4185,7 +4198,7 @@ static int ioctl_setversion(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
if (fuse4fs_want_check_owner(ff, ctxt) && inode_uid(inode) != ctxt->uid)
return -EPERM;
- inode.i_generation = generation;
+ inode.i_generation = *indata;
ret = update_ctime(fs, fh->ino, &inode);
if (ret)
@@ -4222,14 +4235,16 @@ static __u32 iflags_to_fsxflags(__u32 iflags)
}
static int ioctl_fsgetxattr(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
- void *data)
+ struct fsxattr *fsx, size_t *outsize)
{
ext2_filsys fs = ff->fs;
errcode_t err;
struct ext2_inode_large inode;
- struct fsxattr *fsx = data;
unsigned int inode_size;
+ if (*outsize < sizeof(struct fsxattr))
+ return -EFAULT;
+
dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
err = fuse4fs_read_inode(fs, fh->ino, &inode);
if (err)
@@ -4240,6 +4255,7 @@ static int ioctl_fsgetxattr(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
if (ext2fs_inode_includes(inode_size, i_projid))
fsx->fsx_projid = inode_projid(inode);
fsx->fsx_xflags = iflags_to_fsxflags(inode.i_flags);
+ *outsize = sizeof(struct fsxattr);
return 0;
}
@@ -4264,18 +4280,21 @@ static __u32 fsxflags_to_iflags(__u32 xflags)
return iflags;
}
-static int ioctl_fssetxattr(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
- void *data)
+static int ioctl_fssetxattr(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
+ struct fuse4fs_file_handle *fh,
+ const struct fsxattr *fsx, size_t insize)
{
ext2_filsys fs = ff->fs;
errcode_t err;
struct ext2_inode_large inode;
int ret;
- struct fuse_context *ctxt = fuse_get_context();
- struct fsxattr *fsx = data;
- __u32 flags = fsxflags_to_iflags(fsx->fsx_xflags);
+ __u32 flags;
unsigned int inode_size;
+ if (insize < sizeof(struct fsxattr))
+ return -EFAULT;
+
+ flags = fsxflags_to_iflags(fsx->fsx_xflags);
dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
err = fuse4fs_read_inode(fs, fh->ino, &inode);
if (err)
@@ -4306,17 +4325,24 @@ static int ioctl_fssetxattr(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
#ifdef FITRIM
static int ioctl_fitrim(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
- void *data)
+ const struct fstrim_range *fr_in, size_t insize,
+ struct fstrim_range *fr, size_t *outsize)
{
ext2_filsys fs = ff->fs;
- struct fstrim_range *fr = data;
blk64_t start, end, max_blocks, b, cleared, minlen;
blk64_t max_blks = ext2fs_blocks_count(fs->super);
errcode_t err = 0;
+ if (insize < sizeof(struct fstrim_range))
+ return -EFAULT;
+
+ if (*outsize < sizeof(struct fstrim_range))
+ return -EFAULT;
+
if (!fuse4fs_is_writeable(ff))
return -EROFS;
+ memcpy(fr, fr_in, sizeof(*fr));
start = FUSE4FS_B_TO_FSBT(ff, fr->start);
if (fr->len == -1ULL)
end = -1ULL;
@@ -4387,6 +4413,7 @@ static int ioctl_fitrim(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
out:
fr->len = FUSE4FS_FSB_TO_B(ff, cleared);
+ *outsize = sizeof(struct fstrim_range);
dbg_printf(ff, "%s: len=%llu err=%ld\n", __func__, fr->len, err);
return err;
}
@@ -4396,10 +4423,10 @@ static int ioctl_fitrim(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
# define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
#endif
-static int ioctl_shutdown(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
- void *data)
+static int ioctl_shutdown(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
+ struct fuse4fs_file_handle *fh, const void *indata,
+ size_t insize)
{
- struct fuse_context *ctxt = fuse_get_context();
ext2_filsys fs = ff->fs;
if (!fuse4fs_is_superuser(ff, ctxt))
@@ -4419,49 +4446,61 @@ static int ioctl_shutdown(struct fuse4fs *ff, struct fuse4fs_file_handle *fh,
return 0;
}
-static int op_ioctl(const char *path EXT2FS_ATTR((unused)),
- unsigned int cmd,
- void *arg EXT2FS_ATTR((unused)),
- struct fuse_file_info *fp,
- unsigned int flags EXT2FS_ATTR((unused)), void *data)
+static void op_ioctl(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
+ unsigned int cmd,
+ void *arg EXT2FS_ATTR((unused)),
+ struct fuse_file_info *fp,
+ unsigned int flags EXT2FS_ATTR((unused)),
+ const void *indata, size_t insize,
+ size_t outsize)
{
- struct fuse4fs *ff = fuse4fs_get();
+ const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ struct fuse4fs *ff = fuse4fs_get(req);
struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
+ void *outdata = NULL;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
- FUSE4FS_CHECK_HANDLE(ff, fh);
+ if (outsize > 0) {
+ outdata = calloc(outsize, sizeof(char));
+ if (!outdata) {
+ fuse_reply_err(req, errno);
+ return;
+ }
+ }
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CHECK_HANDLE(req, fh);
fuse4fs_start(ff);
switch ((unsigned long) cmd) {
#ifdef SUPPORT_I_FLAGS
case EXT2_IOC_GETFLAGS:
- ret = ioctl_getflags(ff, fh, data);
+ ret = ioctl_getflags(ff, fh, outdata, &outsize);
break;
case EXT2_IOC_SETFLAGS:
- ret = ioctl_setflags(ff, fh, data);
+ ret = ioctl_setflags(ff, ctxt, fh, indata, insize);
break;
case EXT2_IOC_GETVERSION:
- ret = ioctl_getversion(ff, fh, data);
+ ret = ioctl_getversion(ff, fh, outdata, &outsize);
break;
case EXT2_IOC_SETVERSION:
- ret = ioctl_setversion(ff, fh, data);
+ ret = ioctl_setversion(ff, ctxt, fh, indata, insize);
break;
#endif
#ifdef FS_IOC_FSGETXATTR
case FS_IOC_FSGETXATTR:
- ret = ioctl_fsgetxattr(ff, fh, data);
+ ret = ioctl_fsgetxattr(ff, fh, outdata, &outsize);
break;
case FS_IOC_FSSETXATTR:
- ret = ioctl_fssetxattr(ff, fh, data);
+ ret = ioctl_fssetxattr(ff, ctxt, fh, indata, insize);
break;
#endif
#ifdef FITRIM
case FITRIM:
- ret = ioctl_fitrim(ff, fh, data);
+ ret = ioctl_fitrim(ff, fh, indata, insize, outdata, &outsize);
break;
#endif
case EXT4_IOC_SHUTDOWN:
- ret = ioctl_shutdown(ff, fh, data);
+ ret = ioctl_shutdown(ff, ctxt, fh, indata, insize);
break;
default:
dbg_printf(ff, "%s: Unknown ioctl %d\n", __func__, cmd);
@@ -4469,28 +4508,29 @@ static int op_ioctl(const char *path EXT2FS_ATTR((unused)),
}
fuse4fs_finish(ff, ret);
- return ret;
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_ioctl(req, 0, outdata, outsize);
+ free(outdata);
}
-static int op_bmap(const char *path, size_t blocksize EXT2FS_ATTR((unused)),
- uint64_t *idx)
+static void op_bmap(fuse_req_t req, fuse_ino_t fino,
+ size_t blocksize EXT2FS_ATTR((unused)), uint64_t idx)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
ext2_ino_t ino;
+ blk64_t blkno;
errcode_t err;
int ret = 0;
- FUSE4FS_CHECK_CONTEXT(ff);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
fs = fuse4fs_start(ff);
- err = ext2fs_namei(fs, EXT2_ROOT_INO, EXT2_ROOT_INO, path, &ino);
- if (err) {
- ret = translate_error(fs, 0, err);
- goto out;
- }
- dbg_printf(ff, "%s: ino=%d blk=%"PRIu64"\n", __func__, ino, *idx);
+ dbg_printf(ff, "%s: ino=%d blk=%"PRIu64"\n", __func__, ino, idx);
- err = ext2fs_bmap2(fs, ino, NULL, NULL, 0, *idx, 0, (blk64_t *)idx);
+ err = ext2fs_bmap2(fs, ino, NULL, NULL, 0, idx, 0, &blkno);
if (err) {
ret = translate_error(fs, ino, err);
goto out;
@@ -4498,7 +4538,10 @@ static int op_bmap(const char *path, size_t blocksize EXT2FS_ATTR((unused)),
out:
fuse4fs_finish(ff, ret);
- return ret;
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_bmap(req, blkno);
}
#ifdef SUPPORT_FALLOCATE
@@ -4741,20 +4784,22 @@ static int fuse4fs_zero_range(struct fuse4fs *ff,
return ret;
}
-static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
- off_t offset, off_t len,
- struct fuse_file_info *fp)
+static void op_fallocate(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
+ int mode, off_t offset, off_t len,
+ struct fuse_file_info *fp)
{
- struct fuse4fs *ff = fuse4fs_get();
+ struct fuse4fs *ff = fuse4fs_get(req);
struct fuse4fs_file_handle *fh = fuse4fs_get_handle(fp);
int ret;
/* Catch unknown flags */
- if (mode & ~(FL_ZERO_RANGE_FLAG | FL_PUNCH_HOLE_FLAG | FL_KEEP_SIZE_FLAG))
- return -EOPNOTSUPP;
+ if (mode & ~(FL_ZERO_RANGE_FLAG | FL_PUNCH_HOLE_FLAG | FL_KEEP_SIZE_FLAG)) {
+ fuse_reply_err(req, EOPNOTSUPP);
+ return;
+ }
- FUSE4FS_CHECK_CONTEXT(ff);
- FUSE4FS_CHECK_HANDLE(ff, fh);
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CHECK_HANDLE(req, fh);
fuse4fs_start(ff);
if (!fuse4fs_is_writeable(ff)) {
ret = -EROFS;
@@ -4774,12 +4819,13 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
ret = fuse4fs_allocate_range(ff, fh, mode, offset, len);
out:
fuse4fs_finish(ff, ret);
-
- return ret;
+ fuse_reply_err(req, -ret);
}
#endif /* SUPPORT_FALLOCATE */
-static struct fuse_operations fs_ops = {
+static struct fuse_lowlevel_ops fs_ops = {
+ .lookup = op_lookup,
+ .setattr = op_setattr,
.init = op_init,
.destroy = op_destroy,
.getattr = op_getattr,
@@ -4791,9 +4837,6 @@ static struct fuse_operations fs_ops = {
.symlink = op_symlink,
.rename = op_rename,
.link = op_link,
- .chmod = op_chmod,
- .chown = op_chown,
- .truncate = op_truncate,
.open = op_open,
.read = op_read,
.write = op_write,
@@ -4806,11 +4849,11 @@ static struct fuse_operations fs_ops = {
.removexattr = op_removexattr,
.opendir = op_open,
.readdir = op_readdir,
+ .readdirplus = op_readdirplus,
.releasedir = op_release,
.fsyncdir = op_fsync,
.access = op_access,
.create = op_create,
- .utimens = op_utimens,
.bmap = op_bmap,
#ifdef SUPERFLUOUS
.lock = op_lock,
@@ -4959,8 +5002,8 @@ static int fuse4fs_opt_proc(void *data, const char *arg,
"\n",
outargs->argv[0]);
if (key == FUSE4FS_HELPFULL) {
- fuse_opt_add_arg(outargs, "-h");
- fuse_main(outargs->argc, outargs->argv, &fs_ops, NULL);
+ printf("FUSE options:\n");
+ fuse_cmdline_help();
} else {
fprintf(stderr, "Try --helpfull to get a list of "
"all flags, including the FUSE options.\n");
@@ -4970,8 +5013,7 @@ static int fuse4fs_opt_proc(void *data, const char *arg,
case FUSE4FS_VERSION:
fprintf(stderr, "fuse4fs %s (%s)\n", E2FSPROGS_VERSION,
E2FSPROGS_DATE);
- fuse_opt_add_arg(outargs, "--version");
- fuse_main(outargs->argc, outargs->argv, &fs_ops, NULL);
+ fprintf(stderr, "FUSE library version %s\n", fuse_pkgversion());
exit(0);
}
return 1;
@@ -5040,6 +5082,106 @@ static void fuse4fs_com_err_proc(const char *whoami, errcode_t code,
fflush(stderr);
}
+static int fuse4fs_main(struct fuse_args *args, struct fuse4fs *ff)
+{
+ struct fuse_cmdline_opts opts;
+ struct fuse_session *se;
+ struct fuse_loop_config *loop_config = NULL;
+ int ret;
+
+ if (fuse_parse_cmdline(args, &opts) != 0) {
+ ret = 1;
+ goto out;
+ }
+
+ if (ff->debug)
+ opts.debug = true;
+
+ if (opts.show_help) {
+ fuse_cmdline_help();
+ ret = 0;
+ goto out_free_opts;
+ }
+
+ if (opts.show_version) {
+ printf("FUSE library version %s\n", fuse_pkgversion());
+ ret = 0;
+ goto out_free_opts;
+ }
+
+ if (!opts.mountpoint) {
+ fprintf(stderr, "error: no mountpoint specified\n");
+ ret = 2;
+ goto out_free_opts;
+ }
+
+ se = fuse_session_new(args, &fs_ops, sizeof(fs_ops), ff);
+ if (se == NULL) {
+ ret = 3;
+ goto out_free_opts;
+ }
+ ff->fuse = se;
+
+ if (fuse_session_mount(se, opts.mountpoint) != 0) {
+ ret = 4;
+ goto out_destroy_session;
+ }
+
+ if (fuse_daemonize(opts.foreground) != 0) {
+ ret = 5;
+ goto out_unmount;
+ }
+
+ /*
+ * Configure logging a second time, because libfuse might have
+ * redirected std{out,err} as part of daemonization. If this fails,
+ * give up and move on.
+ */
+ fuse4fs_setup_logging(ff);
+ if (ff->logfd >= 0)
+ close(ff->logfd);
+ ff->logfd = -1;
+
+ if (fuse_set_signal_handlers(se) != 0) {
+ ret = 6;
+ goto out_unmount;
+ }
+
+ loop_config = fuse_loop_cfg_create();
+ if (loop_config == NULL) {
+ ret = 7;
+ goto out_remove_signal_handlers;
+ }
+
+ /*
+ * Since there's a Big Kernel Lock around all the libext2fs code, we
+ * only need to start three threads -- one to decode a request, another
+ * to do the filesystem work, and a third to transmit the reply.
+ */
+ fuse_loop_cfg_set_clone_fd(loop_config, opts.clone_fd);
+ fuse_loop_cfg_set_idle_threads(loop_config, opts.max_idle_threads);
+ fuse_loop_cfg_set_max_threads(loop_config, 3);
+
+ if (fuse_session_loop_mt(se, loop_config) != 0) {
+ ret = 8;
+ goto out_loopcfg;
+ }
+
+out_loopcfg:
+ fuse_loop_cfg_destroy(loop_config);
+out_remove_signal_handlers:
+ fuse_remove_signal_handlers(se);
+out_unmount:
+ fuse_session_unmount(se);
+out_destroy_session:
+ ff->fuse = NULL;
+ fuse_session_destroy(se);
+out_free_opts:
+ free(opts.mountpoint);
+out:
+ return ret;
+}
+
int main(int argc, char *argv[])
{
struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
@@ -5178,8 +5320,7 @@ int main(int argc, char *argv[])
get_random_bytes(&fctx.next_generation, sizeof(unsigned int));
/* Set up default fuse parameters */
- snprintf(extra_args, BUFSIZ, "-okernel_cache,subtype=%s,"
- "fsname=%s,attr_timeout=0",
+ snprintf(extra_args, BUFSIZ, "-osubtype=%s,fsname=%s",
get_subtype(argv[0]),
fctx.device);
if (fctx.no_default_opts == 0)
@@ -5218,7 +5359,7 @@ int main(int argc, char *argv[])
}
pthread_mutex_init(&fctx.bfl, NULL);
- ret = fuse_main(args.argc, args.argv, &fs_ops, &fctx);
+ ret = fuse4fs_main(&args, &fctx);
pthread_mutex_destroy(&fctx.bfl);
switch(ret) {
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 05/20] libsupport: port the kernel list.h to libsupport
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 1:08 ` [PATCH 04/20] fuse4fs: convert to low level API Darrick J. Wong
@ 2025-08-21 1:09 ` Darrick J. Wong
2025-08-21 1:09 ` [PATCH 06/20] libsupport: add a cache Darrick J. Wong
` (14 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:09 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
In the next patch, we're going to add the xfsprogs cache manager code to
e2fsprogs. That code is going into libsupport so that it doesn't become
part of the libext2fs ABI, and it depends on a richer set of list_head
helpers than what is in kernel-list.h, so port the Linux 6.17 list.h to
libsupport and drop the one in libext2fs.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/jfs_compat.h | 2
lib/ext2fs/kernel-list.h | 111 ------
lib/support/list.h | 894 ++++++++++++++++++++++++++++++++++++++++++++++
debugfs/Makefile.in | 12 -
e2fsck/Makefile.in | 56 +--
lib/e2p/Makefile.in | 4
lib/ext2fs/Makefile.in | 14 -
misc/Makefile.in | 12 -
misc/tune2fs.c | 4
9 files changed, 944 insertions(+), 165 deletions(-)
delete mode 100644 lib/ext2fs/kernel-list.h
create mode 100644 lib/support/list.h
diff --git a/lib/ext2fs/jfs_compat.h b/lib/ext2fs/jfs_compat.h
index 30b05822b6fd4d..8e598bcfa73ef7 100644
--- a/lib/ext2fs/jfs_compat.h
+++ b/lib/ext2fs/jfs_compat.h
@@ -2,7 +2,7 @@
#ifndef _JFS_COMPAT_H
#define _JFS_COMPAT_H
-#include "kernel-list.h"
+#include "support/list.h"
#include <errno.h>
#ifdef HAVE_NETINET_IN_H
#include <netinet/in.h>
diff --git a/lib/ext2fs/kernel-list.h b/lib/ext2fs/kernel-list.h
deleted file mode 100644
index dd7b8e07dd56c4..00000000000000
--- a/lib/ext2fs/kernel-list.h
+++ /dev/null
@@ -1,111 +0,0 @@
-#ifndef _LINUX_LIST_H
-#define _LINUX_LIST_H
-
-#include "compiler.h"
-
-/*
- * Simple doubly linked list implementation.
- *
- * Some of the internal functions ("__xxx") are useful when
- * manipulating whole lists rather than single entries, as
- * sometimes we already know the next/prev entries and we can
- * generate better code by using them directly rather than
- * using the generic single-entry routines.
- */
-
-struct list_head {
- struct list_head *next, *prev;
-};
-
-#define LIST_HEAD_INIT(name) { &(name), &(name) }
-
-#define INIT_LIST_HEAD(ptr) do { \
- (ptr)->next = (ptr); (ptr)->prev = (ptr); \
-} while (0)
-
-#if (!defined(__GNUC__) && !defined(__WATCOMC__))
-#define __inline__
-#endif
-
-/*
- * Insert a new entry between two known consecutive entries.
- *
- * This is only for internal list manipulation where we know
- * the prev/next entries already!
- */
-static __inline__ void __list_add(struct list_head * new,
- struct list_head * prev,
- struct list_head * next)
-{
- next->prev = new;
- new->next = next;
- new->prev = prev;
- prev->next = new;
-}
-
-/*
- * Insert a new entry after the specified head..
- */
-static __inline__ void list_add(struct list_head *new, struct list_head *head)
-{
- __list_add(new, head, head->next);
-}
-
-/*
- * Insert a new entry at the tail
- */
-static __inline__ void list_add_tail(struct list_head *new, struct list_head *head)
-{
- __list_add(new, head->prev, head);
-}
-
-/*
- * Delete a list entry by making the prev/next entries
- * point to each other.
- *
- * This is only for internal list manipulation where we know
- * the prev/next entries already!
- */
-static __inline__ void __list_del(struct list_head * prev,
- struct list_head * next)
-{
- next->prev = prev;
- prev->next = next;
-}
-
-static __inline__ void list_del(struct list_head *entry)
-{
- __list_del(entry->prev, entry->next);
-}
-
-static __inline__ int list_empty(struct list_head *head)
-{
- return head->next == head;
-}
-
-/*
- * Splice in "list" into "head"
- */
-static __inline__ void list_splice(struct list_head *list, struct list_head *head)
-{
- struct list_head *first = list->next;
-
- if (first != list) {
- struct list_head *last = list->prev;
- struct list_head *at = head->next;
-
- first->prev = head;
- head->next = first;
-
- last->next = at;
- at->prev = last;
- }
-}
-
-#define list_entry(ptr, type, member) \
- container_of(ptr, type, member)
-
-#define list_for_each(pos, head) \
- for (pos = (head)->next; pos != (head); pos = pos->next)
-
-#endif
diff --git a/lib/support/list.h b/lib/support/list.h
new file mode 100644
index 00000000000000..df6c99708e4a8e
--- /dev/null
+++ b/lib/support/list.h
@@ -0,0 +1,894 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_LIST_H
+#define _LINUX_LIST_H
+
+#include <stdbool.h>
+
+struct list_head {
+ struct list_head *next, *prev;
+};
+
+#ifdef __GNUC__
+#define container_of(ptr, type, member) ({ \
+ __typeof__( ((type *)0)->member ) *__mptr = (ptr); \
+ (type *)( (char *)__mptr - offsetof(type,member) );})
+#else
+#define container_of(ptr, type, member) \
+ ((type *)((char *)(ptr) - offsetof(type, member)))
+#endif
+
+/*
+ * Circular doubly linked list implementation.
+ *
+ * Some of the internal functions ("__xxx") are useful when
+ * manipulating whole lists rather than single entries, as
+ * sometimes we already know the next/prev entries and we can
+ * generate better code by using them directly rather than
+ * using the generic single-entry routines.
+ */
+
+#define LIST_HEAD_INIT(name) { &(name), &(name) }
+
+#define LIST_HEAD(name) \
+ struct list_head name = LIST_HEAD_INIT(name)
+
+/**
+ * INIT_LIST_HEAD - Initialize a list_head structure
+ * @list: list_head structure to be initialized.
+ *
+ * Initializes the list_head to point to itself. If it is a list header,
+ * the result is an empty list.
+ */
+static inline void INIT_LIST_HEAD(struct list_head *list)
+{
+ list->next = list;
+ list->prev = list;
+}
+
+#ifdef CONFIG_LIST_HARDENED
+
+#ifdef CONFIG_DEBUG_LIST
+# define __list_valid_slowpath
+#else
+# define __list_valid_slowpath __cold __preserve_most
+#endif
+
+/*
+ * Performs the full set of list corruption checks before __list_add().
+ * On list corruption reports a warning, and returns false.
+ */
+bool __list_valid_slowpath __list_add_valid_or_report(struct list_head *new,
+ struct list_head *prev,
+ struct list_head *next);
+
+/*
+ * Performs list corruption checks before __list_add(). Returns false if a
+ * corruption is detected, true otherwise.
+ *
+ * With CONFIG_LIST_HARDENED only, performs minimal list integrity checking
+ * inline to catch non-faulting corruptions, and only if a corruption is
+ * detected calls the reporting function __list_add_valid_or_report().
+ */
+static __always_inline bool __list_add_valid(struct list_head *new,
+ struct list_head *prev,
+ struct list_head *next)
+{
+ bool ret = true;
+
+ if (!IS_ENABLED(CONFIG_DEBUG_LIST)) {
+ /*
+ * With the hardening version, elide checking if next and prev
+ * are NULL, since the immediate dereference of them below would
+ * result in a fault if NULL.
+ *
+ * With the reduced set of checks, we can afford to inline the
+ * checks, which also gives the compiler a chance to elide some
+ * of them completely if they can be proven at compile-time. If
+ * one of the pre-conditions does not hold, the slow-path will
+ * show a report which pre-condition failed.
+ */
+ if (likely(next->prev == prev && prev->next == next && new != prev && new != next))
+ return true;
+ ret = false;
+ }
+
+ ret &= __list_add_valid_or_report(new, prev, next);
+ return ret;
+}
+
+/*
+ * Performs the full set of list corruption checks before __list_del_entry().
+ * On list corruption reports a warning, and returns false.
+ */
+bool __list_valid_slowpath __list_del_entry_valid_or_report(struct list_head *entry);
+
+/*
+ * Performs list corruption checks before __list_del_entry(). Returns false if a
+ * corruption is detected, true otherwise.
+ *
+ * With CONFIG_LIST_HARDENED only, performs minimal list integrity checking
+ * inline to catch non-faulting corruptions, and only if a corruption is
+ * detected calls the reporting function __list_del_entry_valid_or_report().
+ */
+static __always_inline bool __list_del_entry_valid(struct list_head *entry)
+{
+ bool ret = true;
+
+ if (!IS_ENABLED(CONFIG_DEBUG_LIST)) {
+ struct list_head *prev = entry->prev;
+ struct list_head *next = entry->next;
+
+ /*
+ * With the hardening version, elide checking if next and prev
+ * are NULL, LIST_POISON1 or LIST_POISON2, since the immediate
+ * dereference of them below would result in a fault.
+ */
+ if (likely(prev->next == entry && next->prev == entry))
+ return true;
+ ret = false;
+ }
+
+ ret &= __list_del_entry_valid_or_report(entry);
+ return ret;
+}
+#else
+static inline bool __list_add_valid(struct list_head *new,
+ struct list_head *prev,
+ struct list_head *next)
+{
+ return true;
+}
+static inline bool __list_del_entry_valid(struct list_head *entry)
+{
+ return true;
+}
+#endif
+
+/*
+ * Insert a new entry between two known consecutive entries.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ */
+static inline void __list_add(struct list_head *new,
+ struct list_head *prev,
+ struct list_head *next)
+{
+ if (!__list_add_valid(new, prev, next))
+ return;
+
+ next->prev = new;
+ new->next = next;
+ new->prev = prev;
+ prev->next = new;
+}
+
+/**
+ * list_add - add a new entry
+ * @new: new entry to be added
+ * @head: list head to add it after
+ *
+ * Insert a new entry after the specified head.
+ * This is good for implementing stacks.
+ */
+static inline void list_add(struct list_head *new, struct list_head *head)
+{
+ __list_add(new, head, head->next);
+}
+
+
+/**
+ * list_add_tail - add a new entry
+ * @new: new entry to be added
+ * @head: list head to add it before
+ *
+ * Insert a new entry before the specified head.
+ * This is useful for implementing queues.
+ */
+static inline void list_add_tail(struct list_head *new, struct list_head *head)
+{
+ __list_add(new, head->prev, head);
+}
+
+/*
+ * Delete a list entry by making the prev/next entries
+ * point to each other.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ */
+static inline void __list_del(struct list_head * prev, struct list_head * next)
+{
+ next->prev = prev;
+ prev->next = next;
+}
+
+/*
+ * Delete a list entry and clear the 'prev' pointer.
+ *
+ * This is a special-purpose list clearing method used in the networking code
+ * for lists allocated as per-cpu, where we don't want to incur the extra
+ * WRITE_ONCE() overhead of a regular list_del_init(). The code that uses this
+ * needs to check the node 'prev' pointer instead of calling list_empty().
+ */
+static inline void __list_del_clearprev(struct list_head *entry)
+{
+ __list_del(entry->prev, entry->next);
+ entry->prev = NULL;
+}
+
+static inline void __list_del_entry(struct list_head *entry)
+{
+ if (!__list_del_entry_valid(entry))
+ return;
+
+ __list_del(entry->prev, entry->next);
+}
+
+/**
+ * list_del - deletes entry from list.
+ * @entry: the element to delete from the list.
+ * Note: list_empty() on entry does not return true after this, the entry is
+ * in an undefined state.
+ */
+static inline void list_del(struct list_head *entry)
+{
+ __list_del_entry(entry);
+ entry->next = NULL;
+ entry->prev = NULL;
+}
+
+/**
+ * list_replace - replace old entry by new one
+ * @old : the element to be replaced
+ * @new : the new element to insert
+ *
+ * If @old was empty, it will be overwritten.
+ */
+static inline void list_replace(struct list_head *old,
+ struct list_head *new)
+{
+ new->next = old->next;
+ new->next->prev = new;
+ new->prev = old->prev;
+ new->prev->next = new;
+}
+
+/**
+ * list_replace_init - replace old entry by new one and initialize the old one
+ * @old : the element to be replaced
+ * @new : the new element to insert
+ *
+ * If @old was empty, it will be overwritten.
+ */
+static inline void list_replace_init(struct list_head *old,
+ struct list_head *new)
+{
+ list_replace(old, new);
+ INIT_LIST_HEAD(old);
+}
+
+/**
+ * list_swap - replace entry1 with entry2 and re-add entry1 at entry2's position
+ * @entry1: the location to place entry2
+ * @entry2: the location to place entry1
+ */
+static inline void list_swap(struct list_head *entry1,
+ struct list_head *entry2)
+{
+ struct list_head *pos = entry2->prev;
+
+ list_del(entry2);
+ list_replace(entry1, entry2);
+ if (pos == entry1)
+ pos = entry2;
+ list_add(entry1, pos);
+}
+
+/**
+ * list_del_init - deletes entry from list and reinitialize it.
+ * @entry: the element to delete from the list.
+ */
+static inline void list_del_init(struct list_head *entry)
+{
+ __list_del_entry(entry);
+ INIT_LIST_HEAD(entry);
+}
+
+/**
+ * list_move - delete from one list and add as another's head
+ * @list: the entry to move
+ * @head: the head that will precede our entry
+ */
+static inline void list_move(struct list_head *list, struct list_head *head)
+{
+ __list_del_entry(list);
+ list_add(list, head);
+}
+
+/**
+ * list_move_tail - delete from one list and add as another's tail
+ * @list: the entry to move
+ * @head: the head that will follow our entry
+ */
+static inline void list_move_tail(struct list_head *list,
+ struct list_head *head)
+{
+ __list_del_entry(list);
+ list_add_tail(list, head);
+}
+
+/**
+ * list_bulk_move_tail - move a subsection of a list to its tail
+ * @head: the head that will follow our entry
+ * @first: first entry to move
+ * @last: last entry to move, can be the same as first
+ *
+ * Move all entries between @first and including @last before @head.
+ * All three entries must belong to the same linked list.
+ */
+static inline void list_bulk_move_tail(struct list_head *head,
+ struct list_head *first,
+ struct list_head *last)
+{
+ first->prev->next = last->next;
+ last->next->prev = first->prev;
+
+ head->prev->next = first;
+ first->prev = head->prev;
+
+ last->next = head;
+ head->prev = last;
+}
+
+/**
+ * list_is_first -- tests whether @list is the first entry in list @head
+ * @list: the entry to test
+ * @head: the head of the list
+ */
+static inline int list_is_first(const struct list_head *list, const struct list_head *head)
+{
+ return list->prev == head;
+}
+
+/**
+ * list_is_last - tests whether @list is the last entry in list @head
+ * @list: the entry to test
+ * @head: the head of the list
+ */
+static inline int list_is_last(const struct list_head *list, const struct list_head *head)
+{
+ return list->next == head;
+}
+
+/**
+ * list_is_head - tests whether @list is the list @head
+ * @list: the entry to test
+ * @head: the head of the list
+ */
+static inline int list_is_head(const struct list_head *list, const struct list_head *head)
+{
+ return list == head;
+}
+
+/**
+ * list_empty - tests whether a list is empty
+ * @head: the list to test.
+ */
+static inline int list_empty(const struct list_head *head)
+{
+ return head->next == head;
+}
+
+/**
+ * list_rotate_left - rotate the list to the left
+ * @head: the head of the list
+ */
+static inline void list_rotate_left(struct list_head *head)
+{
+ struct list_head *first;
+
+ if (!list_empty(head)) {
+ first = head->next;
+ list_move_tail(first, head);
+ }
+}
+
+/**
+ * list_rotate_to_front() - Rotate list to specific item.
+ * @list: The desired new front of the list.
+ * @head: The head of the list.
+ *
+ * Rotates list so that @list becomes the new front of the list.
+ */
+static inline void list_rotate_to_front(struct list_head *list,
+ struct list_head *head)
+{
+ /*
+ * Deletes the list head from the list denoted by @head and
+ * places it as the tail of @list, this effectively rotates the
+ * list so that @list is at the front.
+ */
+ list_move_tail(head, list);
+}
+
+/**
+ * list_is_singular - tests whether a list has just one entry.
+ * @head: the list to test.
+ */
+static inline int list_is_singular(const struct list_head *head)
+{
+ return !list_empty(head) && (head->next == head->prev);
+}
+
+static inline void __list_cut_position(struct list_head *list,
+ struct list_head *head, struct list_head *entry)
+{
+ struct list_head *new_first = entry->next;
+ list->next = head->next;
+ list->next->prev = list;
+ list->prev = entry;
+ entry->next = list;
+ head->next = new_first;
+ new_first->prev = head;
+}
+
+/**
+ * list_cut_position - cut a list into two
+ * @list: a new list to add all removed entries
+ * @head: a list with entries
+ * @entry: an entry within head, could be the head itself
+ * and if so we won't cut the list
+ *
+ * This helper moves the initial part of @head, up to and
+ * including @entry, from @head to @list. You should
+ * pass on @entry an element you know is on @head. @list
+ * should be an empty list or a list you do not care about
+ * losing its data.
+ *
+ */
+static inline void list_cut_position(struct list_head *list,
+ struct list_head *head, struct list_head *entry)
+{
+ if (list_empty(head))
+ return;
+ if (list_is_singular(head) && !list_is_head(entry, head) && (entry != head->next))
+ return;
+ if (list_is_head(entry, head))
+ INIT_LIST_HEAD(list);
+ else
+ __list_cut_position(list, head, entry);
+}
+
+/**
+ * list_cut_before - cut a list into two, before given entry
+ * @list: a new list to add all removed entries
+ * @head: a list with entries
+ * @entry: an entry within head, could be the head itself
+ *
+ * This helper moves the initial part of @head, up to but
+ * excluding @entry, from @head to @list. You should pass
+ * in @entry an element you know is on @head. @list should
+ * be an empty list or a list you do not care about losing
+ * its data.
+ * If @entry == @head, all entries on @head are moved to
+ * @list.
+ */
+static inline void list_cut_before(struct list_head *list,
+ struct list_head *head,
+ struct list_head *entry)
+{
+ if (head->next == entry) {
+ INIT_LIST_HEAD(list);
+ return;
+ }
+ list->next = head->next;
+ list->next->prev = list;
+ list->prev = entry->prev;
+ list->prev->next = list;
+ head->next = entry;
+ entry->prev = head;
+}
+
+static inline void __list_splice(const struct list_head *list,
+ struct list_head *prev,
+ struct list_head *next)
+{
+ struct list_head *first = list->next;
+ struct list_head *last = list->prev;
+
+ first->prev = prev;
+ prev->next = first;
+
+ last->next = next;
+ next->prev = last;
+}
+
+/**
+ * list_splice - join two lists, this is designed for stacks
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ */
+static inline void list_splice(const struct list_head *list,
+ struct list_head *head)
+{
+ if (!list_empty(list))
+ __list_splice(list, head, head->next);
+}
+
+/**
+ * list_splice_tail - join two lists, each list being a queue
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ */
+static inline void list_splice_tail(struct list_head *list,
+ struct list_head *head)
+{
+ if (!list_empty(list))
+ __list_splice(list, head->prev, head);
+}
+
+/**
+ * list_splice_init - join two lists and reinitialise the emptied list.
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ *
+ * The list at @list is reinitialised
+ */
+static inline void list_splice_init(struct list_head *list,
+ struct list_head *head)
+{
+ if (!list_empty(list)) {
+ __list_splice(list, head, head->next);
+ INIT_LIST_HEAD(list);
+ }
+}
+
+/**
+ * list_splice_tail_init - join two lists and reinitialise the emptied list
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ *
+ * Each of the lists is a queue.
+ * The list at @list is reinitialised
+ */
+static inline void list_splice_tail_init(struct list_head *list,
+ struct list_head *head)
+{
+ if (!list_empty(list)) {
+ __list_splice(list, head->prev, head);
+ INIT_LIST_HEAD(list);
+ }
+}
+
+/**
+ * list_entry - get the struct for this entry
+ * @ptr: the &struct list_head pointer.
+ * @type: the type of the struct this is embedded in.
+ * @member: the name of the list_head within the struct.
+ */
+#define list_entry(ptr, type, member) \
+ container_of(ptr, type, member)
+
+/**
+ * list_first_entry - get the first element from a list
+ * @ptr: the list head to take the element from.
+ * @type: the type of the struct this is embedded in.
+ * @member: the name of the list_head within the struct.
+ *
+ * Note, that list is expected to be not empty.
+ */
+#define list_first_entry(ptr, type, member) \
+ list_entry((ptr)->next, type, member)
+
+/**
+ * list_last_entry - get the last element from a list
+ * @ptr: the list head to take the element from.
+ * @type: the type of the struct this is embedded in.
+ * @member: the name of the list_head within the struct.
+ *
+ * Note, that list is expected to be not empty.
+ */
+#define list_last_entry(ptr, type, member) \
+ list_entry((ptr)->prev, type, member)
+
+/**
+ * list_first_entry_or_null - get the first element from a list
+ * @ptr: the list head to take the element from.
+ * @type: the type of the struct this is embedded in.
+ * @member: the name of the list_head within the struct.
+ *
+ * Note that if the list is empty, it returns NULL.
+ */
+#define list_first_entry_or_null(ptr, type, member) ({ \
+ struct list_head *head__ = (ptr); \
+ struct list_head *pos__ = head__->next; \
+ pos__ != head__ ? list_entry(pos__, type, member) : NULL; \
+})
+
+/**
+ * list_next_entry - get the next element in list
+ * @pos: the type * to cursor
+ * @member: the name of the list_head within the struct.
+ */
+#define list_next_entry(pos, member) \
+ list_entry((pos)->member.next, typeof(*(pos)), member)
+
+/**
+ * list_next_entry_circular - get the next element in list
+ * @pos: the type * to cursor.
+ * @head: the list head to take the element from.
+ * @member: the name of the list_head within the struct.
+ *
+ * Wraparound if pos is the last element (return the first element).
+ * Note, that list is expected to be not empty.
+ */
+#define list_next_entry_circular(pos, head, member) \
+ (list_is_last(&(pos)->member, head) ? \
+ list_first_entry(head, typeof(*(pos)), member) : list_next_entry(pos, member))
+
+/**
+ * list_prev_entry - get the prev element in list
+ * @pos: the type * to cursor
+ * @member: the name of the list_head within the struct.
+ */
+#define list_prev_entry(pos, member) \
+ list_entry((pos)->member.prev, typeof(*(pos)), member)
+
+/**
+ * list_prev_entry_circular - get the prev element in list
+ * @pos: the type * to cursor.
+ * @head: the list head to take the element from.
+ * @member: the name of the list_head within the struct.
+ *
+ * Wraparound if pos is the first element (return the last element).
+ * Note, that list is expected to be not empty.
+ */
+#define list_prev_entry_circular(pos, head, member) \
+ (list_is_first(&(pos)->member, head) ? \
+ list_last_entry(head, typeof(*(pos)), member) : list_prev_entry(pos, member))
+
+/**
+ * list_for_each - iterate over a list
+ * @pos: the &struct list_head to use as a loop cursor.
+ * @head: the head for your list.
+ */
+#define list_for_each(pos, head) \
+ for (pos = (head)->next; !list_is_head(pos, (head)); pos = pos->next)
+
+/**
+ * list_for_each_rcu - Iterate over a list in an RCU-safe fashion
+ * @pos: the &struct list_head to use as a loop cursor.
+ * @head: the head for your list.
+ */
+#define list_for_each_rcu(pos, head) \
+ for (pos = rcu_dereference((head)->next); \
+ !list_is_head(pos, (head)); \
+ pos = rcu_dereference(pos->next))
+
+/**
+ * list_for_each_continue - continue iteration over a list
+ * @pos: the &struct list_head to use as a loop cursor.
+ * @head: the head for your list.
+ *
+ * Continue to iterate over a list, continuing after the current position.
+ */
+#define list_for_each_continue(pos, head) \
+ for (pos = pos->next; !list_is_head(pos, (head)); pos = pos->next)
+
+/**
+ * list_for_each_prev - iterate over a list backwards
+ * @pos: the &struct list_head to use as a loop cursor.
+ * @head: the head for your list.
+ */
+#define list_for_each_prev(pos, head) \
+ for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
+
+/**
+ * list_for_each_safe - iterate over a list safe against removal of list entry
+ * @pos: the &struct list_head to use as a loop cursor.
+ * @n: another &struct list_head to use as temporary storage
+ * @head: the head for your list.
+ */
+#define list_for_each_safe(pos, n, head) \
+ for (pos = (head)->next, n = pos->next; \
+ !list_is_head(pos, (head)); \
+ pos = n, n = pos->next)
+
+/**
+ * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
+ * @pos: the &struct list_head to use as a loop cursor.
+ * @n: another &struct list_head to use as temporary storage
+ * @head: the head for your list.
+ */
+#define list_for_each_prev_safe(pos, n, head) \
+ for (pos = (head)->prev, n = pos->prev; \
+ !list_is_head(pos, (head)); \
+ pos = n, n = pos->prev)
+
+/**
+ * list_count_nodes - count nodes in the list
+ * @head: the head for your list.
+ */
+static inline size_t list_count_nodes(struct list_head *head)
+{
+ struct list_head *pos;
+ size_t count = 0;
+
+ list_for_each(pos, head)
+ count++;
+
+ return count;
+}
+
+/**
+ * list_entry_is_head - test if the entry points to the head of the list
+ * @pos: the type * to cursor
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ */
+#define list_entry_is_head(pos, head, member) \
+ list_is_head(&pos->member, (head))
+
+/**
+ * list_for_each_entry - iterate over list of given type
+ * @pos: the type * to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ */
+#define list_for_each_entry(pos, head, member) \
+ for (pos = list_first_entry(head, typeof(*pos), member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = list_next_entry(pos, member))
+
+/**
+ * list_for_each_entry_reverse - iterate backwards over list of given type.
+ * @pos: the type * to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ */
+#define list_for_each_entry_reverse(pos, head, member) \
+ for (pos = list_last_entry(head, typeof(*pos), member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = list_prev_entry(pos, member))
+
+/**
+ * list_prepare_entry - prepare a pos entry for use in list_for_each_entry_continue()
+ * @pos: the type * to use as a start point
+ * @head: the head of the list
+ * @member: the name of the list_head within the struct.
+ *
+ * Prepares a pos entry for use as a start point in list_for_each_entry_continue().
+ */
+#define list_prepare_entry(pos, head, member) \
+ ((pos) ? : list_entry(head, typeof(*pos), member))
+
+/**
+ * list_for_each_entry_continue - continue iteration over list of given type
+ * @pos: the type * to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ *
+ * Continue to iterate over list of given type, continuing after
+ * the current position.
+ */
+#define list_for_each_entry_continue(pos, head, member) \
+ for (pos = list_next_entry(pos, member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = list_next_entry(pos, member))
+
+/**
+ * list_for_each_entry_continue_reverse - iterate backwards from the given point
+ * @pos: the type * to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ *
+ * Start to iterate over list of given type backwards, continuing after
+ * the current position.
+ */
+#define list_for_each_entry_continue_reverse(pos, head, member) \
+ for (pos = list_prev_entry(pos, member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = list_prev_entry(pos, member))
+
+/**
+ * list_for_each_entry_from - iterate over list of given type from the current point
+ * @pos: the type * to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ *
+ * Iterate over list of given type, continuing from current position.
+ */
+#define list_for_each_entry_from(pos, head, member) \
+ for (; !list_entry_is_head(pos, head, member); \
+ pos = list_next_entry(pos, member))
+
+/**
+ * list_for_each_entry_from_reverse - iterate backwards over list of given type
+ * from the current point
+ * @pos: the type * to use as a loop cursor.
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ *
+ * Iterate backwards over list of given type, continuing from current position.
+ */
+#define list_for_each_entry_from_reverse(pos, head, member) \
+ for (; !list_entry_is_head(pos, head, member); \
+ pos = list_prev_entry(pos, member))
+
+/**
+ * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry
+ * @pos: the type * to use as a loop cursor.
+ * @n: another type * to use as temporary storage
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ */
+#define list_for_each_entry_safe(pos, n, head, member) \
+ for (pos = list_first_entry(head, typeof(*pos), member), \
+ n = list_next_entry(pos, member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = n, n = list_next_entry(n, member))
+
+/**
+ * list_for_each_entry_safe_continue - continue list iteration safe against removal
+ * @pos: the type * to use as a loop cursor.
+ * @n: another type * to use as temporary storage
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ *
+ * Iterate over list of given type, continuing after current point,
+ * safe against removal of list entry.
+ */
+#define list_for_each_entry_safe_continue(pos, n, head, member) \
+ for (pos = list_next_entry(pos, member), \
+ n = list_next_entry(pos, member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = n, n = list_next_entry(n, member))
+
+/**
+ * list_for_each_entry_safe_from - iterate over list from current point safe against removal
+ * @pos: the type * to use as a loop cursor.
+ * @n: another type * to use as temporary storage
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ *
+ * Iterate over list of given type from current point, safe against
+ * removal of list entry.
+ */
+#define list_for_each_entry_safe_from(pos, n, head, member) \
+ for (n = list_next_entry(pos, member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = n, n = list_next_entry(n, member))
+
+/**
+ * list_for_each_entry_safe_reverse - iterate backwards over list safe against removal
+ * @pos: the type * to use as a loop cursor.
+ * @n: another type * to use as temporary storage
+ * @head: the head for your list.
+ * @member: the name of the list_head within the struct.
+ *
+ * Iterate backwards over list of given type, safe against removal
+ * of list entry.
+ */
+#define list_for_each_entry_safe_reverse(pos, n, head, member) \
+ for (pos = list_last_entry(head, typeof(*pos), member), \
+ n = list_prev_entry(pos, member); \
+ !list_entry_is_head(pos, head, member); \
+ pos = n, n = list_prev_entry(n, member))
+
+/**
+ * list_safe_reset_next - reset a stale list_for_each_entry_safe loop
+ * @pos: the loop cursor used in the list_for_each_entry_safe loop
+ * @n: temporary storage used in list_for_each_entry_safe
+ * @member: the name of the list_head within the struct.
+ *
+ * list_safe_reset_next is not safe to use in general if the list may be
+ * modified concurrently (eg. the lock is dropped in the loop body). An
+ * exception to this is if the cursor element (pos) is pinned in the list,
+ * and list_safe_reset_next is called after re-taking the lock and before
+ * completing the current iteration of the loop body.
+ */
+#define list_safe_reset_next(pos, n, member) \
+ n = list_next_entry(pos, member)
+
+#endif
diff --git a/debugfs/Makefile.in b/debugfs/Makefile.in
index 689bf0c4a3c13d..700ae87418c268 100644
--- a/debugfs/Makefile.in
+++ b/debugfs/Makefile.in
@@ -195,7 +195,7 @@ debugfs.o: $(srcdir)/debugfs.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h $(top_srcdir)/version.h \
$(srcdir)/../e2fsck/jfs_user.h $(top_srcdir)/lib/ext2fs/kernel-jbd.h \
- $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/ext2fs/kernel-list.h \
+ $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/support/list.h \
$(top_srcdir)/lib/ext2fs/compiler.h $(top_srcdir)/lib/support/plausible.h
util.o: $(srcdir)/util.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/ss/ss.h \
@@ -287,7 +287,7 @@ logdump.o: $(srcdir)/logdump.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h $(srcdir)/../e2fsck/jfs_user.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h
htree.o: $(srcdir)/htree.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/debugfs.h $(top_srcdir)/lib/ss/ss.h \
@@ -408,7 +408,7 @@ journal.o: $(srcdir)/journal.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/ext2fs/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/kernel-jbd.h \
- $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/ext2fs/kernel-list.h \
+ $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/support/list.h \
$(top_srcdir)/lib/ext2fs/compiler.h
revoke.o: $(srcdir)/../e2fsck/revoke.c $(srcdir)/../e2fsck/jfs_user.h \
$(top_builddir)/lib/config.h $(top_builddir)/lib/dirpaths.h \
@@ -418,7 +418,7 @@ revoke.o: $(srcdir)/../e2fsck/revoke.c $(srcdir)/../e2fsck/jfs_user.h \
$(top_builddir)/lib/ext2fs/ext2_err.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/kernel-jbd.h \
- $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/ext2fs/kernel-list.h \
+ $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/support/list.h \
$(top_srcdir)/lib/ext2fs/compiler.h
recovery.o: $(srcdir)/../e2fsck/recovery.c $(srcdir)/../e2fsck/jfs_user.h \
$(top_builddir)/lib/config.h $(top_builddir)/lib/dirpaths.h \
@@ -428,7 +428,7 @@ recovery.o: $(srcdir)/../e2fsck/recovery.c $(srcdir)/../e2fsck/jfs_user.h \
$(top_builddir)/lib/ext2fs/ext2_err.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/kernel-jbd.h \
- $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/ext2fs/kernel-list.h \
+ $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/support/list.h \
$(top_srcdir)/lib/ext2fs/compiler.h
do_journal.o: $(srcdir)/do_journal.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/debugfs.h $(top_srcdir)/lib/ss/ss.h \
@@ -442,7 +442,7 @@ do_journal.o: $(srcdir)/do_journal.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/journal.h $(srcdir)/../e2fsck/jfs_user.h
do_orphan.o: $(srcdir)/do_orphan.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/debugfs.h $(top_srcdir)/lib/ss/ss.h \
diff --git a/e2fsck/Makefile.in b/e2fsck/Makefile.in
index fbb7b156d5c759..52fad9cbfd2b23 100644
--- a/e2fsck/Makefile.in
+++ b/e2fsck/Makefile.in
@@ -282,7 +282,7 @@ e2fsck.o: $(srcdir)/e2fsck.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h
super.o: $(srcdir)/super.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -296,7 +296,7 @@ super.o: $(srcdir)/super.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h
pass1.o: $(srcdir)/pass1.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -310,7 +310,7 @@ pass1.o: $(srcdir)/pass1.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/e2p/e2p.h $(srcdir)/problem.h
pass1b.o: $(srcdir)/pass1b.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/et/com_err.h \
@@ -324,7 +324,7 @@ pass1b.o: $(srcdir)/pass1b.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h $(top_srcdir)/lib/support/dict.h
pass2.o: $(srcdir)/pass2.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -338,7 +338,7 @@ pass2.o: $(srcdir)/pass2.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h $(top_srcdir)/lib/support/dict.h
pass3.o: $(srcdir)/pass3.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -352,7 +352,7 @@ pass3.o: $(srcdir)/pass3.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h
pass4.o: $(srcdir)/pass4.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -366,7 +366,7 @@ pass4.o: $(srcdir)/pass4.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h
pass5.o: $(srcdir)/pass5.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -380,7 +380,7 @@ pass5.o: $(srcdir)/pass5.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h
journal.o: $(srcdir)/journal.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/jfs_user.h $(srcdir)/e2fsck.h \
@@ -394,7 +394,7 @@ journal.o: $(srcdir)/journal.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h $(srcdir)/problem.h
recovery.o: $(srcdir)/recovery.c $(srcdir)/jfs_user.h \
$(top_builddir)/lib/config.h $(top_builddir)/lib/dirpaths.h \
@@ -408,7 +408,7 @@ recovery.o: $(srcdir)/recovery.c $(srcdir)/jfs_user.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h
revoke.o: $(srcdir)/revoke.c $(srcdir)/jfs_user.h \
$(top_builddir)/lib/config.h $(top_builddir)/lib/dirpaths.h \
@@ -422,7 +422,7 @@ revoke.o: $(srcdir)/revoke.c $(srcdir)/jfs_user.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h
badblocks.o: $(srcdir)/badblocks.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/et/com_err.h \
@@ -436,7 +436,7 @@ badblocks.o: $(srcdir)/badblocks.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
util.o: $(srcdir)/util.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
@@ -449,7 +449,7 @@ util.o: $(srcdir)/util.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
unix.o: $(srcdir)/unix.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/e2p/e2p.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
@@ -463,7 +463,7 @@ unix.o: $(srcdir)/unix.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h $(srcdir)/jfs_user.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h $(top_srcdir)/version.h
dirinfo.o: $(srcdir)/dirinfo.c $(top_builddir)/lib/config.h \
@@ -478,7 +478,7 @@ dirinfo.o: $(srcdir)/dirinfo.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/ext2fs/tdb.h
dx_dirinfo.o: $(srcdir)/dx_dirinfo.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -492,7 +492,7 @@ dx_dirinfo.o: $(srcdir)/dx_dirinfo.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
ehandler.o: $(srcdir)/ehandler.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
@@ -505,7 +505,7 @@ ehandler.o: $(srcdir)/ehandler.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
problem.o: $(srcdir)/problem.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
@@ -518,7 +518,7 @@ problem.o: $(srcdir)/problem.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h $(srcdir)/problemP.h
message.o: $(srcdir)/message.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/support/quotaio.h \
@@ -531,7 +531,7 @@ message.o: $(srcdir)/message.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/quotaio_tree.h $(srcdir)/e2fsck.h \
$(top_srcdir)/lib/support/profile.h $(top_builddir)/lib/support/prof_err.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h
ea_refcount.o: $(srcdir)/ea_refcount.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -545,7 +545,7 @@ ea_refcount.o: $(srcdir)/ea_refcount.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
rehash.o: $(srcdir)/rehash.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
@@ -558,7 +558,7 @@ rehash.o: $(srcdir)/rehash.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h $(top_srcdir)/lib/support/sort_r.h
readahead.o: $(srcdir)/readahead.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -572,7 +572,7 @@ readahead.o: $(srcdir)/readahead.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
region.o: $(srcdir)/region.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
@@ -585,7 +585,7 @@ region.o: $(srcdir)/region.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
sigcatcher.o: $(srcdir)/sigcatcher.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
@@ -598,7 +598,7 @@ sigcatcher.o: $(srcdir)/sigcatcher.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
logfile.o: $(srcdir)/logfile.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
@@ -611,7 +611,7 @@ logfile.o: $(srcdir)/logfile.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
quota.o: $(srcdir)/quota.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
@@ -624,7 +624,7 @@ quota.o: $(srcdir)/quota.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h
extents.o: $(srcdir)/extents.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -638,7 +638,7 @@ extents.o: $(srcdir)/extents.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h
encrypted_files.o: $(srcdir)/encrypted_files.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2fsck.h \
@@ -652,5 +652,5 @@ encrypted_files.o: $(srcdir)/encrypted_files.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(srcdir)/problem.h $(top_srcdir)/lib/ext2fs/rbtree.h
diff --git a/lib/e2p/Makefile.in b/lib/e2p/Makefile.in
index 92d9c018fe46c8..f642f5ec367c93 100644
--- a/lib/e2p/Makefile.in
+++ b/lib/e2p/Makefile.in
@@ -130,7 +130,7 @@ feature.o: $(srcdir)/feature.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/ext2fs/ext2_err.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/kernel-jbd.h \
- $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/ext2fs/kernel-list.h \
+ $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/support/list.h \
$(top_srcdir)/lib/ext2fs/compiler.h
fgetflags.o: $(srcdir)/fgetflags.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2p.h \
@@ -173,7 +173,7 @@ ljs.o: $(srcdir)/ljs.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(srcdir)/e2p.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h
mntopts.o: $(srcdir)/mntopts.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/e2p.h \
$(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h
diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index e9a6ced244ea26..1d0991defff804 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -1032,7 +1032,7 @@ mkjournal.o: $(srcdir)/mkjournal.c $(top_builddir)/lib/config.h \
$(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h $(srcdir)/ext2_io.h \
$(top_builddir)/lib/ext2fs/ext2_err.h $(srcdir)/ext2_ext_attr.h \
$(srcdir)/hashmap.h $(srcdir)/bitops.h $(srcdir)/kernel-jbd.h \
- $(srcdir)/jfs_compat.h $(srcdir)/kernel-list.h $(srcdir)/compiler.h
+ $(srcdir)/jfs_compat.h $(srcdir)/../support/list.h $(srcdir)/compiler.h
mmp.o: $(srcdir)/mmp.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/ext2_fs.h \
$(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2fs.h \
@@ -1263,7 +1263,7 @@ debugfs.o: $(top_srcdir)/debugfs/debugfs.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/quotaio.h $(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h $(top_srcdir)/debugfs/../version.h \
$(srcdir)/../../e2fsck/jfs_user.h $(srcdir)/kernel-jbd.h \
- $(srcdir)/jfs_compat.h $(srcdir)/kernel-list.h $(srcdir)/compiler.h \
+ $(srcdir)/jfs_compat.h $(srcdir)/../support/list.h $(srcdir)/compiler.h \
$(top_srcdir)/lib/support/plausible.h
util.o: $(top_srcdir)/debugfs/util.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/ss/ss.h \
@@ -1353,7 +1353,7 @@ logdump.o: $(top_srcdir)/debugfs/logdump.c $(top_builddir)/lib/config.h \
$(top_srcdir)/debugfs/../misc/create_inode.h $(top_srcdir)/lib/e2p/e2p.h \
$(top_srcdir)/lib/support/quotaio.h $(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h $(srcdir)/../../e2fsck/jfs_user.h \
- $(srcdir)/kernel-jbd.h $(srcdir)/jfs_compat.h $(srcdir)/kernel-list.h \
+ $(srcdir)/kernel-jbd.h $(srcdir)/jfs_compat.h $(srcdir)/../support/list.h \
$(srcdir)/compiler.h $(srcdir)/fast_commit.h
htree.o: $(top_srcdir)/debugfs/htree.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/debugfs/debugfs.h \
@@ -1469,14 +1469,14 @@ journal.o: $(top_srcdir)/debugfs/journal.c $(top_builddir)/lib/config.h \
$(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h $(srcdir)/ext2_io.h \
$(top_builddir)/lib/ext2fs/ext2_err.h $(srcdir)/ext2_ext_attr.h \
$(srcdir)/hashmap.h $(srcdir)/bitops.h $(srcdir)/kernel-jbd.h \
- $(srcdir)/jfs_compat.h $(srcdir)/kernel-list.h $(srcdir)/compiler.h
+ $(srcdir)/jfs_compat.h $(srcdir)/../support/list.h $(srcdir)/compiler.h
revoke.o: $(top_srcdir)/e2fsck/revoke.c $(top_srcdir)/e2fsck/jfs_user.h \
$(top_builddir)/lib/config.h $(top_builddir)/lib/dirpaths.h \
$(srcdir)/ext2_fs.h $(top_builddir)/lib/ext2fs/ext2_types.h \
$(srcdir)/ext2fs.h $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h \
$(srcdir)/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
$(srcdir)/ext2_ext_attr.h $(srcdir)/hashmap.h $(srcdir)/bitops.h \
- $(srcdir)/kernel-jbd.h $(srcdir)/jfs_compat.h $(srcdir)/kernel-list.h \
+ $(srcdir)/kernel-jbd.h $(srcdir)/jfs_compat.h $(srcdir)/../support/list.h \
$(srcdir)/compiler.h
recovery.o: $(top_srcdir)/e2fsck/recovery.c $(top_srcdir)/e2fsck/jfs_user.h \
$(top_builddir)/lib/config.h $(top_builddir)/lib/dirpaths.h \
@@ -1484,7 +1484,7 @@ recovery.o: $(top_srcdir)/e2fsck/recovery.c $(top_srcdir)/e2fsck/jfs_user.h \
$(srcdir)/ext2fs.h $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h \
$(srcdir)/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
$(srcdir)/ext2_ext_attr.h $(srcdir)/hashmap.h $(srcdir)/bitops.h \
- $(srcdir)/kernel-jbd.h $(srcdir)/jfs_compat.h $(srcdir)/kernel-list.h \
+ $(srcdir)/kernel-jbd.h $(srcdir)/jfs_compat.h $(srcdir)/../support/list.h \
$(srcdir)/compiler.h
do_journal.o: $(top_srcdir)/debugfs/do_journal.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/debugfs/debugfs.h \
@@ -1497,7 +1497,7 @@ do_journal.o: $(top_srcdir)/debugfs/do_journal.c $(top_builddir)/lib/config.h \
$(top_srcdir)/debugfs/../misc/create_inode.h $(top_srcdir)/lib/e2p/e2p.h \
$(top_srcdir)/lib/support/quotaio.h $(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h $(srcdir)/kernel-jbd.h \
- $(srcdir)/jfs_compat.h $(srcdir)/kernel-list.h $(srcdir)/compiler.h \
+ $(srcdir)/jfs_compat.h $(srcdir)/../support/list.h $(srcdir)/compiler.h \
$(top_srcdir)/debugfs/journal.h $(srcdir)/../../e2fsck/jfs_user.h
do_orphan.o: $(top_srcdir)/debugfs/do_orphan.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/debugfs/debugfs.h \
diff --git a/misc/Makefile.in b/misc/Makefile.in
index 7c6b33cb864204..edf7f356f6d0e8 100644
--- a/misc/Makefile.in
+++ b/misc/Makefile.in
@@ -747,7 +747,7 @@ tune2fs.o: $(srcdir)/tune2fs.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/ext2fs/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/kernel-jbd.h \
- $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/ext2fs/kernel-list.h \
+ $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/support/list.h \
$(top_srcdir)/lib/ext2fs/compiler.h $(top_srcdir)/lib/support/plausible.h \
$(top_srcdir)/lib/support/quotaio.h $(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h $(top_srcdir)/lib/support/devname.h \
@@ -800,7 +800,7 @@ dumpe2fs.o: $(srcdir)/dumpe2fs.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/e2p/e2p.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/support/devname.h $(top_srcdir)/lib/support/nls-enable.h \
$(top_srcdir)/lib/support/plausible.h $(top_srcdir)/version.h
badblocks.o: $(srcdir)/badblocks.c $(top_builddir)/lib/config.h \
@@ -823,7 +823,7 @@ util.o: $(srcdir)/util.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/ext2fs/ext2_err.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/kernel-jbd.h \
- $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/ext2fs/kernel-list.h \
+ $(top_srcdir)/lib/ext2fs/jfs_compat.h $(top_srcdir)/lib/support/list.h \
$(top_srcdir)/lib/ext2fs/compiler.h $(top_srcdir)/lib/support/nls-enable.h \
$(top_srcdir)/lib/support/devname.h $(srcdir)/util.h
uuidgen.o: $(srcdir)/uuidgen.c $(top_builddir)/lib/config.h \
@@ -927,7 +927,7 @@ journal.o: $(srcdir)/../debugfs/journal.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h
revoke.o: $(srcdir)/../e2fsck/revoke.c $(srcdir)/../e2fsck/jfs_user.h \
$(top_builddir)/lib/config.h $(top_builddir)/lib/dirpaths.h \
@@ -941,7 +941,7 @@ revoke.o: $(srcdir)/../e2fsck/revoke.c $(srcdir)/../e2fsck/jfs_user.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h
recovery.o: $(srcdir)/../e2fsck/recovery.c $(srcdir)/../e2fsck/jfs_user.h \
$(top_builddir)/lib/config.h $(top_builddir)/lib/dirpaths.h \
@@ -955,5 +955,5 @@ recovery.o: $(srcdir)/../e2fsck/recovery.c $(srcdir)/../e2fsck/jfs_user.h \
$(top_srcdir)/lib/support/dqblk_v2.h \
$(top_srcdir)/lib/support/quotaio_tree.h \
$(top_srcdir)/lib/ext2fs/fast_commit.h $(top_srcdir)/lib/ext2fs/jfs_compat.h \
- $(top_srcdir)/lib/ext2fs/kernel-list.h $(top_srcdir)/lib/ext2fs/compiler.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/ext2fs/compiler.h \
$(top_srcdir)/lib/ext2fs/kernel-jbd.h
diff --git a/misc/tune2fs.c b/misc/tune2fs.c
index 3db57632c88d42..ac440176351e83 100644
--- a/misc/tune2fs.c
+++ b/misc/tune2fs.c
@@ -2857,10 +2857,6 @@ static int expand_inode_table(ext2_filsys fs, unsigned long new_ino_size)
}
-#define list_for_each_safe(pos, pnext, head) \
- for (pos = (head)->next, pnext = pos->next; pos != (head); \
- pos = pnext, pnext = pos->next)
-
static void free_blk_move_list(void)
{
struct list_head *entry, *tmp;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 06/20] libsupport: add a cache
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 1:09 ` [PATCH 05/20] libsupport: port the kernel list.h to libsupport Darrick J. Wong
@ 2025-08-21 1:09 ` Darrick J. Wong
2025-08-21 1:09 ` [PATCH 07/20] cache: disable debugging Darrick J. Wong
` (13 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:09 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Reuse the cache code from xfsprogs.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 139 +++++++++
lib/support/list.h | 7
lib/support/xbitops.h | 128 ++++++++
lib/support/Makefile.in | 8 -
lib/support/cache.c | 739 +++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 1019 insertions(+), 2 deletions(-)
create mode 100644 lib/support/cache.h
create mode 100644 lib/support/xbitops.h
create mode 100644 lib/support/cache.c
diff --git a/lib/support/cache.h b/lib/support/cache.h
new file mode 100644
index 00000000000000..16b17a9b7a1a51
--- /dev/null
+++ b/lib/support/cache.h
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2006 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ */
+#ifndef __CACHE_H__
+#define __CACHE_H__
+
+/*
+ * initialisation flags
+ */
+/*
+ * xfs_db always writes changes immediately, and so we need to purge buffers
+ * when we get a buffer lookup mismatch due to reading the same block with a
+ * different buffer configuration.
+ */
+#define CACHE_MISCOMPARE_PURGE (1 << 0)
+
+/*
+ * cache object campare return values
+ */
+enum {
+ CACHE_HIT,
+ CACHE_MISS,
+ CACHE_PURGE,
+};
+
+#define HASH_CACHE_RATIO 8
+
+/*
+ * Cache priorities range from BASE to MAX.
+ *
+ * For prefetch support, the top half of the range starts at
+ * CACHE_PREFETCH_PRIORITY and everytime the buffer is fetched and is at or
+ * above this priority level, it is reduced to below this level (refer to
+ * libxfs_buf_get).
+ *
+ * If we have dirty nodes, we can't recycle them until they've been cleaned. To
+ * keep these out of the reclaimable lists (as there can be lots of them) give
+ * them their own priority that the shaker doesn't attempt to walk.
+ */
+
+#define CACHE_BASE_PRIORITY 0
+#define CACHE_PREFETCH_PRIORITY 8
+#define CACHE_MAX_PRIORITY 15
+#define CACHE_DIRTY_PRIORITY (CACHE_MAX_PRIORITY + 1)
+#define CACHE_NR_PRIORITIES CACHE_DIRTY_PRIORITY
+
+/*
+ * Simple, generic implementation of a cache (arbitrary data).
+ * Provides a hash table with a capped number of cache entries.
+ */
+
+struct cache;
+struct cache_node;
+
+typedef void *cache_key_t;
+
+typedef void (*cache_walk_t)(struct cache_node *);
+typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
+typedef int (*cache_node_flush_t)(struct cache_node *);
+typedef void (*cache_node_relse_t)(struct cache_node *);
+typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
+ unsigned int);
+typedef int (*cache_node_compare_t)(struct cache_node *, cache_key_t);
+typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
+typedef int (*cache_node_get_t)(struct cache_node *);
+typedef void (*cache_node_put_t)(struct cache_node *);
+
+struct cache_operations {
+ cache_node_hash_t hash;
+ cache_node_alloc_t alloc;
+ cache_node_flush_t flush;
+ cache_node_relse_t relse;
+ cache_node_compare_t compare;
+ cache_bulk_relse_t bulkrelse; /* optional */
+ cache_node_get_t get; /* optional */
+ cache_node_put_t put; /* optional */
+};
+
+struct cache_hash {
+ struct list_head ch_list; /* hash chain head */
+ unsigned int ch_count; /* hash chain length */
+ pthread_mutex_t ch_mutex; /* hash chain mutex */
+};
+
+struct cache_mru {
+ struct list_head cm_list; /* MRU head */
+ unsigned int cm_count; /* MRU length */
+ pthread_mutex_t cm_mutex; /* MRU lock */
+};
+
+struct cache_node {
+ struct list_head cn_hash; /* hash chain */
+ struct list_head cn_mru; /* MRU chain */
+ unsigned int cn_count; /* reference count */
+ unsigned int cn_hashidx; /* hash chain index */
+ int cn_priority; /* priority, -1 = free list */
+ int cn_old_priority;/* saved pre-dirty prio */
+ pthread_mutex_t cn_mutex; /* node mutex */
+};
+
+struct cache {
+ int c_flags; /* behavioural flags */
+ unsigned int c_maxcount; /* max cache nodes */
+ unsigned int c_count; /* count of nodes */
+ pthread_mutex_t c_mutex; /* node count mutex */
+ cache_node_hash_t hash; /* node hash function */
+ cache_node_alloc_t alloc; /* allocation function */
+ cache_node_flush_t flush; /* flush dirty data function */
+ cache_node_relse_t relse; /* memory free function */
+ cache_node_compare_t compare; /* comparison routine */
+ cache_bulk_relse_t bulkrelse; /* bulk release routine */
+ cache_node_get_t get; /* prepare cache node after get */
+ cache_node_put_t put; /* prepare to put cache node */
+ unsigned int c_hashsize; /* hash bucket count */
+ unsigned int c_hashshift; /* hash key shift */
+ struct cache_hash *c_hash; /* hash table buckets */
+ struct cache_mru c_mrus[CACHE_DIRTY_PRIORITY + 1];
+ unsigned long long c_misses; /* cache misses */
+ unsigned long long c_hits; /* cache hits */
+ unsigned int c_max; /* max nodes ever used */
+};
+
+struct cache *cache_init(int, unsigned int, const struct cache_operations *);
+void cache_destroy(struct cache *);
+void cache_walk(struct cache *, cache_walk_t);
+void cache_purge(struct cache *);
+void cache_flush(struct cache *);
+
+int cache_node_get(struct cache *, cache_key_t, struct cache_node **);
+void cache_node_put(struct cache *, struct cache_node *);
+void cache_node_set_priority(struct cache *, struct cache_node *, int);
+int cache_node_get_priority(struct cache_node *);
+int cache_node_purge(struct cache *, cache_key_t, struct cache_node *);
+void cache_report(FILE *fp, const char *, struct cache *);
+int cache_overflowed(struct cache *);
+
+#endif /* __CACHE_H__ */
diff --git a/lib/support/list.h b/lib/support/list.h
index df6c99708e4a8e..0e00e446dd7214 100644
--- a/lib/support/list.h
+++ b/lib/support/list.h
@@ -17,6 +17,13 @@ struct list_head {
((type *)((char *)(ptr) - offsetof(type, member)))
#endif
+static inline void list_head_destroy(struct list_head *list)
+{
+ list->next = list->prev = NULL;
+}
+
+#define list_head_init(list) INIT_LIST_HEAD(list)
+
/*
* Circular doubly linked list implementation.
*
diff --git a/lib/support/xbitops.h b/lib/support/xbitops.h
new file mode 100644
index 00000000000000..78a8f2a8545f4c
--- /dev/null
+++ b/lib/support/xbitops.h
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef __BITOPS_H__
+#define __BITOPS_H__
+
+/*
+ * fls: find last bit set.
+ */
+
+static inline int fls(int x)
+{
+ int r = 32;
+
+ if (!x)
+ return 0;
+ if (!(x & 0xffff0000u)) {
+ x = (x & 0xffffu) << 16;
+ r -= 16;
+ }
+ if (!(x & 0xff000000u)) {
+ x = (x & 0xffffffu) << 8;
+ r -= 8;
+ }
+ if (!(x & 0xf0000000u)) {
+ x = (x & 0xfffffffu) << 4;
+ r -= 4;
+ }
+ if (!(x & 0xc0000000u)) {
+ x = (x & 0x3fffffffu) << 2;
+ r -= 2;
+ }
+ if (!(x & 0x80000000u)) {
+ r -= 1;
+ }
+ return r;
+}
+
+static inline int fls64(uint64_t x)
+{
+ uint32_t h = x >> 32;
+ if (h)
+ return fls(h) + 32;
+ return fls(x);
+}
+
+static inline unsigned fls_long(unsigned long l)
+{
+ if (sizeof(l) == 4)
+ return fls(l);
+ return fls64(l);
+}
+
+/*
+ * ffz: find first zero bit.
+ * Result is undefined if no zero bit exists.
+ */
+#define ffz(x) ffs(~(x))
+
+/*
+ * XFS bit manipulation routines. Repeated here so that some programs
+ * don't have to link in all of libxfs just to have bit manipulation.
+ */
+
+/*
+ * masks with n high/low bits set, 64-bit values
+ */
+static inline uint64_t mask64hi(int n)
+{
+ return (uint64_t)-1 << (64 - (n));
+}
+static inline uint32_t mask32lo(int n)
+{
+ return ((uint32_t)1 << (n)) - 1;
+}
+static inline uint64_t mask64lo(int n)
+{
+ return ((uint64_t)1 << (n)) - 1;
+}
+
+/* Get high bit set out of 32-bit argument, -1 if none set */
+static inline int highbit32(uint32_t v)
+{
+ return fls(v) - 1;
+}
+
+/* Get high bit set out of 64-bit argument, -1 if none set */
+static inline int highbit64(uint64_t v)
+{
+ return fls64(v) - 1;
+}
+
+/* Get low bit set out of 32-bit argument, -1 if none set */
+static inline int lowbit32(uint32_t v)
+{
+ return ffs(v) - 1;
+}
+
+/* Get low bit set out of 64-bit argument, -1 if none set */
+static inline int lowbit64(uint64_t v)
+{
+ uint32_t w = (uint32_t)v;
+ int n = 0;
+
+ if (w) { /* lower bits */
+ n = ffs(w);
+ } else { /* upper bits */
+ w = (uint32_t)(v >> 32);
+ if (w) {
+ n = ffs(w);
+ if (n)
+ n += 32;
+ }
+ }
+ return n - 1;
+}
+
+/**
+ * __rounddown_pow_of_two() - round down to nearest power of two
+ * @n: value to round down
+ */
+static inline __attribute__((const))
+unsigned long __rounddown_pow_of_two(unsigned long n)
+{
+ return 1UL << (fls_long(n) - 1);
+}
+
+#define rounddown_pow_of_two(n) __rounddown_pow_of_two(n)
+
+#endif
diff --git a/lib/support/Makefile.in b/lib/support/Makefile.in
index 3f26cd30172f51..13d6f06f150afd 100644
--- a/lib/support/Makefile.in
+++ b/lib/support/Makefile.in
@@ -25,7 +25,8 @@ OBJS= cstring.o \
quotaio_v2.o \
quotaio_tree.o \
dict.o \
- devname.o
+ devname.o \
+ cache.o
SRCS= $(srcdir)/argv_parse.c \
$(srcdir)/cstring.c \
@@ -40,7 +41,8 @@ SRCS= $(srcdir)/argv_parse.c \
$(srcdir)/quotaio_tree.c \
$(srcdir)/quotaio_v2.c \
$(srcdir)/dict.c \
- $(srcdir)/devname.c
+ $(srcdir)/devname.c \
+ $(srcdir)/cache.c
LIBRARY= libsupport
LIBDIR= support
@@ -183,3 +185,5 @@ dict.o: $(srcdir)/dict.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/dict.h
devname.o: $(srcdir)/devname.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/devname.h $(srcdir)/nls-enable.h
+cache.o: $(srcdir)/cache.c $(top_builddir)/lib/config.h \
+ $(srcdir)/cache.h $(srcdir)/list.h $(srcdir)/xbitops.h
diff --git a/lib/support/cache.c b/lib/support/cache.c
new file mode 100644
index 00000000000000..fe04f62f262aaa
--- /dev/null
+++ b/lib/support/cache.c
@@ -0,0 +1,739 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2006 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include "list.h"
+#include "cache.h"
+#include "xbitops.h"
+
+#define CACHE_DEBUG 1
+#undef CACHE_DEBUG
+#define CACHE_DEBUG 1
+#undef CACHE_ABORT
+/* #define CACHE_ABORT 1 */
+
+#define CACHE_SHAKE_COUNT 64
+
+#ifdef CACHE_DEBUG
+# include <assert.h>
+# define ASSERT(x) assert(x)
+#endif
+
+static unsigned int cache_generic_bulkrelse(struct cache *, struct list_head *);
+
+struct cache *
+cache_init(
+ int flags,
+ unsigned int hashsize,
+ const struct cache_operations *cache_operations)
+{
+ struct cache * cache;
+ unsigned int i, maxcount;
+
+ maxcount = hashsize * HASH_CACHE_RATIO;
+
+ if (!(cache = malloc(sizeof(struct cache))))
+ return NULL;
+ if (!(cache->c_hash = calloc(hashsize, sizeof(struct cache_hash)))) {
+ free(cache);
+ return NULL;
+ }
+
+ cache->c_flags = flags;
+ cache->c_count = 0;
+ cache->c_max = 0;
+ cache->c_hits = 0;
+ cache->c_misses = 0;
+ cache->c_maxcount = maxcount;
+ cache->c_hashsize = hashsize;
+ cache->c_hashshift = fls(hashsize) - 1;
+ cache->hash = cache_operations->hash;
+ cache->alloc = cache_operations->alloc;
+ cache->flush = cache_operations->flush;
+ cache->relse = cache_operations->relse;
+ cache->compare = cache_operations->compare;
+ cache->bulkrelse = cache_operations->bulkrelse ?
+ cache_operations->bulkrelse : cache_generic_bulkrelse;
+ cache->get = cache_operations->get;
+ cache->put = cache_operations->put;
+ pthread_mutex_init(&cache->c_mutex, NULL);
+
+ for (i = 0; i < hashsize; i++) {
+ list_head_init(&cache->c_hash[i].ch_list);
+ cache->c_hash[i].ch_count = 0;
+ pthread_mutex_init(&cache->c_hash[i].ch_mutex, NULL);
+ }
+
+ for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++) {
+ list_head_init(&cache->c_mrus[i].cm_list);
+ cache->c_mrus[i].cm_count = 0;
+ pthread_mutex_init(&cache->c_mrus[i].cm_mutex, NULL);
+ }
+ return cache;
+}
+
+static void
+cache_expand(
+ struct cache * cache)
+{
+ pthread_mutex_lock(&cache->c_mutex);
+#ifdef CACHE_DEBUG
+ fprintf(stderr, "doubling cache size to %d\n", 2 * cache->c_maxcount);
+#endif
+ cache->c_maxcount *= 2;
+ pthread_mutex_unlock(&cache->c_mutex);
+}
+
+void
+cache_walk(
+ struct cache * cache,
+ cache_walk_t visit)
+{
+ struct cache_hash * hash;
+ struct list_head * head;
+ struct list_head * pos;
+ unsigned int i;
+
+ for (i = 0; i < cache->c_hashsize; i++) {
+ hash = &cache->c_hash[i];
+ head = &hash->ch_list;
+ pthread_mutex_lock(&hash->ch_mutex);
+ for (pos = head->next; pos != head; pos = pos->next)
+ visit((struct cache_node *)pos);
+ pthread_mutex_unlock(&hash->ch_mutex);
+ }
+}
+
+#ifdef CACHE_ABORT
+#define cache_abort() abort()
+#else
+#define cache_abort() do { } while (0)
+#endif
+
+#ifdef CACHE_DEBUG
+static void
+cache_zero_check(
+ struct cache_node * node)
+{
+ if (node->cn_count > 0) {
+ fprintf(stderr, "%s: refcount is %u, not zero (node=%p)\n",
+ __FUNCTION__, node->cn_count, node);
+ cache_abort();
+ }
+}
+#define cache_destroy_check(c) cache_walk((c), cache_zero_check)
+#else
+#define cache_destroy_check(c) do { } while (0)
+#endif
+
+void
+cache_destroy(
+ struct cache * cache)
+{
+ unsigned int i;
+
+ cache_destroy_check(cache);
+ for (i = 0; i < cache->c_hashsize; i++) {
+ list_head_destroy(&cache->c_hash[i].ch_list);
+ pthread_mutex_destroy(&cache->c_hash[i].ch_mutex);
+ }
+ for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++) {
+ list_head_destroy(&cache->c_mrus[i].cm_list);
+ pthread_mutex_destroy(&cache->c_mrus[i].cm_mutex);
+ }
+ pthread_mutex_destroy(&cache->c_mutex);
+ free(cache->c_hash);
+ free(cache);
+}
+
+static unsigned int
+cache_generic_bulkrelse(
+ struct cache * cache,
+ struct list_head * list)
+{
+ struct cache_node * node;
+ unsigned int count = 0;
+
+ while (!list_empty(list)) {
+ node = list_entry(list->next, struct cache_node, cn_mru);
+ pthread_mutex_destroy(&node->cn_mutex);
+ list_del_init(&node->cn_mru);
+ cache->relse(node);
+ count++;
+ }
+
+ return count;
+}
+
+/*
+ * Park unflushable nodes on their own special MRU so that cache_shake() doesn't
+ * end up repeatedly scanning them in the futile attempt to clean them before
+ * reclaim.
+ */
+static void
+cache_add_to_dirty_mru(
+ struct cache *cache,
+ struct cache_node *node)
+{
+ struct cache_mru *mru = &cache->c_mrus[CACHE_DIRTY_PRIORITY];
+
+ pthread_mutex_lock(&mru->cm_mutex);
+ node->cn_old_priority = node->cn_priority;
+ node->cn_priority = CACHE_DIRTY_PRIORITY;
+ list_add(&node->cn_mru, &mru->cm_list);
+ mru->cm_count++;
+ pthread_mutex_unlock(&mru->cm_mutex);
+}
+
+/*
+ * We've hit the limit on cache size, so we need to start reclaiming nodes we've
+ * used. The MRU specified by the priority is shaken. Returns new priority at
+ * end of the call (in case we call again). We are not allowed to reclaim dirty
+ * objects, so we have to flush them first. If flushing fails, we move them to
+ * the "dirty, unreclaimable" list.
+ *
+ * Hence we skip priorities > CACHE_MAX_PRIORITY unless "purge" is set as we
+ * park unflushable (and hence unreclaimable) buffers at these priorities.
+ * Trying to shake unreclaimable buffer lists when there is memory pressure is a
+ * waste of time and CPU and greatly slows down cache node recycling operations.
+ * Hence we only try to free them if we are being asked to purge the cache of
+ * all entries.
+ */
+static unsigned int
+cache_shake(
+ struct cache * cache,
+ unsigned int priority,
+ bool purge)
+{
+ struct cache_mru *mru;
+ struct cache_hash * hash;
+ struct list_head temp;
+ struct list_head * head;
+ struct list_head * pos;
+ struct list_head * n;
+ struct cache_node * node;
+ unsigned int count;
+
+ ASSERT(priority <= CACHE_DIRTY_PRIORITY);
+ if (priority > CACHE_MAX_PRIORITY && !purge)
+ priority = 0;
+
+ mru = &cache->c_mrus[priority];
+ count = 0;
+ list_head_init(&temp);
+ head = &mru->cm_list;
+
+ pthread_mutex_lock(&mru->cm_mutex);
+ for (pos = head->prev, n = pos->prev; pos != head;
+ pos = n, n = pos->prev) {
+ node = list_entry(pos, struct cache_node, cn_mru);
+
+ if (pthread_mutex_trylock(&node->cn_mutex) != 0)
+ continue;
+
+ /* memory pressure is not allowed to release dirty objects */
+ if (cache->flush(node) && !purge) {
+ list_del(&node->cn_mru);
+ mru->cm_count--;
+ node->cn_priority = -1;
+ pthread_mutex_unlock(&node->cn_mutex);
+ cache_add_to_dirty_mru(cache, node);
+ continue;
+ }
+
+ hash = cache->c_hash + node->cn_hashidx;
+ if (pthread_mutex_trylock(&hash->ch_mutex) != 0) {
+ pthread_mutex_unlock(&node->cn_mutex);
+ continue;
+ }
+ ASSERT(node->cn_count == 0);
+ ASSERT(node->cn_priority == priority);
+ node->cn_priority = -1;
+
+ list_move(&node->cn_mru, &temp);
+ list_del_init(&node->cn_hash);
+ hash->ch_count--;
+ mru->cm_count--;
+ pthread_mutex_unlock(&hash->ch_mutex);
+ pthread_mutex_unlock(&node->cn_mutex);
+
+ count++;
+ if (!purge && count == CACHE_SHAKE_COUNT)
+ break;
+ }
+ pthread_mutex_unlock(&mru->cm_mutex);
+
+ if (count > 0) {
+ cache->bulkrelse(cache, &temp);
+
+ pthread_mutex_lock(&cache->c_mutex);
+ cache->c_count -= count;
+ pthread_mutex_unlock(&cache->c_mutex);
+ }
+
+ return (count == CACHE_SHAKE_COUNT) ? priority : ++priority;
+}
+
+/*
+ * Allocate a new hash node (updating atomic counter in the process),
+ * unless doing so will push us over the maximum cache size.
+ */
+static struct cache_node *
+cache_node_allocate(
+ struct cache * cache,
+ cache_key_t key)
+{
+ unsigned int nodesfree;
+ struct cache_node * node;
+
+ pthread_mutex_lock(&cache->c_mutex);
+ nodesfree = (cache->c_count < cache->c_maxcount);
+ if (nodesfree) {
+ cache->c_count++;
+ if (cache->c_count > cache->c_max)
+ cache->c_max = cache->c_count;
+ }
+ cache->c_misses++;
+ pthread_mutex_unlock(&cache->c_mutex);
+ if (!nodesfree)
+ return NULL;
+ node = cache->alloc(key);
+ if (node == NULL) { /* uh-oh */
+ pthread_mutex_lock(&cache->c_mutex);
+ cache->c_count--;
+ pthread_mutex_unlock(&cache->c_mutex);
+ return NULL;
+ }
+ pthread_mutex_init(&node->cn_mutex, NULL);
+ list_head_init(&node->cn_mru);
+ node->cn_count = 1;
+ node->cn_priority = 0;
+ node->cn_old_priority = -1;
+ return node;
+}
+
+int
+cache_overflowed(
+ struct cache * cache)
+{
+ return cache->c_maxcount == cache->c_max;
+}
+
+
+static int
+__cache_node_purge(
+ struct cache * cache,
+ struct cache_node * node)
+{
+ int count;
+ struct cache_mru * mru;
+
+ pthread_mutex_lock(&node->cn_mutex);
+ count = node->cn_count;
+ if (count != 0) {
+ pthread_mutex_unlock(&node->cn_mutex);
+ return count;
+ }
+
+ /* can't purge dirty objects */
+ if (cache->flush(node)) {
+ pthread_mutex_unlock(&node->cn_mutex);
+ return 1;
+ }
+
+ mru = &cache->c_mrus[node->cn_priority];
+ pthread_mutex_lock(&mru->cm_mutex);
+ list_del_init(&node->cn_mru);
+ mru->cm_count--;
+ pthread_mutex_unlock(&mru->cm_mutex);
+
+ pthread_mutex_unlock(&node->cn_mutex);
+ pthread_mutex_destroy(&node->cn_mutex);
+ list_del_init(&node->cn_hash);
+ cache->relse(node);
+ return 0;
+}
+
+/*
+ * Lookup in the cache hash table. With any luck we'll get a cache
+ * hit, in which case this will all be over quickly and painlessly.
+ * Otherwise, we allocate a new node, taking care not to expand the
+ * cache beyond the requested maximum size (shrink it if it would).
+ * Returns one if hit in cache, otherwise zero. A node is _always_
+ * returned, however.
+ */
+int
+cache_node_get(
+ struct cache * cache,
+ cache_key_t key,
+ struct cache_node ** nodep)
+{
+ struct cache_node * node = NULL;
+ struct cache_hash * hash;
+ struct cache_mru * mru;
+ struct list_head * head;
+ struct list_head * pos;
+ struct list_head * n;
+ unsigned int hashidx;
+ int priority = 0;
+ int purged = 0;
+
+ hashidx = cache->hash(key, cache->c_hashsize, cache->c_hashshift);
+ hash = cache->c_hash + hashidx;
+ head = &hash->ch_list;
+
+ for (;;) {
+ pthread_mutex_lock(&hash->ch_mutex);
+ for (pos = head->next, n = pos->next; pos != head;
+ pos = n, n = pos->next) {
+ int result;
+
+ node = list_entry(pos, struct cache_node, cn_hash);
+ result = cache->compare(node, key);
+ switch (result) {
+ case CACHE_HIT:
+ break;
+ case CACHE_PURGE:
+ if ((cache->c_flags & CACHE_MISCOMPARE_PURGE) &&
+ !__cache_node_purge(cache, node)) {
+ purged++;
+ hash->ch_count--;
+ }
+ /* FALL THROUGH */
+ case CACHE_MISS:
+ goto next_object;
+ }
+
+ /*
+ * node found, bump node's reference count, remove it
+ * from its MRU list, and update stats.
+ */
+ pthread_mutex_lock(&node->cn_mutex);
+
+ if (node->cn_count == 0 && cache->get) {
+ int err = cache->get(node);
+ if (err) {
+ pthread_mutex_unlock(&node->cn_mutex);
+ goto next_object;
+ }
+ }
+ if (node->cn_count == 0) {
+ ASSERT(node->cn_priority >= 0);
+ ASSERT(!list_empty(&node->cn_mru));
+ mru = &cache->c_mrus[node->cn_priority];
+ pthread_mutex_lock(&mru->cm_mutex);
+ mru->cm_count--;
+ list_del_init(&node->cn_mru);
+ pthread_mutex_unlock(&mru->cm_mutex);
+ if (node->cn_old_priority != -1) {
+ ASSERT(node->cn_priority ==
+ CACHE_DIRTY_PRIORITY);
+ node->cn_priority = node->cn_old_priority;
+ node->cn_old_priority = -1;
+ }
+ }
+ node->cn_count++;
+
+ pthread_mutex_unlock(&node->cn_mutex);
+ pthread_mutex_unlock(&hash->ch_mutex);
+
+ pthread_mutex_lock(&cache->c_mutex);
+ cache->c_hits++;
+ pthread_mutex_unlock(&cache->c_mutex);
+
+ *nodep = node;
+ return 0;
+next_object:
+ continue; /* what the hell, gcc? */
+ }
+ pthread_mutex_unlock(&hash->ch_mutex);
+ /*
+ * not found, allocate a new entry
+ */
+ node = cache_node_allocate(cache, key);
+ if (node)
+ break;
+ priority = cache_shake(cache, priority, false);
+ /*
+ * We start at 0; if we free CACHE_SHAKE_COUNT we get
+ * back the same priority, if not we get back priority+1.
+ * If we exceed CACHE_MAX_PRIORITY all slots are full; grow it.
+ */
+ if (priority > CACHE_MAX_PRIORITY) {
+ priority = 0;
+ cache_expand(cache);
+ }
+ }
+
+ node->cn_hashidx = hashidx;
+
+ /* add new node to appropriate hash */
+ pthread_mutex_lock(&hash->ch_mutex);
+ hash->ch_count++;
+ list_add(&node->cn_hash, &hash->ch_list);
+ pthread_mutex_unlock(&hash->ch_mutex);
+
+ if (purged) {
+ pthread_mutex_lock(&cache->c_mutex);
+ cache->c_count -= purged;
+ pthread_mutex_unlock(&cache->c_mutex);
+ }
+
+ *nodep = node;
+ return 1;
+}
+
+void
+cache_node_put(
+ struct cache * cache,
+ struct cache_node * node)
+{
+ struct cache_mru * mru;
+
+ pthread_mutex_lock(&node->cn_mutex);
+#ifdef CACHE_DEBUG
+ if (node->cn_count < 1) {
+ fprintf(stderr, "%s: node put on refcount %u (node=%p)\n",
+ __FUNCTION__, node->cn_count, node);
+ cache_abort();
+ }
+ if (!list_empty(&node->cn_mru)) {
+ fprintf(stderr, "%s: node put on node (%p) in MRU list\n",
+ __FUNCTION__, node);
+ cache_abort();
+ }
+#endif
+ node->cn_count--;
+
+ if (node->cn_count == 0 && cache->put)
+ cache->put(node);
+ if (node->cn_count == 0) {
+ /* add unreferenced node to appropriate MRU for shaker */
+ mru = &cache->c_mrus[node->cn_priority];
+ pthread_mutex_lock(&mru->cm_mutex);
+ mru->cm_count++;
+ list_add(&node->cn_mru, &mru->cm_list);
+ pthread_mutex_unlock(&mru->cm_mutex);
+ }
+
+ pthread_mutex_unlock(&node->cn_mutex);
+}
+
+void
+cache_node_set_priority(
+ struct cache * cache,
+ struct cache_node * node,
+ int priority)
+{
+ if (priority < 0)
+ priority = 0;
+ else if (priority > CACHE_MAX_PRIORITY)
+ priority = CACHE_MAX_PRIORITY;
+
+ pthread_mutex_lock(&node->cn_mutex);
+ ASSERT(node->cn_count > 0);
+ node->cn_priority = priority;
+ node->cn_old_priority = -1;
+ pthread_mutex_unlock(&node->cn_mutex);
+}
+
+int
+cache_node_get_priority(
+ struct cache_node * node)
+{
+ int priority;
+
+ pthread_mutex_lock(&node->cn_mutex);
+ priority = node->cn_priority;
+ pthread_mutex_unlock(&node->cn_mutex);
+
+ return priority;
+}
+
+
+/*
+ * Purge a specific node from the cache. Reference count must be zero.
+ */
+int
+cache_node_purge(
+ struct cache * cache,
+ cache_key_t key,
+ struct cache_node * node)
+{
+ struct list_head * head;
+ struct list_head * pos;
+ struct list_head * n;
+ struct cache_hash * hash;
+ int count = -1;
+
+ hash = cache->c_hash + cache->hash(key, cache->c_hashsize,
+ cache->c_hashshift);
+ head = &hash->ch_list;
+ pthread_mutex_lock(&hash->ch_mutex);
+ for (pos = head->next, n = pos->next; pos != head;
+ pos = n, n = pos->next) {
+ if ((struct cache_node *)pos != node)
+ continue;
+
+ count = __cache_node_purge(cache, node);
+ if (!count)
+ hash->ch_count--;
+ break;
+ }
+ pthread_mutex_unlock(&hash->ch_mutex);
+
+ if (count == 0) {
+ pthread_mutex_lock(&cache->c_mutex);
+ cache->c_count--;
+ pthread_mutex_unlock(&cache->c_mutex);
+ }
+#ifdef CACHE_DEBUG
+ if (count >= 1) {
+ fprintf(stderr, "%s: refcount was %u, not zero (node=%p)\n",
+ __FUNCTION__, count, node);
+ cache_abort();
+ }
+ if (count == -1) {
+ fprintf(stderr, "%s: purge node not found! (node=%p)\n",
+ __FUNCTION__, node);
+ cache_abort();
+ }
+#endif
+ return count == 0;
+}
+
+/*
+ * Purge all nodes from the cache. All reference counts must be zero.
+ */
+void
+cache_purge(
+ struct cache * cache)
+{
+ int i;
+
+ for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++)
+ cache_shake(cache, i, true);
+
+#ifdef CACHE_DEBUG
+ if (cache->c_count != 0) {
+ /* flush referenced nodes to disk */
+ cache_flush(cache);
+ fprintf(stderr, "%s: shake on cache %p left %u nodes!?\n",
+ __FUNCTION__, cache, cache->c_count);
+ cache_abort();
+ }
+#endif
+}
+
+/*
+ * Flush all nodes in the cache to disk.
+ */
+void
+cache_flush(
+ struct cache * cache)
+{
+ struct cache_hash * hash;
+ struct list_head * head;
+ struct list_head * pos;
+ struct cache_node * node;
+ int i;
+
+ if (!cache->flush)
+ return;
+
+ for (i = 0; i < cache->c_hashsize; i++) {
+ hash = &cache->c_hash[i];
+
+ pthread_mutex_lock(&hash->ch_mutex);
+ head = &hash->ch_list;
+ for (pos = head->next; pos != head; pos = pos->next) {
+ node = (struct cache_node *)pos;
+ pthread_mutex_lock(&node->cn_mutex);
+ cache->flush(node);
+ pthread_mutex_unlock(&node->cn_mutex);
+ }
+ pthread_mutex_unlock(&hash->ch_mutex);
+ }
+}
+
+#define HASH_REPORT (3 * HASH_CACHE_RATIO)
+void
+cache_report(
+ FILE *fp,
+ const char *name,
+ struct cache *cache)
+{
+ int i;
+ unsigned long count, index, total;
+ unsigned long hash_bucket_lengths[HASH_REPORT + 2];
+
+ if ((cache->c_hits + cache->c_misses) == 0)
+ return;
+
+ /* report cache summary */
+ fprintf(fp, "%s: %p\n"
+ "Max supported entries = %u\n"
+ "Max utilized entries = %u\n"
+ "Active entries = %u\n"
+ "Hash table size = %u\n"
+ "Hits = %llu\n"
+ "Misses = %llu\n"
+ "Hit ratio = %5.2f\n",
+ name, cache,
+ cache->c_maxcount,
+ cache->c_max,
+ cache->c_count,
+ cache->c_hashsize,
+ cache->c_hits,
+ cache->c_misses,
+ (double)cache->c_hits * 100 /
+ (cache->c_hits + cache->c_misses)
+ );
+
+ for (i = 0; i <= CACHE_MAX_PRIORITY; i++)
+ fprintf(fp, "MRU %d entries = %6u (%3u%%)\n",
+ i, cache->c_mrus[i].cm_count,
+ cache->c_mrus[i].cm_count * 100 / cache->c_count);
+
+ i = CACHE_DIRTY_PRIORITY;
+ fprintf(fp, "Dirty MRU %d entries = %6u (%3u%%)\n",
+ i, cache->c_mrus[i].cm_count,
+ cache->c_mrus[i].cm_count * 100 / cache->c_count);
+
+ /* report hash bucket lengths */
+ bzero(hash_bucket_lengths, sizeof(hash_bucket_lengths));
+
+ for (i = 0; i < cache->c_hashsize; i++) {
+ count = cache->c_hash[i].ch_count;
+ if (count > HASH_REPORT)
+ index = HASH_REPORT + 1;
+ else
+ index = count;
+ hash_bucket_lengths[index]++;
+ }
+
+ total = 0;
+ for (i = 0; i < HASH_REPORT + 1; i++) {
+ total += i * hash_bucket_lengths[i];
+ if (hash_bucket_lengths[i] == 0)
+ continue;
+ fprintf(fp, "Hash buckets with %2d entries %6ld (%3ld%%)\n",
+ i, hash_bucket_lengths[i],
+ (i * hash_bucket_lengths[i] * 100) / cache->c_count);
+ }
+ if (hash_bucket_lengths[i]) /* last report bucket is the overflow bucket */
+ fprintf(fp, "Hash buckets with >%2d entries %6ld (%3ld%%)\n",
+ i - 1, hash_bucket_lengths[i],
+ ((cache->c_count - total) * 100) / cache->c_count);
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 07/20] cache: disable debugging
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (5 preceding siblings ...)
2025-08-21 1:09 ` [PATCH 06/20] libsupport: add a cache Darrick J. Wong
@ 2025-08-21 1:09 ` Darrick J. Wong
2025-08-21 1:09 ` [PATCH 08/20] cache: use modern list iterator macros Darrick J. Wong
` (12 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:09 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Not sure why debugging is turned on by default in the xfsprogs cache
code, but let's turn it off.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/lib/support/cache.c b/lib/support/cache.c
index fe04f62f262aaa..08e0b484cca298 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -17,9 +17,8 @@
#include "cache.h"
#include "xbitops.h"
-#define CACHE_DEBUG 1
#undef CACHE_DEBUG
-#define CACHE_DEBUG 1
+/* #define CACHE_DEBUG 1 */
#undef CACHE_ABORT
/* #define CACHE_ABORT 1 */
@@ -28,6 +27,8 @@
#ifdef CACHE_DEBUG
# include <assert.h>
# define ASSERT(x) assert(x)
+#else
+# define ASSERT(x) do { } while (0)
#endif
static unsigned int cache_generic_bulkrelse(struct cache *, struct list_head *);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 08/20] cache: use modern list iterator macros
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (6 preceding siblings ...)
2025-08-21 1:09 ` [PATCH 07/20] cache: disable debugging Darrick J. Wong
@ 2025-08-21 1:09 ` Darrick J. Wong
2025-08-21 1:10 ` [PATCH 09/20] cache: embed struct cache in the owner Darrick J. Wong
` (11 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:09 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Use the list iterator macros from list.h.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.c | 71 +++++++++++++++++----------------------------------
1 file changed, 24 insertions(+), 47 deletions(-)
diff --git a/lib/support/cache.c b/lib/support/cache.c
index 08e0b484cca298..d8f8231ac36d28 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -98,20 +98,18 @@ cache_expand(
void
cache_walk(
- struct cache * cache,
+ struct cache *cache,
cache_walk_t visit)
{
- struct cache_hash * hash;
- struct list_head * head;
- struct list_head * pos;
+ struct cache_hash *hash;
+ struct cache_node *pos;
unsigned int i;
for (i = 0; i < cache->c_hashsize; i++) {
hash = &cache->c_hash[i];
- head = &hash->ch_list;
pthread_mutex_lock(&hash->ch_mutex);
- for (pos = head->next; pos != head; pos = pos->next)
- visit((struct cache_node *)pos);
+ list_for_each_entry(pos, &hash->ch_list, cn_hash)
+ visit(pos);
pthread_mutex_unlock(&hash->ch_mutex);
}
}
@@ -218,12 +216,9 @@ cache_shake(
bool purge)
{
struct cache_mru *mru;
- struct cache_hash * hash;
+ struct cache_hash *hash;
struct list_head temp;
- struct list_head * head;
- struct list_head * pos;
- struct list_head * n;
- struct cache_node * node;
+ struct cache_node *node, *n;
unsigned int count;
ASSERT(priority <= CACHE_DIRTY_PRIORITY);
@@ -233,13 +228,9 @@ cache_shake(
mru = &cache->c_mrus[priority];
count = 0;
list_head_init(&temp);
- head = &mru->cm_list;
pthread_mutex_lock(&mru->cm_mutex);
- for (pos = head->prev, n = pos->prev; pos != head;
- pos = n, n = pos->prev) {
- node = list_entry(pos, struct cache_node, cn_mru);
-
+ list_for_each_entry_safe_reverse(node, n, &mru->cm_list, cn_mru) {
if (pthread_mutex_trylock(&node->cn_mutex) != 0)
continue;
@@ -376,31 +367,25 @@ __cache_node_purge(
*/
int
cache_node_get(
- struct cache * cache,
+ struct cache *cache,
cache_key_t key,
- struct cache_node ** nodep)
+ struct cache_node **nodep)
{
- struct cache_node * node = NULL;
- struct cache_hash * hash;
- struct cache_mru * mru;
- struct list_head * head;
- struct list_head * pos;
- struct list_head * n;
+ struct cache_hash *hash;
+ struct cache_mru *mru;
+ struct cache_node *node = NULL, *n;
unsigned int hashidx;
int priority = 0;
int purged = 0;
hashidx = cache->hash(key, cache->c_hashsize, cache->c_hashshift);
hash = cache->c_hash + hashidx;
- head = &hash->ch_list;
for (;;) {
pthread_mutex_lock(&hash->ch_mutex);
- for (pos = head->next, n = pos->next; pos != head;
- pos = n, n = pos->next) {
+ list_for_each_entry_safe(node, n, &hash->ch_list, cn_hash) {
int result;
- node = list_entry(pos, struct cache_node, cn_hash);
result = cache->compare(node, key);
switch (result) {
case CACHE_HIT:
@@ -568,23 +553,19 @@ cache_node_get_priority(
*/
int
cache_node_purge(
- struct cache * cache,
+ struct cache *cache,
cache_key_t key,
- struct cache_node * node)
+ struct cache_node *node)
{
- struct list_head * head;
- struct list_head * pos;
- struct list_head * n;
- struct cache_hash * hash;
+ struct cache_node *pos, *n;
+ struct cache_hash *hash;
int count = -1;
hash = cache->c_hash + cache->hash(key, cache->c_hashsize,
cache->c_hashshift);
- head = &hash->ch_list;
pthread_mutex_lock(&hash->ch_mutex);
- for (pos = head->next, n = pos->next; pos != head;
- pos = n, n = pos->next) {
- if ((struct cache_node *)pos != node)
+ list_for_each_entry_safe(pos, n, &hash->ch_list, cn_hash) {
+ if (pos != node)
continue;
count = __cache_node_purge(cache, node);
@@ -642,12 +623,10 @@ cache_purge(
*/
void
cache_flush(
- struct cache * cache)
+ struct cache *cache)
{
- struct cache_hash * hash;
- struct list_head * head;
- struct list_head * pos;
- struct cache_node * node;
+ struct cache_hash *hash;
+ struct cache_node *node;
int i;
if (!cache->flush)
@@ -657,9 +636,7 @@ cache_flush(
hash = &cache->c_hash[i];
pthread_mutex_lock(&hash->ch_mutex);
- head = &hash->ch_list;
- for (pos = head->next; pos != head; pos = pos->next) {
- node = (struct cache_node *)pos;
+ list_for_each_entry(node, &hash->ch_list, cn_hash) {
pthread_mutex_lock(&node->cn_mutex);
cache->flush(node);
pthread_mutex_unlock(&node->cn_mutex);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 09/20] cache: embed struct cache in the owner
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (7 preceding siblings ...)
2025-08-21 1:09 ` [PATCH 08/20] cache: use modern list iterator macros Darrick J. Wong
@ 2025-08-21 1:10 ` Darrick J. Wong
2025-08-21 1:10 ` [PATCH 10/20] cache: pass cache pointer to callbacks Darrick J. Wong
` (10 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:10 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
It'll be easier to embed a struct cache into the object that owns the
cache rather than passing pointers around. This is the prelude to the
next patch, which will enable cache functions to walk back to the owning
struct.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 10 ++++++++--
lib/support/cache.c | 38 ++++++++++++++++++++------------------
2 files changed, 28 insertions(+), 20 deletions(-)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index 16b17a9b7a1a51..993f1385dedcee 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -122,8 +122,14 @@ struct cache {
unsigned int c_max; /* max nodes ever used */
};
-struct cache *cache_init(int, unsigned int, const struct cache_operations *);
-void cache_destroy(struct cache *);
+static inline bool cache_initialized(const struct cache *cache)
+{
+ return cache->hash != NULL;
+}
+
+int cache_init(int flags, unsigned int size,
+ const struct cache_operations *ops, struct cache *cache);
+void cache_destroy(struct cache *cache);
void cache_walk(struct cache *, cache_walk_t);
void cache_purge(struct cache *);
void cache_flush(struct cache *);
diff --git a/lib/support/cache.c b/lib/support/cache.c
index d8f8231ac36d28..8b4f9f03c3899b 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -12,6 +12,7 @@
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
+#include <errno.h>
#include "list.h"
#include "cache.h"
@@ -33,23 +34,18 @@
static unsigned int cache_generic_bulkrelse(struct cache *, struct list_head *);
-struct cache *
+int
cache_init(
int flags,
unsigned int hashsize,
- const struct cache_operations *cache_operations)
+ const struct cache_operations *cache_operations,
+ struct cache *cache)
{
- struct cache * cache;
unsigned int i, maxcount;
maxcount = hashsize * HASH_CACHE_RATIO;
- if (!(cache = malloc(sizeof(struct cache))))
- return NULL;
- if (!(cache->c_hash = calloc(hashsize, sizeof(struct cache_hash)))) {
- free(cache);
- return NULL;
- }
+ memset(cache, 0, sizeof(*cache));
cache->c_flags = flags;
cache->c_count = 0;
@@ -57,8 +53,6 @@ cache_init(
cache->c_hits = 0;
cache->c_misses = 0;
cache->c_maxcount = maxcount;
- cache->c_hashsize = hashsize;
- cache->c_hashshift = fls(hashsize) - 1;
cache->hash = cache_operations->hash;
cache->alloc = cache_operations->alloc;
cache->flush = cache_operations->flush;
@@ -70,18 +64,26 @@ cache_init(
cache->put = cache_operations->put;
pthread_mutex_init(&cache->c_mutex, NULL);
+ for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++) {
+ list_head_init(&cache->c_mrus[i].cm_list);
+ cache->c_mrus[i].cm_count = 0;
+ pthread_mutex_init(&cache->c_mrus[i].cm_mutex, NULL);
+ }
+
+ cache->c_hash = calloc(hashsize, sizeof(struct cache_hash));
+ if (!cache->c_hash)
+ return ENOMEM;
+
+ cache->c_hashsize = hashsize;
+ cache->c_hashshift = fls(hashsize) - 1;
+
for (i = 0; i < hashsize; i++) {
list_head_init(&cache->c_hash[i].ch_list);
cache->c_hash[i].ch_count = 0;
pthread_mutex_init(&cache->c_hash[i].ch_mutex, NULL);
}
- for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++) {
- list_head_init(&cache->c_mrus[i].cm_list);
- cache->c_mrus[i].cm_count = 0;
- pthread_mutex_init(&cache->c_mrus[i].cm_mutex, NULL);
- }
- return cache;
+ return 0;
}
static void
@@ -153,7 +155,7 @@ cache_destroy(
}
pthread_mutex_destroy(&cache->c_mutex);
free(cache->c_hash);
- free(cache);
+ memset(cache, 0, sizeof(*cache));
}
static unsigned int
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 10/20] cache: pass cache pointer to callbacks
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (8 preceding siblings ...)
2025-08-21 1:10 ` [PATCH 09/20] cache: embed struct cache in the owner Darrick J. Wong
@ 2025-08-21 1:10 ` Darrick J. Wong
2025-08-21 1:10 ` [PATCH 11/20] cache: pass a private data pointer through cache_walk Darrick J. Wong
` (9 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:10 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Pass the cache pointer to the cache node callbacks so that subsequent
patches don't have to waste memory putting pointers to struct fuse4fs in
the cached objects.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 12 ++++++------
lib/support/cache.c | 21 +++++++++++----------
2 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index 993f1385dedcee..0168fdca027896 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -56,16 +56,16 @@ struct cache_node;
typedef void *cache_key_t;
-typedef void (*cache_walk_t)(struct cache_node *);
-typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
-typedef int (*cache_node_flush_t)(struct cache_node *);
-typedef void (*cache_node_relse_t)(struct cache_node *);
+typedef void (*cache_walk_t)(struct cache *c, struct cache_node *cn);
+typedef struct cache_node * (*cache_node_alloc_t)(struct cache *c, cache_key_t k);
+typedef int (*cache_node_flush_t)(struct cache *c, struct cache_node *cn);
+typedef void (*cache_node_relse_t)(struct cache *c, struct cache_node *cn);
typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
unsigned int);
typedef int (*cache_node_compare_t)(struct cache_node *, cache_key_t);
typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
-typedef int (*cache_node_get_t)(struct cache_node *);
-typedef void (*cache_node_put_t)(struct cache_node *);
+typedef int (*cache_node_get_t)(struct cache *c, struct cache_node *cn);
+typedef void (*cache_node_put_t)(struct cache *c, struct cache_node *cn);
struct cache_operations {
cache_node_hash_t hash;
diff --git a/lib/support/cache.c b/lib/support/cache.c
index 8b4f9f03c3899b..2e2e36ccc3ef78 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -111,7 +111,7 @@ cache_walk(
hash = &cache->c_hash[i];
pthread_mutex_lock(&hash->ch_mutex);
list_for_each_entry(pos, &hash->ch_list, cn_hash)
- visit(pos);
+ visit(cache, pos);
pthread_mutex_unlock(&hash->ch_mutex);
}
}
@@ -125,7 +125,8 @@ cache_walk(
#ifdef CACHE_DEBUG
static void
cache_zero_check(
- struct cache_node * node)
+ struct cache *cache,
+ struct cache_node *node)
{
if (node->cn_count > 0) {
fprintf(stderr, "%s: refcount is %u, not zero (node=%p)\n",
@@ -170,7 +171,7 @@ cache_generic_bulkrelse(
node = list_entry(list->next, struct cache_node, cn_mru);
pthread_mutex_destroy(&node->cn_mutex);
list_del_init(&node->cn_mru);
- cache->relse(node);
+ cache->relse(cache, node);
count++;
}
@@ -237,7 +238,7 @@ cache_shake(
continue;
/* memory pressure is not allowed to release dirty objects */
- if (cache->flush(node) && !purge) {
+ if (cache->flush(cache, node) && !purge) {
list_del(&node->cn_mru);
mru->cm_count--;
node->cn_priority = -1;
@@ -302,7 +303,7 @@ cache_node_allocate(
pthread_mutex_unlock(&cache->c_mutex);
if (!nodesfree)
return NULL;
- node = cache->alloc(key);
+ node = cache->alloc(cache, key);
if (node == NULL) { /* uh-oh */
pthread_mutex_lock(&cache->c_mutex);
cache->c_count--;
@@ -341,7 +342,7 @@ __cache_node_purge(
}
/* can't purge dirty objects */
- if (cache->flush(node)) {
+ if (cache->flush(cache, node)) {
pthread_mutex_unlock(&node->cn_mutex);
return 1;
}
@@ -355,7 +356,7 @@ __cache_node_purge(
pthread_mutex_unlock(&node->cn_mutex);
pthread_mutex_destroy(&node->cn_mutex);
list_del_init(&node->cn_hash);
- cache->relse(node);
+ cache->relse(cache, node);
return 0;
}
@@ -410,7 +411,7 @@ cache_node_get(
pthread_mutex_lock(&node->cn_mutex);
if (node->cn_count == 0 && cache->get) {
- int err = cache->get(node);
+ int err = cache->get(cache, node);
if (err) {
pthread_mutex_unlock(&node->cn_mutex);
goto next_object;
@@ -505,7 +506,7 @@ cache_node_put(
node->cn_count--;
if (node->cn_count == 0 && cache->put)
- cache->put(node);
+ cache->put(cache, node);
if (node->cn_count == 0) {
/* add unreferenced node to appropriate MRU for shaker */
mru = &cache->c_mrus[node->cn_priority];
@@ -640,7 +641,7 @@ cache_flush(
pthread_mutex_lock(&hash->ch_mutex);
list_for_each_entry(node, &hash->ch_list, cn_hash) {
pthread_mutex_lock(&node->cn_mutex);
- cache->flush(node);
+ cache->flush(cache, node);
pthread_mutex_unlock(&node->cn_mutex);
}
pthread_mutex_unlock(&hash->ch_mutex);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 11/20] cache: pass a private data pointer through cache_walk
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (9 preceding siblings ...)
2025-08-21 1:10 ` [PATCH 10/20] cache: pass cache pointer to callbacks Darrick J. Wong
@ 2025-08-21 1:10 ` Darrick J. Wong
2025-08-21 1:11 ` [PATCH 12/20] cache: add a helper to grab a new refcount for a cache_node Darrick J. Wong
` (8 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:10 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Allow cache_walk callers to pass a pointer to the callback function.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 4 ++--
lib/support/cache.c | 10 ++++++----
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index 0168fdca027896..b18b6d3325e9ad 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -56,7 +56,7 @@ struct cache_node;
typedef void *cache_key_t;
-typedef void (*cache_walk_t)(struct cache *c, struct cache_node *cn);
+typedef void (*cache_walk_t)(struct cache *c, struct cache_node *cn, void *d);
typedef struct cache_node * (*cache_node_alloc_t)(struct cache *c, cache_key_t k);
typedef int (*cache_node_flush_t)(struct cache *c, struct cache_node *cn);
typedef void (*cache_node_relse_t)(struct cache *c, struct cache_node *cn);
@@ -130,7 +130,7 @@ static inline bool cache_initialized(const struct cache *cache)
int cache_init(int flags, unsigned int size,
const struct cache_operations *ops, struct cache *cache);
void cache_destroy(struct cache *cache);
-void cache_walk(struct cache *, cache_walk_t);
+void cache_walk(struct cache *cache, cache_walk_t fn, void *data);
void cache_purge(struct cache *);
void cache_flush(struct cache *);
diff --git a/lib/support/cache.c b/lib/support/cache.c
index 2e2e36ccc3ef78..606acd5453cf10 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -101,7 +101,8 @@ cache_expand(
void
cache_walk(
struct cache *cache,
- cache_walk_t visit)
+ cache_walk_t visit,
+ void *data)
{
struct cache_hash *hash;
struct cache_node *pos;
@@ -111,7 +112,7 @@ cache_walk(
hash = &cache->c_hash[i];
pthread_mutex_lock(&hash->ch_mutex);
list_for_each_entry(pos, &hash->ch_list, cn_hash)
- visit(cache, pos);
+ visit(cache, pos, data);
pthread_mutex_unlock(&hash->ch_mutex);
}
}
@@ -126,7 +127,8 @@ cache_walk(
static void
cache_zero_check(
struct cache *cache,
- struct cache_node *node)
+ struct cache_node *node,
+ void *data)
{
if (node->cn_count > 0) {
fprintf(stderr, "%s: refcount is %u, not zero (node=%p)\n",
@@ -134,7 +136,7 @@ cache_zero_check(
cache_abort();
}
}
-#define cache_destroy_check(c) cache_walk((c), cache_zero_check)
+#define cache_destroy_check(c) cache_walk((c), cache_zero_check, NULL)
#else
#define cache_destroy_check(c) do { } while (0)
#endif
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 12/20] cache: add a helper to grab a new refcount for a cache_node
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (10 preceding siblings ...)
2025-08-21 1:10 ` [PATCH 11/20] cache: pass a private data pointer through cache_walk Darrick J. Wong
@ 2025-08-21 1:11 ` Darrick J. Wong
2025-08-21 1:11 ` [PATCH 13/20] cache: return results of a cache flush Darrick J. Wong
` (7 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:11 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Create a helper to bump the refcount of a cache node.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 1 +
lib/support/cache.c | 57 +++++++++++++++++++++++++++++----------------------
2 files changed, 33 insertions(+), 25 deletions(-)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index b18b6d3325e9ad..e8f1c82ef7869c 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -141,5 +141,6 @@ int cache_node_get_priority(struct cache_node *);
int cache_node_purge(struct cache *, cache_key_t, struct cache_node *);
void cache_report(FILE *fp, const char *, struct cache *);
int cache_overflowed(struct cache *);
+struct cache_node *cache_node_grab(struct cache *cache, struct cache_node *node);
#endif /* __CACHE_H__ */
diff --git a/lib/support/cache.c b/lib/support/cache.c
index 606acd5453cf10..49568ffa6de2e4 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -362,6 +362,35 @@ __cache_node_purge(
return 0;
}
+/* Grab a new refcount to the cache node object. Caller must hold cn_mutex. */
+struct cache_node *cache_node_grab(struct cache *cache, struct cache_node *node)
+{
+ struct cache_mru *mru;
+
+ if (node->cn_count == 0 && cache->get) {
+ int err = cache->get(cache, node);
+ if (err)
+ return NULL;
+ }
+ if (node->cn_count == 0) {
+ ASSERT(node->cn_priority >= 0);
+ ASSERT(!list_empty(&node->cn_mru));
+ mru = &cache->c_mrus[node->cn_priority];
+ pthread_mutex_lock(&mru->cm_mutex);
+ mru->cm_count--;
+ list_del_init(&node->cn_mru);
+ pthread_mutex_unlock(&mru->cm_mutex);
+ if (node->cn_old_priority != -1) {
+ ASSERT(node->cn_priority ==
+ CACHE_DIRTY_PRIORITY);
+ node->cn_priority = node->cn_old_priority;
+ node->cn_old_priority = -1;
+ }
+ }
+ node->cn_count++;
+ return node;
+}
+
/*
* Lookup in the cache hash table. With any luck we'll get a cache
* hit, in which case this will all be over quickly and painlessly.
@@ -377,7 +406,6 @@ cache_node_get(
struct cache_node **nodep)
{
struct cache_hash *hash;
- struct cache_mru *mru;
struct cache_node *node = NULL, *n;
unsigned int hashidx;
int priority = 0;
@@ -411,31 +439,10 @@ cache_node_get(
* from its MRU list, and update stats.
*/
pthread_mutex_lock(&node->cn_mutex);
-
- if (node->cn_count == 0 && cache->get) {
- int err = cache->get(cache, node);
- if (err) {
- pthread_mutex_unlock(&node->cn_mutex);
- goto next_object;
- }
+ if (!cache_node_grab(cache, node)) {
+ pthread_mutex_unlock(&node->cn_mutex);
+ goto next_object;
}
- if (node->cn_count == 0) {
- ASSERT(node->cn_priority >= 0);
- ASSERT(!list_empty(&node->cn_mru));
- mru = &cache->c_mrus[node->cn_priority];
- pthread_mutex_lock(&mru->cm_mutex);
- mru->cm_count--;
- list_del_init(&node->cn_mru);
- pthread_mutex_unlock(&mru->cm_mutex);
- if (node->cn_old_priority != -1) {
- ASSERT(node->cn_priority ==
- CACHE_DIRTY_PRIORITY);
- node->cn_priority = node->cn_old_priority;
- node->cn_old_priority = -1;
- }
- }
- node->cn_count++;
-
pthread_mutex_unlock(&node->cn_mutex);
pthread_mutex_unlock(&hash->ch_mutex);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 13/20] cache: return results of a cache flush
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (11 preceding siblings ...)
2025-08-21 1:11 ` [PATCH 12/20] cache: add a helper to grab a new refcount for a cache_node Darrick J. Wong
@ 2025-08-21 1:11 ` Darrick J. Wong
2025-08-21 1:11 ` [PATCH 14/20] cache: add a "get only if incore" flag to cache_node_get Darrick J. Wong
` (6 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:11 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Modify cache_flush to return whether or not there were errors whilst
flushing the cache.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 4 ++--
lib/support/cache.c | 11 +++++++----
2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index e8f1c82ef7869c..8d39ca5c02a285 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -58,7 +58,7 @@ typedef void *cache_key_t;
typedef void (*cache_walk_t)(struct cache *c, struct cache_node *cn, void *d);
typedef struct cache_node * (*cache_node_alloc_t)(struct cache *c, cache_key_t k);
-typedef int (*cache_node_flush_t)(struct cache *c, struct cache_node *cn);
+typedef bool (*cache_node_flush_t)(struct cache *c, struct cache_node *cn);
typedef void (*cache_node_relse_t)(struct cache *c, struct cache_node *cn);
typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
unsigned int);
@@ -132,7 +132,7 @@ int cache_init(int flags, unsigned int size,
void cache_destroy(struct cache *cache);
void cache_walk(struct cache *cache, cache_walk_t fn, void *data);
void cache_purge(struct cache *);
-void cache_flush(struct cache *);
+bool cache_flush(struct cache *cache);
int cache_node_get(struct cache *, cache_key_t, struct cache_node **);
void cache_node_put(struct cache *, struct cache_node *);
diff --git a/lib/support/cache.c b/lib/support/cache.c
index 49568ffa6de2e4..fa07b4ad8222d2 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -631,18 +631,19 @@ cache_purge(
}
/*
- * Flush all nodes in the cache to disk.
+ * Flush all nodes in the cache to disk. Returns true if the flush succeeded.
*/
-void
+bool
cache_flush(
struct cache *cache)
{
struct cache_hash *hash;
struct cache_node *node;
int i;
+ bool still_dirty = false;
if (!cache->flush)
- return;
+ return true;
for (i = 0; i < cache->c_hashsize; i++) {
hash = &cache->c_hash[i];
@@ -650,11 +651,13 @@ cache_flush(
pthread_mutex_lock(&hash->ch_mutex);
list_for_each_entry(node, &hash->ch_list, cn_hash) {
pthread_mutex_lock(&node->cn_mutex);
- cache->flush(cache, node);
+ still_dirty |= cache->flush(cache, node);
pthread_mutex_unlock(&node->cn_mutex);
}
pthread_mutex_unlock(&hash->ch_mutex);
}
+
+ return !still_dirty;
}
#define HASH_REPORT (3 * HASH_CACHE_RATIO)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 14/20] cache: add a "get only if incore" flag to cache_node_get
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (12 preceding siblings ...)
2025-08-21 1:11 ` [PATCH 13/20] cache: return results of a cache flush Darrick J. Wong
@ 2025-08-21 1:11 ` Darrick J. Wong
2025-08-21 1:11 ` [PATCH 15/20] cache: support gradual expansion Darrick J. Wong
` (5 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:11 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Add a new flag to cache_node_get so that callers can specify that they
only want the cache to return an existing cache node, and not create a
new one.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 5 ++++-
lib/support/cache.c | 7 +++++++
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index 8d39ca5c02a285..98b2182d49a6e0 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -134,7 +134,10 @@ void cache_walk(struct cache *cache, cache_walk_t fn, void *data);
void cache_purge(struct cache *);
bool cache_flush(struct cache *cache);
-int cache_node_get(struct cache *, cache_key_t, struct cache_node **);
+/* don't allocate a new node */
+#define CACHE_GET_INCORE (1U << 0)
+int cache_node_get(struct cache *c, cache_key_t key, unsigned int cgflags,
+ struct cache_node **nodep);
void cache_node_put(struct cache *, struct cache_node *);
void cache_node_set_priority(struct cache *, struct cache_node *, int);
int cache_node_get_priority(struct cache_node *);
diff --git a/lib/support/cache.c b/lib/support/cache.c
index fa07b4ad8222d2..9da6c59b3b6391 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -403,6 +403,7 @@ int
cache_node_get(
struct cache *cache,
cache_key_t key,
+ unsigned int cgflags,
struct cache_node **nodep)
{
struct cache_hash *hash;
@@ -456,6 +457,12 @@ cache_node_get(
continue; /* what the hell, gcc? */
}
pthread_mutex_unlock(&hash->ch_mutex);
+
+ if (cgflags & CACHE_GET_INCORE) {
+ *nodep = NULL;
+ return 0;
+ }
+
/*
* not found, allocate a new entry
*/
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 15/20] cache: support gradual expansion
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (13 preceding siblings ...)
2025-08-21 1:11 ` [PATCH 14/20] cache: add a "get only if incore" flag to cache_node_get Darrick J. Wong
@ 2025-08-21 1:11 ` Darrick J. Wong
2025-08-21 1:12 ` [PATCH 16/20] cache: implement automatic shrinking Darrick J. Wong
` (4 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:11 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
It's probably not a good idea to expand the cache size by powers of two
beyond some random limit, so let the users figure that out if they want
to.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 10 ++++++++++
lib/support/cache.c | 12 ++++++++++--
2 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index 98b2182d49a6e0..ae37945c545f46 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -66,6 +66,14 @@ typedef int (*cache_node_compare_t)(struct cache_node *, cache_key_t);
typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
typedef int (*cache_node_get_t)(struct cache *c, struct cache_node *cn);
typedef void (*cache_node_put_t)(struct cache *c, struct cache_node *cn);
+typedef unsigned int (*cache_node_resize_t)(const struct cache *c,
+ unsigned int curr_size);
+
+static inline unsigned int cache_gradual_resize(const struct cache *cache,
+ unsigned int curr_size)
+{
+ return curr_size * 5 / 4;
+}
struct cache_operations {
cache_node_hash_t hash;
@@ -76,6 +84,7 @@ struct cache_operations {
cache_bulk_relse_t bulkrelse; /* optional */
cache_node_get_t get; /* optional */
cache_node_put_t put; /* optional */
+ cache_node_resize_t resize; /* optional */
};
struct cache_hash {
@@ -113,6 +122,7 @@ struct cache {
cache_bulk_relse_t bulkrelse; /* bulk release routine */
cache_node_get_t get; /* prepare cache node after get */
cache_node_put_t put; /* prepare to put cache node */
+ cache_node_resize_t resize; /* compute new maxcount */
unsigned int c_hashsize; /* hash bucket count */
unsigned int c_hashshift; /* hash key shift */
struct cache_hash *c_hash; /* hash table buckets */
diff --git a/lib/support/cache.c b/lib/support/cache.c
index 9da6c59b3b6391..dbaddc1bd36d3d 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -62,6 +62,7 @@ cache_init(
cache_operations->bulkrelse : cache_generic_bulkrelse;
cache->get = cache_operations->get;
cache->put = cache_operations->put;
+ cache->resize = cache_operations->resize;
pthread_mutex_init(&cache->c_mutex, NULL);
for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++) {
@@ -90,11 +91,18 @@ static void
cache_expand(
struct cache * cache)
{
+ unsigned int new_size = 0;
+
pthread_mutex_lock(&cache->c_mutex);
+ if (cache->resize)
+ new_size = cache->resize(cache, cache->c_maxcount);
+ if (new_size <= cache->c_maxcount)
+ new_size = cache->c_maxcount * 2;
#ifdef CACHE_DEBUG
- fprintf(stderr, "doubling cache size to %d\n", 2 * cache->c_maxcount);
+ fprintf(stderr, "increasing cache max size from %u to %u\n",
+ cache->c_maxcount, new_size);
#endif
- cache->c_maxcount *= 2;
+ cache->c_maxcount = new_size;
pthread_mutex_unlock(&cache->c_mutex);
}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 16/20] cache: implement automatic shrinking
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (14 preceding siblings ...)
2025-08-21 1:11 ` [PATCH 15/20] cache: support gradual expansion Darrick J. Wong
@ 2025-08-21 1:12 ` Darrick J. Wong
2025-08-21 1:12 ` [PATCH 17/20] fuse4fs: add cache to track open files Darrick J. Wong
` (3 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:12 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Shrink the cache whenever maxcount has been expanded beyond its initial
value, we release a cached object to one of the mru lists and the number
of objects sitting on the mru is enough to drop the cache count down a
level. This enables a cache to reduce its memory consumption after a
spike in which reclamation wasn't possible.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 17 ++++++-
lib/support/cache.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 126 insertions(+), 9 deletions(-)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index ae37945c545f46..cd738b6cd3a460 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -16,6 +16,9 @@
*/
#define CACHE_MISCOMPARE_PURGE (1 << 0)
+/* Automatically shrink the cache's max_count when possible. */
+#define CACHE_CAN_SHRINK (1U << 1)
+
/*
* cache object campare return values
*/
@@ -67,12 +70,18 @@ typedef unsigned int (*cache_bulk_relse_t)(struct cache *, struct list_head *);
typedef int (*cache_node_get_t)(struct cache *c, struct cache_node *cn);
typedef void (*cache_node_put_t)(struct cache *c, struct cache_node *cn);
typedef unsigned int (*cache_node_resize_t)(const struct cache *c,
- unsigned int curr_size);
+ unsigned int curr_size,
+ int dir);
static inline unsigned int cache_gradual_resize(const struct cache *cache,
- unsigned int curr_size)
+ unsigned int curr_size,
+ int dir)
{
- return curr_size * 5 / 4;
+ if (dir < 0)
+ return curr_size * 9 / 10;
+ else if (dir > 0)
+ return curr_size * 5 / 4;
+ return curr_size;
}
struct cache_operations {
@@ -111,6 +120,7 @@ struct cache_node {
struct cache {
int c_flags; /* behavioural flags */
+ unsigned int c_orig_max; /* original max cache nodes */
unsigned int c_maxcount; /* max cache nodes */
unsigned int c_count; /* count of nodes */
pthread_mutex_t c_mutex; /* node count mutex */
@@ -143,6 +153,7 @@ void cache_destroy(struct cache *cache);
void cache_walk(struct cache *cache, cache_walk_t fn, void *data);
void cache_purge(struct cache *);
bool cache_flush(struct cache *cache);
+void cache_shrink(struct cache *cache);
/* don't allocate a new node */
#define CACHE_GET_INCORE (1U << 0)
diff --git a/lib/support/cache.c b/lib/support/cache.c
index dbaddc1bd36d3d..7e1ddc3cc8788d 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -53,6 +53,7 @@ cache_init(
cache->c_hits = 0;
cache->c_misses = 0;
cache->c_maxcount = maxcount;
+ cache->c_orig_max = maxcount;
cache->hash = cache_operations->hash;
cache->alloc = cache_operations->alloc;
cache->flush = cache_operations->flush;
@@ -95,7 +96,7 @@ cache_expand(
pthread_mutex_lock(&cache->c_mutex);
if (cache->resize)
- new_size = cache->resize(cache, cache->c_maxcount);
+ new_size = cache->resize(cache, cache->c_maxcount, 1);
if (new_size <= cache->c_maxcount)
new_size = cache->c_maxcount * 2;
#ifdef CACHE_DEBUG
@@ -226,7 +227,8 @@ static unsigned int
cache_shake(
struct cache * cache,
unsigned int priority,
- bool purge)
+ bool purge,
+ unsigned int nr_to_shake)
{
struct cache_mru *mru;
struct cache_hash *hash;
@@ -274,7 +276,7 @@ cache_shake(
pthread_mutex_unlock(&node->cn_mutex);
count++;
- if (!purge && count == CACHE_SHAKE_COUNT)
+ if (!purge && count == nr_to_shake)
break;
}
pthread_mutex_unlock(&mru->cm_mutex);
@@ -287,7 +289,7 @@ cache_shake(
pthread_mutex_unlock(&cache->c_mutex);
}
- return (count == CACHE_SHAKE_COUNT) ? priority : ++priority;
+ return (count == nr_to_shake) ? priority : ++priority;
}
/*
@@ -477,7 +479,7 @@ cache_node_get(
node = cache_node_allocate(cache, key);
if (node)
break;
- priority = cache_shake(cache, priority, false);
+ priority = cache_shake(cache, priority, false, CACHE_SHAKE_COUNT);
/*
* We start at 0; if we free CACHE_SHAKE_COUNT we get
* back the same priority, if not we get back priority+1.
@@ -507,12 +509,112 @@ cache_node_get(
return 1;
}
+static unsigned int cache_mru_count(const struct cache *cache)
+{
+ const struct cache_mru *mru = cache->c_mrus;
+ unsigned int mru_count = 0;
+ unsigned int i;
+
+ for (i = 0; i < CACHE_NR_PRIORITIES; i++, mru++)
+ mru_count += mru->cm_count;
+
+ return mru_count;
+}
+
+
+void cache_shrink(struct cache *cache)
+{
+ unsigned int mru_count = 0;
+ unsigned int threshold = 0;
+ unsigned int priority = 0;
+ unsigned int new_size;
+
+ pthread_mutex_lock(&cache->c_mutex);
+ /* Don't shrink below the original cache size */
+ if (cache->c_maxcount <= cache->c_orig_max)
+ goto out_unlock;
+
+ mru_count = cache_mru_count(cache);
+
+ /*
+ * If there's not even a batch of nodes on the MRU to try to free,
+ * don't bother with the rest.
+ */
+ if (mru_count < CACHE_SHAKE_COUNT)
+ goto out_unlock;
+
+ /*
+ * Figure out the next step down in size, but don't go below the
+ * original size.
+ */
+ if (cache->resize)
+ new_size = cache->resize(cache, cache->c_maxcount, -1);
+ else
+ new_size = cache->c_maxcount / 2;
+ if (new_size >= cache->c_maxcount)
+ goto out_unlock;
+ if (new_size < cache->c_orig_max)
+ new_size = cache->c_orig_max;
+
+ /*
+ * If we can't purge enough nodes to get the node count below new_size,
+ * don't resize the cache.
+ */
+ if (cache->c_count - mru_count >= new_size)
+ goto out_unlock;
+
+#ifdef CACHE_DEBUG
+ fprintf(stderr, "decreasing cache max size from %u to %u (currently %u)\n",
+ cache->c_maxcount, new_size, cache->c_count);
+#endif
+ cache->c_maxcount = new_size;
+
+ /* Try to reduce the number of cached objects. */
+ do {
+ unsigned int new_priority;
+
+ /*
+ * The threshold is the amount we need to purge to get c_count
+ * below the new maxcount. Try to free some objects off the
+ * MRU. Drop c_mutex because cache_shake will take it.
+ */
+ threshold = cache->c_count - new_size;
+ pthread_mutex_unlock(&cache->c_mutex);
+
+ new_priority = cache_shake(cache, priority, false, threshold);
+
+ /* Either we made no progress or we ran out of MRU levels */
+ if (new_priority == priority ||
+ new_priority > CACHE_MAX_PRIORITY)
+ return;
+ priority = new_priority;
+
+ pthread_mutex_lock(&cache->c_mutex);
+ /*
+ * Someone could have walked in and changed the cache maxsize
+ * again while we had the lock dropped. If that happened, stop
+ * clearing.
+ */
+ if (cache->c_maxcount != new_size)
+ goto out_unlock;
+
+ mru_count = cache_mru_count(cache);
+ if (cache->c_count - mru_count >= new_size)
+ goto out_unlock;
+ } while (1);
+
+out_unlock:
+ pthread_mutex_unlock(&cache->c_mutex);
+ return;
+}
+
void
cache_node_put(
struct cache * cache,
struct cache_node * node)
{
struct cache_mru * mru;
+ bool was_put = false;
pthread_mutex_lock(&node->cn_mutex);
#ifdef CACHE_DEBUG
@@ -528,6 +630,7 @@ cache_node_put(
}
#endif
node->cn_count--;
+ was_put = (node->cn_count == 0);
if (node->cn_count == 0 && cache->put)
cache->put(cache, node);
@@ -541,6 +644,9 @@ cache_node_put(
}
pthread_mutex_unlock(&node->cn_mutex);
+
+ if (was_put && (cache->c_flags & CACHE_CAN_SHRINK))
+ cache_shrink(cache);
}
void
@@ -632,7 +738,7 @@ cache_purge(
int i;
for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++)
- cache_shake(cache, i, true);
+ cache_shake(cache, i, true, CACHE_SHAKE_COUNT);
#ifdef CACHE_DEBUG
if (cache->c_count != 0) {
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 17/20] fuse4fs: add cache to track open files
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (15 preceding siblings ...)
2025-08-21 1:12 ` [PATCH 16/20] cache: implement automatic shrinking Darrick J. Wong
@ 2025-08-21 1:12 ` Darrick J. Wong
2025-08-21 1:12 ` [PATCH 18/20] fuse4fs: use the orphaned inode list Darrick J. Wong
` (2 subsequent siblings)
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:12 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Add our own inode cache so that we can track open files.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 7 +++
misc/Makefile.in | 3 +
misc/fuse4fs.c | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 141 insertions(+), 1 deletion(-)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index cd738b6cd3a460..f482948a3b6331 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -6,6 +6,13 @@
#ifndef __CACHE_H__
#define __CACHE_H__
+/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#ifndef CACHE_LINE_SIZE
+/* if the system didn't tell us, guess something reasonable */
+#define CACHE_LINE_SIZE 64
+#endif
+
/*
* initialisation flags
*/
diff --git a/misc/Makefile.in b/misc/Makefile.in
index edf7f356f6d0e8..36694d682d3b59 100644
--- a/misc/Makefile.in
+++ b/misc/Makefile.in
@@ -900,7 +900,8 @@ fuse4fs.o: $(srcdir)/fuse4fs.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/ext2fsP.h \
$(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/version.h \
- $(top_srcdir)/lib/e2p/e2p.h
+ $(top_srcdir)/lib/e2p/e2p.h $(top_srcdir)/lib/support/cache.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/support/xbitops.h
e2fuzz.o: $(srcdir)/e2fuzz.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/ext2fs/ext2_fs.h \
$(top_builddir)/lib/ext2fs/ext2_types.h $(top_srcdir)/lib/ext2fs/ext2fs.h \
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 0dd47dcf18d77a..e2a9e7bfe54b00 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -27,6 +27,7 @@
#include <unistd.h>
#include <ctype.h>
#include <stdbool.h>
+#include <assert.h>
#define FUSE_DARWIN_ENABLE_EXTENSIONS 0
#ifdef __SET_FOB_FOR_FUSE
# error Do not set magic value __SET_FOB_FOR_FUSE!!!!
@@ -49,6 +50,8 @@
#include "ext2fs/ext2fs.h"
#include "ext2fs/ext2_fs.h"
#include "ext2fs/ext2fsP.h"
+#include "support/list.h"
+#include "support/cache.h"
#include "../version.h"
#include "uuid/uuid.h"
@@ -205,6 +208,7 @@ int journal_enable_debug = -1;
#define FUSE4FS_FILE_MAGIC (0xEF53DEAFUL)
struct fuse4fs_file_handle {
unsigned long magic;
+ struct fuse4fs_inode *fi;
ext2_ino_t ino;
int open_flags;
};
@@ -252,6 +256,7 @@ struct fuse4fs {
uint8_t timing;
#endif
struct fuse_session *fuse;
+ struct cache inodes;
};
#define FUSE4FS_CHECK_HANDLE(req, fh) \
@@ -346,6 +351,115 @@ static inline int u_log2(unsigned int arg)
return l;
}
+struct fuse4fs_inode {
+ struct cache_node i_cnode;
+ ext2_ino_t i_ino;
+ unsigned int i_open_count;
+};
+
+struct fuse4fs_ikey {
+ ext2_ino_t i_ino;
+};
+
+#define ICKEY(key) ((struct fuse4fs_ikey *)(key))
+#define ICNODE(node) (container_of((node), struct fuse4fs_inode, i_cnode))
+
+static unsigned int
+icache_hash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
+{
+ uint64_t hashval = ICKEY(key)->i_ino;
+ uint64_t tmp;
+
+ tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
+ tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
+ return tmp % hashsize;
+}
+
+static int icache_compare(struct cache_node *node, cache_key_t key)
+{
+ struct fuse4fs_inode *fi = ICNODE(node);
+ struct fuse4fs_ikey *ikey = ICKEY(key);
+
+ if (fi->i_ino == ikey->i_ino)
+ return CACHE_HIT;
+
+ return CACHE_MISS;
+}
+
+static struct cache_node *icache_alloc(struct cache *c, cache_key_t key)
+{
+ struct fuse4fs_ikey *ikey = ICKEY(key);
+ struct fuse4fs_inode *fi;
+
+ fi = calloc(1, sizeof(struct fuse4fs_inode));
+ if (!fi)
+ return NULL;
+
+ fi->i_ino = ikey->i_ino;
+ return &fi->i_cnode;
+}
+
+static bool icache_flush(struct cache *c, struct cache_node *node)
+{
+ return false;
+}
+
+static void icache_relse(struct cache *c, struct cache_node *node)
+{
+ struct fuse4fs_inode *fi = ICNODE(node);
+
+ assert(fi->i_open_count == 0);
+ free(fi);
+}
+
+static unsigned int icache_bulkrelse(struct cache *cache,
+ struct list_head *list)
+{
+ struct cache_node *cn, *n;
+ int count = 0;
+
+ if (list_empty(list))
+ return 0;
+
+ list_for_each_entry_safe(cn, n, list, cn_mru) {
+ icache_relse(cache, cn);
+ count++;
+ }
+
+ return count;
+}
+
+static const struct cache_operations icache_ops = {
+ .hash = icache_hash,
+ .alloc = icache_alloc,
+ .flush = icache_flush,
+ .relse = icache_relse,
+ .compare = icache_compare,
+ .bulkrelse = icache_bulkrelse,
+ .resize = cache_gradual_resize,
+};
+
+static errcode_t fuse4fs_iget(struct fuse4fs *ff, ext2_ino_t ino,
+ struct fuse4fs_inode **fip)
+{
+ struct fuse4fs_ikey ikey = {
+ .i_ino = ino,
+ };
+ struct cache_node *node = NULL;
+
+ cache_node_get(&ff->inodes, &ikey, 0, &node);
+ if (!node)
+ return ENOMEM;
+
+ *fip = ICNODE(node);
+ return 0;
+}
+
+static void fuse4fs_iput(struct fuse4fs *ff, struct fuse4fs_inode *fi)
+{
+ cache_node_put(&ff->inodes, &fi->i_cnode);
+}
+
static inline blk64_t FUSE4FS_B_TO_FSBT(const struct fuse4fs *ff, off_t pos)
{
return pos >> ff->blocklog;
@@ -949,6 +1063,11 @@ static void fuse4fs_unmount(struct fuse4fs *ff)
if (!ff->fs)
return;
+ if (cache_initialized(&ff->inodes)) {
+ cache_purge(&ff->inodes);
+ cache_destroy(&ff->inodes);
+ }
+
err = ext2fs_close(ff->fs);
if (err)
err_printf(ff, "%s\n", error_message(err));
@@ -995,6 +1114,10 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff, int libext2_flags)
return err;
}
+ err = cache_init(CACHE_CAN_SHRINK, 1U << 10, &icache_ops, &ff->inodes);
+ if (err)
+ return translate_error(ff->fs, 0, err);
+
ff->fs->priv_data = ff;
ff->blocklog = u_log2(ff->fs->blocksize);
ff->blockmask = ff->fs->blocksize - 1;
@@ -2049,6 +2172,7 @@ static int fuse4fs_remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
if (inode.i_links_count)
goto write_out;
+
if (ext2fs_has_feature_ea_inode(fs->super)) {
ret = fuse4fs_remove_ea_inodes(ff, ino, &inode);
if (ret)
@@ -2957,6 +3081,13 @@ static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
goto out;
}
+ err = fuse4fs_iget(ff, file->ino, &file->fi);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out;
+ }
+ file->fi->i_open_count++;
+
fuse4fs_set_handle(fp, file);
out:
@@ -3144,6 +3275,7 @@ static void op_release(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
ret = translate_error(fs, fh->ino, err);
}
+ fuse4fs_iput(ff, fh->fi);
fp->fh = 0;
fuse4fs_finish(ff, ret);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 18/20] fuse4fs: use the orphaned inode list
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (16 preceding siblings ...)
2025-08-21 1:12 ` [PATCH 17/20] fuse4fs: add cache to track open files Darrick J. Wong
@ 2025-08-21 1:12 ` Darrick J. Wong
2025-08-21 1:12 ` [PATCH 19/20] fuse4fs: implement FUSE_TMPFILE Darrick J. Wong
2025-08-21 1:13 ` [PATCH 20/20] fuse4fs: create incore reverse orphan list Darrick J. Wong
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:12 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Put open but unlinked files on the orphan list, and remove them when the
last open fd releases the inode.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse4fs.c | 181 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 178 insertions(+), 3 deletions(-)
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index e2a9e7bfe54b00..1d1797a483a139 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -955,6 +955,13 @@ static int fuse4fs_inum_access(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
inode_uid(inode), inode_gid(inode),
ctxt->uid, ctxt->gid);
+ /* linked files cannot be on the unlinked list or deleted */
+ if (inode.i_dtime != 0) {
+ dbg_printf(ff, "%s: unlinked ino=%d dtime=0x%x\n",
+ __func__, ino, inode.i_dtime);
+ return -ENOENT;
+ }
+
/* existence check */
if (mask == 0)
return 0;
@@ -2140,9 +2147,80 @@ static int fuse4fs_remove_ea_inodes(struct fuse4fs *ff, ext2_ino_t ino,
return 0;
}
+static int fuse4fs_add_to_orphans(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode)
+{
+ ext2_filsys fs = ff->fs;
+
+ dbg_printf(ff, "%s: orphan ino=%d dtime=%d next=%d\n",
+ __func__, ino, inode->i_dtime, fs->super->s_last_orphan);
+
+ inode->i_dtime = fs->super->s_last_orphan;
+ fs->super->s_last_orphan = ino;
+ ext2fs_mark_super_dirty(fs);
+
+ return 0;
+}
+
+static int fuse4fs_remove_from_orphans(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode)
+{
+ ext2_filsys fs = ff->fs;
+ ext2_ino_t prev_orphan;
+ errcode_t err;
+
+ dbg_printf(ff, "%s: super=%d ino=%d next=%d\n",
+ __func__, fs->super->s_last_orphan, ino, inode->i_dtime);
+
+ /* If we're lucky, the ondisk superblock points to us */
+ if (fs->super->s_last_orphan == ino) {
+ dbg_printf(ff, "%s: superblock\n", __func__);
+
+ fs->super->s_last_orphan = inode->i_dtime;
+ inode->i_dtime = 0;
+ ext2fs_mark_super_dirty(fs);
+ return 0;
+ }
+
+ /* Otherwise walk the ondisk orphan list. */
+ prev_orphan = fs->super->s_last_orphan;
+ while (prev_orphan != 0) {
+ struct ext2_inode_large orphan;
+
+ err = fuse4fs_read_inode(fs, prev_orphan, &orphan);
+ if (err)
+ return translate_error(fs, prev_orphan, err);
+
+ if (orphan.i_dtime == prev_orphan)
+ return translate_error(fs, prev_orphan,
+ EXT2_ET_FILESYSTEM_CORRUPTED);
+
+ if (orphan.i_dtime == ino) {
+ dbg_printf(ff, "%s: prev=%d\n",
+ __func__, prev_orphan);
+
+ orphan.i_dtime = inode->i_dtime;
+ inode->i_dtime = 0;
+
+ err = fuse4fs_write_inode(fs, prev_orphan, &orphan);
+ if (err)
+ return translate_error(fs, prev_orphan, err);
+
+ return 0;
+ }
+
+ dbg_printf(ff, "%s: orphan=%d next=%d\n",
+ __func__, prev_orphan, orphan.i_dtime);
+ prev_orphan = orphan.i_dtime;
+ }
+
+ return translate_error(fs, ino, EXT2_ET_FILESYSTEM_CORRUPTED);
+}
+
static int fuse4fs_remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
{
ext2_filsys fs = ff->fs;
+ struct fuse4fs_inode *fi;
errcode_t err;
struct ext2_inode_large inode;
int ret = 0;
@@ -2159,7 +2237,6 @@ static int fuse4fs_remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
return 0; /* XXX: already done? */
case 1:
inode.i_links_count--;
- ext2fs_set_dtime(fs, EXT2_INODE(&inode));
break;
default:
inode.i_links_count--;
@@ -2172,6 +2249,26 @@ static int fuse4fs_remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
if (inode.i_links_count)
goto write_out;
+ err = fuse4fs_iget(ff, ino, &fi);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ dbg_printf(ff, "%s: put ino=%d opencount=%d\n", __func__, ino,
+ fi->i_open_count);
+
+ /*
+ * The file is unlinked but still open; add it to the orphan list and
+ * free it later.
+ */
+ if (fi->i_open_count > 0) {
+ fuse4fs_iput(ff, fi);
+ ret = fuse4fs_add_to_orphans(ff, ino, &inode);
+ if (ret)
+ return ret;
+
+ goto write_out;
+ }
+ fuse4fs_iput(ff, fi);
if (ext2fs_has_feature_ea_inode(fs->super)) {
ret = fuse4fs_remove_ea_inodes(ff, ino, &inode);
@@ -2191,6 +2288,7 @@ static int fuse4fs_remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
return translate_error(fs, ino, err);
}
+ ext2fs_set_dtime(fs, EXT2_INODE(&inode));
ext2fs_inode_alloc_stats2(fs, ino, -1,
LINUX_S_ISDIR(inode.i_mode));
@@ -2735,6 +2833,16 @@ static void op_link(fuse_req_t req, fuse_ino_t child_fino,
if (ret)
goto out2;
+ /*
+ * Linking a file back into the filesystem requires removing it from
+ * the orphan list.
+ */
+ if (inode.i_links_count == 0) {
+ ret = fuse4fs_remove_from_orphans(ff, child, &inode);
+ if (ret)
+ goto out2;
+ }
+
inode.i_links_count++;
ret = update_ctime(fs, child, &inode);
if (ret)
@@ -3015,7 +3123,8 @@ static void detect_linux_executable_open(int kernel_flags, int *access_check,
#endif /* __linux__ */
static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
- ext2_ino_t ino, struct fuse_file_info *fp)
+ ext2_ino_t ino,
+ struct fuse_file_info *fp)
{
ext2_filsys fs = ff->fs;
errcode_t err;
@@ -3089,6 +3198,8 @@ static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
file->fi->i_open_count++;
fuse4fs_set_handle(fp, file);
+ dbg_printf(ff, "%s: ino=%d fh=%p opencount=%d\n", __func__, ino, file,
+ file->fi->i_open_count);
out:
if (ret)
@@ -3105,6 +3216,8 @@ static void op_open(fuse_req_t req, fuse_ino_t fino, struct fuse_file_info *fp)
FUSE4FS_CHECK_CONTEXT(req);
FUSE4FS_CONVERT_FINO(req, &ino, fino);
+ dbg_printf(ff, "%s: ino=%d\n", __func__, ino);
+
fuse4fs_start(ff);
ret = fuse4fs_open_file(ff, ctxt, ino, fp);
fuse4fs_finish(ff, ret);
@@ -3253,6 +3366,55 @@ static void op_write(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
fuse_reply_err(req, -ret);
}
+static int fuse4fs_free_unlinked(struct fuse4fs *ff, ext2_ino_t ino)
+{
+ struct ext2_inode_large inode;
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
+ int ret = 0;
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ if (inode.i_links_count > 0)
+ return 0;
+
+ dbg_printf(ff, "%s: ino=%d links=%d\n", __func__, ino,
+ inode.i_links_count);
+
+ if (ext2fs_has_feature_ea_inode(fs->super)) {
+ ret = fuse4fs_remove_ea_inodes(ff, ino, &inode);
+ if (ret)
+ return ret;
+ }
+
+ /* Nobody holds this file; free its blocks! */
+ err = ext2fs_free_ext_attr(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ if (ext2fs_inode_has_valid_blocks2(fs, EXT2_INODE(&inode))) {
+ err = ext2fs_punch(fs, ino, EXT2_INODE(&inode), NULL,
+ 0, ~0ULL);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ ret = fuse4fs_remove_from_orphans(ff, ino, &inode);
+ if (ret)
+ return ret;
+
+ ext2fs_set_dtime(fs, EXT2_INODE(&inode));
+ ext2fs_inode_alloc_stats2(fs, ino, -1, LINUX_S_ISDIR(inode.i_mode));
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
static void op_release(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
struct fuse_file_info *fp)
{
@@ -3264,9 +3426,21 @@ static void op_release(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
FUSE4FS_CHECK_CONTEXT(req);
FUSE4FS_CHECK_HANDLE(req, fh);
- dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
+ dbg_printf(ff, "%s: ino=%d fh=%p opencount=%u\n",
+ __func__, fh->ino, fh, fh->fi->i_open_count);
+
fs = fuse4fs_start(ff);
+ /*
+ * If the file is no longer open and is unlinked, free it, which
+ * removes it from the ondisk list.
+ */
+ if (--fh->fi->i_open_count == 0) {
+ ret = fuse4fs_free_unlinked(ff, fh->ino);
+ if (ret)
+ goto out_iput;
+ }
+
if ((fp->flags & O_SYNC) &&
fuse4fs_is_writeable(ff) &&
(fh->open_flags & EXT2_FILE_WRITE)) {
@@ -3275,6 +3449,7 @@ static void op_release(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
ret = translate_error(fs, fh->ino, err);
}
+out_iput:
fuse4fs_iput(ff, fh->fi);
fp->fh = 0;
fuse4fs_finish(ff, ret);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 19/20] fuse4fs: implement FUSE_TMPFILE
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (17 preceding siblings ...)
2025-08-21 1:12 ` [PATCH 18/20] fuse4fs: use the orphaned inode list Darrick J. Wong
@ 2025-08-21 1:12 ` Darrick J. Wong
2025-08-21 1:13 ` [PATCH 20/20] fuse4fs: create incore reverse orphan list Darrick J. Wong
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:12 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Allow creation of O_TMPFILE files now that we know how to use the
unlinked list.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse4fs.c | 93 ++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 67 insertions(+), 26 deletions(-)
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 1d1797a483a139..3f88e98a20c203 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -897,22 +897,25 @@ static inline int fuse4fs_want_check_owner(struct fuse4fs *ff,
/* Test for append permission */
#define A_OK 16
+/* Test for linked file */
+#define L_OK 32
static int fuse4fs_iflags_access(struct fuse4fs *ff, ext2_ino_t ino,
const struct ext2_inode *inode, int mask)
{
- EXT2FS_BUILD_BUG_ON((A_OK & (R_OK | W_OK | X_OK | F_OK)) != 0);
+ EXT2FS_BUILD_BUG_ON(((A_OK | L_OK) & (R_OK | W_OK | X_OK | F_OK)) != 0);
/* no writing or metadata changes to read-only or broken fs */
if ((mask & (W_OK | A_OK)) && !fuse4fs_is_writeable(ff))
return -EROFS;
- dbg_printf(ff, "access ino=%d mask=e%s%s%s%s iflags=0x%x\n",
+ dbg_printf(ff, "access ino=%d mask=e%s%s%s%s%s iflags=0x%x\n",
ino,
(mask & R_OK ? "r" : ""),
(mask & W_OK ? "w" : ""),
(mask & X_OK ? "x" : ""),
(mask & A_OK ? "a" : ""),
+ (mask & L_OK ? "l" : ""),
inode->i_flags);
/* is immutable? */
@@ -945,21 +948,31 @@ static int fuse4fs_inum_access(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
return translate_error(fs, ino, err);
perms = inode.i_mode & 0777;
- dbg_printf(ff, "access ino=%d mask=e%s%s%s%s perms=0%o iflags=0x%x "
+ dbg_printf(ff, "access ino=%d mask=e%s%s%s%s%s perms=0%o iflags=0x%x "
"fuid=%d fgid=%d uid=%d gid=%d\n", ino,
(mask & R_OK ? "r" : ""),
(mask & W_OK ? "w" : ""),
(mask & X_OK ? "x" : ""),
(mask & A_OK ? "a" : ""),
+ (mask & L_OK ? "l" : ""),
perms, inode.i_flags,
inode_uid(inode), inode_gid(inode),
ctxt->uid, ctxt->gid);
- /* linked files cannot be on the unlinked list or deleted */
- if (inode.i_dtime != 0) {
- dbg_printf(ff, "%s: unlinked ino=%d dtime=0x%x\n",
- __func__, ino, inode.i_dtime);
- return -ENOENT;
+ if (mask & L_OK) {
+ /* linked files cannot be on the unlinked list or deleted */
+ if (inode.i_dtime != 0) {
+ dbg_printf(ff, "%s: unlinked ino=%d dtime=0x%x\n",
+ __func__, ino, inode.i_dtime);
+ return -ENOENT;
+ }
+ } else {
+ /* unlinked files cannot be deleted */
+ if (inode.i_dtime >= fs->super->s_inodes_count) {
+ dbg_printf(ff, "%s: deleted ino=%d dtime=0x%x\n",
+ __func__, ino, inode.i_dtime);
+ return -ENOENT;
+ }
}
/* existence check */
@@ -3123,7 +3136,7 @@ static void detect_linux_executable_open(int kernel_flags, int *access_check,
#endif /* __linux__ */
static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
- ext2_ino_t ino,
+ ext2_ino_t ino, bool linked,
struct fuse_file_info *fp)
{
ext2_filsys fs = ff->fs;
@@ -3153,6 +3166,9 @@ static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
break;
}
+ if (linked)
+ check |= L_OK;
+
/*
* If the caller wants to truncate the file, we need to ask for full
* write access even if the caller claims to be appending.
@@ -3219,7 +3235,7 @@ static void op_open(fuse_req_t req, fuse_ino_t fino, struct fuse_file_info *fp)
dbg_printf(ff, "%s: ino=%d\n", __func__, ino);
fuse4fs_start(ff);
- ret = fuse4fs_open_file(ff, ctxt, ino, fp);
+ ret = fuse4fs_open_file(ff, ctxt, ino, true, fp);
fuse4fs_finish(ff, ret);
if (ret)
@@ -4128,22 +4144,28 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name,
goto out2;
}
- dbg_printf(ff, "%s: creating dir=%d name='%s' child=%d\n",
- __func__, parent, name, child);
- err = ext2fs_link(fs, parent, name, child,
- filetype | EXT2FS_LINK_EXPAND);
- if (err) {
- ret = translate_error(fs, parent, err);
- goto out2;
+ if (name) {
+ dbg_printf(ff, "%s: creating dir=%d name='%s' child=%d\n",
+ __func__, parent, name, child);
+
+ err = ext2fs_link(fs, parent, name, child,
+ filetype | EXT2FS_LINK_EXPAND);
+ if (err) {
+ ret = translate_error(fs, parent, err);
+ goto out2;
+ }
+
+ ret = update_mtime(fs, parent, NULL);
+ if (ret)
+ goto out2;
+ } else {
+ dbg_printf(ff, "%s: creating dir=%d tempfile=%d\n",
+ __func__, parent, child);
}
- ret = update_mtime(fs, parent, NULL);
- if (ret)
- goto out2;
-
memset(&inode, 0, sizeof(inode));
inode.i_mode = mode;
- inode.i_links_count = 1;
+ inode.i_links_count = name ? 1 : 0;
fuse4fs_set_extra_isize(ff, child, &inode);
fuse4fs_set_uid(&inode, ctxt->uid);
fuse4fs_set_gid(&inode, gid);
@@ -4161,6 +4183,12 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name,
ext2fs_extent_free(handle);
}
+ if (!name) {
+ ret = fuse4fs_add_to_orphans(ff, child, &inode);
+ if (ret)
+ goto out2;
+ }
+
err = ext2fs_write_new_inode(fs, child, EXT2_INODE(&inode));
if (err) {
ret = translate_error(fs, child, err);
@@ -4182,13 +4210,15 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name,
goto out2;
fp->flags &= ~O_TRUNC;
- ret = fuse4fs_open_file(ff, ctxt, child, fp);
+ ret = fuse4fs_open_file(ff, ctxt, child, name != NULL, fp);
if (ret)
goto out2;
- ret = fuse4fs_dirsync_flush(ff, parent, NULL);
- if (ret)
- goto out2;
+ if (name) {
+ ret = fuse4fs_dirsync_flush(ff, parent, NULL);
+ if (ret)
+ goto out2;
+ }
ret = fuse4fs_stat_inode(ff, child, NULL, &fstat);
if (ret)
@@ -4203,6 +4233,14 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name,
fuse_reply_create(req, &fstat.entry, fp);
}
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17)
+static void op_tmpfile(fuse_req_t req, fuse_ino_t fino, mode_t mode,
+ struct fuse_file_info *fp)
+{
+ op_create(req, fino, NULL, mode, fp);
+}
+#endif
+
enum fuse4fs_time_action {
TA_NOW, /* set to current time */
TA_OMIT, /* do not set timestamp */
@@ -5161,6 +5199,9 @@ static struct fuse_lowlevel_ops fs_ops = {
.fsyncdir = op_fsync,
.access = op_access,
.create = op_create,
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17)
+ .tmpfile = op_tmpfile,
+#endif
.bmap = op_bmap,
#ifdef SUPERFLUOUS
.lock = op_lock,
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 20/20] fuse4fs: create incore reverse orphan list
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
` (18 preceding siblings ...)
2025-08-21 1:12 ` [PATCH 19/20] fuse4fs: implement FUSE_TMPFILE Darrick J. Wong
@ 2025-08-21 1:13 ` Darrick J. Wong
19 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:13 UTC (permalink / raw)
To: tytso
Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, amir73il,
joannelkoong, neal
From: Darrick J. Wong <djwong@kernel.org>
Create an incore orphan list so that removing open unlinked inodes
doesn't take forever.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse4fs.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 174 insertions(+), 4 deletions(-)
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 3f88e98a20c203..cd7e30eaeb7757 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -351,10 +351,20 @@ static inline int u_log2(unsigned int arg)
return l;
}
+/* inode is not on unlinked list */
+#define FUSE4FS_NULL_INO ((ext2_ino_t)~0ULL)
+
struct fuse4fs_inode {
struct cache_node i_cnode;
ext2_ino_t i_ino;
unsigned int i_open_count;
+
+ /*
+ * FUSE4FS_NULL_INO: inode is not on the orphan list
+ * 0: inode is the first on the orphan list
+ * otherwise: inode is in the middle of the list
+ */
+ ext2_ino_t i_prev_orphan;
};
struct fuse4fs_ikey {
@@ -396,12 +406,15 @@ static struct cache_node *icache_alloc(struct cache *c, cache_key_t key)
return NULL;
fi->i_ino = ikey->i_ino;
+ fi->i_prev_orphan = FUSE4FS_NULL_INO;
return &fi->i_cnode;
}
static bool icache_flush(struct cache *c, struct cache_node *node)
{
- return false;
+ struct fuse4fs_inode *fi = ICNODE(node);
+
+ return fi->i_prev_orphan != FUSE4FS_NULL_INO;
}
static void icache_relse(struct cache *c, struct cache_node *node)
@@ -2164,10 +2177,31 @@ static int fuse4fs_add_to_orphans(struct fuse4fs *ff, ext2_ino_t ino,
struct ext2_inode_large *inode)
{
ext2_filsys fs = ff->fs;
+ struct fuse4fs_inode *fi;
+ ext2_ino_t orphan_ino = fs->super->s_last_orphan;
+ errcode_t err;
dbg_printf(ff, "%s: orphan ino=%d dtime=%d next=%d\n",
__func__, ino, inode->i_dtime, fs->super->s_last_orphan);
+ /* Make the first orphan on the list point back to us */
+ if (orphan_ino != 0) {
+ err = fuse4fs_iget(ff, orphan_ino, &fi);
+ if (err)
+ return translate_error(fs, orphan_ino, err);
+
+ fi->i_prev_orphan = ino;
+ fuse4fs_iput(ff, fi);
+ }
+
+ /* Add ourselves to the head of the orphan list */
+ err = fuse4fs_iget(ff, ino, &fi);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ fi->i_prev_orphan = 0;
+ fuse4fs_iput(ff, fi);
+
inode->i_dtime = fs->super->s_last_orphan;
fs->super->s_last_orphan = ino;
ext2fs_mark_super_dirty(fs);
@@ -2175,24 +2209,158 @@ static int fuse4fs_add_to_orphans(struct fuse4fs *ff, ext2_ino_t ino,
return 0;
}
+/*
+ * Given the orphan list excerpt: prev_orphan -> ino -> next_orphan, set
+ * next_orphan's backpointer to ino's backpointer (prev_orphan), having removed
+ * ino from the orphan list.
+ */
+static int fuse2fs_update_next_orphan_backlink(struct fuse4fs *ff,
+ ext2_ino_t prev_orphan,
+ ext2_ino_t ino,
+ ext2_ino_t next_orphan)
+{
+ struct fuse4fs_inode *fi;
+ errcode_t err;
+ int ret = 0;
+
+ err = fuse4fs_iget(ff, next_orphan, &fi);
+ if (err)
+ return translate_error(ff->fs, next_orphan, err);
+
+ dbg_printf(ff, "%s: ino=%d cached next=%d nextprev=%d prev=%d\n",
+ __func__, ino, next_orphan, fi->i_prev_orphan,
+ prev_orphan);
+
+ if (fi->i_prev_orphan != ino) {
+ ret = translate_error(ff->fs, next_orphan,
+ EXT2_ET_FILESYSTEM_CORRUPTED);
+ goto out_iput;
+ }
+
+ fi->i_prev_orphan = prev_orphan;
+out_iput:
+ fuse4fs_iput(ff, fi);
+ return ret;
+}
+
+/*
+ * Remove ino from the orphan list the fast way. Returns 1 for success, 0 if
+ * it didn't do anything, or a negative errno.
+ */
+static int fuse4fs_fast_remove_from_orphans(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode)
+{
+ struct ext2_inode_large orphan;
+ ext2_filsys fs = ff->fs;
+ struct fuse4fs_inode *fi;
+ ext2_ino_t prev_orphan;
+ ext2_ino_t next_orphan = 0;
+ errcode_t err;
+ int ret = 0;
+
+ err = fuse4fs_iget(ff, ino, &fi);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ prev_orphan = fi->i_prev_orphan;
+ switch (prev_orphan) {
+ case 0:
+ /* First inode in the list */
+ dbg_printf(ff, "%s: ino=%d cached superblock\n", __func__, ino);
+
+ fs->super->s_last_orphan = inode->i_dtime;
+ next_orphan = inode->i_dtime;
+ inode->i_dtime = 0;
+ ext2fs_mark_super_dirty(fs);
+ fi->i_prev_orphan = FUSE4FS_NULL_INO;
+ break;
+ case FUSE4FS_NULL_INO:
+ /* unknown */
+ dbg_printf(ff, "%s: ino=%d broken list??\n", __func__, ino);
+ ret = 0;
+ goto out_iput;
+ default:
+ /* We're in the middle of the list */
+ err = fuse4fs_read_inode(fs, prev_orphan, &orphan);
+ if (err) {
+ ret = translate_error(fs, prev_orphan, err);
+ goto out_iput;
+ }
+
+ dbg_printf(ff,
+ "%s: ino=%d cached prev=%d prevnext=%d next=%d\n",
+ __func__, ino, prev_orphan, orphan.i_dtime,
+ inode->i_dtime);
+
+ if (orphan.i_dtime != ino) {
+ ret = translate_error(fs, prev_orphan,
+ EXT2_ET_FILESYSTEM_CORRUPTED);
+ goto out_iput;
+ }
+
+ fi->i_prev_orphan = FUSE4FS_NULL_INO;
+ orphan.i_dtime = inode->i_dtime;
+ next_orphan = inode->i_dtime;
+ inode->i_dtime = 0;
+
+ err = fuse4fs_write_inode(fs, prev_orphan, &orphan);
+ if (err) {
+ ret = translate_error(fs, prev_orphan, err);
+ goto out_iput;
+ }
+
+ break;
+ }
+
+ /*
+ * Make the next orphaned inode point back to the our own previous list
+ * entry
+ */
+ if (next_orphan != 0) {
+ ret = fuse2fs_update_next_orphan_backlink(ff, prev_orphan, ino,
+ next_orphan);
+ if (ret)
+ goto out_iput;
+ }
+ ret = 1;
+
+out_iput:
+ fuse4fs_iput(ff, fi);
+ return ret;
+}
+
static int fuse4fs_remove_from_orphans(struct fuse4fs *ff, ext2_ino_t ino,
struct ext2_inode_large *inode)
{
ext2_filsys fs = ff->fs;
ext2_ino_t prev_orphan;
+ ext2_ino_t next_orphan;
errcode_t err;
+ int ret;
dbg_printf(ff, "%s: super=%d ino=%d next=%d\n",
__func__, fs->super->s_last_orphan, ino, inode->i_dtime);
- /* If we're lucky, the ondisk superblock points to us */
+ /*
+ * Fast way: use the incore list, which doesn't include any orphans
+ * that were already on the superblock when we mounted.
+ */
+ ret = fuse4fs_fast_remove_from_orphans(ff, ino, inode);
+ if (ret < 0)
+ return ret;
+ if (ret == 1)
+ return 0;
+
+ /* Slow way: If we're lucky, the ondisk superblock points to us */
if (fs->super->s_last_orphan == ino) {
dbg_printf(ff, "%s: superblock\n", __func__);
+ next_orphan = inode->i_dtime;
fs->super->s_last_orphan = inode->i_dtime;
inode->i_dtime = 0;
ext2fs_mark_super_dirty(fs);
- return 0;
+ return fuse2fs_update_next_orphan_backlink(ff, 0, ino,
+ next_orphan);
}
/* Otherwise walk the ondisk orphan list. */
@@ -2212,6 +2380,7 @@ static int fuse4fs_remove_from_orphans(struct fuse4fs *ff, ext2_ino_t ino,
dbg_printf(ff, "%s: prev=%d\n",
__func__, prev_orphan);
+ next_orphan = inode->i_dtime;
orphan.i_dtime = inode->i_dtime;
inode->i_dtime = 0;
@@ -2219,7 +2388,8 @@ static int fuse4fs_remove_from_orphans(struct fuse4fs *ff, ext2_ino_t ino,
if (err)
return translate_error(fs, prev_orphan, err);
- return 0;
+ return fuse2fs_update_next_orphan_backlink(ff,
+ prev_orphan, ino, next_orphan);
}
dbg_printf(ff, "%s: orphan=%d next=%d\n",
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
@ 2025-08-21 1:13 ` Darrick J. Wong
2025-08-21 1:13 ` [PATCH 02/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
` (8 subsequent siblings)
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:13 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Make it so that we can extract the fd from an open IO manager. This
will be used in subsequent patches to register the open block device
with the fuse iomap kernel driver.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/ext2_io.h | 4 +++-
debian/libext2fs2t64.symbols | 1 +
lib/ext2fs/io_manager.c | 8 ++++++++
lib/ext2fs/unix_io.c | 15 +++++++++++++++
4 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index 39a4e8fcf6b515..f53983b30996b4 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -102,7 +102,8 @@ struct struct_io_manager {
unsigned long long count);
errcode_t (*zeroout)(io_channel channel, unsigned long long block,
unsigned long long count);
- long reserved[14];
+ errcode_t (*get_fd)(io_channel channel, int *fd);
+ long reserved[13];
};
#define IO_FLAG_RW 0x0001
@@ -145,6 +146,7 @@ extern errcode_t io_channel_alloc_buf(io_channel channel,
extern errcode_t io_channel_cache_readahead(io_channel io,
unsigned long long block,
unsigned long long count);
+extern errcode_t io_channel_get_fd(io_channel io, int *fd);
#ifdef _WIN32
/* windows_io.c */
diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols
index a3042c3292da93..8e3214ee31e337 100644
--- a/debian/libext2fs2t64.symbols
+++ b/debian/libext2fs2t64.symbols
@@ -693,6 +693,7 @@ libext2fs.so.2 libext2fs2t64 #MINVER#
io_channel_alloc_buf@Base 1.42.3
io_channel_cache_readahead@Base 1.43
io_channel_discard@Base 1.42
+ io_channel_get_fd@Base 1.47.99
io_channel_read_blk64@Base 1.41.1
io_channel_set_options@Base 1.37
io_channel_write_blk64@Base 1.41.1
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index dca6af09996b70..6b4dca5e4dbca2 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -150,3 +150,11 @@ errcode_t io_channel_cache_readahead(io_channel io, unsigned long long block,
return io->manager->cache_readahead(io, block, count);
}
+
+errcode_t io_channel_get_fd(io_channel io, int *fd)
+{
+ if (!io->manager->get_fd)
+ return EXT2_ET_OP_NOT_SUPPORTED;
+
+ return io->manager->get_fd(io, fd);
+}
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index cb408f51779aa7..561eddad6b8b17 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1663,6 +1663,19 @@ static errcode_t unix_zeroout(io_channel channel, unsigned long long block,
unimplemented:
return EXT2_ET_UNIMPLEMENTED;
}
+
+static errcode_t unix_get_fd(io_channel channel, int *fd)
+{
+ struct unix_private_data *data;
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ data = (struct unix_private_data *) channel->private_data;
+ EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+ *fd = data->dev;
+ return 0;
+}
+
#if __GNUC_PREREQ (4, 6)
#pragma GCC diagnostic pop
#endif
@@ -1684,6 +1697,7 @@ static struct struct_io_manager struct_unix_manager = {
.discard = unix_discard,
.cache_readahead = unix_cache_readahead,
.zeroout = unix_zeroout,
+ .get_fd = unix_get_fd,
};
io_manager unix_io_manager = &struct_unix_manager;
@@ -1705,6 +1719,7 @@ static struct struct_io_manager struct_unixfd_manager = {
.discard = unix_discard,
.cache_readahead = unix_cache_readahead,
.zeroout = unix_zeroout,
+ .get_fd = unix_get_fd,
};
io_manager unixfd_io_manager = &struct_unixfd_manager;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 02/10] libext2fs: always fsync the device when flushing the cache
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2025-08-21 1:13 ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
@ 2025-08-21 1:13 ` Darrick J. Wong
2025-08-21 1:13 ` [PATCH 03/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
` (7 subsequent siblings)
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:13 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
When we're flushing the unix IO manager's buffer cache, we should always
fsync the block device, because something could have written to the
block device -- either the buffer cache itself, or a direct write.
Regardless, the callers all want all dirtied regions to be persisted to
stable media.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/unix_io.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 561eddad6b8b17..14f5a0c434191a 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1463,7 +1463,8 @@ static errcode_t unix_flush(io_channel channel)
retval = flush_cached_blocks(channel, data, 0);
#endif
#ifdef HAVE_FSYNC
- if (!retval && fsync(data->dev) != 0)
+ /* always fsync the device, even if flushing our own cache failed */
+ if (fsync(data->dev) != 0 && !retval)
return errno;
#endif
return retval;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 03/10] libext2fs: always fsync the device when closing the unix IO manager
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2025-08-21 1:13 ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
2025-08-21 1:13 ` [PATCH 02/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
@ 2025-08-21 1:13 ` Darrick J. Wong
2025-08-21 1:14 ` [PATCH 04/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
` (6 subsequent siblings)
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:13 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
unix_close is the last chance that libext2fs has to report write
failures to users. Although it's likely that ext2fs_close already
called ext2fs_flush and told the IO manager to flush, we could do one
more sync before we close the file descriptor. Also don't override the
fsync's errno with the close's errno.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/unix_io.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 14f5a0c434191a..80fff984e48224 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1147,8 +1147,11 @@ static errcode_t unix_close(io_channel channel)
#ifndef NO_IO_CACHE
retval = flush_cached_blocks(channel, data, 0);
#endif
+ /* always fsync the device, even if flushing our own cache failed */
+ if (fsync(data->dev) != 0 && !retval)
+ retval = errno;
- if (close(data->dev) < 0)
+ if (close(data->dev) < 0 && !retval)
retval = errno;
free_cache(data);
free(data->cache);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 04/10] libext2fs: only fsync the unix fd if we wrote to the device
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 1:13 ` [PATCH 03/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
@ 2025-08-21 1:14 ` Darrick J. Wong
2025-08-21 1:14 ` [PATCH 05/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
` (5 subsequent siblings)
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:14 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
As an optimization, only fsync the block device fd if we tried to write
to the io channel.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/unix_io.c | 48 ++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 42 insertions(+), 6 deletions(-)
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 80fff984e48224..61ecdc9b8b56b2 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -129,10 +129,13 @@ struct unix_cache {
#define WRITE_DIRECT_SIZE 4 /* Must be smaller than CACHE_SIZE */
#define READ_DIRECT_SIZE 4 /* Should be smaller than CACHE_SIZE */
+#define UNIX_STATE_DIRTY (1U << 0) /* device needs fsyncing */
+
struct unix_private_data {
int magic;
int dev;
int flags;
+ unsigned int state; /* UNIX_STATE_* */
int align;
int access_time;
ext2_loff_t offset;
@@ -1132,10 +1135,37 @@ static errcode_t unix_open(const char *name, int flags,
return unix_open_channel(name, fd, flags, channel, unix_io_manager);
}
+static void mark_dirty(io_channel channel)
+{
+ struct unix_private_data *data =
+ (struct unix_private_data *) channel->private_data;
+
+ mutex_lock(data, CACHE_MTX);
+ data->state |= UNIX_STATE_DIRTY;
+ mutex_unlock(data, CACHE_MTX);
+}
+
+static errcode_t maybe_fsync(io_channel channel)
+{
+ struct unix_private_data *data =
+ (struct unix_private_data *) channel->private_data;
+ int was_dirty;
+
+ mutex_lock(data, CACHE_MTX);
+ was_dirty = data->state & UNIX_STATE_DIRTY;
+ data->state &= ~UNIX_STATE_DIRTY;
+ mutex_unlock(data, CACHE_MTX);
+
+ if (was_dirty && fsync(data->dev) != 0)
+ return errno;
+
+ return 0;
+}
+
static errcode_t unix_close(io_channel channel)
{
struct unix_private_data *data;
- errcode_t retval = 0;
+ errcode_t retval = 0, retval2;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
data = (struct unix_private_data *) channel->private_data;
@@ -1148,8 +1178,9 @@ static errcode_t unix_close(io_channel channel)
retval = flush_cached_blocks(channel, data, 0);
#endif
/* always fsync the device, even if flushing our own cache failed */
- if (fsync(data->dev) != 0 && !retval)
- retval = errno;
+ retval2 = maybe_fsync(channel);
+ if (retval2 && !retval)
+ retval = retval2;
if (close(data->dev) < 0 && !retval)
retval = errno;
@@ -1317,6 +1348,8 @@ static errcode_t unix_write_blk64(io_channel channel, unsigned long long block,
data = (struct unix_private_data *) channel->private_data;
EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+ mark_dirty(channel);
+
#ifdef NO_IO_CACHE
return raw_write_blk(channel, data, block, count, buf, 0);
#else
@@ -1441,6 +1474,8 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0)
return errno;
+ mark_dirty(channel);
+
actual = write(data->dev, buf, size);
if (actual < 0)
return errno;
@@ -1456,7 +1491,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
static errcode_t unix_flush(io_channel channel)
{
struct unix_private_data *data;
- errcode_t retval = 0;
+ errcode_t retval = 0, retval2;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
data = (struct unix_private_data *) channel->private_data;
@@ -1467,8 +1502,9 @@ static errcode_t unix_flush(io_channel channel)
#endif
#ifdef HAVE_FSYNC
/* always fsync the device, even if flushing our own cache failed */
- if (fsync(data->dev) != 0 && !retval)
- return errno;
+ retval2 = maybe_fsync(channel);
+ if (retval2 && !retval)
+ retval = retval2;
#endif
return retval;
}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 05/10] libext2fs: invalidate cached blocks when freeing them
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 1:14 ` [PATCH 04/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
@ 2025-08-21 1:14 ` Darrick J. Wong
2025-08-21 1:14 ` [PATCH 06/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
` (4 subsequent siblings)
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:14 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
When we're freeing blocks, we should tell the IO manager to drop them
from any cache it might be maintaining to improve performance.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/ext2_io.h | 8 +++++++-
debian/libext2fs2t64.symbols | 1 +
lib/ext2fs/alloc_stats.c | 6 ++++++
lib/ext2fs/io_manager.c | 9 +++++++++
lib/ext2fs/unix_io.c | 35 +++++++++++++++++++++++++++++++++++
5 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index f53983b30996b4..26ecd128954a0e 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -103,7 +103,10 @@ struct struct_io_manager {
errcode_t (*zeroout)(io_channel channel, unsigned long long block,
unsigned long long count);
errcode_t (*get_fd)(io_channel channel, int *fd);
- long reserved[13];
+ errcode_t (*invalidate_blocks)(io_channel channel,
+ unsigned long long block,
+ unsigned long long count);
+ long reserved[12];
};
#define IO_FLAG_RW 0x0001
@@ -147,6 +150,9 @@ extern errcode_t io_channel_cache_readahead(io_channel io,
unsigned long long block,
unsigned long long count);
extern errcode_t io_channel_get_fd(io_channel io, int *fd);
+extern errcode_t io_channel_invalidate_blocks(io_channel io,
+ unsigned long long block,
+ unsigned long long count);
#ifdef _WIN32
/* windows_io.c */
diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols
index 8e3214ee31e337..864a284b940009 100644
--- a/debian/libext2fs2t64.symbols
+++ b/debian/libext2fs2t64.symbols
@@ -694,6 +694,7 @@ libext2fs.so.2 libext2fs2t64 #MINVER#
io_channel_cache_readahead@Base 1.43
io_channel_discard@Base 1.42
io_channel_get_fd@Base 1.47.99
+ io_channel_invalidate_blocks@Base 1.47.99
io_channel_read_blk64@Base 1.41.1
io_channel_set_options@Base 1.37
io_channel_write_blk64@Base 1.41.1
diff --git a/lib/ext2fs/alloc_stats.c b/lib/ext2fs/alloc_stats.c
index 95a6438f252e0f..68bbe6807a8ed3 100644
--- a/lib/ext2fs/alloc_stats.c
+++ b/lib/ext2fs/alloc_stats.c
@@ -82,6 +82,9 @@ void ext2fs_block_alloc_stats2(ext2_filsys fs, blk64_t blk, int inuse)
-inuse * (blk64_t) EXT2FS_CLUSTER_RATIO(fs));
ext2fs_mark_super_dirty(fs);
ext2fs_mark_bb_dirty(fs);
+ if (inuse < 0)
+ io_channel_invalidate_blocks(fs->io, blk,
+ EXT2FS_CLUSTER_RATIO(fs));
if (fs->block_alloc_stats)
(fs->block_alloc_stats)(fs, (blk64_t) blk, inuse);
}
@@ -144,11 +147,14 @@ void ext2fs_block_alloc_stats_range(ext2_filsys fs, blk64_t blk,
ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT);
ext2fs_group_desc_csum_set(fs, group);
ext2fs_free_blocks_count_add(fs->super, -inuse * (blk64_t) n);
+
blk += n;
num -= n;
}
ext2fs_mark_super_dirty(fs);
ext2fs_mark_bb_dirty(fs);
+ if (inuse < 0)
+ io_channel_invalidate_blocks(fs->io, orig_blk, orig_num);
if (fs->block_alloc_stats_range)
(fs->block_alloc_stats_range)(fs, orig_blk, orig_num, inuse);
}
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index 6b4dca5e4dbca2..c91fab4eb290d5 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -158,3 +158,12 @@ errcode_t io_channel_get_fd(io_channel io, int *fd)
return io->manager->get_fd(io, fd);
}
+
+errcode_t io_channel_invalidate_blocks(io_channel io, unsigned long long block,
+ unsigned long long count)
+{
+ if (!io->manager->invalidate_blocks)
+ return EXT2_ET_OP_NOT_SUPPORTED;
+
+ return io->manager->invalidate_blocks(io, block, count);
+}
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 61ecdc9b8b56b2..0d1006207c60cd 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -667,6 +667,25 @@ static errcode_t reuse_cache(io_channel channel,
#define FLUSH_INVALIDATE 0x01
#define FLUSH_NOLOCK 0x02
+/* Remove blocks from the cache. Dirty contents are discarded. */
+static void invalidate_cached_blocks(io_channel channel,
+ struct unix_private_data *data,
+ unsigned long long block,
+ unsigned long long count)
+{
+ struct unix_cache *cache;
+ int i;
+
+ mutex_lock(data, CACHE_MTX);
+ for (i = 0, cache = data->cache; i < data->cache_size; i++, cache++) {
+ if (!cache->in_use || cache->block < block ||
+ cache->block >= block + count)
+ continue;
+ cache->in_use = 0;
+ }
+ mutex_unlock(data, CACHE_MTX);
+}
+
/*
* Flush all of the blocks in the cache
*/
@@ -1716,6 +1735,20 @@ static errcode_t unix_get_fd(io_channel channel, int *fd)
return 0;
}
+static errcode_t unix_invalidate_blocks(io_channel channel,
+ unsigned long long block,
+ unsigned long long count)
+{
+ struct unix_private_data *data;
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ data = (struct unix_private_data *) channel->private_data;
+ EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+ invalidate_cached_blocks(channel, data, block, count);
+ return 0;
+}
+
#if __GNUC_PREREQ (4, 6)
#pragma GCC diagnostic pop
#endif
@@ -1738,6 +1771,7 @@ static struct struct_io_manager struct_unix_manager = {
.cache_readahead = unix_cache_readahead,
.zeroout = unix_zeroout,
.get_fd = unix_get_fd,
+ .invalidate_blocks = unix_invalidate_blocks,
};
io_manager unix_io_manager = &struct_unix_manager;
@@ -1760,6 +1794,7 @@ static struct struct_io_manager struct_unixfd_manager = {
.cache_readahead = unix_cache_readahead,
.zeroout = unix_zeroout,
.get_fd = unix_get_fd,
+ .invalidate_blocks = unix_invalidate_blocks,
};
io_manager unixfd_io_manager = &struct_unixfd_manager;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 06/10] libext2fs: only flush affected blocks in unix_write_byte
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 1:14 ` [PATCH 05/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
@ 2025-08-21 1:14 ` Darrick J. Wong
2025-08-21 1:14 ` [PATCH 07/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
` (3 subsequent siblings)
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:14 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
There's no need to invalidate the entire cache when writing a range of
bytes to the device. The only ones we need to invalidate are the ones
that we're writing separately.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/unix_io.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 0d1006207c60cd..4036c4b6f1481e 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1469,6 +1469,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
{
struct unix_private_data *data;
errcode_t retval = 0;
+ unsigned long long bno, nbno;
ssize_t actual;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
@@ -1484,10 +1485,17 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
#ifndef NO_IO_CACHE
/*
- * Flush out the cache completely
+ * Flush all the dirty blocks, then invalidate the blocks we're about
+ * to write.
*/
- if ((retval = flush_cached_blocks(channel, data, FLUSH_INVALIDATE)))
+ retval = flush_cached_blocks(channel, data, 0);
+ if (retval)
return retval;
+
+ bno = offset / channel->block_size;
+ nbno = (offset + size + channel->block_size - 1) / channel->block_size;
+
+ invalidate_cached_blocks(channel, data, bno, nbno - bno);
#endif
if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 07/10] libext2fs: allow unix_write_byte when the write would be aligned
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
` (5 preceding siblings ...)
2025-08-21 1:14 ` [PATCH 06/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
@ 2025-08-21 1:14 ` Darrick J. Wong
2025-08-21 1:15 ` [PATCH 08/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
` (2 subsequent siblings)
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:14 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
If someone calls write_byte on an IO channel with an alignment
requirement and the range to be written is aligned correctly, go ahead
and do the write. This will be needed later when we try to speed up
superblock writes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/unix_io.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 4036c4b6f1481e..2ee61395e1275f 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1480,7 +1480,9 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
#ifdef ALIGN_DEBUG
printf("unix_write_byte: O_DIRECT fallback\n");
#endif
- return EXT2_ET_UNIMPLEMENTED;
+ if (!IS_ALIGNED(data->offset + offset, channel->align) ||
+ !IS_ALIGNED(data->offset + offset + size, channel->align))
+ return EXT2_ET_UNIMPLEMENTED;
}
#ifndef NO_IO_CACHE
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 08/10] libext2fs: allow clients to ask to write full superblocks
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
` (6 preceding siblings ...)
2025-08-21 1:14 ` [PATCH 07/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
@ 2025-08-21 1:15 ` Darrick J. Wong
2025-08-21 1:15 ` [PATCH 09/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
2025-08-21 1:15 ` [PATCH 10/10] libext2fs: add posix advisory locking to the unix IO manager Darrick J. Wong
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:15 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
write_primary_superblock currently does this weird dance where it will
try to write only the dirty bytes of the primary superblock to disk. In
theory, this is done so that tune2fs can incrementally update superblock
bytes when the filesystem is mounted; ext2 was famous for allowing using
this dance to set new fs parameters and have them take effect in real
time.
The ability to do this safely was obliterated back in 2001 when ext3 was
introduced with journalling, because tune2fs has no way to know if the
journal has already logged an updated primary superblock but not yet
written it to disk, which means that they can race to write, and changes
can be lost.
This (non-)safety was further obliterated back in 2012 when I added
checksums to all the metadata blocks in ext4 because anyone else with
the block device open can see the primary superblock in an intermediate
state where the checksum does not match the superblock contents.
At this point in 2025 it's kind of stupid for fuse2fs to be doing this
because you can't have the kernel and fuse2fs mount the same filesystem
at the same time. It also makes fuse2fs op_fsync slow because libext2fs
performs a bunch of small writes and introduce extra fsyncs.
So, add a new flag to ask for full superblock writes, which fuse2fs will
use later.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/ext2fs.h | 1 +
lib/ext2fs/closefs.c | 7 +++++++
2 files changed, 8 insertions(+)
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index bb2170b78d6308..dee9feb02624ed 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -220,6 +220,7 @@ typedef struct ext2_file *ext2_file_t;
#define EXT2_FLAG_IBITMAP_TAIL_PROBLEM 0x2000000
#define EXT2_FLAG_THREADS 0x4000000
#define EXT2_FLAG_IGNORE_SWAP_DIRENT 0x8000000
+#define EXT2_FLAG_WRITE_FULL_SUPER 0x10000000
/*
* Internal flags for use by the ext2fs library only
diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
index 8e5bec03a050de..9a67db76e7b326 100644
--- a/lib/ext2fs/closefs.c
+++ b/lib/ext2fs/closefs.c
@@ -196,6 +196,13 @@ static errcode_t write_primary_superblock(ext2_filsys fs,
int check_idx, write_idx, size;
errcode_t retval;
+ if (fs->flags & EXT2_FLAG_WRITE_FULL_SUPER) {
+ retval = io_channel_write_byte(fs->io, SUPERBLOCK_OFFSET,
+ SUPERBLOCK_SIZE, super);
+ if (!retval)
+ return 0;
+ }
+
if (!fs->io->manager->write_byte || !fs->orig_super) {
fallback:
io_channel_set_blksize(fs->io, SUPERBLOCK_OFFSET);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 09/10] libext2fs: allow callers to disallow I/O to file data blocks
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
` (7 preceding siblings ...)
2025-08-21 1:15 ` [PATCH 08/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
@ 2025-08-21 1:15 ` Darrick J. Wong
2025-08-21 1:15 ` [PATCH 10/10] libext2fs: add posix advisory locking to the unix IO manager Darrick J. Wong
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:15 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Add a flag to ext2_file_t to disallow read and write I/O to file data
blocks. This supports fuse2fs iomap support, which will keep all the
file data I/O inside the kerne.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/ext2fs.h | 3 +++
lib/ext2fs/fileio.c | 12 +++++++++++-
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index dee9feb02624ed..7d36b1a839dc57 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -178,6 +178,9 @@ typedef struct ext2_struct_dblist *ext2_dblist;
#define EXT2_FILE_WRITE 0x0001
#define EXT2_FILE_CREATE 0x0002
+/* no file I/O to disk blocks, only to inline data */
+#define EXT2_FILE_NOBLOCKIO 0x0004
+
#define EXT2_FILE_MASK 0x00FF
#define EXT2_FILE_BUF_DIRTY 0x4000
diff --git a/lib/ext2fs/fileio.c b/lib/ext2fs/fileio.c
index 3a36e9e7fff43b..95ee45ec7371ae 100644
--- a/lib/ext2fs/fileio.c
+++ b/lib/ext2fs/fileio.c
@@ -314,6 +314,11 @@ errcode_t ext2fs_file_read(ext2_file_t file, void *buf,
if (file->inode.i_flags & EXT4_INLINE_DATA_FL)
return ext2fs_file_read_inline_data(file, buf, wanted, got);
+ if (file->flags & EXT2_FILE_NOBLOCKIO) {
+ retval = EXT2_ET_OP_NOT_SUPPORTED;
+ goto fail;
+ }
+
while ((file->pos < EXT2_I_SIZE(&file->inode)) && (wanted > 0)) {
retval = sync_buffer_position(file);
if (retval)
@@ -441,6 +446,11 @@ errcode_t ext2fs_file_write(ext2_file_t file, const void *buf,
retval = 0;
}
+ if (file->flags & EXT2_FILE_NOBLOCKIO) {
+ retval = EXT2_ET_OP_NOT_SUPPORTED;
+ goto fail;
+ }
+
while (nbytes > 0) {
retval = sync_buffer_position(file);
if (retval)
@@ -609,7 +619,7 @@ static errcode_t ext2fs_file_zero_past_offset(ext2_file_t file,
int ret_flags;
errcode_t retval;
- if (off == 0)
+ if (off == 0 || (file->flags & EXT2_FILE_NOBLOCKIO))
return 0;
retval = sync_buffer_position(file);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 10/10] libext2fs: add posix advisory locking to the unix IO manager
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
` (8 preceding siblings ...)
2025-08-21 1:15 ` [PATCH 09/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
@ 2025-08-21 1:15 ` Darrick J. Wong
9 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:15 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Add support for using flock() to protect the files opened by the Unix IO
manager so that we can't mount the same fs multiple times. This also
prevents systemd and udev from accessing the device while e2fsprogs is
doing something with the device.
Link: https://systemd.io/BLOCK_DEVICE_LOCKING/
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/unix_io.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 2ee61395e1275f..4a841f7f2133d4 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -65,6 +65,12 @@
#include <pthread.h>
#endif
+#if defined(HAVE_SYS_FILE_H) && defined(HAVE_SIGNAL_H)
+# include <sys/file.h>
+# include <signal.h>
+# define WANT_LOCK_UNIX_FD
+#endif
+
#if defined(__linux__) && defined(_IO) && !defined(BLKROGET)
#define BLKROGET _IO(0x12, 94) /* Get read-only status (0 = read_write). */
#endif
@@ -149,6 +155,9 @@ struct unix_private_data {
pthread_mutex_t bounce_mutex;
pthread_mutex_t stats_mutex;
#endif
+#ifdef WANT_LOCK_UNIX_FD
+ int lock_flags;
+#endif
};
#define IS_ALIGNED(n, align) ((((uintptr_t) n) & \
@@ -897,6 +906,47 @@ int ext2fs_fstat(int fd, ext2fs_struct_stat *buf)
#endif
}
+#ifdef WANT_LOCK_UNIX_FD
+static void unix_lock_alarm_handler(int signal, siginfo_t *data, void *p)
+{
+ /* do nothing, the signal will abort the flock operation */
+}
+
+static int unix_lock_fd(int fd, int flags)
+{
+ struct sigaction newsa = {
+ .sa_flags = SA_SIGINFO,
+ .sa_sigaction = unix_lock_alarm_handler,
+ };
+ struct sigaction oldsa;
+ const int operation = (flags & IO_FLAG_EXCLUSIVE) ? LOCK_EX : LOCK_SH;
+ int ret;
+
+ /* wait five seconds for the lock */
+ ret = sigaction(SIGALRM, &newsa, &oldsa);
+ if (ret)
+ return ret;
+
+ alarm(5);
+
+ ret = flock(fd, operation);
+ if (ret == 0)
+ ret = operation;
+ else if (errno == EINTR) {
+ errno = EWOULDBLOCK;
+ ret = -1;
+ }
+
+ alarm(0);
+ sigaction(SIGALRM, &oldsa, NULL);
+ return ret;
+}
+
+static void unix_unlock_fd(int fd)
+{
+ flock(fd, LOCK_UN);
+}
+#endif
static errcode_t unix_open_channel(const char *name, int fd,
int flags, io_channel *channel,
@@ -935,6 +985,16 @@ static errcode_t unix_open_channel(const char *name, int fd,
if (retval)
goto cleanup;
+#ifdef WANT_LOCK_UNIX_FD
+ if (flags & IO_FLAG_RW) {
+ data->lock_flags = unix_lock_fd(fd, flags);
+ if (data->lock_flags < 0) {
+ retval = errno;
+ goto cleanup;
+ }
+ }
+#endif
+
strcpy(io->name, name);
io->private_data = data;
io->block_size = 1024;
@@ -1201,6 +1261,10 @@ static errcode_t unix_close(io_channel channel)
if (retval2 && !retval)
retval = retval2;
+#ifdef WANT_LOCK_UNIX_FD
+ if (data->lock_flags)
+ unix_unlock_fd(data->dev);
+#endif
if (close(data->dev) < 0 && !retval)
retval = errno;
free_cache(data);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-08-21 1:15 ` Darrick J. Wong
2025-08-21 1:16 ` [PATCH 02/19] fuse2fs: add iomap= mount option Darrick J. Wong
` (17 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:15 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Add enough of an iomap implementation that we can do FIEMAP and
SEEK_DATA and SEEK_HOLE.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
configure | 46 +++++
configure.ac | 31 +++
lib/config.h.in | 3
misc/fuse2fs.c | 521 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
misc/fuse4fs.c | 521 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 1108 insertions(+), 14 deletions(-)
diff --git a/configure b/configure
index 8afc53f89f2bf4..9a0398e26b36e9 100755
--- a/configure
+++ b/configure
@@ -14769,6 +14769,52 @@ fi
fi
+if test "$FUSE_USE_VERSION" -ge 30
+then
+{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking for iomap_begin in libfuse" >&5
+printf %s "checking for iomap_begin in libfuse... " >&6; }
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS 64
+#define FUSE_USE_VERSION 399
+#include <fuse.h>
+
+int
+main (void)
+{
+
+struct fuse_operations fs_ops = {
+ .iomap_begin = NULL,
+ .iomap_end = NULL,
+};
+struct fuse_file_iomap narf = { };
+
+ ;
+ return 0;
+}
+
+_ACEOF
+if ac_fn_c_try_link "$LINENO"
+then :
+ have_fuse_iomap=yes
+ { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+printf "%s\n" "yes" >&6; }
+else $as_nop
+ { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5
+printf "%s\n" "no" >&6; }
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.beam \
+ conftest$ac_exeext conftest.$ac_ext
+if test "$have_fuse_iomap" = yes; then
+ FUSE_USE_VERSION=399
+
+printf "%s\n" "#define HAVE_FUSE_IOMAP 1" >>confdefs.h
+
+fi
+fi
+
if test -n "$FUSE_USE_VERSION"
then
diff --git a/configure.ac b/configure.ac
index 37dbfa0be4d7fc..bac5d512dd8c5f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1478,6 +1478,37 @@ fi
fi
AC_SUBST(FUSE4_CMT)
+if test "$FUSE_USE_VERSION" -ge 30
+then
+dnl
+dnl see if fuse3 supports iomap
+dnl
+AC_MSG_CHECKING(for iomap_begin in libfuse)
+AC_LINK_IFELSE(
+[ AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#define _FILE_OFFSET_BITS 64
+#define FUSE_USE_VERSION 399
+#include <fuse.h>
+ ]], [[
+struct fuse_operations fs_ops = {
+ .iomap_begin = NULL,
+ .iomap_end = NULL,
+};
+struct fuse_file_iomap narf = { };
+ ]])
+], have_fuse_iomap=yes
+ AC_MSG_RESULT(yes),
+ AC_MSG_RESULT(no))
+if test "$have_fuse_iomap" = yes; then
+ FUSE_USE_VERSION=399
+ AC_DEFINE(HAVE_FUSE_IOMAP, 1, [Define to 1 if fuse supports iomap])
+fi
+fi
+
+dnl
+dnl set FUSE_USE_VERSION now that we've done all the feature tests
+dnl
if test -n "$FUSE_USE_VERSION"
then
AC_DEFINE_UNQUOTED(FUSE_USE_VERSION, $FUSE_USE_VERSION,
diff --git a/lib/config.h.in b/lib/config.h.in
index c3379758c3c9bc..55e515020af422 100644
--- a/lib/config.h.in
+++ b/lib/config.h.in
@@ -76,6 +76,9 @@
/* Define to 1 if fuse supports lowlevel API */
#undef HAVE_FUSE_LOWLEVEL
+/* Define to 1 if fuse supports iomap */
+#undef HAVE_FUSE_IOMAP
+
/* Define to 1 if you have the Mac OS X function
CFLocaleCopyPreferredLanguages in the CoreFoundation framework. */
#undef HAVE_CFLOCALECOPYPREFERREDLANGUAGES
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index fa7642b8854c7d..7c87573677e172 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -146,6 +146,9 @@ static inline uint64_t round_down(uint64_t b, unsigned int align)
return b - m;
}
+#define max(a, b) ((a) > (b) ? (a) : (b))
+#define min(a, b) ((a) < (b) ? (a) : (b))
+
#define dbg_printf(fuse2fs, format, ...) \
while ((fuse2fs)->debug) { \
printf("FUSE2FS (%s): tid=%d " format, (fuse2fs)->shortdev, gettid(), ##__VA_ARGS__); \
@@ -223,6 +226,14 @@ enum fuse2fs_opstate {
F2OP_SHUTDOWN,
};
+#ifdef HAVE_FUSE_IOMAP
+enum fuse2fs_iomap_state {
+ IOMAP_DISABLED,
+ IOMAP_UNKNOWN,
+ IOMAP_ENABLED,
+};
+#endif
+
/* Main program context */
#define FUSE2FS_MAGIC (0xEF53DEADUL)
struct fuse2fs {
@@ -249,6 +260,9 @@ struct fuse2fs {
enum fuse2fs_opstate opstate;
int logfd;
int blocklog;
+#ifdef HAVE_FUSE_IOMAP
+ enum fuse2fs_iomap_state iomap_state;
+#endif
unsigned int blockmask;
unsigned long offset;
unsigned int next_generation;
@@ -542,6 +556,15 @@ static inline void __fuse2fs_finish(struct fuse2fs *ff, int ret,
}
#define fuse2fs_finish(ff, ret) __fuse2fs_finish((ff), (ret), __func__)
+#ifdef HAVE_FUSE_IOMAP
+static inline int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
+{
+ return ff->iomap_state >= IOMAP_ENABLED;
+}
+#else
+# define fuse2fs_iomap_enabled(...) (0)
+#endif
+
static void get_now(struct timespec *now)
{
#ifdef CLOCK_REALTIME
@@ -936,7 +959,7 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff, int libext2_flags)
{
char options[128];
int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW |
- libext2_flags;
+ EXT2_FLAG_WRITE_FULL_SUPER | libext2_flags;
errcode_t err;
if (ff->lockfile) {
@@ -1292,6 +1315,29 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
}
#endif
+#ifdef HAVE_FUSE_IOMAP
+static void fuse2fs_iomap_enable(struct fuse_conn_info *conn,
+ struct fuse2fs *ff)
+{
+ /* iomap only works with block devices */
+ if (ff->iomap_state != IOMAP_DISABLED && fuse2fs_on_bdev(ff) &&
+ fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) {
+ /*
+ * If we're mounting in iomap mode, we need to unmount in
+ * op_destroy so that the block device will be released before
+ * umount(2) returns.
+ */
+ ff->unmount_in_destroy = 1;
+ ff->iomap_state = IOMAP_ENABLED;
+ }
+
+ if (ff->iomap_state == IOMAP_UNKNOWN)
+ ff->iomap_state = IOMAP_DISABLED;
+}
+#else
+# define fuse2fs_iomap_enable(...) ((void)0)
+#endif
+
static void *op_init(struct fuse_conn_info *conn
#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
, struct fuse_config *cfg EXT2FS_ATTR((unused))
@@ -1328,6 +1374,8 @@ static void *op_init(struct fuse_conn_info *conn
#ifdef FUSE_CAP_NO_EXPORT_SUPPORT
fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
#endif
+ fuse2fs_iomap_enable(conn, ff);
+
#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
conn->time_gran = 1;
cfg->use_ino = 1;
@@ -4928,6 +4976,459 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
# endif /* SUPPORT_FALLOCATE */
#endif /* FUSE 29 */
+#ifdef HAVE_FUSE_IOMAP
+static void fuse2fs_iomap_hole(struct fuse2fs *ff, struct fuse_file_iomap *iomap,
+ off_t pos, uint64_t count)
+{
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->addr = FUSE_IOMAP_NULL_ADDR;
+ iomap->offset = pos;
+ iomap->length = count;
+ iomap->type = FUSE_IOMAP_TYPE_HOLE;
+}
+
+static void fuse2fs_iomap_hole_to_eof(struct fuse2fs *ff,
+ struct fuse_file_iomap *iomap, off_t pos,
+ off_t count,
+ const struct ext2_inode_large *inode)
+{
+ ext2_filsys fs = ff->fs;
+ uint64_t isize = EXT2_I_SIZE(inode);
+
+ /*
+ * We have to be careful about handling a hole to the right of the
+ * entire mapping tree. First, the mapping must start and end on a
+ * block boundary because they must be aligned to at least an LBA for
+ * the block layer; and to the fsblock for smoother operation.
+ *
+ * As for the length -- we could return a mapping all the way to
+ * i_size, but i_size could be less than pos/count if we're zeroing the
+ * EOF block in anticipation of a truncate operation. Similarly, we
+ * don't want to end the mapping at pos+count because we know there's
+ * nothing mapped byeond here.
+ */
+ uint64_t startoff = round_down(pos, fs->blocksize);
+ uint64_t eofoff = round_up(max(pos + count, isize), fs->blocksize);
+
+ dbg_printf(ff,
+ "pos=0x%llx count=0x%llx isize=0x%llx startoff=0x%llx eofoff=0x%llx\n",
+ (unsigned long long)pos,
+ (unsigned long long)count,
+ (unsigned long long)isize,
+ (unsigned long long)startoff,
+ (unsigned long long)eofoff);
+
+ fuse2fs_iomap_hole(ff, iomap, startoff, eofoff - startoff);
+}
+
+#define DEBUG_IOMAP
+#ifdef DEBUG_IOMAP
+# define __DUMP_EXTENT(ff, func, tag, startoff, err, extent) \
+ do { \
+ dbg_printf((ff), \
+ "%s: %s startoff 0x%llx err %ld lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", \
+ (func), (tag), (startoff), (err), (extent)->e_lblk, \
+ (extent)->e_pblk, (extent)->e_len, \
+ (extent)->e_flags & EXT2_EXTENT_FLAGS_UNINIT); \
+ } while(0)
+# define DUMP_EXTENT(ff, tag, startoff, err, extent) \
+ __DUMP_EXTENT((ff), __func__, (tag), (startoff), (err), (extent))
+
+# define __DUMP_INFO(ff, func, tag, startoff, err, info) \
+ do { \
+ dbg_printf((ff), \
+ "%s: %s startoff 0x%llx err %ld entry %d/%d/%d level %d/%d\n", \
+ (func), (tag), (startoff), (err), \
+ (info)->curr_entry, (info)->num_entries, \
+ (info)->max_entries, (info)->curr_level, \
+ (info)->max_depth); \
+ } while(0)
+# define DUMP_INFO(ff, tag, startoff, err, info) \
+ __DUMP_INFO((ff), __func__, (tag), (startoff), (err), (info))
+#else
+# define __DUMP_EXTENT(...) ((void)0)
+# define DUMP_EXTENT(...) ((void)0)
+# define DUMP_INFO(...) ((void)0)
+#endif
+
+static inline errcode_t __fuse2fs_get_mapping_at(struct fuse2fs *ff,
+ ext2_extent_handle_t handle,
+ blk64_t startoff,
+ struct ext2fs_extent *bmap,
+ const char *func)
+{
+ errcode_t err;
+
+ /*
+ * Find the file mapping at startoff. We don't check the return value
+ * of _goto because _get will error out if _goto failed. There's a
+ * subtlety to the outcome of _goto when startoff falls in a sparse
+ * hole however:
+ *
+ * Most of the time, _goto points the cursor at the mapping whose lblk
+ * is just to the left of startoff. The mapping may or may not overlap
+ * startoff; this is ok. In other words, the tree lookup behaves as if
+ * we asked it to use a less than or equals comparison.
+ *
+ * However, if startoff is to the left of the first mapping in the
+ * extent tree, _goto points the cursor at that first mapping because
+ * it doesn't know how to deal with this situation. In this case,
+ * the tree lookup behaves as if we asked it to use a greater than
+ * or equals comparison.
+ *
+ * Note: If _get() returns 'no current node', that means that there
+ * aren't any mappings at all.
+ */
+ ext2fs_extent_goto(handle, startoff);
+ err = ext2fs_extent_get(handle, EXT2_EXTENT_CURRENT, bmap);
+ __DUMP_EXTENT(ff, func, "lookup", startoff, err, bmap);
+ if (err == EXT2_ET_NO_CURRENT_NODE)
+ err = EXT2_ET_EXTENT_NOT_FOUND;
+ return err;
+}
+
+static inline errcode_t __fuse2fs_get_next_mapping(struct fuse2fs *ff,
+ ext2_extent_handle_t handle,
+ blk64_t startoff,
+ struct ext2fs_extent *bmap,
+ const char *func)
+{
+ struct ext2fs_extent newex;
+ struct ext2_extent_info info;
+ errcode_t err;
+
+ /*
+ * The extent tree code has this (probably broken) behavior that if
+ * more than two of the highest levels of the cursor point at the
+ * rightmost edge of an extent tree block, a _NEXT_LEAF movement fails
+ * to move the cursor position of any of the lower levels. IOWs, if
+ * leaf level N is at the right edge, it will only advance level N-1
+ * to the right. If N-1 was at the right edge, the cursor resets to
+ * record 0 of that level and goes down to the wrong leaf.
+ *
+ * Work around this by walking up (towards root level 0) the extent
+ * tree until we find a level where we're not already at the rightmost
+ * edge. The _NEXT_LEAF movement will walk down the tree to find the
+ * leaves.
+ */
+ err = ext2fs_extent_get_info(handle, &info);
+ DUMP_INFO(ff, "UP?", startoff, err, &info);
+ if (err)
+ return err;
+
+ while (info.curr_entry == info.num_entries && info.curr_level > 0) {
+ err = ext2fs_extent_get(handle, EXT2_EXTENT_UP, &newex);
+ DUMP_EXTENT(ff, "UP", startoff, err, &newex);
+ if (err)
+ return err;
+ err = ext2fs_extent_get_info(handle, &info);
+ DUMP_INFO(ff, "UP", startoff, err, &info);
+ if (err)
+ return err;
+ }
+
+ /*
+ * If we're at the root and there are no more entries, there's nothing
+ * else to be found.
+ */
+ if (info.curr_level == 0 && info.curr_entry == info.num_entries)
+ return EXT2_ET_EXTENT_NOT_FOUND;
+
+ /* Otherwise grab this next leaf and return it. */
+ err = ext2fs_extent_get(handle, EXT2_EXTENT_NEXT_LEAF, &newex);
+ DUMP_EXTENT(ff, "NEXT", startoff, err, &newex);
+ if (err)
+ return err;
+
+ *bmap = newex;
+ return 0;
+}
+
+#define fuse2fs_get_mapping_at(ff, handle, startoff, bmap) \
+ __fuse2fs_get_mapping_at((ff), (handle), (startoff), (bmap), __func__)
+#define fuse2fs_get_next_mapping(ff, handle, startoff, bmap) \
+ __fuse2fs_get_next_mapping((ff), (handle), (startoff), (bmap), __func__)
+
+static errcode_t fuse2fs_iomap_begin_extent(struct fuse2fs *ff, uint64_t ino,
+ struct ext2_inode_large *inode,
+ off_t pos, uint64_t count,
+ uint32_t opflags,
+ struct fuse_file_iomap *iomap)
+{
+ ext2_extent_handle_t handle;
+ struct ext2fs_extent extent = { };
+ ext2_filsys fs = ff->fs;
+ const blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+ errcode_t err;
+ int ret = 0;
+
+ err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = fuse2fs_get_mapping_at(ff, handle, startoff, &extent);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+ /* No mappings at all; the whole range is a hole. */
+ fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+ goto out_handle;
+ }
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_handle;
+ }
+
+ if (startoff < extent.e_lblk) {
+ /*
+ * Mapping starts to the right of the current position.
+ * Synthesize a hole going to that next extent.
+ */
+ fuse2fs_iomap_hole(ff, iomap, FUSE2FS_FSB_TO_B(ff, startoff),
+ FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+ goto out_handle;
+ }
+
+ if (startoff >= extent.e_lblk + extent.e_len) {
+ /*
+ * Mapping ends to the left of the current position. Try to
+ * find the next mapping. If there is no next mapping, the
+ * whole range is in a hole.
+ */
+ err = fuse2fs_get_next_mapping(ff, handle, startoff, &extent);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+ fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+ goto out_handle;
+ }
+
+ /*
+ * If the new mapping starts to the right of startoff, there's
+ * a hole from startoff to the start of the new mapping.
+ */
+ if (startoff < extent.e_lblk) {
+ fuse2fs_iomap_hole(ff, iomap,
+ FUSE2FS_FSB_TO_B(ff, startoff),
+ FUSE2FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+ goto out_handle;
+ }
+
+ /*
+ * The new mapping starts at startoff. Something weird
+ * happened in the extent tree lookup, but we found a valid
+ * mapping so we'll run with it.
+ */
+ }
+
+ /* Mapping overlaps startoff, report this. */
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
+ iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
+ iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
+ if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT)
+ iomap->type = FUSE_IOMAP_TYPE_UNWRITTEN;
+ else
+ iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+
+out_handle:
+ ext2fs_extent_free(handle);
+ return ret;
+}
+
+static int fuse2fs_iomap_begin_indirect(struct fuse2fs *ff, uint64_t ino,
+ struct ext2_inode_large *inode,
+ off_t pos, uint64_t count,
+ uint32_t opflags,
+ struct fuse_file_iomap *iomap)
+{
+ ext2_filsys fs = ff->fs;
+ blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+ uint64_t isize = EXT2_I_SIZE(inode);
+ uint64_t real_count = min(count, 131072);
+ const blk64_t endoff = FUSE2FS_B_TO_FSB(ff, pos + real_count);
+ blk64_t startblock;
+ errcode_t err;
+
+ err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, startoff, NULL,
+ &startblock);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->offset = FUSE2FS_FSB_TO_B(ff, startoff);
+ iomap->flags |= FUSE_IOMAP_F_MERGED;
+ if (startblock) {
+ iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock);
+ iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+ } else {
+ iomap->addr = FUSE_IOMAP_NULL_ADDR;
+ iomap->type = FUSE_IOMAP_TYPE_HOLE;
+ }
+ iomap->length = fs->blocksize;
+
+ /* See how long the mapping goes for. */
+ for (startoff++; startoff < endoff; startoff++) {
+ blk64_t prev_startblock = startblock;
+
+ err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0,
+ startoff, NULL, &startblock);
+ if (err)
+ break;
+
+ if (iomap->type == FUSE_IOMAP_TYPE_MAPPED) {
+ if (startblock == prev_startblock + 1)
+ iomap->length += fs->blocksize;
+ else
+ break;
+ } else {
+ if (startblock == 0)
+ iomap->length += fs->blocksize;
+ else
+ break;
+ }
+ }
+
+ /*
+ * If this is a hole that goes beyond EOF, report this as a hole to the
+ * end of the range queried so that FIEMAP doesn't go mad.
+ */
+ if (iomap->type == FUSE_IOMAP_TYPE_HOLE &&
+ iomap->offset + iomap->length >= isize)
+ fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+
+ return 0;
+}
+
+static int fuse2fs_iomap_begin_inline(struct fuse2fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode, off_t pos,
+ uint64_t count, struct fuse_file_iomap *iomap)
+{
+ uint64_t one_fsb = FUSE2FS_FSB_TO_B(ff, 1);
+
+ if (pos >= one_fsb) {
+ fuse2fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+ } else {
+ /* ext4 only supports inline data files up to 1 fsb */
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->addr = FUSE_IOMAP_NULL_ADDR;
+ iomap->offset = 0;
+ iomap->length = one_fsb;
+ iomap->type = FUSE_IOMAP_TYPE_INLINE;
+ }
+
+ return 0;
+}
+
+static int fuse2fs_iomap_begin_report(struct fuse2fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ off_t pos, uint64_t count,
+ uint32_t opflags,
+ struct fuse_file_iomap *read)
+{
+ if (inode->i_flags & EXT4_INLINE_DATA_FL)
+ return fuse2fs_iomap_begin_inline(ff, ino, inode, pos, count,
+ read);
+
+ if (inode->i_flags & EXT4_EXTENTS_FL)
+ return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count,
+ opflags, read);
+
+ return fuse2fs_iomap_begin_indirect(ff, ino, inode, pos, count,
+ opflags, read);
+}
+
+static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode, off_t pos,
+ uint64_t count, uint32_t opflags,
+ struct fuse_file_iomap *read)
+{
+ return -ENOSYS;
+}
+
+static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode, off_t pos,
+ uint64_t count, uint32_t opflags,
+ struct fuse_file_iomap *read)
+{
+ return -ENOSYS;
+}
+
+static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
+ off_t pos, uint64_t count, uint32_t opflags,
+ struct fuse_file_iomap *read,
+ struct fuse_file_iomap *write)
+{
+ struct fuse2fs *ff = fuse2fs_get();
+ struct ext2_inode_large inode;
+ ext2_filsys fs;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE2FS_CHECK_CONTEXT(ff);
+
+ dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x\n",
+ __func__, path,
+ (unsigned long long)nodeid,
+ (unsigned long long)attr_ino,
+ (unsigned long long)pos,
+ (unsigned long long)count,
+ opflags);
+
+ fs = fuse2fs_start(ff);
+ err = fuse2fs_read_inode(fs, attr_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, attr_ino, err);
+ goto out_unlock;
+ }
+
+ if (opflags & FUSE_IOMAP_OP_REPORT)
+ ret = fuse2fs_iomap_begin_report(ff, attr_ino, &inode, pos,
+ count, opflags, read);
+ else if (fuse_iomap_is_write(opflags))
+ ret = fuse2fs_iomap_begin_write(ff, attr_ino, &inode, pos,
+ count, opflags, read);
+ else
+ ret = fuse2fs_iomap_begin_read(ff, attr_ino, &inode, pos,
+ count, opflags, read);
+ if (ret)
+ goto out_unlock;
+
+ dbg_printf(ff, "%s: nodeid=%llu attr_ino=%llu pos=0x%llx -> addr=0x%llx offset=0x%llx length=0x%llx type=%u\n",
+ __func__,
+ (unsigned long long)nodeid,
+ (unsigned long long)attr_ino,
+ (unsigned long long)pos,
+ (unsigned long long)read->addr,
+ (unsigned long long)read->offset,
+ (unsigned long long)read->length,
+ read->type);
+
+out_unlock:
+ fuse2fs_finish(ff, ret);
+ return ret;
+}
+
+static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
+ off_t pos, uint64_t count, uint32_t opflags,
+ ssize_t written, const struct fuse_file_iomap *iomap)
+{
+ struct fuse2fs *ff = fuse2fs_get();
+
+ FUSE2FS_CHECK_CONTEXT(ff);
+
+ dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx count=0x%llx opflags=0x%x written=0x%zx mapflags=0x%x\n",
+ __func__, path,
+ (unsigned long long)nodeid,
+ (unsigned long long)attr_ino,
+ (unsigned long long)pos,
+ (unsigned long long)count,
+ opflags,
+ written,
+ iomap->flags);
+
+ return 0;
+}
+#endif /* HAVE_FUSE_IOMAP */
+
static struct fuse_operations fs_ops = {
.init = op_init,
.destroy = op_destroy,
@@ -4988,6 +5489,10 @@ static struct fuse_operations fs_ops = {
.fallocate = op_fallocate,
# endif
#endif
+#ifdef HAVE_FUSE_IOMAP
+ .iomap_begin = op_iomap_begin,
+ .iomap_end = op_iomap_end,
+#endif /* HAVE_FUSE_IOMAP */
};
static int get_random_bytes(void *p, size_t sz)
@@ -5211,17 +5716,19 @@ static void fuse2fs_com_err_proc(const char *whoami, errcode_t code,
int main(int argc, char *argv[])
{
struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
- struct fuse2fs fctx;
+ struct fuse2fs fctx = {
+ .magic = FUSE2FS_MAGIC,
+ .opstate = F2OP_WRITABLE,
+ .logfd = -1,
+#ifdef HAVE_FUSE_IOMAP
+ .iomap_state = IOMAP_UNKNOWN,
+#endif
+ };
errcode_t err;
FILE *orig_stderr = stderr;
char extra_args[BUFSIZ];
int ret;
- memset(&fctx, 0, sizeof(fctx));
- fctx.magic = FUSE2FS_MAGIC;
- fctx.logfd = -1;
- fctx.opstate = F2OP_WRITABLE;
-
ret = fuse_opt_parse(&args, &fctx, fuse2fs_opts, fuse2fs_opt_proc);
if (ret)
exit(1);
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index cd7e30eaeb7757..93570d25e91d5c 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -143,6 +143,9 @@ static inline uint64_t round_down(uint64_t b, unsigned int align)
return b - m;
}
+#define max(a, b) ((a) > (b) ? (a) : (b))
+#define min(a, b) ((a) < (b) ? (a) : (b))
+
#define dbg_printf(fuse4fs, format, ...) \
while ((fuse4fs)->debug) { \
printf("FUSE4FS (%s): tid=%d " format, (fuse4fs)->shortdev, gettid(), ##__VA_ARGS__); \
@@ -219,6 +222,14 @@ enum fuse4fs_opstate {
F4OP_SHUTDOWN,
};
+#ifdef HAVE_FUSE_IOMAP
+enum fuse4fs_iomap_state {
+ IOMAP_DISABLED,
+ IOMAP_UNKNOWN,
+ IOMAP_ENABLED,
+};
+#endif
+
/* Main program context */
#define FUSE4FS_MAGIC (0xEF53DEADUL)
struct fuse4fs {
@@ -245,6 +256,9 @@ struct fuse4fs {
enum fuse4fs_opstate opstate;
int logfd;
int blocklog;
+#ifdef HAVE_FUSE_IOMAP
+ enum fuse4fs_iomap_state iomap_state;
+#endif
unsigned int blockmask;
unsigned long offset;
unsigned int next_generation;
@@ -695,6 +709,15 @@ static inline void __fuse4fs_finish(struct fuse4fs *ff, int ret,
}
#define fuse4fs_finish(ff, ret) __fuse4fs_finish((ff), (ret), __func__)
+#ifdef HAVE_FUSE_IOMAP
+static inline int fuse4fs_iomap_enabled(const struct fuse4fs *ff)
+{
+ return ff->iomap_state >= IOMAP_ENABLED;
+}
+#else
+# define fuse4fs_iomap_enabled(...) (0)
+#endif
+
static void get_now(struct timespec *now)
{
#ifdef CLOCK_REALTIME
@@ -1115,7 +1138,7 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff, int libext2_flags)
{
char options[128];
int flags = EXT2_FLAG_64BITS | EXT2_FLAG_THREADS | EXT2_FLAG_RW |
- libext2_flags;
+ EXT2_FLAG_WRITE_FULL_SUPER | libext2_flags;
errcode_t err;
if (ff->lockfile) {
@@ -1475,6 +1498,29 @@ static inline int fuse_set_feature_flag(struct fuse_conn_info *conn,
}
#endif
+#ifdef HAVE_FUSE_IOMAP
+static void fuse4fs_iomap_enable(struct fuse_conn_info *conn,
+ struct fuse4fs *ff)
+{
+ /* iomap only works with block devices */
+ if (ff->iomap_state != IOMAP_DISABLED && fuse4fs_on_bdev(ff) &&
+ fuse_set_feature_flag(conn, FUSE_CAP_IOMAP)) {
+ /*
+ * If we're mounting in iomap mode, we need to unmount in
+ * op_destroy so that the block device will be released before
+ * umount(2) returns.
+ */
+ ff->unmount_in_destroy = 1;
+ ff->iomap_state = IOMAP_ENABLED;
+ }
+
+ if (ff->iomap_state == IOMAP_UNKNOWN)
+ ff->iomap_state = IOMAP_DISABLED;
+}
+#else
+# define fuse4fs_iomap_enable(...) ((void)0)
+#endif
+
static void op_init(void *userdata, struct fuse_conn_info *conn)
{
struct fuse4fs *ff = userdata;
@@ -1497,6 +1543,7 @@ static void op_init(void *userdata, struct fuse_conn_info *conn)
#ifdef FUSE_CAP_NO_EXPORT_SUPPORT
fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);
#endif
+ fuse4fs_iomap_enable(conn, ff);
conn->time_gran = 1;
if (ff->kernel) {
@@ -5338,6 +5385,460 @@ static void op_fallocate(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
}
#endif /* SUPPORT_FALLOCATE */
+#ifdef HAVE_FUSE_IOMAP
+static void fuse4fs_iomap_hole(struct fuse4fs *ff, struct fuse_file_iomap *iomap,
+ off_t pos, uint64_t count)
+{
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->addr = FUSE_IOMAP_NULL_ADDR;
+ iomap->offset = pos;
+ iomap->length = count;
+ iomap->type = FUSE_IOMAP_TYPE_HOLE;
+}
+
+static void fuse4fs_iomap_hole_to_eof(struct fuse4fs *ff,
+ struct fuse_file_iomap *iomap, off_t pos,
+ off_t count,
+ const struct ext2_inode_large *inode)
+{
+ ext2_filsys fs = ff->fs;
+ uint64_t isize = EXT2_I_SIZE(inode);
+
+ /*
+ * We have to be careful about handling a hole to the right of the
+ * entire mapping tree. First, the mapping must start and end on a
+ * block boundary because they must be aligned to at least an LBA for
+ * the block layer; and to the fsblock for smoother operation.
+ *
+ * As for the length -- we could return a mapping all the way to
+ * i_size, but i_size could be less than pos/count if we're zeroing the
+ * EOF block in anticipation of a truncate operation. Similarly, we
+ * don't want to end the mapping at pos+count because we know there's
+ * nothing mapped byeond here.
+ */
+ uint64_t startoff = round_down(pos, fs->blocksize);
+ uint64_t eofoff = round_up(max(pos + count, isize), fs->blocksize);
+
+ dbg_printf(ff,
+ "pos=0x%llx count=0x%llx isize=0x%llx startoff=0x%llx eofoff=0x%llx\n",
+ (unsigned long long)pos,
+ (unsigned long long)count,
+ (unsigned long long)isize,
+ (unsigned long long)startoff,
+ (unsigned long long)eofoff);
+
+ fuse4fs_iomap_hole(ff, iomap, startoff, eofoff - startoff);
+}
+
+#define DEBUG_IOMAP
+#ifdef DEBUG_IOMAP
+# define __DUMP_EXTENT(ff, func, tag, startoff, err, extent) \
+ do { \
+ dbg_printf((ff), \
+ "%s: %s startoff 0x%llx err %ld lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n", \
+ (func), (tag), (startoff), (err), (extent)->e_lblk, \
+ (extent)->e_pblk, (extent)->e_len, \
+ (extent)->e_flags & EXT2_EXTENT_FLAGS_UNINIT); \
+ } while(0)
+# define DUMP_EXTENT(ff, tag, startoff, err, extent) \
+ __DUMP_EXTENT((ff), __func__, (tag), (startoff), (err), (extent))
+
+# define __DUMP_INFO(ff, func, tag, startoff, err, info) \
+ do { \
+ dbg_printf((ff), \
+ "%s: %s startoff 0x%llx err %ld entry %d/%d/%d level %d/%d\n", \
+ (func), (tag), (startoff), (err), \
+ (info)->curr_entry, (info)->num_entries, \
+ (info)->max_entries, (info)->curr_level, \
+ (info)->max_depth); \
+ } while(0)
+# define DUMP_INFO(ff, tag, startoff, err, info) \
+ __DUMP_INFO((ff), __func__, (tag), (startoff), (err), (info))
+#else
+# define __DUMP_EXTENT(...) ((void)0)
+# define DUMP_EXTENT(...) ((void)0)
+# define DUMP_INFO(...) ((void)0)
+#endif
+
+static inline errcode_t __fuse4fs_get_mapping_at(struct fuse4fs *ff,
+ ext2_extent_handle_t handle,
+ blk64_t startoff,
+ struct ext2fs_extent *bmap,
+ const char *func)
+{
+ errcode_t err;
+
+ /*
+ * Find the file mapping at startoff. We don't check the return value
+ * of _goto because _get will error out if _goto failed. There's a
+ * subtlety to the outcome of _goto when startoff falls in a sparse
+ * hole however:
+ *
+ * Most of the time, _goto points the cursor at the mapping whose lblk
+ * is just to the left of startoff. The mapping may or may not overlap
+ * startoff; this is ok. In other words, the tree lookup behaves as if
+ * we asked it to use a less than or equals comparison.
+ *
+ * However, if startoff is to the left of the first mapping in the
+ * extent tree, _goto points the cursor at that first mapping because
+ * it doesn't know how to deal with this situation. In this case,
+ * the tree lookup behaves as if we asked it to use a greater than
+ * or equals comparison.
+ *
+ * Note: If _get() returns 'no current node', that means that there
+ * aren't any mappings at all.
+ */
+ ext2fs_extent_goto(handle, startoff);
+ err = ext2fs_extent_get(handle, EXT2_EXTENT_CURRENT, bmap);
+ __DUMP_EXTENT(ff, func, "lookup", startoff, err, bmap);
+ if (err == EXT2_ET_NO_CURRENT_NODE)
+ err = EXT2_ET_EXTENT_NOT_FOUND;
+ return err;
+}
+
+static inline errcode_t __fuse4fs_get_next_mapping(struct fuse4fs *ff,
+ ext2_extent_handle_t handle,
+ blk64_t startoff,
+ struct ext2fs_extent *bmap,
+ const char *func)
+{
+ struct ext2fs_extent newex;
+ struct ext2_extent_info info;
+ errcode_t err;
+
+ /*
+ * The extent tree code has this (probably broken) behavior that if
+ * more than two of the highest levels of the cursor point at the
+ * rightmost edge of an extent tree block, a _NEXT_LEAF movement fails
+ * to move the cursor position of any of the lower levels. IOWs, if
+ * leaf level N is at the right edge, it will only advance level N-1
+ * to the right. If N-1 was at the right edge, the cursor resets to
+ * record 0 of that level and goes down to the wrong leaf.
+ *
+ * Work around this by walking up (towards root level 0) the extent
+ * tree until we find a level where we're not already at the rightmost
+ * edge. The _NEXT_LEAF movement will walk down the tree to find the
+ * leaves.
+ */
+ err = ext2fs_extent_get_info(handle, &info);
+ DUMP_INFO(ff, "UP?", startoff, err, &info);
+ if (err)
+ return err;
+
+ while (info.curr_entry == info.num_entries && info.curr_level > 0) {
+ err = ext2fs_extent_get(handle, EXT2_EXTENT_UP, &newex);
+ DUMP_EXTENT(ff, "UP", startoff, err, &newex);
+ if (err)
+ return err;
+ err = ext2fs_extent_get_info(handle, &info);
+ DUMP_INFO(ff, "UP", startoff, err, &info);
+ if (err)
+ return err;
+ }
+
+ /*
+ * If we're at the root and there are no more entries, there's nothing
+ * else to be found.
+ */
+ if (info.curr_level == 0 && info.curr_entry == info.num_entries)
+ return EXT2_ET_EXTENT_NOT_FOUND;
+
+ /* Otherwise grab this next leaf and return it. */
+ err = ext2fs_extent_get(handle, EXT2_EXTENT_NEXT_LEAF, &newex);
+ DUMP_EXTENT(ff, "NEXT", startoff, err, &newex);
+ if (err)
+ return err;
+
+ *bmap = newex;
+ return 0;
+}
+
+#define fuse4fs_get_mapping_at(ff, handle, startoff, bmap) \
+ __fuse4fs_get_mapping_at((ff), (handle), (startoff), (bmap), __func__)
+#define fuse4fs_get_next_mapping(ff, handle, startoff, bmap) \
+ __fuse4fs_get_next_mapping((ff), (handle), (startoff), (bmap), __func__)
+
+static errcode_t fuse4fs_iomap_begin_extent(struct fuse4fs *ff, uint64_t ino,
+ struct ext2_inode_large *inode,
+ off_t pos, uint64_t count,
+ uint32_t opflags,
+ struct fuse_file_iomap *iomap)
+{
+ ext2_extent_handle_t handle;
+ struct ext2fs_extent extent = { };
+ ext2_filsys fs = ff->fs;
+ const blk64_t startoff = FUSE4FS_B_TO_FSBT(ff, pos);
+ errcode_t err;
+ int ret = 0;
+
+ err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = fuse4fs_get_mapping_at(ff, handle, startoff, &extent);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+ /* No mappings at all; the whole range is a hole. */
+ fuse4fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+ goto out_handle;
+ }
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_handle;
+ }
+
+ if (startoff < extent.e_lblk) {
+ /*
+ * Mapping starts to the right of the current position.
+ * Synthesize a hole going to that next extent.
+ */
+ fuse4fs_iomap_hole(ff, iomap, FUSE4FS_FSB_TO_B(ff, startoff),
+ FUSE4FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+ goto out_handle;
+ }
+
+ if (startoff >= extent.e_lblk + extent.e_len) {
+ /*
+ * Mapping ends to the left of the current position. Try to
+ * find the next mapping. If there is no next mapping, the
+ * whole range is in a hole.
+ */
+ err = fuse4fs_get_next_mapping(ff, handle, startoff, &extent);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+ fuse4fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+ goto out_handle;
+ }
+
+ /*
+ * If the new mapping starts to the right of startoff, there's
+ * a hole from startoff to the start of the new mapping.
+ */
+ if (startoff < extent.e_lblk) {
+ fuse4fs_iomap_hole(ff, iomap,
+ FUSE4FS_FSB_TO_B(ff, startoff),
+ FUSE4FS_FSB_TO_B(ff, extent.e_lblk - startoff));
+ goto out_handle;
+ }
+
+ /*
+ * The new mapping starts at startoff. Something weird
+ * happened in the extent tree lookup, but we found a valid
+ * mapping so we'll run with it.
+ */
+ }
+
+ /* Mapping overlaps startoff, report this. */
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->addr = FUSE4FS_FSB_TO_B(ff, extent.e_pblk);
+ iomap->offset = FUSE4FS_FSB_TO_B(ff, extent.e_lblk);
+ iomap->length = FUSE4FS_FSB_TO_B(ff, extent.e_len);
+ if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT)
+ iomap->type = FUSE_IOMAP_TYPE_UNWRITTEN;
+ else
+ iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+
+out_handle:
+ ext2fs_extent_free(handle);
+ return ret;
+}
+
+static int fuse4fs_iomap_begin_indirect(struct fuse4fs *ff, uint64_t ino,
+ struct ext2_inode_large *inode,
+ off_t pos, uint64_t count,
+ uint32_t opflags,
+ struct fuse_file_iomap *iomap)
+{
+ ext2_filsys fs = ff->fs;
+ blk64_t startoff = FUSE4FS_B_TO_FSBT(ff, pos);
+ uint64_t isize = EXT2_I_SIZE(inode);
+ uint64_t real_count = min(count, 131072);
+ const blk64_t endoff = FUSE4FS_B_TO_FSB(ff, pos + real_count);
+ blk64_t startblock;
+ errcode_t err;
+
+ err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0, startoff, NULL,
+ &startblock);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->offset = FUSE4FS_FSB_TO_B(ff, startoff);
+ iomap->flags |= FUSE_IOMAP_F_MERGED;
+ if (startblock) {
+ iomap->addr = FUSE4FS_FSB_TO_B(ff, startblock);
+ iomap->type = FUSE_IOMAP_TYPE_MAPPED;
+ } else {
+ iomap->addr = FUSE_IOMAP_NULL_ADDR;
+ iomap->type = FUSE_IOMAP_TYPE_HOLE;
+ }
+ iomap->length = fs->blocksize;
+
+ /* See how long the mapping goes for. */
+ for (startoff++; startoff < endoff; startoff++) {
+ blk64_t prev_startblock = startblock;
+
+ err = ext2fs_bmap2(fs, ino, EXT2_INODE(inode), NULL, 0,
+ startoff, NULL, &startblock);
+ if (err)
+ break;
+
+ if (iomap->type == FUSE_IOMAP_TYPE_MAPPED) {
+ if (startblock == prev_startblock + 1)
+ iomap->length += fs->blocksize;
+ else
+ break;
+ } else {
+ if (startblock == 0)
+ iomap->length += fs->blocksize;
+ else
+ break;
+ }
+ }
+
+ /*
+ * If this is a hole that goes beyond EOF, report this as a hole to the
+ * end of the range queried so that FIEMAP doesn't go mad.
+ */
+ if (iomap->type == FUSE_IOMAP_TYPE_HOLE &&
+ iomap->offset + iomap->length >= isize)
+ fuse4fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+
+ return 0;
+}
+
+static int fuse4fs_iomap_begin_inline(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode, off_t pos,
+ uint64_t count, struct fuse_file_iomap *iomap)
+{
+ uint64_t one_fsb = FUSE4FS_FSB_TO_B(ff, 1);
+
+ if (pos >= one_fsb) {
+ fuse4fs_iomap_hole_to_eof(ff, iomap, pos, count, inode);
+ } else {
+ /* ext4 only supports inline data files up to 1 fsb */
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->addr = FUSE_IOMAP_NULL_ADDR;
+ iomap->offset = 0;
+ iomap->length = one_fsb;
+ iomap->type = FUSE_IOMAP_TYPE_INLINE;
+ }
+
+ return 0;
+}
+
+static int fuse4fs_iomap_begin_report(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ off_t pos, uint64_t count,
+ uint32_t opflags,
+ struct fuse_file_iomap *read)
+{
+ if (inode->i_flags & EXT4_INLINE_DATA_FL)
+ return fuse4fs_iomap_begin_inline(ff, ino, inode, pos, count,
+ read);
+
+ if (inode->i_flags & EXT4_EXTENTS_FL)
+ return fuse4fs_iomap_begin_extent(ff, ino, inode, pos, count,
+ opflags, read);
+
+ return fuse4fs_iomap_begin_indirect(ff, ino, inode, pos, count,
+ opflags, read);
+}
+
+static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode, off_t pos,
+ uint64_t count, uint32_t opflags,
+ struct fuse_file_iomap *read)
+{
+ return -ENOSYS;
+}
+
+static int fuse4fs_iomap_begin_write(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode, off_t pos,
+ uint64_t count, uint32_t opflags,
+ struct fuse_file_iomap *read)
+{
+ return -ENOSYS;
+}
+
+static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
+ off_t pos, uint64_t count, uint32_t opflags)
+{
+ struct fuse4fs *ff = fuse4fs_get(req);
+ struct ext2_inode_large inode;
+ struct fuse_file_iomap read = { };
+ ext2_filsys fs;
+ ext2_ino_t ino;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
+
+ dbg_printf(ff, "%s: ino=%d pos=0x%llx count=0x%llx opflags=0x%x\n",
+ __func__, ino,
+ (unsigned long long)pos,
+ (unsigned long long)count,
+ opflags);
+
+ fs = fuse4fs_start(ff);
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_unlock;
+ }
+
+ if (opflags & FUSE_IOMAP_OP_REPORT)
+ ret = fuse4fs_iomap_begin_report(ff, ino, &inode, pos, count,
+ opflags, &read);
+ else if (fuse_iomap_is_write(opflags))
+ ret = fuse4fs_iomap_begin_write(ff, ino, &inode, pos, count,
+ opflags, &read);
+ else
+ ret = fuse4fs_iomap_begin_read(ff, ino, &inode, pos, count,
+ opflags, &read);
+ if (ret)
+ goto out_unlock;
+
+ dbg_printf(ff,
+ "%s: ino=%d pos=0x%llx -> addr=0x%llx offset=0x%llx length=0x%llx type=%u flags=0x%x\n",
+ __func__, ino,
+ (unsigned long long)pos,
+ (unsigned long long)read.addr,
+ (unsigned long long)read.offset,
+ (unsigned long long)read.length,
+ read.type,
+ read.flags);
+
+out_unlock:
+ fuse4fs_finish(ff, ret);
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_iomap_begin(req, &read, NULL);
+}
+
+static void op_iomap_end(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
+ off_t pos, uint64_t count, uint32_t opflags,
+ ssize_t written, const struct fuse_file_iomap *iomap)
+{
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_ino_t ino;
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
+
+ dbg_printf(ff,
+ "%s: ino=%d pos=0x%llx count=0x%llx opflags=0x%x written=0x%zx mapflags=0x%x\n",
+ __func__, ino,
+ (unsigned long long)pos,
+ (unsigned long long)count,
+ opflags,
+ written,
+ iomap->flags);
+
+ fuse_reply_err(req, 0);
+}
+#endif /* HAVE_FUSE_IOMAP */
+
static struct fuse_lowlevel_ops fs_ops = {
.lookup = op_lookup,
.setattr = op_setattr,
@@ -5381,6 +5882,10 @@ static struct fuse_lowlevel_ops fs_ops = {
#ifdef SUPPORT_FALLOCATE
.fallocate = op_fallocate,
#endif
+#ifdef HAVE_FUSE_IOMAP
+ .iomap_begin = op_iomap_begin,
+ .iomap_end = op_iomap_end,
+#endif /* HAVE_FUSE_IOMAP */
};
static int get_random_bytes(void *p, size_t sz)
@@ -5703,17 +6208,19 @@ static int fuse4fs_main(struct fuse_args *args, struct fuse4fs *ff)
int main(int argc, char *argv[])
{
struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
- struct fuse4fs fctx;
+ struct fuse4fs fctx = {
+ .magic = FUSE4FS_MAGIC,
+ .opstate = F4OP_WRITABLE,
+ .logfd = -1,
+#ifdef HAVE_FUSE_IOMAP
+ .iomap_state = IOMAP_UNKNOWN,
+#endif
+ };
errcode_t err;
FILE *orig_stderr = stderr;
char extra_args[BUFSIZ];
int ret;
- memset(&fctx, 0, sizeof(fctx));
- fctx.magic = FUSE4FS_MAGIC;
- fctx.logfd = -1;
- fctx.opstate = F4OP_WRITABLE;
-
ret = fuse_opt_parse(&args, &fctx, fuse4fs_opts, fuse4fs_opt_proc);
if (ret)
exit(1);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 02/19] fuse2fs: add iomap= mount option
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-08-21 1:15 ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
@ 2025-08-21 1:16 ` Darrick J. Wong
2025-08-21 1:16 ` [PATCH 03/19] fuse2fs: implement iomap configuration Darrick J. Wong
` (16 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:16 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Add a mount option to control iomap usage so that we can test before and
after scenarios.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
misc/fuse4fs.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 7c87573677e172..c63acd7a0ed155 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -226,6 +226,12 @@ enum fuse2fs_opstate {
F2OP_SHUTDOWN,
};
+enum fuse2fs_feature_toggle {
+ FT_DISABLE,
+ FT_ENABLE,
+ FT_DEFAULT,
+};
+
#ifdef HAVE_FUSE_IOMAP
enum fuse2fs_iomap_state {
IOMAP_DISABLED,
@@ -261,6 +267,7 @@ struct fuse2fs {
int logfd;
int blocklog;
#ifdef HAVE_FUSE_IOMAP
+ enum fuse2fs_feature_toggle iomap_want;
enum fuse2fs_iomap_state iomap_state;
#endif
unsigned int blockmask;
@@ -1333,6 +1340,12 @@ static void fuse2fs_iomap_enable(struct fuse_conn_info *conn,
if (ff->iomap_state == IOMAP_UNKNOWN)
ff->iomap_state = IOMAP_DISABLED;
+
+ if (!fuse2fs_iomap_enabled(ff)) {
+ if (ff->iomap_want == FT_ENABLE)
+ err_printf(ff, "%s\n", _("Could not enable iomap."));
+ return;
+ }
}
#else
# define fuse2fs_iomap_enable(...) ((void)0)
@@ -5520,6 +5533,9 @@ enum {
FUSE2FS_CACHE_SIZE,
FUSE2FS_DIRSYNC,
FUSE2FS_ERRORS_BEHAVIOR,
+#ifdef HAVE_FUSE_IOMAP
+ FUSE2FS_IOMAP,
+#endif
};
#define FUSE2FS_OPT(t, p, v) { t, offsetof(struct fuse2fs, p), v }
@@ -5551,6 +5567,10 @@ static struct fuse_opt fuse2fs_opts[] = {
FUSE_OPT_KEY("cache_size=%s", FUSE2FS_CACHE_SIZE),
FUSE_OPT_KEY("dirsync", FUSE2FS_DIRSYNC),
FUSE_OPT_KEY("errors=%s", FUSE2FS_ERRORS_BEHAVIOR),
+#ifdef HAVE_FUSE_IOMAP
+ FUSE_OPT_KEY("iomap=%s", FUSE2FS_IOMAP),
+ FUSE_OPT_KEY("iomap", FUSE2FS_IOMAP),
+#endif
FUSE_OPT_KEY("-V", FUSE2FS_VERSION),
FUSE_OPT_KEY("--version", FUSE2FS_VERSION),
@@ -5602,6 +5622,23 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
/* do not pass through to libfuse */
return 0;
+#ifdef HAVE_FUSE_IOMAP
+ case FUSE2FS_IOMAP:
+ if (strcmp(arg, "iomap") == 0 || strcmp(arg + 6, "1") == 0)
+ ff->iomap_want = FT_ENABLE;
+ else if (strcmp(arg + 6, "0") == 0)
+ ff->iomap_want = FT_DISABLE;
+ else if (strcmp(arg + 6, "default") == 0)
+ ff->iomap_want = FT_DEFAULT;
+ else {
+ fprintf(stderr, "%s: %s\n", arg,
+ _("unknown iomap= behavior."));
+ return -1;
+ }
+
+ /* do not pass through to libfuse */
+ return 0;
+#endif
case FUSE2FS_IGNORED:
return 0;
case FUSE2FS_HELP:
@@ -5629,6 +5666,9 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
" -o cache_size=N[KMG] use a disk cache of this size\n"
" -o errors= behavior when an error is encountered:\n"
" continue|remount-ro|panic\n"
+#ifdef HAVE_FUSE_IOMAP
+ " -o iomap= 0 to disable iomap, 1 to enable iomap\n"
+#endif
"\n",
outargs->argv[0]);
if (key == FUSE2FS_HELPFULL) {
@@ -5721,6 +5761,7 @@ int main(int argc, char *argv[])
.opstate = F2OP_WRITABLE,
.logfd = -1,
#ifdef HAVE_FUSE_IOMAP
+ .iomap_want = FT_DEFAULT,
.iomap_state = IOMAP_UNKNOWN,
#endif
};
@@ -5738,6 +5779,11 @@ int main(int argc, char *argv[])
exit(1);
}
+#ifdef HAVE_FUSE_IOMAP
+ if (fctx.iomap_want == FT_DISABLE)
+ fctx.iomap_state = IOMAP_DISABLED;
+#endif
+
/* /dev/sda -> sda for reporting */
fctx.shortdev = strrchr(fctx.device, '/');
if (fctx.shortdev)
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 93570d25e91d5c..2bc25ff37055d5 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -222,6 +222,12 @@ enum fuse4fs_opstate {
F4OP_SHUTDOWN,
};
+enum fuse4fs_feature_toggle {
+ FT_DISABLE,
+ FT_ENABLE,
+ FT_DEFAULT,
+};
+
#ifdef HAVE_FUSE_IOMAP
enum fuse4fs_iomap_state {
IOMAP_DISABLED,
@@ -257,6 +263,7 @@ struct fuse4fs {
int logfd;
int blocklog;
#ifdef HAVE_FUSE_IOMAP
+ enum fuse4fs_feature_toggle iomap_want;
enum fuse4fs_iomap_state iomap_state;
#endif
unsigned int blockmask;
@@ -1516,6 +1523,12 @@ static void fuse4fs_iomap_enable(struct fuse_conn_info *conn,
if (ff->iomap_state == IOMAP_UNKNOWN)
ff->iomap_state = IOMAP_DISABLED;
+
+ if (!fuse4fs_iomap_enabled(ff)) {
+ if (ff->iomap_want == FT_ENABLE)
+ err_printf(ff, "%s\n", _("Could not enable iomap."));
+ return;
+ }
}
#else
# define fuse4fs_iomap_enable(...) ((void)0)
@@ -5913,6 +5926,9 @@ enum {
FUSE4FS_CACHE_SIZE,
FUSE4FS_DIRSYNC,
FUSE4FS_ERRORS_BEHAVIOR,
+#ifdef HAVE_FUSE_IOMAP
+ FUSE4FS_IOMAP,
+#endif
};
#define FUSE4FS_OPT(t, p, v) { t, offsetof(struct fuse4fs, p), v }
@@ -5944,6 +5960,10 @@ static struct fuse_opt fuse4fs_opts[] = {
FUSE_OPT_KEY("cache_size=%s", FUSE4FS_CACHE_SIZE),
FUSE_OPT_KEY("dirsync", FUSE4FS_DIRSYNC),
FUSE_OPT_KEY("errors=%s", FUSE4FS_ERRORS_BEHAVIOR),
+#ifdef HAVE_FUSE_IOMAP
+ FUSE_OPT_KEY("iomap=%s", FUSE4FS_IOMAP),
+ FUSE_OPT_KEY("iomap", FUSE4FS_IOMAP),
+#endif
FUSE_OPT_KEY("-V", FUSE4FS_VERSION),
FUSE_OPT_KEY("--version", FUSE4FS_VERSION),
@@ -5995,6 +6015,23 @@ static int fuse4fs_opt_proc(void *data, const char *arg,
/* do not pass through to libfuse */
return 0;
+#ifdef HAVE_FUSE_IOMAP
+ case FUSE4FS_IOMAP:
+ if (strcmp(arg, "iomap") == 0 || strcmp(arg + 6, "1") == 0)
+ ff->iomap_want = FT_ENABLE;
+ else if (strcmp(arg + 6, "0") == 0)
+ ff->iomap_want = FT_DISABLE;
+ else if (strcmp(arg + 6, "default") == 0)
+ ff->iomap_want = FT_DEFAULT;
+ else {
+ fprintf(stderr, "%s: %s\n", arg,
+ _("unknown iomap= behavior."));
+ return -1;
+ }
+
+ /* do not pass through to libfuse */
+ return 0;
+#endif
case FUSE4FS_IGNORED:
return 0;
case FUSE4FS_HELP:
@@ -6022,6 +6059,9 @@ static int fuse4fs_opt_proc(void *data, const char *arg,
" -o cache_size=N[KMG] use a disk cache of this size\n"
" -o errors= behavior when an error is encountered:\n"
" continue|remount-ro|panic\n"
+#ifdef HAVE_FUSE_IOMAP
+ " -o iomap= 0 to disable iomap, 1 to enable iomap\n"
+#endif
"\n",
outargs->argv[0]);
if (key == FUSE4FS_HELPFULL) {
@@ -6213,6 +6253,7 @@ int main(int argc, char *argv[])
.opstate = F4OP_WRITABLE,
.logfd = -1,
#ifdef HAVE_FUSE_IOMAP
+ .iomap_want = FT_DEFAULT,
.iomap_state = IOMAP_UNKNOWN,
#endif
};
@@ -6230,6 +6271,11 @@ int main(int argc, char *argv[])
exit(1);
}
+#ifdef HAVE_FUSE_IOMAP
+ if (fctx.iomap_want == FT_DISABLE)
+ fctx.iomap_state = IOMAP_DISABLED;
+#endif
+
/* /dev/sda -> sda for reporting */
fctx.shortdev = strrchr(fctx.device, '/');
if (fctx.shortdev)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 03/19] fuse2fs: implement iomap configuration
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-08-21 1:15 ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-08-21 1:16 ` [PATCH 02/19] fuse2fs: add iomap= mount option Darrick J. Wong
@ 2025-08-21 1:16 ` Darrick J. Wong
2025-08-21 1:16 ` [PATCH 04/19] fuse2fs: register block devices for use with iomap Darrick J. Wong
` (15 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:16 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Upload the filesystem geometry to the kernel when asked.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
misc/fuse4fs.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 186 insertions(+), 6 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index c63acd7a0ed155..5b17aadc006560 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -201,6 +201,10 @@ static inline uint64_t round_down(uint64_t b, unsigned int align)
# define FL_ZERO_RANGE_FLAG (0)
#endif
+#ifndef NSEC_PER_SEC
+# define NSEC_PER_SEC (1000000000L)
+#endif
+
errcode_t ext2fs_run_ext3_journal(ext2_filsys *fs);
const char *err_shortdev;
@@ -655,9 +659,9 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode);
get_now(&now);
- datime = atime.tv_sec + ((double)atime.tv_nsec / 1000000000);
- dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / 1000000000);
- dnow = now.tv_sec + ((double)now.tv_nsec / 1000000000);
+ datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC);
+ dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
+ dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC);
/*
* If atime is newer than mtime and atime hasn't been updated in thirty
@@ -5440,6 +5444,91 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
return 0;
}
+
+/*
+ * Maximal extent format file size.
+ * Resulting logical blkno at s_maxbytes must fit in our on-disk
+ * extent format containers, within a sector_t, and within i_blocks
+ * in the vfs. ext4 inode has 48 bits of i_block in fsblock units,
+ * so that won't be a limiting factor.
+ *
+ * However there is other limiting factor. We do store extents in the form
+ * of starting block and length, hence the resulting length of the extent
+ * covering maximum file size must fit into on-disk format containers as
+ * well. Given that length is always by 1 unit bigger than max unit (because
+ * we count 0 as well) we have to lower the s_maxbytes by one fs block.
+ *
+ * Note, this does *not* consider any metadata overhead for vfs i_blocks.
+ */
+static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit)
+{
+ off_t res;
+
+ if (!ext2fs_has_feature_huge_file(ff->fs->super)) {
+ upper_limit = (1LL << 32) - 1;
+
+ /* total blocks in file system block size */
+ upper_limit >>= (ff->blocklog - 9);
+ upper_limit <<= ff->blocklog;
+ }
+
+ /*
+ * 32-bit extent-start container, ee_block. We lower the maxbytes
+ * by one fs block, so ee_len can cover the extent of maximum file
+ * size
+ */
+ res = (1LL << 32) - 1;
+ res <<= ff->blocklog;
+
+ /* Sanity check against vm- & vfs- imposed limits */
+ if (res > upper_limit)
+ res = upper_limit;
+
+ return res;
+}
+
+static int op_iomap_config(uint64_t flags, off_t maxbytes,
+ struct fuse_iomap_config *cfg)
+{
+ struct fuse2fs *ff = fuse2fs_get();
+ ext2_filsys fs;
+
+ FUSE2FS_CHECK_CONTEXT(ff);
+
+ dbg_printf(ff, "%s: flags=0x%llx maxbytes=0x%llx\n", __func__,
+ (unsigned long long)flags,
+ (unsigned long long)maxbytes);
+ fs = fuse2fs_start(ff);
+
+ cfg->flags |= FUSE_IOMAP_CONFIG_UUID;
+ memcpy(cfg->s_uuid, fs->super->s_uuid, sizeof(cfg->s_uuid));
+ cfg->s_uuid_len = sizeof(fs->super->s_uuid);
+
+ cfg->flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE;
+ cfg->s_blocksize = FUSE2FS_FSB_TO_B(ff, 1);
+
+ /*
+ * If there inode is large enough to house i_[acm]time_extra then we
+ * can turn on nanosecond timestamps; i_crtime was the next field added
+ * after i_atime_extra.
+ */
+ cfg->flags |= FUSE_IOMAP_CONFIG_TIME;
+ if (fs->super->s_inode_size >=
+ offsetof(struct ext2_inode_large, i_crtime)) {
+ cfg->s_time_gran = 1;
+ cfg->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX;
+ } else {
+ cfg->s_time_gran = NSEC_PER_SEC;
+ cfg->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX;
+ }
+ cfg->s_time_min = EXT4_TIMESTAMP_MIN;
+
+ cfg->flags |= FUSE_IOMAP_CONFIG_MAXBYTES;
+ cfg->s_maxbytes = fuse2fs_max_size(ff, maxbytes);
+
+ fuse2fs_finish(ff, 0);
+ return 0;
+}
#endif /* HAVE_FUSE_IOMAP */
static struct fuse_operations fs_ops = {
@@ -5505,6 +5594,7 @@ static struct fuse_operations fs_ops = {
#ifdef HAVE_FUSE_IOMAP
.iomap_begin = op_iomap_begin,
.iomap_end = op_iomap_end,
+ .iomap_config = op_iomap_config,
#endif /* HAVE_FUSE_IOMAP */
};
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 2bc25ff37055d5..5876af19387c96 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -196,6 +196,10 @@ static inline uint64_t round_down(uint64_t b, unsigned int align)
# define FL_ZERO_RANGE_FLAG (0)
#endif
+#ifndef NSEC_PER_SEC
+# define NSEC_PER_SEC (1000000000L)
+#endif
+
errcode_t ext2fs_run_ext3_journal(ext2_filsys *fs);
const char *err_shortdev;
@@ -808,9 +812,9 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode);
get_now(&now);
- datime = atime.tv_sec + ((double)atime.tv_nsec / 1000000000);
- dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / 1000000000);
- dnow = now.tv_sec + ((double)now.tv_nsec / 1000000000);
+ datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC);
+ dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
+ dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC);
/*
* If atime is newer than mtime and atime hasn't been updated in thirty
@@ -5850,6 +5854,91 @@ static void op_iomap_end(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
fuse_reply_err(req, 0);
}
+
+/*
+ * Maximal extent format file size.
+ * Resulting logical blkno at s_maxbytes must fit in our on-disk
+ * extent format containers, within a sector_t, and within i_blocks
+ * in the vfs. ext4 inode has 48 bits of i_block in fsblock units,
+ * so that won't be a limiting factor.
+ *
+ * However there is other limiting factor. We do store extents in the form
+ * of starting block and length, hence the resulting length of the extent
+ * covering maximum file size must fit into on-disk format containers as
+ * well. Given that length is always by 1 unit bigger than max unit (because
+ * we count 0 as well) we have to lower the s_maxbytes by one fs block.
+ *
+ * Note, this does *not* consider any metadata overhead for vfs i_blocks.
+ */
+static off_t fuse4fs_max_size(struct fuse4fs *ff, off_t upper_limit)
+{
+ off_t res;
+
+ if (!ext2fs_has_feature_huge_file(ff->fs->super)) {
+ upper_limit = (1LL << 32) - 1;
+
+ /* total blocks in file system block size */
+ upper_limit >>= (ff->blocklog - 9);
+ upper_limit <<= ff->blocklog;
+ }
+
+ /*
+ * 32-bit extent-start container, ee_block. We lower the maxbytes
+ * by one fs block, so ee_len can cover the extent of maximum file
+ * size
+ */
+ res = (1LL << 32) - 1;
+ res <<= ff->blocklog;
+
+ /* Sanity check against vm- & vfs- imposed limits */
+ if (res > upper_limit)
+ res = upper_limit;
+
+ return res;
+}
+
+static void op_iomap_config(fuse_req_t req, uint64_t flags, uint64_t maxbytes)
+{
+ struct fuse_iomap_config cfg = { };
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_filsys fs;
+
+ FUSE4FS_CHECK_CONTEXT(req);
+
+ dbg_printf(ff, "%s: flags=0x%llx maxbytes=0x%llx\n", __func__,
+ (unsigned long long)flags,
+ (unsigned long long)maxbytes);
+ fs = fuse4fs_start(ff);
+
+ cfg.flags |= FUSE_IOMAP_CONFIG_UUID;
+ memcpy(cfg.s_uuid, fs->super->s_uuid, sizeof(cfg.s_uuid));
+ cfg.s_uuid_len = sizeof(fs->super->s_uuid);
+
+ cfg.flags |= FUSE_IOMAP_CONFIG_BLOCKSIZE;
+ cfg.s_blocksize = FUSE4FS_FSB_TO_B(ff, 1);
+
+ /*
+ * If there inode is large enough to house i_[acm]time_extra then we
+ * can turn on nanosecond timestamps; i_crtime was the next field added
+ * after i_atime_extra.
+ */
+ cfg.flags |= FUSE_IOMAP_CONFIG_TIME;
+ if (fs->super->s_inode_size >=
+ offsetof(struct ext2_inode_large, i_crtime)) {
+ cfg.s_time_gran = 1;
+ cfg.s_time_max = EXT4_EXTRA_TIMESTAMP_MAX;
+ } else {
+ cfg.s_time_gran = NSEC_PER_SEC;
+ cfg.s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX;
+ }
+ cfg.s_time_min = EXT4_TIMESTAMP_MIN;
+
+ cfg.flags |= FUSE_IOMAP_CONFIG_MAXBYTES;
+ cfg.s_maxbytes = fuse4fs_max_size(ff, maxbytes);
+
+ fuse4fs_finish(ff, 0);
+ fuse_reply_iomap_config(req, &cfg);
+}
#endif /* HAVE_FUSE_IOMAP */
static struct fuse_lowlevel_ops fs_ops = {
@@ -5898,6 +5987,7 @@ static struct fuse_lowlevel_ops fs_ops = {
#ifdef HAVE_FUSE_IOMAP
.iomap_begin = op_iomap_begin,
.iomap_end = op_iomap_end,
+ .iomap_config = op_iomap_config,
#endif /* HAVE_FUSE_IOMAP */
};
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 04/19] fuse2fs: register block devices for use with iomap
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 1:16 ` [PATCH 03/19] fuse2fs: implement iomap configuration Darrick J. Wong
@ 2025-08-21 1:16 ` Darrick J. Wong
2025-08-21 1:17 ` [PATCH 05/19] fuse2fs: implement directio file reads Darrick J. Wong
` (14 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:16 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Register the ext4 block device with the kernel for use with iomap. For
now this is redundant with using fuseblk mode because the kernel
automatically registers any fuseblk devices, but eventually we'll go
back to regular fuse mode and we'll have to pin the bdev ourselves.
In theory this interface supports strange beasts where the metadata can
exist somewhere else entirely (or be made up by AI) while the file data
persists to real disks.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 42 ++++++++++++++++++++++++++++++++++++++----
misc/fuse4fs.c | 44 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 78 insertions(+), 8 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 5b17aadc006560..8bf0fbcff093a7 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -41,6 +41,7 @@
# define _FILE_OFFSET_BITS 64
#endif /* _FILE_OFFSET_BITS */
#include <fuse.h>
+#include <fuse_lowlevel.h>
#ifdef __SET_FOB_FOR_FUSE
# undef _FILE_OFFSET_BITS
#endif /* __SET_FOB_FOR_FUSE */
@@ -273,6 +274,7 @@ struct fuse2fs {
#ifdef HAVE_FUSE_IOMAP
enum fuse2fs_feature_toggle iomap_want;
enum fuse2fs_iomap_state iomap_state;
+ uint32_t iomap_dev;
#endif
unsigned int blockmask;
unsigned long offset;
@@ -5235,7 +5237,7 @@ static errcode_t fuse2fs_iomap_begin_extent(struct fuse2fs *ff, uint64_t ino,
}
/* Mapping overlaps startoff, report this. */
- iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->dev = ff->iomap_dev;
iomap->addr = FUSE2FS_FSB_TO_B(ff, extent.e_pblk);
iomap->offset = FUSE2FS_FSB_TO_B(ff, extent.e_lblk);
iomap->length = FUSE2FS_FSB_TO_B(ff, extent.e_len);
@@ -5268,13 +5270,14 @@ static int fuse2fs_iomap_begin_indirect(struct fuse2fs *ff, uint64_t ino,
if (err)
return translate_error(fs, ino, err);
- iomap->dev = FUSE_IOMAP_DEV_NULL;
iomap->offset = FUSE2FS_FSB_TO_B(ff, startoff);
iomap->flags |= FUSE_IOMAP_F_MERGED;
if (startblock) {
+ iomap->dev = ff->iomap_dev;
iomap->addr = FUSE2FS_FSB_TO_B(ff, startblock);
iomap->type = FUSE_IOMAP_TYPE_MAPPED;
} else {
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
iomap->addr = FUSE_IOMAP_NULL_ADDR;
iomap->type = FUSE_IOMAP_TYPE_HOLE;
}
@@ -5487,11 +5490,36 @@ static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit)
return res;
}
+static int fuse2fs_iomap_config_devices(struct fuse2fs *ff)
+{
+ errcode_t err;
+ int fd;
+ int ret;
+
+ err = io_channel_get_fd(ff->fs->io, &fd);
+ if (err)
+ return translate_error(ff->fs, 0, err);
+
+ ret = fuse_fs_iomap_device_add(fd, 0);
+ if (ret < 0) {
+ dbg_printf(ff, "%s: cannot register iomap dev fd=%d, err=%d\n",
+ __func__, fd, -ret);
+ return translate_error(ff->fs, 0, -ret);
+ }
+
+ dbg_printf(ff, "%s: registered iomap dev fd=%d iomap_dev=%u\n",
+ __func__, fd, ff->iomap_dev);
+
+ ff->iomap_dev = ret;
+ return 0;
+}
+
static int op_iomap_config(uint64_t flags, off_t maxbytes,
struct fuse_iomap_config *cfg)
{
struct fuse2fs *ff = fuse2fs_get();
ext2_filsys fs;
+ int ret = 0;
FUSE2FS_CHECK_CONTEXT(ff);
@@ -5526,8 +5554,13 @@ static int op_iomap_config(uint64_t flags, off_t maxbytes,
cfg->flags |= FUSE_IOMAP_CONFIG_MAXBYTES;
cfg->s_maxbytes = fuse2fs_max_size(ff, maxbytes);
- fuse2fs_finish(ff, 0);
- return 0;
+ ret = fuse2fs_iomap_config_devices(ff);
+ if (ret)
+ goto out_unlock;
+
+out_unlock:
+ fuse2fs_finish(ff, ret);
+ return ret;
}
#endif /* HAVE_FUSE_IOMAP */
@@ -5853,6 +5886,7 @@ int main(int argc, char *argv[])
#ifdef HAVE_FUSE_IOMAP
.iomap_want = FT_DEFAULT,
.iomap_state = IOMAP_UNKNOWN,
+ .iomap_dev = FUSE_IOMAP_DEV_NULL,
#endif
};
errcode_t err;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 5876af19387c96..5debaf892b2113 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -269,6 +269,7 @@ struct fuse4fs {
#ifdef HAVE_FUSE_IOMAP
enum fuse4fs_feature_toggle iomap_want;
enum fuse4fs_iomap_state iomap_state;
+ uint32_t iomap_dev;
#endif
unsigned int blockmask;
unsigned long offset;
@@ -5644,7 +5645,7 @@ static errcode_t fuse4fs_iomap_begin_extent(struct fuse4fs *ff, uint64_t ino,
}
/* Mapping overlaps startoff, report this. */
- iomap->dev = FUSE_IOMAP_DEV_NULL;
+ iomap->dev = ff->iomap_dev;
iomap->addr = FUSE4FS_FSB_TO_B(ff, extent.e_pblk);
iomap->offset = FUSE4FS_FSB_TO_B(ff, extent.e_lblk);
iomap->length = FUSE4FS_FSB_TO_B(ff, extent.e_len);
@@ -5677,13 +5678,14 @@ static int fuse4fs_iomap_begin_indirect(struct fuse4fs *ff, uint64_t ino,
if (err)
return translate_error(fs, ino, err);
- iomap->dev = FUSE_IOMAP_DEV_NULL;
iomap->offset = FUSE4FS_FSB_TO_B(ff, startoff);
iomap->flags |= FUSE_IOMAP_F_MERGED;
if (startblock) {
+ iomap->dev = ff->iomap_dev;
iomap->addr = FUSE4FS_FSB_TO_B(ff, startblock);
iomap->type = FUSE_IOMAP_TYPE_MAPPED;
} else {
+ iomap->dev = FUSE_IOMAP_DEV_NULL;
iomap->addr = FUSE_IOMAP_NULL_ADDR;
iomap->type = FUSE_IOMAP_TYPE_HOLE;
}
@@ -5897,11 +5899,36 @@ static off_t fuse4fs_max_size(struct fuse4fs *ff, off_t upper_limit)
return res;
}
+static int fuse4fs_iomap_config_devices(struct fuse4fs *ff)
+{
+ errcode_t err;
+ int fd;
+ int ret;
+
+ err = io_channel_get_fd(ff->fs->io, &fd);
+ if (err)
+ return translate_error(ff->fs, 0, err);
+
+ ret = fuse_lowlevel_iomap_device_add(ff->fuse, fd, 0);
+ if (ret < 0) {
+ dbg_printf(ff, "%s: cannot register iomap dev fd=%d, err=%d\n",
+ __func__, fd, -ret);
+ return translate_error(ff->fs, 0, -ret);
+ }
+
+ dbg_printf(ff, "%s: registered iomap dev fd=%d iomap_dev=%u\n",
+ __func__, fd, ff->iomap_dev);
+
+ ff->iomap_dev = ret;
+ return 0;
+}
+
static void op_iomap_config(fuse_req_t req, uint64_t flags, uint64_t maxbytes)
{
struct fuse_iomap_config cfg = { };
struct fuse4fs *ff = fuse4fs_get(req);
ext2_filsys fs;
+ int ret = 0;
FUSE4FS_CHECK_CONTEXT(req);
@@ -5936,8 +5963,16 @@ static void op_iomap_config(fuse_req_t req, uint64_t flags, uint64_t maxbytes)
cfg.flags |= FUSE_IOMAP_CONFIG_MAXBYTES;
cfg.s_maxbytes = fuse4fs_max_size(ff, maxbytes);
- fuse4fs_finish(ff, 0);
- fuse_reply_iomap_config(req, &cfg);
+ ret = fuse4fs_iomap_config_devices(ff);
+ if (ret)
+ goto out_unlock;
+
+out_unlock:
+ fuse4fs_finish(ff, ret);
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_iomap_config(req, &cfg);
}
#endif /* HAVE_FUSE_IOMAP */
@@ -6345,6 +6380,7 @@ int main(int argc, char *argv[])
#ifdef HAVE_FUSE_IOMAP
.iomap_want = FT_DEFAULT,
.iomap_state = IOMAP_UNKNOWN,
+ .iomap_dev = FUSE_IOMAP_DEV_NULL,
#endif
};
errcode_t err;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 05/19] fuse2fs: implement directio file reads
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 1:16 ` [PATCH 04/19] fuse2fs: register block devices for use with iomap Darrick J. Wong
@ 2025-08-21 1:17 ` Darrick J. Wong
2025-08-21 1:17 ` [PATCH 06/19] fuse2fs: add extent dump function for debugging Darrick J. Wong
` (13 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:17 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Implement file reads via iomap. Currently only directio is supported.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 14 +++++++++++++-
misc/fuse4fs.c | 14 +++++++++++++-
2 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 8bf0fbcff093a7..1dda9c45cb5089 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5359,7 +5359,19 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
uint64_t count, uint32_t opflags,
struct fuse_file_iomap *read)
{
- return -ENOSYS;
+ if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+ return -ENOSYS;
+
+ /* fall back to slow path for inline data reads */
+ if (inode->i_flags & EXT4_INLINE_DATA_FL)
+ return -ENOSYS;
+
+ if (inode->i_flags & EXT4_EXTENTS_FL)
+ return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count,
+ opflags, read);
+
+ return fuse2fs_iomap_begin_indirect(ff, ino, inode, pos, count,
+ opflags, read);
}
static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 5debaf892b2113..2aa7ab646592e9 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -5767,7 +5767,19 @@ static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino,
uint64_t count, uint32_t opflags,
struct fuse_file_iomap *read)
{
- return -ENOSYS;
+ if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+ return -ENOSYS;
+
+ /* fall back to slow path for inline data reads */
+ if (inode->i_flags & EXT4_INLINE_DATA_FL)
+ return -ENOSYS;
+
+ if (inode->i_flags & EXT4_EXTENTS_FL)
+ return fuse4fs_iomap_begin_extent(ff, ino, inode, pos, count,
+ opflags, read);
+
+ return fuse4fs_iomap_begin_indirect(ff, ino, inode, pos, count,
+ opflags, read);
}
static int fuse4fs_iomap_begin_write(struct fuse4fs *ff, ext2_ino_t ino,
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 06/19] fuse2fs: add extent dump function for debugging
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 1:17 ` [PATCH 05/19] fuse2fs: implement directio file reads Darrick J. Wong
@ 2025-08-21 1:17 ` Darrick J. Wong
2025-08-21 1:17 ` [PATCH 07/19] fuse2fs: implement direct write support Darrick J. Wong
` (12 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:17 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Add a function to dump an inode's extent map for debugging purposes.
This helped debug a problem with generic/299 failing on 1k fsblock
filesystems:
--- a/tests/generic/299.out 2025-07-15 14:45:15.030113607 -0700
+++ b/tests/generic/299.out.bad 2025-07-16 19:33:50.889344998 -0700
@@ -3,3 +3,4 @@ QA output created by 299
Run fio with random aio-dio pattern
Start fallocate/truncate loop
+fio: io_u error on file /opt/direct_aio.0.0: Input/output error: write offset=2602827776, buflen=131072
(The cause of this was misuse of the libext2fs extent code)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
misc/fuse4fs.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 146 insertions(+)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 1dda9c45cb5089..4a9fda62f99bc2 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -578,6 +578,74 @@ static inline int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
# define fuse2fs_iomap_enabled(...) (0)
#endif
+static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ const char *why)
+{
+ ext2_filsys fs = ff->fs;
+ unsigned int nr = 0;
+ blk64_t blockcount = 0;
+ struct ext2_inode_large xinode;
+ struct ext2fs_extent extent;
+ ext2_extent_handle_t extents;
+ int op = EXT2_EXTENT_ROOT;
+ errcode_t retval;
+
+ if (!inode) {
+ inode = &xinode;
+
+ retval = fuse2fs_read_inode(fs, ino, inode);
+ if (retval) {
+ com_err(__func__, retval, _("reading ino %u"), ino);
+ return;
+ }
+ }
+
+ if (!(inode->i_flags & EXT4_EXTENTS_FL))
+ return;
+
+ printf("%s: %s ino=%u isize %llu iblocks %llu\n", __func__, why, ino,
+ EXT2_I_SIZE(inode),
+ (ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)) * 512) /
+ fs->blocksize);
+ fflush(stdout);
+
+ retval = ext2fs_extent_open(fs, ino, &extents);
+ if (retval) {
+ com_err(__func__, retval, _("opening extents of ino \"%u\""),
+ ino);
+ return;
+ }
+
+ while ((retval = ext2fs_extent_get(extents, op, &extent)) == 0) {
+ op = EXT2_EXTENT_NEXT;
+
+ if (extent.e_flags & EXT2_EXTENT_FLAGS_SECOND_VISIT)
+ continue;
+
+ printf("[%u]: %s ino=%u lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n",
+ nr++, why, ino, extent.e_lblk, extent.e_pblk,
+ extent.e_len, extent.e_flags);
+ fflush(stdout);
+ if (extent.e_flags & EXT2_EXTENT_FLAGS_LEAF)
+ blockcount += extent.e_len;
+ else
+ blockcount++;
+ }
+ if (retval == EXT2_ET_EXTENT_NO_NEXT)
+ retval = 0;
+ if (retval) {
+ com_err(__func__, retval, ("getting extents of ino %u"),
+ ino);
+ }
+ if (inode->i_file_acl)
+ blockcount++;
+ printf("%s: %s sum(e_len) %llu\n", __func__, why, blockcount);
+ fflush(stdout);
+
+ ext2fs_extent_free(extents);
+}
+
static void get_now(struct timespec *now)
{
#ifdef CLOCK_REALTIME
@@ -5433,6 +5501,11 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
(unsigned long long)read->length,
read->type);
+ /* Not filling even the first byte will make the kernel unhappy. */
+ if (ff->debug && (read->offset > pos ||
+ read->offset + read->length <= pos))
+ fuse2fs_dump_extents(ff, attr_ino, &inode, "BAD DATA");
+
out_unlock:
fuse2fs_finish(ff, ret);
return ret;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 2aa7ab646592e9..0ac5de90498dac 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -730,6 +730,74 @@ static inline int fuse4fs_iomap_enabled(const struct fuse4fs *ff)
# define fuse4fs_iomap_enabled(...) (0)
#endif
+static inline void fuse4fs_dump_extents(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ const char *why)
+{
+ ext2_filsys fs = ff->fs;
+ unsigned int nr = 0;
+ blk64_t blockcount = 0;
+ struct ext2_inode_large xinode;
+ struct ext2fs_extent extent;
+ ext2_extent_handle_t extents;
+ int op = EXT2_EXTENT_ROOT;
+ errcode_t retval;
+
+ if (!inode) {
+ inode = &xinode;
+
+ retval = fuse4fs_read_inode(fs, ino, inode);
+ if (retval) {
+ com_err(__func__, retval, _("reading ino %u"), ino);
+ return;
+ }
+ }
+
+ if (!(inode->i_flags & EXT4_EXTENTS_FL))
+ return;
+
+ printf("%s: %s ino=%u isize %llu iblocks %llu\n", __func__, why, ino,
+ EXT2_I_SIZE(inode),
+ (ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode)) * 512) /
+ fs->blocksize);
+ fflush(stdout);
+
+ retval = ext2fs_extent_open(fs, ino, &extents);
+ if (retval) {
+ com_err(__func__, retval, _("opening extents of ino \"%u\""),
+ ino);
+ return;
+ }
+
+ while ((retval = ext2fs_extent_get(extents, op, &extent)) == 0) {
+ op = EXT2_EXTENT_NEXT;
+
+ if (extent.e_flags & EXT2_EXTENT_FLAGS_SECOND_VISIT)
+ continue;
+
+ printf("[%u]: %s ino=%u lblk 0x%llx pblk 0x%llx len 0x%x flags 0x%x\n",
+ nr++, why, ino, extent.e_lblk, extent.e_pblk,
+ extent.e_len, extent.e_flags);
+ fflush(stdout);
+ if (extent.e_flags & EXT2_EXTENT_FLAGS_LEAF)
+ blockcount += extent.e_len;
+ else
+ blockcount++;
+ }
+ if (retval == EXT2_ET_EXTENT_NO_NEXT)
+ retval = 0;
+ if (retval) {
+ com_err(__func__, retval, ("getting extents of ino %u"),
+ ino);
+ }
+ if (inode->i_file_acl)
+ blockcount++;
+ printf("%s: %s sum(e_len) %llu\n", __func__, why, blockcount);
+ fflush(stdout);
+
+ ext2fs_extent_free(extents);
+}
+
static void get_now(struct timespec *now)
{
#ifdef CLOCK_REALTIME
@@ -5839,6 +5907,11 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
read.type,
read.flags);
+ /* Not filling even the first byte will make the kernel unhappy. */
+ if (ff->debug && (read.offset > pos ||
+ read.offset + read.length <= pos))
+ fuse4fs_dump_extents(ff, ino, &inode, "BAD DATA");
+
out_unlock:
fuse4fs_finish(ff, ret);
if (ret)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 07/19] fuse2fs: implement direct write support
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (5 preceding siblings ...)
2025-08-21 1:17 ` [PATCH 06/19] fuse2fs: add extent dump function for debugging Darrick J. Wong
@ 2025-08-21 1:17 ` Darrick J. Wong
2025-08-21 1:17 ` [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
` (11 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:17 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Wire up an iomap_begin method that can allocate into holes so that we
can do directio writes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 470 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
misc/fuse4fs.c | 473 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 937 insertions(+), 6 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 4a9fda62f99bc2..e8e9056a661e71 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5442,12 +5442,103 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
opflags, read);
}
+static int fuse2fs_iomap_write_allocate(struct fuse2fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode, off_t pos,
+ uint64_t count, uint32_t opflags,
+ struct fuse_file_iomap *read, bool *dirty)
+{
+ ext2_filsys fs = ff->fs;
+ blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+ blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + count);
+ blk64_t old_iblocks;
+ errcode_t err;
+ int ret;
+
+ dbg_printf(ff, "%s: write_alloc ino=%u startoff 0x%llx blockcount 0x%llx\n",
+ __func__, ino, startoff, stopoff - startoff);
+
+ if (!fs_can_allocate(ff, stopoff - startoff))
+ return -ENOSPC;
+
+ old_iblocks = ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode));
+ err = ext2fs_fallocate(fs, EXT2_FALLOCATE_FORCE_UNINIT, ino,
+ EXT2_INODE(inode), ~0ULL, startoff,
+ stopoff - startoff);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /*
+ * New allocations for file data blocks on indirect mapped files are
+ * zeroed through the IO manager so we have to flush it to disk.
+ */
+ if (!(inode->i_flags & EXT4_EXTENTS_FL) &&
+ old_iblocks != ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode))) {
+ err = io_channel_flush(fs->io);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ /* pick up the newly allocated mapping */
+ ret = fuse2fs_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+ read);
+ if (ret)
+ return ret;
+
+ read->flags |= FUSE_IOMAP_F_DIRTY;
+ *dirty = true;
+ return 0;
+}
+
+static off_t fuse2fs_max_file_size(const struct fuse2fs *ff,
+ const struct ext2_inode_large *inode)
+{
+ ext2_filsys fs = ff->fs;
+ blk64_t addr_per_block, max_map_block;
+
+ if (inode->i_flags & EXT4_EXTENTS_FL) {
+ max_map_block = (1ULL << 32) - 1;
+ } else {
+ addr_per_block = fs->blocksize >> 2;
+ max_map_block = addr_per_block;
+ max_map_block += addr_per_block * addr_per_block;
+ max_map_block += addr_per_block * addr_per_block * addr_per_block;
+ max_map_block += 12;
+ }
+
+ return FUSE2FS_FSB_TO_B(ff, max_map_block) + (fs->blocksize - 1);
+}
+
static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
struct ext2_inode_large *inode, off_t pos,
uint64_t count, uint32_t opflags,
- struct fuse_file_iomap *read)
+ struct fuse_file_iomap *read,
+ bool *dirty)
{
- return -ENOSYS;
+ off_t max_size = fuse2fs_max_file_size(ff, inode);
+ int ret;
+
+ if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+ return -ENOSYS;
+
+ if (pos >= max_size)
+ return -EFBIG;
+
+ if (pos >= max_size - count)
+ count = max_size - pos;
+
+ ret = fuse2fs_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+ read);
+ if (ret)
+ return ret;
+
+ if (fuse_iomap_need_write_allocate(opflags, read)) {
+ ret = fuse2fs_iomap_write_allocate(ff, ino, inode, pos, count,
+ opflags, read, dirty);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
}
static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
@@ -5459,6 +5550,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
struct ext2_inode_large inode;
ext2_filsys fs;
errcode_t err;
+ bool dirty = false;
int ret = 0;
FUSE2FS_CHECK_CONTEXT(ff);
@@ -5484,7 +5576,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
count, opflags, read);
else if (fuse_iomap_is_write(opflags))
ret = fuse2fs_iomap_begin_write(ff, attr_ino, &inode, pos,
- count, opflags, read);
+ count, opflags, read, &dirty);
else
ret = fuse2fs_iomap_begin_read(ff, attr_ino, &inode, pos,
count, opflags, read);
@@ -5506,6 +5598,14 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
read->offset + read->length <= pos))
fuse2fs_dump_extents(ff, attr_ino, &inode, "BAD DATA");
+ if (dirty) {
+ err = fuse2fs_write_inode(fs, attr_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, attr_ino, err);
+ goto out_unlock;
+ }
+ }
+
out_unlock:
fuse2fs_finish(ff, ret);
return ret;
@@ -5643,6 +5743,369 @@ static int op_iomap_config(uint64_t flags, off_t maxbytes,
if (ret)
goto out_unlock;
+out_unlock:
+ fuse2fs_finish(ff, ret);
+ return ret;
+}
+
+static inline bool fuse2fs_can_merge_mappings(const struct ext2fs_extent *left,
+ const struct ext2fs_extent *right)
+{
+ uint64_t max_len = (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ?
+ EXT_UNINIT_MAX_LEN : EXT_INIT_MAX_LEN;
+
+ return left->e_lblk + left->e_len == right->e_lblk &&
+ left->e_pblk + left->e_len == right->e_pblk &&
+ (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ==
+ (right->e_flags & EXT2_EXTENT_FLAGS_UNINIT) &&
+ (uint64_t)left->e_len + right->e_len <= max_len;
+}
+
+static int fuse2fs_try_merge_mappings(struct fuse2fs *ff, ext2_ino_t ino,
+ ext2_extent_handle_t handle,
+ blk64_t startoff)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2fs_extent left, right;
+ errcode_t err;
+
+ /* Look up the mappings before startoff */
+ err = fuse2fs_get_mapping_at(ff, handle, startoff - 1, &left);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND)
+ return 0;
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* Look up the mapping at startoff */
+ err = fuse2fs_get_mapping_at(ff, handle, startoff, &right);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND)
+ return 0;
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* Can we combine them? */
+ if (!fuse2fs_can_merge_mappings(&left, &right))
+ return 0;
+
+ /*
+ * Delete the mapping after startoff because libext2fs cannot handle
+ * overlapping mappings.
+ */
+ err = ext2fs_extent_delete(handle, 0);
+ DUMP_EXTENT(ff, "remover", startoff, err, &right);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixremover", startoff, err, &right);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* Move back and lengthen the mapping before startoff */
+ err = ext2fs_extent_goto(handle, left.e_lblk);
+ DUMP_EXTENT(ff, "movel", startoff - 1, err, &left);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ left.e_len += right.e_len;
+ err = ext2fs_extent_replace(handle, 0, &left);
+ DUMP_EXTENT(ff, "replacel", startoff - 1, err, &left);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixreplacel", startoff - 1, err, &left);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
+static int fuse2fs_convert_unwritten_mapping(struct fuse2fs *ff,
+ ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ ext2_extent_handle_t handle,
+ blk64_t *cursor, blk64_t stopoff)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2fs_extent extent;
+ blk64_t startoff = *cursor;
+ errcode_t err;
+
+ /*
+ * Find the mapping at startoff. Note that we can find holes because
+ * the mapping data can change due to racing writes.
+ */
+ err = fuse2fs_get_mapping_at(ff, handle, startoff, &extent);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+ /*
+ * If we didn't find any mappings at all then the file is
+ * completely sparse. There's nothing to convert.
+ */
+ *cursor = stopoff;
+ return 0;
+ }
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /*
+ * The mapping is completely to the left of the range that we want.
+ * Let's see what's in the next extent, if there is one.
+ */
+ if (startoff >= extent.e_lblk + extent.e_len) {
+ /*
+ * Mapping ends to the left of the current position. Try to
+ * find the next mapping. If there is no next mapping, then
+ * we're done.
+ */
+ err = fuse2fs_get_next_mapping(ff, handle, startoff, &extent);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+ *cursor = stopoff;
+ return 0;
+ }
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ /*
+ * The mapping is completely to the right of the range that we want,
+ * so we're done.
+ */
+ if (extent.e_lblk >= stopoff) {
+ *cursor = stopoff;
+ return 0;
+ }
+
+ /*
+ * At this point, we have a mapping that overlaps (startoff, stopoff].
+ * If the mapping is already written, move on to the next one.
+ */
+ if (!(extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT))
+ goto next;
+
+ if (startoff > extent.e_lblk) {
+ struct ext2fs_extent newex = extent;
+
+ /*
+ * Unwritten mapping starts before startoff. Shorten
+ * the previous mapping...
+ */
+ newex.e_len = startoff - extent.e_lblk;
+ err = ext2fs_extent_replace(handle, 0, &newex);
+ DUMP_EXTENT(ff, "shortenp", startoff, err, &newex);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixshortenp", startoff, err, &newex);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* ...and create new written mapping at startoff. */
+ extent.e_len -= newex.e_len;
+ extent.e_lblk += newex.e_len;
+ extent.e_pblk += newex.e_len;
+ extent.e_flags = newex.e_flags & ~EXT2_EXTENT_FLAGS_UNINIT;
+
+ err = ext2fs_extent_insert(handle,
+ EXT2_EXTENT_INSERT_AFTER,
+ &extent);
+ DUMP_EXTENT(ff, "insertx", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixinsertx", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ if (extent.e_lblk + extent.e_len > stopoff) {
+ struct ext2fs_extent newex = extent;
+
+ /*
+ * Unwritten mapping ends after stopoff. Shorten the current
+ * mapping...
+ */
+ extent.e_len = stopoff - extent.e_lblk;
+ extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+ err = ext2fs_extent_replace(handle, 0, &extent);
+ DUMP_EXTENT(ff, "shortenn", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixshortenn", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* ..and create a new unwritten mapping at stopoff. */
+ newex.e_pblk += extent.e_len;
+ newex.e_lblk += extent.e_len;
+ newex.e_len -= extent.e_len;
+ newex.e_flags |= EXT2_EXTENT_FLAGS_UNINIT;
+
+ err = ext2fs_extent_insert(handle,
+ EXT2_EXTENT_INSERT_AFTER,
+ &newex);
+ DUMP_EXTENT(ff, "insertn", startoff, err, &newex);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixinsertn", startoff, err, &newex);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ /* Still unwritten? Update the state. */
+ if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) {
+ extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+ err = ext2fs_extent_replace(handle, 0, &extent);
+ DUMP_EXTENT(ff, "replacex", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixreplacex", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+next:
+ /* Try to merge with the previous extent */
+ if (startoff > 0) {
+ err = fuse2fs_try_merge_mappings(ff, ino, handle, startoff);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ *cursor = extent.e_lblk + extent.e_len;
+ return 0;
+}
+
+static int fuse2fs_convert_unwritten_mappings(struct fuse2fs *ff,
+ ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ off_t pos, size_t written)
+{
+ ext2_extent_handle_t handle;
+ ext2_filsys fs = ff->fs;
+ blk64_t startoff = FUSE2FS_B_TO_FSBT(ff, pos);
+ const blk64_t stopoff = FUSE2FS_B_TO_FSB(ff, pos + written);
+ errcode_t err;
+ int ret;
+
+ err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* Walk every mapping in the range, converting them. */
+ while (startoff < stopoff) {
+ blk64_t old_startoff = startoff;
+
+ ret = fuse2fs_convert_unwritten_mapping(ff, ino, inode, handle,
+ &startoff, stopoff);
+ if (ret)
+ goto out_handle;
+ if (startoff <= old_startoff) {
+ /* Do not go backwards. */
+ ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED);
+ goto out_handle;
+ }
+ }
+
+ /* Try to merge the right edge */
+ ret = fuse2fs_try_merge_mappings(ff, ino, handle, stopoff);
+out_handle:
+ ext2fs_extent_free(handle);
+ return ret;
+}
+
+static int op_iomap_ioend(const char *path, uint64_t nodeid, uint64_t attr_ino,
+ off_t pos, size_t written, uint32_t ioendflags,
+ int error, uint64_t new_addr)
+{
+ struct fuse2fs *ff = fuse2fs_get();
+ struct ext2_inode_large inode;
+ ext2_filsys fs;
+ errcode_t err;
+ bool dirty = false;
+ int ret = 0;
+
+ FUSE2FS_CHECK_CONTEXT(ff);
+
+ dbg_printf(ff,
+ "%s: path=%s nodeid=%llu attr_ino=%llu pos=0x%llx written=0x%zx ioendflags=0x%x error=%d new_addr=%llu\n",
+ __func__, path,
+ (unsigned long long)nodeid,
+ (unsigned long long)attr_ino,
+ (unsigned long long)pos,
+ written,
+ ioendflags,
+ error,
+ (unsigned long long)new_addr);
+
+ fs = fuse2fs_start(ff);
+ if (error) {
+ ret = error;
+ goto out_unlock;
+ }
+
+ /* should never see these ioend types */
+ if (ioendflags & FUSE_IOMAP_IOEND_SHARED) {
+ ret = translate_error(fs, attr_ino,
+ EXT2_ET_FILESYSTEM_CORRUPTED);
+ goto out_unlock;
+ }
+
+ err = fuse2fs_read_inode(fs, attr_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, attr_ino, err);
+ goto out_unlock;
+ }
+
+ if (ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN) {
+ /* unwritten extents are only supported on extents files */
+ if (!(inode.i_flags & EXT4_EXTENTS_FL)) {
+ ret = translate_error(fs, attr_ino,
+ EXT2_ET_FILESYSTEM_CORRUPTED);
+ goto out_unlock;
+ }
+
+ ret = fuse2fs_convert_unwritten_mappings(ff, attr_ino, &inode,
+ pos, written);
+ if (ret)
+ goto out_unlock;
+
+ dirty = true;
+ }
+
+ if (ioendflags & FUSE_IOMAP_IOEND_APPEND) {
+ ext2_off64_t isize = EXT2_I_SIZE(&inode);
+
+ if (pos + written > isize) {
+ err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode),
+ pos + written);
+ if (err) {
+ ret = translate_error(fs, attr_ino, err);
+ goto out_unlock;
+ }
+
+ dirty = true;
+ }
+ }
+
+ if (dirty) {
+ err = fuse2fs_write_inode(fs, attr_ino, &inode);
+ if (err) {
+ ret = translate_error(fs, attr_ino, err);
+ goto out_unlock;
+ }
+ }
+
out_unlock:
fuse2fs_finish(ff, ret);
return ret;
@@ -5713,6 +6176,7 @@ static struct fuse_operations fs_ops = {
.iomap_begin = op_iomap_begin,
.iomap_end = op_iomap_end,
.iomap_config = op_iomap_config,
+ .iomap_ioend = op_iomap_ioend,
#endif /* HAVE_FUSE_IOMAP */
};
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 0ac5de90498dac..ff50182b929974 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -5850,12 +5850,106 @@ static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino,
opflags, read);
}
+static int fuse4fs_iomap_write_allocate(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ off_t pos, uint64_t count,
+ uint32_t opflags,
+ struct fuse_file_iomap *read,
+ bool *dirty)
+{
+ ext2_filsys fs = ff->fs;
+ blk64_t startoff = FUSE4FS_B_TO_FSBT(ff, pos);
+ blk64_t stopoff = FUSE4FS_B_TO_FSB(ff, pos + count);
+ blk64_t old_iblocks;
+ errcode_t err;
+ int ret;
+
+ dbg_printf(ff,
+ "%s: ino=%d startoff 0x%llx blockcount 0x%llx\n",
+ __func__, ino, startoff, stopoff - startoff);
+
+ if (!fuse4fs_can_allocate(ff, stopoff - startoff))
+ return -ENOSPC;
+
+ old_iblocks = ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode));
+ err = ext2fs_fallocate(fs, EXT2_FALLOCATE_FORCE_UNINIT, ino,
+ EXT2_INODE(inode), ~0ULL, startoff,
+ stopoff - startoff);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /*
+ * New allocations for file data blocks on indirect mapped files are
+ * zeroed through the IO manager so we have to flush it to disk.
+ */
+ if (!(inode->i_flags & EXT4_EXTENTS_FL) &&
+ old_iblocks != ext2fs_get_stat_i_blocks(fs, EXT2_INODE(inode))) {
+ err = io_channel_flush(fs->io);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ /* pick up the newly allocated mapping */
+ ret = fuse4fs_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+ read);
+ if (ret)
+ return ret;
+
+ read->flags |= FUSE_IOMAP_F_DIRTY;
+ *dirty = true;
+ return 0;
+}
+
+static off_t fuse4fs_max_file_size(const struct fuse4fs *ff,
+ const struct ext2_inode_large *inode)
+{
+ ext2_filsys fs = ff->fs;
+ blk64_t addr_per_block, max_map_block;
+
+ if (inode->i_flags & EXT4_EXTENTS_FL) {
+ max_map_block = (1ULL << 32) - 1;
+ } else {
+ addr_per_block = fs->blocksize >> 2;
+ max_map_block = addr_per_block;
+ max_map_block += addr_per_block * addr_per_block;
+ max_map_block += addr_per_block * addr_per_block * addr_per_block;
+ max_map_block += 12;
+ }
+
+ return FUSE4FS_FSB_TO_B(ff, max_map_block) + (fs->blocksize - 1);
+}
+
static int fuse4fs_iomap_begin_write(struct fuse4fs *ff, ext2_ino_t ino,
struct ext2_inode_large *inode, off_t pos,
uint64_t count, uint32_t opflags,
- struct fuse_file_iomap *read)
+ struct fuse_file_iomap *read,
+ bool *dirty)
{
- return -ENOSYS;
+ off_t max_size = fuse4fs_max_file_size(ff, inode);
+ int ret;
+
+ if (!(opflags & FUSE_IOMAP_OP_DIRECT))
+ return -ENOSYS;
+
+ if (pos >= max_size)
+ return -EFBIG;
+
+ if (pos >= max_size - count)
+ count = max_size - pos;
+
+ ret = fuse4fs_iomap_begin_read(ff, ino, inode, pos, count, opflags,
+ read);
+ if (ret)
+ return ret;
+
+ if (fuse_iomap_need_write_allocate(opflags, read)) {
+ ret = fuse4fs_iomap_write_allocate(ff, ino, inode, pos, count,
+ opflags, read, dirty);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
}
static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
@@ -5867,6 +5961,7 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
ext2_filsys fs;
ext2_ino_t ino;
errcode_t err;
+ bool dirty = false;
int ret = 0;
FUSE4FS_CHECK_CONTEXT(req);
@@ -5890,7 +5985,7 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
opflags, &read);
else if (fuse_iomap_is_write(opflags))
ret = fuse4fs_iomap_begin_write(ff, ino, &inode, pos, count,
- opflags, &read);
+ opflags, &read, &dirty);
else
ret = fuse4fs_iomap_begin_read(ff, ino, &inode, pos, count,
opflags, &read);
@@ -5912,6 +6007,14 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
read.offset + read.length <= pos))
fuse4fs_dump_extents(ff, ino, &inode, "BAD DATA");
+ if (dirty) {
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_unlock;
+ }
+ }
+
out_unlock:
fuse4fs_finish(ff, ret);
if (ret)
@@ -6059,6 +6162,369 @@ static void op_iomap_config(fuse_req_t req, uint64_t flags, uint64_t maxbytes)
else
fuse_reply_iomap_config(req, &cfg);
}
+
+static inline bool fuse4fs_can_merge_mappings(const struct ext2fs_extent *left,
+ const struct ext2fs_extent *right)
+{
+ uint64_t max_len = (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ?
+ EXT_UNINIT_MAX_LEN : EXT_INIT_MAX_LEN;
+
+ return left->e_lblk + left->e_len == right->e_lblk &&
+ left->e_pblk + left->e_len == right->e_pblk &&
+ (left->e_flags & EXT2_EXTENT_FLAGS_UNINIT) ==
+ (right->e_flags & EXT2_EXTENT_FLAGS_UNINIT) &&
+ (uint64_t)left->e_len + right->e_len <= max_len;
+}
+
+static int fuse4fs_try_merge_mappings(struct fuse4fs *ff, ext2_ino_t ino,
+ ext2_extent_handle_t handle,
+ blk64_t startoff)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2fs_extent left, right;
+ errcode_t err;
+
+ /* Look up the mappings before startoff */
+ err = fuse4fs_get_mapping_at(ff, handle, startoff - 1, &left);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND)
+ return 0;
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* Look up the mapping at startoff */
+ err = fuse4fs_get_mapping_at(ff, handle, startoff, &right);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND)
+ return 0;
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* Can we combine them? */
+ if (!fuse4fs_can_merge_mappings(&left, &right))
+ return 0;
+
+ /*
+ * Delete the mapping after startoff because libext2fs cannot handle
+ * overlapping mappings.
+ */
+ err = ext2fs_extent_delete(handle, 0);
+ DUMP_EXTENT(ff, "remover", startoff, err, &right);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixremover", startoff, err, &right);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* Move back and lengthen the mapping before startoff */
+ err = ext2fs_extent_goto(handle, left.e_lblk);
+ DUMP_EXTENT(ff, "movel", startoff - 1, err, &left);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ left.e_len += right.e_len;
+ err = ext2fs_extent_replace(handle, 0, &left);
+ DUMP_EXTENT(ff, "replacel", startoff - 1, err, &left);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixreplacel", startoff - 1, err, &left);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
+static int fuse4fs_convert_unwritten_mapping(struct fuse4fs *ff,
+ ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ ext2_extent_handle_t handle,
+ blk64_t *cursor, blk64_t stopoff)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2fs_extent extent;
+ blk64_t startoff = *cursor;
+ errcode_t err;
+
+ /*
+ * Find the mapping at startoff. Note that we can find holes because
+ * the mapping data can change due to racing writes.
+ */
+ err = fuse4fs_get_mapping_at(ff, handle, startoff, &extent);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+ /*
+ * If we didn't find any mappings at all then the file is
+ * completely sparse. There's nothing to convert.
+ */
+ *cursor = stopoff;
+ return 0;
+ }
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /*
+ * The mapping is completely to the left of the range that we want.
+ * Let's see what's in the next extent, if there is one.
+ */
+ if (startoff >= extent.e_lblk + extent.e_len) {
+ /*
+ * Mapping ends to the left of the current position. Try to
+ * find the next mapping. If there is no next mapping, then
+ * we're done.
+ */
+ err = fuse4fs_get_next_mapping(ff, handle, startoff, &extent);
+ if (err == EXT2_ET_EXTENT_NOT_FOUND) {
+ *cursor = stopoff;
+ return 0;
+ }
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ /*
+ * The mapping is completely to the right of the range that we want,
+ * so we're done.
+ */
+ if (extent.e_lblk >= stopoff) {
+ *cursor = stopoff;
+ return 0;
+ }
+
+ /*
+ * At this point, we have a mapping that overlaps (startoff, stopoff].
+ * If the mapping is already written, move on to the next one.
+ */
+ if (!(extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT))
+ goto next;
+
+ if (startoff > extent.e_lblk) {
+ struct ext2fs_extent newex = extent;
+
+ /*
+ * Unwritten mapping starts before startoff. Shorten
+ * the previous mapping...
+ */
+ newex.e_len = startoff - extent.e_lblk;
+ err = ext2fs_extent_replace(handle, 0, &newex);
+ DUMP_EXTENT(ff, "shortenp", startoff, err, &newex);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixshortenp", startoff, err, &newex);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* ...and create new written mapping at startoff. */
+ extent.e_len -= newex.e_len;
+ extent.e_lblk += newex.e_len;
+ extent.e_pblk += newex.e_len;
+ extent.e_flags = newex.e_flags & ~EXT2_EXTENT_FLAGS_UNINIT;
+
+ err = ext2fs_extent_insert(handle,
+ EXT2_EXTENT_INSERT_AFTER,
+ &extent);
+ DUMP_EXTENT(ff, "insertx", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixinsertx", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ if (extent.e_lblk + extent.e_len > stopoff) {
+ struct ext2fs_extent newex = extent;
+
+ /*
+ * Unwritten mapping ends after stopoff. Shorten the current
+ * mapping...
+ */
+ extent.e_len = stopoff - extent.e_lblk;
+ extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+ err = ext2fs_extent_replace(handle, 0, &extent);
+ DUMP_EXTENT(ff, "shortenn", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixshortenn", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* ..and create a new unwritten mapping at stopoff. */
+ newex.e_pblk += extent.e_len;
+ newex.e_lblk += extent.e_len;
+ newex.e_len -= extent.e_len;
+ newex.e_flags |= EXT2_EXTENT_FLAGS_UNINIT;
+
+ err = ext2fs_extent_insert(handle,
+ EXT2_EXTENT_INSERT_AFTER,
+ &newex);
+ DUMP_EXTENT(ff, "insertn", startoff, err, &newex);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixinsertn", startoff, err, &newex);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ /* Still unwritten? Update the state. */
+ if (extent.e_flags & EXT2_EXTENT_FLAGS_UNINIT) {
+ extent.e_flags &= ~EXT2_EXTENT_FLAGS_UNINIT;
+
+ err = ext2fs_extent_replace(handle, 0, &extent);
+ DUMP_EXTENT(ff, "replacex", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = ext2fs_extent_fix_parents(handle);
+ DUMP_EXTENT(ff, "fixreplacex", startoff, err, &extent);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+next:
+ /* Try to merge with the previous extent */
+ if (startoff > 0) {
+ err = fuse4fs_try_merge_mappings(ff, ino, handle, startoff);
+ if (err)
+ return translate_error(fs, ino, err);
+ }
+
+ *cursor = extent.e_lblk + extent.e_len;
+ return 0;
+}
+
+static int fuse4fs_convert_unwritten_mappings(struct fuse4fs *ff,
+ ext2_ino_t ino,
+ struct ext2_inode_large *inode,
+ off_t pos, size_t written)
+{
+ ext2_extent_handle_t handle;
+ ext2_filsys fs = ff->fs;
+ blk64_t startoff = FUSE4FS_B_TO_FSBT(ff, pos);
+ const blk64_t stopoff = FUSE4FS_B_TO_FSB(ff, pos + written);
+ errcode_t err;
+ int ret;
+
+ err = ext2fs_extent_open2(fs, ino, EXT2_INODE(inode), &handle);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ /* Walk every mapping in the range, converting them. */
+ while (startoff < stopoff) {
+ blk64_t old_startoff = startoff;
+
+ ret = fuse4fs_convert_unwritten_mapping(ff, ino, inode, handle,
+ &startoff, stopoff);
+ if (ret)
+ goto out_handle;
+ if (startoff <= old_startoff) {
+ /* Do not go backwards. */
+ ret = translate_error(fs, ino, EXT2_ET_INODE_CORRUPTED);
+ goto out_handle;
+ }
+ }
+
+ /* Try to merge the right edge */
+ ret = fuse4fs_try_merge_mappings(ff, ino, handle, stopoff);
+out_handle:
+ ext2fs_extent_free(handle);
+ return ret;
+}
+
+static void op_iomap_ioend(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
+ off_t pos, size_t written, uint32_t ioendflags,
+ int error, uint64_t new_addr)
+{
+ struct fuse4fs *ff = fuse4fs_get(req);
+ struct ext2_inode_large inode;
+ ext2_filsys fs;
+ ext2_ino_t ino;
+ errcode_t err;
+ bool dirty = false;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
+
+ dbg_printf(ff,
+ "%s: ino=%d pos=0x%llx written=0x%zx ioendflags=0x%x error=%d new_addr=0x%llx\n",
+ __func__, ino,
+ (unsigned long long)pos,
+ written,
+ ioendflags,
+ error,
+ (unsigned long long)new_addr);
+
+ if (error) {
+ fuse_reply_err(req, -error);
+ return;
+ }
+
+ fs = fuse4fs_start(ff);
+
+ /* should never see these ioend types */
+ if (ioendflags & FUSE_IOMAP_IOEND_SHARED) {
+ ret = translate_error(fs, ino, EXT2_ET_FILESYSTEM_CORRUPTED);
+ goto out_unlock;
+ }
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_unlock;
+ }
+
+ if (ioendflags & FUSE_IOMAP_IOEND_UNWRITTEN) {
+ /* unwritten extents are only supported on extents files */
+ if (!(inode.i_flags & EXT4_EXTENTS_FL)) {
+ ret = translate_error(fs, ino,
+ EXT2_ET_FILESYSTEM_CORRUPTED);
+ goto out_unlock;
+ }
+
+ ret = fuse4fs_convert_unwritten_mappings(ff, ino, &inode,
+ pos, written);
+ if (ret)
+ goto out_unlock;
+
+ dirty = true;
+ }
+
+ if (ioendflags & FUSE_IOMAP_IOEND_APPEND) {
+ ext2_off64_t isize = EXT2_I_SIZE(&inode);
+
+ if (pos + written > isize) {
+ err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode),
+ pos + written);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_unlock;
+ }
+
+ dirty = true;
+ }
+ }
+
+ if (dirty) {
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err) {
+ ret = translate_error(fs, ino, err);
+ goto out_unlock;
+ }
+ }
+
+out_unlock:
+ fuse4fs_finish(ff, ret);
+ fuse_reply_err(req, -ret);
+}
#endif /* HAVE_FUSE_IOMAP */
static struct fuse_lowlevel_ops fs_ops = {
@@ -6108,6 +6574,7 @@ static struct fuse_lowlevel_ops fs_ops = {
.iomap_begin = op_iomap_begin,
.iomap_end = op_iomap_end,
.iomap_config = op_iomap_config,
+ .iomap_ioend = op_iomap_ioend,
#endif /* HAVE_FUSE_IOMAP */
};
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (6 preceding siblings ...)
2025-08-21 1:17 ` [PATCH 07/19] fuse2fs: implement direct write support Darrick J. Wong
@ 2025-08-21 1:17 ` Darrick J. Wong
2025-08-21 1:18 ` [PATCH 09/19] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
` (10 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:17 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Turn on iomap for pagecache IO to regular files.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++------
misc/fuse4fs.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 108 insertions(+), 14 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index e8e9056a661e71..895addcbc59e04 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5427,9 +5427,6 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
uint64_t count, uint32_t opflags,
struct fuse_file_iomap *read)
{
- if (!(opflags & FUSE_IOMAP_OP_DIRECT))
- return -ENOSYS;
-
/* fall back to slow path for inline data reads */
if (inode->i_flags & EXT4_INLINE_DATA_FL)
return -ENOSYS;
@@ -5517,9 +5514,6 @@ static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
off_t max_size = fuse2fs_max_file_size(ff, inode);
int ret;
- if (!(opflags & FUSE_IOMAP_OP_DIRECT))
- return -ENOSYS;
-
if (pos >= max_size)
return -EFBIG;
@@ -5611,11 +5605,50 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
return ret;
}
+static int fuse2fs_iomap_append_setsize(struct fuse2fs *ff, ext2_ino_t ino,
+ loff_t newsize)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2_inode_large inode;
+ ext2_off64_t isize;
+ errcode_t err;
+
+ dbg_printf(ff, "%s: ino=%u newsize=%llu\n", __func__, ino,
+ (unsigned long long)newsize);
+
+ err = fuse2fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ isize = EXT2_I_SIZE(&inode);
+ if (newsize <= isize)
+ return 0;
+
+ dbg_printf(ff, "%s: ino=%u oldsize=%llu newsize=%llu\n", __func__, ino,
+ (unsigned long long)isize,
+ (unsigned long long)newsize);
+
+ /*
+ * XXX cheesily update the ondisk size even though we only want to do
+ * the incore size until writeback happens
+ */
+ err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), newsize);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = fuse2fs_write_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
off_t pos, uint64_t count, uint32_t opflags,
ssize_t written, const struct fuse_file_iomap *iomap)
{
struct fuse2fs *ff = fuse2fs_get();
+ int ret = 0;
FUSE2FS_CHECK_CONTEXT(ff);
@@ -5630,7 +5663,21 @@ static int op_iomap_end(const char *path, uint64_t nodeid, uint64_t attr_ino,
written,
iomap->flags);
- return 0;
+ fuse2fs_start(ff);
+
+ /* XXX is this really necessary? */
+ if ((opflags & FUSE_IOMAP_OP_WRITE) &&
+ !(opflags & FUSE_IOMAP_OP_DIRECT) &&
+ (iomap->flags & FUSE_IOMAP_F_SIZE_CHANGED) &&
+ written > 0) {
+ ret = fuse2fs_iomap_append_setsize(ff, attr_ino, pos + written);
+ if (ret)
+ goto out_unlock;
+ }
+
+out_unlock:
+ fuse2fs_finish(ff, ret);
+ return ret;
}
/*
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index ff50182b929974..2373c5a371e2b0 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -5835,9 +5835,6 @@ static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino,
uint64_t count, uint32_t opflags,
struct fuse_file_iomap *read)
{
- if (!(opflags & FUSE_IOMAP_OP_DIRECT))
- return -ENOSYS;
-
/* fall back to slow path for inline data reads */
if (inode->i_flags & EXT4_INLINE_DATA_FL)
return -ENOSYS;
@@ -5928,9 +5925,6 @@ static int fuse4fs_iomap_begin_write(struct fuse4fs *ff, ext2_ino_t ino,
off_t max_size = fuse4fs_max_file_size(ff, inode);
int ret;
- if (!(opflags & FUSE_IOMAP_OP_DIRECT))
- return -ENOSYS;
-
if (pos >= max_size)
return -EFBIG;
@@ -6023,12 +6017,51 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
fuse_reply_iomap_begin(req, &read, NULL);
}
+static int fuse4fs_iomap_append_setsize(struct fuse4fs *ff, ext2_ino_t ino,
+ loff_t newsize)
+{
+ ext2_filsys fs = ff->fs;
+ struct ext2_inode_large inode;
+ ext2_off64_t isize;
+ errcode_t err;
+
+ dbg_printf(ff, "%s: ino=%u newsize=%llu\n", __func__, ino,
+ (unsigned long long)newsize);
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ isize = EXT2_I_SIZE(&inode);
+ if (newsize <= isize)
+ return 0;
+
+ dbg_printf(ff, "%s: ino=%u oldsize=%llu newsize=%llu\n", __func__, ino,
+ (unsigned long long)isize,
+ (unsigned long long)newsize);
+
+ /*
+ * XXX cheesily update the ondisk size even though we only want to do
+ * the incore size until writeback happens
+ */
+ err = ext2fs_inode_size_set(fs, EXT2_INODE(&inode), newsize);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ err = fuse4fs_write_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ return 0;
+}
+
static void op_iomap_end(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
off_t pos, uint64_t count, uint32_t opflags,
ssize_t written, const struct fuse_file_iomap *iomap)
{
struct fuse4fs *ff = fuse4fs_get(req);
ext2_ino_t ino;
+ int ret = 0;
FUSE4FS_CHECK_CONTEXT(req);
FUSE4FS_CONVERT_FINO(req, &ino, fino);
@@ -6042,7 +6075,21 @@ static void op_iomap_end(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
written,
iomap->flags);
- fuse_reply_err(req, 0);
+ fuse4fs_start(ff);
+
+ /* XXX is this really necessary? */
+ if ((opflags & FUSE_IOMAP_OP_WRITE) &&
+ !(opflags & FUSE_IOMAP_OP_DIRECT) &&
+ (iomap->flags & FUSE_IOMAP_F_SIZE_CHANGED) &&
+ written > 0) {
+ ret = fuse4fs_iomap_append_setsize(ff, ino, pos + written);
+ if (ret)
+ goto out_unlock;
+ }
+
+out_unlock:
+ fuse4fs_finish(ff, ret);
+ fuse_reply_err(req, -ret);
}
/*
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 09/19] fuse2fs: don't zero bytes in punch hole
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (7 preceding siblings ...)
2025-08-21 1:17 ` [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
@ 2025-08-21 1:18 ` Darrick J. Wong
2025-08-21 1:18 ` [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
` (9 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:18 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
When iomap is in use for the pagecache, it will take care of zeroing the
unaligned parts of punched out regions so we don't have to do it
ourselves.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 9 +++++++++
misc/fuse4fs.c | 8 ++++++++
2 files changed, 17 insertions(+)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 895addcbc59e04..dcf002f380b843 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -576,6 +576,7 @@ static inline int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
}
#else
# define fuse2fs_iomap_enabled(...) (0)
+# define fuse2fs_iomap_enabled(...) (0)
#endif
static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
@@ -4857,6 +4858,10 @@ static errcode_t clean_block_middle(struct fuse2fs *ff, ext2_ino_t ino,
int retflags;
errcode_t err;
+ /* the kernel does this for us in iomap mode */
+ if (fuse2fs_iomap_enabled(ff))
+ return 0;
+
if (!*buf) {
err = ext2fs_get_mem(fs->blocksize, buf);
if (err)
@@ -4893,6 +4898,10 @@ static errcode_t clean_block_edge(struct fuse2fs *ff, ext2_ino_t ino,
off_t residue;
errcode_t err;
+ /* the kernel does this for us in iomap mode */
+ if (fuse2fs_iomap_enabled(ff))
+ return 0;
+
residue = FUSE2FS_OFF_IN_FSB(ff, offset);
if (residue == 0)
return 0;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 2373c5a371e2b0..3082c23e398adf 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -5265,6 +5265,10 @@ static errcode_t fuse4fs_zero_middle(struct fuse4fs *ff, ext2_ino_t ino,
int retflags;
errcode_t err;
+ /* the kernel does this for us in iomap mode */
+ if (fuse4fs_iomap_enabled(ff))
+ return 0;
+
if (!*buf) {
err = ext2fs_get_mem(fs->blocksize, buf);
if (err)
@@ -5301,6 +5305,10 @@ static errcode_t fuse4fs_zero_edge(struct fuse4fs *ff, ext2_ino_t ino,
off_t residue;
errcode_t err;
+ /* the kernel does this for us in iomap mode */
+ if (fuse4fs_iomap_enabled(ff))
+ return 0;
+
residue = FUSE4FS_OFF_IN_FSB(ff, offset);
if (residue == 0)
return 0;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (8 preceding siblings ...)
2025-08-21 1:18 ` [PATCH 09/19] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
@ 2025-08-21 1:18 ` Darrick J. Wong
2025-08-21 1:18 ` [PATCH 11/19] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
` (8 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:18 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
When iomap is in use for the page cache, the kernel will take care of
all the file data block IO for us, including zeroing of punched ranges
and post-EOF bytes. fuse2fs only needs to do IO for inline data.
Therefore, set the NOBLOCKIO ext2_file flag so that libext2fs will not
do any regular file IO to or from disk blocks at all.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
misc/fuse4fs.c | 11 ++++++++-
2 files changed, 81 insertions(+), 2 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index dcf002f380b843..588b0053f43c95 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -3158,15 +3158,72 @@ static int fuse2fs_punch_posteof(struct fuse2fs *ff, ext2_ino_t ino,
return 0;
}
+/*
+ * Decide if file IO for this inode can use iomap.
+ *
+ * It turns out that libfuse creates internal node ids that have nothing to do
+ * with the ext2_ino_t that we give it. These internal node ids are what
+ * actually gets igetted in the kernel, which means that there can be multiple
+ * fuse_inode objects in the kernel for a single hardlinked ondisk ext2 inode.
+ *
+ * What this means, horrifyingly, is that on a fuse filesystem that supports
+ * hard links, the in-kernel i_rwsem does not protect against concurrent writes
+ * between files that point to the same inode. That in turn means that the
+ * file mode and size can get desynchronized between the multiple fuse_inode
+ * objects. This also means that we cannot cache iomaps in the kernel AT ALL
+ * because the caches will get out of sync, leading to WARN_ONs from the iomap
+ * zeroing code and probably data corruption after that.
+ *
+ * Therefore, libfuse won't let us create hardlinks of iomap files, and we must
+ * never turn on iomap for existing hardlinked files. Long term it means we
+ * have to find a way around this loss of functionality. fuse4fs gets around
+ * this by being a low level fuse driver and controlling the nodeids itself.
+ *
+ * Returns 0 for no, 1 for yes, or a negative errno.
+ */
+#ifdef HAVE_FUSE_IOMAP
+static int fuse2fs_file_uses_iomap(struct fuse2fs *ff, ext2_ino_t ino)
+{
+ struct stat statbuf;
+ int ret;
+
+ if (!fuse2fs_iomap_enabled(ff))
+ return 0;
+
+ ret = stat_inode(ff->fs, ino, &statbuf);
+ if (ret)
+ return ret;
+
+ /* the kernel handles all block IO for us in iomap mode */
+ return fuse_fs_can_enable_iomap(&statbuf);
+}
+#else
+# define fuse2fs_file_uses_iomap(...) (0)
+#endif
+
static int fuse2fs_truncate(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size)
{
ext2_filsys fs = ff->fs;
ext2_file_t file;
__u64 old_isize;
errcode_t err;
+ int flags = EXT2_FILE_WRITE;
int ret = 0;
- err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file);
+ /* the kernel handles all eof zeroing for us in iomap mode */
+ ret = fuse2fs_file_uses_iomap(ff, ino);
+ switch (ret) {
+ case 0:
+ break;
+ case 1:
+ flags |= EXT2_FILE_NOBLOCKIO;
+ ret = 0;
+ break;
+ default:
+ return ret;
+ }
+
+ err = ext2fs_file_open(fs, ino, flags, &file);
if (err)
return translate_error(fs, ino, err);
@@ -3324,6 +3381,19 @@ static int __op_open(struct fuse2fs *ff, const char *path,
goto out;
}
+ /* the kernel handles all block IO for us in iomap mode */
+ ret = fuse2fs_file_uses_iomap(ff, file->ino);
+ switch (ret) {
+ case 0:
+ break;
+ case 1:
+ file->open_flags |= EXT2_FILE_NOBLOCKIO;
+ ret = 0;
+ break;
+ default:
+ goto out;
+ }
+
if (fp->flags & O_TRUNC) {
ret = fuse2fs_truncate(ff, file->ino, 0);
if (ret)
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 3082c23e398adf..e08c5af5abfd27 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -3375,9 +3375,14 @@ static int fuse4fs_truncate(struct fuse4fs *ff, ext2_ino_t ino, off_t new_size)
ext2_file_t file;
__u64 old_isize;
errcode_t err;
+ int flags = EXT2_FILE_WRITE;
int ret = 0;
- err = ext2fs_file_open(fs, ino, EXT2_FILE_WRITE, &file);
+ /* the kernel handles all eof zeroing for us in iomap mode */
+ if (fuse4fs_iomap_enabled(ff))
+ flags |= EXT2_FILE_NOBLOCKIO;
+
+ err = ext2fs_file_open(fs, ino, flags, &file);
if (err)
return translate_error(fs, ino, err);
@@ -3472,6 +3477,10 @@ static int fuse4fs_open_file(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
if (linked)
check |= L_OK;
+ /* the kernel handles all block IO for us in iomap mode */
+ if (fuse4fs_iomap_enabled(ff))
+ file->open_flags |= EXT2_FILE_NOBLOCKIO;
+
/*
* If the caller wants to truncate the file, we need to ask for full
* write access even if the caller claims to be appending.
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 11/19] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (9 preceding siblings ...)
2025-08-21 1:18 ` [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
@ 2025-08-21 1:18 ` Darrick J. Wong
2025-08-21 1:18 ` [PATCH 12/19] fuse2fs: enable file IO to inline data files Darrick J. Wong
` (7 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:18 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Since fuse in iomap mode guarantees that op_destroy will be called
before umount returns, we don't need to use fuseblk mode to get that
guarantee. Disable fuseblk mode, which saves us the trouble of closing
and reopening the device.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 20 +++++++++++++++++++-
misc/fuse4fs.c | 20 +++++++++++++++++++-
2 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 588b0053f43c95..97b010b8dc1055 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -275,6 +275,7 @@ struct fuse2fs {
enum fuse2fs_feature_toggle iomap_want;
enum fuse2fs_iomap_state iomap_state;
uint32_t iomap_dev;
+ uint64_t iomap_cap;
#endif
unsigned int blockmask;
unsigned long offset;
@@ -1056,6 +1057,8 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff, int libext2_flags)
if (ff->directio)
flags |= EXT2_FLAG_DIRECT_IO;
+ dbg_printf(ff, "opening with flags=0x%x\n", flags);
+
err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager,
&ff->fs);
if (err == EPERM) {
@@ -6527,6 +6530,19 @@ static unsigned long long default_cache_size(void)
return ret;
}
+#ifdef HAVE_FUSE_IOMAP
+static inline bool fuse2fs_discover_iomap(struct fuse2fs *ff)
+{
+ if (ff->iomap_want == FT_DISABLE)
+ return false;
+
+ ff->iomap_cap = fuse_lowlevel_discover_iomap();
+ return ff->iomap_cap & FUSE_IOMAP_SUPPORT_FILEIO;
+}
+#else
+# define fuse2fs_discover_iomap(...) (false)
+#endif
+
static inline bool fuse2fs_want_fuseblk(const struct fuse2fs *ff)
{
if (ff->noblkdev)
@@ -6567,6 +6583,7 @@ int main(int argc, char *argv[])
errcode_t err;
FILE *orig_stderr = stderr;
char extra_args[BUFSIZ];
+ bool iomap_detected = false;
int ret;
ret = fuse_opt_parse(&args, &fctx, fuse2fs_opts, fuse2fs_opt_proc);
@@ -6637,7 +6654,8 @@ int main(int argc, char *argv[])
goto out;
}
- if (fuse2fs_want_fuseblk(&fctx)) {
+ iomap_detected = fuse2fs_discover_iomap(&fctx);
+ if (!iomap_detected && fuse2fs_want_fuseblk(&fctx)) {
/*
* If this is a block device, we want to close the fs, reopen
* the block device in non-exclusive mode, and start the fuse
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index e08c5af5abfd27..3bb6140b35570e 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -270,6 +270,7 @@ struct fuse4fs {
enum fuse4fs_feature_toggle iomap_want;
enum fuse4fs_iomap_state iomap_state;
uint32_t iomap_dev;
+ uint64_t iomap_cap;
#endif
unsigned int blockmask;
unsigned long offset;
@@ -1233,6 +1234,8 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff, int libext2_flags)
if (ff->directio)
flags |= EXT2_FLAG_DIRECT_IO;
+ dbg_printf(ff, "opening with flags=0x%x\n", flags);
+
err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager,
&ff->fs);
if (err == EPERM) {
@@ -6862,6 +6865,19 @@ static unsigned long long default_cache_size(void)
return ret;
}
+#ifdef HAVE_FUSE_IOMAP
+static inline bool fuse4fs_discover_iomap(struct fuse4fs *ff)
+{
+ if (ff->iomap_want == FT_DISABLE)
+ return false;
+
+ ff->iomap_cap = fuse_lowlevel_discover_iomap();
+ return ff->iomap_cap & FUSE_IOMAP_SUPPORT_FILEIO;
+}
+#else
+# define fuse4fs_discover_iomap(...) (false)
+#endif
+
static inline bool fuse4fs_want_fuseblk(const struct fuse4fs *ff)
{
if (ff->noblkdev)
@@ -7002,6 +7018,7 @@ int main(int argc, char *argv[])
errcode_t err;
FILE *orig_stderr = stderr;
char extra_args[BUFSIZ];
+ bool iomap_detected = false;
int ret;
ret = fuse_opt_parse(&args, &fctx, fuse4fs_opts, fuse4fs_opt_proc);
@@ -7072,7 +7089,8 @@ int main(int argc, char *argv[])
goto out;
}
- if (fuse4fs_want_fuseblk(&fctx)) {
+ iomap_detected = fuse4fs_discover_iomap(&fctx);
+ if (!iomap_detected && fuse4fs_want_fuseblk(&fctx)) {
/*
* If this is a block device, we want to close the fs, reopen
* the block device in non-exclusive mode, and start the fuse
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 12/19] fuse2fs: enable file IO to inline data files
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (10 preceding siblings ...)
2025-08-21 1:18 ` [PATCH 11/19] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
@ 2025-08-21 1:18 ` Darrick J. Wong
2025-08-21 1:19 ` [PATCH 13/19] fuse2fs: set iomap-related inode flags Darrick J. Wong
` (6 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:18 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Enable file reads and writes from inline data files.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 42 ++++++++++++++++++++++++++++++++++++++++--
misc/fuse4fs.c | 3 ++-
2 files changed, 42 insertions(+), 3 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 97b010b8dc1055..fc83d2d21c600b 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1472,7 +1472,16 @@ static void *op_init(struct fuse_conn_info *conn
cfg->use_ino = 1;
if (ff->debug)
cfg->debug = 1;
- cfg->nullpath_ok = 1;
+
+ /*
+ * Inline data file io depends on op_read/write being fed a path, so we
+ * have to slow everyone down to look up the path from the nodeid.
+ */
+ if (fuse2fs_iomap_enabled(ff) &&
+ ext2fs_has_feature_inline_data(ff->fs->super))
+ cfg->nullpath_ok = 0;
+ else
+ cfg->nullpath_ok = 1;
#endif
if (ff->kernel) {
@@ -3427,6 +3436,9 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
size_t len, off_t offset,
struct fuse_file_info *fp)
{
+ struct fuse2fs_file_handle fhurk = {
+ .magic = FUSE2FS_FILE_MAGIC,
+ };
struct fuse2fs *ff = fuse2fs_get();
struct fuse2fs_file_handle *fh = fuse2fs_get_handle(fp);
ext2_filsys fs;
@@ -3436,10 +3448,21 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
int ret = 0;
FUSE2FS_CHECK_CONTEXT(ff);
+
+ if (!fh)
+ fh = &fhurk;
+
FUSE2FS_CHECK_HANDLE(ff, fh);
dbg_printf(ff, "%s: ino=%d off=0x%llx len=0x%zx\n", __func__, fh->ino,
(unsigned long long)offset, len);
fs = fuse2fs_start(ff);
+
+ if (fh == &fhurk) {
+ ret = fuse2fs_file_ino(ff, path, NULL, &fhurk.ino);
+ if (ret)
+ goto out;
+ }
+
err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
if (err) {
ret = translate_error(fs, fh->ino, err);
@@ -3481,6 +3504,10 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
const char *buf, size_t len, off_t offset,
struct fuse_file_info *fp)
{
+ struct fuse2fs_file_handle fhurk = {
+ .magic = FUSE2FS_FILE_MAGIC,
+ .open_flags = EXT2_FILE_WRITE,
+ };
struct fuse2fs *ff = fuse2fs_get();
struct fuse2fs_file_handle *fh = fuse2fs_get_handle(fp);
ext2_filsys fs;
@@ -3490,6 +3517,10 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
int ret = 0;
FUSE2FS_CHECK_CONTEXT(ff);
+
+ if (!fh)
+ fh = &fhurk;
+
FUSE2FS_CHECK_HANDLE(ff, fh);
dbg_printf(ff, "%s: ino=%d off=0x%llx len=0x%zx\n", __func__, fh->ino,
(unsigned long long) offset, len);
@@ -3504,6 +3535,12 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
goto out;
}
+ if (fh == &fhurk) {
+ ret = fuse2fs_file_ino(ff, path, NULL, &fhurk.ino);
+ if (ret)
+ goto out;
+ }
+
err = ext2fs_file_open(fs, fh->ino, fh->open_flags, &efp);
if (err) {
ret = translate_error(fs, fh->ino, err);
@@ -5511,7 +5548,8 @@ static int fuse2fs_iomap_begin_read(struct fuse2fs *ff, ext2_ino_t ino,
{
/* fall back to slow path for inline data reads */
if (inode->i_flags & EXT4_INLINE_DATA_FL)
- return -ENOSYS;
+ return fuse2fs_iomap_begin_inline(ff, ino, inode, pos, count,
+ read);
if (inode->i_flags & EXT4_EXTENTS_FL)
return fuse2fs_iomap_begin_extent(ff, ino, inode, pos, count,
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 3bb6140b35570e..6de9f69d05de0b 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -5857,7 +5857,8 @@ static int fuse4fs_iomap_begin_read(struct fuse4fs *ff, ext2_ino_t ino,
{
/* fall back to slow path for inline data reads */
if (inode->i_flags & EXT4_INLINE_DATA_FL)
- return -ENOSYS;
+ return fuse4fs_iomap_begin_inline(ff, ino, inode, pos, count,
+ read);
if (inode->i_flags & EXT4_EXTENTS_FL)
return fuse4fs_iomap_begin_extent(ff, ino, inode, pos, count,
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 13/19] fuse2fs: set iomap-related inode flags
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (11 preceding siblings ...)
2025-08-21 1:18 ` [PATCH 12/19] fuse2fs: enable file IO to inline data files Darrick J. Wong
@ 2025-08-21 1:19 ` Darrick J. Wong
2025-08-21 1:19 ` [PATCH 14/19] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
` (5 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:19 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Set FUSE_IFLAG_* when we do a getattr, so that all files will have iomap
enabled.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 20 ++++++++++++++++++++
misc/fuse4fs.c | 46 +++++++++++++++++++++++++++++++++++-----------
2 files changed, 55 insertions(+), 11 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index fc83d2d21c600b..291416afb93d6c 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1620,6 +1620,23 @@ static int op_getattr(const char *path, struct stat *statbuf
return ret;
}
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99)
+static int op_getattr_iflags(const char *path, struct stat *statbuf,
+ unsigned int *iflags, struct fuse_file_info *fi)
+{
+ int ret = op_getattr(path, statbuf, fi);
+
+ if (ret)
+ return ret;
+
+ if (fuse_fs_can_enable_iomap(statbuf))
+ *iflags |= FUSE_IFLAG_IOMAP;
+
+ return 0;
+}
+#endif
+
+
static int op_readlink(const char *path, char *buf, size_t len)
{
struct fuse2fs *ff = fuse2fs_get();
@@ -6339,6 +6356,9 @@ static struct fuse_operations fs_ops = {
.fallocate = op_fallocate,
# endif
#endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99)
+ .getattr_iflags = op_getattr_iflags,
+#endif
#ifdef HAVE_FUSE_IOMAP
.iomap_begin = op_iomap_begin,
.iomap_end = op_iomap_end,
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 6de9f69d05de0b..37a7ab3a3718e4 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -1659,6 +1659,7 @@ static void op_init(void *userdata, struct fuse_conn_info *conn)
struct fuse4fs_stat {
struct fuse_entry_param entry;
+ unsigned int iflags;
};
static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino,
@@ -1724,9 +1725,29 @@ static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino,
entry->attr_timeout = FUSE4FS_ATTR_TIMEOUT;
entry->entry_timeout = FUSE4FS_ATTR_TIMEOUT;
+ fstat->iflags = 0;
+#ifdef HAVE_FUSE_IOMAP
+ if (fuse4fs_iomap_enabled(ff))
+ fstat->iflags |= FUSE_IFLAG_IOMAP;
+#endif
+
return 0;
}
+#if FUSE_VERSION < FUSE_MAKE_VERSION(3, 99)
+#define fuse_reply_entry_iflags(req, entry, iflags) \
+ fuse_reply_entry((req), (entry))
+
+#define fuse_reply_attr_iflags(req, entry, iflags, timeout) \
+ fuse_reply_attr((req), (entry), (timeout))
+
+#define fuse_add_direntry_plus_iflags(req, buf, sz, name, iflags, entry, dirpos) \
+ fuse_add_direntry_plus((req), (buf), (sz), (name), (entry), (dirpos))
+
+#define fuse_reply_create_iflags(req, entry, iflags, fp) \
+ fuse_reply_create((req), (entry), (fp))
+#endif
+
static void op_lookup(fuse_req_t req, fuse_ino_t fino, const char *name)
{
struct fuse4fs_stat fstat;
@@ -1757,7 +1778,7 @@ static void op_lookup(fuse_req_t req, fuse_ino_t fino, const char *name)
if (ret)
fuse_reply_err(req, -ret);
else
- fuse_reply_entry(req, &fstat.entry);
+ fuse_reply_entry_iflags(req, &fstat.entry, fstat.iflags);
}
static void op_getattr(fuse_req_t req, fuse_ino_t fino,
@@ -1777,8 +1798,8 @@ static void op_getattr(fuse_req_t req, fuse_ino_t fino,
if (ret)
fuse_reply_err(req, -ret);
else
- fuse_reply_attr(req, &fstat.entry.attr,
- fstat.entry.attr_timeout);
+ fuse_reply_attr_iflags(req, &fstat.entry.attr, fstat.iflags,
+ fstat.entry.attr_timeout);
}
static void op_readlink(fuse_req_t req, fuse_ino_t fino)
@@ -2056,7 +2077,7 @@ static void fuse4fs_reply_entry(fuse_req_t req, ext2_ino_t ino,
return;
}
- fuse_reply_entry(req, &fstat.entry);
+ fuse_reply_entry_iflags(req, &fstat.entry, fstat.iflags);
}
static void op_mknod(fuse_req_t req, fuse_ino_t fino, const char *name,
@@ -4317,10 +4338,13 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)),
namebuf[dirent->name_len & 0xFF] = 0;
if (i->readdirplus) {
- entrysize = fuse_add_direntry_plus(i->req, i->buf + i->bufused,
- i->bufsz - i->bufused,
- namebuf, &fstat.entry,
- i->dirpos);
+ entrysize = fuse_add_direntry_plus_iflags(i->req,
+ i->buf + i->bufused,
+ i->bufsz - i->bufused,
+ namebuf,
+ fstat.iflags,
+ &fstat.entry,
+ i->dirpos);
} else {
entrysize = fuse_add_direntry(i->req, i->buf + i->bufused,
i->bufsz - i->bufused, namebuf,
@@ -4545,7 +4569,7 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name,
if (ret)
fuse_reply_err(req, -ret);
else
- fuse_reply_create(req, &fstat.entry, fp);
+ fuse_reply_create_iflags(req, &fstat.entry, fstat.iflags, fp);
}
#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 17)
@@ -4744,8 +4768,8 @@ static void op_setattr(fuse_req_t req, fuse_ino_t fino, struct stat *attr,
if (ret)
fuse_reply_err(req, -ret);
else
- fuse_reply_attr(req, &fstat.entry.attr,
- fstat.entry.attr_timeout);
+ fuse_reply_attr_iflags(req, &fstat.entry.attr, fstat.iflags,
+ fstat.entry.attr_timeout);
}
#define FUSE4FS_MODIFIABLE_IFLAGS \
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 14/19] fuse2fs: add strictatime/lazytime mount options
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (12 preceding siblings ...)
2025-08-21 1:19 ` [PATCH 13/19] fuse2fs: set iomap-related inode flags Darrick J. Wong
@ 2025-08-21 1:19 ` Darrick J. Wong
2025-08-21 1:19 ` [PATCH 15/19] fuse2fs: configure block device block size Darrick J. Wong
` (4 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:19 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
In iomap mode, we can support the strictatime/lazytime mount options.
Add them to fuse2fs.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 28 ++++++++++++++++++++++++++++
misc/fuse4fs.c | 28 ++++++++++++++++++++++++++++
2 files changed, 56 insertions(+)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 291416afb93d6c..9ac9077f3508f7 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -267,6 +267,7 @@ struct fuse2fs {
uint8_t dirsync;
uint8_t unmount_in_destroy;
uint8_t noblkdev;
+ uint8_t iomap_passthrough_options;
enum fuse2fs_opstate opstate;
int logfd;
@@ -1422,6 +1423,8 @@ static void fuse2fs_iomap_enable(struct fuse_conn_info *conn,
if (!fuse2fs_iomap_enabled(ff)) {
if (ff->iomap_want == FT_ENABLE)
err_printf(ff, "%s\n", _("Could not enable iomap."));
+ if (ff->iomap_passthrough_options)
+ err_printf(ff, "%s\n", _("Some mount options require iomap."));
return;
}
}
@@ -6394,6 +6397,7 @@ enum {
FUSE2FS_ERRORS_BEHAVIOR,
#ifdef HAVE_FUSE_IOMAP
FUSE2FS_IOMAP,
+ FUSE2FS_IOMAP_PASSTHROUGH,
#endif
};
@@ -6420,6 +6424,17 @@ static struct fuse_opt fuse2fs_opts[] = {
#endif
FUSE2FS_OPT("noblkdev", noblkdev, 1),
+#ifdef HAVE_FUSE_IOMAP
+#ifdef MS_LAZYTIME
+ FUSE_OPT_KEY("lazytime", FUSE2FS_IOMAP_PASSTHROUGH),
+ FUSE_OPT_KEY("nolazytime", FUSE2FS_IOMAP_PASSTHROUGH),
+#endif
+#ifdef MS_STRICTATIME
+ FUSE_OPT_KEY("strictatime", FUSE2FS_IOMAP_PASSTHROUGH),
+ FUSE_OPT_KEY("nostrictatime", FUSE2FS_IOMAP_PASSTHROUGH),
+#endif
+#endif
+
FUSE_OPT_KEY("user_xattr", FUSE2FS_IGNORED),
FUSE_OPT_KEY("noblock_validity", FUSE2FS_IGNORED),
FUSE_OPT_KEY("nodelalloc", FUSE2FS_IGNORED),
@@ -6446,6 +6461,12 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
struct fuse2fs *ff = data;
switch (key) {
+#ifdef HAVE_FUSE_IOMAP
+ case FUSE2FS_IOMAP_PASSTHROUGH:
+ ff->iomap_passthrough_options = 1;
+ /* pass through to libfuse */
+ return 1;
+#endif
case FUSE2FS_DIRSYNC:
ff->dirsync = 1;
/* pass through to libfuse */
@@ -6735,6 +6756,13 @@ int main(int argc, char *argv[])
fctx.unmount_in_destroy = 1;
}
+ if (fctx.iomap_passthrough_options && !iomap_detected) {
+ err_printf(&fctx, "%s\n",
+ _("Some mount options require iomap."));
+ ret |= 1;
+ goto out;
+ }
+
if (!fctx.cache_size)
fctx.cache_size = default_cache_size();
if (fctx.cache_size) {
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 37a7ab3a3718e4..1050238c88632d 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -262,6 +262,7 @@ struct fuse4fs {
uint8_t dirsync;
uint8_t unmount_in_destroy;
uint8_t noblkdev;
+ uint8_t iomap_passthrough_options;
enum fuse4fs_opstate opstate;
int logfd;
@@ -1603,6 +1604,8 @@ static void fuse4fs_iomap_enable(struct fuse_conn_info *conn,
if (!fuse4fs_iomap_enabled(ff)) {
if (ff->iomap_want == FT_ENABLE)
err_printf(ff, "%s\n", _("Could not enable iomap."));
+ if (ff->iomap_passthrough_options)
+ err_printf(ff, "%s\n", _("Some mount options require iomap."));
return;
}
}
@@ -6697,6 +6700,7 @@ enum {
FUSE4FS_ERRORS_BEHAVIOR,
#ifdef HAVE_FUSE_IOMAP
FUSE4FS_IOMAP,
+ FUSE4FS_IOMAP_PASSTHROUGH,
#endif
};
@@ -6723,6 +6727,17 @@ static struct fuse_opt fuse4fs_opts[] = {
#endif
FUSE4FS_OPT("noblkdev", noblkdev, 1),
+#ifdef HAVE_FUSE_IOMAP
+#ifdef MS_LAZYTIME
+ FUSE_OPT_KEY("lazytime", FUSE4FS_IOMAP_PASSTHROUGH),
+ FUSE_OPT_KEY("nolazytime", FUSE4FS_IOMAP_PASSTHROUGH),
+#endif
+#ifdef MS_STRICTATIME
+ FUSE_OPT_KEY("strictatime", FUSE4FS_IOMAP_PASSTHROUGH),
+ FUSE_OPT_KEY("nostrictatime", FUSE4FS_IOMAP_PASSTHROUGH),
+#endif
+#endif
+
FUSE_OPT_KEY("user_xattr", FUSE4FS_IGNORED),
FUSE_OPT_KEY("noblock_validity", FUSE4FS_IGNORED),
FUSE_OPT_KEY("nodelalloc", FUSE4FS_IGNORED),
@@ -6749,6 +6764,12 @@ static int fuse4fs_opt_proc(void *data, const char *arg,
struct fuse4fs *ff = data;
switch (key) {
+#ifdef HAVE_FUSE_IOMAP
+ case FUSE4FS_IOMAP_PASSTHROUGH:
+ ff->iomap_passthrough_options = 1;
+ /* pass through to libfuse */
+ return 1;
+#endif
case FUSE4FS_DIRSYNC:
ff->dirsync = 1;
/* pass through to libfuse */
@@ -7137,6 +7158,13 @@ int main(int argc, char *argv[])
fctx.unmount_in_destroy = 1;
}
+ if (fctx.iomap_passthrough_options && !iomap_detected) {
+ err_printf(&fctx, "%s\n",
+ _("Some mount options require iomap."));
+ ret |= 1;
+ goto out;
+ }
+
if (!fctx.cache_size)
fctx.cache_size = default_cache_size();
if (fctx.cache_size) {
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 15/19] fuse2fs: configure block device block size
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (13 preceding siblings ...)
2025-08-21 1:19 ` [PATCH 14/19] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
@ 2025-08-21 1:19 ` Darrick J. Wong
2025-08-21 1:19 ` [PATCH 16/19] fuse4fs: don't use inode number translation when possible Darrick J. Wong
` (3 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:19 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Set the blocksize of the block device to the filesystem blocksize.
This prevents the bdev pagecache from caching file data blocks that
iomap will read and write directly. Cache duplication is dangerous.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 41 +++++++++++++++++++++++++++++++++++++++++
misc/fuse4fs.c | 41 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 82 insertions(+)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 9ac9077f3508f7..874fe3bbcc3b9f 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5862,6 +5862,43 @@ static off_t fuse2fs_max_size(struct fuse2fs *ff, off_t upper_limit)
return res;
}
+/*
+ * Set the block device's blocksize to the fs blocksize.
+ *
+ * This is required to avoid creating uptodate bdev pagecache that aliases file
+ * data blocks because iomap reads and writes directly to file data blocks.
+ */
+static int fuse2fs_set_bdev_blocksize(struct fuse2fs *ff, int fd)
+{
+ int blocksize = ff->fs->blocksize;
+ int set_error;
+ int ret;
+
+ ret = ioctl(fd, BLKBSZSET, &blocksize);
+ if (!ret)
+ return 0;
+
+ /*
+ * Save the original errno so we can report that if the block device
+ * blocksize isn't set in an agreeable way.
+ */
+ set_error = errno;
+
+ ret = ioctl(fd, BLKBSZGET, &blocksize);
+ if (ret)
+ goto out_bad;
+
+ /* Pretend that BLKBSZSET rejected our proposed block size */
+ if (blocksize > ff->fs->blocksize)
+ set_error = EINVAL;
+
+ return 0;
+out_bad:
+ err_printf(ff, "%s: cannot set blocksize %u: %s\n", __func__,
+ blocksize, strerror(set_error));
+ return EIO;
+}
+
static int fuse2fs_iomap_config_devices(struct fuse2fs *ff)
{
errcode_t err;
@@ -5872,6 +5909,10 @@ static int fuse2fs_iomap_config_devices(struct fuse2fs *ff)
if (err)
return translate_error(ff->fs, 0, err);
+ ret = fuse2fs_set_bdev_blocksize(ff, fd);
+ if (ret)
+ return ret;
+
ret = fuse_fs_iomap_device_add(fd, 0);
if (ret < 0) {
dbg_printf(ff, "%s: cannot register iomap dev fd=%d, err=%d\n",
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 1050238c88632d..304bac191e7c4c 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -6182,6 +6182,43 @@ static off_t fuse4fs_max_size(struct fuse4fs *ff, off_t upper_limit)
return res;
}
+/*
+ * Set the block device's blocksize to the fs blocksize.
+ *
+ * This is required to avoid creating uptodate bdev pagecache that aliases file
+ * data blocks because iomap reads and writes directly to file data blocks.
+ */
+static int fuse4fs_set_bdev_blocksize(struct fuse4fs *ff, int fd)
+{
+ int blocksize = ff->fs->blocksize;
+ int set_error;
+ int ret;
+
+ ret = ioctl(fd, BLKBSZSET, &blocksize);
+ if (!ret)
+ return 0;
+
+ /*
+ * Save the original errno so we can report that if the block device
+ * blocksize isn't set in an agreeable way.
+ */
+ set_error = errno;
+
+ ret = ioctl(fd, BLKBSZGET, &blocksize);
+ if (ret)
+ goto out_bad;
+
+ /* Pretend that BLKBSZSET rejected our proposed block size */
+ if (blocksize > ff->fs->blocksize)
+ set_error = EINVAL;
+
+ return 0;
+out_bad:
+ err_printf(ff, "%s: cannot set blocksize %u: %s\n", __func__,
+ blocksize, strerror(set_error));
+ return EIO;
+}
+
static int fuse4fs_iomap_config_devices(struct fuse4fs *ff)
{
errcode_t err;
@@ -6192,6 +6229,10 @@ static int fuse4fs_iomap_config_devices(struct fuse4fs *ff)
if (err)
return translate_error(ff->fs, 0, err);
+ ret = fuse4fs_set_bdev_blocksize(ff, fd);
+ if (ret)
+ return ret;
+
ret = fuse_lowlevel_iomap_device_add(ff->fuse, fd, 0);
if (ret < 0) {
dbg_printf(ff, "%s: cannot register iomap dev fd=%d, err=%d\n",
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 16/19] fuse4fs: don't use inode number translation when possible
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (14 preceding siblings ...)
2025-08-21 1:19 ` [PATCH 15/19] fuse2fs: configure block device block size Darrick J. Wong
@ 2025-08-21 1:19 ` Darrick J. Wong
2025-08-21 1:20 ` [PATCH 17/19] fuse4fs: separate invalidation Darrick J. Wong
` (2 subsequent siblings)
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:19 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Prior to the integration of iomap into fuse, the fuse client (aka the
kernel) required that the root directory have an inumber of
FUSE_ROOT_ID, which is 1. However, the ext2 filesystem defines the root
inode number to be EXT2_ROOT_INO, which is 2. This dissonance means
that we have to have translator functions, and that any access to
inumber 1 (the ext2 badblocks file) will instead redirect to the root
directory.
That's horrible. Use the new mount option to set the root directory
nodeid to EXT2_ROOT_INO so that we don't need this translation.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse4fs.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 304bac191e7c4c..5127712e19e6f9 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -263,6 +263,7 @@ struct fuse4fs {
uint8_t unmount_in_destroy;
uint8_t noblkdev;
uint8_t iomap_passthrough_options;
+ uint8_t translate_inums;
enum fuse4fs_opstate opstate;
int logfd;
@@ -324,17 +325,19 @@ struct fuse4fs {
#define FUSE4FS_CHECK_CONTEXT_ABORT(ff) \
__FUSE4FS_CHECK_CONTEXT((ff), abort(), abort())
-static inline void fuse4fs_ino_from_fuse(ext2_ino_t *inop, fuse_ino_t fino)
+static inline void fuse4fs_ino_from_fuse(const struct fuse4fs *ff,
+ ext2_ino_t *inop, fuse_ino_t fino)
{
- if (fino == FUSE_ROOT_ID)
+ if (ff->translate_inums && fino == FUSE_ROOT_ID)
*inop = EXT2_ROOT_INO;
else
*inop = fino;
}
-static inline void fuse4fs_ino_to_fuse(fuse_ino_t *finop, ext2_ino_t ino)
+static inline void fuse4fs_ino_to_fuse(const struct fuse4fs *ff,
+ fuse_ino_t *finop, ext2_ino_t ino)
{
- if (ino == EXT2_ROOT_INO)
+ if (ff->translate_inums && ino == EXT2_ROOT_INO)
*finop = FUSE_ROOT_ID;
else
*finop = ino;
@@ -350,7 +353,7 @@ static inline void fuse4fs_ino_to_fuse(fuse_ino_t *finop, ext2_ino_t ino)
fuse_reply_err((req), EIO); \
return; \
} \
- fuse4fs_ino_from_fuse(ext2_inop, fuse_ino); \
+ fuse4fs_ino_from_fuse(fuse4fs_get(req), ext2_inop, fuse_ino); \
} while (0)
static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
@@ -1723,7 +1726,7 @@ static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino,
statbuf->st_rdev = inodep->i_block[1];
}
- fuse4fs_ino_to_fuse(&entry->ino, ino);
+ fuse4fs_ino_to_fuse(ff, &entry->ino, ino);
entry->generation = inodep->i_generation;
entry->attr_timeout = FUSE4FS_ATTR_TIMEOUT;
entry->entry_timeout = FUSE4FS_ATTR_TIMEOUT;
@@ -7101,6 +7104,7 @@ int main(int argc, char *argv[])
.iomap_state = IOMAP_UNKNOWN,
.iomap_dev = FUSE_IOMAP_DEV_NULL,
#endif
+ .translate_inums = 1,
};
errcode_t err;
FILE *orig_stderr = stderr;
@@ -7206,6 +7210,19 @@ int main(int argc, char *argv[])
goto out;
}
+ if (iomap_detected) {
+ /*
+ * The root_nodeid mount option was added when iomap support
+ * was added to fuse. This enables us to control the root
+ * nodeid in the kernel, which enables a 1:1 translation of
+ * ext2 to kernel inumbers.
+ */
+ snprintf(extra_args, BUFSIZ, "-oroot_nodeid=%d",
+ EXT2_ROOT_INO);
+ fuse_opt_add_arg(&args, extra_args);
+ fctx.translate_inums = 0;
+ }
+
if (!fctx.cache_size)
fctx.cache_size = default_cache_size();
if (fctx.cache_size) {
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 17/19] fuse4fs: separate invalidation
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (15 preceding siblings ...)
2025-08-21 1:19 ` [PATCH 16/19] fuse4fs: don't use inode number translation when possible Darrick J. Wong
@ 2025-08-21 1:20 ` Darrick J. Wong
2025-08-21 1:20 ` [PATCH 18/19] fuse2fs: implement statx Darrick J. Wong
2025-08-21 1:20 ` [PATCH 19/19] fuse2fs: enable atomic writes Darrick J. Wong
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:20 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Use the new stuff
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
misc/fuse4fs.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 121 insertions(+)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 874fe3bbcc3b9f..cc835f894122a4 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -277,6 +277,9 @@ struct fuse2fs {
enum fuse2fs_iomap_state iomap_state;
uint32_t iomap_dev;
uint64_t iomap_cap;
+ void (*old_alloc_stats)(ext2_filsys fs, blk64_t blk, int inuse);
+ void (*old_alloc_stats_range)(ext2_filsys fs, blk64_t blk, blk_t num,
+ int inuse);
#endif
unsigned int blockmask;
unsigned long offset;
@@ -5927,6 +5930,50 @@ static int fuse2fs_iomap_config_devices(struct fuse2fs *ff)
return 0;
}
+static void fuse2fs_invalidate_bdev(struct fuse2fs *ff, blk64_t blk, blk_t num)
+{
+ off_t offset = FUSE2FS_FSB_TO_B(ff, blk);
+ off_t length = FUSE2FS_FSB_TO_B(ff, num);
+ int ret;
+
+ ret = fuse_fs_iomap_device_invalidate(ff->iomap_dev, offset, length);
+ if (!ret)
+ return;
+
+ if (num == 1)
+ err_printf(ff, "%s %llu: %s\n",
+ _("error invalidating block"),
+ (unsigned long long)blk,
+ strerror(ret));
+ else
+ err_printf(ff, "%s %llu-%llu: %s\n",
+ _("error invalidating blocks"),
+ (unsigned long long)blk,
+ (unsigned long long)blk + num - 1,
+ strerror(ret));
+}
+
+static void fuse2fs_alloc_stats(ext2_filsys fs, blk64_t blk, int inuse)
+{
+ struct fuse2fs *ff = fs->priv_data;
+
+ if (inuse < 0)
+ fuse2fs_invalidate_bdev(ff, blk, 1);
+ if (ff->old_alloc_stats)
+ ff->old_alloc_stats(fs, blk, inuse);
+}
+
+static void fuse2fs_alloc_stats_range(ext2_filsys fs, blk64_t blk, blk_t num,
+ int inuse)
+{
+ struct fuse2fs *ff = fs->priv_data;
+
+ if (inuse < 0)
+ fuse2fs_invalidate_bdev(ff, blk, num);
+ if (ff->old_alloc_stats_range)
+ ff->old_alloc_stats_range(fs, blk, num, inuse);
+}
+
static int op_iomap_config(uint64_t flags, off_t maxbytes,
struct fuse_iomap_config *cfg)
{
@@ -5971,6 +6018,19 @@ static int op_iomap_config(uint64_t flags, off_t maxbytes,
if (ret)
goto out_unlock;
+ /*
+ * If we let iomap do all file block IO, then we need to watch for
+ * freed blocks so that we can invalidate any page cache that might
+ * get written to the block deivce.
+ */
+ if (fuse2fs_iomap_enabled(ff)) {
+ ext2fs_set_block_alloc_stats_callback(ff->fs,
+ fuse2fs_alloc_stats, &ff->old_alloc_stats);
+ ext2fs_set_block_alloc_stats_range_callback(ff->fs,
+ fuse2fs_alloc_stats_range,
+ &ff->old_alloc_stats_range);
+ }
+
out_unlock:
fuse2fs_finish(ff, ret);
return ret;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 5127712e19e6f9..2371b9b37cc16a 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -273,6 +273,9 @@ struct fuse4fs {
enum fuse4fs_iomap_state iomap_state;
uint32_t iomap_dev;
uint64_t iomap_cap;
+ void (*old_alloc_stats)(ext2_filsys fs, blk64_t blk, int inuse);
+ void (*old_alloc_stats_range)(ext2_filsys fs, blk64_t blk, blk_t num,
+ int inuse);
#endif
unsigned int blockmask;
unsigned long offset;
@@ -6250,6 +6253,51 @@ static int fuse4fs_iomap_config_devices(struct fuse4fs *ff)
return 0;
}
+static void fuse4fs_invalidate_bdev(struct fuse4fs *ff, blk64_t blk, blk_t num)
+{
+ off_t offset = FUSE4FS_FSB_TO_B(ff, blk);
+ off_t length = FUSE4FS_FSB_TO_B(ff, num);
+ int ret;
+
+ ret = fuse_lowlevel_iomap_device_invalidate(ff->fuse, ff->iomap_dev,
+ offset, length);
+ if (!ret)
+ return;
+
+ if (num == 1)
+ err_printf(ff, "%s %llu: %s\n",
+ _("error invalidating block"),
+ (unsigned long long)blk,
+ strerror(ret));
+ else
+ err_printf(ff, "%s %llu-%llu: %s\n",
+ _("error invalidating blocks"),
+ (unsigned long long)blk,
+ (unsigned long long)blk + num - 1,
+ strerror(ret));
+}
+
+static void fuse4fs_alloc_stats(ext2_filsys fs, blk64_t blk, int inuse)
+{
+ struct fuse4fs *ff = fs->priv_data;
+
+ if (inuse < 0)
+ fuse4fs_invalidate_bdev(ff, blk, 1);
+ if (ff->old_alloc_stats)
+ ff->old_alloc_stats(fs, blk, inuse);
+}
+
+static void fuse4fs_alloc_stats_range(ext2_filsys fs, blk64_t blk, blk_t num,
+ int inuse)
+{
+ struct fuse4fs *ff = fs->priv_data;
+
+ if (inuse < 0)
+ fuse4fs_invalidate_bdev(ff, blk, num);
+ if (ff->old_alloc_stats_range)
+ ff->old_alloc_stats_range(fs, blk, num, inuse);
+}
+
static void op_iomap_config(fuse_req_t req, uint64_t flags, uint64_t maxbytes)
{
struct fuse_iomap_config cfg = { };
@@ -6294,6 +6342,19 @@ static void op_iomap_config(fuse_req_t req, uint64_t flags, uint64_t maxbytes)
if (ret)
goto out_unlock;
+ /*
+ * If we let iomap do all file block IO, then we need to watch for
+ * freed blocks so that we can invalidate any page cache that might
+ * get written to the block deivce.
+ */
+ if (fuse4fs_iomap_enabled(ff)) {
+ ext2fs_set_block_alloc_stats_callback(ff->fs,
+ fuse4fs_alloc_stats, &ff->old_alloc_stats);
+ ext2fs_set_block_alloc_stats_range_callback(ff->fs,
+ fuse4fs_alloc_stats_range,
+ &ff->old_alloc_stats_range);
+ }
+
out_unlock:
fuse4fs_finish(ff, ret);
if (ret)
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 18/19] fuse2fs: implement statx
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (16 preceding siblings ...)
2025-08-21 1:20 ` [PATCH 17/19] fuse4fs: separate invalidation Darrick J. Wong
@ 2025-08-21 1:20 ` Darrick J. Wong
2025-08-21 1:20 ` [PATCH 19/19] fuse2fs: enable atomic writes Darrick J. Wong
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:20 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Implement statx.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
misc/fuse4fs.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 261 insertions(+)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index cc835f894122a4..a00c32e9f2cae8 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -23,6 +23,7 @@
#include <sys/xattr.h>
#endif
#include <sys/ioctl.h>
+#include <sys/sysmacros.h>
#include <unistd.h>
#include <ctype.h>
#include <stdbool.h>
@@ -1642,6 +1643,130 @@ static int op_getattr_iflags(const char *path, struct stat *statbuf,
}
#endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18) && defined(STATX_BASIC_STATS)
+static inline void fuse2fs_set_statx_attr(struct statx *stx,
+ uint64_t statx_flag, int set)
+{
+ if (set)
+ stx->stx_attributes |= statx_flag;
+ stx->stx_attributes_mask |= statx_flag;
+}
+
+static void fuse2fs_statx_directio(struct fuse2fs *ff, struct statx *stx)
+{
+ struct statx devx;
+ errcode_t err;
+ int fd;
+
+ err = io_channel_get_fd(ff->fs->io, &fd);
+ if (err)
+ return;
+
+ err = statx(fd, "", AT_EMPTY_PATH, STATX_DIOALIGN, &devx);
+ if (err)
+ return;
+ if (!(devx.stx_mask & STATX_DIOALIGN))
+ return;
+
+ stx->stx_mask |= STATX_DIOALIGN;
+ stx->stx_dio_mem_align = devx.stx_dio_mem_align;
+ stx->stx_dio_offset_align = devx.stx_dio_offset_align;
+}
+
+static int fuse2fs_statx(struct fuse2fs *ff, ext2_ino_t ino, int statx_mask,
+ struct statx *stx)
+{
+ struct ext2_inode_large inode;
+ ext2_filsys fs = ff->fs;;
+ dev_t fakedev = 0;
+ errcode_t err;
+ struct timespec tv;
+
+ err = fuse2fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ memcpy(&fakedev, fs->super->s_uuid, sizeof(fakedev));
+ stx->stx_mask = STATX_BASIC_STATS | STATX_BTIME;
+ stx->stx_dev_major = major(fakedev);
+ stx->stx_dev_minor = minor(fakedev);
+ stx->stx_ino = ino;
+ stx->stx_mode = inode.i_mode;
+ stx->stx_nlink = inode.i_links_count;
+ stx->stx_uid = inode_uid(inode);
+ stx->stx_gid = inode_gid(inode);
+ stx->stx_size = EXT2_I_SIZE(&inode);
+ stx->stx_blksize = fs->blocksize;
+ stx->stx_blocks = ext2fs_get_stat_i_blocks(fs,
+ EXT2_INODE(&inode));
+ EXT4_INODE_GET_XTIME(i_atime, &tv, &inode);
+ stx->stx_atime.tv_sec = tv.tv_sec;
+ stx->stx_atime.tv_nsec = tv.tv_nsec;
+
+ EXT4_INODE_GET_XTIME(i_mtime, &tv, &inode);
+ stx->stx_mtime.tv_sec = tv.tv_sec;
+ stx->stx_mtime.tv_nsec = tv.tv_nsec;
+
+ EXT4_INODE_GET_XTIME(i_ctime, &tv, &inode);
+ stx->stx_ctime.tv_sec = tv.tv_sec;
+ stx->stx_ctime.tv_nsec = tv.tv_nsec;
+
+ EXT4_INODE_GET_XTIME(i_crtime, &tv, &inode);
+ stx->stx_btime.tv_sec = tv.tv_sec;
+ stx->stx_btime.tv_nsec = tv.tv_nsec;
+
+ dbg_printf(ff, "%s: ino=%d atime=%lld.%d mtime=%lld.%d ctime=%lld.%d btime=%lld.%d\n",
+ __func__, ino,
+ (long long int)stx->stx_atime.tv_sec, stx->stx_atime.tv_nsec,
+ (long long int)stx->stx_mtime.tv_sec, stx->stx_mtime.tv_nsec,
+ (long long int)stx->stx_ctime.tv_sec, stx->stx_ctime.tv_nsec,
+ (long long int)stx->stx_btime.tv_sec, stx->stx_btime.tv_nsec);
+
+ if (LINUX_S_ISCHR(inode.i_mode) ||
+ LINUX_S_ISBLK(inode.i_mode)) {
+ if (inode.i_block[0]) {
+ stx->stx_rdev_major = major(inode.i_block[0]);
+ stx->stx_rdev_minor = minor(inode.i_block[0]);
+ } else {
+ stx->stx_rdev_major = major(inode.i_block[1]);
+ stx->stx_rdev_minor = minor(inode.i_block[1]);
+ }
+ }
+
+ fuse2fs_set_statx_attr(stx, STATX_ATTR_COMPRESSED,
+ inode.i_flags & EXT2_COMPR_FL);
+ fuse2fs_set_statx_attr(stx, STATX_ATTR_IMMUTABLE,
+ inode.i_flags & EXT2_IMMUTABLE_FL);
+ fuse2fs_set_statx_attr(stx, STATX_ATTR_APPEND,
+ inode.i_flags & EXT2_APPEND_FL);
+ fuse2fs_set_statx_attr(stx, STATX_ATTR_NODUMP,
+ inode.i_flags & EXT2_NODUMP_FL);
+
+ fuse2fs_statx_directio(ff, stx);
+
+ return 0;
+}
+
+static int op_statx(const char *path, int statx_flags, int statx_mask,
+ struct statx *stx, struct fuse_file_info *fi)
+{
+ struct fuse2fs *ff = fuse2fs_get();
+ ext2_ino_t ino;
+ int ret = 0;
+
+ FUSE2FS_CHECK_CONTEXT(ff);
+ fuse2fs_start(ff);
+ ret = fuse2fs_file_ino(ff, path, fi, &ino);
+ if (ret)
+ goto out;
+ ret = fuse2fs_statx(ff, ino, statx_mask, stx);
+out:
+ fuse2fs_finish(ff, ret);
+ return ret;
+}
+#else
+# define op_statx NULL
+#endif
static int op_readlink(const char *path, char *buf, size_t len)
{
@@ -6460,6 +6585,9 @@ static struct fuse_operations fs_ops = {
.fallocate = op_fallocate,
# endif
#endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+ .statx = op_statx,
+#endif
#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99)
.getattr_iflags = op_getattr_iflags,
#endif
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 2371b9b37cc16a..b45f92a1cdbe25 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -24,6 +24,7 @@
#include <sys/xattr.h>
#endif
#include <sys/ioctl.h>
+#include <sys/sysmacros.h>
#include <unistd.h>
#include <ctype.h>
#include <stdbool.h>
@@ -1811,6 +1812,135 @@ static void op_getattr(fuse_req_t req, fuse_ino_t fino,
fstat.entry.attr_timeout);
}
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18) && defined(STATX_BASIC_STATS)
+static inline void fuse4fs_set_statx_attr(struct statx *stx,
+ uint64_t statx_flag, int set)
+{
+ if (set)
+ stx->stx_attributes |= statx_flag;
+ stx->stx_attributes_mask |= statx_flag;
+}
+
+static void fuse4fs_statx_directio(struct fuse4fs *ff, struct statx *stx)
+{
+ struct statx devx;
+ errcode_t err;
+ int fd;
+
+ err = io_channel_get_fd(ff->fs->io, &fd);
+ if (err)
+ return;
+
+ err = statx(fd, "", AT_EMPTY_PATH, STATX_DIOALIGN, &devx);
+ if (err)
+ return;
+ if (!(devx.stx_mask & STATX_DIOALIGN))
+ return;
+
+ stx->stx_mask |= STATX_DIOALIGN;
+ stx->stx_dio_mem_align = devx.stx_dio_mem_align;
+ stx->stx_dio_offset_align = devx.stx_dio_offset_align;
+}
+
+static int fuse4fs_statx(struct fuse4fs *ff, ext2_ino_t ino, int statx_mask,
+ struct statx *stx)
+{
+ struct ext2_inode_large inode;
+ ext2_filsys fs = ff->fs;;
+ dev_t fakedev = 0;
+ errcode_t err;
+ struct timespec tv;
+
+ err = fuse4fs_read_inode(fs, ino, &inode);
+ if (err)
+ return translate_error(fs, ino, err);
+
+ memcpy(&fakedev, fs->super->s_uuid, sizeof(fakedev));
+ stx->stx_mask = STATX_BASIC_STATS | STATX_BTIME;
+ stx->stx_dev_major = major(fakedev);
+ stx->stx_dev_minor = minor(fakedev);
+ stx->stx_ino = ino;
+ stx->stx_mode = inode.i_mode;
+ stx->stx_nlink = inode.i_links_count;
+ stx->stx_uid = inode_uid(inode);
+ stx->stx_gid = inode_gid(inode);
+ stx->stx_size = EXT2_I_SIZE(&inode);
+ stx->stx_blksize = fs->blocksize;
+ stx->stx_blocks = ext2fs_get_stat_i_blocks(fs,
+ EXT2_INODE(&inode));
+ EXT4_INODE_GET_XTIME(i_atime, &tv, &inode);
+ stx->stx_atime.tv_sec = tv.tv_sec;
+ stx->stx_atime.tv_nsec = tv.tv_nsec;
+
+ EXT4_INODE_GET_XTIME(i_mtime, &tv, &inode);
+ stx->stx_mtime.tv_sec = tv.tv_sec;
+ stx->stx_mtime.tv_nsec = tv.tv_nsec;
+
+ EXT4_INODE_GET_XTIME(i_ctime, &tv, &inode);
+ stx->stx_ctime.tv_sec = tv.tv_sec;
+ stx->stx_ctime.tv_nsec = tv.tv_nsec;
+
+ EXT4_INODE_GET_XTIME(i_crtime, &tv, &inode);
+ stx->stx_btime.tv_sec = tv.tv_sec;
+ stx->stx_btime.tv_nsec = tv.tv_nsec;
+
+ dbg_printf(ff, "%s: ino=%d atime=%lld.%d mtime=%lld.%d ctime=%lld.%d btime=%lld.%d\n",
+ __func__, ino,
+ (long long int)stx->stx_atime.tv_sec, stx->stx_atime.tv_nsec,
+ (long long int)stx->stx_mtime.tv_sec, stx->stx_mtime.tv_nsec,
+ (long long int)stx->stx_ctime.tv_sec, stx->stx_ctime.tv_nsec,
+ (long long int)stx->stx_btime.tv_sec, stx->stx_btime.tv_nsec);
+
+ if (LINUX_S_ISCHR(inode.i_mode) ||
+ LINUX_S_ISBLK(inode.i_mode)) {
+ if (inode.i_block[0]) {
+ stx->stx_rdev_major = major(inode.i_block[0]);
+ stx->stx_rdev_minor = minor(inode.i_block[0]);
+ } else {
+ stx->stx_rdev_major = major(inode.i_block[1]);
+ stx->stx_rdev_minor = minor(inode.i_block[1]);
+ }
+ }
+
+ fuse4fs_set_statx_attr(stx, STATX_ATTR_COMPRESSED,
+ inode.i_flags & EXT2_COMPR_FL);
+ fuse4fs_set_statx_attr(stx, STATX_ATTR_IMMUTABLE,
+ inode.i_flags & EXT2_IMMUTABLE_FL);
+ fuse4fs_set_statx_attr(stx, STATX_ATTR_APPEND,
+ inode.i_flags & EXT2_APPEND_FL);
+ fuse4fs_set_statx_attr(stx, STATX_ATTR_NODUMP,
+ inode.i_flags & EXT2_NODUMP_FL);
+
+ fuse4fs_statx_directio(ff, stx);
+
+ return 0;
+}
+
+static void op_statx(fuse_req_t req, fuse_ino_t fino, int flags, int mask,
+ struct fuse_file_info *fi)
+{
+ struct statx stx;
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_ino_t ino;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ FUSE4FS_CONVERT_FINO(req, &ino, fino);
+ fuse4fs_start(ff);
+ ret = fuse4fs_statx(ff, ino, mask, &stx);
+ if (ret)
+ goto out;
+out:
+ fuse4fs_finish(ff, ret);
+ if (ret)
+ fuse_reply_err(req, -ret);
+ else
+ fuse_reply_statx(req, 0, &stx, FUSE4FS_ATTR_TIMEOUT);
+}
+#else
+# define op_statx NULL
+#endif
+
static void op_readlink(fuse_req_t req, fuse_ino_t fino)
{
struct ext2_inode inode;
@@ -6770,6 +6900,9 @@ static struct fuse_lowlevel_ops fs_ops = {
#ifdef SUPPORT_FALLOCATE
.fallocate = op_fallocate,
#endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18)
+ .statx = op_statx,
+#endif
#ifdef HAVE_FUSE_IOMAP
.iomap_begin = op_iomap_begin,
.iomap_end = op_iomap_end,
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 19/19] fuse2fs: enable atomic writes
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
` (17 preceding siblings ...)
2025-08-21 1:20 ` [PATCH 18/19] fuse2fs: implement statx Darrick J. Wong
@ 2025-08-21 1:20 ` Darrick J. Wong
18 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:20 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Advertise the single-fsblock atomic write capability that iomap can do.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
misc/fuse4fs.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 133 insertions(+), 2 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index a00c32e9f2cae8..04bb96f3438f23 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -281,6 +281,9 @@ struct fuse2fs {
void (*old_alloc_stats)(ext2_filsys fs, blk64_t blk, int inuse);
void (*old_alloc_stats_range)(ext2_filsys fs, blk64_t blk, blk_t num,
int inuse);
+#ifdef STATX_WRITE_ATOMIC
+ unsigned int awu_min, awu_max;
+#endif
#endif
unsigned int blockmask;
unsigned long offset;
@@ -580,9 +583,21 @@ static inline int fuse2fs_iomap_enabled(const struct fuse2fs *ff)
{
return ff->iomap_state >= IOMAP_ENABLED;
}
+
+static inline int fuse2fs_iomap_can_hw_atomic(const struct fuse2fs *ff)
+{
+ return fuse2fs_iomap_enabled(ff) &&
+ (ff->iomap_cap & FUSE_IOMAP_SUPPORT_ATOMIC) &&
+#ifdef STATX_WRITE_ATOMIC
+ ff->awu_min > 0 && ff->awu_min > 0;
+#else
+ 0;
+#endif
+}
#else
# define fuse2fs_iomap_enabled(...) (0)
# define fuse2fs_iomap_enabled(...) (0)
+# define fuse2fs_iomap_can_hw_atomic(...) (0)
#endif
static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
@@ -1631,14 +1646,19 @@ static int op_getattr(const char *path, struct stat *statbuf
static int op_getattr_iflags(const char *path, struct stat *statbuf,
unsigned int *iflags, struct fuse_file_info *fi)
{
+ struct fuse2fs *ff = fuse2fs_get();
int ret = op_getattr(path, statbuf, fi);
if (ret)
return ret;
- if (fuse_fs_can_enable_iomap(statbuf))
+ if (fuse_fs_can_enable_iomap(statbuf)) {
*iflags |= FUSE_IFLAG_IOMAP;
+ if (fuse2fs_iomap_can_hw_atomic(ff))
+ *iflags |= FUSE_IFLAG_ATOMIC;
+ }
+
return 0;
}
#endif
@@ -1744,6 +1764,15 @@ static int fuse2fs_statx(struct fuse2fs *ff, ext2_ino_t ino, int statx_mask,
fuse2fs_statx_directio(ff, stx);
+#ifdef STATX_WRITE_ATOMIC
+ if (fuse_fs_can_enable_iomapx(stx) && fuse2fs_iomap_can_hw_atomic(ff)) {
+ stx->stx_mask |= STATX_WRITE_ATOMIC;
+ stx->stx_atomic_write_unit_min = ff->awu_min;
+ stx->stx_atomic_write_unit_max = ff->awu_max;
+ stx->stx_atomic_write_segments_max = 1;
+ }
+#endif
+
return 0;
}
@@ -5868,6 +5897,9 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
}
}
+ if (opflags & FUSE_IOMAP_OP_ATOMIC)
+ read->flags |= FUSE_IOMAP_F_ATOMIC_BIO;
+
out_unlock:
fuse2fs_finish(ff, ret);
return ret;
@@ -6027,6 +6059,38 @@ static int fuse2fs_set_bdev_blocksize(struct fuse2fs *ff, int fd)
return EIO;
}
+#ifdef STATX_WRITE_ATOMIC
+static void fuse2fs_configure_atomic_write(struct fuse2fs *ff, int bdev_fd)
+{
+ struct statx devx;
+ unsigned int awu_min, awu_max;
+ int ret;
+
+ if (!ext2fs_has_feature_extents(ff->fs->super))
+ return;
+
+ ret = statx(bdev_fd, "", AT_EMPTY_PATH, STATX_WRITE_ATOMIC, &devx);
+ if (ret)
+ return;
+ if (!(devx.stx_mask & STATX_WRITE_ATOMIC))
+ return;
+
+ awu_min = max(ff->fs->blocksize, devx.stx_atomic_write_unit_min);
+ awu_max = min(ff->fs->blocksize, devx.stx_atomic_write_unit_max);
+ if (awu_min > awu_max)
+ return;
+
+ log_printf(ff, "%s awu_min: %u, awu_max: %u\n",
+ _("Supports (experimental) DIO atomic writes"),
+ awu_min, awu_max);
+
+ ff->awu_min = awu_min;
+ ff->awu_max = awu_max;
+}
+#else
+# define fuse2fs_configure_atomic_write(...) ((void)0)
+#endif
+
static int fuse2fs_iomap_config_devices(struct fuse2fs *ff)
{
errcode_t err;
@@ -6051,6 +6115,8 @@ static int fuse2fs_iomap_config_devices(struct fuse2fs *ff)
dbg_printf(ff, "%s: registered iomap dev fd=%d iomap_dev=%u\n",
__func__, fd, ff->iomap_dev);
+ fuse2fs_configure_atomic_write(ff, fd);
+
ff->iomap_dev = ret;
return 0;
}
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index b45f92a1cdbe25..43fc21149ba564 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -277,6 +277,9 @@ struct fuse4fs {
void (*old_alloc_stats)(ext2_filsys fs, blk64_t blk, int inuse);
void (*old_alloc_stats_range)(ext2_filsys fs, blk64_t blk, blk_t num,
int inuse);
+#ifdef STATX_WRITE_ATOMIC
+ unsigned int awu_min, awu_max;
+#endif
#endif
unsigned int blockmask;
unsigned long offset;
@@ -735,8 +738,20 @@ static inline int fuse4fs_iomap_enabled(const struct fuse4fs *ff)
{
return ff->iomap_state >= IOMAP_ENABLED;
}
+
+static inline int fuse4fs_iomap_can_hw_atomic(const struct fuse4fs *ff)
+{
+ return fuse4fs_iomap_enabled(ff) &&
+ (ff->iomap_cap & FUSE_IOMAP_SUPPORT_ATOMIC) &&
+#ifdef STATX_WRITE_ATOMIC
+ ff->awu_min > 0 && ff->awu_min > 0;
+#else
+ 0;
+#endif
+}
#else
# define fuse4fs_iomap_enabled(...) (0)
+# define fuse4fs_iomap_can_hw_atomic(...) (0)
#endif
static inline void fuse4fs_dump_extents(struct fuse4fs *ff, ext2_ino_t ino,
@@ -1737,8 +1752,12 @@ static int fuse4fs_stat_inode(struct fuse4fs *ff, ext2_ino_t ino,
fstat->iflags = 0;
#ifdef HAVE_FUSE_IOMAP
- if (fuse4fs_iomap_enabled(ff))
+ if (fuse4fs_iomap_enabled(ff)) {
fstat->iflags |= FUSE_IFLAG_IOMAP;
+
+ if (fuse4fs_iomap_can_hw_atomic(ff))
+ fstat->iflags |= FUSE_IFLAG_ATOMIC;
+ }
#endif
return 0;
@@ -1913,6 +1932,15 @@ static int fuse4fs_statx(struct fuse4fs *ff, ext2_ino_t ino, int statx_mask,
fuse4fs_statx_directio(ff, stx);
+#ifdef STATX_WRITE_ATOMIC
+ if (fuse4fs_iomap_can_hw_atomic(ff)) {
+ stx->stx_mask |= STATX_WRITE_ATOMIC;
+ stx->stx_atomic_write_unit_min = ff->awu_min;
+ stx->stx_atomic_write_unit_max = ff->awu_max;
+ stx->stx_atomic_write_segments_max = 1;
+ }
+#endif
+
return 0;
}
@@ -6193,6 +6221,9 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
}
}
+ if (opflags & FUSE_IOMAP_OP_ATOMIC)
+ read.flags |= FUSE_IOMAP_F_ATOMIC_BIO;
+
out_unlock:
fuse4fs_finish(ff, ret);
if (ret)
@@ -6355,6 +6386,38 @@ static int fuse4fs_set_bdev_blocksize(struct fuse4fs *ff, int fd)
return EIO;
}
+#ifdef STATX_WRITE_ATOMIC
+static void fuse4fs_configure_atomic_write(struct fuse4fs *ff, int bdev_fd)
+{
+ struct statx devx;
+ unsigned int awu_min, awu_max;
+ int ret;
+
+ if (!ext2fs_has_feature_extents(ff->fs->super))
+ return;
+
+ ret = statx(bdev_fd, "", AT_EMPTY_PATH, STATX_WRITE_ATOMIC, &devx);
+ if (ret)
+ return;
+ if (!(devx.stx_mask & STATX_WRITE_ATOMIC))
+ return;
+
+ awu_min = max(ff->fs->blocksize, devx.stx_atomic_write_unit_min);
+ awu_max = min(ff->fs->blocksize, devx.stx_atomic_write_unit_max);
+ if (awu_min > awu_max)
+ return;
+
+ log_printf(ff, "%s awu_min: %u, awu_max: %u\n",
+ _("Supports (experimental) DIO atomic writes"),
+ awu_min, awu_max);
+
+ ff->awu_min = awu_min;
+ ff->awu_max = awu_max;
+}
+#else
+# define fuse4fs_configure_atomic_write(...) ((void)0)
+#endif
+
static int fuse4fs_iomap_config_devices(struct fuse4fs *ff)
{
errcode_t err;
@@ -6379,6 +6442,8 @@ static int fuse4fs_iomap_config_devices(struct fuse4fs *ff)
dbg_printf(ff, "%s: registered iomap dev fd=%d iomap_dev=%u\n",
__func__, fd, ff->iomap_dev);
+ fuse4fs_configure_atomic_write(ff, fd);
+
ff->iomap_dev = ret;
return 0;
}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 1/2] fuse2fs: enable caching of iomaps
2025-08-21 0:50 ` [PATCHSET RFC v4 4/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
@ 2025-08-21 1:20 ` Darrick J. Wong
2025-08-21 1:21 ` [PATCH 2/2] fuse2fs: be smarter about caching iomaps Darrick J. Wong
1 sibling, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:20 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Cache the iomaps we generate in the kernel for better performance.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 23 +++++++++++++++++++++++
misc/fuse4fs.c | 24 ++++++++++++++++++++++++
2 files changed, 47 insertions(+)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 04bb96f3438f23..da384b10bc6bc5 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -284,6 +284,7 @@ struct fuse2fs {
#ifdef STATX_WRITE_ATOMIC
unsigned int awu_min, awu_max;
#endif
+ uint8_t iomap_cache;
#endif
unsigned int blockmask;
unsigned long offset;
@@ -5900,6 +5901,23 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
if (opflags & FUSE_IOMAP_OP_ATOMIC)
read->flags |= FUSE_IOMAP_F_ATOMIC_BIO;
+ /*
+ * Cache the mapping in the kernel so that we can reuse them for
+ * subsequent IO.
+ */
+ if (ff->iomap_cache) {
+ ret = fuse_fs_iomap_upsert(nodeid, attr_ino, read, NULL);
+ if (ret) {
+ ret = translate_error(fs, attr_ino, -ret);
+ goto out_unlock;
+ } else {
+ /* Tell the kernel to retry from cache */
+ read->type = FUSE_IOMAP_TYPE_RETRY_CACHE;
+ read->dev = FUSE_IOMAP_DEV_NULL;
+ read->addr = FUSE_IOMAP_NULL_ADDR;
+ }
+ }
+
out_unlock:
fuse2fs_finish(ff, ret);
return ret;
@@ -6718,6 +6736,10 @@ static struct fuse_opt fuse2fs_opts[] = {
FUSE2FS_OPT("timing", timing, 1),
#endif
FUSE2FS_OPT("noblkdev", noblkdev, 1),
+#ifdef HAVE_FUSE_IOMAP
+ FUSE2FS_OPT("iomap_cache", iomap_cache, 1),
+ FUSE2FS_OPT("noiomap_cache", iomap_cache, 0),
+#endif
#ifdef HAVE_FUSE_IOMAP
#ifdef MS_LAZYTIME
@@ -6952,6 +6974,7 @@ int main(int argc, char *argv[])
.iomap_want = FT_DEFAULT,
.iomap_state = IOMAP_UNKNOWN,
.iomap_dev = FUSE_IOMAP_DEV_NULL,
+ .iomap_cache = 1,
#endif
};
errcode_t err;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 43fc21149ba564..a2601b5ca94970 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -280,6 +280,7 @@ struct fuse4fs {
#ifdef STATX_WRITE_ATOMIC
unsigned int awu_min, awu_max;
#endif
+ uint8_t iomap_cache;
#endif
unsigned int blockmask;
unsigned long offset;
@@ -6224,6 +6225,24 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
if (opflags & FUSE_IOMAP_OP_ATOMIC)
read.flags |= FUSE_IOMAP_F_ATOMIC_BIO;
+ /*
+ * Cache the mapping in the kernel so that we can reuse them for
+ * subsequent IO.
+ */
+ if (ff->iomap_cache) {
+ ret = fuse_lowlevel_notify_iomap_upsert(ff->fuse, fino, ino,
+ &read, NULL);
+ if (ret) {
+ ret = translate_error(fs, ino, -ret);
+ goto out_unlock;
+ } else {
+ /* Tell the kernel to retry from cache */
+ read.type = FUSE_IOMAP_TYPE_RETRY_CACHE;
+ read.dev = FUSE_IOMAP_DEV_NULL;
+ read.addr = FUSE_IOMAP_NULL_ADDR;
+ }
+ }
+
out_unlock:
fuse4fs_finish(ff, ret);
if (ret)
@@ -7029,6 +7048,10 @@ static struct fuse_opt fuse4fs_opts[] = {
FUSE4FS_OPT("timing", timing, 1),
#endif
FUSE4FS_OPT("noblkdev", noblkdev, 1),
+#ifdef HAVE_FUSE_IOMAP
+ FUSE4FS_OPT("iomap_cache", iomap_cache, 1),
+ FUSE4FS_OPT("noiomap_cache", iomap_cache, 0),
+#endif
#ifdef HAVE_FUSE_IOMAP
#ifdef MS_LAZYTIME
@@ -7362,6 +7385,7 @@ int main(int argc, char *argv[])
.iomap_want = FT_DEFAULT,
.iomap_state = IOMAP_UNKNOWN,
.iomap_dev = FUSE_IOMAP_DEV_NULL,
+ .iomap_cache = 1,
#endif
.translate_inums = 1,
};
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 2/2] fuse2fs: be smarter about caching iomaps
2025-08-21 0:50 ` [PATCHSET RFC v4 4/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-08-21 1:20 ` [PATCH 1/2] fuse2fs: enable caching of iomaps Darrick J. Wong
@ 2025-08-21 1:21 ` Darrick J. Wong
1 sibling, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:21 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
There's no point in caching iomaps when we're initiating a disk write to
an unwritten region -- we'll just replace the mapping in the ioend.
Save ourselves a bit of overhead by screening for that.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 24 +++++++++++++++++++++++-
misc/fuse4fs.c | 24 +++++++++++++++++++++++-
2 files changed, 46 insertions(+), 2 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index da384b10bc6bc5..1b44b836484b14 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5833,6 +5833,28 @@ static int fuse2fs_iomap_begin_write(struct fuse2fs *ff, ext2_ino_t ino,
return 0;
}
+static inline int fuse2fs_should_cache_iomap(struct fuse2fs *ff,
+ uint32_t opflags,
+ const struct fuse_file_iomap *map)
+{
+ if (!ff->iomap_cache)
+ return 0;
+
+ /*
+ * Don't cache small unwritten extents that are being written to the
+ * device because the overhead of keeping the cache updated will tank
+ * performance.
+ */
+ if ((opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_DIRECT)) == 0)
+ return 1;
+ if (map->type != FUSE_IOMAP_TYPE_UNWRITTEN)
+ return 1;
+ if (map->length >= FUSE2FS_FSB_TO_B(ff, 16))
+ return 1;
+
+ return 0;
+}
+
static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
off_t pos, uint64_t count, uint32_t opflags,
struct fuse_file_iomap *read,
@@ -5905,7 +5927,7 @@ static int op_iomap_begin(const char *path, uint64_t nodeid, uint64_t attr_ino,
* Cache the mapping in the kernel so that we can reuse them for
* subsequent IO.
*/
- if (ff->iomap_cache) {
+ if (fuse2fs_should_cache_iomap(ff, opflags, read)) {
ret = fuse_fs_iomap_upsert(nodeid, attr_ino, read, NULL);
if (ret) {
ret = translate_error(fs, attr_ino, -ret);
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index a2601b5ca94970..df8da745fcd7c7 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -6159,6 +6159,28 @@ static int fuse4fs_iomap_begin_write(struct fuse4fs *ff, ext2_ino_t ino,
return 0;
}
+static inline int fuse4fs_should_cache_iomap(struct fuse4fs *ff,
+ uint32_t opflags,
+ const struct fuse_file_iomap *map)
+{
+ if (!ff->iomap_cache)
+ return 0;
+
+ /*
+ * Don't cache small unwritten extents that are being written to the
+ * device because the overhead of keeping the cache updated will tank
+ * performance.
+ */
+ if ((opflags & (FUSE_IOMAP_OP_WRITE | FUSE_IOMAP_OP_DIRECT)) == 0)
+ return 1;
+ if (map->type != FUSE_IOMAP_TYPE_UNWRITTEN)
+ return 1;
+ if (map->length >= FUSE4FS_FSB_TO_B(ff, 16))
+ return 1;
+
+ return 0;
+}
+
static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
off_t pos, uint64_t count, uint32_t opflags)
{
@@ -6229,7 +6251,7 @@ static void op_iomap_begin(fuse_req_t req, fuse_ino_t fino, uint64_t dontcare,
* Cache the mapping in the kernel so that we can reuse them for
* subsequent IO.
*/
- if (ff->iomap_cache) {
+ if (fuse4fs_should_cache_iomap(ff, opflags, &read)) {
ret = fuse_lowlevel_notify_iomap_upsert(ff->fuse, fino, ino,
&read, NULL);
if (ret) {
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 1/8] fuse2fs: skip permission checking on utimens when iomap is enabled
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
@ 2025-08-21 1:21 ` Darrick J. Wong
2025-08-21 1:21 ` [PATCH 2/8] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
` (6 subsequent siblings)
7 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:21 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
When iomap is enabled, the kernel is in charge of enforcing permissions
checks on timestamp updates for files. We needn't do that in userspace
anymore.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 11 +++++++----
misc/fuse4fs.c | 11 +++++++----
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 1b44b836484b14..95e850e3cd49f1 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -4609,13 +4609,16 @@ static int op_utimens(const char *path, const struct timespec ctv[2]
/*
* ext4 allows timestamp updates of append-only files but only if we're
- * setting to current time
+ * setting to current time. If iomap is enabled, the kernel does the
+ * permission checking for timestamp updates; skip the access check.
*/
if (ctv[0].tv_nsec == UTIME_NOW && ctv[1].tv_nsec == UTIME_NOW)
access |= A_OK;
- ret = check_inum_access(ff, ino, access);
- if (ret)
- goto out;
+ if (!fuse2fs_iomap_enabled(ff)) {
+ ret = check_inum_access(ff, ino, access);
+ if (ret)
+ goto out;
+ }
err = fuse2fs_read_inode(fs, ino, &inode);
if (err) {
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index df8da745fcd7c7..8d547e03f558df 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -4816,13 +4816,16 @@ static int fuse4fs_utimens(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
/*
* ext4 allows timestamp updates of append-only files but only if we're
- * setting to current time
+ * setting to current time. If iomap is enabled, the kernel does the
+ * permission checking for timestamp updates; skip the access check.
*/
if (aact == TA_NOW && mact == TA_NOW)
access |= A_OK;
- ret = fuse4fs_inum_access(ff, ctxt, ino, access);
- if (ret)
- return ret;
+ if (!fuse4fs_iomap_enabled(ff)) {
+ ret = fuse4fs_inum_access(ff, ctxt, ino, access);
+ if (ret)
+ return ret;
+ }
if (aact != TA_OMIT)
EXT4_INODE_SET_XTIME(i_atime, &atime, inode);
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 2/8] fuse2fs: let the kernel tell us about acl/mode updates
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-08-21 1:21 ` [PATCH 1/8] fuse2fs: skip permission checking on utimens " Darrick J. Wong
@ 2025-08-21 1:21 ` Darrick J. Wong
2025-08-21 1:21 ` [PATCH 3/8] fuse2fs: better debugging for file mode updates Darrick J. Wong
` (5 subsequent siblings)
7 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:21 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
When the kernel is running in iomap mode, it will also manage all the
ACL updates and the resulting file mode changes for us. Disable the
manual implementation of it in fuse2fs.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 4 ++--
misc/fuse4fs.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 95e850e3cd49f1..11ddf6a4001955 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1939,7 +1939,7 @@ static int propagate_default_acls(struct fuse2fs *ff, ext2_ino_t parent,
size_t deflen;
int ret;
- if (!ff->acl)
+ if (!ff->acl || fuse2fs_iomap_enabled(ff))
return 0;
ret = __getxattr(ff, parent, XATTR_NAME_POSIX_ACL_DEFAULT, &def,
@@ -3224,7 +3224,7 @@ static int op_chmod(const char *path, mode_t mode
* of the user's groups, but FUSE only tells us about the primary
* group.
*/
- if (!is_superuser(ff, ctxt)) {
+ if (!fuse2fs_iomap_enabled(ff) && !is_superuser(ff, ctxt)) {
ret = in_file_group(ctxt, &inode);
if (ret < 0)
goto out;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 8d547e03f558df..ef6f3b33db99fd 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -2112,7 +2112,7 @@ static int fuse4fs_propagate_default_acls(struct fuse4fs *ff, ext2_ino_t parent,
size_t deflen;
int ret;
- if (!ff->acl)
+ if (!ff->acl || fuse4fs_iomap_enabled(ff))
return 0;
ret = fuse4fs_getxattr(ff, parent, XATTR_NAME_POSIX_ACL_DEFAULT, &def,
@@ -3480,7 +3480,7 @@ static int fuse4fs_chmod(struct fuse4fs *ff, fuse_req_t req, ext2_ino_t ino,
* of the user's groups, but FUSE only tells us about the primary
* group.
*/
- if (!fuse4fs_is_superuser(ff, ctxt)) {
+ if (!fuse4fs_iomap_enabled(ff) && !fuse4fs_is_superuser(ff, ctxt)) {
ret = fuse4fs_in_file_group(ff, req, inode);
if (ret < 0)
return ret;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 3/8] fuse2fs: better debugging for file mode updates
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-08-21 1:21 ` [PATCH 1/8] fuse2fs: skip permission checking on utimens " Darrick J. Wong
2025-08-21 1:21 ` [PATCH 2/8] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
@ 2025-08-21 1:21 ` Darrick J. Wong
2025-08-21 1:22 ` [PATCH 4/8] fuse2fs: debug timestamp updates Darrick J. Wong
` (4 subsequent siblings)
7 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:21 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Improve the tracing of a chmod operation so that we can debug file mode
updates.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 12 +++++++-----
misc/fuse4fs.c | 10 ++++++----
2 files changed, 13 insertions(+), 9 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 11ddf6a4001955..44f76e9bed5f42 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -3195,6 +3195,7 @@ static int op_chmod(const char *path, mode_t mode
errcode_t err;
ext2_ino_t ino;
struct ext2_inode_large inode;
+ mode_t new_mode;
int ret = 0;
FUSE2FS_CHECK_CONTEXT(ff);
@@ -3233,11 +3234,12 @@ static int op_chmod(const char *path, mode_t mode
mode &= ~S_ISGID;
}
- inode.i_mode &= ~0xFFF;
- inode.i_mode |= mode & 0xFFF;
+ new_mode = (inode.i_mode & ~0xFFF) | (mode & 0xFFF);
- dbg_printf(ff, "%s: path=%s new_mode=0%o ino=%d\n", __func__,
- path, inode.i_mode, ino);
+ dbg_printf(ff, "%s: path=%s old_mode=0%o new_mode=0%o ino=%d\n",
+ __func__, path, inode.i_mode, new_mode, ino);
+
+ inode.i_mode = new_mode;
ret = update_ctime(fs, ino, &inode);
if (ret)
@@ -3260,12 +3262,12 @@ static int op_chown(const char *path, uid_t owner, gid_t group
#endif
)
{
+ struct ext2_inode_large inode;
struct fuse_context *ctxt = fuse_get_context();
struct fuse2fs *ff = fuse2fs_get();
ext2_filsys fs;
errcode_t err;
ext2_ino_t ino;
- struct ext2_inode_large inode;
int ret = 0;
FUSE2FS_CHECK_CONTEXT(ff);
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index ef6f3b33db99fd..b68573f654279d 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -3463,6 +3463,7 @@ static int fuse4fs_chmod(struct fuse4fs *ff, fuse_req_t req, ext2_ino_t ino,
mode_t mode, struct ext2_inode_large *inode)
{
const struct fuse_ctx *ctxt = fuse_req_ctx(req);
+ mode_t new_mode;
int ret = 0;
dbg_printf(ff, "%s: ino=%d mode=0%o\n", __func__, ino, mode);
@@ -3489,11 +3490,12 @@ static int fuse4fs_chmod(struct fuse4fs *ff, fuse_req_t req, ext2_ino_t ino,
mode &= ~S_ISGID;
}
- inode->i_mode &= ~0xFFF;
- inode->i_mode |= mode & 0xFFF;
+ new_mode = (inode->i_mode & ~0xFFF) | (mode & 0xFFF);
- dbg_printf(ff, "%s: ino=%d new_mode=0%o\n",
- __func__, ino, inode->i_mode);
+ dbg_printf(ff, "%s: ino=%d old_mode=0%o new_mode=0%o\n",
+ __func__, ino, inode->i_mode, new_mode);
+
+ inode->i_mode = new_mode;
return 0;
}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 4/8] fuse2fs: debug timestamp updates
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 1:21 ` [PATCH 3/8] fuse2fs: better debugging for file mode updates Darrick J. Wong
@ 2025-08-21 1:22 ` Darrick J. Wong
2025-08-21 1:22 ` [PATCH 5/8] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
` (3 subsequent siblings)
7 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:22 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Add tracing for timestamp updates to files.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 99 +++++++++++++++++++++++++++++++++++---------------------
1 file changed, 62 insertions(+), 37 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 44f76e9bed5f42..fe7d6a2568dcf0 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -693,7 +693,8 @@ static void increment_version(struct ext2_inode_large *inode)
inode->i_version_hi = ver >> 32;
}
-static void init_times(struct ext2_inode_large *inode)
+static void fuse2fs_init_timestamps(struct fuse2fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *inode)
{
struct timespec now;
@@ -703,11 +704,15 @@ static void init_times(struct ext2_inode_large *inode)
EXT4_INODE_SET_XTIME(i_mtime, &now, inode);
EXT4_EINODE_SET_XTIME(i_crtime, &now, inode);
increment_version(inode);
+
+ dbg_printf(ff, "%s: ino=%u time %ld:%lu\n", __func__, ino, now.tv_sec,
+ now.tv_nsec);
}
-static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
- struct ext2_inode_large *pinode)
+static int fuse2fs_update_ctime(struct fuse2fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *pinode)
{
+ ext2_filsys fs = ff->fs;
errcode_t err;
struct timespec now;
struct ext2_inode_large inode;
@@ -718,6 +723,10 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
if (pinode) {
increment_version(pinode);
EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
+
+ dbg_printf(ff, "%s: ino=%u ctime %ld:%lu\n", __func__, ino,
+ now.tv_sec, now.tv_nsec);
+
return 0;
}
@@ -729,6 +738,9 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
increment_version(&inode);
EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
+ dbg_printf(ff, "%s: ino=%u ctime %ld:%lu\n", __func__, ino,
+ now.tv_sec, now.tv_nsec);
+
err = fuse2fs_write_inode(fs, ino, &inode);
if (err)
return translate_error(fs, ino, err);
@@ -736,8 +748,9 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
return 0;
}
-static int update_atime(ext2_filsys fs, ext2_ino_t ino)
+static int fuse2fs_update_atime(struct fuse2fs *ff, ext2_ino_t ino)
{
+ ext2_filsys fs = ff->fs;
errcode_t err;
struct ext2_inode_large inode, *pinode;
struct timespec atime, mtime, now;
@@ -756,6 +769,10 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
dnow = now.tv_sec + ((double)now.tv_nsec / NSEC_PER_SEC);
+ dbg_printf(ff, "%s: ino=%u atime %ld:%lu mtime %ld:%lu now %ld:%lu\n",
+ __func__, ino, atime.tv_sec, atime.tv_nsec, mtime.tv_sec,
+ mtime.tv_nsec, now.tv_sec, now.tv_nsec);
+
/*
* If atime is newer than mtime and atime hasn't been updated in thirty
* seconds, skip the atime update. Same idea as Linux "relatime". Use
@@ -772,9 +789,10 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
return 0;
}
-static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
- struct ext2_inode_large *pinode)
+static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *pinode)
{
+ ext2_filsys fs = ff->fs;
errcode_t err;
struct ext2_inode_large inode;
struct timespec now;
@@ -784,6 +802,10 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
EXT4_INODE_SET_XTIME(i_mtime, &now, pinode);
EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
increment_version(pinode);
+
+ dbg_printf(ff, "%s: ino=%u mtime/ctime %ld:%lu\n",
+ __func__, ino, now.tv_sec, now.tv_nsec);
+
return 0;
}
@@ -796,6 +818,9 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
increment_version(&inode);
+ dbg_printf(ff, "%s: ino=%u mtime/ctime %ld:%lu\n",
+ __func__, ino, now.tv_sec, now.tv_nsec);
+
err = fuse2fs_write_inode(fs, ino, &inode);
if (err)
return translate_error(fs, ino, err);
@@ -1860,7 +1885,7 @@ static int op_readlink(const char *path, char *buf, size_t len)
buf[len] = 0;
if (fuse2fs_is_writeable(ff)) {
- ret = update_atime(fs, ino);
+ ret = fuse2fs_update_atime(ff, ino);
if (ret)
goto out;
}
@@ -2134,7 +2159,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
goto out2;
}
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse2fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
@@ -2157,7 +2182,7 @@ static int op_mknod(const char *path, mode_t mode, dev_t dev)
}
inode.i_generation = ff->next_generation++;
- init_times(&inode);
+ fuse2fs_init_timestamps(ff, child, &inode);
err = fuse2fs_write_inode(fs, child, &inode);
if (err) {
ret = translate_error(fs, child, err);
@@ -2243,7 +2268,7 @@ static int op_mkdir(const char *path, mode_t mode)
goto out2;
}
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse2fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
@@ -2270,7 +2295,7 @@ static int op_mkdir(const char *path, mode_t mode)
if (parent_sgid)
inode.i_mode |= S_ISGID;
inode.i_generation = ff->next_generation++;
- init_times(&inode);
+ fuse2fs_init_timestamps(ff, child, &inode);
err = fuse2fs_write_inode(fs, child, &inode);
if (err) {
@@ -2353,7 +2378,7 @@ static int fuse2fs_unlink(struct fuse2fs *ff, const char *path,
if (err)
return translate_error(fs, dir, err);
- ret = update_mtime(fs, dir, NULL);
+ ret = fuse2fs_update_mtime(ff, dir, NULL);
if (ret)
return ret;
@@ -2432,7 +2457,7 @@ static int remove_inode(struct fuse2fs *ff, ext2_ino_t ino)
inode.i_links_count--;
}
- ret = update_ctime(fs, ino, &inode);
+ ret = fuse2fs_update_ctime(ff, ino, &inode);
if (ret)
return ret;
@@ -2606,7 +2631,7 @@ static int __op_rmdir(struct fuse2fs *ff, const char *path)
}
if (inode.i_links_count > 1)
inode.i_links_count--;
- ret = update_mtime(fs, rds.parent, &inode);
+ ret = fuse2fs_update_mtime(ff, rds.parent, &inode);
if (ret)
goto out;
err = fuse2fs_write_inode(fs, rds.parent, &inode);
@@ -2699,7 +2724,7 @@ static int op_symlink(const char *src, const char *dest)
}
/* Update parent dir's mtime */
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse2fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
@@ -2723,7 +2748,7 @@ static int op_symlink(const char *src, const char *dest)
fuse2fs_set_uid(&inode, ctxt->uid);
fuse2fs_set_gid(&inode, gid);
inode.i_generation = ff->next_generation++;
- init_times(&inode);
+ fuse2fs_init_timestamps(ff, child, &inode);
err = fuse2fs_write_inode(fs, child, &inode);
if (err) {
@@ -2973,11 +2998,11 @@ static int op_rename(const char *from, const char *to
}
/* Update timestamps */
- ret = update_ctime(fs, from_ino, NULL);
+ ret = fuse2fs_update_ctime(ff, from_ino, NULL);
if (ret)
goto out2;
- ret = update_mtime(fs, to_dir_ino, NULL);
+ ret = fuse2fs_update_mtime(ff, to_dir_ino, NULL);
if (ret)
goto out2;
@@ -3066,7 +3091,7 @@ static int op_link(const char *src, const char *dest)
goto out2;
inode.i_links_count++;
- ret = update_ctime(fs, ino, &inode);
+ ret = fuse2fs_update_ctime(ff, ino, &inode);
if (ret)
goto out2;
@@ -3085,7 +3110,7 @@ static int op_link(const char *src, const char *dest)
goto out2;
}
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse2fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
@@ -3241,7 +3266,7 @@ static int op_chmod(const char *path, mode_t mode
inode.i_mode = new_mode;
- ret = update_ctime(fs, ino, &inode);
+ ret = fuse2fs_update_ctime(ff, ino, &inode);
if (ret)
goto out;
@@ -3311,7 +3336,7 @@ static int op_chown(const char *path, uid_t owner, gid_t group
fuse2fs_set_gid(&inode, group);
}
- ret = update_ctime(fs, ino, &inode);
+ ret = fuse2fs_update_ctime(ff, ino, &inode);
if (ret)
goto out;
@@ -3441,7 +3466,7 @@ static int fuse2fs_truncate(struct fuse2fs *ff, ext2_ino_t ino, off_t new_size)
if (err)
return translate_error(fs, ino, err);
- ret = update_mtime(fs, ino, NULL);
+ ret = fuse2fs_update_mtime(ff, ino, NULL);
if (ret)
return ret;
@@ -3671,7 +3696,7 @@ static int op_read(const char *path EXT2FS_ATTR((unused)), char *buf,
}
if (fuse2fs_is_writeable(ff)) {
- ret = update_atime(fs, fh->ino);
+ ret = fuse2fs_update_atime(ff, fh->ino);
if (ret)
goto out;
}
@@ -3755,7 +3780,7 @@ static int op_write(const char *path EXT2FS_ATTR((unused)),
goto out;
}
- ret = update_mtime(fs, fh->ino, NULL);
+ ret = fuse2fs_update_mtime(ff, fh->ino, NULL);
if (ret)
goto out;
@@ -4117,7 +4142,7 @@ static int op_setxattr(const char *path EXT2FS_ATTR((unused)),
goto out2;
}
- ret = update_ctime(fs, ino, NULL);
+ ret = fuse2fs_update_ctime(ff, ino, NULL);
out2:
err = ext2fs_xattrs_close(&h);
if (!ret && err)
@@ -4211,7 +4236,7 @@ static int op_removexattr(const char *path, const char *key)
goto out2;
}
- ret = update_ctime(fs, ino, NULL);
+ ret = fuse2fs_update_ctime(ff, ino, NULL);
out2:
err = ext2fs_xattrs_close(&h);
if (err && !ret)
@@ -4348,7 +4373,7 @@ static int op_readdir(const char *path EXT2FS_ATTR((unused)),
}
if (fuse2fs_is_writeable(ff)) {
- ret = update_atime(i.fs, fh->ino);
+ ret = fuse2fs_update_atime(ff, fh->ino);
if (ret)
goto out;
}
@@ -4453,7 +4478,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
goto out2;
}
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse2fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
@@ -4484,7 +4509,7 @@ static int op_create(const char *path, mode_t mode, struct fuse_file_info *fp)
}
inode.i_generation = ff->next_generation++;
- init_times(&inode);
+ fuse2fs_init_timestamps(ff, child, &inode);
err = fuse2fs_write_inode(fs, child, &inode);
if (err) {
ret = translate_error(fs, child, err);
@@ -4555,7 +4580,7 @@ static int op_ftruncate(const char *path EXT2FS_ATTR((unused)),
goto out;
}
- ret = update_mtime(fs, fh->ino, NULL);
+ ret = fuse2fs_update_mtime(ff, fh->ino, NULL);
if (ret)
goto out;
@@ -4642,7 +4667,7 @@ static int op_utimens(const char *path, const struct timespec ctv[2]
if (tv[1].tv_nsec != UTIME_OMIT)
EXT4_INODE_SET_XTIME(i_mtime, &tv[1], &inode);
#endif /* UTIME_OMIT */
- ret = update_ctime(fs, ino, &inode);
+ ret = fuse2fs_update_ctime(ff, ino, &inode);
if (ret)
goto out;
@@ -4710,7 +4735,7 @@ static int ioctl_setflags(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
if (ret)
return ret;
- ret = update_ctime(fs, fh->ino, &inode);
+ ret = fuse2fs_update_ctime(ff, fh->ino, &inode);
if (ret)
return ret;
@@ -4757,7 +4782,7 @@ static int ioctl_setversion(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
inode.i_generation = generation;
- ret = update_ctime(fs, fh->ino, &inode);
+ ret = fuse2fs_update_ctime(ff, fh->ino, &inode);
if (ret)
return ret;
@@ -4862,7 +4887,7 @@ static int ioctl_fssetxattr(struct fuse2fs *ff, struct fuse2fs_file_handle *fh,
if (ext2fs_inode_includes(inode_size, i_projid))
inode.i_projid = fsx->fsx_projid;
- ret = update_ctime(fs, fh->ino, &inode);
+ ret = fuse2fs_update_ctime(ff, fh->ino, &inode);
if (ret)
return ret;
@@ -5130,7 +5155,7 @@ static int fuse2fs_allocate_range(struct fuse2fs *ff,
}
}
- err = update_mtime(fs, fh->ino, &inode);
+ err = fuse2fs_update_mtime(ff, fh->ino, &inode);
if (err)
return err;
@@ -5303,7 +5328,7 @@ static int fuse2fs_punch_range(struct fuse2fs *ff,
return translate_error(fs, fh->ino, err);
}
- err = update_mtime(fs, fh->ino, &inode);
+ err = fuse2fs_update_mtime(ff, fh->ino, &inode);
if (err)
return err;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 5/8] fuse2fs: use coarse timestamps for iomap mode
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 1:22 ` [PATCH 4/8] fuse2fs: debug timestamp updates Darrick J. Wong
@ 2025-08-21 1:22 ` Darrick J. Wong
2025-08-21 1:22 ` [PATCH 6/8] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
` (2 subsequent siblings)
7 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:22 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
In iomap mode, the kernel is responsible for maintaining timestamps
because file writes don't upcall to fuse2fs. The kernel's predicate for
deciding if [cm]time should be updated bases its decisions off [cm]time
being an exact match for the coarse clock (instead of checking that
[cm]time < coarse_clock) which means that fuse2fs setting a fine-grained
timestamp that is slightly ahead of the coarse clock can result in
timestamps appearing to go backwards. generic/423 doesn't like seeing
btime > ctime from statx, so we'll use the coarse clock in iomap mode.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 34 +++++++++++++----
misc/fuse4fs.c | 110 +++++++++++++++++++++++++++++++++-----------------------
2 files changed, 90 insertions(+), 54 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index fe7d6a2568dcf0..df84884ba6b7d0 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -669,8 +669,24 @@ static inline void fuse2fs_dump_extents(struct fuse2fs *ff, ext2_ino_t ino,
ext2fs_extent_free(extents);
}
-static void get_now(struct timespec *now)
+static void fuse2fs_get_now(struct fuse2fs *ff, struct timespec *now)
{
+#ifdef CLOCK_REALTIME_COARSE
+ /*
+ * In iomap mode, the kernel is responsible for maintaining timestamps
+ * because file writes don't upcall to fuse2fs. The kernel's predicate
+ * for deciding if [cm]time should be updated bases its decisions off
+ * [cm]time being an exact match for the coarse clock (instead of
+ * checking that [cm]time < coarse_clock) which means that fuse2fs
+ * setting a fine-grained timestamp that is slightly ahead of the
+ * coarse clock can result in timestamps appearing to go backwards.
+ * generic/423 doesn't like seeing btime > ctime from statx, so we'll
+ * use the coarse clock in iomap mode.
+ */
+ if (fuse2fs_iomap_enabled(ff) &&
+ !clock_gettime(CLOCK_REALTIME_COARSE, now))
+ return;
+#endif
#ifdef CLOCK_REALTIME
if (!clock_gettime(CLOCK_REALTIME, now))
return;
@@ -698,7 +714,7 @@ static void fuse2fs_init_timestamps(struct fuse2fs *ff, ext2_ino_t ino,
{
struct timespec now;
- get_now(&now);
+ fuse2fs_get_now(ff, &now);
EXT4_INODE_SET_XTIME(i_atime, &now, inode);
EXT4_INODE_SET_XTIME(i_ctime, &now, inode);
EXT4_INODE_SET_XTIME(i_mtime, &now, inode);
@@ -717,7 +733,7 @@ static int fuse2fs_update_ctime(struct fuse2fs *ff, ext2_ino_t ino,
struct timespec now;
struct ext2_inode_large inode;
- get_now(&now);
+ fuse2fs_get_now(ff, &now);
/* If user already has a inode buffer, just update that */
if (pinode) {
@@ -763,7 +779,7 @@ static int fuse2fs_update_atime(struct fuse2fs *ff, ext2_ino_t ino)
pinode = &inode;
EXT4_INODE_GET_XTIME(i_atime, &atime, pinode);
EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode);
- get_now(&now);
+ fuse2fs_get_now(ff, &now);
datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC);
dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
@@ -798,7 +814,7 @@ static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino,
struct timespec now;
if (pinode) {
- get_now(&now);
+ fuse2fs_get_now(ff, &now);
EXT4_INODE_SET_XTIME(i_mtime, &now, pinode);
EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
increment_version(pinode);
@@ -813,7 +829,7 @@ static int fuse2fs_update_mtime(struct fuse2fs *ff, ext2_ino_t ino,
if (err)
return translate_error(fs, ino, err);
- get_now(&now);
+ fuse2fs_get_now(ff, &now);
EXT4_INODE_SET_XTIME(i_mtime, &now, &inode);
EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
increment_version(&inode);
@@ -4657,9 +4673,9 @@ static int op_utimens(const char *path, const struct timespec ctv[2]
tv[1] = ctv[1];
#ifdef UTIME_NOW
if (tv[0].tv_nsec == UTIME_NOW)
- get_now(tv);
+ fuse2fs_get_now(ff, tv);
if (tv[1].tv_nsec == UTIME_NOW)
- get_now(tv + 1);
+ fuse2fs_get_now(ff, tv + 1);
#endif /* UTIME_NOW */
#ifdef UTIME_OMIT
if (tv[0].tv_nsec != UTIME_OMIT)
@@ -7389,7 +7405,7 @@ static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
error_message(err), func, line);
/* Make a note in the error log */
- get_now(&now);
+ fuse2fs_get_now(ff, &now);
ext2fs_set_tstamp(fs->super, s_last_error_time, now.tv_sec);
fs->super->s_last_error_ino = ino;
fs->super->s_last_error_line = line;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index b68573f654279d..a06e963eab6afd 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -823,8 +823,24 @@ static inline void fuse4fs_dump_extents(struct fuse4fs *ff, ext2_ino_t ino,
ext2fs_extent_free(extents);
}
-static void get_now(struct timespec *now)
+static void fuse4fs_get_now(struct fuse4fs *ff, struct timespec *now)
{
+#ifdef CLOCK_REALTIME_COARSE
+ /*
+ * In iomap mode, the kernel is responsible for maintaining timestamps
+ * because file writes don't upcall to fuse4fs. The kernel's predicate
+ * for deciding if [cm]time should be updated bases its decisions off
+ * [cm]time being an exact match for the coarse clock (instead of
+ * checking that [cm]time < coarse_clock) which means that fuse4fs
+ * setting a fine-grained timestamp that is slightly ahead of the
+ * coarse clock can result in timestamps appearing to go backwards.
+ * generic/423 doesn't like seeing btime > ctime from statx, so we'll
+ * use the coarse clock in iomap mode.
+ */
+ if (fuse4fs_iomap_enabled(ff) &&
+ !clock_gettime(CLOCK_REALTIME_COARSE, now))
+ return;
+#endif
#ifdef CLOCK_REALTIME
if (!clock_gettime(CLOCK_REALTIME, now))
return;
@@ -847,11 +863,12 @@ static void increment_version(struct ext2_inode_large *inode)
inode->i_version_hi = ver >> 32;
}
-static void init_times(struct ext2_inode_large *inode)
+static void fuse4fs_init_timestamps(struct fuse4fs *ff,
+ struct ext2_inode_large *inode)
{
struct timespec now;
- get_now(&now);
+ fuse4fs_get_now(ff, &now);
EXT4_INODE_SET_XTIME(i_atime, &now, inode);
EXT4_INODE_SET_XTIME(i_ctime, &now, inode);
EXT4_INODE_SET_XTIME(i_mtime, &now, inode);
@@ -859,14 +876,15 @@ static void init_times(struct ext2_inode_large *inode)
increment_version(inode);
}
-static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
- struct ext2_inode_large *pinode)
+static int fuse4fs_update_ctime(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *pinode)
{
- errcode_t err;
struct timespec now;
struct ext2_inode_large inode;
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
- get_now(&now);
+ fuse4fs_get_now(ff, &now);
/* If user already has a inode buffer, just update that */
if (pinode) {
@@ -890,12 +908,13 @@ static int update_ctime(ext2_filsys fs, ext2_ino_t ino,
return 0;
}
-static int update_atime(ext2_filsys fs, ext2_ino_t ino)
+static int fuse4fs_update_atime(struct fuse4fs *ff, ext2_ino_t ino)
{
- errcode_t err;
struct ext2_inode_large inode, *pinode;
struct timespec atime, mtime, now;
+ ext2_filsys fs = ff->fs;
double datime, dmtime, dnow;
+ errcode_t err;
err = fuse4fs_read_inode(fs, ino, &inode);
if (err)
@@ -904,7 +923,7 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
pinode = &inode;
EXT4_INODE_GET_XTIME(i_atime, &atime, pinode);
EXT4_INODE_GET_XTIME(i_mtime, &mtime, pinode);
- get_now(&now);
+ fuse4fs_get_now(ff, &now);
datime = atime.tv_sec + ((double)atime.tv_nsec / NSEC_PER_SEC);
dmtime = mtime.tv_sec + ((double)mtime.tv_nsec / NSEC_PER_SEC);
@@ -926,15 +945,16 @@ static int update_atime(ext2_filsys fs, ext2_ino_t ino)
return 0;
}
-static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
- struct ext2_inode_large *pinode)
+static int fuse4fs_update_mtime(struct fuse4fs *ff, ext2_ino_t ino,
+ struct ext2_inode_large *pinode)
{
- errcode_t err;
struct ext2_inode_large inode;
struct timespec now;
+ ext2_filsys fs = ff->fs;
+ errcode_t err;
if (pinode) {
- get_now(&now);
+ fuse4fs_get_now(ff, &now);
EXT4_INODE_SET_XTIME(i_mtime, &now, pinode);
EXT4_INODE_SET_XTIME(i_ctime, &now, pinode);
increment_version(pinode);
@@ -945,7 +965,7 @@ static int update_mtime(ext2_filsys fs, ext2_ino_t ino,
if (err)
return translate_error(fs, ino, err);
- get_now(&now);
+ fuse4fs_get_now(ff, &now);
EXT4_INODE_SET_XTIME(i_mtime, &now, &inode);
EXT4_INODE_SET_XTIME(i_ctime, &now, &inode);
increment_version(&inode);
@@ -2029,7 +2049,7 @@ static void op_readlink(fuse_req_t req, fuse_ino_t fino)
buf[len] = 0;
if (fuse4fs_is_writeable(ff)) {
- ret = update_atime(fs, ino);
+ ret = fuse4fs_update_atime(ff, ino);
if (ret)
goto out;
}
@@ -2298,7 +2318,7 @@ static void op_mknod(fuse_req_t req, fuse_ino_t fino, const char *name,
goto out2;
}
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse4fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
@@ -2321,7 +2341,7 @@ static void op_mknod(fuse_req_t req, fuse_ino_t fino, const char *name,
}
inode.i_generation = ff->next_generation++;
- init_times(&inode);
+ fuse4fs_init_timestamps(ff, &inode);
err = fuse4fs_write_inode(fs, child, &inode);
if (err) {
ret = translate_error(fs, child, err);
@@ -2383,7 +2403,7 @@ static void op_mkdir(fuse_req_t req, fuse_ino_t fino, const char *name,
goto out2;
}
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse4fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
@@ -2409,7 +2429,7 @@ static void op_mkdir(fuse_req_t req, fuse_ino_t fino, const char *name,
if (parent_sgid)
inode.i_mode |= S_ISGID;
inode.i_generation = ff->next_generation++;
- init_times(&inode);
+ fuse4fs_init_timestamps(ff, &inode);
err = fuse4fs_write_inode(fs, child, &inode);
if (err) {
@@ -2750,7 +2770,7 @@ static int fuse4fs_remove_inode(struct fuse4fs *ff, ext2_ino_t ino)
inode.i_links_count--;
}
- ret = update_ctime(fs, ino, &inode);
+ ret = fuse4fs_update_ctime(ff, ino, &inode);
if (ret)
return ret;
@@ -2821,7 +2841,7 @@ static int fuse4fs_unlink(struct fuse4fs *ff, ext2_ino_t parent,
goto out;
}
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse4fs_update_mtime(ff, parent, NULL);
if (ret)
goto out;
out:
@@ -2960,7 +2980,7 @@ static int fuse4fs_rmdir(struct fuse4fs *ff, ext2_ino_t parent,
}
if (inode.i_links_count > 1)
inode.i_links_count--;
- ret = update_mtime(fs, rds.parent, &inode);
+ ret = fuse4fs_update_mtime(ff, rds.parent, &inode);
if (ret)
goto out;
err = fuse4fs_write_inode(fs, rds.parent, &inode);
@@ -3060,7 +3080,7 @@ static void op_symlink(fuse_req_t req, const char *target, fuse_ino_t fino,
}
/* Update parent dir's mtime */
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse4fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
@@ -3083,7 +3103,7 @@ static void op_symlink(fuse_req_t req, const char *target, fuse_ino_t fino,
fuse4fs_set_uid(&inode, ctxt->uid);
fuse4fs_set_gid(&inode, gid);
inode.i_generation = ff->next_generation++;
- init_times(&inode);
+ fuse4fs_init_timestamps(ff, &inode);
err = fuse4fs_write_inode(fs, child, &inode);
if (err) {
@@ -3274,11 +3294,11 @@ static void op_rename(fuse_req_t req, fuse_ino_t from_parent, const char *from,
}
/* Update timestamps */
- ret = update_ctime(fs, from_ino, NULL);
+ ret = fuse4fs_update_ctime(ff, from_ino, NULL);
if (ret)
goto out;
- ret = update_mtime(fs, to_dir_ino, NULL);
+ ret = fuse4fs_update_mtime(ff, to_dir_ino, NULL);
if (ret)
goto out;
@@ -3352,7 +3372,7 @@ static void op_link(fuse_req_t req, fuse_ino_t child_fino,
}
inode.i_links_count++;
- ret = update_ctime(fs, child, &inode);
+ ret = fuse4fs_update_ctime(ff, child, &inode);
if (ret)
goto out2;
@@ -3369,7 +3389,7 @@ static void op_link(fuse_req_t req, fuse_ino_t child_fino,
goto out2;
}
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse4fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
@@ -3602,7 +3622,7 @@ static int fuse4fs_truncate(struct fuse4fs *ff, ext2_ino_t ino, off_t new_size)
if (err)
return translate_error(fs, ino, err);
- ret = update_mtime(fs, ino, NULL);
+ ret = fuse4fs_update_mtime(ff, ino, NULL);
if (ret)
return ret;
@@ -3802,7 +3822,7 @@ static void op_read(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
}
if (fuse4fs_is_writeable(ff)) {
- ret = update_atime(fs, fh->ino);
+ ret = fuse4fs_update_atime(ff, fh->ino);
if (ret)
goto out;
}
@@ -3876,7 +3896,7 @@ static void op_write(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
goto out;
}
- ret = update_mtime(fs, fh->ino, NULL);
+ ret = fuse4fs_update_mtime(ff, fh->ino, NULL);
if (ret)
goto out;
@@ -4323,7 +4343,7 @@ static void op_setxattr(fuse_req_t req, fuse_ino_t fino, const char *key,
goto out2;
}
- ret = update_ctime(fs, ino, NULL);
+ ret = fuse4fs_update_ctime(ff, ino, NULL);
out2:
err = ext2fs_xattrs_close(&h);
if (!ret && err)
@@ -4417,7 +4437,7 @@ static void op_removexattr(fuse_req_t req, fuse_ino_t fino, const char *key)
goto out2;
}
- ret = update_ctime(fs, ino, NULL);
+ ret = fuse4fs_update_ctime(ff, ino, NULL);
out2:
err = ext2fs_xattrs_close(&h);
if (err && !ret)
@@ -4564,7 +4584,7 @@ static void __op_readdir(fuse_req_t req, fuse_ino_t fino, size_t size,
}
if (fuse4fs_is_writeable(ff)) {
- ret = update_atime(i.fs, fh->ino);
+ ret = fuse4fs_update_atime(i.ff, fh->ino);
if (ret)
goto out;
}
@@ -4664,7 +4684,7 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name,
goto out2;
}
- ret = update_mtime(fs, parent, NULL);
+ ret = fuse4fs_update_mtime(ff, parent, NULL);
if (ret)
goto out2;
} else {
@@ -4705,7 +4725,7 @@ static void op_create(fuse_req_t req, fuse_ino_t fino, const char *name,
}
inode.i_generation = ff->next_generation++;
- init_times(&inode);
+ fuse4fs_init_timestamps(ff, &inode);
err = fuse4fs_write_inode(fs, child, &inode);
if (err) {
ret = translate_error(fs, child, err);
@@ -4784,7 +4804,7 @@ static int fuse4fs_utimens(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
int ret = 0;
if (to_set & (FUSE_SET_ATTR_ATIME_NOW | FUSE_SET_ATTR_MTIME_NOW))
- get_now(&now);
+ fuse4fs_get_now(ff, &now);
if (to_set & FUSE_SET_ATTR_ATIME_NOW) {
atime = now;
@@ -4922,7 +4942,7 @@ static void op_setattr(fuse_req_t req, fuse_ino_t fino, struct stat *attr,
}
/* Update ctime for any attribute change */
- ret = update_ctime(fs, ino, &inode);
+ ret = fuse4fs_update_ctime(ff, ino, &inode);
if (ret)
goto out;
@@ -5004,7 +5024,7 @@ static int ioctl_setflags(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
if (ret)
return ret;
- ret = update_ctime(fs, fh->ino, &inode);
+ ret = fuse4fs_update_ctime(ff, fh->ino, &inode);
if (ret)
return ret;
@@ -5057,7 +5077,7 @@ static int ioctl_setversion(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
inode.i_generation = *indata;
- ret = update_ctime(fs, fh->ino, &inode);
+ ret = fuse4fs_update_ctime(ff, fh->ino, &inode);
if (ret)
return ret;
@@ -5168,7 +5188,7 @@ static int ioctl_fssetxattr(struct fuse4fs *ff, const struct fuse_ctx *ctxt,
if (ext2fs_inode_includes(inode_size, i_projid))
inode.i_projid = fsx->fsx_projid;
- ret = update_ctime(fs, fh->ino, &inode);
+ ret = fuse4fs_update_ctime(ff, fh->ino, &inode);
if (ret)
return ret;
@@ -5453,7 +5473,7 @@ static int fuse4fs_allocate_range(struct fuse4fs *ff,
}
}
- err = update_mtime(fs, fh->ino, &inode);
+ err = fuse4fs_update_mtime(ff, fh->ino, &inode);
if (err)
return err;
@@ -5626,7 +5646,7 @@ static int fuse4fs_punch_range(struct fuse4fs *ff,
return translate_error(fs, fh->ino, err);
}
- err = update_mtime(fs, fh->ino, &inode);
+ err = fuse4fs_update_mtime(ff, fh->ino, &inode);
if (err)
return err;
@@ -7788,7 +7808,7 @@ static int __translate_error(ext2_filsys fs, ext2_ino_t ino, errcode_t err,
error_message(err), func, line);
/* Make a note in the error log */
- get_now(&now);
+ fuse4fs_get_now(ff, &now);
ext2fs_set_tstamp(fs->super, s_last_error_time, now.tv_sec);
fs->super->s_last_error_ino = ino;
fs->super->s_last_error_line = line;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 6/8] fuse2fs: add tracing for retrieving timestamps
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 1:22 ` [PATCH 5/8] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
@ 2025-08-21 1:22 ` Darrick J. Wong
2025-08-21 1:23 ` [PATCH 7/8] fuse2fs: enable syncfs Darrick J. Wong
2025-08-21 1:23 ` [PATCH 8/8] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
7 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:22 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Add tracing for retrieving timestamps so we can debug the weird
behavior.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index df84884ba6b7d0..80bd47549925bf 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1571,9 +1571,11 @@ static void *op_init(struct fuse_conn_info *conn
return ff;
}
-static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf)
+static int fuse2fs_stat(struct fuse2fs *ff, ext2_ino_t ino,
+ struct stat *statbuf)
{
struct ext2_inode_large inode;
+ ext2_filsys fs = ff->fs;
dev_t fakedev = 0;
errcode_t err;
int ret = 0;
@@ -1612,6 +1614,13 @@ static int stat_inode(ext2_filsys fs, ext2_ino_t ino, struct stat *statbuf)
#else
statbuf->st_ctime = tv.tv_sec;
#endif
+
+ dbg_printf(ff, "%s: ino=%d atime=%lld.%ld mtime=%lld.%ld ctime=%lld.%ld\n",
+ __func__, ino,
+ (long long int)statbuf->st_atim.tv_sec, statbuf->st_atim.tv_nsec,
+ (long long int)statbuf->st_mtim.tv_sec, statbuf->st_mtim.tv_nsec,
+ (long long int)statbuf->st_ctim.tv_sec, statbuf->st_ctim.tv_nsec);
+
if (LINUX_S_ISCHR(inode.i_mode) ||
LINUX_S_ISBLK(inode.i_mode)) {
if (inode.i_block[0])
@@ -1669,16 +1678,15 @@ static int op_getattr(const char *path, struct stat *statbuf
)
{
struct fuse2fs *ff = fuse2fs_get();
- ext2_filsys fs;
ext2_ino_t ino;
int ret = 0;
FUSE2FS_CHECK_CONTEXT(ff);
- fs = fuse2fs_start(ff);
+ fuse2fs_start(ff);
ret = fuse2fs_file_ino(ff, path, fi, &ino);
if (ret)
goto out;
- ret = stat_inode(fs, ino, statbuf);
+ ret = fuse2fs_stat(ff, ino, statbuf);
out:
fuse2fs_finish(ff, ret);
return ret;
@@ -3423,7 +3431,7 @@ static int fuse2fs_file_uses_iomap(struct fuse2fs *ff, ext2_ino_t ino)
if (!fuse2fs_iomap_enabled(ff))
return 0;
- ret = stat_inode(ff->fs, ino, &statbuf);
+ ret = fuse2fs_stat(ff, ino, &statbuf);
if (ret)
return ret;
@@ -4334,7 +4342,7 @@ static int op_readdir_iter(ext2_ino_t dir EXT2FS_ATTR((unused)),
#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
if (i->flags == FUSE_READDIR_PLUS) {
- ret = stat_inode(i->fs, dirent->inode, &stat);
+ ret = fuse2fs_stat(i->ff, dirent->inode, &stat);
if (ret)
return DIRENT_ABORT;
}
@@ -4618,7 +4626,7 @@ static int op_fgetattr(const char *path EXT2FS_ATTR((unused)),
FUSE2FS_CHECK_HANDLE(ff, fh);
dbg_printf(ff, "%s: ino=%d\n", __func__, fh->ino);
fs = fuse2fs_start(ff);
- ret = stat_inode(fs, fh->ino, statbuf);
+ ret = fuse2fs_stat(ff, fh->ino, statbuf);
fuse2fs_finish(ff, ret);
return ret;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 7/8] fuse2fs: enable syncfs
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (5 preceding siblings ...)
2025-08-21 1:22 ` [PATCH 6/8] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
@ 2025-08-21 1:23 ` Darrick J. Wong
2025-08-21 1:23 ` [PATCH 8/8] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
7 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:23 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Enable syncfs calls in fuse2fs.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 36 ++++++++++++++++++++++++++++++++++++
misc/fuse4fs.c | 39 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 74 insertions(+), 1 deletion(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 80bd47549925bf..62aca0ab56ec07 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -5414,6 +5414,41 @@ static int op_fallocate(const char *path EXT2FS_ATTR((unused)), int mode,
# endif /* SUPPORT_FALLOCATE */
#endif /* FUSE 29 */
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99)
+static int op_syncfs(const char *path)
+{
+ struct fuse2fs *ff = fuse2fs_get();
+ ext2_filsys fs;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE2FS_CHECK_CONTEXT(ff);
+ dbg_printf(ff, "%s: path=%s\n", __func__, path);
+ fs = fuse2fs_start(ff);
+
+ if (ff->opstate == F2OP_WRITABLE) {
+ if (fs->super->s_error_count)
+ fs->super->s_state |= EXT2_ERROR_FS;
+ ext2fs_mark_super_dirty(fs);
+ err = ext2fs_set_gdt_csum(fs);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out_unlock;
+ }
+
+ err = ext2fs_flush2(fs, 0);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out_unlock;
+ }
+ }
+
+out_unlock:
+ fuse2fs_finish(ff, ret);
+ return ret;
+}
+#endif
+
#ifdef HAVE_FUSE_IOMAP
static void fuse2fs_iomap_hole(struct fuse2fs *ff, struct fuse_file_iomap *iomap,
off_t pos, uint64_t count)
@@ -6750,6 +6785,7 @@ static struct fuse_operations fs_ops = {
#endif
#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99)
.getattr_iflags = op_getattr_iflags,
+ .syncfs = op_syncfs,
#endif
#ifdef HAVE_FUSE_IOMAP
.iomap_begin = op_iomap_begin,
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index a06e963eab6afd..e01b83e271415c 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -1968,7 +1968,7 @@ static int fuse4fs_statx(struct fuse4fs *ff, ext2_ino_t ino, int statx_mask,
static void op_statx(fuse_req_t req, fuse_ino_t fino, int flags, int mask,
struct fuse_file_info *fi)
{
- struct statx stx;
+ struct statx stx = { };
struct fuse4fs *ff = fuse4fs_get(req);
ext2_ino_t ino;
int ret = 0;
@@ -5708,6 +5708,40 @@ static void op_fallocate(fuse_req_t req, fuse_ino_t fino EXT2FS_ATTR((unused)),
}
#endif /* SUPPORT_FALLOCATE */
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99)
+static void op_syncfs(fuse_req_t req, fuse_ino_t ino)
+{
+ struct fuse4fs *ff = fuse4fs_get(req);
+ ext2_filsys fs;
+ errcode_t err;
+ int ret = 0;
+
+ FUSE4FS_CHECK_CONTEXT(req);
+ fs = fuse4fs_start(ff);
+
+ if (ff->opstate == F4OP_WRITABLE) {
+ if (fs->super->s_error_count)
+ fs->super->s_state |= EXT2_ERROR_FS;
+ ext2fs_mark_super_dirty(fs);
+ err = ext2fs_set_gdt_csum(fs);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out_unlock;
+ }
+
+ err = ext2fs_flush2(fs, 0);
+ if (err) {
+ ret = translate_error(fs, 0, err);
+ goto out_unlock;
+ }
+ }
+
+out_unlock:
+ fuse4fs_finish(ff, ret);
+ fuse_reply_err(req, -ret);
+}
+#endif
+
#ifdef HAVE_FUSE_IOMAP
static void fuse4fs_iomap_hole(struct fuse4fs *ff, struct fuse_file_iomap *iomap,
off_t pos, uint64_t count)
@@ -7034,6 +7068,9 @@ static struct fuse_lowlevel_ops fs_ops = {
#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 18)
.statx = op_statx,
#endif
+#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 99)
+ .syncfs = op_syncfs,
+#endif
#ifdef HAVE_FUSE_IOMAP
.iomap_begin = op_iomap_begin,
.iomap_end = op_iomap_end,
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 8/8] fuse2fs: skip the gdt write in op_destroy if syncfs is working
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
` (6 preceding siblings ...)
2025-08-21 1:23 ` [PATCH 7/8] fuse2fs: enable syncfs Darrick J. Wong
@ 2025-08-21 1:23 ` Darrick J. Wong
7 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:23 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
As an umount-time performance enhancement, don't bother to write the
group descriptor tables in op_destroy if we know that op_syncfs will do
it for us. That only happens if iomap is enabled.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 19 ++++++++++++++++---
misc/fuse4fs.c | 19 ++++++++++++++++---
2 files changed, 32 insertions(+), 6 deletions(-)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index 62aca0ab56ec07..f5d68cc549ad69 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -269,6 +269,7 @@ struct fuse2fs {
uint8_t unmount_in_destroy;
uint8_t noblkdev;
uint8_t iomap_passthrough_options;
+ uint8_t write_gdt_on_destroy;
enum fuse2fs_opstate opstate;
int logfd;
@@ -1309,9 +1310,11 @@ static void op_destroy(void *p EXT2FS_ATTR((unused)))
if (fs->super->s_error_count)
fs->super->s_state |= EXT2_ERROR_FS;
ext2fs_mark_super_dirty(fs);
- err = ext2fs_set_gdt_csum(fs);
- if (err)
- translate_error(fs, 0, err);
+ if (ff->write_gdt_on_destroy) {
+ err = ext2fs_set_gdt_csum(fs);
+ if (err)
+ translate_error(fs, 0, err);
+ }
err = ext2fs_flush2(fs, 0);
if (err)
@@ -5443,6 +5446,15 @@ static int op_syncfs(const char *path)
}
}
+ /*
+ * When iomap is enabled, the kernel will call syncfs right before
+ * calling the destroy method. If any syncfs succeeds, then we know
+ * that there will be a last syncfs and that it will write the GDT, so
+ * destroy doesn't need to waste time doing that.
+ */
+ if (fuse2fs_iomap_enabled(ff))
+ ff->write_gdt_on_destroy = 0;
+
out_unlock:
fuse2fs_finish(ff, ret);
return ret;
@@ -7088,6 +7100,7 @@ int main(int argc, char *argv[])
.iomap_dev = FUSE_IOMAP_DEV_NULL,
.iomap_cache = 1,
#endif
+ .write_gdt_on_destroy = 1,
};
errcode_t err;
FILE *orig_stderr = stderr;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index e01b83e271415c..6f03c6a0933a3d 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -265,6 +265,7 @@ struct fuse4fs {
uint8_t noblkdev;
uint8_t iomap_passthrough_options;
uint8_t translate_inums;
+ uint8_t write_gdt_on_destroy;
enum fuse4fs_opstate opstate;
int logfd;
@@ -1472,9 +1473,11 @@ static void op_destroy(void *userdata)
if (fs->super->s_error_count)
fs->super->s_state |= EXT2_ERROR_FS;
ext2fs_mark_super_dirty(fs);
- err = ext2fs_set_gdt_csum(fs);
- if (err)
- translate_error(fs, 0, err);
+ if (ff->write_gdt_on_destroy) {
+ err = ext2fs_set_gdt_csum(fs);
+ if (err)
+ translate_error(fs, 0, err);
+ }
err = ext2fs_flush2(fs, 0);
if (err)
@@ -5736,6 +5739,15 @@ static void op_syncfs(fuse_req_t req, fuse_ino_t ino)
}
}
+ /*
+ * When iomap is enabled, the kernel will call syncfs right before
+ * calling the destroy method. If any syncfs succeeds, then we know
+ * that there will be a last syncfs and that it will write the GDT, so
+ * destroy doesn't need to waste time doing that.
+ */
+ if (fuse4fs_iomap_enabled(ff))
+ ff->write_gdt_on_destroy = 0;
+
out_unlock:
fuse4fs_finish(ff, ret);
fuse_reply_err(req, -ret);
@@ -7472,6 +7484,7 @@ int main(int argc, char *argv[])
.iomap_cache = 1,
#endif
.translate_inums = 1,
+ .write_gdt_on_destroy = 1,
};
errcode_t err;
FILE *orig_stderr = stderr;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 1/6] libsupport: add caching IO manager
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
@ 2025-08-21 1:23 ` Darrick J. Wong
2025-08-21 1:23 ` [PATCH 2/6] iocache: add the actual buffer cache Darrick J. Wong
` (4 subsequent siblings)
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:23 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Start creating a caching IO manager so that we can have better caching
of metadata blocks in fuse2fs. For now it's just a passthrough cache.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/iocache.h | 17 +++
lib/ext2fs/io_manager.c | 3
lib/support/Makefile.in | 6 +
lib/support/iocache.c | 306 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 331 insertions(+), 1 deletion(-)
create mode 100644 lib/support/iocache.h
create mode 100644 lib/support/iocache.c
diff --git a/lib/support/iocache.h b/lib/support/iocache.h
new file mode 100644
index 00000000000000..3c1d1df00e25bd
--- /dev/null
+++ b/lib/support/iocache.h
@@ -0,0 +1,17 @@
+/*
+ * iocache.h - IO cache
+ *
+ * Copyright (C) 2025 Oracle.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License.
+ * %End-Header%
+ */
+#ifndef __IOCACHE_H__
+#define __IOCACHE_H__
+
+errcode_t iocache_set_backing_manager(io_manager manager);
+extern io_manager iocache_io_manager;
+
+#endif /* __IOCACHE_H__ */
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index c91fab4eb290d5..7a6a6bfedc8a1c 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -16,9 +16,12 @@
#if HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif
+#include <stdbool.h>
#include "ext2_fs.h"
#include "ext2fs.h"
+#include "support/list.h"
+#include "support/cache.h"
errcode_t io_channel_set_options(io_channel channel, const char *opts)
{
diff --git a/lib/support/Makefile.in b/lib/support/Makefile.in
index 13d6f06f150afd..98a9bd42eef55e 100644
--- a/lib/support/Makefile.in
+++ b/lib/support/Makefile.in
@@ -14,6 +14,7 @@ MKDIR_P = @MKDIR_P@
all::
OBJS= cstring.o \
+ iocache.o \
mkquota.o \
plausible.o \
profile.o \
@@ -42,7 +43,8 @@ SRCS= $(srcdir)/argv_parse.c \
$(srcdir)/quotaio_v2.c \
$(srcdir)/dict.c \
$(srcdir)/devname.c \
- $(srcdir)/cache.c
+ $(srcdir)/cache.c \
+ $(srcdir)/iocache.c
LIBRARY= libsupport
LIBDIR= support
@@ -187,3 +189,5 @@ devname.o: $(srcdir)/devname.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/devname.h $(srcdir)/nls-enable.h
cache.o: $(srcdir)/cache.c $(top_builddir)/lib/config.h \
$(srcdir)/cache.h $(srcdir)/list.h $(srcdir)/xbitops.h
+iocache.o: $(srcdir)/iocache.c $(top_builddir)/lib/config.h \
+ $(srcdir)/iocache.h $(srcdir)/cache.h $(srcdir)/list.h $(srcdir)/xbitops.h
diff --git a/lib/support/iocache.c b/lib/support/iocache.c
new file mode 100644
index 00000000000000..9870780d65ef61
--- /dev/null
+++ b/lib/support/iocache.c
@@ -0,0 +1,306 @@
+/*
+ * fuse4fs.c - FUSE low-level server for e2fsprogs.
+ *
+ * Copyright (C) 2025 Oracle.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License.
+ * %End-Header%
+ */
+#include "config.h"
+#include "ext2fs/ext2_fs.h"
+#include "ext2fs/ext2fs.h"
+#include "ext2fs/ext2fsP.h"
+#include "support/iocache.h"
+
+#define IOCACHE_IO_CHANNEL_MAGIC 0x424F5254 /* BORT */
+
+static io_manager iocache_backing_manager;
+
+struct iocache_private_data {
+ int magic;
+ io_channel real;
+};
+
+static struct iocache_private_data *IOCACHE(io_channel channel)
+{
+ return (struct iocache_private_data *)channel->private_data;
+}
+
+static errcode_t iocache_read_error(io_channel channel, unsigned long block,
+ int count, void *data, size_t size,
+ int actual_bytes_read, errcode_t error)
+{
+ io_channel iocache_channel = channel->app_data;
+
+ return iocache_channel->read_error(iocache_channel, block, count, data,
+ size, actual_bytes_read, error);
+}
+
+static errcode_t iocache_write_error(io_channel channel, unsigned long block,
+ int count, const void *data, size_t size,
+ int actual_bytes_written,
+ errcode_t error)
+{
+ io_channel iocache_channel = channel->app_data;
+
+ return iocache_channel->write_error(iocache_channel, block, count, data,
+ size, actual_bytes_written, error);
+}
+
+static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
+{
+ io_channel io = NULL;
+ io_channel real;
+ struct iocache_private_data *data = NULL;
+ errcode_t retval;
+
+ if (!name)
+ return EXT2_ET_BAD_DEVICE_NAME;
+ if (!iocache_backing_manager)
+ return EXT2_ET_INVALID_ARGUMENT;
+
+ retval = iocache_backing_manager->open(name, flags, &real);
+ if (retval)
+ return retval;
+
+ retval = ext2fs_get_mem(sizeof(struct struct_io_channel), &io);
+ if (retval)
+ goto out_backing;
+ memset(io, 0, sizeof(struct struct_io_channel));
+ io->magic = EXT2_ET_MAGIC_IO_CHANNEL;
+
+ retval = ext2fs_get_mem(sizeof(struct iocache_private_data), &data);
+ if (retval)
+ goto out_channel;
+ memset(data, 0, sizeof(struct iocache_private_data));
+ data->magic = IOCACHE_IO_CHANNEL_MAGIC;
+
+ io->manager = iocache_io_manager;
+ retval = ext2fs_get_mem(strlen(name) + 1, &io->name);
+ if (retval)
+ goto out_data;
+
+ strcpy(io->name, name);
+ io->private_data = data;
+ io->block_size = real->block_size;
+ io->read_error = 0;
+ io->write_error = 0;
+ io->refcount = 1;
+ io->flags = real->flags;
+ data->real = real;
+ real->app_data = io;
+ real->read_error = iocache_read_error;
+ real->write_error = iocache_write_error;
+
+ *channel = io;
+ return 0;
+
+out_data:
+ ext2fs_free_mem(&data);
+out_channel:
+ ext2fs_free_mem(&io);
+out_backing:
+ io_channel_close(real);
+ return retval;
+}
+
+static errcode_t iocache_close(io_channel channel)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+ errcode_t retval = 0;
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ if (--channel->refcount > 0)
+ return 0;
+ if (data->real)
+ retval = io_channel_close(data->real);
+ ext2fs_free_mem(&channel->private_data);
+ if (channel->name)
+ ext2fs_free_mem(&channel->name);
+ ext2fs_free_mem(&channel);
+
+ return retval;
+}
+
+static errcode_t iocache_set_blksize(io_channel channel, int blksize)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+ errcode_t retval;
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ retval = io_channel_set_blksize(data->real, blksize);
+ if (retval)
+ return retval;
+
+ channel->block_size = data->real->block_size;
+ return 0;
+}
+
+static errcode_t iocache_flush(io_channel channel)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return io_channel_flush(data->real);
+}
+
+static errcode_t iocache_write_byte(io_channel channel, unsigned long offset,
+ int count, const void *buf)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return io_channel_write_byte(data->real, offset, count, buf);
+}
+
+static errcode_t iocache_set_option(io_channel channel, const char *option,
+ const char *arg)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return data->real->manager->set_option(data->real, option, arg);
+}
+
+static errcode_t iocache_get_stats(io_channel channel, io_stats *io_stats)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return data->real->manager->get_stats(data->real, io_stats);
+}
+
+static errcode_t iocache_read_blk64(io_channel channel,
+ unsigned long long block, int count,
+ void *buf)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return io_channel_read_blk64(data->real, block, count, buf);
+}
+
+static errcode_t iocache_write_blk64(io_channel channel,
+ unsigned long long block, int count,
+ const void *buf)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return io_channel_write_blk64(data->real, block, count, buf);
+}
+
+static errcode_t iocache_read_blk(io_channel channel, unsigned long block,
+ int count, void *buf)
+{
+ return iocache_read_blk64(channel, block, count, buf);
+}
+
+static errcode_t iocache_write_blk(io_channel channel, unsigned long block,
+ int count, const void *buf)
+{
+ return iocache_write_blk64(channel, block, count, buf);
+}
+
+static errcode_t iocache_discard(io_channel channel, unsigned long long block,
+ unsigned long long count)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return io_channel_discard(data->real, block, count);
+}
+
+static errcode_t iocache_cache_readahead(io_channel channel,
+ unsigned long long block,
+ unsigned long long count)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return io_channel_cache_readahead(data->real, block, count);
+}
+
+static errcode_t iocache_zeroout(io_channel channel, unsigned long long block,
+ unsigned long long count)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return io_channel_zeroout(data->real, block, count);
+}
+
+static errcode_t iocache_get_fd(io_channel channel, int *fd)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return io_channel_get_fd(data->real, fd);
+}
+
+static errcode_t iocache_invalidate_blocks(io_channel channel,
+ unsigned long long block,
+ unsigned long long count)
+{
+ struct iocache_private_data *data = IOCACHE(channel);
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+
+ return io_channel_invalidate_blocks(data->real, block, count);
+}
+
+static struct struct_io_manager struct_iocache_manager = {
+ .magic = EXT2_ET_MAGIC_IO_MANAGER,
+ .name = "iocache I/O manager",
+ .open = iocache_open,
+ .close = iocache_close,
+ .set_blksize = iocache_set_blksize,
+ .read_blk = iocache_read_blk,
+ .write_blk = iocache_write_blk,
+ .flush = iocache_flush,
+ .write_byte = iocache_write_byte,
+ .set_option = iocache_set_option,
+ .get_stats = iocache_get_stats,
+ .read_blk64 = iocache_read_blk64,
+ .write_blk64 = iocache_write_blk64,
+ .discard = iocache_discard,
+ .cache_readahead = iocache_cache_readahead,
+ .zeroout = iocache_zeroout,
+ .get_fd = iocache_get_fd,
+ .invalidate_blocks = iocache_invalidate_blocks,
+};
+
+io_manager iocache_io_manager = &struct_iocache_manager;
+
+errcode_t iocache_set_backing_manager(io_manager manager)
+{
+ iocache_backing_manager = manager;
+ return 0;
+}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 2/6] iocache: add the actual buffer cache
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
2025-08-21 1:23 ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
@ 2025-08-21 1:23 ` Darrick J. Wong
2025-08-21 1:24 ` [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses Darrick J. Wong
` (3 subsequent siblings)
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:23 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Wire up buffer caching into our new caching IO manager.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/iocache.c | 469 +++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 447 insertions(+), 22 deletions(-)
diff --git a/lib/support/iocache.c b/lib/support/iocache.c
index 9870780d65ef61..ab879e85d18f2a 100644
--- a/lib/support/iocache.c
+++ b/lib/support/iocache.c
@@ -9,46 +9,288 @@
* %End-Header%
*/
#include "config.h"
+#include <assert.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <unistd.h>
#include "ext2fs/ext2_fs.h"
#include "ext2fs/ext2fs.h"
#include "ext2fs/ext2fsP.h"
#include "support/iocache.h"
+#include "support/list.h"
+#include "support/cache.h"
#define IOCACHE_IO_CHANNEL_MAGIC 0x424F5254 /* BORT */
static io_manager iocache_backing_manager;
+static inline uint64_t B_TO_FSBT(io_channel channel, uint64_t number) {
+ return number / channel->block_size;
+}
+
+static inline uint64_t B_TO_FSB(io_channel channel, uint64_t number) {
+ return (number + channel->block_size - 1) / channel->block_size;
+}
+
struct iocache_private_data {
int magic;
- io_channel real;
+ io_channel real; /* lower level io channel */
+ io_channel channel; /* cache channel */
+ struct cache cache;
+ pthread_mutex_t stats_lock;
+ struct struct_io_stats io_stats;
+ unsigned long long write_errors;
};
+#define IOCACHEDATA(cache) \
+ (container_of(cache, struct iocache_private_data, cache))
+
static struct iocache_private_data *IOCACHE(io_channel channel)
{
return (struct iocache_private_data *)channel->private_data;
}
-static errcode_t iocache_read_error(io_channel channel, unsigned long block,
- int count, void *data, size_t size,
- int actual_bytes_read, errcode_t error)
+struct iocache_buf {
+ struct cache_node node;
+ struct list_head list;
+ blk64_t block;
+ void *buf;
+ errcode_t write_error;
+ unsigned int uptodate:1;
+ unsigned int dirty:1;
+};
+
+static inline void iocache_buf_lock(struct iocache_buf *ubuf)
{
- io_channel iocache_channel = channel->app_data;
+ pthread_mutex_lock(&ubuf->node.cn_mutex);
+}
- return iocache_channel->read_error(iocache_channel, block, count, data,
- size, actual_bytes_read, error);
+static inline void iocache_buf_unlock(struct iocache_buf *ubuf)
+{
+ pthread_mutex_unlock(&ubuf->node.cn_mutex);
}
-static errcode_t iocache_write_error(io_channel channel, unsigned long block,
- int count, const void *data, size_t size,
- int actual_bytes_written,
- errcode_t error)
+struct iocache_key {
+ blk64_t block;
+};
+
+#define IOKEY(key) ((struct iocache_key *)(key))
+#define IOBUF(node) (container_of((node), struct iocache_buf, node))
+
+static unsigned int
+iocache_hash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
{
- io_channel iocache_channel = channel->app_data;
+ uint64_t hashval = IOKEY(key)->block;
+ uint64_t tmp;
- return iocache_channel->write_error(iocache_channel, block, count, data,
- size, actual_bytes_written, error);
+ tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
+ tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
+ return tmp % hashsize;
}
+static int iocache_compare(struct cache_node *node, cache_key_t key)
+{
+ struct iocache_buf *ubuf = IOBUF(node);
+ struct iocache_key *ukey = IOKEY(key);
+
+ if (ubuf->block == ukey->block)
+ return CACHE_HIT;
+
+ return CACHE_MISS;
+}
+
+static struct cache_node *iocache_alloc_node(struct cache *cache,
+ cache_key_t key)
+{
+ struct iocache_private_data *data = IOCACHEDATA(cache);
+ struct iocache_key *ukey = IOKEY(key);
+ struct iocache_buf *ubuf;
+ errcode_t retval;
+
+ retval = ext2fs_get_mem(sizeof(struct iocache_buf), &ubuf);
+ if (retval)
+ return NULL;
+ memset(ubuf, 0, sizeof(*ubuf));
+
+ retval = io_channel_alloc_buf(data->channel, 0, &ubuf->buf);
+ if (retval) {
+ free(ubuf);
+ return NULL;
+ }
+ memset(ubuf->buf, 0, data->channel->block_size);
+
+ INIT_LIST_HEAD(&ubuf->list);
+ ubuf->block = ukey->block;
+ return &ubuf->node;
+}
+
+static bool iocache_flush_node(struct cache *cache, struct cache_node *node)
+{
+ struct iocache_private_data *data = IOCACHEDATA(cache);
+ struct iocache_buf *ubuf = IOBUF(node);
+ errcode_t retval;
+
+ if (ubuf->dirty) {
+ retval = io_channel_write_blk64(data->real, ubuf->block, 1,
+ ubuf->buf);
+ if (retval) {
+ ubuf->write_error = retval;
+ data->write_errors++;
+ } else {
+ ubuf->dirty = 0;
+ ubuf->write_error = 0;
+ }
+ }
+
+ return ubuf->dirty;
+}
+
+static void iocache_relse(struct cache *cache, struct cache_node *node)
+{
+ struct iocache_buf *ubuf = IOBUF(node);
+
+ assert(!ubuf->dirty);
+
+ ext2fs_free_mem(&ubuf->buf);
+ ext2fs_free_mem(&ubuf);
+}
+
+static unsigned int iocache_bulkrelse(struct cache *cache,
+ struct list_head *list)
+{
+ struct cache_node *cn, *n;
+ int count = 0;
+
+ if (list_empty(list))
+ return 0;
+
+ list_for_each_entry_safe(cn, n, list, cn_mru) {
+ iocache_relse(cache, cn);
+ count++;
+ }
+
+ return count;
+}
+
+/* Flush all dirty buffers in the cache to disk. */
+static errcode_t iocache_flush_cache(struct iocache_private_data *data)
+{
+ return cache_flush(&data->cache) ? 0 : EIO;
+}
+
+/* Flush all dirty buffers in this range of the cache to disk. */
+static errcode_t iocache_flush_range(struct iocache_private_data *data,
+ blk64_t block, uint64_t count)
+{
+ uint64_t i;
+ bool still_dirty = false;
+
+ for (i = 0; i < count; i++) {
+ struct iocache_key ukey = {
+ .block = block + i,
+ };
+ struct cache_node *node;
+
+ cache_node_get(&data->cache, &ukey, CACHE_GET_INCORE,
+ &node);
+ if (!node)
+ continue;
+
+ /* cache_flush holds cn_mutex across the node flush */
+ pthread_mutex_unlock(&node->cn_mutex);
+ still_dirty |= iocache_flush_node(&data->cache, node);
+ pthread_mutex_unlock(&node->cn_mutex);
+
+ cache_node_put(&data->cache, node);
+ }
+
+ return still_dirty ? EIO : 0;
+}
+
+static void iocache_add_list(struct cache *cache, struct cache_node *node,
+ void *data)
+{
+ struct iocache_buf *ubuf = IOBUF(node);
+ struct list_head *list = data;
+
+ assert(node->cn_count == 0 || node->cn_count == 1);
+
+ iocache_buf_lock(ubuf);
+ cache_node_grab(cache, node);
+ list_add_tail(&ubuf->list, list);
+ iocache_buf_unlock(ubuf);
+}
+
+static void iocache_invalidate_bufs(struct iocache_private_data *data,
+ struct list_head *list)
+{
+ struct iocache_buf *ubuf, *n;
+
+ list_for_each_entry_safe(ubuf, n, list, list) {
+ struct iocache_key ukey = {
+ .block = ubuf->block,
+ };
+
+ assert(ubuf->node.cn_count == 1);
+
+ iocache_buf_lock(ubuf);
+ ubuf->dirty = 0;
+ list_del_init(&ubuf->list);
+ iocache_buf_unlock(ubuf);
+
+ cache_node_put(&data->cache, &ubuf->node);
+ cache_node_purge(&data->cache, &ukey, &ubuf->node);
+ }
+}
+
+/*
+ * Remove all blocks from the cache. Dirty contents are discarded. Buffer
+ * refcounts must be zero!
+ */
+static void iocache_invalidate_cache(struct iocache_private_data *data)
+{
+ LIST_HEAD(list);
+
+ cache_walk(&data->cache, iocache_add_list, &list);
+ iocache_invalidate_bufs(data, &list);
+}
+
+/*
+ * Remove a range of blocks from the cache. Dirty contents are discarded.
+ * Buffer refcounts must be zero!
+ */
+static void iocache_invalidate_range(struct iocache_private_data *data,
+ blk64_t block, uint64_t count)
+{
+ LIST_HEAD(list);
+ uint64_t i;
+
+ for (i = 0; i < count; i++) {
+ struct iocache_key ukey = {
+ .block = block + i,
+ };
+ struct cache_node *node;
+
+ cache_node_get(&data->cache, &ukey, CACHE_GET_INCORE,
+ &node);
+ if (node) {
+ iocache_add_list(&data->cache, node, &list);
+ cache_node_put(&data->cache, node);
+ }
+ }
+ iocache_invalidate_bufs(data, &list);
+}
+
+static const struct cache_operations iocache_ops = {
+ .hash = iocache_hash,
+ .alloc = iocache_alloc_node,
+ .flush = iocache_flush_node,
+ .relse = iocache_relse,
+ .compare = iocache_compare,
+ .bulkrelse = iocache_bulkrelse,
+ .resize = cache_gradual_resize,
+};
+
static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
{
io_channel io = NULL;
@@ -65,6 +307,9 @@ static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
if (retval)
return retval;
+ /* disable any static cache in the lower io manager */
+ real->manager->set_option(real, "cache", "off");
+
retval = ext2fs_get_mem(sizeof(struct struct_io_channel), &io);
if (retval)
goto out_backing;
@@ -76,12 +321,19 @@ static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
goto out_channel;
memset(data, 0, sizeof(struct iocache_private_data));
data->magic = IOCACHE_IO_CHANNEL_MAGIC;
+ data->io_stats.num_fields = 4;
+ data->channel = io;
io->manager = iocache_io_manager;
retval = ext2fs_get_mem(strlen(name) + 1, &io->name);
if (retval)
goto out_data;
+ retval = cache_init(CACHE_CAN_SHRINK, 1U << 10, &iocache_ops,
+ &data->cache);
+ if (retval)
+ goto out_name;
+
strcpy(io->name, name);
io->private_data = data;
io->block_size = real->block_size;
@@ -91,12 +343,14 @@ static errcode_t iocache_open(const char *name, int flags, io_channel *channel)
io->flags = real->flags;
data->real = real;
real->app_data = io;
- real->read_error = iocache_read_error;
- real->write_error = iocache_write_error;
+
+ pthread_mutex_init(&data->stats_lock, NULL);
*channel = io;
return 0;
+out_name:
+ ext2fs_free_mem(&io->name);
out_data:
ext2fs_free_mem(&data);
out_channel:
@@ -116,6 +370,10 @@ static errcode_t iocache_close(io_channel channel)
if (--channel->refcount > 0)
return 0;
+ pthread_mutex_destroy(&data->stats_lock);
+ cache_flush(&data->cache);
+ cache_purge(&data->cache);
+ cache_destroy(&data->cache);
if (data->real)
retval = io_channel_close(data->real);
ext2fs_free_mem(&channel->private_data);
@@ -134,6 +392,11 @@ static errcode_t iocache_set_blksize(io_channel channel, int blksize)
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+ retval = iocache_flush_cache(data);
+ if (retval)
+ return retval;
+ iocache_invalidate_cache(data);
+
retval = io_channel_set_blksize(data->real, blksize);
if (retval)
return retval;
@@ -145,21 +408,34 @@ static errcode_t iocache_set_blksize(io_channel channel, int blksize)
static errcode_t iocache_flush(io_channel channel)
{
struct iocache_private_data *data = IOCACHE(channel);
+ errcode_t retval = 0;
+ errcode_t retval2;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
- return io_channel_flush(data->real);
+ retval = iocache_flush_cache(data);
+ retval2 = io_channel_flush(data->real);
+ if (retval)
+ return retval;
+ return retval2;
}
static errcode_t iocache_write_byte(io_channel channel, unsigned long offset,
int count, const void *buf)
{
struct iocache_private_data *data = IOCACHE(channel);
+ blk64_t bno = B_TO_FSBT(channel, offset);
+ blk64_t next_bno = B_TO_FSB(channel, offset + count);
+ errcode_t retval;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+ retval = iocache_flush_range(data, bno, next_bno - bno);
+ if (retval)
+ return retval;
+ iocache_invalidate_range(data, bno, next_bno - bno);
return io_channel_write_byte(data->real, offset, count, buf);
}
@@ -170,6 +446,16 @@ static errcode_t iocache_set_option(io_channel channel, const char *option,
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+ errcode_t retval;
+
+ /* don't let unix io cache options leak through */
+ if (!strcmp(option, "cache_blocks") || !strcmp(option, "cache"))
+ return 0;
+
+ retval = iocache_flush_cache(data);
+ if (retval)
+ return retval;
+ iocache_invalidate_cache(data);
return data->real->manager->set_option(data->real, option, arg);
}
@@ -181,31 +467,157 @@ static errcode_t iocache_get_stats(io_channel channel, io_stats *io_stats)
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
- return data->real->manager->get_stats(data->real, io_stats);
+ /*
+ * Yes, io_stats is a double-pointer, and we let the caller scribble on
+ * our stats struct WITHOUT LOCKING!
+ */
+ if (io_stats)
+ *io_stats = &data->io_stats;
+ return 0;
+}
+
+static void iocache_update_stats(struct iocache_private_data *data,
+ unsigned long long bytes_read,
+ unsigned long long bytes_written,
+ int cache_op)
+{
+ pthread_mutex_lock(&data->stats_lock);
+ data->io_stats.bytes_read += bytes_read;
+ data->io_stats.bytes_written += bytes_written;
+ if (cache_op == CACHE_HIT)
+ data->io_stats.cache_hits++;
+ else
+ data->io_stats.cache_misses++;
+ pthread_mutex_unlock(&data->stats_lock);
}
static errcode_t iocache_read_blk64(io_channel channel,
unsigned long long block, int count,
void *buf)
{
+ struct iocache_key ukey = {
+ .block = block,
+ };
struct iocache_private_data *data = IOCACHE(channel);
+ unsigned long long i;
+ errcode_t retval;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
- return io_channel_read_blk64(data->real, block, count, buf);
+ /*
+ * If we're doing an odd-sized read, flush out the cache and then do a
+ * direct read.
+ */
+ if (count < 0) {
+ uint64_t fsbcount = B_TO_FSB(channel, -count);
+
+ retval = iocache_flush_range(data, block, fsbcount);
+ if (retval)
+ return retval;
+ iocache_invalidate_range(data, block, fsbcount);
+ iocache_update_stats(data, 0, 0, CACHE_MISS);
+ return io_channel_read_blk64(data->real, block, count, buf);
+ }
+
+ for (i = 0; i < count; i++, ukey.block++, buf += channel->block_size) {
+ struct cache_node *node;
+ struct iocache_buf *ubuf;
+
+ cache_node_get(&data->cache, &ukey, 0, &node);
+ if (!node) {
+ /* cannot instantiate cache, just do a direct read */
+ retval = io_channel_read_blk64(data->real, ukey.block,
+ 1, buf);
+ if (retval)
+ return retval;
+ iocache_update_stats(data, channel->block_size, 0,
+ CACHE_MISS);
+ continue;
+ }
+
+ ubuf = IOBUF(node);
+ iocache_buf_lock(ubuf);
+ if (!ubuf->uptodate) {
+ retval = io_channel_read_blk64(data->real, ukey.block,
+ 1, ubuf->buf);
+ if (!retval) {
+ ubuf->uptodate = 1;
+ iocache_update_stats(data, channel->block_size,
+ 0, CACHE_MISS);
+ }
+ } else {
+ iocache_update_stats(data, channel->block_size, 0,
+ CACHE_HIT);
+ }
+ if (ubuf->uptodate)
+ memcpy(buf, ubuf->buf, channel->block_size);
+ iocache_buf_unlock(ubuf);
+ cache_node_put(&data->cache, node);
+ if (retval)
+ return retval;
+ }
+
+ return 0;
}
static errcode_t iocache_write_blk64(io_channel channel,
unsigned long long block, int count,
const void *buf)
{
+ struct iocache_key ukey = {
+ .block = block,
+ };
struct iocache_private_data *data = IOCACHE(channel);
+ unsigned long long i;
+ errcode_t retval;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
- return io_channel_write_blk64(data->real, block, count, buf);
+ /*
+ * If we're doing an odd-sized write, flush out the cache and then do a
+ * direct write.
+ */
+ if (count < 0) {
+ uint64_t fsbcount = B_TO_FSB(channel, -count);
+
+ retval = iocache_flush_range(data, block, fsbcount);
+ if (retval)
+ return retval;
+ iocache_invalidate_range(data, block, fsbcount);
+ iocache_update_stats(data, 0, 0, CACHE_MISS);
+ return io_channel_write_blk64(data->real, block, count, buf);
+ }
+
+ for (i = 0; i < count; i++, ukey.block++, buf += channel->block_size) {
+ struct cache_node *node;
+ struct iocache_buf *ubuf;
+
+ cache_node_get(&data->cache, &ukey, 0, &node);
+ if (!node) {
+ /* cannot instantiate cache, do a direct write */
+ retval = io_channel_write_blk64(data->real, ukey.block,
+ 1, buf);
+ if (retval)
+ return retval;
+ iocache_update_stats(data, 0, channel->block_size,
+ CACHE_MISS);
+ continue;
+ }
+
+ ubuf = IOBUF(node);
+ iocache_buf_lock(ubuf);
+ memcpy(ubuf->buf, buf, channel->block_size);
+ iocache_update_stats(data, 0, channel->block_size,
+ ubuf->uptodate ? CACHE_HIT : CACHE_MISS);
+ ubuf->dirty = 1;
+ ubuf->uptodate = 1;
+ iocache_buf_unlock(ubuf);
+ cache_node_put(&data->cache, node);
+ }
+
+ return 0;
}
static errcode_t iocache_read_blk(io_channel channel, unsigned long block,
@@ -224,11 +636,17 @@ static errcode_t iocache_discard(io_channel channel, unsigned long long block,
unsigned long long count)
{
struct iocache_private_data *data = IOCACHE(channel);
+ errcode_t retval;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
- return io_channel_discard(data->real, block, count);
+ retval = io_channel_discard(data->real, block, count);
+ if (retval)
+ return retval;
+
+ iocache_invalidate_range(data, block, count);
+ return 0;
}
static errcode_t iocache_cache_readahead(io_channel channel,
@@ -247,11 +665,17 @@ static errcode_t iocache_zeroout(io_channel channel, unsigned long long block,
unsigned long long count)
{
struct iocache_private_data *data = IOCACHE(channel);
+ errcode_t retval;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
- return io_channel_zeroout(data->real, block, count);
+ retval = io_channel_zeroout(data->real, block, count);
+ if (retval)
+ return retval;
+
+ iocache_invalidate_range(data, block, count);
+ return 0;
}
static errcode_t iocache_get_fd(io_channel channel, int *fd)
@@ -273,6 +697,7 @@ static errcode_t iocache_invalidate_blocks(io_channel channel,
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
EXT2_CHECK_MAGIC(data, IOCACHE_IO_CHANNEL_MAGIC);
+ iocache_invalidate_range(data, block, count);
return io_channel_invalidate_blocks(data->real, block, count);
}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
2025-08-21 1:23 ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
2025-08-21 1:23 ` [PATCH 2/6] iocache: add the actual buffer cache Darrick J. Wong
@ 2025-08-21 1:24 ` Darrick J. Wong
2025-08-21 1:24 ` [PATCH 4/6] fuse2fs: enable caching IO manager Darrick J. Wong
` (2 subsequent siblings)
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:24 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
If a buffer is hot enough to survive more than 50 access without being
reclaimed, bump its priority to the next MRU so it sticks around longer.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/support/cache.h | 1 +
lib/support/cache.c | 16 ++++++++++++++++
lib/support/iocache.c | 9 +++++++++
3 files changed, 26 insertions(+)
diff --git a/lib/support/cache.h b/lib/support/cache.h
index f482948a3b6331..5a8e19f5d18e78 100644
--- a/lib/support/cache.h
+++ b/lib/support/cache.h
@@ -173,5 +173,6 @@ int cache_node_purge(struct cache *, cache_key_t, struct cache_node *);
void cache_report(FILE *fp, const char *, struct cache *);
int cache_overflowed(struct cache *);
struct cache_node *cache_node_grab(struct cache *cache, struct cache_node *node);
+void cache_node_bump_priority(struct cache *cache, struct cache_node *node);
#endif /* __CACHE_H__ */
diff --git a/lib/support/cache.c b/lib/support/cache.c
index 7e1ddc3cc8788d..34df5cb51cd5e4 100644
--- a/lib/support/cache.c
+++ b/lib/support/cache.c
@@ -649,6 +649,22 @@ cache_node_put(
cache_shrink(cache);
}
+/* Bump the priority of a cache node. Caller must hold cn_mutex. */
+void
+cache_node_bump_priority(
+ struct cache *cache,
+ struct cache_node *node)
+{
+ int *priop;
+
+ if (node->cn_priority == CACHE_DIRTY_PRIORITY)
+ priop = &node->cn_old_priority;
+ else
+ priop = &node->cn_priority;
+ if (*priop < CACHE_MAX_PRIORITY)
+ (*priop)++;
+}
+
void
cache_node_set_priority(
struct cache * cache,
diff --git a/lib/support/iocache.c b/lib/support/iocache.c
index ab879e85d18f2a..92d88331bfa54d 100644
--- a/lib/support/iocache.c
+++ b/lib/support/iocache.c
@@ -56,6 +56,7 @@ struct iocache_buf {
blk64_t block;
void *buf;
errcode_t write_error;
+ uint8_t access;
unsigned int uptodate:1;
unsigned int dirty:1;
};
@@ -552,6 +553,10 @@ static errcode_t iocache_read_blk64(io_channel channel,
}
if (ubuf->uptodate)
memcpy(buf, ubuf->buf, channel->block_size);
+ if (++ubuf->access > 50) {
+ cache_node_bump_priority(&data->cache, node);
+ ubuf->access = 0;
+ }
iocache_buf_unlock(ubuf);
cache_node_put(&data->cache, node);
if (retval)
@@ -613,6 +618,10 @@ static errcode_t iocache_write_blk64(io_channel channel,
ubuf->uptodate ? CACHE_HIT : CACHE_MISS);
ubuf->dirty = 1;
ubuf->uptodate = 1;
+ if (++ubuf->access > 50) {
+ cache_node_bump_priority(&data->cache, node);
+ ubuf->access = 0;
+ }
iocache_buf_unlock(ubuf);
cache_node_put(&data->cache, node);
}
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 4/6] fuse2fs: enable caching IO manager
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
` (2 preceding siblings ...)
2025-08-21 1:24 ` [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses Darrick J. Wong
@ 2025-08-21 1:24 ` Darrick J. Wong
2025-08-21 1:24 ` [PATCH 5/6] fuse2fs: increase inode cache size Darrick J. Wong
2025-08-21 1:24 ` [PATCH 6/6] libext2fs: improve caching for inodes Darrick J. Wong
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:24 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Enable the new dynamic iocache I/O manager in the fuse server, and turn
off all the other cache control.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/Makefile.in | 7 ++++-
misc/fuse2fs.c | 71 ++++--------------------------------------------------
misc/fuse4fs.c | 69 ++--------------------------------------------------
3 files changed, 13 insertions(+), 134 deletions(-)
diff --git a/misc/Makefile.in b/misc/Makefile.in
index 36694d682d3b59..8a31b7fc42e643 100644
--- a/misc/Makefile.in
+++ b/misc/Makefile.in
@@ -891,7 +891,9 @@ fuse2fs.o: $(srcdir)/fuse2fs.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/ext2fs/ext2_ext_attr.h $(top_srcdir)/lib/ext2fs/hashmap.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/ext2fsP.h \
$(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/version.h \
- $(top_srcdir)/lib/e2p/e2p.h
+ $(top_srcdir)/lib/e2p/e2p.h $(top_srcdir)/lib/support/cache.h \
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/support/xbitops.h \
+ $(top_srcdir)/lib/support/iocache.h
fuse4fs.o: $(srcdir)/fuse4fs.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/ext2fs/ext2fs.h \
$(top_builddir)/lib/ext2fs/ext2_types.h $(top_srcdir)/lib/ext2fs/ext2_fs.h \
@@ -901,7 +903,8 @@ fuse4fs.o: $(srcdir)/fuse4fs.c $(top_builddir)/lib/config.h \
$(top_srcdir)/lib/ext2fs/bitops.h $(top_srcdir)/lib/ext2fs/ext2fsP.h \
$(top_srcdir)/lib/ext2fs/ext2fs.h $(top_srcdir)/version.h \
$(top_srcdir)/lib/e2p/e2p.h $(top_srcdir)/lib/support/cache.h \
- $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/support/xbitops.h
+ $(top_srcdir)/lib/support/list.h $(top_srcdir)/lib/support/xbitops.h \
+ $(top_srcdir)/lib/support/iocache.h
e2fuzz.o: $(srcdir)/e2fuzz.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(top_srcdir)/lib/ext2fs/ext2_fs.h \
$(top_builddir)/lib/ext2fs/ext2_types.h $(top_srcdir)/lib/ext2fs/ext2fs.h \
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index f5d68cc549ad69..d3ac5f7b6627cd 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -50,6 +50,9 @@
#include "ext2fs/ext2fs.h"
#include "ext2fs/ext2_fs.h"
#include "ext2fs/ext2fsP.h"
+#include "support/list.h"
+#include "support/cache.h"
+#include "support/iocache.h"
#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
# define FUSE_PLATFORM_OPTS ""
#else
@@ -290,7 +293,6 @@ struct fuse2fs {
unsigned int blockmask;
unsigned long offset;
unsigned int next_generation;
- unsigned long long cache_size;
char *lockfile;
#ifdef HAVE_CLOCK_MONOTONIC
struct timespec lock_start_time;
@@ -1122,7 +1124,7 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff, int libext2_flags)
dbg_printf(ff, "opening with flags=0x%x\n", flags);
- err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager,
+ err = ext2fs_open2(ff->device, options, flags, 0, 0, iocache_io_manager,
&ff->fs);
if (err == EPERM) {
err_printf(ff, "%s.\n",
@@ -1150,25 +1152,6 @@ static inline bool fuse2fs_on_bdev(const struct fuse2fs *ff)
return ff->fs->io->flags & CHANNEL_FLAGS_BLOCK_DEVICE;
}
-static errcode_t fuse2fs_config_cache(struct fuse2fs *ff)
-{
- char buf[128];
- errcode_t err;
-
- snprintf(buf, sizeof(buf), "cache_blocks=%llu",
- FUSE2FS_B_TO_FSBT(ff, ff->cache_size));
- err = io_channel_set_options(ff->fs->io, buf);
- if (err) {
- err_printf(ff, "%s %lluk: %s\n",
- _("cannot set disk cache size to"),
- ff->cache_size >> 10,
- error_message(err));
- return err;
- }
-
- return 0;
-}
-
static errcode_t fuse2fs_check_support(struct fuse2fs *ff)
{
ext2_filsys fs = ff->fs;
@@ -6829,7 +6812,6 @@ enum {
FUSE2FS_VERSION,
FUSE2FS_HELP,
FUSE2FS_HELPFULL,
- FUSE2FS_CACHE_SIZE,
FUSE2FS_DIRSYNC,
FUSE2FS_ERRORS_BEHAVIOR,
#ifdef HAVE_FUSE_IOMAP
@@ -6879,7 +6861,6 @@ static struct fuse_opt fuse2fs_opts[] = {
FUSE_OPT_KEY("user_xattr", FUSE2FS_IGNORED),
FUSE_OPT_KEY("noblock_validity", FUSE2FS_IGNORED),
FUSE_OPT_KEY("nodelalloc", FUSE2FS_IGNORED),
- FUSE_OPT_KEY("cache_size=%s", FUSE2FS_CACHE_SIZE),
FUSE_OPT_KEY("dirsync", FUSE2FS_DIRSYNC),
FUSE_OPT_KEY("errors=%s", FUSE2FS_ERRORS_BEHAVIOR),
#ifdef HAVE_FUSE_IOMAP
@@ -6918,16 +6899,6 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
return 0;
}
return 1;
- case FUSE2FS_CACHE_SIZE:
- ff->cache_size = parse_num_blocks2(arg + 11, -1);
- if (ff->cache_size < 1 || ff->cache_size > INT32_MAX) {
- fprintf(stderr, "%s: %s\n", arg,
- _("cache size must be between 1 block and 2GB."));
- return -1;
- }
-
- /* do not pass through to libfuse */
- return 0;
case FUSE2FS_ERRORS_BEHAVIOR:
if (strcmp(arg + 7, "continue") == 0)
ff->errors_behavior = EXT2_ERRORS_CONTINUE;
@@ -6984,7 +6955,6 @@ static int fuse2fs_opt_proc(void *data, const char *arg,
" -o kernel run this as if it were the kernel, which sets:\n"
" allow_others,default_permissions,suid,dev\n"
" -o directio use O_DIRECT to read and write the disk\n"
- " -o cache_size=N[KMG] use a disk cache of this size\n"
" -o errors= behavior when an error is encountered:\n"
" continue|remount-ro|panic\n"
#ifdef HAVE_FUSE_IOMAP
@@ -7028,28 +6998,6 @@ static const char *get_subtype(const char *argv0)
return "ext4";
}
-/* Figure out a reasonable default size for the disk cache */
-static unsigned long long default_cache_size(void)
-{
- long pages = 0, pagesize = 0;
- unsigned long long max_cache;
- unsigned long long ret = 32ULL << 20; /* 32 MB */
-
-#ifdef _SC_PHYS_PAGES
- pages = sysconf(_SC_PHYS_PAGES);
-#endif
-#ifdef _SC_PAGESIZE
- pagesize = sysconf(_SC_PAGESIZE);
-#endif
- if (pages > 0 && pagesize > 0) {
- max_cache = (unsigned long long)pagesize * pages / 20;
-
- if (max_cache > 0 && ret > max_cache)
- ret = max_cache;
- }
- return ret;
-}
-
#ifdef HAVE_FUSE_IOMAP
static inline bool fuse2fs_discover_iomap(struct fuse2fs *ff)
{
@@ -7170,6 +7118,7 @@ int main(int argc, char *argv[])
fctx.alloc_all_blocks = 1;
}
+ iocache_set_backing_manager(unix_io_manager);
err = fuse2fs_open(&fctx, EXT2_FLAG_EXCLUSIVE);
if (err) {
ret = 32;
@@ -7206,16 +7155,6 @@ int main(int argc, char *argv[])
goto out;
}
- if (!fctx.cache_size)
- fctx.cache_size = default_cache_size();
- if (fctx.cache_size) {
- err = fuse2fs_config_cache(&fctx);
- if (err) {
- ret = 32;
- goto out;
- }
- }
-
err = fuse2fs_check_support(&fctx);
if (err) {
ret = 32;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 6f03c6a0933a3d..85d73a9088d237 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -53,6 +53,7 @@
#include "ext2fs/ext2fsP.h"
#include "support/list.h"
#include "support/cache.h"
+#include "support/iocache.h"
#include "../version.h"
#include "uuid/uuid.h"
@@ -286,7 +287,6 @@ struct fuse4fs {
unsigned int blockmask;
unsigned long offset;
unsigned int next_generation;
- unsigned long long cache_size;
char *lockfile;
#ifdef HAVE_CLOCK_MONOTONIC
struct timespec lock_start_time;
@@ -1281,7 +1281,7 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff, int libext2_flags)
dbg_printf(ff, "opening with flags=0x%x\n", flags);
- err = ext2fs_open2(ff->device, options, flags, 0, 0, unix_io_manager,
+ err = ext2fs_open2(ff->device, options, flags, 0, 0, iocache_io_manager,
&ff->fs);
if (err == EPERM) {
err_printf(ff, "%s.\n",
@@ -1313,25 +1313,6 @@ static inline bool fuse4fs_on_bdev(const struct fuse4fs *ff)
return ff->fs->io->flags & CHANNEL_FLAGS_BLOCK_DEVICE;
}
-static errcode_t fuse4fs_config_cache(struct fuse4fs *ff)
-{
- char buf[128];
- errcode_t err;
-
- snprintf(buf, sizeof(buf), "cache_blocks=%llu",
- FUSE4FS_B_TO_FSBT(ff, ff->cache_size));
- err = io_channel_set_options(ff->fs->io, buf);
- if (err) {
- err_printf(ff, "%s %lluk: %s\n",
- _("cannot set disk cache size to"),
- ff->cache_size >> 10,
- error_message(err));
- return err;
- }
-
- return 0;
-}
-
static errcode_t fuse4fs_check_support(struct fuse4fs *ff)
{
ext2_filsys fs = ff->fs;
@@ -7113,7 +7094,6 @@ enum {
FUSE4FS_VERSION,
FUSE4FS_HELP,
FUSE4FS_HELPFULL,
- FUSE4FS_CACHE_SIZE,
FUSE4FS_DIRSYNC,
FUSE4FS_ERRORS_BEHAVIOR,
#ifdef HAVE_FUSE_IOMAP
@@ -7163,7 +7143,6 @@ static struct fuse_opt fuse4fs_opts[] = {
FUSE_OPT_KEY("user_xattr", FUSE4FS_IGNORED),
FUSE_OPT_KEY("noblock_validity", FUSE4FS_IGNORED),
FUSE_OPT_KEY("nodelalloc", FUSE4FS_IGNORED),
- FUSE_OPT_KEY("cache_size=%s", FUSE4FS_CACHE_SIZE),
FUSE_OPT_KEY("dirsync", FUSE4FS_DIRSYNC),
FUSE_OPT_KEY("errors=%s", FUSE4FS_ERRORS_BEHAVIOR),
#ifdef HAVE_FUSE_IOMAP
@@ -7202,16 +7181,6 @@ static int fuse4fs_opt_proc(void *data, const char *arg,
return 0;
}
return 1;
- case FUSE4FS_CACHE_SIZE:
- ff->cache_size = parse_num_blocks2(arg + 11, -1);
- if (ff->cache_size < 1 || ff->cache_size > INT32_MAX) {
- fprintf(stderr, "%s: %s\n", arg,
- _("cache size must be between 1 block and 2GB."));
- return -1;
- }
-
- /* do not pass through to libfuse */
- return 0;
case FUSE4FS_ERRORS_BEHAVIOR:
if (strcmp(arg + 7, "continue") == 0)
ff->errors_behavior = EXT2_ERRORS_CONTINUE;
@@ -7268,7 +7237,6 @@ static int fuse4fs_opt_proc(void *data, const char *arg,
" -o kernel run this as if it were the kernel, which sets:\n"
" allow_others,default_permissions,suid,dev\n"
" -o directio use O_DIRECT to read and write the disk\n"
- " -o cache_size=N[KMG] use a disk cache of this size\n"
" -o errors= behavior when an error is encountered:\n"
" continue|remount-ro|panic\n"
#ifdef HAVE_FUSE_IOMAP
@@ -7311,28 +7279,6 @@ static const char *get_subtype(const char *argv0)
return "ext4";
}
-/* Figure out a reasonable default size for the disk cache */
-static unsigned long long default_cache_size(void)
-{
- long pages = 0, pagesize = 0;
- unsigned long long max_cache;
- unsigned long long ret = 32ULL << 20; /* 32 MB */
-
-#ifdef _SC_PHYS_PAGES
- pages = sysconf(_SC_PHYS_PAGES);
-#endif
-#ifdef _SC_PAGESIZE
- pagesize = sysconf(_SC_PAGESIZE);
-#endif
- if (pages > 0 && pagesize > 0) {
- max_cache = (unsigned long long)pagesize * pages / 20;
-
- if (max_cache > 0 && ret > max_cache)
- ret = max_cache;
- }
- return ret;
-}
-
#ifdef HAVE_FUSE_IOMAP
static inline bool fuse4fs_discover_iomap(struct fuse4fs *ff)
{
@@ -7554,6 +7500,7 @@ int main(int argc, char *argv[])
fctx.alloc_all_blocks = 1;
}
+ iocache_set_backing_manager(unix_io_manager);
err = fuse4fs_open(&fctx, EXT2_FLAG_EXCLUSIVE);
if (err) {
ret = 32;
@@ -7603,16 +7550,6 @@ int main(int argc, char *argv[])
fctx.translate_inums = 0;
}
- if (!fctx.cache_size)
- fctx.cache_size = default_cache_size();
- if (fctx.cache_size) {
- err = fuse4fs_config_cache(&fctx);
- if (err) {
- ret = 32;
- goto out;
- }
- }
-
err = fuse4fs_check_support(&fctx);
if (err) {
ret = 32;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 5/6] fuse2fs: increase inode cache size
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
` (3 preceding siblings ...)
2025-08-21 1:24 ` [PATCH 4/6] fuse2fs: enable caching IO manager Darrick J. Wong
@ 2025-08-21 1:24 ` Darrick J. Wong
2025-08-21 1:24 ` [PATCH 6/6] libext2fs: improve caching for inodes Darrick J. Wong
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:24 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Increase the internal inode cache size. Does this improve performance
any?
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
misc/fuse2fs.c | 4 ++++
misc/fuse4fs.c | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/misc/fuse2fs.c b/misc/fuse2fs.c
index d3ac5f7b6627cd..0c310443b1504b 100644
--- a/misc/fuse2fs.c
+++ b/misc/fuse2fs.c
@@ -1141,6 +1141,10 @@ static errcode_t fuse2fs_open(struct fuse2fs *ff, int libext2_flags)
return err;
}
+ err = ext2fs_create_inode_cache(ff->fs, 1024);
+ if (err)
+ return translate_error(ff->fs, 0, err);
+
ff->fs->priv_data = ff;
ff->blocklog = u_log2(ff->fs->blocksize);
ff->blockmask = ff->fs->blocksize - 1;
diff --git a/misc/fuse4fs.c b/misc/fuse4fs.c
index 85d73a9088d237..186a3188acfa59 100644
--- a/misc/fuse4fs.c
+++ b/misc/fuse4fs.c
@@ -1302,6 +1302,10 @@ static errcode_t fuse4fs_open(struct fuse4fs *ff, int libext2_flags)
if (err)
return translate_error(ff->fs, 0, err);
+ err = ext2fs_create_inode_cache(ff->fs, 1024);
+ if (err)
+ return translate_error(ff->fs, 0, err);
+
ff->fs->priv_data = ff;
ff->blocklog = u_log2(ff->fs->blocksize);
ff->blockmask = ff->fs->blocksize - 1;
^ permalink raw reply related [flat|nested] 210+ messages in thread
* [PATCH 6/6] libext2fs: improve caching for inodes
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
` (4 preceding siblings ...)
2025-08-21 1:24 ` [PATCH 5/6] fuse2fs: increase inode cache size Darrick J. Wong
@ 2025-08-21 1:24 ` Darrick J. Wong
5 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 1:24 UTC (permalink / raw)
To: tytso; +Cc: John, bernd, linux-fsdevel, linux-ext4, miklos, joannelkoong,
neal
From: Darrick J. Wong <djwong@kernel.org>
Use our new cache code to improve the ondisk inode cache inside
libext2fs. Oops, list.h duplication, and libext2fs needs to link
against libsupport now.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/ext2fsP.h | 13 ++-
debugfs/Makefile.in | 4 -
e2fsck/Makefile.in | 4 -
lib/ext2fs/Makefile.in | 4 +
lib/ext2fs/inode.c | 215 +++++++++++++++++++++++++++++++++++++----------
resize/Makefile.in | 4 -
tests/progs/Makefile.in | 4 -
7 files changed, 187 insertions(+), 61 deletions(-)
diff --git a/lib/ext2fs/ext2fsP.h b/lib/ext2fs/ext2fsP.h
index 428081c9e2ff38..8490dd5139d543 100644
--- a/lib/ext2fs/ext2fsP.h
+++ b/lib/ext2fs/ext2fsP.h
@@ -82,21 +82,26 @@ struct dir_context {
errcode_t errcode;
};
+#include "support/list.h"
+#include "support/cache.h"
+
/*
* Inode cache structure
*/
struct ext2_inode_cache {
void * buffer;
blk64_t buffer_blk;
- int cache_last;
- unsigned int cache_size;
int refcount;
- struct ext2_inode_cache_ent *cache;
+ struct cache cache;
};
struct ext2_inode_cache_ent {
+ struct cache_node node;
ext2_ino_t ino;
- struct ext2_inode *inode;
+ uint8_t access;
+
+ /* bytes representing a host-endian ext2_inode_large object */
+ char raw[];
};
/*
diff --git a/debugfs/Makefile.in b/debugfs/Makefile.in
index 700ae87418c268..8dfd802692b839 100644
--- a/debugfs/Makefile.in
+++ b/debugfs/Makefile.in
@@ -38,9 +38,9 @@ SRCS= debug_cmds.c $(srcdir)/debugfs.c $(srcdir)/util.c $(srcdir)/ls.c \
$(srcdir)/../e2fsck/recovery.c $(srcdir)/do_journal.c \
$(srcdir)/do_orphan.c
-LIBS= $(LIBSUPPORT) $(LIBEXT2FS) $(LIBE2P) $(LIBSS) $(LIBCOM_ERR) $(LIBBLKID) \
+LIBS= $(LIBEXT2FS) $(LIBSUPPORT) $(LIBE2P) $(LIBSS) $(LIBCOM_ERR) $(LIBBLKID) \
$(LIBUUID) $(LIBMAGIC) $(SYSLIBS) $(LIBARCHIVE)
-DEPLIBS= $(DEPLIBSUPPORT) $(LIBEXT2FS) $(LIBE2P) $(DEPLIBSS) $(DEPLIBCOM_ERR) \
+DEPLIBS= $(LIBEXT2FS) $(DEPLIBSUPPORT) $(LIBE2P) $(DEPLIBSS) $(DEPLIBCOM_ERR) \
$(DEPLIBBLKID) $(DEPLIBUUID)
STATIC_LIBS= $(STATIC_LIBSUPPORT) $(STATIC_LIBEXT2FS) $(STATIC_LIBSS) \
diff --git a/e2fsck/Makefile.in b/e2fsck/Makefile.in
index 52fad9cbfd2b23..61451f2d9e3276 100644
--- a/e2fsck/Makefile.in
+++ b/e2fsck/Makefile.in
@@ -16,9 +16,9 @@ PROGS= e2fsck
MANPAGES= e2fsck.8
FMANPAGES= e2fsck.conf.5
-LIBS= $(LIBSUPPORT) $(LIBEXT2FS) $(LIBCOM_ERR) $(LIBBLKID) $(LIBUUID) \
+LIBS= $(LIBEXT2FS) $(LIBSUPPORT) $(LIBCOM_ERR) $(LIBBLKID) $(LIBUUID) \
$(LIBINTL) $(LIBE2P) $(LIBMAGIC) $(SYSLIBS)
-DEPLIBS= $(DEPLIBSUPPORT) $(LIBEXT2FS) $(DEPLIBCOM_ERR) $(DEPLIBBLKID) \
+DEPLIBS= $(LIBEXT2FS) $(DEPLIBSUPPORT) $(DEPLIBCOM_ERR) $(DEPLIBBLKID) \
$(DEPLIBUUID) $(DEPLIBE2P)
STATIC_LIBS= $(STATIC_LIBSUPPORT) $(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) \
diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index 1d0991defff804..89254ded7c0723 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -976,7 +976,9 @@ inode.o: $(srcdir)/inode.c $(top_builddir)/lib/config.h \
$(srcdir)/ext2fs.h $(srcdir)/ext2_fs.h $(srcdir)/ext3_extents.h \
$(top_srcdir)/lib/et/com_err.h $(srcdir)/ext2_io.h \
$(top_builddir)/lib/ext2fs/ext2_err.h $(srcdir)/ext2_ext_attr.h \
- $(srcdir)/hashmap.h $(srcdir)/bitops.h $(srcdir)/e2image.h
+ $(srcdir)/hashmap.h $(srcdir)/bitops.h $(srcdir)/e2image.h \
+ $(srcdir)/../support/cache.h $(srcdir)/../support/list.h \
+ $(srcdir)/../support/xbitops.h
inode_io.o: $(srcdir)/inode_io.c $(top_builddir)/lib/config.h \
$(top_builddir)/lib/dirpaths.h $(srcdir)/ext2_fs.h \
$(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2fs.h \
diff --git a/lib/ext2fs/inode.c b/lib/ext2fs/inode.c
index c9389a2324be07..8ca82af1ab35d3 100644
--- a/lib/ext2fs/inode.c
+++ b/lib/ext2fs/inode.c
@@ -59,18 +59,145 @@ struct ext2_struct_inode_scan {
int reserved[6];
};
+struct ext2_inode_cache_key {
+ ext2_filsys fs;
+ ext2_ino_t ino;
+};
+
+#define ICKEY(key) ((struct ext2_inode_cache_key *)(key))
+#define ICNODE(node) (container_of((node), struct ext2_inode_cache_ent, node))
+
+static unsigned int
+ext2_inode_cache_hash(cache_key_t key, unsigned int hashsize,
+ unsigned int hashshift)
+{
+ uint64_t hashval = ICKEY(key)->ino;
+ uint64_t tmp;
+
+ tmp = hashval ^ (GOLDEN_RATIO_PRIME + hashval) / CACHE_LINE_SIZE;
+ tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> hashshift);
+ return tmp % hashsize;
+}
+
+static int ext2_inode_cache_compare(struct cache_node *node, cache_key_t key)
+{
+ struct ext2_inode_cache_ent *ent = ICNODE(node);
+ struct ext2_inode_cache_key *ikey = ICKEY(key);
+
+ if (ent->ino == ikey->ino)
+ return CACHE_HIT;
+
+ return CACHE_MISS;
+}
+
+static struct cache_node *ext2_inode_cache_alloc(struct cache *c,
+ cache_key_t key)
+{
+ struct ext2_inode_cache_key *ikey = ICKEY(key);
+ struct ext2_inode_cache_ent *ent;
+
+ ent = calloc(1, sizeof(struct ext2_inode_cache_ent) +
+ EXT2_INODE_SIZE(ikey->fs->super));
+ if (!ent)
+ return NULL;
+
+ ent->ino = ikey->ino;
+ return &ent->node;
+}
+
+static bool ext2_inode_cache_flush(struct cache *c, struct cache_node *node)
+{
+ /* can always drop inode cache */
+ return 0;
+}
+
+static void ext2_inode_cache_relse(struct cache *c, struct cache_node *node)
+{
+ struct ext2_inode_cache_ent *ent = ICNODE(node);
+
+ free(ent);
+}
+
+static unsigned int ext2_inode_cache_bulkrelse(struct cache *cache,
+ struct list_head *list)
+{
+ struct cache_node *cn, *n;
+ int count = 0;
+
+ if (list_empty(list))
+ return 0;
+
+ list_for_each_entry_safe(cn, n, list, cn_mru) {
+ ext2_inode_cache_relse(cache, cn);
+ count++;
+ }
+
+ return count;
+}
+
+static const struct cache_operations ext2_inode_cache_ops = {
+ .hash = ext2_inode_cache_hash,
+ .alloc = ext2_inode_cache_alloc,
+ .flush = ext2_inode_cache_flush,
+ .relse = ext2_inode_cache_relse,
+ .compare = ext2_inode_cache_compare,
+ .bulkrelse = ext2_inode_cache_bulkrelse,
+ .resize = cache_gradual_resize,
+};
+
+static errcode_t ext2_inode_cache_iget(ext2_filsys fs, ext2_ino_t ino,
+ unsigned int getflags,
+ struct ext2_inode_cache_ent **entp)
+{
+ struct ext2_inode_cache_key ikey = {
+ .fs = fs,
+ .ino = ino,
+ };
+ struct cache_node *node = NULL;
+
+ cache_node_get(&fs->icache->cache, &ikey, getflags, &node);
+ if (!node)
+ return ENOMEM;
+
+ *entp = ICNODE(node);
+ return 0;
+}
+
+static void ext2_inode_cache_iput(ext2_filsys fs,
+ struct ext2_inode_cache_ent *ent)
+{
+ cache_node_put(&fs->icache->cache, &ent->node);
+}
+
+static int ext2_inode_cache_ipurge(ext2_filsys fs, ext2_ino_t ino,
+ struct ext2_inode_cache_ent *ent)
+{
+ struct ext2_inode_cache_key ikey = {
+ .fs = fs,
+ .ino = ino,
+ };
+
+ return cache_node_purge(&fs->icache->cache, &ikey, &ent->node);
+}
+
+static void ext2_inode_cache_ibump(ext2_filsys fs,
+ struct ext2_inode_cache_ent *ent)
+{
+ if (++ent->access > 50) {
+ cache_node_bump_priority(&fs->icache->cache, &ent->node);
+ ent->access = 0;
+ }
+}
+
/*
* This routine flushes the icache, if it exists.
*/
errcode_t ext2fs_flush_icache(ext2_filsys fs)
{
- unsigned i;
-
if (!fs->icache)
return 0;
- for (i=0; i < fs->icache->cache_size; i++)
- fs->icache->cache[i].ino = 0;
+ cache_purge(&fs->icache->cache);
fs->icache->buffer_blk = 0;
return 0;
@@ -81,23 +208,20 @@ errcode_t ext2fs_flush_icache(ext2_filsys fs)
*/
void ext2fs_free_inode_cache(struct ext2_inode_cache *icache)
{
- unsigned i;
-
if (--icache->refcount)
return;
if (icache->buffer)
ext2fs_free_mem(&icache->buffer);
- for (i = 0; i < icache->cache_size; i++)
- ext2fs_free_mem(&icache->cache[i].inode);
- if (icache->cache)
- ext2fs_free_mem(&icache->cache);
+ if (cache_initialized(&icache->cache)) {
+ cache_purge(&icache->cache);
+ cache_destroy(&icache->cache);
+ }
icache->buffer_blk = 0;
ext2fs_free_mem(&icache);
}
errcode_t ext2fs_create_inode_cache(ext2_filsys fs, unsigned int cache_size)
{
- unsigned i;
errcode_t retval;
if (fs->icache)
@@ -112,22 +236,12 @@ errcode_t ext2fs_create_inode_cache(ext2_filsys fs, unsigned int cache_size)
goto errout;
fs->icache->buffer_blk = 0;
- fs->icache->cache_last = -1;
- fs->icache->cache_size = cache_size;
fs->icache->refcount = 1;
- retval = ext2fs_get_array(fs->icache->cache_size,
- sizeof(struct ext2_inode_cache_ent),
- &fs->icache->cache);
+ retval = cache_init(0, cache_size, &ext2_inode_cache_ops,
+ &fs->icache->cache);
if (retval)
goto errout;
- for (i = 0; i < fs->icache->cache_size; i++) {
- retval = ext2fs_get_mem(EXT2_INODE_SIZE(fs->super),
- &fs->icache->cache[i].inode);
- if (retval)
- goto errout;
- }
-
ext2fs_flush_icache(fs);
return 0;
errout:
@@ -762,12 +876,12 @@ errcode_t ext2fs_read_inode2(ext2_filsys fs, ext2_ino_t ino,
unsigned long block, offset;
char *ptr;
errcode_t retval;
- unsigned i;
int clen, inodes_per_block;
io_channel io;
int length = EXT2_INODE_SIZE(fs->super);
struct ext2_inode_large *iptr;
- int cache_slot, fail_csum;
+ struct ext2_inode_cache_ent *ent = NULL;
+ int fail_csum;
EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
@@ -794,12 +908,12 @@ errcode_t ext2fs_read_inode2(ext2_filsys fs, ext2_ino_t ino,
return retval;
}
/* Check to see if it's in the inode cache */
- for (i = 0; i < fs->icache->cache_size; i++) {
- if (fs->icache->cache[i].ino == ino) {
- memcpy(inode, fs->icache->cache[i].inode,
- (bufsize > length) ? length : bufsize);
- return 0;
- }
+ ext2_inode_cache_iget(fs, ino, CACHE_GET_INCORE, &ent);
+ if (ent) {
+ memcpy(inode, ent->raw, (bufsize > length) ? length : bufsize);
+ ext2_inode_cache_ibump(fs, ent);
+ ext2_inode_cache_iput(fs, ent);
+ return 0;
}
if (fs->flags & EXT2_FLAG_IMAGE_FILE) {
inodes_per_block = fs->blocksize / EXT2_INODE_SIZE(fs->super);
@@ -827,8 +941,10 @@ errcode_t ext2fs_read_inode2(ext2_filsys fs, ext2_ino_t ino,
}
offset &= (EXT2_BLOCK_SIZE(fs->super) - 1);
- cache_slot = (fs->icache->cache_last + 1) % fs->icache->cache_size;
- iptr = (struct ext2_inode_large *)fs->icache->cache[cache_slot].inode;
+ retval = ext2_inode_cache_iget(fs, ino, 0, &ent);
+ if (retval)
+ return retval;
+ iptr = (struct ext2_inode_large *)ent->raw;
ptr = (char *) iptr;
while (length) {
@@ -863,13 +979,15 @@ errcode_t ext2fs_read_inode2(ext2_filsys fs, ext2_ino_t ino,
0, length);
#endif
- /* Update the inode cache bookkeeping */
- if (!fail_csum) {
- fs->icache->cache_last = cache_slot;
- fs->icache->cache[cache_slot].ino = ino;
- }
memcpy(inode, iptr, (bufsize > length) ? length : bufsize);
+ /* Update the inode cache bookkeeping */
+ if (!fail_csum)
+ ext2_inode_cache_ibump(fs, ent);
+ ext2_inode_cache_iput(fs, ent);
+ if (fail_csum)
+ ext2_inode_cache_ipurge(fs, ino, ent);
+
if (!(fs->flags & EXT2_FLAG_IGNORE_CSUM_ERRORS) &&
!(flags & READ_INODE_NOCSUM) && fail_csum)
return EXT2_ET_INODE_CSUM_INVALID;
@@ -899,8 +1017,8 @@ errcode_t ext2fs_write_inode2(ext2_filsys fs, ext2_ino_t ino,
unsigned long block, offset;
errcode_t retval = 0;
struct ext2_inode_large *w_inode;
+ struct ext2_inode_cache_ent *ent;
char *ptr;
- unsigned i;
int clen;
int length = EXT2_INODE_SIZE(fs->super);
@@ -933,19 +1051,20 @@ errcode_t ext2fs_write_inode2(ext2_filsys fs, ext2_ino_t ino,
}
/* Check to see if the inode cache needs to be updated */
- if (fs->icache) {
- for (i=0; i < fs->icache->cache_size; i++) {
- if (fs->icache->cache[i].ino == ino) {
- memcpy(fs->icache->cache[i].inode, inode,
- (bufsize > length) ? length : bufsize);
- break;
- }
- }
- } else {
+ if (!fs->icache) {
retval = ext2fs_create_inode_cache(fs, 4);
if (retval)
goto errout;
}
+
+ retval = ext2_inode_cache_iget(fs, ino, 0, &ent);
+ if (retval)
+ goto errout;
+
+ memcpy(ent->raw, inode, (bufsize > length) ? length : bufsize);
+ ext2_inode_cache_ibump(fs, ent);
+ ext2_inode_cache_iput(fs, ent);
+
memcpy(w_inode, inode, (bufsize > length) ? length : bufsize);
if (!(fs->flags & EXT2_FLAG_RW)) {
diff --git a/resize/Makefile.in b/resize/Makefile.in
index 27f721305e052e..d03d3bfc309968 100644
--- a/resize/Makefile.in
+++ b/resize/Makefile.in
@@ -28,8 +28,8 @@ SRCS= $(srcdir)/extent.c \
$(srcdir)/resource_track.c \
$(srcdir)/sim_progress.c
-LIBS= $(LIBE2P) $(LIBEXT2FS) $(LIBCOM_ERR) $(LIBINTL) $(SYSLIBS)
-DEPLIBS= $(LIBE2P) $(LIBEXT2FS) $(DEPLIBCOM_ERR)
+LIBS= $(LIBE2P) $(LIBEXT2FS) $(LIBSUPPORT) $(LIBCOM_ERR) $(LIBINTL) $(SYSLIBS)
+DEPLIBS= $(LIBE2P) $(LIBEXT2FS) $(DEPLIBSUPPORT) $(DEPLIBCOM_ERR)
STATIC_LIBS= $(STATIC_LIBE2P) $(STATIC_LIBEXT2FS) $(STATIC_LIBCOM_ERR) \
$(LIBINTL) $(SYSLIBS)
diff --git a/tests/progs/Makefile.in b/tests/progs/Makefile.in
index 1a8e9299a1c1ca..64069a52c57cd3 100644
--- a/tests/progs/Makefile.in
+++ b/tests/progs/Makefile.in
@@ -23,8 +23,8 @@ TEST_ICOUNT_OBJS= test_icount.o test_icount_cmds.o
SRCS= $(srcdir)/test_icount.c \
$(srcdir)/test_rel.c
-LIBS= $(LIBEXT2FS) $(LIBSS) $(LIBCOM_ERR) $(SYSLIBS)
-DEPLIBS= $(LIBEXT2FS) $(DEPLIBSS) $(DEPLIBCOM_ERR)
+LIBS= $(LIBEXT2FS) $(LIBSUPPORT) $(LIBSS) $(LIBCOM_ERR) $(SYSLIBS)
+DEPLIBS= $(LIBEXT2FS) $(DEPLIBSUPPORT) $(DEPLIBSS) $(DEPLIBCOM_ERR)
.c.o:
$(E) " CC $<"
^ permalink raw reply related [flat|nested] 210+ messages in thread
* Re: [PATCH 04/23] fuse: move the backing file idr and code into a new source file
2025-08-21 0:53 ` [PATCH 04/23] fuse: move the backing file idr and code into a new source file Darrick J. Wong
@ 2025-08-21 7:21 ` Amir Goldstein
2025-08-21 7:42 ` Amir Goldstein
0 siblings, 1 reply; 210+ messages in thread
From: Amir Goldstein @ 2025-08-21 7:21 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, Aug 21, 2025 at 2:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> iomap support for fuse is also going to want the ability to attach
> backing files to a fuse filesystem. Move the fuse_backing code into a
> separate file so that both can use it.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Are you going to make FUSE_IOMAP depend on FUSE_PASSTHROUGH later on?
I can't think of a reason why not.
Thanks,
Amir.
> ---
> fs/fuse/fuse_i.h | 47 ++++++++-----
> fs/fuse/Makefile | 2 -
> fs/fuse/backing.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++
> fs/fuse/passthrough.c | 158 --------------------------------------------
> 4 files changed, 203 insertions(+), 178 deletions(-)
> create mode 100644 fs/fuse/backing.c
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 2cd9f4cdc6a7ef..2be2cbdf060536 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1535,29 +1535,11 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
> void fuse_file_release(struct inode *inode, struct fuse_file *ff,
> unsigned int open_flags, fl_owner_t id, bool isdir);
>
> -/* passthrough.c */
> -static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
> -{
> -#ifdef CONFIG_FUSE_PASSTHROUGH
> - return READ_ONCE(fi->fb);
> -#else
> - return NULL;
> -#endif
> -}
> -
> -static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
> - struct fuse_backing *fb)
> -{
> -#ifdef CONFIG_FUSE_PASSTHROUGH
> - return xchg(&fi->fb, fb);
> -#else
> - return NULL;
> -#endif
> -}
> -
> +/* backing.c */
> #ifdef CONFIG_FUSE_PASSTHROUGH
> struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
> void fuse_backing_put(struct fuse_backing *fb);
> +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
> #else
>
> static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> @@ -1568,6 +1550,11 @@ static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> static inline void fuse_backing_put(struct fuse_backing *fb)
> {
> }
> +static inline struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
> + int backing_id)
> +{
> + return NULL;
> +}
> #endif
>
> void fuse_backing_files_init(struct fuse_conn *fc);
> @@ -1575,6 +1562,26 @@ void fuse_backing_files_free(struct fuse_conn *fc);
> int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map);
> int fuse_backing_close(struct fuse_conn *fc, int backing_id);
>
> +/* passthrough.c */
> +static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
> +{
> +#ifdef CONFIG_FUSE_PASSTHROUGH
> + return READ_ONCE(fi->fb);
> +#else
> + return NULL;
> +#endif
> +}
> +
> +static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
> + struct fuse_backing *fb)
> +{
> +#ifdef CONFIG_FUSE_PASSTHROUGH
> + return xchg(&fi->fb, fb);
> +#else
> + return NULL;
> +#endif
> +}
> +
> struct fuse_backing *fuse_passthrough_open(struct file *file,
> struct inode *inode,
> int backing_id);
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index 70709a7a3f9523..c79f786d0c90c3 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -14,7 +14,7 @@ fuse-y := trace.o # put trace.o first so we see ftrace errors sooner
> fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
> fuse-y += iomode.o
> fuse-$(CONFIG_FUSE_DAX) += dax.o
> -fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> +fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
> fuse-$(CONFIG_SYSCTL) += sysctl.o
> fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
> diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> new file mode 100644
> index 00000000000000..ddb23b7400fc72
> --- /dev/null
> +++ b/fs/fuse/backing.c
> @@ -0,0 +1,174 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * FUSE passthrough to backing file.
> + *
> + * Copyright (c) 2023 CTERA Networks.
> + */
> +
> +#include "fuse_i.h"
> +
> +#include <linux/file.h>
> +
> +struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> +{
> + if (fb && refcount_inc_not_zero(&fb->count))
> + return fb;
> + return NULL;
> +}
> +
> +static void fuse_backing_free(struct fuse_backing *fb)
> +{
> + pr_debug("%s: fb=0x%p\n", __func__, fb);
> +
> + if (fb->file)
> + fput(fb->file);
> + put_cred(fb->cred);
> + kfree_rcu(fb, rcu);
> +}
> +
> +void fuse_backing_put(struct fuse_backing *fb)
> +{
> + if (fb && refcount_dec_and_test(&fb->count))
> + fuse_backing_free(fb);
> +}
> +
> +void fuse_backing_files_init(struct fuse_conn *fc)
> +{
> + idr_init(&fc->backing_files_map);
> +}
> +
> +static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
> +{
> + int id;
> +
> + idr_preload(GFP_KERNEL);
> + spin_lock(&fc->lock);
> + /* FIXME: xarray might be space inefficient */
> + id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
> + spin_unlock(&fc->lock);
> + idr_preload_end();
> +
> + WARN_ON_ONCE(id == 0);
> + return id;
> +}
> +
> +static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
> + int id)
> +{
> + struct fuse_backing *fb;
> +
> + spin_lock(&fc->lock);
> + fb = idr_remove(&fc->backing_files_map, id);
> + spin_unlock(&fc->lock);
> +
> + return fb;
> +}
> +
> +static int fuse_backing_id_free(int id, void *p, void *data)
> +{
> + struct fuse_backing *fb = p;
> +
> + WARN_ON_ONCE(refcount_read(&fb->count) != 1);
> + fuse_backing_free(fb);
> + return 0;
> +}
> +
> +void fuse_backing_files_free(struct fuse_conn *fc)
> +{
> + idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
> + idr_destroy(&fc->backing_files_map);
> +}
> +
> +int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> +{
> + struct file *file;
> + struct super_block *backing_sb;
> + struct fuse_backing *fb = NULL;
> + int res;
> +
> + pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
> +
> + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> + res = -EPERM;
> + if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> + goto out;
> +
> + res = -EINVAL;
> + if (map->flags || map->padding)
> + goto out;
> +
> + file = fget_raw(map->fd);
> + res = -EBADF;
> + if (!file)
> + goto out;
> +
> + backing_sb = file_inode(file)->i_sb;
> + res = -ELOOP;
> + if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> + goto out_fput;
> +
> + fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> + res = -ENOMEM;
> + if (!fb)
> + goto out_fput;
> +
> + fb->file = file;
> + fb->cred = prepare_creds();
> + refcount_set(&fb->count, 1);
> +
> + res = fuse_backing_id_alloc(fc, fb);
> + if (res < 0) {
> + fuse_backing_free(fb);
> + fb = NULL;
> + }
> +
> +out:
> + pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
> +
> + return res;
> +
> +out_fput:
> + fput(file);
> + goto out;
> +}
> +
> +int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> +{
> + struct fuse_backing *fb = NULL;
> + int err;
> +
> + pr_debug("%s: backing_id=%d\n", __func__, backing_id);
> +
> + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> + err = -EPERM;
> + if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> + goto out;
> +
> + err = -EINVAL;
> + if (backing_id <= 0)
> + goto out;
> +
> + err = -ENOENT;
> + fb = fuse_backing_id_remove(fc, backing_id);
> + if (!fb)
> + goto out;
> +
> + fuse_backing_put(fb);
> + err = 0;
> +out:
> + pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
> +
> + return err;
> +}
> +
> +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id)
> +{
> + struct fuse_backing *fb;
> +
> + rcu_read_lock();
> + fb = idr_find(&fc->backing_files_map, backing_id);
> + fb = fuse_backing_get(fb);
> + rcu_read_unlock();
> +
> + return fb;
> +}
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index 607ef735ad4ab3..e0b8d885bc81f3 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -144,158 +144,6 @@ ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
> return backing_file_mmap(backing_file, vma, &ctx);
> }
>
> -struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> -{
> - if (fb && refcount_inc_not_zero(&fb->count))
> - return fb;
> - return NULL;
> -}
> -
> -static void fuse_backing_free(struct fuse_backing *fb)
> -{
> - pr_debug("%s: fb=0x%p\n", __func__, fb);
> -
> - if (fb->file)
> - fput(fb->file);
> - put_cred(fb->cred);
> - kfree_rcu(fb, rcu);
> -}
> -
> -void fuse_backing_put(struct fuse_backing *fb)
> -{
> - if (fb && refcount_dec_and_test(&fb->count))
> - fuse_backing_free(fb);
> -}
> -
> -void fuse_backing_files_init(struct fuse_conn *fc)
> -{
> - idr_init(&fc->backing_files_map);
> -}
> -
> -static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
> -{
> - int id;
> -
> - idr_preload(GFP_KERNEL);
> - spin_lock(&fc->lock);
> - /* FIXME: xarray might be space inefficient */
> - id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
> - spin_unlock(&fc->lock);
> - idr_preload_end();
> -
> - WARN_ON_ONCE(id == 0);
> - return id;
> -}
> -
> -static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
> - int id)
> -{
> - struct fuse_backing *fb;
> -
> - spin_lock(&fc->lock);
> - fb = idr_remove(&fc->backing_files_map, id);
> - spin_unlock(&fc->lock);
> -
> - return fb;
> -}
> -
> -static int fuse_backing_id_free(int id, void *p, void *data)
> -{
> - struct fuse_backing *fb = p;
> -
> - WARN_ON_ONCE(refcount_read(&fb->count) != 1);
> - fuse_backing_free(fb);
> - return 0;
> -}
> -
> -void fuse_backing_files_free(struct fuse_conn *fc)
> -{
> - idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
> - idr_destroy(&fc->backing_files_map);
> -}
> -
> -int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> -{
> - struct file *file;
> - struct super_block *backing_sb;
> - struct fuse_backing *fb = NULL;
> - int res;
> -
> - pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
> -
> - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> - res = -EPERM;
> - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> - goto out;
> -
> - res = -EINVAL;
> - if (map->flags || map->padding)
> - goto out;
> -
> - file = fget_raw(map->fd);
> - res = -EBADF;
> - if (!file)
> - goto out;
> -
> - backing_sb = file_inode(file)->i_sb;
> - res = -ELOOP;
> - if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> - goto out_fput;
> -
> - fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> - res = -ENOMEM;
> - if (!fb)
> - goto out_fput;
> -
> - fb->file = file;
> - fb->cred = prepare_creds();
> - refcount_set(&fb->count, 1);
> -
> - res = fuse_backing_id_alloc(fc, fb);
> - if (res < 0) {
> - fuse_backing_free(fb);
> - fb = NULL;
> - }
> -
> -out:
> - pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
> -
> - return res;
> -
> -out_fput:
> - fput(file);
> - goto out;
> -}
> -
> -int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> -{
> - struct fuse_backing *fb = NULL;
> - int err;
> -
> - pr_debug("%s: backing_id=%d\n", __func__, backing_id);
> -
> - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> - err = -EPERM;
> - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> - goto out;
> -
> - err = -EINVAL;
> - if (backing_id <= 0)
> - goto out;
> -
> - err = -ENOENT;
> - fb = fuse_backing_id_remove(fc, backing_id);
> - if (!fb)
> - goto out;
> -
> - fuse_backing_put(fb);
> - err = 0;
> -out:
> - pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
> -
> - return err;
> -}
> -
> /*
> * Setup passthrough to a backing file.
> *
> @@ -315,12 +163,8 @@ struct fuse_backing *fuse_passthrough_open(struct file *file,
> if (backing_id <= 0)
> goto out;
>
> - rcu_read_lock();
> - fb = idr_find(&fc->backing_files_map, backing_id);
> - fb = fuse_backing_get(fb);
> - rcu_read_unlock();
> -
> err = -ENOENT;
> + fb = fuse_backing_lookup(fc, backing_id);
> if (!fb)
> goto out;
>
>
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 04/23] fuse: move the backing file idr and code into a new source file
2025-08-21 7:21 ` Amir Goldstein
@ 2025-08-21 7:42 ` Amir Goldstein
2025-08-21 16:15 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Amir Goldstein @ 2025-08-21 7:42 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, Aug 21, 2025 at 9:21 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Aug 21, 2025 at 2:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > iomap support for fuse is also going to want the ability to attach
> > backing files to a fuse filesystem. Move the fuse_backing code into a
> > separate file so that both can use it.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
>
> Are you going to make FUSE_IOMAP depend on FUSE_PASSTHROUGH later on?
> I can't think of a reason why not.
Ah I see. They will both depend on FUSE_BACKING
cool
>
> Thanks,
> Amir.
>
> > ---
> > fs/fuse/fuse_i.h | 47 ++++++++-----
> > fs/fuse/Makefile | 2 -
> > fs/fuse/backing.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/fuse/passthrough.c | 158 --------------------------------------------
> > 4 files changed, 203 insertions(+), 178 deletions(-)
> > create mode 100644 fs/fuse/backing.c
> >
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 2cd9f4cdc6a7ef..2be2cbdf060536 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -1535,29 +1535,11 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
> > void fuse_file_release(struct inode *inode, struct fuse_file *ff,
> > unsigned int open_flags, fl_owner_t id, bool isdir);
> >
> > -/* passthrough.c */
> > -static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
> > -{
> > -#ifdef CONFIG_FUSE_PASSTHROUGH
> > - return READ_ONCE(fi->fb);
> > -#else
> > - return NULL;
> > -#endif
> > -}
> > -
> > -static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
> > - struct fuse_backing *fb)
> > -{
> > -#ifdef CONFIG_FUSE_PASSTHROUGH
> > - return xchg(&fi->fb, fb);
> > -#else
> > - return NULL;
> > -#endif
> > -}
> > -
> > +/* backing.c */
> > #ifdef CONFIG_FUSE_PASSTHROUGH
> > struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
> > void fuse_backing_put(struct fuse_backing *fb);
> > +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
> > #else
> >
> > static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> > @@ -1568,6 +1550,11 @@ static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> > static inline void fuse_backing_put(struct fuse_backing *fb)
> > {
> > }
> > +static inline struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
> > + int backing_id)
> > +{
> > + return NULL;
> > +}
> > #endif
> >
> > void fuse_backing_files_init(struct fuse_conn *fc);
> > @@ -1575,6 +1562,26 @@ void fuse_backing_files_free(struct fuse_conn *fc);
> > int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map);
> > int fuse_backing_close(struct fuse_conn *fc, int backing_id);
> >
> > +/* passthrough.c */
> > +static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
> > +{
> > +#ifdef CONFIG_FUSE_PASSTHROUGH
> > + return READ_ONCE(fi->fb);
> > +#else
> > + return NULL;
> > +#endif
> > +}
> > +
> > +static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
> > + struct fuse_backing *fb)
> > +{
> > +#ifdef CONFIG_FUSE_PASSTHROUGH
> > + return xchg(&fi->fb, fb);
> > +#else
> > + return NULL;
> > +#endif
> > +}
> > +
> > struct fuse_backing *fuse_passthrough_open(struct file *file,
> > struct inode *inode,
> > int backing_id);
> > diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> > index 70709a7a3f9523..c79f786d0c90c3 100644
> > --- a/fs/fuse/Makefile
> > +++ b/fs/fuse/Makefile
> > @@ -14,7 +14,7 @@ fuse-y := trace.o # put trace.o first so we see ftrace errors sooner
> > fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
> > fuse-y += iomode.o
> > fuse-$(CONFIG_FUSE_DAX) += dax.o
> > -fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> > +fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
> > fuse-$(CONFIG_SYSCTL) += sysctl.o
> > fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> > fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
> > diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> > new file mode 100644
> > index 00000000000000..ddb23b7400fc72
> > --- /dev/null
> > +++ b/fs/fuse/backing.c
> > @@ -0,0 +1,174 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * FUSE passthrough to backing file.
> > + *
> > + * Copyright (c) 2023 CTERA Networks.
> > + */
> > +
> > +#include "fuse_i.h"
> > +
> > +#include <linux/file.h>
> > +
> > +struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> > +{
> > + if (fb && refcount_inc_not_zero(&fb->count))
> > + return fb;
> > + return NULL;
> > +}
> > +
> > +static void fuse_backing_free(struct fuse_backing *fb)
> > +{
> > + pr_debug("%s: fb=0x%p\n", __func__, fb);
> > +
> > + if (fb->file)
> > + fput(fb->file);
> > + put_cred(fb->cred);
> > + kfree_rcu(fb, rcu);
> > +}
> > +
> > +void fuse_backing_put(struct fuse_backing *fb)
> > +{
> > + if (fb && refcount_dec_and_test(&fb->count))
> > + fuse_backing_free(fb);
> > +}
> > +
> > +void fuse_backing_files_init(struct fuse_conn *fc)
> > +{
> > + idr_init(&fc->backing_files_map);
> > +}
> > +
> > +static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
> > +{
> > + int id;
> > +
> > + idr_preload(GFP_KERNEL);
> > + spin_lock(&fc->lock);
> > + /* FIXME: xarray might be space inefficient */
> > + id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
> > + spin_unlock(&fc->lock);
> > + idr_preload_end();
> > +
> > + WARN_ON_ONCE(id == 0);
> > + return id;
> > +}
> > +
> > +static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
> > + int id)
> > +{
> > + struct fuse_backing *fb;
> > +
> > + spin_lock(&fc->lock);
> > + fb = idr_remove(&fc->backing_files_map, id);
> > + spin_unlock(&fc->lock);
> > +
> > + return fb;
> > +}
> > +
> > +static int fuse_backing_id_free(int id, void *p, void *data)
> > +{
> > + struct fuse_backing *fb = p;
> > +
> > + WARN_ON_ONCE(refcount_read(&fb->count) != 1);
> > + fuse_backing_free(fb);
> > + return 0;
> > +}
> > +
> > +void fuse_backing_files_free(struct fuse_conn *fc)
> > +{
> > + idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
> > + idr_destroy(&fc->backing_files_map);
> > +}
> > +
> > +int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > +{
> > + struct file *file;
> > + struct super_block *backing_sb;
> > + struct fuse_backing *fb = NULL;
> > + int res;
> > +
> > + pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
> > +
> > + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > + res = -EPERM;
> > + if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > + goto out;
> > +
> > + res = -EINVAL;
> > + if (map->flags || map->padding)
> > + goto out;
> > +
> > + file = fget_raw(map->fd);
> > + res = -EBADF;
> > + if (!file)
> > + goto out;
> > +
> > + backing_sb = file_inode(file)->i_sb;
> > + res = -ELOOP;
> > + if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> > + goto out_fput;
> > +
> > + fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> > + res = -ENOMEM;
> > + if (!fb)
> > + goto out_fput;
> > +
> > + fb->file = file;
> > + fb->cred = prepare_creds();
> > + refcount_set(&fb->count, 1);
> > +
> > + res = fuse_backing_id_alloc(fc, fb);
> > + if (res < 0) {
> > + fuse_backing_free(fb);
> > + fb = NULL;
> > + }
> > +
> > +out:
> > + pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
> > +
> > + return res;
> > +
> > +out_fput:
> > + fput(file);
> > + goto out;
> > +}
> > +
> > +int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> > +{
> > + struct fuse_backing *fb = NULL;
> > + int err;
> > +
> > + pr_debug("%s: backing_id=%d\n", __func__, backing_id);
> > +
> > + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > + err = -EPERM;
> > + if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > + goto out;
> > +
> > + err = -EINVAL;
> > + if (backing_id <= 0)
> > + goto out;
> > +
> > + err = -ENOENT;
> > + fb = fuse_backing_id_remove(fc, backing_id);
> > + if (!fb)
> > + goto out;
> > +
> > + fuse_backing_put(fb);
> > + err = 0;
> > +out:
> > + pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
> > +
> > + return err;
> > +}
> > +
> > +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id)
> > +{
> > + struct fuse_backing *fb;
> > +
> > + rcu_read_lock();
> > + fb = idr_find(&fc->backing_files_map, backing_id);
> > + fb = fuse_backing_get(fb);
> > + rcu_read_unlock();
> > +
> > + return fb;
> > +}
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index 607ef735ad4ab3..e0b8d885bc81f3 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -144,158 +144,6 @@ ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
> > return backing_file_mmap(backing_file, vma, &ctx);
> > }
> >
> > -struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> > -{
> > - if (fb && refcount_inc_not_zero(&fb->count))
> > - return fb;
> > - return NULL;
> > -}
> > -
> > -static void fuse_backing_free(struct fuse_backing *fb)
> > -{
> > - pr_debug("%s: fb=0x%p\n", __func__, fb);
> > -
> > - if (fb->file)
> > - fput(fb->file);
> > - put_cred(fb->cred);
> > - kfree_rcu(fb, rcu);
> > -}
> > -
> > -void fuse_backing_put(struct fuse_backing *fb)
> > -{
> > - if (fb && refcount_dec_and_test(&fb->count))
> > - fuse_backing_free(fb);
> > -}
> > -
> > -void fuse_backing_files_init(struct fuse_conn *fc)
> > -{
> > - idr_init(&fc->backing_files_map);
> > -}
> > -
> > -static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
> > -{
> > - int id;
> > -
> > - idr_preload(GFP_KERNEL);
> > - spin_lock(&fc->lock);
> > - /* FIXME: xarray might be space inefficient */
> > - id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
> > - spin_unlock(&fc->lock);
> > - idr_preload_end();
> > -
> > - WARN_ON_ONCE(id == 0);
> > - return id;
> > -}
> > -
> > -static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
> > - int id)
> > -{
> > - struct fuse_backing *fb;
> > -
> > - spin_lock(&fc->lock);
> > - fb = idr_remove(&fc->backing_files_map, id);
> > - spin_unlock(&fc->lock);
> > -
> > - return fb;
> > -}
> > -
> > -static int fuse_backing_id_free(int id, void *p, void *data)
> > -{
> > - struct fuse_backing *fb = p;
> > -
> > - WARN_ON_ONCE(refcount_read(&fb->count) != 1);
> > - fuse_backing_free(fb);
> > - return 0;
> > -}
> > -
> > -void fuse_backing_files_free(struct fuse_conn *fc)
> > -{
> > - idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
> > - idr_destroy(&fc->backing_files_map);
> > -}
> > -
> > -int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > -{
> > - struct file *file;
> > - struct super_block *backing_sb;
> > - struct fuse_backing *fb = NULL;
> > - int res;
> > -
> > - pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
> > -
> > - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > - res = -EPERM;
> > - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > - goto out;
> > -
> > - res = -EINVAL;
> > - if (map->flags || map->padding)
> > - goto out;
> > -
> > - file = fget_raw(map->fd);
> > - res = -EBADF;
> > - if (!file)
> > - goto out;
> > -
> > - backing_sb = file_inode(file)->i_sb;
> > - res = -ELOOP;
> > - if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> > - goto out_fput;
> > -
> > - fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> > - res = -ENOMEM;
> > - if (!fb)
> > - goto out_fput;
> > -
> > - fb->file = file;
> > - fb->cred = prepare_creds();
> > - refcount_set(&fb->count, 1);
> > -
> > - res = fuse_backing_id_alloc(fc, fb);
> > - if (res < 0) {
> > - fuse_backing_free(fb);
> > - fb = NULL;
> > - }
> > -
> > -out:
> > - pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
> > -
> > - return res;
> > -
> > -out_fput:
> > - fput(file);
> > - goto out;
> > -}
> > -
> > -int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> > -{
> > - struct fuse_backing *fb = NULL;
> > - int err;
> > -
> > - pr_debug("%s: backing_id=%d\n", __func__, backing_id);
> > -
> > - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > - err = -EPERM;
> > - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > - goto out;
> > -
> > - err = -EINVAL;
> > - if (backing_id <= 0)
> > - goto out;
> > -
> > - err = -ENOENT;
> > - fb = fuse_backing_id_remove(fc, backing_id);
> > - if (!fb)
> > - goto out;
> > -
> > - fuse_backing_put(fb);
> > - err = 0;
> > -out:
> > - pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
> > -
> > - return err;
> > -}
> > -
> > /*
> > * Setup passthrough to a backing file.
> > *
> > @@ -315,12 +163,8 @@ struct fuse_backing *fuse_passthrough_open(struct file *file,
> > if (backing_id <= 0)
> > goto out;
> >
> > - rcu_read_lock();
> > - fb = idr_find(&fc->backing_files_map, backing_id);
> > - fb = fuse_backing_get(fb);
> > - rcu_read_unlock();
> > -
> > err = -ENOENT;
> > + fb = fuse_backing_lookup(fc, backing_id);
> > if (!fb)
> > goto out;
> >
> >
> >
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 06/23] fuse: add an ioctl to add new iomap devices
2025-08-21 0:54 ` [PATCH 06/23] fuse: add an ioctl to add new iomap devices Darrick J. Wong
@ 2025-08-21 8:09 ` Amir Goldstein
2025-08-21 16:15 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Amir Goldstein @ 2025-08-21 8:09 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, Aug 21, 2025 at 2:54 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Add an ioctl that allows fuse servers to register block devices for use
> with iomap. This is (for now) separate from the backing file open/close
> ioctl (despite using the same struct) to keep the codepaths separate.
Is it though? I'm pretty sure this commit does not add a new ioctl
and reuses the same one (which is fine by me).
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 9 +++++
> fs/fuse/fuse_trace.h | 49 ++++++++++++++++++++++++++-
> fs/fuse/Kconfig | 1 +
> fs/fuse/backing.c | 19 ++++++++---
> fs/fuse/file_iomap.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-----
> fs/fuse/passthrough.c | 13 +++++++
> fs/fuse/trace.c | 1 +
> 7 files changed, 163 insertions(+), 17 deletions(-)
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 1762517a1b99c8..f4834a02d16c98 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -100,6 +100,10 @@ struct fuse_submount_lookup {
> struct fuse_backing {
> struct file *file;
> struct cred *cred;
> + struct block_device *bdev;
> +
> + unsigned int passthrough:1;
> + unsigned int iomap:1;
>
> /** refcount */
> refcount_t count;
> @@ -1639,9 +1643,14 @@ static inline bool fuse_has_iomap(const struct inode *inode)
> {
> return get_fuse_conn_c(inode)->iomap;
> }
> +
> +int fuse_iomap_backing_open(struct fuse_conn *fc, struct fuse_backing *fb);
> +int fuse_iomap_backing_close(struct fuse_conn *fc, struct fuse_backing *fb);
> #else
> # define fuse_iomap_enabled(...) (false)
> # define fuse_has_iomap(...) (false)
> +# define fuse_iomap_backing_open(...) (-EOPNOTSUPP)
> +# define fuse_iomap_backing_close(...) (-EOPNOTSUPP)
> #endif
>
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
> index 660d9b5206a175..c3671a605a32f6 100644
> --- a/fs/fuse/fuse_trace.h
> +++ b/fs/fuse/fuse_trace.h
> @@ -175,6 +175,13 @@ TRACE_EVENT(fuse_request_end,
> );
>
> #ifdef CONFIG_FUSE_BACKING
> +#define FUSE_BACKING_PASSTHROUGH (1U << 0)
> +#define FUSE_BACKING_IOMAP (1U << 1)
> +
> +#define FUSE_BACKING_FLAG_STRINGS \
> + { FUSE_BACKING_PASSTHROUGH, "pass" }, \
> + { FUSE_BACKING_IOMAP, "iomap" }
> +
> TRACE_EVENT(fuse_backing_class,
> TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
> const struct fuse_backing *fb),
> @@ -184,7 +191,9 @@ TRACE_EVENT(fuse_backing_class,
> TP_STRUCT__entry(
> __field(dev_t, connection)
> __field(unsigned int, idx)
> + __field(unsigned int, flags)
> __field(unsigned long, ino)
> + __field(dev_t, rdev)
> ),
>
> TP_fast_assign(
> @@ -193,12 +202,23 @@ TRACE_EVENT(fuse_backing_class,
> __entry->connection = fc->dev;
> __entry->idx = idx;
> __entry->ino = inode->i_ino;
> + __entry->flags = 0;
> + if (fb->passthrough)
> + __entry->flags |= FUSE_BACKING_PASSTHROUGH;
> + if (fb->iomap) {
> + __entry->rdev = inode->i_rdev;
> + __entry->flags |= FUSE_BACKING_IOMAP;
> + } else {
> + __entry->rdev = 0;
> + }
> ),
>
> - TP_printk("connection %u idx %u ino 0x%lx",
> + TP_printk("connection %u idx %u flags (%s) ino 0x%lx rdev %u:%u",
> __entry->connection,
> __entry->idx,
> - __entry->ino)
> + __print_flags(__entry->flags, "|", FUSE_BACKING_FLAG_STRINGS),
> + __entry->ino,
> + MAJOR(__entry->rdev), MINOR(__entry->rdev))
> );
> #define DEFINE_FUSE_BACKING_EVENT(name) \
> DEFINE_EVENT(fuse_backing_class, name, \
> @@ -210,7 +230,6 @@ DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
> #endif
>
> #if IS_ENABLED(CONFIG_FUSE_IOMAP)
> -
> /* tracepoint boilerplate so we don't have to keep doing this */
> #define FUSE_IOMAP_OPFLAGS_FIELD \
> __field(unsigned, opflags)
> @@ -452,6 +471,30 @@ TRACE_EVENT(fuse_iomap_end_error,
> __entry->written,
> __entry->error)
> );
> +
> +TRACE_EVENT(fuse_iomap_dev_add,
> + TP_PROTO(const struct fuse_conn *fc,
> + const struct fuse_backing_map *map),
> +
> + TP_ARGS(fc, map),
> +
> + TP_STRUCT__entry(
> + __field(dev_t, connection)
> + __field(int, fd)
> + __field(unsigned int, flags)
> + ),
> +
> + TP_fast_assign(
> + __entry->connection = fc->dev;
> + __entry->fd = map->fd;
> + __entry->flags = map->flags;
> + ),
> +
> + TP_printk("connection %u fd %d flags 0x%x",
> + __entry->connection,
> + __entry->fd,
> + __entry->flags)
> +);
> #endif /* CONFIG_FUSE_IOMAP */
>
> #endif /* _TRACE_FUSE_H */
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index ebb9a2d76b532e..1ab3d3604c07d0 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -75,6 +75,7 @@ config FUSE_IOMAP
> depends on FUSE_FS
> depends on BLOCK
> select FS_IOMAP
> + select FUSE_BACKING
> help
> For supported fuseblk servers, this allows the file IO path to run
> through the kernel.
> diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> index c128bed95a76b8..c63990254649ca 100644
> --- a/fs/fuse/backing.c
> +++ b/fs/fuse/backing.c
> @@ -67,16 +67,19 @@ static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
>
> static int fuse_backing_id_free(int id, void *p, void *data)
> {
> + struct fuse_conn *fc = data;
> struct fuse_backing *fb = p;
>
> WARN_ON_ONCE(refcount_read(&fb->count) != 1);
> +
> + trace_fuse_backing_close(fc, id, fb);
> fuse_backing_free(fb);
> return 0;
> }
>
> void fuse_backing_files_free(struct fuse_conn *fc)
> {
> - idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
> + idr_for_each(&fc->backing_files_map, fuse_backing_id_free, fc);
> idr_destroy(&fc->backing_files_map);
> }
>
> @@ -84,12 +87,12 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> {
> struct file *file = NULL;
> struct fuse_backing *fb = NULL;
> - int res, passthrough_res;
> + int res, passthrough_res, iomap_res;
>
> pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
>
> res = -EPERM;
> - if (!fc->passthrough)
> + if (!fc->passthrough && !fc->iomap)
> goto out;
>
> res = -EINVAL;
> @@ -125,10 +128,13 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> * default.
> */
> passthrough_res = fuse_passthrough_backing_open(fc, fb);
> + iomap_res = fuse_iomap_backing_open(fc, fb);
>
> if (refcount_read(&fb->count) < 2) {
> if (passthrough_res)
> res = passthrough_res;
> + if (!res && iomap_res)
> + res = iomap_res;
> if (!res)
> res = -EPERM;
> goto out_fb;
> @@ -157,12 +163,12 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> {
> struct fuse_backing *fb = NULL, *test_fb;
> - int err, passthrough_err;
> + int err, passthrough_err, iomap_err;
>
> pr_debug("%s: backing_id=%d\n", __func__, backing_id);
>
> err = -EPERM;
> - if (!fc->passthrough)
> + if (!fc->passthrough && !fc->iomap)
> goto out;
>
> err = -EINVAL;
> @@ -187,10 +193,13 @@ int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> * error code will be passed up. EBUSY is the default.
> */
> passthrough_err = fuse_passthrough_backing_close(fc, fb);
> + iomap_err = fuse_iomap_backing_close(fc, fb);
>
> if (refcount_read(&fb->count) > 1) {
> if (passthrough_err)
> err = passthrough_err;
> + if (!err && iomap_err)
> + err = iomap_err;
> if (!err)
> err = -EBUSY;
> goto out_fb;
Do you really think that we need to support both file passthrough and file iomap
on the same fuse filesystem?
Unless you have a specific use case in mind, it looks like over design to me
We could enforce either fc->passthrough or fc->iomap on init.
Put it in other words: unless you intend to test a combination of file
passthrough
and file iomap, I think you should leave this configuration out of the config
possibilities.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 05/23] fuse: move the passthrough-specific code back to passthrough.c
2025-08-21 0:53 ` [PATCH 05/23] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong
@ 2025-08-21 9:05 ` Amir Goldstein
2025-08-21 16:13 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Amir Goldstein @ 2025-08-21 9:05 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, Aug 21, 2025 at 2:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> In preparation for iomap, move the passthrough-specific validation code
> back to passthrough.c and create a new Kconfig item for conditional
> compilation of backing.c. In the next patch, iomap will share the
> backing structures.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 14 ++++++
> fs/fuse/fuse_trace.h | 35 ++++++++++++++++
> fs/fuse/Kconfig | 4 ++
> fs/fuse/Makefile | 3 +
> fs/fuse/backing.c | 106 +++++++++++++++++++++++++++++++++++++------------
> fs/fuse/dev.c | 4 +-
> fs/fuse/inode.c | 4 +-
> fs/fuse/passthrough.c | 28 +++++++++++++
> 8 files changed, 165 insertions(+), 33 deletions(-)
>
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 2be2cbdf060536..1762517a1b99c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -958,7 +958,7 @@ struct fuse_conn {
> /* New writepages go into this bucket */
> struct fuse_sync_bucket __rcu *curr_bucket;
>
> -#ifdef CONFIG_FUSE_PASSTHROUGH
> +#ifdef CONFIG_FUSE_BACKING
> /** IDR for backing files ids */
> struct idr backing_files_map;
> #endif
> @@ -1536,7 +1536,7 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
> unsigned int open_flags, fl_owner_t id, bool isdir);
>
> /* backing.c */
> -#ifdef CONFIG_FUSE_PASSTHROUGH
> +#ifdef CONFIG_FUSE_BACKING
> struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
> void fuse_backing_put(struct fuse_backing *fb);
> struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
> @@ -1596,6 +1596,16 @@ static inline struct file *fuse_file_passthrough(struct fuse_file *ff)
> #endif
> }
>
> +#ifdef CONFIG_FUSE_PASSTHROUGH
> +int fuse_passthrough_backing_open(struct fuse_conn *fc,
> + struct fuse_backing *fb);
> +int fuse_passthrough_backing_close(struct fuse_conn *fc,
> + struct fuse_backing *fb);
> +#else
> +# define fuse_passthrough_backing_open(...) (-EOPNOTSUPP)
> +# define fuse_passthrough_backing_close(...) (-EOPNOTSUPP)
> +#endif
> +
> ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *iter);
> ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *iter);
> ssize_t fuse_passthrough_splice_read(struct file *in, loff_t *ppos,
> diff --git a/fs/fuse/fuse_trace.h b/fs/fuse/fuse_trace.h
> index 2389072b734636..660d9b5206a175 100644
> --- a/fs/fuse/fuse_trace.h
> +++ b/fs/fuse/fuse_trace.h
> @@ -174,6 +174,41 @@ TRACE_EVENT(fuse_request_end,
> __entry->unique, __entry->len, __entry->error)
> );
>
> +#ifdef CONFIG_FUSE_BACKING
> +TRACE_EVENT(fuse_backing_class,
> + TP_PROTO(const struct fuse_conn *fc, unsigned int idx,
> + const struct fuse_backing *fb),
> +
> + TP_ARGS(fc, idx, fb),
> +
> + TP_STRUCT__entry(
> + __field(dev_t, connection)
> + __field(unsigned int, idx)
> + __field(unsigned long, ino)
> + ),
> +
> + TP_fast_assign(
> + struct inode *inode = file_inode(fb->file);
> +
> + __entry->connection = fc->dev;
> + __entry->idx = idx;
> + __entry->ino = inode->i_ino;
> + ),
> +
> + TP_printk("connection %u idx %u ino 0x%lx",
> + __entry->connection,
> + __entry->idx,
> + __entry->ino)
> +);
> +#define DEFINE_FUSE_BACKING_EVENT(name) \
> +DEFINE_EVENT(fuse_backing_class, name, \
> + TP_PROTO(const struct fuse_conn *fc, unsigned int idx, \
> + const struct fuse_backing *fb), \
> + TP_ARGS(fc, idx, fb))
> +DEFINE_FUSE_BACKING_EVENT(fuse_backing_open);
> +DEFINE_FUSE_BACKING_EVENT(fuse_backing_close);
> +#endif
> +
> #if IS_ENABLED(CONFIG_FUSE_IOMAP)
>
> /* tracepoint boilerplate so we don't have to keep doing this */
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 6be74396ef5198..ebb9a2d76b532e 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -59,12 +59,16 @@ config FUSE_PASSTHROUGH
> default y
> depends on FUSE_FS
> select FS_STACK
> + select FUSE_BACKING
> help
> This allows bypassing FUSE server by mapping specific FUSE operations
> to be performed directly on a backing file.
>
> If you want to allow passthrough operations, answer Y.
>
> +config FUSE_BACKING
> + bool
> +
> config FUSE_IOMAP
> bool "FUSE file IO over iomap"
> default y
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index c79f786d0c90c3..27be39317701d6 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -14,7 +14,8 @@ fuse-y := trace.o # put trace.o first so we see ftrace errors sooner
> fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
> fuse-y += iomode.o
> fuse-$(CONFIG_FUSE_DAX) += dax.o
> -fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
> +fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> +fuse-$(CONFIG_FUSE_BACKING) += backing.o
> fuse-$(CONFIG_SYSCTL) += sysctl.o
> fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
> diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> index ddb23b7400fc72..c128bed95a76b8 100644
> --- a/fs/fuse/backing.c
> +++ b/fs/fuse/backing.c
> @@ -6,6 +6,7 @@
> */
>
> #include "fuse_i.h"
> +#include "fuse_trace.h"
>
> #include <linux/file.h>
>
> @@ -81,16 +82,14 @@ void fuse_backing_files_free(struct fuse_conn *fc)
>
> int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> {
> - struct file *file;
> - struct super_block *backing_sb;
> + struct file *file = NULL;
> struct fuse_backing *fb = NULL;
> - int res;
> + int res, passthrough_res;
>
> pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
>
> - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> res = -EPERM;
> - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> + if (!fc->passthrough)
> goto out;
>
> res = -EINVAL;
> @@ -102,46 +101,68 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> if (!file)
> goto out;
>
> - backing_sb = file_inode(file)->i_sb;
> - res = -ELOOP;
> - if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> - goto out_fput;
> -
> fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> res = -ENOMEM;
> if (!fb)
> - goto out_fput;
> + goto out_file;
>
> + /* fb now owns file */
> fb->file = file;
> + file = NULL;
> fb->cred = prepare_creds();
> refcount_set(&fb->count, 1);
>
> + /*
> + * Each _backing_open function should either:
> + *
> + * 1. Take a ref to fb if it wants the file and return 0.
> + * 2. Return 0 without taking a ref if the backing file isn't needed.
> + * 3. Return an errno explaining why it couldn't attach.
> + *
> + * If at least one subsystem bumps the reference count to open it,
> + * we'll install it into the index and return the index. If nobody
> + * opens the file, the error code will be passed up. EPERM is the
> + * default.
> + */
> + passthrough_res = fuse_passthrough_backing_open(fc, fb);
> +
> + if (refcount_read(&fb->count) < 2) {
> + if (passthrough_res)
> + res = passthrough_res;
> + if (!res)
> + res = -EPERM;
> + goto out_fb;
> + }
> +
> res = fuse_backing_id_alloc(fc, fb);
> - if (res < 0) {
> - fuse_backing_free(fb);
> - fb = NULL;
> - }
> + if (res < 0)
> + goto out_fb;
> +
> + trace_fuse_backing_open(fc, res, fb);
>
> -out:
> pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
> -
> + fuse_backing_put(fb);
> return res;
>
> -out_fput:
> - fput(file);
> - goto out;
> +out_fb:
> + fuse_backing_free(fb);
> +out_file:
> + if (file)
> + fput(file);
> +out:
> + pr_debug("%s: ret=%i\n", __func__, res);
> + return res;
> }
>
> int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> {
> - struct fuse_backing *fb = NULL;
> - int err;
> + struct fuse_backing *fb = NULL, *test_fb;
> + int err, passthrough_err;
>
> pr_debug("%s: backing_id=%d\n", __func__, backing_id);
>
> - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> err = -EPERM;
> - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> + if (!fc->passthrough)
> goto out;
>
> err = -EINVAL;
> @@ -149,12 +170,45 @@ int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> goto out;
>
> err = -ENOENT;
> - fb = fuse_backing_id_remove(fc, backing_id);
> + fb = fuse_backing_lookup(fc, backing_id);
> if (!fb)
> goto out;
>
> + /*
> + * Each _backing_close function should either:
> + *
> + * 1. Release the ref that it took in _backing_open and return 0.
> + * 2. Don't release the ref if the backing file is busy, and return 0.
> + * 2. Return an errno explaining why it couldn't detach.
> + *
> + * If there are no more active references to the backing file, it will
> + * be closed and removed from the index. If there are still active
> + * references to the backing file other than the one we just took, the
That does not look right.
The fuse_backing object can often outliive the backing_id mapping
1. fuse server attached backing fd to backing id 1
2. fuse server opens a file with passthrough to backing id 1
3. fuse inode holds a refcount to the fuse_backing object
4. fuse server closes backing id 1 mapping
5. fuse server closes file, drops last reference to fuse_backing object
IOW, fb->count is not about being in the index.
With your code the fuse server call in #4 above will end up leaving the
fuse_backing object in the index and after #5 it will remain a dangling
object in the index.
TBH, I don't understand why we need any of the complexity
of two subsystems claiming the same fuse_backing object for two
different purposes.
Also, I think that an explicit statement from the server about the
purpose of the backing file is due (like your commit message implies)
This could be easily done with the backing open flags member:
diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
index c63990254649c..e5a675fca7505 100644
--- a/fs/fuse/backing.c
+++ b/fs/fuse/backing.c
@@ -96,7 +96,7 @@ int fuse_backing_open(struct fuse_conn *fc, struct
fuse_backing_map *map)
goto out;
res = -EINVAL;
- if (map->flags || map->padding)
+ if (map->flags & ~FUSE_BACKING_VALID_FLAGS || map->padding)
goto out;
file = fget_raw(map->fd);
@@ -127,8 +127,10 @@ int fuse_backing_open(struct fuse_conn *fc,
struct fuse_backing_map *map)
* opens the file, the error code will be passed up. EPERM is the
* default.
*/
- passthrough_res = fuse_passthrough_backing_open(fc, fb);
- iomap_res = fuse_iomap_backing_open(fc, fb);
+ if (map->flags & FUSE_BACKING_IOMAP)
+ iomap_res = fuse_iomap_backing_open(fc, fb);
+ else
+ passthrough_res = fuse_passthrough_backing_open(fc, fb);
if (refcount_read(&fb->count) < 2) {
if (passthrough_res)
@@ -192,8 +194,10 @@ int fuse_backing_close(struct fuse_conn *fc, int
backing_id)
* references to the backing file other than the one we just took, the
* error code will be passed up. EBUSY is the default.
*/
- passthrough_err = fuse_passthrough_backing_close(fc, fb);
- iomap_err = fuse_iomap_backing_close(fc, fb);
+ if (fb->bdev)
+ iomap_err = fuse_iomap_backing_close(fc, fb);
+ else
+ passthrough_err = fuse_passthrough_backing_close(fc, fb);
if (refcount_read(&fb->count) > 1) {
if (passthrough_err)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 70b5530e587d4..ee81903ad2f98 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1148,6 +1148,10 @@ struct fuse_notify_retrieve_in {
uint64_t dummy4;
};
+/* basic file I/O functionality through iomap */
+#define FUSE_BACKING_IOMAP (1 << 0)
+#define FUSE_BACKING_VALID_FLAGS (FUSE_BACKING_IOMAP)
+
struct fuse_backing_map {
int32_t fd;
uint32_t flags;
> + * error code will be passed up. EBUSY is the default.
> + */
> + passthrough_err = fuse_passthrough_backing_close(fc, fb);
> +
> + if (refcount_read(&fb->count) > 1) {
> + if (passthrough_err)
> + err = passthrough_err;
> + if (!err)
> + err = -EBUSY;
> + goto out_fb;
> + }
> +
> + trace_fuse_backing_close(fc, backing_id, fb);
> +
> + err = -ENOENT;
> + test_fb = fuse_backing_id_remove(fc, backing_id);
> + if (!test_fb)
> + goto out_fb;
> +
> + WARN_ON(fb != test_fb);
> + pr_debug("%s: fb=0x%p, err=0\n", __func__, fb);
> + fuse_backing_put(fb);
> + return 0;
> +out_fb:
> fuse_backing_put(fb);
> - err = 0;
> out:
> pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index dbde17fff0cda9..31d9f006836ac1 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2623,7 +2623,7 @@ static long fuse_dev_ioctl_backing_open(struct file *file,
> if (!fud)
> return -EPERM;
>
> - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> + if (!IS_ENABLED(CONFIG_FUSE_BACKING))
> return -EOPNOTSUPP;
>
> if (copy_from_user(&map, argp, sizeof(map)))
> @@ -2640,7 +2640,7 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
> if (!fud)
> return -EPERM;
>
> - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> + if (!IS_ENABLED(CONFIG_FUSE_BACKING))
> return -EOPNOTSUPP;
>
> if (get_user(backing_id, argp))
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 9448a11c828fef..1f3f91981410aa 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -993,7 +993,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> fc->name_max = FUSE_NAME_LOW_MAX;
> fc->timeout.req_timeout = 0;
>
> - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> + if (IS_ENABLED(CONFIG_FUSE_BACKING))
> fuse_backing_files_init(fc);
>
> INIT_LIST_HEAD(&fc->mounts);
> @@ -1030,7 +1030,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> WARN_ON(atomic_read(&bucket->count) != 1);
> kfree(bucket);
> }
> - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> + if (IS_ENABLED(CONFIG_FUSE_BACKING))
> fuse_backing_files_free(fc);
> call_rcu(&fc->rcu, delayed_release);
> }
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index e0b8d885bc81f3..dfc61cc4bd21af 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -197,3 +197,31 @@ void fuse_passthrough_release(struct fuse_file *ff, struct fuse_backing *fb)
> put_cred(ff->cred);
> ff->cred = NULL;
> }
> +
> +int fuse_passthrough_backing_open(struct fuse_conn *fc,
> + struct fuse_backing *fb)
> +{
> + struct super_block *backing_sb;
> +
> + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
This limitation is not specific to fuse passthrough.
While the fuse passthrough use case is likely to request many fuse
backing files,
the limitation is here to protect from malicious actors and the same ioctl used
by the iomap fuse server can just as well open many "lsof invisible" files,
so the limitation should be in the generic function.
> +
> + backing_sb = file_inode(fb->file)->i_sb;
> + if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> + return -ELOOP;
> +
> + fuse_backing_get(fb);
> + return 0;
> +}
> +
> +int fuse_passthrough_backing_close(struct fuse_conn *fc,
> + struct fuse_backing *fb)
> +{
> + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
Probably this comment in upstream is not very accurate because there is no
harm done in closing the backing files, but sure for symmetry.
Same comment as above through, unless there are reasons to relax
CAP_SYS_ADMIN for file iomap, would leave this in the genetic code.
And then there is not much justification left for the close helpers IMO,
especially given that the implementation wrt removing from index is
incorrect, I would keep it simple:
@@ -175,11 +177,19 @@ int fuse_backing_close(struct fuse_conn *fc, int
backing_id)
if (backing_id <= 0)
goto out;
- err = -ENOENT;
- fb = fuse_backing_lookup(fc, backing_id);
- if (!fb)
+ err = -EPERM;
+ if (!capable(CAP_SYS_ADMIN))
goto out;
+ err = -EBUSY;
+ if (fb->bdev)
+ goto out;
+
+ fb = fuse_backing_id_remove(fc, backing_id);
+ if (!fb)
+ err = -ENOENT;
+ goto out_fb;
+
Thanks,
Amir.
^ permalink raw reply related [flat|nested] 210+ messages in thread
* Re: [PATCH 05/23] fuse: move the passthrough-specific code back to passthrough.c
2025-08-21 9:05 ` Amir Goldstein
@ 2025-08-21 16:13 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 16:13 UTC (permalink / raw)
To: Amir Goldstein; +Cc: miklos, bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, Aug 21, 2025 at 11:05:28AM +0200, Amir Goldstein wrote:
> On Thu, Aug 21, 2025 at 2:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > In preparation for iomap, move the passthrough-specific validation code
> > back to passthrough.c and create a new Kconfig item for conditional
> > compilation of backing.c. In the next patch, iomap will share the
> > backing structures.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/fuse_i.h | 14 ++++++
> > fs/fuse/fuse_trace.h | 35 ++++++++++++++++
> > fs/fuse/Kconfig | 4 ++
> > fs/fuse/Makefile | 3 +
> > fs/fuse/backing.c | 106 +++++++++++++++++++++++++++++++++++++------------
> > fs/fuse/dev.c | 4 +-
> > fs/fuse/inode.c | 4 +-
> > fs/fuse/passthrough.c | 28 +++++++++++++
> > 8 files changed, 165 insertions(+), 33 deletions(-)
> >
> >
<snip to the relevant parts>
> > diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> > index ddb23b7400fc72..c128bed95a76b8 100644
> > --- a/fs/fuse/backing.c
> > +++ b/fs/fuse/backing.c
> > @@ -102,46 +101,68 @@ int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > if (!file)
> > goto out;
> >
> > - backing_sb = file_inode(file)->i_sb;
> > - res = -ELOOP;
> > - if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> > - goto out_fput;
> > -
> > fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> > res = -ENOMEM;
> > if (!fb)
> > - goto out_fput;
> > + goto out_file;
> >
> > + /* fb now owns file */
> > fb->file = file;
> > + file = NULL;
> > fb->cred = prepare_creds();
> > refcount_set(&fb->count, 1);
> >
> > + /*
> > + * Each _backing_open function should either:
> > + *
> > + * 1. Take a ref to fb if it wants the file and return 0.
> > + * 2. Return 0 without taking a ref if the backing file isn't needed.
> > + * 3. Return an errno explaining why it couldn't attach.
> > + *
> > + * If at least one subsystem bumps the reference count to open it,
> > + * we'll install it into the index and return the index. If nobody
> > + * opens the file, the error code will be passed up. EPERM is the
> > + * default.
> > + */
> > + passthrough_res = fuse_passthrough_backing_open(fc, fb);
> > +
> > + if (refcount_read(&fb->count) < 2) {
> > + if (passthrough_res)
> > + res = passthrough_res;
> > + if (!res)
> > + res = -EPERM;
> > + goto out_fb;
> > + }
> > +
> > res = fuse_backing_id_alloc(fc, fb);
> > - if (res < 0) {
> > - fuse_backing_free(fb);
> > - fb = NULL;
> > - }
> > + if (res < 0)
> > + goto out_fb;
> > +
> > + trace_fuse_backing_open(fc, res, fb);
> >
> > -out:
> > pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
> > -
> > + fuse_backing_put(fb);
> > return res;
> >
> > -out_fput:
> > - fput(file);
> > - goto out;
> > +out_fb:
> > + fuse_backing_free(fb);
> > +out_file:
> > + if (file)
> > + fput(file);
> > +out:
> > + pr_debug("%s: ret=%i\n", __func__, res);
> > + return res;
> > }
> >
> > int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> > {
> > - struct fuse_backing *fb = NULL;
> > - int err;
> > + struct fuse_backing *fb = NULL, *test_fb;
> > + int err, passthrough_err;
> >
> > pr_debug("%s: backing_id=%d\n", __func__, backing_id);
> >
> > - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > err = -EPERM;
> > - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > + if (!fc->passthrough)
> > goto out;
> >
> > err = -EINVAL;
> > @@ -149,12 +170,45 @@ int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> > goto out;
> >
> > err = -ENOENT;
> > - fb = fuse_backing_id_remove(fc, backing_id);
> > + fb = fuse_backing_lookup(fc, backing_id);
> > if (!fb)
> > goto out;
> >
> > + /*
> > + * Each _backing_close function should either:
> > + *
> > + * 1. Release the ref that it took in _backing_open and return 0.
> > + * 2. Don't release the ref if the backing file is busy, and return 0.
> > + * 2. Return an errno explaining why it couldn't detach.
> > + *
> > + * If there are no more active references to the backing file, it will
> > + * be closed and removed from the index. If there are still active
> > + * references to the backing file other than the one we just took, the
>
> That does not look right.
> The fuse_backing object can often outliive the backing_id mapping
> 1. fuse server attached backing fd to backing id 1
> 2. fuse server opens a file with passthrough to backing id 1
> 3. fuse inode holds a refcount to the fuse_backing object
> 4. fuse server closes backing id 1 mapping
> 5. fuse server closes file, drops last reference to fuse_backing object
Ah, I didn't account for backing files needing to outlive being
registered in the index. Ok, my whole approach above is wrong. :)
> IOW, fb->count is not about being in the index.
> With your code the fuse server call in #4 above will end up leaving the
> fuse_backing object in the index and after #5 it will remain a dangling
> object in the index.
>
> TBH, I don't understand why we need any of the complexity
> of two subsystems claiming the same fuse_backing object for two
> different purposes.
I decided I should explore your suggestion from v2:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxiZTTEOs4HYD0vGi3XtihyDiQbDFXBCuGKoJyFPQv_+Lw@mail.gmail.com/
...and it didn't occur to me that, well, there's plenty of device_id
address space so if some weird server has to register the same fd twice
for two subsystems to use it, that's completely ok. :)
> Also, I think that an explicit statement from the server about the
> purpose of the backing file is due (like your commit message implies)
> This could be easily done with the backing open flags member:
Hrm. That /would/ eliminate all the stupid {iomap,passthrough}_res
juggling if you were only allowed to register a backing id with a
single subsystem. Worst case, a hybrid iomap+passthrough fs ends up
with the same file registered with multiple ids.
Yeah, let's do that.
> diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> index c63990254649c..e5a675fca7505 100644
> --- a/fs/fuse/backing.c
> +++ b/fs/fuse/backing.c
> @@ -96,7 +96,7 @@ int fuse_backing_open(struct fuse_conn *fc, struct
> fuse_backing_map *map)
> goto out;
>
> res = -EINVAL;
> - if (map->flags || map->padding)
> + if (map->flags & ~FUSE_BACKING_VALID_FLAGS || map->padding)
> goto out;
>
> file = fget_raw(map->fd);
> @@ -127,8 +127,10 @@ int fuse_backing_open(struct fuse_conn *fc,
> struct fuse_backing_map *map)
> * opens the file, the error code will be passed up. EPERM is the
> * default.
> */
> - passthrough_res = fuse_passthrough_backing_open(fc, fb);
> - iomap_res = fuse_iomap_backing_open(fc, fb);
> + if (map->flags & FUSE_BACKING_IOMAP)
> + iomap_res = fuse_iomap_backing_open(fc, fb);
> + else
> + passthrough_res = fuse_passthrough_backing_open(fc, fb);
>
> if (refcount_read(&fb->count) < 2) {
> if (passthrough_res)
> @@ -192,8 +194,10 @@ int fuse_backing_close(struct fuse_conn *fc, int
> backing_id)
> * references to the backing file other than the one we just took, the
> * error code will be passed up. EBUSY is the default.
> */
> - passthrough_err = fuse_passthrough_backing_close(fc, fb);
> - iomap_err = fuse_iomap_backing_close(fc, fb);
> + if (fb->bdev)
> + iomap_err = fuse_iomap_backing_close(fc, fb);
> + else
> + passthrough_err = fuse_passthrough_backing_close(fc, fb);
Yes, that's a lot better, thanks. :D
>
> if (refcount_read(&fb->count) > 1) {
> if (passthrough_err)
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 70b5530e587d4..ee81903ad2f98 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -1148,6 +1148,10 @@ struct fuse_notify_retrieve_in {
> uint64_t dummy4;
> };
>
> +/* basic file I/O functionality through iomap */
> +#define FUSE_BACKING_IOMAP (1 << 0)
> +#define FUSE_BACKING_VALID_FLAGS (FUSE_BACKING_IOMAP)
> +
> struct fuse_backing_map {
> int32_t fd;
> uint32_t flags;
>
>
> > + * error code will be passed up. EBUSY is the default.
> > + */
> > + passthrough_err = fuse_passthrough_backing_close(fc, fb);
> > +
> > + if (refcount_read(&fb->count) > 1) {
> > + if (passthrough_err)
> > + err = passthrough_err;
> > + if (!err)
> > + err = -EBUSY;
> > + goto out_fb;
> > + }
> > +
> > + trace_fuse_backing_close(fc, backing_id, fb);
> > +
> > + err = -ENOENT;
> > + test_fb = fuse_backing_id_remove(fc, backing_id);
> > + if (!test_fb)
> > + goto out_fb;
> > +
> > + WARN_ON(fb != test_fb);
> > + pr_debug("%s: fb=0x%p, err=0\n", __func__, fb);
> > + fuse_backing_put(fb);
> > + return 0;
> > +out_fb:
> > fuse_backing_put(fb);
> > - err = 0;
> > out:
> > pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index dbde17fff0cda9..31d9f006836ac1 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -2623,7 +2623,7 @@ static long fuse_dev_ioctl_backing_open(struct file *file,
> > if (!fud)
> > return -EPERM;
> >
> > - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > + if (!IS_ENABLED(CONFIG_FUSE_BACKING))
> > return -EOPNOTSUPP;
> >
> > if (copy_from_user(&map, argp, sizeof(map)))
> > @@ -2640,7 +2640,7 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
> > if (!fud)
> > return -EPERM;
> >
> > - if (!IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > + if (!IS_ENABLED(CONFIG_FUSE_BACKING))
> > return -EOPNOTSUPP;
> >
> > if (get_user(backing_id, argp))
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 9448a11c828fef..1f3f91981410aa 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -993,7 +993,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> > fc->name_max = FUSE_NAME_LOW_MAX;
> > fc->timeout.req_timeout = 0;
> >
> > - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > + if (IS_ENABLED(CONFIG_FUSE_BACKING))
> > fuse_backing_files_init(fc);
> >
> > INIT_LIST_HEAD(&fc->mounts);
> > @@ -1030,7 +1030,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> > WARN_ON(atomic_read(&bucket->count) != 1);
> > kfree(bucket);
> > }
> > - if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > + if (IS_ENABLED(CONFIG_FUSE_BACKING))
> > fuse_backing_files_free(fc);
> > call_rcu(&fc->rcu, delayed_release);
> > }
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index e0b8d885bc81f3..dfc61cc4bd21af 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -197,3 +197,31 @@ void fuse_passthrough_release(struct fuse_file *ff, struct fuse_backing *fb)
> > put_cred(ff->cred);
> > ff->cred = NULL;
> > }
> > +
> > +int fuse_passthrough_backing_open(struct fuse_conn *fc,
> > + struct fuse_backing *fb)
> > +{
> > + struct super_block *backing_sb;
> > +
> > + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > + if (!capable(CAP_SYS_ADMIN))
> > + return -EPERM;
>
> This limitation is not specific to fuse passthrough.
> While the fuse passthrough use case is likely to request many fuse
> backing files,
> the limitation is here to protect from malicious actors and the same ioctl used
> by the iomap fuse server can just as well open many "lsof invisible" files,
> so the limitation should be in the generic function.
Hrmm. Well for iomap block devices I'm not as worried because (a) you
sort of need privileges to open them, (b) there aren't that many block
devices, and (c) to use fuse-iomap at all you need CAP_SYS_RAWIO.
As for the invisibility problem, I wonder if I could just make
/sys/block/XXX/holder have a symlink to the fuse mount? For fuse2fs we
need to maintain the open fd to /dev/XXX, but I suppose that's not
necessarily true for a fuse server.
> > +
> > + backing_sb = file_inode(fb->file)->i_sb;
> > + if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> > + return -ELOOP;
> > +
> > + fuse_backing_get(fb);
> > + return 0;
> > +}
> > +
> > +int fuse_passthrough_backing_close(struct fuse_conn *fc,
> > + struct fuse_backing *fb)
> > +{
> > + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > + if (!capable(CAP_SYS_ADMIN))
> > + return -EPERM;
> > +
>
> Probably this comment in upstream is not very accurate because there is no
> harm done in closing the backing files, but sure for symmetry.
> Same comment as above through, unless there are reasons to relax
> CAP_SYS_ADMIN for file iomap, would leave this in the genetic code.
<nod>
> And then there is not much justification left for the close helpers IMO,
> especially given that the implementation wrt removing from index is
> incorrect, I would keep it simple:
>
> @@ -175,11 +177,19 @@ int fuse_backing_close(struct fuse_conn *fc, int
> backing_id)
> if (backing_id <= 0)
> goto out;
>
> - err = -ENOENT;
> - fb = fuse_backing_lookup(fc, backing_id);
> - if (!fb)
> + err = -EPERM;
> + if (!capable(CAP_SYS_ADMIN))
> goto out;
>
> + err = -EBUSY;
> + if (fb->bdev)
> + goto out;
> +
> + fb = fuse_backing_id_remove(fc, backing_id);
> + if (!fb)
> + err = -ENOENT;
> + goto out_fb;
I'll think about this, though I don't know how much of the security
checking would need to be relaxed to enable the completely locked down
fuse4fs systemd service that I was imagining. My guess is that I'll
have to establish iomap capability (aka CAP_SYS_RAWIO) when /dev/fuse is
first opened by the mount helper, then the mount helper opens all the
block devices required, and finally it passes all these fds into the
contained service.
<shrug> As I said, food for thought for v5 :)
--D
> +
>
> Thanks,
> Amir.
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 06/23] fuse: add an ioctl to add new iomap devices
2025-08-21 8:09 ` Amir Goldstein
@ 2025-08-21 16:15 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 16:15 UTC (permalink / raw)
To: Amir Goldstein; +Cc: miklos, bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, Aug 21, 2025 at 10:09:29AM +0200, Amir Goldstein wrote:
> On Thu, Aug 21, 2025 at 2:54 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Add an ioctl that allows fuse servers to register block devices for use
> > with iomap. This is (for now) separate from the backing file open/close
> > ioctl (despite using the same struct) to keep the codepaths separate.
>
> Is it though? I'm pretty sure this commit does not add a new ioctl
> and reuses the same one (which is fine by me).
Oops, stale message. :(
<snip>
> > diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> > index c128bed95a76b8..c63990254649ca 100644
> > --- a/fs/fuse/backing.c
> > +++ b/fs/fuse/backing.c
> > @@ -187,10 +193,13 @@ int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> > * error code will be passed up. EBUSY is the default.
> > */
> > passthrough_err = fuse_passthrough_backing_close(fc, fb);
> > + iomap_err = fuse_iomap_backing_close(fc, fb);
> >
> > if (refcount_read(&fb->count) > 1) {
> > if (passthrough_err)
> > err = passthrough_err;
> > + if (!err && iomap_err)
> > + err = iomap_err;
> > if (!err)
> > err = -EBUSY;
> > goto out_fb;
>
> Do you really think that we need to support both file passthrough and file iomap
> on the same fuse filesystem?
Probably not.
> Unless you have a specific use case in mind, it looks like over design to me
> We could enforce either fc->passthrough or fc->iomap on init.
>
> Put it in other words: unless you intend to test a combination of file
> passthrough
> and file iomap, I think you should leave this configuration out of the config
> possibilities.
Nah, one subsystem per backing device_id is ok with me. If someday
someone builds a hybrid filesystem then ... hopefully they don't need
more than INT_MAX backing files to be in the index.
--D
> Thanks,
> Amir.
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 04/23] fuse: move the backing file idr and code into a new source file
2025-08-21 7:42 ` Amir Goldstein
@ 2025-08-21 16:15 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 16:15 UTC (permalink / raw)
To: Amir Goldstein; +Cc: miklos, bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, Aug 21, 2025 at 09:42:07AM +0200, Amir Goldstein wrote:
> On Thu, Aug 21, 2025 at 9:21 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Thu, Aug 21, 2025 at 2:53 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > iomap support for fuse is also going to want the ability to attach
> > > backing files to a fuse filesystem. Move the fuse_backing code into a
> > > separate file so that both can use it.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >
> > Are you going to make FUSE_IOMAP depend on FUSE_PASSTHROUGH later on?
> > I can't think of a reason why not.
>
> Ah I see. They will both depend on FUSE_BACKING
> cool
Yep. Thanks for your feedback! :)
--D
> >
> > Thanks,
> > Amir.
> >
> > > ---
> > > fs/fuse/fuse_i.h | 47 ++++++++-----
> > > fs/fuse/Makefile | 2 -
> > > fs/fuse/backing.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++
> > > fs/fuse/passthrough.c | 158 --------------------------------------------
> > > 4 files changed, 203 insertions(+), 178 deletions(-)
> > > create mode 100644 fs/fuse/backing.c
> > >
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 2cd9f4cdc6a7ef..2be2cbdf060536 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -1535,29 +1535,11 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
> > > void fuse_file_release(struct inode *inode, struct fuse_file *ff,
> > > unsigned int open_flags, fl_owner_t id, bool isdir);
> > >
> > > -/* passthrough.c */
> > > -static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
> > > -{
> > > -#ifdef CONFIG_FUSE_PASSTHROUGH
> > > - return READ_ONCE(fi->fb);
> > > -#else
> > > - return NULL;
> > > -#endif
> > > -}
> > > -
> > > -static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
> > > - struct fuse_backing *fb)
> > > -{
> > > -#ifdef CONFIG_FUSE_PASSTHROUGH
> > > - return xchg(&fi->fb, fb);
> > > -#else
> > > - return NULL;
> > > -#endif
> > > -}
> > > -
> > > +/* backing.c */
> > > #ifdef CONFIG_FUSE_PASSTHROUGH
> > > struct fuse_backing *fuse_backing_get(struct fuse_backing *fb);
> > > void fuse_backing_put(struct fuse_backing *fb);
> > > +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id);
> > > #else
> > >
> > > static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> > > @@ -1568,6 +1550,11 @@ static inline struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> > > static inline void fuse_backing_put(struct fuse_backing *fb)
> > > {
> > > }
> > > +static inline struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc,
> > > + int backing_id)
> > > +{
> > > + return NULL;
> > > +}
> > > #endif
> > >
> > > void fuse_backing_files_init(struct fuse_conn *fc);
> > > @@ -1575,6 +1562,26 @@ void fuse_backing_files_free(struct fuse_conn *fc);
> > > int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map);
> > > int fuse_backing_close(struct fuse_conn *fc, int backing_id);
> > >
> > > +/* passthrough.c */
> > > +static inline struct fuse_backing *fuse_inode_backing(struct fuse_inode *fi)
> > > +{
> > > +#ifdef CONFIG_FUSE_PASSTHROUGH
> > > + return READ_ONCE(fi->fb);
> > > +#else
> > > + return NULL;
> > > +#endif
> > > +}
> > > +
> > > +static inline struct fuse_backing *fuse_inode_backing_set(struct fuse_inode *fi,
> > > + struct fuse_backing *fb)
> > > +{
> > > +#ifdef CONFIG_FUSE_PASSTHROUGH
> > > + return xchg(&fi->fb, fb);
> > > +#else
> > > + return NULL;
> > > +#endif
> > > +}
> > > +
> > > struct fuse_backing *fuse_passthrough_open(struct file *file,
> > > struct inode *inode,
> > > int backing_id);
> > > diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> > > index 70709a7a3f9523..c79f786d0c90c3 100644
> > > --- a/fs/fuse/Makefile
> > > +++ b/fs/fuse/Makefile
> > > @@ -14,7 +14,7 @@ fuse-y := trace.o # put trace.o first so we see ftrace errors sooner
> > > fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
> > > fuse-y += iomode.o
> > > fuse-$(CONFIG_FUSE_DAX) += dax.o
> > > -fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> > > +fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
> > > fuse-$(CONFIG_SYSCTL) += sysctl.o
> > > fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> > > fuse-$(CONFIG_FUSE_IOMAP) += file_iomap.o
> > > diff --git a/fs/fuse/backing.c b/fs/fuse/backing.c
> > > new file mode 100644
> > > index 00000000000000..ddb23b7400fc72
> > > --- /dev/null
> > > +++ b/fs/fuse/backing.c
> > > @@ -0,0 +1,174 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * FUSE passthrough to backing file.
> > > + *
> > > + * Copyright (c) 2023 CTERA Networks.
> > > + */
> > > +
> > > +#include "fuse_i.h"
> > > +
> > > +#include <linux/file.h>
> > > +
> > > +struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> > > +{
> > > + if (fb && refcount_inc_not_zero(&fb->count))
> > > + return fb;
> > > + return NULL;
> > > +}
> > > +
> > > +static void fuse_backing_free(struct fuse_backing *fb)
> > > +{
> > > + pr_debug("%s: fb=0x%p\n", __func__, fb);
> > > +
> > > + if (fb->file)
> > > + fput(fb->file);
> > > + put_cred(fb->cred);
> > > + kfree_rcu(fb, rcu);
> > > +}
> > > +
> > > +void fuse_backing_put(struct fuse_backing *fb)
> > > +{
> > > + if (fb && refcount_dec_and_test(&fb->count))
> > > + fuse_backing_free(fb);
> > > +}
> > > +
> > > +void fuse_backing_files_init(struct fuse_conn *fc)
> > > +{
> > > + idr_init(&fc->backing_files_map);
> > > +}
> > > +
> > > +static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
> > > +{
> > > + int id;
> > > +
> > > + idr_preload(GFP_KERNEL);
> > > + spin_lock(&fc->lock);
> > > + /* FIXME: xarray might be space inefficient */
> > > + id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
> > > + spin_unlock(&fc->lock);
> > > + idr_preload_end();
> > > +
> > > + WARN_ON_ONCE(id == 0);
> > > + return id;
> > > +}
> > > +
> > > +static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
> > > + int id)
> > > +{
> > > + struct fuse_backing *fb;
> > > +
> > > + spin_lock(&fc->lock);
> > > + fb = idr_remove(&fc->backing_files_map, id);
> > > + spin_unlock(&fc->lock);
> > > +
> > > + return fb;
> > > +}
> > > +
> > > +static int fuse_backing_id_free(int id, void *p, void *data)
> > > +{
> > > + struct fuse_backing *fb = p;
> > > +
> > > + WARN_ON_ONCE(refcount_read(&fb->count) != 1);
> > > + fuse_backing_free(fb);
> > > + return 0;
> > > +}
> > > +
> > > +void fuse_backing_files_free(struct fuse_conn *fc)
> > > +{
> > > + idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
> > > + idr_destroy(&fc->backing_files_map);
> > > +}
> > > +
> > > +int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > > +{
> > > + struct file *file;
> > > + struct super_block *backing_sb;
> > > + struct fuse_backing *fb = NULL;
> > > + int res;
> > > +
> > > + pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
> > > +
> > > + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > > + res = -EPERM;
> > > + if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > > + goto out;
> > > +
> > > + res = -EINVAL;
> > > + if (map->flags || map->padding)
> > > + goto out;
> > > +
> > > + file = fget_raw(map->fd);
> > > + res = -EBADF;
> > > + if (!file)
> > > + goto out;
> > > +
> > > + backing_sb = file_inode(file)->i_sb;
> > > + res = -ELOOP;
> > > + if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> > > + goto out_fput;
> > > +
> > > + fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> > > + res = -ENOMEM;
> > > + if (!fb)
> > > + goto out_fput;
> > > +
> > > + fb->file = file;
> > > + fb->cred = prepare_creds();
> > > + refcount_set(&fb->count, 1);
> > > +
> > > + res = fuse_backing_id_alloc(fc, fb);
> > > + if (res < 0) {
> > > + fuse_backing_free(fb);
> > > + fb = NULL;
> > > + }
> > > +
> > > +out:
> > > + pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
> > > +
> > > + return res;
> > > +
> > > +out_fput:
> > > + fput(file);
> > > + goto out;
> > > +}
> > > +
> > > +int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> > > +{
> > > + struct fuse_backing *fb = NULL;
> > > + int err;
> > > +
> > > + pr_debug("%s: backing_id=%d\n", __func__, backing_id);
> > > +
> > > + /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > > + err = -EPERM;
> > > + if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > > + goto out;
> > > +
> > > + err = -EINVAL;
> > > + if (backing_id <= 0)
> > > + goto out;
> > > +
> > > + err = -ENOENT;
> > > + fb = fuse_backing_id_remove(fc, backing_id);
> > > + if (!fb)
> > > + goto out;
> > > +
> > > + fuse_backing_put(fb);
> > > + err = 0;
> > > +out:
> > > + pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
> > > +
> > > + return err;
> > > +}
> > > +
> > > +struct fuse_backing *fuse_backing_lookup(struct fuse_conn *fc, int backing_id)
> > > +{
> > > + struct fuse_backing *fb;
> > > +
> > > + rcu_read_lock();
> > > + fb = idr_find(&fc->backing_files_map, backing_id);
> > > + fb = fuse_backing_get(fb);
> > > + rcu_read_unlock();
> > > +
> > > + return fb;
> > > +}
> > > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > > index 607ef735ad4ab3..e0b8d885bc81f3 100644
> > > --- a/fs/fuse/passthrough.c
> > > +++ b/fs/fuse/passthrough.c
> > > @@ -144,158 +144,6 @@ ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
> > > return backing_file_mmap(backing_file, vma, &ctx);
> > > }
> > >
> > > -struct fuse_backing *fuse_backing_get(struct fuse_backing *fb)
> > > -{
> > > - if (fb && refcount_inc_not_zero(&fb->count))
> > > - return fb;
> > > - return NULL;
> > > -}
> > > -
> > > -static void fuse_backing_free(struct fuse_backing *fb)
> > > -{
> > > - pr_debug("%s: fb=0x%p\n", __func__, fb);
> > > -
> > > - if (fb->file)
> > > - fput(fb->file);
> > > - put_cred(fb->cred);
> > > - kfree_rcu(fb, rcu);
> > > -}
> > > -
> > > -void fuse_backing_put(struct fuse_backing *fb)
> > > -{
> > > - if (fb && refcount_dec_and_test(&fb->count))
> > > - fuse_backing_free(fb);
> > > -}
> > > -
> > > -void fuse_backing_files_init(struct fuse_conn *fc)
> > > -{
> > > - idr_init(&fc->backing_files_map);
> > > -}
> > > -
> > > -static int fuse_backing_id_alloc(struct fuse_conn *fc, struct fuse_backing *fb)
> > > -{
> > > - int id;
> > > -
> > > - idr_preload(GFP_KERNEL);
> > > - spin_lock(&fc->lock);
> > > - /* FIXME: xarray might be space inefficient */
> > > - id = idr_alloc_cyclic(&fc->backing_files_map, fb, 1, 0, GFP_ATOMIC);
> > > - spin_unlock(&fc->lock);
> > > - idr_preload_end();
> > > -
> > > - WARN_ON_ONCE(id == 0);
> > > - return id;
> > > -}
> > > -
> > > -static struct fuse_backing *fuse_backing_id_remove(struct fuse_conn *fc,
> > > - int id)
> > > -{
> > > - struct fuse_backing *fb;
> > > -
> > > - spin_lock(&fc->lock);
> > > - fb = idr_remove(&fc->backing_files_map, id);
> > > - spin_unlock(&fc->lock);
> > > -
> > > - return fb;
> > > -}
> > > -
> > > -static int fuse_backing_id_free(int id, void *p, void *data)
> > > -{
> > > - struct fuse_backing *fb = p;
> > > -
> > > - WARN_ON_ONCE(refcount_read(&fb->count) != 1);
> > > - fuse_backing_free(fb);
> > > - return 0;
> > > -}
> > > -
> > > -void fuse_backing_files_free(struct fuse_conn *fc)
> > > -{
> > > - idr_for_each(&fc->backing_files_map, fuse_backing_id_free, NULL);
> > > - idr_destroy(&fc->backing_files_map);
> > > -}
> > > -
> > > -int fuse_backing_open(struct fuse_conn *fc, struct fuse_backing_map *map)
> > > -{
> > > - struct file *file;
> > > - struct super_block *backing_sb;
> > > - struct fuse_backing *fb = NULL;
> > > - int res;
> > > -
> > > - pr_debug("%s: fd=%d flags=0x%x\n", __func__, map->fd, map->flags);
> > > -
> > > - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > > - res = -EPERM;
> > > - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > > - goto out;
> > > -
> > > - res = -EINVAL;
> > > - if (map->flags || map->padding)
> > > - goto out;
> > > -
> > > - file = fget_raw(map->fd);
> > > - res = -EBADF;
> > > - if (!file)
> > > - goto out;
> > > -
> > > - backing_sb = file_inode(file)->i_sb;
> > > - res = -ELOOP;
> > > - if (backing_sb->s_stack_depth >= fc->max_stack_depth)
> > > - goto out_fput;
> > > -
> > > - fb = kmalloc(sizeof(struct fuse_backing), GFP_KERNEL);
> > > - res = -ENOMEM;
> > > - if (!fb)
> > > - goto out_fput;
> > > -
> > > - fb->file = file;
> > > - fb->cred = prepare_creds();
> > > - refcount_set(&fb->count, 1);
> > > -
> > > - res = fuse_backing_id_alloc(fc, fb);
> > > - if (res < 0) {
> > > - fuse_backing_free(fb);
> > > - fb = NULL;
> > > - }
> > > -
> > > -out:
> > > - pr_debug("%s: fb=0x%p, ret=%i\n", __func__, fb, res);
> > > -
> > > - return res;
> > > -
> > > -out_fput:
> > > - fput(file);
> > > - goto out;
> > > -}
> > > -
> > > -int fuse_backing_close(struct fuse_conn *fc, int backing_id)
> > > -{
> > > - struct fuse_backing *fb = NULL;
> > > - int err;
> > > -
> > > - pr_debug("%s: backing_id=%d\n", __func__, backing_id);
> > > -
> > > - /* TODO: relax CAP_SYS_ADMIN once backing files are visible to lsof */
> > > - err = -EPERM;
> > > - if (!fc->passthrough || !capable(CAP_SYS_ADMIN))
> > > - goto out;
> > > -
> > > - err = -EINVAL;
> > > - if (backing_id <= 0)
> > > - goto out;
> > > -
> > > - err = -ENOENT;
> > > - fb = fuse_backing_id_remove(fc, backing_id);
> > > - if (!fb)
> > > - goto out;
> > > -
> > > - fuse_backing_put(fb);
> > > - err = 0;
> > > -out:
> > > - pr_debug("%s: fb=0x%p, err=%i\n", __func__, fb, err);
> > > -
> > > - return err;
> > > -}
> > > -
> > > /*
> > > * Setup passthrough to a backing file.
> > > *
> > > @@ -315,12 +163,8 @@ struct fuse_backing *fuse_passthrough_open(struct file *file,
> > > if (backing_id <= 0)
> > > goto out;
> > >
> > > - rcu_read_lock();
> > > - fb = idr_find(&fc->backing_files_map, backing_id);
> > > - fb = fuse_backing_get(fb);
> > > - rcu_read_unlock();
> > > -
> > > err = -ENOENT;
> > > + fb = fuse_backing_lookup(fc, backing_id);
> > > if (!fb)
> > > goto out;
> > >
> > >
> > >
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header
2025-08-21 1:01 ` [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header Darrick J. Wong
@ 2025-08-21 21:39 ` Bernd Schubert
2025-08-21 22:27 ` Darrick J. Wong
2025-08-22 0:33 ` Joanne Koong
1 sibling, 1 reply; 210+ messages in thread
From: Bernd Schubert @ 2025-08-21 21:39 UTC (permalink / raw)
To: Darrick J. Wong, bschubert
Cc: John, joannelkoong, linux-fsdevel, miklos, neal
On 8/21/25 03:01, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> fuse.h and fuse_lowlevel.h are public headers, don't expose internal
> build system config variables to downstream clients. This can also lead
> to function pointer ordering issues if (say) libfuse gets built with
> HAVE_STATX but the client program doesn't define a HAVE_STATX.
>
> Get rid of the conditionals in the public header files to fix this.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> include/fuse.h | 2 --
> include/fuse_lowlevel.h | 2 --
> example/memfs_ll.cc | 2 +-
> example/passthrough.c | 2 +-
> example/passthrough_fh.c | 2 +-
> example/passthrough_ll.c | 2 +-
> 6 files changed, 4 insertions(+), 8 deletions(-)
>
>
> diff --git a/include/fuse.h b/include/fuse.h
> index 06feacb070fbfb..209102651e9454 100644
> --- a/include/fuse.h
> +++ b/include/fuse.h
> @@ -854,7 +854,6 @@ struct fuse_operations {
> */
> off_t (*lseek) (const char *, off_t off, int whence, struct fuse_file_info *);
>
> -#ifdef HAVE_STATX
> /**
> * Get extended file attributes.
> *
> @@ -865,7 +864,6 @@ struct fuse_operations {
> */
> int (*statx)(const char *path, int flags, int mask, struct statx *stxbuf,
> struct fuse_file_info *fi);
> -#endif
> };
>
> /** Extra context that may be needed by some filesystems
> diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
> index 844ee710295973..8d87be413bfe37 100644
> --- a/include/fuse_lowlevel.h
> +++ b/include/fuse_lowlevel.h
> @@ -1327,7 +1327,6 @@ struct fuse_lowlevel_ops {
> void (*tmpfile) (fuse_req_t req, fuse_ino_t parent,
> mode_t mode, struct fuse_file_info *fi);
>
> -#ifdef HAVE_STATX
> /**
> * Get extended file attributes.
> *
> @@ -1343,7 +1342,6 @@ struct fuse_lowlevel_ops {
> */
> void (*statx)(fuse_req_t req, fuse_ino_t ino, int flags, int mask,
> struct fuse_file_info *fi);
> -#endif
> };
>
> /**
> diff --git a/example/memfs_ll.cc b/example/memfs_ll.cc
> index edda34b4e43d39..7055a434a439cd 100644
> --- a/example/memfs_ll.cc
> +++ b/example/memfs_ll.cc
> @@ -6,7 +6,7 @@
> See the file GPL2.txt.
> */
>
> -#define FUSE_USE_VERSION 317
> +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
>
> #include <algorithm>
> #include <stdio.h>
> diff --git a/example/passthrough.c b/example/passthrough.c
> index fdaa19e331a17d..1f09c2dc05df1e 100644
> --- a/example/passthrough.c
> +++ b/example/passthrough.c
> @@ -23,7 +23,7 @@
> */
>
>
> -#define FUSE_USE_VERSION 31
> +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
>
> #define _GNU_SOURCE
>
> diff --git a/example/passthrough_fh.c b/example/passthrough_fh.c
> index 0d4fb5bd4df0d6..6403fbb74c7759 100644
> --- a/example/passthrough_fh.c
> +++ b/example/passthrough_fh.c
> @@ -23,7 +23,7 @@
> * \include passthrough_fh.c
> */
>
> -#define FUSE_USE_VERSION 31
> +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
>
> #define _GNU_SOURCE
>
> diff --git a/example/passthrough_ll.c b/example/passthrough_ll.c
> index 5ca6efa2300abe..8a5ac2e9226b59 100644
> --- a/example/passthrough_ll.c
> +++ b/example/passthrough_ll.c
> @@ -35,7 +35,7 @@
> */
>
> #define _GNU_SOURCE
> -#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 12)
> +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
>
> #include <fuse_lowlevel.h>
> #include <unistd.h>
>
Thanks, I'm going to apply it to libfuse tomorrow. I think the version
update in the examples is not strictly needed, but doesn't hurt either.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCHSET RFC v4 4/4] libfuse: implement syncfs
2025-08-21 0:49 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Darrick J. Wong
2025-08-21 1:07 ` [PATCH 1/2] libfuse: wire up FUSE_SYNCFS to the low level library Darrick J. Wong
2025-08-21 1:07 ` [PATCH 2/2] libfuse: add syncfs support to the upper library Darrick J. Wong
@ 2025-08-21 21:41 ` Bernd Schubert
2025-08-21 22:29 ` Darrick J. Wong
2 siblings, 1 reply; 210+ messages in thread
From: Bernd Schubert @ 2025-08-21 21:41 UTC (permalink / raw)
To: Darrick J. Wong, bschubert
Cc: John, joannelkoong, linux-fsdevel, miklos, neal
On 8/21/25 02:49, Darrick J. Wong wrote:
> Hi all,
>
> Implement syncfs in libfuse so that iomap-compatible fuse servers can
> receive syncfs commands.
>
> If you're going to start using this code, I strongly recommend pulling
> from my git trees, which are linked below.
>
> With a bit of luck, this should all go splendidly.
> Comments and questions are, as always, welcome.
>
> --D
>
> kernel git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs
> ---
> Commits in this patchset:
> * libfuse: wire up FUSE_SYNCFS to the low level library
> * libfuse: add syncfs support to the upper library
> ---
> include/fuse.h | 5 +++++
> include/fuse_lowlevel.h | 16 ++++++++++++++++
> lib/fuse.c | 31 +++++++++++++++++++++++++++++++
> lib/fuse_lowlevel.c | 19 +++++++++++++++++++
> 4 files changed, 71 insertions(+)
>
Thank you, both look good to me. This is independent of io-map - we can
apply it immediately?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-21 0:52 ` [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers Darrick J. Wong
@ 2025-08-21 22:18 ` Joanne Koong
2025-08-21 22:28 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Joanne Koong @ 2025-08-21 22:18 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, neal, John, linux-fsdevel
On Wed, Aug 20, 2025 at 5:52 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Turn on syncfs for all fuse servers so that the ones in the know can
> flush cached intermediate data and logs to disk.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/inode.c | 1 +
> 1 file changed, 1 insertion(+)
>
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 463879830ecf34..b05510799f93e1 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1814,6 +1814,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> if (!sb_set_blocksize(sb, ctx->blksize))
> goto err;
> #endif
> + fc->sync_fs = 1;
AFAICT, this enables syncfs only for fuseblk servers. Is this what you
intended?
Thanks,
Joanne
> } else {
> sb->s_blocksize = PAGE_SIZE;
> sb->s_blocksize_bits = PAGE_SHIFT;
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header
2025-08-21 21:39 ` Bernd Schubert
@ 2025-08-21 22:27 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 22:27 UTC (permalink / raw)
To: Bernd Schubert; +Cc: bschubert, John, joannelkoong, linux-fsdevel, miklos, neal
On Thu, Aug 21, 2025 at 11:39:25PM +0200, Bernd Schubert wrote:
>
>
> On 8/21/25 03:01, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > fuse.h and fuse_lowlevel.h are public headers, don't expose internal
> > build system config variables to downstream clients. This can also lead
> > to function pointer ordering issues if (say) libfuse gets built with
> > HAVE_STATX but the client program doesn't define a HAVE_STATX.
> >
> > Get rid of the conditionals in the public header files to fix this.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > include/fuse.h | 2 --
> > include/fuse_lowlevel.h | 2 --
> > example/memfs_ll.cc | 2 +-
> > example/passthrough.c | 2 +-
> > example/passthrough_fh.c | 2 +-
> > example/passthrough_ll.c | 2 +-
> > 6 files changed, 4 insertions(+), 8 deletions(-)
> >
> >
> > diff --git a/include/fuse.h b/include/fuse.h
> > index 06feacb070fbfb..209102651e9454 100644
> > --- a/include/fuse.h
> > +++ b/include/fuse.h
> > @@ -854,7 +854,6 @@ struct fuse_operations {
> > */
> > off_t (*lseek) (const char *, off_t off, int whence, struct fuse_file_info *);
> >
> > -#ifdef HAVE_STATX
> > /**
> > * Get extended file attributes.
> > *
> > @@ -865,7 +864,6 @@ struct fuse_operations {
> > */
> > int (*statx)(const char *path, int flags, int mask, struct statx *stxbuf,
> > struct fuse_file_info *fi);
> > -#endif
> > };
> >
> > /** Extra context that may be needed by some filesystems
> > diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
> > index 844ee710295973..8d87be413bfe37 100644
> > --- a/include/fuse_lowlevel.h
> > +++ b/include/fuse_lowlevel.h
> > @@ -1327,7 +1327,6 @@ struct fuse_lowlevel_ops {
> > void (*tmpfile) (fuse_req_t req, fuse_ino_t parent,
> > mode_t mode, struct fuse_file_info *fi);
> >
> > -#ifdef HAVE_STATX
> > /**
> > * Get extended file attributes.
> > *
> > @@ -1343,7 +1342,6 @@ struct fuse_lowlevel_ops {
> > */
> > void (*statx)(fuse_req_t req, fuse_ino_t ino, int flags, int mask,
> > struct fuse_file_info *fi);
> > -#endif
> > };
> >
> > /**
> > diff --git a/example/memfs_ll.cc b/example/memfs_ll.cc
> > index edda34b4e43d39..7055a434a439cd 100644
> > --- a/example/memfs_ll.cc
> > +++ b/example/memfs_ll.cc
> > @@ -6,7 +6,7 @@
> > See the file GPL2.txt.
> > */
> >
> > -#define FUSE_USE_VERSION 317
> > +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
> >
> > #include <algorithm>
> > #include <stdio.h>
> > diff --git a/example/passthrough.c b/example/passthrough.c
> > index fdaa19e331a17d..1f09c2dc05df1e 100644
> > --- a/example/passthrough.c
> > +++ b/example/passthrough.c
> > @@ -23,7 +23,7 @@
> > */
> >
> >
> > -#define FUSE_USE_VERSION 31
> > +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
> >
> > #define _GNU_SOURCE
> >
> > diff --git a/example/passthrough_fh.c b/example/passthrough_fh.c
> > index 0d4fb5bd4df0d6..6403fbb74c7759 100644
> > --- a/example/passthrough_fh.c
> > +++ b/example/passthrough_fh.c
> > @@ -23,7 +23,7 @@
> > * \include passthrough_fh.c
> > */
> >
> > -#define FUSE_USE_VERSION 31
> > +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
> >
> > #define _GNU_SOURCE
> >
> > diff --git a/example/passthrough_ll.c b/example/passthrough_ll.c
> > index 5ca6efa2300abe..8a5ac2e9226b59 100644
> > --- a/example/passthrough_ll.c
> > +++ b/example/passthrough_ll.c
> > @@ -35,7 +35,7 @@
> > */
> >
> > #define _GNU_SOURCE
> > -#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 12)
> > +#define FUSE_USE_VERSION FUSE_MAKE_VERSION(3, 18)
> >
> > #include <fuse_lowlevel.h>
> > #include <unistd.h>
> >
>
>
> Thanks, I'm going to apply it to libfuse tomorrow. I think the version
> update in the examples is not strictly needed, but doesn't hurt either.
Thank you!
Yeah, I don't think the examples updates are strictly necessary either,
but the examples might as well give full access to someone who wants to
copy-paste them into a new server.
--D
>
> Thanks,
> Bernd
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-21 22:18 ` Joanne Koong
@ 2025-08-21 22:28 ` Darrick J. Wong
2025-08-21 22:54 ` Bernd Schubert
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 22:28 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, bernd, neal, John, linux-fsdevel
On Thu, Aug 21, 2025 at 03:18:11PM -0700, Joanne Koong wrote:
> On Wed, Aug 20, 2025 at 5:52 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Turn on syncfs for all fuse servers so that the ones in the know can
> > flush cached intermediate data and logs to disk.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/inode.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 463879830ecf34..b05510799f93e1 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1814,6 +1814,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> > if (!sb_set_blocksize(sb, ctx->blksize))
> > goto err;
> > #endif
> > + fc->sync_fs = 1;
>
> AFAICT, this enables syncfs only for fuseblk servers. Is this what you
> intended?
I meant to say for all fuseblk servers, but TBH I can't see why you
wouldn't want to enable it for non-fuseblk servers too?
(Maybe I was being overly cautious ;))
--D
>
> Thanks,
> Joanne
> > } else {
> > sb->s_blocksize = PAGE_SIZE;
> > sb->s_blocksize_bits = PAGE_SHIFT;
> >
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCHSET RFC v4 4/4] libfuse: implement syncfs
2025-08-21 21:41 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Bernd Schubert
@ 2025-08-21 22:29 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-21 22:29 UTC (permalink / raw)
To: Bernd Schubert; +Cc: bschubert, John, joannelkoong, linux-fsdevel, miklos, neal
On Thu, Aug 21, 2025 at 11:41:54PM +0200, Bernd Schubert wrote:
>
>
> On 8/21/25 02:49, Darrick J. Wong wrote:
> > Hi all,
> >
> > Implement syncfs in libfuse so that iomap-compatible fuse servers can
> > receive syncfs commands.
> >
> > If you're going to start using this code, I strongly recommend pulling
> > from my git trees, which are linked below.
> >
> > With a bit of luck, this should all go splendidly.
> > Comments and questions are, as always, welcome.
> >
> > --D
> >
> > kernel git tree:
> > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs
> > ---
> > Commits in this patchset:
> > * libfuse: wire up FUSE_SYNCFS to the low level library
> > * libfuse: add syncfs support to the upper library
> > ---
> > include/fuse.h | 5 +++++
> > include/fuse_lowlevel.h | 16 ++++++++++++++++
> > lib/fuse.c | 31 +++++++++++++++++++++++++++++++
> > lib/fuse_lowlevel.c | 19 +++++++++++++++++++
> > 4 files changed, 71 insertions(+)
> >
>
> Thank you, both look good to me. This is independent of io-map - we can
> apply it immediately?
Yes, please! Note that we'll have to decide if the kernel is going to
enable sending syncfs for all servers, not just the virtiofs ones.
--D
>
> Thanks,
> Bernd
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-21 22:28 ` Darrick J. Wong
@ 2025-08-21 22:54 ` Bernd Schubert
2025-08-21 23:31 ` Joanne Koong
2025-08-22 11:32 ` Shachar Sharon
0 siblings, 2 replies; 210+ messages in thread
From: Bernd Schubert @ 2025-08-21 22:54 UTC (permalink / raw)
To: Darrick J. Wong, Joanne Koong; +Cc: miklos, neal, John, linux-fsdevel
On 8/22/25 00:28, Darrick J. Wong wrote:
> On Thu, Aug 21, 2025 at 03:18:11PM -0700, Joanne Koong wrote:
>> On Wed, Aug 20, 2025 at 5:52 PM Darrick J. Wong <djwong@kernel.org> wrote:
>>>
>>> From: Darrick J. Wong <djwong@kernel.org>
>>>
>>> Turn on syncfs for all fuse servers so that the ones in the know can
>>> flush cached intermediate data and logs to disk.
>>>
>>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
>>> ---
>>> fs/fuse/inode.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>>
>>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>>> index 463879830ecf34..b05510799f93e1 100644
>>> --- a/fs/fuse/inode.c
>>> +++ b/fs/fuse/inode.c
>>> @@ -1814,6 +1814,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
>>> if (!sb_set_blocksize(sb, ctx->blksize))
>>> goto err;
>>> #endif
>>> + fc->sync_fs = 1;
>>
>> AFAICT, this enables syncfs only for fuseblk servers. Is this what you
>> intended?
>
> I meant to say for all fuseblk servers, but TBH I can't see why you
> wouldn't want to enable it for non-fuseblk servers too?
>
> (Maybe I was being overly cautious ;))
Just checked, the initial commit message has
<quote 2d82ab251ef0f6e7716279b04e9b5a01a86ca530>
Note that such an operation allows the file server to DoS sync(). Since a
typical FUSE file server is an untrusted piece of software running in
userspace, this is disabled by default. Only enable it with virtiofs for
now since virtiofsd is supposedly trusted by the guest kernel.
</quote>
With that we could at least enable for all privileged servers? And for
non-privileged this could be an async?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-21 22:54 ` Bernd Schubert
@ 2025-08-21 23:31 ` Joanne Koong
2025-08-22 11:32 ` Shachar Sharon
1 sibling, 0 replies; 210+ messages in thread
From: Joanne Koong @ 2025-08-21 23:31 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Darrick J. Wong, miklos, neal, John, linux-fsdevel
On Thu, Aug 21, 2025 at 3:54 PM Bernd Schubert <bernd@bsbernd.com> wrote:
> On 8/22/25 00:28, Darrick J. Wong wrote:
> > On Thu, Aug 21, 2025 at 03:18:11PM -0700, Joanne Koong wrote:
> >> On Wed, Aug 20, 2025 at 5:52 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >>>
> >>> From: Darrick J. Wong <djwong@kernel.org>
> >>>
> >>> Turn on syncfs for all fuse servers so that the ones in the know can
> >>> flush cached intermediate data and logs to disk.
> >>>
> >>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> >>> ---
> >>> fs/fuse/inode.c | 1 +
> >>> 1 file changed, 1 insertion(+)
> >>>
> >>>
> >>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> >>> index 463879830ecf34..b05510799f93e1 100644
> >>> --- a/fs/fuse/inode.c
> >>> +++ b/fs/fuse/inode.c
> >>> @@ -1814,6 +1814,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> >>> if (!sb_set_blocksize(sb, ctx->blksize))
> >>> goto err;
> >>> #endif
> >>> + fc->sync_fs = 1;
> >>
> >> AFAICT, this enables syncfs only for fuseblk servers. Is this what you
> >> intended?
> >
> > I meant to say for all fuseblk servers, but TBH I can't see why you
> > wouldn't want to enable it for non-fuseblk servers too?
> >
> > (Maybe I was being overly cautious ;))
>
> Just checked, the initial commit message has
>
>
> <quote 2d82ab251ef0f6e7716279b04e9b5a01a86ca530>
> Note that such an operation allows the file server to DoS sync(). Since a
> typical FUSE file server is an untrusted piece of software running in
> userspace, this is disabled by default. Only enable it with virtiofs for
> now since virtiofsd is supposedly trusted by the guest kernel.
> </quote>
>
>
> With that we could at least enable for all privileged servers? And for
> non-privileged this could be an async?
This sounds reasonable to me.
Thanks,
Joanne
>
>
> Thanks,
> Bernd
>
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-21 0:51 ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
@ 2025-08-22 0:01 ` Joanne Koong
2025-08-26 18:56 ` Darrick J. Wong
2025-08-29 6:24 ` Miklos Szeredi
1 sibling, 1 reply; 210+ messages in thread
From: Joanne Koong @ 2025-08-22 0:01 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, neal, John, linux-fsdevel
On Wed, Aug 20, 2025 at 5:51 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Actually copy the attributes/attributes_mask from userspace.
This makes sense to me.
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/fuse_i.h | 4 ++++
> fs/fuse/dir.c | 4 ++++
> fs/fuse/inode.c | 3 +++
> 3 files changed, 11 insertions(+)
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
2025-08-21 0:51 ` [PATCH 3/7] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
@ 2025-08-22 0:15 ` Joanne Koong
2025-08-26 18:52 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Joanne Koong @ 2025-08-22 0:15 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: miklos, bernd, neal, John, linux-fsdevel
On Wed, Aug 20, 2025 at 5:51 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> The fuse_request_{send,end} tracepoints capture the value of
> req->in.h.unique in the trace output. It would be really nice if we
> could use this to match a request to its response for debugging and
> latency analysis, but the call to trace_fuse_request_send occurs before
> the unique id has been set:
>
> fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
> fuse_request_end: connection 8388608 req 6 len 16 error -2
>
> Move the callsites to trace_fuse_request_send to after the unique id has
> been set, or right before we decide to cancel a request having not set
> one.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> fs/fuse/dev.c | 6 +++++-
> fs/fuse/dev_uring.c | 8 +++++++-
I think we'll also need to do the equivalent for virtio.
> 2 files changed, 12 insertions(+), 2 deletions(-)
>
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 6f2b277973ca7d..05d6e7779387a4 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -376,10 +376,15 @@ static void fuse_dev_queue_req(struct fuse_iqueue *fiq, struct fuse_req *req)
> if (fiq->connected) {
> if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
> req->in.h.unique = fuse_get_unique_locked(fiq);
> +
> + /* tracepoint captures in.h.unique */
> + trace_fuse_request_send(req);
> +
> list_add_tail(&req->list, &fiq->pending);
> fuse_dev_wake_and_unlock(fiq);
> } else {
> spin_unlock(&fiq->lock);
> + trace_fuse_request_send(req);
Should this request still show up in the trace even though the request
doesn't actually get sent to the server? imo that makes it
misleading/confusing unless the trace also indicates -ENOTCONN.
> req->out.h.error = -ENOTCONN;
> clear_bit(FR_PENDING, &req->flags);
> fuse_request_end(req);
> @@ -398,7 +403,6 @@ static void fuse_send_one(struct fuse_iqueue *fiq, struct fuse_req *req)
> req->in.h.len = sizeof(struct fuse_in_header) +
> fuse_len_args(req->args->in_numargs,
> (struct fuse_arg *) req->args->in_args);
> - trace_fuse_request_send(req);
> fiq->ops->send_req(fiq, req);
> }
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 249b210becb1cc..14f263d4419392 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -7,6 +7,7 @@
> #include "fuse_i.h"
> #include "dev_uring_i.h"
> #include "fuse_dev_i.h"
> +#include "fuse_trace.h"
>
> #include <linux/fs.h>
> #include <linux/io_uring/cmd.h>
> @@ -1265,12 +1266,17 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>
> err = -EINVAL;
> queue = fuse_uring_task_to_queue(ring);
> - if (!queue)
> + if (!queue) {
> + trace_fuse_request_send(req);
Same question here.
Thanks,
Joanne
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header
2025-08-21 1:01 ` [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header Darrick J. Wong
2025-08-21 21:39 ` Bernd Schubert
@ 2025-08-22 0:33 ` Joanne Koong
2025-08-22 12:54 ` Bernd Schubert
1 sibling, 1 reply; 210+ messages in thread
From: Joanne Koong @ 2025-08-22 0:33 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bschubert, John, bernd, linux-fsdevel, miklos, neal
On Wed, Aug 20, 2025 at 6:01 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> fuse.h and fuse_lowlevel.h are public headers, don't expose internal
> build system config variables to downstream clients. This can also lead
> to function pointer ordering issues if (say) libfuse gets built with
> HAVE_STATX but the client program doesn't define a HAVE_STATX.
>
> Get rid of the conditionals in the public header files to fix this.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
> include/fuse.h | 2 --
> include/fuse_lowlevel.h | 2 --
> example/memfs_ll.cc | 2 +-
> example/passthrough.c | 2 +-
> example/passthrough_fh.c | 2 +-
> example/passthrough_ll.c | 2 +-
> 6 files changed, 4 insertions(+), 8 deletions(-)
>
>
> diff --git a/include/fuse.h b/include/fuse.h
> index 06feacb070fbfb..209102651e9454 100644
> --- a/include/fuse.h
> +++ b/include/fuse.h
> @@ -854,7 +854,6 @@ struct fuse_operations {
> */
> off_t (*lseek) (const char *, off_t off, int whence, struct fuse_file_info *);
>
> -#ifdef HAVE_STATX
> /**
> * Get extended file attributes.
> *
> @@ -865,7 +864,6 @@ struct fuse_operations {
> */
> int (*statx)(const char *path, int flags, int mask, struct statx *stxbuf,
> struct fuse_file_info *fi);
> -#endif
> };
Are we able to just remove this ifdef? Won't this break compilation on
old systems that don't recognize "struct statx"?
Thanks,
Joanne
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-21 22:54 ` Bernd Schubert
2025-08-21 23:31 ` Joanne Koong
@ 2025-08-22 11:32 ` Shachar Sharon
2025-08-22 17:21 ` Joanne Koong
1 sibling, 1 reply; 210+ messages in thread
From: Shachar Sharon @ 2025-08-22 11:32 UTC (permalink / raw)
To: Bernd Schubert
Cc: Darrick J. Wong, Joanne Koong, miklos, neal, John, linux-fsdevel
To the best of my understanding, there are two code paths which may
yield FUSE_SYNCFS: one from user-space syscall syncfs(2) and the other
from within the kernel itself. Unfortunately, there is no way to
distinguish between the two at sb->s_op->sync_fs level, and the DoS
argument refers to the second (kernel) case. If we could somehow
propagate this info all the way down to the fuse layer then I see no
reason for preventing (non-privileged) user-space programs from
calling syncfs(2) over FUSE mounted file-systems.
Please correct me if I am wrong with my analysis.
- Shachar.
On Fri, Aug 22, 2025 at 1:57 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>
>
>
> On 8/22/25 00:28, Darrick J. Wong wrote:
> > On Thu, Aug 21, 2025 at 03:18:11PM -0700, Joanne Koong wrote:
> >> On Wed, Aug 20, 2025 at 5:52 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >>>
> >>> From: Darrick J. Wong <djwong@kernel.org>
> >>>
> >>> Turn on syncfs for all fuse servers so that the ones in the know can
> >>> flush cached intermediate data and logs to disk.
> >>>
> >>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> >>> ---
> >>> fs/fuse/inode.c | 1 +
> >>> 1 file changed, 1 insertion(+)
> >>>
> >>>
> >>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> >>> index 463879830ecf34..b05510799f93e1 100644
> >>> --- a/fs/fuse/inode.c
> >>> +++ b/fs/fuse/inode.c
> >>> @@ -1814,6 +1814,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> >>> if (!sb_set_blocksize(sb, ctx->blksize))
> >>> goto err;
> >>> #endif
> >>> + fc->sync_fs = 1;
> >>
> >> AFAICT, this enables syncfs only for fuseblk servers. Is this what you
> >> intended?
> >
> > I meant to say for all fuseblk servers, but TBH I can't see why you
> > wouldn't want to enable it for non-fuseblk servers too?
> >
> > (Maybe I was being overly cautious ;))
>
> Just checked, the initial commit message has
>
>
> <quote 2d82ab251ef0f6e7716279b04e9b5a01a86ca530>
> Note that such an operation allows the file server to DoS sync(). Since a
> typical FUSE file server is an untrusted piece of software running in
> userspace, this is disabled by default. Only enable it with virtiofs for
> now since virtiofsd is supposedly trusted by the guest kernel.
> </quote>
>
>
> With that we could at least enable for all privileged servers? And for
> non-privileged this could be an async?
>
>
> Thanks,
> Bernd
>
>
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header
2025-08-22 0:33 ` Joanne Koong
@ 2025-08-22 12:54 ` Bernd Schubert
2025-08-26 19:43 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Bernd Schubert @ 2025-08-22 12:54 UTC (permalink / raw)
To: Joanne Koong, Darrick J. Wong
Cc: John@groves.net, bernd@bsbernd.com, linux-fsdevel@vger.kernel.org,
miklos@szeredi.hu, neal@gompa.dev
On 8/22/25 02:33, Joanne Koong wrote:
> On Wed, Aug 20, 2025 at 6:01 PM Darrick J. Wong <djwong@kernel.org> wrote:
>>
>> From: Darrick J. Wong <djwong@kernel.org>
>>
>> fuse.h and fuse_lowlevel.h are public headers, don't expose internal
>> build system config variables to downstream clients. This can also lead
>> to function pointer ordering issues if (say) libfuse gets built with
>> HAVE_STATX but the client program doesn't define a HAVE_STATX.
>>
>> Get rid of the conditionals in the public header files to fix this.
>>
>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
>> ---
>> include/fuse.h | 2 --
>> include/fuse_lowlevel.h | 2 --
>> example/memfs_ll.cc | 2 +-
>> example/passthrough.c | 2 +-
>> example/passthrough_fh.c | 2 +-
>> example/passthrough_ll.c | 2 +-
>> 6 files changed, 4 insertions(+), 8 deletions(-)
>>
>>
>> diff --git a/include/fuse.h b/include/fuse.h
>> index 06feacb070fbfb..209102651e9454 100644
>> --- a/include/fuse.h
>> +++ b/include/fuse.h
>> @@ -854,7 +854,6 @@ struct fuse_operations {
>> */
>> off_t (*lseek) (const char *, off_t off, int whence, struct fuse_file_info *);
>>
>> -#ifdef HAVE_STATX
>> /**
>> * Get extended file attributes.
>> *
>> @@ -865,7 +864,6 @@ struct fuse_operations {
>> */
>> int (*statx)(const char *path, int flags, int mask, struct statx *stxbuf,
>> struct fuse_file_info *fi);
>> -#endif
>> };
>
> Are we able to just remove this ifdef? Won't this break compilation on
> old systems that don't recognize "struct statx"?
Yeah, you had added forward declaration actually. Slipped through in
my review that we don't need the HAVE_STATX anymore.
We can also extend the patch a bit to remove HAVE_STATX from the public
config.
Another alternative for this patch would be to replace HAVE_STATX by
HAVE_FUSE_STATX.
The commit message is also not entirely right, as it says
> to function pointer ordering issues if (say) libfuse gets built with
> HAVE_STATX but the client program doesn't define a HAVE_STATX.
Actually not, because /usr/include/fuse3/libfuse_config.h defines HAVE_STATX.
I'm more worried that there might be a conflict of HAVE_STATX from libfuse
with HAVE_STATX from the application.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-22 11:32 ` Shachar Sharon
@ 2025-08-22 17:21 ` Joanne Koong
2025-08-26 19:31 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Joanne Koong @ 2025-08-22 17:21 UTC (permalink / raw)
To: synarete; +Cc: Bernd Schubert, Darrick J. Wong, miklos, neal, John,
linux-fsdevel
On Fri, Aug 22, 2025 at 4:32 AM Shachar Sharon <synarete@gmail.com> wrote:
>
> To the best of my understanding, there are two code paths which may
> yield FUSE_SYNCFS: one from user-space syscall syncfs(2) and the other
> from within the kernel itself. Unfortunately, there is no way to
> distinguish between the two at sb->s_op->sync_fs level, and the DoS
> argument refers to the second (kernel) case. If we could somehow
> propagate this info all the way down to the fuse layer then I see no
> reason for preventing (non-privileged) user-space programs from
> calling syncfs(2) over FUSE mounted file-systems.
I interpreted the DoS comment as referring to the scenario where a
userspace program calls generic sync() and if an untrusted fuse
server deliberately hangs on servicing that request then it'll hang
sync forever. I think if this only affected the syncfs() syscall then
it wouldn't be a problem since the caller is directly invoking it on a
fuse fd, but if it affects generic sync() that seems like a big issue
to me. Or at least that's my understanding of the code with
ksys_sync() -> iterate_supers(sync_fs_one_sb, &wait).
Thanks,
Joanne
>
>
> Please correct me if I am wrong with my analysis.
>
>
> - Shachar.
>
> On Fri, Aug 22, 2025 at 1:57 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> >
> >
> >
> > On 8/22/25 00:28, Darrick J. Wong wrote:
> > > On Thu, Aug 21, 2025 at 03:18:11PM -0700, Joanne Koong wrote:
> > >> On Wed, Aug 20, 2025 at 5:52 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >>>
> > >>> From: Darrick J. Wong <djwong@kernel.org>
> > >>>
> > >>> Turn on syncfs for all fuse servers so that the ones in the know can
> > >>> flush cached intermediate data and logs to disk.
> > >>>
> > >>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > >>> ---
> > >>> fs/fuse/inode.c | 1 +
> > >>> 1 file changed, 1 insertion(+)
> > >>>
> > >>>
> > >>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > >>> index 463879830ecf34..b05510799f93e1 100644
> > >>> --- a/fs/fuse/inode.c
> > >>> +++ b/fs/fuse/inode.c
> > >>> @@ -1814,6 +1814,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> > >>> if (!sb_set_blocksize(sb, ctx->blksize))
> > >>> goto err;
> > >>> #endif
> > >>> + fc->sync_fs = 1;
> > >>
> > >> AFAICT, this enables syncfs only for fuseblk servers. Is this what you
> > >> intended?
> > >
> > > I meant to say for all fuseblk servers, but TBH I can't see why you
> > > wouldn't want to enable it for non-fuseblk servers too?
> > >
> > > (Maybe I was being overly cautious ;))
> >
> > Just checked, the initial commit message has
> >
> >
> > <quote 2d82ab251ef0f6e7716279b04e9b5a01a86ca530>
> > Note that such an operation allows the file server to DoS sync(). Since a
> > typical FUSE file server is an untrusted piece of software running in
> > userspace, this is disabled by default. Only enable it with virtiofs for
> > now since virtiofsd is supposedly trusted by the guest kernel.
> > </quote>
> >
> >
> > With that we could at least enable for all privileged servers? And for
> > non-privileged this could be an async?
> >
> >
> > Thanks,
> > Bernd
> >
> >
> >
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
2025-08-22 0:15 ` Joanne Koong
@ 2025-08-26 18:52 ` Darrick J. Wong
2025-09-03 15:48 ` Miklos Szeredi
2025-09-03 15:51 ` Bernd Schubert
0 siblings, 2 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-26 18:52 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, bernd, neal, John, linux-fsdevel
On Thu, Aug 21, 2025 at 05:15:50PM -0700, Joanne Koong wrote:
> On Wed, Aug 20, 2025 at 5:51 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > The fuse_request_{send,end} tracepoints capture the value of
> > req->in.h.unique in the trace output. It would be really nice if we
> > could use this to match a request to its response for debugging and
> > latency analysis, but the call to trace_fuse_request_send occurs before
> > the unique id has been set:
> >
> > fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
> > fuse_request_end: connection 8388608 req 6 len 16 error -2
> >
> > Move the callsites to trace_fuse_request_send to after the unique id has
> > been set, or right before we decide to cancel a request having not set
> > one.
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/dev.c | 6 +++++-
> > fs/fuse/dev_uring.c | 8 +++++++-
>
> I think we'll also need to do the equivalent for virtio.
Ackpth, virtio sends commands too??
Oh, yes, it does -- judging from the fuse_get_unique calls, at least
virtio_fs_send_req and maybe virtio_fs_send_forget need to add a call to
trace_fuse_request_send?
> > 2 files changed, 12 insertions(+), 2 deletions(-)
> >
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 6f2b277973ca7d..05d6e7779387a4 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -376,10 +376,15 @@ static void fuse_dev_queue_req(struct fuse_iqueue *fiq, struct fuse_req *req)
> > if (fiq->connected) {
> > if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
> > req->in.h.unique = fuse_get_unique_locked(fiq);
> > +
> > + /* tracepoint captures in.h.unique */
> > + trace_fuse_request_send(req);
> > +
> > list_add_tail(&req->list, &fiq->pending);
> > fuse_dev_wake_and_unlock(fiq);
> > } else {
> > spin_unlock(&fiq->lock);
> > + trace_fuse_request_send(req);
>
> Should this request still show up in the trace even though the request
> doesn't actually get sent to the server? imo that makes it
> misleading/confusing unless the trace also indicates -ENOTCONN.
Hrmm. I was thinking that it would be very nice to have
fuse_request_{send,end} bracket the start and end of a fuse request,
even if we kill it immediately.
OTOH from a tracing "efficiency" perspective it's probably ok for
never-sent requests only to ever hit the fuse_request_end tracepoint
since the id will not get reused for quite some time.
<shrug> Thoughts?
--D
> > req->out.h.error = -ENOTCONN;
> > clear_bit(FR_PENDING, &req->flags);
> > fuse_request_end(req);
> > @@ -398,7 +403,6 @@ static void fuse_send_one(struct fuse_iqueue *fiq, struct fuse_req *req)
> > req->in.h.len = sizeof(struct fuse_in_header) +
> > fuse_len_args(req->args->in_numargs,
> > (struct fuse_arg *) req->args->in_args);
> > - trace_fuse_request_send(req);
> > fiq->ops->send_req(fiq, req);
> > }
> >
> > diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> > index 249b210becb1cc..14f263d4419392 100644
> > --- a/fs/fuse/dev_uring.c
> > +++ b/fs/fuse/dev_uring.c
> > @@ -7,6 +7,7 @@
> > #include "fuse_i.h"
> > #include "dev_uring_i.h"
> > #include "fuse_dev_i.h"
> > +#include "fuse_trace.h"
> >
> > #include <linux/fs.h>
> > #include <linux/io_uring/cmd.h>
> > @@ -1265,12 +1266,17 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
> >
> > err = -EINVAL;
> > queue = fuse_uring_task_to_queue(ring);
> > - if (!queue)
> > + if (!queue) {
> > + trace_fuse_request_send(req);
>
> Same question here.
>
> Thanks,
> Joanne
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-22 0:01 ` Joanne Koong
@ 2025-08-26 18:56 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-26 18:56 UTC (permalink / raw)
To: Joanne Koong; +Cc: miklos, bernd, neal, John, linux-fsdevel
On Thu, Aug 21, 2025 at 05:01:01PM -0700, Joanne Koong wrote:
> On Wed, Aug 20, 2025 at 5:51 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Actually copy the attributes/attributes_mask from userspace.
>
> This makes sense to me.
>
> Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Thanks!
--d
> >
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> > fs/fuse/fuse_i.h | 4 ++++
> > fs/fuse/dir.c | 4 ++++
> > fs/fuse/inode.c | 3 +++
> > 3 files changed, 11 insertions(+)
> >
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-22 17:21 ` Joanne Koong
@ 2025-08-26 19:31 ` Darrick J. Wong
2025-08-26 22:07 ` Joanne Koong
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-26 19:31 UTC (permalink / raw)
To: Joanne Koong; +Cc: synarete, Bernd Schubert, miklos, neal, John, linux-fsdevel
On Fri, Aug 22, 2025 at 10:21:44AM -0700, Joanne Koong wrote:
> On Fri, Aug 22, 2025 at 4:32 AM Shachar Sharon <synarete@gmail.com> wrote:
> >
> > To the best of my understanding, there are two code paths which may
> > yield FUSE_SYNCFS: one from user-space syscall syncfs(2) and the other
> > from within the kernel itself. Unfortunately, there is no way to
> > distinguish between the two at sb->s_op->sync_fs level, and the DoS
> > argument refers to the second (kernel) case. If we could somehow
> > propagate this info all the way down to the fuse layer then I see no
> > reason for preventing (non-privileged) user-space programs from
> > calling syncfs(2) over FUSE mounted file-systems.
>
> I interpreted the DoS comment as referring to the scenario where a
> userspace program calls generic sync() and if an untrusted fuse
> server deliberately hangs on servicing that request then it'll hang
> sync forever. I think if this only affected the syncfs() syscall then
> it wouldn't be a problem since the caller is directly invoking it on a
> fuse fd, but if it affects generic sync() that seems like a big issue
> to me. Or at least that's my understanding of the code with
> ksys_sync() -> iterate_supers(sync_fs_one_sb, &wait).
<shrug> I think you can already DoS sync() (and by extension any other
place in the kernel where we try to flush out all filesystems in one go)
by dropping a FUSE_SETATTR call on the floor, because that's how we
flush dirty inodes to disk? Or by doing the same for an FUSE_FSYNC
call?
--D
> Thanks,
> Joanne
> >
> >
> > Please correct me if I am wrong with my analysis.
> >
> >
> > - Shachar.
> >
> > On Fri, Aug 22, 2025 at 1:57 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> > >
> > >
> > >
> > > On 8/22/25 00:28, Darrick J. Wong wrote:
> > > > On Thu, Aug 21, 2025 at 03:18:11PM -0700, Joanne Koong wrote:
> > > >> On Wed, Aug 20, 2025 at 5:52 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >>>
> > > >>> From: Darrick J. Wong <djwong@kernel.org>
> > > >>>
> > > >>> Turn on syncfs for all fuse servers so that the ones in the know can
> > > >>> flush cached intermediate data and logs to disk.
> > > >>>
> > > >>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > >>> ---
> > > >>> fs/fuse/inode.c | 1 +
> > > >>> 1 file changed, 1 insertion(+)
> > > >>>
> > > >>>
> > > >>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > >>> index 463879830ecf34..b05510799f93e1 100644
> > > >>> --- a/fs/fuse/inode.c
> > > >>> +++ b/fs/fuse/inode.c
> > > >>> @@ -1814,6 +1814,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> > > >>> if (!sb_set_blocksize(sb, ctx->blksize))
> > > >>> goto err;
> > > >>> #endif
> > > >>> + fc->sync_fs = 1;
> > > >>
> > > >> AFAICT, this enables syncfs only for fuseblk servers. Is this what you
> > > >> intended?
> > > >
> > > > I meant to say for all fuseblk servers, but TBH I can't see why you
> > > > wouldn't want to enable it for non-fuseblk servers too?
> > > >
> > > > (Maybe I was being overly cautious ;))
> > >
> > > Just checked, the initial commit message has
> > >
> > >
> > > <quote 2d82ab251ef0f6e7716279b04e9b5a01a86ca530>
> > > Note that such an operation allows the file server to DoS sync(). Since a
> > > typical FUSE file server is an untrusted piece of software running in
> > > userspace, this is disabled by default. Only enable it with virtiofs for
> > > now since virtiofsd is supposedly trusted by the guest kernel.
> > > </quote>
> > >
> > >
> > > With that we could at least enable for all privileged servers? And for
> > > non-privileged this could be an async?
> > >
> > >
> > > Thanks,
> > > Bernd
> > >
> > >
> > >
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header
2025-08-22 12:54 ` Bernd Schubert
@ 2025-08-26 19:43 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-26 19:43 UTC (permalink / raw)
To: Bernd Schubert
Cc: Joanne Koong, John@groves.net, bernd@bsbernd.com,
linux-fsdevel@vger.kernel.org, miklos@szeredi.hu, neal@gompa.dev
On Fri, Aug 22, 2025 at 12:54:01PM +0000, Bernd Schubert wrote:
> On 8/22/25 02:33, Joanne Koong wrote:
> > On Wed, Aug 20, 2025 at 6:01 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >>
> >> From: Darrick J. Wong <djwong@kernel.org>
> >>
> >> fuse.h and fuse_lowlevel.h are public headers, don't expose internal
> >> build system config variables to downstream clients. This can also lead
> >> to function pointer ordering issues if (say) libfuse gets built with
> >> HAVE_STATX but the client program doesn't define a HAVE_STATX.
> >>
> >> Get rid of the conditionals in the public header files to fix this.
> >>
> >> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> >> ---
> >> include/fuse.h | 2 --
> >> include/fuse_lowlevel.h | 2 --
> >> example/memfs_ll.cc | 2 +-
> >> example/passthrough.c | 2 +-
> >> example/passthrough_fh.c | 2 +-
> >> example/passthrough_ll.c | 2 +-
> >> 6 files changed, 4 insertions(+), 8 deletions(-)
> >>
> >>
> >> diff --git a/include/fuse.h b/include/fuse.h
> >> index 06feacb070fbfb..209102651e9454 100644
> >> --- a/include/fuse.h
> >> +++ b/include/fuse.h
> >> @@ -854,7 +854,6 @@ struct fuse_operations {
> >> */
> >> off_t (*lseek) (const char *, off_t off, int whence, struct fuse_file_info *);
> >>
> >> -#ifdef HAVE_STATX
> >> /**
> >> * Get extended file attributes.
> >> *
> >> @@ -865,7 +864,6 @@ struct fuse_operations {
> >> */
> >> int (*statx)(const char *path, int flags, int mask, struct statx *stxbuf,
> >> struct fuse_file_info *fi);
> >> -#endif
> >> };
> >
> > Are we able to just remove this ifdef? Won't this break compilation on
> > old systems that don't recognize "struct statx"?
>
> Yeah, you had added forward declaration actually. Slipped through in
> my review that we don't need the HAVE_STATX anymore.
>
> We can also extend the patch a bit to remove HAVE_STATX from the public
> config.
> Another alternative for this patch would be to replace HAVE_STATX by
> HAVE_FUSE_STATX.
> The commit message is also not entirely right, as it says
<shrug> libfuse itself doesn't define a struct statx, so what does it
have, aside from the incomplete struct declaration? TBH I wonder what
will happen when struct statx grows, but everybody gets to deal with
that problem because we didn't explicitly encode the size in either the
syscall or the struct definition.
Presumably fuse servers will detect and set their own HAVE_STATX,
and only supply a ->statx function if they HAVE_STATX. They don't have
to know if libfuse itself got built with statx support; if it didn't,
then nothing will ever call ->statx, afaict.
> > to function pointer ordering issues if (say) libfuse gets built with
> > HAVE_STATX but the client program doesn't define a HAVE_STATX.
>
> Actually not, because /usr/include/fuse3/libfuse_config.h defines HAVE_STATX.
> I'm more worried that there might be a conflict of HAVE_STATX from libfuse
> with HAVE_STATX from the application.
<nod>
--D
>
> Thanks,
> Bernd
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-26 19:31 ` Darrick J. Wong
@ 2025-08-26 22:07 ` Joanne Koong
2025-08-27 15:18 ` Miklos Szeredi
0 siblings, 1 reply; 210+ messages in thread
From: Joanne Koong @ 2025-08-26 22:07 UTC (permalink / raw)
To: Darrick J. Wong
Cc: synarete, Bernd Schubert, miklos, neal, John, linux-fsdevel
On Tue, Aug 26, 2025 at 12:31 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Aug 22, 2025 at 10:21:44AM -0700, Joanne Koong wrote:
> > On Fri, Aug 22, 2025 at 4:32 AM Shachar Sharon <synarete@gmail.com> wrote:
> > >
> > > To the best of my understanding, there are two code paths which may
> > > yield FUSE_SYNCFS: one from user-space syscall syncfs(2) and the other
> > > from within the kernel itself. Unfortunately, there is no way to
> > > distinguish between the two at sb->s_op->sync_fs level, and the DoS
> > > argument refers to the second (kernel) case. If we could somehow
> > > propagate this info all the way down to the fuse layer then I see no
> > > reason for preventing (non-privileged) user-space programs from
> > > calling syncfs(2) over FUSE mounted file-systems.
> >
> > I interpreted the DoS comment as referring to the scenario where a
> > userspace program calls generic sync() and if an untrusted fuse
> > server deliberately hangs on servicing that request then it'll hang
> > sync forever. I think if this only affected the syncfs() syscall then
> > it wouldn't be a problem since the caller is directly invoking it on a
> > fuse fd, but if it affects generic sync() that seems like a big issue
> > to me. Or at least that's my understanding of the code with
> > ksys_sync() -> iterate_supers(sync_fs_one_sb, &wait).
>
> <shrug> I think you can already DoS sync() (and by extension any other
> place in the kernel where we try to flush out all filesystems in one go)
> by dropping a FUSE_SETATTR call on the floor, because that's how we
> flush dirty inodes to disk? Or by doing the same for an FUSE_FSYNC
> call?
Isn't the sync() in fuse right now gated by fc->sync_fs (which is only
set to true for virtiofsd)? I don't see where FUSE_SETATTR or
FUSE_FSYNC get sent in the sync() path to untrusted servers.
Thanks,
Joanne
>
> --D
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-26 22:07 ` Joanne Koong
@ 2025-08-27 15:18 ` Miklos Szeredi
2025-08-27 19:12 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-27 15:18 UTC (permalink / raw)
To: Joanne Koong
Cc: Darrick J. Wong, synarete, Bernd Schubert, neal, John,
linux-fsdevel
On Wed, 27 Aug 2025 at 00:07, Joanne Koong <joannelkoong@gmail.com> wrote:
> Isn't the sync() in fuse right now gated by fc->sync_fs (which is only
> set to true for virtiofsd)? I don't see where FUSE_SETATTR or
> FUSE_FSYNC get sent in the sync() path to untrusted servers.
Hmm, it's through sync_inodes_one_sb() that fuse_write_inode() could
get called, which then would trigger a FUSE_SETATTR.
Does anyone know how useful sync() is in practice? I guess most
applications have switched to syncfs() which is more specific.
In any case, I don't remember a complaint about sync(2) ignoring fuse
filesystems.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-27 15:18 ` Miklos Szeredi
@ 2025-08-27 19:12 ` Darrick J. Wong
2025-08-28 14:08 ` Miklos Szeredi
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-27 19:12 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Joanne Koong, synarete, Bernd Schubert, neal, John, linux-fsdevel
On Wed, Aug 27, 2025 at 05:18:23PM +0200, Miklos Szeredi wrote:
> On Wed, 27 Aug 2025 at 00:07, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > Isn't the sync() in fuse right now gated by fc->sync_fs (which is only
> > set to true for virtiofsd)? I don't see where FUSE_SETATTR or
> > FUSE_FSYNC get sent in the sync() path to untrusted servers.
>
> Hmm, it's through sync_inodes_one_sb() that fuse_write_inode() could
> get called, which then would trigger a FUSE_SETATTR.
<nod> So SETATTR is a theoretical DoS vector, but that's already a
property of most filesystems that write to an off-cpu device such as a
disk or another computer. ;)
> Does anyone know how useful sync() is in practice? I guess most
> applications have switched to syncfs() which is more specific.
Well old greybeards such as myself reboot busted systems with
$ sync
$ sync
$ sync
<sysrq-b>
because that's what you'd type after "startx &" fscked up the display.
It's 2025 and ... that still happens. :(
Debian codesearch shows a few thousand hits for sync(), some of which
are in things like LibreOffice.
> In any case, I don't remember a complaint about sync(2) ignoring fuse
> filesystems.
Well sync() will poke all the fuse filesystems, right?
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-27 19:12 ` Darrick J. Wong
@ 2025-08-28 14:08 ` Miklos Szeredi
2025-08-28 14:23 ` Miklos Szeredi
2025-08-28 15:01 ` Darrick J. Wong
0 siblings, 2 replies; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-28 14:08 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Joanne Koong, synarete, Bernd Schubert, neal, John, linux-fsdevel
On Wed, 27 Aug 2025 at 21:12, Darrick J. Wong <djwong@kernel.org> wrote:
> Well sync() will poke all the fuse filesystems, right?
Only those with writeback_cache enabled. But yeah, apparently this
was overlooked when dealing with "don't allow DoS-ing sync(2)".
Can't see a good way out of this.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-28 14:08 ` Miklos Szeredi
@ 2025-08-28 14:23 ` Miklos Szeredi
2025-08-28 15:01 ` Darrick J. Wong
1 sibling, 0 replies; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-28 14:23 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Joanne Koong, synarete, Bernd Schubert, neal, John, linux-fsdevel
On Thu, 28 Aug 2025 at 16:08, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, 27 Aug 2025 at 21:12, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > Well sync() will poke all the fuse filesystems, right?
>
> Only those with writeback_cache enabled.
And when servicing shared write mmaps...
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-28 14:08 ` Miklos Szeredi
2025-08-28 14:23 ` Miklos Szeredi
@ 2025-08-28 15:01 ` Darrick J. Wong
2025-08-28 15:52 ` Joanne Koong
1 sibling, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-28 15:01 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Joanne Koong, synarete, Bernd Schubert, neal, John, linux-fsdevel
On Thu, Aug 28, 2025 at 04:08:19PM +0200, Miklos Szeredi wrote:
> On Wed, 27 Aug 2025 at 21:12, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > Well sync() will poke all the fuse filesystems, right?
>
> Only those with writeback_cache enabled. But yeah, apparently this
> was overlooked when dealing with "don't allow DoS-ing sync(2)".
>
> Can't see a good way out of this.
I wonder, is it possible to shift a fuse_simple_request to behave like a
fuse_simple_background request? For certain DOS-happy requests, one
could use wait_event_interruptible_timeout(&req->waitq...) with a really
high timeout.
If the wait times out, we shift the completion to asynchronous and
return -ETIMEDOUT to the (blocked) caller. That would allow the system
to make progress though you'd probably have to take some drastic action
if the fuse server sends back a failure (e.g. setting FUSE_I_BAD).
(The problem with timeouts is that I tried setting a 60s timeout on
fuse2fs and discovered that certain horrid fstests actually create
monster files that take 45min to FUSE_RELEASE and so I don't know what a
reasonable timeout is...)
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers
2025-08-28 15:01 ` Darrick J. Wong
@ 2025-08-28 15:52 ` Joanne Koong
0 siblings, 0 replies; 210+ messages in thread
From: Joanne Koong @ 2025-08-28 15:52 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Miklos Szeredi, synarete, Bernd Schubert, neal, John,
linux-fsdevel
On Thu, Aug 28, 2025 at 8:01 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Aug 28, 2025 at 04:08:19PM +0200, Miklos Szeredi wrote:
> > On Wed, 27 Aug 2025 at 21:12, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > > Well sync() will poke all the fuse filesystems, right?
> >
> > Only those with writeback_cache enabled. But yeah, apparently this
> > was overlooked when dealing with "don't allow DoS-ing sync(2)".
> >
> > Can't see a good way out of this.
>
> I wonder, is it possible to shift a fuse_simple_request to behave like a
> fuse_simple_background request? For certain DOS-happy requests, one
> could use wait_event_interruptible_timeout(&req->waitq...) with a really
> high timeout.
>
> If the wait times out, we shift the completion to asynchronous and
> return -ETIMEDOUT to the (blocked) caller. That would allow the system
> to make progress though you'd probably have to take some drastic action
> if the fuse server sends back a failure (e.g. setting FUSE_I_BAD).
>
> (The problem with timeouts is that I tried setting a 60s timeout on
> fuse2fs and discovered that certain horrid fstests actually create
> monster files that take 45min to FUSE_RELEASE and so I don't know what a
> reasonable timeout is...)
Why not just send the setattr request in fuse_write_inode() as a
background request instead of first sending it synchronously with a
timeout? for the sync() case, the only DoS path is (as Miklos pointed
out to me in his earlier comment) the sync_inodes_one_sb() ->
fuse_write_inode(). But the only thing fuse_write_inode() does is call
fuse_flush_times() to send the inode i_mtime to the server. Is this
i_mtime info even that important? It seems fine to me to have that
relayed always as a background request.
Thanks,
Joanne
>
> --D
>
> > Thanks,
> > Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-21 0:51 ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
2025-08-22 0:01 ` Joanne Koong
@ 2025-08-29 6:24 ` Miklos Szeredi
2025-08-29 15:39 ` Darrick J. Wong
1 sibling, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-08-29 6:24 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, 21 Aug 2025 at 02:51, Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Actually copy the attributes/attributes_mask from userspace.
Some attributes should definitely not be copied (like MOUNT_ROOT,
AUTOMOUNT). This should probably be VFS responsibility to prevent
messing with these.
I guess the others are okay, they can already be queried through one
of the fileattr intefaces. But think we should still have an explicit
mask to prevent the server setting anything other than the currently
defined attributes.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-29 6:24 ` Miklos Szeredi
@ 2025-08-29 15:39 ` Darrick J. Wong
2025-09-02 9:41 ` Miklos Szeredi
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-08-29 15:39 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Fri, Aug 29, 2025 at 08:24:42AM +0200, Miklos Szeredi wrote:
> On Thu, 21 Aug 2025 at 02:51, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Actually copy the attributes/attributes_mask from userspace.
>
> Some attributes should definitely not be copied (like MOUNT_ROOT,
> AUTOMOUNT). This should probably be VFS responsibility to prevent
> messing with these.
>
> I guess the others are okay, they can already be queried through one
> of the fileattr intefaces. But think we should still have an explicit
> mask to prevent the server setting anything other than the currently
> defined attributes.
Ok, will do. Thanks for the feedback!
Though unfortunately there isn't a pre-existing mask for "flags the vfs
will set for you" other than grepping:
fs/stat.c:121: * Fill in the STATX_ATTR_* flags in the kstat structure for properties of the
fs/stat.c:127: stat->attributes |= STATX_ATTR_IMMUTABLE;
fs/stat.c:129: stat->attributes |= STATX_ATTR_APPEND;
fs/stat.c:153: stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
fs/stat.c:163: stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
fs/stat.c:201: stat->attributes |= STATX_ATTR_AUTOMOUNT;
fs/stat.c:204: stat->attributes |= STATX_ATTR_DAX;
fs/stat.c:206: stat->attributes_mask |= (STATX_ATTR_AUTOMOUNT |
fs/stat.c:207: STATX_ATTR_DAX);
fs/stat.c:312: stat->attributes |= STATX_ATTR_MOUNT_ROOT;
fs/stat.c:313: stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT;
So I guess that's (IMMUTABLE | APPEND | AUTOMOUNT | DAX | MOUNT_ROOT) ?
IMMUTABLE | APPEND seem to be captured in KSTAT_ATTR_VFS_FLAGS, so maybe
that just needs to include the last three, and then we can use it to
clear those bits from the fuse server's reply.
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-08-29 15:39 ` Darrick J. Wong
@ 2025-09-02 9:41 ` Miklos Szeredi
2025-09-02 20:57 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-09-02 9:41 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Fri, 29 Aug 2025 at 17:39, Darrick J. Wong <djwong@kernel.org> wrote:
> IMMUTABLE | APPEND seem to be captured in KSTAT_ATTR_VFS_FLAGS, so maybe
> that just needs to include the last three, and then we can use it to
> clear those bits from the fuse server's reply.
Hmm. Fuse kernel module passes IMMUTABLE, APPEND and DAX through the
fileattr interfaces. I.e. it doesn't query the respective VFS flags
not does it try to set them.
For IMMUTABLE and APPEND I can imagine the server being able to handle
these mostly (i.e. reject ops should be rejected). It would be nice
if the VFS was also aware. I wonder if we can fix this at this
point.
As for DAX, I don't see how the current behavior makes any sense, but
again not seeing clearly what the best solution is. Currently fuse
doesn't support DAX in the traditional sense, but does have DAX
functionality in virtiofs and in will in famfs. Is this flag useful
in that case?
I also fell that all unknown flags should also be masked off, but
maybe that's too paranoid.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-09-02 9:41 ` Miklos Szeredi
@ 2025-09-02 20:57 ` Darrick J. Wong
2025-09-03 9:55 ` Miklos Szeredi
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-09-02 20:57 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Tue, Sep 02, 2025 at 11:41:45AM +0200, Miklos Szeredi wrote:
> On Fri, 29 Aug 2025 at 17:39, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > IMMUTABLE | APPEND seem to be captured in KSTAT_ATTR_VFS_FLAGS, so maybe
> > that just needs to include the last three, and then we can use it to
> > clear those bits from the fuse server's reply.
>
> Hmm. Fuse kernel module passes IMMUTABLE, APPEND and DAX through the
> fileattr interfaces. I.e. it doesn't query the respective VFS flags
> not does it try to set them.
>
> For IMMUTABLE and APPEND I can imagine the server being able to handle
> these mostly (i.e. reject ops should be rejected). It would be nice
> if the VFS was also aware. I wonder if we can fix this at this
> point.
You can, kind of -- either send the server FS_IOC_FSGETXATTR or
FS_IOC_GETFLAGS right after igetting an inode and set the VFS
immutable/append flags from that; or we could add a couple of flag bits
to fuse_attr::flags to avoid the extra upcall. You'd also have to
update the S_IMMUTABLE/S_APPEND flags based on the results of
FS_IOC_FSSETXATTR/FS_IOC_SETFLAGS.
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=16584b3fcdaaeb789f22847e9f82964957493a18
(I didn't enable any of that for non-iomap files to avoid changing
expected behaviors)
> As for DAX, I don't see how the current behavior makes any sense, but
> again not seeing clearly what the best solution is. Currently fuse
> doesn't support DAX in the traditional sense, but does have DAX
> functionality in virtiofs and in will in famfs. Is this flag useful
> in that case?
At this point, STATX_ATTR_DAX means that S_DAX is set on the VFS inode,
and no other code is allowed to set that statx file attribute bit, per
dax.rst:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.rst
The flag is very much needed for virtiofs/famfs (and any future
fuse+iomap+fsdax combination), because that's how application programs
are supposed to detect that they can use load/store to mmap file regions
without needing fsync/msync.
> I also fell that all unknown flags should also be masked off, but
> maybe that's too paranoid.
That isn't a terrible idea.
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-09-02 20:57 ` Darrick J. Wong
@ 2025-09-03 9:55 ` Miklos Szeredi
2025-09-03 15:49 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-09-03 9:55 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Tue, 2 Sept 2025 at 22:57, Darrick J. Wong <djwong@kernel.org> wrote:
> You can, kind of -- either send the server FS_IOC_FSGETXATTR or
> FS_IOC_GETFLAGS right after igetting an inode and set the VFS
> immutable/append flags from that; or we could add a couple of flag bits
> to fuse_attr::flags to avoid the extra upcall.
How about a new FUSE_LOOKUPX that uses fuse_statx instead of fuse_attr
to initialize the inode?
> The flag is very much needed for virtiofs/famfs (and any future
> fuse+iomap+fsdax combination), because that's how application programs
> are supposed to detect that they can use load/store to mmap file regions
> without needing fsync/msync.
Makes sense.
> > I also fell that all unknown flags should also be masked off, but
> > maybe that's too paranoid.
>
> That isn't a terrible idea.
So in conclusion, the following can be passed through from the fuse
server to the statx syscall (directly or cached):
COMPRESSED
NODUMP
ENCRYPTED
VERITY
WRITE_ATOMIC
The following should be set (cached) in the relevant inode flag:
IMMUTABLE
APPEND
The following should be ignored and the VFS flag be used instead:
AUTOMOUNT
MOUNT_ROOT
DAX
And other attributes should just be ignored.
Agree?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers
2025-08-21 0:50 ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
@ 2025-09-03 15:20 ` Miklos Szeredi
2025-09-03 15:23 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-09-03 15:20 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: stable, bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, 21 Aug 2025 at 02:50, Darrick J. Wong <djwong@kernel.org> wrote:
> Fix this by only using synchronous fputs for fuseblk servers if the
> process doesn't have PF_LOCAL_THROTTLE. Hopefully the fuseblk server
> had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
> filesystem server.
I'm still not convinced. This patch adds complexity and depends on
the server doing some magic, which makes it unreliable.
Doing it async unconditionally removes complexity and fixes the issue reliably.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers
2025-09-03 15:20 ` Miklos Szeredi
@ 2025-09-03 15:23 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-09-03 15:23 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: stable, bernd, neal, John, linux-fsdevel, joannelkoong
On Wed, Sep 03, 2025 at 05:20:13PM +0200, Miklos Szeredi wrote:
> On Thu, 21 Aug 2025 at 02:50, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > Fix this by only using synchronous fputs for fuseblk servers if the
> > process doesn't have PF_LOCAL_THROTTLE. Hopefully the fuseblk server
> > had the good sense to call PR_SET_IO_FLUSHER to mark itself as a
> > filesystem server.
>
> I'm still not convinced. This patch adds complexity and depends on
> the server doing some magic, which makes it unreliable.
>
> Doing it async unconditionally removes complexity and fixes the issue reliably.
Works for me, I'll make it use async mode unconditionally.
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
2025-08-21 0:51 ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
@ 2025-09-03 15:45 ` Miklos Szeredi
2025-09-03 17:49 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-09-03 15:45 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, 21 Aug 2025 at 02:51, Darrick J. Wong <djwong@kernel.org> wrote:
> Create a function to push all the background requests to the queue and
> then wait for the number of pending events to hit zero, and call this
> before fuse_abort_conn. That way, all the pending events are processed
> by the fuse server and we don't end up with a corrupt filesystem.
The flushing should be dependent on fc->destroy. Without that we
really don't want server to block umount, not even for 30s.
I hate timeout based solutions, so my preference would be to remove
the timeout completely. It wouldn't really make a difference anyway,
since FUSE_DESTROY is sent synchronously without a timeout.
Thinking about blocking umount: if we did this in a private user/mount
ns, then it wouldn't be a problem. But how can we be sure? Is
checking sb->s_user_ns != &init_user_ns sufficient?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
2025-08-26 18:52 ` Darrick J. Wong
@ 2025-09-03 15:48 ` Miklos Szeredi
2025-09-03 15:54 ` Darrick J. Wong
2025-09-03 15:51 ` Bernd Schubert
1 sibling, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-09-03 15:48 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Joanne Koong, bernd, neal, John, linux-fsdevel
On Tue, 26 Aug 2025 at 20:52, Darrick J. Wong <djwong@kernel.org> wrote:
> Hrmm. I was thinking that it would be very nice to have
> fuse_request_{send,end} bracket the start and end of a fuse request,
> even if we kill it immediately.
I'm fine with that, and would possibly simplify some code that checks
for an error and calls ->end manually. But that makes it a
non-trivial change unfortunately.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 4/7] fuse: implement file attributes mask for statx
2025-09-03 9:55 ` Miklos Szeredi
@ 2025-09-03 15:49 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-09-03 15:49 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Wed, Sep 03, 2025 at 11:55:25AM +0200, Miklos Szeredi wrote:
> On Tue, 2 Sept 2025 at 22:57, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > You can, kind of -- either send the server FS_IOC_FSGETXATTR or
> > FS_IOC_GETFLAGS right after igetting an inode and set the VFS
> > immutable/append flags from that; or we could add a couple of flag bits
> > to fuse_attr::flags to avoid the extra upcall.
>
> How about a new FUSE_LOOKUPX that uses fuse_statx instead of fuse_attr
> to initialize the inode?
Or what if we enlarged fuse_attr? Its fields mostly duplicate what was
already in struct stat (and now struct statx):
struct fuse_attrx {
struct fuse_statx statx;
uint32_t flags; /* fuse_attr::flags */
};
Hrmm, fuse_attr is embedded in structs fuse_entry_out and
fuse_attr_out. FUSE_{LOOKUP,OPEN,CREATE,SETATTR,GETATTR} (and
direntplus) would need to be rototilled to support the new structure,
and either you need new command codes for all that, or I guess one could
set out_argvar = true and switch the out-struct decoding based on the
size returned.
That sounds like a project in and of itself.
> > The flag is very much needed for virtiofs/famfs (and any future
> > fuse+iomap+fsdax combination), because that's how application programs
> > are supposed to detect that they can use load/store to mmap file regions
> > without needing fsync/msync.
>
> Makes sense.
>
> > > I also fell that all unknown flags should also be masked off, but
> > > maybe that's too paranoid.
> >
> > That isn't a terrible idea.
>
> So in conclusion, the following can be passed through from the fuse
> server to the statx syscall (directly or cached):
>
> COMPRESSED
> NODUMP
> ENCRYPTED
> VERITY
> WRITE_ATOMIC
Right.
> The following should be set (cached) in the relevant inode flag:
>
> IMMUTABLE
> APPEND
Right, S_IMMUTABLE and S_APPEND.
> The following should be ignored and the VFS flag be used instead:
>
> AUTOMOUNT
> MOUNT_ROOT
> DAX
Yes to the first two.
As for /setting/ S_DAX, I think we just keep using FUSE_ATTR_DAX like we
do now, since that's baked into the kernel <-> libfuse interface and
can't go away. That would fit nicely with how the other filesystems do
it.
> And other attributes should just be ignored.
I prefer to say "Other attribute bits are undefined and should just be
ignored." :)
> Agree?
I think we do, except maybe the difficult first point. :)
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
2025-08-26 18:52 ` Darrick J. Wong
2025-09-03 15:48 ` Miklos Szeredi
@ 2025-09-03 15:51 ` Bernd Schubert
1 sibling, 0 replies; 210+ messages in thread
From: Bernd Schubert @ 2025-09-03 15:51 UTC (permalink / raw)
To: Darrick J. Wong, Joanne Koong; +Cc: miklos, neal, John, linux-fsdevel
On 8/26/25 20:52, Darrick J. Wong wrote:
> On Thu, Aug 21, 2025 at 05:15:50PM -0700, Joanne Koong wrote:
>> On Wed, Aug 20, 2025 at 5:51 PM Darrick J. Wong <djwong@kernel.org> wrote:
>>>
>>> From: Darrick J. Wong <djwong@kernel.org>
>>>
>>> The fuse_request_{send,end} tracepoints capture the value of
>>> req->in.h.unique in the trace output. It would be really nice if we
>>> could use this to match a request to its response for debugging and
>>> latency analysis, but the call to trace_fuse_request_send occurs before
>>> the unique id has been set:
>>>
>>> fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
>>> fuse_request_end: connection 8388608 req 6 len 16 error -2
>>>
>>> Move the callsites to trace_fuse_request_send to after the unique id has
>>> been set, or right before we decide to cancel a request having not set
>>> one.
>>>
>>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
>>> ---
>>> fs/fuse/dev.c | 6 +++++-
>>> fs/fuse/dev_uring.c | 8 +++++++-
>>
>> I think we'll also need to do the equivalent for virtio.
>
> Ackpth, virtio sends commands too??
>
> Oh, yes, it does -- judging from the fuse_get_unique calls, at least
> virtio_fs_send_req and maybe virtio_fs_send_forget need to add a call to
> trace_fuse_request_send?
>
>>> 2 files changed, 12 insertions(+), 2 deletions(-)
>>>
>>>
>>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>>> index 6f2b277973ca7d..05d6e7779387a4 100644
>>> --- a/fs/fuse/dev.c
>>> +++ b/fs/fuse/dev.c
>>> @@ -376,10 +376,15 @@ static void fuse_dev_queue_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>>> if (fiq->connected) {
>>> if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
>>> req->in.h.unique = fuse_get_unique_locked(fiq);
>>> +
>>> + /* tracepoint captures in.h.unique */
>>> + trace_fuse_request_send(req);
>>> +
>>> list_add_tail(&req->list, &fiq->pending);
>>> fuse_dev_wake_and_unlock(fiq);
>>> } else {
>>> spin_unlock(&fiq->lock);
>>> + trace_fuse_request_send(req);
>>
>> Should this request still show up in the trace even though the request
>> doesn't actually get sent to the server? imo that makes it
>> misleading/confusing unless the trace also indicates -ENOTCONN.
>
> Hrmm. I was thinking that it would be very nice to have
> fuse_request_{send,end} bracket the start and end of a fuse request,
> even if we kill it immediately.
>
> OTOH from a tracing "efficiency" perspective it's probably ok for
> never-sent requests only to ever hit the fuse_request_end tracepoint
> since the id will not get reused for quite some time.
>
> <shrug> Thoughts?
>
> --D
>
>>> req->out.h.error = -ENOTCONN;
>>> clear_bit(FR_PENDING, &req->flags);
>>> fuse_request_end(req);
>>> @@ -398,7 +403,6 @@ static void fuse_send_one(struct fuse_iqueue *fiq, struct fuse_req *req)
>>> req->in.h.len = sizeof(struct fuse_in_header) +
>>> fuse_len_args(req->args->in_numargs,
>>> (struct fuse_arg *) req->args->in_args);
>>> - trace_fuse_request_send(req);
>>> fiq->ops->send_req(fiq, req);
>>> }
>>>
>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>> index 249b210becb1cc..14f263d4419392 100644
>>> --- a/fs/fuse/dev_uring.c
>>> +++ b/fs/fuse/dev_uring.c
>>> @@ -7,6 +7,7 @@
>>> #include "fuse_i.h"
>>> #include "dev_uring_i.h"
>>> #include "fuse_dev_i.h"
>>> +#include "fuse_trace.h"
>>>
>>> #include <linux/fs.h>
>>> #include <linux/io_uring/cmd.h>
>>> @@ -1265,12 +1266,17 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>>>
>>> err = -EINVAL;
>>> queue = fuse_uring_task_to_queue(ring);
>>> - if (!queue)
>>> + if (!queue) {
>>> + trace_fuse_request_send(req);
>>
>> Same question here.
>>
>> Thanks,
>> Joanne
>>
I really need to find time to update my related branch - as I wrote
before, I already have all of that, I think.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
2025-09-03 15:48 ` Miklos Szeredi
@ 2025-09-03 15:54 ` Darrick J. Wong
2025-09-03 18:47 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-09-03 15:54 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: Joanne Koong, bernd, neal, John, linux-fsdevel
On Wed, Sep 03, 2025 at 05:48:46PM +0200, Miklos Szeredi wrote:
> On Tue, 26 Aug 2025 at 20:52, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > Hrmm. I was thinking that it would be very nice to have
> > fuse_request_{send,end} bracket the start and end of a fuse request,
> > even if we kill it immediately.
>
> I'm fine with that, and would possibly simplify some code that checks
> for an error and calls ->end manually. But that makes it a
> non-trivial change unfortunately.
Yes, and then you have to poke the idr structure for a request id even
if that caller already knows that the connection's dead. That seems
like a waste of cycles, but OTOH maybe we just don't care?
(Though I suppose seeing more than one request id of zero in the trace
output implies very strongly that the connection is really dead)
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 5/7] fuse: update file mode when updating acls
2025-08-21 0:51 ` [PATCH 5/7] fuse: update file mode when updating acls Darrick J. Wong
@ 2025-09-03 16:01 ` Miklos Szeredi
2025-09-03 17:51 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-09-03 16:01 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, 21 Aug 2025 at 02:51, Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> If someone sets ACLs on a file that can be expressed fully as Unix DAC
> mode bits, most filesystems will then update the mode bits and drop the
> ACL xattr to reduce inefficiency in the file access paths. Let's do
> that too. Note that means that we can setacl and end up with no ACL
> xattrs, so we also need to tolerate ENODATA returns from
> fuse_removexattr.
This goes against the model of leaving this sort of task to the
server. I understand your desire to do it in the kernel, since that
simplifies your server. But fuse is often used in passthrough mode,
where this will be done by the kernel, just one layer down the stack.
In that case splitting a setxattr into a removexattr + chmod makes
little sense.
Maybe extend the meaning of fc->default_permissions to mean: userspace
doesn't want to deal with any mode related stuff. Thoughts?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 6/7] fuse: propagate default and file acls on creation
2025-08-21 0:52 ` [PATCH 6/7] fuse: propagate default and file acls on creation Darrick J. Wong
@ 2025-09-03 16:15 ` Miklos Szeredi
2025-09-03 16:27 ` Darrick J. Wong
0 siblings, 1 reply; 210+ messages in thread
From: Miklos Szeredi @ 2025-09-03 16:15 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Thu, 21 Aug 2025 at 02:52, Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Propagate the default and file access ACLs to new children when creating
> them, just like the other kernel filesystems.
Another problem of this and the previous patch is being racy. Not
"real" filesystems like fuse2fs, but this is going to trip network fs
up badly, where such races would be really difficult to test.
We could add a new feature flag, but we seem to have proliferation of
this sort. We have default_permissions, then handle_killpriv, then
handle_killpriv_v2. Seems like we need a flag to tell the kernel to
treat this as a local fs, where it can do all the local fs'y things
without fear of breaking remote fs.
Does that make sense?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 6/7] fuse: propagate default and file acls on creation
2025-09-03 16:15 ` Miklos Szeredi
@ 2025-09-03 16:27 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-09-03 16:27 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Wed, Sep 03, 2025 at 06:15:30PM +0200, Miklos Szeredi wrote:
> On Thu, 21 Aug 2025 at 02:52, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Propagate the default and file access ACLs to new children when creating
> > them, just like the other kernel filesystems.
>
> Another problem of this and the previous patch is being racy. Not
> "real" filesystems like fuse2fs, but this is going to trip network fs
> up badly, where such races would be really difficult to test.
Ahh, right -- I neglected that the fuse interface is more or less what
you'd need for a client node of a network/cluster filesystem.
> We could add a new feature flag, but we seem to have proliferation of
> this sort. We have default_permissions, then handle_killpriv, then
> handle_killpriv_v2. Seems like we need a flag to tell the kernel to
> treat this as a local fs, where it can do all the local fs'y things
> without fear of breaking remote fs.
>
> Does that make sense?
Yeah.
How about I hide the functionality of this ACL patch and the previous
one behind (fc->iomap || sb->s_bdev != NULL)? The iomap functionality
that I'm working on is only useful for filesystems that want to behave
like a local fs, including all the "I went out to lunch DoS" warts.
AFAICT the other fuse developers seem to accept that fuseblk servers can
do that too. Does that sound ok?
If anyone ever wanted to use fuse+iomap for a cluster fs, I guess I'd
have to go back to issuing FUSE_READ/WRITE requests to userspace for
permission checking and resource acquisition. But so far no cluster
filesystems use fs/iomap/ so it's just unsupported.
(And to make this explicit to anyone watching on the list -- all of my
work is completely separate from Joanne's efforts to adapt fuse to use
iomap for tracking pagecache dirty state.)
--D
> Thanks,
> Miklos
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 2/7] fuse: flush pending fuse events before aborting the connection
2025-09-03 15:45 ` Miklos Szeredi
@ 2025-09-03 17:49 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-09-03 17:49 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
mn Wed, Sep 03, 2025 at 05:45:27PM +0200, Miklos Szeredi wrote:
> On Thu, 21 Aug 2025 at 02:51, Darrick J. Wong <djwong@kernel.org> wrote:
>
> > Create a function to push all the background requests to the queue and
> > then wait for the number of pending events to hit zero, and call this
> > before fuse_abort_conn. That way, all the pending events are processed
> > by the fuse server and we don't end up with a corrupt filesystem.
>
> The flushing should be dependent on fc->destroy. Without that we
> really don't want server to block umount, not even for 30s.
<nod> I once thought it was crucial to flush all the FUSE_RELEASE
requests to the fuse server prior to the server's ->destroy method being
called, but it turns out that's not true -- all the open unlinked files
created by generic/488 actually do get cleaned up even in the !fuseblk
case.
It's just libext2fs that's somewhat stupid and leaves dead dirents all
over the root directory, which (mis)lead me into thinking that the
unlinked files weren't being cleaned up correctly.
> I hate timeout based solutions, so my preference would be to remove
> the timeout completely. It wouldn't really make a difference anyway,
> since FUSE_DESTROY is sent synchronously without a timeout.
Hrmm, the timeouts waiting for FUSE_RELEASE might not be so useful
anyway. If someone configured a request_timeout then the requests will
automatically cancel if the fuse server wedges itself.
OTOH I don't set any request_timeout in fuse[24]fs because they use
FUSE_RELEASE to free an open-but-unlinked file, and that can take 45min
if (say) you have a file with ten million extents to free as part of
freeing the file.
I think the problem here is that there's no way for a fuse server to
report back to the kernel that it's making progress on a very long
running request; and that the kernel probably shouldn't trust that.
In the default case there's no request timeout so the kernel will wait
forever.
In any case, I think I agree that the time_after check isn't necessary.
Either we trust the server to be making progress (and do not have a
request timeout) or we notice the connection died and move on to
aborting all the requests.
> Thinking about blocking umount: if we did this in a private user/mount
> ns, then it wouldn't be a problem. But how can we be sure? Is
> checking sb->s_user_ns != &init_user_ns sufficient?
I'm not sure we can -- what if you mount a filesystem in a private mount
ns, but then some supervisor process adds another mount to the same sb
in the init_user_ns?
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 5/7] fuse: update file mode when updating acls
2025-09-03 16:01 ` Miklos Szeredi
@ 2025-09-03 17:51 ` Darrick J. Wong
0 siblings, 0 replies; 210+ messages in thread
From: Darrick J. Wong @ 2025-09-03 17:51 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: bernd, neal, John, linux-fsdevel, joannelkoong
On Wed, Sep 03, 2025 at 06:01:00PM +0200, Miklos Szeredi wrote:
> On Thu, 21 Aug 2025 at 02:51, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > If someone sets ACLs on a file that can be expressed fully as Unix DAC
> > mode bits, most filesystems will then update the mode bits and drop the
> > ACL xattr to reduce inefficiency in the file access paths. Let's do
> > that too. Note that means that we can setacl and end up with no ACL
> > xattrs, so we also need to tolerate ENODATA returns from
> > fuse_removexattr.
>
> This goes against the model of leaving this sort of task to the
> server. I understand your desire to do it in the kernel, since that
> simplifies your server. But fuse is often used in passthrough mode,
> where this will be done by the kernel, just one layer down the stack.
> In that case splitting a setxattr into a removexattr + chmod makes
> little sense.
Ah, right. I temporarily forgot about network/cluster filesystems where
the local kernel isn't necessarily in charge of the file metadata and
permissions.
> Maybe extend the meaning of fc->default_permissions to mean: userspace
> doesn't want to deal with any mode related stuff. Thoughts?
As suggested in the thread for the next patch, maybe I should just hide
this new acl behavior behind (fc->iomap || sb->s_bdev != NULL)?
--D
> Thanks,
> Miklos
>
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
2025-09-03 15:54 ` Darrick J. Wong
@ 2025-09-03 18:47 ` Darrick J. Wong
2025-09-03 23:05 ` Joanne Koong
0 siblings, 1 reply; 210+ messages in thread
From: Darrick J. Wong @ 2025-09-03 18:47 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: Joanne Koong, bernd, neal, John, linux-fsdevel
On Wed, Sep 03, 2025 at 08:54:05AM -0700, Darrick J. Wong wrote:
> On Wed, Sep 03, 2025 at 05:48:46PM +0200, Miklos Szeredi wrote:
> > On Tue, 26 Aug 2025 at 20:52, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > > Hrmm. I was thinking that it would be very nice to have
> > > fuse_request_{send,end} bracket the start and end of a fuse request,
> > > even if we kill it immediately.
> >
> > I'm fine with that, and would possibly simplify some code that checks
> > for an error and calls ->end manually. But that makes it a
> > non-trivial change unfortunately.
>
> Yes, and then you have to poke the idr structure for a request id even
> if that caller already knows that the connection's dead. That seems
> like a waste of cycles, but OTOH maybe we just don't care?
>
> (Though I suppose seeing more than one request id of zero in the trace
> output implies very strongly that the connection is really dead)
Well.... given the fuse_iqueue::reqctr usage, the first request gets a
unique id of 2 and increments by two thereafter. So it's a pretty safe
bet that unique==0 means the request isn't actually being sent, or that
your very lucky in that your fuse server has been running for a /very/
long time.
I think I just won't call trace_fuse_request_send for requests that are
immediately ended; and I'll refactor the req->in.h.unique assignment
into a helper so that virtiofs and friends can call the helper and get
the tracepoint automatically.
For example, fuse_dev_queue_req now becomes:
static inline void fuse_request_assign_unique_locked(struct fuse_iqueue *fiq,
struct fuse_req *req)
{
if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
req->in.h.unique = fuse_get_unique_locked(fiq);
/* tracepoint captures in.h.unique and in.h.len */
trace_fuse_request_send(req);
}
static void fuse_dev_queue_req(struct fuse_iqueue *fiq, struct fuse_req *req)
{
spin_lock(&fiq->lock);
if (fiq->connected) {
fuse_request_assign_unique_locked(fiq, req);
list_add_tail(&req->list, &fiq->pending);
fuse_dev_wake_and_unlock(fiq);
} else {
spin_unlock(&fiq->lock);
req->out.h.error = -ENOTCONN;
clear_bit(FR_PENDING, &req->flags);
fuse_request_end(req);
}
}
--D
^ permalink raw reply [flat|nested] 210+ messages in thread
* Re: [PATCH 3/7] fuse: capture the unique id of fuse commands being sent
2025-09-03 18:47 ` Darrick J. Wong
@ 2025-09-03 23:05 ` Joanne Koong
0 siblings, 0 replies; 210+ messages in thread
From: Joanne Koong @ 2025-09-03 23:05 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Miklos Szeredi, bernd, neal, John, linux-fsdevel
On Wed, Sep 3, 2025 at 11:47 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Sep 03, 2025 at 08:54:05AM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 03, 2025 at 05:48:46PM +0200, Miklos Szeredi wrote:
> > > On Tue, 26 Aug 2025 at 20:52, Darrick J. Wong <djwong@kernel.org> wrote:
> > >
Sorry for the late reply on this.
> > > > Hrmm. I was thinking that it would be very nice to have
> > > > fuse_request_{send,end} bracket the start and end of a fuse request,
> > > > even if we kill it immediately.
Oh interesting, I didn't realize there was a trace_fuse_request_end().
I get now why you wanted the trace_fuse_request_send() for the
!fiq->connected case, for symmetry. I was thinking of it from the
client userspace side (one idea I have, which idk if it is actually
that useful or not, is building some sort of observability "wireshark
for fuse" tool that gives more visibility into the requests being sent
to/from the server like their associated kernel vs libfuse timestamps
to know where the latency is happening. this issue has come up in prod
a few times when debugging slow requests); from this perspective, it
seemed confusing to see requests show up that were never in good faith
attempted to be sent to the server.
If you want to preserve the symmetry, maybe one idea is only doing the
trace_fuse_request_end() if the req.in.h.unique code is valid? That
would skip doing the trace for the !fiq->connected case.
Thanks,
Joanne
> > >
> > > I'm fine with that, and would possibly simplify some code that checks
> > > for an error and calls ->end manually. But that makes it a
> > > non-trivial change unfortunately.
> >
> > Yes, and then you have to poke the idr structure for a request id even
> > if that caller already knows that the connection's dead. That seems
> > like a waste of cycles, but OTOH maybe we just don't care?
> >
> > (Though I suppose seeing more than one request id of zero in the trace
> > output implies very strongly that the connection is really dead)
>
> Well.... given the fuse_iqueue::reqctr usage, the first request gets a
> unique id of 2 and increments by two thereafter. So it's a pretty safe
> bet that unique==0 means the request isn't actually being sent, or that
> your very lucky in that your fuse server has been running for a /very/
> long time.
>
> I think I just won't call trace_fuse_request_send for requests that are
> immediately ended; and I'll refactor the req->in.h.unique assignment
> into a helper so that virtiofs and friends can call the helper and get
> the tracepoint automatically.
>
> For example, fuse_dev_queue_req now becomes:
>
>
> static inline void fuse_request_assign_unique_locked(struct fuse_iqueue *fiq,
> struct fuse_req *req)
> {
> if (req->in.h.opcode != FUSE_NOTIFY_REPLY)
> req->in.h.unique = fuse_get_unique_locked(fiq);
>
> /* tracepoint captures in.h.unique and in.h.len */
> trace_fuse_request_send(req);
> }
>
> static void fuse_dev_queue_req(struct fuse_iqueue *fiq, struct fuse_req *req)
> {
> spin_lock(&fiq->lock);
> if (fiq->connected) {
> fuse_request_assign_unique_locked(fiq, req);
> list_add_tail(&req->list, &fiq->pending);
> fuse_dev_wake_and_unlock(fiq);
> } else {
> spin_unlock(&fiq->lock);
> req->out.h.error = -ENOTCONN;
> clear_bit(FR_PENDING, &req->flags);
> fuse_request_end(req);
> }
> }
>
> --D
^ permalink raw reply [flat|nested] 210+ messages in thread
end of thread, other threads:[~2025-09-03 23:05 UTC | newest]
Thread overview: 210+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-21 0:37 [RFC v4] fuse: use fs-iomap for better performance so we can containerize ext4 Darrick J. Wong
2025-08-21 0:47 ` [PATCHSET RFC v4 1/4] fuse: general bug fixes Darrick J. Wong
2025-08-21 0:50 ` [PATCH 1/7] fuse: fix livelock in synchronous file put from fuseblk workers Darrick J. Wong
2025-09-03 15:20 ` Miklos Szeredi
2025-09-03 15:23 ` Darrick J. Wong
2025-08-21 0:51 ` [PATCH 2/7] fuse: flush pending fuse events before aborting the connection Darrick J. Wong
2025-09-03 15:45 ` Miklos Szeredi
2025-09-03 17:49 ` Darrick J. Wong
2025-08-21 0:51 ` [PATCH 3/7] fuse: capture the unique id of fuse commands being sent Darrick J. Wong
2025-08-22 0:15 ` Joanne Koong
2025-08-26 18:52 ` Darrick J. Wong
2025-09-03 15:48 ` Miklos Szeredi
2025-09-03 15:54 ` Darrick J. Wong
2025-09-03 18:47 ` Darrick J. Wong
2025-09-03 23:05 ` Joanne Koong
2025-09-03 15:51 ` Bernd Schubert
2025-08-21 0:51 ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
2025-08-22 0:01 ` Joanne Koong
2025-08-26 18:56 ` Darrick J. Wong
2025-08-29 6:24 ` Miklos Szeredi
2025-08-29 15:39 ` Darrick J. Wong
2025-09-02 9:41 ` Miklos Szeredi
2025-09-02 20:57 ` Darrick J. Wong
2025-09-03 9:55 ` Miklos Szeredi
2025-09-03 15:49 ` Darrick J. Wong
2025-08-21 0:51 ` [PATCH 5/7] fuse: update file mode when updating acls Darrick J. Wong
2025-09-03 16:01 ` Miklos Szeredi
2025-09-03 17:51 ` Darrick J. Wong
2025-08-21 0:52 ` [PATCH 6/7] fuse: propagate default and file acls on creation Darrick J. Wong
2025-09-03 16:15 ` Miklos Szeredi
2025-09-03 16:27 ` Darrick J. Wong
2025-08-21 0:52 ` [PATCH 7/7] fuse: enable FUSE_SYNCFS for all servers Darrick J. Wong
2025-08-21 22:18 ` Joanne Koong
2025-08-21 22:28 ` Darrick J. Wong
2025-08-21 22:54 ` Bernd Schubert
2025-08-21 23:31 ` Joanne Koong
2025-08-22 11:32 ` Shachar Sharon
2025-08-22 17:21 ` Joanne Koong
2025-08-26 19:31 ` Darrick J. Wong
2025-08-26 22:07 ` Joanne Koong
2025-08-27 15:18 ` Miklos Szeredi
2025-08-27 19:12 ` Darrick J. Wong
2025-08-28 14:08 ` Miklos Szeredi
2025-08-28 14:23 ` Miklos Szeredi
2025-08-28 15:01 ` Darrick J. Wong
2025-08-28 15:52 ` Joanne Koong
2025-08-21 0:47 ` [PATCHSET RFC v4 2/4] fuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-08-21 0:52 ` [PATCH 01/23] fuse: move CREATE_TRACE_POINTS to a separate file Darrick J. Wong
2025-08-21 0:53 ` [PATCH 02/23] fuse: implement the basic iomap mechanisms Darrick J. Wong
2025-08-21 0:53 ` [PATCH 03/23] fuse: make debugging configurable at runtime Darrick J. Wong
2025-08-21 0:53 ` [PATCH 04/23] fuse: move the backing file idr and code into a new source file Darrick J. Wong
2025-08-21 7:21 ` Amir Goldstein
2025-08-21 7:42 ` Amir Goldstein
2025-08-21 16:15 ` Darrick J. Wong
2025-08-21 0:53 ` [PATCH 05/23] fuse: move the passthrough-specific code back to passthrough.c Darrick J. Wong
2025-08-21 9:05 ` Amir Goldstein
2025-08-21 16:13 ` Darrick J. Wong
2025-08-21 0:54 ` [PATCH 06/23] fuse: add an ioctl to add new iomap devices Darrick J. Wong
2025-08-21 8:09 ` Amir Goldstein
2025-08-21 16:15 ` Darrick J. Wong
2025-08-21 0:54 ` [PATCH 07/23] fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount Darrick J. Wong
2025-08-21 0:54 ` [PATCH 08/23] fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE} Darrick J. Wong
2025-08-21 0:54 ` [PATCH 09/23] fuse: implement direct IO with iomap Darrick J. Wong
2025-08-21 0:55 ` [PATCH 10/23] fuse: implement buffered " Darrick J. Wong
2025-08-21 0:55 ` [PATCH 11/23] fuse: enable caching of timestamps Darrick J. Wong
2025-08-21 0:55 ` [PATCH 12/23] fuse: implement large folios for iomap pagecache files Darrick J. Wong
2025-08-21 0:55 ` [PATCH 13/23] fuse: use an unrestricted backing device with iomap pagecache io Darrick J. Wong
2025-08-21 0:56 ` [PATCH 14/23] fuse: advertise support for iomap Darrick J. Wong
2025-08-21 0:56 ` [PATCH 15/23] fuse: query filesystem geometry when using iomap Darrick J. Wong
2025-08-21 0:56 ` [PATCH 16/23] fuse: implement fadvise for iomap files Darrick J. Wong
2025-08-21 0:56 ` [PATCH 17/23] fuse: make the root nodeid dynamic Darrick J. Wong
2025-08-21 0:57 ` [PATCH 18/23] fuse: allow setting of root nodeid Darrick J. Wong
2025-08-21 0:57 ` [PATCH 19/23] fuse: invalidate ranges of block devices being used for iomap Darrick J. Wong
2025-08-21 0:57 ` [PATCH 20/23] fuse: implement inline data file IO via iomap Darrick J. Wong
2025-08-21 0:57 ` [PATCH 21/23] fuse: allow more statx fields Darrick J. Wong
2025-08-21 0:58 ` [PATCH 22/23] fuse: support atomic writes with iomap Darrick J. Wong
2025-08-21 0:58 ` [PATCH 23/23] fuse: enable iomap Darrick J. Wong
2025-08-21 0:47 ` [PATCHSET RFC v4 3/4] fuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-08-21 0:58 ` [PATCH 1/4] fuse: cache iomaps Darrick J. Wong
2025-08-21 0:59 ` [PATCH 2/4] fuse: use the iomap cache for iomap_begin Darrick J. Wong
2025-08-21 0:59 ` [PATCH 3/4] fuse: invalidate iomap cache after file updates Darrick J. Wong
2025-08-21 0:59 ` [PATCH 4/4] fuse: enable iomap cache management Darrick J. Wong
2025-08-21 0:48 ` [PATCHSET RFC v4 4/4] fuse: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-08-21 0:59 ` [PATCH 1/6] fuse: force a ctime update after a fileattr_set call when in iomap mode Darrick J. Wong
2025-08-21 1:00 ` [PATCH 2/6] fuse: synchronize inode->i_flags after fileattr_[gs]et Darrick J. Wong
2025-08-21 1:00 ` [PATCH 3/6] fuse: cache atime when in iomap mode Darrick J. Wong
2025-08-21 1:00 ` [PATCH 4/6] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap filesystems Darrick J. Wong
2025-08-21 1:00 ` [PATCH 5/6] fuse: update ctime when updating acls on an iomap inode Darrick J. Wong
2025-08-21 1:01 ` [PATCH 6/6] fuse: always cache ACLs when using iomap Darrick J. Wong
2025-08-21 0:48 ` [PATCHSET RFC v4 1/4] libfuse: general bug fixes Darrick J. Wong
2025-08-21 1:01 ` [PATCH 1/1] libfuse: don't put HAVE_STATX in a public header Darrick J. Wong
2025-08-21 21:39 ` Bernd Schubert
2025-08-21 22:27 ` Darrick J. Wong
2025-08-22 0:33 ` Joanne Koong
2025-08-22 12:54 ` Bernd Schubert
2025-08-26 19:43 ` Darrick J. Wong
2025-08-21 0:48 ` [PATCHSET RFC v4 2/4] libfuse: allow servers to use iomap for better file IO performance Darrick J. Wong
2025-08-21 1:01 ` [PATCH 01/21] libfuse: bump kernel and library ABI versions Darrick J. Wong
2025-08-21 1:01 ` [PATCH 02/21] libfuse: add kernel gates for FUSE_IOMAP Darrick J. Wong
2025-08-21 1:02 ` [PATCH 03/21] libfuse: add fuse commands for iomap_begin and end Darrick J. Wong
2025-08-21 1:02 ` [PATCH 04/21] libfuse: add upper level iomap commands Darrick J. Wong
2025-08-21 1:02 ` [PATCH 05/21] libfuse: add a lowlevel notification to add a new device to iomap Darrick J. Wong
2025-08-21 1:02 ` [PATCH 06/21] libfuse: add upper-level iomap add device function Darrick J. Wong
2025-08-21 1:03 ` [PATCH 07/21] libfuse: add iomap ioend low level handler Darrick J. Wong
2025-08-21 1:03 ` [PATCH 08/21] libfuse: add upper level iomap ioend commands Darrick J. Wong
2025-08-21 1:03 ` [PATCH 09/21] libfuse: add a reply function to send FUSE_ATTR_* to the kernel Darrick J. Wong
2025-08-21 1:03 ` [PATCH 10/21] libfuse: connect high level fuse library to fuse_reply_attr_iflags Darrick J. Wong
2025-08-21 1:04 ` [PATCH 11/21] libfuse: support direct I/O through iomap Darrick J. Wong
2025-08-21 1:04 ` [PATCH 12/21] libfuse: support buffered " Darrick J. Wong
2025-08-21 1:04 ` [PATCH 13/21] libfuse: don't allow hardlinking of iomap files in the upper level fuse library Darrick J. Wong
2025-08-21 1:05 ` [PATCH 14/21] libfuse: allow discovery of the kernel's iomap capabilities Darrick J. Wong
2025-08-21 1:05 ` [PATCH 15/21] libfuse: add lower level iomap_config implementation Darrick J. Wong
2025-08-21 1:05 ` [PATCH 16/21] libfuse: add upper " Darrick J. Wong
2025-08-21 1:05 ` [PATCH 17/21] libfuse: allow root_nodeid mount option Darrick J. Wong
2025-08-21 1:06 ` [PATCH 18/21] libfuse: add low level code to invalidate iomap block device ranges Darrick J. Wong
2025-08-21 1:06 ` [PATCH 19/21] libfuse: add upper-level API to invalidate parts of an iomap block device Darrick J. Wong
2025-08-21 1:06 ` [PATCH 20/21] libfuse: add strictatime/lazytime mount options Darrick J. Wong
2025-08-21 1:06 ` [PATCH 21/21] libfuse: add atomic write support Darrick J. Wong
2025-08-21 0:48 ` [PATCHSET RFC v4 3/4] libfuse: cache iomap mappings for even better file IO performance Darrick J. Wong
2025-08-21 1:07 ` [PATCH 1/2] libfuse: enable iomap cache management for lowlevel fuse Darrick J. Wong
2025-08-21 1:07 ` [PATCH 2/2] libfuse: add upper-level iomap cache management Darrick J. Wong
2025-08-21 0:49 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Darrick J. Wong
2025-08-21 1:07 ` [PATCH 1/2] libfuse: wire up FUSE_SYNCFS to the low level library Darrick J. Wong
2025-08-21 1:07 ` [PATCH 2/2] libfuse: add syncfs support to the upper library Darrick J. Wong
2025-08-21 21:41 ` [PATCHSET RFC v4 4/4] libfuse: implement syncfs Bernd Schubert
2025-08-21 22:29 ` Darrick J. Wong
2025-08-21 0:49 ` [PATCHSET RFC v4 1/6] fuse4fs: fork a low level fuse server Darrick J. Wong
2025-08-21 1:08 ` [PATCH 01/20] fuse2fs: port fuse2fs to lowlevel libfuse API Darrick J. Wong
2025-08-21 1:08 ` [PATCH 02/20] fuse4fs: drop fuse 2.x support code Darrick J. Wong
2025-08-21 1:08 ` [PATCH 03/20] fuse4fs: namespace some helpers Darrick J. Wong
2025-08-21 1:08 ` [PATCH 04/20] fuse4fs: convert to low level API Darrick J. Wong
2025-08-21 1:09 ` [PATCH 05/20] libsupport: port the kernel list.h to libsupport Darrick J. Wong
2025-08-21 1:09 ` [PATCH 06/20] libsupport: add a cache Darrick J. Wong
2025-08-21 1:09 ` [PATCH 07/20] cache: disable debugging Darrick J. Wong
2025-08-21 1:09 ` [PATCH 08/20] cache: use modern list iterator macros Darrick J. Wong
2025-08-21 1:10 ` [PATCH 09/20] cache: embed struct cache in the owner Darrick J. Wong
2025-08-21 1:10 ` [PATCH 10/20] cache: pass cache pointer to callbacks Darrick J. Wong
2025-08-21 1:10 ` [PATCH 11/20] cache: pass a private data pointer through cache_walk Darrick J. Wong
2025-08-21 1:11 ` [PATCH 12/20] cache: add a helper to grab a new refcount for a cache_node Darrick J. Wong
2025-08-21 1:11 ` [PATCH 13/20] cache: return results of a cache flush Darrick J. Wong
2025-08-21 1:11 ` [PATCH 14/20] cache: add a "get only if incore" flag to cache_node_get Darrick J. Wong
2025-08-21 1:11 ` [PATCH 15/20] cache: support gradual expansion Darrick J. Wong
2025-08-21 1:12 ` [PATCH 16/20] cache: implement automatic shrinking Darrick J. Wong
2025-08-21 1:12 ` [PATCH 17/20] fuse4fs: add cache to track open files Darrick J. Wong
2025-08-21 1:12 ` [PATCH 18/20] fuse4fs: use the orphaned inode list Darrick J. Wong
2025-08-21 1:12 ` [PATCH 19/20] fuse4fs: implement FUSE_TMPFILE Darrick J. Wong
2025-08-21 1:13 ` [PATCH 20/20] fuse4fs: create incore reverse orphan list Darrick J. Wong
2025-08-21 0:49 ` [PATCHSET RFC v4 2/6] libext2fs: refactoring for fuse2fs iomap support Darrick J. Wong
2025-08-21 1:13 ` [PATCH 01/10] libext2fs: make it possible to extract the fd from an IO manager Darrick J. Wong
2025-08-21 1:13 ` [PATCH 02/10] libext2fs: always fsync the device when flushing the cache Darrick J. Wong
2025-08-21 1:13 ` [PATCH 03/10] libext2fs: always fsync the device when closing the unix IO manager Darrick J. Wong
2025-08-21 1:14 ` [PATCH 04/10] libext2fs: only fsync the unix fd if we wrote to the device Darrick J. Wong
2025-08-21 1:14 ` [PATCH 05/10] libext2fs: invalidate cached blocks when freeing them Darrick J. Wong
2025-08-21 1:14 ` [PATCH 06/10] libext2fs: only flush affected blocks in unix_write_byte Darrick J. Wong
2025-08-21 1:14 ` [PATCH 07/10] libext2fs: allow unix_write_byte when the write would be aligned Darrick J. Wong
2025-08-21 1:15 ` [PATCH 08/10] libext2fs: allow clients to ask to write full superblocks Darrick J. Wong
2025-08-21 1:15 ` [PATCH 09/10] libext2fs: allow callers to disallow I/O to file data blocks Darrick J. Wong
2025-08-21 1:15 ` [PATCH 10/10] libext2fs: add posix advisory locking to the unix IO manager Darrick J. Wong
2025-08-21 0:49 ` [PATCHSET RFC v4 3/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-08-21 1:15 ` [PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping reporting Darrick J. Wong
2025-08-21 1:16 ` [PATCH 02/19] fuse2fs: add iomap= mount option Darrick J. Wong
2025-08-21 1:16 ` [PATCH 03/19] fuse2fs: implement iomap configuration Darrick J. Wong
2025-08-21 1:16 ` [PATCH 04/19] fuse2fs: register block devices for use with iomap Darrick J. Wong
2025-08-21 1:17 ` [PATCH 05/19] fuse2fs: implement directio file reads Darrick J. Wong
2025-08-21 1:17 ` [PATCH 06/19] fuse2fs: add extent dump function for debugging Darrick J. Wong
2025-08-21 1:17 ` [PATCH 07/19] fuse2fs: implement direct write support Darrick J. Wong
2025-08-21 1:17 ` [PATCH 08/19] fuse2fs: turn on iomap for pagecache IO Darrick J. Wong
2025-08-21 1:18 ` [PATCH 09/19] fuse2fs: don't zero bytes in punch hole Darrick J. Wong
2025-08-21 1:18 ` [PATCH 10/19] fuse2fs: don't do file data block IO when iomap is enabled Darrick J. Wong
2025-08-21 1:18 ` [PATCH 11/19] fuse2fs: avoid fuseblk mode if fuse-iomap support is likely Darrick J. Wong
2025-08-21 1:18 ` [PATCH 12/19] fuse2fs: enable file IO to inline data files Darrick J. Wong
2025-08-21 1:19 ` [PATCH 13/19] fuse2fs: set iomap-related inode flags Darrick J. Wong
2025-08-21 1:19 ` [PATCH 14/19] fuse2fs: add strictatime/lazytime mount options Darrick J. Wong
2025-08-21 1:19 ` [PATCH 15/19] fuse2fs: configure block device block size Darrick J. Wong
2025-08-21 1:19 ` [PATCH 16/19] fuse4fs: don't use inode number translation when possible Darrick J. Wong
2025-08-21 1:20 ` [PATCH 17/19] fuse4fs: separate invalidation Darrick J. Wong
2025-08-21 1:20 ` [PATCH 18/19] fuse2fs: implement statx Darrick J. Wong
2025-08-21 1:20 ` [PATCH 19/19] fuse2fs: enable atomic writes Darrick J. Wong
2025-08-21 0:50 ` [PATCHSET RFC v4 4/6] fuse2fs: use fuse iomap data paths for better file I/O performance Darrick J. Wong
2025-08-21 1:20 ` [PATCH 1/2] fuse2fs: enable caching of iomaps Darrick J. Wong
2025-08-21 1:21 ` [PATCH 2/2] fuse2fs: be smarter about caching iomaps Darrick J. Wong
2025-08-21 0:50 ` [PATCHSET RFC v4 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled Darrick J. Wong
2025-08-21 1:21 ` [PATCH 1/8] fuse2fs: skip permission checking on utimens " Darrick J. Wong
2025-08-21 1:21 ` [PATCH 2/8] fuse2fs: let the kernel tell us about acl/mode updates Darrick J. Wong
2025-08-21 1:21 ` [PATCH 3/8] fuse2fs: better debugging for file mode updates Darrick J. Wong
2025-08-21 1:22 ` [PATCH 4/8] fuse2fs: debug timestamp updates Darrick J. Wong
2025-08-21 1:22 ` [PATCH 5/8] fuse2fs: use coarse timestamps for iomap mode Darrick J. Wong
2025-08-21 1:22 ` [PATCH 6/8] fuse2fs: add tracing for retrieving timestamps Darrick J. Wong
2025-08-21 1:23 ` [PATCH 7/8] fuse2fs: enable syncfs Darrick J. Wong
2025-08-21 1:23 ` [PATCH 8/8] fuse2fs: skip the gdt write in op_destroy if syncfs is working Darrick J. Wong
2025-08-21 0:50 ` [PATCHSET RFC v4 6/6] fuse2fs: improve block and inode caching Darrick J. Wong
2025-08-21 1:23 ` [PATCH 1/6] libsupport: add caching IO manager Darrick J. Wong
2025-08-21 1:23 ` [PATCH 2/6] iocache: add the actual buffer cache Darrick J. Wong
2025-08-21 1:24 ` [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses Darrick J. Wong
2025-08-21 1:24 ` [PATCH 4/6] fuse2fs: enable caching IO manager Darrick J. Wong
2025-08-21 1:24 ` [PATCH 5/6] fuse2fs: increase inode cache size Darrick J. Wong
2025-08-21 1:24 ` [PATCH 6/6] libext2fs: improve caching for inodes Darrick J. Wong
-- strict thread matches above, loose matches on Subject: below --
2025-07-17 23:23 [PATCHSET RFC v3 1/4] fuse: fixes and cleanups ahead of iomap support Darrick J. Wong
2025-07-17 23:27 ` [PATCH 4/7] fuse: implement file attributes mask for statx Darrick J. Wong
2025-08-18 15:11 ` Miklos Szeredi
2025-08-18 20:01 ` Darrick J. Wong
2025-08-18 20:04 ` Darrick J. Wong
2025-08-19 15:01 ` Miklos Szeredi
2025-08-19 22:51 ` Darrick J. Wong
2025-08-20 9:16 ` Miklos Szeredi
2025-08-20 9:40 ` Miklos Szeredi
2025-08-20 15:16 ` Darrick J. Wong
2025-08-20 15:31 ` Miklos Szeredi
2025-08-20 15:09 ` Darrick J. Wong
2025-08-20 15:23 ` Miklos Szeredi
2025-08-20 15:29 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).