Is NFSv4.2's clone_blksize per-file or per-file-system?

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Is NFSv4.2's clone_blksize per-file or per-file-system?
@ 2025-08-09  1:47 Rick Macklem
       [not found] ` <CAABAsM5nzVzPDB3Ubeqg35F7Qd8pBveiYPi1M+KFnMPjb2dxXw@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Rick Macklem @ 2025-08-09  1:47 UTC (permalink / raw)
  To: NFSv4, Linux NFS Mailing List

Hi,

I'm looking at RFC7862 and I cannot find where it
states if the clone_blksize attribute is per-file or
per-file-system.

If it is not in the RFC, which do others think it is?
(Or maybe, if you have implemented CLONE,
which does your implementation assume?)

In case you are wondering why I am asking,
it turns out that files in a ZFS volume can have
different block sizes. (It can be changed after the
file system is created.)

Thanks, rick

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Is NFSv4.2's clone_blksize per-file or per-file-system?
       [not found] ` <CAABAsM5nzVzPDB3Ubeqg35F7Qd8pBveiYPi1M+KFnMPjb2dxXw@mail.gmail.com>
@ 2025-08-09  3:46   ` Rick Macklem
       [not found]     ` <CADaq8jeAhdOLrD9Y6o1xJsMuGYZLoJdMAonfB5RuX63xV_i0UA@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Rick Macklem @ 2025-08-09  3:46 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: NFSv4, Linux NFS Mailing List

On Fri, Aug 8, 2025 at 8:38 PM Trond Myklebust <trondmy@gmail.com> wrote:
>
>
>
> On Fri, Aug 8, 2025 at 9:47 PM Rick Macklem <rick.macklem@gmail.com> wrote:
>>
>> Hi,
>>
>> I'm looking at RFC7862 and I cannot find where it
>> states if the clone_blksize attribute is per-file or
>> per-file-system.
>>
>> If it is not in the RFC, which do others think it is?
>> (Or maybe, if you have implemented CLONE,
>> which does your implementation assume?)
>>
>> In case you are wondering why I am asking,
>> it turns out that files in a ZFS volume can have
>> different block sizes. (It can be changed after the
>> file system is created.)
>>
>> Thanks, rick
>>
>
> Yes, but since ZFS only supports filesystem level snapshots, and not actual file cloning, does that matter to anything?
ZFS now has a feature it calls block cloning, which does clone file ranges.
(It was only added recently. I do not know if the Linux port uses it yet?)

rick

>
> Cheers
>   Trond

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [nfsv4] Is NFSv4.2's clone_blksize per-file or per-file-system?
       [not found]     ` <CADaq8jeAhdOLrD9Y6o1xJsMuGYZLoJdMAonfB5RuX63xV_i0UA@mail.gmail.com>
@ 2025-08-09 21:02       ` Rick Macklem
       [not found]         ` <CADaq8jdfV0EjVehzGNFw2MxKZvc_Dj-t6Af0NqNKe3oZ66xDMQ@mail.gmail.com>
  2025-08-09 21:49       ` Rick Macklem
  1 sibling, 1 reply; 8+ messages in thread
From: Rick Macklem @ 2025-08-09 21:02 UTC (permalink / raw)
  To: David Noveck; +Cc: Trond Myklebust, NFSv4, Linux NFS Mailing List

On Sat, Aug 9, 2025 at 1:12 PM David Noveck <davenoveck@gmail.com> wrote:
>
>
>
> On Friday, August 8, 2025, Rick Macklem <rick.macklem@gmail.com> wrote:
>>
>> On Fri, Aug 8, 2025 at 8:38 PM Trond Myklebust <trondmy@gmail.com> wrote:
>> >
>> >
>> >
>> > On Fri, Aug 8, 2025 at 9:47 PM Rick Macklem <rick.macklem@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm looking at RFC7862 and I cannot find where it
>> >> states if the clone_blksize attribute is per-file or
>> >> per-file-system.
>> >>
>> >> If it is not in the RFC, which do others think it is?
>
>
>  Before you told us about ZFS,  I would have assumed per-fs.
>
> Given the uncertainty in the spec, you may wind up dealing clients that assume it is per-fs.
>
> Although this is not a  catastrophe, you might want to file an errata report explaining the negative consequences of assuming this is per-fs. It won't get into a spec for a long while but it does provide as much warning as you can right now .
>
>
>
>>
>> >> (Or maybe, if you have implemented CLONE,
>> >> which does your implementation assume?)
>> >>
>> >> In case you are wondering why I am asking,
>> >> it turns out that files in a ZFS volume can have
>> >> different block sizes. (It can be changed after the
>> >> file system is created.)
>
>
> The guy who allowed that probably thinks it's a helpful feature.  Sigh!
It's not just a feature change after creation, it turns out to be based
on file size as well.  A small file gets 512 and a larger one gets a full record
(128K on my test system).

And, yes, block cloning requires alignment with 512bytes or 128Kbytes
depending on the file.

I can return 128K for clone_blksize and that will (sub-optimally) handle
the 512byte case, but I think it is also possible to increase the record
size from 128K-> after the file system has files in it.

I'll take a look at the Linux client to try and see if/how it uses
clone_blksize.  I need to decide if I should always return 128K
(or whatever the full recordsize is) or 512 for the small files.

Thanks for the comments, rick

>
>> >>
>
>
>>
>> >> Thanks, rick
>> >>
>> >
>> > Yes, but since ZFS only supports filesystem level snapshots, and not actual file cloning, does that matter to anything?
>> ZFS now has a feature it calls block cloning, which does clone file ranges.
>> (It was only added recently. I do not know if the Linux port uses it yet?)
>>
>> rick
>>
>> >
>> > Cheers
>> >   Trond
>>
>> _______________________________________________
>> nfsv4 mailing list -- nfsv4@ietf.org
>> To unsubscribe send an email to nfsv4-leave@ietf.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [nfsv4] Is NFSv4.2's clone_blksize per-file or per-file-system?
       [not found]     ` <CADaq8jeAhdOLrD9Y6o1xJsMuGYZLoJdMAonfB5RuX63xV_i0UA@mail.gmail.com>
  2025-08-09 21:02       ` [nfsv4] " Rick Macklem
@ 2025-08-09 21:49       ` Rick Macklem
  1 sibling, 0 replies; 8+ messages in thread
From: Rick Macklem @ 2025-08-09 21:49 UTC (permalink / raw)
  To: David Noveck; +Cc: Trond Myklebust, NFSv4, Linux NFS Mailing List

On Sat, Aug 9, 2025 at 1:12 PM David Noveck <davenoveck@gmail.com> wrote:
>
>
>
> On Friday, August 8, 2025, Rick Macklem <rick.macklem@gmail.com> wrote:
>>
>> On Fri, Aug 8, 2025 at 8:38 PM Trond Myklebust <trondmy@gmail.com> wrote:
>> >
>> >
>> >
>> > On Fri, Aug 8, 2025 at 9:47 PM Rick Macklem <rick.macklem@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm looking at RFC7862 and I cannot find where it
>> >> states if the clone_blksize attribute is per-file or
>> >> per-file-system.
>> >>
>> >> If it is not in the RFC, which do others think it is?
>
>
>  Before you told us about ZFS,  I would have assumed per-fs.
>
> Given the uncertainty in the spec, you may wind up dealing clients that assume it is per-fs.
Actually, it looks like the Linux client assumes per-server.
(I'm going to ask over on linux-nfs@ to see what the implications of
getting the value wrong are. All I can see is that remap_file_range will
return EINVAL if a request isn't aligned.
--> The NFSv4.2 server would also reply EINVAL if the alignment is not
      correct, I think? (Actually RFC7862 says it must be aligned, but fails
      to specify the error reply for this case. However it specifies
NFS4ERR_INVAL
      for other cases, so I think it makes sense to return that.)

      Hopefully someone over on linux-nfs@ will know if returning NFS4ERR_INVAL
      for CLONE from the NFSv4.2 server will be more serious than a return of
      EINVAL for remap_file_range. (Other than a wasted RPC roundtrip.)

rick

>
> Although this is not a  catastrophe, you might want to file an errata report explaining the negative consequences of assuming this is per-fs. It won't get into a spec for a long while but it does provide as much warning as you can right now .
>
>
>
>>
>> >> (Or maybe, if you have implemented CLONE,
>> >> which does your implementation assume?)
>> >>
>> >> In case you are wondering why I am asking,
>> >> it turns out that files in a ZFS volume can have
>> >> different block sizes. (It can be changed after the
>> >> file system is created.)
>
>
> The guy who allowed that probably thinks it's a helpful feature.  Sigh!
>
>> >>
>
>
>>
>> >> Thanks, rick
>> >>
>> >
>> > Yes, but since ZFS only supports filesystem level snapshots, and not actual file cloning, does that matter to anything?
>> ZFS now has a feature it calls block cloning, which does clone file ranges.
>> (It was only added recently. I do not know if the Linux port uses it yet?)
>>
>> rick
>>
>> >
>> > Cheers
>> >   Trond
>>
>> _______________________________________________
>> nfsv4 mailing list -- nfsv4@ietf.org
>> To unsubscribe send an email to nfsv4-leave@ietf.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [nfsv4] Is NFSv4.2's clone_blksize per-file or per-file-system?
       [not found]         ` <CADaq8jdfV0EjVehzGNFw2MxKZvc_Dj-t6Af0NqNKe3oZ66xDMQ@mail.gmail.com>
@ 2025-08-10 14:32           ` Rick Macklem
  2025-08-10 14:52             ` Rick Macklem
  0 siblings, 1 reply; 8+ messages in thread
From: Rick Macklem @ 2025-08-10 14:32 UTC (permalink / raw)
  To: David Noveck; +Cc: Trond Myklebust, NFSv4, Linux NFS Mailing List

On Sun, Aug 10, 2025 at 6:58 AM David Noveck <davenoveck@gmail.com> wrote:
>
>
>
> On Sat, Aug 9, 2025 at 5:02 PM Rick Macklem <rick.macklem@gmail.com> wrote:
>>
>> On Sat, Aug 9, 2025 at 1:12 PM David Noveck <davenoveck@gmail.com> wrote:
>> >
>> >
>> >
>> > On Friday, August 8, 2025, Rick Macklem <rick.macklem@gmail.com> wrote:
>> >>
>> >> On Fri, Aug 8, 2025 at 8:38 PM Trond Myklebust <trondmy@gmail.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Aug 8, 2025 at 9:47 PM Rick Macklem <rick.macklem@gmail.com> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I'm looking at RFC7862 and I cannot find where it
>> >> >> states if the clone_blksize attribute is per-file or
>> >> >> per-file-system.
>> >> >>
>> >> >> If it is not in the RFC, which do others think it is?
>> >
>> >
>> >  Before you told us about ZFS,  I would have assumed per-fs.
>> >
>> > Given the uncertainty in the spec, you may wind up dealing clients that assume it is per-fs.
>> >
>> > Although this is not a  catastrophe, you might want to file an errata report explaining the negative consequences of assuming this is per-fs. It won't get into a spec for a long while but it does provide as much warning as you can right now .
>> >
>> >
>> >
>> >>
>> >> >> (Or maybe, if you have implemented CLONE,
>> >> >> which does your implementation assume?)
>> >> >>
>> >> >> In case you are wondering why I am asking,
>> >> >> it turns out that files in a ZFS volume can have
>> >> >> different block sizes. (It can be changed after the
>> >> >> file system is created.)
>> >
>> >
>> > The guy who allowed that probably thinks it's a helpful feature.  Sigh!
>> It's not just a feature change after creation, it turns out to be based
>> on file size as well.  A small file gets 512 and a larger one gets a full record
>> (128K on my test system).
>>
>> And, yes, block cloning requires alignment with 512bytes or 128Kbytes
>> depending on the file.
>>
>> I can return 128K for clone_blksize and that will (sub-optimally) handle
>> the 512byte case, but I think it is also possible to increase the record
>> size from 128K-> after the file system has files in it.
>>
>> I'll take a look at the Linux client to try and see if/how it uses
>> clone_blksize.  I need to decide if I should always return 128K
>> (or whatever the full recordsize is) or 512 for the small files.
>
>
> I don't see the point of returning anything but 128K given what you said above.
> If a file has to be smaller than 512 to merit the 512 block size, it could still be cloned with a 128k clone_block_size.  The spec makes an exception for the last block of a file being shorter than the block size so returning a 512-byte clone_block_size.
I'll be experimenting with it soon.
What I do not know (you could write what I know about ZFS on a
postage stamp;-) is whether the blksize for a file changes as it
grows.
--> So the problem is a file might get 512 because it is small when
     first created and then grow large. Again, I do not currently know
     what determines the blksize. Whether it is the first write being less
     than a record size when created or maybe it does switch to recordsize
     (128K in my case) when it grows beyond 128K or ???
     - I do know that ZFS allocates new blocks whenever data is written
       to a file, even if the file is not growing. (Which is why it cannot
       support ALLOCATE at this time and probably never will.)

I'll be poking at it. For now, I just do not know, rick


>>
>>
>> Thanks for the comments, rick
>>
>> >
>> >> >>
>> >
>> >
>> >>
>> >> >> Thanks, rick
>> >> >>
>> >> >
>> >> > Yes, but since ZFS only supports filesystem level snapshots, and not actual file cloning, does that matter to anything?
>> >> ZFS now has a feature it calls block cloning, which does clone file ranges.
>> >> (It was only added recently. I do not know if the Linux port uses it yet?)
>> >>
>> >> rick
>> >>
>> >> >
>> >> > Cheers
>> >> >   Trond
>> >>
>> >> _______________________________________________
>> >> nfsv4 mailing list -- nfsv4@ietf.org
>> >> To unsubscribe send an email to nfsv4-leave@ietf.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [nfsv4] Is NFSv4.2's clone_blksize per-file or per-file-system?
  2025-08-10 14:32           ` Rick Macklem
@ 2025-08-10 14:52             ` Rick Macklem
  2025-08-10 15:27               ` Rick Macklem
  0 siblings, 1 reply; 8+ messages in thread
From: Rick Macklem @ 2025-08-10 14:52 UTC (permalink / raw)
  To: David Noveck; +Cc: Trond Myklebust, NFSv4, Linux NFS Mailing List

On Sun, Aug 10, 2025 at 7:32 AM Rick Macklem <rick.macklem@gmail.com> wrote:
>
> On Sun, Aug 10, 2025 at 6:58 AM David Noveck <davenoveck@gmail.com> wrote:
> >
> >
> >
> > On Sat, Aug 9, 2025 at 5:02 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> >>
> >> On Sat, Aug 9, 2025 at 1:12 PM David Noveck <davenoveck@gmail.com> wrote:
> >> >
> >> >
> >> >
> >> > On Friday, August 8, 2025, Rick Macklem <rick.macklem@gmail.com> wrote:
> >> >>
> >> >> On Fri, Aug 8, 2025 at 8:38 PM Trond Myklebust <trondmy@gmail.com> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Fri, Aug 8, 2025 at 9:47 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> I'm looking at RFC7862 and I cannot find where it
> >> >> >> states if the clone_blksize attribute is per-file or
> >> >> >> per-file-system.
> >> >> >>
> >> >> >> If it is not in the RFC, which do others think it is?
> >> >
> >> >
> >> >  Before you told us about ZFS,  I would have assumed per-fs.
> >> >
> >> > Given the uncertainty in the spec, you may wind up dealing clients that assume it is per-fs.
> >> >
> >> > Although this is not a  catastrophe, you might want to file an errata report explaining the negative consequences of assuming this is per-fs. It won't get into a spec for a long while but it does provide as much warning as you can right now .
> >> >
> >> >
> >> >
> >> >>
> >> >> >> (Or maybe, if you have implemented CLONE,
> >> >> >> which does your implementation assume?)
> >> >> >>
> >> >> >> In case you are wondering why I am asking,
> >> >> >> it turns out that files in a ZFS volume can have
> >> >> >> different block sizes. (It can be changed after the
> >> >> >> file system is created.)
> >> >
> >> >
> >> > The guy who allowed that probably thinks it's a helpful feature.  Sigh!
> >> It's not just a feature change after creation, it turns out to be based
> >> on file size as well.  A small file gets 512 and a larger one gets a full record
> >> (128K on my test system).
> >>
> >> And, yes, block cloning requires alignment with 512bytes or 128Kbytes
> >> depending on the file.
> >>
> >> I can return 128K for clone_blksize and that will (sub-optimally) handle
> >> the 512byte case, but I think it is also possible to increase the record
> >> size from 128K-> after the file system has files in it.
> >>
> >> I'll take a look at the Linux client to try and see if/how it uses
> >> clone_blksize.  I need to decide if I should always return 128K
> >> (or whatever the full recordsize is) or 512 for the small files.
> >
> >
> > I don't see the point of returning anything but 128K given what you said above.
> > If a file has to be smaller than 512 to merit the 512 block size, it could still be cloned with a 128k clone_block_size.  The spec makes an exception for the last block of a file being shorter than the block size so returning a 512-byte clone_block_size.
> I'll be experimenting with it soon.
> What I do not know (you could write what I know about ZFS on a
> postage stamp;-) is whether the blksize for a file changes as it
> grows.
> --> So the problem is a file might get 512 because it is small when
>      first created and then grow large. Again, I do not currently know
>      what determines the blksize. Whether it is the first write being less
>      than a record size when created or maybe it does switch to recordsize
>      (128K in my case) when it grows beyond 128K or ???
>      - I do know that ZFS allocates new blocks whenever data is written
>        to a file, even if the file is not growing. (Which is why it cannot
>        support ALLOCATE at this time and probably never will.)
>
> I'll be poking at it. For now, I just do not know, rick
I should have done a scan before posting.
I just ran a little program that printed out the blksize of every
regular file in a ZFS file system.
It turns out that the blksize is any exact multiple of 512 up to
128K (the record size for the volume).
Since most are C sources or objects, most are less than 128K.

If I return 128K, then most files would not be CLONEable unless
the CLONE is for the entire file.
Of course, I do not currently know how clients actually use
clone_blksize either. (Do they check alignment using it before
doing a CLONE or ???)

I'll be playing around with CLONE for both FreeBSD and Linux
in the coming days.
I'll post if/when I have useful info, rick

>
>
> >>
> >>
> >> Thanks for the comments, rick
> >>
> >> >
> >> >> >>
> >> >
> >> >
> >> >>
> >> >> >> Thanks, rick
> >> >> >>
> >> >> >
> >> >> > Yes, but since ZFS only supports filesystem level snapshots, and not actual file cloning, does that matter to anything?
> >> >> ZFS now has a feature it calls block cloning, which does clone file ranges.
> >> >> (It was only added recently. I do not know if the Linux port uses it yet?)
> >> >>
> >> >> rick
> >> >>
> >> >> >
> >> >> > Cheers
> >> >> >   Trond
> >> >>
> >> >> _______________________________________________
> >> >> nfsv4 mailing list -- nfsv4@ietf.org
> >> >> To unsubscribe send an email to nfsv4-leave@ietf.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [nfsv4] Is NFSv4.2's clone_blksize per-file or per-file-system?
  2025-08-10 14:52             ` Rick Macklem
@ 2025-08-10 15:27               ` Rick Macklem
  2025-08-10 15:38                 ` Rick Macklem
  0 siblings, 1 reply; 8+ messages in thread
From: Rick Macklem @ 2025-08-10 15:27 UTC (permalink / raw)
  To: David Noveck; +Cc: Trond Myklebust, NFSv4, Linux NFS Mailing List

On Sun, Aug 10, 2025 at 7:52 AM Rick Macklem <rick.macklem@gmail.com> wrote:
>
> On Sun, Aug 10, 2025 at 7:32 AM Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Sun, Aug 10, 2025 at 6:58 AM David Noveck <davenoveck@gmail.com> wrote:
> > >
> > >
> > >
> > > On Sat, Aug 9, 2025 at 5:02 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > >>
> > >> On Sat, Aug 9, 2025 at 1:12 PM David Noveck <davenoveck@gmail.com> wrote:
> > >> >
> > >> >
> > >> >
> > >> > On Friday, August 8, 2025, Rick Macklem <rick.macklem@gmail.com> wrote:
> > >> >>
> > >> >> On Fri, Aug 8, 2025 at 8:38 PM Trond Myklebust <trondmy@gmail.com> wrote:
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > On Fri, Aug 8, 2025 at 9:47 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > >> >> >>
> > >> >> >> Hi,
> > >> >> >>
> > >> >> >> I'm looking at RFC7862 and I cannot find where it
> > >> >> >> states if the clone_blksize attribute is per-file or
> > >> >> >> per-file-system.
> > >> >> >>
> > >> >> >> If it is not in the RFC, which do others think it is?
> > >> >
> > >> >
> > >> >  Before you told us about ZFS,  I would have assumed per-fs.
> > >> >
> > >> > Given the uncertainty in the spec, you may wind up dealing clients that assume it is per-fs.
> > >> >
> > >> > Although this is not a  catastrophe, you might want to file an errata report explaining the negative consequences of assuming this is per-fs. It won't get into a spec for a long while but it does provide as much warning as you can right now .
> > >> >
> > >> >
> > >> >
> > >> >>
> > >> >> >> (Or maybe, if you have implemented CLONE,
> > >> >> >> which does your implementation assume?)
> > >> >> >>
> > >> >> >> In case you are wondering why I am asking,
> > >> >> >> it turns out that files in a ZFS volume can have
> > >> >> >> different block sizes. (It can be changed after the
> > >> >> >> file system is created.)
> > >> >
> > >> >
> > >> > The guy who allowed that probably thinks it's a helpful feature.  Sigh!
> > >> It's not just a feature change after creation, it turns out to be based
> > >> on file size as well.  A small file gets 512 and a larger one gets a full record
> > >> (128K on my test system).
> > >>
> > >> And, yes, block cloning requires alignment with 512bytes or 128Kbytes
> > >> depending on the file.
> > >>
> > >> I can return 128K for clone_blksize and that will (sub-optimally) handle
> > >> the 512byte case, but I think it is also possible to increase the record
> > >> size from 128K-> after the file system has files in it.
> > >>
> > >> I'll take a look at the Linux client to try and see if/how it uses
> > >> clone_blksize.  I need to decide if I should always return 128K
> > >> (or whatever the full recordsize is) or 512 for the small files.
> > >
> > >
> > > I don't see the point of returning anything but 128K given what you said above.
> > > If a file has to be smaller than 512 to merit the 512 block size, it could still be cloned with a 128k clone_block_size.  The spec makes an exception for the last block of a file being shorter than the block size so returning a 512-byte clone_block_size.
> > I'll be experimenting with it soon.
> > What I do not know (you could write what I know about ZFS on a
> > postage stamp;-) is whether the blksize for a file changes as it
> > grows.
> > --> So the problem is a file might get 512 because it is small when
> >      first created and then grow large. Again, I do not currently know
> >      what determines the blksize. Whether it is the first write being less
> >      than a record size when created or maybe it does switch to recordsize
> >      (128K in my case) when it grows beyond 128K or ???
> >      - I do know that ZFS allocates new blocks whenever data is written
> >        to a file, even if the file is not growing. (Which is why it cannot
> >        support ALLOCATE at this time and probably never will.)
> >
> > I'll be poking at it. For now, I just do not know, rick
> I should have done a scan before posting.
> I just ran a little program that printed out the blksize of every
> regular file in a ZFS file system.
> It turns out that the blksize is any exact multiple of 512 up to
> 128K (the record size for the volume).
> Since most are C sources or objects, most are less than 128K.
>
> If I return 128K, then most files would not be CLONEable unless
> the CLONE is for the entire file.
It appears that your suggestion of 128K is correct for ZFS.
I am still not sure, but it appears that, for files up to 128K,
the files are a single block (which is any multiple of 512).
--> As such, only the entire small file can be cloned.

So, returning 128K for all files in the file system seems like
it will be the correct choice.

It still leaves the per-filesystem vs per-server question
since (if I read it correctly) the Linux client uses clone_blksize
per-server (and not per-server file system).

I do not think per-server is the correct choice, since different
file systems on a server could have different block sizes.

rick

> Of course, I do not currently know how clients actually use
> clone_blksize either. (Do they check alignment using it before
> doing a CLONE or ???)
>
> I'll be playing around with CLONE for both FreeBSD and Linux
> in the coming days.
> I'll post if/when I have useful info, rick
>
> >
> >
> > >>
> > >>
> > >> Thanks for the comments, rick
> > >>
> > >> >
> > >> >> >>
> > >> >
> > >> >
> > >> >>
> > >> >> >> Thanks, rick
> > >> >> >>
> > >> >> >
> > >> >> > Yes, but since ZFS only supports filesystem level snapshots, and not actual file cloning, does that matter to anything?
> > >> >> ZFS now has a feature it calls block cloning, which does clone file ranges.
> > >> >> (It was only added recently. I do not know if the Linux port uses it yet?)
> > >> >>
> > >> >> rick
> > >> >>
> > >> >> >
> > >> >> > Cheers
> > >> >> >   Trond
> > >> >>
> > >> >> _______________________________________________
> > >> >> nfsv4 mailing list -- nfsv4@ietf.org
> > >> >> To unsubscribe send an email to nfsv4-leave@ietf.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [nfsv4] Is NFSv4.2's clone_blksize per-file or per-file-system?
  2025-08-10 15:27               ` Rick Macklem
@ 2025-08-10 15:38                 ` Rick Macklem
  0 siblings, 0 replies; 8+ messages in thread
From: Rick Macklem @ 2025-08-10 15:38 UTC (permalink / raw)
  To: David Noveck; +Cc: Trond Myklebust, NFSv4, Linux NFS Mailing List

On Sun, Aug 10, 2025 at 8:27 AM Rick Macklem <rick.macklem@gmail.com> wrote:
>
> On Sun, Aug 10, 2025 at 7:52 AM Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Sun, Aug 10, 2025 at 7:32 AM Rick Macklem <rick.macklem@gmail.com> wrote:
> > >
> > > On Sun, Aug 10, 2025 at 6:58 AM David Noveck <davenoveck@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > On Sat, Aug 9, 2025 at 5:02 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > > >>
> > > >> On Sat, Aug 9, 2025 at 1:12 PM David Noveck <davenoveck@gmail.com> wrote:
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Friday, August 8, 2025, Rick Macklem <rick.macklem@gmail.com> wrote:
> > > >> >>
> > > >> >> On Fri, Aug 8, 2025 at 8:38 PM Trond Myklebust <trondmy@gmail.com> wrote:
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> > On Fri, Aug 8, 2025 at 9:47 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > > >> >> >>
> > > >> >> >> Hi,
> > > >> >> >>
> > > >> >> >> I'm looking at RFC7862 and I cannot find where it
> > > >> >> >> states if the clone_blksize attribute is per-file or
> > > >> >> >> per-file-system.
> > > >> >> >>
> > > >> >> >> If it is not in the RFC, which do others think it is?
> > > >> >
> > > >> >
> > > >> >  Before you told us about ZFS,  I would have assumed per-fs.
> > > >> >
> > > >> > Given the uncertainty in the spec, you may wind up dealing clients that assume it is per-fs.
> > > >> >
> > > >> > Although this is not a  catastrophe, you might want to file an errata report explaining the negative consequences of assuming this is per-fs. It won't get into a spec for a long while but it does provide as much warning as you can right now .
> > > >> >
> > > >> >
> > > >> >
> > > >> >>
> > > >> >> >> (Or maybe, if you have implemented CLONE,
> > > >> >> >> which does your implementation assume?)
> > > >> >> >>
> > > >> >> >> In case you are wondering why I am asking,
> > > >> >> >> it turns out that files in a ZFS volume can have
> > > >> >> >> different block sizes. (It can be changed after the
> > > >> >> >> file system is created.)
> > > >> >
> > > >> >
> > > >> > The guy who allowed that probably thinks it's a helpful feature.  Sigh!
> > > >> It's not just a feature change after creation, it turns out to be based
> > > >> on file size as well.  A small file gets 512 and a larger one gets a full record
> > > >> (128K on my test system).
> > > >>
> > > >> And, yes, block cloning requires alignment with 512bytes or 128Kbytes
> > > >> depending on the file.
> > > >>
> > > >> I can return 128K for clone_blksize and that will (sub-optimally) handle
> > > >> the 512byte case, but I think it is also possible to increase the record
> > > >> size from 128K-> after the file system has files in it.
> > > >>
> > > >> I'll take a look at the Linux client to try and see if/how it uses
> > > >> clone_blksize.  I need to decide if I should always return 128K
> > > >> (or whatever the full recordsize is) or 512 for the small files.
> > > >
> > > >
> > > > I don't see the point of returning anything but 128K given what you said above.
> > > > If a file has to be smaller than 512 to merit the 512 block size, it could still be cloned with a 128k clone_block_size.  The spec makes an exception for the last block of a file being shorter than the block size so returning a 512-byte clone_block_size.
> > > I'll be experimenting with it soon.
> > > What I do not know (you could write what I know about ZFS on a
> > > postage stamp;-) is whether the blksize for a file changes as it
> > > grows.
> > > --> So the problem is a file might get 512 because it is small when
> > >      first created and then grow large. Again, I do not currently know
> > >      what determines the blksize. Whether it is the first write being less
> > >      than a record size when created or maybe it does switch to recordsize
> > >      (128K in my case) when it grows beyond 128K or ???
> > >      - I do know that ZFS allocates new blocks whenever data is written
> > >        to a file, even if the file is not growing. (Which is why it cannot
> > >        support ALLOCATE at this time and probably never will.)
> > >
> > > I'll be poking at it. For now, I just do not know, rick
> > I should have done a scan before posting.
> > I just ran a little program that printed out the blksize of every
> > regular file in a ZFS file system.
> > It turns out that the blksize is any exact multiple of 512 up to
> > 128K (the record size for the volume).
> > Since most are C sources or objects, most are less than 128K.
> >
> > If I return 128K, then most files would not be CLONEable unless
> > the CLONE is for the entire file.
> It appears that your suggestion of 128K is correct for ZFS.
> I am still not sure, but it appears that, for files up to 128K,
> the files are a single block (which is any multiple of 512).
> --> As such, only the entire small file can be cloned.
>
> So, returning 128K for all files in the file system seems like
> it will be the correct choice.
>
> It still leaves the per-filesystem vs per-server question
> since (if I read it correctly) the Linux client uses clone_blksize
> per-server (and not per-server file system).
Actually, there's a good chance I got this wrong. I recall that
the Linux client creates a separate "mount" that shows up in
places like "df" for every server file system.
So, it is fairly likely that the Linux client is per-file system.

Maybe someone like Trond can clarify this w.r.t. the Linux client?

rick

>
> I do not think per-server is the correct choice, since different
> file systems on a server could have different block sizes.
>
> rick
>
> > Of course, I do not currently know how clients actually use
> > clone_blksize either. (Do they check alignment using it before
> > doing a CLONE or ???)
> >
> > I'll be playing around with CLONE for both FreeBSD and Linux
> > in the coming days.
> > I'll post if/when I have useful info, rick
> >
> > >
> > >
> > > >>
> > > >>
> > > >> Thanks for the comments, rick
> > > >>
> > > >> >
> > > >> >> >>
> > > >> >
> > > >> >
> > > >> >>
> > > >> >> >> Thanks, rick
> > > >> >> >>
> > > >> >> >
> > > >> >> > Yes, but since ZFS only supports filesystem level snapshots, and not actual file cloning, does that matter to anything?
> > > >> >> ZFS now has a feature it calls block cloning, which does clone file ranges.
> > > >> >> (It was only added recently. I do not know if the Linux port uses it yet?)
> > > >> >>
> > > >> >> rick
> > > >> >>
> > > >> >> >
> > > >> >> > Cheers
> > > >> >> >   Trond
> > > >> >>
> > > >> >> _______________________________________________
> > > >> >> nfsv4 mailing list -- nfsv4@ietf.org
> > > >> >> To unsubscribe send an email to nfsv4-leave@ietf.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-10 15:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-09  1:47 Is NFSv4.2's clone_blksize per-file or per-file-system? Rick Macklem
     [not found] ` <CAABAsM5nzVzPDB3Ubeqg35F7Qd8pBveiYPi1M+KFnMPjb2dxXw@mail.gmail.com>
2025-08-09  3:46   ` Rick Macklem
     [not found]     ` <CADaq8jeAhdOLrD9Y6o1xJsMuGYZLoJdMAonfB5RuX63xV_i0UA@mail.gmail.com>
2025-08-09 21:02       ` [nfsv4] " Rick Macklem
     [not found]         ` <CADaq8jdfV0EjVehzGNFw2MxKZvc_Dj-t6Af0NqNKe3oZ66xDMQ@mail.gmail.com>
2025-08-10 14:32           ` Rick Macklem
2025-08-10 14:52             ` Rick Macklem
2025-08-10 15:27               ` Rick Macklem
2025-08-10 15:38                 ` Rick Macklem
2025-08-09 21:49       ` Rick Macklem

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).