All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
@ 2025-08-20  7:23 Friendy Su
  2025-08-20  7:44 ` Gao Xiang
  0 siblings, 1 reply; 12+ messages in thread
From: Friendy Su @ 2025-08-20  7:23 UTC (permalink / raw)
  To: linux-erofs; +Cc: Friendy Su, Yuezhang Mo, Daniel Palmer

Set proper 'dsunit' to let file body align on huge page on blobdev,

where 'dsunit' * 'blocksize' = huge page size (2M).

When do mmap() a file mounted with dax=always, aligning on huge page
makes kernel map huge page(2M) per page fault exception, compared with
mapping normal page(4K) per page fault.

This greatly improves mmap() performance by reducing times of page
fault being triggered.

Signed-off-by: Friendy Su <friendy.su@sony.com>
Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
---
 lib/blobchunk.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/lib/blobchunk.c b/lib/blobchunk.c
index bbc69cf..e47afb5 100644
--- a/lib/blobchunk.c
+++ b/lib/blobchunk.c
@@ -309,6 +309,21 @@ int erofs_blob_write_chunked_file(struct erofs_inode *inode, int fd,
 	minextblks = BLK_ROUND_UP(sbi, inode->i_size);
 	interval_start = 0;
 
+	/* Align file on 'dsunit' */
+	if (sbi->bmgr->dsunit > 1) {
+		off_t off = lseek(blobfile, 0, SEEK_CUR);
+
+		erofs_dbg("Try to round up 0x%llx to align on %d blocks (dsunit)",
+				off, sbi->bmgr->dsunit);
+		off = roundup(off, sbi->bmgr->dsunit * erofs_blksiz(sbi));
+		if (lseek(blobfile, off, SEEK_SET) != off) {
+			ret = -errno;
+			erofs_err("lseek to blobdev 0x%llx error", off);
+			goto err;
+		}
+		erofs_dbg("Aligned on 0x%llx", off);
+	}
+
 	for (pos = 0; pos < inode->i_size; pos += len) {
 #ifdef SEEK_DATA
 		off_t offset = lseek(fd, pos + startoff, SEEK_DATA);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-20  7:23 [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev Friendy Su
@ 2025-08-20  7:44 ` Gao Xiang
  2025-08-20  9:00   ` Friendy.Su
  0 siblings, 1 reply; 12+ messages in thread
From: Gao Xiang @ 2025-08-20  7:44 UTC (permalink / raw)
  To: Friendy Su, linux-erofs; +Cc: Yuezhang Mo, Daniel Palmer

Hi Friendly,

On 2025/8/20 15:23, Friendy Su wrote:
> Set proper 'dsunit' to let file body align on huge page on blobdev,
> 
> where 'dsunit' * 'blocksize' = huge page size (2M).
> 
> When do mmap() a file mounted with dax=always, aligning on huge page
> makes kernel map huge page(2M) per page fault exception, compared with
> mapping normal page(4K) per page fault.
> 
> This greatly improves mmap() performance by reducing times of page
> fault being triggered.
> 
> Signed-off-by: Friendy Su <friendy.su@sony.com>
> Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
> Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>

Sigh, thanks for the patch! Actually the blob chunk implementation
(this file) is now an implementation debt since:

  1) In principle, chunks (and deduplicated pclusters) should
     across blobs (considering the main device is also a blob);

  2) Each blob should have its own space allocation context
     which is independent to the in-memory chunk records...

My mid-term plan tends to use the current metabox way
(i.e. use buffer management for all extra blobs too..)

> ---
>   lib/blobchunk.c | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
> 
> diff --git a/lib/blobchunk.c b/lib/blobchunk.c
> index bbc69cf..e47afb5 100644
> --- a/lib/blobchunk.c
> +++ b/lib/blobchunk.c
> @@ -309,6 +309,21 @@ int erofs_blob_write_chunked_file(struct erofs_inode *inode, int fd,
>   	minextblks = BLK_ROUND_UP(sbi, inode->i_size);
>   	interval_start = 0;
>   
> +	/* Align file on 'dsunit' */
> +	if (sbi->bmgr->dsunit > 1) {
> +		off_t off = lseek(blobfile, 0, SEEK_CUR);
> +
> +		erofs_dbg("Try to round up 0x%llx to align on %d blocks (dsunit)",
> +				off, sbi->bmgr->dsunit);
> +		off = roundup(off, sbi->bmgr->dsunit * erofs_blksiz(sbi));
> +		if (lseek(blobfile, off, SEEK_SET) != off) {
> +			ret = -errno;
> +			erofs_err("lseek to blobdev 0x%llx error", off);
> +			goto err;
> +		}
> +		erofs_dbg("Aligned on 0x%llx", off);
> +	}

But since you have use case on the current chunk
approach, so...

As for this patch, what if the inode itself is
chunk-deduplicated, could we apply this if the inode
only has one new chunk instead at least for now?

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-20  7:44 ` Gao Xiang
@ 2025-08-20  9:00   ` Friendy.Su
  2025-08-20  9:12     ` Gao Xiang
  0 siblings, 1 reply; 12+ messages in thread
From: Friendy.Su @ 2025-08-20  9:00 UTC (permalink / raw)
  To: Gao Xiang, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com

Hi, Gao, 

Thanks for your review !

> As for this patch, what if the inode itself is
> chunk-deduplicated, could we apply this if the inode
> only has one new chunk instead at least for now?

Do you mean inode has 3 chunks, chunk#2 and chunk#3 duplicate chunck#1?

This patch only makes the 1st chunk exactly written to blobdev aligned on dsunit. Deduplicated chunks will not be written to blobdev.

In example above, chunk#1 is written to dsunit aligned block addr, chunk#2,#3 are not written. The next file will be written from next dsunit alignment addr.

Best Regards
Friendy Su

________________________________________
From: Gao Xiang <hsiangkao@linux.alibaba.com>
Sent: Wednesday, August 20, 2025 15:44
To: Su, Friendy; linux-erofs@lists.ozlabs.org
Cc: Mo, Yuezhang; Palmer, Daniel (SGC)
Subject: Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev

Hi Friendly, On 2025/8/20 15: 23, Friendy Su wrote: > Set proper 'dsunit' to let file body align on huge page on blobdev, > > where 'dsunit' * 'blocksize' = huge page size (2M). > > When do mmap() a file mounted with dax=always,


Hi Friendly,

On 2025/8/20 15:23, Friendy Su wrote:
> Set proper 'dsunit' to let file body align on huge page on blobdev,
>
> where 'dsunit' * 'blocksize' = huge page size (2M).
>
> When do mmap() a file mounted with dax=always, aligning on huge page
> makes kernel map huge page(2M) per page fault exception, compared with
> mapping normal page(4K) per page fault.
>
> This greatly improves mmap() performance by reducing times of page
> fault being triggered.
>
> Signed-off-by: Friendy Su <friendy.su@sony.com>
> Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
> Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>

Sigh, thanks for the patch! Actually the blob chunk implementation
(this file) is now an implementation debt since:

  1) In principle, chunks (and deduplicated pclusters) should
     across blobs (considering the main device is also a blob);

  2) Each blob should have its own space allocation context
     which is independent to the in-memory chunk records...

My mid-term plan tends to use the current metabox way
(i.e. use buffer management for all extra blobs too..)

> ---
>   lib/blobchunk.c | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
>
> diff --git a/lib/blobchunk.c b/lib/blobchunk.c
> index bbc69cf..e47afb5 100644
> --- a/lib/blobchunk.c
> +++ b/lib/blobchunk.c
> @@ -309,6 +309,21 @@ int erofs_blob_write_chunked_file(struct erofs_inode *inode, int fd,
>       minextblks = BLK_ROUND_UP(sbi, inode->i_size);
>       interval_start = 0;
>
> +     /* Align file on 'dsunit' */
> +     if (sbi->bmgr->dsunit > 1) {
> +             off_t off = lseek(blobfile, 0, SEEK_CUR);
> +
> +             erofs_dbg("Try to round up 0x%llx to align on %d blocks (dsunit)",
> +                             off, sbi->bmgr->dsunit);
> +             off = roundup(off, sbi->bmgr->dsunit * erofs_blksiz(sbi));
> +             if (lseek(blobfile, off, SEEK_SET) != off) {
> +                     ret = -errno;
> +                     erofs_err("lseek to blobdev 0x%llx error", off);
> +                     goto err;
> +             }
> +             erofs_dbg("Aligned on 0x%llx", off);
> +     }

But since you have use case on the current chunk
approach, so...

As for this patch, what if the inode itself is
chunk-deduplicated, could we apply this if the inode
only has one new chunk instead at least for now?

Thanks,
Gao Xiang



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-20  9:00   ` Friendy.Su
@ 2025-08-20  9:12     ` Gao Xiang
  2025-08-20  9:38       ` Friendy.Su
  0 siblings, 1 reply; 12+ messages in thread
From: Gao Xiang @ 2025-08-20  9:12 UTC (permalink / raw)
  To: Friendy.Su@sony.com, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com

Hi,

On 2025/8/20 17:00, Friendy.Su@sony.com wrote:
> Hi, Gao,
> 
> Thanks for your review !
> 
>> As for this patch, what if the inode itself is
>> chunk-deduplicated, could we apply this if the inode
>> only has one new chunk instead at least for now?
> 
> Do you mean inode has 3 chunks, chunk#2 and chunk#3 duplicate chunck#1?
> 
> This patch only makes the 1st chunk exactly written to blobdev aligned on dsunit. Deduplicated chunks will not be written to blobdev.
> 
> In example above, chunk#1 is written to dsunit aligned block addr, chunk#2,#3 are not written. The next file will be written from next dsunit alignment addr.

What's your `--chunksize` ? consider the following:

  chunksize = 4096
  dsunit = 512 = 2M

and two inodes:

inode A (8k)    2M, 2M+4k

inode B (12k)   4M, 2M, 4M+4k, 4M+8k?

Is it possible? what's the expected behavior of
this case.

Thanks,
Gao Xiang

> 
> Best Regards
> Friendy Su


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-20  9:12     ` Gao Xiang
@ 2025-08-20  9:38       ` Friendy.Su
  2025-08-21  2:00         ` Gao Xiang
  0 siblings, 1 reply; 12+ messages in thread
From: Friendy.Su @ 2025-08-20  9:38 UTC (permalink / raw)
  To: Gao Xiang, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com

Hi, Gao,

> What's your `--chunksize` ? consider the following:

  chunksize = 4096
  dsunit = 512 = 2M

> and two inodes:

> inode A (8k)    2M, 2M+4k

> inode B (12k)   4M, 2M, 4M+4k, 4M+8k?

> Is it possible? what's the expected behavior of
> this case.

Yes. This is the expected behavior. See runtime below:

1. Created duplicated chunk:
root@localhost:~# dd if=A-8k of=B-12k seek=1 bs=4k count=1 conv=notrunc

inode A-8k    (A-chunk#1, A-chunk#2)
inode B-12k   (B-chunk#1, B-chunk#2(=A-chunk#1), B-chunk#3)

2. mkfs.erofs:
root@localhost:~# ./install/usr/local/bin/mkfs.erofs -d 7 --blobdev /dev/sdb1 --dsunit 512 --chunksize 4096 -Enoinline_data /dev/sdb2 dir/
mkfs.erofs 1.8.10-g08df294b
<I> erofs_io: successfully to open /dev/sdb2
	c_version:           [1.8.10-g08df294b]
	c_dbg_lvl:           [       7]
	c_dry_run:           [       0]
<D> erofs: file /root/dir/A-8k added (type 1)
<D> erofs: file /root/dir/B-12k added (type 1)
<D> erofs: Inline tail-end data (60 bytes) to /root/dir
<I> erofs: file / dumped (mode 40755)
<D> erofs: Try to round up 0x0 to align on 512 blocks (dsunit)
<D> erofs: Aligned on 0x0
<D> erofs: Writing chunk (4096 bytes) to 0       <--- A-chunk#1
<D> erofs: Writing chunk (4096 bytes) to 1       <--- A-chunk#2
<I> erofs: file /A-8k dumped (mode 100644)
<D> erofs: Try to round up 0x2000 to align on 512 blocks (dsunit)
<D> erofs: Aligned on 0x200000                       <--- Align on dsunit 2M
<D> erofs: Writing chunk (4096 bytes) to 512   <--- B-chunk#1
<D> erofs: Found duplicated chunk at 0            <--- B-chunk#2 deduplicated 
<D> erofs: Writing chunk (4096 bytes) to 513    <--- B-chunk#3
<I> erofs: file /B-12k dumped (mode 100644)
<D> erofs: Assign nid 44 to file /root/dir/A-8k (mode 100644)
<D> erofs: Assign nid 47 to file /root/dir/B-12k (mode 100644)
<I> erofs: superblock checksum 0x1e00ad75 written
------
Filesystem UUID: 8baa7ef7-bd06-4b51-b0da-78202eb3b8fa
Filesystem total blocks: 515 (of 4096-byte blocks)
Filesystem total inodes: 3
Filesystem total metadata blocks: 1
Filesystem total deduplicated bytes (of source files): 4096

3. mount & check:
root@localhost:~# mount -o device=/dev/sdb1 /dev/sdb2 /mnt/test
root@localhost:~# diff /mnt/test/A-8k dir/A-8k 
root@localhost:~# diff /mnt/test/B-12k dir/B-12k 

Best Regards
Friendy Su
________________________________________
From: Gao Xiang <hsiangkao@linux.alibaba.com>
Sent: Wednesday, August 20, 2025 17:12
To: Su, Friendy; linux-erofs@lists.ozlabs.org
Cc: Mo, Yuezhang; Palmer, Daniel (SGC)
Subject: Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev

Hi, On 2025/8/20 17: 00, Friendy. Su@ sony. com wrote: > Hi, Gao, > > Thanks for your review ! > >> As for this patch, what if the inode itself is >> chunk-deduplicated, could we apply this if the inode >> only has


Hi,

On 2025/8/20 17:00, Friendy.Su@sony.com wrote:
> Hi, Gao,
>
> Thanks for your review !
>
>> As for this patch, what if the inode itself is
>> chunk-deduplicated, could we apply this if the inode
>> only has one new chunk instead at least for now?
>
> Do you mean inode has 3 chunks, chunk#2 and chunk#3 duplicate chunck#1?
>
> This patch only makes the 1st chunk exactly written to blobdev aligned on dsunit. Deduplicated chunks will not be written to blobdev.
>
> In example above, chunk#1 is written to dsunit aligned block addr, chunk#2,#3 are not written. The next file will be written from next dsunit alignment addr.

What's your `--chunksize` ? consider the following:

  chunksize = 4096
  dsunit = 512 = 2M

and two inodes:

inode A (8k)    2M, 2M+4k

inode B (12k)   4M, 2M, 4M+4k, 4M+8k?

Is it possible? what's the expected behavior of
this case.

Thanks,
Gao Xiang

>
> Best Regards
> Friendy Su



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-20  9:38       ` Friendy.Su
@ 2025-08-21  2:00         ` Gao Xiang
  2025-08-21  3:17           ` Friendy.Su
  0 siblings, 1 reply; 12+ messages in thread
From: Gao Xiang @ 2025-08-21  2:00 UTC (permalink / raw)
  To: Friendy.Su@sony.com, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com



On 2025/8/20 17:38, Friendy.Su@sony.com wrote:
> Hi, Gao,
> 
>> What's your `--chunksize` ? consider the following:
> 
>    chunksize = 4096
>    dsunit = 512 = 2M
> 
>> and two inodes:
> 
>> inode A (8k)    2M, 2M+4k
> 
>> inode B (12k)   4M, 2M, 4M+4k, 4M+8k?
> 
>> Is it possible? what's the expected behavior of
>> this case.
> 
> Yes. This is the expected behavior. See runtime below:

I understand that is the expected behavior according to this
patch, but I'm just unsure if it's an expected behavior for
the future wider setups (because some users may use `--dsunit`
for other usage).

So I tend to just constrain the case to your limited case
first, could you explain more if chunk deduplication is
needed for your scenarios? and what's your real `chunksize`?

Maybe adding another command option for this is better.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-21  2:00         ` Gao Xiang
@ 2025-08-21  3:17           ` Friendy.Su
  2025-08-21  3:33             ` Gao Xiang
  0 siblings, 1 reply; 12+ messages in thread
From: Friendy.Su @ 2025-08-21  3:17 UTC (permalink / raw)
  To: Gao Xiang, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com

Hi, Gao,

>So I tend to just constrain the case to your limited case
>first, could you explain more if chunk deduplication is
>needed for your scenarios? and what's your real `chunksize`?

chunk deduplication is needed. 

As I wrote in commit msg, we expect scenario below:

1. mount with -o dax=always, 
2. application calls mmap(addr, file).
3. application read from addr, page fault is triggered. 
We hope in kernel, erofs_dax_vm_ops.huge_fault() can be handled, do not fallback to erofs_dax_vm_ops.fault().

This required file body on blob devices aligned on huge page(2M), and deduplicate unit also is 2M. We can specify --dsunit=512, --chunksize=2*1024*1024 to fulfill this.

I don't think need a new command option. 
Currently, '--dsunit' can be set for formatting blobdev. The following cmdline completes successfully. User certainly thinks mkfs.erofs has executed --dsunit alignment. 
But actually, it does not.  This patch just simply makes actual runtime fit for cmdline looks like.

mkfs.erofs --blobdev /dev/sdb1 --dsunit 512 ......

If actually `--dsunit` does not work on blobdev, should prompt warning msg to user.

Best Regards
Friendy Su

________________________________________
From: Gao Xiang <hsiangkao@linux.alibaba.com>
Sent: Thursday, August 21, 2025 10:00
To: Su, Friendy; linux-erofs@lists.ozlabs.org
Cc: Mo, Yuezhang; Palmer, Daniel (SGC)
Subject: Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev

On 2025/8/20 17: 38, Friendy. Su@ sony. com wrote: > Hi, Gao, > >> What's your `--chunksize` ? consider the following: > > chunksize = 4096 > dsunit = 512 = 2M > >> and two inodes: > >> inode A (8k) 2M, 2M+4k




On 2025/8/20 17:38, Friendy.Su@sony.com wrote:
> Hi, Gao,
>
>> What's your `--chunksize` ? consider the following:
>
>    chunksize = 4096
>    dsunit = 512 = 2M
>
>> and two inodes:
>
>> inode A (8k)    2M, 2M+4k
>
>> inode B (12k)   4M, 2M, 4M+4k, 4M+8k?
>
>> Is it possible? what's the expected behavior of
>> this case.
>
> Yes. This is the expected behavior. See runtime below:

I understand that is the expected behavior according to this
patch, but I'm just unsure if it's an expected behavior for
the future wider setups (because some users may use `--dsunit`
for other usage).

So I tend to just constrain the case to your limited case
first, could you explain more if chunk deduplication is
needed for your scenarios? and what's your real `chunksize`?

Maybe adding another command option for this is better.

Thanks,
Gao Xiang



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-21  3:17           ` Friendy.Su
@ 2025-08-21  3:33             ` Gao Xiang
  2025-08-21  3:39               ` Gao Xiang
  0 siblings, 1 reply; 12+ messages in thread
From: Gao Xiang @ 2025-08-21  3:33 UTC (permalink / raw)
  To: Friendy.Su@sony.com, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com

Hi Friendy,

On 2025/8/21 11:17, Friendy.Su@sony.com wrote:
> Hi, Gao,
> 
>> So I tend to just constrain the case to your limited case
>> first, could you explain more if chunk deduplication is
>> needed for your scenarios? and what's your real `chunksize`?
> 
> chunk deduplication is needed.
> 
> As I wrote in commit msg, we expect scenario below:
> 
> 1. mount with -o dax=always,
> 2. application calls mmap(addr, file).
> 3. application read from addr, page fault is triggered.
> We hope in kernel, erofs_dax_vm_ops.huge_fault() can be handled, do not fallback to erofs_dax_vm_ops.fault().

I totally understand the runtime side, but in short:

> 
> This required file body on blob devices aligned on huge page(2M), and deduplicate unit also is 2M. We can specify --dsunit=512, --chunksize=2*1024*1024 to fulfill this.
> 
> I don't think need a new command option.
> Currently, '--dsunit' can be set for formatting blobdev. The following cmdline completes successfully. User certainly thinks mkfs.erofs has executed --dsunit alignment.
> But actually, it does not.  This patch just simply makes actual runtime fit for cmdline looks like.
> 
> mkfs.erofs --blobdev /dev/sdb1 --dsunit 512 ......
> 
> If actually `--dsunit` does not work on blobdev, should prompt warning msg to user.

My cercern is why `--chunksize=4096 --dsunit=512` will not
lead to each 4k chunk to the 2M data boundary, is it obvious?

chunksize = 4096
dsunit = 512 = 2M

inode A (8k)    2M, 4M
inode B (12k)	6M, 2M, 4M, 8M?

Are you sure if there is no such use case in the
future? Mixing `--chunksize=4096 --dsunit=512` seems
non-obvious for this case.

Thanks,
Gao Xiang


> 
> Best Regards
> Friendy Su
> 
> ________________________________________
> From: Gao Xiang <hsiangkao@linux.alibaba.com>
> Sent: Thursday, August 21, 2025 10:00
> To: Su, Friendy; linux-erofs@lists.ozlabs.org
> Cc: Mo, Yuezhang; Palmer, Daniel (SGC)
> Subject: Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
> 
> On 2025/8/20 17: 38, Friendy. Su@ sony. com wrote: > Hi, Gao, > >> What's your `--chunksize` ? consider the following: > > chunksize = 4096 > dsunit = 512 = 2M > >> and two inodes: > >> inode A (8k) 2M, 2M+4k
> 
> 
> 
> 
> On 2025/8/20 17:38, Friendy.Su@sony.com wrote:
>> Hi, Gao,
>>
>>> What's your `--chunksize` ? consider the following:
>>
>>     chunksize = 4096
>>     dsunit = 512 = 2M
>>
>>> and two inodes:
>>
>>> inode A (8k)    2M, 2M+4k
>>
>>> inode B (12k)   4M, 2M, 4M+4k, 4M+8k?
>>
>>> Is it possible? what's the expected behavior of
>>> this case.
>>
>> Yes. This is the expected behavior. See runtime below:
> 
> I understand that is the expected behavior according to this
> patch, but I'm just unsure if it's an expected behavior for
> the future wider setups (because some users may use `--dsunit`
> for other usage).
> 
> So I tend to just constrain the case to your limited case
> first, could you explain more if chunk deduplication is
> needed for your scenarios? and what's your real `chunksize`?
> 
> Maybe adding another command option for this is better.
> 
> Thanks,
> Gao Xiang



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-21  3:33             ` Gao Xiang
@ 2025-08-21  3:39               ` Gao Xiang
  2025-08-21  9:36                 ` Friendy.Su
  0 siblings, 1 reply; 12+ messages in thread
From: Gao Xiang @ 2025-08-21  3:39 UTC (permalink / raw)
  To: Friendy.Su@sony.com, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com



On 2025/8/21 11:33, Gao Xiang wrote:
> Hi Friendy,
> 
> On 2025/8/21 11:17, Friendy.Su@sony.com wrote:
>> Hi, Gao,
>>
>>> So I tend to just constrain the case to your limited case
>>> first, could you explain more if chunk deduplication is
>>> needed for your scenarios? and what's your real `chunksize`?
>>
>> chunk deduplication is needed.
>>
>> As I wrote in commit msg, we expect scenario below:
>>
>> 1. mount with -o dax=always,
>> 2. application calls mmap(addr, file).
>> 3. application read from addr, page fault is triggered.
>> We hope in kernel, erofs_dax_vm_ops.huge_fault() can be handled, do not fallback to erofs_dax_vm_ops.fault().
> 
> I totally understand the runtime side, but in short:
> 
>>
>> This required file body on blob devices aligned on huge page(2M), and deduplicate unit also is 2M. We can specify --dsunit=512, --chunksize=2*1024*1024 to fulfill this.
>>
>> I don't think need a new command option.
>> Currently, '--dsunit' can be set for formatting blobdev. The following cmdline completes successfully. User certainly thinks mkfs.erofs has executed --dsunit alignment.
>> But actually, it does not.  This patch just simply makes actual runtime fit for cmdline looks like.
>>
>> mkfs.erofs --blobdev /dev/sdb1 --dsunit 512 ......
>>
>> If actually `--dsunit` does not work on blobdev, should prompt warning msg to user.
> 
> My cercern is why `--chunksize=4096 --dsunit=512` will not
> lead to each 4k chunk to the 2M data boundary, is it obvious?
> 
> chunksize = 4096
> dsunit = 512 = 2M
> 
> inode A (8k)    2M, 4M
> inode B (12k)    6M, 2M, 4M, 8M?
> 
> Are you sure if there is no such use case in the
> future? Mixing `--chunksize=4096 --dsunit=512` seems
> non-obvious for this case.

Or as I mentioned before, I'm fine to leave each `dsunit` logical
aligned chunks (but not any deduplicated chunk in this logical range)
alignes with dsunit value, it enables PMD-mapping as you mentioned.

But if there is some deduplciated chunks in the logical dsunit
boundary, don't align it at all since there is no real benefit.
Although I'm still not sure what's the default behavior of `dsunit`
for chunks.

Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang
> 
> 
>>
>> Best Regards
>> Friendy Su
>>
>> ________________________________________
>> From: Gao Xiang <hsiangkao@linux.alibaba.com>
>> Sent: Thursday, August 21, 2025 10:00
>> To: Su, Friendy; linux-erofs@lists.ozlabs.org
>> Cc: Mo, Yuezhang; Palmer, Daniel (SGC)
>> Subject: Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
>>
>> On 2025/8/20 17: 38, Friendy. Su@ sony. com wrote: > Hi, Gao, > >> What's your `--chunksize` ? consider the following: > > chunksize = 4096 > dsunit = 512 = 2M > >> and two inodes: > >> inode A (8k) 2M, 2M+4k
>>
>>
>>
>>
>> On 2025/8/20 17:38, Friendy.Su@sony.com wrote:
>>> Hi, Gao,
>>>
>>>> What's your `--chunksize` ? consider the following:
>>>
>>>     chunksize = 4096
>>>     dsunit = 512 = 2M
>>>
>>>> and two inodes:
>>>
>>>> inode A (8k)    2M, 2M+4k
>>>
>>>> inode B (12k)   4M, 2M, 4M+4k, 4M+8k?
>>>
>>>> Is it possible? what's the expected behavior of
>>>> this case.
>>>
>>> Yes. This is the expected behavior. See runtime below:
>>
>> I understand that is the expected behavior according to this
>> patch, but I'm just unsure if it's an expected behavior for
>> the future wider setups (because some users may use `--dsunit`
>> for other usage).
>>
>> So I tend to just constrain the case to your limited case
>> first, could you explain more if chunk deduplication is
>> needed for your scenarios? and what's your real `chunksize`?
>>
>> Maybe adding another command option for this is better.
>>
>> Thanks,
>> Gao Xiang
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-21  3:39               ` Gao Xiang
@ 2025-08-21  9:36                 ` Friendy.Su
  2025-08-21 10:03                   ` Gao Xiang
  0 siblings, 1 reply; 12+ messages in thread
From: Friendy.Su @ 2025-08-21  9:36 UTC (permalink / raw)
  To: Gao Xiang, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com

Hi, Gao,

> But if there is some deduplciated chunks in the logical dsunit
> boundary, don't align it at all since there is no real benefit.
> Although I'm still not sure what's the default behavior of `dsunit`
> for chunks.

Exactly, if `--chunksize=4096 --dsunit=512`, any 4K deduplicated will cause PMD map failure.  Can we consider the following countermeasure as default behavior:

1. In man page, describe 'chunksize' and 'dsunit' should be collaborated to achieve the best performance.

2. At runtime, if chunksize < dsunit, 
  prompt alert message, tell user it is better set chunksize=dsunit. But still format with user set options. 
  The benefit is user can still set as desired. Current, for the use cases we can imagine, chunksize=dsunit is best. But maybe users have their own use cases, it is better let users do what they really wanted.


If mkfs.erofs force to align every 2M, even there is only 4K not deduplicated in 2M, the 4K actually still occupies 2M. 
0, 2M(only 4K data new, others all deduplicated), 4M........
Space usage efficiency is same as chunksize=dsunit=2M.


Best Regards
Friendy 




________________________________________
From: Gao Xiang <hsiangkao@linux.alibaba.com>
Sent: Thursday, August 21, 2025 11:39
To: Su, Friendy; linux-erofs@lists.ozlabs.org
Cc: Mo, Yuezhang; Palmer, Daniel (SGC)
Subject: Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev

On 2025/8/21 11: 33, Gao Xiang wrote: > Hi Friendy, > > On 2025/8/21 11: 17, Friendy. Su@ sony. com wrote: >> Hi, Gao, >> >>> So I tend to just constrain the case to your limited case >>> first, could you explain




On 2025/8/21 11:33, Gao Xiang wrote:
> Hi Friendy,
>
> On 2025/8/21 11:17, Friendy.Su@sony.com wrote:
>> Hi, Gao,
>>
>>> So I tend to just constrain the case to your limited case
>>> first, could you explain more if chunk deduplication is
>>> needed for your scenarios? and what's your real `chunksize`?
>>
>> chunk deduplication is needed.
>>
>> As I wrote in commit msg, we expect scenario below:
>>
>> 1. mount with -o dax=always,
>> 2. application calls mmap(addr, file).
>> 3. application read from addr, page fault is triggered.
>> We hope in kernel, erofs_dax_vm_ops.huge_fault() can be handled, do not fallback to erofs_dax_vm_ops.fault().
>
> I totally understand the runtime side, but in short:
>
>>
>> This required file body on blob devices aligned on huge page(2M), and deduplicate unit also is 2M. We can specify --dsunit=512, --chunksize=2*1024*1024 to fulfill this.
>>
>> I don't think need a new command option.
>> Currently, '--dsunit' can be set for formatting blobdev. The following cmdline completes successfully. User certainly thinks mkfs.erofs has executed --dsunit alignment.
>> But actually, it does not.  This patch just simply makes actual runtime fit for cmdline looks like.
>>
>> mkfs.erofs --blobdev /dev/sdb1 --dsunit 512 ......
>>
>> If actually `--dsunit` does not work on blobdev, should prompt warning msg to user.
>
> My cercern is why `--chunksize=4096 --dsunit=512` will not
> lead to each 4k chunk to the 2M data boundary, is it obvious?
>
> chunksize = 4096
> dsunit = 512 = 2M
>
> inode A (8k)    2M, 4M
> inode B (12k)    6M, 2M, 4M, 8M?
>
> Are you sure if there is no such use case in the
> future? Mixing `--chunksize=4096 --dsunit=512` seems
> non-obvious for this case.

Or as I mentioned before, I'm fine to leave each `dsunit` logical
aligned chunks (but not any deduplicated chunk in this logical range)
alignes with dsunit value, it enables PMD-mapping as you mentioned.

But if there is some deduplciated chunks in the logical dsunit
boundary, don't align it at all since there is no real benefit.
Although I'm still not sure what's the default behavior of `dsunit`
for chunks.

Thanks,
Gao Xiang

>
> Thanks,
> Gao Xiang
>
>
>>
>> Best Regards
>> Friendy Su
>>
>> ________________________________________
>> From: Gao Xiang <hsiangkao@linux.alibaba.com>
>> Sent: Thursday, August 21, 2025 10:00
>> To: Su, Friendy; linux-erofs@lists.ozlabs.org
>> Cc: Mo, Yuezhang; Palmer, Daniel (SGC)
>> Subject: Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
>>
>> On 2025/8/20 17: 38, Friendy. Su@ sony. com wrote: > Hi, Gao, > >> What's your `--chunksize` ? consider the following: > > chunksize = 4096 > dsunit = 512 = 2M > >> and two inodes: > >> inode A (8k) 2M, 2M+4k
>>
>>
>>
>>
>> On 2025/8/20 17:38, Friendy.Su@sony.com wrote:
>>> Hi, Gao,
>>>
>>>> What's your `--chunksize` ? consider the following:
>>>
>>>     chunksize = 4096
>>>     dsunit = 512 = 2M
>>>
>>>> and two inodes:
>>>
>>>> inode A (8k)    2M, 2M+4k
>>>
>>>> inode B (12k)   4M, 2M, 4M+4k, 4M+8k?
>>>
>>>> Is it possible? what's the expected behavior of
>>>> this case.
>>>
>>> Yes. This is the expected behavior. See runtime below:
>>
>> I understand that is the expected behavior according to this
>> patch, but I'm just unsure if it's an expected behavior for
>> the future wider setups (because some users may use `--dsunit`
>> for other usage).
>>
>> So I tend to just constrain the case to your limited case
>> first, could you explain more if chunk deduplication is
>> needed for your scenarios? and what's your real `chunksize`?
>>
>> Maybe adding another command option for this is better.
>>
>> Thanks,
>> Gao Xiang
>




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-21  9:36                 ` Friendy.Su
@ 2025-08-21 10:03                   ` Gao Xiang
  2025-08-21 10:13                     ` Friendy.Su
  0 siblings, 1 reply; 12+ messages in thread
From: Gao Xiang @ 2025-08-21 10:03 UTC (permalink / raw)
  To: Friendy.Su@sony.com, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com



On 2025/8/21 17:36, Friendy.Su@sony.com wrote:
> Hi, Gao,
> 
>> But if there is some deduplciated chunks in the logical dsunit
>> boundary, don't align it at all since there is no real benefit.
>> Although I'm still not sure what's the default behavior of `dsunit`
>> for chunks.
> 
> Exactly, if `--chunksize=4096 --dsunit=512`, any 4K deduplicated will cause PMD map failure.  Can we consider the following countermeasure as default behavior:
> 
> 1. In man page, describe 'chunksize' and 'dsunit' should be collaborated to achieve the best performance.

Agreed, we should mention that in the manpage:

> 
> 2. At runtime, if chunksize < dsunit,
>    prompt alert message, tell user it is better set chunksize=dsunit. But still format with user set options.
>    The benefit is user can still set as desired. Current, for the use cases we can imagine, chunksize=dsunit is best. But maybe users have their own use cases, it is better let users do what they really wanted.

But could we just apply your patch only to the dsunit>=chunksize
case to fulfill your exist use case? I mean we still ignore
`dsunit` if dsunit!=chunksize and warn out users explicitly (
`dsunit` doesn't work due to dsunit!=chunksize).

I really need to think about chunksize != dsunit cases, but
since you don't have such urgent need, let's keep the old `ignore`
behavior for now...

Thanks,
Gao XIang

> 
> 
> If mkfs.erofs force to align every 2M, even there is only 4K not deduplicated in 2M, the 4K actually still occupies 2M.
> 0, 2M(only 4K data new, others all deduplicated), 4M........
> Space usage efficiency is same as chunksize=dsunit=2M.
> 
> 
> Best Regards
> Friendy
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev
  2025-08-21 10:03                   ` Gao Xiang
@ 2025-08-21 10:13                     ` Friendy.Su
  0 siblings, 0 replies; 12+ messages in thread
From: Friendy.Su @ 2025-08-21 10:13 UTC (permalink / raw)
  To: Gao Xiang, linux-erofs@lists.ozlabs.org
  Cc: Yuezhang.Mo@sony.com, Daniel.Palmer@sony.com

> But could we just apply your patch only to the dsunit>=chunksize
> case to fulfill your exist use case? I mean we still ignore
> `dsunit` if dsunit!=chunksize and warn out users explicitly (
> `dsunit` doesn't work due to dsunit!=chunksize).

Agree.
I will update patch.


________________________________________
From: Gao Xiang <hsiangkao@linux.alibaba.com>
Sent: Thursday, August 21, 2025 18:03
To: Su, Friendy; linux-erofs@lists.ozlabs.org
Cc: Mo, Yuezhang; Palmer, Daniel (SGC)
Subject: Re: [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev

On 2025/8/21 17: 36, Friendy. Su@ sony. com wrote: > Hi, Gao, > >> But if there is some deduplciated chunks in the logical dsunit >> boundary, don't align it at all since there is no real benefit. >> Although I'm still not




On 2025/8/21 17:36, Friendy.Su@sony.com wrote:
> Hi, Gao,
>
>> But if there is some deduplciated chunks in the logical dsunit
>> boundary, don't align it at all since there is no real benefit.
>> Although I'm still not sure what's the default behavior of `dsunit`
>> for chunks.
>
> Exactly, if `--chunksize=4096 --dsunit=512`, any 4K deduplicated will cause PMD map failure.  Can we consider the following countermeasure as default behavior:
>
> 1. In man page, describe 'chunksize' and 'dsunit' should be collaborated to achieve the best performance.

Agreed, we should mention that in the manpage:

>
> 2. At runtime, if chunksize < dsunit,
>    prompt alert message, tell user it is better set chunksize=dsunit. But still format with user set options.
>    The benefit is user can still set as desired. Current, for the use cases we can imagine, chunksize=dsunit is best. But maybe users have their own use cases, it is better let users do what they really wanted.

But could we just apply your patch only to the dsunit>=chunksize
case to fulfill your exist use case? I mean we still ignore
`dsunit` if dsunit!=chunksize and warn out users explicitly (
`dsunit` doesn't work due to dsunit!=chunksize).

I really need to think about chunksize != dsunit cases, but
since you don't have such urgent need, let's keep the old `ignore`
behavior for now...

Thanks,
Gao XIang

>
>
> If mkfs.erofs force to align every 2M, even there is only 4K not deduplicated in 2M, the 4K actually still occupies 2M.
> 0, 2M(only 4K data new, others all deduplicated), 4M........
> Space usage efficiency is same as chunksize=dsunit=2M.
>
>
> Best Regards
> Friendy
>



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-08-21 10:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-20  7:23 [PATCH v1] erofs-utils: mkfs: Implement 'dsunit' alignment on blobdev Friendy Su
2025-08-20  7:44 ` Gao Xiang
2025-08-20  9:00   ` Friendy.Su
2025-08-20  9:12     ` Gao Xiang
2025-08-20  9:38       ` Friendy.Su
2025-08-21  2:00         ` Gao Xiang
2025-08-21  3:17           ` Friendy.Su
2025-08-21  3:33             ` Gao Xiang
2025-08-21  3:39               ` Gao Xiang
2025-08-21  9:36                 ` Friendy.Su
2025-08-21 10:03                   ` Gao Xiang
2025-08-21 10:13                     ` Friendy.Su

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.