* [PATCH 0/2] ceph: fix missing snapshot context in write operations
@ 2025-09-25 10:42 ethanwu
2025-09-25 10:42 ` [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object ethanwu
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: ethanwu @ 2025-09-25 10:42 UTC (permalink / raw)
To: ceph-devel; +Cc: xiubli, idryomov, Slava.Dubeyko, Pavan.Rallabhandi, ethanwu
This series addresses two instances where Ceph filesystem operations
were missing proper snapshot context handling, which could lead to
data inconsistencies in snapshots.
The issue occurs in two scenarios:
1. ceph_zero_partial_object() during fallocate punch hole operations
2. ceph_uninline_data() when converting inline data to regular objects
Both functions were passing NULL snapshot context to OSD write
operations instead of acquiring the appropriate context. This could
result in snapshot data corruption where subsequent reads from snapshots
would return modified data instead of the original snapshot content.
The fix ensures that proper snapshot context is acquired and passed to
all OSD write operations in these code paths.
ethanwu (2):
ceph: fix snapshot context missing in ceph_zero_partial_object
ceph: fix snapshot context missing in ceph_uninline_data
fs/ceph/addr.c | 24 ++++++++++++++++++++++--
fs/ceph/file.c | 17 ++++++++++++++++-
2 files changed, 38 insertions(+), 3 deletions(-)
--
2.43.0
Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object
2025-09-25 10:42 [PATCH 0/2] ceph: fix missing snapshot context in write operations ethanwu
@ 2025-09-25 10:42 ` ethanwu
2025-09-26 21:40 ` Viacheslav Dubeyko
2025-09-25 10:42 ` [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data ethanwu
2025-09-25 19:36 ` [PATCH 0/2] ceph: fix missing snapshot context in write operations Viacheslav Dubeyko
2 siblings, 1 reply; 10+ messages in thread
From: ethanwu @ 2025-09-25 10:42 UTC (permalink / raw)
To: ceph-devel; +Cc: xiubli, idryomov, Slava.Dubeyko, Pavan.Rallabhandi, ethanwu
The ceph_zero_partial_object function was missing proper snapshot
context for its OSD write operations, which could lead to data
inconsistencies in snapshots.
Reproducer:
../src/vstart.sh --new -x --localhost --bluestore
./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
dd if=/dev/urandom of=/mnt/mycephfs/foo bs=64K count=1
mkdir /mnt/mycephfs/.snap/snap1
md5sum /mnt/mycephfs/.snap/snap1/foo
fallocate -p -o 0 -l 4096 /mnt/mycephfs/foo
echo 3 > /proc/sys/vm/drop/caches
md5sum /mnt/mycephfs/.snap/snap1/foo # get different md5sum!!
Fixes: ad7a60de882ac ("ceph: punch hole support")
Signed-off-by: ethanwu <ethanwu@synology.com>
---
fs/ceph/file.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index c02f100f8552..58cc2cbae8bc 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2572,6 +2572,7 @@ static int ceph_zero_partial_object(struct inode *inode,
struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
struct ceph_osd_request *req;
+ struct ceph_snap_context *snapc;
int ret = 0;
loff_t zero = 0;
int op;
@@ -2586,12 +2587,25 @@ static int ceph_zero_partial_object(struct inode *inode,
op = CEPH_OSD_OP_ZERO;
}
+ spin_lock(&ci->i_ceph_lock);
+ if (__ceph_have_pending_cap_snap(ci)) {
+ struct ceph_cap_snap *capsnap =
+ list_last_entry(&ci->i_cap_snaps,
+ struct ceph_cap_snap,
+ ci_item);
+ snapc = ceph_get_snap_context(capsnap->context);
+ } else {
+ BUG_ON(!ci->i_head_snapc);
+ snapc = ceph_get_snap_context(ci->i_head_snapc);
+ }
+ spin_unlock(&ci->i_ceph_lock);
+
req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
ceph_vino(inode),
offset, length,
0, 1, op,
CEPH_OSD_FLAG_WRITE,
- NULL, 0, 0, false);
+ snapc, 0, 0, false);
if (IS_ERR(req)) {
ret = PTR_ERR(req);
goto out;
@@ -2605,6 +2619,7 @@ static int ceph_zero_partial_object(struct inode *inode,
ceph_osdc_put_request(req);
out:
+ ceph_put_snap_context(snapc);
return ret;
}
--
2.43.0
Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data
2025-09-25 10:42 [PATCH 0/2] ceph: fix missing snapshot context in write operations ethanwu
2025-09-25 10:42 ` [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object ethanwu
@ 2025-09-25 10:42 ` ethanwu
2025-09-26 21:52 ` Viacheslav Dubeyko
2025-09-25 19:36 ` [PATCH 0/2] ceph: fix missing snapshot context in write operations Viacheslav Dubeyko
2 siblings, 1 reply; 10+ messages in thread
From: ethanwu @ 2025-09-25 10:42 UTC (permalink / raw)
To: ceph-devel; +Cc: xiubli, idryomov, Slava.Dubeyko, Pavan.Rallabhandi, ethanwu
The ceph_uninline_data function was missing proper snapshot context
handling for its OSD write operations. Both CEPH_OSD_OP_CREATE and
CEPH_OSD_OP_WRITE requests were passing NULL instead of the appropriate
snapshot context, which could lead to unnecessary object clone.
Reproducer:
../src/vstart.sh --new -x --localhost --bluestore
// turn on cephfs inline data
./bin/ceph fs set a inline_data true --yes-i-really-really-mean-it
// allow fs_a client to take snapshot
./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
// mount cephfs with fuse, since kernel cephfs doesn't support inline write
ceph-fuse --id fs_a -m 127.0.0.1:40318 --conf ceph.conf -d /mnt/mycephfs/
// bump snapshot seq
mkdir /mnt/mycephfs/.snap/snap1
echo "foo" > /mnt/mycephfs/test
// umount and mount it again using kernel cephfs client
umount /mnt/mycephfs
mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
echo "bar" >> /mnt/mycephfs/test
./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).00000000
will see this object does unnecessary clone
1000000000a.00000000 (seq:2):
cloneid snaps size overlap
2 2 4 []
head - 8
but it's expected to see
10000000000.00000000 (seq:2):
cloneid snaps size overlap
head - 8
since there's no snapshot between these 2 writes
clone happened because the first osd request CEPH_OSD_OP_CREATE doesn't
pass snap context so object is created with snap seq 0, but later data
writeback is equipped with snapshot context.
snap.seq(1) > object snap seq(0), so osd does object clone.
This fix properly acquiring the snapshot context before performing
write operations.
Signed-off-by: ethanwu <ethanwu@synology.com>
---
fs/ceph/addr.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b202d789e93..0e87a3e8fd2c 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -2202,6 +2202,7 @@ int ceph_uninline_data(struct file *file)
struct ceph_osd_request *req = NULL;
struct ceph_cap_flush *prealloc_cf = NULL;
struct folio *folio = NULL;
+ struct ceph_snap_context *snapc = NULL;
u64 inline_version = CEPH_INLINE_NONE;
struct page *pages[1];
int err = 0;
@@ -2229,6 +2230,24 @@ int ceph_uninline_data(struct file *file)
if (inline_version == 1) /* initial version, no data */
goto out_uninline;
+ down_read(&fsc->mdsc->snap_rwsem);
+ spin_lock(&ci->i_ceph_lock);
+ if (__ceph_have_pending_cap_snap(ci)) {
+ struct ceph_cap_snap *capsnap =
+ list_last_entry(&ci->i_cap_snaps,
+ struct ceph_cap_snap,
+ ci_item);
+ snapc = ceph_get_snap_context(capsnap->context);
+ } else {
+ if (!ci->i_head_snapc) {
+ ci->i_head_snapc = ceph_get_snap_context(
+ ci->i_snap_realm->cached_context);
+ }
+ snapc = ceph_get_snap_context(ci->i_head_snapc);
+ }
+ spin_unlock(&ci->i_ceph_lock);
+ up_read(&fsc->mdsc->snap_rwsem);
+
folio = read_mapping_folio(inode->i_mapping, 0, file);
if (IS_ERR(folio)) {
err = PTR_ERR(folio);
@@ -2244,7 +2263,7 @@ int ceph_uninline_data(struct file *file)
req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
ceph_vino(inode), 0, &len, 0, 1,
CEPH_OSD_OP_CREATE, CEPH_OSD_FLAG_WRITE,
- NULL, 0, 0, false);
+ snapc, 0, 0, false);
if (IS_ERR(req)) {
err = PTR_ERR(req);
goto out_unlock;
@@ -2260,7 +2279,7 @@ int ceph_uninline_data(struct file *file)
req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
ceph_vino(inode), 0, &len, 1, 3,
CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
- NULL, ci->i_truncate_seq,
+ snapc, ci->i_truncate_seq,
ci->i_truncate_size, false);
if (IS_ERR(req)) {
err = PTR_ERR(req);
@@ -2323,6 +2342,7 @@ int ceph_uninline_data(struct file *file)
folio_put(folio);
}
out:
+ ceph_put_snap_context(snapc);
ceph_free_cap_flush(prealloc_cf);
doutc(cl, "%llx.%llx inline_version %llu = %d\n",
ceph_vinop(inode), inline_version, err);
--
2.43.0
Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] ceph: fix missing snapshot context in write operations
2025-09-25 10:42 [PATCH 0/2] ceph: fix missing snapshot context in write operations ethanwu
2025-09-25 10:42 ` [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object ethanwu
2025-09-25 10:42 ` [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data ethanwu
@ 2025-09-25 19:36 ` Viacheslav Dubeyko
2 siblings, 0 replies; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-09-25 19:36 UTC (permalink / raw)
To: ethanwu@synology.com, ceph-devel@vger.kernel.org
Cc: Xiubo Li, idryomov@gmail.com, Pavan Rallabhandi
On Thu, 2025-09-25 at 18:42 +0800, ethanwu wrote:
> This series addresses two instances where Ceph filesystem operations
> were missing proper snapshot context handling, which could lead to
> data inconsistencies in snapshots.
>
> The issue occurs in two scenarios:
> 1. ceph_zero_partial_object() during fallocate punch hole operations
> 2. ceph_uninline_data() when converting inline data to regular objects
>
> Both functions were passing NULL snapshot context to OSD write
> operations instead of acquiring the appropriate context. This could
> result in snapshot data corruption where subsequent reads from snapshots
> would return modified data instead of the original snapshot content.
>
> The fix ensures that proper snapshot context is acquired and passed to
> all OSD write operations in these code paths.
>
> ethanwu (2):
> ceph: fix snapshot context missing in ceph_zero_partial_object
> ceph: fix snapshot context missing in ceph_uninline_data
>
> fs/ceph/addr.c | 24 ++++++++++++++++++++++--
> fs/ceph/file.c | 17 ++++++++++++++++-
> 2 files changed, 38 insertions(+), 3 deletions(-)
Let me spend some time for testing the patch set.
Thanks,
Slava.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object
2025-09-25 10:42 ` [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object ethanwu
@ 2025-09-26 21:40 ` Viacheslav Dubeyko
[not found] ` <af89b60b-17e9-4362-bccf-977b4a573a93@Mail>
0 siblings, 1 reply; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-09-26 21:40 UTC (permalink / raw)
To: ethanwu@synology.com, ceph-devel@vger.kernel.org
Cc: Xiubo Li, idryomov@gmail.com, Pavan Rallabhandi
On Thu, 2025-09-25 at 18:42 +0800, ethanwu wrote:
> The ceph_zero_partial_object function was missing proper snapshot
> context for its OSD write operations, which could lead to data
> inconsistencies in snapshots.
>
> Reproducer:
> ../src/vstart.sh --new -x --localhost --bluestore
> ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
> mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
> dd if=/dev/urandom of=/mnt/mycephfs/foo bs=64K count=1
> mkdir /mnt/mycephfs/.snap/snap1
> md5sum /mnt/mycephfs/.snap/snap1/foo
> fallocate -p -o 0 -l 4096 /mnt/mycephfs/foo
> echo 3 > /proc/sys/vm/drop/caches
I have on my side: 'echo 3 > /proc/sys/vm/drop_caches'.
> md5sum /mnt/mycephfs/.snap/snap1/foo # get different md5sum!!
>
> Fixes: ad7a60de882ac ("ceph: punch hole support")
> Signed-off-by: ethanwu <ethanwu@synology.com>
> ---
> fs/ceph/file.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index c02f100f8552..58cc2cbae8bc 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -2572,6 +2572,7 @@ static int ceph_zero_partial_object(struct inode *inode,
> struct ceph_inode_info *ci = ceph_inode(inode);
> struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> struct ceph_osd_request *req;
> + struct ceph_snap_context *snapc;
> int ret = 0;
> loff_t zero = 0;
> int op;
> @@ -2586,12 +2587,25 @@ static int ceph_zero_partial_object(struct inode *inode,
> op = CEPH_OSD_OP_ZERO;
> }
>
> + spin_lock(&ci->i_ceph_lock);
> + if (__ceph_have_pending_cap_snap(ci)) {
> + struct ceph_cap_snap *capsnap =
> + list_last_entry(&ci->i_cap_snaps,
> + struct ceph_cap_snap,
> + ci_item);
> + snapc = ceph_get_snap_context(capsnap->context);
> + } else {
> + BUG_ON(!ci->i_head_snapc);
By the way, why are decided to use BUG_ON() instead of returning error here? And
you decided not to check ci->i_cap_snaps above.
> + snapc = ceph_get_snap_context(ci->i_head_snapc);
> + }
> + spin_unlock(&ci->i_ceph_lock);
> +
> req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
> ceph_vino(inode),
> offset, length,
> 0, 1, op,
> CEPH_OSD_FLAG_WRITE,
> - NULL, 0, 0, false);
> + snapc, 0, 0, false);
> if (IS_ERR(req)) {
> ret = PTR_ERR(req);
> goto out;
> @@ -2605,6 +2619,7 @@ static int ceph_zero_partial_object(struct inode *inode,
> ceph_osdc_put_request(req);
>
> out:
> + ceph_put_snap_context(snapc);
> return ret;
> }
>
Looks good.
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data
2025-09-25 10:42 ` [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data ethanwu
@ 2025-09-26 21:52 ` Viacheslav Dubeyko
[not found] ` <a16d910c-4838-475f-a5b1-a47323ece137@Mail>
0 siblings, 1 reply; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-09-26 21:52 UTC (permalink / raw)
To: ethanwu@synology.com, ceph-devel@vger.kernel.org
Cc: Xiubo Li, idryomov@gmail.com, Pavan Rallabhandi
On Thu, 2025-09-25 at 18:42 +0800, ethanwu wrote:
> The ceph_uninline_data function was missing proper snapshot context
> handling for its OSD write operations. Both CEPH_OSD_OP_CREATE and
> CEPH_OSD_OP_WRITE requests were passing NULL instead of the appropriate
> snapshot context, which could lead to unnecessary object clone.
>
> Reproducer:
> ../src/vstart.sh --new -x --localhost --bluestore
> // turn on cephfs inline data
> ./bin/ceph fs set a inline_data true --yes-i-really-really-mean-it
> // allow fs_a client to take snapshot
> ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
> // mount cephfs with fuse, since kernel cephfs doesn't support inline write
> ceph-fuse --id fs_a -m 127.0.0.1:40318 --conf ceph.conf -d /mnt/mycephfs/
> // bump snapshot seq
> mkdir /mnt/mycephfs/.snap/snap1
> echo "foo" > /mnt/mycephfs/test
> // umount and mount it again using kernel cephfs client
> umount /mnt/mycephfs
> mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
> echo "bar" >> /mnt/mycephfs/test
> ./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).00000000
>
> will see this object does unnecessary clone
> 1000000000a.00000000 (seq:2):
> cloneid snaps size overlap
> 2 2 4 []
> head - 8
>
> but it's expected to see
> 10000000000.00000000 (seq:2):
> cloneid snaps size overlap
> head - 8
>
> since there's no snapshot between these 2 writes
>
Maybe, I am doing something wrong here. But I have the same issue on the second
Virtual Machine with applied patch:
VM1 (without patch):
sudo ceph-fuse --id admin /mnt/cephfs/
/mnt/cephfs/test-snapshot3# mkdir ./.snap/snap1
/mnt/cephfs/test-snapshot3# echo "foo" > ./test
umount /mnt/cephfs
mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
/mnt/cephfs/test-snapshot3# echo "bar" >> ./test
/mnt/cephfs/test-snapshot3# rados listsnaps -p cephfs_data $(printf "%x\n"
$(stat -c %i ./test)).00000000
10000313cb1.00000000:
cloneid snaps size overlap
4 4 4 []
head - 8
VM2 (with patch):
ceph-fuse --id admin /mnt/cephfs/
/mnt/cephfs/test-snapshot4# mkdir ./.snap/snap1
/mnt/cephfs/test-snapshot4# echo "foo" > ./test
umount /mnt/cephfs
mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
/mnt/cephfs/test-snapshot4# echo "bar" >> ./test
/mnt/cephfs/test-snapshot4# rados listsnaps -p cephfs_data $(printf "%x\n"
$(stat -c %i ./test)).00000000
10000313cb3.00000000:10000313cb3.00000000:
cloneid snaps size overlap
5 5 0 []
head - 4
Maybe, reproduction path is not completely correct? What could be wrong on my
side?
Thanks,
Slava.
> clone happened because the first osd request CEPH_OSD_OP_CREATE doesn't
> pass snap context so object is created with snap seq 0, but later data
> writeback is equipped with snapshot context.
> snap.seq(1) > object snap seq(0), so osd does object clone.
>
> This fix properly acquiring the snapshot context before performing
> write operations.
>
> Signed-off-by: ethanwu <ethanwu@synology.com>
> ---
> fs/ceph/addr.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b202d789e93..0e87a3e8fd2c 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -2202,6 +2202,7 @@ int ceph_uninline_data(struct file *file)
> struct ceph_osd_request *req = NULL;
> struct ceph_cap_flush *prealloc_cf = NULL;
> struct folio *folio = NULL;
> + struct ceph_snap_context *snapc = NULL;
> u64 inline_version = CEPH_INLINE_NONE;
> struct page *pages[1];
> int err = 0;
> @@ -2229,6 +2230,24 @@ int ceph_uninline_data(struct file *file)
> if (inline_version == 1) /* initial version, no data */
> goto out_uninline;
>
> + down_read(&fsc->mdsc->snap_rwsem);
> + spin_lock(&ci->i_ceph_lock);
> + if (__ceph_have_pending_cap_snap(ci)) {
> + struct ceph_cap_snap *capsnap =
> + list_last_entry(&ci->i_cap_snaps,
> + struct ceph_cap_snap,
> + ci_item);
> + snapc = ceph_get_snap_context(capsnap->context);
> + } else {
> + if (!ci->i_head_snapc) {
> + ci->i_head_snapc = ceph_get_snap_context(
> + ci->i_snap_realm->cached_context);
> + }
> + snapc = ceph_get_snap_context(ci->i_head_snapc);
> + }
> + spin_unlock(&ci->i_ceph_lock);
> + up_read(&fsc->mdsc->snap_rwsem);
> +
> folio = read_mapping_folio(inode->i_mapping, 0, file);
> if (IS_ERR(folio)) {
> err = PTR_ERR(folio);
> @@ -2244,7 +2263,7 @@ int ceph_uninline_data(struct file *file)
> req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
> ceph_vino(inode), 0, &len, 0, 1,
> CEPH_OSD_OP_CREATE, CEPH_OSD_FLAG_WRITE,
> - NULL, 0, 0, false);
> + snapc, 0, 0, false);
> if (IS_ERR(req)) {
> err = PTR_ERR(req);
> goto out_unlock;
> @@ -2260,7 +2279,7 @@ int ceph_uninline_data(struct file *file)
> req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
> ceph_vino(inode), 0, &len, 1, 3,
> CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
> - NULL, ci->i_truncate_seq,
> + snapc, ci->i_truncate_seq,
> ci->i_truncate_size, false);
> if (IS_ERR(req)) {
> err = PTR_ERR(req);
> @@ -2323,6 +2342,7 @@ int ceph_uninline_data(struct file *file)
> folio_put(folio);
> }
> out:
> + ceph_put_snap_context(snapc);
> ceph_free_cap_flush(prealloc_cf);
> doutc(cl, "%llx.%llx inline_version %llu = %d\n",
> ceph_vinop(inode), inline_version, err);
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object
[not found] ` <af89b60b-17e9-4362-bccf-977b4a573a93@Mail>
@ 2025-09-30 18:32 ` Viacheslav Dubeyko
[not found] ` <71034561-e258-46a7-a524-9343a6416a4f@Mail>
0 siblings, 1 reply; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-09-30 18:32 UTC (permalink / raw)
To: ethanwu@synology.com
Cc: ceph-devel@vger.kernel.org, Xiubo Li, Pavan Rallabhandi,
idryomov@gmail.com, ethan198912@gmail.com
On Tue, 2025-09-30 at 15:07 +0800, ethanwu wrote:
> Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-09-27 05: 41 寫道: On Thu, 2025-09-25 at 18: 42 +0800, ethanwu wrote: > The ceph_zero_partial_object function was missing proper snapshot > context for its OSD write operations, which
>
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 於 2025-09-27 05:41 寫道:
> > On Thu, 2025-09-25 at 18:42 +0800, ethanwu wrote:
> > > The ceph_zero_partial_object function was missing proper snapshot
> > > context for its OSD write operations, which could lead to data
> > > inconsistencies in snapshots.
> > >
> > > Reproducer:
> > > ../src/vstart.sh --new -x --localhost --bluestore
> > > ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
> > > mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
> > > dd if=/dev/urandom of=/mnt/mycephfs/foo bs=64K count=1
> > > mkdir /mnt/mycephfs/.snap/snap1
> > > md5sum /mnt/mycephfs/.snap/snap1/foo
> > > fallocate -p -o 0 -l 4096 /mnt/mycephfs/foo
> > > echo 3 > /proc/sys/vm/drop/caches
> >
> > I have on my side: 'echo 3 > /proc/sys/vm/drop_caches'.
>
> Thanks for pointing this out, I'll update in V2.
> >
> >
> > > md5sum /mnt/mycephfs/.snap/snap1/foo # get different md5sum!!
> > >
> > > Fixes: ad7a60de882ac ("ceph: punch hole support")
> > > Signed-off-by: ethanwu <ethanwu@synology.com>
> > > ---
> > > fs/ceph/file.c | 17 ++++++++++++++++-
> > > 1 file changed, 16 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index c02f100f8552..58cc2cbae8bc 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -2572,6 +2572,7 @@ static int ceph_zero_partial_object(struct inode *inode,
> > > struct ceph_inode_info *ci = ceph_inode(inode);
> > > struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > > struct ceph_osd_request *req;
> > > + struct ceph_snap_context *snapc;
> > > int ret = 0;
> > > loff_t zero = 0;
> > > int op;
> > > @@ -2586,12 +2587,25 @@ static int ceph_zero_partial_object(struct inode *inode,
> > > op = CEPH_OSD_OP_ZERO;
> > > }
> > >
> > > + spin_lock(&ci->i_ceph_lock);
> > > + if (__ceph_have_pending_cap_snap(ci)) {
> > > + struct ceph_cap_snap *capsnap =
> > > + list_last_entry(&ci->i_cap_snaps,
> > > + struct ceph_cap_snap,
> > > + ci_item);
> > > + snapc = ceph_get_snap_context(capsnap->context);
> > > + } else {
> > > + BUG_ON(!ci->i_head_snapc);
> >
> > By the way, why are decided to use BUG_ON() instead of returning error here?
>
> I follow the rest of the places that use i_head_snapc.
> They call BUG_ON when ci->i_head_snapc is NULL;
> but running ./scripts/checkpatch.pl --strict, it warns that avoid using BUG_ON,
> if this is the latest coding style, I can change it.
Frankly speaking, it is possible to consider various ways of processing likewise
situations. Developers could prefer to have BUG_ON() to stop the code execution
in the place of problem. However, end-users would like to see the code running
but not crashing. So, end-users would prefer to see the error code returned. And
we are writing the code for end-users. :) So, returning error code is more
gentle way, from my point of view.
> > And you decided not to check ci->i_cap_snaps above.
>
> I'm not quite sure what you mean.
> __ceph_have_pending_cap_snap already checked if i_cap_snaps is empty or not.
> Is this the check you want?
>
I am simply trying to double check that all necessary code checks are in place.
If i_cap_snaps is already checked, then we don't need to add another one check.
Thanks,
Slava.
> Thanks,
> ethanwu
> >
> > > + snapc = ceph_get_snap_context(ci->i_head_snapc);
> > > + }
> > > + spin_unlock(&ci->i_ceph_lock);
> > > +
> > > req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
> > > ceph_vino(inode),
> > > offset, length,
> > > 0, 1, op,
> > > CEPH_OSD_FLAG_WRITE,
> > > - NULL, 0, 0, false);
> > > + snapc, 0, 0, false);
> > > if (IS_ERR(req)) {
> > > ret = PTR_ERR(req);
> > > goto out;
> > > @@ -2605,6 +2619,7 @@ static int ceph_zero_partial_object(struct inode *inode,
> > > ceph_osdc_put_request(req);
> > >
> > >
> > >
> > >
> > > out:
> > > + ceph_put_snap_context(snapc);
> > > return ret;
> > > }
> > >
> > >
> > >
> > >
> >
> > Looks good.
> >
> > Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> > Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> >
> > Thanks,
> > Slava.
> Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data
[not found] ` <a16d910c-4838-475f-a5b1-a47323ece137@Mail>
@ 2025-09-30 18:44 ` Viacheslav Dubeyko
[not found] ` <1084d2db-580a-417b-a99c-cbde647fd249@Mail>
0 siblings, 1 reply; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-09-30 18:44 UTC (permalink / raw)
To: ethanwu@synology.com
Cc: ceph-devel@vger.kernel.org, Xiubo Li, Pavan Rallabhandi,
idryomov@gmail.com, ethan198912@gmail.com
On Tue, 2025-09-30 at 15:30 +0800, ethanwu wrote:
> Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-09-27 05: 52 寫道: On Thu, 2025-09-25 at 18: 42 +0800, ethanwu wrote: > The ceph_uninline_data function was missing proper snapshot context > handling for its OSD write operations. Both
>
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 於 2025-09-27 05:52 寫道:
> > On Thu, 2025-09-25 at 18:42 +0800, ethanwu wrote:
> > > The ceph_uninline_data function was missing proper snapshot context
> > > handling for its OSD write operations. Both CEPH_OSD_OP_CREATE and
> > > CEPH_OSD_OP_WRITE requests were passing NULL instead of the appropriate
> > > snapshot context, which could lead to unnecessary object clone.
> > >
> > > Reproducer:
> > > ../src/vstart.sh --new -x --localhost --bluestore
> > > // turn on cephfs inline data
> > > ./bin/ceph fs set a inline_data true --yes-i-really-really-mean-it
> > > // allow fs_a client to take snapshot
> > > ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
> > > // mount cephfs with fuse, since kernel cephfs doesn't support inline write
> > > ceph-fuse --id fs_a -m 127.0.0.1:40318 --conf ceph.conf -d /mnt/mycephfs/
> > > // bump snapshot seq
> > > mkdir /mnt/mycephfs/.snap/snap1
> > > echo "foo" > /mnt/mycephfs/test
> > > // umount and mount it again using kernel cephfs client
> > > umount /mnt/mycephfs
> > > mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
> > > echo "bar" >> /mnt/mycephfs/test
> > > ./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).00000000
> > >
> > > will see this object does unnecessary clone
> > > 1000000000a.00000000 (seq:2):
> > > cloneid snaps size overlap
> > > 2 2 4 []
> > > head - 8
> > >
> > > but it's expected to see
> > > 10000000000.00000000 (seq:2):
> > > cloneid snaps size overlap
> > > head - 8
> > >
> > > since there's no snapshot between these 2 writes
> > >
> >
> > Maybe, I am doing something wrong here. But I have the same issue on the second
> > Virtual Machine with applied patch:
> >
> > VM1 (without patch):
> >
> > sudo ceph-fuse --id admin /mnt/cephfs/
> > /mnt/cephfs/test-snapshot3# mkdir ./.snap/snap1
> > /mnt/cephfs/test-snapshot3# echo "foo" > ./test
> > umount /mnt/cephfs
> >
> > mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
> > /mnt/cephfs/test-snapshot3# echo "bar" >> ./test
> > /mnt/cephfs/test-snapshot3# rados listsnaps -p cephfs_data $(printf "%x\n"
> > $(stat -c %i ./test)).00000000
> > 10000313cb1.00000000:
> > cloneid snaps size overlap
> > 4 4 4 []
> > head - 8
> >
> > VM2 (with patch):
> >
> > ceph-fuse --id admin /mnt/cephfs/
> > /mnt/cephfs/test-snapshot4# mkdir ./.snap/snap1
> > /mnt/cephfs/test-snapshot4# echo "foo" > ./test
> > umount /mnt/cephfs
> >
> > mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
> > /mnt/cephfs/test-snapshot4# echo "bar" >> ./test
> > /mnt/cephfs/test-snapshot4# rados listsnaps -p cephfs_data $(printf "%x\n"
> > $(stat -c %i ./test)).00000000
> > 10000313cb3.00000000:10000313cb3.00000000:
> > cloneid snaps size overlap
> > 5 5 0 []
> > head - 4
>
>
> Looks like the clone comes from the first time we access pool.
> ceph_write_iter
> --ceph_get_caps
> ----__ceph_get_caps
> ------ceph_pool_perm_check
> --------__ceph_pool_perm_get
> ---------ceph_osdc_alloc_request
>
> If no existing pool perm found, it will try to read/write the pool inode layout points to
> but ceph_osdc_alloc_request doesn't pass snapc
>
>
> so the server side will have a zero size object with snap seq 0
>
> next time write arrives, since write is equipped with snap seq >0
> server side will do object clone.
>
> This can be easily reproduce, when there are snapshots(so write will have snap seq > 0).
> The first write to the pool echo "foo" > /mnt/mycephfs/file
> will result in the following output
> 10000000009.00000000 (seq:3):
> cloneid snaps size overlap
> 3 2,3 0 []
> head - 4
>
> I think this could be fixed by using a special reserved inode number to test pool perm instead of using that inode
> I change
> ceph_oid_printf(&rd_req->r_base_oid, "%llx.00000000", ci->i_vino.ino);
> under to __ceph_pool_perm_get
> ceph_oid_printf(&rd_req->r_base_oid, "%llx.00000000", LLONG_MAX);
> the unnecessary clone won't happen again.
>
Frankly speaking, I don't quite follow to your answer. :) So, does patch needs
to be reworked? Or my testing steps are not fully correct?
Thanks,
Slava.
> thanks,
> ethanwu
> >
> > Maybe, reproduction path is not completely correct? What could be wrong on my
> > side?
> >
> > Thanks,
> > Slava.
> >
> > > clone happened because the first osd request CEPH_OSD_OP_CREATE doesn't
> > > pass snap context so object is created with snap seq 0, but later data
> > > writeback is equipped with snapshot context.
> > > snap.seq(1) > object snap seq(0), so osd does object clone.
> > >
> > > This fix properly acquiring the snapshot context before performing
> > > write operations.
> > >
> > > Signed-off-by: ethanwu <ethanwu@synology.com>
> > > ---
> > > fs/ceph/addr.c | 24 ++++++++++++++++++++++--
> > > 1 file changed, 22 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 8b202d789e93..0e87a3e8fd2c 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -2202,6 +2202,7 @@ int ceph_uninline_data(struct file *file)
> > > struct ceph_osd_request *req = NULL;
> > > struct ceph_cap_flush *prealloc_cf = NULL;
> > > struct folio *folio = NULL;
> > > + struct ceph_snap_context *snapc = NULL;
> > > u64 inline_version = CEPH_INLINE_NONE;
> > > struct page *pages[1];
> > > int err = 0;
> > > @@ -2229,6 +2230,24 @@ int ceph_uninline_data(struct file *file)
> > > if (inline_version == 1) /* initial version, no data */
> > > goto out_uninline;
> > >
> > >
> > > + down_read(&fsc->mdsc->snap_rwsem);
> > > + spin_lock(&ci->i_ceph_lock);
> > > + if (__ceph_have_pending_cap_snap(ci)) {
> > > + struct ceph_cap_snap *capsnap =
> > > + list_last_entry(&ci->i_cap_snaps,
> > > + struct ceph_cap_snap,
> > > + ci_item);
> > > + snapc = ceph_get_snap_context(capsnap->context);
> > > + } else {
> > > + if (!ci->i_head_snapc) {
> > > + ci->i_head_snapc = ceph_get_snap_context(
> > > + ci->i_snap_realm->cached_context);
> > > + }
> > > + snapc = ceph_get_snap_context(ci->i_head_snapc);
> > > + }
> > > + spin_unlock(&ci->i_ceph_lock);
> > > + up_read(&fsc->mdsc->snap_rwsem);
> > > +
> > > folio = read_mapping_folio(inode->i_mapping, 0, file);
> > > if (IS_ERR(folio)) {
> > > err = PTR_ERR(folio);
> > > @@ -2244,7 +2263,7 @@ int ceph_uninline_data(struct file *file)
> > > req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
> > > ceph_vino(inode), 0, &len, 0, 1,
> > > CEPH_OSD_OP_CREATE, CEPH_OSD_FLAG_WRITE,
> > > - NULL, 0, 0, false);
> > > + snapc, 0, 0, false);
> > > if (IS_ERR(req)) {
> > > err = PTR_ERR(req);
> > > goto out_unlock;
> > > @@ -2260,7 +2279,7 @@ int ceph_uninline_data(struct file *file)
> > > req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
> > > ceph_vino(inode), 0, &len, 1, 3,
> > > CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
> > > - NULL, ci->i_truncate_seq,
> > > + snapc, ci->i_truncate_seq,
> > > ci->i_truncate_size, false);
> > > if (IS_ERR(req)) {
> > > err = PTR_ERR(req);
> > > @@ -2323,6 +2342,7 @@ int ceph_uninline_data(struct file *file)
> > > folio_put(folio);
> > > }
> > > out:
> > > + ceph_put_snap_context(snapc);
> > > ceph_free_cap_flush(prealloc_cf);
> > > doutc(cl, "%llx.%llx inline_version %llu = %d\n",
> > > ceph_vinop(inode), inline_version, err);
> Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object
[not found] ` <71034561-e258-46a7-a524-9343a6416a4f@Mail>
@ 2025-10-01 18:22 ` Viacheslav Dubeyko
0 siblings, 0 replies; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-10-01 18:22 UTC (permalink / raw)
To: ethanwu@synology.com
Cc: ceph-devel@vger.kernel.org, Xiubo Li, Pavan Rallabhandi,
idryomov@gmail.com, ethan198912@gmail.com
On Wed, 2025-10-01 at 11:34 +0800, ethanwu wrote:
> Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-10-01 02: 33 寫道: On Tue, 2025-09-30 at 15: 07 +0800, ethanwu wrote: > Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-09-27 05: 41 寫道: On Thu, 2025-09-25 at 18: 42 +0800, ethanwu
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 於 2025-10-01 02:33 寫道:
> > On Tue, 2025-09-30 at 15:07 +0800, ethanwu wrote:
> > > Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-09-27 05: 41 寫道: On Thu, 2025-09-25 at 18: 42 +0800, ethanwu wrote: > The ceph_zero_partial_object function was missing proper snapshot > context for its OSD write operations, which
> > >
> > > Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 於 2025-09-27 05:41 寫道:
> > > > On Thu, 2025-09-25 at 18:42 +0800, ethanwu wrote:
> > > > > The ceph_zero_partial_object function was missing proper snapshot
> > > > > context for its OSD write operations, which could lead to data
> > > > > inconsistencies in snapshots.
> > > > >
> > > > > Reproducer:
> > > > > ../src/vstart.sh --new -x --localhost --bluestore
> > > > > ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
> > > > > mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
> > > > > dd if=/dev/urandom of=/mnt/mycephfs/foo bs=64K count=1
> > > > > mkdir /mnt/mycephfs/.snap/snap1
> > > > > md5sum /mnt/mycephfs/.snap/snap1/foo
> > > > > fallocate -p -o 0 -l 4096 /mnt/mycephfs/foo
> > > > > echo 3 > /proc/sys/vm/drop/caches
> > > >
> > > > I have on my side: 'echo 3 > /proc/sys/vm/drop_caches'.
> > >
> > > Thanks for pointing this out, I'll update in V2.
> > > >
> > > >
> > > > > md5sum /mnt/mycephfs/.snap/snap1/foo # get different md5sum!!
> > > > >
> > > > > Fixes: ad7a60de882ac ("ceph: punch hole support")
> > > > > Signed-off-by: ethanwu <ethanwu@synology.com>
> > > > > ---
> > > > > fs/ceph/file.c | 17 ++++++++++++++++-
> > > > > 1 file changed, 16 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > > index c02f100f8552..58cc2cbae8bc 100644
> > > > > --- a/fs/ceph/file.c
> > > > > +++ b/fs/ceph/file.c
> > > > > @@ -2572,6 +2572,7 @@ static int ceph_zero_partial_object(struct inode *inode,
> > > > > struct ceph_inode_info *ci = ceph_inode(inode);
> > > > > struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > > > > struct ceph_osd_request *req;
> > > > > + struct ceph_snap_context *snapc;
> > > > > int ret = 0;
> > > > > loff_t zero = 0;
> > > > > int op;
> > > > > @@ -2586,12 +2587,25 @@ static int ceph_zero_partial_object(struct inode *inode,
> > > > > op = CEPH_OSD_OP_ZERO;
> > > > > }
> > > > >
> > > > > + spin_lock(&ci->i_ceph_lock);
> > > > > + if (__ceph_have_pending_cap_snap(ci)) {
> > > > > + struct ceph_cap_snap *capsnap =
> > > > > + list_last_entry(&ci->i_cap_snaps,
> > > > > + struct ceph_cap_snap,
> > > > > + ci_item);
> > > > > + snapc = ceph_get_snap_context(capsnap->context);
> > > > > + } else {
> > > > > + BUG_ON(!ci->i_head_snapc);
> > > >
> > > > By the way, why are decided to use BUG_ON() instead of returning error here?
> > >
> > > I follow the rest of the places that use i_head_snapc.
> > > They call BUG_ON when ci->i_head_snapc is NULL;
> > > but running ./scripts/checkpatch.pl --strict, it warns that avoid using BUG_ON,
> > > if this is the latest coding style, I can change it.
> >
> > Frankly speaking, it is possible to consider various ways of processing likewise
> > situations. Developers could prefer to have BUG_ON() to stop the code execution
> > in the place of problem. However, end-users would like to see the code running
> > but not crashing. So, end-users would prefer to see the error code returned. And
> > we are writing the code for end-users. :) So, returning error code is more
> > gentle way, from my point of view.
>
>
> OK, then I'll return error code here.
> I don't see any specific errno suitable for this case, is return EIO ok here?
>
>
As far as can see, -EIO looks like reasonably well error code for this function,
Thanks,
Slava.
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data
[not found] ` <1084d2db-580a-417b-a99c-cbde647fd249@Mail>
@ 2025-10-28 15:26 ` Viacheslav Dubeyko
0 siblings, 0 replies; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-10-28 15:26 UTC (permalink / raw)
To: ethanwu@synology.com
Cc: ceph-devel@vger.kernel.org, Xiubo Li, Pavan Rallabhandi,
idryomov@gmail.com, ethan198912@gmail.com
On Wed, 2025-10-01 at 17:15 +0800, ethanwu wrote:
> Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-10-01 02: 44 寫道: On Tue, 2025-09-30 at 15: 30 +0800, ethanwu wrote: > Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-09-27 05: 52 寫道: On Thu, 2025-09-25 at 18: 42 +0800, ethanwu
>
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 於 2025-10-01 02:44 寫道:
> > On Tue, 2025-09-30 at 15:30 +0800, ethanwu wrote:
> > > Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-09-27 05: 52 寫道: On Thu, 2025-09-25 at 18: 42 +0800, ethanwu wrote: > The ceph_uninline_data function was missing proper snapshot context > handling for its OSD write operations. Both
> > >
> > >
> > > Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 於 2025-09-27 05:52 寫道:
> > > > On Thu, 2025-09-25 at 18:42 +0800, ethanwu wrote:
> > > > > The ceph_uninline_data function was missing proper snapshot context
> > > > > handling for its OSD write operations. Both CEPH_OSD_OP_CREATE and
> > > > > CEPH_OSD_OP_WRITE requests were passing NULL instead of the appropriate
> > > > > snapshot context, which could lead to unnecessary object clone.
> > > > >
> > > > > Reproducer:
> > > > > ../src/vstart.sh --new -x --localhost --bluestore
> > > > > // turn on cephfs inline data
> > > > > ./bin/ceph fs set a inline_data true --yes-i-really-really-mean-it
> > > > > // allow fs_a client to take snapshot
> > > > > ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
> > > > > // mount cephfs with fuse, since kernel cephfs doesn't support inline write
> > > > > ceph-fuse --id fs_a -m 127.0.0.1:40318 --conf ceph.conf -d /mnt/mycephfs/
> > > > > // bump snapshot seq
> > > > > mkdir /mnt/mycephfs/.snap/snap1
> > > > > echo "foo" > /mnt/mycephfs/test
> > > > > // umount and mount it again using kernel cephfs client
> > > > > umount /mnt/mycephfs
> > > > > mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
> > > > > echo "bar" >> /mnt/mycephfs/test
> > > > > ./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).00000000
> > > > >
> > > > > will see this object does unnecessary clone
> > > > > 1000000000a.00000000 (seq:2):
> > > > > cloneid snaps size overlap
> > > > > 2 2 4 []
> > > > > head - 8
> > > > >
> > > > > but it's expected to see
> > > > > 10000000000.00000000 (seq:2):
> > > > > cloneid snaps size overlap
> > > > > head - 8
> > > > >
> > > > > since there's no snapshot between these 2 writes
> > > > >
> > > >
> > > > Maybe, I am doing something wrong here. But I have the same issue on the second
> > > > Virtual Machine with applied patch:
> > > >
> > > > VM1 (without patch):
> > > >
> > > > sudo ceph-fuse --id admin /mnt/cephfs/
> > > > /mnt/cephfs/test-snapshot3# mkdir ./.snap/snap1
> > > > /mnt/cephfs/test-snapshot3# echo "foo" > ./test
> > > > umount /mnt/cephfs
> > > >
> > > > mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
> > > > /mnt/cephfs/test-snapshot3# echo "bar" >> ./test
> > > > /mnt/cephfs/test-snapshot3# rados listsnaps -p cephfs_data $(printf "%x\n"
> > > > $(stat -c %i ./test)).00000000
> > > > 10000313cb1.00000000:
> > > > cloneid snaps size overlap
> > > > 4 4 4 []
> > > > head - 8
> > > >
> > > > VM2 (with patch):
> > > >
> > > > ceph-fuse --id admin /mnt/cephfs/
> > > > /mnt/cephfs/test-snapshot4# mkdir ./.snap/snap1
> > > > /mnt/cephfs/test-snapshot4# echo "foo" > ./test
> > > > umount /mnt/cephfs
> > > >
> > > > mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
> > > > /mnt/cephfs/test-snapshot4# echo "bar" >> ./test
> > > > /mnt/cephfs/test-snapshot4# rados listsnaps -p cephfs_data $(printf "%x\n"
> > > > $(stat -c %i ./test)).00000000
> > > > 10000313cb3.00000000:10000313cb3.00000000:
> > > > cloneid snaps size overlap
> > > > 5 5 0 []
> > > > head - 4
> > >
> > >
> > >
> > > Looks like the clone comes from the first time we access pool.
> > > ceph_write_iter
> > > --ceph_get_caps
> > > ----__ceph_get_caps
> > > ------ceph_pool_perm_check
> > > --------__ceph_pool_perm_get
> > > ---------ceph_osdc_alloc_request
> > >
> > >
> > > If no existing pool perm found, it will try to read/write the pool inode layout points to
> > > but ceph_osdc_alloc_request doesn't pass snapc
> > >
> > >
> > > so the server side will have a zero size object with snap seq 0
> > >
> > >
> > > next time write arrives, since write is equipped with snap seq >0
> > > server side will do object clone.
> > >
> > >
> > > This can be easily reproduce, when there are snapshots(so write will have snap seq > 0).
> > > The first write to the pool echo "foo" > /mnt/mycephfs/file
> > > will result in the following output
> > > 10000000009.00000000 (seq:3):
> > > cloneid snaps size overlap
> > > 3 2,3 0 []
> > > head - 4
> > >
> > >
> > > I think this could be fixed by using a special reserved inode number to test pool perm instead of using that inode
> > > I change
> > > ceph_oid_printf(&rd_req->r_base_oid, "%llx.00000000", ci->i_vino.ino);
> > > under to __ceph_pool_perm_get
> > > ceph_oid_printf(&rd_req->r_base_oid, "%llx.00000000", LLONG_MAX);
> > > the unnecessary clone won't happen again.
> > >
> > >
> >
> > Frankly speaking, I don't quite follow to your answer. :) So, does patch needs
> > to be reworked? Or my testing steps are not fully correct?
> >
> > Thanks,
> > Slava.
>
>
> It's another place that misses snapshot context.
> pool permission check libcephfs and kernel cephfs does not equip snapshot context
> when issuing CEPH_OSD_OP_CREATE for pool write permission check.
>
> When kernel cephfs call ceph_get_caps or ceph_try_get_caps
> it may need ceph_pool_perm_check(if no pool permission check is ever done for that pool).
> ceph_pool_perm_check will try to issue both read and write(use CEPH_OSD_OP_CREATE,) request on target inode to the pool that inode belongs to
> (see fs/ceph/addr.c:__ceph_pool_perm_get)
> it uses ceph_osdc_alloc_request without giving snap context,
> so the server side will first create an empty object with no snapshot sequence.
>
> we'll see the following output from ./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).0000000
> 10000000032.00000000 (seq:0):
> cloneid snaps size overlap
> head - 0
>
> Later when data is written back, it write op is equipped with snapshot context.
> compare the snapshot id with the object snap seq
> OSD thinks there's a snapshot and clone is needed.
>
> ceph-fuse --id admin /mnt/cephfs/
> /mnt/cephfs/test-snapshot4# mkdir ./.snap/snap1
> /mnt/cephfs/test-snapshot4# echo "foo" > ./dummy #error one
> /mnt/cephfs/test-snapshot4# echo "foo" > ./test #correct one because data is inlined,
>
> rados listsnaps -p cephfs_data $(printf "%x\n" $(stat -c %i ./dummy)).00000000
> will see
> 10000000032.00000000 (seq:0):
> cloneid snaps size overlap
> head - 0
> but
> rados listsnaps -p cephfs_data $(printf "%x\n" $(stat -c %i ./test)).00000000
> shows (2) No such file or directory
> seeing ENOENT is correct behavior because data is inlined.
>
Sorry, I was in personal trip the last three weeks. So, I wasn't able to take a
deeper look into this.
I still cannot confirm that the patch fixes the issue. The patch's commit
message contains explanation of steps and the expected outcome. And I see the
different outcome, this is why I cannot confirm that patch fixes the issue. I
believe that the commit message requires more clear explanation of steps that I
can reproduce and to see the same result as expected outcome.
Thanks,
Slava.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-10-28 15:26 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-25 10:42 [PATCH 0/2] ceph: fix missing snapshot context in write operations ethanwu
2025-09-25 10:42 ` [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object ethanwu
2025-09-26 21:40 ` Viacheslav Dubeyko
[not found] ` <af89b60b-17e9-4362-bccf-977b4a573a93@Mail>
2025-09-30 18:32 ` Viacheslav Dubeyko
[not found] ` <71034561-e258-46a7-a524-9343a6416a4f@Mail>
2025-10-01 18:22 ` Viacheslav Dubeyko
2025-09-25 10:42 ` [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data ethanwu
2025-09-26 21:52 ` Viacheslav Dubeyko
[not found] ` <a16d910c-4838-475f-a5b1-a47323ece137@Mail>
2025-09-30 18:44 ` Viacheslav Dubeyko
[not found] ` <1084d2db-580a-417b-a99c-cbde647fd249@Mail>
2025-10-28 15:26 ` Viacheslav Dubeyko
2025-09-25 19:36 ` [PATCH 0/2] ceph: fix missing snapshot context in write operations Viacheslav Dubeyko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).