From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
To: "ethanwu@synology.com" <ethanwu@synology.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
Xiubo Li <xiubli@redhat.com>,
Pavan Rallabhandi <Pavan.Rallabhandi@ibm.com>,
"idryomov@gmail.com" <idryomov@gmail.com>,
"ethan198912@gmail.com" <ethan198912@gmail.com>
Subject: RE: [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data
Date: Tue, 28 Oct 2025 15:26:20 +0000 [thread overview]
Message-ID: <05735ecdb98feb50bd5edfa6b9910e4c375d6e6a.camel@ibm.com> (raw)
In-Reply-To: <1084d2db-580a-417b-a99c-cbde647fd249@Mail>
On Wed, 2025-10-01 at 17:15 +0800, ethanwu wrote:
> Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-10-01 02: 44 寫道: On Tue, 2025-09-30 at 15: 30 +0800, ethanwu wrote: > Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-09-27 05: 52 寫道: On Thu, 2025-09-25 at 18: 42 +0800, ethanwu
>
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 於 2025-10-01 02:44 寫道:
> > On Tue, 2025-09-30 at 15:30 +0800, ethanwu wrote:
> > > Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-09-27 05: 52 寫道: On Thu, 2025-09-25 at 18: 42 +0800, ethanwu wrote: > The ceph_uninline_data function was missing proper snapshot context > handling for its OSD write operations. Both
> > >
> > >
> > > Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 於 2025-09-27 05:52 寫道:
> > > > On Thu, 2025-09-25 at 18:42 +0800, ethanwu wrote:
> > > > > The ceph_uninline_data function was missing proper snapshot context
> > > > > handling for its OSD write operations. Both CEPH_OSD_OP_CREATE and
> > > > > CEPH_OSD_OP_WRITE requests were passing NULL instead of the appropriate
> > > > > snapshot context, which could lead to unnecessary object clone.
> > > > >
> > > > > Reproducer:
> > > > > ../src/vstart.sh --new -x --localhost --bluestore
> > > > > // turn on cephfs inline data
> > > > > ./bin/ceph fs set a inline_data true --yes-i-really-really-mean-it
> > > > > // allow fs_a client to take snapshot
> > > > > ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
> > > > > // mount cephfs with fuse, since kernel cephfs doesn't support inline write
> > > > > ceph-fuse --id fs_a -m 127.0.0.1:40318 --conf ceph.conf -d /mnt/mycephfs/
> > > > > // bump snapshot seq
> > > > > mkdir /mnt/mycephfs/.snap/snap1
> > > > > echo "foo" > /mnt/mycephfs/test
> > > > > // umount and mount it again using kernel cephfs client
> > > > > umount /mnt/mycephfs
> > > > > mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
> > > > > echo "bar" >> /mnt/mycephfs/test
> > > > > ./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).00000000
> > > > >
> > > > > will see this object does unnecessary clone
> > > > > 1000000000a.00000000 (seq:2):
> > > > > cloneid snaps size overlap
> > > > > 2 2 4 []
> > > > > head - 8
> > > > >
> > > > > but it's expected to see
> > > > > 10000000000.00000000 (seq:2):
> > > > > cloneid snaps size overlap
> > > > > head - 8
> > > > >
> > > > > since there's no snapshot between these 2 writes
> > > > >
> > > >
> > > > Maybe, I am doing something wrong here. But I have the same issue on the second
> > > > Virtual Machine with applied patch:
> > > >
> > > > VM1 (without patch):
> > > >
> > > > sudo ceph-fuse --id admin /mnt/cephfs/
> > > > /mnt/cephfs/test-snapshot3# mkdir ./.snap/snap1
> > > > /mnt/cephfs/test-snapshot3# echo "foo" > ./test
> > > > umount /mnt/cephfs
> > > >
> > > > mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
> > > > /mnt/cephfs/test-snapshot3# echo "bar" >> ./test
> > > > /mnt/cephfs/test-snapshot3# rados listsnaps -p cephfs_data $(printf "%x\n"
> > > > $(stat -c %i ./test)).00000000
> > > > 10000313cb1.00000000:
> > > > cloneid snaps size overlap
> > > > 4 4 4 []
> > > > head - 8
> > > >
> > > > VM2 (with patch):
> > > >
> > > > ceph-fuse --id admin /mnt/cephfs/
> > > > /mnt/cephfs/test-snapshot4# mkdir ./.snap/snap1
> > > > /mnt/cephfs/test-snapshot4# echo "foo" > ./test
> > > > umount /mnt/cephfs
> > > >
> > > > mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
> > > > /mnt/cephfs/test-snapshot4# echo "bar" >> ./test
> > > > /mnt/cephfs/test-snapshot4# rados listsnaps -p cephfs_data $(printf "%x\n"
> > > > $(stat -c %i ./test)).00000000
> > > > 10000313cb3.00000000:10000313cb3.00000000:
> > > > cloneid snaps size overlap
> > > > 5 5 0 []
> > > > head - 4
> > >
> > >
> > >
> > > Looks like the clone comes from the first time we access pool.
> > > ceph_write_iter
> > > --ceph_get_caps
> > > ----__ceph_get_caps
> > > ------ceph_pool_perm_check
> > > --------__ceph_pool_perm_get
> > > ---------ceph_osdc_alloc_request
> > >
> > >
> > > If no existing pool perm found, it will try to read/write the pool inode layout points to
> > > but ceph_osdc_alloc_request doesn't pass snapc
> > >
> > >
> > > so the server side will have a zero size object with snap seq 0
> > >
> > >
> > > next time write arrives, since write is equipped with snap seq >0
> > > server side will do object clone.
> > >
> > >
> > > This can be easily reproduce, when there are snapshots(so write will have snap seq > 0).
> > > The first write to the pool echo "foo" > /mnt/mycephfs/file
> > > will result in the following output
> > > 10000000009.00000000 (seq:3):
> > > cloneid snaps size overlap
> > > 3 2,3 0 []
> > > head - 4
> > >
> > >
> > > I think this could be fixed by using a special reserved inode number to test pool perm instead of using that inode
> > > I change
> > > ceph_oid_printf(&rd_req->r_base_oid, "%llx.00000000", ci->i_vino.ino);
> > > under to __ceph_pool_perm_get
> > > ceph_oid_printf(&rd_req->r_base_oid, "%llx.00000000", LLONG_MAX);
> > > the unnecessary clone won't happen again.
> > >
> > >
> >
> > Frankly speaking, I don't quite follow to your answer. :) So, does patch needs
> > to be reworked? Or my testing steps are not fully correct?
> >
> > Thanks,
> > Slava.
>
>
> It's another place that misses snapshot context.
> pool permission check libcephfs and kernel cephfs does not equip snapshot context
> when issuing CEPH_OSD_OP_CREATE for pool write permission check.
>
> When kernel cephfs call ceph_get_caps or ceph_try_get_caps
> it may need ceph_pool_perm_check(if no pool permission check is ever done for that pool).
> ceph_pool_perm_check will try to issue both read and write(use CEPH_OSD_OP_CREATE,) request on target inode to the pool that inode belongs to
> (see fs/ceph/addr.c:__ceph_pool_perm_get)
> it uses ceph_osdc_alloc_request without giving snap context,
> so the server side will first create an empty object with no snapshot sequence.
>
> we'll see the following output from ./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).0000000
> 10000000032.00000000 (seq:0):
> cloneid snaps size overlap
> head - 0
>
> Later when data is written back, it write op is equipped with snapshot context.
> compare the snapshot id with the object snap seq
> OSD thinks there's a snapshot and clone is needed.
>
> ceph-fuse --id admin /mnt/cephfs/
> /mnt/cephfs/test-snapshot4# mkdir ./.snap/snap1
> /mnt/cephfs/test-snapshot4# echo "foo" > ./dummy #error one
> /mnt/cephfs/test-snapshot4# echo "foo" > ./test #correct one because data is inlined,
>
> rados listsnaps -p cephfs_data $(printf "%x\n" $(stat -c %i ./dummy)).00000000
> will see
> 10000000032.00000000 (seq:0):
> cloneid snaps size overlap
> head - 0
> but
> rados listsnaps -p cephfs_data $(printf "%x\n" $(stat -c %i ./test)).00000000
> shows (2) No such file or directory
> seeing ENOENT is correct behavior because data is inlined.
>
Sorry, I was in personal trip the last three weeks. So, I wasn't able to take a
deeper look into this.
I still cannot confirm that the patch fixes the issue. The patch's commit
message contains explanation of steps and the expected outcome. And I see the
different outcome, this is why I cannot confirm that patch fixes the issue. I
believe that the commit message requires more clear explanation of steps that I
can reproduce and to see the same result as expected outcome.
Thanks,
Slava.
next prev parent reply other threads:[~2025-10-28 15:26 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-25 10:42 [PATCH 0/2] ceph: fix missing snapshot context in write operations ethanwu
2025-09-25 10:42 ` [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object ethanwu
2025-09-26 21:40 ` Viacheslav Dubeyko
[not found] ` <af89b60b-17e9-4362-bccf-977b4a573a93@Mail>
2025-09-30 18:32 ` Viacheslav Dubeyko
[not found] ` <71034561-e258-46a7-a524-9343a6416a4f@Mail>
2025-10-01 18:22 ` Viacheslav Dubeyko
2025-09-25 10:42 ` [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data ethanwu
2025-09-26 21:52 ` Viacheslav Dubeyko
[not found] ` <a16d910c-4838-475f-a5b1-a47323ece137@Mail>
2025-09-30 18:44 ` Viacheslav Dubeyko
[not found] ` <1084d2db-580a-417b-a99c-cbde647fd249@Mail>
2025-10-28 15:26 ` Viacheslav Dubeyko [this message]
2025-09-25 19:36 ` [PATCH 0/2] ceph: fix missing snapshot context in write operations Viacheslav Dubeyko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=05735ecdb98feb50bd5edfa6b9910e4c375d6e6a.camel@ibm.com \
--to=slava.dubeyko@ibm.com \
--cc=Pavan.Rallabhandi@ibm.com \
--cc=ceph-devel@vger.kernel.org \
--cc=ethan198912@gmail.com \
--cc=ethanwu@synology.com \
--cc=idryomov@gmail.com \
--cc=xiubli@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).