ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
To: "ethanwu@synology.com" <ethanwu@synology.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
	Xiubo Li <xiubli@redhat.com>,
	Pavan Rallabhandi <Pavan.Rallabhandi@ibm.com>,
	"idryomov@gmail.com" <idryomov@gmail.com>,
	"ethan198912@gmail.com" <ethan198912@gmail.com>
Subject: RE: [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data
Date: Tue, 30 Sep 2025 18:44:16 +0000	[thread overview]
Message-ID: <ec7440fbf1411b76a1a56e3511e4463716614cf2.camel@ibm.com> (raw)
In-Reply-To: <a16d910c-4838-475f-a5b1-a47323ece137@Mail>

On Tue, 2025-09-30 at 15:30 +0800, ethanwu wrote:
> Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> 於 2025-09-27 05: 52 寫道: On Thu, 2025-09-25 at 18: 42 +0800, ethanwu wrote: > The ceph_uninline_data function was missing proper snapshot context > handling for its OSD write operations. Both
>  
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> 於 2025-09-27 05:52 寫道:
> > On Thu, 2025-09-25 at 18:42 +0800, ethanwu wrote:
> > > The ceph_uninline_data function was missing proper snapshot context
> > > handling for its OSD write operations. Both CEPH_OSD_OP_CREATE and
> > > CEPH_OSD_OP_WRITE requests were passing NULL instead of the appropriate
> > > snapshot context, which could lead to unnecessary object clone.
> > > 
> > > Reproducer:
> > > ../src/vstart.sh --new -x --localhost --bluestore
> > > // turn on cephfs inline data
> > > ./bin/ceph fs set a inline_data true --yes-i-really-really-mean-it
> > > // allow fs_a client to take snapshot
> > > ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a'
> > > // mount cephfs with fuse, since kernel cephfs doesn't support inline write
> > > ceph-fuse --id fs_a -m 127.0.0.1:40318 --conf ceph.conf -d /mnt/mycephfs/
> > > // bump snapshot seq
> > > mkdir /mnt/mycephfs/.snap/snap1
> > > echo "foo" > /mnt/mycephfs/test
> > > // umount and mount it again using kernel cephfs client
> > > umount /mnt/mycephfs
> > > mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf
> > > echo "bar" >> /mnt/mycephfs/test
> > > ./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).00000000
> > > 
> > > will see this object does unnecessary clone
> > > 1000000000a.00000000 (seq:2):
> > > cloneid snaps   size    overlap
> > > 2       2       4       []
> > > head    -       8
> > > 
> > > but it's expected to see
> > > 10000000000.00000000 (seq:2):
> > > cloneid snaps   size    overlap
> > > head    -       8
> > > 
> > > since there's no snapshot between these 2 writes
> > > 
> > 
> > Maybe, I am doing something wrong here. But I have the same issue on the second
> > Virtual Machine with applied patch:
> > 
> > VM1 (without patch):
> > 
> > sudo ceph-fuse --id admin /mnt/cephfs/
> > /mnt/cephfs/test-snapshot3# mkdir ./.snap/snap1
> > /mnt/cephfs/test-snapshot3# echo "foo" > ./test
> > umount /mnt/cephfs
> > 
> > mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
> > /mnt/cephfs/test-snapshot3# echo "bar" >> ./test
> > /mnt/cephfs/test-snapshot3# rados listsnaps -p cephfs_data $(printf "%x\n"
> > $(stat -c %i ./test)).00000000
> > 10000313cb1.00000000:
> > cloneid snaps size overlap
> > 4 4 4 []
> > head - 8
> > 
> > VM2 (with patch):
> > 
> > ceph-fuse --id admin /mnt/cephfs/
> > /mnt/cephfs/test-snapshot4# mkdir ./.snap/snap1
> > /mnt/cephfs/test-snapshot4# echo "foo" > ./test
> > umount /mnt/cephfs
> > 
> > mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs
> > /mnt/cephfs/test-snapshot4# echo "bar" >> ./test
> > /mnt/cephfs/test-snapshot4# rados listsnaps -p cephfs_data $(printf "%x\n"
> > $(stat -c %i ./test)).00000000
> > 10000313cb3.00000000:10000313cb3.00000000:
> > cloneid snaps size overlap
> > 5 5 0 []
> > head - 4
> 
>  
> Looks like the clone comes from the first time we access pool.
> ceph_write_iter
> --ceph_get_caps
> ----__ceph_get_caps
> ------ceph_pool_perm_check
> --------__ceph_pool_perm_get
> ---------ceph_osdc_alloc_request
>  
> If no existing pool perm found, it will try to read/write the pool inode layout points to
> but ceph_osdc_alloc_request doesn't pass snapc
> 
> 
> so the server side will have a zero size object with snap seq 0
>  
> next time write arrives, since write is equipped with snap seq >0
> server side will do object clone.
>  
> This can be easily reproduce, when there are snapshots(so write will have snap seq > 0).
> The first write to the pool echo "foo" > /mnt/mycephfs/file
> will result in the following output
> 10000000009.00000000 (seq:3):
> cloneid snaps size overlap
> 3 2,3 0 []
> head - 4
>  
> I think this could be fixed by using a special reserved inode number to test pool perm instead of using that inode
> I change 
> ceph_oid_printf(&rd_req->r_base_oid, "%llx.00000000", ci->i_vino.ino);
> under to __ceph_pool_perm_get
> ceph_oid_printf(&rd_req->r_base_oid, "%llx.00000000", LLONG_MAX);
> the unnecessary clone won't happen again.
>  

Frankly speaking, I don't quite follow to your answer. :) So, does patch needs
to be reworked? Or my testing steps are not fully correct?

Thanks,
Slava.


> thanks,
> ethanwu
> > 
> > Maybe, reproduction path is not completely correct? What could be wrong on my
> > side?
> > 
> > Thanks,
> > Slava.
> > 
> > > clone happened because the first osd request CEPH_OSD_OP_CREATE doesn't
> > > pass snap context so object is created with snap seq 0, but later data
> > > writeback is equipped with snapshot context.
> > > snap.seq(1) > object snap seq(0), so osd does object clone.
> > > 
> > > This fix properly acquiring the snapshot context before performing
> > > write operations.
> > > 
> > > Signed-off-by: ethanwu <ethanwu@synology.com>
> > > ---
> > >  fs/ceph/addr.c | 24 ++++++++++++++++++++++--
> > >  1 file changed, 22 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 8b202d789e93..0e87a3e8fd2c 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -2202,6 +2202,7 @@ int ceph_uninline_data(struct file *file)
> > >   struct ceph_osd_request *req = NULL;
> > >   struct ceph_cap_flush *prealloc_cf = NULL;
> > >   struct folio *folio = NULL;
> > > + struct ceph_snap_context *snapc = NULL;
> > >   u64 inline_version = CEPH_INLINE_NONE;
> > >   struct page *pages[1];
> > >   int err = 0;
> > > @@ -2229,6 +2230,24 @@ int ceph_uninline_data(struct file *file)
> > >   if (inline_version == 1) /* initial version, no data */
> > >    goto out_uninline;
> > >  
> > > 
> > > + down_read(&fsc->mdsc->snap_rwsem);
> > > + spin_lock(&ci->i_ceph_lock);
> > > + if (__ceph_have_pending_cap_snap(ci)) {
> > > +  struct ceph_cap_snap *capsnap =
> > > +    list_last_entry(&ci->i_cap_snaps,
> > > +      struct ceph_cap_snap,
> > > +      ci_item);
> > > +  snapc = ceph_get_snap_context(capsnap->context);
> > > + } else {
> > > +  if (!ci->i_head_snapc) {
> > > +   ci->i_head_snapc = ceph_get_snap_context(
> > > +    ci->i_snap_realm->cached_context);
> > > +  }
> > > +  snapc = ceph_get_snap_context(ci->i_head_snapc);
> > > + }
> > > + spin_unlock(&ci->i_ceph_lock);
> > > + up_read(&fsc->mdsc->snap_rwsem);
> > > +
> > >   folio = read_mapping_folio(inode->i_mapping, 0, file);
> > >   if (IS_ERR(folio)) {
> > >    err = PTR_ERR(folio);
> > > @@ -2244,7 +2263,7 @@ int ceph_uninline_data(struct file *file)
> > >   req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
> > >          ceph_vino(inode), 0, &len, 0, 1,
> > >          CEPH_OSD_OP_CREATE, CEPH_OSD_FLAG_WRITE,
> > > -        NULL, 0, 0, false);
> > > +        snapc, 0, 0, false);
> > >   if (IS_ERR(req)) {
> > >    err = PTR_ERR(req);
> > >    goto out_unlock;
> > > @@ -2260,7 +2279,7 @@ int ceph_uninline_data(struct file *file)
> > >   req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout,
> > >          ceph_vino(inode), 0, &len, 1, 3,
> > >          CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
> > > -        NULL, ci->i_truncate_seq,
> > > +        snapc, ci->i_truncate_seq,
> > >          ci->i_truncate_size, false);
> > >   if (IS_ERR(req)) {
> > >    err = PTR_ERR(req);
> > > @@ -2323,6 +2342,7 @@ int ceph_uninline_data(struct file *file)
> > >    folio_put(folio);
> > >   }
> > >  out:
> > > + ceph_put_snap_context(snapc);
> > >   ceph_free_cap_flush(prealloc_cf);
> > >   doutc(cl, "%llx.%llx inline_version %llu = %d\n",
> > >         ceph_vinop(inode), inline_version, err);
> Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.

  parent reply	other threads:[~2025-09-30 18:44 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-25 10:42 [PATCH 0/2] ceph: fix missing snapshot context in write operations ethanwu
2025-09-25 10:42 ` [PATCH 1/2] ceph: fix snapshot context missing in ceph_zero_partial_object ethanwu
2025-09-26 21:40   ` Viacheslav Dubeyko
     [not found]     ` <af89b60b-17e9-4362-bccf-977b4a573a93@Mail>
2025-09-30 18:32       ` Viacheslav Dubeyko
     [not found]         ` <71034561-e258-46a7-a524-9343a6416a4f@Mail>
2025-10-01 18:22           ` Viacheslav Dubeyko
2025-09-25 10:42 ` [PATCH 2/2] ceph: fix snapshot context missing in ceph_uninline_data ethanwu
2025-09-26 21:52   ` Viacheslav Dubeyko
     [not found]     ` <a16d910c-4838-475f-a5b1-a47323ece137@Mail>
2025-09-30 18:44       ` Viacheslav Dubeyko [this message]
     [not found]         ` <1084d2db-580a-417b-a99c-cbde647fd249@Mail>
2025-10-28 15:26           ` Viacheslav Dubeyko
2025-09-25 19:36 ` [PATCH 0/2] ceph: fix missing snapshot context in write operations Viacheslav Dubeyko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ec7440fbf1411b76a1a56e3511e4463716614cf2.camel@ibm.com \
    --to=slava.dubeyko@ibm.com \
    --cc=Pavan.Rallabhandi@ibm.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ethan198912@gmail.com \
    --cc=ethanwu@synology.com \
    --cc=idryomov@gmail.com \
    --cc=xiubli@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).