Re: [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

From: Xiubo Li <xiubli@redhat.com>
To: Venky Shankar <vshankar@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>,
	ceph-devel@vger.kernel.org, jlayton@kernel.org,
	mchangir@redhat.com, atomlin@atomlin.com, stable@vger.kernel.org
Subject: Re: [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace
Date: Fri, 9 Dec 2022 16:07:27 +0800	[thread overview]
Message-ID: <1eb59764-5489-8af9-d502-4f3bd44be4dd@redhat.com> (raw)
In-Reply-To: <CACPzV1mhhskPUz1URs6FUJE9CebOVm1xcRTRgK2yYiwD+soO2g@mail.gmail.com>


On 09/12/2022 15:15, Venky Shankar wrote:
> On Fri, Dec 9, 2022 at 12:29 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 09/12/2022 14:14, Venky Shankar wrote:
>>> On Thu, Dec 8, 2022 at 6:10 AM Xiubo Li <xiubli@redhat.com> wrote:
>>>> On 07/12/2022 22:20, Ilya Dryomov wrote:
>>>>> On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>>>> On 07/12/2022 21:19, Xiubo Li wrote:
>>>>>>> On 07/12/2022 18:59, Ilya Dryomov wrote:
>>>>>>>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>>>
>>>>>>>>> When received corrupted snap trace we don't know what exactly has
>>>>>>>>> happened in MDS side. And we shouldn't continue writing to OSD,
>>>>>>>>> which may corrupt the snapshot contents.
>>>>>>>>>
>>>>>>>>> Just try to blocklist this client and If fails we need to crash the
>>>>>>>>> client instead of leaving it writeable to OSDs.
>>>>>>>>>
>>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>>> URL: https://tracker.ceph.com/issues/57686
>>>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Thanks Aaron's feedback.
>>>>>>>>>
>>>>>>>>> V3:
>>>>>>>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>>>>>>>>
>>>>>>>>> V2:
>>>>>>>>> - Switched to WARN() to taint the Linux kernel.
>>>>>>>>>
>>>>>>>>>      fs/ceph/mds_client.c |  3 ++-
>>>>>>>>>      fs/ceph/mds_client.h |  1 +
>>>>>>>>>      fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>>>>>>>>>      3 files changed, 28 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>>>> index cbbaf334b6b8..59094944af28 100644
>>>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct
>>>>>>>>> ceph_connection *con)
>>>>>>>>>             struct ceph_mds_client *mdsc = s->s_mdsc;
>>>>>>>>>
>>>>>>>>>             pr_warn("mds%d closed our session\n", s->s_mds);
>>>>>>>>> -       send_mds_reconnect(mdsc, s);
>>>>>>>>> +       if (!mdsc->no_reconnect)
>>>>>>>>> +               send_mds_reconnect(mdsc, s);
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>>      static void mds_dispatch(struct ceph_connection *con, struct
>>>>>>>>> ceph_msg *msg)
>>>>>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>>>>>>>> index 728b7d72bf76..8e8f0447c0ad 100644
>>>>>>>>> --- a/fs/ceph/mds_client.h
>>>>>>>>> +++ b/fs/ceph/mds_client.h
>>>>>>>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>>>>>>>>             atomic_t                num_sessions;
>>>>>>>>>             int                     max_sessions;  /* len of sessions
>>>>>>>>> array */
>>>>>>>>>             int                     stopping;      /* true if shutting
>>>>>>>>> down */
>>>>>>>>> +       int                     no_reconnect;  /* true if snap trace
>>>>>>>>> is corrupted */
>>>>>>>>>
>>>>>>>>>             atomic64_t              quotarealms_count; /* # realms with
>>>>>>>>> quota */
>>>>>>>>>             /*
>>>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>>>> index c1c452afa84d..023852b7c527 100644
>>>>>>>>> --- a/fs/ceph/snap.c
>>>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct
>>>>>>>>> ceph_mds_client *mdsc,
>>>>>>>>>             struct ceph_snap_realm *realm;
>>>>>>>>>             struct ceph_snap_realm *first_realm = NULL;
>>>>>>>>>             struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>>>>>>> +       struct ceph_client *client = mdsc->fsc->client;
>>>>>>>>>             int rebuild_snapcs;
>>>>>>>>>             int err = -ENOMEM;
>>>>>>>>> +       int ret;
>>>>>>>>>             LIST_HEAD(dirty_realms);
>>>>>>>>>
>>>>>>>>>             lockdep_assert_held_write(&mdsc->snap_rwsem);
>>>>>>>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct
>>>>>>>>> ceph_mds_client *mdsc,
>>>>>>>>>             if (first_realm)
>>>>>>>>>                     ceph_put_snap_realm(mdsc, first_realm);
>>>>>>>>>             pr_err("%s error %d\n", __func__, err);
>>>>>>>>> +
>>>>>>>>> +       /*
>>>>>>>>> +        * When receiving a corrupted snap trace we don't know what
>>>>>>>>> +        * exactly has happened in MDS side. And we shouldn't continue
>>>>>>>>> +        * writing to OSD, which may corrupt the snapshot contents.
>>>>>>>>> +        *
>>>>>>>>> +        * Just try to blocklist this kclient and if it fails we need
>>>>>>>>> +        * to crash the kclient instead of leaving it writeable.
>>>>>>>> Hi Xiubo,
>>>>>>>>
>>>>>>>> I'm not sure I understand this "let's blocklist ourselves" concept.
>>>>>>>> If the kernel client shouldn't continue writing to OSDs in this case,
>>>>>>>> why not just stop issuing writes -- perhaps initiating some equivalent
>>>>>>>> of a read-only remount like many local filesystems would do on I/O
>>>>>>>> errors (e.g. errors=remount-ro mode)?
>>>>>>> The following patch seems working. Let me do more test to make sure
>>>>>>> there is not further crash.
>>>>>>>
>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>> index c1c452afa84d..cd487f8a4cb5 100644
>>>>>>> --- a/fs/ceph/snap.c
>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>>>>>> *mdsc,
>>>>>>>            struct ceph_snap_realm *realm;
>>>>>>>            struct ceph_snap_realm *first_realm = NULL;
>>>>>>>            struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>>>>> +       struct super_block *sb = mdsc->fsc->sb;
>>>>>>>            int rebuild_snapcs;
>>>>>>>            int err = -ENOMEM;
>>>>>>>            LIST_HEAD(dirty_realms);
>>>>>>> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>>>>>> *mdsc,
>>>>>>>            if (first_realm)
>>>>>>>                    ceph_put_snap_realm(mdsc, first_realm);
>>>>>>>            pr_err("%s error %d\n", __func__, err);
>>>>>>> +       pr_err("Remounting filesystem read-only\n");
>>>>>>> +       sb->s_flags |= SB_RDONLY;
>>>>>>> +
>>>>>>>            return err;
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>> For readonly approach is also my first thought it should be, but I was
>>>>>> just not very sure whether it would be the best approach.
>>>>>>
>>>>>> Because by evicting the kclient we could prevent the buffer to be wrote
>>>>>> to OSDs. But the readonly one seems won't ?
>>>>> The read-only setting is more for the VFS and the user.  Technically,
>>>>> the kernel client could just stop issuing writes (i.e. OSD requests
>>>>> containing a write op) and not set SB_RDONLY.  That should cover any
>>>>> buffered data as well.
>>>>    From reading the local exit4 and other fs, they all doing it this way
>>>> and the VFS will help stop further writing. Tested the above patch and
>>>> it worked as expected.
>>>>
>>>> I think to stop the following OSD requests we can just check the
>>>> SB_RDONLY flag to prevent the buffer writeback.
>>>>
>>>>> By employing self-blocklisting, you are shifting the responsibility
>>>>> of rejecting OSD requests to the OSDs.  I'm saying that not issuing
>>>>> OSD requests from a potentially busted client in the first place is
>>>>> probably a better idea.  At the very least you wouldn't need to BUG
>>>>> on ceph_monc_blocklist_add() errors.
>>>> I found an issue for the read-only approach:
>>>>
>>>> In read-only mode it still can access to the MDSs and OSDs, which will
>>>> continue trying to update the snap realms with the corrupted snap trace
>>>> as before when reading. What if users try to read or backup the
>>>> snapshots by using the corrupted snap realms ?
>>>>
>>>> Isn't that a problem ?
>>> Yeh - this might end up in more problems in various places than what
>>> this change is supposed to handle.
>>>
>>> Maybe we could track affected realms (although, not that granular) and
>>> disallow reads to them (and its children), but I think it might not be
>>> worth putting in the effort.
>> IMO this doesn't make much sense.
>>
>> When reading we need to use the inodes to get the corresponding realms,
>> once the metadatas are corrupted and aborted here the inodes'
>> corresponding realms could be incorrect. That's because when updating
>> the snap realms here it's possible that the snap realm hierarchy will be
>> adjusted and some inodes' realms will be changed.
> My point was if at all we could identify the realms correctly, which
> seems risky with the corrupted info received. Seems like this needs to
> be an all or none approach.

Yeah. Agree.

>
>> Thanks
>>
>> - Xiubo
>>
>>>> Thanks
>>>>
>>>> - Xiubo
>>>>
>>>>> Thanks,
>>>>>
>>>>>                    Ilya
>>>>>
>

     prev parent reply	other threads:[~2022-12-09  8:08 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-06 12:59 [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace xiubli
2022-12-07 10:59 ` Ilya Dryomov
2022-12-07 12:35   ` Xiubo Li
2022-12-07 14:28     ` Ilya Dryomov
2022-12-08  1:04       ` Xiubo Li
2022-12-09 11:28         ` Ilya Dryomov
2022-12-07 13:19   ` Xiubo Li
2022-12-07 13:30     ` Xiubo Li
2022-12-07 14:20       ` Ilya Dryomov
2022-12-08  0:36         ` Xiubo Li
2022-12-09  6:14           ` Venky Shankar
2022-12-09  6:59             ` Xiubo Li
2022-12-09  7:15               ` Venky Shankar
2022-12-09  8:07                 ` Xiubo Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1eb59764-5489-8af9-d502-4f3bd44be4dd@redhat.com \
    --to=xiubli@redhat.com \
    --cc=atomlin@atomlin.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=idryomov@gmail.com \
    --cc=jlayton@kernel.org \
    --cc=mchangir@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=vshankar@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox