From: Xiubo Li <xiubli@redhat.com>
To: Venky Shankar <vshankar@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>,
ceph-devel@vger.kernel.org, jlayton@kernel.org,
mchangir@redhat.com, atomlin@atomlin.com, stable@vger.kernel.org
Subject: Re: [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace
Date: Fri, 9 Dec 2022 16:07:27 +0800 [thread overview]
Message-ID: <1eb59764-5489-8af9-d502-4f3bd44be4dd@redhat.com> (raw)
In-Reply-To: <CACPzV1mhhskPUz1URs6FUJE9CebOVm1xcRTRgK2yYiwD+soO2g@mail.gmail.com>
On 09/12/2022 15:15, Venky Shankar wrote:
> On Fri, Dec 9, 2022 at 12:29 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 09/12/2022 14:14, Venky Shankar wrote:
>>> On Thu, Dec 8, 2022 at 6:10 AM Xiubo Li <xiubli@redhat.com> wrote:
>>>> On 07/12/2022 22:20, Ilya Dryomov wrote:
>>>>> On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>>>> On 07/12/2022 21:19, Xiubo Li wrote:
>>>>>>> On 07/12/2022 18:59, Ilya Dryomov wrote:
>>>>>>>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>>>
>>>>>>>>> When received corrupted snap trace we don't know what exactly has
>>>>>>>>> happened in MDS side. And we shouldn't continue writing to OSD,
>>>>>>>>> which may corrupt the snapshot contents.
>>>>>>>>>
>>>>>>>>> Just try to blocklist this client and If fails we need to crash the
>>>>>>>>> client instead of leaving it writeable to OSDs.
>>>>>>>>>
>>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>>> URL: https://tracker.ceph.com/issues/57686
>>>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> Thanks Aaron's feedback.
>>>>>>>>>
>>>>>>>>> V3:
>>>>>>>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>>>>>>>>
>>>>>>>>> V2:
>>>>>>>>> - Switched to WARN() to taint the Linux kernel.
>>>>>>>>>
>>>>>>>>> fs/ceph/mds_client.c | 3 ++-
>>>>>>>>> fs/ceph/mds_client.h | 1 +
>>>>>>>>> fs/ceph/snap.c | 25 +++++++++++++++++++++++++
>>>>>>>>> 3 files changed, 28 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>>>> index cbbaf334b6b8..59094944af28 100644
>>>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct
>>>>>>>>> ceph_connection *con)
>>>>>>>>> struct ceph_mds_client *mdsc = s->s_mdsc;
>>>>>>>>>
>>>>>>>>> pr_warn("mds%d closed our session\n", s->s_mds);
>>>>>>>>> - send_mds_reconnect(mdsc, s);
>>>>>>>>> + if (!mdsc->no_reconnect)
>>>>>>>>> + send_mds_reconnect(mdsc, s);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> static void mds_dispatch(struct ceph_connection *con, struct
>>>>>>>>> ceph_msg *msg)
>>>>>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>>>>>>>> index 728b7d72bf76..8e8f0447c0ad 100644
>>>>>>>>> --- a/fs/ceph/mds_client.h
>>>>>>>>> +++ b/fs/ceph/mds_client.h
>>>>>>>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>>>>>>>> atomic_t num_sessions;
>>>>>>>>> int max_sessions; /* len of sessions
>>>>>>>>> array */
>>>>>>>>> int stopping; /* true if shutting
>>>>>>>>> down */
>>>>>>>>> + int no_reconnect; /* true if snap trace
>>>>>>>>> is corrupted */
>>>>>>>>>
>>>>>>>>> atomic64_t quotarealms_count; /* # realms with
>>>>>>>>> quota */
>>>>>>>>> /*
>>>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>>>> index c1c452afa84d..023852b7c527 100644
>>>>>>>>> --- a/fs/ceph/snap.c
>>>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct
>>>>>>>>> ceph_mds_client *mdsc,
>>>>>>>>> struct ceph_snap_realm *realm;
>>>>>>>>> struct ceph_snap_realm *first_realm = NULL;
>>>>>>>>> struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>>>>>>> + struct ceph_client *client = mdsc->fsc->client;
>>>>>>>>> int rebuild_snapcs;
>>>>>>>>> int err = -ENOMEM;
>>>>>>>>> + int ret;
>>>>>>>>> LIST_HEAD(dirty_realms);
>>>>>>>>>
>>>>>>>>> lockdep_assert_held_write(&mdsc->snap_rwsem);
>>>>>>>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct
>>>>>>>>> ceph_mds_client *mdsc,
>>>>>>>>> if (first_realm)
>>>>>>>>> ceph_put_snap_realm(mdsc, first_realm);
>>>>>>>>> pr_err("%s error %d\n", __func__, err);
>>>>>>>>> +
>>>>>>>>> + /*
>>>>>>>>> + * When receiving a corrupted snap trace we don't know what
>>>>>>>>> + * exactly has happened in MDS side. And we shouldn't continue
>>>>>>>>> + * writing to OSD, which may corrupt the snapshot contents.
>>>>>>>>> + *
>>>>>>>>> + * Just try to blocklist this kclient and if it fails we need
>>>>>>>>> + * to crash the kclient instead of leaving it writeable.
>>>>>>>> Hi Xiubo,
>>>>>>>>
>>>>>>>> I'm not sure I understand this "let's blocklist ourselves" concept.
>>>>>>>> If the kernel client shouldn't continue writing to OSDs in this case,
>>>>>>>> why not just stop issuing writes -- perhaps initiating some equivalent
>>>>>>>> of a read-only remount like many local filesystems would do on I/O
>>>>>>>> errors (e.g. errors=remount-ro mode)?
>>>>>>> The following patch seems working. Let me do more test to make sure
>>>>>>> there is not further crash.
>>>>>>>
>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>> index c1c452afa84d..cd487f8a4cb5 100644
>>>>>>> --- a/fs/ceph/snap.c
>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>>>>>> *mdsc,
>>>>>>> struct ceph_snap_realm *realm;
>>>>>>> struct ceph_snap_realm *first_realm = NULL;
>>>>>>> struct ceph_snap_realm *realm_to_rebuild = NULL;
>>>>>>> + struct super_block *sb = mdsc->fsc->sb;
>>>>>>> int rebuild_snapcs;
>>>>>>> int err = -ENOMEM;
>>>>>>> LIST_HEAD(dirty_realms);
>>>>>>> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client
>>>>>>> *mdsc,
>>>>>>> if (first_realm)
>>>>>>> ceph_put_snap_realm(mdsc, first_realm);
>>>>>>> pr_err("%s error %d\n", __func__, err);
>>>>>>> + pr_err("Remounting filesystem read-only\n");
>>>>>>> + sb->s_flags |= SB_RDONLY;
>>>>>>> +
>>>>>>> return err;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>> For readonly approach is also my first thought it should be, but I was
>>>>>> just not very sure whether it would be the best approach.
>>>>>>
>>>>>> Because by evicting the kclient we could prevent the buffer to be wrote
>>>>>> to OSDs. But the readonly one seems won't ?
>>>>> The read-only setting is more for the VFS and the user. Technically,
>>>>> the kernel client could just stop issuing writes (i.e. OSD requests
>>>>> containing a write op) and not set SB_RDONLY. That should cover any
>>>>> buffered data as well.
>>>> From reading the local exit4 and other fs, they all doing it this way
>>>> and the VFS will help stop further writing. Tested the above patch and
>>>> it worked as expected.
>>>>
>>>> I think to stop the following OSD requests we can just check the
>>>> SB_RDONLY flag to prevent the buffer writeback.
>>>>
>>>>> By employing self-blocklisting, you are shifting the responsibility
>>>>> of rejecting OSD requests to the OSDs. I'm saying that not issuing
>>>>> OSD requests from a potentially busted client in the first place is
>>>>> probably a better idea. At the very least you wouldn't need to BUG
>>>>> on ceph_monc_blocklist_add() errors.
>>>> I found an issue for the read-only approach:
>>>>
>>>> In read-only mode it still can access to the MDSs and OSDs, which will
>>>> continue trying to update the snap realms with the corrupted snap trace
>>>> as before when reading. What if users try to read or backup the
>>>> snapshots by using the corrupted snap realms ?
>>>>
>>>> Isn't that a problem ?
>>> Yeh - this might end up in more problems in various places than what
>>> this change is supposed to handle.
>>>
>>> Maybe we could track affected realms (although, not that granular) and
>>> disallow reads to them (and its children), but I think it might not be
>>> worth putting in the effort.
>> IMO this doesn't make much sense.
>>
>> When reading we need to use the inodes to get the corresponding realms,
>> once the metadatas are corrupted and aborted here the inodes'
>> corresponding realms could be incorrect. That's because when updating
>> the snap realms here it's possible that the snap realm hierarchy will be
>> adjusted and some inodes' realms will be changed.
> My point was if at all we could identify the realms correctly, which
> seems risky with the corrupted info received. Seems like this needs to
> be an all or none approach.
Yeah. Agree.
>
>> Thanks
>>
>> - Xiubo
>>
>>>> Thanks
>>>>
>>>> - Xiubo
>>>>
>>>>> Thanks,
>>>>>
>>>>> Ilya
>>>>>
>
prev parent reply other threads:[~2022-12-09 8:08 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-06 12:59 [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace xiubli
2022-12-07 10:59 ` Ilya Dryomov
2022-12-07 12:35 ` Xiubo Li
2022-12-07 14:28 ` Ilya Dryomov
2022-12-08 1:04 ` Xiubo Li
2022-12-09 11:28 ` Ilya Dryomov
2022-12-07 13:19 ` Xiubo Li
2022-12-07 13:30 ` Xiubo Li
2022-12-07 14:20 ` Ilya Dryomov
2022-12-08 0:36 ` Xiubo Li
2022-12-09 6:14 ` Venky Shankar
2022-12-09 6:59 ` Xiubo Li
2022-12-09 7:15 ` Venky Shankar
2022-12-09 8:07 ` Xiubo Li [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1eb59764-5489-8af9-d502-4f3bd44be4dd@redhat.com \
--to=xiubli@redhat.com \
--cc=atomlin@atomlin.com \
--cc=ceph-devel@vger.kernel.org \
--cc=idryomov@gmail.com \
--cc=jlayton@kernel.org \
--cc=mchangir@redhat.com \
--cc=stable@vger.kernel.org \
--cc=vshankar@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox