From: Xiubo Li <xiubli@redhat.com>
To: Ilya Dryomov <idryomov@gmail.com>
Cc: ceph-devel@vger.kernel.org, jlayton@kernel.org,
mchangir@redhat.com, atomlin@atomlin.com, stable@vger.kernel.org
Subject: Re: [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace
Date: Wed, 7 Dec 2022 21:19:31 +0800 [thread overview]
Message-ID: <baa681e9-4472-bcfb-601f-132dc6658888@redhat.com> (raw)
In-Reply-To: <CAOi1vP8hkXZ7w9D5LnMViyjqVCmsKo3H2dg1QpzgHCPuNfvACQ@mail.gmail.com>
On 07/12/2022 18:59, Ilya Dryomov wrote:
> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> When received corrupted snap trace we don't know what exactly has
>> happened in MDS side. And we shouldn't continue writing to OSD,
>> which may corrupt the snapshot contents.
>>
>> Just try to blocklist this client and If fails we need to crash the
>> client instead of leaving it writeable to OSDs.
>>
>> Cc: stable@vger.kernel.org
>> URL: https://tracker.ceph.com/issues/57686
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>
>> Thanks Aaron's feedback.
>>
>> V3:
>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>
>> V2:
>> - Switched to WARN() to taint the Linux kernel.
>>
>> fs/ceph/mds_client.c | 3 ++-
>> fs/ceph/mds_client.h | 1 +
>> fs/ceph/snap.c | 25 +++++++++++++++++++++++++
>> 3 files changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index cbbaf334b6b8..59094944af28 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con)
>> struct ceph_mds_client *mdsc = s->s_mdsc;
>>
>> pr_warn("mds%d closed our session\n", s->s_mds);
>> - send_mds_reconnect(mdsc, s);
>> + if (!mdsc->no_reconnect)
>> + send_mds_reconnect(mdsc, s);
>> }
>>
>> static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>> index 728b7d72bf76..8e8f0447c0ad 100644
>> --- a/fs/ceph/mds_client.h
>> +++ b/fs/ceph/mds_client.h
>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>> atomic_t num_sessions;
>> int max_sessions; /* len of sessions array */
>> int stopping; /* true if shutting down */
>> + int no_reconnect; /* true if snap trace is corrupted */
>>
>> atomic64_t quotarealms_count; /* # realms with quota */
>> /*
>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>> index c1c452afa84d..023852b7c527 100644
>> --- a/fs/ceph/snap.c
>> +++ b/fs/ceph/snap.c
>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>> struct ceph_snap_realm *realm;
>> struct ceph_snap_realm *first_realm = NULL;
>> struct ceph_snap_realm *realm_to_rebuild = NULL;
>> + struct ceph_client *client = mdsc->fsc->client;
>> int rebuild_snapcs;
>> int err = -ENOMEM;
>> + int ret;
>> LIST_HEAD(dirty_realms);
>>
>> lockdep_assert_held_write(&mdsc->snap_rwsem);
>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>> if (first_realm)
>> ceph_put_snap_realm(mdsc, first_realm);
>> pr_err("%s error %d\n", __func__, err);
>> +
>> + /*
>> + * When receiving a corrupted snap trace we don't know what
>> + * exactly has happened in MDS side. And we shouldn't continue
>> + * writing to OSD, which may corrupt the snapshot contents.
>> + *
>> + * Just try to blocklist this kclient and if it fails we need
>> + * to crash the kclient instead of leaving it writeable.
> Hi Xiubo,
>
> I'm not sure I understand this "let's blocklist ourselves" concept.
> If the kernel client shouldn't continue writing to OSDs in this case,
> why not just stop issuing writes -- perhaps initiating some equivalent
> of a read-only remount like many local filesystems would do on I/O
> errors (e.g. errors=remount-ro mode)?
The following patch seems working. Let me do more test to make sure
there is not further crash.
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index c1c452afa84d..cd487f8a4cb5 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
struct ceph_snap_realm *realm;
struct ceph_snap_realm *first_realm = NULL;
struct ceph_snap_realm *realm_to_rebuild = NULL;
+ struct super_block *sb = mdsc->fsc->sb;
int rebuild_snapcs;
int err = -ENOMEM;
LIST_HEAD(dirty_realms);
@@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
if (first_realm)
ceph_put_snap_realm(mdsc, first_realm);
pr_err("%s error %d\n", __func__, err);
+ pr_err("Remounting filesystem read-only\n");
+ sb->s_flags |= SB_RDONLY;
+
return err;
}
>
> Or, perhaps, all in-memory snap contexts could somehow be invalidated
> in this case, making writes fail naturally -- on the client side,
> without actually being sent to OSDs just to be nixed by the blocklist
> hammer.
>
> But further, what makes a failure to decode a snap trace special?
> AFAIK we don't do anything close to this for any other decoding
> failure. Wouldn't "when received corrupted XYZ we don't know what
> exactly has happened in MDS side" argument apply to pretty much all
> decoding failures?
>
>> + *
>> + * Then this kclient must be remounted to continue after the
>> + * corrupted metadata fixed in the MDS side.
>> + */
>> + mdsc->no_reconnect = 1;
>> + ret = ceph_monc_blocklist_add(&client->monc, &client->msgr.inst.addr);
>> + if (ret) {
>> + pr_err("%s blocklist of %s failed: %d", __func__,
>> + ceph_pr_addr(&client->msgr.inst.addr), ret);
>> + BUG();
> ... and this is a rough equivalent of errors=panic mode.
>
> Is there a corresponding userspace client PR that can be referenced?
> This needs additional background and justification IMO.
>
> Thanks,
>
> Ilya
>
next prev parent reply other threads:[~2022-12-07 13:20 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-06 12:59 [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace xiubli
2022-12-07 10:59 ` Ilya Dryomov
2022-12-07 12:35 ` Xiubo Li
2022-12-07 14:28 ` Ilya Dryomov
2022-12-08 1:04 ` Xiubo Li
2022-12-09 11:28 ` Ilya Dryomov
2022-12-07 13:19 ` Xiubo Li [this message]
2022-12-07 13:30 ` Xiubo Li
2022-12-07 14:20 ` Ilya Dryomov
2022-12-08 0:36 ` Xiubo Li
2022-12-09 6:14 ` Venky Shankar
2022-12-09 6:59 ` Xiubo Li
2022-12-09 7:15 ` Venky Shankar
2022-12-09 8:07 ` Xiubo Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=baa681e9-4472-bcfb-601f-132dc6658888@redhat.com \
--to=xiubli@redhat.com \
--cc=atomlin@atomlin.com \
--cc=ceph-devel@vger.kernel.org \
--cc=idryomov@gmail.com \
--cc=jlayton@kernel.org \
--cc=mchangir@redhat.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox