From: Xiubo Li <xiubli@redhat.com>
To: Ilya Dryomov <idryomov@gmail.com>
Cc: ceph-devel@vger.kernel.org, jlayton@kernel.org,
mchangir@redhat.com, atomlin@atomlin.com, stable@vger.kernel.org
Subject: Re: [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace
Date: Wed, 7 Dec 2022 20:35:26 +0800 [thread overview]
Message-ID: <897fc89b-775f-88ce-1550-90c47220dc18@redhat.com> (raw)
In-Reply-To: <CAOi1vP8hkXZ7w9D5LnMViyjqVCmsKo3H2dg1QpzgHCPuNfvACQ@mail.gmail.com>
On 07/12/2022 18:59, Ilya Dryomov wrote:
> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> When received corrupted snap trace we don't know what exactly has
>> happened in MDS side. And we shouldn't continue writing to OSD,
>> which may corrupt the snapshot contents.
>>
>> Just try to blocklist this client and If fails we need to crash the
>> client instead of leaving it writeable to OSDs.
>>
>> Cc: stable@vger.kernel.org
>> URL: https://tracker.ceph.com/issues/57686
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>
>> Thanks Aaron's feedback.
>>
>> V3:
>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>
>> V2:
>> - Switched to WARN() to taint the Linux kernel.
>>
>> fs/ceph/mds_client.c | 3 ++-
>> fs/ceph/mds_client.h | 1 +
>> fs/ceph/snap.c | 25 +++++++++++++++++++++++++
>> 3 files changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index cbbaf334b6b8..59094944af28 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con)
>> struct ceph_mds_client *mdsc = s->s_mdsc;
>>
>> pr_warn("mds%d closed our session\n", s->s_mds);
>> - send_mds_reconnect(mdsc, s);
>> + if (!mdsc->no_reconnect)
>> + send_mds_reconnect(mdsc, s);
>> }
>>
>> static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>> index 728b7d72bf76..8e8f0447c0ad 100644
>> --- a/fs/ceph/mds_client.h
>> +++ b/fs/ceph/mds_client.h
>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>> atomic_t num_sessions;
>> int max_sessions; /* len of sessions array */
>> int stopping; /* true if shutting down */
>> + int no_reconnect; /* true if snap trace is corrupted */
>>
>> atomic64_t quotarealms_count; /* # realms with quota */
>> /*
>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>> index c1c452afa84d..023852b7c527 100644
>> --- a/fs/ceph/snap.c
>> +++ b/fs/ceph/snap.c
>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>> struct ceph_snap_realm *realm;
>> struct ceph_snap_realm *first_realm = NULL;
>> struct ceph_snap_realm *realm_to_rebuild = NULL;
>> + struct ceph_client *client = mdsc->fsc->client;
>> int rebuild_snapcs;
>> int err = -ENOMEM;
>> + int ret;
>> LIST_HEAD(dirty_realms);
>>
>> lockdep_assert_held_write(&mdsc->snap_rwsem);
>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
>> if (first_realm)
>> ceph_put_snap_realm(mdsc, first_realm);
>> pr_err("%s error %d\n", __func__, err);
>> +
>> + /*
>> + * When receiving a corrupted snap trace we don't know what
>> + * exactly has happened in MDS side. And we shouldn't continue
>> + * writing to OSD, which may corrupt the snapshot contents.
>> + *
>> + * Just try to blocklist this kclient and if it fails we need
>> + * to crash the kclient instead of leaving it writeable.
> Hi Xiubo,
>
> I'm not sure I understand this "let's blocklist ourselves" concept.
> If the kernel client shouldn't continue writing to OSDs in this case,
> why not just stop issuing writes -- perhaps initiating some equivalent
> of a read-only remount like many local filesystems would do on I/O
> errors (e.g. errors=remount-ro mode)?
I still haven't found how could I handle it this way from ceph layer. I
saw they are just marking the inodes as EIO when this happens.
>
> Or, perhaps, all in-memory snap contexts could somehow be invalidated
> in this case, making writes fail naturally -- on the client side,
> without actually being sent to OSDs just to be nixed by the blocklist
> hammer.
>
> But further, what makes a failure to decode a snap trace special?
From the known tracker the snapid was corrupted in one inode in MDS and
then when trying to build the snap trace with the corrupted snapid it
will corrupt.
And also there maybe other cases.
> AFAIK we don't do anything close to this for any other decoding
> failure. Wouldn't "when received corrupted XYZ we don't know what
> exactly has happened in MDS side" argument apply to pretty much all
> decoding failures?
The snap trace is different from other cases. The corrupted snap trace
will affect the whole snap realm hierarchy, which will affect the whole
inodes in the mount in worst case.
This is why I was trying to evict the mount to prevent further IOs.
>
>> + *
>> + * Then this kclient must be remounted to continue after the
>> + * corrupted metadata fixed in the MDS side.
>> + */
>> + mdsc->no_reconnect = 1;
>> + ret = ceph_monc_blocklist_add(&client->monc, &client->msgr.inst.addr);
>> + if (ret) {
>> + pr_err("%s blocklist of %s failed: %d", __func__,
>> + ceph_pr_addr(&client->msgr.inst.addr), ret);
>> + BUG();
> ... and this is a rough equivalent of errors=panic mode.
>
> Is there a corresponding userspace client PR that can be referenced?
> This needs additional background and justification IMO.
Not yet. Any way we shouldn't let it continue do the IOs if fails to add
it to the blocklist.
- Xiubo
>
> Thanks,
>
> Ilya
>
next prev parent reply other threads:[~2022-12-07 12:36 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-06 12:59 [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace xiubli
2022-12-07 10:59 ` Ilya Dryomov
2022-12-07 12:35 ` Xiubo Li [this message]
2022-12-07 14:28 ` Ilya Dryomov
2022-12-08 1:04 ` Xiubo Li
2022-12-09 11:28 ` Ilya Dryomov
2022-12-07 13:19 ` Xiubo Li
2022-12-07 13:30 ` Xiubo Li
2022-12-07 14:20 ` Ilya Dryomov
2022-12-08 0:36 ` Xiubo Li
2022-12-09 6:14 ` Venky Shankar
2022-12-09 6:59 ` Xiubo Li
2022-12-09 7:15 ` Venky Shankar
2022-12-09 8:07 ` Xiubo Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=897fc89b-775f-88ce-1550-90c47220dc18@redhat.com \
--to=xiubli@redhat.com \
--cc=atomlin@atomlin.com \
--cc=ceph-devel@vger.kernel.org \
--cc=idryomov@gmail.com \
--cc=jlayton@kernel.org \
--cc=mchangir@redhat.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox