From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4BEE8C352A1
	for <stable@archiver.kernel.org>; Wed,  7 Dec 2022 13:32:03 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229989AbiLGNcC (ORCPT <rfc822;stable@archiver.kernel.org>);
        Wed, 7 Dec 2022 08:32:02 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33836 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229993AbiLGNbz (ORCPT
        <rfc822;stable@vger.kernel.org>); Wed, 7 Dec 2022 08:31:55 -0500
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B1A9059179
        for <stable@vger.kernel.org>; Wed,  7 Dec 2022 05:31:00 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1670419859;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=FS4nP8Dvl5Abr4Pj8BMVXPrPrR55oEEMgJ40HwYEeb4=;
        b=AtbiwTZUtaGntJUaodLzmWLKjSaXDjXXG0RdoKjK4hUWnhwsCT65CsSIF1QzgAV7NU1GGv
        7SFo7RNbALqy21QC0UkAG87dF/XpU4cOaJyNJwl7taEQdspnTiQ1Smkeczb+hXvwQ+zBAF
        NwP2uSDPM7wH7oWyk4dU2hXikGTeprE=
Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com
 [209.85.210.199]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id
 us-mta-479-9LbR05N3OA-3d2mskf_FJw-1; Wed, 07 Dec 2022 08:30:58 -0500
X-MC-Unique: 9LbR05N3OA-3d2mskf_FJw-1
Received: by mail-pf1-f199.google.com with SMTP id cp23-20020a056a00349700b005775c52dbceso2423499pfb.21
        for <stable@vger.kernel.org>; Wed, 07 Dec 2022 05:30:58 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-language:content-transfer-encoding:in-reply-to:mime-version
         :user-agent:date:message-id:references:cc:to:from:subject
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=FS4nP8Dvl5Abr4Pj8BMVXPrPrR55oEEMgJ40HwYEeb4=;
        b=bJ1ToaxjJ5Yecn+CP/8N5E11HJgfWMhcNK7DEBXr2HG13yZFC4OWZ9EeTIeM9yCMEy
         4pcrn+N+0D1eKHkhl0JkuStUyPIiu0ITjWZPNK/ij+QJO6T+aj/2Ryii75D4iOnxIXDf
         de8K9ybbyMt9560KPbDZiTUdBqUc2me29KHpbmaUAywUgjMT/OWiPfMxzLPOrBMlTTGp
         Da/y/JT4MgqU7k0o/ok1UsUQiIbDMDGCdXzI6uhSUCbm5fLbSp7hgmY05itvhy4pUV0Y
         4PD9CmU0UnJLUxmeE394YH77pNRIqVzYZNMccoro85ZBjI8iLTw/IKaSt0vvYmVrtdWD
         Ygkw==
X-Gm-Message-State: ANoB5pkIsf5kGzaO+nhj+VZX7gxsAmitqZ7W2j2C0Qd7LO+1UyjJObhh
        gNcvIGgxE5Ak+AKYhDBfr4Nje5cI5ApkY229K5ijFQfsGU6Ujq9wIe7wAgK+ACg7LL8CQ4nG0gb
        ouaSw/jQZxgjAS9YOVM0QZ41SswO9Ib9fTkL4V/tOET30OAGMy4M3xee/6AXPAlM8PA==
X-Received: by 2002:a17:902:a514:b0:189:97c3:6382 with SMTP id s20-20020a170902a51400b0018997c36382mr43079418plq.168.1670419857695;
        Wed, 07 Dec 2022 05:30:57 -0800 (PST)
X-Google-Smtp-Source: AA0mqf7JZSzIpR16yfAiGGCWy8EDHhpFN83uo/E5MVKkzglqHQjlqbbrxUuZYcFBz0h9Axrb4LBY0g==
X-Received: by 2002:a17:902:a514:b0:189:97c3:6382 with SMTP id s20-20020a170902a51400b0018997c36382mr43079391plq.168.1670419857284;
        Wed, 07 Dec 2022 05:30:57 -0800 (PST)
Received: from [10.72.12.134] ([209.132.188.80])
        by smtp.gmail.com with ESMTPSA id a5-20020a170902710500b0018930dbc560sm14443062pll.96.2022.12.07.05.30.54
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 07 Dec 2022 05:30:56 -0800 (PST)
Subject: Re: [PATCH v3] ceph: blocklist the kclient when receiving corrupted
 snap trace
From:   Xiubo Li <xiubli@redhat.com>
To:     Ilya Dryomov <idryomov@gmail.com>
Cc:     ceph-devel@vger.kernel.org, jlayton@kernel.org,
        mchangir@redhat.com, atomlin@atomlin.com, stable@vger.kernel.org
References: <20221206125915.37404-1-xiubli@redhat.com>
 <CAOi1vP8hkXZ7w9D5LnMViyjqVCmsKo3H2dg1QpzgHCPuNfvACQ@mail.gmail.com>
 <baa681e9-4472-bcfb-601f-132dc6658888@redhat.com>
Message-ID: <ac1e95e6-f8fa-e243-97bd-a280b8e0fa66@redhat.com>
Date:   Wed, 7 Dec 2022 21:30:52 +0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.1
MIME-Version: 1.0
In-Reply-To: <baa681e9-4472-bcfb-601f-132dc6658888@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
Precedence: bulk
List-ID: <stable.vger.kernel.org>
X-Mailing-List: stable@vger.kernel.org


On 07/12/2022 21:19, Xiubo Li wrote:
>
> On 07/12/2022 18:59, Ilya Dryomov wrote:
>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@redhat.com> wrote:
>>> From: Xiubo Li <xiubli@redhat.com>
>>>
>>> When received corrupted snap trace we don't know what exactly has
>>> happened in MDS side. And we shouldn't continue writing to OSD,
>>> which may corrupt the snapshot contents.
>>>
>>> Just try to blocklist this client and If fails we need to crash the
>>> client instead of leaving it writeable to OSDs.
>>>
>>> Cc: stable@vger.kernel.org
>>> URL: https://tracker.ceph.com/issues/57686
>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>> ---
>>>
>>> Thanks Aaron's feedback.
>>>
>>> V3:
>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
>>>
>>> V2:
>>> - Switched to WARN() to taint the Linux kernel.
>>>
>>>   fs/ceph/mds_client.c |  3 ++-
>>>   fs/ceph/mds_client.h |  1 +
>>>   fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
>>>   3 files changed, 28 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index cbbaf334b6b8..59094944af28 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct 
>>> ceph_connection *con)
>>>          struct ceph_mds_client *mdsc = s->s_mdsc;
>>>
>>>          pr_warn("mds%d closed our session\n", s->s_mds);
>>> -       send_mds_reconnect(mdsc, s);
>>> +       if (!mdsc->no_reconnect)
>>> +               send_mds_reconnect(mdsc, s);
>>>   }
>>>
>>>   static void mds_dispatch(struct ceph_connection *con, struct 
>>> ceph_msg *msg)
>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>> index 728b7d72bf76..8e8f0447c0ad 100644
>>> --- a/fs/ceph/mds_client.h
>>> +++ b/fs/ceph/mds_client.h
>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
>>>          atomic_t                num_sessions;
>>>          int                     max_sessions;  /* len of sessions 
>>> array */
>>>          int                     stopping;      /* true if shutting 
>>> down */
>>> +       int                     no_reconnect;  /* true if snap trace 
>>> is corrupted */
>>>
>>>          atomic64_t              quotarealms_count; /* # realms with 
>>> quota */
>>>          /*
>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>> index c1c452afa84d..023852b7c527 100644
>>> --- a/fs/ceph/snap.c
>>> +++ b/fs/ceph/snap.c
>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct 
>>> ceph_mds_client *mdsc,
>>>          struct ceph_snap_realm *realm;
>>>          struct ceph_snap_realm *first_realm = NULL;
>>>          struct ceph_snap_realm *realm_to_rebuild = NULL;
>>> +       struct ceph_client *client = mdsc->fsc->client;
>>>          int rebuild_snapcs;
>>>          int err = -ENOMEM;
>>> +       int ret;
>>>          LIST_HEAD(dirty_realms);
>>>
>>>          lockdep_assert_held_write(&mdsc->snap_rwsem);
>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct 
>>> ceph_mds_client *mdsc,
>>>          if (first_realm)
>>>                  ceph_put_snap_realm(mdsc, first_realm);
>>>          pr_err("%s error %d\n", __func__, err);
>>> +
>>> +       /*
>>> +        * When receiving a corrupted snap trace we don't know what
>>> +        * exactly has happened in MDS side. And we shouldn't continue
>>> +        * writing to OSD, which may corrupt the snapshot contents.
>>> +        *
>>> +        * Just try to blocklist this kclient and if it fails we need
>>> +        * to crash the kclient instead of leaving it writeable.
>> Hi Xiubo,
>>
>> I'm not sure I understand this "let's blocklist ourselves" concept.
>> If the kernel client shouldn't continue writing to OSDs in this case,
>> why not just stop issuing writes -- perhaps initiating some equivalent
>> of a read-only remount like many local filesystems would do on I/O
>> errors (e.g. errors=remount-ro mode)?
>
> The following patch seems working. Let me do more test to make sure 
> there is not further crash.
>
> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> index c1c452afa84d..cd487f8a4cb5 100644
> --- a/fs/ceph/snap.c
> +++ b/fs/ceph/snap.c
> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client 
> *mdsc,
>         struct ceph_snap_realm *realm;
>         struct ceph_snap_realm *first_realm = NULL;
>         struct ceph_snap_realm *realm_to_rebuild = NULL;
> +       struct super_block *sb = mdsc->fsc->sb;
>         int rebuild_snapcs;
>         int err = -ENOMEM;
>         LIST_HEAD(dirty_realms);
> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client 
> *mdsc,
>         if (first_realm)
>                 ceph_put_snap_realm(mdsc, first_realm);
>         pr_err("%s error %d\n", __func__, err);
> +       pr_err("Remounting filesystem read-only\n");
> +       sb->s_flags |= SB_RDONLY;
> +
>         return err;
>  }
>
>
For readonly approach is also my first thought it should be, but I was 
just not very sure whether it would be the best approach.

Because by evicting the kclient we could prevent the buffer to be wrote 
to OSDs. But the readonly one seems won't ?

- Xiubo

>
>
>>
>> Or, perhaps, all in-memory snap contexts could somehow be invalidated
>> in this case, making writes fail naturally -- on the client side,
>> without actually being sent to OSDs just to be nixed by the blocklist
>> hammer.
>>
>> But further, what makes a failure to decode a snap trace special?
>> AFAIK we don't do anything close to this for any other decoding
>> failure.  Wouldn't "when received corrupted XYZ we don't know what
>> exactly has happened in MDS side" argument apply to pretty much all
>> decoding failures?
>>
>>> +        *
>>> +        * Then this kclient must be remounted to continue after the
>>> +        * corrupted metadata fixed in the MDS side.
>>> +        */
>>> +       mdsc->no_reconnect = 1;
>>> +       ret = ceph_monc_blocklist_add(&client->monc, 
>>> &client->msgr.inst.addr);
>>> +       if (ret) {
>>> +               pr_err("%s blocklist of %s failed: %d", __func__,
>>> + ceph_pr_addr(&client->msgr.inst.addr), ret);
>>> +               BUG();
>> ... and this is a rough equivalent of errors=panic mode.
>>
>> Is there a corresponding userspace client PR that can be referenced?
>> This needs additional background and justification IMO.
>>
>> Thanks,
>>
>>                  Ilya
>>