From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E907C4332F for ; Fri, 9 Dec 2022 08:08:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229680AbiLIIIp (ORCPT ); Fri, 9 Dec 2022 03:08:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229759AbiLIIIo (ORCPT ); Fri, 9 Dec 2022 03:08:44 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F24B3B1CD for ; Fri, 9 Dec 2022 00:07:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670573258; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EWYoGR7ZMYVitpz9hSEWansn9yhsNBZoKyZGaVZvqus=; b=JEkZ6VMHyHMJSx+6VIEM+rFR4YXY/+8PFuEkTQeamUS5/RLgs9682yIgi3LeKCT0jubusc KSKXiybQpVCmhJFz2th0nDUjoQqbnagbZmYyK0nie9zw/QtRzjjqYAiYYguRxH474cncQc 12aMz0ih0I9FEJs0rysOc/dysOQRGSg= Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com [209.85.215.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-292-R5YxN4tqOfS3jP3wOnH_pw-1; Fri, 09 Dec 2022 03:07:37 -0500 X-MC-Unique: R5YxN4tqOfS3jP3wOnH_pw-1 Received: by mail-pg1-f200.google.com with SMTP id 84-20020a630257000000b00477f88d334eso2665448pgc.11 for ; Fri, 09 Dec 2022 00:07:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-language:content-transfer-encoding:in-reply-to:mime-version :user-agent:date:message-id:from:references:cc:to:subject :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=EWYoGR7ZMYVitpz9hSEWansn9yhsNBZoKyZGaVZvqus=; b=3Jfg5NuXTEQtE0GERsUOhrRDnHG999SELJMDCnolDBTpkK5WKP7Bcw36inCmSSZtyt B4NvNvlgUf19lABdTR/h7MnSv2jdMwafG1tVhEyxTtRQgcMEOYofrDtKpyUyYKZVSx3L ruyiObIV18Ysym6Ta2f8xlgqpNsWzKV3e4R+sIYiwydel017f22IGabVsfwsgL1qA6kx lzPC+9x4dovuYcOcrvNsphFA7ozZyhYQuxrHTqL87FBsDG8qfhBD17erSKKpMiQr+Xgu OETq/oQUNsw7tLgBn9yNBMQxLKltL0lGo7j7l5qa7QXwDs4Y4ETXZb4TbMGS3q8GbsEZ KJbA== X-Gm-Message-State: ANoB5pn5TbbyQt4Z+9R7r30w6jYAzPe6nTIGFbXHxco+ljE5CYVIL9yJ X/fS8TZglmMUlKYLS5ffGVJ7mWKDBmlhPRZ8jGBM6ZbUQd3rwu/iZUm+OB0hEqqLOR6bT+JEF+R MDeRJW8J0TjbydtzbL7V61wRD7CwKEEieQDYcwCx/hwZK7MG5nfBOWBUyELPJGRTRpQ== X-Received: by 2002:a17:903:1247:b0:189:acee:7aa4 with SMTP id u7-20020a170903124700b00189acee7aa4mr5763946plh.65.1670573255984; Fri, 09 Dec 2022 00:07:35 -0800 (PST) X-Google-Smtp-Source: AA0mqf79QSf23h36PgLvM7X3ZBLi7YvtMX59KayQfilDL4OilrJOrlrZirDDqpHvqfHN1tKF1g2Scg== X-Received: by 2002:a17:903:1247:b0:189:acee:7aa4 with SMTP id u7-20020a170903124700b00189acee7aa4mr5763914plh.65.1670573255435; Fri, 09 Dec 2022 00:07:35 -0800 (PST) Received: from [10.72.12.134] ([209.132.188.80]) by smtp.gmail.com with ESMTPSA id je20-20020a170903265400b00187197c499asm729127plb.164.2022.12.09.00.07.31 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 09 Dec 2022 00:07:35 -0800 (PST) Subject: Re: [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace To: Venky Shankar Cc: Ilya Dryomov , ceph-devel@vger.kernel.org, jlayton@kernel.org, mchangir@redhat.com, atomlin@atomlin.com, stable@vger.kernel.org References: <20221206125915.37404-1-xiubli@redhat.com> <487811e4-ffba-cfe5-db2b-5379602e7b26@redhat.com> From: Xiubo Li Message-ID: <1eb59764-5489-8af9-d502-4f3bd44be4dd@redhat.com> Date: Fri, 9 Dec 2022 16:07:27 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org On 09/12/2022 15:15, Venky Shankar wrote: > On Fri, Dec 9, 2022 at 12:29 PM Xiubo Li wrote: >> >> On 09/12/2022 14:14, Venky Shankar wrote: >>> On Thu, Dec 8, 2022 at 6:10 AM Xiubo Li wrote: >>>> On 07/12/2022 22:20, Ilya Dryomov wrote: >>>>> On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li wrote: >>>>>> On 07/12/2022 21:19, Xiubo Li wrote: >>>>>>> On 07/12/2022 18:59, Ilya Dryomov wrote: >>>>>>>> On Tue, Dec 6, 2022 at 1:59 PM wrote: >>>>>>>>> From: Xiubo Li >>>>>>>>> >>>>>>>>> When received corrupted snap trace we don't know what exactly has >>>>>>>>> happened in MDS side. And we shouldn't continue writing to OSD, >>>>>>>>> which may corrupt the snapshot contents. >>>>>>>>> >>>>>>>>> Just try to blocklist this client and If fails we need to crash the >>>>>>>>> client instead of leaving it writeable to OSDs. >>>>>>>>> >>>>>>>>> Cc: stable@vger.kernel.org >>>>>>>>> URL: https://tracker.ceph.com/issues/57686 >>>>>>>>> Signed-off-by: Xiubo Li >>>>>>>>> --- >>>>>>>>> >>>>>>>>> Thanks Aaron's feedback. >>>>>>>>> >>>>>>>>> V3: >>>>>>>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW) >>>>>>>>> >>>>>>>>> V2: >>>>>>>>> - Switched to WARN() to taint the Linux kernel. >>>>>>>>> >>>>>>>>> fs/ceph/mds_client.c | 3 ++- >>>>>>>>> fs/ceph/mds_client.h | 1 + >>>>>>>>> fs/ceph/snap.c | 25 +++++++++++++++++++++++++ >>>>>>>>> 3 files changed, 28 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >>>>>>>>> index cbbaf334b6b8..59094944af28 100644 >>>>>>>>> --- a/fs/ceph/mds_client.c >>>>>>>>> +++ b/fs/ceph/mds_client.c >>>>>>>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct >>>>>>>>> ceph_connection *con) >>>>>>>>> struct ceph_mds_client *mdsc = s->s_mdsc; >>>>>>>>> >>>>>>>>> pr_warn("mds%d closed our session\n", s->s_mds); >>>>>>>>> - send_mds_reconnect(mdsc, s); >>>>>>>>> + if (!mdsc->no_reconnect) >>>>>>>>> + send_mds_reconnect(mdsc, s); >>>>>>>>> } >>>>>>>>> >>>>>>>>> static void mds_dispatch(struct ceph_connection *con, struct >>>>>>>>> ceph_msg *msg) >>>>>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h >>>>>>>>> index 728b7d72bf76..8e8f0447c0ad 100644 >>>>>>>>> --- a/fs/ceph/mds_client.h >>>>>>>>> +++ b/fs/ceph/mds_client.h >>>>>>>>> @@ -413,6 +413,7 @@ struct ceph_mds_client { >>>>>>>>> atomic_t num_sessions; >>>>>>>>> int max_sessions; /* len of sessions >>>>>>>>> array */ >>>>>>>>> int stopping; /* true if shutting >>>>>>>>> down */ >>>>>>>>> + int no_reconnect; /* true if snap trace >>>>>>>>> is corrupted */ >>>>>>>>> >>>>>>>>> atomic64_t quotarealms_count; /* # realms with >>>>>>>>> quota */ >>>>>>>>> /* >>>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>>>>> index c1c452afa84d..023852b7c527 100644 >>>>>>>>> --- a/fs/ceph/snap.c >>>>>>>>> +++ b/fs/ceph/snap.c >>>>>>>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct >>>>>>>>> ceph_mds_client *mdsc, >>>>>>>>> struct ceph_snap_realm *realm; >>>>>>>>> struct ceph_snap_realm *first_realm = NULL; >>>>>>>>> struct ceph_snap_realm *realm_to_rebuild = NULL; >>>>>>>>> + struct ceph_client *client = mdsc->fsc->client; >>>>>>>>> int rebuild_snapcs; >>>>>>>>> int err = -ENOMEM; >>>>>>>>> + int ret; >>>>>>>>> LIST_HEAD(dirty_realms); >>>>>>>>> >>>>>>>>> lockdep_assert_held_write(&mdsc->snap_rwsem); >>>>>>>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct >>>>>>>>> ceph_mds_client *mdsc, >>>>>>>>> if (first_realm) >>>>>>>>> ceph_put_snap_realm(mdsc, first_realm); >>>>>>>>> pr_err("%s error %d\n", __func__, err); >>>>>>>>> + >>>>>>>>> + /* >>>>>>>>> + * When receiving a corrupted snap trace we don't know what >>>>>>>>> + * exactly has happened in MDS side. And we shouldn't continue >>>>>>>>> + * writing to OSD, which may corrupt the snapshot contents. >>>>>>>>> + * >>>>>>>>> + * Just try to blocklist this kclient and if it fails we need >>>>>>>>> + * to crash the kclient instead of leaving it writeable. >>>>>>>> Hi Xiubo, >>>>>>>> >>>>>>>> I'm not sure I understand this "let's blocklist ourselves" concept. >>>>>>>> If the kernel client shouldn't continue writing to OSDs in this case, >>>>>>>> why not just stop issuing writes -- perhaps initiating some equivalent >>>>>>>> of a read-only remount like many local filesystems would do on I/O >>>>>>>> errors (e.g. errors=remount-ro mode)? >>>>>>> The following patch seems working. Let me do more test to make sure >>>>>>> there is not further crash. >>>>>>> >>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>>> index c1c452afa84d..cd487f8a4cb5 100644 >>>>>>> --- a/fs/ceph/snap.c >>>>>>> +++ b/fs/ceph/snap.c >>>>>>> @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client >>>>>>> *mdsc, >>>>>>> struct ceph_snap_realm *realm; >>>>>>> struct ceph_snap_realm *first_realm = NULL; >>>>>>> struct ceph_snap_realm *realm_to_rebuild = NULL; >>>>>>> + struct super_block *sb = mdsc->fsc->sb; >>>>>>> int rebuild_snapcs; >>>>>>> int err = -ENOMEM; >>>>>>> LIST_HEAD(dirty_realms); >>>>>>> @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client >>>>>>> *mdsc, >>>>>>> if (first_realm) >>>>>>> ceph_put_snap_realm(mdsc, first_realm); >>>>>>> pr_err("%s error %d\n", __func__, err); >>>>>>> + pr_err("Remounting filesystem read-only\n"); >>>>>>> + sb->s_flags |= SB_RDONLY; >>>>>>> + >>>>>>> return err; >>>>>>> } >>>>>>> >>>>>>> >>>>>> For readonly approach is also my first thought it should be, but I was >>>>>> just not very sure whether it would be the best approach. >>>>>> >>>>>> Because by evicting the kclient we could prevent the buffer to be wrote >>>>>> to OSDs. But the readonly one seems won't ? >>>>> The read-only setting is more for the VFS and the user. Technically, >>>>> the kernel client could just stop issuing writes (i.e. OSD requests >>>>> containing a write op) and not set SB_RDONLY. That should cover any >>>>> buffered data as well. >>>> From reading the local exit4 and other fs, they all doing it this way >>>> and the VFS will help stop further writing. Tested the above patch and >>>> it worked as expected. >>>> >>>> I think to stop the following OSD requests we can just check the >>>> SB_RDONLY flag to prevent the buffer writeback. >>>> >>>>> By employing self-blocklisting, you are shifting the responsibility >>>>> of rejecting OSD requests to the OSDs. I'm saying that not issuing >>>>> OSD requests from a potentially busted client in the first place is >>>>> probably a better idea. At the very least you wouldn't need to BUG >>>>> on ceph_monc_blocklist_add() errors. >>>> I found an issue for the read-only approach: >>>> >>>> In read-only mode it still can access to the MDSs and OSDs, which will >>>> continue trying to update the snap realms with the corrupted snap trace >>>> as before when reading. What if users try to read or backup the >>>> snapshots by using the corrupted snap realms ? >>>> >>>> Isn't that a problem ? >>> Yeh - this might end up in more problems in various places than what >>> this change is supposed to handle. >>> >>> Maybe we could track affected realms (although, not that granular) and >>> disallow reads to them (and its children), but I think it might not be >>> worth putting in the effort. >> IMO this doesn't make much sense. >> >> When reading we need to use the inodes to get the corresponding realms, >> once the metadatas are corrupted and aborted here the inodes' >> corresponding realms could be incorrect. That's because when updating >> the snap realms here it's possible that the snap realm hierarchy will be >> adjusted and some inodes' realms will be changed. > My point was if at all we could identify the realms correctly, which > seems risky with the corrupted info received. Seems like this needs to > be an all or none approach. Yeah. Agree. > >> Thanks >> >> - Xiubo >> >>>> Thanks >>>> >>>> - Xiubo >>>> >>>>> Thanks, >>>>> >>>>> Ilya >>>>> >