From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9E88CA3D for ; Sun, 13 Nov 2022 02:35:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1668306925; x=1699842925; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=gT7wBumMbUVlYyDmNw4XcR4Yfyh5B4WGurYATeDlOJU=; b=V+XbQS5eYtwDBuglgHhFY9IXZT8XaEAUOLxMxkknEqX45oUXLfSh8v8M 4yncRCogrIqnwOCrl8vDZsQROEgapzj/j37OfVXBHv+XzEhW3ne8Gm3d4 4PfNzATt2v6VRHJWLZAm1yW158lRAoqujNj20PBfe0jd0O7uhoqcRlWzn /A5RtH6Rvv+R6a5I86bXHUAAQDDr+EgqdHzeewQz2MnMCXwhQoE39A+ja Zd9UAns+sOxajfsazv3oXSqsoSWEkChg2fsu990dD7ya0qKcQmRR9Vrnm tng9lt/0EXF/MLTEnJTMeLs/sGpCO6ZWXkcohkFiCbzkcgMrqcngQ5+Eo w==; X-IronPort-AV: E=McAfee;i="6500,9779,10529"; a="311779085" X-IronPort-AV: E=Sophos;i="5.96,161,1665471600"; d="scan'208";a="311779085" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Nov 2022 18:35:25 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10529"; a="701589635" X-IronPort-AV: E=Sophos;i="5.96,161,1665471600"; d="scan'208";a="701589635" Received: from fkabir-mobl.amr.corp.intel.com (HELO tjmaciei-mobl5.localnet) ([10.255.228.60]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Nov 2022 18:35:24 -0800 From: Thiago Macieira To: Borislav Petkov , "Luck, Tony" Cc: "Joseph, Jithu" , "hdegoede@redhat.com" , "markgross@kernel.org" , "tglx@linutronix.de" , "mingo@redhat.com" , "dave.hansen@linux.intel.com" , "x86@kernel.org" , "hpa@zytor.com" , "gregkh@linuxfoundation.org" , "Raj, Ashok" , "linux-kernel@vger.kernel.org" , "platform-driver-x86@vger.kernel.org" , "patches@lists.linux.dev" , "Shankar, Ravi V" , "Jimenez Gonzalez, Athenas" , "Mehta, Sohil" Subject: Re: [PATCH v2 12/14] platform/x86/intel/ifs: Add current_batch sysfs entry Date: Sat, 12 Nov 2022 18:35:23 -0800 Message-ID: <2687702.9iZYToFQE1@tjmaciei-mobl5> Organization: Intel Corporation In-Reply-To: References: <20221021203413.1220137-1-jithu.joseph@intel.com> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" On Saturday, 12 November 2022 15:32:47 PST Luck, Tony wrote: > > Because if this is going to be run during downtime, as Thiago says, then > > you can just as well use debugfs for this. And then there's no need to > > cast any API in stone and so on. >=20 > Did Thiago say =E2=80=9Cduring downtime=E2=80=9D? I think > he talked about some users opportunistic > use of scan tests. But that=E2=80=99s far from only > during downtime. We fully expect CSPs to > run these scans periodically on production > machines. Let me clarify. I did not mean full system downtime for maintenance, but I = did=20 mean that there's a gap in consumer workload, for both threads of one or mo= re=20 cores. As Tony said, it should have little observable effect on any other c= ore,=20 meaning an IFS run can be scheduled *as* any other workload (albeit a=20 privileged one) for a subset of the machine, while the rest of the system=20 remains in production. This allows them a lot of flexibility and is the rea= son=20 I am talking about containers, with the implied constraint that the=20 container's view of the filesystem is narrower than the kernel's. There'll be some coordination required to get all cores to have run all tes= ts,=20 but it should be doable over a period of time, and I'm thinking days, not=20 years. This should still be short enough to reveal if the system can detect= a=20 defect or wear-out before any real workload is impacted by it. If an issue is detected, the admin can decide whether to offline the core(s= )=20 reporting problems but keep the rest serving workloads and generating reven= ue,=20 or offline the entire machine for full maintenance and to run more invasive= and=20 time-consuming tests. =2D-=20 Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering