From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CFD0D24459 for ; Thu, 10 Oct 2024 23:21:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C319A6B0083; Thu, 10 Oct 2024 19:21:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BE1AE6B0088; Thu, 10 Oct 2024 19:21:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA9636B0089; Thu, 10 Oct 2024 19:21:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 8AE776B0083 for ; Thu, 10 Oct 2024 19:21:43 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 87A6140C74 for ; Thu, 10 Oct 2024 23:21:40 +0000 (UTC) X-FDA: 82659266844.29.B25B4F1 Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) by imf19.hostedemail.com (Postfix) with ESMTP id 5DE111A0006 for ; Thu, 10 Oct 2024 23:21:39 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=v1sw0aZb; spf=pass (imf19.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.53 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728602364; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LaNiGcAxcPkq0UIfhQxl+eXkp4846ieU8ttyuNuUHb4=; b=Z7O7iOysGU1Oim2R7O+ILv5CCUPNdPd+vbxgxbL4N3FXfVAUwbrn28ukfYxZNL4x+zm8/Q mVSFv1wKplBi+sPLIySWp13F6iB4yJWP0p+/m32vKrHA70QP+5agQrKtwTil4GZSw6AT0u hwbfVodQv7GzY1JlJCHpAALY5qgaZHc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728602364; a=rsa-sha256; cv=none; b=v+PKjPfAJEcyaFHP1gIkO9tlb6irN5KHkYFKmPpuNoI/w5rb1UvZTfQnemt9mf5cuhMOR7 bvFT3+shsdSQWabqRbK8+XVageyd+wbsQ2Dc3NfhbfjMYOhiBPKNlVcmjCozBdwQuhadsG Cz2RyXzj+VXRlGo8qS7CRy7cyityYrE= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=v1sw0aZb; spf=pass (imf19.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.53 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wm1-f53.google.com with SMTP id 5b1f17b1804b1-4311ea4b011so3945e9.0 for ; Thu, 10 Oct 2024 16:21:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728602500; x=1729207300; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=LaNiGcAxcPkq0UIfhQxl+eXkp4846ieU8ttyuNuUHb4=; b=v1sw0aZbvUlhsG57cSh45IUSHDEwYCCF/qqv8ETvu59CbudCP1nBsNW9hC1p43+WAt WjOQb7l2lrvZFhKQ0hjHpGb0C7YgaEcnfwe4Z7tNDSYv7uCmLgQXfoFD7IEs4jetuTk/ +7qoQrFlBpfS0HWIYt8SZPaYC5Rwgj2hKgYPbOUJUJeRaLmJwINB22FmqWoD3Fwib2OD RLMqlML3r4ngZhksZZNrr37FKUqB1Lw7fiCMgqFjHjZWZKKAsZ1swQcja74i+YNzr7es BHF8t5AyCGi/sauTO2Npky49NEy5EzljqxZeEQbJVczYwEHck9lk/fRJ8rGx4hZIMl+L yn0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728602500; x=1729207300; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LaNiGcAxcPkq0UIfhQxl+eXkp4846ieU8ttyuNuUHb4=; b=PprtIg1u2bLUlnJJkXgkO5hQdeWq37zkdjcNmDHNRX7Ugvdi+XFYnmqZ5+YDRR2bLl Ytj1KRCTgNzwqTEWYUeabgzHsP6oHthuqPSQH5V3xBacO0g3ToVfA55u/bJEiv1MatLF B6ugipTWFVdmh6ZNz4RfHyxCrMao5uIrTICuxauxQ5XPU7HTXE2e6Q7kBlfDD1cYUxbn Ake39UBODW5JNhhjySh0sVfZXUXoRWgO1y/70RMsP5Jjvmd9bmdqm/fEJy1zNOTwIMa7 WY37LRaI484OV7IxqibqSxbbajrLd6Hltcy92pEPiveh9/LtCj/Fw7Ngf/tTsXntbf4/ CWNg== X-Forwarded-Encrypted: i=1; AJvYcCW2xHNWhrWCcZFh5ra8FsMUF4u6INjr/frrKesshxi7cUQt42ZpMC53n5dJUh6AWFw/nMaLWrWysw==@kvack.org X-Gm-Message-State: AOJu0YyzvAMNlqTYTmt55g1KZ+kIlxWTMusSQgdQVDfJsUjytBQRJnzE vp3y958iDStBNSK2uxbNF41qKZBl913mZaod1cFP3RVDzSECnu1sp8+3dZlWFuW54qhvXRsLvVJ u6jqBZkTn8OZVxmdj0iV+j4HBPEW5o2w3Romf X-Google-Smtp-Source: AGHT+IHA9c7xLUGOqpAdIoPQUE5hhJH4rBF1UyuDilnwOB+wXlmYj1eAg/esvduzjbrnsa81vxs9O7UgOlPM3KI5Ih0= X-Received: by 2002:a05:600c:699b:b0:426:5d89:896d with SMTP id 5b1f17b1804b1-4311e270c2fmr906645e9.1.1728602499826; Thu, 10 Oct 2024 16:21:39 -0700 (PDT) MIME-Version: 1.0 References: <20240924043924.3562257-1-jiaqiyan@google.com> <20240924043924.3562257-2-jiaqiyan@google.com> <7658ca1f-1b3b-4352-93d9-66f8dfd28408@oracle.com> In-Reply-To: <7658ca1f-1b3b-4352-93d9-66f8dfd28408@oracle.com> From: Jiaqi Yan Date: Thu, 10 Oct 2024 16:21:27 -0700 Message-ID: Subject: Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy To: jane.chu@oracle.com Cc: nao.horiguchi@gmail.com, linmiaohe@huawei.com, tony.luck@intel.com, wangkefeng.wang@huawei.com, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 5DE111A0006 X-Stat-Signature: tedw788uk7ea1dpmh9nekhezfg53qp3g X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1728602499-81878 X-HE-Meta: U2FsdGVkX1+1oaEBPrtk89Qe2LJOsKFysyotB0IVM/IvWvGwgx01D9SkFp08pkUdibO4Q75CGJCR/+s6M3+vQCJIP90tNFORPv3VZgj/eswGsgcVRG4VEQTGO91yvmQApt+Xt8KmEGD2isGLtFjEztW/GjgbU925Ll1yDxtGnG0xLmFJ61MHMzbioIv+ORTgCY7YdXTZfHTgQd76sfTk1XBy+vxCF9Gv+EqjWn/b7O/TqQhUHV2A22qA4m+8dS+X4HM5q4kXBSnwuaMb1aXfygVFNZFsmyhdh6dM1EeInECRVnxniO0gqzt1gxxpP/cEi7AcCmPwgACc6jgOviIA37O6xKhh9MoYa+wHMj2kQ4JldmtPuhnGrm9Ws7k6Oxr8DMkiY3YJMdtuKtrZA861WtgagXydEPtgXUpVGOxx6Db3hoYNlGw6E4CkRerTDMEsVddCQ/Lk2RVXoxo4rzF1Gyr1jwj5M+8w5IV2x9k2C7GbA/5s2gM9ZhWq3DxWOsu7Q2bAoUUItjBRgfhIqQ+4fD6xQ/qw6X1850NTecAILRBgE1ew2TdClYFvrDCKQYhzK3Gq3IWtoFB0yyxAOTmKChtv/cRYrNS8d33z3GgC6a26Z2FKAJpzBwgRTWMCk62iYAFGCzv+MTJv2IUSsH+M4J8t8637DfJPtysHnV7wsqG52GVyQ9LWqRTtZukUgSMLWmoUl1PYBm4z+/md5HKYG7qy2/9QW7v0O1pFVUNX3MAet3W/kQJIhgRB/5d1UEbXGnzbf5u294DorESM9e12EiKXrfmuZFvDc1u8pUU1pf6nwb0uWKnYSz4PBYk5peqooMBz9xMz7IbN7go8v9weAPLmdpzCnSnExGeMpxUStoqig1J+iN6LhgvOhTCoK3CaHfoxEaTpIvM6u0D48ubOAGVG4V/t0FqZ0f8xLNFAMwTN9dg4GSsKPKwH9cXn3bqoKT14tOeIO9zXOEb6ne/ EX3AwYh4 VBz4wXQPiU22lzocuOlusQtc8lcvUrK8BhWqWM2yUlANWbMzJVjx6CHbqFzVaVevI/lKo1bKHBS4ZDFsYW2utzRR6S1RvKySKSHdFl9mf1mbfROJvA+IZPAlN5Tz+HTmbgi/cbkHSfLyuW1eq2RPQ9BJ4qSoRb57cH/OJbY3P7doS9+wuHS97ayTv25PI9cO0sJ3cOpqbqQSkM8sEchWplefbQh+pmDa4H+d5MoQDRXTBmP69gSOfrl/rip/DQ78hPHIOh7pOjf0vN0Dhx2/EKBTJdPGCTKQ0TQ97 X-Bogosity: Ham, tests=bogofilter, spamicity=0.005124, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 7, 2024 at 10:24=E2=80=AFAM wrote: > > On 10/3/2024 4:51 PM, Jiaqi Yan wrote: > > soned page (sub- or huge-) will eventually be isolated, because, > > The code here is "global policy". The "per-VMA policy", proposed in > > 0/2 but code not sent, should be able to support isolation + offline > > at some point (all VMAs are gone and page becomes free). > "per-VMA policy" sounds interesting. > >> Another thing I'm curious at is whether you have tested with real > >> hardware UE - the one that triggers MCE. When a real UE is consumed b= y > > Yes, with our workload. Can you share more about what is the "training > > process"? Is it something to train memory or screen memory errors? > > The cover letter mentioned "Machine Learning (ML) workloads", so I used > it as an example. Got you. In that case, if the ML workload (running in a VM) wants to do what you described, wouldn't losing 1G hugetlb page due to kernel offline make the VM/workload even harder to execute recover logic? > > -jane >