All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oscar Salvador <osalvador@suse.de>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: David Hildenbrand <david@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: Machine check recovery broken in v6.9-rc1
Date: Sun, 7 Apr 2024 06:51:51 +0200	[thread overview]
Message-ID: <ZhImZwKhp-CZ0MFN@localhost.localdomain> (raw)
In-Reply-To: <SJ1PR11MB608323D7E6113B78A35F4999FC012@SJ1PR11MB6083.namprd11.prod.outlook.com>

On Sun, Apr 07, 2024 at 12:08:30AM +0000, Luck, Tony wrote:
> Oscar.
> 
> Both the 6.1 and 6.9-rc2 patches make the BUG (and subsequent issues) go away.

Thanks for the switf test Tony!

> Here's what's happening.
> 
> When the machine check occurs there's a scramble from various subsystems
> to report the memory error.
> 
> ghes_do_memory_failure() calls memory_failure_queue() which later
> calls memory_failure() from a kernel thread. Side note: this happens TWICE
> for each error. Not sure yet if this is a BIOS issue logging more than once.
> or some Linux issues in acpi/apei/ghes.c code.
> 
> uc_decode_notifier() [called from a different kernel thread] also calls
> do_memory_failure()
> 
> Finally kill_me_maybe() [called from task_work on return to the application
> when returning from the machine check handler] also calls memory_failure()
> 
> do_memory_failure() is somewhat prepared for multiple reports of the same
> error. It uses an atomic test and set operation to mark the page as poisoned.
> 
> First called to report the error does all the real work. Late arrivals take a
> shorter path, but may still take some action(s) depending on the "flags"
> passed in:
> 
>         if (TestSetPageHWPoison(p)) {
>                 pr_err("%#lx: already hardware poisoned\n", pfn);
>                 res = -EHWPOISON;
>                 if (flags & MF_ACTION_REQUIRED)
>                         res = kill_accessing_process(current, pfn, flags);
>                 if (flags & MF_COUNT_INCREASED)
>                         put_page(p);
>                 goto unlock_mutex;
>         }

Thanks for the detailed explanation.

> In this case the last to arrive has MF_ACTION_REQUIRED set, so calls
> kill_accessing_process() ... which is in the stack trace that led to the:
> 
>    kernel BUG at include/linux/swapops.h:88!
> 
> I'm not sure that I fully understand your patch. I guess that it is making sure to
> handle the case that the page has already been marked as poisoned?

Basically what is happening is:

1) We mark the page as HWPoison
2) We see that the page is mapped by someone
3) We try to unmap it, and in the process we create a hwpoison swap entry.
   See the following chunk from try_to_unmap_one():

   "
    if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
            pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
            if (folio_test_hugetlb(folio)) {
                    hugetlb_count_sub(folio_nr_pages(folio), mm);
                    set_huge_pte_at(mm, address, pvmw.pte, pteval,
                                    hsz);
            } else {
                    dec_mm_counter(mm, mm_counter(folio));
                    set_pte_at(mm, address, pvmw.pte, pteval);
            }
	    ...
    }
   "
4) Now there is a second memory event (maybe the previous one has
   already finished, I do not think it matters for the sake of this
   problem)
5) The second event sees that the page has already been marked as
   HWPoison but since it has MF_ACTION_REQUIRED specified, it
   goes to kill_accessing_process() to do what its name says.
6) We walk the page tables of the accessing process to see if it has
   the poisoned pfn.
7) check_hwpoisoned_entry()
   (which is called from
   walk_page_range()->walk_{pgd,p4d,pud,pmd}_range()->ops->pmd_entry())
   checks whether any of the ptes is poisoned.
8) Since the previous MCE event unmapped the page, pte_present() == 0,
   so we want to get the swap entry, and this is where it falls off the
   cliff.
   See check_hwpoisoned_entry()

   static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
                                     unsigned long poisoned_pfn, struct to_kill *tk)
   {
          unsigned long pfn = 0;

          if (pte_present(pte)) {
                  pfn = pte_pfn(pte);
          } else {
                  swp_entry_t swp = pte_to_swp_entry(pte);

                  if (is_hwpoison_entry(swp))
			pfn = swp_offset_pfn(swp);
         }
	 ...
   }

is_hwpoison_entry() returns true (remember the make_hwpoison_entry()
call we did?)
But when we try to get the pfn from the swap entry, we stumble upon the
VM_BUG_ON(), because is_pfn_swap_entry() only checks for:
 is_migration_entry()
 is_device_private_entry()
 is_device_exclusive_entry()

but it should also check for is_hwpoison_entry().
Since it does not, is_pfn_swap_entry() returns false in our case,
leading to the VM_BUG_ON.

Note that this should only matter in environments where CONFIG_DEBUG_VM
is set.

I hope I shed some light in here.

> Anyway ... thanks for the quick fix. I hope the above helps write a good
> commit message to get this applied and backported to stable.
> 
> Tested-by: Tony Luck <tony.luck@intel.com>

Thanks again Tony, much appreciated.

I will write the patch and most likely send it out either today in the
afternoon or tomorrow early in the
morning.


-- 
Oscar Salvador
SUSE Labs


      parent reply	other threads:[~2024-04-07  4:52 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-04 22:05 Machine check recovery broken in v6.9-rc1 Tony Luck
2024-04-04 23:39 ` Tony Luck
2024-04-05  7:19 ` David Hildenbrand
2024-04-05 15:05   ` Luck, Tony
2024-04-05 23:58     ` Tony Luck
2024-04-06  2:18       ` Oscar Salvador
2024-04-06  3:54         ` Oscar Salvador
2024-04-06  4:13           ` Oscar Salvador
2024-04-07  0:08             ` Luck, Tony
2024-04-07  3:59               ` Miaohe Lin
2024-04-07  4:51               ` Oscar Salvador [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZhImZwKhp-CZ0MFN@localhost.localdomain \
    --to=osalvador@suse.de \
    --cc=bp@alien8.de \
    --cc=david@redhat.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=tony.luck@intel.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.