public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ethan Zhao <haifeng.zhao@linux.intel.com>
To: yunhui cui <cuiyunhui@bytedance.com>, Ethan Zhao <etzhao1900@gmail.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
	Baolu Lu <baolu.lu@linux.intel.com>,
	dwmw2@infradead.org, joro@8bytes.org, will@kernel.org,
	robin.murphy@arm.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [External] Re: [PATCH v2] iommu/vt-d: fix system hang on reboot -f
Date: Fri, 28 Feb 2025 12:34:29 +0800	[thread overview]
Message-ID: <c31b072a-7ec6-49f5-9ac7-a19cb6fbd5be@linux.intel.com> (raw)
In-Reply-To: <CAEEQ3w=1H7o7G56hYmwEGErfuFX-2nG=wesJYL7p0dDqLv_rpw@mail.gmail.com>


在 2025/2/28 10:18, yunhui cui 写道:
> Hi All,
>
> On Fri, Feb 28, 2025 at 8:51 AM Ethan Zhao <etzhao1900@gmail.com> wrote:
>>
>> On 2/28/2025 4:38 AM, Jason Gunthorpe wrote:
>>> On Thu, Feb 27, 2025 at 08:40:31AM +0800, Ethan Zhao wrote:
>>>> On 2/26/2025 9:04 PM, Jason Gunthorpe wrote:
>>>>> On Wed, Feb 26, 2025 at 01:55:28PM +0800, Ethan Zhao wrote:
>>>>>>> Provided the system does not respond to those events when this function
>>>>>>> is called, it's fine to remove the lock.
>>>>>> I agree.
>>>>> I think it is running the destruction of the iommu far too late in the
>>>>> process. IMHO it should be done after all the drivers have been
>>>>> shutdown, before the CPUs go single threaded.
>>>> Hmm... so far it is fine, the iommu_shutdown only has a little work to
>>>> do, disable the translation, the PMR disabling is just backward compatible,
>>>> was deprecated already. if we move it to one position where all CPUs are
>>>> cycling, we don't know what kind of user-land tasks left there (i.e. reboot -f
>>>> case), it would be hard to full-fill the requirement of Intel VT-d, no ongoing
>>>> transaction there on hardware when issue the translation disabling command.
>>> There is no guarentee device dma is halted anyhow at this point either.
>> The -f option to reboot command, suggests data corruption possibility.although
>>
>> the IOMMU strives not to cross the transaction boundaries of its address
>> translation.
>>
>> over all, we should make the reboot -f function works. not to hang
>> there. meanwhile
>>
>> it doesn't break anything else so far.
> patch v1:
> if (!down_write_trylock(&dmar_global_lock))
> return;

We should also think about the corner case of reboot without -f option, there might be something

wrong with one of the devices there on other CPUs they don't release the lock and were shutdown

by NMI, the dmar_global_lock is held by one of them, so we just bail out here at at once ?

leaving the IOMMU translation etc in enabling status. Is it better to loop with 10 seconds timeout

to try the lock then make sure no better choice, then continue the disabling of IOMMU translation & PMR ?.

Even the reboot -f case, leaving the IOMMU translation in enabling status is worse choice.

So seems we shouldn't bail out immediately after one time trylock whatever.

>
>   patch v3:
> /* All other CPUs were brought down, hotplug interrupts were disabled,
> no lock and RCU checking needed anymore*/
> list_for_each_entry(drhd, &dmar_drhd_units, list) {
> iommu = drhd->iommu;
> /* Disable PMRs explicitly here. */
> iommu_disable_protect_mem_regions(iommu);
> iommu_disable_translation(iommu);
> }
>
> Since we can remove down_write/up_write, it indicates that during the
> IOMMU shutdown process, we are not particularly concerned about
> whether others are accessing the critical section. Therefore, it can
> be understood that Patch v1 can successfully acquire the lock and
> proceed with subsequent operations. From this perspective, Patch v1 is
> reasonable and can prevent situations where there is an actual owner
> of dmar_global_lock, even though it does not perform a real IOMMU
> shutdown.
>
> However, if the IOMMU shutdown truly fails, IOMMU_WAIT_OP will trigger
> a panic(). Removing down_write/up_write offers better maintainability
> than Patch v1 (as we can retrieve the cause from the vmcore). But this
> might not be significant, since the reboot could have been completed
> successfully, and the occurrence of panic() might instead cause
> confusion.
>
> In summary, it is essential to complete the reboot action here rather
> than causing the system to hang. So, which do you prefer, v1 or v3?

Would v3 miss something ? please comment.

normal reboot/shutdown without -f case, function works right without any data corrupt or

leaving device in confusion status, good or bad situation.

reboot -f case, functions works, strives to not corrupt data or leave device in confusion status.

...

Thanks,

Ethan

>
>
>> Thanks,
>>
>> Ethan
>>
>>> Jason
> Thanks,
> Yunhui

-- 
"firm, enduring, strong, and long-lived"


      reply	other threads:[~2025-02-28  4:34 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-25  6:48 [PATCH v2] iommu/vt-d: fix system hang on reboot -f Yunhui Cui
2025-02-25  7:01 ` Baolu Lu
2025-02-25  8:54   ` Ethan Zhao
2025-02-25 14:26     ` Jason Gunthorpe
2025-02-26  0:35       ` Ethan Zhao
2025-02-26  3:50       ` Ethan Zhao
2025-02-26  5:18         ` Baolu Lu
2025-02-26  5:55           ` Ethan Zhao
2025-02-26 13:04             ` Jason Gunthorpe
2025-02-27  0:40               ` Ethan Zhao
2025-02-27 20:38                 ` Jason Gunthorpe
2025-02-28  0:51                   ` Ethan Zhao
2025-02-28  2:18                     ` [External] " yunhui cui
2025-02-28  4:34                       ` Ethan Zhao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c31b072a-7ec6-49f5-9ac7-a19cb6fbd5be@linux.intel.com \
    --to=haifeng.zhao@linux.intel.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=cuiyunhui@bytedance.com \
    --cc=dwmw2@infradead.org \
    --cc=etzhao1900@gmail.com \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@ziepe.ca \
    --cc=joro@8bytes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox