qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Joao Martins <joao.m.martins@oracle.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	Richard Henderson <richard.henderson@linaro.org>,
	qemu-devel@nongnu.org, Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Edmondson <david.edmondson@oracle.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>, Ani Sinha <ani@anisinha.ca>,
	Igor Mammedov <imammedo@redhat.com>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Subject: Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Date: Fri, 25 Feb 2022 12:36:55 +0000	[thread overview]
Message-ID: <16ac8c6a-1b47-d590-eeef-b7a28b0b471d@oracle.com> (raw)
In-Reply-To: <20220225001353-mutt-send-email-mst@kernel.org>

On 2/25/22 05:22, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 08:34:40PM +0000, Joao Martins wrote:
>> On 2/24/22 20:12, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:
>>>> On 2/24/22 19:54, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
>>>>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
>>>>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>>>>>>>>>> On 2/23/22 23:35, Joao Martins wrote:
>>>>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>>>>>> +                                          uint64_t pci_hole64_size)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>>>>>>> +        return;
>>>>>>>>>>>>> +    }
>>>>>>>>>>>>
>>>>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>>>>>>>>>> It's really about what we present to the guest though,
>>>>>>>>>>>> isn't it?
>>>>>>>>>>>
>>>>>>>>>>> It was the easier catch all to use cpuid without going into
>>>>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>>>>>>>>>> for systems with an IOMMU present.
>>>>>>>>>>>
>>>>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>>>>>>>
>>>>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>>>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>>>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>>>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
>>>>>>>>>>> even worth checking the range exists in:
>>>>>>>>>>>
>>>>>>>>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
>>>>>>>>>>>
>>>>>>>>>>> (Also that sysfs ABI is >= 4.11 only)
>>>>>>>>>>
>>>>>>>>>> Here's what I have staged in local tree, to address your comment.
>>>>>>>>>>
>>>>>>>>>> Naturally the first chunk is what's affected by this patch the rest is a
>>>>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>>>>>>>>>> all the tests and what not.
>>>>>>>>>>
>>>>>>>>>> I am not entirely sure this is the right place to put such a helper, open
>>>>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
>>>>>>>>>> of the file's style.
>>>>>>>>>>

[snip]

>>>>>>>>>
>>>>>>>>> why are we checking whether an AMD IOMMU is present
>>>>>>>>> on the host? 
>>>>>>>>> Isn't what we care about whether it's
>>>>>>>>> present in the VM? What we are reserving after all
>>>>>>>>> is a range of GPAs, not actual host PA's ...
>>>>>>>>>
>>>>>>>> Right.
>>>>>>>>
>>>>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>>>>>>>> and so we need to not map that portion of address space entirely
>>>>>>>> and use some other portion of the GPA-space. This has to
>>>>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
>>>>>>>> place. So, if you do not have an host IOMMU, you can't
>>>>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
>>>>>>>> don't need this problem.
>>>>>>>>
>>>>>>>> Whether one uses a guest IOMMU or not does not affect the result,
>>>>>>>> it would be more of a case of mimicking real hardware not fixing
>>>>>>>> the issue of this series.
>>>>>>>
>>>>>>>
>>>>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>>>>>> And ideally move the logic reporting reserved ranges there too.
>>>>>>> However, I think vdpa has the same issue too.
>>>>>>> CC Jason for an opinion.
>>>>>>
>>>>>> It does have the same problem.
>>>>>>
>>>>>> This is not specific to VFIO, it's to anything that uses the iommu.
>>>>>
>>>>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
>>>>> that we don't have a global "enable-vfio" flag since vfio devices
>>>>> can be hot-plugged. I think this is not the first time we wanted
>>>>> something like this, right Alex?
>>>>>
>>>>>> It's
>>>>>> just that VFIO doesn't let you shoot in the foot :)
>>>>>>
>>>>>> vDPA would need to validate iova ranges as well to prevent mapping on
>>>>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
>>>>>> map request is within a valid iova range"). Now you could argue that
>>>>>> it's an IOMMU core problem, maybe.
>>>>>>
>>>>>> But I this as a separate problem,
>>>>>> because regardless of said validation, your guest provisioning
>>>>>> is still going to fail for guests with >=1010G of max GPA and even if
>>>>>> it doesn't fail you shouldn't be letting it DMA map those in the first
>>>>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
>>>>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
>>>>>
>>>>> I wonder what's the status on a system without an IOMMU.
>>>>>
>>>> In theory you should be OK. Also it's worth keeping in mind that AMD machines
>>>> with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
>>>> is the initiator should be fine on worst case scenario.
>>>
>>> Not sure I understand this last sentence.
>>>
>> I was trying to say that the E820 from hardware is marked as Reserved and any DMA
>> from/to endpoints from kernel-supplied CPU addresses will use those reserved ranges.
> 
> meaning "will *not* use" I guess?
> 
Yes, correct.

Sorry, I missed a word there. Happened quite a lot these days to me :(

>>>>>>> Also, some concerns
>>>>>>> - I think this changes memory layout for working VMs not using VFIO.
>>>>>>>   Better preserve the old layout for old machine types?
>>>>>>>
>>>>>> Oh definitely, and I do that in this series. See the last commit, and
>>>>>> in the past it was also a machine-property exposed to userspace.
>>>>>> Otherwise I would be breaking (badly) compat/migration.
>>>>>>
>>>>>> I would like to emphasize that these memory layout changes are *exclusive* and
>>>>>> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
>>>>>> It's not every guest, but a fairly limited subset of big-memory guests that
>>>>>> are not the norm.
>>>>>>
>>>>>> Albeit the phys-bits property errors might gives us a bit more trouble, so
>>>>>> it might help being more conservative.
>>>>>>
>>>>>>> - You mention the phys bits issue very briefly, and it's pretty
>>>>>>>   concerning.  Do we maybe want to also disable the work around if phys
>>>>>>>   bits is too small? 
>>>>>>
>>>>>> We are doing that here (well, v4), as suggested by Igor. Note that in this series
>>>>>> it's a warning, but v4 I have it as an error and with the 32-bit issues that
>>>>>> I found through qtest.
>>>>>>
>>>>>> I share the same concern as you over making this an error because of compatibility.
>>>>>> Perhaps, we could go back to the previous version and turn back into a warning and
>>>>>> additionally even disabling the relocation all together. Or if desired even
>>>>>> depending on a machine-property to enable it.
>>>>>
>>>>> I would be inclined to disable the relocation. And maybe block vfio? I'm
>>>>> not 100% sure about that last one, but this can be a vfio decision to
>>>>> make.
>>>>>
>>>> I don't see this as a VFIO problem (VFIO is actually being a good citizen and doing the
>>>> right thing :)). From my perspective this fix is also useful to vDPA (which we also care),
>>>> and thus will also let us avoid DMA mapping in these GPA ranges.
>>>>
>>>> The reason I mention VFIO in cover letter is that it's one visible UAPI change that
>>>> users will notice that suddenly their 1TB+ VMs break.
>>>
>>> What I mean is that most VMs don't use either VFIO or VDPA.
>>> If we had VFIO,VDPA as an accelerator that needs to be listed
>>> upfront when qemu is started, we could solve this with
>>> a bit less risk by not changing anything for VMs without these two.
>>>
>> That wouldn't work for the vfio/vdpa hotplug case (which we do use),
>> and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
>> you what are those allowed IOVA ranges when you create the container.
>>
>> The machine property, though, could communicate /the intent/ to add
>> any VFIO or vDPA devices in the future and maybe cover for that. That
>> let's one avoid any 'accelerator-only' problems and restrict the issues
>> of this series to VMs with VFIO/VDPA if the risk is a concern.
> 
> Well Alex nacked that idea (and I certainly trust him where VFIO is
> concerned), I guess for now we'll just live with the risk.
> 

My reading was that he nacked a 'VFIO-only' global property not necessarily
the machine property for valid-iova. Hmm, I might be misreading it as at the
end of the day the result will lead to the same thing.


  reply	other threads:[~2022-02-25 12:55 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-23 18:44 [PATCH v3 0/6] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
2022-02-23 18:44 ` [PATCH v3 1/6] hw/i386: add 4g boundary start to X86MachineState Joao Martins
2022-02-23 18:44 ` [PATCH v3 2/6] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
2022-02-23 18:44 ` [PATCH v3 3/6] i386/pc: pass pci_hole64_size " Joao Martins
2022-02-23 18:44 ` [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable Joao Martins
2022-02-23 21:22   ` Michael S. Tsirkin
2022-02-23 23:35     ` Joao Martins
2022-02-24 16:07       ` Joao Martins
2022-02-24 17:23         ` Michael S. Tsirkin
2022-02-24 17:54           ` Joao Martins
2022-02-24 18:30             ` Michael S. Tsirkin
2022-02-24 19:44               ` Joao Martins
2022-02-24 19:54                 ` Michael S. Tsirkin
2022-02-24 20:04                   ` Joao Martins
2022-02-24 20:12                     ` Michael S. Tsirkin
2022-02-24 20:34                       ` Joao Martins
2022-02-24 21:40                         ` Alex Williamson
2022-02-25 12:36                           ` Joao Martins
2022-02-25 12:49                             ` Michael S. Tsirkin
2022-02-25 17:40                               ` Joao Martins
2022-02-25 16:15                             ` Alex Williamson
2022-02-25 17:40                               ` Joao Martins
2022-02-25  5:22                         ` Michael S. Tsirkin
2022-02-25 12:36                           ` Joao Martins [this message]
2022-02-25  3:52               ` Jason Wang
2022-02-24 14:27   ` Joao Martins
2022-02-23 18:44 ` [PATCH v3 5/6] i386/pc: warn if phys-bits is too low Joao Martins
2022-02-24 14:42   ` Joao Martins
2022-02-23 18:44 ` [PATCH v3 6/6] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=16ac8c6a-1b47-d590-eeef-b7a28b0b471d@oracle.com \
    --to=joao.m.martins@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=ani@anisinha.ca \
    --cc=daniel.m.jordan@oracle.com \
    --cc=david.edmondson@oracle.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=suravee.suthikulpanit@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).