From: "Christian König" <christian.koenig@amd.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
Bjorn Helgaas <helgaas@kernel.org>
Cc: "Alexander.Deucher@amd.com" <Alexander.Deucher@amd.com>,
Sergei Miroshnichenko <s.miroshnichenko@yadro.com>,
linux-pci@vger.kernel.org
Subject: Re: Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case]
Date: Fri, 5 Mar 2021 09:36:20 +0100 [thread overview]
Message-ID: <e078a773-dc5e-b9a2-3078-3562b8dd215a@amd.com> (raw)
In-Reply-To: <0fda1f26-9b26-faee-8208-0fe1bf644935@amd.com>
Hi Andrey,
Am 03.03.21 um 17:05 schrieb Andrey Grodzovsky:
> Ping
>
> Andrey
>
> On 2021-03-02 12:48 p.m., Andrey Grodzovsky wrote:
>> Adding PCI mailing list per Bjorn's advise (sorry, skipped the
>> address last time).
>>
>> On 2021-03-02 12:22 p.m., Christian König wrote:
>>> Hi Andrey,
>>>
>>>> Can you elaborate more why you need such a lock on a global level
>>>> and not
>>>> enough per driver (per device more specifically) local lock
>>>> acquired in
>>>> rescan_preapare (before BARs shuffling starts) and released in
>>>> rescan_done
>>>> (after BARs shuffling done) ?
>>>
>>> because this adds inter device dependencies.
>>>
>>> E.g. when we have device A, B and C and each is taking it's own lock
>>> we end up with 3 locks.
>>>
>>> If those device have inter dependencies on their own, for example
>>> because of DMA-buf we will end up in quite a locking mess.
>>>
>>> Apart from that if we don't always keep the order A, B, C but
>>> instead end up with A, C, B lockdep will certainly start to complain.
>>>
>>> Regards,
>>> Christian.
>>
>> But as far as I understand those locks are agnostic to each other -
>> each lock is taken only by it's respective device and not by others.
Well they should be agnostic to each other, but are they really? I mean
we could have inter device lock dependencies because of DMA-buf or good
knows what.
>> So I don't see deadlock danger here but, I do suspect lockdep might
>> throw a false warning...
Yeah and disabling or ignoring the lockdep warning could cause problems
because we don't see issues caused by inter device lock dependencies any
more.
I just have the feeling that this would cause a lot of problems.
Christian.
>>
>> Andrey
>>
>>>
>>> Am 02.03.21 um 18:16 schrieb Andrey Grodzovsky:
>>>> Per Bjorn's advise adding PCI mailing list.
>>>>
>>>> Christian, please reply on top of this mail and not on top the
>>>> earlier, original one.
>>>>
>>>> Andrey
>>>>
>>>> On 2021-03-02 11:49 a.m., Bjorn Helgaas wrote:
>>>>> Please add linux-pci@vger.kernel.org to the cc list. There's a
>>>>> lot of
>>>>> useful context in this thread and it needs to be archived.
>>>>> On Tue, Mar 02, 2021 at 11:37:25AM -0500, Andrey Grodzovsky wrote:
>>>>>> Adding Sergei and Bjorn since you are suggesting some PCI related
>>>>>> changes.
>>>>>>
>>>>>> On 2021-03-02 6:28 a.m., Christian König wrote:
>>>>>>> Yeah, I'm thinking about that as well for quite a while now.
>>>>>>>
>>>>>>> I'm not 100% sure what Sergei does in his patch set will work
>>>>>>> for us. He
>>>>>>> basically tells the driver to stop any processing then shuffle
>>>>>>> around
>>>>>>> the BARs and then tells him to start again.
>>>>>>>
>>>>>>> Because of the necessity to reprogram bridges there isn't much
>>>>>>> other way
>>>>>>> how to do this, but we would need something like taking a lock,
>>>>>>> returning and dropping the lock later on. If you do this for
>>>>>>> multiple
>>>>>>> drivers at the same time this results in a locking nightmare.
>>>>>>>
>>>>>>> What we need instead is a upper level PCI layer rw-lock which we
>>>>>>> can
>>>>>>> take on the read side in the driver, similar to our internal
>>>>>>> reset lock.
>>>>>>
>>>>>>
>>>>>> Can you elaborate more why you need such a lock on a global level
>>>>>> and not
>>>>>> enough per driver (per device more specifically) local lock
>>>>>> acquired in
>>>>>> rescan_preapare (before BARs shuffling starts) and released in
>>>>>> rescan_done
>>>>>> (after BARs shuffling done) ?
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Wiring up changing the mapping is the smaller problem in that
>>>>>>> picture.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 01.03.21 um 22:14 schrieb Andrey Grodzovsky:
>>>>>>>> Hi, I started looking into actual implementation for this, as
>>>>>>>> reference point I am using Sege's NVME implementations (see
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F58c375df086502538706ee23e8ef7c8bb5a4178f&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C44bc68a68feb4b71836c08d8dd9b306a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503005898302093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nuTptAjGewG2T7tXOQr2h%2Bdr42lyg3m1%2Bfv0mJzd8tw%3D&reserved=0)
>>>>>>>>
>>>>>>>> and our own amdgpu_device_resize_fb_bar. Our use case being more
>>>>>>>> complicated then NVME's arises some questions or thoughts:
>>>>>>>>
>>>>>>>> Before the PCI core will move the BARs (VRAM and registers) we
>>>>>>>> have
>>>>>>>> to stop the HW and SW and then unmap all MMIO mapped entitles
>>>>>>>> in the
>>>>>>>> driver. This includes, registers of all types and VRAM placed BOs
>>>>>>>> with CPU visibility. My concern is how to prevent accesses to all
>>>>>>>> MMIO ranges between iounmap in rescan_preapare and ioremap in
>>>>>>>> rescan_done. For register accesses we are again back to same issue
>>>>>>>> we had with device unplug and GPU reset we are discussing with
>>>>>>>> Denis
>>>>>>>> now. So here at least as first solution just to confirm the
>>>>>>>> feature
>>>>>>>> works I can introduce low level (in register accessors) RW locking
>>>>>>>> to avoid access until all register access driver pointers are
>>>>>>>> remapped post BARs move. What more concerns me is how to - In
>>>>>>>> rescan_preapare: collect all CPU accessible BOs placed in VRAM (i
>>>>>>>> can use amdgpu specific list like I currently use for device
>>>>>>>> unplug), lock them to prevent any further use of them, unpin them
>>>>>>>> (and unmap). In rescan_preapare: remap them back and unlock
>>>>>>>> them for
>>>>>>>> further use. Also, there is a difference between kernel BOs where
>>>>>>>> indeed iounmap, ioremap should be enough and user space mapped BOs
>>>>>>>> where it seems like in hot unplug, we need in rescan_preapare to
>>>>>>>> only unamp them, in between drop new page faults with
>>>>>>>> VM_FAULT_RETRY
>>>>>>>> and post rescan_done allow page faults to go through to refill all
>>>>>>>> the user page tables with updated MMIO addresses.
>>>>>>>>
>>>>>>>> Let me know your thoughts.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>> On 2021-02-26 4:03 p.m., Andrey Grodzovsky wrote:
>>>>>>>>>
>>>>>>>>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote:
>>>>>>>>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote:
>>>>>>>>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote:
>>>>>>>>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote:
>>>>>>>>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote:
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>> Are you saying that even without hot-plugging, while both
>>>>>>>>>>>>> nvme
>>>>>>>>>>>>> and
>>>>>>>>>>>>> AMD
>>>>>>>>>>>>> card are present
>>>>>>>>>>>>> right from boot, you still get BARs moving and MMIO ranges
>>>>>>>>>>>>> reassigned
>>>>>>>>>>>>> for NVME BARs
>>>>>>>>>>>>> just because amdgpu driver will start resize of AMD card
>>>>>>>>>>>>> BARs and
>>>>>>>>>>>>> this
>>>>>>>>>>>>> will trigger NVMEs BARs move to
>>>>>>>>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ?
>>>>>>>>>>>> Yes. Unconditionally, because it is unknown beforehand if
>>>>>>>>>>>> NVMe's
>>>>>>>>>>>> BAR
>>>>>>>>>>>> movement will help. In this particular case BAR movement is
>>>>>>>>>>>> not
>>>>>>>>>>>> needed,
>>>>>>>>>>>> but is done anyway.
>>>>>>>>>>>>
>>>>>>>>>>>> BARs are not moved one by one, but the kernel releases all the
>>>>>>>>>>>> releasable ones, and then recalculates a new BAR layout to
>>>>>>>>>>>> fit them
>>>>>>>>>>>> all. Kernel's algorithm is different from BIOS's, so NVME has
>>>>>>>>>>>> appeared
>>>>>>>>>>>> at a new place.
>>>>>>>>>>>>
>>>>>>>>>>>> This is triggered by following:
>>>>>>>>>>>> - at boot, if BIOS had assigned not every BAR;
>>>>>>>>>>>> - during pci_resize_resource();
>>>>>>>>>>>> - during pci_rescan_bus() -- after a pciehp event or a
>>>>>>>>>>>> manual via
>>>>>>>>>>>> sysfs.
>>>>>>>>>>> By manual via sysfs you mean something like this - 'echo 1 >
>>>>>>>>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 >
>>>>>>>>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably
>>>>>>>>>>> trigger
>>>>>>>>>>> PCI
>>>>>>>>>>> code to call my callbacks even without having external PCI
>>>>>>>>>>> cage for
>>>>>>>>>>> GPU
>>>>>>>>>>> (will take me some time to get it).
>>>>>>>>>> Yeah, this is our way to go when a device can't be physically
>>>>>>>>>> removed
>>>>>>>>>> or unpowered remotely. With just a bit shorter path:
>>>>>>>>>>
>>>>>>>>>> sudo sh -c 'echo 1 >
>>>>>>>>>> /sys/bus/pci/devices/0000\:0c\:00.0/remove'
>>>>>>>>>> sudo sh -c 'echo 1 > /sys/bus/pci/rescan'
>>>>>>>>>>
>>>>>>>>>> Or, just a second command (rescan) is enough: a BAR movement
>>>>>>>>>> attempt
>>>>>>>>>> will be triggered even if there were no changes in PCI topology.
>>>>>>>>>>
>>>>>>>>>> Serge
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ok, able to trigger stub callbacks call after doing rescan from
>>>>>>>>> sysfs. Also I see BARs movement
>>>>>>>>>
>>>>>>>>> [ 1860.968036] amdgpu 0000:09:00.0: BAR 0: assigned [mem
>>>>>>>>> 0xc0000000-0xcfffffff 64bit pref]
>>>>>>>>> [ 1860.968050] amdgpu 0000:09:00.0: BAR 0 updated: 0xe0000000 ->
>>>>>>>>> 0xc0000000
>>>>>>>>> [ 1860.968054] amdgpu 0000:09:00.0: BAR 2: assigned [mem
>>>>>>>>> 0xd0000000-0xd01fffff 64bit pref]
>>>>>>>>> [ 1860.968068] amdgpu 0000:09:00.0: BAR 2 updated: 0xf0000000 ->
>>>>>>>>> 0xd0000000
>>>>>>>>> [ 1860.968071] amdgpu 0000:09:00.0: BAR 5: assigned [mem
>>>>>>>>> 0xf0100000-0xf013ffff]
>>>>>>>>> [ 1860.968078] amdgpu 0000:09:00.0: BAR 5 updated: 0xfce00000 ->
>>>>>>>>> 0xf0100000
>>>>>>>>>
>>>>>>>>> As expected, after the move the device becomes unresponsive as
>>>>>>>>> unmap/ioremap wasn't done.
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>
>>>
next prev parent reply other threads:[~2021-03-05 8:37 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20210302164944.GA424771@bjorn-Precision-5520>
[not found] ` <81cc1a93-3a77-7ef7-c9f9-6083c1c313f7@amd.com>
[not found] ` <5c2b2379-9961-c6f6-b343-b874cc22260f@amd.com>
2021-03-02 17:48 ` Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case] Andrey Grodzovsky
2021-03-03 16:05 ` Andrey Grodzovsky
2021-03-05 8:36 ` Christian König [this message]
2021-03-05 23:19 ` Andrey Grodzovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e078a773-dc5e-b9a2-3078-3562b8dd215a@amd.com \
--to=christian.koenig@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=andrey.grodzovsky@amd.com \
--cc=helgaas@kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=s.miroshnichenko@yadro.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox