* Re: Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case] [not found] ` <5c2b2379-9961-c6f6-b343-b874cc22260f@amd.com> @ 2021-03-02 17:48 ` Andrey Grodzovsky 2021-03-03 16:05 ` Andrey Grodzovsky 0 siblings, 1 reply; 4+ messages in thread From: Andrey Grodzovsky @ 2021-03-02 17:48 UTC (permalink / raw) To: Christian König, Bjorn Helgaas Cc: Alexander.Deucher@amd.com, Sergei Miroshnichenko, linux-pci Adding PCI mailing list per Bjorn's advise (sorry, skipped the address last time). On 2021-03-02 12:22 p.m., Christian König wrote: > Hi Andrey, > >> Can you elaborate more why you need such a lock on a global level and not >> enough per driver (per device more specifically) local lock acquired in >> rescan_preapare (before BARs shuffling starts) and released in >> rescan_done >> (after BARs shuffling done) ? > > because this adds inter device dependencies. > > E.g. when we have device A, B and C and each is taking it's own lock we > end up with 3 locks. > > If those device have inter dependencies on their own, for example > because of DMA-buf we will end up in quite a locking mess. > > Apart from that if we don't always keep the order A, B, C but instead > end up with A, C, B lockdep will certainly start to complain. > > Regards, > Christian. But as far as I understand those locks are agnostic to each other - each lock is taken only by it's respective device and not by others. So I don't see deadlock danger here but, I do suspect lockdep might throw a false warning... Andrey > > Am 02.03.21 um 18:16 schrieb Andrey Grodzovsky: >> Per Bjorn's advise adding PCI mailing list. >> >> Christian, please reply on top of this mail and not on top the >> earlier, original one. >> >> Andrey >> >> On 2021-03-02 11:49 a.m., Bjorn Helgaas wrote: >>> Please add linux-pci@vger.kernel.org to the cc list. There's a lot of >>> useful context in this thread and it needs to be archived. >>> On Tue, Mar 02, 2021 at 11:37:25AM -0500, Andrey Grodzovsky wrote: >>>> Adding Sergei and Bjorn since you are suggesting some PCI related >>>> changes. >>>> >>>> On 2021-03-02 6:28 a.m., Christian König wrote: >>>>> Yeah, I'm thinking about that as well for quite a while now. >>>>> >>>>> I'm not 100% sure what Sergei does in his patch set will work for >>>>> us. He >>>>> basically tells the driver to stop any processing then shuffle around >>>>> the BARs and then tells him to start again. >>>>> >>>>> Because of the necessity to reprogram bridges there isn't much >>>>> other way >>>>> how to do this, but we would need something like taking a lock, >>>>> returning and dropping the lock later on. If you do this for multiple >>>>> drivers at the same time this results in a locking nightmare. >>>>> >>>>> What we need instead is a upper level PCI layer rw-lock which we can >>>>> take on the read side in the driver, similar to our internal reset >>>>> lock. >>>> >>>> >>>> Can you elaborate more why you need such a lock on a global level >>>> and not >>>> enough per driver (per device more specifically) local lock acquired in >>>> rescan_preapare (before BARs shuffling starts) and released in >>>> rescan_done >>>> (after BARs shuffling done) ? >>>> >>>> Andrey >>>> >>>> >>>>> >>>>> Wiring up changing the mapping is the smaller problem in that picture. >>>>> >>>>> Christian. >>>>> >>>>> Am 01.03.21 um 22:14 schrieb Andrey Grodzovsky: >>>>>> Hi, I started looking into actual implementation for this, as >>>>>> reference point I am using Sege's NVME implementations (see >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F58c375df086502538706ee23e8ef7c8bb5a4178f&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C44bc68a68feb4b71836c08d8dd9b306a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503005898302093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nuTptAjGewG2T7tXOQr2h%2Bdr42lyg3m1%2Bfv0mJzd8tw%3D&reserved=0) >>>>>> >>>>>> and our own amdgpu_device_resize_fb_bar. Our use case being more >>>>>> complicated then NVME's arises some questions or thoughts: >>>>>> >>>>>> Before the PCI core will move the BARs (VRAM and registers) we have >>>>>> to stop the HW and SW and then unmap all MMIO mapped entitles in the >>>>>> driver. This includes, registers of all types and VRAM placed BOs >>>>>> with CPU visibility. My concern is how to prevent accesses to all >>>>>> MMIO ranges between iounmap in rescan_preapare and ioremap in >>>>>> rescan_done. For register accesses we are again back to same issue >>>>>> we had with device unplug and GPU reset we are discussing with Denis >>>>>> now. So here at least as first solution just to confirm the feature >>>>>> works I can introduce low level (in register accessors) RW locking >>>>>> to avoid access until all register access driver pointers are >>>>>> remapped post BARs move. What more concerns me is how to - In >>>>>> rescan_preapare: collect all CPU accessible BOs placed in VRAM (i >>>>>> can use amdgpu specific list like I currently use for device >>>>>> unplug), lock them to prevent any further use of them, unpin them >>>>>> (and unmap). In rescan_preapare: remap them back and unlock them for >>>>>> further use. Also, there is a difference between kernel BOs where >>>>>> indeed iounmap, ioremap should be enough and user space mapped BOs >>>>>> where it seems like in hot unplug, we need in rescan_preapare to >>>>>> only unamp them, in between drop new page faults with VM_FAULT_RETRY >>>>>> and post rescan_done allow page faults to go through to refill all >>>>>> the user page tables with updated MMIO addresses. >>>>>> >>>>>> Let me know your thoughts. >>>>>> >>>>>> Andrey >>>>>> >>>>>> On 2021-02-26 4:03 p.m., Andrey Grodzovsky wrote: >>>>>>> >>>>>>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote: >>>>>>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote: >>>>>>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote: >>>>>>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote: >>>>>>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote: >>>>>>>>>>>> ... >>>>>>>>>>> Are you saying that even without hot-plugging, while both nvme >>>>>>>>>>> and >>>>>>>>>>> AMD >>>>>>>>>>> card are present >>>>>>>>>>> right from boot, you still get BARs moving and MMIO ranges >>>>>>>>>>> reassigned >>>>>>>>>>> for NVME BARs >>>>>>>>>>> just because amdgpu driver will start resize of AMD card BARs >>>>>>>>>>> and >>>>>>>>>>> this >>>>>>>>>>> will trigger NVMEs BARs move to >>>>>>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ? >>>>>>>>>> Yes. Unconditionally, because it is unknown beforehand if NVMe's >>>>>>>>>> BAR >>>>>>>>>> movement will help. In this particular case BAR movement is not >>>>>>>>>> needed, >>>>>>>>>> but is done anyway. >>>>>>>>>> >>>>>>>>>> BARs are not moved one by one, but the kernel releases all the >>>>>>>>>> releasable ones, and then recalculates a new BAR layout to fit >>>>>>>>>> them >>>>>>>>>> all. Kernel's algorithm is different from BIOS's, so NVME has >>>>>>>>>> appeared >>>>>>>>>> at a new place. >>>>>>>>>> >>>>>>>>>> This is triggered by following: >>>>>>>>>> - at boot, if BIOS had assigned not every BAR; >>>>>>>>>> - during pci_resize_resource(); >>>>>>>>>> - during pci_rescan_bus() -- after a pciehp event or a manual via >>>>>>>>>> sysfs. >>>>>>>>> By manual via sysfs you mean something like this - 'echo 1 > >>>>>>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 > >>>>>>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably >>>>>>>>> trigger >>>>>>>>> PCI >>>>>>>>> code to call my callbacks even without having external PCI cage >>>>>>>>> for >>>>>>>>> GPU >>>>>>>>> (will take me some time to get it). >>>>>>>> Yeah, this is our way to go when a device can't be physically >>>>>>>> removed >>>>>>>> or unpowered remotely. With just a bit shorter path: >>>>>>>> >>>>>>>> sudo sh -c 'echo 1 > >>>>>>>> /sys/bus/pci/devices/0000\:0c\:00.0/remove' >>>>>>>> sudo sh -c 'echo 1 > /sys/bus/pci/rescan' >>>>>>>> >>>>>>>> Or, just a second command (rescan) is enough: a BAR movement >>>>>>>> attempt >>>>>>>> will be triggered even if there were no changes in PCI topology. >>>>>>>> >>>>>>>> Serge >>>>>>> >>>>>>> >>>>>>> Ok, able to trigger stub callbacks call after doing rescan from >>>>>>> sysfs. Also I see BARs movement >>>>>>> >>>>>>> [ 1860.968036] amdgpu 0000:09:00.0: BAR 0: assigned [mem >>>>>>> 0xc0000000-0xcfffffff 64bit pref] >>>>>>> [ 1860.968050] amdgpu 0000:09:00.0: BAR 0 updated: 0xe0000000 -> >>>>>>> 0xc0000000 >>>>>>> [ 1860.968054] amdgpu 0000:09:00.0: BAR 2: assigned [mem >>>>>>> 0xd0000000-0xd01fffff 64bit pref] >>>>>>> [ 1860.968068] amdgpu 0000:09:00.0: BAR 2 updated: 0xf0000000 -> >>>>>>> 0xd0000000 >>>>>>> [ 1860.968071] amdgpu 0000:09:00.0: BAR 5: assigned [mem >>>>>>> 0xf0100000-0xf013ffff] >>>>>>> [ 1860.968078] amdgpu 0000:09:00.0: BAR 5 updated: 0xfce00000 -> >>>>>>> 0xf0100000 >>>>>>> >>>>>>> As expected, after the move the device becomes unresponsive as >>>>>>> unmap/ioremap wasn't done. >>>>>>> >>>>>>> Andrey >>>>>>> >>>>> > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case] 2021-03-02 17:48 ` Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case] Andrey Grodzovsky @ 2021-03-03 16:05 ` Andrey Grodzovsky 2021-03-05 8:36 ` Christian König 0 siblings, 1 reply; 4+ messages in thread From: Andrey Grodzovsky @ 2021-03-03 16:05 UTC (permalink / raw) To: Christian König, Bjorn Helgaas Cc: Alexander.Deucher@amd.com, Sergei Miroshnichenko, linux-pci Ping Andrey On 2021-03-02 12:48 p.m., Andrey Grodzovsky wrote: > Adding PCI mailing list per Bjorn's advise (sorry, skipped the address > last time). > > On 2021-03-02 12:22 p.m., Christian König wrote: >> Hi Andrey, >> >>> Can you elaborate more why you need such a lock on a global level and >>> not >>> enough per driver (per device more specifically) local lock acquired in >>> rescan_preapare (before BARs shuffling starts) and released in >>> rescan_done >>> (after BARs shuffling done) ? >> >> because this adds inter device dependencies. >> >> E.g. when we have device A, B and C and each is taking it's own lock >> we end up with 3 locks. >> >> If those device have inter dependencies on their own, for example >> because of DMA-buf we will end up in quite a locking mess. >> >> Apart from that if we don't always keep the order A, B, C but instead >> end up with A, C, B lockdep will certainly start to complain. >> >> Regards, >> Christian. > > But as far as I understand those locks are agnostic to each other - each > lock is taken only by it's respective device and not by others. So I > don't see deadlock danger here but, I do suspect lockdep might throw a > false warning... > > Andrey > >> >> Am 02.03.21 um 18:16 schrieb Andrey Grodzovsky: >>> Per Bjorn's advise adding PCI mailing list. >>> >>> Christian, please reply on top of this mail and not on top the >>> earlier, original one. >>> >>> Andrey >>> >>> On 2021-03-02 11:49 a.m., Bjorn Helgaas wrote: >>>> Please add linux-pci@vger.kernel.org to the cc list. There's a lot of >>>> useful context in this thread and it needs to be archived. >>>> On Tue, Mar 02, 2021 at 11:37:25AM -0500, Andrey Grodzovsky wrote: >>>>> Adding Sergei and Bjorn since you are suggesting some PCI related >>>>> changes. >>>>> >>>>> On 2021-03-02 6:28 a.m., Christian König wrote: >>>>>> Yeah, I'm thinking about that as well for quite a while now. >>>>>> >>>>>> I'm not 100% sure what Sergei does in his patch set will work for >>>>>> us. He >>>>>> basically tells the driver to stop any processing then shuffle around >>>>>> the BARs and then tells him to start again. >>>>>> >>>>>> Because of the necessity to reprogram bridges there isn't much >>>>>> other way >>>>>> how to do this, but we would need something like taking a lock, >>>>>> returning and dropping the lock later on. If you do this for multiple >>>>>> drivers at the same time this results in a locking nightmare. >>>>>> >>>>>> What we need instead is a upper level PCI layer rw-lock which we can >>>>>> take on the read side in the driver, similar to our internal reset >>>>>> lock. >>>>> >>>>> >>>>> Can you elaborate more why you need such a lock on a global level >>>>> and not >>>>> enough per driver (per device more specifically) local lock >>>>> acquired in >>>>> rescan_preapare (before BARs shuffling starts) and released in >>>>> rescan_done >>>>> (after BARs shuffling done) ? >>>>> >>>>> Andrey >>>>> >>>>> >>>>>> >>>>>> Wiring up changing the mapping is the smaller problem in that >>>>>> picture. >>>>>> >>>>>> Christian. >>>>>> >>>>>> Am 01.03.21 um 22:14 schrieb Andrey Grodzovsky: >>>>>>> Hi, I started looking into actual implementation for this, as >>>>>>> reference point I am using Sege's NVME implementations (see >>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F58c375df086502538706ee23e8ef7c8bb5a4178f&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C44bc68a68feb4b71836c08d8dd9b306a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503005898302093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nuTptAjGewG2T7tXOQr2h%2Bdr42lyg3m1%2Bfv0mJzd8tw%3D&reserved=0) >>>>>>> >>>>>>> and our own amdgpu_device_resize_fb_bar. Our use case being more >>>>>>> complicated then NVME's arises some questions or thoughts: >>>>>>> >>>>>>> Before the PCI core will move the BARs (VRAM and registers) we have >>>>>>> to stop the HW and SW and then unmap all MMIO mapped entitles in the >>>>>>> driver. This includes, registers of all types and VRAM placed BOs >>>>>>> with CPU visibility. My concern is how to prevent accesses to all >>>>>>> MMIO ranges between iounmap in rescan_preapare and ioremap in >>>>>>> rescan_done. For register accesses we are again back to same issue >>>>>>> we had with device unplug and GPU reset we are discussing with Denis >>>>>>> now. So here at least as first solution just to confirm the feature >>>>>>> works I can introduce low level (in register accessors) RW locking >>>>>>> to avoid access until all register access driver pointers are >>>>>>> remapped post BARs move. What more concerns me is how to - In >>>>>>> rescan_preapare: collect all CPU accessible BOs placed in VRAM (i >>>>>>> can use amdgpu specific list like I currently use for device >>>>>>> unplug), lock them to prevent any further use of them, unpin them >>>>>>> (and unmap). In rescan_preapare: remap them back and unlock them for >>>>>>> further use. Also, there is a difference between kernel BOs where >>>>>>> indeed iounmap, ioremap should be enough and user space mapped BOs >>>>>>> where it seems like in hot unplug, we need in rescan_preapare to >>>>>>> only unamp them, in between drop new page faults with VM_FAULT_RETRY >>>>>>> and post rescan_done allow page faults to go through to refill all >>>>>>> the user page tables with updated MMIO addresses. >>>>>>> >>>>>>> Let me know your thoughts. >>>>>>> >>>>>>> Andrey >>>>>>> >>>>>>> On 2021-02-26 4:03 p.m., Andrey Grodzovsky wrote: >>>>>>>> >>>>>>>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote: >>>>>>>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote: >>>>>>>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote: >>>>>>>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote: >>>>>>>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote: >>>>>>>>>>>>> ... >>>>>>>>>>>> Are you saying that even without hot-plugging, while both nvme >>>>>>>>>>>> and >>>>>>>>>>>> AMD >>>>>>>>>>>> card are present >>>>>>>>>>>> right from boot, you still get BARs moving and MMIO ranges >>>>>>>>>>>> reassigned >>>>>>>>>>>> for NVME BARs >>>>>>>>>>>> just because amdgpu driver will start resize of AMD card >>>>>>>>>>>> BARs and >>>>>>>>>>>> this >>>>>>>>>>>> will trigger NVMEs BARs move to >>>>>>>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ? >>>>>>>>>>> Yes. Unconditionally, because it is unknown beforehand if NVMe's >>>>>>>>>>> BAR >>>>>>>>>>> movement will help. In this particular case BAR movement is not >>>>>>>>>>> needed, >>>>>>>>>>> but is done anyway. >>>>>>>>>>> >>>>>>>>>>> BARs are not moved one by one, but the kernel releases all the >>>>>>>>>>> releasable ones, and then recalculates a new BAR layout to >>>>>>>>>>> fit them >>>>>>>>>>> all. Kernel's algorithm is different from BIOS's, so NVME has >>>>>>>>>>> appeared >>>>>>>>>>> at a new place. >>>>>>>>>>> >>>>>>>>>>> This is triggered by following: >>>>>>>>>>> - at boot, if BIOS had assigned not every BAR; >>>>>>>>>>> - during pci_resize_resource(); >>>>>>>>>>> - during pci_rescan_bus() -- after a pciehp event or a manual >>>>>>>>>>> via >>>>>>>>>>> sysfs. >>>>>>>>>> By manual via sysfs you mean something like this - 'echo 1 > >>>>>>>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 > >>>>>>>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably >>>>>>>>>> trigger >>>>>>>>>> PCI >>>>>>>>>> code to call my callbacks even without having external PCI >>>>>>>>>> cage for >>>>>>>>>> GPU >>>>>>>>>> (will take me some time to get it). >>>>>>>>> Yeah, this is our way to go when a device can't be physically >>>>>>>>> removed >>>>>>>>> or unpowered remotely. With just a bit shorter path: >>>>>>>>> >>>>>>>>> sudo sh -c 'echo 1 > >>>>>>>>> /sys/bus/pci/devices/0000\:0c\:00.0/remove' >>>>>>>>> sudo sh -c 'echo 1 > /sys/bus/pci/rescan' >>>>>>>>> >>>>>>>>> Or, just a second command (rescan) is enough: a BAR movement >>>>>>>>> attempt >>>>>>>>> will be triggered even if there were no changes in PCI topology. >>>>>>>>> >>>>>>>>> Serge >>>>>>>> >>>>>>>> >>>>>>>> Ok, able to trigger stub callbacks call after doing rescan from >>>>>>>> sysfs. Also I see BARs movement >>>>>>>> >>>>>>>> [ 1860.968036] amdgpu 0000:09:00.0: BAR 0: assigned [mem >>>>>>>> 0xc0000000-0xcfffffff 64bit pref] >>>>>>>> [ 1860.968050] amdgpu 0000:09:00.0: BAR 0 updated: 0xe0000000 -> >>>>>>>> 0xc0000000 >>>>>>>> [ 1860.968054] amdgpu 0000:09:00.0: BAR 2: assigned [mem >>>>>>>> 0xd0000000-0xd01fffff 64bit pref] >>>>>>>> [ 1860.968068] amdgpu 0000:09:00.0: BAR 2 updated: 0xf0000000 -> >>>>>>>> 0xd0000000 >>>>>>>> [ 1860.968071] amdgpu 0000:09:00.0: BAR 5: assigned [mem >>>>>>>> 0xf0100000-0xf013ffff] >>>>>>>> [ 1860.968078] amdgpu 0000:09:00.0: BAR 5 updated: 0xfce00000 -> >>>>>>>> 0xf0100000 >>>>>>>> >>>>>>>> As expected, after the move the device becomes unresponsive as >>>>>>>> unmap/ioremap wasn't done. >>>>>>>> >>>>>>>> Andrey >>>>>>>> >>>>>> >> ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case] 2021-03-03 16:05 ` Andrey Grodzovsky @ 2021-03-05 8:36 ` Christian König 2021-03-05 23:19 ` Andrey Grodzovsky 0 siblings, 1 reply; 4+ messages in thread From: Christian König @ 2021-03-05 8:36 UTC (permalink / raw) To: Andrey Grodzovsky, Bjorn Helgaas Cc: Alexander.Deucher@amd.com, Sergei Miroshnichenko, linux-pci Hi Andrey, Am 03.03.21 um 17:05 schrieb Andrey Grodzovsky: > Ping > > Andrey > > On 2021-03-02 12:48 p.m., Andrey Grodzovsky wrote: >> Adding PCI mailing list per Bjorn's advise (sorry, skipped the >> address last time). >> >> On 2021-03-02 12:22 p.m., Christian König wrote: >>> Hi Andrey, >>> >>>> Can you elaborate more why you need such a lock on a global level >>>> and not >>>> enough per driver (per device more specifically) local lock >>>> acquired in >>>> rescan_preapare (before BARs shuffling starts) and released in >>>> rescan_done >>>> (after BARs shuffling done) ? >>> >>> because this adds inter device dependencies. >>> >>> E.g. when we have device A, B and C and each is taking it's own lock >>> we end up with 3 locks. >>> >>> If those device have inter dependencies on their own, for example >>> because of DMA-buf we will end up in quite a locking mess. >>> >>> Apart from that if we don't always keep the order A, B, C but >>> instead end up with A, C, B lockdep will certainly start to complain. >>> >>> Regards, >>> Christian. >> >> But as far as I understand those locks are agnostic to each other - >> each lock is taken only by it's respective device and not by others. Well they should be agnostic to each other, but are they really? I mean we could have inter device lock dependencies because of DMA-buf or good knows what. >> So I don't see deadlock danger here but, I do suspect lockdep might >> throw a false warning... Yeah and disabling or ignoring the lockdep warning could cause problems because we don't see issues caused by inter device lock dependencies any more. I just have the feeling that this would cause a lot of problems. Christian. >> >> Andrey >> >>> >>> Am 02.03.21 um 18:16 schrieb Andrey Grodzovsky: >>>> Per Bjorn's advise adding PCI mailing list. >>>> >>>> Christian, please reply on top of this mail and not on top the >>>> earlier, original one. >>>> >>>> Andrey >>>> >>>> On 2021-03-02 11:49 a.m., Bjorn Helgaas wrote: >>>>> Please add linux-pci@vger.kernel.org to the cc list. There's a >>>>> lot of >>>>> useful context in this thread and it needs to be archived. >>>>> On Tue, Mar 02, 2021 at 11:37:25AM -0500, Andrey Grodzovsky wrote: >>>>>> Adding Sergei and Bjorn since you are suggesting some PCI related >>>>>> changes. >>>>>> >>>>>> On 2021-03-02 6:28 a.m., Christian König wrote: >>>>>>> Yeah, I'm thinking about that as well for quite a while now. >>>>>>> >>>>>>> I'm not 100% sure what Sergei does in his patch set will work >>>>>>> for us. He >>>>>>> basically tells the driver to stop any processing then shuffle >>>>>>> around >>>>>>> the BARs and then tells him to start again. >>>>>>> >>>>>>> Because of the necessity to reprogram bridges there isn't much >>>>>>> other way >>>>>>> how to do this, but we would need something like taking a lock, >>>>>>> returning and dropping the lock later on. If you do this for >>>>>>> multiple >>>>>>> drivers at the same time this results in a locking nightmare. >>>>>>> >>>>>>> What we need instead is a upper level PCI layer rw-lock which we >>>>>>> can >>>>>>> take on the read side in the driver, similar to our internal >>>>>>> reset lock. >>>>>> >>>>>> >>>>>> Can you elaborate more why you need such a lock on a global level >>>>>> and not >>>>>> enough per driver (per device more specifically) local lock >>>>>> acquired in >>>>>> rescan_preapare (before BARs shuffling starts) and released in >>>>>> rescan_done >>>>>> (after BARs shuffling done) ? >>>>>> >>>>>> Andrey >>>>>> >>>>>> >>>>>>> >>>>>>> Wiring up changing the mapping is the smaller problem in that >>>>>>> picture. >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>> Am 01.03.21 um 22:14 schrieb Andrey Grodzovsky: >>>>>>>> Hi, I started looking into actual implementation for this, as >>>>>>>> reference point I am using Sege's NVME implementations (see >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F58c375df086502538706ee23e8ef7c8bb5a4178f&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C44bc68a68feb4b71836c08d8dd9b306a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503005898302093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nuTptAjGewG2T7tXOQr2h%2Bdr42lyg3m1%2Bfv0mJzd8tw%3D&reserved=0) >>>>>>>> >>>>>>>> and our own amdgpu_device_resize_fb_bar. Our use case being more >>>>>>>> complicated then NVME's arises some questions or thoughts: >>>>>>>> >>>>>>>> Before the PCI core will move the BARs (VRAM and registers) we >>>>>>>> have >>>>>>>> to stop the HW and SW and then unmap all MMIO mapped entitles >>>>>>>> in the >>>>>>>> driver. This includes, registers of all types and VRAM placed BOs >>>>>>>> with CPU visibility. My concern is how to prevent accesses to all >>>>>>>> MMIO ranges between iounmap in rescan_preapare and ioremap in >>>>>>>> rescan_done. For register accesses we are again back to same issue >>>>>>>> we had with device unplug and GPU reset we are discussing with >>>>>>>> Denis >>>>>>>> now. So here at least as first solution just to confirm the >>>>>>>> feature >>>>>>>> works I can introduce low level (in register accessors) RW locking >>>>>>>> to avoid access until all register access driver pointers are >>>>>>>> remapped post BARs move. What more concerns me is how to - In >>>>>>>> rescan_preapare: collect all CPU accessible BOs placed in VRAM (i >>>>>>>> can use amdgpu specific list like I currently use for device >>>>>>>> unplug), lock them to prevent any further use of them, unpin them >>>>>>>> (and unmap). In rescan_preapare: remap them back and unlock >>>>>>>> them for >>>>>>>> further use. Also, there is a difference between kernel BOs where >>>>>>>> indeed iounmap, ioremap should be enough and user space mapped BOs >>>>>>>> where it seems like in hot unplug, we need in rescan_preapare to >>>>>>>> only unamp them, in between drop new page faults with >>>>>>>> VM_FAULT_RETRY >>>>>>>> and post rescan_done allow page faults to go through to refill all >>>>>>>> the user page tables with updated MMIO addresses. >>>>>>>> >>>>>>>> Let me know your thoughts. >>>>>>>> >>>>>>>> Andrey >>>>>>>> >>>>>>>> On 2021-02-26 4:03 p.m., Andrey Grodzovsky wrote: >>>>>>>>> >>>>>>>>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote: >>>>>>>>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote: >>>>>>>>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote: >>>>>>>>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote: >>>>>>>>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote: >>>>>>>>>>>>>> ... >>>>>>>>>>>>> Are you saying that even without hot-plugging, while both >>>>>>>>>>>>> nvme >>>>>>>>>>>>> and >>>>>>>>>>>>> AMD >>>>>>>>>>>>> card are present >>>>>>>>>>>>> right from boot, you still get BARs moving and MMIO ranges >>>>>>>>>>>>> reassigned >>>>>>>>>>>>> for NVME BARs >>>>>>>>>>>>> just because amdgpu driver will start resize of AMD card >>>>>>>>>>>>> BARs and >>>>>>>>>>>>> this >>>>>>>>>>>>> will trigger NVMEs BARs move to >>>>>>>>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ? >>>>>>>>>>>> Yes. Unconditionally, because it is unknown beforehand if >>>>>>>>>>>> NVMe's >>>>>>>>>>>> BAR >>>>>>>>>>>> movement will help. In this particular case BAR movement is >>>>>>>>>>>> not >>>>>>>>>>>> needed, >>>>>>>>>>>> but is done anyway. >>>>>>>>>>>> >>>>>>>>>>>> BARs are not moved one by one, but the kernel releases all the >>>>>>>>>>>> releasable ones, and then recalculates a new BAR layout to >>>>>>>>>>>> fit them >>>>>>>>>>>> all. Kernel's algorithm is different from BIOS's, so NVME has >>>>>>>>>>>> appeared >>>>>>>>>>>> at a new place. >>>>>>>>>>>> >>>>>>>>>>>> This is triggered by following: >>>>>>>>>>>> - at boot, if BIOS had assigned not every BAR; >>>>>>>>>>>> - during pci_resize_resource(); >>>>>>>>>>>> - during pci_rescan_bus() -- after a pciehp event or a >>>>>>>>>>>> manual via >>>>>>>>>>>> sysfs. >>>>>>>>>>> By manual via sysfs you mean something like this - 'echo 1 > >>>>>>>>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 > >>>>>>>>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably >>>>>>>>>>> trigger >>>>>>>>>>> PCI >>>>>>>>>>> code to call my callbacks even without having external PCI >>>>>>>>>>> cage for >>>>>>>>>>> GPU >>>>>>>>>>> (will take me some time to get it). >>>>>>>>>> Yeah, this is our way to go when a device can't be physically >>>>>>>>>> removed >>>>>>>>>> or unpowered remotely. With just a bit shorter path: >>>>>>>>>> >>>>>>>>>> sudo sh -c 'echo 1 > >>>>>>>>>> /sys/bus/pci/devices/0000\:0c\:00.0/remove' >>>>>>>>>> sudo sh -c 'echo 1 > /sys/bus/pci/rescan' >>>>>>>>>> >>>>>>>>>> Or, just a second command (rescan) is enough: a BAR movement >>>>>>>>>> attempt >>>>>>>>>> will be triggered even if there were no changes in PCI topology. >>>>>>>>>> >>>>>>>>>> Serge >>>>>>>>> >>>>>>>>> >>>>>>>>> Ok, able to trigger stub callbacks call after doing rescan from >>>>>>>>> sysfs. Also I see BARs movement >>>>>>>>> >>>>>>>>> [ 1860.968036] amdgpu 0000:09:00.0: BAR 0: assigned [mem >>>>>>>>> 0xc0000000-0xcfffffff 64bit pref] >>>>>>>>> [ 1860.968050] amdgpu 0000:09:00.0: BAR 0 updated: 0xe0000000 -> >>>>>>>>> 0xc0000000 >>>>>>>>> [ 1860.968054] amdgpu 0000:09:00.0: BAR 2: assigned [mem >>>>>>>>> 0xd0000000-0xd01fffff 64bit pref] >>>>>>>>> [ 1860.968068] amdgpu 0000:09:00.0: BAR 2 updated: 0xf0000000 -> >>>>>>>>> 0xd0000000 >>>>>>>>> [ 1860.968071] amdgpu 0000:09:00.0: BAR 5: assigned [mem >>>>>>>>> 0xf0100000-0xf013ffff] >>>>>>>>> [ 1860.968078] amdgpu 0000:09:00.0: BAR 5 updated: 0xfce00000 -> >>>>>>>>> 0xf0100000 >>>>>>>>> >>>>>>>>> As expected, after the move the device becomes unresponsive as >>>>>>>>> unmap/ioremap wasn't done. >>>>>>>>> >>>>>>>>> Andrey >>>>>>>>> >>>>>>> >>> ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case] 2021-03-05 8:36 ` Christian König @ 2021-03-05 23:19 ` Andrey Grodzovsky 0 siblings, 0 replies; 4+ messages in thread From: Andrey Grodzovsky @ 2021-03-05 23:19 UTC (permalink / raw) To: Christian König, Bjorn Helgaas Cc: Alexander.Deucher@amd.com, Sergei Miroshnichenko, linux-pci On 2021-03-05 3:36 a.m., Christian König wrote: > Hi Andrey, > > Am 03.03.21 um 17:05 schrieb Andrey Grodzovsky: >> Ping >> >> Andrey >> >> On 2021-03-02 12:48 p.m., Andrey Grodzovsky wrote: >>> Adding PCI mailing list per Bjorn's advise (sorry, skipped the >>> address last time). >>> >>> On 2021-03-02 12:22 p.m., Christian König wrote: >>>> Hi Andrey, >>>> >>>>> Can you elaborate more why you need such a lock on a global level >>>>> and not >>>>> enough per driver (per device more specifically) local lock >>>>> acquired in >>>>> rescan_preapare (before BARs shuffling starts) and released in >>>>> rescan_done >>>>> (after BARs shuffling done) ? >>>> >>>> because this adds inter device dependencies. >>>> >>>> E.g. when we have device A, B and C and each is taking it's own lock >>>> we end up with 3 locks. >>>> >>>> If those device have inter dependencies on their own, for example >>>> because of DMA-buf we will end up in quite a locking mess. >>>> >>>> Apart from that if we don't always keep the order A, B, C but >>>> instead end up with A, C, B lockdep will certainly start to complain. >>>> >>>> Regards, >>>> Christian. >>> >>> But as far as I understand those locks are agnostic to each other - >>> each lock is taken only by it's respective device and not by others. > > Well they should be agnostic to each other, but are they really? I mean > we could have inter device lock dependencies because of DMA-buf or good > knows what. > >>> So I don't see deadlock danger here but, I do suspect lockdep might >>> throw a false warning... > > Yeah and disabling or ignoring the lockdep warning could cause problems > because we don't see issues caused by inter device lock dependencies any > more. > > I just have the feeling that this would cause a lot of problems. > > Christian. Sergei, can you please address Christian's suggestion bellow - quote " What we need instead is a upper level PCI layer rw-lock which we can take on the read side in the driver " Andrey > >>> >>> Andrey >>> >>>> >>>> Am 02.03.21 um 18:16 schrieb Andrey Grodzovsky: >>>>> Per Bjorn's advise adding PCI mailing list. >>>>> >>>>> Christian, please reply on top of this mail and not on top the >>>>> earlier, original one. >>>>> >>>>> Andrey >>>>> >>>>> On 2021-03-02 11:49 a.m., Bjorn Helgaas wrote: >>>>>> Please add linux-pci@vger.kernel.org to the cc list. There's a >>>>>> lot of >>>>>> useful context in this thread and it needs to be archived. >>>>>> On Tue, Mar 02, 2021 at 11:37:25AM -0500, Andrey Grodzovsky wrote: >>>>>>> Adding Sergei and Bjorn since you are suggesting some PCI related >>>>>>> changes. >>>>>>> >>>>>>> On 2021-03-02 6:28 a.m., Christian König wrote: >>>>>>>> Yeah, I'm thinking about that as well for quite a while now. >>>>>>>> >>>>>>>> I'm not 100% sure what Sergei does in his patch set will work >>>>>>>> for us. He >>>>>>>> basically tells the driver to stop any processing then shuffle >>>>>>>> around >>>>>>>> the BARs and then tells him to start again. >>>>>>>> >>>>>>>> Because of the necessity to reprogram bridges there isn't much >>>>>>>> other way >>>>>>>> how to do this, but we would need something like taking a lock, >>>>>>>> returning and dropping the lock later on. If you do this for >>>>>>>> multiple >>>>>>>> drivers at the same time this results in a locking nightmare. >>>>>>>> >>>>>>>> What we need instead is a upper level PCI layer rw-lock which we >>>>>>>> can >>>>>>>> take on the read side in the driver, similar to our internal >>>>>>>> reset lock. >>>>>>> >>>>>>> >>>>>>> Can you elaborate more why you need such a lock on a global level >>>>>>> and not >>>>>>> enough per driver (per device more specifically) local lock >>>>>>> acquired in >>>>>>> rescan_preapare (before BARs shuffling starts) and released in >>>>>>> rescan_done >>>>>>> (after BARs shuffling done) ? >>>>>>> >>>>>>> Andrey >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Wiring up changing the mapping is the smaller problem in that >>>>>>>> picture. >>>>>>>> >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 01.03.21 um 22:14 schrieb Andrey Grodzovsky: >>>>>>>>> Hi, I started looking into actual implementation for this, as >>>>>>>>> reference point I am using Sege's NVME implementations (see >>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FYADRO-KNS%2Flinux%2Fcommit%2F58c375df086502538706ee23e8ef7c8bb5a4178f&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C44bc68a68feb4b71836c08d8dd9b306a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503005898302093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nuTptAjGewG2T7tXOQr2h%2Bdr42lyg3m1%2Bfv0mJzd8tw%3D&reserved=0) >>>>>>>>> >>>>>>>>> and our own amdgpu_device_resize_fb_bar. Our use case being more >>>>>>>>> complicated then NVME's arises some questions or thoughts: >>>>>>>>> >>>>>>>>> Before the PCI core will move the BARs (VRAM and registers) we >>>>>>>>> have >>>>>>>>> to stop the HW and SW and then unmap all MMIO mapped entitles >>>>>>>>> in the >>>>>>>>> driver. This includes, registers of all types and VRAM placed BOs >>>>>>>>> with CPU visibility. My concern is how to prevent accesses to all >>>>>>>>> MMIO ranges between iounmap in rescan_preapare and ioremap in >>>>>>>>> rescan_done. For register accesses we are again back to same issue >>>>>>>>> we had with device unplug and GPU reset we are discussing with >>>>>>>>> Denis >>>>>>>>> now. So here at least as first solution just to confirm the >>>>>>>>> feature >>>>>>>>> works I can introduce low level (in register accessors) RW locking >>>>>>>>> to avoid access until all register access driver pointers are >>>>>>>>> remapped post BARs move. What more concerns me is how to - In >>>>>>>>> rescan_preapare: collect all CPU accessible BOs placed in VRAM (i >>>>>>>>> can use amdgpu specific list like I currently use for device >>>>>>>>> unplug), lock them to prevent any further use of them, unpin them >>>>>>>>> (and unmap). In rescan_preapare: remap them back and unlock >>>>>>>>> them for >>>>>>>>> further use. Also, there is a difference between kernel BOs where >>>>>>>>> indeed iounmap, ioremap should be enough and user space mapped BOs >>>>>>>>> where it seems like in hot unplug, we need in rescan_preapare to >>>>>>>>> only unamp them, in between drop new page faults with >>>>>>>>> VM_FAULT_RETRY >>>>>>>>> and post rescan_done allow page faults to go through to refill all >>>>>>>>> the user page tables with updated MMIO addresses. >>>>>>>>> >>>>>>>>> Let me know your thoughts. >>>>>>>>> >>>>>>>>> Andrey >>>>>>>>> >>>>>>>>> On 2021-02-26 4:03 p.m., Andrey Grodzovsky wrote: >>>>>>>>>> >>>>>>>>>> On 2021-02-26 1:44 a.m., Sergei Miroshnichenko wrote: >>>>>>>>>>> On Thu, 2021-02-25 at 13:28 -0500, Andrey Grodzovsky wrote: >>>>>>>>>>>> On 2021-02-25 2:00 a.m., Sergei Miroshnichenko wrote: >>>>>>>>>>>>> On Wed, 2021-02-24 at 17:51 -0500, Andrey Grodzovsky wrote: >>>>>>>>>>>>>> On 2021-02-24 1:23 p.m., Sergei Miroshnichenko wrote: >>>>>>>>>>>>>>> ... >>>>>>>>>>>>>> Are you saying that even without hot-plugging, while both >>>>>>>>>>>>>> nvme >>>>>>>>>>>>>> and >>>>>>>>>>>>>> AMD >>>>>>>>>>>>>> card are present >>>>>>>>>>>>>> right from boot, you still get BARs moving and MMIO ranges >>>>>>>>>>>>>> reassigned >>>>>>>>>>>>>> for NVME BARs >>>>>>>>>>>>>> just because amdgpu driver will start resize of AMD card >>>>>>>>>>>>>> BARs and >>>>>>>>>>>>>> this >>>>>>>>>>>>>> will trigger NVMEs BARs move to >>>>>>>>>>>>>> allow AMD card BARs to cover full range of VIDEO RAM ? >>>>>>>>>>>>> Yes. Unconditionally, because it is unknown beforehand if >>>>>>>>>>>>> NVMe's >>>>>>>>>>>>> BAR >>>>>>>>>>>>> movement will help. In this particular case BAR movement is >>>>>>>>>>>>> not >>>>>>>>>>>>> needed, >>>>>>>>>>>>> but is done anyway. >>>>>>>>>>>>> >>>>>>>>>>>>> BARs are not moved one by one, but the kernel releases all the >>>>>>>>>>>>> releasable ones, and then recalculates a new BAR layout to >>>>>>>>>>>>> fit them >>>>>>>>>>>>> all. Kernel's algorithm is different from BIOS's, so NVME has >>>>>>>>>>>>> appeared >>>>>>>>>>>>> at a new place. >>>>>>>>>>>>> >>>>>>>>>>>>> This is triggered by following: >>>>>>>>>>>>> - at boot, if BIOS had assigned not every BAR; >>>>>>>>>>>>> - during pci_resize_resource(); >>>>>>>>>>>>> - during pci_rescan_bus() -- after a pciehp event or a >>>>>>>>>>>>> manual via >>>>>>>>>>>>> sysfs. >>>>>>>>>>>> By manual via sysfs you mean something like this - 'echo 1 > >>>>>>>>>>>> /sys/bus/pci/drivers/amdgpu/0000\:0c\:00.0/remove && echo 1 > >>>>>>>>>>>> /sys/bus/pci/rescan ' ? I am looking into how most reliably >>>>>>>>>>>> trigger >>>>>>>>>>>> PCI >>>>>>>>>>>> code to call my callbacks even without having external PCI >>>>>>>>>>>> cage for >>>>>>>>>>>> GPU >>>>>>>>>>>> (will take me some time to get it). >>>>>>>>>>> Yeah, this is our way to go when a device can't be physically >>>>>>>>>>> removed >>>>>>>>>>> or unpowered remotely. With just a bit shorter path: >>>>>>>>>>> >>>>>>>>>>> sudo sh -c 'echo 1 > >>>>>>>>>>> /sys/bus/pci/devices/0000\:0c\:00.0/remove' >>>>>>>>>>> sudo sh -c 'echo 1 > /sys/bus/pci/rescan' >>>>>>>>>>> >>>>>>>>>>> Or, just a second command (rescan) is enough: a BAR movement >>>>>>>>>>> attempt >>>>>>>>>>> will be triggered even if there were no changes in PCI topology. >>>>>>>>>>> >>>>>>>>>>> Serge >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Ok, able to trigger stub callbacks call after doing rescan from >>>>>>>>>> sysfs. Also I see BARs movement >>>>>>>>>> >>>>>>>>>> [ 1860.968036] amdgpu 0000:09:00.0: BAR 0: assigned [mem >>>>>>>>>> 0xc0000000-0xcfffffff 64bit pref] >>>>>>>>>> [ 1860.968050] amdgpu 0000:09:00.0: BAR 0 updated: 0xe0000000 -> >>>>>>>>>> 0xc0000000 >>>>>>>>>> [ 1860.968054] amdgpu 0000:09:00.0: BAR 2: assigned [mem >>>>>>>>>> 0xd0000000-0xd01fffff 64bit pref] >>>>>>>>>> [ 1860.968068] amdgpu 0000:09:00.0: BAR 2 updated: 0xf0000000 -> >>>>>>>>>> 0xd0000000 >>>>>>>>>> [ 1860.968071] amdgpu 0000:09:00.0: BAR 5: assigned [mem >>>>>>>>>> 0xf0100000-0xf013ffff] >>>>>>>>>> [ 1860.968078] amdgpu 0000:09:00.0: BAR 5 updated: 0xfce00000 -> >>>>>>>>>> 0xf0100000 >>>>>>>>>> >>>>>>>>>> As expected, after the move the device becomes unresponsive as >>>>>>>>>> unmap/ioremap wasn't done. >>>>>>>>>> >>>>>>>>>> Andrey >>>>>>>>>> >>>>>>>> >>>> > ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-03-05 23:20 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20210302164944.GA424771@bjorn-Precision-5520>
[not found] ` <81cc1a93-3a77-7ef7-c9f9-6083c1c313f7@amd.com>
[not found] ` <5c2b2379-9961-c6f6-b343-b874cc22260f@amd.com>
2021-03-02 17:48 ` Remapping BOs and registers post BAR movements [was - Re: Question about supporting AMD eGPU hot plug case] Andrey Grodzovsky
2021-03-03 16:05 ` Andrey Grodzovsky
2021-03-05 8:36 ` Christian König
2021-03-05 23:19 ` Andrey Grodzovsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox