* Re: Trying to test my gart/iommu vmcore problem on RH
[not found] ` <1219081942.3361.436.camel@amd.troyhebe>
@ 2008-08-19 13:47 ` Vivek Goyal
2008-08-21 4:50 ` Eric W. Biederman
0 siblings, 1 reply; 14+ messages in thread
From: Vivek Goyal @ 2008-08-19 13:47 UTC (permalink / raw)
To: Bob Montgomery; +Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List
Hi Bob,
I am CCing this thread to kexec mailing list. It is good to discuss
the issue there to get the ideas.
I will summarize the discussion so far.
Bob is running into MCA in second kernel in kdump. Reason seems to be
that second kernel is trying to access the memory area marked as
GART aperture (by first kernel). Because GART aperture does not appear
as "reserved" or something else in /proc/iomem (in case kernel has
overridden the BIOS settings and has reserved a memory area), second
kernel thinks it is a valid RAM area and tries to dump it and runs into
issues.
Few options Bob is considering are.
- Update "e820" memory map to mark GART aperture as reserved, which will
be reflected in /proc/iomem also. Kexec-tools will not pass reserved
area to second kernel and it will not try to dump this area.
- Mark GART aperture as "GART aperture" in /proc/iomem and modify
kexec-tools to filter out this memroy from memory map passed to second
kernel.
- Disable cpu side GART access in first kernel so that even if second
kernel tries to access it, it does not run into isseus.
Thanks
Vivek
On Mon, Aug 18, 2008 at 11:52:22AM -0600, Bob Montgomery wrote:
> On Fri, 2008-08-15 at 13:13 +0000, Vivek Goyal wrote:
>
> >
> > I checked that aperture is allocated in mem_init(), which is little late
> > in the game but bootmem allocator is still in effect and we have not
> > released the pages to free list. May be it is possible to modify e820
> > memory map even now. Somebody will have to experiemnt..
>
> I'll try to study this a bit this week.
>
>
>
> >
> > If not, then I also like the idea of marking the region as "GART Aperture"
> > in /proc/iomem and let kexec-tools filter it.
>
> This of course requires two things to change to get a fix - the kernel
> and the kexec-tools.
> >
> > Not very sure about the idea of disabling cpu side access. Will it run
> > into issues like MCE if DMAs are still going on? It does MCA if one
> > tries to disable GART when DMAs are going on.
>
> I am disabling CPU side access in the *first kernel*, when the GART is
> initially set up. The kdump kernel just inherits that setting when it
> boots. So no DMA is going on when I do the disable. The reason it
> seems safe to me is that in the first kernel, CPU side access is
> effectively disabled by this (in arch/x86_64/pci-gart.c on our
> 2.6.18-based kernel):
>
> /*
> * Unmap the IOMMU part of the GART. The alias of the page is
> * always mapped with cache enabled and there is no full cache
> * coherency across the GART remapping. The unmapping avoids
> * automatic prefetches from the CPU allocating cache lines in
> * there. All CPU accesses are done via the direct mapping to
> * the backing memory. The GART address is only used by PCI
> * devices.
> */
> clear_kernel_mapping((unsigned long)__va(iommu_bus_base),
> iommu_size);
>
>
> I notice some changes in the equivalent area in 2.6.26:
> /*
> * Unmap the IOMMU part of the GART. The alias of the page is
> * always mapped with cache enabled and there is no full cache
> * coherency across the GART remapping. The unmapping avoids
> * automatic prefetches from the CPU allocating cache lines in
> * there. All CPU accesses are done via the direct mapping to
> * the backing memory. The GART address is only used by PCI
> * devices.
> */
> set_memory_np((unsigned long)__va(iommu_bus_base),
> iommu_size >> PAGE_SHIFT);
> /*
> * Tricky. The GART table remaps the physical memory range,
> * so the CPU wont notice potential aliases and if the memory
> * is remapped to UC later on, we might surprise the PCI devices
> * with a stray writeout of a cacheline. So play it sure and
> * do an explicit, full-scale wbinvd() _after_ having marked all
> * the pages as Not-Present:
> */
> wbinvd();
>
>
> set_memory_np does:
> change_page_attr_clear(addr, numpages, __pgprot(_PAGE_PRESENT));
>
> wbinvd() does the wbinvd (write back and invalidate caches) instruction.
>
> Bob Montgomery
>
>
>
>
> >
> > Thanks
> > Vivek
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-08-19 13:47 ` Trying to test my gart/iommu vmcore problem on RH Vivek Goyal
@ 2008-08-21 4:50 ` Eric W. Biederman
2008-08-22 22:05 ` Bob Montgomery
0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2008-08-21 4:50 UTC (permalink / raw)
To: Vivek Goyal
Cc: Heber, Troy, Kexec Mailing List, Loftin, Terry, Bob Montgomery
Vivek Goyal <vgoyal@redhat.com> writes:
> Hi Bob,
>
> I am CCing this thread to kexec mailing list. It is good to discuss
> the issue there to get the ideas.
>
> I will summarize the discussion so far.
>
> Bob is running into MCA in second kernel in kdump. Reason seems to be
> that second kernel is trying to access the memory area marked as
> GART aperture (by first kernel). Because GART aperture does not appear
> as "reserved" or something else in /proc/iomem (in case kernel has
> overridden the BIOS settings and has reserved a memory area), second
> kernel thinks it is a valid RAM area and tries to dump it and runs into
> issues.
>
> Few options Bob is considering are.
>
> - Update "e820" memory map to mark GART aperture as reserved, which will
> be reflected in /proc/iomem also. Kexec-tools will not pass reserved
> area to second kernel and it will not try to dump this area.
>
>
> - Mark GART aperture as "GART aperture" in /proc/iomem and modify
> kexec-tools to filter out this memroy from memory map passed to second
> kernel.
We should definitely reserve the resource, and it should definitely
show up in /proc/iomem.
> - Disable cpu side GART access in first kernel so that even if second
> kernel tries to access it, it does not run into isseus.
This is an interesting one. When I looked at this years ago I had the
feeling that if we did this we could actually always use a 2G Aperture
at a fixed address, and require going through the gart for all of lowmem.
But that is a little more than we are talking about.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-08-21 4:50 ` Eric W. Biederman
@ 2008-08-22 22:05 ` Bob Montgomery
2008-08-22 23:48 ` Eric W. Biederman
2008-08-25 13:02 ` Vivek Goyal
0 siblings, 2 replies; 14+ messages in thread
From: Bob Montgomery @ 2008-08-22 22:05 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Vivek Goyal
On Thu, 2008-08-21 at 04:50 +0000, Eric W. Biederman wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> > Few options Bob is considering are.
> >
> > - Update "e820" memory map to mark GART aperture as reserved, which will
> > be reflected in /proc/iomem also. Kexec-tools will not pass reserved
> > area to second kernel and it will not try to dump this area.
> >
> >
> > - Mark GART aperture as "GART aperture" in /proc/iomem and modify
> > kexec-tools to filter out this memroy from memory map passed to second
> > kernel.
>
>
> We should definitely reserve the resource, and it should definitely
> show up in /proc/iomem.
Reserving it as a child resource called "GART" in a "System RAM"
resource is already in newer kernels than mine (at least in by 2.6.26).
I haven't seen that kexec-tools does anything with that yet.
kexec-tools looks for "Crash kernel" in /proc/iomem now and explicitly
excludes that area.
Example:
000f0000-000fffff : System ROM
00100000-cfe4ffff : System RAM
00200000-0042635a : Kernel code
0042635b-00592037 : Kernel data
01000000-08ffffff : Crash kernel
0c000000-0fffffff : GART
cfe50000-cfe57fff : ACPI Tables
cfe58000-cfffffff : reserved
If it could be "reserved" earlier, so it isn't a child resource of a
System Ram area, but a "reserved" area that divides two "System RAM"
areas, then the current kexec-tools would exclude it (like it excludes
all "reserved" areas from the /proc/vmcore map, and it would no longer
be possible to trigger the MCE (or the mysterious hang) by reading
from /proc/vmcore. But currently (in my older kernel) the original
iomem_resource is constructed from the e820 map before I know where (and
how big) the aperture will be created.
But either way we fix it in iomem to exclude it from /proc/vmcore, a
read of /dev/oldmem in the aperture area would still trigger the MCE.
At least it does on my system.
>
> > - Disable cpu side GART access in first kernel so that even if second
> > kernel tries to access it, it does not run into isseus.
This has the advantage of "fixing" accesses through both /proc/vmcore
and /dev/oldmem. And for me, it's an easy patch to pci-gart.c in
init_k8_gatt that just sets bit 4 instead of clearing both 4 and 5:
- ctl |= 1;
- ctl &= ~((1<<4) | (1<<5));
+ ctl |= 1; /* set GartEn */
+ ctl |= (1<<4); /* set DisGartCpu */
+ ctl &= ~(1<<5); /* clear DisGartIO */
>
> This is an interesting one. When I looked at this years ago I had the
> feeling that if we did this we could actually always use a 2G Aperture
> at a fixed address, and require going through the gart for all of lowmem.
During discussions here, a colleague suggested that with CPU-side access
of the aperture disabled, we could allocate the crash kernel in the
wasted memory "under" the aperture.
>
> But that is a little more than we are talking about.
Yes, also.
>
> Eric
Bob Montgomery
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-08-22 22:05 ` Bob Montgomery
@ 2008-08-22 23:48 ` Eric W. Biederman
2008-08-25 13:16 ` Vivek Goyal
2008-08-25 13:02 ` Vivek Goyal
1 sibling, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2008-08-22 23:48 UTC (permalink / raw)
To: bob.montgomery
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Vivek Goyal
Hmm. Thinking about this we actually have 2 problems.
- Communication about what is going on.
- How to handle an iommu in the event of a crash dump scenario.
The current solution is to ignore the iommu, and use swiotlb. This
solution does not look like it will work for future iommus.
The original plan (and it still sounds like a good one) was to reserve
a section of the iommu (as we do for the physical memory). So we
could have addresses that are only used for the crash dump kernel. Then
have the crash dump kernel just use that section of the iommu.
Either we need to do that or we need to disable the iommu, before we
use swiotlb.
The problem is we can not reliably kill on-going DMA transactions
at the time of a kernel panic, and likely doing so would greatly
decrease our kernel reliability.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-08-22 22:05 ` Bob Montgomery
2008-08-22 23:48 ` Eric W. Biederman
@ 2008-08-25 13:02 ` Vivek Goyal
1 sibling, 0 replies; 14+ messages in thread
From: Vivek Goyal @ 2008-08-25 13:02 UTC (permalink / raw)
To: Bob Montgomery
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Eric W. Biederman
On Fri, Aug 22, 2008 at 04:05:51PM -0600, Bob Montgomery wrote:
> On Thu, 2008-08-21 at 04:50 +0000, Eric W. Biederman wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
>
> > > Few options Bob is considering are.
> > >
> > > - Update "e820" memory map to mark GART aperture as reserved, which will
> > > be reflected in /proc/iomem also. Kexec-tools will not pass reserved
> > > area to second kernel and it will not try to dump this area.
> > >
> > >
> > > - Mark GART aperture as "GART aperture" in /proc/iomem and modify
> > > kexec-tools to filter out this memroy from memory map passed to second
> > > kernel.
> >
> >
> > We should definitely reserve the resource, and it should definitely
> > show up in /proc/iomem.
>
> Reserving it as a child resource called "GART" in a "System RAM"
> resource is already in newer kernels than mine (at least in by 2.6.26).
> I haven't seen that kexec-tools does anything with that yet.
> kexec-tools looks for "Crash kernel" in /proc/iomem now and explicitly
> excludes that area.
>
> Example:
> 000f0000-000fffff : System ROM
> 00100000-cfe4ffff : System RAM
> 00200000-0042635a : Kernel code
> 0042635b-00592037 : Kernel data
> 01000000-08ffffff : Crash kernel
> 0c000000-0fffffff : GART
> cfe50000-cfe57fff : ACPI Tables
> cfe58000-cfffffff : reserved
>
> If it could be "reserved" earlier, so it isn't a child resource of a
> System Ram area, but a "reserved" area that divides two "System RAM"
> areas, then the current kexec-tools would exclude it (like it excludes
> all "reserved" areas from the /proc/vmcore map, and it would no longer
> be possible to trigger the MCE (or the mysterious hang) by reading
> from /proc/vmcore. But currently (in my older kernel) the original
> iomem_resource is constructed from the e820 map before I know where (and
> how big) the aperture will be created.
>
I think above /proc/iomem entries look good. It makes logical sense
that "GART" is child of "System RAM".
It would be great if you could provide a patch for kexec-tools to
explicitly exclude "GART" aperture from the memory map passed to second
kernel.
> But either way we fix it in iomem to exclude it from /proc/vmcore, a
> read of /dev/oldmem in the aperture area would still trigger the MCE.
> At least it does on my system.
>
> >
> > > - Disable cpu side GART access in first kernel so that even if second
> > > kernel tries to access it, it does not run into isseus.
>
> This has the advantage of "fixing" accesses through both /proc/vmcore
> and /dev/oldmem. And for me, it's an easy patch to pci-gart.c in
> init_k8_gatt that just sets bit 4 instead of clearing both 4 and 5:
>
> - ctl |= 1;
> - ctl &= ~((1<<4) | (1<<5));
> + ctl |= 1; /* set GartEn */
> + ctl |= (1<<4); /* set DisGartCpu */
> + ctl &= ~(1<<5); /* clear DisGartIO */
>
This looks interesting from the point of view that it sloves the issue
for /dev/oldmem also. But I am not sure if disabling cpu side access can
have any side affects. Some GART/IOMMU expert needs to comment on this.
You should post it as an independent patch on LKML and see if somebody
can find an issue with above.
Even if disabling cpu patch is going in, I think we should still fix
kexec-tools and older kernels to export "GART" info in /proc/iomem and
modify kexec-tools to exclude that area.
> >
> > This is an interesting one. When I looked at this years ago I had the
> > feeling that if we did this we could actually always use a 2G Aperture
> > at a fixed address, and require going through the gart for all of lowmem.
>
> During discussions here, a colleague suggested that with CPU-side access
> of the aperture disabled, we could allocate the crash kernel in the
> wasted memory "under" the aperture.
Interesting. How big is the aperture? Generally it is 64MB and crash kdump
kernel often requires more than that.
Thanks
Vivek
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-08-22 23:48 ` Eric W. Biederman
@ 2008-08-25 13:16 ` Vivek Goyal
2008-08-25 13:46 ` Eric W. Biederman
0 siblings, 1 reply; 14+ messages in thread
From: Vivek Goyal @ 2008-08-25 13:16 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Heber, Troy, Kexec Mailing List, Loftin, Terry, bob.montgomery
On Fri, Aug 22, 2008 at 04:48:10PM -0700, Eric W. Biederman wrote:
>
> Hmm. Thinking about this we actually have 2 problems.
> - Communication about what is going on.
> - How to handle an iommu in the event of a crash dump scenario.
>
> The current solution is to ignore the iommu, and use swiotlb. This
> solution does not look like it will work for future iommus.
>
Does setting up of swiotlb require iommu to be disabled in second kernel?
IOW, can swiotlb work reliably given the fact that iommu is active and
there are some active mappings (as created by first kernel).
I am thinking is there a possibility that I set a DMA using swiotlb and the
physical address can overlap with IO address setup in IOMMU and that DMA might
go to a different buffer altogether.
> The original plan (and it still sounds like a good one) was to reserve
> a section of the iommu (as we do for the physical memory). So we
> could have addresses that are only used for the crash dump kernel. Then
> have the crash dump kernel just use that section of the iommu.
>
This would also require that second kernel keeps using first kernel's
iommu settings/tables and not try to initialize the iommu freshly.
One patch from Chandru is now mainline which seems to be solving the issue
for calgary IOMMU. He seems to be re-using first kernel's iommu tables
in second kernel hence avoiding re-initializing iommu and avoiding MCE.
git commit 95b68dec0d52c7b8fea3698b3938cf3ab936436b
This patch has the risk that second kernel might not find any free entries
to setup DMA and that's why reserving a section of iommu will help.
> Either we need to do that or we need to disable the iommu, before we
> use swiotlb.
>
I tought disabling iommu was not an option as it leads to MCE if there is
a DMA going on.
> The problem is we can not reliably kill on-going DMA transactions
> at the time of a kernel panic, and likely doing so would greatly
> decrease our kernel reliability.
May be re-using iommu tables in second kernel along with reserving some
entries for kdump is the way to go..
Thanks
Vivek
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-08-25 13:16 ` Vivek Goyal
@ 2008-08-25 13:46 ` Eric W. Biederman
2008-09-04 23:28 ` Bob Montgomery
0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2008-08-25 13:46 UTC (permalink / raw)
To: Vivek Goyal
Cc: Heber, Troy, Kexec Mailing List, Loftin, Terry, bob.montgomery
Vivek Goyal <vgoyal@redhat.com> writes:
> On Fri, Aug 22, 2008 at 04:48:10PM -0700, Eric W. Biederman wrote:
>>
>> Hmm. Thinking about this we actually have 2 problems.
>> - Communication about what is going on.
>> - How to handle an iommu in the event of a crash dump scenario.
>>
>> The current solution is to ignore the iommu, and use swiotlb. This
>> solution does not look like it will work for future iommus.
>>
>
> Does setting up of swiotlb require iommu to be disabled in second kernel?
Not precisely. But in a full iommu all accesses go through the iommu,
and the iommu start becoming per bus. So in practice either we need
to disable full iommu or work with them.
> IOW, can swiotlb work reliably given the fact that iommu is active and
> there are some active mappings (as created by first kernel).
>
> I am thinking is there a possibility that I set a DMA using swiotlb and the
> physical address can overlap with IO address setup in IOMMU and that DMA might
> go to a different buffer altogether.
Yes. Which is why I would very much prefer to reserve some IOMMU entries.
Instead of turning off an iommu altogether.
>> The original plan (and it still sounds like a good one) was to reserve
>> a section of the iommu (as we do for the physical memory). So we
>> could have addresses that are only used for the crash dump kernel. Then
>> have the crash dump kernel just use that section of the iommu.
>>
>
> This would also require that second kernel keeps using first kernel's
> iommu settings/tables and not try to initialize the iommu freshly.
Not completely anyway.
> One patch from Chandru is now mainline which seems to be solving the issue
> for calgary IOMMU. He seems to be re-using first kernel's iommu tables
> in second kernel hence avoiding re-initializing iommu and avoiding MCE.
>
> git commit 95b68dec0d52c7b8fea3698b3938cf3ab936436b
>
> This patch has the risk that second kernel might not find any free entries
> to setup DMA and that's why reserving a section of iommu will help.
Yes. That and we know there aren't any pending DMAs going to missetup
entries.
>> Either we need to do that or we need to disable the iommu, before we
>> use swiotlb.
>>
>
> I tought disabling iommu was not an option as it leads to MCE if there is
> a DMA going on.
Good point. Looks like I oversimplified.
>> The problem is we can not reliably kill on-going DMA transactions
>> at the time of a kernel panic, and likely doing so would greatly
>> decrease our kernel reliability.
>
> May be re-using iommu tables in second kernel along with reserving some
> entries for kdump is the way to go..
That is the best plan we have been able to come up with. Making
AMD's iommu look more like a full strength iommu should help reinforce
that model.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-08-25 13:46 ` Eric W. Biederman
@ 2008-09-04 23:28 ` Bob Montgomery
2008-09-05 1:46 ` Eric W. Biederman
2008-09-05 15:12 ` Vivek Goyal
0 siblings, 2 replies; 14+ messages in thread
From: Bob Montgomery @ 2008-09-04 23:28 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Vivek Goyal
On Mon, 2008-08-25 at 13:46 +0000, Eric W. Biederman wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
>
> > On Fri, Aug 22, 2008 at 04:48:10PM -0700, Eric W. Biederman wrote:
> >>
> >> Hmm. Thinking about this we actually have 2 problems.
> >> - Communication about what is going on.
> >> - How to handle an iommu in the event of a crash dump scenario.
> >>
> >> The current solution is to ignore the iommu, and use swiotlb. This
> >> solution does not look like it will work for future iommus.
Howdy all,
There are several aspects to this problem that make solutions come in
and out of contention:
1. Kexec vs Kdump
Kexec: If we are kexec'ing normally, we assume that the shutdown has
successfully stopped DMAs prior to starting our new kernel, and if not,
it's a bug in the previous kernel's driver shutdown. So no issue here,
right?
Kdump: The driver shutdown has been skipped as we go down during a
crash, so assume that leftover DMA operations might be in progress as
the kdump kernel comes up. BUT! They will be in progress to some area
of memory other than the memory being used by the kdump kernel (it has
its own crashkernel sandbox). And on my 2.6.18-based system, with an
AMD64 NB GART-acting-as-IOMMU, the kdump kernel *does not* try to
initialize or use an IOMMU when it comes up because its memory size is
too small to need one (no one is setting crashkernel=4G@4G). So the
kdump kernel can successfully ignore the old IOs using the old GART
aperture IOMMU.
EXCEPT(!) for the fact the we've left CPU-side translations turned on in
the GART NB hardware and the kdump kernel will currently read through
that zone using /proc/vmcore or /dev/oldmem. That's why I like fixing
my stone-age problem by turning off CPU-side access.
Note that real (future?) IOMMUs don't even have the concept of
translating accesses from the CPU side. They only work on IO requests.
So reading old memory areas from the crashed kernel shouldn't cause an
IOMMU to "do" anything.
2. GART vs Calgary vs "new AMD IOMMU" vs "new Intel IOMMU"
The GART-as-IOMMU thing is not a "real" IOMMU. It doesn't offer much
of the interesting protection of a real IOMMU, just "valid", "coherent"
and a translation address. An IO card is still free to screw up and hit
other addresses outside the aperture if it wants to, or hit other pages
in the aperture that really belong to some other driver, or to write to
a page that it should only read, etc. Consequently, there isn't much
desire to utilize the GART thing unless I really need 32-bit IO card
access to 40-bit address space. Since I don't need that in the kdump
kernel (currently), there's no reason to try to use the GART there, so
it's safe to ignore it, as long as I don't provoke it :-)
BUT, if I had a real IOMMU that provided cool protection stuff and
domain stuff, and not just address range expansion for old IO cards,
then I might want to (or be forced to) use it all the time, independent
of memory size, and then the kdump kernel might really need to deal with
sharing it in some way with old leftover DMAs from the crashed kernel
that we're dumping. And this, I think, is the only real issue looming.
But this should only be a kdump issue, and not a kexec issue (see #1
above), because the previous kernel should have shut all that down
before it kexec'd, right?
3. IOMMU vs swiotlb
Isn't swiotlb just a way of hiding bounce buffer copies and management
inside of the dma_map_single and dma_unmap_single calls? If so, it's
just software(TM) and it just uses addresses in the kdump kernel
sandbox, which (by definition) are not addresses that could have been
used in the old kernel that crashed. There shouldn't be any conflict
between kdump kernel and old crashed kernel if one or both are using
swiotlb. Once again, in *my* current situation, there's no reason to
use swiotlb in the kdump kernel because my memory range is restricted to
my crashkernel sandbox and I don't need any IOMMU address translation
capability.
If the original kernel had been using swiotlb, then there's really no
issue, because any leftover DMAs are just writing to the old bounce
buffers anyway, and there's no driver left waiting to call
dma_unmap_single to copy the result into the real buffer.
What considerations have I missed?
Bob Montgomery
(vacation last week)
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-09-04 23:28 ` Bob Montgomery
@ 2008-09-05 1:46 ` Eric W. Biederman
2008-09-05 15:12 ` Vivek Goyal
1 sibling, 0 replies; 14+ messages in thread
From: Eric W. Biederman @ 2008-09-05 1:46 UTC (permalink / raw)
To: bob.montgomery
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Eric W. Biederman,
Vivek Goyal
Bob Montgomery <bob.montgomery@hp.com> writes:
> On Mon, 2008-08-25 at 13:46 +0000, Eric W. Biederman wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>>
>> > On Fri, Aug 22, 2008 at 04:48:10PM -0700, Eric W. Biederman wrote:
>> >>
>> >> Hmm. Thinking about this we actually have 2 problems.
>> >> - Communication about what is going on.
>> >> - How to handle an iommu in the event of a crash dump scenario.
>> >>
>> >> The current solution is to ignore the iommu, and use swiotlb. This
>> >> solution does not look like it will work for future iommus.
> Howdy all,
>
> There are several aspects to this problem that make solutions come in
> and out of contention:
>
> 1. Kexec vs Kdump
>
> Kexec: If we are kexec'ing normally, we assume that the shutdown has
> successfully stopped DMAs prior to starting our new kernel, and if not,
> it's a bug in the previous kernel's driver shutdown. So no issue here,
> right?
Correct.
> Kdump: The driver shutdown has been skipped as we go down during a
> crash, so assume that leftover DMA operations might be in progress as
> the kdump kernel comes up. BUT! They will be in progress to some area
> of memory other than the memory being used by the kdump kernel (it has
> its own crashkernel sandbox). And on my 2.6.18-based system, with an
> AMD64 NB GART-acting-as-IOMMU, the kdump kernel *does not* try to
> initialize or use an IOMMU when it comes up because its memory size is
> too small to need one (no one is setting crashkernel=4G@4G). So the
> kdump kernel can successfully ignore the old IOs using the old GART
> aperture IOMMU.
>
> EXCEPT(!) for the fact the we've left CPU-side translations turned on in
> the GART NB hardware and the kdump kernel will currently read through
> that zone using /proc/vmcore or /dev/oldmem. That's why I like fixing
> my stone-age problem by turning off CPU-side access.
>
> Note that real (future?) IOMMUs don't even have the concept of
> translating accesses from the CPU side. They only work on IO requests.
> So reading old memory areas from the crashed kernel shouldn't cause an
> IOMMU to "do" anything.
Good point. I don't think linux uses the translations either.
The downside of this is that it increases the dependency of the
kernel that crashed not to have done something bad. So at least
long term it would be good to have code that can share do the
right thing with iommus.
> 2. GART vs Calgary vs "new AMD IOMMU" vs "new Intel IOMMU"
>
> The GART-as-IOMMU thing is not a "real" IOMMU. It doesn't offer much
> of the interesting protection of a real IOMMU, just "valid", "coherent"
> and a translation address. An IO card is still free to screw up and hit
> other addresses outside the aperture if it wants to, or hit other pages
> in the aperture that really belong to some other driver, or to write to
> a page that it should only read, etc. Consequently, there isn't much
> desire to utilize the GART thing unless I really need 32-bit IO card
> access to 40-bit address space. Since I don't need that in the kdump
> kernel (currently), there's no reason to try to use the GART there, so
> it's safe to ignore it, as long as I don't provoke it :-)
>
> BUT, if I had a real IOMMU that provided cool protection stuff and
> domain stuff, and not just address range expansion for old IO cards,
> then I might want to (or be forced to) use it all the time, independent
> of memory size, and then the kdump kernel might really need to deal with
> sharing it in some way with old leftover DMAs from the crashed kernel
> that we're dumping. And this, I think, is the only real issue looming.
Yes. How do we properly share an iommu is the looming issue.
> But this should only be a kdump issue, and not a kexec issue (see #1
> above), because the previous kernel should have shut all that down
> before it kexec'd, right?
Correct.
> 3. IOMMU vs swiotlb
>
> Isn't swiotlb just a way of hiding bounce buffer copies and management
> inside of the dma_map_single and dma_unmap_single calls? If so, it's
> just software(TM) and it just uses addresses in the kdump kernel
> sandbox, which (by definition) are not addresses that could have been
> used in the old kernel that crashed. There shouldn't be any conflict
> between kdump kernel and old crashed kernel if one or both are using
> swiotlb. Once again, in *my* current situation, there's no reason to
> use swiotlb in the kdump kernel because my memory range is restricted to
> my crashkernel sandbox and I don't need any IOMMU address translation
> capability.
> If the original kernel had been using swiotlb, then there's really no
> issue, because any leftover DMAs are just writing to the old bounce
> buffers anyway, and there's no driver left waiting to call
> dma_unmap_single to copy the result into the real buffer.
>
> What considerations have I missed?
>
> Bob Montgomery
> (vacation last week)
>
If the BIOS provides us with an aperture in pci mmio space for the AMD
GART there is no memory loss, and the issue that you are seeing can
not occur.
This has a couple of interesting implications.
1) If you disable translation of cpu side accesses than you can continue
to use the memory instead at the bus addresses used for the GART.
2) If you can continue to use the memory you can make the GART aperture
it's maximum size 2G I think. This begins to provide protection from
errant DMA addresses sent by devices.
3) If we enable access to the memory where the GART lives we have a
situation where bus addresses and memory addresses are not always
in the same domain. Giving us true iommu fun.
So specific recommendations.
1) Since cpu side accesses to the GART appear to just silly let's
disable them.
2) We need to figure out how to communicate the disjoint address
spaces that come with iommus in /sbin/kexec the user space code
that sets this up.
3) If anyone has the oomph let's put the AMD K8 GART into large
window mode and see if we can sort through all of the iommu issues
on a platform that a lot of people have to work with.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-09-04 23:28 ` Bob Montgomery
2008-09-05 1:46 ` Eric W. Biederman
@ 2008-09-05 15:12 ` Vivek Goyal
2008-09-09 21:12 ` Bob Montgomery
1 sibling, 1 reply; 14+ messages in thread
From: Vivek Goyal @ 2008-09-05 15:12 UTC (permalink / raw)
To: Bob Montgomery
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Eric W. Biederman
On Thu, Sep 04, 2008 at 05:28:48PM -0600, Bob Montgomery wrote:
> On Mon, 2008-08-25 at 13:46 +0000, Eric W. Biederman wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> >
> > > On Fri, Aug 22, 2008 at 04:48:10PM -0700, Eric W. Biederman wrote:
> > >>
> > >> Hmm. Thinking about this we actually have 2 problems.
> > >> - Communication about what is going on.
> > >> - How to handle an iommu in the event of a crash dump scenario.
> > >>
> > >> The current solution is to ignore the iommu, and use swiotlb. This
> > >> solution does not look like it will work for future iommus.
> Howdy all,
>
> There are several aspects to this problem that make solutions come in
> and out of contention:
>
> 1. Kexec vs Kdump
>
> Kexec: If we are kexec'ing normally, we assume that the shutdown has
> successfully stopped DMAs prior to starting our new kernel, and if not,
> it's a bug in the previous kernel's driver shutdown. So no issue here,
> right?
>
> Kdump: The driver shutdown has been skipped as we go down during a
> crash, so assume that leftover DMA operations might be in progress as
> the kdump kernel comes up. BUT! They will be in progress to some area
> of memory other than the memory being used by the kdump kernel (it has
> its own crashkernel sandbox). And on my 2.6.18-based system, with an
> AMD64 NB GART-acting-as-IOMMU, the kdump kernel *does not* try to
> initialize or use an IOMMU when it comes up because its memory size is
> too small to need one (no one is setting crashkernel=4G@4G). So the
> kdump kernel can successfully ignore the old IOs using the old GART
> aperture IOMMU.
>
> EXCEPT(!) for the fact the we've left CPU-side translations turned on in
> the GART NB hardware and the kdump kernel will currently read through
> that zone using /proc/vmcore or /dev/oldmem. That's why I like fixing
> my stone-age problem by turning off CPU-side access.
>
> Note that real (future?) IOMMUs don't even have the concept of
> translating accesses from the CPU side. They only work on IO requests.
> So reading old memory areas from the crashed kernel shouldn't cause an
> IOMMU to "do" anything.
>
>
> 2. GART vs Calgary vs "new AMD IOMMU" vs "new Intel IOMMU"
>
> The GART-as-IOMMU thing is not a "real" IOMMU. It doesn't offer much
> of the interesting protection of a real IOMMU, just "valid", "coherent"
> and a translation address. An IO card is still free to screw up and hit
> other addresses outside the aperture if it wants to, or hit other pages
> in the aperture that really belong to some other driver, or to write to
> a page that it should only read, etc. Consequently, there isn't much
> desire to utilize the GART thing unless I really need 32-bit IO card
> access to 40-bit address space. Since I don't need that in the kdump
> kernel (currently), there's no reason to try to use the GART there, so
> it's safe to ignore it, as long as I don't provoke it :-)
>
> BUT, if I had a real IOMMU that provided cool protection stuff and
> domain stuff, and not just address range expansion for old IO cards,
> then I might want to (or be forced to) use it all the time, independent
> of memory size, and then the kdump kernel might really need to deal with
> sharing it in some way with old leftover DMAs from the crashed kernel
> that we're dumping. And this, I think, is the only real issue looming.
>
> But this should only be a kdump issue, and not a kexec issue (see #1
> above), because the previous kernel should have shut all that down
> before it kexec'd, right?
>
>
> 3. IOMMU vs swiotlb
>
> Isn't swiotlb just a way of hiding bounce buffer copies and management
> inside of the dma_map_single and dma_unmap_single calls? If so, it's
> just software(TM) and it just uses addresses in the kdump kernel
> sandbox, which (by definition) are not addresses that could have been
> used in the old kernel that crashed. There shouldn't be any conflict
> between kdump kernel and old crashed kernel if one or both are using
> swiotlb. Once again, in *my* current situation, there's no reason to
> use swiotlb in the kdump kernel because my memory range is restricted to
> my crashkernel sandbox and I don't need any IOMMU address translation
> capability.
>
> If the original kernel had been using swiotlb, then there's really no
> issue, because any leftover DMAs are just writing to the old bounce
> buffers anyway, and there's no driver left waiting to call
> dma_unmap_single to copy the result into the real buffer.
>
> What considerations have I missed?
>
Nice summary Bob. Few thoughts.
- So until and unless one is reserving memory for crashkernel above 4G,
there is no need for initializing the IOMMU in second kernel (At this
moment I am not too worried about need of isolation in second kernel). If
that's the case, we shouldn't have initialized the calgary iommu in second
kernel and just should have left it alone and things probably would have
been fine?
The only issue is that how do you make sure that first kernel has not
setup an IOMMU entry with same bus address which falls in crash kernel
reserved area. I am not very familiar with the dma/iommu code and how
bus addresses are selected. Because if there is bus address overlap in
first kernel and second kernel, IOMMU will trap the second kernel's DMA
attempts and redirect it somewhere else. If we don't run into this issue
then it is fine otherwise we will be forced to use IOMMU in second kernel
and try to find free bus addresses/entries so that we don't conflict with
the first kernel's settings.
- Disabling cpu side access seems to makes sense. We can give it a try
and hope we don't run into other hidden issues.
Thanks
Vivek
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-09-05 15:12 ` Vivek Goyal
@ 2008-09-09 21:12 ` Bob Montgomery
2008-09-22 23:31 ` Bob Montgomery
0 siblings, 1 reply; 14+ messages in thread
From: Bob Montgomery @ 2008-09-09 21:12 UTC (permalink / raw)
To: Vivek Goyal
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Eric W. Biederman
On Fri, 2008-09-05 at 15:12 +0000, Vivek Goyal wrote:
> Nice summary Bob. Few thoughts.
>
> - So until and unless one is reserving memory for crashkernel above 4G,
> there is no need for initializing the IOMMU in second kernel (At this
> moment I am not too worried about need of isolation in second kernel). If
> that's the case, we shouldn't have initialized the calgary iommu in second
> kernel and just should have left it alone and things probably would have
> been fine?
Did you ever try booting the kdump kernel with iommu=off? That should
have prevented the detection and initialization of the calgary iommu
from the kdump kernel, which (I ass-u-me) is what you had problems with,
if leftover IOs from the crashed kernel were still in progress when the
kdump kernel initialized it?
As for 4G, the info in 2.6.26 Documentation/x86_64/boot-options.txt
mentions 3G as the decision for using an iommu in several places. The
test in our kernel for using the GART IOMMU is (end_pfn > MAX_DMA32_PFN)
where MAX_DMA32_PFN is:
#define MAX_DMA32_PFN ((4UL*1024*1024*1024) >> PAGE_SHIFT)
>
> The only issue is that how do you make sure that first kernel has not
> setup an IOMMU entry with same bus address which falls in crash kernel
> reserved area. I am not very familiar with the dma/iommu code and how
> bus addresses are selected. Because if there is bus address overlap in
> first kernel and second kernel, IOMMU will trap the second kernel's DMA
> attempts and redirect it somewhere else. If we don't run into this issue
> then it is fine otherwise we will be forced to use IOMMU in second kernel
> and try to find free bus addresses/entries so that we don't conflict with
> the first kernel's settings.
I finally thought about this long enough to figure out what you meant, I
think :-). If the existence of a real IOMMU in the first kernel
allows the use of completely virtualized addresses on the IO side, then
there's no reason that they would have to avoid real RAM addresses when
handing out addresses to IO cards with that IOMMU.
In particular, our little joke about allocating the crash kernel under
the GART aperture (with CPU-side access turned off) would prevent the
kdump kernel from doing non-iommu IO to the crash kernel address range,
because the still-active GART from the old kernel would grab any of
those addresses coming in from the IO side.
The kdump kernel wouldn't be in danger of being overwritten, it just
might not be able to set up any IOs that work to its own address space
if an IOMMU is out there waiting to grab them.
For the calgary case, we'd maybe have to add the Crash Kernel range to
the list of things sent to iommu_range_reserve in
calgary_reserve_regions, to prevent those addresses from ever being
given out.
Bob M.
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-09-09 21:12 ` Bob Montgomery
@ 2008-09-22 23:31 ` Bob Montgomery
2008-09-23 2:29 ` Eric W. Biederman
0 siblings, 1 reply; 14+ messages in thread
From: Bob Montgomery @ 2008-09-22 23:31 UTC (permalink / raw)
To: Vivek Goyal
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Eric W. Biederman
On Tue, 2008-09-09 at 21:12 +0000, I wrote:
> The kdump kernel wouldn't be in danger of being overwritten, it just
> might not be able to set up any IOs that work to its own address space
> if an IOMMU is out there waiting to grab them.
>
> For the calgary case, we'd maybe have to add the Crash Kernel range to
> the list of things sent to iommu_range_reserve in
> calgary_reserve_regions, to prevent those addresses from ever being
> given out.
>
> Bob M.
While this, or something like it, is necessary, it isn't sufficient.
I think what we would really need to do is to have the primary kernel
set up an identity mapping for all pages in the Crash kernel range,
or the subset of pages that could be IO targets when the kdump kernel
is running. This would allow a still-running IOMMU from the primary
kernel to translate kdump IOs to kdump addresses transparently.
And that leads to the Kdump IO Rule:
The primary kernel is responsible for setting up any necessary
conditions to allow the kdump kernel to perform its required
IO without detecting any iommu.
The kdump kernel must refrain from detecting and initializing
any iommu.
This has a these effects:
A) Primary kernel: depending on what it is using for as an IOMMU,
it may have to do some (or considerable) setup, to guarantee
that the kdump kernel can have IO capability to its Crash
kernel address range.
B) Primary kernel: the Crash kernel range must be set up in an address
range whose physical addresses are accessible to IO cards
without address remapping.
C) Kdump kernel: the kdump kernel must ignore any IOMMU hardware that
might be "detectable".
The setup responsibilities for the primary kernel depend on what it is
currently using for dma mapping:
1) no iommu (nommu_map_single): No setup is required for kdump.
Leftover IOs will go to IO buffers allocated by the primary
kernel outside of the Crash kernel area. Kdump IOs will
go to IO buffers allocated by the kdump kernel in the Crash
kernel area.
2) swiotlb (swiotlb_map_single_phys): No setup is required for kdump.
Leftover IOs will go to the primary kernel bounce buffers
outside of the Crash kernel area. Kdump IOs will go to IO
buffers allocated by the kdump kernel in the Crash kernel area.
3) GART (gart_map_single): No setup is required for kdump. Leftover
IOs will be mapped through the GART aperture to IO buffers
allocated by the primary kernel outside of the Crash Kernel
area. Kdump IOs will go to IO buffers allocated by the kdump
kernel in the Crash kernel area.
4) Calgary IOMMU (calgary_map_single): The Crash kernel memory range
must be pre-allocated for IO and identity-mapped, so any IO
operation to an address in the Crash kernel range is allowed
to complete to that same address. To preallocate for a
128MB Crash kernel area, 32K entries (256 Kbytes) are used
from the Calgary table. For a 4GB system, the default size
of the table is 1024K entries (8 Mbytes).
Leftover IOs will go to IO buffers allocated by the primary
kernel and remapped by the Calgary IOMMU. Neither the IO-side
address (iova) nor the physical address of a leftover IO will
be in the Crash kernel area. Kdump IOs will go to IO buffers
allocated by the kdump kernel, remapped by the Calgary IOMMU
to those same addresses (iova equals physical address within
the Crash kernel area).
5) Intel IOMMU (intel_map_single): The Crash kernel memory range must
be pre-allocated and identity-mapped for each hw device that
is needed by the kdump kernel, so any IO operation to an
address in the Crash kernel range is allowed to complete to
that same address.
Leftover IOs will go to IO buffers allocated by the primary
kernel and remapped by the Intel IOMMU. Neither the IO-side
address (iova) nor the physical address of a leftover IO
will be in the Crash kernel area. Kdump IOs will go to IO
buffers allocated by the kdump kernel, remapped by the Intel
IOMMU to those same addresses (iova equals physical address
within the Crash kernel area).
This all assumes no virtual machine stuff yet.
Possible? Comments? Corrections?
Bob Montgomery
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-09-22 23:31 ` Bob Montgomery
@ 2008-09-23 2:29 ` Eric W. Biederman
2008-09-23 19:12 ` Bob Montgomery
0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2008-09-23 2:29 UTC (permalink / raw)
To: bob.montgomery
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Vivek Goyal
Bob Montgomery <bob.montgomery@hp.com> writes:
> On Tue, 2008-09-09 at 21:12 +0000, I wrote:
>
>> The kdump kernel wouldn't be in danger of being overwritten, it just
>> might not be able to set up any IOs that work to its own address space
>> if an IOMMU is out there waiting to grab them.
>>
>> For the calgary case, we'd maybe have to add the Crash Kernel range to
>> the list of things sent to iommu_range_reserve in
>> calgary_reserve_regions, to prevent those addresses from ever being
>> given out.
>>
>> Bob M.
>
> While this, or something like it, is necessary, it isn't sufficient.
> I think what we would really need to do is to have the primary kernel
> set up an identity mapping for all pages in the Crash kernel range,
> or the subset of pages that could be IO targets when the kdump kernel
> is running. This would allow a still-running IOMMU from the primary
> kernel to translate kdump IOs to kdump addresses transparently.
> And that leads to the Kdump IO Rule:
>
> The primary kernel is responsible for setting up any necessary
> conditions to allow the kdump kernel to perform its required
> IO without detecting any iommu.
Reserving a range or addresses in the iommu I agree with.
If that range of addresses allows for identity mapping I
like it better.
I'm not certain about requiring it.
I don't like setting up the identity mapping before hand,
it allows devices to trash the kdump kernel by accident.
> The kdump kernel must refrain from detecting and initializing
> any iommu.
Why? I can fully understand avoiding addresses that are in flight.
I can definitely see this being simpler in the kdump kernel.
However this feels like it makes a less robust kdump kernel by
not allowing it to touch the iommu.
> This has a these effects:
>
> A) Primary kernel: depending on what it is using for as an IOMMU,
> it may have to do some (or considerable) setup, to guarantee
> that the kdump kernel can have IO capability to its Crash
> kernel address range.
>
> B) Primary kernel: the Crash kernel range must be set up in an address
> range whose physical addresses are accessible to IO cards
> without address remapping.
Below <= 16MB? That doesn't work in general.
Especially not if we are running on an SGI box and someone had
unplugged node 0 (with all of the memory below 4G).
> C) Kdump kernel: the kdump kernel must ignore any IOMMU hardware that
> might be "detectable".
> The setup responsibilities for the primary kernel depend on what it is
> currently using for dma mapping:
>
> 1) no iommu (nommu_map_single): No setup is required for kdump.
> Leftover IOs will go to IO buffers allocated by the primary
> kernel outside of the Crash kernel area. Kdump IOs will
> go to IO buffers allocated by the kdump kernel in the Crash
> kernel area.
>
> 2) swiotlb (swiotlb_map_single_phys): No setup is required for kdump.
> Leftover IOs will go to the primary kernel bounce buffers
> outside of the Crash kernel area. Kdump IOs will go to IO
> buffers allocated by the kdump kernel in the Crash kernel area.
>
> 3) GART (gart_map_single): No setup is required for kdump. Leftover
> IOs will be mapped through the GART aperture to IO buffers
> allocated by the primary kernel outside of the Crash Kernel
> area. Kdump IOs will go to IO buffers allocated by the kdump
> kernel in the Crash kernel area.
>
> 4) Calgary IOMMU (calgary_map_single): The Crash kernel memory range
> must be pre-allocated for IO and identity-mapped, so any IO
> operation to an address in the Crash kernel range is allowed
> to complete to that same address. To preallocate for a
> 128MB Crash kernel area, 32K entries (256 Kbytes) are used
> from the Calgary table. For a 4GB system, the default size
> of the table is 1024K entries (8 Mbytes).
>
> Leftover IOs will go to IO buffers allocated by the primary
> kernel and remapped by the Calgary IOMMU. Neither the IO-side
> address (iova) nor the physical address of a leftover IO will
> be in the Crash kernel area. Kdump IOs will go to IO buffers
> allocated by the kdump kernel, remapped by the Calgary IOMMU
> to those same addresses (iova equals physical address within
> the Crash kernel area).
>
> 5) Intel IOMMU (intel_map_single): The Crash kernel memory range must
> be pre-allocated and identity-mapped for each hw device that
> is needed by the kdump kernel, so any IO operation to an
> address in the Crash kernel range is allowed to complete to
> that same address.
>
> Leftover IOs will go to IO buffers allocated by the primary
> kernel and remapped by the Intel IOMMU. Neither the IO-side
> address (iova) nor the physical address of a leftover IO
> will be in the Crash kernel area. Kdump IOs will go to IO
> buffers allocated by the kdump kernel, remapped by the Intel
> IOMMU to those same addresses (iova equals physical address
> within the Crash kernel area).
>
>
> This all assumes no virtual machine stuff yet.
>
> Possible? Comments? Corrections?
Possible.
I would very much like the option of doing the iommu setup, and possibly
fiddling in the kdump kernel. As long as we are not reusing the same
addresses in the iommu I don't see a problem.
I like the theoretical option of disabling ongoing DMA's, with the
more complete IOMMUs. It isn't strictly necessary but I expect it
would give a better result.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to test my gart/iommu vmcore problem on RH
2008-09-23 2:29 ` Eric W. Biederman
@ 2008-09-23 19:12 ` Bob Montgomery
0 siblings, 0 replies; 14+ messages in thread
From: Bob Montgomery @ 2008-09-23 19:12 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Heber, Troy, Loftin, Terry, Kexec Mailing List, Vivek Goyal
On Tue, 2008-09-23 at 02:29 +0000, Eric W. Biederman wrote:
> Bob Montgomery <bob.montgomery@hp.com> writes:
> > And that leads to the Kdump IO Rule:
> >
> > The primary kernel is responsible for setting up any necessary
> > conditions to allow the kdump kernel to perform its required
> > IO without detecting any iommu.
>
> Reserving a range or addresses in the iommu I agree with.
> If that range of addresses allows for identity mapping I
> like it better.
>
> I'm not certain about requiring it.
>
> I don't like setting up the identity mapping before hand,
> it allows devices to trash the kdump kernel by accident.
The reason for having the primary kernel set up any mapping needed by a
kdump kernel *in advance* is that for a HW IOMMU, this setup actually
consists of modifying data structures (arrays, trees, lists) that are in
the primary kernel's memory, as well as setting registers in the HW.
When the kdump kernel comes up, none of those structures are in its
memory range. They're just part of the artifacts left in
/dev/oldmem. So yes, the kdump kernel could query any hardware that it
found, verify that the hardware had previously been in use, read HW
registers to get the root pointers, or list addresses or whatever, and
then modify arrays, trees, or lists in that non-owned memory to map its
DMA, but it's kind of an unprecedented step for the kdump kernel to
take. (Blindly copying oldmem pages is one thing, manipulating live
data structures in oldmem seems like quite another thing.)
Regarding the danger of trashing the kdump kernel prior to its launch:
Currently, any driver or errant kernel code can trash the kdump area.
And any IO card on a non-IOMMU or swiotlb system can trash it. So it
doesn't seem like much of an extension of a risk that already exists.
It does however negate one possibility to lower some of that risk.
> > The kdump kernel must refrain from detecting and initializing
> > any iommu.
>
> Why? I can fully understand avoiding addresses that are in flight.
> I can definitely see this being simpler in the kdump kernel.
> However this feels like it makes a less robust kdump kernel by
> not allowing it to touch the iommu.
As pointed out above, "touching the iommu" really includes touching its
data structures created by the primary kernel in what is now the oldmem
area. In addition, I'm not sure the kdump kernel can determine which
addresses are in flight by querying either the HW or the oldmem
structures. It could probably determine which ones were unused at the
time of the crash.
> > This has a these effects:
> >
> > A) Primary kernel: depending on what it is using for as an IOMMU,
> > it may have to do some (or considerable) setup, to guarantee
> > that the kdump kernel can have IO capability to its Crash
> > kernel address range.
> >
> > B) Primary kernel: the Crash kernel range must be set up in an address
> > range whose physical addresses are accessible to IO cards
> > without address remapping.
>
> Below <= 16MB? That doesn't work in general.
I didn't think this was working now. Aren't most crash kernels
allocated above 16 MB? And I assumed lots of systems don't have IOMMU
capability. Do you have an example where this would be an issue for an
IO card needed by the kdump kernel?
> Especially not if we are running on an SGI box and someone had
> unplugged node 0 (with all of the memory below 4G).
How does an SGI box with an unplugged node 0 do kdump IO currently?
> > Possible? Comments? Corrections?
>
> Possible.
>
> I would very much like the option of doing the iommu setup, and possibly
> fiddling in the kdump kernel. As long as we are not reusing the same
> addresses in the iommu I don't see a problem.
The problem I see is the oldmem area. Now we could come up with a plan
to allow the primary kernel to do all of its iommu related allocations
in the Crash kernel area, effectively creating an area of memory that is
shared between the primary kernel and kdump. (This would be complicated
in cases where the iommu state is in a dynamic tree vs. a fixed size
array.) Then the kdump kernel would wake up and just take over
maintenance of the iommu. But even the much smaller proposal to
preallocate entries in the iommu data structures to allow the kdump
kernel to do its IO is already violating one of the principals of kdump.
It is making kdump operation dependent on the integrity of a primary
kernel data structure. Actually taking over a shared iommu data
structure from the primary kernel seems like an even bigger
philosophical violation.
> I like the theoretical option of disabling ongoing DMA's, with the
> more complete IOMMUs. It isn't strictly necessary but I expect it
> would give a better result.
It seems like this implies 1) stopping the DMA at the IOMMU, 2)
surviving the resulting error condition when the IO card fails its next
access (hopefully not MCE on a modern IOMMU), 3) verifying that the IO
card won't try another access later after you've started using the IOMMU
in the kdump kernel, and then 4) reinitializing and using the IOMMU. Is
it doable?
Thanks for considering,
Bob Montgomery
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2008-09-23 19:12 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1218138156.3361.386.camel@amd.troyhebe>
[not found] ` <20080808014024.GB3911@redhat.com>
[not found] ` <1218750773.3361.425.camel@amd.troyhebe>
[not found] ` <20080815131359.GA10208@redhat.com>
[not found] ` <1219081942.3361.436.camel@amd.troyhebe>
2008-08-19 13:47 ` Trying to test my gart/iommu vmcore problem on RH Vivek Goyal
2008-08-21 4:50 ` Eric W. Biederman
2008-08-22 22:05 ` Bob Montgomery
2008-08-22 23:48 ` Eric W. Biederman
2008-08-25 13:16 ` Vivek Goyal
2008-08-25 13:46 ` Eric W. Biederman
2008-09-04 23:28 ` Bob Montgomery
2008-09-05 1:46 ` Eric W. Biederman
2008-09-05 15:12 ` Vivek Goyal
2008-09-09 21:12 ` Bob Montgomery
2008-09-22 23:31 ` Bob Montgomery
2008-09-23 2:29 ` Eric W. Biederman
2008-09-23 19:12 ` Bob Montgomery
2008-08-25 13:02 ` Vivek Goyal
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox