* Re: PCI memory allocation bug with CONFIG_HIGHMEM [not found] <1aJdi-7TH-25@gated-at.bofh.it> @ 2004-01-06 3:32 ` Andi Kleen 2004-01-06 3:40 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Andi Kleen @ 2004-01-06 3:32 UTC (permalink / raw) To: David Hinds; +Cc: linux-kernel, torvalds David Hinds <dhinds@sonic.net> writes: > In arch/i386/kernel/setup.c we have: > > /* Tell the PCI layer not to allocate too close to the RAM area.. */ > low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff; > if (low_mem_size > pci_mem_start) > pci_mem_start = low_mem_size; > > which is meant to round up pci_mem_start to the nearest 1 MB boundary > past the top of physical RAM. However this does not consider highmem. > Should this just be using max_pfn rather than max_low_pfn? max_pfn would get memory >4GB on highmem systems, which generally doesn't work because many PCI devices only support 32bit addresses. IMHO the only reliable way to get physical bus space for mappings is to allocate some memory and map the mapping over that. On x86-64 the allocation must be GFP_DMA, on i386 it can be GFP_KERNEL. The problem is that BIOS commonly use physical address space without marking it in the e820 map. For example the AGP aperture is normally not marked in any way in the e820 map, but you definitely cannot reuse its bus space. The old code assumed that there is a memory hole below the highest memory address <4GB, but that can be not true on a system with >3GB. We unfortunately must assume on such systems that all holes in e820 space are already used by something. On a system with <3GB you are usually lucky because there is some space left, but even that can break and e.g. conflict with reserved ACPI mappings. In theory you could have a heuristic with something like "if E820_RAM is <2GB just allocate it after the highest E820_RAM map not conflicting with other E820 mappings", but this would be quite hackish and may break on weird systems. BTW drivers/char/mem.c makes the same broken assumption. It really wants to default to uncached access for any holes, but default to cached for real memory. Doing that also requires reliable hole detection, which we don't have. One approach I haven't checked is that the ACPI memory map may have fixed the problem (no defined way to get a hole) As long as you only have e820 I think there is no real alternative to the "put io map over memory" technique. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 3:32 ` PCI memory allocation bug with CONFIG_HIGHMEM Andi Kleen @ 2004-01-06 3:40 ` Linus Torvalds 2004-01-06 4:05 ` Andi Kleen 2004-01-06 22:56 ` Eric W. Biederman 0 siblings, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2004-01-06 3:40 UTC (permalink / raw) To: Andi Kleen; +Cc: David Hinds, linux-kernel On Tue, 6 Jan 2004, Andi Kleen wrote: > > IMHO the only reliable way to get physical bus space for mappings > is to allocate some memory and map the mapping over that. You literally can't do that: the RAM addresses are decoded by the northbridge before they ever hit the PCI bus, so it's impossible to "map over" RAM in general. Normally, the way this works is that there are magic northbridge mapping registers that remap part of the memory, so that the memory that is physically in the upper 4GB of RAM shows up somewhere else (or just possibly disappears entirely - once you have more than 4GB of RAM, you might not care too much about a few tens of megs missing). Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 3:40 ` Linus Torvalds @ 2004-01-06 4:05 ` Andi Kleen 2004-01-06 5:04 ` Linus Torvalds 2004-01-06 22:56 ` Eric W. Biederman 1 sibling, 1 reply; 41+ messages in thread From: Andi Kleen @ 2004-01-06 4:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andi Kleen, David Hinds, linux-kernel On Mon, Jan 05, 2004 at 07:40:11PM -0800, Linus Torvalds wrote: > > > On Tue, 6 Jan 2004, Andi Kleen wrote: > > > > IMHO the only reliable way to get physical bus space for mappings > > is to allocate some memory and map the mapping over that. > > You literally can't do that: the RAM addresses are decoded by the > northbridge before they ever hit the PCI bus, so it's impossible to "map > over" RAM in general. Are you sure? I have a doc from AMD somewhere on the memory ordering on K8 and it gives this order: (highest to lowest) AGP aperture, TSEG, ASEG, IORR, Fixed MTRR, TOP_MEM Note that TOP_MEM comes last, IORR comes earlier. It would require setting an IORR though, which would be admittedly a bit nasty (there are not that many of them). As long as it is only a single area it should be possible though, we already have some code to change IORRs in the AGP driver. That would be admittedly AMD specific, but I suspect Intel has a similar mechanism. I have successfully mapped the AGP aperture over RAM and also seen it shadowing PCI mappings. I admit I haven't tried it with PCI mappings. But can you suggest a reliable way to find a memory hole in e820? I haven't one figured out and AFAIK there isn't even any guarantee by the BIOS that there is any. e.g. Opteron BIOS tend to use all the precious space < 4GB up for existing mappings and I would expect other i386 BIOS to behave the same. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 4:05 ` Andi Kleen @ 2004-01-06 5:04 ` Linus Torvalds 2004-01-06 8:12 ` Andi Kleen 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-01-06 5:04 UTC (permalink / raw) To: Andi Kleen; +Cc: Andi Kleen, David Hinds, linux-kernel On Mon, 6 Jan 2004, Andi Kleen wrote: > > > > You literally can't do that: the RAM addresses are decoded by the > > northbridge before they ever hit the PCI bus, so it's impossible to "map > > over" RAM in general. > > Are you sure? I have a doc from AMD somewhere on the memory ordering > on K8 and it gives this order: (highest to lowest) > > AGP aperture, TSEG, ASEG, IORR, Fixed MTRR, TOP_MEM Those are all in the CPU or northbridge (well, on the opteron, the northbridge is integrated so it all boils down to the CPU). So yes, I'm sure. You have to have northbridge-specific code to punch a "hole" in the RAM decoder, and some of them are "bios-locked", ie they have registers that become read-only after the first time they are written (or after a special lock-bit has been written). So in some cases you can't do it at all. > I have successfully mapped the AGP aperture > over RAM and also seen it shadowing PCI mappings. I admit I haven't tried > it with PCI mappings. The AGP aperture is generally done in the northbridge, so it all depends on what the decode priority is for the northbridge chip. That's implementation-dependent. > But can you suggest a reliable way to find a memory hole in e820? > I haven't one figured out and AFAIK there isn't even any guarantee > by the BIOS that there is any. e.g. Opteron BIOS tend to use all > the precious space < 4GB up for existing mappings and I would expect > other i386 BIOS to behave the same. If you ahve a proper e820 map, then it should work correctly, with anything that is RAM being marked as such (or being marked as "reserved"). The problems happen when you do _not_ have a proper e820 map, either due to bootloader bugs or BIOS problems, or because the user overrode the values with a "mem=xxxx" thing. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 5:04 ` Linus Torvalds @ 2004-01-06 8:12 ` Andi Kleen 2004-01-06 9:11 ` Mika Penttilä 0 siblings, 1 reply; 41+ messages in thread From: Andi Kleen @ 2004-01-06 8:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andi Kleen, David Hinds, linux-kernel > If you ahve a proper e820 map, then it should work correctly, with > anything that is RAM being marked as such (or being marked as "reserved"). Every e820 map i've seen did not have the AGP aperture marked reserved. It is just an undescribed hole. In fact when you mark the aperture in the e820 map the Linux AGP driver stops working, it relies on it being in an undescribed hole. This means you cannot just reuse holes. And there is no other way to get mapping space. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 8:12 ` Andi Kleen @ 2004-01-06 9:11 ` Mika Penttilä 2004-01-06 9:44 ` Andi Kleen 0 siblings, 1 reply; 41+ messages in thread From: Mika Penttilä @ 2004-01-06 9:11 UTC (permalink / raw) To: Andi Kleen; +Cc: Linus Torvalds, Andi Kleen, David Hinds, linux-kernel Andi Kleen wrote: >>If you ahve a proper e820 map, then it should work correctly, with >>anything that is RAM being marked as such (or being marked as "reserved"). >> >> > >Every e820 map i've seen did not have the AGP aperture marked reserved. > Why should it? It's not ram, and the aperture is marked as reserved while doing PCI resource assignment/reservation. >It is just an undescribed hole. In fact when you mark the aperture in the >e820 map the Linux AGP driver stops working, it relies on it being >in an undescribed hole. > > > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 9:11 ` Mika Penttilä @ 2004-01-06 9:44 ` Andi Kleen 2004-01-06 10:16 ` Mika Penttilä 2004-01-06 15:27 ` Linus Torvalds 0 siblings, 2 replies; 41+ messages in thread From: Andi Kleen @ 2004-01-06 9:44 UTC (permalink / raw) To: Mika Penttil?; +Cc: Linus Torvalds, Andi Kleen, David Hinds, linux-kernel On Tue, Jan 06, 2004 at 11:11:21AM +0200, Mika Penttil? wrote: > > > Andi Kleen wrote: > > >>If you ahve a proper e820 map, then it should work correctly, with > >>anything that is RAM being marked as such (or being marked as "reserved"). > >> > >> > > > >Every e820 map i've seen did not have the AGP aperture marked reserved. > > > Why should it? It's not ram, and the aperture is marked as reserved > while doing PCI resource assignment/reservation. It implies that you cannot just put your IO mappings into any holes. Because something else like the aperture may be already there. In my opinion it would have been cleaner if the aperture had always an reserved entry in the e820 map. Or better all usable holes get an special entry. Then you could actually reliable allocate IO space on your own. Currently it's just impossible. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 9:44 ` Andi Kleen @ 2004-01-06 10:16 ` Mika Penttilä 2004-01-06 10:49 ` Andi Kleen 2004-01-06 15:27 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Mika Penttilä @ 2004-01-06 10:16 UTC (permalink / raw) To: Andi Kleen; +Cc: Linus Torvalds, Andi Kleen, David Hinds, linux-kernel Andi Kleen wrote: >On Tue, Jan 06, 2004 at 11:11:21AM +0200, Mika Penttil? wrote: > > >>Andi Kleen wrote: >> >> >> >>>>If you ahve a proper e820 map, then it should work correctly, with >>>>anything that is RAM being marked as such (or being marked as "reserved"). >>>> >>>> >>>> >>>> >>>Every e820 map i've seen did not have the AGP aperture marked reserved. >>> >>> >>> >>Why should it? It's not ram, and the aperture is marked as reserved >>while doing PCI resource assignment/reservation. >> >> > >It implies that you cannot just put your IO mappings >into any holes. Because something else like the aperture may >be already there. > But AGP aperture is controlled with the standard APBASE pci base register, so you always know where it is, can relocate it and reserve address space for it. Of course there may exist other uncontrollable hw, which may cause problems. > >In my opinion it would have been cleaner if the aperture had always >an reserved entry in the e820 map. Or better all usable holes get >an special entry. Then you could actually reliable allocate IO space > on your own. Currently it's just impossible. > >-Andi > > --Mika ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 10:16 ` Mika Penttilä @ 2004-01-06 10:49 ` Andi Kleen 0 siblings, 0 replies; 41+ messages in thread From: Andi Kleen @ 2004-01-06 10:49 UTC (permalink / raw) To: Mika Penttil?; +Cc: Linus Torvalds, Andi Kleen, David Hinds, linux-kernel On Tue, Jan 06, 2004 at 12:16:14PM +0200, Mika Penttil? wrote: > But AGP aperture is controlled with the standard APBASE pci base > register, so you always know where it is, can relocate it and reserve > address space for it. Of course there may exist other uncontrollable hw, > which may cause problems. Actually not. There are quite a lot of chipsets that require special programming for the AGP aperture (why do you think drivers/char/agp/*.c is so big?). And not even everything AGPv2 compliant. And as Linus points out you would likely need to do some Northbridge specific magic to make that area usable for PCI then. Also you would need to put it over RAM because again there is no reliable way to find a hole. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 9:44 ` Andi Kleen 2004-01-06 10:16 ` Mika Penttilä @ 2004-01-06 15:27 ` Linus Torvalds 2004-01-06 15:37 ` Andi Kleen 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-01-06 15:27 UTC (permalink / raw) To: Andi Kleen; +Cc: Mika Penttil?, Andi Kleen, David Hinds, linux-kernel On Tue, 6 Jan 2004, Andi Kleen wrote: > > In my opinion it would have been cleaner if the aperture had always > an reserved entry in the e820 map. That does sound like a bug in the AGP drivers. It shouldn't be hard at all to make them reserve their aperture. Hint hint. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 15:27 ` Linus Torvalds @ 2004-01-06 15:37 ` Andi Kleen 2004-01-06 15:48 ` Linus Torvalds 2004-01-06 22:45 ` Eric W. Biederman 0 siblings, 2 replies; 41+ messages in thread From: Andi Kleen @ 2004-01-06 15:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: Mika Penttil?, Andi Kleen, David Hinds, linux-kernel On Tue, Jan 06, 2004 at 07:27:33AM -0800, Linus Torvalds wrote: > > > On Tue, 6 Jan 2004, Andi Kleen wrote: > > > > In my opinion it would have been cleaner if the aperture had always > > an reserved entry in the e820 map. > > That does sound like a bug in the AGP drivers. It shouldn't be hard at all > to make them reserve their aperture. > > Hint hint. No, it's a bug in the BIOS that they're not marked. But I've actually seen a BIOS that marked it and it lead to the Linux AGP driver failing (due to some interaction with how setup.c sets up resources). So the Linux driver currently even relies on the broken state. Anyways, I already implemented reservation for the aperture for the K8 driver some time ago. And it's in your tree. But it doesn't help for finding IO holes because there could be other unmarked hardware lurking there ... Or worse there is just no free space below 4GB. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 15:37 ` Andi Kleen @ 2004-01-06 15:48 ` Linus Torvalds 2004-01-06 22:29 ` Adam Belay 2004-01-06 22:45 ` Eric W. Biederman 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-01-06 15:48 UTC (permalink / raw) To: Andi Kleen; +Cc: Mika Penttil?, Andi Kleen, David Hinds, linux-kernel On Tue, 6 Jan 2004, Andi Kleen wrote: > > Anyways, I already implemented reservation for the aperture for the K8 > driver some time ago. And it's in your tree. But it doesn't help for > finding IO holes because there could be other unmarked hardware lurking > there ... Or worse there is just no free space below 4GB. The "unmarked hardware" is why we have PCI quirks. Look at drivers/pci/quirks.c, and notice how many of the quirks are all about quirk_io_region(). Exactly because there isn't any way for the BIOS to tell us about these things on the IO side. (Actually, there is: PnP-BIOS calls are supposed to give us that information. However, not only are the BIOSes buggy and don't give a complete list _anyway_, anybody who uses the PnP-BIOS is much more likely to just get a kernel oops when the BIOS is buggy and assumes that only Windows will call it. So I strongly suggest you not _ever_ use pnp unless you absolutely have to). The same quirks could be done on the MMIO side for northbridges. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 15:48 ` Linus Torvalds @ 2004-01-06 22:29 ` Adam Belay 2004-01-07 4:06 ` Linus Torvalds 2004-01-07 8:32 ` Helge Hafting 0 siblings, 2 replies; 41+ messages in thread From: Adam Belay @ 2004-01-06 22:29 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew On Tue, Jan 06, 2004 at 07:48:37AM -0800, Linus Torvalds wrote: > > > On Tue, 6 Jan 2004, Andi Kleen wrote: > > > > Anyways, I already implemented reservation for the aperture for the K8 > > driver some time ago. And it's in your tree. But it doesn't help for > > finding IO holes because there could be other unmarked hardware lurking > > there ... Or worse there is just no free space below 4GB. > > The "unmarked hardware" is why we have PCI quirks. Look at > drivers/pci/quirks.c, and notice how many of the quirks are all about > quirk_io_region(). Exactly because there isn't any way for the BIOS to > tell us about these things on the IO side. > > (Actually, there is: PnP-BIOS calls are supposed to give us that > information. However, not only are the BIOSes buggy and don't give a > complete list _anyway_, anybody who uses the PnP-BIOS is much more likely > to just get a kernel oops when the BIOS is buggy and assumes that only > Windows will call it. So I strongly suggest you not _ever_ use pnp unless > you absolutely have to). For those with legacy systems, the isapnp protocol, a component of pnp, is unaffected by this problem. Most systems that support ISA addin cards, have correctly implemented PnPBIOSes. > > The same quirks could be done on the MMIO side for northbridges. > > Linus For the past few weeks I've been doing research on the PnPBIOS general protection faults, and I've come up with a few observations and a proposed solution. Any comments would be appreciated. 1.) There probably isn't anything wrong with the way we're calling the PnPBIOS. After searching through various mailing lists I discovered that several other open source operating systems, although having many variations on the PnPBIOS code, are having identical problems (including that the same type of calls trigger the faults). A while back I added a change that was similar to some of apm's buggy bios handling code. It appears to fix the problems with getting dynamic resource information on many buggy systems. I later decided (see pnp-fix-1 in -mm) to get static resource information (the resources set at boot time) because the specifications suggest using that call when enumerating devices for the first time. To my surprise, many have reported problems with the PnPBIOS driver found in -mm. In addition, there are some, but significantly fewer, BIOSes that are completely broken and don't work with either call type. The recent escd fix I have made corrects a thinko in the PnPBIOS code and it turns out that faults from calling /proc/pnp/bus/escd were probably not caused by BIOS bugs. I've attached this fix to the end of the email. This leaves only the get node calls. 2.) Windows works with buggy BIOSes because of the way it calls them. I looked into how Windows handles the PnPBIOS and may have discovered why it works on buggy BIOS. It turns out that exclusively realmode calls are used. See www.missl.cs.umd.edu/Projects/sebos/winint/index2.html#pnpbios. My knowledge is limited in this area of the x86 architecture but it is my impression that it would not be possible, or perhaps worth it, to implement realmode calls for the Linux PnPBIOS driver because of the time it is initialized. 3.) BIOS bugs appear to affect mostly laptops. The Oops seems to generally occur when getting information about the mouse controller. Because of touchpads and external mouses, the BIOS code may be a little different from desktop systems. Nonetheless my laptop, as well as all my other test systems, do not have any PnPBIOS problems. 4.) PnPBIOS support may not be fully implemented on a few rare systems with ACPI. The PnPBIOS standard has been obsoleted by ACPI. Only in systems made before ACPI or systems with blacklisted ACPI support (there are many), is the PnPBIOS necessary. Unfortunantly resource management in the Linux ACPI driver isn't fully supported relative to resource management in the Linux PnPBIOS driver. It is concievable that some PnPBIOSes only implement a minimal set of calls properly. A proposed solution... For 2.6... 1.) only get dynamic resource information 2.) blacklist any BIOSes that fail on dynamic resource calls. We might get lucky and there will be few enough that it is possible to create a blacklist. Also look into them to see if they work with static instead. 3.) attach a warning, by printk and/or kconfig help, to the /proc/bus/pnp interface as it is able to make any PnPBIOS call. (done in -mm) 4.) As a last resort disable PnPBIOS support if ACPI is successful. Although the two can currently coexist, this would prevent the buggy BIOSes found in more modern x86 systems from being used. Of course this would be useless if the user decides to not include the ACPI driver. 5.) Look into other ways of finding out if the PnPBIOS might be buggy, currently we only have DMI. Any others? For the next development kernel... I am working on a new resource management infustructure, tied more closely to the driver model and sysfs, and some ACPI patches. They should make it easier for us to take advantage of ACPI resource management. Although one of my biggest focuses is ACPI, I'd like to maintain compatibility with older protocols such as PnPBIOS. Also it is a major goal to make it usable for all architectures (like the existing resource management code), but perhaps even for Open Firmware when it is further implemented. >From there we can phase out PnPBIOS support where ACPI provides an alternative. It's worth noting that PnPBIOS support is useful on the majority of systems that support it. In later kernels it can serve as an alternative when ACPI is buggy or unsupported. Thanks, Adam --- a/drivers/pnp/pnpbios/bioscalls.c 2003-11-26 20:44:47.000000000 +0000 +++ b/drivers/pnp/pnpbios/bioscalls.c 2003-12-02 21:17:42.000000000 +0000 @@ -493,7 +493,7 @@ if (!pnp_bios_present()) return ESCD_FUNCTION_NOT_SUPPORTED; status = call_pnp_bios(PNP_READ_ESCD, 0, PNP_TS1, PNP_TS2, PNP_DS, 0, 0, 0, - data, 65536, (void *)nvram_base, 65536); + data, 65536, __va((void *)nvram_base), 65536); return status; } @@ -516,7 +516,7 @@ if (!pnp_bios_present()) return ESCD_FUNCTION_NOT_SUPPORTED; status = call_pnp_bios(PNP_WRITE_ESCD, 0, PNP_TS1, PNP_TS2, PNP_DS, 0, 0, 0, - data, 65536, nvram_base, 65536); + data, 65536, __va((void *)nvram_base), 65536); return status; } #endif ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 22:29 ` Adam Belay @ 2004-01-07 4:06 ` Linus Torvalds 2004-01-07 5:02 ` Andi Kleen 2004-01-07 8:32 ` Helge Hafting 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-01-07 4:06 UTC (permalink / raw) To: Adam Belay Cc: Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew On Tue, 6 Jan 2004, Adam Belay wrote: > > 5.) Look into other ways of finding out if the PnPBIOS might be buggy, > currently we only have DMI. > > Any others? We could use the exception mechanism, and try to fix up any BIOS errors. That would require: - make the BIOS calls save all important registers just before entry (esp in particular, and the "after-call EIP") and set a flag saying "fix me up". Do this per-CPU. Clear the flags after exit. - add magic knowledge to "fixup_exception()" path that looks at the per-cpu fix-me-up flag, and if it is set, restore all the segments (which the BIOS may have crapped on), %esp and %eip to the magic fixup values. - test it with a bogus trap (on purpose) which has reset all the x86 registers, including an offset %esp. This could make us recover from some (most?) BIOS bugs and at least dynamically notice when the BIOS does bad bad things. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 4:06 ` Linus Torvalds @ 2004-01-07 5:02 ` Andi Kleen 2004-01-07 5:55 ` Dave Jones 0 siblings, 1 reply; 41+ messages in thread From: Andi Kleen @ 2004-01-07 5:02 UTC (permalink / raw) To: Linus Torvalds Cc: Adam Belay, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew On Tue, Jan 06, 2004 at 08:06:42PM -0800, Linus Torvalds wrote: > > > On Tue, 6 Jan 2004, Adam Belay wrote: > > > > 5.) Look into other ways of finding out if the PnPBIOS might be buggy, > > currently we only have DMI. > > > > Any others? > > We could use the exception mechanism, and try to fix up any BIOS errors. > That would require: [...] It would not work for x86-64 unfortunately where you cannot do any BIOS calls after the system is running (it would only be possible early in boot) My hope was actually that there is some ACPI mechanism to do all this, but I haven't done much research in this area yet. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 5:02 ` Andi Kleen @ 2004-01-07 5:55 ` Dave Jones 2004-01-07 6:06 ` Linus Torvalds 2004-01-07 6:51 ` Andi Kleen 0 siblings, 2 replies; 41+ messages in thread From: Dave Jones @ 2004-01-07 5:55 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Adam Belay, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew On Wed, Jan 07, 2004 at 06:02:56AM +0100, Andi Kleen wrote: > > > 5.) Look into other ways of finding out if the PnPBIOS might be buggy, > > > currently we only have DMI. > > > Any others? > > We could use the exception mechanism, and try to fix up any BIOS errors. > > That would require: > > [...] It would not work for x86-64 unfortunately where you cannot do > any BIOS calls after the system is running (it would only be possible > early in boot) Why on earth would you want to call PNPBIOS on AMD64 anyway ? Dave ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 5:55 ` Dave Jones @ 2004-01-07 6:06 ` Linus Torvalds 2004-01-07 6:08 ` Dave Jones 2004-01-07 6:51 ` Andi Kleen 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-01-07 6:06 UTC (permalink / raw) To: Dave Jones Cc: Andi Kleen, Adam Belay, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew On Wed, 7 Jan 2004, Dave Jones wrote: > > Why on earth would you want to call PNPBIOS on AMD64 anyway ? For the same reason normal PC's still like to: no technical reason, except for the fact that system vendors like to hide bugs and quirks by having magic stuff in ACPI or PnPBIOS to tell the OS "hands off" or "this is how to route this strange irq". It's like ACPI: it would be a whole lot better if the hardware was just standard and documented and didn't need any magic configuration tables and strange code snippets to do magic acts of perversion. But sadly, it ain't so, and PnP and ACPI are there as imperfect ways of doing what needs to be done. Of course, as with most system vendor crud, some BIOSes are more imperfect than others. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 6:06 ` Linus Torvalds @ 2004-01-07 6:08 ` Dave Jones 2004-01-07 6:45 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Dave Jones @ 2004-01-07 6:08 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Adam Belay, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew On Tue, Jan 06, 2004 at 10:06:24PM -0800, Linus Torvalds wrote: > > Why on earth would you want to call PNPBIOS on AMD64 anyway ? > > For the same reason normal PC's still like to: no technical reason, except > for the fact that system vendors like to hide bugs and quirks by having > magic stuff in ACPI or PnPBIOS to tell the OS "hands off" or "this is how > to route this strange irq". But PNPBIOS is an ISA relic isn't it ? No amd64 system I know of even has an ISA bus. Dave ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 6:08 ` Dave Jones @ 2004-01-07 6:45 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2004-01-07 6:45 UTC (permalink / raw) To: Dave Jones Cc: Andi Kleen, Adam Belay, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew On Wed, 7 Jan 2004, Dave Jones wrote: > > But PNPBIOS is an ISA relic isn't it ? It still shows up. BIOSes use it exactly to tell the system about reserved magic IO regions (like the IO registers that are reserved for ACPI). ISA may be gone, but the crap it left behind lingers on. The BIOS writers know that they can affect windows IO region allocation with it, so they still do - to make sure windows boots even when the hardware has some strange IO resource allocations. And yes, that is likely to be an issue on x86-64 too.. As far as windows is concerned, it's just another 32-bit CPU. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 5:55 ` Dave Jones 2004-01-07 6:06 ` Linus Torvalds @ 2004-01-07 6:51 ` Andi Kleen 2004-01-07 2:43 ` Adam Belay 1 sibling, 1 reply; 41+ messages in thread From: Andi Kleen @ 2004-01-07 6:51 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Adam Belay, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew On Wed, Jan 07, 2004 at 05:55:57AM +0000, Dave Jones wrote: > On Wed, Jan 07, 2004 at 06:02:56AM +0100, Andi Kleen wrote: > > > > 5.) Look into other ways of finding out if the PnPBIOS might be buggy, > > > > currently we only have DMI. > > > > Any others? > > > We could use the exception mechanism, and try to fix up any BIOS errors. > > > That would require: > > > > [...] It would not work for x86-64 unfortunately where you cannot do > > any BIOS calls after the system is running (it would only be possible > > early in boot) > > Why on earth would you want to call PNPBIOS on AMD64 anyway ? See the preceding thread. We're currently missing a reliable way to find free IO space for PCI resources, which is needed for some cases. The PNPBIOS was discussed as one of the possible solutions. For AMD64 clearly something ACPI based is needed though. -Andi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 6:51 ` Andi Kleen @ 2004-01-07 2:43 ` Adam Belay 0 siblings, 0 replies; 41+ messages in thread From: Adam Belay @ 2004-01-07 2:43 UTC (permalink / raw) To: Andi Kleen Cc: Dave Jones, Linus Torvalds, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew On Wed, Jan 07, 2004 at 07:51:56AM +0100, Andi Kleen wrote: > On Wed, Jan 07, 2004 at 05:55:57AM +0000, Dave Jones wrote: > > On Wed, Jan 07, 2004 at 06:02:56AM +0100, Andi Kleen wrote: > > > > > 5.) Look into other ways of finding out if the PnPBIOS might be buggy, > > > > > currently we only have DMI. > > > > > Any others? > > > > We could use the exception mechanism, and try to fix up any BIOS errors. > > > > That would require: > > > > > > [...] It would not work for x86-64 unfortunately where you cannot do > > > any BIOS calls after the system is running (it would only be possible > > > early in boot) > > > > Why on earth would you want to call PNPBIOS on AMD64 anyway ? > > See the preceding thread. We're currently missing a reliable way to find > free IO space for PCI resources, which is needed for some cases. The > PNPBIOS was discussed as one of the possible solutions. > > For AMD64 clearly something ACPI based is needed though. > > -Andi Just as an example... Here is how the PnPBIOS reserves io space for which it can't find an actual device: (notice it isn't necessarily related to ISA) 09 PNP0c02 system peripheral: other flags: [no disable] [no config] [static] allocated resources: io 0x04d0-0x04d1 [16-bit decode] io 0x0cf8-0x0cff [16-bit decode] io 0x0010-0x001f [16-bit decode] io 0x0022-0x002d [16-bit decode] io 0x0030-0x003f [16-bit decode] io 0x0050-0x0052 [16-bit decode] io 0x0072-0x0077 [16-bit decode] io 0x0091-0x0093 [16-bit decode] io 0x00a2-0x00be [16-bit decode] io 0x0400-0x047f [16-bit decode] io 0x0540-0x054f [16-bit decode] io 0x0500-0x053f [16-bit decode] io disabled [16-bit decode] io disabled [16-bit decode] io disabled [16-bit decode] mem disabled [8/16 bit] [r/o] [cacheable] [shadow] mem disabled [8/16 bit] [r/o] [cacheable] [shadow] mem disabled [8/16 bit] [r/o] [cacheable] [shadow] mem disabled [8/16 bit] [r/o] [cacheable] [shadow] mem disabled [8/16 bit] [r/o] [cacheable] [shadow] mem disabled [8/16 bit] [r/o] [cacheable] [shadow] mem disabled [8/16 bit] [r/o] [cacheable] [shadow] mem disabled [8/16 bit] [r/o] [cacheable] [shadow] mem 0xffb00000-0xffbfffff [32 bit] [r/o] And here is the output for ACPI on the same system: 00000970: Device SYSR (\_SB_.PCI0.SBRG.SYSR) 00000978: Name _HID (\_SB_.PCI0.SBRG.SYSR._HID) 0000097d: PNP0c02 (0x020cd041) -->snip 00000995: Name _CRS (\_SB_.PCI0.SBRG.SYSR._CRS) -->snip 0000099f: Interpreted as PnP Resource Descriptor: 0000099f: Fixed I/O Ports: 0x10 @ 0x10 0000099f: Fixed I/O Ports: 0x1e @ 0x22 0000099f: Fixed I/O Ports: 0x1c @ 0x44 0000099f: Fixed I/O Ports: 0x2 @ 0x62 0000099f: Fixed I/O Ports: 0xb @ 0x65 0000099f: Fixed I/O Ports: 0xe @ 0x72 0000099f: Fixed I/O Ports: 0x1 @ 0x80 0000099f: Fixed I/O Ports: 0x3 @ 0x84 0000099f: Fixed I/O Ports: 0x1 @ 0x88 0000099f: Fixed I/O Ports: 0x3 @ 0x8c 0000099f: Fixed I/O Ports: 0x10 @ 0x90 0000099f: Fixed I/O Ports: 0x1e @ 0xa2 0000099f: Fixed I/O Ports: 0x10 @ 0xe0 0000099f: I/O Ports: 16 bit address decoding, 0000099f: minbase 0x4d0, maxbase 0x4d0, align 0x0, count 0x2 0000099f: I/O Ports: 16 bit address decoding, 0000099f: minbase 0x400, maxbase 0x400, align 0x0, count 0x70 0000099f: I/O Ports: 16 bit address decoding, 0000099f: minbase 0x470, maxbase 0x470, align 0x0, count 0x10 0000099f: I/O Ports: 16 bit address decoding, 0000099f: minbase 0x500, maxbase 0x500, align 0x0, count 0x40 0000099f: I/O Ports: 16 bit address decoding, 0000099f: minbase 0x800, maxbase 0x800, align 0x0, count 0x80 0000099f: 32-bit rw Fixed memory range: 0000099f: base 0xfff00000, count 0x100000 0000099f: 32-bit rw Fixed memory range: 0000099f: base 0xffb00000, count 0x100000 0000099f: Bad checksum 0x6, should be 0 // hmm, interesting ;-) So they seem to provide a potential solution for this sort of problem. Thanks, Adam ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 22:29 ` Adam Belay 2004-01-07 4:06 ` Linus Torvalds @ 2004-01-07 8:32 ` Helge Hafting 1 sibling, 0 replies; 41+ messages in thread From: Helge Hafting @ 2004-01-07 8:32 UTC (permalink / raw) To: Adam Belay Cc: Linus Torvalds, Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel, Andrew Morton, Grover, Andrew Adam Belay wrote: > 2.) Windows works with buggy BIOSes because of the way it calls them. > > I looked into how Windows handles the PnPBIOS and may have discovered why it > works on buggy BIOS. It turns out that exclusively realmode calls are used. > See www.missl.cs.umd.edu/Projects/sebos/winint/index2.html#pnpbios. My > knowledge is limited in this area of the x86 architecture but it is my > impression that it would not be possible, or perhaps worth it, to implement > realmode calls for the Linux PnPBIOS driver because of the time it is > initialized. Are these PnPBIOS calls needed at boot only? If so, consider querying the bios early in the boot code - before switching to protected mode. Just store the results, and let the driver read them later instead of doing calls that crash. Helge Hafting ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 15:37 ` Andi Kleen 2004-01-06 15:48 ` Linus Torvalds @ 2004-01-06 22:45 ` Eric W. Biederman 2004-01-07 0:06 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2004-01-06 22:45 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel Andi Kleen <ak@colin2.muc.de> writes: > On Tue, Jan 06, 2004 at 07:27:33AM -0800, Linus Torvalds wrote: > > > > > > On Tue, 6 Jan 2004, Andi Kleen wrote: > > > > > > In my opinion it would have been cleaner if the aperture had always > > > an reserved entry in the e820 map. > > > > That does sound like a bug in the AGP drivers. It shouldn't be hard at all > > to make them reserve their aperture. > > > > Hint hint. > > No, it's a bug in the BIOS that they're not marked. But I've actually > seen a BIOS that marked it and it lead to the Linux AGP driver failing > (due to some interaction with how setup.c sets up resources). So the Linux > driver currently even relies on the broken state. And mtd map drivers for rom chips run into the same problem except in that case regions is almost always reserved by the BIOS. Which means it's just silly for the drivers to fail when request_mem_region fails. They are looking at the hardware and know where the regions are, and there is not a parent device we can request a subregion from when it is the BIOS that reserves the region. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 22:45 ` Eric W. Biederman @ 2004-01-07 0:06 ` Linus Torvalds 2004-01-07 4:58 ` Eric W. Biederman 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-01-07 0:06 UTC (permalink / raw) To: Eric W. Biederman Cc: Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel On Tue, 6 Jan 2004, Eric W. Biederman wrote: > > And mtd map drivers for rom chips run into the same problem except in > that case regions is almost always reserved by the BIOS. > > Which means it's just silly for the drivers to fail when request_mem_region > fails. Note: you're not supposed to need to do "request_mem_region()" for modern drivers. You should only need to claim ownership of the resources, and the PCI driver interfaces should do that automatically. What you should do for resources you know about is to just _create_ them. Not necessarily request them (although that is one way of creating them), but you can literally just tell the kernel that they are there. That will already mean that anybody else that tries to allocate a resource will avoid that area. So if you know the hardware is there, and it _tells_ you it's there (unlike, say, an ISA device), you can just call "request_mem_region()" without ever even checking the error return (although you had better make sure that the name allocation is stable if you are a module - don't want to start oopsin in /proc if the module gets unloaded). The PCI layer already does all of that for the "standard" resources. It's just that the generic code can't do it for nonstandard regions, so drivers for chips that don't have just the regular BAR things should create their own resource entries.. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 0:06 ` Linus Torvalds @ 2004-01-07 4:58 ` Eric W. Biederman 2004-01-07 5:32 ` Linus Torvalds 2004-01-07 9:31 ` Russell King 0 siblings, 2 replies; 41+ messages in thread From: Eric W. Biederman @ 2004-01-07 4:58 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel Linus Torvalds <torvalds@osdl.org> writes: > On Tue, 6 Jan 2004, Eric W. Biederman wrote: > > > > And mtd map drivers for rom chips run into the same problem except in > > that case regions is almost always reserved by the BIOS. > > > > Which means it's just silly for the drivers to fail when request_mem_region > > fails. > > Note: you're not supposed to need to do "request_mem_region()" for modern > drivers. You should only need to claim ownership of the resources, and the > PCI driver interfaces should do that automatically. > > What you should do for resources you know about is to just _create_ them. Which I can do. But what if the BIOS has marked them as reserved? The BIOS always does this for ROM chips. And it sounds like this occasionally happens for AGP apertures. > Not necessarily request them (although that is one way of creating them), > but you can literally just tell the kernel that they are there. That will > already mean that anybody else that tries to allocate a resource will > avoid that area. > > So if you know the hardware is there, and it _tells_ you it's there > (unlike, say, an ISA device), you can just call "request_mem_region()" > without ever even checking the error return (although you had better make > sure that the name allocation is stable if you are a module - don't want > to start oopsin in /proc if the module gets unloaded). Or to oops when the module is unloaded, when you try and free the resource. But actually completely freeing the resource is actually bad manners, because the resources are used no matter what and you don't want to allocate anything else in there. > The PCI layer already does all of that for the "standard" resources. It's > just that the generic code can't do it for nonstandard regions, so drivers > for chips that don't have just the regular BAR things should create their > own resource entries.. So thinking out loud about the twist that is in my experience. Southbridges have a special decode region for BIOS ROM chips. It is at least 64K, but can be as big as 8M or so at the end of the address space. On my machine at home the e820 map looks like: 00000000-0009fbff : System RAM 0009fc00-0009ffff : reserved 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000c8000-000c8fff : Extension ROM 000f0000-000fffff : System ROM 00100000-1fffbfff : System RAM 00100000-002bdc69 : Kernel code 002bdc6a-00347183 : Kernel data 1fffc000-1fffefff : ACPI Tables 1ffff000-1fffffff : ACPI Non-volatile Storage cb800000-cb8fffff : Intel Corp. 82557 [Ethernet Pro 100] cc000000-cc000fff : Intel Corp. 82557 [Ethernet Pro 100] cc000000-cc000fff : eepro100 cc800000-cddfffff : PCI Bus #01 cc800000-ccffffff : Matrox Graphics, Inc. MGA G400 AGP cd000000-cd003fff : Matrox Graphics, Inc. MGA G400 AGP cdf00000-cfffffff : PCI Bus #01 ce000000-cfffffff : Matrox Graphics, Inc. MGA G400 AGP d0000000-dfffffff : VIA Technologies, Inc. VT82C693A/694x [Apollo PRO133x] ffff0000-ffffffff : reserved That last reserved region is 64K. Which looking at the pci registers is technically correct at the moment. Only 64K happen to be decoded. If I wanted to flash my ROM what I need to do is: - Load a driver for the region where a ROM chips can possible be at the top of memory. This is the region 0xFFF00000 - 0xFFFFFFFF on the via686. - The driver comes in and looks for the via686, and finds it so it knows it can do something. - The driver can attempt to get the region 0xFFF00000 - 0xFFFFFFFF, but that is impossible. - The driver enables the decodes on all of 0xFFF00000 - 0xFFFFFFFF in the via686 - In general the driver would enable to flip a bit in the via686 to or someplace to enable the WE (write-enable) line to the ROM chip. - The driver would then call in the mtd subsystem at likely offsets into the region 0xFFF00000 - 0xFFFFFFFF and have it do a JEDEC mostly standard probe to see if a BIOS chip starts at that offset. This basic algorithm works see drivers/mtd/maps/amd76xrom.c and drivers/mtd/maps/ich2rom.c. But it does not really play nice with the existing kernel infrastructure. So to do this cleanly it looks like I need to write a pci quirk for the southbridge. Adding a BAR that enables decodes to the BIOS ROM chip. And that quirk should always be present, so that nothing even thinks of using that region for something else. With the quirk doing the heavy lifting the map driver would just need to do something like grab a child resource for the ROM chip to show that I am actively using it. The very practical question. After the BIOS has allocated: 0xFFFF0000 - 0xFFFFFFFF how do I allocate 0xFFF00000 - 0xFFFFFFFF in the pci quirk? The area is already allocated and it chops of the area I need to allocate. Which is a general mess, and that happens to be a very typical scenario for BIOS ROMS. Because the conflicting resource is allocated in what is now: legacy_init_iomem_resources() from bootmem and just dropped on the floor I can't free the conflicting resource. I don't know of anything I can do cleanly without modifying the code. Basically the question becomes what to do about an incorrect e820 map that you don't find out about until you start initializing drivers. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 4:58 ` Eric W. Biederman @ 2004-01-07 5:32 ` Linus Torvalds 2004-01-07 15:53 ` Eric W. Biederman 2004-01-07 9:31 ` Russell King 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-01-07 5:32 UTC (permalink / raw) To: Eric W. Biederman Cc: Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, Kernel Mailing List On Tue, 6 Jan 2004, Eric W. Biederman wrote: > > > > What you should do for resources you know about is to just _create_ them. > > Which I can do. But what if the BIOS has marked them as reserved? > The BIOS always does this for ROM chips. And it sounds like this occasionally > happens for AGP apertures. So? The resource functions will refuse to insert an overlapping resource and return an error, so if the BIOS already did it through a proper e820 map, then it's a no-op. But that's fine - it's _supposed_ to be a no-op in that case. > Or to oops when the module is unloaded, when you try and free the resource. Or, more appropriately, if it's a fixed resource (which it will be, if this is some special chipset feature), you don't ever try to free it. Just leave it be. Just make sure that the resource name etc points to stable data (and "pci_name(dev)" is a good such data). See the quirk entries in drivers/pci/quirks.c. Alternatively, you actually keep track of whether it was your resource or not, the error code will have told you. Don't try to release something that wasn't yours. > If I wanted to flash my ROM what I need to do is: > > [ horrorcase deleted ] > > So to do this cleanly it looks like I need to write a pci quirk for > the southbridge. Adding a BAR that enables decodes to the BIOS ROM chip. > And that quirk should always be present, so that nothing even thinks > of using that region for something else. Sounds correct. However, the BIOS map will still clash with this quirk, so there may be some double resource allocations in the resource maps. The quirks get run _after_ the memory setup has run, which is why you end up with this problem: > The very practical question. After the BIOS has allocated: > 0xFFFF0000 - 0xFFFFFFFF how do I allocate > 0xFFF00000 - 0xFFFFFFFF in the pci quirk? And _this_ is the only really nasty case. It's nasty exactly because the BIOS is involved, and we have _no_ idea why the heck the BIOS marked certain areas reserved. The resource allocation code in kernel/resource.c _will_ help you if you wan tto do this right. The internal "__request_resource()" function will pinpoint any conflicting entry, and in fact we already have a "insert_resource()" that uses exactly this to try to "fix up" these issues. The "insert_resource()" function is able to put "new" resources below old ones, but it does assume that the the resources are fully overlapping in _some_ way. It will correctly insert your PCI quirk (because the BIOS allocation is wholly inside of the quirk you want to add), but it would _not_ be able to handle two different regions conflicting. And such a conflict could happen if the BIOS uses a single "reserved" region for two different PCI resources. Then your quirk might cover one of the PCI resources fully, but wouldn't cover the whole BIOS "reserved" area. "insert_resource()" would still be happy if your quirk is wholly inside, but it would _not_ be happy if your quirk is bigger than the BIOS allocation in one direction but not the other. See? Right now, the ia64 port actually does _exactly_ this to mark all the strange PCI window stuff into the resource tree. For a different reason, but with a number of similar issues. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 5:32 ` Linus Torvalds @ 2004-01-07 15:53 ` Eric W. Biederman 2004-01-07 16:32 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2004-01-07 15:53 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, Kernel Mailing List Linus Torvalds <torvalds@osdl.org> writes: > On Tue, 6 Jan 2004, Eric W. Biederman wrote: > > If I wanted to flash my ROM what I need to do is: > > > > [ horrorcase deleted ] > > > > So to do this cleanly it looks like I need to write a pci quirk for > > the southbridge. Adding a BAR that enables decodes to the BIOS ROM chip. > > And that quirk should always be present, so that nothing even thinks > > of using that region for something else. > > Sounds correct. However, the BIOS map will still clash with this quirk, so > there may be some double resource allocations in the resource maps. The > quirks get run _after_ the memory setup has run, which is why you end up > with this problem: > > > The very practical question. After the BIOS has allocated: > > 0xFFFF0000 - 0xFFFFFFFF how do I allocate > > 0xFFF00000 - 0xFFFFFFFF in the pci quirk? > > And _this_ is the only really nasty case. It's nasty exactly because the > BIOS is involved, and we have _no_ idea why the heck the BIOS marked > certain areas reserved. > > The resource allocation code in kernel/resource.c _will_ help you if you > want to do this right. The internal "__request_resource()" function will > pinpoint any conflicting entry, and in fact we already have a > "insert_resource()" that uses exactly this to try to "fix up" these > issues. Last time I was looking I got as far as __request_resource, but it was and still is private to resource.c so it needs a wrapper around it that does something useful. > The "insert_resource()" function is able to put "new" resources below old > ones, but it does assume that the the resources are fully overlapping in > _some_ way. It will correctly insert your PCI quirk (because the BIOS > allocation is wholly inside of the quirk you want to add), but it would > _not_ be able to handle two different regions conflicting. And this looks like a useful wrapper, that comes very close to what I need. insert_resource is new since last time I looked so I missed it. > And such a conflict could happen if the BIOS uses a single "reserved" > region for two different PCI resources. Then your quirk might cover one of > the PCI resources fully, but wouldn't cover the whole BIOS "reserved" > area. "insert_resource()" would still be happy if your quirk is wholly > inside, but it would _not_ be happy if your quirk is bigger than the BIOS > allocation in one direction but not the other. > > See? Yes. insert_resource does has it's limitations but it doesn't look like I am likely to run into them. > Right now, the ia64 port actually does _exactly_ this to mark all the > strange PCI window stuff into the resource tree. For a different reason, > but with a number of similar issues. It comes very close, to that what this weird case needs. And things are at least close enough now that I can hack something up for 2.6. However insert_resource does not quite match what I think needs to happen. After a pci quirk applies insert_resource I will get something like: fff0000-ffffffff : BIOS ROM Window ffff0000-ffffffff : reserved With the reserved region still present and marked as BUSY. Ideally the map driver would carve up the window into sub regions for each ROM chip. Usually that is just one sub region but it is possible to have multiple ROMs in it. So I would expect to wind up with something like: fff0000-ffffffff : BIOS ROM Window fffc0000-ffffffff : mtd0 ffff0000-ffffffff : reserved But that again runs afoul of the reserved region so it really won't work. I could again use insert_resource and wind up with: fff0000-ffffffff : BIOS ROM Window fffc0000-ffffffff : mtd0 ffff0000-ffffffff : reserved But that is increasingly a hack instead of a clean solution. Would it be reasonable to write a variant of request_resource that just drops BIOS resources. I can live with the restrictions of the current insert_resource, but especially if I do this in a quirk I just want the BIOS resources to go away. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 15:53 ` Eric W. Biederman @ 2004-01-07 16:32 ` Linus Torvalds 2004-01-07 17:32 ` Eric W. Biederman 2004-01-08 19:34 ` Eric W. Biederman 0 siblings, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2004-01-07 16:32 UTC (permalink / raw) To: Eric W. Biederman Cc: Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, Kernel Mailing List On Wed, 7 Jan 2004, Eric W. Biederman wrote: > > However insert_resource does not quite match what I think needs to > happen. After a pci quirk applies insert_resource I will get > something like: > > fff0000-ffffffff : BIOS ROM Window > ffff0000-ffffffff : reserved > > With the reserved region still present and marked as BUSY. I would suggest ignoring it. Not only because being overly complicated is bad, but simply because nobody should care. At some point adding extra regions is _purely_ for "documentation" reasons, and while that may be nice, it's not worth worrying about. The only thing you really want from a _correctness_ standpoint is to make sure that nobody else will try to allocate their stuff in that area, and your "BIOS ROM Window" resource should do that already. > Would it be reasonable to write a variant of request_resource that just > drops BIOS resources. It would not be impossible to just have a "force_resource()" that would simply override _any_ existing resource, but quite frankly, I'd be more nervous about that. We could also mark the e820 non-RAM resources with some special IORESOURCE_TENTATIVE flag, and allow just overriding those. But even the simple "insert_resource()" has some potential problems: if the BIOS has allocated the minimal window for itself (64kB at 0xffff0000), and has allocated some _other_ chip at 0xfffe0000 that the kernel doesn't know about yet, your insert_resource() would do the wrong thing and claim the whole area for the BIOS writing. Maybe that doesn't happen, but it's something to think about. At some point, the _correct_ answer may be: don't do complex things, and write a bootable floppy (without any OS at all, or a really minimal one) to do BIOS rom updates. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 16:32 ` Linus Torvalds @ 2004-01-07 17:32 ` Eric W. Biederman 2004-01-08 19:34 ` Eric W. Biederman 1 sibling, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2004-01-07 17:32 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, Kernel Mailing List Linus Torvalds <torvalds@osdl.org> writes: > On Wed, 7 Jan 2004, Eric W. Biederman wrote: > > > > However insert_resource does not quite match what I think needs to > > happen. After a pci quirk applies insert_resource I will get > > something like: > > > > fff0000-ffffffff : BIOS ROM Window > > ffff0000-ffffffff : reserved > > > > With the reserved region still present and marked as BUSY. > > I would suggest ignoring it. Not only because being overly complicated is > bad, but simply because nobody should care. > > At some point adding extra regions is _purely_ for "documentation" > reasons, and while that may be nice, it's not worth worrying about. The > only thing you really want from a _correctness_ standpoint is to make sure > that nobody else will try to allocate their stuff in that area, and your > "BIOS ROM Window" resource should do that already. Right it is a documentation thing. The case that causes me to pull my hair are Itanium boards. Typically they have 6 or 7 1MB rom chips, for their firmware. My goal with going down this road last was so user space could figure out which rom chip is at which address and how those correspond to mtd devices. Using the existing interfaces to export this information looked like the cleanest way to make certain that information was available until I ran into snags like the above. And once I replace the BIOS I can fix these things at the source, but... > > Would it be reasonable to write a variant of request_resource that just > > drops BIOS resources. > > It would not be impossible to just have a "force_resource()" that would > simply override _any_ existing resource, but quite frankly, I'd be more > nervous about that. Same here. > We could also mark the e820 non-RAM resources with some special > IORESOURCE_TENTATIVE flag, and allow just overriding those. > > But even the simple "insert_resource()" has some potential problems: if > the BIOS has allocated the minimal window for itself (64kB at 0xffff0000), > and has allocated some _other_ chip at 0xfffe0000 that the kernel doesn't > know about yet, your insert_resource() would do the wrong thing and claim > the whole area for the BIOS writing. > > Maybe that doesn't happen, but it's something to think about. Agreed. In practice it does not happen, but it is worth thinking about. The important thing to maintain is that nothing else grabs the area the BIOS reserves with a dynamic resource. So as long as there is a resource over that area the kernel is safe, even if I do grab it with insert_resources it does not really matter to the rest of the kernel because someone has it. The ROM chips actually have ID's so I can always positively identify those, I just don't always know their count. The worst case would be the rom chip probe causing problems. And that can be avoided by simply not loading the driver so I think we are fairly safe. In the case where I open up the decoder beyond the size it is currently set for I can test for conflicts, from other devices. The only reason I would not see another device at that point is if either (a) there are ordering problems in the kernel or (b) a SMM bios is doing truly stupid things. The case where there is a device there and we aren't using it is not a problem because I am just reserving a region of the address space. Now that I have thought about it some more I think the right was to do with IORESOURCE_TENATIVE it instead of removing tenative resources to just push them aside. So in my terrible case I would get: fff0000-ffffffff : BIOS ROM Window fffffff-ffffffff : reserved And that cleans up all of the structure freeing problems. I guess I can do that right now with ____request_resource after I find the conflict and confirm it has the name "reserved". I still like the tenative idea because then any one else who needs the same functionality would not need to reimplement it. > At some point, the _correct_ answer may be: don't do complex things, and > write a bootable floppy (without any OS at all, or a really minimal one) > to do BIOS rom updates. That works to some extent. But it actually a lot more dangerous because you have to be there in person to verify everything is working fine, and to insert the floppy. Doing it from Linux I can update the entire an entire cluster in a minute, and verify everything automatically. And it happens faster because I can load it all over the network. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 16:32 ` Linus Torvalds 2004-01-07 17:32 ` Eric W. Biederman @ 2004-01-08 19:34 ` Eric W. Biederman 1 sibling, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2004-01-08 19:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kernel Mailing List Linus Torvalds <torvalds@osdl.org> writes: > At some point, the _correct_ answer may be: don't do complex things, and > write a bootable floppy (without any OS at all, or a really minimal one) > to do BIOS rom updates. ROM chips fall into the linux mtd layer quite cleanly, and they are just quirky enough they need someplace where lots of eyes look at the code, and lots of people use the code. And the linux mtd layer appears to be that place. I have had enough success in actually using the linux kernel, for flashing ROMS, it is becoming worth while to actually fix up the last couple of annoying cases. Plus I'm close to the point of finding some value in jffs2 and the other flash filesystems, at which point I will need to use the mtd layer anyway. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 4:58 ` Eric W. Biederman 2004-01-07 5:32 ` Linus Torvalds @ 2004-01-07 9:31 ` Russell King 2004-01-07 15:06 ` Eric W. Biederman 1 sibling, 1 reply; 41+ messages in thread From: Russell King @ 2004-01-07 9:31 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel On Tue, Jan 06, 2004 at 09:58:23PM -0700, Eric W. Biederman wrote: > ffff0000-ffffffff : reserved > > That last reserved region is 64K. Which looking at the pci registers > is technically correct at the moment. Only 64K happen to be decoded. We already have this distinction between in use (or busy) resources and allocated resources. Surely the BIOS ROM region should be an allocation resource not a busy resource, so that the MTD driver can obtain a busy resource against it? -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/ 2.6 Serial core ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 9:31 ` Russell King @ 2004-01-07 15:06 ` Eric W. Biederman 2004-01-07 20:29 ` Russell King 0 siblings, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2004-01-07 15:06 UTC (permalink / raw) To: Russell King Cc: Eric W. Biederman, Linus Torvalds, Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel Russell King <rmk+lkml@arm.linux.org.uk> writes: > On Tue, Jan 06, 2004 at 09:58:23PM -0700, Eric W. Biederman wrote: > > ffff0000-ffffffff : reserved > > > > That last reserved region is 64K. Which looking at the pci registers > > is technically correct at the moment. Only 64K happen to be decoded. > > We already have this distinction between in use (or busy) resources and > allocated resources. Surely the BIOS ROM region should be an allocation > resource not a busy resource, so that the MTD driver can obtain a busy > resource against it? Nope the BIOS region is allocated as BUSY, at least as it comes out of the E820 map. >From arch/i386/kernel/setup.c:legacy_init_iomem_resources .... res -> start = e820.map[i].addr; res -> end = res->start + e820.map[i].size - 1; res -> flags = IORESOURCE_MEM | IORESOURCE_BUSY; request_resource(&iomem_resource, res); Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-07 15:06 ` Eric W. Biederman @ 2004-01-07 20:29 ` Russell King 0 siblings, 0 replies; 41+ messages in thread From: Russell King @ 2004-01-07 20:29 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Andi Kleen, Mika Penttil?, Andi Kleen, David Hinds, linux-kernel On Wed, Jan 07, 2004 at 08:06:04AM -0700, Eric W. Biederman wrote: > Russell King <rmk+lkml@arm.linux.org.uk> writes: > > > On Tue, Jan 06, 2004 at 09:58:23PM -0700, Eric W. Biederman wrote: > > > ffff0000-ffffffff : reserved > > > > > > That last reserved region is 64K. Which looking at the pci registers > > > is technically correct at the moment. Only 64K happen to be decoded. > > > > We already have this distinction between in use (or busy) resources and > > allocated resources. Surely the BIOS ROM region should be an allocation > > resource not a busy resource, so that the MTD driver can obtain a busy > > resource against it? > > Nope the BIOS region is allocated as BUSY, at least as it comes > out of the E820 map. > > >From arch/i386/kernel/setup.c:legacy_init_iomem_resources > .... > res -> start = e820.map[i].addr; > res -> end = res->start + e820.map[i].size - 1; > res -> flags = IORESOURCE_MEM | IORESOURCE_BUSY; > request_resource(&iomem_resource, res); I was hoping someone was going to take my comments as a suggestion for a possible solution to the problem. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/ 2.6 Serial core ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 3:40 ` Linus Torvalds 2004-01-06 4:05 ` Andi Kleen @ 2004-01-06 22:56 ` Eric W. Biederman 1 sibling, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2004-01-06 22:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andi Kleen, David Hinds, linux-kernel Linus Torvalds <torvalds@osdl.org> writes: > On Tue, 6 Jan 2004, Andi Kleen wrote: > > > > IMHO the only reliable way to get physical bus space for mappings > > is to allocate some memory and map the mapping over that. > > You literally can't do that: the RAM addresses are decoded by the > northbridge before they ever hit the PCI bus, so it's impossible to "map > over" RAM in general. On AMD cpus starting at least with the K7 it is a cpu function. They have both memory access and IO access FSB cycles. The cpu decodes the address by looking at the IORRS and TOP_MEM (IO range registers are similar to mtrrs but for specifying IO regions). Of course there are some northbridges that don't ignore the mem/io bits.. > Normally, the way this works is that there are magic northbridge mapping > registers that remap part of the memory, So far I have only seen this on the intel E7500 and it's descendants. > so that the memory that is > physically in the upper 4GB of RAM shows up somewhere else (or just > possibly disappears entirely Having the memory disappear entirely is much more common. > - once you have more than 4GB of RAM, you > might not care too much about a few tens of megs missing). At least not until you plug in a card with a 256M pci memory region and loose half a gig of RAM. There is also the trick of just not mapping the RAM into the address space in a contiguous fashion. I have been very tempted lately to just setup boxes with one dimm below 4G and have all of the rest above to make this easier. But 32bit OS's and the performance hit they take when accesses memory above 4G to make this a good idea yet. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* PCI memory allocation bug with CONFIG_HIGHMEM @ 2004-01-05 20:07 David Hinds 2004-01-05 23:00 ` Russell King 2004-01-06 0:36 ` Linus Torvalds 0 siblings, 2 replies; 41+ messages in thread From: David Hinds @ 2004-01-05 20:07 UTC (permalink / raw) To: linux-kernel; +Cc: Amit, Russell King In arch/i386/kernel/setup.c we have: /* Tell the PCI layer not to allocate too close to the RAM area.. */ low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff; if (low_mem_size > pci_mem_start) pci_mem_start = low_mem_size; which is meant to round up pci_mem_start to the nearest 1 MB boundary past the top of physical RAM. However this does not consider highmem. Should this just be using max_pfn rather than max_low_pfn? (I have a report of this failing on a laptop with a highmem kernel, causing a PCI memory resource to be allocated on top of a RAM area) -- Dave ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-05 20:07 David Hinds @ 2004-01-05 23:00 ` Russell King 2004-01-05 23:45 ` David Hinds 2004-01-06 0:36 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Russell King @ 2004-01-05 23:00 UTC (permalink / raw) To: David Hinds; +Cc: linux-kernel, Amit On Mon, Jan 05, 2004 at 12:07:07PM -0800, David Hinds wrote: > > In arch/i386/kernel/setup.c we have: > > /* Tell the PCI layer not to allocate too close to the RAM area.. */ > low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff; > if (low_mem_size > pci_mem_start) > pci_mem_start = low_mem_size; > > which is meant to round up pci_mem_start to the nearest 1 MB boundary > past the top of physical RAM. However this does not consider highmem. > Should this just be using max_pfn rather than max_low_pfn? > > (I have a report of this failing on a laptop with a highmem kernel, > causing a PCI memory resource to be allocated on top of a RAM area) Beware - people sometimes use mem= to tell the kernel how much RAM is available for its use. Unfortunately, this overrides the E820 map, and causes the kernel to believe that all memory above the end of RAM is available for use. This is not the case, especially on ACPI systems. I have come to the conclusion that the use of mem= is a _very_ bad idea unless someone has an extremely good reason to override the E820 map. And even then, it must be used with extreme care, and also in combination with the reserve= parameter to ensure that reserved memory areas remain marked as such. (Reserved regions as in the ACPI data tables.) Failure to follow this will result in non-functional PCMCIA/Cardbus because of memory resource collisions between system RAM and PCI memory space. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/ 2.6 Serial core ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-05 23:00 ` Russell King @ 2004-01-05 23:45 ` David Hinds 0 siblings, 0 replies; 41+ messages in thread From: David Hinds @ 2004-01-05 23:45 UTC (permalink / raw) To: linux-kernel, Amit On Mon, Jan 05, 2004 at 11:00:16PM +0000, Russell King wrote: > On Mon, Jan 05, 2004 at 12:07:07PM -0800, David Hinds wrote: > > > > In arch/i386/kernel/setup.c we have: > > > > /* Tell the PCI layer not to allocate too close to the RAM area.. */ > > low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff; > > if (low_mem_size > pci_mem_start) > > pci_mem_start = low_mem_size; > > > > which is meant to round up pci_mem_start to the nearest 1 MB boundary > > past the top of physical RAM. However this does not consider highmem. > > Should this just be using max_pfn rather than max_low_pfn? > > > > (I have a report of this failing on a laptop with a highmem kernel, > > causing a PCI memory resource to be allocated on top of a RAM area) > > Beware - people sometimes use mem= to tell the kernel how much RAM is > available for its use. Unfortunately, this overrides the E820 map, > and causes the kernel to believe that all memory above the end of RAM > is available for use. > > This is not the case, especially on ACPI systems. Yes and that was the original reason for this snippet of code. It is just a quick fix and shouldn't be needed if the E820 map is correct or if the user has specified a correct mem= parameter. -- Dave ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-05 20:07 David Hinds 2004-01-05 23:00 ` Russell King @ 2004-01-06 0:36 ` Linus Torvalds 2004-01-06 0:44 ` David Hinds 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-01-06 0:36 UTC (permalink / raw) To: David Hinds; +Cc: linux-kernel, Amit, Russell King On Mon, 5 Jan 2004, David Hinds wrote: > > In arch/i386/kernel/setup.c we have: > > /* Tell the PCI layer not to allocate too close to the RAM area.. */ > low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff; > if (low_mem_size > pci_mem_start) > pci_mem_start = low_mem_size; > > which is meant to round up pci_mem_start to the nearest 1 MB boundary > past the top of physical RAM. However this does not consider highmem. > Should this just be using max_pfn rather than max_low_pfn? Yes and no. That doesn't really work either, for any machine with more than 4GB of RAM. We want to find the memory hole (in the low 4GB region), and usually the e820 memory map should make that all happen properly. What does that report on this laptop? This is why we put the memory resources in /proc/iomem, and mark them busy: so that the PCI subsystem won't try to allocate PCI memory in the RAM (or ACPI reserved) area. The "pci_mem_start" thing is just a point to _start_ the allocation, the PCI subsystem still should honor the fact that we have memory above it. That's the whole point of doing proper resource allocation, after all. Does this not work, or have you disabled e820 for some reason? Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 0:36 ` Linus Torvalds @ 2004-01-06 0:44 ` David Hinds 2004-01-06 1:11 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: David Hinds @ 2004-01-06 0:44 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel, Amit, Russell King On Mon, Jan 05, 2004 at 04:36:15PM -0800, Linus Torvalds wrote: > > On Mon, 5 Jan 2004, David Hinds wrote: > > > > In arch/i386/kernel/setup.c we have: > > > > /* Tell the PCI layer not to allocate too close to the RAM area.. */ > > low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff; > > if (low_mem_size > pci_mem_start) > > pci_mem_start = low_mem_size; > > > > which is meant to round up pci_mem_start to the nearest 1 MB boundary > > past the top of physical RAM. However this does not consider highmem. > > Should this just be using max_pfn rather than max_low_pfn? > > Yes and no. That doesn't really work either, for any machine with more > than 4GB of RAM. Ugh. > We want to find the memory hole (in the low 4GB region), and usually the > e820 memory map should make that all happen properly. What does that > report on this laptop? > > This is why we put the memory resources in /proc/iomem, and mark them > busy: so that the PCI subsystem won't try to allocate PCI memory in the > RAM (or ACPI reserved) area. The "pci_mem_start" thing is just a point to > _start_ the allocation, the PCI subsystem still should honor the fact that > we have memory above it. That's the whole point of doing proper resource > allocation, after all. > > Does this not work, or have you disabled e820 for some reason? The original problem was actually that grub was passing a bogus mem= parameter to the kernel that was 4K too small, I guess because it was intending to indicate the amount of "available" memory (the top 4K is reserved for ACPI). If highmem had not been enabled, the above code would have corrected the problem; but with highmem, the computed low_mem_size was incorrect. I would say that grub is just broken and is misusing the mem= parameter, but this has been a problem for years and they don't seem interested in fixing it. -- Dave ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 0:44 ` David Hinds @ 2004-01-06 1:11 ` Linus Torvalds 2004-01-06 1:41 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2004-01-06 1:11 UTC (permalink / raw) To: David Hinds; +Cc: linux-kernel, Amit, Russell King On Mon, 5 Jan 2004, David Hinds wrote: > > The original problem was actually that grub was passing a bogus mem= > parameter to the kernel that was 4K too small, I guess because it was > intending to indicate the amount of "available" memory (the top 4K is > reserved for ACPI). If highmem had not been enabled, the above code > would have corrected the problem; but with highmem, the computed > low_mem_size was incorrect. I would say that grub is just broken and > is misusing the mem= parameter, but this has been a problem for years > and they don't seem interested in fixing it. Hmm.. I suspect that it might be ok to check "max_pfn" for being less than 4GB, and use that if so. Add something like if (max_pfn < 0x100000) if (pci_mem_start < (max_pfn << PAGE_SHIFT)) pci_mem_start = max_pfn << PAGE_SHIFT; to that sequence too.. I dunno. Ugly as hell. The basic issue is that if the kernel doesn't know the RAM layout, there's no way it will get things right all the time, so e820 or another other "good" memory layout should really always be used. "mem=xxx" really doesn't work too well on modern machines. The issue is just too complex, with RAM that is reserved etc.. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: PCI memory allocation bug with CONFIG_HIGHMEM 2004-01-06 1:11 ` Linus Torvalds @ 2004-01-06 1:41 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2004-01-06 1:41 UTC (permalink / raw) To: David Hinds; +Cc: linux-kernel, Amit, Russell King On Mon, 5 Jan 2004, Linus Torvalds wrote: > > Hmm.. I suspect that it might be ok to check "max_pfn" for being less than > 4GB, and use that if so. Add something like > > if (max_pfn < 0x100000) > if (pci_mem_start < (max_pfn << PAGE_SHIFT)) > pci_mem_start = max_pfn << PAGE_SHIFT; Actually, that would suck. I think the proper fix would be to make the "mem=" stuff do the right thing to the iomem_resource handling, and add the "round up" code there too (and mark it as being reserved). Basically, it shouldn't be impossible to get a "reasonably good" map from "mem=xxxx" that would work more of the time. It wouldn't necessarily be perfect, but it would be better than what we have now. You can always use much more complicated "exactmap" stuff to really generate a full e820 map, but I suspect nobody has ever done that in real life. Something like mem=exactmap mem=0x9f000@0 mem=0xfe00000@0x100000 mem=0x100000$0xff00000 can be used to give you 255MB of RAM with the last 1MB marked as being "reserved". Or it _should_ work that way. I've never used it myself ;) Anyway, we could change what the "simple" form of "mem=xxx" means to something that is more likely to have success. Anybody willing to look at this? Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2004-01-08 19:44 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1aJdi-7TH-25@gated-at.bofh.it>
2004-01-06 3:32 ` PCI memory allocation bug with CONFIG_HIGHMEM Andi Kleen
2004-01-06 3:40 ` Linus Torvalds
2004-01-06 4:05 ` Andi Kleen
2004-01-06 5:04 ` Linus Torvalds
2004-01-06 8:12 ` Andi Kleen
2004-01-06 9:11 ` Mika Penttilä
2004-01-06 9:44 ` Andi Kleen
2004-01-06 10:16 ` Mika Penttilä
2004-01-06 10:49 ` Andi Kleen
2004-01-06 15:27 ` Linus Torvalds
2004-01-06 15:37 ` Andi Kleen
2004-01-06 15:48 ` Linus Torvalds
2004-01-06 22:29 ` Adam Belay
2004-01-07 4:06 ` Linus Torvalds
2004-01-07 5:02 ` Andi Kleen
2004-01-07 5:55 ` Dave Jones
2004-01-07 6:06 ` Linus Torvalds
2004-01-07 6:08 ` Dave Jones
2004-01-07 6:45 ` Linus Torvalds
2004-01-07 6:51 ` Andi Kleen
2004-01-07 2:43 ` Adam Belay
2004-01-07 8:32 ` Helge Hafting
2004-01-06 22:45 ` Eric W. Biederman
2004-01-07 0:06 ` Linus Torvalds
2004-01-07 4:58 ` Eric W. Biederman
2004-01-07 5:32 ` Linus Torvalds
2004-01-07 15:53 ` Eric W. Biederman
2004-01-07 16:32 ` Linus Torvalds
2004-01-07 17:32 ` Eric W. Biederman
2004-01-08 19:34 ` Eric W. Biederman
2004-01-07 9:31 ` Russell King
2004-01-07 15:06 ` Eric W. Biederman
2004-01-07 20:29 ` Russell King
2004-01-06 22:56 ` Eric W. Biederman
2004-01-05 20:07 David Hinds
2004-01-05 23:00 ` Russell King
2004-01-05 23:45 ` David Hinds
2004-01-06 0:36 ` Linus Torvalds
2004-01-06 0:44 ` David Hinds
2004-01-06 1:11 ` Linus Torvalds
2004-01-06 1:41 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox