PCI resource problems caused by improper address rounding

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* PCI resource problems caused by improper address rounding
@ 2007-12-18  0:25 Chuck Ebbert
  2007-12-18  0:57 ` Linus Torvalds
  2007-12-22  9:22 ` Andrew Morton
  0 siblings, 2 replies; 33+ messages in thread
From: Chuck Ebbert @ 2007-12-18  0:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ivan Kokshaysky, Linus Torvalds, Daniel Ritz, Greg KH

Looks like a commit that I can't find in git due to the arch merge
has broken PCI address assignment. This patch by Richard Henderson
against 2.6.23 fixes it for x86_64:

--- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c	2007-10-09 13:31:38.000000000 -0700
+++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c	2007-12-15 12:37:44.000000000 -0800
@@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
 	while ((gapsize >> 4) > round)
 		round += round;
 	/* Fun with two's complement */
-	pci_mem_start = (gapstart + round) & -round;
+	pci_mem_start = (gapstart + round - 1) & -round;
 
 	printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
 		pci_mem_start, gapstart, gapsize);


Here is the original changeset, taken from the Mercurial repo. It was
merged in 2.6.14:

# HG changeset patch
# User Daniel Ritz <daniel.ritz@gmx.ch>
# Date 1126304746 -700
# Node ID 51367d6e0b839be0b425a8f67c29f625b670f126
# Parent f4852c862b04efc9f8e2c7913191f5f7d140d895
[PATCH] Update PCI IOMEM allocation start

This fixes the problem with "Averatec 6240 pcmcia_socket0: unable to
apply power", which was due to the CardBus IOMEM register region being
allocated at an address that was actually inside the RAM window that had
been reserved for video frame-buffers in an UMA setup.

The BIOS _should_ have marked that region reserved in the e820 memory
descriptor tables, but did not.

It is fixed by rounding up the default starting address of PCI memory
allocations, so that we leave a bigger gap after the final known memory
location.  The amount of rounding depends on how big the unused memory
gap is that we can allocate IOMEM from.

Based on example code by Linus.

Acked-by: Greg KH <greg@kroah.com>
Acked-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

committer: Linus Torvalds <torvalds@g5.osdl.org> 1126304746 -0700


--- a/arch/i386/kernel/setup.c	Fri Sep 09 22:28:40 2005 +0011
+++ b/arch/i386/kernel/setup.c	Fri Sep 09 22:37:26 2005 +0011
@@ -1300,7 +1300,7 @@ legacy_init_iomem_resources(struct resou
  */
 static void __init register_memory(void)
 {
-	unsigned long gapstart, gapsize;
+	unsigned long gapstart, gapsize, round;
 	unsigned long long last;
 	int	      i;
 
@@ -1345,14 +1345,14 @@ static void __init register_memory(void)
 	}
 
 	/*
-	 * Start allocating dynamic PCI memory a bit into the gap,
-	 * aligned up to the nearest megabyte.
-	 *
-	 * Question: should we try to pad it up a bit (do something
-	 * like " + (gapsize >> 3)" in there too?). We now have the
-	 * technology.
+	 * See how much we want to round up: start off with
+	 * rounding to the next 1MB area.
 	 */
-	pci_mem_start = (gapstart + 0xfffff) & ~0xfffff;
+	round = 0x100000;
+	while ((gapsize >> 4) > round)
+		round += round;
+	/* Fun with two's complement */
+	pci_mem_start = (gapstart + round) & -round;
 
 	printk("Allocating PCI resources starting at %08lx (gap: %08lx:%08lx)\n",
 		pci_mem_start, gapstart, gapsize);
--- a/arch/x86_64/kernel/e820.c	Fri Sep 09 22:28:40 2005 +0011
+++ b/arch/x86_64/kernel/e820.c	Fri Sep 09 22:37:26 2005 +0011
@@ -567,7 +567,7 @@ unsigned long pci_mem_start = 0xaeedbabe
  */
 __init void e820_setup_gap(void)
 {
-	unsigned long gapstart, gapsize;
+	unsigned long gapstart, gapsize, round;
 	unsigned long last;
 	int i;
 	int found = 0;
@@ -604,14 +604,14 @@ __init void e820_setup_gap(void)
 	}
 
 	/*
-	 * Start allocating dynamic PCI memory a bit into the gap,
-	 * aligned up to the nearest megabyte.
-	 *
-	 * Question: should we try to pad it up a bit (do something
-	 * like " + (gapsize >> 3)" in there too?). We now have the
-	 * technology.
+	 * See how much we want to round up: start off with
+	 * rounding to the next 1MB area.
 	 */
-	pci_mem_start = (gapstart + 0xfffff) & ~0xfffff;
+	round = 0x100000;
+	while ((gapsize >> 4) > round)
+		round += round;
+	/* Fun with two's complement */
+	pci_mem_start = (gapstart + round) & -round;
 
 	printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
 		pci_mem_start, gapstart, gapsize);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18  0:25 PCI resource problems caused by improper address rounding Chuck Ebbert
@ 2007-12-18  0:57 ` Linus Torvalds
  2007-12-18 17:34   ` Chuck Ebbert
  2007-12-22  9:22 ` Andrew Morton
  1 sibling, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2007-12-18  0:57 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH

On Mon, 17 Dec 2007, Chuck Ebbert wrote:
>
> Looks like a commit that I can't find in git due to the arch merge
> has broken PCI address assignment. This patch by Richard Henderson
> against 2.6.23 fixes it for x86_64:
> 
> --- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c	2007-10-09 13:31:38.000000000 -0700
> +++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c	2007-12-15 12:37:44.000000000 -0800
> @@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
>  	while ((gapsize >> 4) > round)
>  		round += round;
>  	/* Fun with two's complement */
> -	pci_mem_start = (gapstart + round) & -round;
> +	pci_mem_start = (gapstart + round - 1) & -round;

No, it's very much meant to be that way.

We do *not* want to have the PCI memory abutthe end of memory exactly. So 
it leaves a gap in between "gapstart" and the actual start of PCI memory 
addressing very much on purpose.

In fact, the very commit (it's f0eca9626c6becb6fc56106b2e4287c6c784af3d in 
the kernel tree) you mention actually explicitly *explains* that, although 
maybe it's a bit indirect: if you start allocating PCI resources directly 
after the end-of-RAM thing, you can easily end up using addresses that are 
actually inside the magic stolen system RAM that is being used for UMA 
video etc.

So you very much want to have a buffer in between the end-of-RAM and the 
actual start of the region we try to allocate in. 

So why do you want them to be close, anyway? 

		Linus

PS. On a different topic: if you do

	git log --follow arch/x86/kernel/e820_64.c

you'd see the history past the renames in git. Or just do a "git blame -C" 
which will also follow renames (and copies).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18  0:57 ` Linus Torvalds
@ 2007-12-18 17:34   ` Chuck Ebbert
  2007-12-18 18:21     ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Chuck Ebbert @ 2007-12-18 17:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH

On 12/17/2007 07:57 PM, Linus Torvalds wrote:
> 
> On Mon, 17 Dec 2007, Chuck Ebbert wrote:
>> Looks like a commit that I can't find in git due to the arch merge
>> has broken PCI address assignment. This patch by Richard Henderson
>> against 2.6.23 fixes it for x86_64:
>>
>> --- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c	2007-10-09 13:31:38.000000000 -0700
>> +++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c	2007-12-15 12:37:44.000000000 -0800
>> @@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
>>  	while ((gapsize >> 4) > round)
>>  		round += round;
>>  	/* Fun with two's complement */
>> -	pci_mem_start = (gapstart + round) & -round;
>> +	pci_mem_start = (gapstart + round - 1) & -round;
> 
> No, it's very much meant to be that way.
> 
> We do *not* want to have the PCI memory abutthe end of memory exactly. So 
> it leaves a gap in between "gapstart" and the actual start of PCI memory 
> addressing very much on purpose.
> 
> In fact, the very commit (it's f0eca9626c6becb6fc56106b2e4287c6c784af3d in 
> the kernel tree) you mention actually explicitly *explains* that, although 
> maybe it's a bit indirect: if you start allocating PCI resources directly 
> after the end-of-RAM thing, you can easily end up using addresses that are 
> actually inside the magic stolen system RAM that is being used for UMA 
> video etc.
> 
> So you very much want to have a buffer in between the end-of-RAM and the 
> actual start of the region we try to allocate in. 
> 
> So why do you want them to be close, anyway? 
> 

Because otherwise some video adapters with 256MB of memory end up with their
resources allocated above 4GB, and that doesn't work very well.

https://bugzilla.redhat.com/show_bug.cgi?id=425794#c0

> 
> PS. On a different topic: if you do
> 
> 	git log --follow arch/x86/kernel/e820_64.c
> 
> you'd see the history past the renames in git. Or just do a "git blame -C" 
> which will also follow renames (and copies).

The history in the web interface just ends at the rename.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 17:34   ` Chuck Ebbert
@ 2007-12-18 18:21     ` Linus Torvalds
  2007-12-18 20:22       ` Richard Henderson
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2007-12-18 18:21 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH,
	Richard Henderson

On Tue, 18 Dec 2007, Chuck Ebbert wrote:
> > 
> > So why do you want them to be close, anyway? 
> 
> Because otherwise some video adapters with 256MB of memory end up with their
> resources allocated above 4GB, and that doesn't work very well.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=425794#c0

That bugzilla entry doesn't even have a dmesg output or anything like 
that. I'd really like to see what the 

The fact is, that patch is not safe. We very much _want_ to make the PCI 
region allocator use a window that is in the *middle* of the gap, and not 
close to either end of the gap, and the code literally tries to make the 
default start of the PCI allocation gap start be about 1/16th of the 
actual gap size in question, so that we don't hit BIOS allocations that 
it didn't tell us about by mistake.

But without dmesg and lspci output to see what the actual allocations are, 
there's no way to even _guess_ at whether there is a correct fix or not, 
just the fix that totally misses the point of having any rounding-up at 
all.

That patch might as well just do

	pci_mem_start = gapstart;

and get rid of all that rounding code entirely, since the patch just 
assumes that it's safe to use memory after gapstart (which is known to not 
be true, and is the whole and only reason for that code in the first 
place: BIOSes *invariably* get resource allocation wrong, and forget to 
tell us about some resource they set up).

Now, it's entirely possible that the only reasonable end result is that we 
do have to avoid rounding up that far, but I definitely want to see what 
the actual resource situation is - that patch is *not* obviously correct, 
and it definitely breaks the whole point of the code. 

The *other* patch in the bugzilla entry seems more correct, in that yes, 
we should make sure that we don't allocate resources over 4G if the 
resource won't fit. That said, I think that patch is wrong too: we should 
just fix pcibios_align_resource() to check that case for MEM resouces (the 
same way it already knows about the magic rules for IO resources).

So I'd suggest just fixing pcibios_align_resource() instead. Something 
like the appended might work (and then you could perhaps quirk it to 
always clear the PCI_BASE_ADDRESS_MEM_TYPE_64 thing for VGA controllers, 
although I really don't think the kernel is the right place to do that, 
and that would be an X server issue!).

NOTE! This patch is an independent issue of the whole "what window do we 
use to allocate new resources, and how do we align it" thing.

			Linus

---
 arch/x86/pci/i386.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 42ba0e2..abc642b 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -70,6 +70,20 @@ pcibios_align_resource(void *data, struct resource *res,
 			start = (start + 0x3ff) & ~0x3ff;
 			res->start = start;
 		}
+	} else {
+		u64 max;
+		switch (res->flags & PCI_BASE_ADDRESS_MEM_MASK) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			max = 0xfffff;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			max = -1;
+			break;
+		default:
+			max = 0xffffffff;
+		}
+		if (res->start > max)
+			res->start = res->end;
 	}
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 18:21     ` Linus Torvalds
@ 2007-12-18 20:22       ` Richard Henderson
  2007-12-18 21:09         ` Linus Torvalds
                           ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Richard Henderson @ 2007-12-18 20:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH

On Tue, Dec 18, 2007 at 10:21:50AM -0800, Linus Torvalds wrote:
> > https://bugzilla.redhat.com/show_bug.cgi?id=425794#c0
> 
> That bugzilla entry doesn't even have a dmesg output or anything like 
> that. I'd really like to see what the 

I've added dmesg, /proc/iomem, and lspci -v output to that bug.

Basically, we have

	c0000000-cfffffff : free
	ddf00000-dfefffff : PCI Bus #04
	e0000000-efffffff : pnp 00:0b
	f0000000-fedfffff : less than 256MB

The annoying part is that there's no device (that I can see) 
behind PCI Bus #04, so it might as well be disabled and that
entire d0000000-dfffffff area reclaimed.

> That patch might as well just do
> 
> 	pci_mem_start = gapstart;
> 
> and get rid of all that rounding code entirely, since the patch just 
> assumes that it's safe to use memory after gapstart (which is known to not 
> be true, and is the whole and only reason for that code in the first 
> place: BIOSes *invariably* get resource allocation wrong, and forget to 
> tell us about some resource they set up).

That would have been an excellent comment to add to that code then,
rather than just "rounding up to the next 1MB area", because purely
as rounding code it is erroneous.

> The *other* patch in the bugzilla entry seems more correct, in that yes, 
> we should make sure that we don't allocate resources over 4G if the 
> resource won't fit. That said, I think that patch is wrong too: we should 
> just fix pcibios_align_resource() to check that case for MEM resouces (the 
> same way it already knows about the magic rules for IO resources).

I'll give that patch a try, modified a tad to still include the
force_32_bit quirk.

> So I'd suggest just fixing pcibios_align_resource() instead. Something 
> like the appended might work (and then you could perhaps quirk it to 
> always clear the PCI_BASE_ADDRESS_MEM_TYPE_64 thing for VGA controllers, 

That won't work, because PCI_BASE_ADDRESS_MEM_TYPE_64 controls how
many bits need to be written back to the BAR.  If we changed that
to PCI_BASE_ADDRESS_MEM_TYPE_32, we wouldn't clear the high 32-bits
of the BAR.

> ... and that would be an X server issue!).

Of course, fixing the X server to *handle* 64-bit BARs is the correct
solution.  I've no idea how involved that is, but I have a sneeking
suspicion that it uses that damned CARD32 datatype for everything.

r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 20:22       ` Richard Henderson
@ 2007-12-18 21:09         ` Linus Torvalds
  2007-12-18 21:46           ` Chuck Ebbert
                             ` (3 more replies)
  2007-12-18 21:23         ` Ivan Kokshaysky
  2007-12-20 21:10         ` Benjamin Herrenschmidt
  2 siblings, 4 replies; 33+ messages in thread
From: Linus Torvalds @ 2007-12-18 21:09 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH,
	Keith Packard, Bjorn Helgaas

On Tue, 18 Dec 2007, Richard Henderson wrote:
> 
> I've added dmesg, /proc/iomem, and lspci -v output to that bug.
> 
> Basically, we have
> 
> 	c0000000-cfffffff : free
> 	ddf00000-dfefffff : PCI Bus #04
> 	e0000000-efffffff : pnp 00:0b
> 	f0000000-fedfffff : less than 256MB

Gaah. 

That really is very unlucky. That 256M only goes at one point in the low 
4GB, but the thing is, it fits perfectly well above it, and dammit, that 
resource is explicitly a 64-bit resource or a really good reason. 

However, I wonder about that

	e0000000-efffffff : pnp 00:0b

thing. I actually suspect that that whole allocation is literally *meant* 
for that 256MB graphics aperture, but the kernel explicitly avoids it 
because it's listed in the PnP tables.

I wonder what the heck is the point of that pnp entry. Just for fun, can 
you try to just disable CONFIG_PNP, and see if it all works then?

Björn Helgaas added to Cc to clarify what those pnp entries tend to mean, 
and whether there is possibly some way to match up a specific pnp entry 
with the PCI device that might want to use it. Because that is a nice 
256MB region that really doesn't seem to make sense for anything else than 
the graphics buffer - there's nothing else in your system that seems 
likely (although I guess it could be for some docking port, but even then 
I'd have expected one of the PCI bridges to map it!)

But apart from the question about that pnp 00:0b device, the kernel 
resource allocation really does look perfectly fine, and while we could 
shoe-horn it into the low 4GB in this case by just hoping that there is 
nothing undocumented there (and there probably isn't), it's really 
annoying considering that big graphics areas are a hell of a good reason 
to use those 64-bit resources.

It's not like 256MB is even as large as they come, half-gig graphics cards 
are getting to be fairly common at the high end, and X absolutely _has_ to 
be able to handle a 64-bit address for those. 

Also, I'm surprised it doesn't work with X already: the ChangeLog for X 
says that there are "Minor fixes to the handling of 64-bit PCI BARs [..]" 
in 4.6.99.18, so I'd have assumed that XFree86-4.7.0 should be able to 
handle this perfectly well.

I'll add Keithp to the cc too, to see if the X issues can be clarified. 
Maybe he can set us right. But maybe you just have an old X server? If so, 
considering the situation, I really think the kernel has done a good job 
already, and I'd be *very* nervous about making the kernel allocate new 
PCI resources right after the end-of-memory thing.

I bet it would work in this case, but as mentioned, we definitely know of 
cases where the BIOS did *not* document the magic memory region that was 
stolen for UMA graphics, and trying to put PCI devices just after the top 
of reserved memory in the e820 list causes machines to not work at all 
because the address decoding will clash.

Of course, we could also make the minimum address more of a *hint*, and 
only make the resource allocator only abut the top-of-known-memory when it 
absolutely has to, but on the other hand, in this case it really doesn't 
have to, since there's just _tons_ of space for 64-bit resources. So the 
correct thing really does seem to be to just use the 64-bit hw that is 
there.

> That would have been an excellent comment to add to that code then,
> rather than just "rounding up to the next 1MB area", because purely
> as rounding code it is erroneous.

Patches to add comments are welcome. There are few enough people who 
actually work on the PCI resource allocation code these days (I wish there 
were more), and it's very rare that anybody else than me or Ivan ends up 
even *looking* at it. So it's not been a big issue.

				Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 21:09         ` Linus Torvalds
@ 2007-12-18 21:46           ` Chuck Ebbert
  2007-12-18 21:56             ` Linus Torvalds
  2007-12-18 22:17             ` Richard Henderson
  2007-12-18 21:51           ` Richard Henderson
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 33+ messages in thread
From: Chuck Ebbert @ 2007-12-18 21:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Henderson, linux-kernel, Ivan Kokshaysky, Daniel Ritz,
	Greg KH, Keith Packard, Bjorn Helgaas

On 12/18/2007 04:09 PM, Linus Torvalds wrote:
> 
> I wonder what the heck is the point of that pnp entry. Just for fun, can 
> you try to just disable CONFIG_PNP, and see if it all works then?
> 

pnpacpi=off should work.

PnP is also trying (and failing) to reserve all physical memory.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 21:46           ` Chuck Ebbert
@ 2007-12-18 21:56             ` Linus Torvalds
  2007-12-18 22:17             ` Richard Henderson
  1 sibling, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2007-12-18 21:56 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Richard Henderson, linux-kernel, Ivan Kokshaysky, Daniel Ritz,
	Greg KH, Keith Packard, Bjorn Helgaas

On Tue, 18 Dec 2007, Chuck Ebbert wrote:

> On 12/18/2007 04:09 PM, Linus Torvalds wrote:
> > 
> > I wonder what the heck is the point of that pnp entry. Just for fun, can 
> > you try to just disable CONFIG_PNP, and see if it all works then?
> 
> pnpacpi=off should work.
> 
> PnP is also trying (and failing) to reserve all physical memory.

Yeah, that really is a pretty confused-looking pnp table thing. But I have 
absolutely zero idea how PnP is even supposed to work - the whole thing is 
just a total hack for Windows, afaik.

The sad part is that *normally* the right thing to do about almost any 
BIOS information is what we do right now: just avoid that magic address 
range like the plague, because we have no clue what the heck the BIOS is 
up to. But it looks like in this particular case, some of the problems 
may arise exactly *because* we avoid that range.

It would be good to know what Windows does. If ACPI is found, does it 
perhaps just ignore all the PnP entries these days?

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 21:46           ` Chuck Ebbert
  2007-12-18 21:56             ` Linus Torvalds
@ 2007-12-18 22:17             ` Richard Henderson
  1 sibling, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2007-12-18 22:17 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Linus Torvalds, linux-kernel, Ivan Kokshaysky, Daniel Ritz,
	Greg KH, Keith Packard, Bjorn Helgaas

On Tue, Dec 18, 2007 at 04:46:09PM -0500, Chuck Ebbert wrote:
> pnpacpi=off should work.

This does result in the graphics bar being placed at e0000000,
and does result in a system lockup when X starts.  So it appears
as if there's really something there.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 21:09         ` Linus Torvalds
  2007-12-18 21:46           ` Chuck Ebbert
@ 2007-12-18 21:51           ` Richard Henderson
  2007-12-18 22:31             ` Linus Torvalds
  2007-12-18 22:16           ` Keith Packard
  2007-12-19  0:29           ` Bjorn Helgaas
  3 siblings, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2007-12-18 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH,
	Keith Packard, Bjorn Helgaas

On Tue, Dec 18, 2007 at 01:09:15PM -0800, Linus Torvalds wrote:
> However, I wonder about that
> 
> 	e0000000-efffffff : pnp 00:0b
> 
> thing. I actually suspect that that whole allocation is literally *meant* 
> for that 256MB graphics aperture, but the kernel explicitly avoids it 
> because it's listed in the PnP tables.

I assumed it was reserved for the pccard thing, which I don't see
listed or allocated anywhere else.  

> I wonder what the heck is the point of that pnp entry. Just for fun, can 
> you try to just disable CONFIG_PNP, and see if it all works then?

I'll try that, for grins.

> I'll add Keithp to the cc too, to see if the X issues can be clarified. 
> Maybe he can set us right. But maybe you just have an old X server?

I've got xorg-x11-server-Xorg-1.3.0.0-36.fc8 installed.  I wouldn't
have thought that was too old, since Fedora 8 just came out, but it's
not like I keep up on these things.  I'll give 1.4.99 a try, as that's
what's current in Rawhide.

> Of course, we could also make the minimum address more of a *hint*, and 
> only make the resource allocator only abut the top-of-known-memory when it 
> absolutely has to....

Another way to look at this is that the graphics BAR came in from
the BIOS allocated at c0000000, and we ignored that.  Perhaps there's
a way to give weight to the BIOS settings when consdering where the
PCI region is supposed to start?

On that system for which there was undeclared resources, did the BIOS
avoid that resource for the other PCI resources?  I suspect it did...

r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 21:51           ` Richard Henderson
@ 2007-12-18 22:31             ` Linus Torvalds
  2007-12-19  1:38               ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2007-12-18 22:31 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH,
	Keith Packard, Bjorn Helgaas

On Tue, 18 Dec 2007, Richard Henderson wrote:
> 
> Another way to look at this is that the graphics BAR came in from
> the BIOS allocated at c0000000, and we ignored that.

We did?

> Perhaps there's a way to give weight to the BIOS settings when 
> consdering where the PCI region is supposed to start?

Normally, we *always* keep the BIOS allocations, unless it explicitly 
clashes with something and we have reason to believe that they cannot 
work. And there's nothing it clashes with, so we definitely *should* have 
kept it.

Why do you think it came pre-allocated at 0xc0000000? I'm seeing the 
message 

   PCI: Cannot allocate resource region 1 of device 0000:01:00.0

and I can well imagine that that is it, but if it was a valid allocation, 
then we really should have kept it there.

That question also brings up another issue: how come did we actually 
choose address 0xc0000000 with the original patch you sent in? If we can't 
find it in the parent resources, we shouldn't have accepted it even if it 
had room for it!

Which brings up *another* potential fix for this thing: as mentioned, 
Intel bridges often claim to be "Normal decode", but the core ones seem to 
almost never actually be that, and they tend to be "Negative decode". So 
what may be going on is:

 - the kernel sees that BIOS allocation at 0xc0000000 (I'll take your word 
   for it, it doesn't actually say so without PCI debugging enabled)

 - ...and notices that the PCI BAR is behind a PCI bridge that does
   not claim to be able to actually bridge that resource (it's normal 
   decode, and the ranges it *does* decode are elsewhere!)

 - so clearly that old allocation is pure crap and has to be re-done in a 
   range that is actually properly bridged.

but that decision bases itself on the Intel bridge not lying, and if it 
turns out that the bridge at 0000:00:01.0 actually is transparent, then 
the original allocation would have been ok.

That said, your bridge at 00:1e.0 is *also* transparent, and it's actually 
against the PCI specs to have two transparent bridges on the same PCI bus, 
so I'm a bit surprised about that. But it does bring up a new thing you 
could *try*, namely this patch...

(You obviously have to replace "insert-your-device-here" with the proper 
PCI device ID for the thing you have - your lspci output only gives the 
name, not the numbers)

We seem to have a multitude of possible reasons for this insanity. It 
would be interesting to hear which one(s) of the possibilities make a 
difference, if any.

			Linus

---
 drivers/pci/quirks.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 26cc4dc..c3b52f5 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -820,6 +820,7 @@ static void __devinit quirk_transparent_bridge(struct pci_dev *dev)
 {
 	dev->transparent = 1;
 }
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL,	insert-your-device-id-here,	quirk_transparent_bridge );
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL,	PCI_DEVICE_ID_INTEL_82380FB,	quirk_transparent_bridge );
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_TOSHIBA,	0x605,	quirk_transparent_bridge );

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 22:31             ` Linus Torvalds
@ 2007-12-19  1:38               ` Linus Torvalds
  2007-12-20 21:52                 ` Richard Henderson
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2007-12-19  1:38 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH,
	Keith Packard, Bjorn Helgaas

On Tue, 18 Dec 2007, Linus Torvalds wrote:
> 
> That question also brings up another issue: how come did we actually 
> choose address 0xc0000000 with the original patch you sent in? If we can't 
> find it in the parent resources, we shouldn't have accepted it even if it 
> had room for it!

That

	PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0
	PCI: Cannot allocate resource region 1 of device 0000:01:00.0

thing is really starting to bug me.

I bet that is the real problem here, but it's not printing out enough 
information about the resource to actually give us much of a clue about 
what is wrong.

I suspect that it had a bridge mapping (device 0:01.0) that included the 
range from 0xc0000000 to 0xcfffffff, but there was something stupid wrong 
with it (eg the BIOS had allocated overlapping regions), so we disabled 
it. That, in turn, then caused us to also refuse the existing 0xc0000000 
mapping for the graphics card (device 01:00.0), because now there was no 
valid resource for it.

But that PCI bridge resource handling happens even *before* we look at any 
PnP reserved areas (because we - for really good reasons - trust the 
hardware a _lot_ more than we trust any idiotic firmware tables), so I 
wonder what that strange PCI bridge mapping in 00:01.0 was - it must have 
been _really_ off in order to not fit in the resource tree.

Could you just make it print out what the bridge resources are when it 
scans them? Something like the appended..

		Linus

---
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 42ba0e2..37c4b92 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -117,11 +117,16 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list)
 	/* Depth-First Search on bus tree */
 	list_for_each_entry(bus, bus_list, node) {
 		if ((dev = bus->self)) {
+			printk(KERN_DEBUG "PCI: Bridge %s\n", pci_name(dev));
 			for (idx = PCI_BRIDGE_RESOURCES;
 			    idx < PCI_NUM_RESOURCES; idx++) {
 				r = &dev->resource[idx];
 				if (!r->flags)
 					continue;
+				printk(KERN_DEBUG "PCI: Bridge resource "
+					"%08llx-%08llx (f=%lx)\n",
+					r->start, r->end, r->flags);
+
 				pr = pci_find_parent_resource(dev, r);
 				if (!r->start || !pr ||
 				    request_resource(pr, r) < 0) {

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-19  1:38               ` Linus Torvalds
@ 2007-12-20 21:52                 ` Richard Henderson
  2007-12-20 22:24                   ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2007-12-20 21:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH,
	Keith Packard, Bjorn Helgaas

On Tue, Dec 18, 2007 at 05:38:58PM -0800, Linus Torvalds wrote:
> That
> 
> 	PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0
> 	PCI: Cannot allocate resource region 1 of device 0000:01:00.0
> 
> thing is really starting to bug me.
> 
> I bet that is the real problem here, but it's not printing out enough 
> information about the resource to actually give us much of a clue about 
> what is wrong.
> 
> I suspect that it had a bridge mapping (device 0:01.0) that included the 
> range from 0xc0000000 to 0xcfffffff, but there was something stupid wrong 
> with it (eg the BIOS had allocated overlapping regions), so we disabled 
> it. That, in turn, then caused us to also refuse the existing 0xc0000000 
> mapping for the graphics card (device 01:00.0), because now there was no 
> valid resource for it.

That is exactly it.  The relevant section of the debug info is

PCI: Bridge 0000:00:01.0
PCI: Bridge resource 7 00008000-00008fff (%f=100)
PCI: Bridge resource 8 f7d00000-fddfffff (%f=200)
PCI: Bridge resource 9 bdf00000-ddefffff (%f=1201)

The bridge was assigned to a piece of the end of physical memory.  



r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-20 21:52                 ` Richard Henderson
@ 2007-12-20 22:24                   ` Linus Torvalds
  2007-12-21  0:39                     ` Richard Henderson
  2007-12-21  2:28                     ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2007-12-20 22:24 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH,
	Keith Packard, Bjorn Helgaas

On Thu, 20 Dec 2007, Richard Henderson wrote:

> On Tue, Dec 18, 2007 at 05:38:58PM -0800, Linus Torvalds wrote:
> > That
> > 
> > 	PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0
> > 	PCI: Cannot allocate resource region 1 of device 0000:01:00.0
> > 
> > thing is really starting to bug me.
> > 
> > I bet that is the real problem here, but it's not printing out enough 
> > information about the resource to actually give us much of a clue about 
> > what is wrong.
> > 
> > I suspect that it had a bridge mapping (device 0:01.0) that included the 
> > range from 0xc0000000 to 0xcfffffff, but there was something stupid wrong 
> > with it (eg the BIOS had allocated overlapping regions), so we disabled 
> > it. That, in turn, then caused us to also refuse the existing 0xc0000000 
> > mapping for the graphics card (device 01:00.0), because now there was no 
> > valid resource for it.
> 
> That is exactly it.  The relevant section of the debug info is
> 
> PCI: Bridge 0000:00:01.0
> PCI: Bridge resource 7 00008000-00008fff (%f=100)
> PCI: Bridge resource 8 f7d00000-fddfffff (%f=200)
> PCI: Bridge resource 9 bdf00000-ddefffff (%f=1201)
> 
> The bridge was assigned to a piece of the end of physical memory.  

Oh, wow. That's just really bogus. So the kernel message about

	PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0

was perfectly fine, and we did absolutely the right thing.

But it also says that if the graphics adaptor really had a resource mapped 
at 0xc0000000 - 0xcfffffff by the BIOS, then that mapping never worked at 
all, since it never had any bridge mapping it could rely on. So our 
decision to unmap that one as invalid was _also_ right.

Damn. Very irritating.

You know what? I think this simple (BUT TOTALLY UNTESTED!) patch will get 
your case right, and I think it is preferable to just always lowering the 
"minimum" starting point.

What it does is to just take the minimum PCI address for new allocations 
(which is only used for the case where we don't have an explicit starting 
point for the parent bus anyway!), and just saying "we'll always align it 
down to the required alignment of the allocation".

I'm not exactly 100% happy with it, but it does mean that if we need a big 
area, we'll relax the suggested starting point by that amount. It's not 
wonderful, but it essentially admits that the minimum for the allocations 
is really just a hint, and if we need lots of space for a resource, we'll 
relax the minimum point appropriately.

So in your case, it should *result* in the exact same situation that your 
patch did, but at the same time, when dealing with the (more common) case 
of smaller allocations, we still continue to try to avoid being too close 
to the top-of-memory.

So it's not perfect, but perhaps it is a good compromise between being 
careful and having to make room?

Does this work for your case?

		Linus

---
 drivers/pci/bus.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 9e5ea07..d48d270 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -61,7 +61,7 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,

 		/* Ok, try it out.. */
 		ret = allocate_resource(r, res, size,
-					r->start ? : min,
+					r->start ? : min & -align,
 					-1, align,
 					alignf, alignf_data);
 		if (ret == 0)

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-20 22:24                   ` Linus Torvalds
@ 2007-12-21  0:39                     ` Richard Henderson
  2007-12-21  1:00                       ` Linus Torvalds
  2007-12-21  2:28                     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2007-12-21  0:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH,
	Keith Packard, Bjorn Helgaas

On Thu, Dec 20, 2007 at 02:24:48PM -0800, Linus Torvalds wrote:
> I'm not exactly 100% happy with it, but it does mean that if we need a big 
> area, we'll relax the suggested starting point by that amount. It's not 
> wonderful, but it essentially admits that the minimum for the allocations 
> is really just a hint, and if we need lots of space for a resource, we'll 
> relax the minimum point appropriately.

This breaks in odd cases where the amount of memory in the system
is not a nice round number.  Like throwing two 128MB sticks into
a system that already has 2gb.  A 512MB allocation will get placed
back at 2gb, on top of the end of ram.  In order to get this kind
of thing to work, you'd have to have a hard and a soft minimum.

Even then, any random large allocation is going to ignore that
buffer that you added.  It'd be better if we could still tie this
ignoring of the buffer to whether the bios placed the resource
there in the first place.

Perhaps this is one of those things that just aren't going to be
solved properly without an xserver upgrade...

r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-21  0:39                     ` Richard Henderson
@ 2007-12-21  1:00                       ` Linus Torvalds
  0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2007-12-21  1:00 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH,
	Keith Packard, Bjorn Helgaas

On Thu, 20 Dec 2007, Richard Henderson wrote:
>
> This breaks in odd cases where the amount of memory in the system
> is not a nice round number.  Like throwing two 128MB sticks into
> a system that already has 2gb.  A 512MB allocation will get placed
> back at 2gb, on top of the end of ram.

No, no, you misunderstand.

The kernel *always* takes known memory allocations into account. The 
"minimum PCI starting allocation" value is not there to protect memory we 
know about: the resource management already does that!

So if you have real memory of 2GB+128MB, and you want a 512MB allocation, 
then yes, maybe the "preferred starting point" would be rounded back down 
to 2GB, but the resource allocator would still take known resources into 
account, and skip that address as being a conflict, and then try the next 
address that suits the alignment requirements, and try to see if there's a 
big enough hole at the 2.5GB mark.

So it would all work fine.

The reason we have that "min" parameter is not because of those _known_ 
resources, it's exactly because we have been bitten too many times by 
BIOSes that lay out magic undocumented resources in memory that we simply 
don't know about, because they aren't standard BAR resources, but some 
other special magic stuff. Things like the special ACPI areas that the 
northbridge recognizes, but aren't exposed as regular BAR's, but as just 
magic registers hidden in some undocumented NB register space.

So the reason we have those PCIBIOS_MIN_IO/MEM things is not because we'd 
trample on top of memory without them, it's because we might trample the 
BIOS resources that it never told us about!

Quite often, that's things like stolen RAM that doesn't show up in the 
e820 tables (it *should* show up as "reserved", but BIOS writers are 
generally incompetent drug-addicts picked up from the streets, who just 
randomly change BIOS tables until Windows boots on the machine), or the 
afore-mentioned magic IO registers for some special motherboard resource.

> In order to get this kind of thing to work, you'd have to have a hard 
> and a soft minimum.

We do have that "hard limit" - the resource management keeps track of all 
the resources it already knows about. The "soft limit" is exactly that 
PCIBIOS_MIN_MEM (which on a PC is that "pci_mem_start" variable). It's 
just a hint, but it's a pretty important one, exactly because we've been 
burned so many times by crap firmware and undocumented memory and MMIO 
ranges.

				Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-20 22:24                   ` Linus Torvalds
  2007-12-21  0:39                     ` Richard Henderson
@ 2007-12-21  2:28                     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 33+ messages in thread
From: Benjamin Herrenschmidt @ 2007-12-21  2:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Henderson, Chuck Ebbert, linux-kernel, Ivan Kokshaysky,
	Daniel Ritz, Greg KH, Keith Packard, Bjorn Helgaas


> So in your case, it should *result* in the exact same situation that your 
> patch did, but at the same time, when dealing with the (more common) case 
> of smaller allocations, we still continue to try to avoid being too close 
> to the top-of-memory.
> 
> So it's not perfect, but perhaps it is a good compromise between being 
> careful and having to make room?
> 
> Does this work for your case?

I'm not totally happy with changing the generic code like that, to
possibly not enforce "min" anymore. Other archs may have very good
reasons to provide a min value here... Though at the same time, at
least on powerpc, the parent resource of the host bridge will be the
real limit, so that may not be a big issue.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 21:09         ` Linus Torvalds
  2007-12-18 21:46           ` Chuck Ebbert
  2007-12-18 21:51           ` Richard Henderson
@ 2007-12-18 22:16           ` Keith Packard
  2007-12-19  0:29           ` Bjorn Helgaas
  3 siblings, 0 replies; 33+ messages in thread
From: Keith Packard @ 2007-12-18 22:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: keithp, Richard Henderson, Chuck Ebbert, linux-kernel,
	Ivan Kokshaysky, Daniel Ritz, Greg KH, Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 2122 bytes --]


On Tue, 2007-12-18 at 13:09 -0800, Linus Torvalds wrote:

> It's not like 256MB is even as large as they come, half-gig graphics cards 
> are getting to be fairly common at the high end, and X absolutely _has_ to 
> be able to handle a 64-bit address for those. 

We're now using a system-dependent wrapper library 'libpciaccess' for
all of this stuff, it uses 64-bits for all PCI addresses and should make
this transparent to the X driver.

In addition, our kernel drivers are moving to support graphics cards
that have memory beyond that addressable through their aperture, so we
should be able to manage cards with even more memory, some of which is
not reachable from the CPU.

> Also, I'm surprised it doesn't work with X already: the ChangeLog for X 
> says that there are "Minor fixes to the handling of 64-bit PCI BARs [..]" 
> in 4.6.99.18, so I'd have assumed that XFree86-4.7.0 should be able to 
> handle this perfectly well.

And that code has been replaced with an even more general library that
abstracts away all of the PCI routing issues.

> I'll add Keithp to the cc too, to see if the X issues can be clarified. 
> Maybe he can set us right. But maybe you just have an old X server? If so, 
> considering the situation, I really think the kernel has done a good job 
> already, and I'd be *very* nervous about making the kernel allocate new 
> PCI resources right after the end-of-memory thing.

Trying a libpciaccess-based X server is certainly something worth doing,
that should be 1.4 or later (thanks, git-describe).

> I bet it would work in this case, but as mentioned, we definitely know of 
> cases where the BIOS did *not* document the magic memory region that was 
> stolen for UMA graphics, and trying to put PCI devices just after the top 
> of reserved memory in the e820 list causes machines to not work at all 
> because the address decoding will clash.

There is an additional single-page BAR on 9xx chips which may end up
mapped and not documented. Our kernel driver should correctly deal with
that now though.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 21:09         ` Linus Torvalds
                             ` (2 preceding siblings ...)
  2007-12-18 22:16           ` Keith Packard
@ 2007-12-19  0:29           ` Bjorn Helgaas
  3 siblings, 0 replies; 33+ messages in thread
From: Bjorn Helgaas @ 2007-12-19  0:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Henderson, Chuck Ebbert, linux-kernel, Ivan Kokshaysky,
	Daniel Ritz, Greg KH, Keith Packard

On Tuesday 18 December 2007 02:09:15 pm Linus Torvalds wrote:
> 
> On Tue, 18 Dec 2007, Richard Henderson wrote:
> > 
> > I've added dmesg, /proc/iomem, and lspci -v output to that bug.
> > 
> > Basically, we have
> > 
> > 	c0000000-cfffffff : free
> > 	ddf00000-dfefffff : PCI Bus #04
> > 	e0000000-efffffff : pnp 00:0b
> > 	f0000000-fedfffff : less than 256MB
> 
> Gaah. 
> 
> That really is very unlucky. That 256M only goes at one point in the low 
> 4GB, but the thing is, it fits perfectly well above it, and dammit, that 
> resource is explicitly a 64-bit resource or a really good reason. 
> 
> However, I wonder about that
> 
> 	e0000000-efffffff : pnp 00:0b
> 
> thing. I actually suspect that that whole allocation is literally *meant* 
> for that 256MB graphics aperture, but the kernel explicitly avoids it 
> because it's listed in the PnP tables.
> 
> I wonder what the heck is the point of that pnp entry. Just for fun, can 
> you try to just disable CONFIG_PNP, and see if it all works then?

00:0b must be a "motherboard" device, probably PNP0C01 or PNP0C02.
Those are catch-all devices with no real programming model associated
with them; they only describe resource usage.  AFAICT, they're mostly
used to describe legacy stuff like interrupt controllers, timers, etc.

My laptop has the same range for one of its PNP0C02 devices.  I'll
try to dig up a chipset spec and see what might look like that range.

We used to ignore anything past the first 8 I/O port regions and 4
memory regions (PNP_MAX_PORT and PNP_MAX_MEM), but those limits have
been recently bumped a bit [1].  That will cause additional reservations
that may explain some of the issues we're seeing.

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a7839e960675b549f06209d18283d5cee2ce9261

> Björn Helgaas added to Cc to clarify what those pnp entries tend to mean, 
> and whether there is possibly some way to match up a specific pnp entry 
> with the PCI device that might want to use it. Because that is a nice 
> 256MB region that really doesn't seem to make sense for anything else than 
> the graphics buffer - there's nothing else in your system that seems 
> likely (although I guess it could be for some docking port, but even then 
> I'd have expected one of the PCI bridges to map it!)
> 
> But apart from the question about that pnp 00:0b device, the kernel 
> resource allocation really does look perfectly fine, and while we could 
> shoe-horn it into the low 4GB in this case by just hoping that there is 
> nothing undocumented there (and there probably isn't), it's really 
> annoying considering that big graphics areas are a hell of a good reason 
> to use those 64-bit resources.
> 
> It's not like 256MB is even as large as they come, half-gig graphics cards 
> are getting to be fairly common at the high end, and X absolutely _has_ to 
> be able to handle a 64-bit address for those. 
> 
> Also, I'm surprised it doesn't work with X already: the ChangeLog for X 
> says that there are "Minor fixes to the handling of 64-bit PCI BARs [..]" 
> in 4.6.99.18, so I'd have assumed that XFree86-4.7.0 should be able to 
> handle this perfectly well.
> 
> I'll add Keithp to the cc too, to see if the X issues can be clarified. 
> Maybe he can set us right. But maybe you just have an old X server? If so, 
> considering the situation, I really think the kernel has done a good job 
> already, and I'd be *very* nervous about making the kernel allocate new 
> PCI resources right after the end-of-memory thing.
> 
> I bet it would work in this case, but as mentioned, we definitely know of 
> cases where the BIOS did *not* document the magic memory region that was 
> stolen for UMA graphics, and trying to put PCI devices just after the top 
> of reserved memory in the e820 list causes machines to not work at all 
> because the address decoding will clash.
> 
> Of course, we could also make the minimum address more of a *hint*, and 
> only make the resource allocator only abut the top-of-known-memory when it 
> absolutely has to, but on the other hand, in this case it really doesn't 
> have to, since there's just _tons_ of space for 64-bit resources. So the 
> correct thing really does seem to be to just use the 64-bit hw that is 
> there.
> 
> > That would have been an excellent comment to add to that code then,
> > rather than just "rounding up to the next 1MB area", because purely
> > as rounding code it is erroneous.
> 
> Patches to add comments are welcome. There are few enough people who 
> actually work on the PCI resource allocation code these days (I wish there 
> were more), and it's very rare that anybody else than me or Ivan ends up 
> even *looking* at it. So it's not been a big issue.
> 
> 				Linus
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 20:22       ` Richard Henderson
  2007-12-18 21:09         ` Linus Torvalds
@ 2007-12-18 21:23         ` Ivan Kokshaysky
  2007-12-18 21:46           ` Linus Torvalds
  2007-12-20 21:10         ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 33+ messages in thread
From: Ivan Kokshaysky @ 2007-12-18 21:23 UTC (permalink / raw)
  To: Linus Torvalds, Chuck Ebbert, linux-kernel, Daniel Ritz, Greg KH

On Tue, Dec 18, 2007 at 12:22:34PM -0800, Richard Henderson wrote:
> On Tue, Dec 18, 2007 at 10:21:50AM -0800, Linus Torvalds wrote:
> > ... and that would be an X server issue!).
> 
> Of course, fixing the X server to *handle* 64-bit BARs is the correct
> solution.  I've no idea how involved that is, but I have a sneeking
> suspicion that it uses that damned CARD32 datatype for everything.

Doh. Let's fix the kernel first...

Does this make any difference? (the patch is self explaining ;-)

Ivan.

--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -208,8 +208,9 @@ pci_setup_bridge(struct pci_bus *bus)
 	}
 	pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);
 
-	/* Clear out the upper 32 bits of PREF base. */
-	pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, 0);
+	/* Set up the upper 32 bits of PREF base. */
+	l = region.start >> 16 >> 16;
+	pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, l);
 
 	pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
 }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 21:23         ` Ivan Kokshaysky
@ 2007-12-18 21:46           ` Linus Torvalds
  2007-12-20  8:46             ` Ivan Kokshaysky
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2007-12-18 21:46 UTC (permalink / raw)
  To: Ivan Kokshaysky; +Cc: Chuck Ebbert, linux-kernel, Daniel Ritz, Greg KH

On Wed, 19 Dec 2007, Ivan Kokshaysky wrote:
> 
> Doh. Let's fix the kernel first...
> 
> Does this make any difference? (the patch is self explaining ;-)

Heh, indeed. Good catch - that

	Prefetchable memory behind bridge: 0000000000000000-000000000fffffff

on device 00:01.0 does look totally broken, and it would make more sense 
if it matched what the device has (and what /proc/iomem resports).

That said, Intel bridges tend to be transparent even when they *claim* 
normal decode, so I wonder whether it actually matters in this case. But 
your patch looks very obviously right. 

So maybe the rest of the kernel and X both already did everything right, 
and it was just this stupid bridge setup thing that was broken (and 
forcing the IOMEM resource to below the 4G mark just hid it).

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 21:46           ` Linus Torvalds
@ 2007-12-20  8:46             ` Ivan Kokshaysky
  2007-12-20 21:21               ` Benjamin Herrenschmidt
  2007-12-22  9:12               ` Andrew Morton
  0 siblings, 2 replies; 33+ messages in thread
From: Ivan Kokshaysky @ 2007-12-20  8:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Henderson, Chuck Ebbert, linux-kernel, Daniel Ritz,
	Greg KH

On Tue, Dec 18, 2007 at 01:46:56PM -0800, Linus Torvalds wrote:
> Heh, indeed. Good catch - that
> 
> 	Prefetchable memory behind bridge: 0000000000000000-000000000fffffff
> 
> on device 00:01.0 does look totally broken, and it would make more sense 
> if it matched what the device has (and what /proc/iomem resports).

Sigh. The patch was way too incomplete - I somehow missed
PCI_PREF_LIMIT_UPPER32... Corrected patch appended - I think it's
worth applying in either case since a bridge window at address 0 is
an obvious bug and this patch fixes it.

> That said, Intel bridges tend to be transparent even when they *claim* 
> normal decode, so I wonder whether it actually matters in this case. But 
> your patch looks very obviously right. 

I don't think that transparency is the case here - I read specs for some
recent Intel chipsets and it looks like they are pretty accurate now with
a "subtractive decode" flag - over the last couple of years, at least.
There are, of course, some strange "priority decode rules", but they
can be safely ignored as far as the kernel resource management is
concerned.

> So maybe the rest of the kernel and X both already did everything right, 
> and it was just this stupid bridge setup thing that was broken (and 
> forcing the IOMEM resource to below the 4G mark just hid it).

I've just checked the setup-bus code and have to say that its ability to
correctly handle the 64-bit BARs is close to zero...

On the positive side, getting it right doesn't seem to be as complicated
as I thought initially - mainly because only the prefetchable memory
window of p2p bridge is 64-bit. This effectively limits >4G allocations
to prefetchable resources only.

Anyway, using PCI bus space above 4G on x86 seems to be a must these
days, and I have some spare hardware to play with. Maybe I'll be able
to get something working by mid January or so...

Ivan.

---
PCI: do respect full 64-bit address for bridge prefetch window

Prevent the prefetch window from being programmed with a bogus address
when its respective resource gets allocated above the 4G mark.

Note that we cannot yet guarantee correct resource allocations
above 4G, though it might work in some simple cases.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
---
 drivers/pci/setup-bus.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 401e03c..643e72e 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -208,8 +208,11 @@ pci_setup_bridge(struct pci_bus *bus)
 	}
 	pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);

-	/* Clear out the upper 32 bits of PREF base. */
-	pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, 0);
+	/* Set up the upper 32 bits of PREF base/limit. */
+	l = region.start >> 16 >> 16;
+	pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, l);
+	l = region.end >> 16 >> 16;
+	pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, l);

 	pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-20  8:46             ` Ivan Kokshaysky
@ 2007-12-20 21:21               ` Benjamin Herrenschmidt
  2007-12-22  9:12               ` Andrew Morton
  1 sibling, 0 replies; 33+ messages in thread
From: Benjamin Herrenschmidt @ 2007-12-20 21:21 UTC (permalink / raw)
  To: Ivan Kokshaysky
  Cc: Linus Torvalds, Richard Henderson, Chuck Ebbert, linux-kernel,
	Daniel Ritz, Greg KH

Another turd is pci_scan_device() which can't cope with 64 bits BARs on
32 bits platforms even when they have 64 bits resources. I'll send a fix
for that.

Ben.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-20  8:46             ` Ivan Kokshaysky
  2007-12-20 21:21               ` Benjamin Herrenschmidt
@ 2007-12-22  9:12               ` Andrew Morton
  2007-12-22  9:20                 ` Andrew Morton
  1 sibling, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2007-12-22  9:12 UTC (permalink / raw)
  To: Ivan Kokshaysky
  Cc: Linus Torvalds, Richard Henderson, Chuck Ebbert, linux-kernel,
	Daniel Ritz, Greg KH

On Thu, 20 Dec 2007 11:46:16 +0300 Ivan Kokshaysky <ink@jurassic.park.msu.ru> wrote:

> PCI: do respect full 64-bit address for bridge prefetch window
> 
> Prevent the prefetch window from being programmed with a bogus address
> when its respective resource gets allocated above the 4G mark.
> 
> Note that we cannot yet guarantee correct resource allocations
> above 4G, though it might work in some simple cases.
> 

So.. did we agree that this patch is good to go?

> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -208,8 +208,11 @@ pci_setup_bridge(struct pci_bus *bus)
>  	}
>  	pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);
>  
> -	/* Clear out the upper 32 bits of PREF base. */
> -	pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, 0);
> +	/* Set up the upper 32 bits of PREF base/limit. */
> +	l = region.start >> 16 >> 16;

We have the little upper_32_bits() helper for this.

> +	pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, l);
> +	l = region.end >> 16 >> 16;
> +	pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, l);
>  
>  	pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
>  }


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-22  9:12               ` Andrew Morton
@ 2007-12-22  9:20                 ` Andrew Morton
  0 siblings, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2007-12-22  9:20 UTC (permalink / raw)
  To: Ivan Kokshaysky, Linus Torvalds, Richard Henderson, Chuck Ebbert,
	linux-kernel, Daniel Ritz, Greg KH

On Sat, 22 Dec 2007 01:12:18 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 20 Dec 2007 11:46:16 +0300 Ivan Kokshaysky <ink@jurassic.park.msu.ru> wrote:
> 
> > PCI: do respect full 64-bit address for bridge prefetch window
> > 
> > Prevent the prefetch window from being programmed with a bogus address
> > when its respective resource gets allocated above the 4G mark.
> > 
> > Note that we cannot yet guarantee correct resource allocations
> > above 4G, though it might work in some simple cases.
> > 
> 
> So.. did we agree that this patch is good to go?

Oh, I see Greg merged a differnet patch.

> > --- a/drivers/pci/setup-bus.c
> > +++ b/drivers/pci/setup-bus.c
> > @@ -208,8 +208,11 @@ pci_setup_bridge(struct pci_bus *bus)
> >  	}
> >  	pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);
> >  
> > -	/* Clear out the upper 32 bits of PREF base. */
> > -	pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, 0);
> > +	/* Set up the upper 32 bits of PREF base/limit. */
> > +	l = region.start >> 16 >> 16;
> 
> We have the little upper_32_bits() helper for this.
> 

Which could use this.

--- a/drivers/pci/setup-bus.c~gregkh-pci-pci-fix-bus-resource-assignment-on-32-bits-with-64b-resources-cleanup
+++ a/drivers/pci/setup-bus.c
@@ -206,10 +206,8 @@ pci_setup_bridge(struct pci_bus *bus)
 	if (bus->resource[2]->flags & IORESOURCE_PREFETCH) {
 		l = (region.start >> 16) & 0xfff0;
 		l |= region.end & 0xfff00000;
-#ifdef CONFIG_RESOURCES_64BIT
-		bu = region.start >> 32;
-		lu = region.end >> 32;
-#endif
+		bu = upper_32_bits(region.start);
+		lu = upper_32_bits(region.end);
 		DBG(KERN_INFO "  PREFETCH window: 0x%016llx-0x%016llx\n",
 		    (unsigned long long)region.start,
 		    (unsigned long long)region.end);


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18 20:22       ` Richard Henderson
  2007-12-18 21:09         ` Linus Torvalds
  2007-12-18 21:23         ` Ivan Kokshaysky
@ 2007-12-20 21:10         ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 33+ messages in thread
From: Benjamin Herrenschmidt @ 2007-12-20 21:10 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Linus Torvalds, Chuck Ebbert, linux-kernel, Ivan Kokshaysky,
	Daniel Ritz, Greg KH

> That won't work, because PCI_BASE_ADDRESS_MEM_TYPE_64 controls how
> many bits need to be written back to the BAR.  If we changed that
> to PCI_BASE_ADDRESS_MEM_TYPE_32, we wouldn't clear the high 32-bits
> of the BAR.
> 
> > ... and that would be an X server issue!).
> 
> Of course, fixing the X server to *handle* 64-bit BARs is the correct
> solution.  I've no idea how involved that is, but I have a sneeking
> suspicion that it uses that damned CARD32 datatype for everything.

A lot more than X needs to be fixed to handle 64-bit BARs btw. There's a
whole load of places in drivers/pci/* where we just puke if we see a
value >4G being assigned.

Now, there is some hope that the new X with libpciaccess can cope with
that, and even if it is broken, it would be much easier to fix, as X in
that case is no longer trying to bypass the kernel, but instead uses
proper kernel interfaces to map device resources.

That used to be Xorg pci-rework branch, though it might have been merged
in the  trunk by now.

Ben.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-18  0:25 PCI resource problems caused by improper address rounding Chuck Ebbert
  2007-12-18  0:57 ` Linus Torvalds
@ 2007-12-22  9:22 ` Andrew Morton
  1 sibling, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2007-12-22  9:22 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: linux-kernel, Ivan Kokshaysky, Linus Torvalds, Daniel Ritz,
	Greg KH, Richard Henderson, Ingo Molnar

On Mon, 17 Dec 2007 19:25:27 -0500 Chuck Ebbert <cebbert@redhat.com> wrote:

> Looks like a commit that I can't find in git due to the arch merge
> has broken PCI address assignment. This patch by Richard Henderson
> against 2.6.23 fixes it for x86_64:
> 
> --- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c	2007-10-09 13:31:38.000000000 -0700
> +++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c	2007-12-15 12:37:44.000000000 -0800
> @@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
>  	while ((gapsize >> 4) > round)
>  		round += round;
>  	/* Fun with two's complement */
> -	pci_mem_start = (gapstart + round) & -round;
> +	pci_mem_start = (gapstart + round - 1) & -round;
>  
>  	printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
>  		pci_mem_start, gapstart, gapsize);
> 
> 
> Here is the original changeset, taken from the Mercurial repo. It was
> merged in 2.6.14:
> 
> # HG changeset patch
> # User Daniel Ritz <daniel.ritz@gmx.ch>
> # Date 1126304746 -700
> # Node ID 51367d6e0b839be0b425a8f67c29f625b670f126
> # Parent f4852c862b04efc9f8e2c7913191f5f7d140d895
> [PATCH] Update PCI IOMEM allocation start
> 
> This fixes the problem with "Averatec 6240 pcmcia_socket0: unable to
> apply power", which was due to the CardBus IOMEM register region being
> allocated at an address that was actually inside the RAM window that had
> been reserved for video frame-buffers in an UMA setup.
> 
> The BIOS _should_ have marked that region reserved in the e820 memory
> descriptor tables, but did not.
> 
> It is fixed by rounding up the default starting address of PCI memory
> allocations, so that we leave a bigger gap after the final known memory
> location.  The amount of rounding depends on how big the unused memory
> gap is that we can allocate IOMEM from.
> 
> Based on example code by Linus.
> 
> Acked-by: Greg KH <greg@kroah.com>
> Acked-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> committer: Linus Torvalds <torvalds@g5.osdl.org> 1126304746 -0700
> 
> 
> --- a/arch/i386/kernel/setup.c	Fri Sep 09 22:28:40 2005 +0011
> +++ b/arch/i386/kernel/setup.c	Fri Sep 09 22:37:26 2005 +0011
> @@ -1300,7 +1300,7 @@ legacy_init_iomem_resources(struct resou
>   */
>  static void __init register_memory(void)
>  {
> -	unsigned long gapstart, gapsize;
> +	unsigned long gapstart, gapsize, round;
>  	unsigned long long last;
>  	int	      i;
>  
> @@ -1345,14 +1345,14 @@ static void __init register_memory(void)
>  	}
>  
>  	/*
> -	 * Start allocating dynamic PCI memory a bit into the gap,
> -	 * aligned up to the nearest megabyte.
> -	 *
> -	 * Question: should we try to pad it up a bit (do something
> -	 * like " + (gapsize >> 3)" in there too?). We now have the
> -	 * technology.
> +	 * See how much we want to round up: start off with
> +	 * rounding to the next 1MB area.
>  	 */
> -	pci_mem_start = (gapstart + 0xfffff) & ~0xfffff;
> +	round = 0x100000;
> +	while ((gapsize >> 4) > round)
> +		round += round;
> +	/* Fun with two's complement */
> +	pci_mem_start = (gapstart + round) & -round;
>  
>  	printk("Allocating PCI resources starting at %08lx (gap: %08lx:%08lx)\n",
>  		pci_mem_start, gapstart, gapsize);
> --- a/arch/x86_64/kernel/e820.c	Fri Sep 09 22:28:40 2005 +0011
> +++ b/arch/x86_64/kernel/e820.c	Fri Sep 09 22:37:26 2005 +0011
> @@ -567,7 +567,7 @@ unsigned long pci_mem_start = 0xaeedbabe
>   */
>  __init void e820_setup_gap(void)
>  {
> -	unsigned long gapstart, gapsize;
> +	unsigned long gapstart, gapsize, round;
>  	unsigned long last;
>  	int i;
>  	int found = 0;
> @@ -604,14 +604,14 @@ __init void e820_setup_gap(void)
>  	}
>  
>  	/*
> -	 * Start allocating dynamic PCI memory a bit into the gap,
> -	 * aligned up to the nearest megabyte.
> -	 *
> -	 * Question: should we try to pad it up a bit (do something
> -	 * like " + (gapsize >> 3)" in there too?). We now have the
> -	 * technology.
> +	 * See how much we want to round up: start off with
> +	 * rounding to the next 1MB area.
>  	 */
> -	pci_mem_start = (gapstart + 0xfffff) & ~0xfffff;
> +	round = 0x100000;
> +	while ((gapsize >> 4) > round)
> +		round += round;
> +	/* Fun with two's complement */
> +	pci_mem_start = (gapstart + round) & -round;
>  
>  	printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
>  		pci_mem_start, gapstart, gapsize);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 33+ messages in thread

[parent not found: <fa.WmGIH8th8MfmciABVSBi6whxeFE@ifi.uio.no>]

[parent not found: <fa.Obg5E3fyax+MaF94//uo40q/Zyk@ifi.uio.no>]

[parent not found: <fa./6K5nXEIpws4VU8HtJhQjF4AoGg@ifi.uio.no>]

[parent not found: <fa.V82IIxMkW3eu+9B44NfoyYYQDP4@ifi.uio.no>]

[parent not found: <fa.DM9AyNQQtam66XpKVeXeqS639os@ifi.uio.no>]

[parent not found: <fa.f5O3U527Rv8DNk05hDFRjdCaeFE@ifi.uio.no>]

* Re: PCI resource problems caused by improper address rounding
       [not found]         ` <fa.f5O3U527Rv8DNk05hDFRjdCaeFE@ifi.uio.no>
@ 2007-12-19  0:11           ` Robert Hancock
  2007-12-19  0:55             ` Chuck Ebbert
       [not found]           ` <fa.PJGSMm4TIW6lRYng/jDqooIvj8U@ifi.uio.no>
  1 sibling, 1 reply; 33+ messages in thread
From: Robert Hancock @ 2007-12-19  0:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Henderson, Chuck Ebbert, linux-kernel, Ivan Kokshaysky,
	Daniel Ritz, Greg KH, Keith Packard, Bjorn Helgaas

Linus Torvalds wrote:
> 
> On Tue, 18 Dec 2007, Richard Henderson wrote:
>> I've added dmesg, /proc/iomem, and lspci -v output to that bug.
>>
>> Basically, we have
>>
>> 	c0000000-cfffffff : free
>> 	ddf00000-dfefffff : PCI Bus #04
>> 	e0000000-efffffff : pnp 00:0b
>> 	f0000000-fedfffff : less than 256MB
> 
> Gaah. 
> 
> That really is very unlucky. That 256M only goes at one point in the low 
> 4GB, but the thing is, it fits perfectly well above it, and dammit, that 
> resource is explicitly a 64-bit resource or a really good reason. 
> 
> However, I wonder about that
> 
> 	e0000000-efffffff : pnp 00:0b
> 
> thing. I actually suspect that that whole allocation is literally *meant* 
> for that 256MB graphics aperture, but the kernel explicitly avoids it 
> because it's listed in the PnP tables.

That is probably the MMCONFIG aperture, in that case any attempt to map 
the graphics BAR there will have disastrous results. (This BIOS has an 
MCFG table, though it looks like this Fedora kernel has MMCONFIG 
disabled, so we can't tell what it actually contains.)

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-19  0:11           ` Robert Hancock
@ 2007-12-19  0:55             ` Chuck Ebbert
  2007-12-19  1:12               ` Richard Henderson
  0 siblings, 1 reply; 33+ messages in thread
From: Chuck Ebbert @ 2007-12-19  0:55 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Linus Torvalds, Richard Henderson, linux-kernel, Ivan Kokshaysky,
	Daniel Ritz, Greg KH, Keith Packard, Bjorn Helgaas

On 12/18/2007 07:11 PM, Robert Hancock wrote:
>> However, I wonder about that
>>
>>     e0000000-efffffff : pnp 00:0b
>>
>> thing. I actually suspect that that whole allocation is literally
>> *meant* for that 256MB graphics aperture, but the kernel explicitly
>> avoids it because it's listed in the PnP tables.
> 
> That is probably the MMCONFIG aperture, in that case any attempt to map
> the graphics BAR there will have disastrous results. (This BIOS has an
> MCFG table, though it looks like this Fedora kernel has MMCONFIG
> disabled, so we can't tell what it actually contains.)
> 

You can boot with "pci=mmconf" to enable it.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-19  0:55             ` Chuck Ebbert
@ 2007-12-19  1:12               ` Richard Henderson
  2007-12-19  3:12                 ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2007-12-19  1:12 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Robert Hancock, Linus Torvalds, linux-kernel, Ivan Kokshaysky,
	Daniel Ritz, Greg KH, Keith Packard, Bjorn Helgaas

On Tue, Dec 18, 2007 at 07:55:37PM -0500, Chuck Ebbert wrote:
> You can boot with "pci=mmconf" to enable it.

Heh.

PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
PCI: Not using MMCONFIG.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
  2007-12-19  1:12               ` Richard Henderson
@ 2007-12-19  3:12                 ` Linus Torvalds
  0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2007-12-19  3:12 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Chuck Ebbert, Robert Hancock, linux-kernel, Ivan Kokshaysky,
	Daniel Ritz, Greg KH, Keith Packard, Bjorn Helgaas



On Tue, 18 Dec 2007, Richard Henderson wrote:
>
> Heh.
> 
> PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
> PCI: Not using MMCONFIG.

Well, that at least confirms that e0000000 is indeed the mmconfig area.

One of these days we'll trust the ACPI resource data enough that we can 
use mmconfig even when it's just reserved in those PnP things, which is 
apparently how BIOS writers are suggested to do it (stupidly enough, but 
whatever)

		Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

[parent not found: <fa.PJGSMm4TIW6lRYng/jDqooIvj8U@ifi.uio.no>]

[parent not found: <fa.0UHHdYi5zqyJ2xOPhNk/BhJkxYM@ifi.uio.no>]

* Re: PCI resource problems caused by improper address rounding
       [not found]             ` <fa.0UHHdYi5zqyJ2xOPhNk/BhJkxYM@ifi.uio.no>
@ 2007-12-19  0:18               ` Robert Hancock
  0 siblings, 0 replies; 33+ messages in thread
From: Robert Hancock @ 2007-12-19  0:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, Richard Henderson, linux-kernel, Ivan Kokshaysky,
	Daniel Ritz, Greg KH, Keith Packard, Bjorn Helgaas

Linus Torvalds wrote:
> 
> On Tue, 18 Dec 2007, Chuck Ebbert wrote:
> 
>> On 12/18/2007 04:09 PM, Linus Torvalds wrote:
>>> I wonder what the heck is the point of that pnp entry. Just for fun, can 
>>> you try to just disable CONFIG_PNP, and see if it all works then?
>> pnpacpi=off should work.
>>
>> PnP is also trying (and failing) to reserve all physical memory.
> 
> Yeah, that really is a pretty confused-looking pnp table thing. But I have 
> absolutely zero idea how PnP is even supposed to work - the whole thing is 
> just a total hack for Windows, afaik.
> 
> The sad part is that *normally* the right thing to do about almost any 
> BIOS information is what we do right now: just avoid that magic address 
> range like the plague, because we have no clue what the heck the BIOS is 
> up to. But it looks like in this particular case, some of the problems 
> may arise exactly *because* we avoid that range.
> 
> It would be good to know what Windows does. If ACPI is found, does it 
> perhaps just ignore all the PnP entries these days?
> 
> 			Linus

ACPI is where those PnP entries are coming from (on any modern system 
anyway). They do show up in Device Manager as devices with resources 
(the one that reserves all of system RAM on my machine is labeled 
"System board", others like the one that reserves the MMCONFIG aperature 
are "Motherboard resources" - the name is based on the PNP device ID, I 
believe).

It could be that Windows is stupid enough that it will map things over 
top of physical RAM if the BIOS doesn't explicitly reserve it like that. 
  I suspect based on some comments in Microsoft documents that Windows 
uses the E820 table to figure out where the RAM is, and ACPI/PnP 
information to figure out where IO mappings are, but may not really 
combine those two pieces of information into one overall map like Linux 
does, which would explain why it needs to reserve all physical RAM..

(As mentioned in another post, I would guess the BIOS is reserving that 
memory range since it's the MMCONFIG aperture..)

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PCI resource problems caused by improper address rounding
       [not found] ` <fa.Obg5E3fyax+MaF94//uo40q/Zyk@ifi.uio.no>
       [not found]   ` <fa./6K5nXEIpws4VU8HtJhQjF4AoGg@ifi.uio.no>
@ 2007-12-19  0:38   ` Robert Hancock
  1 sibling, 0 replies; 33+ messages in thread
From: Robert Hancock @ 2007-12-19  0:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, linux-kernel, Ivan Kokshaysky, Daniel Ritz, Greg KH

Linus Torvalds wrote:
> 
> On Mon, 17 Dec 2007, Chuck Ebbert wrote:
>> Looks like a commit that I can't find in git due to the arch merge
>> has broken PCI address assignment. This patch by Richard Henderson
>> against 2.6.23 fixes it for x86_64:
>>
>> --- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c	2007-10-09 13:31:38.000000000 -0700
>> +++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c	2007-12-15 12:37:44.000000000 -0800
>> @@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
>>  	while ((gapsize >> 4) > round)
>>  		round += round;
>>  	/* Fun with two's complement */
>> -	pci_mem_start = (gapstart + round) & -round;
>> +	pci_mem_start = (gapstart + round - 1) & -round;
> 
> No, it's very much meant to be that way.
> 
> We do *not* want to have the PCI memory abutthe end of memory exactly. So 
> it leaves a gap in between "gapstart" and the actual start of PCI memory 
> addressing very much on purpose.
> 
> In fact, the very commit (it's f0eca9626c6becb6fc56106b2e4287c6c784af3d in 
> the kernel tree) you mention actually explicitly *explains* that, although 
> maybe it's a bit indirect: if you start allocating PCI resources directly 
> after the end-of-RAM thing, you can easily end up using addresses that are 
> actually inside the magic stolen system RAM that is being used for UMA 
> video etc.
> 
> So you very much want to have a buffer in between the end-of-RAM and the 
> actual start of the region we try to allocate in. 
> 
> So why do you want them to be close, anyway? 
> 
> 		Linus
> 
> PS. On a different topic: if you do
> 
> 	git log --follow arch/x86/kernel/e820_64.c
> 
> you'd see the history past the renames in git. Or just do a "git blame -C" 
> which will also follow renames (and copies).

That patch is from the 2.6.14 era - I don't think we even did PnP ACPI 
resource reservation handling then? It could be that the BIOS was trying 
to tell us that UMA memory region is reserved using PnP ACPI 
reservations, but we just ignored it.

It seems rather arbitrary in how much it leaves unused - and in this 
case, likely prevents us from using the nice big open gap that the BIOS 
presumably expected the graphics card to be mapped into.

I suspect this buffer space insertion is really not needed at this 
point. The patch description is likely technically correct in that the 
BIOS should have reserved it in E820, but (according to MS comments in a 
presentation I read) Windows doesn't use E820 for anything other than 
figuring out where RAM is, it uses PnP ACPI for figuring out areas it 
needs to avoid. Since BIOS writers test against that behavior, there are 
surely lots of systems where ignoring PnP ACPI reservations and relying 
only on E820 would result in things really going blammo (like mappings 
things over MMCONFIG tables for instance). So disabling it on modern 
machines is really not an option. And if it's enabled, you likely 
wouldn't hit the problem it tries to fix.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2007-12-22  9:24 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-18  0:25 PCI resource problems caused by improper address rounding Chuck Ebbert
2007-12-18  0:57 ` Linus Torvalds
2007-12-18 17:34   ` Chuck Ebbert
2007-12-18 18:21     ` Linus Torvalds
2007-12-18 20:22       ` Richard Henderson
2007-12-18 21:09         ` Linus Torvalds
2007-12-18 21:46           ` Chuck Ebbert
2007-12-18 21:56             ` Linus Torvalds
2007-12-18 22:17             ` Richard Henderson
2007-12-18 21:51           ` Richard Henderson
2007-12-18 22:31             ` Linus Torvalds
2007-12-19  1:38               ` Linus Torvalds
2007-12-20 21:52                 ` Richard Henderson
2007-12-20 22:24                   ` Linus Torvalds
2007-12-21  0:39                     ` Richard Henderson
2007-12-21  1:00                       ` Linus Torvalds
2007-12-21  2:28                     ` Benjamin Herrenschmidt
2007-12-18 22:16           ` Keith Packard
2007-12-19  0:29           ` Bjorn Helgaas
2007-12-18 21:23         ` Ivan Kokshaysky
2007-12-18 21:46           ` Linus Torvalds
2007-12-20  8:46             ` Ivan Kokshaysky
2007-12-20 21:21               ` Benjamin Herrenschmidt
2007-12-22  9:12               ` Andrew Morton
2007-12-22  9:20                 ` Andrew Morton
2007-12-20 21:10         ` Benjamin Herrenschmidt
2007-12-22  9:22 ` Andrew Morton
     [not found] <fa.WmGIH8th8MfmciABVSBi6whxeFE@ifi.uio.no>
     [not found] ` <fa.Obg5E3fyax+MaF94//uo40q/Zyk@ifi.uio.no>
     [not found]   ` <fa./6K5nXEIpws4VU8HtJhQjF4AoGg@ifi.uio.no>
     [not found]     ` <fa.V82IIxMkW3eu+9B44NfoyYYQDP4@ifi.uio.no>
     [not found]       ` <fa.DM9AyNQQtam66XpKVeXeqS639os@ifi.uio.no>
     [not found]         ` <fa.f5O3U527Rv8DNk05hDFRjdCaeFE@ifi.uio.no>
2007-12-19  0:11           ` Robert Hancock
2007-12-19  0:55             ` Chuck Ebbert
2007-12-19  1:12               ` Richard Henderson
2007-12-19  3:12                 ` Linus Torvalds
     [not found]           ` <fa.PJGSMm4TIW6lRYng/jDqooIvj8U@ifi.uio.no>
     [not found]             ` <fa.0UHHdYi5zqyJ2xOPhNk/BhJkxYM@ifi.uio.no>
2007-12-19  0:18               ` Robert Hancock
2007-12-19  0:38   ` Robert Hancock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox