public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM
       [not found] <20110223233022.GA3439@x61s.reliablesolutions.de>
@ 2011-02-25 12:30 ` Jan Niehusmann
  2011-02-25 20:22   ` Chris Wilson
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Niehusmann @ 2011-02-25 12:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: intel-gfx, Chris Wilson

On Thu, Feb 24, 2011 at 12:30:22AM +0100, Jan Niehusmann wrote to
linux-kernel@vger.kernel.org:
> On a Thinkpad x61s, I noticed some memory corruption when
> plugging/unplugging the external VGA connection.
> 
> Symptoms:
> ---------
> 
> 4 bytes at the beginning of a page get overwritten by zeroes. 
> The address of the corruption varies when rebooting the machine, but
> stays constant while it's running (so it's possible to repeatedly write
> some data and then corrupt it again by plugging the cable).

Further investigation revealed that the corrupted address is
(dev_priv->status_page_dmah->busaddr & 0xffffffff), ie. the beginning of
the hardware status page of the i965 graphics card, cut to 32 bits.

So it seems that for some memory access, the hardware uses only 32 bit
addressing. If the hardware status page is located >4GB, this corrupts
unrelated memory.

The corruption was observed on a Thinkpad x61s, using the Mobile Intel
GM965 Express Chipset. The first four bytes of the wrong page are
overwritten with zeroes whenever the VGA cable gets plugged or unplugged.

This patch simply works around this issue by confining the dma memory
to 32 bits.

It's not known if other chipsets are affected as well.

---
 drivers/char/agp/intel-gtt.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
index 29ac6d4..f7977f2 100644
--- a/drivers/char/agp/intel-gtt.c
+++ b/drivers/char/agp/intel-gtt.c
@@ -1379,7 +1379,7 @@ static const struct intel_gtt_driver i965_gtt_driver = {
 	.setup = i9xx_setup,
 	.cleanup = i9xx_cleanup,
 	.write_entry = i965_write_entry,
-	.dma_mask_size = 36,
+	.dma_mask_size = 32,
 	.check_flags = i830_check_flags,
 	.chipset_flush = i9xx_chipset_flush,
 };
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM
  2011-02-25 12:30 ` [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM Jan Niehusmann
@ 2011-02-25 20:22   ` Chris Wilson
  2011-02-25 21:16     ` Jan Niehusmann
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2011-02-25 20:22 UTC (permalink / raw)
  To: Jan Niehusmann, linux-kernel; +Cc: intel-gfx

On Fri, 25 Feb 2011 13:30:56 +0100, Jan Niehusmann <jan@gondor.com> wrote:
> On Thu, Feb 24, 2011 at 12:30:22AM +0100, Jan Niehusmann wrote to
> linux-kernel@vger.kernel.org:
> > On a Thinkpad x61s, I noticed some memory corruption when
> > plugging/unplugging the external VGA connection.
> > 
> > Symptoms:
> > ---------
> > 
> > 4 bytes at the beginning of a page get overwritten by zeroes. 
> > The address of the corruption varies when rebooting the machine, but
> > stays constant while it's running (so it's possible to repeatedly write
> > some data and then corrupt it again by plugging the cable).
> 
> Further investigation revealed that the corrupted address is
> (dev_priv->status_page_dmah->busaddr & 0xffffffff), ie. the beginning of
> the hardware status page of the i965 graphics card, cut to 32 bits.

965GM explicitly supports 36bits of addressing in the PTE. The only
exception is that general state (part of the 3D engine) must be located in
the lower 4GiB.

Simply ignoring the upper 4bits is the wrong approach and means that the
PTE then point to random pages, and completely irrelevant to the physical
address used in the hardware status page address register.

I have been considering:

diff --git a/drivers/gpu/drm/i915/i915_dma.c
b/drivers/gpu/drm/i915/i915_dma.c
index ffa2196..268e448 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1896,6 +1896,8 @@ int i915_driver_load(struct drm_device *dev,
unsigned long
        /* overlay on gen2 is broken and can't address above 1G */
        if (IS_GEN2(dev))
                dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(30));
+       if (IS_BRROADWATER(dev) || IS_CRESTLINE(dev))
+               dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(32));
 
        mmio_bar = IS_GEN2(dev) ? 1 : 0;
        dev_priv->regs = pci_iomap(dev->pdev, mmio_bar, 0);

to prevent hitting the erratum.

However your bug looks to be:

diff --git a/drivers/gpu/drm/i915/i915_dma.c
b/drivers/gpu/drm/i915/i915_dma.c
index ffa2196..3b80507 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -66,9 +66,9 @@ static int i915_init_phys_hws(struct drm_device *dev)
 
        memset_io(ring->status_page.page_addr, 0, PAGE_SIZE);
 
-       if (INTEL_INFO(dev)->gen >= 4)
-               dev_priv->dma_status_page |= (dev_priv->dma_status_page >> 28) &
-                                            0xf0;
+       if (INTEL_INFO(dev)->gen >= 4) /* 36-bit addressing */
+               dev_priv->dma_status_page |=
+                       (dev_priv->status_page_dmah->busaddr >> 28) & 0xf0;
 
        I915_WRITE(HWS_PGA, dev_priv->dma_status_page);
        DRM_DEBUG_DRIVER("Enabled hardware status page\n");

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM
  2011-02-25 20:22   ` Chris Wilson
@ 2011-02-25 21:16     ` Jan Niehusmann
  2011-02-25 22:18       ` Chris Wilson
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Niehusmann @ 2011-02-25 21:16 UTC (permalink / raw)
  To: Chris Wilson; +Cc: linux-kernel, intel-gfx

Hi Chris,

On Fri, Feb 25, 2011 at 08:22:53PM +0000, Chris Wilson wrote:
> On Fri, 25 Feb 2011 13:30:56 +0100, Jan Niehusmann <jan@gondor.com> wrote:
> > Further investigation revealed that the corrupted address is
> > (dev_priv->status_page_dmah->busaddr & 0xffffffff), ie. the beginning of
> > the hardware status page of the i965 graphics card, cut to 32 bits.
> 
> 965GM explicitly supports 36bits of addressing in the PTE. The only
> exception is that general state (part of the 3D engine) must be located in
> the lower 4GiB.

I'm not claiming that 965GM doesn't do 36 bits. In fact I actually see
activity in /sys/kernel/debug/dri/64/i915_gem_hws, and everything seems
to be working well, when the status page is above 4GB - if one ignores
the tiny detail that the wrong memory location gets overwritten,
sometimes...

> Simply ignoring the upper 4bits is the wrong approach and means that the
> PTE then point to random pages, and completely irrelevant to the physical
> address used in the hardware status page address register.

Doesn't setting DMA_BIT_MASK(32) only change the region DMA memory is
allocated from? I made that change just to make sure one gets addresses
which are safe even if the chipset sometimes ignores address bit 32. The
only negative impact I could think of is the allocation may fail if no
appropriate memory is available. Am I wrong?

> I have been considering:

> +       if (IS_BRROADWATER(dev) || IS_CRESTLINE(dev))
> +               dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(32));

> to prevent hitting the erratum.

So is there a known erratum about these chips? I didn't find errata
documents online, but I only did a short google search and may have
missed them.

> However your bug looks to be:

> -       if (INTEL_INFO(dev)->gen >= 4)
> -               dev_priv->dma_status_page |= (dev_priv->dma_status_page >> 28) &
> -                                            0xf0;
> +       if (INTEL_INFO(dev)->gen >= 4) /* 36-bit addressing */
> +               dev_priv->dma_status_page |=
> +                       (dev_priv->status_page_dmah->busaddr >> 28) & 0xf0;

Don't think so. dev_priv->dma_status_page gets initialized to
dev_priv->status_page_dmah->busaddr a few lines above, and it's 64 bit,
so that diff doesn't change the result of the computation.

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM
  2011-02-25 21:16     ` Jan Niehusmann
@ 2011-02-25 22:18       ` Chris Wilson
  2011-02-25 23:05         ` Daniel Vetter
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2011-02-25 22:18 UTC (permalink / raw)
  To: Jan Niehusmann; +Cc: linux-kernel, intel-gfx

On Fri, 25 Feb 2011 22:16:46 +0100, Jan Niehusmann <jan@gondor.com> wrote:
> Hi Chris,
> 
> On Fri, Feb 25, 2011 at 08:22:53PM +0000, Chris Wilson wrote:
> > On Fri, 25 Feb 2011 13:30:56 +0100, Jan Niehusmann <jan@gondor.com> wrote:
> > > Further investigation revealed that the corrupted address is
> > > (dev_priv->status_page_dmah->busaddr & 0xffffffff), ie. the beginning of
> > > the hardware status page of the i965 graphics card, cut to 32 bits.
> > 
> > 965GM explicitly supports 36bits of addressing in the PTE. The only
> > exception is that general state (part of the 3D engine) must be located in
> > the lower 4GiB.
> 
> I'm not claiming that 965GM doesn't do 36 bits. In fact I actually see
> activity in /sys/kernel/debug/dri/64/i915_gem_hws, and everything seems
> to be working well, when the status page is above 4GB - if one ignores
> the tiny detail that the wrong memory location gets overwritten,
> sometimes...
> 
> > Simply ignoring the upper 4bits is the wrong approach and means that the
> > PTE then point to random pages, and completely irrelevant to the physical
> > address used in the hardware status page address register.
> 
> Doesn't setting DMA_BIT_MASK(32) only change the region DMA memory is
> allocated from? I made that change just to make sure one gets addresses
> which are safe even if the chipset sometimes ignores address bit 32. The
> only negative impact I could think of is the allocation may fail if no
> appropriate memory is available. Am I wrong?

I just thought Daniel had wired up the dma_mask_size differently and
didn't realise it also did pci_set_dma_mask on the same pci dev. So
our patches were morally equivalent. ;-)

> 
> > I have been considering:
> 
> > +       if (IS_BRROADWATER(dev) || IS_CRESTLINE(dev))
> > +               dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(32));
> 
> > to prevent hitting the erratum.
> 
> So is there a known erratum about these chips? I didn't find errata
> documents online, but I only did a short google search and may have
> missed them.

Hah. I wish our documentation were that organised. If you grab the PRM
for gen4 from intellinuxgraphics.org, you should find it mentioned in the
state descriptions for the 3D pipeline.

> > However your bug looks to be:
> 
> > -       if (INTEL_INFO(dev)->gen >= 4)
> > -               dev_priv->dma_status_page |= (dev_priv->dma_status_page >> 28) &
> > -                                            0xf0;
> > +       if (INTEL_INFO(dev)->gen >= 4) /* 36-bit addressing */
> > +               dev_priv->dma_status_page |=
> > +                       (dev_priv->status_page_dmah->busaddr >> 28) & 0xf0;
> 
> Don't think so. dev_priv->dma_status_page gets initialized to
> dev_priv->status_page_dmah->busaddr a few lines above, and it's 64 bit,
> so that diff doesn't change the result of the computation.

And here I was working on the assumption that the code to program a 32bit
register would indeed create a 32bit value.

So, I'm happy to use your patch to workaround the known erratum. I just
wish I had an explanation as to what is actually causing the corruption.
What I want to make sure is that we don't paper over a real bug by
thinking it is yet another silicon issue.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM
  2011-02-25 22:18       ` Chris Wilson
@ 2011-02-25 23:05         ` Daniel Vetter
  2011-02-28  6:46           ` [Intel-gfx] " Zhenyu Wang
  2011-03-01 22:24           ` [PATCH] drm/i915: " Jan Niehusmann
  0 siblings, 2 replies; 9+ messages in thread
From: Daniel Vetter @ 2011-02-25 23:05 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, linux-kernel

On Fri, Feb 25, 2011 at 10:18:16PM +0000, Chris Wilson wrote:
> So, I'm happy to use your patch to workaround the known erratum. I just
> wish I had an explanation as to what is actually causing the corruption.
> What I want to make sure is that we don't paper over a real bug by
> thinking it is yet another silicon issue.

Actually, on style points I prefer your patch: The hw status page is
allocated with drm_pci_alloc which calls dma_alloc_coherent, so setting
the coherent mask is sufficient. The dma mask set in the gtt is
essentially useless, because we call get_user_pages on everything anyway
(in gem - iirc agp uses it). I just think it's confusing to limit the
general dma mask and continue to happily map pages above 4G.

Cheers, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM
  2011-02-25 23:05         ` Daniel Vetter
@ 2011-02-28  6:46           ` Zhenyu Wang
  2011-02-28 19:57             ` Daniel Vetter
  2011-03-01 22:24           ` [PATCH] drm/i915: " Jan Niehusmann
  1 sibling, 1 reply; 9+ messages in thread
From: Zhenyu Wang @ 2011-02-28  6:46 UTC (permalink / raw)
  To: Chris Wilson, Jan Niehusmann, intel-gfx, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1089 bytes --]

On 2011.02.26 00:05:27 +0100, Daniel Vetter wrote:
> On Fri, Feb 25, 2011 at 10:18:16PM +0000, Chris Wilson wrote:
> > So, I'm happy to use your patch to workaround the known erratum. I just
> > wish I had an explanation as to what is actually causing the corruption.
> > What I want to make sure is that we don't paper over a real bug by
> > thinking it is yet another silicon issue.
> 
> Actually, on style points I prefer your patch: The hw status page is
> allocated with drm_pci_alloc which calls dma_alloc_coherent, so setting
> the coherent mask is sufficient. The dma mask set in the gtt is
> essentially useless, because we call get_user_pages on everything anyway
> (in gem - iirc agp uses it). I just think it's confusing to limit the
> general dma mask and continue to happily map pages above 4G.
> 

Think about IOMMU engine, we need to set dma_mask properly for
returned dma mapping address be limited in max range that can be
handled in GTT entry. 

-- 
Open Source Technology Center, Intel ltd.

$gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM
  2011-02-28  6:46           ` [Intel-gfx] " Zhenyu Wang
@ 2011-02-28 19:57             ` Daniel Vetter
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Vetter @ 2011-02-28 19:57 UTC (permalink / raw)
  To: Chris Wilson, Jan Niehusmann, intel-gfx, linux-kernel

On Mon, Feb 28, 2011 at 02:46:35PM +0800, Zhenyu Wang wrote:
> > Actually, on style points I prefer your patch: The hw status page is
> > allocated with drm_pci_alloc which calls dma_alloc_coherent, so setting
> > the coherent mask is sufficient. The dma mask set in the gtt is
> > essentially useless, because we call get_user_pages on everything anyway
> > (in gem - iirc agp uses it). I just think it's confusing to limit the
> > general dma mask and continue to happily map pages above 4G.
> > 
> 
> Think about IOMMU engine, we need to set dma_mask properly for
> returned dma mapping address be limited in max range that can be
> handled in GTT entry. 

If I understand it correctly, this is just about broadwater/crestline,
i.e. the original i965 series. Only the later eaglelake/cantiga (gm45 in
our codebase) was shipped with an iommu attached. So no problem there.
Anyway, my comment was just style nitpick, essentially to keep in line
with the existing  work-around for broken overlay reg files on gen2.
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] drm/i915: fix memory corruption with GM965 and >4GB RAM
  2011-02-25 23:05         ` Daniel Vetter
  2011-02-28  6:46           ` [Intel-gfx] " Zhenyu Wang
@ 2011-03-01 22:24           ` Jan Niehusmann
  2011-03-01 22:32             ` Chris Wilson
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Niehusmann @ 2011-03-01 22:24 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, linux-kernel

On Sat, Feb 26, 2011 at 12:05:27AM +0100, Daniel Vetter wrote:
> Actually, on style points I prefer your patch: The hw status page is
> allocated with drm_pci_alloc which calls dma_alloc_coherent, so setting
> the coherent mask is sufficient. The dma mask set in the gtt is
> essentially useless, because we call get_user_pages on everything anyway
> (in gem - iirc agp uses it). I just think it's confusing to limit the
> general dma mask and continue to happily map pages above 4G.

Indeed, setting dma_alloc_coherent should be sufficient, AFAIKT. After
all, only the HWS seems to be affected, which is allocated with
drm_pci_alloc, which in turn uses dma_alloc_coherent.

I just tried the patch below, it also works for me (as expected). Added
the comment because otherwise it wouldn't be obvious why the mask gets
set to 32 bit.

What I don't know is if really only Broadwater and Crestline chips are
affected. The tests were done with a Crestline one. But I think it's a
fair guess that the bug would have been noticed earlier if more recent
chips were affected, as >4GB RAM have become much more common since then.

Signed-off-by: Jan Niehusmann <jan@gondor.com>

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 17bd766..1961580 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1895,6 +1895,12 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	if (IS_GEN2(dev))
 		dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(30));
 
+	/* 965GM sometimes incorrectly writes to hardware status page (HWS)
+	 * using 32bit addressing, overwriting memory if HWS is located
+	 * above 4GB. */
+	if (IS_BROADWATER(dev) || IS_CRESTLINE(dev))
+		dma_set_coherent_mask(&dev->pdev->dev, DMA_BIT_MASK(32));
+
 	mmio_bar = IS_GEN2(dev) ? 1 : 0;
 	dev_priv->regs = pci_iomap(dev->pdev, mmio_bar, 0);
 	if (!dev_priv->regs) {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: fix memory corruption with GM965 and >4GB RAM
  2011-03-01 22:24           ` [PATCH] drm/i915: " Jan Niehusmann
@ 2011-03-01 22:32             ` Chris Wilson
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Wilson @ 2011-03-01 22:32 UTC (permalink / raw)
  To: Jan Niehusmann, intel-gfx, linux-kernel

On Tue, 1 Mar 2011 23:24:16 +0100, Jan Niehusmann <jan@gondor.com> wrote:
> Indeed, setting dma_alloc_coherent should be sufficient, AFAIKT. After
> all, only the HWS seems to be affected, which is allocated with
> drm_pci_alloc, which in turn uses dma_alloc_coherent.
> 
> I just tried the patch below, it also works for me (as expected). Added
> the comment because otherwise it wouldn't be obvious why the mask gets
> set to 32 bit.
> 
> What I don't know is if really only Broadwater and Crestline chips are
> affected. The tests were done with a Crestline one. But I think it's a
> fair guess that the bug would have been noticed earlier if more recent
> chips were affected, as >4GB RAM have become much more common since then.

The later chips do not use a physical address for the hardware status
page. The patch looks good, thanks.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-03-01 22:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20110223233022.GA3439@x61s.reliablesolutions.de>
2011-02-25 12:30 ` [PATCH] intel-gtt: fix memory corruption with GM965 and >4GB RAM Jan Niehusmann
2011-02-25 20:22   ` Chris Wilson
2011-02-25 21:16     ` Jan Niehusmann
2011-02-25 22:18       ` Chris Wilson
2011-02-25 23:05         ` Daniel Vetter
2011-02-28  6:46           ` [Intel-gfx] " Zhenyu Wang
2011-02-28 19:57             ` Daniel Vetter
2011-03-01 22:24           ` [PATCH] drm/i915: " Jan Niehusmann
2011-03-01 22:32             ` Chris Wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox