All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mikhail Krylov <sqarert@gmail.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Alex Deucher <alexdeucher@gmail.com>,
	Maling list - DRI developers <dri-devel@lists.freedesktop.org>,
	amd-gfx list <amd-gfx@lists.freedesktop.org>
Subject: Re: Screen corruption using radeon kernel driver
Date: Thu, 1 Dec 2022 18:28:21 +0300	[thread overview]
Message-ID: <Y4jIFb2JK5hOG01+@sqrt.uni.cx> (raw)
In-Reply-To: <087f62d7-7d82-4e42-305b-c48176d7d77b@arm.com>

[-- Attachment #1: Type: text/plain, Size: 7775 bytes --]

On Thu, Dec 01, 2022 at 02:00:58PM +0000, Robin Murphy wrote:
> On 2022-11-30 19:59, Mikhail Krylov wrote:
> > On Wed, Nov 30, 2022 at 11:07:32AM -0500, Alex Deucher wrote:
> > > On Wed, Nov 30, 2022 at 10:42 AM Robin Murphy <robin.murphy@arm.com> wrote:
> > > > 
> > > > On 2022-11-30 14:28, Alex Deucher wrote:
> > > > > On Wed, Nov 30, 2022 at 7:54 AM Robin Murphy <robin.murphy@arm.com> wrote:
> > > > > > 
> > > > > > On 2022-11-29 17:11, Mikhail Krylov wrote:
> > > > > > > On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote:
> > > > > > > > On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov <sqarert@gmail.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote:
> > > > > > > > > > On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov <sqarert@gmail.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > > > [excessive quoting removed]
> > > > > > > > > > > 
> > > > > > > > > > > > > So, is there any progress on this issue? I do understand it's not a high
> > > > > > > > > > > > > priority one, and today I've checked it on 6.0 kernel, and
> > > > > > > > > > > > > unfortunately, it still persists...
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I'm considering writing a patch that will allow user to override
> > > > > > > > > > > > > need_dma32/dma_bits setting with a module parameter. I'll have some time
> > > > > > > > > > > > > after the New Year for that.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Is it at all possible that such a patch will be merged into kernel?
> > > > > > > > > > > > > 
> > > > > > > > > > > > On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov <sqarert@gmail.com> wrote:
> > > > > > > > > > > > Unless someone familiar with HIMEM can figure out what is going wrong
> > > > > > > > > > > > we should just revert the patch.
> > > > > > > > > > > > 
> > > > > > > > > > > > Alex
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Okay, I was suggesting that mostly because
> > > > > > > > > > > 
> > > > > > > > > > > a) it works for me with dma_bits = 40 (I understand that's what it is
> > > > > > > > > > > without the original patch applied);
> > > > > > > > > > > 
> > > > > > > > > > > b) there's a hint of uncertainity on this line
> > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359
> > > > > > > > > > > saying that for AGP dma_bits = 32 is the safest option, so apparently there are
> > > > > > > > > > > setups, unlike mine, where dma_bits = 32 is better than 40.
> > > > > > > > > > > 
> > > > > > > > > > > But I'm in no position to argue, just wanted to make myself clear.
> > > > > > > > > > > I'm okay with rebuilding the kernel for my machine until the original
> > > > > > > > > > > patch is reverted or any other fix is applied.
> > > > > > > > > > 
> > > > > > > > > > What GPU do you have and is it AGP?  If it is AGP, does setting
> > > > > > > > > > radeon.agpmode=-1 also fix it?
> > > > > > > > > > 
> > > > > > > > > > Alex
> > > > > > > > > 
> > > > > > > > > That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 doesn't
> > > > > > > > > help, it just makes 3D acceleration in games such as OpenArena stop
> > > > > > > > > working.
> > > > > > > > 
> > > > > > > > Just to confirm, is the board AGP or PCIe?
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > 
> > > > > > > It is AGP. That's an old machine.
> > > > > > 
> > > > > > Can you check whether dma_addressing_limited() is actually returning the
> > > > > > expected result at the point of radeon_ttm_init()? Disabling highmem is
> > > > > > presumably just hiding whatever problem exists, by throwing away all
> > > > > >    >32-bit RAM such that use_dma32 doesn't matter.
> > > > > 
> > > > > The device in question only supports a 32 bit DMA mask so
> > > > > dma_addressing_limited() should return true.  Bounce buffers are not
> > > > > really usable on GPUs because they map so much memory.  If
> > > > > dma_addressing_limited() returns false, that would explain it.
> > > > 
> > > > Right, it appears to be the only part of the offending commit that
> > > > *could* reasonably make any difference, so I'm primarily wondering if
> > > > dma_get_required_mask() somehow gets confused.
> > > 
> > > Mikhail,
> > > 
> > > Can you see that dma_addressing_limited() and dma_get_required_mask()
> > > return in this case?
> > > 
> > > Alex
> > > 
> > > 
> > > > 
> > > > Thanks,
> > > > Robin.
> > 
> > Unfortunately, right now I don't have enough time for kernel
> > modifications and rebuilds (I will later!), so I did a quick-and-dirty
> > research with kprobe.
> > 
> > The problem is that dma_addressing_limited() seems to be inlined and
> > kprobe fails to intercept it.
> > 
> > But I managed to get the result of dma_get_required_mask(). It returns
> > 0x7fffffff (!) on the vanilla (with the patch, buggy) kernel:
> > $ sudo kprobe-perf 'r:dma_get_required_mask $retval'
> > Tracing kprobe dma_get_required_mask. Ctrl-C to end.
> >          modprobe-1244    [000] d...   105.582816: dma_get_required_mask: (radeon_ttm_init+0x61/0x240 [radeon] <- dma_get_required_mask) arg1=0x7fffffff
> > 
> > This function does not even get called in the kernel without the patch
> > that I built myself. I believe that's because ttm_bo_device_init()
> > doesn't call it without the patch.
> > 
> > Hope that helps at least a bit. If not, I'll be able to do more thorough
> > research in a couple of weeks, probably.
> 
> Hmm, just to clarify, what's your actual RAM layout? I've been assuming
> that the issue must be caused by unexpected DMA address truncation, but
> double-checking the older threads it seems that might not be the case.
> I just did a quick sanity-check of both HIGHMEM4G and HIGHMEM64G configs
> in a VM with either 2GB or 4GB of RAM assigned, and the
> dma_direct_get_required_mask() calculation seemed to return the
> appropriate result for all combinations.
> 
> Otherwise, the only significant difference of use_dma32 seems to be to
> switch TTM's allocation flags from GFP_HIGHUSER to GFP_DMA32. Could it
> just be that the highmem support somewhere between TTM and radeon has
> bitrotted, and it hasn't been noticed until this change because everyone
> still using a 32-bit system with highmem also happens not to be using a
> newer 40-bit-capable GPU? Or perhaps it never worked for AGP at all, in
> which case an explicit special case might be clearer?
> 
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> index d33fec488713..acb2d534bff5 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -696,6 +696,7 @@ int radeon_ttm_init(struct radeon_device *rdev)
>  			       rdev->ddev->anon_inode->i_mapping,
>  			       rdev->ddev->vma_offset_manager,
>  			       rdev->need_swiotlb,
> +			       rdev->flags & RADEON_IS_AGP ||
>  			       dma_addressing_limited(&rdev->pdev->dev));
>  	if (r) {
>  		DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
> 
> Robin.

Sorry, not sure what you mean, I'll try to guess:

The bug exists on the stock 32-bit non-pae debian kernel (pae one also
works, but bug persists even there):

https://packages.debian.org/stable/kernel/linux-image-5.10.0-18-686

It has the following memory layout related options:

CONFIG_HIGHMEM4G=y
CONFIG_VMSPLIT_3G=y
CONFIG_HIGHMEM=y

The machine itself has 1.5G of RAM (1024M + 512M sticks).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: Mikhail Krylov <sqarert@gmail.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Maling list - DRI developers <dri-devel@lists.freedesktop.org>,
	amd-gfx list <amd-gfx@lists.freedesktop.org>
Subject: Re: Screen corruption using radeon kernel driver
Date: Thu, 1 Dec 2022 18:28:21 +0300	[thread overview]
Message-ID: <Y4jIFb2JK5hOG01+@sqrt.uni.cx> (raw)
In-Reply-To: <087f62d7-7d82-4e42-305b-c48176d7d77b@arm.com>

[-- Attachment #1: Type: text/plain, Size: 7775 bytes --]

On Thu, Dec 01, 2022 at 02:00:58PM +0000, Robin Murphy wrote:
> On 2022-11-30 19:59, Mikhail Krylov wrote:
> > On Wed, Nov 30, 2022 at 11:07:32AM -0500, Alex Deucher wrote:
> > > On Wed, Nov 30, 2022 at 10:42 AM Robin Murphy <robin.murphy@arm.com> wrote:
> > > > 
> > > > On 2022-11-30 14:28, Alex Deucher wrote:
> > > > > On Wed, Nov 30, 2022 at 7:54 AM Robin Murphy <robin.murphy@arm.com> wrote:
> > > > > > 
> > > > > > On 2022-11-29 17:11, Mikhail Krylov wrote:
> > > > > > > On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote:
> > > > > > > > On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov <sqarert@gmail.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote:
> > > > > > > > > > On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov <sqarert@gmail.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > > > [excessive quoting removed]
> > > > > > > > > > > 
> > > > > > > > > > > > > So, is there any progress on this issue? I do understand it's not a high
> > > > > > > > > > > > > priority one, and today I've checked it on 6.0 kernel, and
> > > > > > > > > > > > > unfortunately, it still persists...
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I'm considering writing a patch that will allow user to override
> > > > > > > > > > > > > need_dma32/dma_bits setting with a module parameter. I'll have some time
> > > > > > > > > > > > > after the New Year for that.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Is it at all possible that such a patch will be merged into kernel?
> > > > > > > > > > > > > 
> > > > > > > > > > > > On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov <sqarert@gmail.com> wrote:
> > > > > > > > > > > > Unless someone familiar with HIMEM can figure out what is going wrong
> > > > > > > > > > > > we should just revert the patch.
> > > > > > > > > > > > 
> > > > > > > > > > > > Alex
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Okay, I was suggesting that mostly because
> > > > > > > > > > > 
> > > > > > > > > > > a) it works for me with dma_bits = 40 (I understand that's what it is
> > > > > > > > > > > without the original patch applied);
> > > > > > > > > > > 
> > > > > > > > > > > b) there's a hint of uncertainity on this line
> > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359
> > > > > > > > > > > saying that for AGP dma_bits = 32 is the safest option, so apparently there are
> > > > > > > > > > > setups, unlike mine, where dma_bits = 32 is better than 40.
> > > > > > > > > > > 
> > > > > > > > > > > But I'm in no position to argue, just wanted to make myself clear.
> > > > > > > > > > > I'm okay with rebuilding the kernel for my machine until the original
> > > > > > > > > > > patch is reverted or any other fix is applied.
> > > > > > > > > > 
> > > > > > > > > > What GPU do you have and is it AGP?  If it is AGP, does setting
> > > > > > > > > > radeon.agpmode=-1 also fix it?
> > > > > > > > > > 
> > > > > > > > > > Alex
> > > > > > > > > 
> > > > > > > > > That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 doesn't
> > > > > > > > > help, it just makes 3D acceleration in games such as OpenArena stop
> > > > > > > > > working.
> > > > > > > > 
> > > > > > > > Just to confirm, is the board AGP or PCIe?
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > 
> > > > > > > It is AGP. That's an old machine.
> > > > > > 
> > > > > > Can you check whether dma_addressing_limited() is actually returning the
> > > > > > expected result at the point of radeon_ttm_init()? Disabling highmem is
> > > > > > presumably just hiding whatever problem exists, by throwing away all
> > > > > >    >32-bit RAM such that use_dma32 doesn't matter.
> > > > > 
> > > > > The device in question only supports a 32 bit DMA mask so
> > > > > dma_addressing_limited() should return true.  Bounce buffers are not
> > > > > really usable on GPUs because they map so much memory.  If
> > > > > dma_addressing_limited() returns false, that would explain it.
> > > > 
> > > > Right, it appears to be the only part of the offending commit that
> > > > *could* reasonably make any difference, so I'm primarily wondering if
> > > > dma_get_required_mask() somehow gets confused.
> > > 
> > > Mikhail,
> > > 
> > > Can you see that dma_addressing_limited() and dma_get_required_mask()
> > > return in this case?
> > > 
> > > Alex
> > > 
> > > 
> > > > 
> > > > Thanks,
> > > > Robin.
> > 
> > Unfortunately, right now I don't have enough time for kernel
> > modifications and rebuilds (I will later!), so I did a quick-and-dirty
> > research with kprobe.
> > 
> > The problem is that dma_addressing_limited() seems to be inlined and
> > kprobe fails to intercept it.
> > 
> > But I managed to get the result of dma_get_required_mask(). It returns
> > 0x7fffffff (!) on the vanilla (with the patch, buggy) kernel:
> > $ sudo kprobe-perf 'r:dma_get_required_mask $retval'
> > Tracing kprobe dma_get_required_mask. Ctrl-C to end.
> >          modprobe-1244    [000] d...   105.582816: dma_get_required_mask: (radeon_ttm_init+0x61/0x240 [radeon] <- dma_get_required_mask) arg1=0x7fffffff
> > 
> > This function does not even get called in the kernel without the patch
> > that I built myself. I believe that's because ttm_bo_device_init()
> > doesn't call it without the patch.
> > 
> > Hope that helps at least a bit. If not, I'll be able to do more thorough
> > research in a couple of weeks, probably.
> 
> Hmm, just to clarify, what's your actual RAM layout? I've been assuming
> that the issue must be caused by unexpected DMA address truncation, but
> double-checking the older threads it seems that might not be the case.
> I just did a quick sanity-check of both HIGHMEM4G and HIGHMEM64G configs
> in a VM with either 2GB or 4GB of RAM assigned, and the
> dma_direct_get_required_mask() calculation seemed to return the
> appropriate result for all combinations.
> 
> Otherwise, the only significant difference of use_dma32 seems to be to
> switch TTM's allocation flags from GFP_HIGHUSER to GFP_DMA32. Could it
> just be that the highmem support somewhere between TTM and radeon has
> bitrotted, and it hasn't been noticed until this change because everyone
> still using a 32-bit system with highmem also happens not to be using a
> newer 40-bit-capable GPU? Or perhaps it never worked for AGP at all, in
> which case an explicit special case might be clearer?
> 
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> index d33fec488713..acb2d534bff5 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -696,6 +696,7 @@ int radeon_ttm_init(struct radeon_device *rdev)
>  			       rdev->ddev->anon_inode->i_mapping,
>  			       rdev->ddev->vma_offset_manager,
>  			       rdev->need_swiotlb,
> +			       rdev->flags & RADEON_IS_AGP ||
>  			       dma_addressing_limited(&rdev->pdev->dev));
>  	if (r) {
>  		DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
> 
> Robin.

Sorry, not sure what you mean, I'll try to guess:

The bug exists on the stock 32-bit non-pae debian kernel (pae one also
works, but bug persists even there):

https://packages.debian.org/stable/kernel/linux-image-5.10.0-18-686

It has the following memory layout related options:

CONFIG_HIGHMEM4G=y
CONFIG_VMSPLIT_3G=y
CONFIG_HIGHMEM=y

The machine itself has 1.5G of RAM (1024M + 512M sticks).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  parent reply	other threads:[~2022-12-01 15:28 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-23 16:31 Screen corruption using radeon kernel driver Krylov Michael
2022-04-25 17:22 ` Alex Deucher
2022-05-16 13:01   ` Mikhail Krylov
2022-11-28 14:31   ` Mikhail Krylov
2022-11-28 14:50     ` Alex Deucher
2022-11-28 20:48       ` Mikhail Krylov
2022-11-29 14:44         ` Alex Deucher
2022-11-29 15:59           ` Mikhail Krylov
2022-11-29 16:05             ` Alex Deucher
2022-11-29 17:11               ` Mikhail Krylov
2022-11-30 12:54                 ` Robin Murphy
2022-11-30 14:28                   ` Alex Deucher
2022-11-30 15:42                     ` Robin Murphy
2022-11-30 16:07                       ` Alex Deucher
2022-11-30 19:59                         ` Mikhail Krylov
2022-12-01 14:00                           ` Robin Murphy
2022-12-01 14:06                             ` Alex Deucher
2022-12-01 15:28                             ` Mikhail Krylov [this message]
2022-12-01 15:28                               ` Mikhail Krylov
2022-12-10 15:32                         ` Mikhail Krylov
2022-12-11  5:52                           ` Luben Tuikov
2022-12-11 11:42                             ` [PATCH] drm/radeon: Fix screen corruption Luben Tuikov
2022-12-12  2:08                               ` [PATCH] drm/radeon: Fix screen corruption (v2) Luben Tuikov
2022-12-14 21:53                                 ` Robin Murphy
2022-12-14 22:02                                   ` Alex Deucher
2022-12-14 23:08                                     ` Robin Murphy
2022-12-15  8:07                                       ` Christian König
2022-12-15  9:08                                         ` Luben Tuikov
2022-12-15  9:46                                           ` Christian König
2022-12-15 10:19                                             ` Luben Tuikov
2022-12-15 11:27                                               ` Christian König
2022-12-15 11:40                                                 ` Luben Tuikov
2022-12-15 11:53                                                   ` Robin Murphy
2022-12-15 12:07                                                     ` Luben Tuikov
2023-01-19 16:56                                                       ` Krylov Michael
2023-01-19 16:56                                                         ` Krylov Michael
2023-01-20  4:31                                                         ` Luben Tuikov
2023-01-20  4:31                                                           ` Luben Tuikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y4jIFb2JK5hOG01+@sqrt.uni.cx \
    --to=sqarert@gmail.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=robin.murphy@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.