* [PATCH 0/3] drm/radeon kexec fixes
@ 2013-09-08 12:09 Markus Trippelsdorf
2013-09-08 12:10 ` [PATCH 1/3] kexec: get rid of late printk Markus Trippelsdorf
` (3 more replies)
0 siblings, 4 replies; 19+ messages in thread
From: Markus Trippelsdorf @ 2013-09-08 12:09 UTC (permalink / raw)
To: dri-devel; +Cc: Alex Deucher, kexec, Eric Biederman
Here are a couple of patches that get kexec working with radeon devices.
I've tested this on my RS780.
Comments or flames are welcome.
Thanks.
Markus Trippelsdorf (3):
kexec: get rid of late printk
drm/radeon: Implement radeon_pci_shutdown
drm/radeon: get rid of r100_restore_sanity hack
drivers/gpu/drm/radeon/r100.c | 27 ---------------------------
drivers/gpu/drm/radeon/r300.c | 2 --
drivers/gpu/drm/radeon/r420.c | 2 --
drivers/gpu/drm/radeon/r520.c | 2 --
drivers/gpu/drm/radeon/radeon_asic.h | 1 -
drivers/gpu/drm/radeon/radeon_drv.c | 10 ++++++++++
drivers/gpu/drm/radeon/rs400.c | 2 --
drivers/gpu/drm/radeon/rs600.c | 2 --
drivers/gpu/drm/radeon/rs690.c | 2 --
drivers/gpu/drm/radeon/rv515.c | 2 --
kernel/kexec.c | 1 -
11 files changed, 10 insertions(+), 43 deletions(-)
--
Markus
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/3] kexec: get rid of late printk
2013-09-08 12:09 [PATCH 0/3] drm/radeon kexec fixes Markus Trippelsdorf
@ 2013-09-08 12:10 ` Markus Trippelsdorf
2013-09-08 20:11 ` Daniel Vetter
2013-09-08 12:10 ` [PATCH 2/3] drm/radeon: Implement radeon_pci_shutdown Markus Trippelsdorf
` (2 subsequent siblings)
3 siblings, 1 reply; 19+ messages in thread
From: Markus Trippelsdorf @ 2013-09-08 12:10 UTC (permalink / raw)
To: dri-devel; +Cc: Alex Deucher, kexec, Eric Biederman
kexec calls:
printk(KERN_EMERG "Starting new kernel\n");
late before calling machine_shutdown().
However at this point the underlying fb device may have already been
shutdown. This causes the kernel to hang.
Fix by simply getting rid of the printk call.
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
---
kernel/kexec.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 59f7b55..f33fa9f 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1679,7 +1679,6 @@ int kernel_kexec(void)
#endif
{
kernel_restart_prepare(NULL);
- printk(KERN_EMERG "Starting new kernel\n");
machine_shutdown();
}
--
Markus
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/3] drm/radeon: Implement radeon_pci_shutdown
2013-09-08 12:09 [PATCH 0/3] drm/radeon kexec fixes Markus Trippelsdorf
2013-09-08 12:10 ` [PATCH 1/3] kexec: get rid of late printk Markus Trippelsdorf
@ 2013-09-08 12:10 ` Markus Trippelsdorf
2013-09-09 13:32 ` Konrad Rzeszutek Wilk
2013-09-08 12:11 ` [PATCH 3/3] drm/radeon: get rid of r100_restore_sanity hack Markus Trippelsdorf
2013-09-09 0:32 ` [PATCH 0/3] drm/radeon kexec fixes Eric W. Biederman
3 siblings, 1 reply; 19+ messages in thread
From: Markus Trippelsdorf @ 2013-09-08 12:10 UTC (permalink / raw)
To: dri-devel; +Cc: Alex Deucher, kexec, Eric Biederman
Currently radeon devices are not properbly shutdown during kexec. This
cases a varity of issues, e.g. dpm initialization failures.
Fix this by implementing a radeon_pci_shutdown function, that unloads
the driver cleanly.
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
---
drivers/gpu/drm/radeon/radeon_drv.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
index cb4445f..d71edee 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -380,6 +380,15 @@ static const struct file_operations radeon_driver_kms_fops = {
#endif
};
+
+static void
+radeon_pci_shutdown(struct pci_dev *pdev)
+{
+ struct drm_device *dev = pci_get_drvdata(pdev);
+
+ radeon_driver_unload_kms(dev);
+}
+
static struct drm_driver kms_driver = {
.driver_features =
DRIVER_USE_AGP |
@@ -453,6 +462,7 @@ static struct pci_driver radeon_kms_pci_driver = {
.remove = radeon_pci_remove,
.suspend = radeon_pci_suspend,
.resume = radeon_pci_resume,
+ .shutdown = radeon_pci_shutdown,
};
static int __init radeon_init(void)
--
Markus
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 3/3] drm/radeon: get rid of r100_restore_sanity hack
2013-09-08 12:09 [PATCH 0/3] drm/radeon kexec fixes Markus Trippelsdorf
2013-09-08 12:10 ` [PATCH 1/3] kexec: get rid of late printk Markus Trippelsdorf
2013-09-08 12:10 ` [PATCH 2/3] drm/radeon: Implement radeon_pci_shutdown Markus Trippelsdorf
@ 2013-09-08 12:11 ` Markus Trippelsdorf
2013-09-09 0:32 ` [PATCH 0/3] drm/radeon kexec fixes Eric W. Biederman
3 siblings, 0 replies; 19+ messages in thread
From: Markus Trippelsdorf @ 2013-09-08 12:11 UTC (permalink / raw)
To: dri-devel; +Cc: Alex Deucher, kexec, Eric Biederman
Now that radeon devices are properly shutdown during kexec, we can get
rid of r100_restore_sanity.
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
---
drivers/gpu/drm/radeon/r100.c | 27 ---------------------------
drivers/gpu/drm/radeon/r300.c | 2 --
drivers/gpu/drm/radeon/r420.c | 2 --
drivers/gpu/drm/radeon/r520.c | 2 --
drivers/gpu/drm/radeon/radeon_asic.h | 1 -
drivers/gpu/drm/radeon/rs400.c | 2 --
drivers/gpu/drm/radeon/rs600.c | 2 --
drivers/gpu/drm/radeon/rs690.c | 2 --
drivers/gpu/drm/radeon/rv515.c | 2 --
9 files changed, 42 deletions(-)
diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
index 9fc61dd..d53dcd8 100644
--- a/drivers/gpu/drm/radeon/r100.c
+++ b/drivers/gpu/drm/radeon/r100.c
@@ -3938,31 +3938,6 @@ void r100_fini(struct radeon_device *rdev)
rdev->bios = NULL;
}
-/*
- * Due to how kexec works, it can leave the hw fully initialised when it
- * boots the new kernel. However doing our init sequence with the CP and
- * WB stuff setup causes GPU hangs on the RN50 at least. So at startup
- * do some quick sanity checks and restore sane values to avoid this
- * problem.
- */
-void r100_restore_sanity(struct radeon_device *rdev)
-{
- u32 tmp;
-
- tmp = RREG32(RADEON_CP_CSQ_CNTL);
- if (tmp) {
- WREG32(RADEON_CP_CSQ_CNTL, 0);
- }
- tmp = RREG32(RADEON_CP_RB_CNTL);
- if (tmp) {
- WREG32(RADEON_CP_RB_CNTL, 0);
- }
- tmp = RREG32(RADEON_SCRATCH_UMSK);
- if (tmp) {
- WREG32(RADEON_SCRATCH_UMSK, 0);
- }
-}
-
int r100_init(struct radeon_device *rdev)
{
int r;
@@ -3975,8 +3950,6 @@ int r100_init(struct radeon_device *rdev)
radeon_scratch_init(rdev);
/* Initialize surface registers */
radeon_surface_init(rdev);
- /* sanity check some register to avoid hangs like after kexec */
- r100_restore_sanity(rdev);
/* TODO: disable VGA need to use VGA request */
/* BIOS*/
if (!radeon_get_bios(rdev)) {
diff --git a/drivers/gpu/drm/radeon/r300.c b/drivers/gpu/drm/radeon/r300.c
index d8dd269..57ba534 100644
--- a/drivers/gpu/drm/radeon/r300.c
+++ b/drivers/gpu/drm/radeon/r300.c
@@ -1480,8 +1480,6 @@ int r300_init(struct radeon_device *rdev)
/* Initialize surface registers */
radeon_surface_init(rdev);
/* TODO: disable VGA need to use VGA request */
- /* restore some register to sane defaults */
- r100_restore_sanity(rdev);
/* BIOS*/
if (!radeon_get_bios(rdev)) {
if (ASIC_IS_AVIVO(rdev))
diff --git a/drivers/gpu/drm/radeon/r420.c b/drivers/gpu/drm/radeon/r420.c
index 4e796ec..9ee3360 100644
--- a/drivers/gpu/drm/radeon/r420.c
+++ b/drivers/gpu/drm/radeon/r420.c
@@ -371,8 +371,6 @@ int r420_init(struct radeon_device *rdev)
/* Initialize surface registers */
radeon_surface_init(rdev);
/* TODO: disable VGA need to use VGA request */
- /* restore some register to sane defaults */
- r100_restore_sanity(rdev);
/* BIOS*/
if (!radeon_get_bios(rdev)) {
if (ASIC_IS_AVIVO(rdev))
diff --git a/drivers/gpu/drm/radeon/r520.c b/drivers/gpu/drm/radeon/r520.c
index e1aece7..4709c10 100644
--- a/drivers/gpu/drm/radeon/r520.c
+++ b/drivers/gpu/drm/radeon/r520.c
@@ -256,8 +256,6 @@ int r520_init(struct radeon_device *rdev)
radeon_scratch_init(rdev);
/* Initialize surface registers */
radeon_surface_init(rdev);
- /* restore some register to sane defaults */
- r100_restore_sanity(rdev);
/* TODO: disable VGA need to use VGA request */
/* BIOS*/
if (!radeon_get_bios(rdev)) {
diff --git a/drivers/gpu/drm/radeon/radeon_asic.h b/drivers/gpu/drm/radeon/radeon_asic.h
index 818bbe6..6eee9e2 100644
--- a/drivers/gpu/drm/radeon/radeon_asic.h
+++ b/drivers/gpu/drm/radeon/radeon_asic.h
@@ -122,7 +122,6 @@ void r100_mc_resume(struct radeon_device *rdev, struct r100_mc_save *save);
void r100_vram_init_sizes(struct radeon_device *rdev);
int r100_cp_reset(struct radeon_device *rdev);
void r100_vga_render_disable(struct radeon_device *rdev);
-void r100_restore_sanity(struct radeon_device *rdev);
int r100_cs_track_check_pkt3_indx_buffer(struct radeon_cs_parser *p,
struct radeon_cs_packet *pkt,
struct radeon_bo *robj);
diff --git a/drivers/gpu/drm/radeon/rs400.c b/drivers/gpu/drm/radeon/rs400.c
index b8074a8..23bbf89 100644
--- a/drivers/gpu/drm/radeon/rs400.c
+++ b/drivers/gpu/drm/radeon/rs400.c
@@ -510,8 +510,6 @@ int rs400_init(struct radeon_device *rdev)
/* Initialize surface registers */
radeon_surface_init(rdev);
/* TODO: disable VGA need to use VGA request */
- /* restore some register to sane defaults */
- r100_restore_sanity(rdev);
/* BIOS*/
if (!radeon_get_bios(rdev)) {
if (ASIC_IS_AVIVO(rdev))
diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c
index 670b555..cfe6f80 100644
--- a/drivers/gpu/drm/radeon/rs600.c
+++ b/drivers/gpu/drm/radeon/rs600.c
@@ -1018,8 +1018,6 @@ int rs600_init(struct radeon_device *rdev)
radeon_scratch_init(rdev);
/* Initialize surface registers */
radeon_surface_init(rdev);
- /* restore some register to sane defaults */
- r100_restore_sanity(rdev);
/* BIOS */
if (!radeon_get_bios(rdev)) {
if (ASIC_IS_AVIVO(rdev))
diff --git a/drivers/gpu/drm/radeon/rs690.c b/drivers/gpu/drm/radeon/rs690.c
index d8ddfb3..e1f4036 100644
--- a/drivers/gpu/drm/radeon/rs690.c
+++ b/drivers/gpu/drm/radeon/rs690.c
@@ -789,8 +789,6 @@ int rs690_init(struct radeon_device *rdev)
radeon_scratch_init(rdev);
/* Initialize surface registers */
radeon_surface_init(rdev);
- /* restore some register to sane defaults */
- r100_restore_sanity(rdev);
/* TODO: disable VGA need to use VGA request */
/* BIOS*/
if (!radeon_get_bios(rdev)) {
diff --git a/drivers/gpu/drm/radeon/rv515.c b/drivers/gpu/drm/radeon/rv515.c
index 8ea1573..f9692d7 100644
--- a/drivers/gpu/drm/radeon/rv515.c
+++ b/drivers/gpu/drm/radeon/rv515.c
@@ -627,8 +627,6 @@ int rv515_init(struct radeon_device *rdev)
/* Initialize surface registers */
radeon_surface_init(rdev);
/* TODO: disable VGA need to use VGA request */
- /* restore some register to sane defaults */
- r100_restore_sanity(rdev);
/* BIOS*/
if (!radeon_get_bios(rdev)) {
if (ASIC_IS_AVIVO(rdev))
--
Markus
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] kexec: get rid of late printk
2013-09-08 12:10 ` [PATCH 1/3] kexec: get rid of late printk Markus Trippelsdorf
@ 2013-09-08 20:11 ` Daniel Vetter
2013-09-08 20:42 ` Bruno Prémont
0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2013-09-08 20:11 UTC (permalink / raw)
To: Markus Trippelsdorf; +Cc: kexec, Eric Biederman, dri-devel
On Sun, Sep 8, 2013 at 2:10 PM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> kexec calls:
> printk(KERN_EMERG "Starting new kernel\n");
> late before calling machine_shutdown().
> However at this point the underlying fb device may have already been
> shutdown. This causes the kernel to hang.
> Fix by simply getting rid of the printk call.
>
> Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Shouldn't this be taken care of with the suspend/resume_console calls?
At least that's my understanding how it works in the suspend/hibernate
code, maybe kexec needs similar treatment ...
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] kexec: get rid of late printk
2013-09-08 20:11 ` Daniel Vetter
@ 2013-09-08 20:42 ` Bruno Prémont
0 siblings, 0 replies; 19+ messages in thread
From: Bruno Prémont @ 2013-09-08 20:42 UTC (permalink / raw)
To: Daniel Vetter; +Cc: dri-devel, kexec, Eric Biederman, Markus Trippelsdorf
On Sun, 08 September 2013 Daniel Vetter <daniel@ffwll.ch> wrote:
> On Sun, Sep 8, 2013 at 2:10 PM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
> > kexec calls:
> > printk(KERN_EMERG "Starting new kernel\n");
> > late before calling machine_shutdown().
> > However at this point the underlying fb device may have already been
> > shutdown. This causes the kernel to hang.
> > Fix by simply getting rid of the printk call.
> >
> > Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
>
> Shouldn't this be taken care of with the suspend/resume_console calls?
> At least that's my understanding how it works in the suspend/hibernate
> code, maybe kexec needs similar treatment ...
Is it suspend/resume_console? Shouldn't the fbcon be short-circuited
or disabled once there is no more underlying fb?
Serial console, if present, as well as netconsole if network device
is still alive should continue working I would say.
Bruno
> -Daniel
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-08 12:09 [PATCH 0/3] drm/radeon kexec fixes Markus Trippelsdorf
` (2 preceding siblings ...)
2013-09-08 12:11 ` [PATCH 3/3] drm/radeon: get rid of r100_restore_sanity hack Markus Trippelsdorf
@ 2013-09-09 0:32 ` Eric W. Biederman
2013-09-09 9:21 ` Markus Trippelsdorf
3 siblings, 1 reply; 19+ messages in thread
From: Eric W. Biederman @ 2013-09-09 0:32 UTC (permalink / raw)
To: Markus Trippelsdorf; +Cc: Alex Deucher, kexec, dri-devel
Markus Trippelsdorf <markus@trippelsdorf.de> writes:
> Here are a couple of patches that get kexec working with radeon devices.
> I've tested this on my RS780.
> Comments or flames are welcome.
> Thanks.
A couple of high level comments.
This looks promising for the usual case.
Removing the printk at the end of the kexec path seems a little dubious,
what of other cpus, interrupt handlers, etc. Basically estabilishing a
new rule on when printk is allowed seems a little dubious at this point,
even if it is a useful debugging trick.
Having a clean shutdown of the radeon definitely seems worth doing,
because the cases where we care abouty video are when a person is in
front of the system.
I don't know if you want to remove the sanity checks. They seem cheap
and safe regardless. Are they expensive or ineffective? Moreover if
they work a reasonable amount of the time that means that the kexec on
panic case (where we don't shut anything down) can actually use the
video, and that in general the driver will be more robust. I don't
expect anyone much cares as kexec on panic is mostly used to just write
a core file to the network, or the local disk. But if it is easy to
keep that case working most of the time, why not.
Eric
> Markus Trippelsdorf (3):
> kexec: get rid of late printk
> drm/radeon: Implement radeon_pci_shutdown
> drm/radeon: get rid of r100_restore_sanity hack
>
> drivers/gpu/drm/radeon/r100.c | 27 ---------------------------
> drivers/gpu/drm/radeon/r300.c | 2 --
> drivers/gpu/drm/radeon/r420.c | 2 --
> drivers/gpu/drm/radeon/r520.c | 2 --
> drivers/gpu/drm/radeon/radeon_asic.h | 1 -
> drivers/gpu/drm/radeon/radeon_drv.c | 10 ++++++++++
> drivers/gpu/drm/radeon/rs400.c | 2 --
> drivers/gpu/drm/radeon/rs600.c | 2 --
> drivers/gpu/drm/radeon/rs690.c | 2 --
> drivers/gpu/drm/radeon/rv515.c | 2 --
> kernel/kexec.c | 1 -
> 11 files changed, 10 insertions(+), 43 deletions(-)
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-09 0:32 ` [PATCH 0/3] drm/radeon kexec fixes Eric W. Biederman
@ 2013-09-09 9:21 ` Markus Trippelsdorf
2013-09-09 9:38 ` Christian König
2013-09-09 13:04 ` Alex Deucher
0 siblings, 2 replies; 19+ messages in thread
From: Markus Trippelsdorf @ 2013-09-09 9:21 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Alex Deucher, kexec, dri-devel
On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:
> Markus Trippelsdorf <markus@trippelsdorf.de> writes:
>
> > Here are a couple of patches that get kexec working with radeon devices.
> > I've tested this on my RS780.
> > Comments or flames are welcome.
> > Thanks.
>
> A couple of high level comments.
>
> This looks promising for the usual case.
>
> Removing the printk at the end of the kexec path seems a little dubious,
> what of other cpus, interrupt handlers, etc. Basically estabilishing a
> new rule on when printk is allowed seems a little dubious at this point,
> even if it is a useful debugging trick.
OK. I will drop this patch. It doesn't seem to be necessary, because I
cannot reproduce the printk related hang anymore.
> Having a clean shutdown of the radeon definitely seems worth doing,
> because the cases where we care abouty video are when a person is in
> front of the system.
Yes. But please note that even with radeon_pci_shutdown implemented, I
still get ring test failures on roughly every eighth kexec boot:
[drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
radeon 0000:01:05.0: disabling GPU acceleration
That's definitely better than the current state of affairs, with ring
test failures on every second boot. But I haven't figured out the reason
for these failures yet. It's curious that once a ring test failure
occurs, it will reliably fail after each kexec invocation, no matter how
often repeated. Only a reboot brings the machine back to normal.
> I don't know if you want to remove the sanity checks. They seem cheap
> and safe regardless. Are they expensive or ineffective? Moreover if
> they work a reasonable amount of the time that means that the kexec on
> panic case (where we don't shut anything down) can actually use the
> video, and that in general the driver will be more robust. I don't
> expect anyone much cares as kexec on panic is mostly used to just write
> a core file to the network, or the local disk. But if it is easy to
> keep that case working most of the time, why not.
IIRC Alex said the sanity checks are expensive and boot-time could be
improved by dropping them. Maybe he can chime in?
--
Markus
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-09 9:21 ` Markus Trippelsdorf
@ 2013-09-09 9:38 ` Christian König
2013-09-11 9:01 ` Markus Trippelsdorf
2013-09-09 13:04 ` Alex Deucher
1 sibling, 1 reply; 19+ messages in thread
From: Christian König @ 2013-09-09 9:38 UTC (permalink / raw)
To: Markus Trippelsdorf; +Cc: kexec, Eric W. Biederman, dri-devel
Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:
> On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:
>> Markus Trippelsdorf <markus@trippelsdorf.de> writes:
>>
>>> Here are a couple of patches that get kexec working with radeon devices.
>>> I've tested this on my RS780.
>>> Comments or flames are welcome.
>>> Thanks.
>> A couple of high level comments.
>>
>> This looks promising for the usual case.
>>
>> Removing the printk at the end of the kexec path seems a little dubious,
>> what of other cpus, interrupt handlers, etc. Basically estabilishing a
>> new rule on when printk is allowed seems a little dubious at this point,
>> even if it is a useful debugging trick.
> OK. I will drop this patch. It doesn't seem to be necessary, because I
> cannot reproduce the printk related hang anymore.
>
>> Having a clean shutdown of the radeon definitely seems worth doing,
>> because the cases where we care abouty video are when a person is in
>> front of the system.
> Yes. But please note that even with radeon_pci_shutdown implemented, I
> still get ring test failures on roughly every eighth kexec boot:
>
> [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
> radeon 0000:01:05.0: disabling GPU acceleration
>
> That's definitely better than the current state of affairs, with ring
> test failures on every second boot. But I haven't figured out the reason
> for these failures yet. It's curious that once a ring test failure
> occurs, it will reliably fail after each kexec invocation, no matter how
> often repeated. Only a reboot brings the machine back to normal.
The main problem here is that the AMD gfx hardware doesn't really
support being reinitialized once booted (for various reasons). It's a
(intended) limitation of the hardware design that you can only
initialize certain blocks once every power cycle, so the whole approach
actually will never work 100% reliable.
All you can hope for is that stopping the hardware while shutting down
the old kernel and starting it again results in exactly the same
hardware parameters (offsets, clock etc...) otherwise starting the
blocks will just fail and you end up with disabled acceleration like above.
Sorry, but there isn't much we can do about this,
Christian.
>> I don't know if you want to remove the sanity checks. They seem cheap
>> and safe regardless. Are they expensive or ineffective? Moreover if
>> they work a reasonable amount of the time that means that the kexec on
>> panic case (where we don't shut anything down) can actually use the
>> video, and that in general the driver will be more robust. I don't
>> expect anyone much cares as kexec on panic is mostly used to just write
>> a core file to the network, or the local disk. But if it is easy to
>> keep that case working most of the time, why not.
> IIRC Alex said the sanity checks are expensive and boot-time could be
> improved by dropping them. Maybe he can chime in?
>
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-09 9:21 ` Markus Trippelsdorf
2013-09-09 9:38 ` Christian König
@ 2013-09-09 13:04 ` Alex Deucher
2013-09-10 18:27 ` Eric W. Biederman
1 sibling, 1 reply; 19+ messages in thread
From: Alex Deucher @ 2013-09-09 13:04 UTC (permalink / raw)
To: Markus Trippelsdorf
Cc: kexec, Eric W. Biederman, Maling list - DRI developers
On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:
>> Markus Trippelsdorf <markus@trippelsdorf.de> writes:
>>
>> > Here are a couple of patches that get kexec working with radeon devices.
>> > I've tested this on my RS780.
>> > Comments or flames are welcome.
>> > Thanks.
>>
>> A couple of high level comments.
>>
>> This looks promising for the usual case.
>>
>> Removing the printk at the end of the kexec path seems a little dubious,
>> what of other cpus, interrupt handlers, etc. Basically estabilishing a
>> new rule on when printk is allowed seems a little dubious at this point,
>> even if it is a useful debugging trick.
>
> OK. I will drop this patch. It doesn't seem to be necessary, because I
> cannot reproduce the printk related hang anymore.
>
>> Having a clean shutdown of the radeon definitely seems worth doing,
>> because the cases where we care abouty video are when a person is in
>> front of the system.
>
> Yes. But please note that even with radeon_pci_shutdown implemented, I
> still get ring test failures on roughly every eighth kexec boot:
>
> [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
> radeon 0000:01:05.0: disabling GPU acceleration
>
> That's definitely better than the current state of affairs, with ring
> test failures on every second boot. But I haven't figured out the reason
> for these failures yet. It's curious that once a ring test failure
> occurs, it will reliably fail after each kexec invocation, no matter how
> often repeated. Only a reboot brings the machine back to normal.
>
>> I don't know if you want to remove the sanity checks. They seem cheap
>> and safe regardless. Are they expensive or ineffective? Moreover if
>> they work a reasonable amount of the time that means that the kexec on
>> panic case (where we don't shut anything down) can actually use the
>> video, and that in general the driver will be more robust. I don't
>> expect anyone much cares as kexec on panic is mostly used to just write
>> a core file to the network, or the local disk. But if it is easy to
>> keep that case working most of the time, why not.
>
> IIRC Alex said the sanity checks are expensive and boot-time could be
> improved by dropping them. Maybe he can chime in?
They shouldn't be necessary with a proper shutdown, but in this
particular case, they are not very expensive. What is expensive is
having a separate sanity check functions for all the various hw blocks
to teardown everything on startup prior to starting it up in case
kexec, etc. left the system in a bad state. It ends up amounting to a
full tear down sequence followed by a full start up sequence every
time you load the driver.
I can't really comment on the first patch, but the rest seem fine.
Alex
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] drm/radeon: Implement radeon_pci_shutdown
2013-09-08 12:10 ` [PATCH 2/3] drm/radeon: Implement radeon_pci_shutdown Markus Trippelsdorf
@ 2013-09-09 13:32 ` Konrad Rzeszutek Wilk
0 siblings, 0 replies; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-09-09 13:32 UTC (permalink / raw)
To: Markus Trippelsdorf; +Cc: kexec, Eric Biederman, dri-devel
On Sun, Sep 08, 2013 at 02:10:58PM +0200, Markus Trippelsdorf wrote:
> Currently radeon devices are not properbly shutdown during kexec. This
properly
> cases a varity of issues, e.g. dpm initialization failures.
> Fix this by implementing a radeon_pci_shutdown function, that unloads
> the driver cleanly.
>
> Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
> ---
> drivers/gpu/drm/radeon/radeon_drv.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
> index cb4445f..d71edee 100644
> --- a/drivers/gpu/drm/radeon/radeon_drv.c
> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
> @@ -380,6 +380,15 @@ static const struct file_operations radeon_driver_kms_fops = {
> #endif
> };
>
> +
> +static void
> +radeon_pci_shutdown(struct pci_dev *pdev)
> +{
> + struct drm_device *dev = pci_get_drvdata(pdev);
> +
> + radeon_driver_unload_kms(dev);
> +}
> +
> static struct drm_driver kms_driver = {
> .driver_features =
> DRIVER_USE_AGP |
> @@ -453,6 +462,7 @@ static struct pci_driver radeon_kms_pci_driver = {
> .remove = radeon_pci_remove,
> .suspend = radeon_pci_suspend,
> .resume = radeon_pci_resume,
> + .shutdown = radeon_pci_shutdown,
> };
>
> static int __init radeon_init(void)
> --
> Markus
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-09 13:04 ` Alex Deucher
@ 2013-09-10 18:27 ` Eric W. Biederman
2013-09-10 20:40 ` Alex Deucher
0 siblings, 1 reply; 19+ messages in thread
From: Eric W. Biederman @ 2013-09-10 18:27 UTC (permalink / raw)
To: Alex Deucher
Cc: Christian König, kexec, Maling list - DRI developers,
Markus Trippelsdorf
Alex Deucher <alexdeucher@gmail.com> writes:
> On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
>
>> IIRC Alex said the sanity checks are expensive and boot-time could be
>> improved by dropping them. Maybe he can chime in?
>
> They shouldn't be necessary with a proper shutdown, but in this
> particular case, they are not very expensive. What is expensive is
> having a separate sanity check functions for all the various hw blocks
> to teardown everything on startup prior to starting it up in case
> kexec, etc. left the system in a bad state. It ends up amounting to a
> full tear down sequence followed by a full start up sequence every
> time you load the driver.
>
> I can't really comment on the first patch, but the rest seem fine.
Let me reask the question just a little bit.
Is it the sanity checks that are expensive? Or is it the
reinitialization that is triggered by the sanity checks that is
expensive?
From what Christian said in the other reply it sounds like this is a
game we will never completely win, but it would be nice to have half a
chance in the kexec on panic case to have video. So I am curious to
know if the checks are expensive when we are coming at hardware in a
clean state.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-10 18:27 ` Eric W. Biederman
@ 2013-09-10 20:40 ` Alex Deucher
2013-09-11 8:53 ` Markus Trippelsdorf
0 siblings, 1 reply; 19+ messages in thread
From: Alex Deucher @ 2013-09-10 20:40 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Christian König, kexec, Maling list - DRI developers,
Markus Trippelsdorf
On Tue, Sep 10, 2013 at 2:27 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Alex Deucher <alexdeucher@gmail.com> writes:
>
>> On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf
>> <markus@trippelsdorf.de> wrote:
>>
>>> IIRC Alex said the sanity checks are expensive and boot-time could be
>>> improved by dropping them. Maybe he can chime in?
>>
>> They shouldn't be necessary with a proper shutdown, but in this
>> particular case, they are not very expensive. What is expensive is
>> having a separate sanity check functions for all the various hw blocks
>> to teardown everything on startup prior to starting it up in case
>> kexec, etc. left the system in a bad state. It ends up amounting to a
>> full tear down sequence followed by a full start up sequence every
>> time you load the driver.
>>
>> I can't really comment on the first patch, but the rest seem fine.
>
> Let me reask the question just a little bit.
>
> Is it the sanity checks that are expensive? Or is it the
> reinitialization that is triggered by the sanity checks that is
> expensive?
>
> From what Christian said in the other reply it sounds like this is a
> game we will never completely win, but it would be nice to have half a
> chance in the kexec on panic case to have video. So I am curious to
> know if the checks are expensive when we are coming at hardware in a
> clean state.
The particular sanity checks from this patch set are not expensive,
but we had previously discussed more extensive sanity checks for other
aspects of the chips in prior conversations. Prior to this patch set,
the hw is not torn down properly (might have been in the middle of DMA
for example) when kexec happens. That's why the sanity checks were
added in the first place. With this patch set, the sanity checks
shouldn't be necessary.
Alex
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-10 20:40 ` Alex Deucher
@ 2013-09-11 8:53 ` Markus Trippelsdorf
2013-09-11 9:21 ` Christian König
2013-09-11 13:40 ` Alex Deucher
0 siblings, 2 replies; 19+ messages in thread
From: Markus Trippelsdorf @ 2013-09-11 8:53 UTC (permalink / raw)
To: Alex Deucher
Cc: Christian König, kexec, Eric W. Biederman,
Maling list - DRI developers
On 2013.09.10 at 16:40 -0400, Alex Deucher wrote:
> On Tue, Sep 10, 2013 at 2:27 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> > Alex Deucher <alexdeucher@gmail.com> writes:
> >
> >> On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf
> >> <markus@trippelsdorf.de> wrote:
> >>
> >>> IIRC Alex said the sanity checks are expensive and boot-time could be
> >>> improved by dropping them. Maybe he can chime in?
> >>
> >> They shouldn't be necessary with a proper shutdown, but in this
> >> particular case, they are not very expensive. What is expensive is
> >> having a separate sanity check functions for all the various hw blocks
> >> to teardown everything on startup prior to starting it up in case
> >> kexec, etc. left the system in a bad state. It ends up amounting to a
> >> full tear down sequence followed by a full start up sequence every
> >> time you load the driver.
> >>
> >> I can't really comment on the first patch, but the rest seem fine.
> >
> > Let me reask the question just a little bit.
> >
> > Is it the sanity checks that are expensive? Or is it the
> > reinitialization that is triggered by the sanity checks that is
> > expensive?
> >
> > From what Christian said in the other reply it sounds like this is a
> > game we will never completely win, but it would be nice to have half a
> > chance in the kexec on panic case to have video. So I am curious to
> > know if the checks are expensive when we are coming at hardware in a
> > clean state.
>
> The particular sanity checks from this patch set are not expensive,
> but we had previously discussed more extensive sanity checks for other
> aspects of the chips in prior conversations. Prior to this patch set,
> the hw is not torn down properly (might have been in the middle of DMA
> for example) when kexec happens. That's why the sanity checks were
> added in the first place. With this patch set, the sanity checks
> shouldn't be necessary.
I think you're talking past each other.
What Eric worries about is the »kexec on panic« case, where the shutdown
method *isn't* called. In this case the sanity checks, that are only
superfluous when the hardware was shutdown normally during kexec (the
default case), may actually help. And because the checks aren't
expensive, it wouldn't hurt to just leave them in place.
--
Markus
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-09 9:38 ` Christian König
@ 2013-09-11 9:01 ` Markus Trippelsdorf
2013-09-11 9:10 ` Christian König
2013-09-11 13:30 ` Alex Deucher
0 siblings, 2 replies; 19+ messages in thread
From: Markus Trippelsdorf @ 2013-09-11 9:01 UTC (permalink / raw)
To: Christian König; +Cc: kexec, Eric W. Biederman, dri-devel
On 2013.09.09 at 11:38 +0200, Christian König wrote:
> Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:
> > On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:
> >> Markus Trippelsdorf <markus@trippelsdorf.de> writes:
> >>
> >>> Here are a couple of patches that get kexec working with radeon devices.
> >>> I've tested this on my RS780.
> >>> Comments or flames are welcome.
> >>> Thanks.
> >> A couple of high level comments.
> >>
> >> This looks promising for the usual case.
> >>
> >> Removing the printk at the end of the kexec path seems a little dubious,
> >> what of other cpus, interrupt handlers, etc. Basically estabilishing a
> >> new rule on when printk is allowed seems a little dubious at this point,
> >> even if it is a useful debugging trick.
> > OK. I will drop this patch. It doesn't seem to be necessary, because I
> > cannot reproduce the printk related hang anymore.
> >
> >> Having a clean shutdown of the radeon definitely seems worth doing,
> >> because the cases where we care abouty video are when a person is in
> >> front of the system.
> > Yes. But please note that even with radeon_pci_shutdown implemented, I
> > still get ring test failures on roughly every eighth kexec boot:
> >
> > [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
> > radeon 0000:01:05.0: disabling GPU acceleration
> >
> > That's definitely better than the current state of affairs, with ring
> > test failures on every second boot. But I haven't figured out the reason
> > for these failures yet. It's curious that once a ring test failure
> > occurs, it will reliably fail after each kexec invocation, no matter how
> > often repeated. Only a reboot brings the machine back to normal.
>
> The main problem here is that the AMD gfx hardware doesn't really
> support being reinitialized once booted (for various reasons). It's a
> (intended) limitation of the hardware design that you can only
> initialize certain blocks once every power cycle, so the whole approach
> actually will never work 100% reliable.
>
> All you can hope for is that stopping the hardware while shutting down
> the old kernel and starting it again results in exactly the same
> hardware parameters (offsets, clock etc...) otherwise starting the
> blocks will just fail and you end up with disabled acceleration like above.
>
> Sorry, but there isn't much we can do about this,
I've tested this further and it turned out that if I revert commit
f5d9b7f0f9 on top of my "drm/radeon: Implement radeon_pci_shutdown"
patch, the initialization failures seem to go away completely.
Any idea what's going on?
Here's the patch:
diff --git a/drivers/gpu/drm/radeon/r600_dpm.c b/drivers/gpu/drm/radeon/r600_dpm.c
index fa0de46..4e8c1988 100644
--- a/drivers/gpu/drm/radeon/r600_dpm.c
+++ b/drivers/gpu/drm/radeon/r600_dpm.c
@@ -296,9 +296,9 @@ bool r600_dynamicpm_enabled(struct radeon_device *rdev)
void r600_enable_sclk_control(struct radeon_device *rdev, bool enable)
{
if (enable)
- WREG32_P(SCLK_PWRMGT_CNTL, 0, ~SCLK_PWRMGT_OFF);
+ WREG32_P(GENERAL_PWRMGT, 0, ~SCLK_PWRMGT_OFF);
else
- WREG32_P(SCLK_PWRMGT_CNTL, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
+ WREG32_P(GENERAL_PWRMGT, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
}
void r600_enable_mclk_control(struct radeon_device *rdev, bool enable)
--
Markus
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-11 9:01 ` Markus Trippelsdorf
@ 2013-09-11 9:10 ` Christian König
2013-09-11 13:30 ` Alex Deucher
1 sibling, 0 replies; 19+ messages in thread
From: Christian König @ 2013-09-11 9:10 UTC (permalink / raw)
To: Markus Trippelsdorf; +Cc: kexec, Eric W. Biederman, dri-devel
Am 11.09.2013 11:01, schrieb Markus Trippelsdorf:
> On 2013.09.09 at 11:38 +0200, Christian König wrote:
>> Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:
>>> On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:
>>>> Markus Trippelsdorf <markus@trippelsdorf.de> writes:
>>>>
>>>>> Here are a couple of patches that get kexec working with radeon devices.
>>>>> I've tested this on my RS780.
>>>>> Comments or flames are welcome.
>>>>> Thanks.
>>>> A couple of high level comments.
>>>>
>>>> This looks promising for the usual case.
>>>>
>>>> Removing the printk at the end of the kexec path seems a little dubious,
>>>> what of other cpus, interrupt handlers, etc. Basically estabilishing a
>>>> new rule on when printk is allowed seems a little dubious at this point,
>>>> even if it is a useful debugging trick.
>>> OK. I will drop this patch. It doesn't seem to be necessary, because I
>>> cannot reproduce the printk related hang anymore.
>>>
>>>> Having a clean shutdown of the radeon definitely seems worth doing,
>>>> because the cases where we care abouty video are when a person is in
>>>> front of the system.
>>> Yes. But please note that even with radeon_pci_shutdown implemented, I
>>> still get ring test failures on roughly every eighth kexec boot:
>>>
>>> [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
>>> radeon 0000:01:05.0: disabling GPU acceleration
>>>
>>> That's definitely better than the current state of affairs, with ring
>>> test failures on every second boot. But I haven't figured out the reason
>>> for these failures yet. It's curious that once a ring test failure
>>> occurs, it will reliably fail after each kexec invocation, no matter how
>>> often repeated. Only a reboot brings the machine back to normal.
>> The main problem here is that the AMD gfx hardware doesn't really
>> support being reinitialized once booted (for various reasons). It's a
>> (intended) limitation of the hardware design that you can only
>> initialize certain blocks once every power cycle, so the whole approach
>> actually will never work 100% reliable.
>>
>> All you can hope for is that stopping the hardware while shutting down
>> the old kernel and starting it again results in exactly the same
>> hardware parameters (offsets, clock etc...) otherwise starting the
>> blocks will just fail and you end up with disabled acceleration like above.
>>
>> Sorry, but there isn't much we can do about this,
> I've tested this further and it turned out that if I revert commit
> f5d9b7f0f9 on top of my "drm/radeon: Implement radeon_pci_shutdown"
> patch, the initialization failures seem to go away completely.
>
> Any idea what's going on?
Well DPM is mostly Alex domain, but if I have to guess I would say that
the SCLK is gated by the hardware when the driver is unloaded and since
DPM initialized only later not ungated when the driver loads again.
> Here's the patch:
>
> diff --git a/drivers/gpu/drm/radeon/r600_dpm.c b/drivers/gpu/drm/radeon/r600_dpm.c
> index fa0de46..4e8c1988 100644
> --- a/drivers/gpu/drm/radeon/r600_dpm.c
> +++ b/drivers/gpu/drm/radeon/r600_dpm.c
> @@ -296,9 +296,9 @@ bool r600_dynamicpm_enabled(struct radeon_device *rdev)
> void r600_enable_sclk_control(struct radeon_device *rdev, bool enable)
> {
> if (enable)
> - WREG32_P(SCLK_PWRMGT_CNTL, 0, ~SCLK_PWRMGT_OFF);
> + WREG32_P(GENERAL_PWRMGT, 0, ~SCLK_PWRMGT_OFF);
> else
> - WREG32_P(SCLK_PWRMGT_CNTL, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
> + WREG32_P(GENERAL_PWRMGT, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
> }
>
> void r600_enable_mclk_control(struct radeon_device *rdev, bool enable)
The patch just breaks SCLK gating on R6xx again, so no gain here.
Christian.
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-11 8:53 ` Markus Trippelsdorf
@ 2013-09-11 9:21 ` Christian König
2013-09-11 13:40 ` Alex Deucher
1 sibling, 0 replies; 19+ messages in thread
From: Christian König @ 2013-09-11 9:21 UTC (permalink / raw)
To: Markus Trippelsdorf
Cc: Alex Deucher, kexec, Eric W. Biederman,
Maling list - DRI developers
Am 11.09.2013 10:53, schrieb Markus Trippelsdorf:
> On 2013.09.10 at 16:40 -0400, Alex Deucher wrote:
>> On Tue, Sep 10, 2013 at 2:27 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Alex Deucher <alexdeucher@gmail.com> writes:
>>>
>>>> On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf
>>>> <markus@trippelsdorf.de> wrote:
>>>>
>>>>> IIRC Alex said the sanity checks are expensive and boot-time could be
>>>>> improved by dropping them. Maybe he can chime in?
>>>> They shouldn't be necessary with a proper shutdown, but in this
>>>> particular case, they are not very expensive. What is expensive is
>>>> having a separate sanity check functions for all the various hw blocks
>>>> to teardown everything on startup prior to starting it up in case
>>>> kexec, etc. left the system in a bad state. It ends up amounting to a
>>>> full tear down sequence followed by a full start up sequence every
>>>> time you load the driver.
>>>>
>>>> I can't really comment on the first patch, but the rest seem fine.
>>> Let me reask the question just a little bit.
>>>
>>> Is it the sanity checks that are expensive? Or is it the
>>> reinitialization that is triggered by the sanity checks that is
>>> expensive?
>>>
>>> From what Christian said in the other reply it sounds like this is a
>>> game we will never completely win, but it would be nice to have half a
>>> chance in the kexec on panic case to have video. So I am curious to
>>> know if the checks are expensive when we are coming at hardware in a
>>> clean state.
>> The particular sanity checks from this patch set are not expensive,
>> but we had previously discussed more extensive sanity checks for other
>> aspects of the chips in prior conversations. Prior to this patch set,
>> the hw is not torn down properly (might have been in the middle of DMA
>> for example) when kexec happens. That's why the sanity checks were
>> added in the first place. With this patch set, the sanity checks
>> shouldn't be necessary.
> I think you're talking past each other.
> What Eric worries about is the »kexec on panic« case, where the shutdown
> method *isn't* called. In this case the sanity checks, that are only
> superfluous when the hardware was shutdown normally during kexec (the
> default case), may actually help. And because the checks aren't
> expensive, it wouldn't hurt to just leave them in place.
When we don't know the state the hardware is in it is really hard to get
into a working configuration again, even with the sanity checks in place.
For example you can't just cut power or reset the UVD block without
knowing it's state, cause then you have a good chance that you hit into
the middle of a register or memory access of the VCPU. This usually
results in the next few registers accesses only return either 0x0 or
0xffffffff and after that the system completely locks up.
Isn't it possible to know that we are in a "kexec after panic" case and
only try to initialize the display hardware and not all the other
blocks? That would at least allow us to get an error message of what
happened on the screen and otherwise advise the user to do a clean reset.
Christian.
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-11 9:01 ` Markus Trippelsdorf
2013-09-11 9:10 ` Christian König
@ 2013-09-11 13:30 ` Alex Deucher
1 sibling, 0 replies; 19+ messages in thread
From: Alex Deucher @ 2013-09-11 13:30 UTC (permalink / raw)
To: Markus Trippelsdorf
Cc: Christian König, kexec, Eric W. Biederman,
Maling list - DRI developers
On Wed, Sep 11, 2013 at 5:01 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2013.09.09 at 11:38 +0200, Christian König wrote:
>> Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:
>> > On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:
>> >> Markus Trippelsdorf <markus@trippelsdorf.de> writes:
>> >>
>> >>> Here are a couple of patches that get kexec working with radeon devices.
>> >>> I've tested this on my RS780.
>> >>> Comments or flames are welcome.
>> >>> Thanks.
>> >> A couple of high level comments.
>> >>
>> >> This looks promising for the usual case.
>> >>
>> >> Removing the printk at the end of the kexec path seems a little dubious,
>> >> what of other cpus, interrupt handlers, etc. Basically estabilishing a
>> >> new rule on when printk is allowed seems a little dubious at this point,
>> >> even if it is a useful debugging trick.
>> > OK. I will drop this patch. It doesn't seem to be necessary, because I
>> > cannot reproduce the printk related hang anymore.
>> >
>> >> Having a clean shutdown of the radeon definitely seems worth doing,
>> >> because the cases where we care abouty video are when a person is in
>> >> front of the system.
>> > Yes. But please note that even with radeon_pci_shutdown implemented, I
>> > still get ring test failures on roughly every eighth kexec boot:
>> >
>> > [drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
>> > radeon 0000:01:05.0: disabling GPU acceleration
>> >
>> > That's definitely better than the current state of affairs, with ring
>> > test failures on every second boot. But I haven't figured out the reason
>> > for these failures yet. It's curious that once a ring test failure
>> > occurs, it will reliably fail after each kexec invocation, no matter how
>> > often repeated. Only a reboot brings the machine back to normal.
>>
>> The main problem here is that the AMD gfx hardware doesn't really
>> support being reinitialized once booted (for various reasons). It's a
>> (intended) limitation of the hardware design that you can only
>> initialize certain blocks once every power cycle, so the whole approach
>> actually will never work 100% reliable.
>>
>> All you can hope for is that stopping the hardware while shutting down
>> the old kernel and starting it again results in exactly the same
>> hardware parameters (offsets, clock etc...) otherwise starting the
>> blocks will just fail and you end up with disabled acceleration like above.
>>
>> Sorry, but there isn't much we can do about this,
>
> I've tested this further and it turned out that if I revert commit
> f5d9b7f0f9 on top of my "drm/radeon: Implement radeon_pci_shutdown"
> patch, the initialization failures seem to go away completely.
>
> Any idea what's going on?
>
You are disabling dynamic power management with that patch reverted.
The patch fixed a copy paste typo in the register. Bit 0
(SCLK_PWRMGT_OFF) of register SCLK_PWRMGT_CNTL controls whether
dynamic engine clock control is enabled. Bit 0 (GLOBAL_PWRMGT_EN) of
register GENERAL_PWRMGT controls whether dynamic power management
(dynamic engine/memory/voltage, controls etc.) is enabled at all.
Alex
> Here's the patch:
>
> diff --git a/drivers/gpu/drm/radeon/r600_dpm.c b/drivers/gpu/drm/radeon/r600_dpm.c
> index fa0de46..4e8c1988 100644
> --- a/drivers/gpu/drm/radeon/r600_dpm.c
> +++ b/drivers/gpu/drm/radeon/r600_dpm.c
> @@ -296,9 +296,9 @@ bool r600_dynamicpm_enabled(struct radeon_device *rdev)
> void r600_enable_sclk_control(struct radeon_device *rdev, bool enable)
> {
> if (enable)
> - WREG32_P(SCLK_PWRMGT_CNTL, 0, ~SCLK_PWRMGT_OFF);
> + WREG32_P(GENERAL_PWRMGT, 0, ~SCLK_PWRMGT_OFF);
> else
> - WREG32_P(SCLK_PWRMGT_CNTL, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
> + WREG32_P(GENERAL_PWRMGT, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
> }
>
> void r600_enable_mclk_control(struct radeon_device *rdev, bool enable)
>
>
> --
> Markus
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/3] drm/radeon kexec fixes
2013-09-11 8:53 ` Markus Trippelsdorf
2013-09-11 9:21 ` Christian König
@ 2013-09-11 13:40 ` Alex Deucher
1 sibling, 0 replies; 19+ messages in thread
From: Alex Deucher @ 2013-09-11 13:40 UTC (permalink / raw)
To: Markus Trippelsdorf
Cc: Christian König, kexec, Eric W. Biederman,
Maling list - DRI developers
On Wed, Sep 11, 2013 at 4:53 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2013.09.10 at 16:40 -0400, Alex Deucher wrote:
>> On Tue, Sep 10, 2013 at 2:27 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>> > Alex Deucher <alexdeucher@gmail.com> writes:
>> >
>> >> On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf
>> >> <markus@trippelsdorf.de> wrote:
>> >>
>> >>> IIRC Alex said the sanity checks are expensive and boot-time could be
>> >>> improved by dropping them. Maybe he can chime in?
>> >>
>> >> They shouldn't be necessary with a proper shutdown, but in this
>> >> particular case, they are not very expensive. What is expensive is
>> >> having a separate sanity check functions for all the various hw blocks
>> >> to teardown everything on startup prior to starting it up in case
>> >> kexec, etc. left the system in a bad state. It ends up amounting to a
>> >> full tear down sequence followed by a full start up sequence every
>> >> time you load the driver.
>> >>
>> >> I can't really comment on the first patch, but the rest seem fine.
>> >
>> > Let me reask the question just a little bit.
>> >
>> > Is it the sanity checks that are expensive? Or is it the
>> > reinitialization that is triggered by the sanity checks that is
>> > expensive?
>> >
>> > From what Christian said in the other reply it sounds like this is a
>> > game we will never completely win, but it would be nice to have half a
>> > chance in the kexec on panic case to have video. So I am curious to
>> > know if the checks are expensive when we are coming at hardware in a
>> > clean state.
>>
>> The particular sanity checks from this patch set are not expensive,
>> but we had previously discussed more extensive sanity checks for other
>> aspects of the chips in prior conversations. Prior to this patch set,
>> the hw is not torn down properly (might have been in the middle of DMA
>> for example) when kexec happens. That's why the sanity checks were
>> added in the first place. With this patch set, the sanity checks
>> shouldn't be necessary.
>
> I think you're talking past each other.
> What Eric worries about is the »kexec on panic« case, where the shutdown
> method *isn't* called. In this case the sanity checks, that are only
> superfluous when the hardware was shutdown normally during kexec (the
> default case), may actually help. And because the checks aren't
> expensive, it wouldn't hurt to just leave them in place.
For the particular sanity check in this patch set, that's fine. Even
with that in place, it only covers a small subset of GPUs and it
doesn't cover all possible problematic areas. In general kexec on
panic seems like a recipe for trouble. You may end up in the middle
of several dma transactions across multiple engines on the GPU. We
could add code to tear down every block on the chip on start up in
case such an event happened, but like Christian said, some blocks
can't really be recovered depending on how things were ended, and then
we've doubled the start up time for the driver. Additionally, most
of the tear down is dependent on various driver state which we
wouldn't have at start up, so most of code would need to
re-architected so that it could be called independently of the
relevant driver state.
Alex
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2013-09-11 13:41 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-08 12:09 [PATCH 0/3] drm/radeon kexec fixes Markus Trippelsdorf
2013-09-08 12:10 ` [PATCH 1/3] kexec: get rid of late printk Markus Trippelsdorf
2013-09-08 20:11 ` Daniel Vetter
2013-09-08 20:42 ` Bruno Prémont
2013-09-08 12:10 ` [PATCH 2/3] drm/radeon: Implement radeon_pci_shutdown Markus Trippelsdorf
2013-09-09 13:32 ` Konrad Rzeszutek Wilk
2013-09-08 12:11 ` [PATCH 3/3] drm/radeon: get rid of r100_restore_sanity hack Markus Trippelsdorf
2013-09-09 0:32 ` [PATCH 0/3] drm/radeon kexec fixes Eric W. Biederman
2013-09-09 9:21 ` Markus Trippelsdorf
2013-09-09 9:38 ` Christian König
2013-09-11 9:01 ` Markus Trippelsdorf
2013-09-11 9:10 ` Christian König
2013-09-11 13:30 ` Alex Deucher
2013-09-09 13:04 ` Alex Deucher
2013-09-10 18:27 ` Eric W. Biederman
2013-09-10 20:40 ` Alex Deucher
2013-09-11 8:53 ` Markus Trippelsdorf
2013-09-11 9:21 ` Christian König
2013-09-11 13:40 ` Alex Deucher
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox