public inbox for igt-dev@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Antonio Argenziano <antonio.argenziano@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	Carlos Santa <carlos.santa@intel.com>,
	igt-dev@lists.freedesktop.org
Cc: Michel Thierry <michel.thierry@intel.com>
Subject: Re: [igt-dev] [PATCH i-g-t v2] tests/gem_watchdog: Initial set of tests for GPU watchdog
Date: Tue, 9 Oct 2018 09:12:46 -0700	[thread overview]
Message-ID: <13b95cc9-f760-d8f9-36c9-d98161366cbf@intel.com> (raw)
In-Reply-To: <153907367250.16699.2903901256164230675@skylake-alporthouse-com>



On 09/10/18 01:27, Chris Wilson wrote:
> Quoting Antonio Argenziano (2018-10-08 22:43:39)
>>
>>
>> On 05/10/18 18:05, Carlos Santa wrote:
>>> This test adds basic set of tests to reset the different
>>> GPU engines through the watchdog timer.
>>>
>>> Credits to Antonio for the original codebase this is based on.
>>>
>>> This was verified on SKL/ULT GT3:
>>>
>>> $./gem_watchdog --run-subtest basic-vecs0
>>> IGT-Version: 1.23-gaaeb2007206d (x86_64) (Linux: 4.18.0-rc7+ x86_64)
>>> Starting subtest: basic-vecs0
>>> Subtest basic-vecs0: SUCCESS (2.402s)
>>> $ sudo cat /sys/kernel/debug/dri/0/i915_reset_info
>>> full gpu reset = 0
>>> GuC watchdog/media reset = 0
>>> rcs0 = 0
>>> bcs0 = 0
>>> vcs0 = 0
>>> vcs1 = 0
>>> vecs0 = 1
>>>
>>> v2: (Review comments from Chris Wilson)
>>>    * Replace send_canary() by timestamps before/after the hang
>>>      and measure dt. Use dt < 2*threshold + reset + submission
>>>      to check watchdog vs hangcheck
>>>    * Initialize drm_i915_gem_context_param args only once at
>>>      the struct declaration
>>>    * Avoid using MAX_ENGINES implicitly to declare engines_thresholds
>>>      array
>>>    * Remove unnecessary igt_assert(!check_error_state(fd))
>>>    * Use the class:instance interface when looping through the engines
>>>
>>>     (Review by Petri Latvala)
>>>    * Update the correct patch's year timestamp
>>>    * Include IGT_DESCRIPTION() label
>>>
>>> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
>>> Cc: Michel Thierry <michel.thierry@intel.com>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
>>> ---
>>>    tests/Makefile.sources |   1 +
>>>    tests/gem_watchdog.c   | 186 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>    tests/meson.build      |   1 +
>>>    3 files changed, 188 insertions(+)
>>>    create mode 100644 tests/gem_watchdog.c
>>>
>>
>>
>>> +
>>> +static void media_hang_simple(int fd, const struct intel_execution_engine2 *e)
>>> +{
>>> +     uint32_t ctx;
>>> +     unsigned flags = HANG_ALLOW_CAPTURE;
>>> +     struct timeval start, end;
>>> +     double dt_msecs;
>>> +
>>> +     /* Submit on default context */
>>> +     ctx = 0;
>>> +     context_set_watchdog(fd, e, ctx, WATCHDOG_THRESHOLD);
>>> +
>>> +     clear_error_state(fd);
>>> +
>>> +     gettimeofday(&start, NULL);
>>> +     inject_hang(fd, ctx, e, flags);
>>> +     gettimeofday(&end, NULL);
>>> +     dt_msecs = elapsed(&start, &end)/1000;
>>> +
>>> +     /* reset time for watchdog should be less than 2*threshold + engine reset time + submission */
>>> +     igt_assert(dt_msecs < 2*WATCHDOG_THRESHOLD + 15);
>>> +
>>> +     /* Assert if error state is not clean */
>>> +     igt_assert(!check_error_state(fd));
>>
>> If you have a small enough threshold you don't need the dmesg check for
>> resets. IMO, ideally there would be a way to have the hang_detector
>> running but that doesn't work because it would disable watchdog as well.
> 
> There will be no be capture. The stop_machine() employed there is the
> antithesis of light and fast reset. There will be no uevent since that
> is the global reset, so hang detector will not work. The way to detect
> the failure is to check the fence status which will report the reset of
> that particular request. inject_hang() needs to be reworked and
> gettimeofday() can be replaced by any of the igt timers.

What I was trying to say is that if we had the interface to say have 
watchdog and global reset we could use the hang detector to fail the 
test but, I guess the fence would make things simpler.

> context_set_watchdog uapi is very wrong. The assert that dt is less than
> 2*threshold is intriguing since the kernel patches should require 2
> evaluation intervals to detect a stuck seqno, and the arbitrary epsilon
> will be upset by the slightest sneeze, let alone scheduler misbehaviour
> and kasan.

Would making the process realtime solve the problem? I think you 
originally were suggesting to use 3 batches with the hang sandwiched 
between timestamps. I wonder if we could do it with only one batch like:

TIMESTAMP
TIMESTAMP <--+
LOOP --------+

But I am not sure if the second timestamp might get corrupted in the reset.

> 
>>> +}
>>> +
>>> +igt_main
>>> +{
>>> +     int fd = -1;
>>> +
>>> +     igt_skip_on_simulation();
> 
> This is definitely suitable for sim as we are testing control of hw
> features (which should help underline how far off dt asserts will be).

I think an allow_hang() and a feature based igt_require() in the fixture 
below are needed.

> 
>>> +     igt_fixture {
>>> +             fd = drm_open_driver(DRIVER_INTEL);
>>> +             igt_require_gem(fd);
>>> +     }
>>> +
>>> +     for (const struct intel_execution_engine2 *e = intel_execution_engines2; e->name; e++) {
>>> +             igt_subtest_group {
>>> +                     igt_fixture {
>>> +                             gem_require_engine(fd, e->class, e->instance);
>>
>> Move the require inside the subtest to maintain a consistent test list
>> across platforms.
>>
>> Thanks,
>> Antonio
>>
>>> +                     }
>>> +
>>> +                     /* default exec-id is purely symbolic */
> 
> And this is purely garbage.

I think that was me, a long time ago...

Antonio

> -Chris
> 
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

  reply	other threads:[~2018-10-09 16:15 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-06  1:05 [igt-dev] [PATCH i-g-t v2] tests/gem_watchdog: GPU engine reset Carlos Santa
2018-10-06  1:05 ` [igt-dev] [PATCH i-g-t v2] tests/gem_watchdog: Initial set of tests for GPU watchdog Carlos Santa
2018-10-08 21:43   ` Antonio Argenziano
2018-10-09  8:27     ` Chris Wilson
2018-10-09 16:12       ` Antonio Argenziano [this message]
2018-10-06  2:39 ` [igt-dev] ✓ Fi.CI.BAT: success for tests/gem_watchdog: Initial set of tests for GPU watchdog (rev2) Patchwork
2018-10-06 11:28 ` [igt-dev] ✓ Fi.CI.IGT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=13b95cc9-f760-d8f9-36c9-d98161366cbf@intel.com \
    --to=antonio.argenziano@intel.com \
    --cc=carlos.santa@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=igt-dev@lists.freedesktop.org \
    --cc=michel.thierry@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox