From: Antonio Argenziano <antonio.argenziano@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
Carlos Santa <carlos.santa@intel.com>,
igt-dev@lists.freedesktop.org
Cc: Michel Thierry <michel.thierry@intel.com>
Subject: Re: [igt-dev] [PATCH i-g-t v2] tests/gem_watchdog: Initial set of tests for GPU watchdog
Date: Tue, 9 Oct 2018 09:12:46 -0700 [thread overview]
Message-ID: <13b95cc9-f760-d8f9-36c9-d98161366cbf@intel.com> (raw)
In-Reply-To: <153907367250.16699.2903901256164230675@skylake-alporthouse-com>
On 09/10/18 01:27, Chris Wilson wrote:
> Quoting Antonio Argenziano (2018-10-08 22:43:39)
>>
>>
>> On 05/10/18 18:05, Carlos Santa wrote:
>>> This test adds basic set of tests to reset the different
>>> GPU engines through the watchdog timer.
>>>
>>> Credits to Antonio for the original codebase this is based on.
>>>
>>> This was verified on SKL/ULT GT3:
>>>
>>> $./gem_watchdog --run-subtest basic-vecs0
>>> IGT-Version: 1.23-gaaeb2007206d (x86_64) (Linux: 4.18.0-rc7+ x86_64)
>>> Starting subtest: basic-vecs0
>>> Subtest basic-vecs0: SUCCESS (2.402s)
>>> $ sudo cat /sys/kernel/debug/dri/0/i915_reset_info
>>> full gpu reset = 0
>>> GuC watchdog/media reset = 0
>>> rcs0 = 0
>>> bcs0 = 0
>>> vcs0 = 0
>>> vcs1 = 0
>>> vecs0 = 1
>>>
>>> v2: (Review comments from Chris Wilson)
>>> * Replace send_canary() by timestamps before/after the hang
>>> and measure dt. Use dt < 2*threshold + reset + submission
>>> to check watchdog vs hangcheck
>>> * Initialize drm_i915_gem_context_param args only once at
>>> the struct declaration
>>> * Avoid using MAX_ENGINES implicitly to declare engines_thresholds
>>> array
>>> * Remove unnecessary igt_assert(!check_error_state(fd))
>>> * Use the class:instance interface when looping through the engines
>>>
>>> (Review by Petri Latvala)
>>> * Update the correct patch's year timestamp
>>> * Include IGT_DESCRIPTION() label
>>>
>>> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
>>> Cc: Michel Thierry <michel.thierry@intel.com>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
>>> ---
>>> tests/Makefile.sources | 1 +
>>> tests/gem_watchdog.c | 186 +++++++++++++++++++++++++++++++++++++++++++++++++
>>> tests/meson.build | 1 +
>>> 3 files changed, 188 insertions(+)
>>> create mode 100644 tests/gem_watchdog.c
>>>
>>
>>
>>> +
>>> +static void media_hang_simple(int fd, const struct intel_execution_engine2 *e)
>>> +{
>>> + uint32_t ctx;
>>> + unsigned flags = HANG_ALLOW_CAPTURE;
>>> + struct timeval start, end;
>>> + double dt_msecs;
>>> +
>>> + /* Submit on default context */
>>> + ctx = 0;
>>> + context_set_watchdog(fd, e, ctx, WATCHDOG_THRESHOLD);
>>> +
>>> + clear_error_state(fd);
>>> +
>>> + gettimeofday(&start, NULL);
>>> + inject_hang(fd, ctx, e, flags);
>>> + gettimeofday(&end, NULL);
>>> + dt_msecs = elapsed(&start, &end)/1000;
>>> +
>>> + /* reset time for watchdog should be less than 2*threshold + engine reset time + submission */
>>> + igt_assert(dt_msecs < 2*WATCHDOG_THRESHOLD + 15);
>>> +
>>> + /* Assert if error state is not clean */
>>> + igt_assert(!check_error_state(fd));
>>
>> If you have a small enough threshold you don't need the dmesg check for
>> resets. IMO, ideally there would be a way to have the hang_detector
>> running but that doesn't work because it would disable watchdog as well.
>
> There will be no be capture. The stop_machine() employed there is the
> antithesis of light and fast reset. There will be no uevent since that
> is the global reset, so hang detector will not work. The way to detect
> the failure is to check the fence status which will report the reset of
> that particular request. inject_hang() needs to be reworked and
> gettimeofday() can be replaced by any of the igt timers.
What I was trying to say is that if we had the interface to say have
watchdog and global reset we could use the hang detector to fail the
test but, I guess the fence would make things simpler.
> context_set_watchdog uapi is very wrong. The assert that dt is less than
> 2*threshold is intriguing since the kernel patches should require 2
> evaluation intervals to detect a stuck seqno, and the arbitrary epsilon
> will be upset by the slightest sneeze, let alone scheduler misbehaviour
> and kasan.
Would making the process realtime solve the problem? I think you
originally were suggesting to use 3 batches with the hang sandwiched
between timestamps. I wonder if we could do it with only one batch like:
TIMESTAMP
TIMESTAMP <--+
LOOP --------+
But I am not sure if the second timestamp might get corrupted in the reset.
>
>>> +}
>>> +
>>> +igt_main
>>> +{
>>> + int fd = -1;
>>> +
>>> + igt_skip_on_simulation();
>
> This is definitely suitable for sim as we are testing control of hw
> features (which should help underline how far off dt asserts will be).
I think an allow_hang() and a feature based igt_require() in the fixture
below are needed.
>
>>> + igt_fixture {
>>> + fd = drm_open_driver(DRIVER_INTEL);
>>> + igt_require_gem(fd);
>>> + }
>>> +
>>> + for (const struct intel_execution_engine2 *e = intel_execution_engines2; e->name; e++) {
>>> + igt_subtest_group {
>>> + igt_fixture {
>>> + gem_require_engine(fd, e->class, e->instance);
>>
>> Move the require inside the subtest to maintain a consistent test list
>> across platforms.
>>
>> Thanks,
>> Antonio
>>
>>> + }
>>> +
>>> + /* default exec-id is purely symbolic */
>
> And this is purely garbage.
I think that was me, a long time ago...
Antonio
> -Chris
>
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev
next prev parent reply other threads:[~2018-10-09 16:15 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-06 1:05 [igt-dev] [PATCH i-g-t v2] tests/gem_watchdog: GPU engine reset Carlos Santa
2018-10-06 1:05 ` [igt-dev] [PATCH i-g-t v2] tests/gem_watchdog: Initial set of tests for GPU watchdog Carlos Santa
2018-10-08 21:43 ` Antonio Argenziano
2018-10-09 8:27 ` Chris Wilson
2018-10-09 16:12 ` Antonio Argenziano [this message]
2018-10-06 2:39 ` [igt-dev] ✓ Fi.CI.BAT: success for tests/gem_watchdog: Initial set of tests for GPU watchdog (rev2) Patchwork
2018-10-06 11:28 ` [igt-dev] ✓ Fi.CI.IGT: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=13b95cc9-f760-d8f9-36c9-d98161366cbf@intel.com \
--to=antonio.argenziano@intel.com \
--cc=carlos.santa@intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=igt-dev@lists.freedesktop.org \
--cc=michel.thierry@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox