From: Peter Senna Tschudin <peter.senna@linux.intel.com>
To: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>,
"igt-dev@lists.freedesktop.org" <igt-dev@lists.freedesktop.org>
Cc: "Ryszard Knop" <ryszard.knop@intel.com>,
"Zbigniew Kempczyński" <zbigniew.kempczynski@intel.com>,
"Lucas De Marchi" <lucas.demarchi@intel.com>,
luciano.coelho@intel.com, nirmoy.das@intel.com,
stuart.summers@intel.com, himal.prasad.ghimiray@intel.com,
dominik.karol.piatkowski@intel.com,
katarzyna.piecielska@intel.com
Subject: Re: [PATCH i-g-t v10] igt-runner fact checking
Date: Mon, 9 Dec 2024 12:06:33 +0100 [thread overview]
Message-ID: <c3b14136-db8a-4d3d-82cb-038c2241fe76@linux.intel.com> (raw)
In-Reply-To: <15344780.JCcGWNJJiE@jkrzyszt-mobl2.ger.corp.intel.com>
Hi Janusz,
On 09.12.2024 10:17, Janusz Krzysztofik wrote:
> On Friday, 6 December 2024 06:45:31 CET Peter Senna Tschudin wrote:
>> Hi Janusz,
>>
>> Thank you for your detailed comments. I appreciate the opportunity
>> to clarify and address your concerns.
>>
>> On 05.12.2024 15:05, Janusz Krzysztofik wrote:
>>> Hi Peter,
>>>
>>> On Wednesday, 4 December 2024 19:44:53 CET Peter Senna Tschudin wrote:
>>>> When using igt-runner, collect facts before each test and after the
>>>> last test, and report when facts change. The facts are:
>>>> - GPUs on PCI bus: hardware.pci.gpu_at_addr.0000:03:00.0: 8086:e20b Intel Battlemage (Gen20)
>>>> - Associations between PCI GPU and DRM card: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
>>>> - Kernel taints: kernel.is_tainted.taint_warn: true
>>>> - GPU kernel modules loaded: kernel.kmod_is_loaded.i915: true
>>>>
>>>> This change imposes little execution overhead and adds just a few
>>>> lines of logging. The facts will be printed on normal igt-runner
>>>> output. Here is a real example from our CI shwoing
>>>> hotreplug-lateclose changing the DRM card number
>>>
>>> Since you give that as an example of how helpful your facts can be, and follow
>>> that with a kernel taint example, that may indicate you think, and users of
>>> your facts may then be mislead having that read, that the taint was related to
>>> the change of card number, while both had nothing to do with each other.
>>
>> Let’s take a step back to define the purpose and scope of igt-facts:
>> - Definition of a fact from the dictionary: A fact is an objectively verifiable
>> piece of information.
>> - Purpose of igt-facts: Track which tests cause changes to the facts.
>>
>> The operation is straightforward: facts are collected before and after each test,
>> and any differences are logged. Here’s an example showing a fact change and a new
>> fact after running hotreplug-lateclose:
>>
>> [249.858249] [FACT core_hotunplug (hotreplug-lateclose)] changed: hardware.pci.drm_card_at_addr.0000:00:02.0: card0 -> card1
>> [249.858392] [FACT core_hotunplug (hotreplug-lateclose)] new: kernel.is_tainted.taint_die: true
>>
>> This output highlights the facts without implying causation between them. The
>> tool(and my commit message) neither explains relationships between facts nor
>> misleads users into assuming causation.
>
> For me your commit message does.
Can you please provide some external support for your claims?
I am very careful while writing, and I am aware of my own limitations.
After reading what chatGPT wrote about my text, I felt comfortable
about not misleading anyone. Please mind the wording of my
question: I did not hint a desired outcome for chatGPT.
My Question:
-- // --
Can you please indicate any potential causality relationship
that the following text may offer?
Here is a real example from our CI shwoing hotreplug-lateclose
changing the DRM card number and tainting the kernel on the
abort path
-- // --
chatGPT answer:
--//--
The text suggests a potential causality relationship between the
"hotreplug-lateclose" operation and its effects on the system:
1. Causal Event: The "hotreplug-lateclose" operation.
This likely refers to a process in the CI(Continuous Integration)
environment related to hotplugging devices and managing late-close
scenarios. Hotplugging involves dynamically adding or removing
hardware components without shutting down the system.
2. Effects:
* Changing the DRM card number: This implies that the operation
affects the enumeration or reordering of DRM (Direct Rendering
Manager) devices, causing the card number to be reassigned.
* Kernel Tainting on the abort path: Kernel tainting typically
indicates that the kernel has entered a state that is not
guaranteed to be stable or reliable. This could occur due
to improper handling of errors or unexpected conditions
during the "abort path" (likely a failure or cleanup process
in the operation).
Suggested Causal Chain:
1. The "hotreplug-lateclose" operation executes in the CI.
2. This causes a change in DRM card enumeration (potentially due
to how devices are initialized or re-initialized).
3. On the abort path of this operation, a condition arises that
taints the kernel (possibly due to resource cleanup issues,
unhandled errors, or driver misbehavior).
This text highlights a possible link between hotreplug-lateclose
handling and instabilities in DRM device management as well as
kernel state integrity.
--//--
>
> Can you please provide a full list of "facts" your code is supposed to handle?
This is in the commit message already, at the very begining.
> Can you please explain why you selected just those "facts", not others?
It was either what was missing, such as a convenient way to learn when
something strange happend as a gpu disappearing from the PCI bus, or
something that we believe may cause errors downstream such as a taint,
and the list of loaded modules.
For the drm card number association, we belive that there may be a caching
issue: we are trying to figure it out if the drm-reopen cache handles the
change of drm number association well. We have weak information pointing
to a probable problem akin to missing cache invalidation.
Thanks!
[...]
next prev parent reply other threads:[~2024-12-09 11:06 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <5a111c35-7245-4ada-a2a0-3fd0fd5bbeab@linux.intel.com>
2024-12-05 14:05 ` [PATCH i-g-t v10] igt-runner fact checking Janusz Krzysztofik
2024-12-06 5:45 ` Peter Senna Tschudin
2024-12-09 9:17 ` Janusz Krzysztofik
2024-12-09 11:06 ` Peter Senna Tschudin [this message]
2024-12-09 13:50 ` Janusz Krzysztofik
2024-12-10 8:38 ` Peter Senna Tschudin
2024-12-10 9:50 ` Janusz Krzysztofik
2024-12-10 13:41 ` Knop, Ryszard
2024-11-02 11:37 [PATCH i-g-t] " Peter Senna Tschudin
2024-12-05 4:54 ` [PATCH i-g-t v10] " Peter Senna Tschudin
2024-12-05 9:08 ` Piatkowski, Dominik Karol
2024-12-06 11:42 ` Kamil Konieczny
2024-12-06 13:16 ` Peter Senna Tschudin
2024-12-06 16:46 ` Kamil Konieczny
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c3b14136-db8a-4d3d-82cb-038c2241fe76@linux.intel.com \
--to=peter.senna@linux.intel.com \
--cc=dominik.karol.piatkowski@intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=igt-dev@lists.freedesktop.org \
--cc=janusz.krzysztofik@linux.intel.com \
--cc=katarzyna.piecielska@intel.com \
--cc=lucas.demarchi@intel.com \
--cc=luciano.coelho@intel.com \
--cc=nirmoy.das@intel.com \
--cc=ryszard.knop@intel.com \
--cc=stuart.summers@intel.com \
--cc=zbigniew.kempczynski@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox