* Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload)
@ 2024-12-19 17:24 Lucas De Marchi
2024-12-19 18:38 ` Gustavo Sousa
2024-12-20 17:05 ` Kamil Konieczny
0 siblings, 2 replies; 6+ messages in thread
From: Lucas De Marchi @ 2024-12-19 17:24 UTC (permalink / raw)
To: Zbigniew Kempczyński
Cc: igt-dev, Francois Dugast, Peter Senna Tschudin, Gustavo Sousa,
Kamil Konieczny, Ashutosh Dixit, ryszard.knop
hijacking the thread and adding some people to Cc for the igt_runner question.
Previously In-Reply-To: <rnw3q6mhthnwyvowvszr2gllyjtbb2mozk4em272xlmkvm7pyl@szbhtg3sd7d7>
On Thu, Dec 19, 2024 at 10:35:00AM -0600, Lucas De Marchi wrote:
>On Wed, Dec 18, 2024 at 07:34:19AM +0100, Zbigniew Kempczyński wrote:
>>On Tue, Dec 17, 2024 at 09:13:24PM -0800, Lucas De Marchi wrote:
>>>There's no guarantee a card will end up with the same device node when
>>>modules are loaded/unloaded and drivers bound/unbound. There's some
>>>fundamental issue with the igt's the way it is and it's also puzzling
>>>from the logs it looks like the device vanished from the bus, when in
>>>reality is just the SW state out of sync with what the kernel is
>>>exporting.
>>>
>>>Re-scanning when trying to match a device is not expensive compared to
>>>what most tests are doing, so simply force it to occur whenever trying
>>>to match a card.
>>
>>I also should comment the above. It is generally true, but I've noticed
>>getting attributes might be expensive. Even it may take up to few
>>seconds, that's why I've added some attributes we don't fetch from udev
>>(see is_on_blacklist()). If I'm not wrong getting 'config' was a cause
>>to limit attributes we fetch.
>
>why would we get all attributes and exclude some? Shouldn't we get only
>the attributes we actually use? AFAIK this logic is basically used by
>--device/IGT_DEVICE, right? What filters we normally use?
>
>I usually pass the pci slot (because I know that won't change
>dynamically and cause surprises). Apparently CI passes vendor/devid:
>
> export IGT_DEVICE=pci:vendor=$1,device=$2
>
>(but it seems to vary depending on pipeline)
>
>Some devs pass the device node directly too as in a lot of places
>there's only ever card0 possible.
Could we dump the env and args somewhere so we know how igt_runner or
individual tests are being called without looking at the CI piepeline
sources? I was thinking about either having that info in the stdout
output of igt_runner or in the json. Another possibility would be in
dmesg, but I'm not sure it's a good option. Thoughts?
My preferred option would be to have e.g.:
{
"__type__": "TestrunResult",
"results_version": 10,
"name": "xe-2403-995cd30a4e222b6a7b4b40c36219e4937fd7109e\/bat-bmg-1\/0",
"uname": "Linux bat-bmg-1 6.13.0-rc3-xe+ #1 SMP PREEMPT_DYNAMIC Thu Dec 19 14:40:51 UTC 2024 x86_64",
"time_elapsed": {
"__type__": "TimeAttribute",
"start": 1734621126.8734231,
"end": 1734621288.5994539
},
"environment": {
"IGT_DEVICE": ...
<any IGT_* env var>
},
"argv": [ ... ]
Lucas De Marchi
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload) 2024-12-19 17:24 Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload) Lucas De Marchi @ 2024-12-19 18:38 ` Gustavo Sousa 2024-12-20 17:05 ` Kamil Konieczny 1 sibling, 0 replies; 6+ messages in thread From: Gustavo Sousa @ 2024-12-19 18:38 UTC (permalink / raw) To: Zbigniew Kempczyński, Lucas De Marchi Cc: igt-dev, Francois Dugast, Peter Senna Tschudin, Kamil Konieczny, Ashutosh Dixit, ryszard.knop Quoting Lucas De Marchi (2024-12-19 14:24:09-03:00) >hijacking the thread and adding some people to Cc for the igt_runner question. >Previously In-Reply-To: <rnw3q6mhthnwyvowvszr2gllyjtbb2mozk4em272xlmkvm7pyl@szbhtg3sd7d7> > >On Thu, Dec 19, 2024 at 10:35:00AM -0600, Lucas De Marchi wrote: >>On Wed, Dec 18, 2024 at 07:34:19AM +0100, Zbigniew Kempczyński wrote: >>>On Tue, Dec 17, 2024 at 09:13:24PM -0800, Lucas De Marchi wrote: >>>>There's no guarantee a card will end up with the same device node when >>>>modules are loaded/unloaded and drivers bound/unbound. There's some >>>>fundamental issue with the igt's the way it is and it's also puzzling >>>>from the logs it looks like the device vanished from the bus, when in >>>>reality is just the SW state out of sync with what the kernel is >>>>exporting. >>>> >>>>Re-scanning when trying to match a device is not expensive compared to >>>>what most tests are doing, so simply force it to occur whenever trying >>>>to match a card. >>> >>>I also should comment the above. It is generally true, but I've noticed >>>getting attributes might be expensive. Even it may take up to few >>>seconds, that's why I've added some attributes we don't fetch from udev >>>(see is_on_blacklist()). If I'm not wrong getting 'config' was a cause >>>to limit attributes we fetch. >> >>why would we get all attributes and exclude some? Shouldn't we get only >>the attributes we actually use? AFAIK this logic is basically used by >>--device/IGT_DEVICE, right? What filters we normally use? >> >>I usually pass the pci slot (because I know that won't change >>dynamically and cause surprises). Apparently CI passes vendor/devid: >> >> export IGT_DEVICE=pci:vendor=$1,device=$2 >> >>(but it seems to vary depending on pipeline) >> >>Some devs pass the device node directly too as in a lot of places >>there's only ever card0 possible. > > >Could we dump the env and args somewhere so we know how igt_runner or >individual tests are being called without looking at the CI piepeline >sources? I was thinking about either having that info in the stdout >output of igt_runner or in the json. Another possibility would be in >dmesg, but I'm not sure it's a good option. Thoughts? > >My preferred option would be to have e.g.: > >{ > "__type__": "TestrunResult", > "results_version": 10, > "name": "xe-2403-995cd30a4e222b6a7b4b40c36219e4937fd7109e\/bat-bmg-1\/0", > "uname": "Linux bat-bmg-1 6.13.0-rc3-xe+ #1 SMP PREEMPT_DYNAMIC Thu Dec 19 14:40:51 UTC 2024 x86_64", > "time_elapsed": { > "__type__": "TimeAttribute", > "start": 1734621126.8734231, > "end": 1734621288.5994539 > }, > "environment": { Nitpick: maybe "env"? :-) > "IGT_DEVICE": ... > <any IGT_* env var> > }, > "argv": [ ... ] > I like the idea and I also prefer putting that info in the results.json, which makes general info about the execution more self-contained. Does CI always make results.json available and easy to find? Depending on the answer to this, we might think about additionally dumping that info in some user visible log from CI results pages. -- Gustavo Sousa ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload) 2024-12-19 17:24 Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload) Lucas De Marchi 2024-12-19 18:38 ` Gustavo Sousa @ 2024-12-20 17:05 ` Kamil Konieczny 2024-12-20 17:37 ` Lucas De Marchi 1 sibling, 1 reply; 6+ messages in thread From: Kamil Konieczny @ 2024-12-20 17:05 UTC (permalink / raw) To: Lucas De Marchi Cc: Zbigniew Kempczyński, igt-dev, Francois Dugast, Peter Senna Tschudin, Gustavo Sousa, Ashutosh Dixit, ryszard.knop, jari.tahvanainen Hi Lucas, On 2024-12-19 at 11:24:09 -0600, Lucas De Marchi wrote: > hijacking the thread and adding some people to Cc for the igt_runner question. > Previously In-Reply-To: <rnw3q6mhthnwyvowvszr2gllyjtbb2mozk4em272xlmkvm7pyl@szbhtg3sd7d7> > > On Thu, Dec 19, 2024 at 10:35:00AM -0600, Lucas De Marchi wrote: > > On Wed, Dec 18, 2024 at 07:34:19AM +0100, Zbigniew Kempczyński wrote: > > > On Tue, Dec 17, 2024 at 09:13:24PM -0800, Lucas De Marchi wrote: > > > > There's no guarantee a card will end up with the same device node when > > > > modules are loaded/unloaded and drivers bound/unbound. There's some > > > > fundamental issue with the igt's the way it is and it's also puzzling > > > > from the logs it looks like the device vanished from the bus, when in > > > > reality is just the SW state out of sync with what the kernel is > > > > exporting. > > > > > > > > Re-scanning when trying to match a device is not expensive compared to > > > > what most tests are doing, so simply force it to occur whenever trying > > > > to match a card. > > > > > > I also should comment the above. It is generally true, but I've noticed > > > getting attributes might be expensive. Even it may take up to few > > > seconds, that's why I've added some attributes we don't fetch from udev > > > (see is_on_blacklist()). If I'm not wrong getting 'config' was a cause > > > to limit attributes we fetch. > > > > why would we get all attributes and exclude some? Shouldn't we get only > > the attributes we actually use? AFAIK this logic is basically used by > > --device/IGT_DEVICE, right? What filters we normally use? > > > > I usually pass the pci slot (because I know that won't change > > dynamically and cause surprises). Apparently CI passes vendor/devid: > > > > export IGT_DEVICE=pci:vendor=$1,device=$2 > > > > (but it seems to vary depending on pipeline) > > > > Some devs pass the device node directly too as in a lot of places > > there's only ever card0 possible. > > > Could we dump the env and args somewhere so we know how igt_runner or > individual tests are being called without looking at the CI piepeline > sources? I was thinking about either having that info in the stdout > output of igt_runner or in the json. Another possibility would be in > dmesg, but I'm not sure it's a good option. Thoughts? Not only that, also parameters used to start igt_runner, what was in .igtrc file (if any), current wall time, testlist prepared to run, free memory and free disk. metadata file for igt_resume, it will enable with prepared teslist to re-execute run. Also kernel config from /boot ? Or should it be in shard run info (avoided duplication). Maybe some other info, either igt_facts or lspci output? Should we ask also display team and our CI? +cc Jari from display Regards, Kamil > > My preferred option would be to have e.g.: > > { > "__type__": "TestrunResult", > "results_version": 10, > "name": "xe-2403-995cd30a4e222b6a7b4b40c36219e4937fd7109e\/bat-bmg-1\/0", > "uname": "Linux bat-bmg-1 6.13.0-rc3-xe+ #1 SMP PREEMPT_DYNAMIC Thu Dec 19 14:40:51 UTC 2024 x86_64", > "time_elapsed": { > "__type__": "TimeAttribute", > "start": 1734621126.8734231, > "end": 1734621288.5994539 > }, > "environment": { > "IGT_DEVICE": ... > <any IGT_* env var> > }, > "argv": [ ... ] > > > Lucas De Marchi ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload) 2024-12-20 17:05 ` Kamil Konieczny @ 2024-12-20 17:37 ` Lucas De Marchi 2024-12-20 21:15 ` Knop, Ryszard 0 siblings, 1 reply; 6+ messages in thread From: Lucas De Marchi @ 2024-12-20 17:37 UTC (permalink / raw) To: Kamil Konieczny, Zbigniew Kempczyński, igt-dev, Francois Dugast, Peter Senna Tschudin, Gustavo Sousa, Ashutosh Dixit, ryszard.knop, jari.tahvanainen On Fri, Dec 20, 2024 at 06:05:22PM +0100, Kamil Konieczny wrote: >Hi Lucas, >On 2024-12-19 at 11:24:09 -0600, Lucas De Marchi wrote: >> hijacking the thread and adding some people to Cc for the igt_runner question. >> Previously In-Reply-To: <rnw3q6mhthnwyvowvszr2gllyjtbb2mozk4em272xlmkvm7pyl@szbhtg3sd7d7> >> >> On Thu, Dec 19, 2024 at 10:35:00AM -0600, Lucas De Marchi wrote: >> > On Wed, Dec 18, 2024 at 07:34:19AM +0100, Zbigniew Kempczyński wrote: >> > > On Tue, Dec 17, 2024 at 09:13:24PM -0800, Lucas De Marchi wrote: >> > > > There's no guarantee a card will end up with the same device node when >> > > > modules are loaded/unloaded and drivers bound/unbound. There's some >> > > > fundamental issue with the igt's the way it is and it's also puzzling >> > > > from the logs it looks like the device vanished from the bus, when in >> > > > reality is just the SW state out of sync with what the kernel is >> > > > exporting. >> > > > >> > > > Re-scanning when trying to match a device is not expensive compared to >> > > > what most tests are doing, so simply force it to occur whenever trying >> > > > to match a card. >> > > >> > > I also should comment the above. It is generally true, but I've noticed >> > > getting attributes might be expensive. Even it may take up to few >> > > seconds, that's why I've added some attributes we don't fetch from udev >> > > (see is_on_blacklist()). If I'm not wrong getting 'config' was a cause >> > > to limit attributes we fetch. >> > >> > why would we get all attributes and exclude some? Shouldn't we get only >> > the attributes we actually use? AFAIK this logic is basically used by >> > --device/IGT_DEVICE, right? What filters we normally use? >> > >> > I usually pass the pci slot (because I know that won't change >> > dynamically and cause surprises). Apparently CI passes vendor/devid: >> > >> > export IGT_DEVICE=pci:vendor=$1,device=$2 >> > >> > (but it seems to vary depending on pipeline) >> > >> > Some devs pass the device node directly too as in a lot of places >> > there's only ever card0 possible. >> >> >> Could we dump the env and args somewhere so we know how igt_runner or >> individual tests are being called without looking at the CI piepeline >> sources? I was thinking about either having that info in the stdout >> output of igt_runner or in the json. Another possibility would be in >> dmesg, but I'm not sure it's a good option. Thoughts? > >Not only that, also parameters used to start igt_runner, >what was in .igtrc file (if any), current wall time, does CI actually have an .igtrc? We can add, but I'd prioritize things that are used and that we don't have annotated anywhere (with easy/public access). >testlist prepared to run, free memory and free disk. >metadata file for igt_resume, it will enable with prepared >teslist to re-execute run. talking about igt_resume, there may be some issues doing this at the igt_runner level: it may not match from one test to another if the env didn't match when it started running from when it finished. 1) it may have gone through igt_resume after a reboot (hopefully in the same machine) 2) for shards we slice the testlist and give it to different machines. Ideally they have the same env, config, etc, but that also is not guaranteed. Checking random tests in https://intel-gfx-ci.01.org/tree/intel-xe/shards-all.html? https://intel-gfx-ci.01.org/tree/intel-xe/xe-2404-26e6464dff2b3fe53049bd3b6e426cec43beb165/shard-bmg-1/igt@kms_async_flips@async-flip-with-page-flip-events-atomic@pipe-a-dp-2-4-rc-ccs.html https://intel-gfx-ci.01.org/tree/intel-xe/xe-2404-26e6464dff2b3fe53049bd3b6e426cec43beb165/shard-bmg-1/igt@core_auth@basic-auth.html Do I understand it right that we simply have multiple resultsXX.json for different runs so it's fine to create it at the global level? What about a resume (1)? > >Also kernel config from /boot ? Or should it be in shard >run info (avoided duplication). may be too much as we could extract it from the kernel used by CI since we have CONFIG_IKCONFIG=y. If we want igt_runner to collect this info and save in the results, then it should probably grab it from /proc/config.gz to make sure it's guaranteed to be in sync with the actual kernel being used. Quick check on what we'd need: $ # simulate grabbing the /proc/config.gz and piping it through base64 $ # to be able to add in the json $ ./scripts/extract-ikconfig build64/arch/x86/boot/compressed/vmlinux | gzip | base64 > config.gz.b64 $ ls -lh config.gz.b64 -rw-rw-r-- 1 lucas lucas 70K Dec 20 09:30 config.gz.b64 Humn... I would concentrate on things that aren't currently available anywhere. > >Maybe some other info, either igt_facts or lspci output? for lscpci output it seems there's already a TODO comment that nobody ever tackled :). And "Options" may is a reference to what we are talking here wrt env and args: $ git grep -A2 -B2 lspci runner/ runner/resultgen.c- * Result fields that are TODO: runner/resultgen.c- * runner/resultgen.c: * - lspci runner/resultgen.c- * - options runner/resultgen.c- */ Lucas De Marchi >Should we ask also display team and our CI? > >+cc Jari from display > >Regards, >Kamil > >> >> My preferred option would be to have e.g.: >> >> { >> "__type__": "TestrunResult", >> "results_version": 10, >> "name": "xe-2403-995cd30a4e222b6a7b4b40c36219e4937fd7109e\/bat-bmg-1\/0", >> "uname": "Linux bat-bmg-1 6.13.0-rc3-xe+ #1 SMP PREEMPT_DYNAMIC Thu Dec 19 14:40:51 UTC 2024 x86_64", >> "time_elapsed": { >> "__type__": "TimeAttribute", >> "start": 1734621126.8734231, >> "end": 1734621288.5994539 >> }, >> "environment": { >> "IGT_DEVICE": ... >> <any IGT_* env var> >> }, >> "argv": [ ... ] >> >> >> Lucas De Marchi ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload) 2024-12-20 17:37 ` Lucas De Marchi @ 2024-12-20 21:15 ` Knop, Ryszard 2024-12-20 21:37 ` Lucas De Marchi 0 siblings, 1 reply; 6+ messages in thread From: Knop, Ryszard @ 2024-12-20 21:15 UTC (permalink / raw) To: Sousa, Gustavo, Kempczynski, Zbigniew, Dugast, Francois, Tahvanainen, Jari, Dixit, Ashutosh, igt-dev@lists.freedesktop.org, kamil.konieczny@linux.intel.com, peter.senna@linux.intel.com, De Marchi, Lucas On Fri, 2024-12-20 at 11:37 -0600, Lucas De Marchi wrote: > On Fri, Dec 20, 2024 at 06:05:22PM +0100, Kamil Konieczny wrote: > > Hi Lucas, > > On 2024-12-19 at 11:24:09 -0600, Lucas De Marchi wrote: > > > hijacking the thread and adding some people to Cc for the igt_runner question. > > > Previously In-Reply-To: <rnw3q6mhthnwyvowvszr2gllyjtbb2mozk4em272xlmkvm7pyl@szbhtg3sd7d7> > > > > > > On Thu, Dec 19, 2024 at 10:35:00AM -0600, Lucas De Marchi wrote: > > > > On Wed, Dec 18, 2024 at 07:34:19AM +0100, Zbigniew Kempczyński wrote: > > > > > On Tue, Dec 17, 2024 at 09:13:24PM -0800, Lucas De Marchi wrote: > > > > > > There's no guarantee a card will end up with the same device node when > > > > > > modules are loaded/unloaded and drivers bound/unbound. There's some > > > > > > fundamental issue with the igt's the way it is and it's also puzzling > > > > > > from the logs it looks like the device vanished from the bus, when in > > > > > > reality is just the SW state out of sync with what the kernel is > > > > > > exporting. > > > > > > > > > > > > Re-scanning when trying to match a device is not expensive compared to > > > > > > what most tests are doing, so simply force it to occur whenever trying > > > > > > to match a card. > > > > > > > > > > I also should comment the above. It is generally true, but I've noticed > > > > > getting attributes might be expensive. Even it may take up to few > > > > > seconds, that's why I've added some attributes we don't fetch from udev > > > > > (see is_on_blacklist()). If I'm not wrong getting 'config' was a cause > > > > > to limit attributes we fetch. > > > > > > > > why would we get all attributes and exclude some? Shouldn't we get only > > > > the attributes we actually use? AFAIK this logic is basically used by > > > > --device/IGT_DEVICE, right? What filters we normally use? > > > > > > > > I usually pass the pci slot (because I know that won't change > > > > dynamically and cause surprises). Apparently CI passes vendor/devid: > > > > > > > > export IGT_DEVICE=pci:vendor=$1,device=$2 > > > > > > > > (but it seems to vary depending on pipeline) > > > > > > > > Some devs pass the device node directly too as in a lot of places > > > > there's only ever card0 possible. > > > > > > > > > Could we dump the env and args somewhere so we know how igt_runner or > > > individual tests are being called without looking at the CI piepeline > > > sources? I was thinking about either having that info in the stdout > > > output of igt_runner or in the json. Another possibility would be in > > > dmesg, but I'm not sure it's a good option. Thoughts? > > > > Not only that, also parameters used to start igt_runner, > > what was in .igtrc file (if any), current wall time, > > does CI actually have an .igtrc? We can add, but I'd prioritize > things that are used and that we don't have annotated anywhere (with > easy/public access). Yes, but only on hosts working with Chamelium (for display output mappings). It should(tm) not be present on other DUTs. > > testlist prepared to run, free memory and free disk. > > metadata file for igt_resume, it will enable with prepared > > teslist to re-execute run. > > talking about igt_resume, there may be some issues doing this at > the igt_runner level: it may not match from one test to another if the > env didn't match when it started running from when it finished. > > 1) it may have gone through igt_resume after a reboot (hopefully in the > same machine) > > 2) for shards we slice the testlist and give it to different machines. > Ideally they have the same env, config, etc, but that also is not > guaranteed. > > Checking random tests in https://intel-gfx-ci.01.org/tree/intel-xe/shards-all.html? > > https://intel-gfx-ci.01.org/tree/intel-xe/xe-2404-26e6464dff2b3fe53049bd3b6e426cec43beb165/shard-bmg-1/igt@kms_async_flips@async-flip-with-page-flip-events-atomic@pipe-a-dp-2-4-rc-ccs.html > https://intel-gfx-ci.01.org/tree/intel-xe/xe-2404-26e6464dff2b3fe53049bd3b6e426cec43beb165/shard-bmg-1/igt@core_auth@basic-auth.html > > Do I understand it right that we simply have multiple resultsXX.json > for different runs so it's fine to create it at the global level? > What about a resume (1)? There's also the case where we collate results across multiple machines into a single larger JSON file that is then used for vis generation on 01.org (a single JSON will always cover the same kernel/IGT/scenario combo, but possibly on different hosts). A per-test env collection would be ideal, but it can also take a lot of extra space if you're just collecting all the envs, so be careful with this. In addition, envs may contain CI access tokens and internal data that we might not exactly want to publish - if you are going to implement this, add a configurable env key allowlist + blocklist. You can store all env keys not in the blocklist, but if a key is not in the allowlist, replace its value with "[ REDACTED ]" or something like that. This way we can explicitly strip out security-sensitive vars and useless values ($LS_COLORS, as we all know, are very relevant for CI tests :), while people can see what vars are available that we did not yet cover one way or another and request extending that allowlist. Thanks, Ryszard > > > > Also kernel config from /boot ? Or should it be in shard > > run info (avoided duplication). > > may be too much as we could extract it from the kernel used by CI > since we have CONFIG_IKCONFIG=y. If we want igt_runner to collect this > info and save in the results, then it should probably grab it from > /proc/config.gz to make sure it's guaranteed to be in sync with the > actual kernel being used. Quick check on what we'd need: > > $ # simulate grabbing the /proc/config.gz and piping it through base64 > $ # to be able to add in the json > $ ./scripts/extract-ikconfig build64/arch/x86/boot/compressed/vmlinux | gzip | base64 > config.gz.b64 > $ ls -lh config.gz.b64 > -rw-rw-r-- 1 lucas lucas 70K Dec 20 09:30 config.gz.b64 > > > Humn... I would concentrate on things that aren't currently available > anywhere. > > > > > Maybe some other info, either igt_facts or lspci output? > > for lscpci output it seems there's already a TODO comment that nobody > ever tackled :). And "Options" may is a reference to what we are talking > here wrt env and args: > > $ git grep -A2 -B2 lspci runner/ > runner/resultgen.c- * Result fields that are TODO: > runner/resultgen.c- * > runner/resultgen.c: * - lspci > runner/resultgen.c- * - options > runner/resultgen.c- */ > > Lucas De Marchi > > > Should we ask also display team and our CI? > > > > +cc Jari from display > > > > Regards, > > Kamil > > > > > > > > My preferred option would be to have e.g.: > > > > > > { > > > "__type__": "TestrunResult", > > > "results_version": 10, > > > "name": "xe-2403-995cd30a4e222b6a7b4b40c36219e4937fd7109e\/bat-bmg-1\/0", > > > "uname": "Linux bat-bmg-1 6.13.0-rc3-xe+ #1 SMP PREEMPT_DYNAMIC Thu Dec 19 14:40:51 UTC 2024 x86_64", > > > "time_elapsed": { > > > "__type__": "TimeAttribute", > > > "start": 1734621126.8734231, > > > "end": 1734621288.5994539 > > > }, > > > "environment": { > > > "IGT_DEVICE": ... > > > <any IGT_* env var> > > > }, > > > "argv": [ ... ] > > > > > > > > > Lucas De Marchi ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload) 2024-12-20 21:15 ` Knop, Ryszard @ 2024-12-20 21:37 ` Lucas De Marchi 0 siblings, 0 replies; 6+ messages in thread From: Lucas De Marchi @ 2024-12-20 21:37 UTC (permalink / raw) To: Knop, Ryszard Cc: Sousa, Gustavo, Kempczynski, Zbigniew, Dugast, Francois, Tahvanainen, Jari, Dixit, Ashutosh, igt-dev@lists.freedesktop.org, kamil.konieczny@linux.intel.com, peter.senna@linux.intel.com On Fri, Dec 20, 2024 at 03:15:02PM -0600, Knop, Ryszard wrote: >On Fri, 2024-12-20 at 11:37 -0600, Lucas De Marchi wrote: >> On Fri, Dec 20, 2024 at 06:05:22PM +0100, Kamil Konieczny wrote: >> > Hi Lucas, >> > On 2024-12-19 at 11:24:09 -0600, Lucas De Marchi wrote: >> > > hijacking the thread and adding some people to Cc for the igt_runner question. >> > > Previously In-Reply-To: <rnw3q6mhthnwyvowvszr2gllyjtbb2mozk4em272xlmkvm7pyl@szbhtg3sd7d7> >> > > >> > > On Thu, Dec 19, 2024 at 10:35:00AM -0600, Lucas De Marchi wrote: >> > > > On Wed, Dec 18, 2024 at 07:34:19AM +0100, Zbigniew Kempczyński wrote: >> > > > > On Tue, Dec 17, 2024 at 09:13:24PM -0800, Lucas De Marchi wrote: >> > > > > > There's no guarantee a card will end up with the same device node when >> > > > > > modules are loaded/unloaded and drivers bound/unbound. There's some >> > > > > > fundamental issue with the igt's the way it is and it's also puzzling >> > > > > > from the logs it looks like the device vanished from the bus, when in >> > > > > > reality is just the SW state out of sync with what the kernel is >> > > > > > exporting. >> > > > > > >> > > > > > Re-scanning when trying to match a device is not expensive compared to >> > > > > > what most tests are doing, so simply force it to occur whenever trying >> > > > > > to match a card. >> > > > > >> > > > > I also should comment the above. It is generally true, but I've noticed >> > > > > getting attributes might be expensive. Even it may take up to few >> > > > > seconds, that's why I've added some attributes we don't fetch from udev >> > > > > (see is_on_blacklist()). If I'm not wrong getting 'config' was a cause >> > > > > to limit attributes we fetch. >> > > > >> > > > why would we get all attributes and exclude some? Shouldn't we get only >> > > > the attributes we actually use? AFAIK this logic is basically used by >> > > > --device/IGT_DEVICE, right? What filters we normally use? >> > > > >> > > > I usually pass the pci slot (because I know that won't change >> > > > dynamically and cause surprises). Apparently CI passes vendor/devid: >> > > > >> > > > export IGT_DEVICE=pci:vendor=$1,device=$2 >> > > > >> > > > (but it seems to vary depending on pipeline) >> > > > >> > > > Some devs pass the device node directly too as in a lot of places >> > > > there's only ever card0 possible. >> > > >> > > >> > > Could we dump the env and args somewhere so we know how igt_runner or >> > > individual tests are being called without looking at the CI piepeline >> > > sources? I was thinking about either having that info in the stdout >> > > output of igt_runner or in the json. Another possibility would be in >> > > dmesg, but I'm not sure it's a good option. Thoughts? >> > >> > Not only that, also parameters used to start igt_runner, >> > what was in .igtrc file (if any), current wall time, >> >> does CI actually have an .igtrc? We can add, but I'd prioritize >> things that are used and that we don't have annotated anywhere (with >> easy/public access). > >Yes, but only on hosts working with Chamelium (for display output >mappings). It should(tm) not be present on other DUTs. > > >> > testlist prepared to run, free memory and free disk. >> > metadata file for igt_resume, it will enable with prepared >> > teslist to re-execute run. >> >> talking about igt_resume, there may be some issues doing this at >> the igt_runner level: it may not match from one test to another if the >> env didn't match when it started running from when it finished. >> >> 1) it may have gone through igt_resume after a reboot (hopefully in the >> same machine) >> >> 2) for shards we slice the testlist and give it to different machines. >> Ideally they have the same env, config, etc, but that also is not >> guaranteed. >> >> Checking random tests in https://intel-gfx-ci.01.org/tree/intel-xe/shards-all.html? >> >> https://intel-gfx-ci.01.org/tree/intel-xe/xe-2404-26e6464dff2b3fe53049bd3b6e426cec43beb165/shard-bmg-1/igt@kms_async_flips@async-flip-with-page-flip-events-atomic@pipe-a-dp-2-4-rc-ccs.html >> https://intel-gfx-ci.01.org/tree/intel-xe/xe-2404-26e6464dff2b3fe53049bd3b6e426cec43beb165/shard-bmg-1/igt@core_auth@basic-auth.html >> >> Do I understand it right that we simply have multiple resultsXX.json >> for different runs so it's fine to create it at the global level? >> What about a resume (1)? > >There's also the case where we collate results across multiple machines >into a single larger JSON file that is then used for vis generation on >01.org (a single JSON will always cover the same kernel/IGT/scenario >combo, but possibly on different hosts). A per-test env collection >would be ideal, but it can also take a lot of extra space if you're >just collecting all the envs, so be careful with this. > >In addition, envs may contain CI access tokens and internal data that >we might not exactly want to publish - if you are going to implement >this, add a configurable env key allowlist + blocklist. > >You can store all env keys not in the blocklist, but if a key is not in >the allowlist, replace its value with "[ REDACTED ]" or something like >that. This way we can explicitly strip out security-sensitive vars and >useless values ($LS_COLORS, as we all know, are very relevant for CI >tests :), while people can see what vars are available that we did not >yet cover one way or another and request extending that allowlist. my plan is to avoid the whole thing by simply print the vars igt itself makes use of and hand pick the ones we actually care: $ git grep getenv | grep IGT lib/igt_core.c: key_file_env = getenv("IGT_CONFIG_PATH"); lib/igt_core.c: if (!isatty(STDOUT_FILENO) || getenv("IGT_PLAIN_OUTPUT")) lib/igt_core.c: env = getenv("IGT_LOG_LEVEL"); lib/igt_core.c: igt_frame_dump_path = getenv("IGT_FRAME_DUMP_PATH"); lib/igt_core.c: stderr_needs_sentinel = getenv("IGT_SENTINEL_ON_STDERR") != NULL; lib/igt_core.c: env = getenv("IGT_FORCE_DRIVER"); lib/igt_core.c: env = getenv("IGT_DEVICE"); lib/igt_core.c: env = getenv("IGT_RUNNER_SOCKET_FD"); lib/igt_core.c: const char *param = getenv("IGT_SRANDOM"); lib/igt_frame.c: id = getenv("IGT_FRAME_DUMP_ID"); lib/igt_gt.c: if (getenv("IGT_NO_FORCEWAKE")) lib/igt_kmod.c: dirname = getenv("IGT_KMOD_DIRNAME"); lib/igt_kmod.c: config_paths_str = getenv("IGT_KMOD_CONFIG_PATHS"); lib/igt_kms.c: env = getenv("IGT_KMS_RESOLUTION"); lib/igt_kms.c: if (!getenv("IGT_KMS_IGNORE_HPD")) lib/igt_pipe_crc.c: env_source = getenv("IGT_CRC_SOURCE"); lib/igt_psr.c: env = getenv("IGT_PSR_SINK_STATUS_CHECKS"); runner/executor.c: ping_hostname = getenv("IGT_PING_HOSTNAME"); runner/executor.c: if (socketfd >= 0 && !getenv("IGT_RUNNER_DISABLE_SOCKET_COMMUNICATION")) { runner/settings.c: if ((env_test_root = getenv("IGT_TEST_ROOT")) != NULL) { tests/intel/i915_module_load.c: param = getenv("IGT_SRANDOM"); tests/intel/perf_pmu.c: if (getenv("IGT_NO_FORCEWAKE")) tests/kms_writeback.c: path_name = getenv("IGT_FRAME_DUMP_PATH"); tests/meta_test.c: igt_skip_on_f(!getenv("IGT_CI_META_TEST"), tools/intel_pm_rpm.c: env_device = getenv("IGT_DEVICE"); tools/lsgpu.c: env_device = getenv("IGT_DEVICE"); So initial list would be: IGT_PLAIN_OUTPUT IGT_LOG_LEVEL IGT_SENTINEL_ON_STDERR IGT_FORCE_DRIVER IGT_DEVICE IGT_NO_FORCEWAKE IGT_KMOD_DIRNAME IGT_KMOD_CONFIG_PATHS IGT_KMS_RESOLUTION IGT_KMS_IGNORE_HPD IGT_CRC_SOURCE IGT_PSR_SINK_STATUS_CHECKS Lucas De Marchi > >Thanks, Ryszard > >> > >> > Also kernel config from /boot ? Or should it be in shard >> > run info (avoided duplication). >> >> may be too much as we could extract it from the kernel used by CI >> since we have CONFIG_IKCONFIG=y. If we want igt_runner to collect this >> info and save in the results, then it should probably grab it from >> /proc/config.gz to make sure it's guaranteed to be in sync with the >> actual kernel being used. Quick check on what we'd need: >> >> $ # simulate grabbing the /proc/config.gz and piping it through base64 >> $ # to be able to add in the json >> $ ./scripts/extract-ikconfig build64/arch/x86/boot/compressed/vmlinux | gzip | base64 > config.gz.b64 >> $ ls -lh config.gz.b64 >> -rw-rw-r-- 1 lucas lucas 70K Dec 20 09:30 config.gz.b64 >> >> >> Humn... I would concentrate on things that aren't currently available >> anywhere. >> >> > >> > Maybe some other info, either igt_facts or lspci output? >> >> for lscpci output it seems there's already a TODO comment that nobody >> ever tackled :). And "Options" may is a reference to what we are talking >> here wrt env and args: >> >> $ git grep -A2 -B2 lspci runner/ >> runner/resultgen.c- * Result fields that are TODO: >> runner/resultgen.c- * >> runner/resultgen.c: * - lspci >> runner/resultgen.c- * - options >> runner/resultgen.c- */ >> >> Lucas De Marchi >> >> > Should we ask also display team and our CI? >> > >> > +cc Jari from display >> > >> > Regards, >> > Kamil >> > >> > > >> > > My preferred option would be to have e.g.: >> > > >> > > { >> > > "__type__": "TestrunResult", >> > > "results_version": 10, >> > > "name": "xe-2403-995cd30a4e222b6a7b4b40c36219e4937fd7109e\/bat-bmg-1\/0", >> > > "uname": "Linux bat-bmg-1 6.13.0-rc3-xe+ #1 SMP PREEMPT_DYNAMIC Thu Dec 19 14:40:51 UTC 2024 x86_64", >> > > "time_elapsed": { >> > > "__type__": "TimeAttribute", >> > > "start": 1734621126.8734231, >> > > "end": 1734621288.5994539 >> > > }, >> > > "environment": { >> > > "IGT_DEVICE": ... >> > > <any IGT_* env var> >> > > }, >> > > "argv": [ ... ] >> > > >> > > >> > > Lucas De Marchi > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-12-20 21:37 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-12-19 17:24 Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload) Lucas De Marchi 2024-12-19 18:38 ` Gustavo Sousa 2024-12-20 17:05 ` Kamil Konieczny 2024-12-20 17:37 ` Lucas De Marchi 2024-12-20 21:15 ` Knop, Ryszard 2024-12-20 21:37 ` Lucas De Marchi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox