Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Teres Alexis, Alan Previn" <alan.previn.teres.alexis@intel.com>
To: "Dong, Zhanjun" <zhanjun.dong@intel.com>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v8 1/6] drm/xe/guc: Prepare GuC register list and update ADS size for error capture
Date: Fri, 10 May 2024 18:43:14 +0000	[thread overview]
Message-ID: <f55c5f7ef0ab613be9372d1f30cfb5071e872435.camel@intel.com> (raw)
In-Reply-To: <20240507014736.1057093-2-zhanjun.dong@intel.com>

On Mon, 2024-05-06 at 18:47 -0700, Zhanjun Dong wrote:
> Add referenced registers defines and list of registers.
> Update GuC ADS size allocation to include space for
> the lists of error state capture register descriptors.
> 
> 
alan:snip
> +int xe_calculate_guc_regs_capture_worst_size(struct xe_guc *guc)
> +{
> +       size_t total_size, class_size, instance_size, global_size;
> +       int i, j;
> +
> +       /* This function calculates the worst case register lists size by
> +        * including all possible engines classes. It is called during the
> +        * first of a two-phase GuC (and ADS-population) initialization
> +        * sequence, that is, during the pre-hwconfig phase before we have
> +        * the exact engine fusing info.
> +        */
> +       total_size = PAGE_SIZE; /* Pad a page in front for empty lists */
> +       for (i = 0; i < GUC_CAPTURE_LIST_INDEX_MAX; i++) {
> +               for (j = 0; j < GUC_LAST_ENGINE_CLASS; j++) {
alan: so in rev6, i mentioned in one of the other hunks that in guc interface,
the 'max engine class' definition used by guc-error-capture register-list-arrays
does not follow the definition of engine-class enum of above (which is used
for submission and other interfaces with GuC). I notice you fixed this on
another hunk but missed it above. I should have made it clear this change is
required for ALL cases where we are looping through the engine classes' register
lists used in guc- error-capture.
> +                       if (xe_guc_capture_getlistsize(guc, i,
> +                                                      GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +                                                      j, &class_size) < 0)
> +                               class_size = 0;
> +                       if (xe_guc_capture_getlistsize(guc, i,
> +                                                      GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
> +                                                      j, &instance_size) < 0)
> 
alan:snip

>  #define MAX_GOLDEN_LRC_SIZE    (SZ_4K * 64)
>  
>  int xe_guc_ads_init(struct xe_guc_ads *ads)
> @@ -398,6 +443,7 @@ int xe_guc_ads_init(struct xe_guc_ads *ads)
>         struct xe_bo *bo;
>  
>         ads->golden_lrc_size = calculate_golden_lrc_size(ads);
> +       ads->capture_size = xe_calculate_guc_regs_capture_worst_size(ads_to_guc(ads));
>         ads->regset_size = calculate_regset_size(gt);
>         ads->ads_waklv_size = calculate_waklv_size(ads);
>  
> @@ -431,9 +477,10 @@ int xe_guc_ads_init_post_hwconfig(struct xe_guc_ads *ads)
>         xe_gt_assert(gt, ads->bo);
>  
>         ads->golden_lrc_size = calculate_golden_lrc_size(ads);
> +       ads->capture_size = xe_calculate_guc_regs_capture_worst_size(ads_to_guc(ads));
alan: nit: we had an offline conversation that despite post_hw_config knowing
exact engine class list, we simpliify and improve execution time by using
worst-case-size in both pre and post hwconfig when deciding on size. Would be nice to
add a comment on that? (so folks dont assume its a bug)

alan:snip


>         ads->regset_size = calculate_regset_size(gt);
>  
> -       xe_gt_assert(gt, ads->golden_lrc_size +
> +       xe_gt_assert(gt, ads->golden_lrc_size + ads->capture_size +
>                      (ads->regset_size - prev_regset_size) <=
>                      MAX_GOLDEN_LRC_SIZE);
alan: i missed this in rev6 - why are we using ads->capture_size to verify MAX_GOLDEN_LRC_SIZE is
sufficient? i dont believe this macro is related to guc-error-capture reg list.

alan:snip

> +static int guc_capture_prep_lists(struct xe_guc_ads *ads)
> +{
> +       struct xe_guc *guc = ads_to_guc(ads);
> +       struct xe_gt *gt = ads_to_gt(ads);
> +       u32 ads_ggtt, capture_offset, null_ggtt, total_size = 0;
> +       struct iosys_map info_map;
> +       size_t size = 0;
> +       void *ptr;
>         int i, j;
alan:snip
> +                       /********************************************************/
> +                       /*** engine exists: start with engine-class registers ***/
> +                       /********************************************************/
> +                       write_empty_list = true; /* starting assumption is an empty list */
> +                       size = 0;
> +                       if (!xe_guc_capture_getlistsize(guc, i,
> +                                                       GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
> +                                                       j, &size)) {
> +                               if (total_size + size > ads->capture_size)
> +                                       xe_gt_dbg(gt, "Capture size overflow :%lu vs %d\n",
> +                                                 total_size + size, ads->capture_size);
alan: i think premerge hooks are failing due to format specifiers - please do fix that.
alan:snip
> +       if (ads->capture_size != PAGE_ALIGN(total_size))
> +               xe_gt_info(gt, "ADS capture alloc size changed from %d to %d\n",
> +                          ads->capture_size, PAGE_ALIGN(total_size));
> +       return PAGE_ALIGN(total_size);
>  }
alan:snip

> +#define COMMON_XELP_BASE_GLOBAL \
> +       { FORCEWAKE_GT,             0,      0, "FORCEWAKE" }
> +
> +#define COMMON_BASE_ENGINE_INSTANCE \
> +       { RING_ESR(0),              0,      0, "ESR" }, \
> +       { RING_EMR(0),              0,      0, "EMR" }, \
> +       { RING_EIR(0),              0,      0, "EIR" }, \
> +       { RING_EXECLIST_STATUS_HI(0), 0,    0, "RING_EXECLIST_STATUS_HI" }, \
> +       { RING_EXECLIST_STATUS_LO(0), 0,    0, "RING_EXECLIST_STATUS_LO" }, \
> +       { RING_DMA_FADD(0),         0,      0, "RING_DMA_FADD_LDW" }, \
> +       { RING_DMA_FADD_UDW(0),     0,      0, "RING_DMA_FADD_UDW" }, \
> +       { RING_IPEHR(0),            0,      0, "IPEHR" }, \
> +       { RING_BBADDR(0),           0,      0, "RING_BBADDR_LOW32" }, \
> +       { RING_BBADDR_UDW(0),       0,      0, "RING_BBADDR_UP32" }, \
> +       { RING_ACTHD(0),            0,      0, "ACTHD_LDW" }, \
> +       { RING_ACTHD_UDW(0),        0,      0, "ACTHD_UDW" }, \
> +       { RING_START(0),            0,      0, "START" }, \
> +       { RING_HEAD(0),             0,      0, "HEAD" }, \
> +       { RING_TAIL(0),             0,      0, "TAIL" }, \
> +       { RING_CTL(0),              0,      0, "CTL" }, \
> +       { RING_MI_MODE(0),          0,      0, "MODE" }, \
> +       { RING_HWS_PGA(0),          0,      0, "HWS" }, \
> +       { RING_MODE(0),             0,      0, "GFX_MODE" }
alan: I notice that "struct __guc_mmio_reg_descr.regname" is introduced here in
Patch-#1 and used in the population of the static reglist table above but then
never ever referenced in the entire series. In fact in Patch #6, its removed.

NOTE: this is different from i915 where this variable was used in the reporting
patch of i915's series but it looks like from this xe series version, we are
creating a shared "register->offset->name" lookup table list in xe_hw_engine.c
for both guc and execlist in Patch 6.

I believe this flirts dangerously close to breaking linux patch rules. A variable/
func not used in the entire series really shouldnt be added. In this case, we are
NOT consuming regname at all anywhere in the series. And then we remove it in the
subsequent patch-6.

alan:snip



> +static int
> +guc_capture_getlistsize(struct xe_guc *guc, u32 owner, u32 type, u32 classid,
> +                       size_t *size, bool is_purpose_est)
> +{
> +       struct xe_guc_state_capture *gc = guc->capture;
> +       struct __guc_capture_ads_cache *cache = &gc->ads_cache[owner][type][classid];
> +       int num_regs;
> +
> +       if (!gc->reglists) {
> +               xe_gt_warn(guc_to_gt(guc), "No capture reglist for this device\n");
> +               return -ENODEV;
> +       }
> +
> +       if (cache->is_valid) {
> +               *size = cache->size;
> +               return cache->status;
> +       }
> +
> +       if (!is_purpose_est && owner == GUC_CAPTURE_LIST_INDEX_PF &&
alan: so this one took me a while to figure out why it has become orphaned.
(after i went through the entire series hunting for it). I'm referign to "is_purpose_est".

So this is another i915 inherittance which did have valid reasoning. In i915 when we
were doing "check_guc_capture_size" (which is a one-time event), we wanted to flag if
any of the register tables were non existent. take note that at this early log-buffer-
calculation time, we assumed all engine classes are valid on the platform at
pre-hw-config time. The reason is because we did actually discover cases when new
platforms added new engines and the developer completely missed adding the register
table for error capture as well. With this message, it would get flagged as a warning.
However, this same lower level function was also used for the regular register
list population which can get called everytime we reset guc including coming
out of suspend or if we performed a GT reset and so we didn't want that warning
to happen every single time - only once per module load. so thats why we
created a wraper with that "is_purpose_est" to differentate if the caller
was for the log-buffer-size-estmation or for actual population where we only
print the warn in the former.

so my review comment would be: we already learnt the lessons from i915 so lets
use this properly. i suggest we fix patch #3 so that check_guc_capture_size
will call above guc_capture_getlistsize with is_purpose_est set true as opposed to
calling xe_calculate_guc_regs_capture_worst_size. I'll make the same comment
in Patch3

alan:snip

> +
> +#endif /* _XE_GUC_CAPTURE_H */
> diff --git a/drivers/gpu/drm/xe/xe_guc_capture_fwif.h b/drivers/gpu/drm/xe/xe_guc_capture_fwif.h
> new file mode 100644
> index 000000000000..79bc277afaa8
> --- /dev/null
alan:snip

> +struct guc_state_capture_group_header_t {
> +       u32 owner;
alan: If i read the the spec correctly, capture_group_header's owner
has VFID field at GENMASK(7,0).
> +       u32 info;
> +#define CAP_GRP_HDR_NUM_CAPTURES GENMASK(7, 0)
> +#define CAP_GRP_HDR_CAPTURE_TYPE GENMASK(15, 8) /* guc_capture_group_types */
> +} __packed;
alan:snip


  reply	other threads:[~2024-05-10 18:43 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-07  1:47 [PATCH v8 0/6] drm/xe/guc: Add GuC based register capture for error capture Zhanjun Dong
2024-05-07  1:47 ` [PATCH v8 1/6] drm/xe/guc: Prepare GuC register list and update ADS size " Zhanjun Dong
2024-05-10 18:43   ` Teres Alexis, Alan Previn [this message]
2024-05-14 22:44     ` Dong, Zhanjun
2024-05-10 18:58   ` Teres Alexis, Alan Previn
2024-05-07  1:47 ` [PATCH v8 2/6] drm/xe/guc: Add XE_LP steered register lists Zhanjun Dong
2024-05-11  0:17   ` Teres Alexis, Alan Previn
2024-05-14 23:00     ` Dong, Zhanjun
2024-05-07  1:47 ` [PATCH v8 3/6] drm/xe/guc: Add capture size check in GuC log buffer Zhanjun Dong
2024-05-08 22:57   ` Teres Alexis, Alan Previn
2024-05-15 21:39     ` Dong, Zhanjun
2024-05-07  1:47 ` [PATCH v8 4/6] drm/xe/guc: Extract GuC error capture lists Zhanjun Dong
2024-05-11  1:43   ` Teres Alexis, Alan Previn
2024-05-15 21:45     ` Dong, Zhanjun
2024-05-15 21:55       ` Dong, Zhanjun
2024-05-07  1:47 ` [PATCH v8 5/6] drm/xe/guc: Pre-allocate output nodes for extraction Zhanjun Dong
2024-05-11 18:07   ` Teres Alexis, Alan Previn
2024-05-07  1:47 ` [PATCH v8 6/6] drm/xe/guc: Plumb GuC-capture into dev coredump Zhanjun Dong
2024-05-11 20:25   ` Teres Alexis, Alan Previn
2024-05-07  4:17 ` ✓ CI.Patch_applied: success for drm/xe/guc: Add GuC based register capture for error capture (rev8) Patchwork
2024-05-07  4:18 ` ✗ CI.checkpatch: warning " Patchwork
2024-05-07  4:19 ` ✓ CI.KUnit: success " Patchwork
2024-05-07  4:31 ` ✓ CI.Build: " Patchwork
2024-05-07  4:41 ` ✗ CI.Hooks: failure " Patchwork
2024-05-07  4:49 ` ✓ CI.checksparse: success " Patchwork
2024-05-07  5:24 ` ✗ CI.BAT: failure " Patchwork
2024-05-07  9:35 ` ✗ CI.FULL: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f55c5f7ef0ab613be9372d1f30cfb5071e872435.camel@intel.com \
    --to=alan.previn.teres.alexis@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=zhanjun.dong@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox