Re: [PATCH i-g-t 1/5] lib/[gpu_cmds|xehp_media]: Introduce gpu execution commands for an efficient 64bit mode.

public inbox for igt-dev@lists.freedesktop.org
 help / color / mirror / Atom feed

From: "Hajda, Andrzej" <andrzej.hajda@intel.com>
To: <priyanka.dandamudi@intel.com>, <igt-dev@lists.freedesktop.org>
Cc: "Konieczny, Kamil" <kamil.konieczny@intel.com>
Subject: Re: [PATCH i-g-t 1/5] lib/[gpu_cmds|xehp_media]: Introduce gpu execution commands for an efficient 64bit mode.
Date: Fri, 27 Feb 2026 11:02:04 +0100	[thread overview]
Message-ID: <c6af424d-33fa-4c40-8a43-c217499aa379@intel.com> (raw)
In-Reply-To: <20260224082800.1581935-2-priyanka.dandamudi@intel.com>

W dniu 24.02.2026 o 09:27, priyanka.dandamudi@intel.com pisze:
> From: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
> 
> The efficient 64-bit mode introduced with XE3p and it makes manage all
> heaps by SW. In order to use efficient 64bit mode, the batchbuffer command
> have to use new introduced instructuctions (COMPUTE_WALKER2, etc) and
> new interface_descriptor for compute pipeline configuration and execution.
> 
> Signed-off-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
> Signed-off-by: Priyanka Dandamudi <priyanka.dandamudi@intel.com>
 > --->   lib/gpu_cmds.c   | 210 
+++++++++++++++++++++++++++++++++++++++++++++++
>   lib/gpu_cmds.h   |  17 ++++
>   lib/xehp_media.h |  65 +++++++++++++++
>   3 files changed, 292 insertions(+)
> 
> diff --git a/lib/gpu_cmds.c b/lib/gpu_cmds.c
> index a6a9247dce..dd12c046c7 100644
> --- a/lib/gpu_cmds.c
> +++ b/lib/gpu_cmds.c
> @@ -934,6 +934,38 @@ xehp_fill_interface_descriptor(struct intel_bb *ibb,
>   	idd->desc5.num_threads_in_tg = 1;
>   }
>   
> +/*
> + * XE3P
> + */

This comment serves nothing xe3p is already prefixing func name.
So either full doc either drop it.

> +void
> +xe3p_fill_interface_descriptor(struct intel_bb *ibb,
> +			       struct intel_buf *dst,
> +			       const uint32_t kernel[][4],
> +			       size_t size,
> +			       struct xe3p_interface_descriptor_data *idd)
> +{
> +	uint64_t kernel_offset;
> +
> +	kernel_offset = gen7_fill_kernel(ibb, kernel, size);
> +	kernel_offset += ibb->batch_offset;
> +
> +	memset(idd, 0, sizeof(*idd));
> +
> +	/* 64-bit canonical format setting is needed. */
> +	idd->dw00.kernel_start_pointer = (((uint32_t)kernel_offset) >> 6);
> +	idd->dw01.kernel_start_pointer_high = kernel_offset >> 32;
> +
> +	/* Single program flow has no SIMD-specific branching in SIMD exec in EU threads */
> +	idd->dw02.single_program_flow = 1;
> +	idd->dw02.floating_point_mode = GEN8_FLOATING_POINT_IEEE_754;
> +
> +	/*
> +	* For testing purposes, use only one thread per thread group.
> +	* This makes it possible to identify threads by thread group id.
> +	*/
> +	idd->dw05.number_of_threads_in_gpgpu_thread_group = 1;
> +}
> +
>   static uint32_t
>   xehp_fill_surface_state(struct intel_bb *ibb,
>   			struct intel_buf *buf,
> @@ -1086,6 +1118,66 @@ xehp_emit_state_base_address(struct intel_bb *ibb)
>   	intel_bb_out(ibb, 0);                                       //dw21
>   }
>   
> +void
> +xe3p_emit_state_base_address(struct intel_bb *ibb)
> +{
> +	intel_bb_out(ibb, GEN8_STATE_BASE_ADDRESS | 0x14);            //dw0
> +
> +	/* general state */
> +	intel_bb_out(ibb, 0 | BASE_ADDRESS_MODIFY);                   //dw1-dw2

Not sure what is the point of "0 |", here and other places.

> +	intel_bb_out(ibb, 0);
> +
> +	/*
> +	 * For full 64b Mode, set BASEADDR_DIS.
> +	 * In Full 64b Mode, all heaps are managed by SW.
> +	 * STATE_BASE_ADDRESS base addresses are ignored by HW
> +	 * stateless data port moc not set, so EU threads have to access
> +	 * only uncached without moc when load/store
> +	 */
> +	intel_bb_out(ibb, 1 << 30);                                   //dw3

Define and use BASEADDR_DIS instead of "1 << 30" magic.

> +
> +	/* surface state */
> +	intel_bb_out(ibb, 0 | BASE_ADDRESS_MODIFY);                   //dw4-dw5
> +	intel_bb_out(ibb, 0);
> +
> +	/* dynamic state */
> +	intel_bb_out(ibb, 0 | BASE_ADDRESS_MODIFY);                   //dw6-dw7
> +	intel_bb_out(ibb, 0);
> +
> +	intel_bb_out(ibb, 0);                                         //dw8-dw9
> +	intel_bb_out(ibb, 0);
> +
> +	/* instruction */
> +	intel_bb_emit_reloc(ibb, ibb->handle,
> +			    I915_GEM_DOMAIN_INSTRUCTION,              //dw10-dw11
> +			    0, BASE_ADDRESS_MODIFY, 0x0);
> +
> +	/* general state buffer size */
> +	intel_bb_out(ibb, 0xfffff000 | 1);                            //dw12
> +
> +	/* dynamic state buffer size */
> +	intel_bb_out(ibb, ALIGN(ibb->size, 1 << 12) | 1);             //dw13
> +
> +	intel_bb_out(ibb, 0);                          	              //dw14
> +
> +	/* intruction buffer size */
> +	intel_bb_out(ibb, ALIGN(ibb->size, 1 << 12) | 1);             //dw15
> +
> +	/* Bindless surface state base address */
> +	intel_bb_out(ibb, 0 | BASE_ADDRESS_MODIFY);                   //dw16-17
> +	intel_bb_out(ibb, 0);
> +
> +	/* Bindless surface state size */
> +	/* number of surface state entries in the Bindless Surface State buffer */
> +	intel_bb_out(ibb, 0xfffff000);                                //dw18
> +
> +	/* Bindless sampler state */
> +	intel_bb_out(ibb, 0 | BASE_ADDRESS_MODIFY);                   //dw19-20
> +	intel_bb_out(ibb, 0);
> +	/*  Bindless sampler state size */
> +	intel_bb_out(ibb, 0);                                         //dw21
> +}
> +
>   void
>   xehp_emit_compute_walk(struct intel_bb *ibb,
>   		       unsigned int x, unsigned int y,
> @@ -1175,3 +1267,121 @@ xehp_emit_compute_walk(struct intel_bb *ibb,
>   		intel_bb_out(ibb, 0x0);
>   	}
>   }
> +
> +void
> +xe3p_emit_compute_walk2(struct intel_bb *ibb,
> +			unsigned int x, unsigned int y,
> +			unsigned int width, unsigned int height,
> +			struct xe3p_interface_descriptor_data *pidd,
> +			uint32_t max_threads)
> +{
> +	/*
> +	 * Max Threads represent range: [1, 2^16-1],
> +	 * Max Threads limit range: [64, number of subslices * number of EUs per SubSlice * number of threads per EU]
> +	 * TODO: MAX_THREADS need to use (number of subslices * number of EUs per SubSlice * number of threads per EU)
> +	 */

The comment either should be removed, either applied to the code.

As this is upstreaming of internal code, maybe it is OK to keep it as 
is, to avoid rebase conflicts? Ask for IGT maintainers what is preferred 
solution here?

Regards
Andrzej



> +	const uint32_t MAX_THREADS = (1 << 16) - 1;
> +	uint32_t x_dim, y_dim, mask, max;
> +
> +	/*
> +	 * Simply do SIMD16 based dispatch, so every thread uses
> +	 * SIMD16 channels.
> +	 *
> +	 * Define our own thread group size, e.g 16x1 for every group, then
> +	 * will have 1 thread each group in SIMD16 dispatch. So thread
> +	 * width/height/depth are all 1.
> +	 *
> +	 * Then thread group X = width / 16 (aligned to 16)
> +	 * thread group Y = height;
> +	 */
> +	x_dim = (x + width + 15) / 16;
> +	y_dim = y + height;
> +
> +	mask = (x + width) & 15;
> +	if (mask == 0)
> +		mask = (1 << 16) - 1;
> +	else
> +		mask = (1 << mask) - 1;
> +
> +	intel_bb_out(ibb, XE3P_COMPUTE_WALKER2 | 0x3e);			//dw0, 0x32 => dw length: 62
> +
> +	intel_bb_out(ibb, 0); /* debug object id */			//dw0
> +	intel_bb_out(ibb, 0);						//dw1
> +
> +	/* Maximum Number of Threads */
> +	max = min_t(max_threads, max_t(max_threads, max_threads, 64), MAX_THREADS);
> +	intel_bb_out(ibb, max << 16);					//dw2
> +
> +	/* SIMD size, size: SIMT16 | enable inline Parameter | Message SIMT16 */
> +	intel_bb_out(ibb, 1 << 30 | 1 << 25 | 1 << 17);			//dw3
> +
> +	/* Execution mask: masking the use of some SIMD lanes by the last thread in a thread group */
> +	intel_bb_out(ibb, mask);					//dw4
> +
> +	/*
> +	 * LWS =(Local_X_Max+1)*(Local_Y_Max+1)*(Local_Z_Max+1).
> +	 */
> +	intel_bb_out(ibb, (x_dim << 20) | (y_dim << 10) | 1);		//dw5
> +
> +	/* Thread Group ID X Dimension */
> +	intel_bb_out(ibb, x_dim);					//dw6
> +
> +	/* Thread Group ID Y Dimension */
> +	intel_bb_out(ibb, y_dim);					//dw7
> +
> +	/* Thread Group ID Z Dimension */
> +	intel_bb_out(ibb, 1);						//dw8
> +
> +	/* Thread Group ID Starting X, Y, Z */
> +	intel_bb_out(ibb, x / 16);					//dw9
> +	intel_bb_out(ibb, y);						//dw10
> +	intel_bb_out(ibb, 0);						//dw11
> +
> +	/* partition type / id / size */
> +	intel_bb_out(ibb, 0);						//dw12-13
> +	intel_bb_out(ibb, 0);
> +
> +	/* Preempt X / Y / Z */
> +	intel_bb_out(ibb, 0);						//dw14
> +	intel_bb_out(ibb, 0);						//dw15
> +	intel_bb_out(ibb, 0);						//dw16
> +
> +	/* APQID, PostSync ID, Over dispatch TG count, Walker ID for preemption restore */
> +	intel_bb_out(ibb, 0);						//dw17
> +
> +	/* Interface descriptor data */
> +	for (int i = 0; i < 8; i++) {					//dw18-25
> +		intel_bb_out(ibb, ((uint32_t *) pidd)[i]);
> +	}
> +
> +	/* Post Sync command payload 0 */
> +	for (int i = 0; i < 5; i++) {					//dw26-30
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Inline data */
> +	/* DW31 and DW32 of Inline data will be copied into R0.14 and R0.15. */
> +	/* The rest of DW33 through DW46 will be copied to the following GRFs. */
> +	intel_bb_out(ibb, x_dim);					//dw31
> +	for (int i = 0; i < 15; i++) {					//dw32-46
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Post Sync command payload 1 */
> +	for (int i = 0; i < 5; i++) {					//dw47-51
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Post Sync command payload 2 */
> +	for (int i = 0; i < 5; i++) {					//dw52-56
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Post Sync command payload 3 */
> +	for (int i = 0; i < 5; i++) {					//dw57-61
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Preempt CS Interrupt Vector: Saved by HW on a TG preemption */
> +	intel_bb_out(ibb, 0);						//dw62
> +}
> diff --git a/lib/gpu_cmds.h b/lib/gpu_cmds.h
> index 846d2122ac..c38eaad865 100644
> --- a/lib/gpu_cmds.h
> +++ b/lib/gpu_cmds.h
> @@ -126,6 +126,13 @@ xehp_fill_interface_descriptor(struct intel_bb *ibb,
>   void
>   xehp_emit_state_compute_mode(struct intel_bb *ibb, bool vrt);
>   
> +void
> +xe3p_fill_interface_descriptor(struct intel_bb *ibb,
> +			       struct intel_buf *dst,
> +			       const uint32_t kernel[][4],
> +			       size_t size,
> +			       struct xe3p_interface_descriptor_data *idd);
> +
>   void
>   xehp_emit_state_binding_table_pool_alloc(struct intel_bb *ibb);
>   
> @@ -137,6 +144,9 @@ xehp_emit_cfe_state(struct intel_bb *ibb, uint32_t threads);
>   void
>   xehp_emit_state_base_address(struct intel_bb *ibb);
>   
> +void
> +xe3p_emit_state_base_address(struct intel_bb *ibb);
> +
>   void
>   xehp_emit_compute_walk(struct intel_bb *ibb,
>   		       unsigned int x, unsigned int y,
> @@ -144,4 +154,11 @@ xehp_emit_compute_walk(struct intel_bb *ibb,
>   		       struct xehp_interface_descriptor_data *pidd,
>   		       uint8_t color);
>   
> +void
> +xe3p_emit_compute_walk2(struct intel_bb *ibb,
> +			unsigned int x, unsigned int y,
> +			unsigned int width, unsigned int height,
> +			struct xe3p_interface_descriptor_data *pidd,
> +			uint32_t max_threads);
> +
>   #endif /* GPU_CMDS_H */
> diff --git a/lib/xehp_media.h b/lib/xehp_media.h
> index 20227bd3a6..c88e0dfb62 100644
> --- a/lib/xehp_media.h
> +++ b/lib/xehp_media.h
> @@ -83,6 +83,71 @@ struct xehp_interface_descriptor_data {
>   	} desc7;
>   };
>   
> +struct xe3p_interface_descriptor_data {
> +	struct {
> +		uint32_t rsvd0:					BITRANGE(0, 5);
> +		uint32_t kernel_start_pointer:			BITRANGE(6, 31);
> +	} dw00;
> +
> +	struct {
> +		uint32_t kernel_start_pointer_high:		BITRANGE(0, 31);
> +	} dw01;
> +
> +	struct {
> +		uint32_t eu_thread_scheduling_mode_override:	BITRANGE(0, 1);
> +		uint32_t rsvd5:					BITRANGE(2, 6);
> +		uint32_t software_exception_enable:		BITRANGE(7, 7);
> +		uint32_t rsvd4:					BITRANGE(8, 12);
> +		uint32_t illegal_opcode_exception_enable:	BITRANGE(13, 13);
> +		uint32_t rsvd3:					BITRANGE(14, 15);
> +		uint32_t floating_point_mode:			BITRANGE(16, 16);
> +		uint32_t rsvd2:					BITRANGE(17, 17);
> +		uint32_t single_program_flow:			BITRANGE(18, 18);
> +		uint32_t denorm_mode:				BITRANGE(19, 19);
> +		uint32_t thread_preemption:			BITRANGE(20, 20);
> +		uint32_t rsvd1:					BITRANGE(21, 25);
> +		uint32_t registers_per_thread:			BITRANGE(26, 30);
> +		uint32_t rsvd0:					BITRANGE(31, 31);
> +	} dw02;
> +
> +	struct {
> +		uint32_t rsvd0:					BITRANGE(0, 31);
> +	} dw03;
> +
> +	struct {
> +		uint32_t rsvd0:					BITRANGE(0, 31);
> +	} dw04;
> +
> +	struct {
> +		uint32_t number_of_threads_in_gpgpu_thread_group: BITRANGE(0, 7);
> +		uint32_t rsvd3:					BITRANGE(8, 12);
> +		uint32_t thread_group_forward_progress_guarantee: BITRANGE(13, 13);
> +		uint32_t rsvd2:					BITRANGE(14, 14);
> +		uint32_t btd_mode:				BITRANGE(15, 15);
> +		uint32_t shared_local_memory_size:		BITRANGE(16, 20);
> +		uint32_t rsvd1:					BITRANGE(21, 21);
> +		uint32_t rounding_mode:				BITRANGE(22, 23);
> +		uint32_t rsvd0:					BITRANGE(24, 24);
> +		uint32_t thread_group_dispatch_size:		BITRANGE(25, 27);
> +		uint32_t number_of_barriers:			BITRANGE(28, 31);
> +	} dw05;
> +
> +	struct {
> +		uint32_t rsvd3:					BITRANGE(0, 7);
> +		uint32_t z_pass_async_compute_thread_limit:	BITRANGE(8, 10);
> +		uint32_t rsvd2:					BITRANGE(11, 11);
> +		uint32_t np_z_async_throttle_settings:		BITRANGE(12, 13);
> +		uint32_t rsvd1:					BITRANGE(14, 15);
> +		uint32_t ps_async_thread_limit:			BITRANGE(16, 18);
> +		uint32_t rsvd0:					BITRANGE(19, 31);
> +	} dw06;
> +
> +	struct {
> +		uint32_t preferred_slm_allocation_size:		BITRANGE(0, 3);
> +		uint32_t rsvd0:					BITRANGE(4, 31);
> +	} dw07;
> +};
> +
>   struct xehp_surface_state {
>   	struct {
>   		uint32_t cube_pos_z: BITRANGE(0, 0);

next prev parent reply	other threads:[~2026-02-27 10:02 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24  8:27 [PATCH i-g-t 0/5] Add support for xe3p and prefetch fault test priyanka.dandamudi
2026-02-24  8:27 ` [PATCH i-g-t 1/5] lib/[gpu_cmds|xehp_media]: Introduce gpu execution commands for an efficient 64bit mode priyanka.dandamudi
2026-02-27  6:57   ` Zbigniew Kempczyński
2026-02-27 10:02   ` Hajda, Andrzej [this message]
2026-02-24  8:27 ` [PATCH i-g-t 2/5] lib/allocator: add canonicalize helper for 57 bit priyanka.dandamudi
2026-02-27  6:58   ` Zbigniew Kempczyński
2026-02-27 14:04   ` Hajda, Andrzej
2026-02-24  8:27 ` [PATCH i-g-t 3/5] lib/gpgpu_shader: Add support for xe3p efficient 64bit addressing priyanka.dandamudi
2026-02-24  8:27 ` [PATCH i-g-t 4/5] scripts/generate_iga64_codes: Add support for xe3p and xe3 priyanka.dandamudi
2026-02-27  7:00   ` Zbigniew Kempczyński
2026-02-27  9:41   ` Hajda, Andrzej
2026-02-24  8:28 ` [PATCH i-g-t 5/5] tests/intel/xe_prefetch_fault: Add test for prefetch fault priyanka.dandamudi
2026-02-24 12:21 ` ✓ Xe.CI.BAT: success for Add support for xe3p and prefetch fault test Patchwork
2026-02-24 12:44 ` ✓ i915.CI.BAT: " Patchwork
2026-02-24 17:52 ` ✗ i915.CI.Full: failure " Patchwork
2026-02-24 22:19 ` ✗ Xe.CI.FULL: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c6af424d-33fa-4c40-8a43-c217499aa379@intel.com \
    --to=andrzej.hajda@intel.com \
    --cc=igt-dev@lists.freedesktop.org \
    --cc=kamil.konieczny@intel.com \
    --cc=priyanka.dandamudi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox