public inbox for igt-dev@lists.freedesktop.org
 help / color / mirror / Atom feed
From: "Hajda, Andrzej" <andrzej.hajda@intel.com>
To: <priyanka.dandamudi@intel.com>, <igt-dev@lists.freedesktop.org>,
	<zbigniew.kempczynski@intel.com>
Subject: Re: [PATCH i-g-t v2 2/5] lib/[gpu_cmds|xehp_media]: Introduce gpu execution commands for an efficient 64bit mode.
Date: Tue, 3 Mar 2026 09:29:52 +0100	[thread overview]
Message-ID: <f3b70c57-55d3-4875-b97f-3d4cb9cdaed9@intel.com> (raw)
In-Reply-To: <20260303053104.1674811-3-priyanka.dandamudi@intel.com>

W dniu 3.03.2026 o 06:31, priyanka.dandamudi@intel.com pisze:
> From: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
> 
> The efficient 64-bit mode introduced with XE3p and it makes manage all
> heaps by SW. In order to use efficient 64bit mode, the batchbuffer command
> have to use new introduced instructuctions (COMPUTE_WALKER2, etc) and
> new interface_descriptor for compute pipeline configuration and execution.
> 
> v2: Use xe_canonical_va in xe3p_fill_interface_descriptor for kernel_offset.
> Define BASEADDR_DIS and use it instead of hardcoding, also make some
> minute changes.(Andrzej)
> 
> Signed-off-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
> Signed-off-by: Priyanka Dandamudi <priyanka.dandamudi@intel.com>
> ---
>   include/intel_gpu_commands.h |   1 +
>   lib/gpu_cmds.c               | 208 +++++++++++++++++++++++++++++++++++
>   lib/gpu_cmds.h               |  17 +++
>   lib/xehp_media.h             |  65 +++++++++++
>   4 files changed, 291 insertions(+)
> 
> diff --git a/include/intel_gpu_commands.h b/include/intel_gpu_commands.h
> index 5158bb0ea0..1998db4794 100644
> --- a/include/intel_gpu_commands.h
> +++ b/include/intel_gpu_commands.h
> @@ -400,6 +400,7 @@
>   #define MI_CONDITIONAL_BATCH_BUFFER_END MI_INSTR(0x36, 0)
>   #define  MI_DO_COMPARE		REG_BIT(21)
>   
> +#define BASEADDR_DIS (1 << 30)
>   #define STATE_BASE_ADDRESS \
>   	((0x3 << 29) | (0x0 << 27) | (0x1 << 24) | (0x1 << 16))
>   #define BASE_ADDRESS_MODIFY		REG_BIT(0)
> diff --git a/lib/gpu_cmds.c b/lib/gpu_cmds.c
> index a6a9247dce..cd912eb2ac 100644
> --- a/lib/gpu_cmds.c
> +++ b/lib/gpu_cmds.c
> @@ -24,6 +24,7 @@
>   
>   #include "gpu_cmds.h"
>   #include "intel_mocs.h"
> +#include "xe/xe_util.h"
>   
>   static uint32_t
>   xehp_fill_surface_state(struct intel_bb *ibb,
> @@ -934,6 +935,36 @@ xehp_fill_interface_descriptor(struct intel_bb *ibb,
>   	idd->desc5.num_threads_in_tg = 1;
>   }
>   
> +void
> +xe3p_fill_interface_descriptor(struct intel_bb *ibb,
> +			       struct intel_buf *dst,
> +			       const uint32_t kernel[][4],
> +			       size_t size,
> +			       struct xe3p_interface_descriptor_data *idd)
> +{
> +	uint64_t kernel_offset;
> +
> +	kernel_offset = gen7_fill_kernel(ibb, kernel, size);
> +	kernel_offset += ibb->batch_offset;
> +	kernel_offset = xe_canonical_va(ibb->fd, kernel_offset);
> +
> +	memset(idd, 0, sizeof(*idd));
> +
> +	/* 64-bit canonical format setting is needed. */
> +	idd->dw00.kernel_start_pointer = (((uint32_t)kernel_offset) >> 6);
> +	idd->dw01.kernel_start_pointer_high = kernel_offset >> 32;
> +
> +	/* Single program flow has no SIMD-specific branching in SIMD exec in EU threads */
> +	idd->dw02.single_program_flow = 1;
> +	idd->dw02.floating_point_mode = GEN8_FLOATING_POINT_IEEE_754;
> +
> +	/*
> +	* For testing purposes, use only one thread per thread group.
> +	* This makes it possible to identify threads by thread group id.
> +	*/
> +	idd->dw05.number_of_threads_in_gpgpu_thread_group = 1;
> +}
> +
>   static uint32_t
>   xehp_fill_surface_state(struct intel_bb *ibb,
>   			struct intel_buf *buf,
> @@ -1086,6 +1117,66 @@ xehp_emit_state_base_address(struct intel_bb *ibb)
>   	intel_bb_out(ibb, 0);                                       //dw21
>   }
>   
> +void
> +xe3p_emit_state_base_address(struct intel_bb *ibb)
> +{
> +	intel_bb_out(ibb, GEN8_STATE_BASE_ADDRESS | 0x14);            //dw0
> +
> +	/* general state */
> +	intel_bb_out(ibb, BASE_ADDRESS_MODIFY);                   //dw1-dw2
> +	intel_bb_out(ibb, 0);
> +
> +	/*
> +	 * For full 64b Mode, set BASEADDR_DIS.
> +	 * In Full 64b Mode, all heaps are managed by SW.
> +	 * STATE_BASE_ADDRESS base addresses are ignored by HW
> +	 * stateless data port moc not set, so EU threads have to access
> +	 * only uncached without moc when load/store
> +	 */
> +	intel_bb_out(ibb, BASEADDR_DIS);                                   //dw3
> +
> +	/* surface state */
> +	intel_bb_out(ibb, BASE_ADDRESS_MODIFY);                   //dw4-dw5
> +	intel_bb_out(ibb, 0);
> +
> +	/* dynamic state */
> +	intel_bb_out(ibb, BASE_ADDRESS_MODIFY);                   //dw6-dw7
> +	intel_bb_out(ibb, 0);
> +
> +	intel_bb_out(ibb, 0);                                         //dw8-dw9
> +	intel_bb_out(ibb, 0);
> +
> +	/* instruction */
> +	intel_bb_emit_reloc(ibb, ibb->handle,
> +			    I915_GEM_DOMAIN_INSTRUCTION,              //dw10-dw11
> +			    0, BASE_ADDRESS_MODIFY, 0x0);
> +
> +	/* general state buffer size */
> +	intel_bb_out(ibb, 0xfffff000 | 1);                            //dw12
> +
> +	/* dynamic state buffer size */
> +	intel_bb_out(ibb, ALIGN(ibb->size, 1 << 12) | 1);             //dw13
> +
> +	intel_bb_out(ibb, 0);                          	              //dw14
> +
> +	/* intruction buffer size */
> +	intel_bb_out(ibb, ALIGN(ibb->size, 1 << 12) | 1);             //dw15
> +
> +	/* Bindless surface state base address */
> +	intel_bb_out(ibb, BASE_ADDRESS_MODIFY);                   //dw16-17
> +	intel_bb_out(ibb, 0);
> +
> +	/* Bindless surface state size */
> +	/* number of surface state entries in the Bindless Surface State buffer */
> +	intel_bb_out(ibb, 0xfffff000);                                //dw18
> +
> +	/* Bindless sampler state */
> +	intel_bb_out(ibb, BASE_ADDRESS_MODIFY);                   //dw19-20
> +	intel_bb_out(ibb, 0);
> +	/*  Bindless sampler state size */
> +	intel_bb_out(ibb, 0);                                         //dw21
> +}
> +
>   void
>   xehp_emit_compute_walk(struct intel_bb *ibb,
>   		       unsigned int x, unsigned int y,
> @@ -1175,3 +1266,120 @@ xehp_emit_compute_walk(struct intel_bb *ibb,
>   		intel_bb_out(ibb, 0x0);
>   	}
>   }
> +
> +void
> +xe3p_emit_compute_walk2(struct intel_bb *ibb,
> +			unsigned int x, unsigned int y,
> +			unsigned int width, unsigned int height,
> +			struct xe3p_interface_descriptor_data *pidd,
> +			uint32_t max_threads)
> +{
> +	/*
> +	 * Max Threads represent range: [1, 2^16-1],
> +	 * Max Threads limit range: [64, number of subslices * number of EUs per SubSlice * number of threads per EU]
> +	 */
> +	const uint32_t MAX_THREADS = (1 << 16) - 1;
> +	uint32_t x_dim, y_dim, mask, max;
> +
> +	/*
> +	 * Simply do SIMD16 based dispatch, so every thread uses
> +	 * SIMD16 channels.
> +	 *
> +	 * Define our own thread group size, e.g 16x1 for every group, then
> +	 * will have 1 thread each group in SIMD16 dispatch. So thread
> +	 * width/height/depth are all 1.
> +	 *
> +	 * Then thread group X = width / 16 (aligned to 16)
> +	 * thread group Y = height;
> +	 */
> +	x_dim = (x + width + 15) / 16;
> +	y_dim = y + height;
> +
> +	mask = (x + width) & 15;
> +	if (mask == 0)
> +		mask = (1 << 16) - 1;
> +	else
> +		mask = (1 << mask) - 1;
> +
> +	intel_bb_out(ibb, XE3P_COMPUTE_WALKER2 | 0x3e);			//dw0, 0x32 => dw length: 62
> +
> +	intel_bb_out(ibb, 0); /* debug object id */			//dw0
> +	intel_bb_out(ibb, 0);						//dw1
> +
> +	/* Maximum Number of Threads */
> +	max = min_t(max_threads, max_t(max_threads, max_threads, 64), MAX_THREADS);

max = clamp(max_threads, 64, MAX_THREADS) looks better, up to you.


Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>

Regards
Andrzej


> +	intel_bb_out(ibb, max << 16);					//dw2
> +
> +	/* SIMD size, size: SIMT16 | enable inline Parameter | Message SIMT16 */
> +	intel_bb_out(ibb, 1 << 30 | 1 << 25 | 1 << 17);			//dw3
> +
> +	/* Execution mask: masking the use of some SIMD lanes by the last thread in a thread group */
> +	intel_bb_out(ibb, mask);					//dw4
> +
> +	/*
> +	 * LWS =(Local_X_Max+1)*(Local_Y_Max+1)*(Local_Z_Max+1).
> +	 */
> +	intel_bb_out(ibb, (x_dim << 20) | (y_dim << 10) | 1);		//dw5
> +
> +	/* Thread Group ID X Dimension */
> +	intel_bb_out(ibb, x_dim);					//dw6
> +
> +	/* Thread Group ID Y Dimension */
> +	intel_bb_out(ibb, y_dim);					//dw7
> +
> +	/* Thread Group ID Z Dimension */
> +	intel_bb_out(ibb, 1);						//dw8
> +
> +	/* Thread Group ID Starting X, Y, Z */
> +	intel_bb_out(ibb, x / 16);					//dw9
> +	intel_bb_out(ibb, y);						//dw10
> +	intel_bb_out(ibb, 0);						//dw11
> +
> +	/* partition type / id / size */
> +	intel_bb_out(ibb, 0);						//dw12-13
> +	intel_bb_out(ibb, 0);
> +
> +	/* Preempt X / Y / Z */
> +	intel_bb_out(ibb, 0);						//dw14
> +	intel_bb_out(ibb, 0);						//dw15
> +	intel_bb_out(ibb, 0);						//dw16
> +
> +	/* APQID, PostSync ID, Over dispatch TG count, Walker ID for preemption restore */
> +	intel_bb_out(ibb, 0);						//dw17
> +
> +	/* Interface descriptor data */
> +	for (int i = 0; i < 8; i++) {					//dw18-25
> +		intel_bb_out(ibb, ((uint32_t *) pidd)[i]);
> +	}
> +
> +	/* Post Sync command payload 0 */
> +	for (int i = 0; i < 5; i++) {					//dw26-30
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Inline data */
> +	/* DW31 and DW32 of Inline data will be copied into R0.14 and R0.15. */
> +	/* The rest of DW33 through DW46 will be copied to the following GRFs. */
> +	intel_bb_out(ibb, x_dim);					//dw31
> +	for (int i = 0; i < 15; i++) {					//dw32-46
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Post Sync command payload 1 */
> +	for (int i = 0; i < 5; i++) {					//dw47-51
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Post Sync command payload 2 */
> +	for (int i = 0; i < 5; i++) {					//dw52-56
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Post Sync command payload 3 */
> +	for (int i = 0; i < 5; i++) {					//dw57-61
> +		intel_bb_out(ibb, 0);
> +	}
> +
> +	/* Preempt CS Interrupt Vector: Saved by HW on a TG preemption */
> +	intel_bb_out(ibb, 0);						//dw62
> +}
> diff --git a/lib/gpu_cmds.h b/lib/gpu_cmds.h
> index 846d2122ac..c38eaad865 100644
> --- a/lib/gpu_cmds.h
> +++ b/lib/gpu_cmds.h
> @@ -126,6 +126,13 @@ xehp_fill_interface_descriptor(struct intel_bb *ibb,
>   void
>   xehp_emit_state_compute_mode(struct intel_bb *ibb, bool vrt);
>   
> +void
> +xe3p_fill_interface_descriptor(struct intel_bb *ibb,
> +			       struct intel_buf *dst,
> +			       const uint32_t kernel[][4],
> +			       size_t size,
> +			       struct xe3p_interface_descriptor_data *idd);
> +
>   void
>   xehp_emit_state_binding_table_pool_alloc(struct intel_bb *ibb);
>   
> @@ -137,6 +144,9 @@ xehp_emit_cfe_state(struct intel_bb *ibb, uint32_t threads);
>   void
>   xehp_emit_state_base_address(struct intel_bb *ibb);
>   
> +void
> +xe3p_emit_state_base_address(struct intel_bb *ibb);
> +
>   void
>   xehp_emit_compute_walk(struct intel_bb *ibb,
>   		       unsigned int x, unsigned int y,
> @@ -144,4 +154,11 @@ xehp_emit_compute_walk(struct intel_bb *ibb,
>   		       struct xehp_interface_descriptor_data *pidd,
>   		       uint8_t color);
>   
> +void
> +xe3p_emit_compute_walk2(struct intel_bb *ibb,
> +			unsigned int x, unsigned int y,
> +			unsigned int width, unsigned int height,
> +			struct xe3p_interface_descriptor_data *pidd,
> +			uint32_t max_threads);
> +
>   #endif /* GPU_CMDS_H */
> diff --git a/lib/xehp_media.h b/lib/xehp_media.h
> index 20227bd3a6..c88e0dfb62 100644
> --- a/lib/xehp_media.h
> +++ b/lib/xehp_media.h
> @@ -83,6 +83,71 @@ struct xehp_interface_descriptor_data {
>   	} desc7;
>   };
>   
> +struct xe3p_interface_descriptor_data {
> +	struct {
> +		uint32_t rsvd0:					BITRANGE(0, 5);
> +		uint32_t kernel_start_pointer:			BITRANGE(6, 31);
> +	} dw00;
> +
> +	struct {
> +		uint32_t kernel_start_pointer_high:		BITRANGE(0, 31);
> +	} dw01;
> +
> +	struct {
> +		uint32_t eu_thread_scheduling_mode_override:	BITRANGE(0, 1);
> +		uint32_t rsvd5:					BITRANGE(2, 6);
> +		uint32_t software_exception_enable:		BITRANGE(7, 7);
> +		uint32_t rsvd4:					BITRANGE(8, 12);
> +		uint32_t illegal_opcode_exception_enable:	BITRANGE(13, 13);
> +		uint32_t rsvd3:					BITRANGE(14, 15);
> +		uint32_t floating_point_mode:			BITRANGE(16, 16);
> +		uint32_t rsvd2:					BITRANGE(17, 17);
> +		uint32_t single_program_flow:			BITRANGE(18, 18);
> +		uint32_t denorm_mode:				BITRANGE(19, 19);
> +		uint32_t thread_preemption:			BITRANGE(20, 20);
> +		uint32_t rsvd1:					BITRANGE(21, 25);
> +		uint32_t registers_per_thread:			BITRANGE(26, 30);
> +		uint32_t rsvd0:					BITRANGE(31, 31);
> +	} dw02;
> +
> +	struct {
> +		uint32_t rsvd0:					BITRANGE(0, 31);
> +	} dw03;
> +
> +	struct {
> +		uint32_t rsvd0:					BITRANGE(0, 31);
> +	} dw04;
> +
> +	struct {
> +		uint32_t number_of_threads_in_gpgpu_thread_group: BITRANGE(0, 7);
> +		uint32_t rsvd3:					BITRANGE(8, 12);
> +		uint32_t thread_group_forward_progress_guarantee: BITRANGE(13, 13);
> +		uint32_t rsvd2:					BITRANGE(14, 14);
> +		uint32_t btd_mode:				BITRANGE(15, 15);
> +		uint32_t shared_local_memory_size:		BITRANGE(16, 20);
> +		uint32_t rsvd1:					BITRANGE(21, 21);
> +		uint32_t rounding_mode:				BITRANGE(22, 23);
> +		uint32_t rsvd0:					BITRANGE(24, 24);
> +		uint32_t thread_group_dispatch_size:		BITRANGE(25, 27);
> +		uint32_t number_of_barriers:			BITRANGE(28, 31);
> +	} dw05;
> +
> +	struct {
> +		uint32_t rsvd3:					BITRANGE(0, 7);
> +		uint32_t z_pass_async_compute_thread_limit:	BITRANGE(8, 10);
> +		uint32_t rsvd2:					BITRANGE(11, 11);
> +		uint32_t np_z_async_throttle_settings:		BITRANGE(12, 13);
> +		uint32_t rsvd1:					BITRANGE(14, 15);
> +		uint32_t ps_async_thread_limit:			BITRANGE(16, 18);
> +		uint32_t rsvd0:					BITRANGE(19, 31);
> +	} dw06;
> +
> +	struct {
> +		uint32_t preferred_slm_allocation_size:		BITRANGE(0, 3);
> +		uint32_t rsvd0:					BITRANGE(4, 31);
> +	} dw07;
> +};
> +
>   struct xehp_surface_state {
>   	struct {
>   		uint32_t cube_pos_z: BITRANGE(0, 0);


  parent reply	other threads:[~2026-03-03  8:30 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-03  5:30 [PATCH i-g-t v2 0/5] Add support for xe3p and prefetch fault test priyanka.dandamudi
2026-03-03  5:31 ` [PATCH i-g-t v2 1/5] lib/xe_util.h: add canonical helper for 48/57 bit vas priyanka.dandamudi
2026-03-03  8:06   ` Hajda, Andrzej
2026-03-03  5:31 ` [PATCH i-g-t v2 2/5] lib/[gpu_cmds|xehp_media]: Introduce gpu execution commands for an efficient 64bit mode priyanka.dandamudi
2026-03-03  8:25   ` Zbigniew Kempczyński
2026-03-03  8:29   ` Hajda, Andrzej [this message]
2026-03-03  5:31 ` [PATCH i-g-t v2 3/5] lib/gpgpu_shader: Add support for xe3p efficient 64bit addressing priyanka.dandamudi
2026-03-03  8:30   ` Zbigniew Kempczyński
2026-03-03  5:31 ` [PATCH i-g-t v2 4/5] scripts/generate_iga64_codes: Add support for xe3p and xe3 priyanka.dandamudi
2026-03-03  5:31 ` [PATCH i-g-t v2 5/5] tests/intel/xe_prefetch_fault: Add test for prefetch fault priyanka.dandamudi
2026-03-03 23:20 ` ✓ Xe.CI.BAT: success for Add support for xe3p and prefetch fault test (rev2) Patchwork
2026-03-03 23:43 ` ✓ i915.CI.BAT: " Patchwork
2026-03-04 14:01 ` ✗ Xe.CI.FULL: failure " Patchwork
2026-03-04 17:45 ` ✗ i915.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f3b70c57-55d3-4875-b97f-3d4cb9cdaed9@intel.com \
    --to=andrzej.hajda@intel.com \
    --cc=igt-dev@lists.freedesktop.org \
    --cc=priyanka.dandamudi@intel.com \
    --cc=zbigniew.kempczynski@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox