All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: daniel.vetter@ffwll.ch, tzimmermann@suse.de,
	Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [Intel-gfx] [PATCH] iosys-map: Add word-sized reads
Date: Fri, 10 Jun 2022 09:20:18 +0200	[thread overview]
Message-ID: <4faac373-1a30-e1a0-0c33-e10e8fc6184c@amd.com> (raw)
In-Reply-To: <20220609232020.2292649-1-lucas.demarchi@intel.com>

Am 10.06.22 um 01:20 schrieb Lucas De Marchi:
> Instead of always falling back to memcpy_fromio() for any size, prefer
> using read{b,w,l}(). When reading struct members it's common to read
> individual integer variables individually. Going through memcpy_fromio()
> for each of them poses a high penalty.
>
> Employ a similar trick as __seqprop() by using _Generic() to generate
> only the specific call based on a type-compatible variable.
>
> For a pariticular i915 workload producing GPU context switches,
> __get_engine_usage_record() is particularly hot since the engine usage
> is read from device local memory with dgfx, possibly multiple times
> since it's racy. Test execution time for this test shows a ~12.5%
> improvement with DG2:
>
> Before:
> 	nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
> 	median = 9.52548e+06; var = 526149;
> After:
> 	nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
> 	median = 8.33955e+06; var = 333113;
>
> Other things attempted that didn't prove very useful:
> 1) Change the _Generic() on x86 to just dereference the memory address
> 2) Change __get_engine_usage_record() to do just 1 read per loop,
>     comparing with the previous value read
> 3) Change __get_engine_usage_record() to access the fields directly as it
>     was before the conversion to iosys-map
>
> (3) did gave a small improvement (~3%), but doesn't seem to scale well
> to other similar cases in the driver.
>
> Additional test by Chris Wilson using gem_create from igt with some
> changes to track object creation time. This happens to accidentaly
> stress this code path:
>
> 	Pre iosys_map conversion of engine busyness:
> 	lmem0: Creating    262144 4KiB objects took 59274.2ms
>
> 	Unpatched:
> 	lmem0: Creating    262144 4KiB objects took 108830.2ms
>
> 	With readl (this patch):
> 	lmem0: Creating    262144 4KiB objects took 61348.6ms
>
> 	s/readl/READ_ONCE/
> 	lmem0: Creating    262144 4KiB objects took 61333.2ms
>
> So we do take a little bit more time than before the conversion, but
> that is due to other factors: bringing the READ_ONCE back would be as
> good as just doing this conversion.
>
> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
> ---
>
> If this is acceptable we should probably add the write counterpart, too.
> Sending here only the read for now since this fixes the issue we are
> seeing and to gather feedback.

As far as I can see looks sane to me, but the kernel test robot tears 
the patch apart.

Probably just a typo somewhere in the 32bit handling.

Apart from that looks good to me.

Regards,
Christian.

>
>   include/linux/iosys-map.h | 26 ++++++++++++++++++++++----
>   1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h
> index e69a002d5aa4..4ae3e459419e 100644
> --- a/include/linux/iosys-map.h
> +++ b/include/linux/iosys-map.h
> @@ -333,6 +333,20 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>   		memset(dst->vaddr + offset, value, len);
>   }
>   
> +#ifdef CONFIG_64BIT
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)			\
> +	u64: val_ = readq(vaddr_iomem_),
> +#else
> +#define __iosys_map_u64_case(val_, vaddr_iomem_)
> +#endif
> +
> +#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__,	\
> +	u8: val__ = readb(vaddr_iomem__),				\
> +	u16: val__ = readw(vaddr_iomem__),				\
> +	u32: val__ = readl(vaddr_iomem__),				\
> +	__iosys_map_rd_io_u64_case(val__, vaddr_iomem__)		\
> +	default: memcpy_fromio(&(val__), vaddr_iomem__, sizeof(val__)))
> +
>   /**
>    * iosys_map_rd - Read a C-type value from the iosys_map
>    *
> @@ -346,10 +360,14 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>    * Returns:
>    * The value read from the mapping.
>    */
> -#define iosys_map_rd(map__, offset__, type__) ({			\
> -	type__ val;							\
> -	iosys_map_memcpy_from(&val, map__, offset__, sizeof(val));	\
> -	val;								\
> +#define iosys_map_rd(map__, offset__, type__) ({				\
> +	type__ val;								\
> +	if ((map__)->is_iomem) {						\
> +		__iosys_map_rd_io(val, (map__)->vaddr_iomem + offset__, type__);\
> +	} else {								\
> +		memcpy(&val, (map__)->vaddr + offset__, sizeof(val));		\
> +	}									\
> +	val;									\
>   })
>   
>   /**


WARNING: multiple messages have this Message-ID (diff)
From: "Christian König" <christian.koenig@amd.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: daniel.vetter@ffwll.ch, tzimmermann@suse.de,
	Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [PATCH] iosys-map: Add word-sized reads
Date: Fri, 10 Jun 2022 09:20:18 +0200	[thread overview]
Message-ID: <4faac373-1a30-e1a0-0c33-e10e8fc6184c@amd.com> (raw)
In-Reply-To: <20220609232020.2292649-1-lucas.demarchi@intel.com>

Am 10.06.22 um 01:20 schrieb Lucas De Marchi:
> Instead of always falling back to memcpy_fromio() for any size, prefer
> using read{b,w,l}(). When reading struct members it's common to read
> individual integer variables individually. Going through memcpy_fromio()
> for each of them poses a high penalty.
>
> Employ a similar trick as __seqprop() by using _Generic() to generate
> only the specific call based on a type-compatible variable.
>
> For a pariticular i915 workload producing GPU context switches,
> __get_engine_usage_record() is particularly hot since the engine usage
> is read from device local memory with dgfx, possibly multiple times
> since it's racy. Test execution time for this test shows a ~12.5%
> improvement with DG2:
>
> Before:
> 	nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
> 	median = 9.52548e+06; var = 526149;
> After:
> 	nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
> 	median = 8.33955e+06; var = 333113;
>
> Other things attempted that didn't prove very useful:
> 1) Change the _Generic() on x86 to just dereference the memory address
> 2) Change __get_engine_usage_record() to do just 1 read per loop,
>     comparing with the previous value read
> 3) Change __get_engine_usage_record() to access the fields directly as it
>     was before the conversion to iosys-map
>
> (3) did gave a small improvement (~3%), but doesn't seem to scale well
> to other similar cases in the driver.
>
> Additional test by Chris Wilson using gem_create from igt with some
> changes to track object creation time. This happens to accidentaly
> stress this code path:
>
> 	Pre iosys_map conversion of engine busyness:
> 	lmem0: Creating    262144 4KiB objects took 59274.2ms
>
> 	Unpatched:
> 	lmem0: Creating    262144 4KiB objects took 108830.2ms
>
> 	With readl (this patch):
> 	lmem0: Creating    262144 4KiB objects took 61348.6ms
>
> 	s/readl/READ_ONCE/
> 	lmem0: Creating    262144 4KiB objects took 61333.2ms
>
> So we do take a little bit more time than before the conversion, but
> that is due to other factors: bringing the READ_ONCE back would be as
> good as just doing this conversion.
>
> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
> ---
>
> If this is acceptable we should probably add the write counterpart, too.
> Sending here only the read for now since this fixes the issue we are
> seeing and to gather feedback.

As far as I can see looks sane to me, but the kernel test robot tears 
the patch apart.

Probably just a typo somewhere in the 32bit handling.

Apart from that looks good to me.

Regards,
Christian.

>
>   include/linux/iosys-map.h | 26 ++++++++++++++++++++++----
>   1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h
> index e69a002d5aa4..4ae3e459419e 100644
> --- a/include/linux/iosys-map.h
> +++ b/include/linux/iosys-map.h
> @@ -333,6 +333,20 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>   		memset(dst->vaddr + offset, value, len);
>   }
>   
> +#ifdef CONFIG_64BIT
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)			\
> +	u64: val_ = readq(vaddr_iomem_),
> +#else
> +#define __iosys_map_u64_case(val_, vaddr_iomem_)
> +#endif
> +
> +#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__,	\
> +	u8: val__ = readb(vaddr_iomem__),				\
> +	u16: val__ = readw(vaddr_iomem__),				\
> +	u32: val__ = readl(vaddr_iomem__),				\
> +	__iosys_map_rd_io_u64_case(val__, vaddr_iomem__)		\
> +	default: memcpy_fromio(&(val__), vaddr_iomem__, sizeof(val__)))
> +
>   /**
>    * iosys_map_rd - Read a C-type value from the iosys_map
>    *
> @@ -346,10 +360,14 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>    * Returns:
>    * The value read from the mapping.
>    */
> -#define iosys_map_rd(map__, offset__, type__) ({			\
> -	type__ val;							\
> -	iosys_map_memcpy_from(&val, map__, offset__, sizeof(val));	\
> -	val;								\
> +#define iosys_map_rd(map__, offset__, type__) ({				\
> +	type__ val;								\
> +	if ((map__)->is_iomem) {						\
> +		__iosys_map_rd_io(val, (map__)->vaddr_iomem + offset__, type__);\
> +	} else {								\
> +		memcpy(&val, (map__)->vaddr + offset__, sizeof(val));		\
> +	}									\
> +	val;									\
>   })
>   
>   /**


  parent reply	other threads:[~2022-06-10  7:20 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-09 23:20 [Intel-gfx] [PATCH] iosys-map: Add word-sized reads Lucas De Marchi
2022-06-09 23:20 ` Lucas De Marchi
2022-06-09 23:36 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2022-06-09 23:36 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-06-10  2:55 ` [Intel-gfx] [PATCH] " kernel test robot
2022-06-10  2:55   ` kernel test robot
2022-06-10  2:55   ` kernel test robot
2022-06-10  7:05 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2022-06-10  7:20 ` Christian König [this message]
2022-06-10  7:20   ` [PATCH] " Christian König
2022-06-10 17:35   ` [Intel-gfx] " Lucas De Marchi
2022-06-10 17:35     ` Lucas De Marchi
2022-06-10 14:17 ` [Intel-gfx] " kernel test robot
2022-06-10 14:17   ` kernel test robot
2022-06-11  7:02 ` [Intel-gfx] ✓ Fi.CI.IGT: success for " Patchwork
2022-06-13 10:58 ` [Intel-gfx] [PATCH] " Thomas Zimmermann
2022-06-13 10:58   ` Thomas Zimmermann
2022-06-13 18:37   ` [Intel-gfx] " Lucas De Marchi
2022-06-13 18:37     ` Lucas De Marchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4faac373-1a30-e1a0-0c33-e10e8fc6184c@amd.com \
    --to=christian.koenig@amd.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.