Re: [PATCH v2] radeon: Deinline indirect register accessor functions

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Denys Vlasenko <dvlasenk@redhat.com>
To: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	Denys Vlasenko <vda.linux@googlemail.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2] radeon: Deinline indirect register accessor functions
Date: Wed, 20 May 2015 12:47:43 +0200	[thread overview]
Message-ID: <555C664F.2030002@redhat.com> (raw)
In-Reply-To: <A3397C8B8B789E45844E7EC5DEAD89D06364C87C@satlexdag05.amd.com>

[-- Attachment #1: Type: text/plain, Size: 5919 bytes --]

On 05/19/2015 01:06 AM, Deucher, Alexander wrote:
>> -----Original Message-----
>> From: Denys Vlasenko [mailto:vda.linux@googlemail.com]
>> Sent: Monday, May 18, 2015 6:50 PM
>> To: Koenig, Christian
>> Cc: Denys Vlasenko; Deucher, Alexander; Linux Kernel Mailing List
>> Subject: Re: [PATCH v2] radeon: Deinline indirect register accessor functions
>>
>> On Mon, May 18, 2015 at 9:09 PM, Christian König
>> <christian.koenig@amd.com> wrote:
>>>> r600_uvd_ctx_rreg: 111 bytes, 4 callsites
>>>> r600_uvd_ctx_wreg: 113 bytes, 5 callsites
>>>> eg_pif_phy0_rreg: 106 bytes, 13 callsites
>>>> eg_pif_phy0_wreg: 108 bytes, 13 callsites
>>>> eg_pif_phy1_rreg: 107 bytes, 13 callsites
>>>> eg_pif_phy1_wreg: 108 bytes, 13 callsites
>>>> rv370_pcie_rreg: 111 bytes, 21 callsites
>>>> rv370_pcie_wreg: 113 bytes, 24 callsites
>>>> r600_rcu_rreg: 111 bytes, 16 callsites
>>>> r600_rcu_wreg: 113 bytes, 25 callsites
>>>> cik_didt_rreg: 106 bytes, 10 callsites
>>>> cik_didt_wreg: 107 bytes, 10 callsites
>>>> tn_smc_rreg: 106 bytes, 126 callsites
>>>> tn_smc_wreg: 107 bytes, 116 callsites
>>>> eg_cg_rreg: 107 bytes, 20 callsites
>>>> eg_cg_wreg: 108 bytes, 52 callsites
>>
>>> Sorry haven't noticed that before:
>>>
>>> radeon_device.c is most likely not the right place for the non-inlined
>>> functions. Please move them into to the appropriate files for each
>>> generation.
>>
>> Will do (probably tomorrow, not today).
> 
> Is this whole exercise really worthwhile?
> This will be the 3rd or 4th time these have been inlined/uninlined.

When code grows by 65000 bytes, there ought to be a good reason to inline.
I don't see it.

Let's take a look what these functions actually do. cik_didt_wreg is():

       spin_lock_irqsave(&rdev->didt_idx_lock, flags);
       WREG32(CIK_DIDT_IND_INDEX, (reg));
       WREG32(CIK_DIDT_IND_DATA, (v));
       spin_unlock_irqrestore(&rdev->didt_idx_lock, flags);

this compiles to (on defconfig + radeon enabled):

       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       48 83 ec 20             sub    $0x20,%rsp
       4c 89 65 e8             mov    %r12,-0x18(%rbp)
       4c 8d a7 cc 01 00 00    lea    0x1cc(%rdi),%r12
       48 89 5d e0             mov    %rbx,-0x20(%rbp)
       48 89 fb                mov    %rdi,%rbx
       4c 89 6d f0             mov    %r13,-0x10(%rbp)
       4c 89 75 f8             mov    %r14,-0x8(%rbp)
       4c 89 e7                mov    %r12,%rdi
       41 89 d6                mov    %edx,%r14d
       41 89 f5                mov    %esi,%r13d
       e8 20 6b 4d 00          callq  <_raw_spin_lock_irqsave> //spin_lock_irqsave
       48 8b 93 d0 01 00 00    mov    0x1d0(%rbx),%rdx
       44 89 aa 00 ca 00 00    mov    %r13d,0xca00(%rdx)       //WREG32
       48 8b 93 d0 01 00 00    mov    0x1d0(%rbx),%rdx
       44 89 b2 04 ca 00 00    mov    %r14d,0xca04(%rdx)       //WREG32
       4c 89 e7                mov    %r12,%rdi
       48 89 c6                mov    %rax,%rsi
       e8 b9 69 4d 00          callq  <_raw_spin_unlock_irqrestore> //spin_unlock_irqrestore
       48 8b 5d e0             mov    -0x20(%rbp),%rbx
       4c 8b 65 e8             mov    -0x18(%rbp),%r12
       4c 8b 6d f0             mov    -0x10(%rbp),%r13
       4c 8b 75 f8             mov    -0x8(%rbp),%r14
       c9                      leaveq
       c3                      retq

<_raw_spin_lock_irqsave>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       9c                      pushfq
       58                      pop    %rax
       fa                      cli
       ba 00 01 00 00          mov    $0x100,%edx
       f0 66 0f c1 17          lock xadd %dx,(%rdi)  // expensive
       0f b6 ce                movzbl %dh,%ecx
       38 d1                   cmp    %dl,%cl
       75 04                   jne    <_raw_spin_lock_irqsave+0x1c>
       5d                      pop    %rbp
       c3                      retq
       f3 90                   pause
       0f b6 17                movzbl (%rdi),%edx
       38 ca                   cmp    %cl,%dl
       75 f7                   jne    <_raw_spin_lock_irqsave+0x1a>
       5d                      pop    %rbp
       c3                      retq

<_raw_spin_unlock_irqrestore>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       80 07 01                addb   $0x1,(%rdi)
       56                      push   %rsi
       9d                      popfq                  //expensive
       5d                      pop    %rbp
       c3                      retq

Now, using attached test program, I measure how long
call+ret pair takes:

# ./timing_test64 callret
400000000 loops in 0.71467s = 1.79 nsec/loop for callret

Unlocked read-modify-write memory operation:

# ./timing_test64 or
400000000 loops in 0.86119s = 2.15 nsec/loop for or

Locked read-modify-write memory operations:

# ./timing_test64 lock_or
100000000 loops in 0.68902s = 6.89 nsec/loop for lock_or
# ./timing_test64 lock_xadd
100000000 loops in 0.68582s = 6.86 nsec/loop for lock_xadd

And POPF:

# ./timing_test64 popf
100000000 loops in 0.68861s = 6.89 nsec/loop for popf

This is on Sandy Bridge CPU with cycle time of about 0.30 ns:

# ./timing_test64 nothing
2000000000 loops in 0.59716s = 0.30 nsec/loop for nothing


So, what do we see?

call+ret takes 5 cycles. This is cheaper that one unlocked
RMW memory operation which is 7 cycles.

Locked RMW is 21 cycles in the ideal case (this is what
spin_lock_irqsave does). POPF is also 21 cycles
(spin_unlock_irqrestore does this). Add to this two mmio
accesses (easily 50s of cycles) and all other necessary operations
visible in the assembly code - 5 memory stores,
7 memory loads, and two call+ret pairs.

I expect overhead of call+ret added by deinlining to be in 1-4%,
if you run a microbenchmark which does nothing but one of these ops.
-- 
vda

[-- Attachment #2: timing_test.c --]
[-- Type: text/x-csrc, Size: 8267 bytes --]

// To be unaffected by random cacheline placement, use generous "align":
//
// i686-gcc -O2 -Wall -falign-loops=32 -falign-jumps=32 -falign-labels=32 -static
// x86_64-gcc -O2 -Wall -falign-loops=32 -falign-jumps=32 -falign-labels=32 -static

#include <inttypes.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#include <sys/syscall.h>
#include <stdio.h>

#if !defined(__i386__)
#define get_sysenter_addr() 0
#else
#include <elf.h>
long sysenter_addr;
long get_sysenter_addr(char **envp)
{
	Elf32_auxv_t *auxv;
	while (*envp++ != NULL)
		continue;
	for (auxv = (void *)envp; auxv->a_type != AT_NULL; auxv++)
		if( auxv->a_type == AT_SYSINFO)
			return (sysenter_addr = auxv->a_un.a_val);
	fprintf(stderr, "AT_SYSINFO not supplied, can't test\n");
	exit(0); /* this is not a failure */
}

void sysenter_getpid(void)
{
	asm volatile(
	"\n"   "	mov	$20,%eax" // GETPID
	"\n"   "	call	*sysenter_addr"
	);
}
#endif

#if defined(__i386__)
#define L_or_Q "l"
#define E_or_R "e"
#else
#define L_or_Q "q"
#define E_or_R "r"
#endif

static int memvar;

asm (
"\n"   "	.text"
"\n"   "ret__:	ret"
);

int main(int argc, char **argv, char **envp)
{
	struct timespec start, end;
	unsigned long long duration;
	size_t loops, i;
	const char *mode;

	if (argc < 2) {
		printf("Usage: timing_test [MILLIONS_OF_ITERATIONS] MODE\n");
		return 1;
	}
	mode = argv[2];
	if (!mode) {
		mode = argv[1];
		loops = 10*1000;
	} else {
		loops = (size_t)atol(argv[1]) * 1000000;
	}

 again:
	if (!strcmp(mode, "nothing")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile ("# nothing");
		}
	} else if (!strcmp(mode, "nop")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile ("nop");
		}
	} else if (!strcmp(mode, "rdtsc")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, d;
			asm volatile ("rdtsc" : "=a" (a), "=d" (d));
		}
	} else if (!strcmp(mode, "lfence_rdtsc")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, d;
			asm volatile ("lfence;rdtsc" : "=a" (a), "=d" (d));
		}
	} else if (!strcmp(mode, "lfence_rdtsc_lfence")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, d;
			asm volatile ("");
			asm volatile ("lfence;rdtsc;lfence" : "=a" (a), "=d" (d));
		}
	} else if (!strcmp(mode, "mfence_rdtsc_mfence")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, d;
			asm volatile ("mfence;rdtsc;mfence" : "=a" (a), "=d" (d));
		}
	} else if (!strcmp(mode, "rdtscp")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, c, d;
			asm volatile ("rdtscp" : "=a" (a), "=c" (c), "=d" (d));
		}
	} else if (!strcmp(mode, "gettimeofday")) {
		struct timeval tv;
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--)
			gettimeofday(&tv, 0);
	} else if (!strcmp(mode, "getpid")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--)
			syscall(SYS_getpid);
#if defined(__i386__)
	} else if (!strcmp(mode, "sysenter_getpid")) {
		get_sysenter_addr(envp);
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--)
			sysenter_getpid();
	} else if (!strcmp(mode, "iret")) {
		/* "push cs" is itself a bit expensive, moving it out of loop */
		long saved_cs;
		asm volatile ("mov %%cs,%0" : "=r" (saved_cs));
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	$0"	// flags
			"\n"   "	push	%0"	// cs
			"\n"   "	push	$1f"	// ip
			"\n"   "	iret"
			"\n"   "1:"
			:
			: "r" (saved_cs)
			);
		}
#endif
#if defined(__x86_64__)
	} else if (!strcmp(mode, "iret")) {
		long saved_cs;
		long saved_ss;
		asm volatile ("mov %%cs,%0" : "=r" (saved_cs));
		asm volatile ("mov %%ss,%0" : "=r" (saved_ss));
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	mov	%%rsp,%%rax"
			"\n"   "	push	%0"	// ss
			"\n"   "	push	%%rax"	// sp
			"\n"   "	push	$0"	// flags
			"\n"   "	push	%1"	// cs
			"\n"   "	push	$1f"	// ip
			"\n"   "	iretq"
			"\n"   "1:"
			:
			: "r" (saved_ss), "r" (saved_cs)
			: "ax"
			);
		}
#endif
	} else if (!strcmp(mode, "lret")) {
		/* "push cs" is itself a bit expensive, moving it out of loop */
		long saved_cs;
		asm volatile ("mov %%cs,%0" : "=r" (saved_cs));
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	%0"
			"\n"   "	push	$1f"
			"\n"   "	lret"L_or_Q
			"\n"   "1:"
			:
			: "r" (saved_cs)
			);
		}
	} else if (!strcmp(mode, "callret")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile ("call ret__");
		}
	} else if (!strcmp(mode, "ret")) {
		/* This is useful to measure delays due to
		 * return stack branch prediction not working
		 * (we aren't using paired call/rets here, as CPU expects).
		 * I observed "callret" test above being 4 times faster than this:
		 */
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	$1f"
			"\n"   "	ret"
			"\n"   "1:"
			);
		}
	} else if (!strcmp(mode, "loadss")) {
		long saved_ss;
		asm volatile ("mov %%ss,%0" : "=r" (saved_ss));
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile ("mov %0,%%ss" : : "r" (saved_ss));
		}
	} else if (!strcmp(mode, "readss")) {
		long saved_ss;
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile ("mov %%ss,%0" : "=r" (saved_ss));
		}
	} else if (!strcmp(mode, "leave")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	%"E_or_R"bp"
			"\n"   "	mov	%"E_or_R"sp,%"E_or_R"bp"
			"\n"   "	leave"
			);
		}
	} else if (!strcmp(mode, "noleave")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	%"E_or_R"bp"
			"\n"   "	mov	%"E_or_R"sp,%"E_or_R"bp"
			"\n"   "	mov	%"E_or_R"bp,%"E_or_R"sp"
			"\n"   "	pop	%"E_or_R"bp"
			);
		}
	} else if (!strcmp(mode, "pushf")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	pushf"
			"\n"   "	pop	%%"E_or_R"ax"
			:
			:
			: "ax"
			);
		}
	} else if (!strcmp(mode, "popf")) {
		long flags;
		asm volatile (
		"\n"   "	pushf"
		"\n"   "	pop	%0"
		: "=r" (flags)
		);
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	%0"
			"\n"   "	popf"
			:
			: "r" (flags)
			: "ax"
			);
		}
	} else if (!strcmp(mode, "or")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	orl $1,%0"
			:
			: "m" (memvar)
			);
		}
	} else if (!strcmp(mode, "lock_or")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	lock orl $1,%0"
			:
			: "m" (memvar)
			);
		}
	} else if (!strcmp(mode, "lock_xadd")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	lock xaddl %0,%1"
			:
			: "r" (0), "m" (memvar)
			);
		}
	} else if (!strcmp(mode, "rdpmc")) {
		// Unlikely to work.
		unsigned int eax, edx;
		unsigned int ecx = 0;
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--)
			asm volatile ("rdpmc" : "=a" (eax), "=d" (edx) : "c" (ecx));
	} else {
		printf("Unknown mode %s\n", mode);
		return 1;
	}

	clock_gettime(CLOCK_MONOTONIC, &end);
	duration = (1000*1000*1000ULL * end.tv_sec + end.tv_nsec)
	         - (1000*1000*1000ULL * start.tv_sec + start.tv_nsec);
	printf("%lu loops in %.5fs = %.2f nsec/loop for %s\n",
		(unsigned long)loops, (double)duration * 1e-9,
		(double)duration / loops,
		mode
	);
	if (!argv[2]) {
		if (duration < 90*1000*1000) {
			loops *= 10;
			goto again;
		}
		if (duration < 490*1000*1000) {
			loops *= 2;
			goto again;
		}
	}
	return 0;
}

     prev parent reply	other threads:[~2015-05-20 10:47 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-18 19:02 [PATCH v2] radeon: Deinline indirect register accessor functions Denys Vlasenko
2015-05-18 19:09 ` Christian König
2015-05-18 22:50   ` Denys Vlasenko
2015-05-18 23:06     ` Deucher, Alexander
2015-05-20 10:47       ` Denys Vlasenko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=555C664F.2030002@redhat.com \
    --to=dvlasenk@redhat.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=vda.linux@googlemail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.