public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line
       [not found] <1353900555-5966-1-git-send-email-ling.ma.program@gmail.com>
@ 2012-11-26  6:44 ` Eric Dumazet
  2012-11-26 20:40   ` Ben Hutchings
  2012-11-27 13:48   ` Ling Ma
  2013-02-02 15:03 ` [PATCH v2 net-next] inet: Get critical " Eric Dumazet
  1 sibling, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2012-11-26  6:44 UTC (permalink / raw)
  To: ling.ma.program; +Cc: linux-kernel, netdev

On Mon, 2012-11-26 at 11:29 +0800, ling.ma.program@gmail.com wrote:
> From: Ma Ling <ling.ma.program@gmail.com>
> 
> In order to reduce memory latency when last level cache miss occurs,
> modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> Early Restart(ER) to get data ASAP. For CWF if critical word is first member
> in cache line, memory feed CPU with critical word, then fill others
> data in cache line one by one, otherwise after critical word it must
> cost more cycle to fill the remaining cache line. For Early First CPU will
> restart until critical word in cache line reaches.
> 
> Hash value is critical word, so in this patch we place it as first member
> in cache line(sock address is cache-line aligned), and it is also good for
> Early Restart platform as well .
> 
> Thanks
> Ling

networking patches should be sent to netdev.

(I understand this patch is more a generic one, but at least CC netdev)

You give no performance numbers for this change...

I never heard of this CWF/ER, where are the official Intel documents
about this, and what models really benefit from it ?

Also, why not moving skc_net as well ?

BTW, skc_daddr & skc_rcv_saddr are 'critical' as well, we use them in
INET_MATCH()

It seems we have a 32bit hole on 64bit arches, so we probably should
move inet_dport/inet_num in it. It could well remove a full cache line
miss (I'll send a patch for this after tests)

Thanks

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line
  2012-11-26  6:44 ` [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line Eric Dumazet
@ 2012-11-26 20:40   ` Ben Hutchings
  2012-11-27 13:48   ` Ling Ma
  1 sibling, 0 replies; 13+ messages in thread
From: Ben Hutchings @ 2012-11-26 20:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: ling.ma.program, linux-kernel, netdev

On Sun, 2012-11-25 at 22:44 -0800, Eric Dumazet wrote:
> On Mon, 2012-11-26 at 11:29 +0800, ling.ma.program@gmail.com wrote:
> > From: Ma Ling <ling.ma.program@gmail.com>
> > 
> > In order to reduce memory latency when last level cache miss occurs,
> > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> > Early Restart(ER) to get data ASAP. For CWF if critical word is first member
> > in cache line, memory feed CPU with critical word, then fill others
> > data in cache line one by one, otherwise after critical word it must
> > cost more cycle to fill the remaining cache line. For Early First CPU will
> > restart until critical word in cache line reaches.
> > 
> > Hash value is critical word, so in this patch we place it as first member
> > in cache line(sock address is cache-line aligned), and it is also good for
> > Early Restart platform as well .
> > 
> > Thanks
> > Ling
> 
> networking patches should be sent to netdev.
> 
> (I understand this patch is more a generic one, but at least CC netdev)
> 
> You give no performance numbers for this change...
> 
> I never heard of this CWF/ER, where are the official Intel documents
> about this, and what models really benefit from it ?
[...]

CWF is a standard feature of SDRAM.  Ulrich Drepper's series of articles
on memory covers this in part 2 <http://lwn.net/Articles/252125/>
section 3.5.2.  As for whether it's slower to start fetching from the
middle, that may depend on the memory controller and memory type that
are used.  Drepper's benchmark showed only a small penalty (<1%) for
fetching from the middle, though he didn't say anything particular about
the hardware configuration.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line
  2012-11-26  6:44 ` [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line Eric Dumazet
  2012-11-26 20:40   ` Ben Hutchings
@ 2012-11-27 13:48   ` Ling Ma
  2012-11-27 13:58     ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: Ling Ma @ 2012-11-27 13:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev

> networking patches should be sent to netdev.
>
> (I understand this patch is more a generic one, but at least CC netdev)
Ling: OK, this is my first inet patch, I will send to netdev later.

> You give no performance numbers for this change...
Ling: after I get machine, I will send out test result.

> I never heard of this CWF/ER, where are the official Intel documents
> about this, and what models really benefit from it ?
Ling:
Arm implemented it.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388f/Caccifbd.html
AMD also used it.
http://classes.soe.ucsc.edu/cmpe202/Fall04/papers/opteron.pdf

> Also, why not moving skc_net as well ?
>
> BTW, skc_daddr & skc_rcv_saddr are 'critical' as well, we use them in
> INET_MATCH()
Ling: in the looking-up routine, hash value is the most important key,
if it is matched,  the other values have most possibility to be
satisfied, and CFW is limited by memory bandwidth(64bit usually), so
we only move hash value as critical first word.

Thanks
Ling

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line
  2012-11-27 13:48   ` Ling Ma
@ 2012-11-27 13:58     ` Eric Dumazet
  2012-12-02 13:25       ` Ling Ma
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2012-11-27 13:58 UTC (permalink / raw)
  To: Ling Ma; +Cc: linux-kernel, netdev

On Tue, 2012-11-27 at 21:48 +0800, Ling Ma wrote:

> Ling: in the looking-up routine, hash value is the most important key,
> if it is matched,  the other values have most possibility to be
> satisfied, and CFW is limited by memory bandwidth(64bit usually), so
> we only move hash value as critical first word.

In practice, we have at most one TCP socket per hash slot.
99.9999 % of lookups need all fields to complete.

Your patch introduces a misalignment error. I am not sure all 64 bit
arches are able to cope with that gracefully.

It seems all CWF docs I could find are very old stuff, mostly academic,
without good performance data.

I was asking for up2date statements from Intel/AMD/... about current
cpus and current memory. Because optimizing for 10 years olds cpus is
not worth the pain.

I am assuming cpus are implementing the CWF/ER automatically, and that
only prefetches could have a slight disadvantage if the needed word is
not the first word in the cache line. Its not clear why the prefetch()
hint could not also use CWF. It seems it also could be done by the
hardware.

So before random patches in linux kernel adding their possible bugs, we
need a good study.

Thanks

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line
  2012-11-27 13:58     ` Eric Dumazet
@ 2012-12-02 13:25       ` Ling Ma
  2012-12-02 17:20         ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Ling Ma @ 2012-12-02 13:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 410 bytes --]

Hi Eric,

Attached benchmark test-cwf.c(cc -o test-cwf test-cwf.c), the result
shows when last level cache(LLC) miss and CPU fetches data from
memory, critical word as first 64bit member in cache line has better
performance(costs 158290336 cycles ) than other positions(offset 0x10,
costs 164100732 ) in cache line, the performance is improved by 3.6%
in this case.
cpu-info is also involved too.

Thanks
Ling

[-- Attachment #2: test-cwf.c --]
[-- Type: text/x-csrc, Size: 1986 bytes --]

#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#include<unistd.h>
#define MAX_BUF_NUM (1 << 20)
#define MAX_BUF_SIZE (1 << 8)
#define ACCESS_OFFSET (0x10)

# define HP_TIMING_NOW(Var) \
 ({ unsigned long long _hi, _lo; \
  asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
  (Var) = _hi << 32 | _lo; })

#define repeat_times  (64)

static void init_buf(char **buf)
{
	int i = 0;
	char *start;
	char *end;
	int pagesize = getpagesize();
	*buf = malloc(MAX_BUF_SIZE * MAX_BUF_NUM + pagesize);
	if(*buf == NULL) {
		printf("\nfait to malloc space!\n");
		exit(1);
	} else  {
		*buf = *buf + pagesize;
		*buf = (char *)(((unsigned long)*buf) & (-pagesize));
	}
	
	start = *buf;
	end = *buf + (MAX_BUF_SIZE * MAX_BUF_NUM) - MAX_BUF_SIZE;

	while(1) {
		*((unsigned char **)start) = end;
		*((unsigned char **)(start + ACCESS_OFFSET)) = (end + ACCESS_OFFSET);
		start = start + MAX_BUF_SIZE;
		if(start == end)
			break;
		*((unsigned char **)end) = start;
		*((unsigned char **)(end + ACCESS_OFFSET)) = start + ACCESS_OFFSET;
		end = end - MAX_BUF_SIZE;
	}

}

unsigned long lookingup_memmory(char *access, int num)
{
	__asm__("sub $1, %rsi");
	__asm__("xor %rax, %rax");
	__asm__("1:");
	__asm__("mov (%rdi), %r8");
	__asm__("add %r8, %rax");
	__asm__("mov %r8, %rdi");
	__asm__("sub $1, %rsi");
	__asm__("jae 1b");
}

static unsigned long test_lookup_time(char *buf)
{
	unsigned long i, start, end, best_time = ~0;

	for(i = 0; i < repeat_times; i++) {
		HP_TIMING_NOW(start);
		lookingup_memmory(buf, MAX_BUF_NUM);
		HP_TIMING_NOW(end);
		if(best_time > (end - start))
			best_time = (end - start);
	}

	return best_time;

}
void main (void)
{
	char *buf1 = NULL;
	char *buf2 = NULL;
	unsigned long aligned_time, unaligned_time;
	

	init_buf(&buf1);
	init_buf(&buf2);
	
	aligned_time = test_lookup_time(buf1);
	unaligned_time = test_lookup_time(buf2 + ACCESS_OFFSET);

	printf("looking-up aligned time %ld, looking-up unaligned time %ld\n", aligned_time, unaligned_time);
}





[-- Attachment #3: cpu-info --]
[-- Type: application/octet-stream, Size: 7050 bytes --]

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.005
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.005
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.005
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 9
cpu cores	: 4
apicid		: 18
initial apicid	: 18
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.005
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 10
cpu cores	: 4
apicid		: 20
initial apicid	: 20
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.005
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.005
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.005
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 9
cpu cores	: 4
apicid		: 19
initial apicid	: 19
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.005
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 10
cpu cores	: 4
apicid		: 21
initial apicid	: 21
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.01
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line
  2012-12-02 13:25       ` Ling Ma
@ 2012-12-02 17:20         ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2012-12-02 17:20 UTC (permalink / raw)
  To: Ling Ma; +Cc: linux-kernel, netdev

On Sun, 2012-12-02 at 21:25 +0800, Ling Ma wrote:
> Hi Eric,
> 
> Attached benchmark test-cwf.c(cc -o test-cwf test-cwf.c), the result
> shows when last level cache(LLC) miss and CPU fetches data from
> memory, critical word as first 64bit member in cache line has better
> performance(costs 158290336 cycles ) than other positions(offset 0x10,
> costs 164100732 ) in cache line, the performance is improved by 3.6%
> in this case.
> cpu-info is also involved too.
> 
> Thanks
> Ling

Thanks Ling.

Note that I was more interested by the case we read more fields per
cache line, like we do in tcp lookups. (skc_daddr, skc_rcv_saddr,
skc_bound_dev_if, skc_net).

I made changes to net-next to prepare your patch. 

You'll have to move both skc_rxhash & skc_portpair before the
skc_addrpair.

I have to fix an endianness sparse problem, I'll send a patch for this
in a separate thread right now.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line
       [not found] <1353900555-5966-1-git-send-email-ling.ma.program@gmail.com>
  2012-11-26  6:44 ` [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line Eric Dumazet
@ 2013-02-02 15:03 ` Eric Dumazet
  2013-02-03 21:00   ` saeed bishara
  2013-02-03 21:08   ` David Miller
  1 sibling, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2013-02-02 15:03 UTC (permalink / raw)
  To: ling.ma.program, David Miller; +Cc: netdev, Maciej Żenczykowski

From: Ma Ling <ling.ma.program@gmail.com>

In order to reduce memory latency when last level cache miss occurs,
modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
Early Restart(ER) to get data ASAP. For CWF if critical word is first
member
in cache line, memory feed CPU with critical word, then fill others
data in cache line one by one, otherwise after critical word it must
cost more cycle to fill the remaining cache line. For Early First CPU
will restart until critical word in cache line reaches.

Hash value is critical word, so in this patch we place it as first
member in cache line (sock address is cache-line aligned), and it is
also good for Early Restart platform as well .

[edumazet: respin on net-next after commit ce43b03e8889]

Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
---
 include/net/sock.h |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index a340ab4..efabd9a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -131,12 +131,12 @@ typedef __u64 __bitwise __addrpair;
 
 /**
  *	struct sock_common - minimal network layer representation of sockets
- *	@skc_daddr: Foreign IPv4 addr
- *	@skc_rcv_saddr: Bound local IPv4 addr
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_u16hashes: two u16 hash values used by UDP lookup tables
  *	@skc_dport: placeholder for inet_dport/tw_dport
  *	@skc_num: placeholder for inet_num/tw_num
+ *	@skc_daddr: Foreign IPv4 addr
+ *	@skc_rcv_saddr: Bound local IPv4 addr
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
@@ -153,18 +153,10 @@ typedef __u64 __bitwise __addrpair;
  *
  *	This is the minimal network layer representation of sockets, the header
  *	for struct sock and struct inet_timewait_sock.
+ *	Order of first fields is critical for __inet_lookup_established() :
+ *	skc_hash, skc_portpair, skc_addrpair
  */
 struct sock_common {
-	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
-	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
-	 */
-	union {
-		__addrpair	skc_addrpair;
-		struct {
-			__be32	skc_daddr;
-			__be32	skc_rcv_saddr;
-		};
-	};
 	union  {
 		unsigned int	skc_hash;
 		__u16		skc_u16hashes[2];
@@ -178,6 +170,16 @@ struct sock_common {
 		};
 	};
 
+	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
+	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
+	 */
+	union {
+		__addrpair	skc_addrpair;
+		struct {
+			__be32	skc_daddr;
+			__be32	skc_rcv_saddr;
+		};
+	};
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
 	unsigned char		skc_reuse:4;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line
  2013-02-02 15:03 ` [PATCH v2 net-next] inet: Get critical " Eric Dumazet
@ 2013-02-03 21:00   ` saeed bishara
  2013-02-03 21:08   ` David Miller
  1 sibling, 0 replies; 13+ messages in thread
From: saeed bishara @ 2013-02-03 21:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: ling.ma.program, David Miller, netdev, Maciej Żenczykowski

On Sat, Feb 2, 2013 at 5:03 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Ma Ling <ling.ma.program@gmail.com>
>
> In order to reduce memory latency when last level cache miss occurs,
> modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> Early Restart(ER) to get data ASAP. For CWF if critical word is first
> member
> in cache line, memory feed CPU with critical word, then fill others
> data in cache line one by one, otherwise after critical word it must
> cost more cycle to fill the remaining cache line. For Early First CPU
> will restart until critical word in cache line reaches.
>
> Hash value is critical word, so in this patch we place it as first
> member in cache line (sock address is cache-line aligned), and it is
> also good for Early Restart platform as well .
I think the description of this patch doen't make sense. the purpose
of CWF hardware feature is to release the sw from moving critical word
as first member of the cache.
that's ofcourse depends on how you define the CWF, but at least
according to http://lwn.net/Articles/252125/ and here
https://github.com/jamie-allen/cpu_caches/blob/master/preso/presentation.md
the CWF means the hw will do the job.
so I think the patch maybe usefull (1) for system that doesn't have
CWF, (2) CWF may not totaly eliminate the additional latency. this is
of course a prediction as you see.

saeed

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line
  2013-02-02 15:03 ` [PATCH v2 net-next] inet: Get critical " Eric Dumazet
  2013-02-03 21:00   ` saeed bishara
@ 2013-02-03 21:08   ` David Miller
  2013-02-04  0:18     ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: David Miller @ 2013-02-03 21:08 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ling.ma.program, netdev, maze

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 02 Feb 2013 07:03:55 -0800

> From: Ma Ling <ling.ma.program@gmail.com>
> 
> In order to reduce memory latency when last level cache miss occurs,
> modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> Early Restart(ER) to get data ASAP. For CWF if critical word is first
> member
> in cache line, memory feed CPU with critical word, then fill others
> data in cache line one by one, otherwise after critical word it must
> cost more cycle to fill the remaining cache line. For Early First CPU
> will restart until critical word in cache line reaches.
> 
> Hash value is critical word, so in this patch we place it as first
> member in cache line (sock address is cache-line aligned), and it is
> also good for Early Restart platform as well .
> 
> [edumazet: respin on net-next after commit ce43b03e8889]
> 
> Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I completely agree with the other response to this patch in that
the description is bogus.

If CWF is implemented in the cpu, it should exactly relieve us from
having to move things around in structures so carefully like this.

Either the patch should be completely dropped (modern cpus don't
need this) or the commit message changed to reflect reality.

It really makes a terrible impression upon me when the patch says
something which in fact is 180 degrees from reality.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line
  2013-02-03 21:08   ` David Miller
@ 2013-02-04  0:18     ` Eric Dumazet
  2013-02-04  0:25       ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2013-02-04  0:18 UTC (permalink / raw)
  To: David Miller; +Cc: ling.ma.program, netdev, maze

On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sat, 02 Feb 2013 07:03:55 -0800
> 
> > From: Ma Ling <ling.ma.program@gmail.com>
> > 
> > In order to reduce memory latency when last level cache miss occurs,
> > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> > Early Restart(ER) to get data ASAP. For CWF if critical word is first
> > member
> > in cache line, memory feed CPU with critical word, then fill others
> > data in cache line one by one, otherwise after critical word it must
> > cost more cycle to fill the remaining cache line. For Early First CPU
> > will restart until critical word in cache line reaches.
> > 
> > Hash value is critical word, so in this patch we place it as first
> > member in cache line (sock address is cache-line aligned), and it is
> > also good for Early Restart platform as well .
> > 
> > [edumazet: respin on net-next after commit ce43b03e8889]
> > 
> > Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> I completely agree with the other response to this patch in that
> the description is bogus.
> 
> If CWF is implemented in the cpu, it should exactly relieve us from
> having to move things around in structures so carefully like this.
> 
> Either the patch should be completely dropped (modern cpus don't
> need this) or the commit message changed to reflect reality.
> 
> It really makes a terrible impression upon me when the patch says
> something which in fact is 180 degrees from reality.

Hmm. 

Maybe the changelog is misleading, or maybe all the performance gains I
have from this patch are probably some artifact or old/bad hardware, or
something else.



(Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz)
# ./cwf
looking-up aligned time 108712072, 
looking-up unaligned time 113268256
looking-up aligned time 108677032, 
looking-up unaligned time 113297636


(Intel(R) Xeon(R) CPU           X5679  @ 3.20GHz)
# ./cwf
looking-up aligned time 139193589, 
looking-up unaligned time 144307821
looking-up aligned time 139136787, 
looking-up unaligned time 144277752

My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
# ./cwf
looking-up aligned time 84869203, 
looking-up unaligned time 86843462
looking-up aligned time 84253003, 
looking-up unaligned time 86227675

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>

#define CACHELINE_SZ 64L

#define BIGBUFFER_SZ (64<<20)

# define HP_TIMING_NOW(Var) \
 ({ unsigned long long _hi, _lo; \
  asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
  (Var) = _hi << 32 | _lo; })

#define repeat_times  20

char *bufzap;

static void zap_cache(void)
{
	memset(bufzap, 2, BIGBUFFER_SZ);
	memset(bufzap, 3, BIGBUFFER_SZ);
	memset(bufzap, 4, BIGBUFFER_SZ);
}

static char *init_buf(void)
{
	void *res;

	if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) {
		fprintf(stderr, "malloc() failed");
	        exit(1);
	}

	memset(res, 1, BIGBUFFER_SZ);
	return res;
}

unsigned long total;

static unsigned long random_access(void *buffer,
				   unsigned int off1,
				   unsigned int off2,
				   unsigned int off3)
{
	int i;
	unsigned int n;
	unsigned long sum = 0;
	unsigned long *ptr;

	srandom(7777);
	for (i = 0; i < 1000000; i++) {
		n = random() % (BIGBUFFER_SZ/CACHELINE_SZ);
		ptr = buffer + n*CACHELINE_SZ;
		if (ptr[off1])
			sum++;
		if (ptr[off2])
			sum++;
//		if (ptr[off3])
//			sum++;
	}
	total += sum;
	return sum;
}

static unsigned long test_lookup_time(void *buf, 
				unsigned int off1,
				unsigned int off2,
				unsigned int off3)
{
        unsigned long i, start, end, best_time = ~0;

        for (i = 0; i < repeat_times; i++) {
		zap_cache();
                HP_TIMING_NOW(start);
                random_access(buf, off1, off2, off3);
                HP_TIMING_NOW(end);
                if (best_time > (end - start))
                        best_time = (end - start);
        }

        return best_time;

}

int main(int argc, char *argv[])
{
        char *buf;
        unsigned long aligned_time, unaligned_time;

        buf = init_buf();
        bufzap = init_buf();

        aligned_time = test_lookup_time(buf, 0, 2, 4);
        unaligned_time = test_lookup_time(buf, 4, 2, 0);

        printf("looking-up aligned time %lu, \nlooking-up unaligned time %lu\n", aligned_time, unaligned_time);

        aligned_time = test_lookup_time(buf, 0, 2, 4);
        unaligned_time = test_lookup_time(buf, 4, 2, 0);

        printf("looking-up aligned time %lu, \nlooking-up unaligned time %lu\n", aligned_time, unaligned_time);
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line
  2013-02-04  0:18     ` Eric Dumazet
@ 2013-02-04  0:25       ` Eric Dumazet
  2013-02-04  2:53         ` Ling Ma
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2013-02-04  0:25 UTC (permalink / raw)
  To: David Miller; +Cc: ling.ma.program, netdev, maze

On Sun, 2013-02-03 at 16:18 -0800, Eric Dumazet wrote:
> On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Sat, 02 Feb 2013 07:03:55 -0800
> > 
> > > From: Ma Ling <ling.ma.program@gmail.com>
> > > 
> > > In order to reduce memory latency when last level cache miss occurs,
> > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> > > Early Restart(ER) to get data ASAP. For CWF if critical word is first
> > > member
> > > in cache line, memory feed CPU with critical word, then fill others
> > > data in cache line one by one, otherwise after critical word it must
> > > cost more cycle to fill the remaining cache line. For Early First CPU
> > > will restart until critical word in cache line reaches.
> > > 
> > > Hash value is critical word, so in this patch we place it as first
> > > member in cache line (sock address is cache-line aligned), and it is
> > > also good for Early Restart platform as well .
> > > 
> > > [edumazet: respin on net-next after commit ce43b03e8889]
> > > 
> > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > 
> > I completely agree with the other response to this patch in that
> > the description is bogus.
> > 
> > If CWF is implemented in the cpu, it should exactly relieve us from
> > having to move things around in structures so carefully like this.
> > 
> > Either the patch should be completely dropped (modern cpus don't
> > need this) or the commit message changed to reflect reality.
> > 
> > It really makes a terrible impression upon me when the patch says
> > something which in fact is 180 degrees from reality.
> 
> Hmm. 
> 
> Maybe the changelog is misleading, or maybe all the performance gains I
> have from this patch are probably some artifact or old/bad hardware, or
> something else.
> 
> 
> 
> (Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz)
> # ./cwf
> looking-up aligned time 108712072, 
> looking-up unaligned time 113268256
> looking-up aligned time 108677032, 
> looking-up unaligned time 113297636
> 
> 
> (Intel(R) Xeon(R) CPU           X5679  @ 3.20GHz)
> # ./cwf
> looking-up aligned time 139193589, 
> looking-up unaligned time 144307821
> looking-up aligned time 139136787, 
> looking-up unaligned time 144277752
> 
> My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> # ./cwf
> looking-up aligned time 84869203, 
> looking-up unaligned time 86843462
> looking-up aligned time 84253003, 
> looking-up unaligned time 86227675
> 
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <unistd.h>
> 
> #define CACHELINE_SZ 64L
> 
> #define BIGBUFFER_SZ (64<<20)
> 
> # define HP_TIMING_NOW(Var) \
>  ({ unsigned long long _hi, _lo; \
>   asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
>   (Var) = _hi << 32 | _lo; })
> 
> #define repeat_times  20
> 
> char *bufzap;
> 
> static void zap_cache(void)
> {
> 	memset(bufzap, 2, BIGBUFFER_SZ);
> 	memset(bufzap, 3, BIGBUFFER_SZ);
> 	memset(bufzap, 4, BIGBUFFER_SZ);
> }
> 
> static char *init_buf(void)
> {
> 	void *res;
> 
> 	if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) {
> 		fprintf(stderr, "malloc() failed");
> 	        exit(1);
> 	}
> 
> 	memset(res, 1, BIGBUFFER_SZ);
> 	return res;
> }
> 
> unsigned long total;
> 
> static unsigned long random_access(void *buffer,
> 				   unsigned int off1,
> 				   unsigned int off2,
> 				   unsigned int off3)
> {
> 	int i;
> 	unsigned int n;
> 	unsigned long sum = 0;
> 	unsigned long *ptr;
> 
> 	srandom(7777);
> 	for (i = 0; i < 1000000; i++) {
> 		n = random() % (BIGBUFFER_SZ/CACHELINE_SZ);
> 		ptr = buffer + n*CACHELINE_SZ;
> 		if (ptr[off1])
> 			sum++;
> 		if (ptr[off2])
> 			sum++;
> //		if (ptr[off3])
> //			sum++;

Hmm, I don't know why I left a comment on these two lines...

Of course, results are a bit different removing the comments :

looking-up aligned time 113601316, 
looking-up unaligned time 115964760
looking-up aligned time 113698636, 
looking-up unaligned time 115986072

More testing is probably needed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line
  2013-02-04  0:25       ` Eric Dumazet
@ 2013-02-04  2:53         ` Ling Ma
  2013-02-04  3:11           ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Ling Ma @ 2013-02-04  2:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, maze

[-- Attachment #1: Type: text/plain, Size: 4551 bytes --]

I attached my test program(we force all cpu loads issue one by one ,
and avoid cpu hardwre prefetch cc -o test-cwf test-cwf.c.) and
cpu-info, the result from ./test-cwf  indicates as below:
looking-up aligned time 157000272, looking-up unaligned time 162652724
If I was wrong please correct me.

Thanks
Ling


2013/2/4, Eric Dumazet <eric.dumazet@gmail.com>:
> On Sun, 2013-02-03 at 16:18 -0800, Eric Dumazet wrote:
>> On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote:
>> > From: Eric Dumazet <eric.dumazet@gmail.com>
>> > Date: Sat, 02 Feb 2013 07:03:55 -0800
>> >
>> > > From: Ma Ling <ling.ma.program@gmail.com>
>> > >
>> > > In order to reduce memory latency when last level cache miss occurs,
>> > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
>> > > Early Restart(ER) to get data ASAP. For CWF if critical word is first
>> > > member
>> > > in cache line, memory feed CPU with critical word, then fill others
>> > > data in cache line one by one, otherwise after critical word it must
>> > > cost more cycle to fill the remaining cache line. For Early First CPU
>> > > will restart until critical word in cache line reaches.
>> > >
>> > > Hash value is critical word, so in this patch we place it as first
>> > > member in cache line (sock address is cache-line aligned), and it is
>> > > also good for Early Restart platform as well .
>> > >
>> > > [edumazet: respin on net-next after commit ce43b03e8889]
>> > >
>> > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
>> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> >
>> > I completely agree with the other response to this patch in that
>> > the description is bogus.
>> >
>> > If CWF is implemented in the cpu, it should exactly relieve us from
>> > having to move things around in structures so carefully like this.
>> >
>> > Either the patch should be completely dropped (modern cpus don't
>> > need this) or the commit message changed to reflect reality.
>> >
>> > It really makes a terrible impression upon me when the patch says
>> > something which in fact is 180 degrees from reality.
>>
>> Hmm.
>>
>> Maybe the changelog is misleading, or maybe all the performance gains I
>> have from this patch are probably some artifact or old/bad hardware, or
>> something else.
>>
>>
>>
>> (Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz)
>> # ./cwf
>> looking-up aligned time 108712072,
>> looking-up unaligned time 113268256
>> looking-up aligned time 108677032,
>> looking-up unaligned time 113297636
>>
>>
>> (Intel(R) Xeon(R) CPU           X5679  @ 3.20GHz)
>> # ./cwf
>> looking-up aligned time 139193589,
>> looking-up unaligned time 144307821
>> looking-up aligned time 139136787,
>> looking-up unaligned time 144277752
>>
>> My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
>> # ./cwf
>> looking-up aligned time 84869203,
>> looking-up unaligned time 86843462
>> looking-up aligned time 84253003,
>> looking-up unaligned time 86227675
>>
>> #include <stdio.h>
>> #include <string.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>>
>> #define CACHELINE_SZ 64L
>>
>> #define BIGBUFFER_SZ (64<<20)
>>
>> # define HP_TIMING_NOW(Var) \
>>  ({ unsigned long long _hi, _lo; \
>>   asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
>>   (Var) = _hi << 32 | _lo; })
>>
>> #define repeat_times  20
>>
>> char *bufzap;
>>
>> static void zap_cache(void)
>> {
>> 	memset(bufzap, 2, BIGBUFFER_SZ);
>> 	memset(bufzap, 3, BIGBUFFER_SZ);
>> 	memset(bufzap, 4, BIGBUFFER_SZ);
>> }
>>
>> static char *init_buf(void)
>> {
>> 	void *res;
>>
>> 	if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) {
>> 		fprintf(stderr, "malloc() failed");
>> 	        exit(1);
>> 	}
>>
>> 	memset(res, 1, BIGBUFFER_SZ);
>> 	return res;
>> }
>>
>> unsigned long total;
>>
>> static unsigned long random_access(void *buffer,
>> 				   unsigned int off1,
>> 				   unsigned int off2,
>> 				   unsigned int off3)
>> {
>> 	int i;
>> 	unsigned int n;
>> 	unsigned long sum = 0;
>> 	unsigned long *ptr;
>>
>> 	srandom(7777);
>> 	for (i = 0; i < 1000000; i++) {
>> 		n = random() % (BIGBUFFER_SZ/CACHELINE_SZ);
>> 		ptr = buffer + n*CACHELINE_SZ;
>> 		if (ptr[off1])
>> 			sum++;
>> 		if (ptr[off2])
>> 			sum++;
>> //		if (ptr[off3])
>> //			sum++;
>
> Hmm, I don't know why I left a comment on these two lines...
>
> Of course, results are a bit different removing the comments :
>
> looking-up aligned time 113601316,
> looking-up unaligned time 115964760
> looking-up aligned time 113698636,
> looking-up unaligned time 115986072
>
> More testing is probably needed.
>
>
>

[-- Attachment #2: cpu-info --]
[-- Type: text/plain, Size: 7050 bytes --]

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 9
cpu cores	: 4
apicid		: 18
initial apicid	: 18
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 10
cpu cores	: 4
apicid		: 20
initial apicid	: 20
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 9
cpu cores	: 4
apicid		: 19
initial apicid	: 19
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 10
cpu cores	: 4
apicid		: 21
initial apicid	: 21
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:


[-- Attachment #3: test-cwf.c --]
[-- Type: text/x-csrc, Size: 1986 bytes --]

#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#include<unistd.h>
#define MAX_BUF_NUM (1 << 20)
#define MAX_BUF_SIZE (1 << 8)
#define ACCESS_OFFSET (0x38)

# define HP_TIMING_NOW(Var) \
 ({ unsigned long long _hi, _lo; \
  asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
  (Var) = _hi << 32 | _lo; })

#define repeat_times  (64)

static void init_buf(char **buf)
{
	int i = 0;
	char *start;
	char *end;
	int pagesize = getpagesize();
	*buf = malloc(MAX_BUF_SIZE * MAX_BUF_NUM + pagesize);
	if(*buf == NULL) {
		printf("\nfait to malloc space!\n");
		exit(1);
	} else  {
		*buf = *buf + pagesize;
		*buf = (char *)(((unsigned long)*buf) & (-pagesize));
	}
	
	start = *buf;
	end = *buf + (MAX_BUF_SIZE * MAX_BUF_NUM) - MAX_BUF_SIZE;

	while(1) {
		*((unsigned char **)start) = end;
		*((unsigned char **)(start + ACCESS_OFFSET)) = (end + ACCESS_OFFSET);
		start = start + MAX_BUF_SIZE;
		if(start == end)
			break;
		*((unsigned char **)end) = start;
		*((unsigned char **)(end + ACCESS_OFFSET)) = start + ACCESS_OFFSET;
		end = end - MAX_BUF_SIZE;
	}

}

unsigned long lookingup_memmory(char *access, int num)
{
	__asm__("sub $1, %rsi");
	__asm__("xor %rax, %rax");
	__asm__("1:");
	__asm__("mov (%rdi), %r8");
	__asm__("add %r8, %rax");
	__asm__("mov %r8, %rdi");
	__asm__("sub $1, %rsi");
	__asm__("jae 1b");
}

static unsigned long test_lookup_time(char *buf)
{
	unsigned long i, start, end, best_time = ~0;

	for(i = 0; i < repeat_times; i++) {
		HP_TIMING_NOW(start);
		lookingup_memmory(buf, MAX_BUF_NUM);
		HP_TIMING_NOW(end);
		if(best_time > (end - start))
			best_time = (end - start);
	}

	return best_time;

}
void main (void)
{
	char *buf1 = NULL;
	char *buf2 = NULL;
	unsigned long aligned_time, unaligned_time;
	

	init_buf(&buf1);
	init_buf(&buf2);
	
	unaligned_time = test_lookup_time(buf2 + ACCESS_OFFSET);
	aligned_time = test_lookup_time(buf1);

	printf("looking-up aligned time %ld, looking-up unaligned time %ld\n", aligned_time, unaligned_time);
}





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line
  2013-02-04  2:53         ` Ling Ma
@ 2013-02-04  3:11           ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2013-02-04  3:11 UTC (permalink / raw)
  To: Ling Ma; +Cc: David Miller, netdev, maze

On Mon, 2013-02-04 at 10:53 +0800, Ling Ma wrote:
> I attached my test program(we force all cpu loads issue one by one ,
> and avoid cpu hardwre prefetch cc -o test-cwf test-cwf.c.) and
> cpu-info, the result from ./test-cwf  indicates as below:
> looking-up aligned time 157000272, looking-up unaligned time 162652724
> If I was wrong please correct me.

I have no idea why you use assembly code.

unsigned long lookingup_memmory(char *access, int num)
{
        __asm__("sub $1, %rsi");
        __asm__("xor %rax, %rax");
        __asm__("1:");
        __asm__("mov (%rdi), %r8");
        __asm__("add %r8, %rax");
        __asm__("mov %r8, %rdi");
        __asm__("sub $1, %rsi");
        __asm__("jae 1b");
}

Your program is really hard to read.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-02-04  3:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1353900555-5966-1-git-send-email-ling.ma.program@gmail.com>
2012-11-26  6:44 ` [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line Eric Dumazet
2012-11-26 20:40   ` Ben Hutchings
2012-11-27 13:48   ` Ling Ma
2012-11-27 13:58     ` Eric Dumazet
2012-12-02 13:25       ` Ling Ma
2012-12-02 17:20         ` Eric Dumazet
2013-02-02 15:03 ` [PATCH v2 net-next] inet: Get critical " Eric Dumazet
2013-02-03 21:00   ` saeed bishara
2013-02-03 21:08   ` David Miller
2013-02-04  0:18     ` Eric Dumazet
2013-02-04  0:25       ` Eric Dumazet
2013-02-04  2:53         ` Ling Ma
2013-02-04  3:11           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox