* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line [not found] <1353900555-5966-1-git-send-email-ling.ma.program@gmail.com> @ 2012-11-26 6:44 ` Eric Dumazet 2012-11-26 20:40 ` Ben Hutchings 2012-11-27 13:48 ` Ling Ma 2013-02-02 15:03 ` [PATCH v2 net-next] inet: Get critical " Eric Dumazet 1 sibling, 2 replies; 13+ messages in thread From: Eric Dumazet @ 2012-11-26 6:44 UTC (permalink / raw) To: ling.ma.program; +Cc: linux-kernel, netdev On Mon, 2012-11-26 at 11:29 +0800, ling.ma.program@gmail.com wrote: > From: Ma Ling <ling.ma.program@gmail.com> > > In order to reduce memory latency when last level cache miss occurs, > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > Early Restart(ER) to get data ASAP. For CWF if critical word is first member > in cache line, memory feed CPU with critical word, then fill others > data in cache line one by one, otherwise after critical word it must > cost more cycle to fill the remaining cache line. For Early First CPU will > restart until critical word in cache line reaches. > > Hash value is critical word, so in this patch we place it as first member > in cache line(sock address is cache-line aligned), and it is also good for > Early Restart platform as well . > > Thanks > Ling networking patches should be sent to netdev. (I understand this patch is more a generic one, but at least CC netdev) You give no performance numbers for this change... I never heard of this CWF/ER, where are the official Intel documents about this, and what models really benefit from it ? Also, why not moving skc_net as well ? BTW, skc_daddr & skc_rcv_saddr are 'critical' as well, we use them in INET_MATCH() It seems we have a 32bit hole on 64bit arches, so we probably should move inet_dport/inet_num in it. It could well remove a full cache line miss (I'll send a patch for this after tests) Thanks ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line 2012-11-26 6:44 ` [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line Eric Dumazet @ 2012-11-26 20:40 ` Ben Hutchings 2012-11-27 13:48 ` Ling Ma 1 sibling, 0 replies; 13+ messages in thread From: Ben Hutchings @ 2012-11-26 20:40 UTC (permalink / raw) To: Eric Dumazet; +Cc: ling.ma.program, linux-kernel, netdev On Sun, 2012-11-25 at 22:44 -0800, Eric Dumazet wrote: > On Mon, 2012-11-26 at 11:29 +0800, ling.ma.program@gmail.com wrote: > > From: Ma Ling <ling.ma.program@gmail.com> > > > > In order to reduce memory latency when last level cache miss occurs, > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > > Early Restart(ER) to get data ASAP. For CWF if critical word is first member > > in cache line, memory feed CPU with critical word, then fill others > > data in cache line one by one, otherwise after critical word it must > > cost more cycle to fill the remaining cache line. For Early First CPU will > > restart until critical word in cache line reaches. > > > > Hash value is critical word, so in this patch we place it as first member > > in cache line(sock address is cache-line aligned), and it is also good for > > Early Restart platform as well . > > > > Thanks > > Ling > > networking patches should be sent to netdev. > > (I understand this patch is more a generic one, but at least CC netdev) > > You give no performance numbers for this change... > > I never heard of this CWF/ER, where are the official Intel documents > about this, and what models really benefit from it ? [...] CWF is a standard feature of SDRAM. Ulrich Drepper's series of articles on memory covers this in part 2 <http://lwn.net/Articles/252125/> section 3.5.2. As for whether it's slower to start fetching from the middle, that may depend on the memory controller and memory type that are used. Drepper's benchmark showed only a small penalty (<1%) for fetching from the middle, though he didn't say anything particular about the hardware configuration. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line 2012-11-26 6:44 ` [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line Eric Dumazet 2012-11-26 20:40 ` Ben Hutchings @ 2012-11-27 13:48 ` Ling Ma 2012-11-27 13:58 ` Eric Dumazet 1 sibling, 1 reply; 13+ messages in thread From: Ling Ma @ 2012-11-27 13:48 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, netdev > networking patches should be sent to netdev. > > (I understand this patch is more a generic one, but at least CC netdev) Ling: OK, this is my first inet patch, I will send to netdev later. > You give no performance numbers for this change... Ling: after I get machine, I will send out test result. > I never heard of this CWF/ER, where are the official Intel documents > about this, and what models really benefit from it ? Ling: Arm implemented it. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388f/Caccifbd.html AMD also used it. http://classes.soe.ucsc.edu/cmpe202/Fall04/papers/opteron.pdf > Also, why not moving skc_net as well ? > > BTW, skc_daddr & skc_rcv_saddr are 'critical' as well, we use them in > INET_MATCH() Ling: in the looking-up routine, hash value is the most important key, if it is matched, the other values have most possibility to be satisfied, and CFW is limited by memory bandwidth(64bit usually), so we only move hash value as critical first word. Thanks Ling ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line 2012-11-27 13:48 ` Ling Ma @ 2012-11-27 13:58 ` Eric Dumazet 2012-12-02 13:25 ` Ling Ma 0 siblings, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2012-11-27 13:58 UTC (permalink / raw) To: Ling Ma; +Cc: linux-kernel, netdev On Tue, 2012-11-27 at 21:48 +0800, Ling Ma wrote: > Ling: in the looking-up routine, hash value is the most important key, > if it is matched, the other values have most possibility to be > satisfied, and CFW is limited by memory bandwidth(64bit usually), so > we only move hash value as critical first word. In practice, we have at most one TCP socket per hash slot. 99.9999 % of lookups need all fields to complete. Your patch introduces a misalignment error. I am not sure all 64 bit arches are able to cope with that gracefully. It seems all CWF docs I could find are very old stuff, mostly academic, without good performance data. I was asking for up2date statements from Intel/AMD/... about current cpus and current memory. Because optimizing for 10 years olds cpus is not worth the pain. I am assuming cpus are implementing the CWF/ER automatically, and that only prefetches could have a slight disadvantage if the needed word is not the first word in the cache line. Its not clear why the prefetch() hint could not also use CWF. It seems it also could be done by the hardware. So before random patches in linux kernel adding their possible bugs, we need a good study. Thanks ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line 2012-11-27 13:58 ` Eric Dumazet @ 2012-12-02 13:25 ` Ling Ma 2012-12-02 17:20 ` Eric Dumazet 0 siblings, 1 reply; 13+ messages in thread From: Ling Ma @ 2012-12-02 13:25 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, netdev [-- Attachment #1: Type: text/plain, Size: 410 bytes --] Hi Eric, Attached benchmark test-cwf.c(cc -o test-cwf test-cwf.c), the result shows when last level cache(LLC) miss and CPU fetches data from memory, critical word as first 64bit member in cache line has better performance(costs 158290336 cycles ) than other positions(offset 0x10, costs 164100732 ) in cache line, the performance is improved by 3.6% in this case. cpu-info is also involved too. Thanks Ling [-- Attachment #2: test-cwf.c --] [-- Type: text/x-csrc, Size: 1986 bytes --] #include<stdio.h> #include<string.h> #include<stdlib.h> #include<unistd.h> #define MAX_BUF_NUM (1 << 20) #define MAX_BUF_SIZE (1 << 8) #define ACCESS_OFFSET (0x10) # define HP_TIMING_NOW(Var) \ ({ unsigned long long _hi, _lo; \ asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \ (Var) = _hi << 32 | _lo; }) #define repeat_times (64) static void init_buf(char **buf) { int i = 0; char *start; char *end; int pagesize = getpagesize(); *buf = malloc(MAX_BUF_SIZE * MAX_BUF_NUM + pagesize); if(*buf == NULL) { printf("\nfait to malloc space!\n"); exit(1); } else { *buf = *buf + pagesize; *buf = (char *)(((unsigned long)*buf) & (-pagesize)); } start = *buf; end = *buf + (MAX_BUF_SIZE * MAX_BUF_NUM) - MAX_BUF_SIZE; while(1) { *((unsigned char **)start) = end; *((unsigned char **)(start + ACCESS_OFFSET)) = (end + ACCESS_OFFSET); start = start + MAX_BUF_SIZE; if(start == end) break; *((unsigned char **)end) = start; *((unsigned char **)(end + ACCESS_OFFSET)) = start + ACCESS_OFFSET; end = end - MAX_BUF_SIZE; } } unsigned long lookingup_memmory(char *access, int num) { __asm__("sub $1, %rsi"); __asm__("xor %rax, %rax"); __asm__("1:"); __asm__("mov (%rdi), %r8"); __asm__("add %r8, %rax"); __asm__("mov %r8, %rdi"); __asm__("sub $1, %rsi"); __asm__("jae 1b"); } static unsigned long test_lookup_time(char *buf) { unsigned long i, start, end, best_time = ~0; for(i = 0; i < repeat_times; i++) { HP_TIMING_NOW(start); lookingup_memmory(buf, MAX_BUF_NUM); HP_TIMING_NOW(end); if(best_time > (end - start)) best_time = (end - start); } return best_time; } void main (void) { char *buf1 = NULL; char *buf2 = NULL; unsigned long aligned_time, unaligned_time; init_buf(&buf1); init_buf(&buf2); aligned_time = test_lookup_time(buf1); unaligned_time = test_lookup_time(buf2 + ACCESS_OFFSET); printf("looking-up aligned time %ld, looking-up unaligned time %ld\n", aligned_time, unaligned_time); } [-- Attachment #3: cpu-info --] [-- Type: application/octet-stream, Size: 7050 bytes --] processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.005 cache size : 12288 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.01 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.005 cache size : 12288 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.01 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.005 cache size : 12288 KB physical id : 0 siblings : 8 core id : 9 cpu cores : 4 apicid : 18 initial apicid : 18 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.01 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.005 cache size : 12288 KB physical id : 0 siblings : 8 core id : 10 cpu cores : 4 apicid : 20 initial apicid : 20 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.01 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.005 cache size : 12288 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.01 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.005 cache size : 12288 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.01 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.005 cache size : 12288 KB physical id : 0 siblings : 8 core id : 9 cpu cores : 4 apicid : 19 initial apicid : 19 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.01 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.005 cache size : 12288 KB physical id : 0 siblings : 8 core id : 10 cpu cores : 4 apicid : 21 initial apicid : 21 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.01 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line 2012-12-02 13:25 ` Ling Ma @ 2012-12-02 17:20 ` Eric Dumazet 0 siblings, 0 replies; 13+ messages in thread From: Eric Dumazet @ 2012-12-02 17:20 UTC (permalink / raw) To: Ling Ma; +Cc: linux-kernel, netdev On Sun, 2012-12-02 at 21:25 +0800, Ling Ma wrote: > Hi Eric, > > Attached benchmark test-cwf.c(cc -o test-cwf test-cwf.c), the result > shows when last level cache(LLC) miss and CPU fetches data from > memory, critical word as first 64bit member in cache line has better > performance(costs 158290336 cycles ) than other positions(offset 0x10, > costs 164100732 ) in cache line, the performance is improved by 3.6% > in this case. > cpu-info is also involved too. > > Thanks > Ling Thanks Ling. Note that I was more interested by the case we read more fields per cache line, like we do in tcp lookups. (skc_daddr, skc_rcv_saddr, skc_bound_dev_if, skc_net). I made changes to net-next to prepare your patch. You'll have to move both skc_rxhash & skc_portpair before the skc_addrpair. I have to fix an endianness sparse problem, I'll send a patch for this in a separate thread right now. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line [not found] <1353900555-5966-1-git-send-email-ling.ma.program@gmail.com> 2012-11-26 6:44 ` [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line Eric Dumazet @ 2013-02-02 15:03 ` Eric Dumazet 2013-02-03 21:00 ` saeed bishara 2013-02-03 21:08 ` David Miller 1 sibling, 2 replies; 13+ messages in thread From: Eric Dumazet @ 2013-02-02 15:03 UTC (permalink / raw) To: ling.ma.program, David Miller; +Cc: netdev, Maciej Żenczykowski From: Ma Ling <ling.ma.program@gmail.com> In order to reduce memory latency when last level cache miss occurs, modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or Early Restart(ER) to get data ASAP. For CWF if critical word is first member in cache line, memory feed CPU with critical word, then fill others data in cache line one by one, otherwise after critical word it must cost more cycle to fill the remaining cache line. For Early First CPU will restart until critical word in cache line reaches. Hash value is critical word, so in this patch we place it as first member in cache line (sock address is cache-line aligned), and it is also good for Early Restart platform as well . [edumazet: respin on net-next after commit ce43b03e8889] Signed-off-by: Ma Ling <ling.ma.program@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> --- include/net/sock.h | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index a340ab4..efabd9a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -131,12 +131,12 @@ typedef __u64 __bitwise __addrpair; /** * struct sock_common - minimal network layer representation of sockets - * @skc_daddr: Foreign IPv4 addr - * @skc_rcv_saddr: Bound local IPv4 addr * @skc_hash: hash value used with various protocol lookup tables * @skc_u16hashes: two u16 hash values used by UDP lookup tables * @skc_dport: placeholder for inet_dport/tw_dport * @skc_num: placeholder for inet_num/tw_num + * @skc_daddr: Foreign IPv4 addr + * @skc_rcv_saddr: Bound local IPv4 addr * @skc_family: network address family * @skc_state: Connection state * @skc_reuse: %SO_REUSEADDR setting @@ -153,18 +153,10 @@ typedef __u64 __bitwise __addrpair; * * This is the minimal network layer representation of sockets, the header * for struct sock and struct inet_timewait_sock. + * Order of first fields is critical for __inet_lookup_established() : + * skc_hash, skc_portpair, skc_addrpair */ struct sock_common { - /* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned - * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH() - */ - union { - __addrpair skc_addrpair; - struct { - __be32 skc_daddr; - __be32 skc_rcv_saddr; - }; - }; union { unsigned int skc_hash; __u16 skc_u16hashes[2]; @@ -178,6 +170,16 @@ struct sock_common { }; }; + /* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned + * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH() + */ + union { + __addrpair skc_addrpair; + struct { + __be32 skc_daddr; + __be32 skc_rcv_saddr; + }; + }; unsigned short skc_family; volatile unsigned char skc_state; unsigned char skc_reuse:4; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line 2013-02-02 15:03 ` [PATCH v2 net-next] inet: Get critical " Eric Dumazet @ 2013-02-03 21:00 ` saeed bishara 2013-02-03 21:08 ` David Miller 1 sibling, 0 replies; 13+ messages in thread From: saeed bishara @ 2013-02-03 21:00 UTC (permalink / raw) To: Eric Dumazet Cc: ling.ma.program, David Miller, netdev, Maciej Żenczykowski On Sat, Feb 2, 2013 at 5:03 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Ma Ling <ling.ma.program@gmail.com> > > In order to reduce memory latency when last level cache miss occurs, > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > Early Restart(ER) to get data ASAP. For CWF if critical word is first > member > in cache line, memory feed CPU with critical word, then fill others > data in cache line one by one, otherwise after critical word it must > cost more cycle to fill the remaining cache line. For Early First CPU > will restart until critical word in cache line reaches. > > Hash value is critical word, so in this patch we place it as first > member in cache line (sock address is cache-line aligned), and it is > also good for Early Restart platform as well . I think the description of this patch doen't make sense. the purpose of CWF hardware feature is to release the sw from moving critical word as first member of the cache. that's ofcourse depends on how you define the CWF, but at least according to http://lwn.net/Articles/252125/ and here https://github.com/jamie-allen/cpu_caches/blob/master/preso/presentation.md the CWF means the hw will do the job. so I think the patch maybe usefull (1) for system that doesn't have CWF, (2) CWF may not totaly eliminate the additional latency. this is of course a prediction as you see. saeed ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line 2013-02-02 15:03 ` [PATCH v2 net-next] inet: Get critical " Eric Dumazet 2013-02-03 21:00 ` saeed bishara @ 2013-02-03 21:08 ` David Miller 2013-02-04 0:18 ` Eric Dumazet 1 sibling, 1 reply; 13+ messages in thread From: David Miller @ 2013-02-03 21:08 UTC (permalink / raw) To: eric.dumazet; +Cc: ling.ma.program, netdev, maze From: Eric Dumazet <eric.dumazet@gmail.com> Date: Sat, 02 Feb 2013 07:03:55 -0800 > From: Ma Ling <ling.ma.program@gmail.com> > > In order to reduce memory latency when last level cache miss occurs, > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > Early Restart(ER) to get data ASAP. For CWF if critical word is first > member > in cache line, memory feed CPU with critical word, then fill others > data in cache line one by one, otherwise after critical word it must > cost more cycle to fill the remaining cache line. For Early First CPU > will restart until critical word in cache line reaches. > > Hash value is critical word, so in this patch we place it as first > member in cache line (sock address is cache-line aligned), and it is > also good for Early Restart platform as well . > > [edumazet: respin on net-next after commit ce43b03e8889] > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> I completely agree with the other response to this patch in that the description is bogus. If CWF is implemented in the cpu, it should exactly relieve us from having to move things around in structures so carefully like this. Either the patch should be completely dropped (modern cpus don't need this) or the commit message changed to reflect reality. It really makes a terrible impression upon me when the patch says something which in fact is 180 degrees from reality. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line 2013-02-03 21:08 ` David Miller @ 2013-02-04 0:18 ` Eric Dumazet 2013-02-04 0:25 ` Eric Dumazet 0 siblings, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2013-02-04 0:18 UTC (permalink / raw) To: David Miller; +Cc: ling.ma.program, netdev, maze On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Sat, 02 Feb 2013 07:03:55 -0800 > > > From: Ma Ling <ling.ma.program@gmail.com> > > > > In order to reduce memory latency when last level cache miss occurs, > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > > Early Restart(ER) to get data ASAP. For CWF if critical word is first > > member > > in cache line, memory feed CPU with critical word, then fill others > > data in cache line one by one, otherwise after critical word it must > > cost more cycle to fill the remaining cache line. For Early First CPU > > will restart until critical word in cache line reaches. > > > > Hash value is critical word, so in this patch we place it as first > > member in cache line (sock address is cache-line aligned), and it is > > also good for Early Restart platform as well . > > > > [edumazet: respin on net-next after commit ce43b03e8889] > > > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com> > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > I completely agree with the other response to this patch in that > the description is bogus. > > If CWF is implemented in the cpu, it should exactly relieve us from > having to move things around in structures so carefully like this. > > Either the patch should be completely dropped (modern cpus don't > need this) or the commit message changed to reflect reality. > > It really makes a terrible impression upon me when the patch says > something which in fact is 180 degrees from reality. Hmm. Maybe the changelog is misleading, or maybe all the performance gains I have from this patch are probably some artifact or old/bad hardware, or something else. (Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) # ./cwf looking-up aligned time 108712072, looking-up unaligned time 113268256 looking-up aligned time 108677032, looking-up unaligned time 113297636 (Intel(R) Xeon(R) CPU X5679 @ 3.20GHz) # ./cwf looking-up aligned time 139193589, looking-up unaligned time 144307821 looking-up aligned time 139136787, looking-up unaligned time 144277752 My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz # ./cwf looking-up aligned time 84869203, looking-up unaligned time 86843462 looking-up aligned time 84253003, looking-up unaligned time 86227675 #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #define CACHELINE_SZ 64L #define BIGBUFFER_SZ (64<<20) # define HP_TIMING_NOW(Var) \ ({ unsigned long long _hi, _lo; \ asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \ (Var) = _hi << 32 | _lo; }) #define repeat_times 20 char *bufzap; static void zap_cache(void) { memset(bufzap, 2, BIGBUFFER_SZ); memset(bufzap, 3, BIGBUFFER_SZ); memset(bufzap, 4, BIGBUFFER_SZ); } static char *init_buf(void) { void *res; if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) { fprintf(stderr, "malloc() failed"); exit(1); } memset(res, 1, BIGBUFFER_SZ); return res; } unsigned long total; static unsigned long random_access(void *buffer, unsigned int off1, unsigned int off2, unsigned int off3) { int i; unsigned int n; unsigned long sum = 0; unsigned long *ptr; srandom(7777); for (i = 0; i < 1000000; i++) { n = random() % (BIGBUFFER_SZ/CACHELINE_SZ); ptr = buffer + n*CACHELINE_SZ; if (ptr[off1]) sum++; if (ptr[off2]) sum++; // if (ptr[off3]) // sum++; } total += sum; return sum; } static unsigned long test_lookup_time(void *buf, unsigned int off1, unsigned int off2, unsigned int off3) { unsigned long i, start, end, best_time = ~0; for (i = 0; i < repeat_times; i++) { zap_cache(); HP_TIMING_NOW(start); random_access(buf, off1, off2, off3); HP_TIMING_NOW(end); if (best_time > (end - start)) best_time = (end - start); } return best_time; } int main(int argc, char *argv[]) { char *buf; unsigned long aligned_time, unaligned_time; buf = init_buf(); bufzap = init_buf(); aligned_time = test_lookup_time(buf, 0, 2, 4); unaligned_time = test_lookup_time(buf, 4, 2, 0); printf("looking-up aligned time %lu, \nlooking-up unaligned time %lu\n", aligned_time, unaligned_time); aligned_time = test_lookup_time(buf, 0, 2, 4); unaligned_time = test_lookup_time(buf, 4, 2, 0); printf("looking-up aligned time %lu, \nlooking-up unaligned time %lu\n", aligned_time, unaligned_time); } ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line 2013-02-04 0:18 ` Eric Dumazet @ 2013-02-04 0:25 ` Eric Dumazet 2013-02-04 2:53 ` Ling Ma 0 siblings, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2013-02-04 0:25 UTC (permalink / raw) To: David Miller; +Cc: ling.ma.program, netdev, maze On Sun, 2013-02-03 at 16:18 -0800, Eric Dumazet wrote: > On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote: > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Sat, 02 Feb 2013 07:03:55 -0800 > > > > > From: Ma Ling <ling.ma.program@gmail.com> > > > > > > In order to reduce memory latency when last level cache miss occurs, > > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > > > Early Restart(ER) to get data ASAP. For CWF if critical word is first > > > member > > > in cache line, memory feed CPU with critical word, then fill others > > > data in cache line one by one, otherwise after critical word it must > > > cost more cycle to fill the remaining cache line. For Early First CPU > > > will restart until critical word in cache line reaches. > > > > > > Hash value is critical word, so in this patch we place it as first > > > member in cache line (sock address is cache-line aligned), and it is > > > also good for Early Restart platform as well . > > > > > > [edumazet: respin on net-next after commit ce43b03e8889] > > > > > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com> > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > > > I completely agree with the other response to this patch in that > > the description is bogus. > > > > If CWF is implemented in the cpu, it should exactly relieve us from > > having to move things around in structures so carefully like this. > > > > Either the patch should be completely dropped (modern cpus don't > > need this) or the commit message changed to reflect reality. > > > > It really makes a terrible impression upon me when the patch says > > something which in fact is 180 degrees from reality. > > Hmm. > > Maybe the changelog is misleading, or maybe all the performance gains I > have from this patch are probably some artifact or old/bad hardware, or > something else. > > > > (Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) > # ./cwf > looking-up aligned time 108712072, > looking-up unaligned time 113268256 > looking-up aligned time 108677032, > looking-up unaligned time 113297636 > > > (Intel(R) Xeon(R) CPU X5679 @ 3.20GHz) > # ./cwf > looking-up aligned time 139193589, > looking-up unaligned time 144307821 > looking-up aligned time 139136787, > looking-up unaligned time 144277752 > > My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz > # ./cwf > looking-up aligned time 84869203, > looking-up unaligned time 86843462 > looking-up aligned time 84253003, > looking-up unaligned time 86227675 > > #include <stdio.h> > #include <string.h> > #include <stdlib.h> > #include <unistd.h> > > #define CACHELINE_SZ 64L > > #define BIGBUFFER_SZ (64<<20) > > # define HP_TIMING_NOW(Var) \ > ({ unsigned long long _hi, _lo; \ > asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \ > (Var) = _hi << 32 | _lo; }) > > #define repeat_times 20 > > char *bufzap; > > static void zap_cache(void) > { > memset(bufzap, 2, BIGBUFFER_SZ); > memset(bufzap, 3, BIGBUFFER_SZ); > memset(bufzap, 4, BIGBUFFER_SZ); > } > > static char *init_buf(void) > { > void *res; > > if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) { > fprintf(stderr, "malloc() failed"); > exit(1); > } > > memset(res, 1, BIGBUFFER_SZ); > return res; > } > > unsigned long total; > > static unsigned long random_access(void *buffer, > unsigned int off1, > unsigned int off2, > unsigned int off3) > { > int i; > unsigned int n; > unsigned long sum = 0; > unsigned long *ptr; > > srandom(7777); > for (i = 0; i < 1000000; i++) { > n = random() % (BIGBUFFER_SZ/CACHELINE_SZ); > ptr = buffer + n*CACHELINE_SZ; > if (ptr[off1]) > sum++; > if (ptr[off2]) > sum++; > // if (ptr[off3]) > // sum++; Hmm, I don't know why I left a comment on these two lines... Of course, results are a bit different removing the comments : looking-up aligned time 113601316, looking-up unaligned time 115964760 looking-up aligned time 113698636, looking-up unaligned time 115986072 More testing is probably needed. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line 2013-02-04 0:25 ` Eric Dumazet @ 2013-02-04 2:53 ` Ling Ma 2013-02-04 3:11 ` Eric Dumazet 0 siblings, 1 reply; 13+ messages in thread From: Ling Ma @ 2013-02-04 2:53 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, maze [-- Attachment #1: Type: text/plain, Size: 4551 bytes --] I attached my test program(we force all cpu loads issue one by one , and avoid cpu hardwre prefetch cc -o test-cwf test-cwf.c.) and cpu-info, the result from ./test-cwf indicates as below: looking-up aligned time 157000272, looking-up unaligned time 162652724 If I was wrong please correct me. Thanks Ling 2013/2/4, Eric Dumazet <eric.dumazet@gmail.com>: > On Sun, 2013-02-03 at 16:18 -0800, Eric Dumazet wrote: >> On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote: >> > From: Eric Dumazet <eric.dumazet@gmail.com> >> > Date: Sat, 02 Feb 2013 07:03:55 -0800 >> > >> > > From: Ma Ling <ling.ma.program@gmail.com> >> > > >> > > In order to reduce memory latency when last level cache miss occurs, >> > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or >> > > Early Restart(ER) to get data ASAP. For CWF if critical word is first >> > > member >> > > in cache line, memory feed CPU with critical word, then fill others >> > > data in cache line one by one, otherwise after critical word it must >> > > cost more cycle to fill the remaining cache line. For Early First CPU >> > > will restart until critical word in cache line reaches. >> > > >> > > Hash value is critical word, so in this patch we place it as first >> > > member in cache line (sock address is cache-line aligned), and it is >> > > also good for Early Restart platform as well . >> > > >> > > [edumazet: respin on net-next after commit ce43b03e8889] >> > > >> > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com> >> > > Signed-off-by: Eric Dumazet <edumazet@google.com> >> > >> > I completely agree with the other response to this patch in that >> > the description is bogus. >> > >> > If CWF is implemented in the cpu, it should exactly relieve us from >> > having to move things around in structures so carefully like this. >> > >> > Either the patch should be completely dropped (modern cpus don't >> > need this) or the commit message changed to reflect reality. >> > >> > It really makes a terrible impression upon me when the patch says >> > something which in fact is 180 degrees from reality. >> >> Hmm. >> >> Maybe the changelog is misleading, or maybe all the performance gains I >> have from this patch are probably some artifact or old/bad hardware, or >> something else. >> >> >> >> (Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) >> # ./cwf >> looking-up aligned time 108712072, >> looking-up unaligned time 113268256 >> looking-up aligned time 108677032, >> looking-up unaligned time 113297636 >> >> >> (Intel(R) Xeon(R) CPU X5679 @ 3.20GHz) >> # ./cwf >> looking-up aligned time 139193589, >> looking-up unaligned time 144307821 >> looking-up aligned time 139136787, >> looking-up unaligned time 144277752 >> >> My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz >> # ./cwf >> looking-up aligned time 84869203, >> looking-up unaligned time 86843462 >> looking-up aligned time 84253003, >> looking-up unaligned time 86227675 >> >> #include <stdio.h> >> #include <string.h> >> #include <stdlib.h> >> #include <unistd.h> >> >> #define CACHELINE_SZ 64L >> >> #define BIGBUFFER_SZ (64<<20) >> >> # define HP_TIMING_NOW(Var) \ >> ({ unsigned long long _hi, _lo; \ >> asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \ >> (Var) = _hi << 32 | _lo; }) >> >> #define repeat_times 20 >> >> char *bufzap; >> >> static void zap_cache(void) >> { >> memset(bufzap, 2, BIGBUFFER_SZ); >> memset(bufzap, 3, BIGBUFFER_SZ); >> memset(bufzap, 4, BIGBUFFER_SZ); >> } >> >> static char *init_buf(void) >> { >> void *res; >> >> if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) { >> fprintf(stderr, "malloc() failed"); >> exit(1); >> } >> >> memset(res, 1, BIGBUFFER_SZ); >> return res; >> } >> >> unsigned long total; >> >> static unsigned long random_access(void *buffer, >> unsigned int off1, >> unsigned int off2, >> unsigned int off3) >> { >> int i; >> unsigned int n; >> unsigned long sum = 0; >> unsigned long *ptr; >> >> srandom(7777); >> for (i = 0; i < 1000000; i++) { >> n = random() % (BIGBUFFER_SZ/CACHELINE_SZ); >> ptr = buffer + n*CACHELINE_SZ; >> if (ptr[off1]) >> sum++; >> if (ptr[off2]) >> sum++; >> // if (ptr[off3]) >> // sum++; > > Hmm, I don't know why I left a comment on these two lines... > > Of course, results are a bit different removing the comments : > > looking-up aligned time 113601316, > looking-up unaligned time 115964760 > looking-up aligned time 113698636, > looking-up unaligned time 115986072 > > More testing is probably needed. > > > [-- Attachment #2: cpu-info --] [-- Type: text/plain, Size: 7050 bytes --] processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 9 cpu cores : 4 apicid : 18 initial apicid : 18 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 10 cpu cores : 4 apicid : 20 initial apicid : 20 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 9 cpu cores : 4 apicid : 19 initial apicid : 19 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 10 cpu cores : 4 apicid : 21 initial apicid : 21 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: [-- Attachment #3: test-cwf.c --] [-- Type: text/x-csrc, Size: 1986 bytes --] #include<stdio.h> #include<string.h> #include<stdlib.h> #include<unistd.h> #define MAX_BUF_NUM (1 << 20) #define MAX_BUF_SIZE (1 << 8) #define ACCESS_OFFSET (0x38) # define HP_TIMING_NOW(Var) \ ({ unsigned long long _hi, _lo; \ asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \ (Var) = _hi << 32 | _lo; }) #define repeat_times (64) static void init_buf(char **buf) { int i = 0; char *start; char *end; int pagesize = getpagesize(); *buf = malloc(MAX_BUF_SIZE * MAX_BUF_NUM + pagesize); if(*buf == NULL) { printf("\nfait to malloc space!\n"); exit(1); } else { *buf = *buf + pagesize; *buf = (char *)(((unsigned long)*buf) & (-pagesize)); } start = *buf; end = *buf + (MAX_BUF_SIZE * MAX_BUF_NUM) - MAX_BUF_SIZE; while(1) { *((unsigned char **)start) = end; *((unsigned char **)(start + ACCESS_OFFSET)) = (end + ACCESS_OFFSET); start = start + MAX_BUF_SIZE; if(start == end) break; *((unsigned char **)end) = start; *((unsigned char **)(end + ACCESS_OFFSET)) = start + ACCESS_OFFSET; end = end - MAX_BUF_SIZE; } } unsigned long lookingup_memmory(char *access, int num) { __asm__("sub $1, %rsi"); __asm__("xor %rax, %rax"); __asm__("1:"); __asm__("mov (%rdi), %r8"); __asm__("add %r8, %rax"); __asm__("mov %r8, %rdi"); __asm__("sub $1, %rsi"); __asm__("jae 1b"); } static unsigned long test_lookup_time(char *buf) { unsigned long i, start, end, best_time = ~0; for(i = 0; i < repeat_times; i++) { HP_TIMING_NOW(start); lookingup_memmory(buf, MAX_BUF_NUM); HP_TIMING_NOW(end); if(best_time > (end - start)) best_time = (end - start); } return best_time; } void main (void) { char *buf1 = NULL; char *buf2 = NULL; unsigned long aligned_time, unaligned_time; init_buf(&buf1); init_buf(&buf2); unaligned_time = test_lookup_time(buf2 + ACCESS_OFFSET); aligned_time = test_lookup_time(buf1); printf("looking-up aligned time %ld, looking-up unaligned time %ld\n", aligned_time, unaligned_time); } ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 net-next] inet: Get critical word in first 64bit of cache line 2013-02-04 2:53 ` Ling Ma @ 2013-02-04 3:11 ` Eric Dumazet 0 siblings, 0 replies; 13+ messages in thread From: Eric Dumazet @ 2013-02-04 3:11 UTC (permalink / raw) To: Ling Ma; +Cc: David Miller, netdev, maze On Mon, 2013-02-04 at 10:53 +0800, Ling Ma wrote: > I attached my test program(we force all cpu loads issue one by one , > and avoid cpu hardwre prefetch cc -o test-cwf test-cwf.c.) and > cpu-info, the result from ./test-cwf indicates as below: > looking-up aligned time 157000272, looking-up unaligned time 162652724 > If I was wrong please correct me. I have no idea why you use assembly code. unsigned long lookingup_memmory(char *access, int num) { __asm__("sub $1, %rsi"); __asm__("xor %rax, %rax"); __asm__("1:"); __asm__("mov (%rdi), %r8"); __asm__("add %r8, %rax"); __asm__("mov %r8, %rdi"); __asm__("sub $1, %rsi"); __asm__("jae 1b"); } Your program is really hard to read. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2013-02-04 3:11 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1353900555-5966-1-git-send-email-ling.ma.program@gmail.com>
2012-11-26 6:44 ` [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line Eric Dumazet
2012-11-26 20:40 ` Ben Hutchings
2012-11-27 13:48 ` Ling Ma
2012-11-27 13:58 ` Eric Dumazet
2012-12-02 13:25 ` Ling Ma
2012-12-02 17:20 ` Eric Dumazet
2013-02-02 15:03 ` [PATCH v2 net-next] inet: Get critical " Eric Dumazet
2013-02-03 21:00 ` saeed bishara
2013-02-03 21:08 ` David Miller
2013-02-04 0:18 ` Eric Dumazet
2013-02-04 0:25 ` Eric Dumazet
2013-02-04 2:53 ` Ling Ma
2013-02-04 3:11 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox