Poor network performance between DomU with multiqueue support

All of lore.kernel.org
 help / color / mirror / Atom feed

* Poor network performance between DomU with multiqueue support
@ 2014-12-02  8:30 zhangleiqiang
  2014-12-02 10:57 ` David Vrabel
  2014-12-02 11:01 ` Wei Liu
  0 siblings, 2 replies; 35+ messages in thread
From: zhangleiqiang @ 2014-12-02  8:30 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1826 bytes --]

Hi, all
    I am testing the performance of xen netfront-netback driver that with multi-queues support. The throughput from domU to remote dom0 is 9.2Gb/s, but the throughput from domU to remote domU is only 3.6Gb/s, I think the bottleneck is the throughput from dom0 to local domU. However, we have done some testing and found the throughput from dom0 to local domU is 5.8Gb/s.
    And if we send packets from one DomU to other 3 DomUs on different host simultaneously, the sum of throughout can reach 9Gbps. It seems like the bottleneck is the receiver?
    After some analysis, I found that even the max_queue of netfront/back is set to 4, there are some strange results as follows:
    1. In domU, only one rx queue deal with softirq
    2. In dom0, only two netback queues process are scheduled, other two process aren't scheduled.

    Are there any issues in my test? In theory, can we achieve 9~10Gb/s between DomUs on different hosts using netfront/netback?
    
     The testing environment details are as follows:
   1. Hardware
       a. CPU: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz, 2 CPU 6 Cores with Hyper Thread enabled
       b. NIC: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
   2. Sofware:
       a. HostOS: SLES 12 (Kernel 3.16-7,git commit d0335e4feea0d3f7a8af3116c5dc166239da7521 )
       b. NIC Driver: IXGBE 3.21.2 
       c. OVS: 2.1.3
       d. MTU: 1600
       e. Dom0：6U6G
       f. queue number: 4
       g. xen 4.4
       h. DomU: 4U4G
   3. Networking Environment:
       a. All network flows are transmit/receive through OVS
       b. Sender server and receiver server are connected directly between 10GE NIC
   4. Testing Tools:
       a. Sender: netperf
       b. Receiver: netserver


----------
zhangleiqiang (Trump)
Best Regards

[-- Attachment #1.2: Type: text/html, Size: 4778 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-02  8:30 Poor network performance between DomU with multiqueue support zhangleiqiang
@ 2014-12-02 10:57 ` David Vrabel
  2014-12-02 11:53   ` Zhangleiqiang (Trump)
  2014-12-02 11:01 ` Wei Liu
  1 sibling, 1 reply; 35+ messages in thread
From: David Vrabel @ 2014-12-02 10:57 UTC (permalink / raw)
  To: zhangleiqiang, xen-devel

On 02/12/14 08:30, zhangleiqiang wrote:
> Hi, all
>     I am testing the performance of xen netfront-netback driver that
> with multi-queues support. The throughput from domU to remote dom0 is
> 9.2Gb/s, but the throughput from domU to remote domU is only 3.6Gb/s, I
> think the bottleneck is the throughput from dom0 to local domU. However,
> we have done some testing and found the throughput from dom0 to local
> domU is 5.8Gb/s.
>     And if we send packets from one DomU to other 3 DomUs on different
> host simultaneously, the sum of throughout can reach 9Gbps. It seems
> like the bottleneck is the receiver?
>     After some analysis, I found that even the max_queue of
> netfront/back is set to 4, there are some strange results as follows:
>     1. In domU, only one rx queue deal with softirq
>     2. In dom0, only two netback queues process are scheduled, other two
> process aren't scheduled.

Multiqueue only has benefits if you have multiple flows since the
source/destination addresses are hashed to a queue number.  This
probably explains why only some of the queues are being used in your test.

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-02  8:30 Poor network performance between DomU with multiqueue support zhangleiqiang
  2014-12-02 10:57 ` David Vrabel
@ 2014-12-02 11:01 ` Wei Liu
  2014-12-02 11:50   ` Zhangleiqiang (Trump)
  1 sibling, 1 reply; 35+ messages in thread
From: Wei Liu @ 2014-12-02 11:01 UTC (permalink / raw)
  To: zhangleiqiang; +Cc: wei.liu2, xen-devel

On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> Hi, all
>     I am testing the performance of xen netfront-netback driver that with multi-queues support. The throughput from domU to remote dom0 is 9.2Gb/s, but the throughput from domU to remote domU is only 3.6Gb/s, I think the bottleneck is the throughput from dom0 to local domU. However, we have done some testing and found the throughput from dom0 to local domU is 5.8Gb/s.
>     And if we send packets from one DomU to other 3 DomUs on different host simultaneously, the sum of throughout can reach 9Gbps. It seems like the bottleneck is the receiver?
>     After some analysis, I found that even the max_queue of netfront/back is set to 4, there are some strange results as follows:
>     1. In domU, only one rx queue deal with softirq

Try to bind irq to different vcpus?

>     2. In dom0, only two netback queues process are scheduled, other two process aren't scheduled.

How many Dom0 vcpu do you have? If it only has two then there will only
be two processes running at a time.

> 
>     Are there any issues in my test? In theory, can we achieve 9~10Gb/s between DomUs on different hosts using netfront/netback?
>     
>      The testing environment details are as follows:
>    1. Hardware
>        a. CPU: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz, 2 CPU 6 Cores with Hyper Thread enabled
>        b. NIC: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
>    2. Sofware:
>        a. HostOS: SLES 12 (Kernel 3.16-7,git commit d0335e4feea0d3f7a8af3116c5dc166239da7521 )

And this is a SuSE kernel?

>        b. NIC Driver: IXGBE 3.21.2 
>        c. OVS: 2.1.3
>        d. MTU: 1600
>        e. Dom0：6U6G
>        f. queue number: 4
>        g. xen 4.4
>        h. DomU: 4U4G
>    3. Networking Environment:
>        a. All network flows are transmit/receive through OVS
>        b. Sender server and receiver server are connected directly between 10GE NIC
>    4. Testing Tools:
>        a. Sender: netperf
>        b. Receiver: netserver
> 
> 
> ----------
> zhangleiqiang (Trump)
> Best Regards

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-02 11:01 ` Wei Liu
@ 2014-12-02 11:50   ` Zhangleiqiang (Trump)
  2014-12-02 12:11     ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-02 11:50 UTC (permalink / raw)
  To: Wei Liu, zhangleiqiang
  Cc: Xiaoding (B), Zhuangyuxin, Luohao (brian), Yuzhou (C),
	xen-devel@lists.xen.org

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org
> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Wei Liu
> Sent: Tuesday, December 02, 2014 7:02 PM
> To: zhangleiqiang
> Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > Hi, all
> >     I am testing the performance of xen netfront-netback driver that with
> multi-queues support. The throughput from domU to remote dom0 is 9.2Gb/s,
> but the throughput from domU to remote domU is only 3.6Gb/s, I think the
> bottleneck is the throughput from dom0 to local domU. However, we have
> done some testing and found the throughput from dom0 to local domU is
> 5.8Gb/s.
> >     And if we send packets from one DomU to other 3 DomUs on different
> host simultaneously, the sum of throughout can reach 9Gbps. It seems like the
> bottleneck is the receiver?
> >     After some analysis, I found that even the max_queue of netfront/back
> is set to 4, there are some strange results as follows:
> >     1. In domU, only one rx queue deal with softirq
> 
> Try to bind irq to different vcpus?

Do you mean we try to bind irq to different vcpus in DomU? I will try it now.

> 
> >     2. In dom0, only two netback queues process are scheduled, other two
> process aren't scheduled.
> 
> How many Dom0 vcpu do you have? If it only has two then there will only be
> two processes running at a time.

Dom0 has 6 vcpus, and 6G memory. There are only one DomU running in Dom0 and so four netback processes are running in Dom0 (because the max_queue param of netback kernel module is set to 4). 
The phenomenon is that only 2 of these four netback process were running with about 70% cpu usage, and another two use little CPU.
Is there a hash algorithm to determine which netback process to handle the input packet?

> >
> >     Are there any issues in my test? In theory, can we achieve 9~10Gb/s
> between DomUs on different hosts using netfront/netback?
> >
> >      The testing environment details are as follows:
> >    1. Hardware
> >        a. CPU: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz, 2 CPU 6 Cores with
> Hyper Thread enabled
> >        b. NIC: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network
> Connection (rev 01)
> >    2. Sofware:
> >        a. HostOS: SLES 12 (Kernel 3.16-7,git commit
> > d0335e4feea0d3f7a8af3116c5dc166239da7521 )
> 
> And this is a SuSE kernel?

No, I just compile Dom0 and DomU kernel using 3.16-7 tag from kernel.org.

> >        b. NIC Driver: IXGBE 3.21.2
> >        c. OVS: 2.1.3
> >        d. MTU: 1600
> >        e. Dom0：6U6G
> >        f. queue number: 4
> >        g. xen 4.4
> >        h. DomU: 4U4G
> >    3. Networking Environment:
> >        a. All network flows are transmit/receive through OVS
> >        b. Sender server and receiver server are connected directly between
> 10GE NIC
> >    4. Testing Tools:
> >        a. Sender: netperf
> >        b. Receiver: netserver
> >
> >
> > ----------
> > zhangleiqiang (Trump)
> > Best Regards
> 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-02 10:57 ` David Vrabel
@ 2014-12-02 11:53   ` Zhangleiqiang (Trump)
  2014-12-02 17:25     ` Zoltan Kiss
  0 siblings, 1 reply; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-02 11:53 UTC (permalink / raw)
  To: David Vrabel, zhangleiqiang, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, Luohao (brian), Yuzhou (C)

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org
> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of David Vrabel
> Sent: Tuesday, December 02, 2014 6:57 PM
> To: zhangleiqiang; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On 02/12/14 08:30, zhangleiqiang wrote:
> > Hi, all
> >     I am testing the performance of xen netfront-netback driver that
> > with multi-queues support. The throughput from domU to remote dom0 is
> > 9.2Gb/s, but the throughput from domU to remote domU is only 3.6Gb/s,
> > I think the bottleneck is the throughput from dom0 to local domU.
> > However, we have done some testing and found the throughput from dom0
> > to local domU is 5.8Gb/s.
> >     And if we send packets from one DomU to other 3 DomUs on different
> > host simultaneously, the sum of throughout can reach 9Gbps. It seems
> > like the bottleneck is the receiver?
> >     After some analysis, I found that even the max_queue of
> > netfront/back is set to 4, there are some strange results as follows:
> >     1. In domU, only one rx queue deal with softirq
> >     2. In dom0, only two netback queues process are scheduled, other
> > two process aren't scheduled.
> 
> Multiqueue only has benefits if you have multiple flows since the
> source/destination addresses are hashed to a queue number.  This probably
> explains why only some of the queues are being used in your test.

The hash method you mentioned is used for selection of netback process or netfront rx queue? 
Indeed, there are 4 netback processes running in Dom0, because there are only one DomU running in Dom0 and so four netback processes are running in Dom0 (the max_queue param of netback kernel module is set to 4). 
The phenomenon is that only 2 of these four netback process were running with about 70% cpu usage, and another two use little CPU.

> David
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-02 11:50   ` Zhangleiqiang (Trump)
@ 2014-12-02 12:11     ` Wei Liu
  2014-12-02 14:46       ` Zhangleiqiang (Trump)
  0 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2014-12-02 12:11 UTC (permalink / raw)
  To: Zhangleiqiang (Trump)
  Cc: Luohao (brian), Wei Liu, Zhuangyuxin, zhangleiqiang, Yuzhou (C),
	xen-devel@lists.xen.org, Xiaoding (B)

On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump) wrote:
> > -----Original Message-----
> > From: xen-devel-bounces@lists.xen.org
> > [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Wei Liu
> > Sent: Tuesday, December 02, 2014 7:02 PM
> > To: zhangleiqiang
> > Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > multiqueue support
> > 
> > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > Hi, all
> > >     I am testing the performance of xen netfront-netback driver that with
> > multi-queues support. The throughput from domU to remote dom0 is 9.2Gb/s,
> > but the throughput from domU to remote domU is only 3.6Gb/s, I think the
> > bottleneck is the throughput from dom0 to local domU. However, we have
> > done some testing and found the throughput from dom0 to local domU is
> > 5.8Gb/s.
> > >     And if we send packets from one DomU to other 3 DomUs on different
> > host simultaneously, the sum of throughout can reach 9Gbps. It seems like the
> > bottleneck is the receiver?
> > >     After some analysis, I found that even the max_queue of netfront/back
> > is set to 4, there are some strange results as follows:
> > >     1. In domU, only one rx queue deal with softirq
> > 
> > Try to bind irq to different vcpus?
> 
> Do you mean we try to bind irq to different vcpus in DomU? I will try it now.
> 

Yes. Given the fact that you have two backend threads running while only
one DomU vcpu is busy, it smells like misconfiguration in DomU.

If this phenomenon persists after correctly binding irqs, you might want
to check traffic is steering correctly to different queues.

> > 
> > >     2. In dom0, only two netback queues process are scheduled, other two
> > process aren't scheduled.
> > 
> > How many Dom0 vcpu do you have? If it only has two then there will only be
> > two processes running at a time.
> 
> Dom0 has 6 vcpus, and 6G memory. There are only one DomU running in Dom0 and so four netback processes are running in Dom0 (because the max_queue param of netback kernel module is set to 4). 
> The phenomenon is that only 2 of these four netback process were running with about 70% cpu usage, and another two use little CPU.
> Is there a hash algorithm to determine which netback process to handle the input packet?
> 

I think that's whatever default algorithm Linux kernel is using.

We don't currently support other algorithms.

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-02 12:11     ` Wei Liu
@ 2014-12-02 14:46       ` Zhangleiqiang (Trump)
  2014-12-02 15:58         ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-02 14:46 UTC (permalink / raw)
  To: Wei Liu
  Cc: Luohao (brian), Zhuangyuxin, zhangleiqiang, Yuzhou (C),
	xen-devel@lists.xen.org, Xiaoding (B)

Thanks for your reply, Wei.

I do the following testing just now and found the results as follows:

There are three DomUs (4U4G) are running on Host A (6U6G) and one DomU (4U4G) is running on Host B (6U6G), I send packets from three DomUs to the DomU on Host B simultaneously. 

1. The "top" output of Host B as follows:

top - 09:42:11 up  1:07,  2 users,  load average: 2.46, 1.90, 1.47
Tasks: 173 total,   4 running, 169 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.8 si,  1.9 st
%Cpu1  :  0.0 us, 27.0 sy,  0.0 ni, 63.1 id,  0.0 wa,  0.0 hi,  9.5 si,  0.4 st
%Cpu2  :  0.0 us, 90.0 sy,  0.0 ni,  8.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu3  :  0.4 us,  1.4 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,  1.4 si,  1.4 st
%Cpu4  :  0.0 us, 60.2 sy,  0.0 ni, 39.5 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu5  :  0.0 us,  2.8 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,  6.9 si,  0.9 st
KiB Mem:   4517144 total,  3116480 used,  1400664 free,      876 buffers
KiB Swap:  2103292 total,        0 used,  2103292 free.  2374656 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                            
 7440 root      20   0       0      0      0 R 71.10 0.000   8:15.38 vif4.0-q3-guest                                                    
 7434 root      20   0       0      0      0 R 59.14 0.000   9:00.58 vif4.0-q0-guest                                                    
   18 root      20   0       0      0      0 R 33.89 0.000   2:35.06 ksoftirqd/2                                                        
   28 root      20   0       0      0      0 S 20.93 0.000   3:01.81 ksoftirqd/4


As shown above, only two netback related processes (vif4.0-*) are running with high cpu usage, and the other 2 netback processes are idle. The "ps" result of vif4.0-* processes as follows:

root      7434 50.5  0.0      0     0 ?        R    09:23  11:29 [vif4.0-q0-guest]
root      7435  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q0-deall]
root      7436  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q1-guest]
root      7437  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q1-deall]
root      7438  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q2-guest]
root      7439  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q2-deall]
root      7440 48.1  0.0      0     0 ?        R    09:23  10:55 [vif4.0-q3-guest]
root      7441  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q3-deall]
root      9724  0.0  0.0   9244  1520 pts/0    S+   09:46   0:00 grep --color=auto


2. The "rx" related content in /proc/interupts in receiver DomU (on Host B):

73: 	2		0		2925405		0			xen-dyn-event		eth0-q0-rx
75: 	43		93		0			118			xen-dyn-event		eth0-q1-rx
77: 	2		3376	14			1983		xen-dyn-event		eth0-q2-rx
79: 	2414666	0		9			0			xen-dyn-event		eth0-q3-rx

As shown above, it seems like that only q0 and q3 handles the interrupt triggered by packet receving.

Any advise? Thanks.
----------
zhangleiqiang (Trump)

Best Regards


> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: Tuesday, December 02, 2014 8:12 PM
> To: Zhangleiqiang (Trump)
> Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao (brian); Xiaoding
> (B); Yuzhou (C); Zhuangyuxin
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump) wrote:
> > > -----Original Message-----
> > > From: xen-devel-bounces@lists.xen.org
> > > [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Wei Liu
> > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > To: zhangleiqiang
> > > Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> > > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > > multiqueue support
> > >
> > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > > Hi, all
> > > >     I am testing the performance of xen netfront-netback driver
> > > > that with
> > > multi-queues support. The throughput from domU to remote dom0 is
> > > 9.2Gb/s, but the throughput from domU to remote domU is only
> > > 3.6Gb/s, I think the bottleneck is the throughput from dom0 to local
> > > domU. However, we have done some testing and found the throughput
> > > from dom0 to local domU is 5.8Gb/s.
> > > >     And if we send packets from one DomU to other 3 DomUs on
> > > > different
> > > host simultaneously, the sum of throughout can reach 9Gbps. It seems
> > > like the bottleneck is the receiver?
> > > >     After some analysis, I found that even the max_queue of
> > > > netfront/back
> > > is set to 4, there are some strange results as follows:
> > > >     1. In domU, only one rx queue deal with softirq
> > >
> > > Try to bind irq to different vcpus?
> >
> > Do you mean we try to bind irq to different vcpus in DomU? I will try it now.
> >
> 
> Yes. Given the fact that you have two backend threads running while only one
> DomU vcpu is busy, it smells like misconfiguration in DomU.
> 
> If this phenomenon persists after correctly binding irqs, you might want to
> check traffic is steering correctly to different queues.
> 
> > >
> > > >     2. In dom0, only two netback queues process are scheduled,
> > > > other two
> > > process aren't scheduled.
> > >
> > > How many Dom0 vcpu do you have? If it only has two then there will
> > > only be two processes running at a time.
> >
> > Dom0 has 6 vcpus, and 6G memory. There are only one DomU running in
> Dom0 and so four netback processes are running in Dom0 (because the
> max_queue param of netback kernel module is set to 4).
> > The phenomenon is that only 2 of these four netback process were running
> with about 70% cpu usage, and another two use little CPU.
> > Is there a hash algorithm to determine which netback process to handle the
> input packet?
> >
> 
> I think that's whatever default algorithm Linux kernel is using.
> 
> We don't currently support other algorithms.
> 
> Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-02 14:46       ` Zhangleiqiang (Trump)
@ 2014-12-02 15:58         ` Wei Liu
  2014-12-03 14:43           ` Zhangleiqiang (Trump)
  0 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2014-12-02 15:58 UTC (permalink / raw)
  To: Zhangleiqiang (Trump)
  Cc: Luohao (brian), Wei Liu, Zhuangyuxin, zhangleiqiang, Yuzhou (C),
	xen-devel@lists.xen.org, Xiaoding (B)

On Tue, Dec 02, 2014 at 02:46:36PM +0000, Zhangleiqiang (Trump) wrote:
> Thanks for your reply, Wei.
> 
> I do the following testing just now and found the results as follows:
> 
> There are three DomUs (4U4G) are running on Host A (6U6G) and one DomU (4U4G) is running on Host B (6U6G), I send packets from three DomUs to the DomU on Host B simultaneously. 
> 
> 1. The "top" output of Host B as follows:
> 
> top - 09:42:11 up  1:07,  2 users,  load average: 2.46, 1.90, 1.47
> Tasks: 173 total,   4 running, 169 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.8 si,  1.9 st
> %Cpu1  :  0.0 us, 27.0 sy,  0.0 ni, 63.1 id,  0.0 wa,  0.0 hi,  9.5 si,  0.4 st
> %Cpu2  :  0.0 us, 90.0 sy,  0.0 ni,  8.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
> %Cpu3  :  0.4 us,  1.4 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,  1.4 si,  1.4 st
> %Cpu4  :  0.0 us, 60.2 sy,  0.0 ni, 39.5 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
> %Cpu5  :  0.0 us,  2.8 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,  6.9 si,  0.9 st
> KiB Mem:   4517144 total,  3116480 used,  1400664 free,      876 buffers
> KiB Swap:  2103292 total,        0 used,  2103292 free.  2374656 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                            
>  7440 root      20   0       0      0      0 R 71.10 0.000   8:15.38 vif4.0-q3-guest                                                    
>  7434 root      20   0       0      0      0 R 59.14 0.000   9:00.58 vif4.0-q0-guest                                                    
>    18 root      20   0       0      0      0 R 33.89 0.000   2:35.06 ksoftirqd/2                                                        
>    28 root      20   0       0      0      0 S 20.93 0.000   3:01.81 ksoftirqd/4
> 
> 
> As shown above, only two netback related processes (vif4.0-*) are running with high cpu usage, and the other 2 netback processes are idle. The "ps" result of vif4.0-* processes as follows:
> 
> root      7434 50.5  0.0      0     0 ?        R    09:23  11:29 [vif4.0-q0-guest]
> root      7435  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q0-deall]
> root      7436  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q1-guest]
> root      7437  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q1-deall]
> root      7438  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q2-guest]
> root      7439  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q2-deall]
> root      7440 48.1  0.0      0     0 ?        R    09:23  10:55 [vif4.0-q3-guest]
> root      7441  0.0  0.0      0     0 ?        S    09:23   0:00 [vif4.0-q3-deall]
> root      9724  0.0  0.0   9244  1520 pts/0    S+   09:46   0:00 grep --color=auto
> 
> 
> 2. The "rx" related content in /proc/interupts in receiver DomU (on Host B):
> 
> 73: 	2		0		2925405		0			xen-dyn-event		eth0-q0-rx
> 75: 	43		93		0			118			xen-dyn-event		eth0-q1-rx
> 77: 	2		3376	14			1983		xen-dyn-event		eth0-q2-rx
> 79: 	2414666	0		9			0			xen-dyn-event		eth0-q3-rx
> 
> As shown above, it seems like that only q0 and q3 handles the interrupt triggered by packet receving.
> 
> Any advise? Thanks.

Netback selects queue based on the return value of
skb_get_queue_mapping. The queue mapping is set by core driver or
ndo_select_queue (if specified by individual driver). In this case
netback doesn't have its implementation of ndo_select_queue, so it's up
to core driver to decide which queue to dispatch the packet to.  I
think you need to inspect why Dom0 only steers traffic to these two
queues but not all of them.

Don't know which utility is handy for this job. Probably tc(8) is
useful?

Wei.

> ----------
> zhangleiqiang (Trump)
> 
> Best Regards
> 
> 
> > -----Original Message-----
> > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > Sent: Tuesday, December 02, 2014 8:12 PM
> > To: Zhangleiqiang (Trump)
> > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao (brian); Xiaoding
> > (B); Yuzhou (C); Zhuangyuxin
> > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > multiqueue support
> > 
> > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump) wrote:
> > > > -----Original Message-----
> > > > From: xen-devel-bounces@lists.xen.org
> > > > [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Wei Liu
> > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > To: zhangleiqiang
> > > > Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> > > > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > > > multiqueue support
> > > >
> > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > > > Hi, all
> > > > >     I am testing the performance of xen netfront-netback driver
> > > > > that with
> > > > multi-queues support. The throughput from domU to remote dom0 is
> > > > 9.2Gb/s, but the throughput from domU to remote domU is only
> > > > 3.6Gb/s, I think the bottleneck is the throughput from dom0 to local
> > > > domU. However, we have done some testing and found the throughput
> > > > from dom0 to local domU is 5.8Gb/s.
> > > > >     And if we send packets from one DomU to other 3 DomUs on
> > > > > different
> > > > host simultaneously, the sum of throughout can reach 9Gbps. It seems
> > > > like the bottleneck is the receiver?
> > > > >     After some analysis, I found that even the max_queue of
> > > > > netfront/back
> > > > is set to 4, there are some strange results as follows:
> > > > >     1. In domU, only one rx queue deal with softirq
> > > >
> > > > Try to bind irq to different vcpus?
> > >
> > > Do you mean we try to bind irq to different vcpus in DomU? I will try it now.
> > >
> > 
> > Yes. Given the fact that you have two backend threads running while only one
> > DomU vcpu is busy, it smells like misconfiguration in DomU.
> > 
> > If this phenomenon persists after correctly binding irqs, you might want to
> > check traffic is steering correctly to different queues.
> > 
> > > >
> > > > >     2. In dom0, only two netback queues process are scheduled,
> > > > > other two
> > > > process aren't scheduled.
> > > >
> > > > How many Dom0 vcpu do you have? If it only has two then there will
> > > > only be two processes running at a time.
> > >
> > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU running in
> > Dom0 and so four netback processes are running in Dom0 (because the
> > max_queue param of netback kernel module is set to 4).
> > > The phenomenon is that only 2 of these four netback process were running
> > with about 70% cpu usage, and another two use little CPU.
> > > Is there a hash algorithm to determine which netback process to handle the
> > input packet?
> > >
> > 
> > I think that's whatever default algorithm Linux kernel is using.
> > 
> > We don't currently support other algorithms.
> > 
> > Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-02 11:53   ` Zhangleiqiang (Trump)
@ 2014-12-02 17:25     ` Zoltan Kiss
  0 siblings, 0 replies; 35+ messages in thread
From: Zoltan Kiss @ 2014-12-02 17:25 UTC (permalink / raw)
  To: Zhangleiqiang (Trump), David Vrabel, zhangleiqiang,
	xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, Luohao (brian), Yuzhou (C)



On 02/12/14 11:53, Zhangleiqiang (Trump) wrote:
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xen.org
>> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of David Vrabel
>> Sent: Tuesday, December 02, 2014 6:57 PM
>> To: zhangleiqiang; xen-devel@lists.xen.org
>> Subject: Re: [Xen-devel] Poor network performance between DomU with
>> multiqueue support
>>
>> On 02/12/14 08:30, zhangleiqiang wrote:
>>> Hi, all
>>>      I am testing the performance of xen netfront-netback driver that
>>> with multi-queues support. The throughput from domU to remote dom0 is
>>> 9.2Gb/s, but the throughput from domU to remote domU is only 3.6Gb/s,
>>> I think the bottleneck is the throughput from dom0 to local domU.
>>> However, we have done some testing and found the throughput from dom0
>>> to local domU is 5.8Gb/s.
>>>      And if we send packets from one DomU to other 3 DomUs on different
>>> host simultaneously, the sum of throughout can reach 9Gbps. It seems
>>> like the bottleneck is the receiver?
>>>      After some analysis, I found that even the max_queue of
>>> netfront/back is set to 4, there are some strange results as follows:
>>>      1. In domU, only one rx queue deal with softirq
>>>      2. In dom0, only two netback queues process are scheduled, other
>>> two process aren't scheduled.
>>
>> Multiqueue only has benefits if you have multiple flows since the
>> source/destination addresses are hashed to a queue number.  This probably
>> explains why only some of the queues are being used in your test.
>
> The hash method you mentioned is used for selection of netback process or netfront rx queue?
It's used in both direction to select the queue.

> Indeed, there are 4 netback processes running in Dom0, because there are only one DomU running in Dom0 and so four netback processes are running in Dom0 (the max_queue param of netback kernel module is set to 4).
> The phenomenon is that only 2 of these four netback process were running with about 70% cpu usage, and another two use little CPU.
>
>> David
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-02 15:58         ` Wei Liu
@ 2014-12-03 14:43           ` Zhangleiqiang (Trump)
  2014-12-04 10:50             ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-03 14:43 UTC (permalink / raw)
  To: Wei Liu, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, zhangleiqiang, Luohao (brian),
	Yuzhou (C)

> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: Tuesday, December 02, 2014 11:59 PM
> To: Zhangleiqiang (Trump)
> Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao (brian); Xiaoding
> (B); Yuzhou (C); Zhuangyuxin
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On Tue, Dec 02, 2014 at 02:46:36PM +0000, Zhangleiqiang (Trump) wrote:
> > Thanks for your reply, Wei.
> >
> > I do the following testing just now and found the results as follows:
> >
> > There are three DomUs (4U4G) are running on Host A (6U6G) and one DomU
> (4U4G) is running on Host B (6U6G), I send packets from three DomUs to the
> DomU on Host B simultaneously.
> >
> > 1. The "top" output of Host B as follows:
> >
> > top - 09:42:11 up  1:07,  2 users,  load average: 2.46, 1.90, 1.47
> > Tasks: 173 total,   4 running, 169 sleeping,   0 stopped,   0 zombie
> > %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.8
> > si,  1.9 st
> > %Cpu1  :  0.0 us, 27.0 sy,  0.0 ni, 63.1 id,  0.0 wa,  0.0 hi,  9.5
> > si,  0.4 st
> > %Cpu2  :  0.0 us, 90.0 sy,  0.0 ni,  8.3 id,  0.0 wa,  0.0 hi,  1.7
> > si,  0.0 st
> > %Cpu3  :  0.4 us,  1.4 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,  1.4
> > si,  1.4 st
> > %Cpu4  :  0.0 us, 60.2 sy,  0.0 ni, 39.5 id,  0.0 wa,  0.0 hi,  0.3
> > si,  0.0 st
> > %Cpu5  :  0.0 us,  2.8 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,  6.9 si,  0.9
> st
> > KiB Mem:   4517144 total,  3116480 used,  1400664 free,      876
> buffers
> > KiB Swap:  2103292 total,        0 used,  2103292 free.  2374656
> cached Mem
> >
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM
> TIME+ COMMAND
> >  7440 root      20   0       0      0      0 R 71.10 0.000
> 8:15.38 vif4.0-q3-guest
> >  7434 root      20   0       0      0      0 R 59.14 0.000
> 9:00.58 vif4.0-q0-guest
> >    18 root      20   0       0      0      0 R 33.89 0.000
> 2:35.06 ksoftirqd/2
> >    28 root      20   0       0      0      0 S 20.93 0.000
> 3:01.81 ksoftirqd/4
> >
> >
> > As shown above, only two netback related processes (vif4.0-*) are running
> with high cpu usage, and the other 2 netback processes are idle. The "ps"
> result of vif4.0-* processes as follows:
> >
> > root      7434 50.5  0.0      0     0 ?        R    09:23  11:29
> [vif4.0-q0-guest]
> > root      7435  0.0  0.0      0     0 ?        S    09:23   0:00
> [vif4.0-q0-deall]
> > root      7436  0.0  0.0      0     0 ?        S    09:23   0:00
> [vif4.0-q1-guest]
> > root      7437  0.0  0.0      0     0 ?        S    09:23   0:00
> [vif4.0-q1-deall]
> > root      7438  0.0  0.0      0     0 ?        S    09:23   0:00
> [vif4.0-q2-guest]
> > root      7439  0.0  0.0      0     0 ?        S    09:23   0:00
> [vif4.0-q2-deall]
> > root      7440 48.1  0.0      0     0 ?        R    09:23  10:55
> [vif4.0-q3-guest]
> > root      7441  0.0  0.0      0     0 ?        S    09:23   0:00
> [vif4.0-q3-deall]
> > root      9724  0.0  0.0   9244  1520 pts/0    S+   09:46   0:00
> grep --color=auto
> >
> >
> > 2. The "rx" related content in /proc/interupts in receiver DomU (on Host B):
> >
> > 73: 	2		0		2925405		0			xen-dyn-event
> 	eth0-q0-rx
> > 75: 	43		93		0			118			xen-dyn-event
> 	eth0-q1-rx
> > 77: 	2		3376	14			1983		xen-dyn-event
> 	eth0-q2-rx
> > 79: 	2414666	0		9			0			xen-dyn-event
> 	eth0-q3-rx
> >
> > As shown above, it seems like that only q0 and q3 handles the interrupt
> triggered by packet receving.
> >
> > Any advise? Thanks.
> 
> Netback selects queue based on the return value of skb_get_queue_mapping.
> The queue mapping is set by core driver or ndo_select_queue (if specified by
> individual driver). In this case netback doesn't have its implementation of
> ndo_select_queue, so it's up to core driver to decide which queue to dispatch
> the packet to.  I think you need to inspect why Dom0 only steers traffic to
> these two queues but not all of them.
> 
> Don't know which utility is handy for this job. Probably tc(8) is useful?

Thanks Wei.

I think the reason for the above results that only two netback/netfront processes works hard is the queue select method. I have tried to send packets from multiple host/vm to a vm, and all of the netback/netfront processes are running with high cpu usage a few times.

However, I find another issue. Even using 6 queues and making sure that all of these 6 netback processes running with high cpu usage (indeed, any of it running with 87% cpu usage), the whole VM receive throughout is not very higher than results when using 4 queues. The results are from 4.5Gbps to 5.04 Gbps using TCP with 512 bytes length and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes length.

According to the testing result from WIKI: http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_performance_testing, The VM receive throughput is also more lower than VM transmit. 

I am wondering why the VM receive throughout cannot be up to 8-10Gbps as VM transmit under multi-queue?  I also tried to send packets directly from Local Dom0 to DomU, the DomU receive throughput can reach about 8-12Gbps, so I am also wondering why transmitting packets from Dom0 to Remote DomU can only reach about 4-5Gbps throughout?

> Wei.
> 
> > ----------
> > zhangleiqiang (Trump)
> >
> > Best Regards
> >
> >
> > > -----Original Message-----
> > > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > > Sent: Tuesday, December 02, 2014 8:12 PM
> > > To: Zhangleiqiang (Trump)
> > > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao (brian);
> > > Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > > multiqueue support
> > >
> > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump) wrote:
> > > > > -----Original Message-----
> > > > > From: xen-devel-bounces@lists.xen.org
> > > > > [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Wei Liu
> > > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > > To: zhangleiqiang
> > > > > Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> > > > > Subject: Re: [Xen-devel] Poor network performance between DomU
> > > > > with multiqueue support
> > > > >
> > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > > > > Hi, all
> > > > > >     I am testing the performance of xen netfront-netback
> > > > > > driver that with
> > > > > multi-queues support. The throughput from domU to remote dom0 is
> > > > > 9.2Gb/s, but the throughput from domU to remote domU is only
> > > > > 3.6Gb/s, I think the bottleneck is the throughput from dom0 to
> > > > > local domU. However, we have done some testing and found the
> > > > > throughput from dom0 to local domU is 5.8Gb/s.
> > > > > >     And if we send packets from one DomU to other 3 DomUs on
> > > > > > different
> > > > > host simultaneously, the sum of throughout can reach 9Gbps. It
> > > > > seems like the bottleneck is the receiver?
> > > > > >     After some analysis, I found that even the max_queue of
> > > > > > netfront/back
> > > > > is set to 4, there are some strange results as follows:
> > > > > >     1. In domU, only one rx queue deal with softirq
> > > > >
> > > > > Try to bind irq to different vcpus?
> > > >
> > > > Do you mean we try to bind irq to different vcpus in DomU? I will try it
> now.
> > > >
> > >
> > > Yes. Given the fact that you have two backend threads running while
> > > only one DomU vcpu is busy, it smells like misconfiguration in DomU.
> > >
> > > If this phenomenon persists after correctly binding irqs, you might
> > > want to check traffic is steering correctly to different queues.
> > >
> > > > >
> > > > > >     2. In dom0, only two netback queues process are scheduled,
> > > > > > other two
> > > > > process aren't scheduled.
> > > > >
> > > > > How many Dom0 vcpu do you have? If it only has two then there
> > > > > will only be two processes running at a time.
> > > >
> > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU running
> > > > in
> > > Dom0 and so four netback processes are running in Dom0 (because the
> > > max_queue param of netback kernel module is set to 4).
> > > > The phenomenon is that only 2 of these four netback process were
> > > > running
> > > with about 70% cpu usage, and another two use little CPU.
> > > > Is there a hash algorithm to determine which netback process to
> > > > handle the
> > > input packet?
> > > >
> > >
> > > I think that's whatever default algorithm Linux kernel is using.
> > >
> > > We don't currently support other algorithms.
> > >
> > > Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-03 14:43           ` Zhangleiqiang (Trump)
@ 2014-12-04 10:50             ` Wei Liu
  2014-12-04 12:09               ` Zhangleiqiang (Trump)
  2014-12-05  1:17               ` Zhangleiqiang (Trump)
  0 siblings, 2 replies; 35+ messages in thread
From: Wei Liu @ 2014-12-04 10:50 UTC (permalink / raw)
  To: Zhangleiqiang (Trump)
  Cc: Luohao (brian), Wei Liu, Zhuangyuxin, zhangleiqiang, Yuzhou (C),
	xen-devel@lists.xen.org, Xiaoding (B)

On Wed, Dec 03, 2014 at 02:43:37PM +0000, Zhangleiqiang (Trump) wrote:
> > -----Original Message-----
> > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > Sent: Tuesday, December 02, 2014 11:59 PM
> > To: Zhangleiqiang (Trump)
> > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao (brian); Xiaoding
> > (B); Yuzhou (C); Zhuangyuxin
> > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > multiqueue support
> > 
> > On Tue, Dec 02, 2014 at 02:46:36PM +0000, Zhangleiqiang (Trump) wrote:
> > > Thanks for your reply, Wei.
> > >
> > > I do the following testing just now and found the results as follows:
> > >
> > > There are three DomUs (4U4G) are running on Host A (6U6G) and one DomU
> > (4U4G) is running on Host B (6U6G), I send packets from three DomUs to the
> > DomU on Host B simultaneously.
> > >
> > > 1. The "top" output of Host B as follows:
> > >
> > > top - 09:42:11 up  1:07,  2 users,  load average: 2.46, 1.90, 1.47
> > > Tasks: 173 total,   4 running, 169 sleeping,   0 stopped,   0 zombie
> > > %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.8
> > > si,  1.9 st
> > > %Cpu1  :  0.0 us, 27.0 sy,  0.0 ni, 63.1 id,  0.0 wa,  0.0 hi,  9.5
> > > si,  0.4 st
> > > %Cpu2  :  0.0 us, 90.0 sy,  0.0 ni,  8.3 id,  0.0 wa,  0.0 hi,  1.7
> > > si,  0.0 st
> > > %Cpu3  :  0.4 us,  1.4 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,  1.4
> > > si,  1.4 st
> > > %Cpu4  :  0.0 us, 60.2 sy,  0.0 ni, 39.5 id,  0.0 wa,  0.0 hi,  0.3
> > > si,  0.0 st
> > > %Cpu5  :  0.0 us,  2.8 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,  6.9 si,  0.9
> > st
> > > KiB Mem:   4517144 total,  3116480 used,  1400664 free,      876
> > buffers
> > > KiB Swap:  2103292 total,        0 used,  2103292 free.  2374656
> > cached Mem
> > >
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM
> > TIME+ COMMAND
> > >  7440 root      20   0       0      0      0 R 71.10 0.000
> > 8:15.38 vif4.0-q3-guest
> > >  7434 root      20   0       0      0      0 R 59.14 0.000
> > 9:00.58 vif4.0-q0-guest
> > >    18 root      20   0       0      0      0 R 33.89 0.000
> > 2:35.06 ksoftirqd/2
> > >    28 root      20   0       0      0      0 S 20.93 0.000
> > 3:01.81 ksoftirqd/4
> > >
> > >
> > > As shown above, only two netback related processes (vif4.0-*) are running
> > with high cpu usage, and the other 2 netback processes are idle. The "ps"
> > result of vif4.0-* processes as follows:
> > >
> > > root      7434 50.5  0.0      0     0 ?        R    09:23  11:29
> > [vif4.0-q0-guest]
> > > root      7435  0.0  0.0      0     0 ?        S    09:23   0:00
> > [vif4.0-q0-deall]
> > > root      7436  0.0  0.0      0     0 ?        S    09:23   0:00
> > [vif4.0-q1-guest]
> > > root      7437  0.0  0.0      0     0 ?        S    09:23   0:00
> > [vif4.0-q1-deall]
> > > root      7438  0.0  0.0      0     0 ?        S    09:23   0:00
> > [vif4.0-q2-guest]
> > > root      7439  0.0  0.0      0     0 ?        S    09:23   0:00
> > [vif4.0-q2-deall]
> > > root      7440 48.1  0.0      0     0 ?        R    09:23  10:55
> > [vif4.0-q3-guest]
> > > root      7441  0.0  0.0      0     0 ?        S    09:23   0:00
> > [vif4.0-q3-deall]
> > > root      9724  0.0  0.0   9244  1520 pts/0    S+   09:46   0:00
> > grep --color=auto
> > >
> > >
> > > 2. The "rx" related content in /proc/interupts in receiver DomU (on Host B):
> > >
> > > 73: 	2		0		2925405		0			xen-dyn-event
> > 	eth0-q0-rx
> > > 75: 	43		93		0			118			xen-dyn-event
> > 	eth0-q1-rx
> > > 77: 	2		3376	14			1983		xen-dyn-event
> > 	eth0-q2-rx
> > > 79: 	2414666	0		9			0			xen-dyn-event
> > 	eth0-q3-rx
> > >
> > > As shown above, it seems like that only q0 and q3 handles the interrupt
> > triggered by packet receving.
> > >
> > > Any advise? Thanks.
> > 
> > Netback selects queue based on the return value of skb_get_queue_mapping.
> > The queue mapping is set by core driver or ndo_select_queue (if specified by
> > individual driver). In this case netback doesn't have its implementation of
> > ndo_select_queue, so it's up to core driver to decide which queue to dispatch
> > the packet to.  I think you need to inspect why Dom0 only steers traffic to
> > these two queues but not all of them.
> > 
> > Don't know which utility is handy for this job. Probably tc(8) is useful?
> 
> Thanks Wei.
> 

> I think the reason for the above results that only two
> netback/netfront processes works hard is the queue select method. I
> have tried to send packets from multiple host/vm to a vm, and all of
> the netback/netfront processes are running with high cpu usage a few
> times.
> 

A few times? You might want to check some patches to rework RX stall
detection by David Vrabel that went in after 3.16.

> However, I find another issue. Even using 6 queues and making sure
> that all of these 6 netback processes running with high cpu usage
> (indeed, any of it running with 87% cpu usage), the whole VM receive
> throughout is not very higher than results when using 4 queues. The
> results are from 4.5Gbps to 5.04 Gbps using TCP with 512 bytes length
> and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes length.
> 

I would like to ask if you're still using 4U4G (4 CPU 4 G?)
configuration? If so, please make sure there are at least the same
number of vcpus as queues.

> According to the testing result from WIKI:
> http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_performance_testing,
> The VM receive throughput is also more lower than VM transmit. 
> 

I think that's expected, because guest RX data path still uses
grant_copy while guest TX uses grant_map to do zero-copy transmit.

> I am wondering why the VM receive throughout cannot be up to 8-10Gbps
> as VM transmit under multi-queue?  I also tried to send packets
> directly from Local Dom0 to DomU, the DomU receive throughput can
> reach about 8-12Gbps, so I am also wondering why transmitting packets
> from Dom0 to Remote DomU can only reach about 4-5Gbps throughout?

If data is from Dom0 to DomU then SKB is probably not fragmented by
network stack.  You can use tcpdump to check that.

Wei.

> 
> > Wei.
> > 
> > > ----------
> > > zhangleiqiang (Trump)
> > >
> > > Best Regards
> > >
> > >
> > > > -----Original Message-----
> > > > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > > > Sent: Tuesday, December 02, 2014 8:12 PM
> > > > To: Zhangleiqiang (Trump)
> > > > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao (brian);
> > > > Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > > > multiqueue support
> > > >
> > > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump) wrote:
> > > > > > -----Original Message-----
> > > > > > From: xen-devel-bounces@lists.xen.org
> > > > > > [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Wei Liu
> > > > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > > > To: zhangleiqiang
> > > > > > Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> > > > > > Subject: Re: [Xen-devel] Poor network performance between DomU
> > > > > > with multiqueue support
> > > > > >
> > > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > > > > > Hi, all
> > > > > > >     I am testing the performance of xen netfront-netback
> > > > > > > driver that with
> > > > > > multi-queues support. The throughput from domU to remote dom0 is
> > > > > > 9.2Gb/s, but the throughput from domU to remote domU is only
> > > > > > 3.6Gb/s, I think the bottleneck is the throughput from dom0 to
> > > > > > local domU. However, we have done some testing and found the
> > > > > > throughput from dom0 to local domU is 5.8Gb/s.
> > > > > > >     And if we send packets from one DomU to other 3 DomUs on
> > > > > > > different
> > > > > > host simultaneously, the sum of throughout can reach 9Gbps. It
> > > > > > seems like the bottleneck is the receiver?
> > > > > > >     After some analysis, I found that even the max_queue of
> > > > > > > netfront/back
> > > > > > is set to 4, there are some strange results as follows:
> > > > > > >     1. In domU, only one rx queue deal with softirq
> > > > > >
> > > > > > Try to bind irq to different vcpus?
> > > > >
> > > > > Do you mean we try to bind irq to different vcpus in DomU? I will try it
> > now.
> > > > >
> > > >
> > > > Yes. Given the fact that you have two backend threads running while
> > > > only one DomU vcpu is busy, it smells like misconfiguration in DomU.
> > > >
> > > > If this phenomenon persists after correctly binding irqs, you might
> > > > want to check traffic is steering correctly to different queues.
> > > >
> > > > > >
> > > > > > >     2. In dom0, only two netback queues process are scheduled,
> > > > > > > other two
> > > > > > process aren't scheduled.
> > > > > >
> > > > > > How many Dom0 vcpu do you have? If it only has two then there
> > > > > > will only be two processes running at a time.
> > > > >
> > > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU running
> > > > > in
> > > > Dom0 and so four netback processes are running in Dom0 (because the
> > > > max_queue param of netback kernel module is set to 4).
> > > > > The phenomenon is that only 2 of these four netback process were
> > > > > running
> > > > with about 70% cpu usage, and another two use little CPU.
> > > > > Is there a hash algorithm to determine which netback process to
> > > > > handle the
> > > > input packet?
> > > > >
> > > >
> > > > I think that's whatever default algorithm Linux kernel is using.
> > > >
> > > > We don't currently support other algorithms.
> > > >
> > > > Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-04 10:50             ` Wei Liu
@ 2014-12-04 12:09               ` Zhangleiqiang (Trump)
  2014-12-04 13:05                 ` Wei Liu
  2014-12-04 13:35                 ` Zoltan Kiss
  2014-12-05  1:17               ` Zhangleiqiang (Trump)
  1 sibling, 2 replies; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-04 12:09 UTC (permalink / raw)
  To: Wei Liu, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, zhangleiqiang, Luohao (brian),
	Yuzhou (C)

> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: Thursday, December 04, 2014 6:50 PM
> To: Zhangleiqiang (Trump)
> Cc: Wei Liu; xen-devel@lists.xen.org; zhangleiqiang; Luohao (brian); Xiaoding
> (B); Yuzhou (C); Zhuangyuxin
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On Wed, Dec 03, 2014 at 02:43:37PM +0000, Zhangleiqiang (Trump) wrote:
> > > -----Original Message-----
> > > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > > Sent: Tuesday, December 02, 2014 11:59 PM
> > > To: Zhangleiqiang (Trump)
> > > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao (brian);
> > > Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > > multiqueue support
> > >
> > > On Tue, Dec 02, 2014 at 02:46:36PM +0000, Zhangleiqiang (Trump) wrote:
> > > > Thanks for your reply, Wei.
> > > >
> > > > I do the following testing just now and found the results as follows:
> > > >
> > > > There are three DomUs (4U4G) are running on Host A (6U6G) and one
> > > > DomU
> > > (4U4G) is running on Host B (6U6G), I send packets from three DomUs
> > > to the DomU on Host B simultaneously.
> > > >
> > > > 1. The "top" output of Host B as follows:
> > > >
> > > > top - 09:42:11 up  1:07,  2 users,  load average: 2.46, 1.90, 1.47
> > > > Tasks: 173 total,   4 running, 169 sleeping,   0 stopped,   0 zombie
> > > > %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,
> > > > 0.8 si,  1.9 st
> > > > %Cpu1  :  0.0 us, 27.0 sy,  0.0 ni, 63.1 id,  0.0 wa,  0.0 hi,
> > > > 9.5 si,  0.4 st
> > > > %Cpu2  :  0.0 us, 90.0 sy,  0.0 ni,  8.3 id,  0.0 wa,  0.0 hi,
> > > > 1.7 si,  0.0 st
> > > > %Cpu3  :  0.4 us,  1.4 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,
> > > > 1.4 si,  1.4 st
> > > > %Cpu4  :  0.0 us, 60.2 sy,  0.0 ni, 39.5 id,  0.0 wa,  0.0 hi,
> > > > 0.3 si,  0.0 st
> > > > %Cpu5  :  0.0 us,  2.8 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,
> > > > 6.9 si,  0.9
> > > st
> > > > KiB Mem:   4517144 total,  3116480 used,  1400664 free,      876
> > > buffers
> > > > KiB Swap:  2103292 total,        0 used,  2103292 free.  2374656
> > > cached Mem
> > > >
> > > >   PID USER      PR  NI    VIRT    RES    SHR
> S  %CPU  %MEM
> > > TIME+ COMMAND
> > > >  7440 root      20   0       0      0      0 R 71.10 0.000
> > > 8:15.38 vif4.0-q3-guest
> > > >  7434 root      20   0       0      0      0 R 59.14 0.000
> > > 9:00.58 vif4.0-q0-guest
> > > >    18 root      20   0       0      0      0 R 33.89 0.000
> > > 2:35.06 ksoftirqd/2
> > > >    28 root      20   0       0      0      0 S 20.93 0.000
> > > 3:01.81 ksoftirqd/4
> > > >
> > > >
> > > > As shown above, only two netback related processes (vif4.0-*) are
> > > > running
> > > with high cpu usage, and the other 2 netback processes are idle. The "ps"
> > > result of vif4.0-* processes as follows:
> > > >
> > > > root      7434 50.5  0.0      0     0 ?        R    09:23
> 11:29
> > > [vif4.0-q0-guest]
> > > > root      7435  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q0-deall]
> > > > root      7436  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q1-guest]
> > > > root      7437  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q1-deall]
> > > > root      7438  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q2-guest]
> > > > root      7439  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q2-deall]
> > > > root      7440 48.1  0.0      0     0 ?        R    09:23
> 10:55
> > > [vif4.0-q3-guest]
> > > > root      7441  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q3-deall]
> > > > root      9724  0.0  0.0   9244  1520 pts/0    S+   09:46
> 0:00
> > > grep --color=auto
> > > >
> > > >
> > > > 2. The "rx" related content in /proc/interupts in receiver DomU (on Host
> B):
> > > >
> > > > 73: 	2		0		2925405		0			xen-dyn-event
> > > 	eth0-q0-rx
> > > > 75: 	43		93		0			118			xen-dyn-event
> > > 	eth0-q1-rx
> > > > 77: 	2		3376	14			1983		xen-dyn-event
> > > 	eth0-q2-rx
> > > > 79: 	2414666	0		9			0			xen-dyn-event
> > > 	eth0-q3-rx
> > > >
> > > > As shown above, it seems like that only q0 and q3 handles the
> > > > interrupt
> > > triggered by packet receving.
> > > >
> > > > Any advise? Thanks.
> > >
> > > Netback selects queue based on the return value of
> skb_get_queue_mapping.
> > > The queue mapping is set by core driver or ndo_select_queue (if
> > > specified by individual driver). In this case netback doesn't have
> > > its implementation of ndo_select_queue, so it's up to core driver to
> > > decide which queue to dispatch the packet to.  I think you need to
> > > inspect why Dom0 only steers traffic to these two queues but not all of
> them.
> > >
> > > Don't know which utility is handy for this job. Probably tc(8) is useful?
> >
> > Thanks Wei.
> >
> 
> > I think the reason for the above results that only two
> > netback/netfront processes works hard is the queue select method. I
> > have tried to send packets from multiple host/vm to a vm, and all of
> > the netback/netfront processes are running with high cpu usage a few
> > times.
> >
> 
> A few times? You might want to check some patches to rework RX stall
> detection by David Vrabel that went in after 3.16.

Thanks for your suggest. I have switched to latest stable branch 3.17.4 and I find the patches you mentioned are not merged in this branch too, I will merge this patch and try again.

> > However, I find another issue. Even using 6 queues and making sure
> > that all of these 6 netback processes running with high cpu usage
> > (indeed, any of it running with 87% cpu usage), the whole VM receive
> > throughout is not very higher than results when using 4 queues. The
> > results are from 4.5Gbps to 5.04 Gbps using TCP with 512 bytes length
> > and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes length.
> >
> 
> I would like to ask if you're still using 4U4G (4 CPU 4 G?) configuration? If so,
> please make sure there are at least the same number of vcpus as queues.

Sorry for misleading you, 4U4G means 4 CPU and 4 G memory, :). I also found that the max_queue of netback is determinated by min(online_cpu, module_param) yesterday, so when using 6 queues in the previous testing, I used VM with 6 CPU and 6 G Memory.

> > According to the testing result from WIKI:
> > http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_perf
> > ormance_testing, The VM receive throughput is also more lower than VM
> > transmit.
> >
> 
> I think that's expected, because guest RX data path still uses grant_copy while
> guest TX uses grant_map to do zero-copy transmit.

As I understand, the RX process is as follows: 
1. Phy NIC receive packet
2. XEN Hypervisor trigger interrupt to Dom0
3. Dom0' s NIC driver do the "RX" operation, and the packet is stored into SKB which is also owned/shared with netback
4. NetBack notify netfront through event channel that a packet is receiving
5. Netfront grant a buffer for receiving and notify netback the GR (if using grant-resue mechanism, netfront just notify the GR to netback) through IO Ring
6. NetBack do the grant_copy to copy packet from its SKB to the buffer referenced by GR, and notify netfront through event channel
7. Netfront copy the data from buffer to user-level app's SKB

Am I right? Why not using zero-copy transmit in guest RX data pash too ?


> > I am wondering why the VM receive throughout cannot be up to 8-10Gbps
> > as VM transmit under multi-queue?  I also tried to send packets
> > directly from Local Dom0 to DomU, the DomU receive throughput can
> > reach about 8-12Gbps, so I am also wondering why transmitting packets
> > from Dom0 to Remote DomU can only reach about 4-5Gbps throughout?
> 
> If data is from Dom0 to DomU then SKB is probably not fragmented by network
> stack.  You can use tcpdump to check that.

In our testing , the MTU is set to 1600. However, even testing with packets whose length are 1024 (small than 1600), the throughout between Dom0 to Local DomU is more higher than that between Dom0 to Remote DomU. So maybe the fragment is not the reason for it.


> Wei.
> 
> >
> > > Wei.
> > >
> > > > ----------
> > > > zhangleiqiang (Trump)
> > > >
> > > > Best Regards
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > > > > Sent: Tuesday, December 02, 2014 8:12 PM
> > > > > To: Zhangleiqiang (Trump)
> > > > > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao
> > > > > (brian); Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > > > Subject: Re: [Xen-devel] Poor network performance between DomU
> > > > > with multiqueue support
> > > > >
> > > > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump)
> wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: xen-devel-bounces@lists.xen.org
> > > > > > > [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Wei
> > > > > > > Liu
> > > > > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > > > > To: zhangleiqiang
> > > > > > > Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> > > > > > > Subject: Re: [Xen-devel] Poor network performance between
> > > > > > > DomU with multiqueue support
> > > > > > >
> > > > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > > > > > > Hi, all
> > > > > > > >     I am testing the performance of xen netfront-netback
> > > > > > > > driver that with
> > > > > > > multi-queues support. The throughput from domU to remote
> > > > > > > dom0 is 9.2Gb/s, but the throughput from domU to remote domU
> > > > > > > is only 3.6Gb/s, I think the bottleneck is the throughput
> > > > > > > from dom0 to local domU. However, we have done some testing
> > > > > > > and found the throughput from dom0 to local domU is 5.8Gb/s.
> > > > > > > >     And if we send packets from one DomU to other 3 DomUs
> > > > > > > > on different
> > > > > > > host simultaneously, the sum of throughout can reach 9Gbps.
> > > > > > > It seems like the bottleneck is the receiver?
> > > > > > > >     After some analysis, I found that even the max_queue
> > > > > > > > of netfront/back
> > > > > > > is set to 4, there are some strange results as follows:
> > > > > > > >     1. In domU, only one rx queue deal with softirq
> > > > > > >
> > > > > > > Try to bind irq to different vcpus?
> > > > > >
> > > > > > Do you mean we try to bind irq to different vcpus in DomU? I
> > > > > > will try it
> > > now.
> > > > > >
> > > > >
> > > > > Yes. Given the fact that you have two backend threads running
> > > > > while only one DomU vcpu is busy, it smells like misconfiguration in
> DomU.
> > > > >
> > > > > If this phenomenon persists after correctly binding irqs, you
> > > > > might want to check traffic is steering correctly to different queues.
> > > > >
> > > > > > >
> > > > > > > >     2. In dom0, only two netback queues process are
> > > > > > > > scheduled, other two
> > > > > > > process aren't scheduled.
> > > > > > >
> > > > > > > How many Dom0 vcpu do you have? If it only has two then
> > > > > > > there will only be two processes running at a time.
> > > > > >
> > > > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU
> > > > > > running in
> > > > > Dom0 and so four netback processes are running in Dom0 (because
> > > > > the max_queue param of netback kernel module is set to 4).
> > > > > > The phenomenon is that only 2 of these four netback process
> > > > > > were running
> > > > > with about 70% cpu usage, and another two use little CPU.
> > > > > > Is there a hash algorithm to determine which netback process
> > > > > > to handle the
> > > > > input packet?
> > > > > >
> > > > >
> > > > > I think that's whatever default algorithm Linux kernel is using.
> > > > >
> > > > > We don't currently support other algorithms.
> > > > >
> > > > > Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-04 12:09               ` Zhangleiqiang (Trump)
@ 2014-12-04 13:05                 ` Wei Liu
  2014-12-04 14:37                   ` Zhangleiqiang (Trump)
  2014-12-04 13:35                 ` Zoltan Kiss
  1 sibling, 1 reply; 35+ messages in thread
From: Wei Liu @ 2014-12-04 13:05 UTC (permalink / raw)
  To: Zhangleiqiang (Trump)
  Cc: Luohao (brian), Wei Liu, Zhuangyuxin, zhangleiqiang, Yuzhou (C),
	xen-devel@lists.xen.org, Xiaoding (B)

On Thu, Dec 04, 2014 at 12:09:33PM +0000, Zhangleiqiang (Trump) wrote:
[...]
> > > However, I find another issue. Even using 6 queues and making sure
> > > that all of these 6 netback processes running with high cpu usage
> > > (indeed, any of it running with 87% cpu usage), the whole VM receive
> > > throughout is not very higher than results when using 4 queues. The
> > > results are from 4.5Gbps to 5.04 Gbps using TCP with 512 bytes length
> > > and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes length.
> > >
> > 
> > I would like to ask if you're still using 4U4G (4 CPU 4 G?) configuration? If so,
> > please make sure there are at least the same number of vcpus as queues.
> 

> Sorry for misleading you, 4U4G means 4 CPU and 4 G memory, :). I also
> found that the max_queue of netback is determinated by min(online_cpu,
> module_param) yesterday, so when using 6 queues in the previous
> testing, I used VM with 6 CPU and 6 G Memory.

> 
> > > According to the testing result from WIKI:
> > > http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_perf
> > > ormance_testing, The VM receive throughput is also more lower than VM
> > > transmit.
> > >
> > 
> > I think that's expected, because guest RX data path still uses grant_copy while
> > guest TX uses grant_map to do zero-copy transmit.
> 
> As I understand, the RX process is as follows: 
> 1. Phy NIC receive packet
> 2. XEN Hypervisor trigger interrupt to Dom0
> 3. Dom0' s NIC driver do the "RX" operation, and the packet is stored into SKB which is also owned/shared with netback
> 4. NetBack notify netfront through event channel that a packet is receiving
> 5. Netfront grant a buffer for receiving and notify netback the GR (if using grant-resue mechanism, netfront just notify the GR to netback) through IO Ring
> 6. NetBack do the grant_copy to copy packet from its SKB to the buffer referenced by GR, and notify netfront through event channel
> 7. Netfront copy the data from buffer to user-level app's SKB
> 
> Am I right?

Step 4 is not correct, netback won't notify netfront at that point.

Step 5 is not correct, all grant refs are pre-allocated and
granted before that.

Other steps look correct.

> Why not using zero-copy transmit in guest RX data pash too ?
> 

A rogue / buggy guest might hold the mapping for arbitrary long period
of time.

> 
> > > I am wondering why the VM receive throughout cannot be up to 8-10Gbps
> > > as VM transmit under multi-queue?  I also tried to send packets
> > > directly from Local Dom0 to DomU, the DomU receive throughput can
> > > reach about 8-12Gbps, so I am also wondering why transmitting packets
> > > from Dom0 to Remote DomU can only reach about 4-5Gbps throughout?
> > 
> > If data is from Dom0 to DomU then SKB is probably not fragmented by network
> > stack.  You can use tcpdump to check that.
> 
> In our testing , the MTU is set to 1600. However, even testing with
> packets whose length are 1024 (small than 1600), the throughout
> between Dom0 to Local DomU is more higher than that between Dom0 to
> Remote DomU. So maybe the fragment is not the reason for it.
> 

Don't have much idea about this, sorry.

Wei.

> 
> > Wei.
> > 
> > >
> > > > Wei.
> > > >
> > > > > ----------
> > > > > zhangleiqiang (Trump)
> > > > >
> > > > > Best Regards
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > > > > > Sent: Tuesday, December 02, 2014 8:12 PM
> > > > > > To: Zhangleiqiang (Trump)
> > > > > > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao
> > > > > > (brian); Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > > > > Subject: Re: [Xen-devel] Poor network performance between DomU
> > > > > > with multiqueue support
> > > > > >
> > > > > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump)
> > wrote:
> > > > > > > > -----Original Message-----
> > > > > > > > From: xen-devel-bounces@lists.xen.org
> > > > > > > > [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Wei
> > > > > > > > Liu
> > > > > > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > > > > > To: zhangleiqiang
> > > > > > > > Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> > > > > > > > Subject: Re: [Xen-devel] Poor network performance between
> > > > > > > > DomU with multiqueue support
> > > > > > > >
> > > > > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > > > > > > > Hi, all
> > > > > > > > >     I am testing the performance of xen netfront-netback
> > > > > > > > > driver that with
> > > > > > > > multi-queues support. The throughput from domU to remote
> > > > > > > > dom0 is 9.2Gb/s, but the throughput from domU to remote domU
> > > > > > > > is only 3.6Gb/s, I think the bottleneck is the throughput
> > > > > > > > from dom0 to local domU. However, we have done some testing
> > > > > > > > and found the throughput from dom0 to local domU is 5.8Gb/s.
> > > > > > > > >     And if we send packets from one DomU to other 3 DomUs
> > > > > > > > > on different
> > > > > > > > host simultaneously, the sum of throughout can reach 9Gbps.
> > > > > > > > It seems like the bottleneck is the receiver?
> > > > > > > > >     After some analysis, I found that even the max_queue
> > > > > > > > > of netfront/back
> > > > > > > > is set to 4, there are some strange results as follows:
> > > > > > > > >     1. In domU, only one rx queue deal with softirq
> > > > > > > >
> > > > > > > > Try to bind irq to different vcpus?
> > > > > > >
> > > > > > > Do you mean we try to bind irq to different vcpus in DomU? I
> > > > > > > will try it
> > > > now.
> > > > > > >
> > > > > >
> > > > > > Yes. Given the fact that you have two backend threads running
> > > > > > while only one DomU vcpu is busy, it smells like misconfiguration in
> > DomU.
> > > > > >
> > > > > > If this phenomenon persists after correctly binding irqs, you
> > > > > > might want to check traffic is steering correctly to different queues.
> > > > > >
> > > > > > > >
> > > > > > > > >     2. In dom0, only two netback queues process are
> > > > > > > > > scheduled, other two
> > > > > > > > process aren't scheduled.
> > > > > > > >
> > > > > > > > How many Dom0 vcpu do you have? If it only has two then
> > > > > > > > there will only be two processes running at a time.
> > > > > > >
> > > > > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU
> > > > > > > running in
> > > > > > Dom0 and so four netback processes are running in Dom0 (because
> > > > > > the max_queue param of netback kernel module is set to 4).
> > > > > > > The phenomenon is that only 2 of these four netback process
> > > > > > > were running
> > > > > > with about 70% cpu usage, and another two use little CPU.
> > > > > > > Is there a hash algorithm to determine which netback process
> > > > > > > to handle the
> > > > > > input packet?
> > > > > > >
> > > > > >
> > > > > > I think that's whatever default algorithm Linux kernel is using.
> > > > > >
> > > > > > We don't currently support other algorithms.
> > > > > >
> > > > > > Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-04 12:09               ` Zhangleiqiang (Trump)
  2014-12-04 13:05                 ` Wei Liu
@ 2014-12-04 13:35                 ` Zoltan Kiss
  2014-12-04 14:31                   ` Zhangleiqiang (Trump)
  1 sibling, 1 reply; 35+ messages in thread
From: Zoltan Kiss @ 2014-12-04 13:35 UTC (permalink / raw)
  To: Zhangleiqiang (Trump), Wei Liu, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, zhangleiqiang, Luohao (brian),
	Yuzhou (C)



On 04/12/14 12:09, Zhangleiqiang (Trump) wrote:
>> I think that's expected, because guest RX data path still uses grant_copy while
>> >guest TX uses grant_map to do zero-copy transmit.
> As I understand, the RX process is as follows:
> 1. Phy NIC receive packet
> 2. XEN Hypervisor trigger interrupt to Dom0
> 3. Dom0' s NIC driver do the "RX" operation, and the packet is stored into SKB which is also owned/shared with netback
Not that easy. There is something between the NIC driver and netback 
which directs the packets, e.g. the old bridge driver, ovs, or the IP 
stack of the kernel.
> 4. NetBack notify netfront through event channel that a packet is receiving
> 5. Netfront grant a buffer for receiving and notify netback the GR (if using grant-resue mechanism, netfront just notify the GR to netback) through IO Ring
It looks a bit confusing in the code, but netfront put "requests" on the 
ring buffer, which contains the grant ref of the guest page where the 
backend can copy. When the packet comes, netback consumes these requests 
and send back a response telling the guest the grant copy of the packet 
finished, it can start handling the data. (sending a response means it's 
placing a response in the ring and trigger the event channel)
And ideally netback should always have requests in the ring, so it 
doesn't have to wait for the guest to fill it up.

> 6. NetBack do the grant_copy to copy packet from its SKB to the buffer referenced by GR, and notify netfront through event channel
> 7. Netfront copy the data from buffer to user-level app's SKB
Or wherever that SKB should go, yes. Like with any received packet on a 
real network interface.
>
> Am I right? Why not using zero-copy transmit in guest RX data pash too ?
Because that means you are mapping that memory to the guest, and you 
won't have any guarantee when the guest will release them. And netback 
can't just unmap them forcibly after a timeout, because finding a 
correct timeout value would be quite impossible.
A malicious/buggy/overloaded guest can hold on to Dom0 memory 
indefinitely, but it even becomes worse if the memory came from another 
guest: you can't shutdown that guest for example, until all its memory 
is returned to him.

Regards,

Zoli

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-04 13:35                 ` Zoltan Kiss
@ 2014-12-04 14:31                   ` Zhangleiqiang (Trump)
  2014-12-05 15:20                     ` Zoltan Kiss
  0 siblings, 1 reply; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-04 14:31 UTC (permalink / raw)
  To: Zoltan Kiss, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, zhangleiqiang, Luohao (brian),
	Yuzhou (C)

> -----Original Message-----
> From: Zoltan Kiss [mailto:zoltan.kiss@linaro.org]
> Sent: Thursday, December 04, 2014 9:35 PM
> To: Zhangleiqiang (Trump); Wei Liu; xen-devel@lists.xen.org
> Cc: Xiaoding (B); Zhuangyuxin; zhangleiqiang; Luohao (brian); Yuzhou (C)
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> 
> 
> On 04/12/14 12:09, Zhangleiqiang (Trump) wrote:
> >> I think that's expected, because guest RX data path still uses
> >> grant_copy while
> >> >guest TX uses grant_map to do zero-copy transmit.
> > As I understand, the RX process is as follows:
> > 1. Phy NIC receive packet
> > 2. XEN Hypervisor trigger interrupt to Dom0 3. Dom0' s NIC driver do
> > the "RX" operation, and the packet is stored into SKB which is also
> > owned/shared with netback
> Not that easy. There is something between the NIC driver and netback which
> directs the packets, e.g. the old bridge driver, ovs, or the IP stack of the kernel.
> > 4. NetBack notify netfront through event channel that a packet is
> > receiving 5. Netfront grant a buffer for receiving and notify netback
> > the GR (if using grant-resue mechanism, netfront just notify the GR to
> > netback) through IO Ring
> It looks a bit confusing in the code, but netfront put "requests" on the ring
> buffer, which contains the grant ref of the guest page where the backend can
> copy. When the packet comes, netback consumes these requests and send
> back a response telling the guest the grant copy of the packet finished, it can
> start handling the data. (sending a response means it's placing a response in
> the ring and trigger the event channel) And ideally netback should always have
> requests in the ring, so it doesn't have to wait for the guest to fill it up.

> > 6. NetBack do the grant_copy to copy packet from its SKB to the buffer
> > referenced by GR, and notify netfront through event channel 7.
> > Netfront copy the data from buffer to user-level app's SKB
> Or wherever that SKB should go, yes. Like with any received packet on a real
> network interface.
> >
> > Am I right? Why not using zero-copy transmit in guest RX data pash too ?
> Because that means you are mapping that memory to the guest, and you won't
> have any guarantee when the guest will release them. And netback can't just
> unmap them forcibly after a timeout, because finding a correct timeout value
> would be quite impossible.
> A malicious/buggy/overloaded guest can hold on to Dom0 memory indefinitely,
> but it even becomes worse if the memory came from another
> guest: you can't shutdown that guest for example, until all its memory is
> returned to him.

Thanks for your detailed explanation about RX data path, I have get it, :)

About the issue that poor performance between DomU to DomU, but high throughout between Dom0 to remote Dom0/DomU mentioned in my previous mail, do you have any idea about it? 

I am wondering if netfront/netback can be optimized to reach the 10Gbps throughout between DomUs running on different hosts connected with 10GE network. Currently, it seems like the TX is not the bottleneck, because we can reach the aggregate throughout of 9Gbps when sending packets from one DomU to other 3 DomUs running on different host. So I think the bottleneck maybe the RX, are you agreed with me?

I am wondering what is the main reason that prevent RX to reach the higher throughout? Compared to KVM+virtio+vhost, which can reach high throughout, the RX has extra grantcopy operation, and the grantcopy operation may be one reason for it. Do you have any idea about it too?

> 
> Regards,
> 
> Zoli

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-04 13:05                 ` Wei Liu
@ 2014-12-04 14:37                   ` Zhangleiqiang (Trump)
  0 siblings, 0 replies; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-04 14:37 UTC (permalink / raw)
  To: Wei Liu, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, zhangleiqiang, Luohao (brian),
	Yuzhou (C)

Thanks for your detailed explanation, Wei. 

I am wondering if netfront/netback can be optimized to reach the 10Gbps throughout between DomUs running on different hosts connected with 10GE network. Currently, it seems like the RX is the bottleneck, which also consist with the testing result in xenwiki: http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_performance_testing

I am wondering what factors prevent RX to reach the higher throughout? You have mentioned that one reason is that guest RX data path still uses grant_copy while guest TX uses grant_map to do zero-copy transmit. Do you know any other factors or ongoing work to optimize the RX data path?

----------
zhangleiqiang (Trump)

Best Regards


> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: Thursday, December 04, 2014 9:06 PM
> To: Zhangleiqiang (Trump)
> Cc: Wei Liu; xen-devel@lists.xen.org; zhangleiqiang; Luohao (brian); Xiaoding
> (B); Yuzhou (C); Zhuangyuxin
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On Thu, Dec 04, 2014 at 12:09:33PM +0000, Zhangleiqiang (Trump) wrote:
> [...]
> > > > However, I find another issue. Even using 6 queues and making sure
> > > > that all of these 6 netback processes running with high cpu usage
> > > > (indeed, any of it running with 87% cpu usage), the whole VM
> > > > receive throughout is not very higher than results when using 4
> > > > queues. The results are from 4.5Gbps to 5.04 Gbps using TCP with
> > > > 512 bytes length and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes
> length.
> > > >
> > >
> > > I would like to ask if you're still using 4U4G (4 CPU 4 G?)
> > > configuration? If so, please make sure there are at least the same number
> of vcpus as queues.
> >
> 
> > Sorry for misleading you, 4U4G means 4 CPU and 4 G memory, :). I also
> > found that the max_queue of netback is determinated by min(online_cpu,
> > module_param) yesterday, so when using 6 queues in the previous
> > testing, I used VM with 6 CPU and 6 G Memory.
> 
> >
> > > > According to the testing result from WIKI:
> > > > http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_
> > > > perf ormance_testing, The VM receive throughput is also more lower
> > > > than VM transmit.
> > > >
> > >
> > > I think that's expected, because guest RX data path still uses
> > > grant_copy while guest TX uses grant_map to do zero-copy transmit.
> >
> > As I understand, the RX process is as follows:
> > 1. Phy NIC receive packet
> > 2. XEN Hypervisor trigger interrupt to Dom0 3. Dom0' s NIC driver do
> > the "RX" operation, and the packet is stored into SKB which is also
> > owned/shared with netback 4. NetBack notify netfront through event
> > channel that a packet is receiving 5. Netfront grant a buffer for
> > receiving and notify netback the GR (if using grant-resue mechanism,
> > netfront just notify the GR to netback) through IO Ring 6. NetBack do
> > the grant_copy to copy packet from its SKB to the buffer referenced by
> > GR, and notify netfront through event channel 7. Netfront copy the
> > data from buffer to user-level app's SKB
> >
> > Am I right?
> 
> Step 4 is not correct, netback won't notify netfront at that point.
> 
> Step 5 is not correct, all grant refs are pre-allocated and granted before that.
> 
> Other steps look correct.
> 
> > Why not using zero-copy transmit in guest RX data pash too ?
> >
> 
> A rogue / buggy guest might hold the mapping for arbitrary long period of time.
> 
> >
> > > > I am wondering why the VM receive throughout cannot be up to
> > > > 8-10Gbps as VM transmit under multi-queue?  I also tried to send
> > > > packets directly from Local Dom0 to DomU, the DomU receive
> > > > throughput can reach about 8-12Gbps, so I am also wondering why
> > > > transmitting packets from Dom0 to Remote DomU can only reach about
> 4-5Gbps throughout?
> > >
> > > If data is from Dom0 to DomU then SKB is probably not fragmented by
> > > network stack.  You can use tcpdump to check that.
> >
> > In our testing , the MTU is set to 1600. However, even testing with
> > packets whose length are 1024 (small than 1600), the throughout
> > between Dom0 to Local DomU is more higher than that between Dom0 to
> > Remote DomU. So maybe the fragment is not the reason for it.
> >
> 
> Don't have much idea about this, sorry.
> 
> Wei.
> 
> >
> > > Wei.
> > >
> > > >
> > > > > Wei.
> > > > >
> > > > > > ----------
> > > > > > zhangleiqiang (Trump)
> > > > > >
> > > > > > Best Regards
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > > > > > > Sent: Tuesday, December 02, 2014 8:12 PM
> > > > > > > To: Zhangleiqiang (Trump)
> > > > > > > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao
> > > > > > > (brian); Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > > > > > Subject: Re: [Xen-devel] Poor network performance between
> > > > > > > DomU with multiqueue support
> > > > > > >
> > > > > > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang
> > > > > > > (Trump)
> > > wrote:
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: xen-devel-bounces@lists.xen.org
> > > > > > > > > [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> > > > > > > > > Wei Liu
> > > > > > > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > > > > > > To: zhangleiqiang
> > > > > > > > > Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> > > > > > > > > Subject: Re: [Xen-devel] Poor network performance
> > > > > > > > > between DomU with multiqueue support
> > > > > > > > >
> > > > > > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang
> wrote:
> > > > > > > > > > Hi, all
> > > > > > > > > >     I am testing the performance of xen
> > > > > > > > > > netfront-netback driver that with
> > > > > > > > > multi-queues support. The throughput from domU to remote
> > > > > > > > > dom0 is 9.2Gb/s, but the throughput from domU to remote
> > > > > > > > > domU is only 3.6Gb/s, I think the bottleneck is the
> > > > > > > > > throughput from dom0 to local domU. However, we have
> > > > > > > > > done some testing and found the throughput from dom0 to local
> domU is 5.8Gb/s.
> > > > > > > > > >     And if we send packets from one DomU to other 3
> > > > > > > > > > DomUs on different
> > > > > > > > > host simultaneously, the sum of throughout can reach 9Gbps.
> > > > > > > > > It seems like the bottleneck is the receiver?
> > > > > > > > > >     After some analysis, I found that even the
> > > > > > > > > > max_queue of netfront/back
> > > > > > > > > is set to 4, there are some strange results as follows:
> > > > > > > > > >     1. In domU, only one rx queue deal with softirq
> > > > > > > > >
> > > > > > > > > Try to bind irq to different vcpus?
> > > > > > > >
> > > > > > > > Do you mean we try to bind irq to different vcpus in DomU?
> > > > > > > > I will try it
> > > > > now.
> > > > > > > >
> > > > > > >
> > > > > > > Yes. Given the fact that you have two backend threads
> > > > > > > running while only one DomU vcpu is busy, it smells like
> > > > > > > misconfiguration in
> > > DomU.
> > > > > > >
> > > > > > > If this phenomenon persists after correctly binding irqs,
> > > > > > > you might want to check traffic is steering correctly to different
> queues.
> > > > > > >
> > > > > > > > >
> > > > > > > > > >     2. In dom0, only two netback queues process are
> > > > > > > > > > scheduled, other two
> > > > > > > > > process aren't scheduled.
> > > > > > > > >
> > > > > > > > > How many Dom0 vcpu do you have? If it only has two then
> > > > > > > > > there will only be two processes running at a time.
> > > > > > > >
> > > > > > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU
> > > > > > > > running in
> > > > > > > Dom0 and so four netback processes are running in Dom0
> > > > > > > (because the max_queue param of netback kernel module is set to
> 4).
> > > > > > > > The phenomenon is that only 2 of these four netback
> > > > > > > > process were running
> > > > > > > with about 70% cpu usage, and another two use little CPU.
> > > > > > > > Is there a hash algorithm to determine which netback
> > > > > > > > process to handle the
> > > > > > > input packet?
> > > > > > > >
> > > > > > >
> > > > > > > I think that's whatever default algorithm Linux kernel is using.
> > > > > > >
> > > > > > > We don't currently support other algorithms.
> > > > > > >
> > > > > > > Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-04 10:50             ` Wei Liu
  2014-12-04 12:09               ` Zhangleiqiang (Trump)
@ 2014-12-05  1:17               ` Zhangleiqiang (Trump)
  2014-12-05 12:42                 ` Wei Liu
  1 sibling, 1 reply; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-05  1:17 UTC (permalink / raw)
  To: Wei Liu, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, zhangleiqiang, Luohao (brian),
	Yuzhou (C)

> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: Thursday, December 04, 2014 6:50 PM
> To: Zhangleiqiang (Trump)
> Cc: Wei Liu; xen-devel@lists.xen.org; zhangleiqiang; Luohao (brian); Xiaoding
> (B); Yuzhou (C); Zhuangyuxin
> Subject: Re: [Xen-devel] Poor network performance between DomU with
> multiqueue support
> 
> On Wed, Dec 03, 2014 at 02:43:37PM +0000, Zhangleiqiang (Trump) wrote:
> > > -----Original Message-----
> > > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > > Sent: Tuesday, December 02, 2014 11:59 PM
> > > To: Zhangleiqiang (Trump)
> > > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao (brian);
> > > Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > Subject: Re: [Xen-devel] Poor network performance between DomU with
> > > multiqueue support
> > >
> > > On Tue, Dec 02, 2014 at 02:46:36PM +0000, Zhangleiqiang (Trump) wrote:
> > > > Thanks for your reply, Wei.
> > > >
> > > > I do the following testing just now and found the results as follows:
> > > >
> > > > There are three DomUs (4U4G) are running on Host A (6U6G) and one
> > > > DomU
> > > (4U4G) is running on Host B (6U6G), I send packets from three DomUs
> > > to the DomU on Host B simultaneously.
> > > >
> > > > 1. The "top" output of Host B as follows:
> > > >
> > > > top - 09:42:11 up  1:07,  2 users,  load average: 2.46, 1.90, 1.47
> > > > Tasks: 173 total,   4 running, 169 sleeping,   0 stopped,   0 zombie
> > > > %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,
> > > > 0.8 si,  1.9 st
> > > > %Cpu1  :  0.0 us, 27.0 sy,  0.0 ni, 63.1 id,  0.0 wa,  0.0 hi,
> > > > 9.5 si,  0.4 st
> > > > %Cpu2  :  0.0 us, 90.0 sy,  0.0 ni,  8.3 id,  0.0 wa,  0.0 hi,
> > > > 1.7 si,  0.0 st
> > > > %Cpu3  :  0.4 us,  1.4 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,
> > > > 1.4 si,  1.4 st
> > > > %Cpu4  :  0.0 us, 60.2 sy,  0.0 ni, 39.5 id,  0.0 wa,  0.0 hi,
> > > > 0.3 si,  0.0 st
> > > > %Cpu5  :  0.0 us,  2.8 sy,  0.0 ni, 89.4 id,  0.0 wa,  0.0 hi,
> > > > 6.9 si,  0.9
> > > st
> > > > KiB Mem:   4517144 total,  3116480 used,  1400664 free,      876
> > > buffers
> > > > KiB Swap:  2103292 total,        0 used,  2103292 free.  2374656
> > > cached Mem
> > > >
> > > >   PID USER      PR  NI    VIRT    RES    SHR
> S  %CPU  %MEM
> > > TIME+ COMMAND
> > > >  7440 root      20   0       0      0      0 R 71.10 0.000
> > > 8:15.38 vif4.0-q3-guest
> > > >  7434 root      20   0       0      0      0 R 59.14 0.000
> > > 9:00.58 vif4.0-q0-guest
> > > >    18 root      20   0       0      0      0 R 33.89 0.000
> > > 2:35.06 ksoftirqd/2
> > > >    28 root      20   0       0      0      0 S 20.93 0.000
> > > 3:01.81 ksoftirqd/4
> > > >
> > > >
> > > > As shown above, only two netback related processes (vif4.0-*) are
> > > > running
> > > with high cpu usage, and the other 2 netback processes are idle. The "ps"
> > > result of vif4.0-* processes as follows:
> > > >
> > > > root      7434 50.5  0.0      0     0 ?        R    09:23
> 11:29
> > > [vif4.0-q0-guest]
> > > > root      7435  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q0-deall]
> > > > root      7436  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q1-guest]
> > > > root      7437  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q1-deall]
> > > > root      7438  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q2-guest]
> > > > root      7439  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q2-deall]
> > > > root      7440 48.1  0.0      0     0 ?        R    09:23
> 10:55
> > > [vif4.0-q3-guest]
> > > > root      7441  0.0  0.0      0     0 ?        S    09:23
> 0:00
> > > [vif4.0-q3-deall]
> > > > root      9724  0.0  0.0   9244  1520 pts/0    S+   09:46
> 0:00
> > > grep --color=auto
> > > >
> > > >
> > > > 2. The "rx" related content in /proc/interupts in receiver DomU (on Host
> B):
> > > >
> > > > 73: 	2		0		2925405		0			xen-dyn-event
> > > 	eth0-q0-rx
> > > > 75: 	43		93		0			118			xen-dyn-event
> > > 	eth0-q1-rx
> > > > 77: 	2		3376	14			1983		xen-dyn-event
> > > 	eth0-q2-rx
> > > > 79: 	2414666	0		9			0			xen-dyn-event
> > > 	eth0-q3-rx
> > > >
> > > > As shown above, it seems like that only q0 and q3 handles the
> > > > interrupt
> > > triggered by packet receving.
> > > >
> > > > Any advise? Thanks.
> > >
> > > Netback selects queue based on the return value of
> skb_get_queue_mapping.
> > > The queue mapping is set by core driver or ndo_select_queue (if
> > > specified by individual driver). In this case netback doesn't have
> > > its implementation of ndo_select_queue, so it's up to core driver to
> > > decide which queue to dispatch the packet to.  I think you need to
> > > inspect why Dom0 only steers traffic to these two queues but not all of
> them.
> > >
> > > Don't know which utility is handy for this job. Probably tc(8) is useful?
> >
> > Thanks Wei.
> >
> 
> > I think the reason for the above results that only two
> > netback/netfront processes works hard is the queue select method. I
> > have tried to send packets from multiple host/vm to a vm, and all of
> > the netback/netfront processes are running with high cpu usage a few
> > times.
> >
> 
> A few times? You might want to check some patches to rework RX stall
> detection by David Vrabel that went in after 3.16.
> 
> > However, I find another issue. Even using 6 queues and making sure
> > that all of these 6 netback processes running with high cpu usage
> > (indeed, any of it running with 87% cpu usage), the whole VM receive
> > throughout is not very higher than results when using 4 queues. The
> > results are from 4.5Gbps to 5.04 Gbps using TCP with 512 bytes length
> > and 4.3Gbps to 5.78Gbps using TCP with 1460 bytes length.
> >
> 
> I would like to ask if you're still using 4U4G (4 CPU 4 G?) configuration? If so,
> please make sure there are at least the same number of vcpus as queues.
> 
> > According to the testing result from WIKI:
> > http://wiki.xen.org/wiki/Xen-netback_and_xen-netfront_multi-queue_perf
> > ormance_testing, The VM receive throughput is also more lower than VM
> > transmit.
> >
> 
> I think that's expected, because guest RX data path still uses grant_copy while
> guest TX uses grant_map to do zero-copy transmit.

As far as I know, there are three main grant-related operations used in split device model: grant mapping, grant transfer and grant copy. 
Grant transfer has not used now, and grant mapping and grant transfer both involve "TLB" refresh work for hypervisor, am I right?  Or only grant transfer has this overhead?
Does grant copy surely has more overhead than grant mapping? 

>From the code, I see that in TX, netback will do gnttab_batch_copy as well as gnttab_map_refs:

<code> //netback.c:xenvif_tx_action
	xenvif_tx_build_gops(queue, budget, &nr_cops, &nr_mops);

	if (nr_cops == 0)
		return 0;

	gnttab_batch_copy(queue->tx_copy_ops, nr_cops);
	if (nr_mops != 0) {
		ret = gnttab_map_refs(queue->tx_map_ops,
				      NULL,
				      queue->pages_to_map,
				      nr_mops);
		BUG_ON(ret);
	}
</code>

> > I am wondering why the VM receive throughout cannot be up to 8-10Gbps
> > as VM transmit under multi-queue?  I also tried to send packets
> > directly from Local Dom0 to DomU, the DomU receive throughput can
> > reach about 8-12Gbps, so I am also wondering why transmitting packets
> > from Dom0 to Remote DomU can only reach about 4-5Gbps throughout?
> 
> If data is from Dom0 to DomU then SKB is probably not fragmented by network
> stack.  You can use tcpdump to check that.
> 
> Wei.
> 
> >
> > > Wei.
> > >
> > > > ----------
> > > > zhangleiqiang (Trump)
> > > >
> > > > Best Regards
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > > > > Sent: Tuesday, December 02, 2014 8:12 PM
> > > > > To: Zhangleiqiang (Trump)
> > > > > Cc: Wei Liu; zhangleiqiang; xen-devel@lists.xen.org; Luohao
> > > > > (brian); Xiaoding (B); Yuzhou (C); Zhuangyuxin
> > > > > Subject: Re: [Xen-devel] Poor network performance between DomU
> > > > > with multiqueue support
> > > > >
> > > > > On Tue, Dec 02, 2014 at 11:50:59AM +0000, Zhangleiqiang (Trump)
> wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: xen-devel-bounces@lists.xen.org
> > > > > > > [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Wei
> > > > > > > Liu
> > > > > > > Sent: Tuesday, December 02, 2014 7:02 PM
> > > > > > > To: zhangleiqiang
> > > > > > > Cc: wei.liu2@citrix.com; xen-devel@lists.xen.org
> > > > > > > Subject: Re: [Xen-devel] Poor network performance between
> > > > > > > DomU with multiqueue support
> > > > > > >
> > > > > > > On Tue, Dec 02, 2014 at 04:30:49PM +0800, zhangleiqiang wrote:
> > > > > > > > Hi, all
> > > > > > > >     I am testing the performance of xen netfront-netback
> > > > > > > > driver that with
> > > > > > > multi-queues support. The throughput from domU to remote
> > > > > > > dom0 is 9.2Gb/s, but the throughput from domU to remote domU
> > > > > > > is only 3.6Gb/s, I think the bottleneck is the throughput
> > > > > > > from dom0 to local domU. However, we have done some testing
> > > > > > > and found the throughput from dom0 to local domU is 5.8Gb/s.
> > > > > > > >     And if we send packets from one DomU to other 3 DomUs
> > > > > > > > on different
> > > > > > > host simultaneously, the sum of throughout can reach 9Gbps.
> > > > > > > It seems like the bottleneck is the receiver?
> > > > > > > >     After some analysis, I found that even the max_queue
> > > > > > > > of netfront/back
> > > > > > > is set to 4, there are some strange results as follows:
> > > > > > > >     1. In domU, only one rx queue deal with softirq
> > > > > > >
> > > > > > > Try to bind irq to different vcpus?
> > > > > >
> > > > > > Do you mean we try to bind irq to different vcpus in DomU? I
> > > > > > will try it
> > > now.
> > > > > >
> > > > >
> > > > > Yes. Given the fact that you have two backend threads running
> > > > > while only one DomU vcpu is busy, it smells like misconfiguration in
> DomU.
> > > > >
> > > > > If this phenomenon persists after correctly binding irqs, you
> > > > > might want to check traffic is steering correctly to different queues.
> > > > >
> > > > > > >
> > > > > > > >     2. In dom0, only two netback queues process are
> > > > > > > > scheduled, other two
> > > > > > > process aren't scheduled.
> > > > > > >
> > > > > > > How many Dom0 vcpu do you have? If it only has two then
> > > > > > > there will only be two processes running at a time.
> > > > > >
> > > > > > Dom0 has 6 vcpus, and 6G memory. There are only one DomU
> > > > > > running in
> > > > > Dom0 and so four netback processes are running in Dom0 (because
> > > > > the max_queue param of netback kernel module is set to 4).
> > > > > > The phenomenon is that only 2 of these four netback process
> > > > > > were running
> > > > > with about 70% cpu usage, and another two use little CPU.
> > > > > > Is there a hash algorithm to determine which netback process
> > > > > > to handle the
> > > > > input packet?
> > > > > >
> > > > >
> > > > > I think that's whatever default algorithm Linux kernel is using.
> > > > >
> > > > > We don't currently support other algorithms.
> > > > >
> > > > > Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-05  1:17               ` Zhangleiqiang (Trump)
@ 2014-12-05 12:42                 ` Wei Liu
  2014-12-05 15:18                   ` Zoltan Kiss
  2014-12-08  6:44                   ` Zhangleiqiang (Trump)
  0 siblings, 2 replies; 35+ messages in thread
From: Wei Liu @ 2014-12-05 12:42 UTC (permalink / raw)
  To: Zhangleiqiang (Trump)
  Cc: Luohao (brian), Wei Liu, Zhuangyuxin, zhangleiqiang, Yuzhou (C),
	xen-devel@lists.xen.org, Xiaoding (B)

On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
[...]
> > I think that's expected, because guest RX data path still uses grant_copy while
> > guest TX uses grant_map to do zero-copy transmit.
> 
> As far as I know, there are three main grant-related operations used in split device model: grant mapping, grant transfer and grant copy. 
> Grant transfer has not used now, and grant mapping and grant transfer both involve "TLB" refresh work for hypervisor, am I right?  Or only grant transfer has this overhead?

Transfer is not used so I can't tell. Grant unmap causes TLB flush.

I saw in an email the other day XenServer folks has some planned
improvement to avoid TLB flush in Xen to upstream in 4.6 window. I can't
speak for sure it will get upstreamed as I don't work on that.

> Does grant copy surely has more overhead than grant mapping? 
> 

At the very least the zero-copy TX path is faster than previous copying
path.

But speaking of the micro operation I'm not sure.

There was once persistent map prototype netback / netfront that
establishes a memory pool between FE and BE then use memcpy to copy
data. Unfortunately that prototype was not done right so the result was
not good.

> >From the code, I see that in TX, netback will do gnttab_batch_copy as well as gnttab_map_refs:
> 
> <code> //netback.c:xenvif_tx_action
> 	xenvif_tx_build_gops(queue, budget, &nr_cops, &nr_mops);
> 
> 	if (nr_cops == 0)
> 		return 0;
> 
> 	gnttab_batch_copy(queue->tx_copy_ops, nr_cops);
> 	if (nr_mops != 0) {
> 		ret = gnttab_map_refs(queue->tx_map_ops,
> 				      NULL,
> 				      queue->pages_to_map,
> 				      nr_mops);
> 		BUG_ON(ret);
> 	}
> </code>
> 

The copy is for the packet header. Mapping is for packet data.

We need to copy header from guest so that it doesn't change under
netback's feet.

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-05 12:42                 ` Wei Liu
@ 2014-12-05 15:18                   ` Zoltan Kiss
  2014-12-08  6:44                   ` Zhangleiqiang (Trump)
  1 sibling, 0 replies; 35+ messages in thread
From: Zoltan Kiss @ 2014-12-05 15:18 UTC (permalink / raw)
  To: Wei Liu, Zhangleiqiang (Trump)
  Cc: Luohao (brian), Zhuangyuxin, zhangleiqiang, Yuzhou (C),
	xen-devel@lists.xen.org, Xiaoding (B)



On 05/12/14 12:42, Wei Liu wrote:
> On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
> [...]
>>> I think that's expected, because guest RX data path still uses grant_copy while
>>> guest TX uses grant_map to do zero-copy transmit.
>>
>> As far as I know, there are three main grant-related operations used in split device model: grant mapping, grant transfer and grant copy.
>> Grant transfer has not used now, and grant mapping and grant transfer both involve "TLB" refresh work for hypervisor, am I right?  Or only grant transfer has this overhead?
>
> Transfer is not used so I can't tell. Grant unmap causes TLB flush.
>
> I saw in an email the other day XenServer folks has some planned
> improvement to avoid TLB flush in Xen to upstream in 4.6 window. I can't
> speak for sure it will get upstreamed as I don't work on that.
>
>> Does grant copy surely has more overhead than grant mapping?
>>
>
> At the very least the zero-copy TX path is faster than previous copying
> path.
>
> But speaking of the micro operation I'm not sure.
>
> There was once persistent map prototype netback / netfront that
> establishes a memory pool between FE and BE then use memcpy to copy
> data. Unfortunately that prototype was not done right so the result was
> not good.
>
>> >From the code, I see that in TX, netback will do gnttab_batch_copy as well as gnttab_map_refs:
>>
>> <code> //netback.c:xenvif_tx_action
>> 	xenvif_tx_build_gops(queue, budget, &nr_cops, &nr_mops);
>>
>> 	if (nr_cops == 0)
>> 		return 0;
>>
>> 	gnttab_batch_copy(queue->tx_copy_ops, nr_cops);
>> 	if (nr_mops != 0) {
>> 		ret = gnttab_map_refs(queue->tx_map_ops,
>> 				      NULL,
>> 				      queue->pages_to_map,
>> 				      nr_mops);
>> 		BUG_ON(ret);
>> 	}
>> </code>
>>
>
> The copy is for the packet header. Mapping is for packet data.
>
> We need to copy header from guest so that it doesn't change under
> netback's feet.

It is also important because if the above mentioned "TLB flush 
avoidance" patch goes in to Xen, it will be important to grant copy the 
header rather than grant map plus memcpy. The latter is the old way, it 
touches the page so you can't avoid TLB flush.

>
> Wei.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-04 14:31                   ` Zhangleiqiang (Trump)
@ 2014-12-05 15:20                     ` Zoltan Kiss
  2014-12-05 18:27                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 35+ messages in thread
From: Zoltan Kiss @ 2014-12-05 15:20 UTC (permalink / raw)
  To: Zhangleiqiang (Trump), xen-devel@lists.xen.org
  Cc: jonathan.davies, Luohao (brian), Zhuangyuxin, zhangleiqiang,
	Yuzhou (C), Xiaoding (B)



On 04/12/14 14:31, Zhangleiqiang (Trump) wrote:
>> -----Original Message-----
>> From: Zoltan Kiss [mailto:zoltan.kiss@linaro.org]
>> Sent: Thursday, December 04, 2014 9:35 PM
>> To: Zhangleiqiang (Trump); Wei Liu; xen-devel@lists.xen.org
>> Cc: Xiaoding (B); Zhuangyuxin; zhangleiqiang; Luohao (brian); Yuzhou (C)
>> Subject: Re: [Xen-devel] Poor network performance between DomU with
>> multiqueue support
>>
>>
>>
>> On 04/12/14 12:09, Zhangleiqiang (Trump) wrote:
>>>> I think that's expected, because guest RX data path still uses
>>>> grant_copy while
>>>>> guest TX uses grant_map to do zero-copy transmit.
>>> As I understand, the RX process is as follows:
>>> 1. Phy NIC receive packet
>>> 2. XEN Hypervisor trigger interrupt to Dom0 3. Dom0' s NIC driver do
>>> the "RX" operation, and the packet is stored into SKB which is also
>>> owned/shared with netback
>> Not that easy. There is something between the NIC driver and netback which
>> directs the packets, e.g. the old bridge driver, ovs, or the IP stack of the kernel.
>>> 4. NetBack notify netfront through event channel that a packet is
>>> receiving 5. Netfront grant a buffer for receiving and notify netback
>>> the GR (if using grant-resue mechanism, netfront just notify the GR to
>>> netback) through IO Ring
>> It looks a bit confusing in the code, but netfront put "requests" on the ring
>> buffer, which contains the grant ref of the guest page where the backend can
>> copy. When the packet comes, netback consumes these requests and send
>> back a response telling the guest the grant copy of the packet finished, it can
>> start handling the data. (sending a response means it's placing a response in
>> the ring and trigger the event channel) And ideally netback should always have
>> requests in the ring, so it doesn't have to wait for the guest to fill it up.
>
>>> 6. NetBack do the grant_copy to copy packet from its SKB to the buffer
>>> referenced by GR, and notify netfront through event channel 7.
>>> Netfront copy the data from buffer to user-level app's SKB
>> Or wherever that SKB should go, yes. Like with any received packet on a real
>> network interface.
>>>
>>> Am I right? Why not using zero-copy transmit in guest RX data pash too ?
>> Because that means you are mapping that memory to the guest, and you won't
>> have any guarantee when the guest will release them. And netback can't just
>> unmap them forcibly after a timeout, because finding a correct timeout value
>> would be quite impossible.
>> A malicious/buggy/overloaded guest can hold on to Dom0 memory indefinitely,
>> but it even becomes worse if the memory came from another
>> guest: you can't shutdown that guest for example, until all its memory is
>> returned to him.
>
> Thanks for your detailed explanation about RX data path, I have get it, :)
>
> About the issue that poor performance between DomU to DomU, but high throughout between Dom0 to remote Dom0/DomU mentioned in my previous mail, do you have any idea about it?
>
> I am wondering if netfront/netback can be optimized to reach the 10Gbps throughout between DomUs running on different hosts connected with 10GE network. Currently, it seems like the TX is not the bottleneck, because we can reach the aggregate throughout of 9Gbps when sending packets from one DomU to other 3 DomUs running on different host. So I think the bottleneck maybe the RX, are you agreed with me?
>
> I am wondering what is the main reason that prevent RX to reach the higher throughout? Compared to KVM+virtio+vhost, which can reach high throughout, the RX has extra grantcopy operation, and the grantcopy operation may be one reason for it. Do you have any idea about it too?
It's quite sure that the grant copy is the bottleneck for a single queue 
RX traffic. I don't know what's the plan to help that, currently only a 
faster CPU can help you with that.

>
>>
>> Regards,
>>
>> Zoli

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-05 15:20                     ` Zoltan Kiss
@ 2014-12-05 18:27                       ` Konrad Rzeszutek Wilk
  2014-12-08  6:50                         ` Zhangleiqiang (Trump)
  0 siblings, 1 reply; 35+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-12-05 18:27 UTC (permalink / raw)
  To: Zoltan Kiss
  Cc: Zhangleiqiang (Trump), jonathan.davies, Luohao (brian),
	Zhuangyuxin, zhangleiqiang, Yuzhou (C), xen-devel@lists.xen.org,
	Xiaoding (B)

On Fri, Dec 05, 2014 at 03:20:55PM +0000, Zoltan Kiss wrote:
> 
> 
> On 04/12/14 14:31, Zhangleiqiang (Trump) wrote:
> >>-----Original Message-----
> >>From: Zoltan Kiss [mailto:zoltan.kiss@linaro.org]
> >>Sent: Thursday, December 04, 2014 9:35 PM
> >>To: Zhangleiqiang (Trump); Wei Liu; xen-devel@lists.xen.org
> >>Cc: Xiaoding (B); Zhuangyuxin; zhangleiqiang; Luohao (brian); Yuzhou (C)
> >>Subject: Re: [Xen-devel] Poor network performance between DomU with
> >>multiqueue support
> >>
> >>
> >>
> >>On 04/12/14 12:09, Zhangleiqiang (Trump) wrote:
> >>>>I think that's expected, because guest RX data path still uses
> >>>>grant_copy while
> >>>>>guest TX uses grant_map to do zero-copy transmit.
> >>>As I understand, the RX process is as follows:
> >>>1. Phy NIC receive packet
> >>>2. XEN Hypervisor trigger interrupt to Dom0 3. Dom0' s NIC driver do
> >>>the "RX" operation, and the packet is stored into SKB which is also
> >>>owned/shared with netback
> >>Not that easy. There is something between the NIC driver and netback which
> >>directs the packets, e.g. the old bridge driver, ovs, or the IP stack of the kernel.
> >>>4. NetBack notify netfront through event channel that a packet is
> >>>receiving 5. Netfront grant a buffer for receiving and notify netback
> >>>the GR (if using grant-resue mechanism, netfront just notify the GR to
> >>>netback) through IO Ring
> >>It looks a bit confusing in the code, but netfront put "requests" on the ring
> >>buffer, which contains the grant ref of the guest page where the backend can
> >>copy. When the packet comes, netback consumes these requests and send
> >>back a response telling the guest the grant copy of the packet finished, it can
> >>start handling the data. (sending a response means it's placing a response in
> >>the ring and trigger the event channel) And ideally netback should always have
> >>requests in the ring, so it doesn't have to wait for the guest to fill it up.
> >
> >>>6. NetBack do the grant_copy to copy packet from its SKB to the buffer
> >>>referenced by GR, and notify netfront through event channel 7.
> >>>Netfront copy the data from buffer to user-level app's SKB
> >>Or wherever that SKB should go, yes. Like with any received packet on a real
> >>network interface.
> >>>
> >>>Am I right? Why not using zero-copy transmit in guest RX data pash too ?
> >>Because that means you are mapping that memory to the guest, and you won't
> >>have any guarantee when the guest will release them. And netback can't just
> >>unmap them forcibly after a timeout, because finding a correct timeout value
> >>would be quite impossible.
> >>A malicious/buggy/overloaded guest can hold on to Dom0 memory indefinitely,
> >>but it even becomes worse if the memory came from another
> >>guest: you can't shutdown that guest for example, until all its memory is
> >>returned to him.
> >
> >Thanks for your detailed explanation about RX data path, I have get it, :)
> >
> >About the issue that poor performance between DomU to DomU, but high throughout between Dom0 to remote Dom0/DomU mentioned in my previous mail, do you have any idea about it?
> >
> >I am wondering if netfront/netback can be optimized to reach the 10Gbps throughout between DomUs running on different hosts connected with 10GE network. Currently, it seems like the TX is not the bottleneck, because we can reach the aggregate throughout of 9Gbps when sending packets from one DomU to other 3 DomUs running on different host. So I think the bottleneck maybe the RX, are you agreed with me?
> >
> >I am wondering what is the main reason that prevent RX to reach the higher throughout? Compared to KVM+virtio+vhost, which can reach high throughout, the RX has extra grantcopy operation, and the grantcopy operation may be one reason for it. Do you have any idea about it too?
> It's quite sure that the grant copy is the bottleneck for a single queue RX
> traffic. I don't know what's the plan to help that, currently only a faster
> CPU can help you with that.

Could the Intel QuickData help with that?
> 
> >
> >>
> >>Regards,
> >>
> >>Zoli
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-05 12:42                 ` Wei Liu
  2014-12-05 15:18                   ` Zoltan Kiss
@ 2014-12-08  6:44                   ` Zhangleiqiang (Trump)
  2014-12-08 10:13                     ` Wei Liu
  1 sibling, 1 reply; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-08  6:44 UTC (permalink / raw)
  To: Wei Liu, xen-devel@lists.xen.org
  Cc: Zhangleiqiang (Trump), Luohao (brian), Zhuangyuxin, zhangleiqiang,
	Yuzhou (C), Xiaoding (B)

> On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
> [...]
> > > I think that's expected, because guest RX data path still uses
> > > grant_copy while guest TX uses grant_map to do zero-copy transmit.
> >
> > As far as I know, there are three main grant-related operations used in split
> device model: grant mapping, grant transfer and grant copy.
> > Grant transfer has not used now, and grant mapping and grant transfer both
> involve "TLB" refresh work for hypervisor, am I right?  Or only grant transfer
> has this overhead?
> 
> Transfer is not used so I can't tell. Grant unmap causes TLB flush.
> 
> I saw in an email the other day XenServer folks has some planned improvement
> to avoid TLB flush in Xen to upstream in 4.6 window. I can't speak for sure it will
> get upstreamed as I don't work on that.
> 
> > Does grant copy surely has more overhead than grant mapping?
> >
> 
> At the very least the zero-copy TX path is faster than previous copying path.
> 
> But speaking of the micro operation I'm not sure.
> 
> There was once persistent map prototype netback / netfront that establishes a
> memory pool between FE and BE then use memcpy to copy data. Unfortunately
> that prototype was not done right so the result was not good.

The newest mail about persistent grant I can find is sent from 16 Nov 2012 (http://lists.xen.org/archives/html/xen-devel/2012-11/msg00832.html). Why is it not done right and not merged into upstream?

And I also search for virtio support in XEN, and I find that the one who are familiar with it is you, too, (http://wiki.xen.org/wiki/Virtio_On_Xen), :-). I am wondering what is the current state for virtio on XEN?

> > >From the code, I see that in TX, netback will do gnttab_batch_copy as well
> as gnttab_map_refs:
> >
> > <code> //netback.c:xenvif_tx_action
> > 	xenvif_tx_build_gops(queue, budget, &nr_cops, &nr_mops);
> >
> > 	if (nr_cops == 0)
> > 		return 0;
> >
> > 	gnttab_batch_copy(queue->tx_copy_ops, nr_cops);
> > 	if (nr_mops != 0) {
> > 		ret = gnttab_map_refs(queue->tx_map_ops,
> > 				      NULL,
> > 				      queue->pages_to_map,
> > 				      nr_mops);
> > 		BUG_ON(ret);
> > 	}
> > </code>
> >
> 
> The copy is for the packet header. Mapping is for packet data.
> 
> We need to copy header from guest so that it doesn't change under netback's
> feet.
> 
> Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-05 18:27                       ` Konrad Rzeszutek Wilk
@ 2014-12-08  6:50                         ` Zhangleiqiang (Trump)
  0 siblings, 0 replies; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-08  6:50 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Zoltan Kiss, xen-devel@lists.xen.org
  Cc: jonathan.davies@citrix.com, Luohao (brian), Zhuangyuxin,
	zhangleiqiang, Yuzhou (C), Xiaoding (B)

> On Fri, Dec 05, 2014 at 03:20:55PM +0000, Zoltan Kiss wrote:
> >
> >
> > On 04/12/14 14:31, Zhangleiqiang (Trump) wrote:
> > >>-----Original Message-----
> > >>From: Zoltan Kiss [mailto:zoltan.kiss@linaro.org]
> > >>Sent: Thursday, December 04, 2014 9:35 PM
> > >>To: Zhangleiqiang (Trump); Wei Liu; xen-devel@lists.xen.org
> > >>Cc: Xiaoding (B); Zhuangyuxin; zhangleiqiang; Luohao (brian); Yuzhou
> > >>(C)
> > >>Subject: Re: [Xen-devel] Poor network performance between DomU with
> > >>multiqueue support
> > >>
> > >>
> > >>
> > >>On 04/12/14 12:09, Zhangleiqiang (Trump) wrote:
> > >>>>I think that's expected, because guest RX data path still uses
> > >>>>grant_copy while
> > >>>>>guest TX uses grant_map to do zero-copy transmit.
> > >>>As I understand, the RX process is as follows:
> > >>>1. Phy NIC receive packet
> > >>>2. XEN Hypervisor trigger interrupt to Dom0 3. Dom0' s NIC driver
> > >>>do the "RX" operation, and the packet is stored into SKB which is
> > >>>also owned/shared with netback
> > >>Not that easy. There is something between the NIC driver and netback
> > >>which directs the packets, e.g. the old bridge driver, ovs, or the IP stack of
> the kernel.
> > >>>4. NetBack notify netfront through event channel that a packet is
> > >>>receiving 5. Netfront grant a buffer for receiving and notify
> > >>>netback the GR (if using grant-resue mechanism, netfront just
> > >>>notify the GR to
> > >>>netback) through IO Ring
> > >>It looks a bit confusing in the code, but netfront put "requests" on
> > >>the ring buffer, which contains the grant ref of the guest page
> > >>where the backend can copy. When the packet comes, netback consumes
> > >>these requests and send back a response telling the guest the grant
> > >>copy of the packet finished, it can start handling the data.
> > >>(sending a response means it's placing a response in the ring and
> > >>trigger the event channel) And ideally netback should always have requests
> in the ring, so it doesn't have to wait for the guest to fill it up.
> > >
> > >>>6. NetBack do the grant_copy to copy packet from its SKB to the
> > >>>buffer referenced by GR, and notify netfront through event channel 7.
> > >>>Netfront copy the data from buffer to user-level app's SKB
> > >>Or wherever that SKB should go, yes. Like with any received packet
> > >>on a real network interface.
> > >>>
> > >>>Am I right? Why not using zero-copy transmit in guest RX data pash too ?
> > >>Because that means you are mapping that memory to the guest, and you
> > >>won't have any guarantee when the guest will release them. And
> > >>netback can't just unmap them forcibly after a timeout, because
> > >>finding a correct timeout value would be quite impossible.
> > >>A malicious/buggy/overloaded guest can hold on to Dom0 memory
> > >>indefinitely, but it even becomes worse if the memory came from
> > >>another
> > >>guest: you can't shutdown that guest for example, until all its
> > >>memory is returned to him.
> > >
> > >Thanks for your detailed explanation about RX data path, I have get
> > >it, :)
> > >
> > >About the issue that poor performance between DomU to DomU, but high
> throughout between Dom0 to remote Dom0/DomU mentioned in my previous
> mail, do you have any idea about it?
> > >
> > >I am wondering if netfront/netback can be optimized to reach the 10Gbps
> throughout between DomUs running on different hosts connected with 10GE
> network. Currently, it seems like the TX is not the bottleneck, because we can
> reach the aggregate throughout of 9Gbps when sending packets from one
> DomU to other 3 DomUs running on different host. So I think the bottleneck
> maybe the RX, are you agreed with me?
> > >
> > >I am wondering what is the main reason that prevent RX to reach the higher
> throughout? Compared to KVM+virtio+vhost, which can reach high throughout,
> the RX has extra grantcopy operation, and the grantcopy operation may be one
> reason for it. Do you have any idea about it too?
> > It's quite sure that the grant copy is the bottleneck for a single
> > queue RX traffic. I don't know what's the plan to help that, currently
> > only a faster CPU can help you with that.
> 
> Could the Intel QuickData help with that?

Thanks for your hit. 
I am looking for method which is independent on hardware. Because I have seen that virtio can reach the 10Gbps throughout, and I think PV network protocol which is the mainline of XEN should also reach the throughout. However, the testing results show that it is not ideal, so I am wondering what the possible reason is and if PV network protocol can be optimized.

> >
> > >
> > >>
> > >>Regards,
> > >>
> > >>Zoli
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-08  6:44                   ` Zhangleiqiang (Trump)
@ 2014-12-08 10:13                     ` Wei Liu
  2014-12-08 13:08                       ` Zhangleiqiang (Trump)
  0 siblings, 1 reply; 35+ messages in thread
From: Wei Liu @ 2014-12-08 10:13 UTC (permalink / raw)
  To: Zhangleiqiang (Trump)
  Cc: Luohao (brian), Wei Liu, Zhuangyuxin, zhangleiqiang, Yuzhou (C),
	xen-devel@lists.xen.org, Xiaoding (B)

On Mon, Dec 08, 2014 at 06:44:26AM +0000, Zhangleiqiang (Trump) wrote:
> > On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
> > [...]
> > > > I think that's expected, because guest RX data path still uses
> > > > grant_copy while guest TX uses grant_map to do zero-copy transmit.
> > >
> > > As far as I know, there are three main grant-related operations used in split
> > device model: grant mapping, grant transfer and grant copy.
> > > Grant transfer has not used now, and grant mapping and grant transfer both
> > involve "TLB" refresh work for hypervisor, am I right?  Or only grant transfer
> > has this overhead?
> > 
> > Transfer is not used so I can't tell. Grant unmap causes TLB flush.
> > 
> > I saw in an email the other day XenServer folks has some planned improvement
> > to avoid TLB flush in Xen to upstream in 4.6 window. I can't speak for sure it will
> > get upstreamed as I don't work on that.
> > 
> > > Does grant copy surely has more overhead than grant mapping?
> > >
> > 
> > At the very least the zero-copy TX path is faster than previous copying path.
> > 
> > But speaking of the micro operation I'm not sure.
> > 
> > There was once persistent map prototype netback / netfront that establishes a
> > memory pool between FE and BE then use memcpy to copy data. Unfortunately
> > that prototype was not done right so the result was not good.
> 
> The newest mail about persistent grant I can find is sent from 16 Nov
> 2012
> (http://lists.xen.org/archives/html/xen-devel/2012-11/msg00832.html).
> Why is it not done right and not merged into upstream?

AFAICT there's one more memcpy than necessary, i.e. frontend memcpy data
into the pool then backend memcpy data out of the pool, when backend
should be able to use the page in pool directly.

> 
> And I also search for virtio support in XEN, and I find that the one
> who are familiar with it is you, too,
> (http://wiki.xen.org/wiki/Virtio_On_Xen), :-). I am wondering what is
> the current state for virtio on XEN?

Yes, it was me. I never have the time to revisit that. I don't think we
support virtio network at the moment.

Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-08 10:13                     ` Wei Liu
@ 2014-12-08 13:08                       ` Zhangleiqiang (Trump)
  2014-12-08 13:55                         ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-08 13:08 UTC (permalink / raw)
  To: Wei Liu, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, zhangleiqiang, Luohao (brian),
	Yuzhou (C)

> On Mon, Dec 08, 2014 at 06:44:26AM +0000, Zhangleiqiang (Trump) wrote:
> > > On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
> > > [...]
> > > > > I think that's expected, because guest RX data path still uses
> > > > > grant_copy while guest TX uses grant_map to do zero-copy transmit.
> > > >
> > > > As far as I know, there are three main grant-related operations
> > > > used in split
> > > device model: grant mapping, grant transfer and grant copy.
> > > > Grant transfer has not used now, and grant mapping and grant
> > > > transfer both
> > > involve "TLB" refresh work for hypervisor, am I right?  Or only
> > > grant transfer has this overhead?
> > >
> > > Transfer is not used so I can't tell. Grant unmap causes TLB flush.
> > >
> > > I saw in an email the other day XenServer folks has some planned
> > > improvement to avoid TLB flush in Xen to upstream in 4.6 window. I
> > > can't speak for sure it will get upstreamed as I don't work on that.
> > >
> > > > Does grant copy surely has more overhead than grant mapping?
> > > >
> > >
> > > At the very least the zero-copy TX path is faster than previous copying path.
> > >
> > > But speaking of the micro operation I'm not sure.
> > >
> > > There was once persistent map prototype netback / netfront that
> > > establishes a memory pool between FE and BE then use memcpy to copy
> > > data. Unfortunately that prototype was not done right so the result was not
> good.
> >
> > The newest mail about persistent grant I can find is sent from 16 Nov
> > 2012
> > (http://lists.xen.org/archives/html/xen-devel/2012-11/msg00832.html).
> > Why is it not done right and not merged into upstream?
> 
> AFAICT there's one more memcpy than necessary, i.e. frontend memcpy data
> into the pool then backend memcpy data out of the pool, when backend should
> be able to use the page in pool directly.

Memcpy should cheaper than grant_copy because the former needs not the "hypercall" which will cause "VM Exit" to "XEN Hypervisor", am I right? For RX path, using memcpy based on persistent grant table may have higher performance than using grant copy now.

I have seen "move grant copy to guest" and "Fix grant copy alignment problem" as optimization methods used in "NetChannel2" (http://www-archive.xenproject.org/files/xensummit_fall07/16_JoseRenatoSantos.pdf). Unfortunately, NetChannel2 seems not be supported from 2.6.32. Do you know them and are them be helpful for RX path optimization under current upstream implementation?

By the way, after rethinking the testing results for multi-queue pv (kernel 3.17.4+XEN 4.4) implementation, I find that when using four queues for netback/netfront, there will be about 3 netback process running with high CPU usage on receive Dom0 (about 85% usage per process running on one CPU core), and the aggregate throughout is only about 5Gbps. I doubt that there may be some bug or pitfall in current multi-queue implementation, because for 5Gbps throughout, occurring about all of 3 CPU core for packet receiving is somehow abnormal.

> >
> > And I also search for virtio support in XEN, and I find that the one
> > who are familiar with it is you, too,
> > (http://wiki.xen.org/wiki/Virtio_On_Xen), :-). I am wondering what is
> > the current state for virtio on XEN?
> 
> Yes, it was me. I never have the time to revisit that. I don't think we support
> virtio network at the moment.
> 
> Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-08 13:08                       ` Zhangleiqiang (Trump)
@ 2014-12-08 13:55                         ` Wei Liu
  2014-12-09  2:51                           ` Zhangleiqiang (Trump)
  2014-12-09  9:03                           ` Zhangleiqiang (Trump)
  0 siblings, 2 replies; 35+ messages in thread
From: Wei Liu @ 2014-12-08 13:55 UTC (permalink / raw)
  To: Zhangleiqiang (Trump)
  Cc: Luohao (brian), Wei Liu, Zhuangyuxin, zhangleiqiang, Yuzhou (C),
	xen-devel@lists.xen.org, Xiaoding (B)

On Mon, Dec 08, 2014 at 01:08:18PM +0000, Zhangleiqiang (Trump) wrote:
> > On Mon, Dec 08, 2014 at 06:44:26AM +0000, Zhangleiqiang (Trump) wrote:
> > > > On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
> > > > [...]
> > > > > > I think that's expected, because guest RX data path still uses
> > > > > > grant_copy while guest TX uses grant_map to do zero-copy transmit.
> > > > >
> > > > > As far as I know, there are three main grant-related operations
> > > > > used in split
> > > > device model: grant mapping, grant transfer and grant copy.
> > > > > Grant transfer has not used now, and grant mapping and grant
> > > > > transfer both
> > > > involve "TLB" refresh work for hypervisor, am I right?  Or only
> > > > grant transfer has this overhead?
> > > >
> > > > Transfer is not used so I can't tell. Grant unmap causes TLB flush.
> > > >
> > > > I saw in an email the other day XenServer folks has some planned
> > > > improvement to avoid TLB flush in Xen to upstream in 4.6 window. I
> > > > can't speak for sure it will get upstreamed as I don't work on that.
> > > >
> > > > > Does grant copy surely has more overhead than grant mapping?
> > > > >
> > > >
> > > > At the very least the zero-copy TX path is faster than previous copying path.
> > > >
> > > > But speaking of the micro operation I'm not sure.
> > > >
> > > > There was once persistent map prototype netback / netfront that
> > > > establishes a memory pool between FE and BE then use memcpy to copy
> > > > data. Unfortunately that prototype was not done right so the result was not
> > good.
> > >
> > > The newest mail about persistent grant I can find is sent from 16 Nov
> > > 2012
> > > (http://lists.xen.org/archives/html/xen-devel/2012-11/msg00832.html).
> > > Why is it not done right and not merged into upstream?
> > 
> > AFAICT there's one more memcpy than necessary, i.e. frontend memcpy data
> > into the pool then backend memcpy data out of the pool, when backend should
> > be able to use the page in pool directly.
> 
> Memcpy should cheaper than grant_copy because the former needs not the
> "hypercall" which will cause "VM Exit" to "XEN Hypervisor", am I
> right? For RX path, using memcpy based on persistent grant table may
> have higher performance than using grant copy now.

In theory yes. Unfortunately nobody has benchmarked that properly.

If you're interested in doing work on optimising RX performance, you
might want to sync up with XenServer folks?

> 
> I have seen "move grant copy to guest" and "Fix grant copy alignment
> problem" as optimization methods used in "NetChannel2"
> (http://www-archive.xenproject.org/files/xensummit_fall07/16_JoseRenatoSantos.pdf).
> Unfortunately, NetChannel2 seems not be supported from 2.6.32. Do you
> know them and are them be helpful for RX path optimization under
> current upstream implementation?

Not sure, that's long before I ever started working on Xen.

> 
> By the way, after rethinking the testing results for multi-queue pv
> (kernel 3.17.4+XEN 4.4) implementation, I find that when using four
> queues for netback/netfront, there will be about 3 netback process
> running with high CPU usage on receive Dom0 (about 85% usage per
> process running on one CPU core), and the aggregate throughout is only
> about 5Gbps. I doubt that there may be some bug or pitfall in current
> multi-queue implementation, because for 5Gbps throughout, occurring
> about all of 3 CPU core for packet receiving is somehow abnormal.
> 

3.17.4 doesn't contain David Vrabel's fixes.

Look for 
  bc96f648df1bbc2729abbb84513cf4f64273a1f1
  f48da8b14d04ca87ffcffe68829afd45f926ec6a
  ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e
in David Miller's net tree.

BTW there are some improvement planned for 4.6: "[Xen-devel] [PATCH v3
0/2] gnttab: Improve scaleability". This is orthogonal to the problem
you're trying to solve but it should help improve performance in
general.


Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-08 13:55                         ` Wei Liu
@ 2014-12-09  2:51                           ` Zhangleiqiang (Trump)
  2014-12-09 10:05                             ` Ian Campbell
  2014-12-09  9:03                           ` Zhangleiqiang (Trump)
  1 sibling, 1 reply; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-09  2:51 UTC (permalink / raw)
  To: Wei Liu, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, zhangleiqiang, Luohao (brian),
	Yuzhou (C)

> On Mon, Dec 08, 2014 at 01:08:18PM +0000, Zhangleiqiang (Trump) wrote:
> > > On Mon, Dec 08, 2014 at 06:44:26AM +0000, Zhangleiqiang (Trump) wrote:
> > > > > On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump)
> wrote:
> > > > > [...]
> > > >
> > > > The newest mail about persistent grant I can find is sent from 16
> > > > Nov
> > > > 2012
> > > > (http://lists.xen.org/archives/html/xen-devel/2012-11/msg00832.html).
> > > > Why is it not done right and not merged into upstream?
> > >
> > > AFAICT there's one more memcpy than necessary, i.e. frontend memcpy
> > > data into the pool then backend memcpy data out of the pool, when
> > > backend should be able to use the page in pool directly.
> >
> > Memcpy should cheaper than grant_copy because the former needs not the
> > "hypercall" which will cause "VM Exit" to "XEN Hypervisor", am I
> > right? For RX path, using memcpy based on persistent grant table may
> > have higher performance than using grant copy now.
> 
> In theory yes. Unfortunately nobody has benchmarked that properly.
> 
> If you're interested in doing work on optimising RX performance, you might
> want to sync up with XenServer folks?

What is the recommended way to have a discussion with XenServer folks? Through the forum of XenServer or the standalone mailing list? I find the most of discussions in forum are the production of XenServer.

> >
> > I have seen "move grant copy to guest" and "Fix grant copy alignment
> > problem" as optimization methods used in "NetChannel2"
> >
> (http://www-archive.xenproject.org/files/xensummit_fall07/16_JoseRenatoSa
> ntos.pdf).
> > Unfortunately, NetChannel2 seems not be supported from 2.6.32. Do you
> > know them and are them be helpful for RX path optimization under
> > current upstream implementation?
> 
> Not sure, that's long before I ever started working on Xen.
> 
> >
> > By the way, after rethinking the testing results for multi-queue pv
> > (kernel 3.17.4+XEN 4.4) implementation, I find that when using four
> > queues for netback/netfront, there will be about 3 netback process
> > running with high CPU usage on receive Dom0 (about 85% usage per
> > process running on one CPU core), and the aggregate throughout is only
> > about 5Gbps. I doubt that there may be some bug or pitfall in current
> > multi-queue implementation, because for 5Gbps throughout, occurring
> > about all of 3 CPU core for packet receiving is somehow abnormal.
> >
> 
> 3.17.4 doesn't contain David Vrabel's fixes.
> 
> Look for
>   bc96f648df1bbc2729abbb84513cf4f64273a1f1
>   f48da8b14d04ca87ffcffe68829afd45f926ec6a
>   ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e
> in David Miller's net tree.
> 
> BTW there are some improvement planned for 4.6: "[Xen-devel] [PATCH v3 0/2]
> gnttab: Improve scaleability". This is orthogonal to the problem you're trying to
> solve but it should help improve performance in general.

Thanks for your pointer, it is helpful.

> 
> Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-08 13:55                         ` Wei Liu
  2014-12-09  2:51                           ` Zhangleiqiang (Trump)
@ 2014-12-09  9:03                           ` Zhangleiqiang (Trump)
  1 sibling, 0 replies; 35+ messages in thread
From: Zhangleiqiang (Trump) @ 2014-12-09  9:03 UTC (permalink / raw)
  To: Wei Liu, xen-devel@lists.xen.org
  Cc: Xiaoding (B), Zhuangyuxin, zhangleiqiang, Luohao (brian),
	Yuzhou (C)

> On Mon, Dec 08, 2014 at 01:08:18PM +0000, Zhangleiqiang (Trump) wrote:
> > > On Mon, Dec 08, 2014 at 06:44:26AM +0000, Zhangleiqiang (Trump) wrote:
> > > > > On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump)
> wrote:
> > > > > [...]
> > By the way, after rethinking the testing results for multi-queue pv
> > (kernel 3.17.4+XEN 4.4) implementation, I find that when using four
> > queues for netback/netfront, there will be about 3 netback process
> > running with high CPU usage on receive Dom0 (about 85% usage per
> > process running on one CPU core), and the aggregate throughout is only
> > about 5Gbps. I doubt that there may be some bug or pitfall in current
> > multi-queue implementation, because for 5Gbps throughout, occurring
> > about all of 3 CPU core for packet receiving is somehow abnormal.
> >
> 
> 3.17.4 doesn't contain David Vrabel's fixes.
> 
> Look for
>   bc96f648df1bbc2729abbb84513cf4f64273a1f1
>   f48da8b14d04ca87ffcffe68829afd45f926ec6a
>   ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e
> in David Miller's net tree.

I have tried to testing with 3.18-rc5 which including these patches, however, it seems that the problem mentioned is not improved. There are still 3 netback receive processes each of which uses about 85% of CPU core.

> BTW there are some improvement planned for 4.6: "[Xen-devel] [PATCH v3 0/2]
> gnttab: Improve scaleability". This is orthogonal to the problem you're trying to
> solve but it should help improve performance in general.
> 
> 
> Wei.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2014-12-09  2:51                           ` Zhangleiqiang (Trump)
@ 2014-12-09 10:05                             ` Ian Campbell
  0 siblings, 0 replies; 35+ messages in thread
From: Ian Campbell @ 2014-12-09 10:05 UTC (permalink / raw)
  To: Zhangleiqiang (Trump)
  Cc: Luohao (brian), Wei Liu, Zhuangyuxin, zhangleiqiang, Yuzhou	(C),
	xen-devel@lists.xen.org, Xiaoding (B)

On Tue, 2014-12-09 at 02:51 +0000, Zhangleiqiang (Trump) wrote:
> What is the recommended way to have a discussion with XenServer folks?
> Through the forum of XenServer or the standalone mailing list? I find
> the most of discussions in forum are the production of XenServer.

AIUI development == list, users == forums.

Ian.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
       [not found] <3A6795EA1206904E94BEC8EF9DF109AE239B35A9@SZXEMA512-MBX.china.huawei.com>
@ 2015-02-27  9:21 ` openlui
  2015-02-27 10:59   ` Wei Liu
  0 siblings, 1 reply; 35+ messages in thread
From: openlui @ 2015-02-27  9:21 UTC (permalink / raw)
  To: xen-devel@lists.xen.org; +Cc: Zhangleiqiang (Trump)


[-- Attachment #1.1: Type: text/plain, Size: 4742 bytes --]

>On Mon, Dec 08, 2014 at 01:08:18PM +0000, Zhangleiqiang (Trump) wrote:
>> > On Mon, Dec 08, 2014 at 06:44:26AM +0000, Zhangleiqiang (Trump) wrote:
>> > > > On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
>> > > > [...]
>> > > > > > I think that's expected, because guest RX data path still 
>> > > > > > uses grant_copy while guest TX uses grant_map to do zero-copy transmit.
>> > > > >
>> > > > > As far as I know, there are three main grant-related 
>> > > > > operations used in split
>> > > > device model: grant mapping, grant transfer and grant copy.
>> > > > > Grant transfer has not used now, and grant mapping and grant 
>> > > > > transfer both
>> > > > involve "TLB" refresh work for hypervisor, am I right?  Or only 
>> > > > grant transfer has this overhead?
>> > > >
>> > > > Transfer is not used so I can't tell. Grant unmap causes TLB flush.
>> > > >
>> > > > I saw in an email the other day XenServer folks has some planned 
>> > > > improvement to avoid TLB flush in Xen to upstream in 4.6 window. 
>> > > > I can't speak for sure it will get upstreamed as I don't work on that.
>> > > >
>> > > > > Does grant copy surely has more overhead than grant mapping?
>> > > > >
>> > > >
>> > > > At the very least the zero-copy TX path is faster than previous copying path.
>> > > >
>> > > > But speaking of the micro operation I'm not sure.
>> > > >
>> > > > There was once persistent map prototype netback / netfront that 
>> > > > establishes a memory pool between FE and BE then use memcpy to 
>> > > > copy data. Unfortunately that prototype was not done right so 
>> > > > the result was not
>> > good.
>> > >
>> > > The newest mail about persistent grant I can find is sent from 16 
>> > > Nov
>> > > 2012
>> > > (http://lists.xen.org/archives/html/xen-devel/2012-11/msg00832.html).
>> > > Why is it not done right and not merged into upstream?
>> > 
>> > AFAICT there's one more memcpy than necessary, i.e. frontend memcpy 
>> > data into the pool then backend memcpy data out of the pool, when 
>> > backend should be able to use the page in pool directly.
>> 
>> Memcpy should cheaper than grant_copy because the former needs not the 
>> "hypercall" which will cause "VM Exit" to "XEN Hypervisor", am I 
>> right? For RX path, using memcpy based on persistent grant table may 
>> have higher performance than using grant copy now.
>
>In theory yes. Unfortunately nobody has benchmarked that properly.
I have some testing for RX performance using persistent grant method and upstream method (3.17.4 branch), the results show that persistent grant method does have higher performance than upstream method (from 3.5Gbps to about 6Gbps). And I find that persistent grant mechanism has already used in blkfrong/blkback, I am wondering why there are no efforts to replace the grant copy by persistent grant now, at least in RX path. Are there other disadvantages in persistent grant method which stop we use it? 

PS. I used pkt-gen to send packet from dom0 to a domU running on another dom0, the CPUs of both dom0 is Intel E5640 2.4GHz, and the two dom0s is connected with a 10GE NIC.




>If you're interested in doing work on optimising RX performance, you might want to sync up with XenServer folks?
>
>> 
>> I have seen "move grant copy to guest" and "Fix grant copy alignment 
>> problem" as optimization methods used in "NetChannel2"
>> (http://www-archive.xenproject.org/files/xensummit_fall07/16_JoseRenatoSantos.pdf).
>> Unfortunately, NetChannel2 seems not be supported from 2.6.32. Do you 
>> know them and are them be helpful for RX path optimization under 
>> current upstream implementation?
>
>Not sure, that's long before I ever started working on Xen.
>
>> 
>> By the way, after rethinking the testing results for multi-queue pv 
>> (kernel 3.17.4+XEN 4.4) implementation, I find that when using four 
>> queues for netback/netfront, there will be about 3 netback process 
>> running with high CPU usage on receive Dom0 (about 85% usage per 
>> process running on one CPU core), and the aggregate throughout is only 
>> about 5Gbps. I doubt that there may be some bug or pitfall in current 
>> multi-queue implementation, because for 5Gbps throughout, occurring 
>> about all of 3 CPU core for packet receiving is somehow abnormal.
>> 
>
>3.17.4 doesn't contain David Vrabel's fixes.
>
>Look for
>  bc96f648df1bbc2729abbb84513cf4f64273a1f1
>  f48da8b14d04ca87ffcffe68829afd45f926ec6a
>  ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e
>in David Miller's net tree.
>
>BTW there are some improvement planned for 4.6: "[Xen-devel] [PATCH v3 0/2] gnttab: Improve scaleability". This is orthogonal to the problem you're trying to solve but it should help improve performance in general.
>
>
>Wei.

[-- Attachment #1.2: Type: text/html, Size: 5803 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2015-02-27  9:21 ` openlui
@ 2015-02-27 10:59   ` Wei Liu
  2015-02-27 11:30     ` David Vrabel
  2015-02-28  2:45     ` openlui
  0 siblings, 2 replies; 35+ messages in thread
From: Wei Liu @ 2015-02-27 10:59 UTC (permalink / raw)
  To: openlui
  Cc: Zhangleiqiang (Trump), wei.liu2, David Vrabel,
	xen-devel@lists.xen.org

Cc'ing David (XenServer kernel maintainer)

On Fri, Feb 27, 2015 at 05:21:11PM +0800, openlui wrote:
> >On Mon, Dec 08, 2014 at 01:08:18PM +0000, Zhangleiqiang (Trump) wrote:
> >> > On Mon, Dec 08, 2014 at 06:44:26AM +0000, Zhangleiqiang (Trump) wrote:
> >> > > > On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
> >> > > > [...]
> >> > > > > > I think that's expected, because guest RX data path still 
> >> > > > > > uses grant_copy while guest TX uses grant_map to do zero-copy transmit.
> >> > > > >
> >> > > > > As far as I know, there are three main grant-related 
> >> > > > > operations used in split
> >> > > > device model: grant mapping, grant transfer and grant copy.
> >> > > > > Grant transfer has not used now, and grant mapping and grant 
> >> > > > > transfer both
> >> > > > involve "TLB" refresh work for hypervisor, am I right?  Or only 
> >> > > > grant transfer has this overhead?
> >> > > >
> >> > > > Transfer is not used so I can't tell. Grant unmap causes TLB flush.
> >> > > >
> >> > > > I saw in an email the other day XenServer folks has some planned 
> >> > > > improvement to avoid TLB flush in Xen to upstream in 4.6 window. 
> >> > > > I can't speak for sure it will get upstreamed as I don't work on that.
> >> > > >
> >> > > > > Does grant copy surely has more overhead than grant mapping?
> >> > > > >
> >> > > >
> >> > > > At the very least the zero-copy TX path is faster than previous copying path.
> >> > > >
> >> > > > But speaking of the micro operation I'm not sure.
> >> > > >
> >> > > > There was once persistent map prototype netback / netfront that 
> >> > > > establishes a memory pool between FE and BE then use memcpy to 
> >> > > > copy data. Unfortunately that prototype was not done right so 
> >> > > > the result was not
> >> > good.
> >> > >
> >> > > The newest mail about persistent grant I can find is sent from 16 
> >> > > Nov
> >> > > 2012
> >> > > (http://lists.xen.org/archives/html/xen-devel/2012-11/msg00832.html).
> >> > > Why is it not done right and not merged into upstream?
> >> > 
> >> > AFAICT there's one more memcpy than necessary, i.e. frontend memcpy 
> >> > data into the pool then backend memcpy data out of the pool, when 
> >> > backend should be able to use the page in pool directly.
> >> 
> >> Memcpy should cheaper than grant_copy because the former needs not the 
> >> "hypercall" which will cause "VM Exit" to "XEN Hypervisor", am I 
> >> right? For RX path, using memcpy based on persistent grant table may 
> >> have higher performance than using grant copy now.
> >
> >In theory yes. Unfortunately nobody has benchmarked that properly.

> I have some testing for RX performance using persistent grant method
> and upstream method (3.17.4 branch), the results show that persistent
> grant method does have higher performance than upstream method (from
> 3.5Gbps to about 6Gbps). And I find that persistent grant mechanism
> has already used in blkfrong/blkback, I am wondering why there are no
> efforts to replace the grant copy by persistent grant now, at least in
> RX path. Are there other disadvantages in persistent grant method
> which stop we use it? 
> 

I've seen numbers better than 6Gbps. See upstream changeset
1650d5455bd2dc6b5ee134bd6fc1a3236c266b5b.

Persistent grant is not silver bullet. There is email thread on the
list discussing whether it should be removed in block driver.

XenServer folks have been working on improving network performance. It's
my understanding that they choose different routes than persistent
grant. David might have more insight.

Wei.

> PS. I used pkt-gen to send packet from dom0 to a domU running on
> another dom0, the CPUs of both dom0 is Intel E5640 2.4GHz, and the two
> dom0s is connected with a 10GE NIC.
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2015-02-27 10:59   ` Wei Liu
@ 2015-02-27 11:30     ` David Vrabel
  2015-02-28  3:21       ` openlui
  2015-02-28  2:45     ` openlui
  1 sibling, 1 reply; 35+ messages in thread
From: David Vrabel @ 2015-02-27 11:30 UTC (permalink / raw)
  To: Wei Liu, openlui
  Cc: Zhangleiqiang (Trump), David Vrabel, xen-devel@lists.xen.org

On 27/02/15 10:59, Wei Liu wrote:
> 
> Persistent grant is not silver bullet. There is email thread on the
> list discussing whether it should be removed in block driver.

Persistent grants for to-guest network traffic is a flawed idea.  It
either requires:

a) the backend to memcpy into the mapped grant /and/ the frontend to
memcpy out of the persistently mapped pool.  This is clearly going to be
worse for memory bandwidth than a single grant copy.

or

b) the backend to accumulate more and more mappings of guest memory,
which is bad for security and it uses too many grant and map track
resources hence it does not scale to many VIFs.

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2015-02-27 10:59   ` Wei Liu
  2015-02-27 11:30     ` David Vrabel
@ 2015-02-28  2:45     ` openlui
  2015-03-03 10:40       ` Wei Liu
  1 sibling, 1 reply; 35+ messages in thread
From: openlui @ 2015-02-28  2:45 UTC (permalink / raw)
  To: Wei Liu; +Cc: Zhangleiqiang (Trump), David Vrabel, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 4338 bytes --]

At 2015-02-27 18:59:52, "Wei Liu" <wei.liu2@citrix.com> wrote:
>Cc'ing David (XenServer kernel maintainer)
>
>On Fri, Feb 27, 2015 at 05:21:11PM +0800, openlui wrote:
>> >On Mon, Dec 08, 2014 at 01:08:18PM +0000, Zhangleiqiang (Trump) wrote:
>> >> > On Mon, Dec 08, 2014 at 06:44:26AM +0000, Zhangleiqiang (Trump) wrote:
>> >> > > > On Fri, Dec 05, 2014 at 01:17:16AM +0000, Zhangleiqiang (Trump) wrote:
>> >> > > > [...]
>> >> > > > > > I think that's expected, because guest RX data path still 
>> >> > > > > > uses grant_copy while guest TX uses grant_map to do zero-copy transmit.
>> >> > > > >
>> >> > > > > As far as I know, there are three main grant-related 
>> >> > > > > operations used in split
>> >> > > > device model: grant mapping, grant transfer and grant copy.
>> >> > > > > Grant transfer has not used now, and grant mapping and grant 
>> >> > > > > transfer both
>> >> > > > involve "TLB" refresh work for hypervisor, am I right?  Or only 
>> >> > > > grant transfer has this overhead?
>> >> > > >
>> >> > > > Transfer is not used so I can't tell. Grant unmap causes TLB flush.
>> >> > > >
>> >> > > > I saw in an email the other day XenServer folks has some planned 
>> >> > > > improvement to avoid TLB flush in Xen to upstream in 4.6 window. 
>> >> > > > I can't speak for sure it will get upstreamed as I don't work on that.
>> >> > > >
>> >> > > > > Does grant copy surely has more overhead than grant mapping?
>> >> > > > >
>> >> > > >
>> >> > > > At the very least the zero-copy TX path is faster than previous copying path.
>> >> > > >
>> >> > > > But speaking of the micro operation I'm not sure.
>> >> > > >
>> >> > > > There was once persistent map prototype netback / netfront that 
>> >> > > > establishes a memory pool between FE and BE then use memcpy to 
>> >> > > > copy data. Unfortunately that prototype was not done right so 
>> >> > > > the result was not
>> >> > good.
>> >> > >
>> >> > > The newest mail about persistent grant I can find is sent from 16 
>> >> > > Nov
>> >> > > 2012
>> >> > > (http://lists.xen.org/archives/html/xen-devel/2012-11/msg00832.html).
>> >> > > Why is it not done right and not merged into upstream?
>> >> > 
>> >> > AFAICT there's one more memcpy than necessary, i.e. frontend memcpy 
>> >> > data into the pool then backend memcpy data out of the pool, when 
>> >> > backend should be able to use the page in pool directly.
>> >> 
>> >> Memcpy should cheaper than grant_copy because the former needs not the 
>> >> "hypercall" which will cause "VM Exit" to "XEN Hypervisor", am I 
>> >> right? For RX path, using memcpy based on persistent grant table may 
>> >> have higher performance than using grant copy now.
>> >
>> >In theory yes. Unfortunately nobody has benchmarked that properly.
>
>> I have some testing for RX performance using persistent grant method
>> and upstream method (3.17.4 branch), the results show that persistent
>> grant method does have higher performance than upstream method (from
>> 3.5Gbps to about 6Gbps). And I find that persistent grant mechanism
>> has already used in blkfrong/blkback, I am wondering why there are no
>> efforts to replace the grant copy by persistent grant now, at least in
>> RX path. Are there other disadvantages in persistent grant method
>> which stop we use it? 
>> 
>
>I've seen numbers better than 6Gbps. See upstream changeset
>1650d5455bd2dc6b5ee134bd6fc1a3236c266b5b.
Thanks, Wei. The throughout I mentioned (3.5Gbps and 6Gbps) is for UDP 1400 bytes packet, I think the result based on 1650d5455bd2dc6b5ee134bd6fc1a3236c266b5b is for TCP. 

>Persistent grant is not silver bullet. There is email thread on the
>list discussing whether it should be removed in block driver.

I have tried to look for the thread but no detailed info. Could you give me some keyword to find the thread, thanks.


>XenServer folks have been working on improving network performance. It's
>my understanding that they choose different routes than persistent
>grant. David might have more insight.


>Wei.
>
>> PS. I used pkt-gen to send packet from dom0 to a domU running on
>> another dom0, the CPUs of both dom0 is Intel E5640 2.4GHz, and the two
>> dom0s is connected with a 10GE NIC.
>> 
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xen.org
>http://lists.xen.org/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 5777 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2015-02-27 11:30     ` David Vrabel
@ 2015-02-28  3:21       ` openlui
  0 siblings, 0 replies; 35+ messages in thread
From: openlui @ 2015-02-28  3:21 UTC (permalink / raw)
  To: David Vrabel; +Cc: Zhangleiqiang (Trump), Wei Liu, xen-devel@lists.xen.org

[-- Attachment #1.1: Type: text/plain, Size: 1658 bytes --]

At 2015-02-27 19:30:20, "David Vrabel" <david.vrabel@citrix.com> wrote:
>On 27/02/15 10:59, Wei Liu wrote:
>> 
>> Persistent grant is not silver bullet. There is email thread on the
>> list discussing whether it should be removed in block driver.
>
>Persistent grants for to-guest network traffic is a flawed idea.  It
>either requires:
>
>a) the backend to memcpy into the mapped grant /and/ the frontend to
>memcpy out of the persistently mapped pool.  This is clearly going to be
>worse for memory bandwidth than a single grant copy.

Yes, persistent grant method does use more DomU's cpu than grant copy method. 

However, the persistent way does have one more memcpy operation than grant copy, but it has two less "mmap" operation than grant copy and no hypercall too. I have examined the code for grant copy, it needs to "mmap" the memory from src and dest domain to hypervisor,  then "memcpy" the data from src to dest. There will be more cpu used by hypervisor instead of DomU.

>or
>
>b) the backend to accumulate more and more mappings of guest memory,
>which is bad for security and it uses too many grant and map track
>resources hence it does not scale to many VIFs.

I find that persistent grant patch has a upper limit for amount of guest memory can be mapped by each queue of VIF. The limit seems to the VIF‘s ring size if I understand right, so the amount seems not high.
Under my benchmark, at least for single UDP flow, the persistent grant way has more higher throughout than grant copy way. 

>David
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xen.org
>http://lists.xen.org/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 2021 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Poor network performance between DomU with multiqueue support
  2015-02-28  2:45     ` openlui
@ 2015-03-03 10:40       ` Wei Liu
  0 siblings, 0 replies; 35+ messages in thread
From: Wei Liu @ 2015-03-03 10:40 UTC (permalink / raw)
  To: openlui
  Cc: Zhangleiqiang (Trump), Wei Liu, David Vrabel,
	xen-devel@lists.xen.org

On Sat, Feb 28, 2015 at 10:45:02AM +0800, openlui wrote:
>
> 
> >Persistent grant is not silver bullet. There is email thread on the
> >list discussing whether it should be removed in block driver.
> 
> I have tried to look for the thread but no detailed info. Could you give me some keyword to find the thread, thanks.
> 
> 

Message id <1423988345-4005-5-git-send-email-bob.liu@oracle.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2015-03-03 10:40 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-02  8:30 Poor network performance between DomU with multiqueue support zhangleiqiang
2014-12-02 10:57 ` David Vrabel
2014-12-02 11:53   ` Zhangleiqiang (Trump)
2014-12-02 17:25     ` Zoltan Kiss
2014-12-02 11:01 ` Wei Liu
2014-12-02 11:50   ` Zhangleiqiang (Trump)
2014-12-02 12:11     ` Wei Liu
2014-12-02 14:46       ` Zhangleiqiang (Trump)
2014-12-02 15:58         ` Wei Liu
2014-12-03 14:43           ` Zhangleiqiang (Trump)
2014-12-04 10:50             ` Wei Liu
2014-12-04 12:09               ` Zhangleiqiang (Trump)
2014-12-04 13:05                 ` Wei Liu
2014-12-04 14:37                   ` Zhangleiqiang (Trump)
2014-12-04 13:35                 ` Zoltan Kiss
2014-12-04 14:31                   ` Zhangleiqiang (Trump)
2014-12-05 15:20                     ` Zoltan Kiss
2014-12-05 18:27                       ` Konrad Rzeszutek Wilk
2014-12-08  6:50                         ` Zhangleiqiang (Trump)
2014-12-05  1:17               ` Zhangleiqiang (Trump)
2014-12-05 12:42                 ` Wei Liu
2014-12-05 15:18                   ` Zoltan Kiss
2014-12-08  6:44                   ` Zhangleiqiang (Trump)
2014-12-08 10:13                     ` Wei Liu
2014-12-08 13:08                       ` Zhangleiqiang (Trump)
2014-12-08 13:55                         ` Wei Liu
2014-12-09  2:51                           ` Zhangleiqiang (Trump)
2014-12-09 10:05                             ` Ian Campbell
2014-12-09  9:03                           ` Zhangleiqiang (Trump)
     [not found] <3A6795EA1206904E94BEC8EF9DF109AE239B35A9@SZXEMA512-MBX.china.huawei.com>
2015-02-27  9:21 ` openlui
2015-02-27 10:59   ` Wei Liu
2015-02-27 11:30     ` David Vrabel
2015-02-28  3:21       ` openlui
2015-02-28  2:45     ` openlui
2015-03-03 10:40       ` Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.