From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60571) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aMV6r-000437-NR for qemu-devel@nongnu.org; Fri, 22 Jan 2016 01:22:15 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aMV6o-0005h3-BD for qemu-devel@nongnu.org; Fri, 22 Jan 2016 01:22:13 -0500 Received: from mx1.redhat.com ([209.132.183.28]:47602) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aMV6o-0005gq-14 for qemu-devel@nongnu.org; Fri, 22 Jan 2016 01:22:10 -0500 References: <1450780978-19123-1-git-send-email-zhangchen.fnst@cn.fujitsu.com> <568494B8.4080105@redhat.com> <5684E9EB.3070002@cn.fujitsu.com> <568A0527.9040001@redhat.com> <568A2A5F.3090608@cn.fujitsu.com> <568A3F80.8000806@redhat.com> <568A54C2.8050300@cn.fujitsu.com> <568CA327.4020103@redhat.com> <569C8EB7.3060507@cn.fujitsu.com> <569CB08F.4030607@redhat.com> <569EFF25.2020804@cn.fujitsu.com> <569F2F27.9000806@redhat.com> <569F5AFF.2050302@cn.fujitsu.com> <569F5F43.5030807@redhat.com> <569F61D7.3060502@cn.fujitsu.com> <56A19EEA.4000700@redhat.com> <56A1A1CA.8020008@cn.fujitsu.com> <56A1C112.3060402@redhat.com> <56A1C4A9.3020203@cn.fujitsu.com> From: Jason Wang Message-ID: <56A1CA7E.3090306@redhat.com> Date: Fri, 22 Jan 2016 14:21:50 +0800 MIME-Version: 1.0 In-Reply-To: <56A1C4A9.3020203@cn.fujitsu.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Wen Congyang , Zhang Chen , qemu devel Cc: zhanghailiang , Li Zhijian , Gui jianfeng , "eddie.dong" , "Dr. David Alan Gilbert" , Huang peng , Gong lei , Stefan Hajnoczi , jan.kiszka@siemens.com, Yang Hongyang On 01/22/2016 01:56 PM, Wen Congyang wrote: > On 01/22/2016 01:41 PM, Jason Wang wrote: >> >=20 >> >=20 >> > On 01/22/2016 11:28 AM, Wen Congyang wrote: >>> >> On 01/22/2016 11:15 AM, Jason Wang wrote: >>>> >>> >>>> >>> On 01/20/2016 06:30 PM, Wen Congyang wrote: >>>>> >>>> On 01/20/2016 06:19 PM, Jason Wang wrote: >>>>>>> >>>>>> >>>>>>> >>>>>> On 01/20/2016 06:01 PM, Wen Congyang wrote: >>>>>>>>> >>>>>>>> On 01/20/2016 02:54 PM, Jason Wang wrote: >>>>>>>>>>> >>>>>>>>>> On 01/20/2016 11:29 AM, Zhang Chen wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Sure. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Two main comments/suggestions: >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> - TCP analysis is missed in current versio= n, maybe you point a git tree >>>>>>>>>>>>>>> >>>>>>>>>>>>>> (or another version of RFC) to me for a be= tter understanding of the >>>>>>>>>>>>>>> >>>>>>>>>>>>>> design. (Just a skeleton for TCP should be= sufficient to discuss). >>>>>>>>>>>>>>> >>>>>>>>>>>>>> - I prefer to make the code as reusable as= possible. So it's better to >>>>>>>>>>>>>>> >>>>>>>>>>>>>> split/decouple the reusable parts from the= codes. So a vague idea is: >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> 1) Decouple the packet comparing from the = netfilter. You've achieved >>>>>>>>>>>>>>> >>>>>>>>>>>>>> this 99% since the work has been done in a= thread. Just let the thread >>>>>>>>>>>>>>> >>>>>>>>>>>>>> poll sockets directly, then the comparing = have the possibility to be >>>>>>>>>>>>>>> >>>>>>>>>>>>>> reused by other kinds of dataplane. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> 2) Implement traffic mirror/redirector as = filter. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> 3) Implement TCP seq rewriting as a filter= . >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Then, in primary node, you need just a tra= ffic mirror, which did: >>>>>>>>>>>>>>> >>>>>>>>>>>>>> - mirror ingress traffic to secondary node >>>>>>>>>>>>>>> >>>>>>>>>>>>>> - mirror outgress traffic to packet compar= ing thread >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> And in secondadry node, you need two filte= rs: >>>>>>>>>>>>>>> >>>>>>>>>>>>>> - A TCP seq rewriter which adjust tcp sequ= ence number. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> - A traffic redirector which redirect pack= et from a socket as ingress >>>>>>>>>>>>>>> >>>>>>>>>>>>>> traffic, and redirect outgress traffic to = the socket which could be >>>>>>>>>>>>>>> >>>>>>>>>>>>>> polled by remote packet comparing thread. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thoughts? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> zhangchen >>>>>>>>>>>>> >>>>>>>>>>>> Hi, Jason. >>>>>>>>>>>>> >>>>>>>>>>>> We consider your suggestion to split/decouple >>>>>>>>>>>>> >>>>>>>>>>>> the reusable parts from the codes. >>>>>>>>>>>>> >>>>>>>>>>>> Due to filter plugin are traversed one by one = in order >>>>>>>>>>>>> >>>>>>>>>>>> we will split colo-proxy to three filters in e= ach side. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> But in this plan,primary and secondary both ha= ve socket >>>>>>>>>>>>> >>>>>>>>>>>> server,startup is a problem. >>>>>>>>>>> >>>>>>>>>> I believe this issue could be solved by reusing so= cket chardev. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> Primary qemu = =20 >>>>>>>>>>>>> >>>>>>>>>>>> Secondary qemu >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------------------------------= -------------+ =20 >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------------------------------= --------------+ >>>>>>>>>>>>> >>>>>>>>>>>> | +-------------------------------------------= ----------+ | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------------------------------= ---------+ | >>>>>>>>>>>>> >>>>>>>>>>>> | | = | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | = | | >>>>>>>>>>>>> >>>>>>>>>>>> | | guest = | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | guest = | | >>>>>>>>>>>>> >>>>>>>>>>>> | | = | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | = | | >>>>>>>>>>>>> >>>>>>>>>>>> | +-----------^--------------+----------------= ----------+ | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------+--------+--------------= ---------+ | >>>>>>>>>>>>> >>>>>>>>>>>> | | | = | =20 >>>>>>>>>>>>> >>>>>>>>>>>> | ^ | = | >>>>>>>>>>>>> >>>>>>>>>>>> | | | = | =20 >>>>>>>>>>>>> >>>>>>>>>>>> | | | = | >>>>>>>>>>>>> >>>>>>>>>>>> | +-------------------------------= ------------------+=20 >>>>>>>>>>>>> >>>>>>>>>>>> | | | = | >>>>>>>>>>>>> >>>>>>>>>>>> | netfilter | | = | | | =20 >>>>>>>>>>>>> >>>>>>>>>>>> netfilter | | = | >>>>>>>>>>>>> >>>>>>>>>>>> | +-------------------------------------------= ----------+ | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------------------------------= ---------+ | >>>>>>>>>>>>> >>>>>>>>>>>> | | | | filter excu= te order | | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | | | filter excut= e order | | >>>>>>>>>>>>> >>>>>>>>>>>> | | | | +-----------= --------> | | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | | | +------------= -------> | | >>>>>>>>>>>>> >>>>>>>>>>>> | | | | = | | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | | | TCP = | | >>>>>>>>>>>>> >>>>>>>>>>>> | | +---------+-+ +------v-----+ +----+= +-----+ | | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | +-----------+ +---+----+---v+rewriter+ +-= -------+ | | >>>>>>>>>>>>> >>>>>>>>>>>> | | | | | | | = | | | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | | | | | | | = | | | >>>>>>>>>>>>> >>>>>>>>>>>> | | | mirror | | redirect +----> com= pare | | | =20 >>>>>>>>>>>>> >>>>>>>>>>>> +--------> mirror +---> adjust | adjust = +-->redirect| | | >>>>>>>>>>>>> >>>>>>>>>>>> | | | client | | server | | = | | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | | server | | ack | seq | |c= lient | | | >>>>>>>>>>>>> >>>>>>>>>>>> | | | | | | | = | | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | | | | | | | = | | | >>>>>>>>>>>>> >>>>>>>>>>>> | | +----^------+ +----^-------+ +-----= +------+ | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | +-----------+ +--------+-------------+ +-= ---+---+ | | >>>>>>>>>>>>> >>>>>>>>>>>> | | | tx | rx = | rx | | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> | tx all = | rx | | >>>>>>>>>>>>> >>>>>>>>>>>> | +-------------------------------------------= ----------+ | |=20 >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------------------------------= ---------+ | >>>>>>>>>>>>> >>>>>>>>>>>> | | =20 >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------------------------------= ----------------------------------------------+ =20 >>>>>>>>>>>>> >>>>>>>>>>>> | >>>>>>>>>>>>> >>>>>>>>>>>> | | = | | =20 >>>>>>>>>>>>> >>>>>>>>>>>> | = | >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------------------------------= -------------+ =20 >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------------------------------= --------------+ >>>>>>>>>>>>> >>>>>>>>>>>> | = | >>>>>>>>>>>>> >>>>>>>>>>>> |guest receive = |guest send >>>>>>>>>>>>> >>>>>>>>>>>> | = | >>>>>>>>>>>>> >>>>>>>>>>>> +--------+------------------------------------= v------------+ >>>>>>>>>>>>> >>>>>>>>>>>> | = | >>>>>>>>>>>>> >>>>>>>>>>>> | = | >>>>>>>>>>>>> >>>>>>>>>>>> | tap = =20 >>>>>>>>>>>>> >>>>>>>>>>>> | NOTE: filter di= rection is rx/tx/all >>>>>>>>>>>>> >>>>>>>>>>>> | = =20 >>>>>>>>>>>>> >>>>>>>>>>>> | rx:receive pack= ets sent to the netdev >>>>>>>>>>>>> >>>>>>>>>>>> | = =20 >>>>>>>>>>>>> >>>>>>>>>>>> | tx:receive pack= ets sent by the netdev >>>>>>>>>>>>> >>>>>>>>>>>> +---------------------------------------------= -------------+ >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> I still like to decouple comparer from netfilter. = It have two obvious >>>>>>>>>>> >>>>>>>>>> advantages: >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> - make it can be reused by other dataplane (e.g vh= ost) >>>>>>>>>>> >>>>>>>>>> - secondary redirector could redirect rx to compar= er on primary node >>>>>>>>>>> >>>>>>>>>> directly which simplify the design. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> guest recv packet route >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> primary >>>>>>>>>>>>> >>>>>>>>>>>> tap --> mirror client filter >>>>>>>>>>>>> >>>>>>>>>>>> mirror client will send packet to guest,at the >>>>>>>>>>>>> >>>>>>>>>>>> same time, copy and forward packet to secondar= y >>>>>>>>>>>>> >>>>>>>>>>>> mirror server. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> secondary >>>>>>>>>>>>> >>>>>>>>>>>> mirror server filter --> TCP rewriter >>>>>>>>>>>>> >>>>>>>>>>>> if recv packet is TCP packet,we will adjust ac= k >>>>>>>>>>>>> >>>>>>>>>>>> and update TCP checksum, then send to secondar= y >>>>>>>>>>>>> >>>>>>>>>>>> guest. else directly send to guest. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> guest send packet route >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> primary >>>>>>>>>>>>> >>>>>>>>>>>> guest --> redirect server filter >>>>>>>>>>>>> >>>>>>>>>>>> redirect server filter recv primary guest pack= et >>>>>>>>>>>>> >>>>>>>>>>>> but do nothing, just pass to next filter. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> redirect server filter --> compare filter >>>>>>>>>>>>> >>>>>>>>>>>> compare filter recv primary guest packet then >>>>>>>>>>>>> >>>>>>>>>>>> waiting scondary redirect packet to compare it= . >>>>>>>>>>>>> >>>>>>>>>>>> if packet same,send primary packet and clear s= econdary >>>>>>>>>>>>> >>>>>>>>>>>> packet, else send primary packet and do >>>>>>>>>>>>> >>>>>>>>>>>> checkpoint. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> secondary >>>>>>>>>>>>> >>>>>>>>>>>> guest --> TCP rewriter filter >>>>>>>>>>>>> >>>>>>>>>>>> if the packet is TCP packet,we will adjust seq >>>>>>>>>>>>> >>>>>>>>>>>> and update TCP checksum. then send it to >>>>>>>>>>>>> >>>>>>>>>>>> redirect client filter. else directly send to >>>>>>>>>>>>> >>>>>>>>>>>> redirect client filter. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> redirect client filter --> redirect server fil= ter >>>>>>>>>>>>> >>>>>>>>>>>> forward packet to primary >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> In failover scene=EF=BC=88primary is down=EF=BC= =89, the TCP rewriter will keep >>>>>>>>>>>>> >>>>>>>>>>>> servicing >>>>>>>>>>>>> >>>>>>>>>>>> for the TCP connection which is established af= ter the last checkpoint=E3=80=82 >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> How about this plan? >>>>>>>>>>> >>>>>>>>>> Sounds good. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> And there's indeed no need to differ client/server= by reusing the socket >>>>>>>>>>> >>>>>>>>>> chardev. E.g: >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> In primary node: >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>> -chardev socket,id=3Dcomparer0,host=3Dip_primary,p= ort=3DX,server,nowait >>>>>>>>>>> >>>>>>>>>> -chardev socket,id=3Dcomparer1,host=3Dip_primary,p= ort=3DY,server,nowait >>>>>>>>>>> >>>>>>>>>> -chardev socket,id=3Dmirrorer0,host=3Dip_primary,p= ort=3DZ,server,nowait >>>>>>>>>>> >>>>>>>>>> -netdev tap,id=3Dhn0 >>>>>>>>>>> >>>>>>>>>> -traffic-mirrorer netdev=3Dhn0,id=3Dt0,indev=3Dcom= parer0,outdev=3Dmirrorer0 >>>>>>>>>>> >>>>>>>>>> -colo-comparer primary_traffic=3Dcomparer0,seconda= ry_traffic=3Dcomparer1 >>>>>>>>> >>>>>>>> Why mirrorer has indev?=20 >>>>>>> >>>>>> >>>>>>> >>>>>> As I said in the previous mails. I would like to decouple = packet >>>>>>> >>>>>> comparing from netfilter. You've already done most of this= since the >>>>>>> >>>>>> comparing is done in an independent thread. So the indev h= ere is to >>>>>>> >>>>>> mirror the packet sent by guest to the packet comparing th= read. >>>>>>> >>>>>> >>>>>>>>> >>>>>>>> I think we can use traffic-redirector to do it. >>>>>>>>> >>>>>>>> The command line is: >>>>>>>>> >>>>>>>> -netdev tap,id=3Dhn0 >>>>>>>>> >>>>>>>> -object traffic-mirrorer,id=3Df0,netdev=3Dhn0,queue=3D= tx,outdev=3Dmirrorer0 >>>>>>>>> >>>>>>>> -object traffic-redirector,id=3Df1,netdev=3Dhn0,queue=3D= rx,outdev=3Dcomparer0 >>>>>>>>> >>>>>>>> -colo-comparer primary_traffic=3Dcomparer0,secondary_t= raffic=3Dcomparer1,netdev=3Dhn0 >>>>>>>>> >>>>>>>> In the comparer thread, we can use qemu_net_queue_send= _iov() to send >>>>>>>>> >>>>>>>> out the packet. >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Also, we can merge the socketdev comparer1 and mirrore= r0. >>>>>>> >>>>>> It depends on whether or not packet comparing was done in = a net filter >>>>>>> >>>>>> (which I prefer not). >>>>> >>>> I mean that: packet comapring is done in a thread, not a net f= ilter. >>>>> >>>> The flow of the packet sent from guest: >>>>> >>>> 1. traffice-redirecotr, we will redirector the packet to compa= rer0, the next >>>>> >>>> filter will never see it. >>>>> >>>> 2. comparing thread: read it from socket chardev comparer0 >>>>> >>>> 3. call qemu_net_queue_send_iov() to send it back to the netde= v. >>>> >>> Ok, looks like I miss something. >>>> >>> >>>> >>> My suggestion tries best to let the packet comparing not tie to = filter >>>> >>> or netdev. But your suggestion still need it to be coupled with = a >>>> >>> netdev. Any advantages of doing this (or is there a reason that = packet >>>> >>> must be sent to netdev after doing comparing?). If not, why not = just >>> >> Yes, the packet must be sent to netdev after doing comparing. If b= oth >>> >> the primary packet and secondary packet are the same(contains the = same >>> >> application level data), we will drop the secondary packet, and se= nd the >>> >> primary packet to the netdev. Otherwise, we will sync the state. >> >=20 >> > And drop primary packet also here? > No, the primary packet must be sent back to the netdev, so the client c= an receive > the response. > > For example: > 1. guest has a ftp server > 2. we connect to the ftp server via the network > 3. both primary guest and secondary guest receive this request > 4. both primary guest and secondary guest ack it > 5. we compare these two ack packets in the comparing thread > 6. it is the same(the seqno is different, but it is not important, we c= an modify it in > colo-rewriter). So we drop the secondary packets, and sent back the = primary packet > to netdev > 7. The primary ack packet is sent to the ftp client via netdev. > > The ftp client only cares of the received packet. So if the packets fro= m primay > and secondary guest contain the same data, we can say they are in the "= same" state. > > Thanks > Wen Congyang > Thanks for the example. But still don't get why it must be done before comparing consider it will always be sent regardless the result of comparing?