From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54335) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bG2Bf-0002Sy-77 for qemu-devel@nongnu.org; Thu, 23 Jun 2016 06:48:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bG2Bb-0000lB-1o for qemu-devel@nongnu.org; Thu, 23 Jun 2016 06:48:42 -0400 Received: from [59.151.112.132] (port=58021 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bG2Ba-0000kH-1T for qemu-devel@nongnu.org; Thu, 23 Jun 2016 06:48:38 -0400 References: <1465902912-27527-1-git-send-email-zhangchen.fnst@cn.fujitsu.com> <1465902912-27527-4-git-send-email-zhangchen.fnst@cn.fujitsu.com> <57678CDA.7020700@redhat.com> <20160620121402.GC2891@work-vm> <576A020E.8040804@cn.fujitsu.com> <576A3174.2010905@redhat.com> From: Zhang Chen Message-ID: <576BBE8D.1030009@cn.fujitsu.com> Date: Thu, 23 Jun 2016 18:48:45 +0800 MIME-Version: 1.0 In-Reply-To: <576A3174.2010905@redhat.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [RFC PATCH 3/3] filter-rewriter: rewrite tcp packet to keep secondary connection List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jason Wang , "Dr. David Alan Gilbert" Cc: Yang Hongyang , "eddie . dong" , qemu devel , Li Zhijian , zhanghailiang On 06/22/2016 02:34 PM, Jason Wang wrote: > > > On 2016年06月22日 11:12, Zhang Chen wrote: >> >> >> On 06/20/2016 08:14 PM, Dr. David Alan Gilbert wrote: >>> * Jason Wang (jasowang@redhat.com) wrote: >>>> >>>> On 2016年06月14日 19:15, Zhang Chen wrote: >>>>> We will rewrite tcp packet secondary received and sent. >>>> More verbose please. E.g which fields were rewrote and why. >> >> OK. >> >>>>> Signed-off-by: Zhang Chen >>>>> Signed-off-by: Li Zhijian >>>>> Signed-off-by: Wen Congyang >>>>> --- >>>>> net/filter-rewriter.c | 94 >>>>> +++++++++++++++++++++++++++++++++++++++++++++++++-- >>>>> trace-events | 3 ++ >>>>> 2 files changed, 95 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/net/filter-rewriter.c b/net/filter-rewriter.c >>>>> index 12f88c5..86a2f53 100644 >>>>> --- a/net/filter-rewriter.c >>>>> +++ b/net/filter-rewriter.c >>>>> @@ -21,6 +21,7 @@ >>>>> #include "qemu/main-loop.h" >>>>> #include "qemu/iov.h" >>>>> #include "net/checksum.h" >>>>> +#include "trace.h" >>>>> #define FILTER_COLO_REWRITER(obj) \ >>>>> OBJECT_CHECK(RewriterState, (obj), TYPE_FILTER_REWRITER) >>>>> @@ -64,6 +65,75 @@ static int is_tcp_packet(Packet *pkt) >>>>> } >>>>> } >>>>> +static int handle_primary_tcp_pkt(NetFilterState *nf, >>>>> + Connection *conn, >>>>> + Packet *pkt) >>>>> +{ >>>>> + struct tcphdr *tcp_pkt; >>>>> + >>>>> + tcp_pkt = (struct tcphdr *)pkt->transport_layer; >>>>> + >>>>> + if (trace_event_get_state(TRACE_COLO_FILTER_REWRITER_DEBUG)) { >>>> Why not use tracepoints directly? >>> Because trace can't cope with you having to do an allocation/free. >>> >>>>> + char *sdebug, *ddebug; >>>>> + sdebug = strdup(inet_ntoa(pkt->ip->ip_src)); >>>>> + ddebug = strdup(inet_ntoa(pkt->ip->ip_dst)); >>>>> + fprintf(stderr, "%s: src/dst: %s/%s p: seq/ack=%u/%u" >>>>> + " flags=%x\n", __func__, sdebug, ddebug, >>>>> + ntohl(tcp_pkt->th_seq), ntohl(tcp_pkt->th_ack), >>>>> + tcp_pkt->th_flags); >>> However, this should use the trace_ call to write the result even if >>> it's >>> using trace_event_get_state to switch the whole block on/off. >> >> I will fix it in next version. >> >>> >>>>> + g_free(sdebug); >>>>> + g_free(ddebug); >>>>> + } >>>>> + >>>>> + if (((tcp_pkt->th_flags & (TH_ACK | TH_SYN)) == TH_ACK)) { >>>>> + /* save primary colo tcp packet seq */ >>>>> + conn->primary_seq = ntohl(tcp_pkt->th_ack) - 1; >>>> Looks like primary_seq will only be updated during handshake, I >>>> wonder how >>>> this works. >> >> OK. >> We assume that colo guest is a tcp server. >> >> Firstly, client start a tcp handshake. the packet's seq=client_seq, >> ack=0,flag=SYN. COLO primary guest get this pkt and >> mirror(filter-mirror) >> to secondary guest, secondary get it use filter-redirector. >> Then,primary guest response >> pkt(seq=primary_seq,ack=client_seq+1,flag=ACK|SYN). >> secondary guest response >> pkt(seq=secondary_seq,ack=client_seq+1,flag=ACK|SYN). >> In here,we use filter-rewriter save the secondary_seq to it's tcp >> connection. >> Finally handshake,client send >> pkt(seq=client_seq+1,ack=primary_seq+1,flag=ACK). >> Here,filter-rewriter can get primary_seq, and rewrite ack from >> primary_seq+1 >> to secondary_seq+1, recalculate checksum. So the secondary tcp >> connection >> kept good. >> >> When we send/recv packet. >> client send >> pkt(seq=client_seq+1+data_len,ack=primary_seq+1,flag=ACK|PSH). >> filter-rewriter rewrite ack and send to secondary guest. > > If I read your code correctly, secondary_seq will only be updated > during handshake. So the ack seq will always be same for each packet > received by secondary? Yes. I don't know why kernel do this. But I dump the packet hex found that, the ack packet flag=ACK means only ack enabled.and the seq will affect tcp checksum make connection failed. > >> primary guest response >> pkt(seq=primary_seq+1,ack=client_seq+1+data_len,flag=ACK) >> secondary guest response >> pkt(seq=secondary_seq+1,ack=client_seq+1+data_len,flag=ACK) > > Is ACK a must here? Yes. > >> we rewrite secondary guest seq from secondary_seq+1 to primary_seq+1. >> So tcp connection kept good. > > What if, consider we have a large window, so server(guest) want to > send more than one TCP packets? The code can only advance primary_seq > when we've received an ack which seems wrong. > > So it will be very tricky if you don't track offset. Basically, what I > suggest is rather simple: > > 1) calculate offset during handshake, e.g offset = secondary_seq_syn - > primary_seq_syn > 2) in handle_primary_tcp_pkt: tcp_pkt->th_ack += offset; > 3) in handle_secondary_tcp_pkt: tcp_pkt->th_seq -= offset; > > Looks like this can handle more cases and more robust than current code? Make sense, I will change it in next version. Thanks Zhang Chen > >> >> >>> This code really needs commenting to make it see what's going on; each >>> of these functions should say which way the packet is going (e.g. >>> 'handle packets to the primary from the secondary') - there's a lot >>> of packet flows going on and without the comments it's very hard to >>> follow. >> >> Thanks..I will add comments in next version. >> >>> >>> I think this could be because we're fixing up the sequence numbers >>> on the >>> secondary once we've received the first response from the primary, >>> so it's >>> only the first packet of each connection that the primary has to do >>> this on - >>> but hmm I'm not sure without some comments. >> >> Yes,you are right. >> >> >> >>> >>> Dave >>> >>>>> + >>>>> + /* adjust tcp seq to make secondary guest handle it */ >>>>> + tcp_pkt->th_ack = htonl(conn->secondary_seq + 1); >>>> I'm not sure this can work for all cases. I believe we should also >>>> rewrite >>>> seq here. And to me, a better approach is to track the offset of >>>> seq between >>>> pri and sec during handshake and rewrite both ack and seq based on >>>> this >>>> offset. >> >> In the vast majority of cases, colo guest is a tcp server. >> client kernel and guest kernel make the tcp seq work good. >> we don't need rewrite seq here. we just need rewrite ack >> and checksum can make secondary tcp connection work. If >> colo guest is a tcp client,maybe we can wait colo-compare >> do a checkpoint(secondary haven't send tcp packet in time). >> >> >> Thanks >> Zhang Chen >> >> >>>>> + net_checksum_calculate((uint8_t *)pkt->data, pkt->size); >>>>> + } >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>>> +static int handle_secondary_tcp_pkt(NetFilterState *nf, >>>>> + Connection *conn, >>>>> + Packet *pkt) >>>>> +{ >>>>> + struct tcphdr *tcp_pkt; >>>>> + >>>>> + tcp_pkt = (struct tcphdr *)pkt->transport_layer; >>>>> + >>>>> + if (trace_event_get_state(TRACE_COLO_FILTER_REWRITER_DEBUG)) { >>>>> + char *sdebug, *ddebug; >>>>> + sdebug = strdup(inet_ntoa(pkt->ip->ip_src)); >>>>> + ddebug = strdup(inet_ntoa(pkt->ip->ip_dst)); >>>>> + printf("handle_secondary_tcp_pkt conn->secondary_seq = >>>>> %u,\n", >>>>> + conn->secondary_seq); >>>>> + printf("handle_secondary_tcp_pkt conn->primary_seq = %u,\n", >>>>> + conn->primary_seq); >>>>> + fprintf(stderr, "%s: src/dst: %s/%s p: seq/ack=%u/%u" >>>>> + " flags=%x\n", __func__, sdebug, ddebug, >>>>> + ntohl(tcp_pkt->th_seq), ntohl(tcp_pkt->th_ack), >>>>> + tcp_pkt->th_flags); >>>>> + g_free(sdebug); >>>>> + g_free(ddebug); >>>>> + } >>>>> + >>>>> + if (((tcp_pkt->th_flags & (TH_ACK | TH_SYN)) == (TH_ACK | >>>>> TH_SYN))) { >>>>> + /* save client's seq */ >>>>> + conn->secondary_seq = ntohl(tcp_pkt->th_seq); >>>>> + } >>>>> + >>>>> + if ((tcp_pkt->th_flags & (TH_ACK | TH_SYN)) == TH_ACK) { >>>>> + tcp_pkt->th_seq = htonl(conn->primary_seq + 1); >>>>> + net_checksum_calculate((uint8_t *)pkt->data, pkt->size); >>>>> + } >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>>> static ssize_t colo_rewriter_receive_iov(NetFilterState *nf, >>>>> NetClientState *sender, >>>>> unsigned flags, >>>>> @@ -106,10 +176,30 @@ static ssize_t >>>>> colo_rewriter_receive_iov(NetFilterState *nf, >>>>> if (sender == nf->netdev) { >>>>> /* This packet is sent by netdev itself */ >>>>> /* NET_FILTER_DIRECTION_TX */ >>>>> - /* handle_primary_tcp_pkt */ >>>>> + if (!handle_primary_tcp_pkt(nf, conn, pkt)) { >>>>> + qemu_net_queue_send(s->incoming_queue, sender, 0, >>>>> + (const uint8_t *)pkt->data, pkt->size, NULL); >>>>> + packet_destroy(pkt, NULL); >>>>> + pkt = NULL; >>>>> + /* >>>>> + * We block the packet here,after rewrite pkt >>>>> + * and will send it >>>>> + */ >>>>> + return 1; >>>>> + } >>>>> } else { >>>>> /* NET_FILTER_DIRECTION_RX */ >>>>> - /* handle_secondary_tcp_pkt */ >>>>> + if (!handle_secondary_tcp_pkt(nf, conn, pkt)) { >>>>> + qemu_net_queue_send(s->incoming_queue, sender, 0, >>>>> + (const uint8_t *)pkt->data, pkt->size, NULL); >>>>> + packet_destroy(pkt, NULL); >>>>> + pkt = NULL; >>>>> + /* >>>>> + * We block the packet here,after rewrite pkt >>>>> + * and will send it >>>>> + */ >>>>> + return 1; >>>>> + } >>>>> } >>>>> } >>>>> diff --git a/trace-events b/trace-events >>>>> index 6686cdf..5d798c6 100644 >>>>> --- a/trace-events >>>>> +++ b/trace-events >>>>> @@ -1927,3 +1927,6 @@ colo_compare_icmp_miscompare_mtu(const char >>>>> *sta, int size) ": %s %d" >>>>> colo_compare_ip_info(int psize, const char *sta, const char >>>>> *stb, int ssize, const char *stc, const char *std) "ppkt size = >>>>> %d, ip_src = %s, ip_dst = %s, spkt size = %d, ip_src = %s, ip_dst >>>>> = %s" >>>>> colo_old_packet_check_found(int64_t old_time) "%" PRId64 >>>>> colo_compare_miscompare(void) "" >>>>> + >>>>> +# net/filter-rewriter.c >>>>> +colo_filter_rewriter_debug(void) "" >>> -- >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>> >>> >>> . >>> >> > > > > . > -- Thanks zhangchen