From: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>
To: Jason Wang <jasowang@redhat.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Yang Hongyang <hongyang.yang@easystack.cn>,
"eddie . dong" <eddie.dong@intel.com>,
qemu devel <qemu-devel@nongnu.org>,
Li Zhijian <lizhijian@cn.fujitsu.com>,
zhanghailiang <zhang.zhanghailiang@huawei.com>
Subject: Re: [Qemu-devel] [RFC PATCH 3/3] filter-rewriter: rewrite tcp packet to keep secondary connection
Date: Thu, 23 Jun 2016 18:48:45 +0800 [thread overview]
Message-ID: <576BBE8D.1030009@cn.fujitsu.com> (raw)
In-Reply-To: <576A3174.2010905@redhat.com>
On 06/22/2016 02:34 PM, Jason Wang wrote:
>
>
> On 2016年06月22日 11:12, Zhang Chen wrote:
>>
>>
>> On 06/20/2016 08:14 PM, Dr. David Alan Gilbert wrote:
>>> * Jason Wang (jasowang@redhat.com) wrote:
>>>>
>>>> On 2016年06月14日 19:15, Zhang Chen wrote:
>>>>> We will rewrite tcp packet secondary received and sent.
>>>> More verbose please. E.g which fields were rewrote and why.
>>
>> OK.
>>
>>>>> Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>
>>>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>>> ---
>>>>> net/filter-rewriter.c | 94
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++--
>>>>> trace-events | 3 ++
>>>>> 2 files changed, 95 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/net/filter-rewriter.c b/net/filter-rewriter.c
>>>>> index 12f88c5..86a2f53 100644
>>>>> --- a/net/filter-rewriter.c
>>>>> +++ b/net/filter-rewriter.c
>>>>> @@ -21,6 +21,7 @@
>>>>> #include "qemu/main-loop.h"
>>>>> #include "qemu/iov.h"
>>>>> #include "net/checksum.h"
>>>>> +#include "trace.h"
>>>>> #define FILTER_COLO_REWRITER(obj) \
>>>>> OBJECT_CHECK(RewriterState, (obj), TYPE_FILTER_REWRITER)
>>>>> @@ -64,6 +65,75 @@ static int is_tcp_packet(Packet *pkt)
>>>>> }
>>>>> }
>>>>> +static int handle_primary_tcp_pkt(NetFilterState *nf,
>>>>> + Connection *conn,
>>>>> + Packet *pkt)
>>>>> +{
>>>>> + struct tcphdr *tcp_pkt;
>>>>> +
>>>>> + tcp_pkt = (struct tcphdr *)pkt->transport_layer;
>>>>> +
>>>>> + if (trace_event_get_state(TRACE_COLO_FILTER_REWRITER_DEBUG)) {
>>>> Why not use tracepoints directly?
>>> Because trace can't cope with you having to do an allocation/free.
>>>
>>>>> + char *sdebug, *ddebug;
>>>>> + sdebug = strdup(inet_ntoa(pkt->ip->ip_src));
>>>>> + ddebug = strdup(inet_ntoa(pkt->ip->ip_dst));
>>>>> + fprintf(stderr, "%s: src/dst: %s/%s p: seq/ack=%u/%u"
>>>>> + " flags=%x\n", __func__, sdebug, ddebug,
>>>>> + ntohl(tcp_pkt->th_seq), ntohl(tcp_pkt->th_ack),
>>>>> + tcp_pkt->th_flags);
>>> However, this should use the trace_ call to write the result even if
>>> it's
>>> using trace_event_get_state to switch the whole block on/off.
>>
>> I will fix it in next version.
>>
>>>
>>>>> + g_free(sdebug);
>>>>> + g_free(ddebug);
>>>>> + }
>>>>> +
>>>>> + if (((tcp_pkt->th_flags & (TH_ACK | TH_SYN)) == TH_ACK)) {
>>>>> + /* save primary colo tcp packet seq */
>>>>> + conn->primary_seq = ntohl(tcp_pkt->th_ack) - 1;
>>>> Looks like primary_seq will only be updated during handshake, I
>>>> wonder how
>>>> this works.
>>
>> OK.
>> We assume that colo guest is a tcp server.
>>
>> Firstly, client start a tcp handshake. the packet's seq=client_seq,
>> ack=0,flag=SYN. COLO primary guest get this pkt and
>> mirror(filter-mirror)
>> to secondary guest, secondary get it use filter-redirector.
>> Then,primary guest response
>> pkt(seq=primary_seq,ack=client_seq+1,flag=ACK|SYN).
>> secondary guest response
>> pkt(seq=secondary_seq,ack=client_seq+1,flag=ACK|SYN).
>> In here,we use filter-rewriter save the secondary_seq to it's tcp
>> connection.
>> Finally handshake,client send
>> pkt(seq=client_seq+1,ack=primary_seq+1,flag=ACK).
>> Here,filter-rewriter can get primary_seq, and rewrite ack from
>> primary_seq+1
>> to secondary_seq+1, recalculate checksum. So the secondary tcp
>> connection
>> kept good.
>>
>> When we send/recv packet.
>> client send
>> pkt(seq=client_seq+1+data_len,ack=primary_seq+1,flag=ACK|PSH).
>> filter-rewriter rewrite ack and send to secondary guest.
>
> If I read your code correctly, secondary_seq will only be updated
> during handshake. So the ack seq will always be same for each packet
> received by secondary?
Yes. I don't know why kernel do this. But I dump the packet hex found that,
the ack packet flag=ACK means only ack enabled.and the seq will affect
tcp checksum
make connection failed.
>
>> primary guest response
>> pkt(seq=primary_seq+1,ack=client_seq+1+data_len,flag=ACK)
>> secondary guest response
>> pkt(seq=secondary_seq+1,ack=client_seq+1+data_len,flag=ACK)
>
> Is ACK a must here?
Yes.
>
>> we rewrite secondary guest seq from secondary_seq+1 to primary_seq+1.
>> So tcp connection kept good.
>
> What if, consider we have a large window, so server(guest) want to
> send more than one TCP packets? The code can only advance primary_seq
> when we've received an ack which seems wrong.
>
> So it will be very tricky if you don't track offset. Basically, what I
> suggest is rather simple:
>
> 1) calculate offset during handshake, e.g offset = secondary_seq_syn -
> primary_seq_syn
> 2) in handle_primary_tcp_pkt: tcp_pkt->th_ack += offset;
> 3) in handle_secondary_tcp_pkt: tcp_pkt->th_seq -= offset;
>
> Looks like this can handle more cases and more robust than current code?
Make sense, I will change it in next version.
Thanks
Zhang Chen
>
>>
>>
>>> This code really needs commenting to make it see what's going on; each
>>> of these functions should say which way the packet is going (e.g.
>>> 'handle packets to the primary from the secondary') - there's a lot
>>> of packet flows going on and without the comments it's very hard to
>>> follow.
>>
>> Thanks..I will add comments in next version.
>>
>>>
>>> I think this could be because we're fixing up the sequence numbers
>>> on the
>>> secondary once we've received the first response from the primary,
>>> so it's
>>> only the first packet of each connection that the primary has to do
>>> this on -
>>> but hmm I'm not sure without some comments.
>>
>> Yes,you are right.
>>
>>
>>
>>>
>>> Dave
>>>
>>>>> +
>>>>> + /* adjust tcp seq to make secondary guest handle it */
>>>>> + tcp_pkt->th_ack = htonl(conn->secondary_seq + 1);
>>>> I'm not sure this can work for all cases. I believe we should also
>>>> rewrite
>>>> seq here. And to me, a better approach is to track the offset of
>>>> seq between
>>>> pri and sec during handshake and rewrite both ack and seq based on
>>>> this
>>>> offset.
>>
>> In the vast majority of cases, colo guest is a tcp server.
>> client kernel and guest kernel make the tcp seq work good.
>> we don't need rewrite seq here. we just need rewrite ack
>> and checksum can make secondary tcp connection work. If
>> colo guest is a tcp client,maybe we can wait colo-compare
>> do a checkpoint(secondary haven't send tcp packet in time).
>>
>>
>> Thanks
>> Zhang Chen
>>
>>
>>>>> + net_checksum_calculate((uint8_t *)pkt->data, pkt->size);
>>>>> + }
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +static int handle_secondary_tcp_pkt(NetFilterState *nf,
>>>>> + Connection *conn,
>>>>> + Packet *pkt)
>>>>> +{
>>>>> + struct tcphdr *tcp_pkt;
>>>>> +
>>>>> + tcp_pkt = (struct tcphdr *)pkt->transport_layer;
>>>>> +
>>>>> + if (trace_event_get_state(TRACE_COLO_FILTER_REWRITER_DEBUG)) {
>>>>> + char *sdebug, *ddebug;
>>>>> + sdebug = strdup(inet_ntoa(pkt->ip->ip_src));
>>>>> + ddebug = strdup(inet_ntoa(pkt->ip->ip_dst));
>>>>> + printf("handle_secondary_tcp_pkt conn->secondary_seq =
>>>>> %u,\n",
>>>>> + conn->secondary_seq);
>>>>> + printf("handle_secondary_tcp_pkt conn->primary_seq = %u,\n",
>>>>> + conn->primary_seq);
>>>>> + fprintf(stderr, "%s: src/dst: %s/%s p: seq/ack=%u/%u"
>>>>> + " flags=%x\n", __func__, sdebug, ddebug,
>>>>> + ntohl(tcp_pkt->th_seq), ntohl(tcp_pkt->th_ack),
>>>>> + tcp_pkt->th_flags);
>>>>> + g_free(sdebug);
>>>>> + g_free(ddebug);
>>>>> + }
>>>>> +
>>>>> + if (((tcp_pkt->th_flags & (TH_ACK | TH_SYN)) == (TH_ACK |
>>>>> TH_SYN))) {
>>>>> + /* save client's seq */
>>>>> + conn->secondary_seq = ntohl(tcp_pkt->th_seq);
>>>>> + }
>>>>> +
>>>>> + if ((tcp_pkt->th_flags & (TH_ACK | TH_SYN)) == TH_ACK) {
>>>>> + tcp_pkt->th_seq = htonl(conn->primary_seq + 1);
>>>>> + net_checksum_calculate((uint8_t *)pkt->data, pkt->size);
>>>>> + }
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> static ssize_t colo_rewriter_receive_iov(NetFilterState *nf,
>>>>> NetClientState *sender,
>>>>> unsigned flags,
>>>>> @@ -106,10 +176,30 @@ static ssize_t
>>>>> colo_rewriter_receive_iov(NetFilterState *nf,
>>>>> if (sender == nf->netdev) {
>>>>> /* This packet is sent by netdev itself */
>>>>> /* NET_FILTER_DIRECTION_TX */
>>>>> - /* handle_primary_tcp_pkt */
>>>>> + if (!handle_primary_tcp_pkt(nf, conn, pkt)) {
>>>>> + qemu_net_queue_send(s->incoming_queue, sender, 0,
>>>>> + (const uint8_t *)pkt->data, pkt->size, NULL);
>>>>> + packet_destroy(pkt, NULL);
>>>>> + pkt = NULL;
>>>>> + /*
>>>>> + * We block the packet here,after rewrite pkt
>>>>> + * and will send it
>>>>> + */
>>>>> + return 1;
>>>>> + }
>>>>> } else {
>>>>> /* NET_FILTER_DIRECTION_RX */
>>>>> - /* handle_secondary_tcp_pkt */
>>>>> + if (!handle_secondary_tcp_pkt(nf, conn, pkt)) {
>>>>> + qemu_net_queue_send(s->incoming_queue, sender, 0,
>>>>> + (const uint8_t *)pkt->data, pkt->size, NULL);
>>>>> + packet_destroy(pkt, NULL);
>>>>> + pkt = NULL;
>>>>> + /*
>>>>> + * We block the packet here,after rewrite pkt
>>>>> + * and will send it
>>>>> + */
>>>>> + return 1;
>>>>> + }
>>>>> }
>>>>> }
>>>>> diff --git a/trace-events b/trace-events
>>>>> index 6686cdf..5d798c6 100644
>>>>> --- a/trace-events
>>>>> +++ b/trace-events
>>>>> @@ -1927,3 +1927,6 @@ colo_compare_icmp_miscompare_mtu(const char
>>>>> *sta, int size) ": %s %d"
>>>>> colo_compare_ip_info(int psize, const char *sta, const char
>>>>> *stb, int ssize, const char *stc, const char *std) "ppkt size =
>>>>> %d, ip_src = %s, ip_dst = %s, spkt size = %d, ip_src = %s, ip_dst
>>>>> = %s"
>>>>> colo_old_packet_check_found(int64_t old_time) "%" PRId64
>>>>> colo_compare_miscompare(void) ""
>>>>> +
>>>>> +# net/filter-rewriter.c
>>>>> +colo_filter_rewriter_debug(void) ""
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>>>
>>> .
>>>
>>
>
>
>
> .
>
--
Thanks
zhangchen
next prev parent reply other threads:[~2016-06-23 10:48 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-14 11:15 [Qemu-devel] [RFC PATCH 0/3] filter-rewriter: introduce filter-rewriter Zhang Chen
2016-06-14 11:15 ` [Qemu-devel] [RFC PATCH 1/3] filter-rewriter: introduce filter-rewriter initialization Zhang Chen
2016-06-14 11:15 ` [Qemu-devel] [RFC PATCH 2/3] filter-rewriter: track connection and parse packet Zhang Chen
2016-06-14 11:15 ` [Qemu-devel] [RFC PATCH 3/3] filter-rewriter: rewrite tcp packet to keep secondary connection Zhang Chen
2016-06-20 6:27 ` Jason Wang
2016-06-20 12:14 ` Dr. David Alan Gilbert
2016-06-22 3:12 ` Zhang Chen
2016-06-22 6:34 ` Jason Wang
2016-06-23 10:48 ` Zhang Chen [this message]
2016-06-24 6:08 ` Jason Wang
2016-06-28 6:33 ` Zhang Chen
2016-06-29 1:55 ` Jason Wang
2016-06-29 6:13 ` Zhang Chen
2016-06-30 12:17 ` Jason Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=576BBE8D.1030009@cn.fujitsu.com \
--to=zhangchen.fnst@cn.fujitsu.com \
--cc=dgilbert@redhat.com \
--cc=eddie.dong@intel.com \
--cc=hongyang.yang@easystack.cn \
--cc=jasowang@redhat.com \
--cc=lizhijian@cn.fujitsu.com \
--cc=qemu-devel@nongnu.org \
--cc=zhang.zhanghailiang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).