Re: [Qemu-devel] [RFC PATCH v2] Replication agent module

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ori Mamluk <omamluk@zerto.com>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Roni Luxenberg <rluxenbe@redhat.com>, Oded Kedem <oded@zerto.com>,
	dlaor@redhat.com, qemu-devel@nongnu.org,
	Anthony Liguori <anthony@codemonkey.ws>,
	Yair Kuszpet <yairk@zerto.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [RFC PATCH v2] Replication agent module
Date: Wed, 04 Apr 2012 16:23:07 +0300	[thread overview]
Message-ID: <4F7C4B3B.7050708@zerto.com> (raw)
In-Reply-To: <20120402112240.GA21865@stefanha-thinkpad.localdomain>

[-- Attachment #1: Type: text/plain, Size: 8675 bytes --]

Hi Stephan,
Thanks again for the thorough and detailed review.
Some answers embedded below, generally I took all the points you mentioned.
Main changes were:
- Use qemu-thread
     * WRT to this - I didn't find any recv() function there. Am I 
missing anything? Should I call the native recv?
         Same for send, even though there's a send_all - should I use it?
- Use qemu_socket
- Use QTAILQ for volumes list

I'll shortly send patch V3 with the changes.

I also attached a PDF with a design sketch following your request to 
describe the main flows.


On 02/04/2012 14:22, Stefan Hajnoczi wrote:
> Feedback on specific points below.  The main thing to think about is how
> to integrate with QEMU's event loop.  You have used threads in places
> but are also using qemu_set_fd_handler().  Most of QEMU's functions
> (including the block layer) cannot be executed from arbitrary threads
> because they are not thread-safe.  So it probably makes sense to run all
> repagent network code in fd handler functions as part of the QEMU event
> loop.
Right. Actually I moved exclusively to set_fd_handler. The thread 
remains only for re-opening disconnected sockets.
The systems needs to recover from a rephub disconnection - i.e. a rephub 
which goes down and up again.
No Qemu function is called from that thread context.

>> By default the build and use of this module is disabled. To activate it
>> you need to use a flag in ./configure and a commandline switch to qemu.
> If the code is portable and doesn't have specific external library
> dependencies then it's good to enable it by default.  Distros and users
> can decide to disable it but I think building everything by default
> makes sense - it reduces bitrot because people will be building your
> code on different host platforms all the time.
I see your point. I'll change it future.
Currently as long as the module is not mature I feel better to leave it 
under a build flag.
I think I also have platform dependent sockets code that I still need to 
solve.
>> Missing features:
>> * Dirty bitmap at the protected side
>> 	The protected volume side needs to persistently keep track of
>> 'pending' IOs.
>> 	I want to add a bitmap (can hook like another 'filter' with the
>> repagent), that will synchronously track IOs. And allow getting a list of
>> such IOs.
>> 	Without such a bitmap a failure of any component (rephub or agent)
>> will require reading the whole protected volume.
> To double-check that I understand the point of this:
>
> If the rephub is unavailable for a period of time and then comes up
> again the rephub will be able to request only the regions of the
> replicated volume that have been modified.
>
> If the host or QEMU crashes does the dirty bitmap come into play?
In general, yes - it depends on how persistent we'll implement the bitmap.
An acceptable bitmap for us can reside in host memory  (i.e. not survive 
a host crash).
It should definitely survive a Qemu crash, and preferably also a rephub 
crash.
I prefer to leave the details to next stages, and just remain aware of 
this 'creature'.
BTW it also deals with network disconnections/outages, not just crashes.
The bitmap allows us to avoid reading the entire protected disk after 
each outage event.
>
>> * Capture IOs at the recovery side
>> 	During fail-over or fail-over test, the repagent at the recovery
>> side needs to capture all IOs (reads and writes) done by the fail-over VM
>> 	and answer them by passing them synchronously to the VRA.
> Does fail-over mean that some management stack functionality will spawn
> a new "recovery" VM if a VM disappears?  The recovery VM will have
> access to the replicated volumes and will itself run repagent.
>
> Can you explain this step-by-step?  What is a "VRA"?
VRA should be spelled Rephub. VRA is a VM that runs the Rephub in the 
Zerto current product.
Fail over indeed means what you wrote.
I attached a high level design document that explains these flows.
>> * I wrote a stand-alone Rephub for actively reading the protected volume
>> and copying it - for test purposes.
> Cool, would be great to see a rephub implementation.
It's still very basic and only for testing. Once I establish the 
repagent design I'll add it.
Should it go into the same repository?
>> @@ -685,6 +693,12 @@ int bdrv_open(BlockDriverState *bs, const char
>> *filename, int flags,
>>       int ret;
>>       char tmp_filename[PATH_MAX];
>>
>> +    if (use_repagent&&  bs->device_name[0] != '\0'&&  strcmp(
>> +            bs->device_name, "pflash0") != 0) {
>> +        BlockDriver *bdrv_repagent = bdrv_find_format("repagent");
>> +        drv = bdrv_repagent;
>> +    }
>> +
>>       if (flags&  BDRV_O_SNAPSHOT) {
>>           BlockDriverState *bs1;
>>           int64_t total_size;
> You could add a -drive option.  See blockdev.c:drive_init().
>
>    -drive ...,repagent=on|off  (default "off")
It's not really a per-volume option.
As a way to deliver the option to the block layer I can do the same as 
the snapshot - and add the option to all drives when the repagent global 
option is enabled (    if (snapshot) 
qemu_opts_foreach(qemu_find_opts("drive"), drive_enable_snapshot, NULL, 
0); )

> There still needs to be a way to configure the rephub network connection
> details.  Is it ever useful to connect one volume to rephub A and
> another volume to rephub B, then the connection details would have to be
> per-drive?
I don't see it as a per-drive option. I think a single rephub is enough.
>
>> @@ -1842,6 +1866,18 @@ static int coroutine_fn
>> bdrv_co_do_writev(BlockDriverState *bs,
>>           ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
>>       }
>>
>> +
>> +#ifdef CONFIG_REPAGENT
>> +    if (bs->device_name[0] != '\0') {
>> +        /* We split the IO only at the highest stack driver layer.
>> +           Currently we know that by checking device_name - only
>> +           highest level (closest to the guest) has that name.
>> +           */
>> +/*           repagent_handle_protected_write(bs, sector_num,
>> +                nb_sectors, qiov, ret);
> The top-level BlockDriverState requirement can be handled if you do
> something like this in blockdev.c:drive_init():
>
>    if (use_repagent) {
>        repagent_start(bs);
>    }
>
> Only the top-level BlockDriverState (bs) would be tracked.
I see. Sounds good.
There's still the problem of additional drivers being added above - e.g. 
live snapshot, but as Kevin advised I'll worry about that later on.
>> +       the content of a protected volume.
>> +       Also used to read/write to a recovery volume - the replica of a
>> protected volume.
> It sounds like the write operation will be used to populate a recovery
> volume?
Yes.
>> +        RepCmdDataStartProtect *pcmd_dat
>> +static void repagent_client_read(void *opaque)
>> +{
>> +    printf("repagent_client_read\n");
>> +    int bytes_read =
>> repcmd_listener_socket_read_next_buf(g_client_state.hsock);
> Did I miss where the socket is set to nonblocking?  This function should
> not block if used as an fd handler in QEMU's main loop.
You didn't miss. Good point. You're referencing the risk of the read 
from the socket getting blocked, right?
>> +    if (g_cl            printf("Connected (%d)...\n", c++);
>> +            usleep(1 * 1000 * 1000);
>> +        }
> This thread isn't actually doing anything!  I think this function could
> be integrated into the QEMU event loop instead of having a thread.
Right. The thread is just a heartbeat responsible for reconnecting the 
socket.
Any periodic event would suffice.
Can you give me a pointer in the code that registers a similar event?
> uint32_tzeof(RepCmd); 
> Hint that bytes
>
>> +    }
>> +
>> +    if (bytecount == 0) {
>> +        printf("Disconnected\n");
>> +        return 0;
> What happens if we get disconnected, looks like this thread will
> continue running with nothing to do?
The g_client_state.is_connected flag will reset and the thread will 
busy-try to connect again
>
> +            assert(cmd_state->curCmd.magic2 == REPCMD_MAGIC2);
> Is there a more robust error handling strategy than aborting the QEMU
> process?
I'll leave a todo there - during development it's probably a good way to 
handle it.
>> +            cmd_state->pReadBuf = (uint8_t *)&cmd_state->curCmd;
>> +            cmd_state->bytesGotten = 0;
>> +            cmd_state->bytesToGet = sizeof(RepCmd);
>> +            cmd_state->isGotHeader = 0;
>> +        }
>> +    }
> Since we're running in a dedicated thread sequential code would be
> simpler than this state-machine approach with isGotHeader and isGotData?
Yes. I started sequentially, but current code uses set_fd_handler


[-- Attachment #2: qemu repagent design.pdf --]
[-- Type: application/pdf, Size: 127160 bytes --]

next prev parent reply	other threads:[~2012-04-04 13:23 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-01 12:05 [Qemu-devel] [RFC PATCH v2] Replication agent module Ori Mamluk
2012-04-02 11:22 ` Stefan Hajnoczi
2012-04-04 13:23   ` Ori Mamluk [this message]
2012-04-02 16:31 ` Eric Blake

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F7C4B3B.7050708@zerto.com \
    --to=omamluk@zerto.com \
    --cc=anthony@codemonkey.ws \
    --cc=dlaor@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=oded@zerto.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rluxenbe@redhat.com \
    --cc=stefanha@gmail.com \
    --cc=yairk@zerto.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.