Replication agent patch version 3. This patch contains multiple fixes for version v2 (v2 patch is attached). Main changes: · Move from native to qemu-thread library · Move from natice to qemu-socket library · Use QTAILQ Also attached a document describing in high level the replication main flows. Description of the Repagent patch: Repagent is a new block driver that allows an external replication system to hook to the Qemu storage stack to replicate a volume of a VM. This RFC patch adds the repagent client module to Qemu. Documentation of the module role and API is in the patch at replication/qemu-repagent.txt In high level, the API is shaped like a filter-driver for the block layer - following a discussion on the mailing list triggered by the first version submitted. The idea is to leave the specific applications out of the Qemu storage logic by treating it as a filter hooking into the storage layer, rather than being a feature of that layer. The main motivation behind the module is to allow replication of VMs in a virtualization environment like Ovirt. To achieve this we need basic replication support in Qemu. By default the build and use of this module is disabled. To activate it you need to use a flag in ./configure and a commandline switch to qemu. The module is not feature complete, I wanted to get approval for the basic approach and interface first, and then complete the features. Missing features: * Dirty bitmap at the protected side The protected volume side needs to persistently keep track of 'pending' IOs. I want to add a bitmap (can hook like another 'filter' with the repagent), that will synchronously track IOs. And allow getting a list of such IOs. Without such a bitmap a failure of any component (rephub or agent) will require reading the whole protected volume. * Capture IOs at the recovery side During fail-over or fail-over test, the repagent at the recovery side needs to capture all IOs (reads and writes) done by the fail-over VM and answer them by passing them synchronously to the Rephub. (See attached PDF for the full flow) * Reporting of IO failures TBD * Sample Rephub An application that along with Repagent is a full replication solution. * There is still much debugging code (including printfs) in the code. This willnaturally go away is more mature versions. Points and open issues: * I don't have deep knowledge of the structure of the storage stack. I used the "raw" format as an example to make "repagent" a filter driver that passes-through most of the calls. Is it the right way to do it? * I didn't find a good way to tell the block layer that repagent filter should be used. I currently use a temporary global flag - 'use_repagent' It is set in main after parsing the commandline options, and checked in block.c. I guess there should be a better way to convey this option. Tests I ran: * I hooked a Qemu VM with Repagent to a full zerto replication solution based on vmware, and was able to replicate and recover the VM on a remote site. * I wrote a stand-alone Rephub for actively reading the protected volume and copying it - for test purposes. Appreciate any feedback or suggestions. Ori