From mboxrd@z Thu Jan 1 00:00:00 1970 From: Markus Armbruster Subject: Re: [Qemu-devel] Massive read only kvm guests when backing file was missing Date: Thu, 27 Mar 2014 08:36:57 +0100 Message-ID: <87y4zwt7mu.fsf@blackfin.pond.sub.org> References: <20140327064158.GA17563@redhat.com> Mime-Version: 1.0 Content-Type: text/plain Cc: Alejandro Comisario , kvm@vger.kernel.org, ghammer@redhat.com, Stefan Hajnoczi , Jason Wang , linux-kernel@vger.kernel.org, qemu-devel@nongnu.org To: "Michael S. Tsirkin" Return-path: In-Reply-To: <20140327064158.GA17563@redhat.com> (Michael S. Tsirkin's message of "Thu, 27 Mar 2014 08:41:58 +0200") Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org "Michael S. Tsirkin" writes: > On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote: >> Hi List! >> Hope some one can help me, we had a big issue in our cloud the other >> day, a couple of our openstack regions ( +2000 kvm guests with qcow2 ) >> went read only filesystem from the guest side because the backing >> files directory (the openstack _base directory) was compromised and >> the data was lost, when we realized the data was lost, it took us 5 >> mins to restore the backup of the backing files, but by that time all >> the kvm guests received some kind of IO error from the hypervisor >> layer, and went read only on root filesystem. >> >> My question would be, is there a way to hold the IO operations against >> the backing files ( i thought that would be 99% READ operations ) for >> a little longer ( im asking this because i dont quite understand what >> is the process and when it raises the error ) in a case the backing >> files are missing (no IO possible) but is recoverable within minutes ? >> >> Any tip on how to achieve this if possible, or information about how >> backing files works on kvm, will be amazing. >> Waiting for feedback! >> >> kindest regards. >> Alejandro Comisario > > > I'm guessing this is what happened: guests timed out meanwhile. > You can increase the timeout within the guest: > echo 600 > /sys/block/sda/device/timeout > to timeout after 10 minutes. > > If you have installed qemu guest agent on your system, you can do this > from the host. Unfortunately by default it's memory can be pushed out to swap > and then on disk error access there might will fail :( > Maybe we should consider mlock on all its memory at least as an option. > > You could pause your guests, restart them after the issue is resolved, > and we could I guess add functionality to pause VM on disk errors > automatically. > Stefan? Would -drive rerror=stop do?