From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C3A91EB64D9 for ; Thu, 15 Jun 2023 07:05:58 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1q9h2N-0000S7-Fp; Thu, 15 Jun 2023 03:04:55 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q9h2H-0000QN-Op; Thu, 15 Jun 2023 03:04:50 -0400 Received: from proxmox-new.maurer-it.com ([94.136.29.106]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q9h2D-0003wN-85; Thu, 15 Jun 2023 03:04:49 -0400 Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 11A21456DF; Thu, 15 Jun 2023 09:04:30 +0200 (CEST) Message-ID: Date: Thu, 15 Jun 2023 09:04:19 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: Lost partition tables on ide-hd + ahci drive To: simon.rowe@nutanix.com, QEMU Developers Cc: "open list:Network Block Dev..." , Thomas Lamprecht , jsnow@redhat.com References: Content-Language: en-US From: Fiona Ebner In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=94.136.29.106; envelope-from=f.ebner@proxmox.com; helo=proxmox-new.maurer-it.com X-Spam_score_int: -19 X-Spam_score: -2.0 X-Spam_bar: -- X-Spam_report: (-2.0 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.098, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Am 14.06.23 um 16:48 schrieb Simon J. Rowe: > On 02/02/2023 12:08, Fiona Ebner wrote: >> Hi, >> over the years we've got 1-2 dozen reports[0] about suddenly >> missing/corrupted MBR/partition tables. The issue seems to be very rare >> and there was no success in trying to reproduce it yet. I'm asking here >> in the hope that somebody has seen something similar. >> >> The only commonality seems to be the use of an ide-hd drive with ahci >> bus. >> >> It does seem to happen with both Linux and Windows guests (one of the >> reports even mentions FreeBSD) and backing storages for the VMs include >> ZFS, RBD, LVM-Thin as well as file-based storages. >> >> Relevant part of an example configuration: >> >>>    -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \ >>>    -drive >>> 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \ >>>    -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \ >> The first reports are from before io_uring was used and there are also >> reports with writeback cache mode and discard=on,detect-zeroes=unmap. >> >> Some reports say that the issue occurred under high IO load. >> >> Many reports suspect backups causing the issue. Our backup mechanism >> uses backup_job_create() for each drive and runs the jobs sequentially. >> It uses a custom block driver as the backup target which just forwards >> the writes to the actual target which can be a file or our backup server. >> (If you really want to see the details, apply the patches in [1] and see >> pve-backup.c and block/backup-dump.c). >> >> Of course, the backup job will read sector 0 of the source disk, but I >> really can't see where a stray write would happen, why the issue would >> trigger so rarely or why seemingly only ide-hd+ahci would be affected. >> >> So again, just asking if somebody has seen something similar or has a >> hunch of what the cause might be. >> >> [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874 >> [1]: >> https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;hb=HEAD >> >> > We've also seen a handful of similar reports. Again, just the MBR sector > overwritten by what looks to be guest data (e.g. log messages). The > common thread with our incidents is again a SATA disk under the AHCI > controller, we have a network backend (iSCSI) which has experienced a > failure. > > I've tried to repro this with blkdebug and simulated write errors, > without success. > Hi, which version/build of QEMU are you using? Can you correlate the issue with any block job or was the drive in use by the guest only? Best Regards, Fiona