Lost partition tables on ide-hd + ahci drive

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Lost partition tables on ide-hd + ahci drive
@ 2023-02-02 12:08 Fiona Ebner
  2023-02-14 18:21 ` John Snow
  2023-06-14 14:48 ` Simon J. Rowe
  0 siblings, 2 replies; 19+ messages in thread
From: Fiona Ebner @ 2023-02-02 12:08 UTC (permalink / raw)
  To: QEMU Developers; +Cc: open list:Network Block Dev..., Thomas Lamprecht, jsnow

Hi,
over the years we've got 1-2 dozen reports[0] about suddenly
missing/corrupted MBR/partition tables. The issue seems to be very rare
and there was no success in trying to reproduce it yet. I'm asking here
in the hope that somebody has seen something similar.

The only commonality seems to be the use of an ide-hd drive with ahci bus.

It does seem to happen with both Linux and Windows guests (one of the
reports even mentions FreeBSD) and backing storages for the VMs include
ZFS, RBD, LVM-Thin as well as file-based storages.

Relevant part of an example configuration:

>   -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
>   -drive 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
>   -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \

The first reports are from before io_uring was used and there are also
reports with writeback cache mode and discard=on,detect-zeroes=unmap.

Some reports say that the issue occurred under high IO load.

Many reports suspect backups causing the issue. Our backup mechanism
uses backup_job_create() for each drive and runs the jobs sequentially.
It uses a custom block driver as the backup target which just forwards
the writes to the actual target which can be a file or our backup server.
(If you really want to see the details, apply the patches in [1] and see
pve-backup.c and block/backup-dump.c).

Of course, the backup job will read sector 0 of the source disk, but I
really can't see where a stray write would happen, why the issue would
trigger so rarely or why seemingly only ide-hd+ahci would be affected.

So again, just asking if somebody has seen something similar or has a
hunch of what the cause might be.

[0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874
[1]: https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;hb=HEAD

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-02 12:08 Lost partition tables on ide-hd + ahci drive Fiona Ebner
@ 2023-02-14 18:21 ` John Snow
  2023-02-15 10:53   ` Fiona Ebner
  2023-06-14 14:48 ` Simon J. Rowe
  1 sibling, 1 reply; 19+ messages in thread
From: John Snow @ 2023-02-14 18:21 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: QEMU Developers, open list:Network Block Dev..., Thomas Lamprecht

On Thu, Feb 2, 2023 at 7:08 AM Fiona Ebner <f.ebner@proxmox.com> wrote:
>
> Hi,
> over the years we've got 1-2 dozen reports[0] about suddenly
> missing/corrupted MBR/partition tables. The issue seems to be very rare
> and there was no success in trying to reproduce it yet. I'm asking here
> in the hope that somebody has seen something similar.
>
> The only commonality seems to be the use of an ide-hd drive with ahci bus.
>
> It does seem to happen with both Linux and Windows guests (one of the
> reports even mentions FreeBSD) and backing storages for the VMs include
> ZFS, RBD, LVM-Thin as well as file-based storages.
>
> Relevant part of an example configuration:
>
> >   -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
> >   -drive 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
> >   -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \
>
> The first reports are from before io_uring was used and there are also
> reports with writeback cache mode and discard=on,detect-zeroes=unmap.
>
> Some reports say that the issue occurred under high IO load.
>
> Many reports suspect backups causing the issue. Our backup mechanism
> uses backup_job_create() for each drive and runs the jobs sequentially.
> It uses a custom block driver as the backup target which just forwards
> the writes to the actual target which can be a file or our backup server.
> (If you really want to see the details, apply the patches in [1] and see
> pve-backup.c and block/backup-dump.c).
>
> Of course, the backup job will read sector 0 of the source disk, but I
> really can't see where a stray write would happen, why the issue would
> trigger so rarely or why seemingly only ide-hd+ahci would be affected.
>
> So again, just asking if somebody has seen something similar or has a
> hunch of what the cause might be.
>

Hi Floria;

I'm sorry to say that I haven't worked on the block devices (or
backup) for a little while now, so I am not immediately sure what
might be causing this problem. In general, I advise against using AHCI
in production as better performance (and dev support) can be achieved
through virtio. Still, I am not sure why the combination of AHCI with
backup_job_create() would be corrupting the early sectors of the disk.

Do you have any analysis on how much data gets corrupted? Is it the
first sector only, the first few? Has anyone taken a peek at the
backing storage to see if there are any interesting patterns that can
be observed? (Zeroes, garbage, old data?)

Have any errors or warnings been observed in either the guest or the
host that might offer some clues?

Is there any commonality in the storage format being used? Is it
qcow2? Is it network-backed?

Apologies for the "tier 1" questions.

> [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874
> [1]: https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;hb=HEAD
>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-14 18:21 ` John Snow
@ 2023-02-15 10:53   ` Fiona Ebner
  2023-02-15 21:47     ` John Snow
                       ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Fiona Ebner @ 2023-02-15 10:53 UTC (permalink / raw)
  To: John Snow
  Cc: QEMU Developers, open list:Network Block Dev..., Thomas Lamprecht,
	Aaron Lauterer

Am 14.02.23 um 19:21 schrieb John Snow:
> On Thu, Feb 2, 2023 at 7:08 AM Fiona Ebner <f.ebner@proxmox.com> wrote:
>>
>> Hi,
>> over the years we've got 1-2 dozen reports[0] about suddenly
>> missing/corrupted MBR/partition tables. The issue seems to be very rare
>> and there was no success in trying to reproduce it yet. I'm asking here
>> in the hope that somebody has seen something similar.
>>
>> The only commonality seems to be the use of an ide-hd drive with ahci bus.
>>
>> It does seem to happen with both Linux and Windows guests (one of the
>> reports even mentions FreeBSD) and backing storages for the VMs include
>> ZFS, RBD, LVM-Thin as well as file-based storages.
>>
>> Relevant part of an example configuration:
>>
>>>   -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
>>>   -drive 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
>>>   -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \
>>
>> The first reports are from before io_uring was used and there are also
>> reports with writeback cache mode and discard=on,detect-zeroes=unmap.
>>
>> Some reports say that the issue occurred under high IO load.
>>
>> Many reports suspect backups causing the issue. Our backup mechanism
>> uses backup_job_create() for each drive and runs the jobs sequentially.
>> It uses a custom block driver as the backup target which just forwards
>> the writes to the actual target which can be a file or our backup server.
>> (If you really want to see the details, apply the patches in [1] and see
>> pve-backup.c and block/backup-dump.c).
>>
>> Of course, the backup job will read sector 0 of the source disk, but I
>> really can't see where a stray write would happen, why the issue would
>> trigger so rarely or why seemingly only ide-hd+ahci would be affected.
>>
>> So again, just asking if somebody has seen something similar or has a
>> hunch of what the cause might be.
>>
> 
> Hi Floria;
> 
> I'm sorry to say that I haven't worked on the block devices (or
> backup) for a little while now, so I am not immediately sure what
> might be causing this problem. In general, I advise against using AHCI
> in production as better performance (and dev support) can be achieved
> through virtio.

Yes, we also recommend using virtio-{scsi,blk}-pci to our users and most
do. Still, some use AHCI, I'd guess mostly for Windows, but not only.

> Still, I am not sure why the combination of AHCI with
> backup_job_create() would be corrupting the early sectors of the disk.

It's not clear that backup itself is causing the issue. Some of the
reports do correlate it with backup, but there are no precise timestamps
when the corruption happened. It might be that the additional IO during
backup is somehow triggering the issue.

> Do you have any analysis on how much data gets corrupted? Is it the
> first sector only, the first few? Has anyone taken a peek at the
> backing storage to see if there are any interesting patterns that can
> be observed? (Zeroes, garbage, old data?)

It does seem to be the first sector only, but it's not entirely clear.
Many of the affected users said that after fixing the partition table
with TestDisk, the VMs booted/worked normally again. We only have dumps
for the first MiB of three images. In this case, all Windows with Ceph
RBD images.

See below[0] for the dumps. One was a valid MBR and matched the latest
good backup, so that VM didn't boot for some other reason, not sure if
even related to this bug. I did not include this one. One was completely
empty and one contained other data in the first 512 Bytes, then again
zeroes, but those zeroes are nothing special AFAIK.

> Have any errors or warnings been observed in either the guest or the
> host that might offer some clues?

There is a single user who seemed to have hardware issues, and I'd be
inclined to blame those in that case. But none of the other users
reported any errors or warnings, though I can't say if any checked
inside the guests.

> Is there any commonality in the storage format being used? Is it
> qcow2? Is it network-backed?

There are reports with local ZFS volumes, local LVM-Thin volumes, RBD
images, qcow2 on NFS. So no pattern to be seen.

> Apologies for the "tier 1" questions.

Thank you for your time!

Best Regards,
Fiona

@Aaron (had access to the broken images): please correct me/add anything
relevant I missed. Are the broken VMs/backups still present? If yes, can
we ask the user to check the logs inside?

[0]:
> febner@enia ~/Downloads % hexdump -C dump-vm-120.raw
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00100000
> febner@enia ~/Downloads % hexdump -C dump-vm-130.raw
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 000000c0  00 00 19 03 46 4d 66 6e  00 00 00 00 00 00 00 00  |....FMfn........|
> 000000d0  04 f2 7a 01 00 00 00 00  00 00 00 00 00 00 00 00  |..z.............|
> 000000e0  f0 a4 01 00 00 00 00 00  c8 4d 5b 99 0c 81 ff ff  |.........M[.....|
> 000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 00000100  00 42 e1 38 0d da ff ff  00 bc b4 3b 0d da ff ff  |.B.8.......;....|
> 00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 00000120  78 00 00 00 01 00 00 00  a8 00 aa 00 00 00 00 00  |x...............|
> 00000130  a0 71 ba b0 0c 81 ff ff  2e 00 2e 00 00 00 00 00  |.q..............|
> 00000140  a0 71 ba b0 0c 81 ff ff  00 00 00 00 00 00 00 00  |.q..............|
> 00000150  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 000001a0  5c 00 44 00 65 00 76 00  69 00 63 00 65 00 5c 00  |\.D.e.v.i.c.e.\.|
> 000001b0  48 00 61 00 72 00 64 00  64 00 69 00 73 00 6b 00  |H.a.r.d.d.i.s.k.|
> 000001c0  56 00 6f 00 6c 00 75 00  6d 00 65 00 32 00 5c 00  |V.o.l.u.m.e.2.\.|
> 000001d0  57 00 69 00 6e 00 64 00  6f 00 77 00 73 00 5c 00  |W.i.n.d.o.w.s.\.|
> 000001e0  4d 00 69 00 63 00 72 00  6f 00 73 00 6f 00 66 00  |M.i.c.r.o.s.o.f.|
> 000001f0  74 00 2e 00 4e 00 45 00  54 00 5c 00 46 00 72 00  |t...N.E.T.\.F.r.|
> 00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00100000



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-15 10:53   ` Fiona Ebner
@ 2023-02-15 21:47     ` John Snow
  2023-02-16  8:58       ` Fiona Ebner
  2023-02-16 14:17     ` Mike Maslenkin
  2023-02-17  9:44     ` Aaron Lauterer
  2 siblings, 1 reply; 19+ messages in thread
From: John Snow @ 2023-02-15 21:47 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: QEMU Developers, open list:Network Block Dev..., Thomas Lamprecht,
	Aaron Lauterer

[-- Attachment #1: Type: text/plain, Size: 7214 bytes --]

On Wed, Feb 15, 2023, 5:53 AM Fiona Ebner <f.ebner@proxmox.com> wrote:

> Am 14.02.23 um 19:21 schrieb John Snow:
> > On Thu, Feb 2, 2023 at 7:08 AM Fiona Ebner <f.ebner@proxmox.com> wrote:
> >>
> >> Hi,
> >> over the years we've got 1-2 dozen reports[0] about suddenly
> >> missing/corrupted MBR/partition tables. The issue seems to be very rare
> >> and there was no success in trying to reproduce it yet. I'm asking here
> >> in the hope that somebody has seen something similar.
> >>
> >> The only commonality seems to be the use of an ide-hd drive with ahci
> bus.
> >>
> >> It does seem to happen with both Linux and Windows guests (one of the
> >> reports even mentions FreeBSD) and backing storages for the VMs include
> >> ZFS, RBD, LVM-Thin as well as file-based storages.
> >>
> >> Relevant part of an example configuration:
> >>
> >>>   -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
> >>>   -drive
> 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on'
> \
> >>>   -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \
> >>
> >> The first reports are from before io_uring was used and there are also
> >> reports with writeback cache mode and discard=on,detect-zeroes=unmap.
> >>
> >> Some reports say that the issue occurred under high IO load.
> >>
> >> Many reports suspect backups causing the issue. Our backup mechanism
> >> uses backup_job_create() for each drive and runs the jobs sequentially.
> >> It uses a custom block driver as the backup target which just forwards
> >> the writes to the actual target which can be a file or our backup
> server.
> >> (If you really want to see the details, apply the patches in [1] and see
> >> pve-backup.c and block/backup-dump.c).
> >>
> >> Of course, the backup job will read sector 0 of the source disk, but I
> >> really can't see where a stray write would happen, why the issue would
> >> trigger so rarely or why seemingly only ide-hd+ahci would be affected.
> >>
> >> So again, just asking if somebody has seen something similar or has a
> >> hunch of what the cause might be.
> >>
> >
> > Hi Floria;
> >
> > I'm sorry to say that I haven't worked on the block devices (or
> > backup) for a little while now, so I am not immediately sure what
> > might be causing this problem. In general, I advise against using AHCI
> > in production as better performance (and dev support) can be achieved
> > through virtio.
>
> Yes, we also recommend using virtio-{scsi,blk}-pci to our users and most
> do. Still, some use AHCI, I'd guess mostly for Windows, but not only.
>
> > Still, I am not sure why the combination of AHCI with
> > backup_job_create() would be corrupting the early sectors of the disk.
>
> It's not clear that backup itself is causing the issue. Some of the
> reports do correlate it with backup, but there are no precise timestamps
> when the corruption happened. It might be that the additional IO during
> backup is somehow triggering the issue.
>
> > Do you have any analysis on how much data gets corrupted? Is it the
> > first sector only, the first few? Has anyone taken a peek at the
> > backing storage to see if there are any interesting patterns that can
> > be observed? (Zeroes, garbage, old data?)
>
> It does seem to be the first sector only, but it's not entirely clear.
> Many of the affected users said that after fixing the partition table
> with TestDisk, the VMs booted/worked normally again. We only have dumps
> for the first MiB of three images. In this case, all Windows with Ceph
> RBD images.
>

There was a corruption case I diagnosed for a client many aeons ago where
Ceph under load turned out to be the culprit for qcow2 corruption.

I don't recall the BZ#, but I'd like to think any version in production
these days isn't prone to the same bug.

This was probably around late 2016 or so, but I don't know precisely when
the bug got fixed (after I shuffled it out of my queue!)


> See below[0] for the dumps. One was a valid MBR and matched the latest
> good backup, so that VM didn't boot for some other reason, not sure if
> even related to this bug. I did not include this one. One was completely
> empty and one contained other data in the first 512 Bytes, then again
> zeroes, but those zeroes are nothing special AFAIK.
>
> > Have any errors or warnings been observed in either the guest or the
> > host that might offer some clues?
>
> There is a single user who seemed to have hardware issues, and I'd be
> inclined to blame those in that case. But none of the other users
> reported any errors or warnings, though I can't say if any checked
> inside the guests.
>
> > Is there any commonality in the storage format being used? Is it
> > qcow2? Is it network-backed?
>
> There are reports with local ZFS volumes, local LVM-Thin volumes, RBD
> images, qcow2 on NFS. So no pattern to be seen.
>
> > Apologies for the "tier 1" questions.
>
> Thank you for your time!
>

Hm, I'm not sure I see any pattern that might help. Could be that AHCI is
just bugged during load, but it's tough to know in what way.

What versions of QEMU are in use here? Is there a date on which you noticed
an increased frequency of these reports?


> Best Regards,
> Fiona
>
> @Aaron (had access to the broken images): please correct me/add anything
> relevant I missed. Are the broken VMs/backups still present? If yes, can
> we ask the user to check the logs inside?
>
> [0]:
> > febner@enia ~/Downloads % hexdump -C dump-vm-120.raw
> > 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> > *
> > 00100000
> > febner@enia ~/Downloads % hexdump -C dump-vm-130.raw
> > 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> > *
> > 000000c0  00 00 19 03 46 4d 66 6e  00 00 00 00 00 00 00 00
> |....FMfn........|
> > 000000d0  04 f2 7a 01 00 00 00 00  00 00 00 00 00 00 00 00
> |..z.............|
> > 000000e0  f0 a4 01 00 00 00 00 00  c8 4d 5b 99 0c 81 ff ff
> |.........M[.....|
> > 000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> > 00000100  00 42 e1 38 0d da ff ff  00 bc b4 3b 0d da ff ff
> |.B.8.......;....|
> > 00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> > 00000120  78 00 00 00 01 00 00 00  a8 00 aa 00 00 00 00 00
> |x...............|
> > 00000130  a0 71 ba b0 0c 81 ff ff  2e 00 2e 00 00 00 00 00
> |.q..............|
> > 00000140  a0 71 ba b0 0c 81 ff ff  00 00 00 00 00 00 00 00
> |.q..............|
> > 00000150  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> > *
> > 000001a0  5c 00 44 00 65 00 76 00  69 00 63 00 65 00 5c 00
> |\.D.e.v.i.c.e.\.|
> > 000001b0  48 00 61 00 72 00 64 00  64 00 69 00 73 00 6b 00
> |H.a.r.d.d.i.s.k.|
> > 000001c0  56 00 6f 00 6c 00 75 00  6d 00 65 00 32 00 5c 00
> |V.o.l.u.m.e.2.\.|
> > 000001d0  57 00 69 00 6e 00 64 00  6f 00 77 00 73 00 5c 00
> |W.i.n.d.o.w.s.\.|
> > 000001e0  4d 00 69 00 63 00 72 00  6f 00 73 00 6f 00 66 00
> |M.i.c.r.o.s.o.f.|
> > 000001f0  74 00 2e 00 4e 00 45 00  54 00 5c 00 46 00 72 00
> |t...N.E.T.\.F.r.|
> > 00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> > *
> > 00100000
>
>

[-- Attachment #2: Type: text/html, Size: 9111 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-15 21:47     ` John Snow
@ 2023-02-16  8:58       ` Fiona Ebner
  0 siblings, 0 replies; 19+ messages in thread
From: Fiona Ebner @ 2023-02-16  8:58 UTC (permalink / raw)
  To: John Snow
  Cc: QEMU Developers, open list:Network Block Dev..., Thomas Lamprecht,
	Aaron Lauterer

Am 15.02.23 um 22:47 schrieb John Snow:
> Hm, I'm not sure I see any pattern that might help. Could be that AHCI
> is just bugged during load, but it's tough to know in what way.

If we ever get a backtrace where the bad write actually goes through
QEMU, I'll let you know.

We are considering providing a custom build to affected users (using
GDB-hooks leads to too much slowdown in these performance-critical
paths) in the hope to catch it if it triggers again. We can't really
roll it out to all users, because most writes to sector zero are
legitimate after all and most users are not affected.

> What versions of QEMU are in use here? Is there a date on which you
> noticed an increased frequency of these reports?

There were a few reports around the time we rolled out 4.2 and 5.0
(Q2/Q3 of 2020), but the frequency was always very low. AFAICT, there's
about 20-40 reports that could be this issue in total. The earliest I
know of with lost partitions, but not much more information, are forum
threads from 2017/2018.

With 4.2, there was a rework with our backup patches so naturally, I
suspected that. Before 4.2, we had extended the backup job to allow
using a callback to handle the writes instead of the BlockDriverState
target. But starting from 4.2, we are not messing with that anymore and
using a custom driver as the backup target. That custom driver doesn't
even know about the source. The source is handled by the usual backup
job mechanisms.

If there was some general mix-up there, I'd not expect it to work for
>99.99% of backups and only trigger in combination with AHCI, but who knows?

Best Regards,
Fiona

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-15 10:53   ` Fiona Ebner
  2023-02-15 21:47     ` John Snow
@ 2023-02-16 14:17     ` Mike Maslenkin
  2023-02-16 15:25       ` Fiona Ebner
  2023-02-17 13:40       ` Fiona Ebner
  2023-02-17  9:44     ` Aaron Lauterer
  2 siblings, 2 replies; 19+ messages in thread
From: Mike Maslenkin @ 2023-02-16 14:17 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: John Snow, QEMU Developers, open list:Network Block Dev...,
	Thomas Lamprecht, Aaron Lauterer

Does additional comparison make a sense here: check for LBA == 0 and
then check MBR signature bytes.
Additionally it’s easy to check buffer_is_zero() result or even print
FIS contents under these conditions.
Data looks like a part of guest memory of 64bit Windows.

On Wed, Feb 15, 2023 at 1:53 PM Fiona Ebner <f.ebner@proxmox.com> wrote:
>
> Am 14.02.23 um 19:21 schrieb John Snow:
> > On Thu, Feb 2, 2023 at 7:08 AM Fiona Ebner <f.ebner@proxmox.com> wrote:
> >>
> >> Hi,
> >> over the years we've got 1-2 dozen reports[0] about suddenly
> >> missing/corrupted MBR/partition tables. The issue seems to be very rare
> >> and there was no success in trying to reproduce it yet. I'm asking here
> >> in the hope that somebody has seen something similar.
> >>
> >> The only commonality seems to be the use of an ide-hd drive with ahci bus.
> >>
> >> It does seem to happen with both Linux and Windows guests (one of the
> >> reports even mentions FreeBSD) and backing storages for the VMs include
> >> ZFS, RBD, LVM-Thin as well as file-based storages.
> >>
> >> Relevant part of an example configuration:
> >>
> >>>   -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
> >>>   -drive 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
> >>>   -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \
> >>
> >> The first reports are from before io_uring was used and there are also
> >> reports with writeback cache mode and discard=on,detect-zeroes=unmap.
> >>
> >> Some reports say that the issue occurred under high IO load.
> >>
> >> Many reports suspect backups causing the issue. Our backup mechanism
> >> uses backup_job_create() for each drive and runs the jobs sequentially.
> >> It uses a custom block driver as the backup target which just forwards
> >> the writes to the actual target which can be a file or our backup server.
> >> (If you really want to see the details, apply the patches in [1] and see
> >> pve-backup.c and block/backup-dump.c).
> >>
> >> Of course, the backup job will read sector 0 of the source disk, but I
> >> really can't see where a stray write would happen, why the issue would
> >> trigger so rarely or why seemingly only ide-hd+ahci would be affected.
> >>
> >> So again, just asking if somebody has seen something similar or has a
> >> hunch of what the cause might be.
> >>
> >
> > Hi Floria;
> >
> > I'm sorry to say that I haven't worked on the block devices (or
> > backup) for a little while now, so I am not immediately sure what
> > might be causing this problem. In general, I advise against using AHCI
> > in production as better performance (and dev support) can be achieved
> > through virtio.
>
> Yes, we also recommend using virtio-{scsi,blk}-pci to our users and most
> do. Still, some use AHCI, I'd guess mostly for Windows, but not only.
>
> > Still, I am not sure why the combination of AHCI with
> > backup_job_create() would be corrupting the early sectors of the disk.
>
> It's not clear that backup itself is causing the issue. Some of the
> reports do correlate it with backup, but there are no precise timestamps
> when the corruption happened. It might be that the additional IO during
> backup is somehow triggering the issue.
>
> > Do you have any analysis on how much data gets corrupted? Is it the
> > first sector only, the first few? Has anyone taken a peek at the
> > backing storage to see if there are any interesting patterns that can
> > be observed? (Zeroes, garbage, old data?)
>
> It does seem to be the first sector only, but it's not entirely clear.
> Many of the affected users said that after fixing the partition table
> with TestDisk, the VMs booted/worked normally again. We only have dumps
> for the first MiB of three images. In this case, all Windows with Ceph
> RBD images.
>
> See below[0] for the dumps. One was a valid MBR and matched the latest
> good backup, so that VM didn't boot for some other reason, not sure if
> even related to this bug. I did not include this one. One was completely
> empty and one contained other data in the first 512 Bytes, then again
> zeroes, but those zeroes are nothing special AFAIK.
>
> > Have any errors or warnings been observed in either the guest or the
> > host that might offer some clues?
>
> There is a single user who seemed to have hardware issues, and I'd be
> inclined to blame those in that case. But none of the other users
> reported any errors or warnings, though I can't say if any checked
> inside the guests.
>
> > Is there any commonality in the storage format being used? Is it
> > qcow2? Is it network-backed?
>
> There are reports with local ZFS volumes, local LVM-Thin volumes, RBD
> images, qcow2 on NFS. So no pattern to be seen.
>
> > Apologies for the "tier 1" questions.
>
> Thank you for your time!
>
> Best Regards,
> Fiona
>
> @Aaron (had access to the broken images): please correct me/add anything
> relevant I missed. Are the broken VMs/backups still present? If yes, can
> we ask the user to check the logs inside?
>
> [0]:
> > febner@enia ~/Downloads % hexdump -C dump-vm-120.raw
> > 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 00100000
> > febner@enia ~/Downloads % hexdump -C dump-vm-130.raw
> > 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 000000c0  00 00 19 03 46 4d 66 6e  00 00 00 00 00 00 00 00  |....FMfn........|
> > 000000d0  04 f2 7a 01 00 00 00 00  00 00 00 00 00 00 00 00  |..z.............|
> > 000000e0  f0 a4 01 00 00 00 00 00  c8 4d 5b 99 0c 81 ff ff  |.........M[.....|
> > 000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 00000100  00 42 e1 38 0d da ff ff  00 bc b4 3b 0d da ff ff  |.B.8.......;....|
> > 00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > 00000120  78 00 00 00 01 00 00 00  a8 00 aa 00 00 00 00 00  |x...............|
> > 00000130  a0 71 ba b0 0c 81 ff ff  2e 00 2e 00 00 00 00 00  |.q..............|
> > 00000140  a0 71 ba b0 0c 81 ff ff  00 00 00 00 00 00 00 00  |.q..............|
> > 00000150  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 000001a0  5c 00 44 00 65 00 76 00  69 00 63 00 65 00 5c 00  |\.D.e.v.i.c.e.\.|
> > 000001b0  48 00 61 00 72 00 64 00  64 00 69 00 73 00 6b 00  |H.a.r.d.d.i.s.k.|
> > 000001c0  56 00 6f 00 6c 00 75 00  6d 00 65 00 32 00 5c 00  |V.o.l.u.m.e.2.\.|
> > 000001d0  57 00 69 00 6e 00 64 00  6f 00 77 00 73 00 5c 00  |W.i.n.d.o.w.s.\.|
> > 000001e0  4d 00 69 00 63 00 72 00  6f 00 73 00 6f 00 66 00  |M.i.c.r.o.s.o.f.|
> > 000001f0  74 00 2e 00 4e 00 45 00  54 00 5c 00 46 00 72 00  |t...N.E.T.\.F.r.|
> > 00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 00100000
>
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-16 14:17     ` Mike Maslenkin
@ 2023-02-16 15:25       ` Fiona Ebner
  2023-02-16 16:15         ` Mike Maslenkin
  2023-02-17 13:40       ` Fiona Ebner
  1 sibling, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2023-02-16 15:25 UTC (permalink / raw)
  To: Mike Maslenkin
  Cc: John Snow, QEMU Developers, open list:Network Block Dev...,
	Thomas Lamprecht, Aaron Lauterer

Am 16.02.23 um 15:17 schrieb Mike Maslenkin:
> Does additional comparison make a sense here: check for LBA == 0 and
> then check MBR signature bytes.
> Additionally it’s easy to check buffer_is_zero() result or even print
> FIS contents under these conditions.
> Data looks like a part of guest memory of 64bit Windows.

Thank you for the suggestion! I'll think about adding such a check and
dumping of FIS contents in a custom build for affected users. But in
general it would be too much noise for non-MBR cases: e.g. on a disk
formatted with ext4 (without any partitions), Linux will write to sector
0 on every startup and shutdown.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-16 15:25       ` Fiona Ebner
@ 2023-02-16 16:15         ` Mike Maslenkin
  2023-02-17 12:25           ` Fiona Ebner
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Maslenkin @ 2023-02-16 16:15 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: John Snow, QEMU Developers, open list:Network Block Dev...,
	Thomas Lamprecht, Aaron Lauterer

Makes sense for disks without partition table.
But wouldn't Linux or any other OS write at least 4K bytes in that case?
Who may want to write 512 bytes for any purposes except for boot
sector nowadays..
In dump mentioned before only 512 bytes were not zeroed, so I guess it
was caused by IO from guest OS.
In other cases it can be caused by misconfigured IDE registers state
or broken FIS memory area.


On Thu, Feb 16, 2023 at 6:25 PM Fiona Ebner <f.ebner@proxmox.com> wrote:
>
> Am 16.02.23 um 15:17 schrieb Mike Maslenkin:
> > Does additional comparison make a sense here: check for LBA == 0 and
> > then check MBR signature bytes.
> > Additionally it’s easy to check buffer_is_zero() result or even print
> > FIS contents under these conditions.
> > Data looks like a part of guest memory of 64bit Windows.
>
> Thank you for the suggestion! I'll think about adding such a check and
> dumping of FIS contents in a custom build for affected users. But in
> general it would be too much noise for non-MBR cases: e.g. on a disk
> formatted with ext4 (without any partitions), Linux will write to sector
> 0 on every startup and shutdown.
>
> Best Regards,
> Fiona
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-15 10:53   ` Fiona Ebner
  2023-02-15 21:47     ` John Snow
  2023-02-16 14:17     ` Mike Maslenkin
@ 2023-02-17  9:44     ` Aaron Lauterer
  2 siblings, 0 replies; 19+ messages in thread
From: Aaron Lauterer @ 2023-02-17  9:44 UTC (permalink / raw)
  To: Fiona Ebner, John Snow
  Cc: QEMU Developers, open list:Network Block Dev..., Thomas Lamprecht

I am a bit late, but nonetheless, some comments inline.

On 2/15/23 11:53, Fiona Ebner wrote:
> Am 14.02.23 um 19:21 schrieb John Snow:
>> On Thu, Feb 2, 2023 at 7:08 AM Fiona Ebner <f.ebner@proxmox.com> wrote:
>>>
>>> Hi,
>>> over the years we've got 1-2 dozen reports[0] about suddenly
>>> missing/corrupted MBR/partition tables. The issue seems to be very rare
>>> and there was no success in trying to reproduce it yet. I'm asking here
>>> in the hope that somebody has seen something similar.
>>>
>>> The only commonality seems to be the use of an ide-hd drive with ahci bus.
>>>
>>> It does seem to happen with both Linux and Windows guests (one of the
>>> reports even mentions FreeBSD) and backing storages for the VMs include
>>> ZFS, RBD, LVM-Thin as well as file-based storages.
>>>
>>> Relevant part of an example configuration:
>>>
>>>>    -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
>>>>    -drive 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
>>>>    -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \
>>>
>>> The first reports are from before io_uring was used and there are also
>>> reports with writeback cache mode and discard=on,detect-zeroes=unmap.
>>>
>>> Some reports say that the issue occurred under high IO load.
>>>
>>> Many reports suspect backups causing the issue. Our backup mechanism
>>> uses backup_job_create() for each drive and runs the jobs sequentially.
>>> It uses a custom block driver as the backup target which just forwards
>>> the writes to the actual target which can be a file or our backup server.
>>> (If you really want to see the details, apply the patches in [1] and see
>>> pve-backup.c and block/backup-dump.c).
>>>
>>> Of course, the backup job will read sector 0 of the source disk, but I
>>> really can't see where a stray write would happen, why the issue would
>>> trigger so rarely or why seemingly only ide-hd+ahci would be affected.
>>>
>>> So again, just asking if somebody has seen something similar or has a
>>> hunch of what the cause might be.
>>>
>>
>> Hi Floria;
>>
>> I'm sorry to say that I haven't worked on the block devices (or
>> backup) for a little while now, so I am not immediately sure what
>> might be causing this problem. In general, I advise against using AHCI
>> in production as better performance (and dev support) can be achieved
>> through virtio.
> 
> Yes, we also recommend using virtio-{scsi,blk}-pci to our users and most
> do. Still, some use AHCI, I'd guess mostly for Windows, but not only.
> 
>> Still, I am not sure why the combination of AHCI with
>> backup_job_create() would be corrupting the early sectors of the disk.
> 
> It's not clear that backup itself is causing the issue. Some of the
> reports do correlate it with backup, but there are no precise timestamps
> when the corruption happened. It might be that the additional IO during
> backup is somehow triggering the issue.
> 
>> Do you have any analysis on how much data gets corrupted? Is it the
>> first sector only, the first few? Has anyone taken a peek at the
>> backing storage to see if there are any interesting patterns that can
>> be observed? (Zeroes, garbage, old data?)
> 
> It does seem to be the first sector only, but it's not entirely clear.
> Many of the affected users said that after fixing the partition table
> with TestDisk, the VMs booted/worked normally again. We only have dumps
> for the first MiB of three images. In this case, all Windows with Ceph
> RBD images.
> 
> See below[0] for the dumps. One was a valid MBR and matched the latest
> good backup, so that VM didn't boot for some other reason, not sure if
> even related to this bug. I did not include this one. One was completely
> empty and one contained other data in the first 512 Bytes, then again
> zeroes, but those zeroes are nothing special AFAIK.

Unfortunately, we only had direct access to those 3 disks mentioned. I took a 
look at them and for the first MiB, it matches what @Fiona explained. At the 
first MiB, all 3 disk images looked normal when compared to a similar test 
Windows installation: the start of the NTFS file system. The VMs were installed 
in BIOS mode, so no ESP.

Cloning the VMs and replacing the first 512 bytes of the disk image from a good 
known earlier backup to restore the partition table seems to be all that was 
necessary. Afterward, those VMs were able to boot all the way to the Windows 
login screen. That matches the reports we have from the community.

We were not able to confirm the integrity of the rest of the disk, though.


> 
>> Have any errors or warnings been observed in either the guest or the
>> host that might offer some clues?
> 
> There is a single user who seemed to have hardware issues, and I'd be
> inclined to blame those in that case. But none of the other users
> reported any errors or warnings, though I can't say if any checked
> inside the guests.
> 
>> Is there any commonality in the storage format being used? Is it
>> qcow2? Is it network-backed?
> 
> There are reports with local ZFS volumes, local LVM-Thin volumes, RBD
> images, qcow2 on NFS. So no pattern to be seen.
> 
>> Apologies for the "tier 1" questions.
> 
> Thank you for your time!
> 
> Best Regards,
> Fiona
> 
> @Aaron (had access to the broken images): please correct me/add anything
> relevant I missed. Are the broken VMs/backups still present? If yes, can
> we ask the user to check the logs inside?

I can ask. I guess the plan would be to clone the failed VM, restore the boot 
sector and then check the logs (Event Viewer) if there is anything of interest?

It is possible, that we won't be able to get access to the VM itself, if the 
customer doesn't want that for data privacy reasons.

> 
> [0]:
>> febner@enia ~/Downloads % hexdump -C dump-vm-120.raw
>> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 00100000
>> febner@enia ~/Downloads % hexdump -C dump-vm-130.raw
>> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 000000c0  00 00 19 03 46 4d 66 6e  00 00 00 00 00 00 00 00  |....FMfn........|
>> 000000d0  04 f2 7a 01 00 00 00 00  00 00 00 00 00 00 00 00  |..z.............|
>> 000000e0  f0 a4 01 00 00 00 00 00  c8 4d 5b 99 0c 81 ff ff  |.........M[.....|
>> 000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> 00000100  00 42 e1 38 0d da ff ff  00 bc b4 3b 0d da ff ff  |.B.8.......;....|
>> 00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> 00000120  78 00 00 00 01 00 00 00  a8 00 aa 00 00 00 00 00  |x...............|
>> 00000130  a0 71 ba b0 0c 81 ff ff  2e 00 2e 00 00 00 00 00  |.q..............|
>> 00000140  a0 71 ba b0 0c 81 ff ff  00 00 00 00 00 00 00 00  |.q..............|
>> 00000150  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 000001a0  5c 00 44 00 65 00 76 00  69 00 63 00 65 00 5c 00  |\.D.e.v.i.c.e.\.|
>> 000001b0  48 00 61 00 72 00 64 00  64 00 69 00 73 00 6b 00  |H.a.r.d.d.i.s.k.|
>> 000001c0  56 00 6f 00 6c 00 75 00  6d 00 65 00 32 00 5c 00  |V.o.l.u.m.e.2.\.|
>> 000001d0  57 00 69 00 6e 00 64 00  6f 00 77 00 73 00 5c 00  |W.i.n.d.o.w.s.\.|
>> 000001e0  4d 00 69 00 63 00 72 00  6f 00 73 00 6f 00 66 00  |M.i.c.r.o.s.o.f.|
>> 000001f0  74 00 2e 00 4e 00 45 00  54 00 5c 00 46 00 72 00  |t...N.E.T.\.F.r.|
>> 00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 00100000



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-16 16:15         ` Mike Maslenkin
@ 2023-02-17 12:25           ` Fiona Ebner
  0 siblings, 0 replies; 19+ messages in thread
From: Fiona Ebner @ 2023-02-17 12:25 UTC (permalink / raw)
  To: Mike Maslenkin
  Cc: John Snow, QEMU Developers, open list:Network Block Dev...,
	Thomas Lamprecht, Aaron Lauterer

Am 16.02.23 um 17:15 schrieb Mike Maslenkin:
> Makes sense for disks without partition table.
> But wouldn't Linux or any other OS write at least 4K bytes in that case?

Yes, it does here.

> Who may want to write 512 bytes for any purposes except for boot
> sector nowadays..

From a quick test, fdisk on Linux even causes a 4KiB write when
partitioning, Windows only 512 bytes.

> In dump mentioned before only 512 bytes were not zeroed, so I guess it
> was caused by IO from guest OS.

Yes, with all checks you suggested, most false positives could be
avoided, and we can hope to catch something with Windows. And on Linux
if the corruption on Linux is also just 512 bytes and not 4KiB, but we
don't have any dumps yet unfortunately.

> In other cases it can be caused by misconfigured IDE registers state
> or broken FIS memory area.

I stumbled upon [0], which will be addressed by [1]. Any chance that it
could be related?

[0]: https://gitlab.com/qemu-project/qemu/-/issues/62
[1]: https://lists.nongnu.org/archive/html/qemu-devel/2023-02/msg01141.html

Best Regards,
Fiona

> On Thu, Feb 16, 2023 at 6:25 PM Fiona Ebner <f.ebner@proxmox.com> wrote:
>>
>> Am 16.02.23 um 15:17 schrieb Mike Maslenkin:
>>> Does additional comparison make a sense here: check for LBA == 0 and
>>> then check MBR signature bytes.
>>> Additionally it’s easy to check buffer_is_zero() result or even print
>>> FIS contents under these conditions.
>>> Data looks like a part of guest memory of 64bit Windows.
>>
>> Thank you for the suggestion! I'll think about adding such a check and
>> dumping of FIS contents in a custom build for affected users. But in
>> general it would be too much noise for non-MBR cases: e.g. on a disk
>> formatted with ext4 (without any partitions), Linux will write to sector
>> 0 on every startup and shutdown.
>>
>> Best Regards,
>> Fiona
>>
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-16 14:17     ` Mike Maslenkin
  2023-02-16 15:25       ` Fiona Ebner
@ 2023-02-17 13:40       ` Fiona Ebner
  2023-02-17 21:22         ` Mike Maslenkin
  1 sibling, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2023-02-17 13:40 UTC (permalink / raw)
  To: Mike Maslenkin
  Cc: John Snow, QEMU Developers, open list:Network Block Dev...,
	Thomas Lamprecht, Aaron Lauterer

Am 16.02.23 um 15:17 schrieb Mike Maslenkin:
> Does additional comparison make a sense here: check for LBA == 0 and
> then check MBR signature bytes.
> Additionally it’s easy to check buffer_is_zero() result or even print
> FIS contents under these conditions.
> Data looks like a part of guest memory of 64bit Windows.

Just today we got a new dump [0], and it's very similar. Again only 512
bytes and again guest memory?

> febner@enia ~/Downloads % hexdump -C dump.raw 
> 00000000  00 03 22 00 4e 74 46 73  da 4c a3 1c 3b f5 7d 19  |..".NtFs.L..;.}.|
> 00000010  60 a5 a6 d4 0c a8 ff ff  30 15 d9 e6 0c a8 ff ff  |`.......0.......|
> 00000020  5c 00 53 00 6f 00 66 00  74 00 77 00 61 00 72 00  |\.S.o.f.t.w.a.r.|
> 00000030  65 00 44 00 69 00 73 00  74 00 72 00 69 00 62 00  |e.D.i.s.t.r.i.b.|
> 00000040  75 00 74 00 69 00 6f 00  6e 00 5c 00 44 00 6f 00  |u.t.i.o.n.\.D.o.|
> 00000050  77 00 6e 00 6c 00 6f 00  61 00 64 00 5c 00 37 00  |w.n.l.o.a.d.\.7.|
> 00000060  33 00 63 00 36 00 33 00  65 00 32 00 64 00 37 00  |3.c.6.3.e.2.d.7.|
> 00000070  66 00 66 00 38 00 66 00  36 00 35 00 31 00 31 00  |f.f.8.f.6.5.1.1.|
> 00000080  39 00 36 00 63 00 65 00  61 00 31 00 65 00 30 00  |9.6.c.e.a.1.e.0.|
> 00000090  39 00 66 00 66 00 36 00  32 00 30 00 65 00 5c 00  |9.f.f.6.2.0.e.\.|
> 000000a0  69 00 6e 00 73 00 74 00  5c 00 70 00 61 00 63 00  |i.n.s.t.\.p.a.c.|
> 000000b0  6b 00 61 00 67 00 65 00  5f 00 39 00 31 00 37 00  |k.a.g.e._.9.1.7.|
> 000000c0  31 00 5f 00 66 00 6f 00  72 00 5f 00 6b 00 62 00  |1._.f.o.r._.k.b.|
> 000000d0  35 00 30 00 32 00 32 00  38 00 33 00 38 00 7e 00  |5.0.2.2.8.3.8.~.|
> 000000e0  33 00 31 00 62 00 66 00  33 00 38 00 35 00 36 00  |3.1.b.f.3.8.5.6.|
> 000000f0  61 00 64 00 33 00 36 00  34 00 65 00 33 00 35 00  |a.d.3.6.4.e.3.5.|
> 00000100  7e 00 61 00 6d 00 64 00  36 00 34 00 7e 00 7e 00  |~.a.m.d.6.4.~.~.|
> 00000110  31 00 30 00 2e 00 30 00  2e 00 31 00 2e 00 31 00  |1.0...0...1...1.|
> 00000120  33 00 2e 00 63 00 61 00  74 00 1d 08 0d a8 ff ff  |3...c.a.t.......|
> 00000130  13 03 0f 00 4e 74 46 73  ea 4d a3 1c 3b f5 7d 19  |....NtFs.M..;.}.|
> 00000140  90 05 4d 0f 0d a8 ff ff  a0 0c 55 0d 0d a8 ff ff  |..M.......U.....|
> 00000150  43 52 4f 53 4f 46 54 2d  57 49 4e 44 4f 57 53 2d  |CROSOFT-WINDOWS-|
> 00000160  44 2e 2e 2d 57 49 4e 50  52 4f 56 49 44 45 52 53  |D..-WINPROVIDERS|
> 00000170  2d 41 53 53 4f 43 5f 33  31 42 46 33 38 35 36 41  |-ASSOC_31BF3856A|
> 00000180  0c 03 67 00 70 00 73 00  63 00 72 00 69 00 70 00  |..g.p.s.c.r.i.p.|
> 00000190  74 00 2e 00 65 00 78 00  65 00 37 00 36 00 34 00  |t...e.x.e.7.6.4.|
> 000001a0  37 00 62 00 33 00 36 00  30 00 30 00 63 00 64 00  |7.b.3.6.0.0.c.d.|
> 000001b0  65 00 30 00 34 00 31 00  35 00 39 00 35 00 32 00  |e.0.4.1.5.9.5.2.|
> 000001c0  31 00 2e 00 74 00 6d 00  70 00 47 00 50 00 53 00  |1...t.m.p.G.P.S.|
> 000001d0  43 00 52 00 49 00 50 00  54 00 2e 00 45 00 58 00  |C.R.I.P.T...E.X.|
> 000001e0  45 00 37 00 36 00 34 00  37 00 42 00 33 00 36 00  |E.7.6.4.7.B.3.6.|
> 000001f0  30 00 30 00 43 00 44 00  45 00 30 00 34 00 31 00  |0.0.C.D.E.0.4.1.|
> 00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00100000

[0]:
https://forum.proxmox.com/threads/not-a-bootable-disk-vm-ms-server-2016.122849/post-534473



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-17 13:40       ` Fiona Ebner
@ 2023-02-17 21:22         ` Mike Maslenkin
  2023-08-23  8:47           ` Fiona Ebner
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Maslenkin @ 2023-02-17 21:22 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: John Snow, QEMU Developers, open list:Network Block Dev...,
	Thomas Lamprecht, Aaron Lauterer

I think it's guest memory again. IMHO It's a part of a memory pool and
not real IO data (unless this was pagefile data).
The first 16 bytes look like POOL_HEADER structure.
The first dump contained signature from FilterManager and the latest
contains two structures from Ntfs.
It's not clear to me what exact data after header structure, but in
case of Ntfs it looks like doubly linked list  element
with Flink/Blink pointers: 60 a5 a6 d4 0c a8 ff ff,  - is a
0xffffa80cd4a6a560, and 30 15 d9 e6 0c a8 ff ff = 0xffffa80ce6d91530.
The first Ntfs, looks like a final element of something, while the
second is a middle part of something else.
That is why I think it is not real IO (i.e disk data sent by guest
NTFS driver). IMHO.

I can not tell anything about dma-reentracy issues, but yes, i would
start to look at check_cmd() function call sequence.
The most interesting is why Sector Count = 1. I thought about race
with IDE reset where registers initialized with
value SATA_SIGNATURE_DISK = 0x00000101, but this means LBA=1 as well...

Regards,
Mike

On Fri, Feb 17, 2023 at 4:40 PM Fiona Ebner <f.ebner@proxmox.com> wrote:
>
> Am 16.02.23 um 15:17 schrieb Mike Maslenkin:
> > Does additional comparison make a sense here: check for LBA == 0 and
> > then check MBR signature bytes.
> > Additionally it’s easy to check buffer_is_zero() result or even print
> > FIS contents under these conditions.
> > Data looks like a part of guest memory of 64bit Windows.
>
> Just today we got a new dump [0], and it's very similar. Again only 512
> bytes and again guest memory?
>
> > febner@enia ~/Downloads % hexdump -C dump.raw
> > 00000000  00 03 22 00 4e 74 46 73  da 4c a3 1c 3b f5 7d 19  |..".NtFs.L..;.}.|
> > 00000010  60 a5 a6 d4 0c a8 ff ff  30 15 d9 e6 0c a8 ff ff  |`.......0.......|
> > 00000020  5c 00 53 00 6f 00 66 00  74 00 77 00 61 00 72 00  |\.S.o.f.t.w.a.r.|
> > 00000030  65 00 44 00 69 00 73 00  74 00 72 00 69 00 62 00  |e.D.i.s.t.r.i.b.|
> > 00000040  75 00 74 00 69 00 6f 00  6e 00 5c 00 44 00 6f 00  |u.t.i.o.n.\.D.o.|
> > 00000050  77 00 6e 00 6c 00 6f 00  61 00 64 00 5c 00 37 00  |w.n.l.o.a.d.\.7.|
> > 00000060  33 00 63 00 36 00 33 00  65 00 32 00 64 00 37 00  |3.c.6.3.e.2.d.7.|
> > 00000070  66 00 66 00 38 00 66 00  36 00 35 00 31 00 31 00  |f.f.8.f.6.5.1.1.|
> > 00000080  39 00 36 00 63 00 65 00  61 00 31 00 65 00 30 00  |9.6.c.e.a.1.e.0.|
> > 00000090  39 00 66 00 66 00 36 00  32 00 30 00 65 00 5c 00  |9.f.f.6.2.0.e.\.|
> > 000000a0  69 00 6e 00 73 00 74 00  5c 00 70 00 61 00 63 00  |i.n.s.t.\.p.a.c.|
> > 000000b0  6b 00 61 00 67 00 65 00  5f 00 39 00 31 00 37 00  |k.a.g.e._.9.1.7.|
> > 000000c0  31 00 5f 00 66 00 6f 00  72 00 5f 00 6b 00 62 00  |1._.f.o.r._.k.b.|
> > 000000d0  35 00 30 00 32 00 32 00  38 00 33 00 38 00 7e 00  |5.0.2.2.8.3.8.~.|
> > 000000e0  33 00 31 00 62 00 66 00  33 00 38 00 35 00 36 00  |3.1.b.f.3.8.5.6.|
> > 000000f0  61 00 64 00 33 00 36 00  34 00 65 00 33 00 35 00  |a.d.3.6.4.e.3.5.|
> > 00000100  7e 00 61 00 6d 00 64 00  36 00 34 00 7e 00 7e 00  |~.a.m.d.6.4.~.~.|
> > 00000110  31 00 30 00 2e 00 30 00  2e 00 31 00 2e 00 31 00  |1.0...0...1...1.|
> > 00000120  33 00 2e 00 63 00 61 00  74 00 1d 08 0d a8 ff ff  |3...c.a.t.......|
> > 00000130  13 03 0f 00 4e 74 46 73  ea 4d a3 1c 3b f5 7d 19  |....NtFs.M..;.}.|
> > 00000140  90 05 4d 0f 0d a8 ff ff  a0 0c 55 0d 0d a8 ff ff  |..M.......U.....|
> > 00000150  43 52 4f 53 4f 46 54 2d  57 49 4e 44 4f 57 53 2d  |CROSOFT-WINDOWS-|
> > 00000160  44 2e 2e 2d 57 49 4e 50  52 4f 56 49 44 45 52 53  |D..-WINPROVIDERS|
> > 00000170  2d 41 53 53 4f 43 5f 33  31 42 46 33 38 35 36 41  |-ASSOC_31BF3856A|
> > 00000180  0c 03 67 00 70 00 73 00  63 00 72 00 69 00 70 00  |..g.p.s.c.r.i.p.|
> > 00000190  74 00 2e 00 65 00 78 00  65 00 37 00 36 00 34 00  |t...e.x.e.7.6.4.|
> > 000001a0  37 00 62 00 33 00 36 00  30 00 30 00 63 00 64 00  |7.b.3.6.0.0.c.d.|
> > 000001b0  65 00 30 00 34 00 31 00  35 00 39 00 35 00 32 00  |e.0.4.1.5.9.5.2.|
> > 000001c0  31 00 2e 00 74 00 6d 00  70 00 47 00 50 00 53 00  |1...t.m.p.G.P.S.|
> > 000001d0  43 00 52 00 49 00 50 00  54 00 2e 00 45 00 58 00  |C.R.I.P.T...E.X.|
> > 000001e0  45 00 37 00 36 00 34 00  37 00 42 00 33 00 36 00  |E.7.6.4.7.B.3.6.|
> > 000001f0  30 00 30 00 43 00 44 00  45 00 30 00 34 00 31 00  |0.0.C.D.E.0.4.1.|
> > 00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> > *
> > 00100000
>
> [0]:
> https://forum.proxmox.com/threads/not-a-bootable-disk-vm-ms-server-2016.122849/post-534473
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-02 12:08 Lost partition tables on ide-hd + ahci drive Fiona Ebner
  2023-02-14 18:21 ` John Snow
@ 2023-06-14 14:48 ` Simon J. Rowe
  2023-06-15  7:04   ` Fiona Ebner
  2023-07-27 13:22   ` Simon Rowe
  1 sibling, 2 replies; 19+ messages in thread
From: Simon J. Rowe @ 2023-06-14 14:48 UTC (permalink / raw)
  To: Fiona Ebner, QEMU Developers
  Cc: open list:Network Block Dev..., Thomas Lamprecht, jsnow

On 02/02/2023 12:08, Fiona Ebner wrote:
> Hi,
> over the years we've got 1-2 dozen reports[0] about suddenly
> missing/corrupted MBR/partition tables. The issue seems to be very rare
> and there was no success in trying to reproduce it yet. I'm asking here
> in the hope that somebody has seen something similar.
>
> The only commonality seems to be the use of an ide-hd drive with ahci bus.
>
> It does seem to happen with both Linux and Windows guests (one of the
> reports even mentions FreeBSD) and backing storages for the VMs include
> ZFS, RBD, LVM-Thin as well as file-based storages.
>
> Relevant part of an example configuration:
>
>>    -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
>>    -drive 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
>>    -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \
> The first reports are from before io_uring was used and there are also
> reports with writeback cache mode and discard=on,detect-zeroes=unmap.
>
> Some reports say that the issue occurred under high IO load.
>
> Many reports suspect backups causing the issue. Our backup mechanism
> uses backup_job_create() for each drive and runs the jobs sequentially.
> It uses a custom block driver as the backup target which just forwards
> the writes to the actual target which can be a file or our backup server.
> (If you really want to see the details, apply the patches in [1] and see
> pve-backup.c and block/backup-dump.c).
>
> Of course, the backup job will read sector 0 of the source disk, but I
> really can't see where a stray write would happen, why the issue would
> trigger so rarely or why seemingly only ide-hd+ahci would be affected.
>
> So again, just asking if somebody has seen something similar or has a
> hunch of what the cause might be.
>
> [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874
> [1]: https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;hb=HEAD
>
>
We've also seen a handful of similar reports. Again, just the MBR sector 
overwritten by what looks to be guest data (e.g. log messages). The 
common thread with our incidents is again a SATA disk under the AHCI 
controller, we have a network backend (iSCSI) which has experienced a 
failure.

I've tried to repro this with blkdebug and simulated write errors, 
without success.

Regards

Simon



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-06-14 14:48 ` Simon J. Rowe
@ 2023-06-15  7:04   ` Fiona Ebner
  2023-06-15  8:24     ` Simon Rowe
  2023-07-27 13:22   ` Simon Rowe
  1 sibling, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2023-06-15  7:04 UTC (permalink / raw)
  To: simon.rowe, QEMU Developers
  Cc: open list:Network Block Dev..., Thomas Lamprecht, jsnow

Am 14.06.23 um 16:48 schrieb Simon J. Rowe:
> On 02/02/2023 12:08, Fiona Ebner wrote:
>> Hi,
>> over the years we've got 1-2 dozen reports[0] about suddenly
>> missing/corrupted MBR/partition tables. The issue seems to be very rare
>> and there was no success in trying to reproduce it yet. I'm asking here
>> in the hope that somebody has seen something similar.
>>
>> The only commonality seems to be the use of an ide-hd drive with ahci
>> bus.
>>
>> It does seem to happen with both Linux and Windows guests (one of the
>> reports even mentions FreeBSD) and backing storages for the VMs include
>> ZFS, RBD, LVM-Thin as well as file-based storages.
>>
>> Relevant part of an example configuration:
>>
>>>    -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
>>>    -drive
>>> 'file=/dev/zvol/myzpool/vm-168-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
>>>    -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0' \
>> The first reports are from before io_uring was used and there are also
>> reports with writeback cache mode and discard=on,detect-zeroes=unmap.
>>
>> Some reports say that the issue occurred under high IO load.
>>
>> Many reports suspect backups causing the issue. Our backup mechanism
>> uses backup_job_create() for each drive and runs the jobs sequentially.
>> It uses a custom block driver as the backup target which just forwards
>> the writes to the actual target which can be a file or our backup server.
>> (If you really want to see the details, apply the patches in [1] and see
>> pve-backup.c and block/backup-dump.c).
>>
>> Of course, the backup job will read sector 0 of the source disk, but I
>> really can't see where a stray write would happen, why the issue would
>> trigger so rarely or why seemingly only ide-hd+ahci would be affected.
>>
>> So again, just asking if somebody has seen something similar or has a
>> hunch of what the cause might be.
>>
>> [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=2874
>> [1]:
>> https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;hb=HEAD
>>
>>
> We've also seen a handful of similar reports. Again, just the MBR sector
> overwritten by what looks to be guest data (e.g. log messages). The
> common thread with our incidents is again a SATA disk under the AHCI
> controller, we have a network backend (iSCSI) which has experienced a
> failure.
> 
> I've tried to repro this with blkdebug and simulated write errors,
> without success.
> 

Hi,
which version/build of QEMU are you using? Can you correlate the issue
with any block job or was the drive in use by the guest only?

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-06-15  7:04   ` Fiona Ebner
@ 2023-06-15  8:24     ` Simon Rowe
  0 siblings, 0 replies; 19+ messages in thread
From: Simon Rowe @ 2023-06-15  8:24 UTC (permalink / raw)
  To: Fiona Ebner, QEMU Developers
  Cc: open list:Network Block Dev..., Thomas Lamprecht,
	jsnow@redhat.com

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On Thursday, 15 June 2023 Fiona Ebner wrote:

> which version/build of QEMU are you using? Can you correlate the issue
> with any block job or was the drive in use by the guest only?

I believe this has been seen on a range of releases so that includes QEMU 4.2 and 2.12. We do have custom patches but nothing in the SATA/AHCI code.

I’m not familiar with the storage backend but in the RCA for one of the incidents the engineer identified an explicit write that hit the MBR. This seems to suggest QEMU is mistakenly translating a normal guest write to sector 0, probably following an earlier write failure.

Regards
Simon

[-- Attachment #2: Type: text/html, Size: 3018 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-06-14 14:48 ` Simon J. Rowe
  2023-06-15  7:04   ` Fiona Ebner
@ 2023-07-27 13:22   ` Simon Rowe
  1 sibling, 0 replies; 19+ messages in thread
From: Simon Rowe @ 2023-07-27 13:22 UTC (permalink / raw)
  To: Fiona Ebner, QEMU Developers
  Cc: open list:Network Block Dev..., Thomas Lamprecht,
	jsnow@redhat.com

[-- Attachment #1: Type: text/plain, Size: 3956 bytes --]

On Wednesday, 14 June 2023 Simon Rowe wrote:

> We've also seen a handful of similar reports. Again, just the MBR sector
> overwritten by what looks to be guest data (e.g. log messages). The
> common thread with our incidents is again a SATA disk under the AHCI
> controller, we have a network backend (iSCSI) which has experienced a
> failure.
>
> I've tried to repro this with blkdebug and simulated write errors,
> without success.

I’ve finally has some success in reproducing this issue. I have a test environment set up as follows:
* QEMU 4.2
* guest booting from CD with a small SATA disk
* guest test harness partitions the disk then continually writes data to the partition while checking the integrity of the MBR
* filter script that interposes between QEMU and the iSCSI backend, this drops writes and then resets the connection after a period of time

From tracing in the filter script I can see unsolicited writes to LBA 0 once the SATA controller is reset

Data in: iSCSI op 01 SCSI op 28 LBA 0 NOP count 5 wait for read False
Data in: iSCSI op 01 SCSI op 28 LBA 0 NOP count 6 wait for read False
Data in: iSCSI op 01 SCSI op 2a LBA 0 NOP count 0 wait for read True
Data in: iSCSI op 01 SCSI op 28 LBA 0 NOP count 0 wait for read False

I have a stack trace at the time that the write occurs
#0  iscsi_co_writev (bs=0x564322ecc220, sector_num=<optimized out>,
    nb_sectors=1, iov=0x7fc20c045860, flags=<optimized out>)
    at block/iscsi.c:641
#1  0x00005643220e780b in bdrv_driver_pwritev (bs=bs@entry=0x564322ecc220,
    offset=offset@entry=0, bytes=bytes@entry=512,
    qiov=qiov@entry=0x7fc20c045860, qiov_offset=qiov_offset@entry=0,
    flags=flags@entry=0) at block/io.c:1216
#2  0x00005643220e9985 in bdrv_aligned_pwritev (
    child=child@entry=0x564322ecb050, req=req@entry=0x7fc2aa90bb00, offset=0,
    bytes=512, align=align@entry=512, qiov=0x7fc20c045860, qiov_offset=0,
    flags=flags@entry=0) at block/io.c:1980
#3  0x00005643220ea25b in bdrv_co_pwritev_part (child=0x564322ecb050,
    offset=offset@entry=0, bytes=bytes@entry=512,
    qiov=qiov@entry=0x7fc20c045860, qiov_offset=qiov_offset@entry=0, flags=0)
    at block/io.c:2137
#4  0x00005643220ea55b in bdrv_co_pwritev (child=<optimized out>,
    offset=offset@entry=0, bytes=bytes@entry=512,
    qiov=qiov@entry=0x7fc20c045860, flags=<optimized out>) at block/io.c:2087
#5  0x00005643220aa64d in raw_co_pwritev (bs=0x564322ec4a00, offset=0,
    bytes=512, qiov=0x7fc20c045860, flags=<optimized out>)
    at block/raw-format.c:258
#6  0x00005643220e7702 in bdrv_driver_pwritev (bs=bs@entry=0x564322ec4a00,
    offset=offset@entry=0, bytes=bytes@entry=512,
    qiov=qiov@entry=0x7fc20c045860, qiov_offset=qiov_offset@entry=0,
    flags=flags@entry=0) at block/io.c:1183
#7  0x00005643220e9985 in bdrv_aligned_pwritev (
    child=child@entry=0x564322ed28c0, req=req@entry=0x7fc2aa90be70, offset=0,
    bytes=512, align=align@entry=1, qiov=0x7fc20c045860, qiov_offset=0,
    flags=flags@entry=0) at block/io.c:1980
#8  0x00005643220ea25b in bdrv_co_pwritev_part (child=0x564322ed28c0,
    offset=offset@entry=0, bytes=bytes@entry=512,
    qiov=qiov@entry=0x7fc20c045860, qiov_offset=qiov_offset@entry=0, flags=0)
    at block/io.c:2137
#9  0x00005643220d63b4 in blk_do_pwritev_part (blk=0x564322ec4570, offset=0,
    bytes=512, qiov=0x7fc20c045860, qiov_offset=qiov_offset@entry=0,
    flags=<optimized out>) at block/block-backend.c:1231
#10 0x00005643220d650d in blk_aio_write_entry (opaque=0x7fc20c045520)
    at block/block-backend.c:1439
#11 0x000056432218706a in coroutine_trampoline (i0=<optimized out>,
    i1=<optimized out>) at util/coroutine-ucontext.c:115
#12 0x00007fc2afa20190 in ?? () from /lib64/libc.so.6
#13 0x00007fc2b3e01aa0 in ?? ()
#14 0x0000000000000000 in ?? ()
I’m not familiar with the storage code of QEMU, any suggestions about how to proceed debugging this?
Regards
Simon

[-- Attachment #2: Type: text/html, Size: 11963 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-02-17 21:22         ` Mike Maslenkin
@ 2023-08-23  8:47           ` Fiona Ebner
  2023-08-23  9:17             ` Fiona Ebner
  0 siblings, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2023-08-23  8:47 UTC (permalink / raw)
  To: Mike Maslenkin
  Cc: John Snow, QEMU Developers, open list:Network Block Dev...,
	Thomas Lamprecht, Aaron Lauterer, simon.rowe

Am 17.02.23 um 22:22 schrieb Mike Maslenkin:
> I can not tell anything about dma-reentracy issues, but yes, i would
> start to look at check_cmd() function call sequence.
> The most interesting is why Sector Count = 1. I thought about race
> with IDE reset where registers initialized with
> value SATA_SIGNATURE_DISK = 0x00000101, but this means LBA=1 as well...
> 

You got it! Since we got another report (after half a year of nothing)
and also because of Simon's mail, I gave it another shot too and was
finally able to reproduce the issue (with our patched QEMU 8.0, but
patches shouldn't affect IDE code). See below for the traces that
confirm your theory. The reason the write goes to sector 0 and not 1 is
because ide_dma_cb() uses sector_num = ide_get_sector(s); and that will
evaluate to 0 after a reset.

So the issue is indeed that ide_dma_cb can get called with an IDEState
just after that state was reset. Can we somehow wait for pending
requests before proceeding with the reset, or can we force an error
return for callbacks that are still pending during reset?

Best Regards,
Fiona

QEMU trace log (-trace dma_*,file=/root/sata.log -trace
ide_*,file=/root/sata.log -trace ahci_*,file=/root/sata.log -trace
*ncq*,file=/root/sata.log -trace handle_cmd*,file=/root/sata.log)

> ahci_port_write ahci(0x5595af6923f0)[0]: port write [reg:PxSCTL] @ 0x2c: 0x00000300
> ahci_reset_port ahci(0x5595af6923f0)[0]: reset port
> ide_reset IDEstate 0x5595af6949d0
> ide_reset IDEstate 0x5595af694da8
> ide_bus_reset_aio aio_cancel
> dma_aio_cancel dbs=0x7f64600089a0
> dma_blk_cb dbs=0x7f64600089a0 ret=0
> dma_complete dbs=0x7f64600089a0 ret=0 cb=0x5595acd40b30
> ahci_populate_sglist ahci(0x5595af6923f0)[0]
> ahci_dma_prepare_buf ahci(0x5595af6923f0)[0]: prepare buf limit=512 prepared=512
> ide_dma_cb IDEState 0x5595af6949d0; sector_num=0 n=1 cmd=DMA WRITE
> dma_blk_io dbs=0x7f6420802010 bs=0x5595ae2c6c30 offset=0 to_dev=1
> dma_blk_cb dbs=0x7f6420802010 ret=0

Info from GDB:

> (gdb) p *qiov
> 
> $11 = {iov = 0x7f647c76d840, niov = 1, {{nalloc = 1, local_iov = {iov_base = 0x0, 
>         iov_len = 512}}, {__pad = "\001\000\000\000\000\000\000\000\000\000\000", 
>       size = 512}}}
> (gdb) bt
> 
> #0  blk_aio_pwritev (blk=0x5595ae2c6c30, offset=0, qiov=0x7f6420802070, flags=0, 
>     cb=0x5595ace6f0b0 <dma_blk_cb>, opaque=0x7f6420802010)
>     at ../block/block-backend.c:1682
> #1  0x00005595ace6f185 in dma_blk_cb (opaque=0x7f6420802010, ret=<optimized out>)
>     at ../softmmu/dma-helpers.c:179
> #2  0x00005595ace6f778 in dma_blk_io (ctx=0x5595ae0609f0, 
>     sg=sg@entry=0x5595af694d00, offset=offset@entry=0, align=align@entry=512, 
>     io_func=io_func@entry=0x5595ace6ee30 <dma_blk_write_io_func>, 
>     io_func_opaque=io_func_opaque@entry=0x5595ae2c6c30, 
>     cb=0x5595acd40b30 <ide_dma_cb>, opaque=0x5595af6949d0, 
>     dir=DMA_DIRECTION_TO_DEVICE) at ../softmmu/dma-helpers.c:244
> #3  0x00005595ace6f90a in dma_blk_write (blk=0x5595ae2c6c30, 
>     sg=sg@entry=0x5595af694d00, offset=offset@entry=0, align=align@entry=512, 
>     cb=cb@entry=0x5595acd40b30 <ide_dma_cb>, opaque=opaque@entry=0x5595af6949d0)
>     at ../softmmu/dma-helpers.c:280
> #4  0x00005595acd40e18 in ide_dma_cb (opaque=0x5595af6949d0, ret=<optimized out>)
>     at ../hw/ide/core.c:953
> #5  0x00005595ace6f319 in dma_complete (ret=0, dbs=0x7f64600089a0)
>     at ../softmmu/dma-helpers.c:107
> #6  dma_blk_cb (opaque=0x7f64600089a0, ret=0) at ../softmmu/dma-helpers.c:127
> #7  0x00005595ad12227d in blk_aio_complete (acb=0x7f6460005b10)
>     at ../block/block-backend.c:1527
> #8  blk_aio_complete (acb=0x7f6460005b10) at ../block/block-backend.c:1524
> #9  blk_aio_write_entry (opaque=0x7f6460005b10) at ../block/block-backend.c:1594
> #10 0x00005595ad258cfb in coroutine_trampoline (i0=<optimized out>, 
>     i1=<optimized out>) at ../util/coroutine-ucontext.c:177
> #11 0x00007f64f2fcb8d0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #12 0x00007f64d0ff3290 in ?? ()
> #13 0x0000000000000000 in ?? ()

This is of course not directly after the reset, since the break happened
a bit later:

> (gdb) p *((IDEState*)0x5595af6949d0)
> 
> $12 = {bus = 0x5595af694948, unit = 0 '\000', drive_kind = IDE_HD, 
>   drive_heads = 16, drive_sectors = 63, cylinders = 8740, heads = 16, 
>   sectors = 63, chs_trans = 2, nb_sectors = 8810496, mult_sectors = 16, 
>   identify_set = 1, 
>   identify_data = "@\000$\"\000\000\020\000\000~\000\002?\000\000\000\000\000\000\000MQ0000 5", ' ' <repeats 12 times>, "\003\000\000\002\004\000.2+5    EQUMH RADDSI K", ' ' <repeats 26 times>, "\020\200\001\000\000\v\000\000\000\002\000\002\a\000$\"\020\000?\000\300m\206\000\020\001\000p\206\000\a\000\a\000\003\000x\000x\000x\000x\000\000@\000\000\000\000\000\000\000\000\000\000\037\000\000\001\000\000\000\000\000\000\360\000\026\000!@\000t\000@!@\0004\000@?\020\000\000\000\000\000\000\000\000\001`", '\000' <repeats 13 times>..., drive_serial = 5, 
>   drive_serial_str = "QM00005", '\000' <repeats 13 times>, 
>   drive_model_str = "QEMU HARDDISK", '\000' <repeats 27 times>, wwn = 0, 
>   feature = 0 '\000', error = 0 '\000', nsector = 1, sector = 1 '\001', 
>   lcyl = 0 '\000', hcyl = 0 '\000', hob_feature = 0 '\000', 
>   hob_nsector = 0 '\000', hob_sector = 0 '\000', hob_lcyl = 0 '\000', 
>   hob_hcyl = 0 '\000', select = 160 '\240', status = 80 'P', io8 = false, 
>   reset_reverts = false, lba48 = 0 '\000', blk = 0x5595ae2c6c30, 
>   version = "2.5+\000\000\000\000", events = {eject_request = false, 
>     new_media = false}, sense_key = 0 '\000', asc = 0 '\000', tray_open = false, 
>   tray_locked = false, cdrom_changed = 0 '\000', packet_transfer_size = 0, 
>   elementary_transfer_size = 0, io_buffer_index = 0, lba = 0, cd_sector_size = 0, 
>   atapi_dma = 0, acct = {bytes = 131072, start_time_ns = 89102481675200, 
>     type = BLOCK_ACCT_WRITE}, pio_aiocb = 0x0, qiov = {iov = 0x0, niov = 0, {{
>         nalloc = 0, local_iov = {iov_base = 0x0, iov_len = 0}}, {
>         __pad = '\000' <repeats 11 times>, size = 0}}}, buffered_requests = {
>     lh_first = 0x0}, io_buffer_offset = 0, io_buffer_size = 512, sg = {
>     sg = 0x7f647c76d390, nsg = 1, nalloc = 2, size = 512, dev = 0x5595af6919c0, 
>     as = 0x5595af691c00}, req_nb_sectors = 0, 
>   end_transfer_func = 0x5595acd3cb90 <ide_dummy_transfer_stop>, 
>   data_ptr = 0x5595af69e800 "\377\377\377\377", 
>   data_end = 0x5595af69e800 "\377\377\377\377", 
>   io_buffer = 0x5595af69e800 "\377\377\377\377", io_buffer_total_len = 131076, 
>   cur_io_buffer_offset = 0, cur_io_buffer_len = 0, 
>   end_transfer_fn_idx = 0 '\000', sector_write_timer = 0x5595af69db20, 
>   irq_count = 0, ext_error = 0 '\000', mdata_size = 0, mdata_storage = 0x0, 
>   media_changed = 0, dma_cmd = IDE_DMA_WRITE, smart_enabled = 1 '\001', 
>   smart_autosave = 1 '\001', smart_errors = 0, smart_selftest_count = 0 '\000', 
>   smart_selftest_data = 0x5595af6bf000 "", ncq_queues = 32}



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-08-23  8:47           ` Fiona Ebner
@ 2023-08-23  9:17             ` Fiona Ebner
  2023-08-26 18:07               ` Mike Maslenkin
  0 siblings, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2023-08-23  9:17 UTC (permalink / raw)
  To: Mike Maslenkin
  Cc: John Snow, QEMU Developers, open list:Network Block Dev...,
	Thomas Lamprecht, Aaron Lauterer, simon.rowe

Am 23.08.23 um 10:47 schrieb Fiona Ebner:
> Am 17.02.23 um 22:22 schrieb Mike Maslenkin:
>> I can not tell anything about dma-reentracy issues, but yes, i would
>> start to look at check_cmd() function call sequence.
>> The most interesting is why Sector Count = 1. I thought about race
>> with IDE reset where registers initialized with
>> value SATA_SIGNATURE_DISK = 0x00000101, but this means LBA=1 as well...
>>
> 
> You got it! Since we got another report (after half a year of nothing)
> and also because of Simon's mail, I gave it another shot too and was
> finally able to reproduce the issue (with our patched QEMU 8.0, but
> patches shouldn't affect IDE code). See below for the traces that
> confirm your theory. The reason the write goes to sector 0 and not 1 is
> because ide_dma_cb() uses sector_num = ide_get_sector(s); and that will
> evaluate to 0 after a reset.
> 
> So the issue is indeed that ide_dma_cb can get called with an IDEState
> just after that state was reset. Can we somehow wait for pending
> requests before proceeding with the reset, or can we force an error
> return for callbacks that are still pending during reset?
> 

I noticed that ide_bus_reset() does the reset first and then cancels the
aiocb. Maybe it's already enough to switch those around?

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Lost partition tables on ide-hd + ahci drive
  2023-08-23  9:17             ` Fiona Ebner
@ 2023-08-26 18:07               ` Mike Maslenkin
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Maslenkin @ 2023-08-26 18:07 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: John Snow, QEMU Developers, open list:Network Block Dev...,
	Thomas Lamprecht, Aaron Lauterer, simon.rowe

On Wed, Aug 23, 2023 at 12:17 PM Fiona Ebner <f.ebner@proxmox.com> wrote:
>
> Am 23.08.23 um 10:47 schrieb Fiona Ebner:
> > Am 17.02.23 um 22:22 schrieb Mike Maslenkin:
> >> I can not tell anything about dma-reentracy issues, but yes, i would
> >> start to look at check_cmd() function call sequence.
> >> The most interesting is why Sector Count = 1. I thought about race
> >> with IDE reset where registers initialized with
> >> value SATA_SIGNATURE_DISK = 0x00000101, but this means LBA=1 as well...
> >>
> >
> > You got it! Since we got another report (after half a year of nothing)
> > and also because of Simon's mail, I gave it another shot too and was
> > finally able to reproduce the issue (with our patched QEMU 8.0, but
> > patches shouldn't affect IDE code). See below for the traces that
> > confirm your theory. The reason the write goes to sector 0 and not 1 is
> > because ide_dma_cb() uses sector_num = ide_get_sector(s); and that will
> > evaluate to 0 after a reset.
> >
> > So the issue is indeed that ide_dma_cb can get called with an IDEState
> > just after that state was reset. Can we somehow wait for pending
> > requests before proceeding with the reset, or can we force an error
> > return for callbacks that are still pending during reset?
> >
>
> I noticed that ide_bus_reset() does the reset first and then cancels the
> aiocb. Maybe it's already enough to switch those around?
>
> Best Regards,
> Fiona

Great job! Patch looks good to me.

Since the reason is known now, It can be easier to reproduce original
case again, but with disabled NCQ.
There is no command line argument, so it is required to rebuild qemu
without announcing HOST_CAP_NCQ capability.
I'd expect this greatly increase chances to catch original corruption.

Best Regards,
Mike.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-08-26 18:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-02 12:08 Lost partition tables on ide-hd + ahci drive Fiona Ebner
2023-02-14 18:21 ` John Snow
2023-02-15 10:53   ` Fiona Ebner
2023-02-15 21:47     ` John Snow
2023-02-16  8:58       ` Fiona Ebner
2023-02-16 14:17     ` Mike Maslenkin
2023-02-16 15:25       ` Fiona Ebner
2023-02-16 16:15         ` Mike Maslenkin
2023-02-17 12:25           ` Fiona Ebner
2023-02-17 13:40       ` Fiona Ebner
2023-02-17 21:22         ` Mike Maslenkin
2023-08-23  8:47           ` Fiona Ebner
2023-08-23  9:17             ` Fiona Ebner
2023-08-26 18:07               ` Mike Maslenkin
2023-02-17  9:44     ` Aaron Lauterer
2023-06-14 14:48 ` Simon J. Rowe
2023-06-15  7:04   ` Fiona Ebner
2023-06-15  8:24     ` Simon Rowe
2023-07-27 13:22   ` Simon Rowe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).