* status of bugzilla #99171 - mdraid broken for O_DIRECT [not found] <A4168F21-4CDF-4BAD-8754-30BAA1315C6F@web.de> @ 2025-10-14 20:14 ` Roland 2025-10-15 6:56 ` Hannes Reinecke 0 siblings, 1 reply; 15+ messages in thread From: Roland @ 2025-10-14 20:14 UTC (permalink / raw) To: Hannes Reinecke, Reindl Harald, linux-raid sorry, resend in text format as mail contained html and bounced from ML. Am 14.10.25 um 08:31 schrieb Hannes Reinecke: > Hmm. I still would argue that the testcase quoted is invalid. > > What you do is to issue writes of a given buffer, while at the > same time modifying the contents of that buffer. > > As we're doing zerocopy with O_DIRECT the buffer passed to pwrite > is _the same_ buffer used when issuing the write to disk. The > block layer now assumes that the buffer will _not_ be modified > when writing to disk (ie between issuing 'pwrite' and the resulting > request being send to disk). > But that's not the case here; it will be modified, and consequently > all sorts of issues will pop up. > We have had all sorts of fun some years back with this issue until > we fixed up all filesystems to do this correctly; if interested > dig up the threads regarding 'stable pages' on linux-fsdevel. > > I would think you will end up with a corrupted filesystem if you > run this without mdraid by just using btrfs with data checksumming. > yes, it's correct. you also end up with corrupted btrfs with this tool, see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c16 > So really I'm not sure how to go from here; I would declare this > as invalid, but what do I know ... > > Cheers, > > Hannes anyhow, i don't see why this testcase is invalid, especially when zfs seems not to be affected. please look at this issue from a security perspective. if you can break or corrupt your raid mirror from userspace even from an insulated layer/environment, i would better consider this "testcase" to be "malicious code" , which is able to subvert the virtualization/block/fs layer stack. how could we prevent, that non-trused users in a vm or container environment can execute this "invalid" code ? how can we prevent, that they do harm on the underlying mirror in a hosting environment for example ? not using it in a hosting environment is a little bit weird strategy for a linux basic technoligy which exists for years. and let it up to the hoster to remember he needs to disable direct-io for the hypervisor - is dissatisfying and error-prone, too. roland ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2025-10-14 20:14 ` status of bugzilla #99171 - mdraid broken for O_DIRECT Roland @ 2025-10-15 6:56 ` Hannes Reinecke 2025-10-15 23:09 ` Roland 0 siblings, 1 reply; 15+ messages in thread From: Hannes Reinecke @ 2025-10-15 6:56 UTC (permalink / raw) To: Roland, Reindl Harald, linux-raid On 10/14/25 22:14, Roland wrote: > sorry, resend in text format as mail contained html and bounced from ML. > > > Am 14.10.25 um 08:31 schrieb Hannes Reinecke: >> Hmm. I still would argue that the testcase quoted is invalid. >> >> What you do is to issue writes of a given buffer, while at the >> same time modifying the contents of that buffer. >> >> As we're doing zerocopy with O_DIRECT the buffer passed to pwrite >> is _the same_ buffer used when issuing the write to disk. The >> block layer now assumes that the buffer will _not_ be modified >> when writing to disk (ie between issuing 'pwrite' and the resulting >> request being send to disk). >> But that's not the case here; it will be modified, and consequently >> all sorts of issues will pop up. >> We have had all sorts of fun some years back with this issue until >> we fixed up all filesystems to do this correctly; if interested >> dig up the threads regarding 'stable pages' on linux-fsdevel. >> >> I would think you will end up with a corrupted filesystem if you >> run this without mdraid by just using btrfs with data checksumming. >> > yes, it's correct. you also end up with corrupted btrfs with this tool, > see https://bugzilla.kernel.org/show_bug.cgi?id=99171#c16 > >> So really I'm not sure how to go from here; I would declare this >> as invalid, but what do I know ... >> >> Cheers, >> >> Hannes > > anyhow, i don't see why this testcase is invalid, especially when zfs > seems not to be affected. > Welll ... I am sure you are aware of the somewhat dubious state of zfs and linux, right? And anyway: 'break userspace' is a matter of debate here; the use of O_DIRECT effectively moves the burden of checking I/O from the kernel to userspace; with O_DIRECT you can submit _any_ I/O without the kernel interfering, but at the same time you _must_ ensure that the I/O submitted conforms to the expectations the block layer has. And one of the expectation is that data is not modified between assembling the request and submitting the request to the drive. But that is precisely what the test program does. > please look at this issue from a security perspective. > > if you can break or corrupt your raid mirror from userspace even from an > insulated layer/environment, i would better consider this "testcase" to > be "malicious code" , which is able to subvert the virtualization/block/ > fs layer stack. > > how could we prevent, that non-trused users in a vm or container > environment can execute this "invalid" code ? > Well, yes, but then this is O_DIRECT. > > how can we prevent, that they do harm on the underlying mirror in a > hosting environment for example ? > Well, this has been an ongoing debate for years, and we from the linux side have had long discussions about that, too. But eventually we settled on the notion of 'stable pages', ie that the data buffer for a command _must not_ be modified between assembling the command and submitting the command to the drivers. Precisely such that we _can_ do things like data checksumming. > not using it in a hosting environment is a little bit weird strategy for > a linux basic technoligy which exists for years. > Oh, agreed. We do want to make linux better. But there is a perfectly viable workaround (namely: do not disable caching on the VM ...). So the question really is: where's the advantage? Security and O_DIRECT is always a very tricky subject, as O_DIRECT is precisely there to circumvent checks in the kernel. And yes, some of these checks are there to prevent security issues. So of course the will be security implications, but that was kinda the idea. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2025-10-15 6:56 ` Hannes Reinecke @ 2025-10-15 23:09 ` Roland 2025-10-16 6:02 ` Hannes Reinecke 0 siblings, 1 reply; 15+ messages in thread From: Roland @ 2025-10-15 23:09 UTC (permalink / raw) To: Hannes Reinecke, Reindl Harald, linux-raid > Welll ... I am sure you are aware of the somewhat dubious state of zfs > and linux, right? yes , i know about this "dubious" state due to licensing issues, but it's in this state for years now and a pretty solid and well installable and usable filesystem , used in many enterprise setups, though. i run dozens of zfs installations for years and did not have a single major issue , data loss or data corruption with those. but that's a different story not belonging here... > And anyway: 'break userspace' is a matter of debate here; the use of > O_DIRECT effectively moves the burden of checking I/O from the kernel > to userspace; with O_DIRECT you can submit _any_ I/O without the kernel > interfering, but at the same time you _must_ ensure that the I/O > submitted conforms to the expectations the block layer has. > And one of the expectation is that data is not modified between > assembling the request and submitting the request to the drive. > > But that is precisely what the test program does. > >> please look at this issue from a security perspective. >> >> if you can break or corrupt your raid mirror from userspace even from >> an insulated layer/environment, i would better consider this >> "testcase" to be "malicious code" , which is able to subvert the >> virtualization/block/ fs layer stack. >> >> how could we prevent, that non-trused users in a vm or container >> environment can execute this "invalid" code ? >> > Well, yes, but then this is O_DIRECT. > >> >> how can we prevent, that they do harm on the underlying mirror in a >> hosting environment for example ? >> > > Well, this has been an ongoing debate for years, and we from the linux > side have had long discussions about that, too. > But eventually we settled on the notion of 'stable pages', ie that the > data buffer for a command _must not_ be modified between assembling the > command and submitting the command to the drivers. > Precisely such that we _can_ do things like data checksumming. > >> not using it in a hosting environment is a little bit weird strategy >> for a linux basic technoligy which exists for years. >> > Oh, agreed. We do want to make linux better. > But there is a perfectly viable workaround (namely: do not disable > caching on the VM ...). So the question really is: where's the > advantage? > Security and O_DIRECT is always a very tricky subject, as O_DIRECT > is precisely there to circumvent checks in the kernel. And yes, > some of these checks are there to prevent security issues. > So of course the will be security implications, but that was > kinda the idea. > > Cheers, > > Hannes thank you for your feedback. i see, things are complicated and O_DIRECT is a very special beast.... meanwhile, i gave bcachefs a try today , because it looks interesting . like zfs, it does not seem to be affected by this problem, at least from my first tests reported at https://bugzilla.kernel.org/show_bug.cgi?id=99171#c26 (i hope this is a valid test for consistency) so we have at least a second "software raid" technology besides zfs, which does NOT suffer from the "by design" O_DIRECT breakage. that's at least surprising me, as bcachefs is far from production ready, and i wonder why it just seems to work at this early stage of development. roland ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2025-10-15 23:09 ` Roland @ 2025-10-16 6:02 ` Hannes Reinecke 2025-10-17 20:18 ` Roland 0 siblings, 1 reply; 15+ messages in thread From: Hannes Reinecke @ 2025-10-16 6:02 UTC (permalink / raw) To: Roland, Reindl Harald, linux-raid On 10/16/25 01:09, Roland wrote: [ .. ]> > thank you for your feedback. > > i see, things are complicated and O_DIRECT is a very special beast.... > > meanwhile, i gave bcachefs a try today , because it looks interesting . > > like zfs, it does not seem to be affected by this problem, at least from > my first tests reported at https://bugzilla.kernel.org/show_bug.cgi? > id=99171#c26 (i hope this is a valid test for consistency) > > so we have at least a second "software raid" technology besides zfs, > which does NOT suffer from the "by design" O_DIRECT breakage. > > that's at least surprising me, as bcachefs is far from production > ready, and i wonder why it just seems to work at this early stage of > development. > Hmm. True. I would suggest bringing up this topic on linux-fsdevel; there is always a chance that there is a bug somewhere. At least some explanation would be warranted why bcachefs does not suffer from this issue. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2025-10-16 6:02 ` Hannes Reinecke @ 2025-10-17 20:18 ` Roland 2025-10-20 6:44 ` Hannes Reinecke 0 siblings, 1 reply; 15+ messages in thread From: Roland @ 2025-10-17 20:18 UTC (permalink / raw) To: Hannes Reinecke, Reindl Harald, linux-raid good evening, are you really sure fs-fsdevel is the right place to report this? i just was able to reproduce the mdraid breakage without any filesystem involved, just put lvm on top of mdraid and passed lvm logical volumes from that to the debian vm, and then ran "break-raid-odirect /dev/sdb" inside vm. meanwhile, btrfs issue with O_DIRECT seems to be fixed, at least from my quick tests, reported at https://bugzilla.kernel.org/show_bug.cgi?id=99171#c35 . btrfs fix is also linked there. regards Roland Am 16.10.25 um 08:02 schrieb Hannes Reinecke: > On 10/16/25 01:09, Roland wrote: > [ .. ]> >> thank you for your feedback. >> >> i see, things are complicated and O_DIRECT is a very special beast.... >> >> meanwhile, i gave bcachefs a try today , because it looks interesting . >> >> like zfs, it does not seem to be affected by this problem, at least >> from my first tests reported at >> https://bugzilla.kernel.org/show_bug.cgi? id=99171#c26 (i hope this >> is a valid test for consistency) >> >> so we have at least a second "software raid" technology besides zfs, >> which does NOT suffer from the "by design" O_DIRECT breakage. >> >> that's at least surprising me, as bcachefs is far from production >> ready, and i wonder why it just seems to work at this early stage of >> development. >> > Hmm. True. > > I would suggest bringing up this topic on linux-fsdevel; there is > always a chance that there is a bug somewhere. > At least some explanation would be warranted why bcachefs does not > suffer from this issue. > > Cheers, > > Hannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2025-10-17 20:18 ` Roland @ 2025-10-20 6:44 ` Hannes Reinecke 0 siblings, 0 replies; 15+ messages in thread From: Hannes Reinecke @ 2025-10-20 6:44 UTC (permalink / raw) To: Roland, Reindl Harald, linux-raid On 10/17/25 22:18, Roland wrote: > good evening, > > are you really sure fs-fsdevel is the right place to report this? > > i just was able to reproduce the mdraid breakage without any filesystem > involved, just put lvm on top of mdraid and passed lvm logical volumes > from that to the debian vm, and then ran "break-raid-odirect /dev/sdb" > inside vm. > > meanwhile, btrfs issue with O_DIRECT seems to be fixed, at least from > my quick tests, reported at https://bugzilla.kernel.org/show_bug.cgi? > id=99171#c35 . btrfs fix is also linked there. > Ah, so btrfs is fixed, so indeed a report on fsdevel is pointless. So on the good side my analysis was correct (phew :-); on the flip side my attempt to offload that problem to someone else has failed :-( So guess we need to fix it after all. Curious, though; for RAID5 we do set the 'STABLE_WRITES' flag. Does the issue occur with RAID5, too? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* status of bugzilla #99171 - mdraid broken for O_DIRECT @ 2024-10-09 20:08 Roland 2024-10-09 21:38 ` Reindl Harald 0 siblings, 1 reply; 15+ messages in thread From: Roland @ 2024-10-09 20:08 UTC (permalink / raw) To: linux-raid Hello, as proxmox hypervisor does not offer mdadm software raid at installation time because of this bugticket "MD RAID or DRBD can be broken from userspace when using O_DIRECT" https://bugzilla.kernel.org/show_bug.cgi?id=99171 i tried to find some more references besides the kernel bugzilla entry - and - besides some discussion in the proxmox community - i did not succeed. why is this apparent fundamental design flaw (should we call it that?) so damn unknown ? and what about O_DIRECT with other software raid solutions like btrfs or zfs ? how/why do they right but not mdraid ? the latter exists for much longer time and afaik is offered as install-time option for RHEL (enterprise linux) for example, where people use oracle on top - which IS using O_DIRECT very often/likely. not accusing anyone - just being curious why totally unknown/unpopular bugticket #99171 bitrots for nearly a decade now... regards Roland Sysadmin ps: also see "qemu cache=none should not be used with mdadm" https://bugzilla.proxmox.com/show_bug.cgi?id=5235 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2024-10-09 20:08 Roland @ 2024-10-09 21:38 ` Reindl Harald 2024-10-10 6:53 ` Hannes Reinecke 0 siblings, 1 reply; 15+ messages in thread From: Reindl Harald @ 2024-10-09 21:38 UTC (permalink / raw) To: Roland, linux-raid Am 09.10.24 um 22:08 schrieb Roland: > as proxmox hypervisor does not offer mdadm software raid at installation > time because of this bugticket > > "MD RAID or DRBD can be broken from userspace when using O_DIRECT" > https://bugzilla.kernel.org/show_bug.cgi?id=99171 > > ps: > also see "qemu cache=none should not be used with mdadm" > https://bugzilla.proxmox.com/show_bug.cgi?id=5235 that all sounds like terrible nosense if "Yes. O_DIRECT is really fundamentally broken. There's just no way to fix it sanely. Except by teaching people not to use it, and making the normal paths fast enough" it has to go away it's not acceptable that userspace can break the integrity of the underlying RAID - period ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2024-10-09 21:38 ` Reindl Harald @ 2024-10-10 6:53 ` Hannes Reinecke 2024-10-10 7:29 ` Roland 0 siblings, 1 reply; 15+ messages in thread From: Hannes Reinecke @ 2024-10-10 6:53 UTC (permalink / raw) To: Reindl Harald, Roland, linux-raid On 10/9/24 23:38, Reindl Harald wrote: > > Am 09.10.24 um 22:08 schrieb Roland: >> as proxmox hypervisor does not offer mdadm software raid at installation >> time because of this bugticket >> >> "MD RAID or DRBD can be broken from userspace when using O_DIRECT" >> https://bugzilla.kernel.org/show_bug.cgi?id=99171 >> >> ps: >> also see "qemu cache=none should not be used with mdadm" >> https://bugzilla.proxmox.com/show_bug.cgi?id=5235 > that all sounds like terrible nosense > > if "Yes. O_DIRECT is really fundamentally broken. There's just no way to > fix it sanely. Except by teaching people not to use it, and making the > normal paths fast enough" it has to go away > > it's not acceptable that userspace can break the integrity of the > underlying RAID - period > Take deep breath everyone. Nothing has happened, nothing has been broken. All systems continue to operate as normal. If you look closely at the mentioned bug, you'll find that it does modify the buffer at random times, in particular while it's being written to disk. Now, the boilerplate text for O_DIRECT says: the application is in control of the data, and the data will be written without any caching. Applying that to our testcase it means that the application _can_ modify the data, even if it's in the process of being written to disk (zero copy and all that). We do guarantee that data is consistent once I/O is completed (here: once 'write' returns), but we do not (and, in fact, cannot) guarantee that data is consistent while write() is running. Which means that the test case is actually invalid; you either would need drop O_DIRECT or modify the buffer after write() to arrive with a valid example. That doesn't mean that I don't agree with the comments about O_DIRECT. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2024-10-10 6:53 ` Hannes Reinecke @ 2024-10-10 7:29 ` Roland 2024-10-10 8:34 ` Hannes Reinecke 0 siblings, 1 reply; 15+ messages in thread From: Roland @ 2024-10-10 7:29 UTC (permalink / raw) To: Hannes Reinecke, Reindl Harald, linux-raid thank you for clearing things up. >Which means that the test case is actually invalid; you either would need drop O_DIRECT or modify the buffer >after write() to arrive with a valid example. ok, but what about running virtual machines in O_DIRECT mode on top of mdraid then ? https://forum.proxmox.com/threads/zfs-on-debian-or-mdadm-softraid-stability-and-reliability-of-zfs.116871/post-505697 i have not seen any report of broken/inconsistent mdraid caused by virtual machines, so is this just a "theoretical" issue ? i'm curious why we can use zfs software raid with virtual machines but not md software raid. shouldn't that have the same problem ( https://www.phoronix.com/news/OpenZFS-Direct-IO ) , at least from now on ? regards Roland Am 10.10.24 um 08:53 schrieb Hannes Reinecke: > On 10/9/24 23:38, Reindl Harald wrote: >> >> Am 09.10.24 um 22:08 schrieb Roland: >>> as proxmox hypervisor does not offer mdadm software raid at >>> installation >>> time because of this bugticket >>> >>> "MD RAID or DRBD can be broken from userspace when using O_DIRECT" >>> https://bugzilla.kernel.org/show_bug.cgi?id=99171 >>> >>> ps: >>> also see "qemu cache=none should not be used with mdadm" >>> https://bugzilla.proxmox.com/show_bug.cgi?id=5235 >> that all sounds like terrible nosense >> >> if "Yes. O_DIRECT is really fundamentally broken. There's just no way >> to fix it sanely. Except by teaching people not to use it, and making >> the normal paths fast enough" it has to go away >> >> it's not acceptable that userspace can break the integrity of the >> underlying RAID - period >> > Take deep breath everyone. > Nothing has happened, nothing has been broken. > All systems continue to operate as normal. > > If you look closely at the mentioned bug, you'll find that it does > modify the buffer at random times, in particular while it's being > written to disk. > Now, the boilerplate text for O_DIRECT says: the application is in > control of the data, and the data will be written without any caching. > Applying that to our testcase it means that the application _can_ modify > the data, even if it's in the process of being written to disk (zero > copy and all that). > We do guarantee that data is consistent once I/O is completed (here: > once 'write' returns), but we do not (and, in fact, cannot) guarantee > that data is consistent while write() is running. > > Which means that the test case is actually invalid; you either would > need drop O_DIRECT or modify the buffer after write() to arrive with > a valid example. > > That doesn't mean that I don't agree with the comments about O_DIRECT. > > Cheers, > > Hannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2024-10-10 7:29 ` Roland @ 2024-10-10 8:34 ` Hannes Reinecke 2025-10-11 19:25 ` Roland 0 siblings, 1 reply; 15+ messages in thread From: Hannes Reinecke @ 2024-10-10 8:34 UTC (permalink / raw) To: Roland, Reindl Harald, linux-raid On 10/10/24 09:29, Roland wrote: > thank you for clearing things up. > > >Which means that the test case is actually invalid; you either would > need drop O_DIRECT or modify the buffer > >after write() to arrive with a valid example. > > ok, but what about running virtual machines in O_DIRECT mode on top of > mdraid then ? > > https://forum.proxmox.com/threads/zfs-on-debian-or-mdadm-softraid- > stability-and-reliability-of-zfs.116871/post-505697 > The example quoted is this: > Take a virtual machine, give it a disk - put the image on a software > raid and tell qemu to disable caching (iow. use O_DIRECT, because the > guest already does caching anyway). > Run linux in the VM, add part of the/a disk on the raid as swap, and > cause the guest to start swapping a lot. And then ending up with data corruption on MD. Which I really would love to see reproduced, especially with recent kernels, as there is a lot of vagueness around it (add part of the disk on the raid as swap? How? In the host? On the guest?). Hint: we (SUSE) have a bugzilla.suse.com. And if someone would be reproducing that with, say, OpenSUSE Tumbleweed and open a bugzilla someone on this list would be more than happy to have a look and do a proper debugging here. There are a lot of things which have changed since 2017 (Stable pages? Anyone?), so it might be that the cited issue simply is not reproducible anymore. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2024-10-10 8:34 ` Hannes Reinecke @ 2025-10-11 19:25 ` Roland 2025-10-13 6:48 ` Hannes Reinecke 0 siblings, 1 reply; 15+ messages in thread From: Roland @ 2025-10-11 19:25 UTC (permalink / raw) To: Hannes Reinecke, Reindl Harald, linux-raid hello, some late reply for this... > Am 10.10.24 um 10:34 schrieb Hannes Reinecke: > Which I really would love > to see reproduced, especially with recent kernels, as there is a lot of > vagueness around it (add part of the disk on the raid as swap? How? > In the host? On the guest?). here is a reproducer everybody should be able to follow/reproduce. 1. install proxmox pve9 on a system with two empty disks for mdraid. build mdraid and format with ext4 . 2. add that ext4 mountpoint as a datastore type "dir" for file/vm storage in proxmox. 3. install a debian13 in a normal/default (cache=none, i.e. O_DIRECT = on) linux VM. the virtual disk should be backed by that mdraid/ext4 datastore created above. 4. inside the vm as an ordinary user get break-raid-odirect.c from https://forum.proxmox.com/threads/mdraid-o_direct.156036/post-713543 , compile that and let that run for a while. then terminate with ctrl-c. 5. on the pve host, check if your raid did not throw any error or has mismatch_count >0 ( cat /sys/block/md127/md/mismatch_cnt ) in the meantime. 6. on the pve host start raid check with "echo check > /sys/block/md127/md/sync_action" 6. let that check run and wait until it finishes (/proc/mdstat) 7. check for inconsistencies via "cat /sys/block/md127/md/mismatch_cnt" again i am getting: cat /sys/block/md127/md/mismatch_cnt 1048832 so , we see that even with recent kernel (pve9 kernel is 6.14 based on ubuntu kernel), we can break mdraid from non-root user inside a qemu VM on top ext4 on top of mdraid. roland Am 10.10.24 um 10:34 schrieb Hannes Reinecke: > On 10/10/24 09:29, Roland wrote: >> thank you for clearing things up. >> >> >Which means that the test case is actually invalid; you either would >> need drop O_DIRECT or modify the buffer >> >after write() to arrive with a valid example. >> >> ok, but what about running virtual machines in O_DIRECT mode on top of >> mdraid then ? >> >> https://forum.proxmox.com/threads/zfs-on-debian-or-mdadm-softraid- >> stability-and-reliability-of-zfs.116871/post-505697 >> > > The example quoted is this: > > Take a virtual machine, give it a disk - put the image on a software > > raid and tell qemu to disable caching (iow. use O_DIRECT, because the > > guest already does caching anyway). > > Run linux in the VM, add part of the/a disk on the raid as swap, and > > cause the guest to start swapping a lot. > > And then ending up with data corruption on MD. Which I really would love > to see reproduced, especially with recent kernels, as there is a lot of > vagueness around it (add part of the disk on the raid as swap? How? > In the host? On the guest?). > > Hint: we (SUSE) have a bugzilla.suse.com. And if someone would be > reproducing that with, say, OpenSUSE Tumbleweed and open a bugzilla > someone on this list would be more than happy to have a look and do > a proper debugging here. There are a lot of things which have changed > since 2017 (Stable pages? Anyone?), so it might be that the cited issue > simply is not reproducible anymore. > > Cheers, > > Hannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2025-10-11 19:25 ` Roland @ 2025-10-13 6:48 ` Hannes Reinecke 2025-10-13 19:06 ` Roland [not found] ` <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de> 0 siblings, 2 replies; 15+ messages in thread From: Hannes Reinecke @ 2025-10-13 6:48 UTC (permalink / raw) To: Roland, Reindl Harald, linux-raid On 10/11/25 21:25, Roland wrote: > hello, > > some late reply for this... > > > Am 10.10.24 um 10:34 schrieb Hannes Reinecke: > > Which I really would love > > to see reproduced, especially with recent kernels, as there is a lot of > > vagueness around it (add part of the disk on the raid as swap? How? > > In the host? On the guest?). > > here is a reproducer everybody should be able to follow/reproduce. > > 1. install proxmox pve9 on a system with two empty disks for mdraid. > build mdraid and format with ext4 . > > 2. add that ext4 mountpoint as a datastore type "dir" for file/vm > storage in proxmox. > > 3. install a debian13 in a normal/default (cache=none, i.e. O_DIRECT = > on) linux VM. the virtual disk should be backed by that mdraid/ext4 > datastore created above. > > 4. inside the vm as an ordinary user get break-raid-odirect.c from > https://forum.proxmox.com/threads/mdraid-o_direct.156036/post-713543 , > compile that and let that run for a while. then terminate with ctrl-c. > > 5. on the pve host, check if your raid did not throw any error or has > mismatch_count >0 ( cat /sys/block/md127/md/mismatch_cnt ) in the meantime. > > 6. on the pve host start raid check with "echo check > /sys/block/ > md127/md/sync_action" > > 6. let that check run and wait until it finishes (/proc/mdstat) > > 7. check for inconsistencies via "cat /sys/block/md127/md/mismatch_cnt" > again > > i am getting: > > cat /sys/block/md127/md/mismatch_cnt > 1048832 > > so , we see that even with recent kernel (pve9 kernel is 6.14 based on > ubuntu kernel), we can break mdraid from non-root user inside a qemu VM > on top ext4 on top of mdraid. > And what would happen if you use 'xfs' instead of 'ext4'? ext4 has some nasty requirements regarding 'flush', and that might well explain the issue here. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT 2025-10-13 6:48 ` Hannes Reinecke @ 2025-10-13 19:06 ` Roland [not found] ` <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de> 1 sibling, 0 replies; 15+ messages in thread From: Roland @ 2025-10-13 19:06 UTC (permalink / raw) To: Hannes Reinecke, Reindl Harald, linux-raid hello, >> 7. check for inconsistencies via "cat >> /sys/block/md127/md/mismatch_cnt" again >> >> i am getting: >> >> cat /sys/block/md127/md/mismatch_cnt >> 1048832 >> >> so , we see that even with recent kernel (pve9 kernel is 6.14 based >> on ubuntu kernel), we can break mdraid from non-root user inside a >> qemu VM on top ext4 on top of mdraid. >> > > And what would happen if you use 'xfs' instead of 'ext4'? > ext4 has some nasty requirements regarding 'flush', and that might well > explain the issue here. > > Cheers, > > Hannes thanks for feedback and for this hint. i tested with xfs today , and it seems to make no difference. i can inject inconsistency into the raid with the mentioned tool via debian-vm with xfs inside debian and hosted on xfs formatted md raid1 on proxmox. root@pve-hpmini-gen8:~# cat /sys/block/md126/md/mismatch_cnt 59648 # cat /proc/mdstat Personalities : [raid0] [raid1] [raid4] [raid5] [raid6] [raid10] [linear] md126 : active raid1 sdb3[1] sda3[0] 48794624 blocks super 1.2 [2/2] [UU] [==================>..] check = 90.5% (44183744/48794624) finish=0.7min speed=107987K/sec bitmap: 1/1 pages [4KB], 65536KB chunk md127 : active raid1 sdb1[2] sda1[3] 48794624 blocks super 1.2 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk unused devices: <none> roland Am 13.10.25 um 08:48 schrieb Hannes Reinecke: > On 10/11/25 21:25, Roland wrote: >> hello, >> >> some late reply for this... >> >> > Am 10.10.24 um 10:34 schrieb Hannes Reinecke: >> > Which I really would love >> > to see reproduced, especially with recent kernels, as there is a >> lot of >> > vagueness around it (add part of the disk on the raid as swap? How? >> > In the host? On the guest?). >> >> here is a reproducer everybody should be able to follow/reproduce. >> >> 1. install proxmox pve9 on a system with two empty disks for mdraid. >> build mdraid and format with ext4 . >> >> 2. add that ext4 mountpoint as a datastore type "dir" for file/vm >> storage in proxmox. >> >> 3. install a debian13 in a normal/default (cache=none, i.e. O_DIRECT >> = on) linux VM. the virtual disk should be backed by that >> mdraid/ext4 datastore created above. >> >> 4. inside the vm as an ordinary user get break-raid-odirect.c from >> https://forum.proxmox.com/threads/mdraid-o_direct.156036/post-713543 , >> compile that and let that run for a while. then terminate with ctrl-c. >> >> 5. on the pve host, check if your raid did not throw any error or has >> mismatch_count >0 ( cat /sys/block/md127/md/mismatch_cnt ) in the >> meantime. >> >> 6. on the pve host start raid check with "echo check > /sys/block/ >> md127/md/sync_action" >> >> 6. let that check run and wait until it finishes (/proc/mdstat) >> >> 7. check for inconsistencies via "cat >> /sys/block/md127/md/mismatch_cnt" again >> >> i am getting: >> >> cat /sys/block/md127/md/mismatch_cnt >> 1048832 >> >> so , we see that even with recent kernel (pve9 kernel is 6.14 based >> on ubuntu kernel), we can break mdraid from non-root user inside a >> qemu VM on top ext4 on top of mdraid. >> > > And what would happen if you use 'xfs' instead of 'ext4'? > ext4 has some nasty requirements regarding 'flush', and that might well > explain the issue here. > > Cheers, > > Hannes ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de>]
* Re: status of bugzilla #99171 - mdraid broken for O_DIRECT [not found] ` <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de> @ 2025-10-14 6:31 ` Hannes Reinecke 0 siblings, 0 replies; 15+ messages in thread From: Hannes Reinecke @ 2025-10-14 6:31 UTC (permalink / raw) To: Roland, Reindl Harald, linux-raid On 10/13/25 21:04, Roland wrote: > hello, > >>> 7. check for inconsistencies via "cat /sys/block/md127/md/ >>> mismatch_cnt" again >>> >>> i am getting: >>> >>> cat /sys/block/md127/md/mismatch_cnt >>> 1048832 >>> >>> so , we see that even with recent kernel (pve9 kernel is 6.14 based >>> on ubuntu kernel), we can break mdraid from non-root user inside a >>> qemu VM on top ext4 on top of mdraid. >>> >> >> And what would happen if you use 'xfs' instead of 'ext4'? >> ext4 has some nasty requirements regarding 'flush', and that might well >> explain the issue here. >> >> Cheers, >> >> Hannes > > > thanks for feedback and for this hint. > > i tested with xfs today , and it seems to make no difference. > > i can inject inconsistency at disk with the mentioned tool at block > level via debian-vm with xfs inside debian and hosted on xfs formatted > md raid1 on proxmox. > > root@pve-hpmini-gen8:~# cat /sys/block/md126/md/mismatch_cnt > 59648 > > # cat /proc/mdstat > Personalities : [raid0] [raid1] [raid4] [raid5] [raid6] [raid10] [linear] > md126 : active raid1 sdb3[1] sda3[0] > 48794624 blocks super 1.2 [2/2] [UU] > [==================>..] check = 90.5% (44183744/48794624) > finish=0.7min speed=107987K/sec > bitmap: 1/1 pages [4KB], 65536KB chunk > > md127 : active raid1 sdb1[2] sda1[3] > 48794624 blocks super 1.2 [2/2] [UU] > bitmap: 0/1 pages [0KB], 65536KB chunk > > unused devices: <none> > Hmm. I still would argue that the testcase quoted is invalid. What you do is to issue writes of a given buffer, while at the same time modifying the contents of that buffer. As we're doing zerocopy with O_DIRECT the buffer passed to pwrite is _the same_ buffer used when issuing the write to disk. The block layer now assumes that the buffer will _not_ be modified when writing to disk (ie between issuing 'pwrite' and the resulting request being send to disk). But that's not the case here; it will be modified, and consequently all sorts of issues will pop up. We have had all sorts of fun some years back with this issue until we fixed up all filesystems to do this correctly; if interested dig up the threads regarding 'stable pages' on linux-fsdevel. I would think you will end up with a corrupted filesystem if you run this without mdraid by just using btrfs with data checksumming. So really I'm not sure how to go from here; I would declare this as invalid, but what do I know ... Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-10-20 6:44 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <A4168F21-4CDF-4BAD-8754-30BAA1315C6F@web.de>
2025-10-14 20:14 ` status of bugzilla #99171 - mdraid broken for O_DIRECT Roland
2025-10-15 6:56 ` Hannes Reinecke
2025-10-15 23:09 ` Roland
2025-10-16 6:02 ` Hannes Reinecke
2025-10-17 20:18 ` Roland
2025-10-20 6:44 ` Hannes Reinecke
2024-10-09 20:08 Roland
2024-10-09 21:38 ` Reindl Harald
2024-10-10 6:53 ` Hannes Reinecke
2024-10-10 7:29 ` Roland
2024-10-10 8:34 ` Hannes Reinecke
2025-10-11 19:25 ` Roland
2025-10-13 6:48 ` Hannes Reinecke
2025-10-13 19:06 ` Roland
[not found] ` <6fb3e2cb-8eeb-4e76-9364-16348d807784@web.de>
2025-10-14 6:31 ` Hannes Reinecke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox