* Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers [not found] ` <CAOCpoWf3TSQkUUo-qsj0LVEOm-kY0hXdmttLE82Ytc0hjpTSPw@mail.gmail.com> @ 2024-02-28 17:25 ` Patrick Plenefisch 2024-02-28 19:19 ` Goffredo Baroncelli 0 siblings, 1 reply; 15+ messages in thread From: Patrick Plenefisch @ 2024-02-28 17:25 UTC (permalink / raw) To: stable, linux-kernel Cc: Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs I'm unsure if this is just an LVM bug, or a BTRFS+LVM interaction bug, but LVM is definitely involved somehow. Upgrading from 5.10 to 6.1, I noticed one of my filesystems was read-only. In dmesg, I found: BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO failure (errors while submitting device barriers.) BTRFS info (device dm-75: state E): forced readonly BTRFS warning (device dm-75: state E): Skipping commit of aborted transaction. BTRFS: error (device dm-75: state EA) in cleanup_transaction:1992: errno=-5 IO failure At first I suspected a btrfs error, but a scrub found no errors, and it continued to be read-write on 5.10 kernels. Here is my setup: /dev/lvm/brokenDisk is a lvm-on-lvm volume. I have /dev/sd{a,b,c,d} (of varying sizes) in a lower VG, which has three LVs, all raid1 volumes. Two of the volumes are further used as PV's for an upper VGs. One of the upper VGs has no issues. The non-PV LV has no issue. The remaining one, /dev/lowerVG/lvmPool, hosting nested LVM, is used as a PV for VG "lvm", and has 3 volumes inside. Two of those volumes have no issues (and are btrfs), but the last one is /dev/lvm/brokenDisk. This volume is the only one that exhibits this behavior, so something is special. Or described as layers: /dev/sd{a,b,c,d} => PV => VG "lowerVG" /dev/lowerVG/single (RAID1 LV) => BTRFS, works fine /dev/lowerVG/works (RAID1 LV) => PV => VG "workingUpper" /dev/workingUpper/{a,b,c} => BTRFS, works fine /dev/lowerVG/lvmPool (RAID1 LV) => PV => VG "lvm" /dev/lvm/{a,b} => BTRFS, works fine /dev/lvm/brokenDisk => BTRFS, Exhibits errors After some investigation, here is what I've found: 1. This regression was introduced in 5.19. 5.18 and earlier kernels I can keep this filesystem rw and everything works as expected, while 5.19.0 and later the filesystem is immediately ro on any write attempt. I couldn't build rc1, but I did confirm rc2 already has this regression. 2. Passing /dev/lvm/brokenDisk to a KVM VM as /dev/vdb with an unaffected kernel inside the vm exhibits the ro barrier problem on unaffected kernels. 3. Passing /dev/lowerVG/lvmPool to a KVM VM as /dev/vdb with an affected kernel inside the VM and using LVM inside the VM exhibits correct behavior (I can keep the filesystem rw, no barrier errors on host or guest) 4. A discussion in IRC with BTRFS folks, and they think the BTRFS filesystem is fine (btrfs check and btrfs scrub also agree) 5. The dmesg error can be delayed indefinitely by not writing to the disk, or reading with noatime 6. This affects Debian, Ubuntu, NixOS, and Solus, so I'm fairly certain it's distro-agnostic, and purely a kernel issue. 7. I can't reproduce this with other LVM-on-LVM setups, so I think the asymmetric nature of the raid1 volume is potentially contributing 8. There are no new smart errors/failures on any of the disks, disks are healthy 9. I previously had raidintegrity=y and caching enabled. They didn't affect the issue #regzbot introduced v5.18..v5.19-rc2 Patrick ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers 2024-02-28 17:25 ` [REGRESSION] LVM-on-LVM: error while submitting device barriers Patrick Plenefisch @ 2024-02-28 19:19 ` Goffredo Baroncelli 2024-02-28 19:37 ` Patrick Plenefisch 0 siblings, 1 reply; 15+ messages in thread From: Goffredo Baroncelli @ 2024-02-28 19:19 UTC (permalink / raw) To: Patrick Plenefisch, stable, linux-kernel Cc: Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On 28/02/2024 18.25, Patrick Plenefisch wrote: > I'm unsure if this is just an LVM bug, or a BTRFS+LVM interaction bug, > but LVM is definitely involved somehow. > Upgrading from 5.10 to 6.1, I noticed one of my filesystems was > read-only. In dmesg, I found: > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > 0, rd 0, flush 1, corrupt 0, gen 0 > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max > tolerance is 0 for writable mount > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > failure (errors while submitting device barriers.) > BTRFS info (device dm-75: state E): forced readonly > BTRFS warning (device dm-75: state E): Skipping commit of aborted transaction. > BTRFS: error (device dm-75: state EA) in cleanup_transaction:1992: > errno=-5 IO failure > > At first I suspected a btrfs error, but a scrub found no errors, and > it continued to be read-write on 5.10 kernels. > > Here is my setup: > > /dev/lvm/brokenDisk is a lvm-on-lvm volume. I have /dev/sd{a,b,c,d} > (of varying sizes) in a lower VG, which has three LVs, all raid1 > volumes. Two of the volumes are further used as PV's for an upper VGs. > One of the upper VGs has no issues. The non-PV LV has no issue. The > remaining one, /dev/lowerVG/lvmPool, hosting nested LVM, is used as a > PV for VG "lvm", and has 3 volumes inside. Two of those volumes have > no issues (and are btrfs), but the last one is /dev/lvm/brokenDisk. > This volume is the only one that exhibits this behavior, so something > is special. > > Or described as layers: > /dev/sd{a,b,c,d} => PV => VG "lowerVG" > /dev/lowerVG/single (RAID1 LV) => BTRFS, works fine > /dev/lowerVG/works (RAID1 LV) => PV => VG "workingUpper" > /dev/workingUpper/{a,b,c} => BTRFS, works fine > /dev/lowerVG/lvmPool (RAID1 LV) => PV => VG "lvm" > /dev/lvm/{a,b} => BTRFS, works fine > /dev/lvm/brokenDisk => BTRFS, Exhibits errors I am a bit curious about the reasons of this setup. However I understood that: /dev/sda -+ +-- single (RAID1) -> ok +-> a ok /dev/sdb | | |-> b ok /dev/sdc +--> [lowerVG]>--+-- works (RAID1) -> [workingUpper] -+-> c ok /dev/sdd -+ | | +-> a -> ok +-- lvmPool -> [lvm] ->-| +-> b -> ok | +->brokenDisk -> fail [xxx] means VG, the others are LVs that may act also as PV in an upper VG So, it seems that 1) lowerVG/lvmPool/lvm/a 2) lowerVG/lvmPool/lvm/a 3) lowerVG/lvmPool/lvm/brokenDisk are equivalent ... so I don't understand how 1) and 2) are fine but 3) is problematic. Is my understanding of the LVM layouts correct ? > > After some investigation, here is what I've found: > > 1. This regression was introduced in 5.19. 5.18 and earlier kernels I > can keep this filesystem rw and everything works as expected, while > 5.19.0 and later the filesystem is immediately ro on any write > attempt. I couldn't build rc1, but I did confirm rc2 already has this > regression. > 2. Passing /dev/lvm/brokenDisk to a KVM VM as /dev/vdb with an > unaffected kernel inside the vm exhibits the ro barrier problem on > unaffected kernels. Is /dev/lvm/brokenDisk *always* problematic with affected ( >= 5.19 ) and UNaffected ( < 5.19 ) kernel ? > 3. Passing /dev/lowerVG/lvmPool to a KVM VM as /dev/vdb with an > affected kernel inside the VM and using LVM inside the VM exhibits > correct behavior (I can keep the filesystem rw, no barrier errors on > host or guest) Is /dev/lowerVG/lvmPool problematic with only "affected" kernel ? [...] -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers 2024-02-28 19:19 ` Goffredo Baroncelli @ 2024-02-28 19:37 ` Patrick Plenefisch 2024-02-29 19:56 ` Goffredo Baroncelli 0 siblings, 1 reply; 15+ messages in thread From: Patrick Plenefisch @ 2024-02-28 19:37 UTC (permalink / raw) To: kreijack Cc: stable, linux-kernel, Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On Wed, Feb 28, 2024 at 2:19 PM Goffredo Baroncelli <kreijack@libero.it> wrote: > > On 28/02/2024 18.25, Patrick Plenefisch wrote: > > I'm unsure if this is just an LVM bug, or a BTRFS+LVM interaction bug, > > but LVM is definitely involved somehow. > > Upgrading from 5.10 to 6.1, I noticed one of my filesystems was > > read-only. In dmesg, I found: > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > > 0, rd 0, flush 1, corrupt 0, gen 0 > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max > > tolerance is 0 for writable mount > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > > failure (errors while submitting device barriers.) > > BTRFS info (device dm-75: state E): forced readonly > > BTRFS warning (device dm-75: state E): Skipping commit of aborted transaction. > > BTRFS: error (device dm-75: state EA) in cleanup_transaction:1992: > > errno=-5 IO failure > > > > At first I suspected a btrfs error, but a scrub found no errors, and > > it continued to be read-write on 5.10 kernels. > > > > Here is my setup: > > > > /dev/lvm/brokenDisk is a lvm-on-lvm volume. I have /dev/sd{a,b,c,d} > > (of varying sizes) in a lower VG, which has three LVs, all raid1 > > volumes. Two of the volumes are further used as PV's for an upper VGs. > > One of the upper VGs has no issues. The non-PV LV has no issue. The > > remaining one, /dev/lowerVG/lvmPool, hosting nested LVM, is used as a > > PV for VG "lvm", and has 3 volumes inside. Two of those volumes have > > no issues (and are btrfs), but the last one is /dev/lvm/brokenDisk. > > This volume is the only one that exhibits this behavior, so something > > is special. > > > > Or described as layers: > > /dev/sd{a,b,c,d} => PV => VG "lowerVG" > > /dev/lowerVG/single (RAID1 LV) => BTRFS, works fine > > /dev/lowerVG/works (RAID1 LV) => PV => VG "workingUpper" > > /dev/workingUpper/{a,b,c} => BTRFS, works fine > > /dev/lowerVG/lvmPool (RAID1 LV) => PV => VG "lvm" > > /dev/lvm/{a,b} => BTRFS, works fine > > /dev/lvm/brokenDisk => BTRFS, Exhibits errors > > I am a bit curious about the reasons of this setup. The lowerVG is supposed to be a pool of storage for several VM's & containers. [workingUpper] is for one VM, and [lvm] is for another VM. However right now I'm still trying to organize the files directly because I don't have all the VM's fully setup yet > However I understood that: > > /dev/sda -+ +-- single (RAID1) -> ok +-> a ok > /dev/sdb | | |-> b ok > /dev/sdc +--> [lowerVG]>--+-- works (RAID1) -> [workingUpper] -+-> c ok > /dev/sdd -+ | > | +-> a -> ok > +-- lvmPool -> [lvm] ->-| > +-> b -> ok > | > +->brokenDisk -> fail > > [xxx] means VG, the others are LVs that may act also as PV in > an upper VG Note that lvmPool is also RAID1, but yes > > So, it seems that > > 1) lowerVG/lvmPool/lvm/a > 2) lowerVG/lvmPool/lvm/a > 3) lowerVG/lvmPool/lvm/brokenDisk > > are equivalent ... so I don't understand how 1) and 2) are fine but 3) is > problematic. I assume you meant lvm/b for 2? > > Is my understanding of the LVM layouts correct ? Your understanding is correct. The only thing that comes to my mind to cause the problem is asymmetry of the SATA devices. I have one 8TB device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual extents, lowerVG/single spans (3TB+3TB), and lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have the other leg of raid1 on the 8TB drive, but my thought was that the jump across the 1.5+3TB drive gap was at least "interesting" > > > > > > After some investigation, here is what I've found: > > > > 1. This regression was introduced in 5.19. 5.18 and earlier kernels I > > can keep this filesystem rw and everything works as expected, while > > 5.19.0 and later the filesystem is immediately ro on any write > > attempt. I couldn't build rc1, but I did confirm rc2 already has this > > regression. > > 2. Passing /dev/lvm/brokenDisk to a KVM VM as /dev/vdb with an > > unaffected kernel inside the vm exhibits the ro barrier problem on > > unaffected kernels. > > Is /dev/lvm/brokenDisk *always* problematic with affected ( >= 5.19 ) and > UNaffected ( < 5.19 ) kernel ? Yes, I didn't test it in as much depth, but 5.15 and 6.1 in the VM (and 6.1 on the host) are identically problematic > > > 3. Passing /dev/lowerVG/lvmPool to a KVM VM as /dev/vdb with an > > affected kernel inside the VM and using LVM inside the VM exhibits > > correct behavior (I can keep the filesystem rw, no barrier errors on > > host or guest) > > Is /dev/lowerVG/lvmPool problematic with only "affected" kernel ? Uh, passing lvmPool directly to the VM is never problematic. I tested 5.10 and 6.1 in the VM (and 6.1 on the host), and neither setup throws barrier errors. > [...] > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers 2024-02-28 19:37 ` Patrick Plenefisch @ 2024-02-29 19:56 ` Goffredo Baroncelli 2024-02-29 20:22 ` Patrick Plenefisch 0 siblings, 1 reply; 15+ messages in thread From: Goffredo Baroncelli @ 2024-02-29 19:56 UTC (permalink / raw) To: Patrick Plenefisch Cc: stable, linux-kernel, Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On 28/02/2024 20.37, Patrick Plenefisch wrote: > On Wed, Feb 28, 2024 at 2:19 PM Goffredo Baroncelli <kreijack@libero.it> wrote: >> >> On 28/02/2024 18.25, Patrick Plenefisch wrote: >>> I'm unsure if this is just an LVM bug, or a BTRFS+LVM interaction bug, >>> but LVM is definitely involved somehow. >>> Upgrading from 5.10 to 6.1, I noticed one of my filesystems was >>> read-only. In dmesg, I found: >>> >>> BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr >>> 0, rd 0, flush 1, corrupt 0, gen 0 >>> BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max >>> tolerance is 0 for writable mount >>> BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO >>> failure (errors while submitting device barriers.) >>> BTRFS info (device dm-75: state E): forced readonly >>> BTRFS warning (device dm-75: state E): Skipping commit of aborted transaction. >>> BTRFS: error (device dm-75: state EA) in cleanup_transaction:1992: >>> errno=-5 IO failure >>> >>> At first I suspected a btrfs error, but a scrub found no errors, and >>> it continued to be read-write on 5.10 kernels. >>> >>> Here is my setup: >>> >>> /dev/lvm/brokenDisk is a lvm-on-lvm volume. I have /dev/sd{a,b,c,d} >>> (of varying sizes) in a lower VG, which has three LVs, all raid1 >>> volumes. Two of the volumes are further used as PV's for an upper VGs. >>> One of the upper VGs has no issues. The non-PV LV has no issue. The >>> remaining one, /dev/lowerVG/lvmPool, hosting nested LVM, is used as a >>> PV for VG "lvm", and has 3 volumes inside. Two of those volumes have >>> no issues (and are btrfs), but the last one is /dev/lvm/brokenDisk. >>> This volume is the only one that exhibits this behavior, so something >>> is special. >>> >>> Or described as layers: >>> /dev/sd{a,b,c,d} => PV => VG "lowerVG" >>> /dev/lowerVG/single (RAID1 LV) => BTRFS, works fine >>> /dev/lowerVG/works (RAID1 LV) => PV => VG "workingUpper" >>> /dev/workingUpper/{a,b,c} => BTRFS, works fine >>> /dev/lowerVG/lvmPool (RAID1 LV) => PV => VG "lvm" >>> /dev/lvm/{a,b} => BTRFS, works fine >>> /dev/lvm/brokenDisk => BTRFS, Exhibits errors >> >> I am a bit curious about the reasons of this setup. > > The lowerVG is supposed to be a pool of storage for several VM's & > containers. [workingUpper] is for one VM, and [lvm] is for another VM. > However right now I'm still trying to organize the files directly > because I don't have all the VM's fully setup yet > >> However I understood that: >> >> /dev/sda -+ +-- single (RAID1) -> ok +-> a ok >> /dev/sdb | | |-> b ok >> /dev/sdc +--> [lowerVG]>--+-- works (RAID1) -> [workingUpper] -+-> c ok >> /dev/sdd -+ | >> | +-> a -> ok >> +-- lvmPool (raid1)-> [lvm] ->-| >> +-> b -> ok >> | >> +->brokenDisk -> fail >> >> [xxx] means VG, the others are LVs that may act also as PV in >> an upper VG > > Note that lvmPool is also RAID1, but yes > >> >> So, it seems that >> >> 1) lowerVG/lvmPool/lvm/a >> 2) lowerVG/lvmPool/lvm/a >> 3) lowerVG/lvmPool/lvm/brokenDisk >> >> are equivalent ... so I don't understand how 1) and 2) are fine but 3) is >> problematic. > > I assume you meant lvm/b for 2? Yes >> >> Is my understanding of the LVM layouts correct ? > > Your understanding is correct. The only thing that comes to my mind to > cause the problem is asymmetry of the SATA devices. I have one 8TB > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual > extents, lowerVG/single spans (3TB+3TB), and > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have > the other leg of raid1 on the 8TB drive, but my thought was that the > jump across the 1.5+3TB drive gap was at least "interesting" what about lowerVG/works ? However yes, I agree that the pair of disks involved may be the answer of the problem. Could you show us the output of $ sudo pvdisplay -m > >> >> >>> >>> After some investigation, here is what I've found: >>> >>> 1. This regression was introduced in 5.19. 5.18 and earlier kernels I >>> can keep this filesystem rw and everything works as expected, while >>> 5.19.0 and later the filesystem is immediately ro on any write >>> attempt. I couldn't build rc1, but I did confirm rc2 already has this >>> regression. >>> 2. Passing /dev/lvm/brokenDisk to a KVM VM as /dev/vdb with an >>> unaffected kernel inside the vm exhibits the ro barrier problem on >>> unaffected kernels. >> >> Is /dev/lvm/brokenDisk *always* problematic with affected ( >= 5.19 ) and >> UNaffected ( < 5.19 ) kernel ? > > Yes, I didn't test it in as much depth, but 5.15 and 6.1 in the VM > (and 6.1 on the host) are identically problematic > >> >>> 3. Passing /dev/lowerVG/lvmPool to a KVM VM as /dev/vdb with an >>> affected kernel inside the VM and using LVM inside the VM exhibits >>> correct behavior (I can keep the filesystem rw, no barrier errors on >>> host or guest) >> >> Is /dev/lowerVG/lvmPool problematic with only "affected" kernel ? > > Uh, passing lvmPool directly to the VM is never problematic. I tested > 5.10 and 6.1 in the VM (and 6.1 on the host), and neither setup throws > barrier errors. > >> [...] >> >> -- >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >> -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers 2024-02-29 19:56 ` Goffredo Baroncelli @ 2024-02-29 20:22 ` Patrick Plenefisch 2024-02-29 22:05 ` Goffredo Baroncelli 0 siblings, 1 reply; 15+ messages in thread From: Patrick Plenefisch @ 2024-02-29 20:22 UTC (permalink / raw) To: kreijack Cc: stable, linux-kernel, Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > Your understanding is correct. The only thing that comes to my mind to > > cause the problem is asymmetry of the SATA devices. I have one 8TB > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual > > extents, lowerVG/single spans (3TB+3TB), and > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have > > the other leg of raid1 on the 8TB drive, but my thought was that the > > jump across the 1.5+3TB drive gap was at least "interesting" > > > what about lowerVG/works ? > That one is only on two disks, it doesn't span any gaps > However yes, I agree that the pair of disks involved may be the answer > of the problem. > > Could you show us the output of > > $ sudo pvdisplay -m > > I trimmed it, but kept the relevant bits (Free PE is thus not correct): --- Physical volume --- PV Name /dev/lowerVG/lvmPool VG Name lvm PV Size <3.00 TiB / not usable 3.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 786431 Free PE 82943 Allocated PE 703488 PV UUID 7p3LSU-EAHd-xUg0-r9vT-Gzkf-tYFV-mvlU1M --- Physical Segments --- Physical extent 0 to 159999: Logical volume /dev/lvm/brokenDisk Logical extents 0 to 159999 Physical extent 160000 to 339199: Logical volume /dev/lvm/a Logical extents 0 to 179199 Physical extent 339200 to 349439: Logical volume /dev/lvm/brokenDisk Logical extents 160000 to 170239 Physical extent 349440 to 351999: FREE Physical extent 352000 to 460026: Logical volume /dev/lvm/brokenDisk Logical extents 416261 to 524287 Physical extent 460027 to 540409: FREE Physical extent 540410 to 786430: Logical volume /dev/lvm/brokenDisk Logical extents 170240 to 416260 --- Physical volume --- PV Name /dev/sda3 VG Name lowerVG PV Size <2.70 TiB / not usable 3.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 707154 Free PE 909 Allocated PE 706245 PV UUID W8gJ0P-JuMs-1y3g-b5cO-4RuA-MoFs-3zgKBn --- Physical Segments --- Physical extent 0 to 52223: Logical volume /dev/lowerVG/single_corig_rimage_0_iorig Logical extents 629330 to 681553 Physical extent 52224 to 628940: Logical volume /dev/lowerVG/single_corig_rimage_0_iorig Logical extents 0 to 576716 Physical extent 628941 to 628941: Logical volume /dev/lowerVG/single_corig_rmeta_0 Logical extents 0 to 0 Physical extent 628942 to 628962: Logical volume /dev/lowerVG/single_corig_rimage_0_iorig Logical extents 681554 to 681574 Physical extent 628963 to 634431: Logical volume /dev/lowerVG/single_corig_rimage_0_imeta Logical extents 0 to 5468 Physical extent 634432 to 654540: FREE Physical extent 654541 to 707153: Logical volume /dev/lowerVG/single_corig_rimage_0_iorig Logical extents 576717 to 629329 --- Physical volume --- PV Name /dev/sdf2 VG Name lowerVG PV Size <7.28 TiB / not usable 4.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 1907645 Free PE 414967 Allocated PE 1492678 PV UUID my0zQM-832Z-HYPD-sNfW-68ms-nddg-lMyWJM --- Physical Segments --- Physical extent 0 to 0: Logical volume /dev/lowerVG/single_corig_rmeta_1 Logical extents 0 to 0 Physical extent 1 to 681575: Logical volume /dev/lowerVG/single_corig_rimage_1_iorig Logical extents 0 to 681574 Physical extent 681576 to 687044: Logical volume /dev/lowerVG/single_corig_rimage_1_imeta Logical extents 0 to 5468 Physical extent 687045 to 687045: Logical volume /dev/lowerVG/lvmPool_rmeta_0 Logical extents 0 to 0 Physical extent 687046 to 1049242: Logical volume /dev/lowerVG/lvmPool_rimage_0 Logical extents 0 to 362196 Physical extent 1049243 to 1056551: FREE Physical extent 1056552 to 1473477: Logical volume /dev/lowerVG/lvmPool_rimage_0 Logical extents 369506 to 786431 Physical extent 1473478 to 1480786: Logical volume /dev/lowerVG/lvmPool_rimage_0 Logical extents 362197 to 369505 Physical extent 1480787 to 1907644: FREE --- Physical volume --- PV Name /dev/sdb3 VG Name lowerVG PV Size 1.33 TiB / not usable 3.00 MiB Allocatable yes (but full) PE Size 4.00 MiB Total PE 349398 Free PE 0 Allocated PE 349398 PV UUID Ncmgdw-ZOXS-qTYL-1jAz-w7zt-38V2-f53EpI --- Physical Segments --- Physical extent 0 to 0: Logical volume /dev/lowerVG/lvmPool_rmeta_1 Logical extents 0 to 0 Physical extent 1 to 349397: Logical volume /dev/lowerVG/lvmPool_rimage_1 Logical extents 0 to 349396 --- Physical volume --- PV Name /dev/sde2 VG Name lowerVG PV Size 2.71 TiB / not usable 3.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 711346 Free PE 255111 Allocated PE 456235 PV UUID xUG8TG-wvp0-roBo-GPo7-sbvn-aE7I-NAHU07 --- Physical Segments --- Physical extent 0 to 416925: Logical volume /dev/lowerVG/lvmPool_rimage_1 Logical extents 369506 to 786431 Physical extent 416926 to 437034: Logical volume /dev/lowerVG/lvmPool_rimage_1 Logical extents 349397 to 369505 Physical extent 437035 to 711345: FREE Finally, I am not sure if it's relevant, but I did struggle to expand the raid1 volumes across gaps when creating this setup. I did file a bug about that, though I am not sure if it's relevant, as I removed integrity and cache for brokenDisk & lvmPool: https://gitlab.com/lvmteam/lvm2/-/issues/6 Patrick ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers 2024-02-29 20:22 ` Patrick Plenefisch @ 2024-02-29 22:05 ` Goffredo Baroncelli 2024-03-05 17:45 ` Mike Snitzer 0 siblings, 1 reply; 15+ messages in thread From: Goffredo Baroncelli @ 2024-02-29 22:05 UTC (permalink / raw) To: Patrick Plenefisch Cc: stable, linux-kernel, Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On 29/02/2024 21.22, Patrick Plenefisch wrote: > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: >> >>> Your understanding is correct. The only thing that comes to my mind to >>> cause the problem is asymmetry of the SATA devices. I have one 8TB >>> device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual >>> extents, lowerVG/single spans (3TB+3TB), and >>> lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have >>> the other leg of raid1 on the 8TB drive, but my thought was that the >>> jump across the 1.5+3TB drive gap was at least "interesting" >> >> >> what about lowerVG/works ? >> > > That one is only on two disks, it doesn't span any gaps Sorry, but re-reading the original email I found something that I missed before: > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > 0, rd 0, flush 1, corrupt 0, gen 0 > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > tolerance is 0 for writable mount > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > failure (errors while submitting device barriers.) Looking at the code, it seems that if a FLUSH commands fails, btrfs considers that the disk is missing. The it cannot mount RW the device. I would investigate with the LVM developers, if it properly passes the flush/barrier command through all the layers, when we have an lvm over lvm (raid1). The fact that the lvm is a raid1, is important because a flush command to be honored has to be honored by all the devices involved. > >> However yes, I agree that the pair of disks involved may be the answer >> of the problem. >> >> Could you show us the output of >> >> $ sudo pvdisplay -m >> >> > > I trimmed it, but kept the relevant bits (Free PE is thus not correct): > > > --- Physical volume --- > PV Name /dev/lowerVG/lvmPool > VG Name lvm > PV Size <3.00 TiB / not usable 3.00 MiB > Allocatable yes > PE Size 4.00 MiB > Total PE 786431 > Free PE 82943 > Allocated PE 703488 > PV UUID 7p3LSU-EAHd-xUg0-r9vT-Gzkf-tYFV-mvlU1M > > --- Physical Segments --- > Physical extent 0 to 159999: > Logical volume /dev/lvm/brokenDisk > Logical extents 0 to 159999 > Physical extent 160000 to 339199: > Logical volume /dev/lvm/a > Logical extents 0 to 179199 > Physical extent 339200 to 349439: > Logical volume /dev/lvm/brokenDisk > Logical extents 160000 to 170239 > Physical extent 349440 to 351999: > FREE > Physical extent 352000 to 460026: > Logical volume /dev/lvm/brokenDisk > Logical extents 416261 to 524287 > Physical extent 460027 to 540409: > FREE > Physical extent 540410 to 786430: > Logical volume /dev/lvm/brokenDisk > Logical extents 170240 to 416260 > > > --- Physical volume --- > PV Name /dev/sda3 > VG Name lowerVG > PV Size <2.70 TiB / not usable 3.00 MiB > Allocatable yes > PE Size 4.00 MiB > Total PE 707154 > Free PE 909 > Allocated PE 706245 > PV UUID W8gJ0P-JuMs-1y3g-b5cO-4RuA-MoFs-3zgKBn > > --- Physical Segments --- > Physical extent 0 to 52223: > Logical volume /dev/lowerVG/single_corig_rimage_0_iorig > Logical extents 629330 to 681553 > Physical extent 52224 to 628940: > Logical volume /dev/lowerVG/single_corig_rimage_0_iorig > Logical extents 0 to 576716 > Physical extent 628941 to 628941: > Logical volume /dev/lowerVG/single_corig_rmeta_0 > Logical extents 0 to 0 > Physical extent 628942 to 628962: > Logical volume /dev/lowerVG/single_corig_rimage_0_iorig > Logical extents 681554 to 681574 > Physical extent 628963 to 634431: > Logical volume /dev/lowerVG/single_corig_rimage_0_imeta > Logical extents 0 to 5468 > Physical extent 634432 to 654540: > FREE > Physical extent 654541 to 707153: > Logical volume /dev/lowerVG/single_corig_rimage_0_iorig > Logical extents 576717 to 629329 > > --- Physical volume --- > PV Name /dev/sdf2 > VG Name lowerVG > PV Size <7.28 TiB / not usable 4.00 MiB > Allocatable yes > PE Size 4.00 MiB > Total PE 1907645 > Free PE 414967 > Allocated PE 1492678 > PV UUID my0zQM-832Z-HYPD-sNfW-68ms-nddg-lMyWJM > > --- Physical Segments --- > Physical extent 0 to 0: > Logical volume /dev/lowerVG/single_corig_rmeta_1 > Logical extents 0 to 0 > Physical extent 1 to 681575: > Logical volume /dev/lowerVG/single_corig_rimage_1_iorig > Logical extents 0 to 681574 > Physical extent 681576 to 687044: > Logical volume /dev/lowerVG/single_corig_rimage_1_imeta > Logical extents 0 to 5468 > Physical extent 687045 to 687045: > Logical volume /dev/lowerVG/lvmPool_rmeta_0 > Logical extents 0 to 0 > Physical extent 687046 to 1049242: > Logical volume /dev/lowerVG/lvmPool_rimage_0 > Logical extents 0 to 362196 > Physical extent 1049243 to 1056551: > FREE > Physical extent 1056552 to 1473477: > Logical volume /dev/lowerVG/lvmPool_rimage_0 > Logical extents 369506 to 786431 > Physical extent 1473478 to 1480786: > Logical volume /dev/lowerVG/lvmPool_rimage_0 > Logical extents 362197 to 369505 > Physical extent 1480787 to 1907644: > FREE > > --- Physical volume --- > PV Name /dev/sdb3 > VG Name lowerVG > PV Size 1.33 TiB / not usable 3.00 MiB > Allocatable yes (but full) > PE Size 4.00 MiB > Total PE 349398 > Free PE 0 > Allocated PE 349398 > PV UUID Ncmgdw-ZOXS-qTYL-1jAz-w7zt-38V2-f53EpI > > --- Physical Segments --- > Physical extent 0 to 0: > Logical volume /dev/lowerVG/lvmPool_rmeta_1 > Logical extents 0 to 0 > Physical extent 1 to 349397: > Logical volume /dev/lowerVG/lvmPool_rimage_1 > Logical extents 0 to 349396 > > > --- Physical volume --- > PV Name /dev/sde2 > VG Name lowerVG > PV Size 2.71 TiB / not usable 3.00 MiB > Allocatable yes > PE Size 4.00 MiB > Total PE 711346 > Free PE 255111 > Allocated PE 456235 > PV UUID xUG8TG-wvp0-roBo-GPo7-sbvn-aE7I-NAHU07 > > --- Physical Segments --- > Physical extent 0 to 416925: > Logical volume /dev/lowerVG/lvmPool_rimage_1 > Logical extents 369506 to 786431 > Physical extent 416926 to 437034: > Logical volume /dev/lowerVG/lvmPool_rimage_1 > Logical extents 349397 to 369505 > Physical extent 437035 to 711345: > FREE > > > Finally, I am not sure if it's relevant, but I did struggle to expand > the raid1 volumes across gaps when creating this setup. I did file a > bug about that, though I am not sure if it's relevant, as I removed > integrity and cache for brokenDisk & lvmPool: > https://gitlab.com/lvmteam/lvm2/-/issues/6 > > Patrick > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: LVM-on-LVM: error while submitting device barriers 2024-02-29 22:05 ` Goffredo Baroncelli @ 2024-03-05 17:45 ` Mike Snitzer 2024-03-06 15:59 ` Ming Lei 0 siblings, 1 reply; 15+ messages in thread From: Mike Snitzer @ 2024-03-05 17:45 UTC (permalink / raw) To: Patrick Plenefisch Cc: Goffredo Baroncelli, linux-kernel, Alasdair Kergon, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs, ming.lei On Thu, Feb 29 2024 at 5:05P -0500, Goffredo Baroncelli <kreijack@inwind.it> wrote: > On 29/02/2024 21.22, Patrick Plenefisch wrote: > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > > > Your understanding is correct. The only thing that comes to my mind to > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual > > > > extents, lowerVG/single spans (3TB+3TB), and > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have > > > > the other leg of raid1 on the 8TB drive, but my thought was that the > > > > jump across the 1.5+3TB drive gap was at least "interesting" > > > > > > > > > what about lowerVG/works ? > > > > > > > That one is only on two disks, it doesn't span any gaps > > Sorry, but re-reading the original email I found something that I missed before: > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > > 0, rd 0, flush 1, corrupt 0, gen 0 > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > tolerance is 0 for writable mount > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > > failure (errors while submitting device barriers.) > > Looking at the code, it seems that if a FLUSH commands fails, btrfs > considers that the disk is missing. The it cannot mount RW the device. > > I would investigate with the LVM developers, if it properly passes > the flush/barrier command through all the layers, when we have an > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because > a flush command to be honored has to be honored by all the > devices involved. Hi Patrick, Your initial report (start of this thread) mentioned that the regression occured with 5.19. The DM changes that landed during the 5.19 merge window refactored quite a bit of DM core's handling for bio splitting (to simplify DM's newfound support for bio polling) -- Ming Lei (now cc'd) and I wrote these changes: e86f2b005a51 dm: simplify basic targets bdb34759a0db dm: use bio_sectors in dm_aceept_partial_bio b992b40dfcc1 dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct e6926ad0c988 dm: pass dm_io instance to dm_io_acct directly d3de6d12694d dm: switch to bdev based IO accounting interfaces 7dd76d1feec7 dm: improve bio splitting and associated IO accounting 2e803cd99ba8 dm: don't grab target io reference in dm_zone_map_bio 0f14d60a023c dm: improve dm_io reference counting ec211631ae24 dm: put all polled dm_io instances into a single list 9d20653fe84e dm: simplify bio-based IO accounting further 4edadf6dcb54 dm: improve abnormal bio processing I'll have a closer look at these DM commits (especially relative to flush bios and your stacked device usage). The last commit (4edadf6dcb54) is marginally relevant (but likely most easily reverted from v5.19-rc2, as a simple test to see if it somehow a problem... doubtful to be cause but worth a try). (FYI, not relevant because it is specific to REQ_NOWAIT but figured I'd mention it, this commit earlier in the 5.19 DM changes was bogus: 563a225c9fd2 dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio Jens fixed it with this stable@ commit: a9ce385344f9 dm: don't attempt to queue IO under RCU protection) > > > However yes, I agree that the pair of disks involved may be the answer > > > of the problem. > > > > > > Could you show us the output of > > > > > > $ sudo pvdisplay -m > > > > > > > > > > I trimmed it, but kept the relevant bits (Free PE is thus not correct): > > > > > > --- Physical volume --- > > PV Name /dev/lowerVG/lvmPool > > VG Name lvm > > PV Size <3.00 TiB / not usable 3.00 MiB > > Allocatable yes > > PE Size 4.00 MiB > > Total PE 786431 > > Free PE 82943 > > Allocated PE 703488 > > PV UUID 7p3LSU-EAHd-xUg0-r9vT-Gzkf-tYFV-mvlU1M > > > > --- Physical Segments --- > > Physical extent 0 to 159999: > > Logical volume /dev/lvm/brokenDisk > > Logical extents 0 to 159999 > > Physical extent 160000 to 339199: > > Logical volume /dev/lvm/a > > Logical extents 0 to 179199 > > Physical extent 339200 to 349439: > > Logical volume /dev/lvm/brokenDisk > > Logical extents 160000 to 170239 > > Physical extent 349440 to 351999: > > FREE > > Physical extent 352000 to 460026: > > Logical volume /dev/lvm/brokenDisk > > Logical extents 416261 to 524287 > > Physical extent 460027 to 540409: > > FREE > > Physical extent 540410 to 786430: > > Logical volume /dev/lvm/brokenDisk > > Logical extents 170240 to 416260 Please provide the following from guest that activates /dev/lvm/brokenDisk: lsblk dmsetup table Please also provide the same from the host (just for completeness). Also, I didn't see any kernel logs that show DM-specific errors. I doubt you'd have left any DM-specific errors out in your report. So is btrfs the canary here? To be clear: You're only seeing btrfs errors in the kernel log? Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: LVM-on-LVM: error while submitting device barriers 2024-03-05 17:45 ` Mike Snitzer @ 2024-03-06 15:59 ` Ming Lei 2024-03-09 20:39 ` Patrick Plenefisch 0 siblings, 1 reply; 15+ messages in thread From: Ming Lei @ 2024-03-06 15:59 UTC (permalink / raw) To: Mike Snitzer, Patrick Plenefisch Cc: Goffredo Baroncelli, linux-kernel, Alasdair Kergon, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote: > On Thu, Feb 29 2024 at 5:05P -0500, > Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > On 29/02/2024 21.22, Patrick Plenefisch wrote: > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > > > > > Your understanding is correct. The only thing that comes to my mind to > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual > > > > > extents, lowerVG/single spans (3TB+3TB), and > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the > > > > > jump across the 1.5+3TB drive gap was at least "interesting" > > > > > > > > > > > > what about lowerVG/works ? > > > > > > > > > > That one is only on two disks, it doesn't span any gaps > > > > Sorry, but re-reading the original email I found something that I missed before: > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > > > 0, rd 0, flush 1, corrupt 0, gen 0 > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > tolerance is 0 for writable mount > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > > > failure (errors while submitting device barriers.) > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs > > considers that the disk is missing. The it cannot mount RW the device. > > > > I would investigate with the LVM developers, if it properly passes > > the flush/barrier command through all the layers, when we have an > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because > > a flush command to be honored has to be honored by all the > > devices involved. Hello Patrick & Goffredo, I can trigger this kind of btrfs complaint by simulating one FLUSH failure. If you can reproduce this issue easily, please collect log by the following bpftrace script, which may show where the flush failure is, and maybe it can help to narrow down the issue in the whole stack. #!/usr/bin/bpftrace #ifndef BPFTRACE_HAVE_BTF #include <linux/blkdev.h> #endif kprobe:submit_bio_noacct, kprobe:submit_bio / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / { $bio = (struct bio *)arg0; @submit_stack[arg0] = kstack; @tracked[arg0] = 1; } kprobe:bio_endio /@tracked[arg0] != 0/ { $bio = (struct bio *)arg0; if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { return; } if ($bio->bi_status != 0) { printf("dev %s bio failed %d, submitter %s completion %s\n", $bio->bi_bdev->bd_disk->disk_name, $bio->bi_status, @submit_stack[arg0], kstack); } delete(@submit_stack[arg0]); delete(@tracked[arg0]); } END { clear(@submit_stack); clear(@tracked); } Thanks, Ming ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: LVM-on-LVM: error while submitting device barriers 2024-03-06 15:59 ` Ming Lei @ 2024-03-09 20:39 ` Patrick Plenefisch 2024-03-10 11:34 ` Ming Lei 0 siblings, 1 reply; 15+ messages in thread From: Patrick Plenefisch @ 2024-03-09 20:39 UTC (permalink / raw) To: Ming Lei Cc: Mike Snitzer, Goffredo Baroncelli, linux-kernel, Alasdair Kergon, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhat.com> wrote: > > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote: > > On Thu, Feb 29 2024 at 5:05P -0500, > > Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > On 29/02/2024 21.22, Patrick Plenefisch wrote: > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > > > > > > > Your understanding is correct. The only thing that comes to my mind to > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual > > > > > > extents, lowerVG/single spans (3TB+3TB), and > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the > > > > > > jump across the 1.5+3TB drive gap was at least "interesting" > > > > > > > > > > > > > > > what about lowerVG/works ? > > > > > > > > > > > > > That one is only on two disks, it doesn't span any gaps > > > > > > Sorry, but re-reading the original email I found something that I missed before: > > > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > > > > 0, rd 0, flush 1, corrupt 0, gen 0 > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > tolerance is 0 for writable mount > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > > > > failure (errors while submitting device barriers.) > > > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs > > > considers that the disk is missing. The it cannot mount RW the device. > > > > > > I would investigate with the LVM developers, if it properly passes > > > the flush/barrier command through all the layers, when we have an > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because > > > a flush command to be honored has to be honored by all the > > > devices involved. > > Hello Patrick & Goffredo, > > I can trigger this kind of btrfs complaint by simulating one FLUSH failure. > > If you can reproduce this issue easily, please collect log by the > following bpftrace script, which may show where the flush failure is, > and maybe it can help to narrow down the issue in the whole stack. > > > #!/usr/bin/bpftrace > > #ifndef BPFTRACE_HAVE_BTF > #include <linux/blkdev.h> > #endif > > kprobe:submit_bio_noacct, > kprobe:submit_bio > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / > { > $bio = (struct bio *)arg0; > @submit_stack[arg0] = kstack; > @tracked[arg0] = 1; > } > > kprobe:bio_endio > /@tracked[arg0] != 0/ > { > $bio = (struct bio *)arg0; > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { > return; > } > > if ($bio->bi_status != 0) { > printf("dev %s bio failed %d, submitter %s completion %s\n", > $bio->bi_bdev->bd_disk->disk_name, > $bio->bi_status, @submit_stack[arg0], kstack); > } > delete(@submit_stack[arg0]); > delete(@tracked[arg0]); > } > > END { > clear(@submit_stack); > clear(@tracked); > } > Attaching 4 probes... dev dm-77 bio failed 10, submitter submit_bio_noacct+5 __send_duplicate_bios+358 __send_empty_flush+179 dm_submit_bio+857 __submit_bio+132 submit_bio_noacct_nocheck+345 write_all_supers+1718 btrfs_commit_transaction+2342 transaction_kthread+345 kthread+229 ret_from_fork+49 ret_from_fork_asm+27 completion bio_endio+5 dm_submit_bio+955 __submit_bio+132 submit_bio_noacct_nocheck+345 write_all_supers+1718 btrfs_commit_transaction+2342 transaction_kthread+345 kthread+229 ret_from_fork+49 ret_from_fork_asm+27 dev dm-86 bio failed 10, submitter submit_bio_noacct+5 write_all_supers+1718 btrfs_commit_transaction+2342 transaction_kthread+345 kthread+229 ret_from_fork+49 ret_from_fork_asm+27 completion bio_endio+5 clone_endio+295 clone_endio+295 process_one_work+369 worker_thread+635 kthread+229 ret_from_fork+49 ret_from_fork_asm+27 For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool > > > Thanks, > Ming > And to answer Mike's question: > > Also, I didn't see any kernel logs that show DM-specific errors. I > doubt you'd have left any DM-specific errors out in your report. So > is btrfs the canary here? To be clear: You're only seeing btrfs > errors in the kernel log? Correct, that's why I initially thought it was a btrfs issue. No DM errors in dmesg, btrfs is just the canary ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: LVM-on-LVM: error while submitting device barriers 2024-03-09 20:39 ` Patrick Plenefisch @ 2024-03-10 11:34 ` Ming Lei 2024-03-10 15:27 ` Mike Snitzer 0 siblings, 1 reply; 15+ messages in thread From: Ming Lei @ 2024-03-10 11:34 UTC (permalink / raw) To: Patrick Plenefisch Cc: Mike Snitzer, Goffredo Baroncelli, linux-kernel, Alasdair Kergon, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs, ming.lei On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote: > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote: > > > On Thu, Feb 29 2024 at 5:05P -0500, > > > Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > > > On 29/02/2024 21.22, Patrick Plenefisch wrote: > > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > > > > > > > > > Your understanding is correct. The only thing that comes to my mind to > > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB > > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual > > > > > > > extents, lowerVG/single spans (3TB+3TB), and > > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have > > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the > > > > > > > jump across the 1.5+3TB drive gap was at least "interesting" > > > > > > > > > > > > > > > > > > what about lowerVG/works ? > > > > > > > > > > > > > > > > That one is only on two disks, it doesn't span any gaps > > > > > > > > Sorry, but re-reading the original email I found something that I missed before: > > > > > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > > > > > 0, rd 0, flush 1, corrupt 0, gen 0 > > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > tolerance is 0 for writable mount > > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > > > > > failure (errors while submitting device barriers.) > > > > > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs > > > > considers that the disk is missing. The it cannot mount RW the device. > > > > > > > > I would investigate with the LVM developers, if it properly passes > > > > the flush/barrier command through all the layers, when we have an > > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because > > > > a flush command to be honored has to be honored by all the > > > > devices involved. > > > > Hello Patrick & Goffredo, > > > > I can trigger this kind of btrfs complaint by simulating one FLUSH failure. > > > > If you can reproduce this issue easily, please collect log by the > > following bpftrace script, which may show where the flush failure is, > > and maybe it can help to narrow down the issue in the whole stack. > > > > > > #!/usr/bin/bpftrace > > > > #ifndef BPFTRACE_HAVE_BTF > > #include <linux/blkdev.h> > > #endif > > > > kprobe:submit_bio_noacct, > > kprobe:submit_bio > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / > > { > > $bio = (struct bio *)arg0; > > @submit_stack[arg0] = kstack; > > @tracked[arg0] = 1; > > } > > > > kprobe:bio_endio > > /@tracked[arg0] != 0/ > > { > > $bio = (struct bio *)arg0; > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { > > return; > > } > > > > if ($bio->bi_status != 0) { > > printf("dev %s bio failed %d, submitter %s completion %s\n", > > $bio->bi_bdev->bd_disk->disk_name, > > $bio->bi_status, @submit_stack[arg0], kstack); > > } > > delete(@submit_stack[arg0]); > > delete(@tracked[arg0]); > > } > > > > END { > > clear(@submit_stack); > > clear(@tracked); > > } > > > > Attaching 4 probes... > dev dm-77 bio failed 10, submitter > submit_bio_noacct+5 > __send_duplicate_bios+358 > __send_empty_flush+179 > dm_submit_bio+857 > __submit_bio+132 > submit_bio_noacct_nocheck+345 > write_all_supers+1718 > btrfs_commit_transaction+2342 > transaction_kthread+345 > kthread+229 > ret_from_fork+49 > ret_from_fork_asm+27 > completion > bio_endio+5 > dm_submit_bio+955 > __submit_bio+132 > submit_bio_noacct_nocheck+345 > write_all_supers+1718 > btrfs_commit_transaction+2342 > transaction_kthread+345 > kthread+229 > ret_from_fork+49 > ret_from_fork_asm+27 > > dev dm-86 bio failed 10, submitter > submit_bio_noacct+5 > write_all_supers+1718 > btrfs_commit_transaction+2342 > transaction_kthread+345 > kthread+229 > ret_from_fork+49 > ret_from_fork_asm+27 > completion > bio_endio+5 > clone_endio+295 > clone_endio+295 > process_one_work+369 > worker_thread+635 > kthread+229 > ret_from_fork+49 > ret_from_fork_asm+27 > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool io_status is 10(BLK_STS_IOERR), which is produced in submission code path on /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue. The error should be from the following code only: static void __map_bio(struct bio *clone) ... if (r == DM_MAPIO_KILL) dm_io_dec_pending(io, BLK_STS_IOERR); else dm_io_dec_pending(io, BLK_STS_DM_REQUEUE); break; Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is built? It is dm-raid1 target or over plain raid1 device which is build over /dev/lowerVG? Mike, the logic in the following code doesn't change from v5.18-rc2 to v5.19, but I still can't understand why STS_IOERR is set in dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(), since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which is supposed to not happen in Patrick's case. dm_io_complete() ... if (io->status == BLK_STS_DM_REQUEUE) { unsigned long flags; /* * Target requested pushing back the I/O. */ spin_lock_irqsave(&md->deferred_lock, flags); if (__noflush_suspending(md) && !WARN_ON_ONCE(dm_is_zone_write(md, bio))) { /* NOTE early return due to BLK_STS_DM_REQUEUE below */ bio_list_add_head(&md->deferred, bio); } else { /* * noflush suspend was interrupted or this is * a write to a zoned target. */ io->status = BLK_STS_IOERR; } spin_unlock_irqrestore(&md->deferred_lock, flags); } thanks, Ming ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: LVM-on-LVM: error while submitting device barriers 2024-03-10 11:34 ` Ming Lei @ 2024-03-10 15:27 ` Mike Snitzer 2024-03-10 15:47 ` Ming Lei 2024-03-10 18:11 ` Patrick Plenefisch 0 siblings, 2 replies; 15+ messages in thread From: Mike Snitzer @ 2024-03-10 15:27 UTC (permalink / raw) To: Ming Lei Cc: Patrick Plenefisch, Goffredo Baroncelli, linux-kernel, Alasdair Kergon, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On Sun, Mar 10 2024 at 7:34P -0400, Ming Lei <ming.lei@redhat.com> wrote: > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote: > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > > > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote: > > > > On Thu, Feb 29 2024 at 5:05P -0500, > > > > Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > > > > > On 29/02/2024 21.22, Patrick Plenefisch wrote: > > > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > > > > > > > > > > > Your understanding is correct. The only thing that comes to my mind to > > > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB > > > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual > > > > > > > > extents, lowerVG/single spans (3TB+3TB), and > > > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have > > > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the > > > > > > > > jump across the 1.5+3TB drive gap was at least "interesting" > > > > > > > > > > > > > > > > > > > > > what about lowerVG/works ? > > > > > > > > > > > > > > > > > > > That one is only on two disks, it doesn't span any gaps > > > > > > > > > > Sorry, but re-reading the original email I found something that I missed before: > > > > > > > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > > > > > > 0, rd 0, flush 1, corrupt 0, gen 0 > > > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > tolerance is 0 for writable mount > > > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > > > > > > failure (errors while submitting device barriers.) > > > > > > > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs > > > > > considers that the disk is missing. The it cannot mount RW the device. > > > > > > > > > > I would investigate with the LVM developers, if it properly passes > > > > > the flush/barrier command through all the layers, when we have an > > > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because > > > > > a flush command to be honored has to be honored by all the > > > > > devices involved. > > > > > > Hello Patrick & Goffredo, > > > > > > I can trigger this kind of btrfs complaint by simulating one FLUSH failure. > > > > > > If you can reproduce this issue easily, please collect log by the > > > following bpftrace script, which may show where the flush failure is, > > > and maybe it can help to narrow down the issue in the whole stack. > > > > > > > > > #!/usr/bin/bpftrace > > > > > > #ifndef BPFTRACE_HAVE_BTF > > > #include <linux/blkdev.h> > > > #endif > > > > > > kprobe:submit_bio_noacct, > > > kprobe:submit_bio > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / > > > { > > > $bio = (struct bio *)arg0; > > > @submit_stack[arg0] = kstack; > > > @tracked[arg0] = 1; > > > } > > > > > > kprobe:bio_endio > > > /@tracked[arg0] != 0/ > > > { > > > $bio = (struct bio *)arg0; > > > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { > > > return; > > > } > > > > > > if ($bio->bi_status != 0) { > > > printf("dev %s bio failed %d, submitter %s completion %s\n", > > > $bio->bi_bdev->bd_disk->disk_name, > > > $bio->bi_status, @submit_stack[arg0], kstack); > > > } > > > delete(@submit_stack[arg0]); > > > delete(@tracked[arg0]); > > > } > > > > > > END { > > > clear(@submit_stack); > > > clear(@tracked); > > > } > > > > > > > Attaching 4 probes... > > dev dm-77 bio failed 10, submitter > > submit_bio_noacct+5 > > __send_duplicate_bios+358 > > __send_empty_flush+179 > > dm_submit_bio+857 > > __submit_bio+132 > > submit_bio_noacct_nocheck+345 > > write_all_supers+1718 > > btrfs_commit_transaction+2342 > > transaction_kthread+345 > > kthread+229 > > ret_from_fork+49 > > ret_from_fork_asm+27 > > completion > > bio_endio+5 > > dm_submit_bio+955 > > __submit_bio+132 > > submit_bio_noacct_nocheck+345 > > write_all_supers+1718 > > btrfs_commit_transaction+2342 > > transaction_kthread+345 > > kthread+229 > > ret_from_fork+49 > > ret_from_fork_asm+27 > > > > dev dm-86 bio failed 10, submitter > > submit_bio_noacct+5 > > write_all_supers+1718 > > btrfs_commit_transaction+2342 > > transaction_kthread+345 > > kthread+229 > > ret_from_fork+49 > > ret_from_fork_asm+27 > > completion > > bio_endio+5 > > clone_endio+295 > > clone_endio+295 > > process_one_work+369 > > worker_thread+635 > > kthread+229 > > ret_from_fork+49 > > ret_from_fork_asm+27 > > > > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool > > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue. > > The error should be from the following code only: > > static void __map_bio(struct bio *clone) > > ... > if (r == DM_MAPIO_KILL) > dm_io_dec_pending(io, BLK_STS_IOERR); > else > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE); > break; I agree that the above bpf stack traces for dm-77 indicate that dm_submit_bio failed, which would end up in the above branch if the target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE. But such an early failure speaks to the flush bio never being submitted to the underlying storage. No? dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with: /* * If we're reshaping to add disk(s)), ti->len and * mddev->array_sectors will differ during the process * (ti->len > mddev->array_sectors), so we have to requeue * bios with addresses > mddev->array_sectors here or * there will occur accesses past EOD of the component * data images thus erroring the raid set. */ if (unlikely(bio_end_sector(bio) > mddev->array_sectors)) return DM_MAPIO_REQUEUE; But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems weird relative to a flush. > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is > built? It is dm-raid1 target or over plain raid1 device which is > build over /dev/lowerVG? In my earlier reply I asked Patrick for both: lsblk dmsetup table Picking over the described IO stacks provided earlier (or Goffredo's interpretation of it, via ascii art) isn't really a great way to see the IO stacks that are in use/question. > Mike, the logic in the following code doesn't change from v5.18-rc2 to > v5.19, but I still can't understand why STS_IOERR is set in > dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(), > since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which > is supposed to not happen in Patrick's case. > > dm_io_complete() > ... > if (io->status == BLK_STS_DM_REQUEUE) { > unsigned long flags; > /* > * Target requested pushing back the I/O. > */ > spin_lock_irqsave(&md->deferred_lock, flags); > if (__noflush_suspending(md) && > !WARN_ON_ONCE(dm_is_zone_write(md, bio))) { > /* NOTE early return due to BLK_STS_DM_REQUEUE below */ > bio_list_add_head(&md->deferred, bio); > } else { > /* > * noflush suspend was interrupted or this is > * a write to a zoned target. > */ > io->status = BLK_STS_IOERR; > } > spin_unlock_irqrestore(&md->deferred_lock, flags); > } Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE I think the DM device could be suspending without flush. But regardless, given you logged BLK_STS_IOERR lets assume it isn't, the assumption that "noflush suspend was interrupted" seems like a stale comment -- especially given that target's like dm-raid are now using DM_MAPIO_REQUEUE without concern for the historic tight-coupling of noflush suspend (which was always the case for the biggest historic reason for this code: dm-multipath, see commit 2e93ccc1933d0 from 2006 -- predates my time with developing DM). So all said, this code seems flawed for dm-raid (and possibly other targets that return DM_MAPIO_REQUEUE). I'll look closer this week. Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: LVM-on-LVM: error while submitting device barriers 2024-03-10 15:27 ` Mike Snitzer @ 2024-03-10 15:47 ` Ming Lei 2024-03-10 18:11 ` Patrick Plenefisch 1 sibling, 0 replies; 15+ messages in thread From: Ming Lei @ 2024-03-10 15:47 UTC (permalink / raw) To: Mike Snitzer Cc: Patrick Plenefisch, Goffredo Baroncelli, linux-kernel, Alasdair Kergon, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs, Heinz Mauelshagen On Sun, Mar 10, 2024 at 11:27:22AM -0400, Mike Snitzer wrote: > On Sun, Mar 10 2024 at 7:34P -0400, > Ming Lei <ming.lei@redhat.com> wrote: > > > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote: > > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > > > > > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote: > > > > > On Thu, Feb 29 2024 at 5:05P -0500, > > > > > Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > > > > > > > On 29/02/2024 21.22, Patrick Plenefisch wrote: > > > > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote: > > > > > > > > > > > > > > > > > Your understanding is correct. The only thing that comes to my mind to > > > > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB > > > > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual > > > > > > > > > extents, lowerVG/single spans (3TB+3TB), and > > > > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have > > > > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the > > > > > > > > > jump across the 1.5+3TB drive gap was at least "interesting" > > > > > > > > > > > > > > > > > > > > > > > > what about lowerVG/works ? > > > > > > > > > > > > > > > > > > > > > > That one is only on two disks, it doesn't span any gaps > > > > > > > > > > > > Sorry, but re-reading the original email I found something that I missed before: > > > > > > > > > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > > > > > > > 0, rd 0, flush 1, corrupt 0, gen 0 > > > > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > > tolerance is 0 for writable mount > > > > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > > > > > > > failure (errors while submitting device barriers.) > > > > > > > > > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs > > > > > > considers that the disk is missing. The it cannot mount RW the device. > > > > > > > > > > > > I would investigate with the LVM developers, if it properly passes > > > > > > the flush/barrier command through all the layers, when we have an > > > > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because > > > > > > a flush command to be honored has to be honored by all the > > > > > > devices involved. > > > > > > > > Hello Patrick & Goffredo, > > > > > > > > I can trigger this kind of btrfs complaint by simulating one FLUSH failure. > > > > > > > > If you can reproduce this issue easily, please collect log by the > > > > following bpftrace script, which may show where the flush failure is, > > > > and maybe it can help to narrow down the issue in the whole stack. > > > > > > > > > > > > #!/usr/bin/bpftrace > > > > > > > > #ifndef BPFTRACE_HAVE_BTF > > > > #include <linux/blkdev.h> > > > > #endif > > > > > > > > kprobe:submit_bio_noacct, > > > > kprobe:submit_bio > > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / > > > > { > > > > $bio = (struct bio *)arg0; > > > > @submit_stack[arg0] = kstack; > > > > @tracked[arg0] = 1; > > > > } > > > > > > > > kprobe:bio_endio > > > > /@tracked[arg0] != 0/ > > > > { > > > > $bio = (struct bio *)arg0; > > > > > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { > > > > return; > > > > } > > > > > > > > if ($bio->bi_status != 0) { > > > > printf("dev %s bio failed %d, submitter %s completion %s\n", > > > > $bio->bi_bdev->bd_disk->disk_name, > > > > $bio->bi_status, @submit_stack[arg0], kstack); > > > > } > > > > delete(@submit_stack[arg0]); > > > > delete(@tracked[arg0]); > > > > } > > > > > > > > END { > > > > clear(@submit_stack); > > > > clear(@tracked); > > > > } > > > > > > > > > > Attaching 4 probes... > > > dev dm-77 bio failed 10, submitter > > > submit_bio_noacct+5 > > > __send_duplicate_bios+358 > > > __send_empty_flush+179 > > > dm_submit_bio+857 > > > __submit_bio+132 > > > submit_bio_noacct_nocheck+345 > > > write_all_supers+1718 > > > btrfs_commit_transaction+2342 > > > transaction_kthread+345 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > completion > > > bio_endio+5 > > > dm_submit_bio+955 > > > __submit_bio+132 > > > submit_bio_noacct_nocheck+345 > > > write_all_supers+1718 > > > btrfs_commit_transaction+2342 > > > transaction_kthread+345 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > > > > dev dm-86 bio failed 10, submitter > > > submit_bio_noacct+5 > > > write_all_supers+1718 > > > btrfs_commit_transaction+2342 > > > transaction_kthread+345 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > completion > > > bio_endio+5 > > > clone_endio+295 > > > clone_endio+295 > > > process_one_work+369 > > > worker_thread+635 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > > > > > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool > > > > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on > > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue. > > > > The error should be from the following code only: > > > > static void __map_bio(struct bio *clone) > > > > ... > > if (r == DM_MAPIO_KILL) > > dm_io_dec_pending(io, BLK_STS_IOERR); > > else > > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE); > > break; > > I agree that the above bpf stack traces for dm-77 indicate that > dm_submit_bio failed, which would end up in the above branch if the > target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE. > > But such an early failure speaks to the flush bio never being > submitted to the underlying storage. No? > > dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with: > > /* > * If we're reshaping to add disk(s)), ti->len and > * mddev->array_sectors will differ during the process > * (ti->len > mddev->array_sectors), so we have to requeue > * bios with addresses > mddev->array_sectors here or > * there will occur accesses past EOD of the component > * data images thus erroring the raid set. > */ > if (unlikely(bio_end_sector(bio) > mddev->array_sectors)) > return DM_MAPIO_REQUEUE; > > But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems > weird relative to a flush. Yeah, I also found the above is weird, since DM_MAPIO_REQUEUE is supposed to work together only with noflush_suspend, see 2e93ccc1933d ("[PATCH] dm: suspend: add noflush pushback"), looks you already mentioned. If that is the reason, maybe the following change can make a difference: diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 5e41fbae3f6b..07af18baa8dd 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3331,7 +3331,7 @@ static int raid_map(struct dm_target *ti, struct bio *bio) * there will occur accesses past EOD of the component * data images thus erroring the raid set. */ - if (unlikely(bio_end_sector(bio) > mddev->array_sectors)) + if (unlikely(bio_has_data(bio) && bio_end_sector(bio) > mddev->array_sectors)) return DM_MAPIO_REQUEUE; md_handle_request(mddev, bio); > > > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is > > built? It is dm-raid1 target or over plain raid1 device which is > > build over /dev/lowerVG? > > In my earlier reply I asked Patrick for both: > lsblk > dmsetup table > > Picking over the described IO stacks provided earlier (or Goffredo's > interpretation of it, via ascii art) isn't really a great way to see > the IO stacks that are in use/question. > > > Mike, the logic in the following code doesn't change from v5.18-rc2 to > > v5.19, but I still can't understand why STS_IOERR is set in > > dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(), > > since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which > > is supposed to not happen in Patrick's case. > > > > dm_io_complete() > > ... > > if (io->status == BLK_STS_DM_REQUEUE) { > > unsigned long flags; > > /* > > * Target requested pushing back the I/O. > > */ > > spin_lock_irqsave(&md->deferred_lock, flags); > > if (__noflush_suspending(md) && > > !WARN_ON_ONCE(dm_is_zone_write(md, bio))) { > > /* NOTE early return due to BLK_STS_DM_REQUEUE below */ > > bio_list_add_head(&md->deferred, bio); > > } else { > > /* > > * noflush suspend was interrupted or this is > > * a write to a zoned target. > > */ > > io->status = BLK_STS_IOERR; > > } > > spin_unlock_irqrestore(&md->deferred_lock, flags); > > } > > Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE > I think the DM device could be suspending without flush. > > But regardless, given you logged BLK_STS_IOERR lets assume it isn't, > the assumption that "noflush suspend was interrupted" seems like a > stale comment -- especially given that target's like dm-raid are now > using DM_MAPIO_REQUEUE without concern for the historic tight-coupling > of noflush suspend (which was always the case for the biggest historic > reason for this code: dm-multipath, see commit 2e93ccc1933d0 from > 2006 -- predates my time with developing DM). > > So all said, this code seems flawed for dm-raid (and possibly other > targets that return DM_MAPIO_REQUEUE). I'll look closer this week. Agree, the change is added since 9dbd1aa3a81c ("dm raid: add reshaping support to the target"), so loop Heinz in. Thanks, Ming ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: LVM-on-LVM: error while submitting device barriers 2024-03-10 15:27 ` Mike Snitzer 2024-03-10 15:47 ` Ming Lei @ 2024-03-10 18:11 ` Patrick Plenefisch 2024-03-11 13:13 ` Ming Lei 1 sibling, 1 reply; 15+ messages in thread From: Patrick Plenefisch @ 2024-03-10 18:11 UTC (permalink / raw) To: Mike Snitzer Cc: Ming Lei, Goffredo Baroncelli, linux-kernel, Alasdair Kergon, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On Sun, Mar 10, 2024 at 11:27 AM Mike Snitzer <snitzer@kernel.org> wrote: > > On Sun, Mar 10 2024 at 7:34P -0400, > Ming Lei <ming.lei@redhat.com> wrote: > > > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote: > > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > > > > > #!/usr/bin/bpftrace > > > > > > > > #ifndef BPFTRACE_HAVE_BTF > > > > #include <linux/blkdev.h> > > > > #endif > > > > > > > > kprobe:submit_bio_noacct, > > > > kprobe:submit_bio > > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / > > > > { > > > > $bio = (struct bio *)arg0; > > > > @submit_stack[arg0] = kstack; > > > > @tracked[arg0] = 1; > > > > } > > > > > > > > kprobe:bio_endio > > > > /@tracked[arg0] != 0/ > > > > { > > > > $bio = (struct bio *)arg0; > > > > > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { > > > > return; > > > > } > > > > > > > > if ($bio->bi_status != 0) { > > > > printf("dev %s bio failed %d, submitter %s completion %s\n", > > > > $bio->bi_bdev->bd_disk->disk_name, > > > > $bio->bi_status, @submit_stack[arg0], kstack); > > > > } > > > > delete(@submit_stack[arg0]); > > > > delete(@tracked[arg0]); > > > > } > > > > > > > > END { > > > > clear(@submit_stack); > > > > clear(@tracked); > > > > } > > > > > > > > > > Attaching 4 probes... > > > dev dm-77 bio failed 10, submitter > > > submit_bio_noacct+5 > > > __send_duplicate_bios+358 > > > __send_empty_flush+179 > > > dm_submit_bio+857 > > > __submit_bio+132 > > > submit_bio_noacct_nocheck+345 > > > write_all_supers+1718 > > > btrfs_commit_transaction+2342 > > > transaction_kthread+345 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > completion > > > bio_endio+5 > > > dm_submit_bio+955 > > > __submit_bio+132 > > > submit_bio_noacct_nocheck+345 > > > write_all_supers+1718 > > > btrfs_commit_transaction+2342 > > > transaction_kthread+345 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > > > > dev dm-86 bio failed 10, submitter > > > submit_bio_noacct+5 > > > write_all_supers+1718 > > > btrfs_commit_transaction+2342 > > > transaction_kthread+345 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > completion > > > bio_endio+5 > > > clone_endio+295 > > > clone_endio+295 > > > process_one_work+369 > > > worker_thread+635 > > > kthread+229 > > > ret_from_fork+49 > > > ret_from_fork_asm+27 > > > > > > > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool > > > > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on > > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue. > > > > The error should be from the following code only: > > > > static void __map_bio(struct bio *clone) > > > > ... > > if (r == DM_MAPIO_KILL) > > dm_io_dec_pending(io, BLK_STS_IOERR); > > else > > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE); > > break; > > I agree that the above bpf stack traces for dm-77 indicate that > dm_submit_bio failed, which would end up in the above branch if the > target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE. > > But such an early failure speaks to the flush bio never being > submitted to the underlying storage. No? > > dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with: > > /* > * If we're reshaping to add disk(s)), ti->len and > * mddev->array_sectors will differ during the process > * (ti->len > mddev->array_sectors), so we have to requeue > * bios with addresses > mddev->array_sectors here or > * there will occur accesses past EOD of the component > * data images thus erroring the raid set. > */ > if (unlikely(bio_end_sector(bio) > mddev->array_sectors)) > return DM_MAPIO_REQUEUE; > > But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems > weird relative to a flush. > > > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is > > built? It is dm-raid1 target or over plain raid1 device which is > > build over /dev/lowerVG? LVM raid1: lvcreate --type raid1 -m 1 ... I had previously added raidintegrity and caching like "lowerVG/single", but I had removed them in trying to root cause this bug > > In my earlier reply I asked Patrick for both: > lsblk > dmsetup table Oops, here they are, trimmed for relevance: NAME sdb └─sdb2 ├─lowerVG-single_corig_rmeta_1 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-single_corig_rimage_1_imeta │ └─lowerVG-single_corig_rimage_1 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-single_corig_rimage_1_iorig │ └─lowerVG-single_corig_rimage_1 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-lvmPool_rmeta_0 │ └─lowerVG-lvmPool │ ├─lvm-a │ └─lvm-brokenDisk ├─lowerVG-lvmPool_rimage_0 │ └─lowerVG-lvmPool │ ├─lvm-a │ └─lvm-brokenDisk sdc └─sdc3 ├─lowerVG-single_corig_rmeta_0 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-single_corig_rimage_0_imeta │ └─lowerVG-single_corig_rimage_0 │ └─lowerVG-single_corig │ └─lowerVG-single ├─lowerVG-single_corig_rimage_0_iorig │ └─lowerVG-single_corig_rimage_0 │ └─lowerVG-single_corig │ └─lowerVG-single sdd └─sdd3 ├─lowerVG-lvmPool_rmeta_1 │ └─lowerVG-lvmPool │ ├─lvm-a │ └─lvm-brokenDisk └─lowerVG-lvmPool_rimage_1 └─lowerVG-lvmPool ├─lvm-a └─lvm-brokenDisk sdf ├─sdf2 │ ├─lowerVG-lvmPool_rimage_1 │ │ └─lowerVG-lvmPool │ │ ├─lvm-a │ │ └─lvm-brokenDisk lowerVG-single: 0 5583462400 cache 254:32 254:31 254:71 128 2 metadata2 writethrough mq 0 lowerVG-singleCache_cvol: 0 104857600 linear 259:13 104859648 lowerVG-singleCache_cvol-cdata: 0 104775680 linear 254:30 81920 lowerVG-singleCache_cvol-cmeta: 0 81920 linear 254:30 0 lowerVG-single_corig: 0 5583462400 raid raid1 3 0 region_size 4096 2 254:33 254:36 254:67 254:70 lowerVG-single_corig_rimage_0: 0 5583462400 integrity 254:35 0 4 J 8 meta_device:254:34 recalculate journal_sectors:130944 interleave_sectors:1 buffer_sectors:128 journal_watermark:50 commit_time:10000 internal_hash:crc32c lowerVG-single_corig_rimage_0_imeta: 0 44802048 linear 8:35 5152466944 lowerVG-single_corig_rimage_0_iorig: 0 4724465664 linear 8:35 427821056 lowerVG-single_corig_rimage_0_iorig: 4724465664 431005696 linear 8:35 5362001920 lowerVG-single_corig_rimage_0_iorig: 5155471360 427819008 linear 8:35 2048 lowerVG-single_corig_rimage_0_iorig: 5583290368 172032 linear 8:35 5152294912 lowerVG-single_corig_rimage_1: 0 5583462400 integrity 254:69 0 4 J 8 meta_device:254:68 recalculate journal_sectors:130944 interleave_sectors:1 buffer_sectors:128 journal_watermark:50 commit_time:10000 internal_hash:crc32c lowerVG-single_corig_rimage_1_imeta: 0 44802048 linear 8:18 5583472640 lowerVG-single_corig_rimage_1_iorig: 0 5583462400 linear 8:18 10240 lowerVG-single_corig_rmeta_0: 0 8192 linear 8:35 5152286720 lowerVG-single_corig_rmeta_1: 0 8192 linear 8:18 2048 lowerVG-lvmPool: 0 6442450944 raid raid1 3 0 region_size 4096 2 254:73 254:74 254:75 254:76 lowerVG-lvmPool_rimage_0: 0 2967117824 linear 8:18 5628282880 lowerVG-lvmPool_rimage_0: 2967117824 59875328 linear 8:18 12070733824 lowerVG-lvmPool_rimage_0: 3026993152 3415457792 linear 8:18 8655276032 lowerVG-lvmPool_rimage_1: 0 2862260224 linear 8:51 10240 lowerVG-lvmPool_rimage_1: 2862260224 164732928 linear 8:82 3415459840 lowerVG-lvmPool_rimage_1: 3026993152 3415457792 linear 8:82 2048 lowerVG-lvmPool_rmeta_0: 0 8192 linear 8:18 5628274688 lowerVG-lvmPool_rmeta_1: 0 8192 linear 8:51 2048 lvm-a: 0 1468006400 linear 254:77 1310722048 lvm-brokenDisk: 0 1310720000 linear 254:77 2048 lvm-brokenDisk: 1310720000 83886080 linear 254:77 2778728448 lvm-brokenDisk: 1394606080 2015404032 linear 254:77 4427040768 lvm-brokenDisk: 3410010112 884957184 linear 254:77 2883586048 As a side note, is there a way to make lsblk only show things the first time they come up? > > Picking over the described IO stacks provided earlier (or Goffredo's > interpretation of it, via ascii art) isn't really a great way to see > the IO stacks that are in use/question. > > > Mike, the logic in the following code doesn't change from v5.18-rc2 to > > v5.19, but I still can't understand why STS_IOERR is set in > > dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(), > > since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which > > is supposed to not happen in Patrick's case. > > > > dm_io_complete() > > ... > > if (io->status == BLK_STS_DM_REQUEUE) { > > unsigned long flags; > > /* > > * Target requested pushing back the I/O. > > */ > > spin_lock_irqsave(&md->deferred_lock, flags); > > if (__noflush_suspending(md) && > > !WARN_ON_ONCE(dm_is_zone_write(md, bio))) { > > /* NOTE early return due to BLK_STS_DM_REQUEUE below */ > > bio_list_add_head(&md->deferred, bio); > > } else { > > /* > > * noflush suspend was interrupted or this is > > * a write to a zoned target. > > */ > > io->status = BLK_STS_IOERR; > > } > > spin_unlock_irqrestore(&md->deferred_lock, flags); > > } > > Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE > I think the DM device could be suspending without flush. > > But regardless, given you logged BLK_STS_IOERR lets assume it isn't, > the assumption that "noflush suspend was interrupted" seems like a > stale comment -- especially given that target's like dm-raid are now > using DM_MAPIO_REQUEUE without concern for the historic tight-coupling > of noflush suspend (which was always the case for the biggest historic > reason for this code: dm-multipath, see commit 2e93ccc1933d0 from > 2006 -- predates my time with developing DM). > > So all said, this code seems flawed for dm-raid (and possibly other > targets that return DM_MAPIO_REQUEUE). I'll look closer this week. > > Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: LVM-on-LVM: error while submitting device barriers 2024-03-10 18:11 ` Patrick Plenefisch @ 2024-03-11 13:13 ` Ming Lei 2024-03-12 22:54 ` Patrick Plenefisch 0 siblings, 1 reply; 15+ messages in thread From: Ming Lei @ 2024-03-11 13:13 UTC (permalink / raw) To: Patrick Plenefisch Cc: Mike Snitzer, Goffredo Baroncelli, linux-kernel, Alasdair Kergon, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs, ming.lei On Sun, Mar 10, 2024 at 02:11:11PM -0400, Patrick Plenefisch wrote: > On Sun, Mar 10, 2024 at 11:27 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > On Sun, Mar 10 2024 at 7:34P -0400, > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote: > > > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > > > > > > > #!/usr/bin/bpftrace > > > > > > > > > > #ifndef BPFTRACE_HAVE_BTF > > > > > #include <linux/blkdev.h> > > > > > #endif > > > > > > > > > > kprobe:submit_bio_noacct, > > > > > kprobe:submit_bio > > > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / > > > > > { > > > > > $bio = (struct bio *)arg0; > > > > > @submit_stack[arg0] = kstack; > > > > > @tracked[arg0] = 1; > > > > > } > > > > > > > > > > kprobe:bio_endio > > > > > /@tracked[arg0] != 0/ > > > > > { > > > > > $bio = (struct bio *)arg0; > > > > > > > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { > > > > > return; > > > > > } > > > > > > > > > > if ($bio->bi_status != 0) { > > > > > printf("dev %s bio failed %d, submitter %s completion %s\n", > > > > > $bio->bi_bdev->bd_disk->disk_name, > > > > > $bio->bi_status, @submit_stack[arg0], kstack); > > > > > } > > > > > delete(@submit_stack[arg0]); > > > > > delete(@tracked[arg0]); > > > > > } > > > > > > > > > > END { > > > > > clear(@submit_stack); > > > > > clear(@tracked); > > > > > } > > > > > > > > > > > > > Attaching 4 probes... > > > > dev dm-77 bio failed 10, submitter > > > > submit_bio_noacct+5 > > > > __send_duplicate_bios+358 > > > > __send_empty_flush+179 > > > > dm_submit_bio+857 > > > > __submit_bio+132 > > > > submit_bio_noacct_nocheck+345 > > > > write_all_supers+1718 > > > > btrfs_commit_transaction+2342 > > > > transaction_kthread+345 > > > > kthread+229 > > > > ret_from_fork+49 > > > > ret_from_fork_asm+27 > > > > completion > > > > bio_endio+5 > > > > dm_submit_bio+955 > > > > __submit_bio+132 > > > > submit_bio_noacct_nocheck+345 > > > > write_all_supers+1718 > > > > btrfs_commit_transaction+2342 > > > > transaction_kthread+345 > > > > kthread+229 > > > > ret_from_fork+49 > > > > ret_from_fork_asm+27 > > > > > > > > dev dm-86 bio failed 10, submitter > > > > submit_bio_noacct+5 > > > > write_all_supers+1718 > > > > btrfs_commit_transaction+2342 > > > > transaction_kthread+345 > > > > kthread+229 > > > > ret_from_fork+49 > > > > ret_from_fork_asm+27 > > > > completion > > > > bio_endio+5 > > > > clone_endio+295 > > > > clone_endio+295 > > > > process_one_work+369 > > > > worker_thread+635 > > > > kthread+229 > > > > ret_from_fork+49 > > > > ret_from_fork_asm+27 > > > > > > > > > > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool > > > > > > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on > > > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue. > > > > > > The error should be from the following code only: > > > > > > static void __map_bio(struct bio *clone) > > > > > > ... > > > if (r == DM_MAPIO_KILL) > > > dm_io_dec_pending(io, BLK_STS_IOERR); > > > else > > > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE); > > > break; > > > > I agree that the above bpf stack traces for dm-77 indicate that > > dm_submit_bio failed, which would end up in the above branch if the > > target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE. > > > > But such an early failure speaks to the flush bio never being > > submitted to the underlying storage. No? > > > > dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with: > > > > /* > > * If we're reshaping to add disk(s)), ti->len and > > * mddev->array_sectors will differ during the process > > * (ti->len > mddev->array_sectors), so we have to requeue > > * bios with addresses > mddev->array_sectors here or > > * there will occur accesses past EOD of the component > > * data images thus erroring the raid set. > > */ > > if (unlikely(bio_end_sector(bio) > mddev->array_sectors)) > > return DM_MAPIO_REQUEUE; > > > > But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems > > weird relative to a flush. > > > > > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is > > > built? It is dm-raid1 target or over plain raid1 device which is > > > build over /dev/lowerVG? > > LVM raid1: > lvcreate --type raid1 -m 1 ... OK, that is the reason, as Mike mentioned. dm-raid.c:raid_map returns DM_MAPIO_REQUEUE, which is translated into BLK_STS_IOERR in dm_io_complete(). Empty flush bio is sent from btrfs, both .bi_size and .bi_sector are set as zero, but the top dm is linear, which(linear_map()) maps new bio->bi_iter.bi_sector, and the mapped bio is sent to dm-raid(raid_map()), then DM_MAPIO_REQUEUE is returned. The one-line patch I sent in last email should solve this issue. https://lore.kernel.org/dm-devel/a783e5ed-db56-4100-956a-353170b1b7ed@inwind.it/T/#m8fce3ecb2f98370b7d7ce8db6714bbf644af5459 But DM_MAPIO_REQUEUE misuse needs close look, and I believe Mike is working on that bigger problem. I guess most of dm targets don't deal with empty bio well, at least linear & dm-raid, not look into others yet, :-( Thanks, Ming ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: LVM-on-LVM: error while submitting device barriers 2024-03-11 13:13 ` Ming Lei @ 2024-03-12 22:54 ` Patrick Plenefisch 0 siblings, 0 replies; 15+ messages in thread From: Patrick Plenefisch @ 2024-03-12 22:54 UTC (permalink / raw) To: Ming Lei Cc: Mike Snitzer, Goffredo Baroncelli, linux-kernel, Alasdair Kergon, Mikulas Patocka, Chris Mason, Josef Bacik, David Sterba, regressions, dm-devel, linux-btrfs On Mon, Mar 11, 2024 at 9:13 AM Ming Lei <ming.lei@redhat.com> wrote: > > On Sun, Mar 10, 2024 at 02:11:11PM -0400, Patrick Plenefisch wrote: > > On Sun, Mar 10, 2024 at 11:27 AM Mike Snitzer <snitzer@kernel.org> wrote: > > > > > > On Sun, Mar 10 2024 at 7:34P -0400, > > > Ming Lei <ming.lei@redhat.com> wrote: > > > > > > > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote: > > > > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > > > > > > > > > #!/usr/bin/bpftrace > > > > > > > > > > > > #ifndef BPFTRACE_HAVE_BTF > > > > > > #include <linux/blkdev.h> > > > > > > #endif > > > > > > > > > > > > kprobe:submit_bio_noacct, > > > > > > kprobe:submit_bio > > > > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / > > > > > > { > > > > > > $bio = (struct bio *)arg0; > > > > > > @submit_stack[arg0] = kstack; > > > > > > @tracked[arg0] = 1; > > > > > > } > > > > > > > > > > > > kprobe:bio_endio > > > > > > /@tracked[arg0] != 0/ > > > > > > { > > > > > > $bio = (struct bio *)arg0; > > > > > > > > > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { > > > > > > return; > > > > > > } > > > > > > > > > > > > if ($bio->bi_status != 0) { > > > > > > printf("dev %s bio failed %d, submitter %s completion %s\n", > > > > > > $bio->bi_bdev->bd_disk->disk_name, > > > > > > $bio->bi_status, @submit_stack[arg0], kstack); > > > > > > } > > > > > > delete(@submit_stack[arg0]); > > > > > > delete(@tracked[arg0]); > > > > > > } > > > > > > > > > > > > END { > > > > > > clear(@submit_stack); > > > > > > clear(@tracked); > > > > > > } > > > > > > > > > > > > > > > > Attaching 4 probes... > > > > > dev dm-77 bio failed 10, submitter > > > > > submit_bio_noacct+5 > > > > > __send_duplicate_bios+358 > > > > > __send_empty_flush+179 > > > > > dm_submit_bio+857 > > > > > __submit_bio+132 > > > > > submit_bio_noacct_nocheck+345 > > > > > write_all_supers+1718 > > > > > btrfs_commit_transaction+2342 > > > > > transaction_kthread+345 > > > > > kthread+229 > > > > > ret_from_fork+49 > > > > > ret_from_fork_asm+27 > > > > > completion > > > > > bio_endio+5 > > > > > dm_submit_bio+955 > > > > > __submit_bio+132 > > > > > submit_bio_noacct_nocheck+345 > > > > > write_all_supers+1718 > > > > > btrfs_commit_transaction+2342 > > > > > transaction_kthread+345 > > > > > kthread+229 > > > > > ret_from_fork+49 > > > > > ret_from_fork_asm+27 > > > > > > > > > > dev dm-86 bio failed 10, submitter > > > > > submit_bio_noacct+5 > > > > > write_all_supers+1718 > > > > > btrfs_commit_transaction+2342 > > > > > transaction_kthread+345 > > > > > kthread+229 > > > > > ret_from_fork+49 > > > > > ret_from_fork_asm+27 > > > > > completion > > > > > bio_endio+5 > > > > > clone_endio+295 > > > > > clone_endio+295 > > > > > process_one_work+369 > > > > > worker_thread+635 > > > > > kthread+229 > > > > > ret_from_fork+49 > > > > > ret_from_fork_asm+27 > > > > > > > > > > > > > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool > > > > > > > > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on > > > > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue. > > > > > > > > The error should be from the following code only: > > > > > > > > static void __map_bio(struct bio *clone) > > > > > > > > ... > > > > if (r == DM_MAPIO_KILL) > > > > dm_io_dec_pending(io, BLK_STS_IOERR); > > > > else > > > > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE); > > > > break; > > > > > > I agree that the above bpf stack traces for dm-77 indicate that > > > dm_submit_bio failed, which would end up in the above branch if the > > > target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE. > > > > > > But such an early failure speaks to the flush bio never being > > > submitted to the underlying storage. No? > > > > > > dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with: > > > > > > /* > > > * If we're reshaping to add disk(s)), ti->len and > > > * mddev->array_sectors will differ during the process > > > * (ti->len > mddev->array_sectors), so we have to requeue > > > * bios with addresses > mddev->array_sectors here or > > > * there will occur accesses past EOD of the component > > > * data images thus erroring the raid set. > > > */ > > > if (unlikely(bio_end_sector(bio) > mddev->array_sectors)) > > > return DM_MAPIO_REQUEUE; > > > > > > But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems > > > weird relative to a flush. > > > > > > > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is > > > > built? It is dm-raid1 target or over plain raid1 device which is > > > > build over /dev/lowerVG? > > > > LVM raid1: > > lvcreate --type raid1 -m 1 ... > > OK, that is the reason, as Mike mentioned. > > dm-raid.c:raid_map returns DM_MAPIO_REQUEUE, which is translated into > BLK_STS_IOERR in dm_io_complete(). > > Empty flush bio is sent from btrfs, both .bi_size and .bi_sector are set > as zero, but the top dm is linear, which(linear_map()) maps new > bio->bi_iter.bi_sector, and the mapped bio is sent to dm-raid(raid_map()), > then DM_MAPIO_REQUEUE is returned. > > The one-line patch I sent in last email should solve this issue. > > https://lore.kernel.org/dm-devel/a783e5ed-db56-4100-956a-353170b1b7ed@inwind.it/T/#m8fce3ecb2f98370b7d7ce8db6714bbf644af5459 With this patch on a 6.6.13 base, I can modify files and the BTRFS volume stays RW, while no errors are logged in dmesg! > > But DM_MAPIO_REQUEUE misuse needs close look, and I believe Mike is working > on that bigger problem. > > I guess most of dm targets don't deal with empty bio well, at least > linear & dm-raid, not look into others yet, :-( > > > Thanks, > Ming > ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-03-12 22:55 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAOCpoWc_HQy4UJzTi9pqtJdO740Wx5Yd702O-mwXBE6RVBX1Eg@mail.gmail.com>
[not found] ` <CAOCpoWf3TSQkUUo-qsj0LVEOm-kY0hXdmttLE82Ytc0hjpTSPw@mail.gmail.com>
2024-02-28 17:25 ` [REGRESSION] LVM-on-LVM: error while submitting device barriers Patrick Plenefisch
2024-02-28 19:19 ` Goffredo Baroncelli
2024-02-28 19:37 ` Patrick Plenefisch
2024-02-29 19:56 ` Goffredo Baroncelli
2024-02-29 20:22 ` Patrick Plenefisch
2024-02-29 22:05 ` Goffredo Baroncelli
2024-03-05 17:45 ` Mike Snitzer
2024-03-06 15:59 ` Ming Lei
2024-03-09 20:39 ` Patrick Plenefisch
2024-03-10 11:34 ` Ming Lei
2024-03-10 15:27 ` Mike Snitzer
2024-03-10 15:47 ` Ming Lei
2024-03-10 18:11 ` Patrick Plenefisch
2024-03-11 13:13 ` Ming Lei
2024-03-12 22:54 ` Patrick Plenefisch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox