* Regression v6.11 booting cannot mount harddisks (xfs)
@ 2024-09-10 12:19 Jesper Dangaard Brouer
2024-09-10 13:06 ` Damien Le Moal
0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2024-09-10 12:19 UTC (permalink / raw)
To: Linus Torvalds, LKML
Cc: Netdev, Jens Axboe, linux-ide, dlemoal, cassel, handan.babu,
djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski,
kernel-team
Hi Linus,
My testlab kernel devel server isn't booting correctly on v6.11 branches
(e.g. net-next at 6.11.0-rc5)
I just confirmed this also happens on your tree tag: v6.11-rc7.
The symptom/issue is that harddisk dev names (e.g /dev/sda, /dev/sdb,
/dev/sdc) gets reordered. I switched /etc/fstab to use UUID's instead
(which boots on v6.10) but on 6.11 it still cannot mount harddisks and
doesn't fully boot.
E.g. errors:
systemd[1]: Expecting device
dev-disk-by\x2duuid-0c2b348d\x2de013\x2d482b\x2da91c\x2d029640ec427a.device
- /dev/disk/by-uuid/0c2b348d-e013-482b-a91c-029640ec42
7a...
[DEPEND] Dependency failed for var-lib.mount - /var/lib.
[...]
[ TIME ] Timed out waiting for device
dev-d…499e46-b40d-4067-afd4-5f6ad09fcff2.
[DEPEND] Dependency failed for boot.mount - /boot.
That corresponds to fstab's:
- UUID=8b499e46-b40d-4067-afd4-5f6ad09fcff2 /boot xfs defaults 0 0
- UUID=0c2b348d-e013-482b-a91c-029640ec427a /var/lib/ xfs defaults 0 0
It looks like disk controller initialization happens in *parallel* on
these newer kernels as dmesg shows init printk's overlapping:
[ 5.683393] scsi 5:0:0:0: Direct-Access ATA SAMSUNG
MZ7KM120 003Q PQ: 0 ANSI: 5
[ 5.683641] scsi 7:0:0:0: Direct-Access ATA SAMSUNG
MZ7KM120 003Q PQ: 0 ANSI: 5
[ 5.683797] scsi 8:0:0:0: Direct-Access ATA Samsung SSD
840 BB0Q PQ: 0 ANSI: 5
[...]
[ 7.057376] sd 5:0:0:0: [sda] 234441648 512-byte logical blocks:
(120 GB/112 GiB)
[ 7.062279] sd 7:0:0:0: [sdb] 234441648 512-byte logical blocks:
(120 GB/112 GiB)
[ 7.070628] sd 5:0:0:0: [sda] Write Protect is off
[ 7.070701] sd 8:0:0:0: [sdc] 488397168 512-byte logical blocks:
(250 GB/233 GiB)
Perhaps this could be a hint to what changed?
Any hints what commit I should try to test revert?
Or good starting point for bisecting?
--Jesper
Extra system info that might be relevant:
00:11.4 SATA controller: Intel Corporation C610/X99 series chipset sSATA
Controller [AHCI mode] (rev 05) (prog-if 01 [AHCI 1.0])
Kernel driver in use: ahci
00:1f.2 SATA controller: Intel Corporation C610/X99 series chipset
6-Port SATA Controller [AHCI mode] (rev 05) (prog-if 01 [AHCI 1.0])
Subsystem: Super Micro Computer Inc Device 0834
Kernel driver in use: ahci
$ lsb_release -a
LSB Version: :core-5.0-amd64:core-5.0-noarch
Distributor ID: Fedora
Description: Fedora release 40 (Forty)
Release: 40
Codename: Forty
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 12:19 Regression v6.11 booting cannot mount harddisks (xfs) Jesper Dangaard Brouer @ 2024-09-10 13:06 ` Damien Le Moal 2024-09-10 14:49 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 13+ messages in thread From: Damien Le Moal @ 2024-09-10 13:06 UTC (permalink / raw) To: Jesper Dangaard Brouer, Linus Torvalds, LKML Cc: Netdev, Jens Axboe, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team On 2024/09/10 21:19, Jesper Dangaard Brouer wrote: > Hi Linus, > > My testlab kernel devel server isn't booting correctly on v6.11 branches > (e.g. net-next at 6.11.0-rc5) > I just confirmed this also happens on your tree tag: v6.11-rc7. > > The symptom/issue is that harddisk dev names (e.g /dev/sda, /dev/sdb, > /dev/sdc) gets reordered. I switched /etc/fstab to use UUID's instead > (which boots on v6.10) but on 6.11 it still cannot mount harddisks and > doesn't fully boot. Parallel SCSI device scanning has been around for a long time... This is controlled with CONFIG_SCSI_SCAN_ASYNC. And yes, that can cause disk names to change, which is why it is never a good idea to rely on them but instead use /dev/disk/by-* names. Disabling CONFIG_SCSI_SCAN_ASYNC will likely not guarantee that disk names will be constant, given that you seem to have 2 AHCI adapters on your host and PCI device scanning is done in parallel. > E.g. errors: > systemd[1]: Expecting device > dev-disk-by\x2duuid-0c2b348d\x2de013\x2d482b\x2da91c\x2d029640ec427a.device > - /dev/disk/by-uuid/0c2b348d-e013-482b-a91c-029640ec42 > 7a... > [DEPEND] Dependency failed for var-lib.mount - /var/lib. > [...] > [ TIME ] Timed out waiting for device > dev-d…499e46-b40d-4067-afd4-5f6ad09fcff2. > [DEPEND] Dependency failed for boot.mount - /boot. > > That corresponds to fstab's: > - UUID=8b499e46-b40d-4067-afd4-5f6ad09fcff2 /boot xfs defaults 0 0 > - UUID=0c2b348d-e013-482b-a91c-029640ec427a /var/lib/ xfs defaults 0 0 > > It looks like disk controller initialization happens in *parallel* on > these newer kernels as dmesg shows init printk's overlapping: > > [ 5.683393] scsi 5:0:0:0: Direct-Access ATA SAMSUNG > MZ7KM120 003Q PQ: 0 ANSI: 5 > [ 5.683641] scsi 7:0:0:0: Direct-Access ATA SAMSUNG > MZ7KM120 003Q PQ: 0 ANSI: 5 > [ 5.683797] scsi 8:0:0:0: Direct-Access ATA Samsung SSD > 840 BB0Q PQ: 0 ANSI: 5 > [...] > [ 7.057376] sd 5:0:0:0: [sda] 234441648 512-byte logical blocks: > (120 GB/112 GiB) > [ 7.062279] sd 7:0:0:0: [sdb] 234441648 512-byte logical blocks: > (120 GB/112 GiB) > [ 7.070628] sd 5:0:0:0: [sda] Write Protect is off > [ 7.070701] sd 8:0:0:0: [sdc] 488397168 512-byte logical blocks: > (250 GB/233 GiB) > > Perhaps this could be a hint to what changed? See above. The disk /dev/sdX names not being reliable is rather normal. Are you sure you have the correct UUIDs of your FSes on the disks ? You can check them with "blkid /dev/sdX[n]" > Any hints what commit I should try to test revert? > Or good starting point for bisecting? You said that 6.10 works, so maybe start from there ? -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 13:06 ` Damien Le Moal @ 2024-09-10 14:49 ` Jesper Dangaard Brouer 2024-09-10 17:53 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 13+ messages in thread From: Jesper Dangaard Brouer @ 2024-09-10 14:49 UTC (permalink / raw) To: Damien Le Moal, Linus Torvalds, LKML Cc: Netdev, Jens Axboe, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team On 10/09/2024 15.06, Damien Le Moal wrote: > On 2024/09/10 21:19, Jesper Dangaard Brouer wrote: >> Hi Linus, >> >> My testlab kernel devel server isn't booting correctly on v6.11 branches >> (e.g. net-next at 6.11.0-rc5) >> I just confirmed this also happens on your tree tag: v6.11-rc7. >> >> The symptom/issue is that harddisk dev names (e.g /dev/sda, /dev/sdb, >> /dev/sdc) gets reordered. I switched /etc/fstab to use UUID's instead >> (which boots on v6.10) but on 6.11 it still cannot mount harddisks and >> doesn't fully boot. > > Parallel SCSI device scanning has been around for a long time... This is > controlled with CONFIG_SCSI_SCAN_ASYNC. And yes, that can cause disk names to > change, which is why it is never a good idea to rely on them but instead use > /dev/disk/by-* names. Disabling CONFIG_SCSI_SCAN_ASYNC will likely not guarantee > that disk names will be constant, given that you seem to have 2 AHCI adapters on > your host and PCI device scanning is done in parallel. > >> E.g. errors: >> systemd[1]: Expecting device >> dev-disk-by\x2duuid-0c2b348d\x2de013\x2d482b\x2da91c\x2d029640ec427a.device >> - /dev/disk/by-uuid/0c2b348d-e013-482b-a91c-029640ec42 >> 7a... >> [DEPEND] Dependency failed for var-lib.mount - /var/lib. >> [...] >> [ TIME ] Timed out waiting for device >> dev-d…499e46-b40d-4067-afd4-5f6ad09fcff2. >> [DEPEND] Dependency failed for boot.mount - /boot. >> >> That corresponds to fstab's: >> - UUID=8b499e46-b40d-4067-afd4-5f6ad09fcff2 /boot xfs defaults 0 0 >> - UUID=0c2b348d-e013-482b-a91c-029640ec427a /var/lib/ xfs defaults 0 0 >> >> It looks like disk controller initialization happens in *parallel* on >> these newer kernels as dmesg shows init printk's overlapping: >> >> [ 5.683393] scsi 5:0:0:0: Direct-Access ATA SAMSUNG >> MZ7KM120 003Q PQ: 0 ANSI: 5 >> [ 5.683641] scsi 7:0:0:0: Direct-Access ATA SAMSUNG >> MZ7KM120 003Q PQ: 0 ANSI: 5 >> [ 5.683797] scsi 8:0:0:0: Direct-Access ATA Samsung SSD >> 840 BB0Q PQ: 0 ANSI: 5 >> [...] >> [ 7.057376] sd 5:0:0:0: [sda] 234441648 512-byte logical blocks: >> (120 GB/112 GiB) >> [ 7.062279] sd 7:0:0:0: [sdb] 234441648 512-byte logical blocks: >> (120 GB/112 GiB) >> [ 7.070628] sd 5:0:0:0: [sda] Write Protect is off >> [ 7.070701] sd 8:0:0:0: [sdc] 488397168 512-byte logical blocks: >> (250 GB/233 GiB) >> >> Perhaps this could be a hint to what changed? > > See above. The disk /dev/sdX names not being reliable is rather normal. > Are you sure you have the correct UUIDs of your FSes on the disks ? You can > check them with "blkid /dev/sdX[n]" > I have checked that I use the correct UUIDs. I checked my /etc/fstab have the UUID entries under /dev/disk/by-uuid/ via this oneliner, which needs to have a /etc/fstab entry under each UUID. We can see I have one partition that I'm not using (0fd3bc38-6496-401f-87f2-87e09532de53), which is expected. $ for UUID in $(ls /dev/disk/by-uuid/); do echo $UUID; grep -H $UUID /etc/fstab; done 09e8c15f-80d2-47e3-8e73-d3fdfcf33eef /etc/fstab:UUID=09e8c15f-80d2-47e3-8e73-d3fdfcf33eef / xfs defaults 0 0 0c2b348d-e013-482b-a91c-029640ec427a /etc/fstab:UUID=0c2b348d-e013-482b-a91c-029640ec427a /var/lib/ xfs defaults 0 0 0fd3bc38-6496-401f-87f2-87e09532de53 581920da-1ccb-4b25-856c-036310032a74 /etc/fstab:UUID=581920da-1ccb-4b25-856c-036310032a74 /nix xfs defaults 0 0 8b499e46-b40d-4067-afd4-5f6ad09fcff2 /etc/fstab:UUID=8b499e46-b40d-4067-afd4-5f6ad09fcff2 /boot xfs defaults 0 0 cd409a50-0371-47ca-9213-49a2bc7b9317 /etc/fstab:UUID=cd409a50-0371-47ca-9213-49a2bc7b9317 swap swap defaults 0 0 >> Any hints what commit I should try to test revert? >> Or good starting point for bisecting? > > You said that 6.10 works, so maybe start from there ? I tested I could boot tag v6.10, and have started bisection. I've not tried to deselect CONFIG_SCSI_SCAN_ASYNC as the kernel that worked on tag v6.10 also had this CONFIG_SCSI_SCAN_ASYNC enabled. So, it is likely not related to the async controller init. --Jesper ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 14:49 ` Jesper Dangaard Brouer @ 2024-09-10 17:53 ` Jesper Dangaard Brouer 2024-09-10 18:30 ` Linus Torvalds 2024-09-10 18:38 ` Jens Axboe 0 siblings, 2 replies; 13+ messages in thread From: Jesper Dangaard Brouer @ 2024-09-10 17:53 UTC (permalink / raw) To: Damien Le Moal, Linus Torvalds, LKML, Christoph Hellwig Cc: Netdev, Jens Axboe, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team Hi Hellwig, I bisected my boot problem down to this commit: $ git bisect good af2814149883e2c1851866ea2afcd8eadc040f79 is the first bad commit commit af2814149883e2c1851866ea2afcd8eadc040f79 Author: Christoph Hellwig <hch@lst.de> Date: Mon Jun 17 08:04:38 2024 +0200 block: freeze the queue in queue_attr_store queue_attr_store updates attributes used to control generating I/O, and can cause malformed bios if changed with I/O in flight. Freeze the queue in common code instead of adding it to almost every attribute. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> block/blk-mq.c | 5 +++-- block/blk-sysfs.c | 9 ++------- 2 files changed, 5 insertions(+), 9 deletions(-) git describe --contains af2814149883e2c1851866ea2afcd8eadc040f79 v6.11-rc1~80^2~66^2~15 On 10/09/2024 16.49, Jesper Dangaard Brouer wrote: > > > On 10/09/2024 15.06, Damien Le Moal wrote: >> On 2024/09/10 21:19, Jesper Dangaard Brouer wrote: >>> Hi Linus, >>> >>> My testlab kernel devel server isn't booting correctly on v6.11 branches >>> (e.g. net-next at 6.11.0-rc5) >>> I just confirmed this also happens on your tree tag: v6.11-rc7. >>> >>> The symptom/issue is that harddisk dev names (e.g /dev/sda, /dev/sdb, >>> /dev/sdc) gets reordered. I switched /etc/fstab to use UUID's instead >>> (which boots on v6.10) but on 6.11 it still cannot mount harddisks and >>> doesn't fully boot. >> >> Parallel SCSI device scanning has been around for a long time... This is >> controlled with CONFIG_SCSI_SCAN_ASYNC. And yes, that can cause disk >> names to >> change, which is why it is never a good idea to rely on them but >> instead use >> /dev/disk/by-* names. Disabling CONFIG_SCSI_SCAN_ASYNC will likely not >> guarantee >> that disk names will be constant, given that you seem to have 2 AHCI >> adapters on >> your host and PCI device scanning is done in parallel. >> >>> E.g. errors: >>> systemd[1]: Expecting device >>> dev-disk-by\x2duuid-0c2b348d\x2de013\x2d482b\x2da91c\x2d029640ec427a.device >>> - /dev/disk/by-uuid/0c2b348d-e013-482b-a91c-029640ec42 >>> 7a... >>> [DEPEND] Dependency failed for var-lib.mount - /var/lib. >>> [...] >>> [ TIME ] Timed out waiting for device >>> dev-d…499e46-b40d-4067-afd4-5f6ad09fcff2. >>> [DEPEND] Dependency failed for boot.mount - /boot. >>> >>> That corresponds to fstab's: >>> - UUID=8b499e46-b40d-4067-afd4-5f6ad09fcff2 /boot xfs defaults >>> 0 0 >>> - UUID=0c2b348d-e013-482b-a91c-029640ec427a /var/lib/ xfs defaults >>> 0 0 >>> >>> It looks like disk controller initialization happens in *parallel* on >>> these newer kernels as dmesg shows init printk's overlapping: >>> >>> [ 5.683393] scsi 5:0:0:0: Direct-Access ATA SAMSUNG >>> MZ7KM120 003Q PQ: 0 ANSI: 5 >>> [ 5.683641] scsi 7:0:0:0: Direct-Access ATA SAMSUNG >>> MZ7KM120 003Q PQ: 0 ANSI: 5 >>> [ 5.683797] scsi 8:0:0:0: Direct-Access ATA Samsung SSD >>> 840 BB0Q PQ: 0 ANSI: 5 >>> [...] >>> [ 7.057376] sd 5:0:0:0: [sda] 234441648 512-byte logical blocks: >>> (120 GB/112 GiB) >>> [ 7.062279] sd 7:0:0:0: [sdb] 234441648 512-byte logical blocks: >>> (120 GB/112 GiB) >>> [ 7.070628] sd 5:0:0:0: [sda] Write Protect is off >>> [ 7.070701] sd 8:0:0:0: [sdc] 488397168 512-byte logical blocks: >>> (250 GB/233 GiB) >>> >>> Perhaps this could be a hint to what changed? >> >> See above. The disk /dev/sdX names not being reliable is rather normal. >> Are you sure you have the correct UUIDs of your FSes on the disks ? >> You can >> check them with "blkid /dev/sdX[n]" >> > > I have checked that I use the correct UUIDs. > > I checked my /etc/fstab have the UUID entries under /dev/disk/by-uuid/ > via this oneliner, which needs to have a /etc/fstab entry under each > UUID. We can see I have one partition that I'm not using > (0fd3bc38-6496-401f-87f2-87e09532de53), which is expected. > > $ for UUID in $(ls /dev/disk/by-uuid/); do echo $UUID; grep -H $UUID > /etc/fstab; done > 09e8c15f-80d2-47e3-8e73-d3fdfcf33eef > /etc/fstab:UUID=09e8c15f-80d2-47e3-8e73-d3fdfcf33eef / xfs > defaults 0 0 > 0c2b348d-e013-482b-a91c-029640ec427a > /etc/fstab:UUID=0c2b348d-e013-482b-a91c-029640ec427a /var/lib/ > xfs defaults 0 0 > 0fd3bc38-6496-401f-87f2-87e09532de53 > 581920da-1ccb-4b25-856c-036310032a74 > /etc/fstab:UUID=581920da-1ccb-4b25-856c-036310032a74 /nix > xfs defaults 0 0 > 8b499e46-b40d-4067-afd4-5f6ad09fcff2 > /etc/fstab:UUID=8b499e46-b40d-4067-afd4-5f6ad09fcff2 /boot xfs > defaults 0 0 > cd409a50-0371-47ca-9213-49a2bc7b9317 > /etc/fstab:UUID=cd409a50-0371-47ca-9213-49a2bc7b9317 swap swap > defaults 0 0 > > >>> Any hints what commit I should try to test revert? >>> Or good starting point for bisecting? >> >> You said that 6.10 works, so maybe start from there ? > > I tested I could boot tag v6.10, and have started bisection. > > I've not tried to deselect CONFIG_SCSI_SCAN_ASYNC as the kernel that > worked on tag v6.10 also had this CONFIG_SCSI_SCAN_ASYNC enabled. So, it > is likely not related to the async controller init. > > --Jesper ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 17:53 ` Jesper Dangaard Brouer @ 2024-09-10 18:30 ` Linus Torvalds 2024-09-10 19:07 ` Jesper Dangaard Brouer 2024-09-10 18:38 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2024-09-10 18:30 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Damien Le Moal, LKML, Christoph Hellwig, Netdev, Jens Axboe, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team On Tue, 10 Sept 2024 at 10:53, Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > af2814149883e2c1851866ea2afcd8eadc040f79 is the first bad commit Just for fun - can you test moving the queue freezing *inside* the mutex, ie something like --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -670,11 +670,11 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr, if (!entry->store) return -EIO; - blk_mq_freeze_queue(q); mutex_lock(&q->sysfs_lock); + blk_mq_freeze_queue(q); res = entry->store(disk, page, length); - mutex_unlock(&q->sysfs_lock); blk_mq_unfreeze_queue(q); + mutex_unlock(&q->sysfs_lock); return res; } (Just do it by hand, my patch is whitespace-damaged on purpose - untested and not well thought through). Because I'm wondering whether maybe some IO is done under the sysfs_lock, and then you might have a deadlock? Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 18:30 ` Linus Torvalds @ 2024-09-10 19:07 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 13+ messages in thread From: Jesper Dangaard Brouer @ 2024-09-10 19:07 UTC (permalink / raw) To: Linus Torvalds Cc: Damien Le Moal, LKML, Christoph Hellwig, Netdev, Jens Axboe, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team On 10/09/2024 20.30, Linus Torvalds wrote: > On Tue, 10 Sept 2024 at 10:53, Jesper Dangaard Brouer <hawk@kernel.org> wrote: >> >> af2814149883e2c1851866ea2afcd8eadc040f79 is the first bad commit > > Just for fun - can you test moving the queue freezing *inside* the > mutex, ie something like > > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -670,11 +670,11 @@ queue_attr_store(struct kobject *kobj, struct > attribute *attr, > if (!entry->store) > return -EIO; > > - blk_mq_freeze_queue(q); > mutex_lock(&q->sysfs_lock); > + blk_mq_freeze_queue(q); > res = entry->store(disk, page, length); > - mutex_unlock(&q->sysfs_lock); > blk_mq_unfreeze_queue(q); > + mutex_unlock(&q->sysfs_lock); > return res; > } > > (Just do it by hand, my patch is whitespace-damaged on purpose - > untested and not well thought through). > > Because I'm wondering whether maybe some IO is done under the > sysfs_lock, and then you might have a deadlock? > > Linus Tested the patch (manually applied change) and it did NOT help. More likely the patch/fix Jens pointed to is the culprit. --Jesper ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 17:53 ` Jesper Dangaard Brouer 2024-09-10 18:30 ` Linus Torvalds @ 2024-09-10 18:38 ` Jens Axboe 2024-09-10 18:46 ` Linus Torvalds 2024-09-10 19:19 ` Jesper Dangaard Brouer 1 sibling, 2 replies; 13+ messages in thread From: Jens Axboe @ 2024-09-10 18:38 UTC (permalink / raw) To: Jesper Dangaard Brouer, Damien Le Moal, Linus Torvalds, LKML, Christoph Hellwig Cc: Netdev, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team On 9/10/24 11:53 AM, Jesper Dangaard Brouer wrote: > Hi Hellwig, > > I bisected my boot problem down to this commit: > > $ git bisect good > af2814149883e2c1851866ea2afcd8eadc040f79 is the first bad commit > commit af2814149883e2c1851866ea2afcd8eadc040f79 > Author: Christoph Hellwig <hch@lst.de> > Date: Mon Jun 17 08:04:38 2024 +0200 > > block: freeze the queue in queue_attr_store > > queue_attr_store updates attributes used to control generating I/O, and > can cause malformed bios if changed with I/O in flight. Freeze the queue > in common code instead of adding it to almost every attribute. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > Reviewed-by: Bart Van Assche <bvanassche@acm.org> > Reviewed-by: Damien Le Moal <dlemoal@kernel.org> > Reviewed-by: Hannes Reinecke <hare@suse.de> > Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> > Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de > Signed-off-by: Jens Axboe <axboe@kernel.dk> > > block/blk-mq.c | 5 +++-- > block/blk-sysfs.c | 9 ++------- > 2 files changed, 5 insertions(+), 9 deletions(-) > > git describe --contains af2814149883e2c1851866ea2afcd8eadc040f79 > v6.11-rc1~80^2~66^2~15 Curious, does your init scripts attempt to load a modular scheduler for your root drive? Reference: https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/block&id=3c031b721c0ee1d6237719a6a9d7487ef757487b -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 18:38 ` Jens Axboe @ 2024-09-10 18:46 ` Linus Torvalds 2024-09-10 18:56 ` Jens Axboe 2024-09-10 19:19 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2024-09-10 18:46 UTC (permalink / raw) To: Jens Axboe Cc: Jesper Dangaard Brouer, Damien Le Moal, LKML, Christoph Hellwig, Netdev, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team On Tue, 10 Sept 2024 at 11:38, Jens Axboe <axboe@kernel.dk> wrote: > > Curious, does your init scripts attempt to load a modular scheduler > for your root drive? Ahh, that sounds more likely than my idea. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 18:46 ` Linus Torvalds @ 2024-09-10 18:56 ` Jens Axboe 0 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2024-09-10 18:56 UTC (permalink / raw) To: Linus Torvalds Cc: Jesper Dangaard Brouer, Damien Le Moal, LKML, Christoph Hellwig, Netdev, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team On 9/10/24 12:46 PM, Linus Torvalds wrote: > On Tue, 10 Sept 2024 at 11:38, Jens Axboe <axboe@kernel.dk> wrote: >> >> Curious, does your init scripts attempt to load a modular scheduler >> for your root drive? > > Ahh, that sounds more likely than my idea. And if confirmed, now makes me think I should migrate that to the 6.11 fixes rather than 6.12 where it's currently staged... -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 18:38 ` Jens Axboe 2024-09-10 18:46 ` Linus Torvalds @ 2024-09-10 19:19 ` Jesper Dangaard Brouer 2024-09-10 19:21 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Jesper Dangaard Brouer @ 2024-09-10 19:19 UTC (permalink / raw) To: Jens Axboe, Damien Le Moal, Linus Torvalds, LKML, Christoph Hellwig Cc: Netdev, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team On 10/09/2024 20.38, Jens Axboe wrote: > On 9/10/24 11:53 AM, Jesper Dangaard Brouer wrote: >> Hi Hellwig, >> >> I bisected my boot problem down to this commit: >> >> $ git bisect good >> af2814149883e2c1851866ea2afcd8eadc040f79 is the first bad commit >> commit af2814149883e2c1851866ea2afcd8eadc040f79 >> Author: Christoph Hellwig <hch@lst.de> >> Date: Mon Jun 17 08:04:38 2024 +0200 >> >> block: freeze the queue in queue_attr_store >> >> queue_attr_store updates attributes used to control generating I/O, and >> can cause malformed bios if changed with I/O in flight. Freeze the queue >> in common code instead of adding it to almost every attribute. >> >> Signed-off-by: Christoph Hellwig <hch@lst.de> >> Reviewed-by: Bart Van Assche <bvanassche@acm.org> >> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> >> Reviewed-by: Hannes Reinecke <hare@suse.de> >> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> >> Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> >> block/blk-mq.c | 5 +++-- >> block/blk-sysfs.c | 9 ++------- >> 2 files changed, 5 insertions(+), 9 deletions(-) >> >> git describe --contains af2814149883e2c1851866ea2afcd8eadc040f79 >> v6.11-rc1~80^2~66^2~15 > > Curious, does your init scripts attempt to load a modular scheduler > for your root drive? I have no idea, this is just a standard Fedora 40. > > Reference: https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/block&id=3c031b721c0ee1d6237719a6a9d7487ef757487b The commit doesn't apply cleanly on top of af2814149883e2c185. $ patch --dry-run -p1 < ../block-jens/block-jens-bootfix.patch checking file block/blk-sysfs.c Hunk #1 FAILED at 23. Hunk #2 succeeded at 469 (offset 56 lines). Hunk #3 succeeded at 484 (offset 56 lines). Hunk #4 succeeded at 723 with fuzz 1 (offset 45 lines). 1 out of 4 hunks FAILED checking file block/elevator.c Hunk #1 FAILED at 698. 1 out of 1 hunk FAILED checking file block/elevator.h Hunk #1 FAILED at 148. 1 out of 1 hunk FAILED I will try to apply and adjust manually. --Jesper ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 19:19 ` Jesper Dangaard Brouer @ 2024-09-10 19:21 ` Jens Axboe 2024-09-10 19:40 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2024-09-10 19:21 UTC (permalink / raw) To: Jesper Dangaard Brouer, Damien Le Moal, Linus Torvalds, LKML, Christoph Hellwig Cc: Netdev, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team On 9/10/24 1:19 PM, Jesper Dangaard Brouer wrote: > > > On 10/09/2024 20.38, Jens Axboe wrote: >> On 9/10/24 11:53 AM, Jesper Dangaard Brouer wrote: >>> Hi Hellwig, >>> >>> I bisected my boot problem down to this commit: >>> >>> $ git bisect good >>> af2814149883e2c1851866ea2afcd8eadc040f79 is the first bad commit >>> commit af2814149883e2c1851866ea2afcd8eadc040f79 >>> Author: Christoph Hellwig <hch@lst.de> >>> Date: Mon Jun 17 08:04:38 2024 +0200 >>> >>> block: freeze the queue in queue_attr_store >>> >>> queue_attr_store updates attributes used to control generating I/O, and >>> can cause malformed bios if changed with I/O in flight. Freeze the queue >>> in common code instead of adding it to almost every attribute. >>> >>> Signed-off-by: Christoph Hellwig <hch@lst.de> >>> Reviewed-by: Bart Van Assche <bvanassche@acm.org> >>> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> >>> Reviewed-by: Hannes Reinecke <hare@suse.de> >>> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> >>> Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de >>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>> >>> block/blk-mq.c | 5 +++-- >>> block/blk-sysfs.c | 9 ++------- >>> 2 files changed, 5 insertions(+), 9 deletions(-) >>> >>> git describe --contains af2814149883e2c1851866ea2afcd8eadc040f79 >>> v6.11-rc1~80^2~66^2~15 >> >> Curious, does your init scripts attempt to load a modular scheduler >> for your root drive? > > I have no idea, this is just a standard Fedora 40. > >> >> Reference: https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/block&id=3c031b721c0ee1d6237719a6a9d7487ef757487b > > The commit doesn't apply cleanly on top of af2814149883e2c185. > > $ patch --dry-run -p1 < ../block-jens/block-jens-bootfix.patch > checking file block/blk-sysfs.c > Hunk #1 FAILED at 23. > Hunk #2 succeeded at 469 (offset 56 lines). > Hunk #3 succeeded at 484 (offset 56 lines). > Hunk #4 succeeded at 723 with fuzz 1 (offset 45 lines). > 1 out of 4 hunks FAILED > checking file block/elevator.c > Hunk #1 FAILED at 698. > 1 out of 1 hunk FAILED > checking file block/elevator.h > Hunk #1 FAILED at 148. > 1 out of 1 hunk FAILED > > I will try to apply and adjust manually. Just apply it on top of current -git, doesn't have to be your bisection point. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 19:21 ` Jens Axboe @ 2024-09-10 19:40 ` Jesper Dangaard Brouer 2024-09-10 19:43 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Jesper Dangaard Brouer @ 2024-09-10 19:40 UTC (permalink / raw) To: Jens Axboe, Damien Le Moal, Linus Torvalds, LKML, Christoph Hellwig Cc: Netdev, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team, rjones On 10/09/2024 21.21, Jens Axboe wrote: > On 9/10/24 1:19 PM, Jesper Dangaard Brouer wrote: >> >> >> On 10/09/2024 20.38, Jens Axboe wrote: >>> On 9/10/24 11:53 AM, Jesper Dangaard Brouer wrote: >>>> Hi Hellwig, >>>> >>>> I bisected my boot problem down to this commit: >>>> >>>> $ git bisect good >>>> af2814149883e2c1851866ea2afcd8eadc040f79 is the first bad commit >>>> commit af2814149883e2c1851866ea2afcd8eadc040f79 >>>> Author: Christoph Hellwig <hch@lst.de> >>>> Date: Mon Jun 17 08:04:38 2024 +0200 >>>> >>>> block: freeze the queue in queue_attr_store >>>> >>>> queue_attr_store updates attributes used to control generating I/O, and >>>> can cause malformed bios if changed with I/O in flight. Freeze the queue >>>> in common code instead of adding it to almost every attribute. >>>> >>>> Signed-off-by: Christoph Hellwig <hch@lst.de> >>>> Reviewed-by: Bart Van Assche <bvanassche@acm.org> >>>> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> >>>> Reviewed-by: Hannes Reinecke <hare@suse.de> >>>> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> >>>> Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de >>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>> >>>> block/blk-mq.c | 5 +++-- >>>> block/blk-sysfs.c | 9 ++------- >>>> 2 files changed, 5 insertions(+), 9 deletions(-) >>>> >>>> git describe --contains af2814149883e2c1851866ea2afcd8eadc040f79 >>>> v6.11-rc1~80^2~66^2~15 >>> >>> Curious, does your init scripts attempt to load a modular scheduler >>> for your root drive? >> >> I have no idea, this is just a standard Fedora 40. >> >>> >>> Reference: https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/block&id=3c031b721c0ee1d6237719a6a9d7487ef757487b >> [1] https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/block&id=3c031b721c0ee1d6237719a6a9d7487ef757487b >> The commit doesn't apply cleanly on top of af2814149883e2c185. >> >> $ patch --dry-run -p1 < ../block-jens/block-jens-bootfix.patch >> checking file block/blk-sysfs.c >> Hunk #1 FAILED at 23. >> Hunk #2 succeeded at 469 (offset 56 lines). >> Hunk #3 succeeded at 484 (offset 56 lines). >> Hunk #4 succeeded at 723 with fuzz 1 (offset 45 lines). >> 1 out of 4 hunks FAILED >> checking file block/elevator.c >> Hunk #1 FAILED at 698. >> 1 out of 1 hunk FAILED >> checking file block/elevator.h >> Hunk #1 FAILED at 148. >> 1 out of 1 hunk FAILED >> >> I will try to apply and adjust manually. > > Just apply it on top of current -git, doesn't have to be your bisection > point. > I applied it manually and now my testlab server boots :-) Just with the patch[1] on top of bisection point ... as it was faster to recompile this way ;-) --Jesper ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Regression v6.11 booting cannot mount harddisks (xfs) 2024-09-10 19:40 ` Jesper Dangaard Brouer @ 2024-09-10 19:43 ` Jens Axboe 0 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2024-09-10 19:43 UTC (permalink / raw) To: Jesper Dangaard Brouer, Damien Le Moal, Linus Torvalds, LKML, Christoph Hellwig Cc: Netdev, linux-ide, cassel, handan.babu, djwong, Linux-XFS, hdegoede, David S. Miller, Jakub Kicinski, kernel-team, rjones On 9/10/24 1:40 PM, Jesper Dangaard Brouer wrote: > > > On 10/09/2024 21.21, Jens Axboe wrote: >> On 9/10/24 1:19 PM, Jesper Dangaard Brouer wrote: >>> >>> >>> On 10/09/2024 20.38, Jens Axboe wrote: >>>> On 9/10/24 11:53 AM, Jesper Dangaard Brouer wrote: >>>>> Hi Hellwig, >>>>> >>>>> I bisected my boot problem down to this commit: >>>>> >>>>> $ git bisect good >>>>> af2814149883e2c1851866ea2afcd8eadc040f79 is the first bad commit >>>>> commit af2814149883e2c1851866ea2afcd8eadc040f79 >>>>> Author: Christoph Hellwig <hch@lst.de> >>>>> Date: Mon Jun 17 08:04:38 2024 +0200 >>>>> >>>>> block: freeze the queue in queue_attr_store >>>>> >>>>> queue_attr_store updates attributes used to control generating I/O, and >>>>> can cause malformed bios if changed with I/O in flight. Freeze the queue >>>>> in common code instead of adding it to almost every attribute. >>>>> >>>>> Signed-off-by: Christoph Hellwig <hch@lst.de> >>>>> Reviewed-by: Bart Van Assche <bvanassche@acm.org> >>>>> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> >>>>> Reviewed-by: Hannes Reinecke <hare@suse.de> >>>>> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> >>>>> Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de >>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>>> >>>>> block/blk-mq.c | 5 +++-- >>>>> block/blk-sysfs.c | 9 ++------- >>>>> 2 files changed, 5 insertions(+), 9 deletions(-) >>>>> >>>>> git describe --contains af2814149883e2c1851866ea2afcd8eadc040f79 >>>>> v6.11-rc1~80^2~66^2~15 >>>> >>>> Curious, does your init scripts attempt to load a modular scheduler >>>> for your root drive? >>> >>> I have no idea, this is just a standard Fedora 40. >>> >>>> >>>> Reference: https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/block&id=3c031b721c0ee1d6237719a6a9d7487ef757487b >>> > > [1] https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/block&id=3c031b721c0ee1d6237719a6a9d7487ef757487b > >>> The commit doesn't apply cleanly on top of af2814149883e2c185. >>> >>> $ patch --dry-run -p1 < ../block-jens/block-jens-bootfix.patch >>> checking file block/blk-sysfs.c >>> Hunk #1 FAILED at 23. >>> Hunk #2 succeeded at 469 (offset 56 lines). >>> Hunk #3 succeeded at 484 (offset 56 lines). >>> Hunk #4 succeeded at 723 with fuzz 1 (offset 45 lines). >>> 1 out of 4 hunks FAILED >>> checking file block/elevator.c >>> Hunk #1 FAILED at 698. >>> 1 out of 1 hunk FAILED >>> checking file block/elevator.h >>> Hunk #1 FAILED at 148. >>> 1 out of 1 hunk FAILED >>> >>> I will try to apply and adjust manually. >> >> Just apply it on top of current -git, doesn't have to be your bisection >> point. >> > > I applied it manually and now my testlab server boots :-) Excellent! I'll get it staged for 6.11 instead. Thank for reporting and testing. > Just with the patch[1] on top of bisection point > ... as it was faster to recompile this way ;-) That's a pathetic excuse for a test box then ;-) -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-09-10 19:43 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-10 12:19 Regression v6.11 booting cannot mount harddisks (xfs) Jesper Dangaard Brouer 2024-09-10 13:06 ` Damien Le Moal 2024-09-10 14:49 ` Jesper Dangaard Brouer 2024-09-10 17:53 ` Jesper Dangaard Brouer 2024-09-10 18:30 ` Linus Torvalds 2024-09-10 19:07 ` Jesper Dangaard Brouer 2024-09-10 18:38 ` Jens Axboe 2024-09-10 18:46 ` Linus Torvalds 2024-09-10 18:56 ` Jens Axboe 2024-09-10 19:19 ` Jesper Dangaard Brouer 2024-09-10 19:21 ` Jens Axboe 2024-09-10 19:40 ` Jesper Dangaard Brouer 2024-09-10 19:43 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox