* [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
@ 2024-08-07 17:23 Christian Heusel
2024-08-07 18:26 ` Damien Le Moal
0 siblings, 1 reply; 10+ messages in thread
From: Christian Heusel @ 2024-08-07 17:23 UTC (permalink / raw)
To: Igor Pylypiv, Damien Le Moal, Niklas Cassel, linux-ide
Cc: Hannes Reinecke, regressions, stable, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 5067 bytes --]
Hello Igor, hello Niklas,
on my NAS I am encountering the following issue since v6.6.44 (LTS),
when executing the hdparm command for my WD-WCC7K4NLX884 drives to get
the active or standby state:
$ hdparm -C /dev/sda
/dev/sda:
SG_IO: bad/missing sense data, sb[]: f0 00 01 00 50 40 ff 0a 00 00 78 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
drive state is: unknown
While the expected output is the following:
$ hdparm -C /dev/sda
/dev/sda:
drive state is: active/idle
I did a bisection within the stable series and found the following
commit to be the first bad one:
28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error")
According to kernel.dance the same commit was also backported to the
v6.10.3 and v6.1.103 stable kernels and I could not find any commit or
pending patch with a "Fixes:" tag for the offending commit.
So far I have not been able to test with the mainline kernel as this is
a remote device which I couldn't rescue in case of a boot failure. Also
just for transparency it does have the out of tree ZFS module loaded,
but AFAIU this shouldn't be an issue here, as the commit seems clearly
related to the error. If needed I can test with an untainted mainline
kernel on Friday when I'm near the device.
I have attached the output of hdparm -I below and would be happy to
provide further debug information or test patches.
Cheers,
Christian
---
#regzbot introduced: 28ab9769117c
#regzbot title: ata: libata-scsi: Sense data errors breaking hdparm with WD drives
---
$ pacman -Q hdparm
hdparm 9.65-2
$ hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: WDC WD40EFRX-68N32N0
Serial Number: WD-WCC7K4NLX884
Firmware Revision: 82.00A82
Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Used: unknown (minor revision code 0x006d)
Supported: 10 9 8 7 6 5
Likely used: 10
Configuration:
Logical max current
cylinders 16383 0
heads 16 0
sectors/track 63 0
--
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 7814037168
Logical Sector size: 512 bytes
Physical Sector size: 4096 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 3815447 MBytes
device size with M = 1000*1000: 4000787 MBytes (4000 GB)
cache/buffer size = unknown
Form Factor: 3.5 inch
Nominal Media Rotation Rate: 5400
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* NOP cmd
* DOWNLOAD_MICROCODE
Power-Up In Standby feature set
* SET_FEATURES required to spinup after power up
SET_MAX security extension
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* 64-bit World wide name
* IDLE_IMMEDIATE with UNLOAD
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Gen3 signaling speed (6.0Gb/s)
* Native Command Queueing (NCQ)
* Host-initiated interface power management
* Phy event counters
* Idle-Unload when NCQ is active
* NCQ priority information
* READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
* DMA Setup Auto-Activate optimization
* Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Write Same (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12] (vendor specific)
unknown 206[13] (vendor specific)
* DOWNLOAD MICROCODE DMA command
* WRITE BUFFER DMA command
* READ BUFFER DMA command
Security:
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
supported: enhanced erase
504min for SECURITY ERASE UNIT. 504min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee2647735a1
NAA : 5
IEEE OUI : 0014ee
Unique ID : 2647735a1
Checksum: correct
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
2024-08-07 17:23 [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c Christian Heusel
@ 2024-08-07 18:26 ` Damien Le Moal
2024-08-07 22:10 ` Niklas Cassel
0 siblings, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2024-08-07 18:26 UTC (permalink / raw)
To: Christian Heusel, Igor Pylypiv, Niklas Cassel, linux-ide
Cc: Hannes Reinecke, regressions, stable, linux-kernel
On 2024/08/07 10:23, Christian Heusel wrote:
> Hello Igor, hello Niklas,
>
> on my NAS I am encountering the following issue since v6.6.44 (LTS),
> when executing the hdparm command for my WD-WCC7K4NLX884 drives to get
> the active or standby state:
>
> $ hdparm -C /dev/sda
> /dev/sda:
> SG_IO: bad/missing sense data, sb[]: f0 00 01 00 50 40 ff 0a 00 00 78 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> drive state is: unknown
>
>
> While the expected output is the following:
>
> $ hdparm -C /dev/sda
> /dev/sda:
> drive state is: active/idle
>
> I did a bisection within the stable series and found the following
> commit to be the first bad one:
>
> 28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error")
>
> According to kernel.dance the same commit was also backported to the
> v6.10.3 and v6.1.103 stable kernels and I could not find any commit or
> pending patch with a "Fixes:" tag for the offending commit.
>
> So far I have not been able to test with the mainline kernel as this is
> a remote device which I couldn't rescue in case of a boot failure. Also
> just for transparency it does have the out of tree ZFS module loaded,
> but AFAIU this shouldn't be an issue here, as the commit seems clearly
> related to the error. If needed I can test with an untainted mainline
> kernel on Friday when I'm near the device.
>
> I have attached the output of hdparm -I below and would be happy to
> provide further debug information or test patches.
I confirm this, using 6.11-rc2. The problem is actually hdparm code which
assumes that the sense data is in descriptor format without ever looking at the
D_SENSE bit to verify that. So commit 28ab9769117c reveals this issue because as
its title explains, it (correctly) honors D_SENSE instead of always generating
sense data in descriptor format.
Hmm... This is annoying. The kernel is fixed to be spec compliant but that
breaks old/non-compliant applications... We definitely should fix hdparm code,
but I think we still need to revert 28ab9769117c...
Niklas, Igor, thoughts ?
>
> Cheers,
> Christian
>
> ---
>
> #regzbot introduced: 28ab9769117c
> #regzbot title: ata: libata-scsi: Sense data errors breaking hdparm with WD drives
>
> ---
>
> $ pacman -Q hdparm
> hdparm 9.65-2
>
> $ hdparm -I /dev/sda
>
> /dev/sda:
>
> ATA device, with non-removable media
> Model Number: WDC WD40EFRX-68N32N0
> Serial Number: WD-WCC7K4NLX884
> Firmware Revision: 82.00A82
> Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
> Standards:
> Used: unknown (minor revision code 0x006d)
> Supported: 10 9 8 7 6 5
> Likely used: 10
> Configuration:
> Logical max current
> cylinders 16383 0
> heads 16 0
> sectors/track 63 0
> --
> LBA user addressable sectors: 268435455
> LBA48 user addressable sectors: 7814037168
> Logical Sector size: 512 bytes
> Physical Sector size: 4096 bytes
> Logical Sector-0 offset: 0 bytes
> device size with M = 1024*1024: 3815447 MBytes
> device size with M = 1000*1000: 4000787 MBytes (4000 GB)
> cache/buffer size = unknown
> Form Factor: 3.5 inch
> Nominal Media Rotation Rate: 5400
> Capabilities:
> LBA, IORDY(can be disabled)
> Queue depth: 32
> Standby timer values: spec'd by Standard, with device specific minimum
> R/W multiple sector transfer: Max = 16 Current = 16
> DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
> Cycle time: min=120ns recommended=120ns
> PIO: pio0 pio1 pio2 pio3 pio4
> Cycle time: no flow control=120ns IORDY flow control=120ns
> Commands/features:
> Enabled Supported:
> * SMART feature set
> Security Mode feature set
> * Power Management feature set
> * Write cache
> * Look-ahead
> * Host Protected Area feature set
> * WRITE_BUFFER command
> * READ_BUFFER command
> * NOP cmd
> * DOWNLOAD_MICROCODE
> Power-Up In Standby feature set
> * SET_FEATURES required to spinup after power up
> SET_MAX security extension
> * 48-bit Address feature set
> * Device Configuration Overlay feature set
> * Mandatory FLUSH_CACHE
> * FLUSH_CACHE_EXT
> * SMART error logging
> * SMART self-test
> * General Purpose Logging feature set
> * 64-bit World wide name
> * IDLE_IMMEDIATE with UNLOAD
> * WRITE_UNCORRECTABLE_EXT command
> * {READ,WRITE}_DMA_EXT_GPL commands
> * Segmented DOWNLOAD_MICROCODE
> * Gen1 signaling speed (1.5Gb/s)
> * Gen2 signaling speed (3.0Gb/s)
> * Gen3 signaling speed (6.0Gb/s)
> * Native Command Queueing (NCQ)
> * Host-initiated interface power management
> * Phy event counters
> * Idle-Unload when NCQ is active
> * NCQ priority information
> * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
> * DMA Setup Auto-Activate optimization
> * Device-initiated interface power management
> * Software settings preservation
> * SMART Command Transport (SCT) feature set
> * SCT Write Same (AC2)
> * SCT Error Recovery Control (AC3)
> * SCT Features Control (AC4)
> * SCT Data Tables (AC5)
> unknown 206[12] (vendor specific)
> unknown 206[13] (vendor specific)
> * DOWNLOAD MICROCODE DMA command
> * WRITE BUFFER DMA command
> * READ BUFFER DMA command
> Security:
> Master password revision code = 65534
> supported
> not enabled
> not locked
> frozen
> not expired: security count
> supported: enhanced erase
> 504min for SECURITY ERASE UNIT. 504min for ENHANCED SECURITY ERASE UNIT.
> Logical Unit WWN Device Identifier: 50014ee2647735a1
> NAA : 5
> IEEE OUI : 0014ee
> Unique ID : 2647735a1
> Checksum: correct
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
2024-08-07 18:26 ` Damien Le Moal
@ 2024-08-07 22:10 ` Niklas Cassel
2024-08-09 14:30 ` Niklas Cassel
2024-08-09 15:34 ` Damien Le Moal
0 siblings, 2 replies; 10+ messages in thread
From: Niklas Cassel @ 2024-08-07 22:10 UTC (permalink / raw)
To: Damien Le Moal
Cc: Christian Heusel, Igor Pylypiv, linux-ide, Hannes Reinecke,
regressions, stable, linux-kernel
On Wed, Aug 07, 2024 at 11:26:46AM -0700, Damien Le Moal wrote:
> On 2024/08/07 10:23, Christian Heusel wrote:
> > Hello Igor, hello Niklas,
> >
> > on my NAS I am encountering the following issue since v6.6.44 (LTS),
> > when executing the hdparm command for my WD-WCC7K4NLX884 drives to get
> > the active or standby state:
> >
> > $ hdparm -C /dev/sda
> > /dev/sda:
> > SG_IO: bad/missing sense data, sb[]: f0 00 01 00 50 40 ff 0a 00 00 78 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > drive state is: unknown
> >
> >
> > While the expected output is the following:
> >
> > $ hdparm -C /dev/sda
> > /dev/sda:
> > drive state is: active/idle
> >
> > I did a bisection within the stable series and found the following
> > commit to be the first bad one:
> >
> > 28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error")
> >
> > According to kernel.dance the same commit was also backported to the
> > v6.10.3 and v6.1.103 stable kernels and I could not find any commit or
> > pending patch with a "Fixes:" tag for the offending commit.
> >
> > So far I have not been able to test with the mainline kernel as this is
> > a remote device which I couldn't rescue in case of a boot failure. Also
> > just for transparency it does have the out of tree ZFS module loaded,
> > but AFAIU this shouldn't be an issue here, as the commit seems clearly
> > related to the error. If needed I can test with an untainted mainline
> > kernel on Friday when I'm near the device.
> >
> > I have attached the output of hdparm -I below and would be happy to
> > provide further debug information or test patches.
>
> I confirm this, using 6.11-rc2. The problem is actually hdparm code which
> assumes that the sense data is in descriptor format without ever looking at the
> D_SENSE bit to verify that. So commit 28ab9769117c reveals this issue because as
> its title explains, it (correctly) honors D_SENSE instead of always generating
> sense data in descriptor format.
You mean: the user space application is using the sense buffer without first
checking if the returned sense buffer is in descriptor or fixed format.
This seems like a fundamentally flawed assumption by the user space program.
If it doesn't even bother checking the first field in the sense buffer, sb[0],
perhaps it shouldn't bother trying to use the sense buffer at all.
(Yes, the D_SENSE bit can be configured by the user, but that doesn't change
the fact that a user space program must check the format of the returned buffer
before trying to use it.)
> Hmm... This is annoying. The kernel is fixed to be spec compliant but that
> breaks old/non-compliant applications... We definitely should fix hdparm code,
> but I think we still need to revert 28ab9769117c...
Well.. if we look at commit:
11093cb1ef56 ("libata-scsi: generate correct ATA pass-through sense")
https://github.com/torvalds/linux/commit/11093cb1ef56147fe33f5750b1eab347bdef30db
We can see that before that commit, the kernel used to call
ata_scsi_set_sense().
Back then ata_scsi_set_sense() was defined as:
https://github.com/torvalds/linux/blob/11093cb1ef56147fe33f5750b1eab347bdef30db/drivers/ata/libata-scsi.c#L280
scsi_build_sense_buffer(0, cmd->sense_buffer, sk, asc, ascq);
Where the first argument to scsi_build_sense_buffer() is if the generated sense
buffer should be fixed or desc format (0 == fixed format), so we used to
generate the sense buffer in fixed format:
https://github.com/torvalds/linux/blob/11093cb1ef56147fe33f5750b1eab347bdef30db/drivers/scsi/scsi_common.c#L231
However, as we can see, the kernel then used to incorrectly just
change sb[0} to say that the buffer was in desc format,
without updating the other fields, e.g. sb[2]:
https://github.com/torvalds/linux/blob/11093cb1ef56147fe33f5750b1eab347bdef30db~/drivers/ata/libata-scsi.c#L1026
so the format was really in some franken format...
following neither fixed or descriptor format.
11093cb1ef56 ("libata-scsi: generate correct ATA pass-through sense")
did change so that successful ATA-passthrough commands always generated
the sense data in descriptor format. However, that commit also managed to
mess up the offsets for fixed format sense...
The commit that later changed ata_scsi_set_sense() to honor D_SENSE
was commit: 06dbde5f3a44 ("libata: Implement control mode page to select
sense format")
So basically:
Before commit 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through
sense"), we generated sense data in some franken format for both successful
and failed ATA-passthrough commands.
After commit 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through
sense") we generate sense data for sucessful ATA-passthrough commands in
descriptor format unconditionally, but still in franken format for failed
ATA-passthrough commands.
After commit 06dbde5f3a44 ("libata: Implement control mode page to select
sense format") we generate sense data for sucessful ATA-passthrough commands
in descriptor format unconditionally, but for failed commands we actually
honored D_SENSE to generate it either in fixed format or descriptor format.
(However, because of a bug in 11093cb1ef56, if using fixed format, the
offsets were wrong...)
The incorrect offsets for fixed format was fixed recently, in commit
38dab832c3f4 ("ata: libata-scsi: Fix offsets for the fixed format sense data")
Commit 28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and
no error") fixed so that we actually honor D_SENSE not only for failed
ATA-passthrough commands, but also for successfull ATA-passthrough commands.
TL;DR: it is very hard to say that we have introduced a regression, because
this crap has basically been broken in one way or another since it was
introduced... Personally, I would definitely want all the patches that are in
mainline in the kernel running on my machine, since that is the only thing
that is consistent.
However, that assumes that user space programs that are trying to parse the
sense data actually bothers to check the first field in the sense data,
to see which format the returned sense data is in... Applications that
do not even both with that will have problems on a lot of (historic) kernel
versions.
Kind regards,
Niklas
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
2024-08-07 22:10 ` Niklas Cassel
@ 2024-08-09 14:30 ` Niklas Cassel
2024-08-09 15:21 ` Damien Le Moal
2024-08-09 15:34 ` Damien Le Moal
1 sibling, 1 reply; 10+ messages in thread
From: Niklas Cassel @ 2024-08-09 14:30 UTC (permalink / raw)
To: Damien Le Moal
Cc: Christian Heusel, Igor Pylypiv, linux-ide, Hannes Reinecke,
regressions, stable, linux-kernel
Hello Damien,
If we want to no longer respect the D_SENSE bit for successful ATA-passthrough
commands, e.g. by replacing the ata_scsi_set_sense() call with a
scsi_build_sense() call in the else clause:
https://github.com/torvalds/linux/blob/v6.11-rc2/drivers/ata/libata-scsi.c#L955
...then I think that we should also replace the ata_scsi_set_sense() call with
a scsi_build_sense() call for failed ATA-passthrough commands too
(in the non-else clause):
https://github.com/torvalds/linux/blob/v6.11-rc2/drivers/ata/libata-scsi.c#L952
..however, that does not sound like a very nice solution IMO.
Another option, if there are a lot of user space programs that incorrectly
assume that the sense data (for both successful and failed commands) is in
descriptor format, without bothering to check the sense data type, one option
might be to change the default value of D_SENSE in the control mode page to 1
in libata's SATL, i.e. set ATA_DFLAG_D_SENSE in dev->flags by default.
That way, we will still respect D_SENSE while generating the sense data
(in case the user issues a mode select to modify the bit), and the default
will be to generate the sense data in descriptor format, as it has been
since 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through sense").
Kind regards,
Niklas
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
2024-08-09 14:30 ` Niklas Cassel
@ 2024-08-09 15:21 ` Damien Le Moal
0 siblings, 0 replies; 10+ messages in thread
From: Damien Le Moal @ 2024-08-09 15:21 UTC (permalink / raw)
To: Niklas Cassel
Cc: Christian Heusel, Igor Pylypiv, linux-ide, Hannes Reinecke,
regressions, stable, linux-kernel
On 2024/08/09 7:30, Niklas Cassel wrote:
> Hello Damien,
>
> If we want to no longer respect the D_SENSE bit for successful ATA-passthrough
> commands, e.g. by replacing the ata_scsi_set_sense() call with a
> scsi_build_sense() call in the else clause:
> https://github.com/torvalds/linux/blob/v6.11-rc2/drivers/ata/libata-scsi.c#L955
>
> ...then I think that we should also replace the ata_scsi_set_sense() call with
> a scsi_build_sense() call for failed ATA-passthrough commands too
> (in the non-else clause):
> https://github.com/torvalds/linux/blob/v6.11-rc2/drivers/ata/libata-scsi.c#L952
>
> ..however, that does not sound like a very nice solution IMO.
>
>
> Another option, if there are a lot of user space programs that incorrectly
> assume that the sense data (for both successful and failed commands) is in
> descriptor format, without bothering to check the sense data type, one option
> might be to change the default value of D_SENSE in the control mode page to 1
> in libata's SATL, i.e. set ATA_DFLAG_D_SENSE in dev->flags by default.
>
> That way, we will still respect D_SENSE while generating the sense data
> (in case the user issues a mode select to modify the bit), and the default
> will be to generate the sense data in descriptor format, as it has been
> since 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through sense").
That indeed should be acceptable. And we should also patch hdparm to properly
look at the sense format and not assume descriptor format by default.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
2024-08-07 22:10 ` Niklas Cassel
2024-08-09 14:30 ` Niklas Cassel
@ 2024-08-09 15:34 ` Damien Le Moal
2024-08-09 18:42 ` Christian Heusel
1 sibling, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2024-08-09 15:34 UTC (permalink / raw)
To: Niklas Cassel
Cc: Christian Heusel, Igor Pylypiv, linux-ide, Hannes Reinecke,
regressions, stable, linux-kernel
On 2024/08/07 15:10, Niklas Cassel wrote:
> On Wed, Aug 07, 2024 at 11:26:46AM -0700, Damien Le Moal wrote:
>> On 2024/08/07 10:23, Christian Heusel wrote:
>>> Hello Igor, hello Niklas,
>>>
>>> on my NAS I am encountering the following issue since v6.6.44 (LTS),
>>> when executing the hdparm command for my WD-WCC7K4NLX884 drives to get
>>> the active or standby state:
>>>
>>> $ hdparm -C /dev/sda
>>> /dev/sda:
>>> SG_IO: bad/missing sense data, sb[]: f0 00 01 00 50 40 ff 0a 00 00 78 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>> drive state is: unknown
>>>
>>>
>>> While the expected output is the following:
>>>
>>> $ hdparm -C /dev/sda
>>> /dev/sda:
>>> drive state is: active/idle
>>>
>>> I did a bisection within the stable series and found the following
>>> commit to be the first bad one:
>>>
>>> 28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error")
>>>
>>> According to kernel.dance the same commit was also backported to the
>>> v6.10.3 and v6.1.103 stable kernels and I could not find any commit or
>>> pending patch with a "Fixes:" tag for the offending commit.
>>>
>>> So far I have not been able to test with the mainline kernel as this is
>>> a remote device which I couldn't rescue in case of a boot failure. Also
>>> just for transparency it does have the out of tree ZFS module loaded,
>>> but AFAIU this shouldn't be an issue here, as the commit seems clearly
>>> related to the error. If needed I can test with an untainted mainline
>>> kernel on Friday when I'm near the device.
>>>
>>> I have attached the output of hdparm -I below and would be happy to
>>> provide further debug information or test patches.
>>
>> I confirm this, using 6.11-rc2. The problem is actually hdparm code which
>> assumes that the sense data is in descriptor format without ever looking at the
>> D_SENSE bit to verify that. So commit 28ab9769117c reveals this issue because as
>> its title explains, it (correctly) honors D_SENSE instead of always generating
>> sense data in descriptor format.
>
> You mean: the user space application is using the sense buffer without first
> checking if the returned sense buffer is in descriptor or fixed format.
Yes. The code looks like:
desc = sb + 8;
if (io_hdr.driver_status != SG_DRIVER_SENSE) {
...
} else if (sb[0] != 0x72 || sb[7] < 14 || desc[0] != 0x09 || desc[1] < 0x0c) {
if (verbose || tf->command != ATA_OP_IDENTIFY)
dump_bytes("SG_IO: bad/missing sense data, sb[]",
sb, sizeof(sb));
}
So clearly it assumes descrip@tor format.
> This seems like a fundamentally flawed assumption by the user space program.
> If it doesn't even bother checking the first field in the sense buffer, sb[0],
> perhaps it shouldn't bother trying to use the sense buffer at all.
> (Yes, the D_SENSE bit can be configured by the user, but that doesn't change
> the fact that a user space program must check the format of the returned buffer
> before trying to use it.)
Yep. I agree.
>
>
>> Hmm... This is annoying. The kernel is fixed to be spec compliant but that
>> breaks old/non-compliant applications... We definitely should fix hdparm code,
>> but I think we still need to revert 28ab9769117c...
>
> Well.. if we look at commit:
> 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through sense")
> https://github.com/torvalds/linux/commit/11093cb1ef56147fe33f5750b1eab347bdef30db
>
> We can see that before that commit, the kernel used to call
> ata_scsi_set_sense().
>
> Back then ata_scsi_set_sense() was defined as:
> https://github.com/torvalds/linux/blob/11093cb1ef56147fe33f5750b1eab347bdef30db/drivers/ata/libata-scsi.c#L280
> scsi_build_sense_buffer(0, cmd->sense_buffer, sk, asc, ascq);
>
> Where the first argument to scsi_build_sense_buffer() is if the generated sense
> buffer should be fixed or desc format (0 == fixed format), so we used to
> generate the sense buffer in fixed format:
> https://github.com/torvalds/linux/blob/11093cb1ef56147fe33f5750b1eab347bdef30db/drivers/scsi/scsi_common.c#L231
>
> However, as we can see, the kernel then used to incorrectly just
> change sb[0} to say that the buffer was in desc format,
> without updating the other fields, e.g. sb[2]:
> https://github.com/torvalds/linux/blob/11093cb1ef56147fe33f5750b1eab347bdef30db~/drivers/ata/libata-scsi.c#L1026
> so the format was really in some franken format...
> following neither fixed or descriptor format.
>
> 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through sense")
> did change so that successful ATA-passthrough commands always generated
> the sense data in descriptor format. However, that commit also managed to
> mess up the offsets for fixed format sense...
>
> The commit that later changed ata_scsi_set_sense() to honor D_SENSE
> was commit: 06dbde5f3a44 ("libata: Implement control mode page to select
> sense format")
>
> So basically:
> Before commit 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through
> sense"), we generated sense data in some franken format for both successful
> and failed ATA-passthrough commands.
>
> After commit 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through
> sense") we generate sense data for sucessful ATA-passthrough commands in
> descriptor format unconditionally, but still in franken format for failed
> ATA-passthrough commands.
>
> After commit 06dbde5f3a44 ("libata: Implement control mode page to select
> sense format") we generate sense data for sucessful ATA-passthrough commands
> in descriptor format unconditionally, but for failed commands we actually
> honored D_SENSE to generate it either in fixed format or descriptor format.
> (However, because of a bug in 11093cb1ef56, if using fixed format, the
> offsets were wrong...)
>
>
> The incorrect offsets for fixed format was fixed recently, in commit
> 38dab832c3f4 ("ata: libata-scsi: Fix offsets for the fixed format sense data")
>
> Commit 28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and
> no error") fixed so that we actually honor D_SENSE not only for failed
> ATA-passthrough commands, but also for successfull ATA-passthrough commands.
>
> TL;DR: it is very hard to say that we have introduced a regression, because
> this crap has basically been broken in one way or another since it was
> introduced... Personally, I would definitely want all the patches that are in
> mainline in the kernel running on my machine, since that is the only thing
> that is consistent.
>
> However, that assumes that user space programs that are trying to parse the
> sense data actually bothers to check the first field in the sense data,
> to see which format the returned sense data is in... Applications that
> do not even both with that will have problems on a lot of (historic) kernel
> versions.
Yes, indeed. I do not want to revert any of these recent patches, because as you
rightly summarize here, these fix something that has been broken for a long
time. We were just lucky that we did not see more application failures until
now, or rather unlucky that we did not as that would have revealed these
problems earlier.
So I think we will have some patching to do to hdparm at least to fix the
problems there.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
2024-08-09 15:34 ` Damien Le Moal
@ 2024-08-09 18:42 ` Christian Heusel
2024-08-09 20:13 ` Christian Heusel
0 siblings, 1 reply; 10+ messages in thread
From: Christian Heusel @ 2024-08-09 18:42 UTC (permalink / raw)
To: Damien Le Moal
Cc: Niklas Cassel, Igor Pylypiv, linux-ide, Hannes Reinecke,
regressions, stable, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2135 bytes --]
On 24/08/09 08:34AM, Damien Le Moal wrote:
> On 2024/08/07 15:10, Niklas Cassel wrote:
> > On Wed, Aug 07, 2024 at 11:26:46AM -0700, Damien Le Moal wrote:
> >> On 2024/08/07 10:23, Christian Heusel wrote:
> >>> Hello Igor, hello Niklas,
> >>>
> >>> on my NAS I am encountering the following issue since v6.6.44 (LTS),
> >>> when executing the hdparm command for my WD-WCC7K4NLX884 drives to get
> >>> the active or standby state:
> >>>
> >>> $ hdparm -C /dev/sda
> >>> /dev/sda:
> >>> SG_IO: bad/missing sense data, sb[]: f0 00 01 00 50 40 ff 0a 00 00 78 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> drive state is: unknown
> >>>
> >>>
> >>> While the expected output is the following:
> >>>
> >>> $ hdparm -C /dev/sda
> >>> /dev/sda:
> >>> drive state is: active/idle
> >>>
>
> Yes, indeed. I do not want to revert any of these recent patches, because as you
> rightly summarize here, these fix something that has been broken for a long
> time. We were just lucky that we did not see more application failures until
> now, or rather unlucky that we did not as that would have revealed these
> problems earlier.
>
> So I think we will have some patching to do to hdparm at least to fix the
> problems there.
It seems like this does not only break hdparm but also hddtemp, which
does not use hdparm as dep as far as I can tell:
# on bad kernel for the above issue
$ hddtemp /dev/sda
/dev/sda: WDC WD40EFRX-68N32N0 : drive is sleeping
# on good kernel for the above issue
$ hddtemp /dev/sda
/dev/sda: WDC WD40EFRX-68N32N0: 31°C
I didn't take the time to actually verify that this is the same issue,
but it seems very likely from what we have gathered in this thread
already.
So while I agree that it might have previously just worked by chance it
seems like there is quite some stuff depending on the previous behavior.
This was first discovered in [this thread in the Arch Linux Forums][0]
by user @GerBra.
~Chris
[0]: https://bbs.archlinux.org/viewtopic.php?id=298407
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
2024-08-09 18:42 ` Christian Heusel
@ 2024-08-09 20:13 ` Christian Heusel
2024-08-12 9:26 ` Linux regression tracking (Thorsten Leemhuis)
0 siblings, 1 reply; 10+ messages in thread
From: Christian Heusel @ 2024-08-09 20:13 UTC (permalink / raw)
To: Damien Le Moal
Cc: Niklas Cassel, Igor Pylypiv, linux-ide, Hannes Reinecke,
regressions, stable, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2435 bytes --]
On 24/08/09 08:42PM, Christian Heusel wrote:
> On 24/08/09 08:34AM, Damien Le Moal wrote:
> > On 2024/08/07 15:10, Niklas Cassel wrote:
> > > On Wed, Aug 07, 2024 at 11:26:46AM -0700, Damien Le Moal wrote:
> > >> On 2024/08/07 10:23, Christian Heusel wrote:
> > >>> Hello Igor, hello Niklas,
> > >>>
> > >>> on my NAS I am encountering the following issue since v6.6.44 (LTS),
> > >>> when executing the hdparm command for my WD-WCC7K4NLX884 drives to get
> > >>> the active or standby state:
> > >>>
> > >>> $ hdparm -C /dev/sda
> > >>> /dev/sda:
> > >>> SG_IO: bad/missing sense data, sb[]: f0 00 01 00 50 40 ff 0a 00 00 78 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >>> drive state is: unknown
> > >>>
> > >>>
> > >>> While the expected output is the following:
> > >>>
> > >>> $ hdparm -C /dev/sda
> > >>> /dev/sda:
> > >>> drive state is: active/idle
> > >>>
> >
> > Yes, indeed. I do not want to revert any of these recent patches, because as you
> > rightly summarize here, these fix something that has been broken for a long
> > time. We were just lucky that we did not see more application failures until
> > now, or rather unlucky that we did not as that would have revealed these
> > problems earlier.
> >
> > So I think we will have some patching to do to hdparm at least to fix the
> > problems there.
>
> It seems like this does not only break hdparm but also hddtemp, which
> does not use hdparm as dep as far as I can tell:
>
> # on bad kernel for the above issue
> $ hddtemp /dev/sda
> /dev/sda: WDC WD40EFRX-68N32N0 : drive is sleeping
>
> # on good kernel for the above issue
> $ hddtemp /dev/sda
> /dev/sda: WDC WD40EFRX-68N32N0: 31°C
>
> I didn't take the time to actually verify that this is the same issue,
> but it seems very likely from what we have gathered in this thread
> already.
>
> So while I agree that it might have previously just worked by chance it
> seems like there is quite some stuff depending on the previous behavior.
>
> This was first discovered in [this thread in the Arch Linux Forums][0]
> by user @GerBra.
>
> ~Chris
>
> [0]: https://bbs.archlinux.org/viewtopic.php?id=298407
As someone on the same thread has pointed out, this also seems to affect
udiskd:
https://github.com/storaged-project/udisks/issues/732
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
2024-08-09 20:13 ` Christian Heusel
@ 2024-08-12 9:26 ` Linux regression tracking (Thorsten Leemhuis)
2024-08-19 18:45 ` Salvatore Bonaccorso
0 siblings, 1 reply; 10+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-08-12 9:26 UTC (permalink / raw)
To: Christian Heusel, Damien Le Moal
Cc: Niklas Cassel, Igor Pylypiv, linux-ide, Hannes Reinecke,
regressions, stable, linux-kernel
On 09.08.24 22:13, Christian Heusel wrote:
> On 24/08/09 08:42PM, Christian Heusel wrote:
>> On 24/08/09 08:34AM, Damien Le Moal wrote:
>>> On 2024/08/07 15:10, Niklas Cassel wrote:
>>>> On Wed, Aug 07, 2024 at 11:26:46AM -0700, Damien Le Moal wrote:
>>>>> On 2024/08/07 10:23, Christian Heusel wrote:
>>>>>>
>>>>>> on my NAS I am encountering the following issue since v6.6.44 (LTS),
>>>>>> when executing the hdparm command for my WD-WCC7K4NLX884 drives to get
>>>>>> the active or standby state:
>>> [...]
>>> Yes, indeed. I do not want to revert any of these recent patches, because as you
>>> rightly summarize here, these fix something that has been broken for a long
>>> time. We were just lucky that we did not see more application failures until
>>> now, or rather unlucky that we did not as that would have revealed these
>>> problems earlier.
>>
>> It seems like this does not only break hdparm but also hddtemp, which
>> does not use hdparm as dep as far as I can tell:
>
> As someone on the same thread has pointed out, this also seems to affect
> udiskd:
>
> https://github.com/storaged-project/udisks/issues/732
For the record, three more people reported similar symptoms in the past
few days:
https://lore.kernel.org/all/e620f887-a674-f007-c17b-dc16f9a0a588@web.de/
https://bugzilla.kernel.org/show_bug.cgi?id=219144
Ciao, Thorsten
P.S.: I for the tracking for now assume those are indeed the same problem:
#regzbot dup:
https://lore.kernel.org/all/e620f887-a674-f007-c17b-dc16f9a0a588@web.de/
#regzbot dup: https://bugzilla.kernel.org/show_bug.cgi?id=219144
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
2024-08-12 9:26 ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-08-19 18:45 ` Salvatore Bonaccorso
0 siblings, 0 replies; 10+ messages in thread
From: Salvatore Bonaccorso @ 2024-08-19 18:45 UTC (permalink / raw)
To: Linux regressions mailing list
Cc: Christian Heusel, Damien Le Moal, Niklas Cassel, Igor Pylypiv,
linux-ide, Hannes Reinecke, stable, linux-kernel
Hi,
On Mon, Aug 12, 2024 at 11:26:53AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 09.08.24 22:13, Christian Heusel wrote:
> > On 24/08/09 08:42PM, Christian Heusel wrote:
> >> On 24/08/09 08:34AM, Damien Le Moal wrote:
> >>> On 2024/08/07 15:10, Niklas Cassel wrote:
> >>>> On Wed, Aug 07, 2024 at 11:26:46AM -0700, Damien Le Moal wrote:
> >>>>> On 2024/08/07 10:23, Christian Heusel wrote:
> >>>>>>
> >>>>>> on my NAS I am encountering the following issue since v6.6.44 (LTS),
> >>>>>> when executing the hdparm command for my WD-WCC7K4NLX884 drives to get
> >>>>>> the active or standby state:
> >>> [...]
> >>> Yes, indeed. I do not want to revert any of these recent patches, because as you
> >>> rightly summarize here, these fix something that has been broken for a long
> >>> time. We were just lucky that we did not see more application failures until
> >>> now, or rather unlucky that we did not as that would have revealed these
> >>> problems earlier.
> >>
> >> It seems like this does not only break hdparm but also hddtemp, which
> >> does not use hdparm as dep as far as I can tell:
> >
> > As someone on the same thread has pointed out, this also seems to affect
> > udiskd:
> >
> > https://github.com/storaged-project/udisks/issues/732
>
> For the record, three more people reported similar symptoms in the past
> few days:
>
> https://lore.kernel.org/all/e620f887-a674-f007-c17b-dc16f9a0a588@web.de/
> https://bugzilla.kernel.org/show_bug.cgi?id=219144
>
> Ciao, Thorsten
>
> P.S.: I for the tracking for now assume those are indeed the same problem:
>
> #regzbot dup:
> https://lore.kernel.org/all/e620f887-a674-f007-c17b-dc16f9a0a588@web.de/
> #regzbot dup: https://bugzilla.kernel.org/show_bug.cgi?id=219144
AFAICS, this has now been reverted upstream with fa0db8e56878 ("Revert
"ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error"")
in 6.11-rc4 and the revert was as well backported to stable releases
(5.15.165, 6.1.106, 6.6.47 and 6.10.6)
Regards,
Salvatore
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-08-19 18:46 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07 17:23 [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c Christian Heusel
2024-08-07 18:26 ` Damien Le Moal
2024-08-07 22:10 ` Niklas Cassel
2024-08-09 14:30 ` Niklas Cassel
2024-08-09 15:21 ` Damien Le Moal
2024-08-09 15:34 ` Damien Le Moal
2024-08-09 18:42 ` Christian Heusel
2024-08-09 20:13 ` Christian Heusel
2024-08-12 9:26 ` Linux regression tracking (Thorsten Leemhuis)
2024-08-19 18:45 ` Salvatore Bonaccorso
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox