* acpi_nfit_query_poison() broken on non-ARS systems
@ 2016-04-29 23:36 Linda Knippers
2016-04-30 0:03 ` Vishal Verma
2016-04-30 0:12 ` Dan Williams
0 siblings, 2 replies; 12+ messages in thread
From: Linda Knippers @ 2016-04-29 23:36 UTC (permalink / raw)
To: Williams, Dan J, Vishal L Verma; +Cc: linux-nvdimm@lists.01.org
I tested Vishal's original ARS patches and they worked correctly on my test
system, which doesn't support ARS. By worked correctly, I mean that they
properly looked at the status of the Query ARS Capabilities function and saw
that it indicated that the function is not supported. The original code
skipped the rest of the ARS processing.
The code I tested has been re-worked a number of times and now it no longer
looks at the status of the Query function. It assumes that if the DSM worked
at all, it must be ok and happily uses fields that is shouldn't be looking at.
If you look here:
> static int acpi_nfit_query_poison(struct acpi_nfit_desc *acpi_desc,
> struct nfit_spa *nfit_spa)
> {
> struct acpi_nfit_system_address *spa = nfit_spa->spa;
> int rc;
>
> if (!nfit_spa->max_ars) {
On a platform that doesn't support ARS, max_ars will always be zero
so you'll keep executing this code when it seems like you're trying
to avoid that.
> struct nd_cmd_ars_cap ars_cap;
>
> memset(&ars_cap, 0, sizeof(ars_cap));
> rc = ars_get_cap(acpi_desc, &ars_cap, nfit_spa);
> if (rc < 0)
> return rc;
The call succeeds so this return isn't taken, but then the code assumes
everything is good. It should check ars_cap.status so see if the function
is supported or if there was an error and return something appropriate.
In previous version of the code, that's what acpi_nfit_find_poison() did.
Instead, this function continues, using data that's not right and making
more calls that also aren't supported.
> nfit_spa->max_ars = ars_cap.max_ars_out;
> nfit_spa->clear_err_unit = ars_cap.clear_err_unit;
Since we don't bail out, the above values are zero but not actually checked
by subsequent users.
> /* check that the supported scrub types match the spa type */
> if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE &&
> ((ars_cap.status >> 16) & ND_ARS_VOLATILE) == 0)
> return -ENOTTY;
> else if (nfit_spa_type(spa) == NFIT_SPA_PM &&
> ((ars_cap.status >> 16) & ND_ARS_PERSISTENT) == 0)
> return -ENOTTY;
> }
>
> if (ars_status_alloc(acpi_desc, nfit_spa->max_ars))
> return -ENOMEM;
>
> rc = ars_get_status(acpi_desc);
> if (rc < 0 && rc != -ENOSPC)
> return rc;
>
> if (ars_status_process_records(acpi_desc->nvdimm_bus,
> acpi_desc->ars_status))
> return -ENOMEM;
>
> return 0;
> }
I don't know if you want to change how this function works or change how
ars_get_cap works but something needs to change. You actually do check the
status in xlat_status() if the call was initiated from user space but you don't
check when it's initiated from within the kernel.
For reference, below is the dmesg output from my system running the latest
upstream kernel.
-- ljk
[ 36.807155] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_cap input length: 16
[ 36.807159] ars_cap00000000: 80000000 00000004 00000000 00000002
................
[ 36.807336] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_cap output length: 16
[ 36.807338] ars_cap00000000: 00000001 00000000 00000000 00000000
................
[ 36.807427] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start input length: 24
[ 36.807430] ars_start00000000: 80000000 00000004 00000000 00000002
................
[ 36.807431] ars_start00000010: 00000002 00000000 ........
[ 36.807513] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start output length: 4
[ 36.807515] ars_start00000000: 00000001 ....
[ 36.807516] nfit ACPI0012:00: acpi_nfit_ctl:bus output object underflow cmd:
ars_start field: 1
[ 36.807518] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start input length: 24
[ 36.807520] ars_start00000000: 80000000 00000006 00000000 00000002
................
[ 36.807521] ars_start00000010: 00000002 00000000 ........
[ 36.807593] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start output length: 4
[ 36.807594] ars_start00000000: 00000001 ....
[ 36.807596] nfit ACPI0012:00: acpi_nfit_ctl:bus output object underflow cmd:
ars_start field: 1
[ 36.807597] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start input length: 24
[ 36.807599] ars_start00000000: 80000000 0000000c 00000000 00000002
................
[ 36.807600] ars_start00000010: 00000002 00000000 ........
[ 36.807670] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start output length: 4
[ 36.807671] ars_start00000000: 00000001 ....
[ 36.807673] nfit ACPI0012:00: acpi_nfit_ctl:bus output object underflow cmd:
ars_start field: 1
[ 36.807674] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start input length: 24
[ 36.807676] ars_start00000000: 80000000 0000000e 00000000 00000002
................
[ 36.807677] ars_start00000010: 00000002 00000000 ........
[ 36.807747] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start output length: 4
[ 36.807748] ars_start00000000: 00000001 ....
[ 36.807749] nfit ACPI0012:00: acpi_nfit_ctl:bus output object underflow cmd:
ars_start field: 1
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-04-29 23:36 acpi_nfit_query_poison() broken on non-ARS systems Linda Knippers
@ 2016-04-30 0:03 ` Vishal Verma
2016-04-30 0:04 ` Dan Williams
2016-04-30 0:12 ` Dan Williams
1 sibling, 1 reply; 12+ messages in thread
From: Vishal Verma @ 2016-04-30 0:03 UTC (permalink / raw)
To: Linda Knippers, Williams, Dan J, Vishal L Verma; +Cc: linux-nvdimm@lists.01.org
On Fri, 2016-04-29 at 19:36 -0400, Linda Knippers wrote:
[snip]
>
> > static int acpi_nfit_query_poison(struct acpi_nfit_desc *acpi_desc,
> > struct nfit_spa *nfit_spa)
> > {
> > struct acpi_nfit_system_address *spa = nfit_spa->spa;
> > int rc;
> >
> > if (!nfit_spa->max_ars) {
> On a platform that doesn't support ARS, max_ars will always be zero
> so you'll keep executing this code when it seems like you're trying
> to avoid that.
>
> >
> > struct nd_cmd_ars_cap ars_cap;
> >
> > memset(&ars_cap, 0, sizeof(ars_cap));
> > rc = ars_get_cap(acpi_desc, &ars_cap, nfit_spa);
> > if (rc < 0)
> > return rc;
> The call succeeds so this return isn't taken, but then the code
> assumes
> everything is good. It should check ars_cap.status so see if the
> function
> is supported or if there was an error and return something
> appropriate.
> In previous version of the code, that's what acpi_nfit_find_poison()
> did.
> Instead, this function continues, using data that's not right and
> making
> more calls that also aren't supported.
Good point - I think we should add a check here and make sure ARS is
supported using the status field. I can work on a patch. Thanks for the
report!
-Vishal
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-04-30 0:03 ` Vishal Verma
@ 2016-04-30 0:04 ` Dan Williams
2016-04-30 0:15 ` Vishal Verma
0 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2016-04-30 0:04 UTC (permalink / raw)
To: Vishal Verma; +Cc: linux-nvdimm@lists.01.org
On Fri, Apr 29, 2016 at 5:03 PM, Vishal Verma <vishal@kernel.org> wrote:
> On Fri, 2016-04-29 at 19:36 -0400, Linda Knippers wrote:
>
> [snip]
>>
>> > static int acpi_nfit_query_poison(struct acpi_nfit_desc *acpi_desc,
>> > struct nfit_spa *nfit_spa)
>> > {
>> > struct acpi_nfit_system_address *spa = nfit_spa->spa;
>> > int rc;
>> >
>> > if (!nfit_spa->max_ars) {
>> On a platform that doesn't support ARS, max_ars will always be zero
>> so you'll keep executing this code when it seems like you're trying
>> to avoid that.
>>
>> >
>> > struct nd_cmd_ars_cap ars_cap;
>> >
>> > memset(&ars_cap, 0, sizeof(ars_cap));
>> > rc = ars_get_cap(acpi_desc, &ars_cap, nfit_spa);
>> > if (rc < 0)
>> > return rc;
>> The call succeeds so this return isn't taken, but then the code
>> assumes
>> everything is good. It should check ars_cap.status so see if the
>> function
>> is supported or if there was an error and return something
>> appropriate.
>> In previous version of the code, that's what acpi_nfit_find_poison()
>> did.
>> Instead, this function continues, using data that's not right and
>> making
>> more calls that also aren't supported.
>
> Good point - I think we should add a check here and make sure ARS is
> supported using the status field. I can work on a patch. Thanks for the
> report!
...but we do, or are supposed to, fail on any non-zero status:
/* Command failed */
if (ars_cap->status & 0xffff)
return -EIO;
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-04-30 0:04 ` Dan Williams
@ 2016-04-30 0:15 ` Vishal Verma
2016-04-30 0:15 ` Dan Williams
0 siblings, 1 reply; 12+ messages in thread
From: Vishal Verma @ 2016-04-30 0:15 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-nvdimm@lists.01.org
On Fri, 2016-04-29 at 17:04 -0700, Dan Williams wrote:
> On Fri, Apr 29, 2016 at 5:03 PM, Vishal Verma <vishal@kernel.org>
> wrote:
> >
> > On Fri, 2016-04-29 at 19:36 -0400, Linda Knippers wrote:
> >
> > [snip]
> > >
> > >
> > > >
> > > > static int acpi_nfit_query_poison(struct acpi_nfit_desc
> > > > *acpi_desc,
> > > > struct nfit_spa *nfit_spa)
> > > > {
> > > > struct acpi_nfit_system_address *spa = nfit_spa->spa;
> > > > int rc;
> > > >
> > > > if (!nfit_spa->max_ars) {
> > > On a platform that doesn't support ARS, max_ars will always be
> > > zero
> > > so you'll keep executing this code when it seems like you're
> > > trying
> > > to avoid that.
> > >
> > > >
> > > >
> > > > struct nd_cmd_ars_cap ars_cap;
> > > >
> > > > memset(&ars_cap, 0, sizeof(ars_cap));
> > > > rc = ars_get_cap(acpi_desc, &ars_cap,
> > > > nfit_spa);
> > > > if (rc < 0)
> > > > return rc;
> > > The call succeeds so this return isn't taken, but then the code
> > > assumes
> > > everything is good. It should check ars_cap.status so see if the
> > > function
> > > is supported or if there was an error and return something
> > > appropriate.
> > > In previous version of the code, that's what
> > > acpi_nfit_find_poison()
> > > did.
> > > Instead, this function continues, using data that's not right and
> > > making
> > > more calls that also aren't supported.
> > Good point - I think we should add a check here and make sure ARS
> > is
> > supported using the status field. I can work on a patch. Thanks for
> > the
> > report!
> ...but we do, or are supposed to, fail on any non-zero status:
>
> /* Command failed */
> if (ars_cap->status & 0xffff)
> return -EIO;
This doesn't check the extended status, does it..
I think we'd also want something like:
/* ARS not supported */>
if (ars_cap->status & 0xffff0000 == 0)
return -EOPNOTSUPP;
From ACPI 6.1:
Extended Status (Len 2) (Off 2)
Bit[0] – If set to 1, indicates Volatile Memory scrub is supported
Bit[1] – If set to 1, indicates Persistent Memory Scrub is supported
Bits[15:2] – Reserved
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-04-30 0:15 ` Vishal Verma
@ 2016-04-30 0:15 ` Dan Williams
2016-04-30 0:18 ` Vishal Verma
0 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2016-04-30 0:15 UTC (permalink / raw)
To: Vishal Verma; +Cc: linux-nvdimm@lists.01.org
On Fri, Apr 29, 2016 at 5:15 PM, Vishal Verma <vishal@kernel.org> wrote:
> On Fri, 2016-04-29 at 17:04 -0700, Dan Williams wrote:
>> On Fri, Apr 29, 2016 at 5:03 PM, Vishal Verma <vishal@kernel.org>
>> wrote:
>> >
>> > On Fri, 2016-04-29 at 19:36 -0400, Linda Knippers wrote:
>> >
>> > [snip]
>> > >
>> > >
>> > > >
>> > > > static int acpi_nfit_query_poison(struct acpi_nfit_desc
>> > > > *acpi_desc,
>> > > > struct nfit_spa *nfit_spa)
>> > > > {
>> > > > struct acpi_nfit_system_address *spa = nfit_spa->spa;
>> > > > int rc;
>> > > >
>> > > > if (!nfit_spa->max_ars) {
>> > > On a platform that doesn't support ARS, max_ars will always be
>> > > zero
>> > > so you'll keep executing this code when it seems like you're
>> > > trying
>> > > to avoid that.
>> > >
>> > > >
>> > > >
>> > > > struct nd_cmd_ars_cap ars_cap;
>> > > >
>> > > > memset(&ars_cap, 0, sizeof(ars_cap));
>> > > > rc = ars_get_cap(acpi_desc, &ars_cap,
>> > > > nfit_spa);
>> > > > if (rc < 0)
>> > > > return rc;
>> > > The call succeeds so this return isn't taken, but then the code
>> > > assumes
>> > > everything is good. It should check ars_cap.status so see if the
>> > > function
>> > > is supported or if there was an error and return something
>> > > appropriate.
>> > > In previous version of the code, that's what
>> > > acpi_nfit_find_poison()
>> > > did.
>> > > Instead, this function continues, using data that's not right and
>> > > making
>> > > more calls that also aren't supported.
>> > Good point - I think we should add a check here and make sure ARS
>> > is
>> > supported using the status field. I can work on a patch. Thanks for
>> > the
>> > report!
>> ...but we do, or are supposed to, fail on any non-zero status:
>>
>> /* Command failed */
>> if (ars_cap->status & 0xffff)
>> return -EIO;
>
> This doesn't check the extended status, does it..
> I think we'd also want something like:
>
> /* ARS not supported */>
> if (ars_cap->status & 0xffff0000 == 0)
> return -EOPNOTSUPP;
>
> From ACPI 6.1:
> Extended Status (Len 2) (Off 2)
> Bit[0] – If set to 1, indicates Volatile Memory scrub is supported
> Bit[1] – If set to 1, indicates Persistent Memory Scrub is supported
> Bits[15:2] – Reserved
See xlat_status():
/* No supported scan types for this range */
flags = ND_ARS_PERSISTENT | ND_ARS_VOLATILE;
if ((ars_cap->status >> 16 & flags) == 0)
return -ENOTTY;
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-04-30 0:15 ` Dan Williams
@ 2016-04-30 0:18 ` Vishal Verma
2016-05-01 17:27 ` Linda Knippers
0 siblings, 1 reply; 12+ messages in thread
From: Vishal Verma @ 2016-04-30 0:18 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-nvdimm@lists.01.org
On Fri, 2016-04-29 at 17:15 -0700, Dan Williams wrote:
> On Fri, Apr 29, 2016 at 5:15 PM, Vishal Verma <vishal@kernel.org>
> wrote:
> >
> > On Fri, 2016-04-29 at 17:04 -0700, Dan Williams wrote:
> > >
> > > On Fri, Apr 29, 2016 at 5:03 PM, Vishal Verma <vishal@kernel.org>
> > > wrote:
> > > >
> > > >
> > > > On Fri, 2016-04-29 at 19:36 -0400, Linda Knippers wrote:
> > > >
> > > > [snip]
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > static int acpi_nfit_query_poison(struct acpi_nfit_desc
> > > > > > *acpi_desc,
> > > > > > struct nfit_spa *nfit_spa)
> > > > > > {
> > > > > > struct acpi_nfit_system_address *spa = nfit_spa-
> > > > > > >spa;
> > > > > > int rc;
> > > > > >
> > > > > > if (!nfit_spa->max_ars) {
> > > > > On a platform that doesn't support ARS, max_ars will always
> > > > > be
> > > > > zero
> > > > > so you'll keep executing this code when it seems like you're
> > > > > trying
> > > > > to avoid that.
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > struct nd_cmd_ars_cap ars_cap;
> > > > > >
> > > > > > memset(&ars_cap, 0, sizeof(ars_cap));
> > > > > > rc = ars_get_cap(acpi_desc, &ars_cap,
> > > > > > nfit_spa);
> > > > > > if (rc < 0)
> > > > > > return rc;
> > > > > The call succeeds so this return isn't taken, but then the
> > > > > code
> > > > > assumes
> > > > > everything is good. It should check ars_cap.status so see if
> > > > > the
> > > > > function
> > > > > is supported or if there was an error and return something
> > > > > appropriate.
> > > > > In previous version of the code, that's what
> > > > > acpi_nfit_find_poison()
> > > > > did.
> > > > > Instead, this function continues, using data that's not right
> > > > > and
> > > > > making
> > > > > more calls that also aren't supported.
> > > > Good point - I think we should add a check here and make sure
> > > > ARS
> > > > is
> > > > supported using the status field. I can work on a patch. Thanks
> > > > for
> > > > the
> > > > report!
> > > ...but we do, or are supposed to, fail on any non-zero status:
> > >
> > > /* Command failed */
> > > if (ars_cap->status & 0xffff)
> > > return -EIO;
> > This doesn't check the extended status, does it..
> > I think we'd also want something like:
> >
> > /* ARS not supported */>
> > if (ars_cap->status & 0xffff0000 == 0)
> > return -EOPNOTSUPP;
> >
> > From ACPI 6.1:
> > Extended Status (Len 2) (Off 2)
> > Bit[0] – If set to 1, indicates Volatile Memory scrub is supported
> > Bit[1] – If set to 1, indicates Persistent Memory Scrub is
> > supported
> > Bits[15:2] – Reserved
> See xlat_status():
>
> /* No supported scan types for this range */
> flags = ND_ARS_PERSISTENT | ND_ARS_VOLATILE;
> if ((ars_cap->status >> 16 & flags) == 0)
> return -ENOTTY;\
Ah missed that - thanks. Yep in that case something is weird :)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-04-30 0:18 ` Vishal Verma
@ 2016-05-01 17:27 ` Linda Knippers
0 siblings, 0 replies; 12+ messages in thread
From: Linda Knippers @ 2016-05-01 17:27 UTC (permalink / raw)
To: Vishal Verma, Dan Williams; +Cc: linux-nvdimm@lists.01.org
On 4/29/2016 8:18 PM, Vishal Verma wrote:
> On Fri, 2016-04-29 at 17:15 -0700, Dan Williams wrote:
>> On Fri, Apr 29, 2016 at 5:15 PM, Vishal Verma <vishal@kernel.org>
>> wrote:
>>>
>>> On Fri, 2016-04-29 at 17:04 -0700, Dan Williams wrote:
>>>>
>>>> On Fri, Apr 29, 2016 at 5:03 PM, Vishal Verma <vishal@kernel.org>
>>>> wrote:
>>>>>
>>>>>
>>>>> On Fri, 2016-04-29 at 19:36 -0400, Linda Knippers wrote:
>>>>>
>>>>> [snip]
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> static int acpi_nfit_query_poison(struct acpi_nfit_desc
>>>>>>> *acpi_desc,
>>>>>>> struct nfit_spa *nfit_spa)
>>>>>>> {
>>>>>>> struct acpi_nfit_system_address *spa = nfit_spa-
>>>>>>>> spa;
>>>>>>> int rc;
>>>>>>>
>>>>>>> if (!nfit_spa->max_ars) {
>>>>>> On a platform that doesn't support ARS, max_ars will always
>>>>>> be
>>>>>> zero
>>>>>> so you'll keep executing this code when it seems like you're
>>>>>> trying
>>>>>> to avoid that.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> struct nd_cmd_ars_cap ars_cap;
>>>>>>>
>>>>>>> memset(&ars_cap, 0, sizeof(ars_cap));
>>>>>>> rc = ars_get_cap(acpi_desc, &ars_cap,
>>>>>>> nfit_spa);
>>>>>>> if (rc < 0)
>>>>>>> return rc;
>>>>>> The call succeeds so this return isn't taken, but then the
>>>>>> code
>>>>>> assumes
>>>>>> everything is good. It should check ars_cap.status so see if
>>>>>> the
>>>>>> function
>>>>>> is supported or if there was an error and return something
>>>>>> appropriate.
>>>>>> In previous version of the code, that's what
>>>>>> acpi_nfit_find_poison()
>>>>>> did.
>>>>>> Instead, this function continues, using data that's not right
>>>>>> and
>>>>>> making
>>>>>> more calls that also aren't supported.
>>>>> Good point - I think we should add a check here and make sure
>>>>> ARS
>>>>> is
>>>>> supported using the status field. I can work on a patch. Thanks
>>>>> for
>>>>> the
>>>>> report!
>>>> ...but we do, or are supposed to, fail on any non-zero status:
>>>>
>>>> /* Command failed */
>>>> if (ars_cap->status & 0xffff)
>>>> return -EIO;
>>> This doesn't check the extended status, does it..
>>> I think we'd also want something like:
>>>
>>> /* ARS not supported */>
>>> if (ars_cap->status & 0xffff0000 == 0)
>>> return -EOPNOTSUPP;
>>>
>>> From ACPI 6.1:
>>> Extended Status (Len 2) (Off 2)
>>> Bit[0] – If set to 1, indicates Volatile Memory scrub is supported
>>> Bit[1] – If set to 1, indicates Persistent Memory Scrub is
>>> supported
>>> Bits[15:2] – Reserved
>> See xlat_status():
>>
>> /* No supported scan types for this range */
>> flags = ND_ARS_PERSISTENT | ND_ARS_VOLATILE;
>> if ((ars_cap->status >> 16 & flags) == 0)
>> return -ENOTTY;\
>
> Ah missed that - thanks. Yep in that case something is weird :)
The problem is that xlat_status() isn't called when you're initiating the
sequence from acpi_nfit_query_poison(). I think it's only called when
the DSM is initiated from user space.
If you go back to my original message, you'll see that.
-- ljk
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-04-29 23:36 acpi_nfit_query_poison() broken on non-ARS systems Linda Knippers
2016-04-30 0:03 ` Vishal Verma
@ 2016-04-30 0:12 ` Dan Williams
2016-05-01 17:42 ` Linda Knippers
1 sibling, 1 reply; 12+ messages in thread
From: Dan Williams @ 2016-04-30 0:12 UTC (permalink / raw)
To: Linda Knippers; +Cc: linux-nvdimm@lists.01.org
On Fri, Apr 29, 2016 at 4:36 PM, Linda Knippers <linda.knippers@hpe.com> wrote:
> I tested Vishal's original ARS patches and they worked correctly on my test
> system, which doesn't support ARS. By worked correctly, I mean that they
> properly looked at the status of the Query ARS Capabilities function and saw
> that it indicated that the function is not supported. The original code
> skipped the rest of the ARS processing.
Why doesn't the firmware just not export the ARS function in the
supported functions list?
We should still fix this bug, but the whole "implement a function to
say this function is not implemented" seems a bit of a waste.
> The code I tested has been re-worked a number of times and now it no longer
> looks at the status of the Query function. It assumes that if the DSM worked
> at all, it must be ok and happily uses fields that is shouldn't be looking at.
>
> If you look here:
>
>> static int acpi_nfit_query_poison(struct acpi_nfit_desc *acpi_desc,
>> struct nfit_spa *nfit_spa)
>> {
>> struct acpi_nfit_system_address *spa = nfit_spa->spa;
>> int rc;
>>
>> if (!nfit_spa->max_ars) {
>
> On a platform that doesn't support ARS, max_ars will always be zero
> so you'll keep executing this code when it seems like you're trying
> to avoid that.
>
>> struct nd_cmd_ars_cap ars_cap;
>>
>> memset(&ars_cap, 0, sizeof(ars_cap));
>> rc = ars_get_cap(acpi_desc, &ars_cap, nfit_spa);
>> if (rc < 0)
>> return rc;
>
> The call succeeds so this return isn't taken, but then the code assumes
> everything is good. It should check ars_cap.status so see if the function
> is supported or if there was an error and return something appropriate.
> In previous version of the code, that's what acpi_nfit_find_poison() did.
> Instead, this function continues, using data that's not right and making
> more calls that also aren't supported.
Hmm, we should be bailing on any non-zero status, why is that getting skipped?
/* Command failed */
if (ars_cap->status & 0xffff)
return -EIO;
>
>> nfit_spa->max_ars = ars_cap.max_ars_out;
>> nfit_spa->clear_err_unit = ars_cap.clear_err_unit;
>
> Since we don't bail out, the above values are zero but not actually checked
> by subsequent users.
>
>> /* check that the supported scrub types match the spa type */
>> if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE &&
>> ((ars_cap.status >> 16) & ND_ARS_VOLATILE) == 0)
>> return -ENOTTY;
>> else if (nfit_spa_type(spa) == NFIT_SPA_PM &&
>> ((ars_cap.status >> 16) & ND_ARS_PERSISTENT) == 0)
>> return -ENOTTY;
>> }
>>
>> if (ars_status_alloc(acpi_desc, nfit_spa->max_ars))
>> return -ENOMEM;
>>
>> rc = ars_get_status(acpi_desc);
>> if (rc < 0 && rc != -ENOSPC)
>> return rc;
>>
>> if (ars_status_process_records(acpi_desc->nvdimm_bus,
>> acpi_desc->ars_status))
>> return -ENOMEM;
>>
>> return 0;
>> }
>
> I don't know if you want to change how this function works or change how
> ars_get_cap works but something needs to change. You actually do check the
> status in xlat_status() if the call was initiated from user space but you don't
> check when it's initiated from within the kernel.
Hmm, why do you say that? xlat_status() only gets skipped if the
output buffer is underrun and in that case the underrun should trigger
-ENXIO to be returned.
> For reference, below is the dmesg output from my system running the latest
> upstream kernel.
Can you add some debug tracing to xlat_status()? Something is not lining up.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-04-30 0:12 ` Dan Williams
@ 2016-05-01 17:42 ` Linda Knippers
2016-05-01 19:44 ` Linda Knippers
0 siblings, 1 reply; 12+ messages in thread
From: Linda Knippers @ 2016-05-01 17:42 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-nvdimm@lists.01.org
On 4/29/2016 8:12 PM, Dan Williams wrote:
> On Fri, Apr 29, 2016 at 4:36 PM, Linda Knippers <linda.knippers@hpe.com> wrote:
>> I tested Vishal's original ARS patches and they worked correctly on my test
>> system, which doesn't support ARS. By worked correctly, I mean that they
>> properly looked at the status of the Query ARS Capabilities function and saw
>> that it indicated that the function is not supported. The original code
>> skipped the rest of the ARS processing.
>
> Why doesn't the firmware just not export the ARS function in the
> supported functions list?
It's a root device function. In theory a platform might support ARS for some
SPA ranges but not for others so even if the function is supported in general,
it might not be supported for all ranges. An SPA range might also support ARS
but not the ability to clear an error. I think you have to check all these
things.
> We should still fix this bug, but the whole "implement a function to
> say this function is not implemented" seems a bit of a waste.
Even if all the functions are supported, you still have to check the status
because something could fail.
>
>> The code I tested has been re-worked a number of times and now it no longer
>> looks at the status of the Query function. It assumes that if the DSM worked
>> at all, it must be ok and happily uses fields that is shouldn't be looking at.
>>
>> If you look here:
>>
>>> static int acpi_nfit_query_poison(struct acpi_nfit_desc *acpi_desc,
>>> struct nfit_spa *nfit_spa)
>>> {
>>> struct acpi_nfit_system_address *spa = nfit_spa->spa;
>>> int rc;
>>>
>>> if (!nfit_spa->max_ars) {
>>
>> On a platform that doesn't support ARS, max_ars will always be zero
>> so you'll keep executing this code when it seems like you're trying
>> to avoid that.
>>
>>> struct nd_cmd_ars_cap ars_cap;
>>>
>>> memset(&ars_cap, 0, sizeof(ars_cap));
>>> rc = ars_get_cap(acpi_desc, &ars_cap, nfit_spa);
>>> if (rc < 0)
>>> return rc;
>>
>> The call succeeds so this return isn't taken, but then the code assumes
>> everything is good. It should check ars_cap.status so see if the function
>> is supported or if there was an error and return something appropriate.
>> In previous version of the code, that's what acpi_nfit_find_poison() did.
>> Instead, this function continues, using data that's not right and making
>> more calls that also aren't supported.
>
> Hmm, we should be bailing on any non-zero status, why is that getting skipped?
>
> /* Command failed */
> if (ars_cap->status & 0xffff)
> return -EIO;
>
>>
>>> nfit_spa->max_ars = ars_cap.max_ars_out;
>>> nfit_spa->clear_err_unit = ars_cap.clear_err_unit;
>>
>> Since we don't bail out, the above values are zero but not actually checked
>> by subsequent users.
>>
>>> /* check that the supported scrub types match the spa type */
>>> if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE &&
>>> ((ars_cap.status >> 16) & ND_ARS_VOLATILE) == 0)
>>> return -ENOTTY;
>>> else if (nfit_spa_type(spa) == NFIT_SPA_PM &&
>>> ((ars_cap.status >> 16) & ND_ARS_PERSISTENT) == 0)
>>> return -ENOTTY;
>>> }
>>>
>>> if (ars_status_alloc(acpi_desc, nfit_spa->max_ars))
>>> return -ENOMEM;
>>>
>>> rc = ars_get_status(acpi_desc);
>>> if (rc < 0 && rc != -ENOSPC)
>>> return rc;
>>>
>>> if (ars_status_process_records(acpi_desc->nvdimm_bus,
>>> acpi_desc->ars_status))
>>> return -ENOMEM;
>>>
>>> return 0;
>>> }
>>
>> I don't know if you want to change how this function works or change how
>> ars_get_cap works but something needs to change. You actually do check the
>> status in xlat_status() if the call was initiated from user space but you don't
>> check when it's initiated from within the kernel.
>
> Hmm, why do you say that? xlat_status() only gets skipped if the
> output buffer is underrun and in that case the underrun should trigger
> -ENXIO to be returned.
>
>> For reference, below is the dmesg output from my system running the latest
>> upstream kernel.
>
> Can you add some debug tracing to xlat_status()? Something is not lining up.
xlat_status() is not called from acpi_nfit_query_poison(), which is called at
boot time. Not sure if we're getting there from acpi_nfit_scrub() or
acpi_nfit_async_scrub().
-- ljk
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-05-01 17:42 ` Linda Knippers
@ 2016-05-01 19:44 ` Linda Knippers
2016-05-02 0:33 ` Dan Williams
0 siblings, 1 reply; 12+ messages in thread
From: Linda Knippers @ 2016-05-01 19:44 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-nvdimm@lists.01.org
[-- Attachment #1: Type: text/plain, Size: 874 bytes --]
On 5/1/2016 1:42 PM, Linda Knippers wrote:
<snip>
>>
>> Can you add some debug tracing to xlat_status()? Something is not lining up.
I put a printk at the top of xlat_status() as well as in each case statement.
>
> xlat_status() is not called from acpi_nfit_query_poison(), which is called at
> boot time. Not sure if we're getting there from acpi_nfit_scrub() or
> acpi_nfit_async_scrub().
Ok, it actually is being called, but only for the start command. See attached.
BTW, this is with your rewrite of Jerry's patches but I saw the same
behavior before I applied them as that's what the previous dmesg output
was from.
One thing you'll notice in this new dmesg output is that unlike before,
the acpi_nfit_ctl() function prints the DSM number, not the DSM name,
when displaying the input. It prints the DSM name when displaying the
output.
-- ljk
>
> -- ljk
>
[-- Attachment #2: ars_debug.txt --]
[-- Type: text/plain, Size: 6085 bytes --]
[ 33.770104] nfit ACPI0012:00: add_spa: spa index: 1 type: pmem
[ 33.770107] nfit ACPI0012:00: add_memdev: memdev handle: 0x10 spa: 1 dcr: 1
[ 33.770109] nfit ACPI0012:00: add_dcr: dcr index: 1 windows: 0
[ 33.770111] nfit ACPI0012:00: add_spa: spa index: 2 type: pmem
[ 33.770113] nfit ACPI0012:00: add_memdev: memdev handle: 0x120 spa: 2 dcr: 2
[ 33.770115] nfit ACPI0012:00: add_dcr: dcr index: 2 windows: 0
[ 33.770116] nfit ACPI0012:00: add_spa: spa index: 3 type: pmem
[ 33.770118] nfit ACPI0012:00: add_memdev: memdev handle: 0x1010 spa: 3 dcr: 3
[ 33.770120] nfit ACPI0012:00: add_dcr: dcr index: 3 windows: 0
[ 33.770122] nfit ACPI0012:00: add_spa: spa index: 4 type: pmem
[ 33.770123] nfit ACPI0012:00: add_memdev: memdev handle: 0x1120 spa: 4 dcr: 4
[ 33.770125] nfit ACPI0012:00: add_dcr: dcr index: 4 windows: 0
[ 33.771719] nvdimm nmem0: nvdimm_init_nsarea [libnvdimm]: validate_dimm error: -6
[ 33.771721] nvdimm nmem0: nvdimm_drvdata_release
[ 33.771724] ndbus0: nvdimm.probe(nmem0) = -6
[ 33.772229] nvdimm nmem1: nvdimm_init_nsarea [libnvdimm]: validate_dimm error: -6
[ 33.772230] nvdimm nmem1: nvdimm_drvdata_release
[ 33.772232] ndbus0: nvdimm.probe(nmem1) = -6
[ 33.773135] nvdimm nmem2: nvdimm_init_nsarea [libnvdimm]: validate_dimm error: -6
[ 33.773136] nvdimm nmem2: nvdimm_drvdata_release
[ 33.773138] ndbus0: nvdimm.probe(nmem2) = -6
[ 33.774084] nvdimm nmem3: nvdimm_init_nsarea [libnvdimm]: validate_dimm error: -6
[ 33.774086] nvdimm nmem3: nvdimm_drvdata_release
[ 33.774087] ndbus0: nvdimm.probe(nmem3) = -6
[ 33.774101] ndbus0: nvdimm_bus_check_dimm_count: count: 4
[ 33.774115] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: 1: func: 1 input length: 16
[ 33.774119] nvdimm in 00000000: 80000000 00000004 00000000 00000002 ................
[ 33.774306] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_cap output length: 16
[ 33.774309] ars_cap00000000: 00000001 00000000 00000000 00000000 ................
[ 33.774347] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: 1: func: 1 input length: 16
[ 33.774350] nvdimm in 00000000: 80000000 00000006 00000000 00000002 ................
[ 33.774416] ndbus0: nd_region.probe(region0) = 0
[ 33.774501] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_cap output length: 16
[ 33.774503] ars_cap00000000: 00000001 00000000 00000000 00000000 ................
[ 33.774545] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: 2: func: 2 input length: 24
[ 33.774547] nvdimm in 00000000: 80000000 00000006 00000000 00000002 ................
[ 33.774549] nvdimm in 00000010: 00000002 00000000 ........
[ 33.774602] ndbus0: nd_region.probe(region1) = 0
[ 33.774662] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start output length: 4
[ 33.774664] ars_start00000000: 00000001 ....
[ 33.774668] nfit ACPI0012:00: acpi_nfit_ctl:bus output object underflow cmd: ars_start field: 1
[ 33.774670] ndbus0: nd_region.probe(region2) = 0
[ 33.774671] xlat_status: called with cmd 2
[ 33.774672] xlat_status: ND_CMD_ARS_START status 0x1
[ 33.774676] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: 2: func: 2 input length: 24
[ 33.774678] ndbus0: nd_region.probe(region3) = 0
[ 33.774680] nvdimm in 00000000: 80000000 0000000c 00000000 00000002 ................
[ 33.774683] nvdimm in 00000010: 00000002 00000000 ........
[ 33.774783] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start output length: 4
[ 33.774784] ars_start00000000: 00000001 ....
[ 33.774786] nfit ACPI0012:00: acpi_nfit_ctl:bus output object underflow cmd: ars_start field: 1
[ 33.774787] xlat_status: called with cmd 2
[ 33.774787] xlat_status: ND_CMD_ARS_START status 0x1
[ 33.774789] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: 2: func: 2 input length: 24
[ 33.774791] nvdimm in 00000000: 80000000 0000000e 00000000 00000002 ................
[ 33.774793] nvdimm in 00000010: 00000002 00000000 ........
[ 33.774875] nfit ACPI0012:00: acpi_nfit_ctl:bus cmd: ars_start output length: 4
[ 33.774876] ars_start00000000: 00000001 ....
[ 33.774878] nfit ACPI0012:00: acpi_nfit_ctl:bus output object underflow cmd: ars_start field: 1
[ 33.774879] xlat_status: called with cmd 2
[ 33.774879] xlat_status: ND_CMD_ARS_START status 0x1
[ 34.230367] nd_pmem namespace0.0: unable to guarantee persistence of writes
[ 34.264312] nd_pmem namespace0.0: nd_btt_probe: btt: btt0.1
[ 34.278381] ndbus0: nd_pmem.probe(namespace0.0) = -6
[ 34.278459] nd_pmem btt0.1: unable to guarantee persistence of writes
[ 34.278464] ndbus0: nd_pmem.probe(btt0.0) = -19
[ 34.278490] nd_pmem namespace1.0: unable to guarantee persistence of writes
[ 34.278664] nd_pmem namespace1.0: nd_btt_probe: btt: <none>
[ 34.278665] btt1.1: nd_btt_release
[ 34.279153] pmem1: detected capacity change from 0 to 8589934592
[ 34.279157] ndbus0: nd_pmem.probe(namespace1.0) = 0
[ 34.279164] ndbus0: nd_pmem.probe(btt1.0) = -19
[ 34.279173] nd_pmem namespace2.0: unable to guarantee persistence of writes
[ 34.279325] nd_pmem namespace2.0: nd_btt_probe: btt: btt2.1
[ 34.404932] ndbus0: nd_pmem.probe(namespace2.0) = -6
[ 34.404950] nd_pmem btt2.1: unable to guarantee persistence of writes
[ 34.404954] ndbus0: nd_pmem.probe(btt2.0) = -19
[ 34.404960] ndbus0: nd_pmem.probe(btt3.0) = -19
[ 34.404969] nd_pmem namespace3.0: unable to guarantee persistence of writes
[ 34.405125] nd_pmem namespace3.0: nd_btt_probe: btt: btt3.1
[ 34.468103] ndbus0: nd_pmem.probe(namespace3.0) = -6
[ 34.468130] nd_pmem btt3.1: unable to guarantee persistence of writes
[ 34.468431] pmem0s: detected capacity change from 0 to 8523182080
[ 34.468486] ndbus0: nd_pmem.probe(btt0.1) = 0
[ 34.528239] pmem3s: detected capacity change from 0 to 8580472832
[ 34.529055] pmem2s: detected capacity change from 0 to 8580472832
[ 34.529126] ndbus0: nd_pmem.probe(btt2.1) = 0
[ 34.684833] ndbus0: nd_pmem.probe(btt3.1) = 0
[-- Attachment #3: Type: text/plain, Size: 151 bytes --]
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-05-01 19:44 ` Linda Knippers
@ 2016-05-02 0:33 ` Dan Williams
2016-05-02 15:05 ` Linda Knippers
0 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2016-05-02 0:33 UTC (permalink / raw)
To: Linda Knippers; +Cc: linux-nvdimm@lists.01.org
On Sun, May 1, 2016 at 12:44 PM, Linda Knippers <linda.knippers@hpe.com> wrote:
>
>
> On 5/1/2016 1:42 PM, Linda Knippers wrote:
> <snip>
>>>
>>> Can you add some debug tracing to xlat_status()? Something is not lining up.
>
> I put a printk at the top of xlat_status() as well as in each case statement.
>>
>> xlat_status() is not called from acpi_nfit_query_poison(), which is called at
>> boot time. Not sure if we're getting there from acpi_nfit_scrub() or
>> acpi_nfit_async_scrub().
>
> Ok, it actually is being called, but only for the start command. See attached.
>
> BTW, this is with your rewrite of Jerry's patches but I saw the same
> behavior before I applied them as that's what the previous dmesg output
> was from.
>
> One thing you'll notice in this new dmesg output is that unlike before,
> the acpi_nfit_ctl() function prints the DSM number, not the DSM name,
> when displaying the input. It prints the DSM name when displaying the
> output.
>
Found it. This is one of those brown-paper-bag bugs that tells me I
need to figure out a way to unit/regression test the interface between
acpi_nfit_ctl() and acpi_evalaute_dsm(). In this case I injected a
__wrap_acpi_evaulate_dsm(), but I can't ship it as a new test in
nfit_test because it relies on an enabled BIOS for all the other
cases.
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index d0f35e63640b..24ef312fdba3 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -287,8 +288,11 @@ static int acpi_nfit_ctl(struct
nvdimm_bus_descriptor *nd_desc,
offset);
rc = -ENXIO;
}
- } else
+ } else {
rc = 0;
+ if (cmd_rc)
+ *cmd_rc = xlat_status(buf, cmd);
+ }
out:
ACPI_FREE(out_obj);
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: acpi_nfit_query_poison() broken on non-ARS systems
2016-05-02 0:33 ` Dan Williams
@ 2016-05-02 15:05 ` Linda Knippers
0 siblings, 0 replies; 12+ messages in thread
From: Linda Knippers @ 2016-05-02 15:05 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-nvdimm@lists.01.org
On 5/1/2016 8:33 PM, Dan Williams wrote:
> Found it. This is one of those brown-paper-bag bugs that tells me I
> need to figure out a way to unit/regression test the interface between
> acpi_nfit_ctl() and acpi_evalaute_dsm(). In this case I injected a
> __wrap_acpi_evaulate_dsm(), but I can't ship it as a new test in
> nfit_test because it relies on an enabled BIOS for all the other
> cases.
Thanks. I'll give it a try.
-- ljk
>
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index d0f35e63640b..24ef312fdba3 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
> @@ -287,8 +288,11 @@ static int acpi_nfit_ctl(struct
> nvdimm_bus_descriptor *nd_desc,
> offset);
> rc = -ENXIO;
> }
> - } else
> + } else {
> rc = 0;
> + if (cmd_rc)
> + *cmd_rc = xlat_status(buf, cmd);
> + }
>
> out:
> ACPI_FREE(out_obj);
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-05-02 15:05 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-29 23:36 acpi_nfit_query_poison() broken on non-ARS systems Linda Knippers
2016-04-30 0:03 ` Vishal Verma
2016-04-30 0:04 ` Dan Williams
2016-04-30 0:15 ` Vishal Verma
2016-04-30 0:15 ` Dan Williams
2016-04-30 0:18 ` Vishal Verma
2016-05-01 17:27 ` Linda Knippers
2016-04-30 0:12 ` Dan Williams
2016-05-01 17:42 ` Linda Knippers
2016-05-01 19:44 ` Linda Knippers
2016-05-02 0:33 ` Dan Williams
2016-05-02 15:05 ` Linda Knippers
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.