Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH v9 3/7] acpi: apei: Add SEI notification type support for ARMv8
From: James Morse @ 2018-04-12 16:14 UTC (permalink / raw)
  To: gengdongjiu
  Cc: lishuo1, merry.libing, gengdongjiu,
	linux-arm-kernel@lists.infradead.org, Liujun (Jun Liu),
	linux-kernel@vger.kernel.org, corbet@lwn.net,
	marc.zyngier@arm.com, catalin.marinas@arm.com,
	linux-doc@vger.kernel.org, rjw@rjwysocki.net,
	linux@armlinux.org.uk, will.deacon@arm.com,
	robert.moore@intel.com, linux-acpi@vger.kernel.org, bp@alien8.de,
	lv.zheng@intel.com, Huangshaoyu, kvmarm@lists.cs.columbia.edu,
	devel@acpica.org
In-Reply-To: <CAMj-D2CEcNjdi8VkSMw0aTqeb678nFnBKiq5ggix3gJhzkgSEA@mail.gmail.com>

Hi gengdongjiu,

On 12/04/18 06:00, gengdongjiu wrote:
> 2018-02-16 1:55 GMT+08:00 James Morse <james.morse@arm.com>:
>> On 05/02/18 11:24, gengdongjiu wrote:
>>>> Is the emulated SError routed following the routing rules for HCR_EL2.{AMO,
>>>> TGE}?
>>>
>>> Yes, it is.
>>
>> ... and yet ...
>>
>>
>>>> What does your firmware do when it wants to emulate SError but its masked?
>>>> (e.g.1: The physical-SError interrupted EL2 and the SPSR shows EL2 had
>>>> PSTATE.A  set.
>>>>  e.g.2: The physical-SError interrupted EL2 but HCR_EL2 indicates the
>>>> emulated  SError should go to EL1. This effectively masks SError.)
>>>
>>> Currently we does not consider much about the mask status(SPSR).
>>
>> .. this is a problem.
>>
>> If you ignore SPSR_EL3 you may deliver an SError to EL1 when the exception
>> interrupted EL2. Even if you setup the EL1 register correctly, EL1 can't eret to
>> EL2. This should never happen, SError is effectively masked if you are running
>> at an EL higher than the one its routed to.
>>
>> More obviously: if the exception came from the EL that SError should be routed
>> to, but PSTATE.A was set, you can't deliver SError. Masking SError is the only

> James, I  summarized the masking and routing rules for SError to
> confirm with you for the firmware first solution,

You also said "Currently we does not consider much about the mask status(SPSR)."


> 1. If the HCR_EL2.{AMO,TGE} is set,

If one or the other of these bits is set: (AMO==1 || TGE==1)

> which means the SError should route to EL2,
> When system happens SError and trap to EL3,   If EL3 find
> HCR_EL2.{AMO,TGE} and SPSR_EL3.A are both set,
> and find this SError come from EL2, it will not deliver an SError:
> store the RAS error in the BERT and 'reboot'; but if
> it find that this SError come from EL1 or EL0, it also need to deliver
> an SError, right?

Yes.


> 2. If the HCR_EL2.{AMO,TGE} is not set,

If neither of these bits is set: (AMO==0 && TGE == 0)

> which means the SError should route to EL1,
> When system happens SError and trap to EL3, If EL3 find
> HCR_EL2.{AMO,TGE} and SPSR_EL3.A are both not set,

(I'm reading this as all three of these bits are clear)

> and find this SError come from EL1, it will not deliver an SError:
> store the RAS error in the BERT and 'reboot'; 

No, (AMO==0 && TGE == 0) means SError is routed to EL1, this exception
interrupted EL1 and the A bit was clear, so EL1 can take an SError.

The two cases here are:
AMO==0,TGE==0 means SError should be routed to EL1. If SPSR_EL3 says the
exception interrupted EL1 and the A bit was set, you need to do the BERT trick.

If SPSR_EL3 says the exception interrupted EL2, you need to do the BERT trick
regardless of the A bit, as SError is implicitly masked by running at a higher
exception level than it was routed to.


From your v11 reply:
> 2. The exception came from the EL that SError should not be routed
> to(according to hcr_EL2.{AMO, TGE}),even though the PSTATE.A was set,EL3
> firmware still deliver SError

(this is re-iterating the two-cases above:)
'not be routed to' is one of two things: Route-to-EL2+interruted-EL1, or
Route-to-EL1+interrupted-EL2.

Route-to-EL2+interrupted-EL1 is fine, regardless of SPSR_EL3.A the emulated
SError can be delivered to EL2, as EL2 can't mask SError when executing at a
lower EL.

Route-to-EL1+interrupted-EL2 is the problem. SError is implicitly masked by
running at a higher EL. Regardless of SPSR_EL3.A, the emulated SError can not be
delivered.
KVM does this on the way out of a guest, if an SError occurs during this time
the CPU will wait until execution returns to EL1 before delivering the SError.
Your firmware has to do the same.

Table D1-15 in "D1.14.2 Asynchronous exception masking" has a table with all the
combinations. The ARM-ARM is what we need to match with this behaviour.


> but if it find that this SError come from EL0, it also need to deliver an
> SError, right?

I thought interrupted-EL0 could always be delivered: but re-reading the
ARM-ARM's "D1.14.2 Asynchronous exception masking", if asynchronous exceptions
are routed to EL1 then EL0&EL1 are treated the same.
So if SError is routed to EL1, the exception interrupted EL0, and SPSR_EL3.A was
set, you still can't deliver the emulated-SError you have to do the BERT-trick.
Linux doesn't do this today, but another OS might (e.g. UEFI), and we might do
this in the future.

This is really tricky for firmware to get right. Another alternative would be to
put the CPER records in a Polled buffer, unless something needs doing right now,
in which case a BERT-reboot is probably best.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v11 2/4] arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS
From: gengdongjiu @ 2018-04-12 15:11 UTC (permalink / raw)
  To: James Morse, cdall
  Cc: Dongjiu Geng, devel, linux-doc, kvm, corbet, marc.zyngier,
	catalin.marinas, rjw, linux, linux-kernel, linux-acpi,
	zhengxiang9, bp, linux-arm-kernel, huangshaoyu, kvmarm,
	christoffer.dall, lenb
In-Reply-To: <d1d1c262-41ae-1725-7ae0-16d63ac3d372@arm.com>

Hi James,
  Thanks for the comments.

2018-04-10 22:15 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi Dongjiu Geng,
>
> On 09/04/18 22:36, Dongjiu Geng wrote:
>> This new IOCTL exports user-invisible states related to SError.
>> Together with appropriate user space changes, it can inject
>> SError with specified syndrome to guest by setup kvm_vcpu_events
>> value.
>
>> Also it can support live migration.
>
> Could you explain what user-space is expected to do for this?
> (this is also relevant for snapshot-ing/suspending VMs)
Ok.

>
> It's probably worth noting that this solves an existing problem: KVM may
> make an
> SError pending, but user-space has no way to discover/migrate this.

if KVM make an SError pending, when user-space do migration, it get the
kvm_vcpu_events through KVM_GET_VCPU_EVENTS, then can find that pending status.
What are the things you're worried about?

>
>
>> diff --git a/Documentation/virtual/kvm/api.txt
>> b/Documentation/virtual/kvm/api.txt
>> index 8a3d708..45719b4 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -819,11 +819,13 @@ struct kvm_clock_data {
>>
>>  Capability: KVM_CAP_VCPU_EVENTS
>>  Extended by: KVM_CAP_INTR_SHADOW
>> -Architectures: x86
>> +Architectures: x86, arm, arm64
>>  Type: vm ioctl
>>  Parameters: struct kvm_vcpu_event (out)
>>  Returns: 0 on success, -1 on error
>>
>> +X86:
>> +
>>  Gets currently pending exceptions, interrupts, and NMIs as well as
>> related
>>  states of the vcpu.
>>
>> @@ -865,15 +867,31 @@ Only two fields are defined in the flags field:
>>  - KVM_VCPUEVENT_VALID_SMM may be set in the flags field to signal that
>>    smi contains a valid state.
>>
>> +ARM, ARM64:
>> +
>> +Gets currently pending SError exceptions as well as related states of the
>> vcpu.
>> +
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>> +		/* Align it to 4 bytes */
>> +		__u8 pad[2];
>> +		__u64 serror_esr;
>> +	} exception;
>> +};
>> +
>
> I'm not convinced we should change this struct from the layout/size x86 has.
> Its
> confusing for the documentation, is this API call really the same on all
> architectures?
>
> What if we want to add some future interrupt, NMI or related state? We've
> found
> ourselves needing to add this API, it seems odd to remove its other uses on
> x86.
> We can't put them back in the future.
>
> Having a different layout would force user-space to ifdef/duplicate any
> code
> that accesses this between architectures.
 In x86 and arm64 user space code, the handling logic of
KVM_GET/SET_VCPU_EVENTS is in different ARCH folder,  maybe it is not
necessary to share the handling code in the user space.

>
>
>
> The compiler will want that __u64 to be naturally aligned to 8-bytes, so
> your
> 4-byte padding still causes some secret compiler-padding to be inserted.
> Different versions of the compiler may put it in different places.
>
>
>>  4.32 KVM_SET_VCPU_EVENTS
>>
>>  Capability: KVM_CAP_VCPU_EVENTS
>>  Extended by: KVM_CAP_INTR_SHADOW
>> -Architectures: x86
>> +Architectures: x86, arm, arm64
>>  Type: vm ioctl
>>  Parameters: struct kvm_vcpu_event (in)
>>  Returns: 0 on success, -1 on error
>>
>> +X86:
>> +
>>  Set pending exceptions, interrupts, and NMIs as well as related states of
>> the
>>  vcpu.
>>
>> @@ -894,6 +912,12 @@ shall be written into the VCPU.
>>
>>  KVM_VCPUEVENT_VALID_SMM can only be set if KVM_CAP_X86_SMM is available.
>>
>> +ARM, ARM64:
>> +
>> +Set pending SError exceptions as well as related states of the vcpu.
>> +
>> +See KVM_GET_VCPU_EVENTS for the data structure.
>> +
>>
>>  4.33 KVM_GET_DEBUGREGS
>>
>
>
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 9abbf30..855cc9a 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -39,6 +39,7 @@
>>  #define __KVM_HAVE_GUEST_DEBUG
>>  #define __KVM_HAVE_IRQ_LINE
>>  #define __KVM_HAVE_READONLY_MEM
>> +#define __KVM_HAVE_VCPU_EVENTS
>>
>>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>>
>> @@ -153,6 +154,17 @@ struct kvm_sync_regs {
>>  struct kvm_arch_memory_slot {
>>  };
>>
>> +/* for KVM_GET/SET_VCPU_EVENTS */
>> +struct kvm_vcpu_events {
>> +	struct {
>> +		__u8 serror_pending;
>> +		__u8 serror_has_esr;
>
>> +		/* Align it to 4 bytes */
>> +		__u8 pad[2];
>
> (padding noted above)
>
>
>> +		__u64 serror_esr;
>> +	} exception;
>> +};
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK		0x000000000FFF0000
>>  #define KVM_REG_ARM_COPROC_SHIFT	16
>
>
>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
>> index 5c7f657..42e1222 100644
>> --- a/arch/arm64/kvm/guest.c
>> +++ b/arch/arm64/kvm/guest.c
>> @@ -277,6 +277,37 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu
>> *vcpu,
>>  	return -EINVAL;
>>  }
>>
>> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	events->exception.serror_pending = (vcpu_get_hcr(vcpu) & HCR_VSE);
>> +	events->exception.serror_has_esr =
>> +			cpus_have_const_cap(ARM64_HAS_RAS_EXTN) &&
>> +					(!!vcpu_get_vsesr(vcpu));
>
>> +	events->exception.serror_esr = vcpu_get_vsesr(vcpu);
>
> This will return a stale ESR even if nothing is pending. On systems without
> the
> RAS extensions it will return 'ESR_ELx_ISV' if kvm_inject_vabt() has ever
> been
> called for this CPU.
>
> Could we hide this behind (pending && has_esr), setting it to 0 otherwise.
> This
> is just to avoid exposing the stale value.
Exactly, it is indeed.

>
>
>> +
>> +	return 0;
>> +}
>
>> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>> +			struct kvm_vcpu_events *events)
>> +{
>> +	bool injected = events->exception.serror_pending;
>> +	bool has_esr = events->exception.serror_has_esr;
>> +
>> +	if (injected && has_esr) {
>> +		if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN))
>> +			return -EINVAL;
>> +
>> +		kvm_set_sei_esr(vcpu, events->exception.serror_esr);
>> +
>> +	} else if (injected) {
>> +		kvm_inject_vabt(vcpu);
>
> Nit: looks like 'injected' is misnamed.

"injected" change to "pending"?

>
>
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  int __attribute_const__ kvm_target_cpu(void)
>>  {
>>  	unsigned long implementor = read_cpuid_implementor();
>
>
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index 38c8a64..20e919a 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -82,6 +82,7 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm,
>> long ext)
>>  		break;
>>  	case KVM_CAP_SET_GUEST_DEBUG:
>>  	case KVM_CAP_VCPU_ATTRIBUTES:
>> +	case KVM_CAP_VCPU_EVENTS:
>>  		r = 1;
>>  		break;
>>  	default:
>
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index 7e3941f..945655d 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -1051,6 +1051,27 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>>  			return -EFAULT;
>>  		return kvm_arm_vcpu_has_attr(vcpu, &attr);
>>  	}
>> +	case KVM_GET_VCPU_EVENTS: {
>> +		struct kvm_vcpu_events events;
>> +
>> +		memset(&events, 0, sizeof(struct kvm_vcpu_events));
>
> sizeof(events) is the normal style here, it means if someone changes
> event's
> type we don't get any surprises...

Ok, thanks

>
>
>> +		if (kvm_arm_vcpu_get_events(vcpu, &events))
>> +			return -EINVAL;
>> +
>> +		if (copy_to_user(argp, &events, sizeof(struct kvm_vcpu_events)))
>
> sizeof(events)
thanks

>
>
>> +			return -EFAULT;
>> +
>> +		return 0;
>> +	}
>> +	case KVM_SET_VCPU_EVENTS: {
>> +		struct kvm_vcpu_events events;
>> +
>> +		if (copy_from_user(&events, argp,
>> +				sizeof(struct kvm_vcpu_events)))
>> +			return -EFAULT;
>> +
>> +		return kvm_arm_vcpu_set_events(vcpu, &events);
>> +	}
>>  	default:
>>  		return -EINVAL;
>>  	}
>>
>
> Despite KVM_CAP_VCPU_EVENTS not being advertised on 32bit, any attempt to
> call
> it will still end up in here, but will always fail as the {g,s}et_events()
> calls
> always return -EINVAL. I don't think this will cause us any problems.
What are the things you're worried about?

>
>
> Thanks,
>
> James
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 0/6] KASan for arm
From: Abbott Liu @ 2018-04-12 12:54 UTC (permalink / raw)
  To: aryabinin, glider, dvyukov, corbet, linux, cdall, marc.zyngier,
	arnd, nicolas.pitre, vladimir.murzin, mingo, viro,
	alexandre.belloni, ard.biesheuvel, daniel.lezcano, pombredanne,
	kstewart, liuwenliang, gregkh, f.fainelli, mawilcox, akpm,
	mark.rutland, afzal.mohd.ma, alexander.levin, tglx, thgarnie,
	dhowells, keescook, geert, tixy, julien.thierry, james.morse,
	zhichao.huang, jinb.park7, labbott, philip, grygorii.strashko,
	opendmb, mhocko, kirill.shutemov, kasan-dev, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm

From: Andrey Ryabinin <a.ryabinin@samsung.com>

Changelog:
v4 - v3
- Remove the fix of type conversion in kasan_cache_create because it has
  been fix in the latest version in:
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
- Change some Reviewed-by tag into Reported-by tag to avoid misleading.
  ---Reported by: Marc Zyngier <marc.zyngier@arm.com>
                  Russell King - ARM Linux <linux@armlinux.org.uk>
- Disable instrumentation for arch/arm/mm/physaddr.c

v3 - v2
- Remove this patch: 2 1-byte checks more safer for memory_is_poisoned_16
  because a unaligned load/store of 16 bytes is rare on arm, and this
  patch is very likely to affect the performance of modern CPUs.
  ---Acked by: Russell King - ARM Linux <linux@armlinux.org.uk>
- Fixed some link error which kasan_pmd_populate,kasan_pte_populate and
  kasan_pud_populate are in section .meminit.text but the function
  kasan_alloc_block which is called by kasan_pmd_populate,
  kasan_pte_populate and kasan_pud_populate is in section .init.text. So
  we need change kasan_pmd_populate,kasan_pte_populate and
  kasan_pud_populate into the section .init.text.
  ---Reported by: Florian Fainelli <f.fainelli@gmail.com>
- Fixed some compile error which caused by the wrong access instruction in
  arch/arm/kernel/entry-common.S.
  ---Reported by: kbuild test robot <lkp@intel.com>
- Disable instrumentation for arch/arm/kvm/hyp/*.
  ---Acked by: Marc Zyngier <marc.zyngier@arm.com>
- Update the set of supported architectures in
  Documentation/dev-tools/kasan.rst.
  ---Acked by:Dmitry Vyukov <dvyukov@google.com>
- The version 2 is tested by:
  Florian Fainelli <f.fainelli@gmail.com> (compile test)
  kbuild test robot <lkp@intel.com>       (compile test)
  Joel Stanley <joel@jms.id.au>           (on ASPEED ast2500(ARMv5))

v2 - v1
- Fixed some compiling error which happens on changing kernel compression
  mode to lzma/xz/lzo/lz4.
  ---Reported by: Florian Fainelli <f.fainelli@gmail.com>,
             Russell King - ARM Linux <linux@armlinux.org.uk>
- Fixed a compiling error cause by some older arm instruction set(armv4t)
  don't suppory movw/movt which is reported by kbuild.
- Changed the pte flag from _L_PTE_DEFAULT | L_PTE_DIRTY | L_PTE_XN to
  pgprot_val(PAGE_KERNEL).
  ---Reported by: Russell King - ARM Linux <linux@armlinux.org.uk>
- Moved Enable KASan patch as the last one.
  ---Reported by: Florian Fainelli <f.fainelli@gmail.com>,
     Russell King - ARM Linux <linux@armlinux.org.uk>
- Moved the definitions of cp15 registers from
  arch/arm/include/asm/kvm_hyp.h to arch/arm/include/asm/cp15.h.
  ---Asked by: Mark Rutland <mark.rutland@arm.com>
- Merge the following commits into the commit
  Define the virtual space of KASan's shadow region:
  1) Define the virtual space of KASan's shadow region;
  2) Avoid cleaning the KASan shadow area's mapping table;
  3) Add KASan layout;
- Merge the following commits into the commit
  Initialize the mapping of KASan shadow memory:
  1) Initialize the mapping of KASan shadow memory;
  2) Add support arm LPAE;
  3) Don't need to map the shadow of KASan's shadow memory;
     ---Reported by: Russell King - ARM Linux <linux@armlinux.org.uk>
  4) Change mapping of kasan_zero_page int readonly.
- The version 1 is tested by Florian Fainelli <f.fainelli@gmail.com>
  on a Cortex-A5 (no LPAE).

Hi,all:
   These patches add arch specific code for kernel address sanitizer
(see Documentation/kasan.txt).

   1/8 of kernel addresses reserved for shadow memory. There was no
big enough hole for this, so virtual addresses for shadow were
stolen from user space.

   At early boot stage the whole shadow region populated with just
one physical page (kasan_zero_page). Later, this page reused
as readonly zero shadow for some memory that KASan currently
don't track (vmalloc).

  After mapping the physical memory, pages for shadow memory are
allocated and mapped.

  KASan's stack instrumentation significantly increases stack's
consumption, so CONFIG_KASAN doubles THREAD_SIZE.

  Functions like memset/memmove/memcpy do a lot of memory accesses.
If bad pointer passed to one of these function it is important
to catch this. Compiler's instrumentation cannot do this since
these functions are written in assembly.

  KASan replaces memory functions with manually instrumented variants.
Original functions declared as weak symbols so strong definitions
in mm/kasan/kasan.c could replace them. Original functions have aliases
with '__' prefix in name, so we could call non-instrumented variant
if needed.

  Some files built without kasan instrumentation (e.g. mm/slub.c).
Original mem* function replaced (via #define) with prefixed variants
to disable memory access checks for such files.

  On arm LPAE architecture,  the mapping table of KASan shadow memory(if
PAGE_OFFSET is 0xc0000000, the KASan shadow memory's virtual space is
0xb6e000000~0xbf000000) can't be filled in do_translation_fault function,
because kasan instrumentation maybe cause do_translation_fault function
accessing KASan shadow memory. The accessing of KASan shadow memory in
do_translation_fault function maybe cause dead circle. So the mapping table
of KASan shadow memory need be copyed in pgd_alloc function.


Most of the code comes from:
https://github.com/aryabinin/linux/commit/0b54f17e70ff50a902c4af05bb92716eb95acefe

These patches are tested on vexpress-ca15, vexpress-ca9


Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Tested-by: Abbott Liu <liuwenliang@huawei.com>
Signed-off-by: Abbott Liu <liuwenliang@huawei.com>

Abbott Liu (2):
  Add TTBR operator for kasan_init
  Define the virtual space of KASan's shadow region

Andrey Ryabinin (4):
  Disable instrumentation for some code
  Replace memory function for kasan
  Initialize the mapping of KASan shadow memory
  Enable KASan for arm

 Documentation/dev-tools/kasan.rst     |   2 +-
 arch/arm/Kconfig                      |   1 +
 arch/arm/boot/compressed/Makefile     |   1 +
 arch/arm/boot/compressed/decompress.c |   2 +
 arch/arm/boot/compressed/libfdt_env.h |   2 +
 arch/arm/include/asm/cp15.h           | 104 ++++++++++++
 arch/arm/include/asm/kasan.h          |  35 ++++
 arch/arm/include/asm/kasan_def.h      |  64 +++++++
 arch/arm/include/asm/kvm_hyp.h        |  52 ------
 arch/arm/include/asm/memory.h         |   5 +
 arch/arm/include/asm/pgalloc.h        |   7 +-
 arch/arm/include/asm/string.h         |  17 ++
 arch/arm/include/asm/thread_info.h    |   4 +
 arch/arm/kernel/entry-armv.S          |   5 +-
 arch/arm/kernel/entry-common.S        |   9 +-
 arch/arm/kernel/head-common.S         |   7 +-
 arch/arm/kernel/setup.c               |   2 +
 arch/arm/kernel/unwind.c              |   3 +-
 arch/arm/kvm/hyp/Makefile             |   4 +
 arch/arm/kvm/hyp/cp15-sr.c            |  12 +-
 arch/arm/kvm/hyp/switch.c             |   6 +-
 arch/arm/lib/memcpy.S                 |   3 +
 arch/arm/lib/memmove.S                |   5 +-
 arch/arm/lib/memset.S                 |   3 +
 arch/arm/mm/Makefile                  |   4 +
 arch/arm/mm/init.c                    |   6 +
 arch/arm/mm/kasan_init.c              | 302 ++++++++++++++++++++++++++++++++++
 arch/arm/mm/mmu.c                     |   7 +-
 arch/arm/mm/pgd.c                     |  14 ++
 arch/arm/vdso/Makefile                |   2 +
 30 files changed, 616 insertions(+), 74 deletions(-)
 create mode 100644 arch/arm/include/asm/kasan.h
 create mode 100644 arch/arm/include/asm/kasan_def.h
 create mode 100644 arch/arm/mm/kasan_init.c

-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 1/6] Add TTBR operator for kasan_init
From: Abbott Liu @ 2018-04-12 12:54 UTC (permalink / raw)
  To: aryabinin, glider, dvyukov, corbet, linux, cdall, marc.zyngier,
	arnd, nicolas.pitre, vladimir.murzin, mingo, viro,
	alexandre.belloni, ard.biesheuvel, daniel.lezcano, pombredanne,
	kstewart, liuwenliang, gregkh, f.fainelli, mawilcox, akpm,
	mark.rutland, afzal.mohd.ma, alexander.levin, tglx, thgarnie,
	dhowells, keescook, geert, tixy, julien.thierry, james.morse,
	zhichao.huang, jinb.park7, labbott, philip, grygorii.strashko,
	opendmb, mhocko, kirill.shutemov, kasan-dev, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm
In-Reply-To: <20180412125411.326-1-liuwenliang@huawei.com>

The purpose of this patch is to provide set_ttbr0/get_ttbr0
to kasan_init function. The definitions of cp15 registers
should be in arch/arm/include/asm/cp15.h rather than
arch/arm/include/asm/kvm_hyp.h, so move them.

Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Tested-by: Abbott Liu <liuwenliang@huawei.com>
Signed-off-by: Abbott Liu <liuwenliang@huawei.com>
---
 arch/arm/include/asm/cp15.h    | 104 +++++++++++++++++++++++++++++++++++++++++
 arch/arm/include/asm/kvm_hyp.h |  52 ---------------------
 arch/arm/kvm/hyp/cp15-sr.c     |  12 ++---
 arch/arm/kvm/hyp/switch.c      |   6 +--
 4 files changed, 113 insertions(+), 61 deletions(-)

diff --git a/arch/arm/include/asm/cp15.h b/arch/arm/include/asm/cp15.h
index 4c9fa72..99ebb31 100644
--- a/arch/arm/include/asm/cp15.h
+++ b/arch/arm/include/asm/cp15.h
@@ -3,6 +3,7 @@
 #define __ASM_ARM_CP15_H
 
 #include <asm/barrier.h>
+#include <linux/stringify.h>
 
 /*
  * CR1 bits (CP#15 CR1)
@@ -65,8 +66,111 @@
 #define __write_sysreg(v, r, w, c, t)	asm volatile(w " " c : : "r" ((t)(v)))
 #define write_sysreg(v, ...)		__write_sysreg(v, __VA_ARGS__)
 
+#define TTBR0_32	__ACCESS_CP15(c2, 0, c0, 0)
+#define TTBR1_32	__ACCESS_CP15(c2, 0, c0, 1)
+#define PAR_32		__ACCESS_CP15(c7, 0, c4, 0)
+#define TTBR0_64	__ACCESS_CP15_64(0, c2)
+#define TTBR1_64	__ACCESS_CP15_64(1, c2)
+#define PAR_64		__ACCESS_CP15_64(0, c7)
+#define VTTBR		__ACCESS_CP15_64(6, c2)
+#define CNTV_CVAL	__ACCESS_CP15_64(3, c14)
+#define CNTVOFF		__ACCESS_CP15_64(4, c14)
+
+#define MIDR		__ACCESS_CP15(c0, 0, c0, 0)
+#define CSSELR		__ACCESS_CP15(c0, 2, c0, 0)
+#define VPIDR		__ACCESS_CP15(c0, 4, c0, 0)
+#define VMPIDR		__ACCESS_CP15(c0, 4, c0, 5)
+#define SCTLR		__ACCESS_CP15(c1, 0, c0, 0)
+#define CPACR		__ACCESS_CP15(c1, 0, c0, 2)
+#define HCR		__ACCESS_CP15(c1, 4, c1, 0)
+#define HDCR		__ACCESS_CP15(c1, 4, c1, 1)
+#define HCPTR		__ACCESS_CP15(c1, 4, c1, 2)
+#define HSTR		__ACCESS_CP15(c1, 4, c1, 3)
+#define TTBCR		__ACCESS_CP15(c2, 0, c0, 2)
+#define HTCR		__ACCESS_CP15(c2, 4, c0, 2)
+#define VTCR		__ACCESS_CP15(c2, 4, c1, 2)
+#define DACR		__ACCESS_CP15(c3, 0, c0, 0)
+#define DFSR		__ACCESS_CP15(c5, 0, c0, 0)
+#define IFSR		__ACCESS_CP15(c5, 0, c0, 1)
+#define ADFSR		__ACCESS_CP15(c5, 0, c1, 0)
+#define AIFSR		__ACCESS_CP15(c5, 0, c1, 1)
+#define HSR		__ACCESS_CP15(c5, 4, c2, 0)
+#define DFAR		__ACCESS_CP15(c6, 0, c0, 0)
+#define IFAR		__ACCESS_CP15(c6, 0, c0, 2)
+#define HDFAR		__ACCESS_CP15(c6, 4, c0, 0)
+#define HIFAR		__ACCESS_CP15(c6, 4, c0, 2)
+#define HPFAR		__ACCESS_CP15(c6, 4, c0, 4)
+#define ICIALLUIS	__ACCESS_CP15(c7, 0, c1, 0)
+#define BPIALLIS	__ACCESS_CP15(c7, 0, c1, 6)
+#define ICIMVAU		__ACCESS_CP15(c7, 0, c5, 1)
+#define ATS1CPR		__ACCESS_CP15(c7, 0, c8, 0)
+#define TLBIALLIS	__ACCESS_CP15(c8, 0, c3, 0)
+#define TLBIALL		__ACCESS_CP15(c8, 0, c7, 0)
+#define TLBIALLNSNHIS	__ACCESS_CP15(c8, 4, c3, 4)
+#define PRRR		__ACCESS_CP15(c10, 0, c2, 0)
+#define NMRR		__ACCESS_CP15(c10, 0, c2, 1)
+#define AMAIR0		__ACCESS_CP15(c10, 0, c3, 0)
+#define AMAIR1		__ACCESS_CP15(c10, 0, c3, 1)
+#define VBAR		__ACCESS_CP15(c12, 0, c0, 0)
+#define CID		__ACCESS_CP15(c13, 0, c0, 1)
+#define TID_URW		__ACCESS_CP15(c13, 0, c0, 2)
+#define TID_URO		__ACCESS_CP15(c13, 0, c0, 3)
+#define TID_PRIV	__ACCESS_CP15(c13, 0, c0, 4)
+#define HTPIDR		__ACCESS_CP15(c13, 4, c0, 2)
+#define CNTKCTL		__ACCESS_CP15(c14, 0, c1, 0)
+#define CNTV_CTL	__ACCESS_CP15(c14, 0, c3, 1)
+#define CNTHCTL		__ACCESS_CP15(c14, 4, c1, 0)
+
 extern unsigned long cr_alignment;	/* defined in entry-armv.S */
 
+static inline void set_par(u64 val)
+{
+	if (IS_ENABLED(CONFIG_ARM_LPAE))
+		write_sysreg(val, PAR_64);
+	else
+		write_sysreg(val, PAR_32);
+}
+
+static inline u64 get_par(void)
+{
+	if (IS_ENABLED(CONFIG_ARM_LPAE))
+		return read_sysreg(PAR_64);
+	else
+		return read_sysreg(PAR_32);
+}
+
+static inline void set_ttbr0(u64 val)
+{
+	if (IS_ENABLED(CONFIG_ARM_LPAE))
+		write_sysreg(val, TTBR0_64);
+	else
+		write_sysreg(val, TTBR0_32);
+}
+
+static inline u64 get_ttbr0(void)
+{
+	if (IS_ENABLED(CONFIG_ARM_LPAE))
+		return read_sysreg(TTBR0_64);
+	else
+		return read_sysreg(TTBR0_32);
+}
+
+static inline void set_ttbr1(u64 val)
+{
+	if (IS_ENABLED(CONFIG_ARM_LPAE))
+		write_sysreg(val, TTBR1_64);
+	else
+		write_sysreg(val, TTBR1_32);
+}
+
+static inline u64 get_ttbr1(void)
+{
+	if (IS_ENABLED(CONFIG_ARM_LPAE))
+		return read_sysreg(TTBR1_64);
+	else
+		return read_sysreg(TTBR1_32);
+}
+
 static inline unsigned long get_cr(void)
 {
 	unsigned long val;
diff --git a/arch/arm/include/asm/kvm_hyp.h b/arch/arm/include/asm/kvm_hyp.h
index 1ab8329..8e8592e 100644
--- a/arch/arm/include/asm/kvm_hyp.h
+++ b/arch/arm/include/asm/kvm_hyp.h
@@ -36,58 +36,6 @@
 	__val;							\
 })
 
-#define TTBR0		__ACCESS_CP15_64(0, c2)
-#define TTBR1		__ACCESS_CP15_64(1, c2)
-#define VTTBR		__ACCESS_CP15_64(6, c2)
-#define PAR		__ACCESS_CP15_64(0, c7)
-#define CNTV_CVAL	__ACCESS_CP15_64(3, c14)
-#define CNTVOFF		__ACCESS_CP15_64(4, c14)
-
-#define MIDR		__ACCESS_CP15(c0, 0, c0, 0)
-#define CSSELR		__ACCESS_CP15(c0, 2, c0, 0)
-#define VPIDR		__ACCESS_CP15(c0, 4, c0, 0)
-#define VMPIDR		__ACCESS_CP15(c0, 4, c0, 5)
-#define SCTLR		__ACCESS_CP15(c1, 0, c0, 0)
-#define CPACR		__ACCESS_CP15(c1, 0, c0, 2)
-#define HCR		__ACCESS_CP15(c1, 4, c1, 0)
-#define HDCR		__ACCESS_CP15(c1, 4, c1, 1)
-#define HCPTR		__ACCESS_CP15(c1, 4, c1, 2)
-#define HSTR		__ACCESS_CP15(c1, 4, c1, 3)
-#define TTBCR		__ACCESS_CP15(c2, 0, c0, 2)
-#define HTCR		__ACCESS_CP15(c2, 4, c0, 2)
-#define VTCR		__ACCESS_CP15(c2, 4, c1, 2)
-#define DACR		__ACCESS_CP15(c3, 0, c0, 0)
-#define DFSR		__ACCESS_CP15(c5, 0, c0, 0)
-#define IFSR		__ACCESS_CP15(c5, 0, c0, 1)
-#define ADFSR		__ACCESS_CP15(c5, 0, c1, 0)
-#define AIFSR		__ACCESS_CP15(c5, 0, c1, 1)
-#define HSR		__ACCESS_CP15(c5, 4, c2, 0)
-#define DFAR		__ACCESS_CP15(c6, 0, c0, 0)
-#define IFAR		__ACCESS_CP15(c6, 0, c0, 2)
-#define HDFAR		__ACCESS_CP15(c6, 4, c0, 0)
-#define HIFAR		__ACCESS_CP15(c6, 4, c0, 2)
-#define HPFAR		__ACCESS_CP15(c6, 4, c0, 4)
-#define ICIALLUIS	__ACCESS_CP15(c7, 0, c1, 0)
-#define BPIALLIS	__ACCESS_CP15(c7, 0, c1, 6)
-#define ICIMVAU		__ACCESS_CP15(c7, 0, c5, 1)
-#define ATS1CPR		__ACCESS_CP15(c7, 0, c8, 0)
-#define TLBIALLIS	__ACCESS_CP15(c8, 0, c3, 0)
-#define TLBIALL		__ACCESS_CP15(c8, 0, c7, 0)
-#define TLBIALLNSNHIS	__ACCESS_CP15(c8, 4, c3, 4)
-#define PRRR		__ACCESS_CP15(c10, 0, c2, 0)
-#define NMRR		__ACCESS_CP15(c10, 0, c2, 1)
-#define AMAIR0		__ACCESS_CP15(c10, 0, c3, 0)
-#define AMAIR1		__ACCESS_CP15(c10, 0, c3, 1)
-#define VBAR		__ACCESS_CP15(c12, 0, c0, 0)
-#define CID		__ACCESS_CP15(c13, 0, c0, 1)
-#define TID_URW		__ACCESS_CP15(c13, 0, c0, 2)
-#define TID_URO		__ACCESS_CP15(c13, 0, c0, 3)
-#define TID_PRIV	__ACCESS_CP15(c13, 0, c0, 4)
-#define HTPIDR		__ACCESS_CP15(c13, 4, c0, 2)
-#define CNTKCTL		__ACCESS_CP15(c14, 0, c1, 0)
-#define CNTV_CTL	__ACCESS_CP15(c14, 0, c3, 1)
-#define CNTHCTL		__ACCESS_CP15(c14, 4, c1, 0)
-
 #define VFP_FPEXC	__ACCESS_VFP(FPEXC)
 
 /* AArch64 compatibility macros, only for the timer so far */
diff --git a/arch/arm/kvm/hyp/cp15-sr.c b/arch/arm/kvm/hyp/cp15-sr.c
index c478281..d365e3c 100644
--- a/arch/arm/kvm/hyp/cp15-sr.c
+++ b/arch/arm/kvm/hyp/cp15-sr.c
@@ -31,8 +31,8 @@ void __hyp_text __sysreg_save_state(struct kvm_cpu_context *ctxt)
 	ctxt->cp15[c0_CSSELR]		= read_sysreg(CSSELR);
 	ctxt->cp15[c1_SCTLR]		= read_sysreg(SCTLR);
 	ctxt->cp15[c1_CPACR]		= read_sysreg(CPACR);
-	*cp15_64(ctxt, c2_TTBR0)	= read_sysreg(TTBR0);
-	*cp15_64(ctxt, c2_TTBR1)	= read_sysreg(TTBR1);
+	*cp15_64(ctxt, c2_TTBR0)	= read_sysreg(TTBR0_64);
+	*cp15_64(ctxt, c2_TTBR1)	= read_sysreg(TTBR1_64);
 	ctxt->cp15[c2_TTBCR]		= read_sysreg(TTBCR);
 	ctxt->cp15[c3_DACR]		= read_sysreg(DACR);
 	ctxt->cp15[c5_DFSR]		= read_sysreg(DFSR);
@@ -41,7 +41,7 @@ void __hyp_text __sysreg_save_state(struct kvm_cpu_context *ctxt)
 	ctxt->cp15[c5_AIFSR]		= read_sysreg(AIFSR);
 	ctxt->cp15[c6_DFAR]		= read_sysreg(DFAR);
 	ctxt->cp15[c6_IFAR]		= read_sysreg(IFAR);
-	*cp15_64(ctxt, c7_PAR)		= read_sysreg(PAR);
+	*cp15_64(ctxt, c7_PAR)		= read_sysreg(PAR_64);
 	ctxt->cp15[c10_PRRR]		= read_sysreg(PRRR);
 	ctxt->cp15[c10_NMRR]		= read_sysreg(NMRR);
 	ctxt->cp15[c10_AMAIR0]		= read_sysreg(AMAIR0);
@@ -60,8 +60,8 @@ void __hyp_text __sysreg_restore_state(struct kvm_cpu_context *ctxt)
 	write_sysreg(ctxt->cp15[c0_CSSELR],	CSSELR);
 	write_sysreg(ctxt->cp15[c1_SCTLR],	SCTLR);
 	write_sysreg(ctxt->cp15[c1_CPACR],	CPACR);
-	write_sysreg(*cp15_64(ctxt, c2_TTBR0),	TTBR0);
-	write_sysreg(*cp15_64(ctxt, c2_TTBR1),	TTBR1);
+	write_sysreg(*cp15_64(ctxt, c2_TTBR0),	TTBR0_64);
+	write_sysreg(*cp15_64(ctxt, c2_TTBR1),	TTBR1_64);
 	write_sysreg(ctxt->cp15[c2_TTBCR],	TTBCR);
 	write_sysreg(ctxt->cp15[c3_DACR],	DACR);
 	write_sysreg(ctxt->cp15[c5_DFSR],	DFSR);
@@ -70,7 +70,7 @@ void __hyp_text __sysreg_restore_state(struct kvm_cpu_context *ctxt)
 	write_sysreg(ctxt->cp15[c5_AIFSR],	AIFSR);
 	write_sysreg(ctxt->cp15[c6_DFAR],	DFAR);
 	write_sysreg(ctxt->cp15[c6_IFAR],	IFAR);
-	write_sysreg(*cp15_64(ctxt, c7_PAR),	PAR);
+	write_sysreg(*cp15_64(ctxt, c7_PAR),	PAR_64);
 	write_sysreg(ctxt->cp15[c10_PRRR],	PRRR);
 	write_sysreg(ctxt->cp15[c10_NMRR],	NMRR);
 	write_sysreg(ctxt->cp15[c10_AMAIR0],	AMAIR0);
diff --git a/arch/arm/kvm/hyp/switch.c b/arch/arm/kvm/hyp/switch.c
index ae45ae9..94d5bb9 100644
--- a/arch/arm/kvm/hyp/switch.c
+++ b/arch/arm/kvm/hyp/switch.c
@@ -134,12 +134,12 @@ static bool __hyp_text __populate_fault_info(struct kvm_vcpu *vcpu)
 	if (!(hsr & HSR_DABT_S1PTW) && (hsr & HSR_FSC_TYPE) == FSC_PERM) {
 		u64 par, tmp;
 
-		par = read_sysreg(PAR);
+		par = read_sysreg(PAR_64);
 		write_sysreg(far, ATS1CPR);
 		isb();
 
-		tmp = read_sysreg(PAR);
-		write_sysreg(par, PAR);
+		tmp = read_sysreg(PAR_64);
+		write_sysreg(par, PAR_64);
 
 		if (unlikely(tmp & 1))
 			return false; /* Translation failed, back to guest */
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 6/6] Enable KASan for arm
From: Abbott Liu @ 2018-04-12 12:54 UTC (permalink / raw)
  To: aryabinin, glider, dvyukov, corbet, linux, cdall, marc.zyngier,
	arnd, nicolas.pitre, vladimir.murzin, mingo, viro,
	alexandre.belloni, ard.biesheuvel, daniel.lezcano, pombredanne,
	kstewart, liuwenliang, gregkh, f.fainelli, mawilcox, akpm,
	mark.rutland, afzal.mohd.ma, alexander.levin, tglx, thgarnie,
	dhowells, keescook, geert, tixy, julien.thierry, james.morse,
	zhichao.huang, jinb.park7, labbott, philip, grygorii.strashko,
	opendmb, mhocko, kirill.shutemov, kasan-dev, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm
In-Reply-To: <20180412125411.326-1-liuwenliang@huawei.com>

From: Andrey Ryabinin <a.ryabinin@samsung.com>

This patch enable kernel address sanitizer for arm.

Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Abbott Liu <liuwenliang@huawei.com>
Signed-off-by: Abbott Liu <liuwenliang@huawei.com>
---
 Documentation/dev-tools/kasan.rst | 2 +-
 arch/arm/Kconfig                  | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst
index f7a18f2..d92120d 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -12,7 +12,7 @@ KASAN uses compile-time instrumentation for checking every memory access,
 therefore you will need a GCC version 4.9.2 or later. GCC 5.0 or later is
 required for detection of out-of-bounds accesses to stack or global variables.
 
-Currently KASAN is supported only for the x86_64 and arm64 architectures.
+Currently KASAN is supported only for the x86_64, arm64 and arm architectures.
 
 Usage
 -----
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 1878083..cd71bea 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -49,6 +49,7 @@ config ARM
 	select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
 	select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
+	select HAVE_ARCH_KASAN if MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 5/6] Initialize the mapping of KASan shadow memory
From: Abbott Liu @ 2018-04-12 12:54 UTC (permalink / raw)
  To: aryabinin, glider, dvyukov, corbet, linux, cdall, marc.zyngier,
	arnd, nicolas.pitre, vladimir.murzin, mingo, viro,
	alexandre.belloni, ard.biesheuvel, daniel.lezcano, pombredanne,
	kstewart, liuwenliang, gregkh, f.fainelli, mawilcox, akpm,
	mark.rutland, afzal.mohd.ma, alexander.levin, tglx, thgarnie,
	dhowells, keescook, geert, tixy, julien.thierry, james.morse,
	zhichao.huang, jinb.park7, labbott, philip, grygorii.strashko,
	opendmb, mhocko, kirill.shutemov, kasan-dev, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm
In-Reply-To: <20180412125411.326-1-liuwenliang@huawei.com>

From: Andrey Ryabinin <a.ryabinin@samsung.com>

This patch initializes KASan shadow region's page table and memory.
There are two stage for KASan initializing:
1. At early boot stage the whole shadow region is mapped to just
   one physical page (kasan_zero_page). It's finished by the function
   kasan_early_init which is called by __mmap_switched(arch/arm/kernel/
   head-common.S)
             ---Andrey Ryabinin <a.ryabinin@samsung.com>

2. After the calling of paging_init, we use kasan_zero_page as zero
   shadow for some memory that KASan don't need to track, and we alloc
   new shadow space for the other memory that KASan need to track. These
   issues are finished by the function kasan_init which is call by
   setup_arch.
            ---Andrey Ryabinin <a.ryabinin@samsung.com>

3. Add support arm LPAE
   If LPAE is enabled, KASan shadow region's mapping table need be copyed
   in pgd_alloc function.
            ---Abbott Liu <liuwenliang@huawei.com>

4. Change kasan_pte_populate,kasan_pmd_populate,kasan_pud_populate,
   kasan_pgd_populate from .meminit.text section to .init.text section.
           ---Reported by: Florian Fainelli <f.fainelli@gmail.com>
           ---Signed off by: Abbott Liu <liuwenliang@huawei.com>

Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Co-Developed-by: Abbott Liu <liuwenliang@huawei.com>
Reported-by: Russell King - ARM Linux <linux@armlinux.org.uk>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Tested-by: Abbott Liu <liuwenliang@huawei.com>
Signed-off-by: Abbott Liu <liuwenliang@huawei.com>
---
 arch/arm/include/asm/kasan.h       |  35 +++++
 arch/arm/include/asm/pgalloc.h     |   7 +-
 arch/arm/include/asm/thread_info.h |   4 +
 arch/arm/kernel/head-common.S      |   3 +
 arch/arm/kernel/setup.c            |   2 +
 arch/arm/mm/Makefile               |   3 +
 arch/arm/mm/kasan_init.c           | 302 +++++++++++++++++++++++++++++++++++++
 arch/arm/mm/pgd.c                  |  14 ++
 8 files changed, 368 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm/include/asm/kasan.h
 create mode 100644 arch/arm/mm/kasan_init.c

diff --git a/arch/arm/include/asm/kasan.h b/arch/arm/include/asm/kasan.h
new file mode 100644
index 0000000..1801f4d
--- /dev/null
+++ b/arch/arm/include/asm/kasan.h
@@ -0,0 +1,35 @@
+/*
+ * arch/arm/include/asm/kasan.h
+ *
+ * Copyright (c) 2015 Samsung Electronics Co., Ltd.
+ * Author: Andrey Ryabinin <ryabinin.a.a@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#ifndef __ASM_KASAN_H
+#define __ASM_KASAN_H
+
+#ifdef CONFIG_KASAN
+
+#include <asm/kasan_def.h>
+
+#define KASAN_SHADOW_SCALE_SHIFT 3
+
+/*
+ * Compiler uses shadow offset assuming that addresses start
+ * from 0. Kernel addresses don't start from 0, so shadow
+ * for kernel really starts from 'compiler's shadow offset' +
+ * ('kernel address space start' >> KASAN_SHADOW_SCALE_SHIFT)
+ */
+
+extern void kasan_init(void);
+
+#else
+static inline void kasan_init(void) { }
+#endif
+
+#endif
diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index 2d7344f..f170659 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -50,8 +50,11 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
  */
 #define pmd_alloc_one(mm,addr)		({ BUG(); ((pmd_t *)2); })
 #define pmd_free(mm, pmd)		do { } while (0)
-#define pud_populate(mm,pmd,pte)	BUG()
-
+#ifndef CONFIG_KASAN
+#define pud_populate(mm, pmd, pte)	BUG()
+#else
+#define pud_populate(mm, pmd, pte)	do { } while (0)
+#endif
 #endif	/* CONFIG_ARM_LPAE */
 
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index e71cc35..bc681a0 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -16,7 +16,11 @@
 #include <asm/fpstate.h>
 #include <asm/page.h>
 
+#ifdef CONFIG_KASAN
+#define THREAD_SIZE_ORDER	2
+#else
 #define THREAD_SIZE_ORDER	1
+#endif
 #define THREAD_SIZE		(PAGE_SIZE << THREAD_SIZE_ORDER)
 #define THREAD_START_SP		(THREAD_SIZE - 8)
 
diff --git a/arch/arm/kernel/head-common.S b/arch/arm/kernel/head-common.S
index c79b829..20161e2 100644
--- a/arch/arm/kernel/head-common.S
+++ b/arch/arm/kernel/head-common.S
@@ -115,6 +115,9 @@ __mmap_switched:
 	str	r8, [r2]			@ Save atags pointer
 	cmp	r3, #0
 	strne	r10, [r3]			@ Save control register values
+#ifdef CONFIG_KASAN
+	bl	kasan_early_init
+#endif
 	mov	lr, #0
 	b	start_kernel
 ENDPROC(__mmap_switched)
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index fc40a2b..81c3e9df 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -62,6 +62,7 @@
 #include <asm/unwind.h>
 #include <asm/memblock.h>
 #include <asm/virt.h>
+#include <asm/kasan.h>
 
 #include "atags.h"
 
@@ -1118,6 +1119,7 @@ void __init setup_arch(char **cmdline_p)
 	early_ioremap_reset();
 
 	paging_init(mdesc);
+	kasan_init();
 	request_standard_resources(mdesc);
 
 	if (mdesc->restart)
diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index c056e17..46db240 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -112,3 +112,6 @@ obj-$(CONFIG_CACHE_L2X0_PMU)	+= cache-l2x0-pmu.o
 obj-$(CONFIG_CACHE_XSC3L2)	+= cache-xsc3l2.o
 obj-$(CONFIG_CACHE_TAUROS2)	+= cache-tauros2.o
 obj-$(CONFIG_CACHE_UNIPHIER)	+= cache-uniphier.o
+
+KASAN_SANITIZE_kasan_init.o    := n
+obj-$(CONFIG_KASAN)            += kasan_init.o
diff --git a/arch/arm/mm/kasan_init.c b/arch/arm/mm/kasan_init.c
new file mode 100644
index 0000000..461cc85
--- /dev/null
+++ b/arch/arm/mm/kasan_init.c
@@ -0,0 +1,302 @@
+/*
+ * This file contains kasan initialization code for ARM.
+ *
+ * Copyright (c) 2018 Samsung Electronics Co., Ltd.
+ * Author: Andrey Ryabinin <ryabinin.a.a@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/bootmem.h>
+#include <linux/kasan.h>
+#include <linux/kernel.h>
+#include <linux/memblock.h>
+#include <linux/start_kernel.h>
+#include <asm/cputype.h>
+#include <asm/highmem.h>
+#include <asm/mach/map.h>
+#include <asm/memory.h>
+#include <asm/page.h>
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+#include <asm/procinfo.h>
+#include <asm/proc-fns.h>
+#include <asm/tlbflush.h>
+#include <asm/cp15.h>
+#include <linux/sched/task.h>
+
+#include "mm.h"
+
+static pgd_t tmp_pgd_table[PTRS_PER_PGD] __initdata __aligned(1ULL << 14);
+
+pmd_t tmp_pmd_table[PTRS_PER_PMD] __page_aligned_bss;
+
+static __init void *kasan_alloc_block(size_t size, int node)
+{
+	return memblock_virt_alloc_try_nid(size, size, __pa(MAX_DMA_ADDRESS),
+					BOOTMEM_ALLOC_ACCESSIBLE, node);
+}
+
+static void __init kasan_early_pmd_populate(unsigned long start,
+					unsigned long end, pud_t *pud)
+{
+	unsigned long addr;
+	unsigned long next;
+	pmd_t *pmd;
+
+	pmd = pmd_offset(pud, start);
+	for (addr = start; addr < end;) {
+		pmd_populate_kernel(&init_mm, pmd, kasan_zero_pte);
+		next = pmd_addr_end(addr, end);
+		addr = next;
+		flush_pmd_entry(pmd);
+		pmd++;
+	}
+}
+
+static void __init kasan_early_pud_populate(unsigned long start,
+				unsigned long end, pgd_t *pgd)
+{
+	unsigned long addr;
+	unsigned long next;
+	pud_t *pud;
+
+	pud = pud_offset(pgd, start);
+	for (addr = start; addr < end;) {
+		next = pud_addr_end(addr, end);
+		kasan_early_pmd_populate(addr, next, pud);
+		addr = next;
+		pud++;
+	}
+}
+
+void __init kasan_map_early_shadow(pgd_t *pgdp)
+{
+	int i;
+	unsigned long start = KASAN_SHADOW_START;
+	unsigned long end = KASAN_SHADOW_END;
+	unsigned long addr;
+	unsigned long next;
+	pgd_t *pgd;
+
+	for (i = 0; i < PTRS_PER_PTE; i++)
+		set_pte_at(&init_mm, KASAN_SHADOW_START + i*PAGE_SIZE,
+			&kasan_zero_pte[i], pfn_pte(
+				virt_to_pfn(kasan_zero_page),
+				__pgprot(_L_PTE_DEFAULT | L_PTE_DIRTY
+					| L_PTE_XN)));
+
+	pgd = pgd_offset_k(start);
+	for (addr = start; addr < end;) {
+		next = pgd_addr_end(addr, end);
+		kasan_early_pud_populate(addr, next, pgd);
+		addr = next;
+		pgd++;
+	}
+}
+
+extern struct proc_info_list *lookup_processor_type(unsigned int);
+
+void __init kasan_early_init(void)
+{
+	struct proc_info_list *list;
+
+	/*
+	 * locate processor in the list of supported processor
+	 * types.  The linker builds this table for us from the
+	 * entries in arch/arm/mm/proc-*.S
+	 */
+	list = lookup_processor_type(read_cpuid_id());
+	if (list) {
+#ifdef MULTI_CPU
+		processor = *list->proc;
+#endif
+	}
+
+	BUILD_BUG_ON((KASAN_SHADOW_END - (1UL << 29)) != KASAN_SHADOW_OFFSET);
+	kasan_map_early_shadow(swapper_pg_dir);
+}
+
+static void __init clear_pgds(unsigned long start,
+			unsigned long end)
+{
+	for (; start && start < end; start += PMD_SIZE)
+		pmd_clear(pmd_off_k(start));
+}
+
+pte_t * __init kasan_pte_populate(pmd_t *pmd, unsigned long addr, int node)
+{
+	pte_t *pte = pte_offset_kernel(pmd, addr);
+
+	if (pte_none(*pte)) {
+		pte_t entry;
+		void *p = kasan_alloc_block(PAGE_SIZE, node);
+
+		if (!p)
+			return NULL;
+		entry = pfn_pte(virt_to_pfn(p),
+			__pgprot(pgprot_val(PAGE_KERNEL)));
+		set_pte_at(&init_mm, addr, pte, entry);
+	}
+	return pte;
+}
+
+pmd_t * __init kasan_pmd_populate(pud_t *pud, unsigned long addr, int node)
+{
+	pmd_t *pmd = pmd_offset(pud, addr);
+
+	if (pmd_none(*pmd)) {
+		void *p = kasan_alloc_block(PAGE_SIZE, node);
+
+		if (!p)
+			return NULL;
+		pmd_populate_kernel(&init_mm, pmd, p);
+	}
+	return pmd;
+}
+
+pud_t * __init kasan_pud_populate(pgd_t *pgd, unsigned long addr, int node)
+{
+	pud_t *pud = pud_offset(pgd, addr);
+
+	if (pud_none(*pud)) {
+		void *p = kasan_alloc_block(PAGE_SIZE, node);
+
+		if (!p)
+			return NULL;
+		pr_err("populating pud addr %lx\n", addr);
+		pud_populate(&init_mm, pud, p);
+	}
+	return pud;
+}
+
+pgd_t * __init kasan_pgd_populate(unsigned long addr, int node)
+{
+	pgd_t *pgd = pgd_offset_k(addr);
+
+	if (pgd_none(*pgd)) {
+		void *p = kasan_alloc_block(PAGE_SIZE, node);
+
+		if (!p)
+			return NULL;
+		pgd_populate(&init_mm, pgd, p);
+	}
+	return pgd;
+}
+
+static int __init create_mapping(unsigned long start, unsigned long end,
+				int node)
+{
+	unsigned long addr = start;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pr_info("populating shadow for %lx, %lx\n", start, end);
+
+	for (; addr < end; addr += PAGE_SIZE) {
+		pgd = kasan_pgd_populate(addr, node);
+		if (!pgd)
+			return -ENOMEM;
+
+		pud = kasan_pud_populate(pgd, addr, node);
+		if (!pud)
+			return -ENOMEM;
+
+		pmd = kasan_pmd_populate(pud, addr, node);
+		if (!pmd)
+			return -ENOMEM;
+
+		pte = kasan_pte_populate(pmd, addr, node);
+		if (!pte)
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+
+void __init kasan_init(void)
+{
+	struct memblock_region *reg;
+	u64 orig_ttbr0;
+	int i;
+
+	/*
+	 * We are going to perform proper setup of shadow memory.
+	 * At first we should unmap early shadow (clear_pgds() call bellow).
+	 * However, instrumented code couldn't execute without shadow memory.
+	 * tmp_pgd_table and tmp_pmd_table used to keep early shadow mapped
+	 * until full shadow setup will be finished.
+	 */
+	orig_ttbr0 = get_ttbr0();
+
+#ifdef CONFIG_ARM_LPAE
+	memcpy(tmp_pmd_table,
+		pgd_page_vaddr(*pgd_offset_k(KASAN_SHADOW_START)),
+		sizeof(tmp_pmd_table));
+	memcpy(tmp_pgd_table, swapper_pg_dir, sizeof(tmp_pgd_table));
+	set_pgd(&tmp_pgd_table[pgd_index(KASAN_SHADOW_START)],
+		__pgd(__pa(tmp_pmd_table) | PMD_TYPE_TABLE | L_PGD_SWAPPER));
+	set_ttbr0(__pa(tmp_pgd_table));
+#else
+	memcpy(tmp_pgd_table, swapper_pg_dir, sizeof(tmp_pgd_table));
+	set_ttbr0((u64)__pa(tmp_pgd_table));
+#endif
+	flush_cache_all();
+	local_flush_bp_all();
+	local_flush_tlb_all();
+
+	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
+
+	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)VMALLOC_START),
+				kasan_mem_to_shadow((void *)-1UL) + 1);
+
+	for_each_memblock(memory, reg) {
+		void *start = __va(reg->base);
+		void *end = __va(reg->base + reg->size);
+
+		if (reg->base + reg->size > arm_lowmem_limit)
+			end = __va(arm_lowmem_limit);
+		if (start >= end)
+			break;
+
+		create_mapping((unsigned long)kasan_mem_to_shadow(start),
+			(unsigned long)kasan_mem_to_shadow(end),
+			NUMA_NO_NODE);
+	}
+
+	/*1.the module's global variable is in MODULES_VADDR ~ MODULES_END,
+	 *  so we need mapping.
+	 *2.PKMAP_BASE ~ PKMAP_BASE+PMD_SIZE's shadow and MODULES_VADDR
+	 *  ~ MODULES_END's shadow is in the same PMD_SIZE, so we cant
+	 *  use kasan_populate_zero_shadow.
+	 */
+	create_mapping(
+		(unsigned long)kasan_mem_to_shadow((void *)MODULES_VADDR),
+
+		(unsigned long)kasan_mem_to_shadow((void *)(PKMAP_BASE +
+							PMD_SIZE)),
+		NUMA_NO_NODE);
+
+	/*
+	 * KAsan may reuse the contents of kasan_zero_pte directly, so we
+	 * should make sure that it maps the zero page read-only.
+	 */
+	for (i = 0; i < PTRS_PER_PTE; i++)
+		set_pte_at(&init_mm, KASAN_SHADOW_START + i*PAGE_SIZE,
+			&kasan_zero_pte[i],
+			pfn_pte(virt_to_pfn(kasan_zero_page),
+				__pgprot(pgprot_val(PAGE_KERNEL)
+					| L_PTE_RDONLY)));
+	memset(kasan_zero_page, 0, PAGE_SIZE);
+	set_ttbr0(orig_ttbr0);
+	flush_cache_all();
+	local_flush_bp_all();
+	local_flush_tlb_all();
+	pr_info("Kernel address sanitizer initialized\n");
+	init_task.kasan_depth = 0;
+}
diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
index 61e281c..4644a21 100644
--- a/arch/arm/mm/pgd.c
+++ b/arch/arm/mm/pgd.c
@@ -64,6 +64,20 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	new_pmd = pmd_alloc(mm, new_pud, 0);
 	if (!new_pmd)
 		goto no_pmd;
+#ifdef CONFIG_KASAN
+	/*
+	 *Copy PMD table for KASAN shadow mappings.
+	 */
+	init_pgd = pgd_offset_k(TASK_SIZE);
+	init_pud = pud_offset(init_pgd, TASK_SIZE);
+	init_pmd = pmd_offset(init_pud, TASK_SIZE);
+	new_pmd = pmd_offset(new_pud, TASK_SIZE);
+	memcpy(new_pmd, init_pmd,
+		(pmd_index(MODULES_VADDR)-pmd_index(TASK_SIZE))
+		* sizeof(pmd_t));
+	clean_dcache_area(new_pmd, PTRS_PER_PMD*sizeof(pmd_t));
+#endif
+
 #endif
 
 	if (!vectors_high()) {
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 4/6] Define the virtual space of KASan's shadow region
From: Abbott Liu @ 2018-04-12 12:54 UTC (permalink / raw)
  To: aryabinin, glider, dvyukov, corbet, linux, cdall, marc.zyngier,
	arnd, nicolas.pitre, vladimir.murzin, mingo, viro,
	alexandre.belloni, ard.biesheuvel, daniel.lezcano, pombredanne,
	kstewart, liuwenliang, gregkh, f.fainelli, mawilcox, akpm,
	mark.rutland, afzal.mohd.ma, alexander.levin, tglx, thgarnie,
	dhowells, keescook, geert, tixy, julien.thierry, james.morse,
	zhichao.huang, jinb.park7, labbott, philip, grygorii.strashko,
	opendmb, mhocko, kirill.shutemov, kasan-dev, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm
In-Reply-To: <20180412125411.326-1-liuwenliang@huawei.com>

Define KASAN_SHADOW_OFFSET,KASAN_SHADOW_START and KASAN_SHADOW_END for arm
kernel address sanitizer.

     +----+ 0xffffffff
     |    |
     |    |
     |    |
     +----+ CONFIG_PAGE_OFFSET
     |    |     |    | |->  module virtual address space area.
     |    |/
     +----+ MODULE_VADDR = KASAN_SHADOW_END
     |    |     |    | |-> the shadow area of kernel virtual address.
     |    |/
     +----+ TASK_SIZE(start of kernel space) = KASAN_SHADOW_START  the
     |    |\  shadow address of MODULE_VADDR
     |    | ---------------------+
     |    |                      |
     +    + KASAN_SHADOW_OFFSET  |-> the user space area. Kernel address
     |    |                      |    sanitizer do not use this space.
     |    | ---------------------+
     |    |/
     ------ 0

1)KASAN_SHADOW_OFFSET:
  This value is used to map an address to the corresponding shadow
address by the following formula:
shadow_addr = (address >> 3) + KASAN_SHADOW_OFFSET;

2)KASAN_SHADOW_START
  This value is the MODULE_VADDR's shadow address. It is the start
of kernel virtual space.

3)KASAN_SHADOW_END
  This value is the 0x100000000's shadow address. It is the end of
kernel addresssanitizer's shadow area. It is also the start of the
module area.

When enable kasan, the definition of TASK_SIZE is not an an 8-bit
rotated constant, so we need to modify the TASK_SIZE access code
in the *.s file.

Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Joel Stanley <joel@jms.id.au>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Abbott Liu <liuwenliang@huawei.com>
Signed-off-by: Abbott Liu <liuwenliang@huawei.com>
---
 arch/arm/include/asm/kasan_def.h | 64 ++++++++++++++++++++++++++++++++++++++++
 arch/arm/include/asm/memory.h    |  5 ++++
 arch/arm/kernel/entry-armv.S     |  5 ++--
 arch/arm/kernel/entry-common.S   |  9 ++++--
 arch/arm/mm/init.c               |  6 ++++
 arch/arm/mm/mmu.c                |  7 ++++-
 6 files changed, 90 insertions(+), 6 deletions(-)
 create mode 100644 arch/arm/include/asm/kasan_def.h

diff --git a/arch/arm/include/asm/kasan_def.h b/arch/arm/include/asm/kasan_def.h
new file mode 100644
index 0000000..7b7f424
--- /dev/null
+++ b/arch/arm/include/asm/kasan_def.h
@@ -0,0 +1,64 @@
+/*
+ *  arch/arm/include/asm/kasan_def.h
+ *
+ *  Copyright (c) 2018 Huawei Technologies Co., Ltd.
+ *
+ *  Author: Abbott Liu <liuwenliang@huawei.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef __ASM_KASAN_DEF_H
+#define __ASM_KASAN_DEF_H
+
+#ifdef CONFIG_KASAN
+
+/*
+ *    +----+ 0xffffffff
+ *    |    |
+ *    |    |
+ *    |    |
+ *    +----+ CONFIG_PAGE_OFFSET
+ *    |    |\
+ *    |    | |->  module virtual address space area.
+ *    |    |/
+ *    +----+ MODULE_VADDR = KASAN_SHADOW_END
+ *    |    |\
+ *    |    | |-> the shadow area of kernel virtual address.
+ *    |    |/
+ *    +----+ TASK_SIZE(start of kernel space) = KASAN_SHADOW_START  the
+ *    |    |\  shadow address of MODULE_VADDR
+ *    |    | ---------------------+
+ *    |    |                      |
+ *    +    + KASAN_SHADOW_OFFSET  |-> the user space area. Kernel address
+ *    |    |                      |    sanitizer do not use this space.
+ *    |    | ---------------------+
+ *    |    |/
+ *    ------ 0
+ *
+ *1)KASAN_SHADOW_OFFSET:
+ *    This value is used to map an address to the corresponding shadow
+ * address by the following formula:
+ * shadow_addr = (address >> 3) + KASAN_SHADOW_OFFSET;
+ *
+ * 2)KASAN_SHADOW_START
+ *     This value is the MODULE_VADDR's shadow address. It is the start
+ * of kernel virtual space.
+ *
+ * 3) KASAN_SHADOW_END
+ *   This value is the 0x100000000's shadow address. It is the end of
+ * kernel addresssanitizer's shadow area. It is also the start of the
+ * module area.
+ *
+ */
+
+#define KASAN_SHADOW_OFFSET     (KASAN_SHADOW_END - (1<<29))
+
+#define KASAN_SHADOW_START      ((KASAN_SHADOW_END >> 3) + KASAN_SHADOW_OFFSET)
+
+#define KASAN_SHADOW_END        (UL(CONFIG_PAGE_OFFSET) - UL(SZ_16M))
+
+#endif
+#endif
diff --git a/arch/arm/include/asm/memory.h b/arch/arm/include/asm/memory.h
index 4966677..3ce1a9a 100644
--- a/arch/arm/include/asm/memory.h
+++ b/arch/arm/include/asm/memory.h
@@ -21,6 +21,7 @@
 #ifdef CONFIG_NEED_MACH_MEMORY_H
 #include <mach/memory.h>
 #endif
+#include <asm/kasan_def.h>
 
 /*
  * Allow for constants defined here to be used from assembly code
@@ -37,7 +38,11 @@
  * TASK_SIZE - the maximum size of a user space task.
  * TASK_UNMAPPED_BASE - the lower boundary of the mmap VM area
  */
+#ifndef CONFIG_KASAN
 #define TASK_SIZE		(UL(CONFIG_PAGE_OFFSET) - UL(SZ_16M))
+#else
+#define TASK_SIZE		(KASAN_SHADOW_START)
+#endif
 #define TASK_UNMAPPED_BASE	ALIGN(TASK_SIZE / 3, SZ_16M)
 
 /*
diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S
index 1752033..b4de9e4 100644
--- a/arch/arm/kernel/entry-armv.S
+++ b/arch/arm/kernel/entry-armv.S
@@ -183,7 +183,7 @@ ENDPROC(__und_invalid)
 
 	get_thread_info tsk
 	ldr	r0, [tsk, #TI_ADDR_LIMIT]
-	mov	r1, #TASK_SIZE
+	ldr	r1, =TASK_SIZE
 	str	r1, [tsk, #TI_ADDR_LIMIT]
 	str	r0, [sp, #SVC_ADDR_LIMIT]
 
@@ -437,7 +437,8 @@ ENDPROC(__fiq_abt)
 	@ if it was interrupted in a critical region.  Here we
 	@ perform a quick test inline since it should be false
 	@ 99.9999% of the time.  The rest is done out of line.
-	cmp	r4, #TASK_SIZE
+	ldr	r0, =TASK_SIZE
+	cmp	r4, r0
 	blhs	kuser_cmpxchg64_fixup
 #endif
 #endif
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 3c4f887..78046de 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -51,7 +51,8 @@ ret_fast_syscall:
  UNWIND(.cantunwind	)
 	disable_irq_notrace			@ disable interrupts
 	ldr	r2, [tsk, #TI_ADDR_LIMIT]
-	cmp	r2, #TASK_SIZE
+	ldr	r1, =TASK_SIZE
+	cmp	r2, r1
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
 	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
@@ -81,7 +82,8 @@ ret_fast_syscall:
 	str	r0, [sp, #S_R0 + S_OFF]!	@ save returned r0
 	disable_irq_notrace			@ disable interrupts
 	ldr	r2, [tsk, #TI_ADDR_LIMIT]
-	cmp	r2, #TASK_SIZE
+	ldr     r1, =TASK_SIZE
+	cmp     r2, r1
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]		@ re-check for syscall tracing
 	tst	r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
@@ -116,7 +118,8 @@ ret_slow_syscall:
 	disable_irq_notrace			@ disable interrupts
 ENTRY(ret_to_user_from_irq)
 	ldr	r2, [tsk, #TI_ADDR_LIMIT]
-	cmp	r2, #TASK_SIZE
+	ldr     r1, =TASK_SIZE
+	cmp	r2, r1
 	blne	addr_limit_check_failed
 	ldr	r1, [tsk, #TI_FLAGS]
 	tst	r1, #_TIF_WORK_MASK
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index bd6f451..da11f61 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -538,6 +538,9 @@ void __init mem_init(void)
 #ifdef CONFIG_MODULES
 			"    modules : 0x%08lx - 0x%08lx   (%4ld MB)\n"
 #endif
+#ifdef CONFIG_KASAN
+			"    kasan   : 0x%08lx - 0x%08lx   (%4ld MB)\n"
+#endif
 			"      .text : 0x%p" " - 0x%p" "   (%4td kB)\n"
 			"      .init : 0x%p" " - 0x%p" "   (%4td kB)\n"
 			"      .data : 0x%p" " - 0x%p" "   (%4td kB)\n"
@@ -558,6 +561,9 @@ void __init mem_init(void)
 #ifdef CONFIG_MODULES
 			MLM(MODULES_VADDR, MODULES_END),
 #endif
+#ifdef CONFIG_KASAN
+			MLM(KASAN_SHADOW_START, KASAN_SHADOW_END),
+#endif
 
 			MLK_ROUNDUP(_text, _etext),
 			MLK_ROUNDUP(__init_begin, __init_end),
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index e46a6a4..f5aa1de 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1251,9 +1251,14 @@ static inline void prepare_page_table(void)
 	/*
 	 * Clear out all the mappings below the kernel image.
 	 */
-	for (addr = 0; addr < MODULES_VADDR; addr += PMD_SIZE)
+	for (addr = 0; addr < TASK_SIZE; addr += PMD_SIZE)
 		pmd_clear(pmd_off_k(addr));
 
+#ifdef CONFIG_KASAN
+	/*TASK_SIZE ~ MODULES_VADDR is the KASAN's shadow area -- skip over it*/
+	addr = MODULES_VADDR;
+#endif
+
 #ifdef CONFIG_XIP_KERNEL
 	/* The XIP kernel is mapped in the module area -- skip over it */
 	addr = ((unsigned long)_exiprom + PMD_SIZE - 1) & PMD_MASK;
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 3/6] Replace memory function for kasan
From: Abbott Liu @ 2018-04-12 12:54 UTC (permalink / raw)
  To: aryabinin, glider, dvyukov, corbet, linux, cdall, marc.zyngier,
	arnd, nicolas.pitre, vladimir.murzin, mingo, viro,
	alexandre.belloni, ard.biesheuvel, daniel.lezcano, pombredanne,
	kstewart, liuwenliang, gregkh, f.fainelli, mawilcox, akpm,
	mark.rutland, afzal.mohd.ma, alexander.levin, tglx, thgarnie,
	dhowells, keescook, geert, tixy, julien.thierry, james.morse,
	zhichao.huang, jinb.park7, labbott, philip, grygorii.strashko,
	opendmb, mhocko, kirill.shutemov, kasan-dev, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm
In-Reply-To: <20180412125411.326-1-liuwenliang@huawei.com>

From: Andrey Ryabinin <a.ryabinin@samsung.com>

Functions like memset/memmove/memcpy do a lot of memory accesses.
If bad pointer passed to one of these function it is important
to catch this. Compiler's instrumentation cannot do this since
these functions are written in assembly.

KASan replaces memory functions with manually instrumented variants.
Original functions declared as weak symbols so strong definitions
in mm/kasan/kasan.c could replace them. Original functions have aliases
with '__' prefix in name, so we could call non-instrumented variant
if needed.

We must use __memcpy/__memset to replace memcpy/memset when we copy
.data to RAM and when we clear .bss, because kasan_early_init can't
be called before the initialization of .data and .bss.

Reported-by: Russell King - ARM Linux <linux@armlinux.org.uk>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Tested-by: Abbott Liu <liuwenliang@huawei.com>
Signed-off-by: Abbott Liu <liuwenliang@huawei.com>
---
 arch/arm/boot/compressed/decompress.c |  2 ++
 arch/arm/boot/compressed/libfdt_env.h |  2 ++
 arch/arm/include/asm/string.h         | 17 +++++++++++++++++
 arch/arm/kernel/head-common.S         |  4 ++--
 arch/arm/lib/memcpy.S                 |  3 +++
 arch/arm/lib/memmove.S                |  5 ++++-
 arch/arm/lib/memset.S                 |  3 +++
 7 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/arm/boot/compressed/decompress.c b/arch/arm/boot/compressed/decompress.c
index a2ac3fe..0596077 100644
--- a/arch/arm/boot/compressed/decompress.c
+++ b/arch/arm/boot/compressed/decompress.c
@@ -49,8 +49,10 @@ extern int memcmp(const void *cs, const void *ct, size_t count);
 #endif
 
 #ifdef CONFIG_KERNEL_XZ
+#ifndef CONFIG_KASAN
 #define memmove memmove
 #define memcpy memcpy
+#endif
 #include "../../../../lib/decompress_unxz.c"
 #endif
 
diff --git a/arch/arm/boot/compressed/libfdt_env.h b/arch/arm/boot/compressed/libfdt_env.h
index 0743781..736ed36 100644
--- a/arch/arm/boot/compressed/libfdt_env.h
+++ b/arch/arm/boot/compressed/libfdt_env.h
@@ -17,4 +17,6 @@ typedef __be64 fdt64_t;
 #define fdt64_to_cpu(x)		be64_to_cpu(x)
 #define cpu_to_fdt64(x)		cpu_to_be64(x)
 
+#undef memset
+
 #endif
diff --git a/arch/arm/include/asm/string.h b/arch/arm/include/asm/string.h
index 111a1d8..1f9016b 100644
--- a/arch/arm/include/asm/string.h
+++ b/arch/arm/include/asm/string.h
@@ -15,15 +15,18 @@ extern char * strchr(const char * s, int c);
 
 #define __HAVE_ARCH_MEMCPY
 extern void * memcpy(void *, const void *, __kernel_size_t);
+extern void *__memcpy(void *dest, const void *src, __kernel_size_t n);
 
 #define __HAVE_ARCH_MEMMOVE
 extern void * memmove(void *, const void *, __kernel_size_t);
+extern void *__memmove(void *dest, const void *src, __kernel_size_t n);
 
 #define __HAVE_ARCH_MEMCHR
 extern void * memchr(const void *, int, __kernel_size_t);
 
 #define __HAVE_ARCH_MEMSET
 extern void * memset(void *, int, __kernel_size_t);
+extern void *__memset(void *s, int c, __kernel_size_t n);
 
 #define __HAVE_ARCH_MEMSET32
 extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
@@ -39,4 +42,18 @@ static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
 	return __memset64(p, v, n * 8, v >> 32);
 }
 
+
+
+#if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
+
+/*
+ * For files that not instrumented (e.g. mm/slub.c) we
+ * should use not instrumented version of mem* functions.
+ */
+
+#define memcpy(dst, src, len) __memcpy(dst, src, len)
+#define memmove(dst, src, len) __memmove(dst, src, len)
+#define memset(s, c, n) __memset(s, c, n)
+#endif
+
 #endif
diff --git a/arch/arm/kernel/head-common.S b/arch/arm/kernel/head-common.S
index 6e0375e..c79b829 100644
--- a/arch/arm/kernel/head-common.S
+++ b/arch/arm/kernel/head-common.S
@@ -99,7 +99,7 @@ __mmap_switched:
  THUMB(	ldmia	r4!, {r0, r1, r2, r3} )
  THUMB(	mov	sp, r3 )
 	sub	r2, r2, r1
-	bl	memcpy				@ copy .data to RAM
+	bl	__memcpy			@ copy .data to RAM
 #endif
 
    ARM(	ldmia	r4!, {r0, r1, sp} )
@@ -107,7 +107,7 @@ __mmap_switched:
  THUMB(	mov	sp, r3 )
 	sub	r2, r1, r0
 	mov	r1, #0
-	bl	memset				@ clear .bss
+	bl	__memset			@ clear .bss
 
 	ldmia	r4, {r0, r1, r2, r3}
 	str	r9, [r0]			@ Save processor ID
diff --git a/arch/arm/lib/memcpy.S b/arch/arm/lib/memcpy.S
index 64111bd..79a83f8 100644
--- a/arch/arm/lib/memcpy.S
+++ b/arch/arm/lib/memcpy.S
@@ -61,6 +61,8 @@
 
 /* Prototype: void *memcpy(void *dest, const void *src, size_t n); */
 
+.weak memcpy
+ENTRY(__memcpy)
 ENTRY(mmiocpy)
 ENTRY(memcpy)
 
@@ -68,3 +70,4 @@ ENTRY(memcpy)
 
 ENDPROC(memcpy)
 ENDPROC(mmiocpy)
+ENDPROC(__memcpy)
diff --git a/arch/arm/lib/memmove.S b/arch/arm/lib/memmove.S
index 69a9d47..313db6c 100644
--- a/arch/arm/lib/memmove.S
+++ b/arch/arm/lib/memmove.S
@@ -27,12 +27,14 @@
  * occurring in the opposite direction.
  */
 
+.weak memmove
+ENTRY(__memmove)
 ENTRY(memmove)
 	UNWIND(	.fnstart			)
 
 		subs	ip, r0, r1
 		cmphi	r2, ip
-		bls	memcpy
+		bls	__memcpy
 
 		stmfd	sp!, {r0, r4, lr}
 	UNWIND(	.fnend				)
@@ -225,3 +227,4 @@ ENTRY(memmove)
 18:		backward_copy_shift	push=24	pull=8
 
 ENDPROC(memmove)
+ENDPROC(__memmove)
diff --git a/arch/arm/lib/memset.S b/arch/arm/lib/memset.S
index ed6d35d..64aa06a 100644
--- a/arch/arm/lib/memset.S
+++ b/arch/arm/lib/memset.S
@@ -16,6 +16,8 @@
 	.text
 	.align	5
 
+.weak memset
+ENTRY(__memset)
 ENTRY(mmioset)
 ENTRY(memset)
 UNWIND( .fnstart         )
@@ -135,6 +137,7 @@ UNWIND( .fnstart            )
 UNWIND( .fnend   )
 ENDPROC(memset)
 ENDPROC(mmioset)
+ENDPROC(__memset)
 
 ENTRY(__memset32)
 UNWIND( .fnstart         )
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 2/6] Disable instrumentation for some code
From: Abbott Liu @ 2018-04-12 12:54 UTC (permalink / raw)
  To: aryabinin, glider, dvyukov, corbet, linux, cdall, marc.zyngier,
	arnd, nicolas.pitre, vladimir.murzin, mingo, viro,
	alexandre.belloni, ard.biesheuvel, daniel.lezcano, pombredanne,
	kstewart, liuwenliang, gregkh, f.fainelli, mawilcox, akpm,
	mark.rutland, afzal.mohd.ma, alexander.levin, tglx, thgarnie,
	dhowells, keescook, geert, tixy, julien.thierry, james.morse,
	zhichao.huang, jinb.park7, labbott, philip, grygorii.strashko,
	opendmb, mhocko, kirill.shutemov, kasan-dev, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm
In-Reply-To: <20180412125411.326-1-liuwenliang@huawei.com>

From: Andrey Ryabinin <a.ryabinin@samsung.com>

Disable instrumentation for arch/arm/boot/compressed/*
and arch/arm/vdso/* because those code won't linkd with
kernel image.

Disable instrumentation for arch/arm/kvm/hyp/*. See commit a6cdf1c08cbf
("kvm: arm64: Disable compiler instrumentation for hypervisor code")
for more details.

Disable instrumentation for arch/arm/mm/physaddr.c. See
commit ec6d06efb0ba ("arm64: Add support for CONFIG_DEBUG_VIRTUAL")
for more details.

Disable kasan check in the function unwind_pop_register
because it doesn't matter that kasan checks failed when
unwind_pop_register read stack memory of task.

Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Abbott Liu <liuwenliang@huawei.com>
Signed-off-by: Abbott Liu <liuwenliang@huawei.com>
---
 arch/arm/boot/compressed/Makefile | 1 +
 arch/arm/kernel/unwind.c          | 3 ++-
 arch/arm/kvm/hyp/Makefile         | 4 ++++
 arch/arm/mm/Makefile              | 1 +
 arch/arm/vdso/Makefile            | 2 ++
 5 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile
index 45a6b9b..966103e 100644
--- a/arch/arm/boot/compressed/Makefile
+++ b/arch/arm/boot/compressed/Makefile
@@ -24,6 +24,7 @@ OBJS		+= hyp-stub.o
 endif
 
 GCOV_PROFILE		:= n
+KASAN_SANITIZE		:= n
 
 #
 # Architecture dependencies
diff --git a/arch/arm/kernel/unwind.c b/arch/arm/kernel/unwind.c
index 0bee233..2e55c7d 100644
--- a/arch/arm/kernel/unwind.c
+++ b/arch/arm/kernel/unwind.c
@@ -249,7 +249,8 @@ static int unwind_pop_register(struct unwind_ctrl_block *ctrl,
 		if (*vsp >= (unsigned long *)ctrl->sp_high)
 			return -URC_FAILURE;
 
-	ctrl->vrs[reg] = *(*vsp)++;
+	ctrl->vrs[reg] = READ_ONCE_NOCHECK(*(*vsp));
+	(*vsp)++;
 	return URC_OK;
 }
 
diff --git a/arch/arm/kvm/hyp/Makefile b/arch/arm/kvm/hyp/Makefile
index 63d6b40..0a8b500 100644
--- a/arch/arm/kvm/hyp/Makefile
+++ b/arch/arm/kvm/hyp/Makefile
@@ -24,3 +24,7 @@ obj-$(CONFIG_KVM_ARM_HOST) += hyp-entry.o
 obj-$(CONFIG_KVM_ARM_HOST) += switch.o
 CFLAGS_switch.o		   += $(CFLAGS_ARMV7VE)
 obj-$(CONFIG_KVM_ARM_HOST) += s2-setup.o
+
+GCOV_PROFILE	:= n
+KASAN_SANITIZE	:= n
+UBSAN_SANITIZE	:= n
diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 9dbb849..c056e17 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -16,6 +16,7 @@ endif
 obj-$(CONFIG_ARM_PTDUMP_CORE)	+= dump.o
 obj-$(CONFIG_ARM_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
 obj-$(CONFIG_MODULES)		+= proc-syms.o
+KASAN_SANITIZE_physaddr.o	:= n
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 
 obj-$(CONFIG_ALIGNMENT_TRAP)	+= alignment.o
diff --git a/arch/arm/vdso/Makefile b/arch/arm/vdso/Makefile
index bb411821..87abbb7 100644
--- a/arch/arm/vdso/Makefile
+++ b/arch/arm/vdso/Makefile
@@ -30,6 +30,8 @@ CFLAGS_vgettimeofday.o = -O2
 # Disable gcov profiling for VDSO code
 GCOV_PROFILE := n
 
+KASAN_SANITIZE := n
+
 # Force dependency
 $(obj)/vdso.o : $(obj)/vdso.so
 
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v11 1/4] arm64: KVM: export the capability to set guest SError syndrome
From: gengdongjiu @ 2018-04-12 12:55 UTC (permalink / raw)
  To: James Morse
  Cc: Dongjiu Geng, linux-doc, kvm, corbet, marc.zyngier,
	catalin.marinas, rjw, linux, linux-kernel, linux-acpi,
	zhengxiang9, bp, linux-arm-kernel, huangshaoyu, devel, kvmarm,
	christoffer.dall, lenb
In-Reply-To: <98360e36-5fb7-aea4-52c3-21595d09419f@arm.com>

HI James,
  Thanks for the review.


2018-04-10 22:15 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi Dongjiu Geng,
>
> On 09/04/18 22:36, Dongjiu Geng wrote:
>> Before user space injects a SError, it needs to know whether it can
>> specify the guest Exception Syndrome, so KVM should tell user space
>> whether it has such capability.
>
> (you could improve the commit message by briefly explaining how/why
> user-space
> would want to do this. As this is patch 1, you don't have the context of
> the
> previous patch to say that some systems can provide an ESR with
> virtual-SError)
Exactly, thanks for the good comments.

>
>
>> diff --git a/Documentation/virtual/kvm/api.txt
>> b/Documentation/virtual/kvm/api.txt
>> index fc3ae95..8a3d708 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -4415,3 +4415,14 @@ Parameters: none
>>  This capability indicates if the flic device will be able to get/set the
>>  AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and
>> allows
>>  to discover this without having to create a flic device.
>> +
>> +8.14 KVM_CAP_ARM_SET_SERROR_ESR
>> +
>> +Architectures: arm, arm64
>> +
>> +This capability indicates that userspace can specify syndrome value
>> reported to
>
> (Nit: 'the syndrome value')
will fix it.

>
>> +guest OS when guest takes a virtual SError interrupt exception.
>
> (Nit: 'the guest')
will fix it.

>
>> +If KVM has this capability, userspace can only specify the ISS field for
>> the ESR
>> +syndrome, can not specify the EC field which is not under control by
>> KVM.
>
> (Nit: 'it can not specify...')
will fix it.

>
>> +If this virtual SError is taken to EL1 using AArch64, this value will be
>> reported
>> +into ISS filed of ESR_EL1.
>
> (Nit: 'in the ISS field')
will fix it.

>
>
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index 3256b92..38c8a64 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -77,6 +77,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm,
>> long ext)
>>  	case KVM_CAP_ARM_PMU_V3:
>>  		r = kvm_arm_support_pmu_v3();
>>  		break;
>> +	case KVM_CAP_ARM_INJECT_SERROR_ESR:
>> +		r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
>> +		break;
>>  	case KVM_CAP_SET_GUEST_DEBUG:
>>  	case KVM_CAP_VCPU_ATTRIBUTES:
>>  		r = 1;
>
> 'dev_ioctl' feels a bit weird, but we already have cpu_has_32bit_el1() in
> here.

Yes, although the name is "dev_ioctl", it does not have relationship
with the device.
here it mainly check vcpu capability, such as PMU, 32bit EL1 etc.

>
>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 8fb90a0..3587b33 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -934,6 +934,7 @@ struct kvm_ppc_resize_hpt {
>>  #define KVM_CAP_S390_AIS_MIGRATION 150
>>  #define KVM_CAP_PPC_GET_CPU_CHAR 151
>>  #define KVM_CAP_S390_BPB 152
>> +#define KVM_CAP_ARM_INJECT_SERROR_ESR 153
>>
>>  #ifdef KVM_CAP_IRQ_ROUTING
>
> (patch 1&2 should probably be swapped around, as on its own this does
> thing).
ok, I will do it.

>
> Reviewed-by: James Morse <james.morse@arm.com>
thanks this Reviewed-by

>
>
> Thanks,
>
> James
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v4] mm: remove odd HAVE_PTE_SPECIAL
From: Laurent Dufour @ 2018-04-12 11:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, mhocko, akpm
  Cc: linuxppc-dev, x86, linux-doc, linux-snps-arc, linux-arm-kernel,
	linux-riscv, linux-s390, linux-sh, sparclinux, Jerome Glisse,
	aneesh.kumar, benh, paulus, Jonathan Corbet, Catalin Marinas,
	Will Deacon, Yoshinori Sato, Rich Felker, David S . Miller,
	Thomas Gleixner, Ingo Molnar, Vineet Gupta, Palmer Dabbelt,
	Albert Ou, Martin Schwidefsky, Heiko Carstens, David Rientjes,
	Robin Murphy, Christophe LEROY
In-Reply-To: <20180411110936.GG23400@dhcp22.suse.cz>

Remove the additional define HAVE_PTE_SPECIAL and rely directly on
CONFIG_ARCH_HAS_PTE_SPECIAL.

There is no functional change introduced by this patch

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memory.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 96910c625daa..345e562a138d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -817,17 +817,12 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
  * PFNMAP mappings in order to support COWable mappings.
  *
  */
-#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
-# define HAVE_PTE_SPECIAL 1
-#else
-# define HAVE_PTE_SPECIAL 0
-#endif
 struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte, bool with_public_device)
 {
 	unsigned long pfn = pte_pfn(pte);
 
-	if (HAVE_PTE_SPECIAL) {
+	if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
 		if (likely(!pte_special(pte)))
 			goto check_pfn;
 		if (vma->vm_ops && vma->vm_ops->find_special_page)
@@ -862,7 +857,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		return NULL;
 	}
 
-	/* !HAVE_PTE_SPECIAL case follows: */
+	/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
 
 	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
 		if (vma->vm_flags & VM_MIXEDMAP) {
@@ -881,6 +876,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 
 	if (is_zero_pfn(pfn))
 		return NULL;
+
 check_pfn:
 	if (unlikely(pfn > highest_memmap_pfn)) {
 		print_bad_pte(vma, addr, pte, NULL);
@@ -904,7 +900,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 	/*
 	 * There is no pmd_special() but there may be special pmds, e.g.
 	 * in a direct-access (dax) mapping, so let's just replicate the
-	 * !HAVE_PTE_SPECIAL case from vm_normal_page() here.
+	 * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
 	 */
 	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
 		if (vma->vm_flags & VM_MIXEDMAP) {
@@ -1933,7 +1929,8 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 	 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
 	 * without pte special, it would there be refcounted as a normal page.
 	 */
-	if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
+	    !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
 		struct page *page;
 
 		/*
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v11 0/4] set VSESR_EL2 by user space and support NOTIFY_SEI notification
From: gengdongjiu @ 2018-04-12  6:09 UTC (permalink / raw)
  To: James Morse
  Cc: rkrcmar, corbet, christoffer.dall, marc.zyngier, linux,
	catalin.marinas, rjw, bp, lenb, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-acpi, devel, huangshaoyu,
	zhengxiang9
In-Reply-To: <d94439b6-9b9a-9bd0-ddec-3bb63ede9461@arm.com>

Hi James,
   thanks for this mail.

On 2018/4/10 22:15, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 09/04/18 22:36, Dongjiu Geng wrote:
>> 1. Detect whether KVM can set set guest SError syndrome
>> 2. Support to Set VSESR_EL2 and inject SError by user space.
>> 3. Support live migration to keep SError pending state and VSESR_EL2 value.
>> 4. ACPI 6.1 adds support for NOTIFY_SEI as a GHES notification mechanism, so support this
>>    notification in software, KVM or kernel ARCH code call handle_guest_sei() to let ACP driver
>>    to handle this notification.
> 
> Please don't post code during the merge-window, will this apply to v4.17-rc1? We
> can't know until its tagged.
I do not know when it is merge-window. About the apply version, it does not have limited.

> 
> 
> This series is doing two separate things, please split it into two series.
OK, thanks!

> 
> But on the ACPI front: I don't see how any OS can support your NOTIFY_SEI when
> firmware is ignoring the normal world's PSTATE.A.
> 
> The latest lobe of that discussion was on the list here:
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1611496.html
I have replied the mail.
I still have some questions that need to clarify with you.
After clarification, we will follow that.
The question is in the reply of this mail "https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1611496.html"

> 
> 
> As it is, we would need to spot SError being delivered while SError is masked,
> spray nasty messages about firmware being horrifically buggy, then panic(). For
> a corrected error, this looks bad, but its preferable to letting firmware
> silently overwrite the exception registers, causing linux to spin through the
> vectors 'eret' with all exceptions masked.
> I still think its best to wait for firmware that does the right thing.
Let us  discuss that in another mail.
In a summary, I think firmware follow below rule can be OK, right?
1. The exception came from the EL that SError should be routed to(according to hcr_EL2.{AMO, TGE}),but PSTATE.A was set, EL3 firmware can't deliver SError;
2. The exception came from the EL that SError should not be routed to(according to hcr_EL2.{AMO, TGE}),even though the PSTATE.A was set,EL3 firmware still deliver SError

> 
> 
> Thanks,
> 
> James
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v9 3/7] acpi: apei: Add SEI notification type support for ARMv8
From: gengdongjiu @ 2018-04-12  5:00 UTC (permalink / raw)
  To: James Morse, lishuo1, merry.libing
  Cc: gengdongjiu, linux-arm-kernel@lists.infradead.org,
	Liujun (Jun Liu), linux-kernel@vger.kernel.org, corbet@lwn.net,
	marc.zyngier@arm.com, catalin.marinas@arm.com,
	linux-doc@vger.kernel.org, rjw@rjwysocki.net,
	linux@armlinux.org.uk, will.deacon@arm.com,
	robert.moore@intel.com, linux-acpi@vger.kernel.org, bp@alien8.de,
	lv.zheng@intel.com, Huangshaoyu, kvmarm@lists.cs.columbia.edu,
	devel@acpica.org
In-Reply-To: <5A85C97C.5080605@arm.com>

Dear James,
        Thanks for this mail and sorry for my late response.


2018-02-16 1:55 GMT+08:00 James Morse <james.morse@arm.com>:
> Hi gengdongjiu, liu jun
>
> On 05/02/18 11:24, gengdongjiu wrote:
[....]
>>
>>> Is the emulated SError routed following the routing rules for HCR_EL2.{AMO,
>>> TGE}?
>>
>> Yes, it is.
>
> ... and yet ...
>
>
>>> What does your firmware do when it wants to emulate SError but its masked?
>>> (e.g.1: The physical-SError interrupted EL2 and the SPSR shows EL2 had
>>> PSTATE.A  set.
>>>  e.g.2: The physical-SError interrupted EL2 but HCR_EL2 indicates the
>>> emulated  SError should go to EL1. This effectively masks SError.)
>>
>> Currently we does not consider much about the mask status(SPSR).
>
> .. this is a problem.
>
> If you ignore SPSR_EL3 you may deliver an SError to EL1 when the exception
> interrupted EL2. Even if you setup the EL1 register correctly, EL1 can't eret to
> EL2. This should never happen, SError is effectively masked if you are running
> at an EL higher than the one its routed to.
>
> More obviously: if the exception came from the EL that SError should be routed
> to, but PSTATE.A was set, you can't deliver SError. Masking SError is the only

James, I  summarized the masking and routing rules for SError to
confirm with you for the firmware first solution,

1. If the HCR_EL2.{AMO,TGE} is set, which means the SError should route to EL2,
When system happens SError and trap to EL3,   If EL3 find
HCR_EL2.{AMO,TGE} and SPSR_EL3.A are both set,
and find this SError come from EL2, it will not deliver an SError:
store the RAS error in the BERT and 'reboot'; but if
it find that this SError come from EL1 or EL0, it also need to deliver
an SError, right?


2. If the HCR_EL2.{AMO,TGE} is not set, which means the SError should
route to EL1,
When system happens SError and trap to EL3, If EL3 find
HCR_EL2.{AMO,TGE} and SPSR_EL3.A are both not set,
and find this SError come from EL1, it will not deliver an SError:
store the RAS error in the BERT and 'reboot'; but if
it find that this SError come from EL0, it also need to deliver an
SError, right?


> way the OS has to indicate it can't take an exception right now. VBAR_EL1 may be
> 'wrong' if we're doing some power-management, the FAR/ELR/ESR/SPSR registers may
> contain live values that the OS would lose if you deliver another exception over
> the top.
>
> If you deliver an emulated-SError as the OS eret's, your new ELR will point at
> the eret instruction and the CPU will spin on this instruction forever.
>
> You have to honour the masking and routing rules for SError, otherwise no OS can
> run safely with this firmware.
>
>
>> I remember that you ever suggested firmware should reboot if the mask status
>> is set(SPSR), right?
>
> Yes, this is my suggestion of what to do if you can't deliver an SError: store
> the RAS error in the BERT and 'reboot'.
>
>
>> I CC "liu jun" <liujun88@hisilicon.com> who is our UEFI firmware Architect,
>> if you have firmware requirements, you can raise again.
>
> (UEFI? I didn't think there was any of that at EL3, but I'm not familiar with
> all the 'PI' bits).
>
> The requirement is your emulated-SError from EL3 looks exactly like a
> physical-SError as if EL3 wasn't implemented.
> Your CPU has to handle cases where it can't deliver an SError, your emulation
> has to do the same.
>
> This is not something any OS can work around.
>
>
>>> Answers to these let us determine whether a bug is in the firmware or the
>>> kernel. If firmware is expecting the OS to do something special, I'd like to know
>>> about it from the beginning!
>>
>> I know your meaning, thanks for raising it again.
>
>
> Happy new year,
>
> James
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 09/10] drivers/hwmon: Add PECI hwmon client drivers
From: Guenter Roeck @ 2018-04-12  3:40 UTC (permalink / raw)
  To: Jae Hyun Yoo
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
	Haiyue Wang, James Feist, Jason M Biils, Jean Delvare,
	Joel Stanley, Julia Cartwright, Miguel Ojeda, Milton Miller II,
	Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
	Vernon Mauery, linux-kernel, linux-doc, devicetree, linux-hwmon,
	linux-arm-kernel, openbmc
In-Reply-To: <7b378954-1ce8-a2d2-5d09-8b6b36c048be@linux.intel.com>

On 04/11/2018 07:51 PM, Jae Hyun Yoo wrote:
> On 4/11/2018 5:34 PM, Guenter Roeck wrote:
>> On 04/11/2018 02:59 PM, Jae Hyun Yoo wrote:
>>> Hi Guenter,
>>>
>>> Thanks a lot for sharing your time. Please see my inline answers.
>>>
>>> On 4/10/2018 3:28 PM, Guenter Roeck wrote:
>>>> On Tue, Apr 10, 2018 at 11:32:11AM -0700, Jae Hyun Yoo wrote:
>>>>> This commit adds PECI cputemp and dimmtemp hwmon drivers.
>>>>>
>>>>> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
>>>>> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
>>>>> Reviewed-by: James Feist <james.feist@linux.intel.com>
>>>>> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
>>>>> Cc: Alan Cox <alan@linux.intel.com>
>>>>> Cc: Andrew Jeffery <andrew@aj.id.au>
>>>>> Cc: Andrew Lunn <andrew@lunn.ch>
>>>>> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
>>>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>>>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>>>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>>>>> Cc: Greg KH <gregkh@linuxfoundation.org>
>>>>> Cc: Guenter Roeck <linux@roeck-us.net>
>>>>> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
>>>>> Cc: Jean Delvare <jdelvare@suse.com>
>>>>> Cc: Joel Stanley <joel@jms.id.au>
>>>>> Cc: Julia Cartwright <juliac@eso.teric.us>
>>>>> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
>>>>> Cc: Milton Miller II <miltonm@us.ibm.com>
>>>>> Cc: Pavel Machek <pavel@ucw.cz>
>>>>> Cc: Randy Dunlap <rdunlap@infradead.org>
>>>>> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
>>>>> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
>>>>> ---
>>>>>   drivers/hwmon/Kconfig         |  28 ++
>>>>>   drivers/hwmon/Makefile        |   2 +
>>>>>   drivers/hwmon/peci-cputemp.c  | 783 ++++++++++++++++++++++++++++++++++++++++++
>>>>>   drivers/hwmon/peci-dimmtemp.c | 432 +++++++++++++++++++++++
>>>>>   4 files changed, 1245 insertions(+)
>>>>>   create mode 100644 drivers/hwmon/peci-cputemp.c
>>>>>   create mode 100644 drivers/hwmon/peci-dimmtemp.c
>>>>>
>>>>> diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
>>>>> index f249a4428458..c52f610f81d0 100644
>>>>> --- a/drivers/hwmon/Kconfig
>>>>> +++ b/drivers/hwmon/Kconfig
>>>>> @@ -1259,6 +1259,34 @@ config SENSORS_NCT7904
>>>>>         This driver can also be built as a module.  If so, the module
>>>>>         will be called nct7904.
>>>>> +config SENSORS_PECI_CPUTEMP
>>>>> +    tristate "PECI CPU temperature monitoring support"
>>>>> +    depends on OF
>>>>> +    depends on PECI
>>>>> +    help
>>>>> +      If you say yes here you get support for the generic Intel PECI
>>>>> +      cputemp driver which provides Digital Thermal Sensor (DTS) thermal
>>>>> +      readings of the CPU package and CPU cores that are accessible using
>>>>> +      the PECI Client Command Suite via the processor PECI client.
>>>>> +      Check Documentation/hwmon/peci-cputemp for details.
>>>>> +
>>>>> +      This driver can also be built as a module.  If so, the module
>>>>> +      will be called peci-cputemp.
>>>>> +
>>>>> +config SENSORS_PECI_DIMMTEMP
>>>>> +    tristate "PECI DIMM temperature monitoring support"
>>>>> +    depends on OF
>>>>> +    depends on PECI
>>>>> +    help
>>>>> +      If you say yes here you get support for the generic Intel PECI hwmon
>>>>> +      driver which provides Digital Thermal Sensor (DTS) thermal readings of
>>>>> +      DIMM components that are accessible using the PECI Client Command
>>>>> +      Suite via the processor PECI client.
>>>>> +      Check Documentation/hwmon/peci-dimmtemp for details.
>>>>> +
>>>>> +      This driver can also be built as a module.  If so, the module
>>>>> +      will be called peci-dimmtemp.
>>>>> +
>>>>>   config SENSORS_NSA320
>>>>>       tristate "ZyXEL NSA320 and compatible fan speed and temperature sensors"
>>>>>       depends on GPIOLIB && OF
>>>>> diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
>>>>> index e7d52a36e6c4..48d9598fcd3a 100644
>>>>> --- a/drivers/hwmon/Makefile
>>>>> +++ b/drivers/hwmon/Makefile
>>>>> @@ -136,6 +136,8 @@ obj-$(CONFIG_SENSORS_NCT7802)    += nct7802.o
>>>>>   obj-$(CONFIG_SENSORS_NCT7904)    += nct7904.o
>>>>>   obj-$(CONFIG_SENSORS_NSA320)    += nsa320-hwmon.o
>>>>>   obj-$(CONFIG_SENSORS_NTC_THERMISTOR)    += ntc_thermistor.o
>>>>> +obj-$(CONFIG_SENSORS_PECI_CPUTEMP)    += peci-cputemp.o
>>>>> +obj-$(CONFIG_SENSORS_PECI_DIMMTEMP)    += peci-dimmtemp.o
>>>>>   obj-$(CONFIG_SENSORS_PC87360)    += pc87360.o
>>>>>   obj-$(CONFIG_SENSORS_PC87427)    += pc87427.o
>>>>>   obj-$(CONFIG_SENSORS_PCF8591)    += pcf8591.o
>>>>> diff --git a/drivers/hwmon/peci-cputemp.c b/drivers/hwmon/peci-cputemp.c
>>>>> new file mode 100644
>>>>> index 000000000000..f0bc92687512
>>>>> --- /dev/null
>>>>> +++ b/drivers/hwmon/peci-cputemp.c
>>>>> @@ -0,0 +1,783 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +// Copyright (c) 2018 Intel Corporation
>>>>> +
>>>>> +#include <linux/delay.h>
>>>>> +#include <linux/hwmon.h>
>>>>> +#include <linux/hwmon-sysfs.h>
>>>>
>>>> Is this include needed ?
>>>>
>>>
>>> No it isn't. Will drop the line.
>>>
>>>>> +#include <linux/jiffies.h>
>>>>> +#include <linux/module.h>
>>>>> +#include <linux/of_device.h>
>>>>> +#include <linux/peci.h>
>>>>> +
>>>>> +#define TEMP_TYPE_PECI        6  /* Sensor type 6: Intel PECI */
>>>>> +
>>>>> +#define CORE_MAX_ON_HSX       18 /* Max number of cores on Haswell */
>>>>> +#define CORE_MAX_ON_BDX       24 /* Max number of cores on Broadwell */
>>>>> +#define CORE_MAX_ON_SKX       28 /* Max number of cores on Skylake */
>>>>> +
>>>>> +#define DEFAULT_CHANNEL_NUMS  5
>>>>> +#define CORETEMP_CHANNEL_NUMS CORE_MAX_ON_SKX
>>>>> +#define CPUTEMP_CHANNEL_NUMS  (DEFAULT_CHANNEL_NUMS + CORETEMP_CHANNEL_NUMS)
>>>>> +
>>>>> +#define CLIENT_CPU_ID_MASK    0xf0ff0  /* Mask for Family / Model info */
>>>>> +
>>>>> +#define UPDATE_INTERVAL_MIN   HZ
>>>>> +
>>>>> +enum cpu_gens {
>>>>> +    CPU_GEN_HSX, /* Haswell Xeon */
>>>>> +    CPU_GEN_BRX, /* Broadwell Xeon */
>>>>> +    CPU_GEN_SKX, /* Skylake Xeon */
>>>>> +    CPU_GEN_MAX
>>>>> +};
>>>>> +
>>>>> +struct cpu_gen_info {
>>>>> +    u32 type;
>>>>> +    u32 cpu_id;
>>>>> +    u32 core_max;
>>>>> +};
>>>>> +
>>>>> +struct temp_data {
>>>>> +    bool valid;
>>>>> +    s32  value;
>>>>> +    unsigned long last_updated;
>>>>> +};
>>>>> +
>>>>> +struct temp_group {
>>>>> +    struct temp_data die;
>>>>> +    struct temp_data dts_margin;
>>>>> +    struct temp_data tcontrol;
>>>>> +    struct temp_data tthrottle;
>>>>> +    struct temp_data tjmax;
>>>>> +    struct temp_data core[CORETEMP_CHANNEL_NUMS];
>>>>> +};
>>>>> +
>>>>> +struct peci_cputemp {
>>>>> +    struct peci_client *client;
>>>>> +    struct device *dev;
>>>>> +    char name[PECI_NAME_SIZE];
>>>>> +    struct temp_group temp;
>>>>> +    u8 addr;
>>>>> +    uint cpu_no;
>>>>> +    const struct cpu_gen_info *gen_info;
>>>>> +    u32 core_mask;
>>>>> +    u32 temp_config[CPUTEMP_CHANNEL_NUMS + 1];
>>>>> +    uint config_idx;
>>>>> +    struct hwmon_channel_info temp_info;
>>>>> +    const struct hwmon_channel_info *info[2];
>>>>> +    struct hwmon_chip_info chip;
>>>>> +};
>>>>> +
>>>>> +enum cputemp_channels {
>>>>> +    channel_die,
>>>>> +    channel_dts_mrgn,
>>>>> +    channel_tcontrol,
>>>>> +    channel_tthrottle,
>>>>> +    channel_tjmax,
>>>>> +    channel_core,
>>>>> +};
>>>>> +
>>>>> +static const struct cpu_gen_info cpu_gen_info_table[] = {
>>>>> +    { .type = CPU_GEN_HSX,
>>>>> +      .cpu_id = 0x306f0, /* Family code: 6, Model number: 63 (0x3f) */
>>>>> +      .core_max = CORE_MAX_ON_HSX },
>>>>> +    { .type = CPU_GEN_BRX,
>>>>> +      .cpu_id = 0x406f0, /* Family code: 6, Model number: 79 (0x4f) */
>>>>> +      .core_max = CORE_MAX_ON_BDX },
>>>>> +    { .type = CPU_GEN_SKX,
>>>>> +      .cpu_id = 0x50650, /* Family code: 6, Model number: 85 (0x55) */
>>>>> +      .core_max = CORE_MAX_ON_SKX },
>>>>> +};
>>>>> +
>>>>> +static const u32 config_table[DEFAULT_CHANNEL_NUMS + 1] = {
>>>>> +    /* Die temperature */
>>>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT |
>>>>> +    HWMON_T_CRIT_HYST,
>>>>> +
>>>>> +    /* DTS margin temperature */
>>>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MIN | HWMON_T_LCRIT,
>>>>> +
>>>>> +    /* Tcontrol temperature */
>>>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_CRIT,
>>>>> +
>>>>> +    /* Tthrottle temperature */
>>>>> +    HWMON_T_LABEL | HWMON_T_INPUT,
>>>>> +
>>>>> +    /* Tjmax temperature */
>>>>> +    HWMON_T_LABEL | HWMON_T_INPUT,
>>>>> +
>>>>> +    /* Core temperature - for all core channels */
>>>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT |
>>>>> +    HWMON_T_CRIT_HYST,
>>>>> +};
>>>>> +
>>>>> +static const char *cputemp_label[CPUTEMP_CHANNEL_NUMS] = {
>>>>> +    "Die",
>>>>> +    "DTS margin",
>>>>> +    "Tcontrol",
>>>>> +    "Tthrottle",
>>>>> +    "Tjmax",
>>>>> +    "Core 0", "Core 1", "Core 2", "Core 3",
>>>>> +    "Core 4", "Core 5", "Core 6", "Core 7",
>>>>> +    "Core 8", "Core 9", "Core 10", "Core 11",
>>>>> +    "Core 12", "Core 13", "Core 14", "Core 15",
>>>>> +    "Core 16", "Core 17", "Core 18", "Core 19",
>>>>> +    "Core 20", "Core 21", "Core 22", "Core 23",
>>>>> +};
>>>>> +
>>>>> +static int send_peci_cmd(struct peci_cputemp *priv,
>>>>> +             enum peci_cmd cmd,
>>>>> +             void *msg)
>>>>> +{
>>>>> +    return peci_command(priv->client->adapter, cmd, msg);
>>>>> +}
>>>>> +
>>>>> +static int need_update(struct temp_data *temp)
>>>>
>>>> Please use bool.
>>>>
>>>
>>> Okay. I'll use bool instead of int.
>>>
>>>>> +{
>>>>> +    if (temp->valid &&
>>>>> +        time_before(jiffies, temp->last_updated + UPDATE_INTERVAL_MIN))
>>>>> +        return 0;
>>>>> +
>>>>> +    return 1;
>>>>> +}
>>>>> +
>>>>> +static void mark_updated(struct temp_data *temp)
>>>>> +{
>>>>> +    temp->valid = true;
>>>>> +    temp->last_updated = jiffies;
>>>>> +}
>>>>> +
>>>>> +static s32 ten_dot_six_to_millidegree(s32 val)
>>>>> +{
>>>>> +    return ((val ^ 0x8000) - 0x8000) * 1000 / 64;
>>>>> +}
>>>>> +
>>>>> +static int get_tjmax(struct peci_cputemp *priv)
>>>>> +{
>>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!priv->temp.tjmax.valid) {
>>>>> +        msg.addr = priv->addr;
>>>>> +        msg.index = MBX_INDEX_TEMP_TARGET;
>>>>> +        msg.param = 0;
>>>>> +        msg.rx_len = 4;
>>>>> +
>>>>> +        rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        priv->temp.tjmax.value = (s32)msg.pkg_config[2] * 1000;
>>>>> +        priv->temp.tjmax.valid = true;
>>>>> +    }
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int get_tcontrol(struct peci_cputemp *priv)
>>>>> +{
>>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>>> +    s32 tcontrol_margin;
>>>>> +    s32 tthrottle_offset;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!need_update(&priv->temp.tcontrol))
>>>>> +        return 0;
>>>>> +
>>>>> +    rc = get_tjmax(priv);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    msg.addr = priv->addr;
>>>>> +    msg.index = MBX_INDEX_TEMP_TARGET;
>>>>> +    msg.param = 0;
>>>>> +    msg.rx_len = 4;
>>>>> +
>>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    tcontrol_margin = msg.pkg_config[1];
>>>>> +    tcontrol_margin = ((tcontrol_margin ^ 0x80) - 0x80) * 1000;
>>>>> +    priv->temp.tcontrol.value = priv->temp.tjmax.value - tcontrol_margin;
>>>>> +
>>>>> +    tthrottle_offset = (msg.pkg_config[3] & 0x2f) * 1000;
>>>>> +    priv->temp.tthrottle.value = priv->temp.tjmax.value - tthrottle_offset;
>>>>> +
>>>>> +    mark_updated(&priv->temp.tcontrol);
>>>>> +    mark_updated(&priv->temp.tthrottle);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int get_tthrottle(struct peci_cputemp *priv)
>>>>> +{
>>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>>> +    s32 tcontrol_margin;
>>>>> +    s32 tthrottle_offset;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!need_update(&priv->temp.tthrottle))
>>>>> +        return 0;
>>>>> +
>>>>> +    rc = get_tjmax(priv);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    msg.addr = priv->addr;
>>>>> +    msg.index = MBX_INDEX_TEMP_TARGET;
>>>>> +    msg.param = 0;
>>>>> +    msg.rx_len = 4;
>>>>> +
>>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    tthrottle_offset = (msg.pkg_config[3] & 0x2f) * 1000;
>>>>> +    priv->temp.tthrottle.value = priv->temp.tjmax.value - tthrottle_offset;
>>>>> +
>>>>> +    tcontrol_margin = msg.pkg_config[1];
>>>>> +    tcontrol_margin = ((tcontrol_margin ^ 0x80) - 0x80) * 1000;
>>>>> +    priv->temp.tcontrol.value = priv->temp.tjmax.value - tcontrol_margin;
>>>>> +
>>>>> +    mark_updated(&priv->temp.tthrottle);
>>>>> +    mark_updated(&priv->temp.tcontrol);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>
>>>> I am quite completely missing how the two functions above are different.
>>>>
>>>
>>> The two above functions are slightly different but uses the same PECI command which provides both Tthrottle and Tcontrol values in pkg_config array so it updates the values to reduce duplicate PECI transactions. Probably, combining these two functions into get_ttrottle_and_tcontrol() would look better. I'll rewrite it.
>>>
>>>>> +
>>>>> +static int get_die_temp(struct peci_cputemp *priv)
>>>>> +{
>>>>> +    struct peci_get_temp_msg msg;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!need_update(&priv->temp.die))
>>>>> +        return 0;
>>>>> +
>>>>> +    rc = get_tjmax(priv);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    msg.addr = priv->addr;
>>>>> +
>>>>> +    rc = send_peci_cmd(priv, PECI_CMD_GET_TEMP, &msg);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    priv->temp.die.value = priv->temp.tjmax.value +
>>>>> +                   ((s32)msg.temp_raw * 1000 / 64);
>>>>> +
>>>>> +    mark_updated(&priv->temp.die);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int get_dts_margin(struct peci_cputemp *priv)
>>>>> +{
>>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>>> +    s32 dts_margin;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!need_update(&priv->temp.dts_margin))
>>>>> +        return 0;
>>>>> +
>>>>> +    msg.addr = priv->addr;
>>>>> +    msg.index = MBX_INDEX_DTS_MARGIN;
>>>>> +    msg.param = 0;
>>>>> +    msg.rx_len = 4;
>>>>> +
>>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    dts_margin = (msg.pkg_config[1] << 8) | msg.pkg_config[0];
>>>>> +
>>>>> +    /**
>>>>> +     * Processors return a value of DTS reading in 10.6 format
>>>>> +     * (10 bits signed decimal, 6 bits fractional).
>>>>> +     * Error codes:
>>>>> +     *   0x8000: General sensor error
>>>>> +     *   0x8001: Reserved
>>>>> +     *   0x8002: Underflow on reading value
>>>>> +     *   0x8003-0x81ff: Reserved
>>>>> +     */
>>>>> +    if (dts_margin >= 0x8000 && dts_margin <= 0x81ff)
>>>>> +        return -EIO;
>>>>> +
>>>>> +    dts_margin = ten_dot_six_to_millidegree(dts_margin);
>>>>> +
>>>>> +    priv->temp.dts_margin.value = dts_margin;
>>>>> +
>>>>> +    mark_updated(&priv->temp.dts_margin);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int get_core_temp(struct peci_cputemp *priv, int core_index)
>>>>> +{
>>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>>> +    s32 core_dts_margin;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!need_update(&priv->temp.core[core_index]))
>>>>> +        return 0;
>>>>> +
>>>>> +    rc = get_tjmax(priv);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    msg.addr = priv->addr;
>>>>> +    msg.index = MBX_INDEX_PER_CORE_DTS_TEMP;
>>>>> +    msg.param = core_index;
>>>>> +    msg.rx_len = 4;
>>>>> +
>>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    core_dts_margin = (msg.pkg_config[1] << 8) | msg.pkg_config[0];
>>>>> +
>>>>> +    /**
>>>>> +     * Processors return a value of the core DTS reading in 10.6 format
>>>>> +     * (10 bits signed decimal, 6 bits fractional).
>>>>> +     * Error codes:
>>>>> +     *   0x8000: General sensor error
>>>>> +     *   0x8001: Reserved
>>>>> +     *   0x8002: Underflow on reading value
>>>>> +     *   0x8003-0x81ff: Reserved
>>>>> +     */
>>>>> +    if (core_dts_margin >= 0x8000 && core_dts_margin <= 0x81ff)
>>>>> +        return -EIO;
>>>>> +
>>>>> +    core_dts_margin = ten_dot_six_to_millidegree(core_dts_margin);
>>>>> +
>>>>> +    priv->temp.core[core_index].value = priv->temp.tjmax.value +
>>>>> +                        core_dts_margin;
>>>>> +
>>>>> +    mark_updated(&priv->temp.core[core_index]);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>
>>>> There is a lot of duplication in those functions. Would it be possible
>>>> to find common code and use functions for it instead of duplicating
>>>> everything several times ?
>>>>
>>>
>>> Are you pointing out this code?
>>> /**
>>>   * Processors return a value of the core DTS reading in 10.6 format
>>>   * (10 bits signed decimal, 6 bits fractional).
>>>   * Error codes:
>>>   *   0x8000: General sensor error
>>>   *   0x8001: Reserved
>>>   *   0x8002: Underflow on reading value
>>>   *   0x8003-0x81ff: Reserved
>>>   */
>>> if (core_dts_margin >= 0x8000 && core_dts_margin <= 0x81ff)
>>>      return -EIO;
>>>
>>> Then I'll rewrite it as a function. If not, please point out the duplication.
>>>
>>
>> There is lots of other duplication.
>>
> 
> Sorry but can you point out the duplication?
> 
write a python script to do a semantic comparison.

>>>>> +static int find_core_index(struct peci_cputemp *priv, int channel)
>>>>> +{
>>>>> +    int core_channel = channel - DEFAULT_CHANNEL_NUMS;
>>>>> +    int idx, found = 0;
>>>>> +
>>>>> +    for (idx = 0; idx < priv->gen_info->core_max; idx++) {
>>>>> +        if (priv->core_mask & BIT(idx)) {
>>>>> +            if (core_channel == found)
>>>>> +                break;
>>>>> +
>>>>> +            found++;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    return idx;
>>>>
>>>> What if nothing is found ?
>>>>
>>>
>>> Core temperature group will be registered only when it detects at least one core checked by check_resolved_cores(), so find_core_index() can be called only when priv->core_mask has a non-zero value. The 'nothing is found' case will not happen.
>>>
>> That doesn't guarantee a match. If what you are saying is correct there should always be
>> a well defined match of channel -> idx, and the search should be unnecessary.
>>
> 
> There could be some disabled cores in the resolved core mask bit sequence also it should remove indexing gap in channel numbering so it is the reason why this search function is needed. Well defined match of channel -> idx would not be always satisfied.
> 
Are you saying that each call to the function, with the same parameters,
can return a different result ?

>>>>> +}
>>>>> +
>>>>> +static int cputemp_read_string(struct device *dev,
>>>>> +                   enum hwmon_sensor_types type,
>>>>> +                   u32 attr, int channel, const char **str)
>>>>> +{
>>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>>> +    int core_index;
>>>>> +
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_label:
>>>>> +        if (channel < DEFAULT_CHANNEL_NUMS) {
>>>>> +            *str = cputemp_label[channel];
>>>>> +        } else {
>>>>> +            core_index = find_core_index(priv, channel);
>>>>
>>>> FWIW, it might be better to pass channel - DEFAULT_CHANNEL_NUMS
>>>> as parameter.
>>>>
>>>
>>> cputemp_read_string() is mapped to read_string member of hwmon_ops struct, so hwmon susbsystem passes the channel parameter based on the registered channel order. Should I modify hwmon subsystem code?
>>>
>>
>> Huh ? Changing
>>      f(x) { y = x - const; }
>> ...
>>      f(x);
>>
>> to
>>      f(y) { }
>> ...
>>      f(x - const);
>>
>> requires a hwmon core change ? Really ?
>>
> 
> Sorry for my misunderstanding. You are right. I'll change the parameter passing of find_core_index() from 'channel' to 'channel - DEFAULT_CHANNEL_NUMS'.
> 
>>>> What if find_core_index() returns priv->gen_info->core_max, ie
>>>> if it didn't find a core ?
>>>>
>>>
>>> As explained above, find_core index() returns a correct index always.
>>>
>>>>> +            *str = cputemp_label[DEFAULT_CHANNEL_NUMS + core_index];
>>>>> +        }
>>>>> +        return 0;
>>>>> +    default:
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static int cputemp_read_die(struct device *dev,
>>>>> +                enum hwmon_sensor_types type,
>>>>> +                u32 attr, int channel, long *val)
>>>>> +{
>>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>>> +    int rc;
>>>>> +
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_input:
>>>>> +        rc = get_die_temp(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.die.value;
>>>>> +        return 0;
>>>>> +    case hwmon_temp_max:
>>>>> +        rc = get_tcontrol(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tcontrol.value;
>>>>> +        return 0;
>>>>> +    case hwmon_temp_crit:
>>>>> +        rc = get_tjmax(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tjmax.value;
>>>>> +        return 0;
>>>>> +    case hwmon_temp_crit_hyst:
>>>>> +        rc = get_tcontrol(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tjmax.value - priv->temp.tcontrol.value;
>>>>> +        return 0;
>>>>> +    default:
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static int cputemp_read_dts_margin(struct device *dev,
>>>>> +                   enum hwmon_sensor_types type,
>>>>> +                   u32 attr, int channel, long *val)
>>>>> +{
>>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>>> +    int rc;
>>>>> +
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_input:
>>>>> +        rc = get_dts_margin(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.dts_margin.value;
>>>>> +        return 0;
>>>>> +    case hwmon_temp_min:
>>>>> +        *val = 0;
>>>>> +        return 0;
>>>>
>>>> This attribute should not exist.
>>>>
>>>
>>> This is an attribute of DTS margin temperature which reflects thermal margin to Tcontrol of the CPU package. If it shows '0' means it reached to Tcontrol, the first level of thermal warning. If the CPU keeps getting hot then this DTS margin shows a negative value until it reaches to Tjmax. When the temperature reaches to Tjmax at last then it shows the lower critcal value which lcrit indicates as the second level of thermal warning.
>>>
>>
>> The hwmon ABI reports chip values, not constants. Even though some drivers do
>> it, reporting a constant is always wrong.
>>
>>>>> +    case hwmon_temp_lcrit:
>>>>> +        rc = get_tcontrol(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tcontrol.value - priv->temp.tjmax.value;
>>>>
>>>> lcrit is tcontrol - tjmax, and crit_hyst above is
>>>> tjmax - tcontrol ? How does this make sense ?
>>>>
>>>
>>> Both Tjmax and Tcontrol have positive values and Tjmax is greater than Tcontrol always. As explained above, lcrit of DTS margin should show a negative value means the margin goes down across '0'. On the other hand, crit_hyst of Die temperature should show absolute hyterisis value between Tcontrol and Tjmax.
>>>
>> The hwmon ABI requires reporting of absolute temperatures in milli-degrees C.
>> Your statements make it very clear that this driver does not report
>> absolute temperatures. This is not acceptable.
>>
> 
> Okay. I'll remove the 'DTS margin' temperature. All others are reporting absolute temperatures.
> 
>>>>> +        return 0;
>>>>> +    default:
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static int cputemp_read_tcontrol(struct device *dev,
>>>>> +                 enum hwmon_sensor_types type,
>>>>> +                 u32 attr, int channel, long *val)
>>>>> +{
>>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>>> +    int rc;
>>>>> +
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_input:
>>>>> +        rc = get_tcontrol(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tcontrol.value;
>>>>> +        return 0;
>>>>> +    case hwmon_temp_crit:
>>>>> +        rc = get_tjmax(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tjmax.value;
>>>>> +        return 0;
>>>>
>>>> Am I missing something, or is the same temperature reported several times ?
>>>> tjmax is also reported as temp_crit cputemp_read_die(), for example.
>>>>
>>>
>>> This driver provides multiple channels and each channel has its own supplement attributes. As you mentioned, Die temperature channel and Core temperature channel have their individual crit attributes and they reflect the same value, Tjmax. It is not reporting several times but reporting the same value.
>>>
>> Then maybe fold the functions accordingly ?
>>
> 
> I'll use a single function for 'Die temperature' and 'Core temperature' that have the same attributes set. It would simplify this code a bit.
> 
>>>>> +    default:
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static int cputemp_read_tthrottle(struct device *dev,
>>>>> +                  enum hwmon_sensor_types type,
>>>>> +                  u32 attr, int channel, long *val)
>>>>> +{
>>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>>> +    int rc;
>>>>> +
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_input:
>>>>> +        rc = get_tthrottle(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tthrottle.value;
>>>>> +        return 0;
>>>>> +    default:
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static int cputemp_read_tjmax(struct device *dev,
>>>>> +                  enum hwmon_sensor_types type,
>>>>> +                  u32 attr, int channel, long *val)
>>>>> +{
>>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>>> +    int rc;
>>>>> +
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_input:
>>>>> +        rc = get_tjmax(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tjmax.value;
>>>>> +        return 0;
>>>>> +    default:
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static int cputemp_read_core(struct device *dev,
>>>>> +                 enum hwmon_sensor_types type,
>>>>> +                 u32 attr, int channel, long *val)
>>>>> +{
>>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>>> +    int core_index = find_core_index(priv, channel);
>>>>> +    int rc;
>>>>> +
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_input:
>>>>> +        rc = get_core_temp(priv, core_index);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.core[core_index].value;
>>>>> +        return 0;
>>>>> +    case hwmon_temp_max:
>>>>> +        rc = get_tcontrol(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tcontrol.value;
>>>>> +        return 0;
>>>>> +    case hwmon_temp_crit:
>>>>> +        rc = get_tjmax(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tjmax.value;
>>>>> +        return 0;
>>>>> +    case hwmon_temp_crit_hyst:
>>>>> +        rc = get_tcontrol(priv);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp.tjmax.value - priv->temp.tcontrol.value;
>>>>> +        return 0;
>>>>> +    default:
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>
>>>> There is again a lot of duplication in those functions.
>>>>
>>>
>>> Each function is called from cputemp_read() which is mapped to read function pointer of hwmon_ops struct. Since each channel has different set of attributes so the cputemp_read() calls an individual channel handler after checking the channel type. Of course, we can handle all attributes of all channels in a single function but the way also needs channel type checking code on each attribute.
>>>
>>>>> +
>>>>> +static int cputemp_read(struct device *dev,
>>>>> +            enum hwmon_sensor_types type,
>>>>> +            u32 attr, int channel, long *val)
>>>>> +{
>>>>> +    switch (channel) {
>>>>> +    case channel_die:
>>>>> +        return cputemp_read_die(dev, type, attr, channel, val);
>>>>> +    case channel_dts_mrgn:
>>>>> +        return cputemp_read_dts_margin(dev, type, attr, channel, val);
>>>>> +    case channel_tcontrol:
>>>>> +        return cputemp_read_tcontrol(dev, type, attr, channel, val);
>>>>> +    case channel_tthrottle:
>>>>> +        return cputemp_read_tthrottle(dev, type, attr, channel, val);
>>>>> +    case channel_tjmax:
>>>>> +        return cputemp_read_tjmax(dev, type, attr, channel, val);
>>>>> +    default:
>>>>> +        if (channel < CPUTEMP_CHANNEL_NUMS)
>>>>> +            return cputemp_read_core(dev, type, attr, channel, val);
>>>>> +
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static umode_t cputemp_is_visible(const void *data,
>>>>> +                  enum hwmon_sensor_types type,
>>>>> +                  u32 attr, int channel)
>>>>> +{
>>>>> +    const struct peci_cputemp *priv = data;
>>>>> +
>>>>> +    if (priv->temp_config[channel] & BIT(attr))
>>>>> +        return 0444;
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static const struct hwmon_ops cputemp_ops = {
>>>>> +    .is_visible = cputemp_is_visible,
>>>>> +    .read_string = cputemp_read_string,
>>>>> +    .read = cputemp_read,
>>>>> +};
>>>>> +
>>>>> +static int check_resolved_cores(struct peci_cputemp *priv)
>>>>> +{
>>>>> +    struct peci_rd_pci_cfg_local_msg msg;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!(priv->client->adapter->cmd_mask & BIT(PECI_CMD_RD_PCI_CFG_LOCAL)))
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    /* Get the RESOLVED_CORES register value */
>>>>> +    msg.addr = priv->addr;
>>>>> +    msg.bus = 1;
>>>>> +    msg.device = 30;
>>>>> +    msg.function = 3;
>>>>> +    msg.reg = 0xB4;
>>>>
>>>> Can this be made less magic with some defines ?
>>>>
>>>
>>> Sure, will use defines instead.
>>>
>>>>> +    msg.rx_len = 4;
>>>>> +
>>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PCI_CFG_LOCAL, &msg);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    priv->core_mask = msg.pci_config[3] << 24 |
>>>>> +              msg.pci_config[2] << 16 |
>>>>> +              msg.pci_config[1] << 8 |
>>>>> +              msg.pci_config[0];
>>>>> +
>>>>> +    if (!priv->core_mask)
>>>>> +        return -EAGAIN;
>>>>> +
>>>>> +    dev_dbg(priv->dev, "Scanned resolved cores: 0x%x\n", priv->core_mask);
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int create_core_temp_info(struct peci_cputemp *priv)
>>>>> +{
>>>>> +    int rc, i;
>>>>> +
>>>>> +    rc = check_resolved_cores(priv);
>>>>> +    if (!rc) {
>>>>> +        for (i = 0; i < priv->gen_info->core_max; i++) {
>>>>> +            if (priv->core_mask & BIT(i)) {
>>>>> +                priv->temp_config[priv->config_idx++] =
>>>>> +                             config_table[channel_core];
>>>>> +            }
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    return rc;
>>>>> +}
>>>>> +
>>>>> +static int check_cpu_id(struct peci_cputemp *priv)
>>>>> +{
>>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>>> +    u32 cpu_id;
>>>>> +    int i, rc;
>>>>> +
>>>>> +    msg.addr = priv->addr;
>>>>> +    msg.index = MBX_INDEX_CPU_ID;
>>>>> +    msg.param = PKG_ID_CPU_ID;
>>>>> +    msg.rx_len = 4;
>>>>> +
>>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    cpu_id = ((msg.pkg_config[2] << 16) | (msg.pkg_config[1] << 8) |
>>>>> +          msg.pkg_config[0]) & CLIENT_CPU_ID_MASK;
>>>>> +
>>>>> +    for (i = 0; i < CPU_GEN_MAX; i++) {
>>>>> +        if (cpu_id == cpu_gen_info_table[i].cpu_id) {
>>>>> +            priv->gen_info = &cpu_gen_info_table[i];
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    if (!priv->gen_info)
>>>>> +        return -ENODEV;
>>>>> +
>>>>> +    dev_dbg(priv->dev, "CPU_ID: 0x%x\n", cpu_id);
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int peci_cputemp_probe(struct peci_client *client)
>>>>> +{
>>>>> +    struct device *dev = &client->dev;
>>>>> +    struct peci_cputemp *priv;
>>>>> +    struct device *hwmon_dev;
>>>>> +    int rc;
>>>>> +
>>>>> +    if ((client->adapter->cmd_mask &
>>>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) !=
>>>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) {
>>>>> +        dev_err(dev, "Client doesn't support temperature monitoring\n");
>>>>> +        return -EINVAL;
>>>>
>>>> Does this mean there will be an error message for each non-supported CPU ?
>>>> Why ?
>>>>
>>>
>>> For proper operation of this driver, PECI_CMD_GET_TEMP and PECI_CMD_RD_PKG_CFG have to be supported by a client CPU. PECI_CMD_GET_TEMP is provided as a default command but PECI_CMD_RD_PKG_CFG depends on PECI minor revision of a CPU package so this checking is needed.
>>>
>>
>> I do not question the check. I question the error message and error return value.
>> Why is it an _error_ if the CPU does not support the functionality, and why does
>> it have to be reported in the kernel log ?
>>
> 
> Got it. I'll change that to dev_dbg.
> 
>>>>> +    }
>>>>> +
>>>>> +    priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>>>>> +    if (!priv)
>>>>> +        return -ENOMEM;
>>>>> +
>>>>> +    dev_set_drvdata(dev, priv);
>>>>> +    priv->client = client;
>>>>> +    priv->dev = dev;
>>>>> +    priv->addr = client->addr;
>>>>> +    priv->cpu_no = priv->addr - PECI_BASE_ADDR;
>>>>> +
>>>>> +    snprintf(priv->name, PECI_NAME_SIZE, "peci_cputemp.cpu%d",
>>>>> +         priv->cpu_no);
>>>>> +
>>>>> +    rc = check_cpu_id(priv);
>>>>> +    if (rc) {
>>>>> +        dev_err(dev, "Client CPU is not supported\n");
>>>>
>>>> -ENODEV is not an error, and should not result in an error message.
>>>> Besides, the error can also be propagated from peci core code,
>>>> and may well be something else.
>>>>
>>>
>>> Got it. I'll remove the error message and will add a proper handling code into PECI core.
>>>
>>>>> +        return rc;
>>>>> +    }
>>>>> +
>>>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_die];
>>>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_dts_mrgn];
>>>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_tcontrol];
>>>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_tthrottle];
>>>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_tjmax];
>>>>> +
>>>>> +    rc = create_core_temp_info(priv);
>>>>> +    if (rc)
>>>>> +        dev_dbg(dev, "Failed to create core temp info\n");
>>>>
>>>> Then what ? Shouldn't this result in probe deferral or something more useful
>>>> instead of just being ignored ?
>>>>
>>>
>>> This driver can't support core temperature monitoring if a CPU doesn't support PECI_CMD_RD_PCI_CFG_LOCAL command. In that case, it skips core temperature group creation and supports only basic temperature monitoring of Die, DTS margin and etc. I'll add this description as a comment.
>>>
>>
>> The message says "Failed to ...". It does not say "This CPU does not support ...".
>>
> 
> Got it. Will correct the message.
> 
>>>>> +
>>>>> +    priv->chip.ops = &cputemp_ops;
>>>>> +    priv->chip.info = priv->info;
>>>>> +
>>>>> +    priv->info[0] = &priv->temp_info;
>>>>> +
>>>>> +    priv->temp_info.type = hwmon_temp;
>>>>> +    priv->temp_info.config = priv->temp_config;
>>>>> +
>>>>> +    hwmon_dev = devm_hwmon_device_register_with_info(priv->dev,
>>>>> +                             priv->name,
>>>>> +                             priv,
>>>>> +                             &priv->chip,
>>>>> +                             NULL);
>>>>> +
>>>>> +    if (IS_ERR(hwmon_dev))
>>>>> +        return PTR_ERR(hwmon_dev);
>>>>> +
>>>>> +    dev_dbg(dev, "%s: sensor '%s'\n", dev_name(hwmon_dev), priv->name);
>>>>> +
>>
>> Why does this message display the device name twice ?
>>
> 
> For an example, dev_name(hwmon_dev) shows 'hwmon5' and priv->name shows 'peci-cputemp0'.
> 
And dev_dbg() shows another device name. So you'll have something like

peci-cputemp0: hwmon5: sensor 'peci-cputemp0'

>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static const struct of_device_id peci_cputemp_of_table[] = {
>>>>> +    { .compatible = "intel,peci-cputemp" },
>>>>> +    { }
>>>>> +};
>>>>> +MODULE_DEVICE_TABLE(of, peci_cputemp_of_table);
>>>>> +
>>>>> +static struct peci_driver peci_cputemp_driver = {
>>>>> +    .probe  = peci_cputemp_probe,
>>>>> +    .driver = {
>>>>> +        .name           = "peci-cputemp",
>>>>> +        .of_match_table = of_match_ptr(peci_cputemp_of_table),
>>>>> +    },
>>>>> +};
>>>>> +module_peci_driver(peci_cputemp_driver);
>>>>> +
>>>>> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
>>>>> +MODULE_DESCRIPTION("PECI cputemp driver");
>>>>> +MODULE_LICENSE("GPL v2");
>>>>> diff --git a/drivers/hwmon/peci-dimmtemp.c b/drivers/hwmon/peci-dimmtemp.c
>>>>> new file mode 100644
>>>>> index 000000000000..78bf29cb2c4c
>>>>> --- /dev/null
>>>>> +++ b/drivers/hwmon/peci-dimmtemp.c
>>>>
>>>> FWIW, this should be two separate patches.
>>>>
>>>
>>> Should I split out hwmon documents and dt bindings too?
>>>
>>>>> @@ -0,0 +1,432 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +// Copyright (c) 2018 Intel Corporation
>>>>> +
>>>>> +#include <linux/delay.h>
>>>>> +#include <linux/hwmon.h>
>>>>> +#include <linux/hwmon-sysfs.h>
>>>>
>>>> Needed ?
>>>>
>>>
>>> No. Will drop the line.
>>>
>>>>> +#include <linux/jiffies.h>
>>>>> +#include <linux/module.h>
>>>>> +#include <linux/of_device.h>
>>>>> +#include <linux/peci.h>
>>>>> +#include <linux/workqueue.h>
>>>>> +
>>>>> +#define TEMP_TYPE_PECI       6  /* Sensor type 6: Intel PECI */
>>>>> +
>>>>> +#define CHAN_RANK_MAX_ON_HSX 8  /* Max number of channel ranks on Haswell */
>>>>> +#define DIMM_IDX_MAX_ON_HSX  3  /* Max DIMM index per channel on Haswell */
>>>>> +
>>>>> +#define CHAN_RANK_MAX_ON_BDX 4  /* Max number of channel ranks on Broadwell */
>>>>> +#define DIMM_IDX_MAX_ON_BDX  3  /* Max DIMM index per channel on Broadwell */
>>>>> +
>>>>> +#define CHAN_RANK_MAX_ON_SKX 6  /* Max number of channel ranks on Skylake */
>>>>> +#define DIMM_IDX_MAX_ON_SKX  2  /* Max DIMM index per channel on Skylake */
>>>>> +
>>>>> +#define CHAN_RANK_MAX        CHAN_RANK_MAX_ON_HSX
>>>>> +#define DIMM_IDX_MAX         DIMM_IDX_MAX_ON_HSX
>>>>> +
>>>>> +#define DIMM_NUMS_MAX        (CHAN_RANK_MAX * DIMM_IDX_MAX)
>>>>> +
>>>>> +#define CLIENT_CPU_ID_MASK   0xf0ff0  /* Mask for Family / Model info */
>>>>> +
>>>>> +#define UPDATE_INTERVAL_MIN  HZ
>>>>> +
>>>>> +#define DIMM_MASK_CHECK_DELAY_JIFFIES msecs_to_jiffies(5000)
>>>>> +#define DIMM_MASK_CHECK_RETRY_MAX     60 /* 60 x 5 secs = 5 minutes */
>>>>> +
>>>>> +enum cpu_gens {
>>>>> +    CPU_GEN_HSX, /* Haswell Xeon */
>>>>> +    CPU_GEN_BRX, /* Broadwell Xeon */
>>>>> +    CPU_GEN_SKX, /* Skylake Xeon */
>>>>> +    CPU_GEN_MAX
>>>>> +};
>>>>> +
>>>>> +struct cpu_gen_info {
>>>>> +    u32 type;
>>>>> +    u32 cpu_id;
>>>>> +    u32 chan_rank_max;
>>>>> +    u32 dimm_idx_max;
>>>>> +};
>>>>> +
>>>>> +struct temp_data {
>>>>> +    bool valid;
>>>>> +    s32  value;
>>>>> +    unsigned long last_updated;
>>>>> +};
>>>>> +
>>>>> +struct peci_dimmtemp {
>>>>> +    struct peci_client *client;
>>>>> +    struct device *dev;
>>>>> +    struct workqueue_struct *work_queue;
>>>>> +    struct delayed_work work_handler;
>>>>> +    char name[PECI_NAME_SIZE];
>>>>> +    struct temp_data temp[DIMM_NUMS_MAX];
>>>>> +    u8 addr;
>>>>> +    uint cpu_no;
>>>>> +    const struct cpu_gen_info *gen_info;
>>>>> +    u32 dimm_mask;
>>>>> +    int retry_count;
>>>>> +    int channels;
>>>>> +    u32 temp_config[DIMM_NUMS_MAX + 1];
>>>>> +    struct hwmon_channel_info temp_info;
>>>>> +    const struct hwmon_channel_info *info[2];
>>>>> +    struct hwmon_chip_info chip;
>>>>> +};
>>>>> +
>>>>> +static const struct cpu_gen_info cpu_gen_info_table[] = {
>>>>> +    { .type  = CPU_GEN_HSX,
>>>>> +      .cpu_id = 0x306f0, /* Family code: 6, Model number: 63 (0x3f) */
>>>>> +      .chan_rank_max = CHAN_RANK_MAX_ON_HSX,
>>>>> +      .dimm_idx_max  = DIMM_IDX_MAX_ON_HSX },
>>>>> +    { .type  = CPU_GEN_BRX,
>>>>> +      .cpu_id = 0x406f0, /* Family code: 6, Model number: 79 (0x4f) */
>>>>> +      .chan_rank_max = CHAN_RANK_MAX_ON_BDX,
>>>>> +      .dimm_idx_max  = DIMM_IDX_MAX_ON_BDX },
>>>>> +    { .type  = CPU_GEN_SKX,
>>>>> +      .cpu_id = 0x50650, /* Family code: 6, Model number: 85 (0x55) */
>>>>> +      .chan_rank_max = CHAN_RANK_MAX_ON_SKX,
>>>>> +      .dimm_idx_max  = DIMM_IDX_MAX_ON_SKX },
>>>>> +};
>>>>> +
>>>>> +static const char *dimmtemp_label[CHAN_RANK_MAX][DIMM_IDX_MAX] = {
>>>>> +    { "DIMM A0", "DIMM A1", "DIMM A2" },
>>>>> +    { "DIMM B0", "DIMM B1", "DIMM B2" },
>>>>> +    { "DIMM C0", "DIMM C1", "DIMM C2" },
>>>>> +    { "DIMM D0", "DIMM D1", "DIMM D2" },
>>>>> +    { "DIMM E0", "DIMM E1", "DIMM E2" },
>>>>> +    { "DIMM F0", "DIMM F1", "DIMM F2" },
>>>>> +    { "DIMM G0", "DIMM G1", "DIMM G2" },
>>>>> +    { "DIMM H0", "DIMM H1", "DIMM H2" },
>>>>> +};
>>>>> +
>>>>> +static int send_peci_cmd(struct peci_dimmtemp *priv, enum peci_cmd cmd,
>>>>> +             void *msg)
>>>>> +{
>>>>> +    return peci_command(priv->client->adapter, cmd, msg);
>>>>> +}
>>>>> +
>>>>> +static int need_update(struct temp_data *temp)
>>>>> +{
>>>>> +    if (temp->valid &&
>>>>> +        time_before(jiffies, temp->last_updated + UPDATE_INTERVAL_MIN))
>>>>> +        return 0;
>>>>> +
>>>>> +    return 1;
>>>>> +}
>>>>> +
>>>>> +static void mark_updated(struct temp_data *temp)
>>>>> +{
>>>>> +    temp->valid = true;
>>>>> +    temp->last_updated = jiffies;
>>>>> +}
>>>>
>>>> It might make sense to provide the duplicate functions in a core file.
>>>>
>>>
>>> It is temperature monitoring specific function and it touches module specific variables. Do you really think that this non-generic function should be moved to PECI core?
>>>
>>>>> +
>>>>> +static int get_dimm_temp(struct peci_dimmtemp *priv, int dimm_no)
>>>>> +{
>>>>> +    int dimm_order = dimm_no % priv->gen_info->dimm_idx_max;
>>>>> +    int chan_rank = dimm_no / priv->gen_info->dimm_idx_max;
>>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (!need_update(&priv->temp[dimm_no]))
>>>>> +        return 0;
>>>>> +
>>>>> +    msg.addr = priv->addr;
>>>>> +    msg.index = MBX_INDEX_DDR_DIMM_TEMP;
>>>>> +    msg.param = chan_rank;
>>>>> +    msg.rx_len = 4;
>>>>> +
>>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    priv->temp[dimm_no].value = msg.pkg_config[dimm_order] * 1000;
>>>>> +
>>>>> +    mark_updated(&priv->temp[dimm_no]);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int find_dimm_number(struct peci_dimmtemp *priv, int channel)
>>>>> +{
>>>>> +    int dimm_nums_max = priv->gen_info->chan_rank_max *
>>>>> +                priv->gen_info->dimm_idx_max;
>>>>> +    int idx, found = 0;
>>>>> +
>>>>> +    for (idx = 0; idx < dimm_nums_max; idx++) {
>>>>> +        if (priv->dimm_mask & BIT(idx)) {
>>>>> +            if (channel == found)
>>>>> +                break;
>>>>> +
>>>>> +            found++;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    return idx;
>>>>> +}
>>>>
>>>> This again looks like duplicate code.
>>>>
>>>
>>> find_dimm_number()? I'm sure it isn't.
>>>
>>>>> +
>>>>> +static int dimmtemp_read_string(struct device *dev,
>>>>> +                enum hwmon_sensor_types type,
>>>>> +                u32 attr, int channel, const char **str)
>>>>> +{
>>>>> +    struct peci_dimmtemp *priv = dev_get_drvdata(dev);
>>>>> +    u32 dimm_idx_max = priv->gen_info->dimm_idx_max;
>>>>> +    int dimm_no, chan_rank, dimm_idx;
>>>>> +
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_label:
>>>>> +        dimm_no = find_dimm_number(priv, channel);
>>>>> +        chan_rank = dimm_no / dimm_idx_max;
>>>>> +        dimm_idx = dimm_no % dimm_idx_max;
>>>>> +        *str = dimmtemp_label[chan_rank][dimm_idx];
>>>>> +        return 0;
>>>>> +    default:
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static int dimmtemp_read(struct device *dev, enum hwmon_sensor_types type,
>>>>> +             u32 attr, int channel, long *val)
>>>>> +{
>>>>> +    struct peci_dimmtemp *priv = dev_get_drvdata(dev);
>>>>> +    int dimm_no = find_dimm_number(priv, channel);
>>>>> +    int rc;
>>>>> +
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_input:
>>>>> +        rc = get_dimm_temp(priv, dimm_no);
>>>>> +        if (rc)
>>>>> +            return rc;
>>>>> +
>>>>> +        *val = priv->temp[dimm_no].value;
>>>>> +        return 0;
>>>>> +    default:
>>>>> +        return -EOPNOTSUPP;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static umode_t dimmtemp_is_visible(const void *data,
>>>>> +                   enum hwmon_sensor_types type,
>>>>> +                   u32 attr, int channel)
>>>>> +{
>>>>> +    switch (attr) {
>>>>> +    case hwmon_temp_label:
>>>>> +    case hwmon_temp_input:
>>>>> +        return 0444;
>>>>> +    default:
>>>>> +        return 0;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static const struct hwmon_ops dimmtemp_ops = {
>>>>> +    .is_visible = dimmtemp_is_visible,
>>>>> +    .read_string = dimmtemp_read_string,
>>>>> +    .read = dimmtemp_read,
>>>>> +};
>>>>> +
>>>>> +static int check_populated_dimms(struct peci_dimmtemp *priv)
>>>>> +{
>>>>> +    u32 chan_rank_max = priv->gen_info->chan_rank_max;
>>>>> +    u32 dimm_idx_max = priv->gen_info->dimm_idx_max;
>>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>>> +    int chan_rank, dimm_idx;
>>>>> +    int rc, channels = 0;
>>>>> +
>>>>> +    for (chan_rank = 0; chan_rank < chan_rank_max; chan_rank++) {
>>>>> +        msg.addr = priv->addr;
>>>>> +        msg.index = MBX_INDEX_DDR_DIMM_TEMP;
>>>>> +        msg.param = chan_rank;
>>>>> +        msg.rx_len = 4;
>>>>> +
>>>>> +        rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>>> +        if (rc) {
>>>>> +            priv->dimm_mask = 0;
>>>>> +            return rc;
>>>>> +        }
>>>>> +
>>>>> +        for (dimm_idx = 0; dimm_idx < dimm_idx_max; dimm_idx++) {
>>>>> +            if (msg.pkg_config[dimm_idx]) {
>>>>> +                priv->dimm_mask |= BIT(chan_rank *
>>>>> +                               chan_rank_max +
>>>>> +                               dimm_idx);
>>>>> +                channels++;
>>>>> +            }
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    if (!priv->dimm_mask)
>>>>> +        return -EAGAIN;
>>>>> +
>>>>> +    priv->channels = channels;
>>>>> +
>>>>> +    dev_dbg(priv->dev, "Scanned populated DIMMs: 0x%x\n", priv->dimm_mask);
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int create_dimm_temp_info(struct peci_dimmtemp *priv)
>>>>> +{
>>>>> +    struct device *hwmon_dev;
>>>>> +    int rc, i;
>>>>> +
>>>>> +    rc = check_populated_dimms(priv);
>>>>> +    if (!rc) {
>>>>
>>>> Please handle error cases first.
>>>>
>>>
>>> Sure, I'll rewrite it.
>>>
>>>>> +        for (i = 0; i < priv->channels; i++)
>>>>> +            priv->temp_config[i] = HWMON_T_LABEL | HWMON_T_INPUT;
>>>>> +
>>>>> +        priv->chip.ops = &dimmtemp_ops;
>>>>> +        priv->chip.info = priv->info;
>>>>> +
>>>>> +        priv->info[0] = &priv->temp_info;
>>>>> +
>>>>> +        priv->temp_info.type = hwmon_temp;
>>>>> +        priv->temp_info.config = priv->temp_config;
>>>>> +
>>>>> +        hwmon_dev = devm_hwmon_device_register_with_info(priv->dev,
>>>>> +                                 priv->name,
>>>>> +                                 priv,
>>>>> +                                 &priv->chip,
>>>>> +                                 NULL);
>>>>> +        rc = PTR_ERR_OR_ZERO(hwmon_dev);
>>>>> +        if (!rc)
>>>>> +            dev_dbg(priv->dev, "%s: sensor '%s'\n",
>>>>> +                dev_name(hwmon_dev), priv->name);
>>>>> +    } else if (rc == -EAGAIN) {
>>>>> +        if (priv->retry_count < DIMM_MASK_CHECK_RETRY_MAX) {
>>>>> +            queue_delayed_work(priv->work_queue,
>>>>> +                       &priv->work_handler,
>>>>> +                       DIMM_MASK_CHECK_DELAY_JIFFIES);
>>>>> +            priv->retry_count++;
>>>>> +            dev_dbg(priv->dev,
>>>>> +                "Deferred DIMM temp info creation\n");
>>>>> +        } else {
>>>>> +            rc = -ETIMEDOUT;
>>>>> +            dev_err(priv->dev,
>>>>> +                "Timeout retrying DIMM temp info creation\n");
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    return rc;
>>>>> +}
>>>>> +
>>>>> +static void create_dimm_temp_info_delayed(struct work_struct *work)
>>>>> +{
>>>>> +    struct delayed_work *dwork = to_delayed_work(work);
>>>>> +    struct peci_dimmtemp *priv = container_of(dwork, struct peci_dimmtemp,
>>>>> +                          work_handler);
>>>>> +    int rc;
>>>>> +
>>>>> +    rc = create_dimm_temp_info(priv);
>>>>> +    if (rc && rc != -EAGAIN)
>>>>> +        dev_dbg(priv->dev, "Failed to create DIMM temp info\n");
>>>>> +}
>>>>> +
>>>>> +static int check_cpu_id(struct peci_dimmtemp *priv)
>>>>> +{
>>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>>> +    u32 cpu_id;
>>>>> +    int i, rc;
>>>>> +
>>>>> +    msg.addr = priv->addr;
>>>>> +    msg.index = MBX_INDEX_CPU_ID;
>>>>> +    msg.param = PKG_ID_CPU_ID;
>>>>> +    msg.rx_len = 4;
>>>>> +
>>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>>> +    if (rc)
>>>>> +        return rc;
>>>>> +
>>>>> +    cpu_id = ((msg.pkg_config[2] << 16) | (msg.pkg_config[1] << 8) |
>>>>> +          msg.pkg_config[0]) & CLIENT_CPU_ID_MASK;
>>>>> +
>>>>> +    for (i = 0; i < CPU_GEN_MAX; i++) {
>>>>> +        if (cpu_id == cpu_gen_info_table[i].cpu_id) {
>>>>> +            priv->gen_info = &cpu_gen_info_table[i];
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    if (!priv->gen_info)
>>>>> +        return -ENODEV;
>>>>> +
>>>>> +    dev_dbg(priv->dev, "CPU_ID: 0x%x\n", cpu_id);
>>>>> +    return 0;
>>>>> +}
>>>>
>>>> More duplicate code.
>>>>
>>>
>>> Okay. In case of check_cpu_id(), it could be used as a generic PECI function. I'll move it into PECI core.
>>>
>>>>> +
>>>>> +static int peci_dimmtemp_probe(struct peci_client *client)
>>>>> +{
>>>>> +    struct device *dev = &client->dev;
>>>>> +    struct peci_dimmtemp *priv;
>>>>> +    int rc;
>>>>> +
>>>>> +    if ((client->adapter->cmd_mask &
>>>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) !=
>>>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) {
>>>>
>>>> One set of ( ) is unnecessary on each side of the expression.
>>>>
>>>
>>> '&' has a precedence over '!=' but '|' doesn't. I'll rewrite it to:
>>>
>>
>> Actually, that is wrong. You refer to address-of. Bit operations do have lower
>> precedence that comparisons. I stand corrected.
>>
>>>      if (client->adapter->cmd_mask &
>>>          (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG)) !=
>>>          (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG)))
>>>
>>>>> +        dev_err(dev, "Client doesn't support temperature monitoring\n");
>>>>> +        return -EINVAL;
>>>>
>>>> Why is this "invalid", and why does it warrant an error message ?
>>>>
>>>
>>> Should I use -EPERM? Any suggestion?
>>>
>>
>> Is it an _error_ if the CPU does not support this functionality ?
>>
> 
> Actually, it returns from this probe() function without making any hwmon info creation so I intended to handle this case as an error. Am I wrong?
> 

If the functionality or HW supported by the driver isn't available, it is customary
to return -ENODEV and no error message. Otherwise the kernel log would drown in
"not supported" error messages. I don't see where it would add any value to handle
this driver differently.

EINVAL	Invalid argument
EPERM	Operation not permitted

You'll have to work hard to convince me that any of those makes sense, and that

ENODEV	No such device

doesn't. More specifically, if EINVAL makes sense, the caller did something wrong,
meaning there is a problem in the infrastructure which should get fixed.
The same is true for EPERM.

>>>>> +    }
>>>>> +
>>>>> +    priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>>>>> +    if (!priv)
>>>>> +        return -ENOMEM;
>>>>> +
>>>>> +    dev_set_drvdata(dev, priv);
>>>>> +    priv->client = client;
>>>>> +    priv->dev = dev;
>>>>> +    priv->addr = client->addr;
>>>>> +    priv->cpu_no = priv->addr - PECI_BASE_ADDR;
>>>>
>>>> Is priv->addr guaranteed to be >= PECI_BASE_ADDR ?
>>>
>>> Client address range validation will be done in peci_check_addr_validity() in PECI core before probing a device driver.
>>>
>>>>> +
>>>>> +    snprintf(priv->name, PECI_NAME_SIZE, "peci_dimmtemp.cpu%d",
>>>>> +         priv->cpu_no);
>>>>> +
>>>>> +    rc = check_cpu_id(priv);
>>>>> +    if (rc) {
>>>>> +        dev_err(dev, "Client CPU is not supported\n");
>>>>
>>>> Or the peci command failed.
>>>>
>>>
>>> I'll remove the error message and will add a proper handling code into PECI core on each error type.
>>>
>>>>> +        return rc;
>>>>> +    }
>>>>> +
>>>>> +    priv->work_queue = alloc_ordered_workqueue(priv->name, 0);
>>>>> +    if (!priv->work_queue)
>>>>> +        return -ENOMEM;
>>>>> +
>>>>> +    INIT_DELAYED_WORK(&priv->work_handler, create_dimm_temp_info_delayed);
>>>>> +
>>>>> +    rc = create_dimm_temp_info(priv);
>>>>> +    if (rc && rc != -EAGAIN) {
>>>>> +        dev_err(dev, "Failed to create DIMM temp info\n");
>>>>> +        goto err_free_wq;
>>>>> +    }
>>>>> +
>>>>> +    return 0;
>>>>> +
>>>>> +err_free_wq:
>>>>> +    destroy_workqueue(priv->work_queue);
>>>>> +    return rc;
>>>>> +}
>>>>> +
>>>>> +static int peci_dimmtemp_remove(struct peci_client *client)
>>>>> +{
>>>>> +    struct peci_dimmtemp *priv = dev_get_drvdata(&client->dev);
>>>>> +
>>>>> +    cancel_delayed_work(&priv->work_handler);
>>>>
>>>> cancel_delayed_work_sync() ?
>>>>
>>>
>>> Yes, it would be safer. Will fix it.
>>>
>>>>> +    destroy_workqueue(priv->work_queue);
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static const struct of_device_id peci_dimmtemp_of_table[] = {
>>>>> +    { .compatible = "intel,peci-dimmtemp" },
>>>>> +    { }
>>>>> +};
>>>>> +MODULE_DEVICE_TABLE(of, peci_dimmtemp_of_table);
>>>>> +
>>>>> +static struct peci_driver peci_dimmtemp_driver = {
>>>>> +    .probe  = peci_dimmtemp_probe,
>>>>> +    .remove = peci_dimmtemp_remove,
>>>>> +    .driver = {
>>>>> +        .name           = "peci-dimmtemp",
>>>>> +        .of_match_table = of_match_ptr(peci_dimmtemp_of_table),
>>>>> +    },
>>>>> +};
>>>>> +module_peci_driver(peci_dimmtemp_driver);
>>>>> +
>>>>> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
>>>>> +MODULE_DESCRIPTION("PECI dimmtemp driver");
>>>>> +MODULE_LICENSE("GPL v2");
>>>>> -- 
>>>>> 2.16.2
>>>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 09/10] drivers/hwmon: Add PECI hwmon client drivers
From: Jae Hyun Yoo @ 2018-04-12  2:51 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
	Haiyue Wang, James Feist, Jason M Biils, Jean Delvare,
	Joel Stanley, Julia Cartwright, Miguel Ojeda, Milton Miller II,
	Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
	Vernon Mauery, linux-kernel, linux-doc, devicetree, linux-hwmon,
	linux-arm-kernel, openbmc
In-Reply-To: <292be2d9-e572-2e06-9899-f6c2c53fdefc@roeck-us.net>

On 4/11/2018 5:34 PM, Guenter Roeck wrote:
> On 04/11/2018 02:59 PM, Jae Hyun Yoo wrote:
>> Hi Guenter,
>>
>> Thanks a lot for sharing your time. Please see my inline answers.
>>
>> On 4/10/2018 3:28 PM, Guenter Roeck wrote:
>>> On Tue, Apr 10, 2018 at 11:32:11AM -0700, Jae Hyun Yoo wrote:
>>>> This commit adds PECI cputemp and dimmtemp hwmon drivers.
>>>>
>>>> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
>>>> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
>>>> Reviewed-by: James Feist <james.feist@linux.intel.com>
>>>> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
>>>> Cc: Alan Cox <alan@linux.intel.com>
>>>> Cc: Andrew Jeffery <andrew@aj.id.au>
>>>> Cc: Andrew Lunn <andrew@lunn.ch>
>>>> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
>>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>>>> Cc: Greg KH <gregkh@linuxfoundation.org>
>>>> Cc: Guenter Roeck <linux@roeck-us.net>
>>>> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
>>>> Cc: Jean Delvare <jdelvare@suse.com>
>>>> Cc: Joel Stanley <joel@jms.id.au>
>>>> Cc: Julia Cartwright <juliac@eso.teric.us>
>>>> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
>>>> Cc: Milton Miller II <miltonm@us.ibm.com>
>>>> Cc: Pavel Machek <pavel@ucw.cz>
>>>> Cc: Randy Dunlap <rdunlap@infradead.org>
>>>> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
>>>> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
>>>> ---
>>>>   drivers/hwmon/Kconfig         |  28 ++
>>>>   drivers/hwmon/Makefile        |   2 +
>>>>   drivers/hwmon/peci-cputemp.c  | 783 
>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>   drivers/hwmon/peci-dimmtemp.c | 432 +++++++++++++++++++++++
>>>>   4 files changed, 1245 insertions(+)
>>>>   create mode 100644 drivers/hwmon/peci-cputemp.c
>>>>   create mode 100644 drivers/hwmon/peci-dimmtemp.c
>>>>
>>>> diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
>>>> index f249a4428458..c52f610f81d0 100644
>>>> --- a/drivers/hwmon/Kconfig
>>>> +++ b/drivers/hwmon/Kconfig
>>>> @@ -1259,6 +1259,34 @@ config SENSORS_NCT7904
>>>>         This driver can also be built as a module.  If so, the module
>>>>         will be called nct7904.
>>>> +config SENSORS_PECI_CPUTEMP
>>>> +    tristate "PECI CPU temperature monitoring support"
>>>> +    depends on OF
>>>> +    depends on PECI
>>>> +    help
>>>> +      If you say yes here you get support for the generic Intel PECI
>>>> +      cputemp driver which provides Digital Thermal Sensor (DTS) 
>>>> thermal
>>>> +      readings of the CPU package and CPU cores that are accessible 
>>>> using
>>>> +      the PECI Client Command Suite via the processor PECI client.
>>>> +      Check Documentation/hwmon/peci-cputemp for details.
>>>> +
>>>> +      This driver can also be built as a module.  If so, the module
>>>> +      will be called peci-cputemp.
>>>> +
>>>> +config SENSORS_PECI_DIMMTEMP
>>>> +    tristate "PECI DIMM temperature monitoring support"
>>>> +    depends on OF
>>>> +    depends on PECI
>>>> +    help
>>>> +      If you say yes here you get support for the generic Intel 
>>>> PECI hwmon
>>>> +      driver which provides Digital Thermal Sensor (DTS) thermal 
>>>> readings of
>>>> +      DIMM components that are accessible using the PECI Client 
>>>> Command
>>>> +      Suite via the processor PECI client.
>>>> +      Check Documentation/hwmon/peci-dimmtemp for details.
>>>> +
>>>> +      This driver can also be built as a module.  If so, the module
>>>> +      will be called peci-dimmtemp.
>>>> +
>>>>   config SENSORS_NSA320
>>>>       tristate "ZyXEL NSA320 and compatible fan speed and 
>>>> temperature sensors"
>>>>       depends on GPIOLIB && OF
>>>> diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
>>>> index e7d52a36e6c4..48d9598fcd3a 100644
>>>> --- a/drivers/hwmon/Makefile
>>>> +++ b/drivers/hwmon/Makefile
>>>> @@ -136,6 +136,8 @@ obj-$(CONFIG_SENSORS_NCT7802)    += nct7802.o
>>>>   obj-$(CONFIG_SENSORS_NCT7904)    += nct7904.o
>>>>   obj-$(CONFIG_SENSORS_NSA320)    += nsa320-hwmon.o
>>>>   obj-$(CONFIG_SENSORS_NTC_THERMISTOR)    += ntc_thermistor.o
>>>> +obj-$(CONFIG_SENSORS_PECI_CPUTEMP)    += peci-cputemp.o
>>>> +obj-$(CONFIG_SENSORS_PECI_DIMMTEMP)    += peci-dimmtemp.o
>>>>   obj-$(CONFIG_SENSORS_PC87360)    += pc87360.o
>>>>   obj-$(CONFIG_SENSORS_PC87427)    += pc87427.o
>>>>   obj-$(CONFIG_SENSORS_PCF8591)    += pcf8591.o
>>>> diff --git a/drivers/hwmon/peci-cputemp.c 
>>>> b/drivers/hwmon/peci-cputemp.c
>>>> new file mode 100644
>>>> index 000000000000..f0bc92687512
>>>> --- /dev/null
>>>> +++ b/drivers/hwmon/peci-cputemp.c
>>>> @@ -0,0 +1,783 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +// Copyright (c) 2018 Intel Corporation
>>>> +
>>>> +#include <linux/delay.h>
>>>> +#include <linux/hwmon.h>
>>>> +#include <linux/hwmon-sysfs.h>
>>>
>>> Is this include needed ?
>>>
>>
>> No it isn't. Will drop the line.
>>
>>>> +#include <linux/jiffies.h>
>>>> +#include <linux/module.h>
>>>> +#include <linux/of_device.h>
>>>> +#include <linux/peci.h>
>>>> +
>>>> +#define TEMP_TYPE_PECI        6  /* Sensor type 6: Intel PECI */
>>>> +
>>>> +#define CORE_MAX_ON_HSX       18 /* Max number of cores on Haswell */
>>>> +#define CORE_MAX_ON_BDX       24 /* Max number of cores on 
>>>> Broadwell */
>>>> +#define CORE_MAX_ON_SKX       28 /* Max number of cores on Skylake */
>>>> +
>>>> +#define DEFAULT_CHANNEL_NUMS  5
>>>> +#define CORETEMP_CHANNEL_NUMS CORE_MAX_ON_SKX
>>>> +#define CPUTEMP_CHANNEL_NUMS  (DEFAULT_CHANNEL_NUMS + 
>>>> CORETEMP_CHANNEL_NUMS)
>>>> +
>>>> +#define CLIENT_CPU_ID_MASK    0xf0ff0  /* Mask for Family / Model 
>>>> info */
>>>> +
>>>> +#define UPDATE_INTERVAL_MIN   HZ
>>>> +
>>>> +enum cpu_gens {
>>>> +    CPU_GEN_HSX, /* Haswell Xeon */
>>>> +    CPU_GEN_BRX, /* Broadwell Xeon */
>>>> +    CPU_GEN_SKX, /* Skylake Xeon */
>>>> +    CPU_GEN_MAX
>>>> +};
>>>> +
>>>> +struct cpu_gen_info {
>>>> +    u32 type;
>>>> +    u32 cpu_id;
>>>> +    u32 core_max;
>>>> +};
>>>> +
>>>> +struct temp_data {
>>>> +    bool valid;
>>>> +    s32  value;
>>>> +    unsigned long last_updated;
>>>> +};
>>>> +
>>>> +struct temp_group {
>>>> +    struct temp_data die;
>>>> +    struct temp_data dts_margin;
>>>> +    struct temp_data tcontrol;
>>>> +    struct temp_data tthrottle;
>>>> +    struct temp_data tjmax;
>>>> +    struct temp_data core[CORETEMP_CHANNEL_NUMS];
>>>> +};
>>>> +
>>>> +struct peci_cputemp {
>>>> +    struct peci_client *client;
>>>> +    struct device *dev;
>>>> +    char name[PECI_NAME_SIZE];
>>>> +    struct temp_group temp;
>>>> +    u8 addr;
>>>> +    uint cpu_no;
>>>> +    const struct cpu_gen_info *gen_info;
>>>> +    u32 core_mask;
>>>> +    u32 temp_config[CPUTEMP_CHANNEL_NUMS + 1];
>>>> +    uint config_idx;
>>>> +    struct hwmon_channel_info temp_info;
>>>> +    const struct hwmon_channel_info *info[2];
>>>> +    struct hwmon_chip_info chip;
>>>> +};
>>>> +
>>>> +enum cputemp_channels {
>>>> +    channel_die,
>>>> +    channel_dts_mrgn,
>>>> +    channel_tcontrol,
>>>> +    channel_tthrottle,
>>>> +    channel_tjmax,
>>>> +    channel_core,
>>>> +};
>>>> +
>>>> +static const struct cpu_gen_info cpu_gen_info_table[] = {
>>>> +    { .type = CPU_GEN_HSX,
>>>> +      .cpu_id = 0x306f0, /* Family code: 6, Model number: 63 (0x3f) */
>>>> +      .core_max = CORE_MAX_ON_HSX },
>>>> +    { .type = CPU_GEN_BRX,
>>>> +      .cpu_id = 0x406f0, /* Family code: 6, Model number: 79 (0x4f) */
>>>> +      .core_max = CORE_MAX_ON_BDX },
>>>> +    { .type = CPU_GEN_SKX,
>>>> +      .cpu_id = 0x50650, /* Family code: 6, Model number: 85 (0x55) */
>>>> +      .core_max = CORE_MAX_ON_SKX },
>>>> +};
>>>> +
>>>> +static const u32 config_table[DEFAULT_CHANNEL_NUMS + 1] = {
>>>> +    /* Die temperature */
>>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT |
>>>> +    HWMON_T_CRIT_HYST,
>>>> +
>>>> +    /* DTS margin temperature */
>>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MIN | HWMON_T_LCRIT,
>>>> +
>>>> +    /* Tcontrol temperature */
>>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_CRIT,
>>>> +
>>>> +    /* Tthrottle temperature */
>>>> +    HWMON_T_LABEL | HWMON_T_INPUT,
>>>> +
>>>> +    /* Tjmax temperature */
>>>> +    HWMON_T_LABEL | HWMON_T_INPUT,
>>>> +
>>>> +    /* Core temperature - for all core channels */
>>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT |
>>>> +    HWMON_T_CRIT_HYST,
>>>> +};
>>>> +
>>>> +static const char *cputemp_label[CPUTEMP_CHANNEL_NUMS] = {
>>>> +    "Die",
>>>> +    "DTS margin",
>>>> +    "Tcontrol",
>>>> +    "Tthrottle",
>>>> +    "Tjmax",
>>>> +    "Core 0", "Core 1", "Core 2", "Core 3",
>>>> +    "Core 4", "Core 5", "Core 6", "Core 7",
>>>> +    "Core 8", "Core 9", "Core 10", "Core 11",
>>>> +    "Core 12", "Core 13", "Core 14", "Core 15",
>>>> +    "Core 16", "Core 17", "Core 18", "Core 19",
>>>> +    "Core 20", "Core 21", "Core 22", "Core 23",
>>>> +};
>>>> +
>>>> +static int send_peci_cmd(struct peci_cputemp *priv,
>>>> +             enum peci_cmd cmd,
>>>> +             void *msg)
>>>> +{
>>>> +    return peci_command(priv->client->adapter, cmd, msg);
>>>> +}
>>>> +
>>>> +static int need_update(struct temp_data *temp)
>>>
>>> Please use bool.
>>>
>>
>> Okay. I'll use bool instead of int.
>>
>>>> +{
>>>> +    if (temp->valid &&
>>>> +        time_before(jiffies, temp->last_updated + 
>>>> UPDATE_INTERVAL_MIN))
>>>> +        return 0;
>>>> +
>>>> +    return 1;
>>>> +}
>>>> +
>>>> +static void mark_updated(struct temp_data *temp)
>>>> +{
>>>> +    temp->valid = true;
>>>> +    temp->last_updated = jiffies;
>>>> +}
>>>> +
>>>> +static s32 ten_dot_six_to_millidegree(s32 val)
>>>> +{
>>>> +    return ((val ^ 0x8000) - 0x8000) * 1000 / 64;
>>>> +}
>>>> +
>>>> +static int get_tjmax(struct peci_cputemp *priv)
>>>> +{
>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>> +    int rc;
>>>> +
>>>> +    if (!priv->temp.tjmax.valid) {
>>>> +        msg.addr = priv->addr;
>>>> +        msg.index = MBX_INDEX_TEMP_TARGET;
>>>> +        msg.param = 0;
>>>> +        msg.rx_len = 4;
>>>> +
>>>> +        rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        priv->temp.tjmax.value = (s32)msg.pkg_config[2] * 1000;
>>>> +        priv->temp.tjmax.valid = true;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int get_tcontrol(struct peci_cputemp *priv)
>>>> +{
>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>> +    s32 tcontrol_margin;
>>>> +    s32 tthrottle_offset;
>>>> +    int rc;
>>>> +
>>>> +    if (!need_update(&priv->temp.tcontrol))
>>>> +        return 0;
>>>> +
>>>> +    rc = get_tjmax(priv);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    msg.addr = priv->addr;
>>>> +    msg.index = MBX_INDEX_TEMP_TARGET;
>>>> +    msg.param = 0;
>>>> +    msg.rx_len = 4;
>>>> +
>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    tcontrol_margin = msg.pkg_config[1];
>>>> +    tcontrol_margin = ((tcontrol_margin ^ 0x80) - 0x80) * 1000;
>>>> +    priv->temp.tcontrol.value = priv->temp.tjmax.value - 
>>>> tcontrol_margin;
>>>> +
>>>> +    tthrottle_offset = (msg.pkg_config[3] & 0x2f) * 1000;
>>>> +    priv->temp.tthrottle.value = priv->temp.tjmax.value - 
>>>> tthrottle_offset;
>>>> +
>>>> +    mark_updated(&priv->temp.tcontrol);
>>>> +    mark_updated(&priv->temp.tthrottle);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int get_tthrottle(struct peci_cputemp *priv)
>>>> +{
>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>> +    s32 tcontrol_margin;
>>>> +    s32 tthrottle_offset;
>>>> +    int rc;
>>>> +
>>>> +    if (!need_update(&priv->temp.tthrottle))
>>>> +        return 0;
>>>> +
>>>> +    rc = get_tjmax(priv);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    msg.addr = priv->addr;
>>>> +    msg.index = MBX_INDEX_TEMP_TARGET;
>>>> +    msg.param = 0;
>>>> +    msg.rx_len = 4;
>>>> +
>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    tthrottle_offset = (msg.pkg_config[3] & 0x2f) * 1000;
>>>> +    priv->temp.tthrottle.value = priv->temp.tjmax.value - 
>>>> tthrottle_offset;
>>>> +
>>>> +    tcontrol_margin = msg.pkg_config[1];
>>>> +    tcontrol_margin = ((tcontrol_margin ^ 0x80) - 0x80) * 1000;
>>>> +    priv->temp.tcontrol.value = priv->temp.tjmax.value - 
>>>> tcontrol_margin;
>>>> +
>>>> +    mark_updated(&priv->temp.tthrottle);
>>>> +    mark_updated(&priv->temp.tcontrol);
>>>> +
>>>> +    return 0;
>>>> +}
>>>
>>> I am quite completely missing how the two functions above are different.
>>>
>>
>> The two above functions are slightly different but uses the same PECI 
>> command which provides both Tthrottle and Tcontrol values in 
>> pkg_config array so it updates the values to reduce duplicate PECI 
>> transactions. Probably, combining these two functions into 
>> get_ttrottle_and_tcontrol() would look better. I'll rewrite it.
>>
>>>> +
>>>> +static int get_die_temp(struct peci_cputemp *priv)
>>>> +{
>>>> +    struct peci_get_temp_msg msg;
>>>> +    int rc;
>>>> +
>>>> +    if (!need_update(&priv->temp.die))
>>>> +        return 0;
>>>> +
>>>> +    rc = get_tjmax(priv);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    msg.addr = priv->addr;
>>>> +
>>>> +    rc = send_peci_cmd(priv, PECI_CMD_GET_TEMP, &msg);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    priv->temp.die.value = priv->temp.tjmax.value +
>>>> +                   ((s32)msg.temp_raw * 1000 / 64);
>>>> +
>>>> +    mark_updated(&priv->temp.die);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int get_dts_margin(struct peci_cputemp *priv)
>>>> +{
>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>> +    s32 dts_margin;
>>>> +    int rc;
>>>> +
>>>> +    if (!need_update(&priv->temp.dts_margin))
>>>> +        return 0;
>>>> +
>>>> +    msg.addr = priv->addr;
>>>> +    msg.index = MBX_INDEX_DTS_MARGIN;
>>>> +    msg.param = 0;
>>>> +    msg.rx_len = 4;
>>>> +
>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    dts_margin = (msg.pkg_config[1] << 8) | msg.pkg_config[0];
>>>> +
>>>> +    /**
>>>> +     * Processors return a value of DTS reading in 10.6 format
>>>> +     * (10 bits signed decimal, 6 bits fractional).
>>>> +     * Error codes:
>>>> +     *   0x8000: General sensor error
>>>> +     *   0x8001: Reserved
>>>> +     *   0x8002: Underflow on reading value
>>>> +     *   0x8003-0x81ff: Reserved
>>>> +     */
>>>> +    if (dts_margin >= 0x8000 && dts_margin <= 0x81ff)
>>>> +        return -EIO;
>>>> +
>>>> +    dts_margin = ten_dot_six_to_millidegree(dts_margin);
>>>> +
>>>> +    priv->temp.dts_margin.value = dts_margin;
>>>> +
>>>> +    mark_updated(&priv->temp.dts_margin);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int get_core_temp(struct peci_cputemp *priv, int core_index)
>>>> +{
>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>> +    s32 core_dts_margin;
>>>> +    int rc;
>>>> +
>>>> +    if (!need_update(&priv->temp.core[core_index]))
>>>> +        return 0;
>>>> +
>>>> +    rc = get_tjmax(priv);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    msg.addr = priv->addr;
>>>> +    msg.index = MBX_INDEX_PER_CORE_DTS_TEMP;
>>>> +    msg.param = core_index;
>>>> +    msg.rx_len = 4;
>>>> +
>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    core_dts_margin = (msg.pkg_config[1] << 8) | msg.pkg_config[0];
>>>> +
>>>> +    /**
>>>> +     * Processors return a value of the core DTS reading in 10.6 
>>>> format
>>>> +     * (10 bits signed decimal, 6 bits fractional).
>>>> +     * Error codes:
>>>> +     *   0x8000: General sensor error
>>>> +     *   0x8001: Reserved
>>>> +     *   0x8002: Underflow on reading value
>>>> +     *   0x8003-0x81ff: Reserved
>>>> +     */
>>>> +    if (core_dts_margin >= 0x8000 && core_dts_margin <= 0x81ff)
>>>> +        return -EIO;
>>>> +
>>>> +    core_dts_margin = ten_dot_six_to_millidegree(core_dts_margin);
>>>> +
>>>> +    priv->temp.core[core_index].value = priv->temp.tjmax.value +
>>>> +                        core_dts_margin;
>>>> +
>>>> +    mark_updated(&priv->temp.core[core_index]);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>
>>> There is a lot of duplication in those functions. Would it be possible
>>> to find common code and use functions for it instead of duplicating
>>> everything several times ?
>>>
>>
>> Are you pointing out this code?
>> /**
>>   * Processors return a value of the core DTS reading in 10.6 format
>>   * (10 bits signed decimal, 6 bits fractional).
>>   * Error codes:
>>   *   0x8000: General sensor error
>>   *   0x8001: Reserved
>>   *   0x8002: Underflow on reading value
>>   *   0x8003-0x81ff: Reserved
>>   */
>> if (core_dts_margin >= 0x8000 && core_dts_margin <= 0x81ff)
>>      return -EIO;
>>
>> Then I'll rewrite it as a function. If not, please point out the 
>> duplication.
>>
> 
> There is lots of other duplication.
> 

Sorry but can you point out the duplication?

>>>> +static int find_core_index(struct peci_cputemp *priv, int channel)
>>>> +{
>>>> +    int core_channel = channel - DEFAULT_CHANNEL_NUMS;
>>>> +    int idx, found = 0;
>>>> +
>>>> +    for (idx = 0; idx < priv->gen_info->core_max; idx++) {
>>>> +        if (priv->core_mask & BIT(idx)) {
>>>> +            if (core_channel == found)
>>>> +                break;
>>>> +
>>>> +            found++;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return idx;
>>>
>>> What if nothing is found ?
>>>
>>
>> Core temperature group will be registered only when it detects at 
>> least one core checked by check_resolved_cores(), so find_core_index() 
>> can be called only when priv->core_mask has a non-zero value. The 
>> 'nothing is found' case will not happen.
>>
> That doesn't guarantee a match. If what you are saying is correct there 
> should always be
> a well defined match of channel -> idx, and the search should be 
> unnecessary.
> 

There could be some disabled cores in the resolved core mask bit 
sequence also it should remove indexing gap in channel numbering so it 
is the reason why this search function is needed. Well defined match of 
channel -> idx would not be always satisfied.

>>>> +}
>>>> +
>>>> +static int cputemp_read_string(struct device *dev,
>>>> +                   enum hwmon_sensor_types type,
>>>> +                   u32 attr, int channel, const char **str)
>>>> +{
>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>> +    int core_index;
>>>> +
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_label:
>>>> +        if (channel < DEFAULT_CHANNEL_NUMS) {
>>>> +            *str = cputemp_label[channel];
>>>> +        } else {
>>>> +            core_index = find_core_index(priv, channel);
>>>
>>> FWIW, it might be better to pass channel - DEFAULT_CHANNEL_NUMS
>>> as parameter.
>>>
>>
>> cputemp_read_string() is mapped to read_string member of hwmon_ops 
>> struct, so hwmon susbsystem passes the channel parameter based on the 
>> registered channel order. Should I modify hwmon subsystem code?
>>
> 
> Huh ? Changing
>      f(x) { y = x - const; }
> ...
>      f(x);
> 
> to
>      f(y) { }
> ...
>      f(x - const);
> 
> requires a hwmon core change ? Really ?
> 

Sorry for my misunderstanding. You are right. I'll change the parameter 
passing of find_core_index() from 'channel' to 'channel - 
DEFAULT_CHANNEL_NUMS'.

>>> What if find_core_index() returns priv->gen_info->core_max, ie
>>> if it didn't find a core ?
>>>
>>
>> As explained above, find_core index() returns a correct index always.
>>
>>>> +            *str = cputemp_label[DEFAULT_CHANNEL_NUMS + core_index];
>>>> +        }
>>>> +        return 0;
>>>> +    default:
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>> +
>>>> +static int cputemp_read_die(struct device *dev,
>>>> +                enum hwmon_sensor_types type,
>>>> +                u32 attr, int channel, long *val)
>>>> +{
>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>> +    int rc;
>>>> +
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_input:
>>>> +        rc = get_die_temp(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.die.value;
>>>> +        return 0;
>>>> +    case hwmon_temp_max:
>>>> +        rc = get_tcontrol(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tcontrol.value;
>>>> +        return 0;
>>>> +    case hwmon_temp_crit:
>>>> +        rc = get_tjmax(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tjmax.value;
>>>> +        return 0;
>>>> +    case hwmon_temp_crit_hyst:
>>>> +        rc = get_tcontrol(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tjmax.value - priv->temp.tcontrol.value;
>>>> +        return 0;
>>>> +    default:
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>> +
>>>> +static int cputemp_read_dts_margin(struct device *dev,
>>>> +                   enum hwmon_sensor_types type,
>>>> +                   u32 attr, int channel, long *val)
>>>> +{
>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>> +    int rc;
>>>> +
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_input:
>>>> +        rc = get_dts_margin(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.dts_margin.value;
>>>> +        return 0;
>>>> +    case hwmon_temp_min:
>>>> +        *val = 0;
>>>> +        return 0;
>>>
>>> This attribute should not exist.
>>>
>>
>> This is an attribute of DTS margin temperature which reflects thermal 
>> margin to Tcontrol of the CPU package. If it shows '0' means it 
>> reached to Tcontrol, the first level of thermal warning. If the CPU 
>> keeps getting hot then this DTS margin shows a negative value until it 
>> reaches to Tjmax. When the temperature reaches to Tjmax at last then 
>> it shows the lower critcal value which lcrit indicates as the second 
>> level of thermal warning.
>>
> 
> The hwmon ABI reports chip values, not constants. Even though some 
> drivers do
> it, reporting a constant is always wrong.
> 
>>>> +    case hwmon_temp_lcrit:
>>>> +        rc = get_tcontrol(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tcontrol.value - priv->temp.tjmax.value;
>>>
>>> lcrit is tcontrol - tjmax, and crit_hyst above is
>>> tjmax - tcontrol ? How does this make sense ?
>>>
>>
>> Both Tjmax and Tcontrol have positive values and Tjmax is greater than 
>> Tcontrol always. As explained above, lcrit of DTS margin should show a 
>> negative value means the margin goes down across '0'. On the other 
>> hand, crit_hyst of Die temperature should show absolute hyterisis 
>> value between Tcontrol and Tjmax.
>>
> The hwmon ABI requires reporting of absolute temperatures in 
> milli-degrees C.
> Your statements make it very clear that this driver does not report
> absolute temperatures. This is not acceptable.
> 

Okay. I'll remove the 'DTS margin' temperature. All others are reporting 
absolute temperatures.

>>>> +        return 0;
>>>> +    default:
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>> +
>>>> +static int cputemp_read_tcontrol(struct device *dev,
>>>> +                 enum hwmon_sensor_types type,
>>>> +                 u32 attr, int channel, long *val)
>>>> +{
>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>> +    int rc;
>>>> +
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_input:
>>>> +        rc = get_tcontrol(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tcontrol.value;
>>>> +        return 0;
>>>> +    case hwmon_temp_crit:
>>>> +        rc = get_tjmax(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tjmax.value;
>>>> +        return 0;
>>>
>>> Am I missing something, or is the same temperature reported several 
>>> times ?
>>> tjmax is also reported as temp_crit cputemp_read_die(), for example.
>>>
>>
>> This driver provides multiple channels and each channel has its own 
>> supplement attributes. As you mentioned, Die temperature channel and 
>> Core temperature channel have their individual crit attributes and 
>> they reflect the same value, Tjmax. It is not reporting several times 
>> but reporting the same value.
>>
> Then maybe fold the functions accordingly ?
> 

I'll use a single function for 'Die temperature' and 'Core temperature' 
that have the same attributes set. It would simplify this code a bit.

>>>> +    default:
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>> +
>>>> +static int cputemp_read_tthrottle(struct device *dev,
>>>> +                  enum hwmon_sensor_types type,
>>>> +                  u32 attr, int channel, long *val)
>>>> +{
>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>> +    int rc;
>>>> +
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_input:
>>>> +        rc = get_tthrottle(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tthrottle.value;
>>>> +        return 0;
>>>> +    default:
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>> +
>>>> +static int cputemp_read_tjmax(struct device *dev,
>>>> +                  enum hwmon_sensor_types type,
>>>> +                  u32 attr, int channel, long *val)
>>>> +{
>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>> +    int rc;
>>>> +
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_input:
>>>> +        rc = get_tjmax(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tjmax.value;
>>>> +        return 0;
>>>> +    default:
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>> +
>>>> +static int cputemp_read_core(struct device *dev,
>>>> +                 enum hwmon_sensor_types type,
>>>> +                 u32 attr, int channel, long *val)
>>>> +{
>>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>>> +    int core_index = find_core_index(priv, channel);
>>>> +    int rc;
>>>> +
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_input:
>>>> +        rc = get_core_temp(priv, core_index);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.core[core_index].value;
>>>> +        return 0;
>>>> +    case hwmon_temp_max:
>>>> +        rc = get_tcontrol(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tcontrol.value;
>>>> +        return 0;
>>>> +    case hwmon_temp_crit:
>>>> +        rc = get_tjmax(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tjmax.value;
>>>> +        return 0;
>>>> +    case hwmon_temp_crit_hyst:
>>>> +        rc = get_tcontrol(priv);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp.tjmax.value - priv->temp.tcontrol.value;
>>>> +        return 0;
>>>> +    default:
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>
>>> There is again a lot of duplication in those functions.
>>>
>>
>> Each function is called from cputemp_read() which is mapped to read 
>> function pointer of hwmon_ops struct. Since each channel has different 
>> set of attributes so the cputemp_read() calls an individual channel 
>> handler after checking the channel type. Of course, we can handle all 
>> attributes of all channels in a single function but the way also needs 
>> channel type checking code on each attribute.
>>
>>>> +
>>>> +static int cputemp_read(struct device *dev,
>>>> +            enum hwmon_sensor_types type,
>>>> +            u32 attr, int channel, long *val)
>>>> +{
>>>> +    switch (channel) {
>>>> +    case channel_die:
>>>> +        return cputemp_read_die(dev, type, attr, channel, val);
>>>> +    case channel_dts_mrgn:
>>>> +        return cputemp_read_dts_margin(dev, type, attr, channel, val);
>>>> +    case channel_tcontrol:
>>>> +        return cputemp_read_tcontrol(dev, type, attr, channel, val);
>>>> +    case channel_tthrottle:
>>>> +        return cputemp_read_tthrottle(dev, type, attr, channel, val);
>>>> +    case channel_tjmax:
>>>> +        return cputemp_read_tjmax(dev, type, attr, channel, val);
>>>> +    default:
>>>> +        if (channel < CPUTEMP_CHANNEL_NUMS)
>>>> +            return cputemp_read_core(dev, type, attr, channel, val);
>>>> +
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>> +
>>>> +static umode_t cputemp_is_visible(const void *data,
>>>> +                  enum hwmon_sensor_types type,
>>>> +                  u32 attr, int channel)
>>>> +{
>>>> +    const struct peci_cputemp *priv = data;
>>>> +
>>>> +    if (priv->temp_config[channel] & BIT(attr))
>>>> +        return 0444;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static const struct hwmon_ops cputemp_ops = {
>>>> +    .is_visible = cputemp_is_visible,
>>>> +    .read_string = cputemp_read_string,
>>>> +    .read = cputemp_read,
>>>> +};
>>>> +
>>>> +static int check_resolved_cores(struct peci_cputemp *priv)
>>>> +{
>>>> +    struct peci_rd_pci_cfg_local_msg msg;
>>>> +    int rc;
>>>> +
>>>> +    if (!(priv->client->adapter->cmd_mask & 
>>>> BIT(PECI_CMD_RD_PCI_CFG_LOCAL)))
>>>> +        return -EINVAL;
>>>> +
>>>> +    /* Get the RESOLVED_CORES register value */
>>>> +    msg.addr = priv->addr;
>>>> +    msg.bus = 1;
>>>> +    msg.device = 30;
>>>> +    msg.function = 3;
>>>> +    msg.reg = 0xB4;
>>>
>>> Can this be made less magic with some defines ?
>>>
>>
>> Sure, will use defines instead.
>>
>>>> +    msg.rx_len = 4;
>>>> +
>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PCI_CFG_LOCAL, &msg);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    priv->core_mask = msg.pci_config[3] << 24 |
>>>> +              msg.pci_config[2] << 16 |
>>>> +              msg.pci_config[1] << 8 |
>>>> +              msg.pci_config[0];
>>>> +
>>>> +    if (!priv->core_mask)
>>>> +        return -EAGAIN;
>>>> +
>>>> +    dev_dbg(priv->dev, "Scanned resolved cores: 0x%x\n", 
>>>> priv->core_mask);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int create_core_temp_info(struct peci_cputemp *priv)
>>>> +{
>>>> +    int rc, i;
>>>> +
>>>> +    rc = check_resolved_cores(priv);
>>>> +    if (!rc) {
>>>> +        for (i = 0; i < priv->gen_info->core_max; i++) {
>>>> +            if (priv->core_mask & BIT(i)) {
>>>> +                priv->temp_config[priv->config_idx++] =
>>>> +                             config_table[channel_core];
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +static int check_cpu_id(struct peci_cputemp *priv)
>>>> +{
>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>> +    u32 cpu_id;
>>>> +    int i, rc;
>>>> +
>>>> +    msg.addr = priv->addr;
>>>> +    msg.index = MBX_INDEX_CPU_ID;
>>>> +    msg.param = PKG_ID_CPU_ID;
>>>> +    msg.rx_len = 4;
>>>> +
>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    cpu_id = ((msg.pkg_config[2] << 16) | (msg.pkg_config[1] << 8) |
>>>> +          msg.pkg_config[0]) & CLIENT_CPU_ID_MASK;
>>>> +
>>>> +    for (i = 0; i < CPU_GEN_MAX; i++) {
>>>> +        if (cpu_id == cpu_gen_info_table[i].cpu_id) {
>>>> +            priv->gen_info = &cpu_gen_info_table[i];
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (!priv->gen_info)
>>>> +        return -ENODEV;
>>>> +
>>>> +    dev_dbg(priv->dev, "CPU_ID: 0x%x\n", cpu_id);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int peci_cputemp_probe(struct peci_client *client)
>>>> +{
>>>> +    struct device *dev = &client->dev;
>>>> +    struct peci_cputemp *priv;
>>>> +    struct device *hwmon_dev;
>>>> +    int rc;
>>>> +
>>>> +    if ((client->adapter->cmd_mask &
>>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) !=
>>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) {
>>>> +        dev_err(dev, "Client doesn't support temperature 
>>>> monitoring\n");
>>>> +        return -EINVAL;
>>>
>>> Does this mean there will be an error message for each non-supported 
>>> CPU ?
>>> Why ?
>>>
>>
>> For proper operation of this driver, PECI_CMD_GET_TEMP and 
>> PECI_CMD_RD_PKG_CFG have to be supported by a client CPU. 
>> PECI_CMD_GET_TEMP is provided as a default command but 
>> PECI_CMD_RD_PKG_CFG depends on PECI minor revision of a CPU package so 
>> this checking is needed.
>>
> 
> I do not question the check. I question the error message and error 
> return value.
> Why is it an _error_ if the CPU does not support the functionality, and 
> why does
> it have to be reported in the kernel log ?
> 

Got it. I'll change that to dev_dbg.

>>>> +    }
>>>> +
>>>> +    priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>>>> +    if (!priv)
>>>> +        return -ENOMEM;
>>>> +
>>>> +    dev_set_drvdata(dev, priv);
>>>> +    priv->client = client;
>>>> +    priv->dev = dev;
>>>> +    priv->addr = client->addr;
>>>> +    priv->cpu_no = priv->addr - PECI_BASE_ADDR;
>>>> +
>>>> +    snprintf(priv->name, PECI_NAME_SIZE, "peci_cputemp.cpu%d",
>>>> +         priv->cpu_no);
>>>> +
>>>> +    rc = check_cpu_id(priv);
>>>> +    if (rc) {
>>>> +        dev_err(dev, "Client CPU is not supported\n");
>>>
>>> -ENODEV is not an error, and should not result in an error message.
>>> Besides, the error can also be propagated from peci core code,
>>> and may well be something else.
>>>
>>
>> Got it. I'll remove the error message and will add a proper handling 
>> code into PECI core.
>>
>>>> +        return rc;
>>>> +    }
>>>> +
>>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_die];
>>>> +    priv->temp_config[priv->config_idx++] = 
>>>> config_table[channel_dts_mrgn];
>>>> +    priv->temp_config[priv->config_idx++] = 
>>>> config_table[channel_tcontrol];
>>>> +    priv->temp_config[priv->config_idx++] = 
>>>> config_table[channel_tthrottle];
>>>> +    priv->temp_config[priv->config_idx++] = 
>>>> config_table[channel_tjmax];
>>>> +
>>>> +    rc = create_core_temp_info(priv);
>>>> +    if (rc)
>>>> +        dev_dbg(dev, "Failed to create core temp info\n");
>>>
>>> Then what ? Shouldn't this result in probe deferral or something more 
>>> useful
>>> instead of just being ignored ?
>>>
>>
>> This driver can't support core temperature monitoring if a CPU doesn't 
>> support PECI_CMD_RD_PCI_CFG_LOCAL command. In that case, it skips core 
>> temperature group creation and supports only basic temperature 
>> monitoring of Die, DTS margin and etc. I'll add this description as a 
>> comment.
>>
> 
> The message says "Failed to ...". It does not say "This CPU does not 
> support ...".
> 

Got it. Will correct the message.

>>>> +
>>>> +    priv->chip.ops = &cputemp_ops;
>>>> +    priv->chip.info = priv->info;
>>>> +
>>>> +    priv->info[0] = &priv->temp_info;
>>>> +
>>>> +    priv->temp_info.type = hwmon_temp;
>>>> +    priv->temp_info.config = priv->temp_config;
>>>> +
>>>> +    hwmon_dev = devm_hwmon_device_register_with_info(priv->dev,
>>>> +                             priv->name,
>>>> +                             priv,
>>>> +                             &priv->chip,
>>>> +                             NULL);
>>>> +
>>>> +    if (IS_ERR(hwmon_dev))
>>>> +        return PTR_ERR(hwmon_dev);
>>>> +
>>>> +    dev_dbg(dev, "%s: sensor '%s'\n", dev_name(hwmon_dev), 
>>>> priv->name);
>>>> +
> 
> Why does this message display the device name twice ?
> 

For an example, dev_name(hwmon_dev) shows 'hwmon5' and priv->name shows 
'peci-cputemp0'.

>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static const struct of_device_id peci_cputemp_of_table[] = {
>>>> +    { .compatible = "intel,peci-cputemp" },
>>>> +    { }
>>>> +};
>>>> +MODULE_DEVICE_TABLE(of, peci_cputemp_of_table);
>>>> +
>>>> +static struct peci_driver peci_cputemp_driver = {
>>>> +    .probe  = peci_cputemp_probe,
>>>> +    .driver = {
>>>> +        .name           = "peci-cputemp",
>>>> +        .of_match_table = of_match_ptr(peci_cputemp_of_table),
>>>> +    },
>>>> +};
>>>> +module_peci_driver(peci_cputemp_driver);
>>>> +
>>>> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
>>>> +MODULE_DESCRIPTION("PECI cputemp driver");
>>>> +MODULE_LICENSE("GPL v2");
>>>> diff --git a/drivers/hwmon/peci-dimmtemp.c 
>>>> b/drivers/hwmon/peci-dimmtemp.c
>>>> new file mode 100644
>>>> index 000000000000..78bf29cb2c4c
>>>> --- /dev/null
>>>> +++ b/drivers/hwmon/peci-dimmtemp.c
>>>
>>> FWIW, this should be two separate patches.
>>>
>>
>> Should I split out hwmon documents and dt bindings too?
>>
>>>> @@ -0,0 +1,432 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +// Copyright (c) 2018 Intel Corporation
>>>> +
>>>> +#include <linux/delay.h>
>>>> +#include <linux/hwmon.h>
>>>> +#include <linux/hwmon-sysfs.h>
>>>
>>> Needed ?
>>>
>>
>> No. Will drop the line.
>>
>>>> +#include <linux/jiffies.h>
>>>> +#include <linux/module.h>
>>>> +#include <linux/of_device.h>
>>>> +#include <linux/peci.h>
>>>> +#include <linux/workqueue.h>
>>>> +
>>>> +#define TEMP_TYPE_PECI       6  /* Sensor type 6: Intel PECI */
>>>> +
>>>> +#define CHAN_RANK_MAX_ON_HSX 8  /* Max number of channel ranks on 
>>>> Haswell */
>>>> +#define DIMM_IDX_MAX_ON_HSX  3  /* Max DIMM index per channel on 
>>>> Haswell */
>>>> +
>>>> +#define CHAN_RANK_MAX_ON_BDX 4  /* Max number of channel ranks on 
>>>> Broadwell */
>>>> +#define DIMM_IDX_MAX_ON_BDX  3  /* Max DIMM index per channel on 
>>>> Broadwell */
>>>> +
>>>> +#define CHAN_RANK_MAX_ON_SKX 6  /* Max number of channel ranks on 
>>>> Skylake */
>>>> +#define DIMM_IDX_MAX_ON_SKX  2  /* Max DIMM index per channel on 
>>>> Skylake */
>>>> +
>>>> +#define CHAN_RANK_MAX        CHAN_RANK_MAX_ON_HSX
>>>> +#define DIMM_IDX_MAX         DIMM_IDX_MAX_ON_HSX
>>>> +
>>>> +#define DIMM_NUMS_MAX        (CHAN_RANK_MAX * DIMM_IDX_MAX)
>>>> +
>>>> +#define CLIENT_CPU_ID_MASK   0xf0ff0  /* Mask for Family / Model 
>>>> info */
>>>> +
>>>> +#define UPDATE_INTERVAL_MIN  HZ
>>>> +
>>>> +#define DIMM_MASK_CHECK_DELAY_JIFFIES msecs_to_jiffies(5000)
>>>> +#define DIMM_MASK_CHECK_RETRY_MAX     60 /* 60 x 5 secs = 5 minutes */
>>>> +
>>>> +enum cpu_gens {
>>>> +    CPU_GEN_HSX, /* Haswell Xeon */
>>>> +    CPU_GEN_BRX, /* Broadwell Xeon */
>>>> +    CPU_GEN_SKX, /* Skylake Xeon */
>>>> +    CPU_GEN_MAX
>>>> +};
>>>> +
>>>> +struct cpu_gen_info {
>>>> +    u32 type;
>>>> +    u32 cpu_id;
>>>> +    u32 chan_rank_max;
>>>> +    u32 dimm_idx_max;
>>>> +};
>>>> +
>>>> +struct temp_data {
>>>> +    bool valid;
>>>> +    s32  value;
>>>> +    unsigned long last_updated;
>>>> +};
>>>> +
>>>> +struct peci_dimmtemp {
>>>> +    struct peci_client *client;
>>>> +    struct device *dev;
>>>> +    struct workqueue_struct *work_queue;
>>>> +    struct delayed_work work_handler;
>>>> +    char name[PECI_NAME_SIZE];
>>>> +    struct temp_data temp[DIMM_NUMS_MAX];
>>>> +    u8 addr;
>>>> +    uint cpu_no;
>>>> +    const struct cpu_gen_info *gen_info;
>>>> +    u32 dimm_mask;
>>>> +    int retry_count;
>>>> +    int channels;
>>>> +    u32 temp_config[DIMM_NUMS_MAX + 1];
>>>> +    struct hwmon_channel_info temp_info;
>>>> +    const struct hwmon_channel_info *info[2];
>>>> +    struct hwmon_chip_info chip;
>>>> +};
>>>> +
>>>> +static const struct cpu_gen_info cpu_gen_info_table[] = {
>>>> +    { .type  = CPU_GEN_HSX,
>>>> +      .cpu_id = 0x306f0, /* Family code: 6, Model number: 63 (0x3f) */
>>>> +      .chan_rank_max = CHAN_RANK_MAX_ON_HSX,
>>>> +      .dimm_idx_max  = DIMM_IDX_MAX_ON_HSX },
>>>> +    { .type  = CPU_GEN_BRX,
>>>> +      .cpu_id = 0x406f0, /* Family code: 6, Model number: 79 (0x4f) */
>>>> +      .chan_rank_max = CHAN_RANK_MAX_ON_BDX,
>>>> +      .dimm_idx_max  = DIMM_IDX_MAX_ON_BDX },
>>>> +    { .type  = CPU_GEN_SKX,
>>>> +      .cpu_id = 0x50650, /* Family code: 6, Model number: 85 (0x55) */
>>>> +      .chan_rank_max = CHAN_RANK_MAX_ON_SKX,
>>>> +      .dimm_idx_max  = DIMM_IDX_MAX_ON_SKX },
>>>> +};
>>>> +
>>>> +static const char *dimmtemp_label[CHAN_RANK_MAX][DIMM_IDX_MAX] = {
>>>> +    { "DIMM A0", "DIMM A1", "DIMM A2" },
>>>> +    { "DIMM B0", "DIMM B1", "DIMM B2" },
>>>> +    { "DIMM C0", "DIMM C1", "DIMM C2" },
>>>> +    { "DIMM D0", "DIMM D1", "DIMM D2" },
>>>> +    { "DIMM E0", "DIMM E1", "DIMM E2" },
>>>> +    { "DIMM F0", "DIMM F1", "DIMM F2" },
>>>> +    { "DIMM G0", "DIMM G1", "DIMM G2" },
>>>> +    { "DIMM H0", "DIMM H1", "DIMM H2" },
>>>> +};
>>>> +
>>>> +static int send_peci_cmd(struct peci_dimmtemp *priv, enum peci_cmd 
>>>> cmd,
>>>> +             void *msg)
>>>> +{
>>>> +    return peci_command(priv->client->adapter, cmd, msg);
>>>> +}
>>>> +
>>>> +static int need_update(struct temp_data *temp)
>>>> +{
>>>> +    if (temp->valid &&
>>>> +        time_before(jiffies, temp->last_updated + 
>>>> UPDATE_INTERVAL_MIN))
>>>> +        return 0;
>>>> +
>>>> +    return 1;
>>>> +}
>>>> +
>>>> +static void mark_updated(struct temp_data *temp)
>>>> +{
>>>> +    temp->valid = true;
>>>> +    temp->last_updated = jiffies;
>>>> +}
>>>
>>> It might make sense to provide the duplicate functions in a core file.
>>>
>>
>> It is temperature monitoring specific function and it touches module 
>> specific variables. Do you really think that this non-generic function 
>> should be moved to PECI core?
>>
>>>> +
>>>> +static int get_dimm_temp(struct peci_dimmtemp *priv, int dimm_no)
>>>> +{
>>>> +    int dimm_order = dimm_no % priv->gen_info->dimm_idx_max;
>>>> +    int chan_rank = dimm_no / priv->gen_info->dimm_idx_max;
>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>> +    int rc;
>>>> +
>>>> +    if (!need_update(&priv->temp[dimm_no]))
>>>> +        return 0;
>>>> +
>>>> +    msg.addr = priv->addr;
>>>> +    msg.index = MBX_INDEX_DDR_DIMM_TEMP;
>>>> +    msg.param = chan_rank;
>>>> +    msg.rx_len = 4;
>>>> +
>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    priv->temp[dimm_no].value = msg.pkg_config[dimm_order] * 1000;
>>>> +
>>>> +    mark_updated(&priv->temp[dimm_no]);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int find_dimm_number(struct peci_dimmtemp *priv, int channel)
>>>> +{
>>>> +    int dimm_nums_max = priv->gen_info->chan_rank_max *
>>>> +                priv->gen_info->dimm_idx_max;
>>>> +    int idx, found = 0;
>>>> +
>>>> +    for (idx = 0; idx < dimm_nums_max; idx++) {
>>>> +        if (priv->dimm_mask & BIT(idx)) {
>>>> +            if (channel == found)
>>>> +                break;
>>>> +
>>>> +            found++;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return idx;
>>>> +}
>>>
>>> This again looks like duplicate code.
>>>
>>
>> find_dimm_number()? I'm sure it isn't.
>>
>>>> +
>>>> +static int dimmtemp_read_string(struct device *dev,
>>>> +                enum hwmon_sensor_types type,
>>>> +                u32 attr, int channel, const char **str)
>>>> +{
>>>> +    struct peci_dimmtemp *priv = dev_get_drvdata(dev);
>>>> +    u32 dimm_idx_max = priv->gen_info->dimm_idx_max;
>>>> +    int dimm_no, chan_rank, dimm_idx;
>>>> +
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_label:
>>>> +        dimm_no = find_dimm_number(priv, channel);
>>>> +        chan_rank = dimm_no / dimm_idx_max;
>>>> +        dimm_idx = dimm_no % dimm_idx_max;
>>>> +        *str = dimmtemp_label[chan_rank][dimm_idx];
>>>> +        return 0;
>>>> +    default:
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>> +
>>>> +static int dimmtemp_read(struct device *dev, enum 
>>>> hwmon_sensor_types type,
>>>> +             u32 attr, int channel, long *val)
>>>> +{
>>>> +    struct peci_dimmtemp *priv = dev_get_drvdata(dev);
>>>> +    int dimm_no = find_dimm_number(priv, channel);
>>>> +    int rc;
>>>> +
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_input:
>>>> +        rc = get_dimm_temp(priv, dimm_no);
>>>> +        if (rc)
>>>> +            return rc;
>>>> +
>>>> +        *val = priv->temp[dimm_no].value;
>>>> +        return 0;
>>>> +    default:
>>>> +        return -EOPNOTSUPP;
>>>> +    }
>>>> +}
>>>> +
>>>> +static umode_t dimmtemp_is_visible(const void *data,
>>>> +                   enum hwmon_sensor_types type,
>>>> +                   u32 attr, int channel)
>>>> +{
>>>> +    switch (attr) {
>>>> +    case hwmon_temp_label:
>>>> +    case hwmon_temp_input:
>>>> +        return 0444;
>>>> +    default:
>>>> +        return 0;
>>>> +    }
>>>> +}
>>>> +
>>>> +static const struct hwmon_ops dimmtemp_ops = {
>>>> +    .is_visible = dimmtemp_is_visible,
>>>> +    .read_string = dimmtemp_read_string,
>>>> +    .read = dimmtemp_read,
>>>> +};
>>>> +
>>>> +static int check_populated_dimms(struct peci_dimmtemp *priv)
>>>> +{
>>>> +    u32 chan_rank_max = priv->gen_info->chan_rank_max;
>>>> +    u32 dimm_idx_max = priv->gen_info->dimm_idx_max;
>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>> +    int chan_rank, dimm_idx;
>>>> +    int rc, channels = 0;
>>>> +
>>>> +    for (chan_rank = 0; chan_rank < chan_rank_max; chan_rank++) {
>>>> +        msg.addr = priv->addr;
>>>> +        msg.index = MBX_INDEX_DDR_DIMM_TEMP;
>>>> +        msg.param = chan_rank;
>>>> +        msg.rx_len = 4;
>>>> +
>>>> +        rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>> +        if (rc) {
>>>> +            priv->dimm_mask = 0;
>>>> +            return rc;
>>>> +        }
>>>> +
>>>> +        for (dimm_idx = 0; dimm_idx < dimm_idx_max; dimm_idx++) {
>>>> +            if (msg.pkg_config[dimm_idx]) {
>>>> +                priv->dimm_mask |= BIT(chan_rank *
>>>> +                               chan_rank_max +
>>>> +                               dimm_idx);
>>>> +                channels++;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (!priv->dimm_mask)
>>>> +        return -EAGAIN;
>>>> +
>>>> +    priv->channels = channels;
>>>> +
>>>> +    dev_dbg(priv->dev, "Scanned populated DIMMs: 0x%x\n", 
>>>> priv->dimm_mask);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int create_dimm_temp_info(struct peci_dimmtemp *priv)
>>>> +{
>>>> +    struct device *hwmon_dev;
>>>> +    int rc, i;
>>>> +
>>>> +    rc = check_populated_dimms(priv);
>>>> +    if (!rc) {
>>>
>>> Please handle error cases first.
>>>
>>
>> Sure, I'll rewrite it.
>>
>>>> +        for (i = 0; i < priv->channels; i++)
>>>> +            priv->temp_config[i] = HWMON_T_LABEL | HWMON_T_INPUT;
>>>> +
>>>> +        priv->chip.ops = &dimmtemp_ops;
>>>> +        priv->chip.info = priv->info;
>>>> +
>>>> +        priv->info[0] = &priv->temp_info;
>>>> +
>>>> +        priv->temp_info.type = hwmon_temp;
>>>> +        priv->temp_info.config = priv->temp_config;
>>>> +
>>>> +        hwmon_dev = devm_hwmon_device_register_with_info(priv->dev,
>>>> +                                 priv->name,
>>>> +                                 priv,
>>>> +                                 &priv->chip,
>>>> +                                 NULL);
>>>> +        rc = PTR_ERR_OR_ZERO(hwmon_dev);
>>>> +        if (!rc)
>>>> +            dev_dbg(priv->dev, "%s: sensor '%s'\n",
>>>> +                dev_name(hwmon_dev), priv->name);
>>>> +    } else if (rc == -EAGAIN) {
>>>> +        if (priv->retry_count < DIMM_MASK_CHECK_RETRY_MAX) {
>>>> +            queue_delayed_work(priv->work_queue,
>>>> +                       &priv->work_handler,
>>>> +                       DIMM_MASK_CHECK_DELAY_JIFFIES);
>>>> +            priv->retry_count++;
>>>> +            dev_dbg(priv->dev,
>>>> +                "Deferred DIMM temp info creation\n");
>>>> +        } else {
>>>> +            rc = -ETIMEDOUT;
>>>> +            dev_err(priv->dev,
>>>> +                "Timeout retrying DIMM temp info creation\n");
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +static void create_dimm_temp_info_delayed(struct work_struct *work)
>>>> +{
>>>> +    struct delayed_work *dwork = to_delayed_work(work);
>>>> +    struct peci_dimmtemp *priv = container_of(dwork, struct 
>>>> peci_dimmtemp,
>>>> +                          work_handler);
>>>> +    int rc;
>>>> +
>>>> +    rc = create_dimm_temp_info(priv);
>>>> +    if (rc && rc != -EAGAIN)
>>>> +        dev_dbg(priv->dev, "Failed to create DIMM temp info\n");
>>>> +}
>>>> +
>>>> +static int check_cpu_id(struct peci_dimmtemp *priv)
>>>> +{
>>>> +    struct peci_rd_pkg_cfg_msg msg;
>>>> +    u32 cpu_id;
>>>> +    int i, rc;
>>>> +
>>>> +    msg.addr = priv->addr;
>>>> +    msg.index = MBX_INDEX_CPU_ID;
>>>> +    msg.param = PKG_ID_CPU_ID;
>>>> +    msg.rx_len = 4;
>>>> +
>>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>>> +    if (rc)
>>>> +        return rc;
>>>> +
>>>> +    cpu_id = ((msg.pkg_config[2] << 16) | (msg.pkg_config[1] << 8) |
>>>> +          msg.pkg_config[0]) & CLIENT_CPU_ID_MASK;
>>>> +
>>>> +    for (i = 0; i < CPU_GEN_MAX; i++) {
>>>> +        if (cpu_id == cpu_gen_info_table[i].cpu_id) {
>>>> +            priv->gen_info = &cpu_gen_info_table[i];
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (!priv->gen_info)
>>>> +        return -ENODEV;
>>>> +
>>>> +    dev_dbg(priv->dev, "CPU_ID: 0x%x\n", cpu_id);
>>>> +    return 0;
>>>> +}
>>>
>>> More duplicate code.
>>>
>>
>> Okay. In case of check_cpu_id(), it could be used as a generic PECI 
>> function. I'll move it into PECI core.
>>
>>>> +
>>>> +static int peci_dimmtemp_probe(struct peci_client *client)
>>>> +{
>>>> +    struct device *dev = &client->dev;
>>>> +    struct peci_dimmtemp *priv;
>>>> +    int rc;
>>>> +
>>>> +    if ((client->adapter->cmd_mask &
>>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) !=
>>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) {
>>>
>>> One set of ( ) is unnecessary on each side of the expression.
>>>
>>
>> '&' has a precedence over '!=' but '|' doesn't. I'll rewrite it to:
>>
> 
> Actually, that is wrong. You refer to address-of. Bit operations do have 
> lower
> precedence that comparisons. I stand corrected.
> 
>>      if (client->adapter->cmd_mask &
>>          (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG)) !=
>>          (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG)))
>>
>>>> +        dev_err(dev, "Client doesn't support temperature 
>>>> monitoring\n");
>>>> +        return -EINVAL;
>>>
>>> Why is this "invalid", and why does it warrant an error message ?
>>>
>>
>> Should I use -EPERM? Any suggestion?
>>
> 
> Is it an _error_ if the CPU does not support this functionality ?
> 

Actually, it returns from this probe() function without making any hwmon 
info creation so I intended to handle this case as an error. Am I wrong?

>>>> +    }
>>>> +
>>>> +    priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>>>> +    if (!priv)
>>>> +        return -ENOMEM;
>>>> +
>>>> +    dev_set_drvdata(dev, priv);
>>>> +    priv->client = client;
>>>> +    priv->dev = dev;
>>>> +    priv->addr = client->addr;
>>>> +    priv->cpu_no = priv->addr - PECI_BASE_ADDR;
>>>
>>> Is priv->addr guaranteed to be >= PECI_BASE_ADDR ?
>>
>> Client address range validation will be done in 
>> peci_check_addr_validity() in PECI core before probing a device driver.
>>
>>>> +
>>>> +    snprintf(priv->name, PECI_NAME_SIZE, "peci_dimmtemp.cpu%d",
>>>> +         priv->cpu_no);
>>>> +
>>>> +    rc = check_cpu_id(priv);
>>>> +    if (rc) {
>>>> +        dev_err(dev, "Client CPU is not supported\n");
>>>
>>> Or the peci command failed.
>>>
>>
>> I'll remove the error message and will add a proper handling code into 
>> PECI core on each error type.
>>
>>>> +        return rc;
>>>> +    }
>>>> +
>>>> +    priv->work_queue = alloc_ordered_workqueue(priv->name, 0);
>>>> +    if (!priv->work_queue)
>>>> +        return -ENOMEM;
>>>> +
>>>> +    INIT_DELAYED_WORK(&priv->work_handler, 
>>>> create_dimm_temp_info_delayed);
>>>> +
>>>> +    rc = create_dimm_temp_info(priv);
>>>> +    if (rc && rc != -EAGAIN) {
>>>> +        dev_err(dev, "Failed to create DIMM temp info\n");
>>>> +        goto err_free_wq;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +
>>>> +err_free_wq:
>>>> +    destroy_workqueue(priv->work_queue);
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +static int peci_dimmtemp_remove(struct peci_client *client)
>>>> +{
>>>> +    struct peci_dimmtemp *priv = dev_get_drvdata(&client->dev);
>>>> +
>>>> +    cancel_delayed_work(&priv->work_handler);
>>>
>>> cancel_delayed_work_sync() ?
>>>
>>
>> Yes, it would be safer. Will fix it.
>>
>>>> +    destroy_workqueue(priv->work_queue);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static const struct of_device_id peci_dimmtemp_of_table[] = {
>>>> +    { .compatible = "intel,peci-dimmtemp" },
>>>> +    { }
>>>> +};
>>>> +MODULE_DEVICE_TABLE(of, peci_dimmtemp_of_table);
>>>> +
>>>> +static struct peci_driver peci_dimmtemp_driver = {
>>>> +    .probe  = peci_dimmtemp_probe,
>>>> +    .remove = peci_dimmtemp_remove,
>>>> +    .driver = {
>>>> +        .name           = "peci-dimmtemp",
>>>> +        .of_match_table = of_match_ptr(peci_dimmtemp_of_table),
>>>> +    },
>>>> +};
>>>> +module_peci_driver(peci_dimmtemp_driver);
>>>> +
>>>> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
>>>> +MODULE_DESCRIPTION("PECI dimmtemp driver");
>>>> +MODULE_LICENSE("GPL v2");
>>>> -- 
>>>> 2.16.2
>>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 05/10] ARM: dts: aspeed: peci: Add PECI node
From: Jae Hyun Yoo @ 2018-04-12  2:20 UTC (permalink / raw)
  To: Joel Stanley
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
	Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
	Jean Delvare, Julia Cartwright, Miguel Ojeda, Milton Miller II,
	Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
	Vernon Mauery, Linux Kernel Mailing List, linux-doc, devicetree,
	linux-hwmon, Linux ARM, OpenBMC Maillist
In-Reply-To: <CACPK8Xf7ta43K=POx8J6bAdjvX=wCzy-+HWqHsm+RstQxeetCQ@mail.gmail.com>

On 4/11/2018 4:52 AM, Joel Stanley wrote:
> On 11 April 2018 at 04:02, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
>> This commit adds PECI bus/adapter node of AST24xx/AST25xx into
>> aspeed-g4 and aspeed-g5.
>>
> 
> The patches to the device trees get merged by the ASPEED maintainer
> (me). Once you have the bindings reviewed you can send the patches to
> me and the linux-aspeed list (I've got a pending patch to maintainers
> that will ensure get_maintainers.pl does the right thing as far as
> email addresses go).
> 
> I'd suggest dropping it from your series and re-sending once the
> bindings and driver are reviewed.
> 
> Cheers,
> 
> Joel
> 

Do you mean that bindings and driver of ASPEED peci adapter driver 
including documents?

Thanks,
-Jae
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 04/10] Documentations: dt-bindings: Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs
From: Jae Hyun Yoo @ 2018-04-12  2:11 UTC (permalink / raw)
  To: Joel Stanley, Rob Herring, linux-aspeed, Ryan Chen
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
	Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
	Jean Delvare, Julia Cartwright, Miguel Ojeda, Milton Miller II,
	Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
	Vernon Mauery, Linux Kernel Mailing List, linux-doc, devicetree,
	linux-hwmon, Linux ARM, OpenBMC Maillist
In-Reply-To: <CACPK8XcfrheC+UDB7mLPkgPRCuk8f+P830NG+cy2v6F_PJvUTw@mail.gmail.com>

Hi Joel,

On 4/11/2018 4:52 AM, Joel Stanley wrote:
> On 11 April 2018 at 04:02, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
>> This commit adds a dt-bindings document of PECI adapter driver for Aspeed
> 
> We try to capitalise ASPEED.
> 

Got it. Will capitalize all Aspeed words.

>> AST24xx/25xx SoCs.
>> ---
>>   .../devicetree/bindings/peci/peci-aspeed.txt       | 60 ++++++++++++++++++++++
>>   1 file changed, 60 insertions(+)
>>   create mode 100644 Documentation/devicetree/bindings/peci/peci-aspeed.txt
>>
>> diff --git a/Documentation/devicetree/bindings/peci/peci-aspeed.txt b/Documentation/devicetree/bindings/peci/peci-aspeed.txt
>> new file mode 100644
>> index 000000000000..4598bb8c20fa
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/peci/peci-aspeed.txt
>> @@ -0,0 +1,60 @@
>> +Device tree configuration for PECI buses on the AST24XX and AST25XX SoCs.
>> +
>> +Required properties:
>> +- compatible        : Should be "aspeed,ast2400-peci" or "aspeed,ast2500-peci"
>> +                     - aspeed,ast2400-peci: Aspeed AST2400 family PECI
>> +                                            controller
>> +                     - aspeed,ast2500-peci: Aspeed AST2500 family PECI
>> +                                            controller
>> +- reg               : Should contain PECI controller registers location and
>> +                     length.
>> +- #address-cells    : Should be <1>.
>> +- #size-cells       : Should be <0>.
>> +- interrupts        : Should contain PECI controller interrupt.
>> +- clocks            : Should contain clock source for PECI controller.
>> +                     Should reference clkin.
> 
> Are you sure that this is driven by clkin? Most peripherals on the
> Aspeed are attached to the apb, so should reference that clock.
> 

According to the datasheet, PECI controller module is attached to apb 
but its clock source is the 24MHz external clock.

>> +- clock_frequency   : Should contain the operation frequency of PECI controller
>> +                     in units of Hz.
>> +                     187500 ~ 24000000
> 
> Can you explain why you need both the parent clock and this frequency
> to be specified?
> 

Based on this setting, driver code makes clock divisor value to set 
operation clock of PECI controller which is adjustable.

>> +
>> +Optional properties:
>> +- msg-timing-nego   : Message timing negotiation period. This value will
> 
> Perhaps msg-timing-period? Or just msg-timing?
> 

Will use msg-timing instead.

>> +                     determine the period of message timing negotiation to be
>> +                     issued by PECI controller. The unit of the programmed
>> +                     value is four times of PECI clock period.
>> +                     0 ~ 255 (default: 1)
>> +- addr-timing-nego  : Address timing negotiation period. This value will
>> +                     determine the period of address timing negotiation to be
>> +                     issued by PECI controller. The unit of the programmed
>> +                     value is four times of PECI clock period.
>> +                     0 ~ 255 (default: 1)
>> +- rd-sampling-point : Read sampling point selection. The whole period of a bit
>> +                     time will be divided into 16 time frames. This value will
>> +                     determine the time frame in which the controller will
>> +                     sample PECI signal for data read back. Usually in the
>> +                     middle of a bit time is the best.
>> +                     0 ~ 15 (default: 8)
>> +- cmd_timeout_ms    : Command timeout in units of ms.
>> +                     1 ~ 60000 (default: 1000)
>> +
>> +Example:
>> +       peci: peci@1e78b000 {
>> +               compatible = "simple-bus";
>> +               #address-cells = <1>;
>> +               #size-cells = <1>;
>> +               ranges = <0x0 0x1e78b000 0x60>;
>> +
>> +               peci0: peci-bus@0 {
>> +                       compatible = "aspeed,ast2500-peci";
>> +                       reg = <0x0 0x60>;
>> +                       #address-cells = <1>;
>> +                       #size-cells = <0>;
>> +                       interrupts = <15>;
>> +                       clocks = <&clk_clkin>;
>> +                       clock-frequency = <24000000>;
>> +                       msg-timing-nego = <1>;
>> +                       addr-timing-nego = <1>;
>> +                       rd-sampling-point = <8>;
>> +                       cmd-timeout-ms = <1000>;
>> +               };
>> +       };
>> --
>> 2.16.2
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 01/10] Documentations: dt-bindings: Add documents of generic PECI bus, adapter and client drivers
From: Jae Hyun Yoo @ 2018-04-12  2:06 UTC (permalink / raw)
  To: Joel Stanley, Rob Herring
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
	Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
	Jean Delvare, Julia Cartwright, Miguel Ojeda, Milton Miller II,
	Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
	Vernon Mauery, Linux Kernel Mailing List, linux-doc, devicetree,
	linux-hwmon, Linux ARM, OpenBMC Maillist
In-Reply-To: <CACPK8XfwVTKfOQSteY8-Dy2=YiC6qjd1qZ2f8n1hTv8w06DWRg@mail.gmail.com>

Hi Joel,

On 4/11/2018 4:52 AM, Joel Stanley wrote:
> Hi Jae,
> 
> On 11 April 2018 at 04:02, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
>> This commit adds documents of generic PECI bus, adapter and client drivers.
>>
>> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
>> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
>> Reviewed-by: James Feist <james.feist@linux.intel.com>
>> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
>> Cc: Alan Cox <alan@linux.intel.com>
>> Cc: Andrew Jeffery <andrew@aj.id.au>
>> Cc: Andrew Lunn <andrew@lunn.ch>
>> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>> Cc: Greg KH <gregkh@linuxfoundation.org>
>> Cc: Guenter Roeck <linux@roeck-us.net>
>> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
>> Cc: Jean Delvare <jdelvare@suse.com>
>> Cc: Joel Stanley <joel@jms.id.au>
>> Cc: Julia Cartwright <juliac@eso.teric.us>
>> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
>> Cc: Milton Miller II <miltonm@us.ibm.com>
>> Cc: Pavel Machek <pavel@ucw.cz>
>> Cc: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
>> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
> 
> That's a hefty cc list. I can't see Rob Herring though, and he's
> usually the person who you need to convince to get your bindings
> accepted.
> 
> I recommend using ./scripts/get_maintainers.pl to build your CC list,
> and then add others you think are relevant.
> 
> I'm not sure what the guidelines are for generic bindings, so I'll
> defer to Rob for this patch.
> 
> Cheers,
> 
> Joel
> 

Thanks a lot for letting me know that. I'll do as you suggested.

-Jae

>> ---
>>   .../devicetree/bindings/peci/peci-adapter.txt      | 23 ++++++++++++++++++++
>>   .../devicetree/bindings/peci/peci-bus.txt          | 15 +++++++++++++
>>   .../devicetree/bindings/peci/peci-client.txt       | 25 ++++++++++++++++++++++
>>   3 files changed, 63 insertions(+)
>>   create mode 100644 Documentation/devicetree/bindings/peci/peci-adapter.txt
>>   create mode 100644 Documentation/devicetree/bindings/peci/peci-bus.txt
>>   create mode 100644 Documentation/devicetree/bindings/peci/peci-client.txt
>>
>> diff --git a/Documentation/devicetree/bindings/peci/peci-adapter.txt b/Documentation/devicetree/bindings/peci/peci-adapter.txt
>> new file mode 100644
>> index 000000000000..9221374f6b11
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/peci/peci-adapter.txt
>> @@ -0,0 +1,23 @@
>> +Generic device tree configuration for PECI adapters.
>> +
>> +Required properties:
>> +- compatible     : Should contain hardware specific definition strings that can
>> +                  match an adapter driver implementation.
>> +- reg            : Should contain PECI controller registers location and length.
>> +- #address-cells : Should be <1>.
>> +- #size-cells    : Should be <0>.
>> +
>> +Example:
>> +       peci: peci@10000000 {
>> +               compatible = "simple-bus";
>> +               #address-cells = <1>;
>> +               #size-cells = <1>;
>> +               ranges = <0x0 0x10000000 0x1000>;
>> +
>> +               peci0: peci-bus@0 {
>> +                       compatible = "soc,soc-peci";
>> +                       reg = <0x0 0x1000>;
>> +                       #address-cells = <1>;
>> +                       #size-cells = <0>;
>> +               };
>> +       };
>> diff --git a/Documentation/devicetree/bindings/peci/peci-bus.txt b/Documentation/devicetree/bindings/peci/peci-bus.txt
>> new file mode 100644
>> index 000000000000..90bcc791ccb0
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/peci/peci-bus.txt
>> @@ -0,0 +1,15 @@
>> +Generic device tree configuration for PECI buses.
>> +
>> +Required properties:
>> +- compatible     : Should be "simple-bus".
>> +- #address-cells : Should be <1>.
>> +- #size-cells    : Should be <1>.
>> +- ranges         : Should contain PECI controller registers ranges.
>> +
>> +Example:
>> +       peci: peci@10000000 {
>> +               compatible = "simple-bus";
>> +               #address-cells = <1>;
>> +               #size-cells = <1>;
>> +               ranges = <0x0 0x10000000 0x1000>;
>> +       };
>> diff --git a/Documentation/devicetree/bindings/peci/peci-client.txt b/Documentation/devicetree/bindings/peci/peci-client.txt
>> new file mode 100644
>> index 000000000000..8e2bfd8532f6
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/peci/peci-client.txt
>> @@ -0,0 +1,25 @@
>> +Generic device tree configuration for PECI clients.
>> +
>> +Required properties:
>> +- compatible : Should contain target device specific definition strings that can
>> +              match a client driver implementation.
>> +- reg        : Should contain address of a client CPU. Address range of CPU
>> +              clients is starting from 0x30 based on PECI specification.
>> +              <0x30> .. <0x37> (depends on the PECI_OFFSET_MAX definition)
>> +
>> +Example:
>> +       peci-bus@0 {
>> +               #address-cells = <1>;
>> +               #size-cells = <0>;
>> +               < more properties >
>> +
>> +               function@cpu0 {
>> +                       compatible = "device,function";
>> +                       reg = <0x30>;
>> +               };
>> +
>> +               function@cpu1 {
>> +                       compatible = "device,function";
>> +                       reg = <0x31>;
>> +               };
>> +       };
>> --
>> 2.16.2
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 06/10] drivers/peci: Add a PECI adapter driver for Aspeed AST24xx/AST25xx
From: Jae Hyun Yoo @ 2018-04-12  2:03 UTC (permalink / raw)
  To: Joel Stanley, Ryan Chen
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
	Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
	Jean Delvare, Julia Cartwright, Miguel Ojeda, Milton Miller II,
	Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
	Vernon Mauery, Linux Kernel Mailing List, linux-doc, devicetree,
	linux-hwmon, Linux ARM, OpenBMC Maillist
In-Reply-To: <CACPK8XdODUEZNZCkSe+Aq4xdptfhfE=ureUL_v7TgR7w+rMznw@mail.gmail.com>

Hello Joel,

Thanks for sharing your time. Please see my answers inline.

On 4/11/2018 4:51 AM, Joel Stanley wrote:
> Hello Jae,
> 
> On 11 April 2018 at 04:02, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
>> This commit adds PECI adapter driver implementation for Aspeed
>> AST24xx/AST25xx.
> 
> The driver is looking good!
> 
> It looks like you've done some kind of review that we weren't allowed
> to see, which is a double edged sword - I might be asking about things
> that you've already spoken about with someone else.
> 
> I'm only just learning about PECI, but I do have some general comments below.
> 

Yes, it took a hidden review process between v2 and v3. I know it's an 
unusual process but it was requested. Hopefully, change logs in cover 
letter could roughly provide the details. Thanks for your comments.

>> ---
>>   drivers/peci/Kconfig       |  28 +++
>>   drivers/peci/Makefile      |   3 +
>>   drivers/peci/peci-aspeed.c | 504 +++++++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 535 insertions(+)
>>   create mode 100644 drivers/peci/peci-aspeed.c
>>
>> diff --git a/drivers/peci/Kconfig b/drivers/peci/Kconfig
>> index 1fbc13f9e6c2..0e33420365de 100644
>> --- a/drivers/peci/Kconfig
>> +++ b/drivers/peci/Kconfig
>> @@ -14,4 +14,32 @@ config PECI
>>            processors and chipset components to external monitoring or control
>>            devices.
>>
>> +         If you want PECI support, you should say Y here and also to the
>> +         specific driver for your bus adapter(s) below.
>> +
>> +if PECI
>> +
>> +#
>> +# PECI hardware bus configuration
>> +#
>> +
>> +menu "PECI Hardware Bus support"
>> +
>> +config PECI_ASPEED
>> +       tristate "Aspeed AST24xx/AST25xx PECI support"
> 
> I think just saying ASPEED PECI support is enough. That way if the
> next ASPEED SoC happens to have PECI we don't need to update all of
> the help text :)
> 

Agreed. I'll change the description.

>> +       select REGMAP_MMIO
>> +       depends on OF
>> +       depends on ARCH_ASPEED || COMPILE_TEST
>> +       help
>> +         Say Y here if you want support for the Platform Environment Control
>> +         Interface (PECI) bus adapter driver on the Aspeed AST24XX and AST25XX
>> +         SoCs.
>> +
>> +         This support is also available as a module.  If so, the module
>> +         will be called peci-aspeed.
>> +
>> +endmenu
>> +
>> +endif # PECI
>> +
>>   endmenu
>> diff --git a/drivers/peci/Makefile b/drivers/peci/Makefile
>> index 9e8615e0d3ff..886285e69765 100644
>> --- a/drivers/peci/Makefile
>> +++ b/drivers/peci/Makefile
>> @@ -4,3 +4,6 @@
>>
>>   # Core functionality
>>   obj-$(CONFIG_PECI)             += peci-core.o
>> +
>> +# Hardware specific bus drivers
>> +obj-$(CONFIG_PECI_ASPEED)      += peci-aspeed.o
>> diff --git a/drivers/peci/peci-aspeed.c b/drivers/peci/peci-aspeed.c
>> new file mode 100644
>> index 000000000000..be2a1f327eb1
>> --- /dev/null
>> +++ b/drivers/peci/peci-aspeed.c
>> @@ -0,0 +1,504 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (C) 2012-2017 ASPEED Technology Inc.
>> +// Copyright (c) 2018 Intel Corporation
>> +
>> +#include <linux/clk.h>
>> +#include <linux/delay.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/jiffies.h>
>> +#include <linux/module.h>
>> +#include <linux/of.h>
>> +#include <linux/peci.h>
>> +#include <linux/platform_device.h>
>> +#include <linux/regmap.h>
>> +
>> +#define DUMP_DEBUG 0
>> +
>> +/* Aspeed PECI Registers */
>> +#define AST_PECI_CTRL     0x00
> 
> Nit: we use ASPEED instead of AST in the upstream kernel to distingush
> from the aspeed sdk drivers. If you feel strongly about this then I
> won't insist you change.
> 

Okay then, better change it now than later. Will change all defines.

>> +#define AST_PECI_TIMING   0x04
>> +#define AST_PECI_CMD      0x08
>> +#define AST_PECI_CMD_CTRL 0x0c
>> +#define AST_PECI_EXP_FCS  0x10
>> +#define AST_PECI_CAP_FCS  0x14
>> +#define AST_PECI_INT_CTRL 0x18
>> +#define AST_PECI_INT_STS  0x1c
>> +#define AST_PECI_W_DATA0  0x20
>> +#define AST_PECI_W_DATA1  0x24
>> +#define AST_PECI_W_DATA2  0x28
>> +#define AST_PECI_W_DATA3  0x2c
>> +#define AST_PECI_R_DATA0  0x30
>> +#define AST_PECI_R_DATA1  0x34
>> +#define AST_PECI_R_DATA2  0x38
>> +#define AST_PECI_R_DATA3  0x3c
>> +#define AST_PECI_W_DATA4  0x40
>> +#define AST_PECI_W_DATA5  0x44
>> +#define AST_PECI_W_DATA6  0x48
>> +#define AST_PECI_W_DATA7  0x4c
>> +#define AST_PECI_R_DATA4  0x50
>> +#define AST_PECI_R_DATA5  0x54
>> +#define AST_PECI_R_DATA6  0x58
>> +#define AST_PECI_R_DATA7  0x5c
>> +
>> +/* AST_PECI_CTRL - 0x00 : Control Register */
>> +#define PECI_CTRL_SAMPLING_MASK     GENMASK(19, 16)
>> +#define PECI_CTRL_SAMPLING(x)       (((x) << 16) & PECI_CTRL_SAMPLING_MASK)
>> +#define PECI_CTRL_SAMPLING_GET(x)   (((x) & PECI_CTRL_SAMPLING_MASK) >> 16)
>> +#define PECI_CTRL_READ_MODE_MASK    GENMASK(13, 12)
>> +#define PECI_CTRL_READ_MODE(x)      (((x) << 12) & PECI_CTRL_READ_MODE_MASK)
>> +#define PECI_CTRL_READ_MODE_GET(x)  (((x) & PECI_CTRL_READ_MODE_MASK) >> 12)
>> +#define PECI_CTRL_READ_MODE_COUNT   BIT(12)
>> +#define PECI_CTRL_READ_MODE_DBG     BIT(13)
>> +#define PECI_CTRL_CLK_SOURCE_MASK   BIT(11)
>> +#define PECI_CTRL_CLK_SOURCE(x)     (((x) << 11) & PECI_CTRL_CLK_SOURCE_MASK)
>> +#define PECI_CTRL_CLK_SOURCE_GET(x) (((x) & PECI_CTRL_CLK_SOURCE_MASK) >> 11)
>> +#define PECI_CTRL_CLK_DIV_MASK      GENMASK(10, 8)
>> +#define PECI_CTRL_CLK_DIV(x)        (((x) << 8) & PECI_CTRL_CLK_DIV_MASK)
>> +#define PECI_CTRL_CLK_DIV_GET(x)    (((x) & PECI_CTRL_CLK_DIV_MASK) >> 8)
>> +#define PECI_CTRL_INVERT_OUT        BIT(7)
>> +#define PECI_CTRL_INVERT_IN         BIT(6)
>> +#define PECI_CTRL_BUS_CONTENT_EN    BIT(5)
>> +#define PECI_CTRL_PECI_EN           BIT(4)
>> +#define PECI_CTRL_PECI_CLK_EN       BIT(0)
> 
> I know these come from the ASPEED sdk driver. Do we need them all?
> 

It doesn't use all but better keep for bug fix or improvement use, I think.

>> +
>> +/* AST_PECI_TIMING - 0x04 : Timing Negotiation Register */
>> +#define PECI_TIMING_MESSAGE_MASK   GENMASK(15, 8)
>> +#define PECI_TIMING_MESSAGE(x)     (((x) << 8) & PECI_TIMING_MESSAGE_MASK)
>> +#define PECI_TIMING_MESSAGE_GET(x) (((x) & PECI_TIMING_MESSAGE_MASK) >> 8)
>> +#define PECI_TIMING_ADDRESS_MASK   GENMASK(7, 0)
>> +#define PECI_TIMING_ADDRESS(x)     ((x) & PECI_TIMING_ADDRESS_MASK)
>> +#define PECI_TIMING_ADDRESS_GET(x) ((x) & PECI_TIMING_ADDRESS_MASK)
>> +
>> +/* AST_PECI_CMD - 0x08 : Command Register */
>> +#define PECI_CMD_PIN_MON    BIT(31)
>> +#define PECI_CMD_STS_MASK   GENMASK(27, 24)
>> +#define PECI_CMD_STS_GET(x) (((x) & PECI_CMD_STS_MASK) >> 24)
>> +#define PECI_CMD_FIRE       BIT(0)
>> +
>> +/* AST_PECI_LEN - 0x0C : Read/Write Length Register */
>> +#define PECI_AW_FCS_EN       BIT(31)
>> +#define PECI_READ_LEN_MASK   GENMASK(23, 16)
>> +#define PECI_READ_LEN(x)     (((x) << 16) & PECI_READ_LEN_MASK)
>> +#define PECI_WRITE_LEN_MASK  GENMASK(15, 8)
>> +#define PECI_WRITE_LEN(x)    (((x) << 8) & PECI_WRITE_LEN_MASK)
>> +#define PECI_TAGET_ADDR_MASK GENMASK(7, 0)
>> +#define PECI_TAGET_ADDR(x)   ((x) & PECI_TAGET_ADDR_MASK)
>> +
>> +/* AST_PECI_EXP_FCS - 0x10 : Expected FCS Data Register */
>> +#define PECI_EXPECT_READ_FCS_MASK      GENMASK(23, 16)
>> +#define PECI_EXPECT_READ_FCS_GET(x)    (((x) & PECI_EXPECT_READ_FCS_MASK) >> 16)
>> +#define PECI_EXPECT_AW_FCS_AUTO_MASK   GENMASK(15, 8)
>> +#define PECI_EXPECT_AW_FCS_AUTO_GET(x) (((x) & PECI_EXPECT_AW_FCS_AUTO_MASK) \
>> +                                       >> 8)
>> +#define PECI_EXPECT_WRITE_FCS_MASK     GENMASK(7, 0)
>> +#define PECI_EXPECT_WRITE_FCS_GET(x)   ((x) & PECI_EXPECT_WRITE_FCS_MASK)
>> +
>> +/* AST_PECI_CAP_FCS - 0x14 : Captured FCS Data Register */
>> +#define PECI_CAPTURE_READ_FCS_MASK    GENMASK(23, 16)
>> +#define PECI_CAPTURE_READ_FCS_GET(x)  (((x) & PECI_CAPTURE_READ_FCS_MASK) >> 16)
>> +#define PECI_CAPTURE_WRITE_FCS_MASK   GENMASK(7, 0)
>> +#define PECI_CAPTURE_WRITE_FCS_GET(x) ((x) & PECI_CAPTURE_WRITE_FCS_MASK)
>> +
>> +/* AST_PECI_INT_CTRL/STS - 0x18/0x1c : Interrupt Register */
>> +#define PECI_INT_TIMING_RESULT_MASK GENMASK(31, 30)
>> +#define PECI_INT_TIMEOUT            BIT(4)
>> +#define PECI_INT_CONNECT            BIT(3)
>> +#define PECI_INT_W_FCS_BAD          BIT(2)
>> +#define PECI_INT_W_FCS_ABORT        BIT(1)
>> +#define PECI_INT_CMD_DONE           BIT(0)
>> +
>> +struct aspeed_peci {
>> +       struct peci_adapter     adaper;
>> +       struct device           *dev;
>> +       struct regmap           *regmap;
>> +       int                     irq;
>> +       struct completion       xfer_complete;
>> +       u32                     status;
>> +       u32                     cmd_timeout_ms;
>> +};
>> +
>> +#define PECI_INT_MASK  (PECI_INT_TIMEOUT | PECI_INT_CONNECT | \
>> +                       PECI_INT_W_FCS_BAD | PECI_INT_W_FCS_ABORT | \
>> +                       PECI_INT_CMD_DONE)
>> +
>> +#define PECI_IDLE_CHECK_TIMEOUT_MS      50
>> +#define PECI_IDLE_CHECK_INTERVAL_MS     10
>> +
>> +#define PECI_RD_SAMPLING_POINT_DEFAULT  8
>> +#define PECI_RD_SAMPLING_POINT_MAX      15
>> +#define PECI_CLK_DIV_DEFAULT            0
>> +#define PECI_CLK_DIV_MAX                7
>> +#define PECI_MSG_TIMING_NEGO_DEFAULT    1
>> +#define PECI_MSG_TIMING_NEGO_MAX        255
>> +#define PECI_ADDR_TIMING_NEGO_DEFAULT   1
>> +#define PECI_ADDR_TIMING_NEGO_MAX       255
>> +#define PECI_CMD_TIMEOUT_MS_DEFAULT     1000
>> +#define PECI_CMD_TIMEOUT_MS_MAX         60000
>> +
>> +static int aspeed_peci_xfer_native(struct aspeed_peci *priv,
>> +                                  struct peci_xfer_msg *msg)
>> +{
>> +       long err, timeout = msecs_to_jiffies(priv->cmd_timeout_ms);
>> +       u32 peci_head, peci_state, rx_data, cmd_sts;
>> +       ktime_t start, end;
>> +       s64 elapsed_ms;
>> +       int i, rc = 0;
>> +       uint reg;
>> +
>> +       start = ktime_get();
>> +
>> +       /* Check command sts and bus idle state */
>> +       while (!regmap_read(priv->regmap, AST_PECI_CMD, &cmd_sts) &&
>> +              (cmd_sts & (PECI_CMD_STS_MASK | PECI_CMD_PIN_MON))) {
>> +               end = ktime_get();
>> +               elapsed_ms = ktime_to_ms(ktime_sub(end, start));
>> +               if (elapsed_ms >= PECI_IDLE_CHECK_TIMEOUT_MS) {
>> +                       dev_dbg(priv->dev, "Timeout waiting for idle state!\n");
>> +                       return -ETIMEDOUT;
>> +               }
>> +
>> +               usleep_range(PECI_IDLE_CHECK_INTERVAL_MS * 1000,
>> +                            (PECI_IDLE_CHECK_INTERVAL_MS * 1000) + 1000);
>> +       };
> 
> Could the above use regmap_read_poll_timeout instead?
> 

Yes, that would be better. I'll rewrite it.

>> +
>> +       reinit_completion(&priv->xfer_complete);
>> +
>> +       peci_head = PECI_TAGET_ADDR(msg->addr) |
>> +                                   PECI_WRITE_LEN(msg->tx_len) |
>> +                                   PECI_READ_LEN(msg->rx_len);
>> +
>> +       rc = regmap_write(priv->regmap, AST_PECI_CMD_CTRL, peci_head);
>> +       if (rc)
>> +               return rc;
>> +
>> +       for (i = 0; i < msg->tx_len; i += 4) {
>> +               reg = i < 16 ? AST_PECI_W_DATA0 + i % 16 :
>> +                              AST_PECI_W_DATA4 + i % 16;
>> +               rc = regmap_write(priv->regmap, reg,
>> +                                 (msg->tx_buf[i + 3] << 24) |
>> +                                 (msg->tx_buf[i + 2] << 16) |
>> +                                 (msg->tx_buf[i + 1] << 8) |
>> +                                 msg->tx_buf[i + 0]);
> 
> That looks like an endian swap. Can we do something like this?
> 
>   regmap_write(map, reg, cpu_to_be32p((void *)msg->tx_buff))
> 

Yes, it could be simplified like you pointed out. Will change it.

>> +               if (rc)
>> +                       return rc;
>> +       }
>> +
>> +       dev_dbg(priv->dev, "HEAD : 0x%08x\n", peci_head);
>> +#if DUMP_DEBUG
> 
> Having #defines is frowned upon. I think print_hex_dump_debug will do
> what you want here.
> 

Got it. I'll replace it with print_hex_dump_debug() after removing the 
define.

>> +       print_hex_dump(KERN_DEBUG, "TX : ", DUMP_PREFIX_NONE, 16, 1,
>> +                      msg->tx_buf, msg->tx_len, true);
>> +#endif
>> +
>> +       rc = regmap_write(priv->regmap, AST_PECI_CMD, PECI_CMD_FIRE);
>> +       if (rc)
>> +               return rc;
>> +
>> +       err = wait_for_completion_interruptible_timeout(&priv->xfer_complete,
>> +                                                       timeout);
>> +
>> +       dev_dbg(priv->dev, "INT_STS : 0x%08x\n", priv->status);
>> +       if (!regmap_read(priv->regmap, AST_PECI_CMD, &peci_state))
>> +               dev_dbg(priv->dev, "PECI_STATE : 0x%lx\n",
>> +                       PECI_CMD_STS_GET(peci_state));
>> +       else
>> +               dev_dbg(priv->dev, "PECI_STATE : read error\n");
>> +
>> +       rc = regmap_write(priv->regmap, AST_PECI_CMD, 0);
>> +       if (rc)
>> +               return rc;
>> +
>> +       if (err <= 0 || !(priv->status & PECI_INT_CMD_DONE)) {
>> +               if (err < 0) { /* -ERESTARTSYS */
>> +                       return (int)err;
>> +               } else if (err == 0) {
>> +                       dev_dbg(priv->dev, "Timeout waiting for a response!\n");
>> +                       return -ETIMEDOUT;
>> +               }
>> +
>> +               dev_dbg(priv->dev, "No valid response!\n");
>> +               return -EIO;
>> +       }
>> +
>> +       for (i = 0; i < msg->rx_len; i++) {
>> +               u8 byte_offset = i % 4;
>> +
>> +               if (byte_offset == 0) {
>> +                       reg = i < 16 ? AST_PECI_R_DATA0 + i % 16 :
>> +                                      AST_PECI_R_DATA4 + i % 16;
> 
> I find this hard to read. Use a few more lines to make it clear what
> your code is doing.
> 
> Actually, the entire for loop is cryptic. I understand what it's doing
> now. Can you rework it to make it more readable? You follow a similar
> pattern above in the write case.
> 

Intention was that make it run just amount up to the rx_len but it's not 
efficient. I'll rewrite it like you suggested.

>> +                       rc = regmap_read(priv->regmap, reg, &rx_data);
>> +                       if (rc)
>> +                               return rc;
>> +               }
>> +
>> +               msg->rx_buf[i] = (u8)(rx_data >> (byte_offset << 3))
>> +       }
>> +
>> +#if DUMP_DEBUG
>> +       print_hex_dump(KERN_DEBUG, "RX : ", DUMP_PREFIX_NONE, 16, 1,
>> +                      msg->rx_buf, msg->rx_len, true);
>> +#endif
>> +       if (!regmap_read(priv->regmap, AST_PECI_CMD, &peci_state))
>> +               dev_dbg(priv->dev, "PECI_STATE : 0x%lx\n",
>> +                       PECI_CMD_STS_GET(peci_state));
>> +       else
>> +               dev_dbg(priv->dev, "PECI_STATE : read error\n");
> 
> Given the regmap_read is always going to be a memory read on the
> aspeed, I can't think of a situation where the read will fail.
> 
> On that note, is there a reason you are using regmap and not just
> accessing the hardware directly? regmap imposes a number of pointer
> lookups and tests each time you do a read or write.
> 

No specific reason. regmap makes some overhead as you mentioned but it 
also provides some advantages on access simplification, endianness 
handling and register dump at run time. I'd not insist using of regmap 
if you prefer using of raw readl and writel. Do you?

>> +       dev_dbg(priv->dev, "------------------------\n");
>> +
>> +       return rc;
>> +}
>> +
>> +static irqreturn_t aspeed_peci_irq_handler(int irq, void *arg)
>> +{
>> +       struct aspeed_peci *priv = arg;
>> +       u32 status_ack = 0;
>> +
>> +       if (regmap_read(priv->regmap, AST_PECI_INT_STS, &priv->status))
>> +               return IRQ_NONE;
> 
> Again, a memory mapped read won't fail. How about we check that the
> regmap is working once in your _probe() function, and assume it will
> continue working from there (or remove the regmap abstraction all
> together).
> 

You are right. I'll keep this checking only in _probe() function and 
remove all redundant error checking codes on memory mapped IO.

>> +
>> +       /* Be noted that multiple interrupt bits can be set at the same time */
>> +       if (priv->status & PECI_INT_TIMEOUT) {
>> +               dev_dbg(priv->dev, "PECI_INT_TIMEOUT\n");
>> +               status_ack |= PECI_INT_TIMEOUT;
>> +       }
>> +
>> +       if (priv->status & PECI_INT_CONNECT) {
>> +               dev_dbg(priv->dev, "PECI_INT_CONNECT\n");
>> +               status_ack |= PECI_INT_CONNECT;
>> +       }
>> +
>> +       if (priv->status & PECI_INT_W_FCS_BAD) {
>> +               dev_dbg(priv->dev, "PECI_INT_W_FCS_BAD\n");
>> +               status_ack |= PECI_INT_W_FCS_BAD;
>> +       }
>> +
>> +       if (priv->status & PECI_INT_W_FCS_ABORT) {
>> +               dev_dbg(priv->dev, "PECI_INT_W_FCS_ABORT\n");
>> +               status_ack |= PECI_INT_W_FCS_ABORT;
>> +       }
> 
> All of this code is for debugging only. Do you want to put it behind
> some kind of conditional?
> 

This code makes changes on the status_ack variable to write back ack bit 
on each interrupt.

>> +
>> +       /**
>> +        * All commands should be ended up with a PECI_INT_CMD_DONE bit set
>> +        * even in an error case.
>> +        */
>> +       if (priv->status & PECI_INT_CMD_DONE) {
>> +               dev_dbg(priv->dev, "PECI_INT_CMD_DONE\n");
>> +               status_ack |= PECI_INT_CMD_DONE;
>> +               complete(&priv->xfer_complete);
>> +       }
>> +
>> +       if (regmap_write(priv->regmap, AST_PECI_INT_STS, status_ack))
>> +               return IRQ_NONE;
>> +
>> +       return IRQ_HANDLED;
>> +}
>> +
>> +static int aspeed_peci_init_ctrl(struct aspeed_peci *priv)
>> +{
>> +       u32 msg_timing_nego, addr_timing_nego, rd_sampling_point;
>> +       u32 clk_freq, clk_divisor, clk_div_val = 0;
>> +       struct clk *clkin;
>> +       int ret;
>> +
>> +       clkin = devm_clk_get(priv->dev, NULL);
>> +       if (IS_ERR(clkin)) {
>> +               dev_err(priv->dev, "Failed to get clk source.\n");
>> +               return PTR_ERR(clkin);
>> +       }
>> +
>> +       ret = of_property_read_u32(priv->dev->of_node, "clock-frequency",
>> +                                  &clk_freq);
>> +       if (ret < 0) {
>> +               dev_err(priv->dev,
>> +                       "Could not read clock-frequency property.\n");
>> +               return ret;
>> +       }
>> +
>> +       clk_divisor = clk_get_rate(clkin) / clk_freq;
>> +       devm_clk_put(priv->dev, clkin);
>> +
>> +       while ((clk_divisor >> 1) && (clk_div_val < PECI_CLK_DIV_MAX))
>> +               clk_div_val++;
> 
> We have a framework for doing clocks in the kernel. Would it make
> sense to write a driver for this clock and add it to
> drivers/clk/clk-aspeed.c?
> 

Unlike other HW module, PECI uses the 24MHz external clock as its clock 
source. Should it use clk-aspeed.c in this case?

>> +
>> +       ret = of_property_read_u32(priv->dev->of_node, "msg-timing-nego",
>> +                                  &msg_timing_nego);
>> +       if (ret || msg_timing_nego > PECI_MSG_TIMING_NEGO_MAX) {
>> +               dev_warn(priv->dev,
>> +                        "Invalid msg-timing-nego : %u, Use default : %u\n",
>> +                        msg_timing_nego, PECI_MSG_TIMING_NEGO_DEFAULT);
> 
> The property is optional so I suggest we don't print a message if it's
> not present. We certainly don't want to print a message saying
> "invalid".
> 
> The same comment applies to the other optional properties below.
> 

Agreed. I'll make it print out the message only when ret == 0 and 
msg_timing_nego > PECI_MSG_TIMING_NEGO_MAX.

>> +               msg_timing_nego = PECI_MSG_TIMING_NEGO_DEFAULT;
>> +       }
>> +
>> +       ret = of_property_read_u32(priv->dev->of_node, "addr-timing-nego",
>> +                                  &addr_timing_nego);
>> +       if (ret || addr_timing_nego > PECI_ADDR_TIMING_NEGO_MAX) {
>> +               dev_warn(priv->dev,
>> +                        "Invalid addr-timing-nego : %u, Use default : %u\n",
>> +                        addr_timing_nego, PECI_ADDR_TIMING_NEGO_DEFAULT);
>> +               addr_timing_nego = PECI_ADDR_TIMING_NEGO_DEFAULT;
>> +       }
>> +
>> +       ret = of_property_read_u32(priv->dev->of_node, "rd-sampling-point",
>> +                                  &rd_sampling_point);
>> +       if (ret || rd_sampling_point > PECI_RD_SAMPLING_POINT_MAX) {
>> +               dev_warn(priv->dev,
>> +                        "Invalid rd-sampling-point : %u. Use default : %u\n",
>> +                        rd_sampling_point,
>> +                        PECI_RD_SAMPLING_POINT_DEFAULT);
>> +               rd_sampling_point = PECI_RD_SAMPLING_POINT_DEFAULT;
>> +       }
>> +
>> +       ret = of_property_read_u32(priv->dev->of_node, "cmd-timeout-ms",
>> +                                  &priv->cmd_timeout_ms);
>> +       if (ret || priv->cmd_timeout_ms > PECI_CMD_TIMEOUT_MS_MAX ||
>> +           priv->cmd_timeout_ms == 0) {
>> +               dev_warn(priv->dev,
>> +                        "Invalid cmd-timeout-ms : %u. Use default : %u\n",
>> +                        priv->cmd_timeout_ms,
>> +                        PECI_CMD_TIMEOUT_MS_DEFAULT);
>> +               priv->cmd_timeout_ms = PECI_CMD_TIMEOUT_MS_DEFAULT;
>> +       }
>> +
>> +       ret = regmap_write(priv->regmap, AST_PECI_CTRL,
>> +                          PECI_CTRL_CLK_DIV(PECI_CLK_DIV_DEFAULT) |
>> +                          PECI_CTRL_PECI_CLK_EN);
>> +       if (ret)
>> +               return ret;
>> +
>> +       usleep_range(1000, 5000);
> 
> Can we probe in parallel? If not, putting a sleep in the _probe will
> hold up the rest of drivers from being able to do anything, and hold
> up boot.
> 
> If you decide that you do need to probe here, please add a comment.
> (This is the wait for the clock to be stable?)
> 

I'll test it again and will remove it if it is not necessary.

>> +
>> +       /**
>> +        * Timing negotiation period setting.
>> +        * The unit of the programmed value is 4 times of PECI clock period.
>> +        */
>> +       ret = regmap_write(priv->regmap, AST_PECI_TIMING,
>> +                          PECI_TIMING_MESSAGE(msg_timing_nego) |
>> +                          PECI_TIMING_ADDRESS(addr_timing_nego));
>> +       if (ret)
>> +               return ret;
>> +
>> +       /* Clear interrupts */
>> +       ret = regmap_write(priv->regmap, AST_PECI_INT_STS, PECI_INT_MASK);
>> +       if (ret)
>> +               return ret;
>> +
>> +       /* Enable interrupts */
>> +       ret = regmap_write(priv->regmap, AST_PECI_INT_CTRL, PECI_INT_MASK);
>> +       if (ret)
>> +               return ret;
>> +
>> +       /* Read sampling point and clock speed setting */
>> +       ret = regmap_write(priv->regmap, AST_PECI_CTRL,
>> +                          PECI_CTRL_SAMPLING(rd_sampling_point) |
>> +                          PECI_CTRL_CLK_DIV(clk_div_val) |
>> +                          PECI_CTRL_PECI_EN | PECI_CTRL_PECI_CLK_EN);
>> +       if (ret)
>> +               return ret;
>> +
>> +       return 0;
>> +}
>> +
>> +static const struct regmap_config aspeed_peci_regmap_config = {
>> +       .reg_bits = 32,
>> +       .val_bits = 32,
>> +       .reg_stride = 4,
>> +       .max_register = AST_PECI_R_DATA7,
>> +       .val_format_endian = REGMAP_ENDIAN_LITTLE,
>> +       .fast_io = true,
>> +};
>> +
>> +static int aspeed_peci_xfer(struct peci_adapter *adaper,
>> +                           struct peci_xfer_msg *msg)
>> +{
>> +       struct aspeed_peci *priv = peci_get_adapdata(adaper);
>> +
>> +       return aspeed_peci_xfer_native(priv, msg);
>> +}
>> +
>> +static int aspeed_peci_probe(struct platform_device *pdev)
>> +{
>> +       struct aspeed_peci *priv;
>> +       struct resource *res;
>> +       void __iomem *base;
>> +       int ret = 0;
>> +
>> +       priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
>> +       if (!priv)
>> +               return -ENOMEM;
>> +
>> +       dev_set_drvdata(&pdev->dev, priv);
>> +       priv->dev = &pdev->dev;
>> +
>> +       res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
>> +       base = devm_ioremap_resource(&pdev->dev, res);
>> +       if (IS_ERR(base))
>> +               return PTR_ERR(base);
>> +
>> +       priv->regmap = devm_regmap_init_mmio(&pdev->dev, base,
>> +                                            &aspeed_peci_regmap_config);
>> +       if (IS_ERR(priv->regmap))
>> +               return PTR_ERR(priv->regmap);
>> +
>> +       priv->irq = platform_get_irq(pdev, 0);
>> +       if (!priv->irq)
>> +               return -ENODEV;
>> +
>> +       ret = devm_request_irq(&pdev->dev, priv->irq, aspeed_peci_irq_handler,
>> +                              IRQF_SHARED,
> 
> This interrupt is only for the peci device. Why is it marked as shared?
> 

You are right. I'll remove the flag.

>> +                              "peci-aspeed-irq",
>> +                              priv);
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       init_completion(&priv->xfer_complete);
>> +
>> +       priv->adaper.dev.parent = priv->dev;
>> +       priv->adaper.dev.of_node = of_node_get(dev_of_node(priv->dev));
>> +       strlcpy(priv->adaper.name, pdev->name, sizeof(priv->adaper.name));
>> +       priv->adaper.xfer = aspeed_peci_xfer;
>> +       peci_set_adapdata(&priv->adaper, priv);
>> +
>> +       ret = aspeed_peci_init_ctrl(priv);
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       ret = peci_add_adapter(&priv->adaper);
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       dev_info(&pdev->dev, "peci bus %d registered, irq %d\n",
>> +                priv->adaper.nr, priv->irq);
>> +
>> +       return 0;
>> +}
>> +
>> +static int aspeed_peci_remove(struct platform_device *pdev)
>> +{
>> +       struct aspeed_peci *priv = dev_get_drvdata(&pdev->dev);
>> +
>> +       peci_del_adapter(&priv->adaper);
>> +       of_node_put(priv->adaper.dev.of_node);
>> +
>> +       return 0;
>> +}
>> +
>> +static const struct of_device_id aspeed_peci_of_table[] = {
>> +       { .compatible = "aspeed,ast2400-peci", },
>> +       { .compatible = "aspeed,ast2500-peci", },
>> +       { }
>> +};
>> +MODULE_DEVICE_TABLE(of, aspeed_peci_of_table);
>> +
>> +static struct platform_driver aspeed_peci_driver = {
>> +       .probe  = aspeed_peci_probe,
>> +       .remove = aspeed_peci_remove,
>> +       .driver = {
>> +               .name           = "peci-aspeed",
>> +               .of_match_table = of_match_ptr(aspeed_peci_of_table),
>> +       },
>> +};
>> +module_platform_driver(aspeed_peci_driver);
>> +
>> +MODULE_AUTHOR("Ryan Chen <ryan_chen@aspeedtech.com>");
>> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
>> +MODULE_DESCRIPTION("Aspeed PECI driver");
>> +MODULE_LICENSE("GPL v2");
>> --
>> 2.16.2
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 09/10] drivers/hwmon: Add PECI hwmon client drivers
From: Guenter Roeck @ 2018-04-12  0:34 UTC (permalink / raw)
  To: Jae Hyun Yoo
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
	Haiyue Wang, James Feist, Jason M Biils, Jean Delvare,
	Joel Stanley, Julia Cartwright, Miguel Ojeda, Milton Miller II,
	Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
	Vernon Mauery, linux-kernel, linux-doc, devicetree, linux-hwmon,
	linux-arm-kernel, openbmc
In-Reply-To: <637d7812-e567-0bc2-4f08-fbdca0ee85d8@linux.intel.com>

On 04/11/2018 02:59 PM, Jae Hyun Yoo wrote:
> Hi Guenter,
> 
> Thanks a lot for sharing your time. Please see my inline answers.
> 
> On 4/10/2018 3:28 PM, Guenter Roeck wrote:
>> On Tue, Apr 10, 2018 at 11:32:11AM -0700, Jae Hyun Yoo wrote:
>>> This commit adds PECI cputemp and dimmtemp hwmon drivers.
>>>
>>> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
>>> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
>>> Reviewed-by: James Feist <james.feist@linux.intel.com>
>>> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
>>> Cc: Alan Cox <alan@linux.intel.com>
>>> Cc: Andrew Jeffery <andrew@aj.id.au>
>>> Cc: Andrew Lunn <andrew@lunn.ch>
>>> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>>> Cc: Greg KH <gregkh@linuxfoundation.org>
>>> Cc: Guenter Roeck <linux@roeck-us.net>
>>> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
>>> Cc: Jean Delvare <jdelvare@suse.com>
>>> Cc: Joel Stanley <joel@jms.id.au>
>>> Cc: Julia Cartwright <juliac@eso.teric.us>
>>> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
>>> Cc: Milton Miller II <miltonm@us.ibm.com>
>>> Cc: Pavel Machek <pavel@ucw.cz>
>>> Cc: Randy Dunlap <rdunlap@infradead.org>
>>> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
>>> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
>>> ---
>>>   drivers/hwmon/Kconfig         |  28 ++
>>>   drivers/hwmon/Makefile        |   2 +
>>>   drivers/hwmon/peci-cputemp.c  | 783 ++++++++++++++++++++++++++++++++++++++++++
>>>   drivers/hwmon/peci-dimmtemp.c | 432 +++++++++++++++++++++++
>>>   4 files changed, 1245 insertions(+)
>>>   create mode 100644 drivers/hwmon/peci-cputemp.c
>>>   create mode 100644 drivers/hwmon/peci-dimmtemp.c
>>>
>>> diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
>>> index f249a4428458..c52f610f81d0 100644
>>> --- a/drivers/hwmon/Kconfig
>>> +++ b/drivers/hwmon/Kconfig
>>> @@ -1259,6 +1259,34 @@ config SENSORS_NCT7904
>>>         This driver can also be built as a module.  If so, the module
>>>         will be called nct7904.
>>> +config SENSORS_PECI_CPUTEMP
>>> +    tristate "PECI CPU temperature monitoring support"
>>> +    depends on OF
>>> +    depends on PECI
>>> +    help
>>> +      If you say yes here you get support for the generic Intel PECI
>>> +      cputemp driver which provides Digital Thermal Sensor (DTS) thermal
>>> +      readings of the CPU package and CPU cores that are accessible using
>>> +      the PECI Client Command Suite via the processor PECI client.
>>> +      Check Documentation/hwmon/peci-cputemp for details.
>>> +
>>> +      This driver can also be built as a module.  If so, the module
>>> +      will be called peci-cputemp.
>>> +
>>> +config SENSORS_PECI_DIMMTEMP
>>> +    tristate "PECI DIMM temperature monitoring support"
>>> +    depends on OF
>>> +    depends on PECI
>>> +    help
>>> +      If you say yes here you get support for the generic Intel PECI hwmon
>>> +      driver which provides Digital Thermal Sensor (DTS) thermal readings of
>>> +      DIMM components that are accessible using the PECI Client Command
>>> +      Suite via the processor PECI client.
>>> +      Check Documentation/hwmon/peci-dimmtemp for details.
>>> +
>>> +      This driver can also be built as a module.  If so, the module
>>> +      will be called peci-dimmtemp.
>>> +
>>>   config SENSORS_NSA320
>>>       tristate "ZyXEL NSA320 and compatible fan speed and temperature sensors"
>>>       depends on GPIOLIB && OF
>>> diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
>>> index e7d52a36e6c4..48d9598fcd3a 100644
>>> --- a/drivers/hwmon/Makefile
>>> +++ b/drivers/hwmon/Makefile
>>> @@ -136,6 +136,8 @@ obj-$(CONFIG_SENSORS_NCT7802)    += nct7802.o
>>>   obj-$(CONFIG_SENSORS_NCT7904)    += nct7904.o
>>>   obj-$(CONFIG_SENSORS_NSA320)    += nsa320-hwmon.o
>>>   obj-$(CONFIG_SENSORS_NTC_THERMISTOR)    += ntc_thermistor.o
>>> +obj-$(CONFIG_SENSORS_PECI_CPUTEMP)    += peci-cputemp.o
>>> +obj-$(CONFIG_SENSORS_PECI_DIMMTEMP)    += peci-dimmtemp.o
>>>   obj-$(CONFIG_SENSORS_PC87360)    += pc87360.o
>>>   obj-$(CONFIG_SENSORS_PC87427)    += pc87427.o
>>>   obj-$(CONFIG_SENSORS_PCF8591)    += pcf8591.o
>>> diff --git a/drivers/hwmon/peci-cputemp.c b/drivers/hwmon/peci-cputemp.c
>>> new file mode 100644
>>> index 000000000000..f0bc92687512
>>> --- /dev/null
>>> +++ b/drivers/hwmon/peci-cputemp.c
>>> @@ -0,0 +1,783 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +// Copyright (c) 2018 Intel Corporation
>>> +
>>> +#include <linux/delay.h>
>>> +#include <linux/hwmon.h>
>>> +#include <linux/hwmon-sysfs.h>
>>
>> Is this include needed ?
>>
> 
> No it isn't. Will drop the line.
> 
>>> +#include <linux/jiffies.h>
>>> +#include <linux/module.h>
>>> +#include <linux/of_device.h>
>>> +#include <linux/peci.h>
>>> +
>>> +#define TEMP_TYPE_PECI        6  /* Sensor type 6: Intel PECI */
>>> +
>>> +#define CORE_MAX_ON_HSX       18 /* Max number of cores on Haswell */
>>> +#define CORE_MAX_ON_BDX       24 /* Max number of cores on Broadwell */
>>> +#define CORE_MAX_ON_SKX       28 /* Max number of cores on Skylake */
>>> +
>>> +#define DEFAULT_CHANNEL_NUMS  5
>>> +#define CORETEMP_CHANNEL_NUMS CORE_MAX_ON_SKX
>>> +#define CPUTEMP_CHANNEL_NUMS  (DEFAULT_CHANNEL_NUMS + CORETEMP_CHANNEL_NUMS)
>>> +
>>> +#define CLIENT_CPU_ID_MASK    0xf0ff0  /* Mask for Family / Model info */
>>> +
>>> +#define UPDATE_INTERVAL_MIN   HZ
>>> +
>>> +enum cpu_gens {
>>> +    CPU_GEN_HSX, /* Haswell Xeon */
>>> +    CPU_GEN_BRX, /* Broadwell Xeon */
>>> +    CPU_GEN_SKX, /* Skylake Xeon */
>>> +    CPU_GEN_MAX
>>> +};
>>> +
>>> +struct cpu_gen_info {
>>> +    u32 type;
>>> +    u32 cpu_id;
>>> +    u32 core_max;
>>> +};
>>> +
>>> +struct temp_data {
>>> +    bool valid;
>>> +    s32  value;
>>> +    unsigned long last_updated;
>>> +};
>>> +
>>> +struct temp_group {
>>> +    struct temp_data die;
>>> +    struct temp_data dts_margin;
>>> +    struct temp_data tcontrol;
>>> +    struct temp_data tthrottle;
>>> +    struct temp_data tjmax;
>>> +    struct temp_data core[CORETEMP_CHANNEL_NUMS];
>>> +};
>>> +
>>> +struct peci_cputemp {
>>> +    struct peci_client *client;
>>> +    struct device *dev;
>>> +    char name[PECI_NAME_SIZE];
>>> +    struct temp_group temp;
>>> +    u8 addr;
>>> +    uint cpu_no;
>>> +    const struct cpu_gen_info *gen_info;
>>> +    u32 core_mask;
>>> +    u32 temp_config[CPUTEMP_CHANNEL_NUMS + 1];
>>> +    uint config_idx;
>>> +    struct hwmon_channel_info temp_info;
>>> +    const struct hwmon_channel_info *info[2];
>>> +    struct hwmon_chip_info chip;
>>> +};
>>> +
>>> +enum cputemp_channels {
>>> +    channel_die,
>>> +    channel_dts_mrgn,
>>> +    channel_tcontrol,
>>> +    channel_tthrottle,
>>> +    channel_tjmax,
>>> +    channel_core,
>>> +};
>>> +
>>> +static const struct cpu_gen_info cpu_gen_info_table[] = {
>>> +    { .type = CPU_GEN_HSX,
>>> +      .cpu_id = 0x306f0, /* Family code: 6, Model number: 63 (0x3f) */
>>> +      .core_max = CORE_MAX_ON_HSX },
>>> +    { .type = CPU_GEN_BRX,
>>> +      .cpu_id = 0x406f0, /* Family code: 6, Model number: 79 (0x4f) */
>>> +      .core_max = CORE_MAX_ON_BDX },
>>> +    { .type = CPU_GEN_SKX,
>>> +      .cpu_id = 0x50650, /* Family code: 6, Model number: 85 (0x55) */
>>> +      .core_max = CORE_MAX_ON_SKX },
>>> +};
>>> +
>>> +static const u32 config_table[DEFAULT_CHANNEL_NUMS + 1] = {
>>> +    /* Die temperature */
>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT |
>>> +    HWMON_T_CRIT_HYST,
>>> +
>>> +    /* DTS margin temperature */
>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MIN | HWMON_T_LCRIT,
>>> +
>>> +    /* Tcontrol temperature */
>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_CRIT,
>>> +
>>> +    /* Tthrottle temperature */
>>> +    HWMON_T_LABEL | HWMON_T_INPUT,
>>> +
>>> +    /* Tjmax temperature */
>>> +    HWMON_T_LABEL | HWMON_T_INPUT,
>>> +
>>> +    /* Core temperature - for all core channels */
>>> +    HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT |
>>> +    HWMON_T_CRIT_HYST,
>>> +};
>>> +
>>> +static const char *cputemp_label[CPUTEMP_CHANNEL_NUMS] = {
>>> +    "Die",
>>> +    "DTS margin",
>>> +    "Tcontrol",
>>> +    "Tthrottle",
>>> +    "Tjmax",
>>> +    "Core 0", "Core 1", "Core 2", "Core 3",
>>> +    "Core 4", "Core 5", "Core 6", "Core 7",
>>> +    "Core 8", "Core 9", "Core 10", "Core 11",
>>> +    "Core 12", "Core 13", "Core 14", "Core 15",
>>> +    "Core 16", "Core 17", "Core 18", "Core 19",
>>> +    "Core 20", "Core 21", "Core 22", "Core 23",
>>> +};
>>> +
>>> +static int send_peci_cmd(struct peci_cputemp *priv,
>>> +             enum peci_cmd cmd,
>>> +             void *msg)
>>> +{
>>> +    return peci_command(priv->client->adapter, cmd, msg);
>>> +}
>>> +
>>> +static int need_update(struct temp_data *temp)
>>
>> Please use bool.
>>
> 
> Okay. I'll use bool instead of int.
> 
>>> +{
>>> +    if (temp->valid &&
>>> +        time_before(jiffies, temp->last_updated + UPDATE_INTERVAL_MIN))
>>> +        return 0;
>>> +
>>> +    return 1;
>>> +}
>>> +
>>> +static void mark_updated(struct temp_data *temp)
>>> +{
>>> +    temp->valid = true;
>>> +    temp->last_updated = jiffies;
>>> +}
>>> +
>>> +static s32 ten_dot_six_to_millidegree(s32 val)
>>> +{
>>> +    return ((val ^ 0x8000) - 0x8000) * 1000 / 64;
>>> +}
>>> +
>>> +static int get_tjmax(struct peci_cputemp *priv)
>>> +{
>>> +    struct peci_rd_pkg_cfg_msg msg;
>>> +    int rc;
>>> +
>>> +    if (!priv->temp.tjmax.valid) {
>>> +        msg.addr = priv->addr;
>>> +        msg.index = MBX_INDEX_TEMP_TARGET;
>>> +        msg.param = 0;
>>> +        msg.rx_len = 4;
>>> +
>>> +        rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        priv->temp.tjmax.value = (s32)msg.pkg_config[2] * 1000;
>>> +        priv->temp.tjmax.valid = true;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int get_tcontrol(struct peci_cputemp *priv)
>>> +{
>>> +    struct peci_rd_pkg_cfg_msg msg;
>>> +    s32 tcontrol_margin;
>>> +    s32 tthrottle_offset;
>>> +    int rc;
>>> +
>>> +    if (!need_update(&priv->temp.tcontrol))
>>> +        return 0;
>>> +
>>> +    rc = get_tjmax(priv);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    msg.addr = priv->addr;
>>> +    msg.index = MBX_INDEX_TEMP_TARGET;
>>> +    msg.param = 0;
>>> +    msg.rx_len = 4;
>>> +
>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    tcontrol_margin = msg.pkg_config[1];
>>> +    tcontrol_margin = ((tcontrol_margin ^ 0x80) - 0x80) * 1000;
>>> +    priv->temp.tcontrol.value = priv->temp.tjmax.value - tcontrol_margin;
>>> +
>>> +    tthrottle_offset = (msg.pkg_config[3] & 0x2f) * 1000;
>>> +    priv->temp.tthrottle.value = priv->temp.tjmax.value - tthrottle_offset;
>>> +
>>> +    mark_updated(&priv->temp.tcontrol);
>>> +    mark_updated(&priv->temp.tthrottle);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int get_tthrottle(struct peci_cputemp *priv)
>>> +{
>>> +    struct peci_rd_pkg_cfg_msg msg;
>>> +    s32 tcontrol_margin;
>>> +    s32 tthrottle_offset;
>>> +    int rc;
>>> +
>>> +    if (!need_update(&priv->temp.tthrottle))
>>> +        return 0;
>>> +
>>> +    rc = get_tjmax(priv);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    msg.addr = priv->addr;
>>> +    msg.index = MBX_INDEX_TEMP_TARGET;
>>> +    msg.param = 0;
>>> +    msg.rx_len = 4;
>>> +
>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    tthrottle_offset = (msg.pkg_config[3] & 0x2f) * 1000;
>>> +    priv->temp.tthrottle.value = priv->temp.tjmax.value - tthrottle_offset;
>>> +
>>> +    tcontrol_margin = msg.pkg_config[1];
>>> +    tcontrol_margin = ((tcontrol_margin ^ 0x80) - 0x80) * 1000;
>>> +    priv->temp.tcontrol.value = priv->temp.tjmax.value - tcontrol_margin;
>>> +
>>> +    mark_updated(&priv->temp.tthrottle);
>>> +    mark_updated(&priv->temp.tcontrol);
>>> +
>>> +    return 0;
>>> +}
>>
>> I am quite completely missing how the two functions above are different.
>>
> 
> The two above functions are slightly different but uses the same PECI command which provides both Tthrottle and Tcontrol values in pkg_config array so it updates the values to reduce duplicate PECI transactions. Probably, combining these two functions into get_ttrottle_and_tcontrol() would look better. I'll rewrite it.
> 
>>> +
>>> +static int get_die_temp(struct peci_cputemp *priv)
>>> +{
>>> +    struct peci_get_temp_msg msg;
>>> +    int rc;
>>> +
>>> +    if (!need_update(&priv->temp.die))
>>> +        return 0;
>>> +
>>> +    rc = get_tjmax(priv);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    msg.addr = priv->addr;
>>> +
>>> +    rc = send_peci_cmd(priv, PECI_CMD_GET_TEMP, &msg);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    priv->temp.die.value = priv->temp.tjmax.value +
>>> +                   ((s32)msg.temp_raw * 1000 / 64);
>>> +
>>> +    mark_updated(&priv->temp.die);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int get_dts_margin(struct peci_cputemp *priv)
>>> +{
>>> +    struct peci_rd_pkg_cfg_msg msg;
>>> +    s32 dts_margin;
>>> +    int rc;
>>> +
>>> +    if (!need_update(&priv->temp.dts_margin))
>>> +        return 0;
>>> +
>>> +    msg.addr = priv->addr;
>>> +    msg.index = MBX_INDEX_DTS_MARGIN;
>>> +    msg.param = 0;
>>> +    msg.rx_len = 4;
>>> +
>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    dts_margin = (msg.pkg_config[1] << 8) | msg.pkg_config[0];
>>> +
>>> +    /**
>>> +     * Processors return a value of DTS reading in 10.6 format
>>> +     * (10 bits signed decimal, 6 bits fractional).
>>> +     * Error codes:
>>> +     *   0x8000: General sensor error
>>> +     *   0x8001: Reserved
>>> +     *   0x8002: Underflow on reading value
>>> +     *   0x8003-0x81ff: Reserved
>>> +     */
>>> +    if (dts_margin >= 0x8000 && dts_margin <= 0x81ff)
>>> +        return -EIO;
>>> +
>>> +    dts_margin = ten_dot_six_to_millidegree(dts_margin);
>>> +
>>> +    priv->temp.dts_margin.value = dts_margin;
>>> +
>>> +    mark_updated(&priv->temp.dts_margin);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int get_core_temp(struct peci_cputemp *priv, int core_index)
>>> +{
>>> +    struct peci_rd_pkg_cfg_msg msg;
>>> +    s32 core_dts_margin;
>>> +    int rc;
>>> +
>>> +    if (!need_update(&priv->temp.core[core_index]))
>>> +        return 0;
>>> +
>>> +    rc = get_tjmax(priv);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    msg.addr = priv->addr;
>>> +    msg.index = MBX_INDEX_PER_CORE_DTS_TEMP;
>>> +    msg.param = core_index;
>>> +    msg.rx_len = 4;
>>> +
>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    core_dts_margin = (msg.pkg_config[1] << 8) | msg.pkg_config[0];
>>> +
>>> +    /**
>>> +     * Processors return a value of the core DTS reading in 10.6 format
>>> +     * (10 bits signed decimal, 6 bits fractional).
>>> +     * Error codes:
>>> +     *   0x8000: General sensor error
>>> +     *   0x8001: Reserved
>>> +     *   0x8002: Underflow on reading value
>>> +     *   0x8003-0x81ff: Reserved
>>> +     */
>>> +    if (core_dts_margin >= 0x8000 && core_dts_margin <= 0x81ff)
>>> +        return -EIO;
>>> +
>>> +    core_dts_margin = ten_dot_six_to_millidegree(core_dts_margin);
>>> +
>>> +    priv->temp.core[core_index].value = priv->temp.tjmax.value +
>>> +                        core_dts_margin;
>>> +
>>> +    mark_updated(&priv->temp.core[core_index]);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>
>> There is a lot of duplication in those functions. Would it be possible
>> to find common code and use functions for it instead of duplicating
>> everything several times ?
>>
> 
> Are you pointing out this code?
> /**
>   * Processors return a value of the core DTS reading in 10.6 format
>   * (10 bits signed decimal, 6 bits fractional).
>   * Error codes:
>   *   0x8000: General sensor error
>   *   0x8001: Reserved
>   *   0x8002: Underflow on reading value
>   *   0x8003-0x81ff: Reserved
>   */
> if (core_dts_margin >= 0x8000 && core_dts_margin <= 0x81ff)
>      return -EIO;
> 
> Then I'll rewrite it as a function. If not, please point out the duplication.
> 

There is lots of other duplication.

>>> +static int find_core_index(struct peci_cputemp *priv, int channel)
>>> +{
>>> +    int core_channel = channel - DEFAULT_CHANNEL_NUMS;
>>> +    int idx, found = 0;
>>> +
>>> +    for (idx = 0; idx < priv->gen_info->core_max; idx++) {
>>> +        if (priv->core_mask & BIT(idx)) {
>>> +            if (core_channel == found)
>>> +                break;
>>> +
>>> +            found++;
>>> +        }
>>> +    }
>>> +
>>> +    return idx;
>>
>> What if nothing is found ?
>>
> 
> Core temperature group will be registered only when it detects at least one core checked by check_resolved_cores(), so find_core_index() can be called only when priv->core_mask has a non-zero value. The 'nothing is found' case will not happen.
> 
That doesn't guarantee a match. If what you are saying is correct there should always be
a well defined match of channel -> idx, and the search should be unnecessary.

>>> +}
>>> +
>>> +static int cputemp_read_string(struct device *dev,
>>> +                   enum hwmon_sensor_types type,
>>> +                   u32 attr, int channel, const char **str)
>>> +{
>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>> +    int core_index;
>>> +
>>> +    switch (attr) {
>>> +    case hwmon_temp_label:
>>> +        if (channel < DEFAULT_CHANNEL_NUMS) {
>>> +            *str = cputemp_label[channel];
>>> +        } else {
>>> +            core_index = find_core_index(priv, channel);
>>
>> FWIW, it might be better to pass channel - DEFAULT_CHANNEL_NUMS
>> as parameter.
>>
> 
> cputemp_read_string() is mapped to read_string member of hwmon_ops struct, so hwmon susbsystem passes the channel parameter based on the registered channel order. Should I modify hwmon subsystem code?
> 

Huh ? Changing
	f(x) { y = x - const; }
...
	f(x);

to
	f(y) { }
...
	f(x - const);

requires a hwmon core change ? Really ?

>> What if find_core_index() returns priv->gen_info->core_max, ie
>> if it didn't find a core ?
>>
> 
> As explained above, find_core index() returns a correct index always.
> 
>>> +            *str = cputemp_label[DEFAULT_CHANNEL_NUMS + core_index];
>>> +        }
>>> +        return 0;
>>> +    default:
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>> +
>>> +static int cputemp_read_die(struct device *dev,
>>> +                enum hwmon_sensor_types type,
>>> +                u32 attr, int channel, long *val)
>>> +{
>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>> +    int rc;
>>> +
>>> +    switch (attr) {
>>> +    case hwmon_temp_input:
>>> +        rc = get_die_temp(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.die.value;
>>> +        return 0;
>>> +    case hwmon_temp_max:
>>> +        rc = get_tcontrol(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tcontrol.value;
>>> +        return 0;
>>> +    case hwmon_temp_crit:
>>> +        rc = get_tjmax(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tjmax.value;
>>> +        return 0;
>>> +    case hwmon_temp_crit_hyst:
>>> +        rc = get_tcontrol(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tjmax.value - priv->temp.tcontrol.value;
>>> +        return 0;
>>> +    default:
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>> +
>>> +static int cputemp_read_dts_margin(struct device *dev,
>>> +                   enum hwmon_sensor_types type,
>>> +                   u32 attr, int channel, long *val)
>>> +{
>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>> +    int rc;
>>> +
>>> +    switch (attr) {
>>> +    case hwmon_temp_input:
>>> +        rc = get_dts_margin(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.dts_margin.value;
>>> +        return 0;
>>> +    case hwmon_temp_min:
>>> +        *val = 0;
>>> +        return 0;
>>
>> This attribute should not exist.
>>
> 
> This is an attribute of DTS margin temperature which reflects thermal margin to Tcontrol of the CPU package. If it shows '0' means it reached to Tcontrol, the first level of thermal warning. If the CPU keeps getting hot then this DTS margin shows a negative value until it reaches to Tjmax. When the temperature reaches to Tjmax at last then it shows the lower critcal value which lcrit indicates as the second level of thermal warning.
> 

The hwmon ABI reports chip values, not constants. Even though some drivers do
it, reporting a constant is always wrong.

>>> +    case hwmon_temp_lcrit:
>>> +        rc = get_tcontrol(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tcontrol.value - priv->temp.tjmax.value;
>>
>> lcrit is tcontrol - tjmax, and crit_hyst above is
>> tjmax - tcontrol ? How does this make sense ?
>>
> 
> Both Tjmax and Tcontrol have positive values and Tjmax is greater than Tcontrol always. As explained above, lcrit of DTS margin should show a negative value means the margin goes down across '0'. On the other hand, crit_hyst of Die temperature should show absolute hyterisis value between Tcontrol and Tjmax.
> 
The hwmon ABI requires reporting of absolute temperatures in milli-degrees C.
Your statements make it very clear that this driver does not report
absolute temperatures. This is not acceptable.

>>> +        return 0;
>>> +    default:
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>> +
>>> +static int cputemp_read_tcontrol(struct device *dev,
>>> +                 enum hwmon_sensor_types type,
>>> +                 u32 attr, int channel, long *val)
>>> +{
>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>> +    int rc;
>>> +
>>> +    switch (attr) {
>>> +    case hwmon_temp_input:
>>> +        rc = get_tcontrol(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tcontrol.value;
>>> +        return 0;
>>> +    case hwmon_temp_crit:
>>> +        rc = get_tjmax(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tjmax.value;
>>> +        return 0;
>>
>> Am I missing something, or is the same temperature reported several times ?
>> tjmax is also reported as temp_crit cputemp_read_die(), for example.
>>
> 
> This driver provides multiple channels and each channel has its own supplement attributes. As you mentioned, Die temperature channel and Core temperature channel have their individual crit attributes and they reflect the same value, Tjmax. It is not reporting several times but reporting the same value.
> 
Then maybe fold the functions accordingly ?

>>> +    default:
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>> +
>>> +static int cputemp_read_tthrottle(struct device *dev,
>>> +                  enum hwmon_sensor_types type,
>>> +                  u32 attr, int channel, long *val)
>>> +{
>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>> +    int rc;
>>> +
>>> +    switch (attr) {
>>> +    case hwmon_temp_input:
>>> +        rc = get_tthrottle(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tthrottle.value;
>>> +        return 0;
>>> +    default:
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>> +
>>> +static int cputemp_read_tjmax(struct device *dev,
>>> +                  enum hwmon_sensor_types type,
>>> +                  u32 attr, int channel, long *val)
>>> +{
>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>> +    int rc;
>>> +
>>> +    switch (attr) {
>>> +    case hwmon_temp_input:
>>> +        rc = get_tjmax(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tjmax.value;
>>> +        return 0;
>>> +    default:
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>> +
>>> +static int cputemp_read_core(struct device *dev,
>>> +                 enum hwmon_sensor_types type,
>>> +                 u32 attr, int channel, long *val)
>>> +{
>>> +    struct peci_cputemp *priv = dev_get_drvdata(dev);
>>> +    int core_index = find_core_index(priv, channel);
>>> +    int rc;
>>> +
>>> +    switch (attr) {
>>> +    case hwmon_temp_input:
>>> +        rc = get_core_temp(priv, core_index);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.core[core_index].value;
>>> +        return 0;
>>> +    case hwmon_temp_max:
>>> +        rc = get_tcontrol(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tcontrol.value;
>>> +        return 0;
>>> +    case hwmon_temp_crit:
>>> +        rc = get_tjmax(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tjmax.value;
>>> +        return 0;
>>> +    case hwmon_temp_crit_hyst:
>>> +        rc = get_tcontrol(priv);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp.tjmax.value - priv->temp.tcontrol.value;
>>> +        return 0;
>>> +    default:
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>
>> There is again a lot of duplication in those functions.
>>
> 
> Each function is called from cputemp_read() which is mapped to read function pointer of hwmon_ops struct. Since each channel has different set of attributes so the cputemp_read() calls an individual channel handler after checking the channel type. Of course, we can handle all attributes of all channels in a single function but the way also needs channel type checking code on each attribute.
> 
>>> +
>>> +static int cputemp_read(struct device *dev,
>>> +            enum hwmon_sensor_types type,
>>> +            u32 attr, int channel, long *val)
>>> +{
>>> +    switch (channel) {
>>> +    case channel_die:
>>> +        return cputemp_read_die(dev, type, attr, channel, val);
>>> +    case channel_dts_mrgn:
>>> +        return cputemp_read_dts_margin(dev, type, attr, channel, val);
>>> +    case channel_tcontrol:
>>> +        return cputemp_read_tcontrol(dev, type, attr, channel, val);
>>> +    case channel_tthrottle:
>>> +        return cputemp_read_tthrottle(dev, type, attr, channel, val);
>>> +    case channel_tjmax:
>>> +        return cputemp_read_tjmax(dev, type, attr, channel, val);
>>> +    default:
>>> +        if (channel < CPUTEMP_CHANNEL_NUMS)
>>> +            return cputemp_read_core(dev, type, attr, channel, val);
>>> +
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>> +
>>> +static umode_t cputemp_is_visible(const void *data,
>>> +                  enum hwmon_sensor_types type,
>>> +                  u32 attr, int channel)
>>> +{
>>> +    const struct peci_cputemp *priv = data;
>>> +
>>> +    if (priv->temp_config[channel] & BIT(attr))
>>> +        return 0444;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static const struct hwmon_ops cputemp_ops = {
>>> +    .is_visible = cputemp_is_visible,
>>> +    .read_string = cputemp_read_string,
>>> +    .read = cputemp_read,
>>> +};
>>> +
>>> +static int check_resolved_cores(struct peci_cputemp *priv)
>>> +{
>>> +    struct peci_rd_pci_cfg_local_msg msg;
>>> +    int rc;
>>> +
>>> +    if (!(priv->client->adapter->cmd_mask & BIT(PECI_CMD_RD_PCI_CFG_LOCAL)))
>>> +        return -EINVAL;
>>> +
>>> +    /* Get the RESOLVED_CORES register value */
>>> +    msg.addr = priv->addr;
>>> +    msg.bus = 1;
>>> +    msg.device = 30;
>>> +    msg.function = 3;
>>> +    msg.reg = 0xB4;
>>
>> Can this be made less magic with some defines ?
>>
> 
> Sure, will use defines instead.
> 
>>> +    msg.rx_len = 4;
>>> +
>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PCI_CFG_LOCAL, &msg);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    priv->core_mask = msg.pci_config[3] << 24 |
>>> +              msg.pci_config[2] << 16 |
>>> +              msg.pci_config[1] << 8 |
>>> +              msg.pci_config[0];
>>> +
>>> +    if (!priv->core_mask)
>>> +        return -EAGAIN;
>>> +
>>> +    dev_dbg(priv->dev, "Scanned resolved cores: 0x%x\n", priv->core_mask);
>>> +    return 0;
>>> +}
>>> +
>>> +static int create_core_temp_info(struct peci_cputemp *priv)
>>> +{
>>> +    int rc, i;
>>> +
>>> +    rc = check_resolved_cores(priv);
>>> +    if (!rc) {
>>> +        for (i = 0; i < priv->gen_info->core_max; i++) {
>>> +            if (priv->core_mask & BIT(i)) {
>>> +                priv->temp_config[priv->config_idx++] =
>>> +                             config_table[channel_core];
>>> +            }
>>> +        }
>>> +    }
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +static int check_cpu_id(struct peci_cputemp *priv)
>>> +{
>>> +    struct peci_rd_pkg_cfg_msg msg;
>>> +    u32 cpu_id;
>>> +    int i, rc;
>>> +
>>> +    msg.addr = priv->addr;
>>> +    msg.index = MBX_INDEX_CPU_ID;
>>> +    msg.param = PKG_ID_CPU_ID;
>>> +    msg.rx_len = 4;
>>> +
>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    cpu_id = ((msg.pkg_config[2] << 16) | (msg.pkg_config[1] << 8) |
>>> +          msg.pkg_config[0]) & CLIENT_CPU_ID_MASK;
>>> +
>>> +    for (i = 0; i < CPU_GEN_MAX; i++) {
>>> +        if (cpu_id == cpu_gen_info_table[i].cpu_id) {
>>> +            priv->gen_info = &cpu_gen_info_table[i];
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    if (!priv->gen_info)
>>> +        return -ENODEV;
>>> +
>>> +    dev_dbg(priv->dev, "CPU_ID: 0x%x\n", cpu_id);
>>> +    return 0;
>>> +}
>>> +
>>> +static int peci_cputemp_probe(struct peci_client *client)
>>> +{
>>> +    struct device *dev = &client->dev;
>>> +    struct peci_cputemp *priv;
>>> +    struct device *hwmon_dev;
>>> +    int rc;
>>> +
>>> +    if ((client->adapter->cmd_mask &
>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) !=
>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) {
>>> +        dev_err(dev, "Client doesn't support temperature monitoring\n");
>>> +        return -EINVAL;
>>
>> Does this mean there will be an error message for each non-supported CPU ?
>> Why ?
>>
> 
> For proper operation of this driver, PECI_CMD_GET_TEMP and PECI_CMD_RD_PKG_CFG have to be supported by a client CPU. PECI_CMD_GET_TEMP is provided as a default command but PECI_CMD_RD_PKG_CFG depends on PECI minor revision of a CPU package so this checking is needed.
> 

I do not question the check. I question the error message and error return value.
Why is it an _error_ if the CPU does not support the functionality, and why does
it have to be reported in the kernel log ?

>>> +    }
>>> +
>>> +    priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>>> +    if (!priv)
>>> +        return -ENOMEM;
>>> +
>>> +    dev_set_drvdata(dev, priv);
>>> +    priv->client = client;
>>> +    priv->dev = dev;
>>> +    priv->addr = client->addr;
>>> +    priv->cpu_no = priv->addr - PECI_BASE_ADDR;
>>> +
>>> +    snprintf(priv->name, PECI_NAME_SIZE, "peci_cputemp.cpu%d",
>>> +         priv->cpu_no);
>>> +
>>> +    rc = check_cpu_id(priv);
>>> +    if (rc) {
>>> +        dev_err(dev, "Client CPU is not supported\n");
>>
>> -ENODEV is not an error, and should not result in an error message.
>> Besides, the error can also be propagated from peci core code,
>> and may well be something else.
>>
> 
> Got it. I'll remove the error message and will add a proper handling code into PECI core.
> 
>>> +        return rc;
>>> +    }
>>> +
>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_die];
>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_dts_mrgn];
>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_tcontrol];
>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_tthrottle];
>>> +    priv->temp_config[priv->config_idx++] = config_table[channel_tjmax];
>>> +
>>> +    rc = create_core_temp_info(priv);
>>> +    if (rc)
>>> +        dev_dbg(dev, "Failed to create core temp info\n");
>>
>> Then what ? Shouldn't this result in probe deferral or something more useful
>> instead of just being ignored ?
>>
> 
> This driver can't support core temperature monitoring if a CPU doesn't support PECI_CMD_RD_PCI_CFG_LOCAL command. In that case, it skips core temperature group creation and supports only basic temperature monitoring of Die, DTS margin and etc. I'll add this description as a comment.
> 

The message says "Failed to ...". It does not say "This CPU does not support ...".

>>> +
>>> +    priv->chip.ops = &cputemp_ops;
>>> +    priv->chip.info = priv->info;
>>> +
>>> +    priv->info[0] = &priv->temp_info;
>>> +
>>> +    priv->temp_info.type = hwmon_temp;
>>> +    priv->temp_info.config = priv->temp_config;
>>> +
>>> +    hwmon_dev = devm_hwmon_device_register_with_info(priv->dev,
>>> +                             priv->name,
>>> +                             priv,
>>> +                             &priv->chip,
>>> +                             NULL);
>>> +
>>> +    if (IS_ERR(hwmon_dev))
>>> +        return PTR_ERR(hwmon_dev);
>>> +
>>> +    dev_dbg(dev, "%s: sensor '%s'\n", dev_name(hwmon_dev), priv->name);
>>> +

Why does this message display the device name twice ?

>>> +    return 0;
>>> +}
>>> +
>>> +static const struct of_device_id peci_cputemp_of_table[] = {
>>> +    { .compatible = "intel,peci-cputemp" },
>>> +    { }
>>> +};
>>> +MODULE_DEVICE_TABLE(of, peci_cputemp_of_table);
>>> +
>>> +static struct peci_driver peci_cputemp_driver = {
>>> +    .probe  = peci_cputemp_probe,
>>> +    .driver = {
>>> +        .name           = "peci-cputemp",
>>> +        .of_match_table = of_match_ptr(peci_cputemp_of_table),
>>> +    },
>>> +};
>>> +module_peci_driver(peci_cputemp_driver);
>>> +
>>> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
>>> +MODULE_DESCRIPTION("PECI cputemp driver");
>>> +MODULE_LICENSE("GPL v2");
>>> diff --git a/drivers/hwmon/peci-dimmtemp.c b/drivers/hwmon/peci-dimmtemp.c
>>> new file mode 100644
>>> index 000000000000..78bf29cb2c4c
>>> --- /dev/null
>>> +++ b/drivers/hwmon/peci-dimmtemp.c
>>
>> FWIW, this should be two separate patches.
>>
> 
> Should I split out hwmon documents and dt bindings too?
> 
>>> @@ -0,0 +1,432 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +// Copyright (c) 2018 Intel Corporation
>>> +
>>> +#include <linux/delay.h>
>>> +#include <linux/hwmon.h>
>>> +#include <linux/hwmon-sysfs.h>
>>
>> Needed ?
>>
> 
> No. Will drop the line.
> 
>>> +#include <linux/jiffies.h>
>>> +#include <linux/module.h>
>>> +#include <linux/of_device.h>
>>> +#include <linux/peci.h>
>>> +#include <linux/workqueue.h>
>>> +
>>> +#define TEMP_TYPE_PECI       6  /* Sensor type 6: Intel PECI */
>>> +
>>> +#define CHAN_RANK_MAX_ON_HSX 8  /* Max number of channel ranks on Haswell */
>>> +#define DIMM_IDX_MAX_ON_HSX  3  /* Max DIMM index per channel on Haswell */
>>> +
>>> +#define CHAN_RANK_MAX_ON_BDX 4  /* Max number of channel ranks on Broadwell */
>>> +#define DIMM_IDX_MAX_ON_BDX  3  /* Max DIMM index per channel on Broadwell */
>>> +
>>> +#define CHAN_RANK_MAX_ON_SKX 6  /* Max number of channel ranks on Skylake */
>>> +#define DIMM_IDX_MAX_ON_SKX  2  /* Max DIMM index per channel on Skylake */
>>> +
>>> +#define CHAN_RANK_MAX        CHAN_RANK_MAX_ON_HSX
>>> +#define DIMM_IDX_MAX         DIMM_IDX_MAX_ON_HSX
>>> +
>>> +#define DIMM_NUMS_MAX        (CHAN_RANK_MAX * DIMM_IDX_MAX)
>>> +
>>> +#define CLIENT_CPU_ID_MASK   0xf0ff0  /* Mask for Family / Model info */
>>> +
>>> +#define UPDATE_INTERVAL_MIN  HZ
>>> +
>>> +#define DIMM_MASK_CHECK_DELAY_JIFFIES msecs_to_jiffies(5000)
>>> +#define DIMM_MASK_CHECK_RETRY_MAX     60 /* 60 x 5 secs = 5 minutes */
>>> +
>>> +enum cpu_gens {
>>> +    CPU_GEN_HSX, /* Haswell Xeon */
>>> +    CPU_GEN_BRX, /* Broadwell Xeon */
>>> +    CPU_GEN_SKX, /* Skylake Xeon */
>>> +    CPU_GEN_MAX
>>> +};
>>> +
>>> +struct cpu_gen_info {
>>> +    u32 type;
>>> +    u32 cpu_id;
>>> +    u32 chan_rank_max;
>>> +    u32 dimm_idx_max;
>>> +};
>>> +
>>> +struct temp_data {
>>> +    bool valid;
>>> +    s32  value;
>>> +    unsigned long last_updated;
>>> +};
>>> +
>>> +struct peci_dimmtemp {
>>> +    struct peci_client *client;
>>> +    struct device *dev;
>>> +    struct workqueue_struct *work_queue;
>>> +    struct delayed_work work_handler;
>>> +    char name[PECI_NAME_SIZE];
>>> +    struct temp_data temp[DIMM_NUMS_MAX];
>>> +    u8 addr;
>>> +    uint cpu_no;
>>> +    const struct cpu_gen_info *gen_info;
>>> +    u32 dimm_mask;
>>> +    int retry_count;
>>> +    int channels;
>>> +    u32 temp_config[DIMM_NUMS_MAX + 1];
>>> +    struct hwmon_channel_info temp_info;
>>> +    const struct hwmon_channel_info *info[2];
>>> +    struct hwmon_chip_info chip;
>>> +};
>>> +
>>> +static const struct cpu_gen_info cpu_gen_info_table[] = {
>>> +    { .type  = CPU_GEN_HSX,
>>> +      .cpu_id = 0x306f0, /* Family code: 6, Model number: 63 (0x3f) */
>>> +      .chan_rank_max = CHAN_RANK_MAX_ON_HSX,
>>> +      .dimm_idx_max  = DIMM_IDX_MAX_ON_HSX },
>>> +    { .type  = CPU_GEN_BRX,
>>> +      .cpu_id = 0x406f0, /* Family code: 6, Model number: 79 (0x4f) */
>>> +      .chan_rank_max = CHAN_RANK_MAX_ON_BDX,
>>> +      .dimm_idx_max  = DIMM_IDX_MAX_ON_BDX },
>>> +    { .type  = CPU_GEN_SKX,
>>> +      .cpu_id = 0x50650, /* Family code: 6, Model number: 85 (0x55) */
>>> +      .chan_rank_max = CHAN_RANK_MAX_ON_SKX,
>>> +      .dimm_idx_max  = DIMM_IDX_MAX_ON_SKX },
>>> +};
>>> +
>>> +static const char *dimmtemp_label[CHAN_RANK_MAX][DIMM_IDX_MAX] = {
>>> +    { "DIMM A0", "DIMM A1", "DIMM A2" },
>>> +    { "DIMM B0", "DIMM B1", "DIMM B2" },
>>> +    { "DIMM C0", "DIMM C1", "DIMM C2" },
>>> +    { "DIMM D0", "DIMM D1", "DIMM D2" },
>>> +    { "DIMM E0", "DIMM E1", "DIMM E2" },
>>> +    { "DIMM F0", "DIMM F1", "DIMM F2" },
>>> +    { "DIMM G0", "DIMM G1", "DIMM G2" },
>>> +    { "DIMM H0", "DIMM H1", "DIMM H2" },
>>> +};
>>> +
>>> +static int send_peci_cmd(struct peci_dimmtemp *priv, enum peci_cmd cmd,
>>> +             void *msg)
>>> +{
>>> +    return peci_command(priv->client->adapter, cmd, msg);
>>> +}
>>> +
>>> +static int need_update(struct temp_data *temp)
>>> +{
>>> +    if (temp->valid &&
>>> +        time_before(jiffies, temp->last_updated + UPDATE_INTERVAL_MIN))
>>> +        return 0;
>>> +
>>> +    return 1;
>>> +}
>>> +
>>> +static void mark_updated(struct temp_data *temp)
>>> +{
>>> +    temp->valid = true;
>>> +    temp->last_updated = jiffies;
>>> +}
>>
>> It might make sense to provide the duplicate functions in a core file.
>>
> 
> It is temperature monitoring specific function and it touches module specific variables. Do you really think that this non-generic function should be moved to PECI core?
> 
>>> +
>>> +static int get_dimm_temp(struct peci_dimmtemp *priv, int dimm_no)
>>> +{
>>> +    int dimm_order = dimm_no % priv->gen_info->dimm_idx_max;
>>> +    int chan_rank = dimm_no / priv->gen_info->dimm_idx_max;
>>> +    struct peci_rd_pkg_cfg_msg msg;
>>> +    int rc;
>>> +
>>> +    if (!need_update(&priv->temp[dimm_no]))
>>> +        return 0;
>>> +
>>> +    msg.addr = priv->addr;
>>> +    msg.index = MBX_INDEX_DDR_DIMM_TEMP;
>>> +    msg.param = chan_rank;
>>> +    msg.rx_len = 4;
>>> +
>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    priv->temp[dimm_no].value = msg.pkg_config[dimm_order] * 1000;
>>> +
>>> +    mark_updated(&priv->temp[dimm_no]);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int find_dimm_number(struct peci_dimmtemp *priv, int channel)
>>> +{
>>> +    int dimm_nums_max = priv->gen_info->chan_rank_max *
>>> +                priv->gen_info->dimm_idx_max;
>>> +    int idx, found = 0;
>>> +
>>> +    for (idx = 0; idx < dimm_nums_max; idx++) {
>>> +        if (priv->dimm_mask & BIT(idx)) {
>>> +            if (channel == found)
>>> +                break;
>>> +
>>> +            found++;
>>> +        }
>>> +    }
>>> +
>>> +    return idx;
>>> +}
>>
>> This again looks like duplicate code.
>>
> 
> find_dimm_number()? I'm sure it isn't.
> 
>>> +
>>> +static int dimmtemp_read_string(struct device *dev,
>>> +                enum hwmon_sensor_types type,
>>> +                u32 attr, int channel, const char **str)
>>> +{
>>> +    struct peci_dimmtemp *priv = dev_get_drvdata(dev);
>>> +    u32 dimm_idx_max = priv->gen_info->dimm_idx_max;
>>> +    int dimm_no, chan_rank, dimm_idx;
>>> +
>>> +    switch (attr) {
>>> +    case hwmon_temp_label:
>>> +        dimm_no = find_dimm_number(priv, channel);
>>> +        chan_rank = dimm_no / dimm_idx_max;
>>> +        dimm_idx = dimm_no % dimm_idx_max;
>>> +        *str = dimmtemp_label[chan_rank][dimm_idx];
>>> +        return 0;
>>> +    default:
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>> +
>>> +static int dimmtemp_read(struct device *dev, enum hwmon_sensor_types type,
>>> +             u32 attr, int channel, long *val)
>>> +{
>>> +    struct peci_dimmtemp *priv = dev_get_drvdata(dev);
>>> +    int dimm_no = find_dimm_number(priv, channel);
>>> +    int rc;
>>> +
>>> +    switch (attr) {
>>> +    case hwmon_temp_input:
>>> +        rc = get_dimm_temp(priv, dimm_no);
>>> +        if (rc)
>>> +            return rc;
>>> +
>>> +        *val = priv->temp[dimm_no].value;
>>> +        return 0;
>>> +    default:
>>> +        return -EOPNOTSUPP;
>>> +    }
>>> +}
>>> +
>>> +static umode_t dimmtemp_is_visible(const void *data,
>>> +                   enum hwmon_sensor_types type,
>>> +                   u32 attr, int channel)
>>> +{
>>> +    switch (attr) {
>>> +    case hwmon_temp_label:
>>> +    case hwmon_temp_input:
>>> +        return 0444;
>>> +    default:
>>> +        return 0;
>>> +    }
>>> +}
>>> +
>>> +static const struct hwmon_ops dimmtemp_ops = {
>>> +    .is_visible = dimmtemp_is_visible,
>>> +    .read_string = dimmtemp_read_string,
>>> +    .read = dimmtemp_read,
>>> +};
>>> +
>>> +static int check_populated_dimms(struct peci_dimmtemp *priv)
>>> +{
>>> +    u32 chan_rank_max = priv->gen_info->chan_rank_max;
>>> +    u32 dimm_idx_max = priv->gen_info->dimm_idx_max;
>>> +    struct peci_rd_pkg_cfg_msg msg;
>>> +    int chan_rank, dimm_idx;
>>> +    int rc, channels = 0;
>>> +
>>> +    for (chan_rank = 0; chan_rank < chan_rank_max; chan_rank++) {
>>> +        msg.addr = priv->addr;
>>> +        msg.index = MBX_INDEX_DDR_DIMM_TEMP;
>>> +        msg.param = chan_rank;
>>> +        msg.rx_len = 4;
>>> +
>>> +        rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>> +        if (rc) {
>>> +            priv->dimm_mask = 0;
>>> +            return rc;
>>> +        }
>>> +
>>> +        for (dimm_idx = 0; dimm_idx < dimm_idx_max; dimm_idx++) {
>>> +            if (msg.pkg_config[dimm_idx]) {
>>> +                priv->dimm_mask |= BIT(chan_rank *
>>> +                               chan_rank_max +
>>> +                               dimm_idx);
>>> +                channels++;
>>> +            }
>>> +        }
>>> +    }
>>> +
>>> +    if (!priv->dimm_mask)
>>> +        return -EAGAIN;
>>> +
>>> +    priv->channels = channels;
>>> +
>>> +    dev_dbg(priv->dev, "Scanned populated DIMMs: 0x%x\n", priv->dimm_mask);
>>> +    return 0;
>>> +}
>>> +
>>> +static int create_dimm_temp_info(struct peci_dimmtemp *priv)
>>> +{
>>> +    struct device *hwmon_dev;
>>> +    int rc, i;
>>> +
>>> +    rc = check_populated_dimms(priv);
>>> +    if (!rc) {
>>
>> Please handle error cases first.
>>
> 
> Sure, I'll rewrite it.
> 
>>> +        for (i = 0; i < priv->channels; i++)
>>> +            priv->temp_config[i] = HWMON_T_LABEL | HWMON_T_INPUT;
>>> +
>>> +        priv->chip.ops = &dimmtemp_ops;
>>> +        priv->chip.info = priv->info;
>>> +
>>> +        priv->info[0] = &priv->temp_info;
>>> +
>>> +        priv->temp_info.type = hwmon_temp;
>>> +        priv->temp_info.config = priv->temp_config;
>>> +
>>> +        hwmon_dev = devm_hwmon_device_register_with_info(priv->dev,
>>> +                                 priv->name,
>>> +                                 priv,
>>> +                                 &priv->chip,
>>> +                                 NULL);
>>> +        rc = PTR_ERR_OR_ZERO(hwmon_dev);
>>> +        if (!rc)
>>> +            dev_dbg(priv->dev, "%s: sensor '%s'\n",
>>> +                dev_name(hwmon_dev), priv->name);
>>> +    } else if (rc == -EAGAIN) {
>>> +        if (priv->retry_count < DIMM_MASK_CHECK_RETRY_MAX) {
>>> +            queue_delayed_work(priv->work_queue,
>>> +                       &priv->work_handler,
>>> +                       DIMM_MASK_CHECK_DELAY_JIFFIES);
>>> +            priv->retry_count++;
>>> +            dev_dbg(priv->dev,
>>> +                "Deferred DIMM temp info creation\n");
>>> +        } else {
>>> +            rc = -ETIMEDOUT;
>>> +            dev_err(priv->dev,
>>> +                "Timeout retrying DIMM temp info creation\n");
>>> +        }
>>> +    }
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +static void create_dimm_temp_info_delayed(struct work_struct *work)
>>> +{
>>> +    struct delayed_work *dwork = to_delayed_work(work);
>>> +    struct peci_dimmtemp *priv = container_of(dwork, struct peci_dimmtemp,
>>> +                          work_handler);
>>> +    int rc;
>>> +
>>> +    rc = create_dimm_temp_info(priv);
>>> +    if (rc && rc != -EAGAIN)
>>> +        dev_dbg(priv->dev, "Failed to create DIMM temp info\n");
>>> +}
>>> +
>>> +static int check_cpu_id(struct peci_dimmtemp *priv)
>>> +{
>>> +    struct peci_rd_pkg_cfg_msg msg;
>>> +    u32 cpu_id;
>>> +    int i, rc;
>>> +
>>> +    msg.addr = priv->addr;
>>> +    msg.index = MBX_INDEX_CPU_ID;
>>> +    msg.param = PKG_ID_CPU_ID;
>>> +    msg.rx_len = 4;
>>> +
>>> +    rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    cpu_id = ((msg.pkg_config[2] << 16) | (msg.pkg_config[1] << 8) |
>>> +          msg.pkg_config[0]) & CLIENT_CPU_ID_MASK;
>>> +
>>> +    for (i = 0; i < CPU_GEN_MAX; i++) {
>>> +        if (cpu_id == cpu_gen_info_table[i].cpu_id) {
>>> +            priv->gen_info = &cpu_gen_info_table[i];
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    if (!priv->gen_info)
>>> +        return -ENODEV;
>>> +
>>> +    dev_dbg(priv->dev, "CPU_ID: 0x%x\n", cpu_id);
>>> +    return 0;
>>> +}
>>
>> More duplicate code.
>>
> 
> Okay. In case of check_cpu_id(), it could be used as a generic PECI function. I'll move it into PECI core.
> 
>>> +
>>> +static int peci_dimmtemp_probe(struct peci_client *client)
>>> +{
>>> +    struct device *dev = &client->dev;
>>> +    struct peci_dimmtemp *priv;
>>> +    int rc;
>>> +
>>> +    if ((client->adapter->cmd_mask &
>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) !=
>>> +        (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) {
>>
>> One set of ( ) is unnecessary on each side of the expression.
>>
> 
> '&' has a precedence over '!=' but '|' doesn't. I'll rewrite it to:
> 

Actually, that is wrong. You refer to address-of. Bit operations do have lower
precedence that comparisons. I stand corrected.

>      if (client->adapter->cmd_mask &
>          (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG)) !=
>          (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG)))
> 
>>> +        dev_err(dev, "Client doesn't support temperature monitoring\n");
>>> +        return -EINVAL;
>>
>> Why is this "invalid", and why does it warrant an error message ?
>>
> 
> Should I use -EPERM? Any suggestion?
> 

Is it an _error_ if the CPU does not support this functionality ?

>>> +    }
>>> +
>>> +    priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>>> +    if (!priv)
>>> +        return -ENOMEM;
>>> +
>>> +    dev_set_drvdata(dev, priv);
>>> +    priv->client = client;
>>> +    priv->dev = dev;
>>> +    priv->addr = client->addr;
>>> +    priv->cpu_no = priv->addr - PECI_BASE_ADDR;
>>
>> Is priv->addr guaranteed to be >= PECI_BASE_ADDR ?
> 
> Client address range validation will be done in peci_check_addr_validity() in PECI core before probing a device driver.
> 
>>> +
>>> +    snprintf(priv->name, PECI_NAME_SIZE, "peci_dimmtemp.cpu%d",
>>> +         priv->cpu_no);
>>> +
>>> +    rc = check_cpu_id(priv);
>>> +    if (rc) {
>>> +        dev_err(dev, "Client CPU is not supported\n");
>>
>> Or the peci command failed.
>>
> 
> I'll remove the error message and will add a proper handling code into PECI core on each error type.
> 
>>> +        return rc;
>>> +    }
>>> +
>>> +    priv->work_queue = alloc_ordered_workqueue(priv->name, 0);
>>> +    if (!priv->work_queue)
>>> +        return -ENOMEM;
>>> +
>>> +    INIT_DELAYED_WORK(&priv->work_handler, create_dimm_temp_info_delayed);
>>> +
>>> +    rc = create_dimm_temp_info(priv);
>>> +    if (rc && rc != -EAGAIN) {
>>> +        dev_err(dev, "Failed to create DIMM temp info\n");
>>> +        goto err_free_wq;
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +err_free_wq:
>>> +    destroy_workqueue(priv->work_queue);
>>> +    return rc;
>>> +}
>>> +
>>> +static int peci_dimmtemp_remove(struct peci_client *client)
>>> +{
>>> +    struct peci_dimmtemp *priv = dev_get_drvdata(&client->dev);
>>> +
>>> +    cancel_delayed_work(&priv->work_handler);
>>
>> cancel_delayed_work_sync() ?
>>
> 
> Yes, it would be safer. Will fix it.
> 
>>> +    destroy_workqueue(priv->work_queue);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static const struct of_device_id peci_dimmtemp_of_table[] = {
>>> +    { .compatible = "intel,peci-dimmtemp" },
>>> +    { }
>>> +};
>>> +MODULE_DEVICE_TABLE(of, peci_dimmtemp_of_table);
>>> +
>>> +static struct peci_driver peci_dimmtemp_driver = {
>>> +    .probe  = peci_dimmtemp_probe,
>>> +    .remove = peci_dimmtemp_remove,
>>> +    .driver = {
>>> +        .name           = "peci-dimmtemp",
>>> +        .of_match_table = of_match_ptr(peci_dimmtemp_of_table),
>>> +    },
>>> +};
>>> +module_peci_driver(peci_dimmtemp_driver);
>>> +
>>> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
>>> +MODULE_DESCRIPTION("PECI dimmtemp driver");
>>> +MODULE_LICENSE("GPL v2");
>>> -- 
>>> 2.16.2
>>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 09/10] drivers/hwmon: Add PECI hwmon client drivers
From: Jae Hyun Yoo @ 2018-04-11 21:59 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
	Haiyue Wang, James Feist, Jason M Biils, Jean Delvare,
	Joel Stanley, Julia Cartwright, Miguel Ojeda, Milton Miller II,
	Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
	Vernon Mauery, linux-kernel, linux-doc, devicetree, linux-hwmon,
	linux-arm-kernel, openbmc
In-Reply-To: <20180410222800.GA8484@roeck-us.net>

Hi Guenter,

Thanks a lot for sharing your time. Please see my inline answers.

On 4/10/2018 3:28 PM, Guenter Roeck wrote:
> On Tue, Apr 10, 2018 at 11:32:11AM -0700, Jae Hyun Yoo wrote:
>> This commit adds PECI cputemp and dimmtemp hwmon drivers.
>>
>> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
>> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
>> Reviewed-by: James Feist <james.feist@linux.intel.com>
>> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
>> Cc: Alan Cox <alan@linux.intel.com>
>> Cc: Andrew Jeffery <andrew@aj.id.au>
>> Cc: Andrew Lunn <andrew@lunn.ch>
>> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Fengguang Wu <fengguang.wu@intel.com>
>> Cc: Greg KH <gregkh@linuxfoundation.org>
>> Cc: Guenter Roeck <linux@roeck-us.net>
>> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
>> Cc: Jean Delvare <jdelvare@suse.com>
>> Cc: Joel Stanley <joel@jms.id.au>
>> Cc: Julia Cartwright <juliac@eso.teric.us>
>> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
>> Cc: Milton Miller II <miltonm@us.ibm.com>
>> Cc: Pavel Machek <pavel@ucw.cz>
>> Cc: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
>> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
>> ---
>>   drivers/hwmon/Kconfig         |  28 ++
>>   drivers/hwmon/Makefile        |   2 +
>>   drivers/hwmon/peci-cputemp.c  | 783 ++++++++++++++++++++++++++++++++++++++++++
>>   drivers/hwmon/peci-dimmtemp.c | 432 +++++++++++++++++++++++
>>   4 files changed, 1245 insertions(+)
>>   create mode 100644 drivers/hwmon/peci-cputemp.c
>>   create mode 100644 drivers/hwmon/peci-dimmtemp.c
>>
>> diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
>> index f249a4428458..c52f610f81d0 100644
>> --- a/drivers/hwmon/Kconfig
>> +++ b/drivers/hwmon/Kconfig
>> @@ -1259,6 +1259,34 @@ config SENSORS_NCT7904
>>   	  This driver can also be built as a module.  If so, the module
>>   	  will be called nct7904.
>>   
>> +config SENSORS_PECI_CPUTEMP
>> +	tristate "PECI CPU temperature monitoring support"
>> +	depends on OF
>> +	depends on PECI
>> +	help
>> +	  If you say yes here you get support for the generic Intel PECI
>> +	  cputemp driver which provides Digital Thermal Sensor (DTS) thermal
>> +	  readings of the CPU package and CPU cores that are accessible using
>> +	  the PECI Client Command Suite via the processor PECI client.
>> +	  Check Documentation/hwmon/peci-cputemp for details.
>> +
>> +	  This driver can also be built as a module.  If so, the module
>> +	  will be called peci-cputemp.
>> +
>> +config SENSORS_PECI_DIMMTEMP
>> +	tristate "PECI DIMM temperature monitoring support"
>> +	depends on OF
>> +	depends on PECI
>> +	help
>> +	  If you say yes here you get support for the generic Intel PECI hwmon
>> +	  driver which provides Digital Thermal Sensor (DTS) thermal readings of
>> +	  DIMM components that are accessible using the PECI Client Command
>> +	  Suite via the processor PECI client.
>> +	  Check Documentation/hwmon/peci-dimmtemp for details.
>> +
>> +	  This driver can also be built as a module.  If so, the module
>> +	  will be called peci-dimmtemp.
>> +
>>   config SENSORS_NSA320
>>   	tristate "ZyXEL NSA320 and compatible fan speed and temperature sensors"
>>   	depends on GPIOLIB && OF
>> diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
>> index e7d52a36e6c4..48d9598fcd3a 100644
>> --- a/drivers/hwmon/Makefile
>> +++ b/drivers/hwmon/Makefile
>> @@ -136,6 +136,8 @@ obj-$(CONFIG_SENSORS_NCT7802)	+= nct7802.o
>>   obj-$(CONFIG_SENSORS_NCT7904)	+= nct7904.o
>>   obj-$(CONFIG_SENSORS_NSA320)	+= nsa320-hwmon.o
>>   obj-$(CONFIG_SENSORS_NTC_THERMISTOR)	+= ntc_thermistor.o
>> +obj-$(CONFIG_SENSORS_PECI_CPUTEMP)	+= peci-cputemp.o
>> +obj-$(CONFIG_SENSORS_PECI_DIMMTEMP)	+= peci-dimmtemp.o
>>   obj-$(CONFIG_SENSORS_PC87360)	+= pc87360.o
>>   obj-$(CONFIG_SENSORS_PC87427)	+= pc87427.o
>>   obj-$(CONFIG_SENSORS_PCF8591)	+= pcf8591.o
>> diff --git a/drivers/hwmon/peci-cputemp.c b/drivers/hwmon/peci-cputemp.c
>> new file mode 100644
>> index 000000000000..f0bc92687512
>> --- /dev/null
>> +++ b/drivers/hwmon/peci-cputemp.c
>> @@ -0,0 +1,783 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (c) 2018 Intel Corporation
>> +
>> +#include <linux/delay.h>
>> +#include <linux/hwmon.h>
>> +#include <linux/hwmon-sysfs.h>
> 
> Is this include needed ?
> 

No it isn't. Will drop the line.

>> +#include <linux/jiffies.h>
>> +#include <linux/module.h>
>> +#include <linux/of_device.h>
>> +#include <linux/peci.h>
>> +
>> +#define TEMP_TYPE_PECI        6  /* Sensor type 6: Intel PECI */
>> +
>> +#define CORE_MAX_ON_HSX       18 /* Max number of cores on Haswell */
>> +#define CORE_MAX_ON_BDX       24 /* Max number of cores on Broadwell */
>> +#define CORE_MAX_ON_SKX       28 /* Max number of cores on Skylake */
>> +
>> +#define DEFAULT_CHANNEL_NUMS  5
>> +#define CORETEMP_CHANNEL_NUMS CORE_MAX_ON_SKX
>> +#define CPUTEMP_CHANNEL_NUMS  (DEFAULT_CHANNEL_NUMS + CORETEMP_CHANNEL_NUMS)
>> +
>> +#define CLIENT_CPU_ID_MASK    0xf0ff0  /* Mask for Family / Model info */
>> +
>> +#define UPDATE_INTERVAL_MIN   HZ
>> +
>> +enum cpu_gens {
>> +	CPU_GEN_HSX, /* Haswell Xeon */
>> +	CPU_GEN_BRX, /* Broadwell Xeon */
>> +	CPU_GEN_SKX, /* Skylake Xeon */
>> +	CPU_GEN_MAX
>> +};
>> +
>> +struct cpu_gen_info {
>> +	u32 type;
>> +	u32 cpu_id;
>> +	u32 core_max;
>> +};
>> +
>> +struct temp_data {
>> +	bool valid;
>> +	s32  value;
>> +	unsigned long last_updated;
>> +};
>> +
>> +struct temp_group {
>> +	struct temp_data die;
>> +	struct temp_data dts_margin;
>> +	struct temp_data tcontrol;
>> +	struct temp_data tthrottle;
>> +	struct temp_data tjmax;
>> +	struct temp_data core[CORETEMP_CHANNEL_NUMS];
>> +};
>> +
>> +struct peci_cputemp {
>> +	struct peci_client *client;
>> +	struct device *dev;
>> +	char name[PECI_NAME_SIZE];
>> +	struct temp_group temp;
>> +	u8 addr;
>> +	uint cpu_no;
>> +	const struct cpu_gen_info *gen_info;
>> +	u32 core_mask;
>> +	u32 temp_config[CPUTEMP_CHANNEL_NUMS + 1];
>> +	uint config_idx;
>> +	struct hwmon_channel_info temp_info;
>> +	const struct hwmon_channel_info *info[2];
>> +	struct hwmon_chip_info chip;
>> +};
>> +
>> +enum cputemp_channels {
>> +	channel_die,
>> +	channel_dts_mrgn,
>> +	channel_tcontrol,
>> +	channel_tthrottle,
>> +	channel_tjmax,
>> +	channel_core,
>> +};
>> +
>> +static const struct cpu_gen_info cpu_gen_info_table[] = {
>> +	{ .type = CPU_GEN_HSX,
>> +	  .cpu_id = 0x306f0, /* Family code: 6, Model number: 63 (0x3f) */
>> +	  .core_max = CORE_MAX_ON_HSX },
>> +	{ .type = CPU_GEN_BRX,
>> +	  .cpu_id = 0x406f0, /* Family code: 6, Model number: 79 (0x4f) */
>> +	  .core_max = CORE_MAX_ON_BDX },
>> +	{ .type = CPU_GEN_SKX,
>> +	  .cpu_id = 0x50650, /* Family code: 6, Model number: 85 (0x55) */
>> +	  .core_max = CORE_MAX_ON_SKX },
>> +};
>> +
>> +static const u32 config_table[DEFAULT_CHANNEL_NUMS + 1] = {
>> +	/* Die temperature */
>> +	HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT |
>> +	HWMON_T_CRIT_HYST,
>> +
>> +	/* DTS margin temperature */
>> +	HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MIN | HWMON_T_LCRIT,
>> +
>> +	/* Tcontrol temperature */
>> +	HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_CRIT,
>> +
>> +	/* Tthrottle temperature */
>> +	HWMON_T_LABEL | HWMON_T_INPUT,
>> +
>> +	/* Tjmax temperature */
>> +	HWMON_T_LABEL | HWMON_T_INPUT,
>> +
>> +	/* Core temperature - for all core channels */
>> +	HWMON_T_LABEL | HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT |
>> +	HWMON_T_CRIT_HYST,
>> +};
>> +
>> +static const char *cputemp_label[CPUTEMP_CHANNEL_NUMS] = {
>> +	"Die",
>> +	"DTS margin",
>> +	"Tcontrol",
>> +	"Tthrottle",
>> +	"Tjmax",
>> +	"Core 0", "Core 1", "Core 2", "Core 3",
>> +	"Core 4", "Core 5", "Core 6", "Core 7",
>> +	"Core 8", "Core 9", "Core 10", "Core 11",
>> +	"Core 12", "Core 13", "Core 14", "Core 15",
>> +	"Core 16", "Core 17", "Core 18", "Core 19",
>> +	"Core 20", "Core 21", "Core 22", "Core 23",
>> +};
>> +
>> +static int send_peci_cmd(struct peci_cputemp *priv,
>> +			 enum peci_cmd cmd,
>> +			 void *msg)
>> +{
>> +	return peci_command(priv->client->adapter, cmd, msg);
>> +}
>> +
>> +static int need_update(struct temp_data *temp)
> 
> Please use bool.
> 

Okay. I'll use bool instead of int.

>> +{
>> +	if (temp->valid &&
>> +	    time_before(jiffies, temp->last_updated + UPDATE_INTERVAL_MIN))
>> +		return 0;
>> +
>> +	return 1;
>> +}
>> +
>> +static void mark_updated(struct temp_data *temp)
>> +{
>> +	temp->valid = true;
>> +	temp->last_updated = jiffies;
>> +}
>> +
>> +static s32 ten_dot_six_to_millidegree(s32 val)
>> +{
>> +	return ((val ^ 0x8000) - 0x8000) * 1000 / 64;
>> +}
>> +
>> +static int get_tjmax(struct peci_cputemp *priv)
>> +{
>> +	struct peci_rd_pkg_cfg_msg msg;
>> +	int rc;
>> +
>> +	if (!priv->temp.tjmax.valid) {
>> +		msg.addr = priv->addr;
>> +		msg.index = MBX_INDEX_TEMP_TARGET;
>> +		msg.param = 0;
>> +		msg.rx_len = 4;
>> +
>> +		rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>> +		if (rc)
>> +			return rc;
>> +
>> +		priv->temp.tjmax.value = (s32)msg.pkg_config[2] * 1000;
>> +		priv->temp.tjmax.valid = true;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int get_tcontrol(struct peci_cputemp *priv)
>> +{
>> +	struct peci_rd_pkg_cfg_msg msg;
>> +	s32 tcontrol_margin;
>> +	s32 tthrottle_offset;
>> +	int rc;
>> +
>> +	if (!need_update(&priv->temp.tcontrol))
>> +		return 0;
>> +
>> +	rc = get_tjmax(priv);
>> +	if (rc)
>> +		return rc;
>> +
>> +	msg.addr = priv->addr;
>> +	msg.index = MBX_INDEX_TEMP_TARGET;
>> +	msg.param = 0;
>> +	msg.rx_len = 4;
>> +
>> +	rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>> +	if (rc)
>> +		return rc;
>> +
>> +	tcontrol_margin = msg.pkg_config[1];
>> +	tcontrol_margin = ((tcontrol_margin ^ 0x80) - 0x80) * 1000;
>> +	priv->temp.tcontrol.value = priv->temp.tjmax.value - tcontrol_margin;
>> +
>> +	tthrottle_offset = (msg.pkg_config[3] & 0x2f) * 1000;
>> +	priv->temp.tthrottle.value = priv->temp.tjmax.value - tthrottle_offset;
>> +
>> +	mark_updated(&priv->temp.tcontrol);
>> +	mark_updated(&priv->temp.tthrottle);
>> +
>> +	return 0;
>> +}
>> +
>> +static int get_tthrottle(struct peci_cputemp *priv)
>> +{
>> +	struct peci_rd_pkg_cfg_msg msg;
>> +	s32 tcontrol_margin;
>> +	s32 tthrottle_offset;
>> +	int rc;
>> +
>> +	if (!need_update(&priv->temp.tthrottle))
>> +		return 0;
>> +
>> +	rc = get_tjmax(priv);
>> +	if (rc)
>> +		return rc;
>> +
>> +	msg.addr = priv->addr;
>> +	msg.index = MBX_INDEX_TEMP_TARGET;
>> +	msg.param = 0;
>> +	msg.rx_len = 4;
>> +
>> +	rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>> +	if (rc)
>> +		return rc;
>> +
>> +	tthrottle_offset = (msg.pkg_config[3] & 0x2f) * 1000;
>> +	priv->temp.tthrottle.value = priv->temp.tjmax.value - tthrottle_offset;
>> +
>> +	tcontrol_margin = msg.pkg_config[1];
>> +	tcontrol_margin = ((tcontrol_margin ^ 0x80) - 0x80) * 1000;
>> +	priv->temp.tcontrol.value = priv->temp.tjmax.value - tcontrol_margin;
>> +
>> +	mark_updated(&priv->temp.tthrottle);
>> +	mark_updated(&priv->temp.tcontrol);
>> +
>> +	return 0;
>> +}
> 
> I am quite completely missing how the two functions above are different.
> 

The two above functions are slightly different but uses the same PECI 
command which provides both Tthrottle and Tcontrol values in pkg_config 
array so it updates the values to reduce duplicate PECI transactions. 
Probably, combining these two functions into get_ttrottle_and_tcontrol() 
would look better. I'll rewrite it.

>> +
>> +static int get_die_temp(struct peci_cputemp *priv)
>> +{
>> +	struct peci_get_temp_msg msg;
>> +	int rc;
>> +
>> +	if (!need_update(&priv->temp.die))
>> +		return 0;
>> +
>> +	rc = get_tjmax(priv);
>> +	if (rc)
>> +		return rc;
>> +
>> +	msg.addr = priv->addr;
>> +
>> +	rc = send_peci_cmd(priv, PECI_CMD_GET_TEMP, &msg);
>> +	if (rc)
>> +		return rc;
>> +
>> +	priv->temp.die.value = priv->temp.tjmax.value +
>> +			       ((s32)msg.temp_raw * 1000 / 64);
>> +
>> +	mark_updated(&priv->temp.die);
>> +
>> +	return 0;
>> +}
>> +
>> +static int get_dts_margin(struct peci_cputemp *priv)
>> +{
>> +	struct peci_rd_pkg_cfg_msg msg;
>> +	s32 dts_margin;
>> +	int rc;
>> +
>> +	if (!need_update(&priv->temp.dts_margin))
>> +		return 0;
>> +
>> +	msg.addr = priv->addr;
>> +	msg.index = MBX_INDEX_DTS_MARGIN;
>> +	msg.param = 0;
>> +	msg.rx_len = 4;
>> +
>> +	rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>> +	if (rc)
>> +		return rc;
>> +
>> +	dts_margin = (msg.pkg_config[1] << 8) | msg.pkg_config[0];
>> +
>> +	/**
>> +	 * Processors return a value of DTS reading in 10.6 format
>> +	 * (10 bits signed decimal, 6 bits fractional).
>> +	 * Error codes:
>> +	 *   0x8000: General sensor error
>> +	 *   0x8001: Reserved
>> +	 *   0x8002: Underflow on reading value
>> +	 *   0x8003-0x81ff: Reserved
>> +	 */
>> +	if (dts_margin >= 0x8000 && dts_margin <= 0x81ff)
>> +		return -EIO;
>> +
>> +	dts_margin = ten_dot_six_to_millidegree(dts_margin);
>> +
>> +	priv->temp.dts_margin.value = dts_margin;
>> +
>> +	mark_updated(&priv->temp.dts_margin);
>> +
>> +	return 0;
>> +}
>> +
>> +static int get_core_temp(struct peci_cputemp *priv, int core_index)
>> +{
>> +	struct peci_rd_pkg_cfg_msg msg;
>> +	s32 core_dts_margin;
>> +	int rc;
>> +
>> +	if (!need_update(&priv->temp.core[core_index]))
>> +		return 0;
>> +
>> +	rc = get_tjmax(priv);
>> +	if (rc)
>> +		return rc;
>> +
>> +	msg.addr = priv->addr;
>> +	msg.index = MBX_INDEX_PER_CORE_DTS_TEMP;
>> +	msg.param = core_index;
>> +	msg.rx_len = 4;
>> +
>> +	rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>> +	if (rc)
>> +		return rc;
>> +
>> +	core_dts_margin = (msg.pkg_config[1] << 8) | msg.pkg_config[0];
>> +
>> +	/**
>> +	 * Processors return a value of the core DTS reading in 10.6 format
>> +	 * (10 bits signed decimal, 6 bits fractional).
>> +	 * Error codes:
>> +	 *   0x8000: General sensor error
>> +	 *   0x8001: Reserved
>> +	 *   0x8002: Underflow on reading value
>> +	 *   0x8003-0x81ff: Reserved
>> +	 */
>> +	if (core_dts_margin >= 0x8000 && core_dts_margin <= 0x81ff)
>> +		return -EIO;
>> +
>> +	core_dts_margin = ten_dot_six_to_millidegree(core_dts_margin);
>> +
>> +	priv->temp.core[core_index].value = priv->temp.tjmax.value +
>> +					    core_dts_margin;
>> +
>> +	mark_updated(&priv->temp.core[core_index]);
>> +
>> +	return 0;
>> +}
>> +
> 
> There is a lot of duplication in those functions. Would it be possible
> to find common code and use functions for it instead of duplicating
> everything several times ?
> 

Are you pointing out this code?
/**
  * Processors return a value of the core DTS reading in 10.6 format
  * (10 bits signed decimal, 6 bits fractional).
  * Error codes:
  *   0x8000: General sensor error
  *   0x8001: Reserved
  *   0x8002: Underflow on reading value
  *   0x8003-0x81ff: Reserved
  */
if (core_dts_margin >= 0x8000 && core_dts_margin <= 0x81ff)
	return -EIO;

Then I'll rewrite it as a function. If not, please point out the 
duplication.

>> +static int find_core_index(struct peci_cputemp *priv, int channel)
>> +{
>> +	int core_channel = channel - DEFAULT_CHANNEL_NUMS;
>> +	int idx, found = 0;
>> +
>> +	for (idx = 0; idx < priv->gen_info->core_max; idx++) {
>> +		if (priv->core_mask & BIT(idx)) {
>> +			if (core_channel == found)
>> +				break;
>> +
>> +			found++;
>> +		}
>> +	}
>> +
>> +	return idx;
> 
> What if nothing is found ?
> 

Core temperature group will be registered only when it detects at least 
one core checked by check_resolved_cores(), so find_core_index() can be 
called only when priv->core_mask has a non-zero value. The 'nothing is 
found' case will not happen.

>> +}
>> +
>> +static int cputemp_read_string(struct device *dev,
>> +			       enum hwmon_sensor_types type,
>> +			       u32 attr, int channel, const char **str)
>> +{
>> +	struct peci_cputemp *priv = dev_get_drvdata(dev);
>> +	int core_index;
>> +
>> +	switch (attr) {
>> +	case hwmon_temp_label:
>> +		if (channel < DEFAULT_CHANNEL_NUMS) {
>> +			*str = cputemp_label[channel];
>> +		} else {
>> +			core_index = find_core_index(priv, channel);
> 
> FWIW, it might be better to pass channel - DEFAULT_CHANNEL_NUMS
> as parameter.
> 

cputemp_read_string() is mapped to read_string member of hwmon_ops 
struct, so hwmon susbsystem passes the channel parameter based on the 
registered channel order. Should I modify hwmon subsystem code?

> What if find_core_index() returns priv->gen_info->core_max, ie
> if it didn't find a core ?
> 

As explained above, find_core index() returns a correct index always.

>> +			*str = cputemp_label[DEFAULT_CHANNEL_NUMS + core_index];
>> +		}
>> +		return 0;
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +static int cputemp_read_die(struct device *dev,
>> +			    enum hwmon_sensor_types type,
>> +			    u32 attr, int channel, long *val)
>> +{
>> +	struct peci_cputemp *priv = dev_get_drvdata(dev);
>> +	int rc;
>> +
>> +	switch (attr) {
>> +	case hwmon_temp_input:
>> +		rc = get_die_temp(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.die.value;
>> +		return 0;
>> +	case hwmon_temp_max:
>> +		rc = get_tcontrol(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tcontrol.value;
>> +		return 0;
>> +	case hwmon_temp_crit:
>> +		rc = get_tjmax(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tjmax.value;
>> +		return 0;
>> +	case hwmon_temp_crit_hyst:
>> +		rc = get_tcontrol(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tjmax.value - priv->temp.tcontrol.value;
>> +		return 0;
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +static int cputemp_read_dts_margin(struct device *dev,
>> +				   enum hwmon_sensor_types type,
>> +				   u32 attr, int channel, long *val)
>> +{
>> +	struct peci_cputemp *priv = dev_get_drvdata(dev);
>> +	int rc;
>> +
>> +	switch (attr) {
>> +	case hwmon_temp_input:
>> +		rc = get_dts_margin(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.dts_margin.value;
>> +		return 0;
>> +	case hwmon_temp_min:
>> +		*val = 0;
>> +		return 0;
> 
> This attribute should not exist.
> 

This is an attribute of DTS margin temperature which reflects thermal 
margin to Tcontrol of the CPU package. If it shows '0' means it reached 
to Tcontrol, the first level of thermal warning. If the CPU keeps 
getting hot then this DTS margin shows a negative value until it reaches 
to Tjmax. When the temperature reaches to Tjmax at last then it shows 
the lower critcal value which lcrit indicates as the second level of 
thermal warning.

>> +	case hwmon_temp_lcrit:
>> +		rc = get_tcontrol(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tcontrol.value - priv->temp.tjmax.value;
> 
> lcrit is tcontrol - tjmax, and crit_hyst above is
> tjmax - tcontrol ? How does this make sense ?
> 

Both Tjmax and Tcontrol have positive values and Tjmax is greater than 
Tcontrol always. As explained above, lcrit of DTS margin should show a 
negative value means the margin goes down across '0'. On the other hand, 
crit_hyst of Die temperature should show absolute hyterisis value 
between Tcontrol and Tjmax.

>> +		return 0;
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +static int cputemp_read_tcontrol(struct device *dev,
>> +				 enum hwmon_sensor_types type,
>> +				 u32 attr, int channel, long *val)
>> +{
>> +	struct peci_cputemp *priv = dev_get_drvdata(dev);
>> +	int rc;
>> +
>> +	switch (attr) {
>> +	case hwmon_temp_input:
>> +		rc = get_tcontrol(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tcontrol.value;
>> +		return 0;
>> +	case hwmon_temp_crit:
>> +		rc = get_tjmax(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tjmax.value;
>> +		return 0;
> 
> Am I missing something, or is the same temperature reported several times ?
> tjmax is also reported as temp_crit cputemp_read_die(), for example.
> 

This driver provides multiple channels and each channel has its own 
supplement attributes. As you mentioned, Die temperature channel and 
Core temperature channel have their individual crit attributes and they 
reflect the same value, Tjmax. It is not reporting several times but 
reporting the same value.

>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +static int cputemp_read_tthrottle(struct device *dev,
>> +				  enum hwmon_sensor_types type,
>> +				  u32 attr, int channel, long *val)
>> +{
>> +	struct peci_cputemp *priv = dev_get_drvdata(dev);
>> +	int rc;
>> +
>> +	switch (attr) {
>> +	case hwmon_temp_input:
>> +		rc = get_tthrottle(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tthrottle.value;
>> +		return 0;
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +static int cputemp_read_tjmax(struct device *dev,
>> +			      enum hwmon_sensor_types type,
>> +			      u32 attr, int channel, long *val)
>> +{
>> +	struct peci_cputemp *priv = dev_get_drvdata(dev);
>> +	int rc;
>> +
>> +	switch (attr) {
>> +	case hwmon_temp_input:
>> +		rc = get_tjmax(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tjmax.value;
>> +		return 0;
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +static int cputemp_read_core(struct device *dev,
>> +			     enum hwmon_sensor_types type,
>> +			     u32 attr, int channel, long *val)
>> +{
>> +	struct peci_cputemp *priv = dev_get_drvdata(dev);
>> +	int core_index = find_core_index(priv, channel);
>> +	int rc;
>> +
>> +	switch (attr) {
>> +	case hwmon_temp_input:
>> +		rc = get_core_temp(priv, core_index);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.core[core_index].value;
>> +		return 0;
>> +	case hwmon_temp_max:
>> +		rc = get_tcontrol(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tcontrol.value;
>> +		return 0;
>> +	case hwmon_temp_crit:
>> +		rc = get_tjmax(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tjmax.value;
>> +		return 0;
>> +	case hwmon_temp_crit_hyst:
>> +		rc = get_tcontrol(priv);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp.tjmax.value - priv->temp.tcontrol.value;
>> +		return 0;
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
> 
> There is again a lot of duplication in those functions.
> 

Each function is called from cputemp_read() which is mapped to read 
function pointer of hwmon_ops struct. Since each channel has different 
set of attributes so the cputemp_read() calls an individual channel 
handler after checking the channel type. Of course, we can handle all 
attributes of all channels in a single function but the way also needs 
channel type checking code on each attribute.

>> +
>> +static int cputemp_read(struct device *dev,
>> +			enum hwmon_sensor_types type,
>> +			u32 attr, int channel, long *val)
>> +{
>> +	switch (channel) {
>> +	case channel_die:
>> +		return cputemp_read_die(dev, type, attr, channel, val);
>> +	case channel_dts_mrgn:
>> +		return cputemp_read_dts_margin(dev, type, attr, channel, val);
>> +	case channel_tcontrol:
>> +		return cputemp_read_tcontrol(dev, type, attr, channel, val);
>> +	case channel_tthrottle:
>> +		return cputemp_read_tthrottle(dev, type, attr, channel, val);
>> +	case channel_tjmax:
>> +		return cputemp_read_tjmax(dev, type, attr, channel, val);
>> +	default:
>> +		if (channel < CPUTEMP_CHANNEL_NUMS)
>> +			return cputemp_read_core(dev, type, attr, channel, val);
>> +
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +static umode_t cputemp_is_visible(const void *data,
>> +				  enum hwmon_sensor_types type,
>> +				  u32 attr, int channel)
>> +{
>> +	const struct peci_cputemp *priv = data;
>> +
>> +	if (priv->temp_config[channel] & BIT(attr))
>> +		return 0444;
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct hwmon_ops cputemp_ops = {
>> +	.is_visible = cputemp_is_visible,
>> +	.read_string = cputemp_read_string,
>> +	.read = cputemp_read,
>> +};
>> +
>> +static int check_resolved_cores(struct peci_cputemp *priv)
>> +{
>> +	struct peci_rd_pci_cfg_local_msg msg;
>> +	int rc;
>> +
>> +	if (!(priv->client->adapter->cmd_mask & BIT(PECI_CMD_RD_PCI_CFG_LOCAL)))
>> +		return -EINVAL;
>> +
>> +	/* Get the RESOLVED_CORES register value */
>> +	msg.addr = priv->addr;
>> +	msg.bus = 1;
>> +	msg.device = 30;
>> +	msg.function = 3;
>> +	msg.reg = 0xB4;
> 
> Can this be made less magic with some defines ?
> 

Sure, will use defines instead.

>> +	msg.rx_len = 4;
>> +
>> +	rc = send_peci_cmd(priv, PECI_CMD_RD_PCI_CFG_LOCAL, &msg);
>> +	if (rc)
>> +		return rc;
>> +
>> +	priv->core_mask = msg.pci_config[3] << 24 |
>> +			  msg.pci_config[2] << 16 |
>> +			  msg.pci_config[1] << 8 |
>> +			  msg.pci_config[0];
>> +
>> +	if (!priv->core_mask)
>> +		return -EAGAIN;
>> +
>> +	dev_dbg(priv->dev, "Scanned resolved cores: 0x%x\n", priv->core_mask);
>> +	return 0;
>> +}
>> +
>> +static int create_core_temp_info(struct peci_cputemp *priv)
>> +{
>> +	int rc, i;
>> +
>> +	rc = check_resolved_cores(priv);
>> +	if (!rc) {
>> +		for (i = 0; i < priv->gen_info->core_max; i++) {
>> +			if (priv->core_mask & BIT(i)) {
>> +				priv->temp_config[priv->config_idx++] =
>> +						     config_table[channel_core];
>> +			}
>> +		}
>> +	}
>> +
>> +	return rc;
>> +}
>> +
>> +static int check_cpu_id(struct peci_cputemp *priv)
>> +{
>> +	struct peci_rd_pkg_cfg_msg msg;
>> +	u32 cpu_id;
>> +	int i, rc;
>> +
>> +	msg.addr = priv->addr;
>> +	msg.index = MBX_INDEX_CPU_ID;
>> +	msg.param = PKG_ID_CPU_ID;
>> +	msg.rx_len = 4;
>> +
>> +	rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>> +	if (rc)
>> +		return rc;
>> +
>> +	cpu_id = ((msg.pkg_config[2] << 16) | (msg.pkg_config[1] << 8) |
>> +		  msg.pkg_config[0]) & CLIENT_CPU_ID_MASK;
>> +
>> +	for (i = 0; i < CPU_GEN_MAX; i++) {
>> +		if (cpu_id == cpu_gen_info_table[i].cpu_id) {
>> +			priv->gen_info = &cpu_gen_info_table[i];
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (!priv->gen_info)
>> +		return -ENODEV;
>> +
>> +	dev_dbg(priv->dev, "CPU_ID: 0x%x\n", cpu_id);
>> +	return 0;
>> +}
>> +
>> +static int peci_cputemp_probe(struct peci_client *client)
>> +{
>> +	struct device *dev = &client->dev;
>> +	struct peci_cputemp *priv;
>> +	struct device *hwmon_dev;
>> +	int rc;
>> +
>> +	if ((client->adapter->cmd_mask &
>> +	    (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) !=
>> +	    (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) {
>> +		dev_err(dev, "Client doesn't support temperature monitoring\n");
>> +		return -EINVAL;
> 
> Does this mean there will be an error message for each non-supported CPU ?
> Why ?
> 

For proper operation of this driver, PECI_CMD_GET_TEMP and 
PECI_CMD_RD_PKG_CFG have to be supported by a client CPU. 
PECI_CMD_GET_TEMP is provided as a default command but 
PECI_CMD_RD_PKG_CFG depends on PECI minor revision of a CPU package so 
this checking is needed.

>> +	}
>> +
>> +	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>> +	if (!priv)
>> +		return -ENOMEM;
>> +
>> +	dev_set_drvdata(dev, priv);
>> +	priv->client = client;
>> +	priv->dev = dev;
>> +	priv->addr = client->addr;
>> +	priv->cpu_no = priv->addr - PECI_BASE_ADDR;
>> +
>> +	snprintf(priv->name, PECI_NAME_SIZE, "peci_cputemp.cpu%d",
>> +		 priv->cpu_no);
>> +
>> +	rc = check_cpu_id(priv);
>> +	if (rc) {
>> +		dev_err(dev, "Client CPU is not supported\n");
> 
> -ENODEV is not an error, and should not result in an error message.
> Besides, the error can also be propagated from peci core code,
> and may well be something else.
> 

Got it. I'll remove the error message and will add a proper handling 
code into PECI core.

>> +		return rc;
>> +	}
>> +
>> +	priv->temp_config[priv->config_idx++] = config_table[channel_die];
>> +	priv->temp_config[priv->config_idx++] = config_table[channel_dts_mrgn];
>> +	priv->temp_config[priv->config_idx++] = config_table[channel_tcontrol];
>> +	priv->temp_config[priv->config_idx++] = config_table[channel_tthrottle];
>> +	priv->temp_config[priv->config_idx++] = config_table[channel_tjmax];
>> +
>> +	rc = create_core_temp_info(priv);
>> +	if (rc)
>> +		dev_dbg(dev, "Failed to create core temp info\n");
> 
> Then what ? Shouldn't this result in probe deferral or something more useful
> instead of just being ignored ?
> 

This driver can't support core temperature monitoring if a CPU doesn't 
support PECI_CMD_RD_PCI_CFG_LOCAL command. In that case, it skips core 
temperature group creation and supports only basic temperature 
monitoring of Die, DTS margin and etc. I'll add this description as a 
comment.

>> +
>> +	priv->chip.ops = &cputemp_ops;
>> +	priv->chip.info = priv->info;
>> +
>> +	priv->info[0] = &priv->temp_info;
>> +
>> +	priv->temp_info.type = hwmon_temp;
>> +	priv->temp_info.config = priv->temp_config;
>> +
>> +	hwmon_dev = devm_hwmon_device_register_with_info(priv->dev,
>> +							 priv->name,
>> +							 priv,
>> +							 &priv->chip,
>> +							 NULL);
>> +
>> +	if (IS_ERR(hwmon_dev))
>> +		return PTR_ERR(hwmon_dev);
>> +
>> +	dev_dbg(dev, "%s: sensor '%s'\n", dev_name(hwmon_dev), priv->name);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct of_device_id peci_cputemp_of_table[] = {
>> +	{ .compatible = "intel,peci-cputemp" },
>> +	{ }
>> +};
>> +MODULE_DEVICE_TABLE(of, peci_cputemp_of_table);
>> +
>> +static struct peci_driver peci_cputemp_driver = {
>> +	.probe  = peci_cputemp_probe,
>> +	.driver = {
>> +		.name           = "peci-cputemp",
>> +		.of_match_table = of_match_ptr(peci_cputemp_of_table),
>> +	},
>> +};
>> +module_peci_driver(peci_cputemp_driver);
>> +
>> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
>> +MODULE_DESCRIPTION("PECI cputemp driver");
>> +MODULE_LICENSE("GPL v2");
>> diff --git a/drivers/hwmon/peci-dimmtemp.c b/drivers/hwmon/peci-dimmtemp.c
>> new file mode 100644
>> index 000000000000..78bf29cb2c4c
>> --- /dev/null
>> +++ b/drivers/hwmon/peci-dimmtemp.c
> 
> FWIW, this should be two separate patches.
> 

Should I split out hwmon documents and dt bindings too?

>> @@ -0,0 +1,432 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (c) 2018 Intel Corporation
>> +
>> +#include <linux/delay.h>
>> +#include <linux/hwmon.h>
>> +#include <linux/hwmon-sysfs.h>
> 
> Needed ?
> 

No. Will drop the line.

>> +#include <linux/jiffies.h>
>> +#include <linux/module.h>
>> +#include <linux/of_device.h>
>> +#include <linux/peci.h>
>> +#include <linux/workqueue.h>
>> +
>> +#define TEMP_TYPE_PECI       6  /* Sensor type 6: Intel PECI */
>> +
>> +#define CHAN_RANK_MAX_ON_HSX 8  /* Max number of channel ranks on Haswell */
>> +#define DIMM_IDX_MAX_ON_HSX  3  /* Max DIMM index per channel on Haswell */
>> +
>> +#define CHAN_RANK_MAX_ON_BDX 4  /* Max number of channel ranks on Broadwell */
>> +#define DIMM_IDX_MAX_ON_BDX  3  /* Max DIMM index per channel on Broadwell */
>> +
>> +#define CHAN_RANK_MAX_ON_SKX 6  /* Max number of channel ranks on Skylake */
>> +#define DIMM_IDX_MAX_ON_SKX  2  /* Max DIMM index per channel on Skylake */
>> +
>> +#define CHAN_RANK_MAX        CHAN_RANK_MAX_ON_HSX
>> +#define DIMM_IDX_MAX         DIMM_IDX_MAX_ON_HSX
>> +
>> +#define DIMM_NUMS_MAX        (CHAN_RANK_MAX * DIMM_IDX_MAX)
>> +
>> +#define CLIENT_CPU_ID_MASK   0xf0ff0  /* Mask for Family / Model info */
>> +
>> +#define UPDATE_INTERVAL_MIN  HZ
>> +
>> +#define DIMM_MASK_CHECK_DELAY_JIFFIES msecs_to_jiffies(5000)
>> +#define DIMM_MASK_CHECK_RETRY_MAX     60 /* 60 x 5 secs = 5 minutes */
>> +
>> +enum cpu_gens {
>> +	CPU_GEN_HSX, /* Haswell Xeon */
>> +	CPU_GEN_BRX, /* Broadwell Xeon */
>> +	CPU_GEN_SKX, /* Skylake Xeon */
>> +	CPU_GEN_MAX
>> +};
>> +
>> +struct cpu_gen_info {
>> +	u32 type;
>> +	u32 cpu_id;
>> +	u32 chan_rank_max;
>> +	u32 dimm_idx_max;
>> +};
>> +
>> +struct temp_data {
>> +	bool valid;
>> +	s32  value;
>> +	unsigned long last_updated;
>> +};
>> +
>> +struct peci_dimmtemp {
>> +	struct peci_client *client;
>> +	struct device *dev;
>> +	struct workqueue_struct *work_queue;
>> +	struct delayed_work work_handler;
>> +	char name[PECI_NAME_SIZE];
>> +	struct temp_data temp[DIMM_NUMS_MAX];
>> +	u8 addr;
>> +	uint cpu_no;
>> +	const struct cpu_gen_info *gen_info;
>> +	u32 dimm_mask;
>> +	int retry_count;
>> +	int channels;
>> +	u32 temp_config[DIMM_NUMS_MAX + 1];
>> +	struct hwmon_channel_info temp_info;
>> +	const struct hwmon_channel_info *info[2];
>> +	struct hwmon_chip_info chip;
>> +};
>> +
>> +static const struct cpu_gen_info cpu_gen_info_table[] = {
>> +	{ .type  = CPU_GEN_HSX,
>> +	  .cpu_id = 0x306f0, /* Family code: 6, Model number: 63 (0x3f) */
>> +	  .chan_rank_max = CHAN_RANK_MAX_ON_HSX,
>> +	  .dimm_idx_max  = DIMM_IDX_MAX_ON_HSX },
>> +	{ .type  = CPU_GEN_BRX,
>> +	  .cpu_id = 0x406f0, /* Family code: 6, Model number: 79 (0x4f) */
>> +	  .chan_rank_max = CHAN_RANK_MAX_ON_BDX,
>> +	  .dimm_idx_max  = DIMM_IDX_MAX_ON_BDX },
>> +	{ .type  = CPU_GEN_SKX,
>> +	  .cpu_id = 0x50650, /* Family code: 6, Model number: 85 (0x55) */
>> +	  .chan_rank_max = CHAN_RANK_MAX_ON_SKX,
>> +	  .dimm_idx_max  = DIMM_IDX_MAX_ON_SKX },
>> +};
>> +
>> +static const char *dimmtemp_label[CHAN_RANK_MAX][DIMM_IDX_MAX] = {
>> +	{ "DIMM A0", "DIMM A1", "DIMM A2" },
>> +	{ "DIMM B0", "DIMM B1", "DIMM B2" },
>> +	{ "DIMM C0", "DIMM C1", "DIMM C2" },
>> +	{ "DIMM D0", "DIMM D1", "DIMM D2" },
>> +	{ "DIMM E0", "DIMM E1", "DIMM E2" },
>> +	{ "DIMM F0", "DIMM F1", "DIMM F2" },
>> +	{ "DIMM G0", "DIMM G1", "DIMM G2" },
>> +	{ "DIMM H0", "DIMM H1", "DIMM H2" },
>> +};
>> +
>> +static int send_peci_cmd(struct peci_dimmtemp *priv, enum peci_cmd cmd,
>> +			 void *msg)
>> +{
>> +	return peci_command(priv->client->adapter, cmd, msg);
>> +}
>> +
>> +static int need_update(struct temp_data *temp)
>> +{
>> +	if (temp->valid &&
>> +	    time_before(jiffies, temp->last_updated + UPDATE_INTERVAL_MIN))
>> +		return 0;
>> +
>> +	return 1;
>> +}
>> +
>> +static void mark_updated(struct temp_data *temp)
>> +{
>> +	temp->valid = true;
>> +	temp->last_updated = jiffies;
>> +}
> 
> It might make sense to provide the duplicate functions in a core file.
> 

It is temperature monitoring specific function and it touches module 
specific variables. Do you really think that this non-generic function 
should be moved to PECI core?

>> +
>> +static int get_dimm_temp(struct peci_dimmtemp *priv, int dimm_no)
>> +{
>> +	int dimm_order = dimm_no % priv->gen_info->dimm_idx_max;
>> +	int chan_rank = dimm_no / priv->gen_info->dimm_idx_max;
>> +	struct peci_rd_pkg_cfg_msg msg;
>> +	int rc;
>> +
>> +	if (!need_update(&priv->temp[dimm_no]))
>> +		return 0;
>> +
>> +	msg.addr = priv->addr;
>> +	msg.index = MBX_INDEX_DDR_DIMM_TEMP;
>> +	msg.param = chan_rank;
>> +	msg.rx_len = 4;
>> +
>> +	rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>> +	if (rc)
>> +		return rc;
>> +
>> +	priv->temp[dimm_no].value = msg.pkg_config[dimm_order] * 1000;
>> +
>> +	mark_updated(&priv->temp[dimm_no]);
>> +
>> +	return 0;
>> +}
>> +
>> +static int find_dimm_number(struct peci_dimmtemp *priv, int channel)
>> +{
>> +	int dimm_nums_max = priv->gen_info->chan_rank_max *
>> +			    priv->gen_info->dimm_idx_max;
>> +	int idx, found = 0;
>> +
>> +	for (idx = 0; idx < dimm_nums_max; idx++) {
>> +		if (priv->dimm_mask & BIT(idx)) {
>> +			if (channel == found)
>> +				break;
>> +
>> +			found++;
>> +		}
>> +	}
>> +
>> +	return idx;
>> +}
> 
> This again looks like duplicate code.
> 

find_dimm_number()? I'm sure it isn't.

>> +
>> +static int dimmtemp_read_string(struct device *dev,
>> +				enum hwmon_sensor_types type,
>> +				u32 attr, int channel, const char **str)
>> +{
>> +	struct peci_dimmtemp *priv = dev_get_drvdata(dev);
>> +	u32 dimm_idx_max = priv->gen_info->dimm_idx_max;
>> +	int dimm_no, chan_rank, dimm_idx;
>> +
>> +	switch (attr) {
>> +	case hwmon_temp_label:
>> +		dimm_no = find_dimm_number(priv, channel);
>> +		chan_rank = dimm_no / dimm_idx_max;
>> +		dimm_idx = dimm_no % dimm_idx_max;
>> +		*str = dimmtemp_label[chan_rank][dimm_idx];
>> +		return 0;
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +static int dimmtemp_read(struct device *dev, enum hwmon_sensor_types type,
>> +			 u32 attr, int channel, long *val)
>> +{
>> +	struct peci_dimmtemp *priv = dev_get_drvdata(dev);
>> +	int dimm_no = find_dimm_number(priv, channel);
>> +	int rc;
>> +
>> +	switch (attr) {
>> +	case hwmon_temp_input:
>> +		rc = get_dimm_temp(priv, dimm_no);
>> +		if (rc)
>> +			return rc;
>> +
>> +		*val = priv->temp[dimm_no].value;
>> +		return 0;
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>> +static umode_t dimmtemp_is_visible(const void *data,
>> +				   enum hwmon_sensor_types type,
>> +				   u32 attr, int channel)
>> +{
>> +	switch (attr) {
>> +	case hwmon_temp_label:
>> +	case hwmon_temp_input:
>> +		return 0444;
>> +	default:
>> +		return 0;
>> +	}
>> +}
>> +
>> +static const struct hwmon_ops dimmtemp_ops = {
>> +	.is_visible = dimmtemp_is_visible,
>> +	.read_string = dimmtemp_read_string,
>> +	.read = dimmtemp_read,
>> +};
>> +
>> +static int check_populated_dimms(struct peci_dimmtemp *priv)
>> +{
>> +	u32 chan_rank_max = priv->gen_info->chan_rank_max;
>> +	u32 dimm_idx_max = priv->gen_info->dimm_idx_max;
>> +	struct peci_rd_pkg_cfg_msg msg;
>> +	int chan_rank, dimm_idx;
>> +	int rc, channels = 0;
>> +
>> +	for (chan_rank = 0; chan_rank < chan_rank_max; chan_rank++) {
>> +		msg.addr = priv->addr;
>> +		msg.index = MBX_INDEX_DDR_DIMM_TEMP;
>> +		msg.param = chan_rank;
>> +		msg.rx_len = 4;
>> +
>> +		rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>> +		if (rc) {
>> +			priv->dimm_mask = 0;
>> +			return rc;
>> +		}
>> +
>> +		for (dimm_idx = 0; dimm_idx < dimm_idx_max; dimm_idx++) {
>> +			if (msg.pkg_config[dimm_idx]) {
>> +				priv->dimm_mask |= BIT(chan_rank *
>> +						       chan_rank_max +
>> +						       dimm_idx);
>> +				channels++;
>> +			}
>> +		}
>> +	}
>> +
>> +	if (!priv->dimm_mask)
>> +		return -EAGAIN;
>> +
>> +	priv->channels = channels;
>> +
>> +	dev_dbg(priv->dev, "Scanned populated DIMMs: 0x%x\n", priv->dimm_mask);
>> +	return 0;
>> +}
>> +
>> +static int create_dimm_temp_info(struct peci_dimmtemp *priv)
>> +{
>> +	struct device *hwmon_dev;
>> +	int rc, i;
>> +
>> +	rc = check_populated_dimms(priv);
>> +	if (!rc) {
> 
> Please handle error cases first.
> 

Sure, I'll rewrite it.

>> +		for (i = 0; i < priv->channels; i++)
>> +			priv->temp_config[i] = HWMON_T_LABEL | HWMON_T_INPUT;
>> +
>> +		priv->chip.ops = &dimmtemp_ops;
>> +		priv->chip.info = priv->info;
>> +
>> +		priv->info[0] = &priv->temp_info;
>> +
>> +		priv->temp_info.type = hwmon_temp;
>> +		priv->temp_info.config = priv->temp_config;
>> +
>> +		hwmon_dev = devm_hwmon_device_register_with_info(priv->dev,
>> +								 priv->name,
>> +								 priv,
>> +								 &priv->chip,
>> +								 NULL);
>> +		rc = PTR_ERR_OR_ZERO(hwmon_dev);
>> +		if (!rc)
>> +			dev_dbg(priv->dev, "%s: sensor '%s'\n",
>> +				dev_name(hwmon_dev), priv->name);
>> +	} else if (rc == -EAGAIN) {
>> +		if (priv->retry_count < DIMM_MASK_CHECK_RETRY_MAX) {
>> +			queue_delayed_work(priv->work_queue,
>> +					   &priv->work_handler,
>> +					   DIMM_MASK_CHECK_DELAY_JIFFIES);
>> +			priv->retry_count++;
>> +			dev_dbg(priv->dev,
>> +				"Deferred DIMM temp info creation\n");
>> +		} else {
>> +			rc = -ETIMEDOUT;
>> +			dev_err(priv->dev,
>> +				"Timeout retrying DIMM temp info creation\n");
>> +		}
>> +	}
>> +
>> +	return rc;
>> +}
>> +
>> +static void create_dimm_temp_info_delayed(struct work_struct *work)
>> +{
>> +	struct delayed_work *dwork = to_delayed_work(work);
>> +	struct peci_dimmtemp *priv = container_of(dwork, struct peci_dimmtemp,
>> +						  work_handler);
>> +	int rc;
>> +
>> +	rc = create_dimm_temp_info(priv);
>> +	if (rc && rc != -EAGAIN)
>> +		dev_dbg(priv->dev, "Failed to create DIMM temp info\n");
>> +}
>> +
>> +static int check_cpu_id(struct peci_dimmtemp *priv)
>> +{
>> +	struct peci_rd_pkg_cfg_msg msg;
>> +	u32 cpu_id;
>> +	int i, rc;
>> +
>> +	msg.addr = priv->addr;
>> +	msg.index = MBX_INDEX_CPU_ID;
>> +	msg.param = PKG_ID_CPU_ID;
>> +	msg.rx_len = 4;
>> +
>> +	rc = send_peci_cmd(priv, PECI_CMD_RD_PKG_CFG, &msg);
>> +	if (rc)
>> +		return rc;
>> +
>> +	cpu_id = ((msg.pkg_config[2] << 16) | (msg.pkg_config[1] << 8) |
>> +		  msg.pkg_config[0]) & CLIENT_CPU_ID_MASK;
>> +
>> +	for (i = 0; i < CPU_GEN_MAX; i++) {
>> +		if (cpu_id == cpu_gen_info_table[i].cpu_id) {
>> +			priv->gen_info = &cpu_gen_info_table[i];
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (!priv->gen_info)
>> +		return -ENODEV;
>> +
>> +	dev_dbg(priv->dev, "CPU_ID: 0x%x\n", cpu_id);
>> +	return 0;
>> +}
> 
> More duplicate code.
> 

Okay. In case of check_cpu_id(), it could be used as a generic PECI 
function. I'll move it into PECI core.

>> +
>> +static int peci_dimmtemp_probe(struct peci_client *client)
>> +{
>> +	struct device *dev = &client->dev;
>> +	struct peci_dimmtemp *priv;
>> +	int rc;
>> +
>> +	if ((client->adapter->cmd_mask &
>> +	    (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) !=
>> +	    (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG))) {
> 
> One set of ( ) is unnecessary on each side of the expression.
> 

'&' has a precedence over '!=' but '|' doesn't. I'll rewrite it to:

	if (client->adapter->cmd_mask &
	    (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG)) !=
	    (BIT(PECI_CMD_GET_TEMP) | BIT(PECI_CMD_RD_PKG_CFG)))

>> +		dev_err(dev, "Client doesn't support temperature monitoring\n");
>> +		return -EINVAL;
> 
> Why is this "invalid", and why does it warrant an error message ?
> 

Should I use -EPERM? Any suggestion?

>> +	}
>> +
>> +	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>> +	if (!priv)
>> +		return -ENOMEM;
>> +
>> +	dev_set_drvdata(dev, priv);
>> +	priv->client = client;
>> +	priv->dev = dev;
>> +	priv->addr = client->addr;
>> +	priv->cpu_no = priv->addr - PECI_BASE_ADDR;
> 
> Is priv->addr guaranteed to be >= PECI_BASE_ADDR ?

Client address range validation will be done in 
peci_check_addr_validity() in PECI core before probing a device driver.

>> +
>> +	snprintf(priv->name, PECI_NAME_SIZE, "peci_dimmtemp.cpu%d",
>> +		 priv->cpu_no);
>> +
>> +	rc = check_cpu_id(priv);
>> +	if (rc) {
>> +		dev_err(dev, "Client CPU is not supported\n");
> 
> Or the peci command failed.
> 

I'll remove the error message and will add a proper handling code into 
PECI core on each error type.

>> +		return rc;
>> +	}
>> +
>> +	priv->work_queue = alloc_ordered_workqueue(priv->name, 0);
>> +	if (!priv->work_queue)
>> +		return -ENOMEM;
>> +
>> +	INIT_DELAYED_WORK(&priv->work_handler, create_dimm_temp_info_delayed);
>> +
>> +	rc = create_dimm_temp_info(priv);
>> +	if (rc && rc != -EAGAIN) {
>> +		dev_err(dev, "Failed to create DIMM temp info\n");
>> +		goto err_free_wq;
>> +	}
>> +
>> +	return 0;
>> +
>> +err_free_wq:
>> +	destroy_workqueue(priv->work_queue);
>> +	return rc;
>> +}
>> +
>> +static int peci_dimmtemp_remove(struct peci_client *client)
>> +{
>> +	struct peci_dimmtemp *priv = dev_get_drvdata(&client->dev);
>> +
>> +	cancel_delayed_work(&priv->work_handler);
> 
> cancel_delayed_work_sync() ?
> 

Yes, it would be safer. Will fix it.

>> +	destroy_workqueue(priv->work_queue);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct of_device_id peci_dimmtemp_of_table[] = {
>> +	{ .compatible = "intel,peci-dimmtemp" },
>> +	{ }
>> +};
>> +MODULE_DEVICE_TABLE(of, peci_dimmtemp_of_table);
>> +
>> +static struct peci_driver peci_dimmtemp_driver = {
>> +	.probe  = peci_dimmtemp_probe,
>> +	.remove = peci_dimmtemp_remove,
>> +	.driver = {
>> +		.name           = "peci-dimmtemp",
>> +		.of_match_table = of_match_ptr(peci_dimmtemp_of_table),
>> +	},
>> +};
>> +module_peci_driver(peci_dimmtemp_driver);
>> +
>> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
>> +MODULE_DESCRIPTION("PECI dimmtemp driver");
>> +MODULE_LICENSE("GPL v2");
>> -- 
>> 2.16.2
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 18:01 UTC (permalink / raw)
  To: Yonghong Song, daniel, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
	Josef Bacik, Andrey Ignatov
In-Reply-To: <7e388b10-ccea-a2b0-e776-5420c8e7f521@netronome.com>

2018-04-11 16:44 UTC+0100 ~ Quentin Monnet <quentin.monnet@netronome.com>
> 2018-04-10 09:58 UTC-0700 ~ Yonghong Song <yhs@fb.com>
>> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>>> Add documentation for eBPF helper functions to bpf.h user header file.
>>> This documentation can be parsed with the Python script provided in
>>> another commit of the patch series, in order to provide a RST document
>>> that can later be converted into a man page.
>>>
>>> The objective is to make the documentation easily understandable and
>>> accessible to all eBPF developers, including beginners.
>>>
>>> This patch contains descriptions for the following helper functions:
>>>
>>> Helpers from Lawrence:
>>> - bpf_setsockopt()
>>> - bpf_getsockopt()
>>> - bpf_sock_ops_cb_flags_set()
>>>
>>> Helpers from Yonghong:
>>> - bpf_perf_event_read_value()
>>> - bpf_perf_prog_read_value()
>>>
>>> Helper from Josef:
>>> - bpf_override_return()
>>>
>>> Helper from Andrey:
>>> - bpf_bind()
>>>
>>> Cc: Lawrence Brakmo <brakmo@fb.com>
>>> Cc: Yonghong Song <yhs@fb.com>
>>> Cc: Josef Bacik <jbacik@fb.com>
>>> Cc: Andrey Ignatov <rdna@fb.com>
>>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>>> ---

> [...]
> 
> Thanks Yonghong for the review!
> 
> I have a favor to ask of you. I got a bounce for Kaixu Xia's email
> address, and I don't know what alternative email address I could use. I
> CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
> this series), which is rather close to bpf_perf_event_read_value().
> Would you mind having a look at that one too, please? The description is
> not long.

Well I read again the description I wrote, and actually the one for
bpf_perf_evnet_read() is nearly a subset of the one for
perf_event_read_value(). So the same comments that you raised earlier
apply, there's probably nothing more to review. But if you notice that
some important info is missing for bpf_perf_event_read(), I'm interested
too!

Quentin
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [RFC bpf-next v2 8/8] bpf: add documentation for eBPF helpers (58-64)
From: Quentin Monnet @ 2018-04-11 15:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	John Fastabend
In-Reply-To: <20180411121759.4191e267@redhat.com>

2018-04-11 12:17 UTC+0200 ~ Jesper Dangaard Brouer <brouer@redhat.com>
> On Tue, 10 Apr 2018 15:41:57 +0100
> Quentin Monnet <quentin.monnet@netronome.com> wrote:
> 
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 7343af4196c8..db090ad03626 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1250,6 +1250,51 @@ union bpf_attr {
>>   * 	Return
>>   * 		0 on success, or a negative error in case of failure.
>>   *
>> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
>> + * 	Description
>> + * 		Redirect the packet to the endpoint referenced by *map* at
>> + * 		index *key*. Depending on its type, his *map* can contain
>> + * 		references to net devices (for forwarding packets through other
>> + * 		ports), or to CPUs (for redirecting XDP frames to another CPU;
>> + * 		but this is not fully implemented as of this writing).
> 
> Stating that CPUMAP redirect "is not fully implemented" is confusing.
> The issue is that CPUMAP only works for "native" XDP.
> 
> What about saying:
> 
> "[...] or to CPUs (for redirecting XDP frames to another CPU;
>  but this is only implemented for native XDP as of this writing)"
> 

Fine by me, I will change it. Thank you Jesper for the review!

Quentin


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 15:45 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	Lawrence Brakmo, Yonghong Song, Josef Bacik
In-Reply-To: <20180410175015.GA6762@rdna-mbp.dhcp.thefacebook.com>

2018-04-10 10:50 UTC-0700 ~ Andrey Ignatov <rdna@fb.com>
> Quentin Monnet <quentin.monnet@netronome.com> [Tue, 2018-04-10 07:43 -0700]:
>> + * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int addr_len)
>> + * 	Description
>> + * 		Bind the socket associated to *ctx* to the address pointed by
>> + * 		*addr*, of length *addr_len*. This allows for making outgoing
>> + * 		connection from the desired IP address, which can be useful for
>> + * 		example when all processes inside a cgroup should use one
>> + * 		single IP address on a host that has multiple IP configured.
>> + *
>> + * 		This helper works for IPv4 and IPv6, TCP and UDP sockets. The
>> + * 		domain (*addr*\ **->sa_family**) must be **AF_INET** (or
>> + * 		**AF_INET6**). Looking for a free port to bind to can be
>> + * 		expensive, therefore binding to port is not permitted by the
>> + * 		helper: *addr*\ **->sin_port** (or **sin6_port**, respectively)
>> + * 		must be set to zero.
>> + *
>> + * 		As for the remote end, both parts of it can be overridden,
>> + * 		remote IP and remote port. This can be useful if an application
>> + * 		inside a cgroup wants to connect to another application inside
>> + * 		the same cgroup or to itself, but knows nothing about the IP
>> + * 		address assigned to the cgroup.
> 
> The last paragraph ("As for the remote end ...") is not relevant to
> bpf_bind() and should be removed. It's about sys_connect hook itself
> that can call to bpf_bind() but also has other functionality (and that
> other functionality is described by this paragraph).

Thanks Andrey, I will remove this paragraph.

Quentin



--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 15:44 UTC (permalink / raw)
  To: Yonghong Song, daniel, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
	Josef Bacik, Andrey Ignatov
In-Reply-To: <cc54b41e-3f2f-e87f-042f-842c96308626@fb.com>

2018-04-10 09:58 UTC-0700 ~ Yonghong Song <yhs@fb.com>
> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions:
>>
>> Helpers from Lawrence:
>> - bpf_setsockopt()
>> - bpf_getsockopt()
>> - bpf_sock_ops_cb_flags_set()
>>
>> Helpers from Yonghong:
>> - bpf_perf_event_read_value()
>> - bpf_perf_prog_read_value()
>>
>> Helper from Josef:
>> - bpf_override_return()
>>
>> Helper from Andrey:
>> - bpf_bind()
>>
>> Cc: Lawrence Brakmo <brakmo@fb.com>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Josef Bacik <jbacik@fb.com>
>> Cc: Andrey Ignatov <rdna@fb.com>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>   include/uapi/linux/bpf.h | 184
>> +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 184 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 15d9ccafebbe..7343af4196c8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h

[...]

>> @@ -1255,6 +1277,168 @@ union bpf_attr {
>>    *         performed again.
>>    *     Return
>>    *         0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags,
>> struct bpf_perf_event_value *buf, u32 buf_size)
>> + *     Description
>> + *         Read the value of a perf event counter, and store it into
>> *buf*
>> + *         of size *buf_size*. This helper relies on a *map* of type
>> + *         **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf
>> + *         event counter is selected at the creation of the *map*. The
> 
> The nature of the perf event counter is selected when *map* is updated
> with perf_event fd's.
> 

Thanks, I will fix it.

>> + *         *map* is an array whose size is the number of available CPU
>> + *         cores, and each cell contains a value relative to one
>> core. The
> 
> It is confusing to mix core/cpu here. Maybe just use perf_event
> convention, always using cpu?
> 

Right, I'll remove occurrences of "core".

>> + *         value to retrieve is indicated by *flags*, that contains the
>> + *         index of the core to look up, masked with
>> + *         **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
>> + *         **BPF_F_CURRENT_CPU** to indicate that the value for the
>> + *         current CPU core should be retrieved.
>> + *
>> + *         This helper behaves in a way close to
>> + *         **bpf_perf_event_read**\ () helper, save that instead of
>> + *         just returning the value observed, it fills the *buf*
>> + *         structure. This allows for additional data to be
>> retrieved: in
>> + *         particular, the enabled and running times (in *buf*\
>> + *         **->enabled** and *buf*\ **->running**, respectively) are
>> + *         copied.
>> + *
>> + *         These values are interesting, because hardware PMU
>> (Performance
>> + *         Monitoring Unit) counters are limited resources. When
>> there are
>> + *         more PMU based perf events opened than available counters,
>> + *         kernel will multiplex these events so each event gets certain
>> + *         percentage (but not all) of the PMU time. In case that
>> + *         multiplexing happens, the number of samples or counter value
>> + *         will not reflect the case compared to when no multiplexing
>> + *         occurs. This makes comparison between different runs
>> difficult.
>> + *         Typically, the counter value should be normalized before
>> + *         comparing to other experiments. The usual normalization is
>> done
>> + *         as follows.
>> + *
>> + *         ::
>> + *
>> + *             normalized_counter = counter * t_enabled / t_running
>> + *
>> + *         Where t_enabled is the time enabled for event and
>> t_running is
>> + *         the time running for event since last normalization. The
>> + *         enabled and running times are accumulated since the perf
>> event
>> + *         open. To achieve scaling factor between two invocations of an
>> + *         eBPF program, users can can use CPU id as the key (which is
>> + *         typical for perf array usage model) to remember the previous
>> + *         value and do the calculation inside the eBPF program.
>> + *     Return
>> + *         0 on success, or a negative error in case of failure.
>> + *

[...]

Thanks Yonghong for the review!

I have a favor to ask of you. I got a bounce for Kaixu Xia's email
address, and I don't know what alternative email address I could use. I
CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
this series), which is rather close to bpf_perf_event_read_value().
Would you mind having a look at that one too, please? The description is
not long.

Quentin


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox