* Re: [PATCH 18/26] KVM: PPC: KVM PV guest stubs
From: Alexander Graf @ 2010-06-28 8:23 UTC (permalink / raw)
To: Avi Kivity; +Cc: linuxppc-dev, kvm-ppc, Matt Evans, KVM list
In-Reply-To: <4C285A13.8070208@redhat.com>
On 28.06.2010, at 10:15, Avi Kivity wrote:
> On 06/28/2010 09:33 AM, Alexander Graf wrote:
>>=20
>>> Could you do something similar in module_finalize() to patch loaded =
modules' .text sections?
>>> =20
>> I could, but do we need it? I objdump -d | grep'ed all my modules and =
didn't find any need to do so.
>> =20
>=20
> You mean even kvm.ko doesn't use privileged instructions?
It does, but I don't think it's worth speeding those up. There are only =
a couple. Most of the privileged instructions in PPC KVM are statically =
compiled into the kernel because we need to guarantee they're in the RMO =
(first 8MB for the PS3).
Even with the magic page in use, trapping instructions still works =
exactly as before, so we're only talking about a speed difference.
Alex
^ permalink raw reply
* Re: [PATCH 26/26] KVM: PPC: Add Documentation about PV interface
From: Alexander Graf @ 2010-06-28 8:21 UTC (permalink / raw)
To: Avi Kivity; +Cc: linuxppc-dev, kvm-ppc, Milton Miller, KVM list
In-Reply-To: <4C285991.1050303@redhat.com>
On 28.06.2010, at 10:13, Avi Kivity wrote:
> On 06/28/2010 10:49 AM, Alexander Graf wrote:
>>=20
>>> I don't believe we support the kernel actually doing a syscall to =
itself
>>> anymore, at least on powerpc. The callers call the underlying =
system
>>> call function, or kernel_thread.
>>>=20
>>> That said, I would suggest we allocate a syscall number for this, as =
it
>>> would document the usage. (In additon to 0..nr_syscalls - 1 we have
>>> 0x1ebe in use).
>>> =20
>> That's actually a pretty good idea.
>> =20
>=20
> Since the syscall register is not architectual (or rather it is =
architectural but Linux ignores it) I don't see the point. It would =
work for Linux but may alias some random parameter for a different =
guest. We need a reliable method of distinguishing between syscalls and =
hypercalls. Matching pc would work (but is defeated by inlining) so =
long as we find some other way of identifying the hc pc to the =
hypervisor.
The other alternative I'd see is to reuse an instruction that is not sc. =
We could for example pull the mfpvr trick again, but pass a different =
magic value in the register this time that tells the hypervisor "this is =
a hypercall".
Or we could reserve a different SPR. But from what I've seen there are =
already quite a lot of SPRs out there. More than available numbers :).
The hypercall technique I used here is actually inspired by MOL. They =
use magic constants in r3 and r4 for their "OSI" identification. I'm =
frankly not sure what the best approach is, but considering that =
syscalls from the kernel lie in the guest kernel's hand, we could just =
declare any breakage a guest kernel bug.
Alex
^ permalink raw reply
* Re: [PATCH 18/26] KVM: PPC: KVM PV guest stubs
From: Avi Kivity @ 2010-06-28 8:15 UTC (permalink / raw)
To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc, Matt Evans, KVM list
In-Reply-To: <168E1B5F-44F7-4FF5-80A5-64B0E2E94D68@suse.de>
On 06/28/2010 09:33 AM, Alexander Graf wrote:
>
>> Could you do something similar in module_finalize() to patch loaded modules' .text sections?
>>
> I could, but do we need it? I objdump -d | grep'ed all my modules and didn't find any need to do so.
>
You mean even kvm.ko doesn't use privileged instructions?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: [PATCH 26/26] KVM: PPC: Add Documentation about PV interface
From: Avi Kivity @ 2010-06-28 8:13 UTC (permalink / raw)
To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc, Milton Miller, KVM list
In-Reply-To: <92F4A3F3-A89F-418D-BD4D-66E2489F2E42@suse.de>
On 06/28/2010 10:49 AM, Alexander Graf wrote:
>
>> I don't believe we support the kernel actually doing a syscall to itself
>> anymore, at least on powerpc. The callers call the underlying system
>> call function, or kernel_thread.
>>
>> That said, I would suggest we allocate a syscall number for this, as it
>> would document the usage. (In additon to 0..nr_syscalls - 1 we have
>> 0x1ebe in use).
>>
> That's actually a pretty good idea.
>
Since the syscall register is not architectual (or rather it is
architectural but Linux ignores it) I don't see the point. It would
work for Linux but may alias some random parameter for a different
guest. We need a reliable method of distinguishing between syscalls and
hypercalls. Matching pc would work (but is defeated by inlining) so
long as we find some other way of identifying the hc pc to the hypervisor.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: [PATCH 26/26] KVM: PPC: Add Documentation about PV interface
From: Alexander Graf @ 2010-06-28 7:49 UTC (permalink / raw)
To: Milton Miller; +Cc: linuxppc-dev, Avi Kivity, kvm-ppc, KVM list
In-Reply-To: <1277709531_13308@mail4.comsite.net>
On 28.06.2010, at 09:18, Milton Miller wrote:
> On Sun Jun 27 around 19:33:52 EST 2010 Alexander Graf wrote:
>> Am 27.06.2010 um 10:14 schrieb Avi Kivity <avi at redhat.com>:
>>> On 06/26/2010 02:25 AM, Alexander Graf wrote:
>=20
>>>> +
>>>> +PPC hypercalls
>>>> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> +
>>>> +The only viable ways to reliably get from guest context to host =20=
>>>> context are:
>>>> +
>>>> + 1) Call an invalid instruction
>>>> + 2) Call the "sc" instruction with a parameter to "sc"
>>>> + 3) Call the "sc" instruction with parameters in GPRs
>>>> +
>>>> +Method 1 is always a bad idea. Invalid instructions can be =20
>>>> replaced later on
>>>> +by valid instructions, rendering the interface broken.
>>>> +
>>>> +Method 2 also has downfalls. If the parameter to "sc" is !=3D 0 =
the =20
>>>> spec is
>>>> +rather unclear if the sc is targeted directly for the hypervisor =20=
>>>> or the
>>>> +supervisor. It would also require that we read the syscall issuing =
=20
>>>> instruction
>>>> +every time a syscall is issued, slowing down guest syscalls.
>>>> +
>=20
> It goes to the hypervisor, and it would require the hypervisor to
> return to the supervisor, but I believe it just returns to the user =
with
> permission denied.
That's what I assumed, yeah :(.
>=20
>>>> +Method 3 is what KVM uses. We pass magic constants =20
>>>> (KVM_SC_MAGIC_R3 and
>>>> +KVM_SC_MAGIC_R4) in r3 and r4 respectively. If a syscall =20
>>>> instruction with these
>>>> +magic values arrives from the guest's kernel mode, we take the =20
>>>> syscall as a
>>>> +hypercall.
>>>>=20
>>>=20
>>> Is there any chance a normal syscall will have those values in r3 =20=
>>> and r4?
>>=20
>> r3 is the syscall number. So as long as the guest doesn't reuse that =20=
>> value, we're safe. Since in general syscall numbers are not randomly =20=
>> scattered throughout the number range, we should be ok here.
>>=20
>=20
> No, r0 has the system call number. Registers 3 and 4 are the first
> 2 args in c abi (or first 64 bit arg in 32 bit c abi), but the linux
> syscall abi special. (In addition, it returns success or failure in
> cr0).
Oh. Ahem :)
>=20
>>>=20
>>> If so, maybe it's better to use pc as they key for hypercalls. Let =20=
>>> the guest designate one instruction address as the hypercall call =20=
>>> point; kvm can easily check it and reflect it back to the guest if =20=
>>> it doesn't match.
>>>=20
>>=20
>> You mean the guest would tell the hv where the hypercall lies? That =20=
>> would require a hypercall, no? Defining it statically is tricky. I =20=
>> want to PV'nize osx using a kernel module later, so I don't have =20
>> control over the physical layout.
>>=20
>>> Is it valid and useful to issue sc from privileged mode anyway, =20
>>> except for calling the hypervisor?
>>=20
>> Same as a syscall on x86 really. The kernel can and does issue =20
>> syscalls within itself.
>>=20
>>=20
>=20
> I don't believe we support the kernel actually doing a syscall to =
itself
> anymore, at least on powerpc. The callers call the underlying system
> call function, or kernel_thread.
>=20
> That said, I would suggest we allocate a syscall number for this, as =
it
> would document the usage. (In additon to 0..nr_syscalls - 1 we have
> 0x1ebe in use).
That's actually a pretty good idea.
>=20
> Also, is there any desire to nest such emulation?
Nesting should just work, right? Since we only accept hypercalls from =
PR=3D0 and guests run in PR=3D1, we get the sc interrupt in the l1 guest =
by then.
The only issue I'm aware of that completely breaks when using nested KVM =
on PPC is the MSR_IR !=3D MSR_DR logic. We fetch the instruction we got =
an interrupt on for certain interrupts in the world switch handler by =
keeping MSR_IR=3D0, but setting MSR_DR=3D1. And KVM speeds up MSR_DR !=3D =
MSR_IR by mapping both of them lazily in a special address space. So if =
you access the same page as instruction and as data, you get an invalid =
result.
Alex
^ permalink raw reply
* Re: [PATCH 26/26] KVM: PPC: Add Documentation about PV interface
From: Milton Miller @ 2010-06-28 7:18 UTC (permalink / raw)
To: Alexander Graf; +Cc: linuxppc-dev, Avi Kivity, kvm-ppc, KVM list
In-Reply-To: <07C9A4B8-881A-438C-AA99-AEC23887C6B8@suse.de>
On Sun Jun 27 around 19:33:52 EST 2010 Alexander Graf wrote:
> Am 27.06.2010 um 10:14 schrieb Avi Kivity <avi at redhat.com>:
> > On 06/26/2010 02:25 AM, Alexander Graf wrote:
> > > +
> > > +PPC hypercalls
> > > +==============
> > > +
> > > +The only viable ways to reliably get from guest context to host
> > > context are:
> > > +
> > > + 1) Call an invalid instruction
> > > + 2) Call the "sc" instruction with a parameter to "sc"
> > > + 3) Call the "sc" instruction with parameters in GPRs
> > > +
> > > +Method 1 is always a bad idea. Invalid instructions can be
> > > replaced later on
> > > +by valid instructions, rendering the interface broken.
> > > +
> > > +Method 2 also has downfalls. If the parameter to "sc" is != 0 the
> > > spec is
> > > +rather unclear if the sc is targeted directly for the hypervisor
> > > or the
> > > +supervisor. It would also require that we read the syscall issuing
> > > instruction
> > > +every time a syscall is issued, slowing down guest syscalls.
> > > +
It goes to the hypervisor, and it would require the hypervisor to
return to the supervisor, but I believe it just returns to the user with
permission denied.
> > > +Method 3 is what KVM uses. We pass magic constants
> > > (KVM_SC_MAGIC_R3 and
> > > +KVM_SC_MAGIC_R4) in r3 and r4 respectively. If a syscall
> > > instruction with these
> > > +magic values arrives from the guest's kernel mode, we take the
> > > syscall as a
> > > +hypercall.
> > >
> >
> > Is there any chance a normal syscall will have those values in r3
> > and r4?
>
> r3 is the syscall number. So as long as the guest doesn't reuse that
> value, we're safe. Since in general syscall numbers are not randomly
> scattered throughout the number range, we should be ok here.
>
No, r0 has the system call number. Registers 3 and 4 are the first
2 args in c abi (or first 64 bit arg in 32 bit c abi), but the linux
syscall abi special. (In addition, it returns success or failure in
cr0).
> >
> > If so, maybe it's better to use pc as they key for hypercalls. Let
> > the guest designate one instruction address as the hypercall call
> > point; kvm can easily check it and reflect it back to the guest if
> > it doesn't match.
> >
>
> You mean the guest would tell the hv where the hypercall lies? That
> would require a hypercall, no? Defining it statically is tricky. I
> want to PV'nize osx using a kernel module later, so I don't have
> control over the physical layout.
>
> > Is it valid and useful to issue sc from privileged mode anyway,
> > except for calling the hypervisor?
>
> Same as a syscall on x86 really. The kernel can and does issue
> syscalls within itself.
>
>
I don't believe we support the kernel actually doing a syscall to itself
anymore, at least on powerpc. The callers call the underlying system
call function, or kernel_thread.
That said, I would suggest we allocate a syscall number for this, as it
would document the usage. (In additon to 0..nr_syscalls - 1 we have
0x1ebe in use).
Also, is there any desire to nest such emulation?
milton
^ permalink raw reply
* Re: of-flash: Unable to ioremap() both 128MB NOR flashes on 32-bit system with 2GB+ RAM
From: Milton Miller @ 2010-06-28 7:18 UTC (permalink / raw)
To: Kyle Moffett; +Cc: linux-mtd, linuxppc-dev
In-Reply-To: <AANLkTikMtZsg5ameUY9dnZoEqV-hfa7tkfGIO1hBtDq1@mail.gmail.com>
On Fri Jun 25 around 14:01:51 EST 2010 Kyle Moffett wrote:
> Oops... put the old linuxppc list on the CC, sorry!
>
> On Thu, Jun 24, 2010 at 23:45, Kyle Moffett <kyle at moffetthome.net> wrote:
> > Hello,
> >
> > I've got a new P2020 (32bit mpc85xx family) board I'm working on a
> > port for that includes 2 NOR flashes (128MB each) and a removable
> > SO-RDIMM of 2GB or 4GB. Unfortunately when I configure both flashes
> > in the device-tree off my elbc, Linux is completely unable to access
> > the second one because it attempts to ioremap() the entire virtual
> > address space of both FLASH chips.
> >
> > Even with only one flash chip enabled, there's a bit of a noticeable
> > performance degradation because the mapping consumes almost all of my
> > available vmalloc space and forces bounce-buffering for all my
> > HIGHMEM.
> >
> > It looks like the "of-flash" driver currently requires that the whole
> > chip be mapped in the kernel at once. I would much rather have a 50%
> > performance penalty on flash accesses (which are already very slow)
> > and regain most of the vmap space.
> >
> > So the question is, is there a way to convince the MTD layer to
> > iomap() only what it needs to access to do reads and writes? If not,
> > what changes would need to be made to MTD and/or "of-flash" to create
> > such functionality?
> >
> > Cheers,
> > Kyle Moffett
> >
I believe the MTD layer would be happy, but it is beyond the scope of
the physmap_of driver. A look at drivers/mtd/maps/pcmciamtd.c shows
the concept of paging in a section of flash, although it has the advantage
of hardware to move the window instead of calling ioremap or swapping
translations. Another example is drivers/mtd/maps/pci.c, also with
hardware assist.
milton
^ permalink raw reply
* Re: [PATCH 18/26] KVM: PPC: KVM PV guest stubs
From: Alexander Graf @ 2010-06-28 6:33 UTC (permalink / raw)
To: Matt Evans; +Cc: linuxppc-dev, KVM list, kvm-ppc
In-Reply-To: <4C282794.1040209@ozlabs.org>
On 28.06.2010, at 06:39, Matt Evans wrote:
> Howdy Alex!
>=20
> Alexander Graf wrote:
>> We will soon start and replace instructions from the text section =
with
>> other, paravirtualized versions. To ease the readability of those =
patches
>> I split out the generic looping and magic page mapping code out.
>>=20
>> This patch still only contains stubs. But at least it loops through =
the
>> text section :).
>>=20
>> Signed-off-by: Alexander Graf <agraf@suse.de>
>> ---
>> arch/powerpc/kernel/kvm.c | 59 =
+++++++++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 59 insertions(+), 0 deletions(-)
>>=20
>> diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
>> index 2d8dd73..d873bc6 100644
>> --- a/arch/powerpc/kernel/kvm.c
>> +++ b/arch/powerpc/kernel/kvm.c
>> @@ -32,3 +32,62 @@
>> #define KVM_MAGIC_PAGE (-4096L)
>> #define magic_var(x) KVM_MAGIC_PAGE + offsetof(struct =
kvm_vcpu_arch_shared, x)
>>=20
>> +static bool kvm_patching_worked =3D true;
>> +
>> +static void kvm_map_magic_page(void *data)
>> +{
>> + kvm_hypercall2(KVM_HC_PPC_MAP_MAGIC_PAGE,
>> + KVM_MAGIC_PAGE, /* Physical Address */
>> + KVM_MAGIC_PAGE); /* Effective Address */
>> +}
>> +
>> +static void kvm_check_ins(u32 *inst)
>> +{
>> + u32 _inst =3D *inst;
>> + u32 inst_no_rt =3D _inst & ~KVM_MASK_RT;
>> + u32 inst_rt =3D _inst & KVM_MASK_RT;
>> +
>> + switch (inst_no_rt) {
>> + }
>> +
>> + switch (_inst) {
>> + }
>> +
>> + flush_icache_range((ulong)inst, (ulong)inst + 4);
>> +}
>> +
>> +static void kvm_use_magic_page(void)
>> +{
>> + u32 *p;
>> + u32 *start, *end;
>> +
>> + /* Tell the host to map the magic page to -4096 on all CPUs */
>> +
>> + on_each_cpu(kvm_map_magic_page, NULL, 1);
>> +
>> + /* Now loop through all code and find instructions */
>> +
>> + start =3D (void*)_stext;
>> + end =3D (void*)_etext;
>> +
>> + for (p =3D start; p < end; p++)
>> + kvm_check_ins(p);
>> +}
>=20
> Could you do something similar in module_finalize() to patch loaded =
modules' .text sections?
I could, but do we need it? I objdump -d | grep'ed all my modules and =
didn't find any need to do so.
Alex
^ permalink raw reply
* Re: [PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY
From: Michael Neuling @ 2010-06-28 6:11 UTC (permalink / raw)
To: svaidy; +Cc: Paul Mackerras, Anton Blanchard, linuxppc-dev
In-Reply-To: <20100628053252.GA12751@dirshya.in.ibm.com>
In message <20100628053252.GA12751@dirshya.in.ibm.com> you wrote:
> * Michael Neuling <mikey@neuling.org> [2010-06-28 11:44:31]:
>
> > Vaidy,
> >
> > > Create sysfs interface to export data from H_BEST_ENERGY hcall
> > > that can be used by administrative tools on supported pseries
> > > platforms for energy management optimizations.
> > >
> > > /sys/device/system/cpu/pseries_(de)activate_hint_list and
> > > /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
> > > hints for activation and deactivation of cpus respectively.
> > >
> > > Added new driver module
> > > arch/powerpc/platforms/pseries/pseries_energy.c
> > > under new config option CONFIG_PSERIES_ENERGY
> >
> > Can you provide some documentation on how to use these hints and what
> > format they are provided from sysfs. Looks like two separate interfaces
> > two the same thing (one a comma sep list and 1 per cpu, why do need
> > both?). What is the difference between activate and deactivate, with
> > out me having to read PAPR :-) ??
>
> Hi Mike,
>
> Thanks for reviewing this patch series. Sure, I can provide
> additional information.
>
> These hints are abstract number given by the hypervisor based on
> the extended knowledge the hypervisor has regarding the current system
> topology and resource mappings.
>
> The activate and the deactivate part is for the two distinct
> operations that we could do for energy savings. When we have more
> capacity than required, we could deactivate few core to save energy.
> The choice of the core to deactivate will be based on
> /sys/devices/system/cpu/deactivate_hint_list. The comma separated
> list of cpus (cores) will be the preferred choice.
>
> Once the system has few deactivated cores, based on workload demand we
> may have to activate them to meet the demand. In that case the
> /sys/devices/system/cpu/activate_hint_list will be used to prefer the
> core in-order among the deactivated cores.
>
> In simple terms, activate_hint_list will be null until we deactivate
> few cores. Then we could look at the corresponding list for
> activation or deactivation.
Can you put these details in the code and in the check-in comments.
>
> Regarding your second point, there is a reason for both a list and
> per-cpu interface. The list gives us a system wide list of cores in
> one shot for userspace to base their decision. This will be the
> preferred interface for most cases. On the other hand, per-cpu file
> /sys/device/system/cpu/cpuN/pseries_(de)activate_hint provide more
> information since it exports the hint value as such.
>
> The idea is that the list interface will be used to get a suggested
> list of cores to manage, while the per-cpu value can be used to
> further get fine grain information on a per-core bases from the
> hypervisor. This allows Linux to have access to all information that
> the hypervisor has offered through this hcall interface.
OK, I didn't realise that they contained different info. Just more
reasons that this interface needs better documentation :-)
Overall, I'm mostly happy with the interface. It's pretty light weight.
> > Other comments below.
> >
> > >
> > > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > > ---
> > > arch/powerpc/include/asm/hvcall.h | 3
> > > arch/powerpc/platforms/pseries/Kconfig | 10 +
> > > arch/powerpc/platforms/pseries/Makefile | 1
> > > arch/powerpc/platforms/pseries/pseries_energy.c | 258 +++++++++++++++++
++++
> > ++
> > > 4 files changed, 271 insertions(+), 1 deletions(-)
> > > create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c
> > >
> > > diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm
/hvc
> > all.h
> > > index 5119b7d..34b66e0 100644
> > > --- a/arch/powerpc/include/asm/hvcall.h
> > > +++ b/arch/powerpc/include/asm/hvcall.h
> > > @@ -231,7 +231,8 @@
> > > #define H_GET_EM_PARMS 0x2B8
> > > #define H_SET_MPP 0x2D0
> > > #define H_GET_MPP 0x2D4
> > > -#define MAX_HCALL_OPCODE H_GET_MPP
> > > +#define H_BEST_ENERGY 0x2F4
> > > +#define MAX_HCALL_OPCODE H_BEST_ENERGY
> > >
> > > #ifndef __ASSEMBLY__
> > >
> > > diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platfo
rms/
> > pseries/Kconfig
> > > index c667f0f..b3dd108 100644
> > > --- a/arch/powerpc/platforms/pseries/Kconfig
> > > +++ b/arch/powerpc/platforms/pseries/Kconfig
> > > @@ -33,6 +33,16 @@ config PSERIES_MSI
> > > depends on PCI_MSI && EEH
> > > default y
> > >
> > > +config PSERIES_ENERGY
> >
> > Probably need a less generic name. PSERIES_ENERGY_MANAGEMENT?
> > PSERIES_ENERGY_HOTPLUG_HINTS?
>
> PSERIES_ENERGY_MANAGEMENT may be good but too long for a config
> option.
>
> The idea is to collect all energy management functions in this module
> as and when new features are introduced in the pseries platform. This
> hcall interface is the first to be included, but going forward in
> future I do not propose to have different modules for other energy
> management related features.
>
> The name is specific enough for IBM pseries platform and energy
> management functions and enablements. Having less generic name below
> this level will make it difficult to add all varieties of energy
> management functions in future.
OK, I thought this might be the case but you never said. Please say
something like "This adds CONFIG_PSERIES_ENERGY which will be used for
future power saving code" or some such.
>
> > > + tristate "pseries energy management capabilities driver"
> > > + depends on PPC_PSERIES
> > > + default y
> > > + help
> > > + Provides interface to platform energy management capabilities
> > > + on supported PSERIES platforms.
> > > + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
> > > + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
> > > +
> > > config SCANLOG
> > > tristate "Scanlog dump interface"
> > > depends on RTAS_PROC && PPC_PSERIES
> > > diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platf
orms
> > /pseries/Makefile
> > > index 3dbef30..32ae72e 100644
> > > --- a/arch/powerpc/platforms/pseries/Makefile
> > > +++ b/arch/powerpc/platforms/pseries/Makefile
> > > @@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver
.o eeh_e
> > vent.o eeh_sysfs.o
> > > obj-$(CONFIG_KEXEC) += kexec.o
> > > obj-$(CONFIG_PCI) += pci.o pci_dlpar.o
> > > obj-$(CONFIG_PSERIES_MSI) += msi.o
> > > +obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o
> > >
> > > obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o
> > > obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o
> > > diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/power
pc/p
> > latforms/pseries/pseries_energy.c
> > > new file mode 100644
> > > index 0000000..9a936b1
> > > --- /dev/null
> > > +++ b/arch/powerpc/platforms/pseries/pseries_energy.c
> > > @@ -0,0 +1,258 @@
> > > +/*
> > > + * POWER platform energy management driver
> > > + * Copyright (C) 2010 IBM Corporation
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > + * modify it under the terms of the GNU General Public License
> > > + * version 2 as published by the Free Software Foundation.
> > > + *
> > > + * This pseries platform device driver provides access to
> > > + * platform energy management capabilities.
> > > + */
> > > +
> > > +#include <linux/module.h>
> > > +#include <linux/types.h>
> > > +#include <linux/errno.h>
> > > +#include <linux/init.h>
> > > +#include <linux/seq_file.h>
> > > +#include <linux/sysdev.h>
> > > +#include <linux/cpu.h>
> > > +#include <linux/of.h>
> > > +#include <asm/cputhreads.h>
> > > +#include <asm/page.h>
> > > +#include <asm/hvcall.h>
> > > +
> > > +
> > > +#define MODULE_VERS "1.0"
> >
> > Argh, I hate module versions... but this one is less of an issue since
> > it doesn't seem to be being used anyway :-)
> >
> > > +#define MODULE_NAME "pseries_energy"
> >
> > Unused too.
>
> Yes. This keep the module template complete. No overhead as yet :)
> We will certainly need the MODULE_VERS in future if we add/change
> sysfs interfaces.
Argh, change sysfs interfaces?!? We don't even have the first one and
now we are going to change it? :-)
> > > +
> > > +/* Helper Routines to convert between drc_index to cpu numbers */
> > > +
> > > +static u32 cpu_to_drc_index(int cpu)
> > > +{
> > > + struct device_node *dn = NULL;
> > > + const int *indexes;
> > > + int i;
> > > + dn = of_find_node_by_path("/cpus");
> > > + if (dn == NULL)
> > > + goto err;
> >
> > Humm, I not sure this is really needed. If you don't have /cpus you are
> > probably not going to boot.
>
> Good suggestion. I could add all these checks in module_init. I was
> think if any of the functions being called is allocating memory and in
> case they fail, we need to abort.
>
> I just reviewed and look like of_find_node_by_path() will not sleep or
> allocate any memory. So if it succeeds once in module_init(), then it
> will never fail!
>
>
> > > + indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> > > + if (indexes == NULL)
> > > + goto err;
> >
> > These checks should probably be moved to module init rather than /sfs
> > read time. If they fail, don't load the module and print a warning.
> >
> > These HCALLS and device-tree entire aren't going to be dynamic.
>
> Agreed. Only cause of runtime failure is OOM. If none of these
> allocate memory, moving these checks once at module_init() will be
> a good optimization.
Cool, thanks.
I also noticed you are doing this per cpu, so it's got the potential to
really suck on our big machines.
> But still I am wondering if it is worth the risk. These are not in
> hot paths and these are just quick null comparisons. Also in most other
> call sites, we do check for return values.
I found a /proc file the other day that took 60sec to read on a big
machine. The file wasn't a hot path, but 60sec was way too long.
Things get bad quickly on our big machines. I'd prefer it in module
init.
>
> > > + /* Convert logical cpu number to core number */
> > > + i = cpu_core_of_thread(cpu);
> > > + /*
> > > + * The first element indexes[0] is the number of drc_indexes
> > > + * returned in the list. Hence i+1 will get the drc_index
> > > + * corresponding to core number i.
> > > + */
> > > + WARN_ON(i > indexes[0]);
> > > + return indexes[i + 1];
> > > +err:
> > > + printk(KERN_WARNING "cpu_to_drc_index(%d) failed", cpu);
> > > + return 0;
> > > +}
> > > +
> > > +static int drc_index_to_cpu(u32 drc_index)
> > > +{
> > > + struct device_node *dn = NULL;
> > > + const int *indexes;
> > > + int i, cpu;
> > > + dn = of_find_node_by_path("/cpus");
> > > + if (dn == NULL)
> > > + goto err;
> >
> > same here
>
> agreed, comments mentioned above.
>
> > > + indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> > > + if (indexes == NULL)
> > > + goto err;
> > > + /*
> > > + * First element in the array is the number of drc_indexes
> > > + * returned. Search through the list to find the matching
> > > + * drc_index and get the core number
> > > + */
> > > + for (i = 0; i < indexes[0]; i++) {
> > > + if (indexes[i + 1] == drc_index)
> > > + break;
> > > + }
> > > + /* Convert core number to logical cpu number */
> > > + cpu = cpu_first_thread_of_core(i);
> > > + return cpu;
> > > +err:
> > > + printk(KERN_WARNING "drc_index_to_cpu(%d) failed", drc_index);
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * pseries hypervisor call H_BEST_ENERGY provides hints to OS on
> > > + * preferred logical cpus to activate or deactivate for optimized
> > > + * energy consumption.
> > > + */
> > > +
> > > +#define FLAGS_MODE1 0x004E200000080E01
> > > +#define FLAGS_MODE2 0x004E200000080401
> > > +#define FLAGS_ACTIVATE 0x100
> > > +
> > > +static ssize_t get_best_energy_list(char *page, int activate)
> > > +{
> > > + int rc, cnt, i, cpu;
> > > + unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
> > > + unsigned long flags = 0;
> > > + u32 *buf_page;
> > > + char *s = page;
> > > +
> > > + buf_page = (u32 *) get_zeroed_page(GFP_KERNEL);
> > > + if (!buf_page)
> > > + return -ENOMEM;
> > > +
> > > + flags = FLAGS_MODE1;
> > > + if (activate)
> > > + flags |= FLAGS_ACTIVATE;
> > > +
> > > + rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags, 0, __pa(buf_page),
> > > + 0, 0, 0, 0, 0, 0);
> > > + if (rc != H_SUCCESS) {
> > > + free_page((unsigned long) buf_page);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + cnt = retbuf[0];
> > > + for (i = 0; i < cnt; i++) {
> > > + cpu = drc_index_to_cpu(buf_page[2*i+1]);
> > > + if ((cpu_online(cpu) && !activate) ||
> > > + (!cpu_online(cpu) && activate))
> > > + s += sprintf(s, "%d,", cpu);
> > > + }
> > > + if (s > page) { /* Something to show */
> > > + s--; /* Suppress last comma */
> > > + s += sprintf(s, "\n");
> > > + }
> > > +
> > > + free_page((unsigned long) buf_page);
> > > + return s-page;
> > > +}
> > > +
> > > +static ssize_t get_best_energy_data(struct sys_device *dev,
> > > + char *page, int activate)
> > > +{
> > > + int rc;
> > > + unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
> > > + unsigned long flags = 0;
> > > +
> > > + flags = FLAGS_MODE2;
> > > + if (activate)
> > > + flags |= FLAGS_ACTIVATE;
> > > +
> > > + rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags,
> > > + cpu_to_drc_index(dev->id),
> > > + 0, 0, 0, 0, 0, 0, 0);
> > > +
> > > + if (rc != H_SUCCESS)
> > > + return -EINVAL;
> > > +
> > > + return sprintf(page, "%lu\n", retbuf[1] >> 32);
> > > +}
> > > +
> > > +/* Wrapper functions */
> > > +
> > > +static ssize_t cpu_activate_hint_list_show(struct sysdev_class *class,
> > > + struct sysdev_class_attribute *attr, char *page)
> > > +{
> > > + return get_best_energy_list(page, 1);
> > > +}
> > > +
> > > +static ssize_t cpu_deactivate_hint_list_show(struct sysdev_class *class,
> > > + struct sysdev_class_attribute *attr, char *page)
> > > +{
> > > + return get_best_energy_list(page, 0);
> > > +}
> > > +
> > > +static ssize_t percpu_activate_hint_show(struct sys_device *dev,
> > > + struct sysdev_attribute *attr, char *page)
> > > +{
> > > + return get_best_energy_data(dev, page, 1);
> > > +}
> > > +
> > > +static ssize_t percpu_deactivate_hint_show(struct sys_device *dev,
> > > + struct sysdev_attribute *attr, char *page)
> > > +{
> > > + return get_best_energy_data(dev, page, 0);
> > > +}
> > > +
> > > +/*
> > > + * Create sysfs interface:
> > > + * /sys/devices/system/cpu/pseries_activate_hint_list
> > > + * /sys/devices/system/cpu/pseries_deactivate_hint_list
> > > + * Comma separated list of cpus to activate or deactivate
> > > + * /sys/devices/system/cpu/cpuN/pseries_activate_hint
> > > + * /sys/devices/system/cpu/cpuN/pseries_deactivate_hint
> > > + * Per-cpu value of the hint
> >
> > Do we really need both interfaces? Seems like awk could generate one
> > from the other in userspace?
>
> Yes, it is possible, but will not scale. Generating a list from set
> of per-cpu values is possible but will be too much overhead to build
> the list. Having the list interface and deleting the per-cpu ones
> will reduce information available from hypervisor. Having both the
> interface is a good balance between amount of information exported and
> quick access to a consolidated view.
OK.
"pseries_activate_hint" doesn't say what it's hinting about. Can we
change the name to add it being a power save hint? We might need to add
other hints later. Performance hint etc..
>
> > > + */
> > > +
> > > +struct sysdev_class_attribute attr_cpu_activate_hint_list =
> > > + _SYSDEV_CLASS_ATTR(pseries_activate_hint_list, 0444,
> > > + cpu_activate_hint_list_show, NULL);
> > > +
> > > +struct sysdev_class_attribute attr_cpu_deactivate_hint_list =
> > > + _SYSDEV_CLASS_ATTR(pseries_deactivate_hint_list, 0444,
> > > + cpu_deactivate_hint_list_show, NULL);
> > > +
> > > +struct sysdev_attribute attr_percpu_activate_hint =
> > > + _SYSDEV_ATTR(pseries_activate_hint, 0444,
> > > + percpu_activate_hint_show, NULL);
> > > +
> > > +struct sysdev_attribute attr_percpu_deactivate_hint =
> > > + _SYSDEV_ATTR(pseries_deactivate_hint, 0444,
> > > + percpu_deactivate_hint_show, NULL);
> >
> > > +
> > > +static int __init pseries_energy_init(void)
> > > +{
> > > + int cpu, err;
> > > + struct sys_device *cpu_sys_dev;
> > > +
> > > + /* Create the sysfs files */
> > > + err = sysfs_create_file(&cpu_sysdev_class.kset.kobj,
> > > + &attr_cpu_activate_hint_list.attr);
> > > + if (!err)
> > > + err = sysfs_create_file(&cpu_sysdev_class.kset.kobj,
> > > + &attr_cpu_deactivate_hint_list.attr);
> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + cpu_sys_dev = get_cpu_sysdev(cpu);
> > > + err = sysfs_create_file(&cpu_sys_dev->kobj,
> > > + &attr_percpu_activate_hint.attr);
> > > + if (err)
> > > + break;
> > > + err = sysfs_create_file(&cpu_sys_dev->kobj,
> > > + &attr_percpu_deactivate_hint.attr);
> > > + if (err)
> > > + break;
> > > + }
> > > + return err;
> > > +
> > > +}
> > > +
> > > +static void __exit pseries_energy_cleanup(void)
> > > +{
> > > + int cpu;
> > > + struct sys_device *cpu_sys_dev;
> > > +
> > > + /* Remove the sysfs files */
> > > + sysfs_remove_file(&cpu_sysdev_class.kset.kobj,
> > > + &attr_cpu_activate_hint_list.attr);
> > > +
> > > + sysfs_remove_file(&cpu_sysdev_class.kset.kobj,
> > > + &attr_cpu_deactivate_hint_list.attr);
> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + cpu_sys_dev = get_cpu_sysdev(cpu);
> > > + sysfs_remove_file(&cpu_sys_dev->kobj,
> > > + &attr_percpu_activate_hint.attr);
> > > + sysfs_remove_file(&cpu_sys_dev->kobj,
> > > + &attr_percpu_deactivate_hint.attr);
> > > + }
> > > +}
> > > +
> > > +module_init(pseries_energy_init);
> > > +module_exit(pseries_energy_cleanup);
> > > +MODULE_DESCRIPTION("Driver for pseries platform energy management");
> >
> > Needs a less generic description.
>
> Explained above.
>
> Thanks,
> Vaidy
>
^ permalink raw reply
* Re: [PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY
From: Vaidyanathan Srinivasan @ 2010-06-28 5:33 UTC (permalink / raw)
To: Michael Neuling; +Cc: Paul Mackerras, Anton Blanchard, linuxppc-dev
In-Reply-To: <25049.1277689471@neuling.org>
* Michael Neuling <mikey@neuling.org> [2010-06-28 11:44:31]:
> Vaidy,
>
> > Create sysfs interface to export data from H_BEST_ENERGY hcall
> > that can be used by administrative tools on supported pseries
> > platforms for energy management optimizations.
> >
> > /sys/device/system/cpu/pseries_(de)activate_hint_list and
> > /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
> > hints for activation and deactivation of cpus respectively.
> >
> > Added new driver module
> > arch/powerpc/platforms/pseries/pseries_energy.c
> > under new config option CONFIG_PSERIES_ENERGY
>
> Can you provide some documentation on how to use these hints and what
> format they are provided from sysfs. Looks like two separate interfaces
> two the same thing (one a comma sep list and 1 per cpu, why do need
> both?). What is the difference between activate and deactivate, with
> out me having to read PAPR :-) ??
Hi Mike,
Thanks for reviewing this patch series. Sure, I can provide
additional information.
These hints are abstract number given by the hypervisor based on
the extended knowledge the hypervisor has regarding the current system
topology and resource mappings.
The activate and the deactivate part is for the two distinct
operations that we could do for energy savings. When we have more
capacity than required, we could deactivate few core to save energy.
The choice of the core to deactivate will be based on
/sys/devices/system/cpu/deactivate_hint_list. The comma separated
list of cpus (cores) will be the preferred choice.
Once the system has few deactivated cores, based on workload demand we
may have to activate them to meet the demand. In that case the
/sys/devices/system/cpu/activate_hint_list will be used to prefer the
core in-order among the deactivated cores.
In simple terms, activate_hint_list will be null until we deactivate
few cores. Then we could look at the corresponding list for
activation or deactivation.
Regarding your second point, there is a reason for both a list and
per-cpu interface. The list gives us a system wide list of cores in
one shot for userspace to base their decision. This will be the
preferred interface for most cases. On the other hand, per-cpu file
/sys/device/system/cpu/cpuN/pseries_(de)activate_hint provide more
information since it exports the hint value as such.
The idea is that the list interface will be used to get a suggested
list of cores to manage, while the per-cpu value can be used to
further get fine grain information on a per-core bases from the
hypervisor. This allows Linux to have access to all information that
the hypervisor has offered through this hcall interface.
> Other comments below.
>
> >
> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > ---
> > arch/powerpc/include/asm/hvcall.h | 3
> > arch/powerpc/platforms/pseries/Kconfig | 10 +
> > arch/powerpc/platforms/pseries/Makefile | 1
> > arch/powerpc/platforms/pseries/pseries_energy.c | 258 +++++++++++++++++++++
> ++
> > 4 files changed, 271 insertions(+), 1 deletions(-)
> > create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c
> >
> > diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvc
> all.h
> > index 5119b7d..34b66e0 100644
> > --- a/arch/powerpc/include/asm/hvcall.h
> > +++ b/arch/powerpc/include/asm/hvcall.h
> > @@ -231,7 +231,8 @@
> > #define H_GET_EM_PARMS 0x2B8
> > #define H_SET_MPP 0x2D0
> > #define H_GET_MPP 0x2D4
> > -#define MAX_HCALL_OPCODE H_GET_MPP
> > +#define H_BEST_ENERGY 0x2F4
> > +#define MAX_HCALL_OPCODE H_BEST_ENERGY
> >
> > #ifndef __ASSEMBLY__
> >
> > diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/
> pseries/Kconfig
> > index c667f0f..b3dd108 100644
> > --- a/arch/powerpc/platforms/pseries/Kconfig
> > +++ b/arch/powerpc/platforms/pseries/Kconfig
> > @@ -33,6 +33,16 @@ config PSERIES_MSI
> > depends on PCI_MSI && EEH
> > default y
> >
> > +config PSERIES_ENERGY
>
> Probably need a less generic name. PSERIES_ENERGY_MANAGEMENT?
> PSERIES_ENERGY_HOTPLUG_HINTS?
PSERIES_ENERGY_MANAGEMENT may be good but too long for a config
option.
The idea is to collect all energy management functions in this module
as and when new features are introduced in the pseries platform. This
hcall interface is the first to be included, but going forward in
future I do not propose to have different modules for other energy
management related features.
The name is specific enough for IBM pseries platform and energy
management functions and enablements. Having less generic name below
this level will make it difficult to add all varieties of energy
management functions in future.
> > + tristate "pseries energy management capabilities driver"
> > + depends on PPC_PSERIES
> > + default y
> > + help
> > + Provides interface to platform energy management capabilities
> > + on supported PSERIES platforms.
> > + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
> > + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
> > +
> > config SCANLOG
> > tristate "Scanlog dump interface"
> > depends on RTAS_PROC && PPC_PSERIES
> > diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms
> /pseries/Makefile
> > index 3dbef30..32ae72e 100644
> > --- a/arch/powerpc/platforms/pseries/Makefile
> > +++ b/arch/powerpc/platforms/pseries/Makefile
> > @@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o eeh_e
> vent.o eeh_sysfs.o
> > obj-$(CONFIG_KEXEC) += kexec.o
> > obj-$(CONFIG_PCI) += pci.o pci_dlpar.o
> > obj-$(CONFIG_PSERIES_MSI) += msi.o
> > +obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o
> >
> > obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o
> > obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o
> > diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/p
> latforms/pseries/pseries_energy.c
> > new file mode 100644
> > index 0000000..9a936b1
> > --- /dev/null
> > +++ b/arch/powerpc/platforms/pseries/pseries_energy.c
> > @@ -0,0 +1,258 @@
> > +/*
> > + * POWER platform energy management driver
> > + * Copyright (C) 2010 IBM Corporation
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * version 2 as published by the Free Software Foundation.
> > + *
> > + * This pseries platform device driver provides access to
> > + * platform energy management capabilities.
> > + */
> > +
> > +#include <linux/module.h>
> > +#include <linux/types.h>
> > +#include <linux/errno.h>
> > +#include <linux/init.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/sysdev.h>
> > +#include <linux/cpu.h>
> > +#include <linux/of.h>
> > +#include <asm/cputhreads.h>
> > +#include <asm/page.h>
> > +#include <asm/hvcall.h>
> > +
> > +
> > +#define MODULE_VERS "1.0"
>
> Argh, I hate module versions... but this one is less of an issue since
> it doesn't seem to be being used anyway :-)
>
> > +#define MODULE_NAME "pseries_energy"
>
> Unused too.
Yes. This keep the module template complete. No overhead as yet :)
We will certainly need the MODULE_VERS in future if we add/change
sysfs interfaces.
> > +
> > +/* Helper Routines to convert between drc_index to cpu numbers */
> > +
> > +static u32 cpu_to_drc_index(int cpu)
> > +{
> > + struct device_node *dn = NULL;
> > + const int *indexes;
> > + int i;
> > + dn = of_find_node_by_path("/cpus");
> > + if (dn == NULL)
> > + goto err;
>
> Humm, I not sure this is really needed. If you don't have /cpus you are
> probably not going to boot.
Good suggestion. I could add all these checks in module_init. I was
think if any of the functions being called is allocating memory and in
case they fail, we need to abort.
I just reviewed and look like of_find_node_by_path() will not sleep or
allocate any memory. So if it succeeds once in module_init(), then it
will never fail!
> > + indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> > + if (indexes == NULL)
> > + goto err;
>
> These checks should probably be moved to module init rather than /sfs
> read time. If they fail, don't load the module and print a warning.
>
> These HCALLS and device-tree entire aren't going to be dynamic.
Agreed. Only cause of runtime failure is OOM. If none of these
allocate memory, moving these checks once at module_init() will be
a good optimization.
But still I am wondering if it is worth the risk. These are not in
hot paths and these are just quick null comparisons. Also in most other
call sites, we do check for return values.
> > + /* Convert logical cpu number to core number */
> > + i = cpu_core_of_thread(cpu);
> > + /*
> > + * The first element indexes[0] is the number of drc_indexes
> > + * returned in the list. Hence i+1 will get the drc_index
> > + * corresponding to core number i.
> > + */
> > + WARN_ON(i > indexes[0]);
> > + return indexes[i + 1];
> > +err:
> > + printk(KERN_WARNING "cpu_to_drc_index(%d) failed", cpu);
> > + return 0;
> > +}
> > +
> > +static int drc_index_to_cpu(u32 drc_index)
> > +{
> > + struct device_node *dn = NULL;
> > + const int *indexes;
> > + int i, cpu;
> > + dn = of_find_node_by_path("/cpus");
> > + if (dn == NULL)
> > + goto err;
>
> same here
agreed, comments mentioned above.
> > + indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> > + if (indexes == NULL)
> > + goto err;
> > + /*
> > + * First element in the array is the number of drc_indexes
> > + * returned. Search through the list to find the matching
> > + * drc_index and get the core number
> > + */
> > + for (i = 0; i < indexes[0]; i++) {
> > + if (indexes[i + 1] == drc_index)
> > + break;
> > + }
> > + /* Convert core number to logical cpu number */
> > + cpu = cpu_first_thread_of_core(i);
> > + return cpu;
> > +err:
> > + printk(KERN_WARNING "drc_index_to_cpu(%d) failed", drc_index);
> > + return 0;
> > +}
> > +
> > +/*
> > + * pseries hypervisor call H_BEST_ENERGY provides hints to OS on
> > + * preferred logical cpus to activate or deactivate for optimized
> > + * energy consumption.
> > + */
> > +
> > +#define FLAGS_MODE1 0x004E200000080E01
> > +#define FLAGS_MODE2 0x004E200000080401
> > +#define FLAGS_ACTIVATE 0x100
> > +
> > +static ssize_t get_best_energy_list(char *page, int activate)
> > +{
> > + int rc, cnt, i, cpu;
> > + unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
> > + unsigned long flags = 0;
> > + u32 *buf_page;
> > + char *s = page;
> > +
> > + buf_page = (u32 *) get_zeroed_page(GFP_KERNEL);
> > + if (!buf_page)
> > + return -ENOMEM;
> > +
> > + flags = FLAGS_MODE1;
> > + if (activate)
> > + flags |= FLAGS_ACTIVATE;
> > +
> > + rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags, 0, __pa(buf_page),
> > + 0, 0, 0, 0, 0, 0);
> > + if (rc != H_SUCCESS) {
> > + free_page((unsigned long) buf_page);
> > + return -EINVAL;
> > + }
> > +
> > + cnt = retbuf[0];
> > + for (i = 0; i < cnt; i++) {
> > + cpu = drc_index_to_cpu(buf_page[2*i+1]);
> > + if ((cpu_online(cpu) && !activate) ||
> > + (!cpu_online(cpu) && activate))
> > + s += sprintf(s, "%d,", cpu);
> > + }
> > + if (s > page) { /* Something to show */
> > + s--; /* Suppress last comma */
> > + s += sprintf(s, "\n");
> > + }
> > +
> > + free_page((unsigned long) buf_page);
> > + return s-page;
> > +}
> > +
> > +static ssize_t get_best_energy_data(struct sys_device *dev,
> > + char *page, int activate)
> > +{
> > + int rc;
> > + unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
> > + unsigned long flags = 0;
> > +
> > + flags = FLAGS_MODE2;
> > + if (activate)
> > + flags |= FLAGS_ACTIVATE;
> > +
> > + rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags,
> > + cpu_to_drc_index(dev->id),
> > + 0, 0, 0, 0, 0, 0, 0);
> > +
> > + if (rc != H_SUCCESS)
> > + return -EINVAL;
> > +
> > + return sprintf(page, "%lu\n", retbuf[1] >> 32);
> > +}
> > +
> > +/* Wrapper functions */
> > +
> > +static ssize_t cpu_activate_hint_list_show(struct sysdev_class *class,
> > + struct sysdev_class_attribute *attr, char *page)
> > +{
> > + return get_best_energy_list(page, 1);
> > +}
> > +
> > +static ssize_t cpu_deactivate_hint_list_show(struct sysdev_class *class,
> > + struct sysdev_class_attribute *attr, char *page)
> > +{
> > + return get_best_energy_list(page, 0);
> > +}
> > +
> > +static ssize_t percpu_activate_hint_show(struct sys_device *dev,
> > + struct sysdev_attribute *attr, char *page)
> > +{
> > + return get_best_energy_data(dev, page, 1);
> > +}
> > +
> > +static ssize_t percpu_deactivate_hint_show(struct sys_device *dev,
> > + struct sysdev_attribute *attr, char *page)
> > +{
> > + return get_best_energy_data(dev, page, 0);
> > +}
> > +
> > +/*
> > + * Create sysfs interface:
> > + * /sys/devices/system/cpu/pseries_activate_hint_list
> > + * /sys/devices/system/cpu/pseries_deactivate_hint_list
> > + * Comma separated list of cpus to activate or deactivate
> > + * /sys/devices/system/cpu/cpuN/pseries_activate_hint
> > + * /sys/devices/system/cpu/cpuN/pseries_deactivate_hint
> > + * Per-cpu value of the hint
>
> Do we really need both interfaces? Seems like awk could generate one
> from the other in userspace?
Yes, it is possible, but will not scale. Generating a list from set
of per-cpu values is possible but will be too much overhead to build
the list. Having the list interface and deleting the per-cpu ones
will reduce information available from hypervisor. Having both the
interface is a good balance between amount of information exported and
quick access to a consolidated view.
> > + */
> > +
> > +struct sysdev_class_attribute attr_cpu_activate_hint_list =
> > + _SYSDEV_CLASS_ATTR(pseries_activate_hint_list, 0444,
> > + cpu_activate_hint_list_show, NULL);
> > +
> > +struct sysdev_class_attribute attr_cpu_deactivate_hint_list =
> > + _SYSDEV_CLASS_ATTR(pseries_deactivate_hint_list, 0444,
> > + cpu_deactivate_hint_list_show, NULL);
> > +
> > +struct sysdev_attribute attr_percpu_activate_hint =
> > + _SYSDEV_ATTR(pseries_activate_hint, 0444,
> > + percpu_activate_hint_show, NULL);
> > +
> > +struct sysdev_attribute attr_percpu_deactivate_hint =
> > + _SYSDEV_ATTR(pseries_deactivate_hint, 0444,
> > + percpu_deactivate_hint_show, NULL);
>
> > +
> > +static int __init pseries_energy_init(void)
> > +{
> > + int cpu, err;
> > + struct sys_device *cpu_sys_dev;
> > +
> > + /* Create the sysfs files */
> > + err = sysfs_create_file(&cpu_sysdev_class.kset.kobj,
> > + &attr_cpu_activate_hint_list.attr);
> > + if (!err)
> > + err = sysfs_create_file(&cpu_sysdev_class.kset.kobj,
> > + &attr_cpu_deactivate_hint_list.attr);
> > +
> > + for_each_possible_cpu(cpu) {
> > + cpu_sys_dev = get_cpu_sysdev(cpu);
> > + err = sysfs_create_file(&cpu_sys_dev->kobj,
> > + &attr_percpu_activate_hint.attr);
> > + if (err)
> > + break;
> > + err = sysfs_create_file(&cpu_sys_dev->kobj,
> > + &attr_percpu_deactivate_hint.attr);
> > + if (err)
> > + break;
> > + }
> > + return err;
> > +
> > +}
> > +
> > +static void __exit pseries_energy_cleanup(void)
> > +{
> > + int cpu;
> > + struct sys_device *cpu_sys_dev;
> > +
> > + /* Remove the sysfs files */
> > + sysfs_remove_file(&cpu_sysdev_class.kset.kobj,
> > + &attr_cpu_activate_hint_list.attr);
> > +
> > + sysfs_remove_file(&cpu_sysdev_class.kset.kobj,
> > + &attr_cpu_deactivate_hint_list.attr);
> > +
> > + for_each_possible_cpu(cpu) {
> > + cpu_sys_dev = get_cpu_sysdev(cpu);
> > + sysfs_remove_file(&cpu_sys_dev->kobj,
> > + &attr_percpu_activate_hint.attr);
> > + sysfs_remove_file(&cpu_sys_dev->kobj,
> > + &attr_percpu_deactivate_hint.attr);
> > + }
> > +}
> > +
> > +module_init(pseries_energy_init);
> > +module_exit(pseries_energy_cleanup);
> > +MODULE_DESCRIPTION("Driver for pseries platform energy management");
>
> Needs a less generic description.
Explained above.
Thanks,
Vaidy
^ permalink raw reply
* Re: [PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY
From: Michael Neuling @ 2010-06-28 1:44 UTC (permalink / raw)
To: Vaidyanathan Srinivasan; +Cc: Paul Mackerras, Anton Blanchard, linuxppc-dev
In-Reply-To: <20100623060415.4957.24478.stgit@drishya.in.ibm.com>
Vaidy,
> Create sysfs interface to export data from H_BEST_ENERGY hcall
> that can be used by administrative tools on supported pseries
> platforms for energy management optimizations.
>
> /sys/device/system/cpu/pseries_(de)activate_hint_list and
> /sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
> hints for activation and deactivation of cpus respectively.
>
> Added new driver module
> arch/powerpc/platforms/pseries/pseries_energy.c
> under new config option CONFIG_PSERIES_ENERGY
Can you provide some documentation on how to use these hints and what
format they are provided from sysfs. Looks like two separate interfaces
two the same thing (one a comma sep list and 1 per cpu, why do need
both?). What is the difference between activate and deactivate, with
out me having to read PAPR :-) ??
Other comments below.
>
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> ---
> arch/powerpc/include/asm/hvcall.h | 3
> arch/powerpc/platforms/pseries/Kconfig | 10 +
> arch/powerpc/platforms/pseries/Makefile | 1
> arch/powerpc/platforms/pseries/pseries_energy.c | 258 +++++++++++++++++++++
++
> 4 files changed, 271 insertions(+), 1 deletions(-)
> create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c
>
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvc
all.h
> index 5119b7d..34b66e0 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -231,7 +231,8 @@
> #define H_GET_EM_PARMS 0x2B8
> #define H_SET_MPP 0x2D0
> #define H_GET_MPP 0x2D4
> -#define MAX_HCALL_OPCODE H_GET_MPP
> +#define H_BEST_ENERGY 0x2F4
> +#define MAX_HCALL_OPCODE H_BEST_ENERGY
>
> #ifndef __ASSEMBLY__
>
> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/
pseries/Kconfig
> index c667f0f..b3dd108 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -33,6 +33,16 @@ config PSERIES_MSI
> depends on PCI_MSI && EEH
> default y
>
> +config PSERIES_ENERGY
Probably need a less generic name. PSERIES_ENERGY_MANAGEMENT?
PSERIES_ENERGY_HOTPLUG_HINTS?
> + tristate "pseries energy management capabilities driver"
> + depends on PPC_PSERIES
> + default y
> + help
> + Provides interface to platform energy management capabilities
> + on supported PSERIES platforms.
> + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
> + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
> +
> config SCANLOG
> tristate "Scanlog dump interface"
> depends on RTAS_PROC && PPC_PSERIES
> diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms
/pseries/Makefile
> index 3dbef30..32ae72e 100644
> --- a/arch/powerpc/platforms/pseries/Makefile
> +++ b/arch/powerpc/platforms/pseries/Makefile
> @@ -16,6 +16,7 @@ obj-$(CONFIG_EEH) += eeh.o eeh_cache.o eeh_driver.o eeh_e
vent.o eeh_sysfs.o
> obj-$(CONFIG_KEXEC) += kexec.o
> obj-$(CONFIG_PCI) += pci.o pci_dlpar.o
> obj-$(CONFIG_PSERIES_MSI) += msi.o
> +obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o
>
> obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o
> obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o
> diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/p
latforms/pseries/pseries_energy.c
> new file mode 100644
> index 0000000..9a936b1
> --- /dev/null
> +++ b/arch/powerpc/platforms/pseries/pseries_energy.c
> @@ -0,0 +1,258 @@
> +/*
> + * POWER platform energy management driver
> + * Copyright (C) 2010 IBM Corporation
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * version 2 as published by the Free Software Foundation.
> + *
> + * This pseries platform device driver provides access to
> + * platform energy management capabilities.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/init.h>
> +#include <linux/seq_file.h>
> +#include <linux/sysdev.h>
> +#include <linux/cpu.h>
> +#include <linux/of.h>
> +#include <asm/cputhreads.h>
> +#include <asm/page.h>
> +#include <asm/hvcall.h>
> +
> +
> +#define MODULE_VERS "1.0"
Argh, I hate module versions... but this one is less of an issue since
it doesn't seem to be being used anyway :-)
> +#define MODULE_NAME "pseries_energy"
Unused too.
> +
> +/* Helper Routines to convert between drc_index to cpu numbers */
> +
> +static u32 cpu_to_drc_index(int cpu)
> +{
> + struct device_node *dn = NULL;
> + const int *indexes;
> + int i;
> + dn = of_find_node_by_path("/cpus");
> + if (dn == NULL)
> + goto err;
Humm, I not sure this is really needed. If you don't have /cpus you are
probably not going to boot.
> + indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> + if (indexes == NULL)
> + goto err;
These checks should probably be moved to module init rather than /sfs
read time. If they fail, don't load the module and print a warning.
These HCALLS and device-tree entire aren't going to be dynamic.
> + /* Convert logical cpu number to core number */
> + i = cpu_core_of_thread(cpu);
> + /*
> + * The first element indexes[0] is the number of drc_indexes
> + * returned in the list. Hence i+1 will get the drc_index
> + * corresponding to core number i.
> + */
> + WARN_ON(i > indexes[0]);
> + return indexes[i + 1];
> +err:
> + printk(KERN_WARNING "cpu_to_drc_index(%d) failed", cpu);
> + return 0;
> +}
> +
> +static int drc_index_to_cpu(u32 drc_index)
> +{
> + struct device_node *dn = NULL;
> + const int *indexes;
> + int i, cpu;
> + dn = of_find_node_by_path("/cpus");
> + if (dn == NULL)
> + goto err;
same here
> + indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> + if (indexes == NULL)
> + goto err;
> + /*
> + * First element in the array is the number of drc_indexes
> + * returned. Search through the list to find the matching
> + * drc_index and get the core number
> + */
> + for (i = 0; i < indexes[0]; i++) {
> + if (indexes[i + 1] == drc_index)
> + break;
> + }
> + /* Convert core number to logical cpu number */
> + cpu = cpu_first_thread_of_core(i);
> + return cpu;
> +err:
> + printk(KERN_WARNING "drc_index_to_cpu(%d) failed", drc_index);
> + return 0;
> +}
> +
> +/*
> + * pseries hypervisor call H_BEST_ENERGY provides hints to OS on
> + * preferred logical cpus to activate or deactivate for optimized
> + * energy consumption.
> + */
> +
> +#define FLAGS_MODE1 0x004E200000080E01
> +#define FLAGS_MODE2 0x004E200000080401
> +#define FLAGS_ACTIVATE 0x100
> +
> +static ssize_t get_best_energy_list(char *page, int activate)
> +{
> + int rc, cnt, i, cpu;
> + unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
> + unsigned long flags = 0;
> + u32 *buf_page;
> + char *s = page;
> +
> + buf_page = (u32 *) get_zeroed_page(GFP_KERNEL);
> + if (!buf_page)
> + return -ENOMEM;
> +
> + flags = FLAGS_MODE1;
> + if (activate)
> + flags |= FLAGS_ACTIVATE;
> +
> + rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags, 0, __pa(buf_page),
> + 0, 0, 0, 0, 0, 0);
> + if (rc != H_SUCCESS) {
> + free_page((unsigned long) buf_page);
> + return -EINVAL;
> + }
> +
> + cnt = retbuf[0];
> + for (i = 0; i < cnt; i++) {
> + cpu = drc_index_to_cpu(buf_page[2*i+1]);
> + if ((cpu_online(cpu) && !activate) ||
> + (!cpu_online(cpu) && activate))
> + s += sprintf(s, "%d,", cpu);
> + }
> + if (s > page) { /* Something to show */
> + s--; /* Suppress last comma */
> + s += sprintf(s, "\n");
> + }
> +
> + free_page((unsigned long) buf_page);
> + return s-page;
> +}
> +
> +static ssize_t get_best_energy_data(struct sys_device *dev,
> + char *page, int activate)
> +{
> + int rc;
> + unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
> + unsigned long flags = 0;
> +
> + flags = FLAGS_MODE2;
> + if (activate)
> + flags |= FLAGS_ACTIVATE;
> +
> + rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags,
> + cpu_to_drc_index(dev->id),
> + 0, 0, 0, 0, 0, 0, 0);
> +
> + if (rc != H_SUCCESS)
> + return -EINVAL;
> +
> + return sprintf(page, "%lu\n", retbuf[1] >> 32);
> +}
> +
> +/* Wrapper functions */
> +
> +static ssize_t cpu_activate_hint_list_show(struct sysdev_class *class,
> + struct sysdev_class_attribute *attr, char *page)
> +{
> + return get_best_energy_list(page, 1);
> +}
> +
> +static ssize_t cpu_deactivate_hint_list_show(struct sysdev_class *class,
> + struct sysdev_class_attribute *attr, char *page)
> +{
> + return get_best_energy_list(page, 0);
> +}
> +
> +static ssize_t percpu_activate_hint_show(struct sys_device *dev,
> + struct sysdev_attribute *attr, char *page)
> +{
> + return get_best_energy_data(dev, page, 1);
> +}
> +
> +static ssize_t percpu_deactivate_hint_show(struct sys_device *dev,
> + struct sysdev_attribute *attr, char *page)
> +{
> + return get_best_energy_data(dev, page, 0);
> +}
> +
> +/*
> + * Create sysfs interface:
> + * /sys/devices/system/cpu/pseries_activate_hint_list
> + * /sys/devices/system/cpu/pseries_deactivate_hint_list
> + * Comma separated list of cpus to activate or deactivate
> + * /sys/devices/system/cpu/cpuN/pseries_activate_hint
> + * /sys/devices/system/cpu/cpuN/pseries_deactivate_hint
> + * Per-cpu value of the hint
Do we really need both interfaces? Seems like awk could generate one
from the other in userspace?
> + */
> +
> +struct sysdev_class_attribute attr_cpu_activate_hint_list =
> + _SYSDEV_CLASS_ATTR(pseries_activate_hint_list, 0444,
> + cpu_activate_hint_list_show, NULL);
> +
> +struct sysdev_class_attribute attr_cpu_deactivate_hint_list =
> + _SYSDEV_CLASS_ATTR(pseries_deactivate_hint_list, 0444,
> + cpu_deactivate_hint_list_show, NULL);
> +
> +struct sysdev_attribute attr_percpu_activate_hint =
> + _SYSDEV_ATTR(pseries_activate_hint, 0444,
> + percpu_activate_hint_show, NULL);
> +
> +struct sysdev_attribute attr_percpu_deactivate_hint =
> + _SYSDEV_ATTR(pseries_deactivate_hint, 0444,
> + percpu_deactivate_hint_show, NULL);
> +
> +static int __init pseries_energy_init(void)
> +{
> + int cpu, err;
> + struct sys_device *cpu_sys_dev;
> +
> + /* Create the sysfs files */
> + err = sysfs_create_file(&cpu_sysdev_class.kset.kobj,
> + &attr_cpu_activate_hint_list.attr);
> + if (!err)
> + err = sysfs_create_file(&cpu_sysdev_class.kset.kobj,
> + &attr_cpu_deactivate_hint_list.attr);
> +
> + for_each_possible_cpu(cpu) {
> + cpu_sys_dev = get_cpu_sysdev(cpu);
> + err = sysfs_create_file(&cpu_sys_dev->kobj,
> + &attr_percpu_activate_hint.attr);
> + if (err)
> + break;
> + err = sysfs_create_file(&cpu_sys_dev->kobj,
> + &attr_percpu_deactivate_hint.attr);
> + if (err)
> + break;
> + }
> + return err;
> +
> +}
> +
> +static void __exit pseries_energy_cleanup(void)
> +{
> + int cpu;
> + struct sys_device *cpu_sys_dev;
> +
> + /* Remove the sysfs files */
> + sysfs_remove_file(&cpu_sysdev_class.kset.kobj,
> + &attr_cpu_activate_hint_list.attr);
> +
> + sysfs_remove_file(&cpu_sysdev_class.kset.kobj,
> + &attr_cpu_deactivate_hint_list.attr);
> +
> + for_each_possible_cpu(cpu) {
> + cpu_sys_dev = get_cpu_sysdev(cpu);
> + sysfs_remove_file(&cpu_sys_dev->kobj,
> + &attr_percpu_activate_hint.attr);
> + sysfs_remove_file(&cpu_sys_dev->kobj,
> + &attr_percpu_deactivate_hint.attr);
> + }
> +}
> +
> +module_init(pseries_energy_init);
> +module_exit(pseries_energy_cleanup);
> +MODULE_DESCRIPTION("Driver for pseries platform energy management");
Needs a less generic description.
> +MODULE_AUTHOR("Vaidyanathan Srinivasan");
> +MODULE_LICENSE("GPL");
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
^ permalink raw reply
* section .data..init_task
From: Sean MacLennan @ 2010-06-28 4:59 UTC (permalink / raw)
To: linuxppc-dev
Anybody else seeing these messages?
ppc_4xxFP-ld: .tmp_vmlinux1: section .data..init_task lma 0xc0374000 overlaps previous sections
ppc_4xxFP-ld: .tmp_vmlinux2: section .data..init_task lma 0xc03a2000 overlaps previous sections
ppc_4xxFP-ld: vmlinux: section .data..init_task lma 0xc03a2000 overlaps previous sections
Or does anybody know what they mean? They started showing up in 2.6.35.
Very easy to reproduce, so don't hesitate to ask for more info.
Cheers,
Sean
^ permalink raw reply
* Re: [PATCH 18/26] KVM: PPC: KVM PV guest stubs
From: Matt Evans @ 2010-06-28 4:39 UTC (permalink / raw)
To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc
In-Reply-To: <1277508314-915-19-git-send-email-agraf@suse.de>
Howdy Alex!
Alexander Graf wrote:
> We will soon start and replace instructions from the text section with
> other, paravirtualized versions. To ease the readability of those patches
> I split out the generic looping and magic page mapping code out.
>
> This patch still only contains stubs. But at least it loops through the
> text section :).
>
> Signed-off-by: Alexander Graf <agraf@suse.de>
> ---
> arch/powerpc/kernel/kvm.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 59 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
> index 2d8dd73..d873bc6 100644
> --- a/arch/powerpc/kernel/kvm.c
> +++ b/arch/powerpc/kernel/kvm.c
> @@ -32,3 +32,62 @@
> #define KVM_MAGIC_PAGE (-4096L)
> #define magic_var(x) KVM_MAGIC_PAGE + offsetof(struct kvm_vcpu_arch_shared, x)
>
> +static bool kvm_patching_worked = true;
> +
> +static void kvm_map_magic_page(void *data)
> +{
> + kvm_hypercall2(KVM_HC_PPC_MAP_MAGIC_PAGE,
> + KVM_MAGIC_PAGE, /* Physical Address */
> + KVM_MAGIC_PAGE); /* Effective Address */
> +}
> +
> +static void kvm_check_ins(u32 *inst)
> +{
> + u32 _inst = *inst;
> + u32 inst_no_rt = _inst & ~KVM_MASK_RT;
> + u32 inst_rt = _inst & KVM_MASK_RT;
> +
> + switch (inst_no_rt) {
> + }
> +
> + switch (_inst) {
> + }
> +
> + flush_icache_range((ulong)inst, (ulong)inst + 4);
> +}
> +
> +static void kvm_use_magic_page(void)
> +{
> + u32 *p;
> + u32 *start, *end;
> +
> + /* Tell the host to map the magic page to -4096 on all CPUs */
> +
> + on_each_cpu(kvm_map_magic_page, NULL, 1);
> +
> + /* Now loop through all code and find instructions */
> +
> + start = (void*)_stext;
> + end = (void*)_etext;
> +
> + for (p = start; p < end; p++)
> + kvm_check_ins(p);
> +}
Could you do something similar in module_finalize() to patch loaded modules' .text sections?
> +
> +static int __init kvm_guest_init(void)
> +{
> + char *p;
> +
> + if (!kvm_para_available())
> + return 0;
> +
> + if (kvm_para_has_feature(KVM_FEATURE_MAGIC_PAGE))
> + kvm_use_magic_page();
> +
> + printk(KERN_INFO "KVM: Live patching for a fast VM %s\n",
> + kvm_patching_worked ? "worked" : "failed");
> +
> + return 0;
> +}
> +
> +postcore_initcall(kvm_guest_init);
Cheers,
Matt
^ permalink raw reply
* Re: [PATCH 1/2] KVM: PPC: Add generic hpte management functions
From: Benjamin Herrenschmidt @ 2010-06-27 22:10 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-ppc, linuxppc-dev, Alexander Graf, kvm
In-Reply-To: <4C27035D.1010604@redhat.com>
On Sun, 2010-06-27 at 10:53 +0300, Avi Kivity wrote:
> On 06/27/2010 01:58 AM, Benjamin Herrenschmidt wrote:
> >
> >> Then mmu intensive loads can expect to be slow.
> >>
> > Well, depends. ppc64 indeed requires the hash to be managed by the
> > hypervisor, so inserting or invalidating translations will mean a
> > roundtrip to the hypervisor, though there are ways at least the
> > insertion could be alleviated (for example, the HV could service the
> > hash misses directly walking the guest page tables).
> >
>
> But the guest page tables are software defined, no? That means the
> interface will break if the page table format changes.
Yes. Unless the hypervisor or architecture defines the format to be
used :-) IE. That's what Niagara 1 did. But we don't do that indeed
currently.
> > But that's due in part to a design choice (whether it's a good one or
> > not I'm not going to argue here) which favors huge reasonably static
> > workloads where the hash is expected to contain all translations for
> > everything.
> >
>
> What about when you have memory pressure? The hash will have to reflect
> those pte_clear_flush_young(), no?
Well, our architects would argue that the kind of workloads we target
don't have memory pressure :-)
But yes, I agree, harvesting of dirty and young bits is going to force a
hash flush which can be pretty expensive. Heh, we've been trying to
convince our own architects at designers that the MMU sucks for long
enough...
> It seems horribly expensive.
>
> > However, note that BookE (the embedded variant of the architecture) uses
> > a different model for virtualization, including options in its latest
> > variant for a HW logical->real translation (via a small dedicated TLB)
> > and direct access to some TLB ops from the guest.
> >
>
> I'm somewhat familiar with it, yes.
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 18/26] KVM: PPC: KVM PV guest stubs
From: Benjamin Herrenschmidt @ 2010-06-27 22:04 UTC (permalink / raw)
To: Alexander Graf
Cc: kvm-ppc@vger.kernel.org, linuxppc-dev, Avi Kivity, KVM list
In-Reply-To: <0E529B3E-541C-4E3B-81E7-AACCD96CBF2C@suse.de>
On Sun, 2010-06-27 at 11:47 +0200, Alexander Graf wrote:
> I did that at first. It breaks. During the patching we may take
> interrupts (pahe faults for example) that contain just patched
> instructions. And really, hell breaks loose if we don't flush it
> immediately :). I was hoping at first a 32 bit replace would be
> atomic
> in cache, but the cpu tried to execute invalid instructions, so it
> must have gotten some intermediate state.
A 32-bit aligned store -is- atomic. The other threads/cpu will see
either the old or the new instruction, nothing in between.
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 08/26] KVM: PPC: Add PV guest critical sections
From: Benjamin Herrenschmidt @ 2010-06-27 22:03 UTC (permalink / raw)
To: Alexander Graf
Cc: kvm-ppc@vger.kernel.org, linuxppc-dev, Avi Kivity, KVM list
In-Reply-To: <CF78C6B8-F600-4781-82CA-27865F745A60@suse.de>
On Sun, 2010-06-27 at 14:06 +0200, Alexander Graf wrote:
> > Right, I was agreeing with you. I meant there was no inc/dec mem
> > insns.
>
> Ah, I see :). I think that's really the only point where I deem the
> x86 isa superior :).
Heh, well, that's a typical RISC thing tho. I think ARM and MIPS are
like PowerPC here at least :-)
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 0/2 v3] mpc5200 ac97 gpio reset
From: Mark Brown @ 2010-06-27 22:01 UTC (permalink / raw)
To: Grant Likely; +Cc: linuxppc-dev@lists.ozlabs.org, Eric Millbrandt
In-Reply-To: <AANLkTiluyeGSwVSXgTeRxS3hVWTe3Gkn8qHrqbHJFMun@mail.gmail.com>
On 26 Jun 2010, at 00:04, Grant Likely <grant.likely@secretlab.ca> =
wrote:
> On Tue, Jun 22, 2010 at 5:00 PM, Mark Brown
>> On Tue, Jun 15, 2010 at 12:05:05PM -0400, Eric Millbrandt wrote:
>>> These patches reimplement the reset fuction in the ac97 to use gpio =
pins
>>> instead of using the mpc5200 ac97 reset functionality in the psc. =
This
>>> avoids a problem in which attached ac97 devices go into "test" mode =
appear
>>> unresponsive.
>>>=20
>>> These patches were tested on a pcm030 baseboard and on custom =
hardware with
>>> a wm9715 audio codec/touchscreen controller.
>>=20
>> Grant, are you OK with this series?
>=20
> Yes, I'm going to pick it up.
I'm a little concerned with a collision with multi codec here. It'd
be handy if you could keep it separate in case it needs merging
into both trees (or we could merge via ASoC only).=
^ permalink raw reply
* Re: [PATCH 01/26] KVM: PPC: Introduce shared page
From: Avi Kivity @ 2010-06-27 12:12 UTC (permalink / raw)
To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc
In-Reply-To: <1277508314-915-2-git-send-email-agraf@suse.de>
On 06/26/2010 02:24 AM, Alexander Graf wrote:
> For transparent variable sharing between the hypervisor and guest, I introduce
> a shared page. This shared page will contain all the registers the guest can
> read and write safely without exiting guest context.
>
> This patch only implements the stubs required for the basic structure of the
> shared page. The actual register moving follows.
>
>
> @@ -123,8 +123,14 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
> if (err)
> goto free_vcpu;
>
> + vcpu->arch.shared = (void*)__get_free_page(GFP_KERNEL|__GFP_ZERO);
> + if (!vcpu->arch.shared)
> + goto uninit_vcpu;
> +
> return vcpu;
>
> +uninit_vcpu:
> + kvm_vcpu_uninit(vcpu);
> free_vcpu:
> kmem_cache_free(kvm_vcpu_cache, vcpu_44x);
> out:
> @@ -135,6 +141,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
> {
> struct kvmppc_vcpu_44x *vcpu_44x = to_44x(vcpu);
>
> + free_page((unsigned long)vcpu->arch.shared);
> kvm_vcpu_uninit(vcpu);
> kmem_cache_free(kvm_vcpu_cache, vcpu_44x);
> }
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 884d4a5..ba79b35 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -1247,6 +1247,10 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
> if (err)
> goto free_shadow_vcpu;
>
> + vcpu->arch.shared = (void*)__get_free_page(GFP_KERNEL|__GFP_ZERO);
> + if (!vcpu->arch.shared)
> + goto uninit_vcpu;
> +
> vcpu->arch.host_retip = kvm_return_point;
> vcpu->arch.host_msr = mfmsr();
> #ifdef CONFIG_PPC_BOOK3S_64
> @@ -1277,6 +1281,8 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>
> return vcpu;
>
> +uninit_vcpu:
> + kvm_vcpu_uninit(vcpu);
> free_shadow_vcpu:
> kfree(vcpu_book3s->shadow_vcpu);
> free_vcpu:
> @@ -1289,6 +1295,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
> {
> struct kvmppc_vcpu_book3s *vcpu_book3s = to_book3s(vcpu);
>
> + free_page((unsigned long)vcpu->arch.shared);
> kvm_vcpu_uninit(vcpu);
> kfree(vcpu_book3s->shadow_vcpu);
> vfree(vcpu_book3s);
> diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
> index e8a00b0..71750f2 100644
> --- a/arch/powerpc/kvm/e500.c
> +++ b/arch/powerpc/kvm/e500.c
> @@ -117,8 +117,14 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
> if (err)
> goto uninit_vcpu;
>
> + vcpu->arch.shared = (void*)__get_free_page(GFP_KERNEL|__GFP_ZERO);
> + if (!vcpu->arch.shared)
> + goto uninit_tlb;
> +
> return vcpu;
>
> +uninit_tlb:
> + kvmppc_e500_tlb_uninit(vcpu_e500);
> uninit_vcpu:
> kvm_vcpu_uninit(vcpu);
> free_vcpu:
> @@ -131,6 +137,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
> {
> struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
>
> + free_page((unsigned long)vcpu->arch.shared);
> kvmppc_e500_tlb_uninit(vcpu_e500);
> kvm_vcpu_uninit(vcpu);
> kmem_cache_free(kvm_vcpu_cache, vcpu_e500);
>
Code repeats 3x. Share please.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: [PATCH 08/26] KVM: PPC: Add PV guest critical sections
From: Alexander Graf @ 2010-06-27 12:06 UTC (permalink / raw)
To: Avi Kivity; +Cc: linuxppc-dev, KVM list, kvm-ppc@vger.kernel.org
In-Reply-To: <4C273BAD.2090305@redhat.com>
Am 27.06.2010 um 13:53 schrieb Avi Kivity <avi@redhat.com>:
> On 06/27/2010 02:49 PM, Alexander Graf wrote:
>>
>> Am 27.06.2010 um 12:59 schrieb Avi Kivity <avi@redhat.com>:
>>
>>> On 06/27/2010 01:33 PM, Alexander Graf wrote:
>>>>> Or inc/dec...
>>>>
>>>>
>>>> Uh - huh? How would that help?
>>>
>>> inc shared->critical # disable interrupts
>>
>> There is no opcode in the ppc isa that acts on momery without
>> involving a register.
>>
>> An inc would be:
>>
>> ld rX, A(0)
>> addi rX, rX, 1
>> std rX, A(0)
>
> Right, I was agreeing with you. I meant there was no inc/dec mem
> insns.
Ah, I see :). I think that's really the only point where I deem the
x86 isa superior :).
Alex
^ permalink raw reply
* Re: [PATCH 08/26] KVM: PPC: Add PV guest critical sections
From: Avi Kivity @ 2010-06-27 11:53 UTC (permalink / raw)
To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc@vger.kernel.org
In-Reply-To: <C90A3ABA-60E4-48D4-8332-3B49F3554447@suse.de>
On 06/27/2010 02:49 PM, Alexander Graf wrote:
>
> Am 27.06.2010 um 12:59 schrieb Avi Kivity <avi@redhat.com>:
>
>> On 06/27/2010 01:33 PM, Alexander Graf wrote:
>>>> Or inc/dec...
>>>
>>>
>>> Uh - huh? How would that help?
>>
>> inc shared->critical # disable interrupts
>
> There is no opcode in the ppc isa that acts on momery without
> involving a register.
>
> An inc would be:
>
> ld rX, A(0)
> addi rX, rX, 1
> std rX, A(0)
Right, I was agreeing with you. I meant there was no inc/dec mem insns.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: [PATCH 08/26] KVM: PPC: Add PV guest critical sections
From: Alexander Graf @ 2010-06-27 11:49 UTC (permalink / raw)
To: Avi Kivity; +Cc: linuxppc-dev, KVM list, kvm-ppc@vger.kernel.org
In-Reply-To: <4C272F08.8090709@redhat.com>
Am 27.06.2010 um 12:59 schrieb Avi Kivity <avi@redhat.com>:
> On 06/27/2010 01:33 PM, Alexander Graf wrote:
>>> Or inc/dec...
>>
>>
>> Uh - huh? How would that help?
>
> inc shared->critical # disable interrupts
There is no opcode in the ppc isa that acts on momery without
involving a register.
An inc would be:
ld rX, A(0)
addi rX, rX, 1
std rX, A(0)
Alex
>
^ permalink raw reply
* Re: [PATCH 08/26] KVM: PPC: Add PV guest critical sections
From: Avi Kivity @ 2010-06-27 10:59 UTC (permalink / raw)
To: Alexander Graf; +Cc: linuxppc-dev, KVM list, kvm-ppc@vger.kernel.org
In-Reply-To: <DFC77851-3BE7-4746-93DE-287D5E27EF7D@suse.de>
On 06/27/2010 01:33 PM, Alexander Graf wrote:
>> Or inc/dec...
>
>
> Uh - huh? How would that help?
inc shared->critical # disable interrupts
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: [PATCH 02/26] KVM: PPC: Convert MSR to shared page
From: Alexander Graf @ 2010-06-27 10:40 UTC (permalink / raw)
To: Avi Kivity; +Cc: linuxppc-dev, KVM list, kvm-ppc@vger.kernel.org
In-Reply-To: <4C271EE5.1060401@redhat.com>
Am 27.06.2010 um 11:50 schrieb Avi Kivity <avi@redhat.com>:
> On 06/27/2010 12:38 PM, Alexander Graf wrote:
>>
>> Am 27.06.2010 um 10:16 schrieb Avi Kivity <avi@redhat.com>:
>>
>>> On 06/26/2010 02:24 AM, Alexander Graf wrote:
>>>> One of the most obvious registers to share with the guest
>>>> directly is the
>>>> MSR. The MSR contains the "interrupts enabled" flag which the
>>>> guest has to
>>>> toggle in critical sections.
>>>>
>>>> So in order to bring the overhead of interrupt en- and disabling
>>>> down, let's
>>>> put msr into the shared page. Keep in mind that even though you
>>>> can fully read
>>>> its contents, writing to it doesn't always update all state.
>>>> There are a few
>>>> safe fields that don't require hypervisor interaction. See the
>>>> guest
>>>> implementation that follows later for reference.
>>>>
>>>
>>>
>>> You mean, see the documentation for reference.
>>>
>>> It should be possible to write the guest code looking only at the
>>> documentation.
>>
>> *shrug* since we're writing open source I don't mind telling people
>> to read code for a reference implemenration.
>
> It's impossible to infer from the source what's a guaranteed part of
> the interface and what is just an implementation artifact. So
> people rely on implementation artifacts (or even bugs) and that
> reduces our ability to change things.
>
>> If well written, that's more comprehensible than documentation
>> anyways :).
>
> If the documentation is poorly written, yes.
I think I start to agree. I guess i should just list all fields of the
MSR that are ok to modify inside the guest context.
Alex
^ permalink raw reply
* Re: [PATCH 18/26] KVM: PPC: KVM PV guest stubs
From: Alexander Graf @ 2010-06-27 10:38 UTC (permalink / raw)
To: Avi Kivity; +Cc: linuxppc-dev, KVM list, kvm-ppc@vger.kernel.org
In-Reply-To: <4C272503.7030605@redhat.com>
Am 27.06.2010 um 12:16 schrieb Avi Kivity <avi@redhat.com>:
> On 06/27/2010 12:47 PM, Alexander Graf wrote:
>>
>> Am 27.06.2010 um 10:28 schrieb Avi Kivity <avi@redhat.com>:
>>
>>> On 06/26/2010 02:25 AM, Alexander Graf wrote:
>>>> We will soon start and replace instructions from the text section
>>>> with
>>>> other, paravirtualized versions. To ease the readability of those
>>>> patches
>>>> I split out the generic looping and magic page mapping code out.
>>>>
>>>> This patch still only contains stubs. But at least it loops
>>>> through the
>>>> text section :).
>>>>
>>>>
>>>> +
>>>> +static void kvm_check_ins(u32 *inst)
>>>> +{
>>>> + u32 _inst = *inst;
>>>> + u32 inst_no_rt = _inst& ~KVM_MASK_RT;
>>>> + u32 inst_rt = _inst& KVM_MASK_RT;
>>>> +
>>>> + switch (inst_no_rt) {
>>>> + }
>>>> +
>>>> + switch (_inst) {
>>>> + }
>>>> +
>>>> + flush_icache_range((ulong)inst, (ulong)inst + 4);
>>>> +}
>>>>
>>>
>>> Shouldn't we flush only if we patched something?
>>
>> We introduce the patching in the next patches. This is only a
>> preparation stub.
>
> Well, unless I missed something, this remains unconditional after
> all the patches.
>
> A helper patch(pc, replacement) could patch and flush in one go.
Oh I see what you mean. While not necessary, it would save a few
cycles on guest bootup.
>
>>
>>>
>>>> +
>>>> +static void kvm_use_magic_page(void)
>>>> +{
>>>> + u32 *p;
>>>> + u32 *start, *end;
>>>> +
>>>> + /* Tell the host to map the magic page to -4096 on all CPUs */
>>>> +
>>>> + on_each_cpu(kvm_map_magic_page, NULL, 1);
>>>> +
>>>> + /* Now loop through all code and find instructions */
>>>> +
>>>> + start = (void*)_stext;
>>>> + end = (void*)_etext;
>>>> +
>>>> + for (p = start; p< end; p++)
>>>> + kvm_check_ins(p);
>>>> +}
>>>> +
>>>>
>>>
>>> Or, flush the entire thing here.
>>
>> I did that at first. It breaks. During the patching we may take
>> interrupts (pahe faults for example) that contain just patched
>> instructions. And really, hell breaks loose if we don't flush it
>> immediately :). I was hoping at first a 32 bit replace would be
>> atomic in cache, but the cpu tried to execute invalid instructions,
>> so it must have gotten some intermediate state.
>
> Surprising. Maybe you need a flush after writing to the out-of-line
> code?
I do that too now :). Better flush too often that too rarely. It's not
_that_ expensive after all.
Alex
^ permalink raw reply
* Re: [PATCH 08/26] KVM: PPC: Add PV guest critical sections
From: Alexander Graf @ 2010-06-27 10:35 UTC (permalink / raw)
To: Avi Kivity; +Cc: linuxppc-dev, KVM list, kvm-ppc@vger.kernel.org
In-Reply-To: <4C27220D.7090508@redhat.com>
Am 27.06.2010 um 12:03 schrieb Avi Kivity <avi@redhat.com>:
> On 06/26/2010 02:24 AM, Alexander Graf wrote:
>> When running in hooked code we need a way to disable interrupts
>> without
>> clobbering any interrupts or exiting out to the hypervisor.
>>
>> To achieve this, we have an additional critical field in the shared
>> page. If
>> that field is equal to the r1 register of the guest, it tells the
>> hypervisor
>> that we're in such a critical section and thus may not receive any
>> interrupts.
>>
>>
>> --- a/arch/powerpc/kvm/book3s.c
>> +++ b/arch/powerpc/kvm/book3s.c
>> @@ -251,14 +251,25 @@ int kvmppc_book3s_irqprio_deliver(struct
>> kvm_vcpu *vcpu, unsigned int priority)
>> int deliver = 1;
>> int vec = 0;
>> ulong flags = 0ULL;
>> + ulong crit_raw = vcpu->arch.shared->critical;
>> + ulong crit_r1 = kvmppc_get_gpr(vcpu, 1);
>> + bool crit;
>> +
>> + /* Truncate crit indicators in 32 bit mode */
>> + if (!(vcpu->arch.shared->msr& MSR_SF)) {
>> + crit_raw&= 0xffffffff;
>> + crit_r1&= 0xffffffff;
>> + }
>> +
>> + crit = (crit_raw == crit_r1);
>>
>
> I think you need to qualify that for supervisor mode only.
> Otherwise guest userspace can guess the value of shared->critical
> and disable interrupts.
Yes, you're right. Good catch!
Alex
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox