LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* RE: [rtc-linux] [PATCH] rtc/ds3232: Enable ds3232 to work as wakeup source
From: Dongsheng.Wang @ 2014-02-26  3:26 UTC (permalink / raw)
  To: Scott Wood
  Cc: a.zummo@towertech.it, chenhui.zhao@freescale.com,
	rtc-linux@googlegroups.com, Andrew Morton,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1393384830.6733.987.camel@snotra.buserror.net>

DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogV29vZCBTY290dC1CMDc0
MjENCj4gU2VudDogV2VkbmVzZGF5LCBGZWJydWFyeSAyNiwgMjAxNCAxMToyMSBBTQ0KPiBUbzog
V2FuZyBEb25nc2hlbmctQjQwNTM0DQo+IENjOiBBbmRyZXcgTW9ydG9uOyBydGMtbGludXhAZ29v
Z2xlZ3JvdXBzLmNvbTsgYmVuaEBrZXJuZWwuY3Jhc2hpbmcub3JnOw0KPiBhLnp1bW1vQHRvd2Vy
dGVjaC5pdDsgWmhhbyBDaGVuaHVpLUIzNTMzNjsgbGludXhwcGMtZGV2QGxpc3RzLm96bGFicy5v
cmcNCj4gU3ViamVjdDogUmU6IFtydGMtbGludXhdIFtQQVRDSF0gcnRjL2RzMzIzMjogRW5hYmxl
IGRzMzIzMiB0byB3b3JrIGFzIHdha2V1cA0KPiBzb3VyY2UNCj4gDQo+IE9uIFR1ZSwgMjAxNC0w
Mi0yNSBhdCAyMTowOSAtMDYwMCwgV2FuZyBEb25nc2hlbmctQjQwNTM0IHdyb3RlOg0KPiA+DQo+
ID4gPiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+ID4gRnJvbTogQW5kcmV3IE1vcnRv
biBbbWFpbHRvOmFrcG1AbGludXgtZm91bmRhdGlvbi5vcmddDQo+ID4gPiBTZW50OiBXZWRuZXNk
YXksIEZlYnJ1YXJ5IDI2LCAyMDE0IDY6MDcgQU0NCj4gPiA+IFRvOiBydGMtbGludXhAZ29vZ2xl
Z3JvdXBzLmNvbQ0KPiA+ID4gQ2M6IFdhbmcgRG9uZ3NoZW5nLUI0MDUzNDsgYS56dW1tb0B0b3dl
cnRlY2guaXQ7IFpoYW8gQ2hlbmh1aS1CMzUzMzY7DQo+IGxpbnV4cHBjLQ0KPiA+ID4gZGV2QGxp
c3RzLm96bGFicy5vcmcNCj4gPiA+IFN1YmplY3Q6IFJlOiBbcnRjLWxpbnV4XSBbUEFUQ0hdIHJ0
Yy9kczMyMzI6IEVuYWJsZSBkczMyMzIgdG8gd29yayBhcyB3YWtldXANCj4gPiA+IHNvdXJjZQ0K
PiA+ID4NCj4gPiA+IE9uIFR1ZSwgMjEgSmFuIDIwMTQgMTM6MjQ6NTEgKzA4MDAgRG9uZ3NoZW5n
IFdhbmcNCj4gPGRvbmdzaGVuZy53YW5nQGZyZWVzY2FsZS5jb20+DQo+ID4gPiB3cm90ZToNCj4g
PiA+DQo+ID4gPiA+ICsJaWYgKGNsaWVudC0+aXJxICE9IE5PX0lSUSkgew0KPiA+ID4NCj4gPiA+
IHg4Nl82NCBhbGxtb2Rjb25maWc6DQo+ID4gPg0KPiA+ID4gZHJpdmVycy9ydGMvcnRjLWRzMzIz
Mi5jOiBJbiBmdW5jdGlvbiAnZHMzMjMyX3Byb2JlJzoNCj4gPiA+IGRyaXZlcnMvcnRjL3J0Yy1k
czMyMzIuYzo0Mjc6IGVycm9yOiAnTk9fSVJRJyB1bmRlY2xhcmVkIChmaXJzdCB1c2UgaW4gdGhp
cw0KPiA+ID4gZnVuY3Rpb24pDQo+ID4gPiBkcml2ZXJzL3J0Yy9ydGMtZHMzMjMyLmM6NDI3OiBl
cnJvcjogKEVhY2ggdW5kZWNsYXJlZCBpZGVudGlmaWVyIGlzIHJlcG9ydGVkDQo+ID4gPiBvbmx5
IG9uY2UNCj4gPiA+IGRyaXZlcnMvcnRjL3J0Yy1kczMyMzIuYzo0Mjc6IGVycm9yOiBmb3IgZWFj
aCBmdW5jdGlvbiBpdCBhcHBlYXJzIGluLikNCj4gPiA+DQo+ID4gPiBOb3QgYWxsIGFyY2hpdGVj
dHVyZXMgaW1wbGVtZW50IE5PX0lSUS4NCj4gPiA+DQo+ID4gPiBJIHRoaW5rIHRoaXMgc2hvdWxk
IGJlDQo+ID4gPg0KPiA+ID4gCWlmIChjbGllbnQtPmlycSA+IDApIHsNCj4gPiA+DQo+ID4gPiBi
dXQgSSdtIG5vdCBzdXJlIC0gaWlyYywgeDg2IChhdCBsZWFzdCkgdHJlYXRzIHplcm8gYXMgIm5v
dCBhbiBJUlEiLg0KPiA+ID4gQnV0IEkgdGhpbmsgc29tZSBhcmNoaXRlY3R1cmVzIHBlcm1pdCBJ
UlEgMC4gIFRoZXJlIHdhcyBkaXNjdXNzaW9uIG1hbnkNCj4gPiA+IHllYXJzIGFnbyBidXQgSSBk
b24ndCB0aGluayBhbnl0aGluZyBnb3QgcmVzb2x2ZWQuDQo+ID4gPg0KPiA+IEkgdGhpbmsgdGhp
cyBpcyB3aHkgTk9fSVJRIGlzIGRlZmluZWQgaW4ga2VybmVsLCB0aGF0IHNob3VsZCBiZSByZXNv
bHZlZCB0aGlzDQo+IGlzc3VlLg0KPiA+DQo+ID4gU29ycnksIEkgZG9uJ3Qga25vdyB3aHkgc29t
ZSBhcmNoaXRlY3R1cmVzIGRpZG4ndCBkZWZpbmUgdGhpcyBtYWNybz8NCj4gDQo+IE5PX0lSUSBp
cyBkZXByZWNhdGVkIChzZWUgImdpdCBsb2cgLVNOT19JUlEiIGZvciB0aGUgdHJlbmQgb2YgcmVt
b3ZpbmcNCj4gdXNlcyBvZiBpdCwgYXMgd2VsbCBhcyBzaXR1YXRpb25zIHdoZXJlIGl0IGdpdmVz
IHRoZSB3cm9uZyByZXN1bHRzKS4NCj4gImlmIChjbGllbnQtPmlycSA+IDApIiBpcyBjb3JyZWN0
Lg0KPiANClRoYW5rcy4NCg0KLURvbmdzaGVuZw0KDQo+IC1TY290dA0KPiANCg0K

^ permalink raw reply

* Re: [rtc-linux] [PATCH] rtc/ds3232: Enable ds3232 to work as wakeup source
From: Scott Wood @ 2014-02-26  3:20 UTC (permalink / raw)
  To: Wang Dongsheng-B40534
  Cc: a.zummo@towertech.it, Zhao Chenhui-B35336,
	rtc-linux@googlegroups.com, Andrew Morton,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <3eba8b896e624b698ab8291399c5a1b0@BN1PR03MB188.namprd03.prod.outlook.com>

On Tue, 2014-02-25 at 21:09 -0600, Wang Dongsheng-B40534 wrote:
> 
> > -----Original Message-----
> > From: Andrew Morton [mailto:akpm@linux-foundation.org]
> > Sent: Wednesday, February 26, 2014 6:07 AM
> > To: rtc-linux@googlegroups.com
> > Cc: Wang Dongsheng-B40534; a.zummo@towertech.it; Zhao Chenhui-B35336; linuxppc-
> > dev@lists.ozlabs.org
> > Subject: Re: [rtc-linux] [PATCH] rtc/ds3232: Enable ds3232 to work as wakeup
> > source
> > 
> > On Tue, 21 Jan 2014 13:24:51 +0800 Dongsheng Wang <dongsheng.wang@freescale.com>
> > wrote:
> > 
> > > +	if (client->irq != NO_IRQ) {
> > 
> > x86_64 allmodconfig:
> > 
> > drivers/rtc/rtc-ds3232.c: In function 'ds3232_probe':
> > drivers/rtc/rtc-ds3232.c:427: error: 'NO_IRQ' undeclared (first use in this
> > function)
> > drivers/rtc/rtc-ds3232.c:427: error: (Each undeclared identifier is reported
> > only once
> > drivers/rtc/rtc-ds3232.c:427: error: for each function it appears in.)
> > 
> > Not all architectures implement NO_IRQ.
> > 
> > I think this should be
> > 
> > 	if (client->irq > 0) {
> > 
> > but I'm not sure - iirc, x86 (at least) treats zero as "not an IRQ".
> > But I think some architectures permit IRQ 0.  There was discussion many
> > years ago but I don't think anything got resolved.
> > 
> I think this is why NO_IRQ is defined in kernel, that should be resolved this issue.
> 
> Sorry, I don't know why some architectures didn't define this macro?

NO_IRQ is deprecated (see "git log -SNO_IRQ" for the trend of removing
uses of it, as well as situations where it gives the wrong results).
"if (client->irq > 0)" is correct.

-Scott

^ permalink raw reply

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
From: Liu ping fan @ 2014-02-26  3:09 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc
In-Reply-To: <62DB3340-5AF7-4DA4-A790-77EE00696F57@suse.de>

Sorry to update lately. It takes a long time to apply for test machine
and then, I hit a series of other bugs which I could not resolve
easily. And for now, I have some high priority task, and will come
back to this topic when time is available.
Besides this, I had do some basic test for numa-fault and no
numa-fault test for HV guest, it shows that 10% drop in performance
when  numa-fault is on. (Test with $pg_access_random 60 4 200, and
guest has 10GB mlocked pages ).
I thought this is caused based on the following factors: cache-miss,
tlb-miss, guest->host exit and hw-thread cooperate to exit from guest
state.  Hope my patches to be helpful to reduce the cost of
guest->host exit and hw-thread cooperate to exit.

My test case launches 4 threads on guest( as 4 hw-threads ), and each
of them has random access to PAGE_ALIGN area.
Hope from some suggestion about the test case, so when I had time, I
could improve and finish the test.

Thanks,
Fan

--- test case: usage: pg_random_access  secs  fork_num  mem_size---
#include <ctype.h>
#include <errno.h>
#include <libgen.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <time.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/timerfd.h>
#include <time.h>
#include <stdint.h>        /* Definition of uint64_t */
#include <poll.h>


/* */
#define CMD_STOP 0x1234
#define SHM_FNAME "/numafault_shm"
#define PAGE_SIZE (1<<12)

/* the protocol defined on the shm */
#define SHM_CMD_OFF 0x0
#define SHM_CNT_OFF 0x1
#define SHM_MESSAGE_OFF 0x2

#define handle_error(msg) \
        do { perror(msg); exit(EXIT_FAILURE); } while (0)


void __inline__ random_access(void *region_start, int len)
{
        int *p;
        int num;

        num = random();
        num &= ~(PAGE_SIZE - 1);
        num &= (len - 1);
        p = region_start + num;
        *p = 0x654321;
}

static int numafault_body(int size_MB)
{
        /* since MB is always align on PAGE_SIZE, so it is ok to test
fault on page */
        int size = size_MB*1024*1024;
        void *region_start = malloc(size);
        unsigned long *pmap;
        int shm_fid;
        unsigned long cnt = 0;
        pid_t pid = getpid();
        char *dst;
        char buf[128];

        shm_fid = shm_open(SHM_FNAME, O_RDWR, S_IRUSR | S_IWUSR);
        ftruncate(shm_fid, 2*sizeof(long));
        pmap = mmap(NULL, 2*sizeof(long), PROT_WRITE | PROT_READ,
MAP_SHARED, shm_fid, 0);
        if (!pmap) {
                printf("child fail to setup mmap of shm\n");
                return -1;
        }

        while (*(pmap+SHM_CMD_OFF) != CMD_STOP){
                random_access(region_start, size);
                cnt++;
        }

        __atomic_fetch_add((pmap+SHM_CNT_OFF), cnt, __ATOMIC_SEQ_CST);
        dst = (char *)(pmap+SHM_MESSAGE_OFF);
        //tofix, need lock
        sprintf(buf, "child [%i] cnt=%u\n\0", pid, cnt);
        strcat(dst, buf);

        munmap(pmap, 2*sizeof(long));
        shm_unlink(SHM_FNAME);
        fprintf(stdout, "[%s] cnt=%lu\n", pid, cnt);
        fflush(stdout);
        exit(0);

}

int main(int argc, char **argv)
{
        int i;
        pid_t pid;
        int shm_fid;
        unsigned long *pmap;
        int fork_num;
        int size;
        char *dst_info;

        struct itimerspec new_value;
        int fd;
        struct timespec now;
        uint64_t exp, tot_exp;
        ssize_t s;
        struct pollfd pfd;
        int elapsed;

        if (argc != 4){
            fprintf(stderr, "%s wait-secs [secs elapsed before parent
asks the children to exit]\n \
                    fork-num [child num]\n \
                    size [memory region covered by each child in MB]\n",
                    argv[0]);
            exit(EXIT_FAILURE);
        }
        elapsed = atoi(argv[1]);
        fork_num = atoi(argv[2]);
        size = atoi(argv[3]);
        printf("fork %i child process to test mem %i MB for a period: %i sec\n",
                fork_num, size, elapsed);

        fd = timerfd_create(CLOCK_REALTIME, 0);
        if (fd == -1)
            handle_error("timerfd_create");


        shm_fid = shm_open(SHM_FNAME, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
        ftruncate(shm_fid, PAGE_SIZE);
        pmap = mmap(NULL, PAGE_SIZE, PROT_WRITE | PROT_READ,
MAP_SHARED, shm_fid, 0);
        if (!pmap) {
                printf("fail to setup mmap of shm\n");
                return -1;
        }
        memset(pmap, 0, 2*sizeof(long));
        //wmb();

        for (i = 0; i < fork_num; i++){
                switch (pid = fork())
                {
                case 0:            /* child */
                        numafault_body(size);
                        exit(0);
                case -1:           /* error */
                        err (stderr, "fork failed: %s\n", strerror (errno));
                        break;
                default:           /* parent */
                        printf("fork child [%i]\n", pid);
                }
        }

        if (clock_gettime(CLOCK_REALTIME, &now) == -1)
                handle_error("clock_gettime");

        /* Create a CLOCK_REALTIME absolute timer with initial
expiration and interval as specified in command line */

        new_value.it_value.tv_sec = now.tv_sec + elapsed;
        new_value.it_value.tv_nsec = now.tv_nsec;
        new_value.it_interval.tv_sec = 0;
        new_value.it_interval.tv_nsec = 0;

        if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) == -1)
                handle_error("timerfd_settime");

        pfd.fd = fd;
        pfd.events = POLLIN;
        pfd.revents = 0;
        /* -1: infinite wait */
        poll(&pfd, 1, -1);



        /* ask children to stop and get back cnt */

        *(pmap + SHM_CMD_OFF) = CMD_STOP;

        wait(NULL);
        dst_info = (char *)(pmap + SHM_MESSAGE_OFF);
        printf(dst_info);
        printf("total cnt:%lu\n", *(pmap + SHM_CNT_OFF));

        munmap(pmap, PAGE_SIZE);
        shm_unlink(SHM_FNAME);
}




On Mon, Jan 20, 2014 at 10:48 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:
>
>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>
>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>
>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>
>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>> (for which, I still try to get a machine to show nums)
>>>>
>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>> which is  well known.
>>>
>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>
>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>> do_numa_page() to do the numa fault check. This incurs the overhead
>> when exiting from rmode to vmode.  My idea is that in
>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>
>>>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>>>
>>> This translates to me as "This is an RFC"?
>>>
>> Yes, I am not quite sure about it. I have no bare-metal to verify it.
>> So I hope at least, from the theory, it is correct.
>
> Paul, could you please give this some thought and maybe benchmark it?
>
>
> Alex
>

^ permalink raw reply

* RE: [rtc-linux] [PATCH] rtc/ds3232: Enable ds3232 to work as wakeup source
From: Dongsheng.Wang @ 2014-02-26  3:09 UTC (permalink / raw)
  To: Andrew Morton, rtc-linux@googlegroups.com,
	benh@kernel.crashing.org
  Cc: Scott Wood, a.zummo@towertech.it, linuxppc-dev@lists.ozlabs.org,
	chenhui.zhao@freescale.com
In-Reply-To: <20140225140705.fe9f7038bbffbfbde899e0f7@linux-foundation.org>



> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Sent: Wednesday, February 26, 2014 6:07 AM
> To: rtc-linux@googlegroups.com
> Cc: Wang Dongsheng-B40534; a.zummo@towertech.it; Zhao Chenhui-B35336; lin=
uxppc-
> dev@lists.ozlabs.org
> Subject: Re: [rtc-linux] [PATCH] rtc/ds3232: Enable ds3232 to work as wak=
eup
> source
>=20
> On Tue, 21 Jan 2014 13:24:51 +0800 Dongsheng Wang <dongsheng.wang@freesca=
le.com>
> wrote:
>=20
> > From: Wang Dongsheng <dongsheng.wang@freescale.com>
> >
> > Add suspend/resume and device_init_wakeup to enable ds3232 as
> > wakeup source, /sys/class/rtc/rtcX/wakealarm for set wakeup alarm.
> >
> > ...
> >
> > @@ -411,23 +424,21 @@ static int ds3232_probe(struct i2c_client *client=
,
> >  	if (ret)
> >  		return ret;
> >
> > -	ds3232->rtc =3D devm_rtc_device_register(&client->dev, client->name,
> > -					  &ds3232_rtc_ops, THIS_MODULE);
> > -	if (IS_ERR(ds3232->rtc)) {
> > -		dev_err(&client->dev, "unable to register the class device\n");
> > -		return PTR_ERR(ds3232->rtc);
> > -	}
> > -
> > -	if (client->irq >=3D 0) {
> > +	if (client->irq !=3D NO_IRQ) {
>=20
> x86_64 allmodconfig:
>=20
> drivers/rtc/rtc-ds3232.c: In function 'ds3232_probe':
> drivers/rtc/rtc-ds3232.c:427: error: 'NO_IRQ' undeclared (first use in th=
is
> function)
> drivers/rtc/rtc-ds3232.c:427: error: (Each undeclared identifier is repor=
ted
> only once
> drivers/rtc/rtc-ds3232.c:427: error: for each function it appears in.)
>=20
> Not all architectures implement NO_IRQ.
>=20
> I think this should be
>=20
> 	if (client->irq > 0) {
>=20
> but I'm not sure - iirc, x86 (at least) treats zero as "not an IRQ".
> But I think some architectures permit IRQ 0.  There was discussion many
> years ago but I don't think anything got resolved.
>=20
I think this is why NO_IRQ is defined in kernel, that should be resolved th=
is issue.

Sorry, I don't know why some architectures didn't define this macro?


Hi Ben,

Did you have some suggestion?

Thanks,
-Dongsheng

>=20
> Help!  I think some ppc people will know what to do here?
>=20

^ permalink raw reply

* [PATCH] powerpc: ftrace: bugfix for test_24bit_addr
From: Liu Ping Fan @ 2014-02-26  2:23 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Paul Mackerras

The branch target should be the func addr, not the addr of func_descr_t.
So using ppc_function_entry() to generate the right target addr.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
This bug will make ftrace fail to work. It can be triggered when the kernel
size grows up.
---
 arch/powerpc/kernel/ftrace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/ftrace.c b/arch/powerpc/kernel/ftrace.c
index 9b27b29..b0ded97 100644
--- a/arch/powerpc/kernel/ftrace.c
+++ b/arch/powerpc/kernel/ftrace.c
@@ -74,6 +74,7 @@ ftrace_modify_code(unsigned long ip, unsigned int old, unsigned int new)
  */
 static int test_24bit_addr(unsigned long ip, unsigned long addr)
 {
+	addr = ppc_function_entry((void *)addr);
 
 	/* use the create_branch to verify that this offset can be branched */
 	return create_branch((unsigned int *)ip, addr, 0);
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH 7/7] cpuidle/powernv: Parse device tree to setup idle states
From: Preeti U Murthy @ 2014-02-26  0:09 UTC (permalink / raw)
  To: linux-pm, geoff, fweisbec, daniel.lezcano, srivatsa.bhat, benh,
	tglx, svaidy, linuxppc-dev, mingo
  Cc: paulmck, rafael.j.wysocki
In-Reply-To: <20140226000310.17879.67295.stgit@preeti>

Add deep idle states such as nap and fast sleep to the cpuidle state table
only if they are discovered from the device tree during cpuidle initialization.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 drivers/cpuidle/cpuidle-powernv.c |   82 +++++++++++++++++++++++++++++--------
 1 file changed, 65 insertions(+), 17 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 4fb97ce..fdae1c4 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -12,10 +12,17 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/clockchips.h>
+#include <linux/of.h>
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
 
+/* Flags and constants used in PowerNV platform */
+
+#define MAX_POWERNV_IDLE_STATES	8
+#define IDLE_USE_INST_NAP	0x00010000 /* Use nap instruction */
+#define IDLE_USE_INST_SLEEP	0x00020000 /* Use sleep instruction */
+
 struct cpuidle_driver powernv_idle_driver = {
 	.name             = "powernv_idle",
 	.owner            = THIS_MODULE,
@@ -79,7 +86,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state powernv_states[] = {
+static struct cpuidle_state powernv_states[MAX_POWERNV_IDLE_STATES] = {
 	{ /* Snooze */
 		.name = "snooze",
 		.desc = "snooze",
@@ -87,20 +94,6 @@ static struct cpuidle_state powernv_states[] = {
 		.exit_latency = 0,
 		.target_residency = 0,
 		.enter = &snooze_loop },
-	{ /* NAP */
-		.name = "NAP",
-		.desc = "NAP",
-		.flags = CPUIDLE_FLAG_TIME_VALID,
-		.exit_latency = 10,
-		.target_residency = 100,
-		.enter = &nap_loop },
-	 { /* Fastsleep */
-		.name = "fastsleep",
-		.desc = "fastsleep",
-		.flags = CPUIDLE_FLAG_TIME_VALID,
-		.exit_latency = 10,
-		.target_residency = 100,
-		.enter = &fastsleep_loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,
@@ -161,19 +154,74 @@ static int powernv_cpuidle_driver_init(void)
 	return 0;
 }
 
+static int powernv_add_idle_states(void)
+{
+	struct device_node *power_mgt;
+	struct property *prop;
+	int nr_idle_states = 1; /* Snooze */
+	int dt_idle_states;
+	u32 *flags;
+	int i;
+
+	/* Currently we have snooze statically defined */
+
+	power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
+	if (!power_mgt) {
+		pr_warn("opal: PowerMgmt Node not found\n");
+		return nr_idle_states;
+	}
+
+	prop = of_find_property(power_mgt, "ibm,cpu-idle-state-flags", NULL);
+	if (!prop) {
+		pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n");
+		return nr_idle_states;
+	}
+
+	dt_idle_states = prop->length / sizeof(u32);
+	flags = (u32 *) prop->value;
+
+	for (i = 0; i < dt_idle_states; i++) {
+
+		if (flags[i] & IDLE_USE_INST_NAP) {
+			/* Add NAP state */
+			strcpy(powernv_states[nr_idle_states].name, "Nap");
+			strcpy(powernv_states[nr_idle_states].desc, "Nap");
+			powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIME_VALID;
+			powernv_states[nr_idle_states].exit_latency = 10;
+			powernv_states[nr_idle_states].target_residency = 100;
+			powernv_states[nr_idle_states].enter = &nap_loop;
+			nr_idle_states++;
+		}
+
+		if (flags[i] & IDLE_USE_INST_SLEEP) {
+			/* Add FASTSLEEP state */
+			strcpy(powernv_states[nr_idle_states].name, "FastSleep");
+			strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
+			powernv_states[nr_idle_states].flags =
+				CPUIDLE_FLAG_TIME_VALID | CPUIDLE_FLAG_TIMER_STOP;
+			powernv_states[nr_idle_states].exit_latency = 300;
+			powernv_states[nr_idle_states].target_residency = 1000000;
+			powernv_states[nr_idle_states].enter = &fastsleep_loop;
+			nr_idle_states++;
+		}
+	}
+
+	return nr_idle_states;
+}
+
 /*
  * powernv_idle_probe()
  * Choose state table for shared versus dedicated partition
  */
 static int powernv_idle_probe(void)
 {
-
 	if (cpuidle_disable != IDLE_NO_OVERRIDE)
 		return -ENODEV;
 
 	if (firmware_has_feature(FW_FEATURE_OPALv3)) {
 		cpuidle_state_table = powernv_states;
-		max_idle_state = ARRAY_SIZE(powernv_states);
+		/* Device tree can indicate more idle states */
+		max_idle_state = powernv_add_idle_states();
  	} else
  		return -ENODEV;
 

^ permalink raw reply related

* [PATCH 6/7] cpuidle/powernv: Add "Fast-Sleep" CPU idle state
From: Preeti U Murthy @ 2014-02-26  0:09 UTC (permalink / raw)
  To: linux-pm, geoff, fweisbec, daniel.lezcano, srivatsa.bhat, benh,
	tglx, svaidy, linuxppc-dev, mingo
  Cc: paulmck, rafael.j.wysocki
In-Reply-To: <20140226000310.17879.67295.stgit@preeti>

Fast sleep is one of the deep idle states on Power8 in which local timers of
CPUs stop. On PowerPC we do not have an external clock device which can
handle wakeup of such CPUs. Now that we have the support in the tick broadcast
framework for archs that do not sport such a device and the low level support
for fast sleep, enable it in the cpuidle framework on PowerNV.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/Kconfig              |    2 ++
 arch/powerpc/kernel/time.c        |    4 +++-
 drivers/cpuidle/cpuidle-powernv.c |   34 ++++++++++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 957bf34..b841420 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -130,6 +130,8 @@ config PPC
 	select GENERIC_CMOS_UPDATE
 	select GENERIC_TIME_VSYSCALL_OLD
 	select GENERIC_CLOCKEVENTS
+	select GENERIC_CLOCKEVENTS_BROADCAST if SMP
+	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select HAVE_MOD_ARCH_SPECIFIC
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index df2989b..122a580 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
 #include <linux/timex.h>
 #include <linux/kernel_stat.h>
 #include <linux/time.h>
+#include <linux/clockchips.h>
 #include <linux/init.h>
 #include <linux/profile.h>
 #include <linux/cpu.h>
@@ -106,7 +107,7 @@ struct clock_event_device decrementer_clockevent = {
 	.irq            = 0,
 	.set_next_event = decrementer_set_next_event,
 	.set_mode       = decrementer_set_mode,
-	.features       = CLOCK_EVT_FEAT_ONESHOT,
+	.features       = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_C3STOP,
 };
 EXPORT_SYMBOL(decrementer_clockevent);
 
@@ -944,6 +945,7 @@ void __init time_init(void)
 	clocksource_init();
 
 	init_decrementer_clockevent();
+	tick_setup_hrtimer_broadcast();
 }
 
 
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 78fd174..4fb97ce 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -11,6 +11,7 @@
 #include <linux/cpuidle.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
+#include <linux/clockchips.h>
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
@@ -49,6 +50,32 @@ static int nap_loop(struct cpuidle_device *dev,
 	return index;
 }
 
+static int fastsleep_loop(struct cpuidle_device *dev,
+				struct cpuidle_driver *drv,
+				int index)
+{
+	unsigned long old_lpcr = mfspr(SPRN_LPCR);
+	unsigned long new_lpcr;
+
+	if (unlikely(system_state < SYSTEM_RUNNING))
+		return index;
+
+	new_lpcr = old_lpcr;
+	new_lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+	/* exit powersave upon external interrupt, but not decrementer
+	 * interrupt.
+	 */
+	new_lpcr |= LPCR_PECE0;
+
+	mtspr(SPRN_LPCR, new_lpcr);
+	power7_sleep();
+
+	mtspr(SPRN_LPCR, old_lpcr);
+
+	return index;
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -67,6 +94,13 @@ static struct cpuidle_state powernv_states[] = {
 		.exit_latency = 10,
 		.target_residency = 100,
 		.enter = &nap_loop },
+	 { /* Fastsleep */
+		.name = "fastsleep",
+		.desc = "fastsleep",
+		.flags = CPUIDLE_FLAG_TIME_VALID,
+		.exit_latency = 10,
+		.target_residency = 100,
+		.enter = &fastsleep_loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,

^ permalink raw reply related

* [PATCH 5/7] powermgt: Add OPAL call to resync timebase on wakeup
From: Preeti U Murthy @ 2014-02-26  0:08 UTC (permalink / raw)
  To: linux-pm, geoff, fweisbec, daniel.lezcano, srivatsa.bhat, benh,
	tglx, svaidy, linuxppc-dev, mingo
  Cc: paulmck, rafael.j.wysocki
In-Reply-To: <20140226000310.17879.67295.stgit@preeti>

From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

During "Fast-sleep" and deeper power savings state, decrementer and
timebase could be stopped making it out of sync with rest
of the cores in the system.

Add a firmware call to request platform to resync timebase
using low level platform methods.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/opal.h                |    2 ++
 arch/powerpc/kernel/exceptions-64s.S           |    2 +-
 arch/powerpc/kernel/idle_power7.S              |   27 ++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/opal-wrappers.S |    1 +
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 40157e2..c71c72e 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -154,6 +154,7 @@ extern int opal_enter_rtas(struct rtas_args *args,
 #define OPAL_FLASH_VALIDATE			76
 #define OPAL_FLASH_MANAGE			77
 #define OPAL_FLASH_UPDATE			78
+#define OPAL_RESYNC_TIMEBASE			79
 #define OPAL_GET_MSG				85
 #define OPAL_CHECK_ASYNC_COMPLETION		86
 #define OPAL_SYNC_HOST_REBOOT			87
@@ -865,6 +866,7 @@ extern void opal_flash_init(void);
 extern int opal_machine_check(struct pt_regs *regs);
 
 extern void opal_shutdown(void);
+extern int opal_resync_timebase(void);
 
 extern void opal_lpc_init(void);
 
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index b01a9cb..9533d7a 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -145,7 +145,7 @@ BEGIN_FTR_SECTION
 
 	/* Fast Sleep wakeup on PowerNV */
 8:	GET_PACA(r13)
-	b 	.power7_wakeup_loss
+	b 	.power7_wakeup_tb_loss
 
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index 14f78be..c3ab869 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -17,6 +17,7 @@
 #include <asm/ppc-opcode.h>
 #include <asm/hw_irq.h>
 #include <asm/kvm_book3s_asm.h>
+#include <asm/opal.h>
 
 #undef DEBUG
 
@@ -125,6 +126,32 @@ _GLOBAL(power7_sleep)
 	b	power7_powersave_common
 	/* No return */
 
+_GLOBAL(power7_wakeup_tb_loss)
+	ld	r2,PACATOC(r13);
+	ld	r1,PACAR1(r13)
+
+	/* Time base re-sync */
+	li	r0,OPAL_RESYNC_TIMEBASE
+	LOAD_REG_ADDR(r11,opal);
+	ld	r12,8(r11);
+	ld	r2,0(r11);
+	mtctr	r12
+	bctrl
+
+	/* TODO: Check r3 for failure */
+
+	REST_NVGPRS(r1)
+	REST_GPR(2, r1)
+	ld	r3,_CCR(r1)
+	ld	r4,_MSR(r1)
+	ld	r5,_NIP(r1)
+	addi	r1,r1,INT_FRAME_SIZE
+	mtcr	r3
+	mfspr	r3,SPRN_SRR1		/* Return SRR1 */
+	mtspr	SPRN_SRR1,r4
+	mtspr	SPRN_SRR0,r5
+	rfid
+
 _GLOBAL(power7_wakeup_loss)
 	ld	r1,PACAR1(r13)
 	REST_NVGPRS(r1)
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 3e8829c..aab54b6 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -126,6 +126,7 @@ OPAL_CALL(opal_return_cpu,			OPAL_RETURN_CPU);
 OPAL_CALL(opal_validate_flash,			OPAL_FLASH_VALIDATE);
 OPAL_CALL(opal_manage_flash,			OPAL_FLASH_MANAGE);
 OPAL_CALL(opal_update_flash,			OPAL_FLASH_UPDATE);
+OPAL_CALL(opal_resync_timebase,			OPAL_RESYNC_TIMEBASE);
 OPAL_CALL(opal_get_msg,				OPAL_GET_MSG);
 OPAL_CALL(opal_check_completion,		OPAL_CHECK_ASYNC_COMPLETION);
 OPAL_CALL(opal_sync_host_reboot,		OPAL_SYNC_HOST_REBOOT);

^ permalink raw reply related

* [PATCH 4/7] powernv/cpuidle: Add context management for Fast Sleep
From: Preeti U Murthy @ 2014-02-26  0:08 UTC (permalink / raw)
  To: linux-pm, geoff, fweisbec, daniel.lezcano, srivatsa.bhat, benh,
	tglx, svaidy, linuxppc-dev, mingo
  Cc: paulmck, rafael.j.wysocki
In-Reply-To: <20140226000310.17879.67295.stgit@preeti>

From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

Before adding Fast-Sleep into the cpuidle framework, some low level
support needs to be added to enable it. This includes saving and
restoring of certain registers at entry and exit time of this state
respectively just like we do in the NAP idle state.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
[Changelog modified by Preeti U. Murthy <preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/processor.h |    1 +
 arch/powerpc/kernel/exceptions-64s.S |   10 ++++-
 arch/powerpc/kernel/idle_power7.S    |   63 ++++++++++++++++++++++++----------
 3 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index b62de43..d660dc3 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -450,6 +450,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF};
 
 extern int powersave_nap;	/* set if nap mode can be used in idle loop */
 extern void power7_nap(void);
+extern void power7_sleep(void);
 extern void flush_instruction_cache(void);
 extern void hard_reset_now(void);
 extern void poweroff_now(void);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 38d5073..b01a9cb 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -121,9 +121,10 @@ BEGIN_FTR_SECTION
 	cmpwi	cr1,r13,2
 	/* Total loss of HV state is fatal, we could try to use the
 	 * PIR to locate a PACA, then use an emergency stack etc...
-	 * but for now, let's just stay stuck here
+	 * OPAL v3 based powernv platforms have new idle states
+	 * which fall in this catagory.
 	 */
-	bgt	cr1,.
+	bgt	cr1,8f
 	GET_PACA(r13)
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
@@ -141,6 +142,11 @@ BEGIN_FTR_SECTION
 	beq	cr1,2f
 	b	.power7_wakeup_noloss
 2:	b	.power7_wakeup_loss
+
+	/* Fast Sleep wakeup on PowerNV */
+8:	GET_PACA(r13)
+	b 	.power7_wakeup_loss
+
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 #endif /* CONFIG_PPC_P7_NAP */
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index 3fdef0f..14f78be 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -20,17 +20,27 @@
 
 #undef DEBUG
 
-	.text
+/* Idle state entry routines */
 
-_GLOBAL(power7_idle)
-	/* Now check if user or arch enabled NAP mode */
-	LOAD_REG_ADDRBASE(r3,powersave_nap)
-	lwz	r4,ADDROFF(powersave_nap)(r3)
-	cmpwi	0,r4,0
-	beqlr
-	/* fall through */
+#define	IDLE_STATE_ENTER_SEQ(IDLE_INST)				\
+	/* Magic NAP/SLEEP/WINKLE mode enter sequence */	\
+	std	r0,0(r1);					\
+	ptesync;						\
+	ld	r0,0(r1);					\
+1:	cmp	cr0,r0,r0;					\
+	bne	1b;						\
+	IDLE_INST;						\
+	b	.
 
-_GLOBAL(power7_nap)
+	.text
+
+/*
+ * Pass requested state in r3:
+ * 	0 - nap
+ * 	1 - sleep
+ */
+_GLOBAL(power7_powersave_common)
+	/* Use r3 to pass state nap/sleep/winkle */
 	/* NAP is a state loss, we create a regs frame on the
 	 * stack, fill it up with the state we care about and
 	 * stick a pointer to it in PACAR1. We really only
@@ -79,8 +89,8 @@ _GLOBAL(power7_nap)
 	/* Continue saving state */
 	SAVE_GPR(2, r1)
 	SAVE_NVGPRS(r1)
-	mfcr	r3
-	std	r3,_CCR(r1)
+	mfcr	r4
+	std	r4,_CCR(r1)
 	std	r9,_MSR(r1)
 	std	r1,PACAR1(r13)
 
@@ -90,15 +100,30 @@ _GLOBAL(power7_enter_nap_mode)
 	li	r4,KVM_HWTHREAD_IN_NAP
 	stb	r4,HSTATE_HWTHREAD_STATE(r13)
 #endif
+	cmpwi	cr0,r3,1
+	beq	2f
+	IDLE_STATE_ENTER_SEQ(PPC_NAP)
+	/* No return */
+2:	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+	/* No return */
 
-	/* Magic NAP mode enter sequence */
-	std	r0,0(r1)
-	ptesync
-	ld	r0,0(r1)
-1:	cmp	cr0,r0,r0
-	bne	1b
-	PPC_NAP
-	b	.
+_GLOBAL(power7_idle)
+	/* Now check if user or arch enabled NAP mode */
+	LOAD_REG_ADDRBASE(r3,powersave_nap)
+	lwz	r4,ADDROFF(powersave_nap)(r3)
+	cmpwi	0,r4,0
+	beqlr
+	/* fall through */
+
+_GLOBAL(power7_nap)
+	li	r3,0
+	b	power7_powersave_common
+	/* No return */
+
+_GLOBAL(power7_sleep)
+	li	r3,1
+	b	power7_powersave_common
+	/* No return */
 
 _GLOBAL(power7_wakeup_loss)
 	ld	r1,PACAR1(r13)

^ permalink raw reply related

* [PATCH 3/7] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
From: Preeti U Murthy @ 2014-02-26  0:08 UTC (permalink / raw)
  To: linux-pm, geoff, fweisbec, daniel.lezcano, srivatsa.bhat, benh,
	tglx, svaidy, linuxppc-dev, mingo
  Cc: paulmck, rafael.j.wysocki
In-Reply-To: <20140226000310.17879.67295.stgit@preeti>

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/kernel/time.c |   81 +++++++++++++++++++++++++-------------------
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 3ff97db..df2989b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,47 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+void __timer_interrupt(void)
+{
+	struct pt_regs *regs = get_irq_regs();
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+	struct clock_event_device *evt = &__get_cpu_var(decrementers);
+	u64 now;
+
+	trace_timer_interrupt_entry(regs);
+
+	if (test_irq_work_pending()) {
+		clear_irq_work_pending();
+		irq_work_run();
+	}
+
+	now = get_tb_or_rtc();
+	if (now >= *next_tb) {
+		*next_tb = ~(u64)0;
+		if (evt->event_handler)
+			evt->event_handler(evt);
+		__get_cpu_var(irq_stat).timer_irqs_event++;
+	} else {
+		now = *next_tb - now;
+		if (now <= DECREMENTER_MAX)
+			set_dec((int)now);
+		/* We may have raced with new irq work */
+		if (test_irq_work_pending())
+			set_dec(1);
+		__get_cpu_var(irq_stat).timer_irqs_others++;
+	}
+
+#ifdef CONFIG_PPC64
+	/* collect purr register values often, for accurate calculations */
+	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+		cu->current_tb = mfspr(SPRN_PURR);
+	}
+#endif
+
+	trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +527,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
 	struct pt_regs *old_regs;
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-	struct clock_event_device *evt = &__get_cpu_var(decrementers);
-	u64 now;
 
 	/* Ensure a positive value is written to the decrementer, or else
 	 * some CPUs will continue to take decrementer exceptions.
@@ -519,39 +558,7 @@ void timer_interrupt(struct pt_regs * regs)
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
-	trace_timer_interrupt_entry(regs);
-
-	if (test_irq_work_pending()) {
-		clear_irq_work_pending();
-		irq_work_run();
-	}
-
-	now = get_tb_or_rtc();
-	if (now >= *next_tb) {
-		*next_tb = ~(u64)0;
-		if (evt->event_handler)
-			evt->event_handler(evt);
-		__get_cpu_var(irq_stat).timer_irqs_event++;
-	} else {
-		now = *next_tb - now;
-		if (now <= DECREMENTER_MAX)
-			set_dec((int)now);
-		/* We may have raced with new irq work */
-		if (test_irq_work_pending())
-			set_dec(1);
-		__get_cpu_var(irq_stat).timer_irqs_others++;
-	}
-
-#ifdef CONFIG_PPC64
-	/* collect purr register values often, for accurate calculations */
-	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-		cu->current_tb = mfspr(SPRN_PURR);
-	}
-#endif
-
-	trace_timer_interrupt_exit(regs);
-
+	__timer_interrupt();
 	irq_exit();
 	set_irq_regs(old_regs);
 }
@@ -828,6 +835,10 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 /* Interrupt handler for the timer broadcast IPI */
 void tick_broadcast_ipi_handler(void)
 {
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+	*next_tb = get_tb_or_rtc();
+	__timer_interrupt();
 }
 
 static void register_decrementer_clockevent(int cpu)

^ permalink raw reply related

* [PATCH 2/7] powerpc: Implement tick broadcast IPI as a fixed IPI message
From: Preeti U Murthy @ 2014-02-26  0:07 UTC (permalink / raw)
  To: linux-pm, geoff, fweisbec, daniel.lezcano, srivatsa.bhat, benh,
	tglx, svaidy, linuxppc-dev, mingo
  Cc: paulmck, rafael.j.wysocki
In-Reply-To: <20140226000310.17879.67295.stgit@preeti>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
[Functions renamed to tick_broadcast* and Changelog modified by
 Preeti U. Murthy<preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/include/asm/time.h         |    1 +
 arch/powerpc/kernel/smp.c               |   21 +++++++++++++++++----
 arch/powerpc/kernel/time.c              |    5 +++++
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 6 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9f7356b..ff51046 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_UNUSED		2
+#define PPC_MSG_TICK_BROADCAST	2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..1d428e6 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ee7d76b..e2a4232 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include <asm/ptrace.h>
 #include <linux/atomic.h>
 #include <asm/irq.h>
+#include <asm/hw_irq.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/prom.h>
@@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-	/* This slot is unused and hence available for use, if needed */
+	tick_broadcast_ipi_handler();
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_UNUSED] = unused_action,
+	[PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_UNUSED] = "ipi unused",
+	[PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
+		if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
+			tick_broadcast_ipi_handler();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -289,6 +292,16 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 		do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
+void tick_broadcast(const struct cpumask *mask)
+{
+	unsigned int cpu;
+
+	for_each_cpu(cpu, mask)
+		do_message_pass(cpu, PPC_MSG_TICK_BROADCAST);
+}
+#endif
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3dab20..3ff97db 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -825,6 +825,11 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 		decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+/* Interrupt handler for the timer broadcast IPI */
+void tick_broadcast_ipi_handler(void)
+{
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
 	struct clock_event_device *dec = &per_cpu(decrementers, cpu);
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index adf3726..8a106b4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_UNUSED);
+	iic_request_ipi(PPC_MSG_TICK_BROADCAST);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 00d1a7c..b358bec 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
+		BUILD_BUG_ON(PPC_MSG_TICK_BROADCAST   != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related

* [PATCH 1/7] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
From: Preeti U Murthy @ 2014-02-26  0:07 UTC (permalink / raw)
  To: linux-pm, geoff, fweisbec, daniel.lezcano, srivatsa.bhat, benh,
	tglx, svaidy, linuxppc-dev, mingo
  Cc: paulmck, rafael.j.wysocki
In-Reply-To: <20140226000310.17879.67295.stgit@preeti>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/kernel/smp.c               |   12 +++++-------
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 084e080..9f7356b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_CALL_FUNC_SINGLE	2
+#define PPC_MSG_UNUSED		2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ac2621a..ee7d76b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t call_function_single_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-	generic_smp_call_function_single_interrupt();
+	/* This slot is unused and hence available for use, if needed */
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
+	[PPC_MSG_UNUSED] = unused_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
+	[PPC_MSG_UNUSED] = "ipi unused",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
-		if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE))
-			generic_smp_call_function_single_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
 void arch_send_call_function_single_ipi(int cpu)
 {
-	do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
+	do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
 void arch_send_call_function_ipi_mask(const struct cpumask *mask)
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..adf3726 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
+	iic_request_ipi(PPC_MSG_UNUSED);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..00d1a7c 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
+		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related

* [PATCH 0/7] cpuidle/powernv: Enable Fast-Sleep on PowerNV
From: Preeti U Murthy @ 2014-02-26  0:07 UTC (permalink / raw)
  To: linux-pm, geoff, fweisbec, daniel.lezcano, srivatsa.bhat, benh,
	tglx, svaidy, linuxppc-dev, mingo
  Cc: paulmck, rafael.j.wysocki

This series is based on tip/timers/core ontop of commit
849401b66d305:tick: Fixup more fallout from hrtimer broadcast mode.

Fast sleep is one of the deep idle states on Power8 in which local timers of
CPUs stop. On PowerPC we do not have an external clock device which can
handle wakeup of such CPUs. Now that we have the support in the tick
broadcast framework for archs that do not sport such a device soon to go
upstream, add fast sleep as one of the idle states on PowerNV along with
related arch specific support.

The earlier versions of this patchset included support in the tick broadcast
framework for such idle states. Now that the support in the broadcast
framework has been pulled into tip separately, this series is posted
independently and as a new patchset altogether. This series depends in
particular on the following commits in tip/timers/core:

1.da7e6f45c3:time: Change the return type of clockevents_notify() to integer
2.ba8f20c2eb:cpuidle: Handle clockevents_notify(BROADCAST_ENTER) failure
3.5d1638acb9f62fa:tick: Introduce hrtimer based broadcast
4.f1689bb7abec8e2e6:time: Fixup fallout from recent clockevent/tick changes
5.849401b66d305f3feb75:Fixup more fallout from hrtimer broadcast mode

---

Preeti U Murthy (3):
      cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
      cpuidle/powernv: Add "Fast-Sleep" CPU idle state
      cpuidle/powernv: Parse device tree to setup idle states

Srivatsa S. Bhat (2):
      powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
      powerpc: Implement tick broadcast IPI as a fixed IPI message

Vaidyanathan Srinivasan (2):
      powernv/cpuidle: Add context management for Fast Sleep
      powermgt: Add OPAL call to resync timebase on wakeup


 arch/powerpc/Kconfig                           |    2 
 arch/powerpc/include/asm/opal.h                |    2 
 arch/powerpc/include/asm/processor.h           |    1 
 arch/powerpc/include/asm/smp.h                 |    2 
 arch/powerpc/include/asm/time.h                |    1 
 arch/powerpc/kernel/exceptions-64s.S           |   10 ++
 arch/powerpc/kernel/idle_power7.S              |   90 +++++++++++++++++----
 arch/powerpc/kernel/smp.c                      |   25 ++++--
 arch/powerpc/kernel/time.c                     |   90 +++++++++++++--------
 arch/powerpc/platforms/cell/interrupt.c        |    2 
 arch/powerpc/platforms/powernv/opal-wrappers.S |    1 
 arch/powerpc/platforms/ps3/smp.c               |    2 
 drivers/cpuidle/cpuidle-powernv.c              |  102 ++++++++++++++++++++++--
 13 files changed, 253 insertions(+), 77 deletions(-)

-- 
Signature

^ permalink raw reply

* Re: [PATCH] powerpc/powernv: Read opal error log and export it through sysfs interface.
From: Stewart Smith @ 2014-02-25 23:19 UTC (permalink / raw)
  To: Mahesh Jagannath Salgaonkar, linuxppc-dev
In-Reply-To: <530C23D3.6020203@linux.vnet.ibm.com>

Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
>>  I think we could provide a better interface with instead having a file
>>  per log message appear in sysfs. We're never going to have more than 128
>>  of these at any one time on the Linux side, so it's not going to bee too
>>  many files.
>
> It is not just about 128 files, we may be adding/removing sysfs node for
> every new log id that gets informed to kernel and ack-ed. In worst case,
> when we have flood of elog errors with user daemon consuming it and
> ack-ing back to get ready for next log in a tight poll, we may
> continuously add/remove the sysfs node for each new <id>.

Do we ever get a storm of hundreds/thousands of them though? If many
come it at once userspace may just be woken up one or two times, as it
would just select() and wait for events.

>>  I've seen some conflicting things on this - is it 2kb or 16kb?
>
> We choose 16kb because we want to pull all the log data and not
> partial.

So the max log size for any one entry is in fact 16kb?

>>  This means we constantly use 128 * sizeof(struct opal_err_log) which
>>  equates to somewhere north of 2MB of memory (due to list overhead).
>> 
>>  I don't think we need to statically allocate this, we can probably just
>>  allocate on-demand as in a typical system you're probably quite
>>  unlikely to have too many of these sitting around (besides, if for
>>  whatever reason we cannot allocate memory at some point, that's okay
>>  because we can read it again later).
>
> The reason we choose to go for static allocation is, we can not afford
> to drop or delay a critical error log due to memory allocation failure.
> OR we can keep static allocations for critical errors and follow dynamic
> allocation for informative error logs.  What do you say?

Userspace is probably going to have to do IO to get the log and ack it,
so it's probably not a huge problem - if we can't allocate a few kb in a
couple of attempts then we likely have bigger problems.

If we were going to have a sustained amount of hundreds/thousands of
these per second then perhaps we'd have other issues, but from what I
understand we're probably only going to have a handful per year on a
typical system? (I am, of course, not talking about our dev systems,
which are rather atypical :)

I'll likely have a patch today that shows kind of what I mean.

^ permalink raw reply

* Re: [PATCH] powerpc: warn users of smt-snooze-delay that the API isn't there anymore
From: Cody P Schafer @ 2014-02-25 22:47 UTC (permalink / raw)
  To: Madhavan Srinivasan, Benjamin Herrenschmidt, Olof Johansson,
	Paul Gortmaker, Wang Dongsheng
  Cc: linuxppc-dev, Paul Mackerras, linux-kernel
In-Reply-To: <530C21E6.5020106@linux.vnet.ibm.com>

On 02/24/2014 08:53 PM, Madhavan Srinivasan wrote:
> On Saturday 22 February 2014 05:44 AM, Cody P Schafer wrote:
>> /sys/devices/system/cpu/cpu*/smt-snooze-delay was converted into a NOP
>> in commit 3fa8cad82b94d0bed002571bd246f2299ffc876b, and now does
>> nothing. Add a pr_warn() to convince any users that they should stop
>> using it.
>>
>> The commit message from the removing commit notes that this
>> functionality should move into the cpuidle driver, essentially by
>
> Would prefer to cleanup the code since the functionality is moved,
> instead of adding to it.

We'd still want users of the interface to use an attribute wired up 
under the cpuidle/ dir, so a warning (to update their software) is still 
needed. As deepthi has noted, cpuidle right now doesn't support changing 
this on a per-cpu basis, so a "cleanup" isn't a simple matter.

^ permalink raw reply

* Re: [PATCH] powerpc: warn users of smt-snooze-delay that the API isn't there anymore
From: Benjamin Herrenschmidt @ 2014-02-25 22:40 UTC (permalink / raw)
  To: Deepthi Dharwar
  Cc: Madhavan Srinivasan, linuxppc-dev, Wang Dongsheng, linux-kernel,
	Paul Gortmaker, Paul Mackerras, Olof Johansson, Cody P Schafer
In-Reply-To: <530C4D57.1030905@linux.vnet.ibm.com>

On Tue, 2014-02-25 at 13:29 +0530, Deepthi Dharwar wrote:
> We currently do not use smt-snooze-delay in the kernel.
> The sysfs entries needs to  be retained until we do a clean up
> ppc64_cpu
> util that uses these entries to determine SMT,
> clean up patch for this has already been posted out by Prerna.
> Once, we have the ppc64_cpu changes in, we can look to clean up these
> parts from the kernel.

We generally shouldn't change user visible interfaces.

People still have old versions of ppc64_cpu, we must not break them

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v2 01/11] perf: add PMU_RANGE_ATTR() helper for use by sw-like pmus
From: Cody P Schafer @ 2014-02-25 22:19 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC, Arnaldo Carvalho de Melo,
	Ingo Molnar, Paul Mackerras, Peter Zijlstra
  Cc: LKML
In-Reply-To: <530CFE30.70803@linux.vnet.ibm.com>

On 02/25/2014 12:33 PM, Cody P Schafer wrote:
> On 02/24/2014 07:33 PM, Michael Ellerman wrote:
>> On Fri, 2014-14-02 at 22:02:05 UTC, Cody P Schafer wrote:
>>> Add PMU_RANGE_ATTR() and PMU_RANGE_RESV() (for reserved areas) which
>>> generate functions to extract the relevent bits from
>>> event->attr.config{,1,2} for use by sw-like pmus where the
>>> 'config{,1,2}' values don't map directly to hardware registers.
>>>
>>> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
>>> ---
>>>   include/linux/perf_event.h | 17 +++++++++++++++++
>>>   1 file changed, 17 insertions(+)
>>>
>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>> index e56b07f..2702e91 100644
>>> --- a/include/linux/perf_event.h
>>> +++ b/include/linux/perf_event.h
>>> @@ -871,4 +871,21 @@ _name##_show(struct device
>>> *dev,                    \
>>>                                       \
>>>   static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
>>>
>>> +#define PMU_RANGE_ATTR(name, attr_var, bit_start, bit_end)        \
>>> +PMU_FORMAT_ATTR(name, #attr_var ":" #bit_start "-" #bit_end);        \
>>> +PMU_RANGE_RESV(name, attr_var, bit_start, bit_end)
>>> +
>>> +#define PMU_RANGE_RESV(name, attr_var, bit_start, bit_end)        \
>>> +static u64 event_get_##name##_max(void)                    \
>>> +{                                    \
>>> +    int bits = (bit_end) - (bit_start) + 1;                \
>>> +    return ((0x1ULL << (bits - 1ULL)) - 1ULL) |            \
>>> +        (0xFULL << (bits - 4ULL));                \
>>> +}                                    \
>>> +static u64 event_get_##name(struct perf_event *event)            \
>>> +{                                    \
>>> +    return (event->attr.attr_var >> (bit_start)) &            \
>>> +        event_get_##name##_max();                \
>>> +}
>>
>> I still don't like the names.
>>
>> EVENT_GETTER_AND_FORMAT()
>
> EVENT_RANGE()
>
> I'd prefer to describe the intended usage rather than what is generated
> both in case we change some of the specifics later, and to provide
> additional information to the developers beyond what a simple code
> reading gives.
>
>> EVENT_RESERVED()
>
> Sure. The PMU_* naming was just based on the PMU_FORMAT_ATTR() naming,
> so I kept it for continuity with the existing API. Maybe
> EVENT_RANGE_RESERVED() would be more appropriate?
>

Thinking about this a bit more, EVENT_RANGE() and EVENT_RANGE_RESERVED() 
aren't quite ideal either. The "EVENT" name collides with the files we 
put in the event/ dir, which these macros generate files for the format/ 
dir. Maybe:

FORMAT_RANGE() and FORMAT_RANGE_RESERVED()
or
PMU_FORMAT_RANGE(), PMU_FORMAT_RANGE_RESERVED()

^ permalink raw reply

* Re: [rtc-linux] [PATCH] rtc/ds3232: Enable ds3232 to work as wakeup source
From: Andrew Morton @ 2014-02-25 22:07 UTC (permalink / raw)
  To: rtc-linux; +Cc: a.zummo, linuxppc-dev, Dongsheng Wang, chenhui.zhao
In-Reply-To: <1390281891-9632-1-git-send-email-dongsheng.wang@freescale.com>

On Tue, 21 Jan 2014 13:24:51 +0800 Dongsheng Wang <dongsheng.wang@freescale.com> wrote:

> From: Wang Dongsheng <dongsheng.wang@freescale.com>
> 
> Add suspend/resume and device_init_wakeup to enable ds3232 as
> wakeup source, /sys/class/rtc/rtcX/wakealarm for set wakeup alarm.
> 
> ...
> 
> @@ -411,23 +424,21 @@ static int ds3232_probe(struct i2c_client *client,
>  	if (ret)
>  		return ret;
>  
> -	ds3232->rtc = devm_rtc_device_register(&client->dev, client->name,
> -					  &ds3232_rtc_ops, THIS_MODULE);
> -	if (IS_ERR(ds3232->rtc)) {
> -		dev_err(&client->dev, "unable to register the class device\n");
> -		return PTR_ERR(ds3232->rtc);
> -	}
> -
> -	if (client->irq >= 0) {
> +	if (client->irq != NO_IRQ) {

x86_64 allmodconfig:

drivers/rtc/rtc-ds3232.c: In function 'ds3232_probe':
drivers/rtc/rtc-ds3232.c:427: error: 'NO_IRQ' undeclared (first use in this function)
drivers/rtc/rtc-ds3232.c:427: error: (Each undeclared identifier is reported only once
drivers/rtc/rtc-ds3232.c:427: error: for each function it appears in.)

Not all architectures implement NO_IRQ.

I think this should be 

	if (client->irq > 0) {

but I'm not sure - iirc, x86 (at least) treats zero as "not an IRQ". 
But I think some architectures permit IRQ 0.  There was discussion many
years ago but I don't think anything got resolved.


Help!  I think some ppc people will know what to do here?

^ permalink raw reply

* [PATCH] rapidio: rework device hierarchy and introduce mport class of devices
From: Alexandre Bounine @ 2014-02-25 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arno Tiemersma, linux-kernel, Andre van Herk, Jerry Jacobs,
	Alexandre Bounine, Rob Landley, Stef van Os, linuxppc-dev

This patch removes an artificial RapidIO bus root device and establishes actual
device hierarchy by providing reference to real parent devices.
It also introduces device class for RapidIO controller devices (on-chip or
an eternal bridge, known as "mport").

Existing implementation was sufficient for SoC-based platforms that have
a single RapidIO controller. With introduction of devices using multiple RapidIO
controllers and PCIe-to-RapidIO bridges the old scheme is very limiting or does
not work at all. The implemented changes allow to properly reference platform's
local RapidIO mport devices and provide device details needed for upper layers.

This change to RapidIO device hierarchy does not break any known existing kernel
or user space interfaces.

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
Cc: Jerry Jacobs <jerry.jacobs@prodrive-technologies.com>
Cc: Arno Tiemersma <arno.tiemersma@prodrive-technologies.com>
Cc: Rob Landley <rob@landley.net>
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
---
 Documentation/rapidio/sysfs.txt  |   66 +++++++++++++++++++++++++++++++++----
 arch/powerpc/sysdev/fsl_rio.c    |    1 +
 drivers/net/rionet.c             |    1 +
 drivers/rapidio/devices/tsi721.c |    1 +
 drivers/rapidio/rio-driver.c     |   22 ++++++++----
 drivers/rapidio/rio-scan.c       |    1 +
 drivers/rapidio/rio-sysfs.c      |   40 +++++++++++++++++++++++
 drivers/rapidio/rio.c            |   11 ++++++
 drivers/rapidio/rio.h            |    1 +
 include/linux/rio.h              |    5 ++-
 10 files changed, 133 insertions(+), 16 deletions(-)

diff --git a/Documentation/rapidio/sysfs.txt b/Documentation/rapidio/sysfs.txt
index 271438c..47ce9a5 100644
--- a/Documentation/rapidio/sysfs.txt
+++ b/Documentation/rapidio/sysfs.txt
@@ -2,8 +2,8 @@
 
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-1. Device Subdirectories
-------------------------
+1. RapidIO Device Subdirectories
+--------------------------------
 
 For each RapidIO device, the RapidIO subsystem creates files in an individual
 subdirectory with the following name, /sys/bus/rapidio/devices/<device_name>.
@@ -25,8 +25,8 @@ seen by the enumerating host (destID = 1):
 NOTE: An enumerating or discovering endpoint does not create a sysfs entry for
 itself, this is why an endpoint with destID=1 is not shown in the list.
 
-2. Attributes Common for All Devices
-------------------------------------
+2. Attributes Common for All RapidIO Devices
+--------------------------------------------
 
 Each device subdirectory contains the following informational read-only files:
 
@@ -52,16 +52,16 @@ This attribute is similar in behavior to the "config" attribute of PCI devices
 and provides an access to the RapidIO device registers using standard file read
 and write operations.
 
-3. Endpoint Device Attributes
------------------------------
+3. RapidIO Endpoint Device Attributes
+-------------------------------------
 
 Currently Linux RapidIO subsystem does not create any endpoint specific sysfs
 attributes. It is possible that RapidIO master port drivers and endpoint device
 drivers will add their device-specific sysfs attributes but such attributes are
 outside the scope of this document.
 
-4. Switch Device Attributes
----------------------------
+4. RapidIO Switch Device Attributes
+-----------------------------------
 
 RapidIO switches have additional attributes in sysfs. RapidIO subsystem supports
 common and device-specific sysfs attributes for switches. Because switches are
@@ -106,3 +106,53 @@ attribute:
 	 for that controller always will be 0.
 	 To initiate RapidIO enumeration/discovery on all available mports
 	 a user must write '-1' (or RIO_MPORT_ANY) into this attribute file.
+
+
+6. RapidIO Bus Controllers/Ports
+--------------------------------
+
+On-chip RapidIO controllers and PCIe-to-RapidIO bridges (referenced as
+"Master Port" or "mport") are presented in sysfs as the special class of
+devices: "rapidio_port".
+
+The /sys/class/rapidio_port subdirectory contains individual subdirectories
+named as "rapidioN" where N = mport ID registered with RapidIO subsystem.
+
+NOTE: An mport ID is not a RapidIO destination ID assigned to a given local
+mport device.
+
+Each mport device subdirectory in addition to standard entries contains the
+following device-specific attributes:
+
+   port_destid - reports RapidIO destination ID assigned to the given RapidIO
+                 mport device. If value 0xFFFFFFFF is returned this means that
+                 no valid destination ID have been assigned to the mport (yet).
+                 Normally, before enumeration/discovery have been executed only
+                 fabric enumerating mports have a valid destination ID assigned
+                 to them using "hdid=..." rapidio module parameter.
+      sys_size - reports RapidIO common transport system size:
+                   0 = small (8-bit destination ID, max. 256 devices),
+                   1 = large (16-bit destination ID, max. 65536 devices).
+
+After enumeration or discovery was performed for a given mport device,
+the corresponding subdirectory will also contain subdirectories for each
+child RapidIO device connected to the mport. Naming conventions for RapidIO
+devices are described in Section 1 above.
+
+The example below shows mport device subdirectory with several child RapidIO
+devices attached to it.
+
+[rio@rapidio ~]$ ls /sys/class/rapidio_port/rapidio0/ -l
+total 0
+drwxr-xr-x 3 root root    0 Feb 11 15:10 00:e:0001
+drwxr-xr-x 3 root root    0 Feb 11 15:10 00:e:0004
+drwxr-xr-x 3 root root    0 Feb 11 15:10 00:e:0007
+drwxr-xr-x 3 root root    0 Feb 11 15:10 00:s:0002
+drwxr-xr-x 3 root root    0 Feb 11 15:10 00:s:0003
+drwxr-xr-x 3 root root    0 Feb 11 15:10 00:s:0005
+lrwxrwxrwx 1 root root    0 Feb 11 15:11 device -> ../../../0000:01:00.0
+-r--r--r-- 1 root root 4096 Feb 11 15:11 port_destid
+drwxr-xr-x 2 root root    0 Feb 11 15:11 power
+lrwxrwxrwx 1 root root    0 Feb 11 15:04 subsystem -> ../../../../../../class/rapidio_port
+-r--r--r-- 1 root root 4096 Feb 11 15:11 sys_size
+-rw-r--r-- 1 root root 4096 Feb 11 15:04 uevent
diff --git a/arch/powerpc/sysdev/fsl_rio.c b/arch/powerpc/sysdev/fsl_rio.c
index 95dd892..cf2b084 100644
--- a/arch/powerpc/sysdev/fsl_rio.c
+++ b/arch/powerpc/sysdev/fsl_rio.c
@@ -531,6 +531,7 @@ int fsl_rio_setup(struct platform_device *dev)
 		sprintf(port->name, "RIO mport %d", i);
 
 		priv->dev = &dev->dev;
+		port->dev.parent = &dev->dev;
 		port->ops = ops;
 		port->priv = priv;
 		port->phys_efptr = 0x100;
diff --git a/drivers/net/rionet.c b/drivers/net/rionet.c
index 6d1f6ed..a849718 100644
--- a/drivers/net/rionet.c
+++ b/drivers/net/rionet.c
@@ -493,6 +493,7 @@ static int rionet_setup_netdev(struct rio_mport *mport, struct net_device *ndev)
 	ndev->netdev_ops = &rionet_netdev_ops;
 	ndev->mtu = RIO_MAX_MSG_SIZE - 14;
 	ndev->features = NETIF_F_LLTX;
+	SET_NETDEV_DEV(ndev, &mport->dev);
 	SET_ETHTOOL_OPS(ndev, &rionet_ethtool_ops);
 
 	spin_lock_init(&rnet->lock);
diff --git a/drivers/rapidio/devices/tsi721.c b/drivers/rapidio/devices/tsi721.c
index ff7cbf2..1753dc6 100644
--- a/drivers/rapidio/devices/tsi721.c
+++ b/drivers/rapidio/devices/tsi721.c
@@ -2256,6 +2256,7 @@ static int tsi721_setup_mport(struct tsi721_device *priv)
 	mport->phy_type = RIO_PHY_SERIAL;
 	mport->priv = (void *)priv;
 	mport->phys_efptr = 0x100;
+	mport->dev.parent = &pdev->dev;
 	priv->mport = mport;
 
 	INIT_LIST_HEAD(&mport->dbells);
diff --git a/drivers/rapidio/rio-driver.c b/drivers/rapidio/rio-driver.c
index c9ae692..f301f05 100644
--- a/drivers/rapidio/rio-driver.c
+++ b/drivers/rapidio/rio-driver.c
@@ -167,7 +167,6 @@ void rio_unregister_driver(struct rio_driver *rdrv)
 void rio_attach_device(struct rio_dev *rdev)
 {
 	rdev->dev.bus = &rio_bus_type;
-	rdev->dev.parent = &rio_bus;
 }
 EXPORT_SYMBOL_GPL(rio_attach_device);
 
@@ -216,9 +215,12 @@ static int rio_uevent(struct device *dev, struct kobj_uevent_env *env)
 	return 0;
 }
 
-struct device rio_bus = {
-	.init_name = "rapidio",
+struct class rio_mport_class = {
+	.name		= "rapidio_port",
+	.owner		= THIS_MODULE,
+	.dev_groups	= rio_mport_groups,
 };
+EXPORT_SYMBOL_GPL(rio_mport_class);
 
 struct bus_type rio_bus_type = {
 	.name = "rapidio",
@@ -233,14 +235,20 @@ struct bus_type rio_bus_type = {
 /**
  *  rio_bus_init - Register the RapidIO bus with the device model
  *
- *  Registers the RIO bus device and RIO bus type with the Linux
+ *  Registers the RIO mport device class and RIO bus type with the Linux
  *  device model.
  */
 static int __init rio_bus_init(void)
 {
-	if (device_register(&rio_bus) < 0)
-		printk("RIO: failed to register RIO bus device\n");
-	return bus_register(&rio_bus_type);
+	int ret;
+
+	ret = class_register(&rio_mport_class);
+	if (!ret) {
+		ret = bus_register(&rio_bus_type);
+		if (ret)
+			class_unregister(&rio_mport_class);
+	}
+	return ret;
 }
 
 postcore_initcall(rio_bus_init);
diff --git a/drivers/rapidio/rio-scan.c b/drivers/rapidio/rio-scan.c
index d3a6539..47a1b2e 100644
--- a/drivers/rapidio/rio-scan.c
+++ b/drivers/rapidio/rio-scan.c
@@ -461,6 +461,7 @@ static struct rio_dev *rio_setup_device(struct rio_net *net,
 			     rdev->comp_tag & RIO_CTAG_UDEVID);
 	}
 
+	rdev->dev.parent = &port->dev;
 	rio_attach_device(rdev);
 
 	device_initialize(&rdev->dev);
diff --git a/drivers/rapidio/rio-sysfs.c b/drivers/rapidio/rio-sysfs.c
index e0221c6..cdb005c 100644
--- a/drivers/rapidio/rio-sysfs.c
+++ b/drivers/rapidio/rio-sysfs.c
@@ -341,3 +341,43 @@ const struct attribute_group *rio_bus_groups[] = {
 	&rio_bus_group,
 	NULL,
 };
+
+static ssize_t
+port_destid_show(struct device *dev, struct device_attribute *attr,
+		 char *buf)
+{
+	struct rio_mport *mport = to_rio_mport(dev);
+
+	if (mport)
+		return sprintf(buf, "0x%04x\n", mport->host_deviceid);
+	else
+		return -ENODEV;
+}
+static DEVICE_ATTR_RO(port_destid);
+
+static ssize_t sys_size_show(struct device *dev, struct device_attribute *attr,
+			   char *buf)
+{
+	struct rio_mport *mport = to_rio_mport(dev);
+
+	if (mport)
+		return sprintf(buf, "%u\n", mport->sys_size);
+	else
+		return -ENODEV;
+}
+static DEVICE_ATTR_RO(sys_size);
+
+static struct attribute *rio_mport_attrs[] = {
+	&dev_attr_port_destid.attr,
+	&dev_attr_sys_size.attr,
+	NULL,
+};
+
+static const struct attribute_group rio_mport_group = {
+	.attrs = rio_mport_attrs,
+};
+
+const struct attribute_group *rio_mport_groups[] = {
+	&rio_mport_group,
+	NULL,
+};
diff --git a/drivers/rapidio/rio.c b/drivers/rapidio/rio.c
index 2e8a20c..a54ba04 100644
--- a/drivers/rapidio/rio.c
+++ b/drivers/rapidio/rio.c
@@ -1884,6 +1884,7 @@ static int rio_get_hdid(int index)
 int rio_register_mport(struct rio_mport *port)
 {
 	struct rio_scan_node *scan = NULL;
+	int res = 0;
 
 	if (next_portid >= RIO_MAX_MPORTS) {
 		pr_err("RIO: reached specified max number of mports\n");
@@ -1894,6 +1895,16 @@ int rio_register_mport(struct rio_mport *port)
 	port->host_deviceid = rio_get_hdid(port->id);
 	port->nscan = NULL;
 
+	dev_set_name(&port->dev, "rapidio%d", port->id);
+	port->dev.class = &rio_mport_class;
+
+	res = device_register(&port->dev);
+	if (res)
+		dev_err(&port->dev, "RIO: mport%d registration failed ERR=%d\n",
+			port->id, res);
+	else
+		dev_dbg(&port->dev, "RIO: mport%d registered\n", port->id);
+
 	mutex_lock(&rio_mport_list_lock);
 	list_add_tail(&port->node, &rio_mports);
 
diff --git a/drivers/rapidio/rio.h b/drivers/rapidio/rio.h
index 5f99d22..2d0550e 100644
--- a/drivers/rapidio/rio.h
+++ b/drivers/rapidio/rio.h
@@ -50,6 +50,7 @@ extern int rio_mport_scan(int mport_id);
 /* Structures internal to the RIO core code */
 extern const struct attribute_group *rio_dev_groups[];
 extern const struct attribute_group *rio_bus_groups[];
+extern const struct attribute_group *rio_mport_groups[];
 
 #define RIO_GET_DID(size, x)	(size ? (x & 0xffff) : ((x & 0x00ff0000) >> 16))
 #define RIO_SET_DID(size, x)	(size ? (x & 0xffff) : ((x & 0x000000ff) << 16))
diff --git a/include/linux/rio.h b/include/linux/rio.h
index b71d573..6bda06f 100644
--- a/include/linux/rio.h
+++ b/include/linux/rio.h
@@ -83,7 +83,7 @@
 #define RIO_CTAG_UDEVID	0x0001ffff /* Unique device identifier */
 
 extern struct bus_type rio_bus_type;
-extern struct device rio_bus;
+extern struct class rio_mport_class;
 
 struct rio_mport;
 struct rio_dev;
@@ -201,6 +201,7 @@ struct rio_dev {
 #define rio_dev_f(n) list_entry(n, struct rio_dev, net_list)
 #define	to_rio_dev(n) container_of(n, struct rio_dev, dev)
 #define sw_to_rio_dev(n) container_of(n, struct rio_dev, rswitch[0])
+#define	to_rio_mport(n) container_of(n, struct rio_mport, dev)
 
 /**
  * struct rio_msg - RIO message event
@@ -248,6 +249,7 @@ enum rio_phy_type {
  * @phy_type: RapidIO phy type
  * @phys_efptr: RIO port extended features pointer
  * @name: Port name string
+ * @dev: device structure associated with an mport
  * @priv: Master port private data
  * @dma: DMA device associated with mport
  * @nscan: RapidIO network enumeration/discovery operations
@@ -272,6 +274,7 @@ struct rio_mport {
 	enum rio_phy_type phy_type;	/* RapidIO phy type */
 	u32 phys_efptr;
 	unsigned char name[RIO_MAX_MPORT_NAME];
+	struct device dev;
 	void *priv;		/* Master port private data */
 #ifdef CONFIG_RAPIDIO_DMA_ENGINE
 	struct dma_device	dma;
-- 
1.7.8.4

^ permalink raw reply related

* Re: [PATCH v2 02/11] perf core: export swevent hrtimer helpers
From: Cody P Schafer @ 2014-02-25 21:38 UTC (permalink / raw)
  To: Peter Zijlstra, Michael Ellerman
  Cc: Paul Mackerras, Ingo Molnar, Linux PPC, LKML,
	Arnaldo Carvalho de Melo
In-Reply-To: <20140225102008.GI9987@twins.programming.kicks-ass.net>

On 02/25/2014 02:20 AM, Peter Zijlstra wrote:
> On Tue, Feb 25, 2014 at 02:33:26PM +1100, Michael Ellerman wrote:
>> On Fri, 2014-14-02 at 22:02:06 UTC, Cody P Schafer wrote:
>>> Export the swevent hrtimer helpers currently only used in events/core.c
>>> to allow the addition of architecture specific sw-like pmus.
>>
>> Peter, Ingo, can we get your ACK on this please?
>
> How are they used? I saw some usage in patch 9 or so; but its not
> explained anywhere. All patches have non-existent Changelogs and the few
> comments that are there are pretty hardware specific.
>
> So please do tell; what do you need this for?

 From this patch's change log:

> Export the swevent hrtimer helpers currently only used in events/core.c to allow the addition of architecture specific sw-like pmus.

The key part here is "architecture specific sw-like pmus", where the 
announcement explains why these pmus are sw-like:

> The counters supplied by these interfaces are continually counting and never
> need to be (and cannot be) disabled or enabled. They additionally do not
> generate any interrupts. This makes them in some regards similar to software
> counters, and as a result their implimentation shares some common code (which
> an initial patch exposes) with the sw counters.

Essentially, these pmus just provide access to a big array of counters 
which don't generate interrupts, and are all 64bit (and assumed to never 
overflow). Rather than duplicate the code that we already have for 
managing timing when reading from counters that don't have interrupts 
(the functions that are exposed by this patch), I've reused it.

^ permalink raw reply

* Re: [PATCH v2 10/11] powerpc/perf: add kconfig option for hypervisor provided counters
From: Cody P Schafer @ 2014-02-25 21:31 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC, Aneesh Kumar K.V, Anshuman Khandual,
	Anton Blanchard, Benjamin Herrenschmidt, Kumar Gala, Lijun Pan,
	Li Yang, Paul Bolle, Priyanka Jain, Scott Wood, Tang Yuantian
  Cc: Ingo Molnar, Paul Mackerras, LKML, Arnaldo Carvalho de Melo,
	Peter Zijlstra
In-Reply-To: <20140225033330.54F332C02FB@ozlabs.org>

On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:14 UTC, Cody P Schafer wrote:
>> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/perf/Makefile             | 2 ++
>>   arch/powerpc/platforms/Kconfig.cputype | 6 ++++++
>>   2 files changed, 8 insertions(+)
>>
>> diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
>> index 60d71ee..f9c083a 100644
>> --- a/arch/powerpc/perf/Makefile
>> +++ b/arch/powerpc/perf/Makefile
>> @@ -11,5 +11,7 @@ obj32-$(CONFIG_PPC_PERF_CTRS)	+= mpc7450-pmu.o
>>   obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
>>   obj-$(CONFIG_FSL_EMB_PERF_EVENT_E500) += e500-pmu.o e6500-pmu.o
>>
>> +obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o
>> +
>>   obj-$(CONFIG_PPC64)		+= $(obj64-y)
>>   obj-$(CONFIG_PPC32)		+= $(obj32-y)
>> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
>> index 434fda3..dcc67cd 100644
>> --- a/arch/powerpc/platforms/Kconfig.cputype
>> +++ b/arch/powerpc/platforms/Kconfig.cputype
>> @@ -364,6 +364,12 @@ config PPC_PERF_CTRS
>>          help
>>            This enables the powerpc-specific perf_event back-end.
>>
>> +config HV_PERF_CTRS
>> +       def_bool y
>
> This was bool, why did you change it?

No, it wasn't. v1 also had def_bool. https://lkml.org/lkml/2014/1/16/518
Maybe you're confusing v2.1 and v2 of this patch?

>
>> +       depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT
>
> Should be:
>
> 	depends on PERF_EVENTS && PPC_PSERIES
>
>> +       help
>> +         Enable access to perf counters provided by the hypervisor
>> +

Yep, the v2.1 patch (which I bungled and labeled as 9/11) already 
changes both of these.
It'll end up rolled into v3.

^ permalink raw reply

* Re: [PATCH v2 08/11] powerpc/perf: add support for the hv gpci (get performance counter info) interface
From: Cody P Schafer @ 2014-02-25 21:25 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo
In-Reply-To: <20140225033329.400E22C0331@ozlabs.org>

On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:12 UTC, Cody P Schafer wrote:
>> This provides a basic link between perf and hv_gpci. Notably, it does
>> not yet support transactions and does not list any events (they can
>> still be manually composed).
>
> Can you explain how the HV_CAPS stuff ends up looking.
>
> I'm not against adding it, but I'd like to understand how we expect it to be
> used a bit better.

It's just a quick mechanism for me to expose some relevant information 
to userspace via sysfs using the hv_perf_caps_get() function's returned 
data. Documentation for this sysfs interface (and the rest) is in a 
later patch.
I don't expect any more uses to show up unless the firmware decides to 
add another capability bit (in which case I'll want to expose it as well).

>> diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
>> new file mode 100644
>> index 0000000..1f5d96d
>> --- /dev/null
>> +++ b/arch/powerpc/perf/hv-gpci.c
>> +
>> +static struct pmu h_gpci_pmu = {
>> +	.task_ctx_nr = perf_invalid_context,
>> +
>> +	.name = "hv_gpci",
>> +	.attr_groups = attr_groups,
>> +	.event_init  = h_gpci_event_init,
>> +	.add         = h_gpci_event_add,
>> +	.del         = h_gpci_event_del,
> 		     = h_gpci_event_stop,
>
>> +	.start       = h_gpci_event_start,
>> +	.stop        = h_gpci_event_stop,
>> +	.read        = h_gpci_event_read,
> 		     = h_gpci_event_update
>
>> +	.event_idx = perf_swevent_event_idx,
>> +};

whoops, thought I had fixed those 2 already.

^ permalink raw reply

* Re: [PATCH v2 07/11] powerpc: add a shared interface to get gpci version and capabilities
From: Cody P Schafer @ 2014-02-25 21:20 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo
In-Reply-To: <20140225033328.C5A9E2C0324@ozlabs.org>

On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> [PATCH v2 07/11] powerpc: add a shared interface to get gpci version and capabilities
>
> All the patches that touch perf should be "powerpc/perf: foo"

Ok.

> On Fri, 2014-14-02 at 22:02:11 UTC, Cody P Schafer wrote:
>> ...
>
> I realise this is a fairly small patch but a changelog is still nice. You could
> for example mention that we don't currently use .ga, .expanded or .lab but
> we're adding the logic anyway because ...
>

Well, we do use them to expose some more information to the user (via 
sysfs attributes). Always nice to know what capabilities are enabled.

But sure, I can explain why each bit in that structure is a good idea.

>
>> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/perf/hv-common.c | 39 +++++++++++++++++++++++++++++++++++++++
>>   arch/powerpc/perf/hv-common.h | 17 +++++++++++++++++
>>   2 files changed, 56 insertions(+)
>>   create mode 100644 arch/powerpc/perf/hv-common.c
>>   create mode 100644 arch/powerpc/perf/hv-common.h
>>
>> diff --git a/arch/powerpc/perf/hv-common.c b/arch/powerpc/perf/hv-common.c
>> new file mode 100644
>> index 0000000..47e02b3
>> --- /dev/null
>> +++ b/arch/powerpc/perf/hv-common.c
>> @@ -0,0 +1,39 @@
>> +#include <asm/io.h>
>> +#include <asm/hvcall.h>
>> +
>> +#include "hv-gpci.h"
>> +#include "hv-common.h"
>> +
>> +unsigned long hv_perf_caps_get(struct hv_perf_caps *caps)
>> +{
>> +	unsigned long r;
>> +	struct p {
>> +		struct hv_get_perf_counter_info_params params;
>> +		struct cv_system_performance_capabilities caps;
>> +	} __packed __aligned(sizeof(uint64_t));
>> +
>> +	struct p arg = {
>> +		.params = {
>> +			.counter_request = cpu_to_be32(
>> +					CIR_SYSTEM_PERFORMANCE_CAPABILITIES),
>> +			.starting_index = cpu_to_be32(-1),
>> +			.counter_info_version_in = 0,
>> +		}
>> +	};
>> +
>> +	r = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
>> +			       virt_to_phys(&arg), sizeof(arg));
>> +
>> +	if (r)
>> +		return r;
>> +
>> +	pr_devel("capability_mask: 0x%x\n", arg.caps.capability_mask);
>> +
>> +	caps->version = arg.params.counter_info_version_out;
>> +	caps->collect_privileged = !!arg.caps.perf_collect_privileged;
>> +	caps->ga = !!(arg.caps.capability_mask & CV_CM_GA);
>> +	caps->expanded = !!(arg.caps.capability_mask & CV_CM_EXPANDED);
>> +	caps->lab = !!(arg.caps.capability_mask & CV_CM_LAB);
>> +
>> +	return r;
>> +}
>> diff --git a/arch/powerpc/perf/hv-common.h b/arch/powerpc/perf/hv-common.h
>> new file mode 100644
>> index 0000000..7e615bd
>> --- /dev/null
>> +++ b/arch/powerpc/perf/hv-common.h
>> @@ -0,0 +1,17 @@
>> +#ifndef LINUX_POWERPC_PERF_HV_COMMON_H_
>> +#define LINUX_POWERPC_PERF_HV_COMMON_H_
>> +
>> +#include <linux/types.h>
>> +
>> +struct hv_perf_caps {
>> +	u16 version;
>> +	u16 collect_privileged:1,
>> +	    ga:1,
>> +	    expanded:1,
>> +	    lab:1,
>> +	    unused:12;
>> +};
>> +
>> +unsigned long hv_perf_caps_get(struct hv_perf_caps *caps);
>> +
>> +#endif
>> --
>> 1.8.5.4
>>
>>
>

^ permalink raw reply

* Re: [PATCH v2 04/11] powerpc: add hvcalls for 24x7 and gpci (get performance counter info)
From: Cody P Schafer @ 2014-02-25 21:13 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC, Alexander Graf, Anton Blanchard,
	Benjamin Herrenschmidt, Paul Mackerras
  Cc: Ingo Molnar, LKML, Arnaldo Carvalho de Melo, Peter Zijlstra
In-Reply-To: <20140225033327.878F52C0256@ozlabs.org>

On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:08 UTC, Cody P Schafer wrote:
>> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/include/asm/hvcall.h | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
>> index d8b600b..652f7e4 100644
>> --- a/arch/powerpc/include/asm/hvcall.h
>> +++ b/arch/powerpc/include/asm/hvcall.h
>> @@ -274,6 +274,11 @@
>>   /* Platform specific hcalls, used by KVM */
>>   #define H_RTAS			0xf000
>>
>> +/* "Platform specific hcalls", provided by PHYP */
>> +#define H_GET_24X7_CATALOG_PAGE 0xF078
>> +#define H_GET_24X7_DATA		0xF07C
>> +#define H_GET_PERF_COUNTER_INFO 0xF080
>
> Some tabs some spaces, use tabs.

Ack.

^ permalink raw reply

* Re: [PATCH v2 09/11] powerpc/perf: add support for the hv 24x7 interface
From: Cody P Schafer @ 2014-02-25 20:55 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo
In-Reply-To: <20140225033329.BBB492C033B@ozlabs.org>

On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:13 UTC, Cody P Schafer wrote:
>> This provides a basic interface between hv_24x7 and perf. Similar to
>> the one provided for gpci, it lacks transaction support and does not
>> list any events.
>>
>> Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/perf/hv-24x7.c | 491 ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 491 insertions(+)
>>   create mode 100644 arch/powerpc/perf/hv-24x7.c
>>
>> diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
>> new file mode 100644
>> index 0000000..13de140
>> --- /dev/null
>> +++ b/arch/powerpc/perf/hv-24x7.c
> ...
>> +
>> +/*
>> + * read_offset_data - copy data from one buffer to another while treating the
>> + *                    source buffer as a small view on the total avaliable
>> + *                    source data.
>> + *
>> + * @dest: buffer to copy into
>> + * @dest_len: length of @dest in bytes
>> + * @requested_offset: the offset within the source data we want. Must be > 0
>> + * @src: buffer to copy data from
>> + * @src_len: length of @src in bytes
>> + * @source_offset: the offset in the sorce data that (src,src_len) refers to.
>> + *                 Must be > 0
>> + *
>> + * returns the number of bytes copied.
>> + *
>> + * '.' areas in d are written to.
>> + *
>> + *                       u
>> + *   x         w	 v  z
>> + * d           |.........|
>> + * s |----------------------|
>> + *
>> + *                      u
>> + *   x         w	z     v
>> + * d           |........------|
>> + * s |------------------|
>> + *
>> + *   x         w        u,z,v
>> + * d           |........|
>> + * s |------------------|
>> + *
>> + *   x,w                u,v,z
>> + * d |------------------|
>> + * s |------------------|
>> + *
>> + *   x        u
>> + *   w        v		z
>> + * d |........|
>> + * s |------------------|
>> + *
>> + *   x      z   w      v
>> + * d            |------|
>> + * s |------|
>> + *
>> + * x = source_offset
>> + * w = requested_offset
>> + * z = source_offset + src_len
>> + * v = requested_offset + dest_len
>> + *
>> + * w_offset_in_s = w - x = requested_offset - source_offset
>> + * z_offset_in_s = z - x = src_len
>> + * v_offset_in_s = v - x = request_offset + dest_len - src_len
>> + * u_offset_in_s = min(z_offset_in_s, v_offset_in_s)
>> + *
>> + * copy_len = u_offset_in_s - w_offset_in_s = min(z_offset_in_s, v_offset_in_s)
>> + *						- w_offset_in_s
>
> Comments are great, especially for complicated code like this. But at a glance
> I don't actually understand what this comment is trying to tell me.

The function was composed via some number line logic. The comment tries 
to explain what that logic is. The ascii art is various overlapping 
buffers that we're copying between (the '+'s from the patch are messing 
with the indenting some of the labels). The only major omission I'm 
seeing is I failed to note that d=dest and s=src (though this could be 
inferred from the comment about '.' indicating a write).

Is there anything specific That doesn't make sense in the comment? (it 
may not be a comment that really can be read at a glance).

>
>> + */
>> +static ssize_t read_offset_data(void *dest, size_t dest_len,
>> +				loff_t requested_offset, void *src,
>> +				size_t src_len, loff_t source_offset)
>> +{
>> +	size_t w_offset_in_s = requested_offset - source_offset;
>> +	size_t z_offset_in_s = src_len;
>> +	size_t v_offset_in_s = requested_offset + dest_len - src_len;
>> +	size_t u_offset_in_s = min(z_offset_in_s, v_offset_in_s);
>> +	size_t copy_len = u_offset_in_s - w_offset_in_s;
>> +
>> +	if (requested_offset < 0 || source_offset < 0)
>> +		return -EINVAL;
>> +
>> +	if (z_offset_in_s <= w_offset_in_s)
>> +		return 0;
>> +
>> +	memcpy(dest, src + w_offset_in_s, copy_len);
>> +	return copy_len;
>> +}
>> +
>> +static unsigned long h_get_24x7_catalog_page(char page[static 4096],
>> +					     u32 version, u32 index)
>> +{
>> +	WARN_ON(!IS_ALIGNED((unsigned long)page, 4096));
>> +	return plpar_hcall_norets(H_GET_24X7_CATALOG_PAGE,
>> +			virt_to_phys(page),
>> +			version,
>> +			index);
>> +}
>> +
>> +static ssize_t catalog_read(struct file *filp, struct kobject *kobj,
>> +			    struct bin_attribute *bin_attr, char *buf,
>> +			    loff_t offset, size_t count)
>> +{
>> +	unsigned long hret;
>> +	ssize_t ret = 0;
>> +	size_t catalog_len = 0, catalog_page_len = 0, page_count = 0;
>> +	loff_t page_offset = 0;
>> +	uint32_t catalog_version_num = 0;
>> +	void *page = kmalloc(4096, GFP_USER);
>> +	struct hv_24x7_catalog_page_0 *page_0 = page;
>> +	if (!page)
>> +		return -ENOMEM;
>> +
>> +
>> +	hret = h_get_24x7_catalog_page(page, 0, 0);
>> +	if (hret) {
>> +		ret = -EIO;
>> +		goto e_free;
>> +	}
>> +
>> +	catalog_version_num = be32_to_cpu(page_0->version);
>> +	catalog_page_len = be32_to_cpu(page_0->length);
>> +	catalog_len = catalog_page_len * 4096;
>> +
>> +	page_offset = offset / 4096;
>> +	page_count  = count  / 4096;
>> +
>> +	if (page_offset >= catalog_page_len)
>> +		goto e_free;
>> +
>> +	if (page_offset != 0) {
>> +		hret = h_get_24x7_catalog_page(page, catalog_version_num,
>> +					       page_offset);
>> +		if (hret) {
>> +			ret = -EIO;
>> +			goto e_free;
>> +		}
>> +	}
>> +
>> +	ret = read_offset_data(buf, count, offset,
>> +				page, 4096, page_offset * 4096);
>> +e_free:
>> +	if (hret)
>> +		pr_err("h_get_24x7_catalog_page(ver=%d, page=%lld) failed: rc=%ld\n",
>> +				catalog_version_num, page_offset, hret);
>> +	kfree(page);
>> +
>> +	pr_devel("catalog_read: offset=%lld(%lld) count=%zu(%zu) catalog_len=%zu(%zu) => %zd\n",
>> +			offset, page_offset, count, page_count, catalog_len,
>> +			catalog_page_len, ret);
>> +
>> +	return ret;
>> +}
>> +
>> +#define PAGE_0_ATTR(_name, _fmt, _expr)				\
>> +static ssize_t _name##_show(struct device *dev,			\
>> +			    struct device_attribute *dev_attr,	\
>> +			    char *buf)				\
>> +{								\
>> +	unsigned long hret;					\
>> +	ssize_t ret = 0;					\
>> +	void *page = kmalloc(4096, GFP_USER);			\
>> +	struct hv_24x7_catalog_page_0 *page_0 = page;		\
>> +	if (!page)						\
>> +		return -ENOMEM;					\
>> +	hret = h_get_24x7_catalog_page(page, 0, 0);		\
>> +	if (hret) {						\
>> +		ret = -EIO;					\
>> +		goto e_free;					\
>> +	}							\
>> +	ret = sprintf(buf, _fmt, _expr);			\
>> +e_free:								\
>> +	kfree(page);						\
>> +	return ret;						\
>> +}								\
>> +static DEVICE_ATTR_RO(_name)
>> +
>> +PAGE_0_ATTR(catalog_version, "%lld\n",
>> +		(unsigned long long)be32_to_cpu(page_0->version));
>> +PAGE_0_ATTR(catalog_len, "%lld\n",
>> +		(unsigned long long)be32_to_cpu(page_0->length) * 4096);
>> +static BIN_ATTR_RO(catalog, 0/* real length varies */);
>
> So we're dumping the catalog out as a binary blob.

Yep

> Why do we want to do that?

Right now it's the only way to know what events are available. 
Additionally, even when the kernel starts parsing events out (and 
exposing them via sysfs), there is some additional powerpc specific 
structuring ("groups" and "schemas" that some userspace applications may 
want to take advantage of.

> It clearly violates the sysfs rule-of-sorts of ASCII and one value per file.
> Obviously there can be exceptions, but what's our justification?

Actual justification is above, but additionally:
I actually was looking at the acpi code that provides (among other 
binary tables) the dsdt as a binary blob in sysfs when I was putting 
this code together. The 24x7 catalog is, in the same manner, a binary 
blob provided by firmware.

>> +static struct bin_attribute *if_bin_attrs[] = {
>> +	&bin_attr_catalog,
>> +	NULL,
>> +};
>> +
>> +static struct attribute *if_attrs[] = {
>> +	&dev_attr_catalog_len.attr,
>> +	&dev_attr_catalog_version.attr,
>> +	NULL,
>> +};
>> +
>> +static struct attribute_group if_group = {
>> +	.name = "interface",
>> +	.bin_attrs = if_bin_attrs,
>> +	.attrs = if_attrs,
>> +};
>
> Both pmus have an "interface" directory, but they don't seem to have anything
> in common? Its feels a little ad-hoc.

It is absolutely ad-hoc. The only similarity is that both groups named 
"interface" provide some additional details about the firmware interface 
they're using to provide the perf data. We could easily call them both 
"misc", "details", put all the attributes in the device root, or call 
them some other generic name. I ended up choosing "interface" because 
we're provided details on the firmware interface, and it feels just a 
bit less generic. Having device specific names for the attribute group 
("24x7" and "gpci", for example) doesn't get us anything because the 
devices themselves already have those names ("hv_24x7" and "hv_gpci"). I 
don't see any reason to make them different.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox