LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-08-27 14:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: Rodolfo Giometti, Arnd Bergmann, john stultz, devicetree-discuss,
	linux-kernel, netdev, linuxppc-dev, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <20100827143844.646eccf6@lxorguk.ukuu.org.uk>

On Fri, Aug 27, 2010 at 02:38:44PM +0100, Alan Cox wrote:
> > 2007. If we can justify adding a clock id in this case, surely we can
> > add one for PTP as well!
> 
> But PTP isn't really a clock - its a time sync protocol. You can (and may
> need to) have multiple clocks of this form on the same host because it's
> master based and you may have to deal with multiple masters who disagree.

Okay, I really meant "for PTP hardware clocks". In general the
discussion is about supporting a kind of hardware and not about the
PTP network protocol. In fact, the hardware clocks and clock servo
loops are not at all part of the IEEE 1588 standard.

Sorry for causing confusion, but please understand "a hardware clock
with timestamping capabilities than can be used for PTP support"
whenever I wrote "PTP" or "PTP clock."

Well, what I just said is not entirely true.

In fact, most of the current crop of PTP hardware clocks operate by
recognizing PTP network frames and timestamping them. One clock I know
of can timestamp every frame, but that seems to be the exception
rather than the rule. So, in theory they are just clocks, but actually
most are bound to the PTP protocol. That may change in the future...

> > viable approach. After the lkml discussion, I think it is even cleaner
> > and nicer to just offer a new clock id.
> 
> PTP is not a clock, it's many clocks so a clock id doesn't really work.
> You could assume a single time domain and add a CLOCK_TAI plus then use
> PTP to track it I guess ?
> 
> The question then is who would consume it and how ?
> 
> Generic applications want POSIX time, which is managed by NTP but could
> in userspace also be slewed via the existing API to track a PTP source if
> someone wanted and if there is a GPS source around they can compute UTC
> from it.

Yes, and even without a GPS, the PTP protocol (this time I really do
mean the protocol!) does provide the UTC offset whenever it is known.

> Specialist applications will presumably need to know which time source or
> sources they are tracking and synchronizing too out of multiple potential
> PTP sources

Yes, the PTPd will have some special knowledge.

> Kernel stuff is more of a problem.
> 
> I'm not sure shoehorning a source of many clocks and time sync bases into
> a jump up and down and make it fit single time assumption is wise. Making
> system time bimble track a source makes sense just as with NTP but making
> it a new clock seems the wrong model extending a non-too-bright API when
> you can just put the time sources in a file tree.

Don't get your meaning here, what did you mean by "file tree?"

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-08-27 14:02 UTC (permalink / raw)
  To: Alan Cox
  Cc: Rodolfo Giometti, Arnd Bergmann, john stultz, devicetree-discuss,
	linux-kernel, Christian Riesch, netdev, linuxppc-dev,
	linux-arm-kernel, Krzysztof Halasa
In-Reply-To: <20100827134154.50eef56c@lxorguk.ukuu.org.uk>

On Fri, Aug 27, 2010 at 01:41:54PM +0100, Alan Cox wrote:
> > The master node in a PTP network probably takes its time from a
> > precise external time source, like GPS. The GPS provides a 1 PPS
> > directly to the PTP clock hardware, which latches the PTP hardware
> > clock time on the PPS edge. This provides one sample as input to a
> > clock servo (in the PTPd) that, in turn, regulates the PTP clock
> > hardware.
> 
> A PTP clock is TAI, Unix time is UTC.

But TAI and UTC progress at the same rate, and UTC differs from TAI by
a constant offset. In fact, the needed conversion is provided by the
protocol, so it is not hard to take a 1 PPS from GPS and set the PTP
clock to TAI.

> > This is the core issue and source of misunderstanding, in my view. The
> > fact of the matter is, the current generation of computers has
> > multiple clocks, and these are usually unsynchronized. I think we
> > should not try too hard to cover up or work around this. It is a fact
> > of life.
> 
> In this case I don't think you can. Their divergence is rather difficult
> to handle unless you have a GPS to hand.
> 
> But all this talk of "PTP this" and "PTP that" is not helpful. Any
> interface for additional time sources should be generic with PTP being
> one use case.

To tell the truth, my original motivation for the patch set was to
support PTP clocks and applications. I don't think that is such a bad
idea. After all, the adjtimex interface was added just to support NTP.

At the same time, I can understand the desire to have a generic
hardware clock adjustment API. Let me see if I can understand and
summarize what people are asking for:

	  clock_adjtime(clockid_t id, struct timex *t);

and struct timex gets some new fields at the end.

Using the call, NTPd can call clock_adjtime(CLOCK_REALTIME) and PTPd
can call clock_realtime(CLOCK_PTP) and everyone is happy, no?

Thanks,
Richard

^ permalink raw reply

* [PATCH] Use is_32bit_task() helper to test 32-bit binary
From: Denis Kirjanov @ 2010-08-27 13:49 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, paulus

This patch removes all explicit tests for the TIF_32BIT flag

Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
---
 arch/powerpc/include/asm/compat.h    |    4 ++--
 arch/powerpc/include/asm/elf.h       |    2 +-
 arch/powerpc/include/asm/page_64.h   |    4 ++--
 arch/powerpc/include/asm/processor.h |    4 ++--
 arch/powerpc/kernel/ptrace.c         |    2 +-
 arch/powerpc/kernel/vdso.c           |    6 +++---
 arch/powerpc/oprofile/backtrace.c    |    2 +-
 7 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/compat.h b/arch/powerpc/include/asm/compat.h
index 396d21a..3369e2c 100644
--- a/arch/powerpc/include/asm/compat.h
+++ b/arch/powerpc/include/asm/compat.h
@@ -143,7 +143,7 @@ static inline void __user *compat_alloc_user_space(long len)
 	 * We cant access below the stack pointer in the 32bit ABI and
 	 * can access 288 bytes in the 64bit ABI
 	 */
-	if (!(test_thread_flag(TIF_32BIT)))
+	if (!is_32bit_task())
 		usp -= 288;
 
 	return (void __user *) (usp - len);
@@ -213,7 +213,7 @@ struct compat_shmid64_ds {
 
 static inline int is_compat_task(void)
 {
-	return test_thread_flag(TIF_32BIT);
+	return is_32bit_task();
 }
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index c376eda..2b917c6 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -250,7 +250,7 @@ do {								\
  * the 64bit ABI has never had these issues dont enable the workaround
  * even if we have an executable stack.
  */
-# define elf_read_implies_exec(ex, exec_stk) (test_thread_flag(TIF_32BIT) ? \
+# define elf_read_implies_exec(ex, exec_stk) (is_32bit_task() ? \
 		(exec_stk == EXSTACK_DEFAULT) : 0)
 #else 
 # define SET_PERSONALITY(ex) \
diff --git a/arch/powerpc/include/asm/page_64.h b/arch/powerpc/include/asm/page_64.h
index 358ff14..932f88d 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -163,7 +163,7 @@ do {						\
 #endif /* !CONFIG_HUGETLB_PAGE */
 
 #define VM_DATA_DEFAULT_FLAGS \
-	(test_thread_flag(TIF_32BIT) ? \
+	(is_32bit_task() ? \
 	 VM_DATA_DEFAULT_FLAGS32 : VM_DATA_DEFAULT_FLAGS64)
 
 /*
@@ -179,7 +179,7 @@ do {						\
 					 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
 
 #define VM_STACK_DEFAULT_FLAGS \
-	(test_thread_flag(TIF_32BIT) ? \
+	(is_32bit_task() ? \
 	 VM_STACK_DEFAULT_FLAGS32 : VM_STACK_DEFAULT_FLAGS64)
 
 #include <asm-generic/getorder.h>
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 19c05b0..4c14187 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -118,7 +118,7 @@ extern struct task_struct *last_task_used_spe;
 #define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(TASK_SIZE_USER32 / 4))
 #define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_USER64 / 4))
 
-#define TASK_UNMAPPED_BASE ((test_thread_flag(TIF_32BIT)) ? \
+#define TASK_UNMAPPED_BASE ((is_32bit_task()) ? \
 		TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 )
 #endif
 
@@ -128,7 +128,7 @@ extern struct task_struct *last_task_used_spe;
 #define STACK_TOP_USER64 TASK_SIZE_USER64
 #define STACK_TOP_USER32 TASK_SIZE_USER32
 
-#define STACK_TOP (test_thread_flag(TIF_32BIT) ? \
+#define STACK_TOP (is_32bit_task() ? \
 		   STACK_TOP_USER32 : STACK_TOP_USER64)
 
 #define STACK_TOP_MAX STACK_TOP_USER64
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 11f3cd9..286d978 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -1681,7 +1681,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 
 	if (unlikely(current->audit_context)) {
 #ifdef CONFIG_PPC64
-		if (!test_thread_flag(TIF_32BIT))
+		if (!is_32bit_task())
 			audit_syscall_entry(AUDIT_ARCH_PPC64,
 					    regs->gpr[0],
 					    regs->gpr[3], regs->gpr[4],
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 13002fe..fd87287 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -159,7 +159,7 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
 {
 	int i;
 
-	if (!vma || test_thread_flag(TIF_32BIT)) {
+	if (!vma || is_32bit_task()) {
 		printk("vDSO32 @ %016lx:\n", (unsigned long)vdso32_kbase);
 		for (i=0; i<vdso32_pages; i++) {
 			struct page *pg = virt_to_page(vdso32_kbase +
@@ -170,7 +170,7 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
 			dump_one_vdso_page(pg, upg);
 		}
 	}
-	if (!vma || !test_thread_flag(TIF_32BIT)) {
+	if (!vma || !is_32bit_task()) {
 		printk("vDSO64 @ %016lx:\n", (unsigned long)vdso64_kbase);
 		for (i=0; i<vdso64_pages; i++) {
 			struct page *pg = virt_to_page(vdso64_kbase +
@@ -200,7 +200,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		return 0;
 
 #ifdef CONFIG_PPC64
-	if (test_thread_flag(TIF_32BIT)) {
+	if (is_32bit_task()) {
 		vdso_pagelist = vdso32_pagelist;
 		vdso_pages = vdso32_pages;
 		vdso_base = VDSO32_MBASE;
diff --git a/arch/powerpc/oprofile/backtrace.c b/arch/powerpc/oprofile/backtrace.c
index b4278cf..f75301f 100644
--- a/arch/powerpc/oprofile/backtrace.c
+++ b/arch/powerpc/oprofile/backtrace.c
@@ -105,7 +105,7 @@ void op_powerpc_backtrace(struct pt_regs * const regs, unsigned int depth)
 		}
 	} else {
 #ifdef CONFIG_PPC64
-		if (!test_thread_flag(TIF_32BIT)) {
+		if (!is_32bit_task()) {
 			while (depth--) {
 				sp = user_getsp64(sp, first_frame);
 				if (!sp)
-- 
1.7.0

^ permalink raw reply related

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Alan Cox @ 2010-08-27 13:38 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Rodolfo Giometti, Arnd Bergmann, john stultz, devicetree-discuss,
	linux-kernel, netdev, linuxppc-dev, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <20100827123849.GC11657@riccoc20.at.omicron.at>

> 2007. If we can justify adding a clock id in this case, surely we can
> add one for PTP as well!

But PTP isn't really a clock - its a time sync protocol. You can (and may
need to) have multiple clocks of this form on the same host because it's
master based and you may have to deal with multiple masters who disagree.

> > Further, if we're using PTP to synchoronize the system time, then there
> > shouldn't be any measurable difference between CLOCK_PTP and
> > CLOCK_REALTIME, no?
> 
> When using software timestamping, then the clocks are one in the same.

Technically the POSIX clock is UTC, IEEE1588v2 is TAI.

> It would be possible, but not too nice, IMHO. In contrast to NTP,
> there is no real need to place the servo in the kernel. Having the
> protocol code and servo in user space makes life much easier.

We can't currently put it in the kernel anyway for other good reasons.

> viable approach. After the lkml discussion, I think it is even cleaner
> and nicer to just offer a new clock id.

PTP is not a clock, it's many clocks so a clock id doesn't really work.
You could assume a single time domain and add a CLOCK_TAI plus then use
PTP to track it I guess ?

The question then is who would consume it and how ?

Generic applications want POSIX time, which is managed by NTP but could
in userspace also be slewed via the existing API to track a PTP source if
someone wanted and if there is a GPS source around they can compute UTC
from it.

Specialist applications will presumably need to know which time source or
sources they are tracking and synchronizing too out of multiple potential
PTP sources

Kernel stuff is more of a problem.

I'm not sure shoehorning a source of many clocks and time sync bases into
a jump up and down and make it fit single time assumption is wise. Making
system time bimble track a source makes sense just as with NTP but making
it a new clock seems the wrong model extending a non-too-bright API when
you can just put the time sources in a file tree.

Alan

^ permalink raw reply

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Alan Cox @ 2010-08-27 12:45 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Rodolfo Giometti, Arnd Bergmann, john stultz, devicetree-discuss,
	linux-kernel, netdev, linuxppc-dev, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <20100827110855.GA11657@riccoc20.at.omicron.at>

> > So if the clock_adjtime interface is needed, it would seem best for it
> > to be generic enough to support not only PTP, but also the NTP kernel
> > PLL.
> 
> For the proposed clock_adjime, what else is needed to support clock
> adjustment in general?

Multiple PLLs, at least with containers and certain classes of system you
want different containers in different timespaces, especially when doing
high precision stuff where you need your system tracking say a local
master clock for syncing musical instruments and sound events while
tracking other clocks like NTP for general system time.

> I don't mind making the interface generic enough to support any
> (realistic) conceivable clock adjustment scheme, but beyond the
> present PTP hardware clocks, I don't know what else might be needed.

Put the clock type in the new fields. It becomes

	u16 clock_type;
	[clock type specific data]

saves having to guess.

Alan

^ permalink raw reply

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Alan Cox @ 2010-08-27 12:41 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Rodolfo Giometti, Arnd Bergmann, john stultz, devicetree-discuss,
	linux-kernel, Christian Riesch, netdev, linuxppc-dev,
	linux-arm-kernel, Krzysztof Halasa
In-Reply-To: <20100827075727.GA3818@riccoc20.at.omicron.at>

> The master node in a PTP network probably takes its time from a
> precise external time source, like GPS. The GPS provides a 1 PPS
> directly to the PTP clock hardware, which latches the PTP hardware
> clock time on the PPS edge. This provides one sample as input to a
> clock servo (in the PTPd) that, in turn, regulates the PTP clock
> hardware.

A PTP clock is TAI, Unix time is UTC.

> This is the core issue and source of misunderstanding, in my view. The
> fact of the matter is, the current generation of computers has
> multiple clocks, and these are usually unsynchronized. I think we
> should not try too hard to cover up or work around this. It is a fact
> of life.

In this case I don't think you can. Their divergence is rather difficult
to handle unless you have a GPS to hand.

But all this talk of "PTP this" and "PTP that" is not helpful. Any
interface for additional time sources should be generic with PTP being
one use case.

Alan

^ permalink raw reply

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-08-27 12:38 UTC (permalink / raw)
  To: john stultz
  Cc: Rodolfo Giometti, Arnd Bergmann, netdev, devicetree-discuss,
	linux-kernel, linuxppc-dev, linux-arm-kernel, Krzysztof Halasa
In-Reply-To: <1282594125.3111.344.camel@localhost.localdomain>

On Mon, Aug 23, 2010 at 01:08:45PM -0700, john stultz wrote:
> On Thu, 2010-08-19 at 07:55 +0200, Richard Cochran wrote:
> > The clockid_t CLOCK_PTP will be arch-neutral.
> 
> Sure, but are they conceptually neutral? There are other clock
> synchronization algorithms out there. Will they need their own
> similar-but-different clock_ids?
> 
> Look at the other clock ids and what the represent:

IMHO, the presently offered clock ids are a mixed bag...
 
> CLOCK_REALTIME : Wall time (possibly freq/offset corrected)
> CLOCK_MONOTONIC: Monotonic time (possibly freq corrected).
> CLOCK_PROCESS_CPUTIME_ID: Process cpu time.
> CLOCK_THREAD_CPUTIME_ID: Thread cpu time.

The amount of time a thread has been granted by the kernel is really
not connected to the real passage of time, at least not in a direct
way.

> CLOCK_MONOTONIC_RAW: Non freq corrected monotonic time.

This one comes from commit 2d42244ae71d6c7b0884b5664cf2eda30fb2ae68
and is surely a special case, unrelated to the other clock ids. The
commit message mentions that this was added to help the btime.sf.net
project. That project does not seem to have had any activity since
2007. If we can justify adding a clock id in this case, surely we can
add one for PTP as well!

> CLOCK_REALTIME_COARSE: Tick granular wall time (filesystem timestamp)
> CLOCK_MONOTONIC_COARSE: Tick granular monotonic time.

These were added in commit da15cfdae03351c689736f8d142618592e3cebc3
in order to fulfill needs of special applications.

> CLOCK_PTP that you're proposing doesn't seem to be at the same level of
> abstraction. I'm not saying that this isn't the right place for it, but
> can we take a step back from PTP and consider what your exposing in more
> generic terms. In other words, could someone use the same
> packet-timestamping hardware to implement a different non-PTP time
> synchronization algorithm?

Yes, of course. There is nothing at all in the patch set about the PTP
protocol itself. It just lets you access the hardware. In short:

1. SO_TIMESTAMPING delivers timestamped packets
2. the PTP API lets you tune the clock.

"That's all, folks."

> Further, if we're using PTP to synchoronize the system time, then there
> shouldn't be any measurable difference between CLOCK_PTP and
> CLOCK_REALTIME, no?

When using software timestamping, then the clocks are one in the same.

When using PTP, with the PPS hook to synchoronize the Linux system
time, the clocks will be a close as the servo algorithm provides. I
have not measured this yet, but it cannot be much different than using
any other PPS source.

> > SYSCALL_DEFINE3(clock_adjtime, const clockid_t, clkid,
> > 		int, ppb, struct timespec __user *, ts)
> > 
> > ppb - desired frequency adjustment in parts per billion
> > ts  - desired time step (or jump) in <sec,nsec> to correct
> >       a measured offset
> > 
> > Arguably, this syscall might be useful for other clocks, too.
> 
> So yea, obviously the syscall should not be CLOCK_PTP specific, so we
> would want it to be usable against CLOCK_REALTIME.
> 
> That said, the clock_adjtime your proposing does not seem to be
> sufficient for usage by NTPd. So this suggests that it is not generic
> enough.

I don't think we need to support ntpd. It already has adjtimex, and it
won't get any better by using another interface.

> > I think the ancillary features from PTP hardware clocks should be made
> > available throught the sysfs. A syscall for these would end up very
> > ugly, looking like an ioctl. Also, it is hard to see how these
> > features relate to the more general idea of the clockid.
> 
> This may be a good approach, but be aware that adding stuff to sysfs
> requires similar scrutiny as adding a syscall.  

Yes, it will be properly documented and maintained. I have already
implemented the ancillary stuff in two ways, via sysfs and with a
character device. The next patch set will include them both, and you
all can just choose which one to delete (or leave them both).

> > In contrast, sysfs attributes will fit the need nicely:
> > 
> > 1. enable or disable pps
> > 2. enable or disable external timestamps
> > 3. read out external timestamp
> > 4. configure period for periodic output
> 
> Things to consider here:
> Do having these options really make sense? 

Yes, since they represent the PTP clock's hardware features. As I
explained previously, if you don't have any hardware interfaces, then
having your clocks synchoronized to under 100 nanoseconds does not
help you more than having them to within 1 microsecond.

> Why would we want pps disabled?

If you are a master clock, then you want to take your PPS from an
external time source, like GPS.  If you leave the PTP PPS events on,
then they will occur close in time to the GPS PPS events and may add
unwanted latency to the interrupt handler.

> And if that does make sense, would it
> be better to do so via the existing pps interface instead of adding a
> new ptp specific one? 

We have not introduced new PPS interface. We use existing PPS subsystem.

> Same for the timestamps and periodic output (ie: and how do they differ
> from reading or setting a timer on CLOCK_PTP?)

The posix timer calls won't work:

I have a PTP hardware clocks with multiple external timestamp
channels. Using timer_gettime, how can I specify (or decode) the
channel of interest to me?

> > This is a good example of the poverty (in regards to time
> > synchronization) of our current systems.
> > 
> > Lets say I want to build a surround sound audio system, using a set of
> > distributed computers, each host connected to one speaker. How I can
> > be sure that the samples in one channel (ie one host) pass through the
> > DA converter at exactly the same time?
> 
> They won't be exactly the same, but to minimize any noticeable
> difference we'd need each speaker/client-system that have their system
> time closely synced. Then the server-system would need to send the
> channel stream and frame times to each client. The clients would then
> feed the audio frames to the audio card at the designated times.
> 
> This is a little high level and generic and of course, the devil's in
> the details:
> 
> 1) How is the system time synchronized across systems?
> 
> 2) How is the error between the system time freq and the audio cards
> rate addressed?
> 
> These are things that need to be addressed, but the high-level design is
> what the applications should target, because it doesn't limit them to
> the specifics of the details.
> 
> By suggesting the application be designed to use CLOCK_PTP, it limits
> itself to systems with CLOCK_PTP hardware, and should the application be
> ported to a different distributed system that's using RADclocks or some
> other synchronization method, it won't function.
> 
> What the kernel needs to provide are ways to address #1 and #2 above,
> but what the kernel needs to expose to userland should be minimal and
> generic.

My point was this:

The application requires that the soundcard DA clocks (*not* the CPU
clocks) be synchronized. Currently the Linux kernel offers no way at
all to do this at all.

> > The clock and its adjustment have nothing to do with a network
> > socket. The current PTP hacks floating around all add private ioctls
> > to the MAC driver. That is the *wrong* way to do it.
> 
> Could you clarify on *why* that is the wrong approach?

Christian explained this pretty well.

> Questions:
> 1) When the PTP hardware is doing the timestamping, what API/interface
> does PTPd use to get and send the Sync/Delay_req/Delay_Resp messages?
>
> SO_TIMESTAMPed packets from a network device seems the obvious answer,
> but your comments above about with regards to my SO_TIMESTAMP_ADJ idea
> suggest there's something more subtle here.

Nope, no magic here, just a plain old UDP socket.

> 2) You've mentioned multiple PTP hardware clocks are possible, but maybe
> not practically useful. How does PTPd enumerate the existing clocks, and
> know which devices to listen to for Sync/Delay_Resp messages?
> 
> The issue I'm trying to address here is the interface inconsistency
> between the message timestamping interface (ie: likely from a packet,
> possibly multiple sources) and the proposed CLOCK_PTP interface (with
> only a single clock being exposed at a time, and that being controlled
> by a sysfs interface).

The sysfs will include one class device for each PTP clock. Each clock
has a sysfs attribute with the corresponding clock id.

For the network timestamps, they need to be enabled using the
SIOCSHWTSTAMP ioctl. We can easily extend that ioctl to include the
desired clock id.

> My concerns: 
> 1) Again, I'm not totally comfortable exposing the PTP hardware via the
> posix-clocks/timers interface. I'm not dead set against it, but it just
> doesn't seem right as a top-level abstraction.

I would also be happy with the character device idea already
posted. Just pick one of the two, and I'll resubmit the patch set...

> I'm curious if its possible to do the PTP hardware offset/adjustment
> calculation in a module internally to the kernel? That would allow the
> PPS interface to still be used to sync the system time, and not expose
> additional interfaces.

It would be possible, but not too nice, IMHO. In contrast to NTP,
there is no real need to place the servo in the kernel. Having the
protocol code and servo in user space makes life much easier.

> 2) If this is a top-level interface, I'd prefer the inconsistency
> between how the timestamped messages are received and the proposed
> posix_clocks/timer interface be clarified. 
> 
> For example: does the networking stack need to have the source clock_id
> to use for SO_TIMESTAMPing be specified?

We could do it that way. I never liked the <SW,SYS,RAW> tuple in the
SO_TIMESTAMPING control message in the first place. Sending three
fields with two blank seems wasteful. Instead, <clockid,timestamp>
would be sufficient. Maybe that it is too late to change this.

Even without altering the timestamp, we can simply augment
SIOCSHWTSTAMP with a clock id.

At this point I would just like to go forward with one of the two
proposed APIs. I had modelled the character device on the posix clock
calls in order to make it immediately familar, and I think it is a
viable approach. After the lkml discussion, I think it is even cleaner
and nicer to just offer a new clock id.

Thanks for all the feedback and comments,

Richard

^ permalink raw reply

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Arnd Bergmann @ 2010-08-27 12:03 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Rodolfo Giometti, john stultz, devicetree-discuss, linux-kernel,
	netdev, linuxppc-dev, linux-arm-kernel, Krzysztof Halasa
In-Reply-To: <20100827110855.GA11657@riccoc20.at.omicron.at>

On Friday 27 August 2010, Richard Cochran wrote:
> On Mon, Aug 23, 2010 at 01:21:39PM -0700, john stultz wrote:
> > On Thu, 2010-08-19 at 17:38 +0200, Richard Cochran wrote:
> > > On Thu, Aug 19, 2010 at 02:28:04PM +0200, Arnd Bergmann wrote:
> > > > Have you considered passing a struct timex instead of ppb and ts?
> > > 
> > > Yes, but the timex is not suitable, IMHO.
> > 
> > Could you expand on this?
> 
> We need to able to specify that the call is for a PTP clock. We could
> add that to the modes flag, like this:
> 
> /*timex.h*/
> #define ADJ_PTP_0 0x10000
> #define ADJ_PTP_1 0x20000
> #define ADJ_PTP_2 0x30000
> #define ADJ_PTP_3 0x40000
>
> I can live with this, if everyone else can, too.

My suggestion was actually to have a new syscall with the existing
structure, and pass a clockid_t value to it, similar to your
sys_clock_adjtime(), not change the actual sys_adjtime syscall.
  
> > Could we not add a adjustment mode ADJ_SETOFFSET or something that would
> > provide the instantaneous offset correction?
> 
> Yes, but we would also need to add a struct timespec to the struct
> timex, in order to get nanosecond resolution. I think it would be
> possible to do in the padding at the end?

Yes, that's exactly what the padding is for. Instead of timespec, you can
probably have a extra values for replacing the existing ppm values with
ppb values.

	Arnd

^ permalink raw reply

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-08-27 11:08 UTC (permalink / raw)
  To: john stultz
  Cc: Rodolfo Giometti, Arnd Bergmann, netdev, devicetree-discuss,
	linux-kernel, linuxppc-dev, linux-arm-kernel, Krzysztof Halasa
In-Reply-To: <1282594899.3111.358.camel@localhost.localdomain>

On Mon, Aug 23, 2010 at 01:21:39PM -0700, john stultz wrote:
> On Thu, 2010-08-19 at 17:38 +0200, Richard Cochran wrote:
> > On Thu, Aug 19, 2010 at 02:28:04PM +0200, Arnd Bergmann wrote:
> > > My point was that a syscall is better than an ioctl based interface here,
> > > which I definitely still think. Given that John knows much more about
> > > clocks than I do, we still need to get agreement on the question that
> > > he raised, which is whether we actually need to expose this clock to the
> > > user or not.
> > > 
> > > If we can find a way to sync system time accurate enough with PTP and
> > > PPS, user applications may not need to see two separate clocks at all.
> > 
> > At the very least, one user application (the PTPd) needs to see the
> > PTP clock.
> > 
> > > > SYSCALL_DEFINE3(clock_adjtime, const clockid_t, clkid,
> > > > 		int, ppb, struct timespec __user *, ts)
> > > > 
> > > > ppb - desired frequency adjustment in parts per billion
> > > > ts  - desired time step (or jump) in <sec,nsec> to correct
> > > >       a measured offset
> > > > 
> > > > Arguably, this syscall might be useful for other clocks, too.
> > > 
> > > This is a mix of adjtime and adjtimex with the addition of
> > > the clkid parameter, right?
> > 
> > Sort of, but not really. ADJTIME(3) takes an offset and slowly
> > corrects the clock using a servo in the kernel, over hours.
> > 
> > For this function, the offset passed in the 'ts' parameter will be
> > immediately corrected, by jumping to the new time. This reflects the
> > way that PTP works. After the first few samples, the PTPd has an
> > estimate of the offset to the master and the rate difference. The PTPd
> > can proceed in one of two ways.
> > 
> > 1. If resetting the clock is not desired, then the clock is set to the
> >    maximum adjustment (in the right direction) until the clock time is
> >    close to the master's time.
> > 
> > 2. The estimated offset is added to the current time, resulting in a
> >    jump in time.
> > 
> > We need clock_adjtime(id, 0, ts) for the second case.
> >
> > > Have you considered passing a struct timex instead of ppb and ts?
> > 
> > Yes, but the timex is not suitable, IMHO.
> 
> Could you expand on this?

We need to able to specify that the call is for a PTP clock. We could
add that to the modes flag, like this:

/*timex.h*/
#define ADJ_PTP_0 0x10000
#define ADJ_PTP_1 0x20000
#define ADJ_PTP_2 0x30000
#define ADJ_PTP_3 0x40000

I can live with this, if everyone else can, too.
 
> Could we not add a adjustment mode ADJ_SETOFFSET or something that would
> provide the instantaneous offset correction?

Yes, but we would also need to add a struct timespec to the struct
timex, in order to get nanosecond resolution. I think it would be
possible to do in the padding at the end?

> You're right that the timex is a little crufty. But its legacy that we
> will support indefinitely. So following the established interface helps
> maintainability.

We can use it for PTP, with the modifications suggested above. Or we
can just introduce the clock_adjtime method, instead.
 
> So if the clock_adjtime interface is needed, it would seem best for it
> to be generic enough to support not only PTP, but also the NTP kernel
> PLL.

For the proposed clock_adjime, what else is needed to support clock
adjustment in general?

I don't mind making the interface generic enough to support any
(realistic) conceivable clock adjustment scheme, but beyond the
present PTP hardware clocks, I don't know what else might be needed.

Richard

^ permalink raw reply

* [PATCH] mpc8308: fix USB DR controller initialization
From: Ilya Yanok @ 2010-08-27 10:57 UTC (permalink / raw)
  To: linuxppc-dev, wd, dzu, vlad; +Cc: Ilya Yanok

MPC8308 has ULPI pin muxing settings in SICRH register, bits 17-18
which is different from both MPC8313 and MPC8315.
Also MPC8308 doesn't have REFSEL, UTMI_PHY_EN and OTG_PORT fields
in the USB DR controller CONTROL register.

Signed-off-by: Ilya Yanok <yanok@emcraft.com>
---
 arch/powerpc/boot/dts/mpc8308rdb.dts  |    2 +-
 arch/powerpc/platforms/83xx/mpc83xx.h |    2 ++
 arch/powerpc/platforms/83xx/usb.c     |   21 ++++++++++++++++-----
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/boot/dts/mpc8308rdb.dts b/arch/powerpc/boot/dts/mpc8308rdb.dts
index a97eb2d..1e2b888 100644
--- a/arch/powerpc/boot/dts/mpc8308rdb.dts
+++ b/arch/powerpc/boot/dts/mpc8308rdb.dts
@@ -109,7 +109,7 @@
 		#address-cells = <1>;
 		#size-cells = <1>;
 		device_type = "soc";
-		compatible = "fsl,mpc8315-immr", "simple-bus";
+		compatible = "fsl,mpc8308-immr", "simple-bus";
 		ranges = <0 0xe0000000 0x00100000>;
 		reg = <0xe0000000 0x00000200>;
 		bus-frequency = <0>;
diff --git a/arch/powerpc/platforms/83xx/mpc83xx.h b/arch/powerpc/platforms/83xx/mpc83xx.h
index 0fea881..82a4345 100644
--- a/arch/powerpc/platforms/83xx/mpc83xx.h
+++ b/arch/powerpc/platforms/83xx/mpc83xx.h
@@ -35,6 +35,8 @@
 
 /* system i/o configuration register high */
 #define MPC83XX_SICRH_OFFS         0x118
+#define MPC8308_SICRH_USB_MASK     0x000c0000
+#define MPC8308_SICRH_USB_ULPI     0x00040000
 #define MPC834X_SICRH_USB_UTMI     0x00020000
 #define MPC831X_SICRH_USB_MASK     0x000000e0
 #define MPC831X_SICRH_USB_ULPI     0x000000a0
diff --git a/arch/powerpc/platforms/83xx/usb.c b/arch/powerpc/platforms/83xx/usb.c
index 3ba4bb7..2c64164 100644
--- a/arch/powerpc/platforms/83xx/usb.c
+++ b/arch/powerpc/platforms/83xx/usb.c
@@ -127,7 +127,8 @@ int mpc831x_usb_cfg(void)
 
 	/* Configure clock */
 	immr_node = of_get_parent(np);
-	if (immr_node && of_device_is_compatible(immr_node, "fsl,mpc8315-immr"))
+	if (immr_node && (of_device_is_compatible(immr_node, "fsl,mpc8315-immr") ||
+			of_device_is_compatible(immr_node, "fsl,mpc8308-immr")))
 		clrsetbits_be32(immap + MPC83XX_SCCR_OFFS,
 		                MPC8315_SCCR_USB_MASK,
 		                MPC8315_SCCR_USB_DRCM_01);
@@ -138,7 +139,11 @@ int mpc831x_usb_cfg(void)
 
 	/* Configure pin mux for ULPI.  There is no pin mux for UTMI */
 	if (prop && !strcmp(prop, "ulpi")) {
-		if (of_device_is_compatible(immr_node, "fsl,mpc8315-immr")) {
+		if (of_device_is_compatible(immr_node, "fsl,mpc8308-immr")) {
+			clrsetbits_be32(immap + MPC83XX_SICRH_OFFS,
+					MPC8308_SICRH_USB_MASK,
+					MPC8308_SICRH_USB_ULPI);
+		} else if (of_device_is_compatible(immr_node, "fsl,mpc8315-immr")) {
 			clrsetbits_be32(immap + MPC83XX_SICRL_OFFS,
 					MPC8315_SICRL_USB_MASK,
 					MPC8315_SICRL_USB_ULPI);
@@ -173,6 +178,9 @@ int mpc831x_usb_cfg(void)
 		     !strcmp(prop, "utmi"))) {
 		u32 refsel;
 
+		if (of_device_is_compatible(immr_node, "fsl,mpc8308-immr"))
+			goto out;
+
 		if (of_device_is_compatible(immr_node, "fsl,mpc8315-immr"))
 			refsel = CONTROL_REFSEL_24MHZ;
 		else
@@ -186,9 +194,11 @@ int mpc831x_usb_cfg(void)
 		temp = CONTROL_PHY_CLK_SEL_ULPI;
 #ifdef CONFIG_USB_OTG
 		/* Set OTG_PORT */
-		dr_mode = of_get_property(np, "dr_mode", NULL);
-		if (dr_mode && !strcmp(dr_mode, "otg"))
-			temp |= CONTROL_OTG_PORT;
+		if (!of_device_is_compatible(immr_node, "fsl,mpc8308-immr")) {
+			dr_mode = of_get_property(np, "dr_mode", NULL);
+			if (dr_mode && !strcmp(dr_mode, "otg"))
+				temp |= CONTROL_OTG_PORT;
+		}
 #endif /* CONFIG_USB_OTG */
 		out_be32(usb_regs + FSL_USB2_CONTROL_OFFS, temp);
 	} else {
@@ -196,6 +206,7 @@ int mpc831x_usb_cfg(void)
 		ret = -EINVAL;
 	}
 
+out:
 	iounmap(usb_regs);
 	of_node_put(np);
 	return ret;
-- 
1.6.2.5

^ permalink raw reply related

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Richard Cochran @ 2010-08-27  7:57 UTC (permalink / raw)
  To: john stultz
  Cc: Rodolfo Giometti, Arnd Bergmann, netdev, devicetree-discuss,
	linux-kernel, Christian Riesch, linuxppc-dev, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <1282874269.4371.74.camel@localhost.localdomain>

On Thu, Aug 26, 2010 at 06:57:49PM -0700, john stultz wrote:
> On Wed, 2010-08-25 at 11:40 +0200, Christian Riesch wrote:
> > 2) Master clock:
> > We have one or more network ports. Our system has a really good clock
> > (ovenized quartz crystal, an atomic clock, a GPS timing receiver...)
> > and it distributes this time on the network. In such a case we do not
> > steer our clock based on the (packet) timestamps we get from our
> > timestamping unit. Instead, we directly drive our clock hardware with
> > a very stable frequency that we get from the OCXO or the atomic
> > clock...
>
> Ok. Following you here...
>
> > or we use one of the ancillary features of the PTP clock that
> > Richard mentioned to timestamp not network packets but a 1pps signal
> > and use these timestamps to steer the clock.
>
> Wait.. I thought we weren't using PTP to steer the clock? But now we're
> using the pps signal from it to do so? Do I misunderstand you? Or did
> you just not mean this?

The master node in a PTP network probably takes its time from a
precise external time source, like GPS. The GPS provides a 1 PPS
directly to the PTP clock hardware, which latches the PTP hardware
clock time on the PPS edge. This provides one sample as input to a
clock servo (in the PTPd) that, in turn, regulates the PTP clock
hardware.

> > Packet time stamping is
> > used to distribute the time to the slaves, but it is not part of the
> > control loop in this case.
>
> I assume here you mean PTPd is steering the PTP clock according to the
> system time (which is NTP/GPS/whatever sourced)? And then the PTP clock
> distributes that time through the network?

Yes, but in this case, "system time" has nothing to do with the Linux
system time. For a PTP master clock, it really doesn't matter whether
the Linux time is correct, or not. It doesn't hurt either (but see
below about chaining servos).

> So first of all, thanks for the extra explanation and context here! I
> really appreciate it, as I'm not familiar with all the hardware details
> and possible use cases, but I'm trying to learn.
>
> So in the two cases you mention, the time "flow" is something like:
>
> #1) [Master Clock on Network1] => [PTP Clock] => [PTPd] =>
> 	[PTP Clock] => [PTP Clients on Network2]

I would only draw the PTP clock once, perhaps like this:

                                +------------------------------+
                                |                              ^
                                V                              |
[Master Clock on NW 1]--->[HW timestamp]--->[PTPd]--adj-->[PTP Clock]
[Slaves on NW 2,3,...]<---[HW timestamp]<---[    ]

>
> #2) [GPS] => [NTPd] => [System Time] => [PTPd] => [PTP clock] =>
> 	[PTP clients on Network]

Nope. More like this:

                                      +------------------------------+
                                      |                              ^
                                      V                              |
[GPS]----------PPS--------->[Latch  timestamp]--->[PTPd]--adj-->[PTP Clock]
[Slaves on NW 1,2,3,...]<---[HW pkt timestamp]<---[    ]

> And the original case:
> #3) [Master Clock on Network] => [PTP clock] => [PTPd] => [PTP clock]

More like:

[Master Clock on NW 1]--->[HW timestamp]--->[PTPd]--adj-->[PTP Clock]

> With a secondary control flow:
> 	[PPS signal from PTP clock] => [NTPd] => [System Time]

Yes.


> So, just brainstorming here, I guess the question I'm trying to figure
> out here, is can the "System Time" and "PTP clock" be merged/globbed
> into a single "Time" interface from the userspace point of view?

This is the core issue and source of misunderstanding, in my view. The
fact of the matter is, the current generation of computers has
multiple clocks, and these are usually unsynchronized. I think we
should not try too hard to cover up or work around this. It is a fact
of life.

It would be nice if there were only one clock, and that clock could do
everything that we need. Indeed, the next generation of SoC computers may
all have PTP build in to the main CPU. Well, one can always wish.

If we can make it appear that multiple clocks are just one clock, then
I agree that we should do it. But I would not want to sacrifice
synchronization accuracy or application features just to keep that
illusion.

> In other words, if internal to the kernel, the PTP clock was always
> synced to the system time, couldn't the flow look something like:
>
> #3') [Master clock on network] => [PTP clock] => [PTPd] =>
> 	 [System Time] => [in-kernel sync thread] => [PTP clock]
>
> So PTPd sees the offset adjustment from the PTP clock, and then feeds
> that offset correction right into (a possibly enhanced) adjtimex. The
> kernel would then immediately steer the PTP clock by the same amount to
> keep it in sync with system time (along with a periodic offset/freq
> correction step to deal with crystal drift).
>
> Similarly:
>
> #2') [GPS] => [NTPd] => [System Time] => [in-kernel sync thread] =>
> 		[PTP clock] => [PTP clients on Network]
>
> and
>
> #1') [Master Clock on Network1] => [PTP Clock] => [PTPd] =>
> 	[System Time] => [in-kernel sync thread] => [PTP Clock] =>
> 	[PTP Clients on Network2]
>
> Now, I realize PTP purists probably won't like this, because it
> effectively makes the in-kernel sync thread similar to a PTP boundary
> clock (or worse, since the control loop isn't exactly direct).

I don't like it. The experience with PTP boundary clocks already shows
that the errors in servo loops cascade. It worsens the PTP performance.

However, I think it would fine just to synch the Linux system time to
the PTP clock using the PPS interface, which is what my patch set
implements. Linux user applications would not be able to detect the
difference.

Richard

^ permalink raw reply

* [PATCH 2/2] powerpc, pseries: Re-enable dispatch trace log userspace interface
From: Paul Mackerras @ 2010-08-27  5:57 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20100827055643.GA19033@brick.ozlabs.ibm.com>

Since the cpu accounting code uses the hypervisor dispatch trace log
now when CONFIG_VIRT_CPU_ACCOUNTING = y, the previous commit disabled
access to it via files in the /sys/kernel/debug/powerpc/dtl/ directory
in that case.  This restores those files.

To do this, we now have a hook that the cpu accounting code will call
as it processes each entry from the hypervisor dispatch trace log.
The code in dtl.c now uses that to fill up its ring buffer, rather
than having the hypervisor fill the ring buffer directly.

This also fixes dtl_file_read() to handle overflow conditions a bit
better and adds a spinlock to ensure that race conditions (multiple
processes opening or reading the file concurrently) are handled
correctly.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/lppaca.h    |    8 ++
 arch/powerpc/kernel/time.c           |    6 +-
 arch/powerpc/platforms/pseries/dtl.c |  207 +++++++++++++++++++++++++++-------
 3 files changed, 180 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h
index cfb85ec..7f5e0fe 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -191,6 +191,14 @@ struct dtl_entry {
 #define DISPATCH_LOG_BYTES	4096	/* bytes per cpu */
 #define N_DISPATCH_LOG		(DISPATCH_LOG_BYTES / sizeof(struct dtl_entry))
 
+/*
+ * When CONFIG_VIRT_CPU_ACCOUNTING = y, the cpu accounting code controls
+ * reading from the dispatch trace log.  If other code wants to consume
+ * DTL entries, it can set this pointer to a function that will get
+ * called once for each DTL entry that gets processed.
+ */
+extern void (*dtl_consumer)(struct dtl_entry *entry, u64 index);
+
 #endif /* CONFIG_PPC_BOOK3S */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_LPPACA_H */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index fca2064..bcb738b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -183,6 +183,8 @@ DEFINE_PER_CPU(unsigned long, cputime_scaled_last_delta);
 
 cputime_t cputime_one_jiffy;
 
+void (*dtl_consumer)(struct dtl_entry *, u64);
+
 static void calc_cputime_factors(void)
 {
 	struct div_result res;
@@ -218,7 +220,7 @@ static u64 read_spurr(u64 tb)
  */
 static u64 scan_dispatch_log(u64 stop_tb)
 {
-	unsigned long i = local_paca->dtl_ridx;
+	u64 i = local_paca->dtl_ridx;
 	struct dtl_entry *dtl = local_paca->dtl_curr;
 	struct dtl_entry *dtl_end = local_paca->dispatch_log_end;
 	struct lppaca *vpa = local_paca->lppaca_ptr;
@@ -229,6 +231,8 @@ static u64 scan_dispatch_log(u64 stop_tb)
 	if (i == vpa->dtl_idx)
 		return 0;
 	while (i < vpa->dtl_idx) {
+		if (dtl_consumer)
+			dtl_consumer(dtl, i);
 		dtb = dtl->timebase;
 		tb_delta = dtl->enqueue_to_dispatch_time +
 			dtl->ready_to_enqueue_time;
diff --git a/arch/powerpc/platforms/pseries/dtl.c b/arch/powerpc/platforms/pseries/dtl.c
index 0357655..68cb2f2 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/debugfs.h>
+#include <linux/spinlock.h>
 #include <asm/smp.h>
 #include <asm/system.h>
 #include <asm/uaccess.h>
@@ -37,6 +38,7 @@ struct dtl {
 	int			cpu;
 	int			buf_entries;
 	u64			last_idx;
+	spinlock_t		lock;
 };
 static DEFINE_PER_CPU(struct dtl, cpu_dtl);
 
@@ -55,25 +57,97 @@ static u8 dtl_event_mask = 0x7;
 static int dtl_buf_entries = (16 * 85);
 
 
-static int dtl_enable(struct dtl *dtl)
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+struct dtl_ring {
+	u64	write_index;
+	struct dtl_entry *write_ptr;
+	struct dtl_entry *buf;
+	struct dtl_entry *buf_end;
+	u8	saved_dtl_mask;
+};
+
+static DEFINE_PER_CPU(struct dtl_ring, dtl_rings);
+
+static atomic_t dtl_count;
+
+/*
+ * The cpu accounting code controls the DTL ring buffer, and we get
+ * given entries as they are processed.
+ */
+static void consume_dtle(struct dtl_entry *dtle, u64 index)
 {
-	unsigned long addr;
-	int ret, hwcpu;
+	struct dtl_ring *dtlr = &__get_cpu_var(dtl_rings);
+	struct dtl_entry *wp = dtlr->write_ptr;
+	struct lppaca *vpa = local_paca->lppaca_ptr;
 
-	/* only allow one reader */
-	if (dtl->buf)
-		return -EBUSY;
+	if (!wp)
+		return;
 
-	/* we need to store the original allocation size for use during read */
-	dtl->buf_entries = dtl_buf_entries;
+	*wp = *dtle;
+	barrier();
 
-	dtl->buf = kmalloc_node(dtl->buf_entries * sizeof(struct dtl_entry),
-			GFP_KERNEL, cpu_to_node(dtl->cpu));
-	if (!dtl->buf) {
-		printk(KERN_WARNING "%s: buffer alloc failed for cpu %d\n",
-				__func__, dtl->cpu);
-		return -ENOMEM;
-	}
+	/* check for hypervisor ring buffer overflow, ignore this entry if so */
+	if (index + N_DISPATCH_LOG < vpa->dtl_idx)
+		return;
+
+	++wp;
+	if (wp == dtlr->buf_end)
+		wp = dtlr->buf;
+	dtlr->write_ptr = wp;
+
+	/* incrementing write_index makes the new entry visible */
+	smp_wmb();
+	++dtlr->write_index;
+}
+
+static int dtl_start(struct dtl *dtl)
+{
+	struct dtl_ring *dtlr = &per_cpu(dtl_rings, dtl->cpu);
+
+	dtlr->buf = dtl->buf;
+	dtlr->buf_end = dtl->buf + dtl->buf_entries;
+	dtlr->write_index = 0;
+
+	/* setting write_ptr enables logging into our buffer */
+	smp_wmb();
+	dtlr->write_ptr = dtl->buf;
+
+	/* enable event logging */
+	dtlr->saved_dtl_mask = lppaca_of(dtl->cpu).dtl_enable_mask;
+	lppaca_of(dtl->cpu).dtl_enable_mask |= dtl_event_mask;
+
+	dtl_consumer = consume_dtle;
+	atomic_inc(&dtl_count);
+	return 0;
+}
+
+static void dtl_stop(struct dtl *dtl)
+{
+	struct dtl_ring *dtlr = &per_cpu(dtl_rings, dtl->cpu);
+
+	dtlr->write_ptr = NULL;
+	smp_wmb();
+
+	dtlr->buf = NULL;
+
+	/* restore dtl_enable_mask */
+	lppaca_of(dtl->cpu).dtl_enable_mask = dtlr->saved_dtl_mask;
+
+	if (atomic_dec_and_test(&dtl_count))
+		dtl_consumer = NULL;
+}
+
+static u64 dtl_current_index(struct dtl *dtl)
+{
+	return per_cpu(dtl_rings, dtl->cpu).write_index;
+}
+
+#else /* CONFIG_VIRT_CPU_ACCOUNTING */
+
+static int dtl_start(struct dtl *dtl)
+{
+	unsigned long addr;
+	int ret, hwcpu;
 
 	/* Register our dtl buffer with the hypervisor. The HV expects the
 	 * buffer size to be passed in the second word of the buffer */
@@ -85,12 +159,11 @@ static int dtl_enable(struct dtl *dtl)
 	if (ret) {
 		printk(KERN_WARNING "%s: DTL registration for cpu %d (hw %d) "
 		       "failed with %d\n", __func__, dtl->cpu, hwcpu, ret);
-		kfree(dtl->buf);
 		return -EIO;
 	}
 
 	/* set our initial buffer indices */
-	dtl->last_idx = lppaca_of(dtl->cpu).dtl_idx = 0;
+	lppaca_of(dtl->cpu).dtl_idx = 0;
 
 	/* ensure that our updates to the lppaca fields have occurred before
 	 * we actually enable the logging */
@@ -102,17 +175,66 @@ static int dtl_enable(struct dtl *dtl)
 	return 0;
 }
 
-static void dtl_disable(struct dtl *dtl)
+static void dtl_stop(struct dtl *dtl)
 {
 	int hwcpu = get_hard_smp_processor_id(dtl->cpu);
 
 	lppaca_of(dtl->cpu).dtl_enable_mask = 0x0;
 
 	unregister_dtl(hwcpu, __pa(dtl->buf));
+}
+
+static u64 dtl_current_index(struct dtl *dtl)
+{
+	return lppaca_of(dtl->cpu).dtl_idx;
+}
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
+static int dtl_enable(struct dtl *dtl)
+{
+	long int n_entries;
+	long int rc;
+	struct dtl_entry *buf = NULL;
+
+	/* only allow one reader */
+	if (dtl->buf)
+		return -EBUSY;
+
+	n_entries = dtl_buf_entries;
+	buf = kmalloc_node(n_entries * sizeof(struct dtl_entry),
+			GFP_KERNEL, cpu_to_node(dtl->cpu));
+	if (!buf) {
+		printk(KERN_WARNING "%s: buffer alloc failed for cpu %d\n",
+				__func__, dtl->cpu);
+		return -ENOMEM;
+	}
+
+	spin_lock(&dtl->lock);
+	rc = -EBUSY;
+	if (!dtl->buf) {
+		/* store the original allocation size for use during read */
+		dtl->buf_entries = n_entries;
+		dtl->buf = buf;
+		dtl->last_idx = 0;
+		rc = dtl_start(dtl);
+		if (rc)
+			dtl->buf = NULL;
+	}
+	spin_unlock(&dtl->lock);
+
+	if (rc)
+		kfree(buf);
+	return rc;
+}
+
+static void dtl_disable(struct dtl *dtl)
+{
+	spin_lock(&dtl->lock);
+	dtl_stop(dtl);
 	kfree(dtl->buf);
 	dtl->buf = NULL;
 	dtl->buf_entries = 0;
+	spin_unlock(&dtl->lock);
 }
 
 /* file interface */
@@ -140,8 +262,9 @@ static int dtl_file_release(struct inode *inode, struct file *filp)
 static ssize_t dtl_file_read(struct file *filp, char __user *buf, size_t len,
 		loff_t *pos)
 {
-	int rc, cur_idx, last_idx, n_read, n_req, read_size;
+	long int rc, n_read, n_req, read_size;
 	struct dtl *dtl;
+	u64 cur_idx, last_idx, i;
 
 	if ((len % sizeof(struct dtl_entry)) != 0)
 		return -EINVAL;
@@ -154,41 +277,49 @@ static ssize_t dtl_file_read(struct file *filp, char __user *buf, size_t len,
 	/* actual number of entries read */
 	n_read = 0;
 
-	cur_idx = lppaca_of(dtl->cpu).dtl_idx;
+	spin_lock(&dtl->lock);
+
+	cur_idx = dtl_current_index(dtl);
 	last_idx = dtl->last_idx;
 
-	if (cur_idx - last_idx > dtl->buf_entries) {
-		pr_debug("%s: hv buffer overflow for cpu %d, samples lost\n",
-				__func__, dtl->cpu);
-	}
+	if (last_idx + dtl->buf_entries <= cur_idx)
+		last_idx = cur_idx - dtl->buf_entries + 1;
+
+	if (last_idx + n_req > cur_idx)
+		n_req = cur_idx - last_idx;
 
-	cur_idx  %= dtl->buf_entries;
-	last_idx %= dtl->buf_entries;
+	if (n_req > 0)
+		dtl->last_idx = last_idx + n_req;
+
+	spin_unlock(&dtl->lock);
+
+	if (n_req <= 0)
+		return 0;
+
+	i = last_idx % dtl->buf_entries;
 
 	/* read the tail of the buffer if we've wrapped */
-	if (last_idx > cur_idx) {
-		read_size = min(n_req, dtl->buf_entries - last_idx);
+	if (i + n_req > dtl->buf_entries) {
+		read_size = dtl->buf_entries - i;
 
-		rc = copy_to_user(buf, &dtl->buf[last_idx],
+		rc = copy_to_user(buf, &dtl->buf[i],
 				read_size * sizeof(struct dtl_entry));
 		if (rc)
 			return -EFAULT;
 
-		last_idx = 0;
+		i = 0;
 		n_req -= read_size;
 		n_read += read_size;
 		buf += read_size * sizeof(struct dtl_entry);
 	}
 
 	/* .. and now the head */
-	read_size = min(n_req, cur_idx - last_idx);
-	rc = copy_to_user(buf, &dtl->buf[last_idx],
-			read_size * sizeof(struct dtl_entry));
+	rc = copy_to_user(buf, &dtl->buf[i], n_req * sizeof(struct dtl_entry));
 	if (rc)
 		return -EFAULT;
 
-	n_read += read_size;
-	dtl->last_idx += n_read;
+	n_read += n_req;
+	dtl->last_idx = last_idx + n_read;
 
 	return n_read * sizeof(struct dtl_entry);
 }
@@ -220,11 +351,6 @@ static int dtl_init(void)
 	struct dentry *event_mask_file, *buf_entries_file;
 	int rc, i;
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-	/* disable this for now */
-	return -ENODEV;
-#endif
-
 	if (!firmware_has_feature(FW_FEATURE_SPLPAR))
 		return -ENODEV;
 
@@ -251,6 +377,7 @@ static int dtl_init(void)
 	/* set up the per-cpu log structures */
 	for_each_possible_cpu(i) {
 		struct dtl *dtl = &per_cpu(cpu_dtl, i);
+		spin_lock_init(&dtl->lock);
 		dtl->cpu = i;
 
 		rc = dtl_setup_file(dtl);
-- 
1.7.1

^ permalink raw reply related

* [PATCH 1/2] powerpc: Account time using timebase rather than PURR
From: Paul Mackerras @ 2010-08-27  5:56 UTC (permalink / raw)
  To: linuxppc-dev

Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
PURR register for measuring the user and system time used by
processes, as well as other related times such as hardirq and
softirq times.  This turns out to be quite confusing for users
because it means that a program will often be measured as taking
less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
than it does when run on a single-threaded processor (ST mode), even
though the program takes longer to finish.  The discrepancy is
accounted for as stolen time, which is also confusing, particularly
when there are no other partitions running.

This changes the accounting to use the timebase instead, meaning that
the reported user and system times are the actual number of real-time
seconds that the program was executing on the processor thread,
regardless of which SMT mode the processor is in.  Thus a program will
generally show greater user and system times when run on a
multi-threaded processor than on a single-threaded processor.

On pSeries systems on POWER5 or later processors, we measure the
stolen time (time when this partition wasn't running) using the
hypervisor dispatch trace log.  We check for new entries in the
log on every entry from user mode and on every transition from
kernel process context to soft or hard IRQ context (i.e. when
account_system_vtime() gets called).  So that we can correctly
distinguish time stolen from user time and time stolen from system
time, without having to check the log on every exit to user mode,
we store separate timestamps for exit to user mode and entry from
user mode.

On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
in account_system_vtime() (as before), and then apportion the SPURR
ticks since the last time we read it between scaled user time and
scaled system time according to the relative proportions of user
time and system time over the same interval.  This avoids having to
read the SPURR on every kernel entry and exit.  On systems that have
PURR but not SPURR (i.e., POWER5), we do the same using the PURR
rather than the SPURR.

This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
for now since it conflicts with the use of the dispatch trace log
by the time accounting code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
This series goes on top of my "powerpc: Dynamically allocate most
lppaca structs" patch.

 arch/powerpc/include/asm/exception-64s.h |    3 +-
 arch/powerpc/include/asm/lppaca.h        |   19 ++
 arch/powerpc/include/asm/paca.h          |   10 +-
 arch/powerpc/include/asm/ppc_asm.h       |   50 ++++--
 arch/powerpc/include/asm/time.h          |    5 -
 arch/powerpc/kernel/asm-offsets.c        |    8 +-
 arch/powerpc/kernel/entry_64.S           |   18 ++
 arch/powerpc/kernel/process.c            |    1 -
 arch/powerpc/kernel/smp.c                |    5 -
 arch/powerpc/kernel/time.c               |  268 ++++++++++++++----------------
 arch/powerpc/platforms/pseries/dtl.c     |   24 +--
 arch/powerpc/platforms/pseries/lpar.c    |   21 +++
 arch/powerpc/platforms/pseries/setup.c   |   52 ++++++
 13 files changed, 290 insertions(+), 194 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h
index 57c4000..7778d6f 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -137,7 +137,8 @@
 	li	r10,0;							   \
 	ld	r11,exception_marker@toc(r2);				   \
 	std	r10,RESULT(r1);		/* clear regs->result		*/ \
-	std	r11,STACK_FRAME_OVERHEAD-16(r1); /* mark the frame	*/
+	std	r11,STACK_FRAME_OVERHEAD-16(r1); /* mark the frame	*/ \
+	ACCOUNT_STOLEN_TIME
 
 /*
  * Exception vectors.
diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h
index 6d02624..cfb85ec 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -172,6 +172,25 @@ struct slb_shadow {
 
 extern struct slb_shadow slb_shadow[];
 
+/*
+ * Layout of entries in the hypervisor's dispatch trace log buffer.
+ */
+struct dtl_entry {
+	u8	dispatch_reason;
+	u8	preempt_reason;
+	u16	processor_id;
+	u32	enqueue_to_dispatch_time;
+	u32	ready_to_enqueue_time;
+	u32	waiting_to_ready_time;
+	u64	timebase;
+	u64	fault_addr;
+	u64	srr0;
+	u64	srr1;
+};
+
+#define DISPATCH_LOG_BYTES	4096	/* bytes per cpu */
+#define N_DISPATCH_LOG		(DISPATCH_LOG_BYTES / sizeof(struct dtl_entry))
+
 #endif /* CONFIG_PPC_BOOK3S */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_LPPACA_H */
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1ff6662..6af6c16 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -85,6 +85,8 @@ struct paca_struct {
 	u8 kexec_state;		/* set when kexec down has irqs off */
 #ifdef CONFIG_PPC_STD_MMU_64
 	struct slb_shadow *slb_shadow_ptr;
+	struct dtl_entry *dispatch_log;
+	struct dtl_entry *dispatch_log_end;
 
 	/*
 	 * Now, starting in cacheline 2, the exception save areas
@@ -134,8 +136,14 @@ struct paca_struct {
 	/* Stuff for accurate time accounting */
 	u64 user_time;			/* accumulated usermode TB ticks */
 	u64 system_time;		/* accumulated system TB ticks */
-	u64 startpurr;			/* PURR/TB value snapshot */
+	u64 user_time_scaled;		/* accumulated usermode SPURR ticks */
+	u64 starttime;			/* TB value snapshot */
+	u64 starttime_user;		/* TB value on exit to usermode */
 	u64 startspurr;			/* SPURR value snapshot */
+	u64 utime_sspurr;		/* ->user_time when ->startspurr set */
+	u64 stolen_time;		/* TB ticks taken by hypervisor */
+	u64 dtl_ridx;			/* read index in dispatch log */
+	struct dtl_entry *dtl_curr;	/* pointer corresponding to dtl_ridx */
 
 #ifdef CONFIG_KVM_BOOK3S_HANDLER
 	/* We use this to store guest state in */
diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
index 498fe09..9821006 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -9,6 +9,7 @@
 #include <asm/asm-compat.h>
 #include <asm/processor.h>
 #include <asm/ppc-opcode.h>
+#include <asm/firmware.h>
 
 #ifndef __ASSEMBLY__
 #error __FILE__ should only be used in assembler files
@@ -26,17 +27,13 @@
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 #define ACCOUNT_CPU_USER_ENTRY(ra, rb)
 #define ACCOUNT_CPU_USER_EXIT(ra, rb)
+#define ACCOUNT_STOLEN_TIME
 #else
 #define ACCOUNT_CPU_USER_ENTRY(ra, rb)					\
 	beq	2f;			/* if from kernel mode */	\
-BEGIN_FTR_SECTION;							\
-	mfspr	ra,SPRN_PURR;		/* get processor util. reg */	\
-END_FTR_SECTION_IFSET(CPU_FTR_PURR);					\
-BEGIN_FTR_SECTION;							\
-	MFTB(ra);			/* or get TB if no PURR */	\
-END_FTR_SECTION_IFCLR(CPU_FTR_PURR);					\
-	ld	rb,PACA_STARTPURR(r13);					\
-	std	ra,PACA_STARTPURR(r13);					\
+	MFTB(ra);			/* get timebase */		\
+	ld	rb,PACA_STARTTIME_USER(r13);				\
+	std	ra,PACA_STARTTIME(r13);					\
 	subf	rb,rb,ra;		/* subtract start value */	\
 	ld	ra,PACA_USER_TIME(r13);					\
 	add	ra,ra,rb;		/* add on to user time */	\
@@ -44,19 +41,34 @@ END_FTR_SECTION_IFCLR(CPU_FTR_PURR);					\
 2:
 
 #define ACCOUNT_CPU_USER_EXIT(ra, rb)					\
-BEGIN_FTR_SECTION;							\
-	mfspr	ra,SPRN_PURR;		/* get processor util. reg */	\
-END_FTR_SECTION_IFSET(CPU_FTR_PURR);					\
-BEGIN_FTR_SECTION;							\
-	MFTB(ra);			/* or get TB if no PURR */	\
-END_FTR_SECTION_IFCLR(CPU_FTR_PURR);					\
-	ld	rb,PACA_STARTPURR(r13);					\
-	std	ra,PACA_STARTPURR(r13);					\
+	MFTB(ra);			/* get timebase */		\
+	ld	rb,PACA_STARTTIME(r13);					\
+	std	ra,PACA_STARTTIME_USER(r13);				\
 	subf	rb,rb,ra;		/* subtract start value */	\
 	ld	ra,PACA_SYSTEM_TIME(r13);				\
-	add	ra,ra,rb;		/* add on to user time */	\
-	std	ra,PACA_SYSTEM_TIME(r13);
-#endif
+	add	ra,ra,rb;		/* add on to system time */	\
+	std	ra,PACA_SYSTEM_TIME(r13)
+
+#ifdef CONFIG_PPC_SPLPAR
+#define ACCOUNT_STOLEN_TIME						\
+BEGIN_FW_FTR_SECTION;							\
+	beq	33f;							\
+	/* from user - see if there are any DTL entries to process */	\
+	ld	r10,PACALPPACAPTR(r13);	/* get ptr to VPA */		\
+	ld	r11,PACA_DTL_RIDX(r13);	/* get log read index */	\
+	ld	r10,LPPACA_DTLIDX(r10);	/* get log write index */	\
+	cmpd	cr1,r11,r10;						\
+	beq+	cr1,33f;						\
+	bl	.accumulate_stolen_time;				\
+33:									\
+END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
+
+#else  /* CONFIG_PPC_SPLPAR */
+#define ACCOUNT_STOLEN_TIME
+
+#endif /* CONFIG_PPC_SPLPAR */
+
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
 /*
  * Macros for storing registers into and loading registers from
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index dc779df..fe6f7c2 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -34,7 +34,6 @@ extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
 
 extern void generic_calibrate_decr(void);
-extern void snapshot_timebase(void);
 
 extern void set_dec_cpu6(unsigned int val);
 
@@ -212,12 +211,8 @@ struct cpu_usage {
 DECLARE_PER_CPU(struct cpu_usage, cpu_usage_array);
 
 #if defined(CONFIG_VIRT_CPU_ACCOUNTING)
-extern void calculate_steal_time(void);
-extern void snapshot_timebases(void);
 #define account_process_vtime(tsk)		account_process_tick(tsk, 0)
 #else
-#define calculate_steal_time()			do { } while (0)
-#define snapshot_timebases()			do { } while (0)
 #define account_process_vtime(tsk)		do { } while (0)
 #endif
 
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 1c0607d..c634940 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -181,17 +181,19 @@ int main(void)
 	       offsetof(struct slb_shadow, save_area[SLB_NUM_BOLTED - 1].vsid));
 	DEFINE(SLBSHADOW_STACKESID,
 	       offsetof(struct slb_shadow, save_area[SLB_NUM_BOLTED - 1].esid));
+	DEFINE(SLBSHADOW_SAVEAREA, offsetof(struct slb_shadow, save_area));
 	DEFINE(LPPACASRR0, offsetof(struct lppaca, saved_srr0));
 	DEFINE(LPPACASRR1, offsetof(struct lppaca, saved_srr1));
 	DEFINE(LPPACAANYINT, offsetof(struct lppaca, int_dword.any_int));
 	DEFINE(LPPACADECRINT, offsetof(struct lppaca, int_dword.fields.decr_int));
-	DEFINE(SLBSHADOW_SAVEAREA, offsetof(struct slb_shadow, save_area));
+	DEFINE(LPPACA_DTLIDX, offsetof(struct lppaca, dtl_idx));
+	DEFINE(PACA_DTL_RIDX, offsetof(struct paca_struct, dtl_ridx));
 #endif /* CONFIG_PPC_STD_MMU_64 */
 	DEFINE(PACAEMERGSP, offsetof(struct paca_struct, emergency_sp));
 	DEFINE(PACAHWCPUID, offsetof(struct paca_struct, hw_cpu_id));
 	DEFINE(PACAKEXECSTATE, offsetof(struct paca_struct, kexec_state));
-	DEFINE(PACA_STARTPURR, offsetof(struct paca_struct, startpurr));
-	DEFINE(PACA_STARTSPURR, offsetof(struct paca_struct, startspurr));
+	DEFINE(PACA_STARTTIME, offsetof(struct paca_struct, starttime));
+	DEFINE(PACA_STARTTIME_USER, offsetof(struct paca_struct, starttime_user));
 	DEFINE(PACA_USER_TIME, offsetof(struct paca_struct, user_time));
 	DEFINE(PACA_SYSTEM_TIME, offsetof(struct paca_struct, system_time));
 	DEFINE(PACA_TRAP_SAVE, offsetof(struct paca_struct, trap_save));
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 2f1a6be..eb02f24 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -97,6 +97,24 @@ system_call_common:
 	addi	r9,r1,STACK_FRAME_OVERHEAD
 	ld	r11,exception_marker@toc(r2)
 	std	r11,-16(r9)		/* "regshere" marker */
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING) && defined(CONFIG_PPC_SPLPAR)
+BEGIN_FW_FTR_SECTION
+	beq	33f
+	/* if from user, see if there are any DTL entries to process */
+	ld	r10,PACALPPACAPTR(r13)	/* get ptr to VPA */
+	ld	r11,PACA_DTL_RIDX(r13)	/* get log read index */
+	ld	r10,LPPACA_DTLIDX(r10)	/* get log write index */
+	cmpd	cr1,r11,r10
+	beq+	cr1,33f
+	bl	.accumulate_stolen_time
+	REST_GPR(0,r1)
+	REST_4GPRS(3,r1)
+	REST_2GPRS(7,r1)
+	addi	r9,r1,STACK_FRAME_OVERHEAD
+33:
+END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING && CONFIG_PPC_SPLPAR */
+
 #ifdef CONFIG_TRACE_IRQFLAGS
 	bl	.trace_hardirqs_on
 	REST_GPR(0,r1)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index feacfb7..806fa8a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -517,7 +517,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 
 	account_system_vtime(current);
 	account_process_vtime(current);
-	calculate_steal_time();
 
 	/*
 	 * We can't take a PMU exception inside _switch() since there is a
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index a61b3dd..af31c3f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -508,9 +508,6 @@ int __devinit start_secondary(void *unused)
 	if (smp_ops->take_timebase)
 		smp_ops->take_timebase();
 
-	if (system_state > SYSTEM_BOOTING)
-		snapshot_timebase();
-
 	secondary_cpu_time_init();
 
 	ipi_call_lock();
@@ -575,8 +572,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	free_cpumask_var(old_mask);
 
-	snapshot_timebases();
-
 	dump_numa_cpu_topology();
 }
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 8533b3b..fca2064 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -164,8 +164,6 @@ unsigned long ppc_proc_freq;
 EXPORT_SYMBOL(ppc_proc_freq);
 unsigned long ppc_tb_freq;
 
-static DEFINE_PER_CPU(u64, last_jiffy);
-
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
 /*
  * Factors for converting from cputime_t (timebase ticks) to
@@ -200,62 +198,151 @@ static void calc_cputime_factors(void)
 }
 
 /*
- * Read the PURR on systems that have it, otherwise the timebase.
+ * Read the SPURR on systems that have it, otherwise the PURR,
+ * or if that doesn't exist return the timebase value passed in.
  */
-static u64 read_purr(void)
+static u64 read_spurr(u64 tb)
 {
+	if (cpu_has_feature(CPU_FTR_SPURR))
+		return mfspr(SPRN_SPURR);
 	if (cpu_has_feature(CPU_FTR_PURR))
 		return mfspr(SPRN_PURR);
-	return mftb();
+	return tb;
 }
 
+#ifdef CONFIG_PPC_SPLPAR
+
 /*
- * Read the SPURR on systems that have it, otherwise the purr
+ * Scan the dispatch trace log and count up the stolen time.
+ * Should be called with interrupts disabled.
  */
-static u64 read_spurr(u64 purr)
+static u64 scan_dispatch_log(u64 stop_tb)
 {
-	/*
-	 * cpus without PURR won't have a SPURR
-	 * We already know the former when we use this, so tell gcc
-	 */
-	if (cpu_has_feature(CPU_FTR_PURR) && cpu_has_feature(CPU_FTR_SPURR))
-		return mfspr(SPRN_SPURR);
-	return purr;
+	unsigned long i = local_paca->dtl_ridx;
+	struct dtl_entry *dtl = local_paca->dtl_curr;
+	struct dtl_entry *dtl_end = local_paca->dispatch_log_end;
+	struct lppaca *vpa = local_paca->lppaca_ptr;
+	u64 tb_delta;
+	u64 stolen = 0;
+	u64 dtb;
+
+	if (i == vpa->dtl_idx)
+		return 0;
+	while (i < vpa->dtl_idx) {
+		dtb = dtl->timebase;
+		tb_delta = dtl->enqueue_to_dispatch_time +
+			dtl->ready_to_enqueue_time;
+		barrier();
+		if (i + N_DISPATCH_LOG < vpa->dtl_idx) {
+			/* buffer has overflowed */
+			i = vpa->dtl_idx - N_DISPATCH_LOG;
+			dtl = local_paca->dispatch_log + (i % N_DISPATCH_LOG);
+			continue;
+		}
+		if (dtb > stop_tb)
+			break;
+		stolen += tb_delta;
+		++i;
+		++dtl;
+		if (dtl == dtl_end)
+			dtl = local_paca->dispatch_log;
+	}
+	local_paca->dtl_ridx = i;
+	local_paca->dtl_curr = dtl;
+	return stolen;
 }
 
 /*
+ * Accumulate stolen time by scanning the dispatch trace log.
+ * Called on entry from user mode.
+ */
+void accumulate_stolen_time(void)
+{
+	u64 sst, ust;
+
+	sst = scan_dispatch_log(get_paca()->starttime_user);
+	ust = scan_dispatch_log(get_paca()->starttime);
+	get_paca()->system_time -= sst;
+	get_paca()->user_time -= ust;
+	get_paca()->stolen_time += ust + sst;
+}
+
+static inline u64 calculate_stolen_time(u64 stop_tb)
+{
+	u64 stolen = 0;
+
+	if (get_paca()->dtl_ridx != get_paca()->lppaca_ptr->dtl_idx) {
+		stolen = scan_dispatch_log(stop_tb);
+		get_paca()->system_time -= stolen;
+	}
+
+	stolen += get_paca()->stolen_time;
+	get_paca()->stolen_time = 0;
+	return stolen;
+}
+
+#else /* CONFIG_PPC_SPLPAR */
+static inline u64 calculate_stolen_time(u64 stop_tb)
+{
+	return 0;
+}
+
+#endif /* CONFIG_PPC_SPLPAR */
+
+/*
  * Account time for a transition between system, hard irq
  * or soft irq state.
  */
 void account_system_vtime(struct task_struct *tsk)
 {
-	u64 now, nowscaled, delta, deltascaled, sys_time;
+	u64 now, nowscaled, delta, deltascaled;
 	unsigned long flags;
+	u64 stolen, udelta, sys_scaled, user_scaled;
 
 	local_irq_save(flags);
-	now = read_purr();
+	now = mftb();
 	nowscaled = read_spurr(now);
-	delta = now - get_paca()->startpurr;
+	get_paca()->system_time += now - get_paca()->starttime;
+	get_paca()->starttime = now;
 	deltascaled = nowscaled - get_paca()->startspurr;
-	get_paca()->startpurr = now;
 	get_paca()->startspurr = nowscaled;
-	if (!in_interrupt()) {
-		/* deltascaled includes both user and system time.
-		 * Hence scale it based on the purr ratio to estimate
-		 * the system time */
-		sys_time = get_paca()->system_time;
-		if (get_paca()->user_time)
-			deltascaled = deltascaled * sys_time /
-			     (sys_time + get_paca()->user_time);
-		delta += sys_time;
-		get_paca()->system_time = 0;
+
+	stolen = calculate_stolen_time(now);
+
+	delta = get_paca()->system_time;
+	get_paca()->system_time = 0;
+	udelta = get_paca()->user_time - get_paca()->utime_sspurr;
+	get_paca()->utime_sspurr = get_paca()->user_time;
+
+	/*
+	 * Because we don't read the SPURR on every kernel entry/exit,
+	 * deltascaled includes both user and system SPURR ticks.
+	 * Apportion these ticks to system SPURR ticks and user
+	 * SPURR ticks in the same ratio as the system time (delta)
+	 * and user time (udelta) values obtained from the timebase
+	 * over the same interval.  The system ticks get accounted here;
+	 * the user ticks get saved up in paca->user_time_scaled to be
+	 * used by account_process_tick.
+	 */
+	sys_scaled = delta;
+	user_scaled = udelta;
+	if (deltascaled != delta + udelta) {
+		if (udelta) {
+			sys_scaled = deltascaled * delta / (delta + udelta);
+			user_scaled = deltascaled - sys_scaled;
+		} else {
+			sys_scaled = deltascaled;
+		}
+	}
+	get_paca()->user_time_scaled += user_scaled;
+
+	if (in_irq() || idle_task(smp_processor_id()) != tsk) {
+		account_system_time(tsk, 0, delta, sys_scaled);
+		if (stolen)
+			account_steal_time(stolen);
+	} else {
+		account_idle_time(delta + stolen);
 	}
-	if (in_irq() || idle_task(smp_processor_id()) != tsk)
-		account_system_time(tsk, 0, delta, deltascaled);
-	else
-		account_idle_time(delta);
-	__get_cpu_var(cputime_last_delta) = delta;
-	__get_cpu_var(cputime_scaled_last_delta) = deltascaled;
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
@@ -265,125 +352,26 @@ EXPORT_SYMBOL_GPL(account_system_vtime);
  * by the exception entry and exit code to the generic process
  * user and system time records.
  * Must be called with interrupts disabled.
+ * Assumes that account_system_vtime() has been called recently
+ * (i.e. since the last entry from usermode) so that
+ * get_paca()->user_time_scaled is up to date.
  */
 void account_process_tick(struct task_struct *tsk, int user_tick)
 {
 	cputime_t utime, utimescaled;
 
 	utime = get_paca()->user_time;
+	utimescaled = get_paca()->user_time_scaled;
 	get_paca()->user_time = 0;
-	utimescaled = cputime_to_scaled(utime);
+	get_paca()->user_time_scaled = 0;
+	get_paca()->utime_sspurr = 0;
 	account_user_time(tsk, utime, utimescaled);
 }
 
-/*
- * Stuff for accounting stolen time.
- */
-struct cpu_purr_data {
-	int	initialized;			/* thread is running */
-	u64	tb;			/* last TB value read */
-	u64	purr;			/* last PURR value read */
-	u64	spurr;			/* last SPURR value read */
-};
-
-/*
- * Each entry in the cpu_purr_data array is manipulated only by its
- * "owner" cpu -- usually in the timer interrupt but also occasionally
- * in process context for cpu online.  As long as cpus do not touch
- * each others' cpu_purr_data, disabling local interrupts is
- * sufficient to serialize accesses.
- */
-static DEFINE_PER_CPU(struct cpu_purr_data, cpu_purr_data);
-
-static void snapshot_tb_and_purr(void *data)
-{
-	unsigned long flags;
-	struct cpu_purr_data *p = &__get_cpu_var(cpu_purr_data);
-
-	local_irq_save(flags);
-	p->tb = get_tb_or_rtc();
-	p->purr = mfspr(SPRN_PURR);
-	wmb();
-	p->initialized = 1;
-	local_irq_restore(flags);
-}
-
-/*
- * Called during boot when all cpus have come up.
- */
-void snapshot_timebases(void)
-{
-	if (!cpu_has_feature(CPU_FTR_PURR))
-		return;
-	on_each_cpu(snapshot_tb_and_purr, NULL, 1);
-}
-
-/*
- * Must be called with interrupts disabled.
- */
-void calculate_steal_time(void)
-{
-	u64 tb, purr;
-	s64 stolen;
-	struct cpu_purr_data *pme;
-
-	pme = &__get_cpu_var(cpu_purr_data);
-	if (!pme->initialized)
-		return;		/* !CPU_FTR_PURR or early in early boot */
-	tb = mftb();
-	purr = mfspr(SPRN_PURR);
-	stolen = (tb - pme->tb) - (purr - pme->purr);
-	if (stolen > 0) {
-		if (idle_task(smp_processor_id()) != current)
-			account_steal_time(stolen);
-		else
-			account_idle_time(stolen);
-	}
-	pme->tb = tb;
-	pme->purr = purr;
-}
-
-#ifdef CONFIG_PPC_SPLPAR
-/*
- * Must be called before the cpu is added to the online map when
- * a cpu is being brought up at runtime.
- */
-static void snapshot_purr(void)
-{
-	struct cpu_purr_data *pme;
-	unsigned long flags;
-
-	if (!cpu_has_feature(CPU_FTR_PURR))
-		return;
-	local_irq_save(flags);
-	pme = &__get_cpu_var(cpu_purr_data);
-	pme->tb = mftb();
-	pme->purr = mfspr(SPRN_PURR);
-	pme->initialized = 1;
-	local_irq_restore(flags);
-}
-
-#endif /* CONFIG_PPC_SPLPAR */
-
 #else /* ! CONFIG_VIRT_CPU_ACCOUNTING */
 #define calc_cputime_factors()
-#define calculate_steal_time()		do { } while (0)
 #endif
 
-#if !(defined(CONFIG_VIRT_CPU_ACCOUNTING) && defined(CONFIG_PPC_SPLPAR))
-#define snapshot_purr()			do { } while (0)
-#endif
-
-/*
- * Called when a cpu comes up after the system has finished booting,
- * i.e. as a result of a hotplug cpu action.
- */
-void snapshot_timebase(void)
-{
-	__get_cpu_var(last_jiffy) = get_tb_or_rtc();
-	snapshot_purr();
-}
-
 void __delay(unsigned long loops)
 {
 	unsigned long start;
@@ -585,8 +573,6 @@ void timer_interrupt(struct pt_regs * regs)
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
-	calculate_steal_time();
-
 	if (test_perf_event_pending()) {
 		clear_perf_event_pending();
 		perf_event_do_pending();
diff --git a/arch/powerpc/platforms/pseries/dtl.c b/arch/powerpc/platforms/pseries/dtl.c
index adfd544..0357655 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -27,27 +27,10 @@
 #include <asm/system.h>
 #include <asm/uaccess.h>
 #include <asm/firmware.h>
+#include <asm/lppaca.h>
 
 #include "plpar_wrappers.h"
 
-/*
- * Layout of entries in the hypervisor's DTL buffer. Although we don't
- * actually access the internals of an entry (we only need to know the size),
- * we might as well define it here for reference.
- */
-struct dtl_entry {
-	u8	dispatch_reason;
-	u8	preempt_reason;
-	u16	processor_id;
-	u32	enqueue_to_dispatch_time;
-	u32	ready_to_enqueue_time;
-	u32	waiting_to_ready_time;
-	u64	timebase;
-	u64	fault_addr;
-	u64	srr0;
-	u64	srr1;
-};
-
 struct dtl {
 	struct dtl_entry	*buf;
 	struct dentry		*file;
@@ -237,6 +220,11 @@ static int dtl_init(void)
 	struct dentry *event_mask_file, *buf_entries_file;
 	int rc, i;
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+	/* disable this for now */
+	return -ENODEV;
+#endif
+
 	if (!firmware_has_feature(FW_FEATURE_SPLPAR))
 		return -ENODEV;
 
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index a17fe4a..f129040 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -248,6 +248,8 @@ void vpa_init(int cpu)
 	int hwcpu = get_hard_smp_processor_id(cpu);
 	unsigned long addr;
 	long ret;
+	struct paca_struct *pp;
+	struct dtl_entry *dtl;
 
 	if (cpu_has_feature(CPU_FTR_ALTIVEC))
 		lppaca_of(cpu).vmxregs_in_use = 1;
@@ -274,6 +276,25 @@ void vpa_init(int cpu)
 			       "registration for cpu %d (hw %d) of area %lx "
 			       "returns %ld\n", cpu, hwcpu, addr, ret);
 	}
+
+	/*
+	 * Register dispatch trace log, if one has been allocated.
+	 */
+	pp = &paca[cpu];
+	dtl = pp->dispatch_log;
+	if (dtl) {
+		pp->dtl_ridx = 0;
+		pp->dtl_curr = dtl;
+		lppaca_of(cpu).dtl_idx = 0;
+
+		/* hypervisor reads buffer length from this field */
+		dtl->enqueue_to_dispatch_time = DISPATCH_LOG_BYTES;
+		ret = register_dtl(hwcpu, __pa(dtl));
+		if (ret)
+			pr_warn("DTL registration failed for cpu %d (%ld)\n",
+				cpu, ret);
+		lppaca_of(cpu).dtl_enable_mask = 2;
+	}
 }
 
 static long pSeries_lpar_hpte_insert(unsigned long hpte_group,
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index a6d19e3..d345bfd 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -273,6 +273,58 @@ static struct notifier_block pci_dn_reconfig_nb = {
 	.notifier_call = pci_dn_reconfig_notifier,
 };
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+/*
+ * Allocate space for the dispatch trace log for all possible cpus
+ * and register the buffers with the hypervisor.  This is used for
+ * computing time stolen by the hypervisor.
+ */
+static int alloc_dispatch_logs(void)
+{
+	int cpu, ret;
+	struct paca_struct *pp;
+	struct dtl_entry *dtl;
+
+	if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+		return 0;
+
+	for_each_possible_cpu(cpu) {
+		pp = &paca[cpu];
+		dtl = kmalloc_node(DISPATCH_LOG_BYTES, GFP_KERNEL,
+				   cpu_to_node(cpu));
+		if (!dtl) {
+			pr_warn("Failed to allocate dispatch trace log for cpu %d\n",
+				cpu);
+			pr_warn("Stolen time statistics will be unreliable\n");
+			break;
+		}
+
+		pp->dtl_ridx = 0;
+		pp->dispatch_log = dtl;
+		pp->dispatch_log_end = dtl + N_DISPATCH_LOG;
+		pp->dtl_curr = dtl;
+	}
+
+	/* Register the DTL for the current (boot) cpu */
+	dtl = get_paca()->dispatch_log;
+	get_paca()->dtl_ridx = 0;
+	get_paca()->dtl_curr = dtl;
+	get_paca()->lppaca_ptr->dtl_idx = 0;
+
+	/* hypervisor reads buffer length from this field */
+	dtl->enqueue_to_dispatch_time = DISPATCH_LOG_BYTES;
+	ret = register_dtl(hard_smp_processor_id(), __pa(dtl));
+	if (ret)
+		pr_warn("DTL registration failed for boot cpu %d (%d)\n",
+			smp_processor_id(), ret);
+	get_paca()->lppaca_ptr->dtl_enable_mask = 2;
+
+	return 0;
+}
+
+early_initcall(alloc_dispatch_logs);
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+
 static void __init pSeries_setup_arch(void)
 {
 	/* Discover PIC type and setup ppc_md accordingly */
-- 
1.7.1

^ permalink raw reply related

* Re: Power machines fail to boot after build being successful
From: Stephen Rothwell @ 2010-08-27  2:01 UTC (permalink / raw)
  To: Michael Neuling; +Cc: linuxppc-dev, linux-next, LKML, divya
In-Reply-To: <2311.1282871746@neuling.org>

[-- Attachment #1: Type: text/plain, Size: 725 bytes --]

Hi Mikey,

On Fri, 27 Aug 2010 11:15:46 +1000 Michael Neuling <mikey@neuling.org> wrote:
>
> > After successfully building the kernel version
> > 2.6.36-rc2-git4(commitid d4348c678977c) with the config file
> > attached(used make oldconfig), P5 and P6 power machines fails to
> > reboot with the following logs
> > 
> > Logs collected while rebooting into today next, same occurs with the
> > upstream kernel too.
> 
> This will fix your problem:
> http://patchwork.ozlabs.org/patch/62757/ 

I have seen something similar in my linux-next boot tests, so I will add
this patch to linux-next for today.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: john stultz @ 2010-08-27  1:57 UTC (permalink / raw)
  To: Christian Riesch
  Cc: Rodolfo Giometti, Arnd Bergmann, netdev, devicetree-discuss,
	linux-kernel, linuxppc-dev, Richard Cochran, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <AANLkTi=nKrRHH6+9mvny718gov8Q+u7DHFikdgNY8YdX@mail.gmail.com>

On Wed, 2010-08-25 at 11:40 +0200, Christian Riesch wrote:
> What you describe here is only one of the use cases. If the hardware
> has a single network port and operates as a PTP slave, it timestamps
> the PTP packets that are sent and received and subsequently uses these
> timestamps and the information it received from the master in the
> packets to steer its clock to align it with the master clock. In such
> a case the timestamping hardware and the clock hardware work together
> closely and it seems to be okay to use the same interface to control
> both the timestamping and the PTP clock.
> 
> But we have to consider other use cases, e.g.,
> 
> 1) Boundary clocks:
> We have more than one network port. One port operates as a slave
> clock, our system gets time information via this port and steers its
> PTP clock to align with the master clock. The other network ports of
> our system operate as master clocks and redistribute the time
> information we got from the master to other clocks on these networks.
> In such a case we do timestamping on each of the network ports, but we
> only have a single PTP clock. Each network port's timestamping
> hardware uses the same hardware clock to generate time stamps.
> 
> 2) Master clock:
> We have one or more network ports. Our system has a really good clock
> (ovenized quartz crystal, an atomic clock, a GPS timing receiver...)
> and it distributes this time on the network. In such a case we do not
> steer our clock based on the (packet) timestamps we get from our
> timestamping unit. Instead, we directly drive our clock hardware with
> a very stable frequency that we get from the OCXO or the atomic
> clock... 

Ok. Following you here...

> or we use one of the ancillary features of the PTP clock that
> Richard mentioned to timestamp not network packets but a 1pps signal
> and use these timestamps to steer the clock. 

Wait.. I thought we weren't using PTP to steer the clock? But now we're
using the pps signal from it to do so? Do I misunderstand you? Or did
you just not mean this?

> Packet time stamping is
> used to distribute the time to the slaves, but it is not part of the
> control loop in this case.

I assume here you mean PTPd is steering the PTP clock according to the
system time (which is NTP/GPS/whatever sourced)? And then the PTP clock
distributes that time through the network?

> So in the first case we have one PTP clock but several network packet
> timestamping units, whereas in the second case the packet timestamping
> is done but it is not part of the control loop that steers the clock.
> Of course in most hardware implementations both the PTP clock and the
> timestamping unit sit on the same chip and often use the same means of
> communication to the cpu, e.g., the MDIO bus, but I think we need some
> logical separation here.


So first of all, thanks for the extra explanation and context here! I
really appreciate it, as I'm not familiar with all the hardware details
and possible use cases, but I'm trying to learn.

So in the two cases you mention, the time "flow" is something like:

#1) [Master Clock on Network1] => [PTP Clock] => [PTPd] =>
	[PTP Clock] => [PTP Clients on Network2]

#2) [GPS] => [NTPd] => [System Time] => [PTPd] => [PTP clock] =>
	[PTP clients on Network]

And the original case:
#3) [Master Clock on Network] => [PTP clock] => [PTPd] => [PTP clock]

With a secondary control flow:
	[PPS signal from PTP clock] => [NTPd] => [System Time]


Right?


So, just brainstorming here, I guess the question I'm trying to figure
out here, is can the "System Time" and "PTP clock" be merged/globbed
into a single "Time" interface from the userspace point of view?

In other words, if internal to the kernel, the PTP clock was always
synced to the system time, couldn't the flow look something like:

#3') [Master clock on network] => [PTP clock] => [PTPd] =>
	 [System Time] => [in-kernel sync thread] => [PTP clock]

So PTPd sees the offset adjustment from the PTP clock, and then feeds
that offset correction right into (a possibly enhanced) adjtimex. The
kernel would then immediately steer the PTP clock by the same amount to
keep it in sync with system time (along with a periodic offset/freq
correction step to deal with crystal drift).

Similarly:

#2') [GPS] => [NTPd] => [System Time] => [in-kernel sync thread] => 
		[PTP clock] => [PTP clients on Network]

and 

#1') [Master Clock on Network1] => [PTP Clock] => [PTPd] =>
	[System Time] => [in-kernel sync thread] => [PTP Clock] => 
	[PTP Clients on Network2]

Now, I realize PTP purists probably won't like this, because it
effectively makes the in-kernel sync thread similar to a PTP boundary
clock (or worse, since the control loop isn't exactly direct).

But considering that the kernel (internally) allows for *very*
fine-grained adjustments (we keep our long-term offset error in
(nanoseconds << 32)  ie: ~quarter-*billion*ths of a nanosecond - I think
that's sub-attosecond, if I recall the unit). And even the existing
external adjtimex interface allows for adjustments of 1ppm<<16 which is
a granularity of ~15 parts-per-trillion (assuming i'm doing the math
right).

These are all much greater then the parts-per-billion adjustment
granularity proposed for the direct PTP clock steering, so I suspect any
error caused by the indirection in the control flow could be minimized
significantly.

Additionally my suggestion here has the benefit of:
A: Avoiding the fragmented time domains (ie CLOCK_REALTIME vs CLOCK_PTP)
caused by adding a new clock_id.

B: Avoiding the indirect system time sync through the PPS interface,
which isn't completely terrible, but just feels a little ugly
configuration wise from a users-perspective.

I'm sure I still have lots to learn about PTP, so please let me know
where I'm off-base.

thanks
-john

^ permalink raw reply

* Re: Power machines fail to boot after build being successful
From: Michael Neuling @ 2010-08-27  1:15 UTC (permalink / raw)
  To: divya; +Cc: LKML, linuxppc-dev
In-Reply-To: <4C764F52.5050707@linux.vnet.ibm.com>

> After successfully building the kernel version
> 2.6.36-rc2-git4(commitid d4348c678977c) with the config file
> attached(used make oldconfig), P5 and P6 power machines fails to
> reboot with the following logs
> 
> Logs collected while rebooting into today next, same occurs with the
> upstream kernel too.

This will fix your problem:
http://patchwork.ozlabs.org/patch/62757/ 

Please CC linuxppc-dev@ozlabs.org for pwoerpc bug reports.

Mikey


> Please wait, loading kernel...
> Allocated 01a00000 bytes for kernel @ 03500000
>     Elf64 kernel loaded...
> Loading ramdisk...
> ramdisk loaded 00841ae7 @ 04f00000
> OF stdout device is: /vdevice/vty@30000000
> Preparing to boot Linux version 2.6.36-rc2-autotest-next-20100826 (root@llm62
) (gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP Thu
 Aug 26 10:55:01 IST 2010
> Max number of cores passed to firmware: 512 (NR_CPUS = 1024)
> Calling ibm,client-architecture-support... done
> command line: root=/dev/sda5 IDENT=1282801372 xmon=early
> memory layout at init:
>    memory_limit : 0000000000000000 (16 MB aligned)
>    alloc_bottom : 0000000005750000
>    alloc_top    : 0000000008000000
>    alloc_top_hi : 0000000008000000
>    rmo_top      : 0000000008000000
>    ram_top      : 0000000008000000
> instantiating rtas at 0x00000000074e0000... done
> boot cpu hw idx 0
> starting cpu hw idx 2... done
> copying OF device tree...
> Building dt strings...
> Building dt structure...
> Device tree strings 0x0000000005760000 ->  0x00000000057615fa
> Device tree struct  0x0000000005770000 ->  0x0000000005790000
> Calling quiesce...
> returning from prom_init
> Using pSeries machine description
> Using 1TB segments
> Found initrd at 0xc000000004f00000:0xc000000005741ae7
> Partition configured for 4 cpus.
> CPU maps initialized for 2 threads per core
> Starting Linux PPC64 #1 SMP Thu Aug 26 10:55:01 IST 2010
> -----------------------------------------------------
> ppc64_pft_size                = 0x1a
> physicalMemorySize            = 0x100000000
> htab_hash_mask                = 0x7ffff
> -----------------------------------------------------
> Initializing cgroup subsys cpuset
> Initializing cgroup subsys cpu
> Linux version 2.6.36-rc2-autotest-next-20100826 (root@llm62) (gcc version 4.3
.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP Thu Aug 26 10:55:01 I
ST 2010
> [boot]0012 Setup Arch
> EEH: No capable adapters found
> PPC64 nvram contains 15360 bytes
> Zone PFN ranges:
>    DMA      0x00000000 ->  0x00010000
>    Normal   empty
> Movable zone start PFN for each node
> early_node_map[1] active PFN ranges
>      1: 0x00000000 ->  0x00010000
> Could not find start_pfn for node 0
> [boot]0015 Setup Done
> PERCPU: Embedded 29 pages/cpu @c000000002100000 s1859840 r0 d40704 u2097152
> pcpu-alloc: s1859840 r0 d40704 u2097152 alloc=2*1048576
> pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3
> Built 2 zonelists in Node order, mobility grouping on.  Total pages: 65480
> Policy zone: DMA
> Kernel command line: root=/dev/sda5 IDENT=1282801372 xmon=early
> PID hash table entries: 4096 (order: -1, 32768 bytes)
> freeing bootmem node 1
> Memory: 4124352k/4194304k available (10880k kernel code, 69952k reserved, 288
0k data, 10947k bss, 2432k init)
> Hierarchical RCU implementation.
>          Verbose stalled-CPUs detection is disabled.
> NR_IRQS:512
> [boot]0020 XICS Init
> [boot]0021 XICS Done
> clocksource: timebase mult[7d0000] shift[22] registered
> Console: colour dummy device 80x25
> console [hvc0] enabled, bootconsole disabled
> console [hvc0] enabled, bootconsole disabled
> Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
> ... MAX_LOCKDEP_SUBCLASSES:  8
> ... MAX_LOCK_DEPTH:          48
> ... MAX_LOCKDEP_KEYS:        8191
> ... CLASSHASH_SIZE:          4096
> ... MAX_LOCKDEP_ENTRIES:     16384
> ... MAX_LOCKDEP_CHAINS:      32768
> ... CHAINHASH_SIZE:          16384
>   memory used by lock dependency info: 6335 kB
>   per task-struct memory footprint: 2688 bytes
> allocated 2621440 bytes of page_cgroup
> please try 'cgroup_disable=memory' option if you don't want memory cgroups
> pid_max: default: 32768 minimum: 301
> Security Framework initialized
> SELinux:  Disabled at boot.
> AppArmor: AppArmor disabled by boot time parameter
> Dentry cache hash table entries: 524288 (order: 6, 4194304 bytes)
> Inode-cache hash table entries: 262144 (order: 5, 2097152 bytes)
> Mount-cache hash table entries: 4096
> Initializing cgroup subsys ns
> Initializing cgroup subsys cpuacct
> Initializing cgroup subsys memory
> Initializing cgroup subsys devices
> Initializing cgroup subsys freezer
> cpu 0x1: Vector: 300 (Data Access) at [c0000000fedcbd10]
>      pc: 0000000000017f00
>      lr: 0000000000008290
>      sp: c0000000fedcbf90
>     msr: 8000000000001000
>     dar: c0000000fedcbfa0
>   dsisr: 42000000
>    current = 0xc0000000fedb4e70
>    paca    = 0xc000000007440280
>      pid   = 0, comm = swapper
> WARNING: exception is not recoverable, can't continue
> enter ? for help
> 1:mon>  Unable to handle kernel paging request for data at address 0xc0000000
fedcbfa0
> Faulting instruction address: 0x00017f00
> Processor 1 is stuck.
> cpu 0x2: Vector: 300 (Data Access) at [c0000000fedcfd10]
>      pc: 0000000000017f00
>      lr: 0000000000008290
>      sp: c0000000fedcff90
>     msr: 8000000000001002
>     dar: c0000000fedcffa0
>   dsisr: 42000000
>    current = 0xc0000000fedb7580
>    paca    = 0xc000000007440500
>      pid   = 0, comm = swapper
> Unable to handle kernel paging request for data at address 0xc0000000fedcffa0
> Faulting instruction address: 0x00017f00
> Processor 2 is stuck.
> cpu 0x3: Vector: 300 (Data Access) at [c0000000fee33d10]
>      pc: 0000000000017f00
>      lr: 0000000000008290
>      sp: c0000000fee33f90
>     msr: 8000000000001000
>     dar: c0000000fee33fa0
>   dsisr: 42000000
>    current = 0xc0000000fedb9c90
>    paca    = 0xc000000007440780
>      pid   = 0, comm = swapper
> WARNING: exception is not recoverable, can't continue
> 
> Thanks
> Divya
> 
> 
> --------------060402090502050601040804
> Content-Type: text/plain;
>  name="config_ppc_slub"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: attachment;
>  filename="config_ppc_slub"
> 
> CONFIG_ALTIVEC=y
> # CONFIG_WIRELESS is not set
> CONFIG_VSX=y
> CONFIG_TASKSTATS=y
> CONFIG_TASK_DELAY_ACCT=y
> CONFIG_TASK_XACCT=y
> CONFIG_TASK_IO_ACCOUNTING=y
> CONFIG_HAVE_KPROBES=y
> CONFIG_HAVE_KRETPROBES=y
> CONFIG_TREE_RCU=y
> CONFIG_RCU_FANOUT=64
> CONFIG_CGROUP_SCHED=y
> CONFIG_CGROUPS=y
> CONFIG_CGROUP_NS=y
> CONFIG_CGROUP_FREEZER=y
> CONFIG_CGROUP_DEVICE=y
> CONFIG_CPUSETS=y
> CONFIG_PROC_PID_CPUSET=y
> CONFIG_CGROUP_CPUACCT=y
> CONFIG_RESOURCE_COUNTERS=y
> CONFIG_CGROUP_MEM_RES_CTLR=y
> CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
> CONFIG_NAMESPACES=y
> CONFIG_UTS_NS=y
> CONFIG_IPC_NS=y
> CONFIG_USER_NS=y
> CONFIG_PID_NS=y
> CONFIG_NET_NS=y
> CONFIG_KALLSYMS=y
> CONFIG_KALLSYMS_ALL=y
> CONFIG_KPROBES=y
> CONFIG_CMM=y
> CONFIG_PPC_HAS_HASH_64K=y
> CONFIG_PPC_64K_PAGES=y
> CONFIG_FORCE_MAX_ZONEORDER=9
> CONFIG_PPC_SUBPAGE_PROT=y
> CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> CONFIG_ARCH_HAS_WALK_MEMORY=y
> CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
> CONFIG_KEXEC=y
> CONFIG_MEMORY_HOTPLUG=y
> CONFIG_MEMORY_HOTPLUG_SPARSE=y
> CONFIG_MEMORY_HOTREMOVE=y
> CONFIG_IBMEBUS=y
> CONFIG_EHEA=y
> CONFIG_SCSI_IBMVSCSI=y
> CONFIG_SCSI_IBMVFC=m
> CONFIG_SCSI_IPR=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DETECT_SOFTLOCKUP=y
> CONFIG_DETECT_HUNG_TASK=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_SPINLOCK_SLEEP=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_INFO=y
> CONFIG_KPROBES_SANITY_TEST=y
> CONFIG_LATENCYTOP=y
> CONFIG_SYSCTL_SYSCALL_CHECK=y
> CONFIG_DEBUG_PAGEALLOC=y
> CONFIG_NOP_TRACER=y
> CONFIG_HAVE_FUNCTION_TRACER=y
> CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
> CONFIG_HAVE_DYNAMIC_FTRACE=y
> CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
> CONFIG_RING_BUFFER=y
> CONFIG_TRACING=y
> CONFIG_TRACING_SUPPORT=y
> ONFIG_BLK_DEV_IO_TRACE=y
> CONFIG_FTRACE_SELFTEST=y
> CONFIG_FTRACE_STARTUP_TEST=y
> CONFIG_KEYS=y
> CONFIG_KEYS_DEBUG_PROC_KEYS=y
> CONFIG_SECURITY=y
> CONFIG_SECURITYFS=y
> CONFIG_SECURITY_NETWORK=y
> CONFIG_RCU_CPU_STALL_DETECTOR=y
> CONFIG_EXT2_FS=y
> CONFIG_EXT2_FS_XATTR=y
> CONFIG_EXT2_FS_POSIX_ACL=y
> CONFIG_EXT2_FS_SECURITY=y
> CONFIG_EXT3_FS=y
> CONFIG_EXT3_FS_XATTR=y
> CONFIG_EXT3_FS_POSIX_ACL=y
> CONFIG_EXT3_FS_SECURITY=y
> CONFIG_EXT4_FS=y
> CONFIG_EXT4DEV_COMPAT=y
> CONFIG_EXT4_FS_XATTR=y
> CONFIG_EXT4_FS_POSIX_ACL=y
> CONFIG_EXT4_FS_SECURITY=y
> CONFIG_JBD=y
> CONFIG_JBD2=y
> CONFIG_REISERFS_PROC_INFO=y
> CONFIG_REISERFS_FS_XATTR=y
> CONFIG_REISERFS_FS_POSIX_ACL=y
> CONFIG_REISERFS_FS_SECURITY=y
> CONFIG_JFS_FS=m
> CONFIG_JFS_POSIX_ACL=y
> CONFIG_JFS_SECURITY=y
> CONFIG_JFS_STATISTICS=y
> CONFIG_FS_POSIX_ACL=y
> CONFIG_FILE_LOCKING=y
> CONFIG_XFS_FS=m
> CONFIG_XFS_QUOTA=y
> CONFIG_XFS_POSIX_ACL=y
> CONFIG_XFS_RT=y
> CONFIG_GFS2_FS=m
> CONFIG_GFS2_FS_LOCKING_DLM=m
> CONFIG_OCFS2_FS=m
> CONFIG_OCFS2_FS_O2CB=m
> CONFIG_OCFS2_FS_USERSPACE_CLUSTER=m
> CONFIG_OCFS2_FS_STATS=y
> CONFIG_BTRFS_FS=y
> CONFIG_BTRFS_FS_POSIX_ACL=y
> CONFIG_LOCALVERSION_AUTO=n
> CONFIG_LOCALVERSION=""
> CONFIG_SLUB_DEBUG=y
> CONFIG_SLUB=y
> CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
> CONFIG_HAVE_PERF_COUNTERS=y
> CONFIG_PERF_COUNTERS=y
> CONFIG_DEFAULT_MMAP_MIN_ADDR=65536
> CONFIG_FSNOTIFY=y
> CONFIG_RCU_TORTURE_TEST=y
> CONFIG_RCU_TORTURE_TEST_RUNNABLE=y
> CONFIG_RCU_CPU_STALL_DETECTOR=y
> CONFIG_PERF_EVENTS=y
> CONFIG_PPC_OF_BOOT_TRAMPOLINE=y
> CONFIG_BLOCK=y
> CONFIG_BLK_DEV_BSG=y
> CONFIG_BLK_DEV_INTEGRITY=y
> CONFIG_BLK_CGROUP=y
> CONFIG_DEBUG_BLK_CGROUP=y
> CONFIG_BLOCK_COMPAT=y
> CONFIG_IOSCHED_NOOP=y
> CONFIG_IOSCHED_CFQ=y
> CONFIG_CFQ_GROUP_IOSCHED=y
> CONFIG_DEBUG_CFQ_IOSCHED=y
> CONFIG_DEFAULT_CFQ=y
> CONFIG_DEFAULT_IOSCHED="cfq"
> CONFIG_VIRTUALIZATION=y
> CONFIG_VIRTIO_PCI=y
> CONFIG_VIRTIO_BALLOON=y
> CONFIG_NR_IRQS=512
> CONFIG_MIGRATION=y
> CONFIG_KSM=y
> CONFIG_PM=y
> CONFIG_SUSPEND=y
> CONFIG_SUSPEND_FREEZER=y
> CONFIG_HIBERNATION=y
> CONFIG_SYSFS_DEPRECATED_V2=y
> CONFIG_LOCK_STAT=y
> CONFIG_LOCKDEP_SUPPORT=y
> CONFIG_LOCKDEP=y
> CONFIG_IXGBE_DCB=y
> CONFIG_DCB=y
> 
> 
> --------------060402090502050601040804--
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply

* Re: Power machines fail to boot after build being successful
From: Tony Breeds @ 2010-08-27  0:46 UTC (permalink / raw)
  To: divya; +Cc: linux-next, LinuxPPC-dev, LKML
In-Reply-To: <4C764F52.5050707@linux.vnet.ibm.com>

On Thu, Aug 26, 2010 at 04:56:10PM +0530, divya wrote:
> Hi,
> 
> After successfully building the kernel version 2.6.36-rc2-git4(commitid
> d4348c678977c) with the config file attached(used make oldconfig),
> P5 and P6 power machines fails to reboot with the following logs

<snip>

> Preparing to boot Linux version 2.6.36-rc2-autotest-next-20100826 (root@llm62) (gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP Thu Aug 26 10:55:01 IST 2010

You say you're booting -git4 but you're running kernel is Linux next from Aug.
26th  Can you confirm you see this boot failure with Linus's tree NOT linux
next?

Adding linux-ppc and linux-next lists.

Yours Tony

^ permalink raw reply

* SPI bitbanging driver
From: Ravi Gupta @ 2010-08-26  9:39 UTC (permalink / raw)
  To: linuxppc-dev, MJ embd, Ira W. Snyder, Anton Vorontsov

[-- Attachment #1: Type: text/plain, Size: 632 bytes --]

Hi,

I am new to linux device driver development. I have to develop a spi
bitbanging driver for MPC837xERDB board. So far I have found that linux
kernel already has support for bitbanging over gpio(I am using linux kernel
2.6.35). Now I am confused that how I am going to configure SPI pins as GPIO
and what API to use to write my driver. I searched over internet a lot for a
sample SPI bitbang driver but unable to do so. Can anyone provide me some
sample bitbang driver explaning how to use the spi_bitbang.c and spi_gpio.c.
Do I need to change the mpc8377_rdb.dts file to register spi pins as gpio?

Thanks in advance
Ravi Gupta

[-- Attachment #2: Type: text/html, Size: 655 bytes --]

^ permalink raw reply

* [PATCH] powerpc: don't use kernel stack with translation off
From: Michael Neuling @ 2010-08-26  7:04 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, stable, Matt Evans

In f761622e59433130bc33ad086ce219feee9eb961 we changed
early_setup_secondary so it's called using the proper kernel stack
rather than the emergency one.

Unfortunately, this stack pointer can't be used when translation is off
on PHYP as this stack pointer might be outside the RMO.  This results in
the following on all non zero cpus:
  cpu 0x1: Vector: 300 (Data Access) at [c00000001639fd10]
      pc: 000000000001c50c
      lr: 000000000000821c
      sp: c00000001639ff90
     msr: 8000000000001000
     dar: c00000001639ffa0
   dsisr: 42000000
    current = 0xc000000016393540
    paca    = 0xc000000006e00200
      pid   = 0, comm = swapper

The original patch was only tested on bare metal system, so it never
caught this problem.

This changes __secondary_start so that we calculate the new stack
pointer but only start using it after we've called early_setup_secondary.

With this patch, the above problem goes away.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
benh: can you pick this up for 2.6.36 as we are bust currently in Linus' tree.

diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 4d6681d..c571cd3 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -575,13 +575,19 @@ __secondary_start:
 	/* Initialize the kernel stack.  Just a repeat for iSeries.	 */
 	LOAD_REG_ADDR(r3, current_set)
 	sldi	r28,r24,3		/* get current_set[cpu#]	 */
-	ldx	r1,r3,r28
-	addi	r1,r1,THREAD_SIZE-STACK_FRAME_OVERHEAD
-	std	r1,PACAKSAVE(r13)
+	ldx	r14,r3,r28
+	addi	r14,r14,THREAD_SIZE-STACK_FRAME_OVERHEAD
+	std	r14,PACAKSAVE(r13)
 
 	/* Do early setup for that CPU (stab, slb, hash table pointer) */
 	bl	.early_setup_secondary
 
+	/*
+	 * setup the new stack pointer, but *don't* use this until
+	 * translation is on.
+	 */
+	mr	r1, r14
+
 	/* Clear backchain so we get nice backtraces */
 	li	r7,0
 	mtlr	r7

^ permalink raw reply related

* Re: [PATCH] powerpc: Wire up direct socket system calls
From: Ian Munsie @ 2010-08-26  5:56 UTC (permalink / raw)
  To: linux-kernel, Benjamin Herrenschmidt, Paul Mackerras,
	Andrew Morton, Andreas Schwab, Christoph Hellwig,
	Arjan van de Ven, Jesper Nilsson, linuxppc-dev
In-Reply-To: <1282798228-340-1-git-send-email-imunsie@au1.ibm.com>

Excerpts from Ian Munsie's message of Thu Aug 26 14:50:28 +1000 2010:
> This patch wires up the various socket system calls on PowerPC so that
> userspace can call them directly, rather than by going through the
> multiplexed socketcall system call.

I should have mentioned that the base is ppc/next

Also, I've included a simple library below suitable for use with
LD_PRELOAD to allow this to be tested on existing programs.


Cheers,
-Ian


#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>

#include <sys/types.h>
#include <sys/socket.h>
#include <sys/time.h>

/* PPC syscall numbers from /arch/powerpc/include/asm/unistd.h */
#define __NR_socket		326
#define __NR_bind		327
#define __NR_connect		328
#define __NR_listen		329
#define __NR_accept		330
#define __NR_getsockname	331
#define __NR_getpeername	332
#define __NR_socketpair		333
#define __NR_send		334
#define __NR_sendto		335
#define __NR_recv		336
#define __NR_recvfrom		337
#define __NR_shutdown		338
#define __NR_setsockopt		339
#define __NR_getsockopt		340
#define __NR_sendmsg		341
#define __NR_recvmsg		342
#define __NR_recvmmsg		343
#define __NR_accept4		344


#define DEBUG 0

#if DEBUG
#define DEBUGsyscall(name, ...)						\
	int __ret;							\
	__ret = syscall(__NR_##name, __VA_ARGS__);			\
	fprintf(stderr, "--"#name": %i", __ret);			\
	if (__ret == -1) {						\
		fprintf(stderr, ", %s (%i)", strerror(errno), errno);	\
	}								\
	fprintf(stderr, "--\n");					\
	return __ret;
#else
#define DEBUGsyscall(name, ...)						\
	return syscall(__NR_##name, __VA_ARGS__);
#endif


int socket(int domain, int type, int protocol)
{
	DEBUGsyscall(socket, domain, type, protocol);
}

int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen)
{
	DEBUGsyscall(bind, sockfd, addr, addrlen);
}

int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen)
{
	DEBUGsyscall(connect, sockfd, addr, addrlen);
}

int listen(int sockfd, int backlog)
{
	DEBUGsyscall(listen, sockfd, backlog);
}

int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen)
{
	DEBUGsyscall(accept, sockfd, addr, addrlen);
}

int getsockname(int sockfd, struct sockaddr *addr, socklen_t *addrlen)
{
	DEBUGsyscall(getsockname, sockfd, addr, addrlen);
}

int getpeername(int sockfd, struct sockaddr *addr, socklen_t *addrlen)
{
	DEBUGsyscall(getpeername, sockfd, addr, addrlen);
}

int socketpair(int domain, int type, int protocol, int sv[2])
{
	DEBUGsyscall(socketpair, domain, type, protocol, sv);
}

int send(int sockfd, const void *buf, size_t len, int flags)
{
	DEBUGsyscall(send, sockfd, buf, len, flags);
}

int sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen)
{
	DEBUGsyscall(sendto, sockfd, buf, len, flags, dest_addr, addrlen);
}

int recv(int sockfd, void *buf, size_t len, int flags)
{
	DEBUGsyscall(recv, sockfd, buf, len, flags);
}

int recvfrom(int sockfd, void *buf, size_t len, int flags, struct sockaddr *src_addr, socklen_t *addrlen)
{
	DEBUGsyscall(recvfrom, sockfd, buf, len, flags, src_addr, addrlen);
}

int shutdown(int sockfd, int how)
{
	DEBUGsyscall(shutdown, sockfd, how);
}

int setsockopt(int sockfd, int level, int optname, const void *optval, socklen_t optlen)
{
	DEBUGsyscall(setsockopt, sockfd, level, optname, optval, optlen);
}

int getsockopt(int sockfd, int level, int optname, void *optval, socklen_t *optlen)
{
	DEBUGsyscall(getsockopt, sockfd, level, optname, optval, optlen);
}

int sendmsg(int sockfd, const struct msghdr *msg, int flags)
{
	DEBUGsyscall(sendmsg, sockfd, msg, flags);
}

int recvmsg(int sockfd, struct msghdr *msg, int flags)
{
	DEBUGsyscall(recvmsg, sockfd, msg, flags);
}

/* Debian squeeze libc doesn't support recvmmsg yet, not much point intercepting it */

int accept4(int sockfd, struct sockaddr *addr, socklen_t *addrlen, int flags)
{
	DEBUGsyscall(accept4, sockfd, addr, addrlen, flags);
}

^ permalink raw reply

* [PATCH] powerpc: Wire up direct socket system calls
From: Ian Munsie @ 2010-08-26  4:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jesper Nilsson, linuxppc-dev, Andreas Schwab, Ian Munsie,
	Paul Mackerras, Andrew Morton, Arjan van de Ven,
	Christoph Hellwig

From: Ian Munsie <imunsie@au1.ibm.com>

This patch wires up the various socket system calls on PowerPC so that
userspace can call them directly, rather than by going through the
multiplexed socketcall system call.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
---
 arch/powerpc/include/asm/systbl.h |   19 +++++++++++++++++++
 arch/powerpc/include/asm/unistd.h |   21 ++++++++++++++++++++-
 2 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 3d21266..aa0f1eb 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -329,3 +329,22 @@ COMPAT_SYS(rt_tgsigqueueinfo)
 SYSCALL(fanotify_init)
 COMPAT_SYS(fanotify_mark)
 SYSCALL_SPU(prlimit64)
+SYSCALL_SPU(socket)
+SYSCALL_SPU(bind)
+SYSCALL_SPU(connect)
+SYSCALL_SPU(listen)
+SYSCALL_SPU(accept)
+SYSCALL_SPU(getsockname)
+SYSCALL_SPU(getpeername)
+SYSCALL_SPU(socketpair)
+SYSCALL_SPU(send)
+SYSCALL_SPU(sendto)
+COMPAT_SYS_SPU(recv)
+COMPAT_SYS_SPU(recvfrom)
+SYSCALL_SPU(shutdown)
+COMPAT_SYS_SPU(setsockopt)
+COMPAT_SYS_SPU(getsockopt)
+COMPAT_SYS_SPU(sendmsg)
+COMPAT_SYS_SPU(recvmsg)
+COMPAT_SYS_SPU(recvmmsg)
+SYSCALL_SPU(accept4)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 597e6f9..6151937 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -348,10 +348,29 @@
 #define __NR_fanotify_init	323
 #define __NR_fanotify_mark	324
 #define __NR_prlimit64		325
+#define __NR_socket		326
+#define __NR_bind		327
+#define __NR_connect		328
+#define __NR_listen		329
+#define __NR_accept		330
+#define __NR_getsockname	331
+#define __NR_getpeername	332
+#define __NR_socketpair		333
+#define __NR_send		334
+#define __NR_sendto		335
+#define __NR_recv		336
+#define __NR_recvfrom		337
+#define __NR_shutdown		338
+#define __NR_setsockopt		339
+#define __NR_getsockopt		340
+#define __NR_sendmsg		341
+#define __NR_recvmsg		342
+#define __NR_recvmmsg		343
+#define __NR_accept4		344
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		326
+#define __NR_syscalls		345
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH 1/5] ptp: Added a brand new class driver for ptp clocks.
From: Christian Riesch @ 2010-08-25  9:40 UTC (permalink / raw)
  To: john stultz
  Cc: Rodolfo Giometti, Arnd Bergmann, netdev, devicetree-discuss,
	linux-kernel, linuxppc-dev, Richard Cochran, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <1282594125.3111.344.camel@localhost.localdomain>

On Mon, Aug 23, 2010 at 10:08 PM, john stultz <johnstul@us.ibm.com> wrote:
> On Thu, 2010-08-19 at 07:55 +0200, Richard Cochran wrote:
>> On Wed, Aug 18, 2010 at 05:12:56PM -0700, john stultz wrote:
>> > Again, my knowledge in the networking stack is pretty limited. But it
>> > would seem that having an interface that does something to the effect =
of
>> > "adjust the timestamp clock on the hardware that generated it from thi=
s
>> > packet by Xppb" would feel like the right level of abstraction. Its
>> > closely related to SO_TIMESTAMP, option right? Would something like
>> > using the setsockopt/getsockopt interface with
>> > SO_TIMESTAMP_ADJUST/OFFSET/SET/etc be reasonable?
>>
>> The clock and its adjustment have nothing to do with a network
>> socket. The current PTP hacks floating around all add private ioctls
>> to the MAC driver. That is the *wrong* way to do it.
>
> Could you clarify on *why* that is the wrong approach?
>
> Maybe this is where some of the confusion is coming from? The subtleties
> of the more generic PTP algorithm and how the existence of PTP hardware
> clocks change things are not clear to me. My understanding of ptp and
> the networking details around it is limited, so your expertise is
> appreciated. =C2=A0Might you consider covering some of this via a
> Documentation/ptp/overview.txt file in a future version of your patch?
>
> Here's a summary of what I understand:
> So from:
> http://en.wikipedia.org/wiki/Precision_Time_Protocol#Synchronization
>
> We see the message exchange of Sync/Delay_Req/Delay_Resp, and the
> calculation of the local offset from the server (and then a frequency
> adjustment over time as offsets values are accumulated).
>
> Without the hardware clock, this all of these messages and their
> corresponding timestamps are likely created by PTPd, using clock_gettime
> and then adjtimex() to correct for the calculated offset or freq
> adjustment. No extra interfaces are necessary, and PTPd is syncing the
> system time as accurately as it can. This is how the existing ptpd
> projects on the web seem to function.
>
> Now, with PTP hardware on the system, my understanding of what you're
> trying to enable with your patches is that the PTP hardware does the
> timestamping on both incoming and outgoing messages. PTPd then reads the
> pre-timestamped messages, calculates the offset and freq correction, and
> then feeds that back into the PTP hardware via your interface. No time
> correction is done at all by PTPd.

John,
What you describe here is only one of the use cases. If the hardware
has a single network port and operates as a PTP slave, it timestamps
the PTP packets that are sent and received and subsequently uses these
timestamps and the information it received from the master in the
packets to steer its clock to align it with the master clock. In such
a case the timestamping hardware and the clock hardware work together
closely and it seems to be okay to use the same interface to control
both the timestamping and the PTP clock.

But we have to consider other use cases, e.g.,

1) Boundary clocks:
We have more than one network port. One port operates as a slave
clock, our system gets time information via this port and steers its
PTP clock to align with the master clock. The other network ports of
our system operate as master clocks and redistribute the time
information we got from the master to other clocks on these networks.
In such a case we do timestamping on each of the network ports, but we
only have a single PTP clock. Each network port's timestamping
hardware uses the same hardware clock to generate time stamps.

2) Master clock:
We have one or more network ports. Our system has a really good clock
(ovenized quartz crystal, an atomic clock, a GPS timing receiver...)
and it distributes this time on the network. In such a case we do not
steer our clock based on the (packet) timestamps we get from our
timestamping unit. Instead, we directly drive our clock hardware with
a very stable frequency that we get from the OCXO or the atomic
clock... or we use one of the ancillary features of the PTP clock that
Richard mentioned to timestamp not network packets but a 1pps signal
and use these timestamps to steer the clock. Packet time stamping is
used to distribute the time to the slaves, but it is not part of the
control loop in this case.

So in the first case we have one PTP clock but several network packet
timestamping units, whereas in the second case the packet timestamping
is done but it is not part of the control loop that steers the clock.
Of course in most hardware implementations both the PTP clock and the
timestamping unit sit on the same chip and often use the same means of
communication to the cpu, e.g., the MDIO bus, but I think we need some
logical separation here.

Christian

^ permalink raw reply

* Re: [PATCH 1/2] kdump: Allow shrinking of kdump region to be overridden
From: Eric W. Biederman @ 2010-08-25  0:37 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: akpm, kexec, linux-kernel, linuxppc-dev
In-Reply-To: <20100825002258.GD28360@kryten>

Anton Blanchard <anton@samba.org> writes:

> On ppc64 the crashkernel region almost always overlaps an area of firmware.
> This works fine except when using the sysfs interface to reduce the kdump
> region. If we free the firmware area we are guaranteed to crash.

That is ppc64 bug.  firmware should not be in the reserved region.  Any
random kernel like thing can be put in to that region at any valid
address and the fact that shrinking the region frees your firmware means
that using that region could also stomp your firmware (which I assume
would be a bad thing).

So please fix the ppc64 reservation.

Eric

^ permalink raw reply

* Re: [PATCH] powerpc: Check end of stack canary at oops time
From: Anton Blanchard @ 2010-08-25  1:51 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev
In-Reply-To: <1282699778.21145.54.camel@concordia>

 
Hi,

> The check for init is just because we haven't set the magic value for
> init's stack right? But we could.

Yeah, it's similar to what x86 are doing now:


commit 0e7810be30f66e9f430c4ce2cd3b14634211690f
Author: Jan Beulich <JBeulich@novell.com>
Date:   Fri Nov 20 14:00:14 2009 +0000

    x86: Suppress stack overrun message for init_task
    
    init_task doesn't get its stack end location set to
    STACK_END_MAGIC, and hence the message is confusing
    rather than helpful in this case.


Adding it directly to init_task would be nice but I suspect we'd
either have to make assumptions about end_of_stack in our code or move the
canary into the thread_info (so we can statically allocate it via
INIT_THREAD_INFO()) or do it at runtime somewhere, hopefully early enough that
we couldn't take an oops.

Anton

^ permalink raw reply

* Re: [PATCH] powerpc: remove fpscr use from [kvm_]cvt_{fd,df}
From: Michael Neuling @ 2010-08-25  1:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, Andreas Schwab, kvm-ppc, Paul Mackerras
In-Reply-To: <1282699836.22370.566.camel@pasglop>

In message <1282699836.22370.566.camel@pasglop> you wrote:
> On Tue, 2010-08-24 at 15:15 +1000, Michael Neuling wrote:
> > > > Do some 32 bit processors need this? 
> > > > 
> > > > In 32 bit before the merge, we use to have code that did:
> > > > 
> > > >   #if defined(CONFIG_4xx) || defined(CONFIG_E500)
> > > >    #define cvt_fd without save/restore fpscr
> > > >   #else
> > > >    #define cvt_fd with save/restore fpscr
> > > >   #end if
> > > > 
> > > > Kumar; does this ring any bells?
> > > 
> > > I don't see anything in the various 440 docs I have at hand that would
> > > hint at lfd/stfs adffecting FPSCR.
> > 
> > The way the ifdefs are, it's the other way around.  4xx procs don't need
> > to save/restore fpscr and others do.
> 
> Right, my bad. In any case, Paulus reckons it's all his mistake and we
> really never need to save/restore fpscr.

ACK :-P

Mikey

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox