LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: RFC: issues concerning the next NAPI interface
From: Rick Jones @ 2007-08-24 17:07 UTC (permalink / raw)
  To: Linas Vepstas
  Cc: Thomas Klein, Jan-Bernd Themann, Stefan Roscher, netdev,
	linux-kernel, Christoph Raisch, linux-ppc, Jan-Bernd Themann,
	Eder, akepner, Stephen Hemminger, Marcus
In-Reply-To: <20070824165110.GH4282@austin.ibm.com>

> Just to be clear, in the previous email I posted on this thread, I
> described a worst-case network ping-pong test case (send a packet, wait
> for reply), and found out that a deffered interrupt scheme just damaged
> the performance of the test case.  Since the folks who came up with the
> test case were adamant, I turned off the defferred interrupts.  
> While defferred interrupts are an "obvious" solution, I decided that 
> they weren't a good solution. (And I have no other solution to offer).

Sounds exactly like the default netperf TCP_RR test and any number of other 
benchmarks.  The "send  a request, wait for reply, send next request, etc etc 
etc" is a rather common application behaviour afterall.

rick jones

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: Shirley Ma @ 2007-08-24 17:45 UTC (permalink / raw)
  To: Linas Vepstas
  Cc: Thomas Klein, Jan-Bernd Themann, Stefan Roscher, netdev,
	linux-kernel, Christoph Raisch, netdev-owner, linux-ppc, akepner,
	Eder, Jan-Bernd Themann, Stephen Hemminger, Marcus
In-Reply-To: <20070824165110.GH4282@austin.ibm.com>

> Just to be clear, in the previous email I posted on this thread, I
> described a worst-case network ping-pong test case (send a packet, wait
> for reply), and found out that a deffered interrupt scheme just damaged
> the performance of the test case. 

When splitting rx and tx handler, I found some performance gain by 
deffering interrupt scheme in tx not rx in IPoIB driver.

Shirley

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: Jan-Bernd Themann @ 2007-08-24 18:11 UTC (permalink / raw)
  To: James Chapman
  Cc: Thomas Klein, Jan-Bernd Themann, Stefan Roscher, netdev,
	linux-kernel, Christoph Raisch, linux-ppc, akepner, Marcus Eder,
	Stephen Hemminger
In-Reply-To: <46CF127D.1090609@katalix.com>

James Chapman schrieb:
> Stephen Hemminger wrote:
>> On Fri, 24 Aug 2007 17:47:15 +0200
>> Jan-Bernd Themann <ossthema@de.ibm.com> wrote:
>>
>>> Hi,
>>>
>>> On Friday 24 August 2007 17:37, akepner@sgi.com wrote:
>>>> On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
>>>>> .......
>>>>> 3) On modern systems the incoming packets are processed very fast. 
>>>>> Especially
>>>>>    on SMP systems when we use multiple queues we process only a 
>>>>> few packets
>>>>>    per napi poll cycle. So NAPI does not work very well here and 
>>>>> the interrupt    rate is still high. What we need would be some 
>>>>> sort of timer polling mode    which will schedule a device after a 
>>>>> certain amount of time for high load    situations. With high 
>>>>> precision timers this could work well. Current
>>>>>    usual timers are too slow. A finer granularity would be needed 
>>>>> to keep the
>>>>>    latency down (and queue length moderate).
>>>>>
>>>> We found the same on ia64-sn systems with tg3 a couple of years 
>>>> ago. Using simple interrupt coalescing ("don't interrupt until 
>>>> you've received N packets or M usecs have elapsed") worked 
>>>> reasonably well in practice. If your h/w supports that (and I'd 
>>>> guess it does, since it's such a simple thing), you might try it.
>>>>
>>> I don't see how this should work. Our latest machines are fast 
>>> enough that they
>>> simply empty the queue during the first poll iteration (in most cases).
>>> Even if you wait until X packets have been received, it does not 
>>> help for
>>> the next poll cycle. The average number of packets we process per 
>>> poll queue
>>> is low. So a timer would be preferable that periodically polls the 
>>> queue, without the need of generating a HW interrupt. This would 
>>> allow us
>>> to wait until a reasonable amount of packets have been received in 
>>> the meantime
>>> to keep the poll overhead low. This would also be useful in combination
>>> with LRO.
>>>
>>
>> You need hardware support for deferred interrupts. Most devices have 
>> it (e1000, sky2, tg3)
>> and it interacts well with NAPI. It is not a generic thing you want 
>> done by the stack,
>> you want the hardware to hold off interrupts until X packets or Y 
>> usecs have expired.
>
> Does hardware interrupt mitigation really interact well with NAPI? In 
> my experience, holding off interrupts for X packets or Y usecs does 
> more harm than good; such hardware features are useful only when the 
> OS has no NAPI-like mechanism.
>
> When tuning NAPI drivers for packets/sec performance (which is a good 
> indicator of driver performance), I make sure that the driver stays in 
> NAPI polled mode while it has any rx or tx work to do. If the CPU is 
> fast enough that all work is always completed on each poll, I have the 
> driver stay in polled mode until dev->poll() is called N times with no 
> work being done. This keeps interrupts disabled for reasonable traffic 
> levels, while minimizing packet processing latency. No need for 
> hardware interrupt mitigation.
Yes, that was one idea as well. But the problem with that is that 
net_rx_action will call
the same poll function over and over again in a row if there are no 
further network
devices. The problem about this approach is that you always poll just a 
very few packets
each time. This does not work with LRO well, as there are no packets to 
aggregate...
So it would make more sense to wait for a certain time before trying it 
again.
Second problem: after the jiffies incremented by one in net_rx_action 
(after some poll rounds), net_rx_action will quit and return control to 
the softIRQ handler. The poll function
is called again as the softIRQ handler thinks there is more work to be 
done. So even
then we do not wait... After some rounds in the softIRQ handler, we 
finally wait some time.

>
>> The parameters for controlling it are already in ethtool, the issue 
>> is finding a good
>> default set of values for a wide range of applications and 
>> architectures. Maybe some
>> heuristic based on processor speed would be a good starting point. 
>> The dynamic irq
>> moderation stuff is not widely used because it is too hard to get right.
>
> I agree. It would be nice to find a way for the typical user to derive 
> best values for these knobs for his/her particular system. Perhaps a 
> tool using pktgen and network device phy internal loopback could be 
> developed?
>

^ permalink raw reply

* Re: Problems on porting linux 2.6 to xilinx ML410
From: Grant Likely @ 2007-08-24 18:21 UTC (permalink / raw)
  To: woyuzhilei; +Cc: linuxppc-embedded
In-Reply-To: <200708241111513591137@163.com>

On 8/23/07, woyuzhilei <woyuzhilei@163.com> wrote:
>
>
> Hey:
>     Recently I'm doing some project on Xilinx Ml410 evaluation board.The
> first step is porting linux 2.6 to ml410,but I got some problems on this,and
> my project cann't proceed,so I come to you for some help.
>     I use the linux kernel source tree download from
> http://git.secretlab.ca/git/linux-2.6-virtex.git  (The
> latest one).Add the file the xparameters.h,xparameter_ml40x.h.  Then add
> arch/ppc/platforms/4xx/xilinx_generic_ppc.c(Use the patch I
> get here),and change it's name to ml40x.c,then I make some necessay change
> of the configuration files to accept selecting a Ml40x type board.

You shouldn't need to do all of this; only the xparameters_ml40x.h
file is needed.  To get started, I'd just use replace
xparameters_ml403.h with your custom xparameters_ml40x.h and use the
ml403 board port.  Once you've got it booting, you can get more fancy.

> Then
> compile the kernel,and get the image file from arch/ppc/boot/images, (On
> kernel compiling,the only device driver  I sellect is " 8250/16550 and
> compatible serial support ").After that  I download the zImage.elf file to
> the target board,and run it.But there is no output from the serial port at
> all.Am I doing somthing wrong?I really don't know goes wrong.
>     Can anyone here help me with this?Any help from you is appreciated.Thank
> you very much!

Make sure you've got 16550 console support enabled and
'console=/dev/ttyS0' in your kernel command line.

Cheers,
g.

-- 
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.
grant.likely@secretlab.ca
(403) 399-0195

^ permalink raw reply

* Re: 8555CDS BSP on 8548CDS board
From: Andy Fleming @ 2007-08-24 18:38 UTC (permalink / raw)
  To: mike zheng; +Cc: linuxppc-embedded
In-Reply-To: <5c9cd53b0708240900y6537e42cn5ee59b2a2a707768@mail.gmail.com>


On Aug 24, 2007, at 11:00, mike zheng wrote:

> Hi,
>
> I was told Freescale's 8555CDS board is very similar to 8548CDS  
> board. I just wonder what exactly the differences are. can I just  
> put the 8555CDS BSP onto the 8548CDS board?
>
> Thanks  in advance,


The 8555 u-boot is different from the 8548 u-boot.  There are also  
differences in the device-tree (I'm not sure what version of the  
kernel is in the BSP, so I can't say for sure).  Recent versions of  
the Linux kernel merged all of the CDS systems into one kernel.

As for the differences, off the top of my head:

* 8555 vs 8548 chip
* PCI slot on the carrier card is PCI on 8555, PCIe on 8548.
* 8548 has 4 eTSECs, 8555 has 2 TSECs (and the # of ethernet ports  
reflects this).

Andy

^ permalink raw reply

* Re: [PATCH v2] [02/10] pasemi_mac: Stop using the pci config space accessors for register read/writes
From: Olof Johansson @ 2007-08-24 18:11 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: netdev, jgarzik, linuxppc-dev
In-Reply-To: <20070824140531.ff7d66bf.sfr@canb.auug.org.au>

On Fri, Aug 24, 2007 at 02:05:31PM +1000, Stephen Rothwell wrote:
> On Thu, 23 Aug 2007 13:13:10 -0500 Olof Johansson <olof@lixom.net> wrote:
> >
> >  out:
> > -	pci_dev_put(mac->iob_pdev);
> > -out_put_dma_pdev:
> > -	pci_dev_put(mac->dma_pdev);
> > -out_free_netdev:
> > +	if (mac->iob_pdev)
> > +		pci_dev_put(mac->iob_pdev);
> > +	if (mac->dma_pdev)
> > +		pci_dev_put(mac->dma_pdev);
> 
> It is not documented as such (as far as I can see), but pci_dev_put is
> safe to call with NULL. And there are other places in the kernel that
> explicitly use that fact.

Some places check, others do not. I'll leave it be for now but might take
care of it during some future cleanup. Thanks for point it out though.


-Olof

^ permalink raw reply

* Re: [PATCH 2/6] PowerPC 440EPx: Sequoia DTS
From: Sergei Shtylyov @ 2007-08-24 19:10 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: linuxppc-dev, David Gibson
In-Reply-To: <b055a84cbb23ef7287c1d585c1810a74@kernel.crashing.org>

Segher Boessenkool wrote:

>>>address-permutation = <0 1 3 2 4 5 7 6 e f d c a b 9 8>;

>>Yes, I was contemplating something like that.

> Let's not define this until we need it though :-)

    Let's ot even think of it, since this will end up in a "catch all" driver, 
and yet this may be not enough when the flash doesn't support 8-but R/W, for 
example (I've already quoted it...

>>>I haven't heard or thought of anything better either.  Using "ranges"
>>>is conceptually wrong, even ignoring the technical problems that come
>>>with it.
>>Why is "ranges" conceptually wrong?

> The flash partitions aren't separate devices sitting on a

    Yeah, that's why I decided not to go that from the very start... though 
wait: I didn't do this simply because they'renot devices.
That lead me to interesting question: do device tree have something for the 
disk partitions?

> "flash bus", they are "sub-devices" of their parent.

    They're quite an abstaction of a device -- althogh Linux treats them as 
separate devices indeed.

>>To be honest this looks rather to me like another case where having
>>overlapping 'reg' and 'ranges' would actually make sense.

> It never makes sense.  You should give the "master" device
> the full "reg" range it covers, and have it define its own
> address space; "sub-devices" can carve out their little hunk
> from that.  You don't want more than one device owning the
> same address range in the same address space.

    So, no "ranges" prop in MTD node is necessary? Phew... :-)

> Segher

WBR, Sergei

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: Bodo Eggert @ 2007-08-24 19:04 UTC (permalink / raw)
  To: Linas Vepstas, Jan-Bernd Themann, netdev, Thomas Klein,
	Jan-Bernd Themann, linux-kernel, linux-ppc, Christoph Raisch,
	Marcus Eder, Stefan Roscher
In-Reply-To: <8VKwj-8ke-27@gated-at.bofh.it>

Linas Vepstas <linas@austin.ibm.com> wrote:
> On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:

>> 3) On modern systems the incoming packets are processed very fast. Especially
>> on SMP systems when we use multiple queues we process only a few packets
>> per napi poll cycle. So NAPI does not work very well here and the interrupt
>> rate is still high.
> 
> I saw this too, on a system that is "modern" but not terribly fast, and
> only slightly (2-way) smp. (the spidernet)
> 
> I experimented wih various solutions, none were terribly exciting.  The
> thing that killed all of them was a crazy test case that someone sprung on
> me:  They had written a worst-case network ping-pong app: send one
> packet, wait for reply, send one packet, etc.
> 
> If I waited (indefinitely) for a second packet to show up, the test case
> completely stalled (since no second packet would ever arrive).  And if I
> introduced a timer to wait for a second packet, then I just increased
> the latency in the response to the first packet, and this was noticed,
> and folks complained.

Possible solution / possible brainfart:

Introduce a timer, but don't start to use it to combine packets unless you
receive n packets within the timeframe. If you receive less than m packets
within one timeframe, stop using the timer. The system should now have a
decent response time when the network is idle, and when the network is
busy, nobody will complain about the latency.-)
-- 
Funny quotes:
22. When everything's going your way, you're in the wrong lane and and going
    the wrong way.
Friß, Spammer: rsRxhvmk@CaR.7eggert.dyndns.org m@z3T.7eggert.dyndns.org

^ permalink raw reply

* [PATCH] fix undefined reference to device_power_up/resume
From: Olaf Hering @ 2007-08-24 19:42 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev


Current Linus tree fails to link on pmac32:

drivers/built-in.o: In function `pmac_wakeup_devices':
via-pmu.c:(.text+0x5bab4): undefined reference to `device_power_up'
via-pmu.c:(.text+0x5bb08): undefined reference to `device_resume'
drivers/built-in.o: In function `pmac_suspend_devices':
via-pmu.c:(.text+0x5c260): undefined reference to `device_power_down'
via-pmu.c:(.text+0x5c27c): undefined reference to `device_resume'
make[1]: *** [.tmp_vmlinux1] Error 1

changing CONFIG_PM > CONFIG_PM_SLEEP leads to:

drivers/built-in.o: In function `pmu_led_set':
via-pmu-led.c:(.text+0x5cdca): undefined reference to `pmu_sys_suspended'
via-pmu-led.c:(.text+0x5cdce): undefined reference to `pmu_sys_suspended'
drivers/built-in.o: In function `pmu_req_done':
via-pmu-led.c:(.text+0x5ce3e): undefined reference to `pmu_sys_suspended'
via-pmu-led.c:(.text+0x5ce42): undefined reference to `pmu_sys_suspended'
drivers/built-in.o: In function `adb_init':
(.init.text+0x4c5c): undefined reference to `pmu_register_sleep_notifier'
make[1]: *** [.tmp_vmlinux1] Error 1

So change even more places from PM to PM_SLEEP to allow linking.

Signed-off-by: Olaf Hering <olaf@aepfle.de>

---
 drivers/macintosh/adb.c     |    4 ++--
 drivers/macintosh/via-pmu.c |   34 +++++++++++++++++-----------------
 include/linux/pmu.h         |    2 +-
 3 files changed, 20 insertions(+), 20 deletions(-)

--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -89,7 +89,7 @@ static int sleepy_trackpad;
 static int autopoll_devs;
 int __adb_probe_sync;
 
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
 static void adb_notify_sleep(struct pmu_sleep_notifier *self, int when);
 static struct pmu_sleep_notifier adb_sleep_notifier = {
 	adb_notify_sleep,
@@ -313,7 +313,7 @@ int __init adb_init(void)
 		printk(KERN_WARNING "Warning: no ADB interface detected\n");
 		adb_controller = NULL;
 	} else {
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
 		pmu_register_sleep_notifier(&adb_sleep_notifier);
 #endif /* CONFIG_PM */
 #ifdef CONFIG_PPC
--- a/drivers/macintosh/via-pmu.c
+++ b/drivers/macintosh/via-pmu.c
@@ -152,10 +152,10 @@ static spinlock_t pmu_lock;
 static u8 pmu_intr_mask;
 static int pmu_version;
 static int drop_interrupts;
-#if defined(CONFIG_PM) && defined(CONFIG_PPC32)
+#if defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
 static int option_lid_wakeup = 1;
-#endif /* CONFIG_PM && CONFIG_PPC32 */
-#if (defined(CONFIG_PM)&&defined(CONFIG_PPC32))||defined(CONFIG_PMAC_BACKLIGHT_LEGACY)
+#endif /* CONFIG_PM_SLEEP && CONFIG_PPC32 */
+#if (defined(CONFIG_PM_SLEEP)&&defined(CONFIG_PPC32))||defined(CONFIG_PMAC_BACKLIGHT_LEGACY)
 static int sleep_in_progress;
 #endif
 static unsigned long async_req_locks;
@@ -875,7 +875,7 @@ proc_read_options(char *page, char **sta
 {
 	char *p = page;
 
-#if defined(CONFIG_PM) && defined(CONFIG_PPC32)
+#if defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
 	if (pmu_kind == PMU_KEYLARGO_BASED &&
 	    pmac_call_feature(PMAC_FTR_SLEEP_STATE,NULL,0,-1) >= 0)
 		p += sprintf(p, "lid_wakeup=%d\n", option_lid_wakeup);
@@ -916,7 +916,7 @@ proc_write_options(struct file *file, co
 	*(val++) = 0;
 	while(*val == ' ')
 		val++;
-#if defined(CONFIG_PM) && defined(CONFIG_PPC32)
+#if defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
 	if (pmu_kind == PMU_KEYLARGO_BASED &&
 	    pmac_call_feature(PMAC_FTR_SLEEP_STATE,NULL,0,-1) >= 0)
 		if (!strcmp(label, "lid_wakeup"))
@@ -1738,7 +1738,7 @@ pmu_present(void)
 	return via != 0;
 }
 
-#ifdef CONFIG_PM
+#ifdef CONFIG_PM_SLEEP
 
 static LIST_HEAD(sleep_notifiers);
 
@@ -1769,9 +1769,9 @@ pmu_unregister_sleep_notifier(struct pmu
 	return 0;
 }
 EXPORT_SYMBOL(pmu_unregister_sleep_notifier);
-#endif /* CONFIG_PM */
+#endif /* CONFIG_PM_SLEEP */
 
-#if defined(CONFIG_PM) && defined(CONFIG_PPC32)
+#if defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
 
 /* Sleep is broadcast last-to-first */
 static void broadcast_sleep(int when)
@@ -2390,7 +2390,7 @@ powerbook_sleep_3400(void)
 	return 0;
 }
 
-#endif /* CONFIG_PM && CONFIG_PPC32 */
+#endif /* CONFIG_PM_SLEEP && CONFIG_PPC32 */
 
 /*
  * Support for /dev/pmu device
@@ -2573,7 +2573,7 @@ pmu_ioctl(struct inode * inode, struct f
 	int error = -EINVAL;
 
 	switch (cmd) {
-#if defined(CONFIG_PM) && defined(CONFIG_PPC32)
+#if defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
 	case PMU_IOC_SLEEP:
 		if (!capable(CAP_SYS_ADMIN))
 			return -EACCES;
@@ -2601,7 +2601,7 @@ pmu_ioctl(struct inode * inode, struct f
 			return put_user(0, argp);
 		else
 			return put_user(1, argp);
-#endif /* CONFIG_PM && CONFIG_PPC32 */
+#endif /* CONFIG_PM_SLEEP && CONFIG_PPC32 */
 
 #ifdef CONFIG_PMAC_BACKLIGHT_LEGACY
 	/* Compatibility ioctl's for backlight */
@@ -2757,7 +2757,7 @@ pmu_polled_request(struct adb_request *r
  * to do suspend-to-disk.
  */
 
-#if defined(CONFIG_PM) && defined(CONFIG_PPC32)
+#if defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
 
 int pmu_sys_suspended;
 
@@ -2792,7 +2792,7 @@ static int pmu_sys_resume(struct sys_dev
 	return 0;
 }
 
-#endif /* CONFIG_PM && CONFIG_PPC32 */
+#endif /* CONFIG_PM_SLEEP && CONFIG_PPC32 */
 
 static struct sysdev_class pmu_sysclass = {
 	set_kset_name("pmu"),
@@ -2803,10 +2803,10 @@ static struct sys_device device_pmu = {
 };
 
 static struct sysdev_driver driver_pmu = {
-#if defined(CONFIG_PM) && defined(CONFIG_PPC32)
+#if defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
 	.suspend	= &pmu_sys_suspend,
 	.resume		= &pmu_sys_resume,
-#endif /* CONFIG_PM && CONFIG_PPC32 */
+#endif /* CONFIG_PM_SLEEP && CONFIG_PPC32 */
 };
 
 static int __init init_pmu_sysfs(void)
@@ -2841,10 +2841,10 @@ EXPORT_SYMBOL(pmu_wait_complete);
 EXPORT_SYMBOL(pmu_suspend);
 EXPORT_SYMBOL(pmu_resume);
 EXPORT_SYMBOL(pmu_unlock);
-#if defined(CONFIG_PM) && defined(CONFIG_PPC32)
+#if defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
 EXPORT_SYMBOL(pmu_enable_irled);
 EXPORT_SYMBOL(pmu_battery_count);
 EXPORT_SYMBOL(pmu_batteries);
 EXPORT_SYMBOL(pmu_power_flags);
-#endif /* CONFIG_PM && CONFIG_PPC32 */
+#endif /* CONFIG_PM_SLEEP && CONFIG_PPC32 */
 
--- a/include/linux/pmu.h
+++ b/include/linux/pmu.h
@@ -226,7 +226,7 @@ extern unsigned int pmu_power_flags;
 extern void pmu_backlight_init(void);
 
 /* some code needs to know if the PMU was suspended for hibernation */
-#if defined(CONFIG_PM) && defined(CONFIG_PPC32)
+#if defined(CONFIG_PM_SLEEP) && defined(CONFIG_PPC32)
 extern int pmu_sys_suspended;
 #else
 /* if power management is not configured it can't be suspended */

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: Linas Vepstas @ 2007-08-24 20:42 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Thomas Klein, Jan-Bernd Themann, netdev, linux-kernel,
	Christoph Raisch, linux-ppc, Jan-Bernd Themann, Marcus Eder,
	Stefan Roscher
In-Reply-To: <E1IOeSm-0000bm-Jo@be1.lrz>

On Fri, Aug 24, 2007 at 09:04:56PM +0200, Bodo Eggert wrote:
> Linas Vepstas <linas@austin.ibm.com> wrote:
> > On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
> >> 3) On modern systems the incoming packets are processed very fast. Especially
> >> on SMP systems when we use multiple queues we process only a few packets
> >> per napi poll cycle. So NAPI does not work very well here and the interrupt
> >> rate is still high.
> > 
> > worst-case network ping-pong app: send one
> > packet, wait for reply, send one packet, etc.
> 
> Possible solution / possible brainfart:
> 
> Introduce a timer, but don't start to use it to combine packets unless you
> receive n packets within the timeframe. If you receive less than m packets
> within one timeframe, stop using the timer. The system should now have a
> decent response time when the network is idle, and when the network is
> busy, nobody will complain about the latency.-)

Ohh, that was inspirational. Let me free-associate some wild ideas.

Suppose we keep a running average of the recent packet arrival rate,
Lets say its 10 per millisecond ("typical" for a gigabit eth runnning
flat-out).  If we could poll the driver at a rate of 10-20 per
millisecond (i.e. letting the OS do other useful work for 0.05 millisec),
then we could potentially service the card without ever having to enable 
interrupts on the card, and without hurting latency.

If the packet arrival rate becomes slow enough, we go back to an
interrupt-driven scheme (to keep latency down).

The main problem here is that, even for HZ=1000 machines, this amounts 
to 10-20 polls per jiffy.  Which, if implemented in kernel, requires 
using the high-resolution timers. And, umm, don't the HR timers require
a cpu timer interrupt to make them go? So its not clear that this is much
of a win.

The eHEA is a 10 gigabit device, so it can expect 80-100 packets per
millisecond for large packets, and even more, say 1K packets per
millisec, for small packets. (Even the spec for my 1Gb spidernet card
claims its internal rate is 1M packets/sec.) 

Another possiblity is to set HZ to 5000 or 20000 or something humongous
... after all cpu's are now faster! But, since this might be wasteful,
maybe we could make HZ be dynamically variable: have high HZ rates when
there's lots of network/disk activity, and low HZ rates when not. That
means a non-constant jiffy.

If all drivers used interrupt mitigation, then the variable-high
frequency jiffy could take thier place, and be more "fair" to everyone.
Most drivers would be polled most of the time when they're busy, and 
only use interrupts when they're not.
 
--linas

^ permalink raw reply

* Re: [PATCH 2/6] PowerPC 440EPx: Sequoia DTS
From: Segher Boessenkool @ 2007-08-24 20:43 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: linuxppc-dev, David Gibson
In-Reply-To: <46CF2D2C.2050306@ru.mvista.com>

>>>> address-permutation = <0 1 3 2 4 5 7 6 e f d c a b 9 8>;
>
>>> Yes, I was contemplating something like that.
>
>> Let's not define this until we need it though :-)
>
>    Let's ot even think of it,

It is good to think about it, for the simple reason that it
validates whether the current design is future-proof or not.

> since this will end up in a "catch all" driver,

Yeah, we shouldn't _define_ anything like this, not until
it is needed anyway.

> and yet this may be not enough when the flash doesn't support 8-but 
> R/W, for example (I've already quoted it...

Yeah.  There is no need to future-proof to insane designs anyway;
whatever can not fit in the "generic" framework can bloody well
just do its own binding, no need to pollute the generic thing.

>>>> I haven't heard or thought of anything better either.  Using 
>>>> "ranges"
>>>> is conceptually wrong, even ignoring the technical problems that 
>>>> come
>>>> with it.
>>> Why is "ranges" conceptually wrong?
>
>> The flash partitions aren't separate devices sitting on a
>
>    Yeah, that's why I decided not to go that from the very start... 
> though wait: I didn't do this simply because they'renot devices.
> That lead me to interesting question: do device tree have something 
> for the disk partitions?

Some do.  Most don't.  There is no standardised binding I know of.

The big huge difference here is that disks typically do contain
partitioning information on the disk itself, and flash doesn't.

>> "flash bus", they are "sub-devices" of their parent.
>
>    They're quite an abstaction of a device -- althogh Linux treats 
> them as separate devices indeed.

Sure, it's a pseudo-device.  Nothing new there.

>>> To be honest this looks rather to me like another case where having
>>> overlapping 'reg' and 'ranges' would actually make sense.
>
>> It never makes sense.  You should give the "master" device
>> the full "reg" range it covers, and have it define its own
>> address space; "sub-devices" can carve out their little hunk
>> from that.  You don't want more than one device owning the
>> same address range in the same address space.
>
>    So, no "ranges" prop in MTD node is necessary? Phew... :-)

Yeah, it would be positively harmful.  They are pseudo-devices
only, the kernel device driver needs to always access the real
device.


Segher

^ permalink raw reply

* Re: 8555CDS BSP on 8548CDS board
From: mike zheng @ 2007-08-24 20:47 UTC (permalink / raw)
  To: Andy Fleming; +Cc: linuxppc-embedded
In-Reply-To: <59B03E9A-3FE0-4F11-AEB5-CEAE6AB38F92@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]

Hi Andy,

Uboot is fine, I will re-use the current uboot on 8548CDS board. But the BSP
linux kernel is 2.6, I have to use 2.4 kernel.

I merged E500 support (8548CDS BSP and Kernel) from 2.6 back to my
2.4source tree. I managed to make the serial port driver (Poll mode)
working. The code already passed "console_init()". However the console does
NOT come out. What shall I check, exception, interrupt or decrementor?

Thus I am think to re-start from a 8555CDS BSP on 2.4 kernel, and modify
it for 8548CDS board. I am not sure which approach is easier. Maybe both are
difficult for me. :-(

Thanks,


On 8/24/07, Andy Fleming <afleming@freescale.com> wrote:
>
>
> On Aug 24, 2007, at 11:00, mike zheng wrote:
>
> > Hi,
> >
> > I was told Freescale's 8555CDS board is very similar to 8548CDS
> > board. I just wonder what exactly the differences are. can I just
> > put the 8555CDS BSP onto the 8548CDS board?
> >
> > Thanks  in advance,
>
>
> The 8555 u-boot is different from the 8548 u-boot.  There are also
> differences in the device-tree (I'm not sure what version of the
> kernel is in the BSP, so I can't say for sure).  Recent versions of
> the Linux kernel merged all of the CDS systems into one kernel.
>
> As for the differences, off the top of my head:
>
> * 8555 vs 8548 chip
> * PCI slot on the carrier card is PCI on 8555, PCIe on 8548.
> * 8548 has 4 eTSECs, 8555 has 2 TSECs (and the # of ethernet ports
> reflects this).
>
> Andy
>

[-- Attachment #2: Type: text/html, Size: 2008 bytes --]

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: Jan-Bernd Themann @ 2007-08-24 21:11 UTC (permalink / raw)
  To: Linas Vepstas
  Cc: Thomas Klein, Jan-Bernd Themann, netdev, linux-kernel, linux-ppc,
	Bodo Eggert, Christoph Raisch, Marcus Eder, Stefan Roscher
In-Reply-To: <20070824204243.GI4282@austin.ibm.com>

Linas Vepstas schrieb:
> On Fri, Aug 24, 2007 at 09:04:56PM +0200, Bodo Eggert wrote:
>   
>> Linas Vepstas <linas@austin.ibm.com> wrote:
>>     
>>> On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
>>>       
>>>> 3) On modern systems the incoming packets are processed very fast. Especially
>>>> on SMP systems when we use multiple queues we process only a few packets
>>>> per napi poll cycle. So NAPI does not work very well here and the interrupt
>>>> rate is still high.
>>>>         
>>> worst-case network ping-pong app: send one
>>> packet, wait for reply, send one packet, etc.
>>>       
>> Possible solution / possible brainfart:
>>
>> Introduce a timer, but don't start to use it to combine packets unless you
>> receive n packets within the timeframe. If you receive less than m packets
>> within one timeframe, stop using the timer. The system should now have a
>> decent response time when the network is idle, and when the network is
>> busy, nobody will complain about the latency.-)
>>     
>
> Ohh, that was inspirational. Let me free-associate some wild ideas.
>
> Suppose we keep a running average of the recent packet arrival rate,
> Lets say its 10 per millisecond ("typical" for a gigabit eth runnning
> flat-out).  If we could poll the driver at a rate of 10-20 per
> millisecond (i.e. letting the OS do other useful work for 0.05 millisec),
> then we could potentially service the card without ever having to enable 
> interrupts on the card, and without hurting latency.
>
> If the packet arrival rate becomes slow enough, we go back to an
> interrupt-driven scheme (to keep latency down).
>
> The main problem here is that, even for HZ=1000 machines, this amounts 
> to 10-20 polls per jiffy.  Which, if implemented in kernel, requires 
> using the high-resolution timers. And, umm, don't the HR timers require
> a cpu timer interrupt to make them go? So its not clear that this is much
> of a win.
>   
That is indeed a good question. At least for 10G eHEA we see
that the average number of packets/poll cycle is very low.
With high precision timers we could control the poll interval
better and thus make sure we get enough packets on the queue in
high load situations to benefit from LRO while keeping the
latency moderate. When the traffic load is low we could just
stick to plain NAPI. I don't know how expensive hp timers are,
we probably just have to test it (when they are available for
POWER in our case). However, having more packets
per poll run would make LRO more efficient and thus the total
CPU utilization would decrease.

I guess on most systems there are not many different network
cards working in parallel. So if the driver could set the poll
interval for its devices, it could be well optimized depending
on the NICs characteristics.

Maybe it would be good enough to have a timer that schedules
the device for NAPI (and thus triggers SoftIRQs, which will
trigger NAPI). Whether this timer would be used via a generic
interface or would be implemented as a proprietary solution
would depend on whether other drivers want / need this feature
as well. Drivers / NICs that work fine with plain NAPI don't
have to use timer :-)

I tried to implement something with "normal" timers, but the result
was everything but great. The timers seem to be far too slow.
I'm not sure if it helps to increase it from 1000HZ to 2500HZ
or more.

Regards,
Jan-Bernd

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: James Chapman @ 2007-08-24 17:16 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Thomas Klein, Jan-Bernd Themann, netdev, linux-kernel,
	Christoph Raisch, linux-ppc, akepner, Marcus Eder,
	Jan-Bernd Themann, Stefan Roscher
In-Reply-To: <20070824085203.42f4305c@freepuppy.rosehill.hemminger.net>

Stephen Hemminger wrote:
> On Fri, 24 Aug 2007 17:47:15 +0200
> Jan-Bernd Themann <ossthema@de.ibm.com> wrote:
> 
>> Hi,
>>
>> On Friday 24 August 2007 17:37, akepner@sgi.com wrote:
>>> On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
>>>> .......
>>>> 3) On modern systems the incoming packets are processed very fast. Especially
>>>>    on SMP systems when we use multiple queues we process only a few packets
>>>>    per napi poll cycle. So NAPI does not work very well here and the interrupt 
>>>>    rate is still high. What we need would be some sort of timer polling mode 
>>>>    which will schedule a device after a certain amount of time for high load 
>>>>    situations. With high precision timers this could work well. Current
>>>>    usual timers are too slow. A finer granularity would be needed to keep the
>>>>    latency down (and queue length moderate).
>>>>
>>> We found the same on ia64-sn systems with tg3 a couple of years 
>>> ago. Using simple interrupt coalescing ("don't interrupt until 
>>> you've received N packets or M usecs have elapsed") worked 
>>> reasonably well in practice. If your h/w supports that (and I'd 
>>> guess it does, since it's such a simple thing), you might try 
>>> it.
>>>
>> I don't see how this should work. Our latest machines are fast enough that they
>> simply empty the queue during the first poll iteration (in most cases).
>> Even if you wait until X packets have been received, it does not help for
>> the next poll cycle. The average number of packets we process per poll queue
>> is low. So a timer would be preferable that periodically polls the 
>> queue, without the need of generating a HW interrupt. This would allow us
>> to wait until a reasonable amount of packets have been received in the meantime
>> to keep the poll overhead low. This would also be useful in combination
>> with LRO.
>>
> 
> You need hardware support for deferred interrupts. Most devices have it (e1000, sky2, tg3)
> and it interacts well with NAPI. It is not a generic thing you want done by the stack,
> you want the hardware to hold off interrupts until X packets or Y usecs have expired.

Does hardware interrupt mitigation really interact well with NAPI? In my 
experience, holding off interrupts for X packets or Y usecs does more 
harm than good; such hardware features are useful only when the OS has 
no NAPI-like mechanism.

When tuning NAPI drivers for packets/sec performance (which is a good 
indicator of driver performance), I make sure that the driver stays in 
NAPI polled mode while it has any rx or tx work to do. If the CPU is 
fast enough that all work is always completed on each poll, I have the 
driver stay in polled mode until dev->poll() is called N times with no 
work being done. This keeps interrupts disabled for reasonable traffic 
levels, while minimizing packet processing latency. No need for hardware 
interrupt mitigation.

> The parameters for controlling it are already in ethtool, the issue is finding a good
> default set of values for a wide range of applications and architectures. Maybe some
> heuristic based on processor speed would be a good starting point. The dynamic irq
> moderation stuff is not widely used because it is too hard to get right.

I agree. It would be nice to find a way for the typical user to derive 
best values for these knobs for his/her particular system. Perhaps a 
tool using pktgen and network device phy internal loopback could be 
developed?

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: David Miller @ 2007-08-24 21:32 UTC (permalink / raw)
  To: ossthema
  Cc: tklein, themann, netdev, linux-kernel, linuxppc-dev, raisch,
	meder, stefan.roscher
In-Reply-To: <200708241559.17055.ossthema@de.ibm.com>

From: Jan-Bernd Themann <ossthema@de.ibm.com>
Date: Fri, 24 Aug 2007 15:59:16 +0200

> =A0 =A0It would be nice if it is possible to schedule queues to other=
 CPU's, or
> =A0 =A0at least to use interrupts to put the queue to another cpu (no=
t nice for =

> =A0 =A0as you never know which one you will hit). =

> =A0 =A0I'm not sure how bad the tradeoff would be.

Once the per-cpu NAPI poll queues start needing locks, much of the
gain will be lost.  This is strictly what we want to avoid.

We need real facilities for IRQ distribution policies.  With that none
of this is an issue.

This is also a platform specific problem with IRQ behavior, the IRQ
distibution scheme you mention would never occur on sparc64 for
example.  We use a fixed round-robin distribution of interrupts to
CPUS there, they don't move.

Each scheme has it's advantages, but you want a difference scheme here
than what is implemented and the fix is therefore not in the
networking :-)

Furthermore, most cards that will be using multi-queue will be
using hashes on the packet headers to choose the MSI-X interrupt
and thus the cpu to be targetted.  Those cards will want fixed
instead of dynamic interrupt to cpu distribution schemes as well,
so your problem is not unique and they'll need the same fix as
you do.

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: Linas Vepstas @ 2007-08-24 21:35 UTC (permalink / raw)
  To: Jan-Bernd Themann
  Cc: Thomas Klein, Jan-Bernd Themann, netdev, linux-kernel, linux-ppc,
	Bodo Eggert, Christoph Raisch, Marcus Eder, Stefan Roscher
In-Reply-To: <46CF499C.60009@de.ibm.com>

On Fri, Aug 24, 2007 at 11:11:56PM +0200, Jan-Bernd Themann wrote:
> (when they are available for
> POWER in our case). 

hrtimer worked fine on the powerpc cell arch last summer.
I assume they work on p5 and p6 too, no ??

> I tried to implement something with "normal" timers, but the result
> was everything but great. The timers seem to be far too slow.
> I'm not sure if it helps to increase it from 1000HZ to 2500HZ
> or more.

Heh. Do the math. Even on 1gigabit cards, that's not enough:

(1gigabit/sec) x (byte/8 bits) x (packet/1500bytes) x (sec/1000 jiffy) 

is 83 packets a jiffy (for big packets, even more for small packets, 
and more again for 10 gigabit cards). So polling once per jiffy is a 
latency disaster.

--linas  

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: David Miller @ 2007-08-24 21:37 UTC (permalink / raw)
  To: ossthema
  Cc: tklein, themann, netdev, linux-kernel, linuxppc-dev, raisch,
	meder, stefan.roscher
In-Reply-To: <200708241559.17055.ossthema@de.ibm.com>

From: Jan-Bernd Themann <ossthema@de.ibm.com>
Date: Fri, 24 Aug 2007 15:59:16 +0200

> 1) The current implementation of netif_rx_schedule, netif_rx_complete=

> =A0 =A0and the net_rx_action have the following problem: netif_rx_sch=
edule
> =A0 =A0sets the NAPI_STATE_SCHED flag and adds the NAPI instance to t=
he poll_list.
> =A0 =A0netif_rx_action checks NAPI_STATE_SCHED, if set it will add th=
e device
> =A0 =A0to the poll_list again (as well). netif_rx_complete clears the=
 NAPI_STATE_SCHED.
> =A0 =A0If an interrupt handler calls netif_rx_schedule on CPU 2
> =A0 =A0after netif_rx_complete has been called on CPU 1 (and the poll=
 function =

> =A0 =A0has not returned yet), the NAPI instance will be added twice t=
o the =

> =A0 =A0poll_list (by netif_rx_schedule and net_rx_action). Problems o=
ccur when =

> =A0 =A0netif_rx_complete is called twice for the device (BUG() called=
)

Indeed, this is the "who should manage the list" problem.
Probably the answer is that whoever transitions the NAPI_STATE_SCHED
bit from cleared to set should do the list addition.

Patches welcome :-)

> 3) On modern systems the incoming packets are processed very fast. Es=
pecially
> =A0 =A0on SMP systems when we use multiple queues we process only a f=
ew packets
> =A0 =A0per napi poll cycle. So NAPI does not work very well here and =
the interrupt =

> =A0 =A0rate is still high. What we need would be some sort of timer p=
olling mode =

> =A0 =A0which will schedule a device after a certain amount of time fo=
r high load =

> =A0 =A0situations. With high precision timers this could work well. C=
urrent
> =A0 =A0usual timers are too slow. A finer granularity would be needed=
 to keep the
>    latency down (and queue length moderate).

This is why minimal levels of HW interrupt mitigation should be enabled=

in your chip.  If it does not support this, you will indeed need to loo=
k
into using high resolution timers or other schemes to alleviate this.

I do not think it deserves a generic core networking helper facility,
the chips that can't mitigate interrupts are few and obscure.

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: David Miller @ 2007-08-24 21:43 UTC (permalink / raw)
  To: linas
  Cc: tklein, themann, netdev, linux-kernel, raisch, linuxppc-dev,
	ossthema, meder, stefan.roscher
In-Reply-To: <20070824164541.GG4282@austin.ibm.com>

From: linas@austin.ibm.com (Linas Vepstas)
Date: Fri, 24 Aug 2007 11:45:41 -0500

> In the end, I just let it be, and let the system work as a
> busy-beaver, with the high interrupt rate. Is this a wise thing to
> do?

The tradeoff is always going to be latency vs. throughput.

A sane default should defer enough to catch multiple packets coming in
at something close to line rate, but not so much that latency unduly
suffers.

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: David Miller @ 2007-08-24 21:44 UTC (permalink / raw)
  To: dlstevens
  Cc: tklein, themann, stefan.roscher, netdev, linux-kernel, raisch,
	netdev-owner, linuxppc-dev, akepner, meder, ossthema, shemminger
In-Reply-To: <OF0002C3DB.2E4C4A38-ON88257341.005B1253-88257341.005C7C2A@us.ibm.com>

From: David Stevens <dlstevens@us.ibm.com>
Date: Fri, 24 Aug 2007 09:50:58 -0700

>         Problem is if it increases rapidly, you may drop packets
> before you notice that the ring is full in the current estimated
> interval.

This is one of many reasons why hardware interrupt mitigation
is really needed for this.

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: David Miller @ 2007-08-24 21:47 UTC (permalink / raw)
  To: jchapman
  Cc: tklein, themann, stefan.roscher, netdev, linux-kernel, raisch,
	linuxppc-dev, akepner, meder, ossthema, shemminger
In-Reply-To: <46CF127D.1090609@katalix.com>

From: James Chapman <jchapman@katalix.com>
Date: Fri, 24 Aug 2007 18:16:45 +0100

> Does hardware interrupt mitigation really interact well with NAPI?

It interacts quite excellently.

There was a long saga about this with tg3 and huge SGI numa
systems with large costs for interrupt processing, and the
fix was to do a minimal amount of interrupt mitigation and
this basically cleared up all the problems.

Someone should reference that thread _now_ before this discussion goes
too far and we repeat a lot of information and people like myself have
to stay up all night correcting the misinformation and
misunderstandings that are basically guarenteed for this topic :)

^ permalink raw reply

* [PATCH] Handle alignment faults on SPE load/store instructions
From: Kumar Gala @ 2007-08-24 21:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: fray

This adds code to handle alignment traps generated by the following
SPE (signal processing engine) load/store instructions, by emulating
the instruction in the kernel (as is done for other instructions that
generate alignment traps):

evldd[x]         Vector Load Double Word into Double Word [Indexed]
evldw[x]         Vector Load Double into Two Words [Indexed]
evldh[x]         Vector Load Double into Four Half Words [Indexed]
evlhhesplat[x]   Vector Load Half Word into Half Words Even and Splat [Indexed]
evlhhousplat[x]  Vector Load Half Word into Half Word Odd Unsigned and Splat [Indexed]
evlhhossplat[x]  Vector Load Half Word into Half Word Odd Signed and Splat [Indexed]
evlwhe[x]        Vector Load Word into Two Half Words Even [Indexed]
evlwhou[x]       Vector Load Word into Two Half Words Odd Unsigned (zero-extended) [Indexed]
evlwhos[x]       Vector Load Word into Two Half Words Odd Signed (with sign extension) [Indexed]
evlwwsplat[x]    Vector Load Word into Word and Splat [Indexed]
evlwhsplat[x]    Vector Load Word into Two Half Words and Splat [Indexed]
evstdd[x]        Vector Store Double of Double [Indexed]
evstdw[x]        Vector Store Double of Two Words [Indexed]
evstdh[x]        Vector Store Double of Four Half Words [Indexed]
evstwhe[x]       Vector Store Word of Two Half Words from Even [Indexed]
evstwho[x]       Vector Store Word of Two Half Words from Odd [Indexed]
evstwwe[x]       Vector Store Word of Word from Even [Indexed]
evstwwo[x]       Vector Store Word of Word from Odd [Indexed]

---

Exists in my git, tree posted here for review.

 arch/powerpc/kernel/align.c |  250 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 250 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index 4c47f9c..e06f75d 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -46,6 +46,8 @@ struct aligninfo {
 #define S	0x40	/* single-precision fp or... */
 #define SX	0x40	/* ... byte count in XER */
 #define HARD	0x80	/* string, stwcx. */
+#define E4	0x40	/* SPE endianness is word */
+#define E8	0x80	/* SPE endianness is double word */

 /* DSISR bits reported for a DCBZ instruction: */
 #define DCBZ	0x5f	/* 8xx/82xx dcbz faults when cache not enabled */
@@ -392,6 +394,248 @@ static int emulate_fp_pair(struct pt_regs *regs, unsigned char __user *addr,
 	return 1;	/* exception handled and fixed up */
 }

+#ifdef CONFIG_SPE
+
+static struct aligninfo spe_aligninfo[32] = {
+	{ 8, LD+E8 },		/* 0 00 00: evldd[x] */
+	{ 8, LD+E4 },		/* 0 00 01: evldw[x] */
+	{ 8, LD },		/* 0 00 10: evldh[x] */
+	INVALID,		/* 0 00 11 */
+	{ 2, LD },		/* 0 01 00: evlhhesplat[x] */
+	INVALID,		/* 0 01 01 */
+	{ 2, LD },		/* 0 01 10: evlhhousplat[x] */
+	{ 2, LD+SE },		/* 0 01 11: evlhhossplat[x] */
+	{ 4, LD },		/* 0 10 00: evlwhe[x] */
+	INVALID,		/* 0 10 01 */
+	{ 4, LD },		/* 0 10 10: evlwhou[x] */
+	{ 4, LD+SE },		/* 0 10 11: evlwhos[x] */
+	{ 4, LD+E4 },		/* 0 11 00: evlwwsplat[x] */
+	INVALID,		/* 0 11 01 */
+	{ 4, LD },		/* 0 11 10: evlwhsplat[x] */
+	INVALID,		/* 0 11 11 */
+
+	{ 8, ST+E8 },		/* 1 00 00: evstdd[x] */
+	{ 8, ST+E4 },		/* 1 00 01: evstdw[x] */
+	{ 8, ST },		/* 1 00 10: evstdh[x] */
+	INVALID,		/* 1 00 11 */
+	INVALID,		/* 1 01 00 */
+	INVALID,		/* 1 01 01 */
+	INVALID,		/* 1 01 10 */
+	INVALID,		/* 1 01 11 */
+	{ 4, ST },		/* 1 10 00: evstwhe[x] */
+	INVALID,		/* 1 10 01 */
+	{ 4, ST },		/* 1 10 10: evstwho[x] */
+	INVALID,		/* 1 10 11 */
+	{ 4, ST+E4 },		/* 1 11 00: evstwwe[x] */
+	INVALID,		/* 1 11 01 */
+	{ 4, ST+E4 },		/* 1 11 10: evstwwo[x] */
+	INVALID,		/* 1 11 11 */
+};
+
+#define	EVLDD		0x00
+#define	EVLDW		0x01
+#define	EVLDH		0x02
+#define	EVLHHESPLAT	0x04
+#define	EVLHHOUSPLAT	0x06
+#define	EVLHHOSSPLAT	0x07
+#define	EVLWHE		0x08
+#define	EVLWHOU		0x0A
+#define	EVLWHOS		0x0B
+#define	EVLWWSPLAT	0x0C
+#define	EVLWHSPLAT	0x0E
+#define	EVSTDD		0x10
+#define	EVSTDW		0x11
+#define	EVSTDH		0x12
+#define	EVSTWHE		0x18
+#define	EVSTWHO		0x1A
+#define	EVSTWWE		0x1C
+#define	EVSTWWO		0x1E
+
+/*
+ * Emulate SPE loads and stores.
+ * Only Book-E has these instructions, and it does true little-endian,
+ * so we don't need the address swizzling.
+ */
+static int emulate_spe(struct pt_regs *regs, unsigned int reg,
+		       unsigned int instr)
+{
+	int t, ret;
+	union {
+		u64 ll;
+		u32 w[2];
+		u16 h[4];
+		u8 v[8];
+	} data, temp;
+	unsigned char __user *p, *addr;
+	unsigned long *evr = &current->thread.evr[reg];
+	unsigned int nb, flags;
+
+	instr = (instr >> 1) & 0x1f;
+
+	/* DAR has the operand effective address */
+	addr = (unsigned char __user *)regs->dar;
+
+	nb = spe_aligninfo[instr].len;
+	flags = spe_aligninfo[instr].flags;
+
+	/* Verify the address of the operand */
+	if (unlikely(user_mode(regs) &&
+		     !access_ok((flags & ST ? VERIFY_WRITE : VERIFY_READ),
+				addr, nb)))
+		return -EFAULT;
+
+	/* userland only */
+	if (unlikely(!user_mode(regs)))
+		return 0;
+
+	flush_spe_to_thread(current);
+
+	/* If we are loading, get the data from user space, else
+	 * get it from register values
+	 */
+	if (flags & ST) {
+		data.ll = 0;
+		switch (instr) {
+		case EVSTDD:
+		case EVSTDW:
+		case EVSTDH:
+			data.w[0] = *evr;
+			data.w[1] = regs->gpr[reg];
+			break;
+		case EVSTWHE:
+			data.h[2] = *evr >> 16;
+			data.h[3] = regs->gpr[reg] >> 16;
+			break;
+		case EVSTWHO:
+			data.h[2] = *evr & 0xffff;
+			data.h[3] = regs->gpr[reg] & 0xffff;
+			break;
+		case EVSTWWE:
+			data.w[1] = *evr;
+			break;
+		case EVSTWWO:
+			data.w[1] = regs->gpr[reg];
+			break;
+		default:
+			return -EINVAL;
+		}
+	} else {
+		temp.ll = data.ll = 0;
+		ret = 0;
+		p = addr;
+
+		switch (nb) {
+		case 8:
+			ret |= __get_user_inatomic(temp.v[0], p++);
+			ret |= __get_user_inatomic(temp.v[1], p++);
+			ret |= __get_user_inatomic(temp.v[2], p++);
+			ret |= __get_user_inatomic(temp.v[3], p++);
+		case 4:
+			ret |= __get_user_inatomic(temp.v[4], p++);
+			ret |= __get_user_inatomic(temp.v[5], p++);
+		case 2:
+			ret |= __get_user_inatomic(temp.v[6], p++);
+			ret |= __get_user_inatomic(temp.v[7], p++);
+			if (unlikely(ret))
+				return -EFAULT;
+		}
+
+		switch (instr) {
+		case EVLDD:
+		case EVLDW:
+		case EVLDH:
+			data.ll = temp.ll;
+			break;
+		case EVLHHESPLAT:
+			data.h[0] = temp.h[3];
+			data.h[2] = temp.h[3];
+			break;
+		case EVLHHOUSPLAT:
+		case EVLHHOSSPLAT:
+			data.h[1] = temp.h[3];
+			data.h[3] = temp.h[3];
+			break;
+		case EVLWHE:
+			data.h[0] = temp.h[2];
+			data.h[2] = temp.h[3];
+			break;
+		case EVLWHOU:
+		case EVLWHOS:
+			data.h[1] = temp.h[2];
+			data.h[3] = temp.h[3];
+			break;
+		case EVLWWSPLAT:
+			data.w[0] = temp.w[1];
+			data.w[1] = temp.w[1];
+			break;
+		case EVLWHSPLAT:
+			data.h[0] = temp.h[2];
+			data.h[1] = temp.h[2];
+			data.h[2] = temp.h[3];
+			data.h[3] = temp.h[3];
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	if (flags & SW) {
+		switch (flags & 0xf0) {
+		case E8:
+			SWAP(data.v[0], data.v[7]);
+			SWAP(data.v[1], data.v[6]);
+			SWAP(data.v[2], data.v[5]);
+			SWAP(data.v[3], data.v[4]);
+			break;
+		case E4:
+
+			SWAP(data.v[0], data.v[3]);
+			SWAP(data.v[1], data.v[2]);
+			SWAP(data.v[4], data.v[7]);
+			SWAP(data.v[5], data.v[6]);
+			break;
+		/* Its half word endian */
+		default:
+			SWAP(data.v[0], data.v[1]);
+			SWAP(data.v[2], data.v[3]);
+			SWAP(data.v[4], data.v[5]);
+			SWAP(data.v[6], data.v[7]);
+			break;
+		}
+	}
+
+	if (flags & SE) {
+		data.w[0] = (s16)data.h[1];
+		data.w[1] = (s16)data.h[3];
+	}
+
+	/* Store result to memory or update registers */
+	if (flags & ST) {
+		ret = 0;
+		p = addr;
+		switch (nb) {
+		case 8:
+			ret |= __put_user_inatomic(data.v[0], p++);
+			ret |= __put_user_inatomic(data.v[1], p++);
+			ret |= __put_user_inatomic(data.v[2], p++);
+			ret |= __put_user_inatomic(data.v[3], p++);
+		case 4:
+			ret |= __put_user_inatomic(data.v[4], p++);
+			ret |= __put_user_inatomic(data.v[5], p++);
+		case 2:
+			ret |= __put_user_inatomic(data.v[6], p++);
+			ret |= __put_user_inatomic(data.v[7], p++);
+		}
+		if (unlikely(ret))
+			return -EFAULT;
+	} else {
+		*evr = data.w[0];
+		regs->gpr[reg] = data.w[1];
+	}
+
+	return 1;
+}
+#endif /* CONFIG_SPE */

 /*
  * Called on alignment exception. Attempts to fixup
@@ -450,6 +694,12 @@ int fix_alignment(struct pt_regs *regs)
 	/* extract the operation and registers from the dsisr */
 	reg = (dsisr >> 5) & 0x1f;	/* source/dest register */
 	areg = dsisr & 0x1f;		/* register to update */
+
+#ifdef CONFIG_SPE
+	if ((instr >> 26) == 0x4)
+		return emulate_spe(regs, reg, instr);
+#endif
+
 	instr = (dsisr >> 10) & 0x7f;
 	instr |= (dsisr >> 13) & 0x60;

-- 
1.5.2.4

^ permalink raw reply related

* Re: RFC: issues concerning the next NAPI interface
From: Linas Vepstas @ 2007-08-24 21:51 UTC (permalink / raw)
  To: David Miller
  Cc: tklein, netdev-owner, themann, netdev, shemminger, dlstevens,
	linux-kernel, linuxppc-dev, raisch, meder, akepner, ossthema,
	stefan.roscher
In-Reply-To: <20070824.144436.59664160.davem@davemloft.net>

On Fri, Aug 24, 2007 at 02:44:36PM -0700, David Miller wrote:
> From: David Stevens <dlstevens@us.ibm.com>
> Date: Fri, 24 Aug 2007 09:50:58 -0700
> 
> >         Problem is if it increases rapidly, you may drop packets
> > before you notice that the ring is full in the current estimated
> > interval.
> 
> This is one of many reasons why hardware interrupt mitigation
> is really needed for this.

When turning off interrupts, don't turn them *all* off.
Leave the queue-full interrupt always on.

--linas

^ permalink raw reply

* Re: RFC: issues concerning the next NAPI interface
From: akepner @ 2007-08-24 22:06 UTC (permalink / raw)
  To: David Miller
  Cc: tklein, themann, stefan.roscher, netdev, jchapman, raisch,
	linux-kernel, linuxppc-dev, ossthema, meder, shemminger
In-Reply-To: <20070824.144711.18301866.davem@davemloft.net>

On Fri, Aug 24, 2007 at 02:47:11PM -0700, David Miller wrote:

> ....
> Someone should reference that thread _now_ before this discussion goes
> too far and we repeat a lot of information ......

Here's part of the thread:
http://marc.info/?t=111595306000001&r=1&w=2

Also, Jamal's paper may be of interest - Google for ""when napi comes 
to town".

-- 
Arthur

^ permalink raw reply

* Re: [PATCH 05/20] bootwrapper: flatdevtree fixes
From: David Gibson @ 2007-08-24 22:17 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev
In-Reply-To: <20070824144837.GA18753@ld0162-tx32.am.freescale.net>

On Fri, Aug 24, 2007 at 09:48:37AM -0500, Scott Wood wrote:
> On Fri, Aug 24, 2007 at 11:01:22AM +1000, David Gibson wrote:
> > On Thu, Aug 23, 2007 at 12:48:30PM -0500, Scott Wood wrote:
> > > It's likely to be ugly no matter what, though I'll try to come up with 
> > > something slightly nicer.  If I were doing this code from scratch, I'd 
> > > probably liven the tree first and reflatten it to pass to the kernel.
> > 
> > Eh, probably not worth bothering doing an actual implementation at
> > this stage - I'll have to redo it for libfdt anyway.
> 
> Too late, I already wrote it -- it wasn't as bad as I thought it would
> be.

Well, there you go.

> > flatdevtree uses some of the information it caches in the phandle
> > context stuff to remember who's the parent of a node.  libfdt uses raw
> > offsets into the structure, so the *only* way to implement
> > get_parent() is to rescan the dt from the beginning, keeping track of
> > parents until reaching the given node.
> 
> What is the benefit of doing it that way?

Most other operations are simpler like this - no more futzing around
converting between phandles and offsets and back again at the
beginning and end of most functions.

More importantly, it allows libfdt to be "stateless" in the sense that
you can manipulate the device tree without having to maintain any
context or state structure apart from the device tree blob itself.
That's particularly handy for doing read-only accesses really early
with a minimum of fuss.

In particular, it means libfdt does not need malloc().  That can be
rather useful for some that's supposed to be embeddable in a variety
of strange, constrained environments such as bootloaders and
firmwares.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: [PATCH] fix undefined reference to device_power_up/resume
From: Paul Mackerras @ 2007-08-25  1:10 UTC (permalink / raw)
  To: Olaf Hering; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20070824194201.GA8737@aepfle.de>

Olaf Hering writes:

> So change even more places from PM to PM_SLEEP to allow linking.

What config shows these errors?  I presume you need to have CONFIG_PM
but not CONFIG_PM_SLEEP in order to see them?

Paul.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox