* Re: [PATCH][v2] powerpc/4xx: Add 16K FIFO size DTS entries on supported platforms
From: Benjamin Herrenschmidt @ 2009-10-09 2:15 UTC (permalink / raw)
To: Dave Mitchell; +Cc: linuxppc-dev
In-Reply-To: <1255019609-2033-1-git-send-email-dmitchell@appliedmicro.com>
On Thu, 2009-10-08 at 11:33 -0500, Dave Mitchell wrote:
> Adding tx/rx-fifo-size-gige to EMAC fields for evaluation kit DTS
> files where appropriate.
>
> Signed-off-by: Dave Mitchell <dmitchell@appliedmicro.com>
> Acked-by: Prodyut Hazarika <phazarika@appliedmicro.com>
> Acked-by: Victor Gallardo <vgallardo@appliedmicro.com>
> Acked-by: Loc Ho <lho@appliedmicro.com>
> ---
> v1->v2: local date/time was out-of-sync and thus mail was as well
I'll have to wait for the EMAC patch to go in, so ping me if you don't
see me take that one after that happens.
Cheers,
Ben.
> arch/powerpc/boot/dts/canyonlands.dts | 2 ++
> arch/powerpc/boot/dts/eiger.dts | 6 ++++++
> arch/powerpc/boot/dts/glacier.dts | 6 ++++++
> arch/powerpc/boot/dts/haleakala.dts | 2 ++
> arch/powerpc/boot/dts/kilauea.dts | 4 ++++
> arch/powerpc/boot/dts/makalu.dts | 4 ++++
> arch/powerpc/boot/dts/redwood.dts | 1 +
> 7 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/boot/dts/canyonlands.dts b/arch/powerpc/boot/dts/canyonlands.dts
> index c920170..cd56bb5 100644
> --- a/arch/powerpc/boot/dts/canyonlands.dts
> +++ b/arch/powerpc/boot/dts/canyonlands.dts
> @@ -352,6 +352,7 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> @@ -381,6 +382,7 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> diff --git a/arch/powerpc/boot/dts/eiger.dts b/arch/powerpc/boot/dts/eiger.dts
> index c4a934f..48bcf71 100644
> --- a/arch/powerpc/boot/dts/eiger.dts
> +++ b/arch/powerpc/boot/dts/eiger.dts
> @@ -316,6 +316,7 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> @@ -345,6 +346,7 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> @@ -375,6 +377,8 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> + tx-fifo-size-gige = <16384>; /* emac2&3 only */
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII1>;
> @@ -403,6 +407,8 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> + tx-fifo-size-gige = <16384>; /* emac2&3 only */
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII1>;
> diff --git a/arch/powerpc/boot/dts/glacier.dts b/arch/powerpc/boot/dts/glacier.dts
> index f3787a2..f6f6189 100644
> --- a/arch/powerpc/boot/dts/glacier.dts
> +++ b/arch/powerpc/boot/dts/glacier.dts
> @@ -292,6 +292,7 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> @@ -321,6 +322,7 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> @@ -351,6 +353,8 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> + tx-fifo-size-gige = <16384>; /* emac2&3 only */
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII1>;
> @@ -379,6 +383,8 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> + tx-fifo-size-gige = <16384>; /* emac2&3 only */
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII1>;
> diff --git a/arch/powerpc/boot/dts/haleakala.dts b/arch/powerpc/boot/dts/haleakala.dts
> index 5b2a494..2b25669 100644
> --- a/arch/powerpc/boot/dts/haleakala.dts
> +++ b/arch/powerpc/boot/dts/haleakala.dts
> @@ -226,6 +226,8 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> + tx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> diff --git a/arch/powerpc/boot/dts/kilauea.dts b/arch/powerpc/boot/dts/kilauea.dts
> index c465614..083e68e 100644
> --- a/arch/powerpc/boot/dts/kilauea.dts
> +++ b/arch/powerpc/boot/dts/kilauea.dts
> @@ -272,6 +272,8 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> + tx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> @@ -300,6 +302,8 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> + tx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> diff --git a/arch/powerpc/boot/dts/makalu.dts b/arch/powerpc/boot/dts/makalu.dts
> index ffc246e..63d48b6 100644
> --- a/arch/powerpc/boot/dts/makalu.dts
> +++ b/arch/powerpc/boot/dts/makalu.dts
> @@ -227,6 +227,8 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> + tx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x0000003f>; /* Start at 6 */
> rgmii-device = <&RGMII0>;
> @@ -255,6 +257,8 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> + tx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
> diff --git a/arch/powerpc/boot/dts/redwood.dts b/arch/powerpc/boot/dts/redwood.dts
> index ad402c4..d2af32e 100644
> --- a/arch/powerpc/boot/dts/redwood.dts
> +++ b/arch/powerpc/boot/dts/redwood.dts
> @@ -226,6 +226,7 @@
> max-frame-size = <9000>;
> rx-fifo-size = <4096>;
> tx-fifo-size = <2048>;
> + rx-fifo-size-gige = <16384>;
> phy-mode = "rgmii";
> phy-map = <0x00000000>;
> rgmii-device = <&RGMII0>;
^ permalink raw reply
* Re: [PATCH][v2] ibm_newemac: Added 16K Tx FIFO size support for EMAC4
From: Benjamin Herrenschmidt @ 2009-10-09 2:15 UTC (permalink / raw)
To: Dave Mitchell; +Cc: netdev, linuxppc-dev
In-Reply-To: <1255019541-1974-1-git-send-email-dmitchell@appliedmicro.com>
On Thu, 2009-10-08 at 11:32 -0500, Dave Mitchell wrote:
> Some of the EMAC V4 implementations support 16K Tx FIFOs. This
> patch adds support for this functionality and fixes typos in the
> Tx FIFO size error messages.
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
> Signed-off-by: Dave Mitchell <dmitchell@appliedmicro.com>
> Acked-by: Prodyut Hazarika <phazarika@appliedmicro.com>
> Acked-by: Victor Gallardo <vgallardo@appliedmicro.com>
> Acked-by: Loc Ho <lho@appliedmicro.com>
> ---
> v1->v2: local date/time was out-of-sync and thus mail was as well
>
> drivers/net/ibm_newemac/core.c | 7 +++++--
> drivers/net/ibm_newemac/emac.h | 1 +
> 2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ibm_newemac/core.c b/drivers/net/ibm_newemac/core.c
> index 89c82c5..c6591cb 100644
> --- a/drivers/net/ibm_newemac/core.c
> +++ b/drivers/net/ibm_newemac/core.c
> @@ -443,7 +443,7 @@ static u32 __emac_calc_base_mr1(struct emac_instance *dev, int tx_size, int rx_s
> ret |= EMAC_MR1_TFS_2K;
> break;
> default:
> - printk(KERN_WARNING "%s: Unknown Rx FIFO size %d\n",
> + printk(KERN_WARNING "%s: Unknown Tx FIFO size %d\n",
> dev->ndev->name, tx_size);
> }
>
> @@ -470,6 +470,9 @@ static u32 __emac4_calc_base_mr1(struct emac_instance *dev, int tx_size, int rx_
> DBG2(dev, "__emac4_calc_base_mr1" NL);
>
> switch(tx_size) {
> + case 16384:
> + ret |= EMAC4_MR1_TFS_16K;
> + break;
> case 4096:
> ret |= EMAC4_MR1_TFS_4K;
> break;
> @@ -477,7 +480,7 @@ static u32 __emac4_calc_base_mr1(struct emac_instance *dev, int tx_size, int rx_
> ret |= EMAC4_MR1_TFS_2K;
> break;
> default:
> - printk(KERN_WARNING "%s: Unknown Rx FIFO size %d\n",
> + printk(KERN_WARNING "%s: Unknown Tx FIFO size %d\n",
> dev->ndev->name, tx_size);
> }
>
> diff --git a/drivers/net/ibm_newemac/emac.h b/drivers/net/ibm_newemac/emac.h
> index 0afc2cf..d34adf9 100644
> --- a/drivers/net/ibm_newemac/emac.h
> +++ b/drivers/net/ibm_newemac/emac.h
> @@ -153,6 +153,7 @@ struct emac_regs {
> #define EMAC4_MR1_RFS_16K 0x00280000
> #define EMAC4_MR1_TFS_2K 0x00020000
> #define EMAC4_MR1_TFS_4K 0x00030000
> +#define EMAC4_MR1_TFS_16K 0x00050000
> #define EMAC4_MR1_TR 0x00008000
> #define EMAC4_MR1_MWSW_001 0x00001000
> #define EMAC4_MR1_JPSM 0x00000800
^ permalink raw reply
* Re: [PATCH 1/2][v2] mm: add notifier in pageblock isolation for balloon drivers
From: KAMEZAWA Hiroyuki @ 2009-10-09 0:47 UTC (permalink / raw)
To: Robert Jennings
Cc: linux-mm, Mel Gorman, linux-kernel, linuxppc-dev,
Martin Schwidefsky, Badari Pulavarty, Brian King, Paul Mackerras,
Ingo Molnar
In-Reply-To: <20091002184458.GC4908@austin.ibm.com>
On Fri, 2 Oct 2009 13:44:58 -0500
Robert Jennings <rcj@linux.vnet.ibm.com> wrote:
> Memory balloon drivers can allocate a large amount of memory which
> is not movable but could be freed to accomodate memory hotplug remove.
>
> Prior to calling the memory hotplug notifier chain the memory in the
> pageblock is isolated. If the migrate type is not MIGRATE_MOVABLE the
> isolation will not proceed, causing the memory removal for that page
> range to fail.
>
> Rather than failing pageblock isolation if the the migrateteype is not
> MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock
> are owned by a registered balloon driver (or other entity) using a
> notifier chain. If all of the non-movable pages are owned by a balloon,
> they can be freed later through the memory notifier chain and the range
> can still be isolated in set_migratetype_isolate().
>
> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
>
> ---
> drivers/base/memory.c | 19 +++++++++++++++++++
> include/linux/memory.h | 26 ++++++++++++++++++++++++++
> mm/page_alloc.c | 45 ++++++++++++++++++++++++++++++++++++++-------
> 3 files changed, 83 insertions(+), 7 deletions(-)
>
> Index: b/drivers/base/memory.c
> ===================================================================
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -63,6 +63,20 @@ void unregister_memory_notifier(struct n
> }
> EXPORT_SYMBOL(unregister_memory_notifier);
>
> +static BLOCKING_NOTIFIER_HEAD(memory_isolate_chain);
> +
IIUC, this notifier is called under zone->lock.
please ATOMIC_NOTIFIER_HEAD().
> +int register_memory_isolate_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_register(&memory_isolate_chain, nb);
> +}
> +EXPORT_SYMBOL(register_memory_isolate_notifier);
> +
> +void unregister_memory_isolate_notifier(struct notifier_block *nb)
> +{
> + blocking_notifier_chain_unregister(&memory_isolate_chain, nb);
> +}
> +EXPORT_SYMBOL(unregister_memory_isolate_notifier);
> +
> /*
> * register_memory - Setup a sysfs device for a memory block
> */
> @@ -157,6 +171,11 @@ int memory_notify(unsigned long val, voi
> return blocking_notifier_call_chain(&memory_chain, val, v);
> }
>
> +int memory_isolate_notify(unsigned long val, void *v)
> +{
> + return blocking_notifier_call_chain(&memory_isolate_chain, val, v);
> +}
> +
> /*
> * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
> * OK to have direct references to sparsemem variables in here.
> Index: b/include/linux/memory.h
> ===================================================================
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -50,6 +50,18 @@ struct memory_notify {
> int status_change_nid;
> };
>
> +/*
> + * During pageblock isolation, count the number of pages in the
> + * range [start_pfn, start_pfn + nr_pages)
> + */
> +#define MEM_ISOLATE_COUNT (1<<0)
> +
> +struct memory_isolate_notify {
> + unsigned long start_pfn;
> + unsigned int nr_pages;
> + unsigned int pages_found;
> +};
Could you add commentary for each field ?
> +
> struct notifier_block;
> struct mem_section;
>
> @@ -76,14 +88,28 @@ static inline int memory_notify(unsigned
> {
> return 0;
> }
> +static inline int register_memory_isolate_notifier(struct notifier_block *nb)
> +{
> + return 0;
> +}
> +static inline void unregister_memory_isolate_notifier(struct notifier_block *nb)
> +{
> +}
> +static inline int memory_isolate_notify(unsigned long val, void *v)
> +{
> + return 0;
> +}
> #else
> extern int register_memory_notifier(struct notifier_block *nb);
> extern void unregister_memory_notifier(struct notifier_block *nb);
> +extern int register_memory_isolate_notifier(struct notifier_block *nb);
> +extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
> extern int register_new_memory(int, struct mem_section *);
> extern int unregister_memory_section(struct mem_section *);
> extern int memory_dev_init(void);
> extern int remove_memory_block(unsigned long, struct mem_section *, int);
> extern int memory_notify(unsigned long val, void *v);
> +extern int memory_isolate_notify(unsigned long val, void *v);
> extern struct memory_block *find_memory_block(struct mem_section *);
> #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
> enum mem_add_context { BOOT, HOTPLUG };
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -48,6 +48,7 @@
> #include <linux/page_cgroup.h>
> #include <linux/debugobjects.h>
> #include <linux/kmemleak.h>
> +#include <linux/memory.h>
> #include <trace/events/kmem.h>
>
> #include <asm/tlbflush.h>
> @@ -4985,23 +4986,53 @@ void set_pageblock_flags_group(struct pa
> int set_migratetype_isolate(struct page *page)
> {
> struct zone *zone;
> - unsigned long flags;
> + unsigned long flags, pfn, iter;
> + unsigned long immobile = 0;
> + struct memory_isolate_notify arg;
> + int notifier_ret;
> int ret = -EBUSY;
> int zone_idx;
>
> zone = page_zone(page);
> zone_idx = zone_idx(zone);
> +
> spin_lock_irqsave(&zone->lock, flags);
> + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE ||
> + zone_idx == ZONE_MOVABLE) {
> + ret = 0;
> + goto out;
> + }
> +
> + pfn = page_to_pfn(page);
> + arg.start_pfn = pfn;
> + arg.nr_pages = pageblock_nr_pages;
> + arg.pages_found = 0;
> +
> /*
> - * In future, more migrate types will be able to be isolation target.
> + * The pageblock can be isolated even if the migrate type is
> + * not *_MOVABLE. The memory isolation notifier chain counts
> + * the number of pages in this pageblock that can be freed later
> + * through the memory notifier chain. If all of the pages are
> + * accounted for, isolation can continue.
> */
Could add explanation like this ?
==
Later, for example, when memory hotplug notifier runs, these pages reported as
"can be isoalted" should be isolated(freed) by callbacks.
==
> - if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE &&
> - zone_idx != ZONE_MOVABLE)
> + notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg);
> + notifier_ret = notifier_to_errno(notifier_ret);
> + if (notifier_ret || !arg.pages_found)
> goto out;
> - set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> - move_freepages_block(zone, page, MIGRATE_ISOLATE);
> - ret = 0;
> +
> + for (iter = pfn; iter < (pfn + pageblock_nr_pages); iter++)
> + if (page_count(pfn_to_page(iter)))
> + immobile++;
> +
> + if (arg.pages_found == immobile)
> + ret = 0;
> +
I can't understand this part. Does this mean all pages under this pageblock
are used by balloon driver ?
IOW, memory is hotpluggable only when all pages under this pageblock is used
by balloon ?
Hmm. Can't we do this kind of check..?
==
for (iter = pfn; iter < (pfn + pageblock_nr_pages); iter++) {
page = pfn_to_page(iter);
if (!page_count(page) || PageLRU(page)) // This page is movable.
continue;
immobile++
}
==
Then, if a page is luckyly on LRU, we have more chances.
(This check can fail if a page is on percpu pagevec etc...)
Thanks,
-Kame
> out:
> + if (!ret) {
> + set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> + move_freepages_block(zone, page, MIGRATE_ISOLATE);
> + }
> +
> spin_unlock_irqrestore(&zone->lock, flags);
> if (!ret)
> drain_all_pages();
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Benjamin Herrenschmidt @ 2009-10-09 0:57 UTC (permalink / raw)
To: Dan Malek; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <3DFBC735-4084-4A0C-BF8D-C9B33142FE69@embeddedalley.com>
On Thu, 2009-10-08 at 17:36 -0700, Dan Malek wrote:
> On Oct 8, 2009, at 3:23 PM, Benjamin Herrenschmidt wrote:
>
> > <<
> > • Reference and change bit updates—The MPC850 does not generate an
> > exception for
> > an R (reference) bit update. In fact, there is no entry for an R
> > bit in the TLB.
> > The change bit (C) is bit 23 in the level-two descriptor,
> > described in Table 8-4.
> > Software updates C (changed) bits, but hardware treats the C bit
> > (negated) as a
> > write-protect attribute. Therefore, attempting to write to a page
> > marked unmodified
> > invalidates that entry and causes an implementation-specific DTLB
> > error exception.
> > ^^^^^^^^^^^^^^^^^^^^^^
> > If change bits are not needed, set the C bit to one by default in
> > the PTEs.
>
> How interesting....
>
> I've looked at many 8xx docs and they all have the same text
> (probably cut/paste :-)) I'd place some debug code in the C functions
> to print out a few of the TLB Entry for various errors to see if this
> really
> happens, and for other errors, too. I guess I never stumbled into
> this because I always thought I had to do everything from software,
> so just made sure I did.
I'm not sure it's worth bothering :-) I'm happy to continue assuming we
need to tlbie and always make sure we do so.
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Benjamin Herrenschmidt @ 2009-10-09 0:56 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <OFD13ED50D.BE2734E7-ONC1257649.007DB1F8-C1257649.007E7CB6@transmode.se>
On Fri, 2009-10-09 at 01:01 +0200, Joakim Tjernlund wrote:
> Ok, that's my understanding too and I think we had the tlbie in
> > update_mmu_cache to do the trick, though the comment is misleading
> > making it think that the only reason it's there is for the dcbst
> > problem. At least that's my understanding. That was lost recently in
> 2.6
> > so I'll have to put it back properly.
>
> So you don't think my invalidate "only !present pages" patch in
> do_page_fault is enough?
It might well be the right solution, I was talking about the code as we
have upstream today.
> I don't think we do the pre-load to avoid the second fault, but we
> It won't get much faster than my current patch. Trapping all DTLB
> Errors to C won't make it faster, only more correct should there be
> a bug in the asm version. Actually there is one that has been there
> all the time, guarded flag is not set by DTLB Error.
There's other areas of improvements I suggested that can make it faster
such as avoiding the whole kernel/user test in the TLB misses.
Removing the stuff in DataTLBError can potentially make normal page
faults faster too by avoiding going through a bunch of useless code
before going to do the real thing in C :-)
As I said, the case of ACCESSED or DIRTY updates are rare enough to not
warrant code in the main page fault hot path.
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 2/6] 8xx: Update TLB asm so it behaves as linux mm expects.
From: Benjamin Herrenschmidt @ 2009-10-09 0:53 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <OF0E7CD530.5EA71A5E-ONC1257649.00797FD2-C1257649.007CEF55@transmode.se>
On Fri, 2009-10-09 at 00:44 +0200, Joakim Tjernlund wrote:
> accessed == 1 and present = 0 is impossible, right?
> So basically just copy over accessed to present and
> linux mm set both when trapping to C.
No, when present = 0, then the rest of the PTE can contain unrelated
things, you can't trust ACCESSED.
> What about the execute perms in Level 2 descriptor, page 247?
Not useful, not fine grained enough.
> > You still need to massage the PP bits into place. I don't see that
> > happening.
>
> Not at the moment, later.
>
> >
> > As it is, your PTE contains for bit 20 and 21, which translates to:
> >
> > PTE: Translates to PP bits:
> > RW: 0 USER: 0 00 supervisor RW (ok)
> > RW: 0 USER: 1 01 supervisor RW user RO (WRONG)
> > RW: 1 USER: 0 10 supervisor RW user RW (WRONG)
> > RW: 1 USER: 1 11 supervisor RO user RO (WRONG)
>
> You got USER and RW swapped and the table is different
> for exec.
Hrm, let me see... yes. You are right, I mixed RW and USER. However,
I don't think the PP bits change do they ? IE. Basically, Read == Exec
at the page level. So the table isn't really different between I and D.
However, indeed, since you don't have a unified TLB, the case can be
made that we can ignore R vs. W in the iTLB case. In which case, you get
for iTLB:
PTE: Translates to PP bits:
RW: 0 USER: 0 00 supervisor X only (ok)
RW: 0 USER: 1 10 supervisor X user X (ok)
RW: 1 USER: 0 01 supervisor X user X (WRONG)
RW: 1 USER: 1 11 supervisor X user X (ok)
So a page with _PAGE_RW and not _PAGE_USER would still be executable
from user... oops :-)
I think what you want for iTLB is just basically have a base of 00
and or-in _PAGE_USER only (ie, keep _PAGE_RW clear with a rlwinm)
so that you basically get supervisor X only if _PAGE_USER is 0 and
both X if _PAGE_USER is 1
For the dTLB, the table becomes (including your inversion of _PAGE_RW)
PTE: Translates to PP bits:
RW: 0 USER: 0 01 supervisor RW user RO (WRONG)
RW: 0 USER: 1 11 supervisor RO user RO (ok)
RW: 1 USER: 0 00 supervisor RW only (ok)
RW: 1 USER: 1 10 supervisor RW user RW (ok)
So it's -almost- right :-) You still got the RW:0 USER:0 case wrong,
ie a read-only kernel page would be user readable.
You can work around that by never setting kernel pages read-only (which
we do mostly), but in the grand scheme of things, my trick I proposed
initially would sort it out all including support for kernel RO :-)
In any case, the above, while wrong, wouldn't cause crashes or issues
for well behaved userspace so it's a step forward.
> Same here as for ITLB.
And still not right :-) ie. you cannot rely on the value of
_PAGE_ACCESSED if _PAGE_PRESENT is not set.
> Nope, no xori needed for exec perms
Right, thanks to having a split TLB, but you still need to mask
out one of the bits afaik.
> I don't think user space would boot if I got it wrong.
True. I think it's more correct than I initially thought but still
subtely wrong :-)
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Dan Malek @ 2009-10-09 0:36 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <1255040628.2146.54.camel@pasglop>
On Oct 8, 2009, at 3:23 PM, Benjamin Herrenschmidt wrote:
> <<
> =95 Reference and change bit updates=97The MPC850 does not generate an =
=20
> exception for
> an R (reference) bit update. In fact, there is no entry for an R =20
> bit in the TLB.
> The change bit (C) is bit 23 in the level-two descriptor, =20
> described in Table 8-4.
> Software updates C (changed) bits, but hardware treats the C bit =20
> (negated) as a
> write-protect attribute. Therefore, attempting to write to a page =20=
> marked unmodified
> invalidates that entry and causes an implementation-specific DTLB =20=
> error exception.
> ^^^^^^^^^^^^^^^^^^^^^^
> If change bits are not needed, set the C bit to one by default in =20=
> the PTEs.
How interesting....
I've looked at many 8xx docs and they all have the same text
(probably cut/paste :-)) I'd place some debug code in the C functions
to print out a few of the TLB Entry for various errors to see if this =20=
really
happens, and for other errors, too. I guess I never stumbled into
this because I always thought I had to do everything from software,
so just made sure I did.
-- Dan
^ permalink raw reply
* Re: [PATCH 0/6] 8xx MMU fixes
From: Rex Feany @ 2009-10-09 0:15 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: Scott Wood, linuxppc-dev@ozlabs.org
In-Reply-To: <1255008298-19949-1-git-send-email-Joakim.Tjernlund@transmode.se>
Thus spake Joakim Tjernlund (Joakim.Tjernlund@transmode.se):
> So here we go again. This time I am
> fairly confindent I got most things correct :)
> Also manged to use even less instructions in the
> TLB Miss handlers.
>
> Scott and Rex, forget previous versions and
> try this one out.
This patch set is better, userspace seems more stable but I am still
seeing some odd problems. This is from straceing mount, which displays
nothing even though /proc/mounts shows everything mounted properly:
open("/proc/mounts", O_RDONLY) = 3
fstat64(0x3, 0x7fe7e2a8) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3001f000
read(3, 0x3001f000, 1024) = -1 EFAULT (Bad address)
exit_group(0) = ?
it looks like the memory allocated using mmap is bad?
take care!
/rex.
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Dan Malek @ 2009-10-09 0:05 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: Scott Wood, Rex Feany, linuxppc-dev@ozlabs.org
In-Reply-To: <OFB570D3CC.F2AE189F-ONC1257649.0070C808-C1257649.00714D8A@transmode.se>
On Oct 8, 2009, at 1:37 PM, Joakim Tjernlund wrote:
> Hi, been a long time since I heard from you :)
Yeah, hiding among other projects :-)
> Anyhow, you are welcome to have a look at the patches I have been
> tossing out.
I've been looking, but since I'm not familiar with the current
VM implementation, there isn't much I can contribute. If
I see something where I can be useful, I'll speak up.
Thanks.
-- Dan
^ permalink raw reply
* Re: [PATCH 1/2][v2] mm: add notifier in pageblock isolation for balloon drivers
From: Andrew Morton @ 2009-10-08 23:34 UTC (permalink / raw)
To: Robert Jennings
Cc: linux-mm, Mel Gorman, linux-kernel, linuxppc-dev,
Martin Schwidefsky, Badari Pulavarty, Brian King, Paul Mackerras,
Ingo Molnar, KAMEZAWA Hiroyuki
In-Reply-To: <20091002184458.GC4908@austin.ibm.com>
On Fri, 2 Oct 2009 13:44:58 -0500
Robert Jennings <rcj@linux.vnet.ibm.com> wrote:
> Memory balloon drivers can allocate a large amount of memory which
> is not movable but could be freed to accomodate memory hotplug remove.
>
> Prior to calling the memory hotplug notifier chain the memory in the
> pageblock is isolated. If the migrate type is not MIGRATE_MOVABLE the
> isolation will not proceed, causing the memory removal for that page
> range to fail.
>
> Rather than failing pageblock isolation if the the migrateteype is not
> MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock
> are owned by a registered balloon driver (or other entity) using a
> notifier chain. If all of the non-movable pages are owned by a balloon,
> they can be freed later through the memory notifier chain and the range
> can still be isolated in set_migratetype_isolate().
The patch looks sane enough to me.
I expect that if the powerpc and s390 guys want to work on CMM over the
next couple of months, they'd like this patch merged into 2.6.32. It's
a bit larger and more involved than one would like, but I guess we can
do that if suitable people (Mel? Kamezawa?) have had a close look and
are OK with it.
What do people think?
Has it been carefully compile- and run-time tested with
CONFIG_MEMORY_HOTPLUG_SPARSE=n?
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Joakim Tjernlund @ 2009-10-08 23:01 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <1255040628.2146.54.camel@pasglop>
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 09/10/2009 00:23:48:
>
> On Thu, 2009-10-08 at 15:08 -0700, Dan Malek wrote:
> > Hi Ben.
> >
> > On Oct 8, 2009, at 1:28 PM, Benjamin Herrenschmidt wrote:
> >
> > > While you are around ... I have a question :-)
> >
> > I'll try. Many brain cells have been replaced or lost
> > over the years :-)
>
> Replaced ? You lucky ! I only lose mines :-)
>
> > I thought we did a tlbie() (or whatever the equivalent is today)
> > when the PTE was updated in the table. An optimization was to
> > load the TLB with the entry at that time to avoid a subsequent miss.
> > In any case, the TLB entry has to be modified by the software.
>
> Ok, that's my understanding too and I think we had the tlbie in
> update_mmu_cache to do the trick, though the comment is misleading
> making it think that the only reason it's there is for the dcbst
> problem. At least that's my understanding. That was lost recently in 2.6
> so I'll have to put it back properly.
So you don't think my invalidate "only !present pages" patch in do_page_fault
is enough?
>
> I don't think we do the pre-load to avoid the second fault, but we
> certainly could and should.
>
> > I don't remember how C was used in the past, but I suspect
> > it just mirrored the Linux VM state. I'm quite certain the MMU
> > HW would never change a TLB entry. Where did you read this?
>
> MPC855UM, chapter 8.6 "Memory attributes":
>
> <<
> • Reference and change bit updates—The MPC850 does not generate an exception for
> an R (reference) bit update. In fact, there is no entry for an R bit in the TLB.
> The change bit (C) is bit 23 in the level-two descriptor, described in Table 8-4.
> Software updates C (changed) bits, but hardware treats the C bit (negated) as a
> write-protect attribute. Therefore, attempting to write to a page marked unmodified
> invalidates that entry and causes an implementation-specific DTLB error exception.
> ^^^^^^^^^^^^^^^^^^^^^^
> If change bits are not needed, set the C bit to one by default in the PTEs.
> >>
>
> And yes, it's weird and that's the only place I think where this is
> mentioned which makes me think it could well be a doco bug.
>
> > For most of the 8xx "early days," I used to just mark all write
> > pages as dirty. For some reason I just overloaded the write/changed
> > into one bit, it avoided TLB Error overhead and I think even some
> > silicon bugs. Since they were tiny and fairly static embedded
> > systems, it didn't have any effect on the way VM was managed.
>
> Well, nowadays at least, most of the time, a writeable page is also
> dirty unless it's a writeable shared mapping, and in that later case
> you really want to do proper dirty tracking. So I suspect we can
> simplify some of that and let the generic code handle dirty by mapping
> _PAGE_DIRTY to C and removing _PAGE_HWWRITE. We can also remove all
> of the asm munging from DataTLBError, and let the generic C code fixup
> DIRTY and ACCESSED when needed, since those should only rarely need a
> fixup.
>
> > The MMU HW on the 8xx is just a translator. I'm now really
> > certain it won't ever change a TLB entry. It's completely up to
> > software to make all TLB changes.
>
> That makes sense.
>
> > Just make it simple :-)
>
> Yeah. I think we can simplify the current code a lot, which will speed
> up TLB misses (well, nothing much you can do about the infamous errata
> #6 but that's another story). It won't give 2.6 back the 2.4 speed due
> to sheer bloat of the generic code but it might at least offset some of
> the loss by improving the overall TLB miss performance.
It won't get much faster than my current patch. Trapping all DTLB
Errors to C won't make it faster, only more correct should there be
a bug in the asm version. Actually there is one that has been there
all the time, guarded flag is not set by DTLB Error.
Jocke
>
> Now, I don't have any 8xx gear, so it will be up to Joakim, Scott etc...
> to get that stuff right :-)
Waiting for Rex and Scott to comment/test.
^ permalink raw reply
* Re: [RFC] misc/at24: add experimental OF support for the generic eeprom driver
From: Grant Likely @ 2009-10-08 22:59 UTC (permalink / raw)
To: Wolfram Sang; +Cc: linuxppc-dev, devicetree-discuss, linux-i2c
In-Reply-To: <20091008204808.GB8116@pengutronix.de>
On Thu, Oct 8, 2009 at 2:48 PM, Wolfram Sang <w.sang@pengutronix.de> wrote:
>
>> I think in general, this is the right direction; but I'm not convinced
>> that the right pattern or form has been found yet. =A0What I don't like
>> on this particular patch is that it still hooks of-specific stuff into
>> an arbitrary point in the probe routine.
>>
>> I'd like to see some pattern for retrieving or populating a
>> platform_data structure when one isn't already provided, and
>> regardless of the data source.
>
> I thought about this, too. I just wondered how many drivers would actuall=
y need
> a 'pdata'-preparation routine before probe, and if this would not cause t=
oo
> much overhead for those who don't. But as you say OF might not the only c=
ase
> where this is needed, then it might be a reasonable choice to have an ext=
ra
> call fot setting up pdata. Then again, if we have preparation routines fo=
r
> OF,UEFI,... for each and every driver, uh, the bloat :(
It shouldn't be too bad.... I'm toying with the idea of a callable
library function which is passed in a data table of the available
translators. Can be very simple, and the library function would take
care of pdata allocation and management. Then the translators can be
kept down to the bare minimum of populating the structure. If it is
done right, then only the needed translators get compiled in and it
won't add any significant size at all to the drivers.
>> will take a bit of experimentation to come up with the best form for
>> the pdata fetching function, but it will be better contained if it is
>> all at a single place.
>
> I might have a try :)
:-)
--=20
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.
^ permalink raw reply
* Re: [PATCH 2/6] 8xx: Update TLB asm so it behaves as linux mm expects.
From: Joakim Tjernlund @ 2009-10-08 22:44 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <1255035843.2146.39.camel@pasglop>
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 08/10/2009 23:04:03:
>
> On Thu, 2009-10-08 at 15:24 +0200, Joakim Tjernlund wrote:
>
> > +#define _PAGE_RW 0x0400 /* lsb PP bits, inverted in HW */
> > +#define _PAGE_USER 0x0800 /* msb PP bits */
> >
> > + /* r10=(r10&~_PAGE_PRESENT)|((r10&_PAGE_ACCESSED)>>5) */
> > + rlwimi. r10, r10, 27, 31, 31
> > + beq- cr0, 2f /* Can be removed, costs a ITLB Err */
>
> Did you mean
(counting bits) ... No, it should be >> 5
>
> + /* r10=(r10&~_PAGE_PRESENT)|((r10&_PAGE_ACCESSED)>>4) */
> + rlwimi. r10, r10, 28, 30, 31
> + beq- cr0, 2f /* Can be removed, costs a ITLB Err */
>
> Which would still be incorrect. You want -both- to be set,
> and your code will move on if -any- is set. Might want to add
> a ~ of the whole thing maybe and invert the condition ?
I want:
if (!accessed)
present = 0;
accessed == 1 and present = 0 is impossible, right?
So basically just copy over accessed to present and
linux mm set both when trapping to C.
>
> I find it easier to just do li rX, requested_bits, and then, andc.
> rscratch, rX, rPTE
>
> > +#if 0 /* Dont' bother with PP lsb, bit 21 for now */
> > + /* r10 = (r10 & ~0x0400) | ((r10 & _PAGE_EXEC) << 7) */
> > + rlwimi r10, r10, 7, 21, 21 /* Set _PAGE_EXEC << 7 */
> > +#endif
>
> I don't get that one. Don't bother with _PAGE_EXEC, there's no
> control of execute permission that works on 8xx. It's #if 0 anyway.
What about the execute perms in Level 2 descriptor, page 247?
>
> You still need to massage the PP bits into place. I don't see that
> happening.
Not at the moment, later.
>
> As it is, your PTE contains for bit 20 and 21, which translates to:
>
> PTE: Translates to PP bits:
> RW: 0 USER: 0 00 supervisor RW (ok)
> RW: 0 USER: 1 01 supervisor RW user RO (WRONG)
> RW: 1 USER: 0 10 supervisor RW user RW (WRONG)
> RW: 1 USER: 1 11 supervisor RO user RO (WRONG)
You got USER and RW swapped and the table is different
for exec.
>
> I would suggest you do the stuff I suggested initially with RW and USER
> being an "index" into a pre-cooked immediate value with all the encodings
> which also gives you the extended encoding for supervisor RO for free.
>
> > + /* Need to know if load/store -> force a TLB Error
> > + * by copying ACCESSED to PRESENT.
> > + */
> > + /* r10=(r10&~_PAGE_PRESENT)|((r10&_PAGE_ACCESSED)>>5) */
> > + rlwimi r10, r10, 27, 31, 31
>
> That doesn't sound right, just like ITLB... you need to AND those two
> bits in a way or another, or basically test that they are both set
> (or neither is set)
Same here as for ITLB.
>
> > +#if 0 /* Not yet */
> > + /* Honour kernel RO, User NA */
> > + andi. r11, r10, _PAGE_USER | _PAGE_RW
> > + bne- cr0, 5f
> > + ori r10,r10, 0x200 /* Extended encoding, bit 22 */
> > #endif
> > - mfspr r11, SPRN_MD_TWC /* get the pte address again */
> > - stw r10, 0(r11)
> > +5: xori r10, r10, _PAGE_RW /* invert RW bit */
>
> Well, you are missing that xori from the ITLB miss I think. Also, that
Nope, no xori needed for exec perms
> changes the table above to:
>
> PTE: Translates to PP bits:
> RW: 0 USER: 0 10 supervisor RW user RW (WRONG)
> RW: 0 USER: 1 11 supervisor RO user RO (ok)
> RW: 1 USER: 0 00 supervisor RW (ok)
> RW: 1 USER: 1 01 supervisor RW user RO (WRONG)
>
> So it's still not right :-)
User and RW swapped here too, as I read the table.
I don't think user space would boot if I got it wrong.
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Benjamin Herrenschmidt @ 2009-10-08 22:23 UTC (permalink / raw)
To: Dan Malek; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <34A4641D-8121-4EBC-9880-26CB70DD3339@embeddedalley.com>
On Thu, 2009-10-08 at 15:08 -0700, Dan Malek wrote:
> Hi Ben.
>
> On Oct 8, 2009, at 1:28 PM, Benjamin Herrenschmidt wrote:
>
> > While you are around ... I have a question :-)
>
> I'll try. Many brain cells have been replaced or lost
> over the years :-)
Replaced ? You lucky ! I only lose mines :-)
> I thought we did a tlbie() (or whatever the equivalent is today)
> when the PTE was updated in the table. An optimization was to
> load the TLB with the entry at that time to avoid a subsequent miss.
> In any case, the TLB entry has to be modified by the software.
Ok, that's my understanding too and I think we had the tlbie in
update_mmu_cache to do the trick, though the comment is misleading
making it think that the only reason it's there is for the dcbst
problem. At least that's my understanding. That was lost recently in 2.6
so I'll have to put it back properly.
I don't think we do the pre-load to avoid the second fault, but we
certainly could and should.
> I don't remember how C was used in the past, but I suspect
> it just mirrored the Linux VM state. I'm quite certain the MMU
> HW would never change a TLB entry. Where did you read this?
MPC855UM, chapter 8.6 "Memory attributes":
<<
• Reference and change bit updates—The MPC850 does not generate an exception for
an R (reference) bit update. In fact, there is no entry for an R bit in the TLB.
The change bit (C) is bit 23 in the level-two descriptor, described in Table 8-4.
Software updates C (changed) bits, but hardware treats the C bit (negated) as a
write-protect attribute. Therefore, attempting to write to a page marked unmodified
invalidates that entry and causes an implementation-specific DTLB error exception.
^^^^^^^^^^^^^^^^^^^^^^
If change bits are not needed, set the C bit to one by default in the PTEs.
>>
And yes, it's weird and that's the only place I think where this is
mentioned which makes me think it could well be a doco bug.
> For most of the 8xx "early days," I used to just mark all write
> pages as dirty. For some reason I just overloaded the write/changed
> into one bit, it avoided TLB Error overhead and I think even some
> silicon bugs. Since they were tiny and fairly static embedded
> systems, it didn't have any effect on the way VM was managed.
Well, nowadays at least, most of the time, a writeable page is also
dirty unless it's a writeable shared mapping, and in that later case
you really want to do proper dirty tracking. So I suspect we can
simplify some of that and let the generic code handle dirty by mapping
_PAGE_DIRTY to C and removing _PAGE_HWWRITE. We can also remove all
of the asm munging from DataTLBError, and let the generic C code fixup
DIRTY and ACCESSED when needed, since those should only rarely need a
fixup.
> The MMU HW on the 8xx is just a translator. I'm now really
> certain it won't ever change a TLB entry. It's completely up to
> software to make all TLB changes.
That makes sense.
> Just make it simple :-)
Yeah. I think we can simplify the current code a lot, which will speed
up TLB misses (well, nothing much you can do about the infamous errata
#6 but that's another story). It won't give 2.6 back the 2.4 speed due
to sheer bloat of the generic code but it might at least offset some of
the loss by improving the overall TLB miss performance.
Now, I don't have any 8xx gear, so it will be up to Joakim, Scott etc...
to get that stuff right :-)
Thanks for your feedback.
Cheers,
Ben.
> Thanks.
>
> -- Dan
^ permalink raw reply
* Re: [RFC] misc/at24: add experimental OF support for the generic eeprom driver
From: Anton Vorontsov @ 2009-10-08 22:20 UTC (permalink / raw)
To: Grant Likely; +Cc: linuxppc-dev, devicetree-discuss, linux-embedded, linux-i2c
In-Reply-To: <fa686aa40910080848r459c47baob73fc70a95a08604@mail.gmail.com>
On Thu, Oct 08, 2009 at 09:48:50AM -0600, Grant Likely wrote:
> On Thu, Oct 8, 2009 at 9:10 AM, Anton Vorontsov
[...]
> > It's *always* a small amound of code, at a start. Then we get
> > floppy disk drivers and the tty layer. ;-)
>
> Holy straw man argument Batman!
>
> But the focus is still on creating pdata. If a translator gets too
> big, then sure, split it into a separate file. Until then, there I
> see no good reason to do so now.
Luckily, I'm not at24 driver maintainer (Wolfram himself is ;-),
but as a maintainer of driver "Foo", I would not want to see
completely unfamiliar "Bar" in my shiny driver.
Another plus is that you can bypass (or almost bypass) subsystem
maintainers when merging OF-specific patches (since he/she couldn't
possibly care less about all these weird arch internals. But again,
this doesn't work for this particular driver since Wolfram is the
maintainer :-).
> > If I wasn't a PPC/OF guy to some degree, I'd hate PPC/OF people
> > for bringing arch-specific details into a generic code... :-P
>
> No, this goes beyond PPC/OF. The real issue is that it is no longer a
> safe assumption that pdata will be a static data structure in platform
> code. The number of possible data sources is going to get larger, not
> smaller. OF is just one. UEFI is another. Translating that data
> into pdata will be the problem that comes up over and over again.
> However, translation code is still driver specific,
> so it belongs with the driver that it translates code for.
Wait... The translation code depends on a platform, and on a
platform_data structure, the same as non-OF arch-specific code
depends on it. How is it different from a static platform data
in the arch/ code? We don't put static platform data into the
drivers and surround them with ugly #ifdefs+machine_is()...
> So, in my opinion, translation code must:
> 1. be *tiny*
Yeah, dream on. ;-) It's tiny when all you have is of_get_property(),
I'd like to see the code when you'll have GPIOs, IRQs, and platform-
specific fixups.
You might say that at24 doesn't need that stuff, but it does.
Suppose AT24's WP pin is connected to a GPIO, and without
'read-only' property I'd like the driver to pull the pin low,
and vice versa: with 'read-only' specifier, WP should be tied
high. Or if WP is controlled by a switch/jumper, GPIO can be
used to read current WP state.
> -- should be trivial to add to a driver without impacting
> common code
This is doable, yes.
> > No matter how small the OF code is, I believe we shouldn't put it
> > into the generic code. Take a look at mmc_spi case again, it can be
> > easily extended to any arch, because there is no arch-specific stuff,
> > but a "get/put" pattern for platform data.
>
> I'm not disagreeing with you that the arch specific stuff should be
> logically separated from the generic code. But I don't agree that it
> belongs in a separate file. And I also think that the mmc_spi
> implementation uses too much code. There must be a better way.
I wonder how you'd shrink the mmc_spi bindings, can you elaborate?
Thanks,
--
Anton Vorontsov
email: cbouatmailru@gmail.com
irc://irc.freenode.net/bd2
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Dan Malek @ 2009-10-08 22:08 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <1255033705.2146.17.camel@pasglop>
Hi Ben.
On Oct 8, 2009, at 1:28 PM, Benjamin Herrenschmidt wrote:
> While you are around ... I have a question :-)
I'll try. Many brain cells have been replaced or lost
over the years :-)
> Do you happen to remember what the story is with the invalidation of
> "unpopulated" (aka invalid) entries ?
>
> IE. We create those in the 8xx TLB miss when the PTE is !present
> (or the
> PMD is absent). Those then cause a TLB error on the next access which
> allows us to process the page fault. But when/how are those invalid
> entries supposed to be invalidated ?
I thought we did a tlbie() (or whatever the equivalent is today)
when the PTE was updated in the table. An optimization was to
load the TLB with the entry at that time to avoid a subsequent miss.
In any case, the TLB entry has to be modified by the software.
> The doco seems to hint that at least in the case of an entry with the
> wrong C bit (store to an entry with C=0), the HW automatically
> invalidates it before taking the TLB Error but that's all I found.
I don't remember how C was used in the past, but I suspect
it just mirrored the Linux VM state. I'm quite certain the MMU
HW would never change a TLB entry. Where did you read this?
For most of the 8xx "early days," I used to just mark all write
pages as dirty. For some reason I just overloaded the write/changed
into one bit, it avoided TLB Error overhead and I think even some
silicon bugs. Since they were tiny and fairly static embedded
systems, it didn't have any effect on the way VM was managed.
> Is there a general HW policy on 8xx to invalidate TLB entries that
> cause
> TLB errors ? Or is the kernel expected to do it most of the time ?
The MMU HW on the 8xx is just a translator. I'm now really
certain it won't ever change a TLB entry. It's completely up to
software to make all TLB changes.
Just make it simple :-)
Thanks.
-- Dan
^ permalink raw reply
* Re: [PATCH 2/6] 8xx: Update TLB asm so it behaves as linux mm expects.
From: Benjamin Herrenschmidt @ 2009-10-08 21:04 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <1255008298-19949-3-git-send-email-Joakim.Tjernlund@transmode.se>
On Thu, 2009-10-08 at 15:24 +0200, Joakim Tjernlund wrote:
> +#define _PAGE_RW 0x0400 /* lsb PP bits, inverted in HW */
> +#define _PAGE_USER 0x0800 /* msb PP bits */
>
> + /* r10=(r10&~_PAGE_PRESENT)|((r10&_PAGE_ACCESSED)>>5) */
> + rlwimi. r10, r10, 27, 31, 31
> + beq- cr0, 2f /* Can be removed, costs a ITLB Err */
Did you mean
+ /* r10=(r10&~_PAGE_PRESENT)|((r10&_PAGE_ACCESSED)>>4) */
+ rlwimi. r10, r10, 28, 30, 31
+ beq- cr0, 2f /* Can be removed, costs a ITLB Err */
Which would still be incorrect. You want -both- to be set,
and your code will move on if -any- is set. Might want to add
a ~ of the whole thing maybe and invert the condition ?
I find it easier to just do li rX, requested_bits, and then, andc.
rscratch, rX, rPTE
> +#if 0 /* Dont' bother with PP lsb, bit 21 for now */
> + /* r10 = (r10 & ~0x0400) | ((r10 & _PAGE_EXEC) << 7) */
> + rlwimi r10, r10, 7, 21, 21 /* Set _PAGE_EXEC << 7 */
> +#endif
I don't get that one. Don't bother with _PAGE_EXEC, there's no
control of execute permission that works on 8xx. It's #if 0 anyway.
You still need to massage the PP bits into place. I don't see that
happening.
As it is, your PTE contains for bit 20 and 21, which translates to:
PTE: Translates to PP bits:
RW: 0 USER: 0 00 supervisor RW (ok)
RW: 0 USER: 1 01 supervisor RW user RO (WRONG)
RW: 1 USER: 0 10 supervisor RW user RW (WRONG)
RW: 1 USER: 1 11 supervisor RO user RO (WRONG)
I would suggest you do the stuff I suggested initially with RW and USER
being an "index" into a pre-cooked immediate value with all the encodings
which also gives you the extended encoding for supervisor RO for free.
> + /* Need to know if load/store -> force a TLB Error
> + * by copying ACCESSED to PRESENT.
> + */
> + /* r10=(r10&~_PAGE_PRESENT)|((r10&_PAGE_ACCESSED)>>5) */
> + rlwimi r10, r10, 27, 31, 31
That doesn't sound right, just like ITLB... you need to AND those two
bits in a way or another, or basically test that they are both set
(or neither is set)
> +#if 0 /* Not yet */
> + /* Honour kernel RO, User NA */
> + andi. r11, r10, _PAGE_USER | _PAGE_RW
> + bne- cr0, 5f
> + ori r10,r10, 0x200 /* Extended encoding, bit 22 */
> #endif
> - mfspr r11, SPRN_MD_TWC /* get the pte address again */
> - stw r10, 0(r11)
> +5: xori r10, r10, _PAGE_RW /* invert RW bit */
Well, you are missing that xori from the ITLB miss I think. Also, that
changes the table above to:
PTE: Translates to PP bits:
RW: 0 USER: 0 10 supervisor RW user RW (WRONG)
RW: 0 USER: 1 11 supervisor RO user RO (ok)
RW: 1 USER: 0 00 supervisor RW (ok)
RW: 1 USER: 1 01 supervisor RW user RO (WRONG)
So it's still not right :-)
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 1/6] 8xx: DTLB Error must check for more errors.
From: Benjamin Herrenschmidt @ 2009-10-08 20:48 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <1255008298-19949-2-git-send-email-Joakim.Tjernlund@transmode.se>
On Thu, 2009-10-08 at 15:24 +0200, Joakim Tjernlund wrote:
> DataTLBError currently does:
> if ((err & 0x02000000) == 0)
> DSI();
> This won't handle a store with no valid translation.
> Change this to
> if ((err & 0x48000000) != 0)
> DSI();
> that is, branch to DSI if either !permission or
> !translation.
> ---
As I said earlier, I don't think this is necessary, just get rid of the
whole bunch of code in DataTLBError :-)
Ben.
> arch/powerpc/kernel/head_8xx.S | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
> index 52ff8c5..118bb05 100644
> --- a/arch/powerpc/kernel/head_8xx.S
> +++ b/arch/powerpc/kernel/head_8xx.S
> @@ -472,8 +472,8 @@ DataTLBError:
> /* First, make sure this was a store operation.
> */
> mfspr r10, SPRN_DSISR
> - andis. r11, r10, 0x0200 /* If set, indicates store op */
> - beq 2f
> + andis. r11, r10, 0x4800 /* no translation, no permission. */
> + bne 2f /* branch if either is set */
>
> /* The EA of a data TLB miss is automatically stored in the MD_EPN
> * register. The EA of a data TLB error is automatically stored in
^ permalink raw reply
* Re: [RFC] misc/at24: add experimental OF support for the generic eeprom driver
From: Wolfram Sang @ 2009-10-08 20:48 UTC (permalink / raw)
To: Grant Likely; +Cc: linuxppc-dev, devicetree-discuss, linux-i2c
In-Reply-To: <fa686aa40910080737l5b21654cw9c8f7426f9faf7c3@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]
> I think in general, this is the right direction; but I'm not convinced
> that the right pattern or form has been found yet. What I don't like
> on this particular patch is that it still hooks of-specific stuff into
> an arbitrary point in the probe routine.
>
> I'd like to see some pattern for retrieving or populating a
> platform_data structure when one isn't already provided, and
> regardless of the data source.
I thought about this, too. I just wondered how many drivers would actually need
a 'pdata'-preparation routine before probe, and if this would not cause too
much overhead for those who don't. But as you say OF might not the only case
where this is needed, then it might be a reasonable choice to have an extra
call fot setting up pdata. Then again, if we have preparation routines for
OF,UEFI,... for each and every driver, uh, the bloat :(
> will take a bit of experimentation to come up with the best form for
> the pdata fetching function, but it will be better contained if it is
> all at a single place.
I might have a try :)
--
Pengutronix e.K. | Wolfram Sang |
Industrial Linux Solutions | http://www.pengutronix.de/ |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Benjamin Herrenschmidt @ 2009-10-08 20:44 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <OFB570D3CC.F2AE189F-ONC1257649.0070C808-C1257649.00714D8A@transmode.se>
> One could try clearing the store bit in the page fault handler, but then
> that might cause a loop.
> Not sure it has any practical meaning though.
>
> Anyhow, you are welcome to have a look at the patches I have been tossing out.
The store bit in do_page_fault() is -very- important (and the only DSISR
bit that is as I wrote earlier). The generic code must know if a fault
is caused by a load or a store since the later is what will cause
copy_on_write to happen for example, or dirty to be set, etc..
Ben.
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Benjamin Herrenschmidt @ 2009-10-08 20:42 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <OF44C1983F.1C42E290-ONC1257649.0069CAF3-C1257649.006A78D1@transmode.se>
On Thu, 2009-10-08 at 21:22 +0200, Joakim Tjernlund wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 07/10/2009 23:18:05:
> >
> > On Wed, 2009-10-07 at 22:46 +0200, Joakim Tjernlund wrote:
> > > 8xx sometimes need to load a invalid/non-present TLBs in
> > > it DTLB asm handler.
> > > These must be invalidated separaly as linux mm don't.
> >
> > not sure about the dsisr test here, what is the point ?
>
> Without this patch I get about twice as many DTLB errors( on 2.4)
Ok, so it is useful... I would have thought that invalidating a TLB
entry that just caused a fault mostly be a nop.. well, the tlbie on 8xx
ignores the ASID so maybe it's invalidating next door process entries
but even that doesn't sound right. The TLB is so tiny on these things...
Oh well, as I said, something else to look at more closely.
> I have also noted that all my dcbst DTLB has the store bit set:
> trap:300 address:10030b8c, dar:10030b8c,err:42000000 dcbst
>
> Thare are comments in the kernel that dcbst wrongly
> generates TLB Errors with store set on 8xx. Is this really so?
> Should dcbst always trap as a load?
Architecturally it should, that's a known 8xx core bug.
Cheers,
Ben.
> Jocke
>
> >
> > Cheers,
> > Ben.
> >
> > > ---
> > > arch/powerpc/mm/fault.c | 8 +++++++-
> > > 1 files changed, 7 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> > > index 7699394..72941c7 100644
> > > --- a/arch/powerpc/mm/fault.c
> > > +++ b/arch/powerpc/mm/fault.c
> > > @@ -39,7 +39,7 @@
> > > #include <asm/uaccess.h>
> > > #include <asm/tlbflush.h>
> > > #include <asm/siginfo.h>
> > > -
> > > +#include <mm/mmu_decl.h>
> > >
> > > #ifdef CONFIG_KPROBES
> > > static inline int notify_page_fault(struct pt_regs *regs)
> > > @@ -243,6 +243,12 @@ good_area:
> > > goto bad_area;
> > > #endif /* CONFIG_6xx */
> > > #if defined(CONFIG_8xx)
> > > + /* 8xx sometimes need to load a invalid/non-present TLBs.
> > > + * These must be invalidated separately as linux mm don't.
> > > + */
> > > + if (error_code & 0x40000000) /* no translation? */
> > > + _tlbil_va(address);
> > > +
> > > /* The MPC8xx seems to always set 0x80000000, which is
> > > * "undefined". Of those that can be set, this is the only
> > > * one which seems bad.
> >
> >
> >
> >
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Joakim Tjernlund @ 2009-10-08 20:37 UTC (permalink / raw)
To: Dan Malek; +Cc: Scott Wood, Rex Feany, linuxppc-dev@ozlabs.org
In-Reply-To: <7A3C6D4C-E92B-434D-AF68-7AEEDE6DAD45@embeddedalley.com>
Dan Malek <dan@embeddedalley.com> wrote on 08/10/2009 22:11:07:
>
>
> On Oct 8, 2009, at 12:22 PM, Joakim Tjernlund wrote:
>
> > hare are comments in the kernel that dcbst wrongly
> > generates TLB Errors with store set on 8xx. Is this really so?
> > Should dcbst always trap as a load?
Hi, been a long time since I heard from you :)
>
> There are many comments written about 8xx as various
> behavior was discovered. Worse, some of these details
> would be different among the different processor versions.
> You need to be careful and test as many different part
> versions as possible to ensure you have everything
> covered..... then someone will find a part that doesn't
> quite work, "fix" it, and break others :-)
>
> In this particular case, the PEM does state dcbst is treated
> as a load, but from experience we know 8xx doesn't work
> that way. Of course, since dcbst is a store operation,
> you could argue that 8xx got it correct :-)
One could try clearing the store bit in the page fault handler, but then
that might cause a loop.
Not sure it has any practical meaning though.
Anyhow, you are welcome to have a look at the patches I have been tossing out.
Jocke
^ permalink raw reply
* Re: [PATCH 3/6] 8xx: invalidate non present TLBs
From: Benjamin Herrenschmidt @ 2009-10-08 20:28 UTC (permalink / raw)
To: Dan Malek; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <7A3C6D4C-E92B-434D-AF68-7AEEDE6DAD45@embeddedalley.com>
Hoy Dan !
While you are around ... I have a question :-)
Do you happen to remember what the story is with the invalidation of
"unpopulated" (aka invalid) entries ?
IE. We create those in the 8xx TLB miss when the PTE is !present (or the
PMD is absent). Those then cause a TLB error on the next access which
allows us to process the page fault. But when/how are those invalid
entries supposed to be invalidated ?
The doco seems to hint that at least in the case of an entry with the
wrong C bit (store to an entry with C=0), the HW automatically
invalidates it before taking the TLB Error but that's all I found.
Is there a general HW policy on 8xx to invalidate TLB entries that cause
TLB errors ? Or is the kernel expected to do it most of the time ?
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC] misc/at24: add experimental OF support for the generic eeprom driver
From: Wolfram Sang @ 2009-10-08 20:27 UTC (permalink / raw)
To: Grant Likely; +Cc: linux-embedded, linuxppc-dev, linux-i2c, devicetree-discuss
In-Reply-To: <fa686aa40910080848r459c47baob73fc70a95a08604@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1250 bytes --]
> No, this goes beyond PPC/OF. The real issue is that it is no longer a
> safe assumption that pdata will be a static data structure in platform
> code. The number of possible data sources is going to get larger, not
> smaller. OF is just one. UEFI is another. Translating that data
> into pdata will be the problem that comes up over and over again.
> However, translation code is still driver specific, so it belongs with
> the driver that it translates code for.
>
> So, in my opinion, translation code must:
> 1. be *tiny* -- should be trivial to add to a driver without impacting
> common code
> 2. live with the driver that it translates data for; ideally in the
> same .c file for drivers that are small.
I am with Grant on these points. It is more than just PPC.
> > No matter how small the OF code is, I believe we shouldn't put it
> > into the generic code. Take a look at mmc_spi case again, it can be
> > easily extended to any arch, because there is no arch-specific stuff,
> > but a "get/put" pattern for platform data.
Will check this tomorrow.
--
Pengutronix e.K. | Wolfram Sang |
Industrial Linux Solutions | http://www.pengutronix.de/ |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply
* Re: [PATCH 2/6] 8xx: get rid of _PAGE_HWWRITE dependency in MMU.
From: Benjamin Herrenschmidt @ 2009-10-08 20:21 UTC (permalink / raw)
To: Joakim Tjernlund; +Cc: Scott Wood, linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <OFB3DEEF6E.1F3F6AC5-ONC1257649.00246C73-C1257649.002523BD@transmode.se>
On Thu, 2009-10-08 at 08:45 +0200, Joakim Tjernlund wrote:
> > Generic code should sort it out in handle_mm_fault() (or earlier if it
> > can't find a VMA at all).
>
> How can it? You need to know more that just read and write.
It does. It's going to look for the VMA, which will tell it what is
allowed or not. You'll notice 4xx/BookE doesn't use DSISR (except the
ESR bit we pass to separate loads from stores).
If the region has no access, the kernel will know it (no VMA for
example) and will trigger a SEGV.
Really, the DSISR stuff is not as necessary as you think it is :-) You
should be able to jump to C code straight from both TLB error
interrupts.
> >
> > But that's a slow path anyways.
>
> How so? You take a TLB Error for the first write to
> every page.
Compared to the TLB miss that is :-) But my main point is that a TLB
error caused by a lack of DIRTY or ACCESSED will be rare.
Ben.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox