LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [git pull] Please pull powerpc.git next branch
From: Benjamin Herrenschmidt @ 2011-11-06 23:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, Andrew Morton, Linux Kernel list

Hi Linus !

Here's (finally) the powerpc stuff for this merge window. It's late, as
I warned you during KS, I was on vacation & travelling around and really
couldn't get to do it earlier than today. Everything in there has been
in linux-next for a while anyway, the only difference from what was in
github a month ago is that I merged a bit more freescale bits from
Kumar.

As for the highlights, you get the new "powernv" platform which allows
booting under the new "OPAL" firmware. This will allow booting without a
hypervisor on future IBM POWER machines, in order to be able to run KVM.
There's still one missing component to support the latest PCI Express
bridges, but it's a drop-in addition, so I might still merge it after
-rc1 (or not .. I haven't decided yet, I held on to it for a bit as it
was depending on some PCI changes that went upstream separately via
Jesse and dealing with the dependency while travelling was deemed too
annoying).

We also have a bunch of Numa fixes from Anton, some DMA code cleanup
from Milton and the usual batch of embedded bits and pieces.

Cheers,
Ben.
 
The following changes since commit d6748066ad0e8b2514545998f8367ebb3906f299:

  Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus (2011-11-03 13:28:14 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git next

Anatolij Gustschin (5):
      powerpc/5200: mpc5200b.dtsi: add spi node address- and size-cells properties
      powerpc/5200: dts: digsy_mtc.dts: update to add can, pci, serial and spi
      powerpc/5200: dts: digsy_mtc.dts: add timer0 and timer1 gpio properties
      powerpc/5200: dts: digsy_mtc.dts: enable both MSCAN nodes
      powerpc/85xx: fix PHYS_64BIT selection for P1022DS

Anshuman Khandual (1):
      perf events, powerpc: Add POWER7 stalled-cycles-frontend/backend events

Anton Blanchard (11):
      powerpc/pseries: Avoid spurious error during hotplug CPU add
      powerpc/numa: Enable SD_WAKE_AFFINE in node definition
      sched: Allow SD_NODES_PER_DOMAIN to be overridden
      powerpc/numa: Increase SD_NODES_PER_DOMAIN to 32.
      powerpc/numa: Disable NEWIDLE balancing at node level
      powerpc/numa: Remove duplicate RECLAIM_DISTANCE definition
      powerpc/numa: Remove double of_node_put in hot_add_node_scn_to_nid
      powerpc: Use for_each_node_by_type instead of open coding it
      powerpc: Coding style cleanups
      powerpc: Fix oops when echoing bad values to /sys/devices/system/memory/probe
      powerpc: Fix deadlock in icswx code

Arnaud Lacombe (1):
      powerpc/xics: Add __init to marker icp_native_init()

Arnd Bergmann (1):
      serial/8250: Move UPIO_TSI to powerpc

Ayman El-Khashab (1):
      powerpc/4xx: enable and fix pcie gen1/gen2 on the 460sx

Becky Bruce (4):
      powerpc: Hugetlb for BookE
      powerpc: Update mpc85xx/corenet 32-bit defconfigs
      powerpc: Update corenet64_smp_defconfig
      powerpc/fsl-booke: Fix settlbcam for 64-bit

Benjamin Herrenschmidt (27):
      Merge remote-tracking branch 'jwb/next' into next
      Merge remote-tracking branch 'origin/master' into next
      powerpc/wsp: Add PCIe Root support to PowerEN/WSP
      Merge remote-tracking branch 'origin/master' into next
      powerpc/udbg: Fix Kconfig entry for avoiding 44x early debug with KVM
      powerpc/smp: More generic support for "soft hotplug"
      powerpc/pci: Call pcie_bus_configure_settings()
      powerpc/powernv: Don't clobber r9 in relative_toc()
      powerpc: Add skeleton PowerNV platform
      of: Change logic to overwrite cmd_line with CONFIG_CMDLINE
      powerpc/powernv: Add CPU hotplug support
      powerpc/powernv: Add OPAL takeover from PowerVM
      powerpc/powernv: Get kernel command line accross OPAL takeover
      powerpc/powernv: Basic support for OPAL
      powerpc/powernv: Add support for instanciating OPAL v2 from Open Firmware
      powerpc/powernv: Support for OPAL console
      powerpc/powernv: Hookup reboot and poweroff functions
      powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks
      powerpc/powernv: Add OPAL ICS backend
      powerpc/powernv: Register and handle OPAL interrupts
      powerpc/powernv: Machine check and other system interrupts
      powerpc/powernv: Add support for p5ioc2 PCI-X and PCIe
      powerpc/powernv: Implement MSI support for p5ioc2 PCIe
      powerpc/powernv: Handle PCI-X/PCIe reset delay
      powerpc/pci: Don't configure PCIe settings when PCI_PROBE_ONLY is set
      powerpc/ptrace: Fix build with gcc 4.6
      powerpc: Don't try OPAL takeover on old 970 blades

Bharat Bhushan (1):
      powerpc: e500mc: Fix: use CONFIG_PPC_E500MC in idle_e500.S

Brian King (1):
      hvcs: Ensure page aligned partner info buffer

Carl E. Love (1):
      powerpc/perf_event: Fix Power6 L1 cache read & write event codes]

Dmitry Eremin-Solenikov (5):
      cpc925_edac: Support single-processor configurations
      powerpc/85xx: sbc8560 - correct compilation if CONFIG_PHYS_ADDR_T_64BIT is set
      powerpc/85xx: ksi8560 - declare that localbus is compatbile with simple-bus
      powerpc/85xx: sbc8560 - declare that localbus is compatbile with simple-bus
      powerpc/mpc8349emitx: mark localbus as compatible with simple-bus

Fabio Baltieri (1):
      powerpc/83xx: Add shutdown request support to MCU handling on MPC8349 MITX

Felix Radensky (1):
      powerpc/8xxx: Fix interrupt handling in MPC8xxx GPIO driver

Hector Martin (1):
      powerpc/ps3: Add gelic udbg driver

Holger Brunck (1):
      powerpc/82xx: updates for mgcoge

Hongjun Chen (1):
      powerpc/cpm: Clear muram before it is in use.

Jim Keniston (1):
      powerpc/nvram: Add compression to fit more oops output into NVRAM

Jimi Xenidis (2):
      powerpc/wsp: Fix Wire Speed Processor platform configs
      powerpc: Fix xmon for systems without MSR[RI]

Josh Boyer (1):
      powerpc/40x: Remove obsolete HCU4 board

Julia Lawall (1):
      pseries/iommu: Add missing kfree

Kumar Gala (6):
      powerpc/85xx: Rename PowerPC core nodes to match other e500mc based .dts
      powerpc/fsl-booke: Handle L1 D-cache parity error correctly on e500mc
      powerpc: respect mem= setting for early memory limit setup
      powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map
      powerpc/85xx: Setup secondary cores PIR with hard SMP id
      powerpc/85xx: Add 'fsl,pq3-gpio' compatiable for GPIO driver

Liu Yu (3):
      powerpc/math_emu/efp: Use pr_debug instead of printk
      powerpc/math_emu/efp: No need to round if the result is exact
      powerpc/math_emu/efp: Look for errata handler when type mismatches

Martyn Welch (1):
      powerpc/86xx: Correct Gianfar support for GE boards

Matthew McClintock (5):
      powerpc: Fix build dependencies for epapr.c which needs libfdt.h
      powerpc/85xx: Fix support for enabling doorbells for IPIs
      powerpc/85xx: issue 15 EOI after core reset for FSL CoreNet devices
      powerpc/fsl_booke: Fix comment in head_fsl_booke.S
      powerpc/85xx: Make kexec to interate over online cpus

Michael Ellerman (1):
      powerpc/wsp: Add MSI support for PCI on PowerEN

Mihai Caraman (1):
      drivers/virt: add ioctl for 32-bit compat on 64-bit to fsl-hv-manager

Mike Williams (1):
      powerpc/4xx: edac: Add comma to fix build error

Milton Miller (4):
      powerpc: Override dma_get_required_mask by platform hook and ops
      dma-mapping: Add get_required_mask if arch overrides default
      powerpc: Use the newly added get_required_mask dma_map_ops hook
      powerpc: Tidy up dma_map_ops after adding new hook

Mingkai Hu (1):
      powerpc/85xx: Rename p2040_rdb.c to p2041_rdb.c

Paul Mackerras (1):
      powerpc: Fix hugetlb with CONFIG_PPC_MM_SLICES=y

Scott Wood (1):
      powerpc/32: Pass device tree address as u64 to machine_init

Shengzhou Liu (1):
      powerpc/p3060qds: Add support for P3060QDS board

Stefan Roese (1):
      powerpc/44x: Add NOR flash device to Yosemite dts

Stephen George (1):
      powerpc/85xx: Adding DCSR node to dtsi device trees

Suzuki Poulose (1):
      powerpc/44x: Kexec support for PPC440X chipsets

Tang Yuantian (1):
      powerpc/mm: Fix the call trace when resumed from hibernation

Thadeu Lima de Souza Cascardo (2):
      powerpc/eeh: Fix /proc/ppc64/eeh creation
      powerpc: Reserve iommu page 0

Timur Tabi (5):
      powerpc/mpic: Add support for discontiguous cores
      powerpc/5200: enable audio in the defconfig
      powerpc/fsl_msi: fix support for multiple MSI ranges
      powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards
      powerpc/fsl_msi: add support for "msi-address-64" property

Tony Breeds (1):
      powerpc/4xx/pci: Add __init annotations for *init_port_hw() functions.

Wolfram Sang (2):
      gpio: move mpc8xxx/512x gpio driver to drivers/gpio
      powerpc: update 512x-defconfig

 .../devicetree/bindings/powerpc/fsl/board.txt      |   30 +-
 .../devicetree/bindings/powerpc/fsl/dcsr.txt       |  395 +++++++
 .../devicetree/bindings/powerpc/fsl/msi-pic.txt    |   42 +
 arch/powerpc/Kconfig                               |    7 +-
 arch/powerpc/Kconfig.debug                         |   46 +-
 arch/powerpc/boot/Makefile                         |    3 +-
 arch/powerpc/boot/dts/digsy_mtc.dts                |   59 +-
 arch/powerpc/boot/dts/gef_ppc9a.dts                |   33 +-
 arch/powerpc/boot/dts/gef_sbc310.dts               |   33 +-
 arch/powerpc/boot/dts/gef_sbc610.dts               |   33 +-
 arch/powerpc/boot/dts/hcu4.dts                     |  168 ---
 arch/powerpc/boot/dts/ksi8560.dts                  |    2 +-
 arch/powerpc/boot/dts/mgcoge.dts                   |    9 +
 arch/powerpc/boot/dts/mpc5200b.dtsi                |    2 +
 arch/powerpc/boot/dts/mpc8349emitx.dts             |    3 +-
 arch/powerpc/boot/dts/p1022ds.dts                  |    2 +-
 arch/powerpc/boot/dts/p2020ds.dts                  |    5 +
 .../boot/dts/{p2040rdb.dts => p2041rdb.dts}        |   17 +-
 .../boot/dts/{p2040si.dtsi => p2041si.dtsi}        |  135 ++-
 arch/powerpc/boot/dts/p3041ds.dts                  |    8 +-
 arch/powerpc/boot/dts/p3041si.dtsi                 |   71 ++-
 arch/powerpc/boot/dts/p3060qds.dts                 |  238 ++++
 arch/powerpc/boot/dts/p3060si.dtsi                 |  719 +++++++++++++
 arch/powerpc/boot/dts/p4080ds.dts                  |   12 +-
 arch/powerpc/boot/dts/p4080si.dtsi                 |  114 ++-
 arch/powerpc/boot/dts/p5020ds.dts                  |    8 +-
 arch/powerpc/boot/dts/p5020si.dtsi                 |   68 ++-
 arch/powerpc/boot/dts/sbc8560.dts                  |    2 +-
 arch/powerpc/boot/dts/yosemite.dts                 |   36 +
 arch/powerpc/configs/40x/hcu4_defconfig            |   80 --
 arch/powerpc/configs/85xx/p1023rds_defconfig       |    2 +-
 arch/powerpc/configs/85xx/xes_mpc85xx_defconfig    |    2 +-
 arch/powerpc/configs/corenet32_smp_defconfig       |   11 +-
 arch/powerpc/configs/corenet64_smp_defconfig       |    5 -
 arch/powerpc/configs/mgcoge_defconfig              |   27 +-
 arch/powerpc/configs/mpc512x_defconfig             |   19 +-
 arch/powerpc/configs/mpc5200_defconfig             |   12 +
 arch/powerpc/configs/mpc85xx_defconfig             |    5 +-
 arch/powerpc/configs/mpc85xx_smp_defconfig         |    6 +-
 arch/powerpc/configs/ppc40x_defconfig              |    1 -
 arch/powerpc/configs/ppc6xx_defconfig              |    2 +-
 arch/powerpc/include/asm/device.h                  |    2 +
 arch/powerpc/include/asm/firmware.h                |   10 +
 arch/powerpc/include/asm/hugetlb.h                 |   63 ++-
 arch/powerpc/include/asm/kexec.h                   |    2 +-
 arch/powerpc/include/asm/machdep.h                 |    3 +-
 arch/powerpc/include/asm/mmu-book3e.h              |    7 +
 arch/powerpc/include/asm/mmu-hash64.h              |    3 +-
 arch/powerpc/include/asm/mmu.h                     |   18 +-
 arch/powerpc/include/asm/mpic.h                    |    2 -
 arch/powerpc/include/asm/opal.h                    |  443 ++++++++
 arch/powerpc/include/asm/paca.h                    |    8 +
 arch/powerpc/include/asm/page.h                    |   31 +-
 arch/powerpc/include/asm/page_64.h                 |   11 -
 arch/powerpc/include/asm/pte-book3e.h              |    3 +
 arch/powerpc/include/asm/reg_booke.h               |    3 +
 arch/powerpc/include/asm/rtas.h                    |    6 +-
 arch/powerpc/include/asm/smp.h                     |    1 +
 arch/powerpc/include/asm/sparsemem.h               |    2 +-
 arch/powerpc/include/asm/topology.h                |   14 +-
 arch/powerpc/include/asm/udbg.h                    |    3 +
 arch/powerpc/include/asm/xics.h                    |   19 +
 arch/powerpc/kernel/asm-offsets.c                  |   10 +
 arch/powerpc/kernel/dma-iommu.c                    |   28 +-
 arch/powerpc/kernel/dma-swiotlb.c                  |   16 +
 arch/powerpc/kernel/dma.c                          |   44 +-
 arch/powerpc/kernel/exceptions-64s.S               |   27 +-
 arch/powerpc/kernel/head_32.S                      |    7 +-
 arch/powerpc/kernel/head_40x.S                     |   15 +-
 arch/powerpc/kernel/head_44x.S                     |   16 +-
 arch/powerpc/kernel/head_64.S                      |   22 +-
 arch/powerpc/kernel/head_8xx.S                     |   13 +-
 arch/powerpc/kernel/head_fsl_booke.S               |  175 +++-
 arch/powerpc/kernel/ibmebus.c                      |   22 +-
 arch/powerpc/kernel/idle_e500.S                    |    2 +-
 arch/powerpc/kernel/iommu.c                        |    8 +
 arch/powerpc/kernel/legacy_serial.c                |   25 +
 arch/powerpc/kernel/machine_kexec_64.c             |    3 +-
 arch/powerpc/kernel/misc_32.S                      |  171 +++
 arch/powerpc/kernel/pci-common.c                   |   11 +
 arch/powerpc/kernel/power6-pmu.c                   |    4 +-
 arch/powerpc/kernel/power7-pmu.c                   |    2 +
 arch/powerpc/kernel/prom.c                         |   19 +-
 arch/powerpc/kernel/prom_init.c                    |  383 ++++++-
 arch/powerpc/kernel/prom_init_check.sh             |    4 +-
 arch/powerpc/kernel/ptrace.c                       |   18 +-
 arch/powerpc/kernel/setup_32.c                     |    2 +-
 arch/powerpc/kernel/setup_64.c                     |   22 +-
 arch/powerpc/kernel/smp.c                          |   30 +-
 arch/powerpc/kernel/swsusp.c                       |    2 +-
 arch/powerpc/kernel/traps.c                        |    9 +-
 arch/powerpc/kernel/udbg.c                         |    6 +
 arch/powerpc/kernel/vio.c                          |   21 +-
 arch/powerpc/math-emu/math_efp.c                   |  100 +-
 arch/powerpc/mm/Makefile                           |    1 +
 arch/powerpc/mm/fsl_booke_mmu.c                    |   43 +-
 arch/powerpc/mm/hash_utils_64.c                    |    9 +-
 arch/powerpc/mm/hugetlbpage-book3e.c               |  121 +++
 arch/powerpc/mm/hugetlbpage.c                      |  379 ++++++-
 arch/powerpc/mm/init_32.c                          |    9 +
 arch/powerpc/mm/mem.c                              |    8 +-
 arch/powerpc/mm/mmu_context_hash64.c               |   12 +-
 arch/powerpc/mm/mmu_context_nohash.c               |    5 +
 arch/powerpc/mm/mmu_decl.h                         |    2 +
 arch/powerpc/mm/numa.c                             |   20 +-
 arch/powerpc/mm/pgtable.c                          |    3 +-
 arch/powerpc/mm/tlb_low_64e.S                      |   24 +-
 arch/powerpc/mm/tlb_nohash.c                       |   67 ++-
 arch/powerpc/platforms/40x/Kconfig                 |    8 -
 arch/powerpc/platforms/40x/Makefile                |    1 -
 arch/powerpc/platforms/40x/hcu4.c                  |   61 --
 arch/powerpc/platforms/512x/Kconfig                |    1 +
 arch/powerpc/platforms/82xx/km82xx.c               |    4 +
 arch/powerpc/platforms/83xx/Kconfig                |    9 +-
 arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c     |   58 +-
 arch/powerpc/platforms/85xx/Kconfig                |   32 +-
 arch/powerpc/platforms/85xx/Makefile               |    3 +-
 arch/powerpc/platforms/85xx/p1022_ds.c             |   11 +-
 .../platforms/85xx/{p2040_rdb.c => p2041_rdb.c}    |   18 +-
 arch/powerpc/platforms/85xx/p3060_qds.c            |   77 ++
 arch/powerpc/platforms/85xx/sbc8560.c              |    2 +-
 arch/powerpc/platforms/85xx/smp.c                  |   12 +-
 arch/powerpc/platforms/86xx/Kconfig                |    1 +
 arch/powerpc/platforms/Kconfig                     |   13 +-
 arch/powerpc/platforms/Kconfig.cputype             |    4 +-
 arch/powerpc/platforms/Makefile                    |    1 +
 arch/powerpc/platforms/cell/iommu.c                |   21 +
 arch/powerpc/platforms/powernv/Kconfig             |   16 +
 arch/powerpc/platforms/powernv/Makefile            |    5 +
 arch/powerpc/platforms/powernv/opal-nvram.c        |   88 ++
 arch/powerpc/platforms/powernv/opal-rtc.c          |   97 ++
 arch/powerpc/platforms/powernv/opal-takeover.S     |  140 +++
 arch/powerpc/platforms/powernv/opal-wrappers.S     |  101 ++
 arch/powerpc/platforms/powernv/opal.c              |  322 ++++++
 arch/powerpc/platforms/powernv/pci-p5ioc2.c        |  234 ++++
 arch/powerpc/platforms/powernv/pci.c               |  427 ++++++++
 arch/powerpc/platforms/powernv/pci.h               |   48 +
 arch/powerpc/platforms/powernv/powernv.h           |   16 +
 arch/powerpc/platforms/powernv/setup.c             |  196 ++++
 arch/powerpc/platforms/powernv/smp.c               |  182 ++++
 arch/powerpc/platforms/ps3/Kconfig                 |   12 +
 arch/powerpc/platforms/ps3/Makefile                |    1 +
 arch/powerpc/platforms/ps3/gelic_udbg.c            |  273 +++++
 arch/powerpc/platforms/ps3/system-bus.c            |    7 +
 arch/powerpc/platforms/pseries/Kconfig             |    1 +
 arch/powerpc/platforms/pseries/dlpar.c             |    4 +
 arch/powerpc/platforms/pseries/eeh.c               |    2 +-
 arch/powerpc/platforms/pseries/iommu.c             |   34 +-
 arch/powerpc/platforms/pseries/nvram.c             |  171 +++-
 arch/powerpc/platforms/wsp/Kconfig                 |   11 +-
 arch/powerpc/platforms/wsp/Makefile                |    2 +
 arch/powerpc/platforms/wsp/ics.c                   |   48 +
 arch/powerpc/platforms/wsp/ics.h                   |    5 +
 arch/powerpc/platforms/wsp/msi.c                   |  102 ++
 arch/powerpc/platforms/wsp/msi.h                   |   19 +
 arch/powerpc/platforms/wsp/psr2.c                  |    4 +
 arch/powerpc/platforms/wsp/wsp.h                   |    3 +
 arch/powerpc/platforms/wsp/wsp_pci.c               | 1133 ++++++++++++++++++++
 arch/powerpc/platforms/wsp/wsp_pci.h               |  268 +++++
 arch/powerpc/sysdev/Makefile                       |    1 -
 arch/powerpc/sysdev/cpm_common.c                   |    3 +-
 arch/powerpc/sysdev/fsl_msi.c                      |   28 +-
 arch/powerpc/sysdev/fsl_msi.h                      |    3 +-
 arch/powerpc/sysdev/mpic.c                         |   34 +-
 arch/powerpc/sysdev/ppc4xx_pci.c                   |  101 ++-
 arch/powerpc/sysdev/ppc4xx_pci.h                   |   12 +
 arch/powerpc/sysdev/xics/Makefile                  |    1 +
 arch/powerpc/sysdev/xics/icp-native.c              |    2 +-
 arch/powerpc/sysdev/xics/ics-opal.c                |  244 +++++
 arch/powerpc/sysdev/xics/xics-common.c             |    8 +-
 arch/powerpc/xmon/xmon.c                           |    4 +-
 drivers/edac/cpc925_edac.c                         |   67 ++-
 drivers/edac/ppc4xx_edac.c                         |    2 +-
 drivers/gpio/Kconfig                               |    8 +
 drivers/gpio/Makefile                              |    1 +
 .../mpc8xxx_gpio.c => drivers/gpio/gpio-mpc8xxx.c  |    3 +
 drivers/net/ps3_gelic_net.c                        |    3 +
 drivers/net/ps3_gelic_net.h                        |    6 +
 drivers/of/fdt.c                                   |    7 +-
 drivers/tty/hvc/Kconfig                            |    9 +
 drivers/tty/hvc/Makefile                           |    1 +
 drivers/tty/hvc/hvc_opal.c                         |  424 ++++++++
 drivers/tty/hvc/hvcs.c                             |    6 +-
 drivers/tty/hvc/hvsi_lib.c                         |    4 +-
 drivers/tty/serial/8250.c                          |   23 -
 drivers/virt/fsl_hypervisor.c                      |    1 +
 include/linux/dma-mapping.h                        |    3 +
 include/linux/topology.h                           |    4 +
 kernel/sched.c                                     |    2 -
 189 files changed, 9411 insertions(+), 979 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/powerpc/fsl/dcsr.txt
 delete mode 100644 arch/powerpc/boot/dts/hcu4.dts
 rename arch/powerpc/boot/dts/{p2040rdb.dts => p2041rdb.dts} (95%)
 rename arch/powerpc/boot/dts/{p2040si.dtsi => p2041si.dtsi} (80%)
 create mode 100644 arch/powerpc/boot/dts/p3060qds.dts
 create mode 100644 arch/powerpc/boot/dts/p3060si.dtsi
 delete mode 100644 arch/powerpc/configs/40x/hcu4_defconfig
 create mode 100644 arch/powerpc/include/asm/opal.h
 create mode 100644 arch/powerpc/mm/hugetlbpage-book3e.c
 delete mode 100644 arch/powerpc/platforms/40x/hcu4.c
 rename arch/powerpc/platforms/85xx/{p2040_rdb.c => p2041_rdb.c} (82%)
 create mode 100644 arch/powerpc/platforms/85xx/p3060_qds.c
 create mode 100644 arch/powerpc/platforms/powernv/Kconfig
 create mode 100644 arch/powerpc/platforms/powernv/Makefile
 create mode 100644 arch/powerpc/platforms/powernv/opal-nvram.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-rtc.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-takeover.S
 create mode 100644 arch/powerpc/platforms/powernv/opal-wrappers.S
 create mode 100644 arch/powerpc/platforms/powernv/opal.c
 create mode 100644 arch/powerpc/platforms/powernv/pci-p5ioc2.c
 create mode 100644 arch/powerpc/platforms/powernv/pci.c
 create mode 100644 arch/powerpc/platforms/powernv/pci.h
 create mode 100644 arch/powerpc/platforms/powernv/powernv.h
 create mode 100644 arch/powerpc/platforms/powernv/setup.c
 create mode 100644 arch/powerpc/platforms/powernv/smp.c
 create mode 100644 arch/powerpc/platforms/ps3/gelic_udbg.c
 create mode 100644 arch/powerpc/platforms/wsp/msi.c
 create mode 100644 arch/powerpc/platforms/wsp/msi.h
 create mode 100644 arch/powerpc/platforms/wsp/wsp_pci.c
 create mode 100644 arch/powerpc/platforms/wsp/wsp_pci.h
 create mode 100644 arch/powerpc/sysdev/xics/ics-opal.c
 rename arch/powerpc/sysdev/mpc8xxx_gpio.c => drivers/gpio/gpio-mpc8xxx.c (98%)
 create mode 100644 drivers/tty/hvc/hvc_opal.c

^ permalink raw reply

* [PATCH] powerpc: Export PIR data through sysfs
From: Ananth N Mavinakayanahalli @ 2011-11-07  4:47 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Anton Blanchard, mahesh

The Processor Identification Register (PIR) on powerpc provides
information to decode the processor identification tag. Decoding
this information platform specfic.

Export PIR data via sysfs.

(Powerpc manuals state this register is 'optional'. I am not sure
though if there are any Linux supported powerpc platforms that
don't have it. Code in the kernel referencing PIR isn't under
a platform ifdef).

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
---
 arch/powerpc/kernel/sysfs.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-3.1/arch/powerpc/kernel/sysfs.c
===================================================================
--- linux-3.1.orig/arch/powerpc/kernel/sysfs.c
+++ linux-3.1/arch/powerpc/kernel/sysfs.c
@@ -177,11 +177,13 @@ SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
 SYSFS_PMCSETUP(purr, SPRN_PURR);
 SYSFS_PMCSETUP(spurr, SPRN_SPURR);
 SYSFS_PMCSETUP(dscr, SPRN_DSCR);
+SYSFS_PMCSETUP(pir, SPRN_PIR);
 
 static SYSDEV_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
 static SYSDEV_ATTR(spurr, 0600, show_spurr, NULL);
 static SYSDEV_ATTR(dscr, 0600, show_dscr, store_dscr);
 static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
+static SYSDEV_ATTR(pir, 0400, show_pir, NULL);
 
 unsigned long dscr_default = 0;
 EXPORT_SYMBOL(dscr_default);
@@ -394,6 +396,8 @@ static void __cpuinit register_cpu_onlin
 		sysdev_create_file(s, &attr_dscr);
 #endif /* CONFIG_PPC64 */
 
+	sysdev_create_file(s, &attr_pir);
+
 	cacheinfo_cpu_online(cpu);
 }
 
@@ -464,6 +468,8 @@ static void unregister_cpu_online(unsign
 		sysdev_remove_file(s, &attr_dscr);
 #endif /* CONFIG_PPC64 */
 
+	sysdev_remove_file(s, &attr_pir);
+
 	cacheinfo_cpu_offline(cpu);
 }
 

^ permalink raw reply

* [PATCH 1/5] powerpc/pci: Add a platform hook after probe and before resource survey
From: Benjamin Herrenschmidt @ 2011-11-07  4:55 UTC (permalink / raw)
  To: linuxppc-dev

Some platforms need to perform resource allocation using a custom algorithm
due to HW constraints, or may want to tweak things globally below a host
bridge. For example OPAL support for IODA will need to perform a
resource allocation pass that applies IODA specific segmentation
constraints to MMIO which cannot be done simply using the kernel generic
resource management code.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/include/asm/machdep.h |    3 +++
 arch/powerpc/kernel/pci-common.c   |    6 ++++++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 58fc216..505bc21 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -212,6 +212,9 @@ struct machdep_calls {
 	 * allow assignment/enabling of the device. */
 	int  (*pcibios_enable_device_hook)(struct pci_dev *);
 
+	/* Called after scan and before resource survey */
+	void (*pcibios_fixup_phb)(struct pci_controller *hose);
+
 	/* Called to shutdown machine specific hardware not already controlled
 	 * by other drivers.
 	 */
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 677eccc..855969f 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1731,6 +1731,12 @@ void __devinit pcibios_scan_phb(struct pci_controller *hose)
 	if (mode == PCI_PROBE_NORMAL)
 		hose->last_busno = bus->subordinate = pci_scan_child_bus(bus);
 
+	/* Platform gets a chance to do some global fixups before
+	 * we proceed to resource allocation
+	 */
+	if (ppc_md.pcibios_fixup_phb)
+		ppc_md.pcibios_fixup_phb(hose);
+
 	/* Configure PCI Express settings */
 	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
 		struct pci_bus *child;
-- 
1.7.7.1

^ permalink raw reply related

* [PATCH 2/5] powerpc/pci: Change how re-assigning resouces work
From: Benjamin Herrenschmidt @ 2011-11-07  4:55 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1320641761-4028-1-git-send-email-benh@kernel.crashing.org>

When PCI_REASSIGN_ALL_RSRC is set, we used to clear all bus resources
at the beginning of survey and re-allocate them later.

This changes it so instead, during early fixup, we mark all resources
as IORESOURCE_UNSET and move them down to be 0-based.

Later, if bus resources are still unset at the beginning of the survey,
then we clear them.

This shouldn't impact the re-assignment case on 4xx, but will enable
us to have the platform do some custom resource assignment before the
survey, by clearing individual resources IORESOURCE_UNSET bit.

Also limits the clutter in the kernel log from fixup when re-assigning
since we don't care about the offset applied to the BAR values in this
case.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/kernel/pci-common.c |   66 ++++++++++++++++++++-----------------
 1 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 855969f..d34ba7e 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -920,18 +920,22 @@ static void __devinit pcibios_fixup_resources(struct pci_dev *dev)
 		struct resource *res = dev->resource + i;
 		if (!res->flags)
 			continue;
-		/* On platforms that have PCI_PROBE_ONLY set, we don't
-		 * consider 0 as an unassigned BAR value. It's technically
-		 * a valid value, but linux doesn't like it... so when we can
-		 * re-assign things, we do so, but if we can't, we keep it
-		 * around and hope for the best...
+
+		/* If we're going to re-assign everything, we mark all resources
+		 * as unset (and 0-base them). In addition, we mark BARs starting
+		 * at 0 as unset as well, except if PCI_PROBE_ONLY is also set
+		 * since in that case, we don't want to re-assign anything
 		 */
-		if (res->start == 0 && !pci_has_flag(PCI_PROBE_ONLY)) {
-			pr_debug("PCI:%s Resource %d %016llx-%016llx [%x] is unassigned\n",
-				 pci_name(dev), i,
-				 (unsigned long long)res->start,
-				 (unsigned long long)res->end,
-				 (unsigned int)res->flags);
+		if (pci_has_flag(PCI_REASSIGN_ALL_RSRC) ||
+		    (res->start == 0 && !pci_has_flag(PCI_PROBE_ONLY))) {
+			/* Only print message if not re-assigning */
+			if (!pci_has_flag(PCI_REASSIGN_ALL_RSRC))
+				pr_debug("PCI:%s Resource %d %016llx-%016llx [%x] "
+					 "is unassigned\n",
+					 pci_name(dev), i,
+					 (unsigned long long)res->start,
+					 (unsigned long long)res->end,
+					 (unsigned int)res->flags);
 			res->end -= res->start;
 			res->start = 0;
 			res->flags |= IORESOURCE_UNSET;
@@ -1041,6 +1045,16 @@ static void __devinit pcibios_fixup_bridge(struct pci_bus *bus)
 		if (i >= 3 && bus->self->transparent)
 			continue;
 
+		/* If we are going to re-assign everything, mark the resource
+		 * as unset and move it down to 0
+		 */
+		if (pci_has_flag(PCI_REASSIGN_ALL_RSRC)) {
+			res->flags |= IORESOURCE_UNSET;
+			res->end -= res->start;
+			res->start = 0;
+			continue;
+		}
+
 		pr_debug("PCI:%s Bus rsrc %d %016llx-%016llx [%x] fixup...\n",
 			 pci_name(dev), i,
 			 (unsigned long long)res->start,\
@@ -1261,18 +1275,15 @@ void pcibios_allocate_bus_resources(struct pci_bus *bus)
 	pci_bus_for_each_resource(bus, res, i) {
 		if (!res || !res->flags || res->start > res->end || res->parent)
 			continue;
+
+		/* If the resource was left unset at this point, we clear it */
+		if (res->flags & IORESOURCE_UNSET)
+			goto clear_resource;
+
 		if (bus->parent == NULL)
 			pr = (res->flags & IORESOURCE_IO) ?
 				&ioport_resource : &iomem_resource;
 		else {
-			/* Don't bother with non-root busses when
-			 * re-assigning all resources. We clear the
-			 * resource flags as if they were colliding
-			 * and as such ensure proper re-allocation
-			 * later.
-			 */
-			if (pci_has_flag(PCI_REASSIGN_ALL_RSRC))
-				goto clear_resource;
 			pr = pci_find_parent_resource(bus->self, res);
 			if (pr == res) {
 				/* this happens when the generic PCI
@@ -1303,9 +1314,9 @@ void pcibios_allocate_bus_resources(struct pci_bus *bus)
 			if (reparent_resources(pr, res) == 0)
 				continue;
 		}
-		printk(KERN_WARNING "PCI: Cannot allocate resource region "
-		       "%d of PCI bridge %d, will remap\n", i, bus->number);
-clear_resource:
+		pr_warning("PCI: Cannot allocate resource region "
+			   "%d of PCI bridge %d, will remap\n", i, bus->number);
+	clear_resource:
 		res->start = res->end = 0;
 		res->flags = 0;
 	}
@@ -1450,16 +1461,11 @@ void __init pcibios_resource_survey(void)
 {
 	struct pci_bus *b;
 
-	/* Allocate and assign resources. If we re-assign everything, then
-	 * we skip the allocate phase
-	 */
+	/* Allocate and assign resources */
 	list_for_each_entry(b, &pci_root_buses, node)
 		pcibios_allocate_bus_resources(b);
-
-	if (!pci_has_flag(PCI_REASSIGN_ALL_RSRC)) {
-		pcibios_allocate_resources(0);
-		pcibios_allocate_resources(1);
-	}
+	pcibios_allocate_resources(0);
+	pcibios_allocate_resources(1);
 
 	/* Before we start assigning unassigned resource, we try to reserve
 	 * the low IO area and the VGA memory area if they intersect the
-- 
1.7.7.1

^ permalink raw reply related

* [PATCH 3/5] powerpc/powernv: Add TCE SW invalidation support
From: Benjamin Herrenschmidt @ 2011-11-07  4:55 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1320641761-4028-1-git-send-email-benh@kernel.crashing.org>

This is used for newer IO Hubs such as p7IOC.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/include/asm/tce.h       |   10 +++-
 arch/powerpc/platforms/powernv/pci.c |   84 +++++++++++++++++++++++++++++-----
 2 files changed, 79 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index f663634..e01907d 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -26,10 +26,14 @@
 
 /*
  * Tces come in two formats, one for the virtual bus and a different
- * format for PCI
+ * format for PCI. We also use a separate value for SW invalidated
+ * PCI
  */
-#define TCE_VB  0
-#define TCE_PCI 1
+#define TCE_VB  		0
+#define TCE_PCI 		1
+#define TCE_PCI_SWINV_CREATE	2
+#define TCE_PCI_SWINV_FREE	4
+#define TCE_PCI_SWINV_PAIR	8
 
 /* TCE page size is 4096 bytes (1 << 12) */
 
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 85bb66d..8b90d94 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -257,12 +257,54 @@ struct pci_ops pnv_pci_ops = {
 	.write = pnv_pci_write_config,
 };
 
+
+static void pnv_tce_invalidate(struct iommu_table *tbl,
+			       u64 *startp, u64 *endp)
+{
+	u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+	unsigned long start, end, inc;
+
+	start = __pa(startp);
+	end = __pa(endp);
+
+
+	/* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
+	if (tbl->it_busno) {
+		start <<= 12;
+		end <<= 12;
+		inc = 128 << 12;
+		start |= tbl->it_busno;
+		end |= tbl->it_busno;
+	}
+	/* p7ioc-style invalidation, 2 TCEs per write */
+	else if (tbl->it_type & TCE_PCI_SWINV_PAIR) {
+		start |= (1ull << 63);
+		end |= (1ull << 63);
+		inc = 16;
+	}
+	/* Default (older HW) */
+	else
+		inc = 128;
+
+	end |= inc - 1;		/* round up end to be different than start */
+
+	mb(); /* Ensure above stores are visible */
+	while (start <= end) {
+		__raw_writeq(start, invalidate);
+		start += inc;
+	}
+	/* The iommu layer will do another mb() for us on build() and
+	 * we don't care on free()
+	 */
+}
+
+
 static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 			 unsigned long uaddr, enum dma_data_direction direction,
 			 struct dma_attrs *attrs)
 {
 	u64 proto_tce;
-	u64 *tcep;
+	u64 *tcep, *tces;
 	u64 rpn;
 
 	proto_tce = TCE_PCI_READ; // Read allowed
@@ -270,25 +312,33 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 	if (direction != DMA_TO_DEVICE)
 		proto_tce |= TCE_PCI_WRITE;
 
-	tcep = ((u64 *)tbl->it_base) + index;
+	tces = tcep = ((u64 *)tbl->it_base) + index - tbl->it_offset;
+	rpn = __pa(uaddr) >> TCE_SHIFT;
 
-	while (npages--) {
-		/* can't move this out since we might cross LMB boundary */
-		rpn = (virt_to_abs(uaddr)) >> TCE_SHIFT;
-		*tcep = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
+	while (npages--)
+		*(tcep++) = proto_tce | (rpn++ << TCE_RPN_SHIFT);
+
+	/* Some implementations won't cache invalid TCEs and thus may not
+	 * need that flush. We'll probably turn it_type into a bit mask
+	 * of flags if that becomes the case
+	 */
+	if (tbl->it_type & TCE_PCI_SWINV_CREATE)
+		pnv_tce_invalidate(tbl, tces, tcep - 1);
 
-		uaddr += TCE_PAGE_SIZE;
-		tcep++;
-	}
 	return 0;
 }
 
 static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
 {
-	u64 *tcep = ((u64 *)tbl->it_base) + index;
+	u64 *tcep, *tces;
+
+	tces = tcep = ((u64 *)tbl->it_base) + index - tbl->it_offset;
 
 	while (npages--)
 		*(tcep++) = 0;
+
+	if (tbl->it_type & TCE_PCI_SWINV_FREE)
+		pnv_tce_invalidate(tbl, tces, tcep - 1);
 }
 
 void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
@@ -308,13 +358,14 @@ static struct iommu_table * __devinit
 pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 {
 	struct iommu_table *tbl;
-	const __be64 *basep;
+	const __be64 *basep, *swinvp;
 	const __be32 *sizep;
 
 	basep = of_get_property(hose->dn, "linux,tce-base", NULL);
 	sizep = of_get_property(hose->dn, "linux,tce-size", NULL);
 	if (basep == NULL || sizep == NULL) {
-		pr_err("PCI: %s has missing tce entries !\n", hose->dn->full_name);
+		pr_err("PCI: %s has missing tce entries !\n",
+		       hose->dn->full_name);
 		return NULL;
 	}
 	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, hose->node);
@@ -323,6 +374,15 @@ pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0);
 	iommu_init_table(tbl, hose->node);
+
+	/* Deal with SW invalidated TCEs when needed (BML way) */
+	swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
+				 NULL);
+	if (swinvp) {
+		tbl->it_busno = swinvp[1];
+		tbl->it_index = (unsigned long)ioremap(swinvp[0], 8);
+		tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE;
+	}
 	return tbl;
 }
 
-- 
1.7.7.1

^ permalink raw reply related

* [PATCH 5/5] powerpc/powernv: PCI support for p7IOC under OPAL v2
From: Benjamin Herrenschmidt @ 2011-11-07  4:56 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1320641761-4028-1-git-send-email-benh@kernel.crashing.org>

This adds support for p7IOC (and possibly other IODA v1 IO Hubs)
using OPAL v2 interfaces.

We completely take over resource assignment and assign them using an
algorithm that hands out device BARs in a way that makes them fit in
individual segments of the M32 window of the bridge, which enables us
to assign individual PEs to devices and functions.

The current implementation gives out a PE per functions on PCIe, and a
PE for the entire bridge for PCIe to PCI-X bridges.

This can be adjusted / fine tuned later.

We also setup DMA resources (32-bit only for now) and MSIs (both 32-bit
and 64-bit MSI are supported).

The DMA allocation tries to divide the available 256M segments of the
32-bit DMA address space "fairly" among PEs. This is done using a
"weight" heuristic which assigns less value to things like OHCI USB
controllers than, for example SCSI RAID controllers. This algorithm
will probably want some fine tuning for specific devices or device
types.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/include/asm/pci-bridge.h     |    6 +-
 arch/powerpc/kernel/pci_dn.c              |    3 +
 arch/powerpc/platforms/powernv/Makefile   |    2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 1319 +++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci.c      |   20 +-
 arch/powerpc/platforms/powernv/pci.h      |   84 ++
 6 files changed, 1428 insertions(+), 6 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pci-ioda.c

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 56b879a..882b6aa 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -153,8 +153,8 @@ struct pci_dn {
 
 	int	pci_ext_config_space;	/* for pci devices */
 
-#ifdef CONFIG_EEH
 	struct	pci_dev *pcidev;	/* back-pointer to the pci device */
+#ifdef CONFIG_EEH
 	int	class_code;		/* pci device class */
 	int	eeh_mode;		/* See eeh.h for possible EEH_MODEs */
 	int	eeh_config_addr;
@@ -164,6 +164,10 @@ struct pci_dn {
 	int	eeh_false_positives;	/* # times this device reported #ff's */
 	u32	config_space[16];	/* saved PCI config space */
 #endif
+#define IODA_INVALID_PE		(-1)
+#ifdef CONFIG_PPC_POWERNV
+	int	pe_number;
+#endif
 };
 
 /* Get the pointer to a device_node's pci_dn */
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 478f8d78..245f68ad 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -49,6 +49,9 @@ void * __devinit update_dn_pci_info(struct device_node *dn, void *data)
 	dn->data = pdn;
 	pdn->node = dn;
 	pdn->phb = phb;
+#ifdef CONFIG_PPC_POWERNV
+	pdn->pe_number = IODA_INVALID_PE;
+#endif
 	regs = of_get_property(dn, "reg", NULL);
 	if (regs) {
 		/* First register entry is addr (00BBSS00)  */
diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index 3185300..bcc3cb48 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -2,4 +2,4 @@ obj-y			+= setup.o opal-takeover.o opal-wrappers.o opal.o
 obj-y			+= opal-rtc.o opal-nvram.o
 
 obj-$(CONFIG_SMP)	+= smp.o
-obj-$(CONFIG_PCI)	+= pci.o pci-p5ioc2.o
+obj-$(CONFIG_PCI)	+= pci.o pci-p5ioc2.o pci-ioda.o
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
new file mode 100644
index 0000000..3532e0a
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -0,0 +1,1319 @@
+/*
+ * Support PCI/PCIe on PowerNV platforms
+ *
+ * Copyright 2011 Benjamin Herrenschmidt, IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define DEBUG
+
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <linux/irq.h>
+#include <linux/io.h>
+#include <linux/msi.h>
+
+#include <asm/sections.h>
+#include <asm/io.h>
+#include <asm/prom.h>
+#include <asm/pci-bridge.h>
+#include <asm/machdep.h>
+#include <asm/ppc-pci.h>
+#include <asm/opal.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
+#include <asm/abs_addr.h>
+
+#include "powernv.h"
+#include "pci.h"
+
+struct resource_wrap {
+	struct list_head	link;
+	resource_size_t		size;
+	resource_size_t		align;
+	struct pci_dev		*dev;	/* Set if it's a device */
+	struct pci_bus		*bus;	/* Set if it's a bridge */
+};
+
+static int __pe_printk(const char *level, const struct pnv_ioda_pe *pe,
+		       struct va_format *vaf)
+{
+	char pfix[32];
+
+	if (pe->pdev)
+		strlcpy(pfix, dev_name(&pe->pdev->dev), sizeof(pfix));
+	else
+		sprintf(pfix, "%04x:%02x     ",
+			pci_domain_nr(pe->pbus), pe->pbus->number);
+	return printk("pci %s%s: [PE# %.3d] %pV", level, pfix, pe->pe_number, vaf);
+}
+
+#define define_pe_printk_level(func, kern_level)		\
+static int func(const struct pnv_ioda_pe *pe, const char *fmt, ...)	\
+{								\
+	struct va_format vaf;					\
+	va_list args;						\
+	int r;							\
+								\
+	va_start(args, fmt);					\
+								\
+	vaf.fmt = fmt;						\
+	vaf.va = &args;						\
+								\
+	r = __pe_printk(kern_level, pe, &vaf);			\
+	va_end(args);						\
+								\
+	return r;						\
+}								\
+
+define_pe_printk_level(pe_err, KERN_ERR);
+define_pe_printk_level(pe_warn, KERN_WARNING);
+define_pe_printk_level(pe_info, KERN_INFO);
+
+
+/* Calculate resource usage & alignment requirement of a single
+ * device. This will also assign all resources within the device
+ * for a given type starting at 0 for the biggest one and then
+ * assigning in decreasing order of size.
+ */
+static void __devinit pnv_ioda_calc_dev(struct pci_dev *dev, unsigned int flags,
+					resource_size_t *size,
+					resource_size_t *align)
+{
+	resource_size_t start;
+	struct resource *r;
+	int i;
+
+	pr_devel("  -> CDR %s\n", pci_name(dev));
+
+	*size = *align = 0;
+
+	/* Clear the resources out and mark them all unset */
+	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+		r = &dev->resource[i];
+		if (!(r->flags & flags))
+		    continue;
+		if (r->start) {
+			r->end -= r->start;
+			r->start = 0;
+		}
+		r->flags |= IORESOURCE_UNSET;
+	}
+
+	/* We currently keep all memory resources together, we
+	 * will handle prefetch & 64-bit separately in the future
+	 * but for now we stick everybody in M32
+	 */
+	start = 0;
+	for (;;) {
+		resource_size_t max_size = 0;
+		int max_no = -1;
+
+		/* Find next biggest resource */
+		for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+			r = &dev->resource[i];
+			if (!(r->flags & IORESOURCE_UNSET) ||
+			    !(r->flags & flags))
+				continue;
+			if (resource_size(r) > max_size) {
+				max_size = resource_size(r);
+				max_no = i;
+			}
+		}
+		if (max_no < 0)
+			break;
+		r = &dev->resource[max_no];
+		if (max_size > *align)
+			*align = max_size;
+		*size += max_size;
+		r->start = start;
+		start += max_size;
+		r->end = r->start + max_size - 1;
+		r->flags &= ~IORESOURCE_UNSET;
+		pr_devel("  ->     R%d %016llx..%016llx\n",
+			 max_no, r->start, r->end);
+	}
+	pr_devel("  <- CDR %s size=%llx align=%llx\n",
+		 pci_name(dev), *size, *align);
+}
+
+/* Allocate a resource "wrap" for a given device or bridge and
+ * insert it at the right position in the sorted list
+ */
+static void __devinit pnv_ioda_add_wrap(struct list_head *list,
+					struct pci_bus *bus,
+					struct pci_dev *dev,
+					resource_size_t size,
+					resource_size_t align)
+{
+	struct resource_wrap *w1, *w = kzalloc(sizeof(*w), GFP_KERNEL);
+
+	w->size = size;
+	w->align = align;
+	w->dev = dev;
+	w->bus = bus;
+
+	list_for_each_entry(w1, list, link) {
+		if (w1->align < align) {
+			list_add_tail(&w->link, &w1->link);
+			return;
+		}
+	}
+	list_add_tail(&w->link, list);
+}
+
+/* Offset device resources of a given type */
+static void __devinit pnv_ioda_offset_dev(struct pci_dev *dev,
+					  unsigned int flags,
+					  resource_size_t offset)
+{
+	struct resource *r;
+	int i;
+
+	pr_devel("  -> ODR %s [%x] +%016llx\n", pci_name(dev), flags, offset);
+
+	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+		r = &dev->resource[i];
+		if (r->flags & flags) {
+			dev->resource[i].start += offset;
+			dev->resource[i].end += offset;
+		}
+	}
+
+	pr_devel("  <- ODR %s [%x] +%016llx\n", pci_name(dev), flags, offset);
+}
+
+/* Offset bus resources (& all children) of a given type */
+static void __devinit pnv_ioda_offset_bus(struct pci_bus *bus,
+					  unsigned int flags,
+					  resource_size_t offset)
+{
+	struct resource *r;
+	struct pci_dev *dev;
+	struct pci_bus *cbus;
+	int i;
+
+	pr_devel("  -> OBR %s [%x] +%016llx\n",
+		 bus->self ? pci_name(bus->self) : "root", flags, offset);
+
+	for (i = 0; i < 2; i++) {
+		r = bus->resource[i];
+		if (r && (r->flags & flags)) { 
+			bus->resource[i]->start += offset;
+			bus->resource[i]->end += offset;
+		}
+	}
+	list_for_each_entry(dev, &bus->devices, bus_list)
+		pnv_ioda_offset_dev(dev, flags, offset);
+	list_for_each_entry(cbus, &bus->children, node)
+		pnv_ioda_offset_bus(cbus, flags, offset);
+
+	pr_devel("  <- OBR %s [%x]\n",
+		 bus->self ? pci_name(bus->self) : "root", flags);
+}
+
+/* This is the guts of our IODA resource allocation. This is called
+ * recursively for each bus in the system. It calculates all the
+ * necessary size and requirements for children and assign them
+ * resources such that:
+ *
+ *   - Each function fits in it's own contiguous set of IO/M32
+ *     segment
+ *
+ *   - All segments behind a P2P bridge are contiguous and obey
+ *     alignment constraints of those bridges
+ */
+static void __devinit pnv_ioda_calc_bus(struct pci_bus *bus, unsigned int flags,
+					resource_size_t *size,
+					resource_size_t *align)
+{
+	struct pci_controller *hose = pci_bus_to_host(bus);
+	struct pnv_phb *phb = hose->private_data;
+	resource_size_t dev_size, dev_align, start;
+	resource_size_t min_align, min_balign;
+	struct pci_dev *cdev;
+	struct pci_bus *cbus;
+	struct list_head head;
+	struct resource_wrap *w;
+	unsigned int bres;
+
+	*size = *align = 0;
+
+	pr_devel("-> CBR %s [%x]\n",
+		 bus->self ? pci_name(bus->self) : "root", flags);
+
+	/* Calculate alignment requirements based on the type
+	 * of resource we are working on
+	 */
+	if (flags & IORESOURCE_IO) {
+		bres = 0;
+		min_align = phb->ioda.io_segsize;
+		min_balign = 0x1000;
+	} else {
+		bres = 1;
+		min_align = phb->ioda.m32_segsize;
+		min_balign = 0x100000;
+	}
+
+	/* Gather all our children resources ordered by alignment */
+	INIT_LIST_HEAD(&head);
+
+	/*   - Busses */
+	list_for_each_entry(cbus, &bus->children, node) {
+		pnv_ioda_calc_bus(cbus, flags, &dev_size, &dev_align);
+		pnv_ioda_add_wrap(&head, cbus, NULL, dev_size, dev_align);
+	}
+
+	/*   - Devices */
+	list_for_each_entry(cdev, &bus->devices, bus_list) {
+		pnv_ioda_calc_dev(cdev, flags, &dev_size, &dev_align);
+		/* Align them to segment size */
+		if (dev_align < min_align)
+			dev_align = min_align;
+		pnv_ioda_add_wrap(&head, NULL, cdev, dev_size, dev_align);
+	}
+	if (list_empty(&head))
+		goto empty;
+
+	/* Now we can do two things: assign offsets to them within that
+	 * level and get our total alignment & size requirements. The
+	 * assignment algorithm is going to be uber-trivial for now, we
+	 * can try to be smarter later at filling out holes.
+	 */
+	start = bus->self ? 0 : bus->resource[bres]->start;
+
+	/* Don't hand out IO 0 */
+	if ((flags & IORESOURCE_IO) && !bus->self)
+		start += 0x1000;
+
+	while(!list_empty(&head)) {
+		w = list_first_entry(&head, struct resource_wrap, link);
+		list_del(&w->link);
+		if (w->size) {
+			if (start) {
+				start = ALIGN(start, w->align);
+				if (w->dev)
+					pnv_ioda_offset_dev(w->dev,flags,start);
+				else if (w->bus)
+					pnv_ioda_offset_bus(w->bus,flags,start);
+			}
+			if (w->align > *align)
+				*align = w->align;
+		}
+		start += w->size;
+		kfree(w);
+	}
+	*size = start;
+
+	/* Align and setup bridge resources */
+	*align = max_t(resource_size_t, *align,
+		       max_t(resource_size_t, min_align, min_balign));
+	*size = ALIGN(*size,
+		      max_t(resource_size_t, min_align, min_balign));
+ empty:
+	/* Only setup P2P's, not the PHB itself */
+	if (bus->self) {
+		WARN_ON(bus->resource[bres] == NULL);
+		bus->resource[bres]->start = 0;
+		bus->resource[bres]->flags = (*size) ? flags : 0;
+		bus->resource[bres]->end = (*size) ? (*size - 1) : 0;
+
+		/* Clear prefetch bus resources for now */
+		bus->resource[2]->flags = 0;
+	}
+
+	pr_devel("<- CBR %s [%x] *size=%016llx *align=%016llx\n",
+		 bus->self ? pci_name(bus->self) : "root", flags,*size,*align);
+}
+
+static struct pci_dn *pnv_ioda_get_pdn(struct pci_dev *dev)
+{
+	struct device_node *np;
+
+	np = pci_device_to_OF_node(dev);
+	if (!np)
+		return NULL;
+	return PCI_DN(np);
+}
+
+static void __devinit pnv_ioda_setup_pe_segments(struct pci_dev *dev)
+{
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dn *pdn = pnv_ioda_get_pdn(dev);
+	unsigned int pe, i;
+	resource_size_t pos;
+	struct resource io_res;
+	struct resource m32_res;
+	struct pci_bus_region region;
+	int rc;
+
+	/* Anything not referenced in the device-tree gets PE#0 */
+	pe = pdn ? pdn->pe_number : 0;
+
+	/* Calculate the device min/max */
+	io_res.start = m32_res.start = (resource_size_t)-1;
+	io_res.end = m32_res.end = 0;
+	io_res.flags = IORESOURCE_IO;
+	m32_res.flags = IORESOURCE_MEM;
+
+	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+		struct resource *r = NULL;
+		if (dev->resource[i].flags & IORESOURCE_IO)
+			r = &io_res;
+		if (dev->resource[i].flags & IORESOURCE_MEM)
+			r = &m32_res;
+		if (!r)
+			continue;
+		if (dev->resource[i].start < r->start)
+			r->start = dev->resource[i].start;
+		if (dev->resource[i].end > r->end)
+			r->end = dev->resource[i].end;
+	}
+
+	/* Setup IO segments */
+	if (io_res.start < io_res.end) {
+		pcibios_resource_to_bus(dev, &region, &io_res);
+		pos = region.start;
+		i = pos / phb->ioda.io_segsize;
+		while(i < phb->ioda.total_pe && pos <= region.end) {
+			if (phb->ioda.io_segmap[i]) {
+				pr_err("%s: Trying to use IO seg #%d which is"
+				       " already used by PE# %d\n",
+				       pci_name(dev), i,
+				       phb->ioda.io_segmap[i]);
+				/* XXX DO SOMETHING TO DISABLE DEVICE ? */
+				break;
+			}
+			phb->ioda.io_segmap[i] = pe;
+			rc = opal_pci_map_pe_mmio_window(phb->opal_id, pe,
+							 OPAL_IO_WINDOW_TYPE,
+							 0, i);
+			if (rc != OPAL_SUCCESS) {
+				pr_err("%s: OPAL error %d setting up mapping"
+				       " for IO seg# %d\n",
+				       pci_name(dev), rc, i);
+				/* XXX DO SOMETHING TO DISABLE DEVICE ? */
+				break;
+			}
+			pos += phb->ioda.io_segsize;
+			i++;
+		};
+	}
+
+	/* Setup M32 segments */
+	if (m32_res.start < m32_res.end) {
+		pcibios_resource_to_bus(dev, &region, &m32_res);
+		pos = region.start;
+		i = pos / phb->ioda.m32_segsize;
+		while(i < phb->ioda.total_pe && pos <= region.end) {
+			if (phb->ioda.m32_segmap[i]) {
+				pr_err("%s: Trying to use M32 seg #%d which is"
+				       " already used by PE# %d\n",
+				       pci_name(dev), i,
+				       phb->ioda.m32_segmap[i]);
+				/* XXX DO SOMETHING TO DISABLE DEVICE ? */
+				break;
+			}
+			phb->ioda.m32_segmap[i] = pe;
+			rc = opal_pci_map_pe_mmio_window(phb->opal_id, pe,
+							 OPAL_M32_WINDOW_TYPE,
+							 0, i);
+			if (rc != OPAL_SUCCESS) {
+				pr_err("%s: OPAL error %d setting up mapping"
+				       " for M32 seg# %d\n",
+				       pci_name(dev), rc, i);
+				/* XXX DO SOMETHING TO DISABLE DEVICE ? */
+				break;
+			}
+			pos += phb->ioda.m32_segsize;
+		}
+	}
+}
+
+/* Check if a resource still fits in the total IO or M32 range
+ * for a given PHB
+ */
+static int __devinit pnv_ioda_resource_fit(struct pci_controller *hose,
+					   struct resource *r)
+{
+	struct resource *bounds;
+
+	if (r->flags & IORESOURCE_IO)
+		bounds = &hose->io_resource;
+	else if (r->flags & IORESOURCE_MEM)
+		bounds = &hose->mem_resources[0];
+	else
+		return 1;
+
+	if (r->start >= bounds->start && r->end <= bounds->end)
+		return 1;
+	r->flags = 0;
+	return 0;
+}
+
+static void __devinit pnv_ioda_update_resources(struct pci_bus *bus)
+{
+	struct pci_controller *hose = pci_bus_to_host(bus);
+	struct pci_bus *cbus;
+	struct pci_dev *cdev;
+	unsigned int i;
+	u16 cmd;
+
+	/* Clear all device enables  */
+	list_for_each_entry(cdev, &bus->devices, bus_list) {
+		pci_read_config_word(cdev, PCI_COMMAND, &cmd);
+		cmd &= ~(PCI_COMMAND_IO|PCI_COMMAND_MEMORY|PCI_COMMAND_MASTER);
+		pci_read_config_word(cdev, PCI_COMMAND, &cmd);
+	}
+
+	/* Check if bus resources fit in our IO or M32 range */
+	for (i = 0; bus->self && (i < 2); i++) {
+		struct resource *r = bus->resource[i];
+		if (r && !pnv_ioda_resource_fit(hose, r))
+			pr_err("%s: Bus %d resource %d disabled, no room\n",
+			       pci_name(bus->self), bus->number, i);
+	}
+
+	/* Update self if it's not a PHB */
+	if (bus->self)
+		pci_setup_bridge(bus);
+
+	/* Update child devices */
+	list_for_each_entry(cdev, &bus->devices, bus_list) {
+		/* Check if resource fits, if not, disabled it */
+		for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+			struct resource *r = &cdev->resource[i];
+			if (!pnv_ioda_resource_fit(hose, r))
+				pr_err("%s: Resource %d disabled, no room\n",
+				       pci_name(cdev), i);
+		}
+
+		/* Assign segments */
+		pnv_ioda_setup_pe_segments(cdev);
+
+		/* Update HW BARs */
+		for (i = 0; i <= PCI_ROM_RESOURCE; i++)
+			pci_update_resource(cdev, i);
+	}
+
+	/* Update child busses */
+	list_for_each_entry(cbus, &bus->children, node)
+		pnv_ioda_update_resources(cbus);
+}
+
+static int __devinit pnv_ioda_alloc_pe(struct pnv_phb *phb)
+{
+	unsigned long pe;
+
+	do {
+		pe = find_next_zero_bit(phb->ioda.pe_alloc,
+					phb->ioda.total_pe, 0);
+		if (pe >= phb->ioda.total_pe)
+			return IODA_INVALID_PE;
+	} while(test_and_set_bit(pe, phb->ioda.pe_alloc));
+
+	phb->ioda.pe_array[pe].pe_number = pe;
+	return pe;
+}
+
+static void __devinit pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
+{
+	WARN_ON(phb->ioda.pe_array[pe].pdev);
+
+	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
+	clear_bit(pe, phb->ioda.pe_alloc);
+}
+
+static struct pnv_ioda_pe * __devinit __pnv_ioda_get_one_pe(struct pci_dev *dev)
+{
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dn *pdn = pnv_ioda_get_pdn(dev);
+
+	if (!pdn)
+		return NULL;
+	if (pdn->pe_number == IODA_INVALID_PE)
+		return NULL;
+	return &phb->ioda.pe_array[pdn->pe_number];
+}
+
+static struct pnv_ioda_pe * __devinit pnv_ioda_get_pe(struct pci_dev *dev)
+{
+	struct pnv_ioda_pe *pe = __pnv_ioda_get_one_pe(dev);
+
+	while (!pe && dev->bus->self) {
+		dev = dev->bus->self;
+		pe = __pnv_ioda_get_one_pe(dev);
+		if (pe)
+			pe = pe->bus_pe;
+	}
+	return pe;
+}
+
+static int __devinit pnv_ioda_configure_pe(struct pnv_phb *phb,
+					   struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *parent;
+	uint8_t bcomp, dcomp, fcomp;
+	long rc, rid_end, rid;
+
+	/* Bus validation ? */
+	if (pe->pbus) {
+		int count;
+
+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
+		parent = pe->pbus->self;
+		count = pe->pbus->subordinate - pe->pbus->secondary + 1;
+		switch(count) {
+		case  1: bcomp = OpalPciBusAll;		break;
+		case  2: bcomp = OpalPciBus7Bits;	break;
+		case  4: bcomp = OpalPciBus6Bits;	break;
+		case  8: bcomp = OpalPciBus5Bits;	break;
+		case 16: bcomp = OpalPciBus4Bits;	break;
+		case 32: bcomp = OpalPciBus3Bits;	break;
+		default:
+			pr_err("%s: Number of subordinate busses %d"
+			       " unsupported\n",
+			       pci_name(pe->pbus->self), count);
+			/* Do an exact match only */
+			bcomp = OpalPciBusAll;
+		}
+		rid_end = pe->rid + (count << 8);
+	} else {
+		parent = pe->pdev->bus->self;
+		bcomp = OpalPciBusAll;
+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
+		rid_end = pe->rid + 1;
+	}
+
+	/* Associate PE in PELT */
+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
+			     bcomp, dcomp, fcomp, OPAL_MAP_PE);
+	if (rc) {
+		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
+		return -ENXIO;
+	}
+	opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+
+	/* Add to all parents PELT-V */
+	while (parent) {
+		struct pci_dn *pdn = pnv_ioda_get_pdn(parent);
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
+						pe->pe_number, 1);
+			/* XXX What to do in case of error ? */
+		}
+		parent = parent->bus->self;
+	}
+	/* Setup reverse map */
+	for (rid = pe->rid; rid < rid_end; rid++)
+		phb->ioda.pe_rmap[rid] = pe->pe_number;
+
+	/* Setup one MVTs on IODA1 */
+	if (phb->type == PNV_PHB_IODA1) {
+		pe->mve_number = pe->pe_number;
+		rc = opal_pci_set_mve(phb->opal_id, pe->mve_number,
+				      pe->pe_number);
+		if (rc) {
+			pe_err(pe, "OPAL error %ld setting up MVE %d\n",
+			       rc, pe->mve_number);
+			pe->mve_number = -1;
+		} else {
+			rc = opal_pci_set_mve_enable(phb->opal_id,
+						     pe->mve_number, 1);
+			if (rc) {
+				pe_err(pe, "OPAL error %ld enabling MVE %d\n",
+				       rc, pe->mve_number);
+				pe->mve_number = -1;
+			}
+		}
+	} else if (phb->type == PNV_PHB_IODA2)
+		pe->mve_number = 0;
+
+	return 0;
+}
+
+static void __devinit pnv_ioda_link_pe_by_weight(struct pnv_phb *phb,
+						 struct pnv_ioda_pe *pe)
+{
+	struct pnv_ioda_pe *lpe;
+
+	list_for_each_entry(lpe, &phb->ioda.pe_list, link) {
+		if (lpe->dma_weight < pe->dma_weight) {
+			list_add_tail(&pe->link, &lpe->link);
+			return;
+		}
+	}
+	list_add_tail(&pe->link, &phb->ioda.pe_list);
+}
+
+static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
+{
+	/* This is quite simplistic. The "base" weight of a device
+	 * is 10. 0 means no DMA is to be accounted for it.
+	 */
+
+	/* If it's a bridge, no DMA */
+	if (dev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+		return 0;
+
+	/* Reduce the weight of slow USB controllers */
+	if (dev->class == PCI_CLASS_SERIAL_USB_UHCI ||
+	    dev->class == PCI_CLASS_SERIAL_USB_OHCI ||
+	    dev->class == PCI_CLASS_SERIAL_USB_EHCI)
+		return 3;
+
+	/* Increase the weight of RAID (includes Obsidian) */
+	if ((dev->class >> 8) == PCI_CLASS_STORAGE_RAID)
+		return 15;
+
+	/* Default */
+	return 10;
+}
+
+static struct pnv_ioda_pe * __devinit pnv_ioda_setup_dev_PE(struct pci_dev *dev)
+{
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dn *pdn = pnv_ioda_get_pdn(dev);
+	struct pnv_ioda_pe *pe;
+	int pe_num;
+
+	if (!pdn) {
+		pr_err("%s: Device tree node not associated properly\n",
+			   pci_name(dev));
+		return NULL;
+	}
+	if (pdn->pe_number != IODA_INVALID_PE)
+		return NULL;
+
+	/* PE#0 has been pre-set */
+	if (dev->bus->number == 0)
+		pe_num = 0;
+	else
+		pe_num = pnv_ioda_alloc_pe(phb);
+	if (pe_num == IODA_INVALID_PE) {
+		pr_warning("%s: Not enough PE# available, disabling device\n",
+			   pci_name(dev));
+		return NULL;
+	}
+
+	/* NOTE: We get only one ref to the pci_dev for the pdn, not for the
+	 * pointer in the PE data structure, both should be destroyed at the
+	 * same time. However, this needs to be looked at more closely again
+	 * once we actually start removing things (Hotplug, SR-IOV, ...)
+	 *
+	 * At some point we want to remove the PDN completely anyways
+	 */
+	pe = &phb->ioda.pe_array[pe_num];
+	pci_dev_get(dev);
+	pdn->pcidev = dev;
+	pdn->pe_number = pe_num;
+	pe->pdev = dev;
+	pe->pbus = NULL;
+	pe->tce32_seg = -1;
+	pe->mve_number = -1;
+	pe->rid = dev->bus->number << 8 | pdn->devfn;
+
+	pe_info(pe, "Associated device to PE\n");
+
+	if (pnv_ioda_configure_pe(phb, pe)) {
+		/* XXX What do we do here ? */
+		if (pe_num)
+			pnv_ioda_free_pe(phb, pe_num);
+		pdn->pe_number = IODA_INVALID_PE;
+		pe->pdev = NULL;
+		pci_dev_put(dev);
+		return NULL;
+	}
+
+	/* Assign a DMA weight to the device */
+	pe->dma_weight = pnv_ioda_dma_weight(dev);
+	if (pe->dma_weight != 0) {
+		phb->ioda.dma_weight += pe->dma_weight;
+		phb->ioda.dma_pe_count++;
+	}
+
+	/* Link the PE */
+	pnv_ioda_link_pe_by_weight(phb, pe);
+
+	return pe;
+}
+
+static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		struct pci_dn *pdn = pnv_ioda_get_pdn(dev);
+
+		if (pdn == NULL) {
+			pr_warn("%s: No device node associated with device !\n",
+				pci_name(dev));
+			continue;
+		}
+		pci_dev_get(dev);
+		pdn->pcidev = dev;
+		pdn->pe_number = pe->pe_number;
+		pe->dma_weight += pnv_ioda_dma_weight(dev);
+		if (dev->subordinate)
+			pnv_ioda_setup_same_PE(dev->subordinate, pe);
+	}
+}
+
+static void __devinit pnv_ioda_setup_bus_PE(struct pci_dev *dev,
+					    struct pnv_ioda_pe *ppe)
+{
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_bus *bus = dev->subordinate;
+	struct pnv_ioda_pe *pe;
+	int pe_num;
+
+	if (!bus) {
+		pr_warning("%s: Bridge without a subordinate bus !\n",
+			   pci_name(dev));
+		return;
+	}
+	pe_num = pnv_ioda_alloc_pe(phb);
+	if (pe_num == IODA_INVALID_PE) {
+		pr_warning("%s: Not enough PE# available, disabling bus\n",
+			   pci_name(dev));
+		return;
+	}
+
+	pe = &phb->ioda.pe_array[pe_num];
+	ppe->bus_pe = pe;
+	pe->pbus = bus;
+	pe->pdev = NULL;
+	pe->tce32_seg = -1;
+	pe->mve_number = -1;
+	pe->rid = bus->secondary << 8;
+	pe->dma_weight = 0;
+
+	pe_info(pe, "Secondary busses %d..%d associated with PE\n",
+		bus->secondary, bus->subordinate);
+
+	if (pnv_ioda_configure_pe(phb, pe)) {
+		/* XXX What do we do here ? */
+		if (pe_num)
+			pnv_ioda_free_pe(phb, pe_num);
+		pe->pbus = NULL;
+		return;
+	}
+
+	/* Associate it with all child devices */
+	pnv_ioda_setup_same_PE(bus, pe);
+
+	/* Account for one DMA PE if at least one DMA capable device exist
+	 * below the bridge
+	 */
+	if (pe->dma_weight != 0) {
+		phb->ioda.dma_weight += pe->dma_weight;
+		phb->ioda.dma_pe_count++;
+	}
+
+	/* Link the PE */
+	pnv_ioda_link_pe_by_weight(phb, pe);
+}
+
+static void __devinit pnv_ioda_setup_PEs(struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+	struct pnv_ioda_pe *pe;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		pe = pnv_ioda_setup_dev_PE(dev);
+		if (pe == NULL)
+			continue;
+		/* Leaving the PCIe domain ... single PE# */
+		if (dev->pcie_type == PCI_EXP_TYPE_PCI_BRIDGE)
+			pnv_ioda_setup_bus_PE(dev, pe);
+		else if (dev->subordinate)
+			pnv_ioda_setup_PEs(dev->subordinate);
+	}
+}
+
+static void __devinit pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb,
+						 struct pci_dev *dev)
+{
+	/* We delay DMA setup after we have assigned all PE# */
+}
+
+static void __devinit pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
+					     struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		set_iommu_table_base(&dev->dev, &pe->tce32_table);
+		if (dev->subordinate)
+			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
+	}
+}
+
+static void __devinit pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
+						struct pnv_ioda_pe *pe,
+						unsigned int base,
+						unsigned int segs)
+{
+
+	struct page *tce_mem = NULL;
+	const __be64 *swinvp;
+	struct iommu_table *tbl;
+	unsigned int i;
+	int64_t rc;
+	void *addr;
+
+	/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
+#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
+
+	/* XXX FIXME: Handle 64-bit only DMA devices */
+	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
+	/* XXX FIXME: Allocate multi-level tables on PHB3 */
+
+	/* We shouldn't already have a 32-bit DMA associated */
+	if (WARN_ON(pe->tce32_seg >= 0))
+		return;
+
+	/* Grab a 32-bit TCE table */
+	pe->tce32_seg = base;
+	pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n",
+		(base << 28), ((base + segs) << 28) - 1);
+
+	/* XXX Currently, we allocate one big contiguous table for the
+	 * TCEs. We only really need one chunk per 256M of TCE space
+	 * (ie per segment) but that's an optimization for later, it
+	 * requires some added smarts with our get/put_tce implementation
+	 */
+	tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
+				   get_order(TCE32_TABLE_SIZE * segs));
+	if (!tce_mem) {
+		pe_err(pe, " Failed to allocate a 32-bit TCE memory\n");
+		goto fail;
+	}
+	addr = page_address(tce_mem);
+	memset(addr, 0, TCE32_TABLE_SIZE * segs);
+
+	/* Configure HW */
+	for (i = 0; i < segs; i++) {
+		rc = opal_pci_map_pe_dma_window(phb->opal_id,
+					      pe->pe_number,
+					      base + i, 1,
+					      __pa(addr) + TCE32_TABLE_SIZE * i,
+					      TCE32_TABLE_SIZE, 0x1000);
+		if (rc) {
+			pe_err(pe, " Failed to configure 32-bit TCE table,"
+			       " err %ld\n", rc);
+			goto fail;
+		}
+	}
+
+	/* Setup linux iommu table */
+	tbl = &pe->tce32_table;
+	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
+				  base << 28);
+
+	/* OPAL variant of P7IOC SW invalidated TCEs */
+	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
+	if (swinvp) {
+		/* We need a couple more fields -- an address and a data
+		 * to or.  Since the bus is only printed out on table free
+		 * errors, and on the first pass the data will be a relative
+		 * bus number, print that out instead.
+		 */
+		tbl->it_busno = 0;
+		tbl->it_index = (unsigned long)ioremap(be64_to_cpup(swinvp), 8);
+		tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE
+			| TCE_PCI_SWINV_PAIR;
+	}
+	iommu_init_table(tbl, phb->hose->node);
+
+	if (pe->pdev)
+		set_iommu_table_base(&pe->pdev->dev, tbl);
+	else
+		pnv_ioda_setup_bus_dma(pe, pe->pbus);
+
+	return;
+ fail:
+	/* XXX Failure: Try to fallback to 64-bit only ? */
+	if (pe->tce32_seg >= 0)
+		pe->tce32_seg = -1;
+	if (tce_mem)
+		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
+}
+
+static void __devinit pnv_ioda_setup_dma(struct pnv_phb *phb)
+{
+	struct pci_controller *hose = phb->hose;
+	unsigned int residual, remaining, segs, tw, base;
+	struct pnv_ioda_pe *pe;
+
+	/* If we have more PE# than segments available, hand out one
+	 * per PE until we run out and let the rest fail. If not,
+	 * then we assign at least one segment per PE, plus more based
+	 * on the amount of devices under that PE
+	 */
+	if (phb->ioda.dma_pe_count > phb->ioda.tce32_count)
+		residual = 0;
+	else
+		residual = phb->ioda.tce32_count -
+			phb->ioda.dma_pe_count;
+
+	pr_info("PCI: Domain %04x has %ld available 32-bit DMA segments\n",
+		hose->global_number, phb->ioda.tce32_count);
+	pr_info("PCI: %d PE# for a total weight of %d\n",
+		phb->ioda.dma_pe_count, phb->ioda.dma_weight);
+
+	/* Walk our PE list and configure their DMA segments, hand them
+	 * out one base segment plus any residual segments based on
+	 * weight
+	 */
+	remaining = phb->ioda.tce32_count;
+	tw = phb->ioda.dma_weight;
+	base = 0;
+	list_for_each_entry(pe, &phb->ioda.pe_list, link) {
+		if (!pe->dma_weight)
+			continue;
+		if (!remaining) {
+			pe_warn(pe, "No DMA32 resources available\n");
+			continue;
+		}
+		segs = 1;
+		if (residual) {
+			segs += ((pe->dma_weight * residual)  + (tw / 2)) / tw;
+			if (segs > remaining)
+				segs = remaining;
+		}
+		pe_info(pe, "DMA weight %d, assigned %d DMA32 segments\n",
+			pe->dma_weight, segs);
+		pnv_pci_ioda_setup_dma_pe(phb, pe, base, segs);
+		remaining -= segs;
+		base += segs;
+	}
+}
+
+#ifdef CONFIG_PCI_MSI
+static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
+				  unsigned int hwirq, unsigned int is_64,
+				  struct msi_msg *msg)
+{
+	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
+	unsigned int xive_num = hwirq - phb->msi_base;
+	uint64_t addr64;
+	uint32_t addr32, data;
+	int rc;
+
+	/* No PE assigned ? bail out ... no MSI for you ! */
+	if (pe == NULL)
+		return -ENXIO;
+
+	/* Check if we have an MVE */
+	if (pe->mve_number < 0)
+		return -ENXIO;
+
+	/* Assign XIVE to PE */
+	rc = opal_pci_set_xive_pe(phb->opal_id, pe->pe_number, xive_num);
+	if (rc) {
+		pr_warn("%s: OPAL error %d setting XIVE %d PE\n",
+			pci_name(dev), rc, xive_num);
+		return -EIO;
+	}
+
+	if (is_64) {
+		rc = opal_get_msi_64(phb->opal_id, pe->mve_number, xive_num, 1,
+				     &addr64, &data);
+		if (rc) {
+			pr_warn("%s: OPAL error %d getting 64-bit MSI data\n",
+				pci_name(dev), rc);
+			return -EIO;
+		}
+		msg->address_hi = addr64 >> 32;
+		msg->address_lo = addr64 & 0xfffffffful;
+	} else {
+		rc = opal_get_msi_32(phb->opal_id, pe->mve_number, xive_num, 1,
+				     &addr32, &data);
+		if (rc) {
+			pr_warn("%s: OPAL error %d getting 32-bit MSI data\n",
+				pci_name(dev), rc);
+			return -EIO;
+		}
+		msg->address_hi = 0;
+		msg->address_lo = addr32;
+	}
+	msg->data = data;
+
+	pr_devel("%s: %s-bit MSI on hwirq %x (xive #%d),"
+		 " address=%x_%08x data=%x PE# %d\n",
+		 pci_name(dev), is_64 ? "64" : "32", hwirq, xive_num,
+		 msg->address_hi, msg->address_lo, data, pe->pe_number);
+
+	return 0;
+}
+
+static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
+{
+	unsigned int bmap_size;
+	const __be32 *prop = of_get_property(phb->hose->dn,
+					     "ibm,opal-msi-ranges", NULL);
+	if (!prop) {
+		/* BML Fallback */
+		prop = of_get_property(phb->hose->dn, "msi-ranges", NULL);
+	}
+	if (!prop)
+		return;
+
+	phb->msi_base = be32_to_cpup(prop);
+	phb->msi_count = be32_to_cpup(prop + 1);
+	bmap_size = BITS_TO_LONGS(phb->msi_count) * sizeof(unsigned long);
+	phb->msi_map = zalloc_maybe_bootmem(bmap_size, GFP_KERNEL);
+	if (!phb->msi_map) {
+		pr_err("PCI %d: Failed to allocate MSI bitmap !\n",
+		       phb->hose->global_number);
+		return;
+	}
+	phb->msi_setup = pnv_pci_ioda_msi_setup;
+	phb->msi32_support = 1;
+	pr_info("  Allocated bitmap for %d MSIs (base IRQ 0x%x)\n",
+		phb->msi_count, phb->msi_base);
+}
+#else
+static void pnv_pci_setup_ioda_msis(struct pnv_phb *phb) { }
+#endif /* CONFIG_PCI_MSI */
+
+/* This is the starting point of our IODA specific resource
+ * allocation process
+ */
+static void __devinit pnv_pci_ioda_fixup_phb(struct pci_controller *hose)
+{
+	resource_size_t size, align;
+	struct pci_bus *child;
+
+	/* Associate PEs per functions */
+	pnv_ioda_setup_PEs(hose->bus);
+
+	/* Calculate all resources */
+	pnv_ioda_calc_bus(hose->bus, IORESOURCE_IO, &size, &align);
+	pnv_ioda_calc_bus(hose->bus, IORESOURCE_MEM, &size, &align);
+
+	/* Apply then to HW */
+	pnv_ioda_update_resources(hose->bus);
+
+	/* Setup DMA */
+	pnv_ioda_setup_dma(hose->private_data);
+
+	/* Configure PCI Express settings */
+	list_for_each_entry(child, &hose->bus->children, node) {
+		struct pci_dev *self = child->self;
+		if (!self)
+			continue;
+		pcie_bus_configure_settings(child, self->pcie_mpss);
+	}
+}
+
+/* Prevent enabling devices for which we couldn't properly
+ * assign a PE
+ */
+static int __devinit pnv_pci_enable_device_hook(struct pci_dev *dev)
+{
+	struct pci_dn *pdn = pnv_ioda_get_pdn(dev);
+
+	if (!pdn || pdn->pe_number == IODA_INVALID_PE)
+		return -EINVAL;
+	return 0;
+}
+
+static u32 pnv_ioda_bdfn_to_pe(struct pnv_phb *phb, struct pci_bus *bus,
+			       u32 devfn)
+{
+	return phb->ioda.pe_rmap[(bus->number << 8) | devfn];
+}
+
+void __init pnv_pci_init_ioda1_phb(struct device_node *np)
+{
+	struct pci_controller *hose;
+	static int primary = 1;
+	struct pnv_phb *phb;
+	unsigned long size, m32map_off, iomap_off, pemap_off;
+	const u64 *prop64;
+	u64 phb_id;
+	void *aux;
+	long rc;
+
+	pr_info(" Initializing IODA OPAL PHB %s\n", np->full_name);
+
+	prop64 = of_get_property(np, "ibm,opal-phbid", NULL);
+	if (!prop64) {
+		pr_err("  Missing \"ibm,opal-phbid\" property !\n");
+		return;
+	}
+	phb_id = be64_to_cpup(prop64);
+	pr_debug("  PHB-ID  : 0x%016llx\n", phb_id);
+
+	phb = alloc_bootmem(sizeof(struct pnv_phb));
+	if (phb) {
+		memset(phb, 0, sizeof(struct pnv_phb));
+		phb->hose = hose = pcibios_alloc_controller(np);
+	}
+	if (!phb || !phb->hose) {
+		pr_err("PCI: Failed to allocate PCI controller for %s\n",
+		       np->full_name);
+		return;
+	}
+
+	spin_lock_init(&phb->lock);
+	/* XXX Use device-tree */
+	hose->first_busno = 0;
+	hose->last_busno = 0xff;
+	hose->private_data = phb;
+	phb->opal_id = phb_id;
+	phb->type = PNV_PHB_IODA1;
+
+	/* We parse "ranges" now since we need to deduce the register base
+	 * from the IO base
+	 */
+	pci_process_bridge_OF_ranges(phb->hose, np, primary);
+	primary = 0;
+
+	/* Magic formula from Milton */
+	phb->regs = of_iomap(np, 0);
+	if (phb->regs == NULL)
+		pr_err("  Failed to map registers !\n");
+
+
+	/* XXX This is hack-a-thon. This needs to be changed so that:
+	 *  - we obtain stuff like PE# etc... from device-tree
+	 *  - we properly re-allocate M32 ourselves
+	 *    (the OFW one isn't very good)
+	 */
+
+	/* Initialize more IODA stuff */
+	phb->ioda.total_pe = 128;
+
+	phb->ioda.m32_size = resource_size(&hose->mem_resources[0]);
+	/* OFW Has already off top 64k of M32 space (MSI space) */
+	phb->ioda.m32_size += 0x10000;
+
+	phb->ioda.m32_segsize = phb->ioda.m32_size / phb->ioda.total_pe;
+	phb->ioda.m32_pci_base = hose->mem_resources[0].start -
+		hose->pci_mem_offset;
+	phb->ioda.io_size = hose->pci_io_size;
+	phb->ioda.io_segsize = phb->ioda.io_size / phb->ioda.total_pe;
+	phb->ioda.io_pci_base = 0; /* XXX calculate this ? */
+
+	/* Allocate aux data & arrays */
+	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
+	m32map_off = size;
+	size += phb->ioda.total_pe;
+	iomap_off = size;
+	size += phb->ioda.total_pe;
+	pemap_off = size;
+	size += phb->ioda.total_pe * sizeof(struct pnv_ioda_pe);
+	aux = alloc_bootmem(size);
+	memset(aux, 0, size);
+	phb->ioda.pe_alloc = aux;
+	phb->ioda.m32_segmap = aux + m32map_off;
+	phb->ioda.io_segmap = aux + iomap_off;
+	phb->ioda.pe_array = aux + pemap_off;
+	set_bit(0, phb->ioda.pe_alloc);
+
+	INIT_LIST_HEAD(&phb->ioda.pe_list);
+
+	/* Calculate how many 32-bit TCE segments we have */
+	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
+
+	/* Clear unusable m64 */
+	hose->mem_resources[1].flags = 0;
+	hose->mem_resources[1].start = 0;
+	hose->mem_resources[1].end = 0;
+	hose->mem_resources[2].flags = 0;
+	hose->mem_resources[2].start = 0;
+	hose->mem_resources[2].end = 0;
+
+#if 0
+	rc = opal_pci_set_phb_mem_window(opal->phb_id,
+					 window_type,
+					 window_num,
+					 starting_real_address,
+					 starting_pci_address,
+					 segment_size);
+#endif
+
+	pr_info("  %d PE's M32: 0x%x [segment=0x%x] IO: 0x%x [segment=0x%x]\n",
+		phb->ioda.total_pe,
+		phb->ioda.m32_size, phb->ioda.m32_segsize,
+		phb->ioda.io_size, phb->ioda.io_segsize);
+
+	if (phb->regs)  {
+		pr_devel(" BUID     = 0x%016llx\n", in_be64(phb->regs + 0x100));
+		pr_devel(" PHB2_CR  = 0x%016llx\n", in_be64(phb->regs + 0x160));
+		pr_devel(" IO_BAR   = 0x%016llx\n", in_be64(phb->regs + 0x170));
+		pr_devel(" IO_BAMR  = 0x%016llx\n", in_be64(phb->regs + 0x178));
+		pr_devel(" IO_SAR   = 0x%016llx\n", in_be64(phb->regs + 0x180));
+		pr_devel(" M32_BAR  = 0x%016llx\n", in_be64(phb->regs + 0x190));
+		pr_devel(" M32_BAMR = 0x%016llx\n", in_be64(phb->regs + 0x198));
+		pr_devel(" M32_SAR  = 0x%016llx\n", in_be64(phb->regs + 0x1a0));
+	}
+	phb->hose->ops = &pnv_pci_ops;
+
+	/* Setup RID -> PE mapping function */
+	phb->bdfn_to_pe = pnv_ioda_bdfn_to_pe;
+
+	/* Setup TCEs */
+	phb->dma_dev_setup = pnv_pci_ioda_dma_dev_setup;
+
+	/* Setup MSI support */
+	pnv_pci_init_ioda_msis(phb);
+
+	/* We set both probe_only and PCI_REASSIGN_ALL_RSRC. This is an
+	 * odd combination which essentially means that we skip all resource
+	 * fixups and assignments in the generic code, and do it all
+	 * ourselves here
+	 */
+	pci_probe_only = 1;
+	ppc_md.pcibios_fixup_phb = pnv_pci_ioda_fixup_phb;
+	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
+	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
+
+	/* Reset IODA tables to a clean state */
+	rc = opal_pci_reset(phb_id, OPAL_PCI_IODA_RESET, OPAL_ASSERT_RESET);
+	if (rc)
+		pr_warning("  OPAL Error %ld performing IODA reset !\n", rc);
+	opal_pci_set_pe(phb_id, 0, 0, 7, 1, 1 , OPAL_MAP_PE);
+}
+
+void __init pnv_pci_init_ioda_hub(struct device_node *np)
+{
+	struct device_node *phbn;
+	const u64 *prop64;
+	u64 hub_id;
+
+	pr_info("Probing IODA IO-Hub %s\n", np->full_name);
+
+	prop64 = of_get_property(np, "ibm,opal-hubid", NULL);
+	if (!prop64) {
+		pr_err(" Missing \"ibm,opal-hubid\" property !\n");
+		return;
+	}
+	hub_id = be64_to_cpup(prop64);
+	pr_devel(" HUB-ID : 0x%016llx\n", hub_id);
+
+	/* Count child PHBs */
+	for_each_child_of_node(np, phbn) {
+		/* Look for IODA1 PHBs */
+		if (of_device_is_compatible(phbn, "ibm,ioda-phb"))
+			pnv_pci_init_ioda1_phb(phbn);
+	}
+}
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index baef772..c0ed379 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -467,12 +467,24 @@ void __init pnv_pci_init(void)
 		init_pci_config_tokens();
 		find_and_init_phbs();
 #endif /* CONFIG_PPC_POWERNV_RTAS */
-	} else {
-		/* OPAL is here, do our normal stuff */
+	}
+	/* OPAL is here, do our normal stuff */
+	else {
+		int found_ioda = 0;
+
+		/* Look for IODA IO-Hubs. We don't support mixing IODA
+		 * and p5ioc2 due to the need to change some global
+		 * probing flags
+		 */
+		for_each_compatible_node(np, NULL, "ibm,ioda-hub") {
+			pnv_pci_init_ioda_hub(np);
+			found_ioda = 1;
+		}
 
 		/* Look for p5ioc2 IO-Hubs */
-		for_each_compatible_node(np, NULL, "ibm,p5ioc2")
-			pnv_pci_init_p5ioc2_hub(np);
+		if (!found_ioda)
+			for_each_compatible_node(np, NULL, "ibm,p5ioc2")
+				pnv_pci_init_p5ioc2_hub(np);
 	}
 
 	/* Setup the linkage between OF nodes and PHBs */
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index d4dbc49..28ae4ca 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -9,6 +9,50 @@ enum pnv_phb_type {
 	PNV_PHB_IODA2,
 };
 
+/* Data associated with a PE, including IOMMU tracking etc.. */
+struct pnv_ioda_pe {
+	/* A PE can be associated with a single device or an
+	 * entire bus (& children). In the former case, pdev
+	 * is populated, in the later case, pbus is.
+	 */
+	struct pci_dev		*pdev;
+	struct pci_bus		*pbus;
+
+	/* Effective RID (device RID for a device PE and base bus
+	 * RID with devfn 0 for a bus PE)
+	 */
+	unsigned int		rid;
+
+	/* PE number */
+	unsigned int		pe_number;
+
+	/* "Weight" assigned to the PE for the sake of DMA resource
+	 * allocations
+	 */
+	unsigned int		dma_weight;
+
+	/* This is a PCI-E -> PCI-X bridge, this points to the
+	 * corresponding bus PE
+	 */
+	struct pnv_ioda_pe	*bus_pe;
+
+	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
+	int			tce32_seg;
+	int			tce32_segcount;
+	struct iommu_table	tce32_table;
+
+	/* XXX TODO: Add support for additional 64-bit iommus */
+
+	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
+	 * and -1 if not supported. (It's actually identical to the
+	 * PE number)
+	 */
+	int			mve_number;
+
+	/* Link in list of PE#s */
+	struct list_head	link;
+};
+
 struct pnv_phb {
 	struct pci_controller	*hose;
 	enum pnv_phb_type	type;
@@ -34,6 +78,45 @@ struct pnv_phb {
 		struct {
 			struct iommu_table iommu_table;
 		} p5ioc2;
+
+		struct {
+			/* Global bridge info */
+			unsigned int		total_pe;
+			unsigned int		m32_size;
+			unsigned int		m32_segsize;
+			unsigned int		m32_pci_base;
+			unsigned int		io_size;
+			unsigned int		io_segsize;
+			unsigned int		io_pci_base;
+
+			/* PE allocation bitmap */
+			unsigned long		*pe_alloc;
+
+			/* M32 & IO segment maps */
+			unsigned int		*m32_segmap;
+			unsigned int		*io_segmap;
+			struct pnv_ioda_pe	*pe_array;
+
+			/* Reverse map of PEs, will have to extend if
+			 * we are to support more than 256 PEs, indexed
+			 * bus { bus, devfn }
+			 */
+			unsigned char		pe_rmap[0x10000];
+
+			/* 32-bit TCE tables allocation */
+			unsigned long		tce32_count;
+
+			/* Total "weight" for the sake of DMA resources
+			 * allocation
+			 */
+			unsigned int		dma_weight;
+			unsigned int		dma_pe_count;
+
+			/* Sorted list of used PE's, sorted at
+			 * boot for resource allocation purposes
+			 */
+			struct list_head	pe_list;
+		} ioda;
 	};
 };
 
@@ -43,6 +126,7 @@ extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 				      void *tce_mem, u64 tce_size,
 				      u64 dma_offset);
 extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
+extern void pnv_pci_init_ioda_hub(struct device_node *np);
 
 
 #endif /* __POWERNV_PCI_H */
-- 
1.7.7.1

^ permalink raw reply related

* [PATCH 4/5] powerpc/powernv: Fixup p7ioc PCIe root complex class code
From: Benjamin Herrenschmidt @ 2011-11-07  4:56 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1320641761-4028-1-git-send-email-benh@kernel.crashing.org>

It advertises "host bridge" instead of "PCI to PCI bridge" which confuses
the Linux probe code.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/platforms/powernv/pci.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 8b90d94..baef772 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -416,6 +416,13 @@ static void __devinit pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 		pnv_pci_dma_fallback_setup(hose, pdev);
 }
 
+/* Fixup wrong class code in p7ioc root complex */
+static void __devinit pnv_p7ioc_rc_quirk(struct pci_dev *dev)
+{
+	dev->class = PCI_CLASS_BRIDGE_PCI << 8;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_IBM, 0x3b9, pnv_p7ioc_rc_quirk);
+
 static int pnv_pci_probe_mode(struct pci_bus *bus)
 {
 	struct pci_controller *hose = pci_bus_to_host(bus);
-- 
1.7.7.1

^ permalink raw reply related

* Re: [PATCH 2/5] powerpc/pci: Change how re-assigning resouces work
From: Benjamin Herrenschmidt @ 2011-11-07  5:01 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <1320641761-4028-2-git-send-email-benh@kernel.crashing.org>

On Mon, 2011-11-07 at 15:55 +1100, Benjamin Herrenschmidt wrote:
> When PCI_REASSIGN_ALL_RSRC is set, we used to clear all bus resources
> at the beginning of survey and re-allocate them later.
> 
> This changes it so instead, during early fixup, we mark all resources
> as IORESOURCE_UNSET and move them down to be 0-based.
> 
> Later, if bus resources are still unset at the beginning of the survey,
> then we clear them.
> 
> This shouldn't impact the re-assignment case on 4xx, but will enable
> us to have the platform do some custom resource assignment before the
> survey, by clearing individual resources IORESOURCE_UNSET bit.
> 
> Also limits the clutter in the kernel log from fixup when re-assigning
> since we don't care about the offset applied to the BAR values in this
> case.

Hi guys !

This one could be invasive to FSL and 4xx if you use
PCI_REASSIGN_ALL_RSRC. I very much want to merge it for 3.3, so any
chance you can give it a beating and see if everything is still
allright ?

Cheers,
Ben.

> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
>  arch/powerpc/kernel/pci-common.c |   66 ++++++++++++++++++++-----------------
>  1 files changed, 36 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 855969f..d34ba7e 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -920,18 +920,22 @@ static void __devinit pcibios_fixup_resources(struct pci_dev *dev)
>  		struct resource *res = dev->resource + i;
>  		if (!res->flags)
>  			continue;
> -		/* On platforms that have PCI_PROBE_ONLY set, we don't
> -		 * consider 0 as an unassigned BAR value. It's technically
> -		 * a valid value, but linux doesn't like it... so when we can
> -		 * re-assign things, we do so, but if we can't, we keep it
> -		 * around and hope for the best...
> +
> +		/* If we're going to re-assign everything, we mark all resources
> +		 * as unset (and 0-base them). In addition, we mark BARs starting
> +		 * at 0 as unset as well, except if PCI_PROBE_ONLY is also set
> +		 * since in that case, we don't want to re-assign anything
>  		 */
> -		if (res->start == 0 && !pci_has_flag(PCI_PROBE_ONLY)) {
> -			pr_debug("PCI:%s Resource %d %016llx-%016llx [%x] is unassigned\n",
> -				 pci_name(dev), i,
> -				 (unsigned long long)res->start,
> -				 (unsigned long long)res->end,
> -				 (unsigned int)res->flags);
> +		if (pci_has_flag(PCI_REASSIGN_ALL_RSRC) ||
> +		    (res->start == 0 && !pci_has_flag(PCI_PROBE_ONLY))) {
> +			/* Only print message if not re-assigning */
> +			if (!pci_has_flag(PCI_REASSIGN_ALL_RSRC))
> +				pr_debug("PCI:%s Resource %d %016llx-%016llx [%x] "
> +					 "is unassigned\n",
> +					 pci_name(dev), i,
> +					 (unsigned long long)res->start,
> +					 (unsigned long long)res->end,
> +					 (unsigned int)res->flags);
>  			res->end -= res->start;
>  			res->start = 0;
>  			res->flags |= IORESOURCE_UNSET;
> @@ -1041,6 +1045,16 @@ static void __devinit pcibios_fixup_bridge(struct pci_bus *bus)
>  		if (i >= 3 && bus->self->transparent)
>  			continue;
>  
> +		/* If we are going to re-assign everything, mark the resource
> +		 * as unset and move it down to 0
> +		 */
> +		if (pci_has_flag(PCI_REASSIGN_ALL_RSRC)) {
> +			res->flags |= IORESOURCE_UNSET;
> +			res->end -= res->start;
> +			res->start = 0;
> +			continue;
> +		}
> +
>  		pr_debug("PCI:%s Bus rsrc %d %016llx-%016llx [%x] fixup...\n",
>  			 pci_name(dev), i,
>  			 (unsigned long long)res->start,\
> @@ -1261,18 +1275,15 @@ void pcibios_allocate_bus_resources(struct pci_bus *bus)
>  	pci_bus_for_each_resource(bus, res, i) {
>  		if (!res || !res->flags || res->start > res->end || res->parent)
>  			continue;
> +
> +		/* If the resource was left unset at this point, we clear it */
> +		if (res->flags & IORESOURCE_UNSET)
> +			goto clear_resource;
> +
>  		if (bus->parent == NULL)
>  			pr = (res->flags & IORESOURCE_IO) ?
>  				&ioport_resource : &iomem_resource;
>  		else {
> -			/* Don't bother with non-root busses when
> -			 * re-assigning all resources. We clear the
> -			 * resource flags as if they were colliding
> -			 * and as such ensure proper re-allocation
> -			 * later.
> -			 */
> -			if (pci_has_flag(PCI_REASSIGN_ALL_RSRC))
> -				goto clear_resource;
>  			pr = pci_find_parent_resource(bus->self, res);
>  			if (pr == res) {
>  				/* this happens when the generic PCI
> @@ -1303,9 +1314,9 @@ void pcibios_allocate_bus_resources(struct pci_bus *bus)
>  			if (reparent_resources(pr, res) == 0)
>  				continue;
>  		}
> -		printk(KERN_WARNING "PCI: Cannot allocate resource region "
> -		       "%d of PCI bridge %d, will remap\n", i, bus->number);
> -clear_resource:
> +		pr_warning("PCI: Cannot allocate resource region "
> +			   "%d of PCI bridge %d, will remap\n", i, bus->number);
> +	clear_resource:
>  		res->start = res->end = 0;
>  		res->flags = 0;
>  	}
> @@ -1450,16 +1461,11 @@ void __init pcibios_resource_survey(void)
>  {
>  	struct pci_bus *b;
>  
> -	/* Allocate and assign resources. If we re-assign everything, then
> -	 * we skip the allocate phase
> -	 */
> +	/* Allocate and assign resources */
>  	list_for_each_entry(b, &pci_root_buses, node)
>  		pcibios_allocate_bus_resources(b);
> -
> -	if (!pci_has_flag(PCI_REASSIGN_ALL_RSRC)) {
> -		pcibios_allocate_resources(0);
> -		pcibios_allocate_resources(1);
> -	}
> +	pcibios_allocate_resources(0);
> +	pcibios_allocate_resources(1);
>  
>  	/* Before we start assigning unassigned resource, we try to reserve
>  	 * the low IO area and the VGA memory area if they intersect the

^ permalink raw reply

* [PATCH] powerpc: fix building hvc_opal.c
From: Michael Neuling @ 2011-11-07  6:05 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, Paul Mackerras, sfr

Fix building following build error:
drivers/tty/hvc/hvc_opal.c:244:12: error: 'THIS_MODULE' undeclared here (not in a function)

Signed-off-by: Michael Neuling <mikey@neuling.org>

diff --git a/drivers/tty/hvc/hvc_opal.c b/drivers/tty/hvc/hvc_opal.c
index 7b38512..cb3938f 100644
--- a/drivers/tty/hvc/hvc_opal.c
+++ b/drivers/tty/hvc/hvc_opal.c
@@ -28,6 +28,7 @@
 #include <linux/console.h>
 #include <linux/of.h>
 #include <linux/of_platform.h>
+#include <linux/module.h>
 
 #include <asm/hvconsole.h>
 #include <asm/prom.h>

^ permalink raw reply related

* [PATCH] powerpc: fix building hvc_opal.c
From: Michael Neuling @ 2011-11-07  6:12 UTC (permalink / raw)
  To: benh, Paul Mackerras, linuxppc-dev, sfr, Paul Gortmaker, torvalds
In-Reply-To: <26327.1320645941@neuling.org>

Fix building following build error:
drivers/tty/hvc/hvc_opal.c:244:12: error: 'THIS_MODULE' undeclared here (not in a function)

Signed-off-by: Michael Neuling <mikey@neuling.org>
--
Actually, this is the right fix.

sfr says this was a merge conflict between the module.h split up and the
powerpc tree, which were both merged by Linus today.

diff --git a/drivers/tty/hvc/hvc_opal.c b/drivers/tty/hvc/hvc_opal.c
index 7b38512..ced26c8 100644
--- a/drivers/tty/hvc/hvc_opal.c
+++ b/drivers/tty/hvc/hvc_opal.c
@@ -28,6 +28,7 @@
 #include <linux/console.h>
 #include <linux/of.h>
 #include <linux/of_platform.h>
+#include <linux/export.h>
 
 #include <asm/hvconsole.h>
 #include <asm/prom.h>

^ permalink raw reply related

* Re: Regression: patch " hvc_console: display printk messages on console." causing infinite loop with 3.2-rc0 + Xen.
From: Stephen Rothwell @ 2011-11-07  6:19 UTC (permalink / raw)
  To: Linus
  Cc: xen-devel, Konrad Rzeszutek Wilk, Rusty Russell, miche,
	linux-kernel, virtualization, Anton Blanchard, Amit Shah, ppc-dev,
	Greg KH
In-Reply-To: <20111103013012.GB3449@suse.de>

[-- Attachment #1: Type: text/plain, Size: 1664 bytes --]

Hi Greg,

On Wed, 2 Nov 2011 18:30:12 -0700 Greg KH <gregkh@suse.de> wrote:
>
> On Wed, Nov 02, 2011 at 12:13:09PM +1100, Stephen Rothwell wrote:
> > 
> > On Thu, 27 Oct 2011 07:48:06 +0200 Greg KH <gregkh@suse.de> wrote:
> > >
> > > On Thu, Oct 27, 2011 at 01:30:08AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > Hey Miche.
> > > > 
> > > > The git commit 361162459f62dc0826b82c9690a741a940f457f0:
> > > > 
> > > >     hvc_console: display printk messages on console.
> > > > 
> > > > is causing an infinite loop when booting Linux under Xen, as so:
> > > 
> > > Ick, not good, thanks for letting us know.
> > 
> > Indeed. I am wondering why it was put in a tree and sent to Linus without
> > any Acks or even being replied to by anyone.  It appeared in the tty tree
> > between Oct 14 and Oct 25 (while I was unfortunately on vacation).  If
> > anyone had tried to boot this on any PowerPC server, it would have been
> > immediately obvious (as it was when I booted Linus' tree last night).
> > 
> > And the original author expressed doubts as to his understanding of how
> > it should all work anyway.
> > 
> > Just a little more care, please.
> > 
> > I would vote for reverting the original and having it resubmitted with
> > corrections at some later date.
> 
> You are right, I will go do that, sorry for the problems.

Ping ...

Linus can you please just revert 361162459f62dc0826b82c9690a741a940f457f0
"hvc_console: display printk messages on console" as it breaks consoles
for all PowerPC server machines.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH] powerpc/p1023: set IRQ[4:6, 11] to high level sensitive for PCIe
From: Roy Zang @ 2011-11-07  8:32 UTC (permalink / raw)
  To: linuxppc-dev

P1023 external IRQ[4:6, 11] do not pin out, but the interrupts are
shared with PCIe controller.
The silicon internally ties the interrupts to L, so change the
IRQ[4:6,11] to high level sensitive for PCIe.

Signed-off-by: Roy Zang <tie-fei.zang@freescale.com>
---
 arch/powerpc/boot/dts/p1023rds.dts |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/boot/dts/p1023rds.dts b/arch/powerpc/boot/dts/p1023rds.dts
index d9b7767..66bf804 100644
--- a/arch/powerpc/boot/dts/p1023rds.dts
+++ b/arch/powerpc/boot/dts/p1023rds.dts
@@ -490,9 +490,9 @@
 			interrupt-map-mask = <0xf800 0 0 7>;
 			interrupt-map = <
 				/* IDSEL 0x0 */
-				0000 0 0 1 &mpic 4 1
-				0000 0 0 2 &mpic 5 1
-				0000 0 0 3 &mpic 6 1
+				0000 0 0 1 &mpic 4 2
+				0000 0 0 2 &mpic 5 2
+				0000 0 0 3 &mpic 6 2
 				0000 0 0 4 &mpic 7 1
 				>;
 			ranges = <0x2000000 0x0 0xa0000000
@@ -532,7 +532,7 @@
 				0000 0 0 1 &mpic 8 1
 				0000 0 0 2 &mpic 9 1
 				0000 0 0 3 &mpic 10 1
-				0000 0 0 4 &mpic 11 1
+				0000 0 0 4 &mpic 11 2
 				>;
 			ranges = <0x2000000 0x0 0x80000000
 				  0x2000000 0x0 0x80000000
-- 
1.6.0.6

^ permalink raw reply related

* [RFC PATCH v4 05/10] fadump: Convert firmware-assisted cpu state dump data into elf notes.
From: Mahesh J Salgaonkar @ 2011-11-07  9:55 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

When registered for firmware assisted dump on powerpc, firmware preserves
the registers for the active CPUs during a system crash. This patch reads
the cpu register data stored in Firmware-assisted dump format (except for
crashing cpu) and converts it into elf notes and updates the PT_NOTE program
header accordingly. The exact register state for crashing cpu is saved to
fadump crash info structure in scratch area during crash_fadump() and read
during second kernel boot.

Change in v4:
- Fixes a issue where memblock_free() is invoked from build_cpu_notes()
  function during error_out path. Invoke cpu_notes_buf_free() in error_out
  path instead of memblock_free().

Change in v2:
- Moved the crash_fadump() invocation from generic code to panic notifier.
- Introduced cpu_notes_buf_alloc() function to allocate cpu notes buffer
  using get_free_pages(). The reason is, with the use of subsys_initcall
  the setup_fadump() is now called after mem_init(). Hence use of
  get_free_pages() to allocate memory is more approriate then using
  memblock_alloc().

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/fadump.h  |   43 +++++
 arch/powerpc/kernel/fadump.c       |  312 ++++++++++++++++++++++++++++++++++++
 arch/powerpc/kernel/setup-common.c |    8 +
 arch/powerpc/kernel/traps.c        |    5 +
 4 files changed, 366 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h
index ced923a..0c14097 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -59,6 +59,18 @@
 /* Dump status flag */
 #define FADUMP_ERROR_FLAG	0x2000
 
+#define FADUMP_CPU_ID_MASK	((1UL << 32) - 1)
+
+#define CPU_UNKNOWN		(~((u32)0))
+
+/* Utility macros */
+#define SKIP_TO_NEXT_CPU(reg_entry)			\
+({							\
+	while (reg_entry->reg_id != REG_ID("CPUEND"))	\
+		reg_entry++;				\
+	reg_entry++;					\
+})
+
 /* Kernel Dump section info */
 struct fadump_section {
 	u32	request_flag;
@@ -113,6 +125,9 @@ struct fw_dump {
 	unsigned long	reserve_bootvar;
 
 	unsigned long	fadumphdr_addr;
+	unsigned long	cpu_notes_buf;
+	unsigned long	cpu_notes_buf_size;
+
 	int		ibm_configure_kernel_dump;
 
 	unsigned long	fadump_enabled:1;
@@ -137,13 +152,40 @@ static inline u64 str_to_u64(const char *str)
 	return val;
 }
 #define STR_TO_HEX(x)	str_to_u64(x)
+#define REG_ID(x)	str_to_u64(x)
 
 #define FADUMP_CRASH_INFO_MAGIC		STR_TO_HEX("FADMPINF")
+#define REGSAVE_AREA_MAGIC		STR_TO_HEX("REGSAVE")
+
+/* The firmware-assisted dump format.
+ *
+ * The register save area is an area in the partition's memory used to preserve
+ * the register contents (CPU state data) for the active CPUs during a firmware
+ * assisted dump. The dump format contains register save area header followed
+ * by register entries. Each list of registers for a CPU starts with
+ * "CPUSTRT" and ends with "CPUEND".
+ */
+
+/* Register save area header. */
+struct fadump_reg_save_area_header {
+	u64		magic_number;
+	u32		version;
+	u32		num_cpu_offset;
+};
+
+/* Register entry. */
+struct fadump_reg_entry {
+	u64		reg_id;
+	u64		reg_value;
+};
 
 /* fadump crash info structure */
 struct fadump_crash_info_header {
 	u64		magic_number;
 	u64		elfcorehdr_addr;
+	u32		crashing_cpu;
+	struct pt_regs	regs;
+	struct cpumask	cpu_online_mask;
 };
 
 /* Crash memory ranges */
@@ -159,6 +201,7 @@ extern int early_init_dt_scan_fw_dump(unsigned long node,
 extern int fadump_reserve_mem(void);
 extern int setup_fadump(void);
 extern int is_fadump_active(void);
+extern void crash_fadump(struct pt_regs *, const char *);
 #else	/* CONFIG_FA_DUMP */
 static inline int is_fadump_active(void) { return 0; }
 #endif
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index bbbda82..70d6287 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -244,6 +244,7 @@ static unsigned long get_dump_area_size(void)
 	size += fw_dump.boot_memory_size;
 	size += sizeof(struct fadump_crash_info_header);
 	size += sizeof(struct elfhdr); /* ELF core header.*/
+	size += sizeof(struct elf_phdr); /* place holder for cpu notes */
 	/* Program headers for crash memory regions. */
 	size += sizeof(struct elf_phdr) * (memblock_num_regions(memory) + 2);
 
@@ -397,6 +398,283 @@ static void register_fw_dump(struct fadump_mem_struct *fdm)
 	}
 }
 
+void crash_fadump(struct pt_regs *regs, const char *str)
+{
+	struct fadump_crash_info_header *fdh = NULL;
+
+	if (!fw_dump.dump_registered || !fw_dump.fadumphdr_addr)
+		return;
+
+	fdh = __va(fw_dump.fadumphdr_addr);
+	crashing_cpu = smp_processor_id();
+	fdh->crashing_cpu = crashing_cpu;
+	crash_save_vmcoreinfo();
+
+	if (regs)
+		fdh->regs = *regs;
+	else
+		ppc_save_regs(&fdh->regs);
+
+	fdh->cpu_online_mask = *cpu_online_mask;
+
+	/* Call ibm,os-term rtas call to trigger firmware assisted dump */
+	rtas_os_term((char *)str);
+}
+
+#define GPR_MASK	0xffffff0000000000
+static inline int gpr_index(u64 id)
+{
+	int i = -1;
+	char str[3];
+
+	if ((id & GPR_MASK) == REG_ID("GPR")) {
+		/* get the digits at the end */
+		id &= ~GPR_MASK;
+		id >>= 24;
+		str[2] = '\0';
+		str[1] = id & 0xff;
+		str[0] = (id >> 8) & 0xff;
+		sscanf(str, "%d", &i);
+		if (i > 31)
+			i = -1;
+	}
+	return i;
+}
+
+static inline void set_regval(struct pt_regs *regs, u64 reg_id, u64 reg_val)
+{
+	int i;
+
+	i = gpr_index(reg_id);
+	if (i >= 0)
+		regs->gpr[i] = (unsigned long)reg_val;
+	else if (reg_id == REG_ID("NIA"))
+		regs->nip = (unsigned long)reg_val;
+	else if (reg_id == REG_ID("MSR"))
+		regs->msr = (unsigned long)reg_val;
+	else if (reg_id == REG_ID("CTR"))
+		regs->ctr = (unsigned long)reg_val;
+	else if (reg_id == REG_ID("LR"))
+		regs->link = (unsigned long)reg_val;
+	else if (reg_id == REG_ID("XER"))
+		regs->xer = (unsigned long)reg_val;
+	else if (reg_id == REG_ID("CR"))
+		regs->ccr = (unsigned long)reg_val;
+	else if (reg_id == REG_ID("DAR"))
+		regs->dar = (unsigned long)reg_val;
+	else if (reg_id == REG_ID("DSISR"))
+		regs->dsisr = (unsigned long)reg_val;
+}
+
+static struct fadump_reg_entry*
+read_registers(struct fadump_reg_entry *reg_entry, struct pt_regs *regs)
+{
+	memset(regs, 0, sizeof(struct pt_regs));
+
+	while (reg_entry->reg_id != REG_ID("CPUEND")) {
+		set_regval(regs, reg_entry->reg_id, reg_entry->reg_value);
+		reg_entry++;
+	}
+	reg_entry++;
+	return reg_entry;
+}
+
+static u32 *append_elf_note(u32 *buf, char *name, unsigned type, void *data,
+			    size_t data_len)
+{
+	struct elf_note note;
+
+	note.n_namesz = strlen(name) + 1;
+	note.n_descsz = data_len;
+	note.n_type   = type;
+	memcpy(buf, &note, sizeof(note));
+	buf += (sizeof(note) + 3)/4;
+	memcpy(buf, name, note.n_namesz);
+	buf += (note.n_namesz + 3)/4;
+	memcpy(buf, data, note.n_descsz);
+	buf += (note.n_descsz + 3)/4;
+
+	return buf;
+}
+
+static void final_note(u32 *buf)
+{
+	struct elf_note note;
+
+	note.n_namesz = 0;
+	note.n_descsz = 0;
+	note.n_type   = 0;
+	memcpy(buf, &note, sizeof(note));
+}
+
+static u32 *regs_to_elf_notes(u32 *buf, struct pt_regs *regs)
+{
+	struct elf_prstatus prstatus;
+
+	memset(&prstatus, 0, sizeof(prstatus));
+	/*
+	 * FIXME: How do i get PID? Do I really need it?
+	 * prstatus.pr_pid = ????
+	 */
+	elf_core_copy_kernel_regs(&prstatus.pr_reg, regs);
+	buf = append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_PRSTATUS,
+				&prstatus, sizeof(prstatus));
+	return buf;
+}
+
+static void update_elfcore_header(char *bufp)
+{
+	struct elfhdr *elf;
+	struct elf_phdr *phdr;
+
+	elf = (struct elfhdr *)bufp;
+	bufp += sizeof(struct elfhdr);
+
+	/* First note is a place holder for cpu notes info. */
+	phdr = (struct elf_phdr *)bufp;
+
+	if (phdr->p_type == PT_NOTE) {
+		phdr->p_paddr = fw_dump.cpu_notes_buf;
+		phdr->p_offset	= phdr->p_paddr;
+		phdr->p_filesz	= fw_dump.cpu_notes_buf_size;
+		phdr->p_memsz = fw_dump.cpu_notes_buf_size;
+	}
+	return;
+}
+
+static void *cpu_notes_buf_alloc(unsigned long size)
+{
+	void *vaddr;
+	struct page *page;
+	unsigned long order, count, i;
+
+	order = get_order(size);
+	vaddr = (void *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, order);
+	if (!vaddr)
+		return NULL;
+
+	count = 1 << order;
+	page = virt_to_page(vaddr);
+	for (i = 0; i < count; i++)
+		SetPageReserved(page + i);
+	return vaddr;
+}
+
+static void cpu_notes_buf_free(unsigned long vaddr, unsigned long size)
+{
+	struct page *page;
+	unsigned long order, count, i;
+
+	order = get_order(size);
+	count = 1 << order;
+	page = virt_to_page(vaddr);
+	for (i = 0; i < count; i++)
+		ClearPageReserved(page + i);
+	__free_pages(page, order);
+}
+
+/*
+ * Read CPU state dump data and convert it into ELF notes.
+ * The CPU dump starts with magic number "REGSAVE". NumCpusOffset should be
+ * used to access the data to allow for additional fields to be added without
+ * affecting compatibility. Each list of registers for a CPU starts with
+ * "CPUSTRT" and ends with "CPUEND". Each register entry is of 16 bytes,
+ * 8 Byte ASCII identifier and 8 Byte register value. The register entry
+ * with identifier "CPUSTRT" and "CPUEND" contains 4 byte cpu id as part
+ * of register value. For more details refer to PAPR document.
+ *
+ * Only for the crashing cpu we ignore the CPU dump data and get exact
+ * state from fadump crash info structure populated by first kernel at the
+ * time of crash.
+ */
+static int __init build_cpu_notes(const struct fadump_mem_struct *fdm)
+{
+	struct fadump_reg_save_area_header *reg_header;
+	struct fadump_reg_entry *reg_entry;
+	struct fadump_crash_info_header *fdh = NULL;
+	void *vaddr;
+	unsigned long addr;
+	u32 num_cpus, *note_buf;
+	struct pt_regs regs;
+	int i, rc = 0, cpu = 0;
+
+	if (!fdm->cpu_state_data.bytes_dumped)
+		return -EINVAL;
+
+	addr = fdm->cpu_state_data.destination_address;
+	vaddr = __va(addr);
+
+	reg_header = vaddr;
+	if (reg_header->magic_number != REGSAVE_AREA_MAGIC) {
+		printk(KERN_ERR "Unable to read register save area.\n");
+		return -ENOENT;
+	}
+	pr_debug("--------CPU State Data------------\n");
+	pr_debug("Magic Number: %llx\n", reg_header->magic_number);
+	pr_debug("NumCpuOffset: %x\n", reg_header->num_cpu_offset);
+
+	vaddr += reg_header->num_cpu_offset;
+	num_cpus = *((u32 *)(vaddr));
+	pr_debug("NumCpus     : %u\n", num_cpus);
+	vaddr += sizeof(u32);
+	reg_entry = (struct fadump_reg_entry *)vaddr;
+
+	/* Allocate buffer to hold cpu crash notes. */
+	fw_dump.cpu_notes_buf_size = num_cpus * sizeof(note_buf_t);
+	fw_dump.cpu_notes_buf_size = PAGE_ALIGN(fw_dump.cpu_notes_buf_size);
+	note_buf = cpu_notes_buf_alloc(fw_dump.cpu_notes_buf_size);
+	if (!note_buf) {
+		printk(KERN_ERR "Failed to allocate 0x%lx bytes for "
+			"cpu notes buffer\n", fw_dump.cpu_notes_buf_size);
+		return -ENOMEM;
+	}
+	fw_dump.cpu_notes_buf = __pa(note_buf);
+
+	pr_debug("Allocated buffer for cpu notes of size %ld at %p\n",
+			(num_cpus * sizeof(note_buf_t)), note_buf);
+
+	if (fw_dump.fadumphdr_addr)
+		fdh = __va(fw_dump.fadumphdr_addr);
+
+	for (i = 0; i < num_cpus; i++) {
+		if (reg_entry->reg_id != REG_ID("CPUSTRT")) {
+			printk(KERN_ERR "Unable to read CPU state data\n");
+			rc = -ENOENT;
+			goto error_out;
+		}
+		/* Lower 4 bytes of reg_value contains logical cpu id */
+		cpu = reg_entry->reg_value & FADUMP_CPU_ID_MASK;
+		if (!cpumask_test_cpu(cpu, &fdh->cpu_online_mask)) {
+			SKIP_TO_NEXT_CPU(reg_entry);
+			continue;
+		}
+		pr_debug("Reading register data for cpu %d...\n", cpu);
+		if (fdh && fdh->crashing_cpu == cpu) {
+			regs = fdh->regs;
+			note_buf = regs_to_elf_notes(note_buf, &regs);
+			SKIP_TO_NEXT_CPU(reg_entry);
+		} else {
+			reg_entry++;
+			reg_entry = read_registers(reg_entry, &regs);
+			note_buf = regs_to_elf_notes(note_buf, &regs);
+		}
+	}
+	final_note(note_buf);
+
+	pr_debug("Updating elfcore header (%llx) with cpu notes\n",
+							fdh->elfcorehdr_addr);
+	update_elfcore_header((char *)__va(fdh->elfcorehdr_addr));
+	return 0;
+
+error_out:
+	cpu_notes_buf_free((unsigned long)__va(fw_dump.cpu_notes_buf),
+					fw_dump.cpu_notes_buf_size);
+	fw_dump.cpu_notes_buf = 0;
+	fw_dump.cpu_notes_buf_size = 0;
+	return rc;
+
+}
+
 /*
  * Validate and process the dump data stored by firmware before exporting
  * it through '/proc/vmcore'.
@@ -404,18 +682,21 @@ static void register_fw_dump(struct fadump_mem_struct *fdm)
 static int __init process_fadump(const struct fadump_mem_struct *fdm_active)
 {
 	struct fadump_crash_info_header *fdh;
+	int rc = 0;
 
 	if (!fdm_active || !fw_dump.fadumphdr_addr)
 		return -EINVAL;
 
 	/* Check if the dump data is valid. */
 	if ((fdm_active->header.dump_status_flag == FADUMP_ERROR_FLAG) ||
+			(fdm_active->cpu_state_data.error_flags != 0) ||
 			(fdm_active->rmr_region.error_flags != 0)) {
 		printk(KERN_ERR "Dump taken by platform is not valid\n");
 		return -EINVAL;
 	}
-	if (fdm_active->rmr_region.bytes_dumped !=
-			fdm_active->rmr_region.source_len) {
+	if ((fdm_active->rmr_region.bytes_dumped !=
+			fdm_active->rmr_region.source_len) ||
+			!fdm_active->cpu_state_data.bytes_dumped) {
 		printk(KERN_ERR "Dump taken by platform is incomplete\n");
 		return -EINVAL;
 	}
@@ -427,6 +708,10 @@ static int __init process_fadump(const struct fadump_mem_struct *fdm_active)
 		return -EINVAL;
 	}
 
+	rc = build_cpu_notes(fdm_active);
+	if (rc)
+		return rc;
+
 	/*
 	 * We are done validating dump info and elfcore header is now ready
 	 * to be exported. set elfcorehdr_addr so that vmcore module will
@@ -541,6 +826,27 @@ static int create_elfcore_headers(char *bufp)
 	elf = (struct elfhdr *)bufp;
 	bufp += sizeof(struct elfhdr);
 
+	/*
+	 * setup ELF PT_NOTE, place holder for cpu notes info. The notes info
+	 * will be populated during second kernel boot after crash. Hence
+	 * this PT_NOTE will always be the first elf note.
+	 *
+	 * NOTE: Any new ELF note addition should be placed after this note.
+	 */
+	phdr = (struct elf_phdr *)bufp;
+	bufp += sizeof(struct elf_phdr);
+	phdr->p_type = PT_NOTE;
+	phdr->p_flags = 0;
+	phdr->p_vaddr = 0;
+	phdr->p_align = 0;
+
+	phdr->p_offset = 0;
+	phdr->p_paddr = 0;
+	phdr->p_filesz = 0;
+	phdr->p_memsz = 0;
+
+	(elf->e_phnum)++;
+
 	/* setup PT_LOAD sections. */
 
 	for (i = 0; i < crash_mem_ranges; i++) {
@@ -592,6 +898,8 @@ static unsigned long init_fadump_header(unsigned long addr)
 	memset(fdh, 0, sizeof(struct fadump_crash_info_header));
 	fdh->magic_number = FADUMP_CRASH_INFO_MAGIC;
 	fdh->elfcorehdr_addr = addr;
+	/* We will set the crashing cpu id in crash_fadump() during crash. */
+	fdh->crashing_cpu = CPU_UNKNOWN;
 
 	return addr;
 }
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index b1d738d..ce35aaf 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -61,6 +61,7 @@
 #include <asm/xmon.h>
 #include <asm/cputhreads.h>
 #include <mm/mmu_decl.h>
+#include <asm/fadump.h>
 
 #include "setup.h"
 
@@ -639,6 +640,13 @@ EXPORT_SYMBOL(check_legacy_ioport);
 static int ppc_panic_event(struct notifier_block *this,
                              unsigned long event, void *ptr)
 {
+#ifdef CONFIG_FA_DUMP
+	/*
+	 * If firmware-assisted dump has been registered then trigger
+	 * firmware-assisted dump and let firmware handle everything else.
+	 */
+	crash_fadump(NULL, ptr);
+#endif
 	ppc_md.panic(ptr);  /* May not return */
 	return NOTIFY_DONE;
 }
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index f19d977..1508532 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -57,6 +57,7 @@
 #include <asm/kexec.h>
 #include <asm/ppc-opcode.h>
 #include <asm/rio.h>
+#include <asm/fadump.h>
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -160,6 +161,10 @@ int die(const char *str, struct pt_regs *regs, long err)
 	add_taint(TAINT_DIE);
 	raw_spin_unlock_irqrestore(&die.lock, flags);
 
+#ifdef CONFIG_FA_DUMP
+	crash_fadump(regs, str);
+#endif
+
 	if (kexec_should_crash(current) ||
 		kexec_sr_activated(smp_processor_id()))
 		crash_kexec(regs);

^ permalink raw reply related

* [RFC PATCH v4 04/10] fadump: Initialize elfcore header and add PT_LOAD program headers.
From: Mahesh J Salgaonkar @ 2011-11-07  9:55 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Build the crash memory range list by traversing through system memory during
the first kernel before we register for firmware-assisted dump. After the
successful dump registration, initialize the elfcore header and populate
PT_LOAD program headers with crash memory ranges. The elfcore header is
saved in the scratch area within the reserved memory. The scratch area starts
at the end of the memory reserved for saving RMR region contents. The
scratch area contains fadump crash info structure that contains magic number
for fadump validation and physical address where the eflcore header can be
found. This structure will also be used to pass some important crash info
data to the second kernel which will help second kernel to populate ELF core
header with correct data before it gets exported through /proc/vmcore. Since
the firmware preserves the entire partition memory at the time of crash the
contents of the scratch area will be preserved till second kernel boot.

NOTE: The current design implementation does not address a possibility of
introducing additional fields (in future) to this structure without affecting
compatibility. It's on TODO list to come up with better approach to
address this.

Reserved dump area start => +-------------------------------------+
                            |  CPU state dump data                |
                            +-------------------------------------+
                            |  HPTE region data                   |
                            +-------------------------------------+
                            |  RMR region data                    |
Scratch area start       => +-------------------------------------+
                            |  fadump crash info structure {      |
                            |     magic nummber                   |
                     +------|---- elfcorehdr_addr                 |
                     |      |  }                                  |
                     +----> +-------------------------------------+
                            |  ELF core header                    |
Reserved dump area end   => +-------------------------------------+

Change in v4:
- Move the init_elfcore_header() function and 'memblock_num_regions' macro
  from generic code to power specific code as these are used only by
  firmware assisted dump implementation which is power specific feature.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/fadump.h |   43 +++++++
 arch/powerpc/kernel/fadump.c      |  235 +++++++++++++++++++++++++++++++++++++
 2 files changed, 276 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h
index 3b2f8cc..ced923a 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -37,6 +37,12 @@
  */
 #define MIN_BOOT_MEM	((RMR_END < (0x1UL << 28)) ? (0x1UL << 28) : RMR_END)
 
+#define memblock_num_regions(memblock_type)	(memblock.memblock_type.cnt)
+
+#ifndef ELF_CORE_EFLAGS
+#define ELF_CORE_EFLAGS 0
+#endif
+
 /* Firmware provided dump sections */
 #define FADUMP_CPU_STATE_DATA	0x0001
 #define FADUMP_HPTE_REGION	0x0002
@@ -50,6 +56,9 @@
 #define FADUMP_UNREGISTER	2
 #define FADUMP_INVALIDATE	3
 
+/* Dump status flag */
+#define FADUMP_ERROR_FLAG	0x2000
+
 /* Kernel Dump section info */
 struct fadump_section {
 	u32	request_flag;
@@ -103,6 +112,7 @@ struct fw_dump {
 	/* cmd line option during boot */
 	unsigned long	reserve_bootvar;
 
+	unsigned long	fadumphdr_addr;
 	int		ibm_configure_kernel_dump;
 
 	unsigned long	fadump_enabled:1;
@@ -111,6 +121,39 @@ struct fw_dump {
 	unsigned long	dump_registered:1;
 };
 
+/*
+ * Copy the ascii values for first 8 characters from a string into u64
+ * variable at their respective indexes.
+ * e.g.
+ *  The string "FADMPINF" will be converted into 0x4641444d50494e46
+ */
+static inline u64 str_to_u64(const char *str)
+{
+	u64 val = 0;
+	int i;
+
+	for (i = 0; i < sizeof(val); i++)
+		val = (*str) ? (val << 8) | *str++ : val << 8;
+	return val;
+}
+#define STR_TO_HEX(x)	str_to_u64(x)
+
+#define FADUMP_CRASH_INFO_MAGIC		STR_TO_HEX("FADMPINF")
+
+/* fadump crash info structure */
+struct fadump_crash_info_header {
+	u64		magic_number;
+	u64		elfcorehdr_addr;
+};
+
+/* Crash memory ranges */
+#define INIT_CRASHMEM_RANGES	(INIT_MEMBLOCK_REGIONS + 2)
+
+struct fad_crash_memory_ranges {
+	unsigned long long	base;
+	unsigned long long	size;
+};
+
 extern int early_init_dt_scan_fw_dump(unsigned long node,
 		const char *uname, int depth, void *data);
 extern int fadump_reserve_mem(void);
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index ed38f86..bbbda82 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -32,6 +32,7 @@
 #include <linux/delay.h>
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
+#include <linux/crash_dump.h>
 
 #include <asm/page.h>
 #include <asm/prom.h>
@@ -53,6 +54,8 @@ static struct fadump_mem_struct fdm;
 static const struct fadump_mem_struct *fdm_active;
 
 static DEFINE_MUTEX(fadump_mutex);
+struct fad_crash_memory_ranges crash_memory_ranges[INIT_CRASHMEM_RANGES];
+int crash_mem_ranges;
 
 /* Scan the Firmware Assisted dump configuration details. */
 int __init early_init_dt_scan_fw_dump(unsigned long node,
@@ -239,6 +242,10 @@ static unsigned long get_dump_area_size(void)
 	size += fw_dump.cpu_state_data_size;
 	size += fw_dump.hpte_region_size;
 	size += fw_dump.boot_memory_size;
+	size += sizeof(struct fadump_crash_info_header);
+	size += sizeof(struct elfhdr); /* ELF core header.*/
+	/* Program headers for crash memory regions. */
+	size += sizeof(struct elf_phdr) * (memblock_num_regions(memory) + 2);
 
 	size = PAGE_ALIGN(size);
 	return size;
@@ -304,6 +311,12 @@ int __init fadump_reserve_mem(void)
 				"for saving crash dump\n",
 				(unsigned long)(size >> 20),
 				(unsigned long)(base >> 20));
+
+		fw_dump.fadumphdr_addr =
+				fdm_active->rmr_region.destination_address +
+				fdm_active->rmr_region.source_len;
+		pr_debug("fadumphdr_addr = %p\n",
+				(void *) fw_dump.fadumphdr_addr);
 	} else {
 		/* Reserve the memory at the top of memory. */
 		size = get_dump_area_size();
@@ -384,8 +397,210 @@ static void register_fw_dump(struct fadump_mem_struct *fdm)
 	}
 }
 
+/*
+ * Validate and process the dump data stored by firmware before exporting
+ * it through '/proc/vmcore'.
+ */
+static int __init process_fadump(const struct fadump_mem_struct *fdm_active)
+{
+	struct fadump_crash_info_header *fdh;
+
+	if (!fdm_active || !fw_dump.fadumphdr_addr)
+		return -EINVAL;
+
+	/* Check if the dump data is valid. */
+	if ((fdm_active->header.dump_status_flag == FADUMP_ERROR_FLAG) ||
+			(fdm_active->rmr_region.error_flags != 0)) {
+		printk(KERN_ERR "Dump taken by platform is not valid\n");
+		return -EINVAL;
+	}
+	if (fdm_active->rmr_region.bytes_dumped !=
+			fdm_active->rmr_region.source_len) {
+		printk(KERN_ERR "Dump taken by platform is incomplete\n");
+		return -EINVAL;
+	}
+
+	/* Validate the fadump crash info header */
+	fdh = __va(fw_dump.fadumphdr_addr);
+	if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
+		printk(KERN_ERR "Crash info header is not valid.\n");
+		return -EINVAL;
+	}
+
+	/*
+	 * We are done validating dump info and elfcore header is now ready
+	 * to be exported. set elfcorehdr_addr so that vmcore module will
+	 * export the elfcore header through '/proc/vmcore'.
+	 */
+	elfcorehdr_addr = fdh->elfcorehdr_addr;
+
+	return 0;
+}
+
+static inline void add_crash_memory(unsigned long long base,
+					unsigned long long end)
+{
+	if (base == end)
+		return;
+
+	pr_debug("crash_memory_range[%d] [%#016llx-%#016llx], %#llx bytes\n",
+		crash_mem_ranges, base, end - 1, (end - base));
+	crash_memory_ranges[crash_mem_ranges].base = base;
+	crash_memory_ranges[crash_mem_ranges].size = end - base;
+	crash_mem_ranges++;
+}
+
+static void exclude_reserved_area(unsigned long long start,
+					unsigned long long end)
+{
+	unsigned long long ra_start, ra_end;
+
+	ra_start = fw_dump.reserve_dump_area_start;
+	ra_end = ra_start + fw_dump.reserve_dump_area_size;
+
+	if ((ra_start < end) && (ra_end > start)) {
+		if ((start < ra_start) && (end > ra_end)) {
+			add_crash_memory(start, ra_start);
+			add_crash_memory(ra_end, end);
+		} else if (start < ra_start) {
+			add_crash_memory(start, ra_start);
+		} else if (ra_end < end) {
+			add_crash_memory(ra_end, end);
+		}
+	} else
+		add_crash_memory(start, end);
+}
+
+static int init_elfcore_header(char *bufp)
+{
+	struct elfhdr *elf;
+
+	elf = (struct elfhdr *) bufp;
+	bufp += sizeof(struct elfhdr);
+	memcpy(elf->e_ident, ELFMAG, SELFMAG);
+	elf->e_ident[EI_CLASS] = ELF_CLASS;
+	elf->e_ident[EI_DATA] = ELF_DATA;
+	elf->e_ident[EI_VERSION] = EV_CURRENT;
+	elf->e_ident[EI_OSABI] = ELF_OSABI;
+	memset(elf->e_ident+EI_PAD, 0, EI_NIDENT-EI_PAD);
+	elf->e_type = ET_CORE;
+	elf->e_machine = ELF_ARCH;
+	elf->e_version = EV_CURRENT;
+	elf->e_entry = 0;
+	elf->e_phoff = sizeof(struct elfhdr);
+	elf->e_shoff = 0;
+	elf->e_flags = ELF_CORE_EFLAGS;
+	elf->e_ehsize = sizeof(struct elfhdr);
+	elf->e_phentsize = sizeof(struct elf_phdr);
+	elf->e_phnum = 0;
+	elf->e_shentsize = 0;
+	elf->e_shnum = 0;
+	elf->e_shstrndx = 0;
+
+	return 0;
+}
+
+/*
+ * Traverse through memblock structure and setup crash memory ranges. These
+ * ranges will be used create PT_LOAD program headers in elfcore header.
+ */
+static void setup_crash_memory_ranges(void)
+{
+	struct memblock_region *reg;
+	unsigned long long start, end;
+
+	pr_debug("Setup crash memory ranges.\n");
+	crash_mem_ranges = 0;
+	/*
+	 * add the first memory chunk (RMR_START through boot_memory_size) as
+	 * a separate memory chunk. The reason is, at the time crash firmware
+	 * will move the content of this memory chunk to different location
+	 * specified during fadump registration. We need to create a separate
+	 * program header for this chunk with the correct offset.
+	 */
+	add_crash_memory(RMR_START, fw_dump.boot_memory_size);
+
+	for_each_memblock(memory, reg) {
+		start = (unsigned long long)reg->base;
+		end = start + (unsigned long long)reg->size;
+		if (start == RMR_START && end >= fw_dump.boot_memory_size)
+			start = fw_dump.boot_memory_size;
+
+		/* add this range excluding the reserved dump area. */
+		exclude_reserved_area(start, end);
+	}
+}
+
+static int create_elfcore_headers(char *bufp)
+{
+	struct elfhdr *elf;
+	struct elf_phdr *phdr;
+	int i;
+
+	init_elfcore_header(bufp);
+	elf = (struct elfhdr *)bufp;
+	bufp += sizeof(struct elfhdr);
+
+	/* setup PT_LOAD sections. */
+
+	for (i = 0; i < crash_mem_ranges; i++) {
+		unsigned long long mbase, msize;
+		mbase = crash_memory_ranges[i].base;
+		msize = crash_memory_ranges[i].size;
+
+		if (!msize)
+			continue;
+
+		phdr = (struct elf_phdr *)bufp;
+		bufp += sizeof(struct elf_phdr);
+		phdr->p_type	= PT_LOAD;
+		phdr->p_flags	= PF_R|PF_W|PF_X;
+		phdr->p_offset	= mbase;
+
+		if (mbase == RMR_START) {
+			/*
+			 * The entire RMR region will be moved by firmware
+			 * to the specified destination_address. Hence set
+			 * the correct offset.
+			 */
+			phdr->p_offset = fdm.rmr_region.destination_address;
+		}
+
+		phdr->p_paddr = mbase;
+		phdr->p_vaddr = (unsigned long)__va(mbase);
+		phdr->p_filesz = msize;
+		phdr->p_memsz = msize;
+		phdr->p_align = 0;
+
+		/* Increment number of program headers. */
+		(elf->e_phnum)++;
+	}
+	return 0;
+}
+
+static unsigned long init_fadump_header(unsigned long addr)
+{
+	struct fadump_crash_info_header *fdh;
+
+	if (!addr)
+		return 0;
+
+	fw_dump.fadumphdr_addr = addr;
+	fdh = __va(addr);
+	addr += sizeof(struct fadump_crash_info_header);
+
+	memset(fdh, 0, sizeof(struct fadump_crash_info_header));
+	fdh->magic_number = FADUMP_CRASH_INFO_MAGIC;
+	fdh->elfcorehdr_addr = addr;
+
+	return addr;
+}
+
 static void register_fadump(void)
 {
+	unsigned long addr;
+	void *vaddr;
+
 	/*
 	 * If no memory is reserved then we can not register for firmware-
 	 * assisted dump.
@@ -393,6 +608,16 @@ static void register_fadump(void)
 	if (!fw_dump.reserve_dump_area_size)
 		return;
 
+	setup_crash_memory_ranges();
+
+	addr = fdm.rmr_region.destination_address + fdm.rmr_region.source_len;
+	/* Initialize fadump crash info header. */
+	addr = init_fadump_header(addr);
+	vaddr = __va(addr);
+
+	pr_debug("Creating ELF core headers at %#016lx\n", addr);
+	create_elfcore_headers(vaddr);
+
 	/* register the future kernel dump with firmware. */
 	register_fw_dump(&fdm);
 }
@@ -586,11 +811,17 @@ int __init setup_fadump(void)
 	}
 
 	fadump_show_config();
+	/*
+	 * If dump data is available then see if it is valid and prepare for
+	 * saving it to the disk.
+	 */
+	if (fw_dump.dump_active)
+		process_fadump(fdm_active);
 	/* Initialize the kernel dump memory structure for FAD registration. */
-	if (fw_dump.reserve_dump_area_size)
+	else if (fw_dump.reserve_dump_area_size)
 		init_fadump_mem_struct(&fdm, fw_dump.reserve_dump_area_start);
-	fadump_init_files();
 
+	fadump_init_files();
 	return 1;
 }
 subsys_initcall(setup_fadump);

^ permalink raw reply related

* [RFC PATCH v4 03/10] fadump: Register for firmware assisted dump.
From: Mahesh J Salgaonkar @ 2011-11-07  9:55 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

This patch registers for firmware-assisted dump using rtas token
ibm,configure-kernel-dump. During registration firmware is informed about
the reserved area where it saves the CPU state data, HPTE table and contents
of RMR region at the time of kernel crash. Apart from this, firmware also
preserves the contents of entire partition memory even if it is not specified
during registration.

This patch also populates sysfs files under /sys/kernel to display
fadump status and reserved memory regions.

Change in v3:
- Re-factored the implementation to work with kdump service start/stop.
  Introduce fadump_registered sysfs control file which will be used by
  kdump init scripts to start/stop firmware assisted dump. echo 1 to
  /sys/kernel/fadump_registered file for fadump registration and
  echo 0 to /sys/kernel/fadump_registered file for fadump un-registration.
- Introduced the locking mechanism to handle simultaneous writes to
  /sys/kernel/fadump_registered file.

Change in v2:
- Removed few debug print statements.
- Moved the setup_fadump() call from setup_system() and now calling it
  subsys_initcall.
- Moved fadump_region attribute under debugfs.
- Clear the TCE entries if firmware assisted dump is active.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/fadump.h |   57 ++++++
 arch/powerpc/kernel/fadump.c      |  352 +++++++++++++++++++++++++++++++++++++
 arch/powerpc/kernel/iommu.c       |    8 +
 arch/powerpc/mm/hash_utils_64.c   |   11 +
 4 files changed, 424 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h
index 0b040c1..3b2f8cc 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -42,6 +42,58 @@
 #define FADUMP_HPTE_REGION	0x0002
 #define FADUMP_REAL_MODE_REGION	0x0011
 
+/* Dump request flag */
+#define FADUMP_REQUEST_FLAG	0x00000001
+
+/* FAD commands */
+#define FADUMP_REGISTER		1
+#define FADUMP_UNREGISTER	2
+#define FADUMP_INVALIDATE	3
+
+/* Kernel Dump section info */
+struct fadump_section {
+	u32	request_flag;
+	u16	source_data_type;
+	u16	error_flags;
+	u64	source_address;
+	u64	source_len;
+	u64	bytes_dumped;
+	u64	destination_address;
+};
+
+/* ibm,configure-kernel-dump header. */
+struct fadump_section_header {
+	u32	dump_format_version;
+	u16	dump_num_sections;
+	u16	dump_status_flag;
+	u32	offset_first_dump_section;
+
+	/* Fields for disk dump option. */
+	u32	dd_block_size;
+	u64	dd_block_offset;
+	u64	dd_num_blocks;
+	u32	dd_offset_disk_path;
+
+	/* Maximum time allowed to prevent an automatic dump-reboot. */
+	u32	max_time_auto;
+};
+
+/*
+ * Firmware Assisted dump memory structure. This structure is required for
+ * registering future kernel dump with power firmware through rtas call.
+ *
+ * No disk dump option. Hence disk dump path string section is not included.
+ */
+struct fadump_mem_struct {
+	struct fadump_section_header	header;
+
+	/* Kernel dump sections */
+	struct fadump_section		cpu_state_data;
+	struct fadump_section		hpte_region;
+	struct fadump_section		rmr_region;
+};
+
+/* Firmware-assisted dump configuration details. */
 struct fw_dump {
 	unsigned long	cpu_state_data_size;
 	unsigned long	hpte_region_size;
@@ -56,10 +108,15 @@ struct fw_dump {
 	unsigned long	fadump_enabled:1;
 	unsigned long	fadump_supported:1;
 	unsigned long	dump_active:1;
+	unsigned long	dump_registered:1;
 };
 
 extern int early_init_dt_scan_fw_dump(unsigned long node,
 		const char *uname, int depth, void *data);
 extern int fadump_reserve_mem(void);
+extern int setup_fadump(void);
+extern int is_fadump_active(void);
+#else	/* CONFIG_FA_DUMP */
+static inline int is_fadump_active(void) { return 0; }
 #endif
 #endif
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 05dffc0..ed38f86 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -29,6 +29,9 @@
 
 #include <linux/string.h>
 #include <linux/memblock.h>
+#include <linux/delay.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
 
 #include <asm/page.h>
 #include <asm/prom.h>
@@ -46,6 +49,10 @@ struct dump_section {
 } __packed;
 
 static struct fw_dump fw_dump;
+static struct fadump_mem_struct fdm;
+static const struct fadump_mem_struct *fdm_active;
+
+static DEFINE_MUTEX(fadump_mutex);
 
 /* Scan the Firmware Assisted dump configuration details. */
 int __init early_init_dt_scan_fw_dump(unsigned long node,
@@ -74,7 +81,8 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
 	 * The 'ibm,kernel-dump' rtas node is present only if there is
 	 * dump data waiting for us.
 	 */
-	if (of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL))
+	fdm_active = of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL);
+	if (fdm_active)
 		fw_dump.dump_active = 1;
 
 	/* Get the sizes required to store dump data for the firmware provided
@@ -101,6 +109,85 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
 	return 1;
 }
 
+int is_fadump_active(void)
+{
+	return fw_dump.dump_active;
+}
+
+/* Print firmware assisted dump configurations for debugging purpose. */
+static void fadump_show_config(void)
+{
+	pr_debug("Support for firmware-assisted dump (fadump): %s\n",
+			(fw_dump.fadump_supported ? "present" : "no support"));
+
+	if (!fw_dump.fadump_supported)
+		return;
+
+	pr_debug("Fadump enabled    : %s\n",
+				(fw_dump.fadump_enabled ? "yes" : "no"));
+	pr_debug("Dump Active       : %s\n",
+				(fw_dump.dump_active ? "yes" : "no"));
+	pr_debug("Dump section sizes:\n");
+	pr_debug("    CPU state data size: %lx\n", fw_dump.cpu_state_data_size);
+	pr_debug("    HPTE region size   : %lx\n", fw_dump.hpte_region_size);
+	pr_debug("Boot memory size  : %lx\n", fw_dump.boot_memory_size);
+}
+
+static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm,
+				unsigned long addr)
+{
+	if (!fdm)
+		return 0;
+
+	memset(fdm, 0, sizeof(struct fadump_mem_struct));
+	addr = addr & PAGE_MASK;
+
+	fdm->header.dump_format_version = 0x00000001;
+	fdm->header.dump_num_sections = 3;
+	fdm->header.dump_status_flag = 0;
+	fdm->header.offset_first_dump_section =
+		(u32)offsetof(struct fadump_mem_struct, cpu_state_data);
+
+	/*
+	 * Fields for disk dump option.
+	 * We are not using disk dump option, hence set these fields to 0.
+	 */
+	fdm->header.dd_block_size = 0;
+	fdm->header.dd_block_offset = 0;
+	fdm->header.dd_num_blocks = 0;
+	fdm->header.dd_offset_disk_path = 0;
+
+	/* set 0 to disable an automatic dump-reboot. */
+	fdm->header.max_time_auto = 0;
+
+	/* Kernel dump sections */
+	/* cpu state data section. */
+	fdm->cpu_state_data.request_flag = FADUMP_REQUEST_FLAG;
+	fdm->cpu_state_data.source_data_type = FADUMP_CPU_STATE_DATA;
+	fdm->cpu_state_data.source_address = 0;
+	fdm->cpu_state_data.source_len = fw_dump.cpu_state_data_size;
+	fdm->cpu_state_data.destination_address = addr;
+	addr += fw_dump.cpu_state_data_size;
+
+	/* hpte region section */
+	fdm->hpte_region.request_flag = FADUMP_REQUEST_FLAG;
+	fdm->hpte_region.source_data_type = FADUMP_HPTE_REGION;
+	fdm->hpte_region.source_address = 0;
+	fdm->hpte_region.source_len = fw_dump.hpte_region_size;
+	fdm->hpte_region.destination_address = addr;
+	addr += fw_dump.hpte_region_size;
+
+	/* RMR region section */
+	fdm->rmr_region.request_flag = FADUMP_REQUEST_FLAG;
+	fdm->rmr_region.source_data_type = FADUMP_REAL_MODE_REGION;
+	fdm->rmr_region.source_address = RMR_START;
+	fdm->rmr_region.source_len = fw_dump.boot_memory_size;
+	fdm->rmr_region.destination_address = addr;
+	addr += fw_dump.boot_memory_size;
+
+	return addr;
+}
+
 /**
  * calculate_reserve_size() - reserve variable boot area 5% of System RAM
  *
@@ -170,8 +257,15 @@ int __init fadump_reserve_mem(void)
 		fw_dump.fadump_enabled = 0;
 		return 0;
 	}
-	/* Initialize boot memory size */
-	fw_dump.boot_memory_size = calculate_reserve_size();
+	/*
+	 * Initialize boot memory size
+	 * If dump is active then we have already calculated the size during
+	 * first kernel.
+	 */
+	if (fdm_active)
+		fw_dump.boot_memory_size = fdm_active->rmr_region.source_len;
+	else
+		fw_dump.boot_memory_size = calculate_reserve_size();
 
 	/*
 	 * Calculate the memory boundary.
@@ -248,3 +342,255 @@ static int __init early_fadump_reserve_mem(char *p)
 	return 0;
 }
 early_param("fadump_reserve_mem", early_fadump_reserve_mem);
+
+static void register_fw_dump(struct fadump_mem_struct *fdm)
+{
+	int rc;
+	unsigned int wait_time;
+
+	pr_debug("Registering for firmware-assisted kernel dump...\n");
+
+	/* TODO: Add upper time limit for the delay */
+	do {
+		rc = rtas_call(fw_dump.ibm_configure_kernel_dump, 3, 1, NULL,
+			FADUMP_REGISTER, fdm,
+			sizeof(struct fadump_mem_struct));
+
+		wait_time = rtas_busy_delay_time(rc);
+		if (wait_time)
+			mdelay(wait_time);
+
+	} while (wait_time);
+
+	switch (rc) {
+	case -1:
+		printk(KERN_ERR "Failed to register firmware-assisted kernel"
+			" dump. Hardware Error(%d).\n", rc);
+		break;
+	case -3:
+		printk(KERN_ERR "Failed to register firmware-assisted kernel"
+			" dump. Parameter Error(%d).\n", rc);
+		break;
+	case -9:
+		printk(KERN_ERR "firmware-assisted kernel dump is already "
+			" registered.");
+		fw_dump.dump_registered = 1;
+		break;
+	case 0:
+		printk(KERN_INFO "firmware-assisted kernel dump registration"
+			" is successful\n");
+		fw_dump.dump_registered = 1;
+		break;
+	}
+}
+
+static void register_fadump(void)
+{
+	/*
+	 * If no memory is reserved then we can not register for firmware-
+	 * assisted dump.
+	 */
+	if (!fw_dump.reserve_dump_area_size)
+		return;
+
+	/* register the future kernel dump with firmware. */
+	register_fw_dump(&fdm);
+}
+
+static int fadump_unregister_dump(struct fadump_mem_struct *fdm)
+{
+	int rc = 0;
+	unsigned int wait_time;
+
+	pr_debug("Un-register firmware-assisted dump\n");
+
+	/* TODO: Add upper time limit for the delay */
+	do {
+		rc = rtas_call(fw_dump.ibm_configure_kernel_dump, 3, 1, NULL,
+			FADUMP_UNREGISTER, fdm,
+			sizeof(struct fadump_mem_struct));
+
+		wait_time = rtas_busy_delay_time(rc);
+		if (wait_time)
+			mdelay(wait_time);
+	} while (wait_time);
+
+	if (rc) {
+		printk(KERN_ERR "Failed to un-register firmware-assisted dump."
+			" unexpected error(%d).\n", rc);
+		return rc;
+	}
+	fw_dump.dump_registered = 0;
+	return 0;
+}
+
+static ssize_t fadump_enabled_show(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					char *buf)
+{
+	return sprintf(buf, "%d\n", fw_dump.fadump_enabled);
+}
+
+static ssize_t fadump_register_show(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					char *buf)
+{
+	return sprintf(buf, "%d\n", fw_dump.dump_registered);
+}
+
+static ssize_t fadump_register_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	int ret = 0;
+
+	if (!fw_dump.fadump_enabled || fdm_active)
+		return -EPERM;
+
+	mutex_lock(&fadump_mutex);
+
+	switch (buf[0]) {
+	case '0':
+		if (fw_dump.dump_registered == 0) {
+			ret = -EINVAL;
+			goto unlock_out;
+		}
+		/* Un-register Firmware-assisted dump */
+		fadump_unregister_dump(&fdm);
+		break;
+	case '1':
+		if (fw_dump.dump_registered == 1) {
+			ret = -EINVAL;
+			goto unlock_out;
+		}
+		/* Register Firmware-assisted dump */
+		register_fadump();
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+unlock_out:
+	mutex_unlock(&fadump_mutex);
+	return ret < 0 ? ret : count;
+}
+
+static int fadump_region_show(struct seq_file *m, void *private)
+{
+	const struct fadump_mem_struct *fdm_ptr;
+
+	if (!fw_dump.fadump_enabled)
+		return 0;
+
+	if (fdm_active)
+		fdm_ptr = fdm_active;
+	else
+		fdm_ptr = &fdm;
+
+	seq_printf(m,
+			"CPU : [%#016llx-%#016llx] %#llx bytes, "
+			"Dumped: %#llx\n",
+			fdm_ptr->cpu_state_data.destination_address,
+			fdm_ptr->cpu_state_data.destination_address +
+			fdm_ptr->cpu_state_data.source_len - 1,
+			fdm_ptr->cpu_state_data.source_len,
+			fdm_ptr->cpu_state_data.bytes_dumped);
+	seq_printf(m,
+			"HPTE: [%#016llx-%#016llx] %#llx bytes, "
+			"Dumped: %#llx\n",
+			fdm_ptr->hpte_region.destination_address,
+			fdm_ptr->hpte_region.destination_address +
+			fdm_ptr->hpte_region.source_len - 1,
+			fdm_ptr->hpte_region.source_len,
+			fdm_ptr->hpte_region.bytes_dumped);
+	seq_printf(m,
+			"DUMP: [%#016llx-%#016llx] %#llx bytes, "
+			"Dumped: %#llx\n",
+			fdm_ptr->rmr_region.destination_address,
+			fdm_ptr->rmr_region.destination_address +
+			fdm_ptr->rmr_region.source_len - 1,
+			fdm_ptr->rmr_region.source_len,
+			fdm_ptr->rmr_region.bytes_dumped);
+
+	if (!fdm_active ||
+		(fw_dump.reserve_dump_area_start ==
+		fdm_ptr->cpu_state_data.destination_address))
+		return 0;
+
+	/* Dump is active. Show reserved memory region. */
+	seq_printf(m,
+			"    : [%#016llx-%#016llx] %#llx bytes, "
+			"Dumped: %#llx\n",
+			(unsigned long long)fw_dump.reserve_dump_area_start,
+			fdm_ptr->cpu_state_data.destination_address - 1,
+			fdm_ptr->cpu_state_data.destination_address -
+			fw_dump.reserve_dump_area_start,
+			fdm_ptr->cpu_state_data.destination_address -
+			fw_dump.reserve_dump_area_start);
+	return 0;
+}
+
+static struct kobj_attribute fadump_attr = __ATTR(fadump_enabled,
+						0444, fadump_enabled_show,
+						NULL);
+static struct kobj_attribute fadump_register_attr = __ATTR(fadump_registered,
+						0644, fadump_register_show,
+						fadump_register_store);
+
+static int fadump_region_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, fadump_region_show, inode->i_private);
+}
+
+static const struct file_operations fadump_region_fops = {
+	.open    = fadump_region_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release,
+};
+
+static void fadump_init_files(void)
+{
+	struct dentry *debugfs_file;
+	int rc = 0;
+
+	rc = sysfs_create_file(kernel_kobj, &fadump_attr.attr);
+	if (rc)
+		printk(KERN_ERR "fadump: unable to create sysfs file"
+			" fadump_enabled (%d)\n", rc);
+
+	rc = sysfs_create_file(kernel_kobj, &fadump_register_attr.attr);
+	if (rc)
+		printk(KERN_ERR "fadump: unable to create sysfs file"
+			" fadump_registered (%d)\n", rc);
+
+	debugfs_file = debugfs_create_file("fadump_region", 0444,
+					powerpc_debugfs_root, NULL,
+					&fadump_region_fops);
+	if (!debugfs_file)
+		printk(KERN_ERR "fadump: unable to create debugfs file"
+				" fadump_region\n");
+	return;
+}
+
+/*
+ * Prepare for firmware-assisted dump.
+ */
+int __init setup_fadump(void)
+{
+	if (!fw_dump.fadump_supported) {
+		printk(KERN_ERR "Firmware-assisted dump is not supported on"
+			" this hardware\n");
+		return 0;
+	}
+
+	fadump_show_config();
+	/* Initialize the kernel dump memory structure for FAD registration. */
+	if (fw_dump.reserve_dump_area_size)
+		init_fadump_mem_struct(&fdm, fw_dump.reserve_dump_area_start);
+	fadump_init_files();
+
+	return 1;
+}
+subsys_initcall(setup_fadump);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 961bb03..2549b53 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -39,6 +39,7 @@
 #include <asm/pci-bridge.h>
 #include <asm/machdep.h>
 #include <asm/kdump.h>
+#include <asm/fadump.h>
 
 #define DBG(...)
 
@@ -445,7 +446,12 @@ void iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 
 static void iommu_table_clear(struct iommu_table *tbl)
 {
-	if (!is_kdump_kernel()) {
+	/*
+	 * In case of firmware assisted dump system goes through clean
+	 * reboot process at the time of system crash. Hence it's safe to
+	 * clear the TCE entries if firmware assisted dump is active.
+	 */
+	if (!is_kdump_kernel() || is_fadump_active()) {
 		/* Clear the table in case firmware left allocations in it */
 		ppc_md.tce_free(tbl, tbl->it_offset, tbl->it_size);
 		return;
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 26b2872..ba64f1a 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -54,6 +54,7 @@
 #include <asm/spu.h>
 #include <asm/udbg.h>
 #include <asm/code-patching.h>
+#include <asm/fadump.h>
 
 #ifdef DEBUG
 #define DBG(fmt...) udbg_printf(fmt)
@@ -627,6 +628,16 @@ static void __init htab_initialize(void)
 		/* Using a hypervisor which owns the htab */
 		htab_address = NULL;
 		_SDR1 = 0; 
+#ifdef CONFIG_FA_DUMP
+		/*
+		 * If firmware assisted dump is active firmware preserves
+		 * the contents of htab along with entire partition memory.
+		 * Clear the htab if firmware assisted dump is active so
+		 * that we dont end up using old mappings.
+		 */
+		if (is_fadump_active() && ppc_md.hpte_clear_all)
+			ppc_md.hpte_clear_all();
+#endif
 	} else {
 		/* Find storage for the HPT.  Must be contiguous in
 		 * the absolute address space. On cell we want it to be

^ permalink raw reply related

* [RFC PATCH v4 06/10] fadump: Add PT_NOTE program header for vmcoreinfo
From: Mahesh J Salgaonkar @ 2011-11-07  9:56 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Introduce a PT_NOTE program header that points to physical address of
vmcoreinfo_note buffer declared in kernel/kexec.c. The vmcoreinfo
note buffer is populated during crash_fadump() at the time of system
crash.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/fadump.c |   29 +++++++++++++++++++++++++++++
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 70d6287..e68ee3a 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -816,6 +816,19 @@ static void setup_crash_memory_ranges(void)
 	}
 }
 
+/*
+ * If the given physical address falls within the boot memory region then
+ * return the relocated address that points to the dump region reserved
+ * for saving initial boot memory contents.
+ */
+static inline unsigned long relocate(unsigned long paddr)
+{
+	if (paddr > RMR_START && paddr < fw_dump.boot_memory_size)
+		return fdm.rmr_region.destination_address + paddr;
+	else
+		return paddr;
+}
+
 static int create_elfcore_headers(char *bufp)
 {
 	struct elfhdr *elf;
@@ -847,6 +860,22 @@ static int create_elfcore_headers(char *bufp)
 
 	(elf->e_phnum)++;
 
+	/* setup ELF PT_NOTE for vmcoreinfo */
+	phdr = (struct elf_phdr *)bufp;
+	bufp += sizeof(struct elf_phdr);
+	phdr->p_type	= PT_NOTE;
+	phdr->p_flags	= 0;
+	phdr->p_vaddr	= 0;
+	phdr->p_align	= 0;
+
+	phdr->p_paddr	= relocate(paddr_vmcoreinfo_note());
+	phdr->p_offset	= phdr->p_paddr;
+	phdr->p_memsz	= vmcoreinfo_max_size;
+	phdr->p_filesz	= vmcoreinfo_max_size;
+
+	/* Increment number of program headers. */
+	(elf->e_phnum)++;
+
 	/* setup PT_LOAD sections. */
 
 	for (i = 0; i < crash_mem_ranges; i++) {

^ permalink raw reply related

* [RFC PATCH v4 08/10] fadump: Invalidate registration and release reserved memory for general use.
From: Mahesh J Salgaonkar @ 2011-11-07  9:56 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

This patch introduces an sysfs interface '/sys/kernel/fadump_release_mem' to
invalidate the last fadump registration, invalidate '/proc/vmcore', release
the reserved memory for general use and re-register for future kernel dump.
Once the dump is copied to the disk, the userspace tool will echo 1 to
'/sys/kernel/fadump_release_mem'.

Release the reserved memory region excluding the size of the memory required
for future kernel dump registration.

Change in v3:
- Syncronize the fadump invalidation step to handle simultaneous writes to
  /sys/kernel/fadump_release_mem.

Change in v2:
- Introduced cpu_notes_buf_free() function to free memory allocated for
  cpu notes buffer.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/fadump.h |    3 +
 arch/powerpc/kernel/fadump.c      |  157 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 156 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h
index 0c14097..8ddfbc7 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -202,6 +202,9 @@ extern int fadump_reserve_mem(void);
 extern int setup_fadump(void);
 extern int is_fadump_active(void);
 extern void crash_fadump(struct pt_regs *, const char *);
+extern void fadump_cleanup(void);
+
+extern void vmcore_cleanup(void);
 #else	/* CONFIG_FA_DUMP */
 static inline int is_fadump_active(void) { return 0; }
 #endif
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index e68ee3a..b449b55 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -33,6 +33,8 @@
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 #include <linux/crash_dump.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
 
 #include <asm/page.h>
 #include <asm/prom.h>
@@ -986,6 +988,131 @@ static int fadump_unregister_dump(struct fadump_mem_struct *fdm)
 	return 0;
 }
 
+static int fadump_invalidate_dump(struct fadump_mem_struct *fdm)
+{
+	int rc = 0;
+	unsigned int wait_time;
+
+	pr_debug("Invalidating firmware-assisted dump registration\n");
+
+	/* TODO: Add upper time limit for the delay */
+	do {
+		rc = rtas_call(fw_dump.ibm_configure_kernel_dump, 3, 1, NULL,
+			FADUMP_INVALIDATE, fdm,
+			sizeof(struct fadump_mem_struct));
+
+		wait_time = rtas_busy_delay_time(rc);
+		if (wait_time)
+			mdelay(wait_time);
+	} while (wait_time);
+
+	if (rc) {
+		printk(KERN_ERR "Failed to invalidate firmware-assisted dump "
+			"rgistration. unexpected error(%d).\n", rc);
+		return rc;
+	}
+	fw_dump.dump_active = 0;
+	fdm_active = NULL;
+	return 0;
+}
+
+void fadump_cleanup(void)
+{
+	/* Invalidate the registration only if dump is active. */
+	if (fw_dump.dump_active) {
+		init_fadump_mem_struct(&fdm,
+			fdm_active->cpu_state_data.destination_address);
+		fadump_invalidate_dump(&fdm);
+	}
+}
+
+/*
+ * Release the memory that was reserved in early boot to preserve the memory
+ * contents. The released memory will be available for general use.
+ */
+static void fadump_release_memory(unsigned long begin, unsigned long end)
+{
+	unsigned long addr;
+	unsigned long ra_start, ra_end;
+
+	ra_start = fw_dump.reserve_dump_area_start;
+	ra_end = ra_start + fw_dump.reserve_dump_area_size;
+
+	for (addr = begin; addr < end; addr += PAGE_SIZE) {
+		/*
+		 * exclude the dump reserve area. Will reuse it for next
+		 * fadump registration.
+		 */
+		if (addr <= ra_end && ((addr + PAGE_SIZE) > ra_start))
+			continue;
+
+		ClearPageReserved(pfn_to_page(addr >> PAGE_SHIFT));
+		init_page_count(pfn_to_page(addr >> PAGE_SHIFT));
+		free_page((unsigned long)__va(addr));
+		totalram_pages++;
+	}
+}
+
+static void fadump_invalidate_release_mem(void)
+{
+	unsigned long reserved_area_start, reserved_area_end;
+	unsigned long destination_address;
+
+	mutex_lock(&fadump_mutex);
+	if (!fw_dump.dump_active) {
+		mutex_unlock(&fadump_mutex);
+		return;
+	}
+
+	destination_address = fdm_active->cpu_state_data.destination_address;
+	fadump_cleanup();
+	mutex_unlock(&fadump_mutex);
+
+	/*
+	 * Save the current reserved memory bounds we will require them
+	 * later for releasing the memory for general use.
+	 */
+	reserved_area_start = fw_dump.reserve_dump_area_start;
+	reserved_area_end = reserved_area_start +
+			fw_dump.reserve_dump_area_size;
+	/*
+	 * Setup reserve_dump_area_start and its size so that we can
+	 * reuse this reserved memory for Re-registration.
+	 */
+	fw_dump.reserve_dump_area_start = destination_address;
+	fw_dump.reserve_dump_area_size = get_dump_area_size();
+
+	fadump_release_memory(reserved_area_start, reserved_area_end);
+	if (fw_dump.cpu_notes_buf) {
+		cpu_notes_buf_free((unsigned long)__va(fw_dump.cpu_notes_buf),
+					fw_dump.cpu_notes_buf_size);
+		fw_dump.cpu_notes_buf = 0;
+		fw_dump.cpu_notes_buf_size = 0;
+	}
+	/* Initialize the kernel dump memory structure for FAD registration. */
+	init_fadump_mem_struct(&fdm, fw_dump.reserve_dump_area_start);
+}
+
+static ssize_t fadump_release_memory_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	if (!fw_dump.dump_active)
+		return -EPERM;
+
+	if (buf[0] == '1') {
+		/*
+		 * Take away the '/proc/vmcore'. We are releasing the dump
+		 * memory, hence it will not be valid anymore.
+		 */
+		vmcore_cleanup();
+		fadump_invalidate_release_mem();
+
+	} else
+		return -EINVAL;
+	return count;
+}
+
 static ssize_t fadump_enabled_show(struct kobject *kobj,
 					struct kobj_attribute *attr,
 					char *buf)
@@ -1045,10 +1172,13 @@ static int fadump_region_show(struct seq_file *m, void *private)
 	if (!fw_dump.fadump_enabled)
 		return 0;
 
+	mutex_lock(&fadump_mutex);
 	if (fdm_active)
 		fdm_ptr = fdm_active;
-	else
+	else {
+		mutex_unlock(&fadump_mutex);
 		fdm_ptr = &fdm;
+	}
 
 	seq_printf(m,
 			"CPU : [%#016llx-%#016llx] %#llx bytes, "
@@ -1078,7 +1208,7 @@ static int fadump_region_show(struct seq_file *m, void *private)
 	if (!fdm_active ||
 		(fw_dump.reserve_dump_area_start ==
 		fdm_ptr->cpu_state_data.destination_address))
-		return 0;
+		goto out;
 
 	/* Dump is active. Show reserved memory region. */
 	seq_printf(m,
@@ -1090,9 +1220,15 @@ static int fadump_region_show(struct seq_file *m, void *private)
 			fw_dump.reserve_dump_area_start,
 			fdm_ptr->cpu_state_data.destination_address -
 			fw_dump.reserve_dump_area_start);
+out:
+	if (fdm_active)
+		mutex_unlock(&fadump_mutex);
 	return 0;
 }
 
+static struct kobj_attribute fadump_release_attr = __ATTR(fadump_release_mem,
+						0200, NULL,
+						fadump_release_memory_store);
 static struct kobj_attribute fadump_attr = __ATTR(fadump_enabled,
 						0444, fadump_enabled_show,
 						NULL);
@@ -1133,6 +1269,13 @@ static void fadump_init_files(void)
 	if (!debugfs_file)
 		printk(KERN_ERR "fadump: unable to create debugfs file"
 				" fadump_region\n");
+
+	if (fw_dump.dump_active) {
+		rc = sysfs_create_file(kernel_kobj, &fadump_release_attr.attr);
+		if (rc)
+			printk(KERN_ERR "fadump: unable to create sysfs file"
+				" fadump_release_mem (%d)\n", rc);
+	}
 	return;
 }
 
@@ -1152,8 +1295,14 @@ int __init setup_fadump(void)
 	 * If dump data is available then see if it is valid and prepare for
 	 * saving it to the disk.
 	 */
-	if (fw_dump.dump_active)
-		process_fadump(fdm_active);
+	if (fw_dump.dump_active) {
+		/*
+		 * if dump process fails then invalidate the registration
+		 * and release memory before proceeding for re-registration.
+		 */
+		if (process_fadump(fdm_active) < 0)
+			fadump_invalidate_release_mem();
+	}
 	/* Initialize the kernel dump memory structure for FAD registration. */
 	else if (fw_dump.reserve_dump_area_size)
 		init_fadump_mem_struct(&fdm, fw_dump.reserve_dump_area_start);

^ permalink raw reply related

* [RFC PATCH v4 09/10] fadump: Invalidate the fadump registration during machine shutdown.
From: Mahesh J Salgaonkar @ 2011-11-07  9:56 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

If dump is active during system reboot, shutdown or halt then invalidate
the fadump registration as it does not get invalidated automatically.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/setup-common.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index ce35aaf..67e5caa 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -110,6 +110,14 @@ EXPORT_SYMBOL(ppc_do_canonicalize_irqs);
 /* also used by kexec */
 void machine_shutdown(void)
 {
+#ifdef CONFIG_FA_DUMP
+	/*
+	 * if fadump is active, cleanup the fadump registration before we
+	 * shutdown.
+	 */
+	fadump_cleanup();
+#endif
+
 	if (ppc_md.machine_shutdown)
 		ppc_md.machine_shutdown();
 }

^ permalink raw reply related

* [RFC PATCH v4 10/10] fadump: Introduce config option for firmware assisted dump feature
From: Mahesh J Salgaonkar @ 2011-11-07  9:56 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

This patch introduces a new config option CONFIG_FA_DUMP for firmware
assisted dump feature on Powerpc (ppc64) architecture.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/Kconfig |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6926b61..7ce773c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -379,6 +379,19 @@ config PHYP_DUMP

 	  If unsure, say "N"

+config FA_DUMP
+	bool "Firmware-assisted dump"
+	depends on PPC64 && PPC_RTAS && CRASH_DUMP
+	help
+	  A robust mechanism to get reliable kernel crash dump with
+	  assistance from firmware. This approach does not use kexec,
+	  instead firmware assists in booting the kdump kernel
+	  while preserving memory contents. Firmware-assisted dump
+	  is meant to be a kdump replacement offering robustness and
+	  speed not possible without system firmware assistance.
+
+	  If unsure, say "N"
+
 config PPCBUG_NVRAM
 	bool "Enable reading PPCBUG NVRAM during boot" if PPLUS || LOPEC
 	default y if PPC_PREP

^ permalink raw reply related

* [RFC PATCH v4 01/10] fadump: Add documentation for firmware-assisted dump.
From: Mahesh J Salgaonkar @ 2011-11-07  9:55 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Documentation for firmware-assisted dump. This document is based on the
original documentation written for phyp assisted dump by Linas Vepstas
and Manish Ahuja, with few changes to reflect the current implementation.

Change in v3:
- Modified the documentation to reflect introdunction of fadump_registered
  sysfs file and few minor changes.

Change in v2:
- Modified the documentation to reflect the change of fadump_region
  file under debugfs filesystem.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 Documentation/powerpc/firmware-assisted-dump.txt |  262 ++++++++++++++++++++++
 1 files changed, 262 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/powerpc/firmware-assisted-dump.txt

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt
new file mode 100644
index 0000000..ba6724a
--- /dev/null
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -0,0 +1,262 @@
+
+                   Firmware-Assisted Dump
+                   ------------------------
+                       July 2011
+
+The goal of firmware-assisted dump is to enable the dump of
+a crashed system, and to do so from a fully-reset system, and
+to minimize the total elapsed time until the system is back
+in production use.
+
+As compared to kdump or other strategies, firmware-assisted
+dump offers several strong, practical advantages:
+
+-- Unlike kdump, the system has been reset, and loaded
+   with a fresh copy of the kernel.  In particular,
+   PCI and I/O devices have been reinitialized and are
+   in a clean, consistent state.
+-- Once the dump is copied out, the memory that held the dump
+   is immediately available to the running kernel. A further
+   reboot isn't required.
+
+The above can only be accomplished by coordination with,
+and assistance from the Power firmware. The procedure is
+as follows:
+
+-- The first kernel registers the sections of memory with the
+   Power firmware for dump preservation during OS initialization.
+   This registered sections of memory is reserved by the first
+   kernel during early boot.
+
+-- When a system crashes, the Power firmware will save
+   the low memory (boot memory of size larger of 5% of system RAM
+   or 256MB) of RAM to a previously registered save region. It
+   will also save system registers, and hardware PTE's.
+
+   NOTE: The term 'boot memory' means size of the low memory chunk
+         that is required for a kernel to boot successfully when
+         booted with restricted memory. By default, the boot memory
+         size will be calculated to larger of 5% of system RAM or
+         256MB. Alternatively, user can also specify boot memory
+         size through boot parameter 'fadump_reserve_mem=' which
+         will override the default calculated size.
+
+-- After the low memory (boot memory) area has been saved, the
+   firmware will reset PCI and other hardware state.  It will
+   *not* clear the RAM. It will then launch the bootloader, as
+   normal.
+
+-- The freshly booted kernel will notice that there is a new
+   node (ibm,dump-kernel) in the device tree, indicating that
+   there is crash data available from a previous boot. During
+   the early boot OS will reserve rest of the memory above
+   boot memory size effectively booting with restricted memory
+   size. This will make sure that the second kernel will not
+   touch any of the dump memory area.
+
+-- Userspace tools will read /proc/vmcore to obtain the contents
+   of memory, which holds the previous crashed kernel dump in ELF
+   format. The userspace tools may copy this info to disk, or
+   network, nas, san, iscsi, etc. as desired.
+
+-- Once the userspace tool is done saving dump, it will echo
+   '1' to /sys/kernel/fadump_release_mem to release the reserved
+   memory back to general use, except the memory required for
+   next firmware-assisted dump registration.
+
+   e.g.
+     # echo 1 > /sys/kernel/fadump_release_mem
+
+Please note that the firmware-assisted dump feature
+is only available on Power6 and above systems with recent
+firmware versions.
+
+Implementation details:
+----------------------
+
+During boot, a check is made to see if firmware supports
+this feature on that particular machine. If it does, then
+we check to see if an active dump is waiting for us. If yes
+then everything but boot memory size of RAM is reserved during
+early boot (See Fig. 2). This area is released once we collect a
+dump from user land scripts (kdump scripts) that are run. If
+there is dump data, then the /sys/kernel/fadump_release_mem
+file is created, and the reserved memory is held.
+
+If there is no waiting dump data, then only the memory required
+to hold CPU state, HPTE region, boot memory dump and elfcore
+header, is reserved at the top of memory (see Fig. 1). This area
+is *not* released: this region will be kept permanently reserved,
+so that it can act as a receptacle for a copy of the boot memory
+content in addition to CPU state and HPTE region, in the case a
+crash does occur.
+
+  o Memory Reservation during first kernel
+
+  Low memory                                        Top of memory
+  0      boot memory size                                       |
+  |           |                       |<--Reserved dump area -->|
+  V           V                       |   Permanent Reservation V
+  +-----------+----------/ /----------+---+----+-----------+----+
+  |           |                       |CPU|HPTE|  DUMP     |ELF |
+  +-----------+----------/ /----------+---+----+-----------+----+
+        |                                           ^
+        |                                           |
+        \                                           /
+         -------------------------------------------
+          Boot memory content gets transferred to
+          reserved area by firmware at the time of
+          crash
+                   Fig. 1
+
+  o Memory Reservation during second kernel after crash
+
+  Low memory                                        Top of memory
+  0      boot memory size                                       |
+  |           |<------------- Reserved dump area ----------- -->|
+  V           V                                                 V
+  +-----------+----------/ /----------+---+----+-----------+----+
+  |           |                       |CPU|HPTE|  DUMP     |ELF |
+  +-----------+----------/ /----------+---+----+-----------+----+
+        |                                                    |
+        V                                                    V
+   Used by second                                    /proc/vmcore
+   kernel to boot
+                   Fig. 2
+
+Currently the dump will be copied from /proc/vmcore to a
+a new file upon user intervention. The dump data available through
+/proc/vmcore will be in ELF format. Hence the existing kdump
+infrastructure (kdump scripts) to save the dump works fine
+with minor modifications. The kdump script requires following
+modifications:
+-- During service kdump start if /proc/vmcore entry is not present,
+   look for the existence of /sys/kernel/fadump_enabled and read
+   value exported by it. If value is set to '0' then fallback to
+   existing kexec based kdump. If value is set to '1' then check the
+   value exported by /sys/kernel/fadump_registered. If value it set
+   to '1' then print success otherwise register for fadump by
+   echo'ing 1 > /sys/kernel/fadump_registered file.
+
+-- During service kdump start if /proc/vmcore entry is present,
+   execute the existing routine to save the dump. Once the dump
+   is saved, echo 1 > /sys/kernel/fadump_release_mem (if the
+   file exists) to release the reserved memory for general use
+   and continue without rebooting. At this point the memory
+   reservation map will look like as shown in Fig. 1. If the file
+   /sys/kernel/fadump_release_mem is not present then follow
+   the existing routine to reboot into new kernel.
+
+-- During service kdump stop echo 0 > /sys/kernel/fadump_registered
+   to un-register the fadump.
+
+The tools to examine the dump will be same as the ones
+used for kdump.
+
+How to enable firmware-assisted dump (fadump):
+-------------------------------------
+
+1. Set config option CONFIG_FA_DUMP=y and build kernel.
+2. Boot into linux kernel with 'fadump=1' kernel cmdline option.
+3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline
+   to specify size of the memory to reserve for boot memory dump
+   preservation.
+
+NOTE: If firmware-assisted dump fails to reserve memory then it will
+   fallback to existing kdump mechanism if 'crashkernel=' option
+   is set at kernel cmdline.
+
+Sysfs/debugfs files:
+------------
+
+Firmware-assisted dump feature uses sysfs file system to hold
+the control files and debugfs file to display memory reserved region.
+
+Here is the list of files under kernel sysfs:
+
+ /sys/kernel/fadump_enabled
+
+    This is used to display the fadump status.
+    0 = fadump is disabled
+    1 = fadump is enabled
+
+ /sys/kernel/fadump_registered
+
+    This is used to display the fadump registration status as well
+    as to control (start/stop) the fadump registration.
+    0 = fadump is not registered.
+    1 = fadump is registered and ready to handle system crash.
+
+    To register fadump echo 1 > /sys/kernel/fadump_registered and
+    echo 0 > /sys/kernel/fadump_registered for un-register and stop the
+    fadump. Once the fadump is un-registered, the system crash will not
+    be handled and vmcore will not be captured.
+
+ /sys/kernel/fadump_release_mem
+
+    This file is available only when fadump is active during
+    second kernel. This is used to release the reserved memory
+    region that are held for saving crash dump. To release the
+    reserved memory echo 1 to it:
+
+    echo 1  > /sys/kernel/fadump_release_mem
+
+    After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
+    file will change to reflect the new memory reservations.
+
+Here is the list of files under powerpc debugfs:
+(Assuming debugfs is mounted on /sys/kernel/debug directory.)
+
+ /sys/kernel/debug/powerpc/fadump_region
+
+    This file shows the reserved memory regions if fadump is
+    enabled otherwise this file is empty. The output format
+    is:
+    <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
+
+    e.g.
+    Contents when fadump is registered during first kernel
+
+    # cat /sys/kernel/debug/powerpc/fadump_region
+    CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
+    HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
+    DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
+
+    Contents when fadump is active during second kernel
+
+    # cat /sys/kernel/debug/powerpc/fadump_region
+    CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
+    HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
+    DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
+        : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
+
+NOTE: Please refer to debugfs documentation on how to mount the debugfs
+      filesystem.
+
+
+TODO:
+-----
+ o Need to come up with the better approach to find out more
+   accurate boot memory size that is required for a kernel to
+   boot successfully when booted with restricted memory.
+ o The fadump implementation introduces a fadump crash info structure
+   in the scratch area before the ELF core header. The idea of introducing
+   this structure is to pass some important crash info data to the second
+   kernel which will help second kernel to populate ELF core header with
+   correct data before it gets exported through /proc/vmcore. The current
+   design implementation does not address a possibility of introducing
+   additional fields (in future) to this structure without affecting
+   compatibility. Need to come up with the better approach to address this.
+   The possible approaches are:
+	1. Introduce version field for version tracking, bump up the version
+	whenever a new field is added to the structure in future. The version
+	field can be used to find out what fields are valid for the current
+	version of the structure.
+	2. Reserve the area of predefined size (say PAGE_SIZE) for this
+	structure and have unused area as reserved (initialized to zero)
+	for future field additions.
+   The advantage of approach 1 over 2 is we don't need to reserve extra space.
+---
+Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
+This document is based on the original documentation written for phyp
+assisted dump by Linas Vepstas and Manish Ahuja.

^ permalink raw reply related

* [PATCH][v2] powerpc/usb: fix type cast for address of ioremap to compatible with 64-bit
From: Shaohui Xie @ 2011-11-07  8:58 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-usb, Shaohui Xie

Below are codes for accessing usb sysif_regs in driver:

usb_sys_regs = (struct usb_sys_interface *)
	((u32)dr_regs + USB_DR_SYS_OFFSET);

these codes work in 32-bit, but in 64-bit, use u32 to type cast the address
of ioremap is not right, and accessing members of 'usb_sys_regs' will cause
call trace, so use (void *) for both 32-bit and 64-bit.

Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
---
changes for v2:
1. use (void *) instead of unsigned long and the double cast according
to Timur's comment.

 drivers/usb/gadget/fsl_udc_core.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/gadget/fsl_udc_core.c b/drivers/usb/gadget/fsl_udc_core.c
index c81fbad..398c5e6 100644
--- a/drivers/usb/gadget/fsl_udc_core.c
+++ b/drivers/usb/gadget/fsl_udc_core.c
@@ -2497,8 +2497,7 @@ static int __init fsl_udc_probe(struct platform_device *pdev)
 
 #ifndef CONFIG_ARCH_MXC
 	if (pdata->have_sysif_regs)
-		usb_sys_regs = (struct usb_sys_interface *)
-				((u32)dr_regs + USB_DR_SYS_OFFSET);
+		usb_sys_regs = (void *)dr_regs + USB_DR_SYS_OFFSET;
 #endif
 
 	/* Initialize USB clocks */
-- 
1.6.4

^ permalink raw reply related

* [RFC PATCH v4 02/10] fadump: Reserve the memory for firmware assisted dump.
From: Mahesh J Salgaonkar @ 2011-11-07  9:55 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

Reserve the memory during early boot to preserve CPU state data, HPTE region
and RMR region data in case of kernel crash. At the time of crash, powerpc
firmware will store CPU state data, HPTE region data and move RMR region
data to the reserved memory area.

If the firmware-assisted dump fails to reserve the memory, then fallback
to existing kexec-based kdump.

The most of the code implementation to reserve memory has been
adapted from phyp assisted dump implementation written by Linas Vepstas
and Manish Ahuja

Change in v2:
- Modified to use standard pr_debug() macro.
- Modified early_init_dt_scan_fw_dump() to get the size of
  "ibm,configure-kernel-dump-sizes" property and use it to iterate through
  an array of dump sections.
- Introduced boot option 'fadump_reserve_mem=' to let user specify the
  fadump boot memory to be reserved.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/fadump.h |   65 ++++++++++
 arch/powerpc/kernel/Makefile      |    1 
 arch/powerpc/kernel/fadump.c      |  250 +++++++++++++++++++++++++++++++++++++
 arch/powerpc/kernel/prom.c        |   15 ++
 4 files changed, 330 insertions(+), 1 deletions(-)
 create mode 100644 arch/powerpc/include/asm/fadump.h
 create mode 100644 arch/powerpc/kernel/fadump.c

diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h
new file mode 100644
index 0000000..0b040c1
--- /dev/null
+++ b/arch/powerpc/include/asm/fadump.h
@@ -0,0 +1,65 @@
+/*
+ * Firmware Assisted dump header file.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright 2011 IBM Corporation
+ * Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
+ */
+
+#ifndef __PPC64_FA_DUMP_H__
+#define __PPC64_FA_DUMP_H__
+
+#ifdef CONFIG_FA_DUMP
+
+/*
+ * The RMR region will be saved for later dumping when kernel crashes.
+ * Set this to 256MB.
+ */
+#define RMR_START	0x0
+#define RMR_END		(ppc64_rma_size)
+
+/*
+ * On some Power systems where RMO is 128MB, it still requires minimum of
+ * 256MB for kernel to boot successfully.
+ */
+#define MIN_BOOT_MEM	((RMR_END < (0x1UL << 28)) ? (0x1UL << 28) : RMR_END)
+
+/* Firmware provided dump sections */
+#define FADUMP_CPU_STATE_DATA	0x0001
+#define FADUMP_HPTE_REGION	0x0002
+#define FADUMP_REAL_MODE_REGION	0x0011
+
+struct fw_dump {
+	unsigned long	cpu_state_data_size;
+	unsigned long	hpte_region_size;
+	unsigned long	boot_memory_size;
+	unsigned long	reserve_dump_area_start;
+	unsigned long	reserve_dump_area_size;
+	/* cmd line option during boot */
+	unsigned long	reserve_bootvar;
+
+	int		ibm_configure_kernel_dump;
+
+	unsigned long	fadump_enabled:1;
+	unsigned long	fadump_supported:1;
+	unsigned long	dump_active:1;
+};
+
+extern int early_init_dt_scan_fw_dump(unsigned long node,
+		const char *uname, int depth, void *data);
+extern int fadump_reserve_mem(void);
+#endif
+#endif
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index ce4f7f1..59b549c 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_IBMVIO)		+= vio.o
 obj-$(CONFIG_IBMEBUS)           += ibmebus.o
 obj-$(CONFIG_GENERIC_TBSYNC)	+= smp-tbsync.o
 obj-$(CONFIG_CRASH_DUMP)	+= crash_dump.o
+obj-$(CONFIG_FA_DUMP)		+= fadump.o
 ifeq ($(CONFIG_PPC32),y)
 obj-$(CONFIG_E500)		+= idle_e500.o
 endif
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
new file mode 100644
index 0000000..05dffc0
--- /dev/null
+++ b/arch/powerpc/kernel/fadump.c
@@ -0,0 +1,250 @@
+/*
+ * Firmware Assisted dump: A robust mechanism to get reliable kernel crash
+ * dump with assistance from firmware. This approach does not use kexec,
+ * instead firmware assists in booting the kdump kernel while preserving
+ * memory contents. The most of the code implementation has been adapted
+ * from phyp assisted dump implementation written by Linas Vepstas and
+ * Manish Ahuja
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright 2011 IBM Corporation
+ * Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
+ */
+
+#undef DEBUG
+#define pr_fmt(fmt) "fadump: " fmt
+
+#include <linux/string.h>
+#include <linux/memblock.h>
+
+#include <asm/page.h>
+#include <asm/prom.h>
+#include <asm/rtas.h>
+#include <asm/fadump.h>
+
+/*
+ * The RTAS property "ibm,configure-kernel-dump-sizes" returns dump
+ * sizes for the firmware provided dump sections (cpu state data
+ * and hpte region).
+ */
+struct dump_section {
+	u32		dump_section;
+	unsigned long	section_size;
+} __packed;
+
+static struct fw_dump fw_dump;
+
+/* Scan the Firmware Assisted dump configuration details. */
+int __init early_init_dt_scan_fw_dump(unsigned long node,
+			const char *uname, int depth, void *data)
+{
+	const struct dump_section *sections;
+	int i, num_sections;
+	unsigned long size;
+	const int *token;
+
+	if (depth != 1 || strcmp(uname, "rtas") != 0)
+		return 0;
+
+	/*
+	 * Check if Firmware Assisted dump is supported. if yes, check
+	 * if dump has been initiated on last reboot.
+	 */
+	token = of_get_flat_dt_prop(node, "ibm,configure-kernel-dump", NULL);
+	if (!token)
+		return 0;
+
+	fw_dump.fadump_supported = 1;
+	fw_dump.ibm_configure_kernel_dump = *token;
+
+	/*
+	 * The 'ibm,kernel-dump' rtas node is present only if there is
+	 * dump data waiting for us.
+	 */
+	if (of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL))
+		fw_dump.dump_active = 1;
+
+	/* Get the sizes required to store dump data for the firmware provided
+	 * dump sections.
+	 */
+	sections = of_get_flat_dt_prop(node, "ibm,configure-kernel-dump-sizes",
+					&size);
+
+	if (!sections)
+		return 0;
+
+	num_sections = size / sizeof(struct dump_section);
+
+	for (i = 0; i < num_sections; i++) {
+		switch (sections[i].dump_section) {
+		case FADUMP_CPU_STATE_DATA:
+			fw_dump.cpu_state_data_size = sections[i].section_size;
+			break;
+		case FADUMP_HPTE_REGION:
+			fw_dump.hpte_region_size = sections[i].section_size;
+			break;
+		}
+	}
+	return 1;
+}
+
+/**
+ * calculate_reserve_size() - reserve variable boot area 5% of System RAM
+ *
+ * Function to find the largest memory size we need to reserve during early
+ * boot process. This will be the size of the memory that is required for a
+ * kernel to boot successfully.
+ *
+ * This function has been taken from phyp-assisted dump feature implementation.
+ *
+ * returns larger of 256MB or 5% rounded down to multiples of 256MB.
+ *
+ * TODO: Come up with better approach to find out more accurate memory size
+ * that is required for a kernel to boot successfully.
+ *
+ */
+static inline unsigned long calculate_reserve_size(void)
+{
+	unsigned long size;
+
+	/*
+	 * Check if the size is specified through fadump_reserve_mem= cmdline
+	 * option. If yes, then use that.
+	 */
+	if (fw_dump.reserve_bootvar)
+		return fw_dump.reserve_bootvar;
+
+	/* divide by 20 to get 5% of value */
+	size = memblock_end_of_DRAM();
+	do_div(size, 20);
+
+	/* round it down in multiples of 256 */
+	size = size & ~0x0FFFFFFFUL;
+
+	/* Truncate to memory_limit. We don't want to over reserve the memory.*/
+	if (memory_limit && size > memory_limit)
+		size = memory_limit;
+
+	return (size > MIN_BOOT_MEM ? size : MIN_BOOT_MEM);
+}
+
+/*
+ * Calculate the total memory size required to be reserved for
+ * firmware-assisted dump registration.
+ */
+static unsigned long get_dump_area_size(void)
+{
+	unsigned long size = 0;
+
+	size += fw_dump.cpu_state_data_size;
+	size += fw_dump.hpte_region_size;
+	size += fw_dump.boot_memory_size;
+
+	size = PAGE_ALIGN(size);
+	return size;
+}
+
+int __init fadump_reserve_mem(void)
+{
+	unsigned long base, size, memory_boundary;
+
+	if (!fw_dump.fadump_enabled)
+		return 0;
+
+	if (!fw_dump.fadump_supported) {
+		printk(KERN_ERR "Firmware-assisted dump is not supported on"
+				" this hardware\n");
+		fw_dump.fadump_enabled = 0;
+		return 0;
+	}
+	/* Initialize boot memory size */
+	fw_dump.boot_memory_size = calculate_reserve_size();
+
+	/*
+	 * Calculate the memory boundary.
+	 * If memory_limit is less than actual memory boundary then reserve
+	 * the memory for fadump beyond the memory_limit and adjust the
+	 * memory_limit accordingly, so that the running kernel can run with
+	 * specified memory_limit.
+	 */
+	if (memory_limit && memory_limit < memblock_end_of_DRAM()) {
+		size = get_dump_area_size();
+		if ((memory_limit + size) < memblock_end_of_DRAM())
+			memory_limit += size;
+		else
+			memory_limit = memblock_end_of_DRAM();
+		printk(KERN_INFO "Adjusted memory_limit for firmware-assisted"
+				" dump, now %#016llx\n",
+				(unsigned long long)memory_limit);
+	}
+	if (memory_limit)
+		memory_boundary = memory_limit;
+	else
+		memory_boundary = memblock_end_of_DRAM();
+
+	if (fw_dump.dump_active) {
+		printk(KERN_INFO "Firmware-assisted dump is active.\n");
+		/*
+		 * If last boot has crashed then reserve all the memory
+		 * above boot_memory_size so that we don't touch it until
+		 * dump is written to disk by userspace tool. This memory
+		 * will be released for general use once the dump is saved.
+		 */
+		base = fw_dump.boot_memory_size;
+		size = memory_boundary - base;
+		memblock_reserve(base, size);
+		printk(KERN_INFO "Reserved %ldMB of memory at %ldMB "
+				"for saving crash dump\n",
+				(unsigned long)(size >> 20),
+				(unsigned long)(base >> 20));
+	} else {
+		/* Reserve the memory at the top of memory. */
+		size = get_dump_area_size();
+		base = memory_boundary - size;
+		memblock_reserve(base, size);
+		printk(KERN_INFO "Reserved %ldMB of memory at %ldMB "
+				"for firmware-assisted dump\n",
+				(unsigned long)(size >> 20),
+				(unsigned long)(base >> 20));
+	}
+	fw_dump.reserve_dump_area_start = base;
+	fw_dump.reserve_dump_area_size = size;
+	return 1;
+}
+
+/* Look for fadump= cmdline option. */
+static int __init early_fadump_param(char *p)
+{
+	if (!p)
+		return 1;
+
+	if (p[0] == '1')
+		fw_dump.fadump_enabled = 1;
+	else if (p[0] == '0')
+		fw_dump.fadump_enabled = 0;
+
+	return 0;
+}
+early_param("fadump", early_fadump_param);
+
+/* Look for fadump_reserve_mem= cmdline option */
+static int __init early_fadump_reserve_mem(char *p)
+{
+	if (p)
+		fw_dump.reserve_bootvar = memparse(p, &p);
+	return 0;
+}
+early_param("fadump_reserve_mem", early_fadump_reserve_mem);
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 174e1e9..3fe75eb 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -54,6 +54,7 @@
 #include <asm/pci-bridge.h>
 #include <asm/phyp_dump.h>
 #include <asm/kexec.h>
+#include <asm/fadump.h>
 #include <mm/mmu_decl.h>
 
 #ifdef DEBUG
@@ -712,6 +713,11 @@ void __init early_init_devtree(void *params)
 	of_scan_flat_dt(early_init_dt_scan_phyp_dump, NULL);
 #endif
 
+#ifdef CONFIG_FA_DUMP
+	/* scan tree to see if dump is active during last boot */
+	of_scan_flat_dt(early_init_dt_scan_fw_dump, NULL);
+#endif
+
 	/* Retrieve various informations from the /chosen node of the
 	 * device-tree, including the platform type, initrd location and
 	 * size, TCE reserve, and more ...
@@ -735,7 +741,14 @@ void __init early_init_devtree(void *params)
 	if (PHYSICAL_START > MEMORY_START)
 		memblock_reserve(MEMORY_START, 0x8000);
 	reserve_kdump_trampoline();
-	reserve_crashkernel();
+#ifdef CONFIG_FA_DUMP
+	/*
+	 * If we fail to reserve memory for firmware-assisted dump then
+	 * fallback to kexec based kdump.
+	 */
+	if (fadump_reserve_mem() == 0)
+#endif
+		reserve_crashkernel();
 	early_reserve_mem();
 	phyp_dump_reserve_mem();
 

^ permalink raw reply related

* [RFC PATCH v4 07/10] fadump: Introduce cleanup routine to invalidate /proc/vmcore.
From: Mahesh J Salgaonkar @ 2011-11-07  9:56 UTC (permalink / raw)
  To: linuxppc-dev, Linux Kernel, Benjamin Herrenschmidt
  Cc: Amerigo Wang, Milton Miller, Eric W. Biederman, Anton Blanchard
In-Reply-To: <20111107095215.1997.14866.stgit@mars.in.ibm.com>

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

With the firmware-assisted dump support we don't require a reboot when we
are in second kernel after crash. The second kernel after crash is a normal
kernel boot and has knowledge about entire system RAM with the page tables
initialized for entire system RAM. Hence once the dump is saved to disk, we
can just release the reserved memory area for general use and continue
with second kernel as production kernel.

Hence when we release the reserved memory that contains dump data, the
'/proc/vmcore' will not be valid anymore. Hence this patch introduces
a cleanup routine that invalidates and removes the /proc/vmcore file. This
routine will be invoked before we release the reserved dump memory area.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 fs/proc/vmcore.c |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index cd99bf5..fae5526 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -699,3 +699,26 @@ static int __init vmcore_init(void)
 	return 0;
 }
 module_init(vmcore_init)
+
+/* Cleanup function for vmcore module. */
+void vmcore_cleanup(void)
+{
+	struct list_head *pos, *next;
+
+	if (proc_vmcore) {
+		remove_proc_entry(proc_vmcore->name, proc_vmcore->parent);
+		proc_vmcore = NULL;
+	}
+
+	/* clear the vmcore list. */
+	list_for_each_safe(pos, next, &vmcore_list) {
+		struct vmcore *m;
+
+		m = list_entry(pos, struct vmcore, list);
+		list_del(&m->list);
+		kfree(m);
+	}
+	kfree(elfcorebuf);
+	elfcorebuf = NULL;
+}
+EXPORT_SYMBOL_GPL(vmcore_cleanup);

^ permalink raw reply related

* Re: New location of powerpc git tree
From: Stephen Rothwell @ 2011-11-07 10:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <1320622150.2779.6.camel@pasglop>

[-- Attachment #1: Type: text/plain, Size: 430 bytes --]

Hi Ben,

On Mon, 07 Nov 2011 10:29:10 +1100 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
> I've moved the powerpc git tree back to kernel.org. The URL should be
> back to normal for users:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git

OK, I have switched back to that, now.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH 4/7] powerpc/85xx: add support to JOG feature using cpufreq interface
From: Zhao Chenhui @ 2011-11-07 10:27 UTC (permalink / raw)
  To: Scott Wood; +Cc: Jerry Huang, linuxppc-dev
In-Reply-To: <4EB4403E.3040700@freescale.com>

On Fri, Nov 04, 2011 at 02:42:54PM -0500, Scott Wood wrote:
> On 11/04/2011 07:36 AM, Zhao Chenhui wrote:
> > From: Li Yang <leoli@freescale.com>
> > 
> > Some 85xx silicons like MPC8536 and P1022 has the JOG PM feature.
> > 
> > The patch adds the support to change CPU frequency using the standard
> > cpufreq interface. Add the all PLL ratio core support. The ratio CORE
> > to CCB can 1:1, 1.5, 2:1, 2.5:1, 3:1, 3.5:1 and 4:1
> > 
> > Signed-off-by: Dave Liu <daveliu@freescale.com>
> > Signed-off-by: Li Yang <leoli@freescale.com>
> > Signed-off-by: Jerry Huang <Chang-Ming.Huang@freescale.com>
> > Signed-off-by: Zhao Chenhui <chenhui.zhao@freescale.com>
> > ---
> >  arch/powerpc/platforms/85xx/Makefile  |    1 +
> >  arch/powerpc/platforms/85xx/cpufreq.c |  255 +++++++++++++++++++++++++++++++++
> >  arch/powerpc/platforms/Kconfig        |    8 +
> >  3 files changed, 264 insertions(+), 0 deletions(-)
> >  create mode 100644 arch/powerpc/platforms/85xx/cpufreq.c
> 
> Please name this something more specific, such as 85xx/cpufreq-jog.c
> 
> Other 85xx/qoriq chips, such as p4080, have different mechanisms for
> updating CPU frequency.
> 
> > +static struct cpufreq_frequency_table mpc85xx_freqs[] = {
> > +	{2,	0},
> > +	{3,	0},
> > +	{4,	0},
> > +	{5,	0},
> > +	{6,	0},
> > +	{7,	0},
> > +	{8,	0},
> > +	{0,	CPUFREQ_TABLE_END},
> > +};
> 
> Only p1022 can handle 1:1 (index 2).
> 
> > +static void set_pll(unsigned int pll, int cpu)
> > +{
> > +	int shift;
> > +	u32 busfreq, corefreq, val;
> > +	u32 core_spd, mask, tmp;
> > +
> > +	tmp = in_be32(guts + PMJCR);
> > +	shift = (cpu == 1) ? CORE1_RATIO_SHIFT : CORE0_RATIO_SHIFT;
> > +	busfreq = fsl_get_sys_freq();
> > +	val = (pll & CORE_RATIO_MASK) << shift;
> > +
> > +	corefreq = ((busfreq * pll) >> 1);
> 
> Use "/ 2", not ">> 1".  Same asm code, more readable.
> 
> > +	/* must set the bit[18/19] if the requested core freq > 533 MHz */
> > +	core_spd = (cpu == 1) ? PMJCR_CORE1_SPD_MASK : PMJCR_CORE0_SPD_MASK;
> > +	if (corefreq > FREQ_533MHz)
> > +		val |= core_spd;
> 
> this is the cutoff for p1022 -- on mpc8536 the manual says the cutoff is
> 800 MHz.
> 
> > +	mask = (cpu == 1) ? (PMJCR_CORE1_RATIO_MASK | PMJCR_CORE1_SPD_MASK) :
> > +		(PMJCR_CORE0_RATIO_MASK | PMJCR_CORE0_SPD_MASK);
> > +	tmp &= ~mask;
> > +	tmp |= val;
> > +	out_be32(guts + PMJCR, tmp);
> 
> clrsetbits_be32()
> 
> > +	val = in_be32(guts + PMJCR);
> > +	out_be32(guts + POWMGTCSR,
> > +			POWMGTCSR_LOSSLESS_MASK | POWMGTCSR_JOG_MASK);
> 
> setbits32()
> 
> > +	pr_debug("PMJCR request %08x at CPU %d\n", tmp, cpu);
> > +}
> > +
> > +static void verify_pll(int cpu)
> > +{
> > +	int shift;
> > +	u32 busfreq, pll, corefreq;
> > +
> > +	shift = (cpu == 1) ? CORE1_RATIO_SHIFT : CORE0_RATIO_SHIFT;
> > +	busfreq = fsl_get_sys_freq();
> > +	pll = (in_be32(guts + PORPLLSR) >> shift) & CORE_RATIO_MASK;
> > +
> > +	corefreq = (busfreq * pll) >> 1;
> > +	corefreq /= 1000000;
> > +	pr_debug("PORPLLSR core freq %dMHz at CPU %d\n", corefreq, cpu);
> > +}
> 
> It looks like the entire point of this function is to make a debug
> print...  #ifdef DEBUG the contents?  Or if we mark fsl_get_sys_freq()
> as __pure (or better, read this once at init, since it involves
> searching the device tree), will it all get optimized away?
> 
> 
> > +	/* initialize frequency table */
> > +	pr_info("core %d frequency table:\n", policy->cpu);
> > +	for (i = 0; mpc85xx_freqs[i].frequency != CPUFREQ_TABLE_END; i++) {
> > +		mpc85xx_freqs[i].frequency =
> > +				(busfreq * mpc85xx_freqs[i].index) >> 1;
> > +		pr_info("%d: %dkHz\n", i, mpc85xx_freqs[i].frequency);
> > +	}
> 
> This should be pr_debug.
> 
> > +	/* the latency of a transition, the unit is ns */
> > +	policy->cpuinfo.transition_latency = 2000;
> > +
> > +	cur_pll = get_pll(policy->cpu);
> > +	pr_debug("current pll is at %d\n", cur_pll);
> > +
> > +	for (i = 0; mpc85xx_freqs[i].frequency != CPUFREQ_TABLE_END; i++) {
> > +		if (mpc85xx_freqs[i].index == cur_pll)
> > +			policy->cur = mpc85xx_freqs[i].frequency;
> > +	}
> 
> You could combine these loops.
> 
> > +	/* this ensures that policy->cpuinfo_min
> > +	 * and policy->cpuinfo_max are set correctly */
> 
> comment style
> 
> > +static int mpc85xx_cpufreq_target(struct cpufreq_policy *policy,
> > +			      unsigned int target_freq,
> > +			      unsigned int relation)
> > +{
> > +	struct cpufreq_freqs freqs;
> > +	unsigned int new;
> > +
> > +	cpufreq_frequency_table_target(policy,
> > +				       mpc85xx_freqs,
> > +				       target_freq,
> > +				       relation,
> > +				       &new);
> > +
> > +	freqs.old = policy->cur;
> > +	freqs.new = mpc85xx_freqs[new].frequency;
> > +	freqs.cpu = policy->cpu;
> > +
> > +	mutex_lock(&mpc85xx_switch_mutex);
> > +	cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE);
> > +
> > +	pr_info("Setting frequency for core %d to %d kHz, " \
> > +		 "PLL ratio is %d/2\n",
> > +		 policy->cpu,
> > +		 mpc85xx_freqs[new].frequency,
> > +		 mpc85xx_freqs[new].index);
> > +
> > +	set_pll(mpc85xx_freqs[new].index, policy->cpu);
> > +
> > +	cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE);
> > +	mutex_unlock(&mpc85xx_switch_mutex);
> > +
> > +	ppc_proc_freq = freqs.new * 1000ul;
> 
> ppc_proc_freq is global -- can CPUs not have their frequencies adjusted
> separately?
> 
> It should be under the lock, if the lock is needed at all.
> 

There is only one ppc_proc_freq. no lock.

> > +/*
> > + * module init and destoy
> > + */
> > +static struct of_device_id mpc85xx_jog_ids[] __initdata = {
> > +	{ .compatible = "fsl,mpc8536-guts", },
> > +	{ .compatible = "fsl,p1022-guts", },
> > +	{}
> > +};
> > +
> > +static int __init mpc85xx_cpufreq_init(void)
> > +{
> > +	struct device_node *np;
> > +
> > +	pr_info("Freescale MPC85xx CPU frequency switching driver\n");
> 
> If you're going to print something here, print it after you find a node
> you can work with -- not on all 85xx/qoriq that have this driver enabled.
> 
> -Scott

Thanks. I will fix them all.

-chenhui

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox