* Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables
From: Peter Zijlstra @ 2020-06-24 9:00 UTC (permalink / raw)
To: Marco Elver
Cc: linux-s390, linuxppc-dev, bigeasy, x86, heiko.carstens,
linux-kernel, rostedt, davem, Ahmed S. Darwish, sparclinux, linux,
tglx, will, mingo
In-Reply-To: <20200623202404.GE2483@worktop.programming.kicks-ass.net>
On Tue, Jun 23, 2020 at 10:24:04PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 08:12:32PM +0200, Peter Zijlstra wrote:
> > Fair enough; I'll rip it all up and boot a KCSAN kernel, see what if
> > anything happens.
>
> OK, so the below patch doesn't seem to have any nasty recursion issues
> here. The only 'problem' is that lockdep now sees report_lock can cause
> deadlocks.
>
> It is completely right about it too, but I don't suspect there's much we
> can do about it, it's pretty much the standard printk() with scheduler
> locks held report.
So I've been getting tons and tons of this:
[ 60.471348] ==================================================================
[ 60.479427] BUG: KCSAN: data-race in __rcu_read_lock / __rcu_read_unlock
[ 60.486909]
[ 60.488572] write (marked) to 0xffff88840fff1cf0 of 4 bytes by interrupt on cpu 1:
[ 60.497026] __rcu_read_lock+0x37/0x60
[ 60.501214] cpuacct_account_field+0x1b/0x170
[ 60.506081] task_group_account_field+0x32/0x160
[ 60.511238] account_system_time+0xe6/0x110
[ 60.515912] update_process_times+0x1d/0xd0
[ 60.520585] tick_sched_timer+0xfc/0x180
[ 60.524967] __hrtimer_run_queues+0x271/0x440
[ 60.529832] hrtimer_interrupt+0x222/0x670
[ 60.534409] __sysvec_apic_timer_interrupt+0xb3/0x1a0
[ 60.540052] asm_call_on_stack+0x12/0x20
[ 60.544434] sysvec_apic_timer_interrupt+0xba/0x130
[ 60.549882] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 60.555621] delay_tsc+0x7d/0xe0
[ 60.559226] kcsan_setup_watchpoint+0x292/0x4e0
[ 60.564284] __rcu_read_unlock+0x73/0x2c0
[ 60.568763] __unlock_page_memcg+0xda/0xf0
[ 60.573338] unlock_page_memcg+0x32/0x40
[ 60.577721] page_remove_rmap+0x5c/0x200
[ 60.582104] unmap_page_range+0x83c/0xc10
[ 60.586582] unmap_single_vma+0xb0/0x150
[ 60.590963] unmap_vmas+0x81/0xe0
[ 60.594663] exit_mmap+0x135/0x2b0
[ 60.598464] __mmput+0x21/0x150
[ 60.601970] mmput+0x2a/0x30
[ 60.605176] exit_mm+0x2fc/0x350
[ 60.608780] do_exit+0x372/0xff0
[ 60.612385] do_group_exit+0x139/0x140
[ 60.616571] __do_sys_exit_group+0xb/0x10
[ 60.621048] __se_sys_exit_group+0xa/0x10
[ 60.625524] __x64_sys_exit_group+0x1b/0x20
[ 60.630189] do_syscall_64+0x6c/0xe0
[ 60.634182] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 60.639820]
[ 60.641485] read to 0xffff88840fff1cf0 of 4 bytes by task 2430 on cpu 1:
[ 60.648969] __rcu_read_unlock+0x73/0x2c0
[ 60.653446] __unlock_page_memcg+0xda/0xf0
[ 60.658019] unlock_page_memcg+0x32/0x40
[ 60.662400] page_remove_rmap+0x5c/0x200
[ 60.666782] unmap_page_range+0x83c/0xc10
[ 60.671259] unmap_single_vma+0xb0/0x150
[ 60.675641] unmap_vmas+0x81/0xe0
[ 60.679341] exit_mmap+0x135/0x2b0
[ 60.683141] __mmput+0x21/0x150
[ 60.686647] mmput+0x2a/0x30
[ 60.689853] exit_mm+0x2fc/0x350
[ 60.693458] do_exit+0x372/0xff0
[ 60.697062] do_group_exit+0x139/0x140
[ 60.701248] __do_sys_exit_group+0xb/0x10
[ 60.705724] __se_sys_exit_group+0xa/0x10
[ 60.710201] __x64_sys_exit_group+0x1b/0x20
[ 60.714872] do_syscall_64+0x6c/0xe0
[ 60.718864] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 60.724503]
[ 60.726156] Reported by Kernel Concurrency Sanitizer on:
[ 60.732089] CPU: 1 PID: 2430 Comm: sshd Not tainted 5.8.0-rc2-00186-gb4ee11fe08b3-dirty #303
[ 60.741510] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 60.752957] ==================================================================
And I figured a quick way to get rid of that would be something like the
below, seeing how volatile gets auto annotated... but that doesn't seem
to actually work.
What am I missing?
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 352223664ebd..b08861118e1a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -351,17 +351,17 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
static void rcu_preempt_read_enter(void)
{
- current->rcu_read_lock_nesting++;
+ (*(volatile int *)¤t->rcu_read_lock_nesting)++;
}
static int rcu_preempt_read_exit(void)
{
- return --current->rcu_read_lock_nesting;
+ return --(*(volatile int *)¤t->rcu_read_lock_nesting);
}
static void rcu_preempt_depth_set(int val)
{
- current->rcu_read_lock_nesting = val;
+ WRITE_ONCE(current->rcu_read_lock_nesting, val);
}
/*
^ permalink raw reply related
* [PATCH v2 6/6] powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00
From: Leonardo Bras @ 2020-06-24 6:24 UTC (permalink / raw)
To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
Ram Pai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>
As of today, enable_ddw() will return a non-null DMA address if the
created DDW maps the whole partition. If the address is valid,
iommu_bypass_supported_pSeriesLP() will consider iommu bypass enabled.
This can cause some trouble if the DDW happens to start at 0x00.
Instead if checking if the address is non-null, check directly if
the DDW maps the whole partition, so it can bypass iommu.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
arch/powerpc/platforms/pseries/iommu.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 2d217cda4075..967634a379b0 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1078,7 +1078,8 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
*
* returns the dma offset for use by the direct mapped DMA code.
*/
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
+ bool *maps_partition)
{
int len, ret;
struct ddw_query_response query;
@@ -1237,9 +1238,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
list_add(&window->list, &direct_window_list);
spin_unlock(&direct_window_list_lock);
- /* Only returns the dma_addr if DDW maps the whole partition */
if (len == order_base_2(max_addr))
- dma_addr = be64_to_cpu(ddwprop->dma_base);
+ *maps_partition = true;
+ dma_addr = be64_to_cpu(ddwprop->dma_base);
goto out_unlock;
out_free_window:
@@ -1324,6 +1325,7 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
{
struct device_node *dn = pci_device_to_OF_node(pdev), *pdn;
const __be32 *dma_window = NULL;
+ bool ret = false;
/* only attempt to use a new window if 64-bit DMA is requested */
if (dma_mask < DMA_BIT_MASK(64))
@@ -1344,13 +1346,10 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
break;
}
- if (pdn && PCI_DN(pdn)) {
- pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
- if (pdev->dev.archdata.dma_offset)
- return true;
- }
+ if (pdn && PCI_DN(pdn))
+ pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn, &ret);
- return false;
+ return ret;
}
static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
--
2.25.4
^ permalink raw reply related
* [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
From: Leonardo Bras @ 2020-06-24 6:24 UTC (permalink / raw)
To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
Ram Pai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>
As of today, if a DDW is created and can't map the whole partition, it's
removed and the default DMA window "ibm,dma-window" is used instead.
Usually this DDW is bigger than the default DMA window, so it would be
better to make use of it instead.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
arch/powerpc/platforms/pseries/iommu.c | 28 +++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 4fcf00016fb1..2d217cda4075 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -685,7 +685,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
struct iommu_table *tbl;
struct device_node *dn, *pdn;
struct pci_dn *ppci;
- const __be32 *dma_window = NULL;
+ const __be32 *dma_window = NULL, *alt_dma_window = NULL;
dn = pci_bus_to_OF_node(bus);
@@ -699,8 +699,13 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
break;
}
+ /* If there is a DDW available, use it instead */
+ alt_dma_window = of_get_property(pdn, DIRECT64_PROPNAME, NULL);
+ if (alt_dma_window)
+ dma_window = alt_dma_window;
+
if (dma_window == NULL) {
- pr_debug(" no ibm,dma-window property !\n");
+ pr_debug(" no ibm,dma-window nor linux,direct64-ddr-window-info property !\n");
return;
}
@@ -1166,16 +1171,19 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
query.page_size);
goto out_failed;
}
+
/* verify the window * number of ptes will map the partition */
- /* check largest block * page size > max memory hotplug addr */
max_addr = ddw_memory_hotplug_max();
if (query.largest_available_block < (max_addr >> page_shift)) {
- dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
- "%llu-sized pages\n", max_addr, query.largest_available_block,
- 1ULL << page_shift);
- goto out_failed;
+ dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
+ max_addr, query.largest_available_block,
+ 1ULL << page_shift);
+
+ len = order_base_2(query.largest_available_block << page_shift);
+ } else {
+ len = order_base_2(max_addr);
}
- len = order_base_2(max_addr);
+
win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
if (!win64) {
dev_info(&dev->dev,
@@ -1229,7 +1237,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
list_add(&window->list, &direct_window_list);
spin_unlock(&direct_window_list_lock);
- dma_addr = be64_to_cpu(ddwprop->dma_base);
+ /* Only returns the dma_addr if DDW maps the whole partition */
+ if (len == order_base_2(max_addr))
+ dma_addr = be64_to_cpu(ddwprop->dma_base);
goto out_unlock;
out_free_window:
--
2.25.4
^ permalink raw reply related
* [PATCH v2 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
From: Leonardo Bras @ 2020-06-24 6:24 UTC (permalink / raw)
To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
Ram Pai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>
On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the
default DMA window for the device, before attempting to configure a DDW,
in order to make the maximum resources available for the next DDW to be
created.
This is a requirement for some devices to use DDW, given they only
allow one DMA window.
If setting up a new DDW fails anywhere after the removal of this
default DMA window, it's needed to restore the default DMA window.
For this, an implementation of ibm,reset-pe-dma-windows rtas call is
needed:
Platforms supporting the DDW option starting with LoPAR level 2.7 implement
ibm,ddw-extensions. The first extension available (index 2) carries the
token for ibm,reset-pe-dma-windows rtas call, which is used to restore
the default DMA window for a device, if it has been deleted.
It does so by resetting the TCE table allocation for the PE to it's
boot time value, available in "ibm,dma-window" device tree node.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
arch/powerpc/platforms/pseries/iommu.c | 70 ++++++++++++++++++++++----
1 file changed, 61 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index a8840d9e1c35..4fcf00016fb1 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1029,6 +1029,39 @@ static phys_addr_t ddw_memory_hotplug_max(void)
return max_addr;
}
+/*
+ * Platforms supporting the DDW option starting with LoPAR level 2.7 implement
+ * ibm,ddw-extensions, which carries the rtas token for
+ * ibm,reset-pe-dma-windows.
+ * That rtas-call can be used to restore the default DMA window for the device.
+ */
+static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
+{
+ int ret;
+ u32 cfg_addr, ddw_ext[DDW_EXT_RESET_DMA_WIN + 1];
+ u64 buid;
+ struct device_node *dn;
+ struct pci_dn *pdn;
+
+ ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions",
+ &ddw_ext[0], DDW_EXT_RESET_DMA_WIN + 1);
+ if (ret)
+ return;
+
+ dn = pci_device_to_OF_node(dev);
+ pdn = PCI_DN(dn);
+ buid = pdn->phb->buid;
+ cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
+
+ ret = rtas_call(ddw_ext[DDW_EXT_RESET_DMA_WIN], 3, 1, NULL, cfg_addr,
+ BUID_HI(buid), BUID_LO(buid));
+ if (ret)
+ dev_info(&dev->dev,
+ "ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ",
+ ddw_ext[1], cfg_addr, BUID_HI(buid), BUID_LO(buid),
+ ret);
+}
+
/*
* If the PE supports dynamic dma windows, and there is space for a table
* that can map all pages in a linear offset, then setup such a table,
@@ -1049,8 +1082,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
u64 dma_addr, max_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
+
struct direct_window *window;
- struct property *win64;
+ struct property *win64, *default_win = NULL, *ddw_ext = NULL;
struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
@@ -1085,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
if (ret)
goto out_failed;
- /*
+ /*
* Query if there is a second window of size to map the
* whole partition. Query returns number of windows, largest
* block assigned to PE (partition endpoint), and two bitmasks
@@ -1096,15 +1130,31 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
if (ret != 0)
goto out_failed;
+ /*
+ * If there is no window available, remove the default DMA window,
+ * if it's present. This will make all the resources available to the
+ * new DDW window.
+ * If anything fails after this, we need to restore it, so also check
+ * for extensions presence.
+ */
if (query.windows_available == 0) {
- /*
- * no additional windows are available for this device.
- * We might be able to reallocate the existing window,
- * trading in for a larger page size.
- */
- dev_dbg(&dev->dev, "no free dynamic windows");
- goto out_failed;
+ default_win = of_find_property(pdn, "ibm,dma-window", NULL);
+ ddw_ext = of_find_property(pdn, "ibm,ddw-extensions", NULL);
+ if (default_win && ddw_ext)
+ remove_dma_window(pdn, ddw_avail, default_win);
+
+ /* Query again, to check if the window is available */
+ ret = query_ddw(dev, ddw_avail, &query, pdn);
+ if (ret != 0)
+ goto out_failed;
+
+ if (query.windows_available == 0) {
+ /* no windows are available for this device. */
+ dev_dbg(&dev->dev, "no free dynamic windows");
+ goto out_failed;
+ }
}
+
if (query.page_size & 4) {
page_shift = 24; /* 16MB */
} else if (query.page_size & 2) {
@@ -1194,6 +1244,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
kfree(win64);
out_failed:
+ if (default_win && ddw_ext)
+ reset_dma_window(dev, pdn);
fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
if (!fpdn)
--
2.25.4
^ permalink raw reply related
* [PATCH v2 3/6] powerpc/pseries/iommu: Move window-removing part of remove_ddw into remove_dma_window
From: Leonardo Bras @ 2020-06-24 6:24 UTC (permalink / raw)
To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
Ram Pai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>
Move the window-removing part of remove_ddw into a new function
(remove_dma_window), so it can be used to remove other DMA windows.
It's useful for removing DMA windows that don't create DIRECT64_PROPNAME
property, like the default DMA window from the device, which uses
"ibm,dma-window".
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
arch/powerpc/platforms/pseries/iommu.c | 45 +++++++++++++++-----------
1 file changed, 27 insertions(+), 18 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 558e5441c355..a8840d9e1c35 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -776,25 +776,14 @@ static int __init disable_ddw_setup(char *str)
early_param("disable_ddw", disable_ddw_setup);
-static void remove_ddw(struct device_node *np, bool remove_prop)
+static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
+ struct property *win)
{
struct dynamic_dma_window_prop *dwp;
- struct property *win64;
- u32 ddw_avail[DDW_APPLICABLE_SIZE];
u64 liobn;
- int ret = 0;
-
- ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
- &ddw_avail[0], DDW_APPLICABLE_SIZE);
-
- win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
- if (!win64)
- return;
-
- if (ret || win64->length < sizeof(*dwp))
- goto delprop;
+ int ret;
- dwp = win64->value;
+ dwp = win->value;
liobn = (u64)be32_to_cpu(dwp->liobn);
/* clear the whole window, note the arg is in kernel pages */
@@ -816,10 +805,30 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
pr_debug("%pOF: successfully removed direct window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
+}
+
+static void remove_ddw(struct device_node *np, bool remove_prop)
+{
+ struct property *win;
+ u32 ddw_avail[DDW_APPLICABLE_SIZE];
+ int ret = 0;
+
+ ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
+ &ddw_avail[0], DDW_APPLICABLE_SIZE);
+ if (ret)
+ return;
+
+ win = of_find_property(np, DIRECT64_PROPNAME, NULL);
+ if (!win)
+ return;
+
+ if (win->length >= sizeof(struct dynamic_dma_window_prop))
+ remove_dma_window(np, ddw_avail, win);
+
+ if (!remove_prop)
+ return;
-delprop:
- if (remove_prop)
- ret = of_remove_property(np, win64);
+ ret = of_remove_property(np, win);
if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
--
2.25.4
^ permalink raw reply related
* [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm, query-pe-dma-windows
From: Leonardo Bras @ 2020-06-24 6:24 UTC (permalink / raw)
To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
Ram Pai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>
From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can make the number of
outputs from "ibm,query-pe-dma-windows" go from 5 to 6.
This change of output size is meant to expand the address size of
largest_available_block PE TCE from 32-bit to 64-bit, which ends up
shifting page_size and migration_capable.
This ends up requiring the update of
ddw_query_response->largest_available_block from u32 to u64, and manually
assigning the values from the buffer into this struct, according to
output size.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
arch/powerpc/platforms/pseries/iommu.c | 57 +++++++++++++++++++++-----
1 file changed, 47 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 68d2aa9c71a8..558e5441c355 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -44,6 +44,10 @@
#define DDW_REMOVE_PE_DMA_WIN 2
#define DDW_APPLICABLE_SIZE 3
+#define DDW_EXT_SIZE 0
+#define DDW_EXT_RESET_DMA_WIN 1
+#define DDW_EXT_QUERY_OUT_SIZE 2
+
static struct iommu_table_group *iommu_pseries_alloc_group(int node)
{
struct iommu_table_group *table_group;
@@ -339,7 +343,7 @@ struct direct_window {
/* Dynamic DMA Window support */
struct ddw_query_response {
u32 windows_available;
- u32 largest_available_block;
+ u64 largest_available_block;
u32 page_size;
u32 migration_capable;
};
@@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void)
machine_arch_initcall(pseries, find_existing_ddw_windows);
static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
- struct ddw_query_response *query)
+ struct ddw_query_response *query,
+ struct device_node *parent)
{
struct device_node *dn;
struct pci_dn *pdn;
- u32 cfg_addr;
+ u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1];
u64 buid;
- int ret;
+ int ret, out_sz;
+
+ /*
+ * From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many
+ * output parameters ibm,query-pe-dma-windows will have, ranging from
+ * 5 to 6.
+ */
+
+ ret = of_property_read_u32_array(parent, "ibm,ddw-extensions",
+ &ddw_ext[0],
+ DDW_EXT_QUERY_OUT_SIZE + 1);
+ if (ret && ddw_ext[DDW_EXT_SIZE] > 1 &&
+ ddw_ext[DDW_EXT_QUERY_OUT_SIZE] == 1)
+ out_sz = 6;
+ else
+ out_sz = 5;
/*
* Get the config address and phb buid of the PE window.
@@ -894,11 +914,28 @@ static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
buid = pdn->phb->buid;
cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
- ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query,
+ ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, out_sz, query_out,
cfg_addr, BUID_HI(buid), BUID_LO(buid));
- dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
- " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr,
- BUID_HI(buid), BUID_LO(buid), ret);
+ dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned %d\n",
+ ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
+ BUID_LO(buid), ret);
+
+ switch (out_sz) {
+ case 5:
+ query->windows_available = query_out[0];
+ query->largest_available_block = query_out[1];
+ query->page_size = query_out[2];
+ query->migration_capable = query_out[3];
+ break;
+ case 6:
+ query->windows_available = query_out[0];
+ query->largest_available_block = ((u64)query_out[1] << 32) |
+ query_out[2];
+ query->page_size = query_out[3];
+ query->migration_capable = query_out[4];
+ break;
+ }
+
return ret;
}
@@ -1046,7 +1083,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
* of page sizes: supported and supported for migrate-dma.
*/
dn = pci_device_to_OF_node(dev);
- ret = query_ddw(dev, ddw_avail, &query);
+ ret = query_ddw(dev, ddw_avail, &query, pdn);
if (ret != 0)
goto out_failed;
@@ -1074,7 +1111,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
/* check largest block * page size > max memory hotplug addr */
max_addr = ddw_memory_hotplug_max();
if (query.largest_available_block < (max_addr >> page_shift)) {
- dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
+ dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
"%llu-sized pages\n", max_addr, query.largest_available_block,
1ULL << page_shift);
goto out_failed;
--
2.25.4
^ permalink raw reply related
* [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm, ddw-applicable
From: Leonardo Bras @ 2020-06-24 6:24 UTC (permalink / raw)
To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Leonardo Bras, Alexey Kardashevskiy, Thiago Jung Bauermann,
Ram Pai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200624062411.367796-1-leobras.c@gmail.com>
Create defines to help handling ibm,ddw-applicable values, avoiding
confusion about the index of given operations.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
arch/powerpc/platforms/pseries/iommu.c | 40 +++++++++++++++-----------
1 file changed, 23 insertions(+), 17 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 6d47b4a3ce39..68d2aa9c71a8 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -39,6 +39,11 @@
#include "pseries.h"
+#define DDW_QUERY_PE_DMA_WIN 0
+#define DDW_CREATE_PE_DMA_WIN 1
+#define DDW_REMOVE_PE_DMA_WIN 2
+#define DDW_APPLICABLE_SIZE 3
+
static struct iommu_table_group *iommu_pseries_alloc_group(int node)
{
struct iommu_table_group *table_group;
@@ -771,12 +776,12 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
{
struct dynamic_dma_window_prop *dwp;
struct property *win64;
- u32 ddw_avail[3];
+ u32 ddw_avail[DDW_APPLICABLE_SIZE];
u64 liobn;
int ret = 0;
ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
- &ddw_avail[0], 3);
+ &ddw_avail[0], DDW_APPLICABLE_SIZE);
win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
if (!win64)
@@ -798,15 +803,15 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
pr_debug("%pOF successfully cleared tces in window.\n",
np);
- ret = rtas_call(ddw_avail[2], 1, 1, NULL, liobn);
+ ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
if (ret)
pr_warn("%pOF: failed to remove direct window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
- np, ret, ddw_avail[2], liobn);
+ np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
pr_debug("%pOF: successfully removed direct window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
- np, ret, ddw_avail[2], liobn);
+ np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
delprop:
if (remove_prop)
@@ -889,11 +894,11 @@ static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
buid = pdn->phb->buid;
cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
- ret = rtas_call(ddw_avail[0], 3, 5, (u32 *)query,
- cfg_addr, BUID_HI(buid), BUID_LO(buid));
+ ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query,
+ cfg_addr, BUID_HI(buid), BUID_LO(buid));
dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
- " returned %d\n", ddw_avail[0], cfg_addr, BUID_HI(buid),
- BUID_LO(buid), ret);
+ " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr,
+ BUID_HI(buid), BUID_LO(buid), ret);
return ret;
}
@@ -920,15 +925,16 @@ static int create_ddw(struct pci_dev *dev, const u32 *ddw_avail,
do {
/* extra outputs are LIOBN and dma-addr (hi, lo) */
- ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create,
- cfg_addr, BUID_HI(buid), BUID_LO(buid),
- page_shift, window_shift);
+ ret = rtas_call(ddw_avail[DDW_CREATE_PE_DMA_WIN], 5, 4,
+ (u32 *)create, cfg_addr, BUID_HI(buid),
+ BUID_LO(buid), page_shift, window_shift);
} while (rtas_busy_delay(ret));
dev_info(&dev->dev,
"ibm,create-pe-dma-window(%x) %x %x %x %x %x returned %d "
- "(liobn = 0x%x starting addr = %x %x)\n", ddw_avail[1],
- cfg_addr, BUID_HI(buid), BUID_LO(buid), page_shift,
- window_shift, ret, create->liobn, create->addr_hi, create->addr_lo);
+ "(liobn = 0x%x starting addr = %x %x)\n",
+ ddw_avail[DDW_CREATE_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
+ BUID_LO(buid), page_shift, window_shift, ret, create->liobn,
+ create->addr_hi, create->addr_lo);
return ret;
}
@@ -996,7 +1002,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
int page_shift;
u64 dma_addr, max_addr;
struct device_node *dn;
- u32 ddw_avail[3];
+ u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
struct dynamic_dma_window_prop *ddwprop;
@@ -1029,7 +1035,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
* the property is actually in the parent, not the PE
*/
ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
- &ddw_avail[0], 3);
+ &ddw_avail[0], DDW_APPLICABLE_SIZE);
if (ret)
goto out_failed;
--
2.25.4
^ permalink raw reply related
* [PATCH v2 0/6] Remove default DMA window before creating DDW
From: Leonardo Bras @ 2020-06-24 6:24 UTC (permalink / raw)
To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Alexey Kardashevskiy, Leonardo Bras, Thiago Jung Bauermann,
Ram Pai
Cc: linuxppc-dev, linux-kernel
There are some devices that only allow 1 DMA window to exist at a time,
and in those cases, a DDW is never created to them, since the default DMA
window keeps using this resource.
LoPAR recommends this procedure:
1. Remove the default DMA window,
2. Query for which configs the DDW can be created,
3. Create a DDW.
Patch #1:
Create defines for outputs of ibm,ddw-applicable, so it's easier to
identify them.
Patch #2:
- After LoPAR level 2.8, there is an extension that can make
ibm,query-pe-dma-windows to have 6 outputs instead of 5. This changes the
order of the outputs, and that can cause some trouble.
- query_ddw() was updated to check how many outputs the
ibm,query-pe-dma-windows is supposed to have, update the rtas_call() and
deal correctly with the outputs in both cases.
- This patch looks somehow unrelated to the series, but it can avoid future
problems on DDW creation.
Patch #3 moves the window-removing code from remove_ddw() to
remove_dma_window(), creating a way to delete any DMA window, so it can be
used to delete the default DMA window.
Patch #4 makes use of the remove_dma_window() from patch #3 to remove the
default DMA window before query_ddw(). It also implements a new rtas call
to recover the default DMA window, in case anything fails after it was
removed, and a DDW couldn't be created.
Patch #5:
Instead of destroying the created DDW if it doesn't map the whole
partition, make use of it instead of the default DMA window.
Patch #6:
Changes the way iommu_bypass_supported_pSeriesLP() check for
iommu_bypass: instead of checking the address returned by enable_ddw(),
it checks a new output value that reflects if the DDW created maps
the whole partition.
All patches were tested into an LPAR with an Ethernet VF:
4005:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4 Virtual Function]
---
Changes since v1:
- Add defines for ibm,ddw-applicable and ibm,ddw-extensions outputs
- Merge aux function query_ddw_out_sz() into query_ddw()
- Merge reset_dma_window() patch (prev. #2) into remove default DMA
window patch (#4).
- Keep device_node *np name instead of using pdn in remove_*()
- Rename 'device_node *pdn' into 'parent' in new functions
- Rename dfl_win to default_win
- Only remove the default DMA window if there is no window available
in first query.
- Check if default DMA window can be restored before removing it.
- Fix 'unitialized use' (found by travis mpe:ci-test)
- New patches #5 and #6
Special thanks for Alexey Kardashevskiy and Oliver O'Halloran for
the feedback provided!
Leonardo Bras (6):
powerpc/pseries/iommu: Create defines for operations in
ibm,ddw-applicable
powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows
powerpc/pseries/iommu: Move window-removing part of remove_ddw into
remove_dma_window
powerpc/pseries/iommu: Remove default DMA window before creating DDW
powerpc/pseries/iommu: Make use of DDW even if it does not map the
partition
powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00
arch/powerpc/platforms/pseries/iommu.c | 239 ++++++++++++++++++-------
1 file changed, 176 insertions(+), 63 deletions(-)
--
2.25.4
^ permalink raw reply
* [PATCH RFC 1/1] powerpc/eeh: PE info tree via debugfs and syslog
From: Sam Bobroff @ 2020-06-24 5:57 UTC (permalink / raw)
To: linuxppc-dev
Provide an ASCII art tree display of the PEs affected by recovery,
with as much state as possible, at the start and end of recovery.
Some platform specific information is provided via a new eeh_ops
member.
The same information is made available from debugfs at:
/sys/kernel/debug/powerpc/PCIXXXX/eeh_pe_tree
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
Here's some debug code I've been using for a long time while working on EEH. I
haven't posted it before because it wasn't possible to make the code safe
enough (to avoid either NULL or LIST_POISON), but with the recent safety work
done it's become possible.
It should be applied on top of:
"powerpc/eeh: Synchronization for safety"
"powerpc/eeh: Provide a unique ID for each EEH recovery"
"powerpc/eeh: Asynchronous recovery"
arch/powerpc/include/asm/eeh.h | 3 +
arch/powerpc/kernel/eeh.c | 90 ++++++++++++++++++++
arch/powerpc/kernel/eeh_driver.c | 28 ++++++
arch/powerpc/platforms/powernv/eeh-powernv.c | 63 +++++++++++++-
arch/powerpc/platforms/pseries/eeh_pseries.c | 21 ++++-
5 files changed, 203 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index dd55d1bf1cfd..46dec5b2482e 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -230,6 +230,7 @@ struct eeh_ops {
int (*next_error)(struct eeh_pe **pe);
int (*restore_config)(struct pci_dn *pdn);
int (*notify_resume)(struct pci_dn *pdn);
+ void (*pe_plat_state_dump)(char *buf, size_t len, struct eeh_pe *pe);
};
extern int eeh_subsystem_flags;
@@ -324,6 +325,8 @@ int eeh_pe_configure(struct eeh_pe *pe);
int eeh_pe_inject_err(struct eeh_pe *pe, int type, int func,
unsigned long addr, unsigned long mask);
int eeh_restore_vf_config(struct pci_dn *pdn);
+void eeh_tree_state_dump(void (*pfn)(void *, const char *, ...),
+ void *s, struct eeh_pe *pe, int level, int xlevel);
/**
* EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 54f921ff7621..6f675f277d26 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -839,6 +839,96 @@ int eeh_restore_vf_config(struct pci_dn *pdn)
return 0;
}
+static void eeh_tree_state_indent(void (*pfn)(void *, const char *, ...),
+ void *s, int level, int xlevel, bool node)
+{
+ int i;
+
+ for (i = 0; i < level; i++)
+ pfn(s, "%c ", ((xlevel & (1 << i)) ? '|' : ' '));
+ if (node)
+ pfn(s, "%c---", ((xlevel & (1 << level)) ? '+' : '\\'));
+ else
+ pfn(s, "%c...", ((xlevel & (1 << level)) ? '|' : ' '));
+}
+
+void eeh_tree_state_dump(void (*pfn)(void *, const char *, ...),
+ void *s, struct eeh_pe *pe, int level, int xlevel)
+{
+ struct eeh_dev *edev;
+ struct eeh_pe *child_pe;
+ int slevel, sxlevel;
+ char buf[1024];
+
+ eeh_recovery_must_be_locked();
+ eeh_tree_state_indent(pfn, s, level, xlevel, true);
+ pfn(s, "* [PE#0x%x] type=%s%s%s%s%s config_addr=0x%x pass_count=%d\n",
+ pe->addr,
+ ((pe->type & EEH_PE_INVALID) ? "INVALID " : ""),
+ ((pe->type & EEH_PE_PHB) ? "PHB " : ""),
+ ((pe->type & EEH_PE_DEVICE) ? "DEVICE " : ""),
+ ((pe->type & EEH_PE_BUS) ? "BUS " : ""),
+ ((pe->type & EEH_PE_VF) ? "VF " : ""),
+ pe->config_addr,
+ atomic_read(&pe->pass_dev_cnt));
+
+ slevel = level + 1;
+ sxlevel = xlevel;
+ if (!list_empty(&pe->edevs) || !list_empty(&pe->child_list))
+ sxlevel |= (1 << slevel);
+ eeh_tree_state_indent(pfn, s, slevel, sxlevel, false);
+ pfn(s, " check_count=%d freeze_count=%d false_positives=%d first_freeze=%llu\n",
+ pe->check_count, pe->freeze_count, pe->false_positives,
+ pe->tstamp);
+ eeh_tree_state_indent(pfn, s, slevel, sxlevel, false);
+ pfn(s, " kernel state=0x%x %s%s%s%s%s%s%s%s\n", pe->state,
+ ((pe->state & EEH_PE_ISOLATED) ? "ISOLATED " : ""),
+ ((pe->state & EEH_PE_RECOVERING) ? "RECOVERING " : ""),
+ ((pe->state & EEH_PE_CFG_BLOCKED) ? "CFG_BLOCKED " : ""),
+ ((pe->state & EEH_PE_RESET) ? "RESET " : ""),
+ ((pe->state & EEH_PE_KEEP) ? "KEEP " : ""),
+ ((pe->state & EEH_PE_CFG_RESTRICTED) ? "CFG_RESTRICTED " : ""),
+ ((pe->state & EEH_PE_REMOVED) ? "REMOVED" : ""),
+ ((pe->state & EEH_PE_PRI_BUS) ? "PRI_BUS" : ""));
+ if (eeh_ops->pe_plat_state_dump) {
+ eeh_tree_state_indent(pfn, s, slevel, sxlevel, false);
+ eeh_ops->pe_plat_state_dump(buf, sizeof(buf), pe);
+ pfn(s, " %.*s\n", sizeof(buf), buf);
+ }
+
+ slevel = level + 1;
+ list_for_each_entry(edev, &pe->edevs, entry) {
+ struct pci_dev *pdev = edev->pdev;
+
+ sxlevel = xlevel;
+ if ((edev != list_last_entry(&pe->edevs, struct eeh_dev, entry)) ||
+ !list_empty(&pe->child_list))
+ sxlevel |= (1 << slevel);
+ eeh_tree_state_indent(pfn, s, slevel, sxlevel, true);
+ pfn(s, "* [DEV %s] pe_config_addr=0x%x driver=%s\n",
+ pci_name(pdev), edev->pe_config_addr, eeh_driver_name(pdev));
+ eeh_tree_state_indent(pfn, s, slevel + 1, sxlevel, false);
+ pfn(s, "mode=0x%x %s%s%s%s%s%s%s%s\n", edev->mode,
+ ((edev->mode & EEH_DEV_BRIDGE) ? "BRIDGE " : ""),
+ ((edev->mode & EEH_DEV_ROOT_PORT) ? "ROOT_PORT " : ""),
+ ((edev->mode & EEH_DEV_DS_PORT) ? "DS_PORT " : ""),
+ ((edev->mode & EEH_DEV_IRQ_DISABLED) ? "IRQ_DISABLED " : ""),
+ ((edev->mode & EEH_DEV_DISCONNECTED) ? "DISCONNECTED " : ""),
+ ((edev->mode & EEH_DEV_NO_HANDLER) ? "NO_HANDLER " : ""),
+ ((edev->mode & EEH_DEV_SYSFS) ? "SYSFS " : ""),
+ ((edev->mode & EEH_DEV_REMOVED) ? "REMOVED " : ""));
+ }
+
+ slevel = level + 1;
+ list_for_each_entry(child_pe, &pe->child_list, child) {
+ sxlevel = xlevel;
+ if (child_pe != list_last_entry(&pe->child_list, struct eeh_pe, child))
+ sxlevel |= (1 << slevel);
+ eeh_tree_state_dump(pfn, s, child_pe, slevel, sxlevel);
+ }
+}
+
+
/**
* pcibios_set_pcie_reset_state - Set PCI-E reset state
* @dev: pci device struct
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 9d03292f66a7..7b077599c567 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -15,6 +15,7 @@
#include <linux/kthread.h>
#include <linux/workqueue.h>
#include <linux/completion.h>
+#include <linux/seq_buf.h>
#include <asm/eeh.h>
#include <asm/eeh_event.h>
#include <asm/ppc-pci.h>
@@ -996,6 +997,26 @@ static int eeh_reset_device(unsigned int event_id,
return 0;
}
+static void eeh_tree_state_dump_kprintf(struct eeh_pe *root_pe)
+{
+ struct seq_buf s;
+ char *buf, *p, *q;
+
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ seq_buf_init(&s, buf, PAGE_SIZE);
+
+ eeh_tree_state_dump((void (*)(void *, const char *, ...))seq_buf_printf, &s, root_pe, 0, 0);
+
+ /* printk's output is limited to LOG_LINE_MAX, so split into lines: */
+ /* buf must end with a \n */
+ p = buf;
+ while ((q = strstr(p, "\n"))) {
+ printk(KERN_INFO "%.*s", (int)(q - p + 1), p);
+ p = q + 1;
+ }
+ kfree(buf);
+}
+
/* The longest amount of time to wait for a pci device
* to come back on line, in seconds.
*/
@@ -1129,6 +1150,9 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
int devices = 0;
eeh_recovery_lock();
+ pr_info("EEH(%u): PE state before recovery:\n", event_id);
+ eeh_tree_state_dump_kprintf(pe);
+
bus = eeh_pe_bus_get(pe);
if (!bus) {
pr_err("EEH(%u): %s: Cannot find PCI bus for PHB#%x-PE#%x\n",
@@ -1381,6 +1405,8 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
eeh_pe_dev_mode_mark(pe, EEH_DEV_REMOVED);
+ pr_info("PE state after recovery (before final device removal):\n");
+ eeh_tree_state_dump_kprintf(pe);
eeh_recovery_unlock();
pci_lock_rescan_remove();
pci_hp_remove_devices(bus);
@@ -1405,6 +1431,8 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
eeh_clear_slot_attention(edev->pdev);
eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
+ pr_info("EEH(%u): PE state after recovery:\n", event_id);
+ eeh_tree_state_dump_kprintf(pe);
eeh_recovery_unlock();
}
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 1f6ec78ad88b..82b705e3f1e7 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -206,6 +206,30 @@ PNV_EEH_DBGFS_ENTRY(outb, 0xD10);
PNV_EEH_DBGFS_ENTRY(inbA, 0xD90);
PNV_EEH_DBGFS_ENTRY(inbB, 0xE10);
+static int eeh_tree_state_dump_seq_file(struct seq_file *s, void *unused)
+{
+ struct eeh_pe *root_pe = s->private;
+
+ eeh_recovery_lock();
+ eeh_tree_state_dump((void (*)(void *, const char *, ...))seq_printf,
+ s, root_pe, 0, 0);
+ eeh_recovery_unlock();
+ return 0;
+}
+
+static int eeh_tree_state_dbgfs_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, eeh_tree_state_dump_seq_file, eeh_phb_pe_get(inode->i_private));
+}
+
+static const struct file_operations eeh_tree_state_debugfs_ops = {
+ .owner = THIS_MODULE,
+ .open = eeh_tree_state_dbgfs_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
#endif /* CONFIG_DEBUG_FS */
void pnv_eeh_enable_phbs(void)
@@ -287,6 +311,9 @@ int pnv_eeh_post_init(void)
debugfs_create_file("err_injct_inboundB", 0600,
phb->dbgfs, hose,
&pnv_eeh_dbgfs_ops_inbB);
+ debugfs_create_file("eeh_pe_tree", 0400,
+ phb->dbgfs, hose,
+ &eeh_tree_state_debugfs_ops);
#endif /* CONFIG_DEBUG_FS */
}
@@ -1685,6 +1712,39 @@ static int pnv_eeh_restore_config(struct pci_dn *pdn)
return ret;
}
+void pnv_eeh_pe_plat_state_dump(char *buf, size_t len, struct eeh_pe *pe)
+{
+ struct pnv_phb *phb = pe->phb->private_data;
+ s64 rc;
+ u8 state = 0;
+ __be16 pcierr = 0;
+
+ /* Collect state directly from OPAL to avoid side effects: */
+ rc = opal_pci_eeh_freeze_status(phb->opal_id, pe->addr, &state,
+ &pcierr, NULL);
+ pcierr = be16_to_cpu(pcierr);
+
+ if (rc == OPAL_SUCCESS)
+ snprintf(buf, len, "OPAL state=0x%x %s%s%s%s%s%s%s pcierr=0x%x %s%s%s%s%s%s",
+ state,
+ ((state == OPAL_EEH_STOPPED_NOT_FROZEN) ? "OK" : ""),
+ ((state == OPAL_EEH_STOPPED_MMIO_FREEZE) ? "MMIO_FREEZE" : ""),
+ ((state == OPAL_EEH_STOPPED_DMA_FREEZE) ? "DMA_FREEZE" : ""),
+ ((state == OPAL_EEH_STOPPED_MMIO_DMA_FREEZE) ? "MMIO_DMA_FREEZE" : ""),
+ ((state == OPAL_EEH_STOPPED_RESET) ? "STOPPED_RESET" : ""),
+ ((state == OPAL_EEH_STOPPED_TEMP_UNAVAIL) ? "STOPPED_TEMP_UNAVAIL" : ""),
+ ((state == OPAL_EEH_STOPPED_PERM_UNAVAIL) ? "STOPPED_PERM_UNAVAIL" : ""),
+ pcierr,
+ ((pcierr == OPAL_EEH_NO_ERROR) ? "OK" : ""),
+ ((pcierr == OPAL_EEH_IOC_ERROR) ? "IOC_ERROR" : ""),
+ ((pcierr == OPAL_EEH_PHB_ERROR) ? "PHB_ERROR" : ""),
+ ((pcierr == OPAL_EEH_PE_ERROR) ? "PE_ERROR" : ""),
+ ((pcierr == OPAL_EEH_PE_MMIO_ERROR) ? "MMIO_ERROR" : ""),
+ ((pcierr == OPAL_EEH_PE_DMA_ERROR) ? "DMA_ERROR" : ""));
+ else
+ snprintf(buf, len, "OPAL error=0x%llx", rc);
+}
+
static struct eeh_ops pnv_eeh_ops = {
.name = "powernv",
.init = pnv_eeh_init,
@@ -1700,7 +1760,8 @@ static struct eeh_ops pnv_eeh_ops = {
.write_config = pnv_eeh_write_config,
.next_error = pnv_eeh_next_error,
.restore_config = pnv_eeh_restore_config,
- .notify_resume = NULL
+ .notify_resume = NULL,
+ .pe_plat_state_dump = pnv_eeh_pe_plat_state_dump,
};
#ifdef CONFIG_PCI_IOV
diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index c49fab23f1c9..3fbbd8583350 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -782,6 +782,24 @@ static int pseries_notify_resume(struct pci_dn *pdn)
}
#endif
+void pseries_eeh_pe_plat_state_dump(char *buf, size_t len, struct eeh_pe *pe)
+{
+ int state, delay;
+
+ /* pseries_eeh_get_state() doesn't have side-effects, so we can use it here: */
+ state = pseries_eeh_get_state(pe, &delay);
+
+ snprintf(buf, len, "RTAS state=0x%x %s%s%s%s%s%s%s",
+ state,
+ ((state & EEH_STATE_UNAVAILABLE) ? "UNAVAILABLE " : ""),
+ ((state & EEH_STATE_NOT_SUPPORT) ? "NOT_SUPPORT " : ""),
+ ((state & EEH_STATE_RESET_ACTIVE) ? "RESET_ACTIVE " : ""),
+ ((state & EEH_STATE_MMIO_ACTIVE) ? "MMIO_ACTIVE " : ""),
+ ((state & EEH_STATE_DMA_ACTIVE) ? "DMA_ACTIVE " : ""),
+ ((state & EEH_STATE_MMIO_ENABLED) ? "MMIO_ENABLED " : ""),
+ ((state & EEH_STATE_DMA_ENABLED) ? "DMA_ENABLED " : ""));
+}
+
static struct eeh_ops pseries_eeh_ops = {
.name = "pseries",
.init = pseries_eeh_init,
@@ -798,8 +816,9 @@ static struct eeh_ops pseries_eeh_ops = {
.next_error = NULL,
.restore_config = pseries_eeh_restore_config,
#ifdef CONFIG_PCI_IOV
- .notify_resume = pseries_notify_resume
+ .notify_resume = pseries_notify_resume,
#endif
+ .pe_plat_state_dump = pseries_eeh_pe_plat_state_dump,
};
/**
--
2.22.0.216.g00a2a96fc9
^ permalink raw reply related
* [PATCH RFC 1/1] powerpc/eeh: Asynchronous recovery
From: Sam Bobroff @ 2020-06-24 5:54 UTC (permalink / raw)
To: linuxppc-dev
Currently, EEH recovery is entirely serialized and takes place within
a single kernel thread. This can cause recovery to take a long time
when there are many devices.
To shorten recovery time, this change allows recovery to proceed in
parallel in two ways:
- Each PHB is given it's own recovery event queue and can be recovered
independently from other PHBs.
- Driver handlers are called in parallel, but with the constraint that
handlers higher up (closer to the PHB) in the PE hierarchy must be
called before those lower down.
To maintain the constraint, above, the driver handlers are called by
traversing the tree of affected PEs from the top, stopping to call
handlers (in parallel) when a PE with devices is discovered. When the
calls for that PE are complete, traversal continues at each child PE.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
This patch should be applied on top of both:
"powerpc/eeh: Synchronization for safety"
"powerpc/eeh: Provide a unique ID for each EEH recovery"
arch/powerpc/include/asm/eeh.h | 1 +
arch/powerpc/include/asm/eeh_event.h | 7 +
arch/powerpc/include/asm/pci-bridge.h | 2 +
arch/powerpc/kernel/eeh_dev.c | 2 +
arch/powerpc/kernel/eeh_driver.c | 313 ++++++++++++++++++--------
arch/powerpc/kernel/eeh_event.c | 65 +++---
6 files changed, 276 insertions(+), 114 deletions(-)
diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 1d4c0b19a63c..dd55d1bf1cfd 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -130,6 +130,7 @@ static inline bool eeh_pe_passed(struct eeh_pe *pe)
#define EEH_DEV_NO_HANDLER (1 << 8) /* No error handler */
#define EEH_DEV_SYSFS (1 << 9) /* Sysfs created */
#define EEH_DEV_REMOVED (1 << 10) /* Removed permanently */
+#define EEH_DEV_RECOVERING (1 << 11) /* Recovering */
struct eeh_dev {
int mode; /* EEH mode */
diff --git a/arch/powerpc/include/asm/eeh_event.h b/arch/powerpc/include/asm/eeh_event.h
index a1fe736bc4cf..b21f49e87b7b 100644
--- a/arch/powerpc/include/asm/eeh_event.h
+++ b/arch/powerpc/include/asm/eeh_event.h
@@ -8,6 +8,8 @@
#define ASM_POWERPC_EEH_EVENT_H
#ifdef __KERNEL__
+#include <linux/workqueue.h>
+
/*
* structure holding pci controller data that describes a
* change in the isolation status of a PCI slot. A pointer
@@ -15,16 +17,21 @@
* callback.
*/
struct eeh_event {
+ struct work_struct work;
struct list_head list; /* to form event queue */
struct eeh_pe *pe; /* EEH PE */
unsigned int id; /* Event ID */
};
+extern spinlock_t eeh_eventlist_lock;
+
int eeh_event_init(void);
+int eeh_phb_event(struct eeh_pe *pe);
int eeh_send_failure_event(struct eeh_pe *pe);
int __eeh_send_failure_event(struct eeh_pe *pe);
void eeh_remove_event(struct eeh_pe *pe, bool force);
void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe);
+void eeh_handle_normal_event_work(struct work_struct *work);
void eeh_handle_special_event(void);
#endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 69f4cb3b7c56..2a9d639b18d1 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -127,6 +127,8 @@ struct pci_controller {
void *private_data;
struct npu *npu;
+ bool eeh_in_progress;
+ struct list_head eeh_eventlist;
};
/* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
index 7370185c7a05..2e48a1e142a9 100644
--- a/arch/powerpc/kernel/eeh_dev.c
+++ b/arch/powerpc/kernel/eeh_dev.c
@@ -62,6 +62,8 @@ struct eeh_dev *eeh_dev_init(struct pci_dn *pdn)
*/
void eeh_dev_phb_init_dynamic(struct pci_controller *phb)
{
+ phb->eeh_in_progress = false; /* TODO: Necessary? */
+ INIT_LIST_HEAD(&phb->eeh_eventlist);
/* EEH PE for PHB */
eeh_phb_pe_create(phb);
}
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 0dbc218597e3..9d03292f66a7 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -12,6 +12,9 @@
#include <linux/module.h>
#include <linux/pci.h>
#include <linux/pci_hotplug.h>
+#include <linux/kthread.h>
+#include <linux/workqueue.h>
+#include <linux/completion.h>
#include <asm/eeh.h>
#include <asm/eeh_event.h>
#include <asm/ppc-pci.h>
@@ -19,6 +22,8 @@
#include <asm/prom.h>
#include <asm/rtas.h>
+static atomic_t eeh_wu_id = ATOMIC_INIT(0);
+
struct eeh_rmv_data {
struct list_head removed_vf_list;
int removed_dev_count;
@@ -249,73 +254,58 @@ static void eeh_set_irq_state(struct eeh_pe *root, bool enable)
}
typedef enum pci_ers_result (*eeh_report_fn)(unsigned int event_id,
+ unsigned int id,
struct pci_dev *,
struct pci_driver *);
static void eeh_pe_report_pdev(unsigned int event_id,
- struct pci_dev *pdev, eeh_report_fn fn,
+ unsigned int id,
+ struct pci_dev *pdev,
+ const char *fn_name, eeh_report_fn fn,
enum pci_ers_result *result,
- const char *handler_name)
+ bool late, bool removed, bool passed)
{
- struct eeh_dev *edev;
struct pci_driver *driver;
- bool actionable, late, removed, passed;
enum pci_ers_result new_result;
- edev = pci_dev_to_eeh_dev(pdev);
- if (!edev) {
- pci_info(pdev, "EEH(%u): no EEH state for device", event_id);
- return;
- }
- /* Cache some useful values before releasing the lock: */
- actionable = eeh_edev_actionable(edev);
- late = edev->mode & EEH_DEV_NO_HANDLER;
- removed = eeh_dev_removed(edev);
- passed = eeh_pe_passed(edev->pe);
- if (actionable) {
+ /*
+ * Driver callbacks may end up calling back into EEH functions
+ * (for example by removing a PCI device) which will deadlock
+ * unless the EEH locks are released first. Note that it may be
+ * re-acquired by the report functions, if necessary.
+ */
+ device_lock(&pdev->dev);
+ driver = eeh_pcid_get(pdev);
+
+ if (!driver)
+ pci_info(pdev, "EEH(%u): W%u: no driver", event_id, id);
+ else if (!driver->err_handler)
+ pci_info(pdev, "EEH(%u): W%u: driver not EEH aware", event_id, id);
+ else if (late)
+ pci_info(pdev, "EEH(%u): W%u: driver bound too late", event_id, id);
+ else {
+ pci_info(pdev, "EEH(%u): EVENT=HANDLER_CALL HANDLER='%s'\n",
+ event_id, fn_name);
+
+ new_result = fn(event_id, id, pdev, driver);
/*
- * Driver callbacks may end up calling back into EEH functions
- * (for example by removing a PCI device) which will deadlock
- * unless the EEH locks are released first. Note that it may be
- * re-acquired by the report functions, if necessary.
+ * It's not safe to use edev here, because the locks
+ * have been released and devices could have changed.
*/
- eeh_recovery_unlock();
- device_lock(&pdev->dev);
- driver = eeh_pcid_get(pdev);
-
- if (!driver)
- pci_info(pdev, "EEH(%u): no driver", event_id);
- else if (!driver->err_handler)
- pci_info(pdev, "EEH(%u): driver not EEH aware", event_id);
- else if (late)
- pci_info(pdev, "EEH(%u): driver bound too late", event_id);
- else {
- pr_warn("EEH(%u): EVENT=HANDLER_CALL DEVICE=%04x:%02x:%02x.%x DRIVER='%s' HANDLER='%s'\n",
- event_id, edev->controller->global_number,
- PCI_BUSNO(edev->bdfn), PCI_SLOT(edev->bdfn),
- PCI_FUNC(edev->bdfn), driver->name, handler_name);
-
- new_result = fn(event_id, pdev, driver);
- /*
- * It's not safe to use edev here, because the locks
- * have been released and devices could have changed.
- */
- pr_warn("EEH(%u): EVENT=HANDLER_RETURN RESULT='%s'\n",
- event_id, pci_ers_result_name(new_result));
- pci_info(pdev, "EEH(%u): %s driver reports: %s",
- event_id, driver->name,
- pci_ers_result_name(new_result));
- if (result)
- *result = pci_ers_merge_result(*result,
- new_result);
+ pr_warn("EEH(%u): EVENT=HANDLER_RETURN RESULT='%s'\n",
+ event_id, pci_ers_result_name(new_result));
+ pci_info(pdev, "EEH(%u): W%u: %s driver reports: '%s'",
+ event_id, id, driver->name,
+ pci_ers_result_name(new_result));
+ if (result) {
+ eeh_recovery_lock();
+ *result = pci_ers_merge_result(*result,
+ new_result);
+ eeh_recovery_unlock();
}
- if (driver)
- eeh_pcid_put(pdev);
- device_unlock(&pdev->dev);
- eeh_recovery_lock();
- } else {
- pci_info(pdev, "EEH(%u): not actionable (%d,%d,%d)",
- event_id, !!pdev, !removed, !passed);
}
+ if (driver)
+ eeh_pcid_put(pdev);
+ device_unlock(&pdev->dev);
}
struct pci_dev **pdev_cache_list_create(struct eeh_pe *root)
@@ -355,27 +345,142 @@ static void pdev_cache_list_destroy(struct pci_dev **pdevs)
kfree(pdevs);
}
-static void eeh_pe_report(unsigned int event_id,
- const char *name, struct eeh_pe *root,
- eeh_report_fn fn, enum pci_ers_result *result)
+struct work_unit {
+ unsigned int id;
+ struct work_struct work;
+ unsigned int event_id;
+ struct pci_dev *pdev;
+ struct eeh_pe *pe;
+ const char *fn_name;
+ eeh_report_fn fn;
+ enum pci_ers_result *result;
+ atomic_t *count;
+ struct completion *done;
+};
+
+static void eeh_pe_report_pdev_thread(struct work_struct *work);
+/*
+ * Traverse down from a PE through it's children, to find devices and enqueue
+ * jobs to call the handler (fn) on them. But do not traverse below a PE that
+ * has devices, so that devices are always handled strictly before their
+ * children. (Traversal is continued by the jobs after handlers are called.)
+ * The recovery lock must be held.
+ * TODO: Convert away from recursive descent traversal?
+ */
+static bool enqueue_pe_work(struct eeh_pe *root, unsigned int event_id,
+ const char *fn_name, eeh_report_fn fn,
+ enum pci_ers_result *result, atomic_t *count,
+ struct completion *done)
{
- struct pci_dev **pdevs, **pdevp;
+ struct eeh_pe *pe;
+ struct eeh_dev *edev, *tmp;
+ struct work_unit *wu;
+ bool work_added = false;
+
+ if (list_empty(&root->edevs)) {
+ list_for_each_entry(pe, &root->child_list, child)
+ work_added |= enqueue_pe_work(pe, event_id, fn_name, fn, result, count, done);
+ } else {
+ eeh_pe_for_each_dev(root, edev, tmp) {
+ work_added = true;
+ edev->mode |= EEH_DEV_RECOVERING;
+ atomic_inc(count);
+ WARN_ON(!(edev->mode & EEH_DEV_RECOVERING));
+ wu = kmalloc(sizeof(*wu), GFP_KERNEL);
+ wu->id = (unsigned int)atomic_inc_return(&eeh_wu_id);
+ wu->event_id = event_id;
+ get_device(&edev->pdev->dev);
+ wu->pdev = edev->pdev;
+ wu->pe = root;
+ wu->fn_name = fn_name;
+ wu->fn = fn;
+ wu->result = result;
+ wu->count = count;
+ wu->done = done;
+ INIT_WORK(&wu->work, eeh_pe_report_pdev_thread);
+ pr_debug("EEH(%u): Queue work unit W%u for device %s (count ~ %d)\n", event_id, wu->id, pci_name(edev->pdev), atomic_read(count));
+ queue_work(system_unbound_wq, &wu->work);
+ }
+ /* This PE has devices, so don't traverse further now */
+ }
+ return work_added;
+}
+
+static void eeh_pe_report_pdev_thread(struct work_struct *work)
+{
+ struct work_unit *wu = container_of(work, struct work_unit, work);
+ struct eeh_dev *edev, *oedev, *tmp;
+ struct eeh_pe *pe;
+ int todo;
- pr_info("EEH(%u): Beginning: '%s'\n", event_id, name);
/*
* It would be convenient to continue to hold the recovery lock here
* but driver callbacks can take a very long time or never return at
* all.
*/
- pdevs = pdev_cache_list_create(root);
- for (pdevp = pdevs; pdevp && *pdevp; pdevp++) {
- /*
- * NOTE! eeh_recovery_lock() is released briefly
- * in eeh_pe_report_pdev()
- */
- eeh_pe_report_pdev(event_id, *pdevp, fn, result, name);
+ pr_debug("EEH(%u): W%u: start (device: %s)\n", wu->event_id, wu->id, pci_name(wu->pdev));
+ eeh_recovery_lock();
+ edev = pci_dev_to_eeh_dev(wu->pdev);
+ if (edev) {
+ bool late, removed, passed;
+
+ WARN_ON(!(edev->mode & EEH_DEV_RECOVERING));
+ removed = eeh_dev_removed(edev);
+ passed = eeh_pe_passed(edev->pe);
+ late = edev->mode & EEH_DEV_NO_HANDLER;
+ if (eeh_edev_actionable(edev)) {
+ eeh_recovery_unlock();
+ eeh_pe_report_pdev(wu->event_id, wu->id, wu->pdev, wu->fn_name, wu->fn, wu->result, late, removed, passed);
+ eeh_recovery_lock();
+ } else {
+ pci_info(wu->pdev, "EEH(%u): W%u: Not actionable (%d,%d,%d)\n",
+ wu->event_id, wu->id, !!wu->pdev, !removed, !passed);
+ }
+ edev = pci_dev_to_eeh_dev(wu->pdev); // Re-acquire after lock release
+ if (edev)
+ edev->mode &= ~EEH_DEV_RECOVERING;
+ /* The edev may be lost, but not moved to a different PE! */
+ WARN_ON(eeh_dev_to_pe(edev) && (eeh_dev_to_pe(edev) != wu->pe));
+ todo = 0;
+ eeh_pe_for_each_dev(wu->pe, oedev, tmp)
+ if (oedev->mode & EEH_DEV_RECOVERING)
+ todo++;
+ pci_dbg(wu->pdev, "EEH(%u): W%u: Remaining devices in this PE: %d\n", wu->event_id, wu->id, todo);
+ if (todo) {
+ pr_debug("EEH(%u): W%u: Remaining work units at this PE: %d\n", wu->event_id, wu->id, todo);
+ } else {
+ pr_debug("EEH(%u): W%u: All work for this PE complete, continuing traversal:\n", wu->event_id, wu->id);
+ list_for_each_entry(pe, &wu->pe->child_list, child)
+ enqueue_pe_work(pe, wu->event_id, wu->fn_name, wu->fn, wu->result, wu->count, wu->done);
+ }
+ } else {
+ pr_warn("EEH(%u): W%u: Device removed.\n", wu->event_id, wu->id);
+ }
+ eeh_recovery_unlock();
+ if (atomic_dec_and_test(wu->count)) {
+ pr_debug("EEH(%u): W%u: done\n", wu->event_id, wu->id);
+ complete(wu->done);
+ }
+ put_device(&wu->pdev->dev);
+ kfree(wu);
+}
+
+static void eeh_pe_report(unsigned int event_id, const char *name, struct eeh_pe *root,
+ eeh_report_fn fn, enum pci_ers_result *result)
+{
+ atomic_t count = ATOMIC_INIT(0);
+ DECLARE_COMPLETION_ONSTACK(done);
+
+ pr_info("EEH(%u): Beginning: '%s'\n", event_id, name);
+ if (enqueue_pe_work(root, event_id, name, fn, result, &count, &done)) {
+ pr_info("EEH(%u): Waiting for asynchronous recovery work to complete...\n", event_id);
+ eeh_recovery_unlock();
+ wait_for_completion_interruptible(&done);
+ pr_info("EEH(%u): Asynchronous recovery work is complete.\n", event_id);
+ eeh_recovery_lock();
+ } else {
+ pr_info("EEH(%u): No recovery work do.\n", event_id);
}
- pdev_cache_list_destroy(pdevs);
if (result)
pr_info("EEH(%u): Finished:'%s' with aggregate recovery state:'%s'\n",
@@ -392,6 +497,7 @@ static void eeh_pe_report(unsigned int event_id,
* Report an EEH error to each device driver.
*/
static enum pci_ers_result eeh_report_error(unsigned int event_id,
+ unsigned int id,
struct pci_dev *pdev,
struct pci_driver *driver)
{
@@ -402,7 +508,8 @@ static enum pci_ers_result eeh_report_error(unsigned int event_id,
if (!driver->err_handler->error_detected)
return PCI_ERS_RESULT_NONE;
- pci_info(pdev, "Invoking %s->error_detected(IO frozen)", driver->name);
+ pci_info(pdev, "EEH(%u): W%u: Invoking %s->error_detected(IO frozen)",
+ event_id, id, driver->name);
rc = driver->err_handler->error_detected(pdev, pci_channel_io_frozen);
eeh_serialize_lock(&flags);
@@ -423,13 +530,14 @@ static enum pci_ers_result eeh_report_error(unsigned int event_id,
* are now enabled.
*/
static enum pci_ers_result eeh_report_mmio_enabled(unsigned int event_id,
+ unsigned int id,
struct pci_dev *pdev,
struct pci_driver *driver)
{
if (!driver->err_handler->mmio_enabled)
return PCI_ERS_RESULT_NONE;
- pci_info(pdev, "EEH(%u): Invoking %s->mmio_enabled()",
- event_id, driver->name);
+ pci_info(pdev, "EEH(%u): W%u: Invoking %s->mmio_enabled()",
+ event_id, id, driver->name);
return driver->err_handler->mmio_enabled(pdev);
}
@@ -444,6 +552,7 @@ static enum pci_ers_result eeh_report_mmio_enabled(unsigned int event_id,
* driver can work again while the device is recovered.
*/
static enum pci_ers_result eeh_report_reset(unsigned int event_id,
+ unsigned int id,
struct pci_dev *pdev,
struct pci_driver *driver)
{
@@ -457,8 +566,8 @@ static enum pci_ers_result eeh_report_reset(unsigned int event_id,
return PCI_ERS_RESULT_NONE;
}
eeh_serialize_unlock(flags);
- pci_info(pdev, "EEH(%u): Invoking %s->slot_reset()",
- event_id, driver->name);
+ pci_info(pdev, "EEH(%u): W%u: Invoking %s->slot_reset()",
+ event_id, id, driver->name);
return driver->err_handler->slot_reset(pdev);
}
@@ -499,6 +608,7 @@ static void eeh_dev_restore_state(struct eeh_dev *edev, void *userdata)
* to make the recovered device work again.
*/
static enum pci_ers_result eeh_report_resume(unsigned int event_id,
+ unsigned int id,
struct pci_dev *pdev,
struct pci_driver *driver)
{
@@ -513,8 +623,8 @@ static enum pci_ers_result eeh_report_resume(unsigned int event_id,
}
eeh_serialize_unlock(flags);
- pci_info(pdev, "EEH(%u): Invoking %s->resume()",
- event_id, driver->name);
+ pci_info(pdev, "EEH(%u): W%u Invoking %s->resume()",
+ event_id, id, driver->name);
driver->err_handler->resume(pdev);
pci_uevent_ers(pdev, PCI_ERS_RESULT_RECOVERED);
@@ -536,6 +646,7 @@ static enum pci_ers_result eeh_report_resume(unsigned int event_id,
* dead, and that no further recovery attempts will be made on it.
*/
static enum pci_ers_result eeh_report_failure(unsigned int event_id,
+ unsigned int id,
struct pci_dev *pdev,
struct pci_driver *driver)
{
@@ -544,8 +655,8 @@ static enum pci_ers_result eeh_report_failure(unsigned int event_id,
if (!driver->err_handler->error_detected)
return PCI_ERS_RESULT_NONE;
- pci_info(pdev, "EEH(%u): Invoking %s->error_detected(permanent failure)",
- event_id, driver->name);
+ pci_info(pdev, "EEH(%u): W%u: Invoking %s->error_detected(permanent failure)",
+ event_id, id, driver->name);
rc = driver->err_handler->error_detected(pdev,
pci_channel_io_perm_failure);
@@ -1006,9 +1117,10 @@ static void eeh_clear_slot_attention(struct pci_dev *pdev)
*/
void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
{
+ struct eeh_pe *tmp_pe;
+ struct pci_controller *phb = pe->phb;
struct pci_bus *bus;
struct eeh_dev *edev, *tmp;
- struct eeh_pe *tmp_pe;
struct pci_dev **pdevs, **pdevp;
int rc = 0;
enum pci_ers_result result = PCI_ERS_RESULT_NONE;
@@ -1020,7 +1132,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
bus = eeh_pe_bus_get(pe);
if (!bus) {
pr_err("EEH(%u): %s: Cannot find PCI bus for PHB#%x-PE#%x\n",
- event_id, __func__, pe->phb->global_number, pe->addr);
+ event_id, __func__, phb->global_number, pe->addr);
eeh_recovery_unlock();
return;
}
@@ -1041,7 +1153,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
if (!devices) {
pr_debug("EEH(%u): Frozen PHB#%x-PE#%x is empty!\n",
- event_id, pe->phb->global_number, pe->addr);
+ event_id, phb->global_number, pe->addr);
goto out; /* nothing to recover */
}
@@ -1053,12 +1165,12 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
/* Log the event */
if (pe->type & EEH_PE_PHB) {
pr_err("EEH(%u): Recovering PHB#%x, location: %s\n",
- event_id, pe->phb->global_number, eeh_pe_loc_get(pe));
+ event_id, phb->global_number, eeh_pe_loc_get(pe));
} else {
- struct eeh_pe *phb_pe = eeh_phb_pe_get(pe->phb);
+ struct eeh_pe *phb_pe = eeh_phb_pe_get(phb);
pr_err("EEH(%u): Recovering PHB#%x-PE#%x\n",
- event_id, pe->phb->global_number, pe->addr);
+ event_id, phb->global_number, pe->addr);
pr_err("EEH(%u): PE location: %s, PHB location: %s\n",
event_id, eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
}
@@ -1073,7 +1185,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
int i;
pr_err("EEH(%u): Frozen PHB#%x-PE#%x detected\n",
- event_id, pe->phb->global_number, pe->addr);
+ event_id, phb->global_number, pe->addr);
/* FIXME: Use the same format as dump_stack() */
pr_err("EEH(%u): Call Trace:\n", event_id);
@@ -1087,7 +1199,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
eeh_pe_update_time_stamp(pe);
if (pe->freeze_count > eeh_max_freezes) {
pr_err("EEH(%u): PHB#%x-PE#%x has failed %d times in the last hour and has been permanently disabled.\n",
- event_id, pe->phb->global_number, pe->addr,
+ event_id, phb->global_number, pe->addr,
pe->freeze_count);
result = PCI_ERS_RESULT_DISCONNECT;
}
@@ -1240,7 +1352,7 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
*/
pr_err("EEH(%u): Unable to recover from failure from PHB#%x-PE#%x.\n"
"Please try reseating or replacing it\n",
- event_id, pe->phb->global_number, pe->addr);
+ event_id, phb->global_number, pe->addr);
eeh_slot_error_detail(event_id, pe, EEH_LOG_PERM);
@@ -1294,8 +1406,34 @@ void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
eeh_recovery_unlock();
+
}
+void eeh_handle_normal_event_work(struct work_struct *work)
+{
+ unsigned long flags;
+ struct eeh_event *event = container_of(work, struct eeh_event, work);
+ struct pci_controller *phb = event->pe->phb;
+
+ eeh_handle_normal_event(event->id, event->pe);
+
+ kfree(event);
+ spin_lock_irqsave(&eeh_eventlist_lock, flags);
+ WARN_ON_ONCE(!phb->eeh_in_progress);
+ if (list_empty(&phb->eeh_eventlist)) {
+ phb->eeh_in_progress = false;
+ pr_debug("EEH(%u): No more work to do\n", event->id);
+ } else {
+ pr_warn("EEH(%u): More work to do\n", event->id);
+ event = list_entry(phb->eeh_eventlist.next,
+ struct eeh_event, list);
+ list_del(&event->list);
+ queue_work(system_unbound_wq, &event->work);
+ }
+ spin_unlock_irqrestore(&eeh_eventlist_lock, flags);
+}
+
+
/**
* eeh_handle_special_event - Handle EEH events without a specific failing PE
*
@@ -1366,8 +1504,7 @@ void eeh_handle_special_event(void)
*/
if (rc == EEH_NEXT_ERR_FROZEN_PE ||
rc == EEH_NEXT_ERR_FENCED_PHB) {
- eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
- eeh_handle_normal_event(0, pe);
+ eeh_phb_event(pe);
} else {
eeh_for_each_pe(pe, tmp_pe)
eeh_pe_for_each_dev(tmp_pe, edev, tmp_edev)
diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
index bd38d6fe5449..81dc4f924324 100644
--- a/arch/powerpc/kernel/eeh_event.c
+++ b/arch/powerpc/kernel/eeh_event.c
@@ -22,7 +22,7 @@
* work-queue, where a worker thread can drive recovery.
*/
-static DEFINE_SPINLOCK(eeh_eventlist_lock);
+DEFINE_SPINLOCK(eeh_eventlist_lock);
static DECLARE_COMPLETION(eeh_eventlist_event);
static LIST_HEAD(eeh_eventlist);
@@ -61,7 +61,7 @@ static int eeh_event_handler(void * dummy)
continue;
/* We might have event without binding PE */
- if (event->pe)
+ if (event->pe) /* TODO: Unused now? */
eeh_handle_normal_event(event->id, event->pe);
else
eeh_handle_special_event();
@@ -94,33 +94,56 @@ int eeh_event_init(void)
return 0;
}
-/**
- * eeh_send_failure_event - Generate a PCI error event
- * @pe: EEH PE
- *
- * This routine can be called within an interrupt context;
- * the actual event will be delivered in a normal context
- * (from a workqueue).
- */
-int __eeh_send_failure_event(struct eeh_pe *pe)
+int eeh_phb_event(struct eeh_pe *pe)
{
- unsigned long flags;
struct eeh_event *event;
+ unsigned long flags;
event = kzalloc(sizeof(*event), GFP_ATOMIC);
if (!event) {
pr_err("EEH: out of memory, event not handled\n");
return -ENOMEM;
}
- event->pe = pe;
+
do {
/* Skip over the special value (0) */
event->id = (unsigned int)atomic_inc_return(&eeh_event_id);
} while (!event->id);
- pr_err("EEH(%u): EVENT=ERROR_DETECTED PHB=%#x PE=%#x\n",
- event->id, pe->phb->global_number, pe->addr);
+ spin_lock_irqsave(&eeh_eventlist_lock, flags);
+ INIT_WORK(&event->work, eeh_handle_normal_event_work);
+ if (pe) {
+ event->pe = pe;
+ eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
+ pr_err("EEH(%u): EVENT=ERROR_DETECTED PHB=%#x PE=%#x\n",
+ event->id, pe->phb->global_number, pe->addr);
+ if (event->pe->phb->eeh_in_progress) {
+ pr_info("EEH: EEH already in progress on this PHB, queueing.\n");
+ list_add(&event->list, &event->pe->phb->eeh_eventlist);
+ } else {
+ pr_info("EEH: Beginning recovery on this PHB.\n");
+ WARN_ON_ONCE(!list_empty(&event->pe->phb->eeh_eventlist));
+ event->pe->phb->eeh_in_progress = true;
+ queue_work(system_unbound_wq, &event->work);
+ }
+ } else {
+ list_add(&event->list, &eeh_eventlist);
+ complete(&eeh_eventlist_event);
+ }
+ spin_unlock_irqrestore(&eeh_eventlist_lock, flags);
+ return 0;
+}
+/**
+ * eeh_send_failure_event - Generate a PCI error event
+ * @pe: EEH PE
+ *
+ * This routine can be called within an interrupt context;
+ * the actual event will be delivered in a normal context
+ * (from a workqueue).
+ */
+int __eeh_send_failure_event(struct eeh_pe *pe)
+{
/*
* Mark the PE as recovering before inserting it in the queue.
* This prevents the PE from being free()ed by a hotplug driver
@@ -136,18 +159,8 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
ARRAY_SIZE(pe->stack_trace), 0);
#endif /* CONFIG_STACKTRACE */
- eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
}
-
- /* We may or may not be called in an interrupt context */
- spin_lock_irqsave(&eeh_eventlist_lock, flags);
- list_add(&event->list, &eeh_eventlist);
- spin_unlock_irqrestore(&eeh_eventlist_lock, flags);
-
- /* For EEH deamon to knick in */
- complete(&eeh_eventlist_event);
-
- return 0;
+ return eeh_phb_event(pe);
}
int eeh_send_failure_event(struct eeh_pe *pe)
--
2.22.0.216.g00a2a96fc9
^ permalink raw reply related
* Re: [PATCH v2 2/2] cpufreq: Specify default governor on command line
From: Viresh Kumar @ 2020-06-24 5:50 UTC (permalink / raw)
To: Quentin Perret
Cc: juri.lelli, kernel-team, vincent.guittot, arnd, rafael, peterz,
adharmap, linux-pm, rjw, linux-kernel, mingo, paulus,
linuxppc-dev, tkjos
In-Reply-To: <20200623142138.209513-3-qperret@google.com>
On 23-06-20, 15:21, Quentin Perret wrote:
> Currently, the only way to specify the default CPUfreq governor is via
> Kconfig options, which suits users who can build the kernel themselves
> perfectly.
>
> However, for those who use a distro-like kernel (such as Android, with
> the Generic Kernel Image project), the only way to use a different
> default is to boot to userspace, and to then switch using the sysfs
> interface. Being able to specify the default governor on the command
> line, like is the case for cpuidle, would enable those users to specify
> their governor of choice earlier on, and to simplify slighlty the
> userspace boot procedure.
>
> To support this use-case, add a kernel command line parameter enabling
> to specify a default governor for CPUfreq, which takes precedence over
> the builtin default.
>
> This implementation has one notable limitation: the default governor
> must be registered before the driver. This is solved for builtin
> governors and drivers using appropriate *_initcall() functions. And in
> the modular case, this must be reflected as a constraint on the module
> loading order.
>
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
> .../admin-guide/kernel-parameters.txt | 5 ++++
> Documentation/admin-guide/pm/cpufreq.rst | 6 ++---
> drivers/cpufreq/cpufreq.c | 23 +++++++++++++++----
> 3 files changed, 26 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index fb95fad81c79..5fd3c9f187eb 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -703,6 +703,11 @@
> cpufreq.off=1 [CPU_FREQ]
> disable the cpufreq sub-system
>
> + cpufreq.default_governor=
> + [CPU_FREQ] Name of the default cpufreq governor to use.
> + This governor must be registered in the kernel before
> + the cpufreq driver probes.
> +
> cpu_init_udelay=N
> [X86] Delay for N microsec between assert and de-assert
> of APIC INIT to start processors. This delay occurs
> diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
> index 0c74a7784964..368e612145d2 100644
> --- a/Documentation/admin-guide/pm/cpufreq.rst
> +++ b/Documentation/admin-guide/pm/cpufreq.rst
> @@ -147,9 +147,9 @@ CPUs in it.
>
> The next major initialization step for a new policy object is to attach a
> scaling governor to it (to begin with, that is the default scaling governor
> -determined by the kernel configuration, but it may be changed later
> -via ``sysfs``). First, a pointer to the new policy object is passed to the
> -governor's ``->init()`` callback which is expected to initialize all of the
> +determined by the kernel command line or configuration, but it may be changed
> +later via ``sysfs``). First, a pointer to the new policy object is passed to
> +the governor's ``->init()`` callback which is expected to initialize all of the
> data structures necessary to handle the given policy and, possibly, to add
> a governor ``sysfs`` interface to it. Next, the governor is started by
> invoking its ``->start()`` callback.
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 0128de3603df..4b1a5c0173cf 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -50,6 +50,9 @@ static LIST_HEAD(cpufreq_governor_list);
> #define for_each_governor(__governor) \
> list_for_each_entry(__governor, &cpufreq_governor_list, governor_list)
>
> +static char cpufreq_param_governor[CPUFREQ_NAME_LEN];
> +static struct cpufreq_governor *default_governor;
> +
> /**
> * The "cpufreq driver" - the arch- or hardware-dependent low
> * level driver of CPUFreq support, and its spinlock. This lock
> @@ -1055,7 +1058,6 @@ __weak struct cpufreq_governor *cpufreq_default_governor(void)
>
> static int cpufreq_init_policy(struct cpufreq_policy *policy)
> {
> - struct cpufreq_governor *def_gov = cpufreq_default_governor();
> struct cpufreq_governor *gov = NULL;
> unsigned int pol = CPUFREQ_POLICY_UNKNOWN;
>
> @@ -1065,8 +1067,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
> if (gov) {
> pr_debug("Restoring governor %s for cpu %d\n",
> policy->governor->name, policy->cpu);
> - } else if (def_gov) {
> - gov = def_gov;
> + } else if (default_governor) {
> + gov = default_governor;
> } else {
> return -ENODATA;
> }
> @@ -1074,8 +1076,8 @@ static int cpufreq_init_policy(struct cpufreq_policy *policy)
> /* Use the default policy if there is no last_policy. */
> if (policy->last_policy) {
> pol = policy->last_policy;
> - } else if (def_gov) {
> - pol = cpufreq_parse_policy(def_gov->name);
> + } else if (default_governor) {
> + pol = cpufreq_parse_policy(default_governor->name);
> /*
> * In case the default governor is neiter "performance"
> * nor "powersave", fall back to the initial policy
> @@ -2320,6 +2322,9 @@ int cpufreq_register_governor(struct cpufreq_governor *governor)
> list_add(&governor->governor_list, &cpufreq_governor_list);
> }
>
> + if (!strncasecmp(cpufreq_param_governor, governor->name, CPUFREQ_NAME_LEN))
> + default_governor = governor;
> +
> mutex_unlock(&cpufreq_governor_mutex);
> return err;
> }
> @@ -2348,6 +2353,8 @@ void cpufreq_unregister_governor(struct cpufreq_governor *governor)
>
> mutex_lock(&cpufreq_governor_mutex);
> list_del(&governor->governor_list);
> + if (governor == default_governor)
> + default_governor = cpufreq_default_governor();
> mutex_unlock(&cpufreq_governor_mutex);
> }
> EXPORT_SYMBOL_GPL(cpufreq_unregister_governor);
> @@ -2789,7 +2796,13 @@ static int __init cpufreq_core_init(void)
> cpufreq_global_kobject = kobject_create_and_add("cpufreq", &cpu_subsys.dev_root->kobj);
> BUG_ON(!cpufreq_global_kobject);
>
> + mutex_lock(&cpufreq_governor_mutex);
> + if (!default_governor)
> + default_governor = cpufreq_default_governor();
> + mutex_unlock(&cpufreq_governor_mutex);
I don't think locking is required here at core-initcall level. Apart
from that:
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
> +
> return 0;
> }
> module_param(off, int, 0444);
> +module_param_string(default_governor, cpufreq_param_governor, CPUFREQ_NAME_LEN, 0444);
> core_initcall(cpufreq_core_init);
> --
> 2.27.0.111.gc72c7da667-goog
--
viresh
^ permalink raw reply
* [PATCH RFC 1/1] powerpc/eeh: Provide a unique ID for each EEH recovery
From: Sam Bobroff @ 2020-06-24 5:19 UTC (permalink / raw)
To: linuxppc-dev
Give a unique ID to each recovery event, to ease log parsing and
prepare for parallel recovery.
Also add some new messages with a very simple format that may be
useful to log-parsers.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
This patch should be applied on top of my recent(ish) set:
"powerpc/eeh: Synchronization for safety".
arch/powerpc/include/asm/eeh_event.h | 3 +-
arch/powerpc/include/asm/ppc-pci.h | 2 +-
arch/powerpc/kernel/eeh.c | 45 ++++---
arch/powerpc/kernel/eeh_driver.c | 189 +++++++++++++++------------
arch/powerpc/kernel/eeh_event.c | 12 +-
5 files changed, 148 insertions(+), 103 deletions(-)
diff --git a/arch/powerpc/include/asm/eeh_event.h b/arch/powerpc/include/asm/eeh_event.h
index dadde7d52f46..a1fe736bc4cf 100644
--- a/arch/powerpc/include/asm/eeh_event.h
+++ b/arch/powerpc/include/asm/eeh_event.h
@@ -17,13 +17,14 @@
struct eeh_event {
struct list_head list; /* to form event queue */
struct eeh_pe *pe; /* EEH PE */
+ unsigned int id; /* Event ID */
};
int eeh_event_init(void);
int eeh_send_failure_event(struct eeh_pe *pe);
int __eeh_send_failure_event(struct eeh_pe *pe);
void eeh_remove_event(struct eeh_pe *pe, bool force);
-void eeh_handle_normal_event(struct eeh_pe *pe);
+void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe);
void eeh_handle_special_event(void);
#endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
index 7f4be5a05eb3..ec62b491ff97 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -47,7 +47,7 @@ extern int rtas_setup_phb(struct pci_controller *phb);
void eeh_addr_cache_insert_dev(struct pci_dev *dev);
void eeh_addr_cache_rmv_dev(struct pci_dev *dev);
struct eeh_dev *eeh_addr_cache_get_dev(unsigned long addr);
-void eeh_slot_error_detail(struct eeh_pe *pe, int severity);
+void eeh_slot_error_detail(unsigned int event_id, struct eeh_pe *pe, int severity);
int eeh_pci_enable(struct eeh_pe *pe, int function);
int eeh_pe_reset_full(struct eeh_pe *pe, bool include_passed);
void eeh_save_bars(struct eeh_dev *edev);
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 68e6dfa526a5..54f921ff7621 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -197,7 +197,8 @@ EXPORT_SYMBOL_GPL(eeh_recovery_must_be_locked);
* for the indicated PCI device, and puts them into a buffer
* for RTAS error logging.
*/
-static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
+static size_t eeh_dump_dev_log(unsigned int event_id, struct eeh_dev *edev,
+ char *buf, size_t len)
{
struct pci_dn *pdn = eeh_dev_to_pdn(edev);
u32 cfg;
@@ -206,34 +207,37 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
char buffer[128];
if (!pdn) {
- pr_warn("EEH: Note: No error log for absent device.\n");
+ pr_warn("EEH(%u): Note: No error log for absent device.\n",
+ event_id);
return 0;
}
n += scnprintf(buf+n, len-n, "%04x:%02x:%02x.%01x\n",
pdn->phb->global_number, pdn->busno,
PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
- pr_warn("EEH: of node=%04x:%02x:%02x.%01x\n",
+ pr_warn("EEH(%u): of node=%04x:%02x:%02x.%01x\n",
+ event_id,
pdn->phb->global_number, pdn->busno,
PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
eeh_ops->read_config(pdn, PCI_VENDOR_ID, 4, &cfg);
n += scnprintf(buf+n, len-n, "dev/vend:%08x\n", cfg);
- pr_warn("EEH: PCI device/vendor: %08x\n", cfg);
+ pr_warn("EEH(%u): PCI device/vendor: %08x\n", event_id, cfg);
eeh_ops->read_config(pdn, PCI_COMMAND, 4, &cfg);
n += scnprintf(buf+n, len-n, "cmd/stat:%x\n", cfg);
- pr_warn("EEH: PCI cmd/status register: %08x\n", cfg);
+ pr_warn("EEH(%u): PCI cmd/status register: %08x\n", event_id, cfg);
/* Gather bridge-specific registers */
if (edev->mode & EEH_DEV_BRIDGE) {
eeh_ops->read_config(pdn, PCI_SEC_STATUS, 2, &cfg);
n += scnprintf(buf+n, len-n, "sec stat:%x\n", cfg);
- pr_warn("EEH: Bridge secondary status: %04x\n", cfg);
+ pr_warn("EEH(%u): Bridge secondary status: %04x\n",
+ event_id, cfg);
eeh_ops->read_config(pdn, PCI_BRIDGE_CONTROL, 2, &cfg);
n += scnprintf(buf+n, len-n, "brdg ctl:%x\n", cfg);
- pr_warn("EEH: Bridge control: %04x\n", cfg);
+ pr_warn("EEH(%u): Bridge control: %04x\n", event_id, cfg);
}
/* Dump out the PCI-X command and status regs */
@@ -241,18 +245,19 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
if (cap) {
eeh_ops->read_config(pdn, cap, 4, &cfg);
n += scnprintf(buf+n, len-n, "pcix-cmd:%x\n", cfg);
- pr_warn("EEH: PCI-X cmd: %08x\n", cfg);
+ pr_warn("EEH(%u): PCI-X cmd: %08x\n", event_id, cfg);
eeh_ops->read_config(pdn, cap+4, 4, &cfg);
n += scnprintf(buf+n, len-n, "pcix-stat:%x\n", cfg);
- pr_warn("EEH: PCI-X status: %08x\n", cfg);
+ pr_warn("EEH(%u): PCI-X status: %08x\n", event_id, cfg);
}
/* If PCI-E capable, dump PCI-E cap 10 */
cap = edev->pcie_cap;
if (cap) {
n += scnprintf(buf+n, len-n, "pci-e cap10:\n");
- pr_warn("EEH: PCI-E capabilities and status follow:\n");
+ pr_warn("EEH(%u): PCI-E capabilities and status follow:\n",
+ event_id);
for (i=0; i<=8; i++) {
eeh_ops->read_config(pdn, cap+4*i, 4, &cfg);
@@ -263,8 +268,8 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
pr_warn("%s\n", buffer);
l = scnprintf(buffer, sizeof(buffer),
- "EEH: PCI-E %02x: %08x ",
- 4*i, cfg);
+ "EEH(%u): PCI-E %02x: %08x ",
+ event_id, 4*i, cfg);
} else {
l += scnprintf(buffer+l, sizeof(buffer)-l,
"%08x ", cfg);
@@ -279,7 +284,8 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
cap = edev->aer_cap;
if (cap) {
n += scnprintf(buf+n, len-n, "pci-e AER:\n");
- pr_warn("EEH: PCI-E AER capability register set follows:\n");
+ pr_warn("EEH(%u): PCI-E AER capability register set follows:\n",
+ event_id);
for (i=0; i<=13; i++) {
eeh_ops->read_config(pdn, cap+4*i, 4, &cfg);
@@ -304,16 +310,13 @@ static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len)
return n;
}
-static void *eeh_dump_pe_log(struct eeh_pe *pe, void *flag)
+static void eeh_dump_pe_log(unsigned int event_id, struct eeh_pe *pe, size_t *plen)
{
struct eeh_dev *edev, *tmp;
- size_t *plen = flag;
eeh_pe_for_each_dev(pe, edev, tmp)
- *plen += eeh_dump_dev_log(edev, pci_regs_buf + *plen,
+ *plen += eeh_dump_dev_log(event_id, edev, pci_regs_buf + *plen,
EEH_PCI_REGS_LOG_LEN - *plen);
-
- return NULL;
}
/**
@@ -326,9 +329,10 @@ static void *eeh_dump_pe_log(struct eeh_pe *pe, void *flag)
* out from the config space of the corresponding PCI device, while
* the error log is fetched through platform dependent function call.
*/
-void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
+void eeh_slot_error_detail(unsigned int event_id, struct eeh_pe *pe, int severity)
{
size_t loglen = 0;
+ struct eeh_pe *tmp_pe;
/*
* When the PHB is fenced or dead, it's pointless to collect
@@ -368,7 +372,8 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
eeh_pe_restore_bars(pe);
pci_regs_buf[0] = 0;
- eeh_pe_traverse(pe, eeh_dump_pe_log, &loglen);
+ eeh_for_each_pe(pe, tmp_pe)
+ eeh_dump_pe_log(event_id, tmp_pe, &loglen);
}
}
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index f21e0910f80a..0dbc218597e3 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -248,10 +248,13 @@ static void eeh_set_irq_state(struct eeh_pe *root, bool enable)
}
}
-typedef enum pci_ers_result (*eeh_report_fn)(struct pci_dev *,
+typedef enum pci_ers_result (*eeh_report_fn)(unsigned int event_id,
+ struct pci_dev *,
struct pci_driver *);
-static void eeh_pe_report_pdev(struct pci_dev *pdev, eeh_report_fn fn,
- enum pci_ers_result *result)
+static void eeh_pe_report_pdev(unsigned int event_id,
+ struct pci_dev *pdev, eeh_report_fn fn,
+ enum pci_ers_result *result,
+ const char *handler_name)
{
struct eeh_dev *edev;
struct pci_driver *driver;
@@ -260,7 +263,7 @@ static void eeh_pe_report_pdev(struct pci_dev *pdev, eeh_report_fn fn,
edev = pci_dev_to_eeh_dev(pdev);
if (!edev) {
- pci_info(pdev, "no EEH state for device");
+ pci_info(pdev, "EEH(%u): no EEH state for device", event_id);
return;
}
/* Cache some useful values before releasing the lock: */
@@ -280,19 +283,26 @@ static void eeh_pe_report_pdev(struct pci_dev *pdev, eeh_report_fn fn,
driver = eeh_pcid_get(pdev);
if (!driver)
- pci_info(pdev, "no driver");
+ pci_info(pdev, "EEH(%u): no driver", event_id);
else if (!driver->err_handler)
- pci_info(pdev, "driver not EEH aware");
+ pci_info(pdev, "EEH(%u): driver not EEH aware", event_id);
else if (late)
- pci_info(pdev, "driver bound too late");
+ pci_info(pdev, "EEH(%u): driver bound too late", event_id);
else {
- new_result = fn(pdev, driver);
+ pr_warn("EEH(%u): EVENT=HANDLER_CALL DEVICE=%04x:%02x:%02x.%x DRIVER='%s' HANDLER='%s'\n",
+ event_id, edev->controller->global_number,
+ PCI_BUSNO(edev->bdfn), PCI_SLOT(edev->bdfn),
+ PCI_FUNC(edev->bdfn), driver->name, handler_name);
+
+ new_result = fn(event_id, pdev, driver);
/*
* It's not safe to use edev here, because the locks
* have been released and devices could have changed.
*/
- pci_info(pdev, "%s driver reports: '%s'",
- driver->name,
+ pr_warn("EEH(%u): EVENT=HANDLER_RETURN RESULT='%s'\n",
+ event_id, pci_ers_result_name(new_result));
+ pci_info(pdev, "EEH(%u): %s driver reports: %s",
+ event_id, driver->name,
pci_ers_result_name(new_result));
if (result)
*result = pci_ers_merge_result(*result,
@@ -303,8 +313,8 @@ static void eeh_pe_report_pdev(struct pci_dev *pdev, eeh_report_fn fn,
device_unlock(&pdev->dev);
eeh_recovery_lock();
} else {
- pci_info(pdev, "not actionable (%d,%d,%d)", !!pdev,
- !removed, !passed);
+ pci_info(pdev, "EEH(%u): not actionable (%d,%d,%d)",
+ event_id, !!pdev, !removed, !passed);
}
}
@@ -345,12 +355,13 @@ static void pdev_cache_list_destroy(struct pci_dev **pdevs)
kfree(pdevs);
}
-static void eeh_pe_report(const char *name, struct eeh_pe *root,
+static void eeh_pe_report(unsigned int event_id,
+ const char *name, struct eeh_pe *root,
eeh_report_fn fn, enum pci_ers_result *result)
{
struct pci_dev **pdevs, **pdevp;
- pr_info("EEH: Beginning: '%s'\n", name);
+ pr_info("EEH(%u): Beginning: '%s'\n", event_id, name);
/*
* It would be convenient to continue to hold the recovery lock here
* but driver callbacks can take a very long time or never return at
@@ -362,15 +373,15 @@ static void eeh_pe_report(const char *name, struct eeh_pe *root,
* NOTE! eeh_recovery_lock() is released briefly
* in eeh_pe_report_pdev()
*/
- eeh_pe_report_pdev(*pdevp, fn, result);
+ eeh_pe_report_pdev(event_id, *pdevp, fn, result, name);
}
pdev_cache_list_destroy(pdevs);
if (result)
- pr_info("EEH: Finished:'%s' with aggregate recovery state:'%s'\n",
- name, pci_ers_result_name(*result));
+ pr_info("EEH(%u): Finished:'%s' with aggregate recovery state:'%s'\n",
+ event_id, name, pci_ers_result_name(*result));
else
- pr_info("EEH: Finished:'%s'", name);
+ pr_info("EEH(%u): Finished:'%s'", event_id, name);
}
/**
@@ -380,7 +391,8 @@ static void eeh_pe_report(const char *name, struct eeh_pe *root,
*
* Report an EEH error to each device driver.
*/
-static enum pci_ers_result eeh_report_error(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_error(unsigned int event_id,
+ struct pci_dev *pdev,
struct pci_driver *driver)
{
enum pci_ers_result rc;
@@ -410,12 +422,14 @@ static enum pci_ers_result eeh_report_error(struct pci_dev *pdev,
* Tells each device driver that IO ports, MMIO and config space I/O
* are now enabled.
*/
-static enum pci_ers_result eeh_report_mmio_enabled(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_mmio_enabled(unsigned int event_id,
+ struct pci_dev *pdev,
struct pci_driver *driver)
{
if (!driver->err_handler->mmio_enabled)
return PCI_ERS_RESULT_NONE;
- pci_info(pdev, "Invoking %s->mmio_enabled()", driver->name);
+ pci_info(pdev, "EEH(%u): Invoking %s->mmio_enabled()",
+ event_id, driver->name);
return driver->err_handler->mmio_enabled(pdev);
}
@@ -429,7 +443,8 @@ static enum pci_ers_result eeh_report_mmio_enabled(struct pci_dev *pdev,
* some actions, usually to save data the driver needs so that the
* driver can work again while the device is recovered.
*/
-static enum pci_ers_result eeh_report_reset(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_reset(unsigned int event_id,
+ struct pci_dev *pdev,
struct pci_driver *driver)
{
struct eeh_dev *edev;
@@ -442,7 +457,8 @@ static enum pci_ers_result eeh_report_reset(struct pci_dev *pdev,
return PCI_ERS_RESULT_NONE;
}
eeh_serialize_unlock(flags);
- pci_info(pdev, "Invoking %s->slot_reset()", driver->name);
+ pci_info(pdev, "EEH(%u): Invoking %s->slot_reset()",
+ event_id, driver->name);
return driver->err_handler->slot_reset(pdev);
}
@@ -482,7 +498,8 @@ static void eeh_dev_restore_state(struct eeh_dev *edev, void *userdata)
* could resume so that the device driver can do some initialization
* to make the recovered device work again.
*/
-static enum pci_ers_result eeh_report_resume(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_resume(unsigned int event_id,
+ struct pci_dev *pdev,
struct pci_driver *driver)
{
struct eeh_dev *edev;
@@ -496,7 +513,8 @@ static enum pci_ers_result eeh_report_resume(struct pci_dev *pdev,
}
eeh_serialize_unlock(flags);
- pci_info(pdev, "Invoking %s->resume()", driver->name);
+ pci_info(pdev, "EEH(%u): Invoking %s->resume()",
+ event_id, driver->name);
driver->err_handler->resume(pdev);
pci_uevent_ers(pdev, PCI_ERS_RESULT_RECOVERED);
@@ -517,7 +535,8 @@ static enum pci_ers_result eeh_report_resume(struct pci_dev *pdev,
* This informs the device driver that the device is permanently
* dead, and that no further recovery attempts will be made on it.
*/
-static enum pci_ers_result eeh_report_failure(struct pci_dev *pdev,
+static enum pci_ers_result eeh_report_failure(unsigned int event_id,
+ struct pci_dev *pdev,
struct pci_driver *driver)
{
enum pci_ers_result rc;
@@ -525,8 +544,8 @@ static enum pci_ers_result eeh_report_failure(struct pci_dev *pdev,
if (!driver->err_handler->error_detected)
return PCI_ERS_RESULT_NONE;
- pci_info(pdev, "Invoking %s->error_detected(permanent failure)",
- driver->name);
+ pci_info(pdev, "EEH(%u): Invoking %s->error_detected(permanent failure)",
+ event_id, driver->name);
rc = driver->err_handler->error_detected(pdev,
pci_channel_io_perm_failure);
@@ -579,7 +598,8 @@ static void *eeh_add_virt_device(struct eeh_dev *edev)
* lock is dropped (which it must be in order to take the PCI rescan/remove
* lock without risking a deadlock).
*/
-static void eeh_rmv_device(struct pci_dev *pdev, void *userdata)
+static void eeh_rmv_device(unsigned int event_id,
+ struct pci_dev *pdev, void *userdata)
{
struct eeh_dev *edev;
struct pci_driver *driver;
@@ -588,8 +608,8 @@ static void eeh_rmv_device(struct pci_dev *pdev, void *userdata)
edev = pci_dev_to_eeh_dev(pdev);
if (!edev) {
- pci_warn(pdev, "EEH: Device removed during processing (#%d)\n",
- __LINE__);
+ pci_warn(pdev, "EEH(%u): Device removed during processing (#%d)\n",
+ event_id, __LINE__);
return;
}
/*
@@ -617,7 +637,8 @@ static void eeh_rmv_device(struct pci_dev *pdev, void *userdata)
}
/* Remove it from PCI subsystem */
- pci_info(pdev, "EEH: Removing device without EEH sensitive driver\n");
+ pci_info(pdev, "EEH(%u): Removing device without EEH sensitive driver\n",
+ event_id);
edev->mode |= EEH_DEV_DISCONNECTED;
if (rmv_data)
rmv_data->removed_dev_count++;
@@ -743,7 +764,8 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
* During the reset, udev might be invoked because those affected
* PCI devices will be removed and then added.
*/
-static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
+static int eeh_reset_device(unsigned int event_id,
+ struct eeh_pe *pe, struct pci_bus *bus,
struct eeh_rmv_data *rmv_data,
bool driver_eeh_aware)
{
@@ -777,7 +799,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
pdevs = pdev_cache_list_create(pe);
/* eeh_rmv_device() may re-acquire the recovery lock */
for (pdevp = pdevs; pdevp && *pdevp; pdevp++)
- eeh_rmv_device(*pdevp, rmv_data);
+ eeh_rmv_device(event_id, *pdevp, rmv_data);
pdev_cache_list_destroy(pdevs);
} else {
eeh_recovery_unlock();
@@ -825,8 +847,8 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
* potentially weird things happen.
*/
if (!driver_eeh_aware || rmv_data->removed_dev_count) {
- pr_info("EEH: Sleep 5s ahead of %s hotplug\n",
- (driver_eeh_aware ? "partial" : "complete"));
+ pr_info("EEH(%u): Sleep 5s ahead of %s hotplug\n",
+ event_id, (driver_eeh_aware ? "partial" : "complete"));
eeh_recovery_unlock();
ssleep(5);
eeh_recovery_lock();
@@ -982,7 +1004,7 @@ static void eeh_clear_slot_attention(struct pci_dev *pdev)
* drivers (which cause a second set of hotplug events to go out to
* userspace).
*/
-void eeh_handle_normal_event(struct eeh_pe *pe)
+void eeh_handle_normal_event(unsigned int event_id, struct eeh_pe *pe)
{
struct pci_bus *bus;
struct eeh_dev *edev, *tmp;
@@ -997,8 +1019,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
eeh_recovery_lock();
bus = eeh_pe_bus_get(pe);
if (!bus) {
- pr_err("%s: Cannot find PCI bus for PHB#%x-PE#%x\n",
- __func__, pe->phb->global_number, pe->addr);
+ pr_err("EEH(%u): %s: Cannot find PCI bus for PHB#%x-PE#%x\n",
+ event_id, __func__, pe->phb->global_number, pe->addr);
eeh_recovery_unlock();
return;
}
@@ -1018,22 +1040,27 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
devices++;
if (!devices) {
- pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
- pe->phb->global_number, pe->addr);
+ pr_debug("EEH(%u): Frozen PHB#%x-PE#%x is empty!\n",
+ event_id, pe->phb->global_number, pe->addr);
goto out; /* nothing to recover */
}
+ pe->freeze_count++;
+ pr_warn("EEH(%u): EVENT=RECOVERY_START TYPE=%s PHB=%#x PE=%#x COUNT=%d\n",
+ event_id, ((pe->type & EEH_PE_PHB) ? "PHB" : "PE"),
+ pe->phb->global_number, pe->addr, pe->freeze_count);
+
/* Log the event */
if (pe->type & EEH_PE_PHB) {
- pr_err("EEH: Recovering PHB#%x, location: %s\n",
- pe->phb->global_number, eeh_pe_loc_get(pe));
+ pr_err("EEH(%u): Recovering PHB#%x, location: %s\n",
+ event_id, pe->phb->global_number, eeh_pe_loc_get(pe));
} else {
struct eeh_pe *phb_pe = eeh_phb_pe_get(pe->phb);
- pr_err("EEH: Recovering PHB#%x-PE#%x\n",
- pe->phb->global_number, pe->addr);
- pr_err("EEH: PE location: %s, PHB location: %s\n",
- eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
+ pr_err("EEH(%u): Recovering PHB#%x-PE#%x\n",
+ event_id, pe->phb->global_number, pe->addr);
+ pr_err("EEH(%u): PE location: %s, PHB location: %s\n",
+ event_id, eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
}
#ifdef CONFIG_STACKTRACE
@@ -1045,23 +1072,22 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
void **ptrs = (void **) pe->stack_trace;
int i;
- pr_err("EEH: Frozen PHB#%x-PE#%x detected\n",
- pe->phb->global_number, pe->addr);
+ pr_err("EEH(%u): Frozen PHB#%x-PE#%x detected\n",
+ event_id, pe->phb->global_number, pe->addr);
/* FIXME: Use the same format as dump_stack() */
- pr_err("EEH: Call Trace:\n");
+ pr_err("EEH(%u): Call Trace:\n", event_id);
for (i = 0; i < pe->trace_entries; i++)
- pr_err("EEH: [%pK] %pS\n", ptrs[i], ptrs[i]);
+ pr_err("EEH(%u): [%pK] %pS\n", event_id, ptrs[i], ptrs[i]);
pe->trace_entries = 0;
}
#endif /* CONFIG_STACKTRACE */
eeh_pe_update_time_stamp(pe);
- pe->freeze_count++;
if (pe->freeze_count > eeh_max_freezes) {
- pr_err("EEH: PHB#%x-PE#%x has failed %d times in the last hour and has been permanently disabled.\n",
- pe->phb->global_number, pe->addr,
+ pr_err("EEH(%u): PHB#%x-PE#%x has failed %d times in the last hour and has been permanently disabled.\n",
+ event_id, pe->phb->global_number, pe->addr,
pe->freeze_count);
result = PCI_ERS_RESULT_DISCONNECT;
}
@@ -1081,12 +1107,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
* hotplug for this case.
*/
if (result != PCI_ERS_RESULT_DISCONNECT) {
- pr_warn("EEH: This PCI device has failed %d times in the last hour and will be permanently disabled after %d failures.\n",
- pe->freeze_count, eeh_max_freezes);
- pr_info("EEH: Notify device drivers to shutdown\n");
+ pr_warn("EEH(%u): This PCI device has failed %d times in the last hour and will be permanently disabled after %d failures.\n",
+ event_id, pe->freeze_count, eeh_max_freezes);
+ pr_info("EEH(%u): Notify device drivers to shutdown\n", event_id);
eeh_set_channel_state(pe, pci_channel_io_frozen);
eeh_set_irq_state(pe, false);
- eeh_pe_report("error_detected(IO frozen)", pe,
+ eeh_pe_report(event_id, "error_detected(IO frozen)", pe,
eeh_report_error, &result);
if ((pe->type & EEH_PE_PHB) &&
result != PCI_ERS_RESULT_NONE &&
@@ -1100,7 +1126,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
if (result != PCI_ERS_RESULT_DISCONNECT) {
rc = eeh_wait_state(pe, MAX_WAIT_FOR_RECOVERY*1000, true);
if (rc < 0 || rc == EEH_STATE_NOT_SUPPORT) {
- pr_warn("EEH: Permanent failure\n");
+ pr_warn("EEH(%u): Permanent failure\n", event_id);
result = PCI_ERS_RESULT_DISCONNECT;
}
}
@@ -1110,8 +1136,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
* have been informed.
*/
if (result != PCI_ERS_RESULT_DISCONNECT) {
- pr_info("EEH: Collect temporary log\n");
- eeh_slot_error_detail(pe, EEH_LOG_TEMP);
+ pr_info("EEH(%u): Collect temporary log\n", event_id);
+ eeh_slot_error_detail(event_id, pe, EEH_LOG_TEMP);
}
/* If all device drivers were EEH-unaware, then shut
@@ -1119,8 +1145,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
* go down willingly, without panicing the system.
*/
if (result == PCI_ERS_RESULT_NONE) {
- pr_info("EEH: Reset with hotplug activity\n");
- rc = eeh_reset_device(pe, bus, NULL, false);
+ pr_info("EEH(%u): Reset with hotplug activity\n", event_id);
+ rc = eeh_reset_device(event_id, pe, bus, NULL, false);
if (rc) {
pr_warn("%s: Unable to reset, err=%d\n",
__func__, rc);
@@ -1130,7 +1156,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
/* If all devices reported they can proceed, then re-enable MMIO */
if (result == PCI_ERS_RESULT_CAN_RECOVER) {
- pr_info("EEH: Enable I/O for affected devices\n");
+ pr_info("EEH(%u): Enable I/O for affected devices\n", event_id);
rc = eeh_pci_enable(pe, EEH_OPT_THAW_MMIO);
if (rc < 0) {
@@ -1138,15 +1164,15 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
} else if (rc) {
result = PCI_ERS_RESULT_NEED_RESET;
} else {
- pr_info("EEH: Notify device drivers to resume I/O\n");
- eeh_pe_report("mmio_enabled", pe,
+ pr_info("EEH(%u): Notify device drivers to resume I/O\n", event_id);
+ eeh_pe_report(event_id, "mmio_enabled", pe,
eeh_report_mmio_enabled, &result);
}
}
/* If all devices reported they can proceed, then re-enable DMA */
if (result == PCI_ERS_RESULT_CAN_RECOVER) {
- pr_info("EEH: Enabled DMA for affected devices\n");
+ pr_info("EEH(%u): Enabled DMA for affected devices\n", event_id);
rc = eeh_pci_enable(pe, EEH_OPT_THAW_DMA);
if (rc < 0) {
@@ -1166,8 +1192,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
/* If any device called out for a reset, then reset the slot */
if (result == PCI_ERS_RESULT_NEED_RESET) {
- pr_info("EEH: Reset without hotplug activity\n");
- rc = eeh_reset_device(pe, bus, &rmv_data, true);
+ pr_info("EEH(%u): Reset without hotplug activity\n", event_id);
+ rc = eeh_reset_device(event_id, pe, bus, &rmv_data, true);
if (rc) {
pr_warn("%s: Cannot reset, err=%d\n",
__func__, rc);
@@ -1176,7 +1202,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
result = PCI_ERS_RESULT_NONE;
eeh_set_channel_state(pe, pci_channel_io_normal);
eeh_set_irq_state(pe, true);
- eeh_pe_report("slot_reset", pe, eeh_report_reset,
+ eeh_pe_report(event_id, "slot_reset", pe, eeh_report_reset,
&result);
}
}
@@ -1194,10 +1220,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
}
/* Tell all device drivers that they can resume operations */
- pr_info("EEH: Notify device driver to resume\n");
+ pr_info("EEH(%u): Notify device driver to resume\n", event_id);
eeh_set_channel_state(pe, pci_channel_io_normal);
eeh_set_irq_state(pe, true);
- eeh_pe_report("resume", pe, eeh_report_resume, NULL);
+ eeh_pe_report(event_id, "resume", pe, eeh_report_resume, NULL);
eeh_for_each_pe(pe, tmp_pe) {
eeh_pe_for_each_dev(tmp_pe, edev, tmp) {
edev->mode &= ~EEH_DEV_NO_HANDLER;
@@ -1205,24 +1231,25 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
}
}
- pr_info("EEH: Recovery successful.\n");
+ pr_info("EEH(%u): Recovery successful.\n", event_id);
} else {
/*
* About 90% of all real-life EEH failures in the field
* are due to poorly seated PCI cards. Only 10% or so are
* due to actual, failed cards.
*/
- pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n"
+ pr_err("EEH(%u): Unable to recover from failure from PHB#%x-PE#%x.\n"
"Please try reseating or replacing it\n",
- pe->phb->global_number, pe->addr);
+ event_id, pe->phb->global_number, pe->addr);
- eeh_slot_error_detail(pe, EEH_LOG_PERM);
+ eeh_slot_error_detail(event_id, pe, EEH_LOG_PERM);
/* Notify all devices that they're about to go down. */
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
eeh_set_irq_state(pe, false);
- eeh_pe_report("error_detected(permanent failure)", pe,
+ eeh_pe_report(event_id, "error_detected(permanent failure)", pe,
eeh_report_failure, NULL);
+ pr_crit("EEH(%u): EVENT=RECOVERY_END RESULT=failure\n", event_id);
/* Mark the PE to be removed permanently */
eeh_pe_state_mark(pe, EEH_PE_REMOVED);
@@ -1235,7 +1262,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
if (pe->type & EEH_PE_VF) {
pdevs = pdev_cache_list_create(pe);
for (pdevp = pdevs; pdevp && *pdevp; pdevp++)
- eeh_rmv_device(*pdevp, NULL);
+ eeh_rmv_device(event_id, *pdevp, NULL);
pdev_cache_list_destroy(pdevs);
eeh_pe_dev_mode_mark(pe, EEH_DEV_REMOVED);
} else {
@@ -1251,6 +1278,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
}
}
+ pr_info("EEH(%u): EVENT=RECOVERY_END RESULT=success\n", event_id);
+
out:
/*
* Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
@@ -1338,7 +1367,7 @@ void eeh_handle_special_event(void)
if (rc == EEH_NEXT_ERR_FROZEN_PE ||
rc == EEH_NEXT_ERR_FENCED_PHB) {
eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
- eeh_handle_normal_event(pe);
+ eeh_handle_normal_event(0, pe);
} else {
eeh_for_each_pe(pe, tmp_pe)
eeh_pe_for_each_dev(tmp_pe, edev, tmp_edev)
@@ -1347,7 +1376,7 @@ void eeh_handle_special_event(void)
/* Notify all devices to be down */
eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
eeh_set_channel_state(pe, pci_channel_io_perm_failure);
- eeh_pe_report(
+ eeh_pe_report(0,
"error_detected(permanent failure)", pe,
eeh_report_failure, NULL);
diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
index a7a8dc182efb..bd38d6fe5449 100644
--- a/arch/powerpc/kernel/eeh_event.c
+++ b/arch/powerpc/kernel/eeh_event.c
@@ -26,6 +26,9 @@ static DEFINE_SPINLOCK(eeh_eventlist_lock);
static DECLARE_COMPLETION(eeh_eventlist_event);
static LIST_HEAD(eeh_eventlist);
+/* Event ID 0 is reserved for special events */
+static atomic_t eeh_event_id = ATOMIC_INIT(1);
+
/**
* eeh_event_handler - Dispatch EEH events.
* @dummy - unused
@@ -59,7 +62,7 @@ static int eeh_event_handler(void * dummy)
/* We might have event without binding PE */
if (event->pe)
- eeh_handle_normal_event(event->pe);
+ eeh_handle_normal_event(event->id, event->pe);
else
eeh_handle_special_event();
@@ -110,6 +113,13 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
return -ENOMEM;
}
event->pe = pe;
+ do {
+ /* Skip over the special value (0) */
+ event->id = (unsigned int)atomic_inc_return(&eeh_event_id);
+ } while (!event->id);
+
+ pr_err("EEH(%u): EVENT=ERROR_DETECTED PHB=%#x PE=%#x\n",
+ event->id, pe->phb->global_number, pe->addr);
/*
* Mark the PE as recovering before inserting it in the queue.
--
2.22.0.216.g00a2a96fc9
^ permalink raw reply related
* [PATCH] powerpc/boot: Use address-of operator on section symbols
From: Nathan Chancellor @ 2020-06-24 3:59 UTC (permalink / raw)
To: Michael Ellerman
Cc: Geoff Levand, linux-kernel, clang-built-linux, Paul Mackerras,
Joel Stanley, Nathan Chancellor, linuxppc-dev
Clang warns:
arch/powerpc/boot/main.c:107:18: warning: array comparison always
evaluates to a constant [-Wtautological-compare]
if (_initrd_end > _initrd_start) {
^
arch/powerpc/boot/main.c:155:20: warning: array comparison always
evaluates to a constant [-Wtautological-compare]
if (_esm_blob_end <= _esm_blob_start)
^
2 warnings generated.
These are not true arrays, they are linker defined symbols, which are
just addresses. Using the address of operator silences the warning
and does not change the resulting assembly with either clang/ld.lld
or gcc/ld (tested with diff + objdump -Dr).
Link: https://github.com/ClangBuiltLinux/linux/issues/212
Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
---
arch/powerpc/boot/main.c | 4 ++--
arch/powerpc/boot/ps3.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/boot/main.c b/arch/powerpc/boot/main.c
index a9d209135975..cae31a6e8f02 100644
--- a/arch/powerpc/boot/main.c
+++ b/arch/powerpc/boot/main.c
@@ -104,7 +104,7 @@ static struct addr_range prep_initrd(struct addr_range vmlinux, void *chosen,
{
/* If we have an image attached to us, it overrides anything
* supplied by the loader. */
- if (_initrd_end > _initrd_start) {
+ if (&_initrd_end > &_initrd_start) {
printf("Attached initrd image at 0x%p-0x%p\n\r",
_initrd_start, _initrd_end);
initrd_addr = (unsigned long)_initrd_start;
@@ -152,7 +152,7 @@ static void prep_esm_blob(struct addr_range vmlinux, void *chosen)
unsigned long esm_blob_addr, esm_blob_size;
/* Do we have an ESM (Enter Secure Mode) blob? */
- if (_esm_blob_end <= _esm_blob_start)
+ if (&_esm_blob_end <= &_esm_blob_start)
return;
printf("Attached ESM blob at 0x%p-0x%p\n\r",
diff --git a/arch/powerpc/boot/ps3.c b/arch/powerpc/boot/ps3.c
index c52552a681c5..6e4efbdb6b7c 100644
--- a/arch/powerpc/boot/ps3.c
+++ b/arch/powerpc/boot/ps3.c
@@ -127,7 +127,7 @@ void platform_init(void)
ps3_repository_read_rm_size(&rm_size);
dt_fixup_memory(0, rm_size);
- if (_initrd_end > _initrd_start) {
+ if (&_initrd_end > &_initrd_start) {
setprop_val(chosen, "linux,initrd-start", (u32)(_initrd_start));
setprop_val(chosen, "linux,initrd-end", (u32)(_initrd_end));
}
base-commit: 3e08a95294a4fb3702bb3d35ed08028433c37fe6
--
2.27.0
^ permalink raw reply related
* Re: [PATCH V3 0/4] mm/debug_vm_pgtable: Add some more tests
From: Anshuman Khandual @ 2020-06-24 3:13 UTC (permalink / raw)
To: linux-mm
Cc: linux-doc, Heiko Carstens, Paul Mackerras, H. Peter Anvin,
linux-riscv, Will Deacon, linux-arch, linux-s390, Jonathan Corbet,
x86, Mike Rapoport, Christian Borntraeger, Ingo Molnar,
linux-arm-kernel, ziy, Catalin Marinas, linux-snps-arc,
Vasily Gorbik, Borislav Petkov, Paul Walmsley,
Kirill A . Shutemov, Thomas Gleixner, gerald.schaefer,
christophe.leroy, Vineet Gupta, linux-kernel, Palmer Dabbelt,
Andrew Morton, linuxppc-dev
In-Reply-To: <1592192277-8421-1-git-send-email-anshuman.khandual@arm.com>
On 06/15/2020 09:07 AM, Anshuman Khandual wrote:
> This series adds some more arch page table helper validation tests which
> are related to core and advanced memory functions. This also creates a
> documentation, enlisting expected semantics for all page table helpers as
> suggested by Mike Rapoport previously (https://lkml.org/lkml/2020/1/30/40).
>
> There are many TRANSPARENT_HUGEPAGE and ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD
> ifdefs scattered across the test. But consolidating all the fallback stubs
> is not very straight forward because ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD is
> not explicitly dependent on ARCH_HAS_TRANSPARENT_HUGEPAGE.
>
> Tested on arm64, x86 platforms but only build tested on all other enabled
> platforms through ARCH_HAS_DEBUG_VM_PGTABLE i.e powerpc, arc, s390. The
> following failure on arm64 still exists which was mentioned previously. It
> will be fixed with the upcoming THP migration on arm64 enablement series.
>
> WARNING .... mm/debug_vm_pgtable.c:860 debug_vm_pgtable+0x940/0xa54
> WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd))))
>
> This series is based on v5.8-rc1.
>
> Changes in V3:
>
> - Replaced HAVE_ARCH_SOFT_DIRTY with MEM_SOFT_DIRTY
> - Added HAVE_ARCH_HUGE_VMAP checks in pxx_huge_tests() per Gerald
> - Updated documentation for pmd_thp_tests() per Zi Yan
> - Replaced READ_ONCE() with huge_ptep_get() per Gerald
> - Added pte_mkhuge() and masking with PMD_MASK per Gerald
> - Replaced pte_same() with holding pfn check in pxx_swap_tests()
> - Added documentation for all (#ifdef #else #endif) per Gerald
> - Updated pmd_protnone_tests() per Gerald
> - Updated HugeTLB PTE creation in hugetlb_advanced_tests() per Gerald
> - Replaced [pmd|pud]_mknotpresent() with [pmd|pud]_mkinvalid()
> - Added has_transparent_hugepage() check for PMD and PUD tests
> - Added a patch which debug prints all individual tests being executed
> - Updated documentation for renamed [pmd|pud]_mkinvalid() helpers
Hello Gerald/Christophe/Vineet,
It would be really great if you could give this series a quick test
on s390/ppc/arc platforms respectively. Thank you.
- Anshuman
^ permalink raw reply
* Re: [PATCH v6 1/5] KVM: s390: clean up redundant 'kvm_run' parameters
From: Tianjia Zhang @ 2020-06-24 2:39 UTC (permalink / raw)
To: Christian Borntraeger, pbonzini, tsbogend, paulus, mpe, benh,
frankja, david, cohuck, heiko.carstens, gor,
sean.j.christopherson, vkuznets, wanpengli, jmattson, joro, tglx,
mingo, bp, x86, hpa, maz, james.morse, julien.thierry.kdev,
suzuki.poulose, christoffer.dall, peterx, thuth, chenhuacai
Cc: linux-s390, kvm, linux-mips, kvm-ppc, linux-kernel, linuxppc-dev,
kvmarm, linux-arm-kernel
In-Reply-To: <c49f8814-c7ea-6884-91c5-3dcd40c6509f@de.ibm.com>
On 2020/6/23 23:31, Christian Borntraeger wrote:
>
>
> On 23.06.20 15:14, Tianjia Zhang wrote:
>> In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
>> structure. For historical reasons, many kvm-related function parameters
>> retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
>> patch does a unified cleanup of these remaining redundant parameters.
>>
>> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
>> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>> arch/s390/kvm/kvm-s390.c | 23 +++++++++++++++--------
>> 1 file changed, 15 insertions(+), 8 deletions(-)
>
> Tinajia,
>
> I have trouble seeing value in this particular patch. We add LOCs
> without providing any noticable benefit. All other patches in this series at
> least reduce the amount of code. So I would defer this to Paolo if he prefers
> to have this way across all architectures.
Yes, this is a full architecture optimization. Some of the architecture
optimization has been merged into the mainline. I think it is necessary
to unify this optimization. This is also the meaning of Paolo.
You can refer to the email of the previous version:
https://lkml.org/lkml/2020/4/27/16
Thanks,
Tianjia
^ permalink raw reply
* RE: [PATCH v2 0/2] cpufreq: Specify the default governor on command line
From: Doug Smythies @ 2020-06-24 0:07 UTC (permalink / raw)
To: 'Quentin Perret'
Cc: juri.lelli, kernel-team, vincent.guittot, arnd, linux-pm, peterz,
adharmap, rafael, rjw, linux-kernel, viresh.kumar, mingo, paulus,
linuxppc-dev, tkjos
In-Reply-To: <20200623180437.GA248517@google.com>
[-- Attachment #1: Type: text/plain, Size: 2109 bytes --]
Hi Quentin,
Thanks for your quick reply.
On 2020.06.23 11:05 Quentin Perret wrote:
> Hi Doug,
>
> On Tuesday 23 Jun 2020 at 10:54:33 (-0700), Doug Smythies wrote:
> > Hi Quentin,
> >
> > Because I am lazy and sometimes do not want to recompile
> > the distro source, I have a need/desire for this.
>
> Good to know I'm not the only one ;-)
>
> > Tested these two grub command lines:
> >
> > GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=300 intel_pstate=disable
> cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo"
> >
> > And
> >
> > #GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 intel_pstate=passive
> cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo"
> >
> > And all worked as expected. I use Ubuntu as my distro, and also had to disable a startup script that
> switches to "ondemand", or similar, after 1 minute.
>
> Good, thanks for giving it a try.
>
> > As a side note (separate subject, but is one reason I tried it):
> > My i5-9600K based computer seems to hit a power limit during boot approximately 3 seconds after
> kernel selection on grub.
> > This had no effect on that issue (even when selecting powersave governor).
>
> Interesting ... Could you confirm that compiling with powersave as
> default doesn't fix the issue either?
No, it doesn't (good idea for a test though).
However, the big mains spike is also gone. So, I no longer know why those power
limit log bits are always set after boot.
>
> Other question, when does the intel_pstate driver start on your device?
> Before or after that 3 seconds boot time?
Before, if I understand correctly (from dmesg):
[ 0.468969] intel_pstate: Intel P-state driver initializing
I'll attach a couple of annotated mains power graphs.
(which will likely get stripped from the on-list version of this e-mail).
Currently, I am drowning in stuff that doesn't work, and will put
this aside for now. I'll revive this as a new thread or a bugzilla
eventually.
I also tried booting with turbo disabled, no difference.
Thanks for this patch set.
... Doug
[-- Attachment #2: reboot-mains-power.png --]
[-- Type: image/png, Size: 35002 bytes --]
[-- Attachment #3: reboot-mains-power4.png --]
[-- Type: image/png, Size: 34687 bytes --]
^ permalink raw reply
* [PATCH 3/3] powerpc: re-initialise lazy FPU/VEC counters on every fault
From: Nicholas Piggin @ 2020-06-23 23:41 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Anton Blanchard, Nicholas Piggin, Gustavo Romero
In-Reply-To: <20200623234139.2262227-1-npiggin@gmail.com>
When a FP/VEC/VSX unavailable fault loads registers and enables the
facility in the MSR, re-set the lazy restore counters to 1 rather
than incrementing them so every fault gets the same number of
restores before the next fault.
This probably shouldn't be a practical change because if a lazy counter
was non-zero then it should have been restored and would not cause a
fault when userspace tries to access it. However the code and comment
implies otherwise so that's misleading and unnecessary.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/kernel/fpu.S | 4 +---
arch/powerpc/kernel/vector.S | 4 +---
2 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index cac22cb97a8c..4ae39db70044 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -107,9 +107,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
or r12,r12,r4
std r12,_MSR(r1)
#endif
- /* Don't care if r4 overflows, this is desired behaviour */
- lbz r4,THREAD_LOAD_FP(r5)
- addi r4,r4,1
+ li r4,1
stb r4,THREAD_LOAD_FP(r5)
addi r10,r5,THREAD_FPSTATE
lfd fr0,FPSTATE_FPSCR(r10)
diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
index efc5b52f95d2..801dc28fdcca 100644
--- a/arch/powerpc/kernel/vector.S
+++ b/arch/powerpc/kernel/vector.S
@@ -76,9 +76,7 @@ _GLOBAL(load_up_altivec)
oris r12,r12,MSR_VEC@h
std r12,_MSR(r1)
#endif
- /* Don't care if r4 overflows, this is desired behaviour */
- lbz r4,THREAD_LOAD_VEC(r5)
- addi r4,r4,1
+ li r4,1
stb r4,THREAD_LOAD_VEC(r5)
addi r6,r5,THREAD_VRSTATE
li r4,1
--
2.23.0
^ permalink raw reply related
* [PATCH 2/3] powerpc/64s: Fix restore_math unnecessarily changing MSR
From: Nicholas Piggin @ 2020-06-23 23:41 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Anton Blanchard, Nicholas Piggin, Gustavo Romero
In-Reply-To: <20200623234139.2262227-1-npiggin@gmail.com>
Before returning to user, if there are missing FP/VEC/VSX bits from the
user MSR then those registers had been saved and must be restored again
before use. restore_math will decide whether to restore immediately, or
skip the restore and let fp/vec/vsx unavailable faults demand load the
registers.
Each time restore_math restores one of the FP/VSX or VEC register sets
is loaded, an 8-bit counter is incremented (load_fp and load_vec). When
these wrap to zero, restore_math no longer restores that register set
until after they are next demand faulted.
It's quite usual for those counters to have different values, so if one
wraps to zero and restore_math no longer restores its registers or user
MSR bit but the other is not zero yet does not need to be restored
(because the kernel is not frequently using the FPU), then restore_math
will be called and it will also not return in the early exit check.
This causes msr_check_and_set to test and set the MSR at every kernel
exit despite having no work to do.
This can cause workloads (e.g., a NULL syscall microbenchmark) to run
fast for a time while both counters are non-zero, then slow down when
one of the counters reaches zero, then speed up again after the second
counter reaches zero. The cost is significant, about 10% slowdown on a
NULL syscall benchmark, and the jittery behaviour is very undesirable.
Fix this by having restore_math test all conditions first, and only
update MSR if we will be loading registers.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/kernel/process.c | 100 ++++++++++++++++++-------------
arch/powerpc/kernel/syscall_64.c | 8 +++
2 files changed, 68 insertions(+), 40 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index c6c1add91bf3..4f72aa4ed717 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -471,49 +471,58 @@ EXPORT_SYMBOL(giveup_all);
#ifdef CONFIG_PPC_BOOK3S_64
#ifdef CONFIG_PPC_FPU
-static int restore_fp(struct task_struct *tsk)
+static bool should_restore_fp(void)
{
- if (tsk->thread.load_fp) {
- load_fp_state(¤t->thread.fp_state);
+ if (current->thread.load_fp) {
current->thread.load_fp++;
- return 1;
+ return true;
}
- return 0;
+ return false;
+}
+
+static void do_restore_fp(void)
+{
+ load_fp_state(¤t->thread.fp_state);
}
#else
-static int restore_fp(struct task_struct *tsk) { return 0; }
+static bool should_restore_fp(void) { return false; }
+static void do_restore_fp(void) { }
#endif /* CONFIG_PPC_FPU */
#ifdef CONFIG_ALTIVEC
-#define loadvec(thr) ((thr).load_vec)
-static int restore_altivec(struct task_struct *tsk)
+static bool should_restore_altivec(void)
{
- if (cpu_has_feature(CPU_FTR_ALTIVEC) && (tsk->thread.load_vec)) {
- load_vr_state(&tsk->thread.vr_state);
- tsk->thread.used_vr = 1;
- tsk->thread.load_vec++;
-
- return 1;
+ if (cpu_has_feature(CPU_FTR_ALTIVEC) && (current->thread.load_vec)) {
+ current->thread.load_vec++;
+ return true;
}
- return 0;
+ return false;
+}
+
+static void do_restore_altivec(void)
+{
+ load_vr_state(¤t->thread.vr_state);
+ current->thread.used_vr = 1;
}
#else
-#define loadvec(thr) 0
-static inline int restore_altivec(struct task_struct *tsk) { return 0; }
+static bool should_restore_altivec(void) { return false; }
+static void do_restore_altivec(void) { }
#endif /* CONFIG_ALTIVEC */
#ifdef CONFIG_VSX
-static int restore_vsx(struct task_struct *tsk)
+static bool should_restore_vsx(void)
{
- if (cpu_has_feature(CPU_FTR_VSX)) {
- tsk->thread.used_vsr = 1;
- return 1;
- }
-
- return 0;
+ if (cpu_has_feature(CPU_FTR_VSX))
+ return true;
+ return false;
+}
+static void do_restore_vsx(void)
+{
+ current->thread.used_vsr = 1;
}
#else
-static inline int restore_vsx(struct task_struct *tsk) { return 0; }
+static bool should_restore_vsx(void) { return false; }
+static void do_restore_vsx(void) { }
#endif /* CONFIG_VSX */
/*
@@ -529,31 +538,42 @@ static inline int restore_vsx(struct task_struct *tsk) { return 0; }
void notrace restore_math(struct pt_regs *regs)
{
unsigned long msr;
-
- if (!current->thread.load_fp && !loadvec(current->thread))
- return;
+ unsigned long new_msr = 0;
msr = regs->msr;
- msr_check_and_set(msr_all_available);
/*
- * Only reload if the bit is not set in the user MSR, the bit BEING set
- * indicates that the registers are hot
+ * new_msr tracks the facilities that are to be restored. Only reload
+ * if the bit is not set in the user MSR (if it is set, the registers
+ * are live for the user thread).
*/
- if ((!(msr & MSR_FP)) && restore_fp(current))
- msr |= MSR_FP | current->thread.fpexc_mode;
+ if ((!(msr & MSR_FP)) && should_restore_fp())
+ new_msr |= MSR_FP | current->thread.fpexc_mode;
- if ((!(msr & MSR_VEC)) && restore_altivec(current))
- msr |= MSR_VEC;
+ if ((!(msr & MSR_VEC)) && should_restore_altivec())
+ new_msr |= MSR_VEC;
- if ((msr & (MSR_FP | MSR_VEC)) == (MSR_FP | MSR_VEC) &&
- restore_vsx(current)) {
- msr |= MSR_VSX;
+ if ((!(msr & MSR_VSX)) && should_restore_vsx()) {
+ if (((msr | new_msr) & (MSR_FP | MSR_VEC)) == (MSR_FP | MSR_VEC))
+ new_msr |= MSR_VSX;
}
- msr_check_and_clear(msr_all_available);
+ if (new_msr) {
+ msr_check_and_set(new_msr);
+
+ if (new_msr & MSR_FP)
+ do_restore_fp();
+
+ if (new_msr & MSR_VEC)
+ do_restore_altivec();
- regs->msr = msr;
+ if (new_msr & MSR_VSX)
+ do_restore_vsx();
+
+ msr_check_and_clear(new_msr);
+
+ regs->msr |= new_msr;
+ }
}
#endif
diff --git a/arch/powerpc/kernel/syscall_64.c b/arch/powerpc/kernel/syscall_64.c
index 79edba3ab312..5126f1d3184a 100644
--- a/arch/powerpc/kernel/syscall_64.c
+++ b/arch/powerpc/kernel/syscall_64.c
@@ -206,6 +206,13 @@ notrace unsigned long syscall_exit_prepare(unsigned long r3,
else if (cpu_has_feature(CPU_FTR_ALTIVEC))
mathflags |= MSR_VEC;
+ /*
+ * If userspace MSR has all available FP bits set,
+ * then they are live and no need to restore. If not,
+ * it means the regs were given up and restore_math
+ * may decide to restore them (to avoid taking an FP
+ * fault).
+ */
if ((regs->msr & mathflags) != mathflags)
restore_math(regs);
}
@@ -277,6 +284,7 @@ notrace unsigned long interrupt_exit_user_prepare(struct pt_regs *regs, unsigned
else if (cpu_has_feature(CPU_FTR_ALTIVEC))
mathflags |= MSR_VEC;
+ /* See above restore_math comment */
if ((regs->msr & mathflags) != mathflags)
restore_math(regs);
}
--
2.23.0
^ permalink raw reply related
* [PATCH 1/3] powerpc/64s: restore_math remove TM test
From: Nicholas Piggin @ 2020-06-23 23:41 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Anton Blanchard, Nicholas Piggin, Gustavo Romero
The TM test in restore_math added by commit dc16b553c949e ("powerpc:
Always restore FPU/VEC/VSX if hardware transactional memory in use") is
no longer necessary after commit a8318c13e79ba ("powerpc/tm: Fix
restoring FP/VMX facility incorrectly on interrupts"), which removed
the cases where restore_math has to restore if TM is active.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/kernel/process.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 7bb7faf84490..c6c1add91bf3 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -530,8 +530,7 @@ void notrace restore_math(struct pt_regs *regs)
{
unsigned long msr;
- if (!MSR_TM_ACTIVE(regs->msr) &&
- !current->thread.load_fp && !loadvec(current->thread))
+ if (!current->thread.load_fp && !loadvec(current->thread))
return;
msr = regs->msr;
--
2.23.0
^ permalink raw reply related
* Re: [PATCH 2/2] ASoC: bindings: fsl-asoc-card: Add compatible string for wm8524
From: Nicolin Chen @ 2020-06-23 23:30 UTC (permalink / raw)
To: Shengjiu Wang
Cc: alsa-devel, timur, Xiubo.Lee, linuxppc-dev, tiwai, lgirdwood,
perex, broonie, festevam, linux-kernel
In-Reply-To: <1592895167-30483-2-git-send-email-shengjiu.wang@nxp.com>
On Tue, Jun 23, 2020 at 02:52:47PM +0800, Shengjiu Wang wrote:
> In order to support wm8524 codec with fsl-asoc-card machine
> driver, add compatible string "fsl,imx-audio-wm8524".
>
> Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
^ permalink raw reply
* Re: [PATCH 1/2] ASoC: fsl-asoc-card: Add WM8524 support
From: Nicolin Chen @ 2020-06-23 23:30 UTC (permalink / raw)
To: Shengjiu Wang
Cc: alsa-devel, timur, Xiubo.Lee, linuxppc-dev, tiwai, lgirdwood,
perex, broonie, festevam, linux-kernel
In-Reply-To: <1592895167-30483-1-git-send-email-shengjiu.wang@nxp.com>
On Tue, Jun 23, 2020 at 02:52:46PM +0800, Shengjiu Wang wrote:
> WM8524 only supports playback mode, and only works at
> slave mode.
>
> Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
^ permalink raw reply
* Re: [PATCH] powerpc/boot/dts: Fix dtc "pciex" warnings
From: Stephen Rothwell @ 2020-06-23 23:06 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
In-Reply-To: <20200623130320.405852-1-mpe@ellerman.id.au>
[-- Attachment #1: Type: text/plain, Size: 902 bytes --]
Hi Michael,
On Tue, 23 Jun 2020 23:03:20 +1000 Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> With CONFIG_OF_ALL_DTBS=y, as set by eg. allmodconfig, we see lots of
> warnings about our dts files, such as:
>
> arch/powerpc/boot/dts/glacier.dts:492.26-532.5:
> Warning (pci_bridge): /plb/pciex@d00000000: node name is not "pci"
> or "pcie"
>
> The node name should not particularly matter, it's just a name, and
> AFAICS there's no kernel code that cares whether nodes are *named*
> "pciex" or "pcie". So shutup these warnings by converting to the name
> dtc wants.
>
> As always there's some risk this could break something obscure that
> does rely on the name, in which case we can revert.
>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Thanks for that. I have applied it to my "fixes" tree until it turns
up elsewhere.
--
Cheers,
Stephen Rothwell
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* RE: [PATCH v2 0/2] cpufreq: Specify the default governor on command line
From: Doug Smythies @ 2020-06-23 17:54 UTC (permalink / raw)
To: 'Quentin Perret'
Cc: juri.lelli, kernel-team, vincent.guittot, arnd, linux-pm, peterz,
adharmap, rafael, rjw, linux-kernel, viresh.kumar, mingo, paulus,
linuxppc-dev, tkjos
In-Reply-To: <20200623142138.209513-1-qperret@google.com>
On 2020.06.23 07:22 Quentin Perret wrote:
>
> This series enables users of prebuilt kernels (e.g. distro kernels) to
> specify their CPUfreq governor of choice using the kernel command line,
> instead of having to wait for the system to fully boot to userspace to
> switch using the sysfs interface. This is helpful for 2 reasons:
> 1. users get to choose the governor that runs during the actual boot;
> 2. it simplifies the userspace boot procedure a bit (one less thing to
> worry about).
>
> To enable this, the first patch moves all governor init calls to
> core_initcall, to make sure they are registered by the time the drivers
> probe. This should be relatively low impact as registering a governor
> is a simple procedure (it gets added to a llist), and all governors
> already load at core_initcall anyway when they're set as the default
> in Kconfig. This also allows to clean-up the governors' init/exit code,
> and reduces boilerplate.
>
> The second patch introduces the new command line parameter, inspired by
> its cpuidle counterpart. More details can be found in the respective
> patch headers.
>
> Changes in v2:
> - added Viresh's ack to patch 01
> - moved the assignment of 'default_governor' in patch 02 to the governor
> registration path instead of the driver registration (Viresh)
>
> Thanks,
> Quentin
>
> Quentin Perret (2):
> cpufreq: Register governors at core_initcall
> cpufreq: Specify default governor on command line
>
> .../admin-guide/kernel-parameters.txt | 5 ++++
> Documentation/admin-guide/pm/cpufreq.rst | 6 ++---
> .../platforms/cell/cpufreq_spudemand.c | 26 ++-----------------
> drivers/cpufreq/cpufreq.c | 23 ++++++++++++----
> drivers/cpufreq/cpufreq_conservative.c | 22 ++++------------
> drivers/cpufreq/cpufreq_ondemand.c | 24 +++++------------
> drivers/cpufreq/cpufreq_performance.c | 14 ++--------
> drivers/cpufreq/cpufreq_powersave.c | 18 +++----------
> drivers/cpufreq/cpufreq_userspace.c | 18 +++----------
> include/linux/cpufreq.h | 14 ++++++++++
> kernel/sched/cpufreq_schedutil.c | 6 +----
> 11 files changed, 62 insertions(+), 114 deletions(-)
>
> --
> 2.27.0.111.gc72c7da667-goog
Hi Quentin,
Because I am lazy and sometimes do not want to recompile
the distro source, I have a need/desire for this.
Tested these two grub command lines:
GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=300 intel_pstate=disable cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo"
And
#GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 intel_pstate=passive cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo"
And all worked as expected. I use Ubuntu as my distro, and also had to disable a startup script that switches to "ondemand", or similar, after 1 minute.
As a side note (separate subject, but is one reason I tried it):
My i5-9600K based computer seems to hit a power limit during boot approximately 3 seconds after kernel selection on grub.
This had no effect on that issue (even when selecting powersave governor).
... Doug
^ permalink raw reply
* Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables
From: Ahmed S. Darwish @ 2020-06-23 16:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-s390, linuxppc-dev, bigeasy, x86, heiko.carstens,
linux-kernel, rostedt, davem, sparclinux, linux, tglx, will,
mingo
In-Reply-To: <20200623152450.GM4817@hirez.programming.kicks-ass.net>
On Tue, Jun 23, 2020 at 05:24:50PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 05:00:31PM +0200, Ahmed S. Darwish wrote:
> > On Tue, Jun 23, 2020 at 10:36:52AM +0200, Peter Zijlstra wrote:
> > ...
> > > -#define lockdep_assert_irqs_disabled() do { \
> > > - WARN_ONCE(debug_locks && !current->lockdep_recursion && \
> > > - current->hardirqs_enabled, \
> > > - "IRQs not disabled as expected\n"); \
> > > - } while (0)
> > > +#define lockdep_assert_irqs_enabled() \
> > > +do { \
> > > + WARN_ON_ONCE(debug_locks && !this_cpu_read(hardirqs_enabled)); \
> > > +} while (0)
> > >
> >
> > Can we add a small comment on top of lockdep_off(), stating that lockdep
> > IRQ tracking will still be kept after a lockdep_off call?
>
> That would only legitimize lockdep_off(). The only comment I want to put
> on that is: "if you use this, you're doing it wrong'.
>
Well, freshly merged code is using it. For example, KCSAN:
=> f1bc96210c6a ("kcsan: Make KCSAN compatible with lockdep")
=> kernel/kcsan/report.c:
void kcsan_report(...)
{
...
/*
* With TRACE_IRQFLAGS, lockdep's IRQ trace state becomes corrupted if
* we do not turn off lockdep here; this could happen due to recursion
* into lockdep via KCSAN if we detect a race in utilities used by
* lockdep.
*/
lockdep_off();
...
}
thanks,
--
Ahmed S. Darwish
Linutronix GmbH
^ permalink raw reply
* Re: [PATCH v4 7/8] lockdep: Change hardirq{s_enabled,_context} to per-cpu variables
From: Ahmed S. Darwish @ 2020-06-23 15:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-s390, linuxppc-dev, bigeasy, x86, heiko.carstens,
linux-kernel, rostedt, davem, sparclinux, linux, tglx, will,
mingo
In-Reply-To: <20200623083721.512673481@infradead.org>
On Tue, Jun 23, 2020 at 10:36:52AM +0200, Peter Zijlstra wrote:
...
> -#define lockdep_assert_irqs_disabled() do { \
> - WARN_ONCE(debug_locks && !current->lockdep_recursion && \
> - current->hardirqs_enabled, \
> - "IRQs not disabled as expected\n"); \
> - } while (0)
> +#define lockdep_assert_irqs_enabled() \
> +do { \
> + WARN_ON_ONCE(debug_locks && !this_cpu_read(hardirqs_enabled)); \
> +} while (0)
>
Can we add a small comment on top of lockdep_off(), stating that lockdep
IRQ tracking will still be kept after a lockdep_off call?
thanks,
--
Ahmed S. Darwish
Linutronix GmbH
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox