* Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86
From: Bjorn Helgaas @ 2020-07-15 22:12 UTC (permalink / raw)
To: David Laight
Cc: Greg Kroah-Hartman, linux-pci, bjorn@helgaas.com, Paul Mackerras,
sparclinux, Toan Le, Christoph Hellwig, Marek Vasut, Rob Herring,
Lorenzo Pieralisi, Sagi Grimberg, Kevin Hilman, Russell King,
Ley Foon Tan, Greg Ungerer, Geert Uytterhoeven, Jakub Kicinski,
Matt Turner, linux-kernel-mentees@lists.linuxfoundation.org,
Guenter Roeck, Arnd Bergmann, Ray Jui, linuxppc-dev, Jens Axboe,
Ivan Kokshaysky, Shuah Khan, Keith Busch, Boris Ostrovsky,
Richard Henderson, Juergen Gross, Thomas Bogendoerfer,
Scott Branden, Jingoo Han, linux-kernel@vger.kernel.org,
Philipp Zabel, Saheed O. Bolarinwa,
'Oliver O'Halloran', Gustavo Pimentel, Bjorn Helgaas,
David S. Miller, Heiner Kallweit
In-Reply-To: <1e2ae69a55f542faa18988a49e9b9491@AcuMS.aculab.com>
On Wed, Jul 15, 2020 at 02:38:29PM +0000, David Laight wrote:
> From: Oliver O'Halloran
> > Sent: 15 July 2020 05:19
> >
> > On Wed, Jul 15, 2020 at 8:03 AM Arnd Bergmann <arnd@arndb.de> wrote:
> ...
> > > - config space accesses are very rare compared to memory
> > > space access and on the hardware side the error handling
> > > would be similar, but readl/writel don't return errors, they just
> > > access wrong registers or return 0xffffffff.
> > > arch/powerpc/kernel/eeh.c has a ton extra code written to
> > > deal with it, but no other architectures do.
> >
> > TBH the EEH MMIO hooks were probably a mistake to begin with. Errors
> > detected via MMIO are almost always asynchronous to the error itself
> > so you usually just wind up with a misleading stack trace rather than
> > any kind of useful synchronous error reporting. It seems like most
> > drivers don't bother checking for 0xFFs either and rely on the
> > asynchronous reporting via .error_detected() instead, so I have to
> > wonder what the point is. I've been thinking of removing the MMIO
> > hooks and using a background poller to check for errors on each PHB
> > periodically (assuming we don't have an EEH interrupt) instead. That
> > would remove the requirement for eeh_dev_check_failure() to be
> > interrupt safe too, so it might even let us fix all the godawful races
> > in EEH.
>
> I've 'played' with PCIe error handling - without much success.
> What might be useful is for a driver that has just read ~0u to
> be able to ask 'has there been an error signalled for this device?'.
In many cases a driver will know that ~0 is not a valid value for the
register it's reading. But if ~0 *could* be valid, an interface like
you suggest could be useful. I don't think we have anything like that
today, but maybe we could. It would certainly be nice if the PCI core
noticed, logged, and cleared errors. We have some of that for AER,
but that's an optional feature, and support for the error bits in the
garden-variety PCI_STATUS register is pretty haphazard. As you note
below, this sort of SERR/PERR reporting is frequently hard-wired in
ways that takes it out of our purview.
Bjorn
^ permalink raw reply
* Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86
From: Benjamin Herrenschmidt @ 2020-07-15 22:26 UTC (permalink / raw)
To: Arnd Bergmann, Bjorn Helgaas
Cc: linux-pci, Keith Busch, Paul Mackerras, sparclinux, Toan Le,
Kjetil Oftedal, Greg Ungerer, Marek Vasut, Rob Herring,
Lorenzo Pieralisi, Sagi Grimberg, Russell King, Ley Foon Tan,
Christoph Hellwig, Geert Uytterhoeven, Kevin Hilman,
Jakub Kicinski, Matt Turner, linux-kernel-mentees, Guenter Roeck,
Ray Jui, Jens Axboe, Ivan Kokshaysky, Shuah Khan, bjorn,
Boris Ostrovsky, Richard Henderson, Juergen Gross, Bjorn Helgaas,
Thomas Bogendoerfer, Scott Branden, Jingoo Han,
Saheed O. Bolarinwa, linux-kernel@vger.kernel.org, Philipp Zabel,
Greg Kroah-Hartman, Gustavo Pimentel, linuxppc-dev,
David S. Miller, Heiner Kallweit
In-Reply-To: <CAK8P3a3OEOosC2VEHb3z7hTA=BjVp0nv9bNxBWzEnmgSDDTX3Q@mail.gmail.com>
On Wed, 2020-07-15 at 08:47 +0200, Arnd Bergmann wrote:
> drivers/misc/cardreader/rts5261.c: retval =
> rtsx_pci_write_config_dword(pcr, PCR_SETTING_REG2, lval);
>
> That last one is interesting because I think this is a case in which
> we
> actually want to check for errors, as the driver seems to use it
> to ensure that accessing extended config space at offset 0x814
> works before relying on the value. Unfortunately the implementation
> seems buggy as it a) keeps using the possibly uninitialized value
> after
> printing a warning and b) returns the PCIBIOS_* value in place of a
> negative errno and then ignores it in the caller.
In cases like this, usually checking against ~0 is sufficient
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86
From: Benjamin Herrenschmidt @ 2020-07-15 22:49 UTC (permalink / raw)
To: Bjorn Helgaas, David Laight
Cc: Greg Kroah-Hartman, linux-pci, bjorn@helgaas.com, Paul Mackerras,
sparclinux, Toan Le, Christoph Hellwig, Marek Vasut, Rob Herring,
Lorenzo Pieralisi, Sagi Grimberg, Kevin Hilman, Russell King,
Ley Foon Tan, Greg Ungerer, Geert Uytterhoeven, Jakub Kicinski,
Matt Turner, linux-kernel-mentees@lists.linuxfoundation.org,
Guenter Roeck, Arnd Bergmann, Ray Jui, linuxppc-dev, Jens Axboe,
Ivan Kokshaysky, Shuah Khan, Keith Busch, Boris Ostrovsky,
Richard Henderson, Juergen Gross, Thomas Bogendoerfer,
Scott Branden, Jingoo Han, linux-kernel@vger.kernel.org,
Philipp Zabel, Saheed O. Bolarinwa,
'Oliver O'Halloran', Gustavo Pimentel, Bjorn Helgaas,
David S. Miller, Heiner Kallweit
In-Reply-To: <20200715221230.GA563957@bjorn-Precision-5520>
On Wed, 2020-07-15 at 17:12 -0500, Bjorn Helgaas wrote:
> > I've 'played' with PCIe error handling - without much success.
> > What might be useful is for a driver that has just read ~0u to
> > be able to ask 'has there been an error signalled for this device?'.
>
> In many cases a driver will know that ~0 is not a valid value for the
> register it's reading. But if ~0 *could* be valid, an interface like
> you suggest could be useful. I don't think we have anything like that
> today, but maybe we could. It would certainly be nice if the PCI core
> noticed, logged, and cleared errors. We have some of that for AER,
> but that's an optional feature, and support for the error bits in the
> garden-variety PCI_STATUS register is pretty haphazard. As you note
> below, this sort of SERR/PERR reporting is frequently hard-wired in
> ways that takes it out of our purview.
We do have pci_channel_state (via pci_channel_offline()) which covers
the cases where the underlying error handling (such as EEH or unplug)
results in the device being offlined though this tend to be
asynchronous so it might take a few ~0's before you get it.
It's typically used to break potentially infinite loops in some
drivers.
There is no interface to check whether *an* error happened though for
the most cases it will be captured in the status register, which is
harvested (and cleared ?) by some EDAC drivers iirc...
All this lacks coordination, I agree.
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH v3 06/12] ppc64/kexec_file: restrict memory usage of kdump kernel
From: Thiago Jung Bauermann @ 2020-07-15 22:52 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Nayna Jain, Kexec-ml, Mahesh J Salgaonkar,
Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain, Petr Tesarik,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <159466091925.24747.6840028682768745598.stgit@hbathini.in.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> /**
> + * get_usable_memory_ranges - Get usable memory ranges. This list includes
> + * regions like crashkernel, opal/rtas & tce-table,
> + * that kdump kernel could use.
> + * @mem_ranges: Range list to add the memory ranges to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int get_usable_memory_ranges(struct crash_mem **mem_ranges)
> +{
> + int ret;
> +
> + /* First memory block & crashkernel region */
> + ret = add_mem_range(mem_ranges, 0, crashk_res.end + 1);
This is a bit surprising. I guess I don't have a complete big picture of
the patch series yet. What prevents the crashkernel from using memory at
the [0, _end] range and overwriting the crashed kernel's memory?
Shouldn't the above range start at crashk_res.start?
> + if (ret)
> + goto out;
> +
> + ret = add_rtas_mem_range(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_opal_mem_range(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_tce_mem_ranges(mem_ranges);
> +out:
> + if (ret)
> + pr_err("Failed to setup usable memory ranges\n");
> + return ret;
> +}
> +
> +/**
> * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
> * in the memory regions between buf_min & buf_max
> * for the buffer. If found, sets kbuf->mem.
> @@ -261,6 +305,322 @@ static int locate_mem_hole_bottom_up_ppc64(struct kexec_buf *kbuf,
> }
>
> /**
> + * check_realloc_usable_mem - Reallocate buffer if it can't accommodate entries
> + * @um_info: Usable memory buffer and ranges info.
> + * @cnt: No. of entries to accommodate.
> + *
> + * Returns 0 on success, negative errno on error.
It actually returns the buffer on success, and NULL on error.
> + */
> +static uint64_t *check_realloc_usable_mem(struct umem_info *um_info, int cnt)
> +{
> + void *tbuf;
> +
> + if (um_info->size >=
> + ((um_info->idx + cnt) * sizeof(*(um_info->buf))))
> + return um_info->buf;
> +
> + um_info->size += MEM_RANGE_CHUNK_SZ;
> + tbuf = krealloc(um_info->buf, um_info->size, GFP_KERNEL);
> + if (!tbuf) {
> + um_info->size -= MEM_RANGE_CHUNK_SZ;
> + return NULL;
> + }
> +
> + memset(tbuf + um_info->idx, 0, MEM_RANGE_CHUNK_SZ);
> + return tbuf;
> +}
<snip>
> +/**
> + * get_node_path - Get the full path of the given node.
> + * @dn: Node.
> + * @path: Updated with the full path of the node.
> + *
> + * Returns nothing.
> + */
> +static void get_node_path(struct device_node *dn, char *path)
> +{
> + if (!dn)
> + return;
> +
> + get_node_path(dn->parent, path);
Is it ok to do recursion in the kernel? In this case I believe it's not
problematic since the maximum call depth will be the maximum depth of a
device tree node which shouldn't be too much. Also, there are no local
variables in this function. But I thought it was worth mentioning.
> + sprintf(path, "/%s", dn->full_name);
> +}
> +
> +/**
> + * get_node_pathlen - Get the full path length of the given node.
> + * @dn: Node.
> + *
> + * Returns the length of the full path of the node.
> + */
> +static int get_node_pathlen(struct device_node *dn)
> +{
> + int len = 0;
> +
> + while (dn) {
> + len += strlen(dn->full_name) + 1;
> + dn = dn->parent;
> + }
> + len++;
> +
> + return len;
> +}
> +
> +/**
> + * add_usable_mem_property - Add usable memory property for the given
> + * memory node.
> + * @fdt: Flattened device tree for the kdump kernel.
> + * @dn: Memory node.
> + * @um_info: Usable memory buffer and ranges info.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int add_usable_mem_property(void *fdt, struct device_node *dn,
> + struct umem_info *um_info)
> +{
> + int n_mem_addr_cells, n_mem_size_cells, node;
> + int i, len, ranges, cnt, ret;
> + uint64_t base, end, *buf;
> + const __be32 *prop;
> + char *pathname;
> +
> + /* Allocate memory for node path */
> + pathname = kzalloc(ALIGN(get_node_pathlen(dn), 8), GFP_KERNEL);
> + if (!pathname)
> + return -ENOMEM;
> +
> + /* Get the full path of the memory node */
> + get_node_path(dn, pathname);
> + pr_debug("Memory node path: %s\n", pathname);
> +
> + /* Now that we know the path, find its offset in kdump kernel's fdt */
> + node = fdt_path_offset(fdt, pathname);
> + if (node < 0) {
> + pr_err("Malformed device tree: error reading %s\n",
> + pathname);
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /* Get the address & size cells */
> + n_mem_addr_cells = of_n_addr_cells(dn);
> + n_mem_size_cells = of_n_size_cells(dn);
> + pr_debug("address cells: %d, size cells: %d\n", n_mem_addr_cells,
> + n_mem_size_cells);
> +
> + um_info->idx = 0;
> + buf = check_realloc_usable_mem(um_info, 2);
> + if (!buf) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + um_info->buf = buf;
> +
> + prop = of_get_property(dn, "reg", &len);
> + if (!prop || len <= 0) {
> + ret = 0;
> + goto out;
> + }
> +
> + /*
> + * "reg" property represents sequence of (addr,size) duples
s/duples/tuples/ ?
> + * each representing a memory range.
> + */
> + ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
> +
> + for (i = 0; i < ranges; i++) {
> + base = of_read_number(prop, n_mem_addr_cells);
> + prop += n_mem_addr_cells;
> + end = base + of_read_number(prop, n_mem_size_cells) - 1;
You need to `prop += n_mem_size_cells` here.
> +
> + ret = add_usable_mem(um_info, base, end, &cnt);
> + if (ret) {
> + ret = ret;
> + goto out;
> + }
> + }
> +
> + /*
> + * No kdump kernel usable memory found in this memory node.
> + * Write (0,0) duple in linux,usable-memory property for
s/duple/tuple/ ?
> + * this region to be ignored.
> + */
> + if (um_info->idx == 0) {
> + um_info->buf[0] = 0;
> + um_info->buf[1] = 0;
> + um_info->idx = 2;
> + }
> +
> + ret = fdt_setprop(fdt, node, "linux,usable-memory", um_info->buf,
> + (um_info->idx * sizeof(*(um_info->buf))));
> +
> +out:
> + kfree(pathname);
> + return ret;
> +}
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* [PATCH] powerpc/vdso: Fix vdso cpu truncation
From: Anton Blanchard @ 2020-07-15 23:37 UTC (permalink / raw)
To: benh, paulus, mpe, miltonm; +Cc: linuxppc-dev
From: Milton Miller <miltonm@us.ibm.com>
The code in vdso_cpu_init that exposes the cpu and numa node to
userspace via SPRG_VDSO incorrctly masks the cpu to 12 bits. This means
that any kernel running on a box with more than 4096 threads (NR_CPUS
advertises a limit of of 8192 cpus) would expose userspace to two cpu
contexts running at the same time with the same cpu number.
Note: I'm not aware of any distro shipping a kernel with support for more
than 4096 threads today, nor of any system image that currently exceeds
4096 threads. Found via code browsing.
Fixes: 18ad51dd342a7eb09dbcd059d0b451b616d4dafc ("powerpc: Add VDSO version of getcpu")
Signed-off-by: Milton Miller <miltonm@us.ibm.com>
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
---
arch/powerpc/kernel/vdso.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index e0f4ba45b6cc..8dad44262e75 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -677,7 +677,7 @@ int vdso_getcpu_init(void)
node = cpu_to_node(cpu);
WARN_ON_ONCE(node > 0xffff);
- val = (cpu & 0xfff) | ((node & 0xffff) << 16);
+ val = (cpu & 0xffff) | ((node & 0xffff) << 16);
mtspr(SPRN_SPRG_VDSO_WRITE, val);
get_paca()->sprg_vdso = val;
--
2.26.2
^ permalink raw reply related
* [PATCH net-next] ibmvnic: Increase driver logging
From: Thomas Falcon @ 2020-07-15 23:51 UTC (permalink / raw)
To: netdev; +Cc: drt, Thomas Falcon, linuxppc-dev
Improve the ibmvnic driver's logging capabilities by providing
more informational messages to track driver operations, facilitating
first-pass debug.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
---
drivers/net/ethernet/ibm/ibmvnic.c | 76 ++++++++++++++++++++++++------
1 file changed, 62 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 0fd7eae25fe9..7382f11872fc 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -561,6 +561,7 @@ static int init_rx_pools(struct net_device *netdev)
u64 *size_array;
int i, j;
+ netdev_info(netdev, "Initializing RX queue buffer pools.\n");
rxadd_subcrqs =
be32_to_cpu(adapter->login_rsp_buf->num_rxadd_subcrqs);
size_array = (u64 *)((u8 *)(adapter->login_rsp_buf) +
@@ -618,6 +619,7 @@ static int init_rx_pools(struct net_device *netdev)
rx_pool->next_alloc = 0;
rx_pool->next_free = 0;
}
+ netdev_info(netdev, "RX queue buffer pools allocated successfully.\n");
return 0;
}
@@ -738,6 +740,7 @@ static int init_tx_pools(struct net_device *netdev)
int tx_subcrqs;
int i, rc;
+ netdev_info(netdev, "Initializing TX queue buffer pools.\n");
tx_subcrqs = be32_to_cpu(adapter->login_rsp_buf->num_txsubm_subcrqs);
adapter->tx_pool = kcalloc(tx_subcrqs,
sizeof(struct ibmvnic_tx_pool), GFP_KERNEL);
@@ -768,7 +771,7 @@ static int init_tx_pools(struct net_device *netdev)
return rc;
}
}
-
+ netdev_info(netdev, "TX queue buffer pools allocated successfully.\n");
return 0;
}
@@ -792,6 +795,7 @@ static void ibmvnic_napi_disable(struct ibmvnic_adapter *adapter)
if (!adapter->napi_enabled)
return;
+ netdev_info(adapter->netdev, "Disable all NAPI threads.\n");
for (i = 0; i < adapter->req_rx_queues; i++) {
netdev_dbg(adapter->netdev, "Disabling napi[%d]\n", i);
napi_disable(&adapter->napi[i]);
@@ -910,12 +914,14 @@ static int ibmvnic_login(struct net_device *netdev)
return -1;
}
} else if (adapter->init_done_rc) {
- netdev_warn(netdev, "Adapter login failed\n");
+ netdev_warn(netdev, "Adapter login failed, rc = %d\n",
+ adapter->init_done_rc);
return -1;
}
} while (retry);
__ibmvnic_set_mac(netdev, adapter->mac_addr);
+ netdev_info(netdev, "Adapter login success!\n");
return 0;
}
@@ -934,6 +940,7 @@ static void release_login_rsp_buffer(struct ibmvnic_adapter *adapter)
static void release_resources(struct ibmvnic_adapter *adapter)
{
+ netdev_info(adapter->netdev, "Releasing VNIC client device memory structures.\n");
release_vpd_data(adapter);
release_tx_pools(adapter);
@@ -941,6 +948,7 @@ static void release_resources(struct ibmvnic_adapter *adapter)
release_napi(adapter);
release_login_rsp_buffer(adapter);
+ netdev_info(adapter->netdev, "VNIC client device memory released.\n");
}
static int set_link_state(struct ibmvnic_adapter *adapter, u8 link_state)
@@ -964,7 +972,7 @@ static int set_link_state(struct ibmvnic_adapter *adapter, u8 link_state)
reinit_completion(&adapter->init_done);
rc = ibmvnic_send_crq(adapter, &crq);
if (rc) {
- netdev_err(netdev, "Failed to set link state\n");
+ netdev_err(netdev, "Failed to set link state, rc = %d\n", rc);
return rc;
}
@@ -1098,6 +1106,8 @@ static int init_resources(struct ibmvnic_adapter *adapter)
struct net_device *netdev = adapter->netdev;
int rc;
+ netdev_info(netdev, "Allocate device resources.\n");
+
rc = set_real_num_queues(netdev);
if (rc)
return rc;
@@ -1126,7 +1136,11 @@ static int init_resources(struct ibmvnic_adapter *adapter)
return rc;
rc = init_tx_pools(netdev);
- return rc;
+ if (rc)
+ return rc;
+
+ netdev_info(netdev, "Device resources allocated.\n");
+ return 0;
}
static int __ibmvnic_open(struct net_device *netdev)
@@ -1136,9 +1150,10 @@ static int __ibmvnic_open(struct net_device *netdev)
int i, rc;
adapter->state = VNIC_OPENING;
+ netdev_info(netdev, "Allocating RX buffer pools and enable NAPI structures.\n");
replenish_pools(adapter);
ibmvnic_napi_enable(adapter);
-
+ netdev_info(netdev, "Activating RX and TX interrupt lines.\n");
/* We're ready to receive frames, enable the sub-crq interrupts and
* set the logical link state to up
*/
@@ -1155,7 +1170,7 @@ static int __ibmvnic_open(struct net_device *netdev)
enable_irq(adapter->tx_scrq[i]->irq);
enable_scrq_irq(adapter, adapter->tx_scrq[i]);
}
-
+ netdev_info(netdev, "Inform system firmware that device is ready for packet transmission.\n");
rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
if (rc) {
for (i = 0; i < adapter->req_rx_queues; i++)
@@ -1163,7 +1178,7 @@ static int __ibmvnic_open(struct net_device *netdev)
release_resources(adapter);
return rc;
}
-
+ netdev_info(netdev, "Activate net device TX queues.\n");
netif_tx_start_all_queues(netdev);
if (prev_state == VNIC_CLOSED) {
@@ -1180,6 +1195,7 @@ static int ibmvnic_open(struct net_device *netdev)
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
int rc;
+ netdev_info(netdev, "Opening VNIC client device.\n");
/* If device failover is pending, just set device state and return.
* Device operation will be handled by reset routine.
*/
@@ -1202,7 +1218,7 @@ static int ibmvnic_open(struct net_device *netdev)
}
rc = __ibmvnic_open(netdev);
-
+ netdev_info(netdev, "VNIC client device opened.\n");
return rc;
}
@@ -1216,7 +1232,7 @@ static void clean_rx_pools(struct ibmvnic_adapter *adapter)
if (!adapter->rx_pool)
return;
-
+ netdev_info(adapter->netdev, "Freeing all existing RX buffer pool memory.\n");
rx_scrqs = adapter->num_active_rx_pools;
rx_entries = adapter->req_rx_add_entries_per_subcrq;
@@ -1265,7 +1281,7 @@ static void clean_tx_pools(struct ibmvnic_adapter *adapter)
if (!adapter->tx_pool || !adapter->tso_pool)
return;
-
+ netdev_info(adapter->netdev, "Freeing all outstanding TX buffers awaiting completion.\n");
tx_scrqs = adapter->num_active_tx_pools;
/* Free any remaining skbs in the tx buffer pools */
@@ -1282,6 +1298,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter)
int i;
if (adapter->tx_scrq) {
+ netdev_info(netdev, "Disabling all TX interrupt lines.\n");
for (i = 0; i < adapter->req_tx_queues; i++)
if (adapter->tx_scrq[i]->irq) {
netdev_dbg(netdev,
@@ -1292,6 +1309,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter)
}
if (adapter->rx_scrq) {
+ netdev_info(netdev, "Disabling all RX interrupt lines.\n");
for (i = 0; i < adapter->req_rx_queues; i++) {
if (adapter->rx_scrq[i]->irq) {
netdev_dbg(netdev,
@@ -1307,6 +1325,7 @@ static void ibmvnic_cleanup(struct net_device *netdev)
{
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
+ netdev_info(netdev, "Halt net device TX queues.\n");
/* ensure that transmissions are stopped if called by do_reset */
if (test_bit(0, &adapter->resetting))
netif_tx_disable(netdev);
@@ -1326,6 +1345,7 @@ static int __ibmvnic_close(struct net_device *netdev)
int rc = 0;
adapter->state = VNIC_CLOSING;
+ netdev_info(netdev, "Halt incoming packets from system firmware.\n");
rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN);
if (rc)
return rc;
@@ -1338,6 +1358,7 @@ static int ibmvnic_close(struct net_device *netdev)
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
int rc;
+ netdev_info(netdev, "Stopping VNIC client device.\n");
/* If device failover is pending, just set device state and return.
* Device operation will be handled by reset routine.
*/
@@ -1348,7 +1369,7 @@ static int ibmvnic_close(struct net_device *netdev)
rc = __ibmvnic_close(netdev);
ibmvnic_cleanup(netdev);
-
+ netdev_info(netdev, "VNIC client device stopped.\n");
return rc;
}
@@ -2941,9 +2962,11 @@ static struct ibmvnic_sub_crq_queue *init_sub_crq_queue(struct ibmvnic_adapter
static void release_sub_crqs(struct ibmvnic_adapter *adapter, bool do_h_free)
{
+ struct net_device *netdev = adapter->netdev;
int i;
if (adapter->tx_scrq) {
+ netdev_info(netdev, "Releasing TX subordinate Command-Response Queues.\n");
for (i = 0; i < adapter->num_active_tx_scrqs; i++) {
if (!adapter->tx_scrq[i])
continue;
@@ -2964,9 +2987,11 @@ static void release_sub_crqs(struct ibmvnic_adapter *adapter, bool do_h_free)
kfree(adapter->tx_scrq);
adapter->tx_scrq = NULL;
adapter->num_active_tx_scrqs = 0;
+ netdev_info(netdev, "TX subordinate Command-Response Queues released.\n");
}
if (adapter->rx_scrq) {
+ netdev_info(netdev, "Releasing RX subordinate Command-Response Queues.\n");
for (i = 0; i < adapter->num_active_rx_scrqs; i++) {
if (!adapter->rx_scrq[i])
continue;
@@ -2987,6 +3012,7 @@ static void release_sub_crqs(struct ibmvnic_adapter *adapter, bool do_h_free)
kfree(adapter->rx_scrq);
adapter->rx_scrq = NULL;
adapter->num_active_rx_scrqs = 0;
+ netdev_info(netdev, "RX subordinate Command-Response Queues released.\n");
}
}
@@ -3149,6 +3175,7 @@ static int init_sub_crq_irqs(struct ibmvnic_adapter *adapter)
int i = 0, j = 0;
int rc = 0;
+ netdev_info(adapter->netdev, "Request Subordinate Command-Response Queue interrupts.\n");
for (i = 0; i < adapter->req_tx_queues; i++) {
netdev_dbg(adapter->netdev, "Initializing tx_scrq[%d] irq\n",
i);
@@ -3195,6 +3222,9 @@ static int init_sub_crq_irqs(struct ibmvnic_adapter *adapter)
goto req_rx_irq_failed;
}
}
+
+ netdev_info(adapter->netdev, "Subordinate Command-Response Queue interrupts created.\n");
+
return rc;
req_rx_irq_failed:
@@ -3221,6 +3251,8 @@ static int init_sub_crqs(struct ibmvnic_adapter *adapter)
int more = 0;
int i;
+ netdev_info(adapter->netdev, "Creating Command-Response Queues.\n");
+
total_queues = adapter->req_tx_queues + adapter->req_rx_queues;
allqueues = kcalloc(total_queues, sizeof(*allqueues), GFP_KERNEL);
@@ -3285,6 +3317,8 @@ static int init_sub_crqs(struct ibmvnic_adapter *adapter)
}
kfree(allqueues);
+
+ netdev_info(adapter->netdev, "Command-Response Queue creation complete.\n");
return 0;
rx_failed:
@@ -3303,6 +3337,8 @@ static void ibmvnic_send_req_caps(struct ibmvnic_adapter *adapter, int retry)
union ibmvnic_crq crq;
int max_entries;
+ netdev_info(adapter->netdev, "Negotiate device capabilities.\n");
+
if (!retry) {
/* Sub-CRQ entries are 32 byte long */
int entries_page = 4 * PAGE_SIZE / (sizeof(u64) * 4);
@@ -3822,6 +3858,8 @@ static void send_cap_queries(struct ibmvnic_adapter *adapter)
{
union ibmvnic_crq crq;
+ netdev_info(adapter->netdev, "Requesting device capabilities.\n");
+
atomic_set(&adapter->running_cap_crqs, 0);
memset(&crq, 0, sizeof(crq));
crq.query_capability.first = IBMVNIC_CRQ_CMD;
@@ -4115,7 +4153,8 @@ static void handle_query_ip_offload_rsp(struct ibmvnic_adapter *adapter)
adapter->netdev->features |=
tmp & adapter->netdev->wanted_features;
}
-
+ netdev_info(adapter->netdev,
+ "Confirming device offload capabilities.\n");
memset(&crq, 0, sizeof(crq));
crq.control_ip_offload.first = IBMVNIC_CRQ_CMD;
crq.control_ip_offload.cmd = CONTROL_IP_OFFLOAD;
@@ -4263,6 +4302,9 @@ static void handle_request_cap_rsp(union ibmvnic_crq *crq,
struct ibmvnic_query_ip_offload_buffer *ip_offload_buf =
&adapter->ip_offload_buf;
+ netdev_info(adapter->netdev,
+ "Query device offload features.\n");
+
adapter->wait_capability = false;
adapter->ip_offload_tok = dma_map_single(dev, ip_offload_buf,
buf_sz,
@@ -4881,7 +4923,7 @@ static void release_crq_queue(struct ibmvnic_adapter *adapter)
if (!crq->msgs)
return;
- netdev_dbg(adapter->netdev, "Releasing CRQ\n");
+ netdev_info(adapter->netdev, "Releasing VNIC Command-Response Queue.\n");
free_irq(vdev->irq, adapter);
tasklet_kill(&adapter->tasklet);
do {
@@ -4893,6 +4935,7 @@ static void release_crq_queue(struct ibmvnic_adapter *adapter)
free_page((unsigned long)crq->msgs);
crq->msgs = NULL;
crq->active = false;
+ netdev_info(adapter->netdev, "VNIC Command-Response Queue released.\n");
}
static int init_crq_queue(struct ibmvnic_adapter *adapter)
@@ -5193,6 +5236,7 @@ static int ibmvnic_remove(struct vio_dev *dev)
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
unsigned long flags;
+ netdev_info(netdev, "Attempting to remove VNIC client device.\n");
spin_lock_irqsave(&adapter->state_lock, flags);
if (adapter->state == VNIC_RESETTING) {
spin_unlock_irqrestore(&adapter->state_lock, flags);
@@ -5206,22 +5250,26 @@ static int ibmvnic_remove(struct vio_dev *dev)
flush_delayed_work(&adapter->ibmvnic_delayed_reset);
rtnl_lock();
+ netdev_info(netdev, "Unregistering net device.\n");
unregister_netdevice(netdev);
+ netdev_info(netdev, "Releasing VNIC client device resources.\n");
release_resources(adapter);
release_sub_crqs(adapter, 1);
release_crq_queue(adapter);
release_stats_token(adapter);
release_stats_buffers(adapter);
-
+ netdev_info(netdev, "VNIC client device resources released.\n");
adapter->state = VNIC_REMOVED;
rtnl_unlock();
+ netdev_info(netdev, "Freeing net device and VNIC client driver data.\n");
mutex_destroy(&adapter->fw_lock);
device_remove_file(&dev->dev, &dev_attr_failover);
free_netdev(netdev);
dev_set_drvdata(&dev->dev, NULL);
+ netdev_info(netdev, "VNIC client device has been successfully removed.\n");
return 0;
}
--
2.26.2
^ permalink raw reply related
* Re: [PATCH net-next] ibmvnic: Increase driver logging
From: Jakub Kicinski @ 2020-07-16 0:06 UTC (permalink / raw)
To: Thomas Falcon; +Cc: drt, netdev, linuxppc-dev
In-Reply-To: <1594857115-22380-1-git-send-email-tlfalcon@linux.ibm.com>
On Wed, 15 Jul 2020 18:51:55 -0500 Thomas Falcon wrote:
> free_netdev(netdev);
> dev_set_drvdata(&dev->dev, NULL);
> + netdev_info(netdev, "VNIC client device has been successfully removed.\n");
A step too far, perhaps.
In general this patch looks a little questionable IMHO, this amount of
logging output is not commonly seen in drivers. All the the info
messages are just static text, not even carrying any extra information.
In an era of ftrace, and bpftrace, do we really need this?
^ permalink raw reply
* Re: [PATCH v3 07/12] ppc64/kexec_file: add support to relocate purgatory
From: Thiago Jung Bauermann @ 2020-07-16 0:20 UTC (permalink / raw)
To: Hari Bathini
Cc: kernel test robot, Pingfan Liu, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Petr Tesarik, Andrew Morton, Dave Young, Vivek Goyal,
Eric Biederman
In-Reply-To: <159466093748.24747.4655547403463921814.stgit@hbathini.in.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> Right now purgatory implementation is only minimal. But if purgatory
> code is to be enhanced to copy memory to the backup region and verify
Can't the memcpy be done in asm? We have arch/powerpc/lib/memcpy_64.S
for example, perhaps it could be linked in with the purgatory?
> sha256 digest, relocations may have to be applied to the purgatory.
Do we want to do the sha256 verification? My original patch series for
kexec_file_load() had a purgatory in C from kexec-tools which did the
sha256 verification but Michael Ellerman thought it was unnecessary and
decided to use the simpler purgatory in asm from kexec-lite.
As a result, this relocation processing became unnecessary.
> So, add support to relocate purgatory in kexec_file_load system call
> by setting up TOC pointer and applying RELA relocations as needed.
If we do want to use a C purgatory, Michael Ellerman had suggested
building it as a Position Independent Executable, which greatly reduces
the number and types of relocations that are needed. See patches 4 and 9
here:
https://lore.kernel.org/linuxppc-dev/1478748449-3894-1-git-send-email-bauerman@linux.vnet.ibm.com/
In the series above I hadn't converted x86 to PIE. If I had done that,
possibly Dave Young's opinion would have been different. :-)
If that's still not desirable, he suggested in that discussion lifting
some code from x86 to generic code, which I implemented and would
simplify this patch as well:
https://lore.kernel.org/linuxppc-dev/5009580.5GxAkTrMYA@morokweng/
> Reported-by: kernel test robot <lkp@intel.com>
> [lkp: In v1, 'struct mem_sym' was declared in parameter list]
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
> ---
>
> v2 -> v3:
> * Fixed get_toc_section() to return the section info that had relocations
> applied, to calculate the correct toc pointer.
> * Fixed how relocation value is converted to relative while applying
> R_PPC64_REL64 & R_PPC64_REL32 relocations.
>
> v1 -> v2:
> * Fixed wrong use of 'struct mem_sym' in local_entry_offset() as
> reported by lkp. lkp report for reference:
> - https://lore.kernel.org/patchwork/patch/1264421/
>
>
> arch/powerpc/kexec/file_load_64.c | 337 ++++++++++++++++++++++++++++++++
> arch/powerpc/purgatory/trampoline_64.S | 8 +
> 2 files changed, 345 insertions(+)
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH 1/1] powerpc: Fix incorrect stw{, ux, u, x} instructions in __set_pte_at
From: Sasha Levin @ 2020-07-16 0:27 UTC (permalink / raw)
To: Sasha Levin, Mathieu Desnoyers, Christophe Leroy
Cc: linux-kernel, Paul Mackerras, stable, linuxppc-dev
In-Reply-To: <20200708175423.28442-1-mathieu.desnoyers@efficios.com>
Hi
[This is an automated email]
This commit has been processed because it contains a "Fixes:" tag
fixing commit: 9bf2b5cdc5fe ("powerpc: Fixes for CONFIG_PTE_64BIT for SMP support").
The bot has tested the following trees: v5.7.8, v5.4.51, v4.19.132, v4.14.188, v4.9.230, v4.4.230.
v5.7.8: Build OK!
v5.4.51: Build OK!
v4.19.132: Build OK!
v4.14.188: Failed to apply! Possible dependencies:
45201c8794693 ("powerpc/nohash: Remove hash related code from nohash headers.")
4e003747043d5 ("powerpc/64s: Replace CONFIG_PPC_STD_MMU_64 with CONFIG_PPC_BOOK3S_64")
d5808ffaec817 ("powerpc/nohash: Use IS_ENABLED() to simplify __set_pte_at()")
v4.9.230: Failed to apply! Possible dependencies:
45201c8794693 ("powerpc/nohash: Remove hash related code from nohash headers.")
4546561551106 ("powerpc/asm: Use OFFSET macro in asm-offsets.c")
4e003747043d5 ("powerpc/64s: Replace CONFIG_PPC_STD_MMU_64 with CONFIG_PPC_BOOK3S_64")
5d451a87e5ebb ("powerpc/64: Retrieve number of L1 cache sets from device-tree")
7c5b06cadf274 ("KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9")
83677f551e0a6 ("KVM: PPC: Book3S HV: Adjust host/guest context switch for POWER9")
902e06eb86cd6 ("powerpc/32: Change the stack protector canary value per task")
a3d96f70c1477 ("powerpc/64s: Fix system reset vs general interrupt reentrancy")
bd067f83b0840 ("powerpc/64: Fix naming of cache block vs. cache line")
d5808ffaec817 ("powerpc/nohash: Use IS_ENABLED() to simplify __set_pte_at()")
e2827fe5c1566 ("powerpc/64: Clean up ppc64_caches using a struct per cache")
e9cf1e085647b ("KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs")
f4c51f841d2ac ("KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests")
v4.4.230: Failed to apply! Possible dependencies:
1ca7212932862 ("powerpc/mm: Move PTE bits from generic functions to hash64 functions.")
26b6a3d9bb48f ("powerpc/mm: move pte headers to book3s directory")
371352ca0e7f3 ("powerpc/mm: Move hash64 PTE bits from book3s/64/pgtable.h to hash.h")
3dfcb315d81e6 ("powerpc/mm: make a separate copy for book3s")
ab537dca2f330 ("powerpc/mm: Move hash specific pte width and other defines to book3s")
b0412ea94bcbd ("powerpc/mm: Drop pte-common.h from BOOK3S 64")
cbbb8683fb632 ("powerpc/mm: Delete booke bits from book3s")
NOTE: The patch will not be queued to stable trees until it is upstream.
How should we proceed with this patch?
--
Thanks
Sasha
^ permalink raw reply
* Re: [PATCH v3 08/12] ppc64/kexec_file: setup the stack for purgatory
From: Thiago Jung Bauermann @ 2020-07-16 0:35 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Nayna Jain, Kexec-ml, Mahesh J Salgaonkar,
Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain, Petr Tesarik,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <159466095278.24747.9161591016931052627.stgit@hbathini.in.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> To avoid any weird errors, the purgatory should run with its own
> stack. Set one up by adding the stack buffer to .data section of
> the purgatory. Also, setup opal base & entry values in r8 & r9
> registers to help early OPAL debugging.
>
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
> Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
> ---
>
> v2 -> v3:
> * Unchanged. Added Tested-by tag from Pingfan.
>
> v1 -> v2:
> * Setting up opal base & entry values in r8 & r9 for early OPAL debug.
>
>
> arch/powerpc/include/asm/kexec.h | 4 ++++
> arch/powerpc/kexec/file_load_64.c | 29 +++++++++++++++++++++++++++++
> arch/powerpc/purgatory/trampoline_64.S | 32 ++++++++++++++++++++++++++++++++
> 3 files changed, 65 insertions(+)
>
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v3] powerpc/pseries: detect secure and trusted boot state of the system.
From: Daniel Axtens @ 2020-07-16 0:49 UTC (permalink / raw)
To: Nayna Jain, linuxppc-dev; +Cc: Nayna Jain, linux-kernel, Mimi Zohar
In-Reply-To: <1594813921-12425-1-git-send-email-nayna@linux.ibm.com>
Hi Nayna,
Looks good to me.
Sorry for not noticing this before, but I think
> +#include <asm/machdep.h>
is now superfluous (I think it's leftover from the machine_is
version?). Maybe mpe will take pity on you and remove it when he picks
up your patch.
Kind regards,
Daniel
>
> static struct device_node *get_ppc_fw_sb_node(void)
> {
> @@ -23,12 +24,19 @@ bool is_ppc_secureboot_enabled(void)
> {
> struct device_node *node;
> bool enabled = false;
> + u32 secureboot;
>
> node = get_ppc_fw_sb_node();
> enabled = of_property_read_bool(node, "os-secureboot-enforcing");
> -
> of_node_put(node);
>
> + if (enabled)
> + goto out;
> +
> + if (!of_property_read_u32(of_root, "ibm,secure-boot", &secureboot))
> + enabled = (secureboot > 1);
> +
> +out:
> pr_info("Secure boot mode %s\n", enabled ? "enabled" : "disabled");
>
> return enabled;
> @@ -38,12 +46,19 @@ bool is_ppc_trustedboot_enabled(void)
> {
> struct device_node *node;
> bool enabled = false;
> + u32 trustedboot;
>
> node = get_ppc_fw_sb_node();
> enabled = of_property_read_bool(node, "trusted-enabled");
> -
> of_node_put(node);
>
> + if (enabled)
> + goto out;
> +
> + if (!of_property_read_u32(of_root, "ibm,trusted-boot", &trustedboot))
> + enabled = (trustedboot > 0);
> +
> +out:
> pr_info("Trusted boot mode %s\n", enabled ? "enabled" : "disabled");
>
> return enabled;
> --
> 2.26.2
^ permalink raw reply
* Re: [PATCH v2] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE
From: Nicholas Piggin @ 2020-07-16 0:56 UTC (permalink / raw)
To: Andreas Schwab; +Cc: linux-arch, Mathieu Desnoyers, linuxppc-dev
In-Reply-To: <87zh816qgv.fsf@igel.home>
Excerpts from Andreas Schwab's message of July 15, 2020 8:08 pm:
> On Jul 15 2020, Nicholas Piggin wrote:
>
>> diff --git a/arch/powerpc/include/asm/exception-64e.h b/arch/powerpc/include/asm/exception-64e.h
>> index 54a98ef7d7fe..071d7ccb830f 100644
>> --- a/arch/powerpc/include/asm/exception-64e.h
>> +++ b/arch/powerpc/include/asm/exception-64e.h
>> @@ -204,7 +204,11 @@ exc_##label##_book3e:
>> LOAD_REG_ADDR(r3,interrupt_base_book3e);\
>> ori r3,r3,vector_offset@l; \
>> mtspr SPRN_IVOR##vector_number,r3;
>> -
>> +/*
>> + * powerpc relies on return from interrupt/syscall being context synchronising
>> + * (which rfi is) to support ARCH_HAS_MEMBARRIER_SYNC_CORE without additional
>> + * additional synchronisation instructions.
>
> s/additonal//
Will fix in a respin.
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH v2] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE
From: Nicholas Piggin @ 2020-07-16 0:57 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-arch, Peter Zijlstra, linuxppc-dev, Andy Lutomirski
In-Reply-To: <849841781.14062.1594816035327.JavaMail.zimbra@efficios.com>
Excerpts from Mathieu Desnoyers's message of July 15, 2020 10:27 pm:
> ----- On Jul 15, 2020, at 5:48 AM, Nicholas Piggin npiggin@gmail.com wrote:
> [...]
>> index 47bd4ea0837d..a4704f405e8d 100644
>> --- a/arch/powerpc/include/asm/exception-64s.h
>> +++ b/arch/powerpc/include/asm/exception-64s.h
>> @@ -68,6 +68,13 @@
>> *
>> * The nop instructions allow us to insert one or more instructions to flush the
>> * L1-D cache when returning to userspace or a guest.
>> + *
>> + * powerpc relies on return from interrupt/syscall being context synchronising
>> + * (which hrfid, rfid, and rfscv are) to support ARCH_HAS_MEMBARRIER_SYNC_CORE
>> + * without additional additional synchronisation instructions. soft-masked
>> + * interrupt replay does not include a context-synchronising rfid, but those
>> + * always return to kernel, the context sync is only required for IPIs which
>> + * return to user.
>> */
>> #define RFI_FLUSH_SLOT \
>> RFI_FLUSH_FIXUP_SECTION; \
>
> I suspect the statement "the context sync is only required for IPIs which return to
> user." is misleading.
>
> As I recall that we need more than just context sync after IPI. We need context sync
> in return path of any trap/interrupt/system call which returns to user-space, else
> we'd need to add the proper core serializing barriers in the scheduler, as we had
> to do for lazy tlb on x86.
>
> Or am I missing something ?
Maybe ambiguous wording. For IPIs, the context synch is only required
for those which return to user. Other things also require context sync.
I will try to improve it.
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH net-next] ibmvnic: Increase driver logging
From: David Miller @ 2020-07-16 1:29 UTC (permalink / raw)
To: kuba; +Cc: drt, netdev, tlfalcon, linuxppc-dev
In-Reply-To: <20200715170632.11f0bf19@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
From: Jakub Kicinski <kuba@kernel.org>
Date: Wed, 15 Jul 2020 17:06:32 -0700
> On Wed, 15 Jul 2020 18:51:55 -0500 Thomas Falcon wrote:
>> free_netdev(netdev);
>> dev_set_drvdata(&dev->dev, NULL);
>> + netdev_info(netdev, "VNIC client device has been successfully removed.\n");
>
> A step too far, perhaps.
>
> In general this patch looks a little questionable IMHO, this amount of
> logging output is not commonly seen in drivers. All the the info
> messages are just static text, not even carrying any extra information.
> In an era of ftrace, and bpftrace, do we really need this?
Agreed, this is too much. This is debugging, and thus suitable for tracing
facilities, at best.
^ permalink raw reply
* [PATCH v3] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE
From: Nicholas Piggin @ 2020-07-16 1:35 UTC (permalink / raw)
To: linuxppc-dev
Cc: linux-arch, Andreas Schwab, Mathieu Desnoyers, Nicholas Piggin
powerpc return from interrupt and return from system call sequences are
context synchronising.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
v3: more comment fixes
.../features/sched/membarrier-sync-core/arch-support.txt | 4 ++--
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/exception-64e.h | 6 +++++-
arch/powerpc/include/asm/exception-64s.h | 8 ++++++++
arch/powerpc/kernel/entry_32.S | 6 ++++++
5 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 8a521a622966..52ad74a25f54 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,7 +5,7 @@
#
# Architecture requirements
#
-# * arm/arm64
+# * arm/arm64/powerpc
#
# Rely on implicit context synchronization as a result of exception return
# when returning from IPI handler, and when returning to user-space.
@@ -45,7 +45,7 @@
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
- | powerpc: | TODO |
+ | powerpc: | ok |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9fa23eb320ff..920c4e3ca4ef 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -131,6 +131,7 @@ config PPC
select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
+ select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
select ARCH_HAS_STRICT_KERNEL_RWX if (PPC32 && !HIBERNATION)
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/exception-64e.h b/arch/powerpc/include/asm/exception-64e.h
index 54a98ef7d7fe..72b6657acd2d 100644
--- a/arch/powerpc/include/asm/exception-64e.h
+++ b/arch/powerpc/include/asm/exception-64e.h
@@ -204,7 +204,11 @@ exc_##label##_book3e:
LOAD_REG_ADDR(r3,interrupt_base_book3e);\
ori r3,r3,vector_offset@l; \
mtspr SPRN_IVOR##vector_number,r3;
-
+/*
+ * powerpc relies on return from interrupt/syscall being context synchronising
+ * (which rfi is) to support ARCH_HAS_MEMBARRIER_SYNC_CORE without additional
+ * synchronisation instructions.
+ */
#define RFI_TO_KERNEL \
rfi
diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h
index 47bd4ea0837d..d7a1a427a690 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -68,6 +68,14 @@
*
* The nop instructions allow us to insert one or more instructions to flush the
* L1-D cache when returning to userspace or a guest.
+ *
+ * powerpc relies on return from interrupt/syscall being context synchronising
+ * (which hrfid, rfid, and rfscv are) to support ARCH_HAS_MEMBARRIER_SYNC_CORE
+ * without additional synchronisation instructions.
+ *
+ * soft-masked interrupt replay does not include a context-synchronising rfid,
+ * but those always return to kernel, the sync is only required when returning
+ * to user.
*/
#define RFI_FLUSH_SLOT \
RFI_FLUSH_FIXUP_SECTION; \
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 217ebdf5b00b..f4d0af8e1136 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -35,6 +35,12 @@
#include "head_32.h"
+/*
+ * powerpc relies on return from interrupt/syscall being context synchronising
+ * (which rfi is) to support ARCH_HAS_MEMBARRIER_SYNC_CORE without additional
+ * synchronisation instructions.
+ */
+
/*
* Align to 4k in order to ensure that all functions modyfing srr0/srr1
* fit into one page in order to not encounter a TLB miss between the
--
2.23.0
^ permalink raw reply related
* Re: [PATCH v3 09/12] ppc64/kexec_file: setup backup region for kdump kernel
From: Thiago Jung Bauermann @ 2020-07-16 1:38 UTC (permalink / raw)
To: Hari Bathini
Cc: kernel test robot, Pingfan Liu, Nayna Jain, Kexec-ml,
Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
Petr Tesarik, Andrew Morton, Dave Young, Vivek Goyal,
Eric Biederman
In-Reply-To: <159466096898.24747.16701009925943468066.stgit@hbathini.in.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> @@ -968,7 +1040,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
>
> /*
> * Restrict memory usage for kdump kernel by setting up
> - * usable memory ranges.
> + * usable memory ranges and memory reserve map.
> */
> if (image->type == KEXEC_TYPE_CRASH) {
> ret = get_usable_memory_ranges(&umem);
> @@ -980,6 +1052,24 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
> pr_err("Error setting up usable-memory property for kdump kernel\n");
> goto out;
> }
> +
> + ret = fdt_add_mem_rsv(fdt, BACKUP_SRC_START + BACKUP_SRC_SIZE,
> + crashk_res.start - BACKUP_SRC_SIZE);
I believe this answers my question from the other email about how the
crashkernel is prevented from stomping in the crashed kernel's memory,
right? I needed to think for a bit to understand what the above
reservation was protecting. I think it's worth adding a comment.
> + if (ret) {
> + pr_err("Error reserving crash memory: %s\n",
> + fdt_strerror(ret));
> + goto out;
> + }
> + }
> +
> + if (image->arch.backup_start) {
> + ret = fdt_add_mem_rsv(fdt, image->arch.backup_start,
> + BACKUP_SRC_SIZE);
> + if (ret) {
> + pr_err("Error reserving memory for backup: %s\n",
> + fdt_strerror(ret));
> + goto out;
> + }
> }
This is only true for KEXEC_TYPE_CRASH, if I'm following the code
correctly. I think it would be clearer to put the if above inside the if
for KEXEC_TYPE_CRASH to make it clearer.
>
> ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len,
<snip>
> diff --git a/arch/powerpc/purgatory/purgatory_64.c b/arch/powerpc/purgatory/purgatory_64.c
> new file mode 100644
> index 0000000..1eca74c
> --- /dev/null
> +++ b/arch/powerpc/purgatory/purgatory_64.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * purgatory: Runs between two kernels
> + *
> + * Copyright 2020, Hari Bathini, IBM Corporation.
> + */
> +
> +#include <asm/purgatory.h>
> +#include <asm/crashdump-ppc64.h>
> +
> +extern unsigned long backup_start;
> +
> +static void *__memcpy(void *dest, const void *src, unsigned long n)
> +{
> + unsigned long i;
> + unsigned char *d;
> + const unsigned char *s;
> +
> + d = dest;
> + s = src;
> + for (i = 0; i < n; i++)
> + d[i] = s[i];
> +
> + return dest;
> +}
> +
> +void purgatory(void)
> +{
> + void *dest, *src;
> +
> + src = (void *)BACKUP_SRC_START;
> + if (backup_start) {
> + dest = (void *)backup_start;
> + __memcpy(dest, src, BACKUP_SRC_SIZE);
> + }
> +}
In general I'm in favor of using C code over assembly, but having to
bring in that relocation support just for the above makes me wonder if
it's worth it in this case.
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v3 08/12] ppc64/kexec_file: setup the stack for purgatory
From: Thiago Jung Bauermann @ 2020-07-16 1:40 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Nayna Jain, Kexec-ml, Mahesh J Salgaonkar,
Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain, Petr Tesarik,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <159466095278.24747.9161591016931052627.stgit@hbathini.in.ibm.com>
Sorry, forgot to send one comment for this patch:
Hari Bathini <hbathini@linux.ibm.com> writes:
> @@ -898,10 +900,37 @@ int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
> goto out;
> }
>
> + /* Setup the stack top */
> + stack_buf = kexec_purgatory_get_symbol_addr(image, "stack_buf");
> + if (!stack_buf)
> + goto out;
> +
> + val = (u64)stack_buf + KEXEC_PURGATORY_STACK_SIZE;
> + ret = kexec_purgatory_get_set_symbol(image, "stack", &val, sizeof(val),
> + false);
> + if (ret)
> + goto out;
> +
> /* Setup the TOC pointer */
> val = get_toc_ptr(&(image->purgatory_info));
> ret = kexec_purgatory_get_set_symbol(image, "my_toc", &val, sizeof(val),
> false);
> + if (ret)
> + goto out;
> +
> + /* Setup OPAL base & entry values */
> + dn = of_find_node_by_path("/ibm,opal");
> + if (dn) {
> + of_property_read_u64(dn, "opal-base-address", &val);
> + ret = kexec_purgatory_get_set_symbol(image, "opal_base", &val,
> + sizeof(val), false);
> + if (ret)
> + goto out;
> +
> + of_property_read_u64(dn, "opal-entry-address", &val);
> + ret = kexec_purgatory_get_set_symbol(image, "opal_entry", &val,
> + sizeof(val), false);
You need to call of_node_put(dn) here and in the if (ret) case above.
> + }
> out:
> if (ret)
> pr_err("Failed to setup purgatory symbols");
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v3 02/12] powerpc/kexec_file: mark PPC64 specific code
From: Thiago Jung Bauermann @ 2020-07-16 1:49 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Nayna Jain, Kexec-ml, Mahesh J Salgaonkar,
Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain, Petr Tesarik,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <159466085652.24747.2414199807974963385.stgit@hbathini.in.ibm.com>
I didn't forget about this patch. I just wanted to see more of the
changes before comenting on it.
Hari Bathini <hbathini@linux.ibm.com> writes:
> Some of the kexec_file_load code isn't PPC64 specific. Move PPC64
> specific code from kexec/file_load.c to kexec/file_load_64.c. Also,
> rename purgatory/trampoline.S to purgatory/trampoline_64.S in the
> same spirit.
There's only a 64 bit implementation of kexec_file_load() so this is a
somewhat theoretical exercise, but there's no harm in getting the code
organized, so:
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
I have just one question below.
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
> Tested-by: Pingfan Liu <piliu@redhat.com>
> ---
>
> v2 -> v3:
> * Unchanged. Added Tested-by tag from Pingfan.
>
> v1 -> v2:
> * No changes.
>
>
> arch/powerpc/include/asm/kexec.h | 11 +++
> arch/powerpc/kexec/Makefile | 2 -
> arch/powerpc/kexec/elf_64.c | 7 +-
> arch/powerpc/kexec/file_load.c | 37 ++--------
> arch/powerpc/kexec/file_load_64.c | 108 ++++++++++++++++++++++++++++++
> arch/powerpc/purgatory/Makefile | 4 +
> arch/powerpc/purgatory/trampoline.S | 117 --------------------------------
> arch/powerpc/purgatory/trampoline_64.S | 117 ++++++++++++++++++++++++++++++++
> 8 files changed, 248 insertions(+), 155 deletions(-)
> create mode 100644 arch/powerpc/kexec/file_load_64.c
> delete mode 100644 arch/powerpc/purgatory/trampoline.S
> create mode 100644 arch/powerpc/purgatory/trampoline_64.S
<snip>
> diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
> new file mode 100644
> index 0000000..e6bff960
> --- /dev/null
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * ppc64 code to implement the kexec_file_load syscall
> + *
> + * Copyright (C) 2004 Adam Litke (agl@us.ibm.com)
> + * Copyright (C) 2004 IBM Corp.
> + * Copyright (C) 2004,2005 Milton D Miller II, IBM Corporation
> + * Copyright (C) 2005 R Sharada (sharada@in.ibm.com)
> + * Copyright (C) 2006 Mohan Kumar M (mohan@in.ibm.com)
> + * Copyright (C) 2020 IBM Corporation
> + *
> + * Based on kexec-tools' kexec-ppc64.c, kexec-elf-rel-ppc64.c, fs2dt.c.
> + * Heavily modified for the kernel by
> + * Hari Bathini <hbathini@linux.ibm.com>.
> + */
> +
> +#include <linux/kexec.h>
> +#include <linux/of_fdt.h>
> +#include <linux/libfdt.h>
> +
> +const struct kexec_file_ops * const kexec_file_loaders[] = {
> + &kexec_elf64_ops,
> + NULL
> +};
> +
> +/**
> + * setup_purgatory_ppc64 - initialize PPC64 specific purgatory's global
> + * variables and call setup_purgatory() to initialize
> + * common global variable.
> + * @image: kexec image.
> + * @slave_code: Slave code for the purgatory.
> + * @fdt: Flattened device tree for the next kernel.
> + * @kernel_load_addr: Address where the kernel is loaded.
> + * @fdt_load_addr: Address where the flattened device tree is loaded.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
> + const void *fdt, unsigned long kernel_load_addr,
> + unsigned long fdt_load_addr)
> +{
> + int ret;
> +
> + ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
> + fdt_load_addr);
> + if (ret)
> + pr_err("Failed to setup purgatory symbols");
> + return ret;
> +}
> +
> +/**
> + * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel
> + * being loaded.
> + * @image: kexec image being loaded.
> + * @fdt: Flattened device tree for the next kernel.
> + * @initrd_load_addr: Address where the next initrd will be loaded.
> + * @initrd_len: Size of the next initrd, or 0 if there will be none.
> + * @cmdline: Command line for the next kernel, or NULL if there will
> + * be none.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
> + unsigned long initrd_load_addr,
> + unsigned long initrd_len, const char *cmdline)
> +{
> + int chosen_node, ret;
> +
> + /* Remove memory reservation for the current device tree. */
> + ret = delete_fdt_mem_rsv(fdt, __pa(initial_boot_params),
> + fdt_totalsize(initial_boot_params));
> + if (ret == 0)
> + pr_debug("Removed old device tree reservation.\n");
> + else if (ret != -ENOENT) {
> + pr_err("Failed to remove old device-tree reservation.\n");
> + return ret;
> + }
> +
> + ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len,
> + cmdline, &chosen_node);
> + if (ret)
> + return ret;
> +
> + ret = fdt_setprop(fdt, chosen_node, "linux,booted-from-kexec", NULL, 0);
> + if (ret)
> + pr_err("Failed to update device-tree with linux,booted-from-kexec\n");
> +
> + return ret;
> +}
For setup_purgatory_ppc64() you start with an empty function and build
from there, but for setup_new_fdt_ppc64() you moved some code here. Is
the code above 64 bit specific?
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCH v3 10/12] ppc64/kexec_file: prepare elfcore header for crashing kernel
From: Thiago Jung Bauermann @ 2020-07-16 2:22 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Nayna Jain, Kexec-ml, Mahesh J Salgaonkar,
Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain, Petr Tesarik,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <159466098739.24747.5860501703617893464.stgit@hbathini.in.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> /**
> + * get_crash_memory_ranges - Get crash memory ranges. This list includes
> + * first/crashing kernel's memory regions that
> + * would be exported via an elfcore.
> + * @mem_ranges: Range list to add the memory ranges to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int get_crash_memory_ranges(struct crash_mem **mem_ranges)
> +{
> + struct memblock_region *reg;
> + struct crash_mem *tmem;
> + int ret;
> +
> + for_each_memblock(memory, reg) {
> + u64 base, size;
> +
> + base = (u64)reg->base;
> + size = (u64)reg->size;
> +
> + /* Skip backup memory region, which needs a separate entry */
> + if (base == BACKUP_SRC_START) {
> + if (size > BACKUP_SRC_SIZE) {
> + base = BACKUP_SRC_END + 1;
> + size -= BACKUP_SRC_SIZE;
> + } else
> + continue;
> + }
> +
> + ret = add_mem_range(mem_ranges, base, size);
> + if (ret)
> + goto out;
> +
> + /* Try merging adjacent ranges before reallocation attempt */
> + if ((*mem_ranges)->nr_ranges == (*mem_ranges)->max_nr_ranges)
> + sort_memory_ranges(*mem_ranges, true);
> + }
> +
> + /* Reallocate memory ranges if there is no space to split ranges */
> + tmem = *mem_ranges;
> + if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) {
> + tmem = realloc_mem_ranges(mem_ranges);
> + if (!tmem)
> + goto out;
> + }
> +
> + /* Exclude crashkernel region */
> + ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end);
> + if (ret)
> + goto out;
> +
> + ret = add_rtas_mem_range(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_opal_mem_range(mem_ranges);
> + if (ret)
> + goto out;
Maybe I'm confused, but don't you add the RTAS and OPAL regions as
usable memory for the crashkernel? In that case they shouldn't show up
in the core file.
> +
> + /* create a separate program header for the backup region */
> + ret = add_mem_range(mem_ranges, BACKUP_SRC_START, BACKUP_SRC_SIZE);
> + if (ret)
> + goto out;
> +
> + sort_memory_ranges(*mem_ranges, false);
> +out:
> + if (ret)
> + pr_err("Failed to setup crash memory ranges\n");
> + return ret;
> +}
<snip>
> +/**
> + * prepare_elf_headers - Prepare headers for the elfcore to be exported as
> + * /proc/vmcore by the kdump kernel.
> + * @image: Kexec image.
> + * @cmem: Crash memory ranges to be exported via elfcore.
> + * @addr: Vmalloc'd memory allocated by crash_prepare_elf64_headers
> + * to prepare the elf headers.
> + * @sz: Size of the vmalloc'd memory allocated.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int prepare_elf_headers(struct kimage *image, struct crash_mem *cmem,
> + void **addr, unsigned long *sz)
> +{
> + int ret;
> +
> + ret = crash_prepare_elf64_headers(cmem, false, addr, sz);
> +
> + /* Fix the offset for backup region in the ELF header */
> + if (!ret)
> + update_backup_region_phdr(image, *addr);
> +
> + return ret;
> +}
The code above can be inlined into its caller, I don't see a need to
have a separate function.
> +
> +/**
> + * load_elfcorehdr_segment - Setup crash memory ranges and initialize elfcorehdr
> + * segment needed to load kdump kernel.
> + * @image: Kexec image.
> + * @kbuf: Buffer contents and memory parameters.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int load_elfcorehdr_segment(struct kimage *image, struct kexec_buf *kbuf)
> +{
> + struct crash_mem *cmem = NULL;
> + unsigned long headers_sz;
> + void *headers = NULL;
> + int ret;
> +
> + ret = get_crash_memory_ranges(&cmem);
> + if (ret)
> + goto out;
> +
> + /* Setup elfcorehdr segment */
> + ret = prepare_elf_headers(image, cmem, &headers, &headers_sz);
> + if (ret) {
> + pr_err("Failed to prepare elf headers for the core\n");
> + goto out;
> + }
> +
> + kbuf->buffer = headers;
> + kbuf->mem = KEXEC_BUF_MEM_UNKNOWN;
> + kbuf->bufsz = kbuf->memsz = headers_sz;
> + kbuf->top_down = false;
> +
> + ret = kexec_add_buffer(kbuf);
> + if (ret) {
> + vfree(headers);
> + goto out;
> + }
> +
> + image->arch.elfcorehdr_addr = kbuf->mem;
> + image->arch.elf_headers_sz = headers_sz;
> + image->arch.elf_headers = headers;
> +out:
> + kfree(cmem);
> + return ret;
> +}
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
From: Nicholas Piggin @ 2020-07-16 2:26 UTC (permalink / raw)
To: Andy Lutomirski
Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, X86 ML, LKML, Linux-MM,
Mathieu Desnoyers, Andy Lutomirski, linuxppc-dev
In-Reply-To: <6D3D1346-DB1E-43EB-812A-184918CCC16A@amacapital.net>
Excerpts from Andy Lutomirski's message of July 14, 2020 10:46 pm:
>
>
>> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:
>>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
>>>>
>>>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>>>>>
>>>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>>>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>>>>>>
>>>>>>> On big systems, the mm refcount can become highly contented when doing
>>>>>>> a lot of context switching with threaded applications (particularly
>>>>>>> switching between the idle thread and an application thread).
>>>>>>>
>>>>>>> Abandoning lazy tlb slows switching down quite a bit in the important
>>>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme
>>>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>>>>>>> any remaining lazy ones.
>>>>>>>
>>>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>>>>>>> with as many software threads as CPUs (so each switch will go in and
>>>>>>> out of idle), upstream can achieve a rate of about 1 million context
>>>>>>> switches per second. After this patch it goes up to 118 million.
>>>>>>>
>>>>>>
>>>>>> I read the patch a couple of times, and I have a suggestion that could
>>>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of
>>>>>> refcount. You're saying "hey, this mm has no more references, but it
>>>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
>>>>>> those references too." I'm wondering whether you actually need the
>>>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount
>>>>>> for real? Roughly, in __mmdrop(), you would only free the page tables
>>>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from
>>>>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
>>>>>> you just removed the last bit from mm_cpumask and potentially free the
>>>>>> mm.
>>>>>>
>>>>>> Getting the locking right here could be a bit tricky -- you need to
>>>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they
>>>>>> should free the mm, and you also need to avoid an mm with mm_users
>>>>>> hitting zero concurrently with the last remote CPU using it lazily
>>>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count
>>>>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
>>>>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful
>>>>>> cmpxchg or dec_return to make sure that only one CPU frees it.
>>>>>>
>>>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to
>>>>>> only ever take the lock after mm_users goes to zero.
>>>>>
>>>>> I don't think it's nonsense, it could be a good way to avoid IPIs.
>>>>>
>>>>> I haven't seen much problem here that made me too concerned about IPIs
>>>>> yet, so I think the simple patch may be good enough to start with
>>>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
>>>>> unlazying with the exit TLB flush without doing anything fancy with
>>>>> ref counting, but we'll see.
>>>>
>>>> I would be cautious with benchmarking here. I would expect that the
>>>> nasty cases may affect power consumption more than performance — the
>>>> specific issue is IPIs hitting idle cores, and the main effects are to
>>>> slow down exit() a bit but also to kick the idle core out of idle.
>>>> Although, if the idle core is in a deep sleep, that IPI could be
>>>> *very* slow.
>>>
>>> It will tend to be self-limiting to some degree (deeper idle cores
>>> would tend to have less chance of IPI) but we have bigger issues on
>>> powerpc with that, like broadcast IPIs to the mm cpumask for THP
>>> management. Power hasn't really shown up as an issue but powerpc
>>> CPUs may have their own requirements and issues there, shall we say.
>>>
>>>> So I think it’s worth at least giving this a try.
>>>
>>> To be clear it's not a complete solution itself. The problem is of
>>> course that mm cpumask gives you false negatives, so the bits
>>> won't always clean up after themselves as CPUs switch away from their
>>> lazy tlb mms.
>>
>> ^^
>>
>> False positives: CPU is in the mm_cpumask, but is not using the mm
>> as a lazy tlb. So there can be bits left and never freed.
>>
>> If you closed the false positives, you're back to a shared mm cache
>> line on lazy mm context switches.
>
> x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :)
>
> Can your share your benchmark?
It's just context_switch1_threads from will-it-scale, running with 1/2
the number of CPUs, and core affinity with an SMT processor (so each
thread from the switching pairs gets spread to their own CPU and so you
get the task->idle->task switching.
It's really just about the worst case, so I wouldn't say it's something
to panic about.
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH v3 11/12] ppc64/kexec_file: add appropriate regions for memory reserve map
From: Thiago Jung Bauermann @ 2020-07-16 2:27 UTC (permalink / raw)
To: Hari Bathini
Cc: Pingfan Liu, Nayna Jain, Kexec-ml, Mahesh J Salgaonkar,
Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain, Petr Tesarik,
Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <159466100206.24747.3782313430191321863.stgit@hbathini.in.ibm.com>
Hari Bathini <hbathini@linux.ibm.com> writes:
> While initrd, elfcorehdr and backup regions are already added to the
> reserve map, there are a few missing regions that need to be added to
> the memory reserve map. Add them here. And now that all the changes
> to load panic kernel are in place, claim likewise.
>
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
> Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Just one oinor nit below.
> ---
>
> v2 -> v3:
> * Unchanged. Added Tested-by tag from Pingfan.
>
> v1 -> v2:
> * Updated add_rtas_mem_range() & add_opal_mem_range() callsites based on
> the new prototype for these functions.
>
>
> arch/powerpc/kexec/file_load_64.c | 58 ++++++++++++++++++++++++++++++++++---
> 1 file changed, 53 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
> index 2531bb5..29e5d11 100644
> --- a/arch/powerpc/kexec/file_load_64.c
> +++ b/arch/powerpc/kexec/file_load_64.c
> @@ -193,6 +193,34 @@ static int get_crash_memory_ranges(struct crash_mem **mem_ranges)
> }
>
> /**
> + * get_reserved_memory_ranges - Get reserve memory ranges. This list includes
> + * memory regions that should be added to the
> + * memory reserve map to ensure the region is
> + * protected from any mischeif.
s/mischeif/mischief/
> + * @mem_ranges: Range list to add the memory ranges to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int get_reserved_memory_ranges(struct crash_mem **mem_ranges)
> +{
> + int ret;
> +
> + ret = add_rtas_mem_range(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_tce_mem_ranges(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_reserved_ranges(mem_ranges);
> +out:
> + if (ret)
> + pr_err("Failed to setup reserved memory ranges\n");
> + return ret;
> +}
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
From: Nicholas Piggin @ 2020-07-16 2:35 UTC (permalink / raw)
To: Andy Lutomirski
Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, X86 ML, LKML, Linux-MM,
Mathieu Desnoyers, Andy Lutomirski, linuxppc-dev
In-Reply-To: <6D3D1346-DB1E-43EB-812A-184918CCC16A@amacapital.net>
Excerpts from Andy Lutomirski's message of July 14, 2020 10:46 pm:
>
>
>> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:
>>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
>>>>
>>>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>>>>>
>>>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>>>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>>>>>>
>>>>>>> On big systems, the mm refcount can become highly contented when doing
>>>>>>> a lot of context switching with threaded applications (particularly
>>>>>>> switching between the idle thread and an application thread).
>>>>>>>
>>>>>>> Abandoning lazy tlb slows switching down quite a bit in the important
>>>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme
>>>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>>>>>>> any remaining lazy ones.
>>>>>>>
>>>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>>>>>>> with as many software threads as CPUs (so each switch will go in and
>>>>>>> out of idle), upstream can achieve a rate of about 1 million context
>>>>>>> switches per second. After this patch it goes up to 118 million.
>>>>>>>
>>>>>>
>>>>>> I read the patch a couple of times, and I have a suggestion that could
>>>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of
>>>>>> refcount. You're saying "hey, this mm has no more references, but it
>>>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
>>>>>> those references too." I'm wondering whether you actually need the
>>>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount
>>>>>> for real? Roughly, in __mmdrop(), you would only free the page tables
>>>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from
>>>>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
>>>>>> you just removed the last bit from mm_cpumask and potentially free the
>>>>>> mm.
>>>>>>
>>>>>> Getting the locking right here could be a bit tricky -- you need to
>>>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they
>>>>>> should free the mm, and you also need to avoid an mm with mm_users
>>>>>> hitting zero concurrently with the last remote CPU using it lazily
>>>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count
>>>>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
>>>>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful
>>>>>> cmpxchg or dec_return to make sure that only one CPU frees it.
>>>>>>
>>>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to
>>>>>> only ever take the lock after mm_users goes to zero.
>>>>>
>>>>> I don't think it's nonsense, it could be a good way to avoid IPIs.
>>>>>
>>>>> I haven't seen much problem here that made me too concerned about IPIs
>>>>> yet, so I think the simple patch may be good enough to start with
>>>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
>>>>> unlazying with the exit TLB flush without doing anything fancy with
>>>>> ref counting, but we'll see.
>>>>
>>>> I would be cautious with benchmarking here. I would expect that the
>>>> nasty cases may affect power consumption more than performance — the
>>>> specific issue is IPIs hitting idle cores, and the main effects are to
>>>> slow down exit() a bit but also to kick the idle core out of idle.
>>>> Although, if the idle core is in a deep sleep, that IPI could be
>>>> *very* slow.
>>>
>>> It will tend to be self-limiting to some degree (deeper idle cores
>>> would tend to have less chance of IPI) but we have bigger issues on
>>> powerpc with that, like broadcast IPIs to the mm cpumask for THP
>>> management. Power hasn't really shown up as an issue but powerpc
>>> CPUs may have their own requirements and issues there, shall we say.
>>>
>>>> So I think it’s worth at least giving this a try.
>>>
>>> To be clear it's not a complete solution itself. The problem is of
>>> course that mm cpumask gives you false negatives, so the bits
>>> won't always clean up after themselves as CPUs switch away from their
>>> lazy tlb mms.
>>
>> ^^
>>
>> False positives: CPU is in the mm_cpumask, but is not using the mm
>> as a lazy tlb. So there can be bits left and never freed.
>>
>> If you closed the false positives, you're back to a shared mm cache
>> line on lazy mm context switches.
>
> x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :)
>
> Can your share your benchmark?
Just testing the IPI rates (on a smaller 176 CPU system), on a
kernel compile, it causes about 300 shootdown interrupts (not
300 broadcasts but total interrupts).
And very short lived fork;exec;exit things like typical scripting
commands doesn't typically generate any.
So yeah the really high exit rate things self-limit pretty well.
I documented the concern and added a few of the possible ways to
further reduce IPIs in the comments though.
Thanks,
Nick
^ permalink raw reply
* [powerpc:next-test] BUILD SUCCESS 085563a9059b42c635151e970fd3af941f46a0fd
From: kernel test robot @ 2020-07-16 2:52 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test
branch HEAD: 085563a9059b42c635151e970fd3af941f46a0fd powerpc/perf: Add kernel support for new MSR[HV PR] bits in trace-imc
elapsed time: 735m
configs tested: 88
configs skipped: 2
The following configs have been built successfully.
More configs may be tested in the coming days.
arm defconfig
arm allyesconfig
arm allmodconfig
arm allnoconfig
arm64 allyesconfig
arm64 defconfig
arm64 allmodconfig
arm64 allnoconfig
arc axs101_defconfig
c6x evmc6457_defconfig
sh kfr2r09-romimage_defconfig
powerpc gamecube_defconfig
arm lpd270_defconfig
arm clps711x_defconfig
arm corgi_defconfig
riscv allyesconfig
arm orion5x_defconfig
i386 allnoconfig
i386 allyesconfig
i386 defconfig
i386 debian-10.3
ia64 allmodconfig
ia64 defconfig
ia64 allnoconfig
ia64 allyesconfig
m68k allmodconfig
m68k allnoconfig
m68k sun3_defconfig
m68k defconfig
m68k allyesconfig
nios2 defconfig
nios2 allyesconfig
openrisc defconfig
c6x allyesconfig
c6x allnoconfig
openrisc allyesconfig
nds32 defconfig
nds32 allnoconfig
csky allyesconfig
csky defconfig
alpha defconfig
alpha allyesconfig
xtensa allyesconfig
h8300 allyesconfig
h8300 allmodconfig
xtensa defconfig
arc defconfig
arc allyesconfig
sh allmodconfig
sh allnoconfig
microblaze allnoconfig
mips allyesconfig
mips allnoconfig
mips allmodconfig
parisc allnoconfig
parisc defconfig
parisc allyesconfig
parisc allmodconfig
powerpc allyesconfig
powerpc rhel-kconfig
powerpc allmodconfig
powerpc allnoconfig
powerpc defconfig
i386 randconfig-a016-20200715
i386 randconfig-a011-20200715
i386 randconfig-a015-20200715
i386 randconfig-a012-20200715
i386 randconfig-a013-20200715
i386 randconfig-a014-20200715
riscv allnoconfig
riscv defconfig
riscv allmodconfig
s390 allyesconfig
s390 allnoconfig
s390 allmodconfig
s390 defconfig
sparc allyesconfig
sparc defconfig
sparc64 defconfig
sparc64 allnoconfig
sparc64 allyesconfig
sparc64 allmodconfig
x86_64 rhel-7.6-kselftests
x86_64 rhel-8.3
x86_64 kexec
x86_64 rhel
x86_64 lkp
x86_64 fedora-25
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
* [powerpc:next] BUILD SUCCESS f88223979044bfae32cb16f814c43739e5ba1301
From: kernel test robot @ 2020-07-16 2:52 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
branch HEAD: f88223979044bfae32cb16f814c43739e5ba1301 powerpc: re-initialise lazy FPU/VEC counters on every fault
elapsed time: 724m
configs tested: 100
configs skipped: 2
The following configs have been built successfully.
More configs may be tested in the coming days.
arm defconfig
arm allyesconfig
arm allmodconfig
arm allnoconfig
arm64 allyesconfig
arm64 defconfig
arm64 allmodconfig
arm64 allnoconfig
arc axs101_defconfig
c6x evmc6457_defconfig
sh kfr2r09-romimage_defconfig
powerpc gamecube_defconfig
arm lpd270_defconfig
arm clps711x_defconfig
arm corgi_defconfig
riscv allyesconfig
arm orion5x_defconfig
i386 allnoconfig
i386 allyesconfig
i386 defconfig
i386 debian-10.3
ia64 allmodconfig
ia64 defconfig
ia64 allnoconfig
ia64 allyesconfig
m68k allmodconfig
m68k allnoconfig
m68k sun3_defconfig
m68k defconfig
m68k allyesconfig
nios2 defconfig
nios2 allyesconfig
openrisc defconfig
c6x allyesconfig
c6x allnoconfig
openrisc allyesconfig
nds32 defconfig
nds32 allnoconfig
csky allyesconfig
csky defconfig
alpha defconfig
alpha allyesconfig
xtensa allyesconfig
h8300 allyesconfig
h8300 allmodconfig
xtensa defconfig
arc defconfig
arc allyesconfig
sh allmodconfig
sh allnoconfig
microblaze allnoconfig
mips allyesconfig
mips allnoconfig
mips allmodconfig
parisc allnoconfig
parisc defconfig
parisc allyesconfig
parisc allmodconfig
powerpc defconfig
powerpc allyesconfig
powerpc rhel-kconfig
powerpc allmodconfig
powerpc allnoconfig
i386 randconfig-a001-20200715
i386 randconfig-a005-20200715
i386 randconfig-a002-20200715
i386 randconfig-a006-20200715
i386 randconfig-a003-20200715
i386 randconfig-a004-20200715
x86_64 randconfig-a005-20200715
x86_64 randconfig-a006-20200715
x86_64 randconfig-a002-20200715
x86_64 randconfig-a001-20200715
x86_64 randconfig-a003-20200715
x86_64 randconfig-a004-20200715
i386 randconfig-a016-20200715
i386 randconfig-a011-20200715
i386 randconfig-a015-20200715
i386 randconfig-a012-20200715
i386 randconfig-a013-20200715
i386 randconfig-a014-20200715
riscv allnoconfig
riscv defconfig
riscv allmodconfig
s390 allyesconfig
s390 allnoconfig
s390 allmodconfig
s390 defconfig
sparc allyesconfig
sparc defconfig
sparc64 defconfig
sparc64 allnoconfig
sparc64 allyesconfig
sparc64 allmodconfig
x86_64 rhel-7.6-kselftests
x86_64 rhel-8.3
x86_64 kexec
x86_64 rhel
x86_64 lkp
x86_64 fedora-25
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
* [powerpc:fixes-test] BUILD SUCCESS f0479c4bcbd92d1a457d4a43bcab79f29d11334a
From: kernel test robot @ 2020-07-16 2:52 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test
branch HEAD: f0479c4bcbd92d1a457d4a43bcab79f29d11334a selftests/powerpc: Use proper error code to check fault address
elapsed time: 737m
configs tested: 100
configs skipped: 2
The following configs have been built successfully.
More configs may be tested in the coming days.
arm defconfig
arm allyesconfig
arm allmodconfig
arm allnoconfig
arm64 allyesconfig
arm64 defconfig
arm64 allmodconfig
arm64 allnoconfig
arc axs101_defconfig
c6x evmc6457_defconfig
sh kfr2r09-romimage_defconfig
powerpc gamecube_defconfig
arm lpd270_defconfig
arm clps711x_defconfig
arm corgi_defconfig
riscv allyesconfig
arm orion5x_defconfig
i386 allnoconfig
i386 allyesconfig
i386 defconfig
i386 debian-10.3
ia64 allmodconfig
ia64 defconfig
ia64 allnoconfig
ia64 allyesconfig
m68k allmodconfig
m68k allnoconfig
m68k sun3_defconfig
m68k defconfig
m68k allyesconfig
nios2 defconfig
nios2 allyesconfig
openrisc defconfig
c6x allyesconfig
c6x allnoconfig
openrisc allyesconfig
nds32 defconfig
nds32 allnoconfig
csky allyesconfig
csky defconfig
alpha defconfig
alpha allyesconfig
xtensa allyesconfig
h8300 allyesconfig
h8300 allmodconfig
xtensa defconfig
arc defconfig
arc allyesconfig
sh allmodconfig
sh allnoconfig
microblaze allnoconfig
mips allyesconfig
mips allnoconfig
mips allmodconfig
parisc allnoconfig
parisc defconfig
parisc allyesconfig
parisc allmodconfig
powerpc allyesconfig
powerpc rhel-kconfig
powerpc allmodconfig
powerpc allnoconfig
powerpc defconfig
i386 randconfig-a001-20200715
i386 randconfig-a005-20200715
i386 randconfig-a002-20200715
i386 randconfig-a006-20200715
i386 randconfig-a003-20200715
i386 randconfig-a004-20200715
x86_64 randconfig-a005-20200715
x86_64 randconfig-a006-20200715
x86_64 randconfig-a002-20200715
x86_64 randconfig-a001-20200715
x86_64 randconfig-a003-20200715
x86_64 randconfig-a004-20200715
i386 randconfig-a016-20200715
i386 randconfig-a011-20200715
i386 randconfig-a015-20200715
i386 randconfig-a012-20200715
i386 randconfig-a013-20200715
i386 randconfig-a014-20200715
riscv allnoconfig
riscv defconfig
riscv allmodconfig
s390 allyesconfig
s390 allnoconfig
s390 allmodconfig
s390 defconfig
sparc allyesconfig
sparc defconfig
sparc64 defconfig
sparc64 allnoconfig
sparc64 allyesconfig
sparc64 allmodconfig
x86_64 rhel-7.6-kselftests
x86_64 rhel-8.3
x86_64 kexec
x86_64 rhel
x86_64 lkp
x86_64 fedora-25
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox