LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] ASoC: fsl_rpmsg: add soc specific data structure
From: Shengjiu Wang @ 2021-08-26 10:57 UTC (permalink / raw)
  To: timur, nicoleotsuka, Xiubo.Lee, festevam, broonie, perex, tiwai,
	alsa-devel
  Cc: linuxppc-dev, linux-kernel

Each platform has different supported rates and
formats, so add soc specific data for each platform.
This soc specific data is attached with compatible string.

Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
---
 sound/soc/fsl/fsl_rpmsg.c | 47 +++++++++++++++++++++++++++++++++++----
 sound/soc/fsl/fsl_rpmsg.h | 12 ++++++++++
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/sound/soc/fsl/fsl_rpmsg.c b/sound/soc/fsl/fsl_rpmsg.c
index d60f4dac6c1b..bda6cc96071d 100644
--- a/sound/soc/fsl/fsl_rpmsg.c
+++ b/sound/soc/fsl/fsl_rpmsg.c
@@ -138,11 +138,42 @@ static const struct snd_soc_component_driver fsl_component = {
 	.name           = "fsl-rpmsg",
 };
 
+static const struct fsl_rpmsg_soc_data imx7ulp_data = {
+	.rates = SNDRV_PCM_RATE_8000 | SNDRV_PCM_RATE_16000 |
+		 SNDRV_PCM_RATE_48000,
+	.formats = SNDRV_PCM_FMTBIT_S16_LE,
+};
+
+static const struct fsl_rpmsg_soc_data imx8mm_data = {
+	.rates = SNDRV_PCM_RATE_KNOT,
+	.formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+		   SNDRV_PCM_FMTBIT_S32_LE | SNDRV_PCM_FMTBIT_DSD_U8 |
+		   SNDRV_PCM_FMTBIT_DSD_U16_LE | SNDRV_PCM_FMTBIT_DSD_U32_LE,
+};
+
+static const struct fsl_rpmsg_soc_data imx8mn_data = {
+	.rates = SNDRV_PCM_RATE_32000 | SNDRV_PCM_RATE_44100 |
+		 SNDRV_PCM_RATE_48000 | SNDRV_PCM_RATE_88200 |
+		 SNDRV_PCM_RATE_96000 | SNDRV_PCM_RATE_176400 |
+		 SNDRV_PCM_RATE_192000,
+	.formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+		   SNDRV_PCM_FMTBIT_S32_LE,
+};
+
+static const struct fsl_rpmsg_soc_data imx8mp_data = {
+	.rates = SNDRV_PCM_RATE_32000 | SNDRV_PCM_RATE_44100 |
+		 SNDRV_PCM_RATE_48000 | SNDRV_PCM_RATE_88200 |
+		 SNDRV_PCM_RATE_96000 | SNDRV_PCM_RATE_176400 |
+		 SNDRV_PCM_RATE_192000,
+	.formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+		   SNDRV_PCM_FMTBIT_S32_LE,
+};
+
 static const struct of_device_id fsl_rpmsg_ids[] = {
-	{ .compatible = "fsl,imx7ulp-rpmsg-audio"},
-	{ .compatible = "fsl,imx8mm-rpmsg-audio"},
-	{ .compatible = "fsl,imx8mn-rpmsg-audio"},
-	{ .compatible = "fsl,imx8mp-rpmsg-audio"},
+	{ .compatible = "fsl,imx7ulp-rpmsg-audio", .data = &imx7ulp_data},
+	{ .compatible = "fsl,imx8mm-rpmsg-audio", .data = &imx8mm_data},
+	{ .compatible = "fsl,imx8mn-rpmsg-audio", .data = &imx8mn_data},
+	{ .compatible = "fsl,imx8mp-rpmsg-audio", .data = &imx8mp_data},
 	{ /* sentinel */ }
 };
 MODULE_DEVICE_TABLE(of, fsl_rpmsg_ids);
@@ -157,6 +188,14 @@ static int fsl_rpmsg_probe(struct platform_device *pdev)
 	if (!rpmsg)
 		return -ENOMEM;
 
+	rpmsg->soc_data = of_device_get_match_data(&pdev->dev);
+	if (rpmsg->soc_data) {
+		fsl_rpmsg_dai.playback.rates = rpmsg->soc_data->rates;
+		fsl_rpmsg_dai.capture.rates = rpmsg->soc_data->rates;
+		fsl_rpmsg_dai.playback.formats = rpmsg->soc_data->formats;
+		fsl_rpmsg_dai.capture.formats = rpmsg->soc_data->formats;
+	}
+
 	if (of_property_read_bool(np, "fsl,enable-lpa")) {
 		rpmsg->enable_lpa = 1;
 		rpmsg->buffer_size = LPA_LARGE_BUFFER_SIZE;
diff --git a/sound/soc/fsl/fsl_rpmsg.h b/sound/soc/fsl/fsl_rpmsg.h
index 4f5b49eb18d8..b04086fbf828 100644
--- a/sound/soc/fsl/fsl_rpmsg.h
+++ b/sound/soc/fsl/fsl_rpmsg.h
@@ -6,6 +6,16 @@
 #ifndef __FSL_RPMSG_H
 #define __FSL_RPMSG_H
 
+/*
+ * struct fsl_rpmsg_soc_data
+ * @rates: supported rates
+ * @formats: supported formats
+ */
+struct fsl_rpmsg_soc_data {
+	int rates;
+	u64 formats;
+};
+
 /*
  * struct fsl_rpmsg - rpmsg private data
  *
@@ -15,6 +25,7 @@
  * @pll8k: parent clock for multiple of 8kHz frequency
  * @pll11k: parent clock for multiple of 11kHz frequency
  * @card_pdev: Platform_device pointer to register a sound card
+ * @soc_data: soc specific data
  * @mclk_streams: Active streams that are using baudclk
  * @force_lpa: force enable low power audio routine if condition satisfy
  * @enable_lpa: enable low power audio routine according to dts setting
@@ -27,6 +38,7 @@ struct fsl_rpmsg {
 	struct clk *pll8k;
 	struct clk *pll11k;
 	struct platform_device *card_pdev;
+	const struct fsl_rpmsg_soc_data *soc_data;
 	unsigned int mclk_streams;
 	int force_lpa;
 	int enable_lpa;
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH 1/2] ASoC: fsl_rpmsg: add soc specific data structure
From: Fabio Estevam @ 2021-08-26 11:52 UTC (permalink / raw)
  To: Shengjiu Wang
  Cc: Linux-ALSA, Timur Tabi, Xiubo Li, linux-kernel, Takashi Iwai,
	Jaroslav Kysela, Nicolin Chen, Mark Brown, linuxppc-dev
In-Reply-To: <1629975460-17990-1-git-send-email-shengjiu.wang@nxp.com>

Hi Shengjiu,

On Thu, Aug 26, 2021 at 8:19 AM Shengjiu Wang <shengjiu.wang@nxp.com> wrote:

> +       rpmsg->soc_data = of_device_get_match_data(&pdev->dev);
> +       if (rpmsg->soc_data) {

This check is not necessary, because rpmsg->soc_data is always non-NULL.

Other than that:

Reviewed-by: Fabio Estevam <festevam@gmail.com>

^ permalink raw reply

* [PATCH 0/3] powerpc/microwatt: Device tree and config updates
From: Joel Stanley @ 2021-08-26 12:26 UTC (permalink / raw)
  To: Michael Ellerman, Paul Mackerras, Anton Blanchard,
	Michael Neuling
  Cc: linuxppc-dev

This enables the liteeth network device for microwatt which will be
merged in v5.15.

It also turns on some options so the microwatt defconfig can be used to
boot a userspace with systemd.

Joel Stanley (3):
  powerpc/microwatt: Add Ethernet to device tree
  powerpc/configs/microwattt: Enable Liteeth
  powerpc/configs/microwatt: Enable options for systemd

 arch/powerpc/boot/dts/microwatt.dts      | 12 ++++++++++++
 arch/powerpc/configs/microwatt_defconfig |  6 +++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

-- 
2.33.0


^ permalink raw reply

* [PATCH 1/3] powerpc/microwatt: Add Ethernet to device tree
From: Joel Stanley @ 2021-08-26 12:26 UTC (permalink / raw)
  To: Michael Ellerman, Paul Mackerras, Anton Blanchard,
	Michael Neuling
  Cc: linuxppc-dev
In-Reply-To: <20210826122653.3236867-1-joel@jms.id.au>

The liteeth network device is used in the Microwatt soc.

Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 arch/powerpc/boot/dts/microwatt.dts | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/boot/dts/microwatt.dts b/arch/powerpc/boot/dts/microwatt.dts
index 974abbdda249..65b270a90f94 100644
--- a/arch/powerpc/boot/dts/microwatt.dts
+++ b/arch/powerpc/boot/dts/microwatt.dts
@@ -127,6 +127,18 @@ UART0: serial@2000 {
 			fifo-size = <16>;
 			interrupts = <0x10 0x1>;
 		};
+
+		ethernet@8020000 {
+			compatible = "litex,liteeth";
+			reg = <0x8021000 0x100
+				0x8020800 0x100
+				0x8030000 0x2000>;
+			reg-names = "mac", "mido", "buffer";
+			litex,rx-slots = <2>;
+			litex,tx-slots = <2>;
+			litex,slot-size = <0x800>;
+			interrupts = <0x11 0x1>;
+		};
 	};
 
 	chosen {
-- 
2.33.0


^ permalink raw reply related

* [PATCH 2/3] powerpc/configs/microwattt: Enable Liteeth
From: Joel Stanley @ 2021-08-26 12:26 UTC (permalink / raw)
  To: Michael Ellerman, Paul Mackerras, Anton Blanchard,
	Michael Neuling
  Cc: linuxppc-dev
In-Reply-To: <20210826122653.3236867-1-joel@jms.id.au>

Liteeth is the network device used by Microwatt.

Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 arch/powerpc/configs/microwatt_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/microwatt_defconfig b/arch/powerpc/configs/microwatt_defconfig
index a08b739123da..2f8b1fec9a16 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -53,6 +53,7 @@ CONFIG_MTD_SPI_NOR=y
 CONFIG_BLK_DEV_LOOP=y
 CONFIG_BLK_DEV_RAM=y
 CONFIG_NETDEVICES=y
+CONFIG_LITEX_LITEETH=y
 # CONFIG_WLAN is not set
 # CONFIG_INPUT is not set
 # CONFIG_SERIO is not set
-- 
2.33.0


^ permalink raw reply related

* [PATCH 3/3] powerpc/configs/microwatt: Enable options for systemd
From: Joel Stanley @ 2021-08-26 12:26 UTC (permalink / raw)
  To: Michael Ellerman, Paul Mackerras, Anton Blanchard,
	Michael Neuling
  Cc: linuxppc-dev
In-Reply-To: <20210826122653.3236867-1-joel@jms.id.au>

When booting with systemd these options are required.

This increases the image by about 50KB, or 2%.

Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 arch/powerpc/configs/microwatt_defconfig | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/configs/microwatt_defconfig b/arch/powerpc/configs/microwatt_defconfig
index 2f8b1fec9a16..4a4924cd056e 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -5,6 +5,7 @@ CONFIG_PREEMPT_VOLUNTARY=y
 CONFIG_TICK_CPU_ACCOUNTING=y
 CONFIG_LOG_BUF_SHIFT=16
 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=12
+CONFIG_CGROUPS=y
 CONFIG_BLK_DEV_INITRD=y
 CONFIG_CC_OPTIMIZE_FOR_SIZE=y
 CONFIG_KALLSYMS_ALL=y
@@ -77,8 +78,10 @@ CONFIG_SPI_SPIDEV=y
 CONFIG_EXT4_FS=y
 # CONFIG_FILE_LOCKING is not set
 # CONFIG_DNOTIFY is not set
-# CONFIG_INOTIFY_USER is not set
+CONFIG_AUTOFS_FS=y
+CONFIG_TMPFS=y
 # CONFIG_MISC_FILESYSTEMS is not set
+CONFIG_CRYPTO_SHA256=y
 # CONFIG_CRYPTO_HW is not set
 # CONFIG_XZ_DEC_X86 is not set
 # CONFIG_XZ_DEC_IA64 is not set
-- 
2.33.0


^ permalink raw reply related

* Re: [PATCH v3] PCI: Move pci_dev_is/assign_added() to pci.h
From: Niklas Schnelle @ 2021-08-26 12:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-arch, linux-s390, linux-kernel, Paul Mackerras, linux-pci,
	Bjorn Helgaas, linuxppc-dev
In-Reply-To: <20210825190444.GA3593752@bjorn-Precision-5520>

On Wed, 2021-08-25 at 14:04 -0500, Bjorn Helgaas wrote:
> On Mon, Aug 23, 2021 at 12:53:39PM +0200, Niklas Schnelle wrote:
> > On Fri, 2021-08-20 at 17:37 -0500, Bjorn Helgaas wrote:
> > > On Tue, Jul 20, 2021 at 05:01:45PM +0200, Niklas Schnelle wrote:
> > > > The helper function pci_dev_is_added() from drivers/pci/pci.h is used in
> > > > PCI arch code of both s390 and powerpc leading to awkward relative
> > > > includes. Move it to the global include/linux/pci.h and get rid of these
> > > > includes just for that one function.
> > > 
> > > I agree the includes are awkward.
> > > 
> > > But the arch code *using* pci_dev_is_added() seems awkward, too.
> > 
> > See below for my interpretation why s390 has some driver like
> > functionality in its arch code which isn't necessarily awkward.
> > 
> > Independent from that I have found pci_dev_is_added() as the only way
> > deal with the case that one might be looking at a struct pci_dev
> > reference that has been removed via pci_stop_and_remove_bus_device() or
> > has never been fully scanned. This is quite useful when handling error
> > events which on s390 are part of the adapter event mechanism shared
> > with channel I/O devices.
> > 
> > > AFAICS, in powerpc, pci_dev_is_added() is only used by
> > > pnv_pci_ioda_fixup_iov() and pseries_pci_fixup_iov_resources().  Those
> > > are only called from pcibios_add_device(), which is only called from
> > > pci_device_add().
> > > 
> > > Is it even possible for pci_dev_is_added() to be true in that path?
> 
> If the pci_dev_is_added() in powerpc is unreachable, we can remove it
> and at least reduce this to an s390-only problem.

Ok. I might be missing something but I agree it does look these are
called from within pcibios_add_device() only so pci_dev_is_added() can
never be true. This looks pretty clear as pci_dev_assign_added() is
called after pcibios_add_device() in pci_bus_add_device().

> 
> > > s390 uses pci_dev_is_added() in recover_store()
> > 
> > I'm actually looking into this as I'm working on an s390 implementation
> > of the PCI recovery flow described in Documentation/PCI/pci-error-
> > recovery.rst that would also call pci_dev_is_added() because when we
> > get a platform notification of a PCI reset done by firmware it may be
> > that the struct pci_dev is going away i.e. we still have a ref count
> > but it is not added to the PCI bus anymore. And pci_dev_is_added() is
> > the only way I've found to check for this state.
> > 
> > > , but I don't know what
> > > that is (looks like a sysfs file, but it's not documented) or why s390
> > > is the only arch that does this.
> > 
> > Good point about this not being documented, I'll look into adding docs.
> > 
> > This is a sysfs attribute that basically removes the pci_dev and re-
> > adds it. This has the complication that since the attribute sits at
> > /sys/bus/pci/devices/<dev>/recover it deletes its own parent directory
> > which requires extra caution and means concurrent accesses block on
> > pci_lock_rescan_remove() instead of a kernfs lock.
> > Long story short when concurrently triggering the attribute one thread
> > proceeds into the pci_lock_rescan_remove() section and does the
> > removal, while others would block on pci_lock_rescan_remove(). Now when
> > the threads unblock the removal is done. In this case there is a new
> > struct pci_dev found in the rescan but the previously blocked threads
> > still have references to the old struct pci_dev which was removed and
> > as far as I could tell can only be distinguished by checking
> > pci_dev_is_added().
> 
> Is this locking issue different from concurrently writing to
> /sys/.../remove on other architectures?

In principle it is very similar except that we re-scan and thus the
removed pdev may co-exist with a new pdev for the same actual device if
there are other references to the pdev.

There is however also a significant difference in locking that fixes a
possible deadlock that I just confirmed also affects /sys/../remove
where it is hidden by a lockdep ignore, see lockdep splash below. 

For /sys/../recover this was fixed by my commit dd712f0a53ca
("s390/pci: Fix possible deadlock in  recover_store()") which also
introduced the need for pci_dev_is_added().

The change in the above commit is very similar to what is done in
drivers/scsi/scsi_sysfs.c:sdev_store_delete() which also has a self
deletion and in commit 0ee223b2e1f6 ("scsi: core: Avoid that SCSI
device removal through sysfs triggers a deadlock") fixed a very very
similar lock order problem.

Like the SCSI code I use sysfs_break_active_protection() which allows a
concurrent access to /sys/../recover to advance into recover_store().
It may then block on pci_lock_rescan_remove() instead of the kernfs
lock it would block on with just delete_remove_file_self(). Thus we
have to handle unblocking from pci_lock_rescan_remove() while holding
the stale struct pci_dev of the removed device. Together with the re-
scan this means I have to be able to determine after unblocking if I'm
looking at the old pdev.

I just tested that when removing the lockdep ignore from remove_store()
and do /sys/../power and /sys/../remove I do hit the same lock order
inversion issue there for remove. Note that unlike in the commit
introducing the ignore this is a completely flat hiearchy:

[  234.196509] ======================================================
[  234.196511] WARNING: possible circular locking dependency detected
[  234.196512] 5.14.0-rc7-08025-gf6d7568b37df-dirty #18 Not tainted
[  234.196514] ------------------------------------------------------
[  234.196516] zsh/1214 is trying to acquire lock:
[  234.196518] 000000013f296e30 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48
[  234.196529]
               but task is already holding lock:
[  234.196530] 000000009b6c3a10 (kn->active#363){++++}-{0:0}, at: sysfs_remove_file_self+0x42/0x78
[  234.196538]
               which lock already depends on the new lock.

[  234.196539]
               the existing dependency chain (in reverse order) is:
[  234.196541]
               -> #1 (kn->active#363){++++}-{0:0}:
[  234.196545]        validate_chain+0x9ca/0xde8
[  234.196548]        __lock_acquire+0x64c/0xc40
[  234.196550]        lock_acquire.part.0+0xec/0x258
[  234.196552]        lock_acquire+0xb0/0x200
[  234.196554]        kernfs_drain+0x17a/0x1c8
[  234.196556]        __kernfs_remove+0x1f4/0x230
[  234.196558]        kernfs_remove_by_name_ns+0x5c/0xa8
[  234.196560]        remove_files+0x46/0x90
[  234.196563]        sysfs_remove_group+0x5a/0xb8
[  234.196565]        sysfs_remove_groups+0x46/0x68
[  234.196567]        device_remove_attrs+0x90/0xb8
[  234.196598]        device_del+0x192/0x3f8
[  234.196600]        pci_remove_bus_device+0x8a/0x128
[  234.196602]        pci_stop_and_remove_bus_device_locked+0x3a/0x48
[  234.196604]        zpci_bus_remove_device+0x68/0xa8
[  234.196607]        zpci_deconfigure_device+0x3a/0xe0
[  234.196610]        power_write_file+0x7c/0x130
[  234.196615]        kernfs_fop_write_iter+0x13e/0x1e0
[  234.196617]        new_sync_write+0x100/0x190
[  234.196620]        vfs_write+0x21e/0x2d0
[  234.196622]        ksys_write+0x6c/0xf8
[  234.196624]        __do_syscall+0x1c2/0x1f0
[  234.196628]        system_call+0x78/0xa0
[  234.196632]
               -> #0 (pci_rescan_remove_lock){+.+.}-{3:3}:
[  234.196636]        check_noncircular+0x168/0x188
[  234.196637]        check_prev_add+0xe0/0xed8
[  234.196639]        validate_chain+0x9ca/0xde8
[  234.196641]        __lock_acquire+0x64c/0xc40
[  234.196642]        lock_acquire.part.0+0xec/0x258
[  234.196644]        lock_acquire+0xb0/0x200
[  234.196646]        __mutex_lock+0xa2/0x8d8
[  234.196648]        mutex_lock_nested+0x32/0x40
[  234.196649]        pci_stop_and_remove_bus_device_locked+0x26/0x48
[  234.196652]        remove_store+0x7a/0x88
[  234.196654]        kernfs_fop_write_iter+0x13e/0x1e0
[  234.196656]        new_sync_write+0x100/0x190
[  234.196657]        vfs_write+0x21e/0x2d0
[  234.196659]        ksys_write+0x6c/0xf8
[  234.196661]        __do_syscall+0x1c2/0x1f0
[  234.196663]        system_call+0x78/0xa0
[  234.196665]
               other info that might help us debug this:

[  234.196666]  Possible unsafe locking scenario:

[  234.196668]        CPU0                    CPU1
[  234.196669]        ----                    ----
[  234.196670]   lock(kn->active#363);
[  234.196672]                                lock(pci_rescan_remove_lock);
[  234.196674]                                lock(kn->active#363);
[  234.196677]   lock(pci_rescan_remove_lock);
[  234.196679]
                *** DEADLOCK ***

[  234.196680] 3 locks held by zsh/1214:
[  234.196682]  #0: 0000000099d44498 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x6c/0xf8
[  234.196688]  #1: 00000000b214ee90 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x102/0x1e0
[  234.196693]  #2: 000000009b6c3a10 (kn->active#363){++++}-{0:0}, at: sysfs_remove_file_self+0x42/0x78
[  234.196699]
               stack backtrace:
[  234.196701] CPU: 6 PID: 1214 Comm: zsh Not tainted 5.14.0-rc7-08025-gf6d7568b37df-dirty #18
[  234.196703] Hardware name: IBM 8561 T01 703 (LPAR)
[  234.196705] Call Trace:
[  234.196706]  [<000000013ebba5d0>] show_stack+0x90/0xf8
[  234.196711]  [<000000013ebcc28e>] dump_stack_lvl+0x8e/0xc8
[  234.196714]  [<000000013dfbe908>] check_noncircular+0x168/0x188
[  234.196716]  [<000000013dfbf9c8>] check_prev_add+0xe0/0xed8
[  234.196718]  [<000000013dfc118a>] validate_chain+0x9ca/0xde8
[  234.196720]  [<000000013dfc4184>] __lock_acquire+0x64c/0xc40
[  234.196721]  [<000000013dfc2d8c>] lock_acquire.part.0+0xec/0x258
[  234.196724]  [<000000013dfc2fa8>] lock_acquire+0xb0/0x200
[  234.196725]  [<000000013ebdac7a>] __mutex_lock+0xa2/0x8d8
[  234.196727]  [<000000013ebdb4e2>] mutex_lock_nested+0x32/0x40
[  234.196729]  [<000000013e7daaf6>] pci_stop_and_remove_bus_device_locked+0x26/0x48
[  234.196732]  [<000000013e7e753a>] remove_store+0x7a/0x88
[  234.196734]  [<000000013e35b4be>] kernfs_fop_write_iter+0x13e/0x1e0
[  234.196736]  [<000000013e264098>] new_sync_write+0x100/0x190
[  234.196738]  [<000000013e266f06>] vfs_write+0x21e/0x2d0
[  234.196740]  [<000000013e267224>] ksys_write+0x6c/0xf8
[  234.196742]  [<000000013ebcf9da>] __do_syscall+0x1c2/0x1f0
[  234.196744]  [<000000013ebe2238>] system_call+0x78/0xa0


> 
> > > Maybe we should make powerpc and s390 less special?
> > 
> > On s390, as I see it, the reason for this is that all of the PCI
> > functionality is directly defined in the Architecture as special CPU
> > instructions which are kind of hypercalls but also an ISA extension.
> > 
> > These instructions range from the basic PCI memory accesses (no real
> > MMIO) to enumeration of the devices and on to reporting of hot-plug and
> > and resets/recovery events. Importantly we do not have any kind of
> > direct access to a real or virtual PCI controller and the architecture
> > has no concept of a comparable entity.
> > 
> > So in my opinion while there is some of the functionality of a PCI
> > controller in arch/s390/pci the cut off between controller
> > functionality and arch support isn't clear at all and exposing PCI
> > support as CPU instructions doesn't map well to the controller concept.
> > 
> > That said, in principle I'm open to moving some of that into
> > drivers/pci/controller/ if you think that would improve things and we
> > can find a good argument what should go where. One possible cut off
> > would be to have arch/s390/pci/ provide wrappers to the PCI
> > instructions but move all their uses to  e.g.
> > drivers/pci/controller/s390/. This would of course be a major
> > refactoring and none of that code would be useful on any other
> > architecture but it would move a lot the accesses to PCI common code
> > functionality out of the arch code.
> 
> Looks like hotplug is already in drivers/pci/hotplug/s390_pci_hpc.c.
> 
> Might be worth considering putting the other PCI core-ish code in
> drivers/pci as well, though it doesn't feel urgent to me.  Maybe a
> good internship or mentoring project.

I agree

> 
> I'm not sure this juggling around is worth it basically to just clean
> up the include path.  The downside to me is exposing
> pci_dev_is_added() to outside the PCI core, because I don't want
> to encourage any other users.

Hmm, for me the big question how to then handle the need for
pci_dev_is_added() in my upcoming recovery code, that would be a new
user outside the PCI core unless I move arch/s390/pci/pci_event.c to
drivers/pci.

That said I'm not sure I understand yet what makes pci_dev_is_added()
unsuitable for outside the PCI core. I do understand that struct
pci_dev::priv_flags is supposed to be private to PCI drivers but on the
other hand the functionality of checking whether a PCI device has been
added to a PCI bus seems rather basic and useful whenever working with
a PCI device.

> 
> > > > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > > > ---
> > > > Since v1 (and bad v2):
> > > > - Fixed accidental removal of PCI_DPC_RECOVERED, PCI_DPC_RECOVERING
> > > >   defines and also move these to include/linux/pci.h
> > > > 
> > > >  arch/powerpc/platforms/powernv/pci-sriov.c |  3 ---
> > > >  arch/powerpc/platforms/pseries/setup.c     |  1 -
> > > >  arch/s390/pci/pci_sysfs.c                  |  2 --
> > > >  drivers/pci/hotplug/acpiphp_glue.c         |  1 -
> > > >  drivers/pci/pci.h                          | 15 ---------------
> > > >  include/linux/pci.h                        | 15 +++++++++++++++
> > > >  6 files changed, 15 insertions(+), 22 deletions(-)
> > > > 
> > > > diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c b/arch/powerpc/platforms/powernv/pci-sriov.c
> > > > index 28aac933a439..2e0ca5451e85 100644
> > > > --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> > > > +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> > > > @@ -9,9 +9,6 @@
> > > >  
> > > >  #include "pci.h"
> > > >  
> > > > -/* for pci_dev_is_added() */
> > > > -#include "../../../../drivers/pci/pci.h"
> > > > 
> > .. snip ..
> > 


^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
From: Segher Boessenkool @ 2021-08-26 12:49 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <1629946707.f6ptz0tgle.astroid@bobo.none>

Hi!

On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
> >> This one possibly the branches end up in predictors, whereas conditional 
> >> trap is always just speculated not to hit. Branches may also have a
> >> throughput limit on execution whereas trap could be more (1 per cycle
> >> vs 4 per cycle on POWER9).
> > 
> > I thought only *taken* branches are just one per cycle?
> 
> Taken branches are fetched by the front end at one per cycle (assuming 
> they hit the BTAC), but all branches have to be executed by BR at one 
> per cycle

This is not true.  (Simple) predicted not-taken conditional branches are
just folded out, never hit the issue queues.  And they are fetched as
many together as fit in a fetch group, can complete without limits as
well.

The BTAC is a frontend thing, used for target address prediction.  It
does not limit execution.

Correctly predicted simple conditional branches just get their prediction
validated (and that is not done in the execution units).  Incorrectly
predicted branches the same, but those cause a redirect and refetch.

> > Internally *all* traps are conditional, in GCC.  It also can optimise
> > them quite well.  There must be something in the kernel macros that
> > prevents good optimisation.
> 
> I did take a look at it at one point.
> 
> One problem is that the kernel needs the address of the trap instruction 
> to create the entry for it. The other problem is that __builtin_trap 
> does not return so it can't be used for WARN. LLVM at least seems to 
> have a __builtin_debugtrap which does return.

This is <https://gcc.gnu.org/PR99299>.

> The first problem seems like the show stopper though. AFAIKS it would 
> need a special builtin support that does something to create the table
> entry, or a guarantee that we could put an inline asm right after the
> builtin as a recognized pattern and that would give us the instruction
> following the trap.

I'm not quite sure what this means.  Can't you always just put a

bla:	asm("");

in there, and use the address of "bla"?  If not, you need to say a lot
more about what you actually want to do :-/


Segher

^ permalink raw reply

* Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes
From: Michael Ellerman @ 2021-08-26 13:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Vincent Guittot,
	Srikar Dronamraju, Peter Zijlstra, Geetika Moolchandani,
	Ingo Molnar, Laurent Dufour, linuxppc-dev, Valentin Schneider,
	kernel test robot
In-Reply-To: <20210821102535.169643-4-srikar@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> Scheduler expects unique number of node distances to be available at
> boot.

I think it needs "the number of unique node distances" ?

> It uses node distance to calculate this unique node distances.

It iterates over all pairs of nodes and records node_distance() for that
pair, and then calculates the set of unique distances.

> On POWER, node distances for offline nodes is not available. However,
> POWER already knows unique possible node distances.

I think it would be more accurate to say PAPR rather than POWER there.
It's PAPR that defines the way we determine distances and imposes that
limitation.

> Fake the offline node's distance_lookup_table entries so that all
> possible node distances are updated.

Does this work if we have a single node offline at boot?

Say we start with:

node distances:
node   0   1
  0:  10  20
  1:  20  10

And node 2 is offline at boot. We can only initialise that nodes entries
in the distance_lookup_table:

		while (i--)
			distance_lookup_table[node][i] = node;

By filling them all with 2 that causes node_distance(2, X) to return the
maximum distance for all other nodes X, because we won't break out of
the loop in __node_distance():

	for (i = 0; i < distance_ref_points_depth; i++) {
		if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
			break;

		/* Double the distance for each NUMA level */
		distance *= 2;
	}

If distance_ref_points_depth was 4 we'd return 160.

That'd leave us with 3 unique distances at boot, 10, 20, 160.

But when node 2 comes online it might introduce more than 1 new distance
value, eg. it could be that the actual distances are:

node distances:
node   0   1   2
  0:  10  20  40
  1:  20  10  80
  2:  40  80  10

ie. we now have 4 distances, 10, 20, 40, 80.

What am I missing?

> However this only needs to be done if the number of unique node
> distances that can be computed for online nodes is less than the
> number of possible unique node distances as represented by
> distance_ref_points_depth.

Looking at a few machines they all have distance_ref_points_depth = 2.

So maybe that explains it, in practice we only see 10, 20, 40.


> When the node is actually onlined, distance_lookup_table will be
> updated with actual entries.

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Cc: Laurent Dufour <ldufour@linux.ibm.com>
> Cc: kernel test robot <lkp@intel.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 70 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
>
> Changelog:
> v1: https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-srikar@linux.vnet.ibm.com/t/#u
> [ Fixed a missing prototype warning Reported-by: kernel test robot <lkp@intel.com>]
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 3c124928a16d..0ee79a08c9e1 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -856,6 +856,75 @@ void __init dump_numa_cpu_topology(void)
>  	}
>  }
>  
> +/*
> + * Scheduler expects unique number of node distances to be available at
> + * boot. It uses node distance to calculate this unique node distances. On
> + * POWER, node distances for offline nodes is not available. However, POWER
> + * already knows unique possible node distances. Fake the offline node's
> + * distance_lookup_table entries so that all possible node distances are
> + * updated.
> + */

> +static void __init fake_update_distance_lookup_table(void)
> +{
> +	unsigned long distance_map;
> +	int i, nr_levels, nr_depth, node;

Are they distances, depths, or levels? :)

Bit more consistency in the variable names might make the code easier to
follow.

> +
> +	if (!numa_enabled)
> +		return;
> +
> +	if (!form1_affinity)
> +		return;

That doesn't exist since Aneesh's FORM2 series, so that will need a
rebase, and possibly some more rework to interact with that series.

> +	/*
> +	 * distance_ref_points_depth lists the unique numa domains
> +	 * available. However it ignore LOCAL_DISTANCE. So add +1
> +	 * to get the actual number of unique distances.
> +	 */
> +	nr_depth = distance_ref_points_depth + 1;

num_depths would be a better name IMHO.

> +
> +	WARN_ON(nr_depth > sizeof(distance_map));

Warn but then continue, and corrupt something on the stack? Seems like a
bad idea :)

I guess it's too early to use bitmap_alloc(). But can we at least return
if nr_depth is too big.

> +
> +	bitmap_zero(&distance_map, nr_depth);
> +	bitmap_set(&distance_map, 0, 1);
> +
> +	for_each_online_node(node) {
> +		int nd, distance = LOCAL_DISTANCE;
> +
> +		if (node == first_online_node)
> +			continue;
> +
> +		nd = __node_distance(node, first_online_node);
> +		for (i = 0; i < nr_depth; i++, distance *= 2) {

		for (i = 0, distance = LOCAL_DISTANCE; i < nr_depth; i++, distance *= 2) {

Could make it clearer what the for loop is doing I think.

> +			if (distance == nd) {
> +				bitmap_set(&distance_map, i, 1);
> +				break;
> +			}
> +		}
> +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> +		if (nr_levels == nr_depth)
> +			return;
> +	}
> +
> +	for_each_node(node) {
> +		if (node_online(node))
> +			continue;
> +
> +		i = find_first_zero_bit(&distance_map, nr_depth);
> +		if (i >= nr_depth || i == 0) {

Neither of those can happen can they?

We checked the bitmap weight in the previous for loop, or at the bottom
of this one, and returned if we'd filled the map already.

And we set bit zero explicitly with bitmap_set().

> +			pr_warn("Levels(%d) not matching levels(%d)", nr_levels, nr_depth);
> +			return;
> +		}
> +
> +		bitmap_set(&distance_map, i, 1);
> +		while (i--)
> +			distance_lookup_table[node][i] = node;

That leaves distance_lookup_table[node][i+1] and so on uninitialised, or
initialised to zero because it's static, is that OK?

> +		nr_levels = bitmap_weight(&distance_map, nr_depth);
> +		if (nr_levels == nr_depth)
> +			return;
> +	}
> +}
> +
>  /* Initialize NODE_DATA for a node on the local memory */
>  static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
>  {
> @@ -971,6 +1040,7 @@ void __init mem_topology_setup(void)
>  		 */
>  		numa_setup_cpu(cpu);
>  	}
> +	fake_update_distance_lookup_table();
>  }
>  
>  void __init initmem_init(void)


cheers

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
From: Segher Boessenkool @ 2021-08-26 13:47 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: llvm, linux-kernel, Nathan Chancellor, clang-built-linux,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <3fad8702-278a-d9f9-1882-6958ce570bcc@csgroup.eu>

On Thu, Aug 26, 2021 at 08:37:09AM +0200, Christophe Leroy wrote:
> Le 26/08/2021 à 05:21, Michael Ellerman a écrit :
> >This patch seems to fix it. Not sure if that's just papering over it 
> >though.
> >
> >diff --git a/arch/powerpc/include/asm/bug.h 
> >b/arch/powerpc/include/asm/bug.h
> >index 1ee0f22313ee..75fcb4370d96 100644
> >--- a/arch/powerpc/include/asm/bug.h
> >+++ b/arch/powerpc/include/asm/bug.h
> >@@ -119,7 +119,7 @@ __label_warn_on:					 \
> >  								\
> >  			WARN_ENTRY(PPC_TLNEI " %4, 0",		\
> >  				   BUGFLAG_WARNING | 
> >  				   BUGFLAG_TAINT(TAINT_WARN),	\
> >-				   __label_warn_on, "r" (x));	\
> >+				   __label_warn_on, "r" (!!(x))); \
> >  			break;					\
> >  __label_warn_on:						\
> >  			__ret_warn_on = true;			\
> 
> But for a simple WARN_ON() call:
> 
> void test(unsigned long b)
> {
> 	WARN_ON(b);
> }
> 
> Without your change with GCC you get:
> 
> 00000000000012d0 <.test>:
>     12d0:	0b 03 00 00 	tdnei   r3,0
>     12d4:	4e 80 00 20 	blr
> 
> 
> With the !! change you get:
> 
> 00000000000012d0 <.test>:
>     12d0:	31 23 ff ff 	addic   r9,r3,-1
>     12d4:	7d 29 19 10 	subfe   r9,r9,r3
>     12d8:	0b 09 00 00 	tdnei   r9,0
>     12dc:	4e 80 00 20 	blr

That is because the asm (unlike the builtin) cannot be optimised by the
compiler.


Segher

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
From: Nicholas Piggin @ 2021-08-26 13:57 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <20210826124901.GY1583@gate.crashing.org>

Excerpts from Segher Boessenkool's message of August 26, 2021 10:49 pm:
> Hi!
> 
> On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
>> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
>> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
>> >> This one possibly the branches end up in predictors, whereas conditional 
>> >> trap is always just speculated not to hit. Branches may also have a
>> >> throughput limit on execution whereas trap could be more (1 per cycle
>> >> vs 4 per cycle on POWER9).
>> > 
>> > I thought only *taken* branches are just one per cycle?
>> 
>> Taken branches are fetched by the front end at one per cycle (assuming 
>> they hit the BTAC), but all branches have to be executed by BR at one 
>> per cycle
> 
> This is not true.  (Simple) predicted not-taken conditional branches are
> just folded out, never hit the issue queues.  And they are fetched as
> many together as fit in a fetch group, can complete without limits as
> well.

No, they are all dispatched and issue to the BRU for execution. It's 
trivial to construct a test of a lot of not taken branches in a row
and time a loop of it to see it executes at 1 cycle per branch.

> The BTAC is a frontend thing, used for target address prediction.  It
> does not limit execution.

I didn't say it did.

> Correctly predicted simple conditional branches just get their prediction
> validated (and that is not done in the execution units).  Incorrectly
> predicted branches the same, but those cause a redirect and refetch.

How could it validate prediction without issuing? It wouldn't know when
sources are ready.

> 
>> > Internally *all* traps are conditional, in GCC.  It also can optimise
>> > them quite well.  There must be something in the kernel macros that
>> > prevents good optimisation.
>> 
>> I did take a look at it at one point.
>> 
>> One problem is that the kernel needs the address of the trap instruction 
>> to create the entry for it. The other problem is that __builtin_trap 
>> does not return so it can't be used for WARN. LLVM at least seems to 
>> have a __builtin_debugtrap which does return.
> 
> This is <https://gcc.gnu.org/PR99299>.

Aha.

>> The first problem seems like the show stopper though. AFAIKS it would 
>> need a special builtin support that does something to create the table
>> entry, or a guarantee that we could put an inline asm right after the
>> builtin as a recognized pattern and that would give us the instruction
>> following the trap.
> 
> I'm not quite sure what this means.  Can't you always just put a
> 
> bla:	asm("");
> 
> in there, and use the address of "bla"?

Not AFAIKS. Put it where?

> If not, you need to say a lot
> more about what you actually want to do :-/

We need to put the address (or relative offset) of the trap instruction
into an entry in a different section. Basically this:

   __builtin_trap();
   asm ("1:                               \n\t"
        "    .section __bug_table,\"aw\"  \n\t"
        "2:  .4byte 1b - 2b - 4           \n\t"
        "    .previous");

Where the 1: label must follow immediately after the trap instruction, 
and where the asm must be emitted even for the case of no-return traps 
(but anything following the asm could be omitted).

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
From: Segher Boessenkool @ 2021-08-26 14:12 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: llvm, linux-kernel, Nathan Chancellor, clang-built-linux,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <87h7fcc2m4.fsf@mpe.ellerman.id.au>

On Thu, Aug 26, 2021 at 01:21:39PM +1000, Michael Ellerman wrote:
> So there seems to be some misunderstanding with clang, it doesn't like
> us passing an expression to the inline asm.
> 
> AFAIK it is legal to pass expressions as inputs to inline asm, ie. it
> doesn't have to just be a variable name.

It certainly is.  That is the whole point of inline asm!  This way, all
of the C code "around" the asm can be optimised.

> This patch seems to fix it. Not sure if that's just papering over it though.

It is, and it makes less optimised code (also on GCC), as Christophe
points out.


Segher

^ permalink raw reply

* Re: [PATCH] powerpc: Make set_endian() return EINVAL when not supporting little endian
From: Michael Ellerman @ 2021-08-26 14:23 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <df2e9b65-9303-070e-b803-c64e20e2620d@csgroup.eu>

Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Le 26/08/2021 à 05:41, Michael Ellerman a écrit :
>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>>> There is no point in modifying MSR_LE bit on CPUs not supporting
>>> little endian.
>> 
>> Isn't that an ABI break?
>
> Or an ABI fix ? I don't know.

It could break existing applications, even if the new semantics make
more sense. So that's a break IMHO :)

> My first thought was that all other 32 bits architectures were returning -EINVAL, but looking at the 
> man page of prctl, it is explicit that this is powerpc only.

It could be generic, but yeah seems we're the only arch that implements
it.

>> set_endian(PR_ENDIAN_BIG) should work on a big endian CPU, even if it
>> does nothing useful.
>
> Fair enough. But shouldn't in that case get_endian() return PR_ENDIAN_BIG instead of returning EINVAL ?

> We can do one or the other, but I think it should at least be consistant between them, shouldn't it ?

It should be consistent, but it isn't, and if we change it now we
potentially break existing userspace, which is bad.

I don't think it's widely used, and the risk of breakage would be
minimal, but it's not zero.

So I'm not sure it's worth changing it just for the sake of consistency.

cheers

^ permalink raw reply

* Re: [PATCH linux-next] powerpc/tm: remove duplicate include in tm-poison.c
From: Michael Ellerman @ 2021-08-26 14:26 UTC (permalink / raw)
  To: Christophe Leroy, Shuah Khan, cgel.zte
  Cc: yong.yiran, shuah, Zeal Robot, linux-kernel, paulus,
	linux-kselftest, linuxppc-dev
In-Reply-To: <4bc97c33-7fc0-ff9d-041b-e773f682c5d2@csgroup.eu>

Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Le 24/08/2021 à 16:40, Shuah Khan a écrit :
>> On 8/5/21 12:52 AM, cgel.zte@gmail.com wrote:
>>> From: yong yiran <yong.yiran@zte.com.cn>
>>>
>>> 'inttypes.h' included in 'tm-poison.c' is duplicated.
>>> Remove all but the first include of inttypes.h from tm-poison.c.
>>>
>>> Reported-by: Zeal Robot <zealci@zte.com.cn>
>>> Signed-off-by: yong yiran <yong.yiran@zte.com.cn>
>>> ---
>>>   tools/testing/selftests/powerpc/tm/tm-poison.c | 1 -
>>>   1 file changed, 1 deletion(-)
>>>
>>> diff --git a/tools/testing/selftests/powerpc/tm/tm-poison.c 
>>> b/tools/testing/selftests/powerpc/tm/tm-poison.c
>>> index 29e5f26af7b9..27c083a03d1f 100644
>>> --- a/tools/testing/selftests/powerpc/tm/tm-poison.c
>>> +++ b/tools/testing/selftests/powerpc/tm/tm-poison.c
>>> @@ -20,7 +20,6 @@
>>>   #include <sched.h>
>>>   #include <sys/types.h>
>>>   #include <signal.h>
>>> -#include <inttypes.h>
>>>   #include "tm.h"
>>>
>> 
>> We can't accept this patch. The from and Signed-off-by don't match.
>> 
>
> As far as I can see they match. You have:
>
> From: yong yiran <yong.yiran@zte.com.cn>
> Signed-off-by: yong yiran <yong.yiran@zte.com.cn>

Regardless I already have a patch queued to fix this, from a different
CI bot.

cheers

^ permalink raw reply

* Re: [PATCH v2 RESEND] powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC
From: Michael Ellerman @ 2021-08-26 14:37 UTC (permalink / raw)
  To: Paul Moore, Christophe Leroy
  Cc: linux-kernel, Eric Paris, linux-audit, Paul Mackerras,
	linuxppc-dev
In-Reply-To: <CAHC9VhSG8tPAkAAz5Z77HDMKXLAiaEOanxR+oY5c1E_Xoiso9Q@mail.gmail.com>

Paul Moore <paul@paul-moore.com> writes:
> On Tue, Aug 24, 2021 at 1:11 PM Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>> Le 24/08/2021 à 16:47, Paul Moore a écrit :
>> > On Tue, Aug 24, 2021 at 9:36 AM Christophe Leroy
>> > <christophe.leroy@csgroup.eu> wrote:
>> >>
>> >> Commit e65e1fc2d24b ("[PATCH] syscall class hookup for all normal
>> >> targets") added generic support for AUDIT but that didn't include
>> >> support for bi-arch like powerpc.
>> >>
>> >> Commit 4b58841149dc ("audit: Add generic compat syscall support")
>> >> added generic support for bi-arch.
>> >>
>> >> Convert powerpc to that bi-arch generic audit support.
>> >>
>> >> Cc: Paul Moore <paul@paul-moore.com>
>> >> Cc: Eric Paris <eparis@redhat.com>
>> >> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> >> ---
>> >> Resending v2 with Audit people in Cc
>> >>
>> >> v2:
>> >> - Missing 'git add' for arch/powerpc/include/asm/unistd32.h
>> >> - Finalised commit description
>> >> ---
>> >>   arch/powerpc/Kconfig                |  5 +-
>> >>   arch/powerpc/include/asm/unistd32.h |  7 +++
>> >>   arch/powerpc/kernel/Makefile        |  3 --
>> >>   arch/powerpc/kernel/audit.c         | 84 -----------------------------
>> >>   arch/powerpc/kernel/compat_audit.c  | 44 ---------------
>> >>   5 files changed, 8 insertions(+), 135 deletions(-)
>> >>   create mode 100644 arch/powerpc/include/asm/unistd32.h
>> >>   delete mode 100644 arch/powerpc/kernel/audit.c
>> >>   delete mode 100644 arch/powerpc/kernel/compat_audit.c
>> >
>> > Can you explain, in detail please, the testing you have done to verify
>> > this patch?
>> >
>>
>> I built ppc64_defconfig and checked that the generated code is functionnaly equivalent.
>>
>> ppc32_classify_syscall() is exactly the same as audit_classify_compat_syscall() except that the
>> later takes the syscall as second argument (ie in r4) whereas the former takes it as first argument
>> (ie in r3).
>>
>> audit_classify_arch() and powerpc audit_classify_syscall() are slightly different between the
>> powerpc version and the generic version because the powerpc version checks whether it is
>> AUDIT_ARCH_PPC or not (ie value 20), while the generic one checks whether it has bit
>> __AUDIT_ARCH_64BIT set or not (__AUDIT_ARCH_64BIT is the sign bit of a word), but taking into
>> account that the abi is either AUDIT_ARCH_PPC, AUDIT_ARCH_PPC64 or AUDIT_ARCH_PPC64LE, the result is
>> the same.
>>
>> If you are asking I guess you saw something wrong ?
>
> I was asking because I didn't see any mention of testing, and when you
> are enabling something significant like this it is nice to see that it
> has been verified to work :)
>
> While binary dumps and comparisons are nice, it is always good to see
> verification from a test suite.  I don't have access to the necessary
> hardware to test this, but could you verify that the audit-testsuite
> passes on your test system with your patches applied?
>
>  * https://github.com/linux-audit/audit-testsuite

I tested on ppc64le. Both before and after the patch I get the result
below.

So I guess the patch is OK, but maybe we have some existing issue.

I had a bit of a look at the test code, but my perl is limited. I think
it was running the command below, and it returned "<no matches>", but
not really sure what that means.

  $ sudo ausearch -i -m SYSCALL -p 216440 -ui 0 -gi 0 -ul 0 -su unconfined _u:unconfined_r:unconfined_t:s0-s0:c0.c1023 -ts recent
  <no matches>

cheers



Running as   user    root
        with context unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
        on   system  Fedora

backlog_wait_time_actual_reset/test .. ok
exec_execve/test ..................... ok
exec_name/test ....................... ok
file_create/test ..................... ok
file_delete/test ..................... ok
file_rename/test ..................... ok
filter_exclude/test .................. 1/21
# Test 20 got: "256" (filter_exclude/test at line 167)
#    Expected: "0"
#  filter_exclude/test line 167 is: ok( $result, 0 );
# Test 21 got: "0" (filter_exclude/test at line 179)
#    Expected: "1"
#  filter_exclude/test line 179 is: ok( $found_msg, 1 );
filter_exclude/test .................. Failed 2/21 subtests
filter_saddr_fam/test ................ ok
filter_sessionid/test ................ ok
login_tty/test ....................... ok
lost_reset/test ...................... ok
netfilter_pkt/test ................... ok
syscalls_file/test ................... ok
syscall_module/test .................. ok
time_change/test ..................... ok
user_msg/test ........................ ok
fanotify/test ........................ ok
bpf/test ............................. ok

Test Summary Report
-------------------
filter_exclude/test                (Wstat: 0 Tests: 21 Failed: 2)
  Failed tests:  20-21
Files=18, Tests=202, 45 wallclock secs ( 0.18 usr  0.03 sys + 20.15 cusr  0.92 csys = 21.28 CPU)
Result: FAIL
Failed 1/18 test programs. 2/202 subtests failed.

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
From: Segher Boessenkool @ 2021-08-26 14:37 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <1629983260.5jkx2jf0y8.astroid@bobo.none>

On Thu, Aug 26, 2021 at 11:57:52PM +1000, Nicholas Piggin wrote:
> Excerpts from Segher Boessenkool's message of August 26, 2021 10:49 pm:
> > On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
> >> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
> >> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
> >> >> This one possibly the branches end up in predictors, whereas conditional 
> >> >> trap is always just speculated not to hit. Branches may also have a
> >> >> throughput limit on execution whereas trap could be more (1 per cycle
> >> >> vs 4 per cycle on POWER9).
> >> > 
> >> > I thought only *taken* branches are just one per cycle?
> >> 
> >> Taken branches are fetched by the front end at one per cycle (assuming 
> >> they hit the BTAC), but all branches have to be executed by BR at one 
> >> per cycle
> > 
> > This is not true.  (Simple) predicted not-taken conditional branches are
> > just folded out, never hit the issue queues.  And they are fetched as
> > many together as fit in a fetch group, can complete without limits as
> > well.
> 
> No, they are all dispatched and issue to the BRU for execution. It's 
> trivial to construct a test of a lot of not taken branches in a row
> and time a loop of it to see it executes at 1 cycle per branch.

(s/dispatched/issued/)

Huh, this was true on p8 already.  Sorry for my confusion.  In my
defence, this doesn't matter for performance on "real code".

> > Correctly predicted simple conditional branches just get their prediction
> > validated (and that is not done in the execution units).  Incorrectly
> > predicted branches the same, but those cause a redirect and refetch.
> 
> How could it validate prediction without issuing? It wouldn't know when
> sources are ready.

In the backend.  But that is just how it worked on older cores :-/

> >> The first problem seems like the show stopper though. AFAIKS it would 
> >> need a special builtin support that does something to create the table
> >> entry, or a guarantee that we could put an inline asm right after the
> >> builtin as a recognized pattern and that would give us the instruction
> >> following the trap.
> > 
> > I'm not quite sure what this means.  Can't you always just put a
> > 
> > bla:	asm("");
> > 
> > in there, and use the address of "bla"?
> 
> Not AFAIKS. Put it where?

After wherever you want to know the address after.  You will have to
make sure they stay together somehow.

It is much easier to get the address of something, not the address after
it.  If you know it is just one insn anyway, that isn't hard to handle
either (even if prefixed insns exist).

> > If not, you need to say a lot
> > more about what you actually want to do :-/
> 
> We need to put the address (or relative offset) of the trap instruction
> into an entry in a different section. Basically this:
> 
>    __builtin_trap();
>    asm ("1:                               \n\t"
>         "    .section __bug_table,\"aw\"  \n\t"
>         "2:  .4byte 1b - 2b - 4           \n\t"
>         "    .previous");
> 
> Where the 1: label must follow immediately after the trap instruction, 
> and where the asm must be emitted even for the case of no-return traps 
> (but anything following the asm could be omitted).

The address *after* the trap insn?  That is much much harder than the
address *of* the trap insn.


Segher

^ permalink raw reply

* [next-20210820][ppc][nvme/raid] pci unbind WARNS at fs/kernfs/dir.c:1524 kernfs_remove_by_name_ns+
From: Abdul Haleem @ 2021-08-26 14:41 UTC (permalink / raw)
  To: linux-next; +Cc: Brian King, linuxppc-dev, linux-kernel, linux-scsi

Greeting's

Today's linux-next kernel WARN's while unbind of pci ssd flash device on 
my powerpc box

# lspci -nn
001d:80:00.0 Non-Volatile memory controller [0108]: Seagate Technology 
PLC Nytro Flash Storage [1bb1:0100]
001e:90:00.0 Non-Volatile memory controller [0108]: Seagate Technology 
PLC Nytro Flash Storage [1bb1:0100]

$ echo -n 001d:80:00.0 > /sys/bus/pci/drivers/nvme/unbind

md: md127: raid0 array has a missing/failed member
Buffer I/O error on dev md127, logical block 1498959, async page read
Buffer I/O error on dev md127, logical block 1498959, async page read
md127: detected capacity change from 191866880 to 0
md: md127 stopped.
------------[ cut here ]------------
kernfs: can not remove 'md127', no directory
WARNING: CPU: 21 PID: 11006 at fs/kernfs/dir.c:1524 
kernfs_remove_by_name_ns+0xc0/0x110

Modules linked in: dm_mod rpadlpar_io rpaphp kvm_pr kvm nf_tables 
libcrc32c nfnetlink tcp_diag udp_diag inet_diag unix_diag af_packet_diag 
netlink_diag bridge stp llc rfkill sg pseries_rng raid0 xts vmx_crypto 
uio_pdrv_genirq gf128mul uio nfsd auth_rpcgss nfs_acl lockd grace sunrpc 
binfmt_misc sch_fq_codel ip_tables ext4 mbcache jbd2 sd_mod ibmvscsi 
ibmvnic ibmveth scsi_transport_srp nvme nvme_core t10_piCPU: 21 PID: 
11006 Comm: mdadm Not tainted 5.14.0-rc6-next-20210820-autotest #1

NIP:  c00000000056dda0 LR: c00000000056dd9c CTR: 00000000007088ec
REGS: c000000004593680 TRAP: 0700   Not tainted 
(5.14.0-rc6-next-20210820-autotest)
MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48044224  
XER: 0000000d
CFAR: c000000000145f00 IRQMASK: 0
GPR00: c00000000056dd9c c000000004593920 c0000000019b3200 000000000000002c
GPR04: 00000000ffff7fff c0000000045935f0 0000000000000027 c00000077cd07e08
GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000000018694b0
GPR12: 0000000000004000 c00000001ec40680 00007fff84875040 ffffffffffffffff
GPR16: 0000000000000000 0000000000000001 0000000000000000 000001001e990280
GPR20: 0000000000000000 0000000000000000 0000000000000066 0000000000030d40
GPR24: c00000002bc591e8 0000000000000000 c00000002bc591c8 c000000025765c00
GPR28: c000000025765c70 c000000025765c00 0000000000000000 c000000025790640
NIP [c00000000056dda0] kernfs_remove_by_name_ns+0xc0/0x110
LR [c00000000056dd9c] kernfs_remove_by_name_ns+0xbc/0x110
Call Trace:
[c000000004593920] [c00000000056dd9c] 
kernfs_remove_by_name_ns+0xbc/0x110 (unreliable)
[c0000000045939b0] [c000000000572368] sysfs_remove_link+0x28/0x70
[c0000000045939d0] [c000000000678bd4] bd_unlink_disk_holder+0xc4/0x130
[c000000004593a10] [c00000000098e4e0] unbind_rdev_from_array+0x40/0x1b0
[c000000004593ac0] [c0000000009921e0] do_md_stop+0x410/0x5a0
[c000000004593bc0] [c0000000009966ac] md_ioctl+0xc6c/0x1c60
[c000000004593cd0] [c000000000654ebc] blkdev_ioctl+0x2fc/0x740
[c000000004593d40] [c0000000004ce284] block_ioctl+0x74/0x90
[c000000004593d60] [c00000000047b6e8] sys_ioctl+0xf8/0x150
[c000000004593db0] [c00000000002ff28] system_call_exception+0x158/0x2c0
[c000000004593e10] [c00000000000c764] system_call_common+0xf4/0x258
--- interrupt: c00 at 0x7fff84790290
NIP:  00007fff84790290 LR: 000000013d5c2ef0 CTR: 0000000000000000
REGS: c000000004593e80 TRAP: 0c00   Not tainted 
(5.14.0-rc6-next-20210820-autotest)
MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 
28044288  XER: 00000000
IRQMASK: 0
GPR00: 0000000000000036 00007fffd0148020 00007fff84877300 0000000000000003
GPR04: 0000000020000932 0000000000000000 0000000000030d40 00007fffd0148028
GPR08: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff8494a780 00007fff84875040 ffffffffffffffff
GPR16: 0000000000000000 0000000000000001 0000000000000000 000001001e990280
GPR20: 0000000000000000 0000000000000000 0000000000000066 0000000000030d40
GPR24: 0000000020000932 000001001e992ac0 ffffffffffffffff 000001001e990a40
GPR28: ffffffffffffffff 0000000000000018 00007fffd0148100 0000000000000003
NIP [00007fff84790290] 0x7fff84790290
LR [000000013d5c2ef0] 0x13d5c2ef0
--- interrupt: c00
Instruction dump:
ebe10088 38210090 e8010010 ebc1fff0 7c0803a6 4e800020 60000000 60000000
3c62ff5c 3863d168 4bbd8109 60000000 <0fe00000> fba10078 fbe10088 60000000
---[ end trace 634fa04d6dac7dfd ]---

-- 
Regard's

Abdul Haleem
IBM Linux Technology Center


^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
From: Michael Ellerman @ 2021-08-26 14:45 UTC (permalink / raw)
  To: Christophe Leroy, Nathan Chancellor
  Cc: llvm, linux-kernel, clang-built-linux, Paul Mackerras,
	linuxppc-dev
In-Reply-To: <3fad8702-278a-d9f9-1882-6958ce570bcc@csgroup.eu>

Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Le 26/08/2021 à 05:21, Michael Ellerman a écrit :
>> Nathan Chancellor <nathan@kernel.org> writes:
>>> On Tue, Apr 13, 2021 at 04:38:10PM +0000, Christophe Leroy wrote:
>>>> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
>>>> flexibility to GCC.
>> ...
>>>
>>> This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
>>> flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
>>> klist_add_tail to trigger over and over on boot when compiling with
>>> clang:
>
> ...
>
>> 
>> This patch seems to fix it. Not sure if that's just papering over it though.
>> 
>> diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
>> index 1ee0f22313ee..75fcb4370d96 100644
>> --- a/arch/powerpc/include/asm/bug.h
>> +++ b/arch/powerpc/include/asm/bug.h
>> @@ -119,7 +119,7 @@ __label_warn_on:						\
>>   								\
>>   			WARN_ENTRY(PPC_TLNEI " %4, 0",		\
>>   				   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),	\
>> -				   __label_warn_on, "r" (x));	\
>> +				   __label_warn_on, "r" (!!(x))); \
>>   			break;					\
>>   __label_warn_on:						\
>>   			__ret_warn_on = true;			\
>
> But for a simple WARN_ON() call:
>
> void test(unsigned long b)
> {
> 	WARN_ON(b);
> }
>
> Without your change with GCC you get:
>
> 00000000000012d0 <.test>:
>      12d0:	0b 03 00 00 	tdnei   r3,0
>      12d4:	4e 80 00 20 	blr
>
>
> With the !! change you get:
>
> 00000000000012d0 <.test>:
>      12d0:	31 23 ff ff 	addic   r9,r3,-1
>      12d4:	7d 29 19 10 	subfe   r9,r9,r3
>      12d8:	0b 09 00 00 	tdnei   r9,0
>      12dc:	4e 80 00 20 	blr

Yeah that's a pity.

We could do something like below, which is ugly, but would be better
than having to revert the whole thing.

Although this doesn't fix the strange warning in drivers/net/ethernet/sfc.

So possibly we need a CLANG ifdef around the whole thing, and use the
old style warn for clang.

cheers


diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 1ee0f22313ee..d978d9004d0d 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -106,6 +106,12 @@ __label_warn_on:						\
 	}							\
 } while (0)
 
+#ifdef CONFIG_CC_IS_CLANG
+#define __clang_warn_hack(x)	(!!(x))
+#else
+#define __clang_warn_hack(x)	(x)
+#endif
+
 #define WARN_ON(x) ({						\
 	bool __ret_warn_on = false;				\
 	do {							\
@@ -119,7 +125,8 @@ __label_warn_on:						\
 								\
 			WARN_ENTRY(PPC_TLNEI " %4, 0",		\
 				   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),	\
-				   __label_warn_on, "r" (x));	\
+				   __label_warn_on,		\
+				   "r" __clang_warn_hack(x));	\
 			break;					\
 __label_warn_on:						\
 			__ret_warn_on = true;			\



^ permalink raw reply related

* Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
From: Christophe Leroy @ 2021-08-26 14:53 UTC (permalink / raw)
  To: Michael Ellerman, Nathan Chancellor
  Cc: llvm, linux-kernel, clang-built-linux, Paul Mackerras,
	linuxppc-dev
In-Reply-To: <87sfyw9sel.fsf@mpe.ellerman.id.au>



Le 26/08/2021 à 16:45, Michael Ellerman a écrit :
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> Le 26/08/2021 à 05:21, Michael Ellerman a écrit :
>>> Nathan Chancellor <nathan@kernel.org> writes:
>>>> On Tue, Apr 13, 2021 at 04:38:10PM +0000, Christophe Leroy wrote:
>>>>> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
>>>>> flexibility to GCC.
>>> ...
>>>>
>>>> This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
>>>> flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
>>>> klist_add_tail to trigger over and over on boot when compiling with
>>>> clang:
>>
>> ...
>>
>>>
>>> This patch seems to fix it. Not sure if that's just papering over it though.
>>>
>>> diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
>>> index 1ee0f22313ee..75fcb4370d96 100644
>>> --- a/arch/powerpc/include/asm/bug.h
>>> +++ b/arch/powerpc/include/asm/bug.h
>>> @@ -119,7 +119,7 @@ __label_warn_on:						\
>>>    								\
>>>    			WARN_ENTRY(PPC_TLNEI " %4, 0",		\
>>>    				   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),	\
>>> -				   __label_warn_on, "r" (x));	\
>>> +				   __label_warn_on, "r" (!!(x))); \
>>>    			break;					\
>>>    __label_warn_on:						\
>>>    			__ret_warn_on = true;			\
>>
>> But for a simple WARN_ON() call:
>>
>> void test(unsigned long b)
>> {
>> 	WARN_ON(b);
>> }
>>
>> Without your change with GCC you get:
>>
>> 00000000000012d0 <.test>:
>>       12d0:	0b 03 00 00 	tdnei   r3,0
>>       12d4:	4e 80 00 20 	blr
>>
>>
>> With the !! change you get:
>>
>> 00000000000012d0 <.test>:
>>       12d0:	31 23 ff ff 	addic   r9,r3,-1
>>       12d4:	7d 29 19 10 	subfe   r9,r9,r3
>>       12d8:	0b 09 00 00 	tdnei   r9,0
>>       12dc:	4e 80 00 20 	blr
> 
> Yeah that's a pity.
> 
> We could do something like below, which is ugly, but would be better
> than having to revert the whole thing.

Yes looks nice, we already had that kind of stuff in the past, even more ugly.

> 
> Although this doesn't fix the strange warning in drivers/net/ethernet/sfc.

What's the warning ?


> 
> So possibly we need a CLANG ifdef around the whole thing, and use the
> old style warn for clang.
> 

Christophe

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
From: Nicholas Piggin @ 2021-08-26 15:04 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <20210826143708.GC1583@gate.crashing.org>

Excerpts from Segher Boessenkool's message of August 27, 2021 12:37 am:
> On Thu, Aug 26, 2021 at 11:57:52PM +1000, Nicholas Piggin wrote:
>> Excerpts from Segher Boessenkool's message of August 26, 2021 10:49 pm:
>> > On Thu, Aug 26, 2021 at 01:26:14PM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Segher Boessenkool's message of August 19, 2021 1:06 am:
>> >> > On Fri, Aug 13, 2021 at 04:08:13PM +1000, Nicholas Piggin wrote:
>> >> >> This one possibly the branches end up in predictors, whereas conditional 
>> >> >> trap is always just speculated not to hit. Branches may also have a
>> >> >> throughput limit on execution whereas trap could be more (1 per cycle
>> >> >> vs 4 per cycle on POWER9).
>> >> > 
>> >> > I thought only *taken* branches are just one per cycle?
>> >> 
>> >> Taken branches are fetched by the front end at one per cycle (assuming 
>> >> they hit the BTAC), but all branches have to be executed by BR at one 
>> >> per cycle
>> > 
>> > This is not true.  (Simple) predicted not-taken conditional branches are
>> > just folded out, never hit the issue queues.  And they are fetched as
>> > many together as fit in a fetch group, can complete without limits as
>> > well.
>> 
>> No, they are all dispatched and issue to the BRU for execution. It's 
>> trivial to construct a test of a lot of not taken branches in a row
>> and time a loop of it to see it executes at 1 cycle per branch.
> 
> (s/dispatched/issued/)

?

> 
> Huh, this was true on p8 already.  Sorry for my confusion.  In my
> defence, this doesn't matter for performance on "real code".
> 
>> > Correctly predicted simple conditional branches just get their prediction
>> > validated (and that is not done in the execution units).  Incorrectly
>> > predicted branches the same, but those cause a redirect and refetch.
>> 
>> How could it validate prediction without issuing? It wouldn't know when
>> sources are ready.
> 
> In the backend.  But that is just how it worked on older cores :-/

Okay. I don't know about older cores than POWER9. Backend would normally 
include execution though. Only other place you could do it if you don't
issue/exec would be after it goes back in order, like completion. But
that would be horrible for mispredict penalty.

>> >> The first problem seems like the show stopper though. AFAIKS it would 
>> >> need a special builtin support that does something to create the table
>> >> entry, or a guarantee that we could put an inline asm right after the
>> >> builtin as a recognized pattern and that would give us the instruction
>> >> following the trap.
>> > 
>> > I'm not quite sure what this means.  Can't you always just put a
>> > 
>> > bla:	asm("");
>> > 
>> > in there, and use the address of "bla"?
>> 
>> Not AFAIKS. Put it where?
> 
> After wherever you want to know the address after.  You will have to
> make sure they stay together somehow.

I still don't follow.

> It is much easier to get the address of something, not the address after
> it.  If you know it is just one insn anyway, that isn't hard to handle
> either (even if prefixed insns exist).
> 
>> > If not, you need to say a lot
>> > more about what you actually want to do :-/
>> 
>> We need to put the address (or relative offset) of the trap instruction
>> into an entry in a different section. Basically this:
>> 
>>    __builtin_trap();
>>    asm ("1:                               \n\t"
>>         "    .section __bug_table,\"aw\"  \n\t"
>>         "2:  .4byte 1b - 2b - 4           \n\t"
>>         "    .previous");
>> 
>> Where the 1: label must follow immediately after the trap instruction, 
>> and where the asm must be emitted even for the case of no-return traps 
>> (but anything following the asm could be omitted).
> 
> The address *after* the trap insn?  That is much much harder than the
> address *of* the trap insn.

No the address of the trap instruction, hence the -4. I have the label
after because that is the semantics I have from inline asm.

If you could give a built in that put a label at the address of the trap 
instruction that could be used later by inline asm then that could work
too:

    __builtin_labeled_trap("1:");
    asm ("    .section __bug_table,\"aw\"  \n\t"
         "2:  .4byte 1b - 2b               \n\t"
         "    .previous");

^ permalink raw reply

* Re: [GIT PULL] retire legacy WR sbc8548 and sbc8641 platforms
From: Michael Ellerman @ 2021-08-26 15:05 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: Scott Wood, Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <20210824174209.GB160508@windriver.com>

Paul Gortmaker <paul.gortmaker@windriver.com> writes:
> This is unchanged from the original wr_sbc-delete branch sent in January,
> other than to add the Acks from Scott in July, and update the baseline.
>
> Built with ppc64 defconfig and mpc85xx_cds_defconfig and mpc86xx_defconfig
> just to make sure I didn't fat finger anything in the baseline update.

Thanks for following up on this.

I ended up cherry-picking the patches into my branch. I like to keep my
next based on rc2, and merging this would have pulled in everything up
to rc7 into my branch.

I don't think you were planning to merge this branch anywhere else, so
it shouldn't make any difference, but let me know if it's a problem.

It should appear here soon:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/log/?h=next


cheers
 

> Original v1 text follows below, from:
>
> https://lore.kernel.org/lkml/20210111082823.99562-1-paul.gortmaker@windriver.com
>
> It would be nice to get this in and off our collective to-do list.
>
> Thanks,
> Paul.
>
>   ---
>
> In v2.6.27 (2008, 917f0af9e5a9) the sbc8260 support was implicitly
> retired by not being carried forward through the ppc --> powerpc
> device tree transition.
>
> Then, in v3.6 (2012, b048b4e17cbb) we retired the support for the
> sbc8560 boards.
>
> Next, in v4.18 (2017, 3bc6cf5a86e5) we retired the support for the
> 2006 vintage sbc834x boards.
>
> The sbc8548 and sbc8641d boards were maybe 1-2 years newer than the
> sbc834x boards, but it is also 3+ years later, so it makes sense to
> now retire them as well - which is what is done here.
>
> These two remaining WR boards were based on the Freescale MPC8548-CDS
> and the MPC8641D-HPCN reference board implementations.  Having had the
> chance to use these and many other Fsl ref boards, I know this:  The
> Freescale reference boards were typically produced in limited quantity
> and primarily available to BSP developers and hardware designers, and
> not likely to have found a 2nd life with hobbyists and/or collectors.
>
> It was good to have that BSP code subjected to mainline review and
> hence also widely available back in the day. But given the above, we
> should probably also be giving serious consideration to retiring
> additional similar age/type reference board platforms as well.
>
> I've always felt it is important for us to be proactive in retiring
> old code, since it has a genuine non-zero carrying cost, as described
> in the 930d52c012b8 merge log.  But for the here and now, we just
> clean up the remaining BSP code that I had added for SBC platforms.
>
> --- 
>
> The following changes since commit e22ce8eb631bdc47a4a4ea7ecf4e4ba499db4f93:
>
>   Linux 5.14-rc7 (2021-08-22 14:24:56 -0700)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git wr_sbc-delete-v2
>
> for you to fetch changes up to d44e2dc12ea2112e74cdd25090eeda2727ed09cc:
>
>   MAINTAINERS: update for Paul Gortmaker (2021-08-24 08:19:01 -0400)
>
> ----------------------------------------------------------------
> Paul Gortmaker (3):
>       powerpc: retire sbc8548 board support
>       powerpc: retire sbc8641d board support
>       MAINTAINERS: update for Paul Gortmaker
>
>  MAINTAINERS                                 |   1 -
>  arch/powerpc/boot/Makefile                  |   1 -
>  arch/powerpc/boot/dts/fsl/sbc8641d.dts      | 176 -----------------
>  arch/powerpc/boot/dts/sbc8548-altflash.dts  | 111 -----------
>  arch/powerpc/boot/dts/sbc8548-post.dtsi     | 289 ----------------------------
>  arch/powerpc/boot/dts/sbc8548-pre.dtsi      |  48 -----
>  arch/powerpc/boot/dts/sbc8548.dts           | 106 ----------
>  arch/powerpc/boot/wrapper                   |   2 +-
>  arch/powerpc/configs/85xx/sbc8548_defconfig |  50 -----
>  arch/powerpc/configs/mpc85xx_base.config    |   1 -
>  arch/powerpc/configs/mpc86xx_base.config    |   1 -
>  arch/powerpc/configs/ppc6xx_defconfig       |   1 -
>  arch/powerpc/platforms/85xx/Kconfig         |   6 -
>  arch/powerpc/platforms/85xx/Makefile        |   1 -
>  arch/powerpc/platforms/85xx/sbc8548.c       | 134 -------------
>  arch/powerpc/platforms/86xx/Kconfig         |   8 +-
>  arch/powerpc/platforms/86xx/Makefile        |   1 -
>  arch/powerpc/platforms/86xx/sbc8641d.c      |  87 ---------
>  18 files changed, 2 insertions(+), 1022 deletions(-)
>  delete mode 100644 arch/powerpc/boot/dts/fsl/sbc8641d.dts
>  delete mode 100644 arch/powerpc/boot/dts/sbc8548-altflash.dts
>  delete mode 100644 arch/powerpc/boot/dts/sbc8548-post.dtsi
>  delete mode 100644 arch/powerpc/boot/dts/sbc8548-pre.dtsi
>  delete mode 100644 arch/powerpc/boot/dts/sbc8548.dts
>  delete mode 100644 arch/powerpc/configs/85xx/sbc8548_defconfig
>  delete mode 100644 arch/powerpc/platforms/85xx/sbc8548.c
>  delete mode 100644 arch/powerpc/platforms/86xx/sbc8641d.c

^ permalink raw reply

* Re: [PATCH v2 3/4] powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use
From: kernel test robot @ 2021-08-26 15:04 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: kbuild-all, Nicholas Piggin
In-Reply-To: <20210825123714.706201-4-npiggin@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6119 bytes --]

Hi Nicholas,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v5.14-rc7 next-20210826]
[cannot apply to scottwood/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64s-interrupt-speedups/20210825-204209
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allnoconfig (attached as .config)
compiler: powerpc-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/a0eb195f49a01ed45b3f97815470f9c8acaa4991
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Nicholas-Piggin/powerpc-64s-interrupt-speedups/20210825-204209
        git checkout a0eb195f49a01ed45b3f97815470f9c8acaa4991
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/powerpc/kernel/irq.c: In function '__do_irq':
>> arch/powerpc/kernel/irq.c:742:13: error: implicit declaration of function 'should_hard_irq_enable'; did you mean 'do_hard_irq_enable'? [-Werror=implicit-function-declaration]
     742 |         if (should_hard_irq_enable())
         |             ^~~~~~~~~~~~~~~~~~~~~~
         |             do_hard_irq_enable
   In file included from <command-line>:
   In function 'do_hard_irq_enable',
       inlined from '__do_irq' at arch/powerpc/kernel/irq.c:743:3:
>> include/linux/compiler_types.h:328:45: error: call to '__compiletime_assert_34' declared with attribute error: BUILD_BUG failed
     328 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:309:25: note: in definition of macro '__compiletime_assert'
     309 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:328:9: note: in expansion of macro '_compiletime_assert'
     328 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
         |                     ^~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/hw_irq.h:447:9: note: in expansion of macro 'BUILD_BUG'
     447 |         BUILD_BUG();
         |         ^~~~~~~~~
   cc1: all warnings being treated as errors
--
   arch/powerpc/kernel/time.c: In function '____timer_interrupt':
>> arch/powerpc/kernel/time.c:570:13: error: implicit declaration of function 'should_hard_irq_enable'; did you mean 'do_hard_irq_enable'? [-Werror=implicit-function-declaration]
     570 |         if (should_hard_irq_enable()) {
         |             ^~~~~~~~~~~~~~~~~~~~~~
         |             do_hard_irq_enable
   In file included from <command-line>:
   In function 'do_hard_irq_enable',
       inlined from '____timer_interrupt' at arch/powerpc/kernel/time.c:584:3,
       inlined from 'timer_interrupt' at arch/powerpc/kernel/time.c:553:1:
>> include/linux/compiler_types.h:328:45: error: call to '__compiletime_assert_34' declared with attribute error: BUILD_BUG failed
     328 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:309:25: note: in definition of macro '__compiletime_assert'
     309 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:328:9: note: in expansion of macro '_compiletime_assert'
     328 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
         |                     ^~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/hw_irq.h:447:9: note: in expansion of macro 'BUILD_BUG'
     447 |         BUILD_BUG();
         |         ^~~~~~~~~
   cc1: all warnings being treated as errors


vim +742 arch/powerpc/kernel/irq.c

   727	
   728	void __do_irq(struct pt_regs *regs)
   729	{
   730		unsigned int irq;
   731	
   732		trace_irq_entry(regs);
   733	
   734		/*
   735		 * Query the platform PIC for the interrupt & ack it.
   736		 *
   737		 * This will typically lower the interrupt line to the CPU
   738		 */
   739		irq = ppc_md.get_irq();
   740	
   741		/* We can hard enable interrupts now to allow perf interrupts */
 > 742		if (should_hard_irq_enable())
   743			do_hard_irq_enable();
   744	
   745		/* And finally process it */
   746		if (unlikely(!irq))
   747			__this_cpu_inc(irq_stat.spurious_irqs);
   748		else
   749			generic_handle_irq(irq);
   750	
   751		trace_irq_exit(regs);
   752	}
   753	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 7152 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
From: Segher Boessenkool @ 2021-08-26 15:30 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <1629989540.drlhb24t2w.astroid@bobo.none>

On Fri, Aug 27, 2021 at 01:04:36AM +1000, Nicholas Piggin wrote:
> Excerpts from Segher Boessenkool's message of August 27, 2021 12:37 am:
> >> No, they are all dispatched and issue to the BRU for execution. It's 
> >> trivial to construct a test of a lot of not taken branches in a row
> >> and time a loop of it to see it executes at 1 cycle per branch.
> > 
> > (s/dispatched/issued/)
> 
> ?

Dispatch is from decode to the issue queues.  Issue is from there to
execution units.  Dispatch is in-order, issue is not.

> >> How could it validate prediction without issuing? It wouldn't know when
> >> sources are ready.
> > 
> > In the backend.  But that is just how it worked on older cores :-/
> 
> Okay. I don't know about older cores than POWER9. Backend would normally 
> include execution though.
> Only other place you could do it if you don't
> issue/exec would be after it goes back in order, like completion.

You do not have to do the verification in-order: the insn cannot finish
until it is no longer speculative, that takes care of all ordering
needed.

> But that would be horrible for mispredict penalty.

See the previous point.  Also, any insn known to be mispredicted can be
flushed immediately anyway.

> >> >> The first problem seems like the show stopper though. AFAIKS it would 
> >> >> need a special builtin support that does something to create the table
> >> >> entry, or a guarantee that we could put an inline asm right after the
> >> >> builtin as a recognized pattern and that would give us the instruction
> >> >> following the trap.
> >> > 
> >> > I'm not quite sure what this means.  Can't you always just put a
> >> > 
> >> > bla:	asm("");
> >> > 
> >> > in there, and use the address of "bla"?
> >> 
> >> Not AFAIKS. Put it where?
> > 
> > After wherever you want to know the address after.  You will have to
> > make sure they stay together somehow.
> 
> I still don't follow.

some_thing_you_want_to_know_the_address_after_let_us_call_it_A;
empty_asm_that_we_can_take_the_address_of_known_as_B;

You have to make sure the compiler keeps A and B together, does not
insert anything between them, does put them in the assembler output in
the same fragment, etc.

> If you could give a built in that put a label at the address of the trap 
> instruction that could be used later by inline asm then that could work
> too:
> 
>     __builtin_labeled_trap("1:");
>     asm ("    .section __bug_table,\"aw\"  \n\t"
>          "2:  .4byte 1b - 2b               \n\t"
>          "    .previous");

How could a compiler do anything like that?!


Segher

^ permalink raw reply

* Re: [next-20210820][ppc][nvme/raid] pci unbind WARNS at fs/kernfs/dir.c:1524 kernfs_remove_by_name_ns+
From: Christoph Hellwig @ 2021-08-26 16:30 UTC (permalink / raw)
  To: Abdul Haleem
  Cc: Brian King, linux-next, linuxppc-dev, linux-kernel, linux-scsi
In-Reply-To: <063e6cf0-94ab-26f2-4fed-aebf1499127c@linux.vnet.ibm.com>

Are you sure this is the very latest linux-next?  This should hav been
fixed by:

    "block: add back the bd_holder_dir reference in bd_link_disk_holder"

^ permalink raw reply

* Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
From: Mathieu Desnoyers @ 2021-08-26 18:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: KVM list, Peter Zijlstra, Catalin Marinas, linux-kernel,
	Will Deacon, Guo Ren, linux-kselftest, Ben Gardon, shuah,
	Paul Mackerras, linux-s390, gor, Russell King, ARM Linux,
	linux-csky, Christian Borntraeger, Ingo Molnar, dvhart,
	linux-mips, Boqun Feng, paulmck, Heiko Carstens, rostedt,
	Shakeel Butt, Andy Lutomirski, Thomas Gleixner, Peter Foley,
	linux-arm-kernel, Thomas Bogendoerfer, Oleg Nesterov,
	Paolo Bonzini, linuxppc-dev
In-Reply-To: <YSblqrrpKcORzilX@google.com>

----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>> 
>> ----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>> 
>> > ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
>> > 
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >> 
>> > 
>> > [...]
>> > 
>> > +#define RSEQ_SIG 0xdeadbeef
>> > 
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> > 
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
> 
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

> 
>> > [...]
>> > 
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> +	cpu_set_t allowed_mask;
>> >> +	int r, i, nr_cpus, cpu;
>> >> +
>> >> +	CPU_ZERO(&allowed_mask);
>> >> +
>> >> +	nr_cpus = CPU_COUNT(&possible_mask);
>> >> +
>> >> +	for (i = 0; i < 20000; i++) {
>> >> +		cpu = i % nr_cpus;
>> >> +		if (!CPU_ISSET(cpu, &possible_mask))
>> >> +			continue;
>> >> +
>> >> +		CPU_SET(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Bump the sequence count twice to allow the reader to detect
>> >> +		 * that a migration may have occurred in between rseq and sched
>> >> +		 * CPU ID reads.  An odd sequence count indicates a migration
>> >> +		 * is in-progress, while a completely different count indicates
>> >> +		 * a migration occurred since the count was last read.
>> >> +		 */
>> >> +		atomic_inc(&seq_cnt);
>> > 
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
> 
> Yeah, I got quite lost trying to figure out what atomics the test would actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I would
recommend adding a comment stating which memory barriers are required alongside each
use of atomic_inc here. I would even prefer that we add extra (currently unneeded)
write barriers to make extra sure that this stays documented. Performance of the write-side
does not matter much here.

> 
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> > 
>> > 
>> >> +		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> >> +		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> +			    errno, strerror(errno));
>> >> +		atomic_inc(&seq_cnt);
>> >> +
>> >> +		CPU_CLR(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Let the read-side get back into KVM_RUN to improve the odds
>> >> +		 * of task migration coinciding with KVM's run loop.
>> > 
>> > This comment should be about increasing the odds of letting the seqlock
>> > read-side complete. Otherwise, the delay between the two back-to-back
>> > atomic_inc is so small that the seqlock read-side may never have time to
>> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
>> > retry forever.
> 
> Hmm, but that's not why there's a delay.  I'm not arguing that a livelock isn't
> possible (though that syscall would have to be screaming fast),

I don't think we have the same understanding of the livelock scenario. AFAIU the livelock
would be caused by a too-small delay between the two consecutive atomic_inc() from one
loop iteration to the next compared to the time it takes to perform a read-side critical
section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I
doubt that the sched_getcpu implementation have good odds to be fast enough to complete
in that narrow window, leading to lots of read seqlock retry.

> but the primary
> motivation is very much to allow the read-side enough time to get back into KVM
> proper.

I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we
have:

migration thread                             KVM_RUN/read-side thread
-----------------------------------------------------------------------------------
                                             - ioctl(KVM_RUN)
- atomic_inc_seq_cst(&seqcnt)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)
                                             - a = atomic_load(&seqcnt) & ~1
                                             - smp_rmb()
                                             - b = LOAD_ONCE(__rseq_abi->cpu_id);
                                             - sched_getcpu()
                                             - smp_rmb()
                                             - re-load seqcnt/compare (succeeds)
                                               - Can only succeed if entire read-side happens while the seqcnt
                                                 is in an even numbered state.
                                             - if (a != b) abort()
  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Let's suppose the lack of delay causes the
     setaffinity to complete too early compared
     with KVM_RUN ioctl */
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Then a setaffinity from a following
     migration thread loop will run
     concurrently with KVM_RUN */
                                             - ioctl(KVM_RUN)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

As pointed out here, if the first setaffinity runs too early compared with KVM_RUN,
a following setaffinity will run concurrently with it. However, the fact that 
the even counter state is very short may very well hurt progress of the read seqlock.

> 
> To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run
> loop, i.e. sched_setaffinity() must induce task migration after the read-side
> has
> invoked ioctl(KVM_RUN).

No argument here.

> 
>> > I'm wondering if 1 microsecond is sufficient on other architectures as
>> > well.
> 
> I'm definitely wondering that as well :-)
> 
>> > One alternative way to make this depend less on the architecture's
>> > implementation of sched_getcpu (whether it's a vDSO, or goes through a
>> > syscall) would be to read the rseq cpu id and call sched_getcpu a few times
>> > (e.g. 3 times) in the migration thread rather than use usleep, and throw
>> > away the value read. This would ensure the delay is appropriate on all
>> > architectures.
> 
> As above, I think an arbitrary delay is required regardless of how fast
> sched_getcpu() can execute.  One thought would be to do sched_getcpu() _and_
> usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part,
> but I don't know that that adds meaningful value.
> 
> The real test is if someone could see if the bug repros on non-x86 hardware...

As I explain in the scenario above, I don't think we agree on the reason why the
delay is required. One way to confirm this would be to run the code without delay,
and count how many seqcnt read-side succeed vs fail. We can then compare that with
a run with a 1us delay between the migration thread loops. This data should help us
come to a common understanding of the delay's role.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox