Status of FLR in Xen 4.4

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* Status of FLR in Xen 4.4
@ 2013-09-26 16:05 Matthias
  2013-09-26 16:16 ` Ian Campbell
  2013-09-26 16:20 ` David Vrabel
  0 siblings, 2 replies; 23+ messages in thread
From: Matthias @ 2013-09-26 16:05 UTC (permalink / raw)
  To: xen-devel@lists.xen.org

[-- Attachment #1.1: Type: text/plain, Size: 1630 bytes --]

Hi everyone,

I would like to ask what the current status of FLR, or better of FLR
emulation is in latest Xen and if we can expect better support in the
future.

I'm asking because with xl (latest build and traditional qemu, not
upstream), I always had problems with rebooting domUs which have vga cards
passed through to them, because appearently they don't get reinitialized
and then cause either bluescreens (windows), blackscreens (linux) or the
complete freeze of the dom0. As far as I understood this is caused by the
vga card do not have FLR capability (lspci -vvv shows FLReset-). So while
lately rebooting sometimes works on windows, it never works on linux domUs
and it appears that xl is simply not really capable of dealing with reboots
with non-FLR'ed vga cards passed through the domUs and I have to reboot the
dom0 to get the vga cards running again.

Is this the current status or is this supposed to work and I only have a
problem on my setup?

Also, I'm specifically referring to xl because back in the day when I used
xm with xen 4.0 and 4.1, this never was an issue and i could reboot both
linux and windows domUs without issues as often as I wanted (with the same
hardware setup I now use with xl). So to me it seems that there is a
possibility to handle non-FLR'ed vga cards gracefully, but xl simply isn't
capable of that / does not do that.

It would be great to have a quick roundup of the current situation and
future plans, because I'm planing a project to use xen's vga passthrough in
a cloud / big data setup and the unreliable reboot behaviour is currently a
deal breaker for me.

Thanks in advance!

[-- Attachment #1.2: Type: text/html, Size: 1795 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 16:05 Status of FLR in Xen 4.4 Matthias
@ 2013-09-26 16:16 ` Ian Campbell
  2013-09-26 17:59   ` Matthias
  2013-09-26 16:20 ` David Vrabel
  1 sibling, 1 reply; 23+ messages in thread
From: Ian Campbell @ 2013-09-26 16:16 UTC (permalink / raw)
  To: Matthias; +Cc: xen-devel@lists.xen.org

On Thu, 2013-09-26 at 18:05 +0200, Matthias wrote:

> I would like to ask what the current status of FLR, or better of FLR
> emulation is in latest Xen and if we can expect better support in the
> future.
> 
> Is this the current status or is this supposed to work and I only have
> a problem on my setup?

xl simply asks the dom0 kernel to reset the card, so this is entirely
dependent on the functionality of your dom0 kernel and/or the features
of the particular hardware WRT allowing things to be reset.

> Also, I'm specifically referring to xl because back in the day when I
> used xm with xen 4.0 and 4.1, this never was an issue and i could
> reboot both linux and windows domUs without issues as often as I
> wanted (with the same hardware setup I now use with xl). So to me it
> seems that there is a possibility to handle non-FLR'ed vga cards
> gracefully, but xl simply isn't capable of that / does not do that.

This I'm afraid I don't know enough about to comment much.

tools/python/xen/util/pci.py appears to implement various FLR quirks for
bits of hardware, including some GFX from the looks of things.

These all belong in the upstream Linux kernel these days. You don't say
which kernel you are using but you could try updating it.

You could also check the kernel source for a quirk for your particular
hardware.

Ian.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 16:05 Status of FLR in Xen 4.4 Matthias
  2013-09-26 16:16 ` Ian Campbell
@ 2013-09-26 16:20 ` David Vrabel
  2013-09-26 17:48   ` Ross Philipson
  1 sibling, 1 reply; 23+ messages in thread
From: David Vrabel @ 2013-09-26 16:20 UTC (permalink / raw)
  To: Matthias; +Cc: xen-devel@lists.xen.org

On 26/09/13 17:05, Matthias wrote:
> Hi everyone,
> 
> I would like to ask what the current status of FLR, or better of FLR
> emulation is in latest Xen and if we can expect better support in the
> future.

What are these cards, are they multi-function and do they actually
support FLR?  Many graphics cards do not.

I have the following hack to pciback to fallback to a bus reset for
multi-function devices without FLR.  Does it help for your use case?
You will need to ensure that all functions are co-assigned to the same
domain.

David

8<---------------------------------------
diff --git a/drivers/xen/xen-pciback/pci_stub.c
b/drivers/xen/xen-pciback/pci_stub.c
index 4e8ba38..5a03e63 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -14,6 +14,7 @@
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/atomic.h>
+#include <linux/delay.h>
 #include <xen/events.h>
 #include <asm/xen/pci.h>
 #include <asm/xen/hypervisor.h>
@@ -43,6 +44,7 @@ struct pcistub_device {
 	struct kref kref;
 	struct list_head dev_list;
 	spinlock_t lock;
+	bool created_reset_file;

 	struct pci_dev *dev;
 	struct xen_pcibk_device *pdev;/* non-NULL if struct pci_dev is in use */
@@ -60,6 +62,114 @@ static LIST_HEAD(pcistub_devices);
 static int initialize_devices;
 static LIST_HEAD(seized_devices);

+/*
+ * pci_reset_function() will only work if there is a mechanism to
+ * reset that single function (e.g., FLR or a D-state transition).
+ * For PCI hardware that has two or more functions but no per-function
+ * reset, we can do a bus reset iff all the functions are co-assigned
+ * to the same domain.
+ *
+ * If a function has no per-function reset mechanism the 'reset' sysfs
+ * file that the toolstack uses to reset a function prior to assigning
+ * the device will be missing.  In this case, pciback adds its own
+ * which will try a bus reset.
+ *
+ * Note: pciback does not check for co-assigment before doing a bus
+ * reset, only that the devices are bound to pciback.  The toolstack
+ * is assumed to have done the right thing.
+ */
+static int __pcistub_reset_function(struct pci_dev *dev)
+{
+	struct pci_dev *pdev;
+	u16 ctrl;
+	int ret;
+
+	ret = __pci_reset_function_locked(dev);
+	if (ret == 0)
+		return 0;
+
+	if (pci_is_root_bus(dev->bus) || dev->subordinate || !dev->bus->self)
+		return -ENOTTY;
+
+	list_for_each_entry(pdev, &dev->bus->devices, bus_list) {
+		if (pdev != dev && (!pdev->driver
+				    || strcmp(pdev->driver->name, "pciback")))
+			return -ENOTTY;
+		pci_save_state(pdev);
+	}
+
+	pci_read_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, &ctrl);
+	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl);
+	msleep(200);
+
+	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl);
+	msleep(200);
+
+	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
+		pci_restore_state(pdev);
+
+	return 0;
+}
+
+static int pcistub_reset_function(struct pci_dev *dev)
+{
+	int ret;
+
+	device_lock(&dev->dev);
+	ret = __pcistub_reset_function(dev);
+	device_unlock(&dev->dev);
+
+	return ret;
+}
+
+static ssize_t pcistub_reset_store(struct device *dev,
+				   struct device_attribute *attr,
+				   const char *buf, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	unsigned long val;
+	ssize_t result = strict_strtoul(buf, 0, &val);
+
+	if (result < 0)
+		return result;
+
+	if (val != 1)
+		return -EINVAL;
+
+	result = pcistub_reset_function(pdev);
+	if (result < 0)
+		return result;
+	return count;
+}
+static DEVICE_ATTR(reset, 0200, NULL, pcistub_reset_store);
+
+static int pcistub_try_create_reset_file(struct pcistub_device *psdev)
+{
+	struct device *dev = &psdev->dev->dev;
+	struct sysfs_dirent *reset_dirent;
+	int ret;
+
+	reset_dirent = sysfs_get_dirent(dev->kobj.sd, NULL, "reset");
+	if (reset_dirent) {
+		sysfs_put(reset_dirent);
+		return 0;
+	}
+
+	ret = device_create_file(dev, &dev_attr_reset);
+	if (ret < 0)
+		return ret;
+	psdev->created_reset_file = true;
+	return 0;
+}
+
+static void pcistub_remove_reset_file(struct pcistub_device *psdev)
+{
+	if (psdev && psdev->created_reset_file)
+		device_remove_file(&psdev->dev->dev, &dev_attr_reset);
+}
+
 static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev)
 {
 	struct pcistub_device *psdev;
@@ -95,12 +205,15 @@ static void pcistub_device_release(struct kref *kref)

 	dev_dbg(&dev->dev, "pcistub_device_release\n");

+	pcistub_remove_reset_file(psdev);
+
 	xen_unregister_device_domain_owner(dev);

 	/* Call the reset function which does not take lock as this
 	 * is called from "unbind" which takes a device_lock mutex.
 	 */
-	__pci_reset_function_locked(dev);
+	__pcistub_reset_function(psdev->dev);
+
 	if (pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state))
 		dev_dbg(&dev->dev, "Could not reload PCI state\n");
 	else
@@ -268,7 +381,7 @@ void pcistub_put_pci_dev(struct pci_dev *dev)
 	/* This is OK - we are running from workqueue context
 	 * and want to inhibit the user from fiddling with 'reset'
 	 */
-	pci_reset_function(dev);
+	pcistub_reset_function(psdev->dev);
 	pci_restore_state(psdev->dev);

 	/* This disables the device. */
@@ -392,7 +505,7 @@ static int pcistub_init_device(struct pci_dev *dev)
 		dev_err(&dev->dev, "Could not store PCI conf saved state!\n");
 	else {
 		dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
-		__pci_reset_function_locked(dev);
+		__pcistub_reset_function(dev);
 		pci_restore_state(dev);
 	}
 	/* Now disable the device (this also ensures some private device
@@ -467,6 +580,10 @@ static int pcistub_seize(struct pci_dev *dev)
 	if (!psdev)
 		return -ENOMEM;

+	err = pcistub_try_create_reset_file(psdev);
+	if (err < 0)
+		goto out;
+
 	spin_lock_irqsave(&pcistub_devices_lock, flags);

 	if (initialize_devices) {
@@ -485,10 +602,9 @@ static int pcistub_seize(struct pci_dev *dev)
 	}

 	spin_unlock_irqrestore(&pcistub_devices_lock, flags);
-
+out:
 	if (err)
 		pcistub_device_put(psdev);
-
 	return err;
 }

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 16:20 ` David Vrabel
@ 2013-09-26 17:48   ` Ross Philipson
  2013-09-26 18:01     ` David Vrabel
  0 siblings, 1 reply; 23+ messages in thread
From: Ross Philipson @ 2013-09-26 17:48 UTC (permalink / raw)
  To: xen-devel@lists.xen.org; +Cc: Matthias, David Vrabel

On 09/26/2013 12:20 PM, David Vrabel wrote:
> On 26/09/13 17:05, Matthias wrote:
>> Hi everyone,
>>
>> I would like to ask what the current status of FLR, or better of FLR
>> emulation is in latest Xen and if we can expect better support in the
>> future.
>
> What are these cards, are they multi-function and do they actually
> support FLR?  Many graphics cards do not.
>
> I have the following hack to pciback to fallback to a bus reset for
> multi-function devices without FLR.  Does it help for your use case?
> You will need to ensure that all functions are co-assigned to the same
> domain.

New kernels (e.g. 3.8) have full support for PCI-e and PCI AF FLRs as 
well as fallback support for D0-D3 and secondary bus resets. This 
functionality is also in the some of the last 2.6 kernels like 2.6.39. 
If you are using an older kernel I guess you might need to patch it.

Also depending on your hw there might be a specific quirk you need (e.g. 
the 82599 quirk in pci/quirks.c).

Ross

>
> David
>
> 8<---------------------------------------
> diff --git a/drivers/xen/xen-pciback/pci_stub.c
> b/drivers/xen/xen-pciback/pci_stub.c
> index 4e8ba38..5a03e63 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -14,6 +14,7 @@
>   #include <linux/wait.h>
>   #include <linux/sched.h>
>   #include <linux/atomic.h>
> +#include <linux/delay.h>
>   #include <xen/events.h>
>   #include <asm/xen/pci.h>
>   #include <asm/xen/hypervisor.h>
> @@ -43,6 +44,7 @@ struct pcistub_device {
>   	struct kref kref;
>   	struct list_head dev_list;
>   	spinlock_t lock;
> +	bool created_reset_file;
>
>   	struct pci_dev *dev;
>   	struct xen_pcibk_device *pdev;/* non-NULL if struct pci_dev is in use */
> @@ -60,6 +62,114 @@ static LIST_HEAD(pcistub_devices);
>   static int initialize_devices;
>   static LIST_HEAD(seized_devices);
>
> +/*
> + * pci_reset_function() will only work if there is a mechanism to
> + * reset that single function (e.g., FLR or a D-state transition).
> + * For PCI hardware that has two or more functions but no per-function
> + * reset, we can do a bus reset iff all the functions are co-assigned
> + * to the same domain.
> + *
> + * If a function has no per-function reset mechanism the 'reset' sysfs
> + * file that the toolstack uses to reset a function prior to assigning
> + * the device will be missing.  In this case, pciback adds its own
> + * which will try a bus reset.
> + *
> + * Note: pciback does not check for co-assigment before doing a bus
> + * reset, only that the devices are bound to pciback.  The toolstack
> + * is assumed to have done the right thing.
> + */
> +static int __pcistub_reset_function(struct pci_dev *dev)
> +{
> +	struct pci_dev *pdev;
> +	u16 ctrl;
> +	int ret;
> +
> +	ret = __pci_reset_function_locked(dev);
> +	if (ret == 0)
> +		return 0;
> +
> +	if (pci_is_root_bus(dev->bus) || dev->subordinate || !dev->bus->self)
> +		return -ENOTTY;
> +
> +	list_for_each_entry(pdev, &dev->bus->devices, bus_list) {
> +		if (pdev != dev && (!pdev->driver
> +				    || strcmp(pdev->driver->name, "pciback")))
> +			return -ENOTTY;
> +		pci_save_state(pdev);
> +	}
> +
> +	pci_read_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, &ctrl);
> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl);
> +	msleep(200);
> +
> +	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(dev->bus->self, PCI_BRIDGE_CONTROL, ctrl);
> +	msleep(200);
> +
> +	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> +		pci_restore_state(pdev);
> +
> +	return 0;
> +}
> +
> +static int pcistub_reset_function(struct pci_dev *dev)
> +{
> +	int ret;
> +
> +	device_lock(&dev->dev);
> +	ret = __pcistub_reset_function(dev);
> +	device_unlock(&dev->dev);
> +
> +	return ret;
> +}
> +
> +static ssize_t pcistub_reset_store(struct device *dev,
> +				   struct device_attribute *attr,
> +				   const char *buf, size_t count)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	unsigned long val;
> +	ssize_t result = strict_strtoul(buf, 0, &val);
> +
> +	if (result < 0)
> +		return result;
> +
> +	if (val != 1)
> +		return -EINVAL;
> +
> +	result = pcistub_reset_function(pdev);
> +	if (result < 0)
> +		return result;
> +	return count;
> +}
> +static DEVICE_ATTR(reset, 0200, NULL, pcistub_reset_store);
> +
> +static int pcistub_try_create_reset_file(struct pcistub_device *psdev)
> +{
> +	struct device *dev = &psdev->dev->dev;
> +	struct sysfs_dirent *reset_dirent;
> +	int ret;
> +
> +	reset_dirent = sysfs_get_dirent(dev->kobj.sd, NULL, "reset");
> +	if (reset_dirent) {
> +		sysfs_put(reset_dirent);
> +		return 0;
> +	}
> +
> +	ret = device_create_file(dev, &dev_attr_reset);
> +	if (ret < 0)
> +		return ret;
> +	psdev->created_reset_file = true;
> +	return 0;
> +}
> +
> +static void pcistub_remove_reset_file(struct pcistub_device *psdev)
> +{
> +	if (psdev && psdev->created_reset_file)
> +		device_remove_file(&psdev->dev->dev, &dev_attr_reset);
> +}
> +
>   static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev)
>   {
>   	struct pcistub_device *psdev;
> @@ -95,12 +205,15 @@ static void pcistub_device_release(struct kref *kref)
>
>   	dev_dbg(&dev->dev, "pcistub_device_release\n");
>
> +	pcistub_remove_reset_file(psdev);
> +
>   	xen_unregister_device_domain_owner(dev);
>
>   	/* Call the reset function which does not take lock as this
>   	 * is called from "unbind" which takes a device_lock mutex.
>   	 */
> -	__pci_reset_function_locked(dev);
> +	__pcistub_reset_function(psdev->dev);
> +
>   	if (pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state))
>   		dev_dbg(&dev->dev, "Could not reload PCI state\n");
>   	else
> @@ -268,7 +381,7 @@ void pcistub_put_pci_dev(struct pci_dev *dev)
>   	/* This is OK - we are running from workqueue context
>   	 * and want to inhibit the user from fiddling with 'reset'
>   	 */
> -	pci_reset_function(dev);
> +	pcistub_reset_function(psdev->dev);
>   	pci_restore_state(psdev->dev);
>
>   	/* This disables the device. */
> @@ -392,7 +505,7 @@ static int pcistub_init_device(struct pci_dev *dev)
>   		dev_err(&dev->dev, "Could not store PCI conf saved state!\n");
>   	else {
>   		dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
> -		__pci_reset_function_locked(dev);
> +		__pcistub_reset_function(dev);
>   		pci_restore_state(dev);
>   	}
>   	/* Now disable the device (this also ensures some private device
> @@ -467,6 +580,10 @@ static int pcistub_seize(struct pci_dev *dev)
>   	if (!psdev)
>   		return -ENOMEM;
>
> +	err = pcistub_try_create_reset_file(psdev);
> +	if (err < 0)
> +		goto out;
> +
>   	spin_lock_irqsave(&pcistub_devices_lock, flags);
>
>   	if (initialize_devices) {
> @@ -485,10 +602,9 @@ static int pcistub_seize(struct pci_dev *dev)
>   	}
>
>   	spin_unlock_irqrestore(&pcistub_devices_lock, flags);
> -
> +out:
>   	if (err)
>   		pcistub_device_put(psdev);
> -
>   	return err;
>   }
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 16:16 ` Ian Campbell
@ 2013-09-26 17:59   ` Matthias
  2013-09-27 13:34     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Matthias @ 2013-09-26 17:59 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel@lists.xen.org

[-- Attachment #1.1: Type: text/plain, Size: 617 bytes --]

I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
kernel I found which doesn't give me this issue:
http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html

So I would assume that the kernel should be new enough to handle that. On
the other hand, as far as I understand the whole process, the kernel itself
will only deal with the vga card if it is actually bind to the dom0 / to
it's driver which it is not. Is there any way to test either if the
ask-command from xl is really executed on dom0 or to test this command
manually?

Btw: Hardware is a Radeon HD 5750 and a Radeon HD 5400..

[-- Attachment #1.2: Type: text/html, Size: 809 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 17:48   ` Ross Philipson
@ 2013-09-26 18:01     ` David Vrabel
  2013-09-26 18:41       ` Matthias
  2013-10-03 22:20       ` Matthias
  0 siblings, 2 replies; 23+ messages in thread
From: David Vrabel @ 2013-09-26 18:01 UTC (permalink / raw)
  To: Ross Philipson; +Cc: Matthias, xen-devel@lists.xen.org

On 26/09/13 18:48, Ross Philipson wrote:
> On 09/26/2013 12:20 PM, David Vrabel wrote:
>> On 26/09/13 17:05, Matthias wrote:
>>> Hi everyone,
>>>
>>> I would like to ask what the current status of FLR, or better of FLR
>>> emulation is in latest Xen and if we can expect better support in the
>>> future.
>>
>> What are these cards, are they multi-function and do they actually
>> support FLR?  Many graphics cards do not.
>>
>> I have the following hack to pciback to fallback to a bus reset for
>> multi-function devices without FLR.  Does it help for your use case?
>> You will need to ensure that all functions are co-assigned to the same
>> domain.
> 
> New kernels (e.g. 3.8) have full support for PCI-e and PCI AF FLRs as
> well as fallback support for D0-D3 and secondary bus resets. This
> functionality is also in the some of the last 2.6 kernels like 2.6.39.
> If you are using an older kernel I guess you might need to patch it.

It will only do a secondary bus reset iff the function to be reset is
the only function on that bus.  If you have a multi-function device
secondary bus reset is not tried.

David

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 18:01     ` David Vrabel
@ 2013-09-26 18:41       ` Matthias
  2013-09-26 19:13         ` Gordan Bobic
  2013-10-03 22:20       ` Matthias
  1 sibling, 1 reply; 23+ messages in thread
From: Matthias @ 2013-09-26 18:41 UTC (permalink / raw)
  To: David Vrabel; +Cc: Ross Philipson, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 2378 bytes --]

Hi,

thanks for your answers, the cards are a AMD HD 5750 and a HD 5400, both
with dual functions (due to audio capabilities), both co-assigned to their
respective domU and both not capable of FLR from lspci -vvv output.

also, @Ross, I'm running a 3.8.2 Kernel, so this should be fine, but I
assume that the 'official' command where xl asks the dom0 about the reset
do not work (if I have understand david correctly) since it's dual function
so no dual bus reset is actually executed causing the misbehaviour, and on
the other side xm doing a bus reset so it works in this specific case.

I'm currently recompiling the kernel to see if your patch works David.

Also, just to understand it better, is the secondary bus reset the thing
which you can manually invoke via /sys/bus/pci/devices/.../reset ?

So as a workaround, would the following work in principle?

xl pci-assignable-remove 0X:00.0
xl pci-assignable-remove 0X:00.1
echo "1" > /sys/bus/pci/devices/0X:00.0/reset
echo "1" > /sys/bus/pci/devices/0X:00.1/reset
xl pci-assignable-add 0X:00.0
xl pci-assignable-add 0X:00.1

Anyway, thanks for your answers and I will report if the patch works!


2013/9/26 David Vrabel <david.vrabel@citrix.com>

> On 26/09/13 18:48, Ross Philipson wrote:
> > On 09/26/2013 12:20 PM, David Vrabel wrote:
> >> On 26/09/13 17:05, Matthias wrote:
> >>> Hi everyone,
> >>>
> >>> I would like to ask what the current status of FLR, or better of FLR
> >>> emulation is in latest Xen and if we can expect better support in the
> >>> future.
> >>
> >> What are these cards, are they multi-function and do they actually
> >> support FLR?  Many graphics cards do not.
> >>
> >> I have the following hack to pciback to fallback to a bus reset for
> >> multi-function devices without FLR.  Does it help for your use case?
> >> You will need to ensure that all functions are co-assigned to the same
> >> domain.
> >
> > New kernels (e.g. 3.8) have full support for PCI-e and PCI AF FLRs as
> > well as fallback support for D0-D3 and secondary bus resets. This
> > functionality is also in the some of the last 2.6 kernels like 2.6.39.
> > If you are using an older kernel I guess you might need to patch it.
>
> It will only do a secondary bus reset iff the function to be reset is
> the only function on that bus.  If you have a multi-function device
> secondary bus reset is not tried.
>
> David
>

[-- Attachment #1.2: Type: text/html, Size: 3164 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 18:41       ` Matthias
@ 2013-09-26 19:13         ` Gordan Bobic
  2013-09-27 12:26           ` Matthias
  0 siblings, 1 reply; 23+ messages in thread
From: Gordan Bobic @ 2013-09-26 19:13 UTC (permalink / raw)
  To: Matthias; +Cc: David Vrabel, Ross Philipson, xen-devel@lists.xen.org

On 09/26/2013 07:41 PM, Matthias wrote:
> Hi,
>
> thanks for your answers, the cards are a AMD HD 5750 and a HD 5400, both
> with dual functions (due to audio capabilities), both co-assigned to
> their respective domU and both not capable of FLR from lspci -vvv output.
>
> also, @Ross, I'm running a 3.8.2 Kernel, so this should be fine, but I
> assume that the 'official' command where xl asks the dom0 about the
> reset do not work (if I have understand david correctly) since it's dual
> function so no dual bus reset is actually executed causing the
> misbehaviour, and on the other side xm doing a bus reset so it works in
> this specific case.
>
> I'm currently recompiling the kernel to see if your patch works David.
>
> Also, just to understand it better, is the secondary bus reset the thing
> which you can manually invoke via /sys/bus/pci/devices/.../reset ?
>
> So as a workaround, would the following work in principle?
>
> xl pci-assignable-remove 0X:00.0
> xl pci-assignable-remove 0X:00.1
> echo "1" > /sys/bus/pci/devices/0X:00.0/reset
> echo "1" > /sys/bus/pci/devices/0X:00.1/reset

This bit is up to the driver to implement. Since pciback is a 
placeholder rather than a driver that knows about the hardware the reset 
node won't be there.

You could try to do something with setpci to force the registers between 
D0 and D3 power states in a vague hope that might do something, but I 
doubt it.

The reason nvidia cards work OK is because the domU driver knows how to 
reinitialize the hardware and acts accordingly. If the manufacturer 
won't implement a standard function to reset the hardware, then it is up 
to their drivers to handle the situation.

As a workaround, if (on Windows domUs) ejecting the card before 
shutdown/reboot of domU works, you could probably write some powershell 
magic that does that on shutdown/reboot as a reasonable workaround.

Gordan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 19:13         ` Gordan Bobic
@ 2013-09-27 12:26           ` Matthias
  2013-09-27 13:27             ` Gordan Bobic
  0 siblings, 1 reply; 23+ messages in thread
From: Matthias @ 2013-09-27 12:26 UTC (permalink / raw)
  To: Gordan Bobic
  Cc: Ian Campbell, David Vrabel, Ross Philipson,
	xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 3541 bytes --]

Hi Gordon,

I tried your patch on my dom0 kernel and I think it somehow helped in the
sense that now I can reboot the domUs now without crashing the whole host,
but linux domU still gets a blackscreen and windows7 domU only starts till
black screen with (actual movable) cursor, but not furthor.. this might
only be a coincidence, though, have to double check this..

I tried some other stuff, too:

1) after domU shutdown rebind both functions to the dom0 drivers, do a
sysfs reset and re-add to assignable devices -> crashes dom0
2) after domU shutdown rebind both functions to the dom0 drivers and readd
to assignable devices -> dom0 crashes somtime when domU using the devices
comes up, sometimes not, but no success either way
3) sysfs reset of the devices within domU seems to be passed through dom0
(see commands in qemu-log) but no effect

Also, I analysed your code and compared it to the stuff in the python tools
of xm and it is the same approach and i don't see any obvious differences..
Then I tried to replicate the secondary bus reset on command lind for
testing purposes via

printf '\x40' | dd of=/sys/devices/pci0000\:00/0000\:00\:0b.0/config bs=1
seek=$((0x3e)) count=1 conv=notrunc

but I think I got some endians or offset slightly wrong because after that
xl refuses to give the device (00:0b.0 is the bus of my 2-function vga card
I have assigned to my domU) to the domU and later crashes dom0.

So I'm a little lost at that point and would welcome some suggestions.

Does FLR reset works for any of you for vga cards?


2013/9/26 Gordan Bobic <gordan@bobich.net>

> On 09/26/2013 07:41 PM, Matthias wrote:
>
>> Hi,
>>
>> thanks for your answers, the cards are a AMD HD 5750 and a HD 5400, both
>> with dual functions (due to audio capabilities), both co-assigned to
>> their respective domU and both not capable of FLR from lspci -vvv output.
>>
>> also, @Ross, I'm running a 3.8.2 Kernel, so this should be fine, but I
>> assume that the 'official' command where xl asks the dom0 about the
>> reset do not work (if I have understand david correctly) since it's dual
>> function so no dual bus reset is actually executed causing the
>> misbehaviour, and on the other side xm doing a bus reset so it works in
>> this specific case.
>>
>> I'm currently recompiling the kernel to see if your patch works David.
>>
>> Also, just to understand it better, is the secondary bus reset the thing
>> which you can manually invoke via /sys/bus/pci/devices/.../reset ?
>>
>> So as a workaround, would the following work in principle?
>>
>> xl pci-assignable-remove 0X:00.0
>> xl pci-assignable-remove 0X:00.1
>> echo "1" > /sys/bus/pci/devices/0X:00.0/**reset
>> echo "1" > /sys/bus/pci/devices/0X:00.1/**reset
>>
>
> This bit is up to the driver to implement. Since pciback is a placeholder
> rather than a driver that knows about the hardware the reset node won't be
> there.
>
> You could try to do something with setpci to force the registers between
> D0 and D3 power states in a vague hope that might do something, but I doubt
> it.
>
> The reason nvidia cards work OK is because the domU driver knows how to
> reinitialize the hardware and acts accordingly. If the manufacturer won't
> implement a standard function to reset the hardware, then it is up to their
> drivers to handle the situation.
>
> As a workaround, if (on Windows domUs) ejecting the card before
> shutdown/reboot of domU works, you could probably write some powershell
> magic that does that on shutdown/reboot as a reasonable workaround.
>
> Gordan
>

[-- Attachment #1.2: Type: text/html, Size: 4362 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-27 12:26           ` Matthias
@ 2013-09-27 13:27             ` Gordan Bobic
  2013-09-27 13:48               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Gordan Bobic @ 2013-09-27 13:27 UTC (permalink / raw)
  To: Matthias; +Cc: xen-devel, Ian Campbell, Ross Philipson, David Vrabel

 On Fri, 27 Sep 2013 14:26:31 +0200, Matthias 
 <matthias.kannenberg@googlemail.com> wrote:
> Hi Gordon,
>
> I tried your patch on my dom0 kernel and I think it somehow helped in
> the sense that now I can reboot the domUs now without crashing the
> whole host, but linux domU still gets a blackscreen and windows7 domU
> only starts till black screen with (actual movable) cursor, but not
> furthor.. this might only be a coincidence, though, have to double
> check this..

 What patch? Nothing I posted to the list is fit for public
 consumption yet. You shouldn't be using it unless you really,
 REALLY know exactly what it does and know exactly what you
 are trying to achieve.

> I tried some other stuff, too:
>
> 1) after domU shutdown rebind both functions to the dom0 drivers, do 
> a
> sysfs reset and re-add to assignable devices -> crashes dom0

 My experience shows that letting dom0 drivers ever touch the hardware
 is a recipe for disaster.

> 2) after domU shutdown rebind both functions to the dom0 drivers and
> readd to assignable devices -> dom0 crashes somtime when domU using
> the devices comes up, sometimes not, but no success either way
>  3) sysfs reset of the devices within domU seems to be passed through
> dom0 (see commands in qemu-log) but no effect

 It's up to the drivers to do the sensible thing. Nvidia drivers
 handle this a little more sanely, but if the drivers cannot handle
 clobbering the device's state into a known state, you are pretty
 much fighting a losing battle.

> Also, I analysed your code and compared it to the stuff in the python
> tools of xm and it is the same approach and i don't see any obvious
> differences..

 I am starting to suspect you aren't actually talking about my code
 but somebody else's...

> Then I tried to replicate the secondary bus reset on
> command lind for testing purposes via
>
>  printf 'x40' | dd of=/sys/devices/pci0000:00/0000:00:0b.0/config 
> bs=1
> seek=$((0x3e)) count=1 conv=notrunc
>
> but I think I got some endians or offset slightly wrong because after
> that xl refuses to give the device (00:0b.0 is the bus of my
> 2-function vga card I have assigned to my domU) to the domU and later
> crashes dom0.
>
> So I'm a little lost at that point and would welcome some 
> suggestions.
>
> Does FLR reset works for any of you for vga cards?

 If you are talking about VGA cards with _proper_ FLR implementations
 on PCI level - there is no such thing. In all cases it is down to
 the domU driver to handle the card in whatever state it is. This
 works reasonably well with supported Nvidia cards (i.e.
 Quadro [K][2456]000 and Grid K[12] and equivalent modified GeForce
 cards (Fermi 4xx and Kepler 6xx/7xx series)). I never managed to
 get it working properly on any other GPUs.

 Even with Nvidia cards rebooting can lead to issues. For example,
 I have two GPUs passed to two different domUs. One is a GTX470
 modified to Q5000. The other is a GTX480 modified to Q6000. The
 domU with Q5000 always handled reboots reasonably reliably. The
 one with a Q6000 did not. I since switched the one with a Q6000
 to a QK5000 (modified GTX680), and now the reboots seem to work
 reasonably reliably, but I have found that there is still a
 crash if the monitor on the card changes between shutdown and
 restart - I'm guessing the card remembers it's state and if it
 isn't consistent when it returns, driver gets confused. I have
 other issues (see recent thread about Nvidia passthrough from
 David), but they seem to be specific to my setup.

 It's not perfect, but it's the only workable solution I have
 found.

 Gordan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 17:59   ` Matthias
@ 2013-09-27 13:34     ` Konrad Rzeszutek Wilk
  2013-09-27 17:07       ` Matthias
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-09-27 13:34 UTC (permalink / raw)
  To: Matthias; +Cc: Ian Campbell, xen-devel@lists.xen.org

On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
> I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
> kernel I found which doesn't give me this issue:
> http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html

So v3.12 (or rather the latest and greaters of the Linus) has the mechanism
for the NMI - so you can actually see what is causing the stall.

> 
> So I would assume that the kernel should be new enough to handle that. On
> the other hand, as far as I understand the whole process, the kernel itself
> will only deal with the vga card if it is actually bind to the dom0 / to
> it's driver which it is not. Is there any way to test either if the
> ask-command from xl is really executed on dom0 or to test this command
> manually?
> 
> Btw: Hardware is a Radeon HD 5750 and a Radeon HD 5400..

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-27 13:27             ` Gordan Bobic
@ 2013-09-27 13:48               ` Konrad Rzeszutek Wilk
  2013-09-27 14:00                 ` Gordan Bobic
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-09-27 13:48 UTC (permalink / raw)
  To: Gordan Bobic
  Cc: David Vrabel, Matthias, Ian Campbell, Ross Philipson, xen-devel

On Fri, Sep 27, 2013 at 02:27:46PM +0100, Gordan Bobic wrote:
> On Fri, 27 Sep 2013 14:26:31 +0200, Matthias
> <matthias.kannenberg@googlemail.com> wrote:
> >Hi Gordon,
> >
> >I tried your patch on my dom0 kernel and I think it somehow helped in
> >the sense that now I can reboot the domUs now without crashing the
> >whole host, but linux domU still gets a blackscreen and windows7 domU
> >only starts till black screen with (actual movable) cursor, but not
> >furthor.. this might only be a coincidence, though, have to double
> >check this..
> 
> What patch? Nothing I posted to the list is fit for public
> consumption yet. You shouldn't be using it unless you really,
> REALLY know exactly what it does and know exactly what you
> are trying to achieve.
> 
> >I tried some other stuff, too:
> >
> >1) after domU shutdown rebind both functions to the dom0 drivers,
> >do a
> >sysfs reset and re-add to assignable devices -> crashes dom0
> 
> My experience shows that letting dom0 drivers ever touch the hardware
> is a recipe for disaster.
> 
> >2) after domU shutdown rebind both functions to the dom0 drivers and
> >readd to assignable devices -> dom0 crashes somtime when domU using
> >the devices comes up, sometimes not, but no success either way
> > 3) sysfs reset of the devices within domU seems to be passed through
> >dom0 (see commands in qemu-log) but no effect
> 
> It's up to the drivers to do the sensible thing. Nvidia drivers
> handle this a little more sanely, but if the drivers cannot handle
> clobbering the device's state into a known state, you are pretty
> much fighting a losing battle.
> 
> >Also, I analysed your code and compared it to the stuff in the python
> >tools of xm and it is the same approach and i don't see any obvious
> >differences..
> 
> I am starting to suspect you aren't actually talking about my code
> but somebody else's...
> 
> >Then I tried to replicate the secondary bus reset on
> >command lind for testing purposes via
> >
> > printf 'x40' | dd of=/sys/devices/pci0000:00/0000:00:0b.0/config
> >bs=1
> >seek=$((0x3e)) count=1 conv=notrunc
> >
> >but I think I got some endians or offset slightly wrong because after
> >that xl refuses to give the device (00:0b.0 is the bus of my
> >2-function vga card I have assigned to my domU) to the domU and later
> >crashes dom0.
> >
> >So I'm a little lost at that point and would welcome some
> >suggestions.
> >
> >Does FLR reset works for any of you for vga cards?
> 
> If you are talking about VGA cards with _proper_ FLR implementations
> on PCI level - there is no such thing. In all cases it is down to
> the domU driver to handle the card in whatever state it is. This
> works reasonably well with supported Nvidia cards (i.e.
> Quadro [K][2456]000 and Grid K[12] and equivalent modified GeForce
> cards (Fermi 4xx and Kepler 6xx/7xx series)). I never managed to
> get it working properly on any other GPUs.
> 
> Even with Nvidia cards rebooting can lead to issues. For example,
> I have two GPUs passed to two different domUs. One is a GTX470
> modified to Q5000. The other is a GTX480 modified to Q6000. The
> domU with Q5000 always handled reboots reasonably reliably. The
> one with a Q6000 did not. I since switched the one with a Q6000
> to a QK5000 (modified GTX680), and now the reboots seem to work
> reasonably reliably, but I have found that there is still a
> crash if the monitor on the card changes between shutdown and
> restart - I'm guessing the card remembers it's state and if it
> isn't consistent when it returns, driver gets confused. I have
> other issues (see recent thread about Nvidia passthrough from
> David), but they seem to be specific to my setup.

This state thing. If one were to capture the cards state before
doing any PCI passthrough in and tried to write it exactly
back would that eliminate some of these issues?

I know that the pciback does that to the PCI configuration values.
(Or at least it should) whenever a device has been de-assigned
from a guest - or unplugged.

But I presume that the rest (the BAR contents) are not in any
way saved/restored. What would be the worst if one wrote exactly
all of the MMIO values back as they were?

(Probably a recipe for disaster, but who knows).
> 
> It's not perfect, but it's the only workable solution I have
> found.
> 
> Gordan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-27 13:48               ` Konrad Rzeszutek Wilk
@ 2013-09-27 14:00                 ` Gordan Bobic
  0 siblings, 0 replies; 23+ messages in thread
From: Gordan Bobic @ 2013-09-27 14:00 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: David Vrabel, Matthias, Ian Campbell, Ross Philipson, xen-devel

 On Fri, 27 Sep 2013 09:48:34 -0400, Konrad Rzeszutek Wilk 
 <konrad.wilk@oracle.com> wrote:
> On Fri, Sep 27, 2013 at 02:27:46PM +0100, Gordan Bobic wrote:
>> On Fri, 27 Sep 2013 14:26:31 +0200, Matthias
>> <matthias.kannenberg@googlemail.com> wrote:
>> >Hi Gordon,
>> >
>> >I tried your patch on my dom0 kernel and I think it somehow helped 
>> in
>> >the sense that now I can reboot the domUs now without crashing the
>> >whole host, but linux domU still gets a blackscreen and windows7 
>> domU
>> >only starts till black screen with (actual movable) cursor, but not
>> >furthor.. this might only be a coincidence, though, have to double
>> >check this..
>>
>> What patch? Nothing I posted to the list is fit for public
>> consumption yet. You shouldn't be using it unless you really,
>> REALLY know exactly what it does and know exactly what you
>> are trying to achieve.
>>
>> >I tried some other stuff, too:
>> >
>> >1) after domU shutdown rebind both functions to the dom0 drivers,
>> >do a
>> >sysfs reset and re-add to assignable devices -> crashes dom0
>>
>> My experience shows that letting dom0 drivers ever touch the 
>> hardware
>> is a recipe for disaster.
>>
>> >2) after domU shutdown rebind both functions to the dom0 drivers 
>> and
>> >readd to assignable devices -> dom0 crashes somtime when domU using
>> >the devices comes up, sometimes not, but no success either way
>> > 3) sysfs reset of the devices within domU seems to be passed 
>> through
>> >dom0 (see commands in qemu-log) but no effect
>>
>> It's up to the drivers to do the sensible thing. Nvidia drivers
>> handle this a little more sanely, but if the drivers cannot handle
>> clobbering the device's state into a known state, you are pretty
>> much fighting a losing battle.
>>
>> >Also, I analysed your code and compared it to the stuff in the 
>> python
>> >tools of xm and it is the same approach and i don't see any obvious
>> >differences..
>>
>> I am starting to suspect you aren't actually talking about my code
>> but somebody else's...
>>
>> >Then I tried to replicate the secondary bus reset on
>> >command lind for testing purposes via
>> >
>> > printf 'x40' | dd of=/sys/devices/pci0000:00/0000:00:0b.0/config
>> >bs=1
>> >seek=$((0x3e)) count=1 conv=notrunc
>> >
>> >but I think I got some endians or offset slightly wrong because 
>> after
>> >that xl refuses to give the device (00:0b.0 is the bus of my
>> >2-function vga card I have assigned to my domU) to the domU and 
>> later
>> >crashes dom0.
>> >
>> >So I'm a little lost at that point and would welcome some
>> >suggestions.
>> >
>> >Does FLR reset works for any of you for vga cards?
>>
>> If you are talking about VGA cards with _proper_ FLR implementations
>> on PCI level - there is no such thing. In all cases it is down to
>> the domU driver to handle the card in whatever state it is. This
>> works reasonably well with supported Nvidia cards (i.e.
>> Quadro [K][2456]000 and Grid K[12] and equivalent modified GeForce
>> cards (Fermi 4xx and Kepler 6xx/7xx series)). I never managed to
>> get it working properly on any other GPUs.
>>
>> Even with Nvidia cards rebooting can lead to issues. For example,
>> I have two GPUs passed to two different domUs. One is a GTX470
>> modified to Q5000. The other is a GTX480 modified to Q6000. The
>> domU with Q5000 always handled reboots reasonably reliably. The
>> one with a Q6000 did not. I since switched the one with a Q6000
>> to a QK5000 (modified GTX680), and now the reboots seem to work
>> reasonably reliably, but I have found that there is still a
>> crash if the monitor on the card changes between shutdown and
>> restart - I'm guessing the card remembers it's state and if it
>> isn't consistent when it returns, driver gets confused. I have
>> other issues (see recent thread about Nvidia passthrough from
>> David), but they seem to be specific to my setup.
>
> This state thing. If one were to capture the cards state before
> doing any PCI passthrough in and tried to write it exactly
> back would that eliminate some of these issues?
>
> I know that the pciback does that to the PCI configuration values.
> (Or at least it should) whenever a device has been de-assigned
> from a guest - or unplugged.
>
> But I presume that the rest (the BAR contents) are not in any
> way saved/restored. What would be the worst if one wrote exactly
> all of the MMIO values back as they were?
>
> (Probably a recipe for disaster, but who knows).
>>
>> It's not perfect, but it's the only workable solution I have
>> found.

 That doesn't cover the entire state of the device.
 What about the rest of the device memory and states of all
 the proprietary registers?

 Since there are open source FB and accelerated drivers
 available for Radeon cards, enough is publicly known about
 them to be able to achieve suitable resetting. How
 difficult that might be to achieve, I have no idea. I
 have seen the open source Radeon Xorg driver successfully
 reset the GPU when the GPU stopped responding without
 taking Xorg or any of the running apps down in the process,
 so something similar to what it does might just be good
 enough.

 Whether it is a good idea to adopt anything but a fully
 hands-off approach to any passthrough hardware is a
 different question entirely.

 Gordan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-27 13:34     ` Konrad Rzeszutek Wilk
@ 2013-09-27 17:07       ` Matthias
  2013-09-27 17:28         ` Sander Eikelenboom
  2013-09-27 17:53         ` Is: RCU callback detects an RCU hang with Linux 3.12+ Was: " Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 23+ messages in thread
From: Matthias @ 2013-09-27 17:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Ian Campbell, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 1006 bytes --]

Hi Konrad,

good call! I was able to reproduce the error with the 3.12-rc2 kernel, got
a lot of information with the new NMI traces (log attached), but since I'm
not a xen hacker I don't really know how to continue from here. So I might
add this to the original post and maybe someone can help me. After all the
error persists for half a year now and besides 2 kernel version / .config
Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue
back (even with bisecting the .config because at some point it seemed
random).


2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
> > I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
> > kernel I found which doesn't give me this issue:
> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
>
> So v3.12 (or rather the latest and greaters of the Linus) has the mechanism
> for the NMI - so you can actually see what is causing the stall.
>

[-- Attachment #1.2: Type: text/html, Size: 1485 bytes --]

[-- Attachment #2: rcu_stall.log --]
[-- Type: application/octet-stream, Size: 40990 bytes --]

Sep 27 20:44:04 Server kernel: [  110.626714] 
Sep 27 20:44:04 Server kernel: [  110.626750]  (t=27864 jiffies g=5332 c=5331 q=2)
Sep 27 20:44:04 Server kernel: [  110.626754] sending NMI to all CPUs:
Sep 27 20:44:04 Server kernel: [  110.626769] NMI backtrace for cpu 0
Sep 27 20:44:04 Server kernel: [  110.626776] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  110.626781] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  110.626787] task: ffffffff81613430 ti: ffffffff81600000 task.ti: ffffffff81600000
Sep 27 20:44:04 Server kernel: [  110.626791] RIP: e030:[<ffffffff8100130a>]  [<ffffffff8100130a>] xen_hypercall_vcpu_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  110.626808] RSP: e02b:ffff88007de03cb8  EFLAGS: 00000046
Sep 27 20:44:04 Server kernel: [  110.626812] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8100130a
Sep 27 20:44:04 Server kernel: [  110.626817] RDX: 00000000deadbeef RSI: 00000000deadbeef RDI: 00000000deadbeef
Sep 27 20:44:04 Server kernel: [  110.626822] RBP: ffffffff816a3c50 R08: ffffffff816a3c58 R09: 0000000000000000
Sep 27 20:44:04 Server kernel: [  110.626826] R10: 000000000001128c R11: 0000000000000246 R12: 0000000000000005
Sep 27 20:44:04 Server kernel: [  110.626831] R13: ffffffff81642080 R14: ffffffff81600000 R15: 0000000000000000
Sep 27 20:44:04 Server kernel: [  110.626840] FS:  00007fb294ab4900(0000) GS:ffff88007de00000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  110.626844] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  110.626847] CR2: 00007fb291aa8630 CR3: 0000000064f6b000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  110.626853] Stack:
Sep 27 20:44:04 Server kernel: [  110.626857]  00000018405b4164 0000000000000000 ffffffff8129623c ffffffff810420c3
Sep 27 20:44:04 Server kernel: [  110.626869]  0000000000002710 ffff88007de0eaf0 0000000000000000 ffffffff81067ee0
Sep 27 20:44:04 Server kernel: [  110.626880]  ffffffff81642080 ffffffff810e44cd 0000000000000002 0000000000000000
Sep 27 20:44:04 Server kernel: [  110.626893] Call Trace:
Sep 27 20:44:04 Server kernel: [  110.626897]  <IRQ>  [<ffffffff8129623c>] ? xen_send_IPI_one+0x16/0x4f
Sep 27 20:44:04 Server kernel: [  110.626912]  [<ffffffff810420c3>] ? __xen_send_IPI_mask+0x32/0x39
Sep 27 20:44:04 Server kernel: [  110.626923]  [<ffffffff81067ee0>] ? arch_trigger_all_cpu_backtrace+0x4d/0x7e
Sep 27 20:44:04 Server kernel: [  110.626934]  [<ffffffff810e44cd>] ? rcu_check_callbacks+0x22f/0x598
Sep 27 20:44:04 Server kernel: [  110.626947]  [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
Sep 27 20:44:04 Server kernel: [  110.626955]  [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
Sep 27 20:44:04 Server kernel: [  110.626964]  [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
Sep 27 20:44:04 Server kernel: [  110.626972]  [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
Sep 27 20:44:04 Server kernel: [  110.626978]  [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
Sep 27 20:44:04 Server kernel: [  110.626988]  [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
Sep 27 20:44:04 Server kernel: [  110.626998]  [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
Sep 27 20:44:04 Server kernel: [  110.627008]  [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
Sep 27 20:44:04 Server kernel: [  110.627017]  [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
Sep 27 20:44:04 Server kernel: [  110.627028]  [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
Sep 27 20:44:04 Server kernel: [  110.627038]  [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
Sep 27 20:44:04 Server kernel: [  110.627048]  [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
Sep 27 20:44:04 Server kernel: [  110.627057]  [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
Sep 27 20:44:04 Server kernel: [  110.627066]  [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
Sep 27 20:44:04 Server kernel: [  110.627077]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  110.627090]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  110.627098]  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
Sep 27 20:44:04 Server kernel: [  110.627105]  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
Sep 27 20:44:04 Server kernel: [  110.627114]  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Sep 27 20:44:04 Server kernel: [  110.627122]  [<ffffffff816c3d3c>] ? start_kernel+0x3f1/0x3fc
Sep 27 20:44:04 Server kernel: [  110.627132]  [<ffffffff816c376e>] ? repair_env_string+0x54/0x54
Sep 27 20:44:04 Server kernel: [  110.627140]  [<ffffffff816c87f9>] ? xen_start_kernel+0x4bb/0x4c5
Sep 27 20:44:04 Server kernel: [  110.627150] Code: cc 51 41 53 50 b8 17 00 00 00 0f 05 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 18 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
Sep 27 20:44:04 Server kernel: [  110.627344] NMI backtrace for cpu 1
Sep 27 20:44:04 Server kernel: [  110.627352] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  110.627356] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  110.627361] task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000
Sep 27 20:44:04 Server kernel: [  110.627364] RIP: e030:[<ffffffff8125b2b2>]  [<ffffffff8125b2b2>] cfb_imageblit+0x1b3/0x411
Sep 27 20:44:04 Server kernel: [  110.627380] RSP: e02b:ffff88007de439f0  EFLAGS: 00000046
Sep 27 20:44:04 Server kernel: [  110.627383] RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003
Sep 27 20:44:04 Server kernel: [  110.627386] RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000
Sep 27 20:44:04 Server kernel: [  110.627389] RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0
Sep 27 20:44:04 Server kernel: [  110.627392] R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d
Sep 27 20:44:04 Server kernel: [  110.627394] R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000
Sep 27 20:44:04 Server kernel: [  110.627403] FS:  00007fb294ab4900(0000) GS:ffff88007de40000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  110.627407] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  110.627410] CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  110.627415] Stack:
Sep 27 20:44:04 Server kernel: [  110.627417]  0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa
Sep 27 20:44:04 Server kernel: [  110.627429]  ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800
Sep 27 20:44:04 Server kernel: [  110.627439]  0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b
Sep 27 20:44:04 Server kernel: [  110.627448] Call Trace:
Sep 27 20:44:04 Server kernel: [  110.627451]  <IRQ>  [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d
Sep 27 20:44:04 Server kernel: [  110.627466]  [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8
Sep 27 20:44:04 Server kernel: [  110.627475]  [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d
Sep 27 20:44:04 Server kernel: [  110.627482]  [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc
Sep 27 20:44:04 Server kernel: [  110.627490]  [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290
Sep 27 20:44:04 Server kernel: [  110.627499]  [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc
Sep 27 20:44:04 Server kernel: [  110.627509]  [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306
Sep 27 20:44:04 Server kernel: [  110.627517]  [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb
Sep 27 20:44:04 Server kernel: [  110.627523]  [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8
Sep 27 20:44:04 Server kernel: [  110.627530]  [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d
Sep 27 20:44:04 Server kernel: [  110.627538]  [<ffffffff813db9c8>] ? printk+0x4f/0x51
Sep 27 20:44:04 Server kernel: [  110.627548]  [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598
Sep 27 20:44:04 Server kernel: [  110.627557]  [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239
Sep 27 20:44:04 Server kernel: [  110.627567]  [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
Sep 27 20:44:04 Server kernel: [  110.627574]  [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
Sep 27 20:44:04 Server kernel: [  110.627582]  [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
Sep 27 20:44:04 Server kernel: [  110.627588]  [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
Sep 27 20:44:04 Server kernel: [  110.627595]  [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
Sep 27 20:44:04 Server kernel: [  110.627602]  [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
Sep 27 20:44:04 Server kernel: [  110.627609]  [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
Sep 27 20:44:04 Server kernel: [  110.627618]  [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d
Sep 27 20:44:04 Server kernel: [  110.627627]  [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa
Sep 27 20:44:04 Server kernel: [  110.627635]  [<ffffffff8103df22>] ? check_events+0x12/0x20
Sep 27 20:44:04 Server kernel: [  110.627643]  [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
Sep 27 20:44:04 Server kernel: [  110.627651]  [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52
Sep 27 20:44:04 Server kernel: [  110.627659]  [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
Sep 27 20:44:04 Server kernel: [  110.627668]  [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
Sep 27 20:44:04 Server kernel: [  110.627675]  [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
Sep 27 20:44:04 Server kernel: [  110.627682]  [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
Sep 27 20:44:04 Server kernel: [  110.627691]  [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
Sep 27 20:44:04 Server kernel: [  110.627698]  [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
Sep 27 20:44:04 Server kernel: [  110.627707]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  110.627718]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  110.627725]  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
Sep 27 20:44:04 Server kernel: [  110.627731]  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
Sep 27 20:44:04 Server kernel: [  110.627740]  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Sep 27 20:44:04 Server kernel: [  110.627749] Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29 c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89 c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3 
Sep 27 20:44:04 Server kernel: [  110.627876] NMI backtrace for cpu 2
Sep 27 20:44:04 Server kernel: [  110.627885] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  110.627889] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  110.627893] task: ffff8800659027d0 ti: ffff880065920000 task.ti: ffff880065920000
Sep 27 20:44:04 Server kernel: [  110.627897] RIP: e030:[<ffffffff810a3b9b>]  [<ffffffff810a3b9b>] irqtime_account_process_tick.isra.2+0x121/0x239
Sep 27 20:44:04 Server kernel: [  110.627916] RSP: e02b:ffff880065921ea8  EFLAGS: 00000046
Sep 27 20:44:04 Server kernel: [  110.627919] RAX: ffff88006591b080 RBX: ffff8800659027d0 RCX: 0000000000000003
Sep 27 20:44:04 Server kernel: [  110.627922] RDX: 00000000000b61e4 RSI: 0000000000000000 RDI: 0000000000000002
Sep 27 20:44:04 Server kernel: [  110.627925] RBP: 000000000000002c R08: 000000009f754700 R09: 00000000fffd00e6
Sep 27 20:44:04 Server kernel: [  110.627928] R10: 0000000000006cd6 R11: 0000000000000020 R12: ffff88007de8e2d0
Sep 27 20:44:04 Server kernel: [  110.627931] R13: 0000000000000000 R14: ffff88007de94b50 R15: 0000000000000002
Sep 27 20:44:04 Server kernel: [  110.627941] FS:  00007fb294ab4900(0000) GS:ffff88007de80000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  110.627944] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  110.627947] CR2: 00007fb2942a1450 CR3: 0000000064faa000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  110.627952] Stack:
Sep 27 20:44:04 Server kernel: [  110.627954]  000000000000002c 0000000000000002 00000000000000dc ffff88007de94340
Sep 27 20:44:04 Server kernel: [  110.627967]  00000000000008e6 0000000000006cd7 ffff8800659027d0 0000000000000000
Sep 27 20:44:04 Server kernel: [  110.627976]  0000000000000000 ffffffff810a3efc ffff88007de8e900 00000018405ba35a
Sep 27 20:44:04 Server kernel: [  110.627986] Call Trace:
Sep 27 20:44:04 Server kernel: [  110.627991]  [<ffffffff810a3efc>] ? account_idle_ticks+0x4b/0x59
Sep 27 20:44:04 Server kernel: [  110.628001]  [<ffffffff810c24a4>] ? tick_nohz_idle_exit+0x12e/0x137
Sep 27 20:44:04 Server kernel: [  110.628016]  [<ffffffff810b543d>] ? cpu_startup_entry+0x156/0x160
Sep 27 20:44:04 Server kernel: [  110.628027] Code: 03 00 00 00 74 6d 45 85 ed 74 20 48 83 c4 18 48 89 df ba 01 00 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f be 01 00 00 00 e9 d1 fd ff ff <49> 3b 1e 75 18 48 83 c4 18 bf 01 00 00 00 5b 5d 41 5c 41 5d 41 
Sep 27 20:44:04 Server kernel: [  110.628169] NMI backtrace for cpu 3
Sep 27 20:44:04 Server kernel: [  110.628179] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  110.628183] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  110.628188] task: ffff880065902040 ti: ffff880065922000 task.ti: ffff880065922000
Sep 27 20:44:04 Server kernel: [  110.628192] RIP: e030:[<ffffffff812210bd>]  [<ffffffff812210bd>] debug_smp_processor_id+0xa/0xf3
Sep 27 20:44:04 Server kernel: [  110.628209] RSP: e02b:ffff880065923ec8  EFLAGS: 00000096
Sep 27 20:44:04 Server kernel: [  110.628212] RAX: ffff880065923fd8 RBX: 000000000000e2d0 RCX: 0000000000000003
Sep 27 20:44:04 Server kernel: [  110.628215] RDX: 00000000000caf42 RSI: 0000000000000000 RDI: 0000000000000001
Sep 27 20:44:04 Server kernel: [  110.628218] RBP: 0000000000014340 R08: 000000009f754700 R09: 00000000fffd00e6
Sep 27 20:44:04 Server kernel: [  110.628221] R10: 0000000000006cd6 R11: 0000000000000020 R12: 0000000000000001
Sep 27 20:44:04 Server kernel: [  110.628224] R13: ffff880065902040 R14: 0000000000000000 R15: 0000000000000000
Sep 27 20:44:04 Server kernel: [  110.628237] FS:  00007fb294bb7700(0000) GS:ffff88007dec0000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  110.628240] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  110.628244] CR2: 00007fb284000010 CR3: 0000000064faa000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  110.628249] Stack:
Sep 27 20:44:04 Server kernel: [  110.628252]  000000000000e2d0 ffffffff810a3a48 ffff88007ded4340 000000000000092a
Sep 27 20:44:04 Server kernel: [  110.628265]  0000000000006cd7 ffffffff810a3efc ffff88007dece900 00000018405b6767
Sep 27 20:44:04 Server kernel: [  110.628275]  0000000000000003 0000000000000000 ffffffff810c24a4 ffff880065923fd8
Sep 27 20:44:04 Server kernel: [  110.628286] Call Trace:
Sep 27 20:44:04 Server kernel: [  110.628290]  [<ffffffff810a3a48>] ? account_idle_time+0x1a/0x4c
Sep 27 20:44:04 Server kernel: [  110.628302]  [<ffffffff810a3efc>] ? account_idle_ticks+0x4b/0x59
Sep 27 20:44:04 Server kernel: [  110.628311]  [<ffffffff810c24a4>] ? tick_nohz_idle_exit+0x12e/0x137
Sep 27 20:44:04 Server kernel: [  110.628319]  [<ffffffff810b543d>] ? cpu_startup_entry+0x156/0x160
Sep 27 20:44:04 Server kernel: [  110.628330] Code: 48 89 de 4c 89 ff e8 e7 f1 ff ff 48 85 c0 75 aa 31 c0 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 53 65 48 8b 04 25 60 c8 00 00 <65> 8b 1c 25 c4 b0 00 00 83 b8 44 e0 ff ff 00 0f 85 d0 00 00 00 
Sep 27 20:44:04 Server kernel: [  110.628465] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.115 msecs
Sep 27 20:44:04 Server kernel: [  110.628485] NMI backtrace for cpu 4
Sep 27 20:44:04 Server kernel: [  110.628496] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  110.628501] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  110.628506] task: ffff880065903810 ti: ffff880065926000 task.ti: ffff880065926000
Sep 27 20:44:04 Server kernel: [  110.628510] RIP: e030:[<ffffffff812210bd>]  [<ffffffff812210bd>] debug_smp_processor_id+0xa/0xf3
Sep 27 20:44:04 Server kernel: [  110.628528] RSP: e02b:ffff880065927ec8  EFLAGS: 00000082
Sep 27 20:44:04 Server kernel: [  110.628531] RAX: ffff880065927fd8 RBX: ffff88007df0e2d0 RCX: 0000000000000003
Sep 27 20:44:04 Server kernel: [  110.628534] RDX: 000000000006e825 RSI: 0000000000000000 RDI: 0000000000000001
Sep 27 20:44:04 Server kernel: [  110.628538] RBP: 0000000000014340 R08: 000000009f754700 R09: 00000000fffd00e6
Sep 27 20:44:04 Server kernel: [  110.628541] R10: 0000000000006cd6 R11: 0000000000000020 R12: 0000000000000001
Sep 27 20:44:04 Server kernel: [  110.628544] R13: ffff880065903810 R14: 0000000000000000 R15: 0000000000000000
Sep 27 20:44:04 Server kernel: [  110.628556] FS:  00007f3d4ba6f700(0000) GS:ffff88007df00000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  110.628560] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  110.628563] CR2: 00007f3d4a021fc8 CR3: 000000001d62a000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  110.628569] Stack:
Sep 27 20:44:04 Server kernel: [  110.628571]  ffff88007df0e2d0 ffffffff810a3a57 ffff88007df14340 0000000000000902
Sep 27 20:44:04 Server kernel: [  110.628585]  0000000000006cd7 ffffffff810a3efc ffff88007df0e900 00000018405b965e
Sep 27 20:44:04 Server kernel: [  110.628598]  0000000000000004 0000000000000000 ffffffff810c24a4 ffff880065927fd8
Sep 27 20:44:04 Server kernel: [  110.628609] Call Trace:
Sep 27 20:44:04 Server kernel: [  110.628613]  [<ffffffff810a3a57>] ? account_idle_time+0x29/0x4c
Sep 27 20:44:04 Server kernel: [  110.628625]  [<ffffffff810a3efc>] ? account_idle_ticks+0x4b/0x59
Sep 27 20:44:04 Server kernel: [  110.628635]  [<ffffffff810c24a4>] ? tick_nohz_idle_exit+0x12e/0x137
Sep 27 20:44:04 Server kernel: [  110.628644]  [<ffffffff810b543d>] ? cpu_startup_entry+0x156/0x160
Sep 27 20:44:04 Server kernel: [  110.628655] Code: 48 89 de 4c 89 ff e8 e7 f1 ff ff 48 85 c0 75 aa 31 c0 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 53 65 48 8b 04 25 60 c8 00 00 <65> 8b 1c 25 c4 b0 00 00 83 b8 44 e0 ff ff 00 0f 85 d0 00 00 00 
Sep 27 20:44:04 Server kernel: [  110.628811] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.457 msecs
Sep 27 20:44:04 Server kernel: [  110.628830] NMI backtrace for cpu 5
Sep 27 20:44:04 Server kernel: [  110.628840] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  110.628843] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  110.628849] task: ffff880065903080 ti: ffff88006592a000 task.ti: ffff88006592a000
Sep 27 20:44:04 Server kernel: [  110.628854] RIP: e030:[<ffffffff810845d8>]  [<ffffffff810845d8>] run_timer_softirq+0xe9/0x1da
Sep 27 20:44:04 Server kernel: [  110.628870] RSP: e02b:ffff88007df43f00  EFLAGS: 00000082
Sep 27 20:44:04 Server kernel: [  110.628875] RAX: ffff880065a4c3f8 RBX: ffff880065a4c000 RCX: ffff88007df43f00
Sep 27 20:44:04 Server kernel: [  110.628878] RDX: ffff880065a4c3d0 RSI: ffff88007df43ec8 RDI: ffff880065a4c000
Sep 27 20:44:04 Server kernel: [  110.628882] RBP: 000000000000003d R08: ffff88001e96e200 R09: 0000000000000000
Sep 27 20:44:04 Server kernel: [  110.628886] R10: 0000000002000000 R11: 0000000000000000 R12: ffff88007df43f00
Sep 27 20:44:04 Server kernel: [  110.628890] R13: ffffffff813a214b R14: ffff880061324000 R15: 0000000000000000
Sep 27 20:44:04 Server kernel: [  110.628903] FS:  00007ff37913a700(0000) GS:ffff88007df40000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  110.628907] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  110.628911] CR2: 00007fb2937b37d6 CR3: 0000000064f6b000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  110.628916] Stack:
Sep 27 20:44:04 Server kernel: [  110.628918]  ffff880065a4c3f8 ffff88007df43f00 ffffffff81605088 0000000000000141
Sep 27 20:44:04 Server kernel: [  110.628931]  0000000000000001 0000000000000101 ffff88006592bfd8 ffff88006592bfd8
Sep 27 20:44:04 Server kernel: [  110.628944]  ffffffff8107e3c4 0000000a00000001 00000000fffd0300 0000000510200040
Sep 27 20:44:04 Server kernel: [  110.628956] Call Trace:
Sep 27 20:44:04 Server kernel: [  110.628959]  <IRQ>  [<ffffffff8107e3c4>] ? __do_softirq+0xd1/0x210
Sep 27 20:44:04 Server kernel: [  110.628973]  [<ffffffff8103dafa>] ? xen_clocksource_read+0x61/0x68
Sep 27 20:44:04 Server kernel: [  110.628984]  [<ffffffff8107e5de>] ? irq_exit+0x4c/0x8b
Sep 27 20:44:04 Server kernel: [  110.628992]  [<ffffffff8129663f>] ? xen_evtchn_do_upcall+0x27/0x32
Sep 27 20:44:04 Server kernel: [  110.629002]  [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
Sep 27 20:44:04 Server kernel: [  110.629012]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  110.629024]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  110.629031]  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
Sep 27 20:44:04 Server kernel: [  110.629039]  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
Sep 27 20:44:04 Server kernel: [  110.629048]  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Sep 27 20:44:04 Server kernel: [  110.629060] Code: c2 83 e2 3f e8 c4 fe ff ff 48 63 d5 48 ff 43 10 48 c1 e2 04 48 01 da 48 8b 4a 28 48 8d 42 28 4c 89 61 08 48 89 0c 24 48 8b 4a 30 <48> 89 4c 24 08 4c 89 21 48 89 42 28 48 89 40 08 48 8b 2c 24 4c 
Sep 27 20:44:04 Server kernel: [  110.629231] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.877 msecs
Sep 27 20:44:04 Server kernel: [  111.582647] 
Sep 27 20:44:04 Server kernel: [  111.607328] sending NMI to all CPUs:
Sep 27 20:44:04 Server kernel: [  111.613457] NMI backtrace for cpu 1
Sep 27 20:44:04 Server kernel: [  111.619175] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  111.624954] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  111.630937] task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000
Sep 27 20:44:04 Server kernel: [  111.636741] RIP: e030:[<ffffffff8100130a>]  [<ffffffff8100130a>] xen_hypercall_vcpu_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  111.642471] RSP: e02b:ffff88007de43cb8  EFLAGS: 00000046
Sep 27 20:44:04 Server kernel: [  111.648139] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff8100130a
Sep 27 20:44:04 Server kernel: [  111.653837] RDX: 00000000deadbeef RSI: 00000000deadbeef RDI: 00000000deadbeef
Sep 27 20:44:04 Server kernel: [  111.659470] RBP: ffffffff816a3c50 R08: ffffffff816a3c50 R09: 0000000000000000
Sep 27 20:44:04 Server kernel: [  111.665045] R10: ffff88001e007520 R11: 0000000000000246 R12: 0000000000000005
Sep 27 20:44:04 Server kernel: [  111.670561] R13: ffffffff81642080 R14: ffff880065900000 R15: 0000000000000001
Sep 27 20:44:04 Server kernel: [  111.676022] FS:  00007fb294ab4900(0000) GS:ffff88007de40000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  111.681508] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  111.687002] CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  111.692559] Stack:
Sep 27 20:44:04 Server kernel: [  111.698091]  0001779000000018 0000000000000001 ffffffff8129623c ffffffff810420c3
Sep 27 20:44:04 Server kernel: [  111.703783]  0000000000002710 ffff88007de4eaf0 0000000000000001 ffffffff81067ee0
Sep 27 20:44:04 Server kernel: [  111.709454]  ffffffff81642080 ffffffff810e44cd 0000000000000003 0000000000000001
Sep 27 20:44:04 Server kernel: [  111.715101] Call Trace:
Sep 27 20:44:04 Server kernel: [  111.720680]  <IRQ> 
Sep 27 20:44:04 Server kernel: [  111.720713]  [<ffffffff8129623c>] ? xen_send_IPI_one+0x16/0x4f
Sep 27 20:44:04 Server kernel: [  111.731750]  [<ffffffff810420c3>] ? __xen_send_IPI_mask+0x32/0x39
Sep 27 20:44:04 Server kernel: [  111.737315]  [<ffffffff81067ee0>] ? arch_trigger_all_cpu_backtrace+0x4d/0x7e
Sep 27 20:44:04 Server kernel: [  111.742894]  [<ffffffff810e44cd>] ? rcu_check_callbacks+0x22f/0x598
Sep 27 20:44:04 Server kernel: [  111.748439]  [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
Sep 27 20:44:04 Server kernel: [  111.753959]  [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
Sep 27 20:44:04 Server kernel: [  111.759509]  [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
Sep 27 20:44:04 Server kernel: [  111.765057]  [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
Sep 27 20:44:04 Server kernel: [  111.770589]  [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
Sep 27 20:44:04 Server kernel: [  111.775965]  [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
Sep 27 20:44:04 Server kernel: [  111.781187]  [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
Sep 27 20:44:04 Server kernel: [  111.786351]  [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d
Sep 27 20:44:04 Server kernel: [  111.791518]  [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa
Sep 27 20:44:04 Server kernel: [  111.796621]  [<ffffffff8103df22>] ? check_events+0x12/0x20
Sep 27 20:44:04 Server kernel: [  111.801642]  [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
Sep 27 20:44:04 Server kernel: [  111.806627]  [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52
Sep 27 20:44:04 Server kernel: [  111.811569]  [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
Sep 27 20:44:04 Server kernel: [  111.816480]  [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
Sep 27 20:44:04 Server kernel: [  111.821431]  [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
Sep 27 20:44:04 Server kernel: [  111.826362]  [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
Sep 27 20:44:04 Server kernel: [  111.831310]  [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
Sep 27 20:44:04 Server kernel: [  111.836250]  [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
Sep 27 20:44:04 Server kernel: [  111.841164]  <EOI> 
Sep 27 20:44:04 Server kernel: [  111.841198]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  111.850915]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  111.855742]  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
Sep 27 20:44:04 Server kernel: [  111.860517]  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
Sep 27 20:44:04 Server kernel: [  111.865255]  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Sep 27 20:44:04 Server kernel: [  111.869988] Code: cc 51 41 53 50 b8 17 00 00 00 0f 05 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 18 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
Sep 27 20:44:04 Server kernel: [  111.875231] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 261.772 msecs
Sep 27 20:44:04 Server kernel: [  111.875238] NMI backtrace for cpu 0
Sep 27 20:44:04 Server kernel: [  111.875241] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  111.875242] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  111.875243] task: ffffffff81613430 ti: ffffffff81600000 task.ti: ffffffff81600000
Sep 27 20:44:04 Server kernel: [  111.875247] RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  111.875248] RSP: e02b:ffffffff81601ed8  EFLAGS: 00000246
Sep 27 20:44:04 Server kernel: [  111.875248] RAX: 0000000000000000 RBX: ffffffff81601fd8 RCX: ffffffff810013aa
Sep 27 20:44:04 Server kernel: [  111.875249] RDX: 0000000000000000 RSI: 00000000deadbeef RDI: 00000000deadbeef
Sep 27 20:44:04 Server kernel: [  111.875250] RBP: ffffffff81601fd8 R08: 0000000000000000 R09: 0000000000000000
Sep 27 20:44:04 Server kernel: [  111.875251] R10: 0000000000000001 R11: 0000000000000246 R12: ffffffff817572c0
Sep 27 20:44:04 Server kernel: [  111.875251] R13: ffff88007e102540 R14: 0000000000000000 R15: 0000000000000000
Sep 27 20:44:04 Server kernel: [  111.875254] FS:  00007fb294ab4900(0000) GS:ffff88007de00000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  111.875255] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  111.875256] CR2: 00007fb291aa8630 CR3: 0000000064f6b000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  111.875257] Stack:
Sep 27 20:44:04 Server kernel: [  111.875260]  ffff88007de0ec90 00000000ffffffff ffffffff8103d768 ffffffff8104ae0b
Sep 27 20:44:04 Server kernel: [  111.875262]  ffffffff810b53ee 080fc07cf5ca8820 ffffffffffffffff ffffffff8174e8d0
Sep 27 20:44:04 Server kernel: [  111.875264]  ffffffff816c3d3c ffffffff816c376e ffffffff817572c0 ffff88001f2fc000
Sep 27 20:44:04 Server kernel: [  111.875264] Call Trace:
Sep 27 20:44:04 Server kernel: [  111.875267]  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
Sep 27 20:44:04 Server kernel: [  111.875269]  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
Sep 27 20:44:04 Server kernel: [  111.875271]  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Sep 27 20:44:04 Server kernel: [  111.875274]  [<ffffffff816c3d3c>] ? start_kernel+0x3f1/0x3fc
Sep 27 20:44:04 Server kernel: [  111.875276]  [<ffffffff816c376e>] ? repair_env_string+0x54/0x54
Sep 27 20:44:04 Server kernel: [  111.875277]  [<ffffffff816c87f9>] ? xen_start_kernel+0x4bb/0x4c5
Sep 27 20:44:04 Server kernel: [  111.875295] Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
Sep 27 20:44:04 Server kernel: [  112.009969] NMI backtrace for cpu 2
Sep 27 20:44:04 Server kernel: [  112.014884] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  112.019793] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  112.024801] task: ffff8800659027d0 ti: ffff880065920000 task.ti: ffff880065920000
Sep 27 20:44:04 Server kernel: [  112.029813] RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  112.034879] RSP: e02b:ffff880065921f18  EFLAGS: 00000246
Sep 27 20:44:04 Server kernel: [  112.039895] RAX: 0000000000000000 RBX: ffff880065921fd8 RCX: ffffffff810013aa
Sep 27 20:44:04 Server kernel: [  112.044957] RDX: 0000000000000000 RSI: 00000000deadbeef RDI: 00000000deadbeef
Sep 27 20:44:04 Server kernel: [  112.050007] RBP: ffff880065921fd8 R08: 0000000000000000 R09: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.055021] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.060045] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.065022] FS:  00007fb294ab4900(0000) GS:ffff88007de80000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  112.070042] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  112.075049] CR2: 00007fb2942a1450 CR3: 0000000064faa000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  112.080086] Stack:
Sep 27 20:44:04 Server kernel: [  112.085084]  ffff88007de8ec90 00000000ffffffff ffffffff8103d768 ffffffff8104ae0b
Sep 27 20:44:04 Server kernel: [  112.090211]  ffffffff810b53ee d34efe2df012d167 0000000000000000 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.095371]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.100543] Call Trace:
Sep 27 20:44:04 Server kernel: [  112.105675]  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
Sep 27 20:44:04 Server kernel: [  112.110874]  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
Sep 27 20:44:04 Server kernel: [  112.116055]  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Sep 27 20:44:04 Server kernel: [  112.121232] Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
Sep 27 20:44:04 Server kernel: [  112.126961] NMI backtrace for cpu 3
Sep 27 20:44:04 Server kernel: [  112.132408] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  112.137869] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  112.143459] task: ffff880065902040 ti: ffff880065922000 task.ti: ffff880065922000
Sep 27 20:44:04 Server kernel: [  112.149079] RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  112.154783] RSP: e02b:ffff880065923f18  EFLAGS: 00000246
Sep 27 20:44:04 Server kernel: [  112.160476] RAX: 0000000000000000 RBX: ffff880065923fd8 RCX: ffffffff810013aa
Sep 27 20:44:04 Server kernel: [  112.166229] RDX: 0000000000000000 RSI: 00000000deadbeef RDI: 00000000deadbeef
Sep 27 20:44:04 Server kernel: [  112.171984] RBP: ffff880065923fd8 R08: 0000000000000000 R09: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.177736] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.183445] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.189101] FS:  00007fb294bb7700(0000) GS:ffff88007dec0000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  112.194845] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  112.200596] CR2: 00007fb284000010 CR3: 0000000064faa000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  112.206414] Stack:
Sep 27 20:44:04 Server kernel: [  112.212190]  ffff88007decec90 00000000ffffffff ffffffff8103d768 ffffffff8104ae0b
Sep 27 20:44:04 Server kernel: [  112.217958]  ffffffff810b53ee 420b348785d48aaa 0000000000000000 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.223591]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.229156] Call Trace:
Sep 27 20:44:04 Server kernel: [  112.234662]  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
Sep 27 20:44:04 Server kernel: [  112.240158]  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
Sep 27 20:44:04 Server kernel: [  112.245570]  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Sep 27 20:44:04 Server kernel: [  112.250932] Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
Sep 27 20:44:04 Server kernel: [  112.256812] NMI backtrace for cpu 4
Sep 27 20:44:04 Server kernel: [  112.262371] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  112.267966] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  112.273692] task: ffff880065903810 ti: ffff880065926000 task.ti: ffff880065926000
Sep 27 20:44:04 Server kernel: [  112.279450] RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  112.285302] RSP: e02b:ffff880065927f18  EFLAGS: 00000246
Sep 27 20:44:04 Server kernel: [  112.291115] RAX: 0000000000000000 RBX: ffff880065927fd8 RCX: ffffffff810013aa
Sep 27 20:44:04 Server kernel: [  112.296957] RDX: 0000000000000000 RSI: 00000000deadbeef RDI: 00000000deadbeef
Sep 27 20:44:04 Server kernel: [  112.302774] RBP: ffff880065927fd8 R08: 0000000000000000 R09: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.308563] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.314302] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.320010] FS:  00007f3d4ba6f700(0000) GS:ffff88007df00000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  112.325758] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  112.331469] CR2: 00007f3d4a01dff0 CR3: 000000001d62a000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  112.337224] Stack:
Sep 27 20:44:04 Server kernel: [  112.342954]  ffff88007df0ec90 00000000ffffffff ffffffff8103d768 ffffffff8104ae0b
Sep 27 20:44:04 Server kernel: [  112.348823]  ffffffff810b53ee 4d70db5bbfed8e0e 0000000000000000 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.354722]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.360623] Call Trace:
Sep 27 20:44:04 Server kernel: [  112.366494]  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
Sep 27 20:44:04 Server kernel: [  112.372276]  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
Sep 27 20:44:04 Server kernel: [  112.377911]  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Sep 27 20:44:04 Server kernel: [  112.383530] Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
Sep 27 20:44:04 Server kernel: [  112.389630] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 379.658 msecs
Sep 27 20:44:04 Server kernel: [  112.389638] NMI backtrace for cpu 5
Sep 27 20:44:04 Server kernel: [  112.389640] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 3.12.0-rc2 #2
Sep 27 20:44:04 Server kernel: [  112.389641] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
Sep 27 20:44:04 Server kernel: [  112.389642] task: ffff880065903080 ti: ffff88006592a000 task.ti: ffff88006592a000
Sep 27 20:44:04 Server kernel: [  112.389646] RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
Sep 27 20:44:04 Server kernel: [  112.389646] RSP: e02b:ffff88006592bf18  EFLAGS: 00000246
Sep 27 20:44:04 Server kernel: [  112.389647] RAX: 0000000000000000 RBX: ffff88006592bfd8 RCX: ffffffff810013aa
Sep 27 20:44:04 Server kernel: [  112.389648] RDX: 0000000000000000 RSI: 00000000deadbeef RDI: 00000000deadbeef
Sep 27 20:44:04 Server kernel: [  112.389649] RBP: ffff88006592bfd8 R08: 0000000000000000 R09: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.389649] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.389650] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.389653] FS:  00007fb294ab4900(0000) GS:ffff88007df40000(0000) knlGS:0000000000000000
Sep 27 20:44:04 Server kernel: [  112.389654] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 27 20:44:04 Server kernel: [  112.389655] CR2: 00007fb2937b37d6 CR3: 0000000064faa000 CR4: 0000000000000660
Sep 27 20:44:04 Server kernel: [  112.389656] Stack:
Sep 27 20:44:04 Server kernel: [  112.389659]  ffff88007df4ec90 00000000ffffffff ffffffff8103d768 ffffffff8104ae0b
Sep 27 20:44:04 Server kernel: [  112.389661]  ffffffff810b53ee e37f0facc99e13fc 0000000000000000 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.389662]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
Sep 27 20:44:04 Server kernel: [  112.389663] Call Trace:
Sep 27 20:44:04 Server kernel: [  112.389666]  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
Sep 27 20:44:04 Server kernel: [  112.389668]  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
Sep 27 20:44:04 Server kernel: [  112.389670]  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Sep 27 20:44:04 Server kernel: [  112.389687] Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
Sep 27 20:44:04 Server kernel: [  112.389690] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 379.716 msecs

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-27 17:07       ` Matthias
@ 2013-09-27 17:28         ` Sander Eikelenboom
  2013-09-27 19:19           ` Matthias
  2013-09-27 17:53         ` Is: RCU callback detects an RCU hang with Linux 3.12+ Was: " Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 23+ messages in thread
From: Sander Eikelenboom @ 2013-09-27 17:28 UTC (permalink / raw)
  To: Matthias; +Cc: xen-devel@lists.xen.org, Ian Campbell

Hi Matthias,

Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in grub ?

--
Sander

Friday, September 27, 2013, 7:07:33 PM, you wrote:

> Hi Konrad,

> good call! I was able to reproduce the error with the 3.12-rc2 kernel, got
> a lot of information with the new NMI traces (log attached), but since I'm
> not a xen hacker I don't really know how to continue from here. So I might
> add this to the original post and maybe someone can help me. After all the
> error persists for half a year now and besides 2 kernel version / .config
> Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue
> back (even with bisecting the .config because at some point it seemed
> random).


> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

>> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
>> > I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
>> > kernel I found which doesn't give me this issue:
>> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
>>
>> So v3.12 (or rather the latest and greaters of the Linus) has the mechanism
>> for the NMI - so you can actually see what is causing the stall.
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4
  2013-09-27 17:07       ` Matthias
  2013-09-27 17:28         ` Sander Eikelenboom
@ 2013-09-27 17:53         ` Konrad Rzeszutek Wilk
  2013-10-03 22:34           ` Matthias
  1 sibling, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-09-27 17:53 UTC (permalink / raw)
  To: Matthias; +Cc: Ian Campbell, xen-devel@lists.xen.org

On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote:
> Hi Konrad,
> 
> good call! I was able to reproduce the error with the 3.12-rc2 kernel, got
> a lot of information with the new NMI traces (log attached), but since I'm
> not a xen hacker I don't really know how to continue from here. So I might
> add this to the original post and maybe someone can help me. After all the
> error persists for half a year now and besides 2 kernel version / .config
> Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue
> back (even with bisecting the .config because at some point it seemed
> random).

Can you tell me a bit on how this happens? Is it happening after you
boot the machine? Does it happen after a specific workload?


It looks like something in the RCU is taking far too long and
the RCU callback mechanism starts complaining. The CPU0 is when the
RCU mechanism detects that something is off and starts sending NMI to
all CPUs. CPU2 is the only one that looks to be doing RCU callback:


NMI backtrace for cpu 1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2
Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000
RIP: e030:[<ffffffff8125b2b2>]  [<ffffffff8125b2b2>] cfb_imageblit+0x1b3/0x411
RSP: e02b:ffff88007de439f0  EFLAGS: 00000046
RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003
RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000
RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0
R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d
R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000
FS:  00007fb294ab4900(0000) GS:ffff88007de40000(0000) knlGS:0000000000000000
CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660
Stack:
 0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa
 ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800
 0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b
Call Trace:
 <IRQ>  [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d
 [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8
 [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d
 [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc
 [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290
 [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc
 [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306
 [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb
 [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8
 [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d
 [<ffffffff813db9c8>] ? printk+0x4f/0x51
 [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598                          <==================
 [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239
 [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
 [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
 [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
 [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
 [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
 [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
 [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
 [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d
 [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa
 [<ffffffff8103df22>] ? check_events+0x12/0x20
 [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
 [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52
 [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
 [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
 [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
 [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
 [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
 [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
 [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
 [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29 c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89 c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3 


Which looks to be printing something on the VT console (which is running
in KMS mode as it uses framebuffer calls). So is there something on the
screen scrolling widly in a loop?

But then there are also complains about 

INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.115 msecs

this taking too long. I am wondering if there is some time issue
on your box.

What version of Xen do you have?
> 
> 
> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
> > > I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
> > > kernel I found which doesn't give me this issue:
> > > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
> >
> > So v3.12 (or rather the latest and greaters of the Linus) has the mechanism
> > for the NMI - so you can actually see what is causing the stall.
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-27 17:28         ` Sander Eikelenboom
@ 2013-09-27 19:19           ` Matthias
  2013-09-27 19:33             ` Sander Eikelenboom
  0 siblings, 1 reply; 23+ messages in thread
From: Matthias @ 2013-09-27 19:19 UTC (permalink / raw)
  To: Sander Eikelenboom; +Cc: xen-devel@lists.xen.org, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 1812 bytes --]

Hi Sander,

thanks for the advice, I have actually no rcu stalls when i use the
no-cpuidle function. Do you have a little more insight on what is actually
causing this behaviour and if there is a better solution then this option,
cause I don't want to sacrifice my C-states (I would assume this makes the
overall server more power hungry?).

Does this has something to do with the new tickless-kernel options in the
newer kernel, or is this really only an apci incompatibility with xen?

Thanks!


2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>

> Hi Matthias,
>
> Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in
> grub ?
>
> --
> Sander
>
> Friday, September 27, 2013, 7:07:33 PM, you wrote:
>
> > Hi Konrad,
>
> > good call! I was able to reproduce the error with the 3.12-rc2 kernel,
> got
> > a lot of information with the new NMI traces (log attached), but since
> I'm
> > not a xen hacker I don't really know how to continue from here. So I
> might
> > add this to the original post and maybe someone can help me. After all
> the
> > error persists for half a year now and besides 2 kernel version / .config
> > Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue
> > back (even with bisecting the .config because at some point it seemed
> > random).
>
>
> > 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>
> >> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
> >> > I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
> >> > kernel I found which doesn't give me this issue:
> >> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
> >>
> >> So v3.12 (or rather the latest and greaters of the Linus) has the
> mechanism
> >> for the NMI - so you can actually see what is causing the stall.
> >>
>
>

[-- Attachment #1.2: Type: text/html, Size: 2572 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-27 19:19           ` Matthias
@ 2013-09-27 19:33             ` Sander Eikelenboom
  2013-09-27 19:48               ` Matthias
  0 siblings, 1 reply; 23+ messages in thread
From: Sander Eikelenboom @ 2013-09-27 19:33 UTC (permalink / raw)
  To: Matthias
  Cc: Jan Beulich, xen-devel@lists.xen.org, Ian Campbell,
	Suravee Suthikulanit


Friday, September 27, 2013, 9:19:14 PM, you wrote:

> Hi Sander,

> thanks for the advice, I have actually no rcu stalls when i use the no-cpuidle function. Do you have a little more insight on what is actually causing this behaviour and if there is a better solution then this option, cause I don't want to sacrifice my C-states (I would assume this makes the overall server more power hungry?).

> Does this has something to do with the new tickless-kernel options in the newer kernel, or is this really only an apci incompatibility with xen?

> Thanks!

 Are you running xen-unstable ?
 Some patches went in lately

 You also seem to have a motherboard with a AMD 890fx chipset, i suspect your bios also has issues around the HPET as mine had.
 I was also seeing RCU stalls on boot (and only on boot) .. hitting any key on the console when it appears to stall during boot made it continue in my case (happens several times).
 Took a while to find the problems, Jan Beulich has made and commited some patches that went in xen-unstable recently.

 Are you running xen-unstable ?
 If not, could you give it a try and provide the xl dmesg / serial log ?

 --
 Sander








>   2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>

>   Hi Matthias,
>  
>  Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in grub ?
>  
>  --
>  Sander
>  

>  Friday, September 27, 2013, 7:07:33 PM, you wrote:
>  
 >> Hi Konrad,
>  
 >> good call! I was able to reproduce the error with the 3.12-rc2 kernel, got
 >> a lot of information with the new NMI traces (log attached), but since I'm
 >> not a xen hacker I don't really know how to continue from here. So I might
 >> add this to the original post and maybe someone can help me. After all the
 >> error persists for half a year now and besides 2 kernel version / .config
 >> Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue
 >> back (even with bisecting the .config because at some point it seemed
 >> random).
>  
>  
 >> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>  
 >>> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
 >>> > I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
 >>> > kernel I found which doesn't give me this issue:
 >>> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
 >>>
 >>> So v3.12 (or rather the latest and greaters of the Linus) has the mechanism
 >>> for the NMI - so you can actually see what is causing the stall.
 >>>
>  
>  

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-27 19:33             ` Sander Eikelenboom
@ 2013-09-27 19:48               ` Matthias
  2013-09-27 20:06                 ` Sander Eikelenboom
  0 siblings, 1 reply; 23+ messages in thread
From: Matthias @ 2013-09-27 19:48 UTC (permalink / raw)
  To: Sander Eikelenboom
  Cc: Jan Beulich, Suravee Suthikulanit, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 3053 bytes --]

Yes, running the most recent xen-unstable-staging tree, but I have these
issues at least since february with xen-unstable, so I don't suspect recent
changes to be the issue in my case.

I will do some testing with switching from tickless-idle to non-tickless
and after you mentioned hpet issues maybe changing the clocksource, will
see what happens..




2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>

>
> Friday, September 27, 2013, 9:19:14 PM, you wrote:
>
> > Hi Sander,
>
> > thanks for the advice, I have actually no rcu stalls when i use the
> no-cpuidle function. Do you have a little more insight on what is actually
> causing this behaviour and if there is a better solution then this option,
> cause I don't want to sacrifice my C-states (I would assume this makes the
> overall server more power hungry?).
>
> > Does this has something to do with the new tickless-kernel options in
> the newer kernel, or is this really only an apci incompatibility with xen?
>
> > Thanks!
>
>  Are you running xen-unstable ?
>  Some patches went in lately
>
>  You also seem to have a motherboard with a AMD 890fx chipset, i suspect
> your bios also has issues around the HPET as mine had.
>  I was also seeing RCU stalls on boot (and only on boot) .. hitting any
> key on the console when it appears to stall during boot made it continue in
> my case (happens several times).
>  Took a while to find the problems, Jan Beulich has made and commited some
> patches that went in xen-unstable recently.
>
>  Are you running xen-unstable ?
>  If not, could you give it a try and provide the xl dmesg / serial log ?
>
>  --
>  Sander
>
>
>
>
>
>
>
>
> >   2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>
>
> >   Hi Matthias,
> >
> >  Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in
> grub ?
> >
> >  --
> >  Sander
> >
>
> >  Friday, September 27, 2013, 7:07:33 PM, you wrote:
> >
>  >> Hi Konrad,
> >
>  >> good call! I was able to reproduce the error with the 3.12-rc2 kernel,
> got
>  >> a lot of information with the new NMI traces (log attached), but since
> I'm
>  >> not a xen hacker I don't really know how to continue from here. So I
> might
>  >> add this to the original post and maybe someone can help me. After all
> the
>  >> error persists for half a year now and besides 2 kernel version /
> .config
>  >> Combinations (a 3.8.2 and a 3.6.something) I could never trace this
> issue
>  >> back (even with bisecting the .config because at some point it seemed
>  >> random).
> >
> >
>  >> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >
>  >>> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
>  >>> > I'm currently on a vanilla 3.8.2 kernel because this is the only
> >3.4
>  >>> > kernel I found which doesn't give me this issue:
>  >>> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
>  >>>
>  >>> So v3.12 (or rather the latest and greaters of the Linus) has the
> mechanism
>  >>> for the NMI - so you can actually see what is causing the stall.
>  >>>
> >
> >
>
>
>

[-- Attachment #1.2: Type: text/html, Size: 4109 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-27 19:48               ` Matthias
@ 2013-09-27 20:06                 ` Sander Eikelenboom
  0 siblings, 0 replies; 23+ messages in thread
From: Sander Eikelenboom @ 2013-09-27 20:06 UTC (permalink / raw)
  To: Matthias; +Cc: Jan Beulich, Suravee Suthikulanit, xen-devel@lists.xen.org


Friday, September 27, 2013, 9:48:39 PM, you wrote:

> Yes, running the most recent xen-unstable-staging tree, but I have these issues at least since february with xen-unstable, so I don't suspect recent changes to be the issue in my case.

> I will do some testing with switching from tickless-idle to non-tickless and after you mentioned hpet issues maybe changing the clocksource, will see what happens..

I'm now running with tickless-idle, so i suspect it will make no difference.
So i think trying to make it boot by pressing a key on the keyboard when it doesn't make progress on boot (see if that works) and if it does .. provide the output of "xl dmesg"
would be the best shot.

( BTW there were 2 seperate issues ..  see threads:
http://lists.xen.org/archives/html/xen-devel/2013-03/msg01796.html
http://lists.xen.org/archives/html/xen-devel/2013-08/msg00201.html
)



> 2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>

>   

>  Friday, September 27, 2013, 9:19:14 PM, you wrote:
>  
 >> Hi Sander,
>  
 >> thanks for the advice, I have actually no rcu stalls when i use the no-cpuidle function. Do you have a little more insight on what is actually causing this behaviour and if there is a better solution then this option, cause I don't want to sacrifice my C-states (I would assume this makes the overall server more power hungry?).
>    
 >> Does this has something to do with the new tickless-kernel options in the newer kernel, or is this really only an apci incompatibility with xen?
>  
 >> Thanks!
>  
>   Are you running xen-unstable ?
>   Some patches went in lately
>  
>   You also seem to have a motherboard with a AMD 890fx chipset, i suspect your bios also has issues around the HPET as mine had.
>   I was also seeing RCU stalls on boot (and only on boot) .. hitting any key on the console when it appears to stall during boot made it continue in my case (happens several times).
>   Took a while to find the problems, Jan Beulich has made and commited some patches that went in xen-unstable recently.
>  
>   Are you running xen-unstable ?
>   If not, could you give it a try and provide the xl dmesg / serial log ?
>  
>   --
>   Sander
>  

>  
>  
>  
>  
>  
>  
>  
 >>   2013/9/27 Sander Eikelenboom <linux@eikelenboom.it>
>  
 >>   Hi Matthias,
 >>
 >>  Have you tried adding "no-cpuidle" on the xen/hypervisor commandline in grub ?
 >>
 >>  --
 >>  Sander
 >>
>  
 >>  Friday, September 27, 2013, 7:07:33 PM, you wrote:
 >>
  >>> Hi Konrad,
 >>
  >>> good call! I was able to reproduce the error with the 3.12-rc2 kernel, got
  >>> a lot of information with the new NMI traces (log attached), but since I'm
  >>> not a xen hacker I don't really know how to continue from here. So I might
  >>> add this to the original post and maybe someone can help me. After all the
  >>> error persists for half a year now and besides 2 kernel version / .config
  >>> Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue
  >>> back (even with bisecting the .config because at some point it seemed
  >>> random).
 >>
 >>
  >>> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 >>
  >>>> On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
  >>>> > I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
  >>>> > kernel I found which doesn't give me this issue:
  >>>> > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
  >>>>
  >>>> So v3.12 (or rather the latest and greaters of the Linus) has the mechanism
  >>>> for the NMI - so you can actually see what is causing the stall.
  >>>>
 >>
 >>
>  
>  
>  

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Status of FLR in Xen 4.4
  2013-09-26 18:01     ` David Vrabel
  2013-09-26 18:41       ` Matthias
@ 2013-10-03 22:20       ` Matthias
  1 sibling, 0 replies; 23+ messages in thread
From: Matthias @ 2013-10-03 22:20 UTC (permalink / raw)
  To: David Vrabel; +Cc: Ian Campbell, xen-devel@lists.xen.org

[-- Attachment #1.1: Type: text/plain, Size: 1710 bytes --]

Hi David,

with your patch as inspiration, I did various test in the past days but
didn't manage to succeed in resetting my vga the right way..

With your patch, and later mine, secondary bus reset is executed but after
that i can't boot the vm because i get a 'device model is not ready' /
'refused to pass the pci device' error..
I also tried to don't reset the secondary function of the vga card after
executing a secondary bus reset when the first function reset is called,
but with the same result.
Do you have any idea if I am missing anything? I tried it with both
load/restore configure and not doing so, but it seems xenstore can't handle
the vga after the parent bus reset.

Something else that is odd is that my vga has in fact a sysfs/reset file
(both functions have a seperate one) but neither doing a normal reset nor
doing it by hand does make any change / I think it is not executed, because
when I commented out the reset completly, the VM showed the same behaviour
on the second boot then when doing a normal reset.. BTW: the same result
comes when I'm doing a d0->d3 transition via the kernel. FLR and AR_FLR do
not work anyway due to no capability in the card..

So I compared the xen-pciback reset-method with both the pci/pci.c method
and what was done in python/xen/util/pci.py and the actions are basically
the same (the quirks in pci.py are only for some nvidia and integrated
vgas) and I don't see what I am missing.. can it be that after the parent
bus reset, the vga card somehow looses it's entry in xenstore or something?

Can you elaborate a bit more what hardware you are having and if your patch
works fine for you? I'm currently testing with a AMD HD5400.

Thanks in advance!

[-- Attachment #1.2: Type: text/html, Size: 2032 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4
  2013-09-27 17:53         ` Is: RCU callback detects an RCU hang with Linux 3.12+ Was: " Konrad Rzeszutek Wilk
@ 2013-10-03 22:34           ` Matthias
  2013-10-04  6:07             ` Pasi Kärkkäinen
  0 siblings, 1 reply; 23+ messages in thread
From: Matthias @ 2013-10-03 22:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Ian Campbell, xen-devel@lists.xen.org


[-- Attachment #1.1: Type: text/plain, Size: 6926 bytes --]

Hi Konrad,

sorry I missed your entry, google mail might not be the best software to
view mailing lists ;)

The RCU stall happens roughly 2 minutes after the machine is fully booted,
and I'm usually working via SSH by then..

I basically have two cases where the stall happens:

1) Without the no-cpuidle function, It happens when I start xencommons
2) With or without no-cpuidle, this happens sometimes and arbitrary and I
have the feeling that logging in via SSH (or network traffic in general?)
will increase the chance of the rcu stall and (and this is only a guess) in
most cases this actually happens when I enter a command of more then 16
chars in the ssh command prompt. (I don't really think that this is really
causing the issue, I just noticed that when entering the usual commands to
start all the xen stuff / boot the domUs, it stalls mostly on the same
commands / when ssh freezes I came to the same part of the command). But
more ssh-intensive commands like 'dmesg' or 'htop' don't cause it..


Also, I can't really say what is on the screen because my dom0 does not
have a vga card / both vga cards in the server are passed to different
domUs and when I don't hide the vga cards on boot via xen-pciback.hide, the
rcu usually does not stall and everything is fine..





2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

> On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote:
> > Hi Konrad,
> >
> > good call! I was able to reproduce the error with the 3.12-rc2 kernel,
> got
> > a lot of information with the new NMI traces (log attached), but since
> I'm
> > not a xen hacker I don't really know how to continue from here. So I
> might
> > add this to the original post and maybe someone can help me. After all
> the
> > error persists for half a year now and besides 2 kernel version / .config
> > Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue
> > back (even with bisecting the .config because at some point it seemed
> > random).
>
> Can you tell me a bit on how this happens? Is it happening after you
> boot the machine? Does it happen after a specific workload?
>
>
> It looks like something in the RCU is taking far too long and
> the RCU callback mechanism starts complaining. The CPU0 is when the
> RCU mechanism detects that something is off and starts sending NMI to
> all CPUs. CPU2 is the only one that looks to be doing RCU callback:
>
>
> NMI backtrace for cpu 1
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2
> Hardware name: System manufacturer System Product Name/Crosshair IV
> Formula, BIOS 3029    10/09/2012
> task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000
> RIP: e030:[<ffffffff8125b2b2>]  [<ffffffff8125b2b2>]
> cfb_imageblit+0x1b3/0x411
> RSP: e02b:ffff88007de439f0  EFLAGS: 00000046
> RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003
> RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000
> RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0
> R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d
> R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000
> FS:  00007fb294ab4900(0000) GS:ffff88007de40000(0000)
> knlGS:0000000000000000
> CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660
> Stack:
>  0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa
>  ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800
>  0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b
> Call Trace:
>  <IRQ>  [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d
>  [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8
>  [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d
>  [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc
>  [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290
>  [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc
>  [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306
>  [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb
>  [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8
>  [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d
>  [<ffffffff813db9c8>] ? printk+0x4f/0x51
>  [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598
>        <==================
>  [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239
>  [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
>  [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
>  [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
>  [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
>  [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
>  [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
>  [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
>  [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d
>  [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa
>  [<ffffffff8103df22>] ? check_events+0x12/0x20
>  [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
>  [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52
>  [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
>  [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
>  [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
>  [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
>  [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
>  [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
>  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>  [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
>  [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
>  [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
> Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29 c1
> 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89 c5 41
> 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3
>
>
> Which looks to be printing something on the VT console (which is running
> in KMS mode as it uses framebuffer calls). So is there something on the
> screen scrolling widly in a loop?
>
> But then there are also complains about
>
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long
> to run: 1.115 msecs
>
> this taking too long. I am wondering if there is some time issue
> on your box.
>
> What version of Xen do you have?
> >
> >
> > 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >
> > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
> > > > I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
> > > > kernel I found which doesn't give me this issue:
> > > > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
> > >
> > > So v3.12 (or rather the latest and greaters of the Linus) has the
> mechanism
> > > for the NMI - so you can actually see what is causing the stall.
> > >
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

[-- Attachment #1.2: Type: text/html, Size: 8407 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4
  2013-10-03 22:34           ` Matthias
@ 2013-10-04  6:07             ` Pasi Kärkkäinen
  0 siblings, 0 replies; 23+ messages in thread
From: Pasi Kärkkäinen @ 2013-10-04  6:07 UTC (permalink / raw)
  To: Matthias; +Cc: xen-devel@lists.xen.org, Ian Campbell

On Fri, Oct 04, 2013 at 12:34:56AM +0200, Matthias wrote:
>    Hi Konrad,
> 
>    sorry I missed your entry, google mail might not be the best software to
>    view mailing lists ;)
> 
>    The RCU stall happens roughly 2 minutes after the machine is fully booted,
>    and I'm usually working via SSH by then..
> 
>    I basically have two cases where the stall happens:
> 
>    1) Without the no-cpuidle function, It happens when I start xencommons
>    2) With or without no-cpuidle, this happens sometimes and arbitrary and I
>    have the feeling that logging in via SSH (or network traffic in general?)
>    will increase the chance of the rcu stall and (and this is only a guess)
>    in most cases this actually happens when I enter a command of more then 16
>    chars in the ssh command prompt. (I don't really think that this is really
>    causing the issue, I just noticed that when entering the usual commands to
>    start all the xen stuff / boot the domUs, it stalls mostly on the same
>    commands / when ssh freezes I came to the same part of the command). But
>    more ssh-intensive commands like 'dmesg' or 'htop' don't cause it..
> 
>    Also, I can't really say what is on the screen because my dom0 does not
>    have a vga card / both vga cards in the server are passed to different
>    domUs and when I don't hide the vga cards on boot via xen-pciback.hide,
>    the rcu usually does not stall and everything is fine..
> 

For debugging you should have a serial console.. so maybe get a pci serial card,
if you don't have any management processors offering SOL ? 

-- Pasi

>    2013/9/27 Konrad Rzeszutek Wilk <[1]konrad.wilk@oracle.com>
> 
>      On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote:
>      > Hi Konrad,
>      >
>      > good call! I was able to reproduce the error with the 3.12-rc2 kernel,
>      got
>      > a lot of information with the new NMI traces (log attached), but since
>      I'm
>      > not a xen hacker I don't really know how to continue from here. So I
>      might
>      > add this to the original post and maybe someone can help me. After all
>      the
>      > error persists for half a year now and besides 2 kernel version /
>      .config
>      > Combinations (a 3.8.2 and a 3.6.something) I could never trace this
>      issue
>      > back (even with bisecting the .config because at some point it seemed
>      > random).
> 
>      Can you tell me a bit on how this happens? Is it happening after you
>      boot the machine? Does it happen after a specific workload?
> 
>      It looks like something in the RCU is taking far too long and
>      the RCU callback mechanism starts complaining. The CPU0 is when the
>      RCU mechanism detects that something is off and starts sending NMI to
>      all CPUs. CPU2 is the only one that looks to be doing RCU callback:
> 
>      NMI backtrace for cpu 1
>      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2
>      Hardware name: System manufacturer System Product Name/Crosshair IV
>      Formula, BIOS 3029    10/09/2012
>      task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000
>      RIP: e030:[<ffffffff8125b2b2>]  [<ffffffff8125b2b2>]
>      cfb_imageblit+0x1b3/0x411
>      RSP: e02b:ffff88007de439f0  EFLAGS: 00000046
>      RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003
>      RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000
>      RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0
>      R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d
>      R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000
>      FS:  00007fb294ab4900(0000) GS:ffff88007de40000(0000)
>      knlGS:0000000000000000
>      CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
>      CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660
>      Stack:
>       0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa
>       ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800
>       0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b
>      Call Trace:
>       <IRQ>  [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d
>       [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8
>       [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d
>       [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc
>       [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290
>       [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc
>       [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306
>       [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb
>       [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8
>       [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d
>       [<ffffffff813db9c8>] ? printk+0x4f/0x51
>       [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598
>               <==================
>       [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239
>       [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
>       [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
>       [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
>       [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
>       [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
>       [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
>       [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
>       [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d
>       [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa
>       [<ffffffff8103df22>] ? check_events+0x12/0x20
>       [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
>       [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52
>       [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
>       [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
>       [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
>       [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
>       [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
>       [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
>       <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>       [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>       [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
>       [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
>       [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
>      Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29
>      c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89
>      c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3
> 
>      Which looks to be printing something on the VT console (which is running
>      in KMS mode as it uses framebuffer calls). So is there something on the
>      screen scrolling widly in a loop?
> 
>      But then there are also complains about
> 
>      INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long
>      to run: 1.115 msecs
> 
>      this taking too long. I am wondering if there is some time issue
>      on your box.
> 
>      What version of Xen do you have?
>      >
>      >
>      > 2013/9/27 Konrad Rzeszutek Wilk <[2]konrad.wilk@oracle.com>
>      >
>      > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
>      > > > I'm currently on a vanilla 3.8.2 kernel because this is the only
>      >3.4
>      > > > kernel I found which doesn't give me this issue:
>      > > >
>      [3]http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
>      > >
>      > > So v3.12 (or rather the latest and greaters of the Linus) has the
>      mechanism
>      > > for the NMI - so you can actually see what is causing the stall.
>      > >
> 
>      _______________________________________________
>      Xen-devel mailing list
>      [4]Xen-devel@lists.xen.org
>      [5]http://lists.xen.org/xen-devel
> 
> References
> 
>    Visible links
>    1. mailto:konrad.wilk@oracle.com
>    2. mailto:konrad.wilk@oracle.com
>    3. http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
>    4. mailto:Xen-devel@lists.xen.org
>    5. http://lists.xen.org/xen-devel

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-10-04  6:07 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-26 16:05 Status of FLR in Xen 4.4 Matthias
2013-09-26 16:16 ` Ian Campbell
2013-09-26 17:59   ` Matthias
2013-09-27 13:34     ` Konrad Rzeszutek Wilk
2013-09-27 17:07       ` Matthias
2013-09-27 17:28         ` Sander Eikelenboom
2013-09-27 19:19           ` Matthias
2013-09-27 19:33             ` Sander Eikelenboom
2013-09-27 19:48               ` Matthias
2013-09-27 20:06                 ` Sander Eikelenboom
2013-09-27 17:53         ` Is: RCU callback detects an RCU hang with Linux 3.12+ Was: " Konrad Rzeszutek Wilk
2013-10-03 22:34           ` Matthias
2013-10-04  6:07             ` Pasi Kärkkäinen
2013-09-26 16:20 ` David Vrabel
2013-09-26 17:48   ` Ross Philipson
2013-09-26 18:01     ` David Vrabel
2013-09-26 18:41       ` Matthias
2013-09-26 19:13         ` Gordan Bobic
2013-09-27 12:26           ` Matthias
2013-09-27 13:27             ` Gordan Bobic
2013-09-27 13:48               ` Konrad Rzeszutek Wilk
2013-09-27 14:00                 ` Gordan Bobic
2013-10-03 22:20       ` Matthias

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).