Linux PCI subsystem development
 help / color / mirror / Atom feed
* Re: [PATCH v2 RESEND] Add NumaChip remote PCI support
From: Daniel J Blueman @ 2012-11-30  5:28 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, linux-kernel, Steffen Persvold
In-Reply-To: <CAErSpo67Brp8H=Jw77NzA4xL52aRNi+YDw+gM++3y8XqfTWyzA@mail.gmail.com>

Hi Bjorn,

On 29/11/2012 07:08, Bjorn Helgaas wrote:
> On Wed, Nov 21, 2012 at 1:39 AM, Daniel J Blueman
> <daniel@numascale-asia.com> wrote:
>> Add NumaChip-specific PCI access mechanism via MMCONFIG cycles, but
>> preventing access to AMD Northbridges which shouldn't respond.
>>
>> v2: Use PCI_DEVFN in precomputed constant limit; drop unneeded includes
>>
>> Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
>> ---
>>   arch/x86/include/asm/numachip/numachip.h |   20 +++++
>>   arch/x86/kernel/apic/apic_numachip.c     |    2 +
>>   arch/x86/pci/Makefile                    |    1 +
>>   arch/x86/pci/numachip.c                  |  134 ++++++++++++++++++++++++++++++
>>   4 files changed, 157 insertions(+)
>>   create mode 100644 arch/x86/include/asm/numachip/numachip.h
>>   create mode 100644 arch/x86/pci/numachip.c
>>
>> diff --git a/arch/x86/include/asm/numachip/numachip.h b/arch/x86/include/asm/numachip/numachip.h
>> new file mode 100644
>> index 0000000..d35e71a
>> --- /dev/null
>> +++ b/arch/x86/include/asm/numachip/numachip.h
>> @@ -0,0 +1,20 @@
>> +/*
>> + * This file is subject to the terms and conditions of the GNU General Public
>> + * License.  See the file "COPYING" in the main directory of this archive
>> + * for more details.
>> + *
>> + * Numascale NumaConnect-specific header file
>> + *
>> + * Copyright (C) 2012 Numascale AS. All rights reserved.
>> + *
>> + * Send feedback to <support@numascale.com>
>> + *
>> + */
>> +
>> +#ifndef _ASM_X86_NUMACHIP_NUMACHIP_H
>> +#define _ASM_X86_NUMACHIP_NUMACHIP_H
>> +
>> +extern int __init pci_numachip_init(void);
>> +
>> +#endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */
>> +
>> diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c
>> index a65829a..9c2aa89 100644
>> --- a/arch/x86/kernel/apic/apic_numachip.c
>> +++ b/arch/x86/kernel/apic/apic_numachip.c
>> @@ -22,6 +22,7 @@
>>   #include <linux/hardirq.h>
>>   #include <linux/delay.h>
>>
>> +#include <asm/numachip/numachip.h>
>>   #include <asm/numachip/numachip_csr.h>
>>   #include <asm/smp.h>
>>   #include <asm/apic.h>
>> @@ -179,6 +180,7 @@ static int __init numachip_system_init(void)
>>                  return 0;
>>
>>          x86_cpuinit.fixup_cpu_id = fixup_cpu_id;
>> +       x86_init.pci.arch_init = pci_numachip_init;
>>
>>          map_csrs();
>>
>> diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
>> index 3af5a1e..ee0af58 100644
>> --- a/arch/x86/pci/Makefile
>> +++ b/arch/x86/pci/Makefile
>> @@ -16,6 +16,7 @@ obj-$(CONFIG_STA2X11)           += sta2x11-fixup.o
>>   obj-$(CONFIG_X86_VISWS)                += visws.o
>>
>>   obj-$(CONFIG_X86_NUMAQ)                += numaq_32.o
>> +obj-$(CONFIG_X86_NUMACHIP)     += numachip.o
>
> It looks like this depends on CONFIG_PCI_MMCONFIG for
> pci_mmconfig_lookup().  Are there config constraints that force
> CONFIG_PCI_MMCONFIG=y when CONFIG_X86_NUMACHIP=y?

I'll revise the patch with this constraint after we work out the best 
approach for below.

>>   obj-$(CONFIG_X86_INTEL_MID)    += mrst.o
>>
>> diff --git a/arch/x86/pci/numachip.c b/arch/x86/pci/numachip.c
>> new file mode 100644
>> index 0000000..3773e05
>> --- /dev/null
>> +++ b/arch/x86/pci/numachip.c
>> @@ -0,0 +1,129 @@
>> +/*
>> + * This file is subject to the terms and conditions of the GNU General Public
>> + * License.  See the file "COPYING" in the main directory of this archive
>> + * for more details.
>> + *
>> + * Numascale NumaConnect-specific PCI code
>> + *
>> + * Copyright (C) 2012 Numascale AS. All rights reserved.
>> + *
>> + * Send feedback to <support@numascale.com>
>> + *
>> + * PCI accessor functions derived from mmconfig_64.c
>> + *
>> + */
>> +
>> +#include <linux/pci.h>
>> +#include <asm/pci_x86.h>
>> +
>> +static u8 limit __read_mostly;
>> +
>> +static inline char __iomem *pci_dev_base(unsigned int seg, unsigned int bus, unsigned int devfn)
>> +{
>> +       struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);
>> +
>> +       if (cfg && cfg->virt)
>> +               return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn << 12));
>> +       return NULL;
>> +}
>
> Most of this file is copied directly from mmconfig_64.c (as you
> mentioned above).  I wonder if we could avoid the code duplication by
> making the pci_dev_base() implementation in mmconfig_64.c a weak
> definition.  Then you could just supply a non-weak pci_dev_base() here
> that would override that default version.  Your version would look
> something like:
>
>    char __iomem *pci_dev_base(unsigned int seg, unsigned int bus,
> unsigned int devfn)
>    {
>        struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);
>
>        if (cfg && cfg->virt && devfn < limit)
>            return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn << 12));
>        return NULL;
>    }
>
> That would be different from what you have in this patch because reads
> & writes to devices above "limit" would return -EINVAL rather than 0
> as you do here.  Would that be a problem?

That would work nicely (pointer lookup and inlining etc aside) if there 
was the runtime ability to override pci_dev_base only if the NumaChip 
signature was detected.

We could expose pci_dev_base via struct x86_init_pci; the extra 
complexity and performance tradeoff may not be worth it for a single 
case perhaps?

Thanks,
   Daniel
-- 
Daniel J Blueman
Principal Software Engineer, Numascale Asia

^ permalink raw reply

* Re: [PATCH 2/3] aerdrv: Enhanced AER logging
From: Steven Rostedt @ 2012-11-30  1:53 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215450.5483.26562.stgit@grignak.americas.hpqcorp.net>

On Thu, 2012-11-29 at 14:54 -0700, Lance Ortiz wrote:

> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -23,6 +23,10 @@
>  
>  #include "aerdrv.h"
>  
> +#define CREATE_TRACE_POINTS
> +#define TRACE_INCLUDE_PATH ../../../../include/ras
> +#include <ras/aer_event.h>
> +

Yuck yuck yuck!

This header should be in the same directory, and you should have in that
same header:

#undef TRACE_INCLUDE_PATH
#define TRACE_INCLUDE_PATH .

and remove the definition here.

-- Steve

>  #define AER_AGENT_RECEIVER		0
>  #define AER_AGENT_REQUESTER		1
>  #define AER_AGENT_COMPLETER		2
> @@ -194,6 +198,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	if (info->id && info->error_dev_num > 1 && info->id == id)
>  		printk("%s""  Error of this Agent(%04x) is reported first\n",
>  			prefix, id);
> +	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> +			info->severity);
>  }
>  



^ permalink raw reply

* Re: [PATCH 1/3] aerdrv: Trace Event for AER
From: Steven Rostedt @ 2012-11-30  1:51 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215443.5483.43364.stgit@grignak.americas.hpqcorp.net>

On Thu, 2012-11-29 at 14:54 -0700, Lance Ortiz wrote:
> This header file will define a new trace event that will be triggered when
> a AER event occurs.  The following data will be provided to the trace 
> event.
> 
> char * name -	String containing the device path
> 
> u32 status - 	Either the correctable or uncorrectable register 
> 		indicating what error or errors have been see.
> 
> u8 severity - 	error severity 0:NONFATAL 1:FATAL 2:CORRECTED
> 
> The trace event will also provide a trace string that may look like:
> 
> "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
> TLP"
> 
> Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
> ---
> 
>  include/ras/aer_event.h |   77 +++++++++++++++++++++++++++++++++++++++++++++++

Is there a reason this header is here? Egad, I never noticed the
ras_event.h that is there. This include/ras directory was created for
the sole purpose of trace events! This is not the way to do this.

Please look at the sample in samples/trace_events/

The proper way is to keep the header by the driver. Then you can simply
include the header with "aer_event.h".

But to have the macro magic work, you need to modify the Makefile to
have something like:

CFLAGS_aerdrv_errprint.o = -I$(src)

and it will be able to find your headers without a problem.
The ras_event.h needs to be fixed too. I may just send a patch myself.

-- Steve


>  1 files changed, 77 insertions(+), 0 deletions(-)
>  create mode 100644 include/ras/aer_event.h
> 
> diff --git a/include/ras/aer_event.h b/include/ras/aer_event.h
> new file mode 100644
> index 0000000..735c973
> --- /dev/null
> +++ b/include/ras/aer_event.h
> @@ -0,0 +1,77 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM aer
> +#define TRACE_INCLUDE_FILE aer_event
> +
> +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_AER_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/edac.h>
> +
> +
> +/*
> + * Anhance Error Reporting (AER) PCIE Report Error
> + *
> + * These events are generated when hardware detects a corrected or
> + * uncorrected event on a pci express device and reports
> + * errors.  The event reports the following data.
> + *
> + * char * dev_name -	String containing the device identification
> + * u32 status -		Either the correctable or uncorrectable register
> + *			indicating what error or errors have been seen
> + * u8 severity -	error severity 0:NONFATAL 1:FATAL 2:CORRECTED
> + */
> +
> +#define correctable_error_string			\
> +	{BIT(0),	"Receiver Error"},		\
> +	{BIT(6),	"Bad TLP"},			\
> +	{BIT(7),	"Bad DLLP"},			\
> +	{BIT(8),	"RELAY_NUM Rollover"},		\
> +	{BIT(12),	"Replay Timer Timeout"},	\
> +	{BIT(13),	"Advisory Non-Fatal"}
> +
> +#define uncorrectable_error_string			\
> +	{BIT(4),	"Data Link Protocol"},		\
> +	{BIT(12),	"Poisoned TLP"},		\
> +	{BIT(13),	"Flow Control Protocol"},	\
> +	{BIT(14),	"Completion Timeout"},		\
> +	{BIT(15),	"Completer Abort"},		\
> +	{BIT(16),	"Unexpected Completion"},	\
> +	{BIT(17),	"Receiver Overflow"},		\
> +	{BIT(18),	"Malformed TLP"},		\
> +	{BIT(19),	"ECRC"},			\
> +	{BIT(20),	"Unsupported Request"}
> +
> +TRACE_EVENT(aer_event,
> +	TP_PROTO(const char *dev_name,
> +		 const u32 status,
> +		 const u8 severity),
> +
> +	TP_ARGS(dev_name, status, severity),
> +
> +	TP_STRUCT__entry(
> +		__string(	dev_name,	dev_name	)
> +		__field(	u32,		status		)
> +		__field(	u8,		severity	)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(dev_name, dev_name);
> +		__entry->status		= status;
> +		__entry->severity	= severity;
> +	),
> +
> +	TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
> +		__get_str(dev_name),
> +		(__entry->severity == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
> +			((__entry->severity == HW_EVENT_ERR_FATAL) ?
> +			"Fatal" : "Uncorrected"),
> +		__entry->severity == HW_EVENT_ERR_CORRECTED ?
> +		__print_flags(__entry->status, "|", correctable_error_string) :
> +		__print_flags(__entry->status, "|", uncorrectable_error_string))
> +);
> +
> +#endif /* _TRACE_AER_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>



^ permalink raw reply

* Re: [PATCH v2 2/3] aerdrv: Enhanced AER logging
From: Joe Perches @ 2012-11-30  1:12 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129225800.8405.83305.stgit@grignak.americas.hpqcorp.net>

On Thu, 2012-11-29 at 15:58 -0700, Lance Ortiz wrote:
> This patch will provide a more reliable and easy way for user-space
> applications to have access to AER logs rather than reading them from the
> message buffer. It also provides a way to notify user-space when an AER
> event occurs.
[]
>  0 files changed, 0 insertions(+), 0 deletions(-)

Those are odd patch statistics

> diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
[]
> @@ -249,6 +250,10 @@ static const char *cper_pcie_port_type_strs[] = {
>  static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  			    const struct acpi_hest_generic_data *gdata)
>  {
> +#ifdef CONFIG_ACPI_APEI_PCIEAER
> +	struct pci_dev *dev;
> +#endif
> +
>  	if (pcie->validation_bits & CPER_PCIE_VALID_PORT_TYPE)
>  		printk("%s""port_type: %d, %s\n", pfx, pcie->port_type,
>  		       pcie->port_type < ARRAY_SIZE(cper_pcie_port_type_strs) ?
> @@ -281,9 +286,17 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
> -	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> +	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
> +			pcie->device_id.bus, pcie->device_id.function);
> +	if (!dev)
> +		printk("PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
> +			pcie->device_id.segment, pcie->device_id.bus,
> +			pcie->device_id.slot, pcie->device_id.function);
> +
> +	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO && dev) {
>  		struct aer_capability_regs *aer_regs = (void *)pcie->aer_info;
> -		cper_print_aer(pfx, gdata->error_severity, aer_regs);
> +		cper_print_aer(dev, gdata->error_severity, aer_regs);
> +		pci_dev_put(dev);
>  	}

This could be written something like:
	dev = etc..
	if (!dev) {
		pr_err(etc...)
	} else if {pcie->validation_bits & ...) {
		etc...



^ permalink raw reply

* [PATCH v2 2/3] aerdrv: Enhanced AER logging
From: Lance Ortiz @ 2012-11-29 22:58 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel

This patch will provide a more reliable and easy way for user-space
applications to have access to AER logs rather than reading them from the
message buffer. It also provides a way to notify user-space when an AER
event occurs.

The aer driver is updated to generate a trace event of function 'aer_event'
when an AER occurs.  The trace event was added to both the interrupt
based aer path and the firmware first path

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
---

 0 files changed, 0 insertions(+), 0 deletions(-)

diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
index e6defd8..87345d4 100644
--- a/drivers/acpi/apei/cper.c
+++ b/drivers/acpi/apei/cper.c
@@ -29,6 +29,7 @@
 #include <linux/time.h>
 #include <linux/cper.h>
 #include <linux/acpi.h>
+#include <linux/pci.h>
 #include <linux/aer.h>
 
 /*
@@ -249,6 +250,10 @@ static const char *cper_pcie_port_type_strs[] = {
 static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 			    const struct acpi_hest_generic_data *gdata)
 {
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+	struct pci_dev *dev;
+#endif
+
 	if (pcie->validation_bits & CPER_PCIE_VALID_PORT_TYPE)
 		printk("%s""port_type: %d, %s\n", pfx, pcie->port_type,
 		       pcie->port_type < ARRAY_SIZE(cper_pcie_port_type_strs) ?
@@ -281,9 +286,17 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
+	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
+			pcie->device_id.bus, pcie->device_id.function);
+	if (!dev)
+		printk("PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
+			pcie->device_id.segment, pcie->device_id.bus,
+			pcie->device_id.slot, pcie->device_id.function);
+
+	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO && dev) {
 		struct aer_capability_regs *aer_regs = (void *)pcie->aer_info;
-		cper_print_aer(pfx, gdata->error_severity, aer_regs);
+		cper_print_aer(dev, gdata->error_severity, aer_regs);
+		pci_dev_put(dev);
 	}
 #endif
 }
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 3ea5173..6354e50 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -23,6 +23,10 @@
 
 #include "aerdrv.h"
 
+#define CREATE_TRACE_POINTS
+#define TRACE_INCLUDE_PATH ../../../../include/ras
+#include <ras/aer_event.h>
+
 #define AER_AGENT_RECEIVER		0
 #define AER_AGENT_REQUESTER		1
 #define AER_AGENT_COMPLETER		2
@@ -194,6 +198,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		printk("%s""  Error of this Agent(%04x) is reported first\n",
 			prefix, id);
+	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+			info->severity);
 }
 
 void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
@@ -217,12 +223,13 @@ int cper_severity_to_aer(int cper_severity)
 }
 EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 
-void cper_print_aer(const char *prefix, int cper_severity,
+void cper_print_aer(struct pci_dev *dev, int cper_severity,
 		    struct aer_capability_regs *aer)
 {
 	int aer_severity, layer, agent, status_strs_size, tlp_header_valid = 0;
 	u32 status, mask;
 	const char **status_strs;
+	char *prefix = NULL;
 
 	aer_severity = cper_severity_to_aer(cper_severity);
 	if (aer_severity == AER_CORRECTABLE) {
@@ -259,5 +266,7 @@ void cper_print_aer(const char *prefix, int cper_severity,
 			*(tlp + 8), *(tlp + 15), *(tlp + 14),
 			*(tlp + 13), *(tlp + 12));
 	}
+	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+			aer_severity);
 }
 #endif
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 544abdb..7b86dc6 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -49,7 +49,7 @@ static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 }
 #endif
 
-extern void cper_print_aer(const char *prefix, int cper_severity,
+extern void cper_print_aer(struct pci_dev *dev, int cper_severity,
 			   struct aer_capability_regs *aer);
 extern int cper_severity_to_aer(int cper_severity);
 extern void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,


^ permalink raw reply related

* RE: [PATCH 2/3] aerdrv: Enhanced AER logging
From: Ortiz, Lance E @ 2012-11-29 22:21 UTC (permalink / raw)
  To: Joe Perches
  Cc: bhelgaas@google.com, lance_ortiz@hotmail.com,
	jiang.liu@huawei.com, tony.luck@intel.com, bp@alien8.de,
	rostedt@goodmis.org, mchehab@redhat.com,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1354227105.1700.15.camel@joe-AO722>

Yup.  You are right.  I thought I had it enabled, I will send out the new patch soon.

Lance

> -----Original Message-----
> From: Joe Perches [mailto:joe@perches.com]
> Sent: Thursday, November 29, 2012 3:12 PM
> To: Ortiz, Lance E
> Cc: bhelgaas@google.com; lance_ortiz@hotmail.com; jiang.liu@huawei.com;
> tony.luck@intel.com; bp@alien8.de; rostedt@goodmis.org;
> mchehab@redhat.com; linux-acpi@vger.kernel.org; linux-
> pci@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 2/3] aerdrv: Enhanced AER logging
> 
> On Thu, 2012-11-29 at 14:54 -0700, Lance Ortiz wrote:
> > This patch will provide a more reliable and easy way for user-space
> > applications to have access to AER logs rather than reading them from
> the
> > message buffer. It also provides a way to notify user-space when an
> AER
> > event occurs.
> []
> > diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
> > index e6defd8..ef1e1c0 100644
> > --- a/drivers/acpi/apei/cper.c
> > +++ b/drivers/acpi/apei/cper.c
> > @@ -281,9 +281,16 @@ static void cper_print_pcie(const char *pfx,
> const struct cper_sec_pcie *pcie,
> >  	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
> >  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
> >  #ifdef CONFIG_ACPI_APEI_PCIEAER
> > -	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> > +	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
> > +			pcie->device_id.bus, pcie->device_id.function);
> > +	if (!dev)
> > +		pr_info(KERN_INFO, "PCI AER Cannot get PCI device
> %04x:%02x:%02x.%d\n",
> > +			domain, bus, slot, func);
> 
> You've not tried this with CONFIG_ACPI_APEI_PCIEAER enabled.
> 
> 		pr_info("PCI AER Cannot get PCI device
> %04x:%02x:%02x.%d\n",
> 			domain, bus, slot, func);
> 


^ permalink raw reply

* Re: [PATCH 2/3] aerdrv: Enhanced AER logging
From: Joe Perches @ 2012-11-29 22:11 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215450.5483.26562.stgit@grignak.americas.hpqcorp.net>

On Thu, 2012-11-29 at 14:54 -0700, Lance Ortiz wrote:
> This patch will provide a more reliable and easy way for user-space
> applications to have access to AER logs rather than reading them from the
> message buffer. It also provides a way to notify user-space when an AER
> event occurs.
[]
> diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
> index e6defd8..ef1e1c0 100644
> --- a/drivers/acpi/apei/cper.c
> +++ b/drivers/acpi/apei/cper.c
> @@ -281,9 +281,16 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
> -	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> +	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
> +			pcie->device_id.bus, pcie->device_id.function);
> +	if (!dev)
> +		pr_info(KERN_INFO, "PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
> +			domain, bus, slot, func);

You've not tried this with CONFIG_ACPI_APEI_PCIEAER enabled.

		pr_info("PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
			domain, bus, slot, func);



^ permalink raw reply

* [PATCH 2/3] aerdrv: Enhanced AER logging
From: Lance Ortiz @ 2012-11-29 21:54 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215443.5483.43364.stgit@grignak.americas.hpqcorp.net>

This patch will provide a more reliable and easy way for user-space
applications to have access to AER logs rather than reading them from the
message buffer. It also provides a way to notify user-space when an AER
event occurs.

The aer driver is updated to generate a trace event of function 'aer_event'
when an AER occurs.  The trace event was added to both the interrupt
based aer path and the firmware first path

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
---

 drivers/acpi/apei/cper.c               |   11 +++++++++--
 drivers/pci/pcie/aer/aerdrv_errprint.c |   11 ++++++++++-
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
index e6defd8..ef1e1c0 100644
--- a/drivers/acpi/apei/cper.c
+++ b/drivers/acpi/apei/cper.c
@@ -281,9 +281,16 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
+	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
+			pcie->device_id.bus, pcie->device_id.function);
+	if (!dev)
+		pr_info(KERN_INFO, "PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
+			domain, bus, slot, func);
+
+	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO && dev) {
 		struct aer_capability_regs *aer_regs = (void *)pcie->aer_info;
-		cper_print_aer(pfx, gdata->error_severity, aer_regs);
+		cper_print_aer(dev, gdata->error_severity, aer_regs);
+		pci_dev_put(dev);
 	}
 #endif
 }
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 3ea5173..6354e50 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -23,6 +23,10 @@
 
 #include "aerdrv.h"
 
+#define CREATE_TRACE_POINTS
+#define TRACE_INCLUDE_PATH ../../../../include/ras
+#include <ras/aer_event.h>
+
 #define AER_AGENT_RECEIVER		0
 #define AER_AGENT_REQUESTER		1
 #define AER_AGENT_COMPLETER		2
@@ -194,6 +198,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		printk("%s""  Error of this Agent(%04x) is reported first\n",
 			prefix, id);
+	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+			info->severity);
 }
 
 void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
@@ -217,12 +223,13 @@ int cper_severity_to_aer(int cper_severity)
 }
 EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 
-void cper_print_aer(const char *prefix, int cper_severity,
+void cper_print_aer(struct pci_dev *dev, int cper_severity,
 		    struct aer_capability_regs *aer)
 {
 	int aer_severity, layer, agent, status_strs_size, tlp_header_valid = 0;
 	u32 status, mask;
 	const char **status_strs;
+	char *prefix = NULL;
 
 	aer_severity = cper_severity_to_aer(cper_severity);
 	if (aer_severity == AER_CORRECTABLE) {
@@ -259,5 +266,7 @@ void cper_print_aer(const char *prefix, int cper_severity,
 			*(tlp + 8), *(tlp + 15), *(tlp + 14),
 			*(tlp + 13), *(tlp + 12));
 	}
+	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+			aer_severity);
 }
 #endif


^ permalink raw reply related

* [PATCH 3/3] aerdrv: Cleanup log output for CPER based AER
From: Lance Ortiz @ 2012-11-29 21:54 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215443.5483.43364.stgit@grignak.americas.hpqcorp.net>

These changes make cper_print_aer more consistent with aer_print_error
which is called in the AER interrupt case. The string in the variable
'prefix' is printed at the beginning of each print statement in
cper_print_aer(). The prefix is a string containing the driver name
and the device's path.  Looking up the call path, the value of prefix
is never assigned and is NULL, so when cper_print_aer prints data the
initial string does not get printed.  This string is important because
it identifies the device that the error is on. This patch adds code to
create the prefix and print it in the cper_print_aer function. This
patch also calculates the device's agent id so it can be printed.

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
---

 drivers/pci/pcie/aer/aerdrv_errprint.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 6354e50..bc0a091 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -229,7 +229,13 @@ void cper_print_aer(struct pci_dev *dev, int cper_severity,
 	int aer_severity, layer, agent, status_strs_size, tlp_header_valid = 0;
 	u32 status, mask;
 	const char **status_strs;
-	char *prefix = NULL;
+	int id = ((dev->bus->number << 8) | dev->devfn);
+	char prefix[44];
+
+	snprintf(prefix, sizeof(prefix), "%s%s %s: ",
+		(info->severity == AER_CORRECTABLE) ?
+		KERN_WARNING : KERN_ERR,
+		dev_driver_string(&dev->dev), dev_name(&dev->dev));
 
 	aer_severity = cper_severity_to_aer(cper_severity);
 	if (aer_severity == AER_CORRECTABLE) {
@@ -249,8 +255,8 @@ void cper_print_aer(struct pci_dev *dev, int cper_severity,
 	printk("%s""aer_status: 0x%08x, aer_mask: 0x%08x\n",
 	       prefix, status, mask);
 	cper_print_bits(prefix, status, status_strs, status_strs_size);
-	printk("%s""aer_layer=%s, aer_agent=%s\n", prefix,
-	       aer_error_layer[layer], aer_agent_string[agent]);
+	printk("%s""aer_layer=%s, %d=%04x(%s)\n", prefix,
+	       aer_error_layer[layer], id, aer_agent_string[agent]);
 	if (aer_severity != AER_CORRECTABLE)
 		printk("%s""aer_uncor_severity: 0x%08x\n",
 		       prefix, aer->uncor_severity);


^ permalink raw reply related

* [PATCH 1/3] aerdrv: Trace Event for AER
From: Lance Ortiz @ 2012-11-29 21:54 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel

This header file will define a new trace event that will be triggered when
a AER event occurs.  The following data will be provided to the trace 
event.

char * name -	String containing the device path

u32 status - 	Either the correctable or uncorrectable register 
		indicating what error or errors have been see.

u8 severity - 	error severity 0:NONFATAL 1:FATAL 2:CORRECTED

The trace event will also provide a trace string that may look like:

"0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
TLP"

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
---

 include/ras/aer_event.h |   77 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 77 insertions(+), 0 deletions(-)
 create mode 100644 include/ras/aer_event.h

diff --git a/include/ras/aer_event.h b/include/ras/aer_event.h
new file mode 100644
index 0000000..735c973
--- /dev/null
+++ b/include/ras/aer_event.h
@@ -0,0 +1,77 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM aer
+#define TRACE_INCLUDE_FILE aer_event
+
+#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_AER_H
+
+#include <linux/tracepoint.h>
+#include <linux/edac.h>
+
+
+/*
+ * Anhance Error Reporting (AER) PCIE Report Error
+ *
+ * These events are generated when hardware detects a corrected or
+ * uncorrected event on a pci express device and reports
+ * errors.  The event reports the following data.
+ *
+ * char * dev_name -	String containing the device identification
+ * u32 status -		Either the correctable or uncorrectable register
+ *			indicating what error or errors have been seen
+ * u8 severity -	error severity 0:NONFATAL 1:FATAL 2:CORRECTED
+ */
+
+#define correctable_error_string			\
+	{BIT(0),	"Receiver Error"},		\
+	{BIT(6),	"Bad TLP"},			\
+	{BIT(7),	"Bad DLLP"},			\
+	{BIT(8),	"RELAY_NUM Rollover"},		\
+	{BIT(12),	"Replay Timer Timeout"},	\
+	{BIT(13),	"Advisory Non-Fatal"}
+
+#define uncorrectable_error_string			\
+	{BIT(4),	"Data Link Protocol"},		\
+	{BIT(12),	"Poisoned TLP"},		\
+	{BIT(13),	"Flow Control Protocol"},	\
+	{BIT(14),	"Completion Timeout"},		\
+	{BIT(15),	"Completer Abort"},		\
+	{BIT(16),	"Unexpected Completion"},	\
+	{BIT(17),	"Receiver Overflow"},		\
+	{BIT(18),	"Malformed TLP"},		\
+	{BIT(19),	"ECRC"},			\
+	{BIT(20),	"Unsupported Request"}
+
+TRACE_EVENT(aer_event,
+	TP_PROTO(const char *dev_name,
+		 const u32 status,
+		 const u8 severity),
+
+	TP_ARGS(dev_name, status, severity),
+
+	TP_STRUCT__entry(
+		__string(	dev_name,	dev_name	)
+		__field(	u32,		status		)
+		__field(	u8,		severity	)
+	),
+
+	TP_fast_assign(
+		__assign_str(dev_name, dev_name);
+		__entry->status		= status;
+		__entry->severity	= severity;
+	),
+
+	TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
+		__get_str(dev_name),
+		(__entry->severity == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
+			((__entry->severity == HW_EVENT_ERR_FATAL) ?
+			"Fatal" : "Uncorrected"),
+		__entry->severity == HW_EVENT_ERR_CORRECTED ?
+		__print_flags(__entry->status, "|", correctable_error_string) :
+		__print_flags(__entry->status, "|", uncorrectable_error_string))
+);
+
+#endif /* _TRACE_AER_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>


^ permalink raw reply related

* Re: pci_enable_msix() fails with ENOMEM/EINVAL
From: Alex Williamson @ 2012-11-29 15:56 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: kvm, linux-pci, Yair Hershko, Shyam Kaushik
In-Reply-To: <A5D6C5E3D7774E62AA482C886483DA6D@alyakaslap>

On Thu, 2012-11-29 at 10:42 +0200, Alex Lyakas wrote:
> Hi Alex,
> did I understand correctly that "vector" value with which the call to 
> request_threaded_irq() is made is *not* supposed to be zero? Because in my 
> case, it is always zero, and still the failure I observe is not happening 
> always. Usually, 3 unique (non-zero) IRQ numbers are assigned to each 
> attached PCI device of each KVM VM.
> I will try to repro this like you suggested and let you know.

pci_enable_msix should be setting vectors for all the host_msix_entries.
That non-zero vector is the first parameter to request_threaded_irq.  If
I put a printk in the loop installing the interrupt handler, I get:

assigned_device_enable_host_msix entry 0, vector 44
assigned_device_enable_host_msix entry 0, vector 44
assigned_device_enable_host_msix entry 1, vector 45
assigned_device_enable_host_msix entry 0, vector 44
assigned_device_enable_host_msix entry 1, vector 45
assigned_device_enable_host_msix entry 2, vector 103

So we enable MSI-X with only one vector enabled, then a second vector
gets enabled and we disable and re-enable with two, and so on with
three.  This is for an assigned e1000e device.  Thanks,

Alex


^ permalink raw reply

* RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Fujinaka, Todd @ 2012-11-29 15:52 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Joe Jin, Ben Hutchings, Mary Mcgrath, netdev@vger.kernel.org,
	e1000-devel@lists.sf.net, linux-kernel@vger.kernel.org, linux-pci
In-Reply-To: <CABawtvPrz=z_CLCvPeXkky7COyhjPyfBu3QCTbiiPpvr+5bicg@mail.gmail.com>

Someone else pointed this out to me locally. If you have a non-client BIOS, you should be able to set the MaxPayloadSize using setpci. You have to make sure that you're being consistent throughout all the associated links.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujinaka@intel.com
(503) 712-4565


-----Original Message-----
From: Ethan Zhao [mailto:ethan.kernel@gmail.com] 
Sent: Wednesday, November 28, 2012 7:10 PM
To: Fujinaka, Todd
Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; netdev@vger.kernel.org; e1000-devel@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Joe,
    Possibly your customer is running a kernel without source code on a platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server ?).
    Anyway, to see if is a payload issue or,  you could change the payload size with setpci tool to those devices and set the link retrain bit to trigger the link retraining to debug the issue and identity the root cause.  I thinks it is much easier than modify the BIOS or  eeprom of NIC.

    e.g.
   set device control register to 0f 00   (128 bytes payload size)
   #   setpci -v -s 00:02.0 98.w=000f
   set device link control register to 60h (retrain the link)
   #  setpci -v -s 00:02.0 a0.b=60

  Hope it works,  Just my 2 cents.

Ethan.zhao@oracle.com

On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd <todd.fujinaka@intel.com> wrote:
> The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.
>
> Todd Fujinaka
> Technical Marketing Engineer
> LAN Access Division (LAD)
> Intel Corporation
> todd.fujinaka@intel.com
> (503) 712-4565
>
>
> -----Original Message-----
> From: Joe Jin [mailto:joe.jin@oracle.com]
> Sent: Wednesday, November 28, 2012 12:31 AM
> To: Ben Hutchings
> Cc: Fujinaka, Todd; Mary Mcgrath; netdev@vger.kernel.org; 
> e1000-devel@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci
> Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
>
> On 11/28/12 02:10, Ben Hutchings wrote:
>> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>>> Forgive me if I'm being too repetitious as I think some of this has 
>>> been mentioned in the past.
>>>
>>> We (and by we I mean the Ethernet part and driver) can only change 
>>> the advertised availability of a larger MaxPayloadSize. The size is 
>>> negotiated by both sides of the link when the link is established.
>>> The driver should not change the size of the link as it would be 
>>> poking at registers outside of its scope and is controlled by the 
>>> upstream bridge (not us).
>> [...]
>>
>> MaxPayloadSize (MPS) is not negotiated between devices but is 
>> programmed by the system firmware (at least for devices present at 
>> boot - the kernel may be responsible in case of hotplug).  You can 
>> use the kernel parameter 'pci=pcie_bus_perf' (or one of several 
>> others) to set a policy that overrides this, but no policy will allow 
>> setting MPS above the device's MaxPayloadSizeSupported (MPSS).
>>
>
> Ben,
>
> Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
> So I'm trying to use ethtool modify it from eeprom to see if help or no.
>
>
> Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom.
>
> Thanks in advance,
> Joe
>

^ permalink raw reply

* Re: pci_enable_msix() fails with ENOMEM/EINVAL
From: Alex Lyakas @ 2012-11-29  8:42 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-pci, Yair Hershko, Shyam Kaushik
In-Reply-To: <1353960253.1809.128.camel@bling.home>

Hi Alex,
did I understand correctly that "vector" value with which the call to 
request_threaded_irq() is made is *not* supposed to be zero? Because in my 
case, it is always zero, and still the failure I observe is not happening 
always. Usually, 3 unique (non-zero) IRQ numbers are assigned to each 
attached PCI device of each KVM VM.
I will try to repro this like you suggested and let you know.

Thanks for your help,
Alex.


-----Original Message----- 
From: Alex Williamson
Sent: 26 November, 2012 10:04 PM
To: Alex Lyakas
Cc: kvm@vger.kernel.org
Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL

On Thu, 2012-11-22 at 10:52 +0200, Alex Lyakas wrote:
> Hi Alex,
> thanks for your response.
>
> I printed out the "vector" and "entry" values of dev->host_msix_entries[i]
> within assigned_device_enable_host_msix() before call to
> request_threaded_irq(). I see that they are all 0s:
> kernel: [ 3332.610980] kvm-8095: KVM_ASSIGN_DEV_IRQ assigned_dev_id=924
> kernel: [ 3332.610985] kvm-8095: assigned_device_enable_host_msix()
> assigned_dev_id=924 #0: [v=0 e=0]
> kernel: [ 3332.610989] kvm-8095: assigned_device_enable_host_msix()
> assigned_dev_id=924 #1: [v=0 e=1]
> kernel: [ 3332.610992] kvm-8095: assigned_device_enable_host_msix()
> assigned_dev_id=924 #2: [v=0 e=2]
>
> So I don't really understand how they all ask for irq=0; I must be missing
> something. Is there any other explanation of request_threaded_irq() to
> return EBUSY? From the code I don't see that there is.

The vectors all being zero sounds like an indication that
pci_enable_msix() didn't actually work.  Each of those should be a
unique vector.   Does booting the host with "nointremap" perhaps make a
difference?  Maybe we can isolate the problem to the interrupt remapper
code.

> This issue is reproducible and is not going to go away by itself. Working
> around it is also problematic. We thought to check whether all IRQs are
> properly attached after QEMU sets the vm state to "running". However, vm
> state is set to "running" before IRQ attachments are performed; we 
> debugged
> this and found out that they are done from a different thread, from a 
> stack
> trace like this:
> kvm_assign_irq()
> assigned_dev_update_msix()
> assigned_dev_pci_write_config()
> pci_host_config_write_common()
> pci_data_write()
> pci_host_data_write()
> memory_region_write_accessor()
> access_with_adjusted_size()
> memory_region_iorange_write()
> ioport_writew_thunk()
> ioport_write()
> cpu_outw()
> kvm_handle_io()
> kvm_cpu_exec()
> qemu_kvm_cpu_thread_fn()
>
> So looks like this is performed on-demand (on first IO), so no reliable
> point to check that IRQs are attached properly.

Correct, MSI-X is setup when the guest enables MSI-X on the device,
which is likely a long way into guest boot.  There's no guarantee that
the guest ever enables MSI-X, so there's no association to whether the
guest is "running".

>  Another issue that in KVM
> code the return value of pci_host_config_write_common() is not checked, so
> there is no way to report a failure.

A common problem in qemu, imho

> Is there any way you think you can help me debug this further?

It seems like pci_enable_msix is still failing, but perhaps silently
without irqbalance.  We need to figure out where and why.  Isolating it
to the interrupt remapper with "nointremap" might give us some clues
(this is an Intel VT-d system, right?).  Thanks,

Alex


> -----Original Message----- 
> From: Alex Williamson
> Sent: 22 November, 2012 12:25 AM
> To: Alex Lyakas
> Cc: kvm@vger.kernel.org
> Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL
>
> On Wed, 2012-11-21 at 16:19 +0200, Alex Lyakas wrote:
> > Hi,
> > I was advised to turn off irqbalance and reproduced this issue, but
> > the failure is in a different place now. Now request_threaded_irq()
> > fails with EBUSY.
> > According to the code, this can only happen on the path:
> > request_threaded_irq() -> __setup_irq()
> > Now in setup irq, the only place where EBUSY can show up for us is here:
> > ...
> > raw_spin_lock_irqsave(&desc->lock, flags);
> > old_ptr = &desc->action;
> > old = *old_ptr;
> > if (old) {
> > /*
> > * Can't share interrupts unless both agree to and are
> > * the same type (level, edge, polarity). So both flag
> > * fields must have IRQF_SHARED set and the bits which
> > * set the trigger type must match. Also all must
> > * agree on ONESHOT.
> > */
> > if (!((old->flags & new->flags) & IRQF_SHARED) ||
> >     ((old->flags ^ new->flags) & IRQF_TRIGGER_MASK) ||
> >     ((old->flags ^ new->flags) & IRQF_ONESHOT)) {
> > old_name = old->name;
> > goto mismatch;
> > }
> >
> > /* All handlers must agree on per-cpuness */
> > if ((old->flags & IRQF_PERCPU) !=
> >     (new->flags & IRQF_PERCPU))
> > goto mismatch;
> >
> > KVM calls request_threaded_irq() with flags==0, so can it be that
> > different KVM processes request the same IRQ?
>
> Shouldn't be possible, irqs are allocated from a bitmap protected by a
> mutex, see __irq_alloc_descs
>
> >  How different KVM
> > processes spawned simultaneously agree between them on IRQ numbers?
>
> They don't, MSI/X vectors are not currently share-able.  Can you show
> that you're actually getting duplicate irq vectors?  Thanks,
>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




^ permalink raw reply

* Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Ethan Zhao @ 2012-11-29  3:10 UTC (permalink / raw)
  To: Fujinaka, Todd
  Cc: Joe Jin, Ben Hutchings, Mary Mcgrath, netdev@vger.kernel.org,
	e1000-devel@lists.sf.net, linux-kernel@vger.kernel.org, linux-pci
In-Reply-To: <9B4A1B1917080E46B64F07F2989DADD62F2D8AAC@ORSMSX102.amr.corp.intel.com>

Joe,
    Possibly your customer is running a kernel without source code on
a platform whose vendor wouldn't like to fix BIOS issue( Is that a
HP/Dell server ?).
    Anyway, to see if is a payload issue or,  you could change the
payload size with setpci tool to those devices and set the link
retrain bit to trigger the link retraining to debug the issue and
identity the root cause.  I thinks it is much easier than modify the
BIOS or  eeprom of NIC.

    e.g.
   set device control register to 0f 00   (128 bytes payload size)
   #   setpci -v -s 00:02.0 98.w=000f
   set device link control register to 60h (retrain the link)
   #  setpci -v -s 00:02.0 a0.b=60

  Hope it works,  Just my 2 cents.

Ethan.zhao@oracle.com

On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd
<todd.fujinaka@intel.com> wrote:
> The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.
>
> Todd Fujinaka
> Technical Marketing Engineer
> LAN Access Division (LAD)
> Intel Corporation
> todd.fujinaka@intel.com
> (503) 712-4565
>
>
> -----Original Message-----
> From: Joe Jin [mailto:joe.jin@oracle.com]
> Sent: Wednesday, November 28, 2012 12:31 AM
> To: Ben Hutchings
> Cc: Fujinaka, Todd; Mary Mcgrath; netdev@vger.kernel.org; e1000-devel@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci
> Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
>
> On 11/28/12 02:10, Ben Hutchings wrote:
>> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>>> Forgive me if I'm being too repetitious as I think some of this has
>>> been mentioned in the past.
>>>
>>> We (and by we I mean the Ethernet part and driver) can only change
>>> the advertised availability of a larger MaxPayloadSize. The size is
>>> negotiated by both sides of the link when the link is established.
>>> The driver should not change the size of the link as it would be
>>> poking at registers outside of its scope and is controlled by the
>>> upstream bridge (not us).
>> [...]
>>
>> MaxPayloadSize (MPS) is not negotiated between devices but is
>> programmed by the system firmware (at least for devices present at
>> boot - the kernel may be responsible in case of hotplug).  You can use
>> the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to
>> set a policy that overrides this, but no policy will allow setting MPS
>> above the device's MaxPayloadSizeSupported (MPSS).
>>
>
> Ben,
>
> Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
> So I'm trying to use ethtool modify it from eeprom to see if help or no.
>
>
> Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom.
>
> Thanks in advance,
> Joe
>

^ permalink raw reply

* Re: [PATCH 5/9] unicore32/PCI: Remove CONFIG_HOTPLUG ifdefs
From: guanxuetao @ 2012-11-29  2:02 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: gregkh, linux-pci, Guan Xuetao, Bjorn Helgaas
In-Reply-To: <1353530100-728-6-git-send-email-wfp5p@virginia.edu>

> Remove conditional code based on CONFIG_HOTPLUG being false.  It's
> always on now in preparation of it going away as an option.
>
> Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
> Cc: Bjorn Helgaas <bhelgaas@google.com>

Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn>

> ---
>  arch/unicore32/kernel/pci.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/arch/unicore32/kernel/pci.c b/arch/unicore32/kernel/pci.c
> index b0056f6..7c43592 100644
> --- a/arch/unicore32/kernel/pci.c
> +++ b/arch/unicore32/kernel/pci.c
> @@ -250,9 +250,7 @@ void __devinit pcibios_fixup_bus(struct pci_bus *bus)
>  	printk(KERN_INFO "PCI: bus%d: Fast back to back transfers %sabled\n",
>  		bus->number, (features & PCI_COMMAND_FAST_BACK) ? "en" : "dis");
>  }
> -#ifdef CONFIG_HOTPLUG
>  EXPORT_SYMBOL(pcibios_fixup_bus);
> -#endif
>
>  static int __init pci_common_init(void)
>  {
> --
> 1.8.0
>


^ permalink raw reply

* Re: pci_enable_msix() failure
From: Bjorn Helgaas @ 2012-11-29  0:18 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-pci
In-Reply-To: <0B81514A792D4D9FAB292ED315D48BE2@alyakaslap>

On Tue, Nov 20, 2012 at 11:04 AM, Alex Lyakas <alex@zadarastorage.com> wrote:
> Hello list, I was advised to post this question by Jesse Barnes; I also
> posted to KVM list.

When you post a question to more than one list, you should always send
a *single* message with all the lists being copied.  That way
everybody can tell what progress is being made, and we can avoid
duplicating effort.

I see via a Google search that you've gotten responses on the KVM
list, e.g., http://www.spinics.net/lists/kvm/msg82997.html, so I
assume you don't need any more help from linux-pci.

If that's not the case, please add linux-pci to the CC: list of the
thread where you're working on this.

> I am running Ubuntu-Precise 3.2.0-29-generic #46, with stock KVM ("QEMU
> emulator version 1.0 (qemu-kvm-1.0)") on a Dell R510 server. I have one
> dual-port Intel's NIC 82599, of which I spawn 32 VFs from each port. I spawn
> virtual machines with KVM, each VM has 4 VFs attached (two from each PF).
>
> Once in a while, in particular when I spawn multiple VMs in parallel, I hit
> an issue that one of the VFs does not have an IRQ assigned to it. I am
> checking this in /proc/interrupts, looking for entries like
> "kvm:0000:03:14.6". In some cases, an entry is missing for a particular VF.
> As a result, the VF within the VM is non-functional.
>
> I debugged this issue further, by adding prints to kvm.ko code. I see that
> the failure happens in kvm_vm_ioctl_assigned_device/KVM_ASSIGN_DEV_IRQ path,
> which calls assigned_device_enable_host_msix() function, which calls
> pci_enable_msix(), which fails with EINVAL or with ENOMEM. This path is
> called twice for each VF.
>
> I see that first pci_enable_msix() returns -12/-22, and when
> kvm_vm_ioctl_set_msix_nr() is called again, it sees that adev->entries_nr !=
> 0 and fails the call with EINVAL. I can repro it only when spawning like 8
> or 10 VMs in parallel, but it doesn't happen every time. So it seems like
> this is not a resource shortage problem, but some race somewhere.
>
> I tested this with several version of ixgbe drivers, including the in-tree
> version that comes with Precise. It reproduces with all the versions.
>
> Can anybody pls advise on how to debug this issue further?
>
> Thanks,
> Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] PCI,sriov: add documentation on sysfs-based VF control
From: Don Dutile @ 2012-11-29  0:09 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci
In-Reply-To: <CAErSpo4mso26k5pwpBs9ujrX7dZtvd-fxYnU+m712xycwngCvg@mail.gmail.com>

On 11/28/2012 03:18 PM, Bjorn Helgaas wrote:
> On Tue, Nov 27, 2012 at 8:31 PM, Donald Dutile<ddutile@redhat.com>  wrote:
>>   Signed-off: Donald Dutile<ddutile@redhat.com>
>>
>
> Thanks, Don.  I added this to my -next branch, for merging in the v3.8
> merge window.
>
> I removed the trailing underscore from sriov_numvfs_.
>
>
Thanks for pulling it into your 3.8-next branch and the correction!
My apologies for the delay in getting it out to the list
after posting the related code.
- Don

>>   Documentation/ABI/testing/sysfs-bus-pci | 34 +++++++++++++++++++++++
>>   Documentation/PCI/pci-iov-howto.txt     | 48 ++++++++++++++++++++++++++++++---
>>   2 files changed, 78 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
>> index dff1f48..1cb389d 100644
>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>> @@ -222,3 +222,37 @@ Description:
>>                  satisfied too.  Reading this attribute will show the current
>>                  value of d3cold_allowed bit.  Writing this attribute will set
>>                  the value of d3cold_allowed bit.
>> +
>> +What:          /sys/bus/pci/devices/.../sriov_totalvfs
>> +Date:          November 2012
>> +Contact:       Donald Dutile<ddutile@redhat.com>
>> +Description:
>> +               This file appears when a physical PCIe device supports SR-IOV.
>> +               Userspace applications can read this file to determine the
>> +               maximum number of Virtual Functions (VFs) a PCIe physical
>> +               function (PF) can support. Typically, this is the value reported
>> +               in the PF's SR-IOV extended capability structure's TotalVFs
>> +               element.  Drivers have the ability at probe time to reduce the
>> +               value read from this file via the pci_sriov_set_totalvfs()
>> +               function.
>> +
>> +What:          /sys/bus/pci/devices/.../sriov_numvfs_
>> +Date:          November 2012
>> +Contact:       Donald Dutile<ddutile@redhat.com>
>> +Description:
>> +               This file appears when a physical PCIe device supports SR-IOV.
>> +               Userspace applications can read and write to this file to
>> +               determine and control the enablement or disablement of Virtual
>> +               Functions (VFs) on the physical function (PF). A read of this
>> +               file will return the number of VFs that are enabled on this PF.
>> +               A number written to this file will enable the specified
>> +               number of VFs. A userspace application would typically read the
>> +               file and check that the value is zero, and then write the number
>> +               of VFs that should be enabled on the PF; the value written
>> +               should be less than or equal to the value in the sriov_totalvfs
>> +               file. A userspace application wanting to disable the VFs would
>> +               write a zero to this file. The core ensures that valid values
>> +               are written to this file, and returns errors when values are not
>> +               valid.  For example, writing a 2 to this file when sriov_numvfs
>> +               is not 0 and not 2 already will return an error. Writing a 10
>> +               when the value of sriov_totalvfs is 8 will return an error.
>> diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
>> index fc73ef5..c41cf95 100644
>> --- a/Documentation/PCI/pci-iov-howto.txt
>> +++ b/Documentation/PCI/pci-iov-howto.txt
>> @@ -2,6 +2,9 @@
>>                  Copyright (C) 2009 Intel Corporation
>>                      Yu Zhao<yu.zhao@intel.com>
>>
>> +               Update: November 2012
>> +                       -- sysfs-based SRIOV enable-/disable-ment
>> +               Donald Dutile<ddutile@redhat.com>
>>
>>   1. Overview
>>
>> @@ -24,10 +27,21 @@ real existing PCI device.
>>
>>   2.1 How can I enable SR-IOV capability
>>
>> -The device driver (PF driver) will control the enabling and disabling
>> -of the capability via API provided by SR-IOV core. If the hardware
>> -has SR-IOV capability, loading its PF driver would enable it and all
>> -VFs associated with the PF.
>> +Multiple methods are available for SR-IOV enablement.
>> +In the first method, the device driver (PF driver) will control the
>> +enabling and disabling of the capability via API provided by SR-IOV core.
>> +If the hardware has SR-IOV capability, loading its PF driver would
>> +enable it and all VFs associated with the PF.  Some PF drivers require
>> +a module parameter to be set to determine the number of VFs to enable.
>> +In the second method, a write to the sysfs file sriov_numvfs will
>> +enable and disable the VFs associated with a PCIe PF.  This method
>> +enables per-PF, VF enable/disable values versus the first method,
>> +which applies to all PFs of the same device.  Additionally, the
>> +PCI SRIOV core support ensures that enable/disable operations are
>> +valid to reduce duplication in multiple drivers for the same
>> +checks, e.g., check numvfs == 0 if enabling VFs, ensure
>> +numvfs<= totalvfs.
>> +The second method is the recommended method for new/future VF devices.
>>
>>   2.2 How can I use the Virtual Functions
>>
>> @@ -40,13 +54,22 @@ requires device driver that is same as a normal PCI device's.
>>   3.1 SR-IOV API
>>
>>   To enable SR-IOV capability:
>> +(a) For the first method, in the driver:
>>          int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>>          'nr_virtfn' is number of VFs to be enabled.
>> +(b) For the second method, from sysfs:
>> +       echo 'nr_virtfn'>  \
>> +        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
>>
>>   To disable SR-IOV capability:
>> +(a) For the first method, in the driver:
>>          void pci_disable_sriov(struct pci_dev *dev);
>> +(b) For the second method, from sysfs:
>> +       echo  0>  \
>> +        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
>>
>>   To notify SR-IOV core of Virtual Function Migration:
>> +(a) In the driver:
>>          irqreturn_t pci_sriov_migration(struct pci_dev *dev);
>>
>>   3.2 Usage example
>> @@ -88,6 +111,22 @@ static void dev_shutdown(struct pci_dev *dev)
>>          ...
>>   }
>>
>> +static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
>> +{
>> +       if (numvfs>  0) {
>> +               ...
>> +               pci_enable_sriov(dev, numvfs);
>> +               ...
>> +               return numvfs;
>> +       }
>> +       if (numvfs == 0) {
>> +               ....
>> +               pci_disable_sriov(dev);
>> +               ...
>> +               return 0;
>> +       }
>> +}
>> +
>>   static struct pci_driver dev_driver = {
>>          .name =         "SR-IOV Physical Function driver",
>>          .id_table =     dev_id_table,
>> @@ -96,4 +135,5 @@ static struct pci_driver dev_driver = {
>>          .suspend =      dev_suspend,
>>          .resume =       dev_resume,
>>          .shutdown =     dev_shutdown,
>> +       .sriov_configure = dev_sriov_configure,
>>   };
>> --
>> 1.7.10.2.552.gaa3bb87
>>


^ permalink raw reply

* Re: [PATCH v2 RESEND] Add NumaChip remote PCI support
From: Bjorn Helgaas @ 2012-11-28 23:08 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: linux-pci, linux-kernel, Steffen Persvold
In-Reply-To: <1353487196-10204-1-git-send-email-daniel@numascale-asia.com>

On Wed, Nov 21, 2012 at 1:39 AM, Daniel J Blueman
<daniel@numascale-asia.com> wrote:
> Add NumaChip-specific PCI access mechanism via MMCONFIG cycles, but
> preventing access to AMD Northbridges which shouldn't respond.
>
> v2: Use PCI_DEVFN in precomputed constant limit; drop unneeded includes
>
> Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
> ---
>  arch/x86/include/asm/numachip/numachip.h |   20 +++++
>  arch/x86/kernel/apic/apic_numachip.c     |    2 +
>  arch/x86/pci/Makefile                    |    1 +
>  arch/x86/pci/numachip.c                  |  134 ++++++++++++++++++++++++++++++
>  4 files changed, 157 insertions(+)
>  create mode 100644 arch/x86/include/asm/numachip/numachip.h
>  create mode 100644 arch/x86/pci/numachip.c
>
> diff --git a/arch/x86/include/asm/numachip/numachip.h b/arch/x86/include/asm/numachip/numachip.h
> new file mode 100644
> index 0000000..d35e71a
> --- /dev/null
> +++ b/arch/x86/include/asm/numachip/numachip.h
> @@ -0,0 +1,20 @@
> +/*
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License.  See the file "COPYING" in the main directory of this archive
> + * for more details.
> + *
> + * Numascale NumaConnect-specific header file
> + *
> + * Copyright (C) 2012 Numascale AS. All rights reserved.
> + *
> + * Send feedback to <support@numascale.com>
> + *
> + */
> +
> +#ifndef _ASM_X86_NUMACHIP_NUMACHIP_H
> +#define _ASM_X86_NUMACHIP_NUMACHIP_H
> +
> +extern int __init pci_numachip_init(void);
> +
> +#endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */
> +
> diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c
> index a65829a..9c2aa89 100644
> --- a/arch/x86/kernel/apic/apic_numachip.c
> +++ b/arch/x86/kernel/apic/apic_numachip.c
> @@ -22,6 +22,7 @@
>  #include <linux/hardirq.h>
>  #include <linux/delay.h>
>
> +#include <asm/numachip/numachip.h>
>  #include <asm/numachip/numachip_csr.h>
>  #include <asm/smp.h>
>  #include <asm/apic.h>
> @@ -179,6 +180,7 @@ static int __init numachip_system_init(void)
>                 return 0;
>
>         x86_cpuinit.fixup_cpu_id = fixup_cpu_id;
> +       x86_init.pci.arch_init = pci_numachip_init;
>
>         map_csrs();
>
> diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
> index 3af5a1e..ee0af58 100644
> --- a/arch/x86/pci/Makefile
> +++ b/arch/x86/pci/Makefile
> @@ -16,6 +16,7 @@ obj-$(CONFIG_STA2X11)           += sta2x11-fixup.o
>  obj-$(CONFIG_X86_VISWS)                += visws.o
>
>  obj-$(CONFIG_X86_NUMAQ)                += numaq_32.o
> +obj-$(CONFIG_X86_NUMACHIP)     += numachip.o

It looks like this depends on CONFIG_PCI_MMCONFIG for
pci_mmconfig_lookup().  Are there config constraints that force
CONFIG_PCI_MMCONFIG=y when CONFIG_X86_NUMACHIP=y?

>  obj-$(CONFIG_X86_INTEL_MID)    += mrst.o
>
> diff --git a/arch/x86/pci/numachip.c b/arch/x86/pci/numachip.c
> new file mode 100644
> index 0000000..3773e05
> --- /dev/null
> +++ b/arch/x86/pci/numachip.c
> @@ -0,0 +1,129 @@
> +/*
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License.  See the file "COPYING" in the main directory of this archive
> + * for more details.
> + *
> + * Numascale NumaConnect-specific PCI code
> + *
> + * Copyright (C) 2012 Numascale AS. All rights reserved.
> + *
> + * Send feedback to <support@numascale.com>
> + *
> + * PCI accessor functions derived from mmconfig_64.c
> + *
> + */
> +
> +#include <linux/pci.h>
> +#include <asm/pci_x86.h>
> +
> +static u8 limit __read_mostly;
> +
> +static inline char __iomem *pci_dev_base(unsigned int seg, unsigned int bus, unsigned int devfn)
> +{
> +       struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);
> +
> +       if (cfg && cfg->virt)
> +               return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn << 12));
> +       return NULL;
> +}

Most of this file is copied directly from mmconfig_64.c (as you
mentioned above).  I wonder if we could avoid the code duplication by
making the pci_dev_base() implementation in mmconfig_64.c a weak
definition.  Then you could just supply a non-weak pci_dev_base() here
that would override that default version.  Your version would look
something like:

  char __iomem *pci_dev_base(unsigned int seg, unsigned int bus,
unsigned int devfn)
  {
      struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);

      if (cfg && cfg->virt && devfn < limit)
          return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn << 12));
      return NULL;
  }

That would be different from what you have in this patch because reads
& writes to devices above "limit" would return -EINVAL rather than 0
as you do here.  Would that be a problem?

> +static int pci_mmcfg_read_numachip(unsigned int seg, unsigned int bus,
> +                         unsigned int devfn, int reg, int len, u32 *value)
> +{
> +       char __iomem *addr;
> +
> +       /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
> +       if (unlikely((bus > 255) || (devfn > 255) || (reg > 4095))) {
> +err:           *value = -1;
> +               return -EINVAL;
> +       }
> +
> +       /* Ensure AMD Northbridges don't decode reads to other devices */
> +       if (unlikely(bus == 0 && devfn >= limit)) {
> +               *value = -1;
> +               return 0;
> +       }
> +
> +       rcu_read_lock();
> +       addr = pci_dev_base(seg, bus, devfn);
> +       if (!addr) {
> +               rcu_read_unlock();
> +               goto err;
> +       }
> +
> +       switch (len) {
> +       case 1:
> +               *value = mmio_config_readb(addr + reg);
> +               break;
> +       case 2:
> +               *value = mmio_config_readw(addr + reg);
> +               break;
> +       case 4:
> +               *value = mmio_config_readl(addr + reg);
> +               break;
> +       }
> +       rcu_read_unlock();
> +
> +       return 0;
> +}
> +
> +static int pci_mmcfg_write_numachip(unsigned int seg, unsigned int bus,
> +                          unsigned int devfn, int reg, int len, u32 value)
> +{
> +       char __iomem *addr;
> +
> +       /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
> +       if (unlikely((bus > 255) || (devfn > 255) || (reg > 4095)))
> +               return -EINVAL;
> +
> +       /* Ensure AMD Northbridges don't decode writes to other devices */
> +       if (unlikely(bus == 0 && devfn >= limit))
> +               return 0;
> +
> +       rcu_read_lock();
> +       addr = pci_dev_base(seg, bus, devfn);
> +       if (!addr) {
> +               rcu_read_unlock();
> +               return -EINVAL;
> +       }
> +
> +       switch (len) {
> +       case 1:
> +               mmio_config_writeb(addr + reg, value);
> +               break;
> +       case 2:
> +               mmio_config_writew(addr + reg, value);
> +               break;
> +       case 4:
> +               mmio_config_writel(addr + reg, value);
> +               break;
> +       }
> +       rcu_read_unlock();
> +
> +       return 0;
> +}
> +
> +const struct pci_raw_ops pci_mmcfg_numachip = {
> +       .read = pci_mmcfg_read_numachip,
> +       .write = pci_mmcfg_write_numachip,
> +};
> +
> +int __init pci_numachip_init(void)
> +{
> +       int ret = 0;
> +       u32 val;
> +
> +       /* For remote I/O, restrict bus 0 access to the actual number of AMD
> +          Northbridges, which starts at device number 0x18 */
> +       ret = raw_pci_read(0, 0, PCI_DEVFN(0x18, 0), 0x60, sizeof(val), &val);
> +       if (ret)
> +               goto out;
> +
> +       /* HyperTransport fabric size in bits 6:4 */
> +       limit = PCI_DEVFN(0x18 + ((val >> 4) & 7) + 1, 0);
> +
> +       /* Use NumaChip PCI accessors for non-extended and extended access */
> +       raw_pci_ops = raw_pci_ext_ops = &pci_mmcfg_numachip;
> +out:
> +       return ret;
> +}
> --
> 1.7.9.5
>

^ permalink raw reply

* Re: Unreliable USB3 with NEC uPD720200 and Delock Cardreader
From: Sarah Sharp @ 2012-11-28 22:54 UTC (permalink / raw)
  To: Bjørn Mork
  Cc: Ulrich Eckhardt, Andrew Lutomirski, linux-usb@vger.kernel.org,
	Alan Stern, Ming Lei, Rafael J. Wysocki, Bjorn Helgaas, linux-pci,
	Huang Ying
In-Reply-To: <87boekave4.fsf@nemi.mork.no>

On Mon, Nov 26, 2012 at 10:48:03PM +0100, Bjørn Mork wrote:
> Sarah Sharp <sarah.a.sharp@linux.intel.com> writes:
> 
> > It looks like both Ulrich and Andrew have the same issue.  I also have a
> > Lenovo x220, and I confirmed that when I turn on PCI runtime suspend,
> > the NEC host controller does not report port status changes when a new
> > USB device is plugged in.
> >
> > I'm running 3.6.7, and I'm pretty sure that runtime suspend worked for
> > the NEC host on some older kernel.  I don't think the NEC host went into
> > D3cold on that kernel, though.  Is there a way to disable D3cold and
> > just use D3hot instead?
> 
> Yes, you have /sys/bus/pci/devices/.../d3cold_allowed
> See Documentation/ABI/testing/sysfs-bus-pci
> 
> If this really is a problem with the D3cold support that went into 3.6
> then I guess you should include Huang Ying in the discussions as well
> (CCed).

Turning off D3 cold didn't help.  Once the PCI device is suspended,
connect events do not generate an interrupt.

I'll go see if I can figure out which kernel this worked on and bisect.

Sarah Sharp

^ permalink raw reply

* Re: [PATCH] PCI/PM: Keep runtime PM enabled for unbound PCI devices
From: Rafael J. Wysocki @ 2012-11-28 22:43 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Huang Ying, linux-kernel, linux-pci, linux-pm, Rafael J. Wysocki,
	Alan Stern
In-Reply-To: <CAErSpo6CPSMXjq1eSTX41vtwOcd5rQOdC0-9ye5NZhds9U1B8A@mail.gmail.com>

On Wednesday, November 28, 2012 03:25:59 PM Bjorn Helgaas wrote:
> On Tue, Nov 20, 2012 at 1:08 AM, Huang Ying <ying.huang@intel.com> wrote:
> > For unbound PCI devices, what we need is:
> >
> >  - Always in D0 state, because some devices does not work again after
> >    being put into D3 by the PCI bus.
> >
> >  - In SUSPENDED state if allowed, so that the parent devices can still
> >    be put into low power state.
> >
> > To satisfy these requirement, the runtime PM for the unbound PCI
> > devices are disabled and set to SUSPENDED state.  One issue of this
> > solution is that the PCI devices will be put into SUSPENDED state even
> > if the SUSPENDED state is forbidden via the sysfs interface
> > (.../power/control) of the device.  This is not an issue for most
> > devices, because most PCI devices are not used at all if unbounded.
> > But there are exceptions.  For example, unbound VGA card can be used
> > for display, but suspend its parents make it stop working.
> >
> > To fix the issue, we keep the runtime PM enabled when the PCI devices
> > are unbound.  But the runtime PM callbacks will do nothing if the PCI
> > devices are unbound.  This way, we can put the PCI devices into
> > SUSPENDED state without put the PCI devices into D3 state.
> 
> Does this fix a reported problem?  Is there a bug report URL?  What
> problem would a user notice?

There is a BZ: https://bugzilla.kernel.org/show_bug.cgi?id=48201

Unfortunately, the reporter hasn't confirmed that the bug is fixed,
although we're quite confident that it will be.

> This doesn't look critical enough to try to put in v3.7 (correct me if
> I'm wrong).  It's getting pretty late for the v3.8 merge window, but
> it looks like it qualifies as a bug fix that we would merge even after
> the merge window, so I'll put it in my -next branch headed for v3.8.
> Then it can be backported to v3.6 and v3.7.

I think that's OK.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH] PCI/PM: Keep runtime PM enabled for unbound PCI devices
From: Bjorn Helgaas @ 2012-11-28 22:25 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-kernel, linux-pci, linux-pm, Rafael J. Wysocki,
	Rafael J. Wysocki, Alan Stern
In-Reply-To: <1353398902-21253-1-git-send-email-ying.huang@intel.com>

On Tue, Nov 20, 2012 at 1:08 AM, Huang Ying <ying.huang@intel.com> wrote:
> For unbound PCI devices, what we need is:
>
>  - Always in D0 state, because some devices does not work again after
>    being put into D3 by the PCI bus.
>
>  - In SUSPENDED state if allowed, so that the parent devices can still
>    be put into low power state.
>
> To satisfy these requirement, the runtime PM for the unbound PCI
> devices are disabled and set to SUSPENDED state.  One issue of this
> solution is that the PCI devices will be put into SUSPENDED state even
> if the SUSPENDED state is forbidden via the sysfs interface
> (.../power/control) of the device.  This is not an issue for most
> devices, because most PCI devices are not used at all if unbounded.
> But there are exceptions.  For example, unbound VGA card can be used
> for display, but suspend its parents make it stop working.
>
> To fix the issue, we keep the runtime PM enabled when the PCI devices
> are unbound.  But the runtime PM callbacks will do nothing if the PCI
> devices are unbound.  This way, we can put the PCI devices into
> SUSPENDED state without put the PCI devices into D3 state.

Does this fix a reported problem?  Is there a bug report URL?  What
problem would a user notice?

This doesn't look critical enough to try to put in v3.7 (correct me if
I'm wrong).  It's getting pretty late for the v3.8 merge window, but
it looks like it qualifies as a bug fix that we would merge even after
the merge window, so I'll put it in my -next branch headed for v3.8.
Then it can be backported to v3.6 and v3.7.

> Signed-off-by: Huang Ying <ying.huang@intel.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> CC: stable@vger.kernel.org          # v3.6+
> ---
>  drivers/pci/pci-driver.c |   69 +++++++++++++++++++++++++++--------------------
>  drivers/pci/pci.c        |    2 +
>  2 files changed, 43 insertions(+), 28 deletions(-)
>
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1900,6 +1900,8 @@ void pci_pm_init(struct pci_dev *dev)
>         u16 pmc;
>
>         pm_runtime_forbid(&dev->dev);
> +       pm_runtime_set_active(&dev->dev);
> +       pm_runtime_enable(&dev->dev);
>         device_enable_async_suspend(&dev->dev);
>         dev->wakeup_prepared = false;
>
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -256,31 +256,26 @@ struct drv_dev_and_id {
>  static long local_pci_probe(void *_ddi)
>  {
>         struct drv_dev_and_id *ddi = _ddi;
> -       struct device *dev = &ddi->dev->dev;
> -       struct device *parent = dev->parent;
> +       struct pci_dev *pci_dev = ddi->dev;
> +       struct pci_driver *pci_drv = ddi->drv;
> +       struct device *dev = &pci_dev->dev;
>         int rc;
>
> -       /* The parent bridge must be in active state when probing */
> -       if (parent)
> -               pm_runtime_get_sync(parent);
> -       /* Unbound PCI devices are always set to disabled and suspended.
> -        * During probe, the device is set to enabled and active and the
> -        * usage count is incremented.  If the driver supports runtime PM,
> -        * it should call pm_runtime_put_noidle() in its probe routine and
> -        * pm_runtime_get_noresume() in its remove routine.
> -        */
> -       pm_runtime_get_noresume(dev);
> -       pm_runtime_set_active(dev);
> -       pm_runtime_enable(dev);
> -
> -       rc = ddi->drv->probe(ddi->dev, ddi->id);
> +       /*
> +        * Unbound PCI devices are always put in D0, regardless of
> +        * runtime PM status.  During probe, the device is set to
> +        * active and the usage count is incremented.  If the driver
> +        * supports runtime PM, it should call pm_runtime_put_noidle()
> +        * in its probe routine and pm_runtime_get_noresume() in its
> +        * remove routine.
> +        */
> +       pm_runtime_get_sync(dev);
> +       pci_dev->driver = pci_drv;
> +       rc = pci_drv->probe(pci_dev, ddi->id);
>         if (rc) {
> -               pm_runtime_disable(dev);
> -               pm_runtime_set_suspended(dev);
> -               pm_runtime_put_noidle(dev);
> +               pci_dev->driver = NULL;
> +               pm_runtime_put_sync(dev);
>         }
> -       if (parent)
> -               pm_runtime_put(parent);
>         return rc;
>  }
>
> @@ -330,10 +325,8 @@ __pci_device_probe(struct pci_driver *dr
>                 id = pci_match_device(drv, pci_dev);
>                 if (id)
>                         error = pci_call_probe(drv, pci_dev, id);
> -               if (error >= 0) {
> -                       pci_dev->driver = drv;
> +               if (error >= 0)
>                         error = 0;
> -               }
>         }
>         return error;
>  }
> @@ -369,9 +362,7 @@ static int pci_device_remove(struct devi
>         }
>
>         /* Undo the runtime PM settings in local_pci_probe() */
> -       pm_runtime_disable(dev);
> -       pm_runtime_set_suspended(dev);
> -       pm_runtime_put_noidle(dev);
> +       pm_runtime_put_sync(dev);
>
>         /*
>          * If the device is still on, set the power state as "unknown",
> @@ -994,6 +985,13 @@ static int pci_pm_runtime_suspend(struct
>         pci_power_t prev = pci_dev->current_state;
>         int error;
>
> +       /*
> +        * If pci_dev->driver is not set (unbound), the device should
> +        * always remain in D0 regardless of the runtime PM status
> +        */
> +       if (!pci_dev->driver)
> +               return 0;
> +
>         if (!pm || !pm->runtime_suspend)
>                 return -ENOSYS;
>
> @@ -1029,6 +1027,13 @@ static int pci_pm_runtime_resume(struct
>         struct pci_dev *pci_dev = to_pci_dev(dev);
>         const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
>
> +       /*
> +        * If pci_dev->driver is not set (unbound), the device should
> +        * always remain in D0 regardless of the runtime PM status
> +        */
> +       if (!pci_dev->driver)
> +               return 0;
> +
>         if (!pm || !pm->runtime_resume)
>                 return -ENOSYS;
>
> @@ -1046,8 +1051,16 @@ static int pci_pm_runtime_resume(struct
>
>  static int pci_pm_runtime_idle(struct device *dev)
>  {
> +       struct pci_dev *pci_dev = to_pci_dev(dev);
>         const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
>
> +       /*
> +        * If pci_dev->driver is not set (unbound), the device should
> +        * always remain in D0 regardless of the runtime PM status
> +        */
> +       if (!pci_dev->driver)
> +               goto out;
> +
>         if (!pm)
>                 return -ENOSYS;
>
> @@ -1057,8 +1070,8 @@ static int pci_pm_runtime_idle(struct de
>                         return ret;
>         }
>
> +out:
>         pm_runtime_suspend(dev);
> -
>         return 0;
>  }
>

^ permalink raw reply

* RE: [PATCH v3] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers
From: Pandarathil, Vijaymohan R @ 2012-11-28 20:30 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linasvepstas@gmail.com, Myron Stowe, Ortiz, Lance E, Huang Ying,
	Hidetoshi Seto, Patterson, Andrew D (LeftHand Networks),
	Zhang Yanmin
In-Reply-To: <CAErSpo7VvWreoTj7MUE3cMGaM1dVMfZymO=1P-m8E-joRSgE7g@mail.gmail.com>

Looks correct.

Vijay

> -----Original Message-----
> From: Bjorn Helgaas [mailto:bhelgaas@google.com]
> Sent: Wednesday, November 28, 2012 12:15 PM
> To: Pandarathil, Vijaymohan R
> Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org;
> linasvepstas@gmail.com; Myron Stowe; Ortiz, Lance E; Huang Ying; Hidetoshi
> Seto; Patterson, Andrew D (LeftHand Networks); Zhang Yanmin
> Subject: Re: [PATCH v3] PCI-AER: Do not report successful error recovery
> for devices with AER-unaware drivers
> 
> On Sat, Nov 17, 2012 at 4:47 AM, Pandarathil, Vijaymohan R
> <vijaymohan.pandarathil@hp.com> wrote:
> > When an error is detected on a PCIe device which does not have an
> > AER-aware driver, prevent AER infrastructure from reporting
> > successful error recovery.
> >
> > This is because the report_error_detected() function that gets
> > called in the first phase of recovery process allows forward
> > progress even when the driver for the device does not have AER
> > capabilities. It seems that all callbacks (in pci_error_handlers
> > structure) registered by drivers that gets called during error
> > recovery are not mandatory. So the intention of the infrastructure
> > design seems to be to allow forward progress even when a specific
> > callback has not been registered by a driver. However, if error
> > handler structure itself has not been registered, it doesn't make
> > sense to allow forward progress.
> >
> > As a result of the current design, in the case of a single device
> > having an AER-unaware driver or in the case of any function in a
> > multi-function card having an AER-unaware driver, a successful
> > recovery is reported.
> >
> > Typical scenario this happens is when a PCI device is detached
> > from a KVM host and the pci-stub driver on the host claims the
> > device. The pci-stub driver does not have error handling capabilities
> > but the AER infrastructure still reports that the device recovered
> > successfully.
> >
> > The changes proposed here leaves the device(s)in an unrecovered state
> > if the driver for the device or for any device in the subtree
> > does not have error handler structure registered. This reflects
> > the true state of the device and prevents any partial recovery (or no
> > recovery at all) reported as successful.
> >
> >
> > v3:
> >   - Fixed a checkpatch/build issue
> >
> > v2:
> >   - Made changes so that all devices in the subtree have the error
> >     state set correctly.
> >
> >
> > Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com>
> > Reviewed-by: Myron Stowe <mstowe <at> redhat.com>
> > Reviewed-by: Bjorn Helgaas <bhelgaas <at> google.com>
> > Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil@hp.com>
> 
> I added this to my -next branch, which will be merged for v3.8.
> 
> Please double-check that I resolved the merge conflict correctly.  It
> should show up here in the next few minutes:
> 
> http://git.kernel.org/?p=linux/kernel/git/helgaas/pci.git;a=shortlog;h=refs
> /heads/next
> 
> >  drivers/pci/pcie/aer/aerdrv.h      |  5 ++++-
> >  drivers/pci/pcie/aer/aerdrv_core.c | 21 ++++++++++++++++++---
> >  include/linux/pci.h                |  3 +++
> >  3 files changed, 25 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/aer/aerdrv.h
> b/drivers/pci/pcie/aer/aerdrv.h
> > index 94a7598..22f840f 100644
> > --- a/drivers/pci/pcie/aer/aerdrv.h
> > +++ b/drivers/pci/pcie/aer/aerdrv.h
> > @@ -87,6 +87,9 @@ struct aer_broadcast_data {
> >  static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
> >                 enum pci_ers_result new)
> >  {
> > +       if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
> > +               return PCI_ERS_RESULT_NO_AER_DRIVER;
> > +
> >         if (new == PCI_ERS_RESULT_NONE)
> >                 return orig;
> >
> > @@ -97,7 +100,7 @@ static inline pci_ers_result_t merge_result(enum
> pci_ers_result orig,
> >                 break;
> >         case PCI_ERS_RESULT_DISCONNECT:
> >                 if (new == PCI_ERS_RESULT_NEED_RESET)
> > -                       orig = new;
> > +                       orig = PCI_ERS_RESULT_NEED_RESET;
> >                 break;
> >         default:
> >                 break;
> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c
> b/drivers/pci/pcie/aer/aerdrv_core.c
> > index 9f80cb0..b67f91a 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> > @@ -232,11 +232,26 @@ static int report_error_detected(struct pci_dev
> *dev, void *data)
> >                                    dev->driver ?
> >                                    "no AER-aware driver" : "no driver");
> >                 }
> > -               return 0;
> > +
> > +               /*
> > +                * If there's any device in the subtree that does not
> > +                * have an error_detected callback, returning
> > +                * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
> > +                * the subsequent mmio_enabled/slot_reset/resume
> > +                * callbacks of "any" device in the subtree. All the
> > +                * devices in the subtree are left in the error state
> > +                * without recovery.
> > +                */
> > +
> > +               if (!(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE))
> > +                       vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> > +               else
> > +                       vote = PCI_ERS_RESULT_NONE;
> > +       } else {
> > +               err_handler = dev->driver->err_handler;
> > +               vote = err_handler->error_detected(dev, result_data-
> >state);
> >         }
> >
> > -       err_handler = dev->driver->err_handler;
> > -       vote = err_handler->error_detected(dev, result_data->state);
> >         result_data->result = merge_result(result_data->result, vote);
> >         return 0;
> >  }
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index ab17a08..c458602 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -540,6 +540,9 @@ enum pci_ers_result {
> >
> >         /* Device driver is fully recovered and operational */
> >         PCI_ERS_RESULT_RECOVERED = (__force pci_ers_result_t) 5,
> > +
> > +       /* No AER capabilities registered for the driver */
> > +       PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
> >  };
> >
> >  /* PCI bus error event callbacks */
> > --
> > 1.7.11.3
> >

^ permalink raw reply

* Re: [PATCH] PCI,sriov: add documentation on sysfs-based VF control
From: Bjorn Helgaas @ 2012-11-28 20:18 UTC (permalink / raw)
  To: Donald Dutile; +Cc: linux-pci
In-Reply-To: <1354073497-6702-1-git-send-email-ddutile@redhat.com>

On Tue, Nov 27, 2012 at 8:31 PM, Donald Dutile <ddutile@redhat.com> wrote:
>  Signed-off: Donald Dutile <ddutile@redhat.com>
>

Thanks, Don.  I added this to my -next branch, for merging in the v3.8
merge window.

I removed the trailing underscore from sriov_numvfs_.


>  Documentation/ABI/testing/sysfs-bus-pci | 34 +++++++++++++++++++++++
>  Documentation/PCI/pci-iov-howto.txt     | 48 ++++++++++++++++++++++++++++++---
>  2 files changed, 78 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index dff1f48..1cb389d 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -222,3 +222,37 @@ Description:
>                 satisfied too.  Reading this attribute will show the current
>                 value of d3cold_allowed bit.  Writing this attribute will set
>                 the value of d3cold_allowed bit.
> +
> +What:          /sys/bus/pci/devices/.../sriov_totalvfs
> +Date:          November 2012
> +Contact:       Donald Dutile <ddutile@redhat.com>
> +Description:
> +               This file appears when a physical PCIe device supports SR-IOV.
> +               Userspace applications can read this file to determine the
> +               maximum number of Virtual Functions (VFs) a PCIe physical
> +               function (PF) can support. Typically, this is the value reported
> +               in the PF's SR-IOV extended capability structure's TotalVFs
> +               element.  Drivers have the ability at probe time to reduce the
> +               value read from this file via the pci_sriov_set_totalvfs()
> +               function.
> +
> +What:          /sys/bus/pci/devices/.../sriov_numvfs_
> +Date:          November 2012
> +Contact:       Donald Dutile <ddutile@redhat.com>
> +Description:
> +               This file appears when a physical PCIe device supports SR-IOV.
> +               Userspace applications can read and write to this file to
> +               determine and control the enablement or disablement of Virtual
> +               Functions (VFs) on the physical function (PF). A read of this
> +               file will return the number of VFs that are enabled on this PF.
> +               A number written to this file will enable the specified
> +               number of VFs. A userspace application would typically read the
> +               file and check that the value is zero, and then write the number
> +               of VFs that should be enabled on the PF; the value written
> +               should be less than or equal to the value in the sriov_totalvfs
> +               file. A userspace application wanting to disable the VFs would
> +               write a zero to this file. The core ensures that valid values
> +               are written to this file, and returns errors when values are not
> +               valid.  For example, writing a 2 to this file when sriov_numvfs
> +               is not 0 and not 2 already will return an error. Writing a 10
> +               when the value of sriov_totalvfs is 8 will return an error.
> diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
> index fc73ef5..c41cf95 100644
> --- a/Documentation/PCI/pci-iov-howto.txt
> +++ b/Documentation/PCI/pci-iov-howto.txt
> @@ -2,6 +2,9 @@
>                 Copyright (C) 2009 Intel Corporation
>                     Yu Zhao <yu.zhao@intel.com>
>
> +               Update: November 2012
> +                       -- sysfs-based SRIOV enable-/disable-ment
> +               Donald Dutile <ddutile@redhat.com>
>
>  1. Overview
>
> @@ -24,10 +27,21 @@ real existing PCI device.
>
>  2.1 How can I enable SR-IOV capability
>
> -The device driver (PF driver) will control the enabling and disabling
> -of the capability via API provided by SR-IOV core. If the hardware
> -has SR-IOV capability, loading its PF driver would enable it and all
> -VFs associated with the PF.
> +Multiple methods are available for SR-IOV enablement.
> +In the first method, the device driver (PF driver) will control the
> +enabling and disabling of the capability via API provided by SR-IOV core.
> +If the hardware has SR-IOV capability, loading its PF driver would
> +enable it and all VFs associated with the PF.  Some PF drivers require
> +a module parameter to be set to determine the number of VFs to enable.
> +In the second method, a write to the sysfs file sriov_numvfs will
> +enable and disable the VFs associated with a PCIe PF.  This method
> +enables per-PF, VF enable/disable values versus the first method,
> +which applies to all PFs of the same device.  Additionally, the
> +PCI SRIOV core support ensures that enable/disable operations are
> +valid to reduce duplication in multiple drivers for the same
> +checks, e.g., check numvfs == 0 if enabling VFs, ensure
> +numvfs <= totalvfs.
> +The second method is the recommended method for new/future VF devices.
>
>  2.2 How can I use the Virtual Functions
>
> @@ -40,13 +54,22 @@ requires device driver that is same as a normal PCI device's.
>  3.1 SR-IOV API
>
>  To enable SR-IOV capability:
> +(a) For the first method, in the driver:
>         int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>         'nr_virtfn' is number of VFs to be enabled.
> +(b) For the second method, from sysfs:
> +       echo 'nr_virtfn' > \
> +        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
>
>  To disable SR-IOV capability:
> +(a) For the first method, in the driver:
>         void pci_disable_sriov(struct pci_dev *dev);
> +(b) For the second method, from sysfs:
> +       echo  0 > \
> +        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
>
>  To notify SR-IOV core of Virtual Function Migration:
> +(a) In the driver:
>         irqreturn_t pci_sriov_migration(struct pci_dev *dev);
>
>  3.2 Usage example
> @@ -88,6 +111,22 @@ static void dev_shutdown(struct pci_dev *dev)
>         ...
>  }
>
> +static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
> +{
> +       if (numvfs > 0) {
> +               ...
> +               pci_enable_sriov(dev, numvfs);
> +               ...
> +               return numvfs;
> +       }
> +       if (numvfs == 0) {
> +               ....
> +               pci_disable_sriov(dev);
> +               ...
> +               return 0;
> +       }
> +}
> +
>  static struct pci_driver dev_driver = {
>         .name =         "SR-IOV Physical Function driver",
>         .id_table =     dev_id_table,
> @@ -96,4 +135,5 @@ static struct pci_driver dev_driver = {
>         .suspend =      dev_suspend,
>         .resume =       dev_resume,
>         .shutdown =     dev_shutdown,
> +       .sriov_configure = dev_sriov_configure,
>  };
> --
> 1.7.10.2.552.gaa3bb87
>

^ permalink raw reply

* Re: [PATCH v3] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers
From: Bjorn Helgaas @ 2012-11-28 20:15 UTC (permalink / raw)
  To: Pandarathil, Vijaymohan R
  Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linasvepstas@gmail.com, Myron Stowe, Ortiz, Lance E, Huang Ying,
	Hidetoshi Seto, Patterson, Andrew D (LeftHand Networks),
	Zhang Yanmin
In-Reply-To: <F9E001219150CB45BEDC82A650F360C90146A2F8@G9W0717.americas.hpqcorp.net>

On Sat, Nov 17, 2012 at 4:47 AM, Pandarathil, Vijaymohan R
<vijaymohan.pandarathil@hp.com> wrote:
> When an error is detected on a PCIe device which does not have an
> AER-aware driver, prevent AER infrastructure from reporting
> successful error recovery.
>
> This is because the report_error_detected() function that gets
> called in the first phase of recovery process allows forward
> progress even when the driver for the device does not have AER
> capabilities. It seems that all callbacks (in pci_error_handlers
> structure) registered by drivers that gets called during error
> recovery are not mandatory. So the intention of the infrastructure
> design seems to be to allow forward progress even when a specific
> callback has not been registered by a driver. However, if error
> handler structure itself has not been registered, it doesn't make
> sense to allow forward progress.
>
> As a result of the current design, in the case of a single device
> having an AER-unaware driver or in the case of any function in a
> multi-function card having an AER-unaware driver, a successful
> recovery is reported.
>
> Typical scenario this happens is when a PCI device is detached
> from a KVM host and the pci-stub driver on the host claims the
> device. The pci-stub driver does not have error handling capabilities
> but the AER infrastructure still reports that the device recovered
> successfully.
>
> The changes proposed here leaves the device(s)in an unrecovered state
> if the driver for the device or for any device in the subtree
> does not have error handler structure registered. This reflects
> the true state of the device and prevents any partial recovery (or no
> recovery at all) reported as successful.
>
>
> v3:
>   - Fixed a checkpatch/build issue
>
> v2:
>   - Made changes so that all devices in the subtree have the error
>     state set correctly.
>
>
> Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com>
> Reviewed-by: Myron Stowe <mstowe <at> redhat.com>
> Reviewed-by: Bjorn Helgaas <bhelgaas <at> google.com>
> Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil@hp.com>

I added this to my -next branch, which will be merged for v3.8.

Please double-check that I resolved the merge conflict correctly.  It
should show up here in the next few minutes:

http://git.kernel.org/?p=linux/kernel/git/helgaas/pci.git;a=shortlog;h=refs/heads/next

>  drivers/pci/pcie/aer/aerdrv.h      |  5 ++++-
>  drivers/pci/pcie/aer/aerdrv_core.c | 21 ++++++++++++++++++---
>  include/linux/pci.h                |  3 +++
>  3 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
> index 94a7598..22f840f 100644
> --- a/drivers/pci/pcie/aer/aerdrv.h
> +++ b/drivers/pci/pcie/aer/aerdrv.h
> @@ -87,6 +87,9 @@ struct aer_broadcast_data {
>  static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
>                 enum pci_ers_result new)
>  {
> +       if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
> +               return PCI_ERS_RESULT_NO_AER_DRIVER;
> +
>         if (new == PCI_ERS_RESULT_NONE)
>                 return orig;
>
> @@ -97,7 +100,7 @@ static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
>                 break;
>         case PCI_ERS_RESULT_DISCONNECT:
>                 if (new == PCI_ERS_RESULT_NEED_RESET)
> -                       orig = new;
> +                       orig = PCI_ERS_RESULT_NEED_RESET;
>                 break;
>         default:
>                 break;
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 9f80cb0..b67f91a 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -232,11 +232,26 @@ static int report_error_detected(struct pci_dev *dev, void *data)
>                                    dev->driver ?
>                                    "no AER-aware driver" : "no driver");
>                 }
> -               return 0;
> +
> +               /*
> +                * If there's any device in the subtree that does not
> +                * have an error_detected callback, returning
> +                * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
> +                * the subsequent mmio_enabled/slot_reset/resume
> +                * callbacks of "any" device in the subtree. All the
> +                * devices in the subtree are left in the error state
> +                * without recovery.
> +                */
> +
> +               if (!(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE))
> +                       vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> +               else
> +                       vote = PCI_ERS_RESULT_NONE;
> +       } else {
> +               err_handler = dev->driver->err_handler;
> +               vote = err_handler->error_detected(dev, result_data->state);
>         }
>
> -       err_handler = dev->driver->err_handler;
> -       vote = err_handler->error_detected(dev, result_data->state);
>         result_data->result = merge_result(result_data->result, vote);
>         return 0;
>  }
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index ab17a08..c458602 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -540,6 +540,9 @@ enum pci_ers_result {
>
>         /* Device driver is fully recovered and operational */
>         PCI_ERS_RESULT_RECOVERED = (__force pci_ers_result_t) 5,
> +
> +       /* No AER capabilities registered for the driver */
> +       PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
>  };
>
>  /* PCI bus error event callbacks */
> --
> 1.7.11.3
>

^ permalink raw reply

* Re: [PATCH 0/9] PCI: Remove use of CONFIG_HOTPLUG
From: Bjorn Helgaas @ 2012-11-28 20:04 UTC (permalink / raw)
  To: Greg KH; +Cc: Bill Pemberton, linux-pci
In-Reply-To: <20121128190106.GA30376@kroah.com>

On Wed, Nov 28, 2012 at 12:01 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Wed, Nov 21, 2012 at 03:34:51PM -0500, Bill Pemberton wrote:
>> This is a respin of the patches to remove CONFIG_HOTPLUG for the PCI
>> subsystem.  The end result is the same as the patches in the big
>> patchset I sent out earlier this week but the changelog entries have
>> been changed to be consistent with the capitalization of PCI.  The
>> removal of the __dev* entries has also been squashed into a single
>> patch.
>
> Bjorn, any objection for me taking these patches through my driver-core
> tree?  If not, can I get your ack?

No objection.

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox