Linux PCI subsystem development
 help / color / mirror / Atom feed
* Re: [PATCH v2 RESEND] Add NumaChip remote PCI support
From: Steffen Persvold @ 2012-11-30 17:41 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Daniel J Blueman, linux-pci, linux-kernel
In-Reply-To: <CAErSpo5veyd4NhayHkoB1bLJhU7pkHBhTDNyk-1vWupEcv6K3A@mail.gmail.com>

Hi Bjorn,

On 11/30/2012 17:45, Bjorn Helgaas wrote:
> On Thu, Nov 29, 2012 at 10:28 PM, Daniel J Blueman
[]
>> We could expose pci_dev_base via struct x86_init_pci; the extra complexity
>> and performance tradeoff may not be worth it for a single case perhaps?
>
> Oh, right, I forgot that you can't decide this at build-time.  This is
> PCI config access, which is not a performance path, so I'm not really
> concerned about it from that angle, but you make a good point about
> the complexity.
>
> The reason I'm interested in this is because MMCONFIG is a generic
> PCIe feature but is currently done via several arch-specific
> implementations, so I'm starting to think about how we can make parts
> of it more generic.  From that perspective, it's nicer to parameterize
> an existing implementation than to clone it because it makes
> refactoring opportunities more obvious.
>
> Backing up a bit, I'm curious about exactly why you need to check for
> the limit to begin with.  The comment says "Ensure AMD Northbridges
> don't decode reads to other devices," but that doesn't seem strictly
> accurate.  You're not changing anything in the hardware to prevent it
> from *decoding* a read, so it seems like you're actually just
> preventing the read in the first place.
>
> What happens without the limit check?  Do you get a response timeout
> and a machine check?  Read from the wrong device?'

The latter. I'm not sure how familiar you are with how pci config reads 
are decoded and handled on coherent hypertransport fabrics; The way it 
works *within* one coherent HT fabric is that the CPU will redirect all 
config space access above a configured max HT node (a setting in the AMD 
northbridge) to a specific I/O link (non-coherent link) which usually 
links up with a "southbridge" device that responds with a target abort 
(non-existing device).

However, this only works when a CPU core is accessing local HT devices. 
In our architecture, we "glue" together multiple HT fabrics and when a 
CPU core sends a pci config space request (mmconfig) to a remote machine 
(via our hardware) this re-direction is not applied anymore. The result 
is that when a mmconfig read comes in to a coherent HT device on bus00 
which is non-existent, one of the other HT nodes on that remote node 
will respond to the read, leading to "phantom" devices (i.e lspci will 
show more HT northbridges than what's really physically present) *or* 
worst case scenario will be that the transaction hangs (alternatively 
times out, leading to MCE and other bad things).

This is why we're checking accesses to bus0, device24-31 and returning a 
"fake" target abort scenario if the access was to a non-existing HT 
device. In other words, we're doing in software what a "normal" HT based 
platform would do in hardware.

>
> As far as I can tell, you still describe your MMCONFIG area with an
> MCFG table (since you use pci_mmconfig_lookup() to find the region).
> That table only includes the starting and ending bus numbers, so the
> assumption is that the MMCONFIG space is valid for every possible
> device on those buses.  So it seems like your system is not really
> compatible with the spec here.
>
> Because the MCFG table can't describe finer granularity than start/end
> bus numbers, we manage MMCONFIG regions as (segment, start_bus,
> end_bus, address) tuples.  Maybe if we tracked it with slightly finer
> granularity, e.g., (segment, start_bus, end_bus, end_bus_device,
> address), you could have some sort of MCFG-parsing quirk that reduces
> the size of the MMCONFIG region you register for bus 0.
>
> Just brainstorming here; it's not obvious to me yet what the best solution is.
>
> Bjorn
>

Kind regards,
-- 
Steffen Persvold, Chief Architect NumaChip
Numascale AS - www.numascale.com
Tel: +47 92 49 25 54 Skype: spersvold

^ permalink raw reply

* Re: [PATCH 1/3] aerdrv: Trace Event for AER
From: Borislav Petkov @ 2012-11-30 17:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, tony.luck, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <1354283631.6276.143.camel@gandalf.local.home>

On Fri, Nov 30, 2012 at 08:53:51AM -0500, Steven Rostedt wrote:
> On Fri, 2012-11-30 at 14:42 +0100, Borislav Petkov wrote:
> 
> > So, it sounds to me like we should we move all RAS-specific tracepoints
> > to <trace/events/ras.h> and then in each usage site do:
> 
> Note, the CREATE_TRACE_POINTS must only be done in one location. Not
> every place. It creates the code that does the work to make the
> tracepoints show up in /debug/tracing/events/* as well as the callback
> code and other such things. If you define it in more than one .c file,
> then you will have linker issues due to the functions being created more
> than once.
> 
> > 
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/ras.h>
> > 
> > Correct?
> 
> That's the default way to do things.

Ok, cool.

Lance, care to move the TP to a new header called
include/trace/events/ras.h in your next iteration of the patches?

I'll later move the mc_event TP there too and drop the RAS-specific TP
header.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply

* Re: [PATCH v2 RESEND] Add NumaChip remote PCI support
From: Bjorn Helgaas @ 2012-11-30 16:45 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: linux-pci, linux-kernel, Steffen Persvold
In-Reply-To: <50B84418.5070804@numascale-asia.com>

On Thu, Nov 29, 2012 at 10:28 PM, Daniel J Blueman
<daniel@numascale-asia.com> wrote:
> Hi Bjorn,
>
>
> On 29/11/2012 07:08, Bjorn Helgaas wrote:
>>
>> On Wed, Nov 21, 2012 at 1:39 AM, Daniel J Blueman
>> <daniel@numascale-asia.com> wrote:
>>>
>>> Add NumaChip-specific PCI access mechanism via MMCONFIG cycles, but
>>> preventing access to AMD Northbridges which shouldn't respond.
>>>
>>> v2: Use PCI_DEVFN in precomputed constant limit; drop unneeded includes
>>>
>>> Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
>>> ---
>>>   arch/x86/include/asm/numachip/numachip.h |   20 +++++
>>>   arch/x86/kernel/apic/apic_numachip.c     |    2 +
>>>   arch/x86/pci/Makefile                    |    1 +
>>>   arch/x86/pci/numachip.c                  |  134
>>> ++++++++++++++++++++++++++++++
>>>   4 files changed, 157 insertions(+)
>>>   create mode 100644 arch/x86/include/asm/numachip/numachip.h
>>>   create mode 100644 arch/x86/pci/numachip.c
>>>
>>> diff --git a/arch/x86/include/asm/numachip/numachip.h
>>> b/arch/x86/include/asm/numachip/numachip.h
>>> new file mode 100644
>>> index 0000000..d35e71a
>>> --- /dev/null
>>> +++ b/arch/x86/include/asm/numachip/numachip.h
>>> @@ -0,0 +1,20 @@
>>> +/*
>>> + * This file is subject to the terms and conditions of the GNU General
>>> Public
>>> + * License.  See the file "COPYING" in the main directory of this
>>> archive
>>> + * for more details.
>>> + *
>>> + * Numascale NumaConnect-specific header file
>>> + *
>>> + * Copyright (C) 2012 Numascale AS. All rights reserved.
>>> + *
>>> + * Send feedback to <support@numascale.com>
>>> + *
>>> + */
>>> +
>>> +#ifndef _ASM_X86_NUMACHIP_NUMACHIP_H
>>> +#define _ASM_X86_NUMACHIP_NUMACHIP_H
>>> +
>>> +extern int __init pci_numachip_init(void);
>>> +
>>> +#endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */
>>> +
>>> diff --git a/arch/x86/kernel/apic/apic_numachip.c
>>> b/arch/x86/kernel/apic/apic_numachip.c
>>> index a65829a..9c2aa89 100644
>>> --- a/arch/x86/kernel/apic/apic_numachip.c
>>> +++ b/arch/x86/kernel/apic/apic_numachip.c
>>> @@ -22,6 +22,7 @@
>>>   #include <linux/hardirq.h>
>>>   #include <linux/delay.h>
>>>
>>> +#include <asm/numachip/numachip.h>
>>>   #include <asm/numachip/numachip_csr.h>
>>>   #include <asm/smp.h>
>>>   #include <asm/apic.h>
>>> @@ -179,6 +180,7 @@ static int __init numachip_system_init(void)
>>>                  return 0;
>>>
>>>          x86_cpuinit.fixup_cpu_id = fixup_cpu_id;
>>> +       x86_init.pci.arch_init = pci_numachip_init;
>>>
>>>          map_csrs();
>>>
>>> diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
>>> index 3af5a1e..ee0af58 100644
>>> --- a/arch/x86/pci/Makefile
>>> +++ b/arch/x86/pci/Makefile
>>> @@ -16,6 +16,7 @@ obj-$(CONFIG_STA2X11)           += sta2x11-fixup.o
>>>   obj-$(CONFIG_X86_VISWS)                += visws.o
>>>
>>>   obj-$(CONFIG_X86_NUMAQ)                += numaq_32.o
>>> +obj-$(CONFIG_X86_NUMACHIP)     += numachip.o
>>
>>
>> It looks like this depends on CONFIG_PCI_MMCONFIG for
>> pci_mmconfig_lookup().  Are there config constraints that force
>> CONFIG_PCI_MMCONFIG=y when CONFIG_X86_NUMACHIP=y?
>
>
> I'll revise the patch with this constraint after we work out the best
> approach for below.
>
>
>>>   obj-$(CONFIG_X86_INTEL_MID)    += mrst.o
>>>
>>> diff --git a/arch/x86/pci/numachip.c b/arch/x86/pci/numachip.c
>>> new file mode 100644
>>> index 0000000..3773e05
>>> --- /dev/null
>>> +++ b/arch/x86/pci/numachip.c
>>> @@ -0,0 +1,129 @@
>>> +/*
>>> + * This file is subject to the terms and conditions of the GNU General
>>> Public
>>> + * License.  See the file "COPYING" in the main directory of this
>>> archive
>>> + * for more details.
>>> + *
>>> + * Numascale NumaConnect-specific PCI code
>>> + *
>>> + * Copyright (C) 2012 Numascale AS. All rights reserved.
>>> + *
>>> + * Send feedback to <support@numascale.com>
>>> + *
>>> + * PCI accessor functions derived from mmconfig_64.c
>>> + *
>>> + */
>>> +
>>> +#include <linux/pci.h>
>>> +#include <asm/pci_x86.h>
>>> +
>>> +static u8 limit __read_mostly;
>>> +
>>> +static inline char __iomem *pci_dev_base(unsigned int seg, unsigned int
>>> bus, unsigned int devfn)
>>> +{
>>> +       struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);
>>> +
>>> +       if (cfg && cfg->virt)
>>> +               return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn <<
>>> 12));
>>> +       return NULL;
>>> +}
>>
>>
>> Most of this file is copied directly from mmconfig_64.c (as you
>> mentioned above).  I wonder if we could avoid the code duplication by
>> making the pci_dev_base() implementation in mmconfig_64.c a weak
>> definition.  Then you could just supply a non-weak pci_dev_base() here
>> that would override that default version.  Your version would look
>> something like:
>>
>>    char __iomem *pci_dev_base(unsigned int seg, unsigned int bus,
>> unsigned int devfn)
>>    {
>>        struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);
>>
>>        if (cfg && cfg->virt && devfn < limit)
>>            return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn << 12));
>>        return NULL;
>>    }
>>
>> That would be different from what you have in this patch because reads
>> & writes to devices above "limit" would return -EINVAL rather than 0
>> as you do here.  Would that be a problem?
>
>
> That would work nicely (pointer lookup and inlining etc aside) if there was
> the runtime ability to override pci_dev_base only if the NumaChip signature
> was detected.
>
> We could expose pci_dev_base via struct x86_init_pci; the extra complexity
> and performance tradeoff may not be worth it for a single case perhaps?

Oh, right, I forgot that you can't decide this at build-time.  This is
PCI config access, which is not a performance path, so I'm not really
concerned about it from that angle, but you make a good point about
the complexity.

The reason I'm interested in this is because MMCONFIG is a generic
PCIe feature but is currently done via several arch-specific
implementations, so I'm starting to think about how we can make parts
of it more generic.  From that perspective, it's nicer to parameterize
an existing implementation than to clone it because it makes
refactoring opportunities more obvious.

Backing up a bit, I'm curious about exactly why you need to check for
the limit to begin with.  The comment says "Ensure AMD Northbridges
don't decode reads to other devices," but that doesn't seem strictly
accurate.  You're not changing anything in the hardware to prevent it
from *decoding* a read, so it seems like you're actually just
preventing the read in the first place.

What happens without the limit check?  Do you get a response timeout
and a machine check?  Read from the wrong device?

As far as I can tell, you still describe your MMCONFIG area with an
MCFG table (since you use pci_mmconfig_lookup() to find the region).
That table only includes the starting and ending bus numbers, so the
assumption is that the MMCONFIG space is valid for every possible
device on those buses.  So it seems like your system is not really
compatible with the spec here.

Because the MCFG table can't describe finer granularity than start/end
bus numbers, we manage MMCONFIG regions as (segment, start_bus,
end_bus, address) tuples.  Maybe if we tracked it with slightly finer
granularity, e.g., (segment, start_bus, end_bus, end_bus_device,
address), you could have some sort of MCFG-parsing quirk that reduces
the size of the MMCONFIG region you register for bus 0.

Just brainstorming here; it's not obvious to me yet what the best solution is.

Bjorn

^ permalink raw reply

* Re: [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump with iommu
From: MUNEDA Takahiro @ 2012-11-30 15:49 UTC (permalink / raw)
  To: Takao Indoh
  Cc: linux-pci, x86, linux-kernel, andi, tokunaga.keiich, kexec, hbabu,
	mingo, ddutile, vgoyal, ishii.hironobu, hpa, bhelgaas, tglx,
	yinghai, khalid, muneda.takahiro
In-Reply-To: <20121127004144.3604.61708.sendpatchset@tindoh.g01.fujitsu.local>

On Tue, 27 Nov 2012 09:42:20 +0900 (JST),
Takao Indoh <indou.takao@jp.fujitsu.com> wrote:

> These patches reset PCIe devices at boot time to address DMA problem on
> kdump with iommu. When "reset_devices" is specified, a hot reset is
> triggered on each PCIe root port and downstream port to reset its
> downstream endpoint.
> 
> Background:
> A kdump problem about DMA has been discussed for a long time. That is,
> when a kernel is switched to the kdump kernel, DMA derived from first
> kernel affects second kernel. Especially this problem surfaces when
> iommu is used for PCI passthrough on KVM guest. In the case of the
> machine I use, when intel_iommu=on is specified, DMAR error is detected
> in kdump kernel and PCI SERR is also detected. Finally kdump fails
> because some devices does not work correctly.
> 
> The root cause is that ongoing DMA from first kernel causes DMAR fault
> because page table of DMAR is initialized while kdump kernel is booting
> up. Therefore to solve this problem DMA needs to be stopped before DMAR
> is initialized at kdump kernel boot time. By these patches, PCIe devices
> are reset by hot reset and its DMA is stopped when reset_devices is
> specified. One problem of this solution is that the monitor blacks out
> when VGA controller is reset. So this patch does not reset the port
> whose child endpoint is VGA device.
> 
> What I tried:
> - Clearing bus master bit and INTx disable bit at boot time
>      This did not solve this problem. I still got DMAR error on devices.
> - Resetting devices in fixup_final(v1 patch)
>      DMAR error disappeared, but sometimes PCI SERR was detected. This
>      is well explained here.
>      https://lkml.org/lkml/2012/9/9/245
>      This PCI SERR seems to be related to interrupt remapping.
> - Clearing bus master in setup_arch() and resetting devices in
>    fixup_final
>      Neither DMAR error nor PCI SERR occurred. But on certain machine
>      kdump kernel hung up when resetting devices. It seems to be a
>      problem specific to the platform.
> - Resetting devices in setup_arch() (v2 and later patch)
>      This solution solves all problems I found so far.

Thank you for updating a patchset.
I have a server which raises PCI Error while system is rebooting when
I set intel_iommu=on.  With v7 on top of 3.7-rc7, I don't see any PCI
Errors or other hardware related errors. So,

Tested-by: MUNEDA Takahiro <muneda.takahiro@jp.fujitsu.com>

Thanks,
Takahiro

> 
> Changelog:
> v7:
> Update Yinghai's dummy-pci patch with macros in linux/pci.h, and fix
> some bugs
> 
> v6:
> Rewrite using Yinghai's dummy-pci patch
> https://lkml.org/lkml/2012/11/13/118
> 
> v5:
> Do bus reset after all devices are scanned and its config registers are
> saved. This fixes a bug that config register is accessed without delay
> after reset.
> https://lkml.org/lkml/2012/10/17/47
> 
> v4:
> Reduce waiting time after resetting devices. A previous patch does reset
> like this:
>    for (each device) {
>      save config registers
>      reset
>      wait for 500 ms
>      restore config registers
>    }
> 
> If there are N devices to be reset, it takes N*500 ms. On the other
> hand, the v4 patch does:
>    for (each device) {
>      save config registers
>      reset
>    }
>    wait 500 ms
>    for (each device) {
>      restore config registers
>    }
> Though it needs more memory space to save config registers, the waiting
> time is always 500ms.
> https://lkml.org/lkml/2012/10/15/49
> 
> v3:
> Move alloc_bootmem and free_bootmem to early_reset_pcie_devices so that
> they are called only once.
> https://lkml.org/lkml/2012/10/10/57
> 
> v2:
> Reset devices in setup_arch() because reset need to be done before
> interrupt remapping is initialized.
> https://lkml.org/lkml/2012/10/2/54
> 
> v1:
> Add fixup_final quirk to reset PCIe devices
> https://lkml.org/lkml/2012/8/3/160
> 
> Takao Indoh (5):
>    x86, pci: add dummy pci device for early stage
>    PCI: Define the maximum number of PCI function
>    Make reset_devices available at early stage
>    x86, pci: Reset PCIe devices at boot time
>    x86, pci: Enable PCI INTx when MSI is disabled
> 
>   arch/x86/include/asm/pci-direct.h |    3 +
>   arch/x86/kernel/setup.c           |    3 +
>   arch/x86/pci/common.c             |    4 +-
>   arch/x86/pci/early.c              |  315 +++++++++++++++++++++++++++++++++++++
>   include/linux/pci.h               |    2 +
>   init/main.c                       |    4 +-
>   6 files changed, 328 insertions(+), 3 deletions(-)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH 1/3] aerdrv: Trace Event for AER
From: Steven Rostedt @ 2012-11-30 13:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, tony.luck, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121130134232.GD6869@liondog.tnic>

On Fri, 2012-11-30 at 14:42 +0100, Borislav Petkov wrote:

> So, it sounds to me like we should we move all RAS-specific tracepoints
> to <trace/events/ras.h> and then in each usage site do:

Note, the CREATE_TRACE_POINTS must only be done in one location. Not
every place. It creates the code that does the work to make the
tracepoints show up in /debug/tracing/events/* as well as the callback
code and other such things. If you define it in more than one .c file,
then you will have linker issues due to the functions being created more
than once.

> 
> #define CREATE_TRACE_POINTS
> #include <trace/events/ras.h>
> 
> Correct?

That's the default way to do things.

> 
> FWIW, it looks neat and clean to me that way.

Yep, that's why it's default ;-)

-- Steve



^ permalink raw reply

* Re: [PATCH 1/3] aerdrv: Trace Event for AER
From: Borislav Petkov @ 2012-11-30 13:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, tony.luck, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <1354275551.6276.140.camel@gandalf.local.home>

On Fri, Nov 30, 2012 at 06:39:11AM -0500, Steven Rostedt wrote:
> On Fri, 2012-11-30 at 11:53 +0100, Borislav Petkov wrote:
> > On Thu, Nov 29, 2012 at 08:51:19PM -0500, Steven Rostedt wrote:
> > > >  include/ras/aer_event.h |   77 +++++++++++++++++++++++++++++++++++++++++++++++
> > > 
> > > Is there a reason this header is here? Egad, I never noticed the
> > > ras_event.h that is there. This include/ras directory was created for
> > > the sole purpose of trace events! This is not the way to do this.
> > 
> > Well, the idea for the ras event was to be able to use it in multiple
> > places. It is currently used only by EDAC but it could be that memory
> > errors could be reported by other agents which would reuse that TP.
> > 
> 
> If it's generic, then place it into the include/trace/events directory,
> the you don't need to have the TRACE_INCLUDE_PATH as that is the default
> path the macros will use.

Hmm, so I'm thinking maybe we should add a ras.h header there which
contains all RAS TPs.

> > > Please look at the sample in samples/trace_events/
> > > 
> > > The proper way is to keep the header by the driver. Then you can simply
> > > include the header with "aer_event.h".
> > > 
> > > But to have the macro magic work, you need to modify the Makefile to
> > > have something like:
> > > 
> > > CFLAGS_aerdrv_errprint.o = -I$(src)
> > 
> > So I'm guessing that every .c file including the TP should also -I
> > include the TP definition header wherever it is. Is that agreeable?
> 
> You only need the -I$(src) for the .c file that uses the
> CREATE_TRACE_POINTS macro, as the trace point macro magic headers
> require it to find the TP header.

This is done like this from EDAC's single usage site:

#define CREATE_TRACE_POINTS
#define TRACE_INCLUDE_PATH ../../include/ras
#include <ras/ras_event.h>

This is in <drivers/edac/edac_mc.c> and it doesn't to "-I$(src)" in the
edac Makefile.

> Other files just need "foo.h", or <trace/events/foo.h> if it's in the
> generic location.

So, it sounds to me like we should we move all RAS-specific tracepoints
to <trace/events/ras.h> and then in each usage site do:

#define CREATE_TRACE_POINTS
#include <trace/events/ras.h>

Correct?

FWIW, it looks neat and clean to me that way.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply

* Re: [PATCH 1/3] aerdrv: Trace Event for AER
From: Steven Rostedt @ 2012-11-30 11:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, tony.luck, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121130105311.GA6869@liondog.tnic>

On Fri, 2012-11-30 at 11:53 +0100, Borislav Petkov wrote:
> On Thu, Nov 29, 2012 at 08:51:19PM -0500, Steven Rostedt wrote:
> > >  include/ras/aer_event.h |   77 +++++++++++++++++++++++++++++++++++++++++++++++
> > 
> > Is there a reason this header is here? Egad, I never noticed the
> > ras_event.h that is there. This include/ras directory was created for
> > the sole purpose of trace events! This is not the way to do this.
> 
> Well, the idea for the ras event was to be able to use it in multiple
> places. It is currently used only by EDAC but it could be that memory
> errors could be reported by other agents which would reuse that TP.
> 

If it's generic, then place it into the include/trace/events directory,
the you don't need to have the TRACE_INCLUDE_PATH as that is the default
path the macros will use.

> > Please look at the sample in samples/trace_events/
> > 
> > The proper way is to keep the header by the driver. Then you can simply
> > include the header with "aer_event.h".
> > 
> > But to have the macro magic work, you need to modify the Makefile to
> > have something like:
> > 
> > CFLAGS_aerdrv_errprint.o = -I$(src)
> 
> So I'm guessing that every .c file including the TP should also -I
> include the TP definition header wherever it is. Is that agreeable?

You only need the -I$(src) for the .c file that uses the
CREATE_TRACE_POINTS macro, as the trace point macro magic headers
require it to find the TP header. Other files just need "foo.h", or
<trace/events/foo.h> if it's in the generic location.

-- Steve



^ permalink raw reply

* Re: [PATCH 1/3] aerdrv: Trace Event for AER
From: Borislav Petkov @ 2012-11-30 10:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, tony.luck, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <1354240279.6276.131.camel@gandalf.local.home>

On Thu, Nov 29, 2012 at 08:51:19PM -0500, Steven Rostedt wrote:
> >  include/ras/aer_event.h |   77 +++++++++++++++++++++++++++++++++++++++++++++++
> 
> Is there a reason this header is here? Egad, I never noticed the
> ras_event.h that is there. This include/ras directory was created for
> the sole purpose of trace events! This is not the way to do this.

Well, the idea for the ras event was to be able to use it in multiple
places. It is currently used only by EDAC but it could be that memory
errors could be reported by other agents which would reuse that TP.

> Please look at the sample in samples/trace_events/
> 
> The proper way is to keep the header by the driver. Then you can simply
> include the header with "aer_event.h".
> 
> But to have the macro magic work, you need to modify the Makefile to
> have something like:
> 
> CFLAGS_aerdrv_errprint.o = -I$(src)

So I'm guessing that every .c file including the TP should also -I
include the TP definition header wherever it is. Is that agreeable?

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply

* Re: [PATCH v2 RESEND] Add NumaChip remote PCI support
From: Daniel J Blueman @ 2012-11-30  5:28 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, linux-kernel, Steffen Persvold
In-Reply-To: <CAErSpo67Brp8H=Jw77NzA4xL52aRNi+YDw+gM++3y8XqfTWyzA@mail.gmail.com>

Hi Bjorn,

On 29/11/2012 07:08, Bjorn Helgaas wrote:
> On Wed, Nov 21, 2012 at 1:39 AM, Daniel J Blueman
> <daniel@numascale-asia.com> wrote:
>> Add NumaChip-specific PCI access mechanism via MMCONFIG cycles, but
>> preventing access to AMD Northbridges which shouldn't respond.
>>
>> v2: Use PCI_DEVFN in precomputed constant limit; drop unneeded includes
>>
>> Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
>> ---
>>   arch/x86/include/asm/numachip/numachip.h |   20 +++++
>>   arch/x86/kernel/apic/apic_numachip.c     |    2 +
>>   arch/x86/pci/Makefile                    |    1 +
>>   arch/x86/pci/numachip.c                  |  134 ++++++++++++++++++++++++++++++
>>   4 files changed, 157 insertions(+)
>>   create mode 100644 arch/x86/include/asm/numachip/numachip.h
>>   create mode 100644 arch/x86/pci/numachip.c
>>
>> diff --git a/arch/x86/include/asm/numachip/numachip.h b/arch/x86/include/asm/numachip/numachip.h
>> new file mode 100644
>> index 0000000..d35e71a
>> --- /dev/null
>> +++ b/arch/x86/include/asm/numachip/numachip.h
>> @@ -0,0 +1,20 @@
>> +/*
>> + * This file is subject to the terms and conditions of the GNU General Public
>> + * License.  See the file "COPYING" in the main directory of this archive
>> + * for more details.
>> + *
>> + * Numascale NumaConnect-specific header file
>> + *
>> + * Copyright (C) 2012 Numascale AS. All rights reserved.
>> + *
>> + * Send feedback to <support@numascale.com>
>> + *
>> + */
>> +
>> +#ifndef _ASM_X86_NUMACHIP_NUMACHIP_H
>> +#define _ASM_X86_NUMACHIP_NUMACHIP_H
>> +
>> +extern int __init pci_numachip_init(void);
>> +
>> +#endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */
>> +
>> diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c
>> index a65829a..9c2aa89 100644
>> --- a/arch/x86/kernel/apic/apic_numachip.c
>> +++ b/arch/x86/kernel/apic/apic_numachip.c
>> @@ -22,6 +22,7 @@
>>   #include <linux/hardirq.h>
>>   #include <linux/delay.h>
>>
>> +#include <asm/numachip/numachip.h>
>>   #include <asm/numachip/numachip_csr.h>
>>   #include <asm/smp.h>
>>   #include <asm/apic.h>
>> @@ -179,6 +180,7 @@ static int __init numachip_system_init(void)
>>                  return 0;
>>
>>          x86_cpuinit.fixup_cpu_id = fixup_cpu_id;
>> +       x86_init.pci.arch_init = pci_numachip_init;
>>
>>          map_csrs();
>>
>> diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
>> index 3af5a1e..ee0af58 100644
>> --- a/arch/x86/pci/Makefile
>> +++ b/arch/x86/pci/Makefile
>> @@ -16,6 +16,7 @@ obj-$(CONFIG_STA2X11)           += sta2x11-fixup.o
>>   obj-$(CONFIG_X86_VISWS)                += visws.o
>>
>>   obj-$(CONFIG_X86_NUMAQ)                += numaq_32.o
>> +obj-$(CONFIG_X86_NUMACHIP)     += numachip.o
>
> It looks like this depends on CONFIG_PCI_MMCONFIG for
> pci_mmconfig_lookup().  Are there config constraints that force
> CONFIG_PCI_MMCONFIG=y when CONFIG_X86_NUMACHIP=y?

I'll revise the patch with this constraint after we work out the best 
approach for below.

>>   obj-$(CONFIG_X86_INTEL_MID)    += mrst.o
>>
>> diff --git a/arch/x86/pci/numachip.c b/arch/x86/pci/numachip.c
>> new file mode 100644
>> index 0000000..3773e05
>> --- /dev/null
>> +++ b/arch/x86/pci/numachip.c
>> @@ -0,0 +1,129 @@
>> +/*
>> + * This file is subject to the terms and conditions of the GNU General Public
>> + * License.  See the file "COPYING" in the main directory of this archive
>> + * for more details.
>> + *
>> + * Numascale NumaConnect-specific PCI code
>> + *
>> + * Copyright (C) 2012 Numascale AS. All rights reserved.
>> + *
>> + * Send feedback to <support@numascale.com>
>> + *
>> + * PCI accessor functions derived from mmconfig_64.c
>> + *
>> + */
>> +
>> +#include <linux/pci.h>
>> +#include <asm/pci_x86.h>
>> +
>> +static u8 limit __read_mostly;
>> +
>> +static inline char __iomem *pci_dev_base(unsigned int seg, unsigned int bus, unsigned int devfn)
>> +{
>> +       struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);
>> +
>> +       if (cfg && cfg->virt)
>> +               return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn << 12));
>> +       return NULL;
>> +}
>
> Most of this file is copied directly from mmconfig_64.c (as you
> mentioned above).  I wonder if we could avoid the code duplication by
> making the pci_dev_base() implementation in mmconfig_64.c a weak
> definition.  Then you could just supply a non-weak pci_dev_base() here
> that would override that default version.  Your version would look
> something like:
>
>    char __iomem *pci_dev_base(unsigned int seg, unsigned int bus,
> unsigned int devfn)
>    {
>        struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);
>
>        if (cfg && cfg->virt && devfn < limit)
>            return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn << 12));
>        return NULL;
>    }
>
> That would be different from what you have in this patch because reads
> & writes to devices above "limit" would return -EINVAL rather than 0
> as you do here.  Would that be a problem?

That would work nicely (pointer lookup and inlining etc aside) if there 
was the runtime ability to override pci_dev_base only if the NumaChip 
signature was detected.

We could expose pci_dev_base via struct x86_init_pci; the extra 
complexity and performance tradeoff may not be worth it for a single 
case perhaps?

Thanks,
   Daniel
-- 
Daniel J Blueman
Principal Software Engineer, Numascale Asia

^ permalink raw reply

* Re: [PATCH 2/3] aerdrv: Enhanced AER logging
From: Steven Rostedt @ 2012-11-30  1:53 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215450.5483.26562.stgit@grignak.americas.hpqcorp.net>

On Thu, 2012-11-29 at 14:54 -0700, Lance Ortiz wrote:

> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -23,6 +23,10 @@
>  
>  #include "aerdrv.h"
>  
> +#define CREATE_TRACE_POINTS
> +#define TRACE_INCLUDE_PATH ../../../../include/ras
> +#include <ras/aer_event.h>
> +

Yuck yuck yuck!

This header should be in the same directory, and you should have in that
same header:

#undef TRACE_INCLUDE_PATH
#define TRACE_INCLUDE_PATH .

and remove the definition here.

-- Steve

>  #define AER_AGENT_RECEIVER		0
>  #define AER_AGENT_REQUESTER		1
>  #define AER_AGENT_COMPLETER		2
> @@ -194,6 +198,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	if (info->id && info->error_dev_num > 1 && info->id == id)
>  		printk("%s""  Error of this Agent(%04x) is reported first\n",
>  			prefix, id);
> +	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> +			info->severity);
>  }
>  



^ permalink raw reply

* Re: [PATCH 1/3] aerdrv: Trace Event for AER
From: Steven Rostedt @ 2012-11-30  1:51 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215443.5483.43364.stgit@grignak.americas.hpqcorp.net>

On Thu, 2012-11-29 at 14:54 -0700, Lance Ortiz wrote:
> This header file will define a new trace event that will be triggered when
> a AER event occurs.  The following data will be provided to the trace 
> event.
> 
> char * name -	String containing the device path
> 
> u32 status - 	Either the correctable or uncorrectable register 
> 		indicating what error or errors have been see.
> 
> u8 severity - 	error severity 0:NONFATAL 1:FATAL 2:CORRECTED
> 
> The trace event will also provide a trace string that may look like:
> 
> "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
> TLP"
> 
> Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
> ---
> 
>  include/ras/aer_event.h |   77 +++++++++++++++++++++++++++++++++++++++++++++++

Is there a reason this header is here? Egad, I never noticed the
ras_event.h that is there. This include/ras directory was created for
the sole purpose of trace events! This is not the way to do this.

Please look at the sample in samples/trace_events/

The proper way is to keep the header by the driver. Then you can simply
include the header with "aer_event.h".

But to have the macro magic work, you need to modify the Makefile to
have something like:

CFLAGS_aerdrv_errprint.o = -I$(src)

and it will be able to find your headers without a problem.
The ras_event.h needs to be fixed too. I may just send a patch myself.

-- Steve


>  1 files changed, 77 insertions(+), 0 deletions(-)
>  create mode 100644 include/ras/aer_event.h
> 
> diff --git a/include/ras/aer_event.h b/include/ras/aer_event.h
> new file mode 100644
> index 0000000..735c973
> --- /dev/null
> +++ b/include/ras/aer_event.h
> @@ -0,0 +1,77 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM aer
> +#define TRACE_INCLUDE_FILE aer_event
> +
> +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_AER_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/edac.h>
> +
> +
> +/*
> + * Anhance Error Reporting (AER) PCIE Report Error
> + *
> + * These events are generated when hardware detects a corrected or
> + * uncorrected event on a pci express device and reports
> + * errors.  The event reports the following data.
> + *
> + * char * dev_name -	String containing the device identification
> + * u32 status -		Either the correctable or uncorrectable register
> + *			indicating what error or errors have been seen
> + * u8 severity -	error severity 0:NONFATAL 1:FATAL 2:CORRECTED
> + */
> +
> +#define correctable_error_string			\
> +	{BIT(0),	"Receiver Error"},		\
> +	{BIT(6),	"Bad TLP"},			\
> +	{BIT(7),	"Bad DLLP"},			\
> +	{BIT(8),	"RELAY_NUM Rollover"},		\
> +	{BIT(12),	"Replay Timer Timeout"},	\
> +	{BIT(13),	"Advisory Non-Fatal"}
> +
> +#define uncorrectable_error_string			\
> +	{BIT(4),	"Data Link Protocol"},		\
> +	{BIT(12),	"Poisoned TLP"},		\
> +	{BIT(13),	"Flow Control Protocol"},	\
> +	{BIT(14),	"Completion Timeout"},		\
> +	{BIT(15),	"Completer Abort"},		\
> +	{BIT(16),	"Unexpected Completion"},	\
> +	{BIT(17),	"Receiver Overflow"},		\
> +	{BIT(18),	"Malformed TLP"},		\
> +	{BIT(19),	"ECRC"},			\
> +	{BIT(20),	"Unsupported Request"}
> +
> +TRACE_EVENT(aer_event,
> +	TP_PROTO(const char *dev_name,
> +		 const u32 status,
> +		 const u8 severity),
> +
> +	TP_ARGS(dev_name, status, severity),
> +
> +	TP_STRUCT__entry(
> +		__string(	dev_name,	dev_name	)
> +		__field(	u32,		status		)
> +		__field(	u8,		severity	)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(dev_name, dev_name);
> +		__entry->status		= status;
> +		__entry->severity	= severity;
> +	),
> +
> +	TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
> +		__get_str(dev_name),
> +		(__entry->severity == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
> +			((__entry->severity == HW_EVENT_ERR_FATAL) ?
> +			"Fatal" : "Uncorrected"),
> +		__entry->severity == HW_EVENT_ERR_CORRECTED ?
> +		__print_flags(__entry->status, "|", correctable_error_string) :
> +		__print_flags(__entry->status, "|", uncorrectable_error_string))
> +);
> +
> +#endif /* _TRACE_AER_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>



^ permalink raw reply

* Re: [PATCH v2 2/3] aerdrv: Enhanced AER logging
From: Joe Perches @ 2012-11-30  1:12 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129225800.8405.83305.stgit@grignak.americas.hpqcorp.net>

On Thu, 2012-11-29 at 15:58 -0700, Lance Ortiz wrote:
> This patch will provide a more reliable and easy way for user-space
> applications to have access to AER logs rather than reading them from the
> message buffer. It also provides a way to notify user-space when an AER
> event occurs.
[]
>  0 files changed, 0 insertions(+), 0 deletions(-)

Those are odd patch statistics

> diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
[]
> @@ -249,6 +250,10 @@ static const char *cper_pcie_port_type_strs[] = {
>  static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  			    const struct acpi_hest_generic_data *gdata)
>  {
> +#ifdef CONFIG_ACPI_APEI_PCIEAER
> +	struct pci_dev *dev;
> +#endif
> +
>  	if (pcie->validation_bits & CPER_PCIE_VALID_PORT_TYPE)
>  		printk("%s""port_type: %d, %s\n", pfx, pcie->port_type,
>  		       pcie->port_type < ARRAY_SIZE(cper_pcie_port_type_strs) ?
> @@ -281,9 +286,17 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
> -	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> +	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
> +			pcie->device_id.bus, pcie->device_id.function);
> +	if (!dev)
> +		printk("PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
> +			pcie->device_id.segment, pcie->device_id.bus,
> +			pcie->device_id.slot, pcie->device_id.function);
> +
> +	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO && dev) {
>  		struct aer_capability_regs *aer_regs = (void *)pcie->aer_info;
> -		cper_print_aer(pfx, gdata->error_severity, aer_regs);
> +		cper_print_aer(dev, gdata->error_severity, aer_regs);
> +		pci_dev_put(dev);
>  	}

This could be written something like:
	dev = etc..
	if (!dev) {
		pr_err(etc...)
	} else if {pcie->validation_bits & ...) {
		etc...



^ permalink raw reply

* [PATCH v2 2/3] aerdrv: Enhanced AER logging
From: Lance Ortiz @ 2012-11-29 22:58 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel

This patch will provide a more reliable and easy way for user-space
applications to have access to AER logs rather than reading them from the
message buffer. It also provides a way to notify user-space when an AER
event occurs.

The aer driver is updated to generate a trace event of function 'aer_event'
when an AER occurs.  The trace event was added to both the interrupt
based aer path and the firmware first path

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
---

 0 files changed, 0 insertions(+), 0 deletions(-)

diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
index e6defd8..87345d4 100644
--- a/drivers/acpi/apei/cper.c
+++ b/drivers/acpi/apei/cper.c
@@ -29,6 +29,7 @@
 #include <linux/time.h>
 #include <linux/cper.h>
 #include <linux/acpi.h>
+#include <linux/pci.h>
 #include <linux/aer.h>
 
 /*
@@ -249,6 +250,10 @@ static const char *cper_pcie_port_type_strs[] = {
 static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 			    const struct acpi_hest_generic_data *gdata)
 {
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+	struct pci_dev *dev;
+#endif
+
 	if (pcie->validation_bits & CPER_PCIE_VALID_PORT_TYPE)
 		printk("%s""port_type: %d, %s\n", pfx, pcie->port_type,
 		       pcie->port_type < ARRAY_SIZE(cper_pcie_port_type_strs) ?
@@ -281,9 +286,17 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
+	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
+			pcie->device_id.bus, pcie->device_id.function);
+	if (!dev)
+		printk("PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
+			pcie->device_id.segment, pcie->device_id.bus,
+			pcie->device_id.slot, pcie->device_id.function);
+
+	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO && dev) {
 		struct aer_capability_regs *aer_regs = (void *)pcie->aer_info;
-		cper_print_aer(pfx, gdata->error_severity, aer_regs);
+		cper_print_aer(dev, gdata->error_severity, aer_regs);
+		pci_dev_put(dev);
 	}
 #endif
 }
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 3ea5173..6354e50 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -23,6 +23,10 @@
 
 #include "aerdrv.h"
 
+#define CREATE_TRACE_POINTS
+#define TRACE_INCLUDE_PATH ../../../../include/ras
+#include <ras/aer_event.h>
+
 #define AER_AGENT_RECEIVER		0
 #define AER_AGENT_REQUESTER		1
 #define AER_AGENT_COMPLETER		2
@@ -194,6 +198,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		printk("%s""  Error of this Agent(%04x) is reported first\n",
 			prefix, id);
+	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+			info->severity);
 }
 
 void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
@@ -217,12 +223,13 @@ int cper_severity_to_aer(int cper_severity)
 }
 EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 
-void cper_print_aer(const char *prefix, int cper_severity,
+void cper_print_aer(struct pci_dev *dev, int cper_severity,
 		    struct aer_capability_regs *aer)
 {
 	int aer_severity, layer, agent, status_strs_size, tlp_header_valid = 0;
 	u32 status, mask;
 	const char **status_strs;
+	char *prefix = NULL;
 
 	aer_severity = cper_severity_to_aer(cper_severity);
 	if (aer_severity == AER_CORRECTABLE) {
@@ -259,5 +266,7 @@ void cper_print_aer(const char *prefix, int cper_severity,
 			*(tlp + 8), *(tlp + 15), *(tlp + 14),
 			*(tlp + 13), *(tlp + 12));
 	}
+	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+			aer_severity);
 }
 #endif
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 544abdb..7b86dc6 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -49,7 +49,7 @@ static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 }
 #endif
 
-extern void cper_print_aer(const char *prefix, int cper_severity,
+extern void cper_print_aer(struct pci_dev *dev, int cper_severity,
 			   struct aer_capability_regs *aer);
 extern int cper_severity_to_aer(int cper_severity);
 extern void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,


^ permalink raw reply related

* RE: [PATCH 2/3] aerdrv: Enhanced AER logging
From: Ortiz, Lance E @ 2012-11-29 22:21 UTC (permalink / raw)
  To: Joe Perches
  Cc: bhelgaas@google.com, lance_ortiz@hotmail.com,
	jiang.liu@huawei.com, tony.luck@intel.com, bp@alien8.de,
	rostedt@goodmis.org, mchehab@redhat.com,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1354227105.1700.15.camel@joe-AO722>

Yup.  You are right.  I thought I had it enabled, I will send out the new patch soon.

Lance

> -----Original Message-----
> From: Joe Perches [mailto:joe@perches.com]
> Sent: Thursday, November 29, 2012 3:12 PM
> To: Ortiz, Lance E
> Cc: bhelgaas@google.com; lance_ortiz@hotmail.com; jiang.liu@huawei.com;
> tony.luck@intel.com; bp@alien8.de; rostedt@goodmis.org;
> mchehab@redhat.com; linux-acpi@vger.kernel.org; linux-
> pci@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 2/3] aerdrv: Enhanced AER logging
> 
> On Thu, 2012-11-29 at 14:54 -0700, Lance Ortiz wrote:
> > This patch will provide a more reliable and easy way for user-space
> > applications to have access to AER logs rather than reading them from
> the
> > message buffer. It also provides a way to notify user-space when an
> AER
> > event occurs.
> []
> > diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
> > index e6defd8..ef1e1c0 100644
> > --- a/drivers/acpi/apei/cper.c
> > +++ b/drivers/acpi/apei/cper.c
> > @@ -281,9 +281,16 @@ static void cper_print_pcie(const char *pfx,
> const struct cper_sec_pcie *pcie,
> >  	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
> >  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
> >  #ifdef CONFIG_ACPI_APEI_PCIEAER
> > -	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> > +	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
> > +			pcie->device_id.bus, pcie->device_id.function);
> > +	if (!dev)
> > +		pr_info(KERN_INFO, "PCI AER Cannot get PCI device
> %04x:%02x:%02x.%d\n",
> > +			domain, bus, slot, func);
> 
> You've not tried this with CONFIG_ACPI_APEI_PCIEAER enabled.
> 
> 		pr_info("PCI AER Cannot get PCI device
> %04x:%02x:%02x.%d\n",
> 			domain, bus, slot, func);
> 


^ permalink raw reply

* Re: [PATCH 2/3] aerdrv: Enhanced AER logging
From: Joe Perches @ 2012-11-29 22:11 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215450.5483.26562.stgit@grignak.americas.hpqcorp.net>

On Thu, 2012-11-29 at 14:54 -0700, Lance Ortiz wrote:
> This patch will provide a more reliable and easy way for user-space
> applications to have access to AER logs rather than reading them from the
> message buffer. It also provides a way to notify user-space when an AER
> event occurs.
[]
> diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
> index e6defd8..ef1e1c0 100644
> --- a/drivers/acpi/apei/cper.c
> +++ b/drivers/acpi/apei/cper.c
> @@ -281,9 +281,16 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
> -	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> +	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
> +			pcie->device_id.bus, pcie->device_id.function);
> +	if (!dev)
> +		pr_info(KERN_INFO, "PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
> +			domain, bus, slot, func);

You've not tried this with CONFIG_ACPI_APEI_PCIEAER enabled.

		pr_info("PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
			domain, bus, slot, func);



^ permalink raw reply

* [PATCH 2/3] aerdrv: Enhanced AER logging
From: Lance Ortiz @ 2012-11-29 21:54 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215443.5483.43364.stgit@grignak.americas.hpqcorp.net>

This patch will provide a more reliable and easy way for user-space
applications to have access to AER logs rather than reading them from the
message buffer. It also provides a way to notify user-space when an AER
event occurs.

The aer driver is updated to generate a trace event of function 'aer_event'
when an AER occurs.  The trace event was added to both the interrupt
based aer path and the firmware first path

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
---

 drivers/acpi/apei/cper.c               |   11 +++++++++--
 drivers/pci/pcie/aer/aerdrv_errprint.c |   11 ++++++++++-
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
index e6defd8..ef1e1c0 100644
--- a/drivers/acpi/apei/cper.c
+++ b/drivers/acpi/apei/cper.c
@@ -281,9 +281,16 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
+	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
+			pcie->device_id.bus, pcie->device_id.function);
+	if (!dev)
+		pr_info(KERN_INFO, "PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
+			domain, bus, slot, func);
+
+	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO && dev) {
 		struct aer_capability_regs *aer_regs = (void *)pcie->aer_info;
-		cper_print_aer(pfx, gdata->error_severity, aer_regs);
+		cper_print_aer(dev, gdata->error_severity, aer_regs);
+		pci_dev_put(dev);
 	}
 #endif
 }
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 3ea5173..6354e50 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -23,6 +23,10 @@
 
 #include "aerdrv.h"
 
+#define CREATE_TRACE_POINTS
+#define TRACE_INCLUDE_PATH ../../../../include/ras
+#include <ras/aer_event.h>
+
 #define AER_AGENT_RECEIVER		0
 #define AER_AGENT_REQUESTER		1
 #define AER_AGENT_COMPLETER		2
@@ -194,6 +198,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		printk("%s""  Error of this Agent(%04x) is reported first\n",
 			prefix, id);
+	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+			info->severity);
 }
 
 void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
@@ -217,12 +223,13 @@ int cper_severity_to_aer(int cper_severity)
 }
 EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 
-void cper_print_aer(const char *prefix, int cper_severity,
+void cper_print_aer(struct pci_dev *dev, int cper_severity,
 		    struct aer_capability_regs *aer)
 {
 	int aer_severity, layer, agent, status_strs_size, tlp_header_valid = 0;
 	u32 status, mask;
 	const char **status_strs;
+	char *prefix = NULL;
 
 	aer_severity = cper_severity_to_aer(cper_severity);
 	if (aer_severity == AER_CORRECTABLE) {
@@ -259,5 +266,7 @@ void cper_print_aer(const char *prefix, int cper_severity,
 			*(tlp + 8), *(tlp + 15), *(tlp + 14),
 			*(tlp + 13), *(tlp + 12));
 	}
+	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+			aer_severity);
 }
 #endif


^ permalink raw reply related

* [PATCH 3/3] aerdrv: Cleanup log output for CPER based AER
From: Lance Ortiz @ 2012-11-29 21:54 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel
In-Reply-To: <20121129215443.5483.43364.stgit@grignak.americas.hpqcorp.net>

These changes make cper_print_aer more consistent with aer_print_error
which is called in the AER interrupt case. The string in the variable
'prefix' is printed at the beginning of each print statement in
cper_print_aer(). The prefix is a string containing the driver name
and the device's path.  Looking up the call path, the value of prefix
is never assigned and is NULL, so when cper_print_aer prints data the
initial string does not get printed.  This string is important because
it identifies the device that the error is on. This patch adds code to
create the prefix and print it in the cper_print_aer function. This
patch also calculates the device's agent id so it can be printed.

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
---

 drivers/pci/pcie/aer/aerdrv_errprint.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 6354e50..bc0a091 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -229,7 +229,13 @@ void cper_print_aer(struct pci_dev *dev, int cper_severity,
 	int aer_severity, layer, agent, status_strs_size, tlp_header_valid = 0;
 	u32 status, mask;
 	const char **status_strs;
-	char *prefix = NULL;
+	int id = ((dev->bus->number << 8) | dev->devfn);
+	char prefix[44];
+
+	snprintf(prefix, sizeof(prefix), "%s%s %s: ",
+		(info->severity == AER_CORRECTABLE) ?
+		KERN_WARNING : KERN_ERR,
+		dev_driver_string(&dev->dev), dev_name(&dev->dev));
 
 	aer_severity = cper_severity_to_aer(cper_severity);
 	if (aer_severity == AER_CORRECTABLE) {
@@ -249,8 +255,8 @@ void cper_print_aer(struct pci_dev *dev, int cper_severity,
 	printk("%s""aer_status: 0x%08x, aer_mask: 0x%08x\n",
 	       prefix, status, mask);
 	cper_print_bits(prefix, status, status_strs, status_strs_size);
-	printk("%s""aer_layer=%s, aer_agent=%s\n", prefix,
-	       aer_error_layer[layer], aer_agent_string[agent]);
+	printk("%s""aer_layer=%s, %d=%04x(%s)\n", prefix,
+	       aer_error_layer[layer], id, aer_agent_string[agent]);
 	if (aer_severity != AER_CORRECTABLE)
 		printk("%s""aer_uncor_severity: 0x%08x\n",
 		       prefix, aer->uncor_severity);


^ permalink raw reply related

* [PATCH 1/3] aerdrv: Trace Event for AER
From: Lance Ortiz @ 2012-11-29 21:54 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt, mchehab,
	linux-acpi, linux-pci, linux-kernel

This header file will define a new trace event that will be triggered when
a AER event occurs.  The following data will be provided to the trace 
event.

char * name -	String containing the device path

u32 status - 	Either the correctable or uncorrectable register 
		indicating what error or errors have been see.

u8 severity - 	error severity 0:NONFATAL 1:FATAL 2:CORRECTED

The trace event will also provide a trace string that may look like:

"0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
TLP"

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
---

 include/ras/aer_event.h |   77 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 77 insertions(+), 0 deletions(-)
 create mode 100644 include/ras/aer_event.h

diff --git a/include/ras/aer_event.h b/include/ras/aer_event.h
new file mode 100644
index 0000000..735c973
--- /dev/null
+++ b/include/ras/aer_event.h
@@ -0,0 +1,77 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM aer
+#define TRACE_INCLUDE_FILE aer_event
+
+#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_AER_H
+
+#include <linux/tracepoint.h>
+#include <linux/edac.h>
+
+
+/*
+ * Anhance Error Reporting (AER) PCIE Report Error
+ *
+ * These events are generated when hardware detects a corrected or
+ * uncorrected event on a pci express device and reports
+ * errors.  The event reports the following data.
+ *
+ * char * dev_name -	String containing the device identification
+ * u32 status -		Either the correctable or uncorrectable register
+ *			indicating what error or errors have been seen
+ * u8 severity -	error severity 0:NONFATAL 1:FATAL 2:CORRECTED
+ */
+
+#define correctable_error_string			\
+	{BIT(0),	"Receiver Error"},		\
+	{BIT(6),	"Bad TLP"},			\
+	{BIT(7),	"Bad DLLP"},			\
+	{BIT(8),	"RELAY_NUM Rollover"},		\
+	{BIT(12),	"Replay Timer Timeout"},	\
+	{BIT(13),	"Advisory Non-Fatal"}
+
+#define uncorrectable_error_string			\
+	{BIT(4),	"Data Link Protocol"},		\
+	{BIT(12),	"Poisoned TLP"},		\
+	{BIT(13),	"Flow Control Protocol"},	\
+	{BIT(14),	"Completion Timeout"},		\
+	{BIT(15),	"Completer Abort"},		\
+	{BIT(16),	"Unexpected Completion"},	\
+	{BIT(17),	"Receiver Overflow"},		\
+	{BIT(18),	"Malformed TLP"},		\
+	{BIT(19),	"ECRC"},			\
+	{BIT(20),	"Unsupported Request"}
+
+TRACE_EVENT(aer_event,
+	TP_PROTO(const char *dev_name,
+		 const u32 status,
+		 const u8 severity),
+
+	TP_ARGS(dev_name, status, severity),
+
+	TP_STRUCT__entry(
+		__string(	dev_name,	dev_name	)
+		__field(	u32,		status		)
+		__field(	u8,		severity	)
+	),
+
+	TP_fast_assign(
+		__assign_str(dev_name, dev_name);
+		__entry->status		= status;
+		__entry->severity	= severity;
+	),
+
+	TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
+		__get_str(dev_name),
+		(__entry->severity == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
+			((__entry->severity == HW_EVENT_ERR_FATAL) ?
+			"Fatal" : "Uncorrected"),
+		__entry->severity == HW_EVENT_ERR_CORRECTED ?
+		__print_flags(__entry->status, "|", correctable_error_string) :
+		__print_flags(__entry->status, "|", uncorrectable_error_string))
+);
+
+#endif /* _TRACE_AER_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>


^ permalink raw reply related

* Re: pci_enable_msix() fails with ENOMEM/EINVAL
From: Alex Williamson @ 2012-11-29 15:56 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: kvm, linux-pci, Yair Hershko, Shyam Kaushik
In-Reply-To: <A5D6C5E3D7774E62AA482C886483DA6D@alyakaslap>

On Thu, 2012-11-29 at 10:42 +0200, Alex Lyakas wrote:
> Hi Alex,
> did I understand correctly that "vector" value with which the call to 
> request_threaded_irq() is made is *not* supposed to be zero? Because in my 
> case, it is always zero, and still the failure I observe is not happening 
> always. Usually, 3 unique (non-zero) IRQ numbers are assigned to each 
> attached PCI device of each KVM VM.
> I will try to repro this like you suggested and let you know.

pci_enable_msix should be setting vectors for all the host_msix_entries.
That non-zero vector is the first parameter to request_threaded_irq.  If
I put a printk in the loop installing the interrupt handler, I get:

assigned_device_enable_host_msix entry 0, vector 44
assigned_device_enable_host_msix entry 0, vector 44
assigned_device_enable_host_msix entry 1, vector 45
assigned_device_enable_host_msix entry 0, vector 44
assigned_device_enable_host_msix entry 1, vector 45
assigned_device_enable_host_msix entry 2, vector 103

So we enable MSI-X with only one vector enabled, then a second vector
gets enabled and we disable and re-enable with two, and so on with
three.  This is for an assigned e1000e device.  Thanks,

Alex


^ permalink raw reply

* RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Fujinaka, Todd @ 2012-11-29 15:52 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Joe Jin, Ben Hutchings, Mary Mcgrath, netdev@vger.kernel.org,
	e1000-devel@lists.sf.net, linux-kernel@vger.kernel.org, linux-pci
In-Reply-To: <CABawtvPrz=z_CLCvPeXkky7COyhjPyfBu3QCTbiiPpvr+5bicg@mail.gmail.com>

Someone else pointed this out to me locally. If you have a non-client BIOS, you should be able to set the MaxPayloadSize using setpci. You have to make sure that you're being consistent throughout all the associated links.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujinaka@intel.com
(503) 712-4565


-----Original Message-----
From: Ethan Zhao [mailto:ethan.kernel@gmail.com] 
Sent: Wednesday, November 28, 2012 7:10 PM
To: Fujinaka, Todd
Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; netdev@vger.kernel.org; e1000-devel@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Joe,
    Possibly your customer is running a kernel without source code on a platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server ?).
    Anyway, to see if is a payload issue or,  you could change the payload size with setpci tool to those devices and set the link retrain bit to trigger the link retraining to debug the issue and identity the root cause.  I thinks it is much easier than modify the BIOS or  eeprom of NIC.

    e.g.
   set device control register to 0f 00   (128 bytes payload size)
   #   setpci -v -s 00:02.0 98.w=000f
   set device link control register to 60h (retrain the link)
   #  setpci -v -s 00:02.0 a0.b=60

  Hope it works,  Just my 2 cents.

Ethan.zhao@oracle.com

On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd <todd.fujinaka@intel.com> wrote:
> The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.
>
> Todd Fujinaka
> Technical Marketing Engineer
> LAN Access Division (LAD)
> Intel Corporation
> todd.fujinaka@intel.com
> (503) 712-4565
>
>
> -----Original Message-----
> From: Joe Jin [mailto:joe.jin@oracle.com]
> Sent: Wednesday, November 28, 2012 12:31 AM
> To: Ben Hutchings
> Cc: Fujinaka, Todd; Mary Mcgrath; netdev@vger.kernel.org; 
> e1000-devel@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci
> Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
>
> On 11/28/12 02:10, Ben Hutchings wrote:
>> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>>> Forgive me if I'm being too repetitious as I think some of this has 
>>> been mentioned in the past.
>>>
>>> We (and by we I mean the Ethernet part and driver) can only change 
>>> the advertised availability of a larger MaxPayloadSize. The size is 
>>> negotiated by both sides of the link when the link is established.
>>> The driver should not change the size of the link as it would be 
>>> poking at registers outside of its scope and is controlled by the 
>>> upstream bridge (not us).
>> [...]
>>
>> MaxPayloadSize (MPS) is not negotiated between devices but is 
>> programmed by the system firmware (at least for devices present at 
>> boot - the kernel may be responsible in case of hotplug).  You can 
>> use the kernel parameter 'pci=pcie_bus_perf' (or one of several 
>> others) to set a policy that overrides this, but no policy will allow 
>> setting MPS above the device's MaxPayloadSizeSupported (MPSS).
>>
>
> Ben,
>
> Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
> So I'm trying to use ethtool modify it from eeprom to see if help or no.
>
>
> Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom.
>
> Thanks in advance,
> Joe
>

^ permalink raw reply

* Re: pci_enable_msix() fails with ENOMEM/EINVAL
From: Alex Lyakas @ 2012-11-29  8:42 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-pci, Yair Hershko, Shyam Kaushik
In-Reply-To: <1353960253.1809.128.camel@bling.home>

Hi Alex,
did I understand correctly that "vector" value with which the call to 
request_threaded_irq() is made is *not* supposed to be zero? Because in my 
case, it is always zero, and still the failure I observe is not happening 
always. Usually, 3 unique (non-zero) IRQ numbers are assigned to each 
attached PCI device of each KVM VM.
I will try to repro this like you suggested and let you know.

Thanks for your help,
Alex.


-----Original Message----- 
From: Alex Williamson
Sent: 26 November, 2012 10:04 PM
To: Alex Lyakas
Cc: kvm@vger.kernel.org
Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL

On Thu, 2012-11-22 at 10:52 +0200, Alex Lyakas wrote:
> Hi Alex,
> thanks for your response.
>
> I printed out the "vector" and "entry" values of dev->host_msix_entries[i]
> within assigned_device_enable_host_msix() before call to
> request_threaded_irq(). I see that they are all 0s:
> kernel: [ 3332.610980] kvm-8095: KVM_ASSIGN_DEV_IRQ assigned_dev_id=924
> kernel: [ 3332.610985] kvm-8095: assigned_device_enable_host_msix()
> assigned_dev_id=924 #0: [v=0 e=0]
> kernel: [ 3332.610989] kvm-8095: assigned_device_enable_host_msix()
> assigned_dev_id=924 #1: [v=0 e=1]
> kernel: [ 3332.610992] kvm-8095: assigned_device_enable_host_msix()
> assigned_dev_id=924 #2: [v=0 e=2]
>
> So I don't really understand how they all ask for irq=0; I must be missing
> something. Is there any other explanation of request_threaded_irq() to
> return EBUSY? From the code I don't see that there is.

The vectors all being zero sounds like an indication that
pci_enable_msix() didn't actually work.  Each of those should be a
unique vector.   Does booting the host with "nointremap" perhaps make a
difference?  Maybe we can isolate the problem to the interrupt remapper
code.

> This issue is reproducible and is not going to go away by itself. Working
> around it is also problematic. We thought to check whether all IRQs are
> properly attached after QEMU sets the vm state to "running". However, vm
> state is set to "running" before IRQ attachments are performed; we 
> debugged
> this and found out that they are done from a different thread, from a 
> stack
> trace like this:
> kvm_assign_irq()
> assigned_dev_update_msix()
> assigned_dev_pci_write_config()
> pci_host_config_write_common()
> pci_data_write()
> pci_host_data_write()
> memory_region_write_accessor()
> access_with_adjusted_size()
> memory_region_iorange_write()
> ioport_writew_thunk()
> ioport_write()
> cpu_outw()
> kvm_handle_io()
> kvm_cpu_exec()
> qemu_kvm_cpu_thread_fn()
>
> So looks like this is performed on-demand (on first IO), so no reliable
> point to check that IRQs are attached properly.

Correct, MSI-X is setup when the guest enables MSI-X on the device,
which is likely a long way into guest boot.  There's no guarantee that
the guest ever enables MSI-X, so there's no association to whether the
guest is "running".

>  Another issue that in KVM
> code the return value of pci_host_config_write_common() is not checked, so
> there is no way to report a failure.

A common problem in qemu, imho

> Is there any way you think you can help me debug this further?

It seems like pci_enable_msix is still failing, but perhaps silently
without irqbalance.  We need to figure out where and why.  Isolating it
to the interrupt remapper with "nointremap" might give us some clues
(this is an Intel VT-d system, right?).  Thanks,

Alex


> -----Original Message----- 
> From: Alex Williamson
> Sent: 22 November, 2012 12:25 AM
> To: Alex Lyakas
> Cc: kvm@vger.kernel.org
> Subject: Re: pci_enable_msix() fails with ENOMEM/EINVAL
>
> On Wed, 2012-11-21 at 16:19 +0200, Alex Lyakas wrote:
> > Hi,
> > I was advised to turn off irqbalance and reproduced this issue, but
> > the failure is in a different place now. Now request_threaded_irq()
> > fails with EBUSY.
> > According to the code, this can only happen on the path:
> > request_threaded_irq() -> __setup_irq()
> > Now in setup irq, the only place where EBUSY can show up for us is here:
> > ...
> > raw_spin_lock_irqsave(&desc->lock, flags);
> > old_ptr = &desc->action;
> > old = *old_ptr;
> > if (old) {
> > /*
> > * Can't share interrupts unless both agree to and are
> > * the same type (level, edge, polarity). So both flag
> > * fields must have IRQF_SHARED set and the bits which
> > * set the trigger type must match. Also all must
> > * agree on ONESHOT.
> > */
> > if (!((old->flags & new->flags) & IRQF_SHARED) ||
> >     ((old->flags ^ new->flags) & IRQF_TRIGGER_MASK) ||
> >     ((old->flags ^ new->flags) & IRQF_ONESHOT)) {
> > old_name = old->name;
> > goto mismatch;
> > }
> >
> > /* All handlers must agree on per-cpuness */
> > if ((old->flags & IRQF_PERCPU) !=
> >     (new->flags & IRQF_PERCPU))
> > goto mismatch;
> >
> > KVM calls request_threaded_irq() with flags==0, so can it be that
> > different KVM processes request the same IRQ?
>
> Shouldn't be possible, irqs are allocated from a bitmap protected by a
> mutex, see __irq_alloc_descs
>
> >  How different KVM
> > processes spawned simultaneously agree between them on IRQ numbers?
>
> They don't, MSI/X vectors are not currently share-able.  Can you show
> that you're actually getting duplicate irq vectors?  Thanks,
>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




^ permalink raw reply

* Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
From: Ethan Zhao @ 2012-11-29  3:10 UTC (permalink / raw)
  To: Fujinaka, Todd
  Cc: Joe Jin, Ben Hutchings, Mary Mcgrath, netdev@vger.kernel.org,
	e1000-devel@lists.sf.net, linux-kernel@vger.kernel.org, linux-pci
In-Reply-To: <9B4A1B1917080E46B64F07F2989DADD62F2D8AAC@ORSMSX102.amr.corp.intel.com>

Joe,
    Possibly your customer is running a kernel without source code on
a platform whose vendor wouldn't like to fix BIOS issue( Is that a
HP/Dell server ?).
    Anyway, to see if is a payload issue or,  you could change the
payload size with setpci tool to those devices and set the link
retrain bit to trigger the link retraining to debug the issue and
identity the root cause.  I thinks it is much easier than modify the
BIOS or  eeprom of NIC.

    e.g.
   set device control register to 0f 00   (128 bytes payload size)
   #   setpci -v -s 00:02.0 98.w=000f
   set device link control register to 60h (retrain the link)
   #  setpci -v -s 00:02.0 a0.b=60

  Hope it works,  Just my 2 cents.

Ethan.zhao@oracle.com

On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd
<todd.fujinaka@intel.com> wrote:
> The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.
>
> Todd Fujinaka
> Technical Marketing Engineer
> LAN Access Division (LAD)
> Intel Corporation
> todd.fujinaka@intel.com
> (503) 712-4565
>
>
> -----Original Message-----
> From: Joe Jin [mailto:joe.jin@oracle.com]
> Sent: Wednesday, November 28, 2012 12:31 AM
> To: Ben Hutchings
> Cc: Fujinaka, Todd; Mary Mcgrath; netdev@vger.kernel.org; e1000-devel@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci
> Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
>
> On 11/28/12 02:10, Ben Hutchings wrote:
>> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>>> Forgive me if I'm being too repetitious as I think some of this has
>>> been mentioned in the past.
>>>
>>> We (and by we I mean the Ethernet part and driver) can only change
>>> the advertised availability of a larger MaxPayloadSize. The size is
>>> negotiated by both sides of the link when the link is established.
>>> The driver should not change the size of the link as it would be
>>> poking at registers outside of its scope and is controlled by the
>>> upstream bridge (not us).
>> [...]
>>
>> MaxPayloadSize (MPS) is not negotiated between devices but is
>> programmed by the system firmware (at least for devices present at
>> boot - the kernel may be responsible in case of hotplug).  You can use
>> the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to
>> set a policy that overrides this, but no policy will allow setting MPS
>> above the device's MaxPayloadSizeSupported (MPSS).
>>
>
> Ben,
>
> Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
> So I'm trying to use ethtool modify it from eeprom to see if help or no.
>
>
> Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom.
>
> Thanks in advance,
> Joe
>

^ permalink raw reply

* Re: [PATCH 5/9] unicore32/PCI: Remove CONFIG_HOTPLUG ifdefs
From: guanxuetao @ 2012-11-29  2:02 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: gregkh, linux-pci, Guan Xuetao, Bjorn Helgaas
In-Reply-To: <1353530100-728-6-git-send-email-wfp5p@virginia.edu>

> Remove conditional code based on CONFIG_HOTPLUG being false.  It's
> always on now in preparation of it going away as an option.
>
> Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
> Cc: Bjorn Helgaas <bhelgaas@google.com>

Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn>

> ---
>  arch/unicore32/kernel/pci.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/arch/unicore32/kernel/pci.c b/arch/unicore32/kernel/pci.c
> index b0056f6..7c43592 100644
> --- a/arch/unicore32/kernel/pci.c
> +++ b/arch/unicore32/kernel/pci.c
> @@ -250,9 +250,7 @@ void __devinit pcibios_fixup_bus(struct pci_bus *bus)
>  	printk(KERN_INFO "PCI: bus%d: Fast back to back transfers %sabled\n",
>  		bus->number, (features & PCI_COMMAND_FAST_BACK) ? "en" : "dis");
>  }
> -#ifdef CONFIG_HOTPLUG
>  EXPORT_SYMBOL(pcibios_fixup_bus);
> -#endif
>
>  static int __init pci_common_init(void)
>  {
> --
> 1.8.0
>


^ permalink raw reply

* Re: pci_enable_msix() failure
From: Bjorn Helgaas @ 2012-11-29  0:18 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-pci
In-Reply-To: <0B81514A792D4D9FAB292ED315D48BE2@alyakaslap>

On Tue, Nov 20, 2012 at 11:04 AM, Alex Lyakas <alex@zadarastorage.com> wrote:
> Hello list, I was advised to post this question by Jesse Barnes; I also
> posted to KVM list.

When you post a question to more than one list, you should always send
a *single* message with all the lists being copied.  That way
everybody can tell what progress is being made, and we can avoid
duplicating effort.

I see via a Google search that you've gotten responses on the KVM
list, e.g., http://www.spinics.net/lists/kvm/msg82997.html, so I
assume you don't need any more help from linux-pci.

If that's not the case, please add linux-pci to the CC: list of the
thread where you're working on this.

> I am running Ubuntu-Precise 3.2.0-29-generic #46, with stock KVM ("QEMU
> emulator version 1.0 (qemu-kvm-1.0)") on a Dell R510 server. I have one
> dual-port Intel's NIC 82599, of which I spawn 32 VFs from each port. I spawn
> virtual machines with KVM, each VM has 4 VFs attached (two from each PF).
>
> Once in a while, in particular when I spawn multiple VMs in parallel, I hit
> an issue that one of the VFs does not have an IRQ assigned to it. I am
> checking this in /proc/interrupts, looking for entries like
> "kvm:0000:03:14.6". In some cases, an entry is missing for a particular VF.
> As a result, the VF within the VM is non-functional.
>
> I debugged this issue further, by adding prints to kvm.ko code. I see that
> the failure happens in kvm_vm_ioctl_assigned_device/KVM_ASSIGN_DEV_IRQ path,
> which calls assigned_device_enable_host_msix() function, which calls
> pci_enable_msix(), which fails with EINVAL or with ENOMEM. This path is
> called twice for each VF.
>
> I see that first pci_enable_msix() returns -12/-22, and when
> kvm_vm_ioctl_set_msix_nr() is called again, it sees that adev->entries_nr !=
> 0 and fails the call with EINVAL. I can repro it only when spawning like 8
> or 10 VMs in parallel, but it doesn't happen every time. So it seems like
> this is not a resource shortage problem, but some race somewhere.
>
> I tested this with several version of ixgbe drivers, including the in-tree
> version that comes with Precise. It reproduces with all the versions.
>
> Can anybody pls advise on how to debug this issue further?
>
> Thanks,
> Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] PCI,sriov: add documentation on sysfs-based VF control
From: Don Dutile @ 2012-11-29  0:09 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci
In-Reply-To: <CAErSpo4mso26k5pwpBs9ujrX7dZtvd-fxYnU+m712xycwngCvg@mail.gmail.com>

On 11/28/2012 03:18 PM, Bjorn Helgaas wrote:
> On Tue, Nov 27, 2012 at 8:31 PM, Donald Dutile<ddutile@redhat.com>  wrote:
>>   Signed-off: Donald Dutile<ddutile@redhat.com>
>>
>
> Thanks, Don.  I added this to my -next branch, for merging in the v3.8
> merge window.
>
> I removed the trailing underscore from sriov_numvfs_.
>
>
Thanks for pulling it into your 3.8-next branch and the correction!
My apologies for the delay in getting it out to the list
after posting the related code.
- Don

>>   Documentation/ABI/testing/sysfs-bus-pci | 34 +++++++++++++++++++++++
>>   Documentation/PCI/pci-iov-howto.txt     | 48 ++++++++++++++++++++++++++++++---
>>   2 files changed, 78 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
>> index dff1f48..1cb389d 100644
>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>> @@ -222,3 +222,37 @@ Description:
>>                  satisfied too.  Reading this attribute will show the current
>>                  value of d3cold_allowed bit.  Writing this attribute will set
>>                  the value of d3cold_allowed bit.
>> +
>> +What:          /sys/bus/pci/devices/.../sriov_totalvfs
>> +Date:          November 2012
>> +Contact:       Donald Dutile<ddutile@redhat.com>
>> +Description:
>> +               This file appears when a physical PCIe device supports SR-IOV.
>> +               Userspace applications can read this file to determine the
>> +               maximum number of Virtual Functions (VFs) a PCIe physical
>> +               function (PF) can support. Typically, this is the value reported
>> +               in the PF's SR-IOV extended capability structure's TotalVFs
>> +               element.  Drivers have the ability at probe time to reduce the
>> +               value read from this file via the pci_sriov_set_totalvfs()
>> +               function.
>> +
>> +What:          /sys/bus/pci/devices/.../sriov_numvfs_
>> +Date:          November 2012
>> +Contact:       Donald Dutile<ddutile@redhat.com>
>> +Description:
>> +               This file appears when a physical PCIe device supports SR-IOV.
>> +               Userspace applications can read and write to this file to
>> +               determine and control the enablement or disablement of Virtual
>> +               Functions (VFs) on the physical function (PF). A read of this
>> +               file will return the number of VFs that are enabled on this PF.
>> +               A number written to this file will enable the specified
>> +               number of VFs. A userspace application would typically read the
>> +               file and check that the value is zero, and then write the number
>> +               of VFs that should be enabled on the PF; the value written
>> +               should be less than or equal to the value in the sriov_totalvfs
>> +               file. A userspace application wanting to disable the VFs would
>> +               write a zero to this file. The core ensures that valid values
>> +               are written to this file, and returns errors when values are not
>> +               valid.  For example, writing a 2 to this file when sriov_numvfs
>> +               is not 0 and not 2 already will return an error. Writing a 10
>> +               when the value of sriov_totalvfs is 8 will return an error.
>> diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
>> index fc73ef5..c41cf95 100644
>> --- a/Documentation/PCI/pci-iov-howto.txt
>> +++ b/Documentation/PCI/pci-iov-howto.txt
>> @@ -2,6 +2,9 @@
>>                  Copyright (C) 2009 Intel Corporation
>>                      Yu Zhao<yu.zhao@intel.com>
>>
>> +               Update: November 2012
>> +                       -- sysfs-based SRIOV enable-/disable-ment
>> +               Donald Dutile<ddutile@redhat.com>
>>
>>   1. Overview
>>
>> @@ -24,10 +27,21 @@ real existing PCI device.
>>
>>   2.1 How can I enable SR-IOV capability
>>
>> -The device driver (PF driver) will control the enabling and disabling
>> -of the capability via API provided by SR-IOV core. If the hardware
>> -has SR-IOV capability, loading its PF driver would enable it and all
>> -VFs associated with the PF.
>> +Multiple methods are available for SR-IOV enablement.
>> +In the first method, the device driver (PF driver) will control the
>> +enabling and disabling of the capability via API provided by SR-IOV core.
>> +If the hardware has SR-IOV capability, loading its PF driver would
>> +enable it and all VFs associated with the PF.  Some PF drivers require
>> +a module parameter to be set to determine the number of VFs to enable.
>> +In the second method, a write to the sysfs file sriov_numvfs will
>> +enable and disable the VFs associated with a PCIe PF.  This method
>> +enables per-PF, VF enable/disable values versus the first method,
>> +which applies to all PFs of the same device.  Additionally, the
>> +PCI SRIOV core support ensures that enable/disable operations are
>> +valid to reduce duplication in multiple drivers for the same
>> +checks, e.g., check numvfs == 0 if enabling VFs, ensure
>> +numvfs<= totalvfs.
>> +The second method is the recommended method for new/future VF devices.
>>
>>   2.2 How can I use the Virtual Functions
>>
>> @@ -40,13 +54,22 @@ requires device driver that is same as a normal PCI device's.
>>   3.1 SR-IOV API
>>
>>   To enable SR-IOV capability:
>> +(a) For the first method, in the driver:
>>          int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>>          'nr_virtfn' is number of VFs to be enabled.
>> +(b) For the second method, from sysfs:
>> +       echo 'nr_virtfn'>  \
>> +        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
>>
>>   To disable SR-IOV capability:
>> +(a) For the first method, in the driver:
>>          void pci_disable_sriov(struct pci_dev *dev);
>> +(b) For the second method, from sysfs:
>> +       echo  0>  \
>> +        /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
>>
>>   To notify SR-IOV core of Virtual Function Migration:
>> +(a) In the driver:
>>          irqreturn_t pci_sriov_migration(struct pci_dev *dev);
>>
>>   3.2 Usage example
>> @@ -88,6 +111,22 @@ static void dev_shutdown(struct pci_dev *dev)
>>          ...
>>   }
>>
>> +static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
>> +{
>> +       if (numvfs>  0) {
>> +               ...
>> +               pci_enable_sriov(dev, numvfs);
>> +               ...
>> +               return numvfs;
>> +       }
>> +       if (numvfs == 0) {
>> +               ....
>> +               pci_disable_sriov(dev);
>> +               ...
>> +               return 0;
>> +       }
>> +}
>> +
>>   static struct pci_driver dev_driver = {
>>          .name =         "SR-IOV Physical Function driver",
>>          .id_table =     dev_id_table,
>> @@ -96,4 +135,5 @@ static struct pci_driver dev_driver = {
>>          .suspend =      dev_suspend,
>>          .resume =       dev_resume,
>>          .shutdown =     dev_shutdown,
>> +       .sriov_configure = dev_sriov_configure,
>>   };
>> --
>> 1.7.10.2.552.gaa3bb87
>>


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox