* Re: [PATCH v4 02/63] Documentation: ACPI: move namespace.txt to firmware-guide/acpi and convert to reST
From: Changbin Du @ 2019-04-24 16:09 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: fenghua.yu, x86, Jonathan Corbet, linux-pci, linux-gpio,
linux-doc, rjw, linux-kernel, linux-acpi, mingo, Bjorn Helgaas,
tglx, linuxppc-dev, Changbin Du
In-Reply-To: <20190423173840.2e450b34@coco.lan>
On Tue, Apr 23, 2019 at 05:38:40PM -0300, Mauro Carvalho Chehab wrote:
> Em Wed, 24 Apr 2019 00:28:31 +0800
> Changbin Du <changbin.du@gmail.com> escreveu:
>
> > This converts the plain text documentation to reStructuredText format and
> > add it to Sphinx TOC tree. No essential content change.
> >
> > Signed-off-by: Changbin Du <changbin.du@gmail.com>
> > ---
> > Documentation/firmware-guide/acpi/index.rst | 1 +
> > .../acpi/namespace.rst} | 310 +++++++++---------
> > 2 files changed, 161 insertions(+), 150 deletions(-)
> > rename Documentation/{acpi/namespace.txt => firmware-guide/acpi/namespace.rst} (54%)
> >
> > diff --git a/Documentation/firmware-guide/acpi/index.rst b/Documentation/firmware-guide/acpi/index.rst
> > index 0ec7d072ba22..210ad8acd6df 100644
> > --- a/Documentation/firmware-guide/acpi/index.rst
> > +++ b/Documentation/firmware-guide/acpi/index.rst
> > @@ -7,3 +7,4 @@ ACPI Support
> > .. toctree::
> > :maxdepth: 1
> >
> > + namespace
> > diff --git a/Documentation/acpi/namespace.txt b/Documentation/firmware-guide/acpi/namespace.rst
> > similarity index 54%
> > rename from Documentation/acpi/namespace.txt
> > rename to Documentation/firmware-guide/acpi/namespace.rst
> > index 1860cb3865c6..443f0e5d0617 100644
> > --- a/Documentation/acpi/namespace.txt
> > +++ b/Documentation/firmware-guide/acpi/namespace.rst
> > @@ -1,85 +1,88 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +.. include:: <isonum.txt>
> > +
> > +===================================================
> > ACPI Device Tree - Representation of ACPI Namespace
> > +===================================================
> > +
> > +:Copyright: |copy| 2013, Intel Corporation
> > +
> > +:Author: Lv Zheng <lv.zheng@intel.com>
> > +
> > +:Abstract: The Linux ACPI subsystem converts ACPI namespace objects into a Linux
> > + device tree under the /sys/devices/LNXSYSTEM:00 and updates it upon
> > + receiving ACPI hotplug notification events. For each device object
> > + in this hierarchy there is a corresponding symbolic link in the
> > + /sys/bus/acpi/devices.
> > + This document illustrates the structure of the ACPI device tree.
>
> Well, this is a matter of preference. I would add Abstract as a chapter,
> as this would make it part of the top index, with can be useful.
>
Now it becomes a chapter. Thanks.
> In any case:
>
> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
>
> > +
> > +:Credit: Thanks for the help from Zhang Rui <rui.zhang@intel.com> and
> > + Rafael J.Wysocki <rafael.j.wysocki@intel.com>.
> > +
> > +
> > +ACPI Definition Blocks
> > +======================
> > +
> > +The ACPI firmware sets up RSDP (Root System Description Pointer) in the
> > +system memory address space pointing to the XSDT (Extended System
> > +Description Table). The XSDT always points to the FADT (Fixed ACPI
> > +Description Table) using its first entry, the data within the FADT
> > +includes various fixed-length entries that describe fixed ACPI features
> > +of the hardware. The FADT contains a pointer to the DSDT
> > +(Differentiated System Descripition Table). The XSDT also contains
> > +entries pointing to possibly multiple SSDTs (Secondary System
> > +Description Table).
> > +
> > +The DSDT and SSDT data is organized in data structures called definition
> > +blocks that contain definitions of various objects, including ACPI
> > +control methods, encoded in AML (ACPI Machine Language). The data block
> > +of the DSDT along with the contents of SSDTs represents a hierarchical
> > +data structure called the ACPI namespace whose topology reflects the
> > +structure of the underlying hardware platform.
> > +
> > +The relationships between ACPI System Definition Tables described above
> > +are illustrated in the following diagram::
> > +
> > + +---------+ +-------+ +--------+ +------------------------+
> > + | RSDP | +->| XSDT | +->| FADT | | +-------------------+ |
> > + +---------+ | +-------+ | +--------+ +-|->| DSDT | |
> > + | Pointer | | | Entry |-+ | ...... | | | +-------------------+ |
> > + +---------+ | +-------+ | X_DSDT |--+ | | Definition Blocks | |
> > + | Pointer |-+ | ..... | | ...... | | +-------------------+ |
> > + +---------+ +-------+ +--------+ | +-------------------+ |
> > + | Entry |------------------|->| SSDT | |
> > + +- - - -+ | +-------------------| |
> > + | Entry | - - - - - - - -+ | | Definition Blocks | |
> > + +- - - -+ | | +-------------------+ |
> > + | | +- - - - - - - - - -+ |
> > + +-|->| SSDT | |
> > + | +-------------------+ |
> > + | | Definition Blocks | |
> > + | +- - - - - - - - - -+ |
> > + +------------------------+
> > + |
> > + OSPM Loading |
> > + \|/
> > + +----------------+
> > + | ACPI Namespace |
> > + +----------------+
> > +
> > + Figure 1. ACPI Definition Blocks
> > +
> > +.. note:: RSDP can also contain a pointer to the RSDT (Root System
> > + Description Table). Platforms provide RSDT to enable
> > + compatibility with ACPI 1.0 operating systems. The OS is expected
> > + to use XSDT, if present.
> > +
> > +
> > +Example ACPI Namespace
> > +======================
> > +
> > +All definition blocks are loaded into a single namespace. The namespace
> > +is a hierarchy of objects identified by names and paths.
> > +The following naming conventions apply to object names in the ACPI
> > +namespace:
> >
> > -Copyright (C) 2013, Intel Corporation
> > -Author: Lv Zheng <lv.zheng@intel.com>
> > -
> > -
> > -Abstract:
> > -
> > -The Linux ACPI subsystem converts ACPI namespace objects into a Linux
> > -device tree under the /sys/devices/LNXSYSTEM:00 and updates it upon
> > -receiving ACPI hotplug notification events. For each device object in this
> > -hierarchy there is a corresponding symbolic link in the
> > -/sys/bus/acpi/devices.
> > -This document illustrates the structure of the ACPI device tree.
> > -
> > -
> > -Credit:
> > -
> > -Thanks for the help from Zhang Rui <rui.zhang@intel.com> and Rafael J.
> > -Wysocki <rafael.j.wysocki@intel.com>.
> > -
> > -
> > -1. ACPI Definition Blocks
> > -
> > - The ACPI firmware sets up RSDP (Root System Description Pointer) in the
> > - system memory address space pointing to the XSDT (Extended System
> > - Description Table). The XSDT always points to the FADT (Fixed ACPI
> > - Description Table) using its first entry, the data within the FADT
> > - includes various fixed-length entries that describe fixed ACPI features
> > - of the hardware. The FADT contains a pointer to the DSDT
> > - (Differentiated System Descripition Table). The XSDT also contains
> > - entries pointing to possibly multiple SSDTs (Secondary System
> > - Description Table).
> > -
> > - The DSDT and SSDT data is organized in data structures called definition
> > - blocks that contain definitions of various objects, including ACPI
> > - control methods, encoded in AML (ACPI Machine Language). The data block
> > - of the DSDT along with the contents of SSDTs represents a hierarchical
> > - data structure called the ACPI namespace whose topology reflects the
> > - structure of the underlying hardware platform.
> > -
> > - The relationships between ACPI System Definition Tables described above
> > - are illustrated in the following diagram.
> > -
> > - +---------+ +-------+ +--------+ +------------------------+
> > - | RSDP | +->| XSDT | +->| FADT | | +-------------------+ |
> > - +---------+ | +-------+ | +--------+ +-|->| DSDT | |
> > - | Pointer | | | Entry |-+ | ...... | | | +-------------------+ |
> > - +---------+ | +-------+ | X_DSDT |--+ | | Definition Blocks | |
> > - | Pointer |-+ | ..... | | ...... | | +-------------------+ |
> > - +---------+ +-------+ +--------+ | +-------------------+ |
> > - | Entry |------------------|->| SSDT | |
> > - +- - - -+ | +-------------------| |
> > - | Entry | - - - - - - - -+ | | Definition Blocks | |
> > - +- - - -+ | | +-------------------+ |
> > - | | +- - - - - - - - - -+ |
> > - +-|->| SSDT | |
> > - | +-------------------+ |
> > - | | Definition Blocks | |
> > - | +- - - - - - - - - -+ |
> > - +------------------------+
> > - |
> > - OSPM Loading |
> > - \|/
> > - +----------------+
> > - | ACPI Namespace |
> > - +----------------+
> > -
> > - Figure 1. ACPI Definition Blocks
> > -
> > - NOTE: RSDP can also contain a pointer to the RSDT (Root System
> > - Description Table). Platforms provide RSDT to enable
> > - compatibility with ACPI 1.0 operating systems. The OS is expected
> > - to use XSDT, if present.
> > -
> > -
> > -2. Example ACPI Namespace
> > -
> > - All definition blocks are loaded into a single namespace. The namespace
> > - is a hierarchy of objects identified by names and paths.
> > - The following naming conventions apply to object names in the ACPI
> > - namespace:
> > 1. All names are 32 bits long.
> > 2. The first byte of a name must be one of 'A' - 'Z', '_'.
> > 3. Each of the remaining bytes of a name must be one of 'A' - 'Z', '0'
> > @@ -91,7 +94,7 @@ Wysocki <rafael.j.wysocki@intel.com>.
> > (i.e. names prepended with '^' are relative to the parent of the
> > current namespace node).
> >
> > - The figure below shows an example ACPI namespace.
> > +The figure below shows an example ACPI namespace::
> >
> > +------+
> > | \ | Root
> > @@ -184,19 +187,20 @@ Wysocki <rafael.j.wysocki@intel.com>.
> > Figure 2. Example ACPI Namespace
> >
> >
> > -3. Linux ACPI Device Objects
> > +Linux ACPI Device Objects
> > +=========================
> >
> > - The Linux kernel's core ACPI subsystem creates struct acpi_device
> > - objects for ACPI namespace objects representing devices, power resources
> > - processors, thermal zones. Those objects are exported to user space via
> > - sysfs as directories in the subtree under /sys/devices/LNXSYSTM:00. The
> > - format of their names is <bus_id:instance>, where 'bus_id' refers to the
> > - ACPI namespace representation of the given object and 'instance' is used
> > - for distinguishing different object of the same 'bus_id' (it is
> > - two-digit decimal representation of an unsigned integer).
> > +The Linux kernel's core ACPI subsystem creates struct acpi_device
> > +objects for ACPI namespace objects representing devices, power resources
> > +processors, thermal zones. Those objects are exported to user space via
> > +sysfs as directories in the subtree under /sys/devices/LNXSYSTM:00. The
> > +format of their names is <bus_id:instance>, where 'bus_id' refers to the
> > +ACPI namespace representation of the given object and 'instance' is used
> > +for distinguishing different object of the same 'bus_id' (it is
> > +two-digit decimal representation of an unsigned integer).
> >
> > - The value of 'bus_id' depends on the type of the object whose name it is
> > - part of as listed in the table below.
> > +The value of 'bus_id' depends on the type of the object whose name it is
> > +part of as listed in the table below::
> >
> > +---+-----------------+-------+----------+
> > | | Object/Feature | Table | bus_id |
> > @@ -226,10 +230,11 @@ Wysocki <rafael.j.wysocki@intel.com>.
> >
> > Table 1. ACPI Namespace Objects Mapping
> >
> > - The following rules apply when creating struct acpi_device objects on
> > - the basis of the contents of ACPI System Description Tables (as
> > - indicated by the letter in the first column and the notation in the
> > - second column of the table above):
> > +The following rules apply when creating struct acpi_device objects on
> > +the basis of the contents of ACPI System Description Tables (as
> > +indicated by the letter in the first column and the notation in the
> > +second column of the table above):
> > +
> > N:
> > The object's source is an ACPI namespace node (as indicated by the
> > named object's type in the second column). In that case the object's
> > @@ -249,13 +254,14 @@ Wysocki <rafael.j.wysocki@intel.com>.
> > struct acpi_device object with LNXVIDEO 'bus_id' will be created for
> > it.
> >
> > - The third column of the above table indicates which ACPI System
> > - Description Tables contain information used for the creation of the
> > - struct acpi_device objects represented by the given row (xSDT means DSDT
> > - or SSDT).
> > +The third column of the above table indicates which ACPI System
> > +Description Tables contain information used for the creation of the
> > +struct acpi_device objects represented by the given row (xSDT means DSDT
> > +or SSDT).
> > +
> > +The forth column of the above table indicates the 'bus_id' generation
> > +rule of the struct acpi_device object:
> >
> > - The forth column of the above table indicates the 'bus_id' generation
> > - rule of the struct acpi_device object:
> > _HID:
> > _HID in the last column of the table means that the object's bus_id
> > is derived from the _HID/_CID identification objects present under
> > @@ -275,45 +281,47 @@ Wysocki <rafael.j.wysocki@intel.com>.
> > object's bus_id.
> >
> >
> > -4. Linux ACPI Physical Device Glue
> > -
> > - ACPI device (i.e. struct acpi_device) objects may be linked to other
> > - objects in the Linux' device hierarchy that represent "physical" devices
> > - (for example, devices on the PCI bus). If that happens, it means that
> > - the ACPI device object is a "companion" of a device otherwise
> > - represented in a different way and is used (1) to provide configuration
> > - information on that device which cannot be obtained by other means and
> > - (2) to do specific things to the device with the help of its ACPI
> > - control methods. One ACPI device object may be linked this way to
> > - multiple "physical" devices.
> > -
> > - If an ACPI device object is linked to a "physical" device, its sysfs
> > - directory contains the "physical_node" symbolic link to the sysfs
> > - directory of the target device object. In turn, the target device's
> > - sysfs directory will then contain the "firmware_node" symbolic link to
> > - the sysfs directory of the companion ACPI device object.
> > - The linking mechanism relies on device identification provided by the
> > - ACPI namespace. For example, if there's an ACPI namespace object
> > - representing a PCI device (i.e. a device object under an ACPI namespace
> > - object representing a PCI bridge) whose _ADR returns 0x00020000 and the
> > - bus number of the parent PCI bridge is 0, the sysfs directory
> > - representing the struct acpi_device object created for that ACPI
> > - namespace object will contain the 'physical_node' symbolic link to the
> > - /sys/devices/pci0000:00/0000:00:02:0/ sysfs directory of the
> > - corresponding PCI device.
> > -
> > - The linking mechanism is generally bus-specific. The core of its
> > - implementation is located in the drivers/acpi/glue.c file, but there are
> > - complementary parts depending on the bus types in question located
> > - elsewhere. For example, the PCI-specific part of it is located in
> > - drivers/pci/pci-acpi.c.
> > -
> > -
> > -5. Example Linux ACPI Device Tree
> > -
> > - The sysfs hierarchy of struct acpi_device objects corresponding to the
> > - example ACPI namespace illustrated in Figure 2 with the addition of
> > - fixed PWR_BUTTON/SLP_BUTTON devices is shown below.
> > +Linux ACPI Physical Device Glue
> > +===============================
> > +
> > +ACPI device (i.e. struct acpi_device) objects may be linked to other
> > +objects in the Linux' device hierarchy that represent "physical" devices
> > +(for example, devices on the PCI bus). If that happens, it means that
> > +the ACPI device object is a "companion" of a device otherwise
> > +represented in a different way and is used (1) to provide configuration
> > +information on that device which cannot be obtained by other means and
> > +(2) to do specific things to the device with the help of its ACPI
> > +control methods. One ACPI device object may be linked this way to
> > +multiple "physical" devices.
> > +
> > +If an ACPI device object is linked to a "physical" device, its sysfs
> > +directory contains the "physical_node" symbolic link to the sysfs
> > +directory of the target device object. In turn, the target device's
> > +sysfs directory will then contain the "firmware_node" symbolic link to
> > +the sysfs directory of the companion ACPI device object.
> > +The linking mechanism relies on device identification provided by the
> > +ACPI namespace. For example, if there's an ACPI namespace object
> > +representing a PCI device (i.e. a device object under an ACPI namespace
> > +object representing a PCI bridge) whose _ADR returns 0x00020000 and the
> > +bus number of the parent PCI bridge is 0, the sysfs directory
> > +representing the struct acpi_device object created for that ACPI
> > +namespace object will contain the 'physical_node' symbolic link to the
> > +/sys/devices/pci0000:00/0000:00:02:0/ sysfs directory of the
> > +corresponding PCI device.
> > +
> > +The linking mechanism is generally bus-specific. The core of its
> > +implementation is located in the drivers/acpi/glue.c file, but there are
> > +complementary parts depending on the bus types in question located
> > +elsewhere. For example, the PCI-specific part of it is located in
> > +drivers/pci/pci-acpi.c.
> > +
> > +
> > +Example Linux ACPI Device Tree
> > +=================================
> > +
> > +The sysfs hierarchy of struct acpi_device objects corresponding to the
> > +example ACPI namespace illustrated in Figure 2 with the addition of
> > +fixed PWR_BUTTON/SLP_BUTTON devices is shown below::
> >
> > +--------------+---+-----------------+
> > | LNXSYSTEM:00 | \ | acpi:LNXSYSTEM: |
> > @@ -377,12 +385,14 @@ Wysocki <rafael.j.wysocki@intel.com>.
> >
> > Figure 3. Example Linux ACPI Device Tree
> >
> > - NOTE: Each node is represented as "object/path/modalias", where:
> > - 1. 'object' is the name of the object's directory in sysfs.
> > - 2. 'path' is the ACPI namespace path of the corresponding
> > - ACPI namespace object, as returned by the object's 'path'
> > - sysfs attribute.
> > - 3. 'modalias' is the value of the object's 'modalias' sysfs
> > - attribute (as described earlier in this document).
> > - NOTE: N/A indicates the device object does not have the 'path' or the
> > - 'modalias' attribute.
> > +.. note:: Each node is represented as "object/path/modalias", where:
> > +
> > + 1. 'object' is the name of the object's directory in sysfs.
> > + 2. 'path' is the ACPI namespace path of the corresponding
> > + ACPI namespace object, as returned by the object's 'path'
> > + sysfs attribute.
> > + 3. 'modalias' is the value of the object's 'modalias' sysfs
> > + attribute (as described earlier in this document).
> > +
> > +.. note:: N/A indicates the device object does not have the 'path' or the
> > + 'modalias' attribute.
>
>
>
> Thanks,
> Mauro
--
Cheers,
Changbin Du
^ permalink raw reply
* Re: [PATCH v4 33/63] Documentation: PCI: convert endpoint/pci-endpoint.txt to reST
From: Mauro Carvalho Chehab @ 2019-04-24 15:55 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-34-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:29:02 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> Documentation/PCI/endpoint/index.rst | 10 ++
> .../{pci-endpoint.txt => pci-endpoint.rst} | 95 +++++++++++--------
> Documentation/PCI/index.rst | 1 +
> 3 files changed, 68 insertions(+), 38 deletions(-)
> create mode 100644 Documentation/PCI/endpoint/index.rst
> rename Documentation/PCI/endpoint/{pci-endpoint.txt => pci-endpoint.rst} (82%)
>
> diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst
> new file mode 100644
> index 000000000000..0db4f2fcd7f0
> --- /dev/null
> +++ b/Documentation/PCI/endpoint/index.rst
> @@ -0,0 +1,10 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +======================
> +PCI Endpoint Framework
> +======================
> +
> +.. toctree::
> + :maxdepth: 2
> +
> + pci-endpoint
> diff --git a/Documentation/PCI/endpoint/pci-endpoint.txt b/Documentation/PCI/endpoint/pci-endpoint.rst
> similarity index 82%
> rename from Documentation/PCI/endpoint/pci-endpoint.txt
> rename to Documentation/PCI/endpoint/pci-endpoint.rst
> index e86a96b66a6a..6674ce5425bf 100644
> --- a/Documentation/PCI/endpoint/pci-endpoint.txt
> +++ b/Documentation/PCI/endpoint/pci-endpoint.rst
> @@ -1,11 +1,17 @@
> - PCI ENDPOINT FRAMEWORK
> - Kishon Vijay Abraham I <kishon@ti.com>
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +======================
> +PCI Endpoint Framework
> +======================
Hmm... considering that you decided to create an index file for the
endpoint, with the same title, I would just remove this from here.
> +
> +:Author: Kishon Vijay Abraham I <kishon@ti.com>
>
> This document is a guide to use the PCI Endpoint Framework in order to create
> endpoint controller driver, endpoint function driver, and using configfs
> interface to bind the function driver to the controller driver.
>
> -1. Introduction
> +Introduction
> +============
>
> Linux has a comprehensive PCI subsystem to support PCI controllers that
> operates in Root Complex mode. The subsystem has capability to scan PCI bus,
> @@ -19,24 +25,27 @@ add endpoint mode support in Linux. This will help to run Linux in an
> EP system which can have a wide variety of use cases from testing or
> validation, co-processor accelerator, etc.
>
> -2. PCI Endpoint Core
> +PCI Endpoint Core
> +=================
>
> The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller
> library, the Endpoint Function library, and the configfs layer to bind the
> endpoint function with the endpoint controller.
>
> -2.1 PCI Endpoint Controller(EPC) Library
> +PCI Endpoint Controller(EPC) Library
> +------------------------------------
>
> The EPC library provides APIs to be used by the controller that can operate
> in endpoint mode. It also provides APIs to be used by function driver/library
> in order to implement a particular endpoint function.
>
> -2.1.1 APIs for the PCI controller Driver
> +APIs for the PCI controller Driver
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> This section lists the APIs that the PCI Endpoint core provides to be used
> by the PCI controller driver.
>
> -*) devm_pci_epc_create()/pci_epc_create()
> +* devm_pci_epc_create()/pci_epc_create()
I would, instead, promote this as a sub-level. E. g. something like:
devm_pci_epc_create()/pci_epc_create()
......................................
(if you do that, you'll need to also promote some similar function
documentation within this doc)
>
> The PCI controller driver should implement the following ops:
> * write_header: ops to populate configuration space header
Better to add a blank line between those two lines. There's no sense
(IMHO) on using bold font to the first line here.
> @@ -51,110 +60,116 @@ by the PCI controller driver.
> The PCI controller driver can then create a new EPC device by invoking
> devm_pci_epc_create()/pci_epc_create().
>
> -*) devm_pci_epc_destroy()/pci_epc_destroy()
> +* devm_pci_epc_destroy()/pci_epc_destroy()
>
> The PCI controller driver can destroy the EPC device created by either
> devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
> pci_epc_destroy().
>
> -*) pci_epc_linkup()
> +* pci_epc_linkup()
>
> In order to notify all the function devices that the EPC device to which
> they are linked has established a link with the host, the PCI controller
> driver should invoke pci_epc_linkup().
>
> -*) pci_epc_mem_init()
> +* pci_epc_mem_init()
>
> Initialize the pci_epc_mem structure used for allocating EPC addr space.
>
> -*) pci_epc_mem_exit()
> +* pci_epc_mem_exit()
>
> Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init().
>
> -2.1.2 APIs for the PCI Endpoint Function Driver
> +
> +APIs for the PCI Endpoint Function Driver
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> This section lists the APIs that the PCI Endpoint core provides to be used
> by the PCI endpoint function driver.
>
> -*) pci_epc_write_header()
> +* pci_epc_write_header()
>
> The PCI endpoint function driver should use pci_epc_write_header() to
> write the standard configuration header to the endpoint controller.
>
> -*) pci_epc_set_bar()
> +* pci_epc_set_bar()
>
> The PCI endpoint function driver should use pci_epc_set_bar() to configure
> the Base Address Register in order for the host to assign PCI addr space.
> Register space of the function driver is usually configured
> using this API.
>
> -*) pci_epc_clear_bar()
> +* pci_epc_clear_bar()
>
> The PCI endpoint function driver should use pci_epc_clear_bar() to reset
> the BAR.
>
> -*) pci_epc_raise_irq()
> +* pci_epc_raise_irq()
>
> The PCI endpoint function driver should use pci_epc_raise_irq() to raise
> Legacy Interrupt, MSI or MSI-X Interrupt.
>
> -*) pci_epc_mem_alloc_addr()
> +* pci_epc_mem_alloc_addr()
>
> The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to
> allocate memory address from EPC addr space which is required to access
> RC's buffer
>
> -*) pci_epc_mem_free_addr()
> +* pci_epc_mem_free_addr()
>
> The PCI endpoint function driver should use pci_epc_mem_free_addr() to
> free the memory space allocated using pci_epc_mem_alloc_addr().
>
> -2.1.3 Other APIs
> +Other APIs
> +~~~~~~~~~~
>
> There are other APIs provided by the EPC library. These are used for binding
> the EPF device with EPC device. pci-ep-cfs.c can be used as reference for
> using these APIs.
>
> -*) pci_epc_get()
> +* pci_epc_get()
>
> Get a reference to the PCI endpoint controller based on the device name of
> the controller.
>
> -*) pci_epc_put()
> +* pci_epc_put()
>
> Release the reference to the PCI endpoint controller obtained using
> pci_epc_get()
>
> -*) pci_epc_add_epf()
> +* pci_epc_add_epf()
>
> Add a PCI endpoint function to a PCI endpoint controller. A PCIe device
> can have up to 8 functions according to the specification.
>
> -*) pci_epc_remove_epf()
> +* pci_epc_remove_epf()
>
> Remove the PCI endpoint function from PCI endpoint controller.
>
> -*) pci_epc_start()
> +* pci_epc_start()
>
> The PCI endpoint function driver should invoke pci_epc_start() once it
> has configured the endpoint function and wants to start the PCI link.
>
> -*) pci_epc_stop()
> +* pci_epc_stop()
>
> The PCI endpoint function driver should invoke pci_epc_stop() to stop
> the PCI LINK.
>
> -2.2 PCI Endpoint Function(EPF) Library
> +
> +PCI Endpoint Function(EPF) Library
> +----------------------------------
>
> The EPF library provides APIs to be used by the function driver and the EPC
> library to provide endpoint mode functionality.
>
> -2.2.1 APIs for the PCI Endpoint Function Driver
> +APIs for the PCI Endpoint Function Driver
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> This section lists the APIs that the PCI Endpoint core provides to be used
> by the PCI endpoint function driver.
>
> -*) pci_epf_register_driver()
> +* pci_epf_register_driver()
>
> The PCI Endpoint Function driver should implement the following ops:
> * bind: ops to perform when a EPC device has been bound to EPF device
> @@ -166,50 +181,54 @@ by the PCI endpoint function driver.
> The PCI Function driver can then register the PCI EPF driver by using
> pci_epf_register_driver().
>
> -*) pci_epf_unregister_driver()
> +* pci_epf_unregister_driver()
>
> The PCI Function driver can unregister the PCI EPF driver by using
> pci_epf_unregister_driver().
>
> -*) pci_epf_alloc_space()
> +* pci_epf_alloc_space()
>
> The PCI Function driver can allocate space for a particular BAR using
> pci_epf_alloc_space().
>
> -*) pci_epf_free_space()
> +* pci_epf_free_space()
>
> The PCI Function driver can free the allocated space
> (using pci_epf_alloc_space) by invoking pci_epf_free_space().
>
> -2.2.2 APIs for the PCI Endpoint Controller Library
> +APIs for the PCI Endpoint Controller Library
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> This section lists the APIs that the PCI Endpoint core provides to be used
> by the PCI endpoint controller library.
>
> -*) pci_epf_linkup()
> +* pci_epf_linkup()
>
> The PCI endpoint controller library invokes pci_epf_linkup() when the
> EPC device has established the connection to the host.
>
> -2.2.2 Other APIs
> +Other APIs
> +~~~~~~~~~~
> +
> There are other APIs provided by the EPF library. These are used to notify
> the function driver when the EPF device is bound to the EPC device.
> pci-ep-cfs.c can be used as reference for using these APIs.
>
> -*) pci_epf_create()
> +* pci_epf_create()
>
> Create a new PCI EPF device by passing the name of the PCI EPF device.
> This name will be used to bind the the EPF device to a EPF driver.
>
> -*) pci_epf_destroy()
> +* pci_epf_destroy()
>
> Destroy the created PCI EPF device.
>
> -*) pci_epf_bind()
> +* pci_epf_bind()
>
> pci_epf_bind() should be invoked when the EPF device has been bound to
> a EPC device.
>
> -*) pci_epf_unbind()
> +* pci_epf_unbind()
>
> pci_epf_unbind() should be invoked when the binding between EPC device
> and EPF device is lost.
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> index 86c76c22810b..c8ea2e626c20 100644
> --- a/Documentation/PCI/index.rst
> +++ b/Documentation/PCI/index.rst
> @@ -15,3 +15,4 @@ Linux PCI Bus Subsystem
> acpi-info
> pci-error-recovery
> pcieaer-howto
> + endpoint/index
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v4 32/63] Documentation: PCI: convert pcieaer-howto.txt to reST
From: Mauro Carvalho Chehab @ 2019-04-24 15:49 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-33-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:29:01 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> Documentation/PCI/index.rst | 1 +
> .../{pcieaer-howto.txt => pcieaer-howto.rst} | 110 ++++++++++++------
> 2 files changed, 74 insertions(+), 37 deletions(-)
> rename Documentation/PCI/{pcieaer-howto.txt => pcieaer-howto.rst} (81%)
>
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> index 5ee4dba07116..86c76c22810b 100644
> --- a/Documentation/PCI/index.rst
> +++ b/Documentation/PCI/index.rst
> @@ -14,3 +14,4 @@ Linux PCI Bus Subsystem
> MSI-HOWTO
> acpi-info
> pci-error-recovery
> + pcieaer-howto
> diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.rst
> similarity index 81%
> rename from Documentation/PCI/pcieaer-howto.txt
> rename to Documentation/PCI/pcieaer-howto.rst
> index 48ce7903e3c6..67f77ff76865 100644
> --- a/Documentation/PCI/pcieaer-howto.txt
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -1,21 +1,29 @@
> - The PCI Express Advanced Error Reporting Driver Guide HOWTO
> - T. Long Nguyen <tom.l.nguyen@intel.com>
> - Yanmin Zhang <yanmin.zhang@intel.com>
> - 07/29/2006
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: <isonum.txt>
>
> +===========================================================
> +The PCI Express Advanced Error Reporting Driver Guide HOWTO
> +===========================================================
>
> -1. Overview
> +:Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
> + - Yanmin Zhang <yanmin.zhang@intel.com>
>
> -1.1 About this guide
> +:Copyright: |copy| 2006 Intel Corporation
> +
> +Overview
> +===========
> +
> +About this guide
> +----------------
>
> This guide describes the basics of the PCI Express Advanced Error
> Reporting (AER) driver and provides information on how to use it, as
> well as how to enable the drivers of endpoint devices to conform with
> PCI Express AER driver.
>
> -1.2 Copyright (C) Intel Corporation 2006.
>
> -1.3 What is the PCI Express AER Driver?
> +What is the PCI Express AER Driver?
> +-----------------------------------
>
> PCI Express error signaling can occur on the PCI Express link itself
> or on behalf of transactions initiated on the link. PCI Express
> @@ -30,17 +38,19 @@ The PCI Express AER driver provides the infrastructure to support PCI
> Express Advanced Error Reporting capability. The PCI Express AER
> driver provides three basic functions:
>
> -- Gathers the comprehensive error information if errors occurred.
> -- Reports error to the users.
> -- Performs error recovery actions.
> + - Gathers the comprehensive error information if errors occurred.
> + - Reports error to the users.
> + - Performs error recovery actions.
>
> AER driver only attaches root ports which support PCI-Express AER
> capability.
>
>
> -2. User Guide
> +User Guide
> +==========
>
> -2.1 Include the PCI Express AER Root Driver into the Linux Kernel
> +Include the PCI Express AER Root Driver into the Linux Kernel
> +-------------------------------------------------------------
>
> The PCI Express AER Root driver is a Root Port service driver attached
> to the PCI Express Port Bus driver. If a user wants to use it, the driver
> @@ -48,7 +58,8 @@ has to be compiled. Option CONFIG_PCIEAER supports this capability. It
> depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
> CONFIG_PCIEAER = y.
>
> -2.2 Load PCI Express AER Root Driver
> +Load PCI Express AER Root Driver
> +--------------------------------
>
> Some systems have AER support in firmware. Enabling Linux AER support at
> the same time the firmware handles AER may result in unpredictable
> @@ -56,30 +67,34 @@ behavior. Therefore, Linux does not handle AER events unless the firmware
> grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
> Specification for details regarding _OSC usage.
>
> -2.3 AER error output
> +AER error output
> +----------------
>
> When a PCIe AER error is captured, an error message will be output to
> console. If it's a correctable error, it is output as a warning.
> Otherwise, it is printed as an error. So users could choose different
> log level to filter out correctable error messages.
>
> -Below shows an example:
> -0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
> -0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
> -0000:50:00.0: [20] Unsupported Request (First)
> -0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
> +Below shows an example::
> +
> + 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
> + 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
> + 0000:50:00.0: [20] Unsupported Request (First)
> + 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
>
> In the example, 'Requester ID' means the ID of the device who sends
> the error message to root port. Pls. refer to pci express specs for
> other fields.
>
> -2.4 AER Statistics / Counters
> +AER Statistics / Counters
> +-------------------------
>
> When PCIe AER errors are captured, the counters / statistics are also exposed
> in the form of sysfs attributes which are documented at
> Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>
> -3. Developer Guide
> +Developer Guide
> +===============
>
> To enable AER aware support requires a software driver to configure
> the AER capability structure within its device and to provide callbacks.
> @@ -120,7 +135,8 @@ hierarchy and links. These errors do not include any device specific
> errors because device specific errors will still get sent directly to
> the device driver.
>
> -3.1 Configure the AER capability structure
> +Configure the AER capability structure
> +--------------------------------------
>
> AER aware drivers of PCI Express component need change the device
> control registers to enable AER. They also could change AER registers,
> @@ -128,9 +144,11 @@ including mask and severity registers. Helper function
> pci_enable_pcie_error_reporting could be used to enable AER. See
> section 3.3.
>
> -3.2. Provide callbacks
> +Provide callbacks
> +-----------------
>
> -3.2.1 callback reset_link to reset pci express link
> +callback reset_link to reset pci express link
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> This callback is used to reset the pci express physical link when a
> fatal error happens. The root port aer service driver provides a
> @@ -140,13 +158,15 @@ upstream ports should provide their own reset_link functions.
>
> In struct pcie_port_service_driver, a new pointer, reset_link, is
> added.
> +::
>
> -pci_ers_result_t (*reset_link) (struct pci_dev *dev);
> + pci_ers_result_t (*reset_link) (struct pci_dev *dev);
>
> Section 3.2.2.2 provides more detailed info on when to call
> reset_link.
>
> -3.2.2 PCI error-recovery callbacks
> +PCI error-recovery callbacks
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> The PCI Express AER Root driver uses error callbacks to coordinate
> with downstream device drivers associated with a hierarchy in question
> @@ -161,7 +181,8 @@ definitions of the callbacks.
>
> Below sections specify when to call the error callback functions.
>
> -3.2.2.1 Correctable errors
> +Correctable errors
> +~~~~~~~~~~~~~~~~~~
>
> Correctable errors pose no impacts on the functionality of
> the interface. The PCI Express protocol can recover without any
> @@ -169,13 +190,16 @@ software intervention or any loss of data. These errors do not
> require any recovery actions. The AER driver clears the device's
> correctable error status register accordingly and logs these errors.
>
> -3.2.2.2 Non-correctable (non-fatal and fatal) errors
> +Non-correctable (non-fatal and fatal) errors
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> If an error message indicates a non-fatal error, performing link reset
> at upstream is not required. The AER driver calls error_detected(dev,
> pci_channel_io_normal) to all drivers associated within a hierarchy in
> -question. for example,
> -EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
> +question. for example::
> +
> + EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort
> +
> If Upstream port A captures an AER error, the hierarchy consists of
> Downstream port B and EndPoint.
>
> @@ -199,23 +223,33 @@ function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
> reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
> to mmio_enabled.
>
> -3.3 helper functions
> +helper functions
> +----------------
> +::
> +
> + int pci_enable_pcie_error_reporting(struct pci_dev *dev);
>
> -3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
> pci_enable_pcie_error_reporting enables the device to send error
> messages to root port when an error is detected. Note that devices
> don't enable the error reporting by default, so device drivers need
> call this function to enable it.
>
> -3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
> +::
> +
> + int pci_disable_pcie_error_reporting(struct pci_dev *dev);
> +
> pci_disable_pcie_error_reporting disables the device to send error
> messages to root port when an error is detected.
>
> -3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
> +::
> +
> + int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);`
> +
> pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
> error status register.
>
> -3.4 Frequent Asked Questions
> +Frequent Asked Questions
> +------------------------
>
> Q: What happens if a PCI Express device driver does not provide an
> error recovery handler (pci_driver->err_handler is equal to NULL)?
I strongly suspect that you also need to touch the Q/A, as otherwise
you'll have either bad output and/or sphinx warnings. Something like:
Q:
What happens if a PCI Express device driver does not provide an
error recovery handler (pci_driver->err_handler is equal to NULL)?
A:
The devices attached with the driver won't be recovered. If the
error is fatal, kernel will print out warning messages. Please refer
to section 3 for more information.
Q:
What happens if an upstream port service driver does not provide
callback reset_link?
A:
Fatal error recovery will fail if the errors are reported by the
upstream ports who are attached by the service driver.
Q:
How does this infrastructure deal with driver that is not PCI
Express aware?
A:
This infrastructure calls the error callback functions of the
driver when an error happens. But if the driver is not aware of
PCI Express, the device might not report its own errors to root
port.
Q:
What modifications will that driver need to make it compatible
with the PCI Express AER Root driver?
A:
It could call the helper functions to enable AER in devices and
cleanup uncorrectable status register. Pls. refer to section 3.3.
> @@ -245,7 +279,8 @@ A: It could call the helper functions to enable AER in devices and
> cleanup uncorrectable status register. Pls. refer to section 3.3.
>
>
> -4. Software error injection
> +Software error injection
> +========================
>
> Debugging PCIe AER error recovery code is quite difficult because it
> is hard to trigger real hardware errors. Software based error
> @@ -261,6 +296,7 @@ After reboot with new kernel or insert the module, a device file named
>
> Then, you need a user space tool named aer-inject, which can be gotten
> from:
> +
> https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
>
> More information about aer-inject can be found in the document comes
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v4 00/63] Include linux ACPI/PCI/X86 docs into Sphinx TOC tree
From: Changbin Du @ 2019-04-24 15:46 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Fenghua Yu, the arch/x86 maintainers, Mauro Carvalho Chehab,
Jonathan Corbet, Linux PCI, linux-gpio, open list:DOCUMENTATION,
Rafael J. Wysocki, Linux Kernel Mailing List,
ACPI Devel Maling List, Ingo Molnar, Rafael J. Wysocki,
Thomas Gleixner, linuxppc-dev, Changbin Du
In-Reply-To: <20190423173644.GA14616@google.com>
On Tue, Apr 23, 2019 at 12:36:44PM -0500, Bjorn Helgaas wrote:
> On Tue, Apr 23, 2019 at 06:39:47PM +0200, Rafael J. Wysocki wrote:
> > On Tue, Apr 23, 2019 at 6:30 PM Changbin Du <changbin.du@gmail.com> wrote:
> > > Hi Corbet and All,
> > > The kernel now uses Sphinx to generate intelligent and beautiful
> > > documentation from reStructuredText files. I converted all of the Linux
> > > ACPI/PCI/X86 docs to reST format in this serias.
> > >
> > > In this version I combined ACPI and PCI docs, and added new x86 docs
> > > conversion.
> >
> > I'm not sure if combining all three into one big patch series has been
> > a good idea, honestly.
>
> Yeah, if you post this again, I would find it easier to deal with if
> linux-pci only got the PCI-related things. 63 patches is a little too
> much for one series.
>
sure, so I will resend them respectively.
> Bjorn
--
Cheers,
Changbin Du
^ permalink raw reply
* Re: [PATCH v4 31/63] Documentation: PCI: convert pci-error-recovery.txt to reST
From: Mauro Carvalho Chehab @ 2019-04-24 15:45 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-32-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:29:00 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> Documentation/PCI/index.rst | 1 +
> ...or-recovery.txt => pci-error-recovery.rst} | 178 +++++++++---------
> MAINTAINERS | 2 +-
> 3 files changed, 94 insertions(+), 87 deletions(-)
> rename Documentation/PCI/{pci-error-recovery.txt => pci-error-recovery.rst} (80%)
>
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> index c877a369481d..5ee4dba07116 100644
> --- a/Documentation/PCI/index.rst
> +++ b/Documentation/PCI/index.rst
> @@ -13,3 +13,4 @@ Linux PCI Bus Subsystem
> pci-iov-howto
> MSI-HOWTO
> acpi-info
> + pci-error-recovery
> diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.rst
> similarity index 80%
> rename from Documentation/PCI/pci-error-recovery.txt
> rename to Documentation/PCI/pci-error-recovery.rst
> index 0b6bb3ef449e..533ec4035bf5 100644
> --- a/Documentation/PCI/pci-error-recovery.txt
> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -1,12 +1,13 @@
> +.. SPDX-License-Identifier: GPL-2.0
>
> - PCI Error Recovery
> - ------------------
> - February 2, 2006
> +==================
> +PCI Error Recovery
> +==================
>
> - Current document maintainer:
> - Linas Vepstas <linasvepstas@gmail.com>
> - updated by Richard Lary <rlary@us.ibm.com>
> - and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009
Just wondering: wouldn't be good to preserve the date here?
> +
> +:Authors: - Linas Vepstas <linasvepstas@gmail.com>
> + - Richard Lary <rlary@us.ibm.com>
> + - Mike Mason <mmlnx@us.ibm.com>
>
>
> Many PCI bus controllers are able to detect a variety of hardware
> @@ -63,7 +64,8 @@ mechanisms for dealing with SCSI bus errors and SCSI bus resets.
>
>
> Detailed Design
> ----------------
> +===============
> +
> Design and implementation details below, based on a chain of
> public email discussions with Ben Herrenschmidt, circa 5 April 2005.
>
> @@ -73,30 +75,33 @@ pci_driver. A driver that fails to provide the structure is "non-aware",
> and the actual recovery steps taken are platform dependent. The
> arch/powerpc implementation will simulate a PCI hotplug remove/add.
>
> -This structure has the form:
> -struct pci_error_handlers
> -{
> - int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
> - int (*mmio_enabled)(struct pci_dev *dev);
> - int (*slot_reset)(struct pci_dev *dev);
> - void (*resume)(struct pci_dev *dev);
> -};
> -
> -The possible channel states are:
> -enum pci_channel_state {
> - pci_channel_io_normal, /* I/O channel is in normal state */
> - pci_channel_io_frozen, /* I/O to channel is blocked */
> - pci_channel_io_perm_failure, /* PCI card is dead */
> -};
> -
> -Possible return values are:
> -enum pci_ers_result {
> - PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
> - PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
> - PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
> - PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
> - PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
> -};
> +This structure has the form::
> +
> + struct pci_error_handlers
> + {
> + int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
> + int (*mmio_enabled)(struct pci_dev *dev);
> + int (*slot_reset)(struct pci_dev *dev);
> + void (*resume)(struct pci_dev *dev);
> + };
> +
> +The possible channel states are::
> +
> + enum pci_channel_state {
> + pci_channel_io_normal, /* I/O channel is in normal state */
> + pci_channel_io_frozen, /* I/O to channel is blocked */
> + pci_channel_io_perm_failure, /* PCI card is dead */
> + };
> +
> +Possible return values are::
> +
> + enum pci_ers_result {
> + PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
> + PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
> + PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
> + PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
> + PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
> + };
>
> A driver does not have to implement all of these callbacks; however,
> if it implements any, it must implement error_detected(). If a callback
> @@ -134,16 +139,17 @@ shouldn't do any new IOs. Called in task context. This is sort of a
>
> All drivers participating in this system must implement this call.
> The driver must return one of the following result codes:
> - - PCI_ERS_RESULT_CAN_RECOVER:
> - Driver returns this if it thinks it might be able to recover
> - the HW by just banging IOs or if it wants to be given
> - a chance to extract some diagnostic information (see
> - mmio_enable, below).
> - - PCI_ERS_RESULT_NEED_RESET:
> - Driver returns this if it can't recover without a
> - slot reset.
> - - PCI_ERS_RESULT_DISCONNECT:
> - Driver returns this if it doesn't want to recover at all.
> +
> + - PCI_ERS_RESULT_CAN_RECOVER:
> + Driver returns this if it thinks it might be able to recover
> + the HW by just banging IOs or if it wants to be given
> + a chance to extract some diagnostic information (see
> + mmio_enable, below).
> + - PCI_ERS_RESULT_NEED_RESET:
> + Driver returns this if it can't recover without a
> + slot reset.
> + - PCI_ERS_RESULT_DISCONNECT:
> + Driver returns this if it doesn't want to recover at all.
This would look better on both text and html if you format it as:
- PCI_ERS_RESULT_CAN_RECOVER:
Driver returns this if it thinks it might be able to recover
the HW by just banging IOs or if it wants to be given
a chance to extract some diagnostic information (see
mmio_enable, below).
- PCI_ERS_RESULT_NEED_RESET:
Driver returns this if it can't recover without a
slot reset.
- PCI_ERS_RESULT_DISCONNECT:
Driver returns this if it doesn't want to recover at all.
>
> The next step taken will depend on the result codes returned by the
> drivers.
> @@ -177,7 +183,7 @@ is STEP 6 (Permanent Failure).
> >>> get the device working again.
>
> STEP 2: MMIO Enabled
> --------------------
> +--------------------
> The platform re-enables MMIO to the device (but typically not the
> DMA), and then calls the mmio_enabled() callback on all affected
> device drivers.
> @@ -203,23 +209,23 @@ instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
> >>> into one of the next states, that is, link reset or slot reset.
It sounds you forgot to reformat the note with ">>>", e. g.
something like:
.. note::
The current powerpc implementation assumes that a device driver will
*not* schedule or semaphore in this routine; the current powerpc
implementation uses one kernel thread to notify all devices;
thus, if one device sleeps/schedules, all devices are affected.
Doing better requires complex multi-threaded logic in the error
recovery implementation (e.g. waiting for all notification threads
to "join" before proceeding with recovery.) This seems excessively
complex and not worth implementing.
The current powerpc implementation doesn't much care if the device
attempts I/O at this point, or not. I/O's will fail, returning
a value of 0xff on read, and writes will be dropped. If more than
EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
assumes that the device driver has gone into an infinite loop
and prints an error to syslog. A reboot is then required to
get the device working again.
>
> The driver should return one of the following result codes:
> - - PCI_ERS_RESULT_RECOVERED
> - Driver returns this if it thinks the device is fully
> - functional and thinks it is ready to start
> - normal driver operations again. There is no
> - guarantee that the driver will actually be
> - allowed to proceed, as another driver on the
> - same segment might have failed and thus triggered a
> - slot reset on platforms that support it.
> -
> - - PCI_ERS_RESULT_NEED_RESET
> - Driver returns this if it thinks the device is not
> - recoverable in its current state and it needs a slot
> - reset to proceed.
> -
> - - PCI_ERS_RESULT_DISCONNECT
> - Same as above. Total failure, no recovery even after
> - reset driver dead. (To be defined more precisely)
> + - PCI_ERS_RESULT_RECOVERED
> + Driver returns this if it thinks the device is fully
> + functional and thinks it is ready to start
> + normal driver operations again. There is no
> + guarantee that the driver will actually be
> + allowed to proceed, as another driver on the
> + same segment might have failed and thus triggered a
> + slot reset on platforms that support it.
> +
> + - PCI_ERS_RESULT_NEED_RESET
> + Driver returns this if it thinks the device is not
> + recoverable in its current state and it needs a slot
> + reset to proceed.
> +
> + - PCI_ERS_RESULT_DISCONNECT
> + Same as above. Total failure, no recovery even after
> + reset driver dead. (To be defined more precisely)
Same as above, this would look a way better if you format it as:
- PCI_ERS_RESULT_RECOVERED
Driver returns this if it thinks the device is fully
functional and thinks it is ready to start
normal driver operations again. There is no
guarantee that the driver will actually be
allowed to proceed, as another driver on the
same segment might have failed and thus triggered a
slot reset on platforms that support it.
- PCI_ERS_RESULT_NEED_RESET
Driver returns this if it thinks the device is not
recoverable in its current state and it needs a slot
reset to proceed.
- PCI_ERS_RESULT_DISCONNECT
Same as above. Total failure, no recovery even after
reset driver dead. (To be defined more precisely)
> The next step taken depends on the results returned by the drivers.
> If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
> @@ -293,24 +299,24 @@ device will be considered "dead" in this case.
> Drivers for multi-function cards will need to coordinate among
> themselves as to which driver instance will perform any "one-shot"
> or global device initialization. For example, the Symbios sym53cxx2
> -driver performs device init only from PCI function 0:
> +driver performs device init only from PCI function 0::
>
> -+ if (PCI_FUNC(pdev->devfn) == 0)
> -+ sym_reset_scsi_bus(np, 0);
> + + if (PCI_FUNC(pdev->devfn) == 0)
> + + sym_reset_scsi_bus(np, 0);
>
> - Result codes:
> - - PCI_ERS_RESULT_DISCONNECT
> - Same as above.
> +Result codes:
> + - PCI_ERS_RESULT_DISCONNECT
> + Same as above.
>
> Drivers for PCI Express cards that require a fundamental reset must
> set the needs_freset bit in the pci_dev structure in their probe function.
> For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
> -PCI card types:
> +PCI card types::
>
> -+ /* Set EEH reset type to fundamental if required by hba */
> -+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
> -+ pdev->needs_freset = 1;
> -+
> + + /* Set EEH reset type to fundamental if required by hba */
> + + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
> + + pdev->needs_freset = 1;
> + +
>
> Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
> Failure).
Again, you forgot to convert one note there, just after the above
lines:
.. note::
The current powerpc implementation does not try a power-cycle
reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
However, it probably should.
> @@ -370,23 +376,23 @@ The current policy is to turn this into a platform policy.
> That is, the recovery API only requires that:
>
> - There is no guarantee that interrupt delivery can proceed from any
> -device on the segment starting from the error detection and until the
> -slot_reset callback is called, at which point interrupts are expected
> -to be fully operational.
> + device on the segment starting from the error detection and until the
> + slot_reset callback is called, at which point interrupts are expected
> + to be fully operational.
>
> - There is no guarantee that interrupt delivery is stopped, that is,
> -a driver that gets an interrupt after detecting an error, or that detects
> -an error within the interrupt handler such that it prevents proper
> -ack'ing of the interrupt (and thus removal of the source) should just
> -return IRQ_NOTHANDLED. It's up to the platform to deal with that
> -condition, typically by masking the IRQ source during the duration of
> -the error handling. It is expected that the platform "knows" which
> -interrupts are routed to error-management capable slots and can deal
> -with temporarily disabling that IRQ number during error processing (this
> -isn't terribly complex). That means some IRQ latency for other devices
> -sharing the interrupt, but there is simply no other way. High end
> -platforms aren't supposed to share interrupts between many devices
> -anyway :)
> + a driver that gets an interrupt after detecting an error, or that detects
> + an error within the interrupt handler such that it prevents proper
> + ack'ing of the interrupt (and thus removal of the source) should just
> + return IRQ_NOTHANDLED. It's up to the platform to deal with that
> + condition, typically by masking the IRQ source during the duration of
> + the error handling. It is expected that the platform "knows" which
> + interrupts are routed to error-management capable slots and can deal
> + with temporarily disabling that IRQ number during error processing (this
> + isn't terribly complex). That means some IRQ latency for other devices
> + sharing the interrupt, but there is simply no other way. High end
> + platforms aren't supposed to share interrupts between many devices
> + anyway :)
>
> >>> Implementation details for the powerpc platform are discussed in
> >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
Another note to be converted:
.. note::
Implementation details for the powerpc platform are discussed in
the file Documentation/powerpc/eeh-pci-error-recovery.txt
As of this writing, there is a growing list of device drivers with
patches implementing error recovery. Not all of these patches are in
mainline yet. These may be used as "examples"::
drivers/scsi/ipr
drivers/scsi/sym53c8xx_2
...
drivers/net/qlge
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 87f930bf32ad..403178958b05 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11965,7 +11965,7 @@ M: Sam Bobroff <sbobroff@linux.ibm.com>
> M: Oliver O'Halloran <oohall@gmail.com>
> L: linuxppc-dev@lists.ozlabs.org
> S: Supported
> -F: Documentation/PCI/pci-error-recovery.txt
> +F: Documentation/PCI/pci-error-recovery.rst
> F: drivers/pci/pcie/aer.c
> F: drivers/pci/pcie/dpc.c
> F: drivers/pci/pcie/err.c
Thanks,
Mauro
^ permalink raw reply
* [RFC PATCH for 5.2 09/10] rseq/selftests: powerpc code signature: generate valid instructions
From: Mathieu Desnoyers @ 2019-04-24 15:25 UTC (permalink / raw)
To: Peter Zijlstra, Paul E . McKenney, Boqun Feng
Cc: Joel Fernandes, Catalin Marinas, Dave Watson, Will Deacon,
Andi Kleen, Paul Mackerras, H . Peter Anvin, Chris Lameter,
Shuah Khan, Russell King, Ingo Molnar, Michael Kerrisk,
Paul Turner, Alan Modra, Josh Triplett, Steven Rostedt,
Ben Maurer, Mathieu Desnoyers, Thomas Gleixner, linux-api,
linuxppc-dev, linux-kernel, Andy Lutomirski, Andrew Morton,
Linus Torvalds
In-Reply-To: <20190424152502.14246-1-mathieu.desnoyers@efficios.com>
Use "twui" as the guard instruction for the restartable sequence abort
handler.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Alan Modra <amodra@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
---
tools/testing/selftests/rseq/rseq-ppc.h | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/rseq/rseq-ppc.h b/tools/testing/selftests/rseq/rseq-ppc.h
index 9df18487fa9f..76be90196fe4 100644
--- a/tools/testing/selftests/rseq/rseq-ppc.h
+++ b/tools/testing/selftests/rseq/rseq-ppc.h
@@ -6,7 +6,15 @@
* (C) Copyright 2016-2018 - Boqun Feng <boqun.feng@gmail.com>
*/
-#define RSEQ_SIG 0x53053053
+/*
+ * RSEQ_SIG is used with the following trap instruction:
+ *
+ * powerpc-be: 0f e5 00 0b twui r5,11
+ * powerpc64-le: 0b 00 e5 0f twui r5,11
+ * powerpc64-be: 0f e5 00 0b twui r5,11
+ */
+
+#define RSEQ_SIG 0x0fe5000b
#define rseq_smp_mb() __asm__ __volatile__ ("sync" ::: "memory", "cc")
#define rseq_smp_lwsync() __asm__ __volatile__ ("lwsync" ::: "memory", "cc")
--
2.11.0
^ permalink raw reply related
* Re: [PATCH v4 30/63] Documentation: PCI: convert acpi-info.txt to reST
From: Mauro Carvalho Chehab @ 2019-04-24 15:34 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-31-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:59 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> Documentation/PCI/{acpi-info.txt => acpi-info.rst} | 11 ++++++++---
> Documentation/PCI/index.rst | 1 +
> 2 files changed, 9 insertions(+), 3 deletions(-)
> rename Documentation/PCI/{acpi-info.txt => acpi-info.rst} (97%)
>
> diff --git a/Documentation/PCI/acpi-info.txt b/Documentation/PCI/acpi-info.rst
> similarity index 97%
> rename from Documentation/PCI/acpi-info.txt
> rename to Documentation/PCI/acpi-info.rst
> index 3ffa3b03970e..f7dabb7ca255 100644
> --- a/Documentation/PCI/acpi-info.txt
> +++ b/Documentation/PCI/acpi-info.rst
> @@ -1,4 +1,8 @@
> - ACPI considerations for PCI host bridges
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========================================
> +ACPI considerations for PCI host bridges
> +========================================
>
> The general rule is that the ACPI namespace should describe everything the
> OS might use unless there's another way for the OS to find it [1, 2].
> @@ -135,8 +139,9 @@ address always corresponds to bus 0, even if the bus range below the bridge
>
> Extended Address Space Descriptor (.4)
> General Flags: Bit [0] Consumer/Producer:
> - 1–This device consumes this resource
> - 0–This device produces and consumes this resource
> +
> + * 1 – This device consumes this resource
> + * 0 – This device produces and consumes this resource
Hmm.. I think that you would need to add some extra blank lines before
the above, e. g., something like:
[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4:
QWord/DWord/Word Address Space Descriptor (.1, .2, .3)
General Flags: Bit [0] Ignored
Extended Address Space Descriptor (.4)
General Flags: Bit [0] Consumer/Producer:
* 1 – This device consumes this resource
* 0 – This device produces and consumes this resource
>
> [5] ACPI 6.2, sec 19.6.43:
> ResourceUsage specifies whether the Memory range is consumed by
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> index 1b25bcc1edca..c877a369481d 100644
> --- a/Documentation/PCI/index.rst
> +++ b/Documentation/PCI/index.rst
> @@ -12,3 +12,4 @@ Linux PCI Bus Subsystem
> PCIEBUS-HOWTO
> pci-iov-howto
> MSI-HOWTO
> + acpi-info
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v4 29/63] Documentation: PCI: convert MSI-HOWTO.txt to reST
From: Mauro Carvalho Chehab @ 2019-04-24 15:29 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-30-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:58 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>
> ---
> v2:
> o drop numbering.
> o simplify author list
> ---
> .../PCI/{MSI-HOWTO.txt => MSI-HOWTO.rst} | 83 +++++++++++--------
> Documentation/PCI/index.rst | 1 +
> 2 files changed, 50 insertions(+), 34 deletions(-)
> rename Documentation/PCI/{MSI-HOWTO.txt => MSI-HOWTO.rst} (88%)
Renamed names in lowercase, please.
>
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.rst
> similarity index 88%
> rename from Documentation/PCI/MSI-HOWTO.txt
> rename to Documentation/PCI/MSI-HOWTO.rst
> index 618e13d5e276..18cc3700489b 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.rst
> @@ -1,13 +1,14 @@
> - The MSI Driver Guide HOWTO
> - Tom L Nguyen tom.l.nguyen@intel.com
> - 10/03/2003
> - Revised Feb 12, 2004 by Martine Silbermann
> - email: Martine.Silbermann@hp.com
> - Revised Jun 25, 2004 by Tom L Nguyen
> - Revised Jul 9, 2008 by Matthew Wilcox <willy@linux.intel.com>
> - Copyright 2003, 2008 Intel Corporation
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: <isonum.txt>
>
> -1. About this guide
> +==========================
> +The MSI Driver Guide HOWTO
> +==========================
> +
> +:Authors: Tom L Nguyen; Martine Silbermann; Matthew Wilcox
Not so sure about this, as you removed the author emails.
It seems you missed to keep:
Copyright 2003, 2008 Intel Corporation
After re-adding the missing copyright:
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> +
> +About this guide
> +================
>
> This guide describes the basics of Message Signaled Interrupts (MSIs),
> the advantages of using MSI over traditional interrupt mechanisms, how
> @@ -15,7 +16,8 @@ to change your driver to use MSI or MSI-X and some basic diagnostics to
> try if a device doesn't support MSIs.
>
>
> -2. What are MSIs?
> +What are MSIs?
> +==============
>
> A Message Signaled Interrupt is a write from the device to a special
> address which causes an interrupt to be received by the CPU.
> @@ -29,7 +31,8 @@ Devices may support both MSI and MSI-X, but only one can be enabled at
> a time.
>
>
> -3. Why use MSIs?
> +Why use MSIs?
> +=============
>
> There are three reasons why using MSIs can give an advantage over
> traditional pin-based interrupts.
> @@ -61,14 +64,16 @@ Other possible designs include giving one interrupt to each packet queue
> in a network card or each port in a storage controller.
>
>
> -4. How to use MSIs
> +How to use MSIs
> +===============
>
> PCI devices are initialised to use pin-based interrupts. The device
> driver has to set up the device to use MSI or MSI-X. Not all machines
> support MSIs correctly, and for those machines, the APIs described below
> will simply fail and the device will continue to use pin-based interrupts.
>
> -4.1 Include kernel support for MSIs
> +Include kernel support for MSIs
> +-------------------------------
>
> To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
> option enabled. This option is only available on some architectures,
> @@ -76,14 +81,15 @@ and it may depend on some other options also being set. For example,
> on x86, you must also enable X86_UP_APIC or SMP in order to see the
> CONFIG_PCI_MSI option.
>
> -4.2 Using MSI
> +Using MSI
> +---------
>
> Most of the hard work is done for the driver in the PCI layer. The driver
> simply has to request that the PCI layer set up the MSI capability for this
> device.
>
> To automatically use MSI or MSI-X interrupt vectors, use the following
> -function:
> +function::
>
> int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
> unsigned int max_vecs, unsigned int flags);
> @@ -101,12 +107,12 @@ any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set,
> pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
>
> To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
> -vectors, use the following function:
> +vectors, use the following function::
>
> int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
>
> Any allocated resources should be freed before removing the device using
> -the following function:
> +the following function::
>
> void pci_free_irq_vectors(struct pci_dev *dev);
>
> @@ -126,7 +132,7 @@ The typical usage of MSI or MSI-X interrupts is to allocate as many vectors
> as possible, likely up to the limit supported by the device. If nvec is
> larger than the number supported by the device it will automatically be
> capped to the supported limit, so there is no need to query the number of
> -vectors supported beforehand:
> +vectors supported beforehand::
>
> nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES)
> if (nvec < 0)
> @@ -135,7 +141,7 @@ vectors supported beforehand:
> If a driver is unable or unwilling to deal with a variable number of MSI
> interrupts it can request a particular number of interrupts by passing that
> number to pci_alloc_irq_vectors() function as both 'min_vecs' and
> -'max_vecs' parameters:
> +'max_vecs' parameters::
>
> ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES);
> if (ret < 0)
> @@ -143,23 +149,24 @@ number to pci_alloc_irq_vectors() function as both 'min_vecs' and
>
> The most notorious example of the request type described above is enabling
> the single MSI mode for a device. It could be done by passing two 1s as
> -'min_vecs' and 'max_vecs':
> +'min_vecs' and 'max_vecs'::
>
> ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
> if (ret < 0)
> goto out_err;
>
> Some devices might not support using legacy line interrupts, in which case
> -the driver can specify that only MSI or MSI-X is acceptable:
> +the driver can specify that only MSI or MSI-X is acceptable::
>
> nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX);
> if (nvec < 0)
> goto out_err;
>
> -4.3 Legacy APIs
> +Legacy APIs
> +-----------
>
> The following old APIs to enable and disable MSI or MSI-X interrupts should
> -not be used in new code:
> +not be used in new code::
>
> pci_enable_msi() /* deprecated */
> pci_disable_msi() /* deprecated */
> @@ -174,9 +181,11 @@ number of vectors. If you have a legitimate special use case for the count
> of vectors we might have to revisit that decision and add a
> pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently.
>
> -4.4 Considerations when using MSIs
> +Considerations when using MSIs
> +------------------------------
>
> -4.4.1 Spinlocks
> +Spinlocks
> +~~~~~~~~~
>
> Most device drivers have a per-device spinlock which is taken in the
> interrupt handler. With pin-based interrupts or a single MSI, it is not
> @@ -188,7 +197,8 @@ acquire the spinlock. Such deadlocks can be avoided by using
> spin_lock_irqsave() or spin_lock_irq() which disable local interrupts
> and acquire the lock (see Documentation/kernel-hacking/locking.rst).
>
> -4.5 How to tell whether MSI/MSI-X is enabled on a device
> +How to tell whether MSI/MSI-X is enabled on a device
> +----------------------------------------------------
>
> Using 'lspci -v' (as root) may show some devices with "MSI", "Message
> Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities
> @@ -196,7 +206,8 @@ has an 'Enable' flag which is followed with either "+" (enabled)
> or "-" (disabled).
>
>
> -5. MSI quirks
> +MSI quirks
> +==========
>
> Several PCI chipsets or devices are known not to support MSIs.
> The PCI stack provides three ways to disable MSIs:
> @@ -205,7 +216,8 @@ The PCI stack provides three ways to disable MSIs:
> 2. on all devices behind a specific bridge
> 3. on a single device
>
> -5.1. Disabling MSIs globally
> +Disabling MSIs globally
> +-----------------------
>
> Some host chipsets simply don't support MSIs properly. If we're
> lucky, the manufacturer knows this and has indicated it in the ACPI
> @@ -219,7 +231,8 @@ on the kernel command line to disable MSIs on all devices. It would be
> in your best interests to report the problem to linux-pci@vger.kernel.org
> including a full 'lspci -v' so we can add the quirks to the kernel.
>
> -5.2. Disabling MSIs below a bridge
> +Disabling MSIs below a bridge
> +-----------------------------
>
> Some PCI bridges are not able to route MSIs between busses properly.
> In this case, MSIs must be disabled on all devices behind the bridge.
> @@ -230,7 +243,7 @@ as the nVidia nForce and Serverworks HT2000). As with host chipsets,
> Linux mostly knows about them and automatically enables MSIs if it can.
> If you have a bridge unknown to Linux, you can enable
> MSIs in configuration space using whatever method you know works, then
> -enable MSIs on that bridge by doing:
> +enable MSIs on that bridge by doing::
>
> echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
>
> @@ -244,7 +257,8 @@ below this bridge.
> Again, please notify linux-pci@vger.kernel.org of any bridges that need
> special handling.
>
> -5.3. Disabling MSIs on a single device
> +Disabling MSIs on a single device
> +---------------------------------
>
> Some devices are known to have faulty MSI implementations. Usually this
> is handled in the individual device driver, but occasionally it's necessary
> @@ -252,7 +266,8 @@ to handle this with a quirk. Some drivers have an option to disable use
> of MSI. While this is a convenient workaround for the driver author,
> it is not good practice, and should not be emulated.
>
> -5.4. Finding why MSIs are disabled on a device
> +Finding why MSIs are disabled on a device
> +-----------------------------------------
>
> From the above three sections, you can see that there are many reasons
> why MSIs may not be enabled for a given device. Your first step should
> @@ -260,8 +275,8 @@ be to examine your dmesg carefully to determine whether MSIs are enabled
> for your machine. You should also check your .config to be sure you
> have enabled CONFIG_PCI_MSI.
>
> -Then, 'lspci -t' gives the list of bridges above a device. Reading
> -/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1)
> +Then, 'lspci -t' gives the list of bridges above a device. Reading
> +`/sys/bus/pci/devices/*/msi_bus` will tell you whether MSIs are enabled (1)
> or disabled (0). If 0 is found in any of the msi_bus files belonging
> to bridges between the PCI root and the device, MSIs are disabled.
>
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> index e1c19962a7f8..1b25bcc1edca 100644
> --- a/Documentation/PCI/index.rst
> +++ b/Documentation/PCI/index.rst
> @@ -11,3 +11,4 @@ Linux PCI Bus Subsystem
> pci
> PCIEBUS-HOWTO
> pci-iov-howto
> + MSI-HOWTO
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v4 28/63] Documentation: PCI: convert pci-iov-howto.txt to reST
From: Mauro Carvalho Chehab @ 2019-04-24 15:25 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-29-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:57 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> Documentation/PCI/index.rst | 1 +
> .../{pci-iov-howto.txt => pci-iov-howto.rst} | 161 ++++++++++--------
> 2 files changed, 94 insertions(+), 68 deletions(-)
> rename Documentation/PCI/{pci-iov-howto.txt => pci-iov-howto.rst} (63%)
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
>
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> index 452723318405..e1c19962a7f8 100644
> --- a/Documentation/PCI/index.rst
> +++ b/Documentation/PCI/index.rst
> @@ -10,3 +10,4 @@ Linux PCI Bus Subsystem
>
> pci
> PCIEBUS-HOWTO
> + pci-iov-howto
> diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.rst
> similarity index 63%
> rename from Documentation/PCI/pci-iov-howto.txt
> rename to Documentation/PCI/pci-iov-howto.rst
> index d2a84151e99c..b9fd003206f1 100644
> --- a/Documentation/PCI/pci-iov-howto.txt
> +++ b/Documentation/PCI/pci-iov-howto.rst
> @@ -1,14 +1,19 @@
> - PCI Express I/O Virtualization Howto
> - Copyright (C) 2009 Intel Corporation
> - Yu Zhao <yu.zhao@intel.com>
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: <isonum.txt>
>
> - Update: November 2012
> - -- sysfs-based SRIOV enable-/disable-ment
> - Donald Dutile <ddutile@redhat.com>
> +====================================
> +PCI Express I/O Virtualization Howto
> +====================================
>
> -1. Overview
> +:Copyright: |copy| 2009 Intel Corporation
> +:Authors: - Yu Zhao <yu.zhao@intel.com>
> + - Donald Dutile <ddutile@redhat.com>
>
> -1.1 What is SR-IOV
> +Overview
> +========
> +
> +What is SR-IOV
> +--------------
>
> Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
> capability which makes one physical device appear as multiple virtual
> @@ -23,9 +28,11 @@ Memory Space, which is used to map its register set. VF device driver
> operates on the register set so it can be functional and appear as a
> real existing PCI device.
>
> -2. User Guide
> +User Guide
> +==========
>
> -2.1 How can I enable SR-IOV capability
> +How can I enable SR-IOV capability
> +----------------------------------
>
> Multiple methods are available for SR-IOV enablement.
> In the first method, the device driver (PF driver) will control the
> @@ -43,105 +50,123 @@ checks, e.g., check numvfs == 0 if enabling VFs, ensure
> numvfs <= totalvfs.
> The second method is the recommended method for new/future VF devices.
>
> -2.2 How can I use the Virtual Functions
> +How can I use the Virtual Functions
> +-----------------------------------
>
> The VF is treated as hot-plugged PCI devices in the kernel, so they
> should be able to work in the same way as real PCI devices. The VF
> requires device driver that is same as a normal PCI device's.
>
> -3. Developer Guide
> +Developer Guide
> +===============
>
> -3.1 SR-IOV API
> +SR-IOV API
> +----------
>
> To enable SR-IOV capability:
> -(a) For the first method, in the driver:
> +
> +(a) For the first method, in the driver::
> +
> int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> - 'nr_virtfn' is number of VFs to be enabled.
> -(b) For the second method, from sysfs:
> +
> +'nr_virtfn' is number of VFs to be enabled.
> +
> +(b) For the second method, from sysfs::
> +
> echo 'nr_virtfn' > \
> /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
>
> To disable SR-IOV capability:
> -(a) For the first method, in the driver:
> +
> +(a) For the first method, in the driver::
> +
> void pci_disable_sriov(struct pci_dev *dev);
> -(b) For the second method, from sysfs:
> +
> +(b) For the second method, from sysfs::
> +
> echo 0 > \
> /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
>
> To enable auto probing VFs by a compatible driver on the host, run
> command below before enabling SR-IOV capabilities. This is the
> default behavior.
> +::
> +
> echo 1 > \
> /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
>
> To disable auto probing VFs by a compatible driver on the host, run
> command below before enabling SR-IOV capabilities. Updating this
> entry will not affect VFs which are already probed.
> +::
> +
> echo 0 > \
> /sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
>
> -3.2 Usage example
> +Usage example
> +-------------
>
> Following piece of code illustrates the usage of the SR-IOV API.
> +::
>
> -static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
> -{
> - pci_enable_sriov(dev, NR_VIRTFN);
> + static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
> + {
> + pci_enable_sriov(dev, NR_VIRTFN);
>
> - ...
> -
> - return 0;
> -}
> + ...
>
> -static void dev_remove(struct pci_dev *dev)
> -{
> - pci_disable_sriov(dev);
> + return 0;
> + }
>
> - ...
> -}
> + static void dev_remove(struct pci_dev *dev)
> + {
> + pci_disable_sriov(dev);
>
> -static int dev_suspend(struct pci_dev *dev, pm_message_t state)
> -{
> - ...
> + ...
> + }
>
> - return 0;
> -}
> + static int dev_suspend(struct pci_dev *dev, pm_message_t state)
> + {
> + ...
>
> -static int dev_resume(struct pci_dev *dev)
> -{
> - ...
> + return 0;
> + }
>
> - return 0;
> -}
> + static int dev_resume(struct pci_dev *dev)
> + {
> + ...
>
> -static void dev_shutdown(struct pci_dev *dev)
> -{
> - ...
> -}
> + return 0;
> + }
>
> -static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
> -{
> - if (numvfs > 0) {
> - ...
> - pci_enable_sriov(dev, numvfs);
> + static void dev_shutdown(struct pci_dev *dev)
> + {
> ...
> - return numvfs;
> }
> - if (numvfs == 0) {
> - ....
> - pci_disable_sriov(dev);
> - ...
> - return 0;
> +
> + static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
> + {
> + if (numvfs > 0) {
> + ...
> + pci_enable_sriov(dev, numvfs);
> + ...
> + return numvfs;
> + }
> + if (numvfs == 0) {
> + ....
> + pci_disable_sriov(dev);
> + ...
> + return 0;
> + }
> }
> -}
> -
> -static struct pci_driver dev_driver = {
> - .name = "SR-IOV Physical Function driver",
> - .id_table = dev_id_table,
> - .probe = dev_probe,
> - .remove = dev_remove,
> - .suspend = dev_suspend,
> - .resume = dev_resume,
> - .shutdown = dev_shutdown,
> - .sriov_configure = dev_sriov_configure,
> -};
> +
> + static struct pci_driver dev_driver = {
> + .name = "SR-IOV Physical Function driver",
> + .id_table = dev_id_table,
> + .probe = dev_probe,
> + .remove = dev_remove,
> + .suspend = dev_suspend,
> + .resume = dev_resume,
> + .shutdown = dev_shutdown,
> + .sriov_configure = dev_sriov_configure,
> + };
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v4 27/63] Documentation: PCI: convert PCIEBUS-HOWTO.txt to reST
From: Mauro Carvalho Chehab @ 2019-04-24 15:23 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-28-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:56 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> .../{PCIEBUS-HOWTO.txt => PCIEBUS-HOWTO.rst} | 140 ++++++++++--------
> Documentation/PCI/index.rst | 1 +
> 2 files changed, 82 insertions(+), 59 deletions(-)
> rename Documentation/PCI/{PCIEBUS-HOWTO.txt => PCIEBUS-HOWTO.rst} (70%)
Names in lowercase after rename, please.
For the changes itself at the txt file:
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
>
> diff --git a/Documentation/PCI/PCIEBUS-HOWTO.txt b/Documentation/PCI/PCIEBUS-HOWTO.rst
> similarity index 70%
> rename from Documentation/PCI/PCIEBUS-HOWTO.txt
> rename to Documentation/PCI/PCIEBUS-HOWTO.rst
> index 15f0bb3b5045..f882ff62c51f 100644
> --- a/Documentation/PCI/PCIEBUS-HOWTO.txt
> +++ b/Documentation/PCI/PCIEBUS-HOWTO.rst
> @@ -1,16 +1,23 @@
> - The PCI Express Port Bus Driver Guide HOWTO
> - Tom L Nguyen tom.l.nguyen@intel.com
> - 11/03/2004
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: <isonum.txt>
>
> -1. About this guide
> +===========================================
> +The PCI Express Port Bus Driver Guide HOWTO
> +===========================================
> +
> +:Author: Tom L Nguyen tom.l.nguyen@intel.com 11/03/2004
> +:Copyright: |copy| 2004 Intel Corporation
> +
> +About this guide
> +================
>
> This guide describes the basics of the PCI Express Port Bus driver
> and provides information on how to enable the service drivers to
> register/unregister with the PCI Express Port Bus Driver.
>
> -2. Copyright 2004 Intel Corporation
>
> -3. What is the PCI Express Port Bus Driver
> +What is the PCI Express Port Bus Driver
> +=======================================
>
> A PCI Express Port is a logical PCI-PCI Bridge structure. There
> are two types of PCI Express Port: the Root Port and the Switch
> @@ -30,7 +37,8 @@ support (AER), and virtual channel support (VC). These services may
> be handled by a single complex driver or be individually distributed
> and handled by corresponding service drivers.
>
> -4. Why use the PCI Express Port Bus Driver?
> +Why use the PCI Express Port Bus Driver?
> +========================================
>
> In existing Linux kernels, the Linux Device Driver Model allows a
> physical device to be handled by only a single driver. The PCI
> @@ -51,28 +59,31 @@ PCI Express Ports and distributes all provided service requests
> to the corresponding service drivers as required. Some key
> advantages of using the PCI Express Port Bus driver are listed below:
>
> - - Allow multiple service drivers to run simultaneously on
> - a PCI-PCI Bridge Port device.
> + - Allow multiple service drivers to run simultaneously on
> + a PCI-PCI Bridge Port device.
>
> - - Allow service drivers implemented in an independent
> - staged approach.
> + - Allow service drivers implemented in an independent
> + staged approach.
>
> - - Allow one service driver to run on multiple PCI-PCI Bridge
> - Port devices.
> + - Allow one service driver to run on multiple PCI-PCI Bridge
> + Port devices.
>
> - - Manage and distribute resources of a PCI-PCI Bridge Port
> - device to requested service drivers.
> + - Manage and distribute resources of a PCI-PCI Bridge Port
> + device to requested service drivers.
>
> -5. Configuring the PCI Express Port Bus Driver vs. Service Drivers
> +Configuring the PCI Express Port Bus Driver vs. Service Drivers
> +===============================================================
>
> -5.1 Including the PCI Express Port Bus Driver Support into the Kernel
> +Including the PCI Express Port Bus Driver Support into the Kernel
> +-----------------------------------------------------------------
>
> Including the PCI Express Port Bus driver depends on whether the PCI
> Express support is included in the kernel config. The kernel will
> automatically include the PCI Express Port Bus driver as a kernel
> driver when the PCI Express support is enabled in the kernel.
>
> -5.2 Enabling Service Driver Support
> +Enabling Service Driver Support
> +-------------------------------
>
> PCI device drivers are implemented based on Linux Device Driver Model.
> All service drivers are PCI device drivers. As discussed above, it is
> @@ -89,9 +100,11 @@ header file /include/linux/pcieport_if.h, before calling these APIs.
> Failure to do so will result an identity mismatch, which prevents
> the PCI Express Port Bus driver from loading a service driver.
>
> -5.2.1 pcie_port_service_register
> +pcie_port_service_register
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +::
>
> -int pcie_port_service_register(struct pcie_port_service_driver *new)
> + int pcie_port_service_register(struct pcie_port_service_driver *new)
>
> This API replaces the Linux Driver Model's pci_register_driver API. A
> service driver should always calls pcie_port_service_register at
> @@ -99,69 +112,76 @@ module init. Note that after service driver being loaded, calls
> such as pci_enable_device(dev) and pci_set_master(dev) are no longer
> necessary since these calls are executed by the PCI Port Bus driver.
>
> -5.2.2 pcie_port_service_unregister
> +pcie_port_service_unregister
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +::
>
> -void pcie_port_service_unregister(struct pcie_port_service_driver *new)
> + void pcie_port_service_unregister(struct pcie_port_service_driver *new)
>
> pcie_port_service_unregister replaces the Linux Driver Model's
> pci_unregister_driver. It's always called by service driver when a
> module exits.
>
> -5.2.3 Sample Code
> +Sample Code
> +~~~~~~~~~~~
>
> Below is sample service driver code to initialize the port service
> driver data structure.
> +::
>
> -static struct pcie_port_service_id service_id[] = { {
> - .vendor = PCI_ANY_ID,
> - .device = PCI_ANY_ID,
> - .port_type = PCIE_RC_PORT,
> - .service_type = PCIE_PORT_SERVICE_AER,
> - }, { /* end: all zeroes */ }
> -};
> + static struct pcie_port_service_id service_id[] = { {
> + .vendor = PCI_ANY_ID,
> + .device = PCI_ANY_ID,
> + .port_type = PCIE_RC_PORT,
> + .service_type = PCIE_PORT_SERVICE_AER,
> + }, { /* end: all zeroes */ }
> + };
>
> -static struct pcie_port_service_driver root_aerdrv = {
> - .name = (char *)device_name,
> - .id_table = &service_id[0],
> + static struct pcie_port_service_driver root_aerdrv = {
> + .name = (char *)device_name,
> + .id_table = &service_id[0],
>
> - .probe = aerdrv_load,
> - .remove = aerdrv_unload,
> + .probe = aerdrv_load,
> + .remove = aerdrv_unload,
>
> - .suspend = aerdrv_suspend,
> - .resume = aerdrv_resume,
> -};
> + .suspend = aerdrv_suspend,
> + .resume = aerdrv_resume,
> + };
>
> Below is a sample code for registering/unregistering a service
> driver.
> +::
>
> -static int __init aerdrv_service_init(void)
> -{
> - int retval = 0;
> + static int __init aerdrv_service_init(void)
> + {
> + int retval = 0;
>
> - retval = pcie_port_service_register(&root_aerdrv);
> - if (!retval) {
> - /*
> - * FIX ME
> - */
> - }
> - return retval;
> -}
> + retval = pcie_port_service_register(&root_aerdrv);
> + if (!retval) {
> + /*
> + * FIX ME
> + */
> + }
> + return retval;
> + }
>
> -static void __exit aerdrv_service_exit(void)
> -{
> - pcie_port_service_unregister(&root_aerdrv);
> -}
> + static void __exit aerdrv_service_exit(void)
> + {
> + pcie_port_service_unregister(&root_aerdrv);
> + }
>
> -module_init(aerdrv_service_init);
> -module_exit(aerdrv_service_exit);
> + module_init(aerdrv_service_init);
> + module_exit(aerdrv_service_exit);
>
> -6. Possible Resource Conflicts
> +Possible Resource Conflicts
> +===========================
>
> Since all service drivers of a PCI-PCI Bridge Port device are
> allowed to run simultaneously, below lists a few of possible resource
> conflicts with proposed solutions.
>
> -6.1 MSI and MSI-X Vector Resource
> +MSI and MSI-X Vector Resource
> +-----------------------------
>
> Once MSI or MSI-X interrupts are enabled on a device, it stays in this
> mode until they are disabled again. Since service drivers of the same
> @@ -179,7 +199,8 @@ driver. Service drivers should use (struct pcie_device*)dev->irq to
> call request_irq/free_irq. In addition, the interrupt mode is stored
> in the field interrupt_mode of struct pcie_device.
>
> -6.3 PCI Memory/IO Mapped Regions
> +PCI Memory/IO Mapped Regions
> +----------------------------
>
> Service drivers for PCI Express Power Management (PME), Advanced
> Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
> @@ -188,7 +209,8 @@ registers accessed are independent of each other. This patch assumes
> that all service drivers will be well behaved and not overwrite
> other service driver's configuration settings.
>
> -6.4 PCI Config Registers
> +PCI Config Registers
> +--------------------
>
> Each service driver runs its PCI config operations on its own
> capability structure except the PCI Express capability structure, in
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> index 7babf43709b0..452723318405 100644
> --- a/Documentation/PCI/index.rst
> +++ b/Documentation/PCI/index.rst
> @@ -9,3 +9,4 @@ Linux PCI Bus Subsystem
> :numbered:
>
> pci
> + PCIEBUS-HOWTO
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v4 26/63] Documentation: PCI: convert pci.txt to reST
From: Mauro Carvalho Chehab @ 2019-04-24 15:20 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-27-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:55 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> Documentation/PCI/index.rst | 2 +
> Documentation/PCI/{pci.txt => pci.rst} | 267 +++++++++++++------------
> 2 files changed, 140 insertions(+), 129 deletions(-)
> rename Documentation/PCI/{pci.txt => pci.rst} (78%)
>
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> index c2f8728d11cf..7babf43709b0 100644
> --- a/Documentation/PCI/index.rst
> +++ b/Documentation/PCI/index.rst
> @@ -7,3 +7,5 @@ Linux PCI Bus Subsystem
> .. toctree::
> :maxdepth: 2
> :numbered:
> +
> + pci
See my comments to patch 25/63. It applies to all PCI stuff,
so I won't keep repeating it. Anyway, the final decision with
regards to file naming belongs to the docs maintainer and to
the PCI maintainer.
> diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.rst
> similarity index 78%
> rename from Documentation/PCI/pci.txt
> rename to Documentation/PCI/pci.rst
> index badb26ac33dc..29ddd2e9177a 100644
> --- a/Documentation/PCI/pci.txt
> +++ b/Documentation/PCI/pci.rst
I would either rename this file or Documentation/driver-api/pci/pci.rst.
Even if the decision is to keep those on different directories, it
sounds a very bad idea on my eyes to keep two files with different
content and identical names on different directories that belong to
the same subsystem.
@PCI maintainers:
The MAINTAINERS file, at the PCI SUBSYSTEM part is missing
an entry for Documentation/driver-api/pci/.
> @@ -1,10 +1,12 @@
> +.. SPDX-License-Identifier: GPL-2.0
>
> - How To Write Linux PCI Drivers
> +==============================
> +How To Write Linux PCI Drivers
> +==============================
>
> - by Martin Mares <mj@ucw.cz> on 07-Feb-2000
> - updated by Grant Grundler <grundler@parisc-linux.org> on 23-Dec-2006
> +:Authors: - Martin Mares <mj@ucw.cz>
> + - Grant Grundler <grundler@parisc-linux.org>
>
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> The world of PCI is vast and full of (mostly unpleasant) surprises.
> Since each CPU architecture implements different chip-sets and PCI devices
> have different requirements (erm, "features"), the result is the PCI support
> @@ -26,8 +28,8 @@ Please send questions/comments/patches about Linux PCI API to the
>
>
>
> -0. Structure of PCI drivers
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Structure of PCI drivers
> +========================
> PCI drivers "discover" PCI devices in a system via pci_register_driver().
> Actually, it's the other way around. When the PCI generic code discovers
> a new device, the driver with a matching "description" will be notified.
> @@ -42,24 +44,25 @@ pointers and thus dictates the high level structure of a driver.
> Once the driver knows about a PCI device and takes ownership, the
> driver generally needs to perform the following initialization:
>
> - Enable the device
> - Request MMIO/IOP resources
> - Set the DMA mask size (for both coherent and streaming DMA)
> - Allocate and initialize shared control data (pci_allocate_coherent())
> - Access device configuration space (if needed)
> - Register IRQ handler (request_irq())
> - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
> - Enable DMA/processing engines
> + - Enable the device
> + - Request MMIO/IOP resources
> + - Set the DMA mask size (for both coherent and streaming DMA)
> + - Allocate and initialize shared control data (pci_allocate_coherent())
> + - Access device configuration space (if needed)
> + - Register IRQ handler (request_irq())
> + - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
> + - Enable DMA/processing engines
>
> When done using the device, and perhaps the module needs to be unloaded,
> the driver needs to take the follow steps:
> - Disable the device from generating IRQs
> - Release the IRQ (free_irq())
> - Stop all DMA activity
> - Release DMA buffers (both streaming and coherent)
> - Unregister from other subsystems (e.g. scsi or netdev)
> - Release MMIO/IOP resources
> - Disable the device
> +
> + - Disable the device from generating IRQs
> + - Release the IRQ (free_irq())
> + - Stop all DMA activity
> + - Release DMA buffers (both streaming and coherent)
> + - Unregister from other subsystems (e.g. scsi or netdev)
> + - Release MMIO/IOP resources
> + - Disable the device
>
> Most of these topics are covered in the following sections.
> For the rest look at LDD3 or <linux/pci.h> .
> @@ -70,13 +73,12 @@ completely empty or just returning an appropriate error codes to avoid
> lots of ifdefs in the drivers.
>
>
> -
> -1. pci_register_driver() call
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +pci_register_driver() call
> +==========================
>
> PCI device drivers call pci_register_driver() during their
> initialization with a pointer to a structure describing the driver
> -(struct pci_driver):
> +(struct pci_driver)::
>
> field name Description
> ---------- ------------------------------------------------------
> @@ -125,7 +127,7 @@ initialization with a pointer to a structure describing the driver
> The ID table is an array of struct pci_device_id entries ending with an
> all-zero entry. Definitions with static const are generally preferred.
Better to format this as a table, e. g.:
=============== =======================================================
field name Description
=============== =======================================================
id_table Pointer to table of device ID's the driver is
interested in. Most drivers should export this
table using MODULE_DEVICE_TABLE(pci,...).
probe This probing function gets called (during execution
of pci_register_driver() for already existing
devices or later if a new device gets inserted) for
all PCI devices which match the ID table and are not
"owned" by the other drivers yet. This function gets
passed a `struct pci_dev *` for each device whose
entry in the ID table matches the device. The probe
function returns zero when the driver chooses to
take "ownership" of the device or an error code
(negative number) otherwise.
The probe function always gets called from process
context, so it can sleep.
remove The remove() function gets called whenever a device
being handled by this driver is removed (either during
deregistration of the driver or when it's manually
pulled out of a hot-pluggable slot).
The remove function always gets called from process
context, so it can sleep.
suspend Put device into low power state.
suspend_late Put device into low power state.
resume_early Wake device from low power state.
resume Wake device from low power state.
(Please see Documentation/power/pci.rst for
descriptions of PCI Power Management and the
related functions.)
shutdown Hook into reboot_notifier_list (kernel/sys.c).
Intended to stop any idling DMA operations.
Useful for enabling wake-on-lan (NIC) or changing
the power state of a device before reboot.
e.g. drivers/net/e100.c.
err_handler See Documentation/PCI/pci-error-recovery.rst
=============== =======================================================
>
> -Each entry consists of:
> +Each entry consists of::
>
> vendor,device Vendor and device ID to match (or PCI_ANY_ID)
Same here:
Each entry consists of:
==================== =======================================================
vendor, device Vendor and device ID to match (or PCI_ANY_ID)
subvendor, subdevice Subsystem vendor and device ID to match (or PCI_ANY_ID)
class Device class, subclass, and "interface" to match.
See Appendix D of the PCI Local Bus Spec or
include/linux/pci_ids.h for a full list of classes.
Most drivers do not need to specify class/class_mask
as vendor/device is normally sufficient.
class_mask limit which sub-fields of the class field are compared.
See drivers/scsi/sym53c8xx_2/ for example of usage.
driver_data Data private to the driver.
Most drivers don't need to use driver_data field.
Best practice is to use driver_data as an index
into a static list of equivalent device types,
instead of using it as a pointer.
==================== =======================================================
>
> @@ -160,9 +162,10 @@ echo "vendor device subvendor subdevice class class_mask driver_data" > \
> All fields are passed in as hexadecimal values (no leading 0x).
> The vendor and device fields are mandatory, the others are optional. Users
> need pass only as many optional fields as necessary:
> - o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
> - o class and classmask fields default to 0
> - o driver_data defaults to 0UL.
> +
> + - subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
> + - class and classmask fields default to 0
> + - driver_data defaults to 0UL.
>
> Note that driver_data must match the value used by any of the pci_device_id
> entries defined in the driver. This makes the driver_data field mandatory
> @@ -175,29 +178,30 @@ When the driver exits, it just calls pci_unregister_driver() and the PCI layer
> automatically calls the remove hook for all devices handled by the driver.
>
>
> -1.1 "Attributes" for driver functions/data
> +"Attributes" for driver functions/data
> +--------------------------------------
>
> Please mark the initialization and cleanup functions where appropriate
> -(the corresponding macros are defined in <linux/init.h>):
> +(the corresponding macros are defined in <linux/init.h>)::
>
> __init Initialization code. Thrown away after the driver
> initializes.
> __exit Exit code. Ignored for non-modular drivers.
Same here:
Please mark the initialization and cleanup functions where appropriate
(the corresponding macros are defined in <linux/init.h>):
=============== =================================================
__init Initialization code. Thrown away after the driver
initializes.
__exit Exit code. Ignored for non-modular drivers.
=============== =================================================
>
> Tips on when/where to use the above attributes:
> - o The module_init()/module_exit() functions (and all
> + - The module_init()/module_exit() functions (and all
> initialization functions called _only_ from these)
> should be marked __init/__exit.
>
> - o Do not mark the struct pci_driver.
> + - Do not mark the struct pci_driver.
>
> - o Do NOT mark a function if you are not sure which mark to use.
> + - Do NOT mark a function if you are not sure which mark to use.
> Better to not mark the function than mark the function wrong.
>
>
>
> -2. How to find PCI devices manually
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +How to find PCI devices manually
> +================================
>
> PCI drivers should have a really good reason for not using the
> pci_register_driver() interface to search for PCI devices.
> @@ -207,17 +211,17 @@ E.g. combined serial/parallel port/floppy controller.
>
> A manual search may be performed using the following constructs:
>
> -Searching by vendor and device ID:
> +Searching by vendor and device ID::
>
> struct pci_dev *dev = NULL;
> while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
> configure_device(dev);
>
> -Searching by class ID (iterate in a similar way):
> +Searching by class ID (iterate in a similar way)::
>
> pci_get_class(CLASS_ID, dev)
>
> -Searching by both vendor/device and subsystem vendor/device ID:
> +Searching by both vendor/device and subsystem vendor/device ID::
>
> pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
>
> @@ -231,20 +235,20 @@ decrement the reference count on these devices by calling pci_dev_put().
>
>
>
> -3. Device Initialization Steps
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Device Initialization Steps
> +===========================
>
> As noted in the introduction, most PCI drivers need the following steps
> for device initialization:
>
> - Enable the device
> - Request MMIO/IOP resources
> - Set the DMA mask size (for both coherent and streaming DMA)
> - Allocate and initialize shared control data (pci_allocate_coherent())
> - Access device configuration space (if needed)
> - Register IRQ handler (request_irq())
> - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
> - Enable DMA/processing engines.
> + - Enable the device
> + - Request MMIO/IOP resources
> + - Set the DMA mask size (for both coherent and streaming DMA)
> + - Allocate and initialize shared control data (pci_allocate_coherent())
> + - Access device configuration space (if needed)
> + - Register IRQ handler (request_irq())
> + - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
> + - Enable DMA/processing engines.
>
> The driver can access PCI config space registers at any time.
> (Well, almost. When running BIST, config space can go away...but
> @@ -252,17 +256,18 @@ that will just result in a PCI Bus Master Abort and config reads
> will return garbage).
>
>
> -3.1 Enable the PCI device
> -~~~~~~~~~~~~~~~~~~~~~~~~~
> +Enable the PCI device
> +---------------------
> Before touching any device registers, the driver needs to enable
> the PCI device by calling pci_enable_device(). This will:
> - o wake up the device if it was in suspended state,
> - o allocate I/O and memory regions of the device (if BIOS did not),
> - o allocate an IRQ (if BIOS did not).
>
> -NOTE: pci_enable_device() can fail! Check the return value.
> + - wake up the device if it was in suspended state,
> + - allocate I/O and memory regions of the device (if BIOS did not),
> + - allocate an IRQ (if BIOS did not).
> +
> +.. note:: pci_enable_device() can fail! Check the return value.
>
> -[ OS BUG: we don't check resource allocations before enabling those
> +.. warning:: OS BUG: we don't check resource allocations before enabling those
> resources. The sequence would make more sense if we called
> pci_request_resources() before calling pci_enable_device().
> Currently, the device drivers can't detect the bug when when two
> @@ -271,7 +276,7 @@ NOTE: pci_enable_device() can fail! Check the return value.
>
> This has been discussed before but not changed as of 2.6.19:
> http://lkml.org/lkml/2006/3/2/194
> -]
> +
>
> pci_set_master() will enable DMA by setting the bus master bit
> in the PCI_COMMAND register. It also fixes the latency timer value if
> @@ -288,8 +293,8 @@ pci_try_set_mwi() to have the system do its best effort at enabling
> Mem-Wr-Inval.
>
>
> -3.2 Request MMIO/IOP resources
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Request MMIO/IOP resources
> +--------------------------
> Memory (MMIO), and I/O port addresses should NOT be read directly
> from the PCI device config space. Use the values in the pci_dev structure
> as the PCI "bus address" might have been remapped to a "host physical"
> @@ -304,9 +309,9 @@ Conversely, drivers should call pci_release_region() AFTER
> calling pci_disable_device().
> The idea is to prevent two devices colliding on the same address range.
>
> -[ See OS BUG comment above. Currently (2.6.19), The driver can only
> +.. tip:: See OS BUG comment above. Currently (2.6.19), The driver can only
> determine MMIO and IO Port resource availability _after_ calling
> - pci_enable_device(). ]
> + pci_enable_device().
Hmm... indentation seems to be wrong here
>
> Generic flavors of pci_request_region() are request_mem_region()
> (for MMIO ranges) and request_region() (for IO Port ranges).
> @@ -316,12 +321,12 @@ BARs.
> Also see pci_request_selected_regions() below.
>
>
> -3.3 Set the DMA mask size
> -~~~~~~~~~~~~~~~~~~~~~~~~~
> -[ If anything below doesn't make sense, please refer to
> +Set the DMA mask size
> +---------------------
> +.. note:: If anything below doesn't make sense, please refer to
> Documentation/DMA-API.txt. This section is just a reminder that
> drivers need to indicate DMA capabilities of the device and is not
> - an authoritative source for DMA interfaces. ]
> + an authoritative source for DMA interfaces.
and here.
To be frank, when handling note/tip/warning/..., I prefer to indent
it like:
.. note::
If anything below doesn't make sense, please refer to
Documentation/DMA-API.txt. This section is just a reminder that
drivers need to indicate DMA capabilities of the device and is not
an authoritative source for DMA interfaces.
As this makes it more visible when reading as a plain text file.
>
> While all drivers should explicitly indicate the DMA capability
> (e.g. 32 or 64 bit) of the PCI bus master, devices with more than
> @@ -342,23 +347,23 @@ Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
> ("consistent") data.
>
>
> -3.4 Setup shared control data
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Setup shared control data
> +-------------------------
> Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
> memory. See Documentation/DMA-API.txt for a full description of
> the DMA APIs. This section is just a reminder that it needs to be done
> before enabling DMA on the device.
>
>
> -3.5 Initialize device registers
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Initialize device registers
> +---------------------------
> Some drivers will need specific "capability" fields programmed
> or other "vendor specific" register initialized or reset.
> E.g. clearing pending interrupts.
>
>
> -3.6 Register IRQ handler
> -~~~~~~~~~~~~~~~~~~~~~~~~
> +Register IRQ handler
> +--------------------
> While calling request_irq() is the last step described here,
> this is often just another intermediate step to initialize a device.
> This step can often be deferred until the device is opened for use.
> @@ -396,6 +401,7 @@ and msix_enabled flags in the pci_dev structure after calling
> pci_alloc_irq_vectors.
>
> There are (at least) two really good reasons for using MSI:
> +
> 1) MSI is an exclusive interrupt vector by definition.
> This means the interrupt handler doesn't have to verify
> its device caused the interrupt.
> @@ -411,23 +417,23 @@ of MSI/MSI-X usage.
>
>
>
> -4. PCI device shutdown
> -~~~~~~~~~~~~~~~~~~~~~~~
> +PCI device shutdown
> +===================
>
> When a PCI device driver is being unloaded, most of the following
> steps need to be performed:
>
> - Disable the device from generating IRQs
> - Release the IRQ (free_irq())
> - Stop all DMA activity
> - Release DMA buffers (both streaming and consistent)
> - Unregister from other subsystems (e.g. scsi or netdev)
> - Disable device from responding to MMIO/IO Port addresses
> - Release MMIO/IO Port resource(s)
> + - Disable the device from generating IRQs
> + - Release the IRQ (free_irq())
> + - Stop all DMA activity
> + - Release DMA buffers (both streaming and consistent)
> + - Unregister from other subsystems (e.g. scsi or netdev)
> + - Disable device from responding to MMIO/IO Port addresses
> + - Release MMIO/IO Port resource(s)
>
>
> -4.1 Stop IRQs on the device
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Stop IRQs on the device
> +-----------------------
> How to do this is chip/device specific. If it's not done, it opens
> the possibility of a "screaming interrupt" if (and only if)
> the IRQ is shared with another device.
> @@ -446,16 +452,16 @@ MSI and MSI-X are defined to be exclusive interrupts and thus
> are not susceptible to the "screaming interrupt" problem.
>
>
> -4.2 Release the IRQ
> -~~~~~~~~~~~~~~~~~~~
> +Release the IRQ
> +---------------
> Once the device is quiesced (no more IRQs), one can call free_irq().
> This function will return control once any pending IRQs are handled,
> "unhook" the drivers IRQ handler from that IRQ, and finally release
> the IRQ if no one else is using it.
>
>
> -4.3 Stop all DMA activity
> -~~~~~~~~~~~~~~~~~~~~~~~~~
> +Stop all DMA activity
> +---------------------
> It's extremely important to stop all DMA operations BEFORE attempting
> to deallocate DMA control data. Failure to do so can result in memory
> corruption, hangs, and on some chip-sets a hard crash.
> @@ -467,8 +473,8 @@ While this step sounds obvious and trivial, several "mature" drivers
> didn't get this step right in the past.
>
>
> -4.4 Release DMA buffers
> -~~~~~~~~~~~~~~~~~~~~~~~
> +Release DMA buffers
> +-------------------
> Once DMA is stopped, clean up streaming DMA first.
> I.e. unmap data buffers and return buffers to "upstream"
> owners if there is one.
> @@ -478,8 +484,8 @@ Then clean up "consistent" buffers which contain the control data.
> See Documentation/DMA-API.txt for details on unmapping interfaces.
>
>
> -4.5 Unregister from other subsystems
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Unregister from other subsystems
> +--------------------------------
> Most low level PCI device drivers support some other subsystem
> like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
> driver isn't losing resources from that other subsystem.
> @@ -487,31 +493,31 @@ If this happens, typically the symptom is an Oops (panic) when
> the subsystem attempts to call into a driver that has been unloaded.
>
>
> -4.6 Disable Device from responding to MMIO/IO Port addresses
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Disable Device from responding to MMIO/IO Port addresses
> +--------------------------------------------------------
> io_unmap() MMIO or IO Port resources and then call pci_disable_device().
> This is the symmetric opposite of pci_enable_device().
> Do not access device registers after calling pci_disable_device().
>
>
> -4.7 Release MMIO/IO Port Resource(s)
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Release MMIO/IO Port Resource(s)
> +--------------------------------
> Call pci_release_region() to mark the MMIO or IO Port range as available.
> Failure to do so usually results in the inability to reload the driver.
>
>
>
> -5. How to access PCI config space
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +How to access PCI config space
> +==============================
>
> -You can use pci_(read|write)_config_(byte|word|dword) to access the config
> -space of a device represented by struct pci_dev *. All these functions return 0
> -when successful or an error code (PCIBIOS_...) which can be translated to a text
> -string by pcibios_strerror. Most drivers expect that accesses to valid PCI
> +You can use `pci_(read|write)_config_(byte|word|dword)` to access the config
> +space of a device represented by `struct pci_dev *`. All these functions return
> +0 when successful or an error code (`PCIBIOS_...`) which can be translated to a
> +text string by pcibios_strerror. Most drivers expect that accesses to valid PCI
> devices don't fail.
>
> If you don't have a struct pci_dev available, you can call
> -pci_bus_(read|write)_config_(byte|word|dword) to access a given device
> +`pci_bus_(read|write)_config_(byte|word|dword)` to access a given device
> and function on that bus.
>
> If you access fields in the standard portion of the config header, please
> @@ -522,28 +528,29 @@ pci_find_capability() for the particular capability and it will find the
> corresponding register block for you.
>
>
> +Other interesting functions
> +===========================
>
> -6. Other interesting functions
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +::
>
> -pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
> - bus and slot and number. If the device is
> - found, its reference count is increased.
> -pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
> -pci_find_capability() Find specified capability in device's capability
> + pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
> + bus and slot and number. If the device is
> + found, its reference count is increased.
> + pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
> + pci_find_capability() Find specified capability in device's capability
> list.
> -pci_resource_start() Returns bus start address for a given PCI region
> -pci_resource_end() Returns bus end address for a given PCI region
> -pci_resource_len() Returns the byte length of a PCI region
> -pci_set_drvdata() Set private driver data pointer for a pci_dev
> -pci_get_drvdata() Return private driver data pointer for a pci_dev
> -pci_set_mwi() Enable Memory-Write-Invalidate transactions.
> -pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
> + pci_resource_start() Returns bus start address for a given PCI region
> + pci_resource_end() Returns bus end address for a given PCI region
> + pci_resource_len() Returns the byte length of a PCI region
> + pci_set_drvdata() Set private driver data pointer for a pci_dev
> + pci_get_drvdata() Return private driver data pointer for a pci_dev
> + pci_set_mwi() Enable Memory-Write-Invalidate transactions.
> + pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
Better to use a list here:
=============================== ================================================
pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
bus and slot and number. If the device is
found, its reference count is increased.
pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
pci_find_capability() Find specified capability in device's capability
list.
pci_resource_start() Returns bus start address for a given PCI region
pci_resource_end() Returns bus end address for a given PCI region
pci_resource_len() Returns the byte length of a PCI region
pci_set_drvdata() Set private driver data pointer for a pci_dev
pci_get_drvdata() Return private driver data pointer for a pci_dev
pci_set_mwi() Enable Memory-Write-Invalidate transactions.
pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
=============================== ================================================
>
>
>
> -7. Miscellaneous hints
> -~~~~~~~~~~~~~~~~~~~~~~
> +Miscellaneous hints
> +===================
>
> When displaying PCI device names to the user (for example when a driver wants
> to tell the user what card has it found), please use pci_name(pci_dev).
> @@ -560,8 +567,8 @@ to be handled by platform and generic code, not individual drivers.
>
>
>
> -8. Vendor and device identifications
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Vendor and device identifications
> +=================================
>
> Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
> are shared across multiple drivers. You can add private definitions in
> @@ -576,18 +583,20 @@ and https://github.com/pciutils/pciids.
>
>
>
> -9. Obsolete functions
> -~~~~~~~~~~~~~~~~~~~~~
> +Obsolete functions
> +==================
>
> There are several functions which you might come across when trying to
> port an old driver to the new PCI interface. They are no longer present
> in the kernel as they aren't compatible with hotplug or PCI domains or
> having sane locking.
>
> -pci_find_device() Superseded by pci_get_device()
> -pci_find_subsys() Superseded by pci_get_subsys()
> -pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
> -pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
> +::
> +
> + pci_find_device() Superseded by pci_get_device()
> + pci_find_subsys() Superseded by pci_get_subsys()
> + pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
> + pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
A list works better here:
======================= ===========================================
pci_find_device() Superseded by pci_get_device()
pci_find_subsys() Superseded by pci_get_subsys()
pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
======================= ===========================================
>
>
> The alternative is the traditional PCI device driver that walks PCI
> @@ -595,8 +604,8 @@ device lists. This is still possible but discouraged.
>
>
>
> -10. MMIO Space and "Write Posting"
> -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +MMIO Space and "Write Posting"
> +==============================
>
> Converting a driver from using I/O Port space to using MMIO space
> often requires some additional changes. Specifically, "write posting"
> @@ -609,14 +618,14 @@ the CPU before the transaction has reached its destination.
>
> Thus, timing sensitive code should add readl() where the CPU is
> expected to wait before doing other work. The classic "bit banging"
> -sequence works fine for I/O Port space:
> +sequence works fine for I/O Port space::
>
> for (i = 8; --i; val >>= 1) {
> outb(val & 1, ioport_reg); /* write bit */
> udelay(10);
> }
>
> -The same sequence for MMIO space should be:
> +The same sequence for MMIO space should be::
>
> for (i = 8; --i; val >>= 1) {
> writeb(val & 1, mmio_reg); /* write bit */
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v12 22/31] mm: provide speculative fault infrastructure
From: Jerome Glisse @ 2019-04-24 15:13 UTC (permalink / raw)
To: Laurent Dufour
Cc: jack, sergey.senozhatsky.work, peterz, Will Deacon, mhocko,
linux-mm, paulus, Punit Agrawal, hpa, Michel Lespinasse,
Alexei Starovoitov, Andrea Arcangeli, ak, Minchan Kim,
aneesh.kumar, x86, Matthew Wilcox, Daniel Jordan, Ingo Molnar,
David Rientjes, paulmck, Haiyan Song, npiggin, sj38.park, dave,
kemi.wang, kirill, Thomas Gleixner, zhong jiang, Ganesh Mahendran,
Yang Shi, Mike Rapoport, linuxppc-dev, linux-kernel,
Sergey Senozhatsky, vinayak menon, akpm, Tim Chen, haren
In-Reply-To: <a1e27d15-2890-28fc-d350-ca62fb77f508@linux.ibm.com>
On Wed, Apr 24, 2019 at 04:56:14PM +0200, Laurent Dufour wrote:
> Le 22/04/2019 à 23:26, Jerome Glisse a écrit :
> > On Tue, Apr 16, 2019 at 03:45:13PM +0200, Laurent Dufour wrote:
> > > From: Peter Zijlstra <peterz@infradead.org>
> > >
> > > Provide infrastructure to do a speculative fault (not holding
> > > mmap_sem).
> > >
> > > The not holding of mmap_sem means we can race against VMA
> > > change/removal and page-table destruction. We use the SRCU VMA freeing
> > > to keep the VMA around. We use the VMA seqcount to detect change
> > > (including umapping / page-table deletion) and we use gup_fast() style
> > > page-table walking to deal with page-table races.
> > >
> > > Once we've obtained the page and are ready to update the PTE, we
> > > validate if the state we started the fault with is still valid, if
> > > not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> > > PTE and we're done.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > >
> > > [Manage the newly introduced pte_spinlock() for speculative page
> > > fault to fail if the VMA is touched in our back]
> > > [Rename vma_is_dead() to vma_has_changed() and declare it here]
> > > [Fetch p4d and pud]
> > > [Set vmd.sequence in __handle_mm_fault()]
> > > [Abort speculative path when handle_userfault() has to be called]
> > > [Add additional VMA's flags checks in handle_speculative_fault()]
> > > [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
> > > [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
> > > [Remove warning comment about waiting for !seq&1 since we don't want
> > > to wait]
> > > [Remove warning about no huge page support, mention it explictly]
> > > [Don't call do_fault() in the speculative path as __do_fault() calls
> > > vma->vm_ops->fault() which may want to release mmap_sem]
> > > [Only vm_fault pointer argument for vma_has_changed()]
> > > [Fix check against huge page, calling pmd_trans_huge()]
> > > [Use READ_ONCE() when reading VMA's fields in the speculative path]
> > > [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
> > > processing done in vm_normal_page()]
> > > [Check that vma->anon_vma is already set when starting the speculative
> > > path]
> > > [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
> > > the processing done in mpol_misplaced()]
> > > [Don't support VMA growing up or down]
> > > [Move check on vm_sequence just before calling handle_pte_fault()]
> > > [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
> > > [Add mem cgroup oom check]
> > > [Use READ_ONCE to access p*d entries]
> > > [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
> > > [Don't fetch pte again in handle_pte_fault() when running the speculative
> > > path]
> > > [Check PMD against concurrent collapsing operation]
> > > [Try spin lock the pte during the speculative path to avoid deadlock with
> > > other CPU's invalidating the TLB and requiring this CPU to catch the
> > > inter processor's interrupt]
> > > [Move define of FAULT_FLAG_SPECULATIVE here]
> > > [Introduce __handle_speculative_fault() and add a check against
> > > mm->mm_users in handle_speculative_fault() defined in mm.h]
> > > [Abort if vm_ops->fault is set instead of checking only vm_ops]
> > > [Use find_vma_rcu() and call put_vma() when we are done with the VMA]
> > > Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> >
> >
> > Few comments and questions for this one see below.
> >
> >
> > > ---
> > > include/linux/hugetlb_inline.h | 2 +-
> > > include/linux/mm.h | 30 +++
> > > include/linux/pagemap.h | 4 +-
> > > mm/internal.h | 15 ++
> > > mm/memory.c | 344 ++++++++++++++++++++++++++++++++-
> > > 5 files changed, 389 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
> > > index 0660a03d37d9..9e25283d6fc9 100644
> > > --- a/include/linux/hugetlb_inline.h
> > > +++ b/include/linux/hugetlb_inline.h
> > > @@ -8,7 +8,7 @@
> > > static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
> > > {
> > > - return !!(vma->vm_flags & VM_HUGETLB);
> > > + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
> > > }
> > > #else
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index f761a9c65c74..ec609cbad25a 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -381,6 +381,7 @@ extern pgprot_t protection_map[16];
> > > #define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
> > > #define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
> > > #define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
> > > +#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not holding mmap_sem */
> > > #define FAULT_FLAG_TRACE \
> > > { FAULT_FLAG_WRITE, "WRITE" }, \
> > > @@ -409,6 +410,10 @@ struct vm_fault {
> > > gfp_t gfp_mask; /* gfp mask to be used for allocations */
> > > pgoff_t pgoff; /* Logical page offset based on vma */
> > > unsigned long address; /* Faulting virtual address */
> > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > > + unsigned int sequence;
> > > + pmd_t orig_pmd; /* value of PMD at the time of fault */
> > > +#endif
> > > pmd_t *pmd; /* Pointer to pmd entry matching
> > > * the 'address' */
> > > pud_t *pud; /* Pointer to pud entry matching
> > > @@ -1524,6 +1529,31 @@ int invalidate_inode_page(struct page *page);
> > > #ifdef CONFIG_MMU
> > > extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
> > > unsigned long address, unsigned int flags);
> > > +
> > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > > +extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
> > > + unsigned long address,
> > > + unsigned int flags);
> > > +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
> > > + unsigned long address,
> > > + unsigned int flags)
> > > +{
> > > + /*
> > > + * Try speculative page fault for multithreaded user space task only.
> > > + */
> > > + if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
> > > + return VM_FAULT_RETRY;
> > > + return __handle_speculative_fault(mm, address, flags);
> > > +}
> > > +#else
> > > +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
> > > + unsigned long address,
> > > + unsigned int flags)
> > > +{
> > > + return VM_FAULT_RETRY;
> > > +}
> > > +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> > > +
> > > extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> > > unsigned long address, unsigned int fault_flags,
> > > bool *unlocked);
> > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > index 2e8438a1216a..2fcfaa910007 100644
> > > --- a/include/linux/pagemap.h
> > > +++ b/include/linux/pagemap.h
> > > @@ -457,8 +457,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
> > > pgoff_t pgoff;
> > > if (unlikely(is_vm_hugetlb_page(vma)))
> > > return linear_hugepage_index(vma, address);
> > > - pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
> > > - pgoff += vma->vm_pgoff;
> > > + pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
> > > + pgoff += READ_ONCE(vma->vm_pgoff);
> > > return pgoff;
> > > }
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 1e368e4afe3c..ed91b199cb8c 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -58,6 +58,21 @@ static inline void put_vma(struct vm_area_struct *vma)
> > > extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
> > > unsigned long addr);
> > > +
> > > +static inline bool vma_has_changed(struct vm_fault *vmf)
> > > +{
> > > + int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
> > > + unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
> > > +
> > > + /*
> > > + * Matches both the wmb in write_seqlock_{begin,end}() and
> > > + * the wmb in vma_rb_erase().
> > > + */
> > > + smp_rmb();
> > > +
> > > + return ret || seq != vmf->sequence;
> > > +}
> > > +
> > > #else /* CONFIG_SPECULATIVE_PAGE_FAULT */
> > > static inline void get_vma(struct vm_area_struct *vma)
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 46f877b6abea..6e6bf61c0e5c 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -522,7 +522,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > > if (page)
> > > dump_page(page, "bad pte");
> > > pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
> > > - (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> > > + (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
> > > + mapping, index);
> > > pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
> > > vma->vm_file,
> > > vma->vm_ops ? vma->vm_ops->fault : NULL,
> > > @@ -2082,6 +2083,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
> > > }
> > > EXPORT_SYMBOL_GPL(apply_to_page_range);
> > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > > +static bool pte_spinlock(struct vm_fault *vmf)
> > > +{
> > > + bool ret = false;
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > + pmd_t pmdval;
> > > +#endif
> > > +
> > > + /* Check if vma is still valid */
> > > + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
> > > + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> > > + spin_lock(vmf->ptl);
> > > + return true;
> > > + }
> > > +
> > > +again:
> > > + local_irq_disable();
> > > + if (vma_has_changed(vmf))
> > > + goto out;
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > + /*
> > > + * We check if the pmd value is still the same to ensure that there
> > > + * is not a huge collapse operation in progress in our back.
> > > + */
> > > + pmdval = READ_ONCE(*vmf->pmd);
> > > + if (!pmd_same(pmdval, vmf->orig_pmd))
> > > + goto out;
> > > +#endif
> > > +
> > > + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> > > + if (unlikely(!spin_trylock(vmf->ptl))) {
> > > + local_irq_enable();
> > > + goto again;
> > > + }
> >
> > Do we want to constantly retry taking the spinlock ? Shouldn't it
> > be limited ? If we fail few times it is probably better to give
> > up on that speculative page fault.
> >
> > So maybe putting everything within a for(i; i < MAX_TRY; ++i) loop
> > would be cleaner.
>
> I did tried that by the past when I added this loop but I never reach the
> limit I set. By the way what should be the MAX_TRY value? ;)
A power of 2 :) Like 4, something small.
>
> The loop was introduced to fix a race between CPU, this is explained in the
> patch description, but a comment is clearly missing here:
>
> /*
> * A spin_trylock() of the ptl is done to avoid a deadlock with other
> * CPU invalidating the TLB and requiring this CPU to catch the IPI.
> * As the interrupt are disabled during this operation we need to relax
> * them and try again locking the PTL.
> */
>
> I don't think that retrying the page fault would help, since the regular
> page fault handler will also spin here if there is a massive contention on
> the PTL.
My main fear is the loop will hammer a CPU if another CPU is holding
the same spinlock. In most places the page table lock should be held
only for short period of time so it should never last long. So while i
can not think of any reasons it would loop forever i fear i might have
a lack of imagination here.
>
> >
> >
> > > +
> > > + if (vma_has_changed(vmf)) {
> > > + spin_unlock(vmf->ptl);
> > > + goto out;
> > > + }
> > > +
> > > + ret = true;
> > > +out:
> > > + local_irq_enable();
> > > + return ret;
> > > +}
> > > +
> > > +static bool pte_map_lock(struct vm_fault *vmf)
> > > +{
> > > + bool ret = false;
> > > + pte_t *pte;
> > > + spinlock_t *ptl;
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > + pmd_t pmdval;
> > > +#endif
> > > +
> > > + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
> > > + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> > > + vmf->address, &vmf->ptl);
> > > + return true;
> > > + }
> > > +
> > > + /*
> > > + * The first vma_has_changed() guarantees the page-tables are still
> > > + * valid, having IRQs disabled ensures they stay around, hence the
> > > + * second vma_has_changed() to make sure they are still valid once
> > > + * we've got the lock. After that a concurrent zap_pte_range() will
> > > + * block on the PTL and thus we're safe.
> > > + */
> > > +again:
> > > + local_irq_disable();
> > > + if (vma_has_changed(vmf))
> > > + goto out;
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > + /*
> > > + * We check if the pmd value is still the same to ensure that there
> > > + * is not a huge collapse operation in progress in our back.
> > > + */
> > > + pmdval = READ_ONCE(*vmf->pmd);
> > > + if (!pmd_same(pmdval, vmf->orig_pmd))
> > > + goto out;
> > > +#endif
> > > +
> > > + /*
> > > + * Same as pte_offset_map_lock() except that we call
> > > + * spin_trylock() in place of spin_lock() to avoid race with
> > > + * unmap path which may have the lock and wait for this CPU
> > > + * to invalidate TLB but this CPU has irq disabled.
> > > + * Since we are in a speculative patch, accept it could fail
> > > + */
> > > + ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> > > + pte = pte_offset_map(vmf->pmd, vmf->address);
> > > + if (unlikely(!spin_trylock(ptl))) {
> > > + pte_unmap(pte);
> > > + local_irq_enable();
> > > + goto again;
> > > + }
> >
> > Same comment as above shouldn't be limited to a maximum number of retry ?
>
> Same answer ;)
>
> >
> > > +
> > > + if (vma_has_changed(vmf)) {
> > > + pte_unmap_unlock(pte, ptl);
> > > + goto out;
> > > + }
> > > +
> > > + vmf->pte = pte;
> > > + vmf->ptl = ptl;
> > > + ret = true;
> > > +out:
> > > + local_irq_enable();
> > > + return ret;
> > > +}
> > > +#else
> > > static inline bool pte_spinlock(struct vm_fault *vmf)
> > > {
> > > vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> > > @@ -2095,6 +2208,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
> > > vmf->address, &vmf->ptl);
> > > return true;
> > > }
> > > +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> > > /*
> > > * handle_pte_fault chooses page fault handler according to an entry which was
> > > @@ -2999,6 +3113,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > > ret = check_stable_address_space(vma->vm_mm);
> > > if (ret)
> > > goto unlock;
> > > + /*
> > > + * Don't call the userfaultfd during the speculative path.
> > > + * We already checked for the VMA to not be managed through
> > > + * userfaultfd, but it may be set in our back once we have lock
> > > + * the pte. In such a case we can ignore it this time.
> > > + */
> > > + if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> > > + goto setpte;
> >
> > Bit confuse by the comment above, if userfaultfd is set in the back
> > then shouldn't the speculative fault abort ? So wouldn't the following
> > be correct:
> >
> > if (userfaultfd_missing(vma)) {
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> > return VM_FAULT_RETRY;
> > ...
>
> Well here we are racing with the user space action setting the userfaultfd,
> we may have go through this page fault seeing the userfaultfd or not. But I
> can't imagine that the user process will rely on that to happen. If there is
> such a race, it would be up to the user space process to ensure that no page
> fault are triggered while it is setting up the userfaultfd.
> Since a check on the userfaultfd is done at the beginning of the SPF
> handler, I made the choice to ignore this later and not trigger the
> userfault this time.
>
> Obviously we may abort the SPF handling but what is the benefit ?
Yeah probably no benefit one way or the other, backing of when a vma
change in anyway seems to be more consistent to me but i am fine either
way.
>
> >
> > > /* Deliver the page fault to userland, check inside PT lock */
> > > if (userfaultfd_missing(vma)) {
> > > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > @@ -3041,7 +3163,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > > goto unlock_and_release;
> > > /* Deliver the page fault to userland, check inside PT lock */
> > > - if (userfaultfd_missing(vma)) {
> > > + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
> > > + userfaultfd_missing(vma)) {
> >
> > Same comment as above but this also seems more wrong then above. What
> > i propose above would look more correct in both cases ie we still want
> > to check for userfaultfd but if we are in speculative fault then we
> > just want to abort the speculative fault.
>
> Why is more wrong here ? Indeed this is consistent with the previous action,
> ignore the userfault event if it has been set while the SPF handler is in
> progress. IMHO this is up to the user space to serialize the userfaultfd
> setting against the ongoing page fault in that case.
Adding a comment saying that SPF would have back off if userfaulfd
was arm at begining of SPF and that we want to ignore race with
userfaultfd enablement.
^ permalink raw reply
* Re: [PATCH v4 25/63] Documentation: add Linux PCI to Sphinx TOC tree
From: Mauro Carvalho Chehab @ 2019-04-24 15:03 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-26-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:54 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> Add a index.rst for PCI subsystem. More docs will be added later.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> Documentation/PCI/index.rst | 9 +++++++++
On a past discussion at docs ML, we've agreed to use lowercase for new
stuff. My suggestion here would be to use lowercase for "pci".
Also, there's already a pci directory under driver-api, added on this
commit:
commit fcc78f9c22474d60c65d522e50ea07006ec1b9fc
Author: Logan Gunthorpe <logang@deltatee.com>
Date: Thu Oct 4 15:27:39 2018 -0600
docs-rst: Add a new directory for PCI documentation
I would just add a new section at Documentation/driver-api/pci/index.rst
with something like:
Legacy PCI documentation
========================
.. note::
The files here were written a long time ago and need some serious
work. Use their contents with caution.
.. toctree::
:maxdepth: 1
<files converted from Documentation/PCI>
And add those documents from Documentation/PCI into it.
> Documentation/index.rst | 1 +
> 2 files changed, 10 insertions(+)
> create mode 100644 Documentation/PCI/index.rst
>
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> new file mode 100644
> index 000000000000..c2f8728d11cf
> --- /dev/null
> +++ b/Documentation/PCI/index.rst
> @@ -0,0 +1,9 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=======================
> +Linux PCI Bus Subsystem
> +=======================
> +
> +.. toctree::
> + :maxdepth: 2
> + :numbered:
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index fdfa85c56a50..d80138284e0f 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -100,6 +100,7 @@ needed).
> filesystems/index
> vm/index
> bpf/index
> + PCI/index
> misc-devices/index
>
> Architecture-specific documentation
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v12 23/31] mm: don't do swap readahead during speculative page fault
From: Laurent Dufour @ 2019-04-24 14:57 UTC (permalink / raw)
To: Jerome Glisse
Cc: jack, sergey.senozhatsky.work, peterz, Will Deacon, mhocko,
linux-mm, paulus, Punit Agrawal, hpa, Michel Lespinasse,
Alexei Starovoitov, Andrea Arcangeli, ak, Minchan Kim,
aneesh.kumar, x86, Matthew Wilcox, Daniel Jordan, Ingo Molnar,
David Rientjes, paulmck, Haiyan Song, npiggin, sj38.park, dave,
kemi.wang, kirill, Thomas Gleixner, zhong jiang, Ganesh Mahendran,
Yang Shi, Mike Rapoport, linuxppc-dev, linux-kernel,
Sergey Senozhatsky, Vinayak Menon, vinayak menon, akpm, Tim Chen,
haren
In-Reply-To: <20190422213611.GN14666@redhat.com>
Le 22/04/2019 à 23:36, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:14PM +0200, Laurent Dufour wrote:
>> Vinayak Menon faced a panic because one thread was page faulting a page in
>> swap, while another one was mprotecting a part of the VMA leading to a VMA
>> split.
>> This raise a panic in swap_vma_readahead() because the VMA's boundaries
>> were not more matching the faulting address.
>>
>> To avoid this, if the page is not found in the swap, the speculative page
>> fault is aborted to retry a regular page fault.
>>
>> Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
>
> Note that you should also skip non swap entry in do_swap_page() when doing
> speculative page fault at very least you need to is_device_private_entry()
> case.
>
> But this should either be part of patch 22 or another patch to fix swap
> case.
Thanks Jérôme,
Yes I missed that, I guess the best option would be to abort on non swap
entry. I'll add that in the patch 22.
>> ---
>> mm/memory.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 6e6bf61c0e5c..1991da97e2db 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2900,6 +2900,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> lru_cache_add_anon(page);
>> swap_readpage(page, true);
>> }
>> + } else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
>> + /*
>> + * Don't try readahead during a speculative page fault
>> + * as the VMA's boundaries may change in our back.
>> + * If the page is not in the swap cache and synchronous
>> + * read is disabled, fall back to the regular page
>> + * fault mechanism.
>> + */
>> + delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>> + ret = VM_FAULT_RETRY;
>> + goto out;
>> } else {
>> page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
>> vmf);
>> --
>> 2.21.0
>>
>
^ permalink raw reply
* Re: [PATCH v4 24/63] Documentation: ACPI: move video_extension.txt to firmware-guide/acpi and convert to reST
From: Mauro Carvalho Chehab @ 2019-04-24 14:56 UTC (permalink / raw)
To: Changbin Du, mingo
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, Bjorn Helgaas,
tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-25-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:53 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> ---
> Documentation/firmware-guide/acpi/index.rst | 1 +
> .../acpi/video_extension.rst} | 63 ++++++++++---------
> 2 files changed, 36 insertions(+), 28 deletions(-)
> rename Documentation/{acpi/video_extension.txt => firmware-guide/acpi/video_extension.rst} (79%)
>
> diff --git a/Documentation/firmware-guide/acpi/index.rst b/Documentation/firmware-guide/acpi/index.rst
> index 0e60f4b7129a..ae609eec4679 100644
> --- a/Documentation/firmware-guide/acpi/index.rst
> +++ b/Documentation/firmware-guide/acpi/index.rst
> @@ -23,3 +23,4 @@ ACPI Support
> i2c-muxes
> acpi-lid
> lpit
> + video_extension
> diff --git a/Documentation/acpi/video_extension.txt b/Documentation/firmware-guide/acpi/video_extension.rst
> similarity index 79%
> rename from Documentation/acpi/video_extension.txt
> rename to Documentation/firmware-guide/acpi/video_extension.rst
> index 79bf6a4921be..06f7e3230b6e 100644
> --- a/Documentation/acpi/video_extension.txt
> +++ b/Documentation/firmware-guide/acpi/video_extension.rst
> @@ -1,5 +1,8 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> ACPI video extensions
> -~~~~~~~~~~~~~~~~~~~~~
> +=====================
>
> This driver implement the ACPI Extensions For Display Adapters for
> integrated graphics devices on motherboard, as specified in ACPI 2.0
> @@ -8,9 +11,10 @@ defining the video POST device, retrieving EDID information or to
> setup a video output, etc. Note that this is an ref. implementation
> only. It may or may not work for your integrated video device.
>
> -The ACPI video driver does 3 things regarding backlight control:
> +The ACPI video driver does 3 things regarding backlight control.
>
> -1 Export a sysfs interface for user space to control backlight level
> +1. Export a sysfs interface for user space to control backlight level
> +=====================================================================
>
> If the ACPI table has a video device, and acpi_backlight=vendor kernel
> command line is not present, the driver will register a backlight device
Hmm... you didn't touch on this part of the document:
And what ACPI video driver does is:
actual_brightness: on read, control method _BQC will be evaluated to
get the brightness level the firmware thinks it is at;
bl_power: not implemented, will set the current brightness instead;
brightness: on write, control method _BCM will run to set the requested
brightness level;
max_brightness: Derived from the _BCL package(see below);
type: firmware
You should touch it. My suggestion here is:
And what ACPI video driver does is:
actual_brightness:
on read, control method _BQC will be evaluated to
get the brightness level the firmware thinks it is at;
bl_power:
not implemented, will set the current brightness instead;
brightness:
on write, control method _BCM will run to set the requested
brightness level;
max_brightness:
Derived from the _BCL package(see below);
type:
firmware
> @@ -32,26 +36,26 @@ type: firmware
>
> Note that ACPI video backlight driver will always use index for
> brightness, actual_brightness and max_brightness. So if we have
> -the following _BCL package:
> +the following _BCL package::
>
> -Method (_BCL, 0, NotSerialized)
> -{
> - Return (Package (0x0C)
> + Method (_BCL, 0, NotSerialized)
> {
> - 0x64,
> - 0x32,
> - 0x0A,
> - 0x14,
> - 0x1E,
> - 0x28,
> - 0x32,
> - 0x3C,
> - 0x46,
> - 0x50,
> - 0x5A,
> - 0x64
> - })
> -}
> + Return (Package (0x0C)
> + {
> + 0x64,
> + 0x32,
> + 0x0A,
> + 0x14,
> + 0x1E,
> + 0x28,
> + 0x32,
> + 0x3C,
> + 0x46,
> + 0x50,
> + 0x5A,
> + 0x64
> + })
> + }
>
> The first two levels are for when laptop are on AC or on battery and are
> not used by Linux currently. The remaining 10 levels are supported levels
> @@ -62,13 +66,15 @@ as a "brightness level" indicator. Thus from the user space perspective
> the range of available brightness levels is from 0 to 9 (max_brightness)
> inclusive.
>
> -2 Notify user space about hotkey event
> +2. Notify user space about hotkey event
> +=======================================
>
> There are generally two cases for hotkey event reporting:
> +
> i) For some laptops, when user presses the hotkey, a scancode will be
> generated and sent to user space through the input device created by
> the keyboard driver as a key type input event, with proper remap, the
> - following key code will appear to user space:
> + following key code will appear to user space::
>
> EV_KEY, KEY_BRIGHTNESSUP
> EV_KEY, KEY_BRIGHTNESSDOWN
> @@ -82,7 +88,7 @@ ii) For some laptops, the press of the hotkey will not generate the
> about the event. The event value is defined in the ACPI spec. ACPI
> video driver will generate an key type input event according to the
> notify value it received and send the event to user space through the
> - input device it created:
> + input device it created::
>
> event keycode
> 0x86 KEY_BRIGHTNESSUP
Perhaps making this as a table would work better:
input device it created:
===== ===================
event keycode
===== ===================
0x86 KEY_BRIGHTNESSUP
0x87 KEY_BRIGHTNESSDOWN
etc.
===== ===================
> @@ -94,13 +100,14 @@ so this would lead to the same effect as case i) now.
> Once user space tool receives this event, it can modify the backlight
> level through the sysfs interface.
>
> -3 Change backlight level in the kernel
> +3. Change backlight level in the kernel
> +=======================================
>
> This works for machines covered by case ii) in Section 2. Once the driver
> received a notification, it will set the backlight level accordingly. This does
> not affect the sending of event to user space, they are always sent to user
> space regardless of whether or not the video module controls the backlight level
> directly. This behaviour can be controlled through the brightness_switch_enabled
> -module parameter as documented in admin-guide/kernel-parameters.rst. It is recommended to
> -disable this behaviour once a GUI environment starts up and wants to have full
> -control of the backlight level.
> +module parameter as documented in admin-guide/kernel-parameters.rst. It is
> +recommended to disable this behaviour once a GUI environment starts up and
> +wants to have full control of the backlight level.
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v12 22/31] mm: provide speculative fault infrastructure
From: Laurent Dufour @ 2019-04-24 14:56 UTC (permalink / raw)
To: Jerome Glisse
Cc: jack, sergey.senozhatsky.work, peterz, Will Deacon, mhocko,
linux-mm, paulus, Punit Agrawal, hpa, Michel Lespinasse,
Alexei Starovoitov, Andrea Arcangeli, ak, Minchan Kim,
aneesh.kumar, x86, Matthew Wilcox, Daniel Jordan, Ingo Molnar,
David Rientjes, paulmck, Haiyan Song, npiggin, sj38.park, dave,
kemi.wang, kirill, Thomas Gleixner, zhong jiang, Ganesh Mahendran,
Yang Shi, Mike Rapoport, linuxppc-dev, linux-kernel,
Sergey Senozhatsky, vinayak menon, akpm, Tim Chen, haren
In-Reply-To: <20190422212623.GM14666@redhat.com>
Le 22/04/2019 à 23:26, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:13PM +0200, Laurent Dufour wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> Provide infrastructure to do a speculative fault (not holding
>> mmap_sem).
>>
>> The not holding of mmap_sem means we can race against VMA
>> change/removal and page-table destruction. We use the SRCU VMA freeing
>> to keep the VMA around. We use the VMA seqcount to detect change
>> (including umapping / page-table deletion) and we use gup_fast() style
>> page-table walking to deal with page-table races.
>>
>> Once we've obtained the page and are ready to update the PTE, we
>> validate if the state we started the fault with is still valid, if
>> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
>> PTE and we're done.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>
>> [Manage the newly introduced pte_spinlock() for speculative page
>> fault to fail if the VMA is touched in our back]
>> [Rename vma_is_dead() to vma_has_changed() and declare it here]
>> [Fetch p4d and pud]
>> [Set vmd.sequence in __handle_mm_fault()]
>> [Abort speculative path when handle_userfault() has to be called]
>> [Add additional VMA's flags checks in handle_speculative_fault()]
>> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
>> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
>> [Remove warning comment about waiting for !seq&1 since we don't want
>> to wait]
>> [Remove warning about no huge page support, mention it explictly]
>> [Don't call do_fault() in the speculative path as __do_fault() calls
>> vma->vm_ops->fault() which may want to release mmap_sem]
>> [Only vm_fault pointer argument for vma_has_changed()]
>> [Fix check against huge page, calling pmd_trans_huge()]
>> [Use READ_ONCE() when reading VMA's fields in the speculative path]
>> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>> processing done in vm_normal_page()]
>> [Check that vma->anon_vma is already set when starting the speculative
>> path]
>> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>> the processing done in mpol_misplaced()]
>> [Don't support VMA growing up or down]
>> [Move check on vm_sequence just before calling handle_pte_fault()]
>> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
>> [Add mem cgroup oom check]
>> [Use READ_ONCE to access p*d entries]
>> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
>> [Don't fetch pte again in handle_pte_fault() when running the speculative
>> path]
>> [Check PMD against concurrent collapsing operation]
>> [Try spin lock the pte during the speculative path to avoid deadlock with
>> other CPU's invalidating the TLB and requiring this CPU to catch the
>> inter processor's interrupt]
>> [Move define of FAULT_FLAG_SPECULATIVE here]
>> [Introduce __handle_speculative_fault() and add a check against
>> mm->mm_users in handle_speculative_fault() defined in mm.h]
>> [Abort if vm_ops->fault is set instead of checking only vm_ops]
>> [Use find_vma_rcu() and call put_vma() when we are done with the VMA]
>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
>
>
> Few comments and questions for this one see below.
>
>
>> ---
>> include/linux/hugetlb_inline.h | 2 +-
>> include/linux/mm.h | 30 +++
>> include/linux/pagemap.h | 4 +-
>> mm/internal.h | 15 ++
>> mm/memory.c | 344 ++++++++++++++++++++++++++++++++-
>> 5 files changed, 389 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
>> index 0660a03d37d9..9e25283d6fc9 100644
>> --- a/include/linux/hugetlb_inline.h
>> +++ b/include/linux/hugetlb_inline.h
>> @@ -8,7 +8,7 @@
>>
>> static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>> {
>> - return !!(vma->vm_flags & VM_HUGETLB);
>> + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>> }
>>
>> #else
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index f761a9c65c74..ec609cbad25a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -381,6 +381,7 @@ extern pgprot_t protection_map[16];
>> #define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
>> #define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
>> #define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
>> +#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not holding mmap_sem */
>>
>> #define FAULT_FLAG_TRACE \
>> { FAULT_FLAG_WRITE, "WRITE" }, \
>> @@ -409,6 +410,10 @@ struct vm_fault {
>> gfp_t gfp_mask; /* gfp mask to be used for allocations */
>> pgoff_t pgoff; /* Logical page offset based on vma */
>> unsigned long address; /* Faulting virtual address */
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + unsigned int sequence;
>> + pmd_t orig_pmd; /* value of PMD at the time of fault */
>> +#endif
>> pmd_t *pmd; /* Pointer to pmd entry matching
>> * the 'address' */
>> pud_t *pud; /* Pointer to pud entry matching
>> @@ -1524,6 +1529,31 @@ int invalidate_inode_page(struct page *page);
>> #ifdef CONFIG_MMU
>> extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
>> unsigned long address, unsigned int flags);
>> +
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
>> + unsigned long address,
>> + unsigned int flags);
>> +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
>> + unsigned long address,
>> + unsigned int flags)
>> +{
>> + /*
>> + * Try speculative page fault for multithreaded user space task only.
>> + */
>> + if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
>> + return VM_FAULT_RETRY;
>> + return __handle_speculative_fault(mm, address, flags);
>> +}
>> +#else
>> +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
>> + unsigned long address,
>> + unsigned int flags)
>> +{
>> + return VM_FAULT_RETRY;
>> +}
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>> +
>> extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>> unsigned long address, unsigned int fault_flags,
>> bool *unlocked);
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index 2e8438a1216a..2fcfaa910007 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -457,8 +457,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
>> pgoff_t pgoff;
>> if (unlikely(is_vm_hugetlb_page(vma)))
>> return linear_hugepage_index(vma, address);
>> - pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
>> - pgoff += vma->vm_pgoff;
>> + pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
>> + pgoff += READ_ONCE(vma->vm_pgoff);
>> return pgoff;
>> }
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 1e368e4afe3c..ed91b199cb8c 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -58,6 +58,21 @@ static inline void put_vma(struct vm_area_struct *vma)
>> extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
>> unsigned long addr);
>>
>> +
>> +static inline bool vma_has_changed(struct vm_fault *vmf)
>> +{
>> + int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
>> + unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
>> +
>> + /*
>> + * Matches both the wmb in write_seqlock_{begin,end}() and
>> + * the wmb in vma_rb_erase().
>> + */
>> + smp_rmb();
>> +
>> + return ret || seq != vmf->sequence;
>> +}
>> +
>> #else /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>
>> static inline void get_vma(struct vm_area_struct *vma)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 46f877b6abea..6e6bf61c0e5c 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -522,7 +522,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>> if (page)
>> dump_page(page, "bad pte");
>> pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
>> - (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>> + (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
>> + mapping, index);
>> pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
>> vma->vm_file,
>> vma->vm_ops ? vma->vm_ops->fault : NULL,
>> @@ -2082,6 +2083,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>> }
>> EXPORT_SYMBOL_GPL(apply_to_page_range);
>>
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +static bool pte_spinlock(struct vm_fault *vmf)
>> +{
>> + bool ret = false;
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + pmd_t pmdval;
>> +#endif
>> +
>> + /* Check if vma is still valid */
>> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> + spin_lock(vmf->ptl);
>> + return true;
>> + }
>> +
>> +again:
>> + local_irq_disable();
>> + if (vma_has_changed(vmf))
>> + goto out;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + /*
>> + * We check if the pmd value is still the same to ensure that there
>> + * is not a huge collapse operation in progress in our back.
>> + */
>> + pmdval = READ_ONCE(*vmf->pmd);
>> + if (!pmd_same(pmdval, vmf->orig_pmd))
>> + goto out;
>> +#endif
>> +
>> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> + if (unlikely(!spin_trylock(vmf->ptl))) {
>> + local_irq_enable();
>> + goto again;
>> + }
>
> Do we want to constantly retry taking the spinlock ? Shouldn't it
> be limited ? If we fail few times it is probably better to give
> up on that speculative page fault.
>
> So maybe putting everything within a for(i; i < MAX_TRY; ++i) loop
> would be cleaner.
I did tried that by the past when I added this loop but I never reach
the limit I set. By the way what should be the MAX_TRY value? ;)
The loop was introduced to fix a race between CPU, this is explained in
the patch description, but a comment is clearly missing here:
/*
* A spin_trylock() of the ptl is done to avoid a deadlock with other
* CPU invalidating the TLB and requiring this CPU to catch the IPI.
* As the interrupt are disabled during this operation we need to relax
* them and try again locking the PTL.
*/
I don't think that retrying the page fault would help, since the regular
page fault handler will also spin here if there is a massive contention
on the PTL.
>
>
>> +
>> + if (vma_has_changed(vmf)) {
>> + spin_unlock(vmf->ptl);
>> + goto out;
>> + }
>> +
>> + ret = true;
>> +out:
>> + local_irq_enable();
>> + return ret;
>> +}
>> +
>> +static bool pte_map_lock(struct vm_fault *vmf)
>> +{
>> + bool ret = false;
>> + pte_t *pte;
>> + spinlock_t *ptl;
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + pmd_t pmdval;
>> +#endif
>> +
>> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>> + vmf->address, &vmf->ptl);
>> + return true;
>> + }
>> +
>> + /*
>> + * The first vma_has_changed() guarantees the page-tables are still
>> + * valid, having IRQs disabled ensures they stay around, hence the
>> + * second vma_has_changed() to make sure they are still valid once
>> + * we've got the lock. After that a concurrent zap_pte_range() will
>> + * block on the PTL and thus we're safe.
>> + */
>> +again:
>> + local_irq_disable();
>> + if (vma_has_changed(vmf))
>> + goto out;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + /*
>> + * We check if the pmd value is still the same to ensure that there
>> + * is not a huge collapse operation in progress in our back.
>> + */
>> + pmdval = READ_ONCE(*vmf->pmd);
>> + if (!pmd_same(pmdval, vmf->orig_pmd))
>> + goto out;
>> +#endif
>> +
>> + /*
>> + * Same as pte_offset_map_lock() except that we call
>> + * spin_trylock() in place of spin_lock() to avoid race with
>> + * unmap path which may have the lock and wait for this CPU
>> + * to invalidate TLB but this CPU has irq disabled.
>> + * Since we are in a speculative patch, accept it could fail
>> + */
>> + ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> + pte = pte_offset_map(vmf->pmd, vmf->address);
>> + if (unlikely(!spin_trylock(ptl))) {
>> + pte_unmap(pte);
>> + local_irq_enable();
>> + goto again;
>> + }
>
> Same comment as above shouldn't be limited to a maximum number of retry ?
Same answer ;)
>
>> +
>> + if (vma_has_changed(vmf)) {
>> + pte_unmap_unlock(pte, ptl);
>> + goto out;
>> + }
>> +
>> + vmf->pte = pte;
>> + vmf->ptl = ptl;
>> + ret = true;
>> +out:
>> + local_irq_enable();
>> + return ret;
>> +}
>> +#else
>> static inline bool pte_spinlock(struct vm_fault *vmf)
>> {
>> vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> @@ -2095,6 +2208,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
>> vmf->address, &vmf->ptl);
>> return true;
>> }
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>
>> /*
>> * handle_pte_fault chooses page fault handler according to an entry which was
>> @@ -2999,6 +3113,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>> ret = check_stable_address_space(vma->vm_mm);
>> if (ret)
>> goto unlock;
>> + /*
>> + * Don't call the userfaultfd during the speculative path.
>> + * We already checked for the VMA to not be managed through
>> + * userfaultfd, but it may be set in our back once we have lock
>> + * the pte. In such a case we can ignore it this time.
>> + */
>> + if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>> + goto setpte;
>
> Bit confuse by the comment above, if userfaultfd is set in the back
> then shouldn't the speculative fault abort ? So wouldn't the following
> be correct:
>
> if (userfaultfd_missing(vma)) {
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> return VM_FAULT_RETRY;
> ...
Well here we are racing with the user space action setting the
userfaultfd, we may have go through this page fault seeing the
userfaultfd or not. But I can't imagine that the user process will rely
on that to happen. If there is such a race, it would be up to the user
space process to ensure that no page fault are triggered while it is
setting up the userfaultfd.
Since a check on the userfaultfd is done at the beginning of the SPF
handler, I made the choice to ignore this later and not trigger the
userfault this time.
Obviously we may abort the SPF handling but what is the benefit ?
>
>> /* Deliver the page fault to userland, check inside PT lock */
>> if (userfaultfd_missing(vma)) {
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>> @@ -3041,7 +3163,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>> goto unlock_and_release;
>>
>> /* Deliver the page fault to userland, check inside PT lock */
>> - if (userfaultfd_missing(vma)) {
>> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
>> + userfaultfd_missing(vma)) {
>
> Same comment as above but this also seems more wrong then above. What
> i propose above would look more correct in both cases ie we still want
> to check for userfaultfd but if we are in speculative fault then we
> just want to abort the speculative fault.
Why is more wrong here ? Indeed this is consistent with the previous
action, ignore the userfault event if it has been set while the SPF
handler is in progress. IMHO this is up to the user space to serialize
the userfaultfd setting against the ongoing page fault in that case.
>
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>> mem_cgroup_cancel_charge(page, memcg, false);
>> put_page(page);
>> @@ -3836,6 +3959,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> pte_t entry;
>>
>> if (unlikely(pmd_none(*vmf->pmd))) {
>> + /*
>> + * In the case of the speculative page fault handler we abort
>> + * the speculative path immediately as the pmd is probably
>> + * in the way to be converted in a huge one. We will try
>> + * again holding the mmap_sem (which implies that the collapse
>> + * operation is done).
>> + */
>> + if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>> + return VM_FAULT_RETRY;
>> /*
>> * Leave __pte_alloc() until later: because vm_ops->fault may
>> * want to allocate huge page, and if we expose page table
>> @@ -3843,7 +3975,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> * concurrent faults and from rmap lookups.
>> */
>> vmf->pte = NULL;
>> - } else {
>> + } else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>> /* See comment in pte_alloc_one_map() */
>> if (pmd_devmap_trans_unstable(vmf->pmd))
>> return 0;
>> @@ -3852,6 +3984,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> * pmd from under us anymore at this point because we hold the
>> * mmap_sem read mode and khugepaged takes it in write mode.
>> * So now it's safe to run pte_offset_map().
>> + * This is not applicable to the speculative page fault handler
>> + * but in that case, the pte is fetched earlier in
>> + * handle_speculative_fault().
>> */
>> vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> vmf->orig_pte = *vmf->pte;
>> @@ -3874,6 +4009,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> if (!vmf->pte) {
>> if (vma_is_anonymous(vmf->vma))
>> return do_anonymous_page(vmf);
>> + else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>> + return VM_FAULT_RETRY;
>
> Maybe a small comment about speculative page fault not applying to
> file back vma.
Sure.
>
>> else
>> return do_fault(vmf);
>> }
>> @@ -3971,6 +4108,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>> vmf.pmd = pmd_alloc(mm, vmf.pud, address);
>> if (!vmf.pmd)
>> return VM_FAULT_OOM;
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
>> +#endif
>> if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
>> ret = create_huge_pmd(&vmf);
>> if (!(ret & VM_FAULT_FALLBACK))
>> @@ -4004,6 +4144,204 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>> return handle_pte_fault(&vmf);
>> }
>>
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +/*
>> + * Tries to handle the page fault in a speculative way, without grabbing the
>> + * mmap_sem.
>> + */
>> +vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
>> + unsigned long address,
>> + unsigned int flags)
>> +{
>> + struct vm_fault vmf = {
>> + .address = address,
>> + };
>> + pgd_t *pgd, pgdval;
>> + p4d_t *p4d, p4dval;
>> + pud_t pudval;
>> + int seq;
>> + vm_fault_t ret = VM_FAULT_RETRY;
>> + struct vm_area_struct *vma;
>> +#ifdef CONFIG_NUMA
>> + struct mempolicy *pol;
>> +#endif
>> +
>> + /* Clear flags that may lead to release the mmap_sem to retry */
>> + flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>> + flags |= FAULT_FLAG_SPECULATIVE;
>> +
>> + vma = find_vma_rcu(mm, address);
>> + if (!vma)
>> + return ret;
>> +
>> + /* rmb <-> seqlock,vma_rb_erase() */
>> + seq = raw_read_seqcount(&vma->vm_sequence);
>> + if (seq & 1)
>> + goto out_put;
>
> A comment explaining that odd sequence number means that we are racing
> with a write_begin and write_end would be welcome above.
Yes that would be welcome.
>> +
>> + /*
>> + * Can't call vm_ops service has we don't know what they would do
>> + * with the VMA.
>> + * This include huge page from hugetlbfs.
>> + */
>> + if (vma->vm_ops && vma->vm_ops->fault)
>> + goto out_put;
>> +
>> + /*
>> + * __anon_vma_prepare() requires the mmap_sem to be held
>> + * because vm_next and vm_prev must be safe. This can't be guaranteed
>> + * in the speculative path.
>> + */
>> + if (unlikely(!vma->anon_vma))
>> + goto out_put;
>
> Maybe also remind people that once the vma->anon_vma is set then its
> value will not change and thus we do not need to protect against such
> thing (unlike vm_flags or other vma field below and above).
Will do, thanks.
>> +
>> + vmf.vma_flags = READ_ONCE(vma->vm_flags);
>> + vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>> +
>> + /* Can't call userland page fault handler in the speculative path */
>> + if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>> + goto out_put;
>> +
>> + if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>> + /*
>> + * This could be detected by the check address against VMA's
>> + * boundaries but we want to trace it as not supported instead
>> + * of changed.
>> + */
>> + goto out_put;
>> +
>> + if (address < READ_ONCE(vma->vm_start)
>> + || READ_ONCE(vma->vm_end) <= address)
>> + goto out_put;
>> +
>> + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>> + flags & FAULT_FLAG_INSTRUCTION,
>> + flags & FAULT_FLAG_REMOTE)) {
>> + ret = VM_FAULT_SIGSEGV;
>> + goto out_put;
>> + }
>> +
>> + /* This is one is required to check that the VMA has write access set */
>> + if (flags & FAULT_FLAG_WRITE) {
>> + if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
>> + ret = VM_FAULT_SIGSEGV;
>> + goto out_put;
>> + }
>> + } else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
>> + ret = VM_FAULT_SIGSEGV;
>> + goto out_put;
>> + }
>> +
>> +#ifdef CONFIG_NUMA
>> + /*
>> + * MPOL_INTERLEAVE implies additional checks in
>> + * mpol_misplaced() which are not compatible with the
>> + *speculative page fault processing.
>> + */
>> + pol = __get_vma_policy(vma, address);
>> + if (!pol)
>> + pol = get_task_policy(current);
>> + if (pol && pol->mode == MPOL_INTERLEAVE)
>> + goto out_put;
>> +#endif
>> +
>> + /*
>> + * Do a speculative lookup of the PTE entry.
>> + */
>> + local_irq_disable();
>> + pgd = pgd_offset(mm, address);
>> + pgdval = READ_ONCE(*pgd);
>> + if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
>> + goto out_walk;
>> +
>> + p4d = p4d_offset(pgd, address);
>> + p4dval = READ_ONCE(*p4d);
>> + if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
>> + goto out_walk;
>> +
>> + vmf.pud = pud_offset(p4d, address);
>> + pudval = READ_ONCE(*vmf.pud);
>> + if (pud_none(pudval) || unlikely(pud_bad(pudval)))
>> + goto out_walk;
>> +
>> + /* Huge pages at PUD level are not supported. */
>> + if (unlikely(pud_trans_huge(pudval)))
>> + goto out_walk;
>> +
>> + vmf.pmd = pmd_offset(vmf.pud, address);
>> + vmf.orig_pmd = READ_ONCE(*vmf.pmd);
>> + /*
>> + * pmd_none could mean that a hugepage collapse is in progress
>> + * in our back as collapse_huge_page() mark it before
>> + * invalidating the pte (which is done once the IPI is catched
>> + * by all CPU and we have interrupt disabled).
>> + * For this reason we cannot handle THP in a speculative way since we
>> + * can't safely identify an in progress collapse operation done in our
>> + * back on that PMD.
>> + * Regarding the order of the following checks, see comment in
>> + * pmd_devmap_trans_unstable()
>> + */
>> + if (unlikely(pmd_devmap(vmf.orig_pmd) ||
>> + pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) ||
>> + is_swap_pmd(vmf.orig_pmd)))
>> + goto out_walk;
>> +
>> + /*
>> + * The above does not allocate/instantiate page-tables because doing so
>> + * would lead to the possibility of instantiating page-tables after
>> + * free_pgtables() -- and consequently leaking them.
>> + *
>> + * The result is that we take at least one !speculative fault per PMD
>> + * in order to instantiate it.
>> + */
>> +
>> + vmf.pte = pte_offset_map(vmf.pmd, address);
>> + vmf.orig_pte = READ_ONCE(*vmf.pte);
>> + barrier(); /* See comment in handle_pte_fault() */
>> + if (pte_none(vmf.orig_pte)) {
>> + pte_unmap(vmf.pte);
>> + vmf.pte = NULL;
>> + }
>> +
>> + vmf.vma = vma;
>> + vmf.pgoff = linear_page_index(vma, address);
>> + vmf.gfp_mask = __get_fault_gfp_mask(vma);
>> + vmf.sequence = seq;
>> + vmf.flags = flags;
>> +
>> + local_irq_enable();
>> +
>> + /*
>> + * We need to re-validate the VMA after checking the bounds, otherwise
>> + * we might have a false positive on the bounds.
>> + */
>> + if (read_seqcount_retry(&vma->vm_sequence, seq))
>> + goto out_put;
>> +
>> + mem_cgroup_enter_user_fault();
>> + ret = handle_pte_fault(&vmf);
>> + mem_cgroup_exit_user_fault();
>> +
>> + put_vma(vma);
>> +
>> + /*
>> + * The task may have entered a memcg OOM situation but
>> + * if the allocation error was handled gracefully (no
>> + * VM_FAULT_OOM), there is no need to kill anything.
>> + * Just clean up the OOM state peacefully.
>> + */
>> + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
>> + mem_cgroup_oom_synchronize(false);
>> + return ret;
>> +
>> +out_walk:
>> + local_irq_enable();
>> +out_put:
>> + put_vma(vma);
>> + return ret;
>> +}
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>> +
>> /*
>> * By the time we get here, we already hold the mm semaphore
>> *
>> --
>> 2.21.0
>>
>
^ permalink raw reply
* Re: [PATCH v4 23/63] Documentation: ACPI: move ssdt-overlays.txt to admin-guide/acpi and convert to reST
From: Mauro Carvalho Chehab @ 2019-04-24 14:51 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-24-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:52 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> ---
> Documentation/acpi/ssdt-overlays.txt | 172 -----------------
> Documentation/admin-guide/acpi/index.rst | 1 +
> .../admin-guide/acpi/ssdt-overlays.rst | 180 ++++++++++++++++++
> 3 files changed, 181 insertions(+), 172 deletions(-)
> delete mode 100644 Documentation/acpi/ssdt-overlays.txt
> create mode 100644 Documentation/admin-guide/acpi/ssdt-overlays.rst
>
> diff --git a/Documentation/acpi/ssdt-overlays.txt b/Documentation/acpi/ssdt-overlays.txt
> deleted file mode 100644
> index 5ae13f161ea2..000000000000
> --- a/Documentation/acpi/ssdt-overlays.txt
> +++ /dev/null
> @@ -1,172 +0,0 @@
> -
> -In order to support ACPI open-ended hardware configurations (e.g. development
> -boards) we need a way to augment the ACPI configuration provided by the firmware
> -image. A common example is connecting sensors on I2C / SPI buses on development
> -boards.
> -
> -Although this can be accomplished by creating a kernel platform driver or
> -recompiling the firmware image with updated ACPI tables, neither is practical:
> -the former proliferates board specific kernel code while the latter requires
> -access to firmware tools which are often not publicly available.
> -
> -Because ACPI supports external references in AML code a more practical
> -way to augment firmware ACPI configuration is by dynamically loading
> -user defined SSDT tables that contain the board specific information.
> -
> -For example, to enumerate a Bosch BMA222E accelerometer on the I2C bus of the
> -Minnowboard MAX development board exposed via the LSE connector [1], the
> -following ASL code can be used:
> -
> -DefinitionBlock ("minnowmax.aml", "SSDT", 1, "Vendor", "Accel", 0x00000003)
> -{
> - External (\_SB.I2C6, DeviceObj)
> -
> - Scope (\_SB.I2C6)
> - {
> - Device (STAC)
> - {
> - Name (_ADR, Zero)
> - Name (_HID, "BMA222E")
> -
> - Method (_CRS, 0, Serialized)
> - {
> - Name (RBUF, ResourceTemplate ()
> - {
> - I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80,
> - AddressingMode7Bit, "\\_SB.I2C6", 0x00,
> - ResourceConsumer, ,)
> - GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000,
> - "\\_SB.GPO2", 0x00, ResourceConsumer, , )
> - { // Pin list
> - 0
> - }
> - })
> - Return (RBUF)
> - }
> - }
> - }
> -}
> -
> -which can then be compiled to AML binary format:
> -
> -$ iasl minnowmax.asl
> -
> -Intel ACPI Component Architecture
> -ASL Optimizing Compiler version 20140214-64 [Mar 29 2014]
> -Copyright (c) 2000 - 2014 Intel Corporation
> -
> -ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords
> -AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes
> -
> -[1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29
> -
> -The resulting AML code can then be loaded by the kernel using one of the methods
> -below.
> -
> -== Loading ACPI SSDTs from initrd ==
> -
> -This option allows loading of user defined SSDTs from initrd and it is useful
> -when the system does not support EFI or when there is not enough EFI storage.
> -
> -It works in a similar way with initrd based ACPI tables override/upgrade: SSDT
> -aml code must be placed in the first, uncompressed, initrd under the
> -"kernel/firmware/acpi" path. Multiple files can be used and this will translate
> -in loading multiple tables. Only SSDT and OEM tables are allowed. See
> -initrd_table_override.txt for more details.
> -
> -Here is an example:
> -
> -# Add the raw ACPI tables to an uncompressed cpio archive.
> -# They must be put into a /kernel/firmware/acpi directory inside the
> -# cpio archive.
> -# The uncompressed cpio archive must be the first.
> -# Other, typically compressed cpio archives, must be
> -# concatenated on top of the uncompressed one.
> -mkdir -p kernel/firmware/acpi
> -cp ssdt.aml kernel/firmware/acpi
> -
> -# Create the uncompressed cpio archive and concatenate the original initrd
> -# on top:
> -find kernel | cpio -H newc --create > /boot/instrumented_initrd
> -cat /boot/initrd >>/boot/instrumented_initrd
> -
> -== Loading ACPI SSDTs from EFI variables ==
> -
> -This is the preferred method, when EFI is supported on the platform, because it
> -allows a persistent, OS independent way of storing the user defined SSDTs. There
> -is also work underway to implement EFI support for loading user defined SSDTs
> -and using this method will make it easier to convert to the EFI loading
> -mechanism when that will arrive.
> -
> -In order to load SSDTs from an EFI variable the efivar_ssdt kernel command line
> -parameter can be used. The argument for the option is the variable name to
> -use. If there are multiple variables with the same name but with different
> -vendor GUIDs, all of them will be loaded.
> -
> -In order to store the AML code in an EFI variable the efivarfs filesystem can be
> -used. It is enabled and mounted by default in /sys/firmware/efi/efivars in all
> -recent distribution.
> -
> -Creating a new file in /sys/firmware/efi/efivars will automatically create a new
> -EFI variable. Updating a file in /sys/firmware/efi/efivars will update the EFI
> -variable. Please note that the file name needs to be specially formatted as
> -"Name-GUID" and that the first 4 bytes in the file (little-endian format)
> -represent the attributes of the EFI variable (see EFI_VARIABLE_MASK in
> -include/linux/efi.h). Writing to the file must also be done with one write
> -operation.
> -
> -For example, you can use the following bash script to create/update an EFI
> -variable with the content from a given file:
> -
> -#!/bin/sh -e
> -
> -while ! [ -z "$1" ]; do
> - case "$1" in
> - "-f") filename="$2"; shift;;
> - "-g") guid="$2"; shift;;
> - *) name="$1";;
> - esac
> - shift
> -done
> -
> -usage()
> -{
> - echo "Syntax: ${0##*/} -f filename [ -g guid ] name"
> - exit 1
> -}
> -
> -[ -n "$name" -a -f "$filename" ] || usage
> -
> -EFIVARFS="/sys/firmware/efi/efivars"
> -
> -[ -d "$EFIVARFS" ] || exit 2
> -
> -if stat -tf $EFIVARFS | grep -q -v de5e81e4; then
> - mount -t efivarfs none $EFIVARFS
> -fi
> -
> -# try to pick up an existing GUID
> -[ -n "$guid" ] || guid=$(find "$EFIVARFS" -name "$name-*" | head -n1 | cut -f2- -d-)
> -
> -# use a randomly generated GUID
> -[ -n "$guid" ] || guid="$(cat /proc/sys/kernel/random/uuid)"
> -
> -# efivarfs expects all of the data in one write
> -tmp=$(mktemp)
> -/bin/echo -ne "\007\000\000\000" | cat - $filename > $tmp
> -dd if=$tmp of="$EFIVARFS/$name-$guid" bs=$(stat -c %s $tmp)
> -rm $tmp
> -
> -== Loading ACPI SSDTs from configfs ==
> -
> -This option allows loading of user defined SSDTs from userspace via the configfs
> -interface. The CONFIG_ACPI_CONFIGFS option must be select and configfs must be
> -mounted. In the following examples, we assume that configfs has been mounted in
> -/config.
> -
> -New tables can be loading by creating new directories in /config/acpi/table/ and
> -writing the SSDT aml code in the aml attribute:
> -
> -cd /config/acpi/table
> -mkdir my_ssdt
> -cat ~/ssdt.aml > my_ssdt/aml
> diff --git a/Documentation/admin-guide/acpi/index.rst b/Documentation/admin-guide/acpi/index.rst
> index 9049a7b9f065..4d13eeea1eca 100644
> --- a/Documentation/admin-guide/acpi/index.rst
> +++ b/Documentation/admin-guide/acpi/index.rst
> @@ -10,4 +10,5 @@ the Linux ACPI support.
>
> initrd_table_override
> dsdt-override
> + ssdt-overlays
> cppc_sysfs
> diff --git a/Documentation/admin-guide/acpi/ssdt-overlays.rst b/Documentation/admin-guide/acpi/ssdt-overlays.rst
> new file mode 100644
> index 000000000000..da37455f96c9
> --- /dev/null
> +++ b/Documentation/admin-guide/acpi/ssdt-overlays.rst
> @@ -0,0 +1,180 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +SSDT Overlays
> +=============
> +
> +In order to support ACPI open-ended hardware configurations (e.g. development
> +boards) we need a way to augment the ACPI configuration provided by the firmware
> +image. A common example is connecting sensors on I2C / SPI buses on development
> +boards.
> +
> +Although this can be accomplished by creating a kernel platform driver or
> +recompiling the firmware image with updated ACPI tables, neither is practical:
> +the former proliferates board specific kernel code while the latter requires
> +access to firmware tools which are often not publicly available.
> +
> +Because ACPI supports external references in AML code a more practical
> +way to augment firmware ACPI configuration is by dynamically loading
> +user defined SSDT tables that contain the board specific information.
> +
> +For example, to enumerate a Bosch BMA222E accelerometer on the I2C bus of the
> +Minnowboard MAX development board exposed via the LSE connector [1], the
> +following ASL code can be used::
> +
> + DefinitionBlock ("minnowmax.aml", "SSDT", 1, "Vendor", "Accel", 0x00000003)
> + {
> + External (\_SB.I2C6, DeviceObj)
> +
> + Scope (\_SB.I2C6)
> + {
> + Device (STAC)
> + {
> + Name (_ADR, Zero)
> + Name (_HID, "BMA222E")
> +
> + Method (_CRS, 0, Serialized)
> + {
> + Name (RBUF, ResourceTemplate ()
> + {
> + I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80,
> + AddressingMode7Bit, "\\_SB.I2C6", 0x00,
> + ResourceConsumer, ,)
> + GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000,
> + "\\_SB.GPO2", 0x00, ResourceConsumer, , )
> + { // Pin list
> + 0
> + }
> + })
> + Return (RBUF)
> + }
> + }
> + }
> + }
> +
> +which can then be compiled to AML binary format::
> +
> + $ iasl minnowmax.asl
> +
> + Intel ACPI Component Architecture
> + ASL Optimizing Compiler version 20140214-64 [Mar 29 2014]
> + Copyright (c) 2000 - 2014 Intel Corporation
> +
> + ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords
> + AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes
> +
> +[1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29
> +
> +The resulting AML code can then be loaded by the kernel using one of the methods
> +below.
> +
> +Loading ACPI SSDTs from initrd
> +==============================
> +
> +This option allows loading of user defined SSDTs from initrd and it is useful
> +when the system does not support EFI or when there is not enough EFI storage.
> +
> +It works in a similar way with initrd based ACPI tables override/upgrade: SSDT
> +aml code must be placed in the first, uncompressed, initrd under the
> +"kernel/firmware/acpi" path. Multiple files can be used and this will translate
> +in loading multiple tables. Only SSDT and OEM tables are allowed. See
> +initrd_table_override.txt for more details.
> +
> +Here is an example::
> +
> + # Add the raw ACPI tables to an uncompressed cpio archive.
> + # They must be put into a /kernel/firmware/acpi directory inside the
> + # cpio archive.
> + # The uncompressed cpio archive must be the first.
> + # Other, typically compressed cpio archives, must be
> + # concatenated on top of the uncompressed one.
> + mkdir -p kernel/firmware/acpi
> + cp ssdt.aml kernel/firmware/acpi
> +
> + # Create the uncompressed cpio archive and concatenate the original initrd
> + # on top:
> + find kernel | cpio -H newc --create > /boot/instrumented_initrd
> + cat /boot/initrd >>/boot/instrumented_initrd
> +
> +Loading ACPI SSDTs from EFI variables
> +=====================================
> +
> +This is the preferred method, when EFI is supported on the platform, because it
> +allows a persistent, OS independent way of storing the user defined SSDTs. There
> +is also work underway to implement EFI support for loading user defined SSDTs
> +and using this method will make it easier to convert to the EFI loading
> +mechanism when that will arrive.
> +
> +In order to load SSDTs from an EFI variable the efivar_ssdt kernel command line
> +parameter can be used. The argument for the option is the variable name to
> +use. If there are multiple variables with the same name but with different
> +vendor GUIDs, all of them will be loaded.
> +
> +In order to store the AML code in an EFI variable the efivarfs filesystem can be
> +used. It is enabled and mounted by default in /sys/firmware/efi/efivars in all
> +recent distribution.
> +
> +Creating a new file in /sys/firmware/efi/efivars will automatically create a new
> +EFI variable. Updating a file in /sys/firmware/efi/efivars will update the EFI
> +variable. Please note that the file name needs to be specially formatted as
> +"Name-GUID" and that the first 4 bytes in the file (little-endian format)
> +represent the attributes of the EFI variable (see EFI_VARIABLE_MASK in
> +include/linux/efi.h). Writing to the file must also be done with one write
> +operation.
> +
> +For example, you can use the following bash script to create/update an EFI
> +variable with the content from a given file::
> +
> + #!/bin/sh -e
> +
> + while ! [ -z "$1" ]; do
> + case "$1" in
> + "-f") filename="$2"; shift;;
> + "-g") guid="$2"; shift;;
> + *) name="$1";;
> + esac
> + shift
> + done
> +
> + usage()
> + {
> + echo "Syntax: ${0##*/} -f filename [ -g guid ] name"
> + exit 1
> + }
> +
> + [ -n "$name" -a -f "$filename" ] || usage
> +
> + EFIVARFS="/sys/firmware/efi/efivars"
> +
> + [ -d "$EFIVARFS" ] || exit 2
> +
> + if stat -tf $EFIVARFS | grep -q -v de5e81e4; then
> + mount -t efivarfs none $EFIVARFS
> + fi
> +
> + # try to pick up an existing GUID
> + [ -n "$guid" ] || guid=$(find "$EFIVARFS" -name "$name-*" | head -n1 | cut -f2- -d-)
> +
> + # use a randomly generated GUID
> + [ -n "$guid" ] || guid="$(cat /proc/sys/kernel/random/uuid)"
> +
> + # efivarfs expects all of the data in one write
> + tmp=$(mktemp)
> + /bin/echo -ne "\007\000\000\000" | cat - $filename > $tmp
> + dd if=$tmp of="$EFIVARFS/$name-$guid" bs=$(stat -c %s $tmp)
> + rm $tmp
> +
> +Loading ACPI SSDTs from configfs
> +================================
> +
> +This option allows loading of user defined SSDTs from userspace via the configfs
> +interface. The CONFIG_ACPI_CONFIGFS option must be select and configfs must be
> +mounted. In the following examples, we assume that configfs has been mounted in
> +/config.
> +
> +New tables can be loading by creating new directories in /config/acpi/table/ and
> +writing the SSDT aml code in the aml attribute::
> +
> + cd /config/acpi/table
> + mkdir my_ssdt
> + cat ~/ssdt.aml > my_ssdt/aml
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v4 22/63] Documentation: ACPI: move lpit.txt to firmware-guide/acpi and convert to reST
From: Mauro Carvalho Chehab @ 2019-04-24 14:49 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-23-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:51 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> ---
> Documentation/firmware-guide/acpi/index.rst | 1 +
> .../lpit.txt => firmware-guide/acpi/lpit.rst} | 18 +++++++++++++-----
> 2 files changed, 14 insertions(+), 5 deletions(-)
> rename Documentation/{acpi/lpit.txt => firmware-guide/acpi/lpit.rst} (68%)
>
> diff --git a/Documentation/firmware-guide/acpi/index.rst b/Documentation/firmware-guide/acpi/index.rst
> index fca854f017d8..0e60f4b7129a 100644
> --- a/Documentation/firmware-guide/acpi/index.rst
> +++ b/Documentation/firmware-guide/acpi/index.rst
> @@ -22,3 +22,4 @@ ACPI Support
> gpio-properties
> i2c-muxes
> acpi-lid
> + lpit
> diff --git a/Documentation/acpi/lpit.txt b/Documentation/firmware-guide/acpi/lpit.rst
> similarity index 68%
> rename from Documentation/acpi/lpit.txt
> rename to Documentation/firmware-guide/acpi/lpit.rst
> index b426398d2e97..aca928fab027 100644
> --- a/Documentation/acpi/lpit.txt
> +++ b/Documentation/firmware-guide/acpi/lpit.rst
> @@ -1,3 +1,9 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===========================
> +Low Power Idle Table (LPIT)
> +===========================
> +
> To enumerate platform Low Power Idle states, Intel platforms are using
> “Low Power Idle Table” (LPIT). More details about this table can be
> downloaded from:
> @@ -8,13 +14,15 @@ Residencies for each low power state can be read via FFH
>
> On platforms supporting S0ix sleep states, there can be two types of
> residencies:
> -- CPU PKG C10 (Read via FFH interface)
> -- Platform Controller Hub (PCH) SLP_S0 (Read via memory mapped interface)
> +
> + - CPU PKG C10 (Read via FFH interface)
> + - Platform Controller Hub (PCH) SLP_S0 (Read via memory mapped interface)
>
> The following attributes are added dynamically to the cpuidle
> -sysfs attribute group:
> - /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us
> - /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us
> +sysfs attribute group::
> +
> + /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us
> + /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us
>
> The "low_power_idle_cpu_residency_us" attribute shows time spent
> by the CPU package in PKG C10
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v4 21/63] Documentation: ACPI: move cppc_sysfs.txt to admin-guide/acpi and convert to reST
From: Mauro Carvalho Chehab @ 2019-04-24 14:48 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-22-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:50 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> ---
> .../acpi/cppc_sysfs.rst} | 71 ++++++++++---------
> Documentation/admin-guide/acpi/index.rst | 1 +
> 2 files changed, 40 insertions(+), 32 deletions(-)
> rename Documentation/{acpi/cppc_sysfs.txt => admin-guide/acpi/cppc_sysfs.rst} (51%)
>
> diff --git a/Documentation/acpi/cppc_sysfs.txt b/Documentation/admin-guide/acpi/cppc_sysfs.rst
> similarity index 51%
> rename from Documentation/acpi/cppc_sysfs.txt
> rename to Documentation/admin-guide/acpi/cppc_sysfs.rst
> index f20fb445135d..a4b99afbe331 100644
> --- a/Documentation/acpi/cppc_sysfs.txt
> +++ b/Documentation/admin-guide/acpi/cppc_sysfs.rst
> @@ -1,5 +1,11 @@
> +.. SPDX-License-Identifier: GPL-2.0
>
> - Collaborative Processor Performance Control (CPPC)
> +==================================================
> +Collaborative Processor Performance Control (CPPC)
> +==================================================
> +
> +CPPC
> +====
>
> CPPC defined in the ACPI spec describes a mechanism for the OS to manage the
> performance of a logical processor on a contigious and abstract performance
> @@ -10,31 +16,28 @@ For more details on CPPC please refer to the ACPI specification at:
>
> http://uefi.org/specifications
>
> -Some of the CPPC registers are exposed via sysfs under:
> -
> -/sys/devices/system/cpu/cpuX/acpi_cppc/
> -
> -for each cpu X
Hmm... removed by mistake?
> +Some of the CPPC registers are exposed via sysfs under::
>
> ---------------------------------------------------------------------------------
> + /sys/devices/system/cpu/cpuX/acpi_cppc/
Did you parse this with Sphinx? It doesn't sound a valid ReST construction
to my eyes, as:
1) I've seen some versions of Sphinx to abort with severe errors when
there's no blank line after the horizontal bar markup;
2) It will very likely ignore the "::" (I didn't test it myself), as you're
not indenting the horizontal bar. End of indentation will mean the end
of an (empty) literal block.
So, I would stick with:
Some of the CPPC registers are exposed via sysfs under:
/sys/devices/system/cpu/cpuX/acpi_cppc/
---------------------------------------------------------------------------------
for each cpu X::
or:
Some of the CPPC registers are exposed via sysfs under:
/sys/devices/system/cpu/cpuX/acpi_cppc/
for each cpu X
--------------------------------------------------------------------------------
::
(with is closer to the original author's intent)
Same applies to the other similar changes on this document.
>
> -$ ls -lR /sys/devices/system/cpu/cpu0/acpi_cppc/
> -/sys/devices/system/cpu/cpu0/acpi_cppc/:
> -total 0
> --r--r--r-- 1 root root 65536 Mar 5 19:38 feedback_ctrs
> --r--r--r-- 1 root root 65536 Mar 5 19:38 highest_perf
> --r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_freq
> --r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_nonlinear_perf
> --r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_perf
> --r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_freq
> --r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_perf
> --r--r--r-- 1 root root 65536 Mar 5 19:38 reference_perf
> --r--r--r-- 1 root root 65536 Mar 5 19:38 wraparound_time
> +for each cpu X::
>
> ---------------------------------------------------------------------------------
> + $ ls -lR /sys/devices/system/cpu/cpu0/acpi_cppc/
> + /sys/devices/system/cpu/cpu0/acpi_cppc/:
> + total 0
> + -r--r--r-- 1 root root 65536 Mar 5 19:38 feedback_ctrs
> + -r--r--r-- 1 root root 65536 Mar 5 19:38 highest_perf
> + -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_freq
> + -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_nonlinear_perf
> + -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_perf
> + -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_freq
> + -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_perf
> + -r--r--r-- 1 root root 65536 Mar 5 19:38 reference_perf
> + -r--r--r-- 1 root root 65536 Mar 5 19:38 wraparound_time
>
> * highest_perf : Highest performance of this processor (abstract scale).
> -* nominal_perf : Highest sustained performance of this processor (abstract scale).
> +* nominal_perf : Highest sustained performance of this processor
> + (abstract scale).
> * lowest_nonlinear_perf : Lowest performance of this processor with nonlinear
> power savings (abstract scale).
> * lowest_perf : Lowest performance of this processor (abstract scale).
> @@ -48,22 +51,26 @@ total 0
> * feedback_ctrs : Includes both Reference and delivered performance counter.
> Reference counter ticks up proportional to processor's reference performance.
> Delivered counter ticks up proportional to processor's delivered performance.
> -* wraparound_time: Minimum time for the feedback counters to wraparound (seconds).
> +* wraparound_time: Minimum time for the feedback counters to wraparound
> + (seconds).
> * reference_perf : Performance level at which reference performance counter
> accumulates (abstract scale).
>
> ---------------------------------------------------------------------------------
>
> - Computing Average Delivered Performance
> +Computing Average Delivered Performance
> +=======================================
> +
> +Below describes the steps to compute the average performance delivered by
> +taking two different snapshots of feedback counters at time T1 and T2.
> +
> + T1: Read feedback_ctrs as fbc_t1
> + Wait or run some workload
>
> -Below describes the steps to compute the average performance delivered by taking
> -two different snapshots of feedback counters at time T1 and T2.
> + T2: Read feedback_ctrs as fbc_t2
>
> -T1: Read feedback_ctrs as fbc_t1
> - Wait or run some workload
> -T2: Read feedback_ctrs as fbc_t2
> +::
>
> -delivered_counter_delta = fbc_t2[del] - fbc_t1[del]
> -reference_counter_delta = fbc_t2[ref] - fbc_t1[ref]
> + delivered_counter_delta = fbc_t2[del] - fbc_t1[del]
> + reference_counter_delta = fbc_t2[ref] - fbc_t1[ref]
>
> -delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta
> + delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta
> diff --git a/Documentation/admin-guide/acpi/index.rst b/Documentation/admin-guide/acpi/index.rst
> index d68e9914c5ff..9049a7b9f065 100644
> --- a/Documentation/admin-guide/acpi/index.rst
> +++ b/Documentation/admin-guide/acpi/index.rst
> @@ -10,3 +10,4 @@ the Linux ACPI support.
>
> initrd_table_override
> dsdt-override
> + cppc_sysfs
Thanks,
Mauro
^ permalink raw reply
* [PATCH AUTOSEL 4.14 29/35] kmemleak: powerpc: skip scanning holes in the .bss section
From: Sasha Levin @ 2019-04-24 14:47 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Sasha Levin, Radim Krcmar, Catalin Marinas, linuxppc-dev, kvm-ppc,
linux-mm, Paul Mackerras, Avi Kivity, Paolo Bonzini,
Andrew Morton, Linus Torvalds
In-Reply-To: <20190424144709.30215-1-sashal@kernel.org>
From: Catalin Marinas <catalin.marinas@arm.com>
[ Upstream commit 298a32b132087550d3fa80641ca58323c5dfd4d9 ]
Commit 2d4f567103ff ("KVM: PPC: Introduce kvm_tmp framework") adds
kvm_tmp[] into the .bss section and then free the rest of unused spaces
back to the page allocator.
kernel_init
kvm_guest_init
kvm_free_tmp
free_reserved_area
free_unref_page
free_unref_page_prepare
With DEBUG_PAGEALLOC=y, it will unmap those pages from kernel. As the
result, kmemleak scan will trigger a panic when it scans the .bss
section with unmapped pages.
This patch creates dedicated kmemleak objects for the .data, .bss and
potentially .data..ro_after_init sections to allow partial freeing via
the kmemleak_free_part() in the powerpc kvm_free_tmp() function.
Link: http://lkml.kernel.org/r/20190321171917.62049-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Qian Cai <cai@lca.pw>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/powerpc/kernel/kvm.c | 7 +++++++
mm/kmemleak.c | 16 +++++++++++-----
2 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 9ad37f827a97..7b59cc853abf 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -22,6 +22,7 @@
#include <linux/kvm_host.h>
#include <linux/init.h>
#include <linux/export.h>
+#include <linux/kmemleak.h>
#include <linux/kvm_para.h>
#include <linux/slab.h>
#include <linux/of.h>
@@ -712,6 +713,12 @@ static void kvm_use_magic_page(void)
static __init void kvm_free_tmp(void)
{
+ /*
+ * Inform kmemleak about the hole in the .bss section since the
+ * corresponding pages will be unmapped with DEBUG_PAGEALLOC=y.
+ */
+ kmemleak_free_part(&kvm_tmp[kvm_tmp_index],
+ ARRAY_SIZE(kvm_tmp) - kvm_tmp_index);
free_reserved_area(&kvm_tmp[kvm_tmp_index],
&kvm_tmp[ARRAY_SIZE(kvm_tmp)], -1, NULL);
}
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index d9e0be2a8189..337be9aacb7a 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1492,11 +1492,6 @@ static void kmemleak_scan(void)
}
rcu_read_unlock();
- /* data/bss scanning */
- scan_large_block(_sdata, _edata);
- scan_large_block(__bss_start, __bss_stop);
- scan_large_block(__start_ro_after_init, __end_ro_after_init);
-
#ifdef CONFIG_SMP
/* per-cpu sections scanning */
for_each_possible_cpu(i)
@@ -2027,6 +2022,17 @@ void __init kmemleak_init(void)
}
local_irq_restore(flags);
+ /* register the data/bss sections */
+ create_object((unsigned long)_sdata, _edata - _sdata,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+ create_object((unsigned long)__bss_start, __bss_stop - __bss_start,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+ /* only register .data..ro_after_init if not within .data */
+ if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)
+ create_object((unsigned long)__start_ro_after_init,
+ __end_ro_after_init - __start_ro_after_init,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+
/*
* This is the point where tracking allocations is safe. Automatic
* scanning is started during the late initcall. Add the early logged
--
2.19.1
^ permalink raw reply related
* Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()
From: Laurent Dufour @ 2019-04-24 14:39 UTC (permalink / raw)
To: Jerome Glisse
Cc: jack, sergey.senozhatsky.work, peterz, Will Deacon, mhocko,
linux-mm, paulus, Punit Agrawal, hpa, Michel Lespinasse,
Alexei Starovoitov, Andrea Arcangeli, ak, Minchan Kim,
aneesh.kumar, x86, Matthew Wilcox, Daniel Jordan, Ingo Molnar,
David Rientjes, paulmck, Haiyan Song, npiggin, sj38.park, dave,
kirill, Thomas Gleixner, zhong jiang, Ganesh Mahendran, Yang Shi,
Mike Rapoport, linuxppc-dev, linux-kernel, Sergey Senozhatsky,
vinayak menon, akpm, Tim Chen, haren
In-Reply-To: <20190422205721.GL14666@redhat.com>
Le 22/04/2019 à 22:57, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:12PM +0200, Laurent Dufour wrote:
>> This allows to search for a VMA structure without holding the mmap_sem.
>>
>> The search is repeated while the mm seqlock is changing and until we found
>> a valid VMA.
>>
>> While under the RCU protection, a reference is taken on the VMA, so the
>> caller must call put_vma() once it not more need the VMA structure.
>>
>> At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
>> reference is taken to the VMA by calling get_vma().
>>
>> When removing a VMA from the MM RB tree, the VMA is not release immediately
>> but at the end of the RCU grace period through vm_rcu_put(). This ensures
>> that the VMA remains allocated until the end the RCU grace period.
>>
>> Since the vm_file pointer, if valid, is released in put_vma(), there is no
>> guarantee that the file pointer will be valid on the returned VMA.
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
>
> Minor comments about comment (i love recursion :)) see below.
>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Thanks Jérôme, see my comments to your comments on my comments below ;)
>> ---
>> include/linux/mm_types.h | 1 +
>> mm/internal.h | 5 ++-
>> mm/mmap.c | 76 ++++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 78 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 6a6159e11a3f..9af6694cb95d 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -287,6 +287,7 @@ struct vm_area_struct {
>>
>> #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> atomic_t vm_ref_count;
>> + struct rcu_head vm_rcu;
>> #endif
>> struct rb_node vm_rb;
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 302382bed406..1e368e4afe3c 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -55,7 +55,10 @@ static inline void put_vma(struct vm_area_struct *vma)
>> __free_vma(vma);
>> }
>>
>> -#else
>> +extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
>> + unsigned long addr);
>> +
>> +#else /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>
>> static inline void get_vma(struct vm_area_struct *vma)
>> {
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index c106440dcae7..34bf261dc2c8 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -179,6 +179,18 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
>> {
>> write_sequnlock(&mm->mm_seq);
>> }
>> +
>> +static void __vm_rcu_put(struct rcu_head *head)
>> +{
>> + struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
>> + vm_rcu);
>> + put_vma(vma);
>> +}
>> +static void vm_rcu_put(struct vm_area_struct *vma)
>> +{
>> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
>> + call_rcu(&vma->vm_rcu, __vm_rcu_put);
>> +}
>> #else
>> static inline void mm_write_seqlock(struct mm_struct *mm)
>> {
>> @@ -190,6 +202,8 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
>>
>> void __free_vma(struct vm_area_struct *vma)
>> {
>> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
>> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
>> mpol_put(vma_policy(vma));
>> vm_area_free(vma);
>> }
>> @@ -197,11 +211,24 @@ void __free_vma(struct vm_area_struct *vma)
>> /*
>> * Close a vm structure and free it, returning the next.
>> */
>> -static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>> +static struct vm_area_struct *__remove_vma(struct vm_area_struct *vma)
>> {
>> struct vm_area_struct *next = vma->vm_next;
>>
>> might_sleep();
>> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT) &&
>> + !RB_EMPTY_NODE(&vma->vm_rb)) {
>> + /*
>> + * If the VMA is still linked in the RB tree, we must release
>> + * that reference by calling put_vma().
>> + * This should only happen when called from exit_mmap().
>> + * We forcely clear the node to satisfy the chec in
> ^
> Typo: chec -> check
Yep
>
>> + * __free_vma(). This is safe since the RB tree is not walked
>> + * anymore.
>> + */
>> + RB_CLEAR_NODE(&vma->vm_rb);
>> + put_vma(vma);
>> + }
>> if (vma->vm_ops && vma->vm_ops->close)
>> vma->vm_ops->close(vma);
>> if (vma->vm_file)
>> @@ -211,6 +238,13 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>> return next;
>> }
>>
>> +static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>> +{
>> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
>> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
>
> Adding a comment here explaining the BUG_ON so people can understand
> what is wrong if that happens. For instance:
>
> /*
> * remove_vma() should be call only once a vma have been remove from the rbtree
> * at which point the vma->vm_rb is an empty node. The exception is when vmas
> * are destroy through exit_mmap() in which case we do not bother updating the
> * rbtree (see comment in __remove_vma()).
> */
I agree !
>> + return __remove_vma(vma);
>> +}
>> +
>> static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags,
>> struct list_head *uf);
>> SYSCALL_DEFINE1(brk, unsigned long, brk)
>> @@ -475,7 +509,7 @@ static inline void vma_rb_insert(struct vm_area_struct *vma,
>>
>> /* All rb_subtree_gap values must be consistent prior to insertion */
>> validate_mm_rb(root, NULL);
>> -
>> + get_vma(vma);
>> rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
>> }
>>
>> @@ -491,6 +525,14 @@ static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
>> mm_write_seqlock(mm);
>> rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
>> mm_write_sequnlock(mm); /* wmb */
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + /*
>> + * Ensure the removal is complete before clearing the node.
>> + * Matched by vma_has_changed()/handle_speculative_fault().
>> + */
>> + RB_CLEAR_NODE(&vma->vm_rb);
>> + vm_rcu_put(vma);
>> +#endif
>> }
>>
>> static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
>> @@ -2331,6 +2373,34 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>>
>> EXPORT_SYMBOL(find_vma);
>>
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +/*
>> + * Like find_vma() but under the protection of RCU and the mm sequence counter.
>> + * The vma returned has to be relaesed by the caller through the call to
>> + * put_vma()
>> + */
>> +struct vm_area_struct *find_vma_rcu(struct mm_struct *mm, unsigned long addr)
>> +{
>> + struct vm_area_struct *vma = NULL;
>> + unsigned int seq;
>> +
>> + do {
>> + if (vma)
>> + put_vma(vma);
>> +
>> + seq = read_seqbegin(&mm->mm_seq);
>> +
>> + rcu_read_lock();
>> + vma = find_vma(mm, addr);
>> + if (vma)
>> + get_vma(vma);
>> + rcu_read_unlock();
>> + } while (read_seqretry(&mm->mm_seq, seq));
>> +
>> + return vma;
>> +}
>> +#endif
>> +
>> /*
>> * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
>> */
>> @@ -3231,7 +3301,7 @@ void exit_mmap(struct mm_struct *mm)
>> while (vma) {
>> if (vma->vm_flags & VM_ACCOUNT)
>> nr_accounted += vma_pages(vma);
>> - vma = remove_vma(vma);
>> + vma = __remove_vma(vma);
>> }
>> vm_unacct_memory(nr_accounted);
>> }
>> --
>> 2.21.0
>>
>
^ permalink raw reply
* [PATCH AUTOSEL 4.19 44/52] kmemleak: powerpc: skip scanning holes in the .bss section
From: Sasha Levin @ 2019-04-24 14:39 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Sasha Levin, Radim Krcmar, Catalin Marinas, linuxppc-dev, kvm-ppc,
linux-mm, Paul Mackerras, Avi Kivity, Paolo Bonzini,
Andrew Morton, Linus Torvalds
In-Reply-To: <20190424143911.28890-1-sashal@kernel.org>
From: Catalin Marinas <catalin.marinas@arm.com>
[ Upstream commit 298a32b132087550d3fa80641ca58323c5dfd4d9 ]
Commit 2d4f567103ff ("KVM: PPC: Introduce kvm_tmp framework") adds
kvm_tmp[] into the .bss section and then free the rest of unused spaces
back to the page allocator.
kernel_init
kvm_guest_init
kvm_free_tmp
free_reserved_area
free_unref_page
free_unref_page_prepare
With DEBUG_PAGEALLOC=y, it will unmap those pages from kernel. As the
result, kmemleak scan will trigger a panic when it scans the .bss
section with unmapped pages.
This patch creates dedicated kmemleak objects for the .data, .bss and
potentially .data..ro_after_init sections to allow partial freeing via
the kmemleak_free_part() in the powerpc kvm_free_tmp() function.
Link: http://lkml.kernel.org/r/20190321171917.62049-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Qian Cai <cai@lca.pw>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/powerpc/kernel/kvm.c | 7 +++++++
mm/kmemleak.c | 16 +++++++++++-----
2 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 683b5b3805bd..cd381e2291df 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -22,6 +22,7 @@
#include <linux/kvm_host.h>
#include <linux/init.h>
#include <linux/export.h>
+#include <linux/kmemleak.h>
#include <linux/kvm_para.h>
#include <linux/slab.h>
#include <linux/of.h>
@@ -712,6 +713,12 @@ static void kvm_use_magic_page(void)
static __init void kvm_free_tmp(void)
{
+ /*
+ * Inform kmemleak about the hole in the .bss section since the
+ * corresponding pages will be unmapped with DEBUG_PAGEALLOC=y.
+ */
+ kmemleak_free_part(&kvm_tmp[kvm_tmp_index],
+ ARRAY_SIZE(kvm_tmp) - kvm_tmp_index);
free_reserved_area(&kvm_tmp[kvm_tmp_index],
&kvm_tmp[ARRAY_SIZE(kvm_tmp)], -1, NULL);
}
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 17dd883198ae..5912a26e041c 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1501,11 +1501,6 @@ static void kmemleak_scan(void)
}
rcu_read_unlock();
- /* data/bss scanning */
- scan_large_block(_sdata, _edata);
- scan_large_block(__bss_start, __bss_stop);
- scan_large_block(__start_ro_after_init, __end_ro_after_init);
-
#ifdef CONFIG_SMP
/* per-cpu sections scanning */
for_each_possible_cpu(i)
@@ -2036,6 +2031,17 @@ void __init kmemleak_init(void)
}
local_irq_restore(flags);
+ /* register the data/bss sections */
+ create_object((unsigned long)_sdata, _edata - _sdata,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+ create_object((unsigned long)__bss_start, __bss_stop - __bss_start,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+ /* only register .data..ro_after_init if not within .data */
+ if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)
+ create_object((unsigned long)__start_ro_after_init,
+ __end_ro_after_init - __start_ro_after_init,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+
/*
* This is the point where tracking allocations is safe. Automatic
* scanning is started during the late initcall. Add the early logged
--
2.19.1
^ permalink raw reply related
* [PATCH AUTOSEL 5.0 53/66] kmemleak: powerpc: skip scanning holes in the .bss section
From: Sasha Levin @ 2019-04-24 14:33 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Sasha Levin, Radim Krcmar, Catalin Marinas, linuxppc-dev, kvm-ppc,
linux-mm, Paul Mackerras, Avi Kivity, Paolo Bonzini,
Andrew Morton, Linus Torvalds
In-Reply-To: <20190424143341.27665-1-sashal@kernel.org>
From: Catalin Marinas <catalin.marinas@arm.com>
[ Upstream commit 298a32b132087550d3fa80641ca58323c5dfd4d9 ]
Commit 2d4f567103ff ("KVM: PPC: Introduce kvm_tmp framework") adds
kvm_tmp[] into the .bss section and then free the rest of unused spaces
back to the page allocator.
kernel_init
kvm_guest_init
kvm_free_tmp
free_reserved_area
free_unref_page
free_unref_page_prepare
With DEBUG_PAGEALLOC=y, it will unmap those pages from kernel. As the
result, kmemleak scan will trigger a panic when it scans the .bss
section with unmapped pages.
This patch creates dedicated kmemleak objects for the .data, .bss and
potentially .data..ro_after_init sections to allow partial freeing via
the kmemleak_free_part() in the powerpc kvm_free_tmp() function.
Link: http://lkml.kernel.org/r/20190321171917.62049-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Qian Cai <cai@lca.pw>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
---
arch/powerpc/kernel/kvm.c | 7 +++++++
mm/kmemleak.c | 16 +++++++++++-----
2 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 683b5b3805bd..cd381e2291df 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -22,6 +22,7 @@
#include <linux/kvm_host.h>
#include <linux/init.h>
#include <linux/export.h>
+#include <linux/kmemleak.h>
#include <linux/kvm_para.h>
#include <linux/slab.h>
#include <linux/of.h>
@@ -712,6 +713,12 @@ static void kvm_use_magic_page(void)
static __init void kvm_free_tmp(void)
{
+ /*
+ * Inform kmemleak about the hole in the .bss section since the
+ * corresponding pages will be unmapped with DEBUG_PAGEALLOC=y.
+ */
+ kmemleak_free_part(&kvm_tmp[kvm_tmp_index],
+ ARRAY_SIZE(kvm_tmp) - kvm_tmp_index);
free_reserved_area(&kvm_tmp[kvm_tmp_index],
&kvm_tmp[ARRAY_SIZE(kvm_tmp)], -1, NULL);
}
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 707fa5579f66..6c318f5ac234 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1529,11 +1529,6 @@ static void kmemleak_scan(void)
}
rcu_read_unlock();
- /* data/bss scanning */
- scan_large_block(_sdata, _edata);
- scan_large_block(__bss_start, __bss_stop);
- scan_large_block(__start_ro_after_init, __end_ro_after_init);
-
#ifdef CONFIG_SMP
/* per-cpu sections scanning */
for_each_possible_cpu(i)
@@ -2071,6 +2066,17 @@ void __init kmemleak_init(void)
}
local_irq_restore(flags);
+ /* register the data/bss sections */
+ create_object((unsigned long)_sdata, _edata - _sdata,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+ create_object((unsigned long)__bss_start, __bss_stop - __bss_start,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+ /* only register .data..ro_after_init if not within .data */
+ if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)
+ create_object((unsigned long)__start_ro_after_init,
+ __end_ro_after_init - __start_ro_after_init,
+ KMEMLEAK_GREY, GFP_ATOMIC);
+
/*
* This is the point where tracking allocations is safe. Automatic
* scanning is started during the late initcall. Add the early logged
--
2.19.1
^ permalink raw reply related
* Re: [PATCH v4 20/63] Documentation: ACPI: move apei/einj.txt to firmware-guide/acpi and convert to reST
From: Mauro Carvalho Chehab @ 2019-04-24 14:33 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-21-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:49 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
> ---
> .../acpi/apei/einj.rst} | 98 ++++++++++---------
> Documentation/firmware-guide/acpi/index.rst | 1 +
> 2 files changed, 53 insertions(+), 46 deletions(-)
> rename Documentation/{acpi/apei/einj.txt => firmware-guide/acpi/apei/einj.rst} (67%)
>
> diff --git a/Documentation/acpi/apei/einj.txt b/Documentation/firmware-guide/acpi/apei/einj.rst
> similarity index 67%
> rename from Documentation/acpi/apei/einj.txt
> rename to Documentation/firmware-guide/acpi/apei/einj.rst
> index e550c8b98139..d85e2667155c 100644
> --- a/Documentation/acpi/apei/einj.txt
> +++ b/Documentation/firmware-guide/acpi/apei/einj.rst
> @@ -1,13 +1,16 @@
> - APEI Error INJection
> - ~~~~~~~~~~~~~~~~~~~~
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +====================
> +APEI Error INJection
> +====================
>
> EINJ provides a hardware error injection mechanism. It is very useful
> for debugging and testing APEI and RAS features in general.
>
> You need to check whether your BIOS supports EINJ first. For that, look
> -for early boot messages similar to this one:
> +for early boot messages similar to this one::
>
> -ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL 00000001 INTL 00000001)
> + ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL 00000001 INTL 00000001)
>
> which shows that the BIOS is exposing an EINJ table - it is the
> mechanism through which the injection is done.
> @@ -23,11 +26,11 @@ order to see the APEI,EINJ,... functionality supported and exposed by
> the BIOS menu.
>
> To use EINJ, make sure the following are options enabled in your kernel
> -configuration:
> +configuration::
>
> -CONFIG_DEBUG_FS
> -CONFIG_ACPI_APEI
> -CONFIG_ACPI_APEI_EINJ
> + CONFIG_DEBUG_FS
> + CONFIG_ACPI_APEI
> + CONFIG_ACPI_APEI_EINJ
>
> The EINJ user interface is in <debugfs mount point>/apei/einj.
>
> @@ -35,22 +38,22 @@ The following files belong to it:
>
> - available_error_type
>
> - This file shows which error types are supported:
> -
> - Error Type Value Error Description
> - ================ =================
> - 0x00000001 Processor Correctable
> - 0x00000002 Processor Uncorrectable non-fatal
> - 0x00000004 Processor Uncorrectable fatal
> - 0x00000008 Memory Correctable
> - 0x00000010 Memory Uncorrectable non-fatal
> - 0x00000020 Memory Uncorrectable fatal
> - 0x00000040 PCI Express Correctable
> - 0x00000080 PCI Express Uncorrectable fatal
> - 0x00000100 PCI Express Uncorrectable non-fatal
> - 0x00000200 Platform Correctable
> - 0x00000400 Platform Uncorrectable non-fatal
> - 0x00000800 Platform Uncorrectable fatal
> + This file shows which error types are supported::
> +
> + Error Type Value Error Description
> + ================ =================
> + 0x00000001 Processor Correctable
> + 0x00000002 Processor Uncorrectable non-fatal
> + 0x00000004 Processor Uncorrectable fatal
> + 0x00000008 Memory Correctable
> + 0x00000010 Memory Uncorrectable non-fatal
> + 0x00000020 Memory Uncorrectable fatal
> + 0x00000040 PCI Express Correctable
> + 0x00000080 PCI Express Uncorrectable fatal
> + 0x00000100 PCI Express Uncorrectable non-fatal
> + 0x00000200 Platform Correctable
> + 0x00000400 Platform Uncorrectable non-fatal
> + 0x00000800 Platform Uncorrectable fatal
This is a table and not a literal block.
The best here to preserve the author's intent is to just adjust the table
markups in order to make it parseable, e. g.:
This file shows which error types are supported:
================ ===================================
Error Type Value Error Description
================ ===================================
0x00000001 Processor Correctable
0x00000002 Processor Uncorrectable non-fatal
0x00000004 Processor Uncorrectable fatal
0x00000008 Memory Correctable
0x00000010 Memory Uncorrectable non-fatal
0x00000020 Memory Uncorrectable fatal
0x00000040 PCI Express Correctable
0x00000080 PCI Express Uncorrectable fatal
0x00000100 PCI Express Uncorrectable non-fatal
0x00000200 Platform Correctable
0x00000400 Platform Uncorrectable non-fatal
0x00000800 Platform Uncorrectable fatal
================ ===================================
After such change:
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
>
> The format of the file contents are as above, except present are only
> the available error types.
> @@ -73,9 +76,12 @@ The following files belong to it:
> injection. Value is a bitmask as specified in ACPI5.0 spec for the
> SET_ERROR_TYPE_WITH_ADDRESS data structure:
>
> - Bit 0 - Processor APIC field valid (see param3 below).
> - Bit 1 - Memory address and mask valid (param1 and param2).
> - Bit 2 - PCIe (seg,bus,dev,fn) valid (see param4 below).
> + Bit 0
> + Processor APIC field valid (see param3 below).
> + Bit 1
> + Memory address and mask valid (param1 and param2).
> + Bit 2
> + PCIe (seg,bus,dev,fn) valid (see param4 below).
>
> If set to zero, legacy behavior is mimicked where the type of
> injection specifies just one bit set, and param1 is multiplexed.
> @@ -121,7 +127,7 @@ BIOS versions based on the ACPI 5.0 specification have more control over
> the target of the injection. For processor-related errors (type 0x1, 0x2
> and 0x4), you can set flags to 0x3 (param3 for bit 0, and param1 and
> param2 for bit 1) so that you have more information added to the error
> -signature being injected. The actual data passed is this:
> +signature being injected. The actual data passed is this::
>
> memory_address = param1;
> memory_address_range = param2;
> @@ -131,7 +137,7 @@ signature being injected. The actual data passed is this:
> For memory errors (type 0x8, 0x10 and 0x20) the address is set using
> param1 with a mask in param2 (0x0 is equivalent to all ones). For PCI
> express errors (type 0x40, 0x80 and 0x100) the segment, bus, device and
> -function are specified using param1:
> +function are specified using param1::
>
> 31 24 23 16 15 11 10 8 7 0
> +-------------------------------------------------+
> @@ -152,26 +158,26 @@ documentation for details (and expect changes to this API if vendors
> creativity in using this feature expands beyond our expectations).
>
>
> -An error injection example:
> +An error injection example::
>
> -# cd /sys/kernel/debug/apei/einj
> -# cat available_error_type # See which errors can be injected
> -0x00000002 Processor Uncorrectable non-fatal
> -0x00000008 Memory Correctable
> -0x00000010 Memory Uncorrectable non-fatal
> -# echo 0x12345000 > param1 # Set memory address for injection
> -# echo $((-1 << 12)) > param2 # Mask 0xfffffffffffff000 - anywhere in this page
> -# echo 0x8 > error_type # Choose correctable memory error
> -# echo 1 > error_inject # Inject now
> + # cd /sys/kernel/debug/apei/einj
> + # cat available_error_type # See which errors can be injected
> + 0x00000002 Processor Uncorrectable non-fatal
> + 0x00000008 Memory Correctable
> + 0x00000010 Memory Uncorrectable non-fatal
> + # echo 0x12345000 > param1 # Set memory address for injection
> + # echo $((-1 << 12)) > param2 # Mask 0xfffffffffffff000 - anywhere in this page
> + # echo 0x8 > error_type # Choose correctable memory error
> + # echo 1 > error_inject # Inject now
>
> -You should see something like this in dmesg:
> +You should see something like this in dmesg::
>
> -[22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR
> -[22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
> -[22715.834759] EDAC sbridge MC3: TSC 0
> -[22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86
> -[22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0
> -[22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> + [22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR
> + [22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
> + [22715.834759] EDAC sbridge MC3: TSC 0
> + [22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86
> + [22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0
> + [22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
>
> For more information about EINJ, please refer to ACPI specification
> version 4.0, section 17.5 and ACPI 5.0, section 18.6.
> diff --git a/Documentation/firmware-guide/acpi/index.rst b/Documentation/firmware-guide/acpi/index.rst
> index 869badba6d7a..fca854f017d8 100644
> --- a/Documentation/firmware-guide/acpi/index.rst
> +++ b/Documentation/firmware-guide/acpi/index.rst
> @@ -18,6 +18,7 @@ ACPI Support
> debug
> aml-debugger
> apei/output_format
> + apei/einj
> gpio-properties
> i2c-muxes
> acpi-lid
Thanks,
Mauro
^ permalink raw reply
* Re: [PATCH v4 19/63] Documentation: ACPI: move apei/output_format.txt to firmware-guide/acpi and convert to reST
From: Mauro Carvalho Chehab @ 2019-04-24 14:29 UTC (permalink / raw)
To: Changbin Du
Cc: fenghua.yu, x86, linux-doc, linux-pci, linux-gpio,
Jonathan Corbet, rjw, linux-kernel, linux-acpi, mingo,
Bjorn Helgaas, tglx, linuxppc-dev
In-Reply-To: <20190423162932.21428-20-changbin.du@gmail.com>
Em Wed, 24 Apr 2019 00:28:48 +0800
Changbin Du <changbin.du@gmail.com> escreveu:
> This converts the plain text documentation to reStructuredText format and
> add it to Sphinx TOC tree. No essential content change.
>
> Signed-off-by: Changbin Du <changbin.du@gmail.com>
For the conversion changes:
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> ---
> Documentation/acpi/apei/output_format.txt | 147 -----------------
> .../acpi/apei/output_format.rst | 150 ++++++++++++++++++
> Documentation/firmware-guide/acpi/index.rst | 1 +
> 3 files changed, 151 insertions(+), 147 deletions(-)
> delete mode 100644 Documentation/acpi/apei/output_format.txt
> create mode 100644 Documentation/firmware-guide/acpi/apei/output_format.rst
>
> diff --git a/Documentation/acpi/apei/output_format.txt b/Documentation/acpi/apei/output_format.txt
> deleted file mode 100644
> index 0c49c197c47a..000000000000
> --- a/Documentation/acpi/apei/output_format.txt
> +++ /dev/null
> @@ -1,147 +0,0 @@
> - APEI output format
> - ~~~~~~~~~~~~~~~~~~
> -
> -APEI uses printk as hardware error reporting interface, the output
> -format is as follow.
> -
> -<error record> :=
> -APEI generic hardware error status
> -severity: <integer>, <severity string>
> -section: <integer>, severity: <integer>, <severity string>
> -flags: <integer>
> -<section flags strings>
> -fru_id: <uuid string>
> -fru_text: <string>
> -section_type: <section type string>
> -<section data>
> -
> -<severity string>* := recoverable | fatal | corrected | info
> -
> -<section flags strings># :=
> -[primary][, containment warning][, reset][, threshold exceeded]\
> -[, resource not accessible][, latent error]
> -
> -<section type string> := generic processor error | memory error | \
> -PCIe error | unknown, <uuid string>
> -
> -<section data> :=
> -<generic processor section data> | <memory section data> | \
> -<pcie section data> | <null>
> -
> -<generic processor section data> :=
> -[processor_type: <integer>, <proc type string>]
> -[processor_isa: <integer>, <proc isa string>]
> -[error_type: <integer>
> -<proc error type strings>]
> -[operation: <integer>, <proc operation string>]
> -[flags: <integer>
> -<proc flags strings>]
> -[level: <integer>]
> -[version_info: <integer>]
> -[processor_id: <integer>]
> -[target_address: <integer>]
> -[requestor_id: <integer>]
> -[responder_id: <integer>]
> -[IP: <integer>]
> -
> -<proc type string>* := IA32/X64 | IA64
> -
> -<proc isa string>* := IA32 | IA64 | X64
> -
> -<processor error type strings># :=
> -[cache error][, TLB error][, bus error][, micro-architectural error]
> -
> -<proc operation string>* := unknown or generic | data read | data write | \
> -instruction execution
> -
> -<proc flags strings># :=
> -[restartable][, precise IP][, overflow][, corrected]
> -
> -<memory section data> :=
> -[error_status: <integer>]
> -[physical_address: <integer>]
> -[physical_address_mask: <integer>]
> -[node: <integer>]
> -[card: <integer>]
> -[module: <integer>]
> -[bank: <integer>]
> -[device: <integer>]
> -[row: <integer>]
> -[column: <integer>]
> -[bit_position: <integer>]
> -[requestor_id: <integer>]
> -[responder_id: <integer>]
> -[target_id: <integer>]
> -[error_type: <integer>, <mem error type string>]
> -
> -<mem error type string>* :=
> -unknown | no error | single-bit ECC | multi-bit ECC | \
> -single-symbol chipkill ECC | multi-symbol chipkill ECC | master abort | \
> -target abort | parity error | watchdog timeout | invalid address | \
> -mirror Broken | memory sparing | scrub corrected error | \
> -scrub uncorrected error
> -
> -<pcie section data> :=
> -[port_type: <integer>, <pcie port type string>]
> -[version: <integer>.<integer>]
> -[command: <integer>, status: <integer>]
> -[device_id: <integer>:<integer>:<integer>.<integer>
> -slot: <integer>
> -secondary_bus: <integer>
> -vendor_id: <integer>, device_id: <integer>
> -class_code: <integer>]
> -[serial number: <integer>, <integer>]
> -[bridge: secondary_status: <integer>, control: <integer>]
> -[aer_status: <integer>, aer_mask: <integer>
> -<aer status string>
> -[aer_uncor_severity: <integer>]
> -aer_layer=<aer layer string>, aer_agent=<aer agent string>
> -aer_tlp_header: <integer> <integer> <integer> <integer>]
> -
> -<pcie port type string>* := PCIe end point | legacy PCI end point | \
> -unknown | unknown | root port | upstream switch port | \
> -downstream switch port | PCIe to PCI/PCI-X bridge | \
> -PCI/PCI-X to PCIe bridge | root complex integrated endpoint device | \
> -root complex event collector
> -
> -if section severity is fatal or recoverable
> -<aer status string># :=
> -unknown | unknown | unknown | unknown | Data Link Protocol | \
> -unknown | unknown | unknown | unknown | unknown | unknown | unknown | \
> -Poisoned TLP | Flow Control Protocol | Completion Timeout | \
> -Completer Abort | Unexpected Completion | Receiver Overflow | \
> -Malformed TLP | ECRC | Unsupported Request
> -else
> -<aer status string># :=
> -Receiver Error | unknown | unknown | unknown | unknown | unknown | \
> -Bad TLP | Bad DLLP | RELAY_NUM Rollover | unknown | unknown | unknown | \
> -Replay Timer Timeout | Advisory Non-Fatal
> -fi
> -
> -<aer layer string> :=
> -Physical Layer | Data Link Layer | Transaction Layer
> -
> -<aer agent string> :=
> -Receiver ID | Requester ID | Completer ID | Transmitter ID
> -
> -Where, [] designate corresponding content is optional
> -
> -All <field string> description with * has the following format:
> -
> -field: <integer>, <field string>
> -
> -Where value of <integer> should be the position of "string" in <field
> -string> description. Otherwise, <field string> will be "unknown".
> -
> -All <field strings> description with # has the following format:
> -
> -field: <integer>
> -<field strings>
> -
> -Where each string in <fields strings> corresponding to one set bit of
> -<integer>. The bit position is the position of "string" in <field
> -strings> description.
> -
> -For more detailed explanation of every field, please refer to UEFI
> -specification version 2.3 or later, section Appendix N: Common
> -Platform Error Record.
> diff --git a/Documentation/firmware-guide/acpi/apei/output_format.rst b/Documentation/firmware-guide/acpi/apei/output_format.rst
> new file mode 100644
> index 000000000000..c2e7ebddb529
> --- /dev/null
> +++ b/Documentation/firmware-guide/acpi/apei/output_format.rst
> @@ -0,0 +1,150 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==================
> +APEI output format
> +==================
> +
> +APEI uses printk as hardware error reporting interface, the output
> +format is as follow::
> +
> + <error record> :=
> + APEI generic hardware error status
> + severity: <integer>, <severity string>
> + section: <integer>, severity: <integer>, <severity string>
> + flags: <integer>
> + <section flags strings>
> + fru_id: <uuid string>
> + fru_text: <string>
> + section_type: <section type string>
> + <section data>
> +
> + <severity string>* := recoverable | fatal | corrected | info
> +
> + <section flags strings># :=
> + [primary][, containment warning][, reset][, threshold exceeded]\
> + [, resource not accessible][, latent error]
> +
> + <section type string> := generic processor error | memory error | \
> + PCIe error | unknown, <uuid string>
> +
> + <section data> :=
> + <generic processor section data> | <memory section data> | \
> + <pcie section data> | <null>
> +
> + <generic processor section data> :=
> + [processor_type: <integer>, <proc type string>]
> + [processor_isa: <integer>, <proc isa string>]
> + [error_type: <integer>
> + <proc error type strings>]
> + [operation: <integer>, <proc operation string>]
> + [flags: <integer>
> + <proc flags strings>]
> + [level: <integer>]
> + [version_info: <integer>]
> + [processor_id: <integer>]
> + [target_address: <integer>]
> + [requestor_id: <integer>]
> + [responder_id: <integer>]
> + [IP: <integer>]
> +
> + <proc type string>* := IA32/X64 | IA64
> +
> + <proc isa string>* := IA32 | IA64 | X64
> +
> + <processor error type strings># :=
> + [cache error][, TLB error][, bus error][, micro-architectural error]
> +
> + <proc operation string>* := unknown or generic | data read | data write | \
> + instruction execution
> +
> + <proc flags strings># :=
> + [restartable][, precise IP][, overflow][, corrected]
> +
> + <memory section data> :=
> + [error_status: <integer>]
> + [physical_address: <integer>]
> + [physical_address_mask: <integer>]
> + [node: <integer>]
> + [card: <integer>]
> + [module: <integer>]
> + [bank: <integer>]
> + [device: <integer>]
> + [row: <integer>]
> + [column: <integer>]
> + [bit_position: <integer>]
> + [requestor_id: <integer>]
> + [responder_id: <integer>]
> + [target_id: <integer>]
> + [error_type: <integer>, <mem error type string>]
> +
> + <mem error type string>* :=
> + unknown | no error | single-bit ECC | multi-bit ECC | \
> + single-symbol chipkill ECC | multi-symbol chipkill ECC | master abort | \
> + target abort | parity error | watchdog timeout | invalid address | \
> + mirror Broken | memory sparing | scrub corrected error | \
> + scrub uncorrected error
> +
> + <pcie section data> :=
> + [port_type: <integer>, <pcie port type string>]
> + [version: <integer>.<integer>]
> + [command: <integer>, status: <integer>]
> + [device_id: <integer>:<integer>:<integer>.<integer>
> + slot: <integer>
> + secondary_bus: <integer>
> + vendor_id: <integer>, device_id: <integer>
> + class_code: <integer>]
> + [serial number: <integer>, <integer>]
> + [bridge: secondary_status: <integer>, control: <integer>]
> + [aer_status: <integer>, aer_mask: <integer>
> + <aer status string>
> + [aer_uncor_severity: <integer>]
> + aer_layer=<aer layer string>, aer_agent=<aer agent string>
> + aer_tlp_header: <integer> <integer> <integer> <integer>]
> +
> + <pcie port type string>* := PCIe end point | legacy PCI end point | \
> + unknown | unknown | root port | upstream switch port | \
> + downstream switch port | PCIe to PCI/PCI-X bridge | \
> + PCI/PCI-X to PCIe bridge | root complex integrated endpoint device | \
> + root complex event collector
> +
> + if section severity is fatal or recoverable
> + <aer status string># :=
> + unknown | unknown | unknown | unknown | Data Link Protocol | \
> + unknown | unknown | unknown | unknown | unknown | unknown | unknown | \
> + Poisoned TLP | Flow Control Protocol | Completion Timeout | \
> + Completer Abort | Unexpected Completion | Receiver Overflow | \
> + Malformed TLP | ECRC | Unsupported Request
> + else
> + <aer status string># :=
> + Receiver Error | unknown | unknown | unknown | unknown | unknown | \
> + Bad TLP | Bad DLLP | RELAY_NUM Rollover | unknown | unknown | unknown | \
> + Replay Timer Timeout | Advisory Non-Fatal
> + fi
> +
> + <aer layer string> :=
> + Physical Layer | Data Link Layer | Transaction Layer
> +
> + <aer agent string> :=
> + Receiver ID | Requester ID | Completer ID | Transmitter ID
> +
> +Where, [] designate corresponding content is optional
> +
> +All <field string> description with * has the following format::
> +
> + field: <integer>, <field string>
> +
> +Where value of <integer> should be the position of "string" in <field
> +string> description. Otherwise, <field string> will be "unknown".
> +
> +All <field strings> description with # has the following format::
> +
> + field: <integer>
> + <field strings>
> +
> +Where each string in <fields strings> corresponding to one set bit of
> +<integer>. The bit position is the position of "string" in <field
> +strings> description.
> +
> +For more detailed explanation of every field, please refer to UEFI
> +specification version 2.3 or later, section Appendix N: Common
> +Platform Error Record.
> diff --git a/Documentation/firmware-guide/acpi/index.rst b/Documentation/firmware-guide/acpi/index.rst
> index e9f253d54897..869badba6d7a 100644
> --- a/Documentation/firmware-guide/acpi/index.rst
> +++ b/Documentation/firmware-guide/acpi/index.rst
> @@ -17,6 +17,7 @@ ACPI Support
> DSD-properties-rules
> debug
> aml-debugger
> + apei/output_format
> gpio-properties
> i2c-muxes
> acpi-lid
Thanks,
Mauro
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox