Linux Documentation
 help / color / mirror / Atom feed
* [PATCH 3/7] docs/vm: ksm: add "Design" section
From: Mike Rapoport @ 2018-04-24  6:40 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Andrea Arcangeli, linux-doc, linux-mm, lkml,
	Mike Rapoport
In-Reply-To: <1524552028-7017-1-git-send-email-rppt@linux.vnet.ibm.com>

Include the KSM description from the source code comment, add a subsection
about reverse mapping and include kernel-doc references for KSM data
structures.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
 Documentation/vm/ksm.rst | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/vm/ksm.rst b/Documentation/vm/ksm.rst
index 786d460..0e5a085 100644
--- a/Documentation/vm/ksm.rst
+++ b/Documentation/vm/ksm.rst
@@ -206,6 +206,45 @@ stable_node "dups" with few rmap_items in them, but that may increase
 the ksmd CPU usage and possibly slowdown the readonly computations on
 the KSM pages of the applications.
 
+Design
+======
+
+Overview
+--------
+
+.. kernel-doc:: mm/ksm.c
+   :DOC: Overview
+
+Reverse mapping
+---------------
+KSM maintains reverse mapping information for KSM pages in the stable
+tree.
+
+If a KSM page is shared between less than ``max_page_sharing`` VMAs,
+the node of the stable tree that represents such KSM page points to a
+list of :c:type:`struct rmap_item` and the ``page->mapping`` of the
+KSM page points to the stable tree node.
+
+When the sharing passes this threshold, KSM adds a second dimension to
+the stable tree. The tree node becomes a "chain" that links one or
+more "dups". Each "dup" keeps reverse mapping information for a KSM
+page with ``page->mapping`` pointing to that "dup".
+
+Every "chain" and all "dups" linked into a "chain" enforce the
+invariant that they represent the same write protected memory content,
+even if each "dup" will be pointed by a different KSM page copy of
+that content.
+
+This way the stable tree lookup computational complexity is unaffected
+if compared to an unlimited list of reverse mappings. It is still
+enforced that there cannot be KSM page content duplicates in the
+stable tree itself.
+
+Reference
+---------
+.. kernel-doc:: mm/ksm.c
+   :functions: mm_slot ksm_scan stable_node rmap_item
+
 --
 Izik Eidus,
 Hugh Dickins, 17 Nov 2009
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 4/7] docs/vm: ksm: reshuffle text between "sysfs" and "design" sections
From: Mike Rapoport @ 2018-04-24  6:40 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Andrea Arcangeli, linux-doc, linux-mm, lkml,
	Mike Rapoport
In-Reply-To: <1524552028-7017-1-git-send-email-rppt@linux.vnet.ibm.com>

The description of "max_page_sharing" sysfs attribute includes lots of
implementation details that more naturally belong in the "Design"
section.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
 Documentation/vm/ksm.rst | 51 ++++++++++++++++++++++++++++--------------------
 1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/Documentation/vm/ksm.rst b/Documentation/vm/ksm.rst
index 0e5a085..00961b8 100644
--- a/Documentation/vm/ksm.rst
+++ b/Documentation/vm/ksm.rst
@@ -133,31 +133,21 @@ use_zero_pages
 
 max_page_sharing
         Maximum sharing allowed for each KSM page. This enforces a
-        deduplication limit to avoid the virtual memory rmap lists to
-        grow too large. The minimum value is 2 as a newly created KSM
-        page will have at least two sharers. The rmap walk has O(N)
-        complexity where N is the number of rmap_items (i.e. virtual
-        mappings) that are sharing the page, which is in turn capped
-        by ``max_page_sharing``. So this effectively spreads the linear
-        O(N) computational complexity from rmap walk context over
-        different KSM pages. The ksmd walk over the stable_node
-        "chains" is also O(N), but N is the number of stable_node
-        "dups", not the number of rmap_items, so it has not a
-        significant impact on ksmd performance. In practice the best
-        stable_node "dup" candidate will be kept and found at the head
-        of the "dups" list. The higher this value the faster KSM will
-        merge the memory (because there will be fewer stable_node dups
-        queued into the stable_node chain->hlist to check for pruning)
-        and the higher the deduplication factor will be, but the
-        slowest the worst case rmap walk could be for any given KSM
-        page. Slowing down the rmap_walk means there will be higher
+        deduplication limit to avoid high latency for virtual memory
+        operations that involve traversal of the virtual mappings that
+        share the KSM page. The minimum value is 2 as a newly created
+        KSM page will have at least two sharers. The higher this value
+        the faster KSM will merge the memory and the higher the
+        deduplication factor will be, but the slower the worst case
+        virtual mappings traversal could be for any given KSM
+        page. Slowing down this traversal means there will be higher
         latency for certain virtual memory operations happening during
         swapping, compaction, NUMA balancing and page migration, in
         turn decreasing responsiveness for the caller of those virtual
         memory operations. The scheduler latency of other tasks not
-        involved with the VM operations doing the rmap walk is not
-        affected by this parameter as the rmap walks are always
-        schedule friendly themselves.
+        involved with the VM operations doing the virtual mappings
+        traversal is not affected by this parameter as these
+        traversals are always schedule friendly themselves.
 
 stable_node_chains_prune_millisecs
         How frequently to walk the whole list of stable_node "dups"
@@ -240,6 +230,25 @@ if compared to an unlimited list of reverse mappings. It is still
 enforced that there cannot be KSM page content duplicates in the
 stable tree itself.
 
+The deduplication limit enforced by ``max_page_sharing`` is required
+to avoid the virtual memory rmap lists to grow too large. The rmap
+walk has O(N) complexity where N is the number of rmap_items
+(i.e. virtual mappings) that are sharing the page, which is in turn
+capped by ``max_page_sharing``. So this effectively spreads the linear
+O(N) computational complexity from rmap walk context over different
+KSM pages. The ksmd walk over the stable_node "chains" is also O(N),
+but N is the number of stable_node "dups", not the number of
+rmap_items, so it has not a significant impact on ksmd performance. In
+practice the best stable_node "dup" candidate will be kept and found
+at the head of the "dups" list.
+
+High values of ``max_page_sharing`` result in faster memory merging
+(because there will be fewer stable_node dups queued into the
+stable_node chain->hlist to check for pruning) and higher
+deduplication factor at the expense of slower worst case for rmap
+walks for any KSM page which can happen during swapping, compaction,
+NUMA balancing and page migration.
+
 Reference
 ---------
 .. kernel-doc:: mm/ksm.c
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 1/7] mm/ksm: docs: extend overview comment and make it "DOC:"
From: Mike Rapoport @ 2018-04-24  6:40 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Andrea Arcangeli, linux-doc, linux-mm, lkml,
	Mike Rapoport
In-Reply-To: <1524552028-7017-1-git-send-email-rppt@linux.vnet.ibm.com>

The existing comment provides a good overview of KSM implementation. Let's
update it to reflect recent additions of "chain" and "dup" variants of the
stable tree nodes and mark it as "DOC:" for inclusion into the KSM
documentation.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
 mm/ksm.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 58c2741..54155b1 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -51,7 +51,9 @@
 #define DO_NUMA(x)	do { } while (0)
 #endif
 
-/*
+/**
+ * DOC: Overview
+ *
  * A few notes about the KSM scanning process,
  * to make it easier to understand the data structures below:
  *
@@ -67,6 +69,21 @@
  * this tree is fully assured to be working (except when pages are unmapped),
  * and therefore this tree is called the stable tree.
  *
+ * The stable tree node includes information required for reverse
+ * mapping from a KSM page to virtual addresses that map this page.
+ *
+ * In order to avoid large latencies of the rmap walks on KSM pages,
+ * KSM maintains two types of nodes in the stable tree:
+ *
+ * * the regular nodes that keep the reverse mapping structures in a
+ *   linked list
+ * * the "chains" that link nodes ("dups") that represent the same
+ *   write protected memory content, but each "dup" corresponds to a
+ *   different KSM page copy of that content
+ *
+ * Internally, the regular nodes, "dups" and "chains" are represented
+ * using the same :c:type:`struct stable_node` structure.
+ *
  * In addition to the stable tree, KSM uses a second data structure called the
  * unstable tree: this tree holds pointers to pages which have been found to
  * be "unchanged for a period of time".  The unstable tree sorts these pages
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 0/7] docs/vm: update KSM documentation
From: Mike Rapoport @ 2018-04-24  6:40 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Andrea Arcangeli, linux-doc, linux-mm, lkml,
	Mike Rapoport

Hi,

These patches extend KSM documentation with high level design overview and
some details about reverse mappings and split the userspace interface
description to Documentation/admin-guide/mm.

The description of some KSM sysfs attributes is changed so that it won't
include implementation detail. The description of these implementation
details are moved to the new "Design" section.

The last patch in the series depends on the patchset that create
Documentation/admin-guide/mm [1], all the rest applies cleanly to the
current docs-next.

[1] https://lkml.org/lkml/2018/4/18/110

Mike Rapoport (7):
  mm/ksm: docs: extend overview comment and make it "DOC:"
  docs/vm: ksm: (mostly) formatting updates
  docs/vm: ksm: add "Design" section
  docs/vm: ksm: reshuffle text between "sysfs" and "design" sections
  docs/vm: ksm: update stable_node_chains_prune_millisecs description
  docs/vm: ksm: udpate description of stable_node_{dups,chains}
  docs/vm: ksm: split userspace interface to admin-guide/mm/ksm.rst

 Documentation/admin-guide/mm/index.rst |   1 +
 Documentation/admin-guide/mm/ksm.rst   | 189 ++++++++++++++++++++++++++
 Documentation/vm/ksm.rst               | 234 ++++++++++-----------------------
 mm/ksm.c                               |  19 ++-
 4 files changed, 277 insertions(+), 166 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/ksm.rst

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v1 1/1] misc: IBM Virtual Management Channel Driver
From: Randy Dunlap @ 2018-04-23 21:17 UTC (permalink / raw)
  To: Greg KH
  Cc: Bryant G. Ly, benh, mpe, arnd, corbet, seroyer, mrochs, adreznec,
	fbarrat, davem, linus.walleij, akpm, mikey, pombredanne, tlfalcon,
	msuchanek, linux-doc, linuxppc-dev
In-Reply-To: <20180423195310.GA10693@kroah.com>

On 04/23/18 12:53, Greg KH wrote:
> On Mon, Apr 23, 2018 at 11:38:18AM -0700, Randy Dunlap wrote:
>> On 04/23/18 07:46, Bryant G. Ly wrote:
>>> This driver is a logical device which provides an
>>> interface between the hypervisor and a management
>>> partition.
>>>
>>> This driver is to be used for the POWER Virtual
>>> Management Channel Virtual Adapter on the PowerVM
>>> platform. It provides both request/response and
>>> async message support through the /dev/ibmvmc node.
>>>
>>> Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
>>> Reviewed-by: Steven Royer <seroyer@linux.vnet.ibm.com>
>>> Reviewed-by: Adam Reznechek <adreznec@linux.vnet.ibm.com>
>>> Tested-by: Taylor Jakobson <tjakobs@us.ibm.com>
>>> Tested-by: Brad Warrum <bwarrum@us.ibm.com>
>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>> ---
>>>  Documentation/ioctl/ioctl-number.txt  |    1 +
>>>  Documentation/misc-devices/ibmvmc.txt |  161 +++
>>>  MAINTAINERS                           |    6 +
>>>  arch/powerpc/include/asm/hvcall.h     |    1 +
>>>  drivers/misc/Kconfig                  |   14 +
>>>  drivers/misc/Makefile                 |    1 +
>>>  drivers/misc/ibmvmc.c                 | 2415 +++++++++++++++++++++++++++++++++
>>>  drivers/misc/ibmvmc.h                 |  209 +++
>>>  8 files changed, 2808 insertions(+)
>>>  create mode 100644 Documentation/misc-devices/ibmvmc.txt
>>>  create mode 100644 drivers/misc/ibmvmc.c
>>>  create mode 100644 drivers/misc/ibmvmc.h
>>
>>> diff --git a/Documentation/misc-devices/ibmvmc.txt b/Documentation/misc-devices/ibmvmc.txt
>>> new file mode 100644
>>> index 0000000..bae1064
>>> --- /dev/null
>>> +++ b/Documentation/misc-devices/ibmvmc.txt
> 
> Aren't we doing new documentation in .rst format instead of .txt?

I am not aware that .rst format is a *requirement* for new documentation.

??

-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v1 1/1] misc: IBM Virtual Management Channel Driver
From: Bryant G. Ly @ 2018-04-23 21:06 UTC (permalink / raw)
  To: Greg KH, Randy Dunlap
  Cc: benh, mpe, arnd, corbet, seroyer, mrochs, adreznec, fbarrat,
	davem, linus.walleij, akpm, mikey, pombredanne, tlfalcon,
	msuchanek, linux-doc, linuxppc-dev
In-Reply-To: <20180423195310.GA10693@kroah.com>

On 4/23/18 2:53 PM, Greg KH wrote:

> On Mon, Apr 23, 2018 at 11:38:18AM -0700, Randy Dunlap wrote:
>> On 04/23/18 07:46, Bryant G. Ly wrote:
>>> This driver is a logical device which provides an
>>> interface between the hypervisor and a management
>>> partition.
>>>
>>> This driver is to be used for the POWER Virtual
>>> Management Channel Virtual Adapter on the PowerVM
>>> platform. It provides both request/response and
>>> async message support through the /dev/ibmvmc node.
>>>
>>> Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
>>> Reviewed-by: Steven Royer <seroyer@linux.vnet.ibm.com>
>>> Reviewed-by: Adam Reznechek <adreznec@linux.vnet.ibm.com>
>>> Tested-by: Taylor Jakobson <tjakobs@us.ibm.com>
>>> Tested-by: Brad Warrum <bwarrum@us.ibm.com>
>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>> ---
>>>  Documentation/ioctl/ioctl-number.txt  |    1 +
>>>  Documentation/misc-devices/ibmvmc.txt |  161 +++
>>>  MAINTAINERS                           |    6 +
>>>  arch/powerpc/include/asm/hvcall.h     |    1 +
>>>  drivers/misc/Kconfig                  |   14 +
>>>  drivers/misc/Makefile                 |    1 +
>>>  drivers/misc/ibmvmc.c                 | 2415 +++++++++++++++++++++++++++++++++
>>>  drivers/misc/ibmvmc.h                 |  209 +++
>>>  8 files changed, 2808 insertions(+)
>>>  create mode 100644 Documentation/misc-devices/ibmvmc.txt
>>>  create mode 100644 drivers/misc/ibmvmc.c
>>>  create mode 100644 drivers/misc/ibmvmc.h
>>> diff --git a/Documentation/misc-devices/ibmvmc.txt b/Documentation/misc-devices/ibmvmc.txt
>>> new file mode 100644
>>> index 0000000..bae1064
>>> --- /dev/null
>>> +++ b/Documentation/misc-devices/ibmvmc.txt
> Aren't we doing new documentation in .rst format instead of .txt?
>
> thanks,
>
> greg k-h
>
I will convert to .rst and fix the comments from Randy in the next version.

Thanks, 

Bryant

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v1 1/1] misc: IBM Virtual Management Channel Driver
From: Greg KH @ 2018-04-23 19:53 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Bryant G. Ly, benh, mpe, arnd, corbet, seroyer, mrochs, adreznec,
	fbarrat, davem, linus.walleij, akpm, mikey, pombredanne, tlfalcon,
	msuchanek, linux-doc, linuxppc-dev
In-Reply-To: <c07b14b7-8f34-2ba5-536f-85896ff21661@infradead.org>

On Mon, Apr 23, 2018 at 11:38:18AM -0700, Randy Dunlap wrote:
> On 04/23/18 07:46, Bryant G. Ly wrote:
> > This driver is a logical device which provides an
> > interface between the hypervisor and a management
> > partition.
> > 
> > This driver is to be used for the POWER Virtual
> > Management Channel Virtual Adapter on the PowerVM
> > platform. It provides both request/response and
> > async message support through the /dev/ibmvmc node.
> > 
> > Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
> > Reviewed-by: Steven Royer <seroyer@linux.vnet.ibm.com>
> > Reviewed-by: Adam Reznechek <adreznec@linux.vnet.ibm.com>
> > Tested-by: Taylor Jakobson <tjakobs@us.ibm.com>
> > Tested-by: Brad Warrum <bwarrum@us.ibm.com>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > ---
> >  Documentation/ioctl/ioctl-number.txt  |    1 +
> >  Documentation/misc-devices/ibmvmc.txt |  161 +++
> >  MAINTAINERS                           |    6 +
> >  arch/powerpc/include/asm/hvcall.h     |    1 +
> >  drivers/misc/Kconfig                  |   14 +
> >  drivers/misc/Makefile                 |    1 +
> >  drivers/misc/ibmvmc.c                 | 2415 +++++++++++++++++++++++++++++++++
> >  drivers/misc/ibmvmc.h                 |  209 +++
> >  8 files changed, 2808 insertions(+)
> >  create mode 100644 Documentation/misc-devices/ibmvmc.txt
> >  create mode 100644 drivers/misc/ibmvmc.c
> >  create mode 100644 drivers/misc/ibmvmc.h
> 
> > diff --git a/Documentation/misc-devices/ibmvmc.txt b/Documentation/misc-devices/ibmvmc.txt
> > new file mode 100644
> > index 0000000..bae1064
> > --- /dev/null
> > +++ b/Documentation/misc-devices/ibmvmc.txt

Aren't we doing new documentation in .rst format instead of .txt?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v1 1/1] misc: IBM Virtual Management Channel Driver
From: Randy Dunlap @ 2018-04-23 18:38 UTC (permalink / raw)
  To: Bryant G. Ly, benh, mpe, gregkh, arnd
  Cc: corbet, seroyer, mrochs, adreznec, fbarrat, davem, linus.walleij,
	akpm, mikey, pombredanne, tlfalcon, msuchanek, linux-doc,
	linuxppc-dev
In-Reply-To: <1524494812-60150-2-git-send-email-bryantly@linux.vnet.ibm.com>

On 04/23/18 07:46, Bryant G. Ly wrote:
> This driver is a logical device which provides an
> interface between the hypervisor and a management
> partition.
> 
> This driver is to be used for the POWER Virtual
> Management Channel Virtual Adapter on the PowerVM
> platform. It provides both request/response and
> async message support through the /dev/ibmvmc node.
> 
> Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
> Reviewed-by: Steven Royer <seroyer@linux.vnet.ibm.com>
> Reviewed-by: Adam Reznechek <adreznec@linux.vnet.ibm.com>
> Tested-by: Taylor Jakobson <tjakobs@us.ibm.com>
> Tested-by: Brad Warrum <bwarrum@us.ibm.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  Documentation/ioctl/ioctl-number.txt  |    1 +
>  Documentation/misc-devices/ibmvmc.txt |  161 +++
>  MAINTAINERS                           |    6 +
>  arch/powerpc/include/asm/hvcall.h     |    1 +
>  drivers/misc/Kconfig                  |   14 +
>  drivers/misc/Makefile                 |    1 +
>  drivers/misc/ibmvmc.c                 | 2415 +++++++++++++++++++++++++++++++++
>  drivers/misc/ibmvmc.h                 |  209 +++
>  8 files changed, 2808 insertions(+)
>  create mode 100644 Documentation/misc-devices/ibmvmc.txt
>  create mode 100644 drivers/misc/ibmvmc.c
>  create mode 100644 drivers/misc/ibmvmc.h

> diff --git a/Documentation/misc-devices/ibmvmc.txt b/Documentation/misc-devices/ibmvmc.txt
> new file mode 100644
> index 0000000..bae1064
> --- /dev/null
> +++ b/Documentation/misc-devices/ibmvmc.txt
> @@ -0,0 +1,161 @@
> +Kernel Driver ibmvmc
> +====================
> +
> +Authors:
> +	Dave Engebretsen <engebret@us.ibm.com>
> +	Adam Reznechek <adreznec@linux.vnet.ibm.com>
> +	Steven Royer <seroyer@linux.vnet.ibm.com>
> +	Bryant G. Ly <bryantly@linux.vnet.ibm.com>
> +
> +Description
> +===========
> +

...

> +
> +Virtual Management Channel (VMC)
> +A logical device, called the virtual management channel (VMC), is defined
> +for communicating between the Novalink application and the hypervisor.
> +This device, similar to a VSCSI server device, is presented to a designated
> +management partition as a virtual device and is only presented when the
> +system is not HMC managed.
> +This communication device borrows aspects from both VSCSI and ILLAN devices
> +and is implemented using the CRQ and the RDMA interfaces. A three-way
> +handshake is defined that must take place to establish that both the
> +hypervisor and management partition sides of the channel are running prior
> +to sending/receiving any of the protocol messages.
> +This driver also utilizes Transport Event CRQs. CRQ messages that are sent

                                                 Drop           "that" ?
otherwise the sentence construction is ... odd.

> +when the hypervisor detects one of the peer partitions has abnormally
> +terminated, or one side has called H_FREE_CRQ to close their CRQ.
> +Two new classes of CRQ messages are introduced for the VMC device. VMC
> +Administrative messages are used for each partition using the VMC to
> +communicate capabilities to their partner. HMC Interface messages are used
> +for the actual flow of HMC messages between the management partition and
> +the hypervisor. As most HMC messages far exceed the size of a CRQ bugger,

what is a bugger?
[reads more]
oh, buffer?

> +a virtual DMA (RMDA) of the HMC message data is done prior to each HMC
> +Interface CRQ message. Only the management partition drives RDMA
> +operations; hypervisors never directly causes the movement of message data.
> +
> +Example Management Partition VMC Driver Interface
> +=================================================

...


Looks pretty good.  In general (IMO), it could use more white space (like blank
lines after section names.)

If you fix those small items (buggers) above, you can add:
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>

thanks,
-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v4 00/10] Add the I3C subsystem
From: Greg Kroah-Hartman @ 2018-04-23 17:56 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Wolfram Sang, linux-i2c, Jonathan Corbet, linux-doc,
	Arnd Bergmann, Przemyslaw Sroka, Arkadiusz Golec, Alan Douglas,
	Bartosz Folta, Damian Kos, Alicja Jurasik-Urbaniak,
	Cyprian Wronka, Suresh Punnoose, Rafal Ciepiela, Thomas Petazzoni,
	Nishanth Menon, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, devicetree, linux-kernel, Vitor Soares,
	Geert Uytterhoeven, Linus Walleij, Xiang Lin, linux-gpio
In-Reply-To: <20180423193814.5c8ca75a@bbrezillon>

On Mon, Apr 23, 2018 at 07:38:14PM +0200, Boris Brezillon wrote:
> Hi,
> 
> On Fri, 30 Mar 2018 09:47:41 +0200
> Boris Brezillon <boris.brezillon@bootlin.com> wrote:
> 
> > This patch series is a proposal for a new I3C subsystem.
> 
> This v4 has been sent almost a month ago and I didn't get any feedback
> so far apart from Rob's R-b. Greg, is there any chance we can get these
> patches merged in 4.18? If not, could you tell me what should be
> addressed/improved/reworked?

I'll look over it later this week, thanks.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] thermal: core: fix emulation of subzero temps
From: Kevin DuBois @ 2018-04-23 17:53 UTC (permalink / raw)
  To: linux-pm, linux-doc, corbet, rui.zhang, edubezval, kevindubois

The current implementation casted away its sign.
It was also using '0' to disable emulation, but 0C is a valid
thermal reading. A large negative value (below absolute zero) now
disables the emulation.

Test: Build kernel with CONFIG_THERMAL_EMULATION=y, then write negative
values to the emul_temp sysfs entry. Check the temp entry, and see the
negative value. Write -INT_MAX to emul_temp, and see the proper sensor
reading appear.
Bug: 77599739
Signed-off-by: Kevin DuBois <kevindubois@google.com>

Change-Id: Iee1a4f0aacf7b4299b67c7f8f4effedf2d7a8425
---
 Documentation/thermal/sysfs-api.txt |  2 +-
 drivers/thermal/thermal_core.c      | 13 ++++++++-----
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt
index 10f062ea6bc2..e240d9e165a0 100644
--- a/Documentation/thermal/sysfs-api.txt
+++ b/Documentation/thermal/sysfs-api.txt
@@ -312,7 +312,7 @@ emul_temp
 	this temperature to platform emulation function if registered or
 	cache it locally. This is useful in debugging different temperature
 	threshold and its associated cooling action. This is write only node
-	and writing 0 on this node should disable emulation.
+	and writing -INT_MAX on this node should disable emulation.
 	Unit: millidegree Celsius
 	WO, Optional
 
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 768783ab3dac..650a6b124fc7 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -636,8 +636,8 @@ int get_tz_trend(struct thermal_zone_device *tz, int trip)
 {
 	enum thermal_trend trend;
 
-	if (tz->emul_temperature || !tz->ops->get_trend ||
-	    tz->ops->get_trend(tz, trip, &trend)) {
+	if ((tz->emul_temperature > THERMAL_TEMP_INVALID) ||
+		!tz->ops->get_trend || tz->ops->get_trend(tz, trip, &trend)) {
 		if (tz->temperature > tz->last_temperature)
 			trend = THERMAL_TREND_RAISING;
 		else if (tz->temperature < tz->last_temperature)
@@ -904,7 +904,8 @@ int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
 
 	ret = tz->ops->get_temp(tz, temp);
 
-	if (IS_ENABLED(CONFIG_THERMAL_EMULATION) && tz->emul_temperature) {
+	if (IS_ENABLED(CONFIG_THERMAL_EMULATION) &&
+		(tz->emul_temperature > THERMAL_TEMP_INVALID)) {
 		for (count = 0; count < tz->trips; count++) {
 			ret = tz->ops->get_trip_type(tz, count, &type);
 			if (!ret && type == THERMAL_TRIP_CRITICAL) {
@@ -961,6 +962,7 @@ static void thermal_zone_device_reset(struct thermal_zone_device *tz)
 	struct thermal_instance *pos;
 
 	tz->temperature = THERMAL_TEMP_INVALID;
+	tz->emul_temperature = THERMAL_TEMP_INVALID;
 	tz->passive = 0;
 	list_for_each_entry(pos, &tz->thermal_instances, tz_node)
 		pos->initialized = false;
@@ -1351,9 +1353,9 @@ emul_temp_store(struct device *dev, struct device_attribute *attr,
 {
 	struct thermal_zone_device *tz = to_thermal_zone(dev);
 	int ret = 0;
-	unsigned long temperature;
+	int temperature;
 
-	if (kstrtoul(buf, 10, &temperature))
+	if (kstrtoint(buf, 10, &temperature))
 		return -EINVAL;
 
 	if (!tz->ops->set_emul_temp) {
@@ -2296,6 +2298,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
 	tz->trips = trips;
 	tz->passive_delay = passive_delay;
 	tz->polling_delay = polling_delay;
+	tz->emul_temperature = THERMAL_TEMP_INVALID;
 	/* A new thermal zone needs to be updated anyway. */
 	atomic_set(&tz->need_update, 1);
 
-- 
2.17.0.484.g0c8726318c-goog

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v3 03/10] drivers/peci: Add support for PECI bus driver core
From: Jae Hyun Yoo @ 2018-04-23 17:40 UTC (permalink / raw)
  To: Greg KH
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu,
	Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
	Jean Delvare, Joel Stanley, Julia Cartwright, Miguel Ojeda,
	Milton Miller II, Pavel Machek, Randy Dunlap, Stef van Os,
	Sumeet R Pawnikar, Vernon Mauery, linux-kernel, linux-doc,
	devicetree, linux-hwmon, linux-arm-kernel, openbmc
In-Reply-To: <20180423105224.GA1444@kroah.com>

On 4/23/2018 3:52 AM, Greg KH wrote:
> On Tue, Apr 10, 2018 at 11:32:05AM -0700, Jae Hyun Yoo wrote:
>> +static void peci_adapter_dev_release(struct device *dev)
>> +{
>> +	/* do nothing */
>> +}
> 
> As per the in-kernel documentation, I am now allowed to make fun of you.
> 
> You are trying to "out smart" the kernel by getting rid of a warning
> message that was explicitly put there for you to do something.  To think
> that by just providing an "empty" function you are somehow fulfilling
> the API requirement is quite bold, don't you think?
> 
> This has to be fixed.  I didn't put that warning in there for no good
> reason.  Please go read the documentation again...
> 
> greg k-h
> 

Hi Greg,

Thanks a lot for your review.

I think, it should contain actual device resource release code which is
being done by peci_del_adapter(), or a coupling logic should be added
between peci_adapter_dev_release() and peci_del_adapter().

As you suggested, I'll check it again after reading documentation and
understanding core.c code more deeply.

Jae
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v4 00/10] Add the I3C subsystem
From: Boris Brezillon @ 2018-04-23 17:38 UTC (permalink / raw)
  To: Wolfram Sang, linux-i2c, Jonathan Corbet, linux-doc,
	Greg Kroah-Hartman, Arnd Bergmann
  Cc: Przemyslaw Sroka, Arkadiusz Golec, Alan Douglas, Bartosz Folta,
	Damian Kos, Alicja Jurasik-Urbaniak, Cyprian Wronka,
	Suresh Punnoose, Rafal Ciepiela, Thomas Petazzoni, Nishanth Menon,
	Rob Herring, Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala,
	devicetree, linux-kernel, Vitor Soares, Geert Uytterhoeven,
	Linus Walleij, Xiang Lin, linux-gpio
In-Reply-To: <20180330074751.25987-1-boris.brezillon@bootlin.com>

Hi,

On Fri, 30 Mar 2018 09:47:41 +0200
Boris Brezillon <boris.brezillon@bootlin.com> wrote:

> This patch series is a proposal for a new I3C subsystem.

This v4 has been sent almost a month ago and I didn't get any feedback
so far apart from Rob's R-b. Greg, is there any chance we can get these
patches merged in 4.18? If not, could you tell me what should be
addressed/improved/reworked?

Thanks,

Boris

> 
> This infrastructure is not complete yet and will be extended over
> time.
> 
> There are a few design choices that are worth mentioning because they
> impact the way I3C device drivers can interact with their devices:
> 
> - all functions used to send I3C/I2C frames must be called in
>   non-atomic context. Mainly done this way to ease implementation, but
>   this is still open to discussion. Please let me know if you think it's
>   worth considering an asynchronous model here
> - the bus element is a separate object and is not implicitly described
>   by the master (as done in I2C). The reason is that I want to be able
>   to handle multiple master connected to the same bus and visible to
>   Linux.
>   In this situation, we should only have one instance of the device and
>   not one per master, and sharing the bus object would be part of the
>   solution to gracefully handle this case.
>   I'm not sure if we will ever need to deal with multiple masters
>   controlling the same bus and exposed under Linux, but separating the
>   bus and master concept is pretty easy, hence the decision to do it
>   now, just in case we need it some day.
>   The other benefit of separating the bus and master concepts is that
>   master devices appear under the bus directory in sysfs.
> - I2C backward compatibility has been designed to be transparent to I2C
>   drivers and the I2C subsystem. The I3C master just registers an I2C
>   adapter which creates a new I2C bus. I'd say that, from a
>   representation PoV it's not ideal because what should appear as a
>   single I3C bus exposing I3C and I2C devices here appears as 2
>   different busses connected to each other through the parenting (the
>   I3C master is the parent of the I2C and I3C busses).
>   On the other hand, I don't see a better solution if we want something
>   that is not invasive.
> 
> Missing features in this preliminary version:
> - support for HDR modes (has been removed because of lack of real users)
> - no support for multi-master and the associated concepts (mastership
>   handover, support for secondary masters, ...)
> - I2C devices can only be described using DT because this is the only
>   use case I have. However, the framework can easily be extended with
>   ACPI and board info support
> - I3C slave framework. This has been completely omitted, but shouldn't
>   have a huge impact on the I3C framework because I3C slaves don't see
>   the whole bus, it's only about handling master requests and generating
>   IBIs. Some of the struct, constant and enum definitions could be
>   shared, but most of the I3C slave framework logic will be different
> 
> Main changes between v2 and v3 are:
> - Reworked the DT bindings as suggested by Rob
> - Reworked the bus initialization step as suggested by Vitor
> - Added a driver for an I3C GPIO expander
> 
> Main changes between the initial RFC and this v2 are:
> - Add a generic infrastructure to support IBIs. It's worth mentioning
>   that I tried exposing IBIs as a regular IRQs, but after several
>   attempts and a discussion with Mark Zyngier, it appeared that it was
>   not really fitting in the Linux IRQ model (the fact that you have
>   payload attached to IBIs, the fact that most of the time an IBI will
>   generate a transfer on the bus which has to be done in an atomic
>   context, ...)
>   The counterpart of this decision is the latency induced by the
>   workqueue approach, but since I don't have real use cases, I don't
>   know if this can be a problem or not. 
> - Add helpers to support Hot Join
> - Add support for IBIs and Hot Join in Cadence I3C master driver
> - Address several issues in how I was using the device model
> 
> Note that I2C changes have been sent separately [1] this time. Other
> than that, no big changes in this version, I just addressed the
> comments I received from Thomas, Peter, Geert and Rob.
> 
> Thanks,
> 
> Boris
> 
> [1]https://patchwork.ozlabs.org/project/linux-i2c/list/?series=35687
> 
> Boris Brezillon (10):
>   i3c: Add core I3C infrastructure
>   docs: driver-api: Add I3C documentation
>   i3c: Add sysfs ABI spec
>   dt-bindings: i3c: Document core bindings
>   dt-bindings: i3c: Add macros to help fill I3C/I2C device's reg
>     property
>   MAINTAINERS: Add myself as the I3C subsystem maintainer
>   i3c: master: Add driver for Cadence IP
>   dt-bindings: i3c: Document Cadence I3C master bindings
>   gpio: Add a driver for Cadence I3C GPIO expander
>   dt-bindings: gpio: Add bindings for Cadence I3C gpio expander
> 
>  Documentation/ABI/testing/sysfs-bus-i3c            |   95 ++
>  .../devicetree/bindings/gpio/gpio-cdns-i3c.txt     |   39 +
>  .../devicetree/bindings/i3c/cdns,i3c-master.txt    |   44 +
>  Documentation/devicetree/bindings/i3c/i3c.txt      |  140 ++
>  Documentation/driver-api/i3c/conf.py               |   10 +
>  Documentation/driver-api/i3c/device-driver-api.rst |    7 +
>  Documentation/driver-api/i3c/index.rst             |    9 +
>  Documentation/driver-api/i3c/master-driver-api.rst |    8 +
>  Documentation/driver-api/i3c/protocol.rst          |  201 +++
>  Documentation/driver-api/index.rst                 |    1 +
>  MAINTAINERS                                        |    9 +
>  drivers/Kconfig                                    |    2 +
>  drivers/Makefile                                   |    2 +-
>  drivers/gpio/Kconfig                               |   11 +
>  drivers/gpio/Makefile                              |    1 +
>  drivers/gpio/gpio-cdns-i3c.c                       |  380 +++++
>  drivers/i3c/Kconfig                                |   24 +
>  drivers/i3c/Makefile                               |    4 +
>  drivers/i3c/core.c                                 |  620 +++++++
>  drivers/i3c/device.c                               |  294 ++++
>  drivers/i3c/internals.h                            |   28 +
>  drivers/i3c/master.c                               | 1723 ++++++++++++++++++++
>  drivers/i3c/master/Kconfig                         |    5 +
>  drivers/i3c/master/Makefile                        |    1 +
>  drivers/i3c/master/i3c-master-cdns.c               | 1650 +++++++++++++++++++
>  include/dt-bindings/i3c/i3c.h                      |   28 +
>  include/linux/i3c/ccc.h                            |  382 +++++
>  include/linux/i3c/device.h                         |  305 ++++
>  include/linux/i3c/master.h                         |  587 +++++++
>  include/linux/mod_devicetable.h                    |   17 +
>  30 files changed, 6626 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-bus-i3c
>  create mode 100644 Documentation/devicetree/bindings/gpio/gpio-cdns-i3c.txt
>  create mode 100644 Documentation/devicetree/bindings/i3c/cdns,i3c-master.txt
>  create mode 100644 Documentation/devicetree/bindings/i3c/i3c.txt
>  create mode 100644 Documentation/driver-api/i3c/conf.py
>  create mode 100644 Documentation/driver-api/i3c/device-driver-api.rst
>  create mode 100644 Documentation/driver-api/i3c/index.rst
>  create mode 100644 Documentation/driver-api/i3c/master-driver-api.rst
>  create mode 100644 Documentation/driver-api/i3c/protocol.rst
>  create mode 100644 drivers/gpio/gpio-cdns-i3c.c
>  create mode 100644 drivers/i3c/Kconfig
>  create mode 100644 drivers/i3c/Makefile
>  create mode 100644 drivers/i3c/core.c
>  create mode 100644 drivers/i3c/device.c
>  create mode 100644 drivers/i3c/internals.h
>  create mode 100644 drivers/i3c/master.c
>  create mode 100644 drivers/i3c/master/Kconfig
>  create mode 100644 drivers/i3c/master/Makefile
>  create mode 100644 drivers/i3c/master/i3c-master-cdns.c
>  create mode 100644 include/dt-bindings/i3c/i3c.h
>  create mode 100644 include/linux/i3c/ccc.h
>  create mode 100644 include/linux/i3c/device.h
>  create mode 100644 include/linux/i3c/master.h
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy
From: Waiman Long @ 2018-04-23 16:32 UTC (permalink / raw)
  To: Mike Galbraith, Tejun Heo, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	torvalds, Roman Gushchin, Juri Lelli
In-Reply-To: <1524212601.11030.2.camel@gmx.de>

On 04/20/2018 04:23 AM, Mike Galbraith wrote:
> On Thu, 2018-04-19 at 09:46 -0400, Waiman Long wrote:
>> v7:
>>  - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
>>  - Enforce that load_balancing can only be turned off on cpusets with
>>    CPUs from the isolated list.
>>  - Update sched domain generation to allow cpusets with CPUs only
>>    from the isolated CPU list to be in separate root domains.
> I haven't done much, but was able to do a q/d manual build, populate
> and teardown of system/critical sets on my desktop box, and it looked
> ok.  Thanks for getting this aboard the v2 boat.
>
> 	-Mike

Thank for the testing.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v7 3/5] cpuset: Add a root-only cpus.isolated v2 control file
From: Juri Lelli @ 2018-04-23 15:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin
In-Reply-To: <1524145624-23655-4-git-send-email-longman@redhat.com>

On 19/04/18 09:47, Waiman Long wrote:

[...]

> +  cpuset.cpus.isolated
> +	A read-write multiple values file which exists on root cgroup
> +	only.
> +
> +	It lists the CPUs that have been withdrawn from the root cgroup
> +	for load balancing.  These CPUs can still be allocated to child
> +	cpusets with load balancing enabled, if necessary.
> +
> +	If a child cpuset contains only an exclusive set of CPUs that are
> +	a subset of the isolated CPUs and with load balancing enabled,
> +	these CPUs will be load balanced on a separate root domain from
> +	the one in the root cgroup.
> +
> +	Just putting the CPUs into "cpuset.cpus.isolated" will be
> +	enough to disable load balancing on those CPUs as long as they
> +	do not appear in a child cpuset with load balancing enabled.

Tasks that were on those CPUs when they got isolated will stay there
(unless forcibly moved somewhere else). They will also "automatically"
belong to default root domain (or potentially to a new root domain
created for a group using those CPUs). Both things are maybe unavoidable
(as discussed in previous versions some tasks cannot be migrated at
all), but such "side effects" should probably be documented. What do you
think?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v1 1/1] misc: IBM Virtual Management Channel Driver
From: Bryant G. Ly @ 2018-04-23 14:46 UTC (permalink / raw)
  To: benh, mpe, gregkh, arnd
  Cc: corbet, seroyer, mrochs, adreznec, fbarrat, davem, linus.walleij,
	akpm, rdunlap, mikey, pombredanne, tlfalcon, msuchanek, linux-doc,
	linuxppc-dev, Bryant G. Ly
In-Reply-To: <1524494812-60150-1-git-send-email-bryantly@linux.vnet.ibm.com>

This driver is a logical device which provides an
interface between the hypervisor and a management
partition.

This driver is to be used for the POWER Virtual
Management Channel Virtual Adapter on the PowerVM
platform. It provides both request/response and
async message support through the /dev/ibmvmc node.

Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Reviewed-by: Steven Royer <seroyer@linux.vnet.ibm.com>
Reviewed-by: Adam Reznechek <adreznec@linux.vnet.ibm.com>
Tested-by: Taylor Jakobson <tjakobs@us.ibm.com>
Tested-by: Brad Warrum <bwarrum@us.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
---
 Documentation/ioctl/ioctl-number.txt  |    1 +
 Documentation/misc-devices/ibmvmc.txt |  161 +++
 MAINTAINERS                           |    6 +
 arch/powerpc/include/asm/hvcall.h     |    1 +
 drivers/misc/Kconfig                  |   14 +
 drivers/misc/Makefile                 |    1 +
 drivers/misc/ibmvmc.c                 | 2415 +++++++++++++++++++++++++++++++++
 drivers/misc/ibmvmc.h                 |  209 +++
 8 files changed, 2808 insertions(+)
 create mode 100644 Documentation/misc-devices/ibmvmc.txt
 create mode 100644 drivers/misc/ibmvmc.c
 create mode 100644 drivers/misc/ibmvmc.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 84bb74d..9851cee 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -329,6 +329,7 @@ Code  Seq#(hex)	Include File		Comments
 0xCA	80-BF	uapi/scsi/cxlflash_ioctl.h
 0xCB	00-1F	CBM serial IEC bus	in development:
 					<mailto:michael.klein@puffin.lb.shuttle.de>
+0xCC	00-0F	drivers/misc/ibmvmc.h    pseries VMC driver
 0xCD	01	linux/reiserfs_fs.h
 0xCF	02	fs/cifs/ioctl.c
 0xDB	00-0F	drivers/char/mwave/mwavepub.h
diff --git a/Documentation/misc-devices/ibmvmc.txt b/Documentation/misc-devices/ibmvmc.txt
new file mode 100644
index 0000000..bae1064
--- /dev/null
+++ b/Documentation/misc-devices/ibmvmc.txt
@@ -0,0 +1,161 @@
+Kernel Driver ibmvmc
+====================
+
+Authors:
+	Dave Engebretsen <engebret@us.ibm.com>
+	Adam Reznechek <adreznec@linux.vnet.ibm.com>
+	Steven Royer <seroyer@linux.vnet.ibm.com>
+	Bryant G. Ly <bryantly@linux.vnet.ibm.com>
+
+Description
+===========
+
+The Virtual Management Channel (VMC) is a logical device which provides an
+interface between the hypervisor and a management partition. This
+management partition is intended to provide an alternative to HMC-based
+system management. In the management partition, a Novalink application
+exists which enables a system administrator to configure the system’s
+partitioning characteristics via a command line interface (CLI) or
+Representational State Transfer Application (REST). You can also manage the
+server by using PowerVC or other OpenStack solution. Support for
+conventional HMC management of the system may still be provided on a
+system; however, when an HMC is attached to the system, the VMC
+interface is disabled by the hypervisor.
+
+NovaLink runs on a Linux logical partition on a POWER8 or newer processor-
+based server that is virtualized by PowerVM. System configuration,
+maintenance, and control functions which traditionally require an HMC can
+be implemented in the Novalink using a combination of HMC to hypervisor
+interfaces and existing operating system methods. This tool provides a
+subset of the functions implemented by the HMC and enables basic partition
+configuration. The set of HMC to hypervisor messages supported by the
+Novalink component are passed to the hypervisor over a VMC interface, which
+is defined below.
+
+Virtual Management Channel (VMC)
+A logical device, called the virtual management channel (VMC), is defined
+for communicating between the Novalink application and the hypervisor.
+This device, similar to a VSCSI server device, is presented to a designated
+management partition as a virtual device and is only presented when the
+system is not HMC managed.
+This communication device borrows aspects from both VSCSI and ILLAN devices
+and is implemented using the CRQ and the RDMA interfaces. A three-way
+handshake is defined that must take place to establish that both the
+hypervisor and management partition sides of the channel are running prior
+to sending/receiving any of the protocol messages.
+This driver also utilizes Transport Event CRQs. CRQ messages that are sent
+when the hypervisor detects one of the peer partitions has abnormally
+terminated, or one side has called H_FREE_CRQ to close their CRQ.
+Two new classes of CRQ messages are introduced for the VMC device. VMC
+Administrative messages are used for each partition using the VMC to
+communicate capabilities to their partner. HMC Interface messages are used
+for the actual flow of HMC messages between the management partition and
+the hypervisor. As most HMC messages far exceed the size of a CRQ bugger,
+a virtual DMA (RMDA) of the HMC message data is done prior to each HMC
+Interface CRQ message. Only the management partition drives RDMA
+operations; hypervisors never directly causes the movement of message data.
+
+Example Management Partition VMC Driver Interface
+=================================================
+This section provides an example for the Novalink implementation where a
+device driver is used to interface to the VMC device. This driver
+consists of a new device, for example /dev/lparvmc, which provides
+interfaces to open, close, read, write, and perform ioctl’s against the
+VMC device.
+
+VMC Interface Initialization
+The device driver is responsible for initializing the VMC when the driver
+is loaded. It first creates and initializes the CRQ. Next, an exchange of
+VMC capabilities is performed to indicate the code version and number of
+resources available in both the management partition and the hypervisor.
+Finally, the hypervisor requests that the management partition create an
+initial pool of VMC buffers, one buffer for each possible HMC connection,
+which will be used for Novalink session initialization. Prior to completion
+of this initialization sequence, the device returns EBUSY to open() calls.
+EIO is returned for all open() failures.
+
+Management Partition		Hypervisor
+		CRQ INIT
+---------------------------------------->
+	   CRQ INIT COMPLETE
+<----------------------------------------
+	      CAPABILITIES
+---------------------------------------->
+	 CAPABILITIES RESPONSE
+<----------------------------------------
+      ADD BUFFER (MHC IDX=0,1,..)	  _
+<----------------------------------------  |
+	  ADD BUFFER RESPONSE		   | - Perform # HMCs Iterations
+----------------------------------------> -
+
+VMC Interface Open
+After the basic VMC channel has been initialized, an HMC session level
+connection can be established. The application layer performs an open() to
+the VMC device and executes an ioctl() against it, indicating the HMC ID
+(32 bytes of data) for this session. If the VMC device is in an invalid
+state, EIO will be returned for the ioctl(). The device driver creates a
+new HMC session value (ranging from 1 to 255) and HMC index value (starting
+at index 0 and potentially ranging to 254 in future releases) for this HMC
+ID. The driver then does an RDMA of the HMC ID to the hypervisor, and then
+sends an Interface Open message to the hypervisor to establish the session
+over the VMC. After the hypervisor receives this information, it sends Add
+Buffer messages to the management partition to seed an initial pool of
+buffers for the new HMC connection. Finally, the hypervisor sends an
+Interface Open Response message, to indicate that it is ready for normal
+runtime messaging. The following illustrates this VMC flow:
+
+Management Partition             Hypervisor
+	      RDMA HMC ID
+---------------------------------------->
+	    Interface Open
+---------------------------------------->
+	      Add Buffer		  _
+<----------------------------------------  |
+	  Add Buffer Response		   | - Perform N Iterations
+----------------------------------------> -
+	Interface Open Response
+<----------------------------------------
+
+VMC Interface Runtime
+During normal runtime, the Novalink application and the hypervisor
+exchange HMC messages via the Signal VMC message and RDMA operations. When
+sending data to the hypervisor, the Novalink application performs a
+write() to the VMC device, and the driver RDMA’s the data to the hypervisor
+and then sends a Signal Message. If a write() is attempted before VMC
+device buffers have been made available by the hypervisor, or no buffers
+are currently available, EBUSY is returned in response to the write(). A
+write() will return EIO for all other errors, such as an invalid device
+state. When the hypervisor sends a message to the Novalink, the data is
+put into a VMC buffer and an Signal Message is sent to the VMC driver in
+the management partition. The driver RDMA’s the buffer into the partition
+and passes the data up to the appropriate Novalink application via a
+read() to the VMC device. The read() request blocks if there is no buffer
+available to read. The Novalink application may use select() to wait for
+the VMC device to become ready with data to read.
+
+Management Partition             Hypervisor
+		MSG RDMA
+---------------------------------------->
+		SIGNAL MSG
+---------------------------------------->
+		SIGNAL MSG
+<----------------------------------------
+		MSG RDMA
+<----------------------------------------
+
+VMC Interface Close
+HMC session level connections are closed by the management partition when
+the application layer performs a close() against the device. This action
+results in an Interface Close message flowing to the hypervisor, which
+causes the session to be terminated. The device driver must free any
+storage allocated for buffers for this HMC connection.
+
+Management Partition             Hypervisor
+	     INTERFACE CLOSE
+---------------------------------------->
+        INTERFACE CLOSE RESPONSE
+<----------------------------------------
+
+For more information on the documentation for CRQ Messages, VMC Messages,
+HMC interface Buffers, and signal messages please refer to the Linux on
+Power Architecture Platform Reference. Section F.
diff --git a/MAINTAINERS b/MAINTAINERS
index c7182d2..79969f9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6721,6 +6721,12 @@ L:	linux-scsi@vger.kernel.org
 S:	Supported
 F:	drivers/scsi/ibmvscsi/ibmvfc*
 
+IBM Power Virtual Management Channel Driver
+M:	Bryant G. Ly <bryantly@linux.vnet.ibm.com>
+M:	Steven Royer <seroyer@linux.vnet.ibm.com>
+S:	Supported
+F:	drivers/misc/ibmvmc.*
+
 IBM Power Virtual SCSI Device Drivers
 M:	Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
 L:	linux-scsi@vger.kernel.org
diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
index 2e2ddda..662c834 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -279,6 +279,7 @@
 #define H_GET_MPP_X		0x314
 #define H_SET_MODE		0x31C
 #define H_CLEAR_HPT		0x358
+#define H_REQUEST_VMC		0x360
 #define H_RESIZE_HPT_PREPARE	0x36C
 #define H_RESIZE_HPT_COMMIT	0x370
 #define H_REGISTER_PROC_TBL	0x37C
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 5d71300..3726eac 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -113,6 +113,20 @@ config IBM_ASM
 	  for information on the specific driver level and support statement
 	  for your IBM server.
 
+config IBMVMC
+	tristate "IBM Virtual Management Channel support"
+	depends on PPC_PSERIES
+	help
+	  This is the IBM POWER Virtual Management Channel
+
+	  This driver is to be used for the POWER Virtual
+	  Management Channel virtual adapter on the PowerVM
+	  platform. It provides both request/response and
+	  async message support through the /dev/ibmvmc node.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called ibmvmc.
+
 config PHANTOM
 	tristate "Sensable PHANToM (PCI)"
 	depends on PCI
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 20be70c..af22bbc 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -4,6 +4,7 @@
 #
 
 obj-$(CONFIG_IBM_ASM)		+= ibmasm/
+obj-$(CONFIG_IBMVMC)		+= ibmvmc.o
 obj-$(CONFIG_AD525X_DPOT)	+= ad525x_dpot.o
 obj-$(CONFIG_AD525X_DPOT_I2C)	+= ad525x_dpot-i2c.o
 obj-$(CONFIG_AD525X_DPOT_SPI)	+= ad525x_dpot-spi.o
diff --git a/drivers/misc/ibmvmc.c b/drivers/misc/ibmvmc.c
new file mode 100644
index 0000000..ae9cea3
--- /dev/null
+++ b/drivers/misc/ibmvmc.c
@@ -0,0 +1,2415 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * IBM Power Systems Virtual Management Channel Support.
+ *
+ * Copyright (c) 2004, 2018 IBM Corp.
+ *   Dave Engebretsen engebret@us.ibm.com
+ *   Steven Royer seroyer@linux.vnet.ibm.com
+ *   Adam Reznechek adreznec@linux.vnet.ibm.com
+ *   Bryant G. Ly <bryantly@linux.vnet.ibm.com>
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/major.h>
+#include <linux/string.h>
+#include <linux/fcntl.h>
+#include <linux/slab.h>
+#include <linux/poll.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/interrupt.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include <linux/delay.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+#include <linux/miscdevice.h>
+#include <linux/sched/signal.h>
+
+#include <asm/byteorder.h>
+#include <asm/irq.h>
+#include <asm/vio.h>
+
+#include "ibmvmc.h"
+
+#define IBMVMC_DRIVER_VERSION "1.0"
+
+/*
+ * Static global variables
+ */
+static DECLARE_WAIT_QUEUE_HEAD(ibmvmc_read_wait);
+
+static const char ibmvmc_driver_name[] = "ibmvmc";
+
+static struct ibmvmc_struct ibmvmc;
+static struct ibmvmc_hmc hmcs[MAX_HMCS];
+static struct crq_server_adapter ibmvmc_adapter;
+
+static int ibmvmc_max_buf_pool_size = DEFAULT_BUF_POOL_SIZE;
+static int ibmvmc_max_hmcs = DEFAULT_HMCS;
+static int ibmvmc_max_mtu = DEFAULT_MTU;
+
+static inline long h_copy_rdma(s64 length, u64 sliobn, u64 slioba,
+			       u64 dliobn, u64 dlioba)
+{
+	long rc = 0;
+
+	/* Ensure all writes to source memory are visible before hcall */
+	dma_wmb();
+	pr_debug("ibmvmc: h_copy_rdma(0x%llx, 0x%llx, 0x%llx, 0x%llx, 0x%llx\n",
+		 length, sliobn, slioba, dliobn, dlioba);
+	rc = plpar_hcall_norets(H_COPY_RDMA, length, sliobn, slioba,
+				dliobn, dlioba);
+	pr_debug("ibmvmc: h_copy_rdma rc = 0x%lx\n", rc);
+
+	return rc;
+}
+
+static inline void h_free_crq(uint32_t unit_address)
+{
+	long rc = 0;
+
+	do {
+		if (H_IS_LONG_BUSY(rc))
+			msleep(get_longbusy_msecs(rc));
+
+		rc = plpar_hcall_norets(H_FREE_CRQ, unit_address);
+	} while ((rc == H_BUSY) || (H_IS_LONG_BUSY(rc)));
+}
+
+/**
+ * h_request_vmc: - request a hypervisor virtual management channel device
+ * @vmc_index: drc index of the vmc device created
+ *
+ * Requests the hypervisor create a new virtual management channel device,
+ * allowing this partition to send hypervisor virtualization control
+ * commands.
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static inline long h_request_vmc(u32 *vmc_index)
+{
+	long rc = 0;
+	unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+	do {
+		if (H_IS_LONG_BUSY(rc))
+			msleep(get_longbusy_msecs(rc));
+
+		/* Call to request the VMC device from phyp */
+		rc = plpar_hcall(H_REQUEST_VMC, retbuf);
+		pr_debug("ibmvmc: %s rc = 0x%lx\n", __func__, rc);
+		*vmc_index = retbuf[0];
+	} while ((rc == H_BUSY) || (H_IS_LONG_BUSY(rc)));
+
+	return rc;
+}
+
+/* routines for managing a command/response queue */
+/**
+ * ibmvmc_handle_event: - Interrupt handler for crq events
+ * @irq:        number of irq to handle, not used
+ * @dev_instance: crq_server_adapter that received interrupt
+ *
+ * Disables interrupts and schedules ibmvmc_task
+ *
+ * Always returns IRQ_HANDLED
+ */
+static irqreturn_t ibmvmc_handle_event(int irq, void *dev_instance)
+{
+	struct crq_server_adapter *adapter =
+		(struct crq_server_adapter *)dev_instance;
+
+	vio_disable_interrupts(to_vio_dev(adapter->dev));
+	tasklet_schedule(&adapter->work_task);
+
+	return IRQ_HANDLED;
+}
+
+/**
+ * ibmvmc_release_crq_queue - Release CRQ Queue
+ *
+ * @adapter:	crq_server_adapter struct
+ *
+ * Return:
+ *	0 - Success
+ *	Non-Zero - Failure
+ */
+static void ibmvmc_release_crq_queue(struct crq_server_adapter *adapter)
+{
+	struct vio_dev *vdev = to_vio_dev(adapter->dev);
+	struct crq_queue *queue = &adapter->queue;
+
+	free_irq(vdev->irq, (void *)adapter);
+	tasklet_kill(&adapter->work_task);
+
+	if (adapter->reset_task)
+		kthread_stop(adapter->reset_task);
+
+	h_free_crq(vdev->unit_address);
+	dma_unmap_single(adapter->dev,
+			 queue->msg_token,
+			 queue->size * sizeof(*queue->msgs), DMA_BIDIRECTIONAL);
+	free_page((unsigned long)queue->msgs);
+}
+
+/**
+ * ibmvmc_reset_crq_queue - Reset CRQ Queue
+ *
+ * @adapter:	crq_server_adapter struct
+ *
+ * This function calls h_free_crq and then calls H_REG_CRQ and does all the
+ * bookkeeping to get us back to where we can communicate.
+ *
+ * Return:
+ *	0 - Success
+ *	Non-Zero - Failure
+ */
+static int ibmvmc_reset_crq_queue(struct crq_server_adapter *adapter)
+{
+	struct vio_dev *vdev = to_vio_dev(adapter->dev);
+	struct crq_queue *queue = &adapter->queue;
+	int rc = 0;
+
+	/* Close the CRQ */
+	h_free_crq(vdev->unit_address);
+
+	/* Clean out the queue */
+	memset(queue->msgs, 0x00, PAGE_SIZE);
+	queue->cur = 0;
+
+	/* And re-open it again */
+	rc = plpar_hcall_norets(H_REG_CRQ,
+				vdev->unit_address,
+				queue->msg_token, PAGE_SIZE);
+	if (rc == 2)
+		/* Adapter is good, but other end is not ready */
+		dev_warn(adapter->dev, "Partner adapter not ready\n");
+	else if (rc != 0)
+		dev_err(adapter->dev, "couldn't register crq--rc 0x%x\n", rc);
+
+	return rc;
+}
+
+/**
+ * crq_queue_next_crq: - Returns the next entry in message queue
+ * @queue:      crq_queue to use
+ *
+ * Returns pointer to next entry in queue, or NULL if there are no new
+ * entried in the CRQ.
+ */
+static struct ibmvmc_crq_msg *crq_queue_next_crq(struct crq_queue *queue)
+{
+	struct ibmvmc_crq_msg *crq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&queue->lock, flags);
+	crq = &queue->msgs[queue->cur];
+	if (crq->valid & 0x80) {
+		if (++queue->cur == queue->size)
+			queue->cur = 0;
+
+		/* Ensure the read of the valid bit occurs before reading any
+		 * other bits of the CRQ entry
+		 */
+		dma_rmb();
+	} else {
+		crq = NULL;
+	}
+
+	spin_unlock_irqrestore(&queue->lock, flags);
+
+	return crq;
+}
+
+/**
+ * ibmvmc_send_crq - Send CRQ
+ *
+ * @adapter:	crq_server_adapter struct
+ * @word1:	Word1 Data field
+ * @word2:	Word2 Data field
+ *
+ * Return:
+ *	0 - Success
+ *	Non-Zero - Failure
+ */
+static long ibmvmc_send_crq(struct crq_server_adapter *adapter,
+			    u64 word1, u64 word2)
+{
+	struct vio_dev *vdev = to_vio_dev(adapter->dev);
+	long rc;
+
+	dev_dbg(adapter->dev, "(0x%x, 0x%016llx, 0x%016llx)\n",
+		vdev->unit_address, word1, word2);
+
+	/*
+	 * Ensure the command buffer is flushed to memory before handing it
+	 * over to the other side to prevent it from fetching any stale data.
+	 */
+	dma_wmb();
+	rc = plpar_hcall_norets(H_SEND_CRQ, vdev->unit_address, word1, word2);
+	dev_dbg(adapter->dev, "rc = 0x%lx\n", rc);
+
+	return rc;
+}
+
+/**
+ * alloc_dma_buffer - Create DMA Buffer
+ *
+ * @vdev:	vio_dev struct
+ * @size:	Size field
+ * @dma_handle:	DMA address field
+ *
+ * Allocates memory for the command queue and maps remote memory into an
+ * ioba.
+ *
+ * Returns a pointer to the buffer
+ */
+static void *alloc_dma_buffer(struct vio_dev *vdev, size_t size,
+			      dma_addr_t *dma_handle)
+{
+	/* allocate memory */
+	void *buffer = kzalloc(size, GFP_KERNEL);
+
+	if (!buffer) {
+		*dma_handle = 0;
+		return NULL;
+	}
+
+	/* DMA map */
+	*dma_handle = dma_map_single(&vdev->dev, buffer, size,
+				     DMA_BIDIRECTIONAL);
+
+	if (dma_mapping_error(&vdev->dev, *dma_handle)) {
+		*dma_handle = 0;
+		kzfree(buffer);
+		return NULL;
+	}
+
+	return buffer;
+}
+
+/**
+ * free_dma_buffer - Free DMA Buffer
+ *
+ * @vdev:	vio_dev struct
+ * @size:	Size field
+ * @vaddr:	Address field
+ * @dma_handle:	DMA address field
+ *
+ * Releases memory for a command queue and unmaps mapped remote memory.
+ */
+static void free_dma_buffer(struct vio_dev *vdev, size_t size, void *vaddr,
+			    dma_addr_t dma_handle)
+{
+	/* DMA unmap */
+	dma_unmap_single(&vdev->dev, dma_handle, size, DMA_BIDIRECTIONAL);
+
+	/* deallocate memory */
+	kzfree(vaddr);
+}
+
+/**
+ * ibmvmc_get_valid_hmc_buffer - Retrieve Valid HMC Buffer
+ *
+ * @hmc_index:	HMC Index Field
+ *
+ * Return:
+ *	Pointer to ibmvmc_buffer
+ */
+static struct ibmvmc_buffer *ibmvmc_get_valid_hmc_buffer(u8 hmc_index)
+{
+	struct ibmvmc_buffer *buffer;
+	struct ibmvmc_buffer *ret_buf = NULL;
+	unsigned long i;
+
+	if (hmc_index > ibmvmc.max_hmc_index)
+		return NULL;
+
+	buffer = hmcs[hmc_index].buffer;
+
+	for (i = 0; i < ibmvmc_max_buf_pool_size; i++) {
+		if (buffer[i].valid && buffer[i].free &&
+		    buffer[i].owner == VMC_BUF_OWNER_ALPHA) {
+			buffer[i].free = 0;
+			ret_buf = &buffer[i];
+			break;
+		}
+	}
+
+	return ret_buf;
+}
+
+/**
+ * ibmvmc_get_free_hmc_buffer - Get Free HMC Buffer
+ *
+ * @adapter:	crq_server_adapter struct
+ * @hmc_index:	Hmc Index field
+ *
+ * Return:
+ *	Pointer to ibmvmc_buffer
+ */
+static struct ibmvmc_buffer *ibmvmc_get_free_hmc_buffer(struct crq_server_adapter *adapter,
+							u8 hmc_index)
+{
+	struct ibmvmc_buffer *buffer;
+	struct ibmvmc_buffer *ret_buf = NULL;
+	unsigned long i;
+
+	if (hmc_index > ibmvmc.max_hmc_index) {
+		dev_info(adapter->dev, "get_free_hmc_buffer: invalid hmc_index=0x%x\n",
+			 hmc_index);
+		return NULL;
+	}
+
+	buffer = hmcs[hmc_index].buffer;
+
+	for (i = 0; i < ibmvmc_max_buf_pool_size; i++) {
+		if (buffer[i].free &&
+		    buffer[i].owner == VMC_BUF_OWNER_ALPHA) {
+			buffer[i].free = 0;
+			ret_buf = &buffer[i];
+			break;
+		}
+	}
+
+	return ret_buf;
+}
+
+/**
+ * ibmvmc_free_hmc_buffer - Free an HMC Buffer
+ *
+ * @hmc:	ibmvmc_hmc struct
+ * @buffer:	ibmvmc_buffer struct
+ *
+ */
+static void ibmvmc_free_hmc_buffer(struct ibmvmc_hmc *hmc,
+				   struct ibmvmc_buffer *buffer)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&hmc->lock, flags);
+	buffer->free = 1;
+	spin_unlock_irqrestore(&hmc->lock, flags);
+}
+
+/**
+ * ibmvmc_count_hmc_buffers - Count HMC Buffers
+ *
+ * @hmc_index:	HMC Index field
+ * @valid:	Valid number of buffers field
+ * @free:	Free number of buffers field
+ *
+ */
+static void ibmvmc_count_hmc_buffers(u8 hmc_index, unsigned int *valid,
+				     unsigned int *free)
+{
+	struct ibmvmc_buffer *buffer;
+	unsigned long i;
+	unsigned long flags;
+
+	if (hmc_index > ibmvmc.max_hmc_index)
+		return;
+
+	if (!valid || !free)
+		return;
+
+	*valid = 0; *free = 0;
+
+	buffer = hmcs[hmc_index].buffer;
+	spin_lock_irqsave(&hmcs[hmc_index].lock, flags);
+
+	for (i = 0; i < ibmvmc_max_buf_pool_size; i++) {
+		if (buffer[i].valid) {
+			*valid = *valid + 1;
+			if (buffer[i].free)
+				*free = *free + 1;
+		}
+	}
+
+	spin_unlock_irqrestore(&hmcs[hmc_index].lock, flags);
+}
+
+/**
+ * ibmvmc_get_free_hmc - Get Free HMC
+ *
+ * Return:
+ *	Pointer to an available HMC Connection
+ *	Null otherwise
+ */
+static struct ibmvmc_hmc *ibmvmc_get_free_hmc(void)
+{
+	unsigned long i;
+	unsigned long flags;
+
+	/*
+	 * Find an available HMC connection.
+	 */
+	for (i = 0; i <= ibmvmc.max_hmc_index; i++) {
+		spin_lock_irqsave(&hmcs[i].lock, flags);
+		if (hmcs[i].state == ibmhmc_state_free) {
+			hmcs[i].index = i;
+			hmcs[i].state = ibmhmc_state_initial;
+			spin_unlock_irqrestore(&hmcs[i].lock, flags);
+			return &hmcs[i];
+		}
+		spin_unlock_irqrestore(&hmcs[i].lock, flags);
+	}
+
+	return NULL;
+}
+
+/**
+ * ibmvmc_return_hmc - Return an HMC Connection
+ *
+ * @hmc:		ibmvmc_hmc struct
+ * @release_readers:	Number of readers connected to session
+ *
+ * This function releases the HMC connections back into the pool.
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static int ibmvmc_return_hmc(struct ibmvmc_hmc *hmc, bool release_readers)
+{
+	struct ibmvmc_buffer *buffer;
+	struct crq_server_adapter *adapter;
+	struct vio_dev *vdev;
+	unsigned long i;
+	unsigned long flags;
+
+	if (!hmc || !hmc->adapter)
+		return -EIO;
+
+	if (release_readers) {
+		if (hmc->file_session) {
+			struct ibmvmc_file_session *session = hmc->file_session;
+
+			session->valid = 0;
+			wake_up_interruptible(&ibmvmc_read_wait);
+		}
+	}
+
+	adapter = hmc->adapter;
+	vdev = to_vio_dev(adapter->dev);
+
+	spin_lock_irqsave(&hmc->lock, flags);
+	hmc->index = 0;
+	hmc->state = ibmhmc_state_free;
+	hmc->queue_head = 0;
+	hmc->queue_tail = 0;
+	buffer = hmc->buffer;
+	for (i = 0; i < ibmvmc_max_buf_pool_size; i++) {
+		if (buffer[i].valid) {
+			free_dma_buffer(vdev,
+					ibmvmc.max_mtu,
+					buffer[i].real_addr_local,
+					buffer[i].dma_addr_local);
+			dev_dbg(adapter->dev, "Forgot buffer id 0x%lx\n", i);
+		}
+		memset(&buffer[i], 0, sizeof(struct ibmvmc_buffer));
+
+		hmc->queue_outbound_msgs[i] = VMC_INVALID_BUFFER_ID;
+	}
+
+	spin_unlock_irqrestore(&hmc->lock, flags);
+
+	return 0;
+}
+
+/**
+ * ibmvmc_send_open - Interface Open
+ * @buffer: Pointer to ibmvmc_buffer struct
+ * @hmc: Pointer to ibmvmc_hmc struct
+ *
+ * This command is sent by the management partition as the result of a
+ * management partition device request. It causes the hypervisor to
+ * prepare a set of data buffers for the Novalink connection indicated HMC idx.
+ * A unique HMC Idx would be used if multiple management applications running
+ * concurrently were desired. Before responding to this command, the
+ * hypervisor must provide the management partition with at least one of these
+ * new buffers via the Add Buffer. This indicates whether the messages are
+ * inbound or outbound from the hypervisor.
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static int ibmvmc_send_open(struct ibmvmc_buffer *buffer,
+			    struct ibmvmc_hmc *hmc)
+{
+	struct ibmvmc_crq_msg crq_msg;
+	struct crq_server_adapter *adapter;
+	__be64 *crq_as_u64 = (__be64 *)&crq_msg;
+	int rc = 0;
+
+	if (!hmc || !hmc->adapter)
+		return -EIO;
+
+	adapter = hmc->adapter;
+
+	dev_dbg(adapter->dev, "send_open: 0x%lx 0x%lx 0x%lx 0x%lx 0x%lx\n",
+		(unsigned long)buffer->size, (unsigned long)adapter->liobn,
+		(unsigned long)buffer->dma_addr_local,
+		(unsigned long)adapter->riobn,
+		(unsigned long)buffer->dma_addr_remote);
+
+	rc = h_copy_rdma(buffer->size,
+			 adapter->liobn,
+			 buffer->dma_addr_local,
+			 adapter->riobn,
+			 buffer->dma_addr_remote);
+	if (rc) {
+		dev_err(adapter->dev, "Error: In send_open, h_copy_rdma rc 0x%x\n",
+			rc);
+		return -EIO;
+	}
+
+	hmc->state = ibmhmc_state_opening;
+
+	crq_msg.valid = 0x80;
+	crq_msg.type = VMC_MSG_OPEN;
+	crq_msg.status = 0;
+	crq_msg.var1.rsvd = 0;
+	crq_msg.hmc_session = hmc->session;
+	crq_msg.hmc_index = hmc->index;
+	crq_msg.var2.buffer_id = cpu_to_be16(buffer->id);
+	crq_msg.rsvd = 0;
+	crq_msg.var3.rsvd = 0;
+
+	ibmvmc_send_crq(adapter, be64_to_cpu(crq_as_u64[0]),
+			be64_to_cpu(crq_as_u64[1]));
+
+	return rc;
+}
+
+/**
+ * ibmvmc_send_close - Interface Close
+ * @hmc: Pointer to ibmvmc_hmc struct
+ *
+ * This command is sent by the management partition to terminate an Novalink to
+ * hypervisor connection. When this command is sent, the management
+ * partition has quiesced all I/O operations to all buffers associated with
+ * this Novalink connection, and has freed any storage for these buffers.
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static int ibmvmc_send_close(struct ibmvmc_hmc *hmc)
+{
+	struct ibmvmc_crq_msg crq_msg;
+	struct crq_server_adapter *adapter;
+	__be64 *crq_as_u64 = (__be64 *)&crq_msg;
+	int rc = 0;
+
+	if (!hmc || !hmc->adapter)
+		return -EIO;
+
+	adapter = hmc->adapter;
+
+	dev_info(adapter->dev, "CRQ send: close\n");
+
+	crq_msg.valid = 0x80;
+	crq_msg.type = VMC_MSG_CLOSE;
+	crq_msg.status = 0;
+	crq_msg.var1.rsvd = 0;
+	crq_msg.hmc_session = hmc->session;
+	crq_msg.hmc_index = hmc->index;
+	crq_msg.var2.rsvd = 0;
+	crq_msg.rsvd = 0;
+	crq_msg.var3.rsvd = 0;
+
+	ibmvmc_send_crq(adapter, be64_to_cpu(crq_as_u64[0]),
+			be64_to_cpu(crq_as_u64[1]));
+
+	return rc;
+}
+
+/**
+ * ibmvmc_send_capabilities - Send VMC Capabilities
+ *
+ * @adapter:	crq_server_adapter struct
+ *
+ * The capabilities message is an administrative message sent after the CRQ
+ * initialization sequence of messages and is used to exchange VMC capabilities
+ * between the management partition and the hypervisor. The management
+ * partition must send this message and the hypervisor must respond with VMC
+ * capabilities Response message before HMC interface message can begin. Any
+ * HMC interface messages received before the exchange of capabilities has
+ * complete are dropped.
+ *
+ * Return:
+ *	0 - Success
+ */
+static int ibmvmc_send_capabilities(struct crq_server_adapter *adapter)
+{
+	struct ibmvmc_admin_crq_msg crq_msg;
+	__be64 *crq_as_u64 = (__be64 *)&crq_msg;
+
+	dev_dbg(adapter->dev, "ibmvmc: CRQ send: capabilities\n");
+	crq_msg.valid = 0x80;
+	crq_msg.type = VMC_MSG_CAP;
+	crq_msg.status = 0;
+	crq_msg.rsvd[0] = 0;
+	crq_msg.rsvd[1] = 0;
+	crq_msg.max_hmc = ibmvmc_max_hmcs;
+	crq_msg.max_mtu = cpu_to_be32(ibmvmc_max_mtu);
+	crq_msg.pool_size = cpu_to_be16(ibmvmc_max_buf_pool_size);
+	crq_msg.crq_size = cpu_to_be16(adapter->queue.size);
+	crq_msg.version = cpu_to_be16(IBMVMC_PROTOCOL_VERSION);
+
+	ibmvmc_send_crq(adapter, be64_to_cpu(crq_as_u64[0]),
+			be64_to_cpu(crq_as_u64[1]));
+
+	ibmvmc.state = ibmvmc_state_capabilities;
+
+	return 0;
+}
+
+/**
+ * ibmvmc_send_add_buffer_resp - Add Buffer Response
+ *
+ * @adapter:	crq_server_adapter struct
+ * @status:	Status field
+ * @hmc_session: HMC Session field
+ * @hmc_index:	HMC Index field
+ * @buffer_id:	Buffer Id field
+ *
+ * This command is sent by the management partition to the hypervisor in
+ * response to the Add Buffer message. The Status field indicates the result of
+ * the command.
+ *
+ * Return:
+ *	0 - Success
+ */
+static int ibmvmc_send_add_buffer_resp(struct crq_server_adapter *adapter,
+				       u8 status, u8 hmc_session,
+				       u8 hmc_index, u16 buffer_id)
+{
+	struct ibmvmc_crq_msg crq_msg;
+	__be64 *crq_as_u64 = (__be64 *)&crq_msg;
+
+	dev_dbg(adapter->dev, "CRQ send: add_buffer_resp\n");
+	crq_msg.valid = 0x80;
+	crq_msg.type = VMC_MSG_ADD_BUF_RESP;
+	crq_msg.status = status;
+	crq_msg.var1.rsvd = 0;
+	crq_msg.hmc_session = hmc_session;
+	crq_msg.hmc_index = hmc_index;
+	crq_msg.var2.buffer_id = cpu_to_be16(buffer_id);
+	crq_msg.rsvd = 0;
+	crq_msg.var3.rsvd = 0;
+
+	ibmvmc_send_crq(adapter, be64_to_cpu(crq_as_u64[0]),
+			be64_to_cpu(crq_as_u64[1]));
+
+	return 0;
+}
+
+/**
+ * ibmvmc_send_rem_buffer_resp - Remove Buffer Response
+ *
+ * @adapter:	crq_server_adapter struct
+ * @status:	Status field
+ * @hmc_session: HMC Session field
+ * @hmc_index:	HMC Index field
+ * @buffer_id:	Buffer Id field
+ *
+ * This command is sent by the management partition to the hypervisor in
+ * response to the Remove Buffer message. The Buffer ID field indicates
+ * which buffer the management partition selected to remove. The Status
+ * field indicates the result of the command.
+ *
+ * Return:
+ *	0 - Success
+ */
+static int ibmvmc_send_rem_buffer_resp(struct crq_server_adapter *adapter,
+				       u8 status, u8 hmc_session,
+				       u8 hmc_index, u16 buffer_id)
+{
+	struct ibmvmc_crq_msg crq_msg;
+	__be64 *crq_as_u64 = (__be64 *)&crq_msg;
+
+	dev_dbg(adapter->dev, "CRQ send: rem_buffer_resp\n");
+	crq_msg.valid = 0x80;
+	crq_msg.type = VMC_MSG_REM_BUF_RESP;
+	crq_msg.status = status;
+	crq_msg.var1.rsvd = 0;
+	crq_msg.hmc_session = hmc_session;
+	crq_msg.hmc_index = hmc_index;
+	crq_msg.var2.buffer_id = cpu_to_be16(buffer_id);
+	crq_msg.rsvd = 0;
+	crq_msg.var3.rsvd = 0;
+
+	ibmvmc_send_crq(adapter, be64_to_cpu(crq_as_u64[0]),
+			be64_to_cpu(crq_as_u64[1]));
+
+	return 0;
+}
+
+/**
+ * ibmvmc_send_msg - Signal Message
+ *
+ * @adapter:	crq_server_adapter struct
+ * @buffer:	ibmvmc_buffer struct
+ * @hmc:	ibmvmc_hmc struct
+ * @msg_length:	message length field
+ *
+ * This command is sent between the management partition and the hypervisor
+ * in order to signal the arrival of an HMC protocol message. The command
+ * can be sent by both the management partition and the hypervisor. It is
+ * used for all traffic between the Novalink application and the hypervisor,
+ * regardless of who initiated the communication.
+ *
+ * There is no response to this message.
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static int ibmvmc_send_msg(struct crq_server_adapter *adapter,
+			   struct ibmvmc_buffer *buffer,
+			   struct ibmvmc_hmc *hmc, int msg_len)
+{
+	struct ibmvmc_crq_msg crq_msg;
+	__be64 *crq_as_u64 = (__be64 *)&crq_msg;
+	int rc = 0;
+
+	dev_dbg(adapter->dev, "CRQ send: rdma to HV\n");
+	rc = h_copy_rdma(msg_len,
+			 adapter->liobn,
+			 buffer->dma_addr_local,
+			 adapter->riobn,
+			 buffer->dma_addr_remote);
+	if (rc) {
+		dev_err(adapter->dev, "Error in send_msg, h_copy_rdma rc 0x%x\n",
+			rc);
+		return rc;
+	}
+
+	crq_msg.valid = 0x80;
+	crq_msg.type = VMC_MSG_SIGNAL;
+	crq_msg.status = 0;
+	crq_msg.var1.rsvd = 0;
+	crq_msg.hmc_session = hmc->session;
+	crq_msg.hmc_index = hmc->index;
+	crq_msg.var2.buffer_id = cpu_to_be16(buffer->id);
+	crq_msg.var3.msg_len = cpu_to_be32(msg_len);
+	dev_dbg(adapter->dev, "CRQ send: msg to HV 0x%llx 0x%llx\n",
+		be64_to_cpu(crq_as_u64[0]), be64_to_cpu(crq_as_u64[1]));
+
+	buffer->owner = VMC_BUF_OWNER_HV;
+	ibmvmc_send_crq(adapter, be64_to_cpu(crq_as_u64[0]),
+			be64_to_cpu(crq_as_u64[1]));
+
+	return rc;
+}
+
+/**
+ * ibmvmc_open - Open Session
+ *
+ * @inode:	inode struct
+ * @file:	file struct
+ *
+ * Return:
+ *	0 - Success
+ */
+static int ibmvmc_open(struct inode *inode, struct file *file)
+{
+	struct ibmvmc_file_session *session;
+	int rc = 0;
+
+	pr_debug("%s: inode = 0x%lx, file = 0x%lx, state = 0x%x\n", __func__,
+		 (unsigned long)inode, (unsigned long)file,
+		 ibmvmc.state);
+
+	session = kzalloc(sizeof(*session), GFP_KERNEL);
+	session->file = file;
+	file->private_data = session;
+
+	return rc;
+}
+
+/**
+ * ibmvmc_close - Close Session
+ *
+ * @inode:	inode struct
+ * @file:	file struct
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static int ibmvmc_close(struct inode *inode, struct file *file)
+{
+	struct ibmvmc_file_session *session;
+	struct ibmvmc_hmc *hmc;
+	int rc = 0;
+	unsigned long flags;
+
+	pr_debug("%s: file = 0x%lx, state = 0x%x\n", __func__,
+		 (unsigned long)file, ibmvmc.state);
+
+	session = file->private_data;
+	if (!session)
+		return -EIO;
+
+	hmc = session->hmc;
+	if (hmc) {
+		if (!hmc->adapter)
+			return -EIO;
+
+		if (ibmvmc.state == ibmvmc_state_failed) {
+			dev_warn(hmc->adapter->dev, "close: state_failed\n");
+			return -EIO;
+		}
+
+		spin_lock_irqsave(&hmc->lock, flags);
+		if (hmc->state >= ibmhmc_state_opening) {
+			rc = ibmvmc_send_close(hmc);
+			if (rc)
+				dev_warn(hmc->adapter->dev, "close: send_close failed.\n");
+		}
+		spin_unlock_irqrestore(&hmc->lock, flags);
+	}
+
+	kzfree(session);
+
+	return rc;
+}
+
+/**
+ * ibmvmc_read - Read
+ *
+ * @file:	file struct
+ * @buf:	Character buffer
+ * @nbytes:	Size in bytes
+ * @ppos:	Offset
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static ssize_t ibmvmc_read(struct file *file, char *buf, size_t nbytes,
+			   loff_t *ppos)
+{
+	struct ibmvmc_file_session *session;
+	struct ibmvmc_hmc *hmc;
+	struct crq_server_adapter *adapter;
+	struct ibmvmc_buffer *buffer;
+	ssize_t n;
+	ssize_t retval = 0;
+	unsigned long flags;
+	DEFINE_WAIT(wait);
+
+	pr_debug("ibmvmc: read: file = 0x%lx, buf = 0x%lx, nbytes = 0x%lx\n",
+		 (unsigned long)file, (unsigned long)buf,
+		 (unsigned long)nbytes);
+
+	if (nbytes == 0)
+		return 0;
+
+	if (nbytes > ibmvmc.max_mtu) {
+		pr_warn("ibmvmc: read: nbytes invalid 0x%x\n",
+			(unsigned int)nbytes);
+		return -EINVAL;
+	}
+
+	session = file->private_data;
+	if (!session) {
+		pr_warn("ibmvmc: read: no session\n");
+		return -EIO;
+	}
+
+	hmc = session->hmc;
+	if (!hmc) {
+		pr_warn("ibmvmc: read: no hmc\n");
+		return -EIO;
+	}
+
+	adapter = hmc->adapter;
+	if (!adapter) {
+		pr_warn("ibmvmc: read: no adapter\n");
+		return -EIO;
+	}
+
+	do {
+		prepare_to_wait(&ibmvmc_read_wait, &wait, TASK_INTERRUPTIBLE);
+
+		spin_lock_irqsave(&hmc->lock, flags);
+		if (hmc->queue_tail != hmc->queue_head)
+			/* Data is available */
+			break;
+
+		spin_unlock_irqrestore(&hmc->lock, flags);
+
+		if (!session->valid) {
+			retval = -EBADFD;
+			goto out;
+		}
+		if (file->f_flags & O_NONBLOCK) {
+			retval = -EAGAIN;
+			goto out;
+		}
+
+		schedule();
+
+		if (signal_pending(current)) {
+			retval = -ERESTARTSYS;
+			goto out;
+		}
+	} while (1);
+
+	buffer = &(hmc->buffer[hmc->queue_outbound_msgs[hmc->queue_tail]]);
+	hmc->queue_tail++;
+	if (hmc->queue_tail == ibmvmc_max_buf_pool_size)
+		hmc->queue_tail = 0;
+	spin_unlock_irqrestore(&hmc->lock, flags);
+
+	nbytes = min_t(size_t, nbytes, buffer->msg_len);
+	n = copy_to_user((void *)buf, buffer->real_addr_local, nbytes);
+	dev_dbg(adapter->dev, "read: copy to user nbytes = 0x%lx.\n", nbytes);
+	ibmvmc_free_hmc_buffer(hmc, buffer);
+	retval = nbytes;
+
+	if (n) {
+		dev_warn(adapter->dev, "read: copy to user failed.\n");
+		retval = -EFAULT;
+	}
+
+ out:
+	finish_wait(&ibmvmc_read_wait, &wait);
+	dev_dbg(adapter->dev, "read: out %ld\n", retval);
+	return retval;
+}
+
+/**
+ * ibmvmc_poll - Poll
+ *
+ * @file:	file struct
+ * @wait:	Poll Table
+ *
+ * Return:
+ *	poll.h return values
+ */
+static unsigned int ibmvmc_poll(struct file *file, poll_table *wait)
+{
+	struct ibmvmc_file_session *session;
+	struct ibmvmc_hmc *hmc;
+	unsigned int mask = 0;
+
+	session = file->private_data;
+	if (!session)
+		return 0;
+
+	hmc = session->hmc;
+	if (!hmc)
+		return 0;
+
+	poll_wait(file, &ibmvmc_read_wait, wait);
+
+	if (hmc->queue_head != hmc->queue_tail)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+/**
+ * ibmvmc_write - Write
+ *
+ * @file:	file struct
+ * @buf:	Character buffer
+ * @count:	Count field
+ * @ppos:	Offset
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static ssize_t ibmvmc_write(struct file *file, const char *buffer,
+			    size_t count, loff_t *ppos)
+{
+	struct ibmvmc_buffer *vmc_buffer;
+	struct ibmvmc_file_session *session;
+	struct crq_server_adapter *adapter;
+	struct ibmvmc_hmc *hmc;
+	unsigned char *buf;
+	unsigned long flags;
+	size_t bytes;
+	const char *p = buffer;
+	size_t c = count;
+	int ret = 0;
+
+	session = file->private_data;
+	if (!session)
+		return -EIO;
+
+	hmc = session->hmc;
+	if (!hmc)
+		return -EIO;
+
+	spin_lock_irqsave(&hmc->lock, flags);
+	if (hmc->state == ibmhmc_state_free) {
+		/* HMC connection is not valid (possibly was reset under us). */
+		ret = -EIO;
+		goto out;
+	}
+
+	adapter = hmc->adapter;
+	if (!adapter) {
+		ret = -EIO;
+		goto out;
+	}
+
+	if (count > ibmvmc.max_mtu) {
+		dev_warn(adapter->dev, "invalid buffer size 0x%lx\n",
+			 (unsigned long)count);
+		ret = -EIO;
+		goto out;
+	}
+
+	/* Waiting for the open resp message to the ioctl(1) - retry */
+	if (hmc->state == ibmhmc_state_opening) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* Make sure the ioctl() was called & the open msg sent, and that
+	 * the HMC connection has not failed.
+	 */
+	if (hmc->state != ibmhmc_state_ready) {
+		ret = -EIO;
+		goto out;
+	}
+
+	vmc_buffer = ibmvmc_get_valid_hmc_buffer(hmc->index);
+	if (!vmc_buffer) {
+		/* No buffer available for the msg send, or we have not yet
+		 * completed the open/open_resp sequence.  Retry until this is
+		 * complete.
+		 */
+		ret = -EBUSY;
+		goto out;
+	}
+	if (!vmc_buffer->real_addr_local) {
+		dev_err(adapter->dev, "no buffer storage assigned\n");
+		ret = -EIO;
+		goto out;
+	}
+	buf = vmc_buffer->real_addr_local;
+
+	while (c > 0) {
+		bytes = min_t(size_t, c, vmc_buffer->size);
+
+		bytes -= copy_from_user(buf, p, bytes);
+		if (!bytes) {
+			ret = -EFAULT;
+			goto out;
+		}
+		c -= bytes;
+		p += bytes;
+	}
+	if (p == buffer)
+		goto out;
+
+	file->f_path.dentry->d_inode->i_mtime = current_time(file_inode(file));
+	mark_inode_dirty(file->f_path.dentry->d_inode);
+
+	dev_dbg(adapter->dev, "write: file = 0x%lx, count = 0x%lx\n",
+		(unsigned long)file, (unsigned long)count);
+
+	ibmvmc_send_msg(adapter, vmc_buffer, hmc, count);
+	ret = p - buffer;
+ out:
+	spin_unlock_irqrestore(&hmc->lock, flags);
+	return (ssize_t)(ret);
+}
+
+/**
+ * ibmvmc_setup_hmc - Setup the HMC
+ *
+ * @session:	ibmvmc_file_session struct
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static long ibmvmc_setup_hmc(struct ibmvmc_file_session *session)
+{
+	struct ibmvmc_hmc *hmc;
+	unsigned int valid, free, index;
+
+	if (ibmvmc.state == ibmvmc_state_failed) {
+		pr_warn("ibmvmc: Reserve HMC: state_failed\n");
+		return -EIO;
+	}
+
+	if (ibmvmc.state < ibmvmc_state_ready) {
+		pr_warn("ibmvmc: Reserve HMC: not state_ready\n");
+		return -EAGAIN;
+	}
+
+	/* Device is busy until capabilities have been exchanged and we
+	 * have a generic buffer for each possible HMC connection.
+	 */
+	for (index = 0; index <= ibmvmc.max_hmc_index; index++) {
+		valid = 0;
+		ibmvmc_count_hmc_buffers(index, &valid, &free);
+		if (valid == 0) {
+			pr_warn("ibmvmc: buffers not ready for index %d\n",
+				index);
+			return -ENOBUFS;
+		}
+	}
+
+	/* Get an hmc object, and transition to ibmhmc_state_initial */
+	hmc = ibmvmc_get_free_hmc();
+	if (!hmc) {
+		pr_warn("%s: free hmc not found\n", __func__);
+		return -EBUSY;
+	}
+
+	hmc->session = hmc->session + 1;
+	if (hmc->session == 0xff)
+		hmc->session = 1;
+
+	session->hmc = hmc;
+	hmc->adapter = &ibmvmc_adapter;
+	hmc->file_session = session;
+	session->valid = 1;
+
+	return 0;
+}
+
+/**
+ * ibmvmc_ioctl_sethmcid - IOCTL Set HMC ID
+ *
+ * @session:	ibmvmc_file_session struct
+ * @new_hmc_id:	HMC id field
+ *
+ * IOCTL command to setup the hmc id
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static long ibmvmc_ioctl_sethmcid(struct ibmvmc_file_session *session,
+				  unsigned char __user *new_hmc_id)
+{
+	struct ibmvmc_hmc *hmc;
+	struct ibmvmc_buffer *buffer;
+	size_t bytes;
+	char print_buffer[HMC_ID_LEN + 1];
+	unsigned long flags;
+	long rc = 0;
+
+	/* Reserve HMC session */
+	hmc = session->hmc;
+	if (!hmc) {
+		rc = ibmvmc_setup_hmc(session);
+		if (rc)
+			return rc;
+
+		hmc = session->hmc;
+		if (!hmc) {
+			pr_err("ibmvmc: setup_hmc success but no hmc\n");
+			return -EIO;
+		}
+	}
+
+	if (hmc->state != ibmhmc_state_initial) {
+		pr_warn("ibmvmc: sethmcid: invalid state to send open 0x%x\n",
+			hmc->state);
+		return -EIO;
+	}
+
+	bytes = copy_from_user(hmc->hmc_id, new_hmc_id, HMC_ID_LEN);
+	if (bytes)
+		return -EFAULT;
+
+	/* Send Open Session command */
+	spin_lock_irqsave(&hmc->lock, flags);
+	buffer = ibmvmc_get_valid_hmc_buffer(hmc->index);
+	spin_unlock_irqrestore(&hmc->lock, flags);
+
+	if (!buffer || !buffer->real_addr_local) {
+		pr_warn("ibmvmc: sethmcid: no buffer available\n");
+		return -EIO;
+	}
+
+	/* Make sure buffer is NULL terminated before trying to print it */
+	memset(print_buffer, 0, HMC_ID_LEN + 1);
+	strncpy(print_buffer, hmc->hmc_id, HMC_ID_LEN);
+	pr_info("ibmvmc: sethmcid: Set HMC ID: \"%s\"\n", print_buffer);
+
+	memcpy(buffer->real_addr_local, hmc->hmc_id, HMC_ID_LEN);
+	/* RDMA over ID, send open msg, change state to ibmhmc_state_opening */
+	rc = ibmvmc_send_open(buffer, hmc);
+
+	return rc;
+}
+
+/**
+ * ibmvmc_ioctl_query - IOCTL Query
+ *
+ * @session:	ibmvmc_file_session struct
+ * @ret_struct:	ibmvmc_query_struct
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static long ibmvmc_ioctl_query(struct ibmvmc_file_session *session,
+			       struct ibmvmc_query_struct __user *ret_struct)
+{
+	struct ibmvmc_query_struct query_struct;
+	size_t bytes;
+
+	memset(&query_struct, 0, sizeof(query_struct));
+	query_struct.have_vmc = (ibmvmc.state > ibmvmc_state_initial);
+	query_struct.state = ibmvmc.state;
+	query_struct.vmc_drc_index = ibmvmc.vmc_drc_index;
+
+	bytes = copy_to_user(ret_struct, &query_struct,
+			     sizeof(query_struct));
+	if (bytes)
+		return -EFAULT;
+
+	return 0;
+}
+
+/**
+ * ibmvmc_ioctl_requestvmc - IOCTL Request VMC
+ *
+ * @session:	ibmvmc_file_session struct
+ * @ret_vmc_index:	VMC Index
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static long ibmvmc_ioctl_requestvmc(struct ibmvmc_file_session *session,
+				    u32 __user *ret_vmc_index)
+{
+	/* TODO: (adreznec) Add locking to control multiple process access */
+	size_t bytes;
+	long rc;
+	u32 vmc_drc_index;
+
+	/* Call to request the VMC device from phyp*/
+	rc = h_request_vmc(&vmc_drc_index);
+	pr_debug("ibmvmc: requestvmc: H_REQUEST_VMC rc = 0x%lx\n", rc);
+
+	if (rc == H_SUCCESS) {
+		rc = 0;
+	} else if (rc == H_FUNCTION) {
+		pr_err("ibmvmc: requestvmc: h_request_vmc not supported\n");
+		return -EPERM;
+	} else if (rc == H_AUTHORITY) {
+		pr_err("ibmvmc: requestvmc: hypervisor denied vmc request\n");
+		return -EPERM;
+	} else if (rc == H_HARDWARE) {
+		pr_err("ibmvmc: requestvmc: hypervisor hardware fault\n");
+		return -EIO;
+	} else if (rc == H_RESOURCE) {
+		pr_err("ibmvmc: requestvmc: vmc resource unavailable\n");
+		return -ENODEV;
+	} else if (rc == H_NOT_AVAILABLE) {
+		pr_err("ibmvmc: requestvmc: system cannot be vmc managed\n");
+		return -EPERM;
+	} else if (rc == H_PARAMETER) {
+		pr_err("ibmvmc: requestvmc: invalid parameter\n");
+		return -EINVAL;
+	}
+
+	/* Success, set the vmc index in global struct */
+	ibmvmc.vmc_drc_index = vmc_drc_index;
+
+	bytes = copy_to_user(ret_vmc_index, &vmc_drc_index,
+			     sizeof(*ret_vmc_index));
+	if (bytes) {
+		pr_warn("ibmvmc: requestvmc: copy to user failed.\n");
+		return -EFAULT;
+	}
+	return rc;
+}
+
+/**
+ * ibmvmc_ioctl - IOCTL
+ *
+ * @session:	ibmvmc_file_session struct
+ * @cmd:	cmd field
+ * @arg:	Argument field
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static long ibmvmc_ioctl(struct file *file,
+			 unsigned int cmd, unsigned long arg)
+{
+	struct ibmvmc_file_session *session = file->private_data;
+
+	pr_debug("ibmvmc: ioctl file=0x%lx, cmd=0x%x, arg=0x%lx, ses=0x%lx\n",
+		 (unsigned long)file, cmd, arg,
+		 (unsigned long)session);
+
+	if (!session) {
+		pr_warn("ibmvmc: ioctl: no session\n");
+		return -EIO;
+	}
+
+	switch (cmd) {
+	case VMC_IOCTL_SETHMCID:
+		return ibmvmc_ioctl_sethmcid(session,
+			(unsigned char __user *)arg);
+	case VMC_IOCTL_QUERY:
+		return ibmvmc_ioctl_query(session,
+			(struct ibmvmc_query_struct __user *)arg);
+	case VMC_IOCTL_REQUESTVMC:
+		return ibmvmc_ioctl_requestvmc(session,
+			(unsigned int __user *)arg);
+	default:
+		pr_warn("ibmvmc: unknown ioctl 0x%x\n", cmd);
+		return -EINVAL;
+	}
+}
+
+static const struct file_operations ibmvmc_fops = {
+	.owner		= THIS_MODULE,
+	.read		= ibmvmc_read,
+	.write		= ibmvmc_write,
+	.poll		= ibmvmc_poll,
+	.unlocked_ioctl	= ibmvmc_ioctl,
+	.open           = ibmvmc_open,
+	.release        = ibmvmc_close,
+};
+
+/**
+ * ibmvmc_add_buffer - Add Buffer
+ *
+ * @adapter: crq_server_adapter struct
+ * @crq:	ibmvmc_crq_msg struct
+ *
+ * This message transfers a buffer from hypervisor ownership to management
+ * partition ownership. The LIOBA is obtained from the virtual TCE table
+ * associated with the hypervisor side of the VMC device, and points to a
+ * buffer of size MTU (as established in the capabilities exchange).
+ *
+ * Typical flow for ading buffers:
+ * 1. A new Novalink connection is opened by the management partition.
+ * 2. The hypervisor assigns new buffers for the traffic associated with
+ *	that connection.
+ * 3. The hypervisor sends VMC Add Buffer messages to the management
+ *	partition, informing it of the new buffers.
+ * 4. The hypervisor sends an HMC protocol message (to the Novalink application)
+ *	notifying it of the new buffers. This informs the application that
+ *	it has buffers available for sending HMC commands.
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static int ibmvmc_add_buffer(struct crq_server_adapter *adapter,
+			     struct ibmvmc_crq_msg *crq)
+{
+	struct ibmvmc_buffer *buffer;
+	u8 hmc_index;
+	u8 hmc_session;
+	u16 buffer_id;
+	unsigned long flags;
+	int rc = 0;
+
+	if (!crq)
+		return -1;
+
+	hmc_session = crq->hmc_session;
+	hmc_index = crq->hmc_index;
+	buffer_id = be16_to_cpu(crq->var2.buffer_id);
+
+	if (hmc_index > ibmvmc.max_hmc_index) {
+		dev_err(adapter->dev, "add_buffer: invalid hmc_index = 0x%x\n",
+			hmc_index);
+		ibmvmc_send_add_buffer_resp(adapter, VMC_MSG_INVALID_HMC_INDEX,
+					    hmc_session, hmc_index, buffer_id);
+		return -1;
+	}
+
+	if (buffer_id >= ibmvmc.max_buffer_pool_size) {
+		dev_err(adapter->dev, "add_buffer: invalid buffer_id = 0x%x\n",
+			buffer_id);
+		ibmvmc_send_add_buffer_resp(adapter, VMC_MSG_INVALID_BUFFER_ID,
+					    hmc_session, hmc_index, buffer_id);
+		return -1;
+	}
+
+	spin_lock_irqsave(&hmcs[hmc_index].lock, flags);
+	buffer = &hmcs[hmc_index].buffer[buffer_id];
+
+	if (buffer->real_addr_local || buffer->dma_addr_local) {
+		dev_warn(adapter->dev, "add_buffer: already allocated id = 0x%lx\n",
+			 (unsigned long)buffer_id);
+		spin_unlock_irqrestore(&hmcs[hmc_index].lock, flags);
+		ibmvmc_send_add_buffer_resp(adapter, VMC_MSG_INVALID_BUFFER_ID,
+					    hmc_session, hmc_index, buffer_id);
+		return -1;
+	}
+
+	buffer->real_addr_local = alloc_dma_buffer(to_vio_dev(adapter->dev),
+						   ibmvmc.max_mtu,
+						   &buffer->dma_addr_local);
+
+	if (!buffer->real_addr_local) {
+		dev_err(adapter->dev, "add_buffer: alloc_dma_buffer failed.\n");
+		spin_unlock_irqrestore(&hmcs[hmc_index].lock, flags);
+		ibmvmc_send_add_buffer_resp(adapter, VMC_MSG_INTERFACE_FAILURE,
+					    hmc_session, hmc_index, buffer_id);
+		return -1;
+	}
+
+	buffer->dma_addr_remote = be32_to_cpu(crq->var3.lioba);
+	buffer->size = ibmvmc.max_mtu;
+	buffer->owner = crq->var1.owner;
+	buffer->free = 1;
+	/* Must ensure valid==1 is observable only after all other fields are */
+	dma_wmb();
+	buffer->valid = 1;
+	buffer->id = buffer_id;
+
+	dev_dbg(adapter->dev, "add_buffer: successfully added a buffer:\n");
+	dev_dbg(adapter->dev, "   index: %d, session: %d, buffer: 0x%x, owner: %d\n",
+		hmc_index, hmc_session, buffer_id, buffer->owner);
+	dev_dbg(adapter->dev, "   local: 0x%x, remote: 0x%x\n",
+		(u32)buffer->dma_addr_local,
+		(u32)buffer->dma_addr_remote);
+	spin_unlock_irqrestore(&hmcs[hmc_index].lock, flags);
+
+	ibmvmc_send_add_buffer_resp(adapter, VMC_MSG_SUCCESS, hmc_session,
+				    hmc_index, buffer_id);
+
+	return rc;
+}
+
+/**
+ * ibmvmc_rem_buffer - Remove Buffer
+ *
+ * @adapter: crq_server_adapter struct
+ * @crq:	ibmvmc_crq_msg struct
+ *
+ * This message requests an HMC buffer to be transferred from management
+ * partition ownership to hypervisor ownership. The management partition may
+ * not be able to satisfy the request at a particular point in time if all its
+ * buffers are in use. The management partition requires a depth of at least
+ * one inbound buffer to allow Novalink commands to flow to the hypervisor. It
+ * is, therefore, an interface error for the hypervisor to attempt to remove the
+ * management partition's last buffer.
+ *
+ * The hypervisor is expected to manage buffer usage with the Novalink
+ * application directly and inform the management partition when buffers may be
+ * removed. The typical flow for removing buffers:
+ *
+ * 1. The Novalink application no longer needs a communication path to a
+ *	particular hypervisor function. That function is closed.
+ * 2. The hypervisor and the Novalink application quiesce all traffic to that
+ *	function. The hypervisor requests a reduction in buffer pool size.
+ * 3. The Novalink application acknowledges the reduction in buffer pool size.
+ * 4. The hypervisor sends a Remove Buffer message to the management partition,
+ *	informing it of the reduction in buffers.
+ * 5. The management partition verifies it can remove the buffer. This is
+ *	possible if buffers have been quiesced.
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+/*
+ * The hypervisor requested that we pick an unused buffer, and return it.
+ * Before sending the buffer back, we free any storage associated with the
+ * buffer.
+ */
+static int ibmvmc_rem_buffer(struct crq_server_adapter *adapter,
+			     struct ibmvmc_crq_msg *crq)
+{
+	struct ibmvmc_buffer *buffer;
+	u8 hmc_index;
+	u8 hmc_session;
+	u16 buffer_id = 0;
+	unsigned long flags;
+	int rc = 0;
+
+	if (!crq)
+		return -1;
+
+	hmc_session = crq->hmc_session;
+	hmc_index = crq->hmc_index;
+
+	if (hmc_index > ibmvmc.max_hmc_index) {
+		dev_warn(adapter->dev, "rem_buffer: invalid hmc_index = 0x%x\n",
+			 hmc_index);
+		ibmvmc_send_rem_buffer_resp(adapter, VMC_MSG_INVALID_HMC_INDEX,
+					    hmc_session, hmc_index, buffer_id);
+		return -1;
+	}
+
+	spin_lock_irqsave(&hmcs[hmc_index].lock, flags);
+	buffer = ibmvmc_get_free_hmc_buffer(adapter, hmc_index);
+	if (!buffer) {
+		dev_info(adapter->dev, "rem_buffer: no buffer to remove\n");
+		spin_unlock_irqrestore(&hmcs[hmc_index].lock, flags);
+		ibmvmc_send_rem_buffer_resp(adapter, VMC_MSG_NO_BUFFER,
+					    hmc_session, hmc_index,
+					    VMC_INVALID_BUFFER_ID);
+		return -1;
+	}
+
+	buffer_id = buffer->id;
+
+	if (buffer->valid)
+		free_dma_buffer(to_vio_dev(adapter->dev),
+				ibmvmc.max_mtu,
+				buffer->real_addr_local,
+				buffer->dma_addr_local);
+
+	memset(buffer, 0, sizeof(struct ibmvmc_buffer));
+	spin_unlock_irqrestore(&hmcs[hmc_index].lock, flags);
+
+	dev_dbg(adapter->dev, "rem_buffer: removed buffer 0x%x.\n", buffer_id);
+	ibmvmc_send_rem_buffer_resp(adapter, VMC_MSG_SUCCESS, hmc_session,
+				    hmc_index, buffer_id);
+
+	return rc;
+}
+
+static int ibmvmc_recv_msg(struct crq_server_adapter *adapter,
+			   struct ibmvmc_crq_msg *crq)
+{
+	struct ibmvmc_buffer *buffer;
+	struct ibmvmc_hmc *hmc;
+	unsigned long msg_len;
+	u8 hmc_index;
+	u8 hmc_session;
+	u16 buffer_id;
+	unsigned long flags;
+	int rc = 0;
+
+	if (!crq)
+		return -1;
+
+	/* Hypervisor writes CRQs directly into our memory in big endian */
+	dev_dbg(adapter->dev, "Recv_msg: msg from HV 0x%016llx 0x%016llx\n",
+		be64_to_cpu(*((unsigned long *)crq)),
+		be64_to_cpu(*(((unsigned long *)crq) + 1)));
+
+	hmc_session = crq->hmc_session;
+	hmc_index = crq->hmc_index;
+	buffer_id = be16_to_cpu(crq->var2.buffer_id);
+	msg_len = be32_to_cpu(crq->var3.msg_len);
+
+	if (hmc_index > ibmvmc.max_hmc_index) {
+		dev_err(adapter->dev, "Recv_msg: invalid hmc_index = 0x%x\n",
+			hmc_index);
+		ibmvmc_send_add_buffer_resp(adapter, VMC_MSG_INVALID_HMC_INDEX,
+					    hmc_session, hmc_index, buffer_id);
+		return -1;
+	}
+
+	if (buffer_id >= ibmvmc.max_buffer_pool_size) {
+		dev_err(adapter->dev, "Recv_msg: invalid buffer_id = 0x%x\n",
+			buffer_id);
+		ibmvmc_send_add_buffer_resp(adapter, VMC_MSG_INVALID_BUFFER_ID,
+					    hmc_session, hmc_index, buffer_id);
+		return -1;
+	}
+
+	hmc = &hmcs[hmc_index];
+	spin_lock_irqsave(&hmc->lock, flags);
+
+	if (hmc->state == ibmhmc_state_free) {
+		dev_err(adapter->dev, "Recv_msg: invalid hmc state = 0x%x\n",
+			hmc->state);
+		/* HMC connection is not valid (possibly was reset under us). */
+		spin_unlock_irqrestore(&hmc->lock, flags);
+		return -1;
+	}
+
+	buffer = &hmc->buffer[buffer_id];
+
+	if (buffer->valid == 0 || buffer->owner == VMC_BUF_OWNER_ALPHA) {
+		dev_err(adapter->dev, "Recv_msg: not valid, or not HV.  0x%x 0x%x\n",
+			buffer->valid, buffer->owner);
+		spin_unlock_irqrestore(&hmc->lock, flags);
+		return -1;
+	}
+
+	/* RDMA the data into the partition. */
+	rc = h_copy_rdma(msg_len,
+			 adapter->riobn,
+			 buffer->dma_addr_remote,
+			 adapter->liobn,
+			 buffer->dma_addr_local);
+
+	dev_dbg(adapter->dev, "Recv_msg: msg_len = 0x%x, buffer_id = 0x%x, queue_head = 0x%x, hmc_idx = 0x%x\n",
+		(unsigned int)msg_len, (unsigned int)buffer_id,
+		(unsigned int)hmc->queue_head, (unsigned int)hmc_index);
+	buffer->msg_len = msg_len;
+	buffer->free = 0;
+	buffer->owner = VMC_BUF_OWNER_ALPHA;
+
+	if (rc) {
+		dev_err(adapter->dev, "Failure in recv_msg: h_copy_rdma = 0x%x\n",
+			rc);
+		spin_unlock_irqrestore(&hmc->lock, flags);
+		return -1;
+	}
+
+	/* Must be locked because read operates on the same data */
+	hmc->queue_outbound_msgs[hmc->queue_head] = buffer_id;
+	hmc->queue_head++;
+	if (hmc->queue_head == ibmvmc_max_buf_pool_size)
+		hmc->queue_head = 0;
+
+	if (hmc->queue_head == hmc->queue_tail)
+		dev_err(adapter->dev, "outbound buffer queue wrapped.\n");
+
+	spin_unlock_irqrestore(&hmc->lock, flags);
+
+	wake_up_interruptible(&ibmvmc_read_wait);
+
+	return 0;
+}
+
+/**
+ * ibmvmc_process_capabilities - Process Capabilities
+ *
+ * @adapter:	crq_server_adapter struct
+ * @crqp:	ibmvmc_crq_msg struct
+ *
+ */
+static void ibmvmc_process_capabilities(struct crq_server_adapter *adapter,
+					struct ibmvmc_crq_msg *crqp)
+{
+	struct ibmvmc_admin_crq_msg *crq = (struct ibmvmc_admin_crq_msg *)crqp;
+
+	if ((be16_to_cpu(crq->version) >> 8) !=
+			(IBMVMC_PROTOCOL_VERSION >> 8)) {
+		dev_err(adapter->dev, "init failed, incompatible versions 0x%x 0x%x\n",
+			be16_to_cpu(crq->version),
+			IBMVMC_PROTOCOL_VERSION);
+		ibmvmc.state = ibmvmc_state_failed;
+		return;
+	}
+
+	ibmvmc.max_mtu = min_t(u32, ibmvmc_max_mtu, be32_to_cpu(crq->max_mtu));
+	ibmvmc.max_buffer_pool_size = min_t(u16, ibmvmc_max_buf_pool_size,
+					    be16_to_cpu(crq->pool_size));
+	ibmvmc.max_hmc_index = min_t(u8, ibmvmc_max_hmcs, crq->max_hmc) - 1;
+	ibmvmc.state = ibmvmc_state_ready;
+
+	dev_info(adapter->dev, "Capabilities: mtu=0x%x, pool_size=0x%x, max_hmc=0x%x\n",
+		 ibmvmc.max_mtu, ibmvmc.max_buffer_pool_size,
+		 ibmvmc.max_hmc_index);
+}
+
+/**
+ * ibmvmc_validate_hmc_session - Validate HMC Session
+ *
+ * @adapter:	crq_server_adapter struct
+ * @crq:	ibmvmc_crq_msg struct
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static int ibmvmc_validate_hmc_session(struct crq_server_adapter *adapter,
+				       struct ibmvmc_crq_msg *crq)
+{
+	unsigned char hmc_index;
+
+	hmc_index = crq->hmc_index;
+
+	if (crq->hmc_session == 0)
+		return 0;
+
+	if (hmc_index > ibmvmc.max_hmc_index)
+		return -1;
+
+	if (hmcs[hmc_index].session != crq->hmc_session) {
+		dev_warn(adapter->dev, "Drop, bad session: expected 0x%x, recv 0x%x\n",
+			 hmcs[hmc_index].session, crq->hmc_session);
+		return -1;
+	}
+
+	return 0;
+}
+
+/**
+ * ibmvmc_reset - Reset
+ *
+ * @adapter:	crq_server_adapter struct
+ * @xport_event:	export_event field
+ *
+ * Closes all HMC sessions and conditionally schedules a CRQ reset.
+ * @xport_event: If true, the partner closed their CRQ; we don't need to reset.
+ *               If false, we need to schedule a CRQ reset.
+ */
+static void ibmvmc_reset(struct crq_server_adapter *adapter, bool xport_event)
+{
+	int i;
+
+	if (ibmvmc.state != ibmvmc_state_sched_reset) {
+		dev_info(adapter->dev, "*** Reset to initial state.\n");
+		for (i = 0; i < ibmvmc_max_hmcs; i++)
+			ibmvmc_return_hmc(&hmcs[i], xport_event);
+
+		if (xport_event) {
+			/* CRQ was closed by the partner.  We don't need to do
+			 * anything except set ourself to the correct state to
+			 * handle init msgs.
+			 */
+			ibmvmc.state = ibmvmc_state_crqinit;
+		} else {
+			/* The partner did not close their CRQ - instead, we're
+			 * closing the CRQ on our end. Need to schedule this
+			 * for process context, because CRQ reset may require a
+			 * sleep.
+			 *
+			 * Setting ibmvmc.state here immediately prevents
+			 * ibmvmc_open from completing until the reset
+			 * completes in process context.
+			 */
+			ibmvmc.state = ibmvmc_state_sched_reset;
+			dev_dbg(adapter->dev, "Device reset scheduled");
+			wake_up_interruptible(&adapter->reset_wait_queue);
+		}
+	}
+}
+
+/**
+ * ibmvmc_reset_task - Reset Task
+ *
+ * @data:	Data field
+ *
+ * Performs a CRQ reset of the VMC device in process context.
+ * NOTE: This function should not be called directly, use ibmvmc_reset.
+ */
+static int ibmvmc_reset_task(void *data)
+{
+	struct crq_server_adapter *adapter = data;
+	int rc;
+
+	set_user_nice(current, -20);
+
+	while (!kthread_should_stop()) {
+		wait_event_interruptible(adapter->reset_wait_queue,
+			(ibmvmc.state == ibmvmc_state_sched_reset) ||
+			kthread_should_stop());
+
+		if (kthread_should_stop())
+			break;
+
+		dev_dbg(adapter->dev, "CRQ resetting in process context");
+		tasklet_disable(&adapter->work_task);
+
+		rc = ibmvmc_reset_crq_queue(adapter);
+
+		if (rc != H_SUCCESS && rc != H_RESOURCE) {
+			dev_err(adapter->dev, "Error initializing CRQ.  rc = 0x%x\n",
+				rc);
+			ibmvmc.state = ibmvmc_state_failed;
+		} else {
+			ibmvmc.state = ibmvmc_state_crqinit;
+
+			if (ibmvmc_send_crq(adapter, 0xC001000000000000LL, 0)
+			    != 0 && rc != H_RESOURCE)
+				dev_warn(adapter->dev, "Failed to send initialize CRQ message\n");
+		}
+
+		vio_enable_interrupts(to_vio_dev(adapter->dev));
+		tasklet_enable(&adapter->work_task);
+	}
+
+	return 0;
+}
+
+/**
+ * ibmvmc_process_open_resp - Process Open Response
+ *
+ * @crq: ibmvmc_crq_msg struct
+ * @adapter:    crq_server_adapter struct
+ *
+ * This command is sent by the hypervisor in response to the Interface
+ * Open message. When this message is received, the indicated buffer is
+ * again available for management partition use.
+ */
+static void ibmvmc_process_open_resp(struct ibmvmc_crq_msg *crq,
+				     struct crq_server_adapter *adapter)
+{
+	unsigned char hmc_index;
+	unsigned short buffer_id;
+
+	hmc_index = crq->hmc_index;
+	if (hmc_index > ibmvmc.max_hmc_index) {
+		/* Why would PHYP give an index > max negotiated? */
+		ibmvmc_reset(adapter, false);
+		return;
+	}
+
+	if (crq->status) {
+		dev_warn(adapter->dev, "open_resp: failed - status 0x%x\n",
+			 crq->status);
+		ibmvmc_return_hmc(&hmcs[hmc_index], false);
+		return;
+	}
+
+	if (hmcs[hmc_index].state == ibmhmc_state_opening) {
+		buffer_id = be16_to_cpu(crq->var2.buffer_id);
+		if (buffer_id >= ibmvmc.max_buffer_pool_size) {
+			dev_err(adapter->dev, "open_resp: invalid buffer_id = 0x%x\n",
+				buffer_id);
+			hmcs[hmc_index].state = ibmhmc_state_failed;
+		} else {
+			ibmvmc_free_hmc_buffer(&hmcs[hmc_index],
+					       &hmcs[hmc_index].buffer[buffer_id]);
+			hmcs[hmc_index].state = ibmhmc_state_ready;
+			dev_dbg(adapter->dev, "open_resp: set hmc state = ready\n");
+		}
+	} else {
+		dev_warn(adapter->dev, "open_resp: invalid hmc state (0x%x)\n",
+			 hmcs[hmc_index].state);
+	}
+}
+
+/**
+ * ibmvmc_process_close_resp - Process Close Response
+ *
+ * @crq: ibmvmc_crq_msg struct
+ * @adapter:    crq_server_adapter struct
+ *
+ * This command is sent by the hypervisor in response to the Novalink Interface
+ * Close message.
+ *
+ * If the close fails, simply reset the entire driver as the state of the VMC
+ * must be in tough shape.
+ */
+static void ibmvmc_process_close_resp(struct ibmvmc_crq_msg *crq,
+				      struct crq_server_adapter *adapter)
+{
+	unsigned char hmc_index;
+
+	hmc_index = crq->hmc_index;
+	if (hmc_index > ibmvmc.max_hmc_index) {
+		ibmvmc_reset(adapter, false);
+		return;
+	}
+
+	if (crq->status) {
+		dev_warn(adapter->dev, "close_resp: failed - status 0x%x\n",
+			 crq->status);
+		ibmvmc_reset(adapter, false);
+		return;
+	}
+
+	ibmvmc_return_hmc(&hmcs[hmc_index], false);
+}
+
+/**
+ * ibmvmc_crq_process - Process CRQ
+ *
+ * @adapter:    crq_server_adapter struct
+ * @crq:	ibmvmc_crq_msg struct
+ *
+ * Process the CRQ message based upon the type of message received.
+ *
+ */
+static void ibmvmc_crq_process(struct crq_server_adapter *adapter,
+			       struct ibmvmc_crq_msg *crq)
+{
+	switch (crq->type) {
+	case VMC_MSG_CAP_RESP:
+		dev_dbg(adapter->dev, "CRQ recv: capabilities resp (0x%x)\n",
+			crq->type);
+		if (ibmvmc.state == ibmvmc_state_capabilities)
+			ibmvmc_process_capabilities(adapter, crq);
+		else
+			dev_warn(adapter->dev, "caps msg invalid in state 0x%x\n",
+				 ibmvmc.state);
+		break;
+	case VMC_MSG_OPEN_RESP:
+		dev_dbg(adapter->dev, "CRQ recv: open resp (0x%x)\n",
+			crq->type);
+		if (ibmvmc_validate_hmc_session(adapter, crq) == 0)
+			ibmvmc_process_open_resp(crq, adapter);
+		break;
+	case VMC_MSG_ADD_BUF:
+		dev_dbg(adapter->dev, "CRQ recv: add buf (0x%x)\n",
+			crq->type);
+		if (ibmvmc_validate_hmc_session(adapter, crq) == 0)
+			ibmvmc_add_buffer(adapter, crq);
+		break;
+	case VMC_MSG_REM_BUF:
+		dev_dbg(adapter->dev, "CRQ recv: rem buf (0x%x)\n",
+			crq->type);
+		if (ibmvmc_validate_hmc_session(adapter, crq) == 0)
+			ibmvmc_rem_buffer(adapter, crq);
+		break;
+	case VMC_MSG_SIGNAL:
+		dev_dbg(adapter->dev, "CRQ recv: signal msg (0x%x)\n",
+			crq->type);
+		if (ibmvmc_validate_hmc_session(adapter, crq) == 0)
+			ibmvmc_recv_msg(adapter, crq);
+		break;
+	case VMC_MSG_CLOSE_RESP:
+		dev_dbg(adapter->dev, "CRQ recv: close resp (0x%x)\n",
+			crq->type);
+		if (ibmvmc_validate_hmc_session(adapter, crq) == 0)
+			ibmvmc_process_close_resp(crq, adapter);
+		break;
+	case VMC_MSG_CAP:
+	case VMC_MSG_OPEN:
+	case VMC_MSG_CLOSE:
+	case VMC_MSG_ADD_BUF_RESP:
+	case VMC_MSG_REM_BUF_RESP:
+		dev_warn(adapter->dev, "CRQ recv: unexpected msg (0x%x)\n",
+			 crq->type);
+		break;
+	default:
+		dev_warn(adapter->dev, "CRQ recv: unknown msg (0x%x)\n",
+			 crq->type);
+		break;
+	}
+}
+
+/**
+ * ibmvmc_handle_crq_init - Handle CRQ Init
+ *
+ * @crq:	ibmvmc_crq_msg struct
+ * @adapter:	crq_server_adapter struct
+ *
+ * Handle the type of crq initialization based on whether
+ * it is a message or a response.
+ *
+ */
+static void ibmvmc_handle_crq_init(struct ibmvmc_crq_msg *crq,
+				   struct crq_server_adapter *adapter)
+{
+	switch (crq->type) {
+	case 0x01:	/* Initialization message */
+		dev_dbg(adapter->dev, "CRQ recv: CRQ init msg - state 0x%x\n",
+			ibmvmc.state);
+		if (ibmvmc.state == ibmvmc_state_crqinit) {
+			/* Send back a response */
+			if (ibmvmc_send_crq(adapter, 0xC002000000000000,
+					    0) == 0)
+				ibmvmc_send_capabilities(adapter);
+			else
+				dev_err(adapter->dev, " Unable to send init rsp\n");
+		} else {
+			dev_err(adapter->dev, "Invalid state 0x%x mtu = 0x%x\n",
+				ibmvmc.state, ibmvmc.max_mtu);
+		}
+
+		break;
+	case 0x02:	/* Initialization response */
+		dev_dbg(adapter->dev, "CRQ recv: initialization resp msg - state 0x%x\n",
+			ibmvmc.state);
+		if (ibmvmc.state == ibmvmc_state_crqinit)
+			ibmvmc_send_capabilities(adapter);
+		break;
+	default:
+		dev_warn(adapter->dev, "Unknown crq message type 0x%lx\n",
+			 (unsigned long)crq->type);
+	}
+}
+
+/**
+ * ibmvmc_handle_crq - Handle CRQ
+ *
+ * @crq:	ibmvmc_crq_msg struct
+ * @adapter:	crq_server_adapter struct
+ *
+ * Read the command elements from the command queue and execute the
+ * requests based upon the type of crq message.
+ *
+ */
+static void ibmvmc_handle_crq(struct ibmvmc_crq_msg *crq,
+			      struct crq_server_adapter *adapter)
+{
+	switch (crq->valid) {
+	case 0xC0:		/* initialization */
+		ibmvmc_handle_crq_init(crq, adapter);
+		break;
+	case 0xFF:	/* Hypervisor telling us the connection is closed */
+		dev_warn(adapter->dev, "CRQ recv: virtual adapter failed - resetting.\n");
+		ibmvmc_reset(adapter, true);
+		break;
+	case 0x80:	/* real payload */
+		ibmvmc_crq_process(adapter, crq);
+		break;
+	default:
+		dev_warn(adapter->dev, "CRQ recv: unknown msg 0x%02x.\n",
+			 crq->valid);
+		break;
+	}
+}
+
+static void ibmvmc_task(unsigned long data)
+{
+	struct crq_server_adapter *adapter =
+		(struct crq_server_adapter *)data;
+	struct vio_dev *vdev = to_vio_dev(adapter->dev);
+	struct ibmvmc_crq_msg *crq;
+	int done = 0;
+
+	while (!done) {
+		/* Pull all the valid messages off the CRQ */
+		while ((crq = crq_queue_next_crq(&adapter->queue)) != NULL) {
+			ibmvmc_handle_crq(crq, adapter);
+			crq->valid = 0x00;
+			/* CRQ reset was requested, stop processing CRQs.
+			 * Interrupts will be re-enabled by the reset task.
+			 */
+			if (ibmvmc.state == ibmvmc_state_sched_reset)
+				return;
+		}
+
+		vio_enable_interrupts(vdev);
+		crq = crq_queue_next_crq(&adapter->queue);
+		if (crq) {
+			vio_disable_interrupts(vdev);
+			ibmvmc_handle_crq(crq, adapter);
+			crq->valid = 0x00;
+			/* CRQ reset was requested, stop processing CRQs.
+			 * Interrupts will be re-enabled by the reset task.
+			 */
+			if (ibmvmc.state == ibmvmc_state_sched_reset)
+				return;
+		} else {
+			done = 1;
+		}
+	}
+}
+
+/**
+ * ibmvmc_init_crq_queue - Init CRQ Queue
+ *
+ * @adapter:	crq_server_adapter struct
+ *
+ * Return:
+ *	0 - Success
+ *	Non-zero - Failure
+ */
+static int ibmvmc_init_crq_queue(struct crq_server_adapter *adapter)
+{
+	struct vio_dev *vdev = to_vio_dev(adapter->dev);
+	struct crq_queue *queue = &adapter->queue;
+	int rc;
+	int retrc;
+
+	queue->msgs = (struct ibmvmc_crq_msg *)get_zeroed_page(GFP_KERNEL);
+
+	if (!queue->msgs)
+		goto malloc_failed;
+
+	queue->size = PAGE_SIZE / sizeof(*queue->msgs);
+
+	queue->msg_token = dma_map_single(adapter->dev, queue->msgs,
+					  queue->size * sizeof(*queue->msgs),
+					  DMA_BIDIRECTIONAL);
+
+	if (dma_mapping_error(adapter->dev, queue->msg_token))
+		goto map_failed;
+
+	retrc = plpar_hcall_norets(H_REG_CRQ,
+				   vdev->unit_address,
+				   queue->msg_token, PAGE_SIZE);
+	retrc = rc;
+
+	if (rc == H_RESOURCE)
+		rc = ibmvmc_reset_crq_queue(adapter);
+
+	if (rc == 2) {
+		dev_warn(adapter->dev, "Partner adapter not ready\n");
+		retrc = 0;
+	} else if (rc != 0) {
+		dev_err(adapter->dev, "Error %d opening adapter\n", rc);
+		goto reg_crq_failed;
+	}
+
+	queue->cur = 0;
+	spin_lock_init(&queue->lock);
+
+	tasklet_init(&adapter->work_task, ibmvmc_task, (unsigned long)adapter);
+
+	if (request_irq(vdev->irq,
+			ibmvmc_handle_event,
+			0, "ibmvmc", (void *)adapter) != 0) {
+		dev_err(adapter->dev, "couldn't register irq 0x%x\n",
+			vdev->irq);
+		goto req_irq_failed;
+	}
+
+	rc = vio_enable_interrupts(vdev);
+	if (rc != 0) {
+		dev_err(adapter->dev, "Error %d enabling interrupts!!!\n", rc);
+		goto req_irq_failed;
+	}
+
+	return retrc;
+
+req_irq_failed:
+	/* Cannot have any work since we either never got our IRQ registered,
+	 * or never got interrupts enabled
+	 */
+	tasklet_kill(&adapter->work_task);
+	h_free_crq(vdev->unit_address);
+reg_crq_failed:
+	dma_unmap_single(adapter->dev,
+			 queue->msg_token,
+			 queue->size * sizeof(*queue->msgs), DMA_BIDIRECTIONAL);
+map_failed:
+	free_page((unsigned long)queue->msgs);
+malloc_failed:
+	return -ENOMEM;
+}
+
+/* Fill in the liobn and riobn fields on the adapter */
+static int read_dma_window(struct vio_dev *vdev,
+			   struct crq_server_adapter *adapter)
+{
+	const __be32 *dma_window;
+	const __be32 *prop;
+
+	/* TODO Using of_parse_dma_window would be better, but it doesn't give
+	 * a way to read multiple windows without already knowing the size of
+	 * a window or the number of windows
+	 */
+	dma_window =
+		(const __be32 *)vio_get_attribute(vdev, "ibm,my-dma-window",
+						NULL);
+	if (!dma_window) {
+		dev_warn(adapter->dev, "Couldn't find ibm,my-dma-window property\n");
+		return -1;
+	}
+
+	adapter->liobn = be32_to_cpu(*dma_window);
+	dma_window++;
+
+	prop = (const __be32 *)vio_get_attribute(vdev, "ibm,#dma-address-cells",
+						NULL);
+	if (!prop) {
+		dev_warn(adapter->dev, "Couldn't find ibm,#dma-address-cells property\n");
+		dma_window++;
+	} else {
+		dma_window += be32_to_cpu(*prop);
+	}
+
+	prop = (const __be32 *)vio_get_attribute(vdev, "ibm,#dma-size-cells",
+						NULL);
+	if (!prop) {
+		dev_warn(adapter->dev, "Couldn't find ibm,#dma-size-cells property\n");
+		dma_window++;
+	} else {
+		dma_window += be32_to_cpu(*prop);
+	}
+
+	/* dma_window should point to the second window now */
+	adapter->riobn = be32_to_cpu(*dma_window);
+
+	return 0;
+}
+
+static int ibmvmc_probe(struct vio_dev *vdev, const struct vio_device_id *id)
+{
+	struct crq_server_adapter *adapter = &ibmvmc_adapter;
+	int rc;
+
+	dev_set_drvdata(&vdev->dev, NULL);
+	memset(adapter, 0, sizeof(*adapter));
+	adapter->dev = &vdev->dev;
+
+	dev_info(adapter->dev, "Probe for UA 0x%x\n", vdev->unit_address);
+
+	rc = read_dma_window(vdev, adapter);
+	if (rc != 0) {
+		ibmvmc.state = ibmvmc_state_failed;
+		return -1;
+	}
+
+	dev_dbg(adapter->dev, "Probe: liobn 0x%x, riobn 0x%x\n",
+		adapter->liobn, adapter->riobn);
+
+	init_waitqueue_head(&adapter->reset_wait_queue);
+	adapter->reset_task = kthread_run(ibmvmc_reset_task, adapter, "ibmvmc");
+	if (IS_ERR(adapter->reset_task)) {
+		dev_err(adapter->dev, "Failed to start reset thread\n");
+		ibmvmc.state = ibmvmc_state_failed;
+		rc = PTR_ERR(adapter->reset_task);
+		adapter->reset_task = NULL;
+		return rc;
+	}
+
+	rc = ibmvmc_init_crq_queue(adapter);
+	if (rc != 0 && rc != H_RESOURCE) {
+		dev_err(adapter->dev, "Error initializing CRQ.  rc = 0x%x\n",
+			rc);
+		ibmvmc.state = ibmvmc_state_failed;
+		goto crq_failed;
+	}
+
+	ibmvmc.state = ibmvmc_state_crqinit;
+
+	/* Try to send an initialization message.  Note that this is allowed
+	 * to fail if the other end is not acive.  In that case we just wait
+	 * for the other side to initialize.
+	 */
+	if (ibmvmc_send_crq(adapter, 0xC001000000000000LL, 0) != 0 &&
+	    rc != H_RESOURCE)
+		dev_warn(adapter->dev, "Failed to send initialize CRQ message\n");
+
+	dev_set_drvdata(&vdev->dev, adapter);
+
+	return 0;
+
+crq_failed:
+	kthread_stop(adapter->reset_task);
+	adapter->reset_task = NULL;
+	return -EPERM;
+}
+
+static int ibmvmc_remove(struct vio_dev *vdev)
+{
+	struct crq_server_adapter *adapter = dev_get_drvdata(&vdev->dev);
+
+	dev_info(adapter->dev, "Entering remove for UA 0x%x\n",
+		 vdev->unit_address);
+	ibmvmc_release_crq_queue(adapter);
+
+	return 0;
+}
+
+static struct vio_device_id ibmvmc_device_table[] = {
+	{ "ibm,vmc", "IBM,vmc" },
+	{ "", "" }
+};
+MODULE_DEVICE_TABLE(vio, ibmvmc_device_table);
+
+static struct vio_driver ibmvmc_driver = {
+	.name        = ibmvmc_driver_name,
+	.id_table    = ibmvmc_device_table,
+	.probe       = ibmvmc_probe,
+	.remove      = ibmvmc_remove,
+};
+
+static void __init ibmvmc_scrub_module_parms(void)
+{
+	if (ibmvmc_max_mtu > MAX_MTU) {
+		pr_warn("ibmvmc: Max MTU reduced to %d\n", MAX_MTU);
+		ibmvmc_max_mtu = MAX_MTU;
+	} else if (ibmvmc_max_mtu < MIN_MTU) {
+		pr_warn("ibmvmc: Max MTU increased to %d\n", MIN_MTU);
+		ibmvmc_max_mtu = MIN_MTU;
+	}
+
+	if (ibmvmc_max_buf_pool_size > MAX_BUF_POOL_SIZE) {
+		pr_warn("ibmvmc: Max buffer pool size reduced to %d\n",
+			MAX_BUF_POOL_SIZE);
+		ibmvmc_max_buf_pool_size = MAX_BUF_POOL_SIZE;
+	} else if (ibmvmc_max_buf_pool_size < MIN_BUF_POOL_SIZE) {
+		pr_warn("ibmvmc: Max buffer pool size increased to %d\n",
+			MIN_BUF_POOL_SIZE);
+		ibmvmc_max_buf_pool_size = MIN_BUF_POOL_SIZE;
+	}
+
+	if (ibmvmc_max_hmcs > MAX_HMCS) {
+		pr_warn("ibmvmc: Max HMCs reduced to %d\n", MAX_HMCS);
+		ibmvmc_max_hmcs = MAX_HMCS;
+	} else if (ibmvmc_max_hmcs < MIN_HMCS) {
+		pr_warn("ibmvmc: Max HMCs increased to %d\n", MIN_HMCS);
+		ibmvmc_max_hmcs = MIN_HMCS;
+	}
+}
+
+static struct miscdevice ibmvmc_miscdev = {
+	.name = ibmvmc_driver_name,
+	.minor = MISC_DYNAMIC_MINOR,
+	.fops = &ibmvmc_fops,
+};
+
+static int __init ibmvmc_module_init(void)
+{
+	int rc, i, j;
+
+	ibmvmc.state = ibmvmc_state_initial;
+	pr_info("ibmvmc: version %s\n", IBMVMC_DRIVER_VERSION);
+
+	rc = misc_register(&ibmvmc_miscdev);
+	if (rc) {
+		pr_err("ibmvmc: misc registration failed\n");
+		goto misc_register_failed;
+	}
+	pr_info("ibmvmc: node %d:%d\n", MISC_MAJOR,
+		ibmvmc_miscdev.minor);
+
+	/* Initialize data structures */
+	memset(hmcs, 0, sizeof(struct ibmvmc_hmc) * MAX_HMCS);
+	for (i = 0; i < MAX_HMCS; i++) {
+		spin_lock_init(&hmcs[i].lock);
+		hmcs[i].state = ibmhmc_state_free;
+		for (j = 0; j < MAX_BUF_POOL_SIZE; j++)
+			hmcs[i].queue_outbound_msgs[j] = VMC_INVALID_BUFFER_ID;
+	}
+
+	/* Sanity check module parms */
+	ibmvmc_scrub_module_parms();
+
+	/*
+	 * Initialize some reasonable values.  Might be negotiated smaller
+	 * values during the capabilities exchange.
+	 */
+	ibmvmc.max_mtu = ibmvmc_max_mtu;
+	ibmvmc.max_buffer_pool_size = ibmvmc_max_buf_pool_size;
+	ibmvmc.max_hmc_index = ibmvmc_max_hmcs - 1;
+
+	rc = vio_register_driver(&ibmvmc_driver);
+
+	if (rc) {
+		pr_err("ibmvmc: rc %d from vio_register_driver\n", rc);
+		goto vio_reg_failed;
+	}
+
+	return 0;
+
+vio_reg_failed:
+	misc_deregister(&ibmvmc_miscdev);
+misc_register_failed:
+	return rc;
+}
+
+static void __exit ibmvmc_module_exit(void)
+{
+	pr_info("ibmvmc: module exit\n");
+	vio_unregister_driver(&ibmvmc_driver);
+	misc_deregister(&ibmvmc_miscdev);
+}
+
+module_init(ibmvmc_module_init);
+module_exit(ibmvmc_module_exit);
+
+module_param_named(buf_pool_size, ibmvmc_max_buf_pool_size,
+		   int, 0644);
+MODULE_PARM_DESC(buf_pool_size, "Buffer pool size");
+module_param_named(max_hmcs, ibmvmc_max_hmcs, int, 0644);
+MODULE_PARM_DESC(max_hmcs, "Max HMCs");
+module_param_named(max_mtu, ibmvmc_max_mtu, int, 0644);
+MODULE_PARM_DESC(max_mtu, "Max MTU");
+
+MODULE_AUTHOR("Steven Royer <seroyer@linux.vnet.ibm.com>");
+MODULE_DESCRIPTION("IBM VMC");
+MODULE_VERSION(IBMVMC_DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/misc/ibmvmc.h b/drivers/misc/ibmvmc.h
new file mode 100644
index 0000000..e140ada
--- /dev/null
+++ b/drivers/misc/ibmvmc.h
@@ -0,0 +1,209 @@
+/* SPDX-License-Identifier: GPL-2.0+
+ *
+ * linux/drivers/misc/ibmvmc.h
+ *
+ * IBM Power Systems Virtual Management Channel Support.
+ *
+ * Copyright (c) 2004, 2018 IBM Corp.
+ *   Dave Engebretsen engebret@us.ibm.com
+ *   Steven Royer seroyer@linux.vnet.ibm.com
+ *   Adam Reznechek adreznec@linux.vnet.ibm.com
+ *   Bryant G. Ly <bryantly@linux.vnet.ibm.com>
+ */
+#ifndef IBMVMC_H
+#define IBMVMC_H
+
+#include <linux/types.h>
+#include <linux/cdev.h>
+
+#include <asm/vio.h>
+
+#define IBMVMC_PROTOCOL_VERSION    0x0101
+
+#define MIN_BUF_POOL_SIZE 16
+#define MIN_HMCS          1
+#define MIN_MTU           4096
+#define MAX_BUF_POOL_SIZE 64
+#define MAX_HMCS          2
+#define MAX_MTU           (4 * 4096)
+#define DEFAULT_BUF_POOL_SIZE 32
+#define DEFAULT_HMCS          1
+#define DEFAULT_MTU           4096
+#define HMC_ID_LEN        32
+
+#define VMC_INVALID_BUFFER_ID 0xFFFF
+
+/* ioctl numbers */
+#define VMC_BASE	     0xCC
+#define VMC_IOCTL_SETHMCID   _IOW(VMC_BASE, 0x00, unsigned char *)
+#define VMC_IOCTL_QUERY      _IOR(VMC_BASE, 0x01, struct ibmvmc_query_struct)
+#define VMC_IOCTL_REQUESTVMC _IOR(VMC_BASE, 0x02, u32)
+
+#define VMC_MSG_CAP          0x01
+#define VMC_MSG_CAP_RESP     0x81
+#define VMC_MSG_OPEN         0x02
+#define VMC_MSG_OPEN_RESP    0x82
+#define VMC_MSG_CLOSE        0x03
+#define VMC_MSG_CLOSE_RESP   0x83
+#define VMC_MSG_ADD_BUF      0x04
+#define VMC_MSG_ADD_BUF_RESP 0x84
+#define VMC_MSG_REM_BUF      0x05
+#define VMC_MSG_REM_BUF_RESP 0x85
+#define VMC_MSG_SIGNAL       0x06
+
+#define VMC_MSG_SUCCESS 0
+#define VMC_MSG_INVALID_HMC_INDEX 1
+#define VMC_MSG_INVALID_BUFFER_ID 2
+#define VMC_MSG_CLOSED_HMC        3
+#define VMC_MSG_INTERFACE_FAILURE 4
+#define VMC_MSG_NO_BUFFER         5
+
+#define VMC_BUF_OWNER_ALPHA 0
+#define VMC_BUF_OWNER_HV    1
+
+enum ibmvmc_states {
+	ibmvmc_state_sched_reset  = -1,
+	ibmvmc_state_initial      = 0,
+	ibmvmc_state_crqinit      = 1,
+	ibmvmc_state_capabilities = 2,
+	ibmvmc_state_ready        = 3,
+	ibmvmc_state_failed       = 4,
+};
+
+enum ibmhmc_states {
+	/* HMC connection not established */
+	ibmhmc_state_free    = 0,
+
+	/* HMC connection established (open called) */
+	ibmhmc_state_initial = 1,
+
+	/* open msg sent to HV, due to ioctl(1) call */
+	ibmhmc_state_opening = 2,
+
+	/* HMC connection ready, open resp msg from HV */
+	ibmhmc_state_ready   = 3,
+
+	/* HMC connection failure */
+	ibmhmc_state_failed  = 4,
+};
+
+struct ibmvmc_buffer {
+	u8 valid;	/* 1 when DMA storage allocated to buffer          */
+	u8 free;	/* 1 when buffer available for the Alpha Partition */
+	u8 owner;
+	u16 id;
+	u32 size;
+	u32 msg_len;
+	dma_addr_t dma_addr_local;
+	dma_addr_t dma_addr_remote;
+	void *real_addr_local;
+};
+
+struct ibmvmc_admin_crq_msg {
+	u8 valid;	/* RPA Defined           */
+	u8 type;	/* ibmvmc msg type       */
+	u8 status;	/* Response msg status. Zero is success and on failure,
+			 * either 1 - General Failure, or 2 - Invalid Version is
+			 * returned.
+			 */
+	u8 rsvd[2];
+	u8 max_hmc;	/* Max # of independent HMC connections supported */
+	__be16 pool_size;	/* Maximum number of buffers supported per HMC
+				 * connection
+				 */
+	__be32 max_mtu;		/* Maximum message size supported (bytes) */
+	__be16 crq_size;	/* # of entries available in the CRQ for the
+				 * source partition. The target partition must
+				 * limit the number of outstanding messages to
+				 * one half or less.
+				 */
+	__be16 version;	/* Indicates the code level of the management partition
+			 * or the hypervisor with the high-order byte
+			 * indicating a major version and the low-order byte
+			 * indicating a minor version.
+			 */
+};
+
+struct ibmvmc_crq_msg {
+	u8 valid;     /* RPA Defined           */
+	u8 type;      /* ibmvmc msg type       */
+	u8 status;    /* Response msg status   */
+	union {
+		u8 rsvd;  /* Reserved              */
+		u8 owner;
+	} var1;
+	u8 hmc_session;	/* Session Identifier for the current VMC connection */
+	u8 hmc_index;	/* A unique HMC Idx would be used if multiple management
+			 * applications running concurrently were desired
+			 */
+	union {
+		__be16 rsvd;
+		__be16 buffer_id;
+	} var2;
+	__be32 rsvd;
+	union {
+		__be32 rsvd;
+		__be32 lioba;
+		__be32 msg_len;
+	} var3;
+};
+
+/* an RPA command/response transport queue */
+struct crq_queue {
+	struct ibmvmc_crq_msg *msgs;
+	int size, cur;
+	dma_addr_t msg_token;
+	spinlock_t lock;
+};
+
+/* VMC server adapter settings */
+struct crq_server_adapter {
+	struct device *dev;
+	struct crq_queue queue;
+	u32 liobn;
+	u32 riobn;
+	struct tasklet_struct work_task;
+	wait_queue_head_t reset_wait_queue;
+	struct task_struct *reset_task;
+};
+
+/* Driver wide settings */
+struct ibmvmc_struct {
+	u32 state;
+	u32 max_mtu;
+	u32 max_buffer_pool_size;
+	u32 max_hmc_index;
+	struct crq_server_adapter *adapter;
+	struct cdev cdev;
+	u32 vmc_drc_index;
+};
+
+struct ibmvmc_file_session;
+
+/* Connection specific settings */
+struct ibmvmc_hmc {
+	u8 session;
+	u8 index;
+	u32 state;
+	struct crq_server_adapter *adapter;
+	spinlock_t lock;
+	unsigned char hmc_id[HMC_ID_LEN];
+	struct ibmvmc_buffer buffer[MAX_BUF_POOL_SIZE];
+	unsigned short queue_outbound_msgs[MAX_BUF_POOL_SIZE];
+	int queue_head, queue_tail;
+	struct ibmvmc_file_session *file_session;
+};
+
+struct ibmvmc_file_session {
+	struct file *file;
+	struct ibmvmc_hmc *hmc;
+	bool valid;
+};
+
+struct ibmvmc_query_struct {
+	int have_vmc;
+	int state;
+	int vmc_drc_index;
+};
+
+#endif /* __IBMVMC_H */
-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v1 0/1] misc: IBM Virtual Management Channel Driver
From: Bryant G. Ly @ 2018-04-23 14:46 UTC (permalink / raw)
  To: benh, mpe, gregkh, arnd
  Cc: corbet, seroyer, mrochs, adreznec, fbarrat, davem, linus.walleij,
	akpm, rdunlap, mikey, pombredanne, tlfalcon, msuchanek, linux-doc,
	linuxppc-dev, Bryant G. Ly

Steven Royer had previously attempted to upstream this
driver two years ago, but never got the chance to address
the concerns from Greg Kroah-Hartman.

The thread with the initial upstream is:
https://lkml.org/lkml/2016/2/16/918

I have addressed the following:
- Documentation
- Use of dev_dbg instead of pr_dbg
- Change to misc class
- Fixed memory barrier usages
- Addressed styling, checkpatch, renaming of functions
- General fixes to the driver to make it more inline with
  existing upstream drivers.

Bryant G. Ly (1):
  misc: IBM Virtual Management Channel Driver

 Documentation/ioctl/ioctl-number.txt  |    1 +
 Documentation/misc-devices/ibmvmc.txt |  161 +++
 MAINTAINERS                           |    6 +
 arch/powerpc/include/asm/hvcall.h     |    1 +
 drivers/misc/Kconfig                  |    9 +
 drivers/misc/Makefile                 |    1 +
 drivers/misc/ibmvmc.c                 | 2413 +++++++++++++++++++++++++++++++++
 drivers/misc/ibmvmc.h                 |  215 +++
 8 files changed, 2807 insertions(+)
 create mode 100644 Documentation/misc-devices/ibmvmc.txt
 create mode 100644 drivers/misc/ibmvmc.c
 create mode 100644 drivers/misc/ibmvmc.h

-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy
From: Waiman Long @ 2018-04-23 14:10 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin
In-Reply-To: <20180423135756.GC32341@localhost.localdomain>

On 04/23/2018 09:57 AM, Juri Lelli wrote:
> On 23/04/18 15:07, Juri Lelli wrote:
>> Hi Waiman,
>>
>> On 19/04/18 09:46, Waiman Long wrote:
>>> v7:
>>>  - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
>>>  - Enforce that load_balancing can only be turned off on cpusets with
>>>    CPUs from the isolated list.
>>>  - Update sched domain generation to allow cpusets with CPUs only
>>>    from the isolated CPU list to be in separate root domains.
> Guess I'll be adding comments as soon as I stumble on something unclear
> (to me :), hope that's OK (shout if I should do it differently).
>
> The below looked unexpected to me:
>
> root@debian-kvm:/sys/fs/cgroup# cat g1/cpuset.cpus
> 2-3
> root@debian-kvm:/sys/fs/cgroup# cat g1/cpuset.mems
>
> root@debian-kvm:~# echo $$ > /sys/fs/cgroup/g1/cgroup.threads
> root@debian-kvm:/sys/fs/cgroup# cat g1/cgroup.threads
> 2312
>
> So I can add tasks to groups with no mems? Or is it this only true in my
> case with a single mem node? Or maybe it's inherited from root group
> (slightly confusing IMHO if that's the case).

No mems mean looking up the parents until we find one with non-empty
mems. The mems.effective will show you the actual memory nodes used.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy
From: Juri Lelli @ 2018-04-23 13:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin
In-Reply-To: <20180423130715.GA32341@localhost.localdomain>

On 23/04/18 15:07, Juri Lelli wrote:
> Hi Waiman,
> 
> On 19/04/18 09:46, Waiman Long wrote:
> > v7:
> >  - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
> >  - Enforce that load_balancing can only be turned off on cpusets with
> >    CPUs from the isolated list.
> >  - Update sched domain generation to allow cpusets with CPUs only
> >    from the isolated CPU list to be in separate root domains.
> 

Guess I'll be adding comments as soon as I stumble on something unclear
(to me :), hope that's OK (shout if I should do it differently).

The below looked unexpected to me:

root@debian-kvm:/sys/fs/cgroup# cat g1/cpuset.cpus
2-3
root@debian-kvm:/sys/fs/cgroup# cat g1/cpuset.mems

root@debian-kvm:~# echo $$ > /sys/fs/cgroup/g1/cgroup.threads
root@debian-kvm:/sys/fs/cgroup# cat g1/cgroup.threads
2312

So I can add tasks to groups with no mems? Or is it this only true in my
case with a single mem node? Or maybe it's inherited from root group
(slightly confusing IMHO if that's the case).
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy
From: Juri Lelli @ 2018-04-23 13:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin
In-Reply-To: <1524145624-23655-1-git-send-email-longman@redhat.com>

Hi Waiman,

On 19/04/18 09:46, Waiman Long wrote:
> v7:
>  - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
>  - Enforce that load_balancing can only be turned off on cpusets with
>    CPUs from the isolated list.
>  - Update sched domain generation to allow cpusets with CPUs only
>    from the isolated CPU list to be in separate root domains.

Just got this while

# echo 2-3 > /sys/fs/cgroup/cpuset.cpus.isolated

[ 6679.177826] =============================
[ 6679.178385] WARNING: suspicious RCU usage
[ 6679.178910] 4.16.0-rc6+ #151 Not tainted
[ 6679.179459] -----------------------------
[ 6679.180082] /home/juri/work/kernel/linux/kernel/cgroup/cgroup.c:3826 cgroup_mutex or RCU read lock required!
[ 6679.181402]
[ 6679.181402] other info that might help us debug this:
[ 6679.181402]
[ 6679.182407]
[ 6679.182407] rcu_scheduler_active = 2, debug_locks = 1
[ 6679.183278] 3 locks held by bash/2205:
[ 6679.183785]  #0:  (sb_writers#10){.+.+}, at: [<000000004e577fb9>] vfs_write+0x18a/0x1b0
[ 6679.184871]  #1:  (&of->mutex){+.+.}, at: [<000000005944c83f>] kernfs_fop_write+0xe2/0x1a0
[ 6679.185987]  #2:  (cpuset_mutex){+.+.}, at: [<00000000879bfba0>] cpuset_write_resmask+0x72/0x1560
[ 6679.187112]
[ 6679.187112] stack backtrace:
[ 6679.187612] CPU: 3 PID: 2205 Comm: bash Not tainted 4.16.0-rc6+ #151
[ 6679.188318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[ 6679.189291] Call Trace:
[ 6679.189581]  dump_stack+0x85/0xc5
[ 6679.189963]  css_next_child+0x90/0xd0
[ 6679.190385]  cpuset_write_resmask+0x46f/0x1560
[ 6679.190885]  ? lock_acquire+0x9f/0x210
[ 6679.191315]  cgroup_file_write+0x94/0x230
[ 6679.191768]  kernfs_fop_write+0x113/0x1a0
[ 6679.192223]  __vfs_write+0x36/0x180
[ 6679.192617]  ? rcu_read_lock_sched_held+0x6b/0x80
[ 6679.193139]  ? rcu_sync_lockdep_assert+0x2e/0x60
[ 6679.193654]  ? __sb_start_write+0x154/0x1f0
[ 6679.194118]  ? __sb_start_write+0x16a/0x1f0
[ 6679.194607]  vfs_write+0xc1/0x1b0
[ 6679.194984]  SyS_write+0x55/0xc0
[ 6679.195365]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 6679.195839]  do_syscall_64+0x79/0x220
[ 6679.196212]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 6679.196729] RIP: 0033:0x7f03183ff780
[ 6679.197138] RSP: 002b:00007ffeae336ca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 6679.197866] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f03183ff780
[ 6679.198550] RDX: 0000000000000004 RSI: 0000000000eaf408 RDI: 0000000000000001
[ 6679.199235] RBP: 0000000000eaf408 R08: 000000000000000a R09: 00007f0318cff700
[ 6679.199928] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f03186b57a0
[ 6679.200615] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000
[ 6679.201369] rebuild_sched_domains dom 0: 0-1
[ 6679.202196] span: 0-1 (max cpu_capacity = 1024)

Guess we should grab either lock from the writing path.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 03/10] drivers/peci: Add support for PECI bus driver core
From: Greg KH @ 2018-04-23 10:52 UTC (permalink / raw)
  To: Jae Hyun Yoo
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu,
	Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
	Jean Delvare, Joel Stanley, Julia Cartwright, Miguel Ojeda,
	Milton Miller II, Pavel Machek, Randy Dunlap, Stef van Os,
	Sumeet R Pawnikar, Vernon Mauery, linux-kernel, linux-doc,
	devicetree, linux-hwmon, linux-arm-kernel, openbmc
In-Reply-To: <20180410183212.16787-4-jae.hyun.yoo@linux.intel.com>

On Tue, Apr 10, 2018 at 11:32:05AM -0700, Jae Hyun Yoo wrote:
> +static void peci_adapter_dev_release(struct device *dev)
> +{
> +	/* do nothing */
> +}

As per the in-kernel documentation, I am now allowed to make fun of you.

You are trying to "out smart" the kernel by getting rid of a warning
message that was explicitly put there for you to do something.  To think
that by just providing an "empty" function you are somehow fulfilling
the API requirement is quite bold, don't you think?

This has to be fixed.  I didn't put that warning in there for no good
reason.  Please go read the documentation again...

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC 01/10] PCI: dwc: Add MSI-X callbacks handler
From: Gustavo Pimentel @ 2018-04-23  9:36 UTC (permalink / raw)
  To: Kishon Vijay Abraham I, bhelgaas@google.com,
	lorenzo.pieralisi@arm.com, Joao.Pinto@synopsys.com,
	jingoohan1@gmail.com, adouglas@cadence.com,
	niklas.cassel@axis.com, jesper.nilsson@axis.com
  Cc: linux-pci@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <0b7023d9-29c6-0993-07d3-b046d25b67ff@ti.com>

Hi Kishon,

On 16/04/2018 10:29, Kishon Vijay Abraham I wrote:
> Hi Gustavo,
> 
> On Tuesday 10 April 2018 10:44 PM, Gustavo Pimentel wrote:
>> Changes the pcie_raise_irq function signature, namely the interrupt_num
>> variable type from u8 to u16 to accommodate the MSI-X maximum interrupts
>> of 2048.
>>
>> Implements a PCIe config space capability iterator function to search and
>> save the MSI and MSI-X pointers. With this method the code becomes more
>> generic and flexible.
>>
>> Implements MSI-X set/get functions for sysfs interface in order to change
>> the EP entries number.
>>
>> Implements EP MSI-X interface for triggering interruptions.
>>
>> Signed-off-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com>
>> ---
>>  drivers/pci/dwc/pci-dra7xx.c           |   2 +-
>>  drivers/pci/dwc/pcie-artpec6.c         |   2 +-
>>  drivers/pci/dwc/pcie-designware-ep.c   | 145 ++++++++++++++++++++++++++++++++-
>>  drivers/pci/dwc/pcie-designware-plat.c |   6 +-
>>  drivers/pci/dwc/pcie-designware.h      |  23 +++++-
>>  5 files changed, 173 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/pci/dwc/pci-dra7xx.c b/drivers/pci/dwc/pci-dra7xx.c
>> index ed8558d..5265725 100644
>> --- a/drivers/pci/dwc/pci-dra7xx.c
>> +++ b/drivers/pci/dwc/pci-dra7xx.c
>> @@ -369,7 +369,7 @@ static void dra7xx_pcie_raise_msi_irq(struct dra7xx_pcie *dra7xx,
>>  }
>>  
>>  static int dra7xx_pcie_raise_irq(struct dw_pcie_ep *ep, u8 func_no,
>> -				 enum pci_epc_irq_type type, u8 interrupt_num)
>> +				 enum pci_epc_irq_type type, u16 interrupt_num)
>>  {
>>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>>  	struct dra7xx_pcie *dra7xx = to_dra7xx_pcie(pci);
>> diff --git a/drivers/pci/dwc/pcie-artpec6.c b/drivers/pci/dwc/pcie-artpec6.c
>> index e66cede..96dc259 100644
>> --- a/drivers/pci/dwc/pcie-artpec6.c
>> +++ b/drivers/pci/dwc/pcie-artpec6.c
>> @@ -428,7 +428,7 @@ static void artpec6_pcie_ep_init(struct dw_pcie_ep *ep)
>>  }
>>  
>>  static int artpec6_pcie_raise_irq(struct dw_pcie_ep *ep, u8 func_no,
>> -				  enum pci_epc_irq_type type, u8 interrupt_num)
>> +				  enum pci_epc_irq_type type, u16 interrupt_num)
>>  {
>>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>>  
>> diff --git a/drivers/pci/dwc/pcie-designware-ep.c b/drivers/pci/dwc/pcie-designware-ep.c
>> index 15b22a6..874d4c2 100644
>> --- a/drivers/pci/dwc/pcie-designware-ep.c
>> +++ b/drivers/pci/dwc/pcie-designware-ep.c
>> @@ -40,6 +40,44 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar)
>>  	__dw_pcie_ep_reset_bar(pci, bar, 0);
>>  }
>>  
>> +void dw_pcie_ep_find_cap_addr(struct dw_pcie_ep *ep)
>> +{
> 
> This should be implemented in a generic way similar to pci_find_capability().
> It'll be useful when we try to implement other capabilities as well.

Hum, what you suggest? Something implemented on the pci-epf-core?

> 
> Thanks
> Kishon
> 

Regards,
Gustavo

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH bpf-next v3 4/8] bpf: add documentation for eBPF helpers (23-32)
From: Daniel Borkmann @ 2018-04-23  9:11 UTC (permalink / raw)
  To: Quentin Monnet, ast; +Cc: netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <b7ef2bc0-90f7-8bd9-2dd2-3b5ba2a6d4b0@netronome.com>

On 04/20/2018 08:54 PM, Quentin Monnet wrote:
> 2018-04-19 13:16 UTC+0200 ~ Daniel Borkmann <daniel@iogearbox.net>
>> On 04/17/2018 04:34 PM, Quentin Monnet wrote:
>>> Add documentation for eBPF helper functions to bpf.h user header file.
>>> This documentation can be parsed with the Python script provided in
>>> another commit of the patch series, in order to provide a RST document
>>> that can later be converted into a man page.
>>>
>>> The objective is to make the documentation easily understandable and
>>> accessible to all eBPF developers, including beginners.
>>>
>>> This patch contains descriptions for the following helper functions, all
>>> written by Daniel:
>>>
>>> - bpf_get_prandom_u32()
>>> - bpf_get_smp_processor_id()
>>> - bpf_get_cgroup_classid()
>>> - bpf_get_route_realm()
>>> - bpf_skb_load_bytes()
>>> - bpf_csum_diff()
>>> - bpf_skb_get_tunnel_opt()
>>> - bpf_skb_set_tunnel_opt()
>>> - bpf_skb_change_proto()
>>> - bpf_skb_change_type()
>>>
>>> v3:
>>> - bpf_get_prandom_u32(): Fix helper name :(. Add description, including
>>>   a note on the internal random state.
>>> - bpf_get_smp_processor_id(): Add description, including a note on the
>>>   processor id remaining stable during program run.
>>> - bpf_get_cgroup_classid(): State that CONFIG_CGROUP_NET_CLASSID is
>>>   required to use the helper. Add a reference to related documentation.
>>>   State that placing a task in net_cls controller disables cgroup-bpf.
>>> - bpf_get_route_realm(): State that CONFIG_CGROUP_NET_CLASSID is
>>>   required to use this helper.
>>> - bpf_skb_load_bytes(): Fix comment on current use cases for the helper.
>>>
>>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>>> ---
>>>  include/uapi/linux/bpf.h | 152 +++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 152 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index c59bf5b28164..d748f65a8f58 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
> 
> [...]
> 
>>> @@ -615,6 +632,27 @@ union bpf_attr {
>>>   * 	Return
>>>   * 		0 on success, or a negative error in case of failure.
>>>   *
>>> + * u32 bpf_get_cgroup_classid(struct sk_buff *skb)
>>> + * 	Description
>>> + * 		Retrieve the classid for the current task, i.e. for the
>>> + * 		net_cls (network classifier) cgroup to which *skb* belongs.
>>> + *
>>> + * 		This helper is only available is the kernel was compiled with
>>> + * 		the **CONFIG_CGROUP_NET_CLASSID** configuration option set to
>>> + * 		"**y**" or to "**m**".
>>> + *
>>> + * 		Note that placing a task into the net_cls controller completely
>>> + * 		disables the execution of eBPF programs with the cgroup.
>>
>> I'm not sure I follow the above sentence, what do you mean by that?
> 
> Please see comment below.
> 
>> I would definitely also add here that this helper is limited to cgroups v1
>> only, and that it works on clsact TC egress hook but not the ingress one.
>>
>>> + * 		Also note that, in the above description, the "network
>>> + * 		classifier" cgroup does not designate a generic classifier, but
>>> + * 		a particular mechanism that provides an interface to tag
>>> + * 		network packets with a specific class identifier. See also the
>>
>> The "generic classifier" part is a bit strange to parse. I would probably
>> leave the first part out and explain that this provides a means to tag
>> packets based on a user-provided ID for all traffic coming from the tasks
>> belonging to the related cgroup.
> 
> In this paragraph and in the one above, I am trying to address Alexei
> comments to the RFC v2:
> 
>     please add that kernel should be configured with
>     CONFIG_NET_CLS_CGROUP=y|m
>     and mention Documentation/cgroup-v1/net_cls.txt
>     Otherwise 'network classifier' is way too generic.
>     I'd also mention that placing a task into net_cls controller
>     disables all of cgroup-bpf.
> 
> But I must admit I am struggling a bit on this helper. If I understand
> correctly, "network classifier" is too generic and could be confused
> with TC classifiers? Hence the attempt to avoid that in the second

I think if you just name it "net_cls cgroup" it should be good enough in case
the concern is that "network classifier" name would be misunderstood.

> paragraph... As for "placing a task into net_cls controller disables all
> of cgroup-bpf", again if I understand correctly, I think it refers to
> cgroup_sk_alloc_disable() being called in write_classid() in file
> net/core/netclassid_cgroup.c. But I am not familiar with it and cannot
> find a nice way to explain that in the doc :/.

Point here should be that there's cgroups v1 and v2 infra. Both are available
and you can use a mixture of them, but net_cls is v1 only whereas bpf progs on
cgroups is v2 only feature. And if you look at struct sock_cgroup_data, it's
a union for the socket, so you can only use one of them with regards to sockets.

>>> + * 		related kernel documentation, available from the Linux sources
>>> + * 		in file *Documentation/cgroup-v1/net_cls.txt*.
>>> + * 	Return
>>> + * 		The classid, or 0 for the default unconfigured classid.
>>> + *
>>>   * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
>>>   * 	Description
>>>   * 		Push a *vlan_tci* (VLAN tag control information) of protocol
>>> @@ -734,6 +772,16 @@ union bpf_attr {
>>>   * 		are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
>>>   * 		error.
>>>   *
>>> + * u32 bpf_get_route_realm(struct sk_buff *skb)
>>> + * 	Description
>>> + * 		Retrieve the realm or the route, that is to say the
>>> + * 		**tclassid** field of the destination for the *skb*. This
>>> + * 		helper is available only if the kernel was compiled with
>>> + * 		**CONFIG_IP_ROUTE_CLASSID** configuration option.
>>
>> Could mention that this is a similar user provided tag like in the net_cls
>> case with cgroups only that this applies to routes here (dst entries) which
>> hold this tag.
>>
>> Also, should say that this works with clsact TC egress hook or alternatively
>> conventional classful egress qdiscs, but not on TC ingress. In case of clsact
>> TC egress hook this has the advantage that the dst entry has not been dropped
>> yet in the xmit path. Therefore, the dst entry does not need to be artificially
>> held via netif_keep_dst() for a classful qdisc until the skb is freed.
> 
> Here I am not sure to follow, the advantage is for clsact, but this is
> only for a classful qdisc that we can avoid holding the dst entry? How

No, it's only for sch_clsact where we don't need to hold it. Take a look at
__dev_queue_xmit(), and where sch_handle_egress() is called. It's right before
the dev->priv_flags & IFF_XMIT_DST_RELEASE check where we either hold or drop
the dst from skb.

In cls_bpf_prog_from_efd() the tcf_block_netif_keep_dst() will handle this.
If you call cls_bpf e.g. out of htb via htb_classify() -> tcf_classify()
then you would have to hold it and thus call netif_keep_dst() so that the
test in __dev_queue_xmit() results in holding the dst via skb_dst_force().

Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 5/7] ocxl: Expose the thread_id needed for wait on p9
From: Andrew Donnellan @ 2018-04-23  7:16 UTC (permalink / raw)
  To: Alastair D'Silva, linuxppc-dev
  Cc: linux-kernel, linux-doc, mikey, vaibhav, aneesh.kumar, malat,
	felix, pombredanne, sukadev, npiggin, gregkh, arnd, fbarrat,
	corbet, Alastair D'Silva
In-Reply-To: <20180418010810.30937-6-alastair@au1.ibm.com>

On 18/04/18 11:08, Alastair D'Silva wrote:
> From: Alastair D'Silva <alastair@d-silva.org>
> 
> In order to successfully issue as_notify, an AFU needs to know the TID
> to notify, which in turn means that this information should be
> available in userspace so it can be communicated to the AFU.
> 
> Signed-off-by: Alastair D'Silva <alastair@d-silva.org>

nitpicks below

Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

> ---
>   drivers/misc/ocxl/context.c       |  5 +++-
>   drivers/misc/ocxl/file.c          | 53 +++++++++++++++++++++++++++++++++++++++
>   drivers/misc/ocxl/link.c          | 36 ++++++++++++++++++++++++++
>   drivers/misc/ocxl/ocxl_internal.h |  1 +
>   include/misc/ocxl.h               |  9 +++++++
>   include/uapi/misc/ocxl.h          | 10 ++++++++
>   6 files changed, 113 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
> index 909e8807824a..95f74623113e 100644
> --- a/drivers/misc/ocxl/context.c
> +++ b/drivers/misc/ocxl/context.c
> @@ -34,6 +34,8 @@ int ocxl_context_init(struct ocxl_context *ctx, struct ocxl_afu *afu,
>   	mutex_init(&ctx->xsl_error_lock);
>   	mutex_init(&ctx->irq_lock);
>   	idr_init(&ctx->irq_idr);
> +	ctx->tidr = 0;
> +
>   	/*
>   	 * Keep a reference on the AFU to make sure it's valid for the
>   	 * duration of the life of the context
> @@ -65,6 +67,7 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr)
>   {
>   	int rc;
>   
> +	// Locks both status & tidr
>   	mutex_lock(&ctx->status_mutex);
>   	if (ctx->status != OPENED) {
>   		rc = -EIO;
> @@ -72,7 +75,7 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr)
>   	}
>   
>   	rc = ocxl_link_add_pe(ctx->afu->fn->link, ctx->pasid,
> -			current->mm->context.id, 0, amr, current->mm,
> +			current->mm->context.id, ctx->tidr, amr, current->mm,
>   			xsl_fault_error, ctx);
>   	if (rc)
>   		goto out;
> diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
> index 038509e5d031..eb409a469f21 100644
> --- a/drivers/misc/ocxl/file.c
> +++ b/drivers/misc/ocxl/file.c
> @@ -5,6 +5,8 @@
>   #include <linux/sched/signal.h>
>   #include <linux/uaccess.h>
>   #include <uapi/misc/ocxl.h>
> +#include <asm/reg.h>
> +#include <asm/switch_to.h>
>   #include "ocxl_internal.h"
>   
>   
> @@ -123,11 +125,55 @@ static long afu_ioctl_get_metadata(struct ocxl_context *ctx,
>   	return 0;
>   }
>   
> +#ifdef CONFIG_PPC64
> +static long afu_ioctl_enable_p9_wait(struct ocxl_context *ctx,
> +		struct ocxl_ioctl_p9_wait __user *uarg)
> +{
> +	struct ocxl_ioctl_p9_wait arg;
> +
> +	memset(&arg, 0, sizeof(arg));
> +
> +	if (cpu_has_feature(CPU_FTR_P9_TIDR)) {
> +		enum ocxl_context_status status;
> +
> +		// Locks both status & tidr
> +		mutex_lock(&ctx->status_mutex);
> +		if (!ctx->tidr) {
> +			if (set_thread_tidr(current))
> +				return -ENOENT;
> +
> +			ctx->tidr = current->thread.tidr;
> +		}
> +
> +		status = ctx->status;
> +		mutex_unlock(&ctx->status_mutex);
> +
> +		if (status == ATTACHED) {
> +			int rc;
> +			struct link *link = ctx->afu->fn->link;

Declarations at the top

> +
> +			rc = ocxl_link_update_pe(link, ctx->pasid, ctx->tidr);
> +			if (rc)
> +				return rc;
> +		}
> +
> +		arg.thread_id = ctx->tidr;
> +	} else
> +		return -ENOENT;
> +
> +	if (copy_to_user(uarg, &arg, sizeof(arg)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +#endif
> +
>   #define CMD_STR(x) (x == OCXL_IOCTL_ATTACH ? "ATTACH" :			\
>   			x == OCXL_IOCTL_IRQ_ALLOC ? "IRQ_ALLOC" :	\
>   			x == OCXL_IOCTL_IRQ_FREE ? "IRQ_FREE" :		\
>   			x == OCXL_IOCTL_IRQ_SET_FD ? "IRQ_SET_FD" :	\
>   			x == OCXL_IOCTL_GET_METADATA ? "GET_METADATA" :	\
> +			x == OCXL_IOCTL_ENABLE_P9_WAIT ? "ENABLE_P9_WAIT" :	\
>   			"UNKNOWN")
>   
>   static long afu_ioctl(struct file *file, unsigned int cmd,
> @@ -186,6 +232,13 @@ static long afu_ioctl(struct file *file, unsigned int cmd,
>   				(struct ocxl_ioctl_metadata __user *) args);
>   		break;
>   
> +#ifdef CONFIG_PPC64
> +	case OCXL_IOCTL_ENABLE_P9_WAIT:
> +		rc = afu_ioctl_enable_p9_wait(ctx,
> +				(struct ocxl_ioctl_p9_wait __user *) args);
> +		break;
> +#endif
> +
>   	default:
>   		rc = -EINVAL;
>   	}
> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
> index 656e8610eec2..88876ae8f330 100644
> --- a/drivers/misc/ocxl/link.c
> +++ b/drivers/misc/ocxl/link.c
> @@ -544,6 +544,42 @@ int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,
>   }
>   EXPORT_SYMBOL_GPL(ocxl_link_add_pe);
>   
> +int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid)
> +{
> +	struct link *link = (struct link *) link_handle;
> +	struct spa *spa = link->spa;
> +	struct ocxl_process_element *pe;
> +	int pe_handle, rc;
> +
> +	if (pasid > SPA_PASID_MAX)
> +		return -EINVAL;
> +
> +	pe_handle = pasid & SPA_PE_MASK;
> +	pe = spa->spa_mem + pe_handle;
> +
> +	mutex_lock(&spa->spa_lock);
> +
> +	pe->tid = tid;
> +
> +	/*
> +	 * The barrier makes sure the PE is updated
> +	 * before we clear the NPU context cache below, so that the
> +	 * old PE cannot be reloaded erroneously.
> +	 */
> +	mb();
> +
> +	/*
> +	 * hook to platform code
> +	 * On powerpc, the entry needs to be cleared from the context
> +	 * cache of the NPU.
> +	 */
> +	rc = pnv_ocxl_spa_remove_pe_from_cache(link->platform_data, pe_handle);
> +	WARN_ON(rc);
> +
> +	mutex_unlock(&spa->spa_lock);
> +	return rc;
> +}
> +
>   int ocxl_link_remove_pe(void *link_handle, int pasid)
>   {
>   	struct link *link = (struct link *) link_handle;
> diff --git a/drivers/misc/ocxl/ocxl_internal.h b/drivers/misc/ocxl/ocxl_internal.h
> index 5d421824afd9..6c6d4e61888e 100644
> --- a/drivers/misc/ocxl/ocxl_internal.h
> +++ b/drivers/misc/ocxl/ocxl_internal.h
> @@ -77,6 +77,7 @@ struct ocxl_context {
>   	struct ocxl_xsl_error xsl_error;
>   	struct mutex irq_lock;
>   	struct idr irq_idr;
> +	__u16 tidr; // Thread ID used for P9 wait implementation

What's the difference between u16 and __u16...

>   };
>   
>   struct ocxl_process_element {
> diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h
> index 51ccf76db293..9ff6ddc28e22 100644
> --- a/include/misc/ocxl.h
> +++ b/include/misc/ocxl.h
> @@ -188,6 +188,15 @@ extern int ocxl_link_add_pe(void *link_handle, int pasid, u32 pidr, u32 tidr,
>   		void (*xsl_err_cb)(void *data, u64 addr, u64 dsisr),
>   		void *xsl_err_data);
>   
> +/**
> + * Update values within a Process Element
> + *
> + * link_handle: the link handle associated with the process element
> + * pasid: the PASID for the AFU context
> + * tid: the new thread id for the process element
> + */
> +extern int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid);
> +
>   /*
>    * Remove a Process Element from the Shared Process Area for a link
>    */
> diff --git a/include/uapi/misc/ocxl.h b/include/uapi/misc/ocxl.h
> index 0af83d80fb3e..8d2748e69c84 100644
> --- a/include/uapi/misc/ocxl.h
> +++ b/include/uapi/misc/ocxl.h
> @@ -48,6 +48,15 @@ struct ocxl_ioctl_metadata {
>   	__u64 reserved[13]; // Total of 16*u64
>   };
>   
> +struct ocxl_ioctl_p9_wait {
> +	__u16 thread_id; // The thread ID required to wake this thread
> +	__u16 reserved1;
> +	__u32 reserved2;
> +	__u64 reserved3[3];
> +};
> +
> +};
> +
>   struct ocxl_ioctl_irq_fd {
>   	__u64 irq_offset;
>   	__s32 eventfd;
> @@ -62,5 +71,6 @@ struct ocxl_ioctl_irq_fd {
>   #define OCXL_IOCTL_IRQ_FREE	_IOW(OCXL_MAGIC, 0x12, __u64)
>   #define OCXL_IOCTL_IRQ_SET_FD	_IOW(OCXL_MAGIC, 0x13, struct ocxl_ioctl_irq_fd)
>   #define OCXL_IOCTL_GET_METADATA _IOR(OCXL_MAGIC, 0x14, struct ocxl_ioctl_metadata)
> +#define OCXL_IOCTL_ENABLE_P9_WAIT	_IOR(OCXL_MAGIC, 0x15, struct ocxl_ioctl_p9_wait)
>   
>   #endif /* _UAPI_MISC_OCXL_H */
> 

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] [v2] docs: clarify security-bugs disclosure policy
From: Linus Torvalds @ 2018-04-22 16:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Dave Hansen, Linux Kernel Mailing List, Dan Williams,
	Thomas Gleixner, Greg Kroah-Hartman, Andrea Arcangeli,
	Andrew Lutomirski, Kees Cook, Tim Chen, Al Viro, Andrew Morton,
	open list:DOCUMENTATION, Jonathan Corbet, Mark Rutland
In-Reply-To: <20180422100032.GA18114@amd>

On Sun, Apr 22, 2018 at 3:00 AM, Pavel Machek <pavel@ucw.cz> wrote:
>
> On Fri 2018-03-09 13:15:31, Linus Torvalds wrote:
>>
>> Oh, I don't want to be taken seriously by people who use gpg
>> encrypted email.
>
> Heh. I see that gpg has some usability problems, but we do encrypt our
> http connections, and email is at least as sensitive.

It's not about sensitivity.

It's about the technology actually *working*.

PGP does not work for email. Never has, never will. It's unusable
crap. Nobody sane uses it, because the cost is just too damn high.

Make it as easy to use as https, and things will change. As it is,
don't even bother.

Thomas helped me get S/MIME working, and even for that I had to fix an
alpine bug that caused alpine to SIGSEGV. And that's _convenient_
compared to pgp email. Whee.

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] [v2] docs: clarify security-bugs disclosure policy
From: Pavel Machek @ 2018-04-22 10:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Dave Hansen, Linux Kernel Mailing List, Dan Williams,
	Thomas Gleixner, Greg Kroah-Hartman, Andrea Arcangeli,
	Andrew Lutomirski, Kees Cook, Tim Chen, Al Viro, Andrew Morton,
	open list:DOCUMENTATION, Jonathan Corbet, Mark Rutland
In-Reply-To: <CA+55aFw5+KjVqKUpAMF+-BfE6NuXSNb8qdRtmAWDq6apK2v+sA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1425 bytes --]

Hi!

On Fri 2018-03-09 13:15:31, Linus Torvalds wrote:
> On Fri, Mar 9, 2018 at 12:45 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
> >
> > If you want to be taken seriously then I think minimum you also need to
> > - Give a GPG key for messages to the list
> 
> Oh, I don't want to be taken seriously by people who use gpg
> encrypted email.

Heh. I see that gpg has some usability problems, but we do encrypt our
http connections, and email is at least as sensitive.

> > - State what security is in place (encryption etc) to protect the list
> >   itself
> 
> That could be stated, but it's worth noting the other rules.
> 
> If you have some long corrupt vendor disclosure period and are worried
> about any good guys finding out (the bad guys probably already have
> it), we're not the list for you anyway.
> 
> Keep your "we'll keep security problems under wraps so that they can
> be exploited for a long time" emails to yourself, or send them to
> /dev/null.

Umm, they will not sent it to /dev/null, as that is not encrypted :-).

I guess I can act as this kind of /dev/null. It might be useful to
note the issues, and for the serious ones notify you few days before
the "long" embargo is going to expire...

Best regards,

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox