Linux Documentation
 help / color / mirror / Atom feed
* Re: [RFT v2 1/4] perf cs-etm: Generate sample for missed packets
From: Robert Walker @ 2018-05-25 13:56 UTC (permalink / raw)
  To: Leo Yan
  Cc: Arnaldo Carvalho de Melo, Mathieu Poirier, Jonathan Corbet,
	Peter Zijlstra, Ingo Molnar, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, linux-arm-kernel, linux-doc, linux-kernel,
	Tor Jeremiassen, mike.leach, kim.phillips, coresight, Mike Leach
In-Reply-To: <20180523132203.GA30299@leoy-ThinkPad-X240s>

Hi Leo,


On 23/05/18 14:22, Leo Yan wrote:
> Hi Rob,
>
> On Wed, May 23, 2018 at 12:21:18PM +0100, Robert Walker wrote:
>> Hi Leo,
>>
>> On 22/05/18 10:52, Leo Yan wrote:
>>> On Tue, May 22, 2018 at 04:39:20PM +0800, Leo Yan wrote:
>>>
>>> [...]
>>>
>>> Rather than the patch I posted in my previous email, I think below new
>>> patch is more reasonable for me.
>>>
>>> In the below change, 'etmq->prev_packet' is only used to store the
>>> previous CS_ETM_RANGE packet, we don't need to save CS_ETM_TRACE_ON
>>> packet into 'etmq->prev_packet'.
>>>
>>> On the other hand, cs_etm__flush() can use 'etmq->period_instructions'
>>> to indicate if need to generate instruction sample or not.  If it's
>>> non-zero, then generate instruction sample and
>>> 'etmq->period_instructions' will be cleared; so next time if there
>>> have more tracing CS_ETM_TRACE_ON packet, it can skip to generate
>>> instruction sample due 'etmq->period_instructions' is zero.
>>>
>>> How about you think for this?
>>>
>>> Thanks,
>>> Leo Yan
>>>
>> I don't think this covers the cases where CS_ETM_TRACE_ON is used to
>> indicate a discontinuity in the trace.  For example, there is work in
>> progress to configure the ETM so that it only traces a few thousand cycles
>> with a gap of many thousands of cycles between each chunk of trace - this
>> can be used to sample program execution in the form of instruction events
>> with branch stacks for feedback directed optimization (AutoFDO).
>>
>> In this case, the raw trace is something like:
>>
>>      ...
>>      I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0x0000007E7B886908;
>>      I_ATOM_F3 : Atom format 3.; EEN
>>      I_ATOM_F1 : Atom format 1.; E
>> # Trace stops here
>>
>> # Some time passes, and then trace is turned on again
>>      I_TRACE_ON : Trace On.
>>      I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.;
>> Addr=0x00000057224322F4; Ctxt: AArch64,EL0, NS;
>>      I_ATOM_F3 : Atom format 3.; ENN
>>      I_ATOM_F5 : Atom format 5.; ENENE
>>      ...
>>
>> This results in the following packets from the decoder:
>>
>> CS_ETM_RANGE: [0x7e7b886908-0x7e7b886930] br
>> CS_ETM_RANGE: [0x7e7b88699c-0x7e7b8869a4] br
>> CS_ETM_RANGE: [0x7e7b8869d8-0x7e7b8869f0]
>> CS_ETM_RANGE: [0x7e7b8869f0-0x7e7b8869fc] br
>> CS_ETM_TRACE_ON
>> CS_ETM_RANGE: [0x57224322f4-0x5722432304] br
>> CS_ETM_RANGE: [0x57224320e8-0x57224320ec]
>> CS_ETM_RANGE: [0x57224320ec-0x57224320f8]
>> CS_ETM_RANGE: [0x57224320f8-0x572243212c] br
>> CS_ETM_RANGE: [0x5722439b80-0x5722439bec]
>> CS_ETM_RANGE: [0x5722439bec-0x5722439c14] br
>> CS_ETM_RANGE: [0x5722437c30-0x5722437c6c]
>> CS_ETM_RANGE: [0x5722437c6c-0x5722437c7c] br
>>
>> Without handling the CS_ETM_TRACE_ON, this would be interpreted as a branch
>> from 0x7e7b8869f8 to 0x57224322f4, when there is actually a gap of many
>> thousand instructions between these.
>>
>> I think this patch will break the branch stacks - by removing the
>> prev_packet swap from cs_etm__flush(), the next time a CS_ETM_RANGE packet
>> is handled, cs_etm__sample() will see prev_packet contains the last
>> CS_ETM_RANGE from the previous block of trace, causing an erroneous call to
>> cs_etm__update_last_branch_rb().  In the example above, the branch stack
>> will contain an erroneous branch from 0x7e7b8869f8 to 0x57224322f4.
>>
>> I think what you need to do is add a check for the previous packet being a
>> CS_ETM_TRACE_ON when determining the generate_sample value.
> I still can see there have hole for packets handling with your
> suggestion, let's focus on below three packets:
>
> CS_ETM_RANGE:    [0x7e7b8869f0-0x7e7b8869fc] br
> CS_ETM_TRACE_ON: [0xdeadbeefdeadbeef-0xdeadbeefdeadbeef]
> CS_ETM_RANGE:    [0x57224322f4-0x5722432304] br
>
> When the CS_ETM_TRACE_ON packet is coming, cs_etm__flush() doesn't
> handle for 'etmq->prev_packet' to generate branch sample, this results
> in we miss the info for 0x7e7b8869fc, and with packet swapping
> 'etmq->prev_packet' is assigned to CS_ETM_TRACE_ON packet.
>
> When the last CS_ETM_RANGE packet is coming, cs_etm__sample() will
> combine the values from CS_ETM_TRACE_ON packet and the last
> CS_ETM_RANGE packet to generate branch sample packet; at the end
> we get below sample packets:
>
>    packet(n):   sample::addr=0x7e7b8869f0
>    packet(n+1): sample::ip=0xdeadbeefdeadbeeb sample::addr=0x57224322f4
>
> So I think we also need to generate branch sample, and we can get
> below results:
>
>    packet(n):   sample::addr=0x7e7b8869f0
>    packet(n+1): sample::ip=0x7e7b8869f8 sample::addr=0xdeadbeefdeadbeef
>    packet(n+2): sample::ip=0xdeadbeefdeadbeeb sample::addr=0x57224322f4
>
> So we also can rely on this to get to know there have one address
> range is [0xdeadbeefdeadbeef..0xdeadbeefdeadbeeb] to indicate there
> have a discontinuity in the trace.
Yes, I agree you need the extra branch sample from cs_etm__flush().

With a discontinuity in trace, I get output from perf script like this:

branches:u:        59ee6e2e08 sqlite3VdbeExec (speedtest1) =>       
59ee6e2e64 sqlite3VdbeExec (spe
branches:u:        59ee6e2e7c sqlite3VdbeExec (speedtest1) =>       
59ee6e2eec sqlite3VdbeExec (spe
branches:u:        59ee6e2efc sqlite3VdbeExec (speedtest1) =>       
59ee6e2f14 sqlite3VdbeExec (spe
branches:u:        59ee6e2f3c sqlite3VdbeExec (speedtest1) => 
deadbeefdeadbeef [unknown] ([unknown])
branches:u:  deadbeefdeadbeeb [unknown] ([unknown]) => 769949daa0 memcpy 
(/system/lib64/libc.so)
branches:u:        769949dacc memcpy (/system/lib64/libc.so) =>       
59ee6f0664 insertCell (speedtest1)
branches:u:        59ee6f0664 insertCell (speedtest1) => 59ee6f0684 
insertCell (speedtest1)
branches:u:        59ee6f06a4 insertCell (speedtest1) => 59ee6a4d50 
memmove@plt (speedtest1)
branches:u:        59ee6a4d5c memmove@plt (speedtest1) => 769949ebf8 
memmove (/system/lib64/libc.so)

Showing there is a break in trace between 59ee6e2f3c and 769949daa0.  
The deadbeefdeadbeef addresses are a bit ugly - these are just dummy 
values emitted in the decoder layer - maybe these should be changed to 
0.  Or you could add a new sample type (i.e. not branch) to indicate the 
start / end of trace, with only the valid address.

With this change, it becomes the same as the patch from your previous mail.

Regards

Rob

>> Regards
>>
>> Rob
>>
>>> diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
>>> index 822ba91..dd354ad 100644
>>> --- a/tools/perf/util/cs-etm.c
>>> +++ b/tools/perf/util/cs-etm.c
>>> @@ -495,6 +495,13 @@ static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq)
>>>   static inline u64 cs_etm__last_executed_instr(struct cs_etm_packet *packet)
>>>   {
>>>   	/*
>>> +	 * The packet is the start tracing packet if the end_addr is zero,
>>> +	 * returns 0 for this case.
>>> +	 */
>>> +	if (!packet->end_addr)
>>> +		return 0;
>>> +
>>> +	/*
>>>   	 * The packet records the execution range with an exclusive end address
>>>   	 *
>>>   	 * A64 instructions are constant size, so the last executed
>>> @@ -897,13 +904,27 @@ static int cs_etm__sample(struct cs_etm_queue *etmq)
>>>   		etmq->period_instructions = instrs_over;
>>>   	}
>>> -	if (etm->sample_branches &&
>>> -	    etmq->prev_packet &&
>>> -	    etmq->prev_packet->sample_type == CS_ETM_RANGE &&
>>> -	    etmq->prev_packet->last_instr_taken_branch) {
>>> -		ret = cs_etm__synth_branch_sample(etmq);
>>> -		if (ret)
>>> -			return ret;
>>> +	if (etm->sample_branches && etmq->prev_packet) {
>>> +		bool generate_sample = false;
>>> +
>>> +		/* Generate sample for start tracing packet */
>>> +		if (etmq->prev_packet->sample_type == 0)
>>> +			generate_sample = true;
>> Also check for etmq->prev_packet->sample_type == CS_ETM_TRACE_ON here and
>> set generate_sample = true.
> Agree, will add this.
>
>>> +
>>> +		/* Generate sample for exception packet */
>>> +		if (etmq->prev_packet->exc == true)
>>> +			generate_sample = true;
>>> +
>>> +		/* Generate sample for normal branch packet */
>>> +		if (etmq->prev_packet->sample_type == CS_ETM_RANGE &&
>>> +		    etmq->prev_packet->last_instr_taken_branch)
>>> +			generate_sample = true;
>>> +
>>> +		if (generate_sample) {
>>> +			ret = cs_etm__synth_branch_sample(etmq);
>>> +			if (ret)
>>> +				return ret;
>>> +		}
>>>   	}
>>>   	if (etm->sample_branches || etm->synth_opts.last_branch) {
>>> @@ -922,11 +943,12 @@ static int cs_etm__sample(struct cs_etm_queue *etmq)
>>>   static int cs_etm__flush(struct cs_etm_queue *etmq)
>>>   {
>>>   	int err = 0;
>>> -	struct cs_etm_packet *tmp;
>>>   	if (etmq->etm->synth_opts.last_branch &&
>>>   	    etmq->prev_packet &&
>>> -	    etmq->prev_packet->sample_type == CS_ETM_RANGE) {
>>> +	    etmq->prev_packet->sample_type == CS_ETM_RANGE &&
>>> +	    etmq->period_instructions) {
>>> +
>> I don't think this is needed.
> Okay, I will keep this.
>
>>>   		/*
>>>   		 * Generate a last branch event for the branches left in the
>>>   		 * circular buffer at the end of the trace.
>>> @@ -940,14 +962,6 @@ static int cs_etm__flush(struct cs_etm_queue *etmq)
>>>   			etmq, addr,
>>>   			etmq->period_instructions);
>>>   		etmq->period_instructions = 0;
>>> -
>>> -		/*
>>> -		 * Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
>>> -		 * the next incoming packet.
>>> -		 */
>>> -		tmp = etmq->packet;
>>> -		etmq->packet = etmq->prev_packet;
>>> -		etmq->prev_packet = tmp;
>> This should not be changed as discussed above.
> Okay, will keep this.  But I suggest we add some change like below:
>
> +    if (etm->sample_branches) {
> +        err = cs_etm__synth_branch_sample(etmq);
> +        if (err)
> +            return err;
> +    }
>
> If so, could you review my posted another patch for this?
> http://archive.armlinux.org.uk/lurker/message/20180522.083920.184f1f78.en.html
WIll do - with these changes, it is the same as your original patch.
> Thanks,
> Leo Yan
>
>>>   	}
>>>   	return err;
>>>

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus
From: Juri Lelli @ 2018-05-25 12:52 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Waiman Long, Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin
In-Reply-To: <20180525103147.GC30654@e110439-lin>

On 25/05/18 11:31, Patrick Bellasi wrote:

[...]

> Right, so the problem seems to be that we "need" to call
> arch_update_cpu_topology() and we do that by calling
> partition_sched_domains() which was initially introduced by:
> 
>    029190c515f1 ("cpuset sched_load_balance flag")
> 
> back in 2007, where it's also quite well explained the reasons behind
> the sched_load_balance flag and the idea to have "partitioned" SDs.
> 
> I also (hopefully) understood that there are at least two actors involved:
> 
>  - A) arch code
>    which creates SDs and SGs, usually to group CPUs depending on the
>    memory hierarchy, to support different time granularity of load
>    balancing operations
> 
>    Special case here are HP and hibernation which, by on-/off-lining
>    CPUs they directly affect the SDs/SGs definitions.
> 
>  - B) cpusets
>    which expose to userspace the possibility to define,
>    _if possible_, a finer granularity set of SGs to further restrict the
>    scope of load balancing operations
> 
> Since B is a "possible finer granularity" refinement of A, then we
> trigger A's reconfigurations based on B's constraints.
> 
> That's why, for example, in consequence of an HP online event,
> we have:
> 
>    --- core.c -------------------
>     HP[sched:active]
>      | sched_cpu_activate()
>        | cpuset_cpu_active()
>    --- cpuset.c -----------------
>          | cpuset_update_active_cpus()
>            | schedule_work(&cpuset_hotplug_work)
>             \.. System Kworker \
>                 | cpuset_hotplug_workfn()
>                   if (cpus_updated || force_rebuild)
>                     | rebuild_sched_domains()
>                       | rebuild_sched_domains_locked()
>                         | generate_sched_domains()
>    --- topology.c ---------------
>                         | partition_sched_domains()
>                           | arch_update_cpu_topology()
> 
> 
> IOW, we need to pass via cpusets to rebuild the SDs whenever we
> there are HP events or we "need" to do an arch_update_cpu_topology()
> via the arch topology driver (drivers/base/arch_topology.c).

I don't think the arch topology driver is always involved in this (e.g.,
arch/x86/kernel/itmt::sched_itmt_update_handler()).

Still we need to check if topology changed, as you say.

> This last bit is also interesting, whenever we detect arch topology
> information that required an SD rebuild, we need to force a
> partition_sched_domains(). But, for that, in:
> 
>    commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs")
> 
> we just introduced the support for the "force_rebuild" flag to be set.
> 
> Thus, potentially we can just extend the check I've proposed to consider the
> force rebuild flag, to be something like:
> 
> ---8<---
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 8f586e8bdc98..1f051fafaa3a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -874,11 +874,19 @@ static void rebuild_sched_domains_locked(void)
>            !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
>                 goto out;
>  
> +       /* Special case for the 99% of systems with one, full, sched domain */
> +       if (!force_rebuild &&
> +           !top_cpuset.isolation_count &&
> +           is_sched_load_balance(&top_cpuset))
> +               goto out;
> +       force_rebuild = false;
> +
>         /* Generate domain masks and attrs */
>         ndoms = generate_sched_domains(&doms, &attr);
>  
>         /* Have scheduler rebuild the domains */
>         partition_sched_domains(ndoms, doms, attr);
>  out:
>         put_online_cpus();
> ---8<---
> 
> 
> Which would still allow to use something like:
> 
>    cpuset_force_rebuild()
>    rebuild_sched_domains()
> 
> to actually rebuild SD in consequence of arch topology changes.

That might work.

> 
> > 
> > Maybe we could move the check you are proposing in update_cpumasks_
> > hier() ?
> 
> Yes, that's another option... although there we are outside of
> get_online_cpus(). Could be a problem?

Mmm, using force_rebuild flag seems safer indeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus
From: Patrick Bellasi @ 2018-05-25 10:31 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Waiman Long, Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin
In-Reply-To: <20180524103938.GB3948@localhost.localdomain>

Hi Juri,
following are some notes I took while trying to understand what's going on...
could be useful to understand if I have a correct view of all the different
components and how they come together.

At the end there are also a couple of possible updates and a question on your
proposal.

Cheers Patrick

On 24-May 12:39, Juri Lelli wrote:
> On 24/05/18 10:04, Patrick Bellasi wrote:
> 
> [...]
> 
> > From 84bb8137ce79f74849d97e30871cf67d06d8d682 Mon Sep 17 00:00:00 2001
> > From: Patrick Bellasi <patrick.bellasi@arm.com>
> > Date: Wed, 23 May 2018 16:33:06 +0100
> > Subject: [PATCH 1/1] cgroup/cpuset: disable sched domain rebuild when not
> >  required
> > 
> > The generate_sched_domains() already addresses the "special case for 99%
> > of systems" which require a single full sched domain at the root,
> > spanning all the CPUs. However, the current support is based on an
> > expensive sequence of operations which destroy and recreate the exact
> > same scheduling domain configuration.
> > 
> > If we notice that:
> > 
> >  1) CPUs in "cpuset.isolcpus" are excluded from load balancing by the
> >     isolcpus= kernel boot option, and will never be load balanced
> >     regardless of the value of "cpuset.sched_load_balance" in any
> >     cpuset.
> > 
> >  2) the root cpuset has load_balance enabled by default at boot and
> >     it's the only parameter which userspace can change at run-time.
> > 
> > we know that, by default, every system comes up with a complete and
> > properly configured set of scheduling domains covering all the CPUs.
> > 
> > Thus, on every system, unless the user explicitly disables load balance
> > for the top_cpuset, the scheduling domains already configured at boot
> > time by the scheduler/topology code and updated in consequence of
> > hotplug events, are already properly configured for cpuset too.
> > 
> > This configuration is the default one for 99% of the systems,
> > and it's also the one used by most of the Android devices which never
> > disable load balance from the top_cpuset.
> > 
> > Thus, while load balance is enabled for the top_cpuset,
> > destroying/rebuilding the scheduling domains at every cpuset.cpus
> > reconfiguration is a useless operation which will always produce the
> > same result.
> > 
> > Let's anticipate the "special" optimization within:
> > 
> >    rebuild_sched_domains_locked()
> > 
> > thus completely skipping the expensive:
> > 
> >    generate_sched_domains()
> >    partition_sched_domains()
> > 
> > for all the cases we know that the scheduling domains already defined
> > will not be affected by whatsoever value of cpuset.cpus.
> 
> [...]
> 
> > +	/* Special case for the 99% of systems with one, full, sched domain */
> > +	if (!top_cpuset.isolation_count &&
> > +	    is_sched_load_balance(&top_cpuset))
> > +		goto out;
> > +
> 
> Mmm, looks like we still need to destroy e recreate if there is a
> new_topology (see arch_update_cpu_topology() in partition_sched_
> domains).

Right, so the problem seems to be that we "need" to call
arch_update_cpu_topology() and we do that by calling
partition_sched_domains() which was initially introduced by:

   029190c515f1 ("cpuset sched_load_balance flag")

back in 2007, where it's also quite well explained the reasons behind
the sched_load_balance flag and the idea to have "partitioned" SDs.

I also (hopefully) understood that there are at least two actors involved:

 - A) arch code
   which creates SDs and SGs, usually to group CPUs depending on the
   memory hierarchy, to support different time granularity of load
   balancing operations

   Special case here are HP and hibernation which, by on-/off-lining
   CPUs they directly affect the SDs/SGs definitions.

 - B) cpusets
   which expose to userspace the possibility to define,
   _if possible_, a finer granularity set of SGs to further restrict the
   scope of load balancing operations

Since B is a "possible finer granularity" refinement of A, then we
trigger A's reconfigurations based on B's constraints.

That's why, for example, in consequence of an HP online event,
we have:

   --- core.c -------------------
    HP[sched:active]
     | sched_cpu_activate()
       | cpuset_cpu_active()
   --- cpuset.c -----------------
         | cpuset_update_active_cpus()
           | schedule_work(&cpuset_hotplug_work)
            \.. System Kworker \
                | cpuset_hotplug_workfn()
                  if (cpus_updated || force_rebuild)
                    | rebuild_sched_domains()
                      | rebuild_sched_domains_locked()
                        | generate_sched_domains()
   --- topology.c ---------------
                        | partition_sched_domains()
                          | arch_update_cpu_topology()


IOW, we need to pass via cpusets to rebuild the SDs whenever we
there are HP events or we "need" to do an arch_update_cpu_topology()
via the arch topology driver (drivers/base/arch_topology.c).

This last bit is also interesting, whenever we detect arch topology
information that required an SD rebuild, we need to force a
partition_sched_domains(). But, for that, in:

   commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs")

we just introduced the support for the "force_rebuild" flag to be set.

Thus, potentially we can just extend the check I've proposed to consider the
force rebuild flag, to be something like:

---8<---
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8f586e8bdc98..1f051fafaa3a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -874,11 +874,19 @@ static void rebuild_sched_domains_locked(void)
           !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
                goto out;
 
+       /* Special case for the 99% of systems with one, full, sched domain */
+       if (!force_rebuild &&
+           !top_cpuset.isolation_count &&
+           is_sched_load_balance(&top_cpuset))
+               goto out;
+       force_rebuild = false;
+
        /* Generate domain masks and attrs */
        ndoms = generate_sched_domains(&doms, &attr);
 
        /* Have scheduler rebuild the domains */
        partition_sched_domains(ndoms, doms, attr);
 out:
        put_online_cpus();
---8<---


Which would still allow to use something like:

   cpuset_force_rebuild()
   rebuild_sched_domains()

to actually rebuild SD in consequence of arch topology changes.

> 
> Maybe we could move the check you are proposing in update_cpumasks_
> hier() ?

Yes, that's another option... although there we are outside of
get_online_cpus(). Could be a problem?

However, in general, I would say that all around:

   rebuild_sched_domains
   rebuild_sched_domains_locked
   update_cpumask
   update_cpumasks_hier

a nice refactoring would be really deserved :)

-- 
#include <best/regards.h>

Patrick Bellasi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2
From: Patrick Bellasi @ 2018-05-25  9:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Juri Lelli, Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin
In-Reply-To: <5f409ed7-3850-f1ea-58cf-4326605d1570@redhat.com>

On 24-May 11:22, Waiman Long wrote:
> On 05/24/2018 11:16 AM, Juri Lelli wrote:
> > On 24/05/18 11:09, Waiman Long wrote:
> >> On 05/24/2018 10:36 AM, Juri Lelli wrote:
> >>> On 17/05/18 16:55, Waiman Long wrote:
> >>>
> >>> [...]
> >>>
> >>>> +	A parent cgroup cannot distribute all its CPUs to child
> >>>> +	scheduling domain cgroups unless its load balancing flag is
> >>>> +	turned off.
> >>>> +
> >>>> +  cpuset.sched.load_balance
> >>>> +	A read-write single value file which exists on non-root
> >>>> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
> >>>> +	either "0" (off) or a non-zero value (on).  This flag is set
> >>>> +	by the parent and is not delegatable.
> >>>> +
> >>>> +	When it is on, tasks within this cpuset will be load-balanced
> >>>> +	by the kernel scheduler.  Tasks will be moved from CPUs with
> >>>> +	high load to other CPUs within the same cpuset with less load
> >>>> +	periodically.
> >>>> +
> >>>> +	When it is off, there will be no load balancing among CPUs on
> >>>> +	this cgroup.  Tasks will stay in the CPUs they are running on
> >>>> +	and will not be moved to other CPUs.
> >>>> +
> >>>> +	The initial value of this flag is "1".	This flag is then
> >>>> +	inherited by child cgroups with cpuset enabled.  Its state
> >>>> +	can only be changed on a scheduling domain cgroup with no
> >>>> +	cpuset-enabled children.
> >>> [...]
> >>>
> >>>> +	/*
> >>>> +	 * On default hierachy, a load balance flag change is only allowed
> >>>> +	 * in a scheduling domain with no child cpuset.
> >>>> +	 */
> >>>> +	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
> >>>> +	   (!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
> >>>> +		err = -EINVAL;
> >>>> +		goto out;
> >>>> +	}
> >>> The rule is actually
> >>>
> >>>  - no child cpuset
> >>>  - and it must be a scheduling domain

I always a bit confused by the usage of "scheduling domain", which
overlaps with the SD concept from the scheduler standpoint.

AFAIU a cpuset sched domain is not granted to be turned into an
actual scheduler SD, am I wrong?

If that's the case, why not better disambiguate these two concept by
calling the cpuset one a "cpus partition" or eventually "cpuset domain"?

-- 
#include <best/regards.h>

Patrick Bellasi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/3] Add parameter for disabling ACS redirection for P2P
From: Christian König @ 2018-05-25  8:28 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-doc
  Cc: Stephen Bates, Christoph Hellwig, Bjorn Helgaas, Jonathan Corbet,
	Ingo Molnar, Thomas Gleixner, Christoffer Dall, Paul E. McKenney,
	Marc Zyngier, Kai-Heng Feng, Frederic Weisbecker, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson
In-Reply-To: <20180524214816.14485-1-logang@deltatee.com>

Am 24.05.2018 um 23:48 schrieb Logan Gunthorpe:
> Hi,
>
> As discussed in our PCI P2PDMA series, we'd like to add a kernel
> parameter for selectively disabling ACS redirection for select
> bridges. Seeing this turned out to be a small series in itself, we've
> decided to send this separately from the P2P work.
>
> This series generalizes the code already done for the resource_alignment
> option that already exists. The first patch creates a helper function
> to match PCI devices against strings based on the code that already
> existed in pci_specified_resource_alignment().
>
> The second patch expands the new helper to optionally take a path of
> PCI devfns. This is to address Alex's renumbering concern when using
> simple bus-devfns. The implementation is essentially how he described it and
> similar to the Intel VT-d spec (Section 8.3.1).
>
> The final patch adds the disable_acs_redir kernel parameter which takes
> a list of PCI devices and will disable the ACS P2P Request Redirect,
> ACS P2P Completion Redirect and ACS P2P Egress Control bits for the
> selected devices. This allows P2P traffic between selected bridges and
> seeing it's done at boot, before IOMMU group creating the IOMMU groups
> will be created correctly based on the bits.
>
> Thanks,
>
> Logan
>
>
> Logan Gunthorpe (3):
>    PCI: Make specifying PCI devices in kernel parameters reusable
>    PCI: Allow specifying devices using a base bus and path of devfns
>    PCI: Introduce the disable_acs_redir parameter

Thanks a lot of taking care of it like that. It looks much cleaner to me 
than just trying to disable ACS without a parameter.

Series is Acked-by: Christian König <christian.koenig@amd.com>.

Thanks,
Christian.


>
>   Documentation/admin-guide/kernel-parameters.txt |  39 ++-
>   drivers/pci/pci.c                               | 358 ++++++++++++++++++++----
>   2 files changed, 336 insertions(+), 61 deletions(-)
>
> --
> 2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 6/9] trace_uprobe: Support SDT markers having reference count (semaphore)
From: Ravi Bangoria @ 2018-05-25  8:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mhiramat, peterz, srikar, rostedt, acme, ananth, akpm,
	alexander.shishkin, alexis.berlemont, corbet, dan.j.williams,
	jolsa, kan.liang, kjlx, kstewart, linux-doc, linux-kernel,
	linux-mm, milian.wolff, mingo, namhyung, naveen.n.rao, pc, tglx,
	yao.jin, fengguang.wu, jglisse, Ravi Bangoria
In-Reply-To: <20180524162608.GA27082@redhat.com>

Thanks Oleg for the review,

On 05/24/2018 09:56 PM, Oleg Nesterov wrote:
> On 04/17, Ravi Bangoria wrote:
>>
>> @@ -941,6 +1091,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
>>  	if (ret)
>>  		goto err_buffer;
>>  
>> +	if (tu->ref_ctr_offset)
>> +		sdt_increment_ref_ctr(tu);
>> +
> 
> iiuc, this is probe_event_enable()...
> 
> Looks racy, but afaics the race with uprobe_mmap() will be closed by the next
> change. However, it seems that probe_event_disable() can race with trace_uprobe_mmap()
> too and the next 7/9 patch won't help,
> 
>> +	if (tu->ref_ctr_offset)
>> +		sdt_decrement_ref_ctr(tu);
>> +
>>  	uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
>>  	tu->tp.flags &= file ? ~TP_FLAG_TRACE : ~TP_FLAG_PROFILE;
> 
> so what if trace_uprobe_mmap() comes right after uprobe_unregister() ?
> Note that trace_probe_is_enabled() is T until we update tp.flags.

Sure, I'll look at your comments.

Apart from these, I've also found a deadlock between uprobe_lock and
mm->mmap_sem. trace_uprobe_mmap() takes these locks in

   mm->mmap_sem
      uprobe_lock

order but some other code path is taking these locks in reverse order. I've
mentioned sample lockdep warning at the end. The issue is, mm->mmap_sem is
not in control of trace_uprobe_mmap() and we have to take uprobe_lock to
loop over all trace_uprobes.

Any idea how this can be resolved?


Sample lockdep warning:

[  499.258006] ======================================================
[  499.258205] WARNING: possible circular locking dependency detected
[  499.258409] 4.17.0-rc3+ #76 Not tainted
[  499.258528] ------------------------------------------------------
[  499.258731] perf/6744 is trying to acquire lock:
[  499.258895] 00000000e4895f49 (uprobe_lock){+.+.}, at: trace_uprobe_mmap+0x78/0x130
[  499.259147]
[  499.259147] but task is already holding lock:
[  499.259349] 000000009ec93a76 (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xe0/0x160
[  499.259597]
[  499.259597] which lock already depends on the new lock.
[  499.259597]
[  499.259848]
[  499.259848] the existing dependency chain (in reverse order) is:
[  499.260086]
[  499.260086] -> #4 (&mm->mmap_sem){++++}:
[  499.260277]        __lock_acquire+0x53c/0x910
[  499.260442]        lock_acquire+0xf4/0x2f0
[  499.260595]        down_write_killable+0x6c/0x150
[  499.260764]        copy_process.isra.34.part.35+0x1594/0x1be0
[  499.260967]        _do_fork+0xf8/0x910
[  499.261090]        ppc_clone+0x8/0xc
[  499.261209]
[  499.261209] -> #3 (&dup_mmap_sem){++++}:
[  499.261378]        __lock_acquire+0x53c/0x910
[  499.261540]        lock_acquire+0xf4/0x2f0
[  499.261669]        down_write+0x6c/0x110
[  499.261793]        percpu_down_write+0x48/0x140
[  499.261954]        register_for_each_vma+0x6c/0x2a0
[  499.262116]        uprobe_register+0x230/0x320
[  499.262277]        probe_event_enable+0x1cc/0x540
[  499.262435]        perf_trace_event_init+0x1e0/0x350
[  499.262587]        perf_trace_init+0xb0/0x110
[  499.262750]        perf_tp_event_init+0x38/0x90
[  499.262910]        perf_try_init_event+0x10c/0x150
[  499.263075]        perf_event_alloc+0xbb0/0xf10
[  499.263235]        sys_perf_event_open+0x2a8/0xdd0
[  499.263396]        system_call+0x58/0x6c
[  499.263516]
[  499.263516] -> #2 (&uprobe->register_rwsem){++++}:
[  499.263723]        __lock_acquire+0x53c/0x910
[  499.263884]        lock_acquire+0xf4/0x2f0
[  499.264002]        down_write+0x6c/0x110
[  499.264118]        uprobe_register+0x1ec/0x320
[  499.264283]        probe_event_enable+0x1cc/0x540
[  499.264442]        perf_trace_event_init+0x1e0/0x350
[  499.264603]        perf_trace_init+0xb0/0x110
[  499.264766]        perf_tp_event_init+0x38/0x90
[  499.264930]        perf_try_init_event+0x10c/0x150
[  499.265092]        perf_event_alloc+0xbb0/0xf10
[  499.265261]        sys_perf_event_open+0x2a8/0xdd0
[  499.265424]        system_call+0x58/0x6c
[  499.265542]
[  499.265542] -> #1 (event_mutex){+.+.}:
[  499.265738]        __lock_acquire+0x53c/0x910
[  499.265896]        lock_acquire+0xf4/0x2f0
[  499.266019]        __mutex_lock+0xa0/0xab0
[  499.266142]        trace_add_event_call+0x44/0x100
[  499.266310]        create_trace_uprobe+0x4a0/0x8b0
[  499.266474]        trace_run_command+0xa4/0xc0
[  499.266631]        trace_parse_run_command+0xe4/0x200
[  499.266799]        probes_write+0x20/0x40
[  499.266922]        __vfs_write+0x6c/0x240
[  499.267041]        vfs_write+0xd0/0x240
[  499.267166]        ksys_write+0x6c/0x110
[  499.267295]        system_call+0x58/0x6c
[  499.267413]
[  499.267413] -> #0 (uprobe_lock){+.+.}:
[  499.267591]        validate_chain.isra.34+0xbd0/0x1000
[  499.267747]        __lock_acquire+0x53c/0x910
[  499.267917]        lock_acquire+0xf4/0x2f0
[  499.268048]        __mutex_lock+0xa0/0xab0
[  499.268170]        trace_uprobe_mmap+0x78/0x130
[  499.268335]        uprobe_mmap+0x80/0x3b0
[  499.268464]        mmap_region+0x290/0x660
[  499.268590]        do_mmap+0x40c/0x500
[  499.268718]        vm_mmap_pgoff+0x114/0x160
[  499.268870]        ksys_mmap_pgoff+0xe8/0x2e0
[  499.269034]        sys_mmap+0x84/0xf0
[  499.269161]        system_call+0x58/0x6c
[  499.269279]
[  499.269279] other info that might help us debug this:
[  499.269279]
[  499.269524] Chain exists of:
[  499.269524]   uprobe_lock --> &dup_mmap_sem --> &mm->mmap_sem
[  499.269524]
[  499.269856]  Possible unsafe locking scenario:
[  499.269856]
[  499.270058]        CPU0                    CPU1
[  499.270223]        ----                    ----
[  499.270384]   lock(&mm->mmap_sem);
[  499.270514]                                lock(&dup_mmap_sem);
[  499.270711]                                lock(&mm->mmap_sem);
[  499.270923]   lock(uprobe_lock);
[  499.271046]
[  499.271046]  *** DEADLOCK ***
[  499.271046]
[  499.271256] 1 lock held by perf/6744:
[  499.271377]  #0: 000000009ec93a76 (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xe0/0x160
[  499.271628]
[  499.271628] stack backtrace:
[  499.271797] CPU: 25 PID: 6744 Comm: perf Not tainted 4.17.0-rc3+ #76
[  499.272003] Call Trace:
[  499.272094] [c0000000e32d74a0] [c000000000b00174] dump_stack+0xe8/0x164 (unreliable)
[  499.272349] [c0000000e32d74f0] [c0000000001a905c] print_circular_bug.isra.30+0x354/0x388
[  499.272590] [c0000000e32d7590] [c0000000001a3050] check_prev_add.constprop.38+0x8f0/0x910
[  499.272828] [c0000000e32d7690] [c0000000001a3c40] validate_chain.isra.34+0xbd0/0x1000
[  499.273070] [c0000000e32d7780] [c0000000001a57cc] __lock_acquire+0x53c/0x910
[  499.273311] [c0000000e32d7860] [c0000000001a65b4] lock_acquire+0xf4/0x2f0
[  499.273510] [c0000000e32d7930] [c000000000b1d1f0] __mutex_lock+0xa0/0xab0
[  499.273717] [c0000000e32d7a40] [c0000000002b01b8] trace_uprobe_mmap+0x78/0x130
[  499.273952] [c0000000e32d7a90] [c0000000002d7070] uprobe_mmap+0x80/0x3b0
[  499.274153] [c0000000e32d7b20] [c0000000003550a0] mmap_region+0x290/0x660
[  499.274353] [c0000000e32d7c00] [c00000000035587c] do_mmap+0x40c/0x500
[  499.274560] [c0000000e32d7c80] [c00000000031ebc4] vm_mmap_pgoff+0x114/0x160
[  499.274763] [c0000000e32d7d60] [c000000000352818] ksys_mmap_pgoff+0xe8/0x2e0
[  499.275013] [c0000000e32d7de0] [c000000000016864] sys_mmap+0x84/0xf0
[  499.275207] [c0000000e32d7e30] [c00000000000b404] system_call+0x58/0x6c


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v8 2/6] cpuset: Add new v2 cpuset.sched.domain flag
From: Peter Zijlstra @ 2018-05-25  7:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli
In-Reply-To: <675f0f38-9154-4e73-1679-179eefdb7c9f@redhat.com>

On Thu, May 24, 2018 at 02:53:31PM -0400, Waiman Long wrote:
> On 05/24/2018 11:41 AM, Peter Zijlstra wrote:
> > On Thu, May 17, 2018 at 04:55:41PM -0400, Waiman Long wrote:
> >> A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
> >> flag indicates that the CPUs in the current cpuset should be treated
> >> as a separate scheduling domain.
> > The traditional name for this is a partition.
> 
> Do you want to call it cpuset.sched.partition? That name sounds strange
> to me.

Let me explore the whole domain x load-balance space first. I'm thinking
the two parameters are mostly redundant, but I might be overlooking
something (trivial or otherwise).

> >> +  cpuset.sched.domain
> >> +	A read-write single value file which exists on non-root
> >> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
> >> +	either "0" (off) or a non-zero value (on).
> > I would be conservative and only allow 0/1.
> 
> I stated that because echoing other integer value like 2 into the flag
> file won't return any error. I will modify it to say just 0 and 1.

Ah, I would make the file error on >1.

Because then you can always extend the meaning afterwards because you
know it won't be written to with the new value.

> >> +	3) There is no child cgroups with cpuset enabled.
> >> +
> >> +	Setting this flag will take the CPUs away from the effective
> >> +	CPUs of the parent cgroup. Once it is set, this flag cannot be
> >> +	cleared if there are any child cgroups with cpuset enabled.
> > This I'm not clear on. Why?
> >
> That is for pragmatic reason as it is easier to code this way. We could
> remove this restriction but that will make the code more complex.

Best to mention that I think.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
From: TSUKADA Koutaro @ 2018-05-25  1:55 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Jonathan Corbet,
	Luis R. Rodriguez, Kees Cook, Andrew Morton, Roman Gushchin,
	David Rientjes, Aneesh Kumar K.V, Naoya Horiguchi,
	Anshuman Khandual, Marc-Andre Lureau, Punit Agrawal, Dan Williams,
	Vlastimil Babka, linux-doc, linux-kernel, linux-fsdevel, linux-mm,
	cgroups
In-Reply-To: <4078bc2d-4aaf-cd1b-0145-5915e382852f@oracle.com>

On 2018/05/25 2:45, Mike Kravetz wrote:
[...]
>> THP does not guarantee to use the Huge Page, but may use the normal page.
> 
> Note.  You do not want to use THP because "THP does not guarantee".

[...]
>> One of the answers I have reached is to use HugeTLBfs by overcommitting
>> without creating a pool(this is the surplus hugepage).
> 
> Using hugetlbfs overcommit also does not provide a guarantee.  Without
> doing much research, I would say the failure rate for obtaining a huge
> page via THP and hugetlbfs overcommit is about the same.  The most
> difficult issue in both cases will be obtaining a "huge page" number of
> pages from the buddy allocator.

Yes. If do not support multiple size hugetlb pages such as x86, because
number of pages between THP and hugetlb is same, the failure rate of
obtaining a compound page is same, as you said.

> I really do not think hugetlbfs overcommit will provide any benefit over
> THP for your use case.

I think that what you say is absolutely right.

>  Also, new user space code is required to "fall back"
> to normal pages in the case of hugetlbfs page allocation failure.  This
> is not needed in the THP case.

I understand the superiority of THP, but there are scenes where khugepaged
occupies cpu due to page fragmentation. Instead of overcommit, setup a
persistent pool once, I think that hugetlb can be superior, such as memory
allocation performance exceeding THP. I will try to find a good way to use
hugetlb page.

I sincerely thank you for your help.

-- 
Thanks,
Tsukada

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
From: TSUKADA Koutaro @ 2018-05-25  1:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Johannes Weiner, Vladimir Davydov, Jonathan Corbet,
	Luis R. Rodriguez, Kees Cook, Andrew Morton, Roman Gushchin,
	David Rientjes, Aneesh Kumar K.V, Naoya Horiguchi,
	Anshuman Khandual, Marc-Andre Lureau, Punit Agrawal, Dan Williams,
	Vlastimil Babka, linux-doc, linux-kernel, linux-fsdevel, linux-mm,
	cgroups
In-Reply-To: <20180524132414.GI20441@dhcp22.suse.cz>

On 2018/05/24 22:24, Michal Hocko wrote
[...]> I do not see anything like that. adjust_pool_surplus is simply and
> accounting thing. At least the last time I've checked. Maybe your
> patchset handles that?

As you said, my patch did not consider handling when manipulating the
pool. And even if that handling is done well, it will not be a valid
reason to charge surplus hugepage to memcg.

[...]
>> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
>> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
>> pages and hugetlb page?)
> 
> No mlocked pages cannot be reclaimed and that is why we restrict them to
> a relatively small amount.

I understood the concept of memcg.

[...]
> Fatal? Not sure. It simply tries to add an alien memory to the memcg
> concept so I would pressume an unexpected behavior (e.g. not being able
> to reclaim memcg or, over reclaim, trashing etc.).

As you said, it must be an alien. Thanks to the interaction up to here,
I understood that my solution is inappropriate. I will look for another
way.

Thank you for your kind explanation.

-- 
Thanks,
Tsukada


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH bpf-next v2 0/3] bpf: add boot parameters for sysctl knobs
From: Alexei Starovoitov @ 2018-05-24 23:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eugene Syromiatnikov, netdev, linux-kernel, linux-doc, Kees Cook,
	Kai-Heng Feng, Daniel Borkmann, Alexei Starovoitov,
	Jonathan Corbet, Jiri Olsa
In-Reply-To: <20180524094108.066d885a@redhat.com>

On Thu, May 24, 2018 at 09:41:08AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 23 May 2018 15:02:45 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Wed, May 23, 2018 at 02:18:19PM +0200, Eugene Syromiatnikov wrote:
> > > Some BPF sysctl knobs affect the loading of BPF programs, and during
> > > system boot/init stages these sysctls are not yet configured.
> > > A concrete example is systemd, that has implemented loading of BPF
> > > programs.
> > > 
> > > Thus, to allow controlling these setting at early boot, this patch set
> > > adds the ability to change the default setting of these sysctl knobs
> > > as well as option to override them via a boot-time kernel parameter
> > > (in order to avoid rebuilding kernel each time a need of changing these
> > > defaults arises).
> > > 
> > > The sysctl knobs in question are kernel.unprivileged_bpf_disable,
> > > net.core.bpf_jit_harden, and net.core.bpf_jit_kallsyms.  
> > 
> > - systemd is root. today it only uses cgroup-bpf progs which require root,
> >   so disabling unpriv during boot time makes no difference to systemd.
> >   what is the actual reason to present time?
> > 
> > - say in the future systemd wants to use so_reuseport+bpf for faster
> >   networking. With unpriv disable during boot, it will force systemd
> >   to do such networking from root, which will lower its security barrier.
> >   How that make sense?
> > 
> > - bpf_jit_kallsyms sysctl has immediate effect on loaded programs.
> >   Flipping it during the boot or right after or any time after
> >   is the same thing. Why add such boot flag then?
> > 
> > - jit_harden can be turned on by systemd. so turning it during the boot
> >   will make systemd progs to be constant blinded.
> >   Constant blinding protects kernel from unprivileged JIT spraying.
> >   Are you worried that systemd will attack the kernel with JIT spraying?
> 
> 
> I think you are missing that, we want the ability to change these
> defaults in-order to avoid depending on /etc/sysctl.conf settings, and
> that the these sysctl.conf setting happen too late.

What does it mean 'happens too late' ?
Too late for what?
sysctl.conf has plenty of system critical knobs like
kernel.perf_event_paranoid, kernel.core_pattern, etc
The behavior of the host is drastically different after sysctl config
is applied.

> For example with jit_harden, there will be a difference between the
> loaded BPF program that got loaded at boot-time with systemd (no
> constant blinding) and when someone reloads that systemd service after
> /etc/sysctl.conf have been evaluated and setting bpf_jit_harden (now
> slower due to constant blinding).   This is inconsistent behavior.

net.core.bpf_jit_harden can be flipped back and forth at run-time,
so bpf progs before and after will be either blinded or not.
I don't see any inconsistency.
In general I think bootparams should be used only for things
like kpti=on/off that cannot be set by sysctl.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 0/3] Add parameter for disabling ACS redirection for P2P
From: Logan Gunthorpe @ 2018-05-24 21:48 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-doc
  Cc: Stephen Bates, Christoph Hellwig, Bjorn Helgaas, Jonathan Corbet,
	Ingo Molnar, Thomas Gleixner, Christoffer Dall, Paul E. McKenney,
	Marc Zyngier, Kai-Heng Feng, Frederic Weisbecker, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Christian König, Logan Gunthorpe

Hi,

As discussed in our PCI P2PDMA series, we'd like to add a kernel
parameter for selectively disabling ACS redirection for select
bridges. Seeing this turned out to be a small series in itself, we've
decided to send this separately from the P2P work.

This series generalizes the code already done for the resource_alignment
option that already exists. The first patch creates a helper function
to match PCI devices against strings based on the code that already
existed in pci_specified_resource_alignment().

The second patch expands the new helper to optionally take a path of
PCI devfns. This is to address Alex's renumbering concern when using
simple bus-devfns. The implementation is essentially how he described it and
similar to the Intel VT-d spec (Section 8.3.1).

The final patch adds the disable_acs_redir kernel parameter which takes
a list of PCI devices and will disable the ACS P2P Request Redirect,
ACS P2P Completion Redirect and ACS P2P Egress Control bits for the
selected devices. This allows P2P traffic between selected bridges and
seeing it's done at boot, before IOMMU group creating the IOMMU groups
will be created correctly based on the bits.

Thanks,

Logan


Logan Gunthorpe (3):
  PCI: Make specifying PCI devices in kernel parameters reusable
  PCI: Allow specifying devices using a base bus and path of devfns
  PCI: Introduce the disable_acs_redir parameter

 Documentation/admin-guide/kernel-parameters.txt |  39 ++-
 drivers/pci/pci.c                               | 358 ++++++++++++++++++++----
 2 files changed, 336 insertions(+), 61 deletions(-)

--
2.11.0
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 2/3] PCI: Allow specifying devices using a base bus and path of devfns
From: Logan Gunthorpe @ 2018-05-24 21:48 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-doc
  Cc: Stephen Bates, Christoph Hellwig, Bjorn Helgaas, Jonathan Corbet,
	Ingo Molnar, Thomas Gleixner, Christoffer Dall, Paul E. McKenney,
	Marc Zyngier, Kai-Heng Feng, Frederic Weisbecker, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Christian König, Logan Gunthorpe
In-Reply-To: <20180524214816.14485-1-logang@deltatee.com>

When specifying PCI devices on the kernel command line using a
BDF, the bus numbers can change when adding or replacing a device,
changing motherboard firmware, or applying kernel parameters like
pci=assign-buses. When this happens, it is usually undesirable to
apply whatever command line tweak to the wrong device.

Therefore, it is useful to be able to specify devices with a base
bus number and the path of devfns needed to get to it. (Similar to
the "device scope" structure in the Intel VT-d spec, Section 8.3.1.)

Thus, we add an option to specify devices in the following format:

path:[<domain>:]<bus>:<slot>.<func>/<slot>.<func>[/ ...]

The path can be any segment within the PCI hierarchy of any length and
determined through the use of 'lspci -t'. When specified this way, it is
less likely that a renumbered bus will result in a valid device specification
and the tweak won't be applied to the wrong device.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Stephen Bates <sbates@raithlin.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  12 ++-
 drivers/pci/pci.c                               | 106 +++++++++++++++++++++++-
 2 files changed, 112 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 894aa516ceab..519ab95bb418 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2986,9 +2986,10 @@
 
 				Some options herein operate on a specific device
 				or a set of devices (<pci_dev>). These are
-				specified in one of two formats:
+				specified in one of three formats:
 
 				[<domain>:]<bus>:<slot>.<func>
+				path:[<domain>:]<bus>:<slot>.<func>/<slot>.<func>[/ ...]
 				pci:<vendor>:<device>[:<subvendor>:<subdevice>]
 
 				Note: the first format specifies a PCI
@@ -2996,9 +2997,12 @@
 				if new hardware is inserted, if motherboard
 				firmware changes, or due to changes caused
 				by other kernel parameters. The second format
-				selects devices using IDs from the
-				configuration space which may match multiple
-				devices in the system.
+				specifies a path from a device through
+				a path of multiple slot/function addresses
+				(this is more robust against renumbering
+				issues). The third format selects devices using
+				IDs from the configuration space which may match
+				multiple devices in the system.
 
 		earlydump	[X86] dump PCI config space before the kernel
 			        changes anything
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 85fec5e2640b..53ea0d7b02ce 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -184,22 +184,116 @@ EXPORT_SYMBOL_GPL(pci_ioremap_wc_bar);
 #endif
 
 /**
+ * pci_dev_str_match_path - test if a path string matches a device
+ * @dev:    the PCI device to test
+ * @p:      string to match the device against
+ * @endptr: pointer to the string after the match
+ *
+ * Test if a string (typically from a kernel parameter) formated as a
+ * path of slot/function addresses matches a PCI device. The string must
+ * be of the form:
+ *
+ *   [<domain>:]<bus>:<slot>.<func>/<slot>.<func>[/ ...]
+ *
+ * A path for a device can be obtained using 'lspci -t'. Using a path
+ * is more robust against renumbering of devices than using only
+ * a single bus, slot and function address.
+ *
+ * Returns 1 if the string matches the device, 0 if it does not and
+ * a negative error code if it fails to parse the string.
+ */
+static int pci_dev_str_match_path(struct pci_dev *dev, const char *p,
+				  const char **endptr)
+{
+	int ret;
+	int seg, bus, slot, func, count;
+	u8 *devfn_path;
+	int num_devfn = 0;
+	struct pci_dev *tmp;
+
+	ret = sscanf(p, "%x:%x:%x.%x%n", &seg, &bus, &slot,
+		     &func, &count);
+	if (ret != 4) {
+		seg = 0;
+		ret = sscanf(p, "%x:%x.%x%n", &bus, &slot,
+			     &func, &count);
+		if (ret != 3)
+			return -EINVAL;
+	}
+
+	p += count;
+
+	devfn_path = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	devfn_path[num_devfn++] = PCI_DEVFN(slot, func);
+
+	while (*p && *p != ',' && *p != ';') {
+		ret = sscanf(p, "/%x.%x%n", &slot, &func, &count);
+		if (ret != 2) {
+			ret = -EINVAL;
+			goto free_and_exit;
+		}
+
+		p += count;
+		devfn_path[num_devfn++] = PCI_DEVFN(slot, func);
+		if (num_devfn >= PAGE_SIZE) {
+			ret = -EINVAL;
+			goto free_and_exit;
+		}
+	}
+
+	*endptr = p;
+	ret = 0;
+
+	if (seg != pci_domain_nr(dev->bus))
+		goto free_and_exit;
+
+	pci_dev_get(dev);
+	while (num_devfn > 0 && dev) {
+		num_devfn--;
+
+		if (devfn_path[num_devfn] != dev->devfn)
+			goto put_and_exit;
+
+		if (num_devfn == 0 && bus == dev->bus->number) {
+			ret = 1;
+			goto put_and_exit;
+		}
+
+		tmp = pci_dev_get(pci_upstream_bridge(dev));
+		pci_dev_put(dev);
+		dev = tmp;
+	}
+
+put_and_exit:
+	pci_dev_put(dev);
+free_and_exit:
+	kfree(devfn_path);
+	return ret;
+}
+
+/**
  * pci_dev_str_match - test if a string matches a device
  * @dev:    the PCI device to test
  * @p:      string to match the device against
  * @endptr: pointer to the string after the match
  *
  * Test if a string (typically from a kernel parameter) matches a
- * specified. The string may be of one of two forms formats:
+ * specified. The string may be of one of three formats:
  *
  *   [<domain>:]<bus>:<slot>.<func>
+ *   path:[<domain>:]<bus>:<slot>.<func>/<slot>.<func>[/ ...]
  *   pci:<vendor>:<device>[:<subvendor>:<subdevice>]
  *
  * The first format specifies a PCI bus/slot/function address which
  * may change if new hardware is inserted, if motherboard firmware changes,
  * or due to changes caused in kernel parameters.
  *
- * The second format matches devices using IDs in the configuration
+ * The second format specifies a PCI bus/slot/function root address and
+ * a path of slot/function addresses to the specific device from the root.
+ * The path for a device can be determined through the use of 'lspci -t'.
+ * This format is more robust against renumbering issues than the first format.
+
+ * The third format matches devices using IDs in the configuration
  * space which may match multiple devices in the system. A value of 0
  * for any field will match all devices.
  *
@@ -236,7 +330,15 @@ static int pci_dev_str_match(struct pci_dev *dev, const char *p,
 		    (!subsystem_device ||
 			    subsystem_device == dev->subsystem_device))
 			goto found;
+	} else if (strncmp(p, "path:", 5) == 0) {
+		/* PCI Root Bus and a path of Slot,Function IDs */
+		p += 5;
 
+		ret = pci_dev_str_match_path(dev, p, &p);
+		if (ret < 0)
+			return ret;
+		else if (ret)
+			goto found;
 	} else {
 		/* PCI Bus,Slot,Function ids are specified */
 		ret = sscanf(p, "%x:%x:%x.%x%n", &seg, &bus, &slot,
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 3/3] PCI: Introduce the disable_acs_redir parameter
From: Logan Gunthorpe @ 2018-05-24 21:48 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-doc
  Cc: Stephen Bates, Christoph Hellwig, Bjorn Helgaas, Jonathan Corbet,
	Ingo Molnar, Thomas Gleixner, Christoffer Dall, Paul E. McKenney,
	Marc Zyngier, Kai-Heng Feng, Frederic Weisbecker, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Christian König, Logan Gunthorpe
In-Reply-To: <20180524214816.14485-1-logang@deltatee.com>

In order to support P2P traffic on a segment of the PCI hierarchy,
we must be able to disable the ACS redirect bits for select
PCI bridges. The bridges must be selected before the devices are
discovered by the kernel and the IOMMU groups created. Therefore,
a kernel command line parameter is created to specify devices
which must have their ACS bits disabled.

The new parameter takes a list of devices separated by a semicolon.
Each device specified will have it's ACS redirect bits disabled.
This is similar to the existing 'resource_alignment' parameter and just
like it we also create a sysfs bus attribute which can be used to
read the parameter. Writing the parameter is not supported
as it would require forcibly hot plugging the affected device as
well as all devices whose IOMMU groups might change.

The ACS Request P2P Request Redirect, P2P Completion Redirect and P2P
Egress Control bits are disabled which is sufficient to always allow
passing P2P traffic uninterrupted. The bits are set after the kernel
(optionally) enables the ACS bits itself. It is also done regardless of
whether the kernel sets the bits or not seeing some BIOS firmware is known
to set the bits on boot.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Stephen Bates <sbates@raithlin.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   9 +++
 drivers/pci/pci.c                               | 103 +++++++++++++++++++++++-
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 519ab95bb418..215285c4772d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3176,6 +3176,15 @@
 				Adding the window is slightly risky (it may
 				conflict with unreported devices), so this
 				taints the kernel.
+		disable_acs_redir=<pci_dev>[; ...]
+				Specify one or more PCI devices (in the format
+				specified above) separated by semicolons.
+				Each device specified will have the PCI ACS
+				redirect capabilities forced off which will
+				allow P2P traffic between devices through
+				bridges without forcing it upstream. Note:
+				this removes isolation between devices and
+				will make the IOMMU groups less granular.
 
 	pcie_aspm=	[PCIE] Forcibly enable or disable PCIe Active State Power
 			Management.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 53ea0d7b02ce..3465895a55ab 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2998,6 +2998,92 @@ void pci_request_acs(void)
 	pci_acs_enable = 1;
 }
 
+#define DISABLE_ACS_REDIR_PARAM_SIZE COMMAND_LINE_SIZE
+static char disable_acs_redir_param[DISABLE_ACS_REDIR_PARAM_SIZE] = {0};
+static DEFINE_SPINLOCK(disable_acs_redir_lock);
+
+static ssize_t pci_set_disable_acs_redir_param(const char *buf, size_t count)
+{
+	if (count > DISABLE_ACS_REDIR_PARAM_SIZE - 1)
+		count = DISABLE_ACS_REDIR_PARAM_SIZE - 1;
+	spin_lock(&disable_acs_redir_lock);
+	strncpy(disable_acs_redir_param, buf, count);
+	disable_acs_redir_param[count] = '\0';
+	spin_unlock(&disable_acs_redir_lock);
+	return count;
+}
+
+static ssize_t pci_disable_acs_redir_show(struct bus_type *bus, char *buf)
+{
+	size_t count;
+
+	spin_lock(&disable_acs_redir_lock);
+	count = snprintf(buf, PAGE_SIZE, "%s\n", disable_acs_redir_param);
+	spin_unlock(&disable_acs_redir_lock);
+	return count;
+}
+
+static BUS_ATTR(disable_acs_redir, 0444, pci_disable_acs_redir_show, NULL);
+
+static int __init pci_disable_acs_redir_sysfs_init(void)
+{
+	return bus_create_file(&pci_bus_type, &bus_attr_disable_acs_redir);
+}
+late_initcall(pci_disable_acs_redir_sysfs_init);
+
+/**
+ * pci_disable_acs_redir - disable ACS redirect capabilities
+ * @dev: the PCI device
+ *
+ * For only devices specified in the disable_acs_redir parameter.
+ */
+static void pci_disable_acs_redir(struct pci_dev *dev)
+{
+	int ret = 0;
+	const char *p;
+	int pos;
+	u16 ctrl;
+
+	spin_lock(&disable_acs_redir_lock);
+
+	p = disable_acs_redir_param;
+	while (*p) {
+		ret = pci_dev_str_match(dev, p, &p);
+		if (ret < 0) {
+			pr_info_once("PCI: Can't parse disable_acs_redir parameter: %s\n",
+				     disable_acs_redir_param);
+
+			break;
+		} else if (ret == 1) {
+			/* Found a match */
+			break;
+		}
+
+		if (*p != ';' && *p != ',') {
+			/* End of param or invalid format */
+			break;
+		}
+		p++;
+	}
+	spin_unlock(&disable_acs_redir_lock);
+
+	if (ret != 1)
+		return;
+
+	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return;
+
+	pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
+
+	/* P2P Request & Completion Redirect */
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_EC);
+
+	pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
+
+	pci_info(dev, "disabled ACS redirect\n");
+}
+
 /**
  * pci_std_enable_acs - enable ACS on devices using standard ACS capabilites
  * @dev: the PCI device
@@ -3037,12 +3123,22 @@ static void pci_std_enable_acs(struct pci_dev *dev)
 void pci_enable_acs(struct pci_dev *dev)
 {
 	if (!pci_acs_enable)
-		return;
+		goto disable_acs_redir;
 
 	if (!pci_dev_specific_enable_acs(dev))
-		return;
+		goto disable_acs_redir;
 
 	pci_std_enable_acs(dev);
+
+disable_acs_redir:
+	/*
+	 * Note: pci_disable_acs_redir() must be called even if
+	 * ACS is not enabled by the kernel because the firmware
+	 * may have unexpectedly set the flags. So if we are told
+	 * to disable it, we should always disable it after setting
+	 * the kernel's default preferences.
+	 */
+	pci_disable_acs_redir(dev);
 }
 
 static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
@@ -5995,6 +6091,9 @@ static int __init pci_setup(char *str)
 				pcie_bus_config = PCIE_BUS_PEER2PEER;
 			} else if (!strncmp(str, "pcie_scan_all", 13)) {
 				pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS);
+			} else if (!strncmp(str, "disable_acs_redir=", 18)) {
+				pci_set_disable_acs_redir_param(str + 18,
+					strlen(str + 18));
 			} else {
 				printk(KERN_ERR "PCI: Unknown option `%s'\n",
 						str);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 1/3] PCI: Make specifying PCI devices in kernel parameters reusable
From: Logan Gunthorpe @ 2018-05-24 21:48 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-doc
  Cc: Stephen Bates, Christoph Hellwig, Bjorn Helgaas, Jonathan Corbet,
	Ingo Molnar, Thomas Gleixner, Christoffer Dall, Paul E. McKenney,
	Marc Zyngier, Kai-Heng Feng, Frederic Weisbecker, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Christian König, Logan Gunthorpe
In-Reply-To: <20180524214816.14485-1-logang@deltatee.com>

Separate out the code to match a PCI device with a string (typically
originating from a kernel parameter) from the
pci_specified_resource_alignment() function into its own helper
function.

While we are at it, this change fixes the kernel style of the function
(fixing a number of long lines and extra parentheses).

Additionally, make the analogous change to the kernel parameter
documentation: Separating the description of how to specify a PCI device
into it's own section at the head of the pci= parameter.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Stephen Bates <sbates@raithlin.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  26 +++-
 drivers/pci/pci.c                               | 153 +++++++++++++++---------
 2 files changed, 120 insertions(+), 59 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 11fc28ecdb6d..894aa516ceab 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2982,7 +2982,24 @@
 			See header of drivers/block/paride/pcd.c.
 			See also Documentation/blockdev/paride.txt.
 
-	pci=option[,option...]	[PCI] various PCI subsystem options:
+	pci=option[,option...]	[PCI] various PCI subsystem options.
+
+				Some options herein operate on a specific device
+				or a set of devices (<pci_dev>). These are
+				specified in one of two formats:
+
+				[<domain>:]<bus>:<slot>.<func>
+				pci:<vendor>:<device>[:<subvendor>:<subdevice>]
+
+				Note: the first format specifies a PCI
+				bus/slot/function address which may change
+				if new hardware is inserted, if motherboard
+				firmware changes, or due to changes caused
+				by other kernel parameters. The second format
+				selects devices using IDs from the
+				configuration space which may match multiple
+				devices in the system.
+
 		earlydump	[X86] dump PCI config space before the kernel
 			        changes anything
 		off		[X86] don't probe for the PCI bus
@@ -3111,11 +3128,10 @@
 				window. The default value is 64 megabytes.
 		resource_alignment=
 				Format:
-				[<order of align>@][<domain>:]<bus>:<slot>.<func>[; ...]
-				[<order of align>@]pci:<vendor>:<device>\
-						[:<subvendor>:<subdevice>][; ...]
+				[<order of align>@]<pci_dev>[; ...]
 				Specifies alignment and device to reassign
-				aligned memory resources.
+				aligned memory resources. How to
+				specify the device is described above.
 				If <order of align> is not specified,
 				PAGE_SIZE is used as alignment.
 				PCI-PCI bridge can be specified, if resource
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index dbfe7c4f3776..85fec5e2640b 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -183,6 +183,88 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar)
 EXPORT_SYMBOL_GPL(pci_ioremap_wc_bar);
 #endif
 
+/**
+ * pci_dev_str_match - test if a string matches a device
+ * @dev:    the PCI device to test
+ * @p:      string to match the device against
+ * @endptr: pointer to the string after the match
+ *
+ * Test if a string (typically from a kernel parameter) matches a
+ * specified. The string may be of one of two forms formats:
+ *
+ *   [<domain>:]<bus>:<slot>.<func>
+ *   pci:<vendor>:<device>[:<subvendor>:<subdevice>]
+ *
+ * The first format specifies a PCI bus/slot/function address which
+ * may change if new hardware is inserted, if motherboard firmware changes,
+ * or due to changes caused in kernel parameters.
+ *
+ * The second format matches devices using IDs in the configuration
+ * space which may match multiple devices in the system. A value of 0
+ * for any field will match all devices.
+ *
+ * Returns 1 if the string matches the device, 0 if it does not and
+ * a negative error code if the string cannot be parsed.
+ */
+static int pci_dev_str_match(struct pci_dev *dev, const char *p,
+			     const char **endptr)
+{
+	int ret;
+	int seg, bus, slot, func, count;
+	unsigned short vendor, device, subsystem_vendor, subsystem_device;
+
+	if (strncmp(p, "pci:", 4) == 0) {
+		/* PCI vendor/device (subvendor/subdevice) ids are specified */
+		p += 4;
+		ret = sscanf(p, "%hx:%hx:%hx:%hx%n", &vendor, &device,
+			     &subsystem_vendor, &subsystem_device, &count);
+		if (ret != 4) {
+			ret = sscanf(p, "%hx:%hx%n", &vendor, &device, &count);
+			if (ret != 2)
+				return -EINVAL;
+
+			subsystem_vendor = 0;
+			subsystem_device = 0;
+		}
+
+		p += count;
+
+		if ((!vendor || vendor == dev->vendor) &&
+		    (!device || device == dev->device) &&
+		    (!subsystem_vendor ||
+			    subsystem_vendor == dev->subsystem_vendor) &&
+		    (!subsystem_device ||
+			    subsystem_device == dev->subsystem_device))
+			goto found;
+
+	} else {
+		/* PCI Bus,Slot,Function ids are specified */
+		ret = sscanf(p, "%x:%x:%x.%x%n", &seg, &bus, &slot,
+			     &func, &count);
+		if (ret != 4) {
+			seg = 0;
+			ret = sscanf(p, "%x:%x.%x%n", &bus, &slot,
+				     &func, &count);
+			if (ret != 3)
+				return -EINVAL;
+		}
+
+		p += count;
+
+		if (seg == pci_domain_nr(dev->bus) &&
+		    bus == dev->bus->number &&
+		    slot == PCI_SLOT(dev->devfn) &&
+		    func == PCI_FUNC(dev->devfn))
+			goto found;
+	}
+
+	*endptr = p;
+	return 0;
+
+found:
+	*endptr = p;
+	return 1;
+}
 
 static int __pci_find_next_cap_ttl(struct pci_bus *bus, unsigned int devfn,
 				   u8 pos, int cap, int *ttl)
@@ -5462,10 +5544,10 @@ static DEFINE_SPINLOCK(resource_alignment_lock);
 static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
 							bool *resize)
 {
-	int seg, bus, slot, func, align_order, count;
-	unsigned short vendor, device, subsystem_vendor, subsystem_device;
+	int align_order, count;
 	resource_size_t align = pcibios_default_alignment();
-	char *p;
+	const char *p;
+	int ret;
 
 	spin_lock(&resource_alignment_lock);
 	p = resource_alignment_param;
@@ -5485,58 +5567,21 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
 		} else {
 			align_order = -1;
 		}
-		if (strncmp(p, "pci:", 4) == 0) {
-			/* PCI vendor/device (subvendor/subdevice) ids are specified */
-			p += 4;
-			if (sscanf(p, "%hx:%hx:%hx:%hx%n",
-				&vendor, &device, &subsystem_vendor, &subsystem_device, &count) != 4) {
-				if (sscanf(p, "%hx:%hx%n", &vendor, &device, &count) != 2) {
-					printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: pci:%s\n",
-						p);
-					break;
-				}
-				subsystem_vendor = subsystem_device = 0;
-			}
-			p += count;
-			if ((!vendor || (vendor == dev->vendor)) &&
-				(!device || (device == dev->device)) &&
-				(!subsystem_vendor || (subsystem_vendor == dev->subsystem_vendor)) &&
-				(!subsystem_device || (subsystem_device == dev->subsystem_device))) {
-				*resize = true;
-				if (align_order == -1)
-					align = PAGE_SIZE;
-				else
-					align = 1 << align_order;
-				/* Found */
-				break;
-			}
-		}
-		else {
-			if (sscanf(p, "%x:%x:%x.%x%n",
-				&seg, &bus, &slot, &func, &count) != 4) {
-				seg = 0;
-				if (sscanf(p, "%x:%x.%x%n",
-						&bus, &slot, &func, &count) != 3) {
-					/* Invalid format */
-					printk(KERN_ERR "PCI: Can't parse resource_alignment parameter: %s\n",
-						p);
-					break;
-				}
-			}
-			p += count;
-			if (seg == pci_domain_nr(dev->bus) &&
-				bus == dev->bus->number &&
-				slot == PCI_SLOT(dev->devfn) &&
-				func == PCI_FUNC(dev->devfn)) {
-				*resize = true;
-				if (align_order == -1)
-					align = PAGE_SIZE;
-				else
-					align = 1 << align_order;
-				/* Found */
-				break;
-			}
+
+		ret = pci_dev_str_match(dev, p, &p);
+		if (ret == 1) {
+			*resize = true;
+			if (align_order == -1)
+				align = PAGE_SIZE;
+			else
+				align = 1 << align_order;
+			break;
+		} else if (ret < 0) {
+			pr_info("PCI: Can't parse resource_alignment parameter: pci:%s\n",
+				p);
+			break;
 		}
+
 		if (*p != ';' && *p != ',') {
 			/* End of param or invalid format */
 			break;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2
From: Waiman Long @ 2018-05-24 18:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli
In-Reply-To: <20180524154341.GJ12198@hirez.programming.kicks-ass.net>

On 05/24/2018 11:43 AM, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 04:55:42PM -0400, Waiman Long wrote:
>> The sched.load_balance flag is needed to enable CPU isolation similar to
>> what can be done with the "isolcpus" kernel boot parameter. Its value
>> can only be changed in a scheduling domain with no child cpusets. On
>> a non-scheduling domain cpuset, the value of sched.load_balance is
>> inherited from its parent.
>>
>> This flag is set by the parent and is not delegatable.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  Documentation/cgroup-v2.txt | 24 ++++++++++++++++++++
>>  kernel/cgroup/cpuset.c      | 53 +++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index 54d9e22..071b634d 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1536,6 +1536,30 @@ Cpuset Interface Files
>>  	CPUs of the parent cgroup. Once it is set, this flag cannot be
>>  	cleared if there are any child cgroups with cpuset enabled.
>>  
>> +	A parent cgroup cannot distribute all its CPUs to child
>> +	scheduling domain cgroups unless its load balancing flag is
>> +	turned off.
>> +
>> +  cpuset.sched.load_balance
>> +	A read-write single value file which exists on non-root
>> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +	either "0" (off) or a non-zero value (on).  This flag is set
>> +	by the parent and is not delegatable.
>> +
>> +	When it is on, tasks within this cpuset will be load-balanced
>> +	by the kernel scheduler.  Tasks will be moved from CPUs with
>> +	high load to other CPUs within the same cpuset with less load
>> +	periodically.
>> +
>> +	When it is off, there will be no load balancing among CPUs on
>> +	this cgroup.  Tasks will stay in the CPUs they are running on
>> +	and will not be moved to other CPUs.
>> +
>> +	The initial value of this flag is "1".	This flag is then
>> +	inherited by child cgroups with cpuset enabled.  Its state
>> +	can only be changed on a scheduling domain cgroup with no
>> +	cpuset-enabled children.
> I'm confused... why exactly do we have both domain and load_balance ?

The domain is for partitioning the CPUs only. It doesn't change the load
balancing state. So the load_balance flag is still need to turn on and
off load balancing.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v8 2/6] cpuset: Add new v2 cpuset.sched.domain flag
From: Waiman Long @ 2018-05-24 18:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli
In-Reply-To: <20180524154156.GI12198@hirez.programming.kicks-ass.net>

On 05/24/2018 11:41 AM, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 04:55:41PM -0400, Waiman Long wrote:
>> A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
>> flag indicates that the CPUs in the current cpuset should be treated
>> as a separate scheduling domain.
> The traditional name for this is a partition.

Do you want to call it cpuset.sched.partition? That name sounds strange
to me.

>>                                  This new flag is owned by the parent
>> and will cause the CPUs in the cpuset to be removed from the effective
>> CPUs of its parent.
> This is a significant departure from existing behaviour, but one I can
> appreciate. I don't immediately see something terribly wrong with it.
>
>> This is implemented internally by adding a new isolated_cpus mask that
>> holds the CPUs belonging to child scheduling domain cpusets so that:
>>
>> 	isolated_cpus | effective_cpus = cpus_allowed
>> 	isolated_cpus & effective_cpus = 0
>>
>> This new flag can only be turned on in a cpuset if its parent is either
>> root or a scheduling domain itself with non-empty cpu list. The state
>> of this flag cannot be changed if the cpuset has children.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  Documentation/cgroup-v2.txt |  22 ++++
>>  kernel/cgroup/cpuset.c      | 237 +++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 256 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index cf7bac6..54d9e22 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
>>  	it is a subset of "cpuset.mems".  Its value will be affected
>>  	by memory nodes hotplug events.
>>  
>> +  cpuset.sched.domain
>> +	A read-write single value file which exists on non-root
>> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +	either "0" (off) or a non-zero value (on).
> I would be conservative and only allow 0/1.

I stated that because echoing other integer value like 2 into the flag
file won't return any error. I will modify it to say just 0 and 1.

>>                                                  This flag is set
>> +	by the parent and is not delegatable.
>> +
>> +	If set, it indicates that the CPUs in the current cgroup will
>> +	be the root of a scheduling domain.  The root cgroup is always
>> +	a scheduling domain.  There are constraints on where this flag
>> +	can be set.  It can only be set in a cgroup if all the following
>> +	conditions are true.
>> +
>> +	1) The parent cgroup is also a scheduling domain with a non-empty
>> +	   cpu list.
> Ah, so initially I was confused by the requirement for root to have it
> always set, but you'll allow child domains to steal _all_ CPUs, such
> that root ends up with an empty effective set?
>
> What about the (kernel) threads that cannot be moved out of the root
> group?

Actually, the current code won't allow you to take all the CPUs from a
scheduling domain cpuset with load balancing on. So there must be at
least 1 cpu left. You can take all away if load balancing is off.

>> +	2) The list of CPUs are exclusive, i.e. they are not shared by
>> +	   any of its siblings.
> Right.
>
>> +	3) There is no child cgroups with cpuset enabled.
>> +
>> +	Setting this flag will take the CPUs away from the effective
>> +	CPUs of the parent cgroup. Once it is set, this flag cannot be
>> +	cleared if there are any child cgroups with cpuset enabled.
> This I'm not clear on. Why?
>
That is for pragmatic reason as it is easier to code this way. We could
remove this restriction but that will make the code more complex.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
From: Mike Kravetz @ 2018-05-24 17:45 UTC (permalink / raw)
  To: TSUKADA Koutaro, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Jonathan Corbet,
	Luis R. Rodriguez, Kees Cook, Andrew Morton, Roman Gushchin,
	David Rientjes, Aneesh Kumar K.V, Naoya Horiguchi,
	Anshuman Khandual, Marc-Andre Lureau, Punit Agrawal, Dan Williams,
	Vlastimil Babka, linux-doc, linux-kernel, linux-fsdevel, linux-mm,
	cgroups
In-Reply-To: <af1a3050-7365-428a-dfb1-2f3da37dc9ff@ascade.co.jp>

On 05/23/2018 09:26 PM, TSUKADA Koutaro wrote:
> 
> I do not know if it is really a strong use case, but I will explain my
> motive in detail. English is not my native language, so please pardon
> my poor English.
> 
> I am one of the developers for software that managing the resource used
> from user job at HPC-Cluster with Linux. The resource is memory mainly.
> The HPC-Cluster may be shared by multiple people and used. Therefore, the
> memory used by each user must be strictly controlled, otherwise the
> user's job will runaway, not only will it hamper the other users, it will
> crash the entire system in OOM.
> 
> Some users of HPC are very nervous about performance. Jobs are executed
> while synchronizing with MPI communication using multiple compute nodes.
> Since CPU wait time will occur when synchronizing, they want to minimize
> the variation in execution time at each node to reduce waiting times as
> much as possible. We call this variation a noise.
> 
> THP does not guarantee to use the Huge Page, but may use the normal page.

Note.  You do not want to use THP because "THP does not guarantee".

> This mechanism is one cause of variation(noise).
> 
> The users who know this mechanism will be hesitant to use THP. However,
> the users also know the benefits of the Huge Page's TLB hit rate
> performance, and the Huge Page seems to be attractive. It seems natural
> that these users are interested in HugeTLBfs, I do not know at all
> whether it is the right approach or not.
> 
> At the very least, our HPC system is pursuing high versatility and we
> have to consider whether we can provide it if users want to use HugeTLBfs.
> 
> In order to use HugeTLBfs we need to create a persistent pool, but in
> our use case sharing nodes, it would be impossible to create, delete or
> resize the pool.
> 
> One of the answers I have reached is to use HugeTLBfs by overcommitting
> without creating a pool(this is the surplus hugepage).

Using hugetlbfs overcommit also does not provide a guarantee.  Without
doing much research, I would say the failure rate for obtaining a huge
page via THP and hugetlbfs overcommit is about the same.  The most
difficult issue in both cases will be obtaining a "huge page" number of
pages from the buddy allocator.

I really do not think hugetlbfs overcommit will provide any benefit over
THP for your use case.  Also, new user space code is required to "fall back"
to normal pages in the case of hugetlbfs page allocation failure.  This
is not needed in the THP case.
-- 
Mike Kravetz
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv5 2/8] arm64: dts: stratix10: add stratix10 service driver binding to base dtsi
From: Moritz Fischer @ 2018-05-24 17:04 UTC (permalink / raw)
  To: richard.gong
  Cc: catalin.marinas, will.deacon, dinguyen, robh+dt, mark.rutland,
	atull, mdf, arnd, gregkh, corbet, linux-arm-kernel, linux-kernel,
	devicetree, linux-fpga, linux-doc, yves.vandervennet,
	richard.gong
In-Reply-To: <1527179600-26441-3-git-send-email-richard.gong@linux.intel.com>

Hi Richard,

On Thu, May 24, 2018 at 11:33:14AM -0500, richard.gong@linux.intel.com wrote:
> From: Richard Gong <richard.gong@intel.com>
> 
> Add Intel Stratix10 service layer to the device tree
> 
> Signed-off-by: Richard Gong <richard.gong@intel.com>
> Signed-off-by: Alan Tull <atull@kernel.org>
Acked-by: Moritz Fischer <mdf@kernel.org>
> ---
> v2: Change to put service layer driver node under the firmware node
>     Change compatible to "intel, stratix10-svc"
> v3: No change
> v4: s/service driver/stratix10 service driver/ in subject line
> v5: No change
> ---
>  arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
> index d8c94d5..c257287 100644
> --- a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
> +++ b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
> @@ -24,6 +24,19 @@
>  	#address-cells = <2>;
>  	#size-cells = <2>;
>  
> +	reserved-memory {
> +		#address-cells = <2>;
> +		#size-cells = <2>;
> +		ranges;
> +
> +		service_reserved: svcbuffer@0 {
> +			compatible = "shared-dma-pool";
> +			reg = <0x0 0x0 0x0 0x1000000>;
> +			alignment = <0x1000>;
> +			no-map;
> +		};
> +	};
> +
>  	cpus {
>  		#address-cells = <1>;
>  		#size-cells = <0>;
> @@ -487,5 +500,13 @@
>  
>  			status = "disabled";
>  		};
> +
> +		firmware {
> +			svc {
> +				compatible = "intel,stratix10-svc";
> +				method = "smc";
> +				memory-region = <&service_reserved>;
> +			};
> +		};
>  	};
>  };
> -- 
> 2.7.4
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv5 1/8] dt-bindings, firmware: add Intel Stratix10 service layer binding
From: Moritz Fischer @ 2018-05-24 17:04 UTC (permalink / raw)
  To: richard.gong
  Cc: catalin.marinas, will.deacon, dinguyen, robh+dt, mark.rutland,
	atull, mdf, arnd, gregkh, corbet, linux-arm-kernel, linux-kernel,
	devicetree, linux-fpga, linux-doc, yves.vandervennet,
	richard.gong
In-Reply-To: <1527179600-26441-2-git-send-email-richard.gong@linux.intel.com>

On Thu, May 24, 2018 at 11:33:13AM -0500, richard.gong@linux.intel.com wrote:
> From: Richard Gong <richard.gong@intel.com>
> 
> Add a device tree binding for the Intel Stratix10 service layer driver
> 
> Signed-off-by: Richard Gong <richard.gong@intel.com>
> Signed-off-by: Alan Tull <atull@kernel.org>
> Reviewed-by: Rob Herring <robh@kernel.org>
Acked-by: Moritz Fischer <mdf@kernel.org>
> ---
> v2: Change to put service layer driver node under the firmware node
>     Change compatible to "intel, stratix10-svc"
> v3: No change
> v4: Add Rob's Reviewed-by
> v5: No change
> ---
>  .../bindings/firmware/intel,stratix10-svc.txt      | 57 ++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
> 
> diff --git a/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
> new file mode 100644
> index 0000000..1fa6606
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
> @@ -0,0 +1,57 @@
> +Intel Service Layer Driver for Stratix10 SoC
> +============================================
> +Intel Stratix10 SoC is composed of a 64 bit quad-core ARM Cortex A53 hard
> +processor system (HPS) and Secure Device Manager (SDM). When the FPGA is
> +configured from HPS, there needs to be a way for HPS to notify SDM the
> +location and size of the configuration data. Then SDM will get the
> +configuration data from that location and perform the FPGA configuration.
> +
> +To meet the whole system security needs and support virtual machine requesting
> +communication with SDM, only the secure world of software (EL3, Exception
> +Layer 3) can interface with SDM. All software entities running on other
> +exception layers must channel through the EL3 software whenever it needs
> +service from SDM.
> +
> +Intel Stratix10 service layer driver, running at privileged exception level
> +(EL1, Exception Layer 1), interfaces with the service providers and provides
> +the services for FPGA configuration, QSPI, Crypto and warm reset. Service layer
> +driver also manages secure monitor call (SMC) to communicate with secure monitor
> +code running in EL3.
> +
> +Required properties:
> +-------------------
> +The svc node has the following mandatory properties, must be located under
> +the firmware node.
> +
> +- compatible: "intel,stratix10-svc"
> +- method: smc or hvc
> +        smc - Secure Monitor Call
> +        hvc - Hypervisor Call
> +- memory-region:
> +	phandle to the reserved memory node. See
> +	Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
> +	for details
> +
> +Example:
> +-------
> +
> +	reserved-memory {
> +                #address-cells = <2>;
> +                #size-cells = <2>;
> +                ranges;
> +
> +                service_reserved: svcbuffer@0 {
> +                        compatible = "shared-dma-pool";
> +                        reg = <0x0 0x0 0x0 0x1000000>;
> +                        alignment = <0x1000>;
> +                        no-map;
> +                };
> +        };
> +
> +	firmware {
> +		svc {
> +			compatible = "intel,stratix10-svc";
> +			method = "smc";
> +			memory-region = <&service_reserved>;
> +		};
> +	};
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fpga" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCHv5 1/8] dt-bindings, firmware: add Intel Stratix10 service layer binding
From: richard.gong @ 2018-05-24 16:33 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, dinguyen, robh+dt, mark.rutland,
	atull, mdf, arnd, gregkh, corbet
  Cc: linux-arm-kernel, linux-kernel, devicetree, linux-fpga, linux-doc,
	yves.vandervennet, richard.gong, richard.gong
In-Reply-To: <1527179600-26441-1-git-send-email-richard.gong@linux.intel.com>

From: Richard Gong <richard.gong@intel.com>

Add a device tree binding for the Intel Stratix10 service layer driver

Signed-off-by: Richard Gong <richard.gong@intel.com>
Signed-off-by: Alan Tull <atull@kernel.org>
Reviewed-by: Rob Herring <robh@kernel.org>
---
v2: Change to put service layer driver node under the firmware node
    Change compatible to "intel, stratix10-svc"
v3: No change
v4: Add Rob's Reviewed-by
v5: No change
---
 .../bindings/firmware/intel,stratix10-svc.txt      | 57 ++++++++++++++++++++++
 1 file changed, 57 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt

diff --git a/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
new file mode 100644
index 0000000..1fa6606
--- /dev/null
+++ b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
@@ -0,0 +1,57 @@
+Intel Service Layer Driver for Stratix10 SoC
+============================================
+Intel Stratix10 SoC is composed of a 64 bit quad-core ARM Cortex A53 hard
+processor system (HPS) and Secure Device Manager (SDM). When the FPGA is
+configured from HPS, there needs to be a way for HPS to notify SDM the
+location and size of the configuration data. Then SDM will get the
+configuration data from that location and perform the FPGA configuration.
+
+To meet the whole system security needs and support virtual machine requesting
+communication with SDM, only the secure world of software (EL3, Exception
+Layer 3) can interface with SDM. All software entities running on other
+exception layers must channel through the EL3 software whenever it needs
+service from SDM.
+
+Intel Stratix10 service layer driver, running at privileged exception level
+(EL1, Exception Layer 1), interfaces with the service providers and provides
+the services for FPGA configuration, QSPI, Crypto and warm reset. Service layer
+driver also manages secure monitor call (SMC) to communicate with secure monitor
+code running in EL3.
+
+Required properties:
+-------------------
+The svc node has the following mandatory properties, must be located under
+the firmware node.
+
+- compatible: "intel,stratix10-svc"
+- method: smc or hvc
+        smc - Secure Monitor Call
+        hvc - Hypervisor Call
+- memory-region:
+	phandle to the reserved memory node. See
+	Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+	for details
+
+Example:
+-------
+
+	reserved-memory {
+                #address-cells = <2>;
+                #size-cells = <2>;
+                ranges;
+
+                service_reserved: svcbuffer@0 {
+                        compatible = "shared-dma-pool";
+                        reg = <0x0 0x0 0x0 0x1000000>;
+                        alignment = <0x1000>;
+                        no-map;
+                };
+        };
+
+	firmware {
+		svc {
+			compatible = "intel,stratix10-svc";
+			method = "smc";
+			memory-region = <&service_reserved>;
+		};
+	};
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCHv5 2/8] arm64: dts: stratix10: add stratix10 service driver binding to base dtsi
From: richard.gong @ 2018-05-24 16:33 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, dinguyen, robh+dt, mark.rutland,
	atull, mdf, arnd, gregkh, corbet
  Cc: linux-arm-kernel, linux-kernel, devicetree, linux-fpga, linux-doc,
	yves.vandervennet, richard.gong, richard.gong
In-Reply-To: <1527179600-26441-1-git-send-email-richard.gong@linux.intel.com>

From: Richard Gong <richard.gong@intel.com>

Add Intel Stratix10 service layer to the device tree

Signed-off-by: Richard Gong <richard.gong@intel.com>
Signed-off-by: Alan Tull <atull@kernel.org>
---
v2: Change to put service layer driver node under the firmware node
    Change compatible to "intel, stratix10-svc"
v3: No change
v4: s/service driver/stratix10 service driver/ in subject line
v5: No change
---
 arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
index d8c94d5..c257287 100644
--- a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
+++ b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
@@ -24,6 +24,19 @@
 	#address-cells = <2>;
 	#size-cells = <2>;
 
+	reserved-memory {
+		#address-cells = <2>;
+		#size-cells = <2>;
+		ranges;
+
+		service_reserved: svcbuffer@0 {
+			compatible = "shared-dma-pool";
+			reg = <0x0 0x0 0x0 0x1000000>;
+			alignment = <0x1000>;
+			no-map;
+		};
+	};
+
 	cpus {
 		#address-cells = <1>;
 		#size-cells = <0>;
@@ -487,5 +500,13 @@
 
 			status = "disabled";
 		};
+
+		firmware {
+			svc {
+				compatible = "intel,stratix10-svc";
+				method = "smc";
+				memory-region = <&service_reserved>;
+			};
+		};
 	};
 };
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCHv5 4/8] dt-bindings: fpga: add Stratix10 SoC FPGA manager binding
From: richard.gong @ 2018-05-24 16:33 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, dinguyen, robh+dt, mark.rutland,
	atull, mdf, arnd, gregkh, corbet
  Cc: linux-arm-kernel, linux-kernel, devicetree, linux-fpga, linux-doc,
	yves.vandervennet, richard.gong, richard.gong
In-Reply-To: <1527179600-26441-1-git-send-email-richard.gong@linux.intel.com>

From: Alan Tull <atull@kernel.org>

Add a Device Tree binding for the Intel Stratix10 SoC FPGA manager.

Signed-off-by: Alan Tull <atull@kernel.org>
Signed-off-by: Richard Gong <richard.gong@intel.com>
Reviewed-by: Rob Herring <robh@kernel.org>
---
v2: this patch is added in patch set version 2
v3: change to put fpga_mgr node under firmware/svc node
v4: s/fpga-mgr@0/fpga-mgr/ to remove unit_address
    add Richard's signed-off-by
v5: add Reviewed-by Rob Herring
---
 .../bindings/fpga/intel-stratix10-soc-fpga-mgr.txt      | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt

diff --git a/Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt b/Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt
new file mode 100644
index 0000000..6e03f79
--- /dev/null
+++ b/Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt
@@ -0,0 +1,17 @@
+Intel Stratix10 SoC FPGA Manager
+
+Required properties:
+The fpga_mgr node has the following mandatory property, must be located under
+firmware/svc node.
+
+- compatible : should contain "intel,stratix10-soc-fpga-mgr"
+
+Example:
+
+	firmware {
+		svc {
+			fpga_mgr: fpga-mgr {
+				compatible = "intel,stratix10-soc-fpga-mgr";
+			};
+		};
+	};
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCHv5 3/8] driver, misc: add Intel Stratix10 service layer driver
From: richard.gong @ 2018-05-24 16:33 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, dinguyen, robh+dt, mark.rutland,
	atull, mdf, arnd, gregkh, corbet
  Cc: linux-arm-kernel, linux-kernel, devicetree, linux-fpga, linux-doc,
	yves.vandervennet, richard.gong, richard.gong
In-Reply-To: <1527179600-26441-1-git-send-email-richard.gong@linux.intel.com>

From: Richard Gong <richard.gong@intel.com>

Some features of the Intel Stratix10 SoC require a level of privilege
higher than the kernel is granted. Such secure features include
FPGA programming. In terms of the ARMv8 architecture, the kernel runs
at Exception Level 1 (EL1), access to the features requires
Exception Level 3 (EL3).

The Intel Stratix10 SoC service layer provides an in kernel API for
drivers to request access to the secure features. The requests are queued
and processed one by one. ARM’s SMCCC is used to pass the execution
of the requests on to a secure monitor (EL3).

The header file stratix10-sve-client.h defines the interface between
service providers (FPGA manager is one of them) and service layer.

The header file stratix10-smc.h defines the secure monitor call (SMC)
message protocols used for service layer driver in normal world
(EL1) to communicate with secure monitor SW in secure monitor exception
level 3 (EL3).

Signed-off-by: Richard Gong <richard.gong@intel.com>
Signed-off-by: Alan Tull <atull@kernel.org>
---
v2: Remove intel-service subdirectory and intel-service.h, move
    intel-smc.h and intel-service.c to driver/misc subdirectory
    Correct SPDX markers
    Change service layer driver be 'default n'
    Remove global variables
    Add timeout for do..while() loop
    Add kernel-doc for the functions and structs, correct multiline comments
    Replace kfifo_in/kfifo_out with kfifo_in_spinlocked/kfifo_out_spinlocked
    rename struct intel_svc_data (at client header) to intel_svc_client_msg
    rename struct intel_svc_private_mem to intel_svc_data
    Other corrections/changes from Intel internal code reviews
v3: Change all exported functions with "intel_svc_" as the prefix
    Increase timeout values for claiming back submitted buffer(s)
    Rename struct intel_command_reconfig_payload to
    struct intel_svc_command_reconfig_payload
    Add pr_err() to provide the error return value
    Other corrections/changes
v4: s/intel/stratix10/ on some variables, structs, functions, and file names
    intel-service.c -> stratix10-svc.c
    intel-smc.h -> stratix10-smc.h
    intel-service-client.h -> stratix10-svc-client.h
    Remove non-kernel-doc formatting
v5: add a new API statix10_svc_done() which is called by service client
    when client request is completed or error occurs during request
    process. Which allows service layer to free its resources.
    remove dummy client from service layer client header and service
    layer source file.
    kernel-doc fixes
---
 drivers/misc/Kconfig                 |  12 +
 drivers/misc/Makefile                |   1 +
 drivers/misc/stratix10-smc.h         | 205 ++++++++
 drivers/misc/stratix10-svc.c         | 984 +++++++++++++++++++++++++++++++++++
 include/linux/stratix10-svc-client.h | 199 +++++++
 5 files changed, 1401 insertions(+)
 create mode 100644 drivers/misc/stratix10-smc.h
 create mode 100644 drivers/misc/stratix10-svc.c
 create mode 100644 include/linux/stratix10-svc-client.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 5d71300..5d5b648 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -138,6 +138,18 @@ config INTEL_MID_PTI
 	  an Intel Atom (non-netbook) mobile device containing a MIPI
 	  P1149.7 standard implementation.
 
+config STRATIX10_SERVICE
+	tristate "Stratix10 Service Layer"
+	depends on HAVE_ARM_SMCCC
+	default n
+	help
+	 Stratix10 service layer runs at privileged exception level, interfaces with
+	 the service providers (FPGA manager is one of them) and manages secure
+	 monitor call to communicate with secure monitor software at secure monitor
+	 exception level.
+
+	 Say Y here if you want Stratix10 service layer support.
+
 config SGI_IOC4
 	tristate "SGI IOC4 Base IO support"
 	depends on PCI
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 20be70c..99fed8b 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_AD525X_DPOT)	+= ad525x_dpot.o
 obj-$(CONFIG_AD525X_DPOT_I2C)	+= ad525x_dpot-i2c.o
 obj-$(CONFIG_AD525X_DPOT_SPI)	+= ad525x_dpot-spi.o
 obj-$(CONFIG_INTEL_MID_PTI)	+= pti.o
+obj-$(CONFIG_STRATIX10_SERVICE) += stratix10-svc.o
 obj-$(CONFIG_ATMEL_SSC)		+= atmel-ssc.o
 obj-$(CONFIG_ATMEL_TCLIB)	+= atmel_tclib.o
 obj-$(CONFIG_DUMMY_IRQ)		+= dummy-irq.o
diff --git a/drivers/misc/stratix10-smc.h b/drivers/misc/stratix10-smc.h
new file mode 100644
index 0000000..94615f4
--- /dev/null
+++ b/drivers/misc/stratix10-smc.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2017-2018, Intel Corporation
+ */
+
+#ifndef __STRATIX10_SMC_H
+#define __STRATIX10_SMC_H
+
+#include <linux/arm-smccc.h>
+#include <linux/bitops.h>
+
+/**
+ * This file defines the Secure Monitor Call (SMC) message protocol used for
+ * service layer driver in normal world (EL1) to communicate with secure
+ * monitor software in Secure Monitor Exception Level 3 (EL3).
+ *
+ * This file is shared with secure firmware (FW) which is out of kernel tree.
+ *
+ * An ARM SMC instruction takes a function identifier and up to 6 64-bit
+ * register values as arguments, and can return up to 4 64-bit register
+ * value. The operation of the secure monitor is determined by the parameter
+ * values passed in through registers.
+ *
+ * EL1 and EL3 communicates pointer as physical address rather than the
+ * virtual address.
+ *
+ * Functions specified by ARM SMC Calling convention:
+ *
+ * FAST call executes atomic operations, returns when the requested operation
+ * has completed.
+ * STD call starts a operation which can be preempted by a non-secure
+ * interrupt. The call can return before the requested operation has
+ * completed.
+ *
+ * a0..a7 is used as register names in the descriptions below, on arm32
+ * that translates to r0..r7 and on arm64 to w0..w7.
+ */
+
+/**
+ * @func_num: function ID
+ */
+#define INTEL_SIP_SMC_STD_CALL_VAL(func_num) \
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_STD_CALL, ARM_SMCCC_SMC_64, \
+	ARM_SMCCC_OWNER_SIP, (func_num))
+
+#define INTEL_SIP_SMC_FAST_CALL_VAL(func_num) \
+	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, ARM_SMCCC_SMC_64, \
+	ARM_SMCCC_OWNER_SIP, (func_num))
+
+/**
+ * Return values in INTEL_SIP_SMC_* call
+ *
+ * INTEL_SIP_SMC_RETURN_UNKNOWN_FUNCTION:
+ * Secure monitor software doesn't recognize the request.
+ *
+ * INTEL_SIP_SMC_STATUS_OK:
+ * FPGA configuration completed successfully,
+ * In case of FPGA configuration write operation, it means secure monitor
+ * software can accept the next chunk of FPGA configuration data.
+ *
+ * INTEL_SIP_SMC_FPGA_CONFIG_STATUS_BUSY:
+ * In case of FPGA configuration write operation, it means secure monitor
+ * software is still processing previous data & can't accept the next chunk
+ * of data. Service driver needs to issue
+ * INTEL_SIP_SMC_FPGA_CONFIG_COMPLETED_WRITE call to query the
+ * completed block(s).
+ *
+ * INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR:
+ * There is error during the FPGA configuration process.
+ */
+#define INTEL_SIP_SMC_RETURN_UNKNOWN_FUNCTION		0xFFFFFFFF
+#define INTEL_SIP_SMC_STATUS_OK				0x0
+#define INTEL_SIP_SMC_FPGA_CONFIG_STATUS_BUSY		0x1
+#define INTEL_SIP_SMC_FPGA_CONFIG_STATUS_REJECTED       0x2
+#define INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR		0x4
+
+/**
+ * Request INTEL_SIP_SMC_FPGA_CONFIG_START
+ *
+ * Sync call used by service driver at EL1 to request the FPGA in EL3 to
+ * be prepare to receive a new configuration.
+ *
+ * Call register usage:
+ * a0: INTEL_SIP_SMC_FPGA_CONFIG_START.
+ * a1: flag for full or partial configuration. 0 for full and 1 for partial
+ * configuration.
+ * a2-7: not used.
+ *
+ * Return status:
+ * a0: INTEL_SIP_SMC_STATUS_OK, or INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR.
+ * a1-3: not used.
+ */
+#define INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_START 1
+#define INTEL_SIP_SMC_FPGA_CONFIG_START \
+	INTEL_SIP_SMC_FAST_CALL_VAL(INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_START)
+
+/**
+ * Request INTEL_SIP_SMC_FPGA_CONFIG_WRITE
+ *
+ * Async call used by service driver at EL1 to provide FPGA configuration data
+ * to secure world.
+ *
+ * Call register usage:
+ * a0: INTEL_SIP_SMC_FPGA_CONFIG_WRITE.
+ * a1: 64bit physical address of the configuration data memory block
+ * a2: Size of configuration data block.
+ * a3-7: not used.
+ *
+ * Return status:
+ * a0: INTEL_SIP_SMC_STATUS_OK, INTEL_SIP_SMC_FPGA_CONFIG_STATUS_BUSY or
+ * INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR.
+ * a1: 64bit physical address of 1st completed memory block if any completed
+ * block, otherwise zero value.
+ * a2: 64bit physical address of 2nd completed memory block if any completed
+ * block, otherwise zero value.
+ * a3: 64bit physical address of 3rd completed memory block if any completed
+ * block, otherwise zero value.
+ */
+#define INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_WRITE 2
+#define INTEL_SIP_SMC_FPGA_CONFIG_WRITE \
+	INTEL_SIP_SMC_STD_CALL_VAL(INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_WRITE)
+
+/**
+ * Request INTEL_SIP_SMC_FPGA_CONFIG_COMPLETED_WRITE
+ *
+ * Sync call used by service driver at EL1 to track the completed write
+ * transactions. This request is called after INTEL_SIP_SMC_FPGA_CONFIG_WRITE
+ * call returns INTEL_SIP_SMC_FPGA_CONFIG_STATUS_BUSY.
+ *
+ * Call register usage:
+ * a0: INTEL_SIP_SMC_FPGA_CONFIG_COMPLETED_WRITE.
+ * a1-7: not used.
+ *
+ * Return status:
+ * a0: INTEL_SIP_SMC_STATUS_OK, INTEL_SIP_SMC_FPGA_CONFIG_STATUS_BUSY or
+ * INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR.
+ * a1: 64bit physical address of 1st completed memory block.
+ * a2: 64bit physical address of 2nd completed memory block if
+ * any completed block, otherwise zero value.
+ * a3: 64bit physical address of 3rd completed memory block if
+ * any completed block, otherwise zero value.
+ */
+#define INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_COMPLETED_WRITE 3
+#define INTEL_SIP_SMC_FPGA_CONFIG_COMPLETED_WRITE \
+INTEL_SIP_SMC_FAST_CALL_VAL(INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_COMPLETED_WRITE)
+
+/**
+ * Request INTEL_SIP_SMC_FPGA_CONFIG_ISDONE
+ *
+ * Sync call used by service driver at EL1 to inform secure world that all
+ * data are sent, to check whether or not the secure world had completed
+ * the FPGA configuration process.
+ *
+ * Call register usage:
+ * a0: INTEL_SIP_SMC_FPGA_CONFIG_ISDONE.
+ * a1-7: not used.
+ *
+ * Return status:
+ * a0: INTEL_SIP_SMC_STATUS_OK, INTEL_SIP_SMC_FPGA_CONFIG_STATUS_BUSY or
+ * INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR.
+ * a1-3: not used.
+ */
+#define INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_ISDONE 4
+#define INTEL_SIP_SMC_FPGA_CONFIG_ISDONE \
+	INTEL_SIP_SMC_FAST_CALL_VAL(INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_ISDONE)
+
+/**
+ * Request INTEL_SIP_SMC_FPGA_CONFIG_GET_MEM
+ *
+ * Sync call used by service driver at EL1 to query the physical address of
+ * memory block reserved by secure monitor software.
+ *
+ * Call register usage:
+ * a0:INTEL_SIP_SMC_FPGA_CONFIG_GET_MEM.
+ * a1-7: not used.
+ *
+ * Return status:
+ * a0: INTEL_SIP_SMC_STATUS_OK or INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR.
+ * a1: start of physical address of reserved memory block.
+ * a2: size of reserved memory block.
+ * a3: not used.
+ */
+#define INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_GET_MEM 5
+#define INTEL_SIP_SMC_FPGA_CONFIG_GET_MEM \
+	INTEL_SIP_SMC_FAST_CALL_VAL(INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_GET_MEM)
+
+/**
+ * Request INTEL_SIP_SMC_FPGA_CONFIG_LOOPBACK
+ *
+ * For SMC loop-back mode only, used for internal integration, debugging
+ * or troubleshooting.
+ *
+ * Call register usage:
+ * a0: INTEL_SIP_SMC_FPGA_CONFIG_LOOPBACK.
+ * a1-7: not used.
+ *
+ * Return status:
+ * a0: INTEL_SIP_SMC_STATUS_OK or INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR.
+ * a1-3: not used.
+ */
+#define INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_LOOPBACK 6
+#define INTEL_SIP_SMC_FPGA_CONFIG_LOOPBACK \
+	INTEL_SIP_SMC_FAST_CALL_VAL(INTEL_SIP_SMC_FUNCID_FPGA_CONFIG_LOOPBACK)
+
+#endif
diff --git a/drivers/misc/stratix10-svc.c b/drivers/misc/stratix10-svc.c
new file mode 100644
index 0000000..5d07994
--- /dev/null
+++ b/drivers/misc/stratix10-svc.c
@@ -0,0 +1,984 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017-2018, Intel Corporation
+ */
+
+#include <linux/arm-smccc.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
+#include <linux/genalloc.h>
+#include <linux/stratix10-svc-client.h>
+#include <linux/io.h>
+#include <linux/kfifo.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/of.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+
+#include "stratix10-smc.h"
+
+/**
+ * SVC_NUM_DATA_IN_FIFO - number of struct stratix10_svc_data in the FIFO
+ *
+ * SVC_NUM_CHANNEL - number of channel supported by service layer driver
+ *
+ * FPGA_CONFIG_DATA_CLAIM_TIMEOUT_MS - claim back the submitted buffer(s)
+ * from the secure world for FPGA manager to reuse, or to free the buffer(s)
+ * when all bit-stream data had be send.
+ *
+ * FPGA_CONFIG_STATUS_TIMEOUT_SEC - poll the FPGA configuration status,
+ * service layer will return error to FPGA manager when timeout occurs,
+ * timeout is set to 30 seconds (30 * 1000) at Intel Stratix10 SoC.
+ */
+#define SVC_NUM_DATA_IN_FIFO			32
+#define SVC_NUM_CHANNEL				1
+#define FPGA_CONFIG_DATA_CLAIM_TIMEOUT_MS	200
+#define FPGA_CONFIG_STATUS_TIMEOUT_SEC		30
+
+typedef void (svc_invoke_fn)(unsigned long, unsigned long, unsigned long,
+			     unsigned long, unsigned long, unsigned long,
+			     unsigned long, unsigned long,
+			     struct arm_smccc_res *);
+struct stratix10_svc_chan;
+
+/**
+ * struct stratix10_svc_sh_memory - service shared memory structure
+ * @sync_complete: state for a completion
+ * @addr: physical address of shared memory block
+ * @size: size of shared memory block
+ * @invoke_fn: function to issue secure monitor or hypervisor call
+ *
+ * This struct is used to save physical address and size of shared memory
+ * block. The shared memory blocked is allocated by secure monitor software
+ * at secure world.
+ *
+ * Service layer driver uses the physical address and size to create a memory
+ * pool, then allocates data buffer from that memory pool for service client.
+ */
+struct stratix10_svc_sh_memory {
+	struct completion sync_complete;
+	unsigned long addr;
+	unsigned long size;
+	svc_invoke_fn *invoke_fn;
+};
+
+/**
+ * struct stratix10_svc_data_mem - service memory structure
+ * @vaddr: virtual address
+ * @paddr: physical address
+ * @size: size of memory
+ * @node: link list head node
+ *
+ * This struct is used in a list that keeps track of buffers which have
+ * been allocated or freed from the memory pool. Service layer driver also
+ * uses this struct to transfer physical address to virtual address.
+ */
+struct stratix10_svc_data_mem {
+	void *vaddr;
+	phys_addr_t paddr;
+	size_t size;
+	struct list_head node;
+};
+
+/**
+ * struct stratix10_svc_data - service data structure
+ * @chan: service channel
+ * @paddr: playload physical address
+ * @size: playload size
+ * @command: service command requested by client
+ *
+ * This struct is used in service FIFO for inter-process communication.
+ */
+struct stratix10_svc_data {
+	struct stratix10_svc_chan *chan;
+	phys_addr_t paddr;
+	size_t size;
+	u32 command;
+};
+
+/**
+ * struct stratix10_svc_controller - service controller
+ * @dev: device
+ * @chans: array of service channels
+ * @num_chans: number of channels in 'chans' array
+ * @num_active_client: number of active service client
+ * @node: list management
+ * @genpool: memory pool pointing to the memory region
+ * @task: pointer to the thread task which handles SMC or HVC call
+ * @svc_fifo: a queue for storing service message data
+ * @complete_status: state for completion
+ * @svc_fifo_lock: protect access to service message data queue
+ * @invoke_fn: function to issue secure monitor call or hypervisor call
+ *
+ * This struct is used to create communication channels for service clients, to
+ * handle secure monitor or hypervisor call.
+ */
+struct stratix10_svc_controller {
+	struct device *dev;
+	struct stratix10_svc_chan *chans;
+	int num_chans;
+	int num_active_client;
+	struct list_head node;
+	struct gen_pool *genpool;
+	struct task_struct *task;
+	struct kfifo svc_fifo;
+	struct completion complete_status;
+	spinlock_t svc_fifo_lock;
+	svc_invoke_fn *invoke_fn;
+};
+
+/**
+ * struct stratix10_svc_chan - service communication channel
+ * @ctrl: pointer to service controller which is the provider of this channel
+ * @scl: pointer to service client which owns the channel
+ * @name: service client name associated with the channel
+ * @lock: protect access to the channel
+ *
+ * This struct is used by service client to communicate with service layer, each
+ * service client has its own channel created by service controller.
+ */
+struct stratix10_svc_chan {
+	struct stratix10_svc_controller *ctrl;
+	struct stratix10_svc_client *scl;
+	char *name;
+	spinlock_t lock;
+};
+
+static LIST_HEAD(svc_ctrl);
+static LIST_HEAD(svc_data_mem);
+
+/**
+ * svc_pa_to_va() - translate physical address to virtual address
+ * @addr: to be translated physical address
+ *
+ * Return: valid virtual address or NULL if the provided physical
+ * address doesn't exist.
+ */
+static void *svc_pa_to_va(unsigned long addr)
+{
+	struct stratix10_svc_data_mem *pmem;
+
+	pr_debug("claim back P-addr=0x%016x\n", (unsigned int)addr);
+	list_for_each_entry(pmem, &svc_data_mem, node) {
+		if (pmem->paddr == addr)
+			return pmem->vaddr;
+	}
+
+	/* physical address is not found */
+	return NULL;
+}
+
+/**
+ * svc_thread_cmd_data_claim() - claim back buffer from the secure world
+ * @ctrl: pointer to service layer controller
+ * @p_data: pointer to service data structure
+ * @cb_data: pointer to callback data structure to service client
+ *
+ * Claim back the submitted buffers from the secure world and pass buffer
+ * back to service client (FPGA manager, etc) for reuse.
+ */
+static void svc_thread_cmd_data_claim(struct stratix10_svc_controller *ctrl,
+				      struct stratix10_svc_data *p_data,
+				      struct stratix10_svc_cb_data *cb_data)
+{
+	struct arm_smccc_res res;
+	unsigned long timeout;
+
+	reinit_completion(&ctrl->complete_status);
+	timeout = msecs_to_jiffies(FPGA_CONFIG_DATA_CLAIM_TIMEOUT_MS);
+
+	pr_debug("%s: claim back the submitted buffer\n", __func__);
+	do {
+		ctrl->invoke_fn(INTEL_SIP_SMC_FPGA_CONFIG_COMPLETED_WRITE,
+				0, 0, 0, 0, 0, 0, 0, &res);
+
+		if (res.a0 == INTEL_SIP_SMC_STATUS_OK) {
+			if (!res.a1) {
+				complete(&ctrl->complete_status);
+				break;
+			}
+			cb_data->status = BIT(SVC_STATUS_RECONFIG_BUFFER_DONE);
+			cb_data->kaddr1 = svc_pa_to_va(res.a1);
+			cb_data->kaddr2 = (res.a2) ?
+					  svc_pa_to_va(res.a2) : NULL;
+			cb_data->kaddr3 = (res.a3) ?
+					  svc_pa_to_va(res.a3) : NULL;
+			p_data->chan->scl->receive_cb(p_data->chan->scl,
+						      cb_data);
+		} else {
+			pr_debug("%s: secure world busy, polling again\n",
+				 __func__);
+		}
+	} while (res.a0 == INTEL_SIP_SMC_STATUS_OK ||
+		 res.a0 == INTEL_SIP_SMC_FPGA_CONFIG_STATUS_BUSY ||
+		 wait_for_completion_timeout(&ctrl->complete_status, timeout));
+}
+
+/**
+ * svc_thread_cmd_config_status() - check configuration status
+ * @ctrl: pointer to service layer controller
+ * @p_data: pointer to service data structure
+ * @cb_data: pointer to callback data structure to service client
+ *
+ * Check whether the secure firmware at secure world has finished the FPGA
+ * configuration, and then inform FPGA manager the configuration status.
+ */
+static void svc_thread_cmd_config_status(struct stratix10_svc_controller *ctrl,
+					 struct stratix10_svc_data *p_data,
+					 struct stratix10_svc_cb_data *cb_data)
+{
+	struct arm_smccc_res res;
+	int count_in_sec;
+
+	cb_data->kaddr1 = NULL;
+	cb_data->kaddr2 = NULL;
+	cb_data->kaddr3 = NULL;
+	cb_data->status = BIT(SVC_STATUS_RECONFIG_ERROR);
+
+	pr_debug("%s: polling config status\n", __func__);
+
+	count_in_sec = FPGA_CONFIG_STATUS_TIMEOUT_SEC;
+	while (count_in_sec) {
+		ctrl->invoke_fn(INTEL_SIP_SMC_FPGA_CONFIG_ISDONE,
+				0, 0, 0, 0, 0, 0, 0, &res);
+		if ((res.a0 == INTEL_SIP_SMC_STATUS_OK) ||
+		    (res.a0 == INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR))
+			break;
+
+		/*
+		 * configuration is still in progress, wait one second then
+		 * poll again
+		 */
+		msleep(1000);
+		count_in_sec--;
+	};
+
+	if (res.a0 == INTEL_SIP_SMC_STATUS_OK && count_in_sec)
+		cb_data->status = BIT(SVC_STATUS_RECONFIG_COMPLETED);
+
+	p_data->chan->scl->receive_cb(p_data->chan->scl, cb_data);
+}
+
+/**
+ * svc_thread_recv_status_ok() - handle the successful status
+ * @p_data: pointer to service data structure
+ * @cb_data: pointer to callback data structure to service client
+ * @res: result from SMC or HVC call
+ *
+ * Send back the correspond status to the service client (FPGA manager etc).
+ */
+static void svc_thread_recv_status_ok(struct stratix10_svc_data *p_data,
+				      struct stratix10_svc_cb_data *cb_data,
+				      struct arm_smccc_res res)
+{
+	cb_data->kaddr1 = NULL;
+	cb_data->kaddr2 = NULL;
+	cb_data->kaddr3 = NULL;
+
+	switch (p_data->command) {
+	case COMMAND_RECONFIG:
+		cb_data->status = BIT(SVC_STATUS_RECONFIG_REQUEST_OK);
+		break;
+	case COMMAND_RECONFIG_DATA_SUBMIT:
+		cb_data->status = BIT(SVC_STATUS_RECONFIG_BUFFER_SUBMITTED);
+		break;
+	case COMMAND_NOOP:
+		cb_data->status = BIT(SVC_STATUS_RECONFIG_BUFFER_SUBMITTED);
+		cb_data->kaddr1 = svc_pa_to_va(res.a1);
+		break;
+	case COMMAND_RECONFIG_STATUS:
+		cb_data->status = BIT(SVC_STATUS_RECONFIG_COMPLETED);
+		break;
+	default:
+		break;
+	}
+
+	pr_debug("%s: call receive_cb\n", __func__);
+	p_data->chan->scl->receive_cb(p_data->chan->scl, cb_data);
+}
+
+/**
+ * svc_normal_to_secure_thread() - the function to run in the kthread
+ * @data: data pointer for kthread function
+ *
+ * Service layer driver creates stratix10_svc_smc_hvc_call kthread on CPU
+ * node 0, its function stratix10_svc_secure_call_thread is used to handle
+ * SMC or HVC calls between kernel driver and secure monitor software.
+ *
+ * Return: 0 for success or -ENOMEM on error.
+ */
+static int svc_normal_to_secure_thread(void *data)
+{
+	struct stratix10_svc_controller
+			*ctrl = (struct stratix10_svc_controller *)data;
+	struct stratix10_svc_data *pdata;
+	struct stratix10_svc_cb_data *cbdata;
+	struct arm_smccc_res res;
+	unsigned long a0, a1, a2;
+	int ret_fifo = 0;
+
+	pdata =  kmalloc(sizeof(*pdata), GFP_KERNEL);
+	if (!pdata)
+		return -ENOMEM;
+
+	cbdata = kmalloc(sizeof(*cbdata), GFP_KERNEL);
+	if (!cbdata)
+		return -ENOMEM;
+
+	/* default set, to remove build warning */
+	a0 = INTEL_SIP_SMC_FPGA_CONFIG_LOOPBACK;
+	a1 = 0;
+	a2 = 0;
+
+	pr_debug("smc_hvc_shm_thread is running\n");
+
+	while (!kthread_should_stop()) {
+		ret_fifo = kfifo_out_spinlocked(&ctrl->svc_fifo,
+						pdata, sizeof(*pdata),
+						&ctrl->svc_fifo_lock);
+
+		if (!ret_fifo)
+			continue;
+
+		pr_debug("get from FIFO pa=0x%016x, command=%u, size=%u\n",
+			 (unsigned int)pdata->paddr, pdata->command,
+			 (unsigned int)pdata->size);
+
+		switch (pdata->command) {
+		case COMMAND_RECONFIG_DATA_CLAIM:
+			svc_thread_cmd_data_claim(ctrl, pdata, cbdata);
+			continue;
+		case COMMAND_RECONFIG:
+			a0 = INTEL_SIP_SMC_FPGA_CONFIG_START;
+			a1 = 0;
+			a2 = 0;
+			break;
+		case COMMAND_RECONFIG_DATA_SUBMIT:
+			a0 = INTEL_SIP_SMC_FPGA_CONFIG_WRITE;
+			a1 = (unsigned long)pdata->paddr;
+			a2 = (unsigned long)pdata->size;
+			break;
+		case COMMAND_RECONFIG_STATUS:
+			a0 = INTEL_SIP_SMC_FPGA_CONFIG_ISDONE;
+			a1 = 0;
+			a2 = 0;
+			break;
+		default:
+			/* it shouldn't happen */
+			break;
+		}
+		pr_debug("%s: before SMC call -- a0=0x%016x a1=0x%016x",
+			 __func__, (unsigned int)a0, (unsigned int)a1);
+		pr_debug(" a2=0x%016x\n", (unsigned int)a2);
+
+		ctrl->invoke_fn(a0, a1, a2, 0, 0, 0, 0, 0, &res);
+
+		pr_debug("%s: after SMC call -- res.a0=0x%016x",
+			 __func__, (unsigned int)res.a0);
+		pr_debug(" res.a1=0x%016x, res.a2=0x%016x",
+			 (unsigned int)res.a1, (unsigned int)res.a2);
+		pr_debug(" res.a3=0x%016x\n", (unsigned int)res.a3);
+
+		switch (res.a0) {
+		case INTEL_SIP_SMC_STATUS_OK:
+			svc_thread_recv_status_ok(pdata, cbdata, res);
+			break;
+		case INTEL_SIP_SMC_FPGA_CONFIG_STATUS_BUSY:
+			switch (pdata->command) {
+			case COMMAND_RECONFIG_DATA_SUBMIT:
+				svc_thread_cmd_data_claim(ctrl,
+							  pdata, cbdata);
+				break;
+			case COMMAND_RECONFIG_STATUS:
+				svc_thread_cmd_config_status(ctrl,
+							     pdata, cbdata);
+				break;
+			default:
+				break;
+			}
+			break;
+		case INTEL_SIP_SMC_FPGA_CONFIG_STATUS_REJECTED:
+			pr_debug("%s: STATUS_REJECTED\n", __func__);
+			break;
+		case INTEL_SIP_SMC_FPGA_CONFIG_STATUS_ERROR:
+			pr_err("%s: STATUS_ERROR\n", __func__);
+			cbdata->status = BIT(SVC_STATUS_RECONFIG_ERROR);
+			cbdata->kaddr1 = NULL;
+			cbdata->kaddr2 = NULL;
+			cbdata->kaddr3 = NULL;
+			pdata->chan->scl->receive_cb(pdata->chan->scl, cbdata);
+			break;
+		default:
+			break;
+		}
+	};
+
+	kfree(cbdata);
+	kfree(pdata);
+
+	return 0;
+}
+
+/**
+ * svc_normal_to_secure_shm_thread() - the function to run in the kthread
+ * @data: data pointer for kthread function
+ *
+ * Service layer driver creates stratix10_svc_smc_hvc_shm kthread on CPU
+ * node 0, its function stratix10_svc_secure_shm_thread is used to query the
+ * physical address of memory block reserved by secure monitor software at
+ * secure world.
+ *
+ * svc_normal_to_secure_shm_thread() calls do_exit() directly since it is a
+ * standlone thread for which no one will call kthread_stop() or return when
+ * 'kthread_should_stop()' is true.
+ */
+static int svc_normal_to_secure_shm_thread(void *data)
+{
+	struct stratix10_svc_sh_memory
+			*sh_mem = (struct stratix10_svc_sh_memory *)data;
+	struct arm_smccc_res res;
+
+	/* SMC or HVC call to get shared memory info from secure world */
+	sh_mem->invoke_fn(INTEL_SIP_SMC_FPGA_CONFIG_GET_MEM,
+			  0, 0, 0, 0, 0, 0, 0, &res);
+	if (res.a0 == INTEL_SIP_SMC_STATUS_OK) {
+		sh_mem->addr = res.a1;
+		sh_mem->size = res.a2;
+	} else {
+		pr_err("%s: after SMC call -- res.a0=0x%016x",  __func__,
+		       (unsigned int)res.a0);
+		sh_mem->addr = 0;
+		sh_mem->size = 0;
+	}
+
+	complete(&sh_mem->sync_complete);
+	do_exit(0);
+}
+
+/**
+ * svc_get_sh_memory() - get memory block reserved by secure monitor SW
+ * @pdev: pointer to service layer device
+ * @sh_memory: pointer to service shared memory structure
+ *
+ * Return: zero for successfully getting the physical address of memory block
+ * reserved by secure monitor software, or negative value on error.
+ */
+static int svc_get_sh_memory(struct platform_device *pdev,
+				    struct stratix10_svc_sh_memory *sh_memory)
+{
+	struct device *dev = &pdev->dev;
+	struct task_struct *sh_memory_task;
+
+	init_completion(&sh_memory->sync_complete);
+
+	/* smc or hvc call happens on cpu 0 bound kthread */
+	sh_memory_task = kthread_create_on_cpu(svc_normal_to_secure_shm_thread,
+					       (void *)sh_memory,
+						0, "svc_smc_hvc_shm_thread");
+	if (IS_ERR(sh_memory_task)) {
+		dev_err(dev, "fail to create stratix10_svc_smc_shm_thread\n");
+		return -EINVAL;
+	}
+
+	wake_up_process(sh_memory_task);
+
+	if (!wait_for_completion_timeout(&sh_memory->sync_complete, 10 * HZ)) {
+		dev_err(dev,
+			"timeout to get sh-memory paras from secure world\n");
+		return -ETIMEDOUT;
+	}
+
+	if (!sh_memory->addr || !sh_memory->size) {
+		dev_err(dev,
+			"fails to get shared memory info from secure world\n");
+		return -ENOMEM;
+	}
+
+	dev_dbg(dev, "SM software provides paddr: 0x%016x, size: 0x%08x\n",
+		(unsigned int)sh_memory->addr,
+		(unsigned int)sh_memory->size);
+
+	return 0;
+}
+
+/**
+ * svc_create_memory_pool() - create a memory pool from reserved memory block
+ * @pdev: pointer to service layer device
+ * @sh_memory: pointer to service shared memory structure
+ *
+ * Return: pool allocated from reserved memory block or ERR_PTR() on error.
+ */
+static struct gen_pool *
+svc_create_memory_pool(struct platform_device *pdev,
+		       struct stratix10_svc_sh_memory *sh_memory)
+{
+	struct device *dev = &pdev->dev;
+	struct gen_pool *genpool;
+	unsigned long vaddr;
+	phys_addr_t paddr;
+	size_t size;
+	phys_addr_t begin;
+	phys_addr_t end;
+	void *va;
+	size_t page_mask = PAGE_SIZE - 1;
+	int min_alloc_order = 3;
+	int ret;
+
+	begin = roundup(sh_memory->addr, PAGE_SIZE);
+	end = rounddown(sh_memory->addr + sh_memory->size, PAGE_SIZE);
+	paddr = begin;
+	size = end - begin;
+	va = memremap(paddr, size, MEMREMAP_WC);
+	if (!va) {
+		dev_err(dev, "fail to remap shared memory\n");
+		return ERR_PTR(-EINVAL);
+	}
+	vaddr = (unsigned long)va;
+	dev_dbg(dev,
+		"reserved memory vaddr: %p, paddr: 0x%16x size: 0x%8x\n",
+		va, (unsigned int)paddr, (unsigned int)size);
+	if ((vaddr & page_mask) || (paddr & page_mask) ||
+	    (size & page_mask)) {
+		dev_err(dev, "page is not aligned\n");
+		return ERR_PTR(-EINVAL);
+	}
+	genpool = gen_pool_create(min_alloc_order, -1);
+	if (!genpool) {
+		dev_err(dev, "fail to create genpool\n");
+		return ERR_PTR(-ENOMEM);
+	}
+	gen_pool_set_algo(genpool, gen_pool_best_fit, NULL);
+	ret = gen_pool_add_virt(genpool, vaddr, paddr, size, -1);
+	if (ret) {
+		dev_err(dev, "fail to add memory chunk to the pool\n");
+		gen_pool_destroy(genpool);
+		return ERR_PTR(ret);
+	}
+
+	return genpool;
+}
+
+/**
+ * svc_smccc_smc() - secure monitor call between normal and secure world
+ * @a0: argument passed in registers 0
+ * @a1: argument passed in registers 1
+ * @a2: argument passed in registers 2
+ * @a3: argument passed in registers 3
+ * @a4: argument passed in registers 4
+ * @a5: argument passed in registers 5
+ * @a6: argument passed in registers 6
+ * @a7: argument passed in registers 7
+ * @res: result values from register 0 to 3
+ */
+static void svc_smccc_smc(unsigned long a0, unsigned long a1,
+			  unsigned long a2, unsigned long a3,
+			  unsigned long a4, unsigned long a5,
+			  unsigned long a6, unsigned long a7,
+			  struct arm_smccc_res *res)
+{
+	arm_smccc_smc(a0, a1, a2, a3, a4, a5, a6, a7, res);
+}
+
+/**
+ * svc_smccc_hvc() - hypervisor call between normal and secure world
+ * @a0: argument passed in registers 0
+ * @a1: argument passed in registers 1
+ * @a2: argument passed in registers 2
+ * @a3: argument passed in registers 3
+ * @a4: argument passed in registers 4
+ * @a5: argument passed in registers 5
+ * @a6: argument passed in registers 6
+ * @a7: argument passed in registers 7
+ * @res: result values from register 0 to 3
+ */
+static void svc_smccc_hvc(unsigned long a0, unsigned long a1,
+			  unsigned long a2, unsigned long a3,
+			  unsigned long a4, unsigned long a5,
+			  unsigned long a6, unsigned long a7,
+			  struct arm_smccc_res *res)
+{
+	arm_smccc_hvc(a0, a1, a2, a3, a4, a5, a6, a7, res);
+}
+
+/**
+ * get_invoke_func() - invoke SMC or HVC call
+ * @dev: pointer to device
+ *
+ * Return: function pointer to svc_smccc_smc or svc_smccc_hvc.
+ */
+static svc_invoke_fn *get_invoke_func(struct device *dev)
+{
+	const char *method;
+
+	if (of_property_read_string(dev->of_node, "method", &method)) {
+		dev_warn(dev, "missing \"method\" property\n");
+		return ERR_PTR(-ENXIO);
+	}
+
+	if (!strcmp(method, "smc"))
+		return svc_smccc_smc;
+	if (!strcmp(method, "hvc"))
+		return svc_smccc_hvc;
+
+	dev_warn(dev, "invalid \"method\" property: %s\n", method);
+
+	return ERR_PTR(-EINVAL);
+}
+
+/**
+ * stratix10_svc_request_channel_byname() - request a service channel
+ * @client: pointer to service client
+ * @name: service client name
+ *
+ * This function is used by service client to request a service channel.
+ *
+ * Return: a pointer to channel assigned to the client on success,
+ * or ERR_PTR() on error.
+ */
+struct stratix10_svc_chan *stratix10_svc_request_channel_byname(
+	struct stratix10_svc_client *client, const char *name)
+{
+	struct device *dev = client->dev;
+	struct stratix10_svc_controller *controller;
+	struct stratix10_svc_chan *chan;
+	unsigned long flag;
+	int i;
+
+	chan = ERR_PTR(-EPROBE_DEFER);
+	if (list_empty(&svc_ctrl))
+		return ERR_PTR(-ENODEV);
+
+	controller = list_first_entry(&svc_ctrl,
+				      struct stratix10_svc_controller, node);
+	for (i = 0; i < SVC_NUM_CHANNEL; i++) {
+		if (!strcmp(controller->chans[i].name, name)) {
+			chan = &controller->chans[i];
+			break;
+		}
+	}
+
+	if (chan->scl || !try_module_get(controller->dev->driver->owner)) {
+		dev_dbg(dev, "%s: svc not free\n", __func__);
+		return ERR_PTR(-EBUSY);
+	}
+
+	spin_lock_irqsave(&chan->lock, flag);
+	chan->scl = client;
+	chan->ctrl->num_active_client++;
+	spin_unlock_irqrestore(&chan->lock, flag);
+
+	return chan;
+}
+EXPORT_SYMBOL_GPL(stratix10_svc_request_channel_byname);
+
+/**
+ * stratix10_svc_free_channel() - free service channel
+ * @chan: service channel to be freed
+ *
+ * This function is used by service client to free a service channel.
+ */
+void stratix10_svc_free_channel(struct stratix10_svc_chan *chan)
+{
+	unsigned long flag;
+
+	spin_lock_irqsave(&chan->lock, flag);
+	chan->scl = NULL;
+	chan->ctrl->num_active_client--;
+	module_put(chan->ctrl->dev->driver->owner);
+	spin_unlock_irqrestore(&chan->lock, flag);
+}
+EXPORT_SYMBOL_GPL(stratix10_svc_free_channel);
+
+/**
+ * stratix10_svc_send() - send a message data to the remote
+ * @chan: service channel assigned to the client
+ * @msg: message data to be sent, in the format of
+ * "struct stratix10_svc_client_msg"
+ *
+ * This function is used by service client to add a message to the service
+ * layer driver's queue for being sent to the secure world.
+ *
+ * Return: 0 for success, -ENOMEM or -ENOBUFS on error.
+ */
+int stratix10_svc_send(struct stratix10_svc_chan *chan, void *msg)
+{
+	struct stratix10_svc_client_msg
+		*p_msg = (struct stratix10_svc_client_msg *)msg;
+	struct stratix10_svc_data_mem *p_mem;
+	struct stratix10_svc_data *p_data;
+	int ret = 0;
+
+	p_data = kmalloc(sizeof(*p_data), GFP_KERNEL);
+	if (!p_data)
+		return -ENOMEM;
+
+	/* first client will create kernel thread */
+	if (!chan->ctrl->task) {
+		chan->ctrl->task =
+			kthread_create_on_cpu(svc_normal_to_secure_thread,
+					      (void *)chan->ctrl, 0,
+					      "svc_smc_hvc_thread");
+			if (IS_ERR(chan->ctrl->task)) {
+				dev_err(chan->ctrl->dev,
+					"fails to create svc_smc_hvc_thread\n");
+				return -EINVAL;
+			}
+		wake_up_process(chan->ctrl->task);
+	}
+
+	pr_debug("%s: sent P-va=%p, P-com=%x, P-size=%u\n", __func__,
+		 p_msg->payload, p_msg->command,
+		 (unsigned int)p_msg->payload_length);
+
+	p_data->paddr = 0;
+	list_for_each_entry(p_mem, &svc_data_mem, node) {
+		if (p_mem->vaddr == p_msg->payload) {
+			p_data->paddr = p_mem->paddr;
+			break;
+		}
+	}
+
+	p_data->command = p_msg->command;
+	p_data->size = p_msg->payload_length;
+	p_data->chan = chan;
+	pr_debug("%s: put to FIFO pa=0x%016x, cmd=%x, size=%u\n", __func__,
+	       (unsigned int)p_data->paddr, p_data->command,
+	       (unsigned int)p_data->size);
+	ret = kfifo_in_spinlocked(&chan->ctrl->svc_fifo, p_data,
+				  sizeof(*p_data),
+				  &chan->ctrl->svc_fifo_lock);
+
+	kfree(p_data);
+
+	if (!ret)
+		return -ENOBUFS;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(stratix10_svc_send);
+
+/**
+ * stratix10_svc_done() - complete service request transactions
+ * @chan: service channel assigned to the client
+ *
+ * This function should be called when client has finished its request
+ * or there is an error in the request process. It allows the service layer
+ * to stop the running thread to have maximize savings in kernel resources.
+ */
+void stratix10_svc_done(struct stratix10_svc_chan *chan)
+{
+	/* stop thread when thread is running AND only one active client */
+	if (chan->ctrl->task && chan->ctrl->num_active_client <= 1) {
+		pr_debug("svc_smc_hvc_shm_thread is stopped\n");
+		kthread_stop(chan->ctrl->task);
+		chan->ctrl->task = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(stratix10_svc_done);
+
+/**
+ * stratix10_svc_allocate_memory() - allocate memory
+ * @chan: service channel assigned to the client
+ * @size: memory size requested by a specific service client
+ *
+ * Service layer allocates the requested number of bytes buffer from the
+ * memory pool, service client uses this function to get allocated buffers.
+ *
+ * Return: address of allocated memory on success, or ERR_PTR() on error.
+ */
+void *stratix10_svc_allocate_memory(struct stratix10_svc_chan *chan,
+				    size_t size)
+{
+	struct stratix10_svc_data_mem *pmem;
+	unsigned long va;
+	phys_addr_t pa;
+	struct gen_pool *genpool = chan->ctrl->genpool;
+	size_t s = roundup(size, 1 << genpool->min_alloc_order);
+
+	pmem = devm_kzalloc(chan->ctrl->dev, sizeof(*pmem), GFP_KERNEL);
+	if (!pmem)
+		return ERR_PTR(-ENOMEM);
+
+	va = gen_pool_alloc(genpool, s);
+	if (!va)
+		return ERR_PTR(-ENOMEM);
+
+	memset((void *)va, 0, s);
+	pa = gen_pool_virt_to_phys(genpool, va);
+
+	pmem->vaddr = (void *)va;
+	pmem->paddr = pa;
+	pmem->size = s;
+	list_add_tail(&pmem->node, &svc_data_mem);
+	pr_debug("%s: va=%p, pa=0x%016x\n", __func__,
+		 pmem->vaddr, (unsigned int)pmem->paddr);
+
+	return (void *)va;
+}
+EXPORT_SYMBOL_GPL(stratix10_svc_allocate_memory);
+
+/**
+ * stratix10_svc_free_memory() - free allocated memory
+ * @chan: service channel assigned to the client
+ * @kaddr: memory to be freed
+ *
+ * This function is used by service client to free allocated buffers.
+ */
+void stratix10_svc_free_memory(struct stratix10_svc_chan *chan, void *kaddr)
+{
+	struct stratix10_svc_data_mem *pmem;
+	size_t size = 0;
+
+	list_for_each_entry(pmem, &svc_data_mem, node)
+		if (pmem->vaddr == kaddr) {
+			size = pmem->size;
+			break;
+		}
+
+	gen_pool_free(chan->ctrl->genpool, (unsigned long)kaddr, size);
+	pmem->vaddr = NULL;
+	list_del(&pmem->node);
+}
+EXPORT_SYMBOL_GPL(stratix10_svc_free_memory);
+
+static const struct of_device_id stratix10_svc_drv_match[] = {
+	{.compatible = "intel,stratix10-svc"},
+	{},
+};
+
+static int stratix10_svc_drv_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct stratix10_svc_controller *controller;
+	struct stratix10_svc_chan *chans;
+	struct gen_pool *genpool;
+	struct stratix10_svc_sh_memory *sh_memory;
+	svc_invoke_fn *invoke_fn;
+	size_t fifo_size;
+	int ret;
+
+	/* get SMC or HVC function */
+	invoke_fn = get_invoke_func(dev);
+	if (IS_ERR(invoke_fn))
+		return -EINVAL;
+
+	sh_memory = devm_kzalloc(dev, sizeof(*sh_memory), GFP_KERNEL);
+	if (!sh_memory)
+		return -ENOMEM;
+
+	sh_memory->invoke_fn = invoke_fn;
+	ret = svc_get_sh_memory(pdev, sh_memory);
+	if (ret)
+		return ret;
+
+	genpool = svc_create_memory_pool(pdev, sh_memory);
+	if (!genpool)
+		return -ENOMEM;
+
+	/* allocate service controller and supporting channel */
+	controller = devm_kzalloc(dev, sizeof(*controller), GFP_KERNEL);
+	if (!controller)
+		return -ENOMEM;
+
+	chans = devm_kmalloc_array(dev, SVC_NUM_CHANNEL,
+				   sizeof(*chans), GFP_KERNEL | __GFP_ZERO);
+	if (!chans)
+		return -ENOMEM;
+
+	controller->dev = dev;
+	controller->num_chans = SVC_NUM_CHANNEL;
+	controller->num_active_client = 0;
+	controller->chans = chans;
+	controller->genpool = genpool;
+	controller->task = NULL;
+	controller->invoke_fn = invoke_fn;
+	init_completion(&controller->complete_status);
+
+	fifo_size = sizeof(struct stratix10_svc_data) * SVC_NUM_DATA_IN_FIFO;
+	ret = kfifo_alloc(&controller->svc_fifo, fifo_size, GFP_KERNEL);
+	if (ret) {
+		dev_err(dev, "fails to allocate FIFO\n");
+		return ret;
+	}
+	spin_lock_init(&controller->svc_fifo_lock);
+
+	chans[0].scl = NULL;
+	chans[0].ctrl = controller;
+	chans[0].name = SVC_CLIENT_FPGA;
+	spin_lock_init(&chans[0].lock);
+
+	list_add_tail(&controller->node, &svc_ctrl);
+	platform_set_drvdata(pdev, controller);
+
+	pr_info("Intel Service Layer Driver Initialized\n");
+
+	return ret;
+}
+
+static int stratix10_svc_drv_remove(struct platform_device *pdev)
+{
+	struct stratix10_svc_controller *ctrl = platform_get_drvdata(pdev);
+
+	kfifo_free(&ctrl->svc_fifo);
+	if (ctrl->task) {
+		kthread_stop(ctrl->task);
+		ctrl->task = NULL;
+	}
+	if (ctrl->genpool)
+		gen_pool_destroy(ctrl->genpool);
+	list_del(&ctrl->node);
+
+	return 0;
+}
+
+static struct platform_driver stratix10_svc_driver = {
+	.probe = stratix10_svc_drv_probe,
+	.remove = stratix10_svc_drv_remove,
+	.driver = {
+		.name = "stratix10-svc",
+		.of_match_table = stratix10_svc_drv_match,
+	},
+};
+
+static int __init stratix10_svc_init(void)
+{
+	struct device_node *fw_np;
+	struct device_node *np;
+	int ret;
+
+	fw_np = of_find_node_by_name(NULL, "firmware");
+	if (!fw_np)
+		return -ENODEV;
+
+	np = of_find_matching_node(fw_np, stratix10_svc_drv_match);
+	if (!np) {
+		of_node_put(fw_np);
+		return -ENODEV;
+	}
+
+	of_node_put(np);
+	ret = of_platform_populate(fw_np, stratix10_svc_drv_match, NULL, NULL);
+	of_node_put(fw_np);
+	if (ret)
+		return ret;
+
+	return platform_driver_register(&stratix10_svc_driver);
+}
+
+static void __exit stratix10_svc_exit(void)
+{
+	return platform_driver_unregister(&stratix10_svc_driver);
+}
+
+subsys_initcall(stratix10_svc_init);
+module_exit(stratix10_svc_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Intel Stratix10 Service Layer Driver");
+MODULE_AUTHOR("Richard Gong <richard.gong@intel.com>");
+MODULE_ALIAS("platform:stratix10-svc");
diff --git a/include/linux/stratix10-svc-client.h b/include/linux/stratix10-svc-client.h
new file mode 100644
index 0000000..0ac6b2c
--- /dev/null
+++ b/include/linux/stratix10-svc-client.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2017-2018, Intel Corporation
+ */
+
+#ifndef __STRATIX10_SVC_CLIENT_H
+#define __STRATIX10_SVC_CLIENT_H
+
+/**
+ * Service layer driver supports client names
+ *
+ * fpga: for FPGA configuration
+ */
+#define SVC_CLIENT_FPGA			"fpga"
+
+/**
+ * Status of the sent command, in bit number
+ *
+ * SVC_COMMAND_STATUS_RECONFIG_REQUEST_OK:
+ * Secure firmware accepts the request of FPGA reconfiguration.
+ *
+ * SVC_STATUS_RECONFIG_BUFFER_SUBMITTED:
+ * Service client successfully submits FPGA configuration
+ * data buffer to secure firmware.
+ *
+ * SVC_COMMAND_STATUS_RECONFIG_BUFFER_DONE:
+ * Secure firmware completes data process, ready to accept the
+ * next WRITE transaction.
+ *
+ * SVC_COMMAND_STATUS_RECONFIG_COMPLETED:
+ * Secure firmware completes FPGA configuration successfully, FPGA should
+ * be in user mode.
+ *
+ * SVC_COMMAND_STATUS_RECONFIG_BUSY:
+ * FPGA configuration is still in process.
+ *
+ * SVC_COMMAND_STATUS_RECONFIG_ERROR:
+ * Error encountered during FPGA configuration.
+ */
+#define SVC_STATUS_RECONFIG_REQUEST_OK		0
+#define SVC_STATUS_RECONFIG_BUFFER_SUBMITTED	1
+#define SVC_STATUS_RECONFIG_BUFFER_DONE		2
+#define SVC_STATUS_RECONFIG_COMPLETED		3
+#define SVC_STATUS_RECONFIG_BUSY		4
+#define SVC_STATUS_RECONFIG_ERROR		5
+
+/**
+ * Flag bit for COMMAND_RECONFIG
+ *
+ * COMMAND_RECONFIG_FLAG_PARTIAL:
+ * Set to FPGA configuration type (full or partial), the default
+ * is full reconfig.
+ */
+#define COMMAND_RECONFIG_FLAG_PARTIAL   0
+
+/**
+ * Timeout settings for service clients:
+ * timeout value used in Stratix10 FPGA manager driver.
+ */
+#define SVC_RECONFIG_REQUEST_TIMEOUT_MS         100
+#define SVC_RECONFIG_BUFFER_TIMEOUT_MS          240
+
+struct stratix10_svc_chan;
+
+/**
+ * enum stratix10_svc_command_code - supported service commands
+ *
+ * @COMMAND_NOOP: do 'dummy' request for integration/debug/trouble-shooting
+ *
+ * @COMMAND_RECONFIG: ask for FPGA configuration preparation, return status
+ * is SVC_STATUS_RECONFIG_REQUEST_OK
+ *
+ * @COMMAND_RECONFIG_DATA_SUBMIT: submit buffer(s) of bit-stream data for the
+ * FPGA configuration, return status is SVC_STATUS_RECONFIG_BUFFER_SUBMITTED,
+ * or SVC_STATUS_RECONFIG_ERROR
+ *
+ * @COMMAND_RECONFIG_DATA_CLAIM: check the status of the configuration, return
+ * status is SVC_STATUS_RECONFIG_COMPLETED, or SVC_STATUS_RECONFIG_BUSY, or
+ * SVC_STATUS_RECONFIG_ERROR
+ *
+ * @COMMAND_RECONFIG_STATUS: check the status of the configuration, return
+ * status is SVC_STATUS_RECONFIG_COMPLETED, or  SVC_STATUS_RECONFIG_BUSY, or
+ * SVC_STATUS_RECONFIG_ERROR
+ */
+enum stratix10_svc_command_code {
+	COMMAND_NOOP = 0,
+	COMMAND_RECONFIG,
+	COMMAND_RECONFIG_DATA_SUBMIT,
+	COMMAND_RECONFIG_DATA_CLAIM,
+	COMMAND_RECONFIG_STATUS
+};
+
+/**
+ * struct stratix10_svc_client_msg - message sent by client to service
+ * @command: service command
+ * @payload: starting address of data need be processed
+ * @payload_length: data size in bytes
+ */
+struct stratix10_svc_client_msg {
+	void *payload;
+	size_t payload_length;
+	enum stratix10_svc_command_code command;
+};
+
+/**
+ * struct stratix10_svc_command_reconfig_payload - reconfig payload
+ * @flags: flag bit for the type of FPGA configuration
+ */
+struct stratix10_svc_command_reconfig_payload {
+	u32 flags;
+};
+
+/**
+ * struct stratix10_svc_cb_data - callback data structure from service layer
+ * @status: the status of sent command
+ * @kaddr1: address of 1st completed data block
+ * @kaddr2: address of 2nd completed data block
+ * @kaddr3: address of 3rd completed data block
+ */
+struct stratix10_svc_cb_data {
+	u32 status;
+	void *kaddr1;
+	void *kaddr2;
+	void *kaddr3;
+};
+
+/**
+ * struct stratix10_svc_client - service client structure
+ * @dev: the client device
+ * @receive_cb: callback to provide service client the received data
+ * @priv: client private data
+ */
+struct stratix10_svc_client {
+	struct device *dev;
+	void (*receive_cb)(struct stratix10_svc_client *client,
+			   struct stratix10_svc_cb_data *cb_data);
+	void *priv;
+};
+
+/**
+ * stratix10_svc_request_channel_byname() - request service channel
+ * @client: identity of the client requesting the channel
+ * @name: supporting client name defined above
+ *
+ * Return: a pointer to channel assigned to the client on success,
+ * or ERR_PTR() on error.
+ */
+struct stratix10_svc_chan
+*stratix10_svc_request_channel_byname(struct stratix10_svc_client *client,
+	const char *name);
+
+/**
+ * stratix10_svc_free_channel() - free service channel.
+ * @chan: service channel to be freed
+ */
+void stratix10_svc_free_channel(struct stratix10_svc_chan *chan);
+
+/**
+ * stratix10_svc_allocate_memory() - allocate the momory
+ * @chan: service channel assigned to the client
+ * @size: number of bytes client requests
+ *
+ * Service layer allocates the requested number of bytes from the memory
+ * pool for the client.
+ *
+ * Return: the starting address of allocated memory on success, or
+ * ERR_PTR() on error.
+ */
+void *stratix10_svc_allocate_memory(struct stratix10_svc_chan *chan,
+				    size_t size);
+
+/**
+ * stratix10_svc_free_memory() - free allocated memory
+ * @chan: service channel assigned to the client
+ * @kaddr: starting address of memory to be free back to pool
+ */
+void stratix10_svc_free_memory(struct stratix10_svc_chan *chan, void *kaddr);
+
+/**
+ * stratix10_svc_send() - send a message to the remote
+ * @chan: service channel assigned to the client
+ * @msg: message data to be sent, in the format of
+ * struct stratix10_svc_client_msg
+ *
+ * Return: 0 for success, -ENOMEM or -ENOBUFS on error.
+ */
+int stratix10_svc_send(struct stratix10_svc_chan *chan, void *msg);
+
+/**
+ * intel_svc_done() - complete service request
+ * @chan: service channel assigned to the client
+ *
+ * This function is used by service client to inform service layer that
+ * client's service requests are completed, or there is an error in the
+ * request process.
+ */
+void stratix10_svc_done(struct stratix10_svc_chan *chan);
+#endif
+
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCHv5 6/8] fpga: add intel stratix10 soc fpga manager driver
From: richard.gong @ 2018-05-24 16:33 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, dinguyen, robh+dt, mark.rutland,
	atull, mdf, arnd, gregkh, corbet
  Cc: linux-arm-kernel, linux-kernel, devicetree, linux-fpga, linux-doc,
	yves.vandervennet, richard.gong, richard.gong
In-Reply-To: <1527179600-26441-1-git-send-email-richard.gong@linux.intel.com>

From: Alan Tull <atull@kernel.org>

Add driver for reconfiguring Intel Stratix10 SoC FPGA devices.
This driver communicates through the Intel Service Driver which
does communication with privileged hardware (that does the
FPGA programming) through a secure mailbox.

Signed-off-by: Alan Tull <atull@kernel.org>
Signed-off-by: Richard Gong <richard.gong@intel.com>
---
v2: this patch is added in patch set version 2
v3: change to align to the update of service client APIs, and the
    update of fpga_mgr device node
v4: changes to align with stratix10-svc-client API updates
    add Richard's signed-off-by
v5: update to align changes at service layer to minimize service
    layer thread usages
---
 drivers/fpga/Kconfig         |   6 +
 drivers/fpga/Makefile        |   1 +
 drivers/fpga/stratix10-soc.c | 545 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 552 insertions(+)
 create mode 100644 drivers/fpga/stratix10-soc.c

diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
index f47ef84..1624a73 100644
--- a/drivers/fpga/Kconfig
+++ b/drivers/fpga/Kconfig
@@ -57,6 +57,12 @@ config FPGA_MGR_ZYNQ_FPGA
 	help
 	  FPGA manager driver support for Xilinx Zynq FPGAs.
 
+config FPGA_MGR_STRATIX10_SOC
+	tristate "Intel Stratix10 SoC FPGA Manager"
+	depends on (ARCH_STRATIX10 && STRATIX10_SERVICE)
+	help
+	  FPGA manager driver support for the Intel Stratix10 SoC.
+
 config FPGA_MGR_XILINX_SPI
 	tristate "Xilinx Configuration over Slave Serial (SPI)"
 	depends on SPI
diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
index 3cb276a..6eef670 100644
--- a/drivers/fpga/Makefile
+++ b/drivers/fpga/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_FPGA_MGR_ALTERA_PS_SPI)	+= altera-ps-spi.o
 obj-$(CONFIG_FPGA_MGR_ICE40_SPI)	+= ice40-spi.o
 obj-$(CONFIG_FPGA_MGR_SOCFPGA)		+= socfpga.o
 obj-$(CONFIG_FPGA_MGR_SOCFPGA_A10)	+= socfpga-a10.o
+obj-$(CONFIG_FPGA_MGR_STRATIX10_SOC)	+= stratix10-soc.o
 obj-$(CONFIG_FPGA_MGR_TS73XX)		+= ts73xx-fpga.o
 obj-$(CONFIG_FPGA_MGR_XILINX_SPI)	+= xilinx-spi.o
 obj-$(CONFIG_FPGA_MGR_ZYNQ_FPGA)	+= zynq-fpga.o
diff --git a/drivers/fpga/stratix10-soc.c b/drivers/fpga/stratix10-soc.c
new file mode 100644
index 0000000..d645ef7
--- /dev/null
+++ b/drivers/fpga/stratix10-soc.c
@@ -0,0 +1,545 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * FPGA Manager Driver for Intel Stratix10 SoC
+ *
+ *  Copyright (C) 2018 Intel Corporation
+ */
+#include <linux/completion.h>
+#include <linux/fpga/fpga-mgr.h>
+#include <linux/stratix10-svc-client.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/of_platform.h>
+/*
+ * FPGA programming requires a higher level of privilege (EL3), per the SoC
+ * design.
+ */
+#define NUM_SVC_BUFS	4
+#define SVC_BUF_SIZE	SZ_512K
+
+/* Indicates buffer is in use if set */
+#define SVC_BUF_LOCK	0
+
+/**
+ * struct s10_svc_buf
+ * @buf: virtual address of buf provided by service layer
+ * @lock: locked if buffer is in use
+ */
+struct s10_svc_buf {
+	char *buf;
+	unsigned long lock;
+};
+
+struct s10_priv {
+	struct stratix10_svc_chan *chan;
+	struct stratix10_svc_client client;
+	struct completion status_return_completion;
+	struct s10_svc_buf svc_bufs[NUM_SVC_BUFS];
+	unsigned long status;
+};
+
+static int s10_svc_send_msg(struct s10_priv *priv,
+			    enum stratix10_svc_command_code command,
+			    void *payload, u32 payload_length)
+{
+	struct stratix10_svc_chan *chan = priv->chan;
+	struct stratix10_svc_client_msg msg;
+	int ret;
+
+	pr_debug("%s cmd=%d payload=%p legnth=%d\n",
+		 __func__, command, payload, payload_length);
+
+	msg.command = command;
+	msg.payload = payload;
+	msg.payload_length = payload_length;
+
+	ret = stratix10_svc_send(chan, &msg);
+	pr_debug("stratix10_svc_send returned status %d\n", ret);
+
+	return ret;
+}
+
+/**
+ * s10_free_buffers
+ * Free buffers allocated from the service layer's pool that are not in use.
+ * @mgr: fpga manager struct
+ * Free all buffers that are not in use.
+ * Return true when all buffers are freed.
+ */
+static bool s10_free_buffers(struct fpga_manager *mgr)
+{
+	struct s10_priv *priv = mgr->priv;
+	uint num_free = 0;
+	uint i;
+
+	for (i = 0; i < NUM_SVC_BUFS; i++) {
+		if (!priv->svc_bufs[i].buf) {
+			num_free++;
+			continue;
+		}
+
+		if (!test_and_set_bit_lock(SVC_BUF_LOCK,
+					   &priv->svc_bufs[i].lock)) {
+			stratix10_svc_free_memory(priv->chan,
+					      priv->svc_bufs[i].buf);
+			priv->svc_bufs[i].buf = NULL;
+			num_free++;
+		}
+	}
+
+	return num_free == NUM_SVC_BUFS;
+}
+
+/**
+ * s10_free_buffer_count
+ * Count how many buffers are not in use.
+ * @mgr: fpga manager struct
+ * Return # of buffers that are not in use.
+ */
+static uint s10_free_buffer_count(struct fpga_manager *mgr)
+{
+	struct s10_priv *priv = mgr->priv;
+	uint num_free = 0;
+	uint i;
+
+	for (i = 0; i < NUM_SVC_BUFS; i++)
+		if (!priv->svc_bufs[i].buf)
+			num_free++;
+
+	return num_free;
+}
+
+/**
+ * s10_unlock_bufs
+ * Given the returned buffer address, match that address to our buffer struct
+ * and unlock that buffer.  This marks it as available to be refilled and sent
+ * (or freed).
+ * @priv: private data
+ * @kaddr: kernel address of buffer that was returned from service layer
+ */
+static void s10_unlock_bufs(struct s10_priv *priv, void *kaddr)
+{
+	uint i;
+
+	if (!kaddr)
+		return;
+
+	for (i = 0; i < NUM_SVC_BUFS; i++)
+		if (priv->svc_bufs[i].buf == kaddr) {
+			clear_bit_unlock(SVC_BUF_LOCK,
+					 &priv->svc_bufs[i].lock);
+			return;
+		}
+
+	WARN(1, "Unknown buffer returned from service layer %p\n", kaddr);
+}
+
+/**
+ * s10_receive_callback
+ * Callback for service layer to use to provide client (this driver) messages
+ * received through the mailbox.
+ * @client: service layer client struct
+ * @data: message
+ */
+static void s10_receive_callback(struct stratix10_svc_client *client,
+				 struct stratix10_svc_cb_data *data)
+{
+	struct s10_priv *priv = client->priv;
+	u32 status;
+	int i;
+
+	WARN_ONCE(!data, "%s: stratix10_svc_rc_data = NULL", __func__);
+
+	status = data->status;
+
+	/*
+	 * Here we set status bits as we receive them.  Elsewhere, we always use
+	 * test_and_clear_bit() to check status in priv->status
+	 */
+	for (i = 0; i <= SVC_STATUS_RECONFIG_ERROR; i++)
+		if (status & (1 << i))
+			set_bit(i, &priv->status);
+
+	if (status & BIT(SVC_STATUS_RECONFIG_BUFFER_DONE)) {
+		s10_unlock_bufs(priv, data->kaddr1);
+		s10_unlock_bufs(priv, data->kaddr2);
+		s10_unlock_bufs(priv, data->kaddr3);
+	}
+
+	complete(&priv->status_return_completion);
+}
+
+/**
+ * s10_ops_write_init
+ * Prepare for FPGA reconfiguration by requesting partial reconfig and
+ * allocating buffers from the service layer.
+ * @mgr: fpga manager
+ * @info: fpga image info
+ * @buf: fpga image buffer
+ * @count: size of buf in bytes
+ */
+static int s10_ops_write_init(struct fpga_manager *mgr,
+			      struct fpga_image_info *info,
+			      const char *buf, size_t count)
+{
+	struct s10_priv *priv = mgr->priv;
+	struct device *dev = priv->client.dev;
+	unsigned long timeout;
+	struct stratix10_svc_command_reconfig_payload payload;
+	char *kbuf;
+	uint i;
+	int ret;
+
+	if (info->flags & FPGA_MGR_PARTIAL_RECONFIG) {
+		dev_info(dev, "Requesting partial reconfiguration.\n");
+		payload.flags |= BIT(COMMAND_RECONFIG_FLAG_PARTIAL);
+	} else {
+		dev_info(dev, "Requesting full reconfiguration.\n");
+	}
+
+	reinit_completion(&priv->status_return_completion);
+	ret = s10_svc_send_msg(priv, COMMAND_RECONFIG,
+			       &payload, sizeof(payload));
+	if (ret)
+		goto init_done;
+
+	timeout = msecs_to_jiffies(SVC_RECONFIG_REQUEST_TIMEOUT_MS);
+	ret = wait_for_completion_interruptible_timeout(
+		&priv->status_return_completion, timeout);
+	if (!ret) {
+		dev_err(dev, "timeout waiting for RECONFIG_REQUEST\n");
+		ret = -ETIMEDOUT;
+		goto init_done;
+	}
+	if (ret < 0) {
+		dev_err(dev, "error (%d) waiting for RECONFIG_REQUEST\n", ret);
+		goto init_done;
+	}
+
+	ret = 0;
+	if (!test_and_clear_bit(SVC_STATUS_RECONFIG_REQUEST_OK,
+				&priv->status)) {
+		ret = -ETIMEDOUT;
+		goto init_done;
+	}
+
+	/* Allocate buffers from the service layer's pool. */
+	for (i = 0; i < NUM_SVC_BUFS; i++) {
+		kbuf = stratix10_svc_allocate_memory(priv->chan, SVC_BUF_SIZE);
+		if (!kbuf) {
+			s10_free_buffers(mgr);
+			ret = -ENOMEM;
+			goto init_done;
+		}
+
+		priv->svc_bufs[i].buf = kbuf;
+		priv->svc_bufs[i].lock = 0;
+	}
+
+init_done:
+	stratix10_svc_done(priv->chan);
+	return ret;
+}
+
+/**
+ * s10_send_buf
+ * Send a buffer to the service layer queue
+ * @mgr: fpga manager struct
+ * @buf_num: index of buffer in svc_bufs array
+ * @buf: fpga image buffer
+ * @count: size of buf in bytes
+ * Returns # of bytes transferred or -errno, never 0
+ */
+static int s10_send_buf(struct fpga_manager *mgr, uint buf_num,
+			const char *buf, size_t count)
+
+{
+	struct s10_priv *priv = mgr->priv;
+	struct device *dev = priv->client.dev;
+	void *svc_buf;
+	size_t xfer_sz;
+	int ret;
+
+	xfer_sz = count < SVC_BUF_SIZE ? count : SVC_BUF_SIZE;
+
+	svc_buf = priv->svc_bufs[buf_num].buf;
+	memcpy(svc_buf, buf, xfer_sz);
+	ret = s10_svc_send_msg(priv, COMMAND_RECONFIG_DATA_SUBMIT,
+			       svc_buf, xfer_sz);
+	if (ret) {
+		dev_err(dev,
+			"Error while sending data to service layer (%d)", ret);
+		return ret;
+	}
+
+	return xfer_sz;
+}
+
+/**
+ * s10_ops_write
+ * Send a FPGA image to privileged layers to write to the FPGA.  When done
+ * sending, free all service layer buffers we allocated in write_init.
+ * @mgr: fpga manager
+ * @buf: fpga image buffer
+ * @count: size of buf in bytes
+ * Returns 0 for success or negative errno.
+ */
+static int s10_ops_write(struct fpga_manager *mgr, const char *buf,
+			 size_t count)
+{
+	struct s10_priv *priv = mgr->priv;
+	struct device *dev = priv->client.dev;
+	unsigned long timeout;
+	size_t sent = 0;
+	int ret = 0;
+	uint i;
+
+	timeout = msecs_to_jiffies(SVC_RECONFIG_BUFFER_TIMEOUT_MS);
+
+	/* Buffer loop: either send buffers or free them. */
+	while (1) {
+		reinit_completion(&priv->status_return_completion);
+
+		if (count > 0) {
+			for (i = 0; i < NUM_SVC_BUFS; i++)
+				if (!test_and_set_bit_lock(
+					 SVC_BUF_LOCK, &priv->svc_bufs[i].lock))
+					break;
+
+			if (i == NUM_SVC_BUFS)
+				/* wait for a free buffer */
+				continue;
+
+			sent = s10_send_buf(mgr, i, buf, count);
+			/*
+			 * If service queue was full, we won't get a callback.
+			 * Wait and try again
+			 */
+			if (sent < 0)
+				continue;
+
+			count -= sent;
+			buf += sent;
+		} else {
+			s10_free_buffers(mgr);
+			if (s10_free_buffer_count(mgr) == NUM_SVC_BUFS)
+				return 0;
+
+			ret = s10_svc_send_msg(
+				priv, COMMAND_RECONFIG_DATA_CLAIM,
+				NULL, 0);
+			if (ret)
+				break;
+		}
+
+		/*
+		 * If callback hasn't already happened, wait for buffers to be
+		 * returned from service layer
+		 */
+		if (priv->status)
+			ret = 0;
+		else
+			ret = wait_for_completion_interruptible_timeout(
+				&priv->status_return_completion, timeout);
+
+		if (test_and_clear_bit(
+				SVC_STATUS_RECONFIG_BUFFER_DONE, &priv->status))
+			continue;
+
+		if (test_and_clear_bit(SVC_STATUS_RECONFIG_BUFFER_SUBMITTED,
+				       &priv->status))
+			continue;
+
+		if (test_and_clear_bit(SVC_STATUS_RECONFIG_ERROR,
+				       &priv->status)) {
+			dev_err(dev, "ERROR - giving up - SVC_STATUS_RECONFIG_ERROR\n");
+			ret = -EFAULT;
+			break;
+		}
+
+		if (!ret) {
+			dev_err(dev, "timeout waiting for svc layer buffers\n");
+			ret = -ETIMEDOUT;
+			break;
+		}
+		if (ret < 0) {
+			dev_err(dev,
+				"error (%d) waiting for svc layer buffers\n",
+				ret);
+			break;
+		}
+	}
+
+	s10_free_buffers(mgr);
+	if (s10_free_buffer_count(mgr) != NUM_SVC_BUFS)
+		dev_err(dev, "%s not all buffers were freed\n", __func__);
+
+	return ret;
+}
+
+/**
+ * s10_ops_write_complete
+ * Wait for FPGA configuration to be done
+ * @mgr: fpga manager
+ * @info: fpga image info
+ * Returns 0 for success negative errno.
+ */
+static int s10_ops_write_complete(struct fpga_manager *mgr,
+				  struct fpga_image_info *info)
+{
+	struct s10_priv *priv = mgr->priv;
+	struct device *dev = priv->client.dev;
+	unsigned long timeout;
+	int ret;
+
+	timeout = usecs_to_jiffies(info->config_complete_timeout_us);
+
+	do {
+		reinit_completion(&priv->status_return_completion);
+
+		ret = s10_svc_send_msg(priv, COMMAND_RECONFIG_STATUS, NULL, 0);
+		if (ret)
+			break;
+
+		ret = wait_for_completion_interruptible_timeout(
+			&priv->status_return_completion, timeout);
+		if (!ret) {
+			dev_err(dev,
+				"timeout waiting for RECONFIG_COMPLETED\n");
+			ret = -ETIMEDOUT;
+			break;
+		}
+		if (ret < 0) {
+			dev_err(dev,
+				"error (%d) waiting for RECONFIG_COMPLETED\n",
+				ret);
+			break;
+		}
+		/* Not error or timeout, so ret is # of jiffies until timeout */
+		timeout = ret;
+		ret = 0;
+
+		if (test_and_clear_bit(SVC_STATUS_RECONFIG_COMPLETED,
+				       &priv->status))
+			break;
+
+		if (test_and_clear_bit(SVC_STATUS_RECONFIG_ERROR,
+				       &priv->status)) {
+			dev_err(dev, "ERROR - giving up - SVC_STATUS_RECONFIG_ERROR\n");
+			ret = -EFAULT;
+			break;
+		}
+	} while (1);
+
+	stratix10_svc_done(priv->chan);
+	return ret;
+}
+
+static enum fpga_mgr_states s10_ops_state(struct fpga_manager *mgr)
+{
+	return FPGA_MGR_STATE_UNKNOWN;
+}
+
+static const struct fpga_manager_ops s10_ops = {
+	.state = s10_ops_state,
+	.write_init = s10_ops_write_init,
+	.write = s10_ops_write,
+	.write_complete = s10_ops_write_complete,
+};
+
+static int s10_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct s10_priv *priv;
+	int ret;
+
+	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->client.dev = dev;
+	priv->client.receive_cb = s10_receive_callback;
+	priv->client.priv = priv;
+
+	priv->chan = stratix10_svc_request_channel_byname(&priv->client,
+						SVC_CLIENT_FPGA);
+	if (IS_ERR(priv->chan)) {
+		dev_err(dev, "couldn't get service channel (%s)\n",
+			SVC_CLIENT_FPGA);
+		return PTR_ERR(priv->chan);
+	}
+
+	init_completion(&priv->status_return_completion);
+
+	ret = fpga_mgr_register(dev, "Stratix10 SOC FPGA Manager",
+				&s10_ops, priv);
+
+	if (ret)
+		stratix10_svc_free_channel(priv->chan);
+
+	return ret;
+}
+
+static int s10_remove(struct platform_device *pdev)
+{
+	struct fpga_manager *mgr = platform_get_drvdata(pdev);
+	struct s10_priv *priv = mgr->priv;
+
+	fpga_mgr_unregister(&pdev->dev);
+	stratix10_svc_free_channel(priv->chan);
+
+	return 0;
+}
+
+static const struct of_device_id s10_of_match[] = {
+	{ .compatible = "intel,stratix10-soc-fpga-mgr", },
+	{},
+};
+
+MODULE_DEVICE_TABLE(of, s10_of_match);
+
+static struct platform_driver s10_driver = {
+	.probe = s10_probe,
+	.remove = s10_remove,
+	.driver = {
+		.name	= "Stratix10 SoC FPGA manager",
+		.of_match_table = of_match_ptr(s10_of_match),
+	},
+};
+
+static int __init s10_init(void)
+{
+	struct device_node *fw_np;
+	struct device_node *np;
+	int ret;
+
+	fw_np = of_find_node_by_name(NULL, "svc");
+	if (!fw_np)
+		return -ENODEV;
+
+	np = of_find_matching_node(fw_np, s10_of_match);
+	if (!np) {
+		of_node_put(fw_np);
+		return -ENODEV;
+	}
+
+	of_node_put(np);
+	ret = of_platform_populate(fw_np, s10_of_match, NULL, NULL);
+	of_node_put(fw_np);
+	if (ret)
+		return ret;
+
+	return platform_driver_register(&s10_driver);
+}
+
+static void __exit s10_exit(void)
+{
+	return platform_driver_unregister(&s10_driver);
+}
+
+module_init(s10_init);
+module_exit(s10_exit);
+
+MODULE_AUTHOR("Alan Tull <atull@kernel.org>");
+MODULE_DESCRIPTION("Intel Stratix 10 SOC FPGA Manager");
+MODULE_LICENSE("GPL v2");
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCHv5 7/8] defconfig: enable fpga and service layer
From: richard.gong @ 2018-05-24 16:33 UTC (permalink / raw)
  To: catalin.marinas, will.deacon, dinguyen, robh+dt, mark.rutland,
	atull, mdf, arnd, gregkh, corbet
  Cc: linux-arm-kernel, linux-kernel, devicetree, linux-fpga, linux-doc,
	yves.vandervennet, richard.gong, richard.gong
In-Reply-To: <1527179600-26441-1-git-send-email-richard.gong@linux.intel.com>

From: Richard Gong <richard.gong@intel.com>

Enable fpga framework, Stratix 10 SoC FPGA manager and Stratix10
Service Layer

Signed-off-by: Richard Gong <richard.gong@intel.com>
Signed-off-by: Alan Tull <atull@kernel.org>
---
v2: this patch is added in patch set version 2
v3: no change
v4: s/CONFIG_INTEL_SERVICE/CONFIG_STRATIX10_SERVICE/
    add CONFIG_OF_FPGA_REGION=y
    s/Intel/Stratix10/ in subject line
v5: no change
---
 arch/arm64/configs/defconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index ecf6137..5f7a9b7 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -180,6 +180,7 @@ CONFIG_BLK_DEV_LOOP=y
 CONFIG_BLK_DEV_NBD=m
 CONFIG_VIRTIO_BLK=y
 CONFIG_BLK_DEV_NVME=m
+CONFIG_STRATIX10_SERVICE=y
 CONFIG_SRAM=y
 CONFIG_EEPROM_AT25=m
 # CONFIG_SCSI_PROC_FS is not set
@@ -595,6 +596,11 @@ CONFIG_PHY_TEGRA_XUSB=y
 CONFIG_QCOM_L2_PMU=y
 CONFIG_QCOM_L3_PMU=y
 CONFIG_MESON_EFUSE=m
+CONFIG_FPGA=y
+CONFIG_FPGA_MGR_STRATIX10_SOC=y
+CONFIG_FPGA_REGION=y
+CONFIG_FPGA_BRIDGE=y
+CONFIG_OF_FPGA_REGION=y
 CONFIG_QCOM_QFPROM=y
 CONFIG_UNIPHIER_EFUSE=y
 CONFIG_TEE=y
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox