[diamon-discuss] Interested in integration with cluster management software

All of lore.kernel.org
 help / color / mirror / Atom feed

* [diamon-discuss] Interested in integration with cluster management software
@ 2015-09-21 16:09 Connor Doyle
  2015-09-22 20:42 ` Mathieu Desnoyers
  0 siblings, 1 reply; 3+ messages in thread
From: Connor Doyle @ 2015-09-21 16:09 UTC (permalink / raw)
  To: diamon-discuss

Hello,

I heard about his workgroup through the LF newsletter this morning.
At Mesosphere we contribute to Apache Mesos, a poular open source
cluster resource manager and related software.  Much of our work would
benefit from more standardization (even de-facto standardization)
around application level tracing and monitoring.  For example, Mesos
recently added support for modular oversubscription policies for slack
estimation and QoS control.  We've started discussions about wiring up
something bespoke for use in Mesos, but standard format for expressing
SLI/SLO could be better.

Anyway, just want to express interest in the outcomes, volunteer to
discuss and help where possible, and say "kudos" for bootstrapping
this workgroup.

Best,
-- 
Connor Doyle

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [diamon-discuss] Interested in integration with cluster management software
  2015-09-21 16:09 [diamon-discuss] Interested in integration with cluster management software Connor Doyle
@ 2015-09-22 20:42 ` Mathieu Desnoyers
  2015-10-12 22:10   ` Connor Doyle
  0 siblings, 1 reply; 3+ messages in thread
From: Mathieu Desnoyers @ 2015-09-22 20:42 UTC (permalink / raw)
  To: Connor Doyle; +Cc: diamon-discuss

----- On Sep 21, 2015, at 12:09 PM, Connor Doyle connor.p.d@gmail.com wrote:

> Hello,
> 
> I heard about his workgroup through the LF newsletter this morning.
> At Mesosphere we contribute to Apache Mesos, a poular open source
> cluster resource manager and related software.  Much of our work would
> benefit from more standardization (even de-facto standardization)
> around application level tracing and monitoring.  For example, Mesos
> recently added support for modular oversubscription policies for slack
> estimation and QoS control.  We've started discussions about wiring up
> something bespoke for use in Mesos, but standard format for expressing
> SLI/SLO could be better.
> 
> Anyway, just want to express interest in the outcomes, volunteer to
> discuss and help where possible, and say "kudos" for bootstrapping
> this workgroup.

Hi Connor,

Thanks for your interest in the DiaMon Workgroup! Indeed, integrating
tracing/monitoring solutions into a CI resource manager feedback loop
would be an interesting area to tackle. We could then do fine-grained
resource monitoring based on a wide set of metrics, e.g.:

- I/O throughput, max latency,
- Network throughput and max latency,
- CPU utilization,
- Preemption latency,
- Memory usage.

One aspect that pure sampling approaches (profiling) usually don't handle
well are those latency-related. Doing aggregation on tracing data can be
a good way to achieve this. You could then express constraints on the
resources in different ways. Instead of just reserving "capacity", you
could also reserve "latency guarantees".

Thoughts ?

Thanks,

Mathieu


> 
> Best,
> --
> Connor Doyle
> _______________________________________________
> diamon-discuss mailing list
> diamon-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/diamon-discuss

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [diamon-discuss] Interested in integration with cluster management software
  2015-09-22 20:42 ` Mathieu Desnoyers
@ 2015-10-12 22:10   ` Connor Doyle
  0 siblings, 0 replies; 3+ messages in thread
From: Connor Doyle @ 2015-10-12 22:10 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: diamon-discuss

Hi Mathieu,

Thanks for your response, sorry for the tardy reply.

Oversubscription in Apache Mesos is described in this doc:
http://mesos.apache.org/documentation/latest/oversubscription.  The
design borrows from established techniques virtual machine hypervisors
as well as the recent "Heracles" paper from Google and Stanford:
http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf.
Mainly it's only the plumbing and APIs that are included in Mesos;
monitoring and interference detection and avoidance policies are left
to external modules.

In the public cloud, where there is limited access to many low-level
performance counters, and even in the private cloud it is advantageous
to rely on custom application performance metrics as a more reliable
signal.  That is, it's better when the application itself can indicate
that it is suffering instead of making an inference based on drops in
IPS, for example.  We are just starting the discussion in the Mesos
community about promoting SLI/SLO descriptors to first-class status to
drive components like Resource Estimator and QoS controller modules.
In the future, we could also use historical performance data to
establish anti-affinity among workload classes.

Even taking this into account, low-level monitoring will be critical
for next steps in mitigating the negative effects of aggressor
workloads.  Currently, the only type of QoS correction we've been
working with is to evict best-effort tasks to protect high-priority
ones.  However, identifying the dominant contested resource(s) is
critical in order to take effective remedial action.  Support from
hardware vendors for more fine-grained monitoring and isolation, such
as L3 cache partitioning or I/O bandwidth, will help.

--
Connor

On Tue, Sep 22, 2015 at 1:42 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Sep 21, 2015, at 12:09 PM, Connor Doyle connor.p.d@gmail.com wrote:
>
>> Hello,
>>
>> I heard about his workgroup through the LF newsletter this morning.
>> At Mesosphere we contribute to Apache Mesos, a poular open source
>> cluster resource manager and related software.  Much of our work would
>> benefit from more standardization (even de-facto standardization)
>> around application level tracing and monitoring.  For example, Mesos
>> recently added support for modular oversubscription policies for slack
>> estimation and QoS control.  We've started discussions about wiring up
>> something bespoke for use in Mesos, but standard format for expressing
>> SLI/SLO could be better.
>>
>> Anyway, just want to express interest in the outcomes, volunteer to
>> discuss and help where possible, and say "kudos" for bootstrapping
>> this workgroup.
>
> Hi Connor,
>
> Thanks for your interest in the DiaMon Workgroup! Indeed, integrating
> tracing/monitoring solutions into a CI resource manager feedback loop
> would be an interesting area to tackle. We could then do fine-grained
> resource monitoring based on a wide set of metrics, e.g.:
>
> - I/O throughput, max latency,
> - Network throughput and max latency,
> - CPU utilization,
> - Preemption latency,
> - Memory usage.
>
> One aspect that pure sampling approaches (profiling) usually don't handle
> well are those latency-related. Doing aggregation on tracing data can be
> a good way to achieve this. You could then express constraints on the
> resources in different ways. Instead of just reserving "capacity", you
> could also reserve "latency guarantees".
>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
>
>>
>> Best,
>> --
>> Connor Doyle
>> _______________________________________________
>> diamon-discuss mailing list
>> diamon-discuss@lists.linuxfoundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/diamon-discuss
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
connor

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-10-12 22:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-21 16:09 [diamon-discuss] Interested in integration with cluster management software Connor Doyle
2015-09-22 20:42 ` Mathieu Desnoyers
2015-10-12 22:10   ` Connor Doyle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.