Implementing BMC Health Monitoring

All of lore.kernel.org
 help / color / mirror / Atom feed

* Implementing BMC Health Monitoring
@ 2020-05-21  1:36 Sui Chen
  2020-05-21 10:47 ` Matuszczak, Piotr
  0 siblings, 1 reply; 7+ messages in thread
From: Sui Chen @ 2020-05-21  1:36 UTC (permalink / raw)
  To: openbmc

[-- Attachment #1: Type: text/plain, Size: 3609 bytes --]

Hello OpenBMC Mailing List,

It is great to see the proposal on BMC health monitoring! We have similar
efforts in health monitoring in progress, started doing some
implementation, and would like to share some thoughts with the Mailing List
to help get BMC health monitoring started:

(1) What metrics have we considered now?

We have considered the following metrics on the BMC:
  - Memory usage
  - Number of open file descriptors
  - Free storage space in the read-write file system
  - List of processes
  - CPU time for a few top processes

  Some of these are inspired by various profilers, and some others are
expected to be relevant to the typical workloads running on the BMC.

(2) Overall, it appears the design space for health monitoring has the
following dimensions:

a) A method to do the collection, which might be:
  - Running a program like "df" to get free disk space
  - Traversing some folder to compute some statistics
  - Monitor some bus for some time and generate some result
  - or something else

  The collection process might vary from metric to metric, and can take
some time to complete on the BMC, and therefore, the results need to be
staged somewhere and made accessible when it's completed, so the requestor
won't have to busy-wait.

b) A way to stage monitoring data on the BMC, which might be:
  - Files or databases in DRAM or some persistent store.
  - DBus objects, as described in Vijay's document; this is similar to how
sensors work.
  - IPMI Blobs (this is what we have implemented right now)
  - or something else

c) A way to transfer monitoring data out of the BMC, which might be:
  - scp
  - RedFish
  - IPMI (this is what we're using right now)
  - or something else

d) Format of staged data:
  - Raw bytes
  - Protocol buffers
  - JSON objects
  - or something else
  - The data may be compressed to save transfer time

e) A way to consume the health monitoring data:
  - The BMC might do some pre-processing, like windowed average.
  - The BMC may perform certain corrective measures when metrics appear
abnormal.
  - The host may perform certain corrective measures when metrics appear
abnormal.
  - BMC health data might be plugged into some already existing monitoring
framework overseeing a large number of machines, collecting historical
data, and projecting future trends, etc.

f) A way to configure the health monitoring system:
  - Configuration for which metrics are collected
  - Configuration for the choice of staging in b), way of transfer in c),
and frequency of collection in e)
  - Some configurations may be build-time and some may be run-time
     - I guess we can draw some inspirations from phosphor-ipmi-blobs

(3) The requirements and performance/storage impacts on the BMC:

a) The collection should not be too taxing on the processing/storage
resources on the BMC

b) The data transfer process should not be too taxing on the link between
the host and BMC
  - For the metrics we have and the IPMI connection we're using so far, it
took around 10 ~ 100ms for the host to collect a metric. The time is
dominated by IPMI transfer time. The time is considered acceptable if a
metric is collected at a reasonably long interval, say, every 30 minutes.


We hope the above contents help complement the existing design proposal,
and would like to help actually start implementing (and deploying) health
monitoring for the BMC.
The question is: we're working on our implementation and we're wondering
what would be a good time for us to send it for review? Do we need to
support both what we have now and what is being proposed?

Thanks!
Sui

[-- Attachment #2: Type: text/html, Size: 3962 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Implementing BMC Health Monitoring
  2020-05-21  1:36 Implementing BMC Health Monitoring Sui Chen
@ 2020-05-21 10:47 ` Matuszczak, Piotr
  2020-05-22  8:43   ` Adrian Ambrożewicz
  0 siblings, 1 reply; 7+ messages in thread
From: Matuszczak, Piotr @ 2020-05-21 10:47 UTC (permalink / raw)
  To: Sui Chen, openbmc@lists.ozlabs.org

Hi, 

The proposal seems interesting. From what I've read from your e-mail, you are looking the best way to implement BMC health metrics. My proposal would be to expose these metrics as D-Bus sensors with an option to store data to the filesystem. Such solution will ease the integration with Redfish and support these metrics by the Monitoring Service (https://github.com/openbmc/docs/blob/master/designs/telemetry.md) . This way, you have support for collecting metrics into metric report, you have support of simple operations, like min/max/average/sum. Also, using metric reports, you can store historical readings and stream the metric reports as events. 
Piotr Matuszczak
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o. 
ul. Slowackiego 173, 80-298 Gdansk
KRS 101882
NIP 957-07-52-316

From: openbmc <openbmc-bounces+piotr.matuszczak=intel.com@lists.ozlabs.org> On Behalf Of Sui Chen
Sent: Thursday, May 21, 2020 3:37 AM
To: openbmc@lists.ozlabs.org
Subject: Implementing BMC Health Monitoring

Hello OpenBMC Mailing List,

It is great to see the proposal on BMC health monitoring! We have similar efforts in health monitoring in progress, started doing some implementation, and would like to share some thoughts with the Mailing List to help get BMC health monitoring started:

(1) What metrics have we considered now?

We have considered the following metrics on the BMC:
  - Memory usage
  - Number of open file descriptors
  - Free storage space in the read-write file system
  - List of processes
  - CPU time for a few top processes
  
  Some of these are inspired by various profilers, and some others are expected to be relevant to the typical workloads running on the BMC.

(2) Overall, it appears the design space for health monitoring has the following dimensions:

a) A method to do the collection, which might be:
  - Running a program like "df" to get free disk space
  - Traversing some folder to compute some statistics
  - Monitor some bus for some time and generate some result
  - or something else
  
  The collection process might vary from metric to metric, and can take some time to complete on the BMC, and therefore, the results need to be staged somewhere and made accessible when it's completed, so the requestor won't have to busy-wait.

b) A way to stage monitoring data on the BMC, which might be:
  - Files or databases in DRAM or some persistent store.
  - DBus objects, as described in Vijay's document; this is similar to how sensors work.
  - IPMI Blobs (this is what we have implemented right now)
  - or something else
  
c) A way to transfer monitoring data out of the BMC, which might be:
  - scp
  - RedFish
  - IPMI (this is what we're using right now)
  - or something else
  
d) Format of staged data:
  - Raw bytes
  - Protocol buffers
  - JSON objects
  - or something else
  - The data may be compressed to save transfer time
 
e) A way to consume the health monitoring data:
  - The BMC might do some pre-processing, like windowed average.
  - The BMC may perform certain corrective measures when metrics appear abnormal.
  - The host may perform certain corrective measures when metrics appear abnormal.
  - BMC health data might be plugged into some already existing monitoring framework overseeing a large number of machines, collecting historical data, and projecting future trends, etc.

f) A way to configure the health monitoring system:
  - Configuration for which metrics are collected
  - Configuration for the choice of staging in b), way of transfer in c), and frequency of collection in e)
  - Some configurations may be build-time and some may be run-time
     - I guess we can draw some inspirations from phosphor-ipmi-blobs

(3) The requirements and performance/storage impacts on the BMC:

a) The collection should not be too taxing on the processing/storage resources on the BMC

b) The data transfer process should not be too taxing on the link between the host and BMC
  - For the metrics we have and the IPMI connection we're using so far, it took around 10 ~ 100ms for the host to collect a metric. The time is dominated by IPMI transfer time. The time is considered acceptable if a metric is collected at a reasonably long interval, say, every 30 minutes.
  

We hope the above contents help complement the existing design proposal, and would like to help actually start implementing (and deploying) health monitoring for the BMC.
The question is: we're working on our implementation and we're wondering what would be a good time for us to send it for review? Do we need to support both what we have now and what is being proposed?

Thanks!
Sui

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Implementing BMC Health Monitoring
  2020-05-21 10:47 ` Matuszczak, Piotr
@ 2020-05-22  8:43   ` Adrian Ambrożewicz
  2020-05-25 12:32     ` Adrian Ambrożewicz
  2020-05-26 17:36     ` Vijay Khemka
  0 siblings, 2 replies; 7+ messages in thread
From: Adrian Ambrożewicz @ 2020-05-22  8:43 UTC (permalink / raw)
  To: Matuszczak, Piotr, Sui Chen, openbmc@lists.ozlabs.org

I suppose I could back up Piotr here.

I believe that in general EntityManager could be leveraged for 
configuration (enabling/disabling metrics and configuring them). 
dbus-sensors infrastructure would be beneficial in terms of:
- familiarity (already used for monitoring physical sensors, new 
synthetized sensors to come)
- flexibility (EntityManager could provide runtime configuration of the 
metrics in the system)
- availability - both configuration and metrics would be exposed using 
D-Bus interfaces as easy to consume and 'standarized' way.

If dbus-sensors would be used then feature mentioned by Piotr 
(TelemetryService) could almost 'out of the box' support storing and 
exposing metrics snapshots, send them to external databases 
(EventService) etc.

Of course dbus-sensors (xyz.openbmcc_project.Sensor.Value) could be only 
one of the interfaces for the data, so it's not limiting any other use 
cases you've mentioned.

Regards,
Adrian

W dniu 5/21/2020 o 12:47, Matuszczak, Piotr pisze:
> Hi,
> 
> The proposal seems interesting. From what I've read from your e-mail, you are looking the best way to implement BMC health metrics. My proposal would be to expose these metrics as D-Bus sensors with an option to store data to the filesystem. Such solution will ease the integration with Redfish and support these metrics by the Monitoring Service (https://github.com/openbmc/docs/blob/master/designs/telemetry.md) . This way, you have support for collecting metrics into metric report, you have support of simple operations, like min/max/average/sum. Also, using metric reports, you can store historical readings and stream the metric reports as events.
> Piotr Matuszczak
> ---------------------------------------------------------------------
> Intel Technology Poland sp. z o.o.
> ul. Slowackiego 173, 80-298 Gdansk
> KRS 101882
> NIP 957-07-52-316
> 
> From: openbmc <openbmc-bounces+piotr.matuszczak=intel.com@lists.ozlabs.org> On Behalf Of Sui Chen
> Sent: Thursday, May 21, 2020 3:37 AM
> To: openbmc@lists.ozlabs.org
> Subject: Implementing BMC Health Monitoring
> 
> Hello OpenBMC Mailing List,
> 
> It is great to see the proposal on BMC health monitoring! We have similar efforts in health monitoring in progress, started doing some implementation, and would like to share some thoughts with the Mailing List to help get BMC health monitoring started:
> 
> (1) What metrics have we considered now?
> 
> We have considered the following metrics on the BMC:
>    - Memory usage
>    - Number of open file descriptors
>    - Free storage space in the read-write file system
>    - List of processes
>    - CPU time for a few top processes
>    
>    Some of these are inspired by various profilers, and some others are expected to be relevant to the typical workloads running on the BMC.
> 
> (2) Overall, it appears the design space for health monitoring has the following dimensions:
> 
> a) A method to do the collection, which might be:
>    - Running a program like "df" to get free disk space
>    - Traversing some folder to compute some statistics
>    - Monitor some bus for some time and generate some result
>    - or something else
>    
>    The collection process might vary from metric to metric, and can take some time to complete on the BMC, and therefore, the results need to be staged somewhere and made accessible when it's completed, so the requestor won't have to busy-wait.
> 
> b) A way to stage monitoring data on the BMC, which might be:
>    - Files or databases in DRAM or some persistent store.
>    - DBus objects, as described in Vijay's document; this is similar to how sensors work.
>    - IPMI Blobs (this is what we have implemented right now)
>    - or something else
>    
> c) A way to transfer monitoring data out of the BMC, which might be:
>    - scp
>    - RedFish
>    - IPMI (this is what we're using right now)
>    - or something else
>    
> d) Format of staged data:
>    - Raw bytes
>    - Protocol buffers
>    - JSON objects
>    - or something else
>    - The data may be compressed to save transfer time
>   
> e) A way to consume the health monitoring data:
>    - The BMC might do some pre-processing, like windowed average.
>    - The BMC may perform certain corrective measures when metrics appear abnormal.
>    - The host may perform certain corrective measures when metrics appear abnormal.
>    - BMC health data might be plugged into some already existing monitoring framework overseeing a large number of machines, collecting historical data, and projecting future trends, etc.
> 
> f) A way to configure the health monitoring system:
>    - Configuration for which metrics are collected
>    - Configuration for the choice of staging in b), way of transfer in c), and frequency of collection in e)
>    - Some configurations may be build-time and some may be run-time
>       - I guess we can draw some inspirations from phosphor-ipmi-blobs
> 
> (3) The requirements and performance/storage impacts on the BMC:
> 
> a) The collection should not be too taxing on the processing/storage resources on the BMC
> 
> b) The data transfer process should not be too taxing on the link between the host and BMC
>    - For the metrics we have and the IPMI connection we're using so far, it took around 10 ~ 100ms for the host to collect a metric. The time is dominated by IPMI transfer time. The time is considered acceptable if a metric is collected at a reasonably long interval, say, every 30 minutes.
>    
> 
> We hope the above contents help complement the existing design proposal, and would like to help actually start implementing (and deploying) health monitoring for the BMC.
> The question is: we're working on our implementation and we're wondering what would be a good time for us to send it for review? Do we need to support both what we have now and what is being proposed?
> 
> Thanks!
> Sui
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Implementing BMC Health Monitoring
  2020-05-22  8:43   ` Adrian Ambrożewicz
@ 2020-05-25 12:32     ` Adrian Ambrożewicz
  2020-05-26 13:45       ` Brad Bishop
  2020-05-26 18:50       ` Vijay Khemka
  2020-05-26 17:36     ` Vijay Khemka
  1 sibling, 2 replies; 7+ messages in thread
From: Adrian Ambrożewicz @ 2020-05-25 12:32 UTC (permalink / raw)
  To: Matuszczak, Piotr, Sui Chen, openbmc@lists.ozlabs.org,
	Brad Bishop, Vijay Khemka

@Brad, @Vijay

It seems Sui is proposing something highly related to already discussed 
https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/31957 . As a matter 
of fact - requirement for such system metrics availability is also 
highly desirable on our (Intel) side. It seems we need to merge all 
requirements to satisfy the common needs..

Regards,
Adrian

W dniu 5/22/2020 o 10:43, Adrian Ambrożewicz pisze:
> I suppose I could back up Piotr here.
> 
> I believe that in general EntityManager could be leveraged for 
> configuration (enabling/disabling metrics and configuring them). 
> dbus-sensors infrastructure would be beneficial in terms of:
> - familiarity (already used for monitoring physical sensors, new 
> synthetized sensors to come)
> - flexibility (EntityManager could provide runtime configuration of the 
> metrics in the system)
> - availability - both configuration and metrics would be exposed using 
> D-Bus interfaces as easy to consume and 'standarized' way.
> 
> If dbus-sensors would be used then feature mentioned by Piotr 
> (TelemetryService) could almost 'out of the box' support storing and 
> exposing metrics snapshots, send them to external databases 
> (EventService) etc.
> 
> Of course dbus-sensors (xyz.openbmcc_project.Sensor.Value) could be only 
> one of the interfaces for the data, so it's not limiting any other use 
> cases you've mentioned.
> 
> Regards,
> Adrian
> 
> W dniu 5/21/2020 o 12:47, Matuszczak, Piotr pisze:
>> Hi,
>>
>> The proposal seems interesting. From what I've read from your e-mail, 
>> you are looking the best way to implement BMC health metrics. My 
>> proposal would be to expose these metrics as D-Bus sensors with an 
>> option to store data to the filesystem. Such solution will ease the 
>> integration with Redfish and support these metrics by the Monitoring 
>> Service 
>> (https://github.com/openbmc/docs/blob/master/designs/telemetry.md) . 
>> This way, you have support for collecting metrics into metric report, 
>> you have support of simple operations, like min/max/average/sum. Also, 
>> using metric reports, you can store historical readings and stream the 
>> metric reports as events.
>> Piotr Matuszczak
>> ---------------------------------------------------------------------
>> Intel Technology Poland sp. z o.o.
>> ul. Slowackiego 173, 80-298 Gdansk
>> KRS 101882
>> NIP 957-07-52-316
>>
>> From: openbmc 
>> <openbmc-bounces+piotr.matuszczak=intel.com@lists.ozlabs.org> On 
>> Behalf Of Sui Chen
>> Sent: Thursday, May 21, 2020 3:37 AM
>> To: openbmc@lists.ozlabs.org
>> Subject: Implementing BMC Health Monitoring
>>
>> Hello OpenBMC Mailing List,
>>
>> It is great to see the proposal on BMC health monitoring! We have 
>> similar efforts in health monitoring in progress, started doing some 
>> implementation, and would like to share some thoughts with the Mailing 
>> List to help get BMC health monitoring started:
>>
>> (1) What metrics have we considered now?
>>
>> We have considered the following metrics on the BMC:
>>    - Memory usage
>>    - Number of open file descriptors
>>    - Free storage space in the read-write file system
>>    - List of processes
>>    - CPU time for a few top processes
>>    Some of these are inspired by various profilers, and some others 
>> are expected to be relevant to the typical workloads running on the BMC.
>>
>> (2) Overall, it appears the design space for health monitoring has the 
>> following dimensions:
>>
>> a) A method to do the collection, which might be:
>>    - Running a program like "df" to get free disk space
>>    - Traversing some folder to compute some statistics
>>    - Monitor some bus for some time and generate some result
>>    - or something else
>>    The collection process might vary from metric to metric, and can 
>> take some time to complete on the BMC, and therefore, the results 
>> need to be staged somewhere and made accessible when it's completed, 
>> so the requestor won't have to busy-wait.
>>
>> b) A way to stage monitoring data on the BMC, which might be:
>>    - Files or databases in DRAM or some persistent store.
>>    - DBus objects, as described in Vijay's document; this is similar 
>> to how sensors work.
>>    - IPMI Blobs (this is what we have implemented right now)
>>    - or something else
>> c) A way to transfer monitoring data out of the BMC, which might be:
>>    - scp
>>    - RedFish
>>    - IPMI (this is what we're using right now)
>>    - or something else
>> d) Format of staged data:
>>    - Raw bytes
>>    - Protocol buffers
>>    - JSON objects
>>    - or something else
>>    - The data may be compressed to save transfer time
>> e) A way to consume the health monitoring data:
>>    - The BMC might do some pre-processing, like windowed average.
>>    - The BMC may perform certain corrective measures when metrics 
>> appear abnormal.
>>    - The host may perform certain corrective measures when metrics 
>> appear abnormal.
>>    - BMC health data might be plugged into some already existing 
>> monitoring framework overseeing a large number of machines, collecting 
>> historical data, and projecting future trends, etc.
>>
>> f) A way to configure the health monitoring system:
>>    - Configuration for which metrics are collected
>>    - Configuration for the choice of staging in b), way of transfer in 
>> c), and frequency of collection in e)
>>    - Some configurations may be build-time and some may be run-time
>>       - I guess we can draw some inspirations from phosphor-ipmi-blobs
>>
>> (3) The requirements and performance/storage impacts on the BMC:
>>
>> a) The collection should not be too taxing on the processing/storage 
>> resources on the BMC
>>
>> b) The data transfer process should not be too taxing on the link 
>> between the host and BMC
>>    - For the metrics we have and the IPMI connection we're using so 
>> far, it took around 10 ~ 100ms for the host to collect a metric. The 
>> time is dominated by IPMI transfer time. The time is considered 
>> acceptable if a metric is collected at a reasonably long interval, 
>> say, every 30 minutes.
>>
>> We hope the above contents help complement the existing design 
>> proposal, and would like to help actually start implementing (and 
>> deploying) health monitoring for the BMC.
>> The question is: we're working on our implementation and we're 
>> wondering what would be a good time for us to send it for review? Do 
>> we need to support both what we have now and what is being proposed?
>>
>> Thanks!
>> Sui
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Implementing BMC Health Monitoring
  2020-05-25 12:32     ` Adrian Ambrożewicz
@ 2020-05-26 13:45       ` Brad Bishop
  2020-05-26 18:50       ` Vijay Khemka
  1 sibling, 0 replies; 7+ messages in thread
From: Brad Bishop @ 2020-05-26 13:45 UTC (permalink / raw)
  To: Adrian Ambrożewicz, Matuszczak, Piotr, Sui Chen,
	openbmc@lists.ozlabs.org, Vijay Khemka

On Mon, 2020-05-25 at 14:32 +0200, Adrian Ambrożewicz wrote:
> @Brad, @Vijay
> 
> It seems Sui is proposing something highly related to already
> discussed 
> https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/31957 . As a
> matter 
> of fact - requirement for such system metrics availability is also 
> highly desirable on our (Intel) side. It seems we need to merge all 
> requirements to satisfy the common needs..

I was still hoping for a telemetry/metrics solution built around
collectd.  This seems like another place where it might make sense.  IMO
there is too much code being re-written in this area.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Implementing BMC Health Monitoring
  2020-05-22  8:43   ` Adrian Ambrożewicz
  2020-05-25 12:32     ` Adrian Ambrożewicz
@ 2020-05-26 17:36     ` Vijay Khemka
  1 sibling, 0 replies; 7+ messages in thread
From: Vijay Khemka @ 2020-05-26 17:36 UTC (permalink / raw)
  To: Adrian Ambrożewicz, Matuszczak, Piotr, Sui Chen,
	openbmc@lists.ozlabs.org

I would suggest to keep this implementation independent of entity manager and 
dbus-sensors so Every user can use this who is not using entity manager currently.
I would keep it more generic and support all possible feature with dbus based 
storing to be accessible to more processes and fetched by redfish. I agree with 
Piotr proposal.

Sui, Do you want to combine with my design to make it a single design document 
and also would like to know how far your implementation has been completed so 
I can add up anything what I am implementing.

Regards
-Vijay

On 5/22/20, 1:45 AM, "openbmc on behalf of Adrian Ambrożewicz" <openbmc-bounces+vijaykhemka=fb.com@lists.ozlabs.org on behalf of adrian.ambrozewicz@linux.intel.com> wrote:

    I suppose I could back up Piotr here.
    
    I believe that in general EntityManager could be leveraged for 
    configuration (enabling/disabling metrics and configuring them). 
    dbus-sensors infrastructure would be beneficial in terms of:
    - familiarity (already used for monitoring physical sensors, new 
    synthetized sensors to come)
    - flexibility (EntityManager could provide runtime configuration of the 
    metrics in the system)
    - availability - both configuration and metrics would be exposed using 
    D-Bus interfaces as easy to consume and 'standarized' way.
    
    If dbus-sensors would be used then feature mentioned by Piotr 
    (TelemetryService) could almost 'out of the box' support storing and 
    exposing metrics snapshots, send them to external databases 
    (EventService) etc.
    
    Of course dbus-sensors (xyz.openbmcc_project.Sensor.Value) could be only 
    one of the interfaces for the data, so it's not limiting any other use 
    cases you've mentioned.
    
    Regards,
    Adrian
    
    W dniu 5/21/2020 o 12:47, Matuszczak, Piotr pisze:
    > Hi,
    > 
    > The proposal seems interesting. From what I've read from your e-mail, you are looking the best way to implement BMC health metrics. My proposal would be to expose these metrics as D-Bus sensors with an option to store data to the filesystem. Such solution will ease the integration with Redfish and support these metrics by the Monitoring Service (https://github.com/openbmc/docs/blob/master/designs/telemetry.md) . This way, you have support for collecting metrics into metric report, you have support of simple operations, like min/max/average/sum. Also, using metric reports, you can store historical readings and stream the metric reports as events.
    > Piotr Matuszczak
    > ---------------------------------------------------------------------
    > Intel Technology Poland sp. z o.o.
    > ul. Slowackiego 173, 80-298 Gdansk
    > KRS 101882
    > NIP 957-07-52-316
    > 
    > From: openbmc <openbmc-bounces+piotr.matuszczak=intel.com@lists.ozlabs.org> On Behalf Of Sui Chen
    > Sent: Thursday, May 21, 2020 3:37 AM
    > To: openbmc@lists.ozlabs.org
    > Subject: Implementing BMC Health Monitoring
    > 
    > Hello OpenBMC Mailing List,
    > 
    > It is great to see the proposal on BMC health monitoring! We have similar efforts in health monitoring in progress, started doing some implementation, and would like to share some thoughts with the Mailing List to help get BMC health monitoring started:
    > 
    > (1) What metrics have we considered now?
    > 
    > We have considered the following metrics on the BMC:
    >    - Memory usage
    >    - Number of open file descriptors
    >    - Free storage space in the read-write file system
    >    - List of processes
    >    - CPU time for a few top processes
    >    
    >    Some of these are inspired by various profilers, and some others are expected to be relevant to the typical workloads running on the BMC.
    > 
    > (2) Overall, it appears the design space for health monitoring has the following dimensions:
    > 
    > a) A method to do the collection, which might be:
    >    - Running a program like "df" to get free disk space
    >    - Traversing some folder to compute some statistics
    >    - Monitor some bus for some time and generate some result
    >    - or something else
    >    
    >    The collection process might vary from metric to metric, and can take some time to complete on the BMC, and therefore, the results need to be staged somewhere and made accessible when it's completed, so the requestor won't have to busy-wait.
    > 
    > b) A way to stage monitoring data on the BMC, which might be:
    >    - Files or databases in DRAM or some persistent store.
    >    - DBus objects, as described in Vijay's document; this is similar to how sensors work.
    >    - IPMI Blobs (this is what we have implemented right now)
    >    - or something else
    >    
    > c) A way to transfer monitoring data out of the BMC, which might be:
    >    - scp
    >    - RedFish
    >    - IPMI (this is what we're using right now)
    >    - or something else
    >    
    > d) Format of staged data:
    >    - Raw bytes
    >    - Protocol buffers
    >    - JSON objects
    >    - or something else
    >    - The data may be compressed to save transfer time
    >   
    > e) A way to consume the health monitoring data:
    >    - The BMC might do some pre-processing, like windowed average.
    >    - The BMC may perform certain corrective measures when metrics appear abnormal.
    >    - The host may perform certain corrective measures when metrics appear abnormal.
    >    - BMC health data might be plugged into some already existing monitoring framework overseeing a large number of machines, collecting historical data, and projecting future trends, etc.
    > 
    > f) A way to configure the health monitoring system:
    >    - Configuration for which metrics are collected
    >    - Configuration for the choice of staging in b), way of transfer in c), and frequency of collection in e)
    >    - Some configurations may be build-time and some may be run-time
    >       - I guess we can draw some inspirations from phosphor-ipmi-blobs
    > 
    > (3) The requirements and performance/storage impacts on the BMC:
    > 
    > a) The collection should not be too taxing on the processing/storage resources on the BMC
    > 
    > b) The data transfer process should not be too taxing on the link between the host and BMC
    >    - For the metrics we have and the IPMI connection we're using so far, it took around 10 ~ 100ms for the host to collect a metric. The time is dominated by IPMI transfer time. The time is considered acceptable if a metric is collected at a reasonably long interval, say, every 30 minutes.
    >    
    > 
    > We hope the above contents help complement the existing design proposal, and would like to help actually start implementing (and deploying) health monitoring for the BMC.
    > The question is: we're working on our implementation and we're wondering what would be a good time for us to send it for review? Do we need to support both what we have now and what is being proposed?
    > 
    > Thanks!
    > Sui
    > 
    


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Implementing BMC Health Monitoring
  2020-05-25 12:32     ` Adrian Ambrożewicz
  2020-05-26 13:45       ` Brad Bishop
@ 2020-05-26 18:50       ` Vijay Khemka
  1 sibling, 0 replies; 7+ messages in thread
From: Vijay Khemka @ 2020-05-26 18:50 UTC (permalink / raw)
  To: Adrian Ambrożewicz, Matuszczak, Piotr, Sui Chen,
	openbmc@lists.ozlabs.org, Brad Bishop

Adrian,
I agree and will discuss with Sui to merge this to one.

Regards
-Vijay

On 5/25/20, 5:32 AM, "Adrian Ambrożewicz" <adrian.ambrozewicz@linux.intel.com> wrote:

    @Brad, @Vijay
    
    It seems Sui is proposing something highly related to already discussed 
    https://urldefense.proofpoint.com/v2/url?u=https-3A__gerrit.openbmc-2Dproject.xyz_c_openbmc_docs_-2B_31957&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=v9MU0Ki9pWnTXCWwjHPVgpnCR80vXkkcrIaqU7USl5g&m=r8plK5Drko3s3OBal2B9JntaXpSk5Kb-tY2NYWL-gQQ&s=2wBUvS8-DtkxTF8kSLoPvmd0cLQTcDBBl3xhKjy1VWw&e=  . As a matter 
    of fact - requirement for such system metrics availability is also 
    highly desirable on our (Intel) side. It seems we need to merge all 
    requirements to satisfy the common needs..
    
    Regards,
    Adrian
    
    W dniu 5/22/2020 o 10:43, Adrian Ambrożewicz pisze:
    > I suppose I could back up Piotr here.
    > 
    > I believe that in general EntityManager could be leveraged for 
    > configuration (enabling/disabling metrics and configuring them). 
    > dbus-sensors infrastructure would be beneficial in terms of:
    > - familiarity (already used for monitoring physical sensors, new 
    > synthetized sensors to come)
    > - flexibility (EntityManager could provide runtime configuration of the 
    > metrics in the system)
    > - availability - both configuration and metrics would be exposed using 
    > D-Bus interfaces as easy to consume and 'standarized' way.
    > 
    > If dbus-sensors would be used then feature mentioned by Piotr 
    > (TelemetryService) could almost 'out of the box' support storing and 
    > exposing metrics snapshots, send them to external databases 
    > (EventService) etc.
    > 
    > Of course dbus-sensors (xyz.openbmcc_project.Sensor.Value) could be only 
    > one of the interfaces for the data, so it's not limiting any other use 
    > cases you've mentioned.
    > 
    > Regards,
    > Adrian
    > 
    > W dniu 5/21/2020 o 12:47, Matuszczak, Piotr pisze:
    >> Hi,
    >>
    >> The proposal seems interesting. From what I've read from your e-mail, 
    >> you are looking the best way to implement BMC health metrics. My 
    >> proposal would be to expose these metrics as D-Bus sensors with an 
    >> option to store data to the filesystem. Such solution will ease the 
    >> integration with Redfish and support these metrics by the Monitoring 
    >> Service 
    >> (https://github.com/openbmc/docs/blob/master/designs/telemetry.md) . 
    >> This way, you have support for collecting metrics into metric report, 
    >> you have support of simple operations, like min/max/average/sum. Also, 
    >> using metric reports, you can store historical readings and stream the 
    >> metric reports as events.
    >> Piotr Matuszczak
    >> ---------------------------------------------------------------------
    >> Intel Technology Poland sp. z o.o.
    >> ul. Slowackiego 173, 80-298 Gdansk
    >> KRS 101882
    >> NIP 957-07-52-316
    >>
    >> From: openbmc 
    >> <openbmc-bounces+piotr.matuszczak=intel.com@lists.ozlabs.org> On 
    >> Behalf Of Sui Chen
    >> Sent: Thursday, May 21, 2020 3:37 AM
    >> To: openbmc@lists.ozlabs.org
    >> Subject: Implementing BMC Health Monitoring
    >>
    >> Hello OpenBMC Mailing List,
    >>
    >> It is great to see the proposal on BMC health monitoring! We have 
    >> similar efforts in health monitoring in progress, started doing some 
    >> implementation, and would like to share some thoughts with the Mailing 
    >> List to help get BMC health monitoring started:
    >>
    >> (1) What metrics have we considered now?
    >>
    >> We have considered the following metrics on the BMC:
    >>    - Memory usage
    >>    - Number of open file descriptors
    >>    - Free storage space in the read-write file system
    >>    - List of processes
    >>    - CPU time for a few top processes
    >>    Some of these are inspired by various profilers, and some others 
    >> are expected to be relevant to the typical workloads running on the BMC.
    >>
    >> (2) Overall, it appears the design space for health monitoring has the 
    >> following dimensions:
    >>
    >> a) A method to do the collection, which might be:
    >>    - Running a program like "df" to get free disk space
    >>    - Traversing some folder to compute some statistics
    >>    - Monitor some bus for some time and generate some result
    >>    - or something else
    >>    The collection process might vary from metric to metric, and can 
    >> take some time to complete on the BMC, and therefore, the results 
    >> need to be staged somewhere and made accessible when it's completed, 
    >> so the requestor won't have to busy-wait.
    >>
    >> b) A way to stage monitoring data on the BMC, which might be:
    >>    - Files or databases in DRAM or some persistent store.
    >>    - DBus objects, as described in Vijay's document; this is similar 
    >> to how sensors work.
    >>    - IPMI Blobs (this is what we have implemented right now)
    >>    - or something else
    >> c) A way to transfer monitoring data out of the BMC, which might be:
    >>    - scp
    >>    - RedFish
    >>    - IPMI (this is what we're using right now)
    >>    - or something else
    >> d) Format of staged data:
    >>    - Raw bytes
    >>    - Protocol buffers
    >>    - JSON objects
    >>    - or something else
    >>    - The data may be compressed to save transfer time
    >> e) A way to consume the health monitoring data:
    >>    - The BMC might do some pre-processing, like windowed average.
    >>    - The BMC may perform certain corrective measures when metrics 
    >> appear abnormal.
    >>    - The host may perform certain corrective measures when metrics 
    >> appear abnormal.
    >>    - BMC health data might be plugged into some already existing 
    >> monitoring framework overseeing a large number of machines, collecting 
    >> historical data, and projecting future trends, etc.
    >>
    >> f) A way to configure the health monitoring system:
    >>    - Configuration for which metrics are collected
    >>    - Configuration for the choice of staging in b), way of transfer in 
    >> c), and frequency of collection in e)
    >>    - Some configurations may be build-time and some may be run-time
    >>       - I guess we can draw some inspirations from phosphor-ipmi-blobs
    >>
    >> (3) The requirements and performance/storage impacts on the BMC:
    >>
    >> a) The collection should not be too taxing on the processing/storage 
    >> resources on the BMC
    >>
    >> b) The data transfer process should not be too taxing on the link 
    >> between the host and BMC
    >>    - For the metrics we have and the IPMI connection we're using so 
    >> far, it took around 10 ~ 100ms for the host to collect a metric. The 
    >> time is dominated by IPMI transfer time. The time is considered 
    >> acceptable if a metric is collected at a reasonably long interval, 
    >> say, every 30 minutes.
    >>
    >> We hope the above contents help complement the existing design 
    >> proposal, and would like to help actually start implementing (and 
    >> deploying) health monitoring for the BMC.
    >> The question is: we're working on our implementation and we're 
    >> wondering what would be a good time for us to send it for review? Do 
    >> we need to support both what we have now and what is being proposed?
    >>
    >> Thanks!
    >> Sui
    >>
    


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-05-26 18:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-05-21  1:36 Implementing BMC Health Monitoring Sui Chen
2020-05-21 10:47 ` Matuszczak, Piotr
2020-05-22  8:43   ` Adrian Ambrożewicz
2020-05-25 12:32     ` Adrian Ambrożewicz
2020-05-26 13:45       ` Brad Bishop
2020-05-26 18:50       ` Vijay Khemka
2020-05-26 17:36     ` Vijay Khemka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.