* [LSF/MM TOPIC] block level event logging for storage media management
@ 2017-01-18 23:34 Song Liu
2017-01-19 0:11 ` Bart Van Assche
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Song Liu @ 2017-01-18 23:34 UTC (permalink / raw)
To: lsf-pc@lists.linux-foundation.org
Cc: Jens Axboe, Kernel Team, linux-block@vger.kernel.org
Media health monitoring is very important for large scale distributed stora=
ge systems.=20
Traditionally, enterprise storage controllers maintain event logs for attac=
hed storage
devices. However, these controller managed logs do not scale well for large=
scale=20
distributed systems.=20
While designing a more flexible and scalable event logging systems, we thin=
k it is better
to build the log in block layer. Block level event logging covers all major=
storage media
(SCSI, SATA, NVMe), and thus minimizes redundant work for different protoco=
ls.=20
In this LSF/MM, we would like to discuss the following topics with the comm=
unity:
1. Mechanism for drivers report events (or errors) to block layer.=20
Basically, we will need a traceable function for the drivers to repo=
rt errors=20
(most likely right before calling end_request or bio_endio). =20
=20
2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event=
logging?
3. How should we categorize different events?
Currently, there are existing code that translates ATA error (ata_to=
_sense_error)=20
and NVMe error (nvme_trans_status_code) to SCSI sense code. So we ca=
n=20
leverage SCSI Key Code Qualifier for event categorizations.=20
4. Detailed discussions on data structure for event logging.=20
We will be able to show a prototype implementation during LSF/MM.=20
Thanks,
Song=
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [LSF/MM TOPIC] block level event logging for storage media management 2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu @ 2017-01-19 0:11 ` Bart Van Assche 2017-01-19 6:32 ` Coly Li ` (2 subsequent siblings) 3 siblings, 0 replies; 11+ messages in thread From: Bart Van Assche @ 2017-01-19 0:11 UTC (permalink / raw) To: lsf-pc@lists.linux-foundation.org, songliubraving@fb.com Cc: Kernel-team@fb.com, linux-block@vger.kernel.org, axboe@fb.com On Wed, 2017-01-18 at 23:34 +0000, Song Liu wrote: > Media health monitoring is very important for large scale distributed sto= rage systems.=20 > Traditionally, enterprise storage controllers maintain event logs for att= ached storage > devices. However, these controller managed logs do not scale well for lar= ge scale=20 > distributed systems.=20 >=20 > While designing a more flexible and scalable event logging systems, we th= ink it is better > to build the log in block layer. Block level event logging covers all maj= or storage media > (SCSI, SATA, NVMe), and thus minimizes redundant work for different proto= cols.=20 >=20 > In this LSF/MM, we would like to discuss the following topics with the co= mmunity: > 1. Mechanism for drivers report events (or errors) to block layer.=20 > Basically, we will need a traceable function for the drivers to re= port errors=20 > (most likely right before calling end_request or bio_endio). =20 > =20 > 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the eve= nt logging? >=20 > 3. How should we categorize different events? > Currently, there are existing code that translates ATA error (ata_= to_sense_error)=20 > and NVMe error (nvme_trans_status_code) to SCSI sense code. So we = can=20 > leverage SCSI Key Code Qualifier for event categorizations.=20 >=20 > 4. Detailed discussions on data structure for event logging.=20 >=20 > We will be able to show a prototype implementation during LSF/MM.=20 I'd like to participate in this discussion. Bart.= ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] block level event logging for storage media management 2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu 2017-01-19 0:11 ` Bart Van Assche @ 2017-01-19 6:32 ` Coly Li 2017-01-19 6:48 ` Hannes Reinecke 2017-01-21 5:46 ` Dan Williams 3 siblings, 0 replies; 11+ messages in thread From: Coly Li @ 2017-01-19 6:32 UTC (permalink / raw) To: Song Liu, lsf-pc@lists.linux-foundation.org Cc: Jens Axboe, Kernel Team, linux-block@vger.kernel.org On 2017/1/19 上午7:34, Song Liu wrote: > > Media health monitoring is very important for large scale distributed storage systems. > Traditionally, enterprise storage controllers maintain event logs for attached storage > devices. However, these controller managed logs do not scale well for large scale > distributed systems. > > While designing a more flexible and scalable event logging systems, we think it is better > to build the log in block layer. Block level event logging covers all major storage media > (SCSI, SATA, NVMe), and thus minimizes redundant work for different protocols. > > In this LSF/MM, we would like to discuss the following topics with the community: > 1. Mechanism for drivers report events (or errors) to block layer. > Basically, we will need a traceable function for the drivers to report errors > (most likely right before calling end_request or bio_endio). > > 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event logging? > > 3. How should we categorize different events? > Currently, there are existing code that translates ATA error (ata_to_sense_error) > and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can > leverage SCSI Key Code Qualifier for event categorizations. > > 4. Detailed discussions on data structure for event logging. > > We will be able to show a prototype implementation during LSF/MM. This is an interesting topic. For stacked block devices, all layers higher than the fault layer will observe the media error, reporting the underlying failure in every layer may introduce quite a lot noise. Yes, I am willing to attend this discussion. Thanks. Coly Li ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] block level event logging for storage media management 2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu 2017-01-19 0:11 ` Bart Van Assche 2017-01-19 6:32 ` Coly Li @ 2017-01-19 6:48 ` Hannes Reinecke 2017-01-21 5:46 ` Dan Williams 3 siblings, 0 replies; 11+ messages in thread From: Hannes Reinecke @ 2017-01-19 6:48 UTC (permalink / raw) To: Song Liu, lsf-pc@lists.linux-foundation.org Cc: Jens Axboe, Kernel Team, linux-block@vger.kernel.org On 01/19/2017 12:34 AM, Song Liu wrote: > > Media health monitoring is very important for large scale distributed storage systems. > Traditionally, enterprise storage controllers maintain event logs for attached storage > devices. However, these controller managed logs do not scale well for large scale > distributed systems. > > While designing a more flexible and scalable event logging systems, we think it is better > to build the log in block layer. Block level event logging covers all major storage media > (SCSI, SATA, NVMe), and thus minimizes redundant work for different protocols. > > In this LSF/MM, we would like to discuss the following topics with the community: > 1. Mechanism for drivers report events (or errors) to block layer. > Basically, we will need a traceable function for the drivers to report errors > (most likely right before calling end_request or bio_endio). > > 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event logging? > > 3. How should we categorize different events? > Currently, there are existing code that translates ATA error (ata_to_sense_error) > and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can > leverage SCSI Key Code Qualifier for event categorizations. > > 4. Detailed discussions on data structure for event logging. > > We will be able to show a prototype implementation during LSF/MM. > Very good topic; I'm very much in favour of it. That ties in rather nicely with my multipath redesign, where I've added a notifier chain for block events. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N�rnberg) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] block level event logging for storage media management 2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu ` (2 preceding siblings ...) 2017-01-19 6:48 ` Hannes Reinecke @ 2017-01-21 5:46 ` Dan Williams 2017-01-23 6:00 ` Song Liu 3 siblings, 1 reply; 11+ messages in thread From: Dan Williams @ 2017-01-21 5:46 UTC (permalink / raw) To: Song Liu Cc: lsf-pc@lists.linux-foundation.org, Jens Axboe, Kernel Team, linux-block@vger.kernel.org, Verma, Vishal L On Wed, Jan 18, 2017 at 3:34 PM, Song Liu <songliubraving@fb.com> wrote: > > Media health monitoring is very important for large scale distributed storage systems. > Traditionally, enterprise storage controllers maintain event logs for attached storage > devices. However, these controller managed logs do not scale well for large scale > distributed systems. > > While designing a more flexible and scalable event logging systems, we think it is better > to build the log in block layer. Block level event logging covers all major storage media > (SCSI, SATA, NVMe), and thus minimizes redundant work for different protocols. > > In this LSF/MM, we would like to discuss the following topics with the community: > 1. Mechanism for drivers report events (or errors) to block layer. > Basically, we will need a traceable function for the drivers to report errors > (most likely right before calling end_request or bio_endio). > > 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event logging? > > 3. How should we categorize different events? > Currently, there are existing code that translates ATA error (ata_to_sense_error) > and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can > leverage SCSI Key Code Qualifier for event categorizations. > > 4. Detailed discussions on data structure for event logging. > > We will be able to show a prototype implementation during LSF/MM. Hi Song, How is this distinct from tracking a badblocks list? I'm interested in this topic since we have both media error reporting / scrubbing for nvdimms as well "SMART" media health retrieval commands. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] block level event logging for storage media management 2017-01-21 5:46 ` Dan Williams @ 2017-01-23 6:00 ` Song Liu 2017-01-23 7:27 ` Dan Williams 0 siblings, 1 reply; 11+ messages in thread From: Song Liu @ 2017-01-23 6:00 UTC (permalink / raw) To: Dan Williams Cc: lsf-pc@lists.linux-foundation.org, Jens Axboe, Kernel Team, linux-block@vger.kernel.org, Verma, Vishal L Hi Dan,=20 I think the the block level event log is more like log only system. When en= event=20 happens, it is not necessary to take immediate action. (I guess this is di= fferent to bad block list?).=20 I would hope the event log to track more information. Some of these individ= ual=20 event may not be very interesting, for example, soft error or latency outli= ers.=20 However, when we gather event log for a fleet of devices, these "soft event= "=20 may become valuable for health monitoring.=20 Thanks, Song > On Jan 20, 2017, at 9:46 PM, Dan Williams <dan.j.williams@intel.com> wrot= e: >=20 > On Wed, Jan 18, 2017 at 3:34 PM, Song Liu <songliubraving@fb.com> wrote: >>=20 >> Media health monitoring is very important for large scale distributed st= orage systems. >> Traditionally, enterprise storage controllers maintain event logs for at= tached storage >> devices. However, these controller managed logs do not scale well for la= rge scale >> distributed systems. >>=20 >> While designing a more flexible and scalable event logging systems, we t= hink it is better >> to build the log in block layer. Block level event logging covers all ma= jor storage media >> (SCSI, SATA, NVMe), and thus minimizes redundant work for different prot= ocols. >>=20 >> In this LSF/MM, we would like to discuss the following topics with the c= ommunity: >> 1. Mechanism for drivers report events (or errors) to block layer. >> Basically, we will need a traceable function for the drivers to re= port errors >> (most likely right before calling end_request or bio_endio). >>=20 >> 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the eve= nt logging? >>=20 >> 3. How should we categorize different events? >> Currently, there are existing code that translates ATA error (ata_= to_sense_error) >> and NVMe error (nvme_trans_status_code) to SCSI sense code. So we = can >> leverage SCSI Key Code Qualifier for event categorizations. >>=20 >> 4. Detailed discussions on data structure for event logging. >>=20 >> We will be able to show a prototype implementation during LSF/MM. >=20 > Hi Song, >=20 > How is this distinct from tracking a badblocks list? >=20 > I'm interested in this topic since we have both media error reporting > / scrubbing for nvdimms as well "SMART" media health retrieval > commands. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] block level event logging for storage media management 2017-01-23 6:00 ` Song Liu @ 2017-01-23 7:27 ` Dan Williams 2017-01-24 20:18 ` Oleg Drokin 0 siblings, 1 reply; 11+ messages in thread From: Dan Williams @ 2017-01-23 7:27 UTC (permalink / raw) To: Song Liu Cc: lsf-pc@lists.linux-foundation.org, Jens Axboe, Kernel Team, linux-block@vger.kernel.org, Verma, Vishal L, green [ adding Oleg ] On Sun, Jan 22, 2017 at 10:00 PM, Song Liu <songliubraving@fb.com> wrote: > Hi Dan, > > I think the the block level event log is more like log only system. When en event > happens, it is not necessary to take immediate action. (I guess this is different > to bad block list?). > > I would hope the event log to track more information. Some of these individual > event may not be very interesting, for example, soft error or latency outliers. > However, when we gather event log for a fleet of devices, these "soft event" > may become valuable for health monitoring. I'd be interested in this. It sounds like you're trying to fill a gap between tracing and console log messages which I believe others have encountered as well. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] block level event logging for storage media management 2017-01-23 7:27 ` Dan Williams @ 2017-01-24 20:18 ` Oleg Drokin 2017-01-24 23:17 ` Song Liu 2017-01-25 9:56 ` [Lsf-pc] " Jan Kara 0 siblings, 2 replies; 11+ messages in thread From: Oleg Drokin @ 2017-01-24 20:18 UTC (permalink / raw) To: Dan Williams Cc: Song Liu, lsf-pc, Jens Axboe, Kernel Team, linux-block, Verma, Vishal L, Andreas Dilger, Greg Kroah-Hartman On Jan 23, 2017, at 2:27 AM, Dan Williams wrote: > [ adding Oleg ] > > On Sun, Jan 22, 2017 at 10:00 PM, Song Liu <songliubraving@fb.com> wrote: >> Hi Dan, >> >> I think the the block level event log is more like log only system. When en event >> happens, it is not necessary to take immediate action. (I guess this is different >> to bad block list?). >> >> I would hope the event log to track more information. Some of these individual >> event may not be very interesting, for example, soft error or latency outliers. >> However, when we gather event log for a fleet of devices, these "soft event" >> may become valuable for health monitoring. > > I'd be interested in this. It sounds like you're trying to fill a gap > between tracing and console log messages which I believe others have > encountered as well. We have a somewhat similar problem problem in Lustre and I guess it's not just Lustre. Currently there are all sorts of conditional debug code all over the place that goes to the console and when you enable it for anything verbose, you quickly overflow your dmesg buffer no matter the size, that might be mostly ok for local "block level" stuff, but once you become distributed, it start to be a mess and once you get to be super large it worsens even more since you need to somehow coordinate data from multiple nodes, ensure all of it is not lost and still you don't end up using a lot of it since only a few nodes end up being useful. (I don't know how NFS people manage to debug complicated issues using just this, could not be super easy). Having some sort of a buffer of a (potentially very) large size that could be storing the data until it's needed, or eagerly polled by some daemon for storage (helpful when you expect a lot of data that definitely won't fit in RAM). Tracepoints have the buffer and the daemon, but creating new messages is very cumbersome, so converting every debug message into one does not look very feasible. Also it's convenient to have "event masks" one want logged that I don't think you could do with tracepoints. I know you were talking about reporting events to the block layer, but other than plain errors what would block layer do with them? just a convenient way to map messages to a particular device? You don't plan to store it on some block device as part of the block layer, right? Implementing such a buffer all sorts of additional generic data might be collected automatically for all events as part of the buffer format like what cpu did emit it, time, stack usage information, current pid, backtrace (tracepoint-alike could be optional), actual source code location of the message, � Having something like that being standard part of {dev,pr}_{dbg,warn,...} and friends would be super awesome too, I imagine (adding Greg to CC for that). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [LSF/MM TOPIC] block level event logging for storage media management 2017-01-24 20:18 ` Oleg Drokin @ 2017-01-24 23:17 ` Song Liu 2017-01-25 9:56 ` [Lsf-pc] " Jan Kara 1 sibling, 0 replies; 11+ messages in thread From: Song Liu @ 2017-01-24 23:17 UTC (permalink / raw) To: Oleg Drokin Cc: Dan Williams, lsf-pc@lists.linux-foundation.org, Jens Axboe, Kernel Team, linux-block@vger.kernel.org, Verma, Vishal L, Andreas Dilger, Greg Kroah-Hartman DQo+IE9uIEphbiAyNCwgMjAxNywgYXQgMTI6MTggUE0sIE9sZWcgRHJva2luIDxncmVlbkBsaW51 eGhhY2tlci5ydT4gd3JvdGU6DQo+IA0KPiANCj4gT24gSmFuIDIzLCAyMDE3LCBhdCAyOjI3IEFN LCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+IA0KPj4gWyBhZGRpbmcgT2xlZyBdDQo+PiANCj4+IE9u IFN1biwgSmFuIDIyLCAyMDE3IGF0IDEwOjAwIFBNLCBTb25nIExpdSA8c29uZ2xpdWJyYXZpbmdA ZmIuY29tPiB3cm90ZToNCj4+PiBIaSBEYW4sDQo+Pj4gDQo+Pj4gSSB0aGluayB0aGUgdGhlIGJs b2NrIGxldmVsIGV2ZW50IGxvZyBpcyBtb3JlIGxpa2UgbG9nIG9ubHkgc3lzdGVtLiBXaGVuIGVu IGV2ZW50DQo+Pj4gaGFwcGVucywgIGl0IGlzIG5vdCBuZWNlc3NhcnkgdG8gdGFrZSBpbW1lZGlh dGUgYWN0aW9uLiAoSSBndWVzcyB0aGlzIGlzIGRpZmZlcmVudA0KPj4+IHRvIGJhZCBibG9jayBs aXN0PykuDQo+Pj4gDQo+Pj4gSSB3b3VsZCBob3BlIHRoZSBldmVudCBsb2cgdG8gdHJhY2sgbW9y ZSBpbmZvcm1hdGlvbi4gU29tZSBvZiB0aGVzZSBpbmRpdmlkdWFsDQo+Pj4gZXZlbnQgbWF5IG5v dCBiZSB2ZXJ5IGludGVyZXN0aW5nLCBmb3IgZXhhbXBsZSwgc29mdCBlcnJvciBvciBsYXRlbmN5 IG91dGxpZXJzLg0KPj4+IEhvd2V2ZXIsIHdoZW4gd2UgZ2F0aGVyIGV2ZW50IGxvZyBmb3IgYSBm bGVldCBvZiBkZXZpY2VzLCB0aGVzZSAic29mdCBldmVudCINCj4+PiBtYXkgYmVjb21lIHZhbHVh YmxlIGZvciBoZWFsdGggbW9uaXRvcmluZy4NCj4+IA0KPj4gSSdkIGJlIGludGVyZXN0ZWQgaW4g dGhpcy4gSXQgc291bmRzIGxpa2UgeW91J3JlIHRyeWluZyB0byBmaWxsIGEgZ2FwDQo+PiBiZXR3 ZWVuIHRyYWNpbmcgYW5kIGNvbnNvbGUgbG9nIG1lc3NhZ2VzIHdoaWNoIEkgYmVsaWV2ZSBvdGhl cnMgaGF2ZQ0KPj4gZW5jb3VudGVyZWQgYXMgd2VsbC4NCj4gDQo+IFdlIGhhdmUgYSBzb21ld2hh dCBzaW1pbGFyIHByb2JsZW0gcHJvYmxlbSBpbiBMdXN0cmUgYW5kIEkgZ3Vlc3MgaXQncyBub3Qg anVzdCBMdXN0cmUuDQo+IEN1cnJlbnRseSB0aGVyZSBhcmUgYWxsIHNvcnRzIG9mIGNvbmRpdGlv bmFsIGRlYnVnIGNvZGUgYWxsIG92ZXIgdGhlIHBsYWNlIHRoYXQgZ29lcw0KPiB0byB0aGUgY29u c29sZSBhbmQgd2hlbiB5b3UgZW5hYmxlIGl0IGZvciBhbnl0aGluZyB2ZXJib3NlLCB5b3UgcXVp Y2tseSBvdmVyZmxvdw0KPiB5b3VyIGRtZXNnIGJ1ZmZlciBubyBtYXR0ZXIgdGhlIHNpemUsIHRo YXQgbWlnaHQgYmUgbW9zdGx5IG9rIGZvciBsb2NhbA0KPiAiYmxvY2sgbGV2ZWwiIHN0dWZmLCBi dXQgb25jZSB5b3UgYmVjb21lIGRpc3RyaWJ1dGVkLCBpdCBzdGFydCB0byBiZSBhIG1lc3MNCj4g YW5kIG9uY2UgeW91IGdldCB0byBiZSBzdXBlciBsYXJnZSBpdCB3b3JzZW5zIGV2ZW4gbW9yZSBz aW5jZSB5b3UgbmVlZCB0bw0KPiBzb21laG93IGNvb3JkaW5hdGUgZGF0YSBmcm9tIG11bHRpcGxl IG5vZGVzLCBlbnN1cmUgYWxsIG9mIGl0IGlzIG5vdCBsb3N0IGFuZCBzdGlsbA0KPiB5b3UgZG9u J3QgZW5kIHVwIHVzaW5nIGEgbG90IG9mIGl0IHNpbmNlIG9ubHkgYSBmZXcgbm9kZXMgZW5kIHVw IGJlaW5nIHVzZWZ1bC4NCj4gKEkgZG9uJ3Qga25vdyBob3cgTkZTIHBlb3BsZSBtYW5hZ2UgdG8g ZGVidWcgY29tcGxpY2F0ZWQgaXNzdWVzIHVzaW5nIGp1c3QgdGhpcywNCj4gY291bGQgbm90IGJl IHN1cGVyIGVhc3kpLg0KPiANCj4gSGF2aW5nIHNvbWUgc29ydCBvZiBhIGJ1ZmZlciBvZiBhIChw b3RlbnRpYWxseSB2ZXJ5KSBsYXJnZSBzaXplIHRoYXQgY291bGQgYmUNCj4gc3RvcmluZyB0aGUg ZGF0YSB1bnRpbCBpdCdzIG5lZWRlZCwgb3IgZWFnZXJseSBwb2xsZWQgYnkgc29tZSBkYWVtb24g Zm9yIHN0b3JhZ2UNCj4gKGhlbHBmdWwgd2hlbiB5b3UgZXhwZWN0IGEgbG90IG9mIGRhdGEgdGhh dCBkZWZpbml0ZWx5IHdvbid0IGZpdCBpbiBSQU0pLg0KPiANCj4gVHJhY2Vwb2ludHMgaGF2ZSB0 aGUgYnVmZmVyIGFuZCB0aGUgZGFlbW9uLCBidXQgY3JlYXRpbmcgbmV3IG1lc3NhZ2VzIGlzDQo+ IHZlcnkgY3VtYmVyc29tZSwgc28gY29udmVydGluZyBldmVyeSBkZWJ1ZyBtZXNzYWdlIGludG8g b25lIGRvZXMgbm90IGxvb2sgdmVyeSBmZWFzaWJsZS4NCj4gQWxzbyBpdCdzIGNvbnZlbmllbnQg dG8gaGF2ZSAiZXZlbnQgbWFza3MiIG9uZSB3YW50IGxvZ2dlZCB0aGF0IEkgZG9uJ3QgdGhpbmsg eW91IGNvdWxkDQo+IGRvIHdpdGggdHJhY2Vwb2ludHMuDQo+IA0KPiBJIGtub3cgeW91IHdlcmUg dGFsa2luZyBhYm91dCByZXBvcnRpbmcgZXZlbnRzIHRvIHRoZSBibG9jayBsYXllciwgYnV0IG90 aGVyIHRoYW4gcGxhaW4NCj4gZXJyb3JzIHdoYXQgd291bGQgYmxvY2sgbGF5ZXIgZG8gd2l0aCB0 aGVtPyBqdXN0IGEgY29udmVuaWVudCB3YXkgdG8gbWFwIG1lc3NhZ2VzDQo+IHRvIGEgcGFydGlj dWxhciBkZXZpY2U/IFlvdSBkb24ndCBwbGFuIHRvIHN0b3JlIGl0IG9uIHNvbWUgYmxvY2sgZGV2 aWNlIGFzIHBhcnQNCj4gb2YgdGhlIGJsb2NrIGxheWVyLCByaWdodD8NCj4gDQo+IEltcGxlbWVu dGluZyBzdWNoIGEgYnVmZmVyIGFsbCBzb3J0cyBvZiBhZGRpdGlvbmFsIGdlbmVyaWMgZGF0YSBt aWdodCBiZQ0KPiBjb2xsZWN0ZWQgYXV0b21hdGljYWxseSBmb3IgYWxsIGV2ZW50cyBhcyBwYXJ0 IG9mIHRoZSBidWZmZXIgZm9ybWF0IGxpa2UNCj4gd2hhdCBjcHUgZGlkIGVtaXQgaXQsIHRpbWUs IHN0YWNrIHVzYWdlIGluZm9ybWF0aW9uLCBjdXJyZW50IHBpZCwNCj4gYmFja3RyYWNlICh0cmFj ZXBvaW50LWFsaWtlIGNvdWxkIGJlIG9wdGlvbmFsKSwgYWN0dWFsIHNvdXJjZSBjb2RlIGxvY2F0 aW9uIG9mDQo+IHRoZSBtZXNzYWdlLCDigKYNCj4gDQo+IEhhdmluZyBzb21ldGhpbmcgbGlrZSB0 aGF0IGJlaW5nIHN0YW5kYXJkIHBhcnQgb2Yge2Rldixwcn1fe2RiZyx3YXJuLC4uLn0gYW5kIGZy aWVuZHMNCj4gd291bGQgYmUgc3VwZXIgYXdlc29tZSB0b28sIEkgaW1hZ2luZSAoYWRkaW5nIEdy ZWcgdG8gQ0MgZm9yIHRoYXQpLg0KPiANCg0KDQpIaSBPbGVnLCANCg0KVGhhbmtzIGZvciBzaGFy aW5nIHRoZXNlIGluc2lnaHRzLiANCg0KV2UgYnVpbHQgYW4gZXZlbnQgbG9nZ2VyIHRoYXQgcGFy c2VzIGRtZXNnIHRvIGdldCBldmVudHMuIEZvciBzaW1pbGFyIHJlYXNvbnMgYXMgeW91IGRlc2Ny aWJlZCANCmFib3ZlLCBpdCBkb2Vzbid0IHdvcmsgd2VsbC4gQW5kIG9uZSBvZiB0aGUgYmlnZ2Vz dCBpc3N1ZSBpcyBwb29yICJldmVudCBtYXNrIiBzdXBwb3J0LiBJIGFtIA0KaG9waW5nIGdldCBi ZXR0ZXIgZXZlbnQgbWFzayBpbiBuZXdlciBpbXBsZW1lbnRhdGlvbiwgZm9yIGV4YW1wbGUsIHdp dGgga2VybmVsIHRyYWNpbmcgZmlsdGVyLCBvciANCmltcGxlbWVudCBjdXN0b21pemVkIGxvZ2lj IGluIEJQRi4gDQoNCldpdGggYSByZWxhdGl2ZWx5IG1hdHVyZSBpbmZyYXN0cnVjdHVyZSwgd2Ug ZG9uJ3QgaGF2ZSBtdWNoIHByb2JsZW0gc3RvcmluZyBsb2dzIGZyb20gdGhlIGV2ZW50DQpsb2dn ZXIuIFNwZWNpZmljYWxseSwgd2UgdXNlIGEgZGFlbW9uIHRoYXQgY29sbGVjdHMgZXZlbnRzIGFu ZCBzZW5kIHRoZW0gdG8gZGlzdHJpYnV0ZWQgc3RvcmFnZQ0KKEhERlMrSElWRSkuIEl0IG1pZ2h0 IGJlIGFuIG92ZXJraWxsIGZvciBzbWFsbGVyIGRlcGxveW1lbnQuIA0KDQpXZSBkbyB1c2UgaW5m b3JtYXRpb24gZnJvbSBzaW1pbGFyIChub3QgZXhhY3RseSB0aGUgb25lIGFib3ZlKSBsb2dzIHRv IG1ha2UgZGVjaXNpb24gYWJvdXQgDQpkZXZpY2UgaGFuZGxpbmcuIEZvciBleGFtcGxlLCBpZiBh IGRyaXZlIHRocm93cyB0b28gbXVjaCBtZWRpdW0gZXJyb3IgaW4gc2hvcnQgcGVyaW9kIG9mIHRp bWUsIA0Kd2Ugd2lsbCBraWNrIGl0IG91dCBvZiBwcm9kdWN0aW9uLiBJIHRoaW5rIGl0IGlzIG5v dCBuZWNlc3NhcnkgdG8gaW5jbHVkZSB0aGlzIGluIHRoZSBibG9jayBsYXllci4gDQoNCk92ZXJh bGwsIEkgYW0gaG9waW5nIHRoZSBrZXJuZWwgY2FuIGdlbmVyYXRlIGFjY3VyYXRlIGV2ZW50cywg d2l0aCBmbGV4aWJsZSBmaWx0ZXIvbWFzayBzdXBwb3J0LiANClRoZXJlIGFyZSBkaWZmZXJlbnQg d2F5cyB0byBzdG9yZSBhbmQgY29uc3VtZSB0aGVzZSBkYXRhLiBJIGd1ZXNzIG1vc3Qgb2YgdGhl c2Ugd2lsbCBiZSANCmltcGxlbWVudGVkIGluIHVzZXIgc3BhY2UuIExldCdzIGRpc2N1c3MgcG90 ZW50aWFsIHVzZSBjYXNlcyBhbmQgcmVxdWlyZW1lbnRzLiBUaGVzZSANCmRpc2N1c3Npb25zIHNo b3VsZCBoZWxwIHVzIGJ1aWxkIHRoZSBrZXJuZWwgcGFydCBvZiB0aGUgZXZlbnQgbG9nLiANCg0K VGhhbmtzLA0KU29uZw0KDQoNCg0KDQoNCg0KDQoNCg== ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] block level event logging for storage media management 2017-01-24 20:18 ` Oleg Drokin 2017-01-24 23:17 ` Song Liu @ 2017-01-25 9:56 ` Jan Kara 2017-01-25 18:30 ` Oleg Drokin 1 sibling, 1 reply; 11+ messages in thread From: Jan Kara @ 2017-01-25 9:56 UTC (permalink / raw) To: Oleg Drokin Cc: Dan Williams, linux-block, Song Liu, Andreas Dilger, Verma, Vishal L, Jens Axboe, Greg Kroah-Hartman, Kernel Team, lsf-pc On Tue 24-01-17 15:18:57, Oleg Drokin wrote: > > On Jan 23, 2017, at 2:27 AM, Dan Williams wrote: > > > [ adding Oleg ] > > > > On Sun, Jan 22, 2017 at 10:00 PM, Song Liu <songliubraving@fb.com> wrote: > >> Hi Dan, > >> > >> I think the the block level event log is more like log only system. When en event > >> happens, it is not necessary to take immediate action. (I guess this is different > >> to bad block list?). > >> > >> I would hope the event log to track more information. Some of these individual > >> event may not be very interesting, for example, soft error or latency outliers. > >> However, when we gather event log for a fleet of devices, these "soft event" > >> may become valuable for health monitoring. > > > > I'd be interested in this. It sounds like you're trying to fill a gap > > between tracing and console log messages which I believe others have > > encountered as well. > > We have a somewhat similar problem problem in Lustre and I guess it's not > just Lustre. Currently there are all sorts of conditional debug code all > over the place that goes to the console and when you enable it for > anything verbose, you quickly overflow your dmesg buffer no matter the > size, that might be mostly ok for local "block level" stuff, but once you > become distributed, it start to be a mess and once you get to be super > large it worsens even more since you need to somehow coordinate data from > multiple nodes, ensure all of it is not lost and still you don't end up > using a lot of it since only a few nodes end up being useful. (I don't > know how NFS people manage to debug complicated issues using just this, > could not be super easy). > > Having some sort of a buffer of a (potentially very) large size that > could be storing the data until it's needed, or eagerly polled by some > daemon for storage (helpful when you expect a lot of data that definitely > won't fit in RAM). > > Tracepoints have the buffer and the daemon, but creating new messages is > very cumbersome, so converting every debug message into one does not look > very feasible. Also it's convenient to have "event masks" one want > logged that I don't think you could do with tracepoints. So creating trace points IMO isn't that cumbersome. I agree that converting hundreds or thousands debug printks into tracepoints is a pain in the ass but still it is doable. WRT filtering, you can enable each tracepoint individually. Granted that is not exactly the 'event mask' feature you ask about but that can be easily scripted in userspace if you give some structure to tracepoint names. Finally tracepoints provide a fine grained control you never get with printk - e.g. you can make a tracepoint trigger only if specific inode is involved with trace filters which greatly reduces the amount of output. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] block level event logging for storage media management 2017-01-25 9:56 ` [Lsf-pc] " Jan Kara @ 2017-01-25 18:30 ` Oleg Drokin 0 siblings, 0 replies; 11+ messages in thread From: Oleg Drokin @ 2017-01-25 18:30 UTC (permalink / raw) To: Jan Kara Cc: Dan Williams, linux-block, Song Liu, Andreas Dilger, Verma, Vishal L, Jens Axboe, Greg Kroah-Hartman, Kernel Team, lsf-pc On Jan 25, 2017, at 4:56 AM, Jan Kara wrote: > On Tue 24-01-17 15:18:57, Oleg Drokin wrote: >> >> On Jan 23, 2017, at 2:27 AM, Dan Williams wrote: >> >>> [ adding Oleg ] >>> >>> On Sun, Jan 22, 2017 at 10:00 PM, Song Liu <songliubraving@fb.com> wrote: >>>> Hi Dan, >>>> >>>> I think the the block level event log is more like log only system. When en event >>>> happens, it is not necessary to take immediate action. (I guess this is different >>>> to bad block list?). >>>> >>>> I would hope the event log to track more information. Some of these individual >>>> event may not be very interesting, for example, soft error or latency outliers. >>>> However, when we gather event log for a fleet of devices, these "soft event" >>>> may become valuable for health monitoring. >>> >>> I'd be interested in this. It sounds like you're trying to fill a gap >>> between tracing and console log messages which I believe others have >>> encountered as well. >> >> We have a somewhat similar problem problem in Lustre and I guess it's not >> just Lustre. Currently there are all sorts of conditional debug code all >> over the place that goes to the console and when you enable it for >> anything verbose, you quickly overflow your dmesg buffer no matter the >> size, that might be mostly ok for local "block level" stuff, but once you >> become distributed, it start to be a mess and once you get to be super >> large it worsens even more since you need to somehow coordinate data from >> multiple nodes, ensure all of it is not lost and still you don't end up >> using a lot of it since only a few nodes end up being useful. (I don't >> know how NFS people manage to debug complicated issues using just this, >> could not be super easy). >> >> Having some sort of a buffer of a (potentially very) large size that >> could be storing the data until it's needed, or eagerly polled by some >> daemon for storage (helpful when you expect a lot of data that definitely >> won't fit in RAM). >> >> Tracepoints have the buffer and the daemon, but creating new messages is >> very cumbersome, so converting every debug message into one does not look >> very feasible. Also it's convenient to have "event masks" one want >> logged that I don't think you could do with tracepoints. > > So creating trace points IMO isn't that cumbersome. I agree that converting > hundreds or thousands debug printks into tracepoints is a pain in the > ass but still it is doable. WRT filtering, you can enable each tracepoint > individually. Granted that is not exactly the 'event mask' feature you ask > about but that can be easily scripted in userspace if you give some > structure to tracepoint names. Finally tracepoints provide a fine grained > control you never get with printk - e.g. you can make a tracepoint trigger > only if specific inode is involved with trace filters which greatly reduces > the amount of output. Oh, I am not dissing tracepoints, don't get me wrong, they add valuable things at a fine-grained level when you have necessary details. The problem is sometimes there are bugs where you don't have enough of knowledge beforehand so you cannot do some fine-grained debug. Think of a 10.000 nodes cluster (heck make it even 100 or probably even 10) with a report of "when running a moderately sized job, there's a hang/something weird/ some unexpected data corruption" that does not occur when run on a single node, so often what you resort to is the shotgun approach where you enable all debug (or selective like "Everything in ldlm and everything rpc related) you could everywhere, then run the job for however long it takes to reproduce and then once reproduced, you sift through those locks reconstructing picture back together only to discover there was this weird race on one of the clients only when some lock was contended but then the grant RPC and some userspace action coincided or some such. the dev_dbg() and nfs's /proc/sys/sunrpc/*debug are somewhat similar, only they dump to dmesg which is quite limited in buffer size, adds huge delays if it goes out to some slow console, wipes other potentially useful messages from the buffer in process and such. I guess you could print script tracepoints with a pattern in their name too, but then there's this pain in the ass of converting: $ git grep CERROR drivers/staging/lustre/ | wc -l 1069 $ git grep CDEBUG drivers/staging/lustre/ | wc -l 1140 messages AND there's also this thing that I do want many of those output to console (because they are important enough) and to the buffer (so I can see them relative to other debug messages I do not want to go to the console). if tracepoints could be extended to enable that much - I'd be a super happy camper, of course. Sure, you cannot just make a macro that wraps the whole print into a tracepoint, but that would be a stupid tracepoint with no finegrained control whatsoever, but perhaps we can do name arguments or some such so that when you do TRACEPOINT(someid, priority, "format string", some_value, some_other_value, �); then if priority includes TPOINT_CONSOLE - it would also always go to console and to the tracepoint buffer, and I can use the some_value and some_other_value as actual matches for things (sure, that would limit you to just variables with no logic done on them, but that's ok, I guess, could always be precalculated if really necessary). Hm, trying to research if you can extract the tracepoint buffer from a kernel crashdump (and if anybody already happened to write a crash module for it yet), I also stumbled upon LKST - http://elinux.org/Linux_Kernel_State_Tracer (no idea how stale that is, but the page is from 2011 and last patch is from 2 years ago) - this also implements a buffer and all sorts of extra event tracing, so it appears to underscore the demand for such things is there and existing mechanisms don't deliver for one reason or another. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-01-25 18:30 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu 2017-01-19 0:11 ` Bart Van Assche 2017-01-19 6:32 ` Coly Li 2017-01-19 6:48 ` Hannes Reinecke 2017-01-21 5:46 ` Dan Williams 2017-01-23 6:00 ` Song Liu 2017-01-23 7:27 ` Dan Williams 2017-01-24 20:18 ` Oleg Drokin 2017-01-24 23:17 ` Song Liu 2017-01-25 9:56 ` [Lsf-pc] " Jan Kara 2017-01-25 18:30 ` Oleg Drokin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).