From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Kluge Date: Mon, 17 May 2010 06:52:01 +0200 Subject: [Lustre-devel] Lustre RPC visualization In-Reply-To: <009101caf4f9$67e1dd50$37a597f0$@barton@oracle.com> References: <000c01cae6ee$1d4693d0$57d3bb70$%barton@oracle.com> <4BD8E021.7050302@oracle.com> <4BD90FB9.5030702@tu-dresden.de> <4BD9CF75.8030204@oracle.com> <4BDE8C3C.2050505@tu-dresden.de> <699F57EF-52E6-41D1-A04B-3C39D469D133@oracle.com> <4BDF1199.2030007@tu-dresden.de> <4BDF1CC7.5020502@oracle.com> <4BDF24BC.9050701@tu-dresden.de> <4BDF2999.2000207@oracle.com> <4BEFBB07.4030403@tu-dresden.de> <009101caf4f9$67e1dd50$37a597f0$@barton@oracle.com> Message-ID: <4BF0CB71.6030601@tu-dresden.de> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org Eric, Vampir does scale up to this point. Almost, if I remember well, but thing might have changed recently). If it does not, well' make it ;) We have visualized real trace files with up to 20.000 processes and artificial trace files with many more. Heat maps can be emulated. Once we have had something like a color-coded counter timeline that procduces a heat map. Right now it can be done by on the timeline by placing artificial enter/leave events for function called "5%", "10%", and so on. I have done this already in the past, so it should not be too difficult. Michael Am 16.05.2010 15:12, schrieb Eric Barton: > Excellent :) > > How do you think measurements taken from 1000 servers with 100,000 > clients can be visualised? We've used heat maps to visualise > 10s-100s of concurrent measurements (y) over time (x) but I wonder > if that will scale. Does vampire support heat maps? > > Cheers, > Eric > >> -----Original Message----- >> From: Michael Kluge [mailto:Michael.Kluge at tu-dresden.de] >> Sent: 16 May 2010 10:30 AM >> To: di.wang >> Cc: Eric Barton; Andreas Dilger; Robert Read; Galen M. Shipman; lustre-devel >> Subject: Re: [Lustre-devel] Lustre RPC visualization >> >> Hi WangDi, >> >> the first version works. Screenshot is attached. I have a couple of >> counter realized: RPC's in flight and RPC's completed in total on the >> client, RPC's enqueued, RPC's in processing and RPC'c completed in total >> on the server. All these counter can be broken down by the type of RPC >> (op code). The picture has not yet the lines that show each single RPC, >> I still have to do counter like "avg. time to complete an RPC over the >> last second" and there are some more TODO's. Like the timer >> synchronization. (In the screenshot the first and the last counter show >> total values while the one in the middle shows a rate.) >> >> What I like to have is a complete set of traces from a small cluster >> (<100 nodes) including the servers. Would that be possible? >> >> Is one of you in Hamburg May, 31-June, 3 for ISC'2010? I'll be there and >> like to talk about what would be useful for the next steps. >> >> >> Regards, Michael >> >> Am 03.05.2010 21:52, schrieb di.wang: >>> Michael Kluge wrote: >>>>>> One more question: RPC 1334380768266400 (in the log WangDi sent me) >>>>>> has on the client side only a "Sending RPC" message, thus missing the >>>>>> "Completed RPC". The server has all three (received,start work, done >>>>>> work). Has this RPC vanished on the way back to the client? There is >>>>>> no further indication what happend. The last timestamp in the client >>>>>> log is: >>>>>> 1272565368.228628 >>>>>> and the server says it finished the processing of the request at: >>>>>> 1272565281.379471 >>>>>> So the client log has been recorded long enough to contain the >>>>>> "Completed RPC" message for this RPC if it arrived ever ... >>>>> Logically, yes. But in some cases, some debug logs might be abandoned >>>>> for some reasons(actually, it happens not rarely), and probably you need >>>>> maintain an average time from server "Handled RPC" to client "Completed >>>>> RPC", then you just guess the client "Completed RPC" time in this case. >>>> >>>> Oh my gosh ;) I don't want to start speculations about the helpfulness >>>> of incomplete debug logs. Anyway, what can get lost? Any kind of >>>> message on the servers and clients? I think I'd like to know what >>>> cases have to be handled while I try to track individual RPC's on >>>> their way. >>> Any records can get lost here. Unfortunately, there are not any messages >>> indicate the missing happened. :( >>> (Usually, I would check the time stamp in the log, i.e. no records for a >>> "long" time, for example several seconds, but this is not the accurate >>> way). >>> >>> I guess you can just ignore these uncompleted records in your first >>> step? Let's see how these incomplete log will >>> impact the profiling result, then we will decide how to deal with this? >>> >>> Thanks >>> Wangdi >>>> >>>> Regards, Michael >>>> _______________________________________________ >>>> Lustre-devel mailing list >>>> Lustre-devel at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-devel >>> >>> >> >> >> -- >> Michael Kluge, M.Sc. >> >> Technische Universit?t Dresden >> Center for Information Services and >> High Performance Computing (ZIH) >> D-01062 Dresden >> Germany >> >> Contact: >> Willersbau, Room WIL A 208 >> Phone: (+49) 351 463-34217 >> Fax: (+49) 351 463-37773 >> e-mail: michael.kluge at tu-dresden.de >> WWW: http://www.tu-dresden.de/zih > >