From mboxrd@z Thu Jan 1 00:00:00 1970 From: Don Slutz Subject: Re: [PATCH v2 0/2] Add xen-crashd. Date: Mon, 2 Dec 2013 12:09:44 -0500 Message-ID: <529CBED8.2000103@CloudSwitch.Com> References: <1384543221-17634-1-git-send-email-dslutz@terremark.com> <1385720807.20209.58.camel@kazak.uk.xensource.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8441855469695681043==" Return-path: In-Reply-To: <1385720807.20209.58.camel@kazak.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: Keir Fraser , Stefano Stabellini , Andrew Cooper , Ian Jackson , Don Slutz , xen-devel@lists.xen.org, David Vrabel List-Id: xen-devel@lists.xenproject.org --===============8441855469695681043== Content-Type: multipart/alternative; boundary="------------010604090609060202070905" --------------010604090609060202070905 Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit On 11/29/13 05:26, Ian Campbell wrote: > On Fri, 2013-11-15 at 14:20 -0500, Don Slutz wrote: > >> Ian Campbell: >> Add 1st pass on some documention on crash's remote protocol. > My concern with this was that we were using some sort of internal crash > protocol which has no ABI stability guarantees etc. Documenting it in > the Xen tree doesn't really do anything to alleviate that concern. It > should be a protocol which is published by the crash folks not us. I have no issues with this. The only documentation I can find is: http://people.redhat.com/anderson/crash_whitepaper/ > Ideally they would agree to some sort of protocol stability level, or > maybe you can show that the protocol had inbuilt backward and forward > compatibility capabilities already? It may not have the best backwards and forwards compatibility that could be designed. However so far I have been able to add features to a newer crash that have no issues with older "crashd" servers. And older crash code works fine with the newer "crashd" servers. This is not the 1st one of these I have coded, just the 1st that I can release. > Even more concerning is [0] where one of the crash maintainers says: >> It's been deprecated for almost 10 years now. I don't understand how >> you have been able to even get it to build, never mind work as the mail >> thread indicates? > We surely don't want to be adding code which relies on a protocol which > has been deprecated for 10 years! The main reason that I know of is that crash in active mode (i.e. running live on machine A), is just so much simpler to use that using a remote crash on machine B talking to a crashd on machine A. This is because the crashd on machine A is in "live" mode. This means that slow or unresponsive systems cannot be examined using the remote protocol. And keeping the right kernel versions on machine B that you need is just overhead. With all this in mind, I was not surprised that it had been deprecated for 10 years. However with Xen in the mix, the machine A no longer needs to be active to run "crashd", in fact it can be paused, or running, or crashed, or shutdown, etc. > Daniel K asked about gdbsx -- can that not speak to crash somehow? It is clearly possible to write a remote crash to remote gdb server, but needing to run 2 servers to connect up crash is to me too complex. I could also embed the xen-crashd code in gdbsx by adding command line options. However very little code would be shared. Since I based xen-crashd off of xenctx, it currently uses libxc calls. gdbsx uses ioctl() directly to do the hyper calls. It does not appear to support physical addresses. It does not appear to support virtual address to physical address conversion. Quoteing from the crash whitepaper: Furthermore, to examine the contents of a live system's kernel internals from user space, the only readily available option has been to use gdb on /proc/kcore. While gdb is an incredibly powerful tool, it is designed to debug user programs, and is not at all "kernel-aware". Consequently, using gdb alone has limited usefulness when looking at kernel memory, essentially constrained to the printing of kernel data structures */if/* the vmlinux file was built with the -g C flag, the disassembly of kernel text, and raw data dumps. > Or > run on /proc/vmcore directly, or be extended to do so? There is no /proc/vmcore in this case. Extending dom0 linux to provide /proc/1/vmcore, /proc/2/vmcore, etc. (I.E. /proc//vmcore) would be a big change and designing a security model for these would also not be quick. Maybe this will help: [root@dcs-xen-54 tmp]# xl list Name ID Mem VCPUs State Time(s) Domain-0 0 2048 8 r----- 3928.9 P-1-0 1 3080 1 -b---- 18.0 [root@dcs-xen-54 tmp]# /usr/lib/xen/bin/xen-crashd 1& [1] 1447 [root@dcs-xen-54 tmp]# 2 Dec 13 11:38:01.042 socket ready on port 5001 after 1 bind call [root@dcs-xen-54 tmp]# crash --machdep phys_base=0x200000 localhost:5001 /usr/lib/debug/lib/modules/2.6.18-128.el5/vmlinux crash 6.1.4 Copyright (C) 2002-2013 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. 2 Dec 13 11:38:08.917 Accepted a connection. WARNING: daemon cannot access /proc/version NOTE: setting phys_base to: 0x200000 GNU gdb (GDB) 7.3.1 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/2.6.18-128.el5/vmlinux DUMPFILE: /dev/mem@localhost (remote live system) CPUS: 1 DATE: Mon Dec 2 11:37:02 2013 UPTIME: 00:33:11 LOAD AVERAGE: 0.01, 0.00, 0.00 TASKS: 81 NODENAME: P-1-0.TC5.CloudSwitch.com RELEASE: 2.6.18-128.el5 VERSION: #1 SMP Wed Jan 21 10:41:14 EST 2009 MACHINE: x86_64 (2400 Mhz) MEMORY: 3 GB PID: 0 COMMAND: "swapper" TASK: ffffffff802eeae0 [THREAD_INFO: ffffffff803dc000] CPU: 0 STATE: TASK_RUNNING (ACTIVE) crash> net NET_DEVICE NAME IP ADDRESS(ES) ffffffff80321e80 lo 127.0.0.1 ffff8100babd9000 eth1 172.16.64.65 ffff8100b6c96000 sit0 crash> q [1]+ Done /usr/lib/xen/bin/xen-crashd 1 Is almost the same as: [root@dcs-xen-54 tmp]# xl dump-core 1 p-1-0.vmore [root@dcs-xen-54 tmp]# crash p-1-0.vmore /usr/lib/debug/lib/modules/2.6.18-128.el5/vmlinux crash 6.1.4 Copyright (C) 2002-2013 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.3.1 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/2.6.18-128.el5/vmlinux DUMPFILE: p-1-0.vmore CPUS: 1 DATE: Mon Dec 2 11:05:09 2013 UPTIME: 00:01:18 LOAD AVERAGE: 2.00, 0.70, 0.24 TASKS: 81 NODENAME: P-1-0.TC5.CloudSwitch.com RELEASE: 2.6.18-128.el5 VERSION: #1 SMP Wed Jan 21 10:41:14 EST 2009 MACHINE: x86_64 (2400 Mhz) MEMORY: 3 GB PANIC: "" PID: 0 COMMAND: "swapper" TASK: ffffffff802eeae0 [THREAD_INFO: ffffffff803dc000] CPU: 0 STATE: TASK_RUNNING (ACTIVE) WARNING: panic task not found crash> net NET_DEVICE NAME IP ADDRESS(ES) ffffffff80321e80 lo 127.0.0.1 ffff8100babd9000 eth1 172.16.64.65 ffff8100b6c96000 sit0 crash> quit With the changes in crash 7.0.4 (yet to be released), crash can be invoked in a remote "not live" mode, which is how it runs on a vmcore file. So if a DomU is paused, "xl dump-core;crash" and"xen-crashd;crash" will give the exact same answers in a lot less real time (xen-crashd case). -Don Slutz > > Ian. > > [0] > http://thread.gmane.org/gmane.linux.kernel.crash-dump.crash-utility/4714/focus=4736 > --------------010604090609060202070905 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: 8bit
On 11/29/13 05:26, Ian Campbell wrote:
On Fri, 2013-11-15 at 14:20 -0500, Don Slutz wrote:

  Ian Campbell:
    Add 1st pass on some documention on crash's remote protocol.
My concern with this was that we were using some sort of internal crash
protocol which has no ABI stability guarantees etc. Documenting it in
the Xen tree doesn't really do anything to alleviate that concern. It
should be a protocol which is published by the crash folks not us.
I have no issues with this.  The only documentation I can find is:

  http://people.redhat.com/anderson/crash_whitepaper/


Ideally they would agree to some sort of protocol stability level, or
maybe you can show that the protocol had inbuilt backward and forward
compatibility capabilities already?
It may not have the best backwards and forwards compatibility that could be designed.  However so far I have been able to add features to a newer crash that have no issues with older "crashd" servers. And older crash code works fine with the newer "crashd" servers.  This is not the 1st one of these I have coded, just the 1st that I can release.
Even more concerning is [0] where one of the crash maintainers says:
It's been deprecated for almost 10 years now.  I don't understand how
you have been able to even get it to build, never mind work as the mail
thread indicates?
We surely don't want to be adding code which relies on a protocol which
has been deprecated for 10 years!
The main reason that I know of is that crash in active mode (i.e. running live on machine A), is just so much simpler to use that using a remote crash on machine B talking to a crashd on machine A.  This is because the crashd on machine A is in "live" mode.  This means that slow or unresponsive systems cannot be examined using the remote protocol.  And keeping the right kernel versions on machine B that you need is just overhead.

With all this in mind, I was not surprised  that it had been deprecated for 10 years.  However with Xen in the mix, the machine A no longer needs to be active to run "crashd", in fact it can be paused, or running, or crashed, or shutdown, etc.


Daniel K asked about gdbsx -- can that not speak to crash somehow?
It is clearly possible to write a remote crash to remote gdb server, but needing to run 2 servers to connect up crash is to me too complex.  I could also embed the xen-crashd code in gdbsx by adding command line options.  However very little code would be shared.  Since I based xen-crashd off of xenctx, it currently uses libxc calls.  gdbsx uses ioctl() directly to do the hyper calls.  It does not appear to support physical addresses. It does not appear to support virtual address to physical address conversion. Quoteing from the crash whitepaper:
Furthermore, to examine the contents of a live system's kernel internals from user space, the only readily available option has been to use gdb on /proc/kcore. While gdb is an incredibly powerful tool, it is designed to debug user programs, and is not at all "kernel-aware". Consequently, using gdb alone has limited usefulness when looking at kernel memory, essentially constrained to the printing of kernel data structures if the vmlinux file was built with the -g C flag, the disassembly of kernel text, and raw data dumps.

 Or
run on /proc/vmcore directly, or be extended to do so?
There is no /proc/vmcore in this case. Extending dom0 linux to provide /proc/1/vmcore, /proc/2/vmcore, etc. (I.E. /proc/<domid>/vmcore) would be a big change and designing a security model for these would also not be quick.

Maybe this will help:

[root@dcs-xen-54 tmp]# xl list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2048     8     r-----    3928.9
P-1-0                                        1  3080     1     -b----      18.0


[root@dcs-xen-54 tmp]# /usr/lib/xen/bin/xen-crashd 1&
[1] 1447
[root@dcs-xen-54 tmp]#  2 Dec 13 11:38:01.042 socket ready on port 5001 after 1 bind call

[root@dcs-xen-54 tmp]# crash --machdep phys_base=0x200000 localhost:5001 /usr/lib/debug/lib/modules/2.6.18-128.el5/vmlinux

crash 6.1.4
Copyright (C) 2002-2013  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
 2 Dec 13 11:38:08.917 Accepted a connection.
WARNING: daemon cannot access /proc/version

NOTE: setting phys_base to: 0x200000

GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/2.6.18-128.el5/vmlinux
    DUMPFILE: /dev/mem@localhost  (remote live system)
        CPUS: 1
        DATE: Mon Dec  2 11:37:02 2013
      UPTIME: 00:33:11
LOAD AVERAGE: 0.01, 0.00, 0.00
       TASKS: 81
    NODENAME: P-1-0.TC5.CloudSwitch.com
     RELEASE: 2.6.18-128.el5
     VERSION: #1 SMP Wed Jan 21 10:41:14 EST 2009
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 3 GB
         PID: 0
     COMMAND: "swapper"
        TASK: ffffffff802eeae0  [THREAD_INFO: ffffffff803dc000]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
ffffffff80321e80  lo     127.0.0.1
ffff8100babd9000  eth1   172.16.64.65
ffff8100b6c96000  sit0  
crash> q

[1]+  Done                    /usr/lib/xen/bin/xen-crashd 1

Is almost the same as:

[root@dcs-xen-54 tmp]# xl dump-core 1 p-1-0.vmore
[root@dcs-xen-54 tmp]# crash p-1-0.vmore /usr/lib/debug/lib/modules/2.6.18-128.el5/vmlinux      

crash 6.1.4
Copyright (C) 2002-2013  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/2.6.18-128.el5/vmlinux
    DUMPFILE: p-1-0.vmore
        CPUS: 1
        DATE: Mon Dec  2 11:05:09 2013
      UPTIME: 00:01:18
LOAD AVERAGE: 2.00, 0.70, 0.24
       TASKS: 81
    NODENAME: P-1-0.TC5.CloudSwitch.com
     RELEASE: 2.6.18-128.el5
     VERSION: #1 SMP Wed Jan 21 10:41:14 EST 2009
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 3 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper"
        TASK: ffffffff802eeae0  [THREAD_INFO: ffffffff803dc000]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)
     WARNING: panic task not found

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
ffffffff80321e80  lo     127.0.0.1
ffff8100babd9000  eth1   172.16.64.65
ffff8100b6c96000  sit0  
crash> quit

With the changes in crash 7.0.4 (yet to be released), crash can be invoked in a remote "not live" mode, which is how it runs on a vmcore file.

So if a DomU is paused, "xl dump-core;crash" and "xen-crashd;crash" will give the exact same answers in a lot less real time (xen-crashd case).


   -Don Slutz

Ian.

[0]
http://thread.gmane.org/gmane.linux.kernel.crash-dump.crash-utility/4714/focus=4736


--------------010604090609060202070905-- --===============8441855469695681043== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============8441855469695681043==--