All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Nelson <mnelson@redhat.com>
To: Robert LeBlanc <robert@leblancnet.us>,
	Gregory Farnum <gfarnum@redhat.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Multiple OSDs suicide because of client issues?
Date: Mon, 23 Nov 2015 13:14:55 -0600	[thread overview]
Message-ID: <565365AF.20409@redhat.com> (raw)
In-Reply-To: <CAANLjFq-eqTy4XN_YTMyS0M2h4Smo1GNSmuqL+FuDAZDG3yXGg@mail.gmail.com>

FWIW, if you've got collectl per-process logs, you might look for major 
pagefaults associated with the osd processes.  I've seen process 
swapping cause heartbeat timeouts in the past.  Not to say that's the 
issue, but worth confirming it's not happening.

Mark

On 11/23/2015 01:03 PM, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We set the debugging to 0/0, but are you talking about lines like:
>
>     -12> 2015-11-20 20:59:47.138746 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.133 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>     -11> 2015-11-20 20:59:47.138749 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.136 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>     -10> 2015-11-20 20:59:47.138751 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.139 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>      -9> 2015-11-20 20:59:47.138758 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.147 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>      -8> 2015-11-20 20:59:47.138761 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.159 since back 2015-11-20
> 20:58:51.427880 front 2015-11-20 20:58:51.427880 (cutoff 2015-11-20
> 20:59:27.138720)
>      -7> 2015-11-20 20:59:47.138789 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.170 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>      -6> 2015-11-20 20:59:47.138794 7f70067de700 -1 osd.177 103793
> heartbeat_check: no reply from osd.175 since back 2015-11-20
> 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutoff 2015-11-20
> 20:59:27.138720)
>
> There are 10,000 of those lines in the OSD log which shows all the
> logs up to the crash. Unless setting the value to 0/0 is eliminating
> what you are looking for. I've been wondering if setting it to 0/1 or
> 0/5 or even 0/20 has any runtime performance penalty? It seems like
> more detailed info on crashes would be helpful, but we don't want to
> write too much to the SATADOMs.
>
> We do have the NICs bonded all across our environment.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Mon, Nov 23, 2015 at 11:14 AM, Gregory Farnum  wrote:
>> On Mon, Nov 23, 2015 at 12:03 PM, Robert LeBlanc  wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> This is one of our production clusters which is dual 40 Gb Ethernet
>>> using VLANs for cluster and public networks. I don't think this is
>>> unusual, not like my dev cluster which runs Infiniband and IPoIB. The
>>> client nodes are connected at 10 GB Ethernet.
>>>
>>> I wonder if you are talking about the system logs, not the Ceph OSD
>>> logs. I'm attaching a snippet that includes the hour before and after.
>>
>> Nope, I meant the OSD logs. Whenever they crash, it should dump out
>> the last 10000 in-memory log entries — the one you sent along didn't
>> have a crash included at all. The exact system which timed out will
>> certainly be in those log entries (it's output at level 1, so unless
>> you manually turned everything to 0, it'll show up on a crash.)
>>
>> Anyway, I wouldn't expect that cluster config to have any issues with
>> a client dying since it's TCP over ethernet, but I have seen some
>> weird behaviors out of bonded NICs when one of them dies, so maybe.
>> -Greg
>>
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWU2LkCRDmVDuy+mK58QAA2EUP/22eOBNzAYDV5lGI4J9Z
> wnSZE39UycEfo8e6v8cfikLdAUT7fbY8HBq+VPylLo7OtxA+sGwgjrcz3hzu
> azRi9QuCeWNm+squPQpgISzXWnpDtSjlsA+7iQb+HJGW7/kcR+opixzMX/W5
> AE0Z/hrRwImw3r7Ze3Avl/j+l7iamUznfZAnaBdeWyle7Nge/D8kV+QJSeHe
> /zXDoWW8wPNiRwU/puJrH/GEzyYVZFZ4F9aPUKf9rXsp0chK5k55yysI8ABL
> CfBLtZ1yXPbD20knMdEyuQrDXWMGQplQ+7Z2qFAKsbp+qMFGNqeIbtA6xmbM
> +8RIXT5hTLmgH6lVLYFbk6wgiSphxTVFrkR4Bm6NzFHnloxZ3KuU1pqOZf2k
> iJZ8eDPfUxuforHO2L8TWMDWAsrqTm5A2u0GFtvm7uPWvxWo6sv08sq5IICD
> C75mnCRUIDGl/bQLxt06qvq7WwAtezwnNcwCth3kDFFS85WTgZGEtPgpFizt
> IpBQI4ustiT6lNmYQr6V2cj4HT1G8YBT1ykKwSYmsbRnT2PWGQc7IJ11DxgC
> E7i0c6UYcOMpWT18t+RTOzvv8AZGpna2X/xTJSPL2H10zIkiuXAwO/gZQ5oa
> mgN/3fdhcki8q7uWbZaBCNtv814sZIoTzQy7C7kApQdxFu+kbe5LHRhHZJbZ
> CExf
> =cjG0
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2015-11-23 19:14 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-21  7:34 Multiple OSDs suicide because of client issues? Robert LeBlanc
2015-11-23 16:03 ` Gregory Farnum
     [not found]   ` <CAANLjFo=vCsny5=JW1wYiQk5S=oXdtVd0OzXEC=uTGgmDO9ydA@mail.gmail.com>
2015-11-23 17:17     ` Gregory Farnum
2015-11-23 17:27       ` Robert LeBlanc
2015-11-23 17:33         ` Gregory Farnum
2015-11-23 18:03           ` Robert LeBlanc
2015-11-23 18:14             ` Gregory Farnum
2015-11-23 19:03               ` Robert LeBlanc
2015-11-23 19:12                 ` Sage Weil
2015-11-23 19:29                   ` Robert LeBlanc
2015-11-23 19:32                     ` Sage Weil
2015-11-23 19:37                       ` Robert LeBlanc
2015-11-23 19:54                         ` Sage Weil
2015-11-23 19:14                 ` Mark Nelson [this message]
2015-11-23 19:32                   ` Robert LeBlanc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=565365AF.20409@redhat.com \
    --to=mnelson@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gfarnum@redhat.com \
    --cc=robert@leblancnet.us \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.