All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zenon Panoussis <oracle@provocation.net>
To: ceph-devel@vger.kernel.org
Subject: Re: Logging
Date: Tue, 19 Apr 2011 13:19:41 +0200	[thread overview]
Message-ID: <4DAD6FCD.5070105@provocation.net> (raw)
In-Reply-To: <Pine.LNX.4.64.1104182154160.13931@cobra.newdream.net>



On 04/19/2011 07:02 AM, Sage Weil wrote:

>> The relation between OSD partitions (/dev/mapper/sda6 in the example above)
>> is another interesting factor. As long as the load is under 100%, the
>> partitions on both nodes grow in almost perfect sync. When the load exceeds
>> 100%, one node starts lagging behind the other. If that continues long enough,
>> the lagging node falls out completely while the other node keeps growing.

> This is really interesting.  This is on the partitions that have _just_ 
> the OSD data? 

Yes, with a couple of extra layers. node01 keeps its OSD data on an ext4
filesystem on top of a dm-crypt encrypted native disk partition. node02
on the other hand has an mdadm RAID0 of two partitions on separate disks
with dm-crypt and ext4 on top of that. This layering - in particular the
encryption - consumes CPU and can slow down things, but for the rest it's
rock-solid; I've been running systems with these setups for years and
never had a problem with them even once.

Here's an example from this morning:

node01:
/dev/mapper/sda6        232003      5914    212830   3% /mnt/osd

node02:
/dev/mapper/md4         225716      5704    207112   3% /mnt/osd

client:
192.168.178.100:6789:/
                        232002      5913    212829   3% /mnt/n01

You can see that the total space on the client corresponds to that of node01,
so the osd of node02 has gone belly up. The load on node01 is creeping upwards
of 200% while rsync on the client keeps smiling and pushing data.

node01 top:
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24793 root      20   0 1679m 510m 1756 S  9.2 25.7  29:23.11 cosd
30235 root      20   0     0    0    0 S  1.0  0.0   0:01.10 kworker/0:1
  637 root      20   0     0    0    0 S  0.7  0.0   4:56.26 jbd2/sda2-8
30468 root      20   0 14988 1152  864 R  0.7  0.1   0:00.14 top
21748 root      20   0  104m  796  504 S  0.3  0.0   1:04.27 watch
29418 root      20   0     0    0    0 S  0.3  0.0   0:02.12 kworker/0:2

node01 iotop:
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
24933 be/4 root      109.97 K/s    7.12 K/s  0.00 % 95.49 % cosd -i 0 -c ~ceph/ceph.conf
24934 be/4 root       94.15 K/s    7.12 K/s  0.00 % 92.45 % cosd -i 0 -c ~ceph/ceph.conf
24830 be/4 root        0.00 B/s   36.39 K/s  0.00 % 81.10 % cosd -i 0 -c ~ceph/ceph.conf
  637 be/3 root        0.00 B/s    0.00 B/s  0.00 % 80.27 % [jbd2/sda2-8]
  256 be/3 root        0.00 B/s    2.37 K/s  0.00 % 72.93 % [jbd2/sda1-8]
24831 be/4 root        0.00 B/s    0.00 B/s  0.00 % 27.85 % cosd -i 0 -c ~ceph/ceph.conf
24826 be/4 root        0.00 B/s  272.94 K/s  0.00 % 19.28 % cosd -i 0 -c ~ceph/ceph.conf
24829 be/4 root        0.00 B/s   45.89 K/s  0.00 % 18.03 % cosd -i 0 -c ~ceph/ceph.conf
24632 be/4 root        0.00 B/s   26.90 K/s  0.00 %  5.99 % cmon -i 0 -c ~ceph/ceph.conf
24556 be/3 root        0.00 B/s    5.54 K/s  0.00 %  2.95 % [jbd2/dm-0-8]
  639 be/3 root        0.00 B/s    0.00 B/s  0.00 %  2.32 % [jbd2/sda5-8]
24833 be/4 root        0.00 B/s   10.28 K/s  0.00 %  0.00 % cosd -i 0 -c ~ceph/ceph.conf


At this point I unmounted ceph on the client and restarted ceph. A few minutes
later I see this:

node01:
/dev/mapper/sda6        232003      5907    212837   3% /mnt/osd

node02:
/dev/mapper/md4         225716      5626    207190   3% /mnt/osd

Note how disk usage went down on both nodes, considerably on node02.

Then they start exchanging data and an hour later or so they're back in sync:

node01:
/dev/mapper/sda6        232003      5906    212838   3% /mnt/osd

node02:
/dev/mapper/md4         225716      5906    206910   3% /mnt/osd


> Do you see any OSD flapping (down/up cycles) during this 
> period?

I've been running without logs since yesterday, but my experience is that
they don't flap; once an OSD goes down it stays down until ceph is restarted.

> It's possible that the MDS is getting ahead of the OSDs, as there isn't 
> currently any throttling of metadata request processing when the 
> journaling is slow.  (We should fix this.)  I don't see how that would 
> explain the variance in disk usage, though, unless you are also seeing the 
> difference in disk usage reflected in the cosd memory usage on the 
> less-disk-used node?

I didn't pay attention to memory usage, but I think I can rule this out
anyway. node01 has 2 GB RAM and 2 GB swap, node02 has 4 GB RAM and no
swap. Since I saw 11 GB on the node02 OSD the other day and 4 GB on the
node01 OSD, the difference could not have been in memory.

Z


  reply	other threads:[~2011-04-19 11:19 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-17 23:30 Logging Zenon Panoussis
2011-04-18  0:53 ` Logging Colin McCabe
2011-04-18 10:13   ` Logging Zenon Panoussis
2011-04-18 17:16     ` Logging Colin McCabe
2011-04-18 18:17       ` Logging Zenon Panoussis
2011-04-18 18:41         ` Logging Colin McCabe
2011-04-18 20:56           ` Logging Zenon Panoussis
2011-04-18 22:25             ` Logging Colin McCabe
2011-04-19  0:10               ` Logging Zenon Panoussis
2011-04-19  5:02                 ` Logging Sage Weil
2011-04-19 11:19                   ` Zenon Panoussis [this message]
2011-04-19 16:27                     ` Logging Sage Weil
  -- strict thread matches above, loose matches on Subject: below --
2020-03-15 13:16 Logging J.R. Oldroyd
2020-03-16 11:25 ` Logging Arti Zirk
2020-03-16 19:30 ` Logging Jason A. Donenfeld
2020-03-17  7:37   ` Logging J.R. Oldroyd
2020-03-17 18:12     ` Logging Luis Ressel
2020-03-18  8:14       ` Logging J.R. Oldroyd
2020-03-18 10:43         ` Logging Luis Ressel
2005-08-11 15:49 Logging Svenne Krap
2005-08-11 20:54 ` Logging Chris Brenton
2005-08-12  6:24 ` Logging Grant Taylor
2004-04-25 15:32 logging IT Clown
2004-04-25 15:45 ` logging Antony Stone
2004-04-12  3:13 logging ip tables
2004-04-01  5:38 logging IT Clown
2004-04-06 10:26 ` logging D. Prima Prayudi
2004-03-31  9:18 logging IT Clown
2004-03-31  9:59 ` logging Mark Page
2004-03-15 15:56 Logging Hurley, Michael
2004-03-15 15:51 Logging Mario Udina
2004-03-15 16:07 ` Logging Frederic de Villamil
2004-03-15 16:08 ` Logging Antony Stone
2004-03-15 16:25 ` Logging Frank Gruellich
2004-03-15 16:36 ` Logging forum
2003-12-29 22:43 logging John T. Williams
2003-12-30  2:39 ` logging caszonyi
2003-12-30  2:44 ` logging Ray Olszewski
2003-03-24 14:02 logging Philippe Dhont   (Sea-ro)
2003-01-14 18:23 Logging Subba Rao
2003-01-16 19:20 ` Logging Athan
2003-01-17  4:26 ` Logging Dharmendra.T
     [not found] <20021021210421.79305.qmail@web40702.mail.yahoo.com>
2002-10-21 21:39 ` Logging Antony Stone
2002-04-10  5:46 Logging Chris Rose
2002-04-10  6:08 ` Logging Richard Adams
2002-04-10  6:36   ` Logging Chris Rose
2002-04-10 18:23     ` Logging Richard Adams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4DAD6FCD.5070105@provocation.net \
    --to=oracle@provocation.net \
    --cc=ceph-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.