OSD deadlock with cephfs client and OSD on same machine

All of lore.kernel.org
 help / color / mirror / Atom feed

* OSD deadlock with cephfs client and OSD on same machine
@ 2012-05-29  7:44 Amon Ott
  2012-05-29 15:47 ` Sage Weil
  2012-05-29 16:18 ` Tommi Virtanen
  0 siblings, 2 replies; 10+ messages in thread
From: Amon Ott @ 2012-05-29  7:44 UTC (permalink / raw)
  To: ceph-devel

Hello again!

On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount 
on the same system and no syncfs system call (as to be expected with libc6 < 
2.14 or kernel < 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers 
the system.

After some investigation in the code, this is what I found:
In src/common/sync_filesystem.h, the function sync_filesystem() first tries a 
syncfs() (not available), then a btrfs ioctrl sync (not available with 
non-btrfs), then finally a sync(). sys_sync tries to sync all filesystems, 
including the journal device, the osd storage area and the cephfs mount. 
Under some load, when OSD calls sync(), cephfs sync waits for the local osd, 
which already waits for its storage to sync, which the kernel wants to do 
after the cephfs sync. Deadlock.

The function sync_filesystem() is called by FileStore::sync_entry() in 
src/os/FileStore.cc, but only on non-btrfs storage and if 
filestore_fsync_flushes_journal_data is false. After forcing this to true in 
OSD config, our test cluster survived three days of heavy load (and still 
running fine) instead of deadlocking all nodes within an hour. Reproduced 
with 0.47.2 and kernel 3.2.18, but the related code seems unchanged in 
current master.

Conclusion: If you want to run OSD and cephfs kernel client on the same Linux 
server and have a libc6 before 2.14 (e.g. Debian's newest in experimental is 
2.13) or a kernel before 2.6.39, either do not use ext4 (but btrfs is still 
unstable) or risk data loss by missing syncs through the workaround of 
forcing filestore_fsync_flushes_journal_data to true.

Please consider putting out a fat warning at least at build time, if syncfs() 
is not available, e.g. "No syncfs() syscall, please expect a deadlock when 
running osd on non-btrfs together with a local cephfs mount." Even better 
would be a quick runtime test for missing syncfs() and storage on non-btrfs 
that spits out a warning, if deadlock is possible.

As a side effect, the experienced lockup seems to be a good way to reproduce 
the long standing bug 1047 - when our cluster tried to recover, all MDS 
instances died with those symptoms. It seems that a partial sync of journal 
or data partition causes that broken state.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD deadlock with cephfs client and OSD on same machine
  2012-05-29  7:44 OSD deadlock with cephfs client and OSD on same machine Amon Ott
@ 2012-05-29 15:47 ` Sage Weil
  2012-05-30  7:08   ` Amon Ott
  2012-05-29 16:18 ` Tommi Virtanen
  1 sibling, 1 reply; 10+ messages in thread
From: Sage Weil @ 2012-05-29 15:47 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

On Tue, 29 May 2012, Amon Ott wrote:
> Hello again!
> 
> On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount 
> on the same system and no syncfs system call (as to be expected with libc6 < 
> 2.14 or kernel < 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers 
> the system.
> 
> After some investigation in the code, this is what I found:
> In src/common/sync_filesystem.h, the function sync_filesystem() first tries a 
> syncfs() (not available), then a btrfs ioctrl sync (not available with 
> non-btrfs), then finally a sync(). sys_sync tries to sync all filesystems, 
> including the journal device, the osd storage area and the cephfs mount. 
> Under some load, when OSD calls sync(), cephfs sync waits for the local osd, 
> which already waits for its storage to sync, which the kernel wants to do 
> after the cephfs sync. Deadlock.
> 
> The function sync_filesystem() is called by FileStore::sync_entry() in 
> src/os/FileStore.cc, but only on non-btrfs storage and if 
> filestore_fsync_flushes_journal_data is false. After forcing this to true in 
> OSD config, our test cluster survived three days of heavy load (and still 
> running fine) instead of deadlocking all nodes within an hour. Reproduced 
> with 0.47.2 and kernel 3.2.18, but the related code seems unchanged in 
> current master.
> 
> Conclusion: If you want to run OSD and cephfs kernel client on the same Linux 
> server and have a libc6 before 2.14 (e.g. Debian's newest in experimental is 
> 2.13) or a kernel before 2.6.39, either do not use ext4 (but btrfs is still 
> unstable) or risk data loss by missing syncs through the workaround of 
> forcing filestore_fsync_flushes_journal_data to true.

Note that fsync_flushed_journal_data should only be set to true with ext3 
and the 'data=ordered' or 'data=journal' mount option.  It is an 
implementation artifact only that fsync() will flush all previous writes.

> Please consider putting out a fat warning at least at build time, if syncfs() 
> is not available, e.g. "No syncfs() syscall, please expect a deadlock when 
> running osd on non-btrfs together with a local cephfs mount." Even better 
> would be a quick runtime test for missing syncfs() and storage on non-btrfs 
> that spits out a warning, if deadlock is possible.

I think a runtime warning makes more sense; nobody will see the build time 
warning (e.g., those installed debs).

> As a side effect, the experienced lockup seems to be a good way to reproduce 
> the long standing bug 1047 - when our cluster tried to recover, all MDS 
> instances died with those symptoms. It seems that a partial sync of journal 
> or data partition causes that broken state.

Interesting!  If you could also note on that bug what the metadata 
workload was (what was making hard links?), that would be great!

Thanks-
sage


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD deadlock with cephfs client and OSD on same machine
  2012-05-29 15:47 ` Sage Weil
@ 2012-05-30  7:08   ` Amon Ott
  2012-06-01  9:35     ` Amon Ott
  0 siblings, 1 reply; 10+ messages in thread
From: Amon Ott @ 2012-05-30  7:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Tuesday 29 May 2012 you wrote:
> On Tue, 29 May 2012, Amon Ott wrote:
> > Conclusion: If you want to run OSD and cephfs kernel client on the same
> > Linux server and have a libc6 before 2.14 (e.g. Debian's newest in
> > experimental is 2.13) or a kernel before 2.6.39, either do not use ext4
> > (but btrfs is still unstable) or risk data loss by missing syncs through
> > the workaround of forcing filestore_fsync_flushes_journal_data to true.
>
> Note that fsync_flushed_journal_data should only be set to true with ext3
> and the 'data=ordered' or 'data=journal' mount option.  It is an
> implementation artifact only that fsync() will flush all previous writes.

I am fully aware of that, this is why I mentioned the risk of data loss.

> > Please consider putting out a fat warning at least at build time, if
> > syncfs() is not available, e.g. "No syncfs() syscall, please expect a
> > deadlock when running osd on non-btrfs together with a local cephfs
> > mount." Even better would be a quick runtime test for missing syncfs()
> > and storage on non-btrfs that spits out a warning, if deadlock is
> > possible.
>
> I think a runtime warning makes more sense; nobody will see the build time
> warning (e.g., those installed debs).

Yes, fully agreed.

> > As a side effect, the experienced lockup seems to be a good way to
> > reproduce the long standing bug 1047 - when our cluster tried to recover,
> > all MDS instances died with those symptoms. It seems that a partial sync
> > of journal or data partition causes that broken state.
>
> Interesting!  If you could also note on that bug what the metadata
> workload was (what was making hard links?), that would be great!

We are auto creating up to 200 preconfigured home directories on all four 
nodes, each home dir consists of ca. 400 dirs and files with ca. 16 MB of 
data. AFAIK, there are no hard links involved. So it is a massive parallel 
creation of many small files, probably lots of metadata for them.

Will put that as note to the bug, too.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD deadlock with cephfs client and OSD on same machine
  2012-05-30  7:08   ` Amon Ott
@ 2012-06-01  9:35     ` Amon Ott
  2012-06-01 21:57       ` Tommi Virtanen
  2012-11-05 20:17       ` Cláudio Martins
  0 siblings, 2 replies; 10+ messages in thread
From: Amon Ott @ 2012-06-01  9:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Wednesday 30 May 2012 wrote Amon Ott:
> On Tuesday 29 May 2012 you wrote:
> > On Tue, 29 May 2012, Amon Ott wrote:
> > > Please consider putting out a fat warning at least at build time, if
> > > syncfs() is not available, e.g. "No syncfs() syscall, please expect a
> > > deadlock when running osd on non-btrfs together with a local cephfs
> > > mount." Even better would be a quick runtime test for missing syncfs()
> > > and storage on non-btrfs that spits out a warning, if deadlock is
> > > possible.
> >
> > I think a runtime warning makes more sense; nobody will see the build
> > time warning (e.g., those installed debs).
>
> Yes, fully agreed.

Thanks for the new log lines in master git. The warning without syncfs() 
support could be a bit more clear though - the system is not only slower, it 
hangs needing a reset and reboot. This is much worse, specially if cephfs is 
permanently broken by bug 1047 afterwards. And I am pretty sure that our 
systems were not running out of memory, because during our load tests we 
always have several GB of unused memory.

After backporting syncfs() support into Debian stable libc6 2.11 and 
recompiling Ceph with it, our test cluster is now running with syncfs().

A first two hour load test this morning did not produce any problems, so I can 
say that syncfs() makes it significantly more stable than sync(). We will 
make a several day load test soon.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD deadlock with cephfs client and OSD on same machine
  2012-06-01  9:35     ` Amon Ott
@ 2012-06-01 21:57       ` Tommi Virtanen
  2012-11-05 20:17       ` Cláudio Martins
  1 sibling, 0 replies; 10+ messages in thread
From: Tommi Virtanen @ 2012-06-01 21:57 UTC (permalink / raw)
  To: Amon Ott; +Cc: Sage Weil, ceph-devel

[Whoops, resending as plain text to make vger happy.]

On Fri, Jun 1, 2012 at 2:35 AM, Amon Ott <a.ott@m-privacy.de> wrote:
> Thanks for the new log lines in master git. The warning without syncfs()
> support could be a bit more clear though - the system is not only slower, it
> hangs needing a reset and reboot. This is much worse, specially if cephfs is

That warning, introduced in
https://github.com/ceph/ceph/commit/07498d66233f388807a458554640cb77424114c0
, is more about running multiple OSDs on a single server, and without
syncfs(2) one OSD syncing causes all to sync. It's not related to your
case of loopback mounting, what has *never* worked well, with the
apparent exception of ceph-fuse

> say that syncfs() makes it significantly more stable than sync(). We will
> make a several day load test soon.

That still won't make it reliable, just less likely to trigger. Good
luck, you'll need it with loopback mounts.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD deadlock with cephfs client and OSD on same machine
  2012-06-01  9:35     ` Amon Ott
  2012-06-01 21:57       ` Tommi Virtanen
@ 2012-11-05 20:17       ` Cláudio Martins
  2012-11-06  7:54         ` Amon Ott
  1 sibling, 1 reply; 10+ messages in thread
From: Cláudio Martins @ 2012-11-05 20:17 UTC (permalink / raw)
  To: Amon Ott; +Cc: Sage Weil, ceph-devel


On Fri, 1 Jun 2012 11:35:37 +0200 Amon Ott <a.ott@m-privacy.de> wrote:
> 
> After backporting syncfs() support into Debian stable libc6 2.11 and 
> recompiling Ceph with it, our test cluster is now running with syncfs().
> 

 Hi,

 We're running OSDs on top of Debian wheezy, which unfortunately has
libc6 2.13. By chance, do you still have that patch to backport syncfs()?

 Thanks in advance

Best regards

Cláudio

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD deadlock with cephfs client and OSD on same machine
  2012-11-05 20:17       ` Cláudio Martins
@ 2012-11-06  7:54         ` Amon Ott
  0 siblings, 0 replies; 10+ messages in thread
From: Amon Ott @ 2012-11-06  7:54 UTC (permalink / raw)
  To: Cláudio Martins; +Cc: Amon Ott, Sage Weil, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 955 bytes --]

Am 05.11.2012 21:17, schrieb Cláudio Martins:
> 
> On Fri, 1 Jun 2012 11:35:37 +0200 Amon Ott <a.ott@m-privacy.de> wrote:
>>
>> After backporting syncfs() support into Debian stable libc6 2.11 and 
>> recompiling Ceph with it, our test cluster is now running with syncfs().
>>
> 
>  Hi,
> 
>  We're running OSDs on top of Debian wheezy, which unfortunately has
> libc6 2.13. By chance, do you still have that patch to backport syncfs()?

Here is the patch we use for Debian Squeeze, it should be easy to port
to Wheezy. The original patch header is still there, but we made small
changes for the other libc version. If you need help, please tell me.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 99296856
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649


[-- Attachment #2: syncfs.diff --]
[-- Type: text/x-patch, Size: 6835 bytes --]

From libc-hacker-return-9689-listarch-libc-hacker=sources dot redhat dot com at sourceware dot org Wed Mar 30 11:44:46 2011
Return-Path: <libc-hacker-return-9689-listarch-libc-hacker=sources dot redhat dot com at sourceware dot org>
Delivered-To: listarch-libc-hacker at sources dot redhat dot com
Received: (qmail 9777 invoked by alias); 30 Mar 2011 11:44:46 -0000
Received: (qmail 9761 invoked by uid 22791); 30 Mar 2011 11:44:45 -0000
X-SWARE-Spam-Status: No, hits=-6.1 required=5.0
	tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI,SPF_HELO_PASS,TW_FX,TW_MK,TW_TD,TW_XM,T_RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28)
    by sourceware dot org (qpsmtpd/0 dot 43rc1) with ESMTP; Wed, 30 Mar 2011 11:44:35 +0000
Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
	by mx1 dot redhat dot com (8 dot 14 dot 4/8 dot 14 dot 4) with ESMTP id p2UBiZJp016689
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <libc-hacker at sourceware dot org>; Wed, 30 Mar 2011 07:44:35 -0400
Received: from hase (ovpn01.gateway.prod.ext.phx2.redhat.com [10.5.9.1])
	by int-mx10 dot intmail dot prod dot int dot phx2 dot redhat dot com (8 dot 14 dot 4/8 dot 14 dot 4) with ESMTP id p2UBiYG9015254
	for <libc-hacker at sourceware dot org>; Wed, 30 Mar 2011 07:44:34 -0400
From: Andreas Schwab <schwab at redhat dot com>
To: libc-hacker at sourceware dot org
Subject: [PATCH] Add syncfs syscall
X-Yow: Hey, LOOK!!  A pair of SIZE 9 CAPRI PANTS!!  They probably belong to
 SAMMY DAVIS, JR dot !!
Date: Wed, 30 Mar 2011 13:44:34 +0200
Message-ID: <m3ei5osuhp.fsf@redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Mailing-List: contact libc-hacker-help at sourceware dot org; run by ezmlm
Precedence: bulk
List-Id: <libc-hacker.sourceware.org>
List-Subscribe: <mailto:libc-hacker-subscribe at sourceware dot org>
List-Archive: <http://sourceware.org/ml/libc-hacker/>
List-Post: <mailto:libc-hacker at sourceware dot org>
List-Help: <mailto:libc-hacker-help at sourceware dot org>, <http://sourceware dot org/ml/#faqs>
Sender: libc-hacker-owner at sourceware dot org
Delivered-To: mailing list libc-hacker at sourceware dot org

2011-03-30  Andreas Schwab  <schwab@redhat.com>

	* Versions.def (libc): Add GLIBC_2.14.
	* misc/syncfs.c: New file.
	* misc/Makefile (routines): Add syncfs.
	* posix/unistd.h: Declare syncfs.
	* sysdeps/unix/syscalls.list: Add syncfs.
---
 Versions.def               |    1 +
 misc/Makefile              |    4 ++--
 misc/Versions              |    3 +++
 misc/syncfs.c              |   33 +++++++++++++++++++++++++++++++++
 posix/unistd.h             |    9 ++++++++-
 sysdeps/unix/syscalls.list |    1 +
 6 files changed, 48 insertions(+), 3 deletions(-)
 create mode 100644 misc/syncfs.c

diff --git a/Versions.def b/Versions.def
index 0ccda50..e478fdd 100644
--- a/Versions.def
+++ b/Versions.def
@@ -30,5 +30,6 @@ libc {
   GLIBC_2.11
   GLIBC_2.12
+  GLIBC_2.14
 %ifdef USE_IN_LIBIO
   HURD_CTHREADS_0.3
 %endif
diff --git a/misc/Makefile b/misc/Makefile
index ee69361..52b13da 100644
--- a/misc/Makefile
+++ b/misc/Makefile
@@ -1,4 +1,4 @@
-# Copyright (C) 1991-2006, 2007, 2009 Free Software Foundation, Inc.
+# Copyright (C) 1991-2006, 2007, 2009, 2011 Free Software Foundation, Inc.
 # This file is part of the GNU C Library.
 
 # The GNU C Library is free software; you can redistribute it and/or
@@ -45,7 +45,7 @@ routines := brk sbrk sstk ioctl \
 	    getdtsz \
 	    gethostname sethostname getdomain setdomain \
 	    select pselect \
-	    acct chroot fsync sync fdatasync reboot \
+	    acct chroot fsync sync fdatasync syncfs reboot \
 	    gethostid sethostid \
 	    vhangup \
 	    swapon swapoff mktemp mkstemp mkstemp64 mkdtemp \
diff --git a/misc/Versions b/misc/Versions
index 3ffe3d1..3a31c7f 100644
--- a/misc/Versions
+++ b/misc/Versions
@@ -143,4 +143,7 @@ libc {
   GLIBC_2.11 {
     mkstemps; mkstemps64; mkostemps; mkostemps64;
   }
+  GLIBC_2.14 {
+    syncfs;
+  }
 }
diff --git a/misc/syncfs.c b/misc/syncfs.c
new file mode 100644
index 0000000..bd7328c
--- /dev/null
+++ b/misc/syncfs.c
@@ -0,0 +1,33 @@
+/* Copyright (C) 2011 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, write to the Free
+   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+   02111-1307 USA.  */
+
+#include <errno.h>
+#include <unistd.h>
+
+/* Make all changes done to all files on the file system associated
+   with FD actually appear on disk.  */
+int
+syncfs (int fd)
+{
+  __set_errno (ENOSYS);
+  return -1;
+}
+
+
+stub_warning (syncfs)
+#include <stub-tag.h>
diff --git a/posix/unistd.h b/posix/unistd.h
index 5ebcaf1..aa11860 100644
--- a/posix/unistd.h
+++ b/posix/unistd.h
@@ -1,4 +1,4 @@
-/* Copyright (C) 1991-2006, 2007, 2008, 2009 Free Software Foundation, Inc.
+/* Copyright (C) 1991-2009, 2010, 2011 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -974,6 +974,13 @@ extern int fsync (int __fd);
 #endif /* Use BSD || X/Open || Unix98.  */
 
 
+#ifdef __USE_GNU
+/* Make all changes done to all files on the file system associated
+   with FD actually appear on disk.  */
+extern int syncfs (int __fd) __THROW;
+#endif
+
+
 #if defined __USE_BSD || defined __USE_XOPEN_EXTENDED
 
 /* Return identifier for the current host.  */
diff --git a/sysdeps/unix/syscalls.list b/sysdeps/unix/syscalls.list
index 04ed63c..ad49170 100644
--- a/sysdeps/unix/syscalls.list
+++ b/sysdeps/unix/syscalls.list
@@ -55,6 +55,7 @@ swapoff		-	swapoff		i:s	swapoff
 swapon		-	swapon		i:s	swapon
 symlink		-	symlink		i:ss	__symlink	symlink
 sync		-	sync		i:	sync
+syncfs		-	syncfs		i:i	syncfs
 sys_fstat	fxstat	fstat		i:ip	__syscall_fstat
 sys_mknod	xmknod	mknod		i:sii	__syscall_mknod
 sys_stat	xstat	stat		i:sp	__syscall_stat
-- 
1.7.4


-- 
Andreas Schwab, schwab@redhat.com
GPG Key fingerprint = D4E8 DBE3 3813 BB5D FA84  5EC7 45C6 250E 6F00 984E
"And now for something completely different."


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: OSD deadlock with cephfs client and OSD on same machine
  2012-05-29  7:44 OSD deadlock with cephfs client and OSD on same machine Amon Ott
  2012-05-29 15:47 ` Sage Weil
@ 2012-05-29 16:18 ` Tommi Virtanen
  2012-05-30  6:59   ` Amon Ott
  1 sibling, 1 reply; 10+ messages in thread
From: Tommi Virtanen @ 2012-05-29 16:18 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

On Tue, May 29, 2012 at 12:44 AM, Amon Ott <a.ott@m-privacy.de> wrote:
> On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount
> on the same system and no syncfs system call (as to be expected with libc6 <
> 2.14 or kernel < 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers
> the system.

This is the classic issue of memory pressure needing free memory to be
relieved. While syncfs(2) may make the hang less common, I do not
think having syncfs(2) is enough; nothing sort of having a reserved
memory pool guaranteed to be big enough to handle the request will,
and maintaining that solution is hideously complex.

Loopback NFS suffers from the exact same thing.

Apparently using ceph-fuse is enough to move so much of the processing
to user space, that the pageability of userspace memory allows the
system to recover.

Here's a fragment of the earlier conversation on this topic. Apologies
for gmane/mail clients breaking the thread, anything with that subject
line is part of the conversation:

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/1673

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD deadlock with cephfs client and OSD on same machine
  2012-05-29 16:18 ` Tommi Virtanen
@ 2012-05-30  6:59   ` Amon Ott
  2012-05-30 17:02     ` Tommi Virtanen
  0 siblings, 1 reply; 10+ messages in thread
From: Amon Ott @ 2012-05-30  6:59 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On Tuesday 29 May 2012 wrote Tommi Virtanen:
> On Tue, May 29, 2012 at 12:44 AM, Amon Ott <a.ott@m-privacy.de> wrote:
> > On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client
> > mount on the same system and no syncfs system call (as to be expected
> > with libc6 < 2.14 or kernel < 2.6.39), OSD deadlocks in sys_sync(). Only
> > reboot recovers the system.
>
> This is the classic issue of memory pressure needing free memory to be
> relieved. While syncfs(2) may make the hang less common, I do not
> think having syncfs(2) is enough; nothing sort of having a reserved
> memory pool guaranteed to be big enough to handle the request will,
> and maintaining that solution is hideously complex.

AFAIR, when the deadlocks came, there were some GB of the 12 GB RAM still 
unused, not even for caching. But it might be a problem with low memory, 
because we are running with 32 Bit.

Would it be possible to preallocate a significant amount of RAM for the 
purpose of syncing? I would not mind reserving a few 100 MB for that, but 
deadlocks must not happen in any case. Can the size of the journal give a 
hint on how much is needed?

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD deadlock with cephfs client and OSD on same machine
  2012-05-30  6:59   ` Amon Ott
@ 2012-05-30 17:02     ` Tommi Virtanen
  0 siblings, 0 replies; 10+ messages in thread
From: Tommi Virtanen @ 2012-05-30 17:02 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

On Tue, May 29, 2012 at 11:59 PM, Amon Ott <a.ott@m-privacy.de> wrote:
> AFAIR, when the deadlocks came, there were some GB of the 12 GB RAM still
> unused, not even for caching. But it might be a problem with low memory,
> because we are running with 32 Bit.
>
> Would it be possible to preallocate a significant amount of RAM for the
> purpose of syncing? I would not mind reserving a few 100 MB for that, but
> deadlocks must not happen in any case. Can the size of the journal give a
> hint on how much is needed?

The code & complexity overhead of managing that reserved buffer has so
far prevented that approach from being really adopted, anywhere in the
Linux kernel community, as far as I know.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-11-06  8:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-29  7:44 OSD deadlock with cephfs client and OSD on same machine Amon Ott
2012-05-29 15:47 ` Sage Weil
2012-05-30  7:08   ` Amon Ott
2012-06-01  9:35     ` Amon Ott
2012-06-01 21:57       ` Tommi Virtanen
2012-11-05 20:17       ` Cláudio Martins
2012-11-06  7:54         ` Amon Ott
2012-05-29 16:18 ` Tommi Virtanen
2012-05-30  6:59   ` Amon Ott
2012-05-30 17:02     ` Tommi Virtanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.