qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
@ 2009-10-21  5:13 MORITA Kazutaka
  2009-10-21  8:28 ` [Qemu-devel] " Nikolai K. Bochev
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: MORITA Kazutaka @ 2009-10-21  5:13 UTC (permalink / raw)
  To: kvm, qemu-devel, linux-fsdevel

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

     * Linear scalability in performance and capacity
     * No single point of failure
     * Redundant architecture (data is written to multiple nodes)
     - Tolerance against network failure
     * Zero configuration (newly added machines will join the cluster automatically)
     - Autonomous load balancing
     * Snapshot
     - Online snapshot from qemu-monitor
     * Clone from a snapshot volume
     * Thin provisioning
     - Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:

http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

     - VM image deletion support
     - Support architectures other than X86_64
     - Data recoverys
     - Free space management
     - Guarantee reliability and availability under heavy load
     - Performance improvement
     - Reclaim unused blocks
     - More documentation

We hope finding people interested in working together.
Enjoy!


Here are examples:

- create images

$ kvm-img create -f sheepdog "Alice's Disk" 256G
$ kvm-img create -f sheepdog "Bob's Disk" 256G

- list images

$ shepherd info -t vdi
    40000 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:17:18, tag:        0, current
    80000 : Bob's Disk    256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:29:20, tag:        0, current

- start up a virtual machine

$ kvm --drive format=sheepdog,file="Alice's Disk"

- create a snapshot

$ kvm-img snapshot -c name sheepdog:"Alice's Disk"

- clone from a snapshot

$ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk"


Thanks.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazutaka@lab.ntt.co.jp

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  5:13 [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM MORITA Kazutaka
@ 2009-10-21  8:28 ` Nikolai K. Bochev
  2009-10-21  8:45 ` Nikolai K. Bochev
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Nikolai K. Bochev @ 2009-10-21  8:28 UTC (permalink / raw)
  To: MORITA Kazutaka; +Cc: linux-fsdevel, qemu-devel, kvm

[-- Attachment #1: Type: text/plain, Size: 4921 bytes --]

Hello, 

when i try to compile, i'm getting the following error ( Using ubuntu 9.10, x64 ) : 

cd shepherd; make 
make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/shepherd' 
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE shepherd.c -o shepherd.o 
shepherd.c: In function ‘main’: 
shepherd.c:300: warning: dereferencing pointer ‘hdr.55’ does break strict-aliasing rules 
shepherd.c:300: note: initialized from here 
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE treeview.c -o treeview.o 
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/event.c -o ../lib/event.o 
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/net.c -o ../lib/net.o 
../lib/net.c: In function ‘write_object’: 
../lib/net.c:358: warning: ‘vosts’ may be used uninitialized in this function 
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/logger.c -o ../lib/logger.o 
cc shepherd.o treeview.o ../lib/event.o ../lib/net.o ../lib/logger.o -o shepherd -lncurses -lcrypto 
make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/shepherd' 
cd sheep; make 
make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/sheep' 
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE sheep.c -o sheep.o 
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE store.c -o store.o 
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE net.c -o net.o 
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE work.c -o work.o 
In file included from /usr/include/asm/fcntl.h:1, 
from /usr/include/linux/fcntl.h:4, 
from /usr/include/linux/signalfd.h:13, 
from work.c:31: 
/usr/include/asm-generic/fcntl.h:117: error: redefinition of ‘struct flock’ 
/usr/include/asm-generic/fcntl.h:140: error: redefinition of ‘struct flock64’ 
make[1]: *** [work.o] Error 1 
make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/sheep' 
make: *** [all] Error 2 

The qemu-kvm source with patched support for sheepdog compiles fine. 

----- Original Message ----- 
From: "MORITA Kazutaka" <morita.kazutaka@lab.ntt.co.jp> 
To: kvm@vger.kernel.org, qemu-devel@nongnu.org, linux-fsdevel@vger.kernel.org 
Sent: Wednesday, October 21, 2009 8:13:47 AM 
Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM 

Hi everyone, 

Sheepdog is a distributed storage system for KVM/QEMU. It provides 
highly available block level storage volumes to VMs like Amazon EBS. 
Sheepdog supports advanced volume management features such as snapshot, 
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds 
of nodes, and the architecture is fully symmetric; there is no central 
node such as a meta-data server. 

The following list describes the features of Sheepdog. 

* Linear scalability in performance and capacity 
* No single point of failure 
* Redundant architecture (data is written to multiple nodes) 
- Tolerance against network failure 
* Zero configuration (newly added machines will join the cluster automatically) 
- Autonomous load balancing 
* Snapshot 
- Online snapshot from qemu-monitor 
* Clone from a snapshot volume 
* Thin provisioning 
- Amazon EBS API support (to use from a Eucalyptus instance) 

(* = current features, - = on our todo list) 

More details and download links are here: 

http://www.osrg.net/sheepdog/ 

Note that the code is still in an early stage. 
There are some critical TODO items: 

- VM image deletion support 
- Support architectures other than X86_64 
- Data recoverys 
- Free space management 
- Guarantee reliability and availability under heavy load 
- Performance improvement 
- Reclaim unused blocks 
- More documentation 

We hope finding people interested in working together. 
Enjoy! 


Here are examples: 

- create images 

$ kvm-img create -f sheepdog "Alice's Disk" 256G 
$ kvm-img create -f sheepdog "Bob's Disk" 256G 

- list images 

$ shepherd info -t vdi 
40000 : Alice's Disk 256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15 
16:17:18, tag: 0, current 
80000 : Bob's Disk 256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15 
16:29:20, tag: 0, current 

- start up a virtual machine 

$ kvm --drive format=sheepdog,file="Alice's Disk" 

- create a snapshot 

$ kvm-img snapshot -c name sheepdog:"Alice's Disk" 

- clone from a snapshot 

$ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk" 


Thanks. 

-- 
MORITA, Kazutaka 

NTT Cyber Space Labs 
OSS Computing Project 
Kernel Group 
E-mail: morita.kazutaka@lab.ntt.co.jp 

-- 
To unsubscribe from this list: send the line "unsubscribe kvm" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 

[-- Attachment #2: Type: text/html, Size: 6050 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  5:13 [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM MORITA Kazutaka
  2009-10-21  8:28 ` [Qemu-devel] " Nikolai K. Bochev
@ 2009-10-21  8:45 ` Nikolai K. Bochev
  2009-10-23  9:59   ` MORITA Kazutaka
  2009-10-21  9:08 ` [Qemu-devel] " Dietmar Maurer
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Nikolai K. Bochev @ 2009-10-21  8:45 UTC (permalink / raw)
  To: MORITA Kazutaka; +Cc: linux-fsdevel, qemu-devel, kvm

Hello,

I am getting the following error trying to compile sheepdog on Ubuntu 9.10 ( 2.6.31-14 x64 ) :

cd shepherd; make
make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE shepherd.c -o shepherd.o
shepherd.c: In function ‘main’:
shepherd.c:300: warning: dereferencing pointer ‘hdr.55’ does break strict-aliasing rules
shepherd.c:300: note: initialized from here
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE treeview.c -o treeview.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/event.c -o ../lib/event.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/net.c -o ../lib/net.o
../lib/net.c: In function ‘write_object’:
../lib/net.c:358: warning: ‘vosts’ may be used uninitialized in this function
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/logger.c -o ../lib/logger.o
cc shepherd.o treeview.o ../lib/event.o ../lib/net.o ../lib/logger.o -o shepherd -lncurses -lcrypto
make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
cd sheep; make
make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE sheep.c -o sheep.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE store.c -o store.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE net.c -o net.o
cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE work.c -o work.o
In file included from /usr/include/asm/fcntl.h:1,
                 from /usr/include/linux/fcntl.h:4,
                 from /usr/include/linux/signalfd.h:13,
                 from work.c:31:
/usr/include/asm-generic/fcntl.h:117: error: redefinition of ‘struct flock’
/usr/include/asm-generic/fcntl.h:140: error: redefinition of ‘struct flock64’
make[1]: *** [work.o] Error 1
make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
make: *** [all] Error 2

I have all the required libs installed. Patching and compiling qemu-kvm went flawless.

----- Original Message -----
From: "MORITA Kazutaka" <morita.kazutaka@lab.ntt.co.jp>
To: kvm@vger.kernel.org, qemu-devel@nongnu.org, linux-fsdevel@vger.kernel.org
Sent: Wednesday, October 21, 2009 8:13:47 AM
Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

     * Linear scalability in performance and capacity
     * No single point of failure
     * Redundant architecture (data is written to multiple nodes)
     - Tolerance against network failure
     * Zero configuration (newly added machines will join the cluster automatically)
     - Autonomous load balancing
     * Snapshot
     - Online snapshot from qemu-monitor
     * Clone from a snapshot volume
     * Thin provisioning
     - Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:

http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

     - VM image deletion support
     - Support architectures other than X86_64
     - Data recoverys
     - Free space management
     - Guarantee reliability and availability under heavy load
     - Performance improvement
     - Reclaim unused blocks
     - More documentation

We hope finding people interested in working together.
Enjoy!


Here are examples:

- create images

$ kvm-img create -f sheepdog "Alice's Disk" 256G
$ kvm-img create -f sheepdog "Bob's Disk" 256G

- list images

$ shepherd info -t vdi
    40000 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:17:18, tag:        0, current
    80000 : Bob's Disk    256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:29:20, tag:        0, current

- start up a virtual machine

$ kvm --drive format=sheepdog,file="Alice's Disk"

- create a snapshot

$ kvm-img snapshot -c name sheepdog:"Alice's Disk"

- clone from a snapshot

$ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk"


Thanks.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazutaka@lab.ntt.co.jp


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  5:13 [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM MORITA Kazutaka
  2009-10-21  8:28 ` [Qemu-devel] " Nikolai K. Bochev
  2009-10-21  8:45 ` Nikolai K. Bochev
@ 2009-10-21  9:08 ` Dietmar Maurer
  2009-10-23 10:06   ` [Qemu-devel] " MORITA Kazutaka
  2009-10-22 15:30 ` [Qemu-devel] " Avi Kivity
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Dietmar Maurer @ 2009-10-21  9:08 UTC (permalink / raw)
  To: MORITA Kazutaka, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-fsdevel@vger.kernel.org

Quite interesting. But would it be possible to use corosync for the cluster communication? The point is that we need corosync anyways for pacemaker, it is written in C (high performance) and seem to implement the feature you need?

> -----Original Message-----
> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> Behalf Of MORITA Kazutaka
> Sent: Mittwoch, 21. Oktober 2009 07:14
> To: kvm@vger.kernel.org; qemu-devel@nongnu.org; linux-
> fsdevel@vger.kernel.org
> Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
> 
> Hi everyone,
> 
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or
> hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  5:13 [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM MORITA Kazutaka
                   ` (2 preceding siblings ...)
  2009-10-21  9:08 ` [Qemu-devel] " Dietmar Maurer
@ 2009-10-22 15:30 ` Avi Kivity
  2009-10-22 16:28   ` Anthony Liguori
  2009-10-23 10:41   ` MORITA Kazutaka
  2009-10-22 18:46 ` Avishay Traeger
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Avi Kivity @ 2009-10-22 15:30 UTC (permalink / raw)
  To: MORITA Kazutaka; +Cc: linux-fsdevel, qemu-devel, kvm

On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:
> Hi everyone,
>
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.

Very interesting!  From a very brief look at the code, it looks like the 
sheepdog block format driver is a network client that is able to access 
highly available images, yes?

If so, is it reasonable to compare this to a cluster file system setup 
(like GFS) with images as files on this filesystem?  The difference 
would be that clustering is implemented in userspace in sheepdog, but in 
the kernel for a clustering filesystem.

How is load balancing implemented?  Can you move an image transparently 
while a guest is running?  Will an image be moved closer to its guest?  
Can you stripe an image across nodes?

Do you support multiple guests accessing the same image?

What about fault tolerance - storing an image redundantly on multiple nodes?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-22 15:30 ` [Qemu-devel] " Avi Kivity
@ 2009-10-22 16:28   ` Anthony Liguori
  2009-10-22 22:09     ` Alexander Graf
  2009-10-23 10:41   ` MORITA Kazutaka
  1 sibling, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2009-10-22 16:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-fsdevel, kvm, MORITA Kazutaka, qemu-devel

Avi Kivity wrote:
> On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:
>> Hi everyone,
>>
>> Sheepdog is a distributed storage system for KVM/QEMU. It provides
>> highly available block level storage volumes to VMs like Amazon EBS.
>> Sheepdog supports advanced volume management features such as snapshot,
>> cloning, and thin provisioning. Sheepdog runs on several tens or 
>> hundreds
>> of nodes, and the architecture is fully symmetric; there is no central
>> node such as a meta-data server.
>
> Very interesting!  From a very brief look at the code, it looks like 
> the sheepdog block format driver is a network client that is able to 
> access highly available images, yes?
>
> If so, is it reasonable to compare this to a cluster file system setup 
> (like GFS) with images as files on this filesystem?  The difference 
> would be that clustering is implemented in userspace in sheepdog, but 
> in the kernel for a clustering filesystem.

I'm still in the process of reading the code, but that's the impression 
I got too.  It made me think that the protocol for qemu to communicate 
with sheepdog could be a filesystem protocol (like 9p) and sheepdog 
could expose itself as a synthetic.  There are some interesting 
ramifications to something like that--namely that you could mount 
sheepdog on localhost and interact with it through the vfs.

Very interesting stuff, I'm looking forward to examining more closely.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  5:13 [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM MORITA Kazutaka
                   ` (3 preceding siblings ...)
  2009-10-22 15:30 ` [Qemu-devel] " Avi Kivity
@ 2009-10-22 18:46 ` Avishay Traeger
  2009-10-23 11:22 ` [Qemu-devel] " Dietmar Maurer
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Avishay Traeger @ 2009-10-22 18:46 UTC (permalink / raw)
  To: MORITA Kazutaka; +Cc: linux-fsdevel, qemu-devel, kvm

[-- Attachment #1: Type: text/plain, Size: 93 bytes --]

This looks very interesting - how does this compare with Exanodes/Seanodes?

Thanks,
Avishay

[-- Attachment #2: Type: text/html, Size: 131 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-22 16:28   ` Anthony Liguori
@ 2009-10-22 22:09     ` Alexander Graf
  0 siblings, 0 replies; 32+ messages in thread
From: Alexander Graf @ 2009-10-22 22:09 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: qemu-devel@nongnu.org, linux-fsdevel@vger.kernel.org, Avi Kivity,
	MORITA Kazutaka, kvm@vger.kernel.org


Am 22.10.2009 um 18:28 schrieb Anthony Liguori <anthony@codemonkey.ws>:

> Avi Kivity wrote:
>> On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:
>>> Hi everyone,
>>>
>>> Sheepdog is a distributed storage system for KVM/QEMU. It provides
>>> highly available block level storage volumes to VMs like Amazon EBS.
>>> Sheepdog supports advanced volume management features such as  
>>> snapshot,
>>> cloning, and thin provisioning. Sheepdog runs on several tens or  
>>> hundreds
>>> of nodes, and the architecture is fully symmetric; there is no  
>>> central
>>> node such as a meta-data server.
>>
>> Very interesting!  From a very brief look at the code, it looks  
>> like the sheepdog block format driver is a network client that is  
>> able to access highly available images, yes?
>>
>> If so, is it reasonable to compare this to a cluster file system  
>> setup (like GFS) with images as files on this filesystem?  The  
>> difference would be that clustering is implemented in userspace in  
>> sheepdog, but in the kernel for a clustering filesystem.
>
> I'm still in the process of reading the code, but that's the  
> impression I got too.  It made me think that the protocol for qemu  
> to communicate with sheepdog could be a filesystem protocol (like 9p)

Speaking about 9p, what's the status there?

Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  8:45 ` Nikolai K. Bochev
@ 2009-10-23  9:59   ` MORITA Kazutaka
  0 siblings, 0 replies; 32+ messages in thread
From: MORITA Kazutaka @ 2009-10-23  9:59 UTC (permalink / raw)
  To: Nikolai K. Bochev; +Cc: linux-fsdevel, qemu-devel, kvm

Hello,

Does the following patch work for you?

diff --git a/sheep/work.c b/sheep/work.c
index 4df8dc0..45f362d 100644
--- a/sheep/work.c
+++ b/sheep/work.c
@@ -28,6 +28,7 @@
 #include <syscall.h>
 #include <sys/types.h>
 #include <linux/types.h>
+#define _LINUX_FCNTL_H
 #include <linux/signalfd.h>

 #include "list.h"


On Wed, Oct 21, 2009 at 5:45 PM, Nikolai K. Bochev
<n.bochev@grandstarco.com> wrote:
> Hello,
>
> I am getting the following error trying to compile sheepdog on Ubuntu 9.10 ( 2.6.31-14 x64 ) :
>
> cd shepherd; make
> make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE shepherd.c -o shepherd.o
> shepherd.c: In function ‘main’:
> shepherd.c:300: warning: dereferencing pointer ‘hdr.55’ does break strict-aliasing rules
> shepherd.c:300: note: initialized from here
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE treeview.c -o treeview.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/event.c -o ../lib/event.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/net.c -o ../lib/net.o
> ../lib/net.c: In function ‘write_object’:
> ../lib/net.c:358: warning: ‘vosts’ may be used uninitialized in this function
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE ../lib/logger.c -o ../lib/logger.o
> cc shepherd.o treeview.o ../lib/event.o ../lib/net.o ../lib/logger.o -o shepherd -lncurses -lcrypto
> make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
> cd sheep; make
> make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE sheep.c -o sheep.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE store.c -o store.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE net.c -o net.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE work.c -o work.o
> In file included from /usr/include/asm/fcntl.h:1,
>                 from /usr/include/linux/fcntl.h:4,
>                 from /usr/include/linux/signalfd.h:13,
>                 from work.c:31:
> /usr/include/asm-generic/fcntl.h:117: error: redefinition of ‘struct flock’
> /usr/include/asm-generic/fcntl.h:140: error: redefinition of ‘struct flock64’
> make[1]: *** [work.o] Error 1
> make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
> make: *** [all] Error 2
>
> I have all the required libs installed. Patching and compiling qemu-kvm went flawless.
>
> ----- Original Message -----
> From: "MORITA Kazutaka" <morita.kazutaka@lab.ntt.co.jp>
> To: kvm@vger.kernel.org, qemu-devel@nongnu.org, linux-fsdevel@vger.kernel.org
> Sent: Wednesday, October 21, 2009 8:13:47 AM
> Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
>
> Hi everyone,
>
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.
>
> The following list describes the features of Sheepdog.
>
>     * Linear scalability in performance and capacity
>     * No single point of failure
>     * Redundant architecture (data is written to multiple nodes)
>     - Tolerance against network failure
>     * Zero configuration (newly added machines will join the cluster automatically)
>     - Autonomous load balancing
>     * Snapshot
>     - Online snapshot from qemu-monitor
>     * Clone from a snapshot volume
>     * Thin provisioning
>     - Amazon EBS API support (to use from a Eucalyptus instance)
>
> (* = current features, - = on our todo list)
>
> More details and download links are here:
>
> http://www.osrg.net/sheepdog/
>
> Note that the code is still in an early stage.
> There are some critical TODO items:
>
>     - VM image deletion support
>     - Support architectures other than X86_64
>     - Data recoverys
>     - Free space management
>     - Guarantee reliability and availability under heavy load
>     - Performance improvement
>     - Reclaim unused blocks
>     - More documentation
>
> We hope finding people interested in working together.
> Enjoy!
>
>
> Here are examples:
>
> - create images
>
> $ kvm-img create -f sheepdog "Alice's Disk" 256G
> $ kvm-img create -f sheepdog "Bob's Disk" 256G
>
> - list images
>
> $ shepherd info -t vdi
>    40000 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
> 16:17:18, tag:        0, current
>    80000 : Bob's Disk    256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
> 16:29:20, tag:        0, current
>
> - start up a virtual machine
>
> $ kvm --drive format=sheepdog,file="Alice's Disk"
>
> - create a snapshot
>
> $ kvm-img snapshot -c name sheepdog:"Alice's Disk"
>
> - clone from a snapshot
>
> $ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk"
>
>
> Thanks.
>
> --
> MORITA, Kazutaka
>
> NTT Cyber Space Labs
> OSS Computing Project
> Kernel Group
> E-mail: morita.kazutaka@lab.ntt.co.jp
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazutaka@lab.ntt.co.jp

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  9:08 ` [Qemu-devel] " Dietmar Maurer
@ 2009-10-23 10:06   ` MORITA Kazutaka
  2009-10-23 10:17     ` Chris Webb
  2009-10-23 11:10     ` [Qemu-devel] " Dietmar Maurer
  0 siblings, 2 replies; 32+ messages in thread
From: MORITA Kazutaka @ 2009-10-23 10:06 UTC (permalink / raw)
  To: Dietmar Maurer
  Cc: linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org,
	kvm@vger.kernel.org

We use JGroups (Java library) for reliable multicast communication in
our cluster manager daemon. We don't worry about the performance much
since the cluster manager daemon is not involved in the I/O path. We
might think about moving to corosync if it is more stable than
JGroups.

On Wed, Oct 21, 2009 at 6:08 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
> Quite interesting. But would it be possible to use corosync for the cluster communication? The point is that we need corosync anyways for pacemaker, it is written in C (high performance) and seem to implement the feature you need?
>
>> -----Original Message-----
>> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
>> Behalf Of MORITA Kazutaka
>> Sent: Mittwoch, 21. Oktober 2009 07:14
>> To: kvm@vger.kernel.org; qemu-devel@nongnu.org; linux-
>> fsdevel@vger.kernel.org
>> Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
>>
>> Hi everyone,
>>
>> Sheepdog is a distributed storage system for KVM/QEMU. It provides
>> highly available block level storage volumes to VMs like Amazon EBS.
>> Sheepdog supports advanced volume management features such as snapshot,
>> cloning, and thin provisioning. Sheepdog runs on several tens or
>> hundreds
>> of nodes, and the architecture is fully symmetric; there is no central
>> node such as a meta-data server.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazutaka@lab.ntt.co.jp

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 10:06   ` [Qemu-devel] " MORITA Kazutaka
@ 2009-10-23 10:17     ` Chris Webb
  2009-10-23 10:26       ` Chris Webb
  2009-10-23 11:10     ` [Qemu-devel] " Dietmar Maurer
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Webb @ 2009-10-23 10:17 UTC (permalink / raw)
  To: MORITA Kazutaka
  Cc: linux-fsdevel@vger.kernel.org, Dietmar Maurer,
	kvm@vger.kernel.org, qemu-devel@nongnu.org

MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp> writes:

> We use JGroups (Java library) for reliable multicast communication in
> our cluster manager daemon. We don't worry about the performance much
> since the cluster manager daemon is not involved in the I/O path. We
> might think about moving to corosync if it is more stable than
> JGroups.

I'd love to see this running on top of corosync too. Corosync is a well
tested, stable cluster manager, and doesn't have the JVM dependency of
jgroups so feels more suitable for building 'thin virtualisation fabrics'.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 10:17     ` Chris Webb
@ 2009-10-23 10:26       ` Chris Webb
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Webb @ 2009-10-23 10:26 UTC (permalink / raw)
  To: MORITA Kazutaka
  Cc: linux-fsdevel@vger.kernel.org, Dietmar Maurer,
	kvm@vger.kernel.org, qemu-devel@nongnu.org

Chris Webb <chris@arachsys.com> writes:

> MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp> writes:
> 
> > We use JGroups (Java library) for reliable multicast communication in
> > our cluster manager daemon. We don't worry about the performance much
> > since the cluster manager daemon is not involved in the I/O path. We
> > might think about moving to corosync if it is more stable than
> > JGroups.
> 
> I'd love to see this running on top of corosync too. Corosync is a well
> tested, stable cluster manager, and doesn't have the JVM dependency of
> jgroups so feels more suitable for building 'thin virtualisation fabrics'.

Very exciting project, by the way!

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-22 15:30 ` [Qemu-devel] " Avi Kivity
  2009-10-22 16:28   ` Anthony Liguori
@ 2009-10-23 10:41   ` MORITA Kazutaka
  2009-10-23 11:10     ` Alexander Graf
  2009-10-23 14:14     ` Javier Guerra
  1 sibling, 2 replies; 32+ messages in thread
From: MORITA Kazutaka @ 2009-10-23 10:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-fsdevel, qemu-devel, kvm

On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity <avi@redhat.com> wrote:
> On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:
>>
>> Hi everyone,
>>
>> Sheepdog is a distributed storage system for KVM/QEMU. It provides
>> highly available block level storage volumes to VMs like Amazon EBS.
>> Sheepdog supports advanced volume management features such as snapshot,
>> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
>> of nodes, and the architecture is fully symmetric; there is no central
>> node such as a meta-data server.
>
> Very interesting!  From a very brief look at the code, it looks like the
> sheepdog block format driver is a network client that is able to access
> highly available images, yes?

Yes. Sheepdog is a simple key-value storage system that
consists of multiple nodes (a bit similar to Amazon Dynamo, I guess).

The qemu Sheepdog driver (client) divides a VM image into fixed-size
objects and store them on the key-value storage system.

> If so, is it reasonable to compare this to a cluster file system setup (like
> GFS) with images as files on this filesystem?  The difference would be that
> clustering is implemented in userspace in sheepdog, but in the kernel for a
> clustering filesystem.

I think that the major difference between sheepdog and cluster file
systems such as Google File system, pNFS, etc is the interface between
clients and a storage system.

> How is load balancing implemented?  Can you move an image transparently
> while a guest is running?  Will an image be moved closer to its guest?

Sheepdog uses consistent hashing to decide where objects store; I/O
load is balanced across the nodes. When a new node is added or the
existing node is removed, the hash table changes and the data
automatically and transparently are moved over nodes.

We plan to implement a mechanism to distribute the data not randomly
but intelligently; we could use machine load, the locations of VMs, etc.

> Can you stripe an image across nodes?

Yes, a VM images is divided into multiple objects, and they are
stored over nodes.

> Do you support multiple guests accessing the same image?

A VM image can be attached to any VMs but one VM at a time; multiple
running VMs cannot access to the same VM image.

> What about fault tolerance - storing an image redundantly on multiple nodes?

Yes, all objects are replicated to multiple nodes.


-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazutaka@lab.ntt.co.jp

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 10:41   ` MORITA Kazutaka
@ 2009-10-23 11:10     ` Alexander Graf
  2009-10-23 16:17       ` MORITA Kazutaka
  2009-10-23 14:14     ` Javier Guerra
  1 sibling, 1 reply; 32+ messages in thread
From: Alexander Graf @ 2009-10-23 11:10 UTC (permalink / raw)
  To: MORITA Kazutaka; +Cc: linux-fsdevel, Avi Kivity, kvm, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1112 bytes --]


On 23.10.2009, at 12:41, MORITA Kazutaka wrote:

> On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity <avi@redhat.com> wrote:
>
>> How is load balancing implemented?  Can you move an image  
>> transparently
>> while a guest is running?  Will an image be moved closer to its  
>> guest?
>
> Sheepdog uses consistent hashing to decide where objects store; I/O
> load is balanced across the nodes. When a new node is added or the
> existing node is removed, the hash table changes and the data
> automatically and transparently are moved over nodes.
>
> We plan to implement a mechanism to distribute the data not randomly
> but intelligently; we could use machine load, the locations of VMs,  
> etc.

What exactly does balanced mean? Can it cope with individual nodes  
having more disk space than others?

>> Do you support multiple guests accessing the same image?
>
> A VM image can be attached to any VMs but one VM at a time; multiple
> running VMs cannot access to the same VM image.

What about read-only access? Imagine you'd have 5 kvm instances each  
accessing it using -snapshot.

Great project btw!

Alex

[-- Attachment #2: Type: text/html, Size: 1721 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 10:06   ` [Qemu-devel] " MORITA Kazutaka
  2009-10-23 10:17     ` Chris Webb
@ 2009-10-23 11:10     ` Dietmar Maurer
  2009-10-23 11:45       ` Dietmar Maurer
  1 sibling, 1 reply; 32+ messages in thread
From: Dietmar Maurer @ 2009-10-23 11:10 UTC (permalink / raw)
  To: MORITA Kazutaka
  Cc: linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org,
	kvm@vger.kernel.org

> We use JGroups (Java library) for reliable multicast communication in
> our cluster manager daemon.

I doubt that there is something like 'reliable multicast' - you will run into many problems when you try to handle errors.

> We don't worry about the performance much
> since the cluster manager daemon is not involved in the I/O path. We
> might think about moving to corosync if it is more stable than
> JGroups.

corosync is already quite stable. And it support virtual synchrony

http://en.wikipedia.org/wiki/Virtual_synchrony

Anyways, I do not know JGroups - maybe that 'reliable multicast' solves all network problems somehow - Is there any documentation about how they do it?

- Dietmar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  5:13 [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM MORITA Kazutaka
                   ` (4 preceding siblings ...)
  2009-10-22 18:46 ` Avishay Traeger
@ 2009-10-23 11:22 ` Dietmar Maurer
  2009-10-23 19:39 ` [Qemu-devel] " MORITA Kazutaka
  2009-10-28  3:53 ` [Qemu-devel] " MORITA Kazutaka
  7 siblings, 0 replies; 32+ messages in thread
From: Dietmar Maurer @ 2009-10-23 11:22 UTC (permalink / raw)
  To: MORITA Kazutaka, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	linux-fsdevel@vger.kernel.org

Another suggestion: use LVM instead of btrfs (to get better performance)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 11:10     ` [Qemu-devel] " Dietmar Maurer
@ 2009-10-23 11:45       ` Dietmar Maurer
  0 siblings, 0 replies; 32+ messages in thread
From: Dietmar Maurer @ 2009-10-23 11:45 UTC (permalink / raw)
  To: Dietmar Maurer, MORITA Kazutaka
  Cc: linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org,
	kvm@vger.kernel.org

> Anyways, I do not know JGroups - maybe that 'reliable multicast' solves
> all network problems somehow - Is there any documentation about how
> they do it?

OK, found the papers on their web site - quite interesting too.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 10:41   ` MORITA Kazutaka
  2009-10-23 11:10     ` Alexander Graf
@ 2009-10-23 14:14     ` Javier Guerra
  2009-10-23 14:58       ` Chris Webb
                         ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Javier Guerra @ 2009-10-23 14:14 UTC (permalink / raw)
  To: MORITA Kazutaka; +Cc: linux-fsdevel, Avi Kivity, kvm, qemu-devel

On Fri, Oct 23, 2009 at 5:41 AM, MORITA Kazutaka
<morita.kazutaka@lab.ntt.co.jp> wrote:
> On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity <avi@redhat.com> wrote:
>> If so, is it reasonable to compare this to a cluster file system setup (like
>> GFS) with images as files on this filesystem?  The difference would be that
>> clustering is implemented in userspace in sheepdog, but in the kernel for a
>> clustering filesystem.
>
> I think that the major difference between sheepdog and cluster file
> systems such as Google File system, pNFS, etc is the interface between
> clients and a storage system.

note that GFS is "Global File System" (written by Sistina (the same
folks from LVM) and bought by RedHat).  Google Filesystem is a
different thing, and ironically the client/storage interface is a
little more like sheepdog and unlike a regular cluster filesystem.

>> How is load balancing implemented?  Can you move an image transparently
>> while a guest is running?  Will an image be moved closer to its guest?
>
> Sheepdog uses consistent hashing to decide where objects store; I/O
> load is balanced across the nodes. When a new node is added or the
> existing node is removed, the hash table changes and the data
> automatically and transparently are moved over nodes.
>
> We plan to implement a mechanism to distribute the data not randomly
> but intelligently; we could use machine load, the locations of VMs, etc.

i don't have much hands-on experience on consistent hashing; but it
sounds reasonable to make each node's ring segment proportional to its
storage capacity.  dynamic load balancing seems a tougher nut to
crack, especially while keeping all clients mapping consistent.

>> Do you support multiple guests accessing the same image?
>
> A VM image can be attached to any VMs but one VM at a time; multiple
> running VMs cannot access to the same VM image.

this is a must-have safety measure; but a 'manual override' is quite
useful for those that know how to manage a cluster-aware filesystem
inside a VM image, maybe like Xen's "w!" flag does.  justs be sure to
avoid distributed caching for a shared image!

in all, great project, and with such a clean patch into KVM/Qemu, high
hopes of making into regular use.

i'd just want to add my '+1 votes' on both getting rid of JVM
dependency and using block devices (usually LVM) instead of ext3/btrfs

-- 
Javier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 14:14     ` Javier Guerra
@ 2009-10-23 14:58       ` Chris Webb
  2009-10-23 15:10         ` Javier Guerra
  2009-10-23 17:05         ` Tomasz Chmielewski
  2009-10-23 15:40       ` FUJITA Tomonori
  2009-10-25  8:51       ` [Qemu-devel] " Dietmar Maurer
  2 siblings, 2 replies; 32+ messages in thread
From: Chris Webb @ 2009-10-23 14:58 UTC (permalink / raw)
  To: Javier Guerra; +Cc: linux-fsdevel, qemu-devel, Avi Kivity, MORITA Kazutaka, kvm

Javier Guerra <javier@guerrag.com> writes:

> i'd just want to add my '+1 votes' on both getting rid of JVM
> dependency and using block devices (usually LVM) instead of ext3/btrfs

If the chunks into which the virtual drives are split are quite small (say
the 64MB used by Hadoop), LVM may be a less appropriate choice. It doesn't
support very large numbers of very small logical volumes very well.

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 14:58       ` Chris Webb
@ 2009-10-23 15:10         ` Javier Guerra
  2009-10-23 17:05         ` Tomasz Chmielewski
  1 sibling, 0 replies; 32+ messages in thread
From: Javier Guerra @ 2009-10-23 15:10 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-fsdevel, qemu-devel, Avi Kivity, MORITA Kazutaka, kvm

On Fri, Oct 23, 2009 at 9:58 AM, Chris Webb <chris@arachsys.com> wrote:
> If the chunks into which the virtual drives are split are quite small (say
> the 64MB used by Hadoop), LVM may be a less appropriate choice. It doesn't
> support very large numbers of very small logical volumes very well.

absolutely.  the 'nicest' way to do it would be to use a single block
device per sheep process, and do the splitting there.

it's an extra layer of code, and once you add non-naïve behavior for
deleting and fragmentation, you quickly approach filesystem-like
complexity.....

unless you can do some very clever mapping that reuses the consistent
hash algorithms to find not only which server(s) you want, but also
which chunk to hit....  the kind of things i'd love to code, but never
found the use for it.

i'll definitely dig deeper in the code.

-- 
Javier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 14:14     ` Javier Guerra
  2009-10-23 14:58       ` Chris Webb
@ 2009-10-23 15:40       ` FUJITA Tomonori
  2009-10-25  5:36         ` Avi Kivity
  2009-10-25  8:51       ` [Qemu-devel] " Dietmar Maurer
  2 siblings, 1 reply; 32+ messages in thread
From: FUJITA Tomonori @ 2009-10-23 15:40 UTC (permalink / raw)
  To: javier; +Cc: qemu-devel, linux-fsdevel, avi, morita.kazutaka, kvm

On Fri, 23 Oct 2009 09:14:29 -0500
Javier Guerra <javier@guerrag.com> wrote:

> > I think that the major difference between sheepdog and cluster file
> > systems such as Google File system, pNFS, etc is the interface between
> > clients and a storage system.
> 
> note that GFS is "Global File System" (written by Sistina (the same
> folks from LVM) and bought by RedHat).  Google Filesystem is a
> different thing, and ironically the client/storage interface is a
> little more like sheepdog and unlike a regular cluster filesystem.

Hmm, Avi referred to Global File System? I wasn't sure. 'GFS' is
ambiguous. Anyway, Global File System is a SAN file system. It's
a completely different architecture from Sheepdog.


> > Sheepdog uses consistent hashing to decide where objects store; I/O
> > load is balanced across the nodes. When a new node is added or the
> > existing node is removed, the hash table changes and the data
> > automatically and transparently are moved over nodes.
> >
> > We plan to implement a mechanism to distribute the data not randomly
> > but intelligently; we could use machine load, the locations of VMs, etc.
> 
> i don't have much hands-on experience on consistent hashing; but it
> sounds reasonable to make each node's ring segment proportional to its
> storage capacity.

Yeah, that's one of the techniques, I think.


>  dynamic load balancing seems a tougher nut to
> crack, especially while keeping all clients mapping consistent.

There are some techniques to do that.

We think that there are some existing techniques to distribute data
intelligently. We just have not analyzed the options.


> i'd just want to add my '+1 votes' on both getting rid of JVM
> dependency and using block devices (usually LVM) instead of ext3/btrfs

LVM doesn't fit for our requirement nicely. What we need is updating
some objects in an atomic way. We can implement that for ourselves but
we prefer to keep our code simple by using the existing mechanism.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 11:10     ` Alexander Graf
@ 2009-10-23 16:17       ` MORITA Kazutaka
  0 siblings, 0 replies; 32+ messages in thread
From: MORITA Kazutaka @ 2009-10-23 16:17 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linux-fsdevel, Avi Kivity, kvm, qemu-devel

On Fri, Oct 23, 2009 at 8:10 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 23.10.2009, at 12:41, MORITA Kazutaka wrote:
>
> On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity <avi@redhat.com> wrote:
>
> How is load balancing implemented?  Can you move an image transparently
>
> while a guest is running?  Will an image be moved closer to its guest?
>
> Sheepdog uses consistent hashing to decide where objects store; I/O
> load is balanced across the nodes. When a new node is added or the
> existing node is removed, the hash table changes and the data
> automatically and transparently are moved over nodes.
>
> We plan to implement a mechanism to distribute the data not randomly
> but intelligently; we could use machine load, the locations of VMs, etc.
>
> What exactly does balanced mean? Can it cope with individual nodes having
> more disk space than others?

I mean objects are uniformly distributed over the nodes by the hash function.
Distribution using free disk space information is one of TODOs.

> Do you support multiple guests accessing the same image?
>
> A VM image can be attached to any VMs but one VM at a time; multiple
> running VMs cannot access to the same VM image.
>
> What about read-only access? Imagine you'd have 5 kvm instances each
> accessing it using -snapshot.

By creating new clone images from existing snapshot image, you can do
the similar thing.
Sheepdog can create cloning image instantly.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazutaka@lab.ntt.co.jp

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 14:58       ` Chris Webb
  2009-10-23 15:10         ` Javier Guerra
@ 2009-10-23 17:05         ` Tomasz Chmielewski
  2009-10-25  8:44           ` Dietmar Maurer
  1 sibling, 1 reply; 32+ messages in thread
From: Tomasz Chmielewski @ 2009-10-23 17:05 UTC (permalink / raw)
  To: Chris Webb
  Cc: kvm, qemu-devel, Avi Kivity, Javier Guerra, linux-fsdevel,
	MORITA Kazutaka

Chris Webb wrote:
> Javier Guerra <javier@guerrag.com> writes:
> 
>> i'd just want to add my '+1 votes' on both getting rid of JVM
>> dependency and using block devices (usually LVM) instead of ext3/btrfs
> 
> If the chunks into which the virtual drives are split are quite small (say
> the 64MB used by Hadoop), LVM may be a less appropriate choice. It doesn't
> support very large numbers of very small logical volumes very well.

Also, on _loaded_ systems, I noticed creating/removing logical volumes 
can take really long (several minutes); where allocating a file of a 
given size would just take a fraction of that.

Sot sure how it would matter here, but probably it would.

-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  5:13 [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM MORITA Kazutaka
                   ` (5 preceding siblings ...)
  2009-10-23 11:22 ` [Qemu-devel] " Dietmar Maurer
@ 2009-10-23 19:39 ` MORITA Kazutaka
  2009-10-23 19:45   ` Javier Guerra
  2009-10-28  3:53 ` [Qemu-devel] " MORITA Kazutaka
  7 siblings, 1 reply; 32+ messages in thread
From: MORITA Kazutaka @ 2009-10-23 19:39 UTC (permalink / raw)
  To: kvm, qemu-devel, linux-fsdevel

Hi,

Thanks for many comments.

Sheepdog git trees are created.

  Sheepdog server
    git://sheepdog.git.sourceforge.net/gitroot/sheepdog/sheepdog

  Sheepdog client
    git://sheepdog.git.sourceforge.net/gitroot/sheepdog/qemu-kvm

Please try!

On Wed, Oct 21, 2009 at 2:13 PM, MORITA Kazutaka
<morita.kazutaka@lab.ntt.co.jp> wrote:
> Hi everyone,
>
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.
>
> The following list describes the features of Sheepdog.
>
>    * Linear scalability in performance and capacity
>    * No single point of failure
>    * Redundant architecture (data is written to multiple nodes)
>    - Tolerance against network failure
>    * Zero configuration (newly added machines will join the cluster
> automatically)
>    - Autonomous load balancing
>    * Snapshot
>    - Online snapshot from qemu-monitor
>    * Clone from a snapshot volume
>    * Thin provisioning
>    - Amazon EBS API support (to use from a Eucalyptus instance)
>
> (* = current features, - = on our todo list)
>
> More details and download links are here:
>
> http://www.osrg.net/sheepdog/
>
> Note that the code is still in an early stage.
> There are some critical TODO items:
>
>    - VM image deletion support
>    - Support architectures other than X86_64
>    - Data recoverys
>    - Free space management
>    - Guarantee reliability and availability under heavy load
>    - Performance improvement
>    - Reclaim unused blocks
>    - More documentation
>
> We hope finding people interested in working together.
> Enjoy!
>
>
> Here are examples:
>
> - create images
>
> $ kvm-img create -f sheepdog "Alice's Disk" 256G
> $ kvm-img create -f sheepdog "Bob's Disk" 256G
>
> - list images
>
> $ shepherd info -t vdi
>   40000 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
> 16:17:18, tag:        0, current
>   80000 : Bob's Disk    256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
> 16:29:20, tag:        0, current
>
> - start up a virtual machine
>
> $ kvm --drive format=sheepdog,file="Alice's Disk"
>
> - create a snapshot
>
> $ kvm-img snapshot -c name sheepdog:"Alice's Disk"
>
> - clone from a snapshot
>
> $ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk"
>
>
> Thanks.
>
> --
> MORITA, Kazutaka
>
> NTT Cyber Space Labs
> OSS Computing Project
> Kernel Group
> E-mail: morita.kazutaka@lab.ntt.co.jp
>
>
>
>



-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazutaka@lab.ntt.co.jp

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 19:39 ` [Qemu-devel] " MORITA Kazutaka
@ 2009-10-23 19:45   ` Javier Guerra
  2009-10-24  2:49     ` MORITA Kazutaka
  0 siblings, 1 reply; 32+ messages in thread
From: Javier Guerra @ 2009-10-23 19:45 UTC (permalink / raw)
  To: MORITA Kazutaka; +Cc: linux-fsdevel, qemu-devel, kvm

On Fri, Oct 23, 2009 at 2:39 PM, MORITA Kazutaka
<morita.kazutaka@lab.ntt.co.jp> wrote:
> Thanks for many comments.
>
> Sheepdog git trees are created.

great!

is there any client (no matter how crude) besides the patched
KVM/Qemu?  it would make it far easier to hack around...

-- 
Javier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 19:45   ` Javier Guerra
@ 2009-10-24  2:49     ` MORITA Kazutaka
  0 siblings, 0 replies; 32+ messages in thread
From: MORITA Kazutaka @ 2009-10-24  2:49 UTC (permalink / raw)
  To: Javier Guerra; +Cc: linux-fsdevel, qemu-devel, kvm

On Sat, Oct 24, 2009 at 4:45 AM, Javier Guerra <javier@guerrag.com> wrote:
> On Fri, Oct 23, 2009 at 2:39 PM, MORITA Kazutaka
> <morita.kazutaka@lab.ntt.co.jp> wrote:
>> Thanks for many comments.
>>
>> Sheepdog git trees are created.
>
> great!
>
> is there any client (no matter how crude) besides the patched
> KVM/Qemu?  it would make it far easier to hack around...

No, there isn't. Sorry.
I think we should provide a test client as soon as possible.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazutaka@lab.ntt.co.jp

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 15:40       ` FUJITA Tomonori
@ 2009-10-25  5:36         ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2009-10-25  5:36 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: qemu-devel, linux-fsdevel, kvm, morita.kazutaka, javier

On 10/23/2009 05:40 PM, FUJITA Tomonori wrote:
> On Fri, 23 Oct 2009 09:14:29 -0500
> Javier Guerra<javier@guerrag.com>  wrote:
>
>    
>>> I think that the major difference between sheepdog and cluster file
>>> systems such as Google File system, pNFS, etc is the interface between
>>> clients and a storage system.
>>>        
>> note that GFS is "Global File System" (written by Sistina (the same
>> folks from LVM) and bought by RedHat).  Google Filesystem is a
>> different thing, and ironically the client/storage interface is a
>> little more like sheepdog and unlike a regular cluster filesystem.
>>      
> Hmm, Avi referred to Global File System? I wasn't sure. 'GFS' is
> ambiguous. Anyway, Global File System is a SAN file system. It's
> a completely different architecture from Sheepdog.
>    

I did, and yes, it is completely different since you don't require 
central storage.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 17:05         ` Tomasz Chmielewski
@ 2009-10-25  8:44           ` Dietmar Maurer
  2009-10-25 10:55             ` Tomasz Chmielewski
  0 siblings, 1 reply; 32+ messages in thread
From: Dietmar Maurer @ 2009-10-25  8:44 UTC (permalink / raw)
  To: Tomasz Chmielewski, Chris Webb
  Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org, Avi Kivity,
	Javier Guerra, linux-fsdevel@vger.kernel.org, MORITA Kazutaka

> Also, on _loaded_ systems, I noticed creating/removing logical volumes
> can take really long (several minutes); where allocating a file of a
> given size would just take a fraction of that.

Allocating a file takes much longer, unless you use  a 'sparse' file.

- Dietmar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] RE: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-23 14:14     ` Javier Guerra
  2009-10-23 14:58       ` Chris Webb
  2009-10-23 15:40       ` FUJITA Tomonori
@ 2009-10-25  8:51       ` Dietmar Maurer
  2009-10-26  6:53         ` [Qemu-devel] " MORITA Kazutaka
  2 siblings, 1 reply; 32+ messages in thread
From: Dietmar Maurer @ 2009-10-25  8:51 UTC (permalink / raw)
  To: Javier Guerra, MORITA Kazutaka
  Cc: linux-fsdevel@vger.kernel.org, Avi Kivity, kvm@vger.kernel.org,
	qemu-devel@nongnu.org

> >> Do you support multiple guests accessing the same image?
> >
> > A VM image can be attached to any VMs but one VM at a time; multiple
> > running VMs cannot access to the same VM image.

I guess this is a problem when you want to do live migrations?

- Dietmar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-25  8:44           ` Dietmar Maurer
@ 2009-10-25 10:55             ` Tomasz Chmielewski
  0 siblings, 0 replies; 32+ messages in thread
From: Tomasz Chmielewski @ 2009-10-25 10:55 UTC (permalink / raw)
  To: Dietmar Maurer
  Cc: Chris Webb, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	Avi Kivity, Javier Guerra, linux-fsdevel@vger.kernel.org,
	MORITA Kazutaka

Dietmar Maurer wrote:
>> Also, on _loaded_ systems, I noticed creating/removing logical volumes
>> can take really long (several minutes); where allocating a file of a
>> given size would just take a fraction of that.
> 
> Allocating a file takes much longer, unless you use  a 'sparse' file.

If you mean "allocating" like with:

    dd if=/dev/zero of=image bs=1G count=50

Then of course, that's a lot of IO.


As you mentioned, you can create a sparse file (but then, you'll end up 
with a lot of fragmentation).

But a better way would be to use persistent preallocation (fallocate), 
instead of "traditional" dd or a sparse file.


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-25  8:51       ` [Qemu-devel] " Dietmar Maurer
@ 2009-10-26  6:53         ` MORITA Kazutaka
  0 siblings, 0 replies; 32+ messages in thread
From: MORITA Kazutaka @ 2009-10-26  6:53 UTC (permalink / raw)
  To: Dietmar Maurer
  Cc: qemu-devel@nongnu.org, linux-fsdevel@vger.kernel.org, Avi Kivity,
	kvm@vger.kernel.org, Javier Guerra

On 2009/10/25 17:51, Dietmar Maurer wrote:
>>>> Do you support multiple guests accessing the same image?
>>> A VM image can be attached to any VMs but one VM at a time; multiple
>>> running VMs cannot access to the same VM image.
> 
> I guess this is a problem when you want to do live migrations?
> 
Yes, because Sheepdog locks a VM image when it is opened.
To avoid this problem, locking must be delayed until migration has done.
This is also a TODO item.

--
MORITA Kazutaka

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
  2009-10-21  5:13 [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM MORITA Kazutaka
                   ` (6 preceding siblings ...)
  2009-10-23 19:39 ` [Qemu-devel] " MORITA Kazutaka
@ 2009-10-28  3:53 ` MORITA Kazutaka
  7 siblings, 0 replies; 32+ messages in thread
From: MORITA Kazutaka @ 2009-10-28  3:53 UTC (permalink / raw)
  To: kvm, qemu-devel, linux-fsdevel

On 2009/10/21 14:13, MORITA Kazutaka wrote:
> Hi everyone,
> 
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.

We added some pages to Sheepdog website:

  Design: http://www.osrg.net/sheepdog/design.html
  FAQ   : http://www.osrg.net/sheepdog/faq.html

Sheepdog mailing list is also ready to use (thanks for Tomasz)

  Subscribe/Unsubscribe/Preferences
    http://lists.wpkg.org/mailman/listinfo/sheepdog
  Archive
    http://lists.wpkg.org/pipermail/sheepdog/

We are always looking for developers or users interested in
participating in Sheepdog project!

Thanks.

MORITA Kazutaka

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2009-10-28  3:53 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-21  5:13 [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM MORITA Kazutaka
2009-10-21  8:28 ` [Qemu-devel] " Nikolai K. Bochev
2009-10-21  8:45 ` Nikolai K. Bochev
2009-10-23  9:59   ` MORITA Kazutaka
2009-10-21  9:08 ` [Qemu-devel] " Dietmar Maurer
2009-10-23 10:06   ` [Qemu-devel] " MORITA Kazutaka
2009-10-23 10:17     ` Chris Webb
2009-10-23 10:26       ` Chris Webb
2009-10-23 11:10     ` [Qemu-devel] " Dietmar Maurer
2009-10-23 11:45       ` Dietmar Maurer
2009-10-22 15:30 ` [Qemu-devel] " Avi Kivity
2009-10-22 16:28   ` Anthony Liguori
2009-10-22 22:09     ` Alexander Graf
2009-10-23 10:41   ` MORITA Kazutaka
2009-10-23 11:10     ` Alexander Graf
2009-10-23 16:17       ` MORITA Kazutaka
2009-10-23 14:14     ` Javier Guerra
2009-10-23 14:58       ` Chris Webb
2009-10-23 15:10         ` Javier Guerra
2009-10-23 17:05         ` Tomasz Chmielewski
2009-10-25  8:44           ` Dietmar Maurer
2009-10-25 10:55             ` Tomasz Chmielewski
2009-10-23 15:40       ` FUJITA Tomonori
2009-10-25  5:36         ` Avi Kivity
2009-10-25  8:51       ` [Qemu-devel] " Dietmar Maurer
2009-10-26  6:53         ` [Qemu-devel] " MORITA Kazutaka
2009-10-22 18:46 ` Avishay Traeger
2009-10-23 11:22 ` [Qemu-devel] " Dietmar Maurer
2009-10-23 19:39 ` [Qemu-devel] " MORITA Kazutaka
2009-10-23 19:45   ` Javier Guerra
2009-10-24  2:49     ` MORITA Kazutaka
2009-10-28  3:53 ` [Qemu-devel] " MORITA Kazutaka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).