* jbd2 inside a device mapper module @ 2008-12-24 21:10 Alberto Bertogli 2008-12-24 22:38 ` Alberto Bertogli 2008-12-24 23:49 ` Theodore Tso 0 siblings, 2 replies; 17+ messages in thread From: Alberto Bertogli @ 2008-12-24 21:10 UTC (permalink / raw) To: linux-mm-cc; +Cc: dm-devel, linux-ext4 Hi! I'm writing a small device mapper module, and I'm interested in placing a jbd/jdb2 journal on the backing device. I started by trying to do a __bread() manually (just for early tests) inside my map function. But it got stucked, as far as I could see, waiting for a buffer head in wait_on_buffer() IIRC (I could track it down again if it's needed). And I couldn't find why it was locked, since it was an unused loopback device, and my code didn't even deal with buffer heads. Then, since I was planning on using jbd/jdb2 anyway, I decided to use it (and went for jbd2). Now, I'm having issues with journal creation. I tried using "mkfs.ext3 -O journal_dev", but jbd2_journal_load() complains that it can't find the journal superblock. And if I modify jbd2_journal_create(), removing the 'if (journal->j_inode == NULL)' check (I imagine it's there for a reason, but from a quick look at the code couldn't find it and thought it was worth a try) then when creating it I get a warning (pasted below) and it gets locked up, which I think may be related to what happened when I did __bread(), but obviously I'm not sure at all. And I got stucked there, so I thought it'd be better to ask. Does anyone have any ideas or suggestions on what I'm doing wrong? I've not published my code yet because it's really rough, but if anyone wants to take a look at it, please let me know. I was planning on posting it when it was at least working. Thanks a lot, Alberto [42949814.780000] ------------[ cut here ]------------ [42949814.780000] WARNING: at /pub/src/linux/linux-2.6/fs/buffer.c:1186 mark_buffer_dirty+0x77/0xa0() [42949814.780000] Modules linked in: [42949814.780000] Call Trace: [42949814.780000] 678f17d8: [<6003988b>] warn_on_slowpath+0x5b/0x80 [42949814.780000] 678f1818: [<600bbe9a>] __find_get_block_slow+0x7a/0x110 [42949814.780000] 678f1858: [<600bc219>] __find_get_block+0x79/0x180 [42949814.780000] 678f1888: [<60033b05>] __might_sleep+0x105/0x130 [42949814.780000] 678f18c8: [<600bc353>] __getblk+0x33/0x270 [42949814.780000] 678f18f8: [<600bc7e7>] mark_buffer_dirty+0x77/0xa0 [42949814.780000] 678f1918: [<6012b1a8>] jbd2_journal_create+0x88/0x170 [42949814.780000] 678f1958: [<601aac70>] csum_ctr+0x1b0/0x240 [42949814.780000] 678f1968: [<6019b810>] get_target_type+0x60/0xa0 [42949814.780000] 678f19a8: [<6019b0d4>] dm_table_add_target+0x174/0x3b0 [42949814.780000] 678f1a08: [<6019d057>] table_load+0xb7/0x200 [42949814.780000] 678f1a68: [<6019dd98>] dm_ctl_ioctl+0x288/0x300 [42949814.780000] 678f1a98: [<6019cfa0>] table_load+0x0/0x200 [42949814.780000] 678f1c18: [<600a82fb>] vfs_ioctl+0x1b/0x70 [42949814.780000] 678f1c28: [<600a8770>] do_vfs_ioctl+0x400/0x660 [42949814.780000] 678f1ca8: [<600a8a1a>] sys_ioctl+0x4a/0x80 [42949814.780000] 678f1ce8: [<6001a310>] handle_syscall+0x50/0x80 [42949814.780000] 678f1d08: [<6002bf1f>] userspace+0x3ff/0x530 [42949814.780000] 678f1fc8: [<60017012>] fork_handler+0x62/0x70 [42949814.780000] [42949814.780000] ---[ end trace ebc125a00ee8f9d2 ]--- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-24 21:10 jbd2 inside a device mapper module Alberto Bertogli @ 2008-12-24 22:38 ` Alberto Bertogli 2008-12-24 23:49 ` Theodore Tso 1 sibling, 0 replies; 17+ messages in thread From: Alberto Bertogli @ 2008-12-24 22:38 UTC (permalink / raw) To: linux-kernel; +Cc: dm-devel, linux-ext4 [Adding lkml on the CC list, somehow I managed to screw the address and sent it to the mm-cc list instead] On Wed, Dec 24, 2008 at 07:10:38PM -0200, Alberto Bertogli wrote: > > Hi! > > I'm writing a small device mapper module, and I'm interested in placing > a jbd/jdb2 journal on the backing device. > > I started by trying to do a __bread() manually (just for early tests) > inside my map function. But it got stucked, as far as I could see, > waiting for a buffer head in wait_on_buffer() IIRC (I could track it > down again if it's needed). And I couldn't find why it was locked, since > it was an unused loopback device, and my code didn't even deal with > buffer heads. > > Then, since I was planning on using jbd/jdb2 anyway, I decided to use it > (and went for jbd2). > > > Now, I'm having issues with journal creation. > > I tried using "mkfs.ext3 -O journal_dev", but jbd2_journal_load() > complains that it can't find the journal superblock. > > And if I modify jbd2_journal_create(), removing the 'if > (journal->j_inode == NULL)' check (I imagine it's there for a reason, > but from a quick look at the code couldn't find it and thought it was > worth a try) then when creating it I get a warning (pasted below) and it > gets locked up, which I think may be related to what happened when I did > __bread(), but obviously I'm not sure at all. > > And I got stucked there, so I thought it'd be better to ask. Does anyone > have any ideas or suggestions on what I'm doing wrong? > > > I've not published my code yet because it's really rough, but if anyone > wants to take a look at it, please let me know. I was planning on > posting it when it was at least working. > > Thanks a lot, > Alberto > > > > [42949814.780000] ------------[ cut here ]------------ > [42949814.780000] WARNING: at /pub/src/linux/linux-2.6/fs/buffer.c:1186 mark_buffer_dirty+0x77/0xa0() > [42949814.780000] Modules linked in: > [42949814.780000] Call Trace: > [42949814.780000] 678f17d8: [<6003988b>] warn_on_slowpath+0x5b/0x80 > [42949814.780000] 678f1818: [<600bbe9a>] __find_get_block_slow+0x7a/0x110 > [42949814.780000] 678f1858: [<600bc219>] __find_get_block+0x79/0x180 > [42949814.780000] 678f1888: [<60033b05>] __might_sleep+0x105/0x130 > [42949814.780000] 678f18c8: [<600bc353>] __getblk+0x33/0x270 > [42949814.780000] 678f18f8: [<600bc7e7>] mark_buffer_dirty+0x77/0xa0 > [42949814.780000] 678f1918: [<6012b1a8>] jbd2_journal_create+0x88/0x170 > [42949814.780000] 678f1958: [<601aac70>] csum_ctr+0x1b0/0x240 > [42949814.780000] 678f1968: [<6019b810>] get_target_type+0x60/0xa0 > [42949814.780000] 678f19a8: [<6019b0d4>] dm_table_add_target+0x174/0x3b0 > [42949814.780000] 678f1a08: [<6019d057>] table_load+0xb7/0x200 > [42949814.780000] 678f1a68: [<6019dd98>] dm_ctl_ioctl+0x288/0x300 > [42949814.780000] 678f1a98: [<6019cfa0>] table_load+0x0/0x200 > [42949814.780000] 678f1c18: [<600a82fb>] vfs_ioctl+0x1b/0x70 > [42949814.780000] 678f1c28: [<600a8770>] do_vfs_ioctl+0x400/0x660 > [42949814.780000] 678f1ca8: [<600a8a1a>] sys_ioctl+0x4a/0x80 > [42949814.780000] 678f1ce8: [<6001a310>] handle_syscall+0x50/0x80 > [42949814.780000] 678f1d08: [<6002bf1f>] userspace+0x3ff/0x530 > [42949814.780000] 678f1fc8: [<60017012>] fork_handler+0x62/0x70 > [42949814.780000] > [42949814.780000] ---[ end trace ebc125a00ee8f9d2 ]--- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-24 21:10 jbd2 inside a device mapper module Alberto Bertogli 2008-12-24 22:38 ` Alberto Bertogli @ 2008-12-24 23:49 ` Theodore Tso 2008-12-25 14:35 ` Alberto Bertogli 1 sibling, 1 reply; 17+ messages in thread From: Theodore Tso @ 2008-12-24 23:49 UTC (permalink / raw) To: Alberto Bertogli; +Cc: linux-ext4, linux-mm-cc, dm-devel On Wed, Dec 24, 2008 at 07:10:38PM -0200, Alberto Bertogli wrote: > > I'm writing a small device mapper module, and I'm interested in placing > a jbd/jdb2 journal on the backing device. > > I started by trying to do a __bread() manually (just for early tests) > inside my map function. But it got stucked, as far as I could see, > waiting for a buffer head in wait_on_buffer() IIRC (I could track it > down again if it's needed). And I couldn't find why it was locked, since > it was an unused loopback device, and my code didn't even deal with > buffer heads. I have no idea why you would need to do manual __breads(). No doubt I'm missing some context here. > Then, since I was planning on using jbd/jdb2 anyway, I decided to use it > (and went for jbd2). > > Now, I'm having issues with journal creation. > > I tried using "mkfs.ext3 -O journal_dev", but jbd2_journal_load() > complains that it can't find the journal superblock. So I'll tell you how to do this via simple hard drives, and you can figure out how to make it work with dm. Note that if the journal device isn't on a stand-alone spindle, it's probably not going to help you. The whole point of using an external journal device is to avoid the seeking on the journal device, or to take advantage of the speed of a battery-backed NVRAM device. I'm not sure how much sense it makes to use dm-based external journal device.... what exactly do you hope to achieve. To create an external journal device on the device /dev/sda: mke2fs -O journal_dev /dev/sda To create a new filesystem on /dev/sdb1 that will use the external journal found on /dev/sda: mke2fs -j -J device=/dev/sda /dev/sdb1 - Ted P.S. All of this is in the mke2fs man page.... ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-24 23:49 ` Theodore Tso @ 2008-12-25 14:35 ` Alberto Bertogli 2008-12-25 15:52 ` Theodore Tso 2008-12-27 20:01 ` Andreas Dilger 0 siblings, 2 replies; 17+ messages in thread From: Alberto Bertogli @ 2008-12-25 14:35 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4, linux-kernel, dm-devel On Wed, Dec 24, 2008 at 06:49:15PM -0500, Theodore Tso wrote: > On Wed, Dec 24, 2008 at 07:10:38PM -0200, Alberto Bertogli wrote: > > > > I'm writing a small device mapper module, and I'm interested in placing > > a jbd/jdb2 journal on the backing device. > > > > I started by trying to do a __bread() manually (just for early tests) > > inside my map function. But it got stucked, as far as I could see, > > waiting for a buffer head in wait_on_buffer() IIRC (I could track it > > down again if it's needed). And I couldn't find why it was locked, since > > it was an unused loopback device, and my code didn't even deal with > > buffer heads. > > I have no idea why you would need to do manual __breads(). No doubt > I'm missing some context here. I'm writing (just for fun and learning purposes) a device mapper module that stores checksums on writes and verifies them on reads. The integrity metadata (currently just the checksum) is interleaved in the backing device: one sector holding the integrity metadata for the following 64 data sectors. The reason for the __bread() is explained below. > > Then, since I was planning on using jbd/jdb2 anyway, I decided to use it > > (and went for jbd2). > > > > Now, I'm having issues with journal creation. > > > > I tried using "mkfs.ext3 -O journal_dev", but jbd2_journal_load() > > complains that it can't find the journal superblock. > > So I'll tell you how to do this via simple hard drives, and you can > figure out how to make it work with dm. Note that if the journal > device isn't on a stand-alone spindle, it's probably not going to help > you. The whole point of using an external journal device is to avoid > the seeking on the journal device, or to take advantage of the speed > of a battery-backed NVRAM device. I'm not sure how much sense it > makes to use dm-based external journal device.... what exactly do you > hope to achieve. > > To create an external journal device on the device /dev/sda: > > mke2fs -O journal_dev /dev/sda > > To create a new filesystem on /dev/sdb1 that will use the external > journal found on /dev/sda: > > mke2fs -j -J device=/dev/sda /dev/sdb1 > > - Ted > > P.S. All of this is in the mke2fs man page.... Thanks. I've found and tried that (that's what I meant with the paragraph you quote), but I couldn't make it work. I'll try to make my intentions more clear, but please let me know if I'm not explaining myself. For each write on the dm device I should not only write the data in the backing device, but also upgrade the corresponding integrity metadata. So, to upgrade the metadata, I should first read that sector from the backing device, then update it, and finally write it back. As an early experiment I began to do the first part without caring for the atomicity of the update. I tried __bread() (just as an experiment, because I've been using dm-io to do the reads so far) without success. I then thought of giving jbd2 a try, with the final intention of using it to update the metadata and the data in an atomic way. I'd devote some space at the beginning of the backing device for the journal, and use it internally to that purpose (so it has nothing to do with ext3/4). The first problem I stumbled upon was that jbd2_journal_create() doesn't like journals initialized using jbd2_journal_init_dev() (because it has no j_inode). I had two choices: or try to create the journal some other way, or remove the j_inode test in jbd2_journal_create(). I suspected the test was there for a reason, but I couldn't find it from a quick look, so I tried it anyway, which resulted in the warning from the first email. Then I tried to create the journal using mke2fs as you described, but jbd2_journal_load() fails when trying to load it. To summarize, these are my questions: - Why does __bread() gets stucked when called from inside a dm map function? It looks like it's waiting on a buffer_head, but why? - What is the reason behind the j_inode check in jbd2_journal_create()? - Does mke2fs -O journal_dev creates a journal that jbd2_journal_load() is supposed to read without any knowledge of ext2/3/4 stuff? If not, how can I create such a journal? I'll be looking at the e2fsprogs code for the answer to this question later today (I haven't looked at it yet). Obviously, I'm not expected long detailed answers; any tip on where I can find them would be greatly appreciated. Thanks a lot, Alberto ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-25 14:35 ` Alberto Bertogli @ 2008-12-25 15:52 ` Theodore Tso 2008-12-26 0:00 ` Alberto Bertogli 2008-12-27 20:01 ` Andreas Dilger 1 sibling, 1 reply; 17+ messages in thread From: Theodore Tso @ 2008-12-25 15:52 UTC (permalink / raw) To: Alberto Bertogli; +Cc: dm-devel, linux-ext4, linux-kernel On Thu, Dec 25, 2008 at 12:35:35PM -0200, Alberto Bertogli wrote: > > Thanks. I've found and tried that (that's what I meant with the > paragraph you quote), but I couldn't make it work. See attached transcript. I did it using lvm/dm just to show it's not an devicemapper problem. > The first problem I stumbled upon was that jbd2_journal_create() doesn't > like journals initialized using jbd2_journal_init_dev() (because it has > no j_inode). I had two choices: or try to create the journal some other > way, or remove the j_inode test in jbd2_journal_create(). ext4_journal_create is ancient code dating back to ext3/jbd, and even there it's code which has been obsolete for about 6-7 years. In fact, I plan to remove ext4_journal_create, the journal_inum mount option, and jbd2_journal_init_dev, because the supported way of creating a journal is using mke2fs. I need to double check and make sure ocfs2 isn't using jbd2_journal_init_dev before I remove it from the jbd2 layer, but really, this sort of thing should be done all in userspace. > Then I tried to create the journal using mke2fs as you described, but > jbd2_journal_load() fails when trying to load it. See attached. Works fine for me. > - Why does __bread() gets stucked when called from inside a dm map > function? It looks like it's waiting on a buffer_head, but why? I'm not a dm guy, so I can't answer this, but I suspect the issue may be a lock ordering issue. > - What is the reason behind the j_inode check in jbd2_journal_create()? jbd2_journal_create was only designed for creating inode-based journals, and it's a deprecated function that will likely be removed soon. > - Does mke2fs -O journal_dev creates a journal that jbd2_journal_load() > is supposed to read without any knowledge of ext2/3/4 stuff? If not, > how can I create such a journal? I'll be looking at the e2fsprogs > code for the answer to this question later today (I haven't looked at > it yet). mke2fs -O journal_dev creates an external journal, but when you create a filesystem, you need to specify need to specify location of the external journal. Hence: mke2fs -O journal_dev /dev/extern_journal_dev mke2fs -t ext4 -J device=/dev/extern_journal_dev /dev/filesystem_dev As I said in my last message. I've tested it, and it works Just Fine. - Ted Script started on Thu 25 Dec 2008 10:22:11 AM EST Top-level shell (parent script) Using forwarded ssh authentication socket # lvs LV VG Attr LSize Origin Snap% Move Log Copy% ext3root thunk -wi-a- 15.00G footest thunk -wi-a- 1.00G foresight thunk -wi-a- 5.00G old-root thunk -wi-a- 128.00G rmake thunk -wi-a- 2.00G root thunk -wi-ao 128.00G sff-torrent thunk -wi-a- 7.00G testext4 thunk -wi-a- 1.00G # mke2fs -O journal_dev /dev/thunk/footest mke2fs 1.41.3 (12-Oct-2008) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 0 inodes, 262144 blocks 0 blocks (0.00%) reserved for the super user First data block=0 0 block group 32768 blocks per group, 32768 fragments per group 0 inodes per group Superblock backups stored on blocks: Zeroing journal device: done # mke2fs -t ext4 -J device=/dev/thunk/footest /dev/thunk/testext4 mke2fs 1.41.3 (12-Oct-2008) Using journal device's blocksize: 4096 Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 65536 inodes, 262144 blocks 13107 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=268435456 8 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376 Writing inode tables: done Adding journal to device /dev/thunk/footest: done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 29 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. # dumpe2fs -h /dev/thunk/testext4 dumpe2fs 1.41.3 (12-Oct-2008) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: 47b3315f-7b0d-40ab-995e-de1ddaaf3528 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 65536 Block count: 262144 Reserved block count: 13107 Free blocks: 257701 Free inodes: 65525 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 63 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Thu Dec 25 10:23:12 2008 Last mount time: n/a Last write time: Thu Dec 25 10:23:12 2008 Mount count: 0 Maximum mount count: 29 Last checked: Thu Dec 25 10:23:12 2008 Check interval: 15552000 (6 months) Next check after: Tue Jun 23 11:23:12 2009 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal UUID: 484902c6-34a5-4cd2-9f66-02a3251bfc9e Journal device: 0xfe06 Default directory hash: half_md4 Directory Hash Seed: 2889d0e3-ca37-443d-b9a3-12e3b0e26d70 # mount /dev/thunk/testext4 /mnt # df /mnt Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/thunk-testext4 1032088 1284 978376 1% /mnt # umount /mnt # exit Script done on Thu 25 Dec 2008 10:23:37 AM EST ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-25 15:52 ` Theodore Tso @ 2008-12-26 0:00 ` Alberto Bertogli 2008-12-26 3:37 ` Theodore Tso 0 siblings, 1 reply; 17+ messages in thread From: Alberto Bertogli @ 2008-12-26 0:00 UTC (permalink / raw) To: Theodore Tso, linux-ext4, linux-kernel, dm-devel On Thu, Dec 25, 2008 at 10:52:48AM -0500, Theodore Tso wrote: > On Thu, Dec 25, 2008 at 12:35:35PM -0200, Alberto Bertogli wrote: > > - What is the reason behind the j_inode check in jbd2_journal_create()? > > jbd2_journal_create was only designed for creating inode-based > journals, and it's a deprecated function that will likely be removed > soon. Thanks, I didn't know that! > > - Does mke2fs -O journal_dev creates a journal that jbd2_journal_load() > > is supposed to read without any knowledge of ext2/3/4 stuff? If not, > > how can I create such a journal? I'll be looking at the e2fsprogs > > code for the answer to this question later today (I haven't looked at > > it yet). > > mke2fs -O journal_dev creates an external journal, but when you create > a filesystem, you need to specify need to specify location of the > external journal. Hence: > > mke2fs -O journal_dev /dev/extern_journal_dev > mke2fs -t ext4 -J device=/dev/extern_journal_dev /dev/filesystem_dev > > As I said in my last message. I've tested it, and it works Just Fine. I think I'm not explaining myself correctly. My code has _nothing_ to do with ext2/3/4 (or any other filesystem) whatsoever. I'm not using the journal as an external one for a filesystem. I want to use it to be able to do atomic writes in my own, filesystem independant, device-mapper code. After what you told me (both this and the deprecation of jbd2_journal_create()), I took a look at e2fsprogs' source. From what I could see, "mke2fs -O journal_dev" creates the external journal inside some ext2/3/4 structures, which caused my journal-loading code to fail (because it doesn't know about ext stuff). So, I wrote a small "mkjournal" utility that creates a journal on the block device without any ext2/3/4 stuff. It's based on e2fsprogs' mkjournal.c, except it doesn't have any ext2 stuff. And it worked great! I'm now able to load the journal just fine. Thanks a lot for all the help! Alberto ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-26 0:00 ` Alberto Bertogli @ 2008-12-26 3:37 ` Theodore Tso 2008-12-26 16:17 ` Alberto Bertogli 0 siblings, 1 reply; 17+ messages in thread From: Theodore Tso @ 2008-12-26 3:37 UTC (permalink / raw) To: Alberto Bertogli; +Cc: dm-devel, linux-ext4, linux-kernel On Thu, Dec 25, 2008 at 10:00:05PM -0200, Alberto Bertogli wrote: > > I think I'm not explaining myself correctly. My code has _nothing_ to do > with ext2/3/4 (or any other filesystem) whatsoever. I'm not using the > journal as an external one for a filesystem. I want to use it to be able > to do atomic writes in my own, filesystem independant, device-mapper > code. How many block writes are you batching into a single transaction? If you're not careful you may find that performance overhead will be quite expensive. > After what you told me (both this and the deprecation of > jbd2_journal_create()), I took a look at e2fsprogs' source. From what I > could see, "mke2fs -O journal_dev" creates the external journal inside > some ext2/3/4 structures, which caused my journal-loading code to fail > (because it doesn't know about ext stuff). Yes, this is necessary because in a production system you need to be able to identify the external journal by UUID, and the ext2/3/4 superblock makes it easy to add a label, UUID, et. al. It also significantly lowers the chance that an external journal will get misidentified as some other filesystem based on the data stored in the journal. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-26 3:37 ` Theodore Tso @ 2008-12-26 16:17 ` Alberto Bertogli 2008-12-26 18:06 ` Theodore Tso 0 siblings, 1 reply; 17+ messages in thread From: Alberto Bertogli @ 2008-12-26 16:17 UTC (permalink / raw) To: Theodore Tso, linux-ext4, linux-kernel, dm-devel On Thu, Dec 25, 2008 at 10:37:36PM -0500, Theodore Tso wrote: > On Thu, Dec 25, 2008 at 10:00:05PM -0200, Alberto Bertogli wrote: > > > > I think I'm not explaining myself correctly. My code has _nothing_ to do > > with ext2/3/4 (or any other filesystem) whatsoever. I'm not using the > > journal as an external one for a filesystem. I want to use it to be able > > to do atomic writes in my own, filesystem independant, device-mapper > > code. > > How many block writes are you batching into a single transaction? If > you're not careful you may find that performance overhead will be > quite expensive. At this moment I'm trying to keep it simple, so I plan to batch two for each sector written to the device: one for the metadata and one for the data. > > After what you told me (both this and the deprecation of > > jbd2_journal_create()), I took a look at e2fsprogs' source. From what I > > could see, "mke2fs -O journal_dev" creates the external journal inside > > some ext2/3/4 structures, which caused my journal-loading code to fail > > (because it doesn't know about ext stuff). > > Yes, this is necessary because in a production system you need to be > able to identify the external journal by UUID, and the ext2/3/4 > superblock makes it easy to add a label, UUID, et. al. It also > significantly lowers the chance that an external journal will get > misidentified as some other filesystem based on the data stored in the > journal. Yes, it makes sense. I've reserved the first sector for that purpose. Thanks a lot, Alberto ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-26 16:17 ` Alberto Bertogli @ 2008-12-26 18:06 ` Theodore Tso 2008-12-27 3:00 ` Alberto Bertogli 0 siblings, 1 reply; 17+ messages in thread From: Theodore Tso @ 2008-12-26 18:06 UTC (permalink / raw) To: Alberto Bertogli; +Cc: dm-devel, linux-ext4, linux-kernel On Fri, Dec 26, 2008 at 02:17:08PM -0200, Alberto Bertogli wrote: > > At this moment I'm trying to keep it simple, so I plan to batch two for > each sector written to the device: one for the metadata and one for the > data. > I think I can pretty much guarantee that your performance will be so horrible that it won't be worth using. > > Yes, this is necessary because in a production system you need to be > > able to identify the external journal by UUID, and the ext2/3/4 > > superblock makes it easy to add a label, UUID, et. al. It also > > significantly lowers the chance that an external journal will get > > misidentified as some other filesystem based on the data stored in the > > journal. > > Yes, it makes sense. I've reserved the first sector for that purpose. Why not just use the ext3/4 external journal format? - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-26 18:06 ` Theodore Tso @ 2008-12-27 3:00 ` Alberto Bertogli 2008-12-27 19:29 ` Theodore Tso 0 siblings, 1 reply; 17+ messages in thread From: Alberto Bertogli @ 2008-12-27 3:00 UTC (permalink / raw) To: Theodore Tso, linux-ext4, linux-kernel, dm-devel On Fri, Dec 26, 2008 at 01:06:42PM -0500, Theodore Tso wrote: > On Fri, Dec 26, 2008 at 02:17:08PM -0200, Alberto Bertogli wrote: > > > > At this moment I'm trying to keep it simple, so I plan to batch two for > > each sector written to the device: one for the metadata and one for the > > data. > > > > I think I can pretty much guarantee that your performance will be so > horrible that it won't be worth using. Thanks for the warning. I have a couple of alternatives in mind, the most decent one at the moment is having two metadatas (M1 and M2) for the each block, and update M1 on the first write to the given block, M2 on the second, M1 on the third, and so on. So, if a block has written "A" and M1 holds crc("A"), and the user wants to write "B" to the block, I would first write crc("B") in M2, and then write "B" to the block. The biggest problem I can see with this approach is that I require either a timestamp on the metadata so I can determine where to write (if M1 or M2). And I'm not sure if it'd perform better than the journal, tho. Do you have any suggestions as to how can I handle this issue? > > > Yes, this is necessary because in a production system you need to be > > > able to identify the external journal by UUID, and the ext2/3/4 > > > superblock makes it easy to add a label, UUID, et. al. It also > > > significantly lowers the chance that an external journal will get > > > misidentified as some other filesystem based on the data stored in the > > > journal. > > > > Yes, it makes sense. I've reserved the first sector for that purpose. > > Why not just use the ext3/4 external journal format? Wouldn't that lead to confusion, because people can think the device holds an ext3/4 external journal, while it actually holds a device-mapper backing device that happens to contain a journal? What would be the advantages of using the ext3/4 journal format, over a simple initial sector and the journal following? Thanks, Alberto ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-27 3:00 ` Alberto Bertogli @ 2008-12-27 19:29 ` Theodore Tso 2008-12-29 21:30 ` Alberto Bertogli 0 siblings, 1 reply; 17+ messages in thread From: Theodore Tso @ 2008-12-27 19:29 UTC (permalink / raw) To: Alberto Bertogli; +Cc: dm-devel, linux-ext4, linux-kernel On Sat, Dec 27, 2008 at 01:00:20AM -0200, Alberto Bertogli wrote: > I have a couple of alternatives in mind, the most decent one at the > moment is having two metadatas (M1 and M2) for the each block, and > update M1 on the first write to the given block, M2 on the second, M1 on > the third, and so on. I don't see how this would help. You still have to do synchronous writes for safety, which is what is going to kill your performance. What you want to do is to batch as many writes as possible. Until the underlying filesystem requests a flush, you can afford to hold off writing the block to disk. Otherwise, you'll end up turning each 4k write into two 8k synchronous writes, which will be a performance disaster. If you hold off, it's much more likely that the you'll be able to patch a large number of blocks into a single transaction. Also, if a block gets modified multiple times (for example, with an inode table block where tar writes one file, and then another), if you hold off the write as long as possible, you can only write the inode table block once, instead of multiple times. Note that this means that you have to wait until the last minute to calculate the checksum, since the buffer could be modified after the write request. OCFS2 does this, by using a commit-time callback to calculate the checksums used. The bottom line doing something like this in an efficient way is tricky. > > Why not just use the ext3/4 external journal format? > > Wouldn't that lead to confusion, because people can think the device > holds an ext3/4 external journal, while it actually holds a > device-mapper backing device that happens to contain a journal? Not really; the external journal has a label and uuid, and the journal superblock has a place to store the uuid of the "client" of the journal. So there is plenty of information available to tie an external journal to some device-mapper backing device. > What would be the advantages of using the ext3/4 journal format, over a > simple initial sector and the journal following? There already existing tools to find the external journal, using the blkid library. So you only have to store the UUID of the journal in the superblock of the device-mapper backing device, and then you can easily find the external journal as follows: journal_fn = blkid_get_devname(ctx->blkid, "UUID", uuid); - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-27 19:29 ` Theodore Tso @ 2008-12-29 21:30 ` Alberto Bertogli 0 siblings, 0 replies; 17+ messages in thread From: Alberto Bertogli @ 2008-12-29 21:30 UTC (permalink / raw) To: Theodore Tso, linux-ext4, linux-kernel, dm-devel On Sat, Dec 27, 2008 at 02:29:50PM -0500, Theodore Tso wrote: > On Sat, Dec 27, 2008 at 01:00:20AM -0200, Alberto Bertogli wrote: > > I have a couple of alternatives in mind, the most decent one at the > > moment is having two metadatas (M1 and M2) for the each block, and > > update M1 on the first write to the given block, M2 on the second, M1 on > > the third, and so on. > > I don't see how this would help. You still have to do synchronous > writes for safety, which is what is going to kill your performance. I was thinking of queueing the writes to the metadata, and then queue the writes of the data marked with bio_barrier(); when the data write completes I end the original bio. Although if they metadata is on a different device, I do have to wait for the metadata to be written because the barrier is useless; but OTOH if I use a journal I can't split my data and metadata in two different devices, can I? (without using two journals or doing more complex stuff). > What you want to do is to batch as many writes as possible. Until the > underlying filesystem requests a flush, you can afford to hold off > writing the block to disk. Otherwise, you'll end up turning each 4k I think I can't do this at the device-mapper layer. There's a .flush function pointer, but I think it's suspend-related; and in any case I gave it a try and it's never called during normal operation. > > > Why not just use the ext3/4 external journal format? > > > > Wouldn't that lead to confusion, because people can think the device > > holds an ext3/4 external journal, while it actually holds a > > device-mapper backing device that happens to contain a journal? > > Not really; the external journal has a label and uuid, and the journal > superblock has a place to store the uuid of the "client" of the > journal. So there is plenty of information available to tie an > external journal to some device-mapper backing device. > > > What would be the advantages of using the ext3/4 journal format, over a > > simple initial sector and the journal following? > > There already existing tools to find the external journal, using the > blkid library. So you only have to store the UUID of the journal in > the superblock of the device-mapper backing device, and then you can > easily find the external journal as follows: > > journal_fn = blkid_get_devname(ctx->blkid, "UUID", uuid); Thanks a lot for the suggestions! As I said in the other email, I'll give the writes a try and see how it goes. If their performance suck (what, from what you tell me, it's likely) at least I'll have something that works. Thanks a lot, Alberto ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: jbd2 inside a device mapper module 2008-12-25 14:35 ` Alberto Bertogli 2008-12-25 15:52 ` Theodore Tso @ 2008-12-27 20:01 ` Andreas Dilger 2008-12-29 6:20 ` Shyam_Iyer 1 sibling, 1 reply; 17+ messages in thread From: Andreas Dilger @ 2008-12-27 20:01 UTC (permalink / raw) To: Alberto Bertogli, Alex Zhuravlev Cc: Theodore Tso, linux-ext4, linux-kernel, dm-devel On Dec 25, 2008 12:35 -0200, Alberto Bertogli wrote: > On Wed, Dec 24, 2008 at 06:49:15PM -0500, Theodore Tso wrote: > > I have no idea why you would need to do manual __breads(). No doubt > > I'm missing some context here. > > I'm writing (just for fun and learning purposes) a device mapper module > that stores checksums on writes and verifies them on reads. The > integrity metadata (currently just the checksum) is interleaved in the > backing device: one sector holding the integrity metadata for the > following 64 data sectors. Alex and I discussed implementing checksums for ext4 using an external device like this, and he might have some more design information for you. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: Re: jbd2 inside a device mapper module 2008-12-27 20:01 ` Andreas Dilger @ 2008-12-29 6:20 ` Shyam_Iyer 2008-12-29 21:05 ` [dm-devel] " Alberto Bertogli 0 siblings, 1 reply; 17+ messages in thread From: Shyam_Iyer @ 2008-12-29 6:20 UTC (permalink / raw) To: dm-devel, albertito, Alex.Zhuravlev; +Cc: linux-ext4, tytso, linux-kernel [-- Attachment #1.1: Type: text/plain, Size: 868 bytes --] Andreas Dilger wrote: > On Dec 25, 2008 12:35 -0200, Alberto Bertogli wrote: > > On Wed, Dec 24, 2008 at 06:49:15PM -0500, Theodore Tso wrote: > > > I have no idea why you would need to do manual __breads(). No doubt > > > I'm missing some context here. > > > > I'm writing (just for fun and learning purposes) a device mapper > > module that stores checksums on writes and verifies them on reads. The > > integrity metadata (currently just the checksum) is interleaved in the > > backing device: one sector holding the integrity metadata for the > > following 64 data sectors. > Alex and I discussed implementing checksums for ext4 using an external device like this, and he might have some more design information for you. That external device could possibly be a TPM chip that can store checksums. Shyam Iyer Dell Linux Engineering [-- Attachment #1.2: Type: text/html, Size: 1595 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dm-devel] Re: jbd2 inside a device mapper module 2008-12-29 6:20 ` Shyam_Iyer @ 2008-12-29 21:05 ` Alberto Bertogli 2008-12-30 6:55 ` Alex Tomas 0 siblings, 1 reply; 17+ messages in thread From: Alberto Bertogli @ 2008-12-29 21:05 UTC (permalink / raw) To: Shyam_Iyer; +Cc: dm-devel, Alex.Zhuravlev, linux-ext4, tytso, linux-kernel On Mon, Dec 29, 2008 at 11:50:14AM +0530, Shyam_Iyer@Dell.com wrote: > Andreas Dilger wrote: > > On Dec 25, 2008 12:35 -0200, Alberto Bertogli wrote: > > > On Wed, Dec 24, 2008 at 06:49:15PM -0500, Theodore Tso wrote: > > > > I have no idea why you would need to do manual __breads(). No > doubt > > > > I'm missing some context here. > > > > > > I'm writing (just for fun and learning purposes) a device mapper > > > module that stores checksums on writes and verifies them on reads. > The > > > integrity metadata (currently just the checksum) is interleaved in > the > > > backing device: one sector holding the integrity metadata for the > > > following 64 data sectors. > > > Alex and I discussed implementing checksums for ext4 using an external > device like this, and he might have some more design information for > you. > > > That external device could possibly be a TPM chip that can store > checksums. Thanks for the suggestion. The code I have at the moment (without the journal stuff) already has the capability of storing checksums in a different device. It's one of the reasons why I would prefer to avoid using jbd. I think I'll go with the "two metadatas" approach and see how it goes. Worst case scenario is that I have to drop that code, which means to be back where I am now, only with one less option. Thanks, Alberto ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dm-devel] Re: jbd2 inside a device mapper module 2008-12-29 21:05 ` [dm-devel] " Alberto Bertogli @ 2008-12-30 6:55 ` Alex Tomas 2008-12-30 13:51 ` Alberto Bertogli 0 siblings, 1 reply; 17+ messages in thread From: Alex Tomas @ 2008-12-30 6:55 UTC (permalink / raw) To: Alberto Bertogli; +Cc: Shyam_Iyer, dm-devel, linux-ext4, tytso, linux-kernel one good thing about JBD is that you can't update target block and csum atomically. so, either you use some form of COW or you use journalling. given we already have JBD it'd make sense to use it? thanks, Alex Alberto Bertogli wrote: > I think I'll go with the "two metadatas" approach and see how it goes. > Worst case scenario is that I have to drop that code, which means to be > back where I am now, only with one less option. > > Thanks, > Alberto > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dm-devel] Re: jbd2 inside a device mapper module 2008-12-30 6:55 ` Alex Tomas @ 2008-12-30 13:51 ` Alberto Bertogli 0 siblings, 0 replies; 17+ messages in thread From: Alberto Bertogli @ 2008-12-30 13:51 UTC (permalink / raw) To: Alex Tomas; +Cc: Shyam_Iyer, dm-devel, linux-ext4, tytso, linux-kernel On Tue, Dec 30, 2008 at 09:55:57AM +0300, Alex Tomas wrote: > one good thing about JBD is that you can't update target block and csum > atomically. so, either you use some form of COW or you use journalling. > given we already have JBD it'd make sense to use it? I'm sorry, but I'm not following. Is that first sentence right? The main disadvantage I see of using jbd at the moment is that I loose the possibility of having checksums and data in a different device. The only alternative to jbd that I have at the moment is the "two metadatas" approach I explained in another email (but please let me know if it wasn't clear). They both provide what I need (atomicity in data and csum writes), one is easier, more tested, but prevents a feature. The other is a bit more difficult, untested and written my me, but allows a feature. I have no idea, performance-wise, how they will behave (it is expected they suck, according to the other emails). At this moment I'm going with the two metadatas approach, because I think it has less limitations and it'd be fun to write. If then it's unfit for some reason, I can always go back and use jbd. But I'm obviously open to suggestions and more alternatives. Thanks a lot, Alberto ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2008-12-30 13:54 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-12-24 21:10 jbd2 inside a device mapper module Alberto Bertogli 2008-12-24 22:38 ` Alberto Bertogli 2008-12-24 23:49 ` Theodore Tso 2008-12-25 14:35 ` Alberto Bertogli 2008-12-25 15:52 ` Theodore Tso 2008-12-26 0:00 ` Alberto Bertogli 2008-12-26 3:37 ` Theodore Tso 2008-12-26 16:17 ` Alberto Bertogli 2008-12-26 18:06 ` Theodore Tso 2008-12-27 3:00 ` Alberto Bertogli 2008-12-27 19:29 ` Theodore Tso 2008-12-29 21:30 ` Alberto Bertogli 2008-12-27 20:01 ` Andreas Dilger 2008-12-29 6:20 ` Shyam_Iyer 2008-12-29 21:05 ` [dm-devel] " Alberto Bertogli 2008-12-30 6:55 ` Alex Tomas 2008-12-30 13:51 ` Alberto Bertogli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).