* mirrored device with thousand of mapping table entries @ 2011-02-28 11:21 Eli Malul 2011-02-28 11:48 ` Alasdair G Kergon 0 siblings, 1 reply; 23+ messages in thread From: Eli Malul @ 2011-02-28 11:21 UTC (permalink / raw) To: dm-devel [-- Attachment #1.1: Type: text/plain, Size: 605 bytes --] Hi, I am trying to create a mirrored device with thousand of mapping table entries. Unfortunately, it seems that the allocated kernel memory to satisfy this configuration is unbearable - I am talking about 6G for 10,000 entries where the used region size is for each entry is minimal (i.e. 8 blocks). My questions are as follows: 1. Did anyone tried before to create a mirrored device with thousands of targets? 2. What is the estimated total allocated memory per each target? 3. Is there something I am missing here? Thanks a lot, Eli [-- Attachment #1.2: Type: text/html, Size: 5078 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mapping table entries 2011-02-28 11:21 mirrored device with thousand of mapping table entries Eli Malul @ 2011-02-28 11:48 ` Alasdair G Kergon 2011-02-28 11:59 ` mirrored device with thousand of mapping tableentries Eli Malul 0 siblings, 1 reply; 23+ messages in thread From: Alasdair G Kergon @ 2011-02-28 11:48 UTC (permalink / raw) To: device-mapper development On Mon, Feb 28, 2011 at 01:21:10PM +0200, Eli Malul wrote: > 1. Did anyone tried before to create a mirrored device with > thousands of targets? > 2. What is the estimated total allocated memory per each target? > 3. Is there something I am missing here? Please show us an example, maybe the first few lines of the tables you are loading. Alasdair ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mapping tableentries 2011-02-28 11:48 ` Alasdair G Kergon @ 2011-02-28 11:59 ` Eli Malul 2011-02-28 12:11 ` Alasdair G Kergon 0 siblings, 1 reply; 23+ messages in thread From: Eli Malul @ 2011-02-28 11:59 UTC (permalink / raw) To: Alasdair G Kergon, device-mapper development The first few lines of the table I am loading: [root@vpc09 ~]# vi /tmp/mirror_table.txt 0 12 km_mirror core 2 8 nosync 2 /dev/loop0 0 /dev/loop1 0 12 12 km_mirror core 2 8 nosync 2 /dev/loop0 12 /dev/loop1 12 24 12 km_mirror core 2 8 nosync 2 /dev/loop0 24 /dev/loop1 24 36 12 km_mirror core 2 8 nosync 2 /dev/loop0 36 /dev/loop1 36 48 12 km_mirror core 2 8 nosync 2 /dev/loop0 48 /dev/loop1 48 60 12 km_mirror core 2 8 nosync 2 /dev/loop0 60 /dev/loop1 60 72 12 km_mirror core 2 8 nosync 2 /dev/loop0 72 /dev/loop1 72 84 12 km_mirror core 2 8 nosync 2 /dev/loop0 84 /dev/loop1 84 96 12 km_mirror core 2 8 nosync 2 /dev/loop0 96 /dev/loop1 96 108 12 km_mirror core 2 8 nosync 2 /dev/loop0 108 /dev/loop1 108 . . . I am also adding the result of the 'free' command so you can see the memory usage increasing: [root@vpc09 ~]# free -g total used free shared buffers cached Mem: 7 0 6 0 0 0 -/+ buffers/cache: 0 7 Swap: 0 0 0 [root@vpc09 ~]# free -g total used free shared buffers cached Mem: 7 6 0 0 0 0 -/+ buffers/cache: 6 1 Swap: 0 0 0 -----Original Message----- From: Alasdair G Kergon [mailto:agk@redhat.com] Sent: Monday, February 28, 2011 1:48 PM To: device-mapper development Subject: Re: [dm-devel] mirrored device with thousand of mapping tableentries On Mon, Feb 28, 2011 at 01:21:10PM +0200, Eli Malul wrote: > 1. Did anyone tried before to create a mirrored device with > thousands of targets? > 2. What is the estimated total allocated memory per each target? > 3. Is there something I am missing here? Please show us an example, maybe the first few lines of the tables you are loading. Alasdair ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mapping tableentries 2011-02-28 11:59 ` mirrored device with thousand of mapping tableentries Eli Malul @ 2011-02-28 12:11 ` Alasdair G Kergon 2011-02-28 12:17 ` mirrored device with thousand of mappingtableentries Eli Malul 0 siblings, 1 reply; 23+ messages in thread From: Alasdair G Kergon @ 2011-02-28 12:11 UTC (permalink / raw) To: Eli Malul; +Cc: device-mapper development On Mon, Feb 28, 2011 at 01:59:16PM +0200, Eli Malul wrote: > The first few lines of the table I am loading: > [root@vpc09 ~]# vi /tmp/mirror_table.txt > 0 12 km_mirror core 2 8 nosync 2 /dev/loop0 0 /dev/loop1 0 > 12 12 km_mirror core 2 8 nosync 2 /dev/loop0 12 /dev/loop1 12 > 24 12 km_mirror core 2 8 nosync 2 /dev/loop0 24 /dev/loop1 24 > 36 12 km_mirror core 2 8 nosync 2 /dev/loop0 36 /dev/loop1 36 > 48 12 km_mirror core 2 8 nosync 2 /dev/loop0 48 /dev/loop1 48 > 60 12 km_mirror core 2 8 nosync 2 /dev/loop0 60 /dev/loop1 60 > 72 12 km_mirror core 2 8 nosync 2 /dev/loop0 72 /dev/loop1 72 > 84 12 km_mirror core 2 8 nosync 2 /dev/loop0 84 /dev/loop1 84 > 96 12 km_mirror core 2 8 nosync 2 /dev/loop0 96 /dev/loop1 96 > 108 12 km_mirror core 2 8 nosync 2 /dev/loop0 108 /dev/loop1 108 Why do you need this, rather than one mirror target covering the whole device? (And if, unlike that example, your extents are not contiguous, create two new devices that join them together, and mirror those. That's what LVM does.) Alasdair ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-02-28 12:11 ` Alasdair G Kergon @ 2011-02-28 12:17 ` Eli Malul 2011-02-28 13:10 ` Alasdair G Kergon 0 siblings, 1 reply; 23+ messages in thread From: Eli Malul @ 2011-02-28 12:17 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: device-mapper development I expect to have thousands of non contiguous extents (so I created a synthetic mapping table to test the device-mapper behavior). Any suggestions? -----Original Message----- From: Alasdair G Kergon [mailto:agk@redhat.com] Sent: Monday, February 28, 2011 2:12 PM To: Eli Malul Cc: device-mapper development Subject: Re: [dm-devel] mirrored device with thousand of mappingtableentries On Mon, Feb 28, 2011 at 01:59:16PM +0200, Eli Malul wrote: > The first few lines of the table I am loading: > [root@vpc09 ~]# vi /tmp/mirror_table.txt > 0 12 km_mirror core 2 8 nosync 2 /dev/loop0 0 /dev/loop1 0 > 12 12 km_mirror core 2 8 nosync 2 /dev/loop0 12 /dev/loop1 12 > 24 12 km_mirror core 2 8 nosync 2 /dev/loop0 24 /dev/loop1 24 > 36 12 km_mirror core 2 8 nosync 2 /dev/loop0 36 /dev/loop1 36 > 48 12 km_mirror core 2 8 nosync 2 /dev/loop0 48 /dev/loop1 48 > 60 12 km_mirror core 2 8 nosync 2 /dev/loop0 60 /dev/loop1 60 > 72 12 km_mirror core 2 8 nosync 2 /dev/loop0 72 /dev/loop1 72 > 84 12 km_mirror core 2 8 nosync 2 /dev/loop0 84 /dev/loop1 84 > 96 12 km_mirror core 2 8 nosync 2 /dev/loop0 96 /dev/loop1 96 > 108 12 km_mirror core 2 8 nosync 2 /dev/loop0 108 /dev/loop1 108 Why do you need this, rather than one mirror target covering the whole device? (And if, unlike that example, your extents are not contiguous, create two new devices that join them together, and mirror those. That's what LVM does.) Alasdair ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-02-28 12:17 ` mirrored device with thousand of mappingtableentries Eli Malul @ 2011-02-28 13:10 ` Alasdair G Kergon 2011-02-28 13:13 ` Eli Malul 0 siblings, 1 reply; 23+ messages in thread From: Alasdair G Kergon @ 2011-02-28 13:10 UTC (permalink / raw) To: Eli Malul; +Cc: device-mapper development On Mon, Feb 28, 2011 at 02:17:52PM +0200, Eli Malul wrote: > I expect to have thousands of non contiguous extents (so I created a > synthetic mapping table to test the device-mapper behavior). > > Any suggestions? > -----Original Message----- > From: Alasdair G Kergon [mailto:agk@redhat.com] > Sent: Monday, February 28, 2011 2:12 PM > To: Eli Malul > Cc: device-mapper development > (And if, unlike that example, your extents are not > contiguous, > create two new devices that join them together, and mirror those. > That's > what LVM does.) That. dm0 = list 0 dm1 = list 1 dm2 = mirror of dm0 and dm1 Alasdair ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-02-28 13:10 ` Alasdair G Kergon @ 2011-02-28 13:13 ` Eli Malul 2011-02-28 13:29 ` Alasdair G Kergon 2011-02-28 13:38 ` Zdenek Kabelac 0 siblings, 2 replies; 23+ messages in thread From: Eli Malul @ 2011-02-28 13:13 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: device-mapper development OK, I will try that and check the implications of that, thanks!! But, do you know why having 10,000 extents is causing this overwhelming memory usage? -----Original Message----- From: Alasdair G Kergon [mailto:agk@redhat.com] Sent: Monday, February 28, 2011 3:10 PM To: Eli Malul Cc: device-mapper development Subject: Re: [dm-devel] mirrored device with thousand of mappingtableentries On Mon, Feb 28, 2011 at 02:17:52PM +0200, Eli Malul wrote: > I expect to have thousands of non contiguous extents (so I created a > synthetic mapping table to test the device-mapper behavior). > > Any suggestions? > -----Original Message----- > From: Alasdair G Kergon [mailto:agk@redhat.com] > Sent: Monday, February 28, 2011 2:12 PM > To: Eli Malul > Cc: device-mapper development > (And if, unlike that example, your extents are not > contiguous, > create two new devices that join them together, and mirror those. > That's > what LVM does.) That. dm0 = list 0 dm1 = list 1 dm2 = mirror of dm0 and dm1 Alasdair ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-02-28 13:13 ` Eli Malul @ 2011-02-28 13:29 ` Alasdair G Kergon 2011-02-28 13:42 ` Eli Malul 2011-02-28 16:25 ` Eli Malul 2011-02-28 13:38 ` Zdenek Kabelac 1 sibling, 2 replies; 23+ messages in thread From: Alasdair G Kergon @ 2011-02-28 13:29 UTC (permalink / raw) To: Eli Malul; +Cc: device-mapper development On Mon, Feb 28, 2011 at 03:13:42PM +0200, Eli Malul wrote: > But, do you know why having 10,000 extents is causing this overwhelming > memory usage? Nobody considered having a huge number of tiny mirrors reasonable or necessary so no attempt has been made to optimise that situation. Alasdair ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-02-28 13:29 ` Alasdair G Kergon @ 2011-02-28 13:42 ` Eli Malul 2011-02-28 16:25 ` Eli Malul 1 sibling, 0 replies; 23+ messages in thread From: Eli Malul @ 2011-02-28 13:42 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: device-mapper development According to your suggestion I need to map the non contiguous extents into linear devices and mirror them. But, what if I am required to preserve the original extent's offset? Since I need to mirror an existing device with user data already on it... -----Original Message----- From: Alasdair G Kergon [mailto:agk@redhat.com] Sent: Monday, February 28, 2011 3:29 PM To: Eli Malul Cc: device-mapper development Subject: Re: [dm-devel] mirrored device with thousand of mappingtableentries On Mon, Feb 28, 2011 at 03:13:42PM +0200, Eli Malul wrote: > But, do you know why having 10,000 extents is causing this overwhelming > memory usage? Nobody considered having a huge number of tiny mirrors reasonable or necessary so no attempt has been made to optimise that situation. Alasdair ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-02-28 13:29 ` Alasdair G Kergon 2011-02-28 13:42 ` Eli Malul @ 2011-02-28 16:25 ` Eli Malul 1 sibling, 0 replies; 23+ messages in thread From: Eli Malul @ 2011-02-28 16:25 UTC (permalink / raw) To: Alasdair G Kergon; +Cc: device-mapper development One way to achieve that is to create two linear mapped devices amd mirror them and another two linear device to simulate the original offsets. The last shall be mapped to the first two linear devices so the client will continue to read and write the same offsets he was used to. Seems a little bit complex configuration but it shall do the trick. I will test its scalability and update of course. BTW - CONFIG_BLK_DEV_INTEGRITY is disabled. -----Original Message----- From: Eli Malul Sent: Monday, February 28, 2011 3:43 PM To: 'Alasdair G Kergon' Cc: device-mapper development Subject: RE: [dm-devel] mirrored device with thousand of mappingtableentries According to your suggestion I need to map the non contiguous extents into linear devices and mirror them. But, what if I am required to preserve the original extent's offset? Since I need to mirror an existing device with user data already on it... -----Original Message----- From: Alasdair G Kergon [mailto:agk@redhat.com] Sent: Monday, February 28, 2011 3:29 PM To: Eli Malul Cc: device-mapper development Subject: Re: [dm-devel] mirrored device with thousand of mappingtableentries On Mon, Feb 28, 2011 at 03:13:42PM +0200, Eli Malul wrote: > But, do you know why having 10,000 extents is causing this overwhelming > memory usage? Nobody considered having a huge number of tiny mirrors reasonable or necessary so no attempt has been made to optimise that situation. Alasdair ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-02-28 13:13 ` Eli Malul 2011-02-28 13:29 ` Alasdair G Kergon @ 2011-02-28 13:38 ` Zdenek Kabelac 2011-02-28 15:01 ` Martin K. Petersen 1 sibling, 1 reply; 23+ messages in thread From: Zdenek Kabelac @ 2011-02-28 13:38 UTC (permalink / raw) To: dm-devel Dne 28.2.2011 14:13, Eli Malul napsal(a): > > OK, I will try that and check the implications of that, thanks!! > > But, do you know why having 10,000 extents is causing this overwhelming > memory usage? > If you really need that many devices - be also sure you have disabled this kernel config option: CONFIG_BLK_DEV_INTEGRITY For some reason it consumes massive amount of memory. Other than that - you should count with 64KB per device usually. Zdenek ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-02-28 13:38 ` Zdenek Kabelac @ 2011-02-28 15:01 ` Martin K. Petersen 2011-03-06 20:39 ` Zdenek Kabelac 0 siblings, 1 reply; 23+ messages in thread From: Martin K. Petersen @ 2011-02-28 15:01 UTC (permalink / raw) To: device-mapper development >>>>> "Zdenek" == Zdenek Kabelac <zkabelac@redhat.com> writes: Zdenek> If you really need that many devices - be also sure you have Zdenek> disabled this kernel config option: Zdenek> CONFIG_BLK_DEV_INTEGRITY Zdenek> For some reason it consumes massive amount of memory. Other Zdenek> than that - you should count with 64KB per device usually. Care to qualify that? Unless your HBA indicates that it supports data integrity the only penalty should be that each bio grows a pointer. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-02-28 15:01 ` Martin K. Petersen @ 2011-03-06 20:39 ` Zdenek Kabelac 2011-03-07 2:59 ` Martin K. Petersen 0 siblings, 1 reply; 23+ messages in thread From: Zdenek Kabelac @ 2011-03-06 20:39 UTC (permalink / raw) To: dm-devel Dne 28.2.2011 16:01, Martin K. Petersen napsal(a): >>>>>> "Zdenek" == Zdenek Kabelac <zkabelac@redhat.com> writes: > > Zdenek> If you really need that many devices - be also sure you have > Zdenek> disabled this kernel config option: > > Zdenek> CONFIG_BLK_DEV_INTEGRITY > > Zdenek> For some reason it consumes massive amount of memory. Other > Zdenek> than that - you should count with 64KB per device usually. > > Care to qualify that? > > Unless your HBA indicates that it supports data integrity the only > penalty should be that each bio grows a pointer. > Ok - I've taken some new measurements on my (possible not the best way configure linux kernel 2.6.38-rc7) As I'm using some kernel memory debugging it might not apply in the same way for non-debug kernel. My finding seems to show that BIP-256 slabtop segment grow by ~73KB per each device (while dm-io is ab out ~26KB) That makes consumption ~730MB when 10000 devices are in the game. Of course there are some other big slowdowns - so it's not the biggest problem, however - if I get it right when I don't have hw with bio integrity support - this memory is essencially wasted? as it will have not use. Of course people who will plan to use such massive number of devices probably should use hw where such memory loose isn't big problem - but it's quite noticeable on my testing laptop with just 4G :) Another minor issue could be seen in delaying device creation time. (i.e. ~7000 device activation with BIP is delayed by ~30%). Zdenek ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-03-06 20:39 ` Zdenek Kabelac @ 2011-03-07 2:59 ` Martin K. Petersen 2011-03-07 14:24 ` Zdenek Kabelac ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Martin K. Petersen @ 2011-03-07 2:59 UTC (permalink / raw) To: device-mapper development >>>>> "Zdenek" == Zdenek Kabelac <zkabelac@redhat.com> writes: Zdenek> My finding seems to show that BIP-256 slabtop segment grow by Zdenek> ~73KB per each device (while dm-io is ab out ~26KB) Ok, I see it now that I tried with a bunch of DM devices. DM allocates a bioset per volume. And since each bioset has an integrity mempool you'll end up with a bunch of memory locked down. It seems like a lot but it's actually the same amount as we reserve for the data path (bio-0 + biovec-256). Since a bioset is not necessarily tied to a single block device we can't automatically decide whether to allocate the integrity pool or not. In the DM case, however, we just set up the integrity profile so the information is available. Can you please try the following patch? This will change things so we only attach an integrity pool to the bioset if the logical volume is integrity-capable. -- Martin K. Petersen Oracle Linux Engineering diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 38e4eb1..37a1b77 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -55,6 +55,7 @@ struct dm_table { struct dm_target *targets; unsigned discards_supported:1; + unsigned integrity_supported:1; /* * Indicates the rw permissions for the new logical @@ -859,7 +860,11 @@ int dm_table_alloc_md_mempools(struct dm_table *t) return -EINVAL; } - t->mempools = dm_alloc_md_mempools(type); + if (t->integrity_supported) + t->mempools = dm_alloc_md_mempools(type, 0); + else + t->mempools = dm_alloc_md_mempools(type, BIOSET_NO_INTEGRITY); + if (!t->mempools) return -ENOMEM; @@ -935,8 +940,10 @@ static int dm_table_prealloc_integrity(struct dm_table *t, struct mapped_device struct dm_dev_internal *dd; list_for_each_entry(dd, devices, list) - if (bdev_get_integrity(dd->dm_dev.bdev)) + if (bdev_get_integrity(dd->dm_dev.bdev)) { + t->integrity_supported = 1; return blk_integrity_register(dm_disk(md), NULL); + } return 0; } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index eaa3af0..f6146b5 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2643,7 +2643,7 @@ int dm_noflush_suspending(struct dm_target *ti) } EXPORT_SYMBOL_GPL(dm_noflush_suspending); -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned int flags) { struct dm_md_mempools *pools = kmalloc(sizeof(*pools), GFP_KERNEL); @@ -2663,7 +2663,8 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) goto free_io_pool_and_out; pools->bs = (type == DM_TYPE_BIO_BASED) ? - bioset_create(16, 0) : bioset_create(MIN_IOS, 0); + bioset_create_flags(16, 0, flags) : + bioset_create_flags(MIN_IOS, 0, flags); if (!pools->bs) goto free_tio_pool_and_out; diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 0c2dd5f..d846ce0 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -149,7 +149,7 @@ void dm_kcopyd_exit(void); /* * Mempool operations */ -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type); +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned int); void dm_free_md_mempools(struct dm_md_mempools *pools); #endif diff --git a/fs/bio.c b/fs/bio.c index 4bd454f..6e4a381 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -1603,9 +1603,10 @@ void bioset_free(struct bio_set *bs) EXPORT_SYMBOL(bioset_free); /** - * bioset_create - Create a bio_set + * bioset_create_flags - Create a bio_set * @pool_size: Number of bio and bio_vecs to cache in the mempool * @front_pad: Number of bytes to allocate in front of the returned bio + * @flags: Flags that affect memory allocation * * Description: * Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller @@ -1615,7 +1616,8 @@ EXPORT_SYMBOL(bioset_free); * Note that the bio must be embedded at the END of that structure always, * or things will break badly. */ -struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) +struct bio_set *bioset_create_flags(unsigned int pool_size, + unsigned int front_pad, unsigned int flags) { unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec); struct bio_set *bs; @@ -1636,7 +1638,8 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) if (!bs->bio_pool) goto bad; - if (bioset_integrity_create(bs, pool_size)) + if ((flags & BIOSET_NO_INTEGRITY) == 0 && + bioset_integrity_create(bs, pool_size)) goto bad; if (!biovec_create_pools(bs, pool_size)) @@ -1646,6 +1649,12 @@ bad: bioset_free(bs); return NULL; } +EXPORT_SYMBOL(bioset_create_flags); + +struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) +{ + return bioset_create_flags(pool_size, front_pad, 0); +} EXPORT_SYMBOL(bioset_create); static void __init biovec_init_slabs(void) diff --git a/include/linux/bio.h b/include/linux/bio.h index 35dcdb3..2f758f3 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -208,7 +208,12 @@ struct bio_pair { extern struct bio_pair *bio_split(struct bio *bi, int first_sectors); extern void bio_pair_release(struct bio_pair *dbio); +enum bioset_flags { + BIOSET_NO_INTEGRITY = (1 << 0), +}; + extern struct bio_set *bioset_create(unsigned int, unsigned int); +extern struct bio_set *bioset_create_flags(unsigned int, unsigned int, unsigned int); extern void bioset_free(struct bio_set *); extern struct bio *bio_alloc(gfp_t, int); ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-03-07 2:59 ` Martin K. Petersen @ 2011-03-07 14:24 ` Zdenek Kabelac 2011-03-07 16:09 ` Mike Snitzer 2011-03-07 20:10 ` mirrored device with thousand of mappingtableentries Mike Snitzer 2 siblings, 0 replies; 23+ messages in thread From: Zdenek Kabelac @ 2011-03-07 14:24 UTC (permalink / raw) To: dm-devel Dne 7.3.2011 03:59, Martin K. Petersen napsal(a): >>>>>> "Zdenek" == Zdenek Kabelac <zkabelac@redhat.com> writes: > > Zdenek> My finding seems to show that BIP-256 slabtop segment grow by > Zdenek> ~73KB per each device (while dm-io is ab out ~26KB) > > Ok, I see it now that I tried with a bunch of DM devices. > > DM allocates a bioset per volume. And since each bioset has an integrity > mempool you'll end up with a bunch of memory locked down. It seems like > a lot but it's actually the same amount as we reserve for the data path > (bio-0 + biovec-256). > > Since a bioset is not necessarily tied to a single block device we can't > automatically decide whether to allocate the integrity pool or not. In > the DM case, however, we just set up the integrity profile so the > information is available. > > Can you please try the following patch? This will change things so we > only attach an integrity pool to the bioset if the logical volume is > integrity-capable. > Yep - patch seems to fix the problem with wasted memory. Thanks tested-by: zkabelac@redhat.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-03-07 2:59 ` Martin K. Petersen 2011-03-07 14:24 ` Zdenek Kabelac @ 2011-03-07 16:09 ` Mike Snitzer 2011-03-08 6:51 ` Martin K. Petersen 2011-03-07 20:10 ` mirrored device with thousand of mappingtableentries Mike Snitzer 2 siblings, 1 reply; 23+ messages in thread From: Mike Snitzer @ 2011-03-07 16:09 UTC (permalink / raw) To: Martin K. Petersen; +Cc: device-mapper development, Zdenek Kabelac Hi Martin, On Sun, Mar 06 2011 at 9:59pm -0500, Martin K. Petersen <martin.petersen@oracle.com> wrote: > >>>>> "Zdenek" == Zdenek Kabelac <zkabelac@redhat.com> writes: > > Zdenek> My finding seems to show that BIP-256 slabtop segment grow by > Zdenek> ~73KB per each device (while dm-io is ab out ~26KB) > > Ok, I see it now that I tried with a bunch of DM devices. > > DM allocates a bioset per volume. And since each bioset has an integrity > mempool you'll end up with a bunch of memory locked down. It seems like > a lot but it's actually the same amount as we reserve for the data path > (bio-0 + biovec-256). > > Since a bioset is not necessarily tied to a single block device we can't > automatically decide whether to allocate the integrity pool or not. In > the DM case, however, we just set up the integrity profile so the > information is available. > > Can you please try the following patch? This will change things so we > only attach an integrity pool to the bioset if the logical volume is > integrity-capable. Thanks for sorting this out. Can you post a patch that has a proper header with your Signed-off-by? Also including Zdenek's Tested-by, and my: Acked-by: Mike Snitzer <snitzer@redhat.com> Thanks, Mike p.s. one small nit: > diff --git a/fs/bio.c b/fs/bio.c > index 4bd454f..6e4a381 100644 > --- a/fs/bio.c > +++ b/fs/bio.c > @@ -1603,9 +1603,10 @@ void bioset_free(struct bio_set *bs) > EXPORT_SYMBOL(bioset_free); > > /** > - * bioset_create - Create a bio_set > + * bioset_create_flags - Create a bio_set > * @pool_size: Number of bio and bio_vecs to cache in the mempool > * @front_pad: Number of bytes to allocate in front of the returned bio > + * @flags: Flags that affect memory allocation > * > * Description: > * Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller > @@ -1615,7 +1616,8 @@ EXPORT_SYMBOL(bioset_free); > * Note that the bio must be embedded at the END of that structure always, > * or things will break badly. > */ > -struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) > +struct bio_set *bioset_create_flags(unsigned int pool_size, > + unsigned int front_pad, unsigned int flags) > { > unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec); > struct bio_set *bs; > @@ -1636,7 +1638,8 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) > if (!bs->bio_pool) > goto bad; > > - if (bioset_integrity_create(bs, pool_size)) > + if ((flags & BIOSET_NO_INTEGRITY) == 0 && > + bioset_integrity_create(bs, pool_size)) > goto bad; > > if (!biovec_create_pools(bs, pool_size)) We'd generally see: if (!(flags & BIOSET_NO_INTEGRITY) && ...) but then the double negative is maybe a bit more challenging to deal with (for a human). In the end, not a big deal either way ;) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-03-07 16:09 ` Mike Snitzer @ 2011-03-08 6:51 ` Martin K. Petersen 2011-03-08 17:13 ` Mike Snitzer 0 siblings, 1 reply; 23+ messages in thread From: Martin K. Petersen @ 2011-03-08 6:51 UTC (permalink / raw) To: Mike Snitzer Cc: Jens Axboe, device-mapper development, Martin K. Petersen, Zdenek Kabelac >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: Mike> Thanks for sorting this out. Can you post a patch that has a Mike> proper header with your Signed-off-by? Also including Zdenek's Mike> Tested-by, and my: I contemplated a bit and decided I liked the following approach better. What do you think? block: Require subsystems to explicitly allocate bio_set integrity mempool MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> --- diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 38e4eb1..4e8e314 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -55,6 +55,7 @@ struct dm_table { struct dm_target *targets; unsigned discards_supported:1; + unsigned integrity_supported:1; /* * Indicates the rw permissions for the new logical @@ -859,7 +860,7 @@ int dm_table_alloc_md_mempools(struct dm_table *t) return -EINVAL; } - t->mempools = dm_alloc_md_mempools(type); + t->mempools = dm_alloc_md_mempools(type, t->integrity_supported); if (!t->mempools) return -ENOMEM; @@ -935,8 +936,10 @@ static int dm_table_prealloc_integrity(struct dm_table *t, struct mapped_device struct dm_dev_internal *dd; list_for_each_entry(dd, devices, list) - if (bdev_get_integrity(dd->dm_dev.bdev)) + if (bdev_get_integrity(dd->dm_dev.bdev)) { + t->integrity_supported = 1; return blk_integrity_register(dm_disk(md), NULL); + } return 0; } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index eaa3af0..b55ef95 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2643,9 +2643,10 @@ int dm_noflush_suspending(struct dm_target *ti) } EXPORT_SYMBOL_GPL(dm_noflush_suspending); -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned int integrity) { struct dm_md_mempools *pools = kmalloc(sizeof(*pools), GFP_KERNEL); + unsigned int pool_size = (type == DM_TYPE_BIO_BASED) ? 16 : MIN_IOS; if (!pools) return NULL; @@ -2662,11 +2663,13 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) if (!pools->tio_pool) goto free_io_pool_and_out; - pools->bs = (type == DM_TYPE_BIO_BASED) ? - bioset_create(16, 0) : bioset_create(MIN_IOS, 0); + pools->bs = bioset_create(pool_size, 0); if (!pools->bs) goto free_tio_pool_and_out; + if (integrity && bioset_integrity_create(pools->bs, pool_size)) + goto free_tio_pool_and_out; + return pools; free_tio_pool_and_out: diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 0c2dd5f..1aaf167 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -149,7 +149,7 @@ void dm_kcopyd_exit(void); /* * Mempool operations */ -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type); +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity); void dm_free_md_mempools(struct dm_md_mempools *pools); #endif diff --git a/drivers/md/md.c b/drivers/md/md.c index 818313e..f11e0bc 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1804,8 +1804,12 @@ int md_integrity_register(mddev_t *mddev) mdname(mddev)); return -EINVAL; } - printk(KERN_NOTICE "md: data integrity on %s enabled\n", - mdname(mddev)); + printk(KERN_NOTICE "md: data integrity enabled on %s\n", mdname(mddev)); + if (bioset_integrity_create(mddev->bio_set, BIO_POOL_SIZE)) { + printk(KERN_ERR "md: failed to create integrity pool for %s\n", + mdname(mddev)); + return -EINVAL; + } return 0; } EXPORT_SYMBOL(md_integrity_register); diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c index e49cce2..9c5e6b2 100644 --- a/fs/bio-integrity.c +++ b/fs/bio-integrity.c @@ -761,6 +761,9 @@ int bioset_integrity_create(struct bio_set *bs, int pool_size) { unsigned int max_slab = vecs_to_idx(BIO_MAX_PAGES); + if (bs->bio_integrity_pool) + return 0; + bs->bio_integrity_pool = mempool_create_slab_pool(pool_size, bip_slab[max_slab].slab); diff --git a/fs/bio.c b/fs/bio.c index 4bd454f..06aebf9 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -1636,9 +1636,6 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) if (!bs->bio_pool) goto bad; - if (bioset_integrity_create(bs, pool_size)) - goto bad; - if (!biovec_create_pools(bs, pool_size)) return bs; @@ -1684,6 +1681,9 @@ static int __init init_bio(void) if (!fs_bio_set) panic("bio: can't allocate bios\n"); + if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE)) + panic("bio: can't create integrity pool\n"); + bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES, sizeof(struct bio_pair)); if (!bio_split_pool) ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-03-08 6:51 ` Martin K. Petersen @ 2011-03-08 17:13 ` Mike Snitzer 2011-03-10 16:11 ` Martin K. Petersen 0 siblings, 1 reply; 23+ messages in thread From: Mike Snitzer @ 2011-03-08 17:13 UTC (permalink / raw) To: Martin K. Petersen Cc: Jens Axboe, device-mapper development, Martin K. Petersen, Zdenek Kabelac On Tue, Mar 08 2011 at 1:51am -0500, Martin K. Petersen <mkp@mkp.net> wrote: > >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: > > Mike> Thanks for sorting this out. Can you post a patch that has a > Mike> proper header with your Signed-off-by? Also including Zdenek's > Mike> Tested-by, and my: > > I contemplated a bit and decided I liked the following approach > better. What do you think? > > > block: Require subsystems to explicitly allocate bio_set integrity mempool > > MD and DM create a new bio_set for every metadevice. Each bio_set has an > integrity mempool attached regardless of whether the metadevice is > capable of passing integrity metadata. This is a waste of memory. > > Instead we defer the allocation decision to MD and DM since we know at > metadevice creation time whether integrity passthrough is needed or not. > > Automatic integrity mempool allocation can then be removed from > bioset_create() and we make an explicit integrity allocation for the > fs_bio_set. > > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> > Reported-by: Zdenek Kabelac <zkabelac@redhat.com> In general this approach looks fine too. Doesn't address the concern I had yesterday about the blk_integrity block_size relative to DM (load vs resume) but I'd imagine you'd like to sort that out in a separate patch. But I found a few DM issues below. Provided those are sorted out: Acked-by: Mike Snitzer <snitzer@redhat.com> (I'll defer to Jens and Neil for the block and MD bits) > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index eaa3af0..b55ef95 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -2643,9 +2643,10 @@ int dm_noflush_suspending(struct dm_target *ti) > } > EXPORT_SYMBOL_GPL(dm_noflush_suspending); > > -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) > +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned int integrity) > { Any reason you didn't just use 'unsigned integrity' like you did below in the dm.h prototype? > struct dm_md_mempools *pools = kmalloc(sizeof(*pools), GFP_KERNEL); > + unsigned int pool_size = (type == DM_TYPE_BIO_BASED) ? 16 : MIN_IOS; > > if (!pools) > return NULL; > @@ -2662,11 +2663,13 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) > if (!pools->tio_pool) > goto free_io_pool_and_out; > > - pools->bs = (type == DM_TYPE_BIO_BASED) ? > - bioset_create(16, 0) : bioset_create(MIN_IOS, 0); > + pools->bs = bioset_create(pool_size, 0); > if (!pools->bs) > goto free_tio_pool_and_out; > > + if (integrity && bioset_integrity_create(pools->bs, pool_size)) > + goto free_tio_pool_and_out; > + > return pools; > > free_tio_pool_and_out: Don't you need to bioset_free(pools->bs) if bioset_integrity_create() fails? So you'd need a new 'free_bioset_and_out' label. Also seems like maybe you had some extra whitespace on the goto? > diff --git a/drivers/md/dm.h b/drivers/md/dm.h > index 0c2dd5f..1aaf167 100644 > --- a/drivers/md/dm.h > +++ b/drivers/md/dm.h > @@ -149,7 +149,7 @@ void dm_kcopyd_exit(void); > /* > * Mempool operations > */ > -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type); > +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity); > void dm_free_md_mempools(struct dm_md_mempools *pools); > > #endif ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-03-08 17:13 ` Mike Snitzer @ 2011-03-10 16:11 ` Martin K. Petersen 2011-03-11 16:53 ` [PATCH] block: Require subsystems to explicitly allocate bio_set integrity mempool (was: Re: mirrored device with thousand of mappingtableentries) Mike Snitzer 0 siblings, 1 reply; 23+ messages in thread From: Martin K. Petersen @ 2011-03-10 16:11 UTC (permalink / raw) To: Mike Snitzer Cc: Jens Axboe, device-mapper development, Martin K. Petersen, Zdenek Kabelac >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: Mike> In general this approach looks fine too. Doesn't address the Mike> concern I had yesterday about the blk_integrity block_size Mike> relative to DM (load vs resume) but I'd imagine you'd like to sort Mike> that out in a separate patch. Yeah, that'll come as part of my DIX1.1 patch set. I believe I've addressed all your issues below. block: Require subsystems to explicitly allocate bio_set integrity mempool MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 38e4eb1..4e8e314 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -55,6 +55,7 @@ struct dm_table { struct dm_target *targets; unsigned discards_supported:1; + unsigned integrity_supported:1; /* * Indicates the rw permissions for the new logical @@ -859,7 +860,7 @@ int dm_table_alloc_md_mempools(struct dm_table *t) return -EINVAL; } - t->mempools = dm_alloc_md_mempools(type); + t->mempools = dm_alloc_md_mempools(type, t->integrity_supported); if (!t->mempools) return -ENOMEM; @@ -935,8 +936,10 @@ static int dm_table_prealloc_integrity(struct dm_table *t, struct mapped_device struct dm_dev_internal *dd; list_for_each_entry(dd, devices, list) - if (bdev_get_integrity(dd->dm_dev.bdev)) + if (bdev_get_integrity(dd->dm_dev.bdev)) { + t->integrity_supported = 1; return blk_integrity_register(dm_disk(md), NULL); + } return 0; } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index eaa3af0..105963d 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2643,9 +2643,10 @@ int dm_noflush_suspending(struct dm_target *ti) } EXPORT_SYMBOL_GPL(dm_noflush_suspending); -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity) { struct dm_md_mempools *pools = kmalloc(sizeof(*pools), GFP_KERNEL); + unsigned int pool_size = (type == DM_TYPE_BIO_BASED) ? 16 : MIN_IOS; if (!pools) return NULL; @@ -2662,13 +2663,18 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) if (!pools->tio_pool) goto free_io_pool_and_out; - pools->bs = (type == DM_TYPE_BIO_BASED) ? - bioset_create(16, 0) : bioset_create(MIN_IOS, 0); + pools->bs = bioset_create(pool_size, 0); if (!pools->bs) goto free_tio_pool_and_out; + if (integrity && bioset_integrity_create(pools->bs, pool_size)) + goto free_bioset_and_out; + return pools; +free_bioset_and_out: + bioset_free(pools->bs); + free_tio_pool_and_out: mempool_destroy(pools->tio_pool); diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 0c2dd5f..1aaf167 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -149,7 +149,7 @@ void dm_kcopyd_exit(void); /* * Mempool operations */ -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type); +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity); void dm_free_md_mempools(struct dm_md_mempools *pools); #endif diff --git a/drivers/md/linear.c b/drivers/md/linear.c index 0ed7f6b..b317544 100644 --- a/drivers/md/linear.c +++ b/drivers/md/linear.c @@ -227,8 +227,7 @@ static int linear_run (mddev_t *mddev) mddev->queue->unplug_fn = linear_unplug; mddev->queue->backing_dev_info.congested_fn = linear_congested; mddev->queue->backing_dev_info.congested_data = mddev; - md_integrity_register(mddev); - return 0; + return md_integrity_register(mddev); } static void free_conf(struct rcu_head *head) diff --git a/drivers/md/md.c b/drivers/md/md.c index 818313e..f11e0bc 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1804,8 +1804,12 @@ int md_integrity_register(mddev_t *mddev) mdname(mddev)); return -EINVAL; } - printk(KERN_NOTICE "md: data integrity on %s enabled\n", - mdname(mddev)); + printk(KERN_NOTICE "md: data integrity enabled on %s\n", mdname(mddev)); + if (bioset_integrity_create(mddev->bio_set, BIO_POOL_SIZE)) { + printk(KERN_ERR "md: failed to create integrity pool for %s\n", + mdname(mddev)); + return -EINVAL; + } return 0; } EXPORT_SYMBOL(md_integrity_register); diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c index 3a62d44..54b8d7c 100644 --- a/drivers/md/multipath.c +++ b/drivers/md/multipath.c @@ -345,7 +345,7 @@ static int multipath_remove_disk(mddev_t *mddev, int number) p->rdev = rdev; goto abort; } - md_integrity_register(mddev); + err = md_integrity_register(mddev); } abort: @@ -520,7 +520,10 @@ static int multipath_run (mddev_t *mddev) mddev->queue->unplug_fn = multipath_unplug; mddev->queue->backing_dev_info.congested_fn = multipath_congested; mddev->queue->backing_dev_info.congested_data = mddev; - md_integrity_register(mddev); + + if (md_integrity_register(mddev)) + goto out_free_conf; + return 0; out_free_conf: diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index c0ac457..17c001c 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -395,8 +395,7 @@ static int raid0_run(mddev_t *mddev) blk_queue_merge_bvec(mddev->queue, raid0_mergeable_bvec); dump_zones(mddev); - md_integrity_register(mddev); - return 0; + return md_integrity_register(mddev); } static int raid0_stop(mddev_t *mddev) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 06cd712..1cca285 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1178,7 +1178,7 @@ static int raid1_remove_disk(mddev_t *mddev, int number) p->rdev = rdev; goto abort; } - md_integrity_register(mddev); + err = md_integrity_register(mddev); } abort: @@ -2069,8 +2069,7 @@ static int run(mddev_t *mddev) mddev->queue->unplug_fn = raid1_unplug; mddev->queue->backing_dev_info.congested_fn = raid1_congested; mddev->queue->backing_dev_info.congested_data = mddev; - md_integrity_register(mddev); - return 0; + return md_integrity_register(mddev); } static int stop(mddev_t *mddev) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 747d061..17a2891 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1233,7 +1233,7 @@ static int raid10_remove_disk(mddev_t *mddev, int number) p->rdev = rdev; goto abort; } - md_integrity_register(mddev); + err = md_integrity_register(mddev); } abort: @@ -2395,7 +2395,10 @@ static int run(mddev_t *mddev) if (conf->near_copies < conf->raid_disks) blk_queue_merge_bvec(mddev->queue, raid10_mergeable_bvec); - md_integrity_register(mddev); + + if (md_integrity_register(mddev)) + goto out_free_conf; + return 0; out_free_conf: diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c index e49cce2..9c5e6b2 100644 --- a/fs/bio-integrity.c +++ b/fs/bio-integrity.c @@ -761,6 +761,9 @@ int bioset_integrity_create(struct bio_set *bs, int pool_size) { unsigned int max_slab = vecs_to_idx(BIO_MAX_PAGES); + if (bs->bio_integrity_pool) + return 0; + bs->bio_integrity_pool = mempool_create_slab_pool(pool_size, bip_slab[max_slab].slab); diff --git a/fs/bio.c b/fs/bio.c index 5694b75..85e2eab 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -1636,9 +1636,6 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) if (!bs->bio_pool) goto bad; - if (bioset_integrity_create(bs, pool_size)) - goto bad; - if (!biovec_create_pools(bs, pool_size)) return bs; @@ -1682,6 +1679,9 @@ static int __init init_bio(void) if (!fs_bio_set) panic("bio: can't allocate bios\n"); + if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE)) + panic("bio: can't create integrity pool\n"); + bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES, sizeof(struct bio_pair)); if (!bio_split_pool) ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH] block: Require subsystems to explicitly allocate bio_set integrity mempool (was: Re: mirrored device with thousand of mappingtableentries) 2011-03-10 16:11 ` Martin K. Petersen @ 2011-03-11 16:53 ` Mike Snitzer 2011-03-11 17:11 ` [PATCH] block: Require subsystems to explicitly allocate bio_set integrity mempool Martin K. Petersen 0 siblings, 1 reply; 23+ messages in thread From: Mike Snitzer @ 2011-03-11 16:53 UTC (permalink / raw) To: Martin K. Petersen Cc: Jens Axboe, device-mapper development, Martin K. Petersen, Zdenek Kabelac On Thu, Mar 10 2011 at 11:11am -0500, Martin K. Petersen <mkp@mkp.net> wrote: > >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: > > Mike> In general this approach looks fine too. Doesn't address the > Mike> concern I had yesterday about the blk_integrity block_size > Mike> relative to DM (load vs resume) but I'd imagine you'd like to sort > Mike> that out in a separate patch. > > Yeah, that'll come as part of my DIX1.1 patch set. Seems my concern should be fixed independent of a larger update patchset. Unless DIX isn't meaningful for "stable" kernels without your full DIX1.1 update? > I believe I've addressed all your issues below. The block and DM changes look good. I haven't put time to the MD changes (cc'ing neil). Thanks, Mike > block: Require subsystems to explicitly allocate bio_set integrity mempool > > MD and DM create a new bio_set for every metadevice. Each bio_set has an > integrity mempool attached regardless of whether the metadevice is > capable of passing integrity metadata. This is a waste of memory. > > Instead we defer the allocation decision to MD and DM since we know at > metadevice creation time whether integrity passthrough is needed or not. > > Automatic integrity mempool allocation can then be removed from > bioset_create() and we make an explicit integrity allocation for the > fs_bio_set. > > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> > Reported-by: Zdenek Kabelac <zkabelac@redhat.com> > Acked-by: Mike Snitzer <snizer@redhat.com> > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > index 38e4eb1..4e8e314 100644 > --- a/drivers/md/dm-table.c > +++ b/drivers/md/dm-table.c > @@ -55,6 +55,7 @@ struct dm_table { > struct dm_target *targets; > > unsigned discards_supported:1; > + unsigned integrity_supported:1; > > /* > * Indicates the rw permissions for the new logical > @@ -859,7 +860,7 @@ int dm_table_alloc_md_mempools(struct dm_table *t) > return -EINVAL; > } > > - t->mempools = dm_alloc_md_mempools(type); > + t->mempools = dm_alloc_md_mempools(type, t->integrity_supported); > if (!t->mempools) > return -ENOMEM; > > @@ -935,8 +936,10 @@ static int dm_table_prealloc_integrity(struct dm_table *t, struct mapped_device > struct dm_dev_internal *dd; > > list_for_each_entry(dd, devices, list) > - if (bdev_get_integrity(dd->dm_dev.bdev)) > + if (bdev_get_integrity(dd->dm_dev.bdev)) { > + t->integrity_supported = 1; > return blk_integrity_register(dm_disk(md), NULL); > + } > > return 0; > } > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index eaa3af0..105963d 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -2643,9 +2643,10 @@ int dm_noflush_suspending(struct dm_target *ti) > } > EXPORT_SYMBOL_GPL(dm_noflush_suspending); > > -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) > +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity) > { > struct dm_md_mempools *pools = kmalloc(sizeof(*pools), GFP_KERNEL); > + unsigned int pool_size = (type == DM_TYPE_BIO_BASED) ? 16 : MIN_IOS; > > if (!pools) > return NULL; > @@ -2662,13 +2663,18 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type) > if (!pools->tio_pool) > goto free_io_pool_and_out; > > - pools->bs = (type == DM_TYPE_BIO_BASED) ? > - bioset_create(16, 0) : bioset_create(MIN_IOS, 0); > + pools->bs = bioset_create(pool_size, 0); > if (!pools->bs) > goto free_tio_pool_and_out; > > + if (integrity && bioset_integrity_create(pools->bs, pool_size)) > + goto free_bioset_and_out; > + > return pools; > > +free_bioset_and_out: > + bioset_free(pools->bs); > + > free_tio_pool_and_out: > mempool_destroy(pools->tio_pool); > > diff --git a/drivers/md/dm.h b/drivers/md/dm.h > index 0c2dd5f..1aaf167 100644 > --- a/drivers/md/dm.h > +++ b/drivers/md/dm.h > @@ -149,7 +149,7 @@ void dm_kcopyd_exit(void); > /* > * Mempool operations > */ > -struct dm_md_mempools *dm_alloc_md_mempools(unsigned type); > +struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity); > void dm_free_md_mempools(struct dm_md_mempools *pools); > > #endif > diff --git a/drivers/md/linear.c b/drivers/md/linear.c > index 0ed7f6b..b317544 100644 > --- a/drivers/md/linear.c > +++ b/drivers/md/linear.c > @@ -227,8 +227,7 @@ static int linear_run (mddev_t *mddev) > mddev->queue->unplug_fn = linear_unplug; > mddev->queue->backing_dev_info.congested_fn = linear_congested; > mddev->queue->backing_dev_info.congested_data = mddev; > - md_integrity_register(mddev); > - return 0; > + return md_integrity_register(mddev); > } > > static void free_conf(struct rcu_head *head) > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 818313e..f11e0bc 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -1804,8 +1804,12 @@ int md_integrity_register(mddev_t *mddev) > mdname(mddev)); > return -EINVAL; > } > - printk(KERN_NOTICE "md: data integrity on %s enabled\n", > - mdname(mddev)); > + printk(KERN_NOTICE "md: data integrity enabled on %s\n", mdname(mddev)); > + if (bioset_integrity_create(mddev->bio_set, BIO_POOL_SIZE)) { > + printk(KERN_ERR "md: failed to create integrity pool for %s\n", > + mdname(mddev)); > + return -EINVAL; > + } > return 0; > } > EXPORT_SYMBOL(md_integrity_register); > diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c > index 3a62d44..54b8d7c 100644 > --- a/drivers/md/multipath.c > +++ b/drivers/md/multipath.c > @@ -345,7 +345,7 @@ static int multipath_remove_disk(mddev_t *mddev, int number) > p->rdev = rdev; > goto abort; > } > - md_integrity_register(mddev); > + err = md_integrity_register(mddev); > } > abort: > > @@ -520,7 +520,10 @@ static int multipath_run (mddev_t *mddev) > mddev->queue->unplug_fn = multipath_unplug; > mddev->queue->backing_dev_info.congested_fn = multipath_congested; > mddev->queue->backing_dev_info.congested_data = mddev; > - md_integrity_register(mddev); > + > + if (md_integrity_register(mddev)) > + goto out_free_conf; > + > return 0; > > out_free_conf: > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c > index c0ac457..17c001c 100644 > --- a/drivers/md/raid0.c > +++ b/drivers/md/raid0.c > @@ -395,8 +395,7 @@ static int raid0_run(mddev_t *mddev) > > blk_queue_merge_bvec(mddev->queue, raid0_mergeable_bvec); > dump_zones(mddev); > - md_integrity_register(mddev); > - return 0; > + return md_integrity_register(mddev); > } > > static int raid0_stop(mddev_t *mddev) > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 06cd712..1cca285 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -1178,7 +1178,7 @@ static int raid1_remove_disk(mddev_t *mddev, int number) > p->rdev = rdev; > goto abort; > } > - md_integrity_register(mddev); > + err = md_integrity_register(mddev); > } > abort: > > @@ -2069,8 +2069,7 @@ static int run(mddev_t *mddev) > mddev->queue->unplug_fn = raid1_unplug; > mddev->queue->backing_dev_info.congested_fn = raid1_congested; > mddev->queue->backing_dev_info.congested_data = mddev; > - md_integrity_register(mddev); > - return 0; > + return md_integrity_register(mddev); > } > > static int stop(mddev_t *mddev) > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index 747d061..17a2891 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -1233,7 +1233,7 @@ static int raid10_remove_disk(mddev_t *mddev, int number) > p->rdev = rdev; > goto abort; > } > - md_integrity_register(mddev); > + err = md_integrity_register(mddev); > } > abort: > > @@ -2395,7 +2395,10 @@ static int run(mddev_t *mddev) > > if (conf->near_copies < conf->raid_disks) > blk_queue_merge_bvec(mddev->queue, raid10_mergeable_bvec); > - md_integrity_register(mddev); > + > + if (md_integrity_register(mddev)) > + goto out_free_conf; > + > return 0; > > out_free_conf: > diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c > index e49cce2..9c5e6b2 100644 > --- a/fs/bio-integrity.c > +++ b/fs/bio-integrity.c > @@ -761,6 +761,9 @@ int bioset_integrity_create(struct bio_set *bs, int pool_size) > { > unsigned int max_slab = vecs_to_idx(BIO_MAX_PAGES); > > + if (bs->bio_integrity_pool) > + return 0; > + > bs->bio_integrity_pool = > mempool_create_slab_pool(pool_size, bip_slab[max_slab].slab); > > diff --git a/fs/bio.c b/fs/bio.c > index 5694b75..85e2eab 100644 > --- a/fs/bio.c > +++ b/fs/bio.c > @@ -1636,9 +1636,6 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) > if (!bs->bio_pool) > goto bad; > > - if (bioset_integrity_create(bs, pool_size)) > - goto bad; > - > if (!biovec_create_pools(bs, pool_size)) > return bs; > > @@ -1682,6 +1679,9 @@ static int __init init_bio(void) > if (!fs_bio_set) > panic("bio: can't allocate bios\n"); > > + if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE)) > + panic("bio: can't create integrity pool\n"); > + > bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES, > sizeof(struct bio_pair)); > if (!bio_split_pool) > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] block: Require subsystems to explicitly allocate bio_set integrity mempool 2011-03-11 16:53 ` [PATCH] block: Require subsystems to explicitly allocate bio_set integrity mempool (was: Re: mirrored device with thousand of mappingtableentries) Mike Snitzer @ 2011-03-11 17:11 ` Martin K. Petersen 0 siblings, 0 replies; 23+ messages in thread From: Martin K. Petersen @ 2011-03-11 17:11 UTC (permalink / raw) To: Mike Snitzer Cc: Jens Axboe, device-mapper development, Martin K. Petersen, Zdenek Kabelac >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: >> Yeah, that'll come as part of my DIX1.1 patch set. Mike> Seems my concern should be fixed independent of a larger update Mike> patchset. Unless DIX isn't meaningful for "stable" kernels Mike> without your full DIX1.1 update? Well, I'm not aware of any shipping non-512 byte block DIF devices other than scsi_debug. And it's not good enough to just fix it in DM since this is a hardware parameter that needs to be queried and bubbled up. It can't be scaled like the rest of the topology block sizes, it's wired in the hardware. So I just don't see much point in trying to fix it now when any ties to the logical block size will be severed in my impending patch set. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-03-07 2:59 ` Martin K. Petersen 2011-03-07 14:24 ` Zdenek Kabelac 2011-03-07 16:09 ` Mike Snitzer @ 2011-03-07 20:10 ` Mike Snitzer 2011-03-07 20:22 ` Martin K. Petersen 2 siblings, 1 reply; 23+ messages in thread From: Mike Snitzer @ 2011-03-07 20:10 UTC (permalink / raw) To: Martin K. Petersen; +Cc: device-mapper development On Sun, Mar 06 2011 at 9:59pm -0500, Martin K. Petersen <martin.petersen@oracle.com> wrote: > >>>>> "Zdenek" == Zdenek Kabelac <zkabelac@redhat.com> writes: > > Zdenek> My finding seems to show that BIP-256 slabtop segment grow by > Zdenek> ~73KB per each device (while dm-io is ab out ~26KB) > > Ok, I see it now that I tried with a bunch of DM devices. > > DM allocates a bioset per volume. And since each bioset has an integrity > mempool you'll end up with a bunch of memory locked down. It seems like > a lot but it's actually the same amount as we reserve for the data path > (bio-0 + biovec-256). > > Since a bioset is not necessarily tied to a single block device we can't > automatically decide whether to allocate the integrity pool or not. In > the DM case, however, we just set up the integrity profile so the > information is available. > > Can you please try the following patch? This will change things so we > only attach an integrity pool to the bioset if the logical volume is > integrity-capable. Hey Martin, I just took the opportunity to review DM's blk_integrity code a bit more closely -- with an eye towards stacking devices. I found an issue that I think we need to fix that has to do with a DM device's limits being established during do_resume() and not during table_load(). Unfortunately, a DM device's blk_integrity gets preallocated during table_load(). dm_table_prealloc_integrity()'s call to blk_integrity_register() establishes the blk_integrity's block_size. But a DM device's queue_limits aren't stacked until a DM device is resumed -- via dm_calculate_queue_limits(). For some background please see the patch header of this commit: http://git.kernel.org/linus/754c5fc7ebb417 The final blk_integrity for the DM device isn't fully established until do_resume()'s eventual call to dm_table_set_integrity() -- by passing a template to blk_integrity_register(). dm_table_set_integrity() does validate the 'block_size' of each DM devices' blk_integrity to make sure they all match. So the code would catch the inconsistency should it arise. All I'm saying is: it's possible for a table_load() to not have the awareness that a newly added device's queue_limits will cause the DM device's final queue_limits to be increased (say a 4K device was added to dm_device2, and dm_device2 is now being added to another dm_device1). So it seems we need to establish bi->sector_size during the final stage of blk_integrity_register(), e.g. when a template is passed. Not sure if you'd agree with that change in general but it'll work for DM because the queue_limits are established before dm_table_set_integrity() is set. Maybe revalidate/change the 'block_size' during the final stage in case it changed? Thanks, Mike ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mirrored device with thousand of mappingtableentries 2011-03-07 20:10 ` mirrored device with thousand of mappingtableentries Mike Snitzer @ 2011-03-07 20:22 ` Martin K. Petersen 0 siblings, 0 replies; 23+ messages in thread From: Martin K. Petersen @ 2011-03-07 20:22 UTC (permalink / raw) To: Mike Snitzer; +Cc: device-mapper development, Martin K. Petersen >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes: Mike> So it seems we need to establish bi->sector_size during the final Mike> stage of blk_integrity_register(), e.g. when a template is passed. Mike> Not sure if you'd agree with that change in general but it'll work Mike> for DM because the queue_limits are established before Mike> dm_table_set_integrity() is set. Let me contemplate a bit since I already have some changes brewing in this area (T10 now permits protection information to be placed at non-logical block size intervals). -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2011-03-11 17:11 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-02-28 11:21 mirrored device with thousand of mapping table entries Eli Malul 2011-02-28 11:48 ` Alasdair G Kergon 2011-02-28 11:59 ` mirrored device with thousand of mapping tableentries Eli Malul 2011-02-28 12:11 ` Alasdair G Kergon 2011-02-28 12:17 ` mirrored device with thousand of mappingtableentries Eli Malul 2011-02-28 13:10 ` Alasdair G Kergon 2011-02-28 13:13 ` Eli Malul 2011-02-28 13:29 ` Alasdair G Kergon 2011-02-28 13:42 ` Eli Malul 2011-02-28 16:25 ` Eli Malul 2011-02-28 13:38 ` Zdenek Kabelac 2011-02-28 15:01 ` Martin K. Petersen 2011-03-06 20:39 ` Zdenek Kabelac 2011-03-07 2:59 ` Martin K. Petersen 2011-03-07 14:24 ` Zdenek Kabelac 2011-03-07 16:09 ` Mike Snitzer 2011-03-08 6:51 ` Martin K. Petersen 2011-03-08 17:13 ` Mike Snitzer 2011-03-10 16:11 ` Martin K. Petersen 2011-03-11 16:53 ` [PATCH] block: Require subsystems to explicitly allocate bio_set integrity mempool (was: Re: mirrored device with thousand of mappingtableentries) Mike Snitzer 2011-03-11 17:11 ` [PATCH] block: Require subsystems to explicitly allocate bio_set integrity mempool Martin K. Petersen 2011-03-07 20:10 ` mirrored device with thousand of mappingtableentries Mike Snitzer 2011-03-07 20:22 ` Martin K. Petersen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox