From: Boaz Harrosh <bharrosh@panasas.com>
To: Idan Kedar <idank@tonian.com>
Cc: Linux FS Maling List <linux-fsdevel@vger.kernel.org>
Subject: Re: exofs/ore: allocation of _ore_get_io_state()
Date: Thu, 24 May 2012 00:00:46 +0300 [thread overview]
Message-ID: <4FBD4FFE.7020703@panasas.com> (raw)
In-Reply-To: <CABpMAyJQGc+sOYcGdKWuG9b4HTKzDE8tBgyWUqd9P9zEZDTbtA@mail.gmail.com>
On 04/16/2012 05:53 PM, Idan Kedar wrote:
> _ore_get_io_state is supposed to allocate a struct ore_io_state, which is
> variable length. Setting aside the uglification argument which gave birth
> to the patch "pnfs-obj: Uglify objio_segment allocation for the sake of
> the principle :-(", the above function checks if the memory it needs to
> allocate is bigger than a page, and if so, separates it into two
> allocations. why is this done?
>
> from include/linux/slab.h:
>> /*
>> * The largest kmalloc size supported by the slab allocators is
>> * 32 megabyte (2^25) or the maximum allocatable page order if that is
>> * less than 32 MB.
>> *
>> * WARNING: Its not easy to increase this value since the allocators have
>> * to do various tricks to work around compiler limitations in order to
>> * ensure proper constant folding.
>> */
>> #define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
>> (MAX_ORDER + PAGE_SHIFT - 1) : 25)
>>
>> #define KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_HIGH)
>> #define KMALLOC_MAX_ORDER (KMALLOC_SHIFT_HIGH - PAGE_SHIFT)
>
> If kmalloc can allocate 32MB, why check one page, like so?
>
> from fs/exofs/ore.c:
>> struct __alloc_all_io_state {
>> struct ore_io_state ios;
>> struct ore_per_dev_state per_dev[numdevs];
>> union {
>> struct osd_sg_entry sglist[sgs_per_dev * numdevs];
>> struct page *pages[num_par_pages];
>> };
>> } *_aios;
>>
>> if (likely(sizeof(*_aios) <= PAGE_SIZE)) {
>> _aios = kzalloc(sizeof(*_aios), GFP_KERNEL);
>> if (unlikely(!_aios)) {
>> EXOFS_DBGMSG("Failed kzalloc bytes=%zd\n",
>> sizeof(*_aios));
>> *pios = NULL;
>> return -ENOMEM;
>> }
>> pages = num_par_pages ? _aios->pages : NULL;
>> sgilist = sgs_per_dev ? _aios->sglist : NULL;
>> ios = &_aios->ios;
>> } else {
>> struct __alloc_small_io_state {
>> struct ore_io_state ios;
>> struct ore_per_dev_state per_dev[numdevs];
>> } *_aio_small;
>> union __extra_part {
>> struct osd_sg_entry sglist[sgs_per_dev * numdevs];
>> struct page *pages[num_par_pages];
>> } *extra_part;
>> _aio_small = kzalloc(sizeof(*_aio_small), GFP_KERNEL);
>> if (unlikely(!_aio_small)) {
>> EXOFS_DBGMSG("Failed alloc first part bytes=%zd\n",
>> sizeof(*_aio_small));
>> *pios = NULL;
>> return -ENOMEM;
>> }
>> extra_part = kzalloc(sizeof(*extra_part), GFP_KERNEL);
>> if (unlikely(!extra_part)) {
>> EXOFS_DBGMSG("Failed alloc second part bytes=%zd\n",
>> sizeof(*extra_part));
>> kfree(_aio_small);
>> *pios = NULL;
>> return -ENOMEM;
>> }
>> pages = num_par_pages ? extra_part->pages : NULL;
>> sgilist = sgs_per_dev ? extra_part->sglist : NULL;
>> /* In this case the per_dev[0].sgilist holds the pointer to
>> * be freed
>> */
>> ios = &_aio_small->ios;
>> ios->extra_part_alloc = true;
>> }
>
> Is there any point to check if the memory is greater than 32MB?
>
>
In theory it can allocate 32MB, in slab. I'm not sure about slob and slub.
But in practice contiguous physical pages allocation tends to fail very
fast on a system that was up a couple of hours. So we avoid it as plage.
Past testing with tables bigger than PAGE_SIZE on the IO path gave
catastrophic results. (Again once the system is up for a while and
had a chance to fragment physical address space)
The all Kernel point of the use of sg-lists is so not to allocate
contiguous physical pages and to not have to use virtual-memory.
This is done all over the Kernel. MAX_BIO_SIZE max-sg-table ...
(BTW I saw this mail by chance. If you direct it to me I see it
for sure)
Cheers
Boaz
next prev parent reply other threads:[~2012-05-23 21:01 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-16 14:53 exofs/ore: allocation of _ore_get_io_state() Idan Kedar
2012-05-23 21:00 ` Boaz Harrosh [this message]
2012-05-24 11:23 ` Idan Kedar
2012-05-24 14:05 ` Boaz Harrosh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FBD4FFE.7020703@panasas.com \
--to=bharrosh@panasas.com \
--cc=idank@tonian.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.