* [Lustre-devel] 2 primitives for the NRS
@ 2008-01-04 13:37 Eric Barton
2008-01-04 17:00 ` Weikuan Yu
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Eric Barton @ 2008-01-04 13:37 UTC (permalink / raw)
To: lustre-devel
What the NRS (network request scheduler) really needs is something
that can order 10**[4-6] RPC requests incredibly efficiently - any
overhead just eats into server efficiency and we already swamp a
server with simple pings at ~30-40,000 RPCs per second. I've
implemented a heap which is already looking good at managing
collections of 1,000,000+ objects with nearly sub-microsecond
insert+ordered_remove overhead.
Here is the API, which also introduces a sparse 1D array suitable for
use in the kernel. The binary heap uses it as space and cache
efficient means of tree walking.
Any suggestions for improving the API appreciated...
#ifndef __LIBCFS_BHEAP_H__
#define __LIBCFS_BHEAP_H__
/* Auto-array
* A sparse 1D contiguous array of pointers which uses single, double and
* triple indirection maps and avoids allocating large contiguous memory.
*
* FIXME: CAA_SHIFT should be defined automatically so that
* (CAA_SIZE == CFS_PAGE_SIZE/sizeof(void *))
*/
#define CAA_SHIFT 10
#define CAA_SIZE (1 << CAA_SHIFT) /* # ptrs per level */
#define CAA_MASK (CAA_SIZE - 1)
#define CAA_NOB (CAA_SIZE * sizeof(void *))
typedef struct
{
void ****aa_3d; /* Triple indirect */
void ***aa_2d; /* double indirect */
void **aa_1d; /* single indirect */
} cfs_autoarray_t;
void cfs_aa_init(cfs_autoarray_t *aa); /* setup */
void cfs_aa_fini(cfs_autoarray_t *aa); /* free all allocated mem */
/* effectively &aa[idx] but you MUST know &aa[idx] exists */
void **cfs_aa_index(cfs_autoarray_t *aa, unsigned int idx);
/* effectively &aa[idx] - return NULL if &aa[idx] doesn't exist and 'grow' is
* FALSE */
void **cfs_aa_lookup(cfs_autoarray_t *aa, unsigned int idx, int grow);
/* Binary heap
*
* Supports insertion and ordered removal of members sorted by the given total
* order ('compare')
*/
typedef struct
{
cfs_autoarray_t cbh_members;
unsigned int cbh_size;
unsigned int cbh_maxsize;
int (*cbh_compare)(void *a, void *b);
} cfs_binheap_t;
void cfs_binheap_init(cfs_binheap_t *h, int (*compare)(void *a, void *b));
void cfs_binheap_fini(cfs_binheap_t *h);
int cfs_binheap_size(cfs_binheap_t *h);
int cfs_binheap_insert(cfs_binheap_t *h, void *e);
void *cfs_binheap_remove_root(cfs_binheap_t *h);
#endif /* __LIBCFS_BHEAP_H__ */
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] 2 primitives for the NRS
2008-01-04 13:37 [Lustre-devel] 2 primitives for the NRS Eric Barton
@ 2008-01-04 17:00 ` Weikuan Yu
2008-01-04 18:19 ` Eric Barton
2008-01-04 17:40 ` Lee Ward
2008-01-05 12:19 ` Alex Lyashkov
2 siblings, 1 reply; 8+ messages in thread
From: Weikuan Yu @ 2008-01-04 17:00 UTC (permalink / raw)
To: lustre-devel
Hi,
I am trying to plug in an I/O request scheduler into OSS before read/write
requests get dispatched to the obdfilter. What I am using is hashed bins
with a basic rb tree, assuming it would be fairly reasonable to handle the
number of I/O requests that can reach an OSS. My interface calls are very
similar to yours, except the lack of plug-in comparison. I would not have
more to suggest on this. But I do have a couple of questions to check for
possible thoughts if you have.
(1) How are you going to order the requests, say the read/write ones? I
assume you made it flexible with a plug-in compare(). Would the order of the
I/O requests based on object ID have some relevance to their locality on the
disks? I was assuming at least the requests can get smoothed out with the
objID ordering.
(2) Have you checked the overhead when there are many concurrent threads
competing for the locks associated with your heap? The performance impact
thereof?
(3) Do you anticipate to merge the requests in any way, or possibly batch
execute them?
--Weikuan
Eric Barton wrote:
> What the NRS (network request scheduler) really needs is something
> that can order 10**[4-6] RPC requests incredibly efficiently - any
> overhead just eats into server efficiency and we already swamp a
> server with simple pings at ~30-40,000 RPCs per second. I've
> implemented a heap which is already looking good at managing
> collections of 1,000,000+ objects with nearly sub-microsecond
> insert+ordered_remove overhead.
>
> Here is the API, which also introduces a sparse 1D array suitable for
> use in the kernel. The binary heap uses it as space and cache
> efficient means of tree walking.
>
> Any suggestions for improving the API appreciated...
>
> #ifndef __LIBCFS_BHEAP_H__
> #define __LIBCFS_BHEAP_H__
>
> /* Auto-array
> * A sparse 1D contiguous array of pointers which uses single, double and
> * triple indirection maps and avoids allocating large contiguous memory.
> *
> * FIXME: CAA_SHIFT should be defined automatically so that
> * (CAA_SIZE == CFS_PAGE_SIZE/sizeof(void *))
> */
>
> #define CAA_SHIFT 10
> #define CAA_SIZE (1 << CAA_SHIFT) /* # ptrs per level */
> #define CAA_MASK (CAA_SIZE - 1)
> #define CAA_NOB (CAA_SIZE * sizeof(void *))
>
> typedef struct
> {
> void ****aa_3d; /* Triple indirect */
> void ***aa_2d; /* double indirect */
> void **aa_1d; /* single indirect */
> } cfs_autoarray_t;
>
> void cfs_aa_init(cfs_autoarray_t *aa); /* setup */
> void cfs_aa_fini(cfs_autoarray_t *aa); /* free all allocated mem */
>
> /* effectively &aa[idx] but you MUST know &aa[idx] exists */
> void **cfs_aa_index(cfs_autoarray_t *aa, unsigned int idx);
>
> /* effectively &aa[idx] - return NULL if &aa[idx] doesn't exist and 'grow' is
> * FALSE */
> void **cfs_aa_lookup(cfs_autoarray_t *aa, unsigned int idx, int grow);
>
>
> /* Binary heap
> *
> * Supports insertion and ordered removal of members sorted by the given total
> * order ('compare')
> */
>
> typedef struct
> {
> cfs_autoarray_t cbh_members;
> unsigned int cbh_size;
> unsigned int cbh_maxsize;
> int (*cbh_compare)(void *a, void *b);
> } cfs_binheap_t;
>
> void cfs_binheap_init(cfs_binheap_t *h, int (*compare)(void *a, void *b));
> void cfs_binheap_fini(cfs_binheap_t *h);
> int cfs_binheap_size(cfs_binheap_t *h);
> int cfs_binheap_insert(cfs_binheap_t *h, void *e);
> void *cfs_binheap_remove_root(cfs_binheap_t *h);
>
> #endif /* __LIBCFS_BHEAP_H__ */
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] 2 primitives for the NRS
2008-01-04 13:37 [Lustre-devel] 2 primitives for the NRS Eric Barton
2008-01-04 17:00 ` Weikuan Yu
@ 2008-01-04 17:40 ` Lee Ward
2008-01-04 17:51 ` Eric Barton
2008-01-04 18:36 ` Eric Barton
2008-01-05 12:19 ` Alex Lyashkov
2 siblings, 2 replies; 8+ messages in thread
From: Lee Ward @ 2008-01-04 17:40 UTC (permalink / raw)
To: lustre-devel
Hi Eric,
Wow! That is good. I had a similar need for another project and solved
it using a splay tree. Overhead is ~2.5 usec/op for random inserts with
unique keys and ~.04 usec/op for the ordered remove, at 1M records.
Where might I get your implementation? Would like to see if I can adapt
it for my project.
--Lee
On Fri, 2008-01-04 at 06:37 -0700, Eric Barton wrote:
> What the NRS (network request scheduler) really needs is something
> that can order 10**[4-6] RPC requests incredibly efficiently - any
> overhead just eats into server efficiency and we already swamp a
> server with simple pings at ~30-40,000 RPCs per second. I've
> implemented a heap which is already looking good at managing
> collections of 1,000,000+ objects with nearly sub-microsecond
> insert+ordered_remove overhead.
>
> Here is the API, which also introduces a sparse 1D array suitable for
> use in the kernel. The binary heap uses it as space and cache
> efficient means of tree walking.
>
> Any suggestions for improving the API appreciated...
>
> #ifndef __LIBCFS_BHEAP_H__
> #define __LIBCFS_BHEAP_H__
>
> /* Auto-array
> * A sparse 1D contiguous array of pointers which uses single, double and
> * triple indirection maps and avoids allocating large contiguous memory.
> *
> * FIXME: CAA_SHIFT should be defined automatically so that
> * (CAA_SIZE == CFS_PAGE_SIZE/sizeof(void *))
> */
>
> #define CAA_SHIFT 10
> #define CAA_SIZE (1 << CAA_SHIFT) /* # ptrs per level */
> #define CAA_MASK (CAA_SIZE - 1)
> #define CAA_NOB (CAA_SIZE * sizeof(void *))
>
> typedef struct
> {
> void ****aa_3d; /* Triple indirect */
> void ***aa_2d; /* double indirect */
> void **aa_1d; /* single indirect */
> } cfs_autoarray_t;
>
> void cfs_aa_init(cfs_autoarray_t *aa); /* setup */
> void cfs_aa_fini(cfs_autoarray_t *aa); /* free all allocated mem */
>
> /* effectively &aa[idx] but you MUST know &aa[idx] exists */
> void **cfs_aa_index(cfs_autoarray_t *aa, unsigned int idx);
>
> /* effectively &aa[idx] - return NULL if &aa[idx] doesn't exist and 'grow' is
> * FALSE */
> void **cfs_aa_lookup(cfs_autoarray_t *aa, unsigned int idx, int grow);
>
>
> /* Binary heap
> *
> * Supports insertion and ordered removal of members sorted by the given total
> * order ('compare')
> */
>
> typedef struct
> {
> cfs_autoarray_t cbh_members;
> unsigned int cbh_size;
> unsigned int cbh_maxsize;
> int (*cbh_compare)(void *a, void *b);
> } cfs_binheap_t;
>
> void cfs_binheap_init(cfs_binheap_t *h, int (*compare)(void *a, void *b));
> void cfs_binheap_fini(cfs_binheap_t *h);
> int cfs_binheap_size(cfs_binheap_t *h);
> int cfs_binheap_insert(cfs_binheap_t *h, void *e);
> void *cfs_binheap_remove_root(cfs_binheap_t *h);
>
> #endif /* __LIBCFS_BHEAP_H__ */
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] 2 primitives for the NRS
2008-01-04 17:40 ` Lee Ward
@ 2008-01-04 17:51 ` Eric Barton
2008-01-04 18:36 ` Eric Barton
1 sibling, 0 replies; 8+ messages in thread
From: Eric Barton @ 2008-01-04 17:51 UTC (permalink / raw)
To: lustre-devel
Lee,
Here's the prototype + trivial exercising code. I'm trying to run it with
something that should roughly approximate RPC arrival order - if that's
full of "itself" then I really need to know :)
> -----Original Message-----
> From: Lee Ward [mailto:lee at sandia.gov]
> Sent: 04 January 2008 5:40 PM
> To: Eric Barton
> Cc: lustre-devel at clusterfs.com
> Subject: Re: [Lustre-devel] 2 primitives for the NRS
>
> Hi Eric,
>
> Wow! That is good. I had a similar need for another project and solved
> it using a splay tree. Overhead is ~2.5 usec/op for random
> inserts with
> unique keys and ~.04 usec/op for the ordered remove, at 1M records.
> Where might I get your implementation? Would like to see if I
> can adapt
> it for my project.
>
> --Lee
>
> On Fri, 2008-01-04 at 06:37 -0700, Eric Barton wrote:
> > What the NRS (network request scheduler) really needs is something
> > that can order 10**[4-6] RPC requests incredibly efficiently - any
> > overhead just eats into server efficiency and we already swamp a
> > server with simple pings at ~30-40,000 RPCs per second. I've
> > implemented a heap which is already looking good at managing
> > collections of 1,000,000+ objects with nearly sub-microsecond
> > insert+ordered_remove overhead.
> >
> > Here is the API, which also introduces a sparse 1D array
> suitable for
> > use in the kernel. The binary heap uses it as space and cache
> > efficient means of tree walking.
> >
> > Any suggestions for improving the API appreciated...
> >
> > #ifndef __LIBCFS_BHEAP_H__
> > #define __LIBCFS_BHEAP_H__
> >
> > /* Auto-array
> > * A sparse 1D contiguous array of pointers which uses
> single, double and
> > * triple indirection maps and avoids allocating large
> contiguous memory.
> > *
> > * FIXME: CAA_SHIFT should be defined automatically so that
> > * (CAA_SIZE == CFS_PAGE_SIZE/sizeof(void *))
> > */
> >
> > #define CAA_SHIFT 10
> > #define CAA_SIZE (1 << CAA_SHIFT) /* # ptrs
> per level */
> > #define CAA_MASK (CAA_SIZE - 1)
> > #define CAA_NOB (CAA_SIZE * sizeof(void *))
> >
> > typedef struct
> > {
> > void ****aa_3d; /* Triple
> indirect */
> > void ***aa_2d; /* double
> indirect */
> > void **aa_1d; /* single
> indirect */
> > } cfs_autoarray_t;
> >
> > void cfs_aa_init(cfs_autoarray_t *aa); /* setup */
> > void cfs_aa_fini(cfs_autoarray_t *aa); /* free
> all allocated mem */
> >
> > /* effectively &aa[idx] but you MUST know &aa[idx] exists */
> > void **cfs_aa_index(cfs_autoarray_t *aa, unsigned int idx);
> >
> > /* effectively &aa[idx] - return NULL if &aa[idx] doesn't
> exist and 'grow' is
> > * FALSE */
> > void **cfs_aa_lookup(cfs_autoarray_t *aa, unsigned int idx,
> int grow);
> >
> >
> > /* Binary heap
> > *
> > * Supports insertion and ordered removal of members sorted
> by the given total
> > * order ('compare')
> > */
> >
> > typedef struct
> > {
> > cfs_autoarray_t cbh_members;
> > unsigned int cbh_size;
> > unsigned int cbh_maxsize;
> > int (*cbh_compare)(void *a, void *b);
> > } cfs_binheap_t;
> >
> > void cfs_binheap_init(cfs_binheap_t *h, int
> (*compare)(void *a, void *b));
> > void cfs_binheap_fini(cfs_binheap_t *h);
> > int cfs_binheap_size(cfs_binheap_t *h);
> > int cfs_binheap_insert(cfs_binheap_t *h, void *e);
> > void *cfs_binheap_remove_root(cfs_binheap_t *h);
> >
> > #endif /* __LIBCFS_BHEAP_H__ */
> >
> > _______________________________________________
> > Lustre-devel mailing list
> > Lustre-devel at clusterfs.com
> > https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> >
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: heapnd.c
Type: application/octet-stream
Size: 11486 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080104/b02040ab/attachment.obj>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] 2 primitives for the NRS
2008-01-04 17:00 ` Weikuan Yu
@ 2008-01-04 18:19 ` Eric Barton
0 siblings, 0 replies; 8+ messages in thread
From: Eric Barton @ 2008-01-04 18:19 UTC (permalink / raw)
To: lustre-devel
Wonderful questions!
> I am trying to plug in an I/O request scheduler into OSS before
> read/write requests get dispatched to the obdfilter.
If you base your code off b1_6, you can take a free ride on the initial
request processing done by Nathan's adaptive timeout code.
> What I am using
> is hashed bins with a basic rb tree, assuming it would be fairly
> reasonable to handle the number of I/O requests that can reach an
> OSS. My interface calls are very similar to yours, except the lack
> of plug-in comparison. I would not have more to suggest on this.
> But I do have a couple of questions to check for possible thoughts if
> you have.
>
> (1) How are you going to order the requests, say the read/write
> ones? I assume you made it flexible with a plug-in compare().
Yes - because I don't know yet what's going to work best. And maybe
different services might want different orders. And generalisation
when it doesn't hurt performance is a hard habit to break :)
My first thoughts are about fairness to all clients in the face of
unfairness elsewhere - e.g. a gridlocked network - so I'm thinking of
something that picks buffered RPC requests round-robin on client ID.
This is probably good for workloads of large numbers of single-client
jobs to ensure that no individual client can be starved. However I
also suggest that it's good for I/O performed by a single job spread
over many clients.
I base this on the idea that a good backend filesystem should and can
optimize disk utilisation. When a file is written, the file<->disk
offset mapping is fixed for subsequent reads, so I want the NRS to
make I/O request execution order repeatable in the face of network
"noise" and races between clients. Without this repeatability, we
have to fall back on the disk elevator to re-create the "good" disk
I/O stream on subsequent reads. Surely it cannot do such a good job
as the NRS since it must have orders of magnitude fewer requests to
play with - bulk buffers must be allocated by the time it sees them.
See http://arch.lustre.org/index.php?title=Network_Request_Scheduler
I'm afraid I don't yet have anything even half backed to say on
write v. read order etc. I'd still want some empirical evidence first.
> Would the order of the I/O requests based on object ID have some
> relevance to their locality on the disks?
I thinks it might make more sense for the backend F/S to use a job ID
to help it create sequential disk offsets for the whole I/O pattern
rather than anything coming from one individual client.
> I was assuming at least the requests
> can get smoothed out with the objID ordering.
>
> (2) Have you checked the overhead when there are many concurrent
> threads competing for the locks associated with your heap? The
> performance impact thereof?
I've only done sequential timings so far. NRS ops could be "Amdhal
moments" for the whole server so fat SMPs might require some better
care.
> (3) Do you anticipate to merge the requests in any way, or possibly
> batch execute them?
Yes, but I'm such a lazy sod that I hope the disk elevator will smile
on me. If not and I have to roll up my sleeves - so be it.
Cheers,
Eric
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] 2 primitives for the NRS
2008-01-04 17:40 ` Lee Ward
2008-01-04 17:51 ` Eric Barton
@ 2008-01-04 18:36 ` Eric Barton
2008-01-05 0:36 ` Lee Ward
1 sibling, 1 reply; 8+ messages in thread
From: Eric Barton @ 2008-01-04 18:36 UTC (permalink / raw)
To: lustre-devel
Oops, I forgot the copyright boilerplate...
Cheers,
Eric
> -----Original Message-----
> From: Lee Ward [mailto:lee at sandia.gov]
> Sent: 04 January 2008 5:40 PM
> To: Eric Barton
> Cc: lustre-devel at clusterfs.com
> Subject: Re: [Lustre-devel] 2 primitives for the NRS
>
> Hi Eric,
>
> Wow! That is good. I had a similar need for another project and solved
> it using a splay tree. Overhead is ~2.5 usec/op for random
> inserts with
> unique keys and ~.04 usec/op for the ordered remove, at 1M records.
> Where might I get your implementation? Would like to see if I
> can adapt
> it for my project.
>
> --Lee
>
> On Fri, 2008-01-04 at 06:37 -0700, Eric Barton wrote:
> > What the NRS (network request scheduler) really needs is something
> > that can order 10**[4-6] RPC requests incredibly efficiently - any
> > overhead just eats into server efficiency and we already swamp a
> > server with simple pings at ~30-40,000 RPCs per second. I've
> > implemented a heap which is already looking good at managing
> > collections of 1,000,000+ objects with nearly sub-microsecond
> > insert+ordered_remove overhead.
> >
> > Here is the API, which also introduces a sparse 1D array
> suitable for
> > use in the kernel. The binary heap uses it as space and cache
> > efficient means of tree walking.
> >
> > Any suggestions for improving the API appreciated...
> >
> > #ifndef __LIBCFS_BHEAP_H__
> > #define __LIBCFS_BHEAP_H__
> >
> > /* Auto-array
> > * A sparse 1D contiguous array of pointers which uses
> single, double and
> > * triple indirection maps and avoids allocating large
> contiguous memory.
> > *
> > * FIXME: CAA_SHIFT should be defined automatically so that
> > * (CAA_SIZE == CFS_PAGE_SIZE/sizeof(void *))
> > */
> >
> > #define CAA_SHIFT 10
> > #define CAA_SIZE (1 << CAA_SHIFT) /* # ptrs
> per level */
> > #define CAA_MASK (CAA_SIZE - 1)
> > #define CAA_NOB (CAA_SIZE * sizeof(void *))
> >
> > typedef struct
> > {
> > void ****aa_3d; /* Triple
> indirect */
> > void ***aa_2d; /* double
> indirect */
> > void **aa_1d; /* single
> indirect */
> > } cfs_autoarray_t;
> >
> > void cfs_aa_init(cfs_autoarray_t *aa); /* setup */
> > void cfs_aa_fini(cfs_autoarray_t *aa); /* free
> all allocated mem */
> >
> > /* effectively &aa[idx] but you MUST know &aa[idx] exists */
> > void **cfs_aa_index(cfs_autoarray_t *aa, unsigned int idx);
> >
> > /* effectively &aa[idx] - return NULL if &aa[idx] doesn't
> exist and 'grow' is
> > * FALSE */
> > void **cfs_aa_lookup(cfs_autoarray_t *aa, unsigned int idx,
> int grow);
> >
> >
> > /* Binary heap
> > *
> > * Supports insertion and ordered removal of members sorted
> by the given total
> > * order ('compare')
> > */
> >
> > typedef struct
> > {
> > cfs_autoarray_t cbh_members;
> > unsigned int cbh_size;
> > unsigned int cbh_maxsize;
> > int (*cbh_compare)(void *a, void *b);
> > } cfs_binheap_t;
> >
> > void cfs_binheap_init(cfs_binheap_t *h, int
> (*compare)(void *a, void *b));
> > void cfs_binheap_fini(cfs_binheap_t *h);
> > int cfs_binheap_size(cfs_binheap_t *h);
> > int cfs_binheap_insert(cfs_binheap_t *h, void *e);
> > void *cfs_binheap_remove_root(cfs_binheap_t *h);
> >
> > #endif /* __LIBCFS_BHEAP_H__ */
> >
> > _______________________________________________
> > Lustre-devel mailing list
> > Lustre-devel at clusterfs.com
> > https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> >
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: heapnd.c
Type: application/octet-stream
Size: 12237 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080104/0a654b9a/attachment.obj>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] 2 primitives for the NRS
2008-01-04 18:36 ` Eric Barton
@ 2008-01-05 0:36 ` Lee Ward
0 siblings, 0 replies; 8+ messages in thread
From: Lee Ward @ 2008-01-05 0:36 UTC (permalink / raw)
To: lustre-devel
Hi Eric,
Your machine must be faster than mine :) I could never get your heapnd.c
program to demonstrate rates below 2 usec.
I used your test harness as it was easier to mod than mine. Also, I got
better times with your harness exercising the splay tree. Think this is
because my test harness destroys the TLB cache; I build the tree from
scratch, empty it, rebuild it... Just a lousy harness. Yours is much
closer to my use-case, anyway. Don't know why I did that.
The splay tree implementation turned out to be faster when I finally got
an apples-to-apples comparison using your harness. It used about 20%
more memory though. Classic time/space trade-off I guess. My app is not
particularly sensitive to memory usage. Thanks for letting me try out
your code.
Anyway, I owe you the results since you were so kind as to send your
code:
Script started on Fri 04 Jan 2008 05:14:40 PM MST
lee at wheel:/tmp$ cc heapnd.c
lee at wheel:/tmp$ ./a.out 1000000 1000000
Size 1000000: 4.266uS
>>cat /proc/21262/statm: 10420 9894 101 2 0 9823 0
lee at wheel:/tmp$ cc treend.c
lee at wheel:/tmp$ !./a
./a.out 1000000 1000000
Size 1000000: 1.620uS
>>cat /proc/21277/statm: 12375 11839 100 2 0 11778 0
lee at wheel:/tmp$ cc -O3 heapnd.c
lee at wheel:/tmp$ !./
./a.out 1000000 1000000
Size 1000000: 2.455uS
>>cat /proc/21289/statm: 10420 9893 100 2 0 9823 0
lee at wheel:/tmp$ cc -O3 treend.c
lee at wheel:/tmp$ !./
./a.out 1000000 1000000
Size 1000000: 0.969uS
>>cat /proc/21304/statm: 12376 11838 99 3 0 11778 0
lee at wheel:/tmp$ exit
Script done on Fri 04 Jan 2008 05:15:54 PM MST
--Lee
On Fri, 2008-01-04 at 11:36 -0700, Eric Barton wrote:
> Oops, I forgot the copyright boilerplate...
>
> Cheers,
> Eric
>
>
> > -----Original Message-----
> > From: Lee Ward [mailto:lee at sandia.gov]
> > Sent: 04 January 2008 5:40 PM
> > To: Eric Barton
> > Cc: lustre-devel at clusterfs.com
> > Subject: Re: [Lustre-devel] 2 primitives for the NRS
> >
> > Hi Eric,
> >
> > Wow! That is good. I had a similar need for another project and solved
> > it using a splay tree. Overhead is ~2.5 usec/op for random
> > inserts with
> > unique keys and ~.04 usec/op for the ordered remove, at 1M records.
> > Where might I get your implementation? Would like to see if I
> > can adapt
> > it for my project.
> >
> > --Lee
> >
> > On Fri, 2008-01-04 at 06:37 -0700, Eric Barton wrote:
> > > What the NRS (network request scheduler) really needs is something
> > > that can order 10**[4-6] RPC requests incredibly efficiently - any
> > > overhead just eats into server efficiency and we already swamp a
> > > server with simple pings at ~30-40,000 RPCs per second. I've
> > > implemented a heap which is already looking good at managing
> > > collections of 1,000,000+ objects with nearly sub-microsecond
> > > insert+ordered_remove overhead.
> > >
> > > Here is the API, which also introduces a sparse 1D array
> > suitable for
> > > use in the kernel. The binary heap uses it as space and cache
> > > efficient means of tree walking.
> > >
> > > Any suggestions for improving the API appreciated...
> > >
> > > #ifndef __LIBCFS_BHEAP_H__
> > > #define __LIBCFS_BHEAP_H__
> > >
> > > /* Auto-array
> > > * A sparse 1D contiguous array of pointers which uses
> > single, double and
> > > * triple indirection maps and avoids allocating large
> > contiguous memory.
> > > *
> > > * FIXME: CAA_SHIFT should be defined automatically so that
> > > * (CAA_SIZE == CFS_PAGE_SIZE/sizeof(void *))
> > > */
> > >
> > > #define CAA_SHIFT 10
> > > #define CAA_SIZE (1 << CAA_SHIFT) /* # ptrs
> > per level */
> > > #define CAA_MASK (CAA_SIZE - 1)
> > > #define CAA_NOB (CAA_SIZE * sizeof(void *))
> > >
> > > typedef struct
> > > {
> > > void ****aa_3d; /* Triple
> > indirect */
> > > void ***aa_2d; /* double
> > indirect */
> > > void **aa_1d; /* single
> > indirect */
> > > } cfs_autoarray_t;
> > >
> > > void cfs_aa_init(cfs_autoarray_t *aa); /* setup */
> > > void cfs_aa_fini(cfs_autoarray_t *aa); /* free
> > all allocated mem */
> > >
> > > /* effectively &aa[idx] but you MUST know &aa[idx] exists */
> > > void **cfs_aa_index(cfs_autoarray_t *aa, unsigned int idx);
> > >
> > > /* effectively &aa[idx] - return NULL if &aa[idx] doesn't
> > exist and 'grow' is
> > > * FALSE */
> > > void **cfs_aa_lookup(cfs_autoarray_t *aa, unsigned int idx,
> > int grow);
> > >
> > >
> > > /* Binary heap
> > > *
> > > * Supports insertion and ordered removal of members sorted
> > by the given total
> > > * order ('compare')
> > > */
> > >
> > > typedef struct
> > > {
> > > cfs_autoarray_t cbh_members;
> > > unsigned int cbh_size;
> > > unsigned int cbh_maxsize;
> > > int (*cbh_compare)(void *a, void *b);
> > > } cfs_binheap_t;
> > >
> > > void cfs_binheap_init(cfs_binheap_t *h, int
> > (*compare)(void *a, void *b));
> > > void cfs_binheap_fini(cfs_binheap_t *h);
> > > int cfs_binheap_size(cfs_binheap_t *h);
> > > int cfs_binheap_insert(cfs_binheap_t *h, void *e);
> > > void *cfs_binheap_remove_root(cfs_binheap_t *h);
> > >
> > > #endif /* __LIBCFS_BHEAP_H__ */
> > >
> > > _______________________________________________
> > > Lustre-devel mailing list
> > > Lustre-devel at clusterfs.com
> > > https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> > >
> >
> >
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] 2 primitives for the NRS
2008-01-04 13:37 [Lustre-devel] 2 primitives for the NRS Eric Barton
2008-01-04 17:00 ` Weikuan Yu
2008-01-04 17:40 ` Lee Ward
@ 2008-01-05 12:19 ` Alex Lyashkov
2 siblings, 0 replies; 8+ messages in thread
From: Alex Lyashkov @ 2008-01-05 12:19 UTC (permalink / raw)
To: lustre-devel
On Fri, 2008-01-04 at 13:37 +0000, Eric Barton wrote:
> Any suggestions for improving the API appreciated...
>
> #ifndef __LIBCFS_BHEAP_H__
> #define __LIBCFS_BHEAP_H__
>
> /* Auto-array
> * A sparse 1D contiguous array of pointers which uses single, double and
> * triple indirection maps and avoids allocating large contiguous memory.
> *
> * FIXME: CAA_SHIFT should be defined automatically so that
> */
>
> #define CAA_SHIFT 10
> #define CAA_SIZE (1 << CAA_SHIFT) /* # ptrs per level */
> #define CAA_MASK (CAA_SIZE - 1)
> #define CAA_NOB (CAA_SIZE * sizeof(void *))
>
> typedef struct
> {
> void ****aa_3d; /* Triple indirect */
> void ***aa_2d; /* double indirect */
> void **aa_1d; /* single indirect */
> } cfs_autoarray_t;
>
> void cfs_aa_init(cfs_autoarray_t *aa); /* setup */
> void cfs_aa_fini(cfs_autoarray_t *aa); /* free all allocated mem */
>
> /* effectively &aa[idx] but you MUST know &aa[idx] exists */
> void **cfs_aa_index(cfs_autoarray_t *aa, unsigned int idx);
>
> /* effectively &aa[idx] - return NULL if &aa[idx] doesn't exist and 'grow' is
> * FALSE */
> void **cfs_aa_lookup(cfs_autoarray_t *aa, unsigned int idx, int grow);
>
why not move obdclass/class_hash into libcfs and use for lnet also ?
now it really generic code and can be easy moved info libcfs.
also if lookup algorithm in class_hash not good for work with 10^6
elements, this need to be adjusted for all usages.
--
Alex Lyashkov <Alexey.lyashkov@sun.com>
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-01-05 12:19 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-04 13:37 [Lustre-devel] 2 primitives for the NRS Eric Barton
2008-01-04 17:00 ` Weikuan Yu
2008-01-04 18:19 ` Eric Barton
2008-01-04 17:40 ` Lee Ward
2008-01-04 17:51 ` Eric Barton
2008-01-04 18:36 ` Eric Barton
2008-01-05 0:36 ` Lee Ward
2008-01-05 12:19 ` Alex Lyashkov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.