Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - jemalloc: (11 Items)
   
jemalloc  
I have ported FreeBSD's jemalloc to QNX.
So far on SMP it performs better than our stock malloc in libc with my test case but on UP it is worse than our stock 
one. I still have something to turn on to optimize.
 
I wrote a small program testm to test malloc and free, it will create 
threads as command line indicated if no arg then it will run with single 
thread(main).
 
our stock one:
# time ./testm
    0.10s real     0.10s user     0.00s system
# time ./testm 2
    0.76s real     0.14s user     0.41s system
# time ./testm 3
    7.91s real     0.33s user     3.72s system
# time ./testm 4
    6.11s real     0.31s user     5.96s system
# pidin in
CPU:X86 Release:6.3.2  FreeMem:1641Mb/2047Mb BootTime:Mar 11 10:48:32 
EST 2008
Processor1: 686 Intel 686 F6M14S4 1834MHz FPU
Processor2: 686 Intel 686 F6M14S4 1833MHz FPU
Processor3: 686 Intel 686 F6M14S4 1837MHz FPU
Processor4: 686 Intel 686 F6M14S4 1834MHz FPU
 
jemalloc:
# time ./testm 2
    0.48s real     0.69s user     0.15s system
# time ./testm 3
    0.77s real     1.00s user     0.83s system
# time ./testm 4
    0.99s real     1.39s user     1.65s system
# time ./testm 5
    1.34s real     1.74s user     2.80s system
# time ./testm 6
    1.56s real     2.13s user     3.31s system
# time ./testm   
    0.27s real     0.27s user     0.00s system
 
so from this case jemalloc's performance is better on SMP and seems that most malloc lib doesn't act well with SMP.
(google the firefox' comparision with others like glibc's malloc, google's ?malloc...)
 
I will see whether I can port some test cases to QNX as well.
Attachment: Text libjemalloc.so 79.11 KB
Re: jemalloc  
On Wed, Apr 30, 2008 at 10:35 PM, Yao Zhao <yzhao@qnx.com> wrote:

> I have ported FreeBSD's jemalloc to QNX.
> So far on SMP it performs better than our stock malloc in libc with my
> test case but on UP it is worse than our stock one. I still have something
> to turn on to optimize.
>
> I wrote a small program testm to test malloc and free, it will create
> threads as command line indicated if no arg then it will run with single
> thread(main).
>

[ Malloc numbers snipped ]

Ahh the joys of malloc optimization =;-)

This is a deep and involved topic ... and not that it isn't worth
investigating, but it has been looked
at lots before and the secret is really in the benchmarks and how you do the
timing.  I'd suggest
re-using one of the hundreds of existing benchmarks but also then comparing
that to what real
world applications actually do ... which is where the end solution to the
'build a better malloc
problem' comes in.  Each application has its own sweet spot in the
configuration of malloc.  This
is why our malloc has the capability of being tuned.  The big problem is
that there is limited (no)
tooling that makes it practical for people to actually do the tuning.

Not to mention the fact that for a 'desktop' type environment, tuning
applications is entirely academic
(neat idea, but no one will do it) and for closed embedded systems, good
developers will already have
mitigated their memory management to tailored memory pools for frequently
used dynamic objects.

In any case, I'm interested to see where this thread goes =;-)

Thomas
_______________________________________________
OSTech
http://community.qnx.com/sf/go/post7539
RE: jemalloc  
> This is why our malloc has the capability of being tuned.
> The big problem is that there is limited (no) tooling that
> makes it practical for people to actually do the tuning.

Any more info on that?

Re: jemalloc  
On Thu, May 1, 2008 at 11:28 PM, Mario Charest <mcharest@zinformatic.com>
wrote:

> > This is why our malloc has the capability of being tuned.
> > The big problem is that there is limited (no) tooling that
> > makes it practical for people to actually do the tuning.
>
> Any more info on that?
>
Well there is the informal information that is now publically exposed in the
malloc
source (lib/c/alloc) and there is a malloc tuning whitepaper (unpolished,
engineering
level content) that has previously been given to customers.  I did a quick
check to
see if it was posted in the documents section but it isn't/hasn't been.

Perhaps if we ask Shiv really nicely he will post it to the community for us
=;-)

Thomas
Re: jemalloc  
> On Wed, Apr 30, 2008 at 10:35 PM, Yao Zhao <yzhao@qnx.com> wrote:
> 
> > I have ported FreeBSD's jemalloc to QNX.
> > So far on SMP it performs better than our stock malloc in libc with my
> > test case but on UP it is worse than our stock one. I still have something
> > to turn on to optimize.
> >
> > I wrote a small program testm to test malloc and free, it will create
> > threads as command line indicated if no arg then it will run with single
> > thread(main).
> >
> 
> [ Malloc numbers snipped ]
> 
> Ahh the joys of malloc optimization =;-)
> 
> This is a deep and involved topic ... and not that it isn't worth
> investigating, but it has been looked
> at lots before and the secret is really in the benchmarks and how you do the
> timing.  I'd suggest
> re-using one of the hundreds of existing benchmarks but also then comparing
> that to what real
> world applications actually do ... which is where the end solution to the
> 'build a better malloc
> problem' comes in.  Each application has its own sweet spot in the
> configuration of malloc.  This
> is why our malloc has the capability of being tuned.  The big problem is
> that there is limited (no)
> tooling that makes it practical for people to actually do the tuning.
> 
> Not to mention the fact that for a 'desktop' type environment, tuning
> applications is entirely academic
> (neat idea, but no one will do it) and for closed embedded systems, good
> developers will already have
> mitigated their memory management to tailored memory pools for frequently
> used dynamic objects.
> 
> In any case, I'm interested to see where this thread goes =;-)
> 
> Thomas
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post7539


I totally agree!
I didn't test too much and only run a program to do this

struct run{
   int size,
   int times;
} test_run[] = {
   {2,10000},
   {4,10000},
  ...
}
for() {
  for() {
     p=malloc(test_run[i].size);
     free(p);
}
}

In real world some apps like browser, database might malloc, free very frequently but rare I agree.

If you read lib/c/alloc and read jemalloc source code you will find jemalloc is really smp optimized.
for example:
if  _Multi_threaded then we are going to handle multi thread case, it didn't directly call pthread_mutex_lock and it 
called pthread_mutex_trylock, why? On SMP I won't say this 100% better but if these threads are all running on different
 cpu then we are much better than pthread_mutex_lock, right? Probably on most OS trylock is a cmpxchg or spin but 
mutex_lock is kernel call on QNX if you can't lock it, on other OS should be same thing because you have to wait 
schedule. In this case probably spin is better than kernel call.
what I want to say is just reading code you will feel difference.

static inline unsigned
malloc_spin_lock(pthread_mutex_t *lock)
{
        unsigned ret = 0;

        if (__isthreaded) {
                if (_pthread_mutex_trylock(lock) != 0) {
                        unsigned i;
                        volatile unsigned j;

                        /* Exponentially back off. */
                        for (i = 1; i <= SPIN_LIMIT_2POW; i++) {
                                for (j = 0; j < (1U << i); j++)
                                        ret++;

                                CPU_SPINWAIT;
                                if (_pthread_mutex_trylock(lock) == 0)
                                        return (ret);
                        }

                        /*
                         * Spinning failed.  Block until the lock becomes
                         * available, in order to avoid indefinite priority
                         * inversion.
                         */
        ...
View Full Message
Re: jemalloc  
I uploaded jemalloc.tgz which includes jemalloc source which I modified to run on QNX, test case, binary.
I run our stock lib/c/alloc on sp2, UP(testm 3, multithread on UP) sometimes it takes long long time to be done, not 
sure yet, but seems on 632 I haven't met same situation.
on UP jemalloc still works good after a couple of threads.(not sure why)

I didn't read stock lib/c/alloc too much but a quick search I didn't find stock one to use _Multi_threaded, obviously 
you don't need lock when you only have main thread, right? and it is safe to use. 

didn't finish reading jemalloc either. (too busy:()

MALLOC_OPTIONS=P will show status, it will help a little bit.
Attachment: Text jemalloc.tgz 213.54 KB
Re: jemalloc  
http://community.qnx.com/integration/viewcvs/viewcvs.cgi/trunk/lib/c/alloc/dlist.c?root=coreos_pub&rev=165018&system=exsy1001&view=markup

in function void *
_list_memalign(size_t alignment, ssize_t n_bytes)
...



	if ((n_bytes >= 
__flist_abins[__flist_nbins-1].size))
		split=0;
	/* findFit will remove it from the list */
	if (split)
		fit = _flist_bin_first_fit(alignment, nbytes);


this is trunk lib/c/malloc, if n_bytes > the last one size class in freelist bin, split will be set to 0 and it won't 
look up freelist bin, this is a bug.
In sp2 and probably 6.3.2 its behaviour is correct.
Re: jemalloc  
On Mon, May 26, 2008 at 01:09:17PM -0400, Yao Zhao wrote:
> http://community.qnx.com/integration/viewcvs/viewcvs.cgi/trunk/lib/c/alloc/dlist.c?root=coreos_pub&rev=165018&system=exsy1001&view=markup
> 
> in function void *
> _list_memalign(size_t alignment, ssize_t n_bytes)
> ...
> 
> 
> 
> 	if ((n_bytes >= __flist_abins[__flist_nbins-1].size))
> 		split=0;
> 	/* findFit will remove it from the list */
> 	if (split)
> 		fit = _flist_bin_first_fit(alignment, nbytes);
> 
> 
> this is trunk lib/c/malloc, if n_bytes > the last one size class in freelist bin, split will be set to 0 and it won't 
look up freelist bin, this is a bug.
> In sp2 and probably 6.3.2 its behaviour is correct.
> 

Did you make a PR?

-seanb
Re: jemalloc  
I haven't.

this is the test case:

#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>

void print_status(struct malloc_stats *ps) {
printf( "\nmemory in free small blocks %d\n"
	"memory in free big blocks   %d\n"
	"space in header block headers %d\n"
	"space used by block headers %d\n"
	"space in small blocks in use %d\n"
	"space in big blocks in use %d\n"
	"number of core allocations performed %d\n"
	"number of core de-allocations performed %d\n"
	"size of the arena %d\n"
	"number of frees performed %d\n"
	"number of allocations performed %d\n"
	"number of realloc functions performed %d\n"
	"number of small blocks %d\n"
	"number of big blocks %d\n"
	"number of header blocks %d\n",
	ps->m_small_freemem,
	ps->m_freemem,
	ps->m_small_overhead,
	ps->m_overhead,
	ps->m_small_allocmem,
	ps->m_allocmem,
	ps->m_coreallocs,
	ps->m_corefrees,
	ps->m_heapsize,
	ps->m_frees,
	ps->m_allocs,
	ps->m_reallocs,
	ps->m_small_blocks,
	ps->m_blocks,
	ps->m_hblocks
	);

}

#define TRUNK_LIBC
void print_env(void) {
 typedef struct __flistbins {
 	size_t size;
	} FlinkBins;
extern FlinkBins __flist_abins[];
extern int _min_free_list_size;
#ifdef TRUNK_LIBC 
extern unsigned __core_cache_max_num;
extern unsigned __core_cache_max_sz;
	printf("lib/c/malloc settings: max_num %u max_sz %u\n",
		__core_cache_max_num, __core_cache_max_sz);
	printf("%4d %4d %4d %4d %4d %4d %5d %5d %8d\n",
		__flist_abins[0].size,
		__flist_abins[1].size,
		__flist_abins[2].size,
		__flist_abins[3].size,
		__flist_abins[4].size,
		__flist_abins[5].size,
		__flist_abins[6].size,
		__flist_abins[7].size,
		__flist_abins[8].size
		);
#else
extern unsigned __ac_max_num;
extern unsigned __ac_max_sz;
extern int __ac_curr_num ;  // maximum size of arena cache
extern int __ac_curr_sz ;  // curr size of arena cache

	printf("lib/c/malloc settings: max_num %u max_sz %u\n",
		__ac_max_num, __ac_max_sz);
	printf("lib/c/malloc settings: cur_num %u cur_sz %u\n",
		__ac_curr_num, __ac_curr_sz);
	printf("%4d %4d %4d %4d %4d %4d %5d %5d\n",
		__flist_abins[0].size,
		__flist_abins[1].size,
		__flist_abins[2].size,
		__flist_abins[3].size,
		__flist_abins[4].size,
		__flist_abins[5].size,
		__flist_abins[6].size,
		__flist_abins[7].size
		);
#endif

}

int main(int argc, char ** argv) {


int size = 100*1024*1024;
void * p,*p1;
struct malloc_stats stat;


	print_env();
	p = malloc(size);
	if (!p) {
		printf("can't malloc %d\n", size);
		exit(1);
	}
	mallopt(MALLOC_STATS, (int)&stat);
	print_status(&stat);
	printf("after %dKB malloc\n", size>>10);
	getchar();
	print_env();

	free(p);
	print_env();

	size = 10*1024*1024;
	p = malloc(size);
	if (!p) {
		printf("can't malloc %d\n", size);
		exit(1);
	}
	mallopt(MALLOC_STATS, (int)&stat);
	print_status(&stat);
	printf("after %dKB malloc\n", size>>10);
	print_env();
	getchar();

	p1 = malloc(size);
	if (!p1) {
		printf("can't malloc %d\n", size);
		exit(1);
	}
	mallopt(MALLOC_STATS, (int)&stat);
	print_status(&stat);
	printf("after %dKB malloc\n", size>>10);
	print_env();
	getchar();


	free(p); free(p1);

	return 0;
}
Re: jemalloc  
http://www.qnx.com/developers/docs/6.3.2/neutrino/lib_ref/m/mallopt.html
doc for mallopt is another bug and I think it is talking about libmalloc's mallopt.

number of big blocks in malloc_stats probably it is wrong too. in list_release it didn't 		_malloc_stats.m_blocks--; 
properly when it is an arena. this will happen when you allocate >=64K.


Re: jemalloc  
bugs are filed.


1.change opt_dss to false default with QNX because it is not stable. Didn't debug this.
   change opt_dss to true.
   but opt_dss performance is better.
2.update my test case which can now run test with
   single,multi threads
  mintest,maxtest
  holdrun
3.add testlibc.c to display malloc_stat and list stat. list stat is just the malloc information in IDE's system 
information. But this is quite handy you can add to your apps to find memory leak.
   It is using those globals in libc so I hope it will not be true in future.
   At this, JEMalloc is very well, if he doesn't want to expose this then he will not. This will save a lot of space for
 libc shared library.
4.port JEMalloc author's test program to QNX, probably some are other people's.
5.update my reading
Attachment: Text jemalloc.tbz 1.24 MB