Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - SMP pthread_mutex performance: (11 Items)
   
SMP pthread_mutex performance  
Hello,
I am trying to test locking (pthread_mutex vs. lock-free using CAS)algorithms, and started with a simple pthread_mutex 
test program that creates either one or two high priority threads.  There are two counters, each of which may be 
protected by a pthread_mutex, depending on the #define in the code.  Here are my results.

One Thread, no protection
Total time:             14997705 ns
Two Threads, no protection
Total time:             17997246 ns
One Thread, mutexes
Total time:             14997705 ns
Two Threads, mutexes
Total time:            755884332 ns

As you can see, running two individual threads, each incrementing its own counter, protected by its own pthread_mutex is
 showing an execution time that is an order of magnatude slower than any of the other three configurations.

Here is the system this is being compiled and executed on

$ pidin in
CPU:X86 Release:6.3.2  FreeMem:801Mb/1021Mb BootTime:Jul 30 00:37:16 UTC 2008 
Processor1: 1586 Intel ?86 F15M4S4 3213MHz FPU 
Processor2: 1586 Intel ?86 F15M4S4 3206MHz FPU 
Processor3: 1586 Intel ?86 F15M4S4 3211MHz FPU 
Processor4: 1586 Intel ?86 F15M4S4 3207MHz FPU 
$ 

Attached you will find the source code, and the makefile to compile the program.

Does anyone have any idea why I am seeing these results?

--Chip Bagwell
Attachment: Text main.cpp 3.93 KB
Re: SMP pthread_mutex performance  
Here is the makefile.  
--Chip Bagwell
Attachment: Text makefile 89 bytes
RE: SMP pthread_mutex performance  
Hi Chip:

	Not sure if this is your problem or not, but I can tell you one
thing can cause strange results when you go multi-threaded on SMP and
that is cache thrashing.  If both threads are running on different CPUs
and trying to access the same cache line (which would be true in your
case considering how things are declared), then you end up with 1 + 1 =
10 instead of 2 :>.

If you can set things up so that the individual thread counters and
mutexes are separated by a cache line (maybe by putting a dummy array in
between the declarations), this might change your results.  Not sure
what your cache line size is (usually on order of 32 bytes), so just go
for something biggish. 

   Robert.

-----Original Message-----
From: Chip Bagwell [mailto:community-noreply@qnx.com] 
Sent: Thursday, July 31, 2008 1:24 PM
To: ostech-core_os
Subject: SMP pthread_mutex performance

Hello,
I am trying to test locking (pthread_mutex vs. lock-free using
CAS)algorithms, and started with a simple pthread_mutex test program
that creates either one or two high priority threads.  There are two
counters, each of which may be protected by a pthread_mutex, depending
on the #define in the code.  Here are my results.

One Thread, no protection
Total time:             14997705 ns
Two Threads, no protection
Total time:             17997246 ns
One Thread, mutexes
Total time:             14997705 ns
Two Threads, mutexes
Total time:            755884332 ns

As you can see, running two individual threads, each incrementing its
own counter, protected by its own pthread_mutex is showing an execution
time that is an order of magnatude slower than any of the other three
configurations.

Here is the system this is being compiled and executed on

$ pidin in
CPU:X86 Release:6.3.2  FreeMem:801Mb/1021Mb BootTime:Jul 30 00:37:16 UTC
2008 
Processor1: 1586 Intel ?86 F15M4S4 3213MHz FPU 
Processor2: 1586 Intel ?86 F15M4S4 3206MHz FPU 
Processor3: 1586 Intel ?86 F15M4S4 3211MHz FPU 
Processor4: 1586 Intel ?86 F15M4S4 3207MHz FPU 
$ 

Attached you will find the source code, and the makefile to compile the
program.

Does anyone have any idea why I am seeing these results?

--Chip Bagwell

_______________________________________________
OSTech
http://community.qnx.com/sf/go/post11289
Re: RE: SMP pthread_mutex performance  
Thanks for the feedback Robert.  So I added some 4Kbyte char bufs between the pthread_mutex variables, and also the 
counters as well.  The results are about the same.

Two Threads, mutexes
Total time:            753884638 ns

Other things I have tried are:
1) malloc-ing the pthread_mutexes and counters, no difference. 
2) Replacing pthread_mutex_lock with pthread_mutex_trylock to always ensure the fast path in the acquire (there should 
be no contention), no difference.
3) Performed kernel trace of this program and the two threads:  The only events that show up are the KER_NOP events.  
Nothing in there to say that something is holding the threads off or causing them to block.

I am much more comfortable with PPC arch than x86, so this may be a silly question, but on a multi-core x86 processor, 
can multiple CAS instructions be executed, one per core, without causing contention?  I didn't know if one thread's CAS 
instruction could "invalidate" the other thread CAS instruction on a different core.

--Chip Bagwell
RE: RE: SMP pthread_mutex performance  
Oh well.  It was worth a try (I admit to exaggerating a bit with the
1+1=10.  I had some idle loop threads that I was using for CPU usage
calculations and I'd see something like 20% higher counts with a thread
running alone than when running in tandem with another thread on a
different CPU accessing counters that were "close" together).

  Robert.

-----Original Message-----
From: Chip Bagwell [mailto:community-noreply@qnx.com] 
Sent: Thursday, July 31, 2008 1:58 PM
To: ostech-core_os
Subject: Re: RE: SMP pthread_mutex performance

Thanks for the feedback Robert.  So I added some 4Kbyte char bufs
between the pthread_mutex variables, and also the counters as well.  The
results are about the same.

Two Threads, mutexes
Total time:            753884638 ns

Other things I have tried are:
1) malloc-ing the pthread_mutexes and counters, no difference. 
2) Replacing pthread_mutex_lock with pthread_mutex_trylock to always
ensure the fast path in the acquire (there should be no contention), no
difference.
3) Performed kernel trace of this program and the two threads:  The only
events that show up are the KER_NOP events.  Nothing in there to say
that something is holding the threads off or causing them to block.

I am much more comfortable with PPC arch than x86, so this may be a
silly question, but on a multi-core x86 processor, can multiple CAS
instructions be executed, one per core, without causing contention?  I
didn't know if one thread's CAS instruction could "invalidate" the other
thread CAS instruction on a different core.

--Chip Bagwell

_______________________________________________
OSTech
http://community.qnx.com/sf/go/post11293
Re: RE: SMP pthread_mutex performance  
I tried this test case and seems true on my target.

and I modified source to make threadxmutex in different cache line, counter as well, run main cpumask=1, thread1 cpumask
 2, thread2 cpumask 4 still same thing.

If I change the pthread_mutex_lock to thread2mutex.count++ then it becomes normal.

I replace the pthread_mutex_lock with a cmpxchng==0 then thread2mutex.count++, it still has problem of course it won't 
have a kernel call. Even smp_xchg with thread2mutex.count++ it still big difference.
Attachment: Text main.cpp 4.99 KB
Re: RE: SMP pthread_mutex performance  
Thanks for all the responses to my original post.  I have figured out my problem.  I have been a victim of one of the 
oldest programming errors in the history of the C/C++ language.  The misplaced semi-colon! (insert funny waa-waa-waa-waa
-waa sound here! ;-)  The for-loop in thread1() for counter1 has a trailing semi-colon that turns it into a very fast 
empty loop!  If I had output the final result of the counters in the original code, I would have seen this immediately. 
 In the end, it was good ole "diff" that showed me the error of my ways, when I diff-ed thread1() and thread2().  Again 
thanks for all the insights and questions, as I have learned a lot about the intricacies of SMP programming through 
debugging this problem.

Here is the output of the original program after fixing the semi-colon error in thread1().  

One Thread, no protection
Total time:             29995410 ns
Two Threads, no protection
Total time:             18997093 ns
One Thread, mutexes
Total time:           1598755353 ns
Two Threads, mutexes
Total time:           7804805682 ns

Note there is still a weird timing with two threads.  But now after applying all the techniques everyone has mentioned..
.

1) Make sure the variables that each thread is accessing do not wind up on the same cache line.
2) Pin the threads to a particular CPU to avoid cpu migration. 

I get results that make much more sense!!!  By the way, my CPU is a Intel Pentium D 3.2 Ghz Dual Core with HT. So two 
physical cores, each with HT gives 4 virtual cores.

One Thread, no protection
Total time:                21996634 ns

Two Threads, no protection
Main pinned to cpu     0
Thread 1 pinned to cpu 1
Thread 2 pinned to cpu 3
Thread 1 time:             28995563 ns
Thread 2 time:             25996022 ns
Total time:                28995563 ns

One Thread, mutexes
Total time:              1580758107 ns

Two Threads, mutexes
Main pinned to cpu     0
Thread 1 pinned to cpu 1
Thread 2 pinned to cpu 3
Thread 1 time:            813875458 ns
Thread 2 time:            796878059 ns
Total time:               813875458 ns

And finally here is the output of the program when the main thread is always pinned to cpu 0, and we iterate the other 
two threads between all the other combinations.  The worse case is when the two worker threads are pinned to the same 
physical and HT core.  A slightly better case is when the two threads are on the same physical core, but on a different 
HT core.  The best case is when the two worker threads are pinned to different physical cores.  It doesn't seem to 
matter which physical and HT core the main thread is blocked on.  Attached is the final version of the test program, in 
all its glory!

Two Threads, mutexes
Main pinned to cpu     0
Thread 1 pinned to cpu 0
Thread 2 pinned to cpu 0
Thread 1 time:           1600755047 ns
Thread 2 time:           1602754741 ns
Total time:              1606754129 ns

Two Threads, mutexes
Main pinned to cpu     0
Thread 1 pinned to cpu 0
Thread 2 pinned to cpu 1
Thread 1 time:           1171820684 ns
Thread 2 time:           1163821908 ns
Total time:              1171820684 ns

Two Threads, mutexes
Main pinned to cpu     0
Thread 1 pinned to cpu 0
Thread 2 pinned to cpu 2
Thread 1 time:            810875917 ns
Thread 2 time:            796878059 ns
Total time:               810875917 ns

Two Threads, mutexes
Main pinned to cpu     0
Thread 1 pinned to cpu 0
Thread 2 pinned to cpu 3
Thread 1 time:            819874540 ns
Thread 2 time:            801877294 ns
Total time:               819874540 ns

Two Threads, mutexes
Main pinned to cpu     0
Thread 1 pinned to cpu 1
Thread 2 pinned to cpu 0
Thread 1 time:           1159822520 ns
Thread 2 time:           1167821296 ns
Total time:              1167821296 ns

Two Threads, mutexes
Main pinned to cpu     0
Thread 1 pinned to cpu 1
Thread 2 pinned to cpu 1
Thread 1 time:           1607753976 ns
Thread 2...
View Full Message
Attachment: Text main.cpp 7.25 KB
Re: RE: SMP pthread_mutex performance  
This time we did see that ";" in thread1:( that is why it is so fast:)
Re: RE: SMP pthread_mutex performance  
> Hi Chip:
> 
> 	Not sure if this is your problem or not, but I can tell you one
> thing can cause strange results when you go multi-threaded on SMP and
> that is cache thrashing.  If both threads are running on different CPUs
> and trying to access the same cache line (which would be true in your
> case considering how things are declared), then you end up with 1 + 1 > 10 instead of 2 :>.


This looks like a Intel Quad Core, I beleive CPU1 and CPU2 share the same L2 cache, so is CPU3 and CPU4. I would expect 
major cache trashing if thread migrates from the 2 pairs. Make any sense?

However if it's 4 different processors ( or 2 processors with 2 cores ) then it's can be bad, real bad. 
Re: SMP pthread_mutex performance  
Some questions.

1) Did you try the same test on a uni processor?
2) Did you know that clock_gettime() actually has an accuracy of the timer tick, ie ~1ms?
3) You say you only saw KER_NOP's - you will see this when a thread is forced to schedule onto another CPU.  Did you not

see any SyncMutexLock calls or STATE_MUTEX states?

Chip Bagwell wrote:
> Hello,
> I am trying to test locking (pthread_mutex vs. lock-free using CAS)algorithms, and started with a simple pthread_mutex
 test program that creates either one or two high priority threads.  There are two counters, each of which may be 
protected by a pthread_mutex, depending on the #define in the code.  Here are my results.
> 
> One Thread, no protection
> Total time:             14997705 ns
> Two Threads, no protection
> Total time:             17997246 ns
> One Thread, mutexes
> Total time:             14997705 ns
> Two Threads, mutexes
> Total time:            755884332 ns
> 
> As you can see, running two individual threads, each incrementing its own counter, protected by its own pthread_mutex 
is showing an execution time that is an order of magnatude slower than any of the other three configurations.
> 
> Here is the system this is being compiled and executed on
> 
> $ pidin in
> CPU:X86 Release:6.3.2  FreeMem:801Mb/1021Mb BootTime:Jul 30 00:37:16 UTC 2008 
> Processor1: 1586 Intel ?86 F15M4S4 3213MHz FPU 
> Processor2: 1586 Intel ?86 F15M4S4 3206MHz FPU 
> Processor3: 1586 Intel ?86 F15M4S4 3211MHz FPU 
> Processor4: 1586 Intel ?86 F15M4S4 3207MHz FPU 
> $ 
> 
> Attached you will find the source code, and the makefile to compile the program.
> 
> Does anyone have any idea why I am seeing these results?
> 
> --Chip Bagwell
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post11289
> 

-- 
cburgess@qnx.com
Re: SMP pthread_mutex performance  
is it possible that the threads are migrating between the two
cpus? Can you try and lock each thread to a different CPU, by
doing a 

ThreadCtl(_NTO_TCTL_RUNMASK, ....)

and setting the runmask for each thread to be on a different cpu?

thanks
shiv
Thu Jul 31 22:22:42 EDT 2008

 --> According to Chip Bagwell <--
	Hello,
	I am trying to test locking (pthread_mutex vs. lock-free using CAS)algorithms, and started with a simple pthread_mutex 
test program that creates either one or two high priority threads.  There are two counters, each of which may be 
protected by a pthread_mutex, depending on the #define in the code.  Here are my results.
	
	One Thread, no protection
	Total time:             14997705 ns
	Two Threads, no protection
	Total time:             17997246 ns
	One Thread, mutexes
	Total time:             14997705 ns
	Two Threads, mutexes
	Total time:            755884332 ns
	
	As you can see, running two individual threads, each incrementing its own counter, protected by its own pthread_mutex 
is showing an execution time that is an order of magnatude slower than any of the other three configurations.
	
	Here is the system this is being compiled and executed on
	
	$ pidin in
	CPU:X86 Release:6.3.2  FreeMem:801Mb/1021Mb BootTime:Jul 30 00:37:16 UTC 2008 
	Processor1: 1586 Intel ?86 F15M4S4 3213MHz FPU 
	Processor2: 1586 Intel ?86 F15M4S4 3206MHz FPU 
	Processor3: 1586 Intel ?86 F15M4S4 3211MHz FPU 
	Processor4: 1586 Intel ?86 F15M4S4 3207MHz FPU 
	$ 
	
	Attached you will find the source code, and the makefile to compile the program.
	
	Does anyone have any idea why I am seeing these results?
	
	--Chip Bagwell
	
	_______________________________________________
	OSTech
	http://community.qnx.com/sf/go/post11289

Content-Description: Sourceforge Attachment
	#include <iostream>
	#include <stdlib.h>
	#include <pthread.h>
	#include <sched.h>
	#include <semaphore.h>
	#include <time.h>
	
	#define LOOP_ITERATIONS 10000000
	#define TWO_THREADS() 1
	#define USE_MUTEX()   1
	
	sem_t thread1sem;
	sem_t thread2sem;
	
	#if USE_MUTEX()
	pthread_mutex_t thread1mutex;
	pthread_mutex_t thread2mutex;
	#endif
	
	unsigned long counter1;
	unsigned long counter2;
	
	unsigned long long startTime1;
	unsigned long long endTime1;
	unsigned long long startTime2;
	unsigned long long endTime2;
	
	extern "C" long long timespec_to_nanosecond(struct timespec const *ts)
	{
	    long long nanosecond = ts->tv_sec;
	    nanosecond *= 1000000000;
	    nanosecond += ts->tv_nsec;
	    return nanosecond;
	}
	
	void * thread1(void *arg)
	{
	    struct timespec currentTime;
	    
	    sem_wait(&thread1sem);  // Wait on signal from parent.
	    
	    sleep(1);  // Sleep to make sure this thread doesn't
	               // finish before the parent has a chance
	               // to create the other thread(s)
	    
	    clock_gettime(CLOCK_REALTIME, ¤tTime);
	    startTime1 = timespec_to_nanosecond(¤tTime);
	    
	    for(unsigned long long i = 0; i < LOOP_ITERATIONS; ++i);
	    {
	#if USE_MUTEX()
	        pthread_mutex_lock(&thread1mutex);
	#endif
	        ++counter1;
	#if USE_MUTEX()
	        pthread_mutex_unlock(&thread1mutex);
	#endif
	
	#if TWO_THREADS() == 0
	
	#if USE_MUTEX()
	        pthread_mutex_lock(&thread2mutex);
	#endif
	        ++counter2;
	#if USE_MUTEX()
	        pthread_mutex_unlock(&thread2mutex);
	#endif
	
	#endif        
	    }
	
	    clock_gettime(CLOCK_REALTIME, ¤tTime);
	    endTime1 = timespec_to_nanosecond(¤tTime);
	
	}
	
	#if TWO_THREADS()
	void * thread2(void *arg)
	{
	    struct timespec currentTime;
	
	    sem_wait(&thread2sem);  // Wait on signal from parent.
	
	    sleep(1);  // Sleep to make sure this thread doesn't
	               // finish before the parent has a chance
	               // to create the other thread(s)
	
	   ...
View Full Message