Chip Bagwell
|
SMP pthread_mutex performance
|
Chip Bagwell
07/31/2008 1:24 PM
post11289
|
SMP pthread_mutex performance
Hello,
I am trying to test locking (pthread_mutex vs. lock-free using CAS)algorithms, and started with a simple pthread_mutex
test program that creates either one or two high priority threads. There are two counters, each of which may be
protected by a pthread_mutex, depending on the #define in the code. Here are my results.
One Thread, no protection
Total time: 14997705 ns
Two Threads, no protection
Total time: 17997246 ns
One Thread, mutexes
Total time: 14997705 ns
Two Threads, mutexes
Total time: 755884332 ns
As you can see, running two individual threads, each incrementing its own counter, protected by its own pthread_mutex is
showing an execution time that is an order of magnatude slower than any of the other three configurations.
Here is the system this is being compiled and executed on
$ pidin in
CPU:X86 Release:6.3.2 FreeMem:801Mb/1021Mb BootTime:Jul 30 00:37:16 UTC 2008
Processor1: 1586 Intel ?86 F15M4S4 3213MHz FPU
Processor2: 1586 Intel ?86 F15M4S4 3206MHz FPU
Processor3: 1586 Intel ?86 F15M4S4 3211MHz FPU
Processor4: 1586 Intel ?86 F15M4S4 3207MHz FPU
$
Attached you will find the source code, and the makefile to compile the program.
Does anyone have any idea why I am seeing these results?
--Chip Bagwell
|
|
|
Chip Bagwell
|
Re: SMP pthread_mutex performance
|
Chip Bagwell
07/31/2008 1:25 PM
post11290
|
Re: SMP pthread_mutex performance
Here is the makefile.
--Chip Bagwell
|
|
|
Robert Craig
|
RE: SMP pthread_mutex performance
|
Robert Craig
07/31/2008 1:42 PM
post11291
|
RE: SMP pthread_mutex performance
Hi Chip:
Not sure if this is your problem or not, but I can tell you one
thing can cause strange results when you go multi-threaded on SMP and
that is cache thrashing. If both threads are running on different CPUs
and trying to access the same cache line (which would be true in your
case considering how things are declared), then you end up with 1 + 1 =
10 instead of 2 :>.
If you can set things up so that the individual thread counters and
mutexes are separated by a cache line (maybe by putting a dummy array in
between the declarations), this might change your results. Not sure
what your cache line size is (usually on order of 32 bytes), so just go
for something biggish.
Robert.
-----Original Message-----
From: Chip Bagwell [mailto:community-noreply@qnx.com]
Sent: Thursday, July 31, 2008 1:24 PM
To: ostech-core_os
Subject: SMP pthread_mutex performance
Hello,
I am trying to test locking (pthread_mutex vs. lock-free using
CAS)algorithms, and started with a simple pthread_mutex test program
that creates either one or two high priority threads. There are two
counters, each of which may be protected by a pthread_mutex, depending
on the #define in the code. Here are my results.
One Thread, no protection
Total time: 14997705 ns
Two Threads, no protection
Total time: 17997246 ns
One Thread, mutexes
Total time: 14997705 ns
Two Threads, mutexes
Total time: 755884332 ns
As you can see, running two individual threads, each incrementing its
own counter, protected by its own pthread_mutex is showing an execution
time that is an order of magnatude slower than any of the other three
configurations.
Here is the system this is being compiled and executed on
$ pidin in
CPU:X86 Release:6.3.2 FreeMem:801Mb/1021Mb BootTime:Jul 30 00:37:16 UTC
2008
Processor1: 1586 Intel ?86 F15M4S4 3213MHz FPU
Processor2: 1586 Intel ?86 F15M4S4 3206MHz FPU
Processor3: 1586 Intel ?86 F15M4S4 3211MHz FPU
Processor4: 1586 Intel ?86 F15M4S4 3207MHz FPU
$
Attached you will find the source code, and the makefile to compile the
program.
Does anyone have any idea why I am seeing these results?
--Chip Bagwell
_______________________________________________
OSTech
http://community.qnx.com/sf/go/post11289
|
|
|
Chip Bagwell
|
Re: RE: SMP pthread_mutex performance
|
Chip Bagwell
07/31/2008 1:58 PM
post11293
|
Re: RE: SMP pthread_mutex performance
Thanks for the feedback Robert. So I added some 4Kbyte char bufs between the pthread_mutex variables, and also the
counters as well. The results are about the same.
Two Threads, mutexes
Total time: 753884638 ns
Other things I have tried are:
1) malloc-ing the pthread_mutexes and counters, no difference.
2) Replacing pthread_mutex_lock with pthread_mutex_trylock to always ensure the fast path in the acquire (there should
be no contention), no difference.
3) Performed kernel trace of this program and the two threads: The only events that show up are the KER_NOP events.
Nothing in there to say that something is holding the threads off or causing them to block.
I am much more comfortable with PPC arch than x86, so this may be a silly question, but on a multi-core x86 processor,
can multiple CAS instructions be executed, one per core, without causing contention? I didn't know if one thread's CAS
instruction could "invalidate" the other thread CAS instruction on a different core.
--Chip Bagwell
|
|
|
Robert Craig
|
RE: RE: SMP pthread_mutex performance
|
Robert Craig
07/31/2008 5:29 PM
post11298
|
RE: RE: SMP pthread_mutex performance
Oh well. It was worth a try (I admit to exaggerating a bit with the
1+1=10. I had some idle loop threads that I was using for CPU usage
calculations and I'd see something like 20% higher counts with a thread
running alone than when running in tandem with another thread on a
different CPU accessing counters that were "close" together).
Robert.
-----Original Message-----
From: Chip Bagwell [mailto:community-noreply@qnx.com]
Sent: Thursday, July 31, 2008 1:58 PM
To: ostech-core_os
Subject: Re: RE: SMP pthread_mutex performance
Thanks for the feedback Robert. So I added some 4Kbyte char bufs
between the pthread_mutex variables, and also the counters as well. The
results are about the same.
Two Threads, mutexes
Total time: 753884638 ns
Other things I have tried are:
1) malloc-ing the pthread_mutexes and counters, no difference.
2) Replacing pthread_mutex_lock with pthread_mutex_trylock to always
ensure the fast path in the acquire (there should be no contention), no
difference.
3) Performed kernel trace of this program and the two threads: The only
events that show up are the KER_NOP events. Nothing in there to say
that something is holding the threads off or causing them to block.
I am much more comfortable with PPC arch than x86, so this may be a
silly question, but on a multi-core x86 processor, can multiple CAS
instructions be executed, one per core, without causing contention? I
didn't know if one thread's CAS instruction could "invalidate" the other
thread CAS instruction on a different core.
--Chip Bagwell
_______________________________________________
OSTech
http://community.qnx.com/sf/go/post11293
|
|
|
Yao Zhao(deleted)
|
Re: RE: SMP pthread_mutex performance
|
Yao Zhao(deleted)
08/05/2008 2:01 PM
post11465
|
Re: RE: SMP pthread_mutex performance
I tried this test case and seems true on my target.
and I modified source to make threadxmutex in different cache line, counter as well, run main cpumask=1, thread1 cpumask
2, thread2 cpumask 4 still same thing.
If I change the pthread_mutex_lock to thread2mutex.count++ then it becomes normal.
I replace the pthread_mutex_lock with a cmpxchng==0 then thread2mutex.count++, it still has problem of course it won't
have a kernel call. Even smp_xchg with thread2mutex.count++ it still big difference.
|
|
|
Chip Bagwell
|
Re: RE: SMP pthread_mutex performance
|
Chip Bagwell
08/06/2008 2:16 PM
post11497
|
Re: RE: SMP pthread_mutex performance
Thanks for all the responses to my original post. I have figured out my problem. I have been a victim of one of the
oldest programming errors in the history of the C/C++ language. The misplaced semi-colon! (insert funny waa-waa-waa-waa
-waa sound here! ;-) The for-loop in thread1() for counter1 has a trailing semi-colon that turns it into a very fast
empty loop! If I had output the final result of the counters in the original code, I would have seen this immediately.
In the end, it was good ole "diff" that showed me the error of my ways, when I diff-ed thread1() and thread2(). Again
thanks for all the insights and questions, as I have learned a lot about the intricacies of SMP programming through
debugging this problem.
Here is the output of the original program after fixing the semi-colon error in thread1().
One Thread, no protection
Total time: 29995410 ns
Two Threads, no protection
Total time: 18997093 ns
One Thread, mutexes
Total time: 1598755353 ns
Two Threads, mutexes
Total time: 7804805682 ns
Note there is still a weird timing with two threads. But now after applying all the techniques everyone has mentioned..
.
1) Make sure the variables that each thread is accessing do not wind up on the same cache line.
2) Pin the threads to a particular CPU to avoid cpu migration.
I get results that make much more sense!!! By the way, my CPU is a Intel Pentium D 3.2 Ghz Dual Core with HT. So two
physical cores, each with HT gives 4 virtual cores.
One Thread, no protection
Total time: 21996634 ns
Two Threads, no protection
Main pinned to cpu 0
Thread 1 pinned to cpu 1
Thread 2 pinned to cpu 3
Thread 1 time: 28995563 ns
Thread 2 time: 25996022 ns
Total time: 28995563 ns
One Thread, mutexes
Total time: 1580758107 ns
Two Threads, mutexes
Main pinned to cpu 0
Thread 1 pinned to cpu 1
Thread 2 pinned to cpu 3
Thread 1 time: 813875458 ns
Thread 2 time: 796878059 ns
Total time: 813875458 ns
And finally here is the output of the program when the main thread is always pinned to cpu 0, and we iterate the other
two threads between all the other combinations. The worse case is when the two worker threads are pinned to the same
physical and HT core. A slightly better case is when the two threads are on the same physical core, but on a different
HT core. The best case is when the two worker threads are pinned to different physical cores. It doesn't seem to
matter which physical and HT core the main thread is blocked on. Attached is the final version of the test program, in
all its glory!
Two Threads, mutexes
Main pinned to cpu 0
Thread 1 pinned to cpu 0
Thread 2 pinned to cpu 0
Thread 1 time: 1600755047 ns
Thread 2 time: 1602754741 ns
Total time: 1606754129 ns
Two Threads, mutexes
Main pinned to cpu 0
Thread 1 pinned to cpu 0
Thread 2 pinned to cpu 1
Thread 1 time: 1171820684 ns
Thread 2 time: 1163821908 ns
Total time: 1171820684 ns
Two Threads, mutexes
Main pinned to cpu 0
Thread 1 pinned to cpu 0
Thread 2 pinned to cpu 2
Thread 1 time: 810875917 ns
Thread 2 time: 796878059 ns
Total time: 810875917 ns
Two Threads, mutexes
Main pinned to cpu 0
Thread 1 pinned to cpu 0
Thread 2 pinned to cpu 3
Thread 1 time: 819874540 ns
Thread 2 time: 801877294 ns
Total time: 819874540 ns
Two Threads, mutexes
Main pinned to cpu 0
Thread 1 pinned to cpu 1
Thread 2 pinned to cpu 0
Thread 1 time: 1159822520 ns
Thread 2 time: 1167821296 ns
Total time: 1167821296 ns
Two Threads, mutexes
Main pinned to cpu 0
Thread 1 pinned to cpu 1
Thread 2 pinned to cpu 1
Thread 1 time: 1607753976 ns
Thread 2...
View Full Message
|
|
|
Yao Zhao(deleted)
|
Re: RE: SMP pthread_mutex performance
|
Yao Zhao(deleted)
08/06/2008 2:50 PM
post11501
|
Re: RE: SMP pthread_mutex performance
This time we did see that ";" in thread1:( that is why it is so fast:)
|
|
|
Mario Charest
|
Re: RE: SMP pthread_mutex performance
|
Mario Charest
08/01/2008 11:09 AM
post11334
|
Re: RE: SMP pthread_mutex performance
> Hi Chip:
>
> Not sure if this is your problem or not, but I can tell you one
> thing can cause strange results when you go multi-threaded on SMP and
> that is cache thrashing. If both threads are running on different CPUs
> and trying to access the same cache line (which would be true in your
> case considering how things are declared), then you end up with 1 + 1 > 10 instead of 2 :>.
This looks like a Intel Quad Core, I beleive CPU1 and CPU2 share the same L2 cache, so is CPU3 and CPU4. I would expect
major cache trashing if thread migrates from the 2 pairs. Make any sense?
However if it's 4 different processors ( or 2 processors with 2 cores ) then it's can be bad, real bad.
|
|
|
Colin Burgess(deleted)
|
Re: SMP pthread_mutex performance
|
Colin Burgess(deleted)
07/31/2008 6:45 PM
post11300
|
Re: SMP pthread_mutex performance
Some questions.
1) Did you try the same test on a uni processor?
2) Did you know that clock_gettime() actually has an accuracy of the timer tick, ie ~1ms?
3) You say you only saw KER_NOP's - you will see this when a thread is forced to schedule onto another CPU. Did you not
see any SyncMutexLock calls or STATE_MUTEX states?
Chip Bagwell wrote:
> Hello,
> I am trying to test locking (pthread_mutex vs. lock-free using CAS)algorithms, and started with a simple pthread_mutex
test program that creates either one or two high priority threads. There are two counters, each of which may be
protected by a pthread_mutex, depending on the #define in the code. Here are my results.
>
> One Thread, no protection
> Total time: 14997705 ns
> Two Threads, no protection
> Total time: 17997246 ns
> One Thread, mutexes
> Total time: 14997705 ns
> Two Threads, mutexes
> Total time: 755884332 ns
>
> As you can see, running two individual threads, each incrementing its own counter, protected by its own pthread_mutex
is showing an execution time that is an order of magnatude slower than any of the other three configurations.
>
> Here is the system this is being compiled and executed on
>
> $ pidin in
> CPU:X86 Release:6.3.2 FreeMem:801Mb/1021Mb BootTime:Jul 30 00:37:16 UTC 2008
> Processor1: 1586 Intel ?86 F15M4S4 3213MHz FPU
> Processor2: 1586 Intel ?86 F15M4S4 3206MHz FPU
> Processor3: 1586 Intel ?86 F15M4S4 3211MHz FPU
> Processor4: 1586 Intel ?86 F15M4S4 3207MHz FPU
> $
>
> Attached you will find the source code, and the makefile to compile the program.
>
> Does anyone have any idea why I am seeing these results?
>
> --Chip Bagwell
>
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post11289
>
--
cburgess@qnx.com
|
|
|
Shiv Nagarajan(deleted)
|
Re: SMP pthread_mutex performance
|
Shiv Nagarajan(deleted)
07/31/2008 10:25 PM
post11306
|
Re: SMP pthread_mutex performance
is it possible that the threads are migrating between the two
cpus? Can you try and lock each thread to a different CPU, by
doing a
ThreadCtl(_NTO_TCTL_RUNMASK, ....)
and setting the runmask for each thread to be on a different cpu?
thanks
shiv
Thu Jul 31 22:22:42 EDT 2008
--> According to Chip Bagwell <--
Hello,
I am trying to test locking (pthread_mutex vs. lock-free using CAS)algorithms, and started with a simple pthread_mutex
test program that creates either one or two high priority threads. There are two counters, each of which may be
protected by a pthread_mutex, depending on the #define in the code. Here are my results.
One Thread, no protection
Total time: 14997705 ns
Two Threads, no protection
Total time: 17997246 ns
One Thread, mutexes
Total time: 14997705 ns
Two Threads, mutexes
Total time: 755884332 ns
As you can see, running two individual threads, each incrementing its own counter, protected by its own pthread_mutex
is showing an execution time that is an order of magnatude slower than any of the other three configurations.
Here is the system this is being compiled and executed on
$ pidin in
CPU:X86 Release:6.3.2 FreeMem:801Mb/1021Mb BootTime:Jul 30 00:37:16 UTC 2008
Processor1: 1586 Intel ?86 F15M4S4 3213MHz FPU
Processor2: 1586 Intel ?86 F15M4S4 3206MHz FPU
Processor3: 1586 Intel ?86 F15M4S4 3211MHz FPU
Processor4: 1586 Intel ?86 F15M4S4 3207MHz FPU
$
Attached you will find the source code, and the makefile to compile the program.
Does anyone have any idea why I am seeing these results?
--Chip Bagwell
_______________________________________________
OSTech
http://community.qnx.com/sf/go/post11289
Content-Description: Sourceforge Attachment
#include <iostream>
#include <stdlib.h>
#include <pthread.h>
#include <sched.h>
#include <semaphore.h>
#include <time.h>
#define LOOP_ITERATIONS 10000000
#define TWO_THREADS() 1
#define USE_MUTEX() 1
sem_t thread1sem;
sem_t thread2sem;
#if USE_MUTEX()
pthread_mutex_t thread1mutex;
pthread_mutex_t thread2mutex;
#endif
unsigned long counter1;
unsigned long counter2;
unsigned long long startTime1;
unsigned long long endTime1;
unsigned long long startTime2;
unsigned long long endTime2;
extern "C" long long timespec_to_nanosecond(struct timespec const *ts)
{
long long nanosecond = ts->tv_sec;
nanosecond *= 1000000000;
nanosecond += ts->tv_nsec;
return nanosecond;
}
void * thread1(void *arg)
{
struct timespec currentTime;
sem_wait(&thread1sem); // Wait on signal from parent.
sleep(1); // Sleep to make sure this thread doesn't
// finish before the parent has a chance
// to create the other thread(s)
clock_gettime(CLOCK_REALTIME, ¤tTime);
startTime1 = timespec_to_nanosecond(¤tTime);
for(unsigned long long i = 0; i < LOOP_ITERATIONS; ++i);
{
#if USE_MUTEX()
pthread_mutex_lock(&thread1mutex);
#endif
++counter1;
#if USE_MUTEX()
pthread_mutex_unlock(&thread1mutex);
#endif
#if TWO_THREADS() == 0
#if USE_MUTEX()
pthread_mutex_lock(&thread2mutex);
#endif
++counter2;
#if USE_MUTEX()
pthread_mutex_unlock(&thread2mutex);
#endif
#endif
}
clock_gettime(CLOCK_REALTIME, ¤tTime);
endTime1 = timespec_to_nanosecond(¤tTime);
}
#if TWO_THREADS()
void * thread2(void *arg)
{
struct timespec currentTime;
sem_wait(&thread2sem); // Wait on signal from parent.
sleep(1); // Sleep to make sure this thread doesn't
// finish before the parent has a chance
// to create the other thread(s)
...
View Full Message
|
|
|
|