Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - openmp poor performance: (3 Items)
   
openmp poor performance  
openmp performance is much worth in multi-threads mode than single thread mode.

code:
```
#include <stdio.h>
#include <omp.h>

#define N 10000
int main(int argc, char** argv) {
  float a[N], b[N], c[N];
  int i;

  for (i = 0; i < N; i++) {
    a[i] = i * 2.0;
    b[i] = i * 3.0;
  }

 
  #pragma omp parallel num_threads(4) shared(a, b, c) private(i)
   for (size_t j = 0; j < 10000; ++j) {
        #pragma omp for
        for (i = 0; i < N; i++) {
            c[i] = a[i] + b[i];
            VLOG(3) << c[i];
        }
  }

}
```

change num_threads(4)  to num_threads(1) gives better performance.

any idea why? 
Re: openmp poor performance  
Hi,

note: all I know about OpenMP is what I just found with a quick search ... so take anything I say with caution 😉


In general, please supply as much information as possible when writing things like "better performance", eg:

- what are the actual numbers? (or even more generally, what is considered "better performance"?)

- how was this measured?

- is this reproducable?

- what else did you try do understand what's going on, and what were the results?


looking at your example, and comparing to what I found, I believe you may have missed segmenting 'c' into per-thread 
sections, and you may have heavy contention when writing to this array.


HTH

-- Michael


________________________________
From: Wenbin Wang <community-noreply@qnx.com>
Sent: Tuesday, January 21, 2020 06:18
To: ostech-core_os
Subject: openmp poor performance

openmp performance is much worth in multi-threads mode than single thread mode.

code:
```
#include <stdio.h>
#include <omp.h>

#define N 10000
int main(int argc, char** argv) {
  float a[N], b[N], c[N];
  int i;

  for (i = 0; i < N; i++) {
    a[i] = i * 2.0;
    b[i] = i * 3.0;
  }


  #pragma omp parallel num_threads(4) shared(a, b, c) private(i)
   for (size_t j = 0; j < 10000; ++j) {
        #pragma omp for
        for (i = 0; i < N; i++) {
            c[i] = a[i] + b[i];
            VLOG(3) << c[i];
        }
  }

}
```

change num_threads(4)  to num_threads(1) gives better performance.

any idea why?



_______________________________________________

OSTech
http://community.qnx.com/sf/go/post120163
To cancel your subscription to this discussion, please e-mail ostech-core_os-unsubscribe@community.qnx.com
Attachment: HTML sf-attachment-mime38368 3.65 KB
Re: openmp poor performance  
More details:

test code:

```
#include <stdio.h>
#include <omp.h>
#include <vector>
#include <chrono>
#include <iostream>


int main(int argc, char** argv) {
    int N = 100000;
    if (auto env = std::getenv("NUM_ITERS")) N = std::stoi(env);

    int jobs = 10000;
    if (auto env = std::getenv("NUM_JOBS")) jobs = std::stoi(env);

    int nthreads = 4;
    if (auto env = std::getenv("NUM_THREADS")) nthreads = std::stoi(env);

    std::vector<float> a(N), b(N), c(N);
    int i;

    for (i = 0; i < N; i++) {
        a[i] = i * 2.0;
        b[i] = i * 3.0;
    }

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int j = 0; j < jobs; ++j) {
        #pragma omp parallel for num_threads(nthreads)
        for (i = 0; i < N; i++) {
            c[i] = a[i] + b[i];
        }
    }

    auto endTime = std::chrono::high_resolution_clock::now();
    float totalTime = std::chrono::duration<float, std::milli>(endTime - startTime).count();
    std::cout << "total time: " << totalTime <<"ms" << std::endl;
}
```

result running on QNX:

```
# export NUM_THREADS=1 
# ./openmp_test        
total time: 633ms
# export NUM_THREADS=4 
# ./openmp_test        
total time: 1497ms
```

result running on Ubuntu with same hardware:


```
nvidia@tegra-ubuntu:~$ export NUM_THREADS=1
nvidia@tegra-ubuntu:~$ ./openmp_test
total time: 484.654ms
nvidia@tegra-ubuntu:~$ export NUM_THREADS=4
nvidia@tegra-ubuntu:~$ ./openmp_test
total time: 237.67ms
```

you can see on QNX, use more threads brings more latency, which doesn't make that sense