foundry27 : Post

Forum Topic - openmp poor performance: (3 Items)

View: as

Wenbin Wang(deleted)

01/21/2020 12:18 AM

post120163

openmp performance is much worth in multi-threads mode than single thread mode.

code:
```
#include <stdio.h>
#include <omp.h>

#define N 10000
int main(int argc, char** argv) {
  float a[N], b[N], c[N];
  int i;

  for (i = 0; i < N; i++) {
    a[i] = i * 2.0;
    b[i] = i * 3.0;
  }

 
  #pragma omp parallel num_threads(4) shared(a, b, c) private(i)
   for (size_t j = 0; j < 10000; ++j) {
        #pragma omp for
        for (i = 0; i < N; i++) {
            c[i] = a[i] + b[i];
            VLOG(3) << c[i];
        }
  }

}
```

change num_threads(4)  to num_threads(1) gives better performance.

any idea why?

Michael Schuster

01/21/2020 12:51 AM

post120164

Re: openmp poor performance

Hi,

note: all I know about OpenMP is what I just found with a quick search ... so take anything I say with caution 😉


In general, please supply as much information as possible when writing things like "better performance", eg:

- what are the actual numbers? (or even more generally, what is considered "better performance"?)

- how was this measured?

- is this reproducable?

- what else did you try do understand what's going on, and what were the results?


looking at your example, and comparing to what I found, I believe you may have missed segmenting 'c' into per-thread 
sections, and you may have heavy contention when writing to this array.


HTH

-- Michael


________________________________
From: Wenbin Wang <community-noreply@qnx.com>
Sent: Tuesday, January 21, 2020 06:18
To: ostech-core_os
Subject: openmp poor performance

openmp performance is much worth in multi-threads mode than single thread mode.

code:
```
#include <stdio.h>
#include <omp.h>

#define N 10000
int main(int argc, char** argv) {
  float a[N], b[N], c[N];
  int i;

  for (i = 0; i < N; i++) {
    a[i] = i * 2.0;
    b[i] = i * 3.0;
  }


  #pragma omp parallel num_threads(4) shared(a, b, c) private(i)
   for (size_t j = 0; j < 10000; ++j) {
        #pragma omp for
        for (i = 0; i < N; i++) {
            c[i] = a[i] + b[i];
            VLOG(3) << c[i];
        }
  }

}
```

change num_threads(4)  to num_threads(1) gives better performance.

any idea why?



_______________________________________________

OSTech
http://community.qnx.com/sf/go/post120163
To cancel your subscription to this discussion, please e-mail ostech-core_os-unsubscribe@community.qnx.com

Attachment:

sf-attachment-mime38368 3.65 KB

Wenbin Wang(deleted)

02/03/2020 1:14 PM

post120223

Re: openmp poor performance

More details:

test code:

```
#include <stdio.h>
#include <omp.h>
#include <vector>
#include <chrono>
#include <iostream>


int main(int argc, char** argv) {
    int N = 100000;
    if (auto env = std::getenv("NUM_ITERS")) N = std::stoi(env);

    int jobs = 10000;
    if (auto env = std::getenv("NUM_JOBS")) jobs = std::stoi(env);

    int nthreads = 4;
    if (auto env = std::getenv("NUM_THREADS")) nthreads = std::stoi(env);

    std::vector<float> a(N), b(N), c(N);
    int i;

    for (i = 0; i < N; i++) {
        a[i] = i * 2.0;
        b[i] = i * 3.0;
    }

    auto startTime = std::chrono::high_resolution_clock::now();

    for (int j = 0; j < jobs; ++j) {
        #pragma omp parallel for num_threads(nthreads)
        for (i = 0; i < N; i++) {
            c[i] = a[i] + b[i];
        }
    }

    auto endTime = std::chrono::high_resolution_clock::now();
    float totalTime = std::chrono::duration<float, std::milli>(endTime - startTime).count();
    std::cout << "total time: " << totalTime <<"ms" << std::endl;
}
```

result running on QNX:

```
# export NUM_THREADS=1 
# ./openmp_test        
total time: 633ms
# export NUM_THREADS=4 
# ./openmp_test        
total time: 1497ms
```

result running on Ubuntu with same hardware:


```
nvidia@tegra-ubuntu:~$ export NUM_THREADS=1
nvidia@tegra-ubuntu:~$ ./openmp_test
total time: 484.654ms
nvidia@tegra-ubuntu:~$ export NUM_THREADS=4
nvidia@tegra-ubuntu:~$ ./openmp_test
total time: 237.67ms
```

you can see on QNX, use more threads brings more latency, which doesn't make that sense

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page