Talk:CFLAGS

Unless a reference is being added which states that passing  together with   really enables OpenMP-based parallel loops I am going to strip the recommendation of enabling OpenMP globally since this is pure non-sense: If a package does not use OpenMP there is afaik no way to automatically enable it. On the other hand what would help generically is loop-rewriting to use vectorization (see, enabled with  ). Rewriting loops for processor-cache-awareness probably only gives a speed-up in HPC code since modern CPUs have scatter/gather and the likes (see also AVX2).

The performance gain by enabling auto parallelization system wide seems to be relatively small:
I've been experimenting with enabling graphite and auto-parallelisation for selected packages with gcc-4.8. My findings indicate that autoparallelisation only works in very rare cases, namely when the loops to be parallelized run over arrays, not pointers+offsets. Since in the vast majority of cases loops run over pointers with offsets, auto parallelization does not work. The CFLAGS I used for testing were CFLAGS="-O3 -ftree-parallelize-loops=6 -floop-parallelize-all -march=native -floop-interchange -ftree-loop-distribution -floop-strip-mine -floop-block -pipe". Additional output on parallelization can be generated using "-fdump-tree-parloops-details -fdump-tree-graphite-all".

Let me clarify this: The following function will not be parallelized.

void sum(double *a, double *b, double *result, int n){ int i;    for(i=0;i<n;i++){ result[i]=a[i]+b[i]; } }

Changing the variables to arrays, for instance by making them global arrays, allows graphite to parallelize the function:

int n; double a[n], b[n], result[n]; void sum{ int i;    for(i=0;i<n;i++){ result[i]=a[i]+b[i]; } }

Considering the better readability of the first variant, the second variant is obviously not something encountered very often.

Another issue is, that gcc does not check profitability of auto parallelization, so even in cases where it works the added overhead might cause decreased overall performance. An example is the dotproduct code posted here, which of course has to be changed to make the loops run over arrays, not pointers and to do the dot product instead of the trivial calculation.

My conclusion is, that the performance gains possible by using auto parallelization are pretty limited, except for packages intentionally written to use this gcc feature. On the other hand there are a few packages that either do not compile or run with auto parallelization enabled. Therefore, I would not recommend to use auto parallelization system wide, and rather to only enable it selectively for packages known to benefit from it (CFLAGS can be set on a per package basis using /etc/portage/env and /etc/portage/package.env).