Bartosz Dziewoski wrote in post #1076335:
> Not really. Compilers are smarter than us these days; they can
> interchange loops and vectorize the instructions to speed them up (on
> platforms which support it). Writing such code by hand would be
> painful, platform-dependent and error-prone.
> -- Matma Rex

Compilers can optimise better, but only if the code is written in such a way as to let them know it's safe. For example:

void add_numbers(int* a, int* b, int* results, unsigned count) {
  unsigned i;
  for (i = 0; i < count; ++i) {
    results[i] = a[i] + b[i];
  }
}

The compiler can unroll that loop a bit, but it will never be able to vectorise the arithmetic. Why? Because the pointers a, b and results could overlap. Vectorising can change the result, so the compiler will never do it.

You could use restrict to tell the compiler to assume these don't overlap:

void add_numbers(int restrict* a, int restrict* b, int restrict* results, unsigned count) {

That can lead to unexpected results if you pass overlapping ranges though - restrict is quite dangerous. A lot of high performance code works by explicitly unrolling:

void add_numbers(int* a, int* b, int* results, unsigned count) {
  unsigned i;

  /* Process in blocks of 4 */
  int r1, r2, r3, r4;
  for (i = 0; i + 3 < count; i += 4) {
    /* Compute first */
    r1 = a[i] + b[i];
    r2 = a[i + 1] + b[i + 1];
    r3 = a[i + 2] + b[i + 2];
    r4 = a[i + 3] + b[i + 3];

    /* Save second */
    results[i] = r1;
    results[i + 1] = r2;
    results[i + 2] = r3;
    results[i + 3] = r4;
  }

  /* Finish portion not divisible by 4 */
  for (; i < count; ++i) {
    results[i] = a[i] + b[i];
  }
}

The second is logically equivalent to a vectorised loop, even if the ranges overlap, so the compiler is entitled to vectorise if it's worthwhile. Of course it now can't not unroll the loop. Actually testing this case shows the unrolled version as being slower for me :D

Compilers are pretty smart, but they can't change the behaviour of your code. Nobody should be writing in assembly any more, but to squeeze performance out of those really tight loops you still have to understand what's going on down there.

Cheers,

Tim