Skip to content

The Volatile Qualifier

Ian McEwan edited this page Jun 7, 2013 · 8 revisions

Volatile means "Don't optimize Load or Store operations".

While its effects might seem like the compiler "caching" data, this is a very poor analogy, that causes a lot of confusion and improper use.

Here's a link to a long article about it, from the Linux kernel folk : https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt

Really all the compiler can do is static code analysis, reordering and removal. It never knows about the contents of variables, and certainly doesn't cache them.

For example when writing to a variable like this: x=0; x=1; x=2; the compiler will first generate the three store operations, then during optimization it will notice that the value used by x=0 is superfluous, because it is overwritten by x=1, and so removes the store operation. It doesn't care what the value is! The same thing happens with x=1, so leaving just the x=2. Making x volatile removes the store optimization and so all three stores happen - which is very useful when the location is a device register!

This redundant-code removal is why a loop like : for (i = 0 ; i < 100; ++i ) {} is completely removed under optimization, leaving i=100. It isn't because the compiler does some magic caching of i, but because its static analysis proceeds something this this : It notices the first conditional 0 < 100 is constant and so attempts to unroll the loop to one hundred if (i<100) { i=i+1; if (i<100) { ... operations. Then does the constant sub-expression expansion and redundant load op removal, which removes the now constant conditionals and leaves i=0; i=1; i=1+1; ... ; i = 1+1+1+1+…. Then it optimizes the constants and removes redundant store operations, leaving just the last one. i=100;, which can't be removed.

Additionally, the compiler will:

  • never optimize across translation units (the thing that is actually compiled to an object file, after all macros and other stuff has been done).
  • only optimize across function boundaries if that function is wholly or partly inlined.
  • sometime optimize across control block boundaries if it can prove the sub-expression is constant.
  • always try to optimize subexpression, unless told not to.

To illustrate that this is all about optimization, lets try compiling a simple function under different conditions.

int loop() {
  QUAL int i;
  for (i=0; i < 100 ; ++i ) {}
  return i;
  }

And compiled with -O0 (left column) and -O3 (right column) with QUAL set to nothing, volatile, and extern.

                                                    Pannel A |                                                          Pannel B
$gcc -S -O0 -DQUAL="" -fomit-frame-pointer loop.c            | $gcc -S -O3 -DQUAL="" -fomit-frame-pointer loop.c
                                                             |
_loop:                                                       | _loop:
        movl    $0, -12(%rsp)                                |         movl    $100, %eax
        jmp     L2                                           |         ret
L3:                                                          |
        incl    -12(%rsp)                                    |
L2:                                                          |
        cmpl    $99, -12(%rsp)                               |
        jle     L3                                           |
        movl    -12(%rsp), %eax                              |
        ret                                                  |

So with no qualifier the optimized version is reduced to simply returning 100. The un-optimized compile has the loop, but the load-increment-store at L3 is optimized to a single inc instruction.

                                                    Pannel C |                                                          Pannel D
$gcc -S -O0 -DQUAL="volatile" -fomit-frame-pointer loop.c    | $gcc -S -O3 -DQUAL="volatile" -fomit-frame-pointer loop.c
                                                             |
_loop:                                                       | _loop:
        movl    $0, -12(%rsp)                                |         movl    $0, -4(%rsp)
        jmp     L2                                           |         movl    -4(%rsp), %eax
L3:                                                          |         cmpl    $99, %eax
        movl    -12(%rsp), %eax                              |         jg      L6
        incl    %eax                                         | L5:       
        movl    %eax, -12(%rsp)                              |         movl    -4(%rsp), %eax
L2:                                                          |         incl    %eax
        movl    -12(%rsp), %eax                              |         movl    %eax, -4(%rsp)
        cmpl    $99, %eax                                    |         movl    -4(%rsp), %eax
        jle     L3                                           |         cmpl    $99, %eax
        movl    -12(%rsp), %eax                              |         jle     L5
        ret                                                  | L6:         
                                                             |         movl    -4(%rsp), %eax
                                                             |         ret

With volatile, none of the load or store operations are optimized. In both columns the load-increment-store operation has been left un-optimized, and both have un-optimized move %eax -> i move i -> %eax pairs.

                                                    Pannel E |                                                          Pannel F
$gcc -S -O0 -DQUAL="extern" -fomit-frame-pointer loop.c      | $gcc -S -O3 -DQUAL="extern" -fomit-frame-pointer loop.c
                                                             |
_loop:                                                       | _loop:
        movq    _i@GOTPCREL(%rip), %rax                      |         movq    _i@GOTPCREL(%rip), %rax
        movl    $0, (%rax)                                   |         movl    $100, (%rax)
        jmp     L2                                           |         movl    $100, %eax
L3:                                                          |         ret
        movq    _i@GOTPCREL(%rip), %rax                      |
        movl    (%rax), %eax                                 |
        leal    1(%rax), %edx                                |
        movq    _i@GOTPCREL(%rip), %rax                      |
        movl    %edx, (%rax)                                 |
L2:                                                          |
        movq    _i@GOTPCREL(%rip), %rax                      |
        movl    (%rax), %eax                                 |
        cmpl    $99, %eax                                    |
        jle     L3                                           |
        movq    _i@GOTPCREL(%rip), %rax                      |
        movl    (%rax), %eax                                 |
        ret                                                  |

Extern causes i to be allocated on the heap, not the stack, and to be available to other functions. The unoptimized code on the left follows the same form as the unoptimized and unqualified code in Pannel A, just with the addition of indirect addressing for the heap variable. The optimized code on the right again follows the same form as in Pannel B, but importantly the store operation to the heap is still precent.

Busy looping.

Lastly: Sometimes it is tempting to use the volatile form of the loop above as a busy loop, because it prevents the compiler from removing the loop. However, as Pannel D shows, most of the loop is memory operations. This is unhelpful if other things (such as DMA, or another processor) need the memory bus or if low power modes are need.

An alternative way to prevent the unhelpful loop removal is to call a null 'extern' function. The compiler will not optimize across a translation unit boundary, and because that function might have side effects, it wont optimize outside the control block either.

extern void null(void);                                      | $gcc -S -O3 -fomit-frame-pointer loop.c
                                                             | _loop:
int loop(void) {                                             |         pushq   %rbx
  int i;                                                     |         call    _null
  for (i=0; i < 100 ; ++i ) {}                            	 |         movl    $1, %ebx
    null();                                                  | L4:
  return i;                                              	 |         call    _null
  }                                                          |         incl    %ebx
                                                             |         cmpl    $100, %ebx
                                                             |         jne     L4
                                                             |         movl    $100, %eax
                                                             |         popq    %rbx
                                                             |         ret

A similar trick can be used to busy loop on a semaphore.

extern void null(void);                                      | $gcc -S -O3 -fomit-frame-pointer loop.c
extern char semaphore;                                       | _loop:
                                                             |         pushq   %rbx
void loop(void) {                                            |         movq  _semaphore@GOTPCREL(%rip), %rbx
  while ( semaphore )                                        |         cmpb    $0, (%rbx)
    null();                                                  |         je      L2
  }                                                          | L5:
                                                             |         call    _null
                                                             |         cmpb    $0, (%rbx)
                                                             |         jne     L5
                                                             | L2:
                                                             |         movl    %ebx, %eax
                                                             |         popq    %rbx
                                                             |         ret

Notice the semaphore memory variable is compared against on each loop, without resorting to a volatile qualifier.

-- Ian
-- [email protected]

Clone this wiki locally