-
Notifications
You must be signed in to change notification settings - Fork 1
The Volatile Qualifier
Volatile means "Don't optimize Load or Store operations".
While its effects might seem like the compiler "caching" data, this is a very poor analogy, that causes a lot of confusion and improper use.
Here's a link to a long article about it, from the Linux kernel folk : https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
Really all the compiler can do is static code analysis, reordering and removal. It never knows about the contents of variables, and certainly doesn't cache them.
For example when writing to a variable like this: x=0; x=1; x=2;
the compiler will first generate the three store operations, then during optimization it will notice that the value used by x=0
is superfluous, because it is overwritten by x=1
, and so removes the store operation. It doesn't care what the value is! The same thing happens with x=1
, so leaving just the x=2
. Making x volatile removes the store optimization and so all three stores happen - which is very useful when the location is a device register!
This redundant-code removal is why a loop like : for (i = 0 ; i < 100; ++i ) {}
is completely removed under optimization, leaving i=100
. It isn't because the compiler does some magic caching of i, but because its static analysis proceeds something this this : It notices the first conditional 0 < 100
is constant and so attempts to unroll the loop to one hundred if (i<100) { i=i+1; if (i<100) { ...
operations. Then does the constant sub-expression expansion and redundant load op removal, which removes the now constant conditionals and leaves i=0; i=1; i=1+1; ... ; i = 1+1+1+1+…
. Then it optimizes the constants and removes redundant store operations, leaving just the last one. i=100;
, which can't be removed.
Additionally, the compiler will:
- never optimize across translation units (the thing that is actually compiled to an object file, after all macros and other stuff has been done).
- only optimize across function boundaries if that function is wholly or partly inlined.
- sometime optimize across control block boundaries if it can prove the sub-expression is constant.
- always try to optimize subexpression, unless told not to.
To illustrate that this is all about optimization, lets try compiling a simple function under different conditions.
int loop() {
QUAL int i;
for (i=0; i < 100 ; ++i ) {}
return i;
}
And compiled with -O0 (left column) and -O3 (right column) with QUAL set to nothing, volatile, and extern.
Pannel A | Pannel B
$gcc -S -O0 -DQUAL="" -fomit-frame-pointer loop.c | $gcc -S -O3 -DQUAL="" -fomit-frame-pointer loop.c
|
_loop: | _loop:
movl $0, -12(%rsp) | movl $100, %eax
jmp L2 | ret
L3: |
incl -12(%rsp) |
L2: |
cmpl $99, -12(%rsp) |
jle L3 |
movl -12(%rsp), %eax |
ret |
So with no qualifier the optimized version is reduced to simply returning 100. The un-optimized compile has the loop, but the load-increment-store at L3 is optimized to a single inc instruction.
Pannel C | Pannel D
$gcc -S -O0 -DQUAL="volatile" -fomit-frame-pointer loop.c | $gcc -S -O3 -DQUAL="volatile" -fomit-frame-pointer loop.c
|
_loop: | _loop:
movl $0, -12(%rsp) | movl $0, -4(%rsp)
jmp L2 | movl -4(%rsp), %eax
L3: | cmpl $99, %eax
movl -12(%rsp), %eax | jg L6
incl %eax | L5:
movl %eax, -12(%rsp) | movl -4(%rsp), %eax
L2: | incl %eax
movl -12(%rsp), %eax | movl %eax, -4(%rsp)
cmpl $99, %eax | movl -4(%rsp), %eax
jle L3 | cmpl $99, %eax
movl -12(%rsp), %eax | jle L5
ret | L6:
| movl -4(%rsp), %eax
| ret
With volatile, none of the load or store operations are optimized. In both columns the load-increment-store operation has been left un-optimized, and both have un-optimized move %eax -> i
move i -> %eax
pairs.
Pannel E | Pannel F
$gcc -S -O0 -DQUAL="extern" -fomit-frame-pointer loop.c | $gcc -S -O3 -DQUAL="extern" -fomit-frame-pointer loop.c
|
_loop: | _loop:
movq _i@GOTPCREL(%rip), %rax | movq _i@GOTPCREL(%rip), %rax
movl $0, (%rax) | movl $100, (%rax)
jmp L2 | movl $100, %eax
L3: | ret
movq _i@GOTPCREL(%rip), %rax |
movl (%rax), %eax |
leal 1(%rax), %edx |
movq _i@GOTPCREL(%rip), %rax |
movl %edx, (%rax) |
L2: |
movq _i@GOTPCREL(%rip), %rax |
movl (%rax), %eax |
cmpl $99, %eax |
jle L3 |
movq _i@GOTPCREL(%rip), %rax |
movl (%rax), %eax |
ret |
Extern causes i
to be allocated on the heap, not the stack, and to be available to other functions. The unoptimized code on the left follows the same form as the unoptimized and unqualified code in Pannel A, just with the addition of indirect addressing for the heap variable. The optimized code on the right again follows the same form as in Pannel B, but importantly the store operation to the heap is still precent.
Lastly: Sometimes it is tempting to use the volatile form of the loop above as a busy loop, because it prevents the compiler from removing the loop. However, as Pannel D shows, most of the loop is memory operations. This is unhelpful if other things (such as DMA, or another processor) need the memory bus or if low power modes are need.
An alternative way to prevent the unhelpful loop removal is to call a null 'extern' function. The compiler will not optimize across a translation unit boundary, and because that function might have side effects, it wont optimize outside the control block either.
extern void null(void); | $gcc -S -O3 -fomit-frame-pointer loop.c
| _loop:
int loop(void) { | pushq %rbx
int i; | call _null
for (i=0; i < 100 ; ++i ) {} | movl $1, %ebx
null(); | L4:
return i; | call _null
} | incl %ebx
| cmpl $100, %ebx
| jne L4
| movl $100, %eax
| popq %rbx
| ret
A similar trick can be used to busy loop on a semaphore.
extern void null(void); | $gcc -S -O3 -fomit-frame-pointer loop.c
extern char semaphore; | _loop:
| pushq %rbx
void loop(void) { | movq _semaphore@GOTPCREL(%rip), %rbx
while ( semaphore ) | cmpb $0, (%rbx)
null(); | je L2
} | L5:
| call _null
| cmpb $0, (%rbx)
| jne L5
| L2:
| movl %ebx, %eax
| popq %rbx
| ret
Notice the semaphore memory variable is compared against on each loop, without resorting to a volatile qualifier.
-- Ian
-- [email protected]