"post"
Sun, 09 Aug 2009 10:27:45 EST
Factorials, Tail Recursion and CPS ... in C
Recursive algorithms are elegant. However, if the recursion is not a tail call the growth of the stack leads to a stack overflow.
Tail call recursion is a technique whereby the last call in a recursive function does not depend on the variables pushed on the stack. In other words the function returns the value of its additional (recursive) call.
Functional languages like Haskell or Lisp are designed to support the use of tail recursive algorithms.The JVM -although now the target platform of a lisp like clojure or a hybrid functional language like scala - does not support tail recursion at all. In C/C++ the compiler can in fact replace tail recursive calls with a simple loop, thereby eliminating the allocation for additional stack frames all together. In this post I'll consider various implementations of the humble factorial to illustrate some of these things.
The factorial of an integer n is defined as:
n! = n * (n-1)!
1! = 1
Here's a straightforward implementation using a while :
fact = 1;
while (n > 0) {
fact *=n;
n--;
}
A naive recursive implementation looks like this :
int fact(int n) {
if (n == 1) return 1;
return n * fact(n-1);
}
fact cannot return a value until all previous values have been popped of the stack, causing the stack to grow. A tail recursive version of the same algorithm uses an accumulator to remove the multiplication on the last line :
int fact(int n, int accum) {
if n==1 return accum;
return fact(n-1, n*accum);
}
When discussing recursion the y-combinator cannot be far behind.
This document discusses the reduction of a factorial in scheme to an application of the y-combinator. When I read this, a third possibility of calculating the factorial sprung to mind: pass the function as part of the recursion. This is very much like a CPS transformation: Each subsequent call is passed a continuation of the previous calculation.
C doesn't support continuations. The role of the continuation is played by the stack, but that is not accessible as a first-order object in C.
A first approach is to rewrite the previous expressions as :
int fact(Func f, int n, int accum) {
if n == 1 return accum;
return f(f, n-1, n*accum);
}
What's the signature of f ? It's a function which takes as it's first argument a function , which takes as it's first argument a function, which.... In other words we now need to explicitly type the recursion, which is not possible in C. Never mind then ? Well, we can now take advantage of C's lack of typing and change the implementation just a little:
int fact(void* f, int val, int acc) {
if (val == 1) return acc;
Func* fptr = (Func * ) f;
return fptr((void *) f, val-1, acc*val);
}
This does the trick.
The file factorial.c in c_fact contains all three implementations discussed above, as well as the 'loop' based calculation. I'm compiling on a 64 bit machine (AMD Turion dual core), running Ubuntu.
It's instructive too look at the assembly code generated by the compilet for each implementation, at various levels of optimization.
First, I compile a non-optimized debug version, like so :
gcc -m32 -g -c -o factorial.o factorial.c
gcc -m32 factorial.o -o factorial
objdump -D ./factorial
The -m32 will force gcc to generate 32 bit code. The 64 bit code is not going to be materially different. objdump -D is used to disassemble the resulting executable.
Here is the dump for the first variation of the factorial function :
0804845e factorial:
804845e: 55 push %ebp
804845f: 89 e5 mov %esp,%ebp
8048461: 83 ec 08 sub $0x8,%esp
8048464: 83 7d 08 00 cmpl $0x0,0x8(%ebp)
8048468: 75 09 jne 8048473 <factorial+0x15>
804846a: c7 45 fc 01 00 00 00 movl $0x1,-0x4(%ebp)
8048471: eb 17 jmp 804848a <factorial+0x2c>
8048473: 8b 45 08 mov 0x8(%ebp),%eax
8048476: 83 e8 01 sub $0x1,%eax
8048479: 89 04 24 mov %eax,(%esp)
**804847c: e8 dd ff ff ff call 804845e <factorial>**
8048481: 89 c2 mov %eax,%edx
8048483: 0f af 55 08 imul 0x8(%ebp),%edx
8048487: 89 55 fc mov %edx,-0x4(%ebp)
804848a: 8b 45 fc mov -0x4(%ebp),%eax
804848d: c9 leave
804848e: c3 ret
Notice that in position 804847c the factorial function calls itself again.
The growth of the stack is quite obvious in the debugger:
#0 factorial (val=3) at factorial.c:23
#1 0x08048481 in factorial (val=4) at factorial.c:24
#2 0x08048481 in factorial (val=5) at factorial.c:24
#3 0x08048481 in factorial (val=6) at factorial.c:24
#4 0x08048481 in factorial (val=7) at factorial.c:24
#5 0x08048481 in factorial (val=8) at factorial.c:24
#6 0x08048481 in factorial (val=9) at factorial.c:24
#7 0x08048481 in factorial (val=10) at factorial.c:24
#8 0x0804861e in main (argc=2, argv=0xfffe8364) at factorial.c:66
The assembly code generated for the other two implementations is not substantially different.
0804848f <factorial2>:
804848f: 55 push %ebp
8048490: 89 e5 mov %esp,%ebp
8048492: 83 ec 0c sub $0xc,%esp
8048495: 83 7d 08 01 cmpl $0x1,0x8(%ebp)
8048499: 75 08 jne 80484a3 <factorial2+0x14>
804849b: 8b 45 0c mov 0xc(%ebp),%eax
804849e: 89 45 fc mov %eax,-0x4(%ebp)
80484a1: eb 1c jmp 80484bf <factorial2+0x30>
80484a3: 8b 45 08 mov 0x8(%ebp),%eax
80484a6: 0f af 45 0c imul 0xc(%ebp),%eax
80484aa: 8b 55 08 mov 0x8(%ebp),%edx
80484ad: 83 ea 01 sub $0x1,%edx
80484b0: 89 44 24 04 mov %eax,0x4(%esp)
80484b4: 89 14 24 mov %edx,(%esp)
80484b7: e8 d3 ff ff ff call 804848f <factorial2>
80484bc: 89 45 fc mov %eax,-0x4(%ebp)
80484bf: 8b 45 fc mov -0x4(%ebp),%eax
80484c2: c9 leave
80484c3: c3 ret
080484c4 <factorial3>:
80484c4: 55 push %ebp
80484c5: 89 e5 mov %esp,%ebp
80484c7: 83 ec 28 sub $0x28,%esp
80484ca: 83 7d 0c 01 cmpl $0x1,0xc(%ebp)
80484ce: 75 08 jne 80484d8 <factorial3+0x14>
80484d0: 8b 45 10 mov 0x10(%ebp),%eax
80484d3: 89 45 ec mov %eax,-0x14(%ebp)
80484d6: eb 29 jmp 8048501 <factorial3+0x3d>
80484d8: 8b 45 08 mov 0x8(%ebp),%eax
80484db: 89 45 fc mov %eax,-0x4(%ebp)
80484de: 8b 45 10 mov 0x10(%ebp),%eax
80484e1: 0f af 45 0c imul 0xc(%ebp),%eax
80484e5: 8b 55 0c mov 0xc(%ebp),%edx
80484e8: 83 ea 01 sub $0x1,%edx
80484eb: 89 44 24 08 mov %eax,0x8(%esp)
80484ef: 89 54 24 04 mov %edx,0x4(%esp)
80484f3: 8b 45 08 mov 0x8(%ebp),%eax
80484f6: 89 04 24 mov %eax,(%esp)
80484f9: 8b 45 fc mov -0x4(%ebp),%eax
80484fc: ff d0 call *%eax
80484fe: 89 45 ec mov %eax,-0x14(%ebp)
8048501: 8b 45 ec mov -0x14(%ebp),%eax
8048504: c9 leave
8048505: c3 ret
In both cases the recursion is indicated by the assembly instruction call on the function itself.
Ok, now let's turn up the optimization level a bit :
gcc -m32 -g -O2 -c -o factorial.o factorial.c
gcc -m32 factorial.o -o factorial
Here's the assembly code generated by the compiler for all three versions of the factorial:
08048410 <factorial>:
8048410: 55 push %ebp
8048411: b8 01 00 00 00 mov $0x1,%eax
8048416: 89 e5 mov %esp,%ebp
8048418: 8b 55 08 mov 0x8(%ebp),%edx
804841b: 85 d2 test %edx,%edx
804841d: 74 09 je 8048428 <factorial+0x18>
804841f: 90 nop
8048420: 0f af c2 imul %edx,%eax
8048423: 83 ea 01 sub $0x1,%edx
8048426: 75 f8 jne 8048420 <factorial+0x10>
8048428: 5d pop %ebp
8048429: c3 ret
804842a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
08048430 <factorial2>:
8048430: 55 push %ebp
8048431: 89 e5 mov %esp,%ebp
8048433: 8b 55 08 mov 0x8(%ebp),%edx
8048436: 8b 45 0c mov 0xc(%ebp),%eax
8048439: 83 fa 01 cmp $0x1,%edx
804843c: 74 0d je 804844b <factorial2+0x1b>
804843e: 66 90 xchg %ax,%ax
8048440: 0f af c2 imul %edx,%eax
8048443: 83 ea 01 sub $0x1,%edx
8048446: 83 fa 01 cmp $0x1,%edx
8048449: 75 f5 jne 8048440 <factorial2+0x10>
804844b: 5d pop %ebp
804844c: c3 ret
804844d: 8d 76 00 lea 0x0(%esi),%esi
08048450 <factorial3>:
8048450: 55 push %ebp
8048451: 89 e5 mov %esp,%ebp
8048453: 8b 45 0c mov 0xc(%ebp),%eax
8048456: 8b 4d 08 mov 0x8(%ebp),%ecx
8048459: 8b 55 10 mov 0x10(%ebp),%edx
804845c: 83 f8 01 cmp $0x1,%eax
804845f: 74 0f je 8048470 <factorial3+0x20>
8048461: 0f af d0 imul %eax,%edx
8048464: 83 e8 01 sub $0x1,%eax
8048467: 89 45 0c mov %eax,0xc(%ebp)
804846a: 89 55 10 mov %edx,0x10(%ebp)
804846d: 5d pop %ebp
804846e: ff e1 jmp *%ecx
8048470: 5d pop %ebp
8048471: 89 d0 mov %edx,%eax
8048473: c3 ret
8048474: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
804847a: 8d bf 00 00 00 00 lea 0x0(%edi),%edi
At this stage, the recursion has been replaced by a loop in the first two factorial algorithms (factorial and factorial2 above).
In factorial3 the recursion has been replaced by a jmp instruction. This is obviously an improvement on the call instruction, since it skips saving the instruction pointer on the stack. Note though, that for all intents and purposes the instruction pointer is already on the stack, as the function pointer is an argument of the function itself.
It's instructive so see how all three implementations behave in the debugger. If you set a breakpoint at the entry point of the first two functions the breakpoint will respected the first time functions are entered. Since the recursion has been replaced by a loop, this will be the only time the breakpoint is hit.
That's pretty annoying, if you want to examine the state of the function body on subsequent iterations by breaking on the function invocation. The compiler has rewritten your function body and essentially eliminated that entry point. Sure, gdb allows you to debug at the assembly level, and with the assembly generated by objdump in hand you're certainly in a position to examine the function body nevertheless. But you have to work a level down (or is it up ?) on the abstraction scale. The optimization of the third factorial function does not eliminate the function invocation as a possible break point.
What's the upshot of all of this ?
A C/C++ compiler should be able to optimize a simple recursive algorithm, like the factorial, by replacing the recursive call with a loop. The compiler rewrites the function body, in effect changing the algorithm. That's easy to verify by inspecting the assembly code. For a simple enough function like the first two factorial functions this is probably not too much of an issue. When the function body is more complex, there is going to be a disconnect between the code generated by the compiler and the original. That will complicate reasoning about the correctness of it.
However, the extend of the rewrite can be controlled. The optimization of the CPS like algorithm kept the original function body substantially in place, but did remove the build up of the stack. The benefit here is that the non-optimized and optimized function are not substantially different, as can be verified by inspection in the debugger.
3458816865