Virtual functions and performance

I posted a simple piece of code earlier this to use as an performance example. It basically compared virtual function performance to inline function performance. One function allocated a string and another performed some simple integer operations. Before I post my own numbers I want to get some bare metal (non-virtualized) results, but mine were similar to the following post to the JoS board:

String:
Policy (inherited, -O0, string temporary)        .52
Policy (member object, -O0, string temporary)    .48
Strategy (-O0, string)                           .48
Policy  (member object, -O2, string temp)        .41
Policy  (inherited, -O2)                         .41
Strategy  (-O2, string)                          .41

Int:
Strategy (-O0, int) .015
Policy (member, -O0 int) .012
Strategy (-O2, int) .008
Policy (inherited, -O0 int) .007
Policy (member, -O2, int) .001
Policy (inherited, -O2, int) .001

It should be noted that the author of the above numbers modified my original program to compare inheriting policies versus holding the policy as a member (as my code did). In the string example the time spent allocating the string dwarfs any time spent setting up the call to point that the call overhead is no longer significant.

When I first learned C++ in *gasp* 1993, much of the discussion from the anti-C++ camp was that virtual functions were too slow. The C++ side countered that the slight decrease in performance was worth the benefits. I agreed with that argument back then, and I agree today. I'm sure they exist, but I've never seen an application limited by virtual function overhead.

What is interesting is in recent years, after I thought the argument had been long settled, an anti-virtual function sentiment rang up from the C++ community itself, to the point were architectural paradigm shifts were justified because they removed virtual functions. But as, Duncan Sharpe explains, removing virtual functions might have the opposite effect.

The template code essentially replicates - several times - the code for the loop and the begin and end timer. This won't in itself result in slower code, but do it often enough, and your program grows in size. Less of it fits into the processor's L1 cache, so you might end up spending more time waiting for the code you need to make its way to you from a slower cache memory. In a highly polymorphic situation, this could lead to significant performance loss, which could easily end up costing you more than the v-table lookup, in theory at least. The strategy code shouldn't, in theory, replicate quite so much, and may fit into the cache more easily.

I'm sure there are cases where virtual function overhead does matter (tight calculation loop in a game for instance), just as there are cases where inline assembly can make a difference, but those cases, I believe, are the rare cases. Don't waste time or compromise a design solving problems that aren't really problems. Memory allocation is surprisingly slow considering its frequency in modern applications and libraries such as the STL. If the time spent executing a virtual function that allocates memory is too long, optimize the allocation before the virtual function. Every performance decision is a trade off. While removing function indirection might seem like a free performance gain, increasing the object code size might be an equally detrimental effect.