now this is damn cool:
(next to 'to cool to be true')
That's impressive. Even with all the limitations, and the lack of a network stack (I could see where that would be difficult), that's an awesome page there.
I could imagine using it to help teach people how to write C code, honestly.
But he already had experience in the field so it's not like one of us starting off on it.
I wanted to run a hypervisor inside it so I could boot up another copy of Linux, start up a browser in it, and then run his demo emulator inside that.
Ah, now *that* would be pretty nifty. But he needs a network stack that can get to the internet, which he currently lacks.
I just did something that sort of underscored the amazing qualities of an optimizing compiler, and the power of inlining code.
I wanted a base64 encoder/decoder. But, you know, I'm picky. I want it to have this certain sort of interface, with such-and-so features, etc... and I figured I probably wouldn't actually find anything out there that does this for free.
But, I did find something that someone released into the public domain. He wrote it primarily in C, and gave it a C++ wrapper. The thing wasn't quite as flexible as I wanted, and it was slightly dangerous to use (potential for buffer overruns, and permitted using functions that should have been made private), so I rolled up my sleeves and did some hacking.
First, I wanted to only include one header file... no object files, no compiling of .c files, none of that... just include one header and have the whole thing right there for you. I also wanted to be able to choose the alphabet for the encoder, while making commonly-used alphabets easy to use. I wanted it to be fast, and I wanted to be able to either use std::i/ostreams or just plain old std::string objects. And, of course, I wanted it to be safe. Note that I didn't care about maintaining a C interface... just pure C++.
After making all of this happen and getting it to work correctly, I thought I'd check on the code's performance. I had to resort to a very high resolution timer, since I wasn't working with a huge amount of data. I was pleased with the results... it seemed very fast indeed. But then I wanted to compare it to the original code.
I was stunned. My code wasn't just faster... it blew the original code out of the water. My original code, using just strings and no std::i/ostream objects, was so much faster that I implemented the std::i/ostream objects just so I could try to compare apples for apples (since I figure the stream objects require more processing power to work). There was negligable difference... it was still blisteringly fast. The original code encoded my sample data in 283162/2604228ths of a second. My equivalent code only took 2866/2604228ths of a second (sorry for the weird numbers, but it was the most accurate clock I could find on this Windows OS). Decoding had similar gains... theirs: 152928/2604228ths of a second, mine: 13331/2604228ths of a second.
The thing is, I'm using the same algorithm he is using, and I'm even using the same technique (a quirk of the language allows you to not close your blocks normally within a switch statement, which makes it kind of nice for working with simplistic state-machines). I didn't improve on any algorithm for doing this in any way whatsoever. Really, the only significant differences I can find is that I'm not compiling the encoder in its own object file, and the code I wrote isn't in C, but C++ (which is kind of a semantic difference, really, but maybe the compiler is different, I dunno). Everything else is sorta syntactic sugar to make working with my objects easier and more flexible than his. In fact, theoretically, I traded performance for flexibility in the decoder, because I am using a hash table to look up values instead of a straight map (needed a way to specify my alphabet, which makes the decoder take a performance hit since I can't just make a static array).
So, either having it compiled inline makes it that much faster, or maybe the optimizer does a nicer job when you work with C++ than C, despite similar code.
Here's a document on that language quirk (which works equally well in C) if you're interested:
I should figure out how to package this so you guys can look at it yourself if you want.
Oh, that URL doesn't just mention a language quirk... it also points out something kind of cool from Knuth's Art of Computer Programming. If your life seems to revolve around pushing computers around, you might want to look at it. It's kind of interesting.
I can't really speak to the original performance penalty in this article.
However, I strongly prefer single-threaded solutions to problems where I can find them. If something goes wrong in a single-threaded routine, it's much easier to track down. Multiple-threads multiplies complexity, to me. I have seen multiple-threads lead to *more* time spent than less. In fact, I wrote something to copy a pair of graphics from 2 PNG files to a single interlaced graphics buffer, and discovered my multithreaded approach took exponentially more time than a single-threaded approach I tried (possibly because I had over 100 of these copies to do, and I tried to do them all at the same time instead of perhaps 2 at a time... not sure).
Still, the emphasis here wasn't on performance as much as a form of legibility, I gathered.
I'm not sure there's a good multi-threaded way to implement a base64 encoder/decoder, though. Maybe, if two threads sorta leap-frogged over each other, doing chunks of bytes at a time. Hmm.
fleeb, I guess the header implementation doesn't create a realy static decoding table.
most probably if you create a simple stdin/out programm and compile it in linux, you can use callgrind/kcachegrind what the real difference of this is.
if you like to, please post screenshots of your kcachegrind displays
I guess also having more than one instances of your objects might make a difference, so if you delete/new your objects sometimes in another test programm...
i'd also use some email attachment of several MB to find out who's faster in the end.
you could compare it to some of the test programs in libcitadel, which i've profiled quiet a bit here too.
Well, I tried something else that significantly improved the speed of the C code, although the C++ code is still faster.
The metrics I gave before was for a 'debug' compile. I'm guessing the C code's debug compiler is not very optimized compared to the C++ debug compiler.
When compiling as release, the C code performed much, much better, to the point where it takes not quite half the time of the C++ code. This test is more fair; most people would actually run in release rather than debug.
I did not do this test on a linux environment, so I don't have those tools available to me. I am busy wrapping this up in a way that you guys can do what you will with it shortly (just need to put all the files together, add some semi-legal crap to the top of my header, etc). I have no reason to believe this code couldn't compile in pretty much any environment, since I am not using anything very special... just plain-old C++ with the STL headers (although I do use #include <unordered_map>, which might not be available in all distributions of C++, so you might need to check on that).
Woops... I wrote 'not quite half the time of the C++ code' when I meant to say, 'not quite double the time of the C++ code'. The C++ code is still faster, at least on my compiler.
well, there is andlinux around, vmware and such
(though its not as reliable for profiling as native code)
but if you don´t show us the code...
Okay, I have it ready now.
If you're running VC++ 2010 (express or otherwise), you should be able to start the project file and run right away.
If you're running in a Linux environment, you'll need to create your own makefiles, etc. Sorry. But I'll be happy to include your makefiles and any code alterations to the base_xx.cpp file specific to POSIX (or whatever OS you're using) and update the file.
I want to implement base32 and base16 yet, but for looking at performance characteristics, if you're as curious as I am, give it a shot as-is.
Sorry, dothebart... I just needed to clean things up so it'd be easier to pull into another environment. I had a lot of other crap that didn't need to be there (e.g. precompiled header, junk organized in a win32-specific way, etc).
Mon May 30 2011 10:44:10 EDT from saltine @ UncensoredMaybe your code is small enough that it remains inside the data and code buffers of the cpu, too?
Maybe. If so, that's still kind of cool.
thought I'd check on the code's performance. I had to resort to a
very high resolution timer, since I wasn't working with a huge amount
of data. I was pleased with the results... it seemed very fast
when microbencharking... if the code is too small to measure in milliseconds, you may want to just run it through a loop about 10K times...