Language:
switch to room list switch to menu My folders
Go to page: First ... 98 99 100 101 [102] 103 104 105 106 ... Last
[#] Wed May 18 2011 23:18:44 EDT from IGnatius T Foobar @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Lazy bastard didn't bother to implement Intel VT or AMD-V in his JavaScript.
I wanted to run a hypervisor inside it so I could boot up another copy of Linux, start up a browser in it, and then run his demo emulator inside that.

[#] Thu May 19 2011 10:05:39 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Ah, now *that* would be pretty nifty.  But he needs a network stack that can get to the internet, which he currently lacks.



[#] Fri May 27 2011 14:54:05 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

I just did something that sort of underscored the amazing qualities of an optimizing compiler, and the power of inlining code.

I wanted a base64 encoder/decoder.  But, you know, I'm picky.  I want it to have this certain sort of interface, with such-and-so features, etc... and I figured I probably wouldn't actually find anything out there that does this for free.

But, I did find something that someone released into the public domain.  He wrote it primarily in C, and gave it a C++ wrapper.  The thing wasn't quite as flexible as I wanted, and it was slightly dangerous to use (potential for buffer overruns, and permitted using functions that should have been made private), so I rolled up my sleeves and did some hacking.

First, I wanted to only include one header file... no object files, no compiling of .c files, none of that... just include one header and have the whole thing right there for you.  I also wanted to be able to choose the alphabet for the encoder, while making commonly-used alphabets easy to use.  I wanted it to be fast, and I wanted to be able to either use std::i/ostreams or just plain old std::string objects.  And, of course, I wanted it to be safe.  Note that I didn't care about maintaining a C interface... just pure C++.

After making all of this happen and getting it to work correctly, I thought I'd check on the code's performance.  I had to resort to a very high resolution timer, since I wasn't working with a huge amount of data.  I was pleased with the results... it seemed very fast indeed.  But then I wanted to compare it to the original code.

I was stunned.  My code wasn't just faster... it blew the original code out of the water.  My original code, using just strings and no std::i/ostream objects, was so much faster that I implemented the std::i/ostream objects just so I could try to compare apples for apples (since I figure the stream objects require more processing power to work).  There was negligable difference... it was still blisteringly fast.  The original code encoded my sample data in 283162/2604228ths of a second.  My equivalent code only took 2866/2604228ths of a second (sorry for the weird numbers, but it was the most accurate clock I could find on this Windows OS).  Decoding had similar gains... theirs: 152928/2604228ths of a second, mine: 13331/2604228ths of a second.

The thing is, I'm using the same algorithm he is using, and I'm even using the same technique (a quirk of the language allows you to not close your blocks normally within a switch statement, which makes it kind of nice for working with simplistic state-machines).  I didn't improve on any algorithm for doing this in any way whatsoever.  Really, the only significant differences I can find is that I'm not compiling the encoder in its own object file, and the code I wrote isn't in C, but C++ (which is kind of a semantic difference, really, but maybe the compiler is different, I dunno).  Everything else is sorta syntactic sugar to make working with my objects easier and more flexible than his. In fact, theoretically, I traded performance for flexibility in the decoder, because I am using a hash table to look up values instead of a straight map (needed a way to specify my alphabet, which makes the decoder take a performance hit since I can't just make a static array).

So, either having it compiled inline makes it that much faster, or maybe the optimizer does a nicer job when you work with C++ than C, despite similar code.

Here's a document on that language quirk (which works equally well in C) if you're interested:

http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

I should figure out how to package this so you guys can look at it yourself if you want.



[#] Fri May 27 2011 15:44:25 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Oh, that URL doesn't just mention a language quirk... it also points out something kind of cool from Knuth's Art of Computer Programming.  If your life seems to revolve around pushing computers around, you might want to look at it.  It's kind of interesting.



[#] Sun May 29 2011 16:35:48 EDT from IGnatius T Foobar @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Where was the original performance penalty? Was it in buffering up a bunch of stuff to pass through what could have been a coroutine? After reading Simon Tatham's article I couldn't help but think that if he really wanted both routines to execute in tandem with each other, it would be a case for using multithreaded code rather than playing silly games with stacks and switches and loops and ponies and rainbows and purple leprechauns and...

[#] Mon May 30 2011 07:34:54 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

I can't really speak to the original performance penalty in this article.

However, I strongly prefer single-threaded solutions to problems where I can find them.  If something goes wrong in a single-threaded routine, it's much easier to track down.  Multiple-threads multiplies complexity, to me.  I have seen multiple-threads lead to *more* time spent than less.  In fact, I wrote something to copy a pair of graphics from 2 PNG files to a single interlaced graphics buffer, and discovered my multithreaded approach took exponentially more time than a single-threaded approach I tried (possibly because I had over 100 of these copies to do, and I tried to do them all at the same time instead of perhaps 2 at a time... not sure).

Still, the emphasis here wasn't on performance as much as a form of legibility, I gathered.

I'm not sure there's a good multi-threaded way to implement a base64 encoder/decoder, though.  Maybe, if two threads sorta leap-frogged over each other, doing chunks of bytes at a time.  Hmm.



[#] Mon May 30 2011 09:22:54 EDT from dothebart @ Uncensored

Subject: Re:

[Reply] [ReplyQuoted] [Headers] [Print]

fleeb, I guess the header implementation doesn't create a realy static decoding table.

most probably if you create a simple stdin/out programm and compile it in linux, you can use callgrind/kcachegrind what the real difference of this is.

if you like to, please post screenshots of your kcachegrind displays

I guess also having more than one instances of your objects might make a difference, so if you delete/new your objects sometimes in another test programm...

i'd also use some email attachment of several MB to find out who's faster in the end.

you could compare it to some of the test programs in libcitadel, which i've profiled quiet a bit here too.



[#] Mon May 30 2011 10:44:10 EDT from saltine @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Maybe your code is small enough that it remains inside the data and code buffers of the cpu, too?

[#] Mon May 30 2011 14:33:13 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Well, I tried something else that significantly improved the speed of the C code, although the C++ code is still faster.

The metrics I gave before was for a 'debug' compile.  I'm guessing the C code's debug compiler is not very optimized compared to the C++ debug compiler.

When compiling as release, the C code performed much, much better, to the point where it takes not quite half the time of the C++ code.  This test is more fair; most people would actually run in release rather than debug.

I did not do this test on a linux environment, so I don't have those tools available to me.  I am busy wrapping this up in a way that you guys can do what you will with it shortly (just need to put all the files together, add some semi-legal crap to the top of my header, etc).  I have no reason to believe this code couldn't compile in pretty much any environment, since I am not using anything very special... just plain-old C++ with the STL headers (although I do use #include <unordered_map>, which might not be available in all distributions of C++, so you might need to check on that).



[#] Mon May 30 2011 14:34:13 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Woops... I wrote 'not quite half the time of the C++ code' when I meant to say, 'not quite double the time of the C++ code'.  The C++ code is still faster, at least on my compiler.



[#] Mon May 30 2011 14:48:56 EDT from dothebart @ Uncensored

Subject: Re:

[Reply] [ReplyQuoted] [Headers] [Print]

well, there is andlinux around, vmware and such

(though its not as reliable for profiling as native code)

but if you don´t show us the code...



[#] Mon May 30 2011 15:34:22 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Okay, I have it ready now.

http://www.fleeb.com/base_64/base_xx.zip

If you're running VC++ 2010 (express or otherwise), you should be able to start the project file and run right away.

If you're running in a Linux environment, you'll need to create your own makefiles, etc.  Sorry.  But I'll be happy to include your makefiles and any code alterations to the base_xx.cpp file specific to POSIX (or whatever OS you're using) and update the file.

I want to implement base32 and base16 yet, but for looking at performance characteristics, if you're as curious as I am, give it a shot as-is.



[#] Mon May 30 2011 15:35:36 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Sorry, dothebart... I just needed to clean things up so it'd be easier to pull into another environment.  I had a lot of other crap that didn't need to be there (e.g. precompiled header, junk organized in a win32-specific way, etc).



[#] Mon May 30 2011 15:36:57 EDT from fleeb @ Uncensored

Subject: Re:

[Reply] [ReplyQuoted] [Headers] [Print]

 

Mon May 30 2011 10:44:10 EDT from saltine @ Uncensored
Maybe your code is small enough that it remains inside the data and code buffers of the cpu, too?

Maybe.  If so, that's still kind of cool.



[#] Mon May 30 2011 19:31:19 EDT from LoanShark @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

thought I'd check on the code's performance.  I had to resort to a
very high resolution timer, since I wasn't working with a huge amount
of data.  I was pleased with the results... it seemed very fast

when microbencharking... if the code is too small to measure in milliseconds, you may want to just run it through a loop about 10K times...

[#] Mon May 30 2011 19:40:19 EDT from LoanShark @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

That's great. The kind of appalling trick we all love and our bosses hate during code review. ;)

[#] Mon May 30 2011 19:52:24 EDT from LoanShark @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

buffer, and discovered my multithreaded approach took exponentially
more time than a single-threaded approach I tried (possibly because I

had over 100 of these copies to do, and I tried to do them all at the

same time instead of perhaps 2 at a time... not sure).

Indeed. The first thing you need to do, when performing CPU-intensive multithreaded work, is make sure you have 1 thread per CPU (or HyperThread virtual cpu. Context switching is a *big deal* for that sort of job, because not only do you have to save register state when switching, but you also have TLB flushes and increased inter-core cache coherence dependencies (communciation costs, generally) and a lot more stuff that won't fit into L1/L2 cache.

The next thing you need to do is find a good work-stealing framework, such as Java's ForkJoinPool (I'm sure there's an analogue in the C++ world, I just don't know what) and learn how to use it. This means that units of work have to be recursively subdivided into smaller chunks that can be executed in parallel, but not too small or else the overhead of scheduling the chunks will be greater than the benefits.


If you tried to do this for extremely trivial tasks such as "adding up a list of numbers" in a languagte like Java without a decent macro facility, you'd find that it'll actually slow your code down unless you write a fair bit of boilerplate, hand-inlined code. In order for to easily realize the benefits of parallelism, each individual unit of work needs to be fairly large, at least 100 primitive computational steps. "1+1" as in adding up a list of numbers is not enough work.

[#] Mon May 30 2011 20:15:17 EDT from LoanShark @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

such as Java's ForkJoinPool (I'm sure there's an analogue in the C++
world, I just don't know what) and learn how to use it. This means that


Perhaps unsurprisingly, apparently it's in boost: http://lists.boost.org/Archives/boost/2008/09/142801.php

[#] Mon May 30 2011 20:29:31 EDT from fleeb @ Uncensored

[Reply] [ReplyQuoted] [Headers] [Print]

Yeah... I did not look deeply into that particular problem, since I was in a hurry.  I just noticed when I went single-threaded for that task, performance increased significantly, so I went with it.  I doubt I really had over 100 threads at the same time (I'm crazy, but I'm not completely stupid), but I might have had something analogous to that happening accidentally.  Again, though, multi-threading is very hard.  I prefer to use only one thread and asynchronous I/O instead.

And, yeah, it doesn't surprise me that boost already has something like that.  Hell, I was using boost::thread in the first place.



[#] Tue May 31 2011 07:54:45 EDT from fleeb @ Uncensored

Subject: Re:

[Reply] [ReplyQuoted] [Headers] [Print]

 

Mon May 30 2011 19:31:19 EDT from LoanShark @ Uncensored
thought I'd check on the code's performance.  I had to resort to a
very high resolution timer, since I wasn't working with a huge amount
of data.  I was pleased with the results... it seemed very fast

when microbencharking... if the code is too small to measure in milliseconds, you may want to just run it through a loop about 10K times...

Duh, I should have thought to do that.  Still, this gives a relatively decent idea of how fast it works without so much churn.

Heh... 'microbench-arking'...



Go to page: First ... 98 99 100 101 [102] 103 104 105 106 ... Last