Surprise! I did some work with one of my awesome professors about half a year ago, and I’m only now getting around to documenting it. It isn’t a particularly sexy project (unless you really love optimization problems, and really, who doesn’t love a good optimization problem?), which has something to do with the late documentation. Ah well! Better late than never, the saying goes…
So what did I do? I set out to explore some properties of a optimization problem called the Bottleneck Assignment Problem (BAP): given a bunch of tasks to do, a bunch of agents to do them, and how long each agent takes to complete each task, what assignment of tasks to agents gets all the tasks done in the shortest amount of time? More formally, assign tasks to agents such that the maximum cost is minimized. We assumed that the number of tasks and agents is the same, so we can represent the task-agent costs as a matrix, with each task as a row and each agent as a column: the exact formulation does not matter.
Being able to solve BAPs is applicable to things like scheduling jobs on a CPU, or really any sort of massively parallel job distribution. We mention that because we are about to step outside the bounds of practicality.
Methods for solving BAPs are well known and run in O(N log N) (like this algorithm, as stated here okay, I’m sure I read that *somewhere*, I just can’t find it in 20 minutes of searching, but you know how trying to remember details half a year later goes…) O(M Sqrt(N log N)) according to Punnen and Zhang (note that I probably didn’t get it that low) (N = dimension, M = N squared, I think), which CS kids should recognize as not too terrible. So instead of determining how to solve BAPs, I was looking at properties of BAPs when given matrices populated with random elements: namely, the average case behavior of BAP solutions given matrices populated with different probability distributions. Now, CS kids should recognize that analysis of average case behaviors is really difficult: fortunately, I didn’t try to do that. Instead of trying to do a rigorous analysis, I actually created and solved a bunch of BAPs.
Remember how we know methods for solving BAPs that have reasonable runtimes? What I did was implement that in Python (numeric intensive parts in NumPy (fast matrix operations)/Cython (compiling to C)); I used the
To get a good average case curve, I had to start computing thousands of solutions for up to 120×120 matrices, and just computing 100 solutions for 80×80 matrices would take a night. I suspect my run times were growing faster than O(N log N), but I coded it quickly (I thought I did it in a week (maybe I’m thinking of spring break), but I was whining on the blog about it all throughout March and April, so I don’t know what’s up with that), so I’m not too miffed about that, and that just meant I had to throw more computing power at it.
Note, this was not my main summer research project: this was something I finished coding before the summer started, and I couldn’t afford to throw more time at tweaking it (programmer time vs. computer time: there’s a good essay about this, don’t remember where it is. You should go find it it’s not the source, I think, but it’s a good essay). Otherwise, I would’ve optimized and gotten it running tighter: it would have run much faster.
So throwing more computing power at it means distributed computing, and since I was doing this at a tiny University without access to a computing cluster, I had to pretty much throw together my own: since the task was highly parallel, all I had to do is solve the problem lots of times on a computer (slave), reduce it to the average solution, and then send that back to central command (master), so there’s almost no bandwidth between the slaves and master. I threw together a custom Ubuntu disc, and booted up unused desktop computers that fetched jobs from a central server and sent back finished job data (actually, the reason I got this domain is so I could have a stable location to point the slaves at). This was at the beginning of summer, so there were no students, and lots of unused computing potential just sitting around: at one point, I had 50 dual-core desktops crunching data, and some 100,000 solutions were crunched (the exact number is lost to time: teaches me to delete my databases early).
But all things must end, and by the end of summer the network was dismantled. Data had been gotten, though, enough for function fitting to curves. And there my involvement ends: I don’t have enough training to do analysis of optimization procedures (even with the Monte-Carlo-esque simulation data), and I have more than enough on my plate now that school has started back up again.
EDIT: HOLD IT HOLD IT I HAVE THE CODE!!!!!! [0.8MiB]
HERE IS THE CODE, CLIENT AND SERVER INCLUDED. NO ISO, THOUGH: GO BUILD YOUR OWN.
If you want the data, just ask. Remember that the server code is insecure as heck, and you shouldn’t run it without at least adding some authentication.
Again, I think I cleaned all the code, but if something made it through, PLEASE PLEASE PLEASE let me know. Thanks.
So, where is the code?
I managed to lose it.
You see, this was before I signed up for github, and I was throwing my git repos everywhere. And now that I look for it half a year later, I can’t find any code that actually works: you know, code that worked all throughout the summer. I tossed the Ubuntu discs at the end of summer, so I can’t recover it from there. The only code I do have, the server-side code for disbursing the jobs and gathering the data, was written in a day with CakePhp, and is insecure as heck (it didn’t even authenticate), so I’m not going to distribute that (no, it’s not running anywhere anymore).
So you’re all kind of out of luck in that regard: no code, just a story and some data. If you want to look at the data, I have it: just email.
But… why is the code gone?