Friday, July 20, 2007

Fridays (again)

Yup. Worst day ever. Just further reinforcement that coming in on Friday's is bad for me, bad for business. All over by 10AM.
Damn.
j

Thursday, July 19, 2007

redundancy

I've always said (and I picked it up from somewhere) that for every 9 you want to add to your uptime, multiply your initial cost by 9.
So if you have a 10mm USD data center, and you want to go from 99 to 99.9% uptime, you need a 90mm data center infrastructure. Why? Well lets take a look

You need back up generators, back up network lines, which means changing the data center architecture to support all that. Redundant gas lines for the generators. Failover for power sources etc (would have been cheaper to build it like that in the first place)-8mm
You need double the hardware, and multiple components in all the hardware you have (dual nic etc)- 7mm
You need a DR data center in another location, with the same setup- 15mm.
You need a new high speed line for the center links, and a new line for the DR site (different clec)- 2mm
You need to upgrade the SAN with real time LUN level replication, and buy one for the DR site - 5mm
You need clustering, on everything (oracle,sybase, windows, unix, linux ...), custom coded apps need to be re-written for active failover- 25mm (includes services)
You need the buy staff to deal with active/active failover, and 24x7 operation- 4mm
You need load balancing and/or fail over network operations for inbound and outbound connections (data feeds etc)- 10mm (at least!)
Since you have 2 different physical locations now (hopefully not the same state) you need new services contracts- 1mm

So thats only 78mm more, or 8x the original cost. But I'm sure I could spend the extra 12mm on something I've forgotten.

The best DRS I've ever seen was at the Depart of the Navy. Everything was a virtual machine. Live snapshots of the VM's was taken every hour or so. The snapshots were saved to an EMC SAN- which had real time replication to 5 other locations. All locals replicated to all remotes. Every remote site had a small cluster of "failover machines". The network was designed by Cisco and everything could automagically routed wherever. So, the entire data center in VA gets blown up (or whatever). The alarm fires, the VM's are started at the primary failover site (NC), they come online, routers do their thing (the DoD has the benefit of their own network) and wallah. Magic data center moved. Worst case loss, 1 hour. Failover time for the entire data center- 5-10 minutes (AND no application restart- they are hot snapshot loads). beautiful.

So, having built a couple of data centers in my day, and coded many an application for active failover, and having deployed clustering on every version of windows since NT4 and Red hat- when a data center is down for say 7 hours, I think people should not only be fired, but any contractors should be sued. I'm not saying who's data center went down, but let's just say it was bad.
Oh, kdb+ failover is trivial. Everything I do is in a pub/sub model. So aside from an extra machine in my data center- I push everything to my desktop. So when the lights went out, I still knew my positions.
J

Wednesday, July 18, 2007

BSC- sorry about that

Sorry, we lost all the money. Well- 90% of it, but the other 10% is ours.
In other news, we have some free flights to Florida for you- and all you need to do is attend a 2 hour presentation on the wonderful oppertunity for real estate in the greater Miami area.

Wednesday, July 11, 2007

maths

I've tossed this around to a bunch of people, so I thought I'd toss it out here and maybe someone would know an answer:

Given 3 series, A B and C
Solve for the next value of each series subject to the constraints:

correlation of A&B>=.9
correlation of B&C>=.9
correlation of A&C>=.85

deviation of A is <=.051
deviation of B is <=.051
deviation of C is <=.051

There are multiple correct answers, I'm looking for as many as possible as quickly as possible.

Here's some sample data

A B C
2.47 3.453 4.263
2.476 3.405 4.211
2.484 3.429 4.228
2.46 3.377 4.178
2.395 3.309 4.119
2.387 3.298 4.115
2.46 3.394 4.215
2.582 3.494 4.296
2.591 3.508 4.293
2.55 3.456 4.24
2.469 3.363 4.161
2.485 3.411 4.197
2.485 3.404 4.188
2.469 3.36 4.137
2.436 3.343 4.126
2.478 3.377 4.166

Amending to a Matrix

Jamie did an excellent job on explaining this and I thought I share it (cleaning out old emails)

A way of amending multiple points in a square matrix in one go.
Take the following matrix:

q)4 4#0
(0 0 0 0;0 0 0 0;0 0 0 0;0 0 0 0)

Let's say we want to add 1 to the diagonal (make it the identity matrix). We could get ourselves a list of coordinates (0 0;1 1;2 2;3 3) and do it one at a time, using over to pass the previous result forward each time:

q).[;;+;1]/[4 4#0;(0 0;1 1;2 2;3 3)]
(1 0 0 0;0 1 0 0;0 0 1 0;0 0 0 1)

This would get pretty slow if the list of coordinates is large, since they are essentially scalar operations. So we can do as Arthur has done, flatten the matrix and use a bit of code to map the 2 dimensional coordinate to it's one dimensional equivalent:

q)(4*4)#0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
q)4 sv flip (0 0;1 1;2 2;3 3)
0 5 10 15
q)@[(4*4)#0;4 sv flip (0 0;1 1;2 2;3 3);+;1]
1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

The cut command gives us back the matrix from the flat vector:

q)4 cut @[(4*4)#0;4 sv flip (0 0;1 1;2 2;3 3);+;1]
(1 0 0 0;0 1 0 0;0 0 1 0;0 0 0 1)

Tuesday, July 10, 2007

Data Mining in kdb+

I'm revisting the "billion query code" because many people have sent questions in.
Mining in Practice
In general, if you can brute force all the solutions to a given search space (or phase space as my physics friends and wife say) - thats what you want to do. If you can't, time for a heuristic.
What do I mean by space? Well lets consider an example:


K- Maximum Sum Subarray Problem


We have a table (arrayable) T


n:10
t:ungroup flip (`$/:.Q.a)!enlist each (26 0N#(26*n)?5)*(26 0N#(26*n)?-1 1)

Which looks like
q)show flip t
a| 0 0 4 -2 -3 3 0 -4 1 -1
b| 0 3 1 -4 -3 2 4 3 0 -3
c| 4 -3 0 -1 -2 0 -3 -4 1 0
...
z| -1 2 0 3 -2 2 1 -3 0 2


Now I ask you to find the maximum sum of the column z, using any combination of the other variables. E.g. let's sort by column a

We can see that if we use the range a>=0&a<=5 then we have


q)show flip `a xasc t
a| -4 -3 -2 -1 0 0 0 1 3 4
b| 3 -3 -4 -3 0 3 4 0 2 1
c| -4 -2 -1 0 4 -3 -3 1 0 0
...
z| -3 -2 3 2 -1 2 1 0 2 0


What if we add the condition b>=3? Then the intersection is...


q)show flip `a xasc t
a| -4 -3 -2 -1 0 0 0 1 3 4
b| 3 -3 -4 -3 0 3 4 0 2 1
c| -4 -2 -1 0 4 -3 -3 1 0 0
...
z| -3 -2 3 2 -1 2 1 0 2 0 //the -1 falls out


Ok now what's the best we can do? How hard is this problem? If we consider all 3 dimensional solutions (e.g. using 1 a, 1 b and 1 d) the problem has */26 26 26 5 5 5 solutions (about 2 million). But in real life we have lots more values (more than 5) and generally more variables.

Breaking down the problem:
A good place to start is to reduce the dimensionality by bucketing values. Consider placing value in m uniform buckets. The code below does this- and quickly.
Then you can do the search, this is ludicrously fast in q (arthur code of course).



n:500000
m:10
t:([]a:n?1.0;b:n?1.0;c:n?1.0;d:n?1.0)
f:`b`c`d

/running totals in 2 dims
s2:+\+\'(m;0N)#
/ 3d aggrs
f3:{s2'+\(m;0N)#@[(m*m*m)#0.0;m/:(x;(m-1)-y;z);+;t.z]}

\t u:m .q.xrank't f
\t r3:{u f3[x]\:/:u}'u


Friday, July 6, 2007

Stay home Friday's

So-
yesterday and the day before the holiday were great days for me. I trade a pattern recognition system (which is just jargon for "matched patterns built from data mining to streaming data"). Most of the time the system trades as normal- never that big of a position- and life is fine. I win, I lose, but overall I'm up with a very small draw down. But the market has been wicked recently- like a skittish, rabid, cocaine addicted rabbit.

So today I walk in, knowing it's going to be a bad day. Why? Well first of all it seems every time I trade on Friday (my day off) I lose- as if to reinforce the fact that I should have stayed home. But also after 2 big winning days- I was in for a loss.
So the day goes like this

5 AM- wake up/news
5:30 Out the door
5:45 At the office
6:00 Turn on trading system (start q process), check last night's data, get coffee
wait
8:00 (or about) start trading
8:30 lose $7500 on one trade- ok
8:45 lose another 3k- still ok
9:00 down another 10k mark to market- that sucks
9:30 risk management and other stuff begins triggering (q is awesome)
10:00 system fights back from down 20k to down only 11
11:00 flat (down only 1k)
12:00 market grinds down, no trades. At this point I contemplate going home.
12:00:00.001 get long 10 mm
12:10 get long another 10 mm (why not)
12:30 %*^*&*@!!!
13:30 more %*^*&*@!!!
14:00 give up, risk management rolls out of positions, end the day down 30k.
Fuck.