Tuesday, July 10, 2007

Data Mining in kdb+

I'm revisting the "billion query code" because many people have sent questions in.
Mining in Practice
In general, if you can brute force all the solutions to a given search space (or phase space as my physics friends and wife say) - thats what you want to do. If you can't, time for a heuristic.
What do I mean by space? Well lets consider an example:


K- Maximum Sum Subarray Problem


We have a table (arrayable) T


n:10
t:ungroup flip (`$/:.Q.a)!enlist each (26 0N#(26*n)?5)*(26 0N#(26*n)?-1 1)

Which looks like
q)show flip t
a| 0 0 4 -2 -3 3 0 -4 1 -1
b| 0 3 1 -4 -3 2 4 3 0 -3
c| 4 -3 0 -1 -2 0 -3 -4 1 0
...
z| -1 2 0 3 -2 2 1 -3 0 2


Now I ask you to find the maximum sum of the column z, using any combination of the other variables. E.g. let's sort by column a

We can see that if we use the range a>=0&a<=5 then we have


q)show flip `a xasc t
a| -4 -3 -2 -1 0 0 0 1 3 4
b| 3 -3 -4 -3 0 3 4 0 2 1
c| -4 -2 -1 0 4 -3 -3 1 0 0
...
z| -3 -2 3 2 -1 2 1 0 2 0


What if we add the condition b>=3? Then the intersection is...


q)show flip `a xasc t
a| -4 -3 -2 -1 0 0 0 1 3 4
b| 3 -3 -4 -3 0 3 4 0 2 1
c| -4 -2 -1 0 4 -3 -3 1 0 0
...
z| -3 -2 3 2 -1 2 1 0 2 0 //the -1 falls out


Ok now what's the best we can do? How hard is this problem? If we consider all 3 dimensional solutions (e.g. using 1 a, 1 b and 1 d) the problem has */26 26 26 5 5 5 solutions (about 2 million). But in real life we have lots more values (more than 5) and generally more variables.

Breaking down the problem:
A good place to start is to reduce the dimensionality by bucketing values. Consider placing value in m uniform buckets. The code below does this- and quickly.
Then you can do the search, this is ludicrously fast in q (arthur code of course).



n:500000
m:10
t:([]a:n?1.0;b:n?1.0;c:n?1.0;d:n?1.0)
f:`b`c`d

/running totals in 2 dims
s2:+\+\'(m;0N)#
/ 3d aggrs
f3:{s2'+\(m;0N)#@[(m*m*m)#0.0;m/:(x;(m-1)-y;z);+;t.z]}

\t u:m .q.xrank't f
\t r3:{u f3[x]\:/:u}'u


No comments: