j/k: Data Mining in kdb+

I'm revisting the "billion query code" because many people have sent questions in.
Mining in Practice
In general, if you can brute force all the solutions to a given search space (or phase space as my physics friends and wife say) - thats what you want to do. If you can't, time for a heuristic.
What do I mean by space? Well lets consider an example:

K- Maximum Sum Subarray Problem

We have a table (arrayable) T


n:10
t:ungroup flip (`$/:.Q.a)!enlist each (26 0N#(26*n)?5)*(26 0N#(26*n)?-1 1)

Which looks like
q)show flip t
a| 0  0  4  -2 -3 3  0  -4 1  -1
b| 0  3  1  -4 -3 2  4  3  0  -3
c| 4  -3 0  -1 -2 0  -3 -4 1  0
...
z| -1 2  0  3  -2 2  1  -3 0  2

Now I ask you to find the maximum sum of the column z, using any combination of the other variables. E.g. let's sort by column a

We can see that if we use the range a>=0&a<=5 then we have


q)show flip `a xasc t
a| -4 -3 -2 -1 0  0  0  1  3  4
b| 3  -3 -4 -3 0  3  4  0  2  1
c| -4 -2 -1 0  4  -3 -3 1  0  0
...
z| -3 -2 3  2  -1 2  1  0  2  0

What if we add the condition b>=3? Then the intersection is...


q)show flip `a xasc t
a| -4 -3 -2 -1 0  0  0  1  3  4
b| 3  -3 -4 -3 0  3  4  0  2  1
c| -4 -2 -1 0  4  -3 -3 1  0  0
...
z| -3 -2 3  2  -1 2  1  0  2  0 //the -1 falls out

Ok now what's the best we can do? How hard is this problem? If we consider all 3 dimensional solutions (e.g. using 1 a, 1 b and 1 d) the problem has */26 26 26 5 5 5 solutions (about 2 million). But in real life we have lots more values (more than 5) and generally more variables.

Breaking down the problem:
A good place to start is to reduce the dimensionality by bucketing values. Consider placing value in m uniform buckets. The code below does this- and quickly.
Then you can do the search, this is ludicrously fast in q (arthur code of course).



n:500000
m:10
t:([]a:n?1.0;b:n?1.0;c:n?1.0;d:n?1.0)
f:`b`c`d

/running totals in 2 dims
s2:+\+\'(m;0N)#
/ 3d aggrs
f3:{s2'+\(m;0N)#@[(m*m*m)#0.0;m/:(x;(m-1)-y;z);+;t.z]}

\t u:m .q.xrank't f
\t r3:{u f3[x]\:/:u}'u

j/k

Tuesday, July 10, 2007

Data Mining in kdb+

No comments:

Blog Archive

About Me

others