Mining in Practice
In general, if you can brute force all the solutions to a given search space (or phase space as my physics friends and wife say) - thats what you want to do. If you can't, time for a heuristic.
What do I mean by space? Well lets consider an example:
K- Maximum Sum Subarray Problem
We have a table (arrayable) T
n:10
t:ungroup flip (`$/:.Q.a)!enlist each (26 0N#(26*n)?5)*(26 0N#(26*n)?-1 1)
Which looks like
q)show flip t
a| 0 0 4 -2 -3 3 0 -4 1 -1
b| 0 3 1 -4 -3 2 4 3 0 -3
c| 4 -3 0 -1 -2 0 -3 -4 1 0
...
z| -1 2 0 3 -2 2 1 -3 0 2
Now I ask you to find the maximum sum of the column z, using any combination of the other variables. E.g. let's sort by column a
We can see that if we use the range a>=0&a<=5 then we have
q)show flip `a xasc t
a| -4 -3 -2 -1 0 0 0 1 3 4
b| 3 -3 -4 -3 0 3 4 0 2 1
c| -4 -2 -1 0 4 -3 -3 1 0 0
...
z| -3 -2 3 2 -1 2 1 0 2 0
What if we add the condition b>=3? Then the intersection is...
q)show flip `a xasc t
a| -4 -3 -2 -1 0 0 0 1 3 4
b| 3 -3 -4 -3 0 3 4 0 2 1
c| -4 -2 -1 0 4 -3 -3 1 0 0
...
z| -3 -2 3 2 -1 2 1 0 2 0 //the -1 falls out
Ok now what's the best we can do? How hard is this problem? If we consider all 3 dimensional solutions (e.g. using 1 a, 1 b and 1 d) the problem has */26 26 26 5 5 5 solutions (about 2 million). But in real life we have lots more values (more than 5) and generally more variables.
Breaking down the problem:
A good place to start is to reduce the dimensionality by bucketing values. Consider placing value in m uniform buckets. The code below does this- and quickly.
Then you can do the search, this is ludicrously fast in q (arthur code of course).
n:500000
m:10
t:([]a:n?1.0;b:n?1.0;c:n?1.0;d:n?1.0)
f:`b`c`d
/running totals in 2 dims
s2:+\+\'(m;0N)#
/ 3d aggrs
f3:{s2'+\(m;0N)#@[(m*m*m)#0.0;m/:(x;(m-1)-y;z);+;t.z]}
\t u:m .q.xrank't f
\t r3:{u f3[x]\:/:u}'u
No comments:
Post a Comment