669 days ago by spherical_cow

u_U = var('u') #number of unique species divided by max possible num of species t = var('t') #total branch length of tree k = var('k') #determines how much weight 't' is given S(u_U,t,k) = (u_U) - exp(-k/t) #define score for subtrees, k is the parameter #the amount of influence total branch length has over the subtree rankings is determined by dS/dt #the higher the rate of change of the score with respect to 't', the more variation(with respect to total branch length) will be #observed between subtree scores dS_dt = abs(derivative(S,t)) #example u_U_ = 1 t_ = .5 plot(dS_dt(u_U_,t_,k), (k,0,t_ + 5)) 
#can see that in the above example, dS_dt is maximized when k = t #is this true in general? dS_dt_dk = derivative(dS_dt,k) #find minima/maxima solve(dS_dt_dk == 0, k) 
[k == t, k == 0]
[k == t, k == 0]
#so the total branch length will have the most effect on the subtree rankings when k = t. when k < t, dS_dt will increase as k #increases towards t, when k > t, dS_dt will decrease as k increases. so the behavior is not consistent. increasing k can have #different effects depending on the data. would rather the effect of increasing k be consistent? #could we only allow k in the range [0,max(t)] so that increasing k always has the same effect? i.e. it increases the influence of t #on the subtree ranking #also t varies for each subtree. so if a user wants to maximize the influence of t, how should they set the constant, k? k = avg(tbl)? this should minimize sum(|k-t|) and should therefore maximize the impact t has on the score ranking.