Actually, if you’re already using LAPACK, I doubt there’s much potential
for further optimization on the same platform – if you, however, know
your stuff is going to only be run on Intel Xeons or so, than maybe have
a look at the Intel math kernel library lapack examples. Maybe you’d
want to accelerate by using GPUs, then you’d have a look at OpenCl or
CUDA implementation, or theano (however, I don’t know how readily
available things like SVD are for theano).
One more thing: Since you’re doing SVD using LAPACK yourself, I trust
you’ve already chosen the right routine (general, fully equipped
complex-valued matrix). I’m not completely convinced, though:
Have you had a look at ? It seems SVD $A=V \Sigma U^H$ is a two step
process: First, the input matrix is decomposed into left and right
unitary matrixes $U_1$ and $V_1^H$ and an bidiagonal matrix $B$ using
CGEBRD, and after that, $B$ is SVD’ed, yielding $B=U_2 \Sigma V_2^H$;
the product $V_1 V_2$ then is $V$. Maybe for your application $V1$ is
sufficient, because you can rearrange your problem mathematically?