Use the batch QR decomposition from CuSolver (needs to a wrapper writing for the strided form — see CUBLAS's strided getrf for reference). Note there are many variants of this but the vanilla (two-step, separated) implementation is the most useful for RBBSi.