Skip to content

Conversation

@YushaLiu
Copy link

A minor edit to allow flash_init to take a value for Y2 rather than have flashier compute it, which throws out the error CHOLMOD error 'problem too large' when the number of cells is too large (e.g., > 100k).

@pcarbo
Copy link
Collaborator

pcarbo commented Jan 21, 2025

@YushaLiu I'm just summarizing the changes that @willwerscheid has requested:

  • Rename "Y2val" as "Y2".

  • Add roxygen2 documentation for the new ("Y2") argument.

I believe if you update your fork, the changes will automatically be reflected in this Pull Request.

@pcarbo
Copy link
Collaborator

pcarbo commented Jan 22, 2025

@YushaLiu It isn't quite clear from your description what Y2 should be; maybe you can give a simple example in R code of how to set Y2 naively? Is Y2 supposed to be a matrix, vector, or scalar?

@YushaLiu
Copy link
Author

@pcarbo I updated the description. Does it look clear now?

@pcarbo
Copy link
Collaborator

pcarbo commented Jan 22, 2025

Looks okay to me now, thanks. @willwerscheid I'll let you take care of the merge when you are okay with Yusha's changes.

@willwerscheid
Copy link
Owner

willwerscheid commented Jan 22, 2025

thanks, looks much better. this part is not accurate though:

#' rather than having \code{flashier} compute \code{Y2}, which involves explicitly
#' forming the \eqn{n \times n} dense matrix \eqn{Y} and can be much slower and
#' even cause memory issues.

flashier does not do this - if a low-rank representation Y = XX' is used then Y2 is computed as sum(X * (X(X’X)) [edited] - see sq.nmode.prod.r1.lowrank() in data_lowrank.R. So I guess now I am wondering whether we actually need this change? This calculation should not be too different from what you suggest doing in the example (it is done this way in case Y = X_1 X_2' with X_1 \ne X_2). But I did notice that we might be using base crossprod instead of Matrix::crossprod... maybe that is causing your issues? Or maybe we need to handle the special case where X_1 = X_2? What do you think @pcarbo @YushaLiu ? I would like to avoid adding an argument to flash_init() if we can instead automatically calculate Y2 more efficiently.

@pcarbo
Copy link
Collaborator

pcarbo commented Jan 22, 2025

I agree with you @willwerscheid, especially if it is an easy fix!

@willwerscheid
Copy link
Owner

The issue is that X(X'X) is not in fact sparse so the current method for computing Y2 is indeed very inefficient. We should handle the case where X_1 = X_2 separately (since it can then be computed very efficiently); we should also see whether it is possible to do the general case and still preserve sparsity.

@pcarbo
Copy link
Collaborator

pcarbo commented Jan 23, 2025

Also, should you be using Matrix::crossprod in this code in data_lowrank.R?

if (n == 1)
  return(Matrix::rowSums(lr[[1]] * (x[[1]] %*% crossprod(x[[2]], u))))
if (n == 2)
  return(Matrix::rowSums(lr[[2]] * (x[[2]] %*% t(crossprod(u, x[[1]])))))

@willwerscheid
Copy link
Owner

willwerscheid commented Jan 23, 2025

Matrix::crossprod is in the NAMESPACE already; I think we can actually remove Matrix:: from the call to rowSums()

@pcarbo
Copy link
Collaborator

pcarbo commented Jan 23, 2025

Hm, not sure, but you may have to be explicit since base::crossprod is also (implicitly) in the NAMESPACE. (I think.)

@pcarbo
Copy link
Collaborator

pcarbo commented Jan 23, 2025

The issue is that X(X'X) is not in fact sparse.

Well, X'X is typically not sparse (i.e., has many nonzeros).

Just to be absolutely clear: crossprod(X) returns a sparse matrix (i.e., sparse matrix data structure) when X is a sparse matrix, but the sparse matrix output will typically have many nonzeros.

@pcarbo
Copy link
Collaborator

pcarbo commented Jan 23, 2025

The term "sparse matrix" is ambiguous here because it could refer to a sparse matrix object (e.g., "dgCMatrix" class), or a matrix that has a high proportion of nonzeros. And I think you are referring to both at different times. (So am I.)

@YushaLiu
Copy link
Author

I didn't follow all the conversations above, but I'm curious if there is a way to calculate Y2 more efficiently than what I did in the pull request, if Y=XX'? @willwerscheid

Also, I personally feel that calculating Y2 is most useful in the case that Y=XX', where Y is closely related to covariance matrix (as in GBCD) -- columns and rows of Y are symmetric here so it is reasonable to assume var_type = 0. If Y = X_1 X_2' with X_1 \ne X_2, I'm not sure if var_type = 0 is still reasonable; if we need to assumevar_type = 1 or 2, then calculating Y2 will not be enough for updating the residual variance parameters in flash?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants