Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LASTAL pair alignment error maf-convert #48

Open
SamCT opened this issue Feb 21, 2025 · 3 comments
Open

LASTAL pair alignment error maf-convert #48

SamCT opened this issue Feb 21, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@SamCT
Copy link

SamCT commented Feb 21, 2025

Description of the bug

Hello,

Thank you for the tool, we are excited to test. I am running with two genomes vs. one reference. I am wondering why I receive this error from maf-convert ?

It has successfully produced a .m20_aln.maf.gz , but seems to be failing on getting a .tsv from it? My parameters are set quite high, I have 600Gb of memory allocated.

Caused by:
  Process `NFCORE_PAIRGENOMEALIGN:PAIRGENOMEALIGN:PAIRALIGN_M2O:ALIGNMENT_LASTAL_M2O (target___MLV4H1)` terminated with an error exit status (1)


Command executed:

  INDEX_NAME=$(basename $(ls lastdb/*.des) .des)
  set -o pipefail

  function calculate_psl_metrics() {
      awk 'BEGIN {
          FS="  ";  # Set field separator as tab
          totalMatches = 0;
          totalAlignmentLength = 0;
          print "Sample TotalAlignmentLength    PercentSimilarity";  # Header for MultiQC
      }
      {
          totalMatches += $1 + $3;  # Sum matches and repMatches
          totalAlignmentLength += $1 + $2 + $3 + $6 + $8;  # Sum matches, misMatches, repMatches, qBaseInsert, and tBaseInsert
      }
      END {
          percentSimilarity = (totalAlignmentLength > 0) ? (totalMatches / totalAlignmentLength * 100) : 0;
          print "target___MLV4H1" "     " totalAlignmentLength "        " percentSimilarity;  # Data in TSV format
      }'
  }

  lastal \
      -P 12 \
      -p target___MLV4H1.train \
      --split-f=MAF+ -C2 -D1e9  \
      lastdb/$INDEX_NAME \
      MLV4H1_Final.fa |
      tee >(gzip --no-name  > target___MLV4H1.m2o_aln.maf.gz) |
      maf-convert psl |
      calculate_psl_metrics > target___MLV4H1.m2o_aln.tsv

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PAIRGENOMEALIGN:PAIRGENOMEALIGN:PAIRALIGN_M2O:ALIGNMENT_LASTAL_M2O":
      last: $(lastal --version 2>&1 | sed 's/lastal //')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  INFO:    Converting SIF file to temporary sandbox...
  WARNING: underlay of /etc/localtime required more than 50 (74) bind mounts
  maf-convert: error: for PSL, each alignment must have 2 sequences
  INFO:    Cleaning up image...

I am curious to hear your feedback on this,

Sam

Command used and terminal output

Relevant files

No response

System information

withLabel:process_high {
    cpus   = { 12    * task.attempt }
    memory = { 144.GB * task.attempt }
    time   = { 1000.h  * task.attempt }
}
withLabel:process_long {
    time   = { 1000.h  * task.attempt }
}
withLabel:process_high_memory {
    memory = { 400.GB * task.attempt }
}
@SamCT SamCT added the bug Something isn't working label Feb 21, 2025
@charles-plessy
Copy link
Collaborator

Hi Sam, I think that nothing was aligned. You can verify that by inspecting target___MLV4H1.m2o_aln.maf.gz in the work directory, which will be either empty or containing header but no alignment.

Your genomes target and MVL4H1 are probably very different. If they are not gigabase-scale, you can try to index target them with a more sensitive seed, like MAM8. I do that routinely for prokaryote alignments.

@SamCT
Copy link
Author

SamCT commented Feb 22, 2025

Hi Charles,

There do appear to be a significant amount of alignments. The .maf.gz alignment has a file size of ~130 Mb, so it isn't empty. If I use minimap2 instead, plot the results from the .paf, I typically see roughly a third of the genome aligned. This is a polyploid, so I don't expect complete alignment... The target reference is ~2Gbp but the query is ~400 Mbp. I will try to adjust the seed and report back.

@charles-plessy
Copy link
Collaborator

OK, if the alignment file is not empty, then it was probably truncated and this crashes maf-convert. Cause for truncation is usually termination of the job by exhaustion of the allocated memory or time. Can you give me access to the genomes ? Otherwise (or in addition), please show the .command.log, .command.err, and .exitcode files of the work directory for the failed process.

In any case, if the genomes you are comparing are very similar, then the RY128 seed is for you. It greatly accelerates the computation at a small expense of sensitivity. I recently aligned all the chromosome assemblies available for the Brassicales with each other with no problem with that seed, and there were clear cases of polyploidy.

By the say, many-to-one alignments are good when comparing diploid to polyploid, but for the converse you will be better served by one-to-many alignments, which you can produce with the --m2m option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants