Skip to content

Accounting for new lines in OCR feature #472

@mattakamatsu

Description

@mattakamatsu

The OCR feature is terrific, with one exception: whenever there is a new line, the OCR does not include a space between words on subsequent lines. For example:

tilted at +10-20 degrees.Based on the degree of invagination, CCSs were classified into threecategories.

Can we add a space for words between new lines? I asked GPT4 how to do this, and here's what it suggested:

// Inside the tesseractImage.onload = async () => { ... }

const {
  data: { text },
} = await worker.recognize(canvas);
await worker.terminate();

const textBullets = text.split("\n");
const bullets = [];
let currentText = "";
for (let b = 0; b < textBullets.length; b++) {
  const s = textBullets[b].trim(); // Trim to remove leading and trailing whitespaces
  if (s) {
    if (currentText && !currentText.match(/[\.,!?\)\]\:;\-]$/)) {
      // Add a space before the new text if the last character is not a punctuation mark that typically does not follow a space
      currentText += " ";
    }
    currentText += s;
  } else if (currentText) {
    // Push the currentText into bullets when encountering an empty string (newline), and reset currentText
    bullets.push(
      currentText.startsWith("* ") ||
      currentText.startsWith("- ") ||
      currentText.startsWith("— ")
        ? currentText.substring(2)
        : currentText
    );
    currentText = "";
  }
}
if (currentText) {
  // Ensure any remaining text is also pushed into bullets
  bullets.push(
    currentText.startsWith("* ") ||
    currentText.startsWith("- ") ||
    currentText.startsWith("— ")
      ? currentText.substring(2)
      : currentText
  );
}

// The rest of your logic to create blocks from bullets remains unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions