A Java SDK for converting Markdown documents to various office formats including Word, Excel, PDF, and more.
- Convert Markdown to multiple formats:
- Word (DOCX)
- Excel (XLSX)
- Plain Text (TXT)
- Markdown (MD)
- Preserve Markdown structure and formatting
- Support for tables, lists, code blocks, and more
- Easy-to-use API
- Command-line interface
<dependency>
<groupId>io.github.twwch</groupId>
<artifactId>markdown2office</artifactId>
<version>1.0.16</version>
</dependency>
implementation 'io.github.twwch:markdown2office:1.0.16'
import io.github.twwch.markdown2office.Markdown2Office;
import io.github.twwch.markdown2office.model.FileType;
public class Example {
public static void main(String[] args) throws IOException {
Markdown2Office converter = new Markdown2Office();
String markdown = "# Hello World\n\nThis is **bold** text.";
// Convert to Word
converter.convert(markdown, FileType.WORD, "output.docx");
// Convert to PDF
converter.convert(markdown, FileType.PDF, "output.pdf");
// Convert to Excel
converter.convert(markdown, FileType.EXCEL, "output.xlsx");
}
}
// Convert markdown file to Word
converter.convertFile("input.md", FileType.WORD, "output.docx");
// Auto-detect output format from file extension
converter.convertFile("input.md", "output.pdf");
// Write to OutputStream
try (FileOutputStream fos = new FileOutputStream("output.docx")) {
converter.convert(markdown, FileType.WORD, fos);
}
// Get as byte array
byte[] pdfBytes = converter.convertToBytes(markdown, FileType.PDF);
java -jar markdown2office.jar input.md output.docx
- Headings (H1-H6)
- Text formatting (bold, italic, code)
- Lists (ordered, unordered, nested)
- Blockquotes
- Code blocks (with syntax highlighting)
- Tables (GitHub Flavored Markdown)
- Links
- Images (as references)
- Horizontal rules
- Task lists
# Clone the repository
git clone https://github.com/twwch/markdown2office.git
cd markdown2office
# Build with Maven
mvn clean install
# Run tests
mvn test
To release to Maven Central, you need to configure the following GitHub Secrets:
- GPG_PRIVATE_KEY: Your GPG private key for signing artifacts
- GPG_PASSPHRASE: Passphrase for your GPG key
- MAVEN_USERNAME: Your Sonatype JIRA username
- MAVEN_PASSWORD: Your Sonatype JIRA password
- Create and push a tag:
git tag v1.0.16
git push origin v1.0.16
- The GitHub Action will automatically:
- Build and test the project
- Sign the artifacts with GPG
- Deploy to Maven Central
- Create a GitHub Release
- Java 8 or higher
- Maven 3.6 or higher (for building)
- Apache POI (Word and Excel support)
- iText (PDF generation)
- CommonMark (Markdown parsing)
- SLF4J (Logging)
Apache License 2.0
Contributions are welcome! Please feel free to submit a Pull Request.
The library includes a powerful and enhanced file parsing system that can extract content, metadata, and structure from various document formats:
- PDF (.pdf) - With hidden layer filtering
- Word (.doc, .docx) - Enhanced structure extraction
- Excel (.xls, .xlsx) - Improved multi-sheet handling
- PowerPoint (.ppt, .pptx)
- Text (.txt) - Smart encoding detection
- Markdown (.md) - Full CommonMark support
- CSV (.csv) - Advanced encoding detection (UTF-8, GBK, GB2312, etc.)
- And 20+ other formats via Apache Tika
import io.github.twwch.markdown2office.parser.UniversalFileParser;
import io.github.twwch.markdown2office.parser.ParsedDocument;
UniversalFileParser parser = new UniversalFileParser();
// Parse any supported file
ParsedDocument document = parser.parse(new File("document.pdf"));
// Get content as markdown
String markdown = document.toMarkdown();
// Access document metadata
DocumentMetadata metadata = document.getDocumentMetadata();
System.out.println("Total pages: " + metadata.getTotalPages());
System.out.println("Word count: " + metadata.getTotalWords());
// Access page-by-page content
for (PageContent page : document.getPages()) {
System.out.println("Page " + page.getPageNumber() + ":");
System.out.println(" Words: " + page.getWordCount());
System.out.println(" Headings: " + page.getHeadings());
System.out.println(" Tables: " + page.getTables().size());
}
Field | Type | Description |
---|---|---|
content |
String | Raw text content of the entire document |
markdownContent |
String | Content converted to Markdown format |
fileType |
FileType | Type of the parsed file (PDF, WORD, EXCEL, etc.) |
pages |
List | Page-by-page content and structure |
tables |
List | All tables found in the document |
documentMetadata |
DocumentMetadata | Comprehensive metadata about the document |
metadata |
Map<String,String> | Legacy metadata for backward compatibility |
Field | Type | Description |
---|---|---|
fileName |
String | Name of the source file |
fileSize |
Long | Size of the file in bytes |
fileType |
FileType | Document type |
title |
String | Document title (if available) |
author |
String | Document author |
subject |
String | Document subject |
keywords |
String | Document keywords |
creationDate |
Date | When the document was created |
modificationDate |
Date | Last modification date |
totalPages |
Integer | Total number of pages |
totalWords |
Integer | Total word count |
totalCharacters |
Integer | Total character count |
totalParagraphs |
Integer | Total paragraph count |
totalTables |
Integer | Total table count |
totalSheets |
Integer | For Excel: number of sheets |
totalSlides |
Integer | For PowerPoint: number of slides |
Field | Type | Description |
---|---|---|
pageNumber |
int | Page number (starting from 1) |
rawText |
String | Raw text content of the page |
markdownContent |
String | Page content in Markdown format |
headings |
List | All headings found on the page |
paragraphs |
List | All paragraphs on the page |
lists |
List | All list items on the page |
tables |
List | Tables found on this page |
wordCount |
Integer | Word count for this page |
characterCount |
Integer | Character count for this page |
Field | Type | Description |
---|---|---|
title |
String | Table title or caption |
headers |
List | Column headers |
data |
List<List> | Table data rows |
rowCount |
int | Number of data rows |
columnCount |
int | Number of columns |
import io.github.twwch.markdown2office.parser.impl.*;
// Use specific parser for better control
PdfFileParser pdfParser = new PdfFileParser();
ParsedDocument pdfDoc = pdfParser.parse("document.pdf");
WordFileParser wordParser = new WordFileParser();
ParsedDocument wordDoc = wordParser.parse("document.docx");
ExcelFileParser excelParser = new ExcelFileParser();
ParsedDocument excelDoc = excelParser.parse("spreadsheet.xlsx");
PDF Hidden Layer Filtering
The PDF parser now supports filtering out hidden layers, watermarks, and invisible text that may appear in some PDFs (e.g., from recruitment platforms like BOSS Zhipin).
import io.github.twwch.markdown2office.parser.impl.PdfFileParser;
import io.github.twwch.markdown2office.parser.ParsedDocument;
// Default behavior: hidden layers are excluded
PdfFileParser parser = new PdfFileParser();
ParsedDocument doc = parser.parse("resume.pdf");
// Hidden watermarks and invisible text are automatically filtered out
// If you need to include hidden layers (not recommended for most cases)
PdfFileParser parserWithHidden = new PdfFileParser(true);
ParsedDocument docWithHidden = parserWithHidden.parse("document.pdf");
// Or configure dynamically
PdfFileParser configurable = new PdfFileParser();
configurable.setIncludeHiddenLayers(false); // Exclude hidden content (default)
ParsedDocument cleanDoc = configurable.parse("document.pdf");
configurable.setIncludeHiddenLayers(true); // Include everything
ParsedDocument fullDoc = configurable.parse("document.pdf");
What gets filtered when includeHiddenLayers
is false
(default):
- Invisible text layers (rendering mode NEITHER)
- Text with transparency below 30%
- White or nearly white text on white backgrounds
- Hidden annotations and watermarks
- XObjects with opacity below 50%
- Resources marked as watermarks or backgrounds
Common use cases:
- Resume parsing: Remove recruitment platform watermarks
- Document cleaning: Extract only visible content
- Content migration: Get clean text without metadata artifacts
- Text analysis: Focus on actual document content
ParsedDocument document = parser.parse(new File("report.pdf"));
// Get all tables
for (ParsedTable table : document.getTables()) {
System.out.println("Table: " + table.getTitle());
System.out.println("Headers: " + table.getHeaders());
// Convert table to markdown
String tableMarkdown = table.toMarkdown();
System.out.println(tableMarkdown);
// Access table data
for (List<String> row : table.getData()) {
System.out.println(String.join(" | ", row));
}
}
The CSV parser now automatically detects and handles various character encodings, including Chinese encodings:
import io.github.twwch.markdown2office.parser.impl.CsvFileParser;
// Automatically detects encoding (UTF-8, GBK, GB2312, etc.)
CsvFileParser csvParser = new CsvFileParser();
ParsedDocument doc = csvParser.parse("chinese_data.csv");
// The parser will:
// 1. Detect BOM markers
// 2. Try common encodings in order
// 3. Validate content to ensure correct encoding
// 4. Fall back gracefully if encoding cannot be determined
String content = doc.getContent(); // Correctly decoded content
ParsedDocument excel = parser.parse(new File("data.xlsx"));
DocumentMetadata metadata = excel.getDocumentMetadata();
System.out.println("Total sheets: " + metadata.getTotalSheets());
// Each sheet is treated as a page
for (PageContent sheet : excel.getPages()) {
System.out.println("Sheet " + sheet.getPageNumber());
// Excel sheets typically contain one table per sheet
for (ParsedTable table : sheet.getTables()) {
System.out.println(" Rows: " + table.getRowCount());
System.out.println(" Columns: " + table.getColumnCount());
}
}
ParsedDocument document = parser.parse(new File("manual.pdf"));
// Search for specific content
for (PageContent page : document.getPages()) {
if (page.getRawText().contains("installation")) {
System.out.println("Found 'installation' on page " + page.getPageNumber());
}
// Find pages with tables
if (!page.getTables().isEmpty()) {
System.out.println("Page " + page.getPageNumber() + " has " +
page.getTables().size() + " table(s)");
}
// Find pages with specific headings
for (String heading : page.getHeadings()) {
if (heading.toLowerCase().contains("introduction")) {
System.out.println("Introduction section on page " + page.getPageNumber());
}
}
}
// Parse a PDF and convert to Word
ParsedDocument pdfDoc = parser.parse(new File("report.pdf"));
String markdown = pdfDoc.toMarkdown();
Markdown2Office converter = new Markdown2Office();
converter.convert(markdown, FileType.WORD, "report.docx");
// Parse Word and convert to PDF with formatting preserved
ParsedDocument wordDoc = parser.parse(new File("document.docx"));
converter.convert(wordDoc.toMarkdown(), FileType.PDF, "document.pdf");
Format | Text Extraction | Table Parsing | Metadata | Structure Detection | Special Features (v1.0.16) |
---|---|---|---|---|---|
✅ Full text with formatting | ✅ Complex tables | ✅ Complete | ✅ Pages, headings, paragraphs | • Hidden layer filtering • Watermark removal • Invisible text detection |
|
Word (.docx/.doc) | ✅ Rich text preservation | ✅ Nested tables | ✅ Complete | ✅ Sections, headings, lists | • Style preservation • Comment extraction • Track changes support |
Excel (.xlsx/.xls) | ✅ Cell values & formulas | ✅ Native | ✅ Complete | ✅ Sheets as pages | • Formula evaluation • Merged cell handling • Multiple sheet support |
PowerPoint (.pptx/.ppt) | ✅ Slide content | ✅ Slide tables | ✅ Complete | ✅ Slides as pages | • Speaker notes • Slide layout detection • Shape text extraction |
CSV | ✅ Full content | ✅ Native | ❌ | • Auto encoding detection (UTF-8, GBK, GB2312) • BOM handling • Delimiter detection |
|
Text (.txt) | ✅ Plain text | ❌ | ❌ | • Encoding auto-detection • Line break preservation |
|
Markdown (.md) | ✅ Formatted text | ✅ GFM tables | ❌ | ✅ Heading hierarchy | • CommonMark + GFM • Code block preservation • Task list support |
RTF | ✅ Rich text | ✅ | • Via Apache Tika | ||
HTML | ✅ Text content | ✅ | ✅ DOM structure | • Tag removal • Link preservation |
|
XML | ✅ Text nodes | ✅ Element hierarchy | • Via Apache Tika | ||
Other formats | ✅ Via Tika | • 20+ formats via Apache Tika |
Legend:
- ✅ Full support with comprehensive features
⚠️ Partial support or limited functionality- ❌ Not supported
- Bold items indicate recent enhancements in v1.0.16
- Large files are processed efficiently with streaming where possible
- Page-based extraction allows processing documents without loading entire content into memory
- Metadata is extracted without parsing full document content when possible
- Smart encoding detection minimizes re-reading of files
- Optimized table extraction for large Excel and CSV files
try {
ParsedDocument document = parser.parse(new File("document.pdf"));
// Process document
} catch (IOException e) {
System.err.println("Failed to parse document: " + e.getMessage());
} catch (UnsupportedFileException e) {
System.err.println("File format not supported: " + e.getMessage());
}
If you encounter any issues or have questions, please file an issue on the GitHub repository.