Text Extraction

DocuDesk provides powerful text extraction capabilities that allow you to extract and process text content from various document formats. This feature serves as the foundation for many other DocuDesk capabilities, including document analysis, reporting, and anonymization.

Supported File Formats

The text extraction service supports a wide range of document formats:

PDF Documents - Extract text from PDF files using the Smalot PDF Parser
Word Documents - Process .doc and .docx files using PHPWord
Excel Spreadsheets - Extract data from .xls, .xlsx, and .csv files using PHPSpreadsheet
PowerPoint Presentations - Process .ppt and .pptx files using PHPPresentation
Plain Text Files - Handle .txt, .md, .html, .xml, and other text-based formats

How It Works

The text extraction process follows these steps:

File Detection: The system identifies the file type based on extension and MIME type
Content Extraction: Specialized extractors process the file to retrieve text content
Text Normalization: The extracted text is normalized to ensure consistent processing
Metadata Extraction: Additional metadata is extracted from the document

Using Text Extraction

Text extraction is typically used as part of a larger workflow, but you can also use it directly:

// Example: Extract text from a document
$extractionService = \OC::$server->get(OCA\DocuDesk\Service\ExtractionService::class);
$text = $extractionService->extractText('/path/to/document.pdf');

// Example: Extract metadata from a document
$metadata = $extractionService->extractMetadata('/path/to/document.docx');

Metadata Extraction

In addition to text content, the extraction service can retrieve valuable metadata from documents:

Basic Metadata: Filename, file size, MIME type, last modified date
PDF Metadata: Title, author, subject, keywords, creation date, page count
Office Document Metadata: Creator, last modified by, creation date, title, description
Spreadsheet Metadata: Sheet count, cell counts, worksheet names
Presentation Metadata: Slide count, shape counts

Performance Considerations

Text extraction can be resource-intensive, especially for large documents. Consider these best practices:

Process documents asynchronously for large files
Implement caching for frequently accessed documents
Set appropriate memory limits for your server

Integration with Other Features

Text extraction integrates with several other DocuDesk features:

Document Analysis: Extracted text is analyzed for sensitive information
Reporting: Text content is used to generate document reports
Anonymization: Extracted text is processed to identify and anonymize sensitive data
Search Indexing: Extracted text improves document searchability

Configuration

No specific configuration is required for basic text extraction functionality. The necessary libraries are included with DocuDesk.

Limitations

While the text extraction service is powerful, be aware of these limitations:

Complex document formatting may be lost during extraction
Some heavily encrypted documents may not be fully extractable
Image-based PDFs require OCR (not included) for text extraction
Very large documents may require additional memory allocation

Troubleshooting

If you encounter issues with text extraction:

Verify the document is not corrupted or password-protected
Check that the file format is supported
Ensure your server has sufficient memory allocated
Review the DocuDesk logs for specific error messages

Supported File Formats​

How It Works​

Using Text Extraction​

Metadata Extraction​

Performance Considerations​

Integration with Other Features​

Configuration​

Limitations​

Troubleshooting​