Document Processing
DocuDesk provides powerful document processing capabilities that allow you to transform, analyze, and manage documents in various formats. This page explains how document processing works in DocuDesk.
Overview
The Document Processing system in DocuDesk enables you to:
- Generate documents from templates
- Convert documents between formats
- Extract text and metadata from documents
- Anonymize personal data in documents
- Check documents for accessibility compliance
- Validate documents against templates or schemas
Processing Workflow
A typical document processing workflow in DocuDesk consists of the following steps:
- Initiation: A user or system initiates a processing operation
- Queuing: The operation is queued for processing
- Processing: The document is processed according to the requested operation
- Logging: The operation is logged in the parsing logs
- Privacy Tracking: If applicable, privacy-related metadata is updated
- Report Generation: A document report is generated with analysis results
- Notification: The user is notified of the operation's completion
Integration with Other Systems
Document processing in DocuDesk is tightly integrated with several other systems:
- Document Reports: Stores analysis results for documents
- Anonymization Logs: Tracks anonymization operations and replacements
- Presidio Integration: Provides entity recognition and anonymization capabilities
This integration ensures that you have a complete audit trail of all document operations and can demonstrate GDPR compliance.
Processing Operations
Text Extraction
Text extraction allows you to extract the textual content from documents in various formats (PDF, Word, etc.). This is useful for:
- Indexing documents for search
- Analyzing document content
- Preparing documents for anonymization
Metadata Extraction
Metadata extraction allows you to extract metadata from documents, such as:
- Author information
- Creation and modification dates
- Document properties
- Embedded metadata
Anonymization
Anonymization allows you to remove or mask personal data in documents. DocuDesk uses Microsoft Presidio for powerful entity recognition and anonymization:
- Named entity recognition to identify personal data (PERSON, LOCATION, etc.)
- Redaction of personal data with customizable replacement text
- Confidence scoring for detected entities
- Secure key generation for potential de-anonymization
- Comprehensive tracking of all anonymization operations
The results of anonymization operations are stored in the Anonymization Log object, which includes:
- Original and anonymized text
- Detailed information about detected entities
- A secure key for potential de-anonymization
- A list of all text replacements made
For more information on anonymization, see:
- Anonymization Logs for details on the anonymization log object
- Presidio Integration for details on how DocuDesk processes Presidio's output
Example Presidio Response
{
"text": "Mijn naam is Jan de Hooglander, mijn BSN is 123456789 en ik woon in Amsterdam.",
"entities_found": [
{
"entity_type": "PERSON",
"text": "Jan de Hoog",
"score": 0.9999997019767761
},
{
"entity_type": "LOCATION",
"text": "Amsterdam",
"score": 0.9999990463256836
},
{
"entity_type": "PERSON",
"text": "BSN",
"score": 0.85
}
]
}
DocuDesk transforms this response into a comprehensive AnonymizationLog object that tracks all aspects of the anonymization process.