Document Processing

DocuDesk provides powerful document processing capabilities that allow you to transform, analyze, and manage documents in various formats. This page explains how document processing works in DocuDesk.

Overview

The Document Processing system in DocuDesk enables you to:

Generate documents from templates
Convert documents between formats
Extract text and metadata from documents
Anonymize personal data in documents
Check documents for accessibility compliance
Validate documents against templates or schemas

Processing Workflow

A typical document processing workflow in DocuDesk consists of the following steps:

Initiation: A user or system initiates a processing operation
Queuing: The operation is queued for processing
Processing: The document is processed according to the requested operation
Logging: The operation is logged in the parsing logs
Privacy Tracking: If applicable, privacy-related metadata is updated
Report Generation: A document report is generated with analysis results
Notification: The user is notified of the operation's completion

Integration with Other Systems

Document processing in DocuDesk is tightly integrated with several other systems:

Document Reports: Stores analysis results for documents
Anonymization Logs: Tracks anonymization operations and replacements
Presidio Integration: Provides entity recognition and anonymization capabilities

This integration ensures that you have a complete audit trail of all document operations and can demonstrate GDPR compliance.

Processing Operations

Text Extraction

Text extraction allows you to extract the textual content from documents in various formats (PDF, Word, etc.). This is useful for:

Indexing documents for search
Analyzing document content
Preparing documents for anonymization

Metadata Extraction

Metadata extraction allows you to extract metadata from documents, such as:

Author information
Creation and modification dates
Document properties
Embedded metadata

Anonymization

Anonymization allows you to remove or mask personal data in documents. DocuDesk uses Microsoft Presidio for powerful entity recognition and anonymization:

Named entity recognition to identify personal data (PERSON, LOCATION, etc.)
Redaction of personal data with customizable replacement text
Confidence scoring for detected entities
Secure key generation for potential de-anonymization
Comprehensive tracking of all anonymization operations

The results of anonymization operations are stored in the Anonymization Log object, which includes:

Original and anonymized text
Detailed information about detected entities
A secure key for potential de-anonymization
A list of all text replacements made

For more information on anonymization, see:

Anonymization Logs for details on the anonymization log object
Presidio Integration for details on how DocuDesk processes Presidio's output

Example Presidio Response

{
    "text": "Mijn naam is Jan de Hooglander, mijn BSN is 123456789 en ik woon in Amsterdam.",
    "entities_found": [
        {
            "entity_type": "PERSON",
            "text": "Jan de Hoog",
            "score": 0.9999997019767761
        },
        {
            "entity_type": "LOCATION",
            "text": "Amsterdam",
            "score": 0.9999990463256836
        },
        {
            "entity_type": "PERSON",
            "text": "BSN",
            "score": 0.85
        }
    ]
}

DocuDesk transforms this response into a comprehensive AnonymizationLog object that tracks all aspects of the anonymization process.

Format Conversion

Format conversion allows you to convert documents between various formats:

PDF to Word
Word to PDF
HTML to PDF
Excel to CSV
And many more

Accessibility Check

Accessibility checking allows you to verify that documents comply with accessibility standards (WCAG):

Checking for proper document structure
Verifying alt text for images
Ensuring proper color contrast
Checking for other accessibility issues

The results of accessibility checks are stored in the Document Report object.

Language Level Analysis

Language level analysis allows you to assess the readability and complexity of document text:

Readability scores (Flesch-Kincaid, SMOG Index, etc.)
Text complexity metrics
Education level assessment
Language improvement suggestions

The results of language level analysis are stored in the Document Report object.

Validation

Validation allows you to verify that documents conform to templates or schemas:

Checking for required fields
Validating field formats
Ensuring document structure matches templates
Verifying document integrity

API Integration

Document processing operations can be initiated through the DocuDesk API. The typical flow is:

Create a parsing log entry with the desired operation type
Monitor the parsing log for status updates
Retrieve the document report or anonymization log when the operation is complete

For detailed API documentation, see the OpenAPI Specification.

Best Practices

Batch Processing: For large numbers of documents, use batch processing to improve efficiency
Error Handling: Implement proper error handling to deal with failed processing operations
Monitoring: Regularly monitor parsing logs to identify issues and bottlenecks
Privacy Compliance: Always update privacy tracking records when processing documents with personal data
Performance Optimization: Configure processing operations for optimal performance based on your needs
Report Analysis: Regularly review document reports to identify compliance issues
Secure Key Management: Store anonymization keys securely for potential de-anonymization
Confidence Thresholds: Adjust confidence thresholds for entity detection based on your needs

Examples

Anonymizing a Document

/**
 * Anonymizes a document and creates the necessary logs and reports
 *
 * @param string $fileId The ID of the file to anonymize
 * @param string $fileName The name of the file
 * @param float $confidenceThreshold The confidence threshold for entity detection (default: 0.7)
 * @return array The anonymization results
 * @throws \Exception If anonymization fails
 */
public function anonymizeDocument(string $fileId, string $fileName, float $confidenceThreshold = 0.7): array
{
    // Create a parsing log entry for anonymization
    $parsingLog = new ParsingLog();
    $parsingLog->setFileId($fileId);
    $parsingLog->setFileName($fileName);
    $parsingLog->setParsingType('anonymization');
    $parsingLog->setStatus('pending');
    $parsingLog->save();

    try {
        // Process the document (this would typically be handled by a background job)
        $processor = new DocumentProcessor();
        $result = $processor->anonymize($fileId, $confidenceThreshold);

        // Update the parsing log
        $parsingLog->setStatus('completed');
        $parsingLog->setCompletedAt(new \DateTime());
        $parsingLog->setOutputFileId($result->getOutputFileId());
        $parsingLog->save();

        // Update the privacy tracking record
        $privacyFile = PrivacyFile::findByFileId($fileId);
        $privacyFile->setAnonymizationStatus('completed');
        $privacyFile->setAnonymizationDate(new \DateTime());
        $privacyFile->save();

        // Create an anonymization log
        $anonymizationLog = new AnonymizationLog();
        $anonymizationLog->setNodeId($fileId);
        $anonymizationLog->setFileName($fileName);
        $anonymizationLog->setStatus('completed');
        $anonymizationLog->setOriginalText($result->getOriginalText());
        $anonymizationLog->setAnonymizedText($result->getAnonymizedText());
        $anonymizationLog->setEntitiesFound($result->getEntitiesFound());
        $anonymizationLog->setEntityReplacements($result->getEntityReplacements());
        $anonymizationLog->setAnonymizationKey($result->generateSecureKey());
        $anonymizationLog->setOutputNodeId($result->getOutputNodeId());
        $anonymizationLog->setTotalEntitiesFound(count($result->getEntitiesFound()));
        $anonymizationLog->setTotalEntitiesReplaced(count($result->getEntityReplacements()));
        $anonymizationLog->save();

        // Create a document report
        $reportData = [
            'node_id' => $fileId,
            'file_name' => $fileName,
            'file_hash' => hash_file('sha256', $result->getOutputFilePath()),
            'analysis_types' => ['anonymization']
        ];

        $reportService = new DocumentReportService();
        $report = $reportService->createReport($reportData);
        $report->setAnonymizationResults($result->getAnonymizationResults());
        $report->setStatus('completed');
        $report->save();

        return [
            'success' => true,
            'anonymization_log_id' => $anonymizationLog->getId(),
            'report_id' => $report->getId(),
            'entities_found' => $anonymizationLog->getTotalEntitiesFound(),
            'entities_replaced' => $anonymizationLog->getTotalEntitiesReplaced()
        ];
    } catch (\Exception $e) {
        // Update the parsing log with error information
        $parsingLog->setStatus('failed');
        $parsingLog->setErrorMessage($e->getMessage());
        $parsingLog->save();
        
        throw $e;
    }
}

Converting a Document

/**
 * Converts a document from one format to another
 *
 * @param string $fileId The ID of the file to convert
 * @param string $fileName The name of the file
 * @param string $outputFormat The desired output format
 * @return array The conversion results
 * @throws \Exception If conversion fails
 */
public function convertDocument(string $fileId, string $fileName, string $outputFormat): array
{
    // Create a parsing log entry for format conversion
    $parsingLog = new ParsingLog();
    $parsingLog->setFileId($fileId);
    $parsingLog->setFileName($fileName);
    $parsingLog->setParsingType('format_conversion');
    $parsingLog->setOutputFormat($outputFormat);
    $parsingLog->setStatus('pending');
    $parsingLog->save();

    try {
        // Process the document
        $processor = new DocumentProcessor();
        $result = $processor->convert($fileId, $outputFormat);

        // Update the parsing log
        $parsingLog->setStatus('completed');
        $parsingLog->setCompletedAt(new \DateTime());
        $parsingLog->setOutputFileId($result->getOutputFileId());
        $parsingLog->save();

        return [
            'success' => true,
            'output_file_id' => $result->getOutputFileId(),
            'output_format' => $outputFormat
        ];
    } catch (\Exception $e) {
        // Update the parsing log with error information
        $parsingLog->setStatus('failed');
        $parsingLog->setErrorMessage($e->getMessage());
        $parsingLog->save();
        
        throw $e;
    }
}

Analyzing Document Accessibility

/**
 * Analyzes a document for WCAG compliance
 *
 * @param string $fileId The ID of the file to analyze
 * @param string $fileName The name of the file
 * @return array The analysis results
 * @throws \Exception If analysis fails
 */
public function analyzeAccessibility(string $fileId, string $fileName): array
{
    // Create a document report for WCAG compliance analysis
    $reportData = [
        'node_id' => $fileId,
        'file_name' => $fileName,
        'file_hash' => hash_file('sha256', '/path/to/document.pdf'),
        'analysis_types' => ['wcag_compliance']
    ];

    $reportService = new DocumentReportService();
    $report = $reportService->createReport($reportData);

    try {
        // Process the document
        $processor = new DocumentProcessor();
        $result = $processor->analyzeAccessibility($fileId);

        // Update the report
        $report->setWcagComplianceResults($result->getWcagResults());
        $report->setStatus('completed');
        $report->save();

        return [
            'success' => true,
            'report_id' => $report->getId(),
            'compliance_level' => $result->getWcagResults()['compliance_level'],
            'total_issues' => $result->getWcagResults()['total_issues']
        ];
    } catch (\Exception $e) {
        // Update the report with error information
        $report->setStatus('failed');
        $report->setErrorMessage($e->getMessage());
        $report->save();
        
        throw $e;
    }
}

Conclusion

Document processing is a core feature of DocuDesk that enables powerful document transformation and analysis capabilities. By integrating with privacy tracking, parsing logs, document reports, and anonymization logs, DocuDesk ensures that all document operations are properly tracked and comply with privacy and accessibility regulations.

For more detailed information on specific aspects of document processing, refer to the following documentation:

Overview​

Processing Workflow​

Integration with Other Systems​

Processing Operations​

Text Extraction​

Metadata Extraction​

Anonymization​

Example Presidio Response​

Format Conversion​

Accessibility Check​

Language Level Analysis​

Validation​

API Integration​

Best Practices​

Examples​

Anonymizing a Document​

Converting a Document​

Analyzing Document Accessibility​

Conclusion​