🔒 GDPR Anonymization

DocuDesk provides powerful document anonymization capabilities to help you comply with GDPR and other privacy regulations. The anonymization feature automatically detects and redacts sensitive personal information in your documents, making it easier to share and process documents while protecting individual privacy.

Overview

The anonymization process follows these steps:

Text Extraction: Extract text content from various document formats
Entity Detection: Identify sensitive information using Microsoft Presidio
Text Anonymization: Replace or mask sensitive information
Document Generation: Create a new anonymized document
Logging: Record anonymization details for compliance purposes

Key Features

Automatic Detection: Identifies various types of personal data without manual tagging
Configurable Anonymization: Customize how different types of information are anonymized
De-anonymization: Option to restore original text when authorized
Detailed Logging: Maintains records of all anonymization operations
Multiple Document Formats: Works with PDF, Word, Excel, and text documents

Gdpr Anonymization

Supported Entity Types

The anonymization service can detect and anonymize various types of sensitive information:

Personal names
Email addresses
Phone numbers
Credit card numbers
Bank account numbers
Social security numbers
Addresses and locations
Dates of birth
IP addresses
Medical license numbers
Passport numbers
Driver's license numbers

Anonymization Methods

DocuDesk supports different anonymization methods for different types of data:

Replacement: Replace the entity with a generic placeholder (e.g., [PERSON])
Masking: Replace characters with a masking character (e.g., ******)
Hashing: Replace the entity with a hash value
Redaction: Completely remove the entity from the text

Automatic Anonymization with Reports

DocuDesk now supports automatic anonymization of documents when reports are processed. When the anonymization feature is enabled in the settings, the system will:

Process the document report as usual, detecting sensitive entities
Automatically create an anonymized version of the document with the same name plus "_anonymized" suffix
Replace all detected sensitive entities with placeholders in the format [ENTITY_TYPE: key]
Store the anonymization log with references to both the original and anonymized files
Update the report with anonymization results

This integration provides a seamless workflow for identifying and anonymizing sensitive information in documents.

How It Works

When a document report is processed:

sequenceDiagram
    participant User
    participant ReportingService
    participant AnonymizationService
    participant PresidioAPI
    participant FileSystem

    User->>ReportingService: Process document report
    ReportingService->>PresidioAPI: Analyze document for entities
    PresidioAPI-->>ReportingService: Return detected entities
    
    alt Anonymization Enabled
        ReportingService->>AnonymizationService: Process anonymization
        AnonymizationService->>FileSystem: Create anonymized file
        AnonymizationService->>AnonymizationService: Replace entities with [TYPE: key]
        AnonymizationService-->>ReportingService: Return anonymization results
        ReportingService->>ReportingService: Update report with anonymization info
    end
    
    ReportingService-->>User: Return completed report

Configuration

Anonymization can be enabled or disabled in the DocuDesk admin settings. When enabled, all documents processed for reporting will automatically be anonymized.

Accessing Anonymized Documents

Anonymized documents are stored in the same folder as the original document, with "_anonymized" added to the filename. For example, if the original document is "contract.pdf", the anonymized version will be "contract_anonymized.pdf".

The report details will include links to both the original and anonymized documents, making it easy to access either version.

Manual Anonymization

In addition to automatic anonymization, you can also manually anonymize documents using the API:

$anonymizationService = \OC::$server->get(OCA\DocuDesk\Service\AnonymizationService::class);
$result = $anonymizationService->anonymizeDocument(
    '/path/to/document.pdf',
    '/path/to/output/anonymized.pdf',
    'doc-123',
    'Important Contract'
);

Retrieving Anonymization Data

You can retrieve anonymization data for a specific file:

// Get anonymization data for a file node
$anonymizationService = \OC::$server->get(OCA\DocuDesk\Service\AnonymizationService::class);
$anonymization = $anonymizationService->getAnonymization($fileNode);

// Or retrieve anonymization by ID
$anonymization = $anonymizationService->getAnonymizationById('anonymization-id');

De-anonymization

If you need to recover the original content, you can de-anonymize a document using the anonymization log ID:

$result = $anonymizationService->deanonymizeDocument(
    'anonymization-log-id',
    '/path/to/output/deanonymized.pdf'
);

Anonymization Logs

You can retrieve anonymization logs to track anonymization activities:

$logs = $anonymizationService->getAnonymizationLogs('doc-123');

Integration with Presidio

The anonymization feature integrates with Microsoft Presidio, an open-source PII detection and anonymization service:

Uses Presidio Analyzer for entity detection
Uses Presidio Anonymizer for text anonymization
Configurable confidence threshold for entity detection
Customizable anonymization operators for different entity types

Configuration

Configure the anonymization feature in the DocuDesk admin settings:

Navigate to Admin Settings > DocuDesk
Set the Presidio Analyzer API URL (default: http://presidio-api:8080/analyze)
Set the Presidio Anonymizer API URL (default: http://presidio-api:8080/anonymize)
Adjust the Confidence Threshold (0.0-1.0) for entity detection sensitivity
Enable or disable the Store Original Text option for de-anonymization capability

OpenRegisters Integration

DocuDesk supports storing anonymization data in OpenRegisters, which provides additional capabilities for data management:

External Storage: Store anonymization logs in external databases through OpenRegisters
Advanced Querying: Use OpenRegisters' query capabilities for complex searches
Data Sharing: Share anonymization logs with other applications that use OpenRegisters

To configure OpenRegisters integration:

Install the OpenRegisters app from the Nextcloud App Store
In DocuDesk settings, select "OpenRegister" as the storage source for anonymization data
Choose the appropriate register and schema for storing the data

This integration is optional - if you prefer to use Nextcloud's internal storage, select "Internal" as the storage source.

Setting Up Presidio

To use the anonymization feature, you need to set up Microsoft Presidio:

Deploy Presidio using Docker or Kubernetes (see Presidio documentation)
Configure the analyzer service with appropriate recognition models
Configure the anonymizer service with desired anonymization methods
Update the DocuDesk settings with your Presidio API URLs

Security Considerations

The anonymization feature is designed with security in mind:

All communication with Presidio is secured
Anonymization logs are stored securely within your Nextcloud instance
Original text storage is optional and can be disabled for higher security
Access to anonymization features can be restricted based on user permissions

Compliance Benefits

Document anonymization supports various compliance scenarios:

GDPR Article 5: Helps implement data minimization principles
GDPR Article 9: Assists in handling special categories of personal data
GDPR Article 17: Supports the right to erasure by anonymizing personal data
GDPR Article 25: Implements privacy by design through data protection measures
GDPR Article 32: Enhances security of processing through pseudonymization

Limitations

Be aware of these limitations:

Detection accuracy depends on Presidio's recognition capabilities
Some context-specific PII may not be detected without custom recognizers
Very large documents may require additional processing time
Image-based documents require OCR before anonymization (not included)
Formatting may be lost in the anonymized document

Best Practices

For optimal results with the anonymization feature:

Test with Sample Data: Verify anonymization effectiveness with representative documents
Adjust Confidence Threshold: Fine-tune the threshold based on your needs
Review Anonymized Documents: Always review automatically anonymized documents
Customize Entity Types: Configure custom entity recognizers for domain-specific data
Implement Access Controls: Restrict access to de-anonymization capabilities

License

DocuDesk is licensed under the European Union Public License (EUPL), version 1.2 only. This is an open-source license that is compatible with many other open-source licenses while providing strong copyleft protections. The full text of the license can be found at https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12.

Key aspects of the EUPL-1.2 license:

You can use, modify, and distribute the software
If you distribute modified versions, you must make the source code available
Compatible with other licenses like GPL, AGPL, MPL, and more
Provides clear patent licensing provisions
Available in all official EU languages with equal validity

Technical Implementation Notes

Dependency Management

The AnonymizationService and ReportingService have a mutual dependency relationship. To avoid circular dependency issues, the AnonymizationService uses setter injection for the ReportingService dependency:

// In ReportingService constructor
public function __construct(
    // ... other dependencies
    AnonymizationService $anonymizationService
) {
    // ... other initializations
    $this->anonymizationService = $anonymizationService;
    
    // Set this service in the anonymization service to avoid circular dependency
    $this->anonymizationService->setReportingService($this);
}

// In AnonymizationService
public function setReportingService(ReportingService $reportingService): void
{
    $this->reportingService = $reportingService;
}

This approach ensures that both services can reference each other without causing infinite loops during dependency resolution.

Overview​

Key Features​

Supported Entity Types​

Anonymization Methods​

Automatic Anonymization with Reports​

How It Works​

Configuration​

Accessing Anonymized Documents​

Manual Anonymization​

Retrieving Anonymization Data​

De-anonymization​

Anonymization Logs​

Integration with Presidio​

Configuration​

OpenRegisters Integration​

Setting Up Presidio​

Security Considerations​

Compliance Benefits​

Limitations​

Best Practices​

License​

Technical Implementation Notes​

Dependency Management​

Overview

Key Features

Supported Entity Types

Anonymization Methods

Automatic Anonymization with Reports

How It Works

Configuration

Accessing Anonymized Documents

Manual Anonymization

Retrieving Anonymization Data

De-anonymization

Anonymization Logs

Integration with Presidio

Configuration

OpenRegisters Integration

Setting Up Presidio

Security Considerations

Compliance Benefits

Limitations

Best Practices

License

Technical Implementation Notes

Dependency Management