Document Reports
DocuDesk provides comprehensive document analysis through its reporting system. This page explains how document reports work and how they can help you ensure your documents meet privacy, accessibility, and readability standards.
Overview
The Document Reports system in DocuDesk enables you to:
- Identify files containing personal data
- Categorize the types of personal data present
- Track anonymization status
- Manage retention periods
- Document the legal basis for processing
- Maintain an audit trail of privacy-related actions
- Analyze documents for personal data that may require anonymization
- Check documents for WCAG accessibility compliance
- Assess the language level and readability of documents
- Track document changes through file hashing
- Generate detailed reports with actionable recommendations
Automatic Report Generation
DocuDesk can automatically generate reports for documents as they are uploaded or modified in Nextcloud. This process works as follows:
- When a file is created or modified in Nextcloud, DocuDesk detects the event
- A document log entry is created to maintain an audit trail
- If reporting is enabled, DocuDesk checks if a report already exists for the current version of the file
- If no report exists (or the file has changed), a new report is created with a 'pending' status
- Depending on the configuration, the report is either:
- Processed immediately (synchronous processing)
- Queued for processing by a background job (asynchronous processing)
- The report is updated with the analysis results once processing is complete
Report Generation Workflow
The following sequence diagram illustrates the report generation process:
Configuration Options
The report generation process can be configured through the DocuDesk settings page:
- Enable Reporting: Turn automatic report generation on or off
- Enable Anonymization: Turn automatic anonymization of sensitive data on or off
- Synchronous Processing: Choose between immediate processing or background job processing
- Confidence Threshold: Set the minimum confidence level for entity detection (0-100%)
- Store Original Text: Choose whether to store the original document text in reports
Processing Modes
DocuDesk supports two processing modes for report generation:
Synchronous Processing
In synchronous mode, reports are generated immediately when a file is created or modified. This provides instant feedback but may impact performance for large files or high-traffic environments.
Asynchronous Processing (Recommended for Production)
In asynchronous mode, reports are queued for processing by a background job that runs periodically. This is more efficient for large environments as it:
- Reduces the impact on user experience
- Allows for better resource management
- Handles large volumes of documents more effectively
- Prevents timeouts when processing large files
The background job processes pending reports in batches, updating their status as they are completed.
Document Report Object
The DocumentReport object is the core component for document analysis. It contains the results of various analyses performed on a document, including anonymization, WCAG compliance, and language level assessments.
Key Properties
| Property | Type | Description |
|---|---|---|
| id | string | Unique identifier for the report |
| nodeId | string | Nextcloud node ID of the document |
| fileName | string | Name of the document |
| filePath | string | Full path to the document in Nextcloud |
| fileType | string | MIME type of the document (e.g., application/pdf) |
| fileExtension | string | File extension (e.g., pdf, docx) |
| fileSize | integer | Size of the file in bytes |
| fileHash | string | Hash of the file content to determine if a new report is needed |
| fileText | string | The extracted text content from the document, used for analysis |
| status | string | Status of the report generation (pending, processing, completed, failed) |
| errorMessage | string | Error message if report processing failed |
| riskScore | float | Numerical score indicating overall risk level (0-100) |
| riskLevel | string | Risk level classification (low, medium, high) based on risk score, or unknown if report is not completed |
| anonymizationResults | object | Results of anonymization analysis |
| entities | object | List of entities found made during anonymization |
| wcagComplianceResults | object | Results of WCAG compliance analysis |
| languageLevelResults | object | Results of language level analysis |
| retentionPeriod | integer | Retention period in days (0 for indefinite) |
| retentionExpiry | date-time | Date when the retention period expires |
| legalBasis | string | Legal basis for processing the data under GDPR |
| dataController | string | Name of the data controller |
Report Status Values
Reports can have the following status values:
- pending: The report has been created but not yet processed
- processing: The report is currently being processed
- completed: The report has been successfully processed
- failed: The report processing failed (check errorMessage for details)
Handling Non-Text Documents
DocuDesk's anonymization capabilities rely on text extraction from documents. However, certain file types cannot be processed for text content, which affects how DocuDesk handles these documents.
Unsupported Document Types
The following document types typically cannot be processed for text extraction:
- Images: JPEG, PNG, GIF, BMP, WebP, etc.
- Videos: MP4, AVI, MOV, WebM, etc.
- Audio: MP3, WAV, FLAC, etc.
- Binary files: EXE, DLL, etc.
- Encrypted documents: Documents with password protection or encryption
- Scanned documents without OCR: Image-based PDFs without text layers
How DocuDesk Handles These Files
When DocuDesk encounters a file that cannot be processed for text extraction:
- A report is still created for the document
- The report status is set to 'completed'
- The anonymizationResults object will include:
containsPersonalData: false(since no text could be analyzed)anonymizationStatus: 'not_required'entitiesFound: [](empty array)totalEntitiesFound: 0
- The report will include an informational note indicating that text extraction was not possible
- The riskLevel will typically be set to 'unknown' since risk assessment requires text analysis
Example Report for Non-Text Document
{
"id": "abc123",
"nodeId": "456",
"fileName": "image.jpg",
"filePath": "/path/to/image.jpg",
"fileType": "image/jpeg",
"fileExtension": "jpg",
"fileSize": 1024000,
"fileHash": "a1b2c3d4e5f6",
"status": "completed",
"riskLevel": "unknown",
"anonymizationResults": {
"containsPersonalData": false,
"entitiesFound": [],
"totalEntitiesFound": 0,
"dataCategories": [],
"anonymizationStatus": "not_required"
},
"errorMessage": "No text could be extracted from this document type"
}
Best Practices for Non-Text Documents
When working with non-text documents that might contain sensitive information:
- Manual Review: Visually inspect images and videos for personal data
- Metadata Cleaning: Remove EXIF data from images which may contain location or device information
- OCR Processing: Consider using OCR tools on scanned documents before uploading
- Alternative Formats: When possible, provide text-based alternatives for important image-based content
- Custom Tagging: Use DocuDesk's manual tagging features to mark non-text documents that contain sensitive information
Report Creation Process
When a file event occurs (creation or modification), DocuDesk follows these steps to create or update reports:
Simplified Event Handling
DocuDesk has streamlined the report creation process by:
- Centralizing Decision Logic: All decisions about whether to create reports and how to process them are now made in the ReportingService
- Automatic Processing Mode: The system automatically determines whether to process reports synchronously based on configuration settings
- Single Responsibility: Event listeners simply pass events to the ReportingService without making any decisions
This approach ensures consistent behavior and makes the system easier to maintain and extend.
Report Update Logic
When a file is modified, DocuDesk updates the existing report rather than creating a new one:
This logic ensures that reports are always up-to-date with the latest version of a file, while avoiding unnecessary processing when the file content hasn't changed.
Analysis Types
Anonymization Analysis
The anonymization analysis identifies personal data in documents that may need to be anonymized for GDPR compliance. It provides:
- Detection of various types of personal data (names, addresses, emails, etc.)
- Count and categorization of personal data instances
- Suggestions for anonymizing personal data
- Confidence scores for detected entities
WCAG Compliance Analysis
The WCAG compliance analysis checks documents for accessibility issues according to the Web Content Accessibility Guidelines. It provides:
- Overall compliance level (A, AA, AAA, or non-compliant)
- Breakdown of issues by severity and WCAG principle
- Detailed list of accessibility issues with recommendations
- Overall compliance score
Language Level Analysis
The language level analysis assesses the readability and complexity of document text. It provides:
- Primary language detection
- Various readability scores (Flesch-Kincaid, SMOG Index, etc.)
- Text complexity metrics
- Estimated education level required to understand the text
- Suggestions for improving language clarity
Data Categories
DocuDesk recognizes the following categories of personal data:
- name: Names of individuals
- address: Physical addresses
- email: Email addresses
- phone: Phone numbers
- id_number: Identification numbers (passport, SSN, etc.)
- financial: Financial information (bank accounts, credit cards, etc.)
- health: Health-related information
- biometric: Biometric data
- location: Location data
- other: Other types of personal data
Anonymization Status
The anonymization status can be one of the following:
- not_required: The file does not require anonymization
- pending: Anonymization is pending
- in_progress: Anonymization is in progress
- completed: Anonymization is completed
- failed: Anonymization failed
Legal Basis
Under GDPR, personal data processing must have a legal basis. DocuDesk supports tracking the following legal bases:
- consent: The data subject has given consent
- contract: Processing is necessary for a contract
- legal_obligation: Processing is necessary for a legal obligation
- vital_interests: Processing is necessary to protect vital interests
- public_interest: Processing is necessary for a task in the public interest
- legitimate_interests: Processing is necessary for legitimate interests
API Endpoints
DocuDesk provides the following API endpoints for managing document reports:
API Flow
The following diagram illustrates the typical flow when using the report API:
List Document Reports
GET /apps/docudesk/api/v1/reports
Returns a list of document reports. You can filter the reports by:
node_id: Filter reports by Nextcloud node IDstatus: Filter reports by status
Create Document Report
POST /apps/docudesk/api/v1/reports
Creates a new document report. You need to specify:
node_id: Nextcloud node ID of the documentfile_name: Name of the documentfile_path: Full path to the document in Nextcloudfile_type: MIME type of the documentfile_extension: File extension of the documentfile_size: Size of the file in bytesfile_hash: Hash of the file contentanalysis_types: Types of analysis to perform (anonymization, wcag_compliance, language_level)
Get Document Report
GET /apps/docudesk/api/v1/reports/{reportId}
Returns a specific document report by ID.
Update Document Report
PUT /apps/docudesk/api/v1/reports/{reportId}
Updates a specific document report.
Get Latest Report for Node
GET /apps/docudesk/api/v1/reports/node/{nodeId}
Returns the latest document report for a specific Nextcloud node.
Get Report Configuration
GET /apps/docudesk/api/v1/settings/report
Returns the current report configuration settings.
Save Report Configuration
POST /apps/docudesk/api/v1/settings/report
Updates the report configuration settings. You can specify:
enable_reporting: Whether to enable automatic report generationenable_anonymization: Whether to enable automatic anonymization of sensitive datasynchronous_processing: Whether to process reports immediatelyconfidence_threshold: Minimum confidence level for entity detection (0-1)store_original_text: Whether to store the original document text in reports
Use Cases
GDPR Compliance
Document reports help ensure GDPR compliance by:
- Identifying documents containing personal data
- Suggesting anonymization methods for sensitive information
- Tracking anonymization status
- Providing an audit trail of privacy-related actions
Accessibility Compliance
Document reports help ensure accessibility compliance by:
- Checking documents against WCAG standards
- Identifying accessibility issues
- Providing recommendations for fixing issues
- Tracking compliance levels over time
Content Readability
Document reports help improve content readability by:
- Assessing the language level of documents
- Identifying complex language
- Suggesting simplifications
- Ensuring content is appropriate for the target audience
Integration with Document Processing
The document reports system integrates with DocuDesk's document processing capabilities:
- Reports can trigger automatic document processing (e.g., anonymization)
- Processing results are reflected in updated reports
- Reports provide a history of document transformations
Examples
Generating a Document Report
// Create a new report
$reportData = [
'node_id' => '12345',
'file_name' => 'important-document.pdf',
'file_path' => '/Documents/important-document.pdf',
'file_type' => 'application/pdf',
'file_extension' => 'pdf',
'file_size' => 1024567,
'file_hash' => 'a1b2c3d4e5f6g7h8i9j0',
'analysis_types' => ['anonymization', 'wcag_compliance', 'language_level']
];
$response = $client->post('/apps/docudesk/api/v1/reports', [
'json' => $reportData
]);
$report = json_decode($response->getBody(), true);
$reportId = $report['id'];
// Check report status
$response = $client->get('/apps/docudesk/api/v1/reports/' . $reportId);
$report = json_decode($response->getBody(), true);
if ($report['status'] === 'completed') {
// Process report results
$anonymizationResults = $report['anonymization_results'];
$wcagResults = $report['wcag_compliance_results'];
$languageResults = $report['language_level_results'];
// Take action based on results
if ($anonymizationResults['contains_personal_data']) {
// Handle personal data
}
if ($wcagResults['compliance_level'] !== 'AA' && $wcagResults['compliance_level'] !== 'AAA') {
// Address accessibility issues
}
if ($languageResults['education_level'] === 'graduate' || $languageResults['education_level'] === 'professional') {
// Simplify language
}
}
Configuring Report Generation
// Update report configuration
$configData = [
'enable_reporting' => true,
'enable_anonymization' => true,
'synchronous_processing' => false, // Use background jobs
'confidence_threshold' => 0.7,
'store_original_text' => true
];
$response = $client->post('/apps/docudesk/api/v1/settings/report', [
'json' => $configData
]);
// Get current report configuration
$response = $client->get('/apps/docudesk/api/v1/settings/report');
$config = json_decode($response->getBody(), true);
Best Practices
- Asynchronous Processing: For production environments, use asynchronous processing to reduce the impact on performance
- Regular Analysis: Regularly analyze important documents to ensure continued compliance
- Hash-Based Updates: Use file hashing to determine when documents have changed and need re-analysis
- Comprehensive Analysis: Use all three analysis types for critical documents
- Action on Results: Implement a workflow to address issues identified in reports
- Version Tracking: Keep reports for different versions of documents to track improvements
- Confidence Threshold: Adjust the confidence threshold based on your needs (higher for fewer false positives, lower for more comprehensive detection)
Conclusion
Document reports provide a powerful way to ensure your documents meet privacy, accessibility, and readability standards. By automatically analyzing documents as they are created or modified, you can maintain compliance with regulations and improve the quality of your content without manual intervention.
File Event Handling
DocuDesk uses Nextcloud's event system to detect file operations and trigger report generation. The following diagram illustrates how file events are handled:
The event listener handles different types of file events:
- NodeCreatedEvent: Triggered when a new file is created
- NodeWrittenEvent: Triggered when a file's content is modified
- NodeDeletedEvent: Triggered when a file is deleted
- NodeTouchedEvent: Triggered when a file's metadata is updated
For file creation and modification events, the listener creates reports if reporting is enabled.
Efficient File Change Detection
DocuDesk uses Nextcloud's ETag (Entity Tag) system when available to efficiently detect file changes:
Using ETag provides several advantages:
- Efficiency: Avoids reading file content for large files
- Accuracy: ETags change whenever file content changes
- Performance: Reduces CPU and I/O overhead
Report Processing Workflow
The report processing workflow involves several steps and state transitions. The following diagram illustrates the lifecycle of a report:
Report Processing Steps
The ReportingService handles report processing through the following steps:
This centralized processing approach ensures consistent handling of reports regardless of how they are triggered (file events, API requests, or background jobs).
Background Job Processing
DocuDesk uses a background job (ProcessPendingReports) to process pending reports asynchronously. This job runs periodically (every 15 minutes by default) and processes a batch of pending reports.
Background Job Sequence
The following sequence diagram illustrates how the background job processes pending reports:
This background processing approach allows DocuDesk to handle large volumes of documents efficiently without impacting user experience.
Entity Detection and Risk Scoring
DocuDesk uses the Presidio API to detect entities in documents and calculate risk scores based on the detected entities.
Entity Detection Process
The following sequence diagram illustrates how entities are detected and risk scores are calculated:
Risk Score Calculation
The risk score is calculated based on the following factors:
The final risk score determines the risk level:
This risk assessment helps organizations prioritize which documents need attention for privacy compliance.
Risk Assessment Visualization
DocuDesk provides a comprehensive risk assessment visualization in the document details view. This feature helps users understand:
- The overall risk score of a document (0-100)
- The risk level classification (Low, Medium, High, Critical)
- The specific entities that contribute to the risk assessment
- The weight of each entity type in the risk calculation
The risk visualization includes:
- A color-coded risk score indicator (green for low risk, yellow for medium, red for high/critical)
- A detailed breakdown of detected entity types and their counts
- An explanation of the risk level and recommended actions
- The weighting system used for different types of personal data
This visual representation helps users quickly identify high-risk documents and understand why they are classified as such, enabling more effective privacy management and compliance efforts.
The risk assessment visualization is particularly useful for:
- Privacy officers reviewing document collections
- Compliance teams conducting audits
- Content creators checking their documents before publication
- Administrators monitoring organizational risk levels