Skip to main content

Document Reports

DocuDesk provides comprehensive document analysis through its reporting system. This page explains how document reports work and how they can help you ensure your documents meet privacy, accessibility, and readability standards.

Overview

The Document Reports system in DocuDesk enables you to:

  • Identify files containing personal data
  • Categorize the types of personal data present
  • Track anonymization status
  • Manage retention periods
  • Document the legal basis for processing
  • Maintain an audit trail of privacy-related actions
  • Analyze documents for personal data that may require anonymization
  • Check documents for WCAG accessibility compliance
  • Assess the language level and readability of documents
  • Track document changes through file hashing
  • Generate detailed reports with actionable recommendations

Automatic Report Generation

DocuDesk can automatically generate reports for documents as they are uploaded or modified in Nextcloud. This process works as follows:

  1. When a file is created or modified in Nextcloud, DocuDesk detects the event
  2. A document log entry is created to maintain an audit trail
  3. If reporting is enabled, DocuDesk checks if a report already exists for the current version of the file
  4. If no report exists (or the file has changed), a new report is created with a 'pending' status
  5. Depending on the configuration, the report is either:
    • Processed immediately (synchronous processing)
    • Queued for processing by a background job (asynchronous processing)
  6. The report is updated with the analysis results once processing is complete

Report Generation Workflow

The following sequence diagram illustrates the report generation process:

Configuration Options

The report generation process can be configured through the DocuDesk settings page:

  • Enable Reporting: Turn automatic report generation on or off
  • Enable Anonymization: Turn automatic anonymization of sensitive data on or off
  • Synchronous Processing: Choose between immediate processing or background job processing
  • Confidence Threshold: Set the minimum confidence level for entity detection (0-100%)
  • Store Original Text: Choose whether to store the original document text in reports

Processing Modes

DocuDesk supports two processing modes for report generation:

Synchronous Processing

In synchronous mode, reports are generated immediately when a file is created or modified. This provides instant feedback but may impact performance for large files or high-traffic environments.

In asynchronous mode, reports are queued for processing by a background job that runs periodically. This is more efficient for large environments as it:

  • Reduces the impact on user experience
  • Allows for better resource management
  • Handles large volumes of documents more effectively
  • Prevents timeouts when processing large files

The background job processes pending reports in batches, updating their status as they are completed.

Document Report Object

The DocumentReport object is the core component for document analysis. It contains the results of various analyses performed on a document, including anonymization, WCAG compliance, and language level assessments.

Key Properties

PropertyTypeDescription
idstringUnique identifier for the report
nodeIdstringNextcloud node ID of the document
fileNamestringName of the document
filePathstringFull path to the document in Nextcloud
fileTypestringMIME type of the document (e.g., application/pdf)
fileExtensionstringFile extension (e.g., pdf, docx)
fileSizeintegerSize of the file in bytes
fileHashstringHash of the file content to determine if a new report is needed
fileTextstringThe extracted text content from the document, used for analysis
statusstringStatus of the report generation (pending, processing, completed, failed)
errorMessagestringError message if report processing failed
riskScorefloatNumerical score indicating overall risk level (0-100)
riskLevelstringRisk level classification (low, medium, high) based on risk score, or unknown if report is not completed
anonymizationResultsobjectResults of anonymization analysis
entitiesobjectList of entities found made during anonymization
wcagComplianceResultsobjectResults of WCAG compliance analysis
languageLevelResultsobjectResults of language level analysis
retentionPeriodintegerRetention period in days (0 for indefinite)
retentionExpirydate-timeDate when the retention period expires
legalBasisstringLegal basis for processing the data under GDPR
dataControllerstringName of the data controller

Report Status Values

Reports can have the following status values:

  • pending: The report has been created but not yet processed
  • processing: The report is currently being processed
  • completed: The report has been successfully processed
  • failed: The report processing failed (check errorMessage for details)

Handling Non-Text Documents

DocuDesk's anonymization capabilities rely on text extraction from documents. However, certain file types cannot be processed for text content, which affects how DocuDesk handles these documents.

Unsupported Document Types

The following document types typically cannot be processed for text extraction:

  • Images: JPEG, PNG, GIF, BMP, WebP, etc.
  • Videos: MP4, AVI, MOV, WebM, etc.
  • Audio: MP3, WAV, FLAC, etc.
  • Binary files: EXE, DLL, etc.
  • Encrypted documents: Documents with password protection or encryption
  • Scanned documents without OCR: Image-based PDFs without text layers

How DocuDesk Handles These Files

When DocuDesk encounters a file that cannot be processed for text extraction:

  1. A report is still created for the document
  2. The report status is set to 'completed'
  3. The anonymizationResults object will include:
    • containsPersonalData: false (since no text could be analyzed)
    • anonymizationStatus: 'not_required'
    • entitiesFound: [] (empty array)
    • totalEntitiesFound: 0
  4. The report will include an informational note indicating that text extraction was not possible
  5. The riskLevel will typically be set to 'unknown' since risk assessment requires text analysis

Example Report for Non-Text Document

{
"id": "abc123",
"nodeId": "456",
"fileName": "image.jpg",
"filePath": "/path/to/image.jpg",
"fileType": "image/jpeg",
"fileExtension": "jpg",
"fileSize": 1024000,
"fileHash": "a1b2c3d4e5f6",
"status": "completed",
"riskLevel": "unknown",
"anonymizationResults": {
"containsPersonalData": false,
"entitiesFound": [],
"totalEntitiesFound": 0,
"dataCategories": [],
"anonymizationStatus": "not_required"
},
"errorMessage": "No text could be extracted from this document type"
}

Best Practices for Non-Text Documents

When working with non-text documents that might contain sensitive information:

  1. Manual Review: Visually inspect images and videos for personal data
  2. Metadata Cleaning: Remove EXIF data from images which may contain location or device information
  3. OCR Processing: Consider using OCR tools on scanned documents before uploading
  4. Alternative Formats: When possible, provide text-based alternatives for important image-based content
  5. Custom Tagging: Use DocuDesk's manual tagging features to mark non-text documents that contain sensitive information

Report Creation Process

When a file event occurs (creation or modification), DocuDesk follows these steps to create or update reports:

Simplified Event Handling

DocuDesk has streamlined the report creation process by:

  1. Centralizing Decision Logic: All decisions about whether to create reports and how to process them are now made in the ReportingService
  2. Automatic Processing Mode: The system automatically determines whether to process reports synchronously based on configuration settings
  3. Single Responsibility: Event listeners simply pass events to the ReportingService without making any decisions

This approach ensures consistent behavior and makes the system easier to maintain and extend.

Report Update Logic

When a file is modified, DocuDesk updates the existing report rather than creating a new one:

This logic ensures that reports are always up-to-date with the latest version of a file, while avoiding unnecessary processing when the file content hasn't changed.

Analysis Types

Anonymization Analysis

The anonymization analysis identifies personal data in documents that may need to be anonymized for GDPR compliance. It provides:

  • Detection of various types of personal data (names, addresses, emails, etc.)
  • Count and categorization of personal data instances
  • Suggestions for anonymizing personal data
  • Confidence scores for detected entities

WCAG Compliance Analysis

The WCAG compliance analysis checks documents for accessibility issues according to the Web Content Accessibility Guidelines. It provides:

  • Overall compliance level (A, AA, AAA, or non-compliant)
  • Breakdown of issues by severity and WCAG principle
  • Detailed list of accessibility issues with recommendations
  • Overall compliance score

Language Level Analysis

The language level analysis assesses the readability and complexity of document text. It provides:

  • Primary language detection
  • Various readability scores (Flesch-Kincaid, SMOG Index, etc.)
  • Text complexity metrics
  • Estimated education level required to understand the text
  • Suggestions for improving language clarity

Data Categories

DocuDesk recognizes the following categories of personal data:

  • name: Names of individuals
  • address: Physical addresses
  • email: Email addresses
  • phone: Phone numbers
  • id_number: Identification numbers (passport, SSN, etc.)
  • financial: Financial information (bank accounts, credit cards, etc.)
  • health: Health-related information
  • biometric: Biometric data
  • location: Location data
  • other: Other types of personal data

Anonymization Status

The anonymization status can be one of the following:

  • not_required: The file does not require anonymization
  • pending: Anonymization is pending
  • in_progress: Anonymization is in progress
  • completed: Anonymization is completed
  • failed: Anonymization failed

Under GDPR, personal data processing must have a legal basis. DocuDesk supports tracking the following legal bases:

  • consent: The data subject has given consent
  • contract: Processing is necessary for a contract
  • legal_obligation: Processing is necessary for a legal obligation
  • vital_interests: Processing is necessary to protect vital interests
  • public_interest: Processing is necessary for a task in the public interest
  • legitimate_interests: Processing is necessary for legitimate interests

API Endpoints

DocuDesk provides the following API endpoints for managing document reports:

API Flow

The following diagram illustrates the typical flow when using the report API:

List Document Reports

GET /apps/docudesk/api/v1/reports

Returns a list of document reports. You can filter the reports by:

  • node_id: Filter reports by Nextcloud node ID
  • status: Filter reports by status

Create Document Report

POST /apps/docudesk/api/v1/reports

Creates a new document report. You need to specify:

  • node_id: Nextcloud node ID of the document
  • file_name: Name of the document
  • file_path: Full path to the document in Nextcloud
  • file_type: MIME type of the document
  • file_extension: File extension of the document
  • file_size: Size of the file in bytes
  • file_hash: Hash of the file content
  • analysis_types: Types of analysis to perform (anonymization, wcag_compliance, language_level)

Get Document Report

GET /apps/docudesk/api/v1/reports/{reportId}

Returns a specific document report by ID.

Update Document Report

PUT /apps/docudesk/api/v1/reports/{reportId}

Updates a specific document report.

Get Latest Report for Node

GET /apps/docudesk/api/v1/reports/node/{nodeId}

Returns the latest document report for a specific Nextcloud node.

Get Report Configuration

GET /apps/docudesk/api/v1/settings/report

Returns the current report configuration settings.

Save Report Configuration

POST /apps/docudesk/api/v1/settings/report

Updates the report configuration settings. You can specify:

  • enable_reporting: Whether to enable automatic report generation
  • enable_anonymization: Whether to enable automatic anonymization of sensitive data
  • synchronous_processing: Whether to process reports immediately
  • confidence_threshold: Minimum confidence level for entity detection (0-1)
  • store_original_text: Whether to store the original document text in reports

Use Cases

GDPR Compliance

Document reports help ensure GDPR compliance by:

  • Identifying documents containing personal data
  • Suggesting anonymization methods for sensitive information
  • Tracking anonymization status
  • Providing an audit trail of privacy-related actions

Accessibility Compliance

Document reports help ensure accessibility compliance by:

  • Checking documents against WCAG standards
  • Identifying accessibility issues
  • Providing recommendations for fixing issues
  • Tracking compliance levels over time

Content Readability

Document reports help improve content readability by:

  • Assessing the language level of documents
  • Identifying complex language
  • Suggesting simplifications
  • Ensuring content is appropriate for the target audience

Integration with Document Processing

The document reports system integrates with DocuDesk's document processing capabilities:

  • Reports can trigger automatic document processing (e.g., anonymization)
  • Processing results are reflected in updated reports
  • Reports provide a history of document transformations

Examples

Generating a Document Report

// Create a new report
$reportData = [
'node_id' => '12345',
'file_name' => 'important-document.pdf',
'file_path' => '/Documents/important-document.pdf',
'file_type' => 'application/pdf',
'file_extension' => 'pdf',
'file_size' => 1024567,
'file_hash' => 'a1b2c3d4e5f6g7h8i9j0',
'analysis_types' => ['anonymization', 'wcag_compliance', 'language_level']
];

$response = $client->post('/apps/docudesk/api/v1/reports', [
'json' => $reportData
]);

$report = json_decode($response->getBody(), true);
$reportId = $report['id'];

// Check report status
$response = $client->get('/apps/docudesk/api/v1/reports/' . $reportId);
$report = json_decode($response->getBody(), true);

if ($report['status'] === 'completed') {
// Process report results
$anonymizationResults = $report['anonymization_results'];
$wcagResults = $report['wcag_compliance_results'];
$languageResults = $report['language_level_results'];

// Take action based on results
if ($anonymizationResults['contains_personal_data']) {
// Handle personal data
}

if ($wcagResults['compliance_level'] !== 'AA' && $wcagResults['compliance_level'] !== 'AAA') {
// Address accessibility issues
}

if ($languageResults['education_level'] === 'graduate' || $languageResults['education_level'] === 'professional') {
// Simplify language
}
}

Configuring Report Generation

// Update report configuration
$configData = [
'enable_reporting' => true,
'enable_anonymization' => true,
'synchronous_processing' => false, // Use background jobs
'confidence_threshold' => 0.7,
'store_original_text' => true
];

$response = $client->post('/apps/docudesk/api/v1/settings/report', [
'json' => $configData
]);

// Get current report configuration
$response = $client->get('/apps/docudesk/api/v1/settings/report');
$config = json_decode($response->getBody(), true);

Best Practices

  1. Asynchronous Processing: For production environments, use asynchronous processing to reduce the impact on performance
  2. Regular Analysis: Regularly analyze important documents to ensure continued compliance
  3. Hash-Based Updates: Use file hashing to determine when documents have changed and need re-analysis
  4. Comprehensive Analysis: Use all three analysis types for critical documents
  5. Action on Results: Implement a workflow to address issues identified in reports
  6. Version Tracking: Keep reports for different versions of documents to track improvements
  7. Confidence Threshold: Adjust the confidence threshold based on your needs (higher for fewer false positives, lower for more comprehensive detection)

Conclusion

Document reports provide a powerful way to ensure your documents meet privacy, accessibility, and readability standards. By automatically analyzing documents as they are created or modified, you can maintain compliance with regulations and improve the quality of your content without manual intervention.

File Event Handling

DocuDesk uses Nextcloud's event system to detect file operations and trigger report generation. The following diagram illustrates how file events are handled:

The event listener handles different types of file events:

  • NodeCreatedEvent: Triggered when a new file is created
  • NodeWrittenEvent: Triggered when a file's content is modified
  • NodeDeletedEvent: Triggered when a file is deleted
  • NodeTouchedEvent: Triggered when a file's metadata is updated

For file creation and modification events, the listener creates reports if reporting is enabled.

Efficient File Change Detection

DocuDesk uses Nextcloud's ETag (Entity Tag) system when available to efficiently detect file changes:

Using ETag provides several advantages:

  • Efficiency: Avoids reading file content for large files
  • Accuracy: ETags change whenever file content changes
  • Performance: Reduces CPU and I/O overhead

Report Processing Workflow

The report processing workflow involves several steps and state transitions. The following diagram illustrates the lifecycle of a report:

Report Processing Steps

The ReportingService handles report processing through the following steps:

This centralized processing approach ensures consistent handling of reports regardless of how they are triggered (file events, API requests, or background jobs).

Background Job Processing

DocuDesk uses a background job (ProcessPendingReports) to process pending reports asynchronously. This job runs periodically (every 15 minutes by default) and processes a batch of pending reports.

Background Job Sequence

The following sequence diagram illustrates how the background job processes pending reports:

This background processing approach allows DocuDesk to handle large volumes of documents efficiently without impacting user experience.

Entity Detection and Risk Scoring

DocuDesk uses the Presidio API to detect entities in documents and calculate risk scores based on the detected entities.

Entity Detection Process

The following sequence diagram illustrates how entities are detected and risk scores are calculated:

Risk Score Calculation

The risk score is calculated based on the following factors:

The final risk score determines the risk level:

This risk assessment helps organizations prioritize which documents need attention for privacy compliance.

Risk Assessment Visualization

DocuDesk provides a comprehensive risk assessment visualization in the document details view. This feature helps users understand:

  1. The overall risk score of a document (0-100)
  2. The risk level classification (Low, Medium, High, Critical)
  3. The specific entities that contribute to the risk assessment
  4. The weight of each entity type in the risk calculation

The risk visualization includes:

  • A color-coded risk score indicator (green for low risk, yellow for medium, red for high/critical)
  • A detailed breakdown of detected entity types and their counts
  • An explanation of the risk level and recommended actions
  • The weighting system used for different types of personal data

This visual representation helps users quickly identify high-risk documents and understand why they are classified as such, enabling more effective privacy management and compliance efforts.

The risk assessment visualization is particularly useful for:

  • Privacy officers reviewing document collections
  • Compliance teams conducting audits
  • Content creators checking their documents before publication
  • Administrators monitoring organizational risk levels