Document Reporting
DocuDesk's document reporting feature provides comprehensive analysis of documents to identify sensitive information and assess potential privacy risks. This feature integrates with Microsoft Presidio to detect and report on personally identifiable information (PII) and other sensitive data within your documents.
Overview
The document reporting system:
- Extracts text from various document formats
- Analyzes the text for sensitive information using Presidio
- Generates detailed reports with risk assessments
- Stores reports for future reference and compliance purposes
- Provides an intuitive interface for viewing and managing reports
User Interface
Reports View
The reports interface has been redesigned to provide better usability and overview:
View Modes
- Table View: Displays reports in a sortable, paginated table with key information at a glance
- Card View: Shows reports as individual cards with detailed statistics
Table Features
- Sortable Columns: Click column headers to sort by name, status, risk level, file size, etc.
- Pagination: Navigate through large numbers of reports with configurable page sizes (default: 20 items)
- Row Selection: Click on any row to view detailed report information in the sidebar
- Quick Actions: Access edit, download, and delete actions directly from the table
Header Actions
- View Mode Toggle: Switch between table and card views
- Add Report: Create new reports
- Refresh: Reload the reports list
- Statistics: Open the reports overview sidebar
Sidebar Interface
The new sidebar system provides detailed information without leaving the main view:
Reports Overview Sidebar
- Filter Options: Filter reports by status (completed, processing, failed) and risk level (high, medium, low)
- System Statistics: View total reports, file sizes, and risk distribution
- Recent Activity: See the most recently created or updated reports
- Settings: Access report configuration options
Individual Report Sidebar
When a report is selected, the detail sidebar shows:
- Overview Tab: Status, risk assessment, file information, and error details
- Entities Tab: Detailed list of detected sensitive entities with confidence scores
- Compliance Tab: WCAG compliance results and language level analysis
- Retention Tab: Data retention policies and legal basis information
Sidebar Features
- Tabbed Interface: Organized information into logical sections
- Action Buttons: Quick access to edit, download, and delete functions
- Risk Visualization: Visual risk score indicators and explanations
- Entity Summary: Count and breakdown of detected entity types
Key Features
- Entity Detection: Identifies various types of sensitive information (names, emails, credit cards, etc.)
- Risk Scoring: Calculates risk scores based on the type and quantity of sensitive data
- Detailed Reports: Provides comprehensive reports with entity counts and risk levels
- Metadata Analysis: Includes document metadata in the analysis
- Historical Tracking: Maintains a history of document analyses for compliance
- Intuitive Interface: Modern table view with detailed sidebars for efficient report management
- Pagination: Handle large numbers of reports with built-in pagination
- Filtering: Filter reports by various criteria for quick access
Supported Entity Types
The reporting system can detect various types of sensitive information, including:
- Personal names
- Email addresses
- Phone numbers
- Credit card numbers
- Bank account numbers
- Social security numbers
- Addresses and locations
- Dates of birth
- IP addresses
- Medical license numbers
- Passport numbers
- Driver's license numbers
Risk Assessment
Each report includes a risk assessment with:
- Risk Score: A numerical score (0-100) indicating the overall risk level
- Risk Level: A categorical assessment (Low, Medium, High, Critical)
- Entity Counts: Breakdown of detected entities by type
- Context Information: Document metadata and processing details
- Visual Indicators: Color-coded badges and circular progress indicators for quick risk identification
Using the Reports Interface
Viewing Reports
- Navigate to the Reports section in the main menu
- Choose between Table or Card view using the toggle buttons
- Use the pagination controls to navigate through multiple pages of reports
- Click on any report row to view detailed information in the sidebar
Managing Reports
- Create New Report: Click the 'Add Report' button in the header
- Edit Report: Use the edit action in the table or sidebar
- Download Report: Access download functionality from actions menu
- Delete Report: Remove reports using the delete action (with confirmation)
Using Filters
- Click the 'Statistics' button to open the overview sidebar
- Use the filter dropdowns to narrow down reports by:
- Status (completed, processing, failed)
- Risk Level (high, medium, low)
- View system-wide statistics and recent activity
Programming Interface
You can generate reports programmatically:
// Example: Generate a report for a document
$reportingService = \OC::$server->get(OCA\DocuDesk\Service\ReportingService::class);
$report = $reportingService->generateReport('/path/to/document.pdf', 'doc-123', 'Important Contract');
// Example: Retrieve a previously generated report
$report = $reportingService->getReport('report-id');
// Example: Get all reports for a document
$reports = $reportingService->getReports('doc-123');
Integration with Presidio
The reporting feature integrates with Microsoft Presidio, an open-source PII detection service:
- Sends extracted text to Presidio for analysis
- Configurable confidence threshold for entity detection
- Customizable entity types and detection rules
- Support for multiple languages (depending on Presidio configuration)
Configuration
Configure the reporting feature in the DocuDesk admin settings:
- Navigate to Admin Settings > DocuDesk
- Set the Presidio API URL (default: http://presidio-api:8080/analyze)
- Adjust the Confidence Threshold (0.0-1.0) for entity detection sensitivity
- Enable or disable the reporting feature
Setting Up Presidio
To use the reporting feature, you need to set up Microsoft Presidio:
- Deploy Presidio using Docker or Kubernetes (see Presidio documentation)
- Configure the analyzer service with appropriate recognition models
- Update the DocuDesk settings with your Presidio API URL
Performance Considerations
Document reporting can be resource-intensive:
- Process large documents asynchronously
- Consider batching multiple documents for analysis
- Implement caching for frequently accessed reports
- Monitor Presidio resource usage for large-scale deployments
- Use pagination to handle large numbers of reports efficiently
Security and Privacy
The reporting feature is designed with security in mind:
- All communication with Presidio is secured
- Reports are stored securely within your Nextcloud instance
- Access to reports can be restricted based on user permissions
- No sensitive data is sent to external services beyond Presidio
Compliance Use Cases
Document reporting supports various compliance scenarios:
- GDPR Compliance: Identify documents containing personal data
- PCI DSS: Detect credit card information in documents
- HIPAA: Identify documents with protected health information
- Data Minimization: Support data minimization efforts by identifying unnecessary PII
- Data Mapping: Help create data maps by identifying where sensitive data resides
Limitations
Be aware of these limitations:
- Detection accuracy depends on Presidio's recognition capabilities
- Some context-specific PII may not be detected without custom recognizers
- Very large documents may require additional processing time
- Image-based documents require OCR before analysis (not included)
- Pagination is limited to 20 items per page by default (configurable)