AuditSphere ML Service - Overview
Introduction
The AuditSphere ML Service is a machine learning-powered anomaly detection microservice that analyzes audit events from all connected cloud providers (Microsoft 365, Google Workspace, Box, Dropbox) in real-time. It trains one Isolation Forest model per provider, learning what “normal” activity looks like for each platform independently, and flags anything that deviates from established patterns.
Architecture
How it works:
| Connection | Purpose |
|---|---|
| Cron → ML Service | Scheduled job triggers anomaly detection every 15 minutes |
| API → ML Service | Sends batch of audit events for analysis |
| ML Service → Database | Fetches historical events for model training |
| ML Service → API | Returns anomaly detection results |
What It Detects
The ML service identifies six categories of suspicious activity:
1. Unusual Timing
Activity occurring outside normal business hours for a user.
| Indicator | Example |
|---|---|
| Off-hours access | Employee downloading files at 2 AM |
| Weekend activity | File access when user has no weekend history |
| Holiday access | Operations during company holidays |
2. Bulk File Operations
Unusually large numbers of file operations in a short time.
| Indicator | Example |
|---|---|
| Mass downloads | 500 files downloaded in 10 minutes |
| Bulk deletions | Rapid deletion from shared folders |
| Data staging | Copying files to personal folders |
3. External & Guest Access
Suspicious activity from users outside the organization.
| Indicator | Example |
|---|---|
| Guest access | Guest user accessing HR documents |
| External contractor | Contractor downloading from restricted areas |
| External markers | Users with #EXT# accessing sensitive data |
4. Suspicious Location
Access from unexpected geographic locations or devices.
| Indicator | Example |
|---|---|
| Location change | NY user suddenly accessing from Eastern Europe |
| Impossible travel | Multiple countries within hours |
| New device | Unknown device accessing sensitive files |
5. Sensitive Operations
High-risk actions that could expose data.
| Indicator | Example |
|---|---|
| Public sharing | Creating anonymous links for confidential files |
| Admin changes | Adding new site collection administrators |
| Permission changes | Broadening access to restricted folders |
6. Unusual Volume
Activity levels far exceeding normal patterns.
| Indicator | Example |
|---|---|
| Spike in access | 5 files/day user suddenly accessing 200 |
| Inactive user surge | Rarely active user performing hundreds of operations |
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| Framework | FastAPI | High-performance async API server |
| ML Algorithm | Isolation Forest | Unsupervised anomaly detection |
| ML Library | scikit-learn | Model training and inference |
| Runtime | Python 3.12+ | Service execution |
| Deployment | Railway | Cloud hosting |
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ | GET | Service status |
/health | GET | Health check |
/api/v1/anomaly/detect | POST | Batch anomaly detection (routes events to per-provider models) |
/api/v1/anomaly/score | POST | Single event scoring (provider-aware) |
/api/v1/anomaly/train | POST | Train models (groups events by provider, trains each separately) |
/api/v1/anomaly/model/status | GET | Legacy Microsoft model status |
/api/v1/anomaly/model/all | GET | All per-provider model statuses |
How Detection Works
How it works:
- Audit Event - Raw event arrives (user, operation, timestamp, file, IP, etc.)
- Feature Extraction - Transforms event into 16 numeric features (hour, day, guest flag, bulk operation, etc.)
- Isolation Forest - ML model scores how “isolated” (unusual) the event is compared to training data
- Detection Result - Returns anomaly flag, confidence score, and anomaly type (unusual_timing, bulk_operation, etc.)
Feature Extraction
The model analyzes 16 features from each audit event:
Temporal Features
- Hour of day
- Day of week
- Weekend flag
- Business hours flag
User Behavior
- Event count (1h window)
- Event count (24h window)
- Unique sites accessed
- Unique operations performed
Access Patterns
- Guest user flag
- External access flag
- New IP address
- Unusual location
Operation Context
- Operation category
- Sensitive operation flag
- File type category
- Bulk operation flag
Scoring
Each detected anomaly includes:
| Field | Description |
|---|---|
anomaly_score | How unusual the activity is (0-1, higher = more suspicious) |
confidence | Model certainty about the detection |
anomaly_type | Category of suspicious behavior detected |
contributing_factors | Specific reasons for flagging |
Per-Provider Model Architecture
The ML service maintains one Isolation Forest model per cloud provider:
| Provider | Model File | Notes |
|---|---|---|
| Microsoft | models/isolation_forest.joblib | Legacy filename for backwards compatibility |
models/google_isolation_forest.joblib | Created when Google events are available | |
| Box | models/box_isolation_forest.joblib | Created when Box events are available |
| Dropbox | models/dropbox_isolation_forest.joblib | Created when Dropbox events are available |
Why separate models? Each provider has different operation semantics — Google sharing patterns differ from SharePoint. A single model would learn cross-provider noise and reduce accuracy.
Provider-Specific Operation Mappings
| Category | Microsoft | Box | Dropbox | |
|---|---|---|---|---|
| Access | FileAccessed, FileDownloaded | view, download | PREVIEW, DOWNLOAD | file_preview, file_download |
| Modification | FileModified, FileUploaded | edit, upload | UPLOAD, EDIT | file_edit, file_add |
| Deletion | FileDeleted, FileRecycled | trash, delete | DELETE, TRASH | file_delete |
| Sharing | SharingSet, AnonymousLinkCreated | change_acl, share | COLLABORATION_INVITE | shared_link_create |
When the normalized operation_category field is present in the event input, it’s used directly instead of looking up the native operation name.
Model Training
Auto-Training (Per Provider)
The service supports automatic training from the AuditSphere database, per provider:
- On Startup: For each provider with a saved model, loads it. If no model exists and
AUTO_TRAIN_ON_STARTUP=true, fetches events fromaudit_activitiesfiltered by provider - On Detection: If a detection request arrives for a provider with no trained model, auto-trains from the database first, then detects
- Fallback: If the
audit_activitiestable is empty or doesn’t exist, falls back to the legacyaudit_eventstable (Microsoft only)
Training via API
The /train endpoint groups events by provider and trains each model separately:
POST /api/v1/anomaly/train
{
"events": [
{"event_id": "1", "provider": "google", "operation": "view", ...},
{"event_id": "2", "provider": "google", "operation": "edit", ...},
{"event_id": "3", "provider": "microsoft", "operation": "FileAccessed", ...}
]
}Response: "google: trained with 2 events; microsoft: trained with 1 events"
Training Requirements
| Requirement | Value |
|---|---|
| Minimum events per provider | 100 |
| Recommended events | 1000+ per provider |
| Data freshness | Recent activity preferred |
Check Model Status
# All provider models
curl https://ml-service.example.com/api/v1/anomaly/model/all
# Legacy Microsoft model only
curl https://ml-service.example.com/api/v1/anomaly/model/statusEvent Input Format
The service accepts both legacy (Microsoft-only) and new (provider-agnostic) formats:
New Format (recommended)
{
"event_id": "evt_123",
"provider": "google",
"creation_time": "2025-03-15T14:30:00Z",
"operation": "file.accessed",
"operation_category": "access",
"user_id": "user@company.com",
"user_type": "user",
"resource_path": "/My Drive/Reports",
"resource_name": "quarterly.xlsx"
}Legacy Format (still supported)
{
"event_id": "evt_456",
"creation_time": "2025-03-15T14:30:00Z",
"operation": "FileDownloaded",
"user_id": "user@company.com",
"user_type": 0,
"site_url": "https://company.sharepoint.com/sites/hr",
"source_file_name": "salaries.xlsx"
}When provider is absent, defaults to "microsoft".
Configuration
| Variable | Description | Default |
|---|---|---|
PORT | Server port | 8000 |
DATABASE_URL | PostgreSQL connection for auto-training | - |
AUTO_TRAIN_ON_STARTUP | Enable auto-training | true |
MIN_TRAINING_EVENTS | Minimum events for training | 100 |
CONTAMINATION | Expected anomaly proportion | 0.10 |
N_ESTIMATORS | Number of trees in forest | 200 |
BUSINESS_HOURS_START | Business hours start | 9 |
BUSINESS_HOURS_END | Business hours end | 18 |
BUSINESS_TIMEZONES | Comma-separated business timezones | Asia/Kolkata,Australia/Melbourne |
Integration with AuditSphere
Detection Flow
- Cron Job triggers every 15 minutes
- AuditSphere API fetches unprocessed audit events
- ML Service receives batch of events
- Feature extraction transforms events into 16 numeric features
- Isolation Forest scores each event
- Results returned with anomaly flags and scores
- Anomalies stored in database for user review
API Usage Examples
Detect Anomalies
curl -X POST https://ml-service.example.com/api/v1/anomaly/detect \
-H "Content-Type: application/json" \
-d '{
"events": [
{
"event_id": "evt_123",
"operation": "FileDownloaded",
"user_id": "user@company.com",
"creation_time": "2025-01-15T02:30:00Z",
"site_url": "https://company.sharepoint.com/sites/hr",
"source_file_name": "salaries.xlsx",
"client_ip": "203.0.113.50",
"user_type": 0,
"event_count_1h": 5,
"event_count_24h": 50
}
]
}'Response
{
"results": [
{
"event_id": "evt_123",
"is_anomaly": true,
"anomaly_score": 0.85,
"confidence": 0.92,
"anomaly_type": "unusual_timing"
}
],
"total_events": 1,
"anomalies_detected": 1,
"processing_time_ms": 45
}Check Model Status
curl https://ml-service.example.com/api/v1/anomaly/model/status{
"model_loaded": true,
"training_samples": 2000,
"contamination": 0.1,
"last_trained": "2025-12-15T10:30:00"
}Deployment
The ML service is deployed separately from the main AuditSphere application:
| Platform | Configuration |
|---|---|
| Railway | Auto-detected Python deployment |
| Docker | Available via Dockerfile |
| Manual | uvicorn app.main:app --port 8000 |
Environment Setup
# Required
DATABASE_URL=postgresql://user:pass@host/db?sslmode=require
# Optional
PORT=8000
AUTO_TRAIN_ON_STARTUP=true
CONTAMINATION=0.10
N_ESTIMATORS=200Document Information
| Property | Value |
|---|---|
| Version | 1.0 |
| Last Updated | December 2025 |
| Classification | Client Documentation |