Skip to Content
Auditsphere MlAuditSphere ML Service - Overview

AuditSphere ML Service - Overview

Introduction

The AuditSphere ML Service is a machine learning-powered anomaly detection microservice that analyzes audit events from all connected cloud providers (Microsoft 365, Google Workspace, Box, Dropbox) in real-time. It trains one Isolation Forest model per provider, learning what “normal” activity looks like for each platform independently, and flags anything that deviates from established patterns.


Architecture

How it works:

ConnectionPurpose
Cron → ML ServiceScheduled job triggers anomaly detection every 15 minutes
API → ML ServiceSends batch of audit events for analysis
ML Service → DatabaseFetches historical events for model training
ML Service → APIReturns anomaly detection results

What It Detects

The ML service identifies six categories of suspicious activity:

1. Unusual Timing

Activity occurring outside normal business hours for a user.

IndicatorExample
Off-hours accessEmployee downloading files at 2 AM
Weekend activityFile access when user has no weekend history
Holiday accessOperations during company holidays

2. Bulk File Operations

Unusually large numbers of file operations in a short time.

IndicatorExample
Mass downloads500 files downloaded in 10 minutes
Bulk deletionsRapid deletion from shared folders
Data stagingCopying files to personal folders

3. External & Guest Access

Suspicious activity from users outside the organization.

IndicatorExample
Guest accessGuest user accessing HR documents
External contractorContractor downloading from restricted areas
External markersUsers with #EXT# accessing sensitive data

4. Suspicious Location

Access from unexpected geographic locations or devices.

IndicatorExample
Location changeNY user suddenly accessing from Eastern Europe
Impossible travelMultiple countries within hours
New deviceUnknown device accessing sensitive files

5. Sensitive Operations

High-risk actions that could expose data.

IndicatorExample
Public sharingCreating anonymous links for confidential files
Admin changesAdding new site collection administrators
Permission changesBroadening access to restricted folders

6. Unusual Volume

Activity levels far exceeding normal patterns.

IndicatorExample
Spike in access5 files/day user suddenly accessing 200
Inactive user surgeRarely active user performing hundreds of operations

Technology Stack

ComponentTechnologyPurpose
FrameworkFastAPIHigh-performance async API server
ML AlgorithmIsolation ForestUnsupervised anomaly detection
ML Libraryscikit-learnModel training and inference
RuntimePython 3.12+Service execution
DeploymentRailwayCloud hosting

API Endpoints

EndpointMethodDescription
/GETService status
/healthGETHealth check
/api/v1/anomaly/detectPOSTBatch anomaly detection (routes events to per-provider models)
/api/v1/anomaly/scorePOSTSingle event scoring (provider-aware)
/api/v1/anomaly/trainPOSTTrain models (groups events by provider, trains each separately)
/api/v1/anomaly/model/statusGETLegacy Microsoft model status
/api/v1/anomaly/model/allGETAll per-provider model statuses

How Detection Works

How it works:

  1. Audit Event - Raw event arrives (user, operation, timestamp, file, IP, etc.)
  2. Feature Extraction - Transforms event into 16 numeric features (hour, day, guest flag, bulk operation, etc.)
  3. Isolation Forest - ML model scores how “isolated” (unusual) the event is compared to training data
  4. Detection Result - Returns anomaly flag, confidence score, and anomaly type (unusual_timing, bulk_operation, etc.)

Feature Extraction

The model analyzes 16 features from each audit event:

Temporal Features

  • Hour of day
  • Day of week
  • Weekend flag
  • Business hours flag

User Behavior

  • Event count (1h window)
  • Event count (24h window)
  • Unique sites accessed
  • Unique operations performed

Access Patterns

  • Guest user flag
  • External access flag
  • New IP address
  • Unusual location

Operation Context

  • Operation category
  • Sensitive operation flag
  • File type category
  • Bulk operation flag

Scoring

Each detected anomaly includes:

FieldDescription
anomaly_scoreHow unusual the activity is (0-1, higher = more suspicious)
confidenceModel certainty about the detection
anomaly_typeCategory of suspicious behavior detected
contributing_factorsSpecific reasons for flagging

Per-Provider Model Architecture

The ML service maintains one Isolation Forest model per cloud provider:

ProviderModel FileNotes
Microsoftmodels/isolation_forest.joblibLegacy filename for backwards compatibility
Googlemodels/google_isolation_forest.joblibCreated when Google events are available
Boxmodels/box_isolation_forest.joblibCreated when Box events are available
Dropboxmodels/dropbox_isolation_forest.joblibCreated when Dropbox events are available

Why separate models? Each provider has different operation semantics — Google sharing patterns differ from SharePoint. A single model would learn cross-provider noise and reduce accuracy.

Provider-Specific Operation Mappings

CategoryMicrosoftGoogleBoxDropbox
AccessFileAccessed, FileDownloadedview, downloadPREVIEW, DOWNLOADfile_preview, file_download
ModificationFileModified, FileUploadededit, uploadUPLOAD, EDITfile_edit, file_add
DeletionFileDeleted, FileRecycledtrash, deleteDELETE, TRASHfile_delete
SharingSharingSet, AnonymousLinkCreatedchange_acl, shareCOLLABORATION_INVITEshared_link_create

When the normalized operation_category field is present in the event input, it’s used directly instead of looking up the native operation name.

Model Training

Auto-Training (Per Provider)

The service supports automatic training from the AuditSphere database, per provider:

  1. On Startup: For each provider with a saved model, loads it. If no model exists and AUTO_TRAIN_ON_STARTUP=true, fetches events from audit_activities filtered by provider
  2. On Detection: If a detection request arrives for a provider with no trained model, auto-trains from the database first, then detects
  3. Fallback: If the audit_activities table is empty or doesn’t exist, falls back to the legacy audit_events table (Microsoft only)

Training via API

The /train endpoint groups events by provider and trains each model separately:

POST /api/v1/anomaly/train { "events": [ {"event_id": "1", "provider": "google", "operation": "view", ...}, {"event_id": "2", "provider": "google", "operation": "edit", ...}, {"event_id": "3", "provider": "microsoft", "operation": "FileAccessed", ...} ] }

Response: "google: trained with 2 events; microsoft: trained with 1 events"

Training Requirements

RequirementValue
Minimum events per provider100
Recommended events1000+ per provider
Data freshnessRecent activity preferred

Check Model Status

# All provider models curl https://ml-service.example.com/api/v1/anomaly/model/all # Legacy Microsoft model only curl https://ml-service.example.com/api/v1/anomaly/model/status

Event Input Format

The service accepts both legacy (Microsoft-only) and new (provider-agnostic) formats:

{ "event_id": "evt_123", "provider": "google", "creation_time": "2025-03-15T14:30:00Z", "operation": "file.accessed", "operation_category": "access", "user_id": "user@company.com", "user_type": "user", "resource_path": "/My Drive/Reports", "resource_name": "quarterly.xlsx" }

Legacy Format (still supported)

{ "event_id": "evt_456", "creation_time": "2025-03-15T14:30:00Z", "operation": "FileDownloaded", "user_id": "user@company.com", "user_type": 0, "site_url": "https://company.sharepoint.com/sites/hr", "source_file_name": "salaries.xlsx" }

When provider is absent, defaults to "microsoft".


Configuration

VariableDescriptionDefault
PORTServer port8000
DATABASE_URLPostgreSQL connection for auto-training-
AUTO_TRAIN_ON_STARTUPEnable auto-trainingtrue
MIN_TRAINING_EVENTSMinimum events for training100
CONTAMINATIONExpected anomaly proportion0.10
N_ESTIMATORSNumber of trees in forest200
BUSINESS_HOURS_STARTBusiness hours start9
BUSINESS_HOURS_ENDBusiness hours end18
BUSINESS_TIMEZONESComma-separated business timezonesAsia/Kolkata,Australia/Melbourne

Integration with AuditSphere

Detection Flow

  1. Cron Job triggers every 15 minutes
  2. AuditSphere API fetches unprocessed audit events
  3. ML Service receives batch of events
  4. Feature extraction transforms events into 16 numeric features
  5. Isolation Forest scores each event
  6. Results returned with anomaly flags and scores
  7. Anomalies stored in database for user review

API Usage Examples

Detect Anomalies

curl -X POST https://ml-service.example.com/api/v1/anomaly/detect \ -H "Content-Type: application/json" \ -d '{ "events": [ { "event_id": "evt_123", "operation": "FileDownloaded", "user_id": "user@company.com", "creation_time": "2025-01-15T02:30:00Z", "site_url": "https://company.sharepoint.com/sites/hr", "source_file_name": "salaries.xlsx", "client_ip": "203.0.113.50", "user_type": 0, "event_count_1h": 5, "event_count_24h": 50 } ] }'

Response

{ "results": [ { "event_id": "evt_123", "is_anomaly": true, "anomaly_score": 0.85, "confidence": 0.92, "anomaly_type": "unusual_timing" } ], "total_events": 1, "anomalies_detected": 1, "processing_time_ms": 45 }

Check Model Status

curl https://ml-service.example.com/api/v1/anomaly/model/status
{ "model_loaded": true, "training_samples": 2000, "contamination": 0.1, "last_trained": "2025-12-15T10:30:00" }

Deployment

The ML service is deployed separately from the main AuditSphere application:

PlatformConfiguration
RailwayAuto-detected Python deployment
DockerAvailable via Dockerfile
Manualuvicorn app.main:app --port 8000

Environment Setup

# Required DATABASE_URL=postgresql://user:pass@host/db?sslmode=require # Optional PORT=8000 AUTO_TRAIN_ON_STARTUP=true CONTAMINATION=0.10 N_ESTIMATORS=200

Document Information

PropertyValue
Version1.0
Last UpdatedDecember 2025
ClassificationClient Documentation
Last updated on