Skip to main content

Querying J1DS Data in Amazon Athena

This page walks through deploying an AWS Glue and Athena infrastructure stack on top of your J1DS S3 bucket and running three SQL query patterns against your JupiterOne entity and relationship change history.

For instructions on enabling J1DS and configuring your S3 bucket, see Data Streaming (J1DS).

Prerequisites

  • AWS account with permissions to create CloudFormation stacks, Glue databases and tables, Athena workgroups, S3 buckets, and IAM roles
  • AWS CLI v2 installed and configured with credentials for your target account
  • J1DS enabled and streaming to an S3 bucket (see Data Streaming (J1DS) for setup)
  • The S3 bucket name where J1DS is writing data

Deploy the Stack

1. Download the CloudFormation template:

Download the CloudFormation template

2. Deploy the stack:

aws cloudformation deploy \
--template-file j1ds-athena-setup.yaml \
--stack-name j1ds-usage-example \
--parameter-overrides \
SourceBucketName=YOUR_BUCKET_NAME \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM

Replace YOUR_BUCKET_NAME with your J1DS S3 bucket name.

3. Note the stack outputs — the deploy command prints the Glue database name (j1ds-usage-example-cdc) and Athena WorkGroup name (j1ds-usage-example). These values appear in every query.

Verify the Deployment

Run a COUNT query in the Athena console to confirm data is accessible. Open the Amazon Athena console, select the j1ds-usage-example workgroup, and run:

SELECT COUNT(*) FROM "j1ds-usage-example-cdc".entities
WHERE accountid = 'YOUR_ACCOUNT_ID'
AND year = 2026 AND month = 3;

Replace YOUR_ACCOUNT_ID with your JupiterOne account ID and adjust year/month to the current period.

Expected: a non-zero row count. If the result is zero, verify that J1DS is streaming (check your S3 bucket for recent .jsonl.gz files) and that the year/month values match files present in the bucket.

info

The accountid partition filter is required for all queries. Omitting it causes an Athena error, not an empty result — the table uses an injected partition projection that requires a caller-supplied value.

This is especially important if you stream multiple JupiterOne accounts (e.g., dev and prod) into the same CDC bucket. Each account's data is stored under a separate accountId= partition, so the accountid filter ensures your query returns results for a single account.

CDC Schema Reference

Record Structure

Every row in the entities and relationships tables corresponds to one CDC event:

FieldColumnTypeValues
operationoperationstringc (create), u (update), d (delete)
eventTypeeventtypestringentity or relationship
propertiespropertiesstring (JSON)JSON blob — access with json_extract_scalar()
labelslabelsarray<string> (entities) / string (relationships)Entity: ["Entity","User","slack_user"] / Relationship: "HAS"

Key Properties Fields

The properties column is a raw JSON string. Use json_extract_scalar() to access individual fields:

FieldAccess PatternDescription
_idjson_extract_scalar(properties, '$._id')Stable UUID — the entity or relationship identifier
_typejson_extract_scalar(properties, '$._type')Provider-specific type, e.g., slack_user, aws_s3_bucket
_classjson_extract_scalar(properties, '$._class')Abstract class, e.g., User, DataStore
_beginOnCAST(json_extract_scalar(properties, '$._beginon') AS BIGINT)Epoch milliseconds when this version became current
displayNamejson_extract_scalar(properties, '$.displayName')Human-readable name

Query: Entity History

Use this query to audit all state changes to a specific entity over time. It returns every CDC event for a single entity ID, ordered chronologically. Replace YOUR_ENTITY_ID with the _id value of the entity you want to audit.

SELECT
year,
month,
day,
operation, -- c=create, u=update, d=delete
eventtype, -- always 'entity' in this table
json_extract_scalar(properties, '$._id') AS entity_id,
json_extract_scalar(properties, '$._type') AS entity_type,
json_extract_scalar(properties, '$._class') AS entity_class,
json_extract_scalar(properties, '$.displayName') AS display_name,
properties
FROM "j1ds-usage-example-cdc".entities
WHERE accountid = 'YOUR_ACCOUNT_ID' -- REQUIRED: injected partition
AND year = 2026 -- narrow partition range
AND month = 3 -- to reduce bytes scanned
AND json_extract_scalar(properties, '$._id') = 'YOUR_ENTITY_ID' -- replace with your entity _id
ORDER BY year, month, day;

Example Output

yearmonthdayoperationentity_typeentity_classdisplay_name
2026317cslack_userUserAlice Johnson
2026318uslack_userUserAlice Johnson
2026319uslack_userUserAlice J.

Query: Latest State Snapshot

Use this query to get the current state of all entities in your account — one row per entity, deduplicated to the most recent version. It uses MAX_BY to select the properties blob with the highest _beginon timestamp for each entity ID.

SELECT
json_extract_scalar(latest_properties, '$._id') AS entity_id,
json_extract_scalar(latest_properties, '$._type') AS entity_type,
json_extract_scalar(latest_properties, '$._class') AS entity_class,
json_extract_scalar(latest_properties, '$.displayName') AS display_name,
latest_properties AS properties
FROM (
SELECT
MAX_BY(properties, CAST(json_extract_scalar(properties, '$._beginon') AS BIGINT)) AS latest_properties
FROM "j1ds-usage-example-cdc".entities
WHERE accountid = 'YOUR_ACCOUNT_ID' -- REQUIRED: injected partition
AND year = 2026 -- filter to recent data
AND month = 3 -- to control scan cost
AND json_extract_scalar(properties, '$._beginon') IS NOT NULL -- guard against missing _beginon
GROUP BY json_extract_scalar(properties, '$._id')
)
ORDER BY entity_type, display_name;

-- NOTE: _beginon is epoch milliseconds stored as a JSON number.
-- CAST to BIGINT is required for correct MAX_BY ordering (not lexicographic).
-- operation: c=create, u=update, d=delete

Example Output

entity_identity_typeentity_classdisplay_name
a1b2c3d4-...aws_s3_bucketDataStoremy-data-bucket
d4e5f6a7-...slack_userUserAlice Johnson

Query: Change Detection

Use this query to see what changed in your account over a time window, grouped by day, operation type, and entity type. Adjust the day BETWEEN range to scope the window you want to examine.

SELECT
year,
month,
day,
operation, -- c=create, u=update, d=delete
json_extract_scalar(properties, '$._type') AS entity_type,
COUNT(*) AS change_count
FROM "j1ds-usage-example-cdc".entities
WHERE accountid = 'YOUR_ACCOUNT_ID' -- REQUIRED: injected partition
AND year = 2026
AND month = 3
AND day BETWEEN 1 AND 7 -- replace with your window
GROUP BY year, month, day, operation,
json_extract_scalar(properties, '$._type')
ORDER BY year, month, day, operation;

-- TIP: To compare two windows, wrap each as a subquery and LEFT JOIN on entity_type,
-- or use EXCEPT to find entity types that appeared only in one window.
-- eventtype is always 'entity' in this table (use relationships table for relationship changes).

Example Output

yearmonthdayoperationentity_typechange_count
202631caws_iam_role12
202631uslack_user45
202632caws_s3_bucket3

Cost and Performance Notes

note
  • Athena charges $5 per TB scanned
  • All queries include partition filters (accountid, year, month) to minimize scan scope
  • The WorkGroup enforces a 10 GB per-query cutoff (~$0.05 maximum per query)
  • Typical per-query cost for a single account's daily data: under $0.01
  • Adding day filters (e.g., AND day BETWEEN 1 AND 7) reduces scanned data further