Deploying CloudFront Access Log Analysis Infrastructure with CDK 2
The source code shown below is shortened and slightly simplified. A full example including a CloudFront static website deployment can be found on my GitLab instance.
When running web applications it's always a good idea to turn on access logging in order to monitor incoming requests. For AWS CloudFront distributions deployed with CDK this can be easily done by creating a bucket to receive the logs and passing the appropriate parameters when deploying the CloudFront distribution.
Enabling CloudFront Access Logging
Access logging can be enabled by setting the parameter enableLogging
to true
and specifying the target bucket with the logBucket
parameter.
1import * as cdk from 'aws-cdk-lib';
2import * as s3 from 'aws-cdk-lib/aws-s3';
3import * as cf from "aws-cdk-lib/aws-cloudfront";
4import * as glue from 'aws-cdk-lib/aws-glue';
5import * as athena from 'aws-cdk-lib/aws-athena';
6
7export class WebsiteStack extends cdk.Stack {
8 constructor(scope: Construct, id: string, props: WebsiteStackProps) {
9 super(scope, id, props);
10
11 const logBucket = new s3.Bucket(this, 'DistributionLoggingBucket', {
12 objectOwnership: s3.ObjectOwnership.OBJECT_WRITER,
13 });
14
15 const distribution = new cf.Distribution(this, "Distribution", {
16 // (...)
17 enableLogging: true,
18 logBucket,
19 });
20 }
21}
By default, CloudFront continuously writes the access logs in files with filenames following the format <distribution-id>.YYYY-MM-DD-HH.<unique-id>.gz
in the given log bucket. Additionally, these files are gzip packed. The interesting question is: How can we query these logs to find, for example, all entries originating from a specific IP address?
Deploying AWS Glue and AWS Athena Base Infrastructure
One way to achieve this is AWS Glue, a 'serverless data integration service' in combination with AWS Athena, a 'serverless, interactive analytics service'. AWS Glue is basically a managed extract, transform, and load service (ETL), and it comes with a so-called Data Catalog which is a repository where metadata about data sources (and targets) is stored. The metadata include the location and the schema of the data like fields, types and so on and is organized in tables as representation of the individual data sources. Tables, in turn, are logically grouped into databases and each table belongs to exactly one database.
Athena on the other hand is an interactive query service which allows analyzing data directly in S3 using simple SQL queries. To be able to execute queries against a data source it uses the metadata for that data source stored in the Glue Data Catalog.
So, to query the log bucket with Athena, we simply need to deploy a Glue database as a container along with a Glue table with the necessary metadata for accessing the logs stored in the log bucket.
1 const glueDatabase = new glue.CfnDatabase(this, 'CfGlueDb', {
2 catalogId: this.account,
3 databaseInput: {
4 description: 'Glue DB for CloudFront access logs',
5 name: 'cf_logs_db',
6 }
7 })
8
9 const glueTable = new glue.CfnTable(this, 'CfGlueTable', {
10 catalogId: this.account,
11 databaseName: 'cf_logs_db',
12 tableInput: {
13 name: 'cf_logs_table',
14 description: "Glue table for CloudFront access logs",
15 storageDescriptor: {
16 columns: [
17 {name: "date", type: "date"},
18 {name: "time", type: "string"},
19 // (...) all fields defined in the CloudFront access log files come here
20 ],
21 location: `s3://${logBucket.bucketName}/`,
22 inputFormat: 'org.apache.hadoop.mapred.TextInputFormat',
23 outputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
24 serdeInfo: {
25 serializationLibrary: 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
26 parameters: {
27 'serialization.format': '\t',
28 'field.delim': '\t',
29 }
30 },
31 },
32 parameters: {
33 'skip.header.line.count': 2,
34 },
35 tableType: 'EXTERNAL_TABLE'
36 },
37 });
columns
is an array containing names and types of all fields of the log fileslocation
contains the S3 URL of the log bucketinputFormat
,outputFormat
,serdeInfo
contain format and serialization/deserialization parameters specific for the logsskip.header.line.count
skips the given number of lines in each file; the first two lines in the access log files contain metadata which are not relevant for queries
We also have to make sure that the Glue table is only created after the Glue database, hence we add a dependency between both
1 glueTable.addDependency(glueDatabase);
Finally, in order to be able to run Athena queries, we need to specify another S3 bucket where the results are stored. In this example, we create the bucket and a new workgroup and then set the bucket as output location for the workgroup. A workgroup in the Athena context is a logical construct for grouping queries and their executions together in order to e.g. apply certain constraints like the output location in our case. Finally, we also deploy a sample named query which can be run directly.
1 const athenaBucket = new s3.Bucket(this, 'AthenaBucket');
2
3 const workgroup = new athena.CfnWorkGroup(this, "Workgroup", {
4 name: "cf_workgroup",
5 workGroupConfiguration: {
6 resultConfiguration: {
7 outputLocation: `s3://${athenaBucket.bucketName}/`
8 }
9 }
10 })
11
12 const sampleQuery = new athena.CfnNamedQuery(this, "SampleNamedQuery", {
13 name: 'sample_query',
14 queryString: `SELECT * FROM "${glueDbName}"."${glueTableName}" limit 25;`,
15 database: 'cf_logs_db',
16 workGroup: workgroup.name,
17 });
18
19 sampleQuery.addDependency(workgroup);
Now the necessary infrastructure is deployed and we can start querying the logs.
Querying CloudFront Access Logs with AWS Athena
In Athena, we switch to the "Query editor" and make sure that the newly created workgroup is selected top-right. Under "Saved queries" we can then pick the earlier deployed sample query and run it. It will show an excerpt of the access logs present so far in the log bucket. Just keep in mind that CloudFront does not write the logs in real-time. Sometimes it takes a little while until they show up in the bucket.
Conclusion
With these steps, the base infrastructure for analyzing CloudFront access logs is in place. There are many more things which can be considered e.g. it is advisable to use partitions in order to speed up query times and reduce cost. Since we used an infrastructure-as-code approach with CDK 2, changes or redeployments can easily be done in a deterministic way. From here we could proceed with e.g. creating some nice dashboards to always see at a glance what is going on in the logs.