Adventures in Logging
Before my current position, my main responsibilities involved getting teams to cloud, in particular AWS. My team had to do this often without budget for non-AWS tools, which meant a lot of creative problem solving.
One of the earliest problems we had to solve was around logging. We were given the requirements that the building team be able to search their logs and that we could store the logs for 1 year in long term storage. There were concerns around capacity for the existing logging solution for on-prem so we could not just send the logs there.
Minimum Viable Product
This was our first attempt at the problem. At the time we developed this, CloudWatch did not have a lot of the cool features it does today, like log insights. But it would work for this project.
The initial design was just to send to CloudWatch logs. For the small number of projects we had, the costs were still lower than other options. Additionally, it didn't make sense for my team to tackle running a separate logging platform when we were trying to demonstrate the value of public cloud to the business.
With CloudWatch logs, you set your expiration date for the logs, so we could keep them for a year. And a very basic search interface was available. We could keep logs segmented from other teams since we had an account model that essentially gave each team their own set of accounts.
Centralized Collection to S3
One of the first things we wanted to address was a potential flaw in using the CloudWatch in the same account the resources and users had access to. CloudWatch Logs are immutable in terms of you can't go in and change the log info once it's reached the service, but it is possible to delete the logs by changing how long you retain them for or deleting the log stream.
AWS had a centralized logging pattern that we modified for our use case. It's no longer recommended because they have produced a better solution, but at the time, it looked something like this.
This solution used the following components:
A great thing about this pattern was that it was rather resilient and fairly inexpensive given our size at the time. CloudWatch logs sent the logs over when they had logs, we didn't have to keep checking for new logs. By using S3 instead of a server, we had further flexibility and didn't have to figure out capacity needed in advance.
We could now ensure we kept a copy of the logs that could be protected from deletion by identities in the producing account and living in a more protected environment.
Centralized Log Viewing
Not long after figuring out the above problem, we were given another one. We had proven the value of AWS through the stuff we were doing and we now had more teams using it. Now we needed to provide access to logs not just to the team building in an account but other dev teams and support. Asking teams to remember what account to sign into to see which logs was too much friction not to mention having to actually keep that identity and access management matrix up to date on a continual basis
Recommended by LinkedIn
So we started up an AWS ElasticSearch cluster and started sending the logs there. The ground work we laid with the centralized logging made this fairly straightforward.
We added an additional consumer lambda off of the Kinesis stream and pushed the logs to the ElasticSearch cluster we built.
Easy peasy lemon squeezy, right?
Although some of us had some Splunk experience as well as other log collection, none of us had really run ElasticSearch before so there were definitely a few issues we hit right away.
Automated Field Type Detection
This was one of the more frustrating ones. ElasticSearch had data types for certain types of data such as integer, date, etc. It would attempt to automatically determine the data type the first time it saw a field, such as in json. Super awesome because it meant you didn't have to predefine your schema for everything sent over the wire which reduced friction for sending.
A problem we ran into right away happened with the CloudTrail logs. I forget which field it was but for one of them, the information was sometimes obviously a string and other times it looked like a date and I believe date format was the more common one. So the first time ElasticSearch saw a log line for it come it, if it happened to be a date, it would set the field type to date and we would get errors and lose all the log lines that had a string instead.
The solution to this was to predefine the schema when we discovered a log type like this.
Missing Logs
Another issue we ran into was missing logs on the ElasticSearch side. There were a few different reasons for this. A common one would be a parsing error like we talked about before, but more often we were dealing with a scale issue.
While message buses like the Kinesis Stream are great for handling large influxes of data while protecting downstream resources from being overwhelmed, they have their own nuance. In this solution, we had just one Kinesis Stream which meant we could only process with one Lambda at a time since the Lambda needs to remember where it is in the stream.
That means that there is a max amount of data that can be processed in a set amount of time. If it's exceeded, the data is backing up in the bus and not showing up in the log search. Around this same time we started using grafana to show our dashboards to other teams too. Since each team had their own Kinesis stream, we had a widget in grafana that they could check to see if a slow down in processing was happening.
If the back up got big enough, we would get into a position where the logs present had no way to be processed before the end of their life. This would lead to the data never showing up in ElasticSearch.
Once the current stream reached that situation, there were limited options. Often it was worth it to accept the temporary data loss and instead determine if this influx was temporary or permanent. If it was temporary, we might do nothing. If it was permanent, we could easily do two things: increase how long the messages can live (to prevent future issues) and add additional streams so we can process more data concurrently.
Care and Feeding of ElasticSearch
After we worked through a lot of the above problems, we reached a new one. We had been super successful at showing value and now had even more teams, including frontend teams using AWS. Now we had to turn to learning more about how ElasticSearch worked.
We needed to make sure we had enough shards to hopefully survive a node going down. We had to learn about how to balance the number of documents in/size of a shard as well as the number of shards per node. And we had to learn while doing. It's way too much to go into here but, as with anything, ElasticSearch has its own arcane nature to understand.
I'm super grateful to have been able to work on that project because I learned so much even if it felt overwhelming at times. And while it definitely was not perfect, it helped us prove out the value of what we were doing to the business.
Thank you for another excellent article.