In school we are told that the three "R"s that we need to know are Reading, wRiting, and aRithmetic. I never did understand that, since only one of those words starts with "R"…
Once you grow up and get a job as a data engineer, you’ll learn that there really are three "R"s that you need to know, And in this case they actually start with that letter!
At the end of the day, the most important characteristic of your data is that it needs to be reliable. All of the fancy machine learning and sophisticated algorithms in the world won’t do you any good if you are feeding them on messy and inconsistent data. Rather than the healthy, beautiful, and valuable systems that you know they could be, they will grow up to be hideous monsters that will consume your time and ambition.
Reliability can mean a lot of things, but in this context we are talking about characteristics of your data that contribute to a high degree of confidence that the analyses you are performing can be assumed correct. Some of the elements of a data platform that is reliable are:
- Consistent and validated schemas. All of the records follow the same pattern and can be treated the same way by your analytical processes.
- High availability of your systems to ensure read and write operations succeed. This requires systems that provide redundancy (there’s another R word) in the face of server and network failures that are inevitable.
- A well defined and maintained repository of metadata that will provide the necessary context around what the data is, where it came from, and how it was produced. These questions will get asked at some point, so it’s best to have the answers ahead of time.
- Access control and auditability, to ensure that only the people and processes who are supposed to be able to modify the data are actually doing so.
Reproducibility is a critical competency when working with business critical systems. If there is no way for another team member or business unit to independently verify and recreate your data sets and analytics, then there is no way to be certain that the original results were valid. This also factors into the concept of reliability, because if you are able to consistently reproduce a given data set then you can be fairly sure that it is reliable and can be trusted to be available.
If all of your servers die, or the data center where your systems are running is destroyed or incapacitated by a natural disaster, you need a recovery plan. This is where the third "R" of repeatability comes into play. It’s all well and good to be able to build a Spark cluster, or install a PostgreSQL database, but can you do it quickly and repeatably? Having the configuration and procedures for repeating the installation or deployment of your data infrastructure feeds into the reproducibility of your data and the reliability of your analytics.
So, now that you have learned the three "R"s of data engineering you can be confident that the teams and systems that depend on the output of your labor will be happy and successful. Just be sure to let them know how much work went into it, so that they can fully appreciate your dedication and perseverance.