Last year, at the CVPR 2019, I had a brief encounter with Andrej Karpathy, the Senior Director of AI at Tesla. During our conversation I asked him a rather naïve question, “Andrej, how do you estimate the volume of labeled data required to train and validate your self-driving cars for a particular scenario?” He paused for a moment and then smiled saying, “Ummm, 378 hours of data. Ask me tomorrow, my number may be more generous.” We laughed for the next two minutes.
Today, it’s common knowledge that we require large swathes of high-quality labeled data to train and validate ML models. But after having met hundreds of ML team leads and multiple CXO’s of AI startups that are at different phases of the AI development journey, I’ve observed that the market has an unbalanced understanding of the acquisition of labeled data.
Most ML teams at the higher end of the AI maturity curve consider labeled data acquisition as one of their main priorities and understand the different layers of complexities involved in the process. However, most of the players at the farther end find some workaround for their momentary project. Besides, I have also observed that almost all ML companies go through a self-realization cycle of ignoring this aspect of the data pipeline to eventually building massive software and operations muscle to tackle their labeled data requirements.
Therefore, it is safe to say that the big question in the current AI ecosystem is, “How can one access large volumes of diverse and high-quality labeled data cost-effectively?”
The Inherent Over-Simplification of the Data Labeling Process
Firstly, over-simplifying the process to a mere ‘hire-annotate-pay’ model will only build further roadblocks to achieving your automation goals. The most important pre-requisite for any functional ML model is high-quality labeled data. A cost compromise will only lead you further down the rabbit hole.
The labeling budget is often considered to be the product of a fixed $7 per human annotator cost and the number of man-hours of annotations required. This approach is drastically incorrect because they do not factor in the various hidden costs that are incurred while ML engineers set up the annotation process.
Factors Overlooked in The Data Labeling Process
Let’s go through a step-by-step process to understand how the real budget overshoots the simple math equation.
Step 1: The ML engineer first creates guidelines to lucidly define the classes of annotations and the expectation set-points for each class to train human annotators. This step repeats a few times, because realistically, nobody gets it right at the first shot; the guidelines require multiple iterations.
Step 2: Next, he/she is required to discover or create annotation tools that support their data formats, produce highly-accurate pixel-level annotations, and one that offers scalability to accommodate their large dataset. This evaluation/development process can take a few days or a few weeks.
Step 3: The engineer then finds the right number of skilled human annotators from a third party or hire part-time annotators. Finding the right workforce needs a well-designed pilot program and evaluation of multiple workforces followed by negotiations, contracting (collaboration with finance and legal teams) of the right one(s).
Step 4: Next, he/she sets up the tool infrastructure, trains the annotators on tool usage, and provides annotation requirements as tasks. They also create and manage deadlines for annotation tasks. Additionally, they also track the progress of the annotators regularly.
Step 5: The engineer then sets up workflows, logics, or scripts to test the accuracy of the output, provides feedback to defaulters, and re-assigns wrong labels for rework. This step repeats for each batch of annotations.
Step 6: Intermittently, they repeat the previous step until the labeled data reaches their desired level of accuracy. The whole process is time-consuming and labor-intensive.
Often, the data labeling budget does not include the added cost factor in each of these steps. Real budgets include the time and effort cost of the ML engineer(s) which is $100/hour or upwards, plus the cost of the annotators–both of which tally 10X higher than $7/hour.
As the specificity, diversity, and volumes of the data to be annotated increase, the complexities of tooling infrastructure, workforce, and project management also increase. Efficiently managing these complex processes become extremely critical for achieving quality outputs.
And truth be told, ML engineers are better off doing what they do best instead of getting tangled in these operational processes.
The Realistic Decision Framework for Your Data Labeling Strategy
For the widespread application of this framework, I’ve used the three phases of the AI development cycle, and three main data labeling functions as the base for decision-making. Along with explaining the sophistication levels of the functions in each phase of the development cycle, I have additionally provided the ballpark budgets to simplify category assessment for decision-makers.
I’ve already established the complexity of the set-up process and the reality of the budgets involved in data labeling. Let us now dive into the functions. At Playment, we divide the previously mentioned six-step process into three important functions (see image above):
· Annotation Tooling
· Project Management
The sophistication of these functions depends on the volume of the data collected for annotations, and the number of human annotators required. For example, depending on the volume of data, the annotation tooling can be 100% manual or very sophisticated with various auto-validation checks or AI-assistance for humans to minimize errors and ensure faster data throughput.
The workforce can be 100% internally sourced or outsourced, depending on the volume of data for annotations. You could also opt for a multi-vendor model when the annotator requirement exceeds 1000 annotators.
Efficient project management of 10 vs 1000s of human annotators are very different problems. The former needs almost nothing, but the latter requires sophisticated software and analytics to train annotators, allocate labeling tasks, track performance on quality, speed, etc. It is very similar to monitoring an assembly line.
Based on this realistic framework, Playment has built a sophisticated suite of technology and workforce solutions to help small, mid-sized, and large companies seamlessly fuel their data pipelines for faster activation of AI applications.
This decision framework would fit the requirements of most of the ML teams. However, data security and automation are other important parameters that largely influence the architecture decision. It adds an extra layer of complexity to this otherwise simple framework. Feel free to DM Playment on Twitter @PlaymentHQ, or email us at email@example.com to discuss the same.
I hope this framework will help the industry avoid pitfalls that come with incorrect data labeling architecture decisions due to common misbeliefs; helping the ML community build highly accurate AI models at a rate faster than ever.
Expedite the AI age!