Dear friends,
Experience gained in building a model to solve one problem doesn’t always transfer to building models for other problems. How can you tell whether or not intuitions honed in one project are likely to generalize to another? I’ve found that two factors can make the difference: the size of the training set and whether the data is unstructured or structured.
For instance, I’ve heard blanket statements like, “you should always have at least 1,000 examples before tackling a problem.” This is good advice if you’re working on a pedestrian detector, where data is readily available and prior art shows that large datasets are important. But it’s bad advice if you’re building a model to diagnose rare medical conditions, where waiting for 1,000 examples might mean you’ll never get started.
Unstructured data includes text, images, and audio clips, which lend themselves to interpretation by humans. Structured data, on the other hand, includes things like transaction records or clickstream logs, which humans don’t process easily.
This difference leads to very different strategies for training and deploying models:
- Unstructured data: Because the examples are easy for humans to understand, you can recruit people to label them and benchmark trained models against human-level performance (HLP). If you need more examples, you might be able to collect them by capturing more text/images/audio or by using data augmentation to distort existing examples. Error analysis can take advantage of human intuition.
- Structured data: This class of data is harder for humans to interpret, and thus harder for humans to label. Algorithms that learn from structured data often surpass HLP, making that measure a poor benchmark. It can also be hard to find additional examples. For instance, if the training dataset comprises records of your customers’ purchases, it’s hard to get data from additional customers beyond your current user base.
Dataset size has implications as well:
- Small dataset: If the dataset includes <1,000 examples, you can examine every example manually, check if the labels are correct, and even add labels yourself. You’re likely to have only a handful of labelers, so it’s easy to hash out any disagreements together on a call. Every single example is a significant fraction of the dataset, so it’s worthwhile to fix every incorrect label.
- Large dataset: If the dataset is >100,000 examples, it’s impractical for a single engineer to examine every one manually. The number of labelers involved is likely to be large, so it’s critical to define standards clearly, and it may be worthwhile to automate labeling. If a significant number of examples are mislabeled, it may be hard to fix them, and you may have to feed the noisy data to your algorithm and hope it can learn a robust model despite the noise.
If you find yourself in need of advice while working on, say, a manufacturing visual inspection problem with 100 examples, the best person to ask would be someone who has worked on a manufacturing visual inspection problem with 100 examples. But if you can’t find such a person, consider looking for someone with expertise in the same dataset size/type quadrant as the problem you’re working on.
As you develop your career, you might also consider whether you want to stay in one quadrant and develop deep expertise there, or move across quadrants and develop more general skills.
Keep learning!
Andrew