Dear friends,
On Monday, the European Union fined Meta roughly $275 million for breaking its data privacy law. Even though Meta’s violation was not AI specific, the EU’s response is a reminder that we need to build AI systems that preserve user privacy — not just to avoid fines but because we owe it to the people who are represented in the data.
Many companies that would benefit from machine learning can’t afford to hire enough skilled engineers. This creates a need for cloud-based AI software as a service (SaaS). How can customers of such services keep data private while counting on another party to process the data? Consider an AI system that reads electronic health records to make predictions about the patients. Can a hospital use a SaaS provider to monitor the system’s performance without exposing sensitive data?
Recently I learned about a monitoring technique that manages to keep data secure. While visiting Seattle, I met with Alessya Visnjic and Maria Karaivanova, two of the founders of WhyLabs, which provides a SaaS platform that monitors machine learning applications. (Disclosure: WhyLabs is a portfolio company of AI Fund, which I lead.) They explained how they help customers monitor deployed systems for problems like data drift — changes in the distribution of data because, say, a new disease emerged or the hospital started collecting data in a new way — while maintaining data privacy. In their approach, data never leaves the customer’s system. Instead, the SaaS provider (i) computes statistics on data at the source using efficient techniques based on Apache DataSketches and (ii) analyzes the statistics.
The system enables customers to set up dashboards that track the distribution of input features (in this case, body temperature, red blood cell count, and so on) and alerts them when the distribution shows anomalies. Software that runs on the hospital’s server collects data from multiple patients and transmits only the aggregate statistics to the cloud. In this way, the system can look for anomalies without receiving any individual’s data.
This is useful for detecting not only data drift but also data-quality problems. Let's say the hospital shifts to a more precise body temperature notation and leaves the old temperature field empty. The system would monitor the fraction of missing temperature values across all patients and alert the hospital that this field is frequently empty. This enables monitoring of critical data-quality markers such as:
- missing value ratio
- volume (that is, volume of data from different departments; a sudden drop in volume from one department may indicate a data pipeline issue in that department)
- cardinality (detecting new values added to a categorical data field)
- schema (which can catch changes in data types and formats, such as nine-digit postal codes entered into a field intended for five-digit postal codes)
In the data-centric approach to building a machine learning system, our job isn’t done when we deploy a model. We still need to watch out for and address post-deployment issues. Too many teams don’t continuously monitor their models after deploying them because they’re concerned about complexity or privacy. This leads to outdated models that may perform poorly for weeks or months before the problem is detected.
In some tasks, complete privacy may not be possible when working with a SaaS provider, but WhyLabs’ approach (which includes open source tools) preserves privacy while logging and monitoring. I hope we continue to invent techniques that enable AI systems to process data in the cloud while maximizing the degree of privacy we can offer to users and customers.
Keep learning!
Andrew