Berkeley, CA – Given recent privacy regulations like the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), organizations face the challenge of giving data subjects more transparency and control over their data. For the data subjects, this leads to very poor experiences and frustration with not being able to easily exercise said rights. For the organization, this leads to distrust in and damage to their brand. Because data management practices were not designed with privacy in mind, a key hurdle is determining what systems and services an organization is using which could contain personal data. 

Privacy management company, DataGrail, partnered with a group of five UC Berkeley students to build an ensemble classification model. Through methods of machine learning and natural language processing, they automated the time-consuming and tedious process of detecting which systems and services contain private data. The model calculates the text similarity between the input text and the names of known systems to return the names of the predicted systems. 

“Our first task upon receiving the datasets from our mentor was to manually identify systems within short and uninformative snippets of text, which is not far from the traditional methods,” says team member Armine Nersisyan. 

Having grappled with systematic human error, the DG Team had a real understanding of how essential a datamap must be for an organization that is struggling to comply with privacy regulations. Given that their model has a precision and recall of nearly one for core systems like Okta, Google, Marketo, and Hubspot, this project offers a lot of promise. Not only does it capture most systems, but it also returns fairly accurate predictions of the true system name. The high performance of this predictive model can be illustrated through a sleek User Interface (UI) that the DG team created.

Given the complexity of the problem, the model is not at a point where it can be commercialized. The DG Team says that more needs to be done to create a comprehensive list of known systems. This may mean web scraping popular domains to gather names of software and services. Additionally, they hope to have access to metadata from a larger set of core systems to train their model on.

Once the classification model for identifying systems is complete, the DG Team can then worry about identifying private data within these systems. Only after that second step is complete can privacy heads and system owners ditch archaic methods and start reaping the benefits of the final product. The DG Team imagines the users logging into a UI that provides users with private access to the DataGrail database in order to view flow of data between the system they use. With a sleek and easy to navigate experience, the front facing interface will make it easier for users to stay updated on their data, check for potential threats, and stay in compliance with new tech laws coming out every day.  

Collaborators: DataGrail

Project by: Armine Nersisyan, Jiayuan Xu, Chuzhen Wang, Nicholas Brathwaite, and Yash Vardhan Goenka.

Armine Nersisyan
Chuzhen Wang

GitHub repo: