Engineered Influence: Weak Data, Machine Learning & Behavioral Economics

This article is will be published in the 2017 Sutardja Center for Entrepreneurship & Technology’s annual journal AIR (Applied Innovation Review) in June 2017. You can see the 2016 version here.

Shomit Ghose is a UC Berkeley alum and mentor, venture capitalist, and partner at ONSET ventures.

Cognitive Irredentists Arise!

A dystopian world: sentient machines manipulating human consciousness to harvest our energy. All while we humans remain docile, unwitting and oblivious. This was the world depicted in the 1999 Hollywood blockbuster, The Matrix¹. How far in the future might such a world be? Er, perhaps not so distant. Precursors of The Matrix may already be upon us.

Human beings today carry with them a loosely-attached (so far²) second brain, also known as your mobile phone. This second brain serves as a quiet collector of all manner of 60x60x24x7 information about us, passes that information to vast banks of computers in the cloud which process it in opaque ways, and our subsequent actions – whether it’s to forgo that chocolate croissant, read a specific piece of content, buy a specific product, turn left at the intersection, or mete out a jail sentence³ – are governed by the results of that processing. In effect, computer algorithms that a small group of humans have programmed are now subtly directing and “programming” the actions of the broader human population.

Is this a concern? It should be. Today, for the first time, the mobile Internet makes it possible for individual actions to be tracked and assessed in real-time, at population scale, and for data-driven algorithms to be deployed to influence our future actions. All of this is made possible by the track-your-every-move⁴ nature of the Internet, smart phone ubiquity, the ability to use machine learning to build statistical correlations on huge volumes of “weak” data, an “asymmetricity of information” advantage in favor of those who collect the data, and the science of behavioral economics.

Data Signals Everywhere

Until recently, due to the limitations of our computing infrastructure, data-driven applications were based principally on “strong” data. I.e., data that was fairly finite in volume, very specific, and fit neatly into a relational database: your electronic medical record indexed by your name and birthdate; your driving record indexed by your driver’s license number; your salary and tax records indexed by your social security number; your purchase history indexed by your credit card number; etc. By and large, strong data has been (relatively) well regulated from a privacy perspective and been (somewhat) well protected.

Weak data, by comparison, is data that is vast in volume, and by itself is very fuzzy and ambiguous; historically, it’s been next to impossible to make any sense of weak data because it’s, well, so weak. Strong data, for example your birthdate and the make of your car, will be predictive of your future auto insurance claims. But what do we make of weak data that tells us that you like to eat meat and drink milk other than you’re probably not a vegan? While strong data could be used to understand you individually, weak data could not.

But enter limitless amounts of cheap storage and computing in the cloud, fold in machine learning algorithms of the unsupervised variety, and weak data can now be statistically correlated and stitched together to yield individualized results that may be just as predictive as those from strong data.

Want to find a college educated professional who’s likely a Republican? Talk to a surgeon or anesthesiologist⁵. Psychiatrist or infectious disease physician? Likely a Democrat. Do you eat lots of red meat and drink lots of milk? Without seeing your driving record we now have a good idea about your auto insurance risk⁶. Properly harnessed and correlated, weak data can yield all manner⁷ of intimate insights⁸ about individuals. Are you an extrovert⁹ or an introvert? Can your browsing behavior be used to gain insights on your specific personality¹⁰ traits or your age and gender¹¹? Ethnicity¹²? Are you gay¹³? What does your streaming music¹⁴ playlist say about your cognitive abilities? Suffering from depression¹⁵? Psychopathic¹⁶ tendencies? And is your Echo or Alexa¹⁷ device currently listening to the conversations¹⁸ in your home?

The Ghost in the Machine

As insightful – and more importantly, as intrusive – as these individualized conclusions may be, weak data has not been regulated, nor is it well protected for privacy. Weak data conveys a powerful advantage from a business point of view because companies and organizations can utilize unsupervised machine learning techniques to comb through huge, heretofore intractable amounts of data and find correlations and insights that would otherwise be unperceivable and unknown to the human mind. From a competitive standpoint, data sources and machine learning algorithms have been weaponized by companies at business’ leading-edge.

An obvious weakness of the data-driven model is that machines learn from the underlying data – e.g., girls become teachers, boys become engineers – and the underlying data may already contain biases¹⁹, thereby perpetuating inequity, or can be mis-trained²⁰ by social engineering attacks. How can we trust the accuracy of a machine’s decisions if we cannot vouch for the validity and balance of the data on which it was trained? Furthermore, the companies that control the data apply their own proprietary algorithms to their data asset for one purpose only: to maximize their profit. This is the rightful goal of any business, of course. But the consumer must be aware that if there are a large number of product pages, or news articles, or routes to a physical destination available, a data-driven business will not be curating/selecting the choices presented to you based on what optimizes your benefit but on what optimizes the business’ profit²¹.

In this way, those that control the data enjoy an advantage in asymmetry of information: we consumers don’t know on what basis a decision is made on our behalf. We know neither the body of the underlying data – the full range of our democratically available choices – nor have any understanding of the opaque algorithms being used to process that data. And all of this data is increasingly concentrated in just a few hands: Google, Facebook, Amazon. Facebook famously experimented²², through selective content exposure, with manipulating the emotions of almost 700,000 of its users to show

“…that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. [Facebook provided] experimental evidence that emotional contagion occurs without direct interaction between people…”

Matrix-like, information asymmetricity enables our actions to be programmed and manipulated by a handful of organizations²³ that have access to, and an understanding of, data and correlations that we individuals do not.

Who Gets to Choose?

Behavioral economics is what finally ties weak (and strong) data, machine learning, and information asymmetry into a neat little cybernetic loop: data is collected, data is analyzed, and the “correct” (from the business’ point of view, at least) set of actions are fed back to us to inform our ongoing behavior, which will then also be collected and analyzed.

Behavioral economics is a potent tool for those holding an information asymmetry advantage because the Internet presents a seemingly infinite set of choices to individuals. But humans dislike choosing from large sets of choices. We much prefer choosing from finite sets²⁴. Whether it’s the forward-decisioning of the “Keep watching” default²⁵ that keeps you bingeing on a streaming video service, or presenting what brand of drug²⁶ a physician might prescribe, we humans like the range of our possible decisions to be defined or limited. Too much choice is seen as being a “tyranny of choice²⁷”.

So if you’re a large company, sitting on a massive trove of data, with highly-predictive machine learning algorithms whirring away, and hundreds of possible news stories or products or promotions to present to an individual consumer (information asymmetry in action), which half-dozen options do you choose to show? The half-dozen that would be of maximum benefit to the individual, or the half-dozen that would most benefit your profits? It’s likely that the company’s “choice architecture²⁸”, indeed its duty to its shareholders, is to optimize for profit, even at the explicit expense of the benefit to the individual consumer. Through choice architecture the actions of the individual become programmed²⁹ by the data and the algorithms.

The Best of Times, the Worst of Times

Needless to say, all human inventions, Big Data included, can be used for beneficent or maleficent purposes: you can use a brick to build a house or you can throw it at someone’s head; you can cut your food with a knife or stab your fellow diner; etc. The same is true of data and algorithms; and Pandora’s Box will yawn ever wider with the volumes and sources³⁰ of data continuing to skyrocket. But just as we should not outlaw bricks, or butter knives, or cars for that matter, neither should we proscribe Big Data.

Data and algorithms can be – and are – used today to level the playing field and bring benefit, at scale, to underserved³¹ or under-resourced segments across the human population. Everything from financial services³², to education³³, to medical care³⁴ can be delivered at low cost, scalably, and across geographic boundaries by harnessing data, machine learning, and even behavioral economics. In this way, data and algorithms can provide a “status hack” to improve access to the resources people need. In no way should such uses of data and algorithms be fettered or circumscribed.

What does need to happen is to bring awareness of the opportunities for light and risks of darkness inherent in our (inescapable) data-driven future; we must not resign³⁵ ourselves to the latter fate. There must instead be an awareness of the many dangers — to privacy at a minimum, and to our ability to freely choose at a maximum — posed by the mass collection and analysis of what seems to be even the most trivial shreds of data. We must therefore always be acutely aware of the privacy implications of our intentional and unintentional data trails, and we must demand transparency from those who control and act on our data sources. We must always have an awareness and skepticism of opaque algorithms deployed by the few for the “benefit” of the many. We must swallow Morpheus’ red pill.

“The Matrix”. 2017. Online. https://en.wikipedia.org/wiki/The_Matrix
Constine, Josh. “Facebook is building brain-computer interfaces for typing and skin-hearing”. TechCrunch. April 2017. Online. https://techcrunch.com/2017/04/19/facebook-brain-interface/
Angwin, Julia. “Machine Bias”. ProPublica. May 2016. Online. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Englehardt, S., Narayanan A. “Online Tracking: A 1-million-site Measurement and Analysis”. http://randomwalker.info/publications/OpenWPM_1_million_site_tracking_measurement.pdf
Sanger-Katz, Margot. “Your Surgeon Is Probably a Republican, Your Psychiatrist Probably a Democrat”. New York Times. October 2016. Online. https://www.nytimes.com/2016/10/07/upshot/your-surgeon-is-probably-a-republican-your-psychiatrist-probably-a-democrat.html?_r=1
Evans P., Forth P. “Navigating a World of Digital Disruption”. Boston Consulting Group. 2017. Online. http://digitaldisrupt.bcgperspectives.com/
Kotikalapudi R., Chellappan S., Montgomery F., Wunsch D., Lutzen K. “Associating Internet Usage with Depressive Behavior Among College Students”. IEEE Technology and Society Magazine. Winter 2012. http://web.mst.edu/~chellaps/papers/TSM.pdf
Quercia D., Kosinski M., Stillwell D., Crowcroft J. “Our Twitter Profiles, Our Selves: Predicting Personality with Twitter”. https://www.cl.cam.ac.uk/~dq209/publications/quercia11twitter.pdf
Gosling S., Augustine A., Vazire S., Holtzman N., Gaddis S. “Manifestations of Personality in Online Social Networks: Self-Reported Facebook-Related Behaviors and Observable Profile Information”. September 2011. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3180765/
Kosinski M., Stillwell D., Kohli P., Bachrach Y., Graepel T. “Personality and Website Choice”. June 2012. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/person_WebSci_final.pdf
Hu J., Zeng H.-J., Li H., Niu C., Chen Z. “Demographic Prediction Based on User’s Browsing Behavior”. 2007. https://www2007.org/papers/paper686.pdf
Kosinski M., Stillwell D., Graepel T. “Private traits and attributes are predictable from digital records of human behavior”. PNAS. October 2012. http://www.pnas.org/content/110/15/5802.full
Jernigan C., Mistree B. “Gaydar: Facebook friendships expose sexual orientation”. First Monday. October 2009. http://pear.accc.uic.edu/ojs/index.php/fm/article/view/2611/2302
Rentfrow P., Gosling S. “The Do Re Mi’s of Everyday Life: The Structure and Personality Correlates of Music Preferences”. Journal of Personality and Social Psychology. 2003. https://pdfs.semanticscholar.org/1364/53addebb04b046e06a524c19fa4e891ea7ae.pdf
“How an Algorithm Learned to Identify Depressed Individuals by Studying Their Instagram Photos”. MIT Technology Review. August 2016. Online. https://www.technologyreview.com/s/602208/how-an-algorithm-learned-to-identify-depressed-individuals-by-studying-their-instagram/
Hancock J., Woodworth M., Porter S. “Hungry like the wolf: A word-pattern analysis of the language of psychopaths”. The British Psychological Society. 2011. https://pdfs.semanticscholar.org/f7b9/cddeb56741f5bae0e9ffec7a901967cdd03d.pdf
Google. “Tomato, tomhato. Google Home now supports multiple users.” April 2017. Online. https://blog.google/products/assistant/tomato-tomahto-google-home-now-supports-multiple-users/
Schwartz H., Eichstaedt J., Kern M., Dziurzynski L., Ramones S., Agrawal M., Shah A., Kosinski M., Stillwell D., Seligman M., Ungar L. “Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach”. PLOS One. September 2013. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073791
Caliskan A., Bryson J., Narayanan A. “Semantics derived automatically from language corpora contain human-like biases”. Science. April 2017. http://science.sciencemag.org/content/356/6334/183.full
Victor, Daniel. “Microsoft Created a Twitter Bot to Learn From Users. It Quickly Became a Racist Jerk.” New York Times. March 2016. Online. https://www.nytimes.com/2016/03/25/technology/microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk.html
Titcomb, James. “Facebook showed advertisers it could tell when teenagers were emotionally vulnerable”. The Telegraph. May 2017. Online. http://www.telegraph.co.uk/technology/2017/05/01/facebook-exploited-emotionally-vulnerable-teenagers-sell-adverts/
Kramer A., Guillory J., Hancock J. “Experimental evidence of massive-scale emotional contagion through social networks”. PNAS. October 2013. http://www.pnas.org/content/111/24/8788.full
Stanley, Jay. “China’s Nightmarish Citizen Scores Are a Warning For Americans”. American Civil Liberties Union. October 2015. Online. https://www.aclu.org/blog/free-future/chinas-nightmarish-citizen-scores-are-warning-americans
Iyengar S., lepper M. “When Choice is Demotivating: Can One Desire Too Much of a Good Thing?” Journal of Personality and Social Psychology. 2000. https://faculty.washington.edu/jdb/345/345%20Articles/Iyengar%20%26%20Lepper%20(2000).pdf
Pittman M., Sheehan K. “Sprinting a media marathon: Uses and gratifications of binge-watching television through Netflix”. First Monday. October 2015. http://firstmonday.org/ojs/index.php/fm/article/view/6138/4999
“Changing default prescription settings in EMRs increased rates of generic drugs, study finds”. Science Daily. May 2016. Online. https://www.sciencedaily.com/releases/2016/05/160509191841.htm
Schwartz, Barry. “The Tyranny of Choice”. Scientific American. April 2004. https://www.swarthmore.edu/SocSci/bschwar1/Sci.Amer.pdf
Thaler R., Sunstein C., Balz J. “Choice Architecture”. https://www.sas.upenn.edu/~baron/475/choice.architecture.pdf
Rosenblat A., Stark L. “Algorithmic Labor and Information Asymmetries: A Case Study of Uber’s Drivers”. International Journal of Communication. 2015. https://starkcontrastdotco.files.wordpress.com/2016/08/4892-21331-1-pb.pdf
Ghose, Shomit. “Securing Your Largest USB-Connected Device: Your Car”. ODBMS.org. March 2016. Online. http://www.odbms.org/2016/03/securing-your-largest-usb-connected-device-your-car/
McClelland, Colin. “Phone Stats Unlock a Million Loans a Month for Africa Lender”. Bloomberg. September 2015. Online. https://www.bloomberg.com/news/articles/2015-09-23/phone-stats-unlock-a-million-loans-each-month-for-african-lender
Lohr, Steve. “ZestFinance Takes Its Big Data Credit Scoring to China”. New York Times. June 2015. Online. https://bits.blogs.nytimes.com/2015/06/26/zestfinance-takes-its-big-data-credit-scoring-to-china/
Bienkowski M., Feng M., Means B. “Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics”. US Department of Education. October 2012. https://tech.ed.gov/wp-content/uploads/2014/03/edm-la-brief.pdf
Ghose, Shomit. “Continuous Healthcare: Big Data and the Future of Medicine.” VentureBeat. June 2015. Online. https://venturebeat.com/2015/06/21/continuous-healthcare-big-data-and-the-future-of-medicine/
Turow J., Hennessy M., Draper N. “The Tradeoff Fallacy: How Marketers Are Misrepresenting American Consumers And Opening Them Up to Exploitation.” Annenberg School for Communication – University of Pennsylvania. June 2015. https://www.asc.upenn.edu/sites/default/files/TradeoffFallacy_1.pdf

Posted in data, Watchlist and tagged AIR journal, amazon, asymmetricity of information, behavioral economics, big data, emotional contagion, facebook, future of data, ghost in the machine, google, machine learning, matrix, noise injection, shomit ghose, strong data, weak data