Anonymous ≠ Private
Mary’s recent series about privacy has been great, reflecting some fairly robust “discussions” we have had internally about where our responsibilities lie as a tool vendor. Keep in mind that we are an analytics platform and vertical application provider and not a destination website or portal. When we talk about data privacy, our primary focus is on ensuring that our customers’ data is protected.
Our primary contribution to privacy, and our core responsibility, is to provide our customers with secure, well tested code, stored in world class collocation facilities, administered by the most experienced and ethical folks available so that the data our customers designate as private remains that way. Now, in some of the markets we address, the law defines the privacy playing field for us – HIPPA for Health Care Analytics for example. In other industries, we have more wiggle room and it comes down to what is morally acceptable to us. For us, issues of “morality” usually arise in one of two areas:
- Who are we willing to sell to?
- What data sets will we use for demonstration purposes?
Who we are willing to sell to has not been an issue for us—yet. All of our customer and prospects are organizations that I am proud to be associated with, have a net positive impact on the world, and a deep respect and understanding of the need to do no harm. But as we grow, I am sure that we will face some hard decisions. If we do, I hope to follow the example set by a former Informix colleague early in my career. He was asked to help sell the Informix database to the Iraqi secret service after a salesperson that was of Jewish decent refused. He did some research (this was before the invasion of Kuwait) and decided that, despite any financial considerations, he was not going to be involved in a sale so morally dubious. It is an example that stuck with me and one I hope to emulate.
The publicly available data sets we are going to use for demonstration purposes to promote our framework is a topic that we are dealing with right now. As Steve Michaels points out in this very good post, anonymity no longer equals privacy. This fact was amply proven in 2006 by AOL’s thoughtless release of 20 Million anonymized queries that were easily used to not only find who issued them, but to reveal intimate details of lives from pet ownership to ethnicity. Given the amount of publically available data, increasing numbers of public APIs to popular websites, and the growth in E-Government and cloud based analytic engines like ours, it is becoming much easier to pierce the privacy shield and learn more about an individual than they want to share. This is not all bad as learning more about your customers allows you to be a better vendor, but as a society we have yet to find an appropriate balance.
While Eric Schmidt, the soon to be ex-Google CEO, has been pilloried in the press for some of his statements about privacy, I think he has done us all a public service, helping the public to understand the dissection of our lives that is possible with big data and next generation analytic frameworks. The horses may have left the stable, but we still have time to build a viable moral and regulatory corral around them.
There may be technical solutions that will help as well. For example, a lot of research dollars have been spent with a focus on how to generate, and share, anonymized medical data and still analyze it well enough to save lives. As in the AOL example, simply removing a patients name is clearly not enough. One way encryption, time shifting, location shifting, aggregation, and secure access all must be combined to get useful clinical data without compromising patient privacy. While progress has been made, it remains one of the most vexing barriers to true evidenced-based medicine around the world. It is problem we need to solve since broadly based EBM would save countless lives and dollars.
Since I am going to be reviewing code to help finish up our next release over the next month, expect my upcoming posts to focus on our architecture and some of the changes we made to make PAF more efficient for streaming/real-time analytics without hurting our batch performance.