Strata West, Law, Ethics, and Open Data: Smart People Solving Some Very Hard Problems
Last week the Bay Area was treated to another great Strata West hosted by the O’Reilly team. For those of you who weren’t able to make it, keep checking strataconf.com for updates on the videos and speaker slides—one of the great things about this conference is that many of the sessions are available to anyone as are the videos and slides.
I had the pleasure of co-hosting the Law, Ethics, and Open Data track with my friend and fellow O’Reilly Author (and Civilization devotee), Alex Howard. Alex is O’Reilly’s government reporter and his book, Data for the Public Good, is a must read. Our track was two days long and featured thoughtful sessions and speakers–bringing together people who are solving difficult technology problems and then showing us how those problems and solutions are impacting lives and society. If you check out my tweets from last week you’ll see my 140 character attempts to highlight some of the sessions. Here is a “longer” version of the highlights of the sessions I hosted:
- Fred Trotter and DocGraph—Fred actually tweeted his presentation as he was giving it, so check out @fredtrotter for last Thursday starting around 10:40 am PST. A presentation of 140 character sound bites made for a very succinct message. He’s done some amazing work creating the DocGraph, probably the largest public social graph in the world, showing the referral relationships between doctors in the US. You can view a nice visualization his team has done here.
- MailChimp’s Email Genome Project—John Foreman gave a funny and engaging talk on how MailChimp uses the Email Genome Project (EGP) to prevent abuse and catch bad guys. His job is to weed out not just the classical spammer but incompetent users as well. One fascinating tidbit: John said that they were able to tell when a site has been hacked by the surge in spam directed at them via MailChimp. John’s session was also the perfect story of a company finding itself with a business model that created a huge of amount of data that they could then turn around and use to improve their business. For my fellow data geeks, check out the MailChimp blog – it often has great insights about our email ecosystem derived from the EGP.
- Data Journalism with no Open Data—Sandra Crucianelli (International Center for Journalists) and Angélica Peralta Ramos (La Nacion Newspaper) described their efforts in Argentina to drive transparency in governance as Data Journalists. Their biggest challenge: No Freedom of Information Act and no tradition of open data. It was a story of PDF scraping and using whatever tools you can afford to get the “Truth out There.” They made a nice video summarizing their story. Their urgent request to the Open Source tool developers who they rely on due to limited budgets is to think globally! They have a real need for open source data tools that have multi-lingual training and documentation.
- The Biggest Dataset in the World—This was probably my favorite of the day. It was a panel made up of three luminaries from the very small community that is doing web scale search: Lisa Green from Common Crawl, Greg Lindahl from blekko, and Kevin Burton from Spinn3r. Their talk on how to view and use the Web as an infinitely sized, real-time dataset was fascinating. And their passionate defense of a free and open Internet was inspiring. The tragic case of Aaron Schwartz is a great example of how tenuous these freedoms are without our vigilance.
- Finding the Needle in the Haystack (without poking yourself!) —Dean Malmgren from DataScope Analytics presented a great case study from an eDiscovery client engagement. His session and study reminded me that big data is just like as any other IT project. It begins with understanding your customers and their capabilities—this is a critical input to any big data project and is often “forgotten” in the midst of all the big data marketing. Given the plethora of programming tools and research projects disguising themselves as enterprise-ready products, it was refreshing to hear from someone that is quantitative and customer experience focused. This is a topic that Mary and Marilyn from PBI will be touching on in their new series quite a bit.
Well, I’m heading off to SXSW Interactive to talk Privacy and Big Data. If you know of any panels I should check out or want to know how PatternBuilders can solve your big data problems or just want to grab a beer, tweet me @terencecraig. This is my first time back in Texas for a decade and I’m looking forward to grabbing some decent BBQ and saying y’all a lot.