Over the weekend of May 30th and 31st, 2015 I was fortunate enough to attend the inaugural Open Data Science Conference in Boston. This is a conference dedicated to promoting the use and open exchange of ideas around the use of open source software and open data for our analytics tasks. I had a great time and learned a lot while I was there, so I wanted to try and summarize my experience and opinions of the weekend.
What Does Open Mean?
In the context of technology and analytics, the term open refers to transparency and easy access. Open source is a term used to describe software – whether that is languages, libraries or applications – that is no cost and for which the code used to create what you are using is accessible and auditable by anyone who wishes to take advantage of it. Open data is any collection of data, whether that is raw or processed, which can be obtained at low or no cost and for which there is no restriction placed on how it is used. Lastly, the open exchange of ideas encouraged by the organizers and participants of the conference means that practitioners in various roles around the industry of data science are able to freely discuss ideas, techniques and pitfalls that they have experienced.
Why Is It Important?
Taken together, these three aspects of openness are critical and powerful for the advancement of data science for both company success and positive societal impact.
By taking advantage of open source, companies and other organizations are able to significantly reduce the costs associated with performing analyses of the data that is in their possession. In addition to the reduced expense associated with freely available languages and tools, by using open source languages you are benefiting from improvements that have been contributed back by a potentially large number of other intelligent individuals. By virtue of having access to the internals of your tools, it is possible to tweak and extend their functionality beyond what the original creators were able to conceive of.
Open data has a democratizing effect by giving everyone access to the same data sources on which to perform analyses and hone their skills. A number of the sources available for open data are from governmental organizations, allowing us to investigate and validate important questions about various aspects of our daily lives. While many open data sets are supplied by various government bodies, there are some private companies who are willing to make the data that they have collected publicly available, as discussed by Lukas Biewald in his excellent talk. By making well-groomed data available to everyone, companies are driven to differentiate themselves by producing compelling experiences based off of that data, rather than just by having information that no-one else does. As described in Lukas’ presentation, providing quality data sets available to the public and other companies can result in surprising and useful remixes and analyses of that data that may not have been possible if the information had remained confined to a single organization.
The positive aspects of enabling an open exchange of ideas should be obvious, but I think that it is still worth exploring to some extent. By enabling practitioners of data science from varied industries, background and skill sets to gather and engage with each other, they can all learn something new. This may be a particular algorithm or technique that they have not previously been exposed to, a set of resources that were previously unknown, or simply an appreciation for the effects (positive or negative) that the data they manipulate every day has on their users or fellow citizens.
The Power of Data
As we progress further into the new millenium, our capacity for generating and analyzing increasing amounts of data is accelerating. With this growth in information about us and the world that we live in comes the ability to discover new insights into everything from how to create more effective medicines to improving the efficiency of transportation networks. The bottom line is that, as data scientists and data engineers, we have the ability to effect real-world change through our investigation and interpretation of the data that is available to us.
To this end there were a number of talks at the conference about how to use our skills for positive social impact. Ari Hamalian presented on using data to help developing nations innovate, Eric Schles talked about his work to combat human trafficking, Peter Bull discussed how we can help non-profits in the social sector. Code for America has a series of videos about data informed decision making for government, showcasing the various ways in which data can be used to improve the effectiveness of government at the local and national level.
As technologists, most of us are intensely aware of the need for and benefits of diversity I was pleasantly surprised to see a large number of women and ethnicities at the conference. This is a good sign for the data science community and, by extension, those who are impacted by the work that we all do.
The conference itself was surprisingly well-run for a first attempt and I am excited to attend future events. There were some minor hiccups with the A/V equipment that delayed the start of a couple of the talks that I attended, but everyone did an excellent job of working around them. The selection of presenters and topics was top-notch and I thoroughly enjoyed all of the sessions that I attended. There were several points when I had a difficulty in deciding which of the talks to attend, but fortunately the majority of the sessions were recorded so that I can see the ones that I missed once they are uploaded.
To anyone who missed this conference, I highly recommend trying to attend an upcoming event. Despite some claims to the contrary, data science is alive and well and the open data science conference is an excellent way to meet and learn with other professionals in the field.
As an engineer with aspirations to become a data scientist, attending this conference was an incredible validation of that goal. While I don’t have a PhD in statistics, I do have a solid understanding of the technical requirements around storing, manipulating and presenting data and there is a need for many different types of roles in the realm of data science and analytics. Not everyone is going to be the director of data science at a company like Cloudera, but there is a demonstrated need for data scientists and data engineers at all levels. There is also an increasing need for people who can automate large parts of the data manipulation process and that is where I intend to fit in.
Thank you to everyone who helped make the conference possible! I can’t wait to come back next year.