Data scientists put in long hours to prepare information
08/24/2014 8:00 PM
08/25/2014 4:43 AM
Technology revolutions come in measured, sometimes foot-dragging steps. The lab science and marketing enthusiasm tend to underestimate the bottlenecks to progress that must be overcome with hard work and practical engineering.
The field known as Big Data offers a contemporary case study. The catchphrase stands for the modern abundance of digital data from many sources – the Web, sensors, smartphones and corporate databases – that can be mined with clever software for discoveries and insights. Its promise is smarter, data-driven decision-making in every field. That’s why data scientist is the economy’s hot new job.
Yet far too much handcrafted work – what data scientists call “data wrangling,” “data munging” and “data janitor work” – is still required. Data scientists, according to interviews and expert estimates, spend from 50 to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
“Data wrangling is a huge – and surprisingly so – part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”
Taming the data
Several startups are trying to break through these big data bottlenecks by developing software to automate the gathering, cleaning and organizing of disparate data, which is plentiful but messy. The modern Wild West of data needs to be tamed somewhat so it can be recognized and exploited by a computer program.
“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a San Francisco-based startup.
Timothy Weaver, the chief information officer of Del Monte Foods, calls the predicament of data wrangling big data’s “iceberg” issue, meaning attention is focused on the end result that is seen rather than all the unseen toil beneath. But it is a problem born of opportunity. Increasingly, there are many more sources of data to tap that can deliver clues about a company’s business, Weaver said.
In the food industry, he explained, the data available today could include production volumes, location data on shipments, weather reports, retailers’ daily sales and social network comments, parsed for signals of shifts in sentiment and demand.
The result, Weaver said, is being able to see each stage of a business in greater detail than in the past, to tailor product plans and trim inventory. “The more visibility you have, the more intelligent decisions you can make,” he said.
But if the value comes from combining different data sets, so does the headache. Data from sensors, documents, the Web and conventional databases all come in different formats. Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.
The big data challenge today fits a familiar pattern in computing. A new technology emerges, and initially it is mastered by an elite few. But with time, ingenuity and investment, the tools get better, the economics improve, business practices adapt, and the technology eventually gets diffused and democratized into the mainstream.
In software, for example, the early programmers were a priesthood who understood the inner workings of the machine. But the door to programming was steadily opened to more people over the years with higher-level languages from Fortran to Java, and even simpler tools like spreadsheets.
Spreadsheets made financial math and simple modeling accessible to millions of nonexperts in business. John Akred, chief technology officer at Silicon Valley Data Science, a consulting firm, sees something similar in the modern data world, as the software tools improve.
“We are witnessing the beginning of that revolution now, of making these data problems addressable by a far larger audience,” Akred said.