Energy News  
TECH SPACE
A data-cleaning tool for building better prediction models
by Staff Writers
New York NY (SPX) Sep 06, 2016


Tested on a dirty, real-world data set, ActiveClean (in red), was able to clean just 5,000 records to bring the researchers' prediction model to a 90 percent accuracy level. The next best technique, called active learning (in green), had to clean 50,000 records to achieve comparable results. The most common data-cleaning method - trial-and-error (in purple) - provided minimal model improvement. Image courtesy Eugene Wu. For a larger version of this image please go here.

Big data sets are full of dirty data, and these outliers, typos and missing values can produce distorted models that lead to wrong conclusions and bad decisions, be it in healthcare or finance. With so much at stake, data cleaning should be easier.

That's the inspiration for software developed by computer scientists at Columbia University and University of California at Berkeley that hands much of the dirty work over to machines. Called ActiveClean, the system analyzes a user's prediction model to decide which mistakes to edit first, while updating the model as it works. With each pass, users see their model improve.

"Dirty data is pervasive and prevents people from doing useful things," said Eugene Wu, a computer science professor at Columbia Engineering and a member of the Data Science Institute. "This is our first step towards automating the data-cleaning process."

The team will present its research on Sept. 7 in New Delhi, at the 2016 conference on Very Large Data Bases. Wu helped develop ActiveClean as a postdoctoral researcher at Berkeley's AMPLab and has continued this work at Columbia.

Big data sets are still mostly combined and edited manually, aided by data-cleaning software like Google Refine and Trifacta, or custom scripts developed for specific data-cleaning tasks. The process consumes up to 80 percent of analysts' time as they hunt for dirty data, clean it, retrain their model, and repeat the process. Cleaning is largely done by guesswork.

"Will it help or hurt the model? You have no idea," said Wu. "Data scientists either clean everything, which is impossible for huge datasets, or clean random subsets and hope for the best."

In the process, statistical biases can be introduced that skew models into producing misleading results. Those mistakes may not be caught until weeks later, as the researchers learned in an earlier survey of industry data scientists.

"Most of these errors are subtle enough that the analysis will go through," said one consultant from a large database vendor. "Usually it's only caught weeks later after someone notices something like, "Well, the Wilmington branch cannot have $1 million sales in a week."

ActiveClean tries to minimize mistakes like these by taking humans out of the most error-prone steps of data cleaning: finding dirty data and updating the model. Using machine learning, the tool analyzes a model's structure to understand what sorts of errors will throw the model off most. It goes after those data first, in decreasing priority, and cleans just enough data to give users assurance that their model will be reasonably accurate.

The researchers tested ActiveClean on Dollars for Docs, a database of corporate donations to doctors that journalists at ProPublica compiled to analyze conflicts of interest and flag improper donations.

ActiveClean's results were compared against two baseline methods. One edited a subset of the data and retrained the model. The other used a popular prioritization algorithm called active learning that picks the most informative labels for ambiguous data. The algorithm improves the model without bothering, as ActiveClean does, whether the labels are accurate.

Nearly a quarter of ProPublica's 240,000 records had multiple names for a drug or company. Left uncorrected these inconsistencies could lead journalists to undercount donations by large companies, which were more likely to have such inconsistencies.

With no data cleaning, a model trained on this dataset could predict an improper donation just 66 percent of the time. ActiveClean, they found, raised the detection rate to 90 percent by cleaning just 5,000 records. The active learning method, by contrast, required 10 times as much data, or 50,000 records, to reach a comparable detection rate.

"As datasets grow larger and more complex, it's becoming more and more difficult to properly clean the data," said study coauthor Sanjay Krishnan, a graduate student at UC Berkeley. "ActiveClean uses machine learning techniques to make data cleaning easier while guaranteeing you won't shoot yourself in the foot."

ActiveClean is a free, open-source tool released in August. Download it here.


Thanks for being here;
We need your help. The SpaceDaily news network continues to grow but revenues have never been harder to maintain.

With the rise of Ad Blockers, and Facebook - our traditional revenue sources via quality network advertising continues to decline. And unlike so many other news sites, we don't have a paywall - with those annoying usernames and passwords.

Our news coverage takes time and effort to publish 365 days a year.

If you find our news sites informative and useful then please consider becoming a regular supporter or for now make a one off contribution.
SpaceDaily Contributor
$5 Billed Once


credit card or paypal
SpaceDaily Monthly Supporter
$5 Billed Monthly


paypal only


.


Related Links
Columbia University School of Engineering and Applied Science
Space Technology News - Applications and Research






Comment on this article via your Facebook, Yahoo, AOL, Hotmail login.

Share this article via these popular social media networks
del.icio.usdel.icio.us DiggDigg RedditReddit GoogleGoogle

Previous Report
TECH SPACE
Streamlining accelerated computing for industry
Oak Ridge TN (SPX) Aug 26, 2016
Scientists and engineers striving to create the next machine-age marvel - whether it be a more aerodynamic rocket, a faster race car, or a higher-efficiency jet engine - depend on reliable analysis and feedback to improve their designs. Building and testing physical prototypes of complex machines can be time-consuming and costly and can provide only limited results. For these reasons, comp ... read more


TECH SPACE
NREL releases updated baseline of cost and performance data for electricity generation technologies

Europe ups energy security ante

Chinese giant to buy Pakistani power company for $1.6 bn

Economy of energy-hungry India may face headwinds

TECH SPACE
Fuel cell membrane patented by Sandia outperforms market

Fusion facilities at PPPL and Culham, England, could provide path to limitless energy

Flywheel technology could create new savings for light rail transit

Extending battery life for mobile devices

TECH SPACE
Statoil complements portfolio with more wind

Super-tall wind turbines installed offshore Britain

British low-carbon target in doubt

New simulations of wind power generation

TECH SPACE
New perovskite research discoveries may lead to solar cell, LED advances

ARENA solar funding shines on - for now

NREL supercomputing provides insights from higher wind and solar generation in eastern grid

US should act to support innovation in increasingly clean electric power technologies

TECH SPACE
Sealing the Deal: Turkey, China Launch Nuclear Cooperation Partnership

Work starts on two new Iran nuclear reactors

Russia's Rosatom Ready to Help Saudi Arabia Build Nuclear Reactors

Rosneft and Gazprom Discuss New Joint Projects With Japanese Companies

TECH SPACE
Tapping the unused potential of photosynthesis

Fish 'biowaste' converted to piezoelectric energy harvesters

Body heat as a power source

Croatian Pig Farm Uses Synergies to Generate Energy

TECH SPACE
China's newly-launched quantum communication satellite in good shape

China Sends Country's Largest Carrier Rocket to Launch Base

'Heavenly Palace': China to Launch Two Manned Space Missions This Fall

China unveils Mars probe, rover for ambitious 2020 mission

TECH SPACE
Technology and innovation not driven by climate change

Grassland tuned to present suffers in a warmer future

Climate pact: After years of talk, focus shifts to action

Can melting of frozen methane explain rapid climate change 56 million years ago?









The content herein, unless otherwise known to be public domain, are Copyright 1995-2024 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. General Data Protection Regulation (GDPR) Statement Our advertisers use various cookies and the like to deliver the best ad banner available at one time. All network advertising suppliers have GDPR policies (Legitimate Interest) that conform with EU regulations for data collection. By using our websites you consent to cookie based advertising. If you do not agree with this then you must stop using the websites from May 25, 2018. Privacy Statement. Additional information can be found here at About Us.