Fast Data hits the Big Data fast lane
By Tony Baer
CERN Experiments Generating One Petabyte of Data Every Second
V3.co.uk (06/23/11) Dan Worth
CERN researchers generate a petabyte of data every second as they work to discover the origins of the universe by smashing particles together at close to the speed of light. However, the researchers, led by Francois Briard, only store about 25 petabytes every year because they use filters to save just the results they are interested in. "To analyze this amount of data you need the equivalent of 100,000 of the world's fastest PC processors," says CERN's Jean-Michel Jouanigot. "CERN provides around 20 percent of this capability in our data centers, but it's not enough to handle this data." The researchers worked with the European Commission to develop the Grid, which provides access to computing resources from around the world. CERN receives data center use from 11 different providers on the Grid, including from companies in the United States, Canada, Italy, France, and Britain. The data comes from four machines on the Large Hadron Collider in which the particle collisions are monitored, which transmit data at 320 Mbps, 100 Mbps, 220 Mbps, and 500 Mbps, respectively, to the CERN computer center.
New Ways to Exploit Raw Data May Bring Surge of Innovation, a Study Says
New York Times (05/13/11) Steve Lohr
Mining and analyzing large datasets will lead to a new wave of innovation, according to a new report from the McKinsey Global Institute. The report, "Big Data: The Next Frontier for Innovation, Competition and Productivity," estimates the potential benefits of using data-harvesting technologies and skills. For example, it says the technology could be worth $300 billion annually to the U.S.'s health care system, while retailers could use it to boost profits by 60 percent. However, the study also identifies challenges to managing big data, such as a talent and skills gap. The report estimates that the United States will need at least 140,000 more experts in statistical methods and data-analysis technologies, as well as 1.5 million more data-literate managers. "Every manager will really have to understand something about statistics and experimental design going forward," says McKinsey's Michael Chui. The use of personal location data could save consumers more than $600 billion worldwide by 2020, according to the report. Consumers will benefit the most from time and fuel savings gained from location-based services that can monitor traffic and weather data to help drivers avoid congestion and suggest alternative routes, the report says. "It's clear that data is an important factor of production now," says McKinsey's James Manyika.
Identifying 'Anonymous' Email Authors
Concordia University (Canada) (03/07/11) Chris Atack
Concordia University researchers led by professor Benjamin Fung have developed a technique to determine the authors of anonymous emails with a high degree of accuracy. Malicious anonymous emails can "transmit threats or child pornography, facilitate communications between criminals, or carry viruses," Fung says. Although authorities can use the Internet protocol address to locate the building where the email originated, they have no way to differentiate between several suspects. Fung's method uses speech recognition and data-mining techniques to identify an individual author. First the researchers identify the patterns found in emails written by the suspect, and then they filter out any patterns that also are found in the emails of other suspects, leaving only the patterns that are unique to the original suspect. "Using this method, we can even determine with a high degree of accuracy who wrote a given email, and infer the gender, nationality, and education level of the author," Fung says. Testing showed that the system can identify an individual author of an anonymous email with up to 90 percent accuracy.
Powerful New Ways to Electronically Mine Published Research May Lead to New Scientific Breakthroughs
University of Chicago (02/10/11) William Harms
University of Chicago researchers are exploring how metaknowledge can be used to better understand science's social context and the biases that can affect research findings. "The computational production and consumption of metaknowledge will allow researchers and policymakers to leverage more scientific knowledge--explicit, implicit, contextual--in their efforts to advance science," say Chicago researchers James Evans and Jacob Foster. Metaknowledge researchers are using natural language processing technologies, such as machine reading, information extraction, and automatic summarization, to find previously hidden meaning in data. For example, Google researchers used computational content analysis to uncover the emergence of influenza epidemics by tracking relevant Google searches, a process that was faster than methods used by public health officials. Metaknowledge also has led to the possibility of implicit assumptions that could form the foundation of scientific conclusions, known as ghost theories, even if scientists are unaware of them. Scientific ideas can become entrenched when studies continue to produce conclusions that have been previously established by well-known scholars, a trend that can be uncovered by using metaknowledge, according to the researchers.
Sizing Samples
MIT News (08/24/10) Hardesty, Larry
Numerous scientific fields employ computers to deduce patterns in data, and Massachusetts Institute of Technology researchers led by graduate student Vincent Tan have taken an initial step to determine how much data is enough to support reliable pattern inference by envisioning data sets as graphs. In the researchers' work, the nodes of the graph represent data and the edges stand for correlations between them, and from this point of view a computer tasked with pattern recognition is provided a bunch of nodes and asked to construe the weights of the edges between them. The researchers have demonstrated that graphs configured like chains and stars establish, respectively, the best- and worst-case scenarios for computers charged with pattern recognition. Tan says that for tree-structured graphs with shapes other than stars or chains, the "strength of the connectivity between the variables matters." Carnegie Mellon University professor John Lafferty notes that a tree-structured approximation of data is more computationally efficient.
Bringing Data Mining Into the Mainstream
New York Times (07/26/10) Lohr, Steve
A record number of corporate researchers and university scientists are attending an ACM conference on knowledge discovery and data mining, which offers papers and workshops that apply data mining to everything from behavioral targeting to cancer research. Although data mining has become a growth industry, profitably probing large data sets is still costly for companies and difficult for users. According to conference executive director Usama Fayyad, an institutional mindset that recognizes the value of data is needed to bring modern data mining into the business mainstream. The executive level must view data as a new strategic asset that can create revenue streams and businesses. Fayyad also says a translation layer of technology is needed to democratize modern data mining, and the underlying software for handling large data sets should be linked to software that ordinary people can use. Using Microsoft's Excel spreadsheet as a metaphor, Fayyad says the sophisticated data-handling layer should be "built in ways that Excel can consume the data and people can browse it."
Data Mining Algorithm Explains Complex Temporal Interactions Among Genes
Virginia Tech News (06/22/10) Trulove, Susan
Researchers at Virginia Tech (VT), New York University (NYU), and the University of Milan have developed Gene Ontology based Algorithmic Logic and Invariant Extractor (GOALIE), a data-mining algorithm that can automatically reveal how biological processes are coordinated in time. GOALIE reconstructs temporal models of cellular processes from gene expression data. The researchers developed and applied the algorithm to time-course gene expression datasets from budding yeast. "A key goal of GOALIE is to be able to computationally integrate data from distinct stress experiments even when the experiments had been conducted independently," says VT professor Naren Ramakrishnan. NYU professor Bud Mishra notes GOALIE also can extract entire formal models that can then be used for posing biological questions and reasoning about hypotheses. The researchers hope the tool can be used to study disease progression, aging, host-pathogen interactions, stress responses, and cell-to-cell communication.
N.Y. Bomb Plot Highlights Limitations of Data Mining
Computerworld (05/05/10) Vijayan, Jaikumar
The recent failed bombing attempt in New York City shows the limitations of data-mining technology when used in security applications. Since the terror attacks of Sept. 11, the U.S. government has spent tens of millions of dollars on data-mining programs that are used by agencies to identify potential terrorists. The Department of Homeland Security's Automated Targeting System assigns terror scores to U.S. citizens, and the Transportation Security Administration's Secure Flight program analyzes airline passenger data. However, it is unclear how effective these programs have been in identifying and stopping potential terrorist threats. Using data mining to search for potential terrorists is similar to looking for a needle in a haystack, says BT chief security officer Bruce Schneier. "Data mining works best when there's a well-defined profile you're searching for, a reasonable number of attacks per year, and a low cost of false alarms," Schneier says. However, even the most accurate and finely tuned data-mining system will generate one billion false alarms for each real terrorist plot it uncovers, he says.
'Big Data' Can Create Big Issues
Investor's Business Daily (04/08/10) P. A4; Bonasia, J.
Tech firms are approaching the challenge of mining "big data"--immense repositories of information generated by industry and government--by using predictive analytics software to detect trends and anticipate coming events. Applications of predictive analytics include sifting credit card transactions to spot fraud, targeted marketing through the combination of data from past transactions and predictive models for pricing and special offers, and customer retention by studying profits from the patterns across a consumer's lifetime. Statistical modeling and machine learning form the two central predictive analytics technologies. PricewaterhouseCoopers' Steve Cranford says the use of predictive analysis is forcing companies to devise customer data management protocols, with implications for privacy and security. The explosive expansion of big data has led to a boom in identity theft and a widespread erosion of consumer privacy via hacking or inattention. IBM Research's Chid Apte argues that personal data should be granted more anonymity in certain cases.
International Win for Clever Dataminer; Weka data-mining software
University of Waikato (08/17/09)
The first place finisher in the 2009 Student Data Mining Contest, run by the University of California, San Diego, used the Weka data-mining software to predict anomalies in e-commerce transaction data. Quan Sun, a University of Waikato computer science student, says it took about a month to find the answer. The contest drew more than 300 entries from students in North America, Europe, Asia, and Australasia. "I couldn't have done it without Weka," Sun says of the open source software that was developed at Waikato. "Weka is like the Microsoft Word of data-mining, and at least half of the competitors used it in their entries." ACM's Special Interest Group on Knowledge Discovery and Data Mining gave the Weka software its Data Mining and Knowledge Discovery Service Award in 2005. Weka has more than 1.5 million users worldwide.
World's Best Data Mining Knowledge and Expertise on Show in Paris at KDD-09
Business Wire (06/08/09)
Knowledge Discovery and Data Mining 2009 (KDD-09), organized by ACM's Special Interest Group on Knowledge Discovery and Data Mining, will offer more than 120 presentations by data-mining experts from around the world and is expected to be attended by more than 600 leading data-mining researchers, academics, and practitioners. "Some of the best minds from the scientific and business communities will be there, ready and willing to share the results of their cutting-edge research and data-mining projects with end users," says KDD-09 joint chair Francoise Soulie Fogelman. "No other industry event offers anything like the depth and breadth of expertise on offer here." Social network analysis will be a focus of KDD-09. Data mining experts also will focus on using real-time Web applications for data mining for custom advertising and personalized offers. KDD-09 will take place June 28 through July 1 in Paris.