Showing posts with label data-mining. Show all posts
Showing posts with label data-mining. Show all posts

Monday, April 16, 2012

Blog: Fast Data hits the Big Data fast lane

Fast Data hits the Big Data fast lane

By Andrew Brust | April 16, 2012, 6:00am PDT
Summary: Fast Data, used in large enterprises for highly specialized needs, has become more affordable and available to the mainstream. Just when corporations absolutely need it.
This guest post comes courtesy of Tony Baer’s OnStrategies blog. Tony is a principal analyst at Ovum.

By Tony Baer

Of the 3 “V’s” of Big Data – volume, variety, velocity (we’d add “Value” as the 4th V) – velocity has been the unsung ‘V.’ With the spotlight on Hadoop, the popular image of Big Data is large petabyte data stores of unstructured data (which are the first two V’s). While Big Data has been thought of as large stores of data at rest, it can also be about data in motion.
“Fast Data” refers to processes that require lower latencies than would otherwise be possible with optimized disk-based storage. Fast Data is not a single technology, but a spectrum of approaches that process data that might or might not be stored. It could encompass event processing, in-memory databases, or hybrid data stores that optimize cache with disk.

Saturday, April 7, 2012

Blog: Berkeley Group Digs In to Challenge of Making Sense of All That Data

Berkeley Group Digs In to Challenge of Making Sense of All That Data
New York Times (04/07/12) Jeanne Carstensen

The U.S. National Science Foundation recently awarded $10 million to the University of California, Berkeley's Algorithms Machines People (AMP) Expedition, a research team that takes an interdisciplinary approach to advancing big data analysis. Researchers at the AMP Expedition, in collaboration with researchers at the University of California, San Francisco, are developing a set of open source tools for big data analysis. "We’ll judge our success by whether we build a new paradigm of data," says AMP Expedition director Michael Franklin. “It’s easier to collect data, and harder to make sense of it.” The grant is part of the Obama administration's "Big Data Research and Development Initiative," which will eventually distribute a total of $200 million. AMP Expedition faculty member Ken Goldberg has developed Opinion Space, a tool for online discussion and brainstorming that uses algorithms and data visualization tools to help gather meaningful ideas from a large number of participants. Goldberg notes that part of their research focus is analyzing how people interact with big data. “We recognize that humans do play an important part in the system,” he says.

Thursday, March 15, 2012

Blog: 'Big Data' Emerges as Key Theme at South by Southwest Interactive

‘Big Data' Emerges as Key Theme at South by Southwest Interactive
Chronicle of Higher Education (03/15/12) Jeffrey R. Young

Several panels and speakers at this year's South By Southwest Interactive festival discussed the growing ability to use data-mining techniques to analyze big data to shape political campaigns, advertising, and education. For example, panelist and Microsoft researcher Jaron Lannier says companies that rely on selling information about their users' behavior to advertisers should find a way to compensate people for their posts. A panel on education discussed the potential ability of Twitter and Facebook to better connect with students and detect signs that that students might be struggling with certain subjects. "We need to be looking at engagement in this new spectrum, and we haven't," says South Dakota State University social-media researcher Greg Heiberger. Some panels examined the role of big data in the latest presidential campaigns. Although recent presidential campaigns have focused on demographic subgroups, future campaigns may design their messages even more narrowly. "They’re actually going to try targeting groups of individuals so that political campaigns become about data mining" rather than any kind of broad policy message, says University of Texas at Dallas professor David Parry.

Wednesday, February 8, 2012

Blog: Weave Open Source Data Visualization Offers Power, Flexibility

Weave Open Source Data Visualization Offers Power, Flexibility
Computerworld (02/08/12) Sharon Machlis

The open source Weave project is a platform designed to make it easier for government agencies, nonprofits, and corporate users to offer the public a way to analyze data. The platform enables users to simultaneously highlight items on multiple visualizations, including map, map legend, bar chart, and scatter plot. The benefits of Weave's interactivity go beyond the visual appeal of selecting an area on a chart and seeing matches highlighted on a map, notes Connecticut Data Collaborative project coordinator James Farnam. Weave aims to help organizations democratize data visualization tools, creating a way for anyone interested in a topic to explore and analyze information about it, instead of leaving the task solely to computer and data specialists, says Georges G. Grinstein, director of the University of Massachusetts at Lowell's Institute for Visualization and Perception Research, which created Weave. "Now [you're] engaging the public in a dialog with the data," Grinstein says. "That's why Weave is open source and free." Weave is so powerful that one of the challenges of implementing it is how to narrow down its offerings so that end users would not be overwhelmed with too many options, says the Metropolitan Area Planning Council's Holly St. Clair.

Friday, December 2, 2011

Blog: U.S. Intelligence Group Seeks Machine Learning Breakthroughs

U.S. Intelligence Group Seeks Machine Learning Breakthroughs
Network World (12/02/11) Michael Cooney

The U.S. Intelligence Advanced Research Projects Activity (IARPA) announced that it is looking for new ideas that may become the basis of cutting-edge machine-learning projects. "In many application areas, the amount of data to be analyzed has been increasing exponentially [sensors, audio and video, social network data, Web information], stressing even the most efficient procedures and most powerful processors," according to IARPA. "Most of these data are unorganized and unlabeled and human effort is needed for annotation and to focus attention on those data that are significant." IARPA's request for information asks about proposed methods for the automation of architecture and algorithm selection and combination, feature engineering, and training data scheduling, as well as compelling reasons to use such approaches in a scalable multi-modal analytic system and whether supporting technologies are readily available. IARPA says that innovations in hierarchical architectures such as Deep Belief Nets and hierarchical clustering will be needed for useful automatic machine-learning systems. It wants to identify promising areas for investment and plans to hold a machine learning workshop in March 2012.

Monday, November 14, 2011

Blog: Walls Have Eyes: How Researchers Are Studying You on Facebook

Walls Have Eyes: How Researchers Are Studying You on Facebook
Time (11/14/11) Sonia van Gilder Cooke

Facebook's trove of personal information is so encyclopedic that researchers are using the site's advertising tool to pinpoint their desired demographic with scientific accuracy, according to a recent Demos report. The report focused on European right-wing extremist groups, and used Facebook's data to find 500,000 fans of right-wing groups across Europe. The researchers linked these Facebook users to a survey that asked questions about their education level, attitudes toward violence, and optimism about their own future. Demos' research is just one example of how Facebook is becoming a popular tool among scientists. There are currently more than 800 million active users adding an average of three pieces of content daily, driving the number of academic papers with the Facebook's name in the title up almost 800 percent over the past five years. Researchers say Facebook's data also could be used to address social health problems. For example, a University of Wisconsin-Madison study found that undergraduates who discussed their drunken exploits on Facebook were significantly more likely to have a drinking problem than those students who did not discuss the topic online.

Monday, September 19, 2011

Blog: Mining Data for Better Medicine

Mining Data for Better Medicine
Technology Review (09/19/11) Neil Savage

Researchers are utilizing digital medical records to conduct wide-ranging studies on the effects of certain drugs and how they relate to different populations. Data-mining studies also are being used to uncover evidence of economic problems, such as overbilling and unnecessary procedures. In addition, some large hospital systems are employing full-time database research teams to study electronic records. Stanford University researcher Russ Altman is developing tools to analyze the U.S. Food and Drug Administration's Adverse Event Reporting System, a database containing several million reports of drugs that have harmed patients. The Stanford researchers have developed an algorithm that searched for patients taking widely prescribed drugs who subsequently suffered side effects similar to those seen in diabetics. "There's just an incredibly wide range of possibilities for research from using all this aggregated data," says Margaret Anderson, executive director of FasterCures, a think tank in Washington, D.C. "We're asking, 'Why aren't we paying a little bit more attention to that?'"

Monday, September 12, 2011

Blog: In Plane View; using cluster analysis to discover what's normal

In Plane View
MIT News (09/12/11) Jennifer Chu

Massachusetts Institute of Technology professor John Hansman and colleagues have developed an airline health detection tool that identifies flight glitches without knowing ahead of time what to look for. The method uses cluster analysis, a type of data mining that filters data into subsets to find common patterns. Flight data outside the clusters is labeled as abnormal, enabling analysts to further inspect those reports to determine the nature of the anomaly. The researchers developed a data set from 365 flights that took place over one month. "The beauty of this is, you don't have to know ahead of time what 'normal' is, because the method finds what's normal by looking at the cluster," Hansman says. The researchers mapped each flight at takeoff and landing and found several flights that fell outside the normal range, mostly due to crew mistakes rather than mechanical flaws, according to Hansman. "To make sure that systems are safe in the future, and the airspace is safe, we have to uncover precursors of aviation safety accidents [and] these [cluster-based] analyses allow us to do that," says the U.S.'s National Aeronautics and Space Administration's Ashok Srivastava.

Thursday, June 23, 2011

Blog; CERN Experiments Generating One Petabyte of Data Every Second

CERN Experiments Generating One Petabyte of Data Every Second
V3.co.uk (06/23/11) Dan Worth

CERN researchers generate a petabyte of data every second as they work to discover the origins of the universe by smashing particles together at close to the speed of light. However, the researchers, led by Francois Briard, only store about 25 petabytes every year because they use filters to save just the results they are interested in. "To analyze this amount of data you need the equivalent of 100,000 of the world's fastest PC processors," says CERN's Jean-Michel Jouanigot. "CERN provides around 20 percent of this capability in our data centers, but it's not enough to handle this data." The researchers worked with the European Commission to develop the Grid, which provides access to computing resources from around the world. CERN receives data center use from 11 different providers on the Grid, including from companies in the United States, Canada, Italy, France, and Britain. The data comes from four machines on the Large Hadron Collider in which the particle collisions are monitored, which transmit data at 320 Mbps, 100 Mbps, 220 Mbps, and 500 Mbps, respectively, to the CERN computer center.

View Full Article

Friday, May 13, 2011

Blog: New Ways to Exploit Raw Data May Bring Surge of Innovation, a Study Says

New Ways to Exploit Raw Data May Bring Surge of Innovation, a Study Says
New York Times (05/13/11) Steve Lohr

Mining and analyzing large datasets will lead to a new wave of innovation, according to a new report from the McKinsey Global Institute. The report, "Big Data: The Next Frontier for Innovation, Competition and Productivity," estimates the potential benefits of using data-harvesting technologies and skills. For example, it says the technology could be worth $300 billion annually to the U.S.'s health care system, while retailers could use it to boost profits by 60 percent. However, the study also identifies challenges to managing big data, such as a talent and skills gap. The report estimates that the United States will need at least 140,000 more experts in statistical methods and data-analysis technologies, as well as 1.5 million more data-literate managers. "Every manager will really have to understand something about statistics and experimental design going forward," says McKinsey's Michael Chui. The use of personal location data could save consumers more than $600 billion worldwide by 2020, according to the report. Consumers will benefit the most from time and fuel savings gained from location-based services that can monitor traffic and weather data to help drivers avoid congestion and suggest alternative routes, the report says. "It's clear that data is an important factor of production now," says McKinsey's James Manyika.

View Full Article

Monday, March 7, 2011

Blog: Identifying 'Anonymous' Email Authors

Identifying 'Anonymous' Email Authors
Concordia University (Canada) (03/07/11) Chris Atack

Concordia University researchers led by professor Benjamin Fung have developed a technique to determine the authors of anonymous emails with a high degree of accuracy. Malicious anonymous emails can "transmit threats or child pornography, facilitate communications between criminals, or carry viruses," Fung says. Although authorities can use the Internet protocol address to locate the building where the email originated, they have no way to differentiate between several suspects. Fung's method uses speech recognition and data-mining techniques to identify an individual author. First the researchers identify the patterns found in emails written by the suspect, and then they filter out any patterns that also are found in the emails of other suspects, leaving only the patterns that are unique to the original suspect. "Using this method, we can even determine with a high degree of accuracy who wrote a given email, and infer the gender, nationality, and education level of the author," Fung says. Testing showed that the system can identify an individual author of an anonymous email with up to 90 percent accuracy.

View Full Article

Thursday, February 10, 2011

Blog: Powerful New Ways to Electronically Mine Published Research May Lead to New Scientific Breakthroughs

Powerful New Ways to Electronically Mine Published Research May Lead to New Scientific Breakthroughs
University of Chicago (02/10/11) William Harms

University of Chicago researchers are exploring how metaknowledge can be used to better understand science's social context and the biases that can affect research findings. "The computational production and consumption of metaknowledge will allow researchers and policymakers to leverage more scientific knowledge--explicit, implicit, contextual--in their efforts to advance science," say Chicago researchers James Evans and Jacob Foster. Metaknowledge researchers are using natural language processing technologies, such as machine reading, information extraction, and automatic summarization, to find previously hidden meaning in data. For example, Google researchers used computational content analysis to uncover the emergence of influenza epidemics by tracking relevant Google searches, a process that was faster than methods used by public health officials. Metaknowledge also has led to the possibility of implicit assumptions that could form the foundation of scientific conclusions, known as ghost theories, even if scientists are unaware of them. Scientific ideas can become entrenched when studies continue to produce conclusions that have been previously established by well-known scholars, a trend that can be uncovered by using metaknowledge, according to the researchers.

View Full Article

Tuesday, August 24, 2010

Blog: Sizing Samples

Sizing Samples
MIT News (08/24/10) Hardesty, Larry

Numerous scientific fields employ computers to deduce patterns in data, and Massachusetts Institute of Technology researchers led by graduate student Vincent Tan have taken an initial step to determine how much data is enough to support reliable pattern inference by envisioning data sets as graphs. In the researchers' work, the nodes of the graph represent data and the edges stand for correlations between them, and from this point of view a computer tasked with pattern recognition is provided a bunch of nodes and asked to construe the weights of the edges between them. The researchers have demonstrated that graphs configured like chains and stars establish, respectively, the best- and worst-case scenarios for computers charged with pattern recognition. Tan says that for tree-structured graphs with shapes other than stars or chains, the "strength of the connectivity between the variables matters." Carnegie Mellon University professor John Lafferty notes that a tree-structured approximation of data is more computationally efficient.

View Full Article

Monday, July 26, 2010

Blog: Bringing Data Mining Into the Mainstream

Bringing Data Mining Into the Mainstream
New York Times (07/26/10) Lohr, Steve

A record number of corporate researchers and university scientists are attending an ACM conference on knowledge discovery and data mining, which offers papers and workshops that apply data mining to everything from behavioral targeting to cancer research. Although data mining has become a growth industry, profitably probing large data sets is still costly for companies and difficult for users. According to conference executive director Usama Fayyad, an institutional mindset that recognizes the value of data is needed to bring modern data mining into the business mainstream. The executive level must view data as a new strategic asset that can create revenue streams and businesses. Fayyad also says a translation layer of technology is needed to democratize modern data mining, and the underlying software for handling large data sets should be linked to software that ordinary people can use. Using Microsoft's Excel spreadsheet as a metaphor, Fayyad says the sophisticated data-handling layer should be "built in ways that Excel can consume the data and people can browse it."

View Full Article

Tuesday, June 22, 2010

Blog: Data Mining Algorithm Explains Complex Temporal Interactions Among Genes

Data Mining Algorithm Explains Complex Temporal Interactions Among Genes
Virginia Tech News (06/22/10) Trulove, Susan

Researchers at Virginia Tech (VT), New York University (NYU), and the University of Milan have developed Gene Ontology based Algorithmic Logic and Invariant Extractor (GOALIE), a data-mining algorithm that can automatically reveal how biological processes are coordinated in time. GOALIE reconstructs temporal models of cellular processes from gene expression data. The researchers developed and applied the algorithm to time-course gene expression datasets from budding yeast. "A key goal of GOALIE is to be able to computationally integrate data from distinct stress experiments even when the experiments had been conducted independently," says VT professor Naren Ramakrishnan. NYU professor Bud Mishra notes GOALIE also can extract entire formal models that can then be used for posing biological questions and reasoning about hypotheses. The researchers hope the tool can be used to study disease progression, aging, host-pathogen interactions, stress responses, and cell-to-cell communication.

View Full Article

Wednesday, May 5, 2010

Blog: N.Y. Bomb Plot Highlights Limitations of Data Mining

N.Y. Bomb Plot Highlights Limitations of Data Mining
Computerworld (05/05/10) Vijayan, Jaikumar

The recent failed bombing attempt in New York City shows the limitations of data-mining technology when used in security applications. Since the terror attacks of Sept. 11, the U.S. government has spent tens of millions of dollars on data-mining programs that are used by agencies to identify potential terrorists. The Department of Homeland Security's Automated Targeting System assigns terror scores to U.S. citizens, and the Transportation Security Administration's Secure Flight program analyzes airline passenger data. However, it is unclear how effective these programs have been in identifying and stopping potential terrorist threats. Using data mining to search for potential terrorists is similar to looking for a needle in a haystack, says BT chief security officer Bruce Schneier. "Data mining works best when there's a well-defined profile you're searching for, a reasonable number of attacks per year, and a low cost of false alarms," Schneier says. However, even the most accurate and finely tuned data-mining system will generate one billion false alarms for each real terrorist plot it uncovers, he says.

View Full Article

Thursday, April 8, 2010

Blog: 'Big Data' Can Create Big Issues

'Big Data' Can Create Big Issues
Investor's Business Daily (04/08/10) P. A4; Bonasia, J.

Tech firms are approaching the challenge of mining "big data"--immense repositories of information generated by industry and government--by using predictive analytics software to detect trends and anticipate coming events. Applications of predictive analytics include sifting credit card transactions to spot fraud, targeted marketing through the combination of data from past transactions and predictive models for pricing and special offers, and customer retention by studying profits from the patterns across a consumer's lifetime. Statistical modeling and machine learning form the two central predictive analytics technologies. PricewaterhouseCoopers' Steve Cranford says the use of predictive analysis is forcing companies to devise customer data management protocols, with implications for privacy and security. The explosive expansion of big data has led to a boom in identity theft and a widespread erosion of consumer privacy via hacking or inattention. IBM Research's Chid Apte argues that personal data should be granted more anonymity in certain cases.

View Full Article

Monday, August 17, 2009

Blog: International Win for Clever Dataminer; Weka data-mining software

International Win for Clever Dataminer; Weka data-mining software
University of Waikato (08/17/09)

The first place finisher in the 2009 Student Data Mining Contest, run by the University of California, San Diego, used the Weka data-mining software to predict anomalies in e-commerce transaction data. Quan Sun, a University of Waikato computer science student, says it took about a month to find the answer. The contest drew more than 300 entries from students in North America, Europe, Asia, and Australasia. "I couldn't have done it without Weka," Sun says of the open source software that was developed at Waikato. "Weka is like the Microsoft Word of data-mining, and at least half of the competitors used it in their entries." ACM's Special Interest Group on Knowledge Discovery and Data Mining gave the Weka software its Data Mining and Knowledge Discovery Service Award in 2005. Weka has more than 1.5 million users worldwide.

View Full Article

Monday, June 8, 2009

Blog: World's Best Data Mining Knowledge and Expertise on Show in Paris at KDD-09

World's Best Data Mining Knowledge and Expertise on Show in Paris at KDD-09
Business Wire (06/08/09)

Knowledge Discovery and Data Mining 2009 (KDD-09), organized by ACM's Special Interest Group on Knowledge Discovery and Data Mining, will offer more than 120 presentations by data-mining experts from around the world and is expected to be attended by more than 600 leading data-mining researchers, academics, and practitioners. "Some of the best minds from the scientific and business communities will be there, ready and willing to share the results of their cutting-edge research and data-mining projects with end users," says KDD-09 joint chair Francoise Soulie Fogelman. "No other industry event offers anything like the depth and breadth of expertise on offer here." Social network analysis will be a focus of KDD-09. Data mining experts also will focus on using real-time Web applications for data mining for custom advertising and personalized offers. KDD-09 will take place June 28 through July 1 in Paris.

View Full Article

Blog Archive