Clemson researchers optimizing end-to-end movement of ‘Big Data’
CLEMSON, South Carolina — Today’s scientists are riding an unprecedented wave of discovery, but the immensity of the data needed to facilitate many of these breakthroughs is creating internet roadblocks that are becoming increasingly detrimental to research.
Finding ways to deal with “Big Data,” which is defined as data sets too large and complex for both traditional computers and average network throughput to handle, has become a science in itself.
But with an eye to the future, Clemson University researchers are playing a leading role in developing state-of-the-art methods to transfer these enormous datasets from place to place using the 100 gigabit Ethernet Internet2 Network. Owned by the nation’s leading higher education institutions, the advanced Internet2 Network is the nation’s largest and fastest coast-to-coast research and education infrastructure designed for next-generation scientific collaboration and Big Data transfer.
“We’ve leveraged advanced research networks from Internet2 and parallel file system technologies to choose the optimal ways to send and receive massive data sets around the country and world,” said Alex Feltus, associate professor in genetics and biochemistry in Clemson University’s College of Science. “What used to take days now takes hours – or even less. And these same methods apply to any project that uses large, contemporary data sets.”
Genomics research is rapidly becoming one of the leading generators of Big Data for science, with the potential to equal if not surpass the data output of the high-energy physics community. Like physicists, university-based life-science researchers must collaborate with counterparts and access data repositories across the nation and around the globe.
“Researchers who work with large data sets need a reliable and agile network that allows them to accelerate data transfers among collaborators,” said Rob Vietzke, Internet2 vice president for network services. “Alex’s work is a great example of how our community members come together to solve research problems that facilitate and enable scientific collaboration on a global scale.”
As a key component of their research, Feltus and his collaborators have developed an open-sourced and freely available software titled Big Data Smart Socket (BDSS) that takes a user’s request for data and attempts to rewrite the request in a more optimal way, creating the potential to send and receive data at much higher speeds.
“We’ve found the right buffer size, number of parallel data streams and the optimal parallel file system to perform the transfers,” said Feltus, who is director of the Clemson Systems Genetics Lab. “It’s very important that end-to-end data movement – and not just network speed – is optimized. Otherwise, bottlenecks on the sending or receiving side can slow transfers to a crawl. Our BDSS software enables researchers to receive data – optimized for the architecture of their own computer systems – far more quickly than before. Previously, researchers were having to move rivers of information through small pipes at the sending and receiving ends. Now, we’ve enhanced those pipes, which vastly improves information flow.”
The ever-expanding sizes of data sets used in scientific computing have made it increasingly difficult for researchers to store all their data at their primary computing sites. Therefore, it has become crucial for researchers to be able to transfer a segment of data from geographically distant repositories and – after analyzing it – to delete it to make room for new data needed to continue the advancement of the project.
BDSS enables researchers, many of whom are unaware of available technologies, to perform faster and more efficient transfers. The groundbreaking software takes advantage of specialized infrastructure such as parallel file systems, which distribute data across multiple servers, and advanced software-defined networks, which allow administrators to build, tune and curate groups of researchers into a virtual organization.
“Network engineers have made these gigantic pipes through Internet2 that can transfer data at a hundred gigabits per second. That’s a hundred billion bits of data, which is almost unfathomable,” Feltus said. “In order to maximize the fastest pipes ever seen in human history, the entire system must be optimized to be able to receive the data from these pipes. Our focus is figuring out ways to transfer data in parallel streams that match the number of hard drives that are receiving the data.”
Feltus collaborates with Clemson Computing and Information Technology, as well as faculty from multiple institutions around the nation, to maximize data transfer through a next-generation campus network linked to the Internet2 backbone. He and his partners have published two recent papers on maximizing Big Data transfer:
- “Big Data Smart Socket (BDSS): A System that Abstracts Data Transfer Habits from End Users” (Nick Watts and Alex Feltus; published in Bioinformatics) This work was supported by “Triple Gateway: Platform for Next Generation Data Analysis and Sharing” and was funded by a grant from the National Science Foundation with Washington State University’s Stephen Ficklin as principal investigator and Feltus as co-principal investigator. Other collaborators included scientists from the universities of Tennessee and Connecticut.
- “Maximizing the Performance of Scientific Data Transfer by Optimizing the Interface Between Parallel File Systems and Advanced Research Networks” (Nicholas Mills, Alex Feltus and Walter Ligon; a paper accepted for the INDIS workshop at Supercomputing 2016 in Salt Lake City) This work was supported by “PXFS: ParalleX Based Transformative I/O System for Big Data” and was funded by a grant from NSF with Ligon as principal investigator and Feltus as co-principal investigator. Other collaborators included scientists from Louisiana State and Indiana universities.
“Think of what we’re doing as sort of a shipping service at Christmas time,” Feltus concluded. “We’re all trying to move a lot of stuff around at the same time. But it might be more efficient to use Interstate 285 and go around Atlanta rather than drive through downtown. Or maybe use an airplane instead of a truck. We’re trying to get the data there as quickly as possible so that the customer is happy.”
Internet2 is a member-owned advanced-technology community founded by the nation’s leading higher education institutions in 1996. Internet2 provides a collaborative environment for U.S. research and education organizations to solve shared technology challenges, and to develop innovative solutions in support of their educational, research and community service missions.
Internet2 also operates the nation’s largest and fastest coast-to-coast research and education network, which as of October 2016, has moved over one exabyte (one billion gigabytes) of data. Internet2 serves more than 90,000 community anchor institutions, 317 U.S. universities, 70 government agencies, 43 regional and state education networks, 78 leading corporations working with our community and more than 61 national research and education networking partners representing more than 100 countries.
Internet2 offices are located in Ann Arbor, Michigan; Denver; Emeryville, California; Washington, D.C.; and West Hartford, Connecticut. For more information, visit www.internet2.edu or follow @Internet2 on Twitter.
This material is based upon work supported by the National Science Foundation under Grant Nos. 1447771 and 1443040. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the NSF.