Andrew Kaczorek

Subscribe to Andrew Kaczorek: eMailAlertsEmail Alerts
Get Andrew Kaczorek: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Cloud Computing, Microservices Journal, Amazon Cloud Journal, Cloud Application Management

Article

Maximizing Data Throughput

Multi-stream data transfer into Amazon EC2

By Andrew Kaczorek, with Jason Stowe

It is common for cloud computing articles to talk at length about the abundant hardware resources the cloud can offer the modern researcher or analyst, but little is typically said about the back-end data store available with cloud computing. Before any research in the cloud can take place, data must be staged in a manner that is accessible to your cloud-based compute resources. It becomes non-trivial to perform the staging portion of your cloud use if your data sets are large.

The Amazon EC2 cloud provides large quantities of hardware suitable for high-speed, high-throughput scientific computing. Coupled with the AWS storage and Amazon S3 system, it makes a formidable platform for anyone looking to do large-scale, scientific computing on quantities of file-based data.

In this article, we'll explore data ingress to EC2 because download speeds out of EC2 are typically much higher, both from consumer grade and enterprise-level Internet connections. Thus, we typically see more upload than download in the practical use of AWS services. For example, large data sets get transferred into EC2, working data stays on cloud-local storage, and compact, summarized results are brought back from the cloud.

Additionally, there are multiple ways of transferring data into the cloud, including open protocols and tools such as HTTP/HTTPS, rsync, and FDT, and proprietary tools like Aspera. Although proprietary tools are efficient, because they cost money, we're going to talk specifically in this article about free and open tools for this kind of work. These include:

Table #1: Open source data transfer protocols and tools

Let's look at a few common cases for moving a large data set into AWS-hosted storage and explore the transfer rates, benefits and drawbacks of each approach.

Case #1: Consumer Grade Internet Connection
It is trivial to saturate the upstream network pipe of a consumer grade cable or DSL Internet connection transferring using only a single stream to EC2. With strong encryption on the ingress stream, CPU demands are insignificant for modern personal computing hardware. Given a 512 kilobit upload speed, it is possible to upload 5.1 gigabytes in one day with no compression.

Case #1: 512kb/sec consumer connection moves 5.1GB/day

Case #2: Mid-Range Business-Class Internet Connection
Among our clients, we see average business-class Internet connections with six to eight megabits of upload performance. Even with the more capable six-megabit upload connection, it is still possible to saturate the line with a single, encrypted rsync into EC2. With no compression, it is possible to transmit 64 gigabytes a day from a mid-range, business-class Internet connection.

Case #2: 6Mb/sec small enterprise connection moves 64GB/day

Case #3: Enterprise-Level Internet Connection
With our larger clients it is not uncommon to see Internet connections with sustained data-out capabilities of 100 megabits or more. In this class of Internet connection, WAN principles such as frame size, packet order and retransmissions start to come into effect and require consideration to optimize the ingress rate. With little to no optimization, we have seen upper bounds on a single Internet-bound network stream of around 18 megabits. Even without compression or multiple streams, an Internet connection operating at this outbound rate can transfer nearly 200 gigabytes in a 24-hour period. There is, of course, potential for the single-stream transfer to be further optimized but often enterprise network teams have a philosophical problem of a single application leveraging more than 50 percent of an Internet connection shared by potentially tens of thousands of people.

Case #3: 18Mb/sec enterprise connection moves 200GB/day

Special Case #1: Parallel S3 Streaming
If the inbound files are relatively homogeneous and reasonably large, we have seen instances where a parallel upload into S3 can be advantageous. S3 is globally distributed with massively scalable ingress. When file sizes are similar, it is possible to saturate an outgoing network pipe without worrying about one stream of data lagging behind the others because it is larger. It should be noted that once data is in S3, it can be deployed via s3backer to a large number of EC2 instances at high speeds using the same principles of S3's massive scalability.

Special Case #2: Parallel Throttled rsync
There are cases where a single-stream approach out of an enterprise-class connection is insufficient for the data transfer needs. On a massive, shared network connection two undesirable conditions can be observed:

  1. A single upload will sometimes fall very short of the available network bandwidth of a particular site due to the network topology
  2. Multiple uploads will sometimes overwhelm a shared network connection causing complaints from other users and potential business disruption

We have seen very good results packaging rsync transfers into multiple parallel streams, where each stream is limited to a rigid amount of bandwidth consumption, in situations where sustained throughput needs to be consistent and relatively peak-free, but overall throughput should not use more than the network administrators are comfortable with. This approach allows us to push the limits of a shared network connection without saturating it. We often come as close as 90 percent of the available network bandwidth, meaning that we can perform a significant amount of data transfer into the cloud in a minimal amount of time, with no perceivable disruption to the other users and applications on the network.

File System Considerations
When replicating file data between different operating systems, it is also important to consider the case sensitivity of the file systems and tools you are using. Specifically, most Linux file systems and the HTTP/HTTPS protocols are case-sensitive for paths (so "HELLO.txt" is a different file than "hello.txt"). On the other hand, most Windows file systems are not ("HELLO.txt" could be opened using "hello.txt"). If you're copying from a file system that isn't case-sensitive, like Windows's NTFS, to a system that is case-sensitive, users must correct for case, or run the risk of having multiple copies of files on the destination filer in the cloud. Thus, if possible do a like-to-like file system transfer on the cloud.

Workflows & Data
In order to run calculations for a certain use case, we must guarantee that the data is in the right place at the right time. There are two possible ways of doing this: periodically synchronize the whole file system, or synchronize just the data you need for a calculation before it gets used in the cloud.

We recommend doing both, if you can tell what data a job for your high-performance application will need to run properly. These periodic updates could run nightly or every three hours, and just in time synchronizations can occur at job submission time to ensure machines come with the right data where possible. Depending on many variables, including your intended usage of the file system, periodic updates serve to make sure the just-in-time transfers operate as quickly as possible.

Synchronization types and typical timing

Final Thoughts
The EC2 ingress rates are quite capable of handling large data transfers from connections with significant upstream bandwidth available to them. Parallel transfers with capped parallel streams can help take maximum advantage of enterprise-class outbound connections without imparting any serious latency issues on other users within your organization.

If the data is compressible, it is possible, using tools such as rsync, to greatly decrease the number of bits transmitted over the wire. In our experiments we were able to increase upload rates by a factor of three on heavily compressible ASCII text data used for genetic computations. We have pushed data set uploads of 250 gigabyte to EC2 in 22 hours by using a single encrypted and compressed rsync stream on an eight megabit upload link.

It is unlikely that the ingress performance into Amazon for even a single EC2 instance will ever come into play unless you have an outbound Internet connection with sustained rates of a large fraction of a gigabit per second or more. Nevertheless, we recommend running at least an m1.large machine for data reception on the EC2 side of the transfer to handle the encryption demands. We have seen inbound transfer rates of encrypted data surpass the capabilities of the basic m1.small instance type.

When performing HTTPS-based transfers a potentially severe impact on transfer rates is traversing corporate infrastructure for HTTP proxy and DNS filtering. These HTTP security architectures tend not to be designed with sustained data transfer performance characteristics in mind but rather control over interactive web browsing. If at all possible, avoid these pieces of infrastructure in your data transfer path.

Finally, if your ability to transfer exceeds your available Internet connection, fear not. There is a solution. Simply mail Amazon a USB disk drive or drive array with your data and they will promptly load it into an S3 bucket for you. As amazingly low tech as this approach sounds, for multi-terabyte data transfer it can be surprisingly quick. As the saying goes, never under-estimate the bandwidth of a truck full of drives...

More Stories By Andrew Kaczorek

Andrew Kaczorek is the former manager of Eli Lilly’s high performance computing cluster who now architects and implements whole clusters in Amazon EC2 at Cycle Computing.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.