ChatGPT解决这个技术问题 Extra ChatGPT

Technically what is the difference between s3n, s3a and s3?

I'm aware of the existence of https://wiki.apache.org/hadoop/AmazonS3 and the following words:

S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. S3A (URI scheme: s3a) A successor to the S3 Native, s3n fs, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for/successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema. S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

Why a letter change on the URI could make such difference? For example

val data = sc.textFile("s3n://bucket-name/key")

to

val data = sc.textFile("s3a://bucket-name/key")

What is the technical difference underlying this change? Are there any good articles that I can read on this?


j
jarmod

The letter change on the URI scheme makes a big difference because it causes different software to be used to interface to S3. Somewhat like the difference between http and https - it's only a one-letter change, but it triggers a big difference in behavior.

The difference between s3 and s3n/s3a is that s3 is a block-based overlay on top of Amazon S3, while s3n/s3a are not (they are object-based).

The difference between s3n and s3a is that s3n supports objects up to 5GB in size, while s3a supports objects up to 5TB and has higher performance (both are because it uses multi-part upload). s3a is the successor to s3n.

Per Work with Storage and File Systems, when using EMRFS:

Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.

Other historical references to s3n and s3a can be found at this article from Amazon (only available on wayback machine).


The support article from Amazon appears to be up-to-date still, but I can now write to S3 from EMR jobs using the s3a scheme. It's possible that the answer should be revised.
@mig While s3a might work, and it does seem to work in my experience, it's not technically supported by AWS. So, I think you would use it at your own risk.
@christang Looks like it's no longer available so have provided wayback machine link.
Basically, AWS support recommends s3:// un place of s3a:// for any support ticket
This AWS doc – docs.aws.amazon.com/emr/latest/ManagementGuide/… – also suggests "s3://" for all applications: "Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability." Might be worth adding this to your answer
d
danielpops

in Apache Hadoop, "s3://" refers to the original S3 client, which used a non-standard structure for scalability. That library is deprecated and soon to be deleted,

s3n is its successor, which used direct path names to objects, so you can read and write data with other applications. Like s3://, it uses jets3t.jar to talk to S3.

On Amazon's EMR service, s3:// refers to Amazon's own S3 client, which is different. A path in s3:// on EMR refers directly to an object in the object store.

In Apache Hadoop, S3N and S3A are both connectors to S3, with S3A the successor built using Amazon's own AWS SDK. Why the new name? so we could ship it side-by-side with the one which was stable. S3A is where all ongoing work on scalability, performance, security, etc, goes. S3N is left alone so we don't break it. S3A shipped in Hadoop 2.6, but was still stabilising until 2.7, primarily with some minor scale problems surfacing.

If you are using Hadoop 2.7 or later, use s3a. If you are using Hadoop 2.5 or earlier. s3n, If you are using Hadoop 2.6, it's a tougher choice. -I'd try s3a and switch back to s3n if there were problems-

For more of the history, see http://hortonworks.com/blog/history-apache-hadoops-support-amazon-s3/

2017-03-14 Update actually, partitioning is broken on S3a in Hadoop 2.6, as the block size returned in a listFiles() call is 0: things like Spark & pig partition the work into one task/byte. You cannot use S3a for analytics work in Hadoop 2.6, even if core filesystem operations & data generation is happy. Hadoop 2.7 fixes that.

2018-01-10 Update Hadoop 3.0 has cut its s3: and s3n implementations: s3a is all you get. It is now significantly better than its predecessor and performs as least as good as the Amazon implementation. Amazon's "s3:" is still offered by EMR, which is their closed source client. Consult the EMR docs for more info.


A
Aleksandr Panzin

TL;DR

AWS EMR just use s3:// Non EMR cluster - limit use of S3. don't use s3 or s3a to read/write large amounts of data from your code directly. Fetch data to cluster HDFS using s3-dist-cp and then send it back to S3 s3a is only useful to read some small to moderate amount of data s3a writing is unstable

(Talking from experience while deploying multiple jobs on EMR and private hardware clusters)


alexandr, where is your data to back up your claims about s3, in particular, which version were you talking about? if you experience all this with the hadoop 2.7 jars, these critcisms were valid at the time, but the time was 2016

关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now