Retrieving subfolders names in S3 bucket from boto3

python amazon-web-services amazon-s3 boto3

Using boto3, I can access my AWS S3 bucket:

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')

Now, the bucket contains folder first-level, which itself contains several sub-folders named with a timestamp, for instance 1456753904534. I need to know the name of these sub-folders for another job I'm doing and I wonder whether I could have boto3 retrieve those for me.

So I tried:

objs = bucket.meta.client.list_objects(Bucket='my-bucket-name')

which gives a dictionary, whose key 'Contents' gives me all the third-level files instead of the second-level timestamp directories, in fact I get a list containing things as

{u'ETag': '"etag"', u'Key': first-level/1456753904534/part-00014', u'LastModified': datetime.datetime(2016, 2, 29, 13, 52, 24, tzinfo=tzutc()), u'Owner': {u'DisplayName': 'owner', u'ID': 'id'}, u'Size': size, u'StorageClass': 'storageclass'}

you can see that the specific files, in this case part-00014 are retrieved, while I'd like to get the name of the directory alone. In principle I could strip out the directory name from all the paths but it's ugly and expensive to retrieve everything at third level to get the second level!

I also tried something reported here:

for o in bucket.objects.filter(Delimiter='/'):
    print(o.key)

but I do not get the folders at the desired level.

Is there a way to solve this?

So you're saying that this doesn't work? Could you post what happens when you run that?

@JordonPhillips I've tried the first lines of that link you send, which I pasted here, and I get the text files at the very first level of the bucket and no folders.

@mar tin Did you ever resolve this issue. I am facing a similar dilemma where I need the first element in every buckets subfolder.

@TedTaylorofLife Yea, no other way than getting all the objects and splitting by / to get subfolders

@ mar tin The only way that I have done is taken the output, thrown it into a text format and comma delimit by " /" and then copying and pasting first element. What a pain in the ass.

Dipankar

Below piece of code returns ONLY the 'subfolders' in a 'folder' from s3 bucket.

import boto3
bucket = 'my-bucket'
#Make sure you provide / in the end
prefix = 'prefix-name-with-slash/'  

client = boto3.client('s3')
result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/')
for o in result.get('CommonPrefixes'):
    print 'sub folder : ', o.get('Prefix')

For more details, you can refer to https://github.com/boto/boto3/issues/134

What if I want to list contents of a particular subfolder?

@azhar22k, I assume you could just run the function recursively for each 'sub folder'.

What if there are more than 1000 different prefixes?

Bar

S3 is an object storage, it doesn't have real directory structure. The "/" is rather cosmetic. One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.

To manipulate object in S3, you need boto3.client or boto3.resource, e.g. To list all object

import boto3 
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name')

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects

In fact, if the s3 object name is stored using '/' separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.

To limit the items to items under certain sub-folders:

    import boto3 
    s3 = boto3.client("s3")
    response = s3.list_objects_v2(
            Bucket=BUCKET,
            Prefix ='DIR1/DIR2',
            MaxKeys=100 )

Documentation

Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.

import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key) 
foldername = os.path.dirname(s3_key)

# if you are not using conventional delimiter like '#' 
s3_key = 'first-level#1456753904534#part-00014'
filename = s3_key.split("#")[-1]

A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.

This gives me the same result I get with my attempt in the question. I guess I'll have to solve the hard way by grabbing all keys from the returned objects and splitting the string to get the folder name.

@martina : a lazy python split and pick up the last data inside the list e.g. filename = keyname.split("/")[-1]

@martin directory_name = os.path.dirname(directory/path/and/filename.txt) and file_name = os.path.basename(directory/path/and/filename.txt)

Pedantic, but it's not clear what "real directory structure" means. S3's directory abstraction is thinner than a typical filesystem, but they are both just abstractions and for op's purposes here identical in power

Does one really want to use os.path utilities to manipulate bucket keys? I found this while looking for a kosher way to edit S3 bucket paths, exactly in order to avoid homegrown path splicing and/or use of os.path, which happens to sort of work on Linux, but seems certain to fail on Windows, and is otherwise conceptually "just wrong".

marengaz

Short answer:

Use Delimiter='/'. This avoids doing a recursive listing of your bucket. Some answers here wrongly suggest doing a full listing and using some string manipulation to retrieve the directory names. This could be horribly inefficient. Remember that S3 has virtually no limit on the number of objects a bucket can contain. So, imagine that, between bar/ and foo/, you have a trillion objects: you would wait a very long time to get ['bar/', 'foo/'].

Use Paginators. For the same reason (S3 is an engineer's approximation of infinity), you must list through pages and avoid storing all the listing in memory. Instead, consider your "lister" as an iterator, and handle the stream it produces.

Use boto3.client, not boto3.resource. The resource version doesn't seem to handle well the Delimiter option. If you have a resource, say a bucket = boto3.resource('s3').Bucket(name), you can get the corresponding client with: bucket.meta.client.

Long answer:

The following is an iterator that I use for simple buckets (no version handling).

import os
import boto3
from collections import namedtuple
from operator import attrgetter


S3Obj = namedtuple('S3Obj', ['key', 'mtime', 'size', 'ETag'])


def s3list(bucket, path, start=None, end=None, recursive=True, list_dirs=True,
           list_objs=True, limit=None):
    """
    Iterator that lists a bucket's objects under path, (optionally) starting with
    start and ending before end.

    If recursive is False, then list only the "depth=0" items (dirs and objects).

    If recursive is True, then list recursively all objects (no dirs).

    Args:
        bucket:
            a boto3.resource('s3').Bucket().
        path:
            a directory in the bucket.
        start:
            optional: start key, inclusive (may be a relative path under path, or
            absolute in the bucket)
        end:
            optional: stop key, exclusive (may be a relative path under path, or
            absolute in the bucket)
        recursive:
            optional, default True. If True, lists only objects. If False, lists
            only depth 0 "directories" and objects.
        list_dirs:
            optional, default True. Has no effect in recursive listing. On
            non-recursive listing, if False, then directories are omitted.
        list_objs:
            optional, default True. If False, then directories are omitted.
        limit:
            optional. If specified, then lists at most this many items.

    Returns:
        an iterator of S3Obj.

    Examples:
        # set up
        >>> s3 = boto3.resource('s3')
        ... bucket = s3.Bucket('bucket-name')

        # iterate through all S3 objects under some dir
        >>> for p in s3list(bucket, 'some/dir'):
        ...     print(p)

        # iterate through up to 20 S3 objects under some dir, starting with foo_0010
        >>> for p in s3list(bucket, 'some/dir', limit=20, start='foo_0010'):
        ...     print(p)

        # non-recursive listing under some dir:
        >>> for p in s3list(bucket, 'some/dir', recursive=False):
        ...     print(p)

        # non-recursive listing under some dir, listing only dirs:
        >>> for p in s3list(bucket, 'some/dir', recursive=False, list_objs=False):
        ...     print(p)
"""
    kwargs = dict()
    if start is not None:
        if not start.startswith(path):
            start = os.path.join(path, start)
        # note: need to use a string just smaller than start, because
        # the list_object API specifies that start is excluded (the first
        # result is *after* start).
        kwargs.update(Marker=__prev_str(start))
    if end is not None:
        if not end.startswith(path):
            end = os.path.join(path, end)
    if not recursive:
        kwargs.update(Delimiter='/')
        if not path.endswith('/'):
            path += '/'
    kwargs.update(Prefix=path)
    if limit is not None:
        kwargs.update(PaginationConfig={'MaxItems': limit})

    paginator = bucket.meta.client.get_paginator('list_objects')
    for resp in paginator.paginate(Bucket=bucket.name, **kwargs):
        q = []
        if 'CommonPrefixes' in resp and list_dirs:
            q = [S3Obj(f['Prefix'], None, None, None) for f in resp['CommonPrefixes']]
        if 'Contents' in resp and list_objs:
            q += [S3Obj(f['Key'], f['LastModified'], f['Size'], f['ETag']) for f in resp['Contents']]
        # note: even with sorted lists, it is faster to sort(a+b)
        # than heapq.merge(a, b) at least up to 10K elements in each list
        q = sorted(q, key=attrgetter('key'))
        if limit is not None:
            q = q[:limit]
            limit -= len(q)
        for p in q:
            if end is not None and p.key >= end:
                return
            yield p


def __prev_str(s):
    if len(s) == 0:
        return s
    s, c = s[:-1], ord(s[-1])
    if c > 0:
        s += chr(c - 1)
    s += ''.join(['\u7FFF' for _ in range(10)])
    return s

Test:

The following is helpful to test the behavior of the paginator and list_objects. It creates a number of dirs and files. Since the pages are up to 1000 entries, we use a multiple of that for dirs and files. dirs contains only directories (each having one object). mixed contains a mix of dirs and objects, with a ratio of 2 objects for each dir (plus one object under dir, of course; S3 stores only objects).

import concurrent
def genkeys(top='tmp/test', n=2000):
    for k in range(n):
        if k % 100 == 0:
            print(k)
        for name in [
            os.path.join(top, 'dirs', f'{k:04d}_dir', 'foo'),
            os.path.join(top, 'mixed', f'{k:04d}_dir', 'foo'),
            os.path.join(top, 'mixed', f'{k:04d}_foo_a'),
            os.path.join(top, 'mixed', f'{k:04d}_foo_b'),
        ]:
            yield name


with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
    executor.map(lambda name: bucket.put_object(Key=name, Body='hi\n'.encode()), genkeys())

The resulting structure is:

./dirs/0000_dir/foo
./dirs/0001_dir/foo
./dirs/0002_dir/foo
...
./dirs/1999_dir/foo
./mixed/0000_dir/foo
./mixed/0000_foo_a
./mixed/0000_foo_b
./mixed/0001_dir/foo
./mixed/0001_foo_a
./mixed/0001_foo_b
./mixed/0002_dir/foo
./mixed/0002_foo_a
./mixed/0002_foo_b
...
./mixed/1999_dir/foo
./mixed/1999_foo_a
./mixed/1999_foo_b

With a little bit of doctoring of the code given above for s3list to inspect the responses from the paginator, you can observe some fun facts:

The Marker is really exclusive. Given Marker=topdir + 'mixed/0500_foo_a' will make the listing start after that key (as per the AmazonS3 API), i.e., with .../mixed/0500_foo_b. That's the reason for __prev_str().

Using Delimiter, when listing mixed/, each response from the paginator contains 666 keys and 334 common prefixes. It's pretty good at not building enormous responses.

By contrast, when listing dirs/, each response from the paginator contains 1000 common prefixes (and no keys).

Passing a limit in the form of PaginationConfig={'MaxItems': limit} limits only the number of keys, not the common prefixes. We deal with that by further truncating the stream of our iterator.

@Mehdi : it's really not very complicated, for a system that offers such unbelievable scale and reliability. If you ever deal with more than a few hundred TBs, you'll get an appreciation for what they are offering. Remember, drives always have an MTBF > 0... Think about the implications for large scale data storage. Disclaimer: I'm an active and happy AWS user, no other connection, except I've worked on petabyte scale data since 2007 and it used to be much harder.

adding a fix to your code. In case someone wants to list all directories in the bucket, non recursive, they would send the following: s3list(bucket, '', recursive=False, list_objs=False) so I added and len(path) > 0: to if not path.endswith('/')

Like for "kwargs" usage. Nice trick to avoid doubling the list_objects_v2 with and without a ContinuationToken

Just to clarify, did I understand it correctly that the uniqueness of listed dirs is not guaranteed? I guess within the same page, commonPrefixes may contain unique prefixes only, but between 2 different pages, some prefixes can be duplicated.

dirs are only listed if recursive=False, list_dirs=True. There are no duplicates in the listing.

azhar22k

It took me a lot of time to figure out, but finally here is a simple way to list contents of a subfolder in S3 bucket using boto3. Hope it helps

prefix = "folderone/foldertwo/"
s3 = boto3.resource('s3')
bucket = s3.Bucket(name="bucket_name_here")
FilesNotFound = True
for obj in bucket.objects.filter(Prefix=prefix):
     print('{0}:{1}'.format(bucket.name, obj.key))
     FilesNotFound = False
if FilesNotFound:
     print("ALERT", "No file in {0}/{1}".format(bucket, prefix))

what if your folder contains an enormous number of objects?

my point is that this is a horribly inefficient solution. S3 is built to deal with arbitrary separators in the keys. For example, '/'. That let's you skip over "folders" full of objects without having to paginate over them. And then, even if you insist on a full listing (i.e. the 'recursive' equivalent in aws cli), then you must use paginators or you will list just the first 1000 objects.

This is a great answer. For those who need it, I have applied a limit to it in my derived answer.

This is a great answer! Some times we are not interested in performance, but in the simples code to maintain. This is quite simple and it works really well!

Glen Thompson

I had the same issue but managed to resolve it using boto3.client and list_objects_v2 with Bucket and StartAfter parameters.

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    print object['Key']

The output result for the code above would display the following:

firstlevelFolder/secondLevelFolder/item1
firstlevelFolder/secondLevelFolder/item2

Boto3 list_objects_v2 Documentation

In order to strip out only the directory name for secondLevelFolder I just used python method split():

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    direcoryName = object['Key'].encode("string_escape").split('/')
    print direcoryName[1]

The output result for the code above would display the following:

secondLevelFolder
secondLevelFolder

Python split() Documentation

If you'd like to get the directory name AND contents item name then replace the print line with the following:

print "{}/{}".format(fileName[1], fileName[2])

And the following will be output:

secondLevelFolder/item2
secondLevelFolder/item2

Hope this helps

Asclepius

The big realisation with S3 is that there are no folders/directories just keys. The apparent folder structure is just prepended to the filename to become the 'Key', so to list the contents of myBucket's some/path/to/the/file/ you can try:

s3 = boto3.client('s3')
for obj in s3.list_objects_v2(Bucket="myBucket", Prefix="some/path/to/the/file/")['Contents']:
    print(obj['Key'])

which would give you something like:

some/path/to/the/file/yo.jpg
some/path/to/the/file/meAndYou.gif
...

This is a good answer, but it will retrieve up to only 1000 objects and no more. I have produced a derived answer which can retrieve a larger number of objects.

yeah, @Acumenus i guess your answer is more complex

cem

The following works for me... S3 objects:

s3://bucket/
    form1/
       section11/
          file111
          file112
       section12/
          file121
    form2/
       section21/
          file211
          file112
       section22/
          file221
          file222
          ...
      ...
   ...

Using:

from boto3.session import Session
s3client = session.client('s3')
resp = s3client.list_objects(Bucket=bucket, Prefix='', Delimiter="/")
forms = [x['Prefix'] for x in resp['CommonPrefixes']]

we get:

form1/
form2/
...

With:

resp = s3client.list_objects(Bucket=bucket, Prefix='form1/', Delimiter="/")
sections = [x['Prefix'] for x in resp['CommonPrefixes']]

we get:

form1/section11/
form1/section12/

This is the only solution that worked for me because I needed the "folders" in root of bucket, the prefix has to be ''" whereas otherwise it has to end with "/"

Paul Zielinski

The AWS cli does this (presumably without fetching and iterating through all keys in the bucket) when you run aws s3 ls s3://my-bucket/, so I figured there must be a way using boto3.

https://github.com/aws/aws-cli/blob/0fedc4c1b6a7aee13e2ed10c3ada778c702c22c3/awscli/customizations/s3/subcommands.py#L499

It looks like they indeed use Prefix and Delimiter - I was able to write a function that would get me all directories at the root level of a bucket by modifying that code a bit:

def list_folders_in_bucket(bucket):
    paginator = boto3.client('s3').get_paginator('list_objects')
    folders = []
    iterator = paginator.paginate(Bucket=bucket, Prefix='', Delimiter='/', PaginationConfig={'PageSize': None})
    for response_data in iterator:
        prefixes = response_data.get('CommonPrefixes', [])
        for prefix in prefixes:
            prefix_name = prefix['Prefix']
            if prefix_name.endswith('/'):
                folders.append(prefix_name.rstrip('/'))
    return folders

Asclepius

Why not use the s3path package which makes it as convenient as working with pathlib? If you must however use boto3:

Using boto3.resource

This builds upon the answer by itz-azhar to apply an optional limit. It is obviously substantially simpler to use than the boto3.client version.

import logging
from typing import List, Optional

import boto3
from boto3_type_annotations.s3 import ObjectSummary  # pip install boto3_type_annotations

log = logging.getLogger(__name__)
_S3_RESOURCE = boto3.resource("s3")

def s3_list(bucket_name: str, prefix: str, *, limit: Optional[int] = None) -> List[ObjectSummary]:
    """Return a list of S3 object summaries."""
    # Ref: https://stackoverflow.com/a/57718002/
    return list(_S3_RESOURCE.Bucket(bucket_name).objects.limit(count=limit).filter(Prefix=prefix))


if __name__ == "__main__":
    s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)

Using boto3.client

This uses list_objects_v2 and builds upon the answer by CpILL to allow retrieving more than 1000 objects.

import logging
from typing import cast, List

import boto3

log = logging.getLogger(__name__)
_S3_CLIENT = boto3.client("s3")

def s3_list(bucket_name: str, prefix: str, *, limit: int = cast(int, float("inf"))) -> List[dict]:
    """Return a list of S3 object summaries."""
    # Ref: https://stackoverflow.com/a/57718002/
    contents: List[dict] = []
    continuation_token = None
    if limit <= 0:
        return contents
    while True:
        max_keys = min(1000, limit - len(contents))
        request_kwargs = {"Bucket": bucket_name, "Prefix": prefix, "MaxKeys": max_keys}
        if continuation_token:
            log.info(  # type: ignore
                "Listing %s objects in s3://%s/%s using continuation token ending with %s with %s objects listed thus far.",
                max_keys, bucket_name, prefix, continuation_token[-6:], len(contents))  # pylint: disable=unsubscriptable-object
            response = _S3_CLIENT.list_objects_v2(**request_kwargs, ContinuationToken=continuation_token)
        else:
            log.info("Listing %s objects in s3://%s/%s with %s objects listed thus far.", max_keys, bucket_name, prefix, len(contents))
            response = _S3_CLIENT.list_objects_v2(**request_kwargs)
        assert response["ResponseMetadata"]["HTTPStatusCode"] == 200
        contents.extend(response["Contents"])
        is_truncated = response["IsTruncated"]
        if (not is_truncated) or (len(contents) >= limit):
            break
        continuation_token = response["NextContinuationToken"]
    assert len(contents) <= limit
    log.info("Returning %s objects from s3://%s/%s.", len(contents), bucket_name, prefix)
    return contents


if __name__ == "__main__":
    s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)

That s3path library is a life saver! Thank you so much!

Vitalii Kotliarenko

As for Boto 1.13.3, it turns to be as simple as that (if you skip all pagination considerations, which were covered in other answers):

def get_sub_paths(bucket, prefix):
    s3 = boto3.client('s3')
    response = s3.list_objects_v2(
      Bucket=bucket,
      Prefix=prefix,
      MaxKeys=1000)
    return [item["Prefix"] for item in response['CommonPrefixes']]

Dan Revital

This worked well for me for retrieving only the first-level folders beneath the bucket:

client = boto3.client('s3')
bucket = 'my-bucket-name'
folders = set()

for prefix in client.list_objects(Bucket=bucket, Delimiter='/')['CommonPrefixes']:
    folders.add(prefix['Prefix'][:-1])
    
print(folders)

You can do the same with a list rather than a set, as the folder-names are unique

Salil Mathur

Some great answers to this question.

I had been using the boto3 resource objects.filter method to get all files.
objects.filter method returns as an iterator and is extremely fast.
Although converting it to list is time consuming.

list_objects_v2 returns the actual content and not an iterator.
However you need to loop over to get all the content because it has a size limit of 1000.

To get only the folders, I apply list comprehension like so

[x.split('/')[index] for x in files]

Below are the time taken for various methods. The number of files were 125077 when running these tests.

%%timeit

s3 = boto3.resource('s3')
response = s3.Bucket('bucket').objects.filter(Prefix='foo/bar/')

3.95 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit

s3 = boto3.resource('s3')
response = s3.Bucket('foo').objects.filter(Prefix='foo/bar/')
files = list(response)

26.6 s ± 1.08 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit

s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket='bucket', Prefix='foo/bar/')
files = response['Contents']
while 'NextContinuationToken' in response:
    response = s3.list_objects_v2(Bucket='bucket', Prefix='foo/bar/', ContinuationToken=response['NextContinuationToken'])
    files.extend(response['Contents'])

22.8 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

ambigus9

Here is a possible solution:

def download_list_s3_folder(my_bucket,my_folder):
    import boto3
    s3 = boto3.client('s3')
    response = s3.list_objects_v2(
        Bucket=my_bucket,
        Prefix=my_folder,
        MaxKeys=1000)
    return [item["Key"] for item in response['Contents']]

Milad Beigi

Using a recursive approach to list all the distinct paths in an S3 bucket.

def common_prefix(bucket_name,paths,prefix=''): client = boto3.client('s3') paginator = client.get_paginator('list_objects') result = paginator.paginate(Bucket=bucket_name, Prefix=prefix, Delimiter='/') for prefix in result.search('CommonPrefixes'): if prefix == None: break paths.append(prefix.get('Prefix')) common_prefix(bucket_name,paths,prefix.get('Prefix'))

exsurge-domine

This is what I used in my latest project. It uses 'paginator' and it works even if the response has more than 1000 keys returned.

import boto3

def list_folders(s3, bucket_name, prefix="", delimiter="/"):

    all = []
    paginator = s3.get_paginator("list_objects_v2")
    for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter):
        for common_prefix in page.get("CommonPrefixes", []):
            all.append(common_prefix)

    return [content.get('Prefix') for content in all]

s3_client = boto3.session.Session(profile_name="my_profile_name", region_name="my_region_name").client('s3')
folders = list_folders(s3_client, "my_bucket_name", prefix="path/to/folder")

Community

First of all, there is no real folder concept in S3. You definitely can have a file @ '/folder/subfolder/myfile.txt' and no folder nor subfolder.

To "simulate" a folder in S3, you must create an empty file with a '/' at the end of its name (see Amazon S3 boto - how to create a folder?)

For your problem, you should probably use the method get_all_keys with the 2 parameters : prefix and delimiter

https://github.com/boto/boto/blob/develop/boto/s3/bucket.py#L427

for key in bucket.get_all_keys(prefix='first-level/', delimiter='/'):
    print(key.name)

I am afraid I don't have the method get_all_keys on the bucket object. I am using boto3 version 1.2.3.

Just checked boto 1.2a: there, bucket has a method list with prefix and delimiter. I suppose it should work.

The Bucket object retrieved as I post in the question does not have those methods. I am on boto3 1.2.6, what version does your link refer to?

See here boto3.readthedocs.org/en/latest/reference/services/…

nate

I know that boto3 is the topic being discussed here, but I find that it is usually quicker and more intuitive to simply use awscli for something like this - awscli retains more capabilities that boto3 for what than is worth.

For example, if I have objects saved in "subfolders" associated with a given bucket, I can list them all out with something such as this:

1) 'mydata' = bucket name 2) 'f1/f2/f3' = "path" leading to "files" or objects 3) 'foo2.csv, barfar.segy, gar.tar' = all objects "inside" f3

So, we can think of the "absolute path" leading to these objects is: 'mydata/f1/f2/f3/foo2.csv'...

Using awscli commands, we can easily list all objects inside a given "subfolder" via:

aws s3 ls s3://mydata/f1/f2/f3/ --recursive

peterDriscoll

Following is the piece of code that can handle pagination, if you are trying to fetch large number of S3 bucket objects:

def get_matching_s3_objects(bucket, prefix="", suffix=""):

    s3 = boto3.client("s3")
    paginator = s3.get_paginator("list_objects_v2")

    kwargs = {'Bucket': bucket}

    # We can pass the prefix directly to the S3 API.  If the user has passed
    # a tuple or list of prefixes, we go through them one by one.
    if isinstance(prefix, str):
        prefixes = (prefix, )
    else:
        prefixes = prefix

    for key_prefix in prefixes:
        kwargs["Prefix"] = key_prefix

        for page in paginator.paginate(**kwargs):
            try:
                contents = page["Contents"]
            except KeyError:
                return

            for obj in contents:
                key = obj["Key"]
                if key.endswith(suffix):
                    yield obj

what if the first page is full of "CommonPrefixes" and does not provide any "Contents" key. I think, proper implementation shall skip missing Contents key and continue with next page.

LucyDrops

The "dir" to list aren't really objects instead they are sub-strings of object keys, so they won't show up in the objects.filter method. you can use client's list_objects here with Prefix specified.

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')
res = bucket.meta.client.list_objects(Bucket=bucket.name, Delimiter='/', Prefix = 'sub-folder/')
for o in res.get('CommonPrefixes'):
    print(o.get('Prefix'))

Sma Ma

Yes, as already mentioned, the important thing is that there is no real folder concept in S3. But let's see which tricks are possible with the S3 API.

The following example is an improvement of the solution of the answer of @cem.

In addition to the soution of @cem this solution is uing the S3 paginator API. The solution collects all results even if the result contains more than 1000 objects. The S3 paginator API automatically resolves the next results from 1001 to 2000 and so on.

In this example, all "subfolders" (keys) under a specific "folder" named "lala" are listed (without recursive structure of that subfolders).

Prefix='lala/' and Delimiter="/" parameters do the magic.

# given "folder/key" structure
# .
# ├── lorem.txt
# ├─── lala
# │ ├── folder1
# │ │    ├── file1.txt
# │ │    └── file2.txt
# │ ├── folder2
# │ │    └── file1.txt
# │ └── folder3
# │      └── file1.txt
# └── lorem
#   └── folder4
#        ├── file1.txt
#        └── file2.txt

import boto3

s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')

# Execute paginated list_objects_v2
response = paginator.paginate(Bucket='your-bucket-name', Prefix='lala/', Delimiter="/")

# Get prefix for each page result
names = []
for page in response:
    names.extend([x['Prefix'] for x in page.get('CommonPrefixes', [])])

print(names)
# Result is:
# ['lala/folder1/','lala/folder2/','lala/folder3/']

Retrieving subfolders names in S3 bucket from boto3

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US