ChatGPT解决这个技术问题 Extra ChatGPT

Download large file in python with requests

Requests is a really nice library. I'd like to use it for downloading big files (>1GB). The problem is it's not possible to keep whole file in memory; I need to read it in chunks. And this is a problem with the following code:

import requests

def DownloadFile(url)
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    f = open(local_filename, 'wb')
    for chunk in r.iter_content(chunk_size=512 * 1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
    f.close()
    return 

For some reason it doesn't work this way; it still loads the response into memory before it is saved to a file.


J
Jenia

With the following streaming code, the Python memory usage is restricted regardless of the size of the downloaded file:

def download_file(url):
    local_filename = url.split('/')[-1]
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                # If you have chunk encoded response uncomment if
                # and set chunk_size parameter to None.
                #if chunk: 
                f.write(chunk)
    return local_filename

Note that the number of bytes returned using iter_content is not exactly the chunk_size; it's expected to be a random number that is often far bigger, and is expected to be different in every iteration.

See body-content-workflow and Response.iter_content for further reference.


@Shuman As I see you resolved the issue when switched from http:// to https:// (github.com/kennethreitz/requests/issues/2043). Can you please update or delete your comments because people may think that there are issues with the code for files bigger 1024Mb
the chunk_size is crucial. by default it's 1 (1 byte). that means that for 1MB it'll make 1 milion iterations. docs.python-requests.org/en/latest/api/…
@RomanPodlinov: f.flush() doesn't flush data to physical disk. It transfers the data to OS. Usually, it is enough unless there is a power failure. f.flush() makes the code slower here for no reason. The flush happens when the correponding file buffer (inside app) is full. If you need more frequent writes; pass buf.size parameter to open().
if chunk: # filter out keep-alive new chunks – it is redundant, isn't it? Since iter_content() always yields string and never yields None, it looks like premature optimization. I also doubt it can ever yield empty string (I cannot imagine any reason for this).
@RomanPodlinov And one more point, sorry :) After reading iter_content() sources I've concluded that it cannot ever yield an empty string: there are emptiness checks everywhere. The main logic here: requests/packages/urllib3/response.py.
D
Daniel F

It's much easier if you use Response.raw and shutil.copyfileobj():

import requests
import shutil

def download_file(url):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

This streams the file to disk without using excessive memory, and the code is simple.

Note: According to the documentation, Response.raw will not decode gzip and deflate transfer-encodings, so you will need to do this manually.


Note that you may need to adjust when streaming gzipped responses per issue 2155.
THIS should be the correct answer! The accepted answer gets you up to 2-3MB/s. Using copyfileobj gets you to ~40MB/s. Curl downloads (same machines, same url, etc) with ~50-55 MB/s.
A small caveat for using .raw is that it does not handle decoding. Mentioned in the docs here: docs.python-requests.org/en/master/user/quickstart/…
Adding length param got me better download speeds shutil.copyfileobj(r.raw, f, length=16*1024*1024)
G
Gringo Suave

Not exactly what OP was asking, but... it's ridiculously easy to do that with urllib:

from urllib.request import urlretrieve

url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
dst = 'ubuntu-16.04.2-desktop-amd64.iso'
urlretrieve(url, dst)

Or this way, if you want to save it to a temporary file:

from urllib.request import urlopen
from shutil import copyfileobj
from tempfile import NamedTemporaryFile

url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
with urlopen(url) as fsrc, NamedTemporaryFile(delete=False) as fdst:
    copyfileobj(fsrc, fdst)

I watched the process:

watch 'ps -p 18647 -o pid,ppid,pmem,rsz,vsz,comm,args; ls -al *.iso'

And I saw the file growing, but memory usage stayed at 17 MB. Am I missing something?


For Python 2.x, use from urllib import urlretrieve
This function "might become deprecated at some point in the future." cf. docs.python.org/3/library/urllib.request.html#legacy-interface
C
Community

Your chunk size could be too large, have you tried dropping that - maybe 1024 bytes at a time? (also, you could use with to tidy up the syntax)

def DownloadFile(url):
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
    return 

Incidentally, how are you deducing that the response has been loaded into memory?

It sounds as if python isn't flushing the data to file, from other SO questions you could try f.flush() and os.fsync() to force the file write and free memory;

    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
                os.fsync(f.fileno())

I use System Monitor in Kubuntu. It shows me that python process memory increases (up to 1.5gb from 25kb).
That memory bloat sucks, maybe f.flush(); os.fsync() might force a write an memory free.
it's os.fsync(f.fileno())
You need to use stream=True in the requests.get() call. That's what's causing the memory bloat.
minor typo: you miss a colon (':') after def DownloadFile(url)
佚名

use wget module of python instead. Here is a snippet

import wget
wget.download(url)

B
Ben Moskovitch

Based on the Roman's most upvoted comment above, here is my implementation, Including "download as" and "retries" mechanism:

def download(url: str, file_path='', attempts=2):
    """Downloads a URL content into a file (with large file support by streaming)

    :param url: URL to download
    :param file_path: Local file name to contain the data downloaded
    :param attempts: Number of attempts
    :return: New file path. Empty string if the download failed
    """
    if not file_path:
        file_path = os.path.realpath(os.path.basename(url))
    logger.info(f'Downloading {url} content to {file_path}')
    url_sections = urlparse(url)
    if not url_sections.scheme:
        logger.debug('The given url is missing a scheme. Adding http scheme')
        url = f'http://{url}'
        logger.debug(f'New url: {url}')
    for attempt in range(1, attempts+1):
        try:
            if attempt > 1:
                time.sleep(10)  # 10 seconds wait time between downloads
            with requests.get(url, stream=True) as response:
                response.raise_for_status()
                with open(file_path, 'wb') as out_file:
                    for chunk in response.iter_content(chunk_size=1024*1024):  # 1MB chunks
                        out_file.write(chunk)
                logger.info('Download finished successfully')
                return file_path
        except Exception as ex:
            logger.error(f'Attempt #{attempt} failed with error: {ex}')
    return ''

r
r1v3n

requests is good, but how about socket solution?

def stream_(host):
    import socket
    import ssl
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        context = ssl.create_default_context(Purpose.CLIENT_AUTH)
        with context.wrap_socket(sock, server_hostname=host) as wrapped_socket:
            wrapped_socket.connect((socket.gethostbyname(host), 443))
            wrapped_socket.send(
                "GET / HTTP/1.1\r\nHost:thiscatdoesnotexist.com\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\n\r\n".encode())

            resp = b""
            while resp[-4:-1] != b"\r\n\r":
                resp += wrapped_socket.recv(1)
            else:
                resp = resp.decode()
                content_length = int("".join([tag.split(" ")[1] for tag in resp.split("\r\n") if "content-length" in tag.lower()]))
                image = b""
                while content_length > 0:
                    data = wrapped_socket.recv(2048)
                    if not data:
                        print("EOF")
                        break
                    image += data
                    content_length -= len(data)
                with open("image.jpeg", "wb") as file:
                    file.write(image)


I'm curious what's the advantange of using this instead of a higher level (and well tested) method from libs like requests?
Libs like requests are full of abstraction above the native sockets. That's not the best algorithm, but it could be faster because of no abstraction at all.