ChatGPT解决这个技术问题 Extra ChatGPT

How do I concatenate text files in Python?

I have a list of 20 file names, like ['file1.txt', 'file2.txt', ...]. I want to write a Python script to concatenate these files into a new file. I could open each file by f = open(...), read line by line by calling f.readline(), and write each line into that new file. It doesn't seem very "elegant" to me, especially the part where I have to read/write line by line.

Is there a more "elegant" way to do this in Python?

Its not python, but in shell scripting you could do something like cat file1.txt file2.txt file3.txt ... > output.txt. In python, if you don't like readline(), there is always readlines() or simply read().
@jedwards simply run the cat file1.txt file2.txt file3.txt command using subprocess module and you're done. But I am not sure if cat works in windows.
As a note, the way you describe is a terrible way to read a file. Use the with statement to ensure your files are closed properly, and iterate over the file to get lines, rather than using f.readline().
@jedwards cat doesn't work when the text file is unicode.

i
inspectorG4dget

This should do it

For large files:

filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

For small files:

filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            outfile.write(infile.read())

… and another interesting one that I thought of:

filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
    for line in itertools.chain.from_iterable(itertools.imap(open, filnames)):
        outfile.write(line)

Sadly, this last method leaves a few open file descriptors, which the GC should take care of anyway. I just thought it was interesting


This will, for large files, be very memory inefficient.
what are we considering a large file to be?
@dee: a file so large that it's contents don't fit into main memory
why would you decode and re-encode the whole thing? and search for newlines and all the unnecessary stuff when all that’s required is concatenating the files. the shutil.copyfileobj answer below will be much faster.
Just to reiterate: this is the wrong answer, shutil.copyfileobj is the right answer.
J
Jeyekomon

Use shutil.copyfileobj.

It automatically reads the input files chunk by chunk for you, which is more more efficient and reading the input files in and will work even if some of the input files are too large to fit into memory:

import shutil

with open('output_file.txt','wb') as wfd:
    for f in ['seg1.txt','seg2.txt','seg3.txt']:
        with open(f,'rb') as fd:
            shutil.copyfileobj(fd, wfd)

for i in glob.glob(r'c:/Users/Desktop/folder/putty/*.txt'): well i replaced the for statement to include all the files in directory but my output_file started growing really huge like in 100's of gb in very quick time.
Note, that is will merge last strings of each file with first strings of next file if there are no EOL characters. In my case I got totally corrupted result after using this code. I added wfd.write(b"\n") after copyfileobj to get normal result
@Thelambofgoat I would say that is not a pure concatenation in that case, but hey, whatever suits your needs.
This is by far the best answer!
This is super fast and as I required. yes it does not add new line between "two files end and start" and exactly this I needed. so dont update it :D
N
Novice C

That's exactly what fileinput is for:

import fileinput
with open(outfilename, 'w') as fout, fileinput.input(filenames) as fin:
    for line in fin:
        fout.write(line)

For this use case, it's really not much simpler than just iterating over the files manually, but in other cases, having a single iterator that iterates over all of the files as if they were a single file is very handy. (Also, the fact that fileinput closes each file as soon as it's done means there's no need to with or close each one, but that's just a one-line savings, not that big of a deal.)

There are some other nifty features in fileinput, like the ability to do in-place modifications of files just by filtering each line.

As noted in the comments, and discussed in another post, fileinput for Python 2.7 will not work as indicated. Here slight modification to make the code Python 2.7 compliant

with open('outfilename', 'w') as fout:
    fin = fileinput.input(filenames)
    for line in fin:
        fout.write(line)
    fin.close()

@Lattyware: I think most people who learn about fileinput are told that it's a way to turn a simple sys.argv (or what's left as args after optparse/etc.) into a big virtual file for trivial scripts, and don't think to use it for anything else (i.e., when the list isn't command-line args). Or they do learn, but then forget—I keep re-discovering it every year or two…
@abament I think for line in fileinput.input() isn't the best way to choose in this particular case: the OP wants to concatenate files, not read them line by line which is a theoretically longer process to execute
@eyquem: It's not a longer process to execute. As you yourself pointed out, line-based solutions don't read one character at a time; they read in chunks and pull lines out of a buffer. The I/O time will completely swamp the line-parsing time, so as long as the implementor didn't do something horribly stupid in the buffering, it will be just as fast (and possibly even faster than trying to guess at a good buffer size yourself, if you think 10000 is a good choice).
@abarnert NO, 10000 isn't a good choice. It is indeed a very bad choice because it isn't a power of 2 and it is ridiculously a little size. Better sizes would be 2097152 (221), 16777216 (224) or even 134217728 (2**27) , why not ?, 128 MB is nothing in a RAM of 4 GB.
Example code not quite valid for Python 2.7.10 and later: stackoverflow.com/questions/30835090/…
D
Daniel

I don't know about elegance, but this works:

    import glob
    import os
    for f in glob.glob("file*.txt"):
         os.system("cat "+f+" >> OutFile.txt")

you can even avoid the loop: import os; os.system("cat file*.txt >> OutFile.txt")
not crossplatform and will break for file names with spaces in them
This is insecure; also, cat can take a list of files, so no need to repeatedly call it. You can easily make it safe by calling subprocess.check_call instead of os.system
l
lucasg

What's wrong with UNIX commands ? (given you're not working on Windows) :

ls | xargs cat | tee output.txt does the job ( you can call it from python with subprocess if you want)


because this is a question about python.
Nothing wrong in general, but this answer is broken (don't pass the output of ls to xargs, just pass the list of files to cat directly: cat * | tee output.txt).
If it can insert filename as well that would be great.
@Deqing To specify input file names, you can use cat file1.txt file2.txt | tee output.txt
... and you can disable sending to stdout (printing in Terminal) by adding 1> /dev/null to the end of the command
C
Clint Chelak
outfile.write(infile.read()) # time: 2.1085190773010254s
shutil.copyfileobj(fd, wfd, 1024*1024*10) # time: 0.60599684715271s

A simple benchmark shows that the shutil performs better.


J
João Palma

An alternative to @inspectorG4dget answer (best answer to date 29-03-2016). I tested with 3 files of 436MB.

@inspectorG4dget solution: 162 seconds

The following solution : 125 seconds

from subprocess import Popen
filenames = ['file1.txt', 'file2.txt', 'file3.txt']
fbatch = open('batch.bat','w')
str ="type "
for f in filenames:
    str+= f + " "
fbatch.write(str + " > file4results.txt")
fbatch.close()
p = Popen("batch.bat", cwd=r"Drive:\Path\to\folder")
stdout, stderr = p.communicate()

The idea is to create a batch file and execute it, taking advantage of "old good technology". Its semi-python but works faster. Works for windows.


M
Michael H.

If you have a lot of files in the directory then glob2 might be a better option to generate a list of filenames rather than writing them by hand.

import glob2

filenames = glob2.glob('*.txt')  # list of all .txt files in the directory

with open('outfile.txt', 'w') as f:
    for file in filenames:
        with open(file) as infile:
            f.write(infile.read()+'\n')

What does this have to do with the question? Why use glob2 instead of the glob module, or the globbing functionality in pathlib?
A
Alex Kawrykow

Check out the .read() method of the File object:

http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects

You could do something like:

concat = ""
for file in files:
    concat += open(file).read()

or a more 'elegant' python-way:

concat = ''.join([open(f).read() for f in files])

which, according to this article: http://www.skymind.com/~ocrow/python_string/ would also be the fastest.


This will produce a giant string, which, depending on the size of the files, could be larger than the available memory. As Python provides easy lazy access to files, it's a bad idea.
e
eyquem

If the files are not gigantic:

with open('newfile.txt','wb') as newf:
    for filename in list_of_files:
        with open(filename,'rb') as hf:
            newf.write(hf.read())
            # newf.write('\n\n\n')   if you want to introduce
            # some blank lines between the contents of the copied files

If the files are too big to be entirely read and held in RAM, the algorithm must be a little different to read each file to be copied in a loop by chunks of fixed length, using read(10000) for example.


@Lattyware Because I'm quite sure the execution is faster. By the way, in fact, even when the code orders to read a file line by line, the file is read by chunks, that are put in cache in which each line is then read one after the other. The better procedure would be to put the length of read chunk equal to the size of the cache. But I don't know how to determine this cache's size.
That's the implementation in CPython, but none of that is guaranteed. Optimizing like that is a bad idea as while it may be effective on some systems, it may not on others.
Yes, of course line-by-line reading is buffered. That's exactly why it's not that much slower. (In fact, in some cases, it may even be slightly faster, because whoever ported Python to your platform chose a much better chunk size than 10000.) If the performance of this really matters, you'll have to profile different implementations. But 99.99…% of the time, either way is more than fast enough, or the actual disk I/O is the slow part and it doesn't matter what your code does.
Also, if you really do need to manually optimize the buffering, you'll want to use os.open and os.read, because plain open uses Python's wrappers around C's stdio, which means either 1 or 2 extra buffers getting in your way.
PS, as for why 10000 is bad: Your files are probably on a disk, with blocks that are some power of bytes long. Let's say they're 4096 bytes. So, reading 10000 bytes means reading two blocks, then part of the next. Reading another 10000 means reading the rest of the next, then two blocks, then part of the next. Count up how many partial or complete block reads you have, and you're wasting a lot of time. Fortunately, the Python, stdio, filesystem, and kernel buffering and caching will hide most of these problems from you, but why try to create them in the first place?
u
user2825287
def concatFiles():
    path = 'input/'
    files = os.listdir(path)
    for idx, infile in enumerate(files):
        print ("File #" + str(idx) + "  " + infile)
    concat = ''.join([open(path + f).read() for f in files])
    with open("output_concatFile.txt", "w") as fo:
        fo.write(path + concat)

if __name__ == "__main__":
    concatFiles()

V
VasanthOPT
  import os
  files=os.listdir()
  print(files)
  print('#',tuple(files))
  name=input('Enter the inclusive file name: ')
  exten=input('Enter the type(extension): ')
  filename=name+'.'+exten
  output_file=open(filename,'w+')
  for i in files:
    print(i)
    j=files.index(i)
    f_j=open(i,'r')
    print(f_j.read())
    for x in f_j:
      outfile.write(x)