I have a list of 20 file names, like ['file1.txt', 'file2.txt', ...]
. I want to write a Python script to concatenate these files into a new file. I could open each file by f = open(...)
, read line by line by calling f.readline()
, and write each line into that new file. It doesn't seem very "elegant" to me, especially the part where I have to read/write line by line.
Is there a more "elegant" way to do this in Python?
cat file1.txt file2.txt file3.txt ... > output.txt
. In python, if you don't like readline()
, there is always readlines()
or simply read()
.
cat file1.txt file2.txt file3.txt
command using subprocess
module and you're done. But I am not sure if cat
works in windows.
with
statement to ensure your files are closed properly, and iterate over the file to get lines, rather than using f.readline()
.
This should do it
For large files:
filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line)
For small files:
filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
… and another interesting one that I thought of:
filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
for line in itertools.chain.from_iterable(itertools.imap(open, filnames)):
outfile.write(line)
Sadly, this last method leaves a few open file descriptors, which the GC should take care of anyway. I just thought it was interesting
Use shutil.copyfileobj
.
It automatically reads the input files chunk by chunk for you, which is more more efficient and reading the input files in and will work even if some of the input files are too large to fit into memory:
import shutil
with open('output_file.txt','wb') as wfd:
for f in ['seg1.txt','seg2.txt','seg3.txt']:
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd)
for i in glob.glob(r'c:/Users/Desktop/folder/putty/*.txt'):
well i replaced the for statement to include all the files in directory but my output_file
started growing really huge like in 100's of gb in very quick time.
That's exactly what fileinput is for:
import fileinput
with open(outfilename, 'w') as fout, fileinput.input(filenames) as fin:
for line in fin:
fout.write(line)
For this use case, it's really not much simpler than just iterating over the files manually, but in other cases, having a single iterator that iterates over all of the files as if they were a single file is very handy. (Also, the fact that fileinput
closes each file as soon as it's done means there's no need to with
or close
each one, but that's just a one-line savings, not that big of a deal.)
There are some other nifty features in fileinput
, like the ability to do in-place modifications of files just by filtering each line.
As noted in the comments, and discussed in another post, fileinput
for Python 2.7 will not work as indicated. Here slight modification to make the code Python 2.7 compliant
with open('outfilename', 'w') as fout:
fin = fileinput.input(filenames)
for line in fin:
fout.write(line)
fin.close()
fileinput
are told that it's a way to turn a simple sys.argv
(or what's left as args after optparse
/etc.) into a big virtual file for trivial scripts, and don't think to use it for anything else (i.e., when the list isn't command-line args). Or they do learn, but then forget—I keep re-discovering it every year or two…
for line in fileinput.input()
isn't the best way to choose in this particular case: the OP wants to concatenate files, not read them line by line which is a theoretically longer process to execute
I don't know about elegance, but this works:
import glob
import os
for f in glob.glob("file*.txt"):
os.system("cat "+f+" >> OutFile.txt")
cat
can take a list of files, so no need to repeatedly call it. You can easily make it safe by calling subprocess.check_call
instead of os.system
What's wrong with UNIX commands ? (given you're not working on Windows) :
ls | xargs cat | tee output.txt
does the job ( you can call it from python with subprocess if you want)
cat * | tee output.txt
).
cat file1.txt file2.txt | tee output.txt
1> /dev/null
to the end of the command
outfile.write(infile.read()) # time: 2.1085190773010254s
shutil.copyfileobj(fd, wfd, 1024*1024*10) # time: 0.60599684715271s
A simple benchmark shows that the shutil performs better.
An alternative to @inspectorG4dget answer (best answer to date 29-03-2016). I tested with 3 files of 436MB.
@inspectorG4dget solution: 162 seconds
The following solution : 125 seconds
from subprocess import Popen
filenames = ['file1.txt', 'file2.txt', 'file3.txt']
fbatch = open('batch.bat','w')
str ="type "
for f in filenames:
str+= f + " "
fbatch.write(str + " > file4results.txt")
fbatch.close()
p = Popen("batch.bat", cwd=r"Drive:\Path\to\folder")
stdout, stderr = p.communicate()
The idea is to create a batch file and execute it, taking advantage of "old good technology". Its semi-python but works faster. Works for windows.
If you have a lot of files in the directory then glob2
might be a better option to generate a list of filenames rather than writing them by hand.
import glob2
filenames = glob2.glob('*.txt') # list of all .txt files in the directory
with open('outfile.txt', 'w') as f:
for file in filenames:
with open(file) as infile:
f.write(infile.read()+'\n')
glob2
instead of the glob
module, or the globbing functionality in pathlib
?
Check out the .read() method of the File object:
http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
You could do something like:
concat = ""
for file in files:
concat += open(file).read()
or a more 'elegant' python-way:
concat = ''.join([open(f).read() for f in files])
which, according to this article: http://www.skymind.com/~ocrow/python_string/ would also be the fastest.
If the files are not gigantic:
with open('newfile.txt','wb') as newf:
for filename in list_of_files:
with open(filename,'rb') as hf:
newf.write(hf.read())
# newf.write('\n\n\n') if you want to introduce
# some blank lines between the contents of the copied files
If the files are too big to be entirely read and held in RAM, the algorithm must be a little different to read each file to be copied in a loop by chunks of fixed length, using read(10000)
for example.
os.open
and os.read
, because plain open
uses Python's wrappers around C's stdio, which means either 1 or 2 extra buffers getting in your way.
def concatFiles():
path = 'input/'
files = os.listdir(path)
for idx, infile in enumerate(files):
print ("File #" + str(idx) + " " + infile)
concat = ''.join([open(path + f).read() for f in files])
with open("output_concatFile.txt", "w") as fo:
fo.write(path + concat)
if __name__ == "__main__":
concatFiles()
import os
files=os.listdir()
print(files)
print('#',tuple(files))
name=input('Enter the inclusive file name: ')
exten=input('Enter the type(extension): ')
filename=name+'.'+exten
output_file=open(filename,'w+')
for i in files:
print(i)
j=files.index(i)
f_j=open(i,'r')
print(f_j.read())
for x in f_j:
outfile.write(x)
Success story sharing
shutil.copyfileobj
answer below will be much faster.