Read my original post in my blog here: How to Open A File in Python Like A Pro
It's a fundamental question. Usually people learn it at the time they get started with python.
And the solution is rather simple.
-
1st Attempt: use
open()
file = open('./i-am-a-file', 'rb')
for line in f.readlines():
print(line)
f.close()
But is that a good solution? What if the file does not existed? It those exceptions and the program ends.
So it's better to always check the existence of the file before reading / writing it.
- 2nd Attempt: File Existence Check
import os
file_path = './i-am-a-very-large-file'
if(os.path.isfile(file_path)):
f = open(file_path, 'rb')
for line in f.readlines():
print(line)
f.close()
Is this solution good enough? What if we have some other complex logics in every line, and it throws exceptions? In that situation, f.close()
will not be called, resulted in the file not closed before the interpreter is closed, which is a bad practice as it might cause unexpected issues (for example, if a non-stop python program is running and it reads a temp file without closing it explicitly, while the OS (such as Windows) protects the temp file as it's being read, this temp file cannot be deleted until this program ends).
In this case, a better choice is to use with
to wrap the file operation, so that it automatically close the file no matter the operation succeeds or fails.
-
3rd Attempt: use
with
import os
file_path = './i-am-a-very-large-file'
if(os.path.isfile(file_path)):
with open(file_path, 'rb') as f:
for line in f.readlines():
print(line)
Is this the perfect solution?
In most cases, yes, it is enough to handle the file operation.
But, what if you need to read a very very very large file, such as a 4GB file? If so, the python program will need read the whole file into the memory before it starts to perform your operation. If this is an API in your server, and several requests come in to read multiple large files, how much memory do you need, 16GB, 32GB, 64GB, just for a simple file operation?
We can do a very simple experiment in Window environment. First, let's create a 4GB with the following script.
import os
size = 1024*1024*1024*4 # 4GB
with open('i-am-a-very-large-file', "wb") as f:
f.write(os.urandom(size))
Now you have a 4GB large file and let's record our current memory statistic using Windows task manager.
From the screenshot, it shows the process python
uses 5,154,136 KB memories, which is about 5.19 GB memories, just for reading this file only! You can clearly see the steep increasing line from the memory diagram. (FYI, I have a total of 24 GB memory)
Hence, to make our solution better, we have to think of a way to optimise it. If only we could read the line while we actually want to use it!
Here comes the concept of generator and we can have the following solution.
-
4th Attempt: use
yield
import os
def read_file(f_path):
BLOCK_SIZE = 1024
if(os.path.isfile(f_path)):
with open(file_path, 'rb') as f:
while True:
block = f.read(BLOCK_SIZE)
if block:
yield block
else:
return
file_path = './i-am-a-very-large-file'
for line in read_file(file_path):
print(line)
And let's run it and monitor the memory change.
Yay! While the console crazily prints the meaningless texts, the memory usage is extremely low compared to the previous version, only 2,928 KB in total! And it's an absolutely flat line in the memory diagram!
Why is it so amazingly fast and memory-safe? The secret is that we use yield
keyword in our solution.
To understand how yield
works, we need to know the concept of generator. Here is a very clear and concise explanation about it, check out What does the “yield” keyword do? on StackOverflow.
As a quick summary, yield
simply makes this read_file
function to be a generator function. When read_file
gets called, it runs until yield block
, returns the first block of string and stops until the function gets called next time. So, only one block of file gets read each time read_file(file_path)
is called.
To read the whole file, multiple times of read_file(file_path)
need to call (for line in read_file(file_path)
), and each time it only consume a little memory to read one block.
So that's how to open a file in python like a pro, given considerations about the extreme cases (actually quite common if your service is performance critical). Hope you enjoy this blog post and share your ideas here!
You can grab the demo source code on GitHub here: ZhiyueYi/how-to-open-a-file-in-python-like-a-pro-demo
Updates on 7 July
Thanks to @Vedran Čačić's comments, I learnt further about better solutions to it.
If we have a very large text file, we can simply use
- 5th Attempt
with open(path) as file:
for line in file: print(line)
And it's totally OK.
Or if we really want to process a binary file in chunks (like what I did in 4th Attempt, 1024 bytes per block), we could also use BufferedReader in the built-in io library to achieve the same thing like this
- 6th Attempt
import sys
import io
file_path = './i-am-a-very-large-file'
with open(file_path, 'rb') as f:
BLOCK_SIZE = 1024
fi = io.FileIO(f.fileno())
fb = io.BufferedReader(fi)
while True:
block = fb.read(1024)
if block:
print(block)
else:
break
And now I noticed that the method in 4th Attempt is just dumbly re-inventing the wheel (BufferedReader). LOL, knowledge is power.
Summary Here
So what's the lesson learnt besides opening a file?
I think, firstly, do not afraid to share your ideas even if it's not a perfect one (like what I did) and do not afraid to admit your mistakes. Share more and interact more, we can then gain insights from the others and improve ourselves. Cheers~
You can grab the demo source code on GitHub here: ZhiyueYi/how-to-open-a-file-in-python-like-a-pro-demo
Top comments (10)
Or simply
Don't reinvent the wheel.
And why do you think swallowing the exception of a non-existent file is a good idea? Read that Zen again. Errors should never pass silently. (Not to mention a race condition with your code.)
It would be more useful if you mentioned local file encoding, and utf8 as a new sensible default. That would really be "like a pro".
Thanks for sharing nice suggestions here about encoding, exception handling! Definitely worth for me to explore more! I’m still quite new to Python and there are still a lot to learn.
As for reading large file examples, probably this is the case: imagine you have a server which has only one API to process a file and thousands QPS pressure. Though each time only some MB size files are processed, with thousands of requests coming in, it accumulates to a greater consumption of memories. Not to mention those servers with more functionalities.
I hope I could have a real-life example for you but currently I don’t :(
It's not reason for ":(", it's for ":)". Because it means you can write normal-looking easy code and it will work.
Your example, even though fictional, has nothing to do with my comment: if you have to process files as a whole, then you do, and no amount of black magic will help you. If you don't, then the question is whether it's a text or a binary file, as I said. And then you should use line or block buffering as needed.
If you're really strapped for memory, the first optimization I'd suggest is not using Python. Python is so dynamic that common data structures easily take up many times more memory than in "normal" languages with value semantics.
I think I got what you mean here.
is fast and low memory consumed (Just learnt it from you and tried by myself, thanks)
And I agree with you that we should write the code as simple as possible in most cases, because having black magic here makes code less readable.
But I still think this technique is worth to mention and good to know, in case somebody needs it for some extreme cases.
Like what? Like "I have a binary file 4GiB in size and I'm just gonna spit it to stdout 1KiB at a time"? Not to mention that you don't do any decoding at all, so bytes objects are written to your screen raw, which isn't what you want, no matter the usecase. And not to mention "if it doesn't exist in the moment I check, I won't do anything, even though it might exist later when I'd actually try to read from it"?
Sorry, I know you're trying to salvage your post, but "like a pro" doesn't mean that. A pro should know the terrain of possibilities they might encounter, and this is something you won't encounter. Ever. If you do, I'll eat my hat. :-P
Now, if you actually need to process a binary file in chunks (not "pretend it's text and write it on the screen"), that's why block buffering is for. Learn to use it. docs.python.org/3/library/io.html#... You're in fact implementing another buffer on top of a builtin one, which really doesn't help your memory nor your speed.
You are right. Thanks a lot for these helpful comments. It’s definitely a good lesson learnt
Let me just tell you one more thing. I do this all the time around the Net (explainxkcd.com/wiki/index.php/SIW...). Usually people stick to their guns and refuse to admit they are wrong. DEV is the only community where people thank me for correcting them. Kudos for that! B-)
Nobody is perfect. We are all learning to be better :D Though the post itself is not good enough, at least we had a meaningful conversation here and I get a better solution.
Your code is not perfect either, what if a single line is a few GB?
You just made the assumption that each line will be small in size which might not always be the case.
No code is perfect, especially in Python. :-) But if we are more explicit about what exactly are we doing, we can produce code that is robust and good enough. For start, is your file textual or binary? They are not the same, although on various UNIXes you can often pretend they are. [Text files are sequences of characters, binary files are sequences of bytes. Bytes aren't characters, characters aren't bytes.]
From the context, I realized you're probably talking about text files, although the "gimme bunch of random bits" is just wrong there. And for a good reason: in 33 years of working with computers in all forms, I never had to read a text file whose line didn't fit in memory. Have you? It's a honest question.