If you are looking to utilize Python to manipulate your directory tree or files on your system, there are many tools to help, including Python’s standard os module. The following is a simple/basic recipe to assist with finding certain files on your system by file extension.
If you have had the experience of “losing” a file in your system where you don’t remember its location and are not even sure of its name, though you remember its type, this is where you might find this recipe useful.
In a way this recipe is a combination of How to Traverse a Directory Tree and Recursive Directory Traversal in Python: Make a list of your movies!, but we’ll tweak it a bit and build upon it in part two.
To script this task, we can use the
walk function in the
os.path module or the
walk function in the
os module (using Python version 2.x or Python 3.x, respectively).
Recursion with os.path.walk in Python 2.x
os.path.walk function takes 3 arguments:
arg– an arbitrary (but mandatory) argument.
visit– a function to execute upon each iteration.
top– the top of the directory tree to walk.
It then walks through the directory tree under the top, performing the function at every step. Let’s examine the function (which we’ll define as “step”) we use to print the path names of the files under top that have the file extension we can provide through
Here is the definition of step:
def step(ext, dirname, names): ext = ext.lower() for name in names: if name.lower().endswith(ext): print os.path.join(dirname, name)
Now let’s break it down line-by-line, but first it’s very important to point out that the arguments given to step are being passed by directly the
os.path.walk function, not by the user. The three arguments that walk passes on each iteration are:
ext– the arbitrary argument given to
dirname– the directory name for that iteration.
names– the names of all files under
The first line of our step function is of course our declaration of the function, and inclusion of the default arguments that will be passed directly by
The second line ensures our
ext string is lowercase. The third line begins our loop of the argument names, which is a list type. The fourth line is how we retrieve the names of files with the extension we want, using the string method
endswith to test for a suffix.
The final line prints the path of any file that passes the suffix (extension) test, concatenating the
dirname argument to the name (with the appropriate system-dependent separator).
Now after combining our step function with the walk function, the script looks something like this:
# We only need to import this module import os.path # The top argument for walk. The # Python27/Lib/site-packages folder in my case topdir = '.' # The arg argument for walk, and subsequently ext for step exten = '.txt' def step(ext, dirname, names): ext = ext.lower() for name in names: if name.lower().endswith(ext): print(os.path.join(dirname, name)) # Start the walk os.path.walk(topdir, step, exten)
For my system I have
wx_py installed in the site-packages for Python 2.7, the output looks like this:
.\README.txt .\wx-2.8-msw-unicode\docs\CHANGES.txt .\wx-2.8-msw-unicode\docs\MigrationGuide.txt .\wx-2.8-msw-unicode\docs\README.win32.txt ...... .\wx-2.8-msw-unicode\wx\tools\XRCed\TODO.txt</blockquote>
Recursion with os.walk in Python 3.x
Now let’s do the same using Python 3.x.
os.walk function in Python 3.x works differently, providing a few more options than the other. It takes 4 arguments, and only the first is mandatory. The arguments (and their default values) in order are:
top– the root of the directory to walk.
topdown(=True)– boolean designating top-down or bottom-up walking.
onerror(=None)– name of a function to call if an error occurs.
followlinks(=False)– boolean designating whether or not to follow symbolic links.
The only one we are concerned with for now is the first. Aside from the arguments, perhaps the biggest difference in the two versions of the walk function is that the Python 2.x version automatically iterates over the directory tree, while the Python 3.x version produces a generator function. This means that the Python 3.x version will only go to the next iteration when we tell it to, and the way we will do that is with a loop.
Instead of defining a separate function to call as with step we will write the
os.walk generator into the loop that went into the
step function. Like the Python 2.x version,
os.walk produces 3 values we can use for every iteration (the directory path, the directory names, and the filenames), but this time they are in the form of a 3-tuple, so we have to adjust our method accordingly. Other than that we won’t change the extension suffix test at all, so the script ends up looking something like this:
import os # The top argument for walk topdir = '.' # The extension to search for exten = '.txt' for dirpath, dirnames, files in os.walk(topdir): for name in files: if name.lower().endswith(exten): print(os.path.join(dirpath, name))
Because my system’s Python32/Lib/site-packages folder contains nothing special, the output for this one ends up being just:
This will work the same way for whatever the “topdir” and “exten” strings are set to; however, this script simply prints the filenames to the window (in our examples the Python IDLE window), and if there are many files to print this leaves our interpreter (or shell) window many rows high—kind of a pain to scroll through. If we know that this is the case, it would be much easier to write the results to a text file we can look at anytime. We can do so easily if we incorporate a
with statement (as in Reading and Writing Files in Python) like so:
with open(logpath, 'a') as logfile: logfile.write('%s\n' % os.path.join(dirname, name))
Let’s see first how to incorporate it into the version Python 2.x script:
# We only need to import this module import os.path # The top argument for walk. The # Python27/Lib/site-packages folder in my case. topdir = '.' # The arg argument for walk, and subsequently ext for step exten = '.txt' logname = 'findfiletype.log' def step((ext, logpath), dirname, names): ext = ext.lower() for name in names: if name.lower().endswith(ext): # Instead of printing, open up the log file for appending with open(logpath, 'a') as logfile: logfile.write('%s\n' % os.path.join(dirname, name)) # Change the arg to a tuple containing the file # extension and the log file name. Start the walk. os.path.walk(topdir, step, (exten, logname))
As we can see above, not much has changed except for the third variable
logname, and the third argument to
os.path.walk. The with statement has replaced the
step is required to open up the log file, write to it, and close it every time it finds a file name; this won’t cause any errors but is a bit awkward. We must also note that because the log file is opened up in append mode, it will not overwrite a log file that exists already, it will only append to the file. This means if we run the script 2 or more times in a row without changing the
logname, the results for each run will be added to the same file, which may not be desirable.
The modified version Python 3.x script is much less awkward:
import os # The top argument for walk topdir = '.' # The extension to search for exten = '.txt' logname = 'findfiletype.log' # What will be logged results = str() for dirpath, dirnames, files in os.walk(topdir): for name in files: if name.lower().endswith(exten): # Save to results string instead of printing results += '%s\n' % os.path.join(dirpath, name) # Write results to logfile with open(logname, 'w') as logfile: logfile.write(results)
In this version the name of each found file is appended to the
results string, and then when the search is over, the results are written to the log file. Unlike the Python 2.x version, the log file is opened in write mode, meaning any existing log file will be overwritten. In both cases the log file will be written in the same directory as the script (because we didn’t specify a full path name).
With that we have a simple script to find files of a certain extension under a file tree and log those results. In the parts that follow we’ll build upon this adding functionality to search for multiple file types, avoid certain paths, and more.