Friday, January 16, 2015

Recursive File and Directory Manipulation in Python (Part 1)




Repost from python central.

If you are looking to utilize Python to manipulate your directory tree or files on your system, there are many tools to help, including Python's standard os module. The following is a simple/basic recipe to assist with finding certain files on your system by file extension.
If you have had the experience of "losing" a file in your system where you don't remember its location and are not even sure of its name, though you remember its type, this is where you might find this recipe useful.
In a way this recipe is a combination of How to Traverse a Directory Tree and Recursive Directory Traversal in Python: Make a list of your movies!, but we'll tweak it a bit and build upon it in part two.
To script this task, we can use the walk function in the os.path module or the walk function in the os module (using Python version 2.x or Python 3.x, respectively).

 

Recursion with os.path.walk in Python 2.x

The os.path.walk function takes 3 arguments:
  1. arg - an arbitrary (but mandatory) argument.
  2. visit - a function to execute upon each iteration.
  3. top - the top of the directory tree to walk.
It then walks through the directory tree under the top, performing the function at every step. Let's examine the function (which we'll define as "step") we use to print the path names of the files under top that have the file extension we can provide through arg.
Here is the definition of step:
Now let's break it down line-by-line, but first it's very important to point out that the arguments given to step are being passed by directly the os.path.walk function, not by the user. The three arguments that walk passes on each iteration are:
  1. ext - the arbitrary argument given to os.path.walk.
  2. dirname - the directory name for that iteration.
  3. names - the names of all files under dirname.
The first line of our step function is of course our declaration of the function, and inclusion of the default arguments that will be passed directly by os.path.walk.
The second line ensures our ext string is lowercase. The third line begins our loop of the argument names, which is a list type. The fourth line is how we retrieve the names of files with the extension we want, using the string method endswith to test for a suffix.
The final line prints the path of any file that passes the suffix (extension) test, concatenating the dirname argument to the name (with the appropriate system-dependent separator).
Now after combining our step function with the walk function, the script looks something like this:
For my system I have wx_py installed in the site-packages for Python 2.7, the output looks like this:

Recursion with os.walk in Python 3.x

Now let's do the same using Python 3.x.
The os.walk function in Python 3.x works differently, providing a few more options than the other. It takes 4 arguments, and only the first is mandatory. The arguments (and their default values) in order are:
    top - the root of the directory to walk.
    topdown(=True) - boolean designating top-down or bottom-up walking.
    onerror(=None) - name of a function to call if an error occurs.
    followlinks(=False) - boolean designating whether or not to follow symbolic links.
The only one we are concerned with for now is the first. Aside from the arguments, perhaps the biggest difference in the two versions of the walk function is that the Python 2.x version automatically iterates over the directory tree, while the Python 3.x version produces a generator function. This means that the Python 3.x version will only go to the next iteration when we tell it to, and the way we will do that is with a loop.
Instead of defining a separate function to call as with step we will write the os.walk generator into the loop that went into the step function. Like the Python 2.x version, os.walk produces 3 values we can use for every iteration (the directory path, the directory names, and the filenames), but this time they are in the form of a 3-tuple, so we have to adjust our method accordingly. Other than that we won't change the extension suffix test at all, so the script ends up looking something like this:
Because my system's Python32/Lib/site-packages folder contains nothing special, the output for this one ends up being just:
This will work the same way for whatever the "topdir" and "exten" strings are set to; however, this script simply prints the filenames to the window (in our examples the Python IDLE window), and if there are many files to print this leaves our interpreter (or shell) window many rows high—kind of a pain to scroll through. If we know that this is the case, it would be much easier to write the results to a text file we can look at anytime. We can do so easily if we incorporate a with statement (as in Reading and Writing Files in Python) like so:
Let's see first how to incorporate it into the version Python 2.x script:
As we can see above, not much has changed except for the third variable logname, and the third argument to os.path.walk. The with statement has replaced the print statement. Because of the nature of os.path.walk function, step is required to open up the log file, write to it, and close it every time it finds a file name; this won't cause any errors but is a bit awkward. We must also note that because the log file is opened up in append mode, it will not overwrite a log file that exists already, it will only append to the file. This means if we run the script 2 or more times in a row without changing the logname, the results for each run will be added to the same file, which may not be desirable.
The modified version Python 3.x script is much less awkward:
In this version the name of each found file is appended to the results string, and then when the search is over, the results are written to the log file. Unlike the Python 2.x version, the log file is opened in write mode, meaning any existing log file will be overwritten. In both cases the log file will be written in the same directory as the script (because we didn't specify a full path name).
With that we have a simple script to find files of a certain extension under a file tree and log those results. In the parts that follow we'll build upon this adding functionality to search for multiple file types, avoid certain paths, and more.

No comments:

Post a Comment