In Part 1 we looked at how to use the os.path.walk and os.walk methods to find and list files of a certain extension under a directory tree. The former function is only present in the Python 2.x, and the latter is available in both Python 2.x and Python 3.x. As we saw in the previous article, the
os.path.walk
method can be awkward to use, so from now on we'll stick to the os.walk
method, this way the script will be simpler and compatible with both branches.In Part 1 our script traversed all the folders under the
topdir
variable, but only found files of one extension. Let's now expand that
to find files of multiple extensions in select folders under the topdir
path. We'll first search for files of three different file extensions: .txt, .pdf, and .doc. Our extens
variable will be a list of strings instead of one:.
character is not included in these strings as it was in the ext
variable as before, and we'll see why shortly. In order to save the
results (the file names) we'll use a dictionary with the extensions as
keys:topdir
variable will become that path.Previously we tested for the extension with the
str.endswith
method. If we were to use it again we'd have to loop through the extension list and test with endswith
for every file name, but instead we'll use a slightly different
approach. For each file stepped on during the walk we'll extract the
extension and then test for membership in extens. Here's how we'll extract it:os.walk
.
With this line we combined three operations: changing the case of the
file name, splitting it, and extracting an element. Calling str.lower
on the filename changes it to lowercase. The same as all strings in extens
. Calling str.rsplit
on name then splits the string into a list (from the right) with the first argument .
delimiting it, and only making as many splits as the second argument (1). The third part ([-1]
) retrieves the last element of the list—we use this instead of an index of 1 because if no splits are made (if there is no .
in name
), no IndexError
will be raised.Now that we've extracted the extension of
name
(if any), we can test to see if it's in our list of extensions:.
doesn't precede any of the extension names in extens
, because ext
won't ever have one. If the condition is true, we'll add the name found to our found
dictionary:dirpath
joined to name
returned from os.walk
) to the list at the ext
key in found
.
Now that we have changed the search extensions and list of results we
also have to adjust how to save the results to our log file.In the previous version (using
os.walk
) we simply opened a file at logname
and wrote the results to the file. In this version we must loop through
multiple categories in the results, one for each extension. We'll
concatenate each result list in found
to our results string, which we'll now identify as logbody
. We'll also add a small header to the logfile, loghead:logbody
is complete, we can write our log file:open
mode to wb
and decode loghead
and logbody
(or encode if in Python 3.x) in order to save the logfile
successfully.Now we are finally ready to test our script. Running it on my system yields this log file (shortened):
C:\Python27\Lib\site-packages
directory there are a few PDF files, many text files, and no ".doc" or
Word files. It seems to work fine, and the extension search list can be
changed easily, but what if we don't want to search in the "docs"
directory under the wx-2.8-msw-unicode
tree? After all, we know there will probably be lots of text files in there. We can ignore this directory by modifying the dirnames
list in-place in the main walk loop. Because we might want to ignore
more than one directory, we'll keep a list of them (this will come
before the loop of course):dirnames
in-place, so that the next
iteration of the walk loop will no longer include the folders named in
ignore. The full script with the new walk loop now looks like this:Search log from
filefind
for files in C:\Python27\Lib\site-packages
wx-...-unicode
.
We can also see that the other ignore directory ("doc") cut out the
other two PDF files from our PDF results, and for both directories we
didn't need to name the full path (because the name won't be the full
path in dirnames
anyway). This can be convenient but always remember that this method will prune out any part of the tree under any name that matches one in the ignore
list (to avoid this try using the dirpath
and dirnames
together to specify full paths to ignore, if you don't mind going through the trouble of naming the full path!).Now that we've completed this version of our file/directory manipulation script, we can search for multiple file extensions under any tree fast and have a record of all those found with just a double-click. This is great if we simply want to know where all the files exist, but since they likely will not all be in the same folder, if we wanted to move/copy them all to the same folder or do something else with all of them simultaneously, looking through each line of the log file would not be preferable. This is why in the next part we'll look at how to upgrade our script to move, copy/backup, or alternatively erase all the files we are looking for.
No comments:
Post a Comment