my little note: Recursive File and Directory Manipulation in Python (Part 2)

Repost from python central.

In Part 1 we looked at how to use the os.path.walk and os.walk methods to find and list files of a certain extension under a directory tree. The former function is only present in the Python 2.x, and the latter is available in both Python 2.x and Python 3.x. As we saw in the previous article, the os.path.walk method can be awkward to use, so from now on we'll stick to the os.walk method, this way the script will be simpler and compatible with both branches.
In Part 1 our script traversed all the folders under the topdir variable, but only found files of one extension. Let's now expand that to find files of multiple extensions in select folders under the topdir path. We'll first search for files of three different file extensions: .txt, .pdf, and .doc. Our extens variable will be a list of strings instead of one:

1	extens = ['txt', 'pdf', 'doc']

The . character is not included in these strings as it was in the ext variable as before, and we'll see why shortly. In order to save the results (the file names) we'll use a dictionary with the extensions as keys:

1 2	# List comprehension form of instantiation found = { x: [] for x in extens }

The other variables will remain the same for now; however, the script file itself will be placed in (and will execute from) my system's “Documents” folder, so the topdir variable will become that path.
Previously we tested for the extension with the str.endswith method. If we were to use it again we'd have to loop through the extension list and test with endswith for every file name, but instead we'll use a slightly different approach. For each file stepped on during the walk we'll extract the extension and then test for membership in extens. Here's how we'll extract it:

for name in files:

# Split the name by '.' & get the last element

ext = name.lower().rsplit(“.”, 1)[-1]

As with the previous part, we put this line inside the for loop that interates over the files list returned by os.walk. With this line we combined three operations: changing the case of the file name, splitting it, and extracting an element. Calling str.lower on the filename changes it to lowercase. The same as all strings in extens. Calling str.rsplit on name then splits the string into a list (from the right) with the first argument . delimiting it, and only making as many splits as the second argument (1). The third part ([-1]) retrieves the last element of the list—we use this instead of an index of 1 because if no splits are made (if there is no . in name), no IndexError will be raised.
Now that we've extracted the extension of name (if any), we can test to see if it's in our list of extensions:

1	if ext in extens:

This is why . doesn't precede any of the extension names in extens, because ext won't ever have one. If the condition is true, we'll add the name found to our found dictionary:

1 2	if ext in extens: found[ext].append(os.path.join(dirpath, name))

The above line will append the result path (dirpath joined to name returned from os.walk) to the list at the ext key in found. Now that we have changed the search extensions and list of results we also have to adjust how to save the results to our log file.
In the previous version (using os.walk) we simply opened a file at logname and wrote the results to the file. In this version we must loop through multiple categories in the results, one for each extension. We'll concatenate each result list in found to our results string, which we'll now identify as logbody. We'll also add a small header to the logfile, loghead:

# The header in our logfile

loghead = 'Search log from filefind for files in {}\n\n'.format(os.path.realpath(topdir))

# The body of our log file

logbody = ''

# Loop through results

for search in found:

# Concatenate the result from the found dict

logbody += "<< Results with the extension '%s' >>" % search

# Use str.join to turn the list at search into a str

logbody += '\n\n%s\n\n' % '\n'.join(found[search])

The format of the results can be whatever or however you like, but it is important that we loop through all of the results to get the full log. After the logbody is complete, we can write our log file:

# Write results to the logfile

with open(logname, 'w') as logfile:

logfile.write('%s\n%s' % (loghead, logbody))

Note: if any names/paths in the solution contain non-ASCII characters, we would have to change the open mode to wb and decode loghead and logbody (or encode if in Python 3.x) in order to save the logfile successfully.
Now we are finally ready to test our script. Running it on my system yields this log file (shortened):

Search log from filefind for files in C:\Python27\Lib\site-packages

<< Results with the extension 'pdf' >>

.\GPL_Full.pdf

.\beautifulsoup4-4.1.3\doc\rfc2425-v2.1.pdf

.\beautifulsoup4-4.1.3\doc\rfc2426-v3.0.pdf

<< Results with the extension 'txt' >>

.\README.txt

.\soup.txt

.\beautifulsoup4-4.1.3\AUTHORS.txt

.\beautifulsoup4-4.1.3\COPYING.txt

...

.\wx-2.8-msw-unicode\docs\CHANGES.txt

.\wx-2.8-msw-unicode\docs\MigrationGuide.txt

.\wx-2.8-msw-unicode\docs\README.win32.txt

...

.\wx-2.8-msw-unicode\wx\tools\XRCed\TODO.txt

<< Results with the extension 'doc' >>

This log tells us that in the C:\Python27\Lib\site-packages directory there are a few PDF files, many text files, and no ".doc" or Word files. It seems to work fine, and the extension search list can be changed easily, but what if we don't want to search in the "docs" directory under the wx-2.8-msw-unicode tree? After all, we know there will probably be lots of text files in there. We can ignore this directory by modifying the dirnames list in-place in the main walk loop. Because we might want to ignore more than one directory, we'll keep a list of them (this will come before the loop of course):

1 2	# Directories to ignore ignore = ['docs', 'doc']

Now that we have the list, we'll add this small loop inside the main walk loop (and before the loop over the file names):

# Remove directories in ignore

# Directory names must match exactly!

for idir in ignore:

if idir in dirnames:

dirnames.remove(idir)

This will edit dirnames in-place, so that the next iteration of the walk loop will no longer include the folders named in ignore. The full script with the new walk loop now looks like this:

import os

# The top argument for name in files

topdir = '.'

extens = ['txt', 'pdf', 'doc'] # the extensions to search for

found = {x: [] for x in extens} # lists of found files

# Directories to ignore

ignore = ['docs', 'doc']

logname = "findfiletypes.log"

print('Beginning search for files in %s' % os.path.realpath(topdir))

# Walk the tree

for dirpath, dirnames, files in os.walk(topdir):

# Remove directories in ignore

# directory names must match exactly!

for idir in ignore:

if idir in dirnames:

dirnames.remove(idir)

# Loop through the file names for the current step

for name in files:

# Split the name by '.' & get the last element

ext = name.lower().rsplit('.', 1)[-1]

# Save the full name if ext matches

if ext in extens:

found[ext].append(os.path.join(dirpath, name))

# The header in our logfile

loghead = 'Search log from filefind for files in {}\n\n'.format(

os.path.realpath(topdir)

)

# The body of our log file

logbody = ''

# loop thru results

for search in found:

# Concatenate the result from the found dict

logbody += "<< Results with the extension '%s' >>" % search

logbody += '\n\n%s\n\n' % '\n'.join(found[search])

# Write results to the logfile

with open(logname, 'w') as logfile:

logfile.write('%s\n%s' % (loghead, logbody))

With our new ignored files element, the log file turns out looking like this (shortened):
Search log from filefind for files in C:\Python27\Lib\site-packages

<< Results with the extension 'pdf' >>

.\GPL_Full.pdf

<< Results with the extension 'txt' >>

.\README.txt

.\soup.txt

.\beautifulsoup4-4.1.3\AUTHORS.txt

.\beautifulsoup4-4.1.3\COPYING.txt

...

.\beautifulsoup4-4.1.3\scripts\demonstration_markup.txt

.\wx-2.8-msw-unicode\wx\lib\editor\README.txt

...

.\wx-2.8-msw-unicode\wx\tools\XRCed\TODO.txt

<< Results with the extension 'doc' >>

Our ignore list worked just as we wanted it to, cutting out the full tree under the "docs" directory in wx-...-unicode. We can also see that the other ignore directory ("doc") cut out the other two PDF files from our PDF results, and for both directories we didn't need to name the full path (because the name won't be the full path in dirnames anyway). This can be convenient but always remember that this method will prune out any part of the tree under any name that matches one in the ignore list (to avoid this try using the dirpath and dirnames together to specify full paths to ignore, if you don't mind going through the trouble of naming the full path!).
Now that we've completed this version of our file/directory manipulation script, we can search for multiple file extensions under any tree fast and have a record of all those found with just a double-click. This is great if we simply want to know where all the files exist, but since they likely will not all be in the same folder, if we wanted to move/copy them all to the same folder or do something else with all of them simultaneously, looking through each line of the log file would not be preferable. This is why in the next part we'll look at how to upgrade our script to move, copy/backup, or alternatively erase all the files we are looking for.

my little note

Friday, January 16, 2015

Recursive File and Directory Manipulation in Python (Part 2)

No comments:

Post a Comment