I'm wondering if there's an easy way to look through a particular directory to find and delete duplicate files. We have a process where an external company emails us a file every 10 minutes, and VC monitors this inbox for this file and detaches it to a specified folder. From there, we have a different VC job that runs every 30 minutes to process whatever files are in that folder at the time. The thing is that it's possible that some of these might be duplicate files, depending on the data that this external company has received since the last file, so it's not necessary to process both files.
Say there are 4 files in the folder (File 1, File 2, File 3, and File 4), sorted in order of create date time ascending. Now, say that File 2 and File 3 were both duplicates of File 1, but File 4 is different. I'd like this process to identify File 2 and File 3 as duplicates and delete them, leaving only File 1 and File 4 left in the folder to process.
The main reason I'm trying to do this is because our process that currently processes these files is an MS Access database, and it takes maybe 1 minute per file. I'm re-doing this process using only VC, but in doing that, the overall process becomes significantly slower. I'm only about half done the VC process, and already it's taking about 2.5 minutes for a single file.
One of the things on my to-do list is to get this company to decrease the frequency from 10 minutes to 30 minutes (or even 60 minutes), but in the meantime I thought I'd look into this option just for learning opportunities.
What I had thought of doing is this:
- Read in a list of files in the directory.
- Loop through the files. For each file:
- -- Read in the file contents.
- -- Loop through the files again (requires a second job, since VC doesn't otherwise support nested loops).
- -- If the file name is different from the current file name in the outer loop (and the file's create date time is greater than that of the outer loop's file), read in the file contents.
- -- Compare the contents of this file with the contents of the file in the outer loop. If they're the same, make note of this file name.
- Once the outer loop has ended, I should have a list of duplicate filenames. Loop through them and delete them.
This seems overly complicated to me, so I'm wondering if maybe I'm missing something.
Edited by moderator
| Reason: Not specified