Identify (and delete) duplicate files? - VisualCron

jrtwynam
Free support Topic Starter

2019-06-19T19:15:34Z

Hi,

I'm wondering if there's an easy way to look through a particular directory to find and delete duplicate files. We have a process where an external company emails us a file every 10 minutes, and VC monitors this inbox for this file and detaches it to a specified folder. From there, we have a different VC job that runs every 30 minutes to process whatever files are in that folder at the time. The thing is that it's possible that some of these might be duplicate files, depending on the data that this external company has received since the last file, so it's not necessary to process both files.

Say there are 4 files in the folder (File 1, File 2, File 3, and File 4), sorted in order of create date time ascending. Now, say that File 2 and File 3 were both duplicates of File 1, but File 4 is different. I'd like this process to identify File 2 and File 3 as duplicates and delete them, leaving only File 1 and File 4 left in the folder to process.

The main reason I'm trying to do this is because our process that currently processes these files is an MS Access database, and it takes maybe 1 minute per file. I'm re-doing this process using only VC, but in doing that, the overall process becomes significantly slower. I'm only about half done the VC process, and already it's taking about 2.5 minutes for a single file.

One of the things on my to-do list is to get this company to decrease the frequency from 10 minutes to 30 minutes (or even 60 minutes), but in the meantime I thought I'd look into this option just for learning opportunities.

What I had thought of doing is this:

Read in a list of files in the directory.
Loop through the files. For each file:
-- Read in the file contents.
-- Loop through the files again (requires a second job, since VC doesn't otherwise support nested loops).
-- If the file name is different from the current file name in the outer loop (and the file's create date time is greater than that of the outer loop's file), read in the file contents.
-- Compare the contents of this file with the contents of the file in the outer loop. If they're the same, make note of this file name.
Once the outer loop has ended, I should have a list of duplicate filenames. Loop through them and delete them.

This seems overly complicated to me, so I'm wondering if maybe I'm missing something.

Thanks.

Edited by moderator 2019-06-25T14:51:35Z | Reason: Not specified

Support
Official support

2019-06-21T17:07:39Z

#2

I do not think it is efficient to solve this in VisualCron. You should create a script for this and run that script. Preferably Powershell or .NET.

Henrik
Support
http://www.visualcron.com
Please like VisualCron on facebook!

jrtwynam
Free support Topic Starter

2019-06-21T17:30:11Z

#3

I was starting to have that same thought. I've been trying to teach myself Powershell lately anyway, so this might be a good mini script to do that with. Then I can just have the email trigger save the file like it already does, and add this script as the last step of the job to clean up duplicates.

MRomer
Free support

2019-08-07T20:38:58Z

#4

Here's a PowerShell script that we use in some jobs to remove duplicate files. Call the script and pass the following parameters to it:
-FolderPath : the folder to work in.
-Recurse : Whether or not to recurse into subfolders.
-Interactive : Whether or not to let the user review the changes and approve them.
The script first builds a list of files where more than one file is the same size. It then hashes each file in that list so that it's finding files with duplicate content, regardless of the file name or date..

# Removes duplicate files from a folder.
Param(
	[Parameter(Position=0,Mandatory=$true,ValueFromPipeline=$true)][string]$FolderPath,
	[Parameter(Mandatory=$false)][ValidateSet('Y','Yes','N','No')][string]$Recurse = 'No',
	[Parameter(Mandatory=$false)][ValidateSet('Y','Yes','N','No')][string]$Interactive = 'No'
)
if ($FolderPath -match '[\*\?]') {
	Write-Output "Path '$FolderPath' invalid.  Wildcards not allowed."
	exit
}

if ( -not (Test-Path -LiteralPath $FolderPath -PathType Container) ){
	Write-Output "Path '$FolderPath' not found or is not a folder."
	exit
}

if ($Recurse -match '^y$|^yes$') {
	$multipleSameSize = Get-ChildItem -Path $FolderPath -Recurse | Group Length | ? {$_.Count -gt 1} | Select -ExpandProperty Group | Select FullName
}
else {
	$multipleSameSize = Get-ChildItem -Path $FolderPath | Group Length | ? {$_.Count -gt 1} | Select -ExpandProperty Group | Select FullName
}

if ($multipleSameSize.Count -eq 0) {
	Write-Output "No duplicate files found."
	exit
}

$fileHashes = foreach ($i in $multipleSameSize) {Get-FileHash -Path $i.FullName -Algorithm SHA1}

$dupFiles = $fileHashes | Group Hash | ? {$_.Count -gt 1} | % {$_.Group | Sort Path | Select Path -Skip 1} | % {ls $_.Path} | Select FullName,Length,LastWriteTime

if ($dupFiles.Count -eq 0) {
	Write-Output "No duplicate files found."
	exit
}

if ($Interactive -match '^y$|^yes$') {
	$userResponse = $dupFiles | Out-GridView -Title ($dupFiles.Count.ToString() + " files to be deleted. " `
													+ $($dupFiles | Measure-Object -Property Length -Sum).Sum.ToString("#,###") `
													+ " bytes recoverable.") -PassThru
	if ($userResponse.Length -gt 0) {
		Write-Output ("Deleting " + $userResponse.Count.ToString() + " files to recover " + ($userResponse |  Measure-Object -Property Length -Sum).Sum.ToString("#,###") + " bytes.")
		$userResponse | % {Remove-Item -Path $_.FullName}
	}
}
else {
	Write-Output ("Deleting " + $dupFiles.Count.ToString() + " files to recover " + ($dupFiles |  Measure-Object -Property Length -Sum).Sum.ToString("#,###") + " bytes.")
	$dupFiles | % {Remove-Item -Path $_.FullName}
}

Support
Official support

2019-08-08T08:40:53Z

#5

Here's a PowerShell script that we use in some jobs to remove duplicate files. Call the script and pass the following parameters to it:
-FolderPath : the folder to work in.
-Recurse : Whether or not to recurse into subfolders.
-Interactive : Whether or not to let the user review the changes and approve them.
The script first builds a list of files where more than one file is the same size. It then hashes each file in that list so that it's finding files with duplicate content, regardless of the file name or date..

# Removes duplicate files from a folder.
Param(
	[Parameter(Position=0,Mandatory=$true,ValueFromPipeline=$true)][string]$FolderPath,
	[Parameter(Mandatory=$false)][ValidateSet('Y','Yes','N','No')][string]$Recurse = 'No',
	[Parameter(Mandatory=$false)][ValidateSet('Y','Yes','N','No')][string]$Interactive = 'No'
)
if ($FolderPath -match '[\*\?]') {
	Write-Output "Path '$FolderPath' invalid.  Wildcards not allowed."
	exit
}

if ( -not (Test-Path -LiteralPath $FolderPath -PathType Container) ){
	Write-Output "Path '$FolderPath' not found or is not a folder."
	exit
}

if ($Recurse -match '^y$|^yes$') {
	$multipleSameSize = Get-ChildItem -Path $FolderPath -Recurse | Group Length | ? {$_.Count -gt 1} | Select -ExpandProperty Group | Select FullName
}
else {
	$multipleSameSize = Get-ChildItem -Path $FolderPath | Group Length | ? {$_.Count -gt 1} | Select -ExpandProperty Group | Select FullName
}

if ($multipleSameSize.Count -eq 0) {
	Write-Output "No duplicate files found."
	exit
}

$fileHashes = foreach ($i in $multipleSameSize) {Get-FileHash -Path $i.FullName -Algorithm SHA1}

$dupFiles = $fileHashes | Group Hash | ? {$_.Count -gt 1} | % {$_.Group | Sort Path | Select Path -Skip 1} | % {ls $_.Path} | Select FullName,Length,LastWriteTime

if ($dupFiles.Count -eq 0) {
	Write-Output "No duplicate files found."
	exit
}

if ($Interactive -match '^y$|^yes$') {
	$userResponse = $dupFiles | Out-GridView -Title ($dupFiles.Count.ToString() + " files to be deleted. " `
													+ $($dupFiles | Measure-Object -Property Length -Sum).Sum.ToString("#,###") `
													+ " bytes recoverable.") -PassThru
	if ($userResponse.Length -gt 0) {
		Write-Output ("Deleting " + $userResponse.Count.ToString() + " files to recover " + ($userResponse |  Measure-Object -Property Length -Sum).Sum.ToString("#,###") + " bytes.")
		$userResponse | % {Remove-Item -Path $_.FullName}
	}
}
else {
	Write-Output ("Deleting " + $dupFiles.Count.ToString() + " files to recover " + ($dupFiles |  Measure-Object -Property Length -Sum).Sum.ToString("#,###") + " bytes.")
	$dupFiles | % {Remove-Item -Path $_.FullName}
}

Thanks, this seems like a perfect candidate for something to upload in the Task repository community! Could you do that?

Henrik
Support
http://www.visualcron.com
Please like VisualCron on facebook!

jrtwynam
Free support Topic Starter

2019-08-08T14:29:07Z

#6

Thanks! That looks like a useful script for what I need to do. One thing though, does it delete ALL duplicate files? E.g. if it finds 2 copies of the same file, would it delete both of them, or would it only delete 1 of them and leave the other one?

jrtwynam
Free support Topic Starter

2019-08-08T14:33:17Z

#7

Henrik,

I just tried having a look at the Task Repository, but it gave me an error saying "please enter your forum credentials in the settings" or something. I just did that and clicked "apply settings", but it still gives me that error. I did log out of here and manually log back in and confirmed that the password is correct.

MRomer
Free support

2019-08-08T14:43:36Z

#8

Henrik, I've uploaded it to the online task repository. I hadn't realize that exists. Neat idea!

MRomer
Free support

2019-08-08T14:50:18Z

#9

Originally Posted by: jrtwynam

Thanks! That looks like a useful script for what I need to do. One thing though, does it delete ALL duplicate files? E.g. if it finds 2 copies of the same file, would it delete both of them, or would it only delete 1 of them and leave the other one?

It will delete all but one of the duplicated files. In this line,

$dupFiles = $fileHashes | Group Hash | ? {$_.Count -gt 1} | % {$_.Group | Sort Path | Select Path -Skip 1} | % {ls $_.Path} | Select FullName,Length,LastWriteTime

The "Select Path -Skip 1" command outputs all the Path properties within each group except for the first one.

Richard Green
Free support

2023-03-17T15:59:14Z

#10

It is really a dilemma to have duplicate files and sometimes they cannot be deleted even shredding them won’t solve the problem at all. Thankfully, DuplicateFilesDeleter was there to save the day. It is a handy software that has the capability to delete files even the ones that are very difficult to delete for some reason. I suggest you try the DuplicateFilesDeleter out and see for yourself.

Please note that VisualCron support is not actively monitoring this community forum. Please use our contact page for contacting the VisualCron support directly.

Login