Community forum

Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

Options
View
Go to last post Go to first unread
Offline jrtwynam  
#1 Posted : Wednesday, June 19, 2019 9:15:34 PM(UTC)
jrtwynam

Rank: Free support

Joined: 7/20/2017(UTC)
Posts: 131
Canada
Location: Ontario, Toronto

Thanks: 4 times
Was thanked: 16 time(s) in 16 post(s)
Hi,

I'm wondering if there's an easy way to look through a particular directory to find and delete duplicate files. We have a process where an external company emails us a file every 10 minutes, and VC monitors this inbox for this file and detaches it to a specified folder. From there, we have a different VC job that runs every 30 minutes to process whatever files are in that folder at the time. The thing is that it's possible that some of these might be duplicate files, depending on the data that this external company has received since the last file, so it's not necessary to process both files.

Say there are 4 files in the folder (File 1, File 2, File 3, and File 4), sorted in order of create date time ascending. Now, say that File 2 and File 3 were both duplicates of File 1, but File 4 is different. I'd like this process to identify File 2 and File 3 as duplicates and delete them, leaving only File 1 and File 4 left in the folder to process.

The main reason I'm trying to do this is because our process that currently processes these files is an MS Access database, and it takes maybe 1 minute per file. I'm re-doing this process using only VC, but in doing that, the overall process becomes significantly slower. I'm only about half done the VC process, and already it's taking about 2.5 minutes for a single file.

One of the things on my to-do list is to get this company to decrease the frequency from 10 minutes to 30 minutes (or even 60 minutes), but in the meantime I thought I'd look into this option just for learning opportunities.

What I had thought of doing is this:

  1. Read in a list of files in the directory.
  2. Loop through the files. For each file:
  3. -- Read in the file contents.
  4. -- Loop through the files again (requires a second job, since VC doesn't otherwise support nested loops).
  5. -- If the file name is different from the current file name in the outer loop (and the file's create date time is greater than that of the outer loop's file), read in the file contents.
  6. -- Compare the contents of this file with the contents of the file in the outer loop. If they're the same, make note of this file name.
  7. Once the outer loop has ended, I should have a list of duplicate filenames. Loop through them and delete them.


This seems overly complicated to me, so I'm wondering if maybe I'm missing something.

Thanks.

Edited by moderator Tuesday, June 25, 2019 4:51:35 PM(UTC)  | Reason: Not specified

Offline Support  
#2 Posted : Friday, June 21, 2019 7:07:39 PM(UTC)
Support

Rank: Official support

Joined: 2/23/2008(UTC)
Posts: 11,167

Thanks: 868 times
Was thanked: 443 time(s) in 421 post(s)
I do not think it is efficient to solve this in VisualCron. You should create a script for this and run that script. Preferably Powershell or .NET.
Henrik
Support
http://www.visualcron.com
Please like VisualCron on facebook!
Offline jrtwynam  
#3 Posted : Friday, June 21, 2019 7:30:11 PM(UTC)
jrtwynam

Rank: Free support

Joined: 7/20/2017(UTC)
Posts: 131
Canada
Location: Ontario, Toronto

Thanks: 4 times
Was thanked: 16 time(s) in 16 post(s)
I was starting to have that same thought. I've been trying to teach myself Powershell lately anyway, so this might be a good mini script to do that with. Then I can just have the email trigger save the file like it already does, and add this script as the last step of the job to clean up duplicates.
thanks 1 user thanked jrtwynam for this useful post.
Support on 6/25/2019(UTC)
Offline MRomer  
#4 Posted : Wednesday, August 7, 2019 10:38:58 PM(UTC)
MRomer

Rank: Paid support

Joined: 11/27/2012(UTC)
Posts: 51
United States
Location: Memphis, TN

Thanks: 8 times
Was thanked: 8 time(s) in 8 post(s)
Here's a PowerShell script that we use in some jobs to remove duplicate files. Call the script and pass the following parameters to it:
-FolderPath : the folder to work in.
-Recurse : Whether or not to recurse into subfolders.
-Interactive : Whether or not to let the user review the changes and approve them.
The script first builds a list of files where more than one file is the same size. It then hashes each file in that list so that it's finding files with duplicate content, regardless of the file name or date..

Code:
# Removes duplicate files from a folder.
Param(
	[Parameter(Position=0,Mandatory=$true,ValueFromPipeline=$true)][string]$FolderPath,
	[Parameter(Mandatory=$false)][ValidateSet('Y','Yes','N','No')][string]$Recurse = 'No',
	[Parameter(Mandatory=$false)][ValidateSet('Y','Yes','N','No')][string]$Interactive = 'No'
)
if ($FolderPath -match '[\*\?]') {
	Write-Output "Path '$FolderPath' invalid.  Wildcards not allowed."
	exit
}

if ( -not (Test-Path -LiteralPath $FolderPath -PathType Container) ){
	Write-Output "Path '$FolderPath' not found or is not a folder."
	exit
}

if ($Recurse -match '^y$|^yes$') {
	$multipleSameSize = Get-ChildItem -Path $FolderPath -Recurse | Group Length | ? {$_.Count -gt 1} | Select -ExpandProperty Group | Select FullName
}
else {
	$multipleSameSize = Get-ChildItem -Path $FolderPath | Group Length | ? {$_.Count -gt 1} | Select -ExpandProperty Group | Select FullName
}

if ($multipleSameSize.Count -eq 0) {
	Write-Output "No duplicate files found."
	exit
}

$fileHashes = foreach ($i in $multipleSameSize) {Get-FileHash -Path $i.FullName -Algorithm SHA1}

$dupFiles = $fileHashes | Group Hash | ? {$_.Count -gt 1} | % {$_.Group | Sort Path | Select Path -Skip 1} | % {ls $_.Path} | Select FullName,Length,LastWriteTime

if ($dupFiles.Count -eq 0) {
	Write-Output "No duplicate files found."
	exit
}

if ($Interactive -match '^y$|^yes$') {
	$userResponse = $dupFiles | Out-GridView -Title ($dupFiles.Count.ToString() + " files to be deleted. " `
													+ $($dupFiles | Measure-Object -Property Length -Sum).Sum.ToString("#,###") `
													+ " bytes recoverable.") -PassThru
	if ($userResponse.Length -gt 0) {
		Write-Output ("Deleting " + $userResponse.Count.ToString() + " files to recover " + ($userResponse |  Measure-Object -Property Length -Sum).Sum.ToString("#,###") + " bytes.")
		$userResponse | % {Remove-Item -Path $_.FullName}
	}
}
else {
	Write-Output ("Deleting " + $dupFiles.Count.ToString() + " files to recover " + ($dupFiles |  Measure-Object -Property Length -Sum).Sum.ToString("#,###") + " bytes.")
	$dupFiles | % {Remove-Item -Path $_.FullName}
}
thanks 1 user thanked MRomer for this useful post.
Support on 8/8/2019(UTC)
Offline Support  
#5 Posted : Thursday, August 8, 2019 10:40:53 AM(UTC)
Support

Rank: Official support

Joined: 2/23/2008(UTC)
Posts: 11,167

Thanks: 868 times
Was thanked: 443 time(s) in 421 post(s)
Originally Posted by: MRomer Go to Quoted Post
Here's a PowerShell script that we use in some jobs to remove duplicate files. Call the script and pass the following parameters to it:
-FolderPath : the folder to work in.
-Recurse : Whether or not to recurse into subfolders.
-Interactive : Whether or not to let the user review the changes and approve them.
The script first builds a list of files where more than one file is the same size. It then hashes each file in that list so that it's finding files with duplicate content, regardless of the file name or date..

Code:
# Removes duplicate files from a folder.
Param(
	[Parameter(Position=0,Mandatory=$true,ValueFromPipeline=$true)][string]$FolderPath,
	[Parameter(Mandatory=$false)][ValidateSet('Y','Yes','N','No')][string]$Recurse = 'No',
	[Parameter(Mandatory=$false)][ValidateSet('Y','Yes','N','No')][string]$Interactive = 'No'
)
if ($FolderPath -match '[\*\?]') {
	Write-Output "Path '$FolderPath' invalid.  Wildcards not allowed."
	exit
}

if ( -not (Test-Path -LiteralPath $FolderPath -PathType Container) ){
	Write-Output "Path '$FolderPath' not found or is not a folder."
	exit
}

if ($Recurse -match '^y$|^yes$') {
	$multipleSameSize = Get-ChildItem -Path $FolderPath -Recurse | Group Length | ? {$_.Count -gt 1} | Select -ExpandProperty Group | Select FullName
}
else {
	$multipleSameSize = Get-ChildItem -Path $FolderPath | Group Length | ? {$_.Count -gt 1} | Select -ExpandProperty Group | Select FullName
}

if ($multipleSameSize.Count -eq 0) {
	Write-Output "No duplicate files found."
	exit
}

$fileHashes = foreach ($i in $multipleSameSize) {Get-FileHash -Path $i.FullName -Algorithm SHA1}

$dupFiles = $fileHashes | Group Hash | ? {$_.Count -gt 1} | % {$_.Group | Sort Path | Select Path -Skip 1} | % {ls $_.Path} | Select FullName,Length,LastWriteTime

if ($dupFiles.Count -eq 0) {
	Write-Output "No duplicate files found."
	exit
}

if ($Interactive -match '^y$|^yes$') {
	$userResponse = $dupFiles | Out-GridView -Title ($dupFiles.Count.ToString() + " files to be deleted. " `
													+ $($dupFiles | Measure-Object -Property Length -Sum).Sum.ToString("#,###") `
													+ " bytes recoverable.") -PassThru
	if ($userResponse.Length -gt 0) {
		Write-Output ("Deleting " + $userResponse.Count.ToString() + " files to recover " + ($userResponse |  Measure-Object -Property Length -Sum).Sum.ToString("#,###") + " bytes.")
		$userResponse | % {Remove-Item -Path $_.FullName}
	}
}
else {
	Write-Output ("Deleting " + $dupFiles.Count.ToString() + " files to recover " + ($dupFiles |  Measure-Object -Property Length -Sum).Sum.ToString("#,###") + " bytes.")
	$dupFiles | % {Remove-Item -Path $_.FullName}
}


Thanks, this seems like a perfect candidate for something to upload in the Task repository community! Could you do that?
Henrik
Support
http://www.visualcron.com
Please like VisualCron on facebook!
Offline jrtwynam  
#6 Posted : Thursday, August 8, 2019 4:29:07 PM(UTC)
jrtwynam

Rank: Free support

Joined: 7/20/2017(UTC)
Posts: 131
Canada
Location: Ontario, Toronto

Thanks: 4 times
Was thanked: 16 time(s) in 16 post(s)
Thanks! That looks like a useful script for what I need to do. One thing though, does it delete ALL duplicate files? E.g. if it finds 2 copies of the same file, would it delete both of them, or would it only delete 1 of them and leave the other one?

Offline jrtwynam  
#7 Posted : Thursday, August 8, 2019 4:33:17 PM(UTC)
jrtwynam

Rank: Free support

Joined: 7/20/2017(UTC)
Posts: 131
Canada
Location: Ontario, Toronto

Thanks: 4 times
Was thanked: 16 time(s) in 16 post(s)
Henrik,

I just tried having a look at the Task Repository, but it gave me an error saying "please enter your forum credentials in the settings" or something. I just did that and clicked "apply settings", but it still gives me that error. I did log out of here and manually log back in and confirmed that the password is correct.
Offline MRomer  
#8 Posted : Thursday, August 8, 2019 4:43:36 PM(UTC)
MRomer

Rank: Paid support

Joined: 11/27/2012(UTC)
Posts: 51
United States
Location: Memphis, TN

Thanks: 8 times
Was thanked: 8 time(s) in 8 post(s)
Henrik, I've uploaded it to the online task repository. I hadn't realize that exists. Neat idea!
Offline MRomer  
#9 Posted : Thursday, August 8, 2019 4:50:18 PM(UTC)
MRomer

Rank: Paid support

Joined: 11/27/2012(UTC)
Posts: 51
United States
Location: Memphis, TN

Thanks: 8 times
Was thanked: 8 time(s) in 8 post(s)
Originally Posted by: jrtwynam Go to Quoted Post
Thanks! That looks like a useful script for what I need to do. One thing though, does it delete ALL duplicate files? E.g. if it finds 2 copies of the same file, would it delete both of them, or would it only delete 1 of them and leave the other one?



It will delete all but one of the duplicated files. In this line,

Code:
$dupFiles = $fileHashes | Group Hash | ? {$_.Count -gt 1} | % {$_.Group | Sort Path | Select Path -Skip 1} | % {ls $_.Path} | Select FullName,Length,LastWriteTime


The "Select Path -Skip 1" command outputs all the Path properties within each group except for the first one.
Users browsing this topic
Guest
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Scroll to Top