SEO Insights > How to remove duplicates from Google Sheets using Mac OS’s terminal

How to remove duplicates from Google Sheets using Mac OS’s terminal

Posted by Tom Pool on November 2, 2017

Content and KeywordsSEOTechnical SEO

How to remove duplicates from Google Sheets using Mac OS’s terminal.

We have recently come across a few issues while using Google Sheets & Excel to process large amounts of keyword data; namely to combine and deduplicate competitor data when doing a keyword gap analysis.

Doing this sort of process within Google Sheets can take a considerable amount of time, particularly if the data sets are very large. If so, a lot of time will be spent staring at this:

Or sometimes this:

If the data sets are too large, Google Sheets will simply refuse to import the data:

The browser sometimes even gives up:

 

Google Sheets also loves to load data slowly, especially when you have over 100,000 rows or more. This can be particularly frustrating when all you need to do is remove duplicates.

Waiting for additional plugins to load, to remove duplicate data, on top of the initial data upload time can also add many minutes, turning what should be a relatively simple process into a task that can take anything upwards of one hour.

We also found that with smaller data sets, having to import 15+ individual .csv files can take a substantial amount of time to complete, which can really eat up a significant part of your day.

This is an issue that we have come across consistently – as such we decided to try and find a simple solution to this problem.

FYI – the following documentation uses Mac OS’s inbuilt terminal application (Command Line), and utilises basic commands. You do not need to be a tech wizard to be able to follow these guidelines.

These guidelines consist of 2 sections –

1) How to combine .csv files together

2) How to remove duplicates from .csv files

For the use of this explanation, we used exports from SEMRush, that look like this:

All of the files that we combined/manipulated followed the same format, with the same number of columns.

This is a series of instructions created to speed up the process that is usually followed in Google Sheets, or Microsoft Excel (by importing multiple .csv files & then using PowerTools or other similar add-ons to remove duplicates).

This is a process that should be used if you deal with larger file sizes, or many different .csv files (5k+ rows), as it can process data many times quicker than Google Sheets.

Where any code is used, we will explain what each piece does, so that you can fully understand what process is being carried out, and what your Mac is doing.

1) To combine all files that you wish to analyse together into one file:

  • Download all relevant .csv files & place them all into one folder.
  • Then right click on this folder within finder, and go to Services > New Terminal at Folder.

  • If this command is not available for you, go to “System Preferences > Keyboard > Shortcuts > Services > Files and Folders” and check the relevant checkbox.

  • Alternatively, you can navigate to the specific folder using terminal, and the ‘cd’ (change directory) command.

  • (the above command will work if you open Terminal as normal, and have a folder on your Desktop called “Test For Gap”)

  • Once you have a terminal open at the relevant folder, input the following command: cat *.csv > aggregate.csv
  • cat short for ‘concatenate’ – this is a command that combines the following directed files.
    • *.csv this tells the cat command to combine all files that contain .csv
    • > aggregate.csv – this is the new .csv file that you are telling the Terminal to create, with the results of cat. (You can call it anything that you like, as long as you have the .csv file extension)
  • Open the aggregate.csv file – it should contain all of your chosen sheets’ data. Note – this file will contain as many header rows as you have files

2) To remove duplicates takes another command, with slightly more parts to it:

In the same terminal window (the terminal at the folder with all downloaded .csv files), enter the following command, and update as appropriate:

awk -F, ‘seen[$1,$4]++{print $0 > “dups.csv”; next}{print $0 > “new.csv”}’ aggregate.csv

awk -F,this is a powerful text processing command, that can be used in many different ways. The “-F”, tells the “awk” what field separator is used, in this case “,” as the file is a .csv (comma separated value)

seen[$1,$4] – this is a test that happens for each line, specifying the columns that you want to check for duplicates. In this case, we want to check the keyword & the search volume column. If you wanted to check an additional column, you can add it in here.

++{print $0 > “dups.csv”; next}{print $0 > “new.csv”} –  this section tells the terminal to print any duplicate lines to a file called “dups.csv” and any new lines to a file called “new.csv”. These values can be updated as appropriate.

aggregate.csv finally, this section of the code specifies which file the previous directives should be carried out on.

Then hit enter, to carry out the command. This process should take just a few seconds.

You will then be left with 2 files – one that contains all duplicate keywords, and one that contains all unique keywords.

Summary:

In this blog post, we have learnt how to combine .csv files and remove duplicates from .csv files using the Terminal Application that is on Mac OS.

Hopefully, this will help out with any lengthy combination & deduplication regarding .csv files (not just with Keyword Data) and that Terminal will become part of your SEO toolbox.

Over time I’ve discovered how powerful Mac OS’s Terminal can be when working with large data sets. There are many more advanced commands and processes that can help with your SEO analysis – I will cover these in more detail at a later date.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments