Nadav Harari, Head of SEO at VentureKite, created something remarkable. With a little help from a popular AI chatbot (and a healthy dose of Python programming), he built a tool that uses Hugging Face to semantically analyze topics for search engine optimization. It’s a real-world example of machine learning for SEO.
Best of all, he’s sharing his creation with the world. We interviewed Harari about his creation. Our conversation is transcripted below, where he explains what the tool is and how it works.
If you prefer written step-by-step instructions on how to set up and use our SEO tool, we’ve included that below as well.
This Article Covers:
Our Free AI SEO Tool: Abridged Transcript
Jim Markus: I’m Jim Markus. I’m the website portfolio manager at VentureKite, and I have with me today…
Nadav Harari: I’m Nadav Harari, the head of SEO at VentureKite.
Jim Markus: So, essentially, our audience will be able to view a tool that you created from scratch with the help of ChatGPT and some of your own programming knowledge.
Nadav Harari: Thank you, Jim. Mainly, my focus was to come up with a process that will allow me to create a content gap analysis, not using the keyword level as we are used to using with a bunch of famous or well-known SEO tools like Ahrefs and SEMrush, but in the semantic level.
So, you can get, let’s say, a few titles of a competitor’s website. And you can easily, using a machine learning model that was pre-trained for that purpose to assess whether you already covered the topic that you just tested or not. And also, to provide a similarity score between zero to one — that way you can assess better whether you already covered the topic or not.
This machine learning model was trained, as you can see here, over more than 1 billion training pairs so it can understand context, maybe intent, synonyms, plurals, singulars, other nouns, gender. Everything. This can allow you to actually perform a conduct gap analysis at another level. And this should switch the manual process that many of us use now, which is to type in Google, let’s say, whichever site you’re testing, colon, your site, and then you type some topic. And you can hover over the results to see if this topic actually exists on your website.
Jim Markus: Perfect. Yeah, and as you mentioned, the tools that are kind of industry standard right now use keyword analysis. It’s pretty common for you to be able to say, “My competitor uses this keyword. Do I use this keyword? Do we compete in this keyword?”
What makes this interesting is that as search gets much more complex and advanced, and you know, Google’s algorithms get a lot better at saying, well, you know, this article isn’t just about this keyword, it’s about this topic and it’s about this.
That’s how we need to be thinking, and that’s what you created is a tool to help you evaluate that.
Nadav Harari: Soon, I’m going to show the process and how you can use it for your benefit. I just want to add that those SEO tools compare keywords, if you are ranked for a keyword or not, right?
So, let’s say you published an article yesterday, and Google didn’t even have the chance to crawl your site, or not to mention index it or rank it, right? So the SEO tool will tell you that you are not ranked for that keyword.
But using the process I’m going to show you, you can just add a bunch of URLs, all your URLs from your site map and match them. Match their titles, and their topics, against a competitor’s topic.
Jim Markus: Perfect, so you’re not waiting for ranking, so you don’t need to wait for Google or any third party. You, you know, obviously, your site map updates when you publish, so you’ll have that information immediately available.
Nadav Harari: Yeah, exactly. I can start to show you how it works. So we have mainly two parts. We have this spreadsheet that you can create a copy of your own. And also, we have a Python script that uses the machine learning model that I just demonstrated.
Jim Markus: Fantastic. And to, you know, repeat back for clarity in case anyone’s not familiar. Column F. And again, just after column G to the right as well. These scores are rated on a scale of zero to one, and what Nadav was saying is anything under the 0.45 doesn’t show up, so it’ll be blank for you. It’s a good way to see, hey, there’s a blank spot.
We likely don’t cover the same topic on our site, or at least within the URLs that you’ve uploaded to the topic that’s in your competitor’s title. Anything closer to one, that .94, you know, 0.85. These are much more likely to be very similar to topics that are covered elsewhere.
I mean, this is unbelievable.
Nadav Harari: Yeah, really, really helpful and I want to show you another use case that may interest you. Up until that point, we discussed content gap analysis with competitors, but what if you can create a content gap analysis internally within your own URLs against the same topics, right?
Let’s go to row seven, right? We have the 10 best web development frameworks, the best PHP frameworks for web development, right?
It’s very similar, right? You can use that as a coding cluster. Let’s say this one, best PHP frameworks, top PHP alternatives, right? It’s the same topic. Best certifications, right? It’s already a relatively high score.
PHP interview questions. So we have a content cluster related to PHP, go, let’s see “how to become a data scientist,” “how to learn data science”, right? It’s very similar to “become a data analyst.” Let’s see the score here. 0.7, right?
Jim Markus: Perfect. Like you mentioned, if you’re looking for content clusters, this is a good way to identify those, but you’re also looking for areas you might be cannibalizing your own traffic, which is, I mean, if you’re seeing things in the nineties, that’s probably a good thing to consider.
Awesome. Any final thoughts on this before we wrap up this call?
Nadav Harari: Yeah, I can share some more examples. But, I mean, everyone should try this. Just make a copy of the spreadsheet. Create an account in Google Colab, copy the script, and paste it there. Create your own credentials using Google Cloud the way I showed you. Just run it on your website or for a few competitors.
Just, you know, test the waters, see how it works for you.
Jim Markus: Great. And you also have the app script as well for the sheet. Is that connected to the sheet? When they make their copy. Great. Yeah. Okay. So we’ll also put this into an article or a blog post that you can find, with detailed instructions, in case you weren’t taking avid notes during the call.
But I think this is a wonderful introduction, and you created this again with, you asked ChatGPT for ideas and then you kind of built it around that. Is that right?
Nadav Harari: I started with ChatGPT 4 to get some ideas, but the ideas I got weren’t really good. So I started digging in Google to find other solutions. And then I encountered the hugging face machine learning transformer, which is like, really, that’s the job.
Jim Markus: Yeah, seems like the heart of what you created. Perfect. Thank you again for showing off everything here, and I will connect you with all the readers, where they can find detailed instructions on how to use this for themselves.
So thank you so much for making time to chat.
Nadav Harari: Thank you, Jim.
How to Use Our Free SEO Tool: Step by Step
Here are step-by-step instructions for how to set up and use our SEO tool. For a more complete explanation, be sure to watch our video, which provides a visual demonstration of installations and setup.
What Is This Semantic Content Gap Analysis?
My Semantic content gap analysis process is based on Hugging Face’s pre-trained machine learning model, trained on a large, diverse dataset of over 1 billion sentence pairs. Using this ML technology enables you to compare each competitor article title against all article titles on your website in order to retrieve the top 50 most semantically similar titles on your site (with a score of 0 to 1) in descending order – for any given competitor title.
Screenshot showing the model in action on Hugging Face’s website.
The model understands that “table tennis table” is a synonym for “ping pong table” and “ping pong tables” and thus gives those terms the highest scores.
Why use it:
Currently, if you want to check if you already cover competitor’s topics on your website you need to search Google using [site:yourdomain.com “ a topic”] operator and browse through the results manually. Alternatively, you may use the content gap feature in Ahrefs/Semrush.
Why the methods above are not good enough:
- Using the site: operator may not show URLs of articles that were published recently and were not able to rank.
- Using Google’s operator [site:] is manual work that is not suitable for checking 100s of your competitor’s topics.
- The content gap feature in Ahrefs/SEMrush finds the keywords your competitors rank for, but you don’t. This disregards the topic as a whole. I.e. The SEO tool may flag that the keyword “Best table tennis paddle” is missing even though you already published an article about “Best ping pong paddle”.
Prerequisites and Example of the Final Result
- Make a copy of this Google Sheet.
- Create an account in Google Colab, then copy-paste the Python script I generated with ChatGPT.
Column A contains competitor’s titles while the columns to the right contain the most similar titles on our site (in descending order).
Example: Cells B2 and C2 in the screenshot (above) reveal the most similar titles on golfspan.com against cell A2, a competitor title. Cells B3 and C3 show the most similar titles against cell A3, and so on.
Setting Up & Using the Google Sheet
Google Sheet1 Tab:
- Place URLs of a competitor in Column A and ALL your website’s URLs in Column F. You’ll copy and paste them from your XML sitemap.
Run the ChatGPT-generated AppsScript to retrieve the status codes and titles of any URL to Columns B and C (competitor) and Columns G and F (your website). I’ve added features to retrieve the title, even if the URL is 3XX redirecting. In case the URL returns a 404, the code should transform the URL slug into a title.
- We want to compare raw topics/titles without any influence such as brand name or other entities. That’s why Column D and Column I have a Google Sheets formula that cleans the title from any brand name after the vertical bar (|), as well as ugly HTML entities like " (“) & (&). You can even choose to remove a specific brand name by replacing the formula part “Add yourdomain.com here” with the term you want to remove.
- Now that you have a list of clean titles in Column D and Column I in Sheet1, they’ll automatically be populated, respectively, in Column A and Column B in Sheet2.
- Download the the Google sheet as an XLSX file.
- Open the XLSX file and click on “Enable editing.”
- Go to Sheet2 tab and copy Columns A and B, then paste them as values. This is important as the Python script will otherwise fail.
For example, Column A contains competitor’s titles and Column B contains all titles on our site. After running the Python script, the Sheet2 tab will be populated with:
- Competitor titles in Column D
- All titles from your site in Column E
- Each competitor title to the right, from the most similar (highest score) to the least similar.
Python Script that accepts your XLSX file and returns it filled after execution.
3. Python Script
Around 30 seconds after running the code, you’ll be asked to upload the XLSX file from the previous section. The script can then continue and execute to retrieve a downloadable version of your XLSX file – with all the input to the Sheet2 tab.
- Enables you to upload your XLSX file.
- Matches each competitor title (Column A) against ALL the titles on your site (Column B) and retrieves competitor’s titles to Column D – and up to 50 titles with a score above 0.45 – to Column E and to the right in Sheet2. Note: Titles with lower scores indicate low similarity and therefore are not retrieved.
- If the title is identical to a given title (i.e., similarity score = 1), the title will not be retrieved. If you’re performing content gap analysis on your own site, you don’t need to retrieve identical titles to find content clusters and consolidation opportunities.
We hope this free AI-powered SEO tool brings value to you and helps with your website’s content gap analysis.
If you have any questions regarding this tool and how to use it, you can reach out directly to Nadav Harari, the creator: email@example.com