Method for studying keyword density in top ranking pages
Assume you are doing some search engine optimization (SEO) and that you would like to have an idea about the keyword density in competing websites. This article provides a step by step method for determining keyword density in the top ranking pages at a given website.
The method
The procedure described hereafter allows you to know the density of certain keywords in the best ranking pages for these keywords in a certain website.
As an example, lets study the density of the keywords 'wireless' and 'performance' in this website 'http://www.codealias.info'.
The procedure applied to our example consists in three phases :
- 1. Obtain the list of the 10 top ranking pages in a 'codealias.info' for the keywords 'wireless' and 'performance'.
- 2. Download these 10 pages.
- 3. Calculate the occurrence/density of each keyword in these pages.
The procedure described below can be used for any keyword and any website. All what is needed is linux/unix environment that provides Perl and the sed, egrep and wget utilities.
Obtain the list of top ranking pages
For this purpose, we will use a Perl script that performs a search on the given keywords in the studied website.
The script uses the LWP::Simple module, if you dont have it then install it as follows :
$cpan >cpan install LWP::Simple
This is the Perl scrip that we will use :
use LWP::Simple; $site="www.codealias.info"; $keywords="wireless+performance"; $req="http://search.yahoo.com/search?n=10&p=$keywords&vs=$site"; print $req; getprint $req;
Place the script above in a file (e.g. get.pl), then run the following command :
perl ./post.pl | sed 's/href=\"/\n/g' | sed 's/\"/\n/g' | egrep "^http://www.codealias.info/\w+" > pagelist.txt
The pagelist.txt file will contain the following :
http://www.codealias.info/topics/performance http://www.codealias.info/technotes/impact_of_wireless_handoff_delays_on_voip_qos http://www.codealias.info/technotes/performance_evaluation_of_wireless_security_systems_part_1 http://www.codealias.info/technotes/802.11_handoff_performance_--_bibliography http://www.codealias.info/technotes/performance_evaluation_of_wireless_security_systems_part_2_-_the_802.11_handoff_process http://www.codealias.info/topics/security http://www.codealias.info/technotes/performance_of_eap_and_radius_authentication_in_roaming_scenarios http://www.codealias.info/technotes/performance_evaluation_of_wireless_security_systems_part_3_-_factors_affecting_handoff_performance http://www.codealias.info/technotes/network_communication_properties_and_qos_of_voip
These are the top 10 ranking web pages for the keywords 'wireless' and 'performance' in the website codealias.info.
Download the top ranking pages.
Now that we have the list of pages, we can download them using wget as follows :
wget -i filelist.txt
Calculate the occurrence/density of each keyword in the top ranking pages
This is a pretty straight forward procedure. You just need to run the following script :
for key in performance wireless do echo Stats for $key : total=0 while read url do echo -n "\t$url: " count=$(cat $(basename $url) | grep -c $key) echo $count total=`expr $total + $count` done < filelist.txt echo "\t===> Total : $total\n" done
The result is as follows :
Stats for performance : http://www.codealias.info/topics/performance: 37 http://www.codealias.info/technotes/impact_of_wireless_handoff_delays_on_voip_qos: 6 http://www.codealias.info/technotes/performance_evaluation_of_wireless_security_systems_part_1: 19 http://www.codealias.info/technotes/802.11_handoff_performance_--_bibliography: 16 http://www.codealias.info/technotes/performance_evaluation_of_wireless_security_systems_part_2_-_the_802.11_handoff_process: 15 http://www.codealias.info/topics/security: 18 http://www.codealias.info/technotes/performance_of_eap_and_radius_authentication_in_roaming_scenarios: 18 http://www.codealias.info/technotes/performance_evaluation_of_wireless_security_systems_part_3_-_factors_affecting_handoff_performance: 15 http://www.codealias.info/technotes/network_communication_properties_and_qos_of_voip: 3 ===> Total : 147 Stats for wireless : http://www.codealias.info/topics/performance: 25 http://www.codealias.info/technotes/impact_of_wireless_handoff_delays_on_voip_qos: 16 http://www.codealias.info/technotes/performance_evaluation_of_wireless_security_systems_part_1: 31 http://www.codealias.info/technotes/802.11_handoff_performance_--_bibliography: 5 http://www.codealias.info/technotes/performance_evaluation_of_wireless_security_systems_part_2_-_the_802.11_handoff_process: 21 http://www.codealias.info/topics/security: 49 http://www.codealias.info/technotes/performance_of_eap_and_radius_authentication_in_roaming_scenarios: 7 http://www.codealias.info/technotes/performance_evaluation_of_wireless_security_systems_part_3_-_factors_affecting_handoff_performance: 24 http://www.codealias.info/technotes/network_communication_properties_and_qos_of_voip: 7 ===> Total : 185
So the conclusion here is that codealias.info uses the word 'performance' 147 times in its top performing pages for 'wireless performance'. The page that shows the most density is 'http://www.codealias.info/topics/performance'.
The word 'wireless' was used 185 times in the top performing pages for 'wireless performance'. The page that shows the most density is 'http://www.codealias.info/topics/security'.
| Labels: coding, unix |
|

Comment