
Open Source Intelligence Gathering 101
A Penetration Test almost always needs to begin with an extensive Information Gathering phase. This post talks about how Open Sources of information on the Internet can be used to build a profile of the target. The gathered data can be used to identify servers, domains, version numbers, vulnerabilities, mis-configurations, exploitable endpoints and sensitive information leakages. Read on!
There is a ton of data that can be discovered via open source intelligence gathering techniques, especially for companies who have a large online presence. There is always some tiny piece of code, a tech’ forum question with elaborate details, a sub-domain that was long forgotten or even a PDF containing marketing material with metadata that can be used against a target site. Even simple Google searches can normally lead to interesting results. Here are some of the things that we do once we have the client’s (domain) name (in no particular order):
1.Whois lookup to find the admin contact and other email addresses. These email addresses very often exist as valid users on the application as well. Email addresses can be searched through database leaks or through a search service like HaveIBeenPwned that tells you if your email was found as part of a breach.

Apart from email addresses, whois queries can return IP history information, domain expiry dates and even phone numbers that can be used in Social Engineering attacks.

2. A Google advanced search using the site operator, to restrict to the target domain, to find php (or any server side script filetype), txt or log files
site:*.example.org ext:php | ext:txt | ext:log
On several occasions we have identified interesting files (log files for example) that contain sensitive information and full system path of the application using search queries like these. You can couple this query with a minus operator to exclude specific search results.

3. Perform a search on the domain (and sub domains) for good old-fashioned documents. File types include PDF, Excel, Word and PowerPoint to begin with. These documents may contain information that you can use for other attacks. Often, the document’s metadata (author name etc.) contained in file properties can be used as a valid username on the application itself.
site:*.example.org ext:pdf | ext:doc| ext:docx | ext:ppt | ext:pptx | ext:xls | ext:xlsx | ext:csv
You can download these files locally and run them through a document metadata extractor or view properties of each file to see what information is leaked.
To see all the options that can be used for searching data refer to https://www.google.co.in/advanced_search. Also, the Google Hacking Database (now on exploit-db) allows you to use pre-crafted queries to search for specific and interesting things on the Internet.

4. Check the robots.txt file for hidden, interesting directories. Most shopping carts, frameworks and content management systems have well defined directory structures. So the admin directory is a /admin or a /administration request away. If not, the robots.txt will very likely contain the directory name you seek.

5. Look through the HTML source to identify carts/CMS/frameworks etc. Identifying the application type helps in focusing the attack to areas of the application that have vulnerable components (plugins and themes for example). For example, if you look at the page source and see wp-content then you can be certain that you are looking at a WordPress site.
A lot of publicly available browser addons can also be used to identify website frameworks. Wappalyzer on Firefox does a pretty good job at identifying several different server types, server and client side frameworks and third party plugins on the site.

6.More often than not, if the site you are looking at has been created by a third party vendor, then you will very likely see a variant of “Powered by Third-Party-Developer-Company” somewhere at the bottom of the home page.
Using this to follow your trail of information gathering to the contractor’s site can also become incredibly rewarding. Browsing through it may reveal types of frameworks and version numbers that they build upon. It is also very likely that the contractor’s have a test/admin account on your client’s site as part of their development plan.
In my experience, many site administrators/developers often use passwords that are a variation of the company name (client’s company or the contractor’s company) and some numbers with/without special characters at the end. For example, if the contractor company was called “Example Developers” then 001Example, Example001, 00example, example00 and so on are good password candidates to try on your client website’s login panel.
(Watch out for our next post on how we used this technique to compromise and gain access to a client’s server and run shell commands on it.)
7. Look through the LinkedIn profile of the company to identify senior managers, directors and non-technical staff. Very often, the weakest passwords belong to the non-tech management folk in many companies. Searching through the “About Us” page on the company website also can lead to finding soft targets.
Based on the discovery of a couple of emails, a standard format for usernames can be derived. Once the username format is understood, a list of email addresses and equivalent usernames can be created that can be then used to perform other attacks including brute force of login pages or even exploiting weak password reset functionality. (On more than one occasion we have found it useful to search for email addresses and possible usernames which have resulted in complete application and server compromises due to the use of weak passwords.)
8.Perform IP address related checks. Very often applications can be compromised due to a different and weaker application hosted on the same IP (shared hosting). Using reverse IP lookups, you can identify additional targets to poke around. Bing has an excellent search using IP feature.

The folks over at you get signal and IP Address provide a reverse lookup facility as well.

As part of the checks with IP addresses, it is important to also note the A and PTR records of a domain. Sometimes due to a misconfiguration, a different site maybe accessible when using the PTR or the A record of the site. This information can be obtained with the nslookup or the dig command
dig -x 8.8.8.8
nslookup 8.8.8.8
9.Enumerate sub domains to find low hanging fruit and weaker entry points to the client’s hosting infrastructure. Sub domain enumeration is easily one of the most important steps in assessing and discovering assets that a client has exposed online; either deliberately as part of their business or accidentally due to a misconfiguration.
Sub domain enumeration can be done using various tools like dnsrecon, subbrute, knock.py, using Google’s site operator or sites like dnsdumpster and even virustotal.com. Most of these tools use a large dictionary of common descriptive words like admin, pages, people, hr, downloads, blog, dev, staging etc. These words are appended to the primary domain — example.org, to create a list of possible sub domain names like admin.example.org, pages.example.org, people.example.org etc. Each of these names can then be checked against a DNS server to verify if the entry exists.

10. Look for HTTP status codes and response headers for different kinds of resource requests. For a valid page, for a non existing page, for a page that redirects, for a directory name etc. Lookout for subtle typos, extra spaces and redundant values in the response headers.

Also, look out for CSP headers. These contain domain names and sources from where script loading may be allowed. Sometimes a typo in a domain name listed in a CSP header or an insecure JavaScript hosting CDN may be your only way to executing an XSS payload :)
11.Search the domain name of the client through Shodan and Censys to find files, IP addresses, exposed services and error messages. The good folks at Shodan and censys have painstakingly port scanned the Internet, enumerated services and categorised their findings making them searchable with simple keywords. Both these services can be used to a find a ton of interesting things including open cameras, Cisco devices, Hospital facilities management servers, weakly configured telnet and snmp services and SCADA systems. Censys has been used in the past to find interesting endpoints that have hosted source code and entire docker images of complete apps.

12.Lookup the client on code hosting services like github, gitlab, bitbucket etc. All sorts of interesting things can be found in code hosted online through searchable repositories including web vulnerabilities, 0days in web apps, configuration issues, AWS and other secret keys.
Developers often commit code with production passwords or API access keys only to later realise and remove the sensitive information and make additional commits. However, using commit logs and checking out specific commits one can retrieve these sensitive pieces of information that can then be used to launch a full attack on the client’s hosted infrastructure.
Tools like Gitrob can be used to query Github and search sensitive files from the command line itself for specific organisations.
13. Browse the site’s HTML source to identify if the client hosts any static content on the cloud. Content like images, js and css files maybe hosted on s3 buckets owned by the client. It may also be possible while performing standard reconnaissance to identify if the client uses cloud infra’ to host static/dynamic content. In such cases finding buckets that the client uses can be really rewarding if the client has misconfigured permissions on the buckets. A ton of interesting information can be found in public facing buckets.
Tools like DigiNinja’s Bucket Finder can be used to automate the search process by brute forcing names of buckets . This tool requires a well curated list of bucket names and potential full URLs to be effective.


OSINT is an ever growing and continuously enhancing field of study in itself. Using the ones I’ve listed above and other techniques, it is possible to build the profile of the target and reveal several weaknesses, sometimes without even sending a single packet from your system their way.
This brings us to the end of this post. If there are techniques that you frequently use that have yielded you interesting results and if you would like to share those, please do leave a comment.
Until next time, happy hacking!!
Thank you for reading this article. If you enjoyed it please let us know by clicking that little heart icon below.
At Appsecco we provide advice, testing, training and insight around software and website security, especially anything that’s online, and its associated hosting infrastructure — Websites, e-commerce sites, online platforms, mobile technology, web-based services etc.
If something is accessible from the internet or a person’s computer we can help make sure it is safe and secure.
