This article will explore manual and automated techniques for gathering data from websites, including:
✅ Web Scraping — Extracting data from websites using scripts and automated tools.
✅ OSINT Techniques — Collecting public data such as emails, subdomains, metadata, and API keys.
✅ Automated Tools — Using Python scripts and OSINT frameworks for efficient data collection.
⚠️ Disclaimer: This article is for educational and ethical purposes only. Unauthorized data extraction from websites without permission may violate laws and website terms of service. Always seek permission before scraping or collecting data.
1. Finding Public Information on a Website (Manual Methods)
Before automating the process, it’s important to start with basic reconnaissance techniques.
🔎 Using Google Dorking for Hidden Information
Google Dorking helps find exposed files, directories, emails, and credentials using advanced search operators:
✅ Find login pages:
site:example.com inurl:login
✅ Discover exposed documents:
site:example.com filetype:pdf OR filetype:doc OR filetype:xls
✅ Look for exposed API keys:
site:github.com "API_KEY=" "password=" "token="
✅ Find admin panels:
site:example.com inurl:admin
💡 Google Dorking can uncover sensitive data that is publicly indexed but not meant to be accessed directly.
🕵️♂️ Extracting Subdomains with OSINT Tools
Finding subdomains can reveal hidden assets like admin portals, test environments, and APIs.
Automated Approach: Using Subfinder & Amass
subfinder -d example.com -o subdomains.txt
amass enum -passive -d example.com
These tools enumerate subdomains by querying public sources like SSL certificates, WHOIS records, and search engines.
🌐 Checking Website Technologies with Wappalyzer
Identifying the tech stack of a website can help in vulnerability assessment.
Automated Approach: Wappalyzer CLI
wappalyzer https://example.com
💡 This will reveal the CMS (WordPress, Joomla), backend technologies (PHP, Python, Node.js), and security headers.
2. Web Scraping for Data Extraction (Automated Approach)
Web scraping automates the extraction of emails, phone numbers, links, or product details from websites.
🛠️ Using Python for Web Scraping
Python’s BeautifulSoup and requests libraries can be used to extract information from a website.
Python Script to Extract All Links from a Web Page
import requests
from bs4 import BeautifulSoup
# Target URL
url = "https://example.com"
# Send request
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Extract and print all links
for link in soup.find_all("a"):
print(link.get("href"))
💡 Modify this script to extract emails, images, or specific text from web pages.
📩 Extracting Emails from Websites with theHarvester
Emails can be collected from public search engines, SSL certificates, and metadata.
Automated Approach: Using theHarvester
theHarvester -d example.com -l 200 -b google
This will return email addresses, subdomains, and linked hosts.
📂 Extracting Hidden Files and Directories with Gobuster
Web applications often have unlinked directories or configuration files that are not intended for public access.
Automated Approach: Using Gobuster
gobuster dir -u https://example.com -w /usr/share/wordlists/dirbuster/directory-list-2.3-medium.txt
💡 This will brute-force directories like /admin, /backup, and /config.
3. Extracting Metadata from Images and Documents
Documents and images often contain hidden metadata that can reveal author names, geolocation, and software details.
🔍 Using ExifTool to Extract Metadata
exiftool file.jpg
💡 Metadata can reveal GPS coordinates, timestamps, and even device information.
4. Monitoring API Endpoints for Sensitive Data
Public APIs may expose sensitive information if improperly configured.
📡 Using Wayback Machine to Find Old API Endpoints
waybackurls example.com | grep "/api/"
This will list historical API endpoints that may still be functional.
🔎 Checking API Responses for Sensitive Data
Use Postman or Burp Suite to inspect API responses and search for:
✅ Hardcoded credentials
✅ API keys
✅ User information
5. Automating OSINT with Recon-ng
Recon-ng is an OSINT framework that automates data gathering from multiple sources.
🔧 Automated Approach: Using Recon-ng
recon-ng
> marketplace install all
> modules load recon/domains-hosts/hackertarget
> set source example.com
> run
💡 This will extract hosts, subdomains, and IPs related to the target.
6. Extracting Social Media & WHOIS Information
📌 WHOIS Lookup for Domain Information
whois example.com
This reveals domain ownership details, registration date, and hosting provider.
📌 Extracting Social Media Information with Sherlock
Sherlock finds a target’s social media accounts across multiple platforms.
python3 sherlock.py username
7. Protecting Yourself from OSINT & Scraping Attacks
Organizations can prevent data scraping and OSINT threats by:
🔹 Blocking known scrapers & bots using rate limiting and CAPTCHAs.
🔹 Using robots.txt to restrict access to sensitive files.
🔹 Implementing security headers (Content-Security-Policy, X-Frame-Options).
🔹 Regularly scanning for exposed credentials in repositories and logs.
Conclusion: Automating Information Extraction from Websites
🔹 Web scraping and OSINT tools allow security researchers to extract emails, subdomains, metadata, and API endpoints from websites.
🔹 Automated tools like theHarvester, Gobuster, and Recon-ng make reconnaissance faster and more efficient.
🔹 Organizations must secure their assets by limiting public exposure of sensitive data.
🔹 Using proper security measures prevents malicious actors from exploiting public information.