JavaScript Web Scraping and Link Analysis
Discover powerful JavaScript methods for web scraping and link analysis. This guide provides examples for capturing links, emails, images, stylesheets, anchor links, and more. Learn efficient techniques to extract valuable information from web pages and analyze link structures.
Capture All Links (Alternative)
Capture Links Using getElementsByTagName
Capture Links Using getElementsByTagName (Alternative)
Capture Links Using a for Loop
Capture Emails Using Regular Expression
Capture Internal Links
Capture External Links
Capture Unique URLs
Capture PDF Links
Capture Download Links
Capture Mailto Links
Capture Tel Links
Capture Links with Specific Text
Capture Anchor Links
Capture Chrome Tabs URLs
Capture Iframe Sources
Capture All Images urls to a tab
Capture all images to a tab and show a preview in a fixed size
const images = Array.from(document.images);
const imageUrls = images.map((image) => image.src);
const anchorTags = imageUrls.map((url) => `<a href="${url}" target="_blank"><img src="${url}" width="50" height="50"></a>`);
const newTab = window.open();
newTab.document.write('<ul style="list-style-type:none; padding: 0;">' + anchorTags.map((tag) => `<li>${tag}</li>`).join('') + '</ul>');
Capture all images to a tab and show a preview in default size
const images = Array.from(document.images);
const imageUrls = images.map((image) => image.src);
const anchorTags = imageUrls.map((url) => `<a href="${url}" target="_blank"><img src="${url}" ></a>`);
const newTab = window.open();
newTab.document.write('<ul style="list-style-type:none; padding: 0;">' + anchorTags.map((tag) => `<li>${tag}</li>`).join('') + '</ul>');
Capture Sources of Elements with src Attributes
Web Crawler with Elapsed Time
const crawledUrls = new Set();
const pendingUrls = [window.location.href];
async function crawl() {
const startTime = new Date().getTime();
while (pendingUrls.length) {
const url = pendingUrls.pop();
if (!crawledUrls.has(url)) {
console.log(`Crawling ${url}`);
try {
const response = await fetch(url);
const text = await response.text();
const doc = new DOMParser().parseFromString(text, 'text/html');
const anchors = doc.getElementsByTagName('a');
for (const a of anchors) {
const href = a.href;
if (!crawledUrls.has(href) && !pendingUrls.includes(href)) {
pendingUrls.push(href);
}
}
} catch (e) {
console.error(`Failed to crawl "${url}": ${e}`);
}
crawledUrls.add(url);
}
}
const endTime = new Date().getTime();
const elapsedTime = endTime - startTime;
console.log('Finished crawling', crawledUrls.size, 'URLs');
console.log('Elapsed time:', elapsedTime, 'ms');
}
crawl();