Concurrent Web Crawler
The brief
Build a polite, bounded-concurrency web crawler. Given a seed URL and a depth limit, fetch HTML, extract links, queue them up, and repeat — across a pool of N worker threads — until either the queue drains or a time budget elapses.
The deliverable is a CLI: crawler <seed-url> --depth 3 --workers 8 --max-pages 1000. Output is a CSV of (url, status_code, bytes, fetch_ms, outlinks) rows.
This capstone exercises the entire concurrency phase end-to-end. By the end you’ll have written a thread-safe queue, a worker pool, a graceful-shutdown protocol, and a politeness layer (per-host rate limiting). It’s also the kind of project that’s bigger than it looks — budget conservatively, build the core first.
Scope (1 weekend)
The deliverable has six parts:
- HTTP client. GET a URL, return body + status + bytes + elapsed.
libcurlis the obvious choice;cpp-httplibfor a simpler header-only option. - Link extractor. A small HTML scanner that pulls
hrefURLs from<a>tags. Lenient — production crawlers use real HTML parsers, but a regex overhref="..."covers ~95% of links and is fine for this scope. - URL normalizer. Resolve relative URLs against the page’s base.
Lowercase the host. Strip fragments. Use
std::filesystem::pathor a small URL class. - Work queue. Thread-safe FIFO of pending URLs, with a “no more
work coming” signal so workers know to exit. A
std::queue+std::mutex+std::condition_variableis sufficient; a lock-free queue is a stretch goal. - Worker pool. N
std::jthreads, each looping: pop URL, fetch, parse links, enqueue new URLs, record result. Respect astop_tokenfor graceful shutdown. - Output. As fetches complete, append a CSV row. A single writer
thread receiving results via a
std::mpsc-shaped channel avoids contended file I/O.
The shape
The work queue, with the shutdown signal as a first-class member:
#include <queue>
#include <mutex>
#include <condition_variable>
#include <optional>
#include <string>
class WorkQueue {
std::queue<std::string> q_;
std::mutex m_;
std::condition_variable cv_;
bool done_ = false;
public:
void push(std::string url) {
{
std::lock_guard lock(m_);
q_.push(std::move(url));
}
cv_.notify_one();
}
// Returns nullopt when the queue is empty AND done_ is set.
std::optional<std::string> pop_or_done() {
std::unique_lock lock(m_);
cv_.wait(lock, [this] { return !q_.empty() || done_; });
if (q_.empty()) return std::nullopt; // done == true
auto url = std::move(q_.front());
q_.pop();
return url;
}
void close() {
{
std::lock_guard lock(m_);
done_ = true;
}
cv_.notify_all();
}
};
Note the careful pattern: lock guards scope to the minimum, predicate
wait handles spurious wakeups, close() notifies all so every worker
sees the shutdown.
The crawl loop
A worker’s main loop:
void worker(std::stop_token stop, WorkQueue& work, ResultChannel& results,
HttpClient& http) {
while (!stop.stop_requested()) {
auto url_opt = work.pop_or_done();
if (!url_opt) break; // queue drained
auto& url = *url_opt;
auto response = http.get(url); // returns std::expected<Response, Error>
if (!response) {
results.send({url, 0, 0, 0, {}});
continue;
}
auto links = extract_links(response->body, url);
for (const auto& link : links) {
if (should_follow(link)) work.push(link);
}
results.send({url, response->status, response->bytes,
response->elapsed_ms, std::move(links)});
}
}
The interaction between stop_token and the pop_or_done() is the
tricky part. If stop_token fires while a worker is waiting on the
condition variable, you also need to call work.close() from the
stopper so the wait wakes up. The std::stop_callback is the right
hook for that.
When the queue is “done”
A subtle question. The crawl is finished when:
- The queue is empty AND
- No worker is currently processing a URL that might add more URLs.
A simple atomic counter does it:
std::atomic<int> in_flight{0};
// Before pushing or starting work:
in_flight.fetch_add(1, std::memory_order_relaxed);
// After finishing the work that came from a pop:
if (in_flight.fetch_sub(1, std::memory_order_acq_rel) == 1) {
// Last in-flight just finished. If the queue is also empty, we're done.
work.maybe_close_if_empty();
}
The exact protocol is finicky — it’s a worthwhile design exercise to
get right. The acq_rel ordering is needed because the “is queue
empty” check has to see all pushes that happened before the
fetch_sub.
Politeness
A real crawler must not hammer servers. At minimum:
- Per-host rate limit. No more than N requests/sec to a given host.
An
std::unordered_map<std::string, std::chrono::time_point>of “last request time per host” + a sleep before the next request. robots.txt. Fetch/robots.txtfor each host on first contact; honorDisallow:entries.- A real User-Agent. Include contact info:
"my-crawler/0.1 (you@example.com)". Operators block anonymous-looking traffic on sight.
Skipping politeness for a learning project on a single laptop is fine; running it against the public web without these is not.
Implementation rubric
Your code should hit these targets:
- No data races. All cross-thread state is either
std::atomicor protected by a mutex. Run with ThreadSanitizer (-fsanitize=thread) and verify zero warnings. - Graceful shutdown. Pressing Ctrl-C drains in-flight work,
flushes the results channel, and exits cleanly. No
std::terminateon thread destructors. - Bounded memory. The work queue has a max size; pushers block if it’s full. Otherwise a wide BFS over the public web exhausts memory.
- Tests. Mock the HTTP client (return canned responses). Test the crawler logic end-to-end with a fixture site of 5-10 pages.
- Output is deterministic given the fixture. Same seed + max-pages → same set of crawled URLs (the order may differ because of threading; just check the set).
Stretch goals
In order of difficulty:
- Async I/O instead of threads. Use
boost::asioorcpp-httplib’s async mode. One thread, N concurrent in-flight requests. Compare throughput against the thread-pool version on a small fixture. - Lock-free queue. Replace
std::queue + mutexwithmoodycamel::ConcurrentQueue(or implement one). Compare contention under load. - Distributed. Split the URL space by hash and have N processes each crawl their share. Coordinate via Redis or a flat file.
- A real HTML parser. Use
gumbo-parserorlexbor. Get genuinely accurate link extraction. Discover edge cases.
What you’ll have learned
By the end:
- Built a thread pool with a graceful-shutdown protocol — the pattern under every job queue in production.
- Used
std::condition_variable+std::mutexfor a non-trivial wait-for-work pattern. - Designed a shared atomic counter with the right
memory_orderfor the “is anyone still working?” question. - Profiled a concurrent program — found out which step is the bottleneck (almost always network, sometimes parsing, rarely the worker bookkeeping).
- Felt where the ThreadSanitizer catches bugs you couldn’t see by reading the code.
- Faced the question every concurrent program has to answer: how do we know we’re done?
Reference reading: Anthony Williams’s C++ Concurrency in Action (the
single best book for this material); the crawler example in the
Boost.Asio docs; Herb Sutter’s “Atomic Weapons” talks for the memory
model in motion.