capstone · advanced · 1 week

Concurrent Web Crawler

concepts: thread pool · std::future · lock-free queue · structured concurrency · std::atomic
builds on: Threads, `std::async`, and Futures · Atomics + the Memory Model · RAII + Smart Pointers

The brief

Build a polite, bounded-concurrency web crawler. Given a seed URL and a depth limit, fetch HTML, extract links, queue them up, and repeat — across a pool of N worker threads — until either the queue drains or a time budget elapses.

The deliverable is a CLI: crawler <seed-url> --depth 3 --workers 8 --max-pages 1000. Output is a CSV of (url, status_code, bytes, fetch_ms, outlinks) rows.

This capstone exercises the entire concurrency phase end-to-end. By the end you'll have written a thread-safe queue, a worker pool, a graceful-shutdown protocol, and a politeness layer (per-host rate limiting). It's also the kind of project that's bigger than it looks — budget conservatively, build the core first.

Scope (1 weekend)

The deliverable has six parts:

HTTP client. GET a URL, return body + status + bytes + elapsed. libcurl is the obvious choice; cpp-httplib for a simpler header-only option.
Link extractor. A small HTML scanner that pulls href URLs from <a> tags. Lenient — production crawlers use real HTML parsers, but a regex over href="..." covers ~95% of links and is fine for this scope.
URL normalizer. Resolve relative URLs against the page's base. Lowercase the host. Strip fragments. Use std::filesystem::path or a small URL class.
Work queue. Thread-safe FIFO of pending URLs, with a "no more work coming" signal so workers know to exit. A std::queue + std::mutex + std::condition_variable is sufficient; a lock-free queue is a stretch goal.
Worker pool. N std::jthreads, each looping: pop URL, fetch, parse links, enqueue new URLs, record result. Respect a stop_token for graceful shutdown.
Output. As fetches complete, append a CSV row. A single writer thread receiving results via a std::mpsc-shaped channel avoids contended file I/O.

The shape

The work queue, with the shutdown signal as a first-class member:

#include <queue>
#include <mutex>
#include <condition_variable>
#include <optional>
#include <string>

class WorkQueue {
  std::queue<std::string> q_;
  std::mutex              m_;
  std::condition_variable cv_;
  bool                    done_ = false;

 public:
  void push(std::string url) {
    {
      std::lock_guard lock(m_);
      q_.push(std::move(url));
    }
    cv_.notify_one();
  }

  // Returns nullopt when the queue is empty AND done_ is set.
  std::optional<std::string> pop_or_done() {
    std::unique_lock lock(m_);
    cv_.wait(lock, [this] { return !q_.empty() || done_; });
    if (q_.empty()) return std::nullopt;     // done == true
    auto url = std::move(q_.front());
    q_.pop();
    return url;
  }

  void close() {
    {
      std::lock_guard lock(m_);
      done_ = true;
    }
    cv_.notify_all();
  }
};

Note the careful pattern: lock guards scope to the minimum, predicate wait handles spurious wakeups, close() notifies all so every worker sees the shutdown.

The crawl loop

A worker's main loop:

void worker(std::stop_token stop, WorkQueue& work, ResultChannel& results,
            HttpClient& http) {
  while (!stop.stop_requested()) {
    auto url_opt = work.pop_or_done();
    if (!url_opt) break;                 // queue drained
    auto& url = *url_opt;

    auto response = http.get(url);       // returns std::expected<Response, Error>
    if (!response) {
      results.send({url, 0, 0, 0, {}});
      continue;
    }

    auto links = extract_links(response->body, url);
    for (const auto& link : links) {
      if (should_follow(link)) work.push(link);
    }

    results.send({url, response->status, response->bytes,
                  response->elapsed_ms, std::move(links)});
  }
}

The interaction between stop_token and the pop_or_done() is the tricky part. If stop_token fires while a worker is waiting on the condition variable, you also need to call work.close() from the stopper so the wait wakes up. The std::stop_callback is the right hook for that.

When the queue is "done"

A subtle question. The crawl is finished when:

The queue is empty AND
No worker is currently processing a URL that might add more URLs.

A simple atomic counter does it:

std::atomic<int> in_flight{0};

// Before pushing or starting work:
in_flight.fetch_add(1, std::memory_order_relaxed);

// After finishing the work that came from a pop:
if (in_flight.fetch_sub(1, std::memory_order_acq_rel) == 1) {
  // Last in-flight just finished. If the queue is also empty, we're done.
  work.maybe_close_if_empty();
}

The exact protocol is finicky — it's a worthwhile design exercise to get right. The acq_rel ordering is needed because the "is queue empty" check has to see all pushes that happened before the fetch_sub.

Politeness

A real crawler must not hammer servers. At minimum:

Per-host rate limit. No more than N requests/sec to a given host. An std::unordered_map<std::string, std::chrono::time_point> of "last request time per host" + a sleep before the next request.
robots.txt. Fetch /robots.txt for each host on first contact; honor Disallow: entries.
A real User-Agent. Include contact info: "my-crawler/0.1 (you@example.com)". Operators block anonymous-looking traffic on sight.

Skipping politeness for a learning project on a single laptop is fine; running it against the public web without these is not.

Implementation rubric

Your code should hit these targets:

No data races. All cross-thread state is either std::atomic or protected by a mutex. Run with ThreadSanitizer (-fsanitize=thread) and verify zero warnings.
Graceful shutdown. Pressing Ctrl-C drains in-flight work, flushes the results channel, and exits cleanly. No std::terminate on thread destructors.
Bounded memory. The work queue has a max size; pushers block if it's full. Otherwise a wide BFS over the public web exhausts memory.
Tests. Mock the HTTP client (return canned responses). Test the crawler logic end-to-end with a fixture site of 5-10 pages.
Output is deterministic given the fixture. Same seed + max-pages → same set of crawled URLs (the order may differ because of threading; just check the set).

Stretch goals

In order of difficulty:

Async I/O instead of threads. Use boost::asio or cpp-httplib's async mode. One thread, N concurrent in-flight requests. Compare throughput against the thread-pool version on a small fixture.
Lock-free queue. Replace std::queue + mutex with moodycamel::ConcurrentQueue (or implement one). Compare contention under load.
Distributed. Split the URL space by hash and have N processes each crawl their share. Coordinate via Redis or a flat file.
A real HTML parser. Use gumbo-parser or lexbor. Get genuinely accurate link extraction. Discover edge cases.

What you'll have learned

By the end:

Built a thread pool with a graceful-shutdown protocol — the pattern under every job queue in production.
Used std::condition_variable + std::mutex for a non-trivial wait-for-work pattern.
Designed a shared atomic counter with the right memory_order for the "is anyone still working?" question.
Profiled a concurrent program — found out which step is the bottleneck (almost always network, sometimes parsing, rarely the worker bookkeeping).
Felt where the ThreadSanitizer catches bugs you couldn't see by reading the code.
Faced the question every concurrent program has to answer: how do we know we're done?

Reference reading: Anthony Williams's C++ Concurrency in Action (the single best book for this material); the crawler example in the Boost.Asio docs; Herb Sutter's "Atomic Weapons" talks for the memory model in motion.

🛠️ Concurrent Web Crawler