A Java headless browser lets you run and control real websites from Java without opening a visible browser window. It solves a problem many developers hit when simple HTTP requests stop working because pages rely on JavaScript, logins, or client-side rendering.
In this guide, you will learn when a headless browser makes sense and when it does not. We will walk through the main tools available in Java, show how to set up a project, and build a working script step by step. You will also see how to handle common real-world challenges like authentication, single page applications, and AJAX-heavy pages.
By the end, you should be able to choose the right approach for your project, write a reliable headless browser script, and understand how to grow it into something production-ready without overengineering.

Quick answer (TL;DR)
A Java headless browser is a real browser like Chrome or Firefox that runs without a visible UI and is controlled from Java code. It executes JavaScript, manages cookies, and behaves like a real user session, which makes it ideal for modern websites.
To put it simply:
- Use a Java headless browser when you deal with JavaScript-heavy pages, single page applications, login flows, or anything that depends on client-side rendering. Tools like Selenium with headless Chrome or Firefox, and Playwright for Java, are common choices.
- If the page is mostly static HTML or the data comes from an API, you usually do not need a headless browser. In those cases, a simple HTTP client or a web scraping API is faster, cheaper, and easier to maintain.
What is a Java headless browser?
A headless browser is a real web browser that runs without a visible user interface. It loads pages, executes JavaScript, stores cookies, and behaves like a normal browser, but everything is controlled through code instead of mouse clicks. In Java, a Java headless browser usually works by launching an actual browser engine in the background and sending it commands. Your code tells the browser what URL to open, when to wait, what elements to click, and what data to read from the page.
The big difference comes down to what gets executed:
- A normal browser renders pages for humans and waits for interaction.
- A headless browser renders pages for your program and follows instructions automatically.
- A simple HTTP client, on the other hand, only fetches raw HTML and never runs JavaScript at all.
This is why headless browsers matter for modern websites. Many pages load content after the initial request using JavaScript, require authentication flows, or depend on client-side rendering. A headless browser can log in, wait for scripts to finish, and give you the same page state a real user would see. That makes it useful for testing, automation, and working with JavaScript-heavy apps.
Key use cases for Java headless browsers
- Scraping dynamic websites
Some sites load data only after JavaScript runs or after user interaction. A headless browser can wait for the page to finish loading and then extract the final content. This is useful when scraping dashboards, search pages, or admin panels that do not expose clean APIs. - Automated testing
Headless browsers are commonly used for end-to-end tests. You can simulate a real user by opening pages, filling out forms, clicking buttons, and checking results. This helps teams catch bugs in login flows, checkout pages, and complex UI logic before users do. - PDF or screenshot generation
A headless browser can render a page exactly as it appears in a real browser and export it as a PDF or image. This is often used for invoices, reports, or previews where layout and styling actually matter. - Scheduled bots and automation
Some tasks need to run on a schedule without human involvement. A headless browser can log into a site every night, download reports, submit forms, or check system status. This is common for monitoring tools and background jobs. - Internal tools
Teams often use headless browsers to build internal utilities. Examples include QA tools, content validators, or scripts that verify links and page behavior across large sites. Since the browser runs headless, these tools can live quietly on servers.
When you do not need a Java headless browser
You do not always need a headless browser, and using one when it is not required can add unnecessary complexity.
- If a page is mostly static HTML and does not rely on JavaScript to load content, a simple HTTP client is usually enough.
- If the data you need is already available through an API, calling that API directly is faster, more reliable, and easier to maintain.
- The same applies to basic scraping tasks where the HTML does not change after load.
Web scraping APIs and lightweight HTTP libraries are often a better fit for simple jobs. Headless browsers shine when JavaScript, rendering, or real user behavior is involved, but they are overkill for straightforward requests. Knowing the difference helps you build simpler systems and avoid solving problems you do not actually have.
Choosing a Java headless browser tool
There is no single best Java headless browser tool, and the right choice depends on what you need to automate. Some tools focus on full browser accuracy, while others trade realism for speed and simplicity. The main things to think about are JavaScript support, performance, setup complexity, and how much control you need.
For modern web apps with heavy JavaScript, login flows, and dynamic UI, you usually want a real browser engine. For simpler pages or internal tools, a lighter solution can be faster and easier to maintain. Java has solid options across this spectrum, so you can pick based on your actual use case instead of forcing one tool everywhere.
The most common choices are Selenium with headless Chrome or Firefox, lighter libraries like HtmlUnit, and newer automation tools like Playwright for Java. Each has clear strengths and tradeoffs.
Using Selenium with headless Chrome or Firefox
Selenium is one of the most popular ways to run a Java headless browser. It works by controlling a real browser like Chrome or Firefox through a driver. When running in headless mode, the browser behaves the same way but does not open a visible window.
The typical setup involves adding Selenium to your project, downloading the correct browser driver, and configuring the browser to run headless. Once that is done, your Java code can open pages, click elements, wait for JavaScript, and read data just like a real user session.
A basic flow usually looks like this:
- Create a class with a main method
- Configure browser options for headless mode
- Initialize the WebDriver
- Open a URL and interact with the page
- Quit the driver when done
Selenium is a strong choice for complex sites because it has excellent JavaScript support and a huge ecosystem. There are many tutorials, plugins, and community examples, which makes debugging easier. The downside is that it is heavier than other options and can be slower, especially when running many browser instances at once.
Lightweight option: HtmlUnit and other headless libraries
If you do not need a full browser engine, lighter tools can be a better fit. HtmlUnit is a popular option in the Java ecosystem and acts like a simulated browser instead of controlling a real one. It is much faster to start and uses fewer resources, which makes it attractive for simple automation tasks.
HtmlUnit works well for basic pages, form submissions, and internal tools where JavaScript usage is limited. It can handle some scripting, but it does not fully match modern browsers when dealing with complex client-side frameworks.
For a practical walkthrough, check out this HtmlUnit tutorial which shows how to get started and where the limits are. This kind of tool is ideal when performance matters more than perfect rendering accuracy.
The main takeaway is to match the tool to the job. Use Selenium or similar tools when you need full browser behavior, and reach for lighter libraries when the page structure is simple and predictable.
Step by step: Build your first Java headless browser script
This mini tutorial shows a simple workflow for a Java headless browser script. You will create a project, add a browser automation dependency, run a headless browser, open a page, wait for dynamic content, extract data, and close everything cleanly. The example uses Selenium because it is the most common starting point, but the same flow applies to other tools too.
Step 1: Set up your Java project and dependencies
We will use Gradle because it is common, fast, and easy to reproduce on any machine or CI runner.
Install prerequisites
Make sure these are installed first:
- Java JDK 11 or newer (Java 21 is a solid default)
- Gradle
- A browser for automation, usually Google Chrome
Verify Java works:
java -version
Create a new Gradle project
From an empty directory, run:
gradle init
Choose:
- Type: application
- Implementation language: Java
- Java version: 21
- Project name: leave default
- Application structure: Single application project
- Build script DSL: Groovy
- Test framework: doesn't matter as we won't write tests
The generated app might contain sample unit tests, so let's remove that folder:
rm -rf app\src\test
Add the Selenium dependency
Open app/build.gradle and add Selenium:
dependencies {
implementation 'org.seleniumhq.selenium:selenium-java:4.39.0'
}
You do not need anything else yet. Gradle will download Selenium automatically on the next build.
Create your main Java class
Open up app\src\main\java\org\example\App.java. Basic skeleton:
package org.example;
public class App {
public static void main(String[] args) {
System.out.println("Java headless browser setup OK");
}
}
Build and run the project
From the project root, run:
gradle build
Alternatively, you can use the
gradlewwrapper.
Then run the app:
gradle run
If you see the message printed, your project setup is correct and Selenium is installed. At this point, your Gradle project is ready, dependencies are installed, and you can start writing a Java headless browser script!
Step 2: Load a page, wait for JavaScript, and extract data
This is the core loop of most Java headless browser scripts: start headless mode, open the page, wait for dynamic content, extract what you need, and then shut down cleanly.
Here are the building blocks you will use:
- Start the browser in headless mode
- Set timeouts so your script does not hang forever
- Navigate to the URL
- Wait for a specific element that proves the content loaded
- Select elements and extract text or attributes
- Handle cookies and headers when needed
- Quit the driver in a
finallyblock so resources always close
A simple outline looks like this:
package org.example;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;
import java.time.Duration;
public class App {
public static void main(String[] args) {
// Configure Chrome to run in headless mode.
// "--headless=new" is the recommended flag for modern Chrome versions (2025+).
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless=new");
options.addArguments("--disable-gpu");
options.addArguments("--window-size=1280,800");
WebDriver driver = null;
try {
// Create the WebDriver instance.
// Selenium Manager will automatically resolve the correct ChromeDriver.
// You still need Chrome/Firefox installed (or a container image that includes it).
// For example, if CI can't find a browser, Selenium Manager can't save you.
driver = new ChromeDriver(options);
// Set page load timeout to avoid infinite waits on broken pages.
driver.manage().timeouts().pageLoadTimeout(Duration.ofSeconds(30));
// Disable implicit waits to avoid conflicts with explicit waits.
driver.manage().timeouts().implicitlyWait(Duration.ofSeconds(0));
// Open the target page.
String url = "https://example.com";
driver.get(url);
// Explicit wait for JavaScript-rendered content.
// Always wait for a specific element instead of guessing with sleep().
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(15));
WebElement title = wait.until(
ExpectedConditions.visibilityOfElementLocated(By.cssSelector("h1"))
);
// Extract visible text from the page.
String text = title.getText();
System.out.println("Page title: " + text);
// Extract an attribute from another element.
String href = driver.findElement(By.cssSelector("a")).getAttribute("href");
System.out.println("First link: " + href);
} catch (Exception e) {
// Basic error handling.
// In real projects, replace this with proper logging.
System.err.println("Browser automation failed: " + e.getMessage());
e.printStackTrace();
} finally {
// Always shut down the browser.
// This prevents orphaned Chrome processes.
if (driver != null) {
driver.quit();
}
}
}
}
- Waiting is the difference between getting real data and getting garbage. After
driver.get()the page may still be loading JavaScript, so scraping immediately can return empty or incomplete HTML. Always wait for a clear signal that the page is ready, like an element becoming visible throughWebDriverWaitandExpectedConditions. - Use explicit waits instead of sleeps. Sleeping guesses how long a page might take, while waiting for a specific selector guarantees the content you need actually exists before you read it.
- Cookies matter for authenticated flows. After a login step, Selenium stores session cookies automatically, and those cookies keep you logged in as you navigate between pages. You can read and reuse cookies if you need to persist sessions or debug authentication issues.
- Header control is limited in classic Selenium. Since Selenium drives a real browser, it does not give you full low-level control over request headers. If you need custom headers, you usually need a proxy, browser devtools integration, or a different automation approach. For many use cases, working with cookies is enough.
- Always shut down the browser in a
finallyblock. Callingquit()ensures Chrome processes are cleaned up even when errors happen. Skipping this step leads to zombie browsers, wasted memory, and slow or unstable CI runners.
Handle logins, SPAs, and AJAX in Java headless browsers
Real sites are messy. They have login walls, single page apps, and data that only appears after background requests finish. This is where a Java headless browser earns its keep, because it runs JavaScript, keeps session state, and behaves like a real user session. A basic HTTP client can fetch HTML, but it cannot reliably reproduce a modern browser session with client-side rendering, CSRF tokens, and scripted flows.
The goal is not "scrape harder." The goal is "automate like a user." That usually means you wait for the right signals, keep cookies, and avoid brittle timing hacks.
Logging into websites with a Java headless browser
A common login flow looks like this:
- Open the login page
- Fill username and password fields
- Click the submit button
- Wait for a post-login signal like a dashboard element
- Reuse the authenticated session across pages using the same driver and cookies
Here is a simple Selenium example that is friendly for 2026-style Chrome, uses explicit waits, and cleans up properly:
package org.example;
import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;
import java.time.Duration;
import java.util.Set;
public class App {
public static void main(String[] args) {
// Headless Chrome config that stays stable on dev machines and CI runners.
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless=new");
options.addArguments("--disable-gpu");
options.addArguments("--window-size=1280,800");
options.addArguments("--no-sandbox"); // Useful on some CI containers.
options.addArguments("--disable-dev-shm-usage"); // Helps avoid shared memory issues in Docker.
WebDriver driver = null;
try {
driver = new ChromeDriver(options);
driver.manage().timeouts().pageLoadTimeout(Duration.ofSeconds(45));
driver.manage().timeouts().implicitlyWait(Duration.ofSeconds(0));
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(20));
// 1) Go to the login page.
driver.get("https://example.com/login");
// 2) Wait for the login form to exist before interacting.
WebElement emailInput = wait.until(
ExpectedConditions.elementToBeClickable(By.cssSelector("input[type='email'], input[name='email']"))
);
WebElement passwordInput = wait.until(
ExpectedConditions.elementToBeClickable(By.cssSelector("input[type='password'], input[name='password']"))
);
emailInput.clear();
emailInput.sendKeys("YOUR_EMAIL");
passwordInput.clear();
passwordInput.sendKeys("YOUR_PASSWORD");
// 3) Submit the form. Some sites want click, some accept submit().
WebElement submit = driver.findElement(By.cssSelector("button[type='submit'], button[name='login']"));
submit.click();
// 4) Wait for a post-login signal. Pick a selector that only appears when logged in.
wait.until(ExpectedConditions.visibilityOfElementLocated(By.cssSelector("[data-test='dashboard'], .dashboard, nav")));
// 5) Now you are logged in. Cookies are stored in the browser session automatically.
Set<Cookie> cookies = driver.manage().getCookies();
System.out.println("Logged in. Cookie count: " + cookies.size());
// 6) Navigate to an authenticated page using the same driver session.
driver.get("https://example.com/account");
// Wait for something that proves the authenticated page loaded.
wait.until(ExpectedConditions.visibilityOfElementLocated(By.cssSelector("h1")));
System.out.println("Account page loaded.");
} catch (Exception e) {
System.err.println("Login automation failed: " + e.getMessage());
e.printStackTrace();
} finally {
if (driver != null) {
driver.quit();
}
}
}
}
Common issues you will hit:
- CSRF tokens are usually handled automatically in a headless browser because the page scripts and hidden inputs load normally, but you still need to wait for the form to be ready before submitting.
- 2FA can work if it is a simple one-time code you can provide, but fully automating it depends on your setup and the site's security design.
- CAPTCHAs are often a hard stop. A Java headless browser can reach the CAPTCHA, but bypassing it is not something you should plan for in a normal engineering workflow.
- Some sites detect automation. Your best defense is stable waiting logic, realistic flows, and not hammering the site.
Also, follow website terms and local legal rules. If a site says "no automation," do not pretend that headless makes it okay.
Working with dynamic SPAs and AJAX content
SPAs change content without full page reloads. Routes change, components re-render, and data arrives through background calls. If you treat it like a static page, you will scrape empty shells.
Patterns that usually work:
- Wait for a stable element that indicates the view is ready, not just the URL change.
- For client-side routing, wait for a route-specific selector after you click a link.
- For infinite scroll, scroll in steps and wait for more items to appear.
- For AJAX-loaded blocks, wait for the list length to increase or for a loader to disappear.
Example patterns in Selenium:
// Wait for a SPA route to finish rendering by waiting for a route-specific element.
wait.until(ExpectedConditions.visibilityOfElementLocated(By.cssSelector("[data-page='search-results']")));
// Infinite scroll pattern: scroll, then wait for more cards to load.
int previousCount = driver.findElements(By.cssSelector(".result-card")).size();
((org.openqa.selenium.JavascriptExecutor) driver).executeScript("window.scrollTo(0, document.body.scrollHeight);");
wait.until(d -> d.findElements(By.cssSelector(".result-card")).size() > previousCount);
// Loader pattern: wait for a spinner to disappear before scraping.
wait.until(ExpectedConditions.invisibilityOfElementLocated(By.cssSelector(".spinner, .loading-indicator")));
A quick reality check on scale: running full browsers is heavier than calling an API. If you need high-volume scraping of SPA pages, a web scraping API such as ScrapingBee can be easier to scale because it handles browsers, retries, and infrastructure for you. Headless browsers are perfect for tricky flows and smaller batches, but they can get expensive when you try to run them by the thousands.
Real project example: Java headless browser for e-books downloading
In this section, we'll build a realistic Java headless browser workflow you can actually reuse. The script will:
- log into your account
- navigate to a protected library page
- find downloadable books
- download a few files locally
The concrete example uses ebook downloads from pragprog.com, but the exact same structure applies to many real-world automation tasks where you need to:
- authenticate
- move through private pages
- download files reliably without UI hacks
If you are interested in applying this same pattern to invoices, receipts, or other billing documents, the structure is almost identical. You can see a dedicated walkthrough in the Java bill downloader guide.
We will keep it practical and copyable. You will see where login code should live, where navigation steps go, and how to keep download logic separate so you are not mixing everything into one huge file. Your main entry point stays in app/src/main/java/org/example/App.java, and we will add a few small helper classes next to it.
Project structure you can copy
Keep App.java as the orchestrator. It should read config, create the browser, call login, run navigation steps, then trigger downloads, and finally close the browser.
Add a small set of helper classes in the same package so imports stay simple:
Configto load settings like username, password, base URL, and download directory from environment variables.BrowserFactoryto create a headless Chrome driver with the right options for stable automation.AccountClientto hold page actions likelogin()andgoToLibrary()so your flow stays readable.DownloadHelperto manage download location and wait until files are fully written before the script exits.
Suggested folder layout under your existing project:
app/src/main/java/org/example/App.javaapp/src/main/java/org/example/Config.javaapp/src/main/java/org/example/BrowserFactory.javaapp/src/main/java/org/example/AccountClient.javaapp/src/main/java/org/example/DownloadHelper.java
Once the script works locally, you can run it on a schedule. A typical setup is a monthly cron job or a task scheduler entry that runs the same command every month to fetch new downloads automatically.
Browser setup: build a reusable WebDriver factory
Keep browser setup out of App.java. This makes your main flow easier to read, and it lets you reuse the same browser settings across projects. Create one helper file that builds a headless Chrome driver with stable defaults and a predictable download folder.
Create this file at app/src/main/java/org/example/BrowserFactory.java and keep it in the same package as App.java so imports stay simple.
package org.example;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.PageLoadStrategy;
import java.nio.file.Path;
import java.time.Duration;
import java.util.HashMap;
import java.util.Map;
/**
* Centralized Chrome setup for headless automation.
* Keeps browser configuration out of App.java and makes it reusable.
*/
public final class BrowserFactory {
private BrowserFactory() {}
public static WebDriver createChrome(Path downloadDir) {
// Configure Chrome to download files automatically to a known directory.
// Without this, headless downloads are unreliable or blocked.
Map<String, Object> prefs = new HashMap<>();
prefs.put("download.default_directory", downloadDir.toAbsolutePath().toString());
prefs.put("download.prompt_for_download", false);
prefs.put("download.directory_upgrade", true);
prefs.put("safebrowsing.enabled", true);
ChromeOptions options = new ChromeOptions();
options.setExperimentalOption("prefs", prefs);
// Do not wait for every tracking/analytics script to finish.
// We rely on explicit waits for elements instead.
options.setPageLoadStrategy(PageLoadStrategy.EAGER);
// Modern headless mode (Chrome 109+).
options.addArguments("--headless=new");
// Consistent viewport across machines and CI runners.
options.addArguments("--window-size=1280,800");
options.addArguments("--disable-gpu");
// Slightly reduce obvious automation fingerprints.
options.addArguments("--disable-blink-features=AutomationControlled");
// Useful when running in Docker or locked-down environments.
// options.addArguments("--no-sandbox");
// options.addArguments("--disable-dev-shm-usage");
WebDriver driver = new ChromeDriver(options);
driver.manage().timeouts().implicitlyWait(Duration.ofSeconds(0));
driver.manage().timeouts().pageLoadTimeout(Duration.ofSeconds(30));
driver.manage().timeouts().scriptTimeout(Duration.ofSeconds(30));
return driver;
}
}
Key points:
prefs.put("download.default_directory", ...)— forces Chrome to save files to a known folder without user interaction.options.setPageLoadStrategy(PageLoadStrategy.EAGER);— prevents Selenium from waiting forever on analytics and tracking scripts.options.addArguments("--headless=new");— enables modern headless Chrome with full feature support.options.addArguments("--window-size=1280,800");— ensures consistent rendering across machines and CI.options.addArguments("--disable-blink-features=AutomationControlled");— slightly reduces obvious automation fingerprints.new ChromeDriver(options);— creates a browser instance using Selenium Manager (no manual driver setup).driver.manage().timeouts().pageLoadTimeout(...)— caps how long navigation can block before we take control again.driver.manage().timeouts().scriptTimeout(...)— limits how long injected or async scripts are allowed to run.
Configuration: credentials and download paths
This script needs three things: the PragProg base URL, login credentials, and a place to save downloads. Keep all of that in one small config class so you do not scatter magic strings and paths across the codebase.
We will hardcode the base URL, and we will read credentials from environment variables so you do not put secrets in code. For downloads, we will use the current user home directory and create a dedicated folder so it works on Windows, macOS, and Linux.
Create the file at app/src/main/java/org/example/Config.java:
package org.example;
import java.nio.file.Files;
import java.nio.file.Path;
/**
* Centralized configuration for the automation script.
*
* Keeps credentials, base URLs, and filesystem paths in one place
* so they are not scattered across the codebase.
*/
public final class Config {
// PragProg base URL is stable and safe to hardcode for this example.
public static final String BASE_URL = "https://pragprog.com/";
private final String email;
private final String password;
private final Path downloadDir;
private Config(String email, String password, Path downloadDir) {
this.email = email;
this.password = password;
this.downloadDir = downloadDir;
}
/**
* Load configuration from the environment and prepare local directories.
*
* - Credentials come from environment variables (no secrets in code).
* - Download directory is created once and reused.
*/
public static Config load() {
// Read credentials from environment variables.
// This keeps secrets out of source control.
String email = readRequiredEnv("PRAGPROG_EMAIL");
String password = readRequiredEnv("PRAGPROG_PASSWORD");
// Create a cross-platform download directory inside the user's home.
Path homeDir = resolveUserHome();
Path downloadDir = homeDir.resolve("pragprog-downloads");
// Ensure the download directory exists before the browser starts.
try {
Files.createDirectories(downloadDir);
} catch (Exception e) {
throw new IllegalStateException(
"Failed to create download directory: " + downloadDir.toAbsolutePath(), e
);
}
return new Config(email, password, downloadDir);
}
public String email() {
return email;
}
public String password() {
return password;
}
public Path downloadDir() {
return downloadDir;
}
/**
* Read a required environment variable or fail fast with a clear error.
*/
private static String readRequiredEnv(String name) {
String value = System.getenv(name);
if (value == null || value.isBlank()) {
throw new IllegalStateException(
"Missing required environment variable: " + name
);
}
return value.trim();
}
/**
* Resolve the current user's home directory in a platform-independent way.
*/
private static Path resolveUserHome() {
String home = System.getProperty("user.home");
if (home == null || home.isBlank()) {
throw new IllegalStateException("user.home system property is not set");
}
return Path.of(home);
}
}
Key points:
readRequiredEnv("...")— reads values from an environment variable so credentials never live in source code.resolveUserHome()— finds the current user's home directory in a cross-platform way.homeDir.resolve("pragprog-downloads")— creates a predictable download folder that works on Windows, macOS, and Linux.Files.createDirectories(downloadDir)— ensures the download directory exists before the browser starts.Config.load()— central entry point that loads all configuration and fails fast if anything is missing.Config(as a whole) — keeps secrets, paths, and constants out of the main automation logic.
Before you run the script, set these environment variables:
PRAGPROG_EMAILandPRAGPROG_PASSWORD.
Login and navigation: reach the PragProg orders page and sign in
PragProg uses a "My Orders" link in the top bar as the gateway to your account area. That link points to a SendOwl customer account page. If you are not logged in, that page shows a login form. If you are already logged in, it takes you straight to your orders.
The clean pattern is: open PragProg, extract the "My Orders" href, navigate to it, then log in only if the login form is present. This keeps the flow stable and avoids guessing URLs or clicking around blindly.
Create the file at app/src/main/java/org/example/AccountClient.java:
package org.example;
import org.openqa.selenium.By;
import org.openqa.selenium.TimeoutException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;
import java.time.Duration;
import java.util.List;
/**
* Handles all account-related browser actions:
* - finding the "My Orders" entry point
* - logging in only when necessary
*
* Keeping this logic isolated keeps App.java readable and boring (in a good way).
*/
public final class AccountClient {
// Main wait used for normal navigation and form interactions.
private static final Duration MAIN_TIMEOUT = Duration.ofSeconds(20);
// Short wait used only for presence checks (e.g. "is login form here?").
private static final Duration SHORT_TIMEOUT = Duration.ofSeconds(5);
private final WebDriver driver;
private final WebDriverWait wait;
public AccountClient(WebDriver driver) {
this.driver = driver;
this.wait = new WebDriverWait(driver, MAIN_TIMEOUT);
}
/**
* Resolve the "My Orders" URL from PragProg's homepage.
*
* We do not hardcode this URL because PragProg redirects it to SendOwl,
* and that redirect target can change over time.
*/
public String resolveOrdersUrl() {
driver.get(Config.BASE_URL);
// The exact top-bar structure can change, so wait for a generic header/nav area.
wait.until(ExpectedConditions.presenceOfElementLocated(By.cssSelector("header, nav")));
// First attempt: look for a direct "My Orders" link via XPath.
List<WebElement> links = driver.findElements(
By.xpath("//a[normalize-space()='My Orders' or contains(normalize-space(), 'My Orders')]")
);
for (WebElement link : links) {
String href = safeAttr(link, "href");
if (!href.isBlank()) {
return href.trim();
}
}
// Fallback: scan all anchors and match by visible text.
// This survives markup changes where text is wrapped in spans or icons.
List<WebElement> allLinks = driver.findElements(By.cssSelector("a"));
for (WebElement link : allLinks) {
String text = safeText(link);
if (text.equalsIgnoreCase("My Orders") || text.toLowerCase().contains("my orders")) {
String href = safeAttr(link, "href");
if (!href.isBlank()) {
return href.trim();
}
}
}
throw new IllegalStateException("Could not find the My Orders link on PragProg home.");
}
/**
* Navigate to the orders page and log in if required.
*
* SendOwl shows the same URL for both logged-in and logged-out users,
* so we detect login state by checking for the login form.
*/
public void goToOrdersAndLoginIfNeeded(String ordersUrl, String email, String password) {
driver.get(ordersUrl);
// Only log in if the login form is present.
if (isLoginFormPresent()) {
login(email, password);
}
// Final safety check: if the login form is still there, something went wrong.
if (isLoginFormPresent()) {
throw new IllegalStateException(
"Login did not succeed (login form still present). Check credentials or captcha."
);
}
}
/**
* Perform the SendOwl login flow.
*
* Uses form submission instead of clicking a specific button
* to avoid brittle selectors.
*/
private void login(String email, String password) {
WebElement emailInput = wait.until(
ExpectedConditions.elementToBeClickable(
By.cssSelector("input[name='customer_session[email]']")
)
);
WebElement passwordInput = wait.until(
ExpectedConditions.elementToBeClickable(
By.cssSelector("input[name='customer_session[password]']")
)
);
emailInput.clear();
emailInput.sendKeys(email);
passwordInput.clear();
passwordInput.sendKeys(password);
// Submit the surrounding form to mimic a real user action.
WebElement form = emailInput.findElement(By.xpath("./ancestor::form[1]"));
form.submit();
// Wait until the login form disappears, signaling a successful login.
wait.until(ExpectedConditions.invisibilityOfElementLocated(
By.cssSelector("input[name='customer_session[password]']")
));
}
/**
* Check whether the login form is currently visible.
*
* Uses a short timeout because this is a presence probe, not a full wait.
*/
private boolean isLoginFormPresent() {
WebDriverWait shortWait = new WebDriverWait(driver, SHORT_TIMEOUT);
try {
shortWait.until(ExpectedConditions.presenceOfElementLocated(
By.cssSelector("input[name='customer_session[email]']")
));
return true;
} catch (TimeoutException ignored) {
return false;
}
}
// ---------- Small safety helpers ----------
private static String safeText(WebElement el) {
try {
String t = el.getText();
return t == null ? "" : t.trim();
} catch (Exception ignored) {
return "";
}
}
private static String safeAttr(WebElement el, String name) {
try {
String v = el.getAttribute(name);
return v == null ? "" : v.trim();
} catch (Exception ignored) {
return "";
}
}
private static String safeUrl(WebDriver d) {
try {
String u = d.getCurrentUrl();
return u == null ? "" : u.trim();
} catch (Exception ignored) {
return "";
}
}
}
Key points:
resolveOrdersUrl()— opens PragProg and finds the real "My Orders" entry point without hardcoding SendOwl URLs.By.cssSelector("header, nav")— waits for a stable page container instead of brittle top-bar markup.goToOrdersAndLoginIfNeeded(...)— navigates to the orders page and logs in only if the login form is present.isLoginFormPresent()— detects authentication state by checking for the email field instead of guessing cookies or redirects.login(email, password)— fills the SendOwl login form and submits it like a real user.form.submit()— avoids fragile button selectors that often change.ExpectedConditions.invisibilityOfElementLocated(...)— waits until login is complete by ensuring the form disappears.safeText(...) / safeAttr(...)— defensive helpers that prevent random DOM quirks from crashing the script.
Download handling: pick a few books and download files reliably
After login, the orders page shows a dynamic table inside tbody#ordersdisplay. Each row contains a download landing page URL for an order. That's perfect for automation: we can extract those URLs directly instead of clicking around or guessing endpoints.
The flow we use is simple:
- Wait until the orders table is actually populated.
- Extract the order download URLs as plain strings.
- Visit each order's download page.
- Pick the PDF format (or fall back to the first available format).
- Trigger the real file download and wait until it is fully written to disk.
To make this reliable, there are two real-world problems we must handle:
- Downloads are asynchronous — clicking a download link returns immediately, but Chrome keeps writing the file in the background.
- Chrome uses temporary files — while a download is in progress, Chrome writes files with a
.crdownloadsuffix. A file is only safe to use once those temporary files disappear and the final file size stops changing.
The helper below solves both problems by:
- tracking when a new file appears after a known baseline
- waiting until no temporary files exist
- verifying that the file size is stable before returning
Create the file at app/src/main/java/org/example/DownloadHelper.java:
package org.example;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.Duration;
import java.time.Instant;
import java.util.Comparator;
import java.util.Optional;
import java.util.stream.Stream;
/**
* Wait helpers for Chrome downloads.
*
* Clicking a download link returns immediately, but the file is written asynchronously.
* Chrome also writes temporary ".crdownload" files while a download is in progress.
*
* This helper waits until:
* - there are no temp download files
* - a new file appears (newer than the baseline)
* - the file size is stable across multiple checks
*/
public final class DownloadHelper {
private DownloadHelper() {}
// Chrome temporary download suffix (most common).
private static final String CHROME_TEMP_SUFFIX = ".crdownload";
// Poll interval: fast enough to feel responsive, slow enough to avoid busy looping.
private static final long POLL_MS = 300;
/**
* Wait until a new download finishes and return the downloaded file path.
*
* The logic is based on "newest modified file after baseline + stable size".
* This avoids fragile assumptions like "download finishes instantly" or "file count increments by 1".
*/
public static Path waitForNewDownload(Path downloadDir, Duration timeout) {
Instant deadline = Instant.now().plus(timeout);
// Baseline: newest file timestamp at the moment we start waiting.
// We only accept files that appear or change after this point.
long baselineNewestMtime = newestRegularFile(downloadDir)
.map(DownloadHelper::lastModifiedMillisSafe)
.orElse(0L);
while (Instant.now().isBefore(deadline)) {
// If any ".crdownload" files exist, Chrome is still writing.
// Don't trust directory state yet.
if (hasTempDownloads(downloadDir)) {
sleep(POLL_MS);
continue;
}
// Candidate = newest regular file in the directory.
Optional<Path> newest = newestRegularFile(downloadDir);
if (newest.isPresent()) {
Path candidate = newest.get();
long mtime = lastModifiedMillisSafe(candidate);
// Only accept files that are newer than our baseline snapshot.
if (mtime > baselineNewestMtime) {
// Extra safety: size must be stable across multiple reads.
// This reduces the chance of returning a partially written file.
if (isFileStable(candidate, Duration.ofMillis(900), 3)) {
return candidate;
}
}
}
sleep(POLL_MS);
}
throw new IllegalStateException(
"Timed out waiting for download to finish in: " + downloadDir.toAbsolutePath()
);
}
private static boolean hasTempDownloads(Path dir) {
try (Stream<Path> s = Files.list(dir)) {
return s.anyMatch(p -> p.getFileName().toString().endsWith(CHROME_TEMP_SUFFIX));
} catch (IOException e) {
throw new IllegalStateException("Failed to scan download directory: " + dir.toAbsolutePath(), e);
}
}
private static Optional<Path> newestRegularFile(Path dir) {
try (Stream<Path> s = Files.list(dir)) {
return s.filter(Files::isRegularFile)
.max(Comparator.comparingLong(DownloadHelper::lastModifiedMillisSafe));
} catch (IOException e) {
throw new IllegalStateException("Failed to scan download directory: " + dir.toAbsolutePath(), e);
}
}
private static long lastModifiedMillisSafe(Path p) {
try {
return Files.getLastModifiedTime(p).toMillis();
} catch (IOException e) {
return 0L;
}
}
/**
* Stability check:
* - file exists
* - size is > 0
* - size stays unchanged across N reads, spaced by "interval"
*/
private static boolean isFileStable(Path file, Duration interval, int checks) {
if (!Files.isRegularFile(file)) return false;
long lastSize = sizeSafe(file);
if (lastSize <= 0) return false;
for (int i = 0; i < checks; i++) {
sleep(interval.toMillis());
long size = sizeSafe(file);
if (size <= 0) return false;
if (size != lastSize) return false;
lastSize = size;
}
return true;
}
private static long sizeSafe(Path file) {
try {
return Files.size(file);
} catch (IOException e) {
return -1L;
}
}
private static void sleep(long ms) {
try {
Thread.sleep(ms);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IllegalStateException("Interrupted while waiting for download.", e);
}
}
}
Orchestrate the full flow in App.java: log in, list orders, download the latest 3 books
Now we wire everything together in your main entry point. App.java should stay boring and readable. It loads config, creates the browser, logs in, finds the orders table, picks a few downloads, waits for each file to finish, and exits cleanly.
Update app/src/main/java/org/example/App.java:
package org.example;
import org.openqa.selenium.*;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;
import java.nio.file.Path;
import java.time.Duration;
import java.util.*;
public class App {
/**
* For the demo we only grab the newest few items.
* In a real project you might download everything, or filter by date/title.
*/
private static final int BOOKS_TO_DOWNLOAD = 3;
// Timeouts that match "real internet" behavior (not localhost fantasy).
private static final Duration ORDERS_TIMEOUT = Duration.ofSeconds(45);
private static final Duration LANDING_PAGE_TIMEOUT = Duration.ofSeconds(25);
private static final Duration DOWNLOAD_TIMEOUT = Duration.ofSeconds(90);
// Orders page (SendOwl) selectors
private static final By ORDERS_TBODY = By.cssSelector("tbody#ordersdisplay");
private static final By ORDER_DOWNLOAD_LINKS = By.cssSelector("tbody#ordersdisplay a[href*='download']");
// Download landing page (format picker) selectors
private static final By DOWNLOAD_LIST = By.cssSelector(".download-list");
private static final By DOWNLOAD_ITEMS = By.cssSelector(".download-list .item");
private static final By FILE_LINKS = By.cssSelector("a[href*='/file?product_id=']");
public static void main(String[] args) {
WebDriver driver = null;
try {
Config config = Config.load();
driver = BrowserFactory.createChrome(config.downloadDir());
// 1) Login (PragProg -> My Orders -> SendOwl account area)
AccountClient account = new AccountClient(driver);
String ordersUrl = account.resolveOrdersUrl();
account.goToOrdersAndLoginIfNeeded(ordersUrl, config.email(), config.password());
// 2) Orders table is dynamic: wait until the links exist, not just the HTML container.
waitForOrdersTable(driver);
// 3) Extract the "download landing pages" as plain strings.
// This avoids stale element pain: we don't keep WebElements around longer than needed.
List<String> orderDownloadPages = extractOrderDownloadPages(driver);
if (orderDownloadPages.isEmpty()) {
System.out.println("No download links found on orders page.");
dumpDebugArtifacts(driver, config.downloadDir());
return;
}
int limit = Math.min(BOOKS_TO_DOWNLOAD, orderDownloadPages.size());
System.out.println("Found " + orderDownloadPages.size() + " orders. Downloading " + limit + ".");
for (int i = 0; i < limit; i++) {
String orderDownloadPageUrl = orderDownloadPages.get(i);
System.out.println("Opening order download page: " + orderDownloadPageUrl);
Path downloaded = downloadPdfFromOrderDownloadPage(driver, orderDownloadPageUrl, config.downloadDir());
System.out.println("Saved file: " + downloaded.getFileName());
}
System.out.println("Done.");
} catch (Exception e) {
System.err.println("Automation failed: " + e.getMessage());
e.printStackTrace();
} finally {
if (driver != null) driver.quit();
}
}
/**
* SendOwl orders page renders asynchronously.
* The most reliable readiness signal is: "at least one download link exists".
*/
private static void waitForOrdersTable(WebDriver driver) {
WebDriverWait wait = new WebDriverWait(driver, ORDERS_TIMEOUT);
System.out.println("Waiting for orders table...");
wait.until(ExpectedConditions.presenceOfElementLocated(ORDERS_TBODY));
// Wait until at least one download link is present (real content has loaded).
wait.until(d -> d.findElements(ORDER_DOWNLOAD_LINKS).size() > 0);
int links = driver.findElements(ORDER_DOWNLOAD_LINKS).size();
System.out.println("Orders ready. downloadLinks=" + links);
}
/**
* Extract the per-order "download landing page" URLs.
* Each of those pages contains multiple formats (pdf/epub/mobi), which we pick from later.
*/
private static List<String> extractOrderDownloadPages(WebDriver driver) {
// Keep the original ordering and avoid duplicates.
Set<String> urls = new LinkedHashSet<>();
for (WebElement a : driver.findElements(ORDER_DOWNLOAD_LINKS)) {
String href = safeAttr(a, "href");
if (!href.isBlank()) urls.add(href);
}
return new ArrayList<>(urls);
}
/**
* Each order link opens a landing page with multiple formats.
* We always try to download the PDF, and fall back to "first available" if PDF is missing.
*/
private static Path downloadPdfFromOrderDownloadPage(WebDriver driver, String orderDownloadPageUrl, Path downloadDir) {
// 1) Open the landing page safely (some pages never fully "finish loading" due to scripts).
safeNavigate(driver, orderDownloadPageUrl);
// 2) Wait for the formats list to appear.
WebDriverWait wait = new WebDriverWait(driver, LANDING_PAGE_TIMEOUT);
wait.until(ExpectedConditions.presenceOfElementLocated(DOWNLOAD_LIST));
// 3) Resolve the real file download URL.
// On SendOwl the actual download happens at ".../file?product_id=..."
String fileUrl = resolvePdfFileUrl(driver);
if (fileUrl.isBlank()) {
throw new IllegalStateException("Could not find a file download link on: " + driver.getCurrentUrl());
}
System.out.println("Resolved file URL: " + fileUrl);
// 4) Trigger the actual download by navigating directly (more reliable than clicking in headless mode).
safeNavigate(driver, fileUrl);
// 5) Wait until Chrome finishes writing the file to disk.
return DownloadHelper.waitForNewDownload(downloadDir, DOWNLOAD_TIMEOUT);
}
/**
* Prefer PDF when multiple formats exist.
* We inspect each ".download-list .item" block because that's where the "pdf:" label lives.
*/
private static String resolvePdfFileUrl(WebDriver driver) {
List<WebElement> items = driver.findElements(DOWNLOAD_ITEMS);
// 1) Prefer the item that mentions PDF.
for (WebElement item : items) {
String label = safeText(item).toLowerCase();
if (label.contains("pdf:")) {
String href = firstHref(item, FILE_LINKS);
if (!href.isBlank()) return href;
}
}
// 2) Fallback: first available file link (whatever format it is).
for (WebElement item : items) {
String href = firstHref(item, FILE_LINKS);
if (!href.isBlank()) return href;
}
return "";
}
private static String firstHref(WebElement root, By selector) {
try {
WebElement a = root.findElement(selector);
String href = a.getAttribute("href");
return href == null ? "" : href.trim();
} catch (Exception ignored) {
return "";
}
}
/**
* Navigate without getting stuck on pages that never "finish loading" due to analytics/scripts.
*
* In modern automation it's more reliable to:
* - navigate
* - then wait for a specific element you care about
* than to trust "the page is fully loaded".
*/
private static void safeNavigate(WebDriver driver, String url) {
try {
driver.navigate().to(url);
// Light sanity wait: at least the DOM should exist.
// (We still wait for specific elements elsewhere.)
if (driver instanceof JavascriptExecutor js) {
try {
js.executeScript("return document.readyState");
} catch (Exception ignored) {}
}
} catch (TimeoutException e) {
// Stop loading and continue. We'll rely on explicit element waits.
try {
((JavascriptExecutor) driver).executeScript("window.stop();");
} catch (Exception ignored) {}
}
}
private static String safeText(WebElement el) {
try {
String t = el.getText();
return t == null ? "" : t.trim();
} catch (Exception ignored) {
return "";
}
}
private static String safeAttr(WebElement el, String name) {
try {
if (el == null) return "";
String v = el.getAttribute(name);
return v == null ? "" : v.trim();
} catch (Exception ignored) {
return "";
}
}
/**
* Debug helper: writes a screenshot + HTML to the download directory.
* This is insanely useful when a site changes markup and your selectors stop matching.
*/
private static void dumpDebugArtifacts(WebDriver driver, Path dir) {
try {
byte[] png = ((TakesScreenshot) driver).getScreenshotAs(OutputType.BYTES);
java.nio.file.Files.write(dir.resolve("debug.png"), png);
String html = driver.getPageSource();
java.nio.file.Files.writeString(dir.resolve("debug.html"), html);
System.out.println("Wrote debug.png and debug.html to: " + dir.toAbsolutePath());
System.out.println("Debug URL: " + driver.getCurrentUrl());
System.out.println("Debug title: " + driver.getTitle());
} catch (Exception e) {
System.out.println("Failed to write debug artifacts: " + e.getMessage());
}
}
}
Key points to note:
waitForOrdersTable(driver);— waits until the orders table is actually usable, not just present in the DOM.extractOrderDownloadPages(driver);— reads all order download links as plain strings to avoid stale Selenium elements later.safeNavigate(driver, orderDownloadPageUrl);— opens pages without getting stuck on scripts that never finish loading.wait.until(ExpectedConditions.presenceOfElementLocated(DOWNLOAD_LIST));— waits until the format chooser (PDF / EPUB / MOBI) is visible.resolvePdfFileUrl(driver);— scans the format list and prefers the PDF download link if it exists.safeNavigate(driver, fileUrl);— navigates directly to the real file URL to trigger the download reliably.DownloadHelper.waitForNewDownload(...)— blocks until Chrome finishes writing the file to disk, avoiding half-downloaded files.dumpDebugArtifacts(driver, ...)— saves a screenshot and page HTML when something breaks, making selector fixes easy.
This gives you an end-to-end working skeleton: it logs in, finds the orders table, picks the newest items, triggers downloads, and waits until files are fully saved. Great!
Start using Java headless browsers in your projects
If you made it this far, the next step is simple: stop reading and start running code. Pick one tool, set up a small Java headless browser script, and make it do one useful thing. Load a page, wait for content, extract a value, and shut down cleanly. That alone already puts you ahead of most half-broken automation setups.
Once the basics work, grow it slowly. Add better waits. Handle errors. Log what happens. Make the script predictable before you make it powerful. This is how small test scripts turn into reliable automation or scraping systems instead of fragile hacks.
When you move toward production, start thinking about scale and maintenance early. Running real browsers costs CPU and memory, and failures will happen. Timeouts, layout changes, and blocked sessions are part of the game. The more traffic you need, the more painful pure browser automation becomes to operate.
For many production workloads, a hybrid approach works best. Use a Java headless browser for tricky flows like logins, complex SPAs, or edge cases, and combine it with a managed solution like a Web Scraping API for large-scale data extraction. This reduces infrastructure work, avoids many blocking issues, and lets you focus on the data instead of babysitting browsers.
💡 Get started with ScrapingBee today!. No credit card needed, and you get 1,000 scraping credits as a gift.
The key idea is simple: start small, automate like a user, and only scale once you understand where the real pain points are. Java headless browsers are a powerful tool, but they shine the most when used intentionally, not everywhere by default.
Conclusion
Running Chrome in headless mode from Java is straightforward once you understand the core pieces. With Selenium driving a real browser, you get full JavaScript execution, session handling, and behavior that closely matches what real users see. Compared to older tools like PhantomJS, the modern setup is more stable, better supported, and easier to maintain long term.
The important part is knowing when to use it. A Java headless browser is perfect for dynamic pages, login flows, and SPAs, but it should not be your default for every scraping or automation task. Start small, build something reliable, and only add complexity when the project actually needs it.
Before you go, check out these related reads:
Frequently asked questions (FAQs)
Can I log into websites with a Java headless browser?
Yes, you can log into most websites using a Java headless browser by filling out forms, clicking buttons, and letting the browser manage cookies for you. From the site's point of view, this looks very similar to a real user session, which is why headless browsers work well for authenticated areas.
Security features like CSRF tokens are usually handled automatically, because the browser loads the page scripts and hidden fields normally. Things get more complicated with advanced protections like multi-step 2FA, hardware keys, or CAPTCHAs. Some of those flows can be partially automated, but others require manual input or a different approach.
Always respect a website's terms of service and local laws. Just because a headless browser can log in does not mean it is allowed.
If you want a deeper walkthrough of login strategies and edge cases, this guide on how to log in to almost any website is a good next read.
How do I use a Java headless browser for single page applications?
For single page applications, the key is waiting for client-side rendering to finish. SPAs often load a blank shell first and then populate the page with JavaScript, so scraping too early will give you empty or incomplete data.
Instead of relying on page load events, wait for specific elements that indicate the current route is ready. This might be a container unique to that view, a heading that only appears after navigation, or a list that fills in once data arrives. SPAs are one of the main reasons to use a headless browser instead of a simple HTTP client.
If you want more background on the challenges involved, this article on scraping single page applications explains common patterns and pitfalls.
How do I handle AJAX-heavy websites in Java headless mode?
AJAX-heavy sites update the page after background requests complete, often without any visible navigation change. To handle this correctly, you must wait for those updates before extracting data.
Common strategies include waiting for a specific element to appear or disappear, waiting for the number of items in a list to increase, or using explicit waits tied to known loading indicators. A very common bug is scraping immediately after an action, before the DOM has updated, which leads to missing or stale data.
This guide on handling AJAX websites goes deeper into techniques for dealing with asynchronous content reliably.
Do I always need a Java headless browser for web scraping?
No, and you usually should not default to one. If a page is mostly static HTML or the data is available through an API, a normal HTTP client is faster, simpler, and easier to maintain. Many scraping tasks do not need a browser at all.
A Java headless browser is most useful for JavaScript-heavy pages, complex user flows, authenticated areas, and cases where you must closely simulate real user behavior. The trick is knowing when the extra power is actually required, and when it is just unnecessary overhead.

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.
