Web scraping with kotlin

Updated on

To solve the problem of extracting data from websites efficiently and robustly using Kotlin, here are the detailed steps and considerations:

👉 Skip the hassle and get the ready to use 100% working script (Link in the comments section of the YouTube Video) (Latest test 31/05/2025)

  1. Understand the Basics: Web scraping involves programmatically downloading and parsing web content to extract specific data. Kotlin, with its conciseness, null safety, and JVM compatibility, is an excellent choice for this.
  2. Choose Your Weapons Libraries:
    • Jsoup: For parsing HTML. It’s a powerful library for manipulating HTML, extracting data using CSS selectors, and cleaning up HTML. Think of it as your primary tool for navigating the DOM.
    • Ktor Client / OkHttp: For making HTTP requests. Ktor Client offers a modern, asynchronous API, while OkHttp is a tried-and-true workhorse for robust network operations.
    • Kotlinx.serialization / Gson: For parsing JSON if dealing with APIs or serializing data into structured formats.
  3. Step-by-Step Execution:
    • Step 1: Set up Your Project:
      • Create a new Kotlin JVM project using Gradle or Maven.
      • Add the necessary dependencies to your build.gradle.kts file:
        // build.gradle.kts
        dependencies {
        
        
           implementation"org.jsoup:jsoup:1.17.2" // Or the latest version
        
        
           implementation"io.ktor:ktor-client-core:2.3.9" // Or the latest version
        
        
           implementation"io.ktor:ktor-client-cio:2.3.9" // Or another engine like OkHttp
            // If you need JSON parsing
        
        
           implementation"org.jetbrains.kotlinx:kotlinx-serialization-json:1.6.3"
        
        
           implementation"io.ktor:ktor-client-content-negotiation:2.3.9"
        
        
           implementation"io.ktor:ktor-serialization-kotlinx-json:2.3.9"
        }
        
    • Step 2: Fetch the HTML:
      • Use Ktor Client to make an HTTP GET request to the target URL.

        0.0
        0.0 out of 5 stars (based on 0 reviews)
        Excellent0%
        Very good0%
        Average0%
        Poor0%
        Terrible0%

        There are no reviews yet. Be the first one to write one.

        Amazon.com: Check Amazon for Web scraping with
        Latest Discussions & Reviews:
      • Handle potential network errors e.g., IOException, TimeoutException.

      • Example with Ktor:
        import io.ktor.client.*
        import io.ktor.client.engine.cio.*
        import io.ktor.client.request.*
        import io.ktor.client.statement.*
        import kotlinx.coroutines.*

        Suspend fun fetchHtmlurl: String: String? {
        val client = HttpClientCIO
        return try {

        val response: HttpResponse = client.geturl
        response.bodyAsText
        } catch e: Exception {

        println”Error fetching $url: ${e.message}”
        null
        } finally {
        client.close
        }

    • Step 3: Parse the HTML:
      • Once you have the HTML string, use Jsoup to parse it into a Document object.

      • Example:
        import org.jsoup.Jsoup
        import org.jsoup.nodes.Document

        Fun parseHtmlhtml: String: Document {
        return Jsoup.parsehtml

    • Step 4: Extract Data using CSS Selectors:
      • This is where Jsoup shines. Inspect the target website’s HTML structure using your browser’s developer tools F12. Identify unique CSS selectors IDs, classes, tag names, attributes to pinpoint the data you want.
        import org.jsoup.nodes.Element
        import org.jsoup.select.Elements

        fun extractDatadocument: Document {
        // Extracting all h2 tags

        val titles: Elements = document.select”h2″
        for title in titles {

        println”Title: ${title.text}”

        // Extracting elements with a specific class

        val productNames: Elements = document.select”.product-title”
        for name in productNames {

        println”Product Name: ${name.text}”

        // Extracting attributes, e.g., href from an anchor tag

        val link: Element? = document.selectFirst”a.read-more-link”
        link?.let {

        println”Read More Link: ${it.attr”href”}”

    • Step 5: Store the Data:
      • Decide on a storage mechanism: CSV, JSON file, database SQLite, PostgreSQL, or simply print to console.

      • For structured data, create Kotlin data classes.

      • Example simple data class:

        Data class Productval name: String, val price: String, val url: String

      • Example writing to CSV:
        import java.io.FileWriter
        import java.io.IOException

        Fun writeToCsvproducts: List, fileName: String {
        try {

        FileWriterfileName.use { writer ->

        writer.append”Name,Price,URL\n” // CSV header

        for product in products {

        writer.append”${product.name},${product.price},${product.url}\n”
        }

        println”Data successfully written to $fileName”
        }
        } catch e: IOException {

        println”Error writing to CSV: ${e.message}”

  4. Important Considerations Be a Good Netizen:
    • robots.txt: Always check the website’s robots.txt file e.g., https://example.com/robots.txt before scraping. It outlines which parts of the site are disallowed for automated crawlers. Respect it.
    • Terms of Service: Review the website’s Terms of Service. Many sites explicitly forbid scraping.
    • Rate Limiting: Don’t hammer a server with requests. Implement delays between requests e.g., delay1000L in Kotlin Coroutines to avoid being blocked and to be polite.
    • User-Agent: Set a realistic User-Agent header in your HTTP requests. Some sites block requests without one or with a generic one.
    • Error Handling: Robustly handle network errors, malformed HTML, and missing elements.
    • Dynamic Content JavaScript: Jsoup and Ktor only fetch raw HTML. If a site relies heavily on JavaScript to load content, you might need a headless browser like Selenium or Playwright though these add complexity and resource overhead.

Table of Contents

The Fundamentals of Web Scraping: What It Is and Why Kotlin Excels

Web scraping, at its core, is the automated process of extracting data from websites.

Imagine manually copying product names, prices, or article headlines from hundreds of web pages – tedious, error-prone, and painfully slow.

Web scraping automates this chore, allowing programs to browse the web, read the HTML structure, and pull out specific information based on predefined rules.

This collected data can then be stored, analyzed, and used for various purposes.

What is Web Scraping?

Web scraping fundamentally involves two main steps: Eight biggest myths about web scraping

  • Fetching: An HTTP client sends a request to a web server, much like your browser does, and receives the raw HTML, CSS, and JavaScript content of a webpage.
  • Parsing: The fetched content is then processed to identify and extract the desired data. This often involves navigating the Document Object Model DOM to pinpoint elements based on their tags, classes, IDs, or other attributes.

Historically, web scraping has been used for everything from market research and price comparison to news aggregation and data analysis.

However, it’s crucial to approach scraping with a strong ethical compass, respecting website terms of service and robots.txt directives.

Why Choose Kotlin for Web Scraping?

Kotlin, a modern, statically typed language developed by JetBrains, offers several compelling advantages that make it a fantastic choice for web scraping projects:

  • Conciseness and Readability: Kotlin’s syntax is remarkably clean and expressive, leading to less boilerplate code compared to Java. This means you can write more powerful scraping logic with fewer lines, making your code easier to read and maintain. For instance, data classes in Kotlin are perfect for modeling scraped information with minimal effort.
  • Null Safety: One of Kotlin’s standout features is its built-in null safety. This drastically reduces the dreaded NullPointerException, a common headache in Java development. When dealing with potentially missing elements or attributes on a webpage, Kotlin forces you to explicitly handle nulls, leading to more robust and crash-resistant scrapers. You’re less likely to have your scraper unexpectedly halt due to a missing div or span.
  • JVM Compatibility: Since Kotlin compiles to JVM bytecode, it benefits from the vast and mature Java ecosystem. This means you have access to a plethora of battle-tested Java libraries for HTTP requests OkHttp, Apache HttpClient, HTML parsing Jsoup, JSON processing Gson, Jackson, and database interactions. You don’t have to reinvent the wheel. you can leverage existing solutions.
  • Coroutines for Asynchronous Operations: Web scraping often involves making many network requests, which can be a bottleneck if done synchronously. Kotlin Coroutines provide a lightweight, efficient way to write asynchronous code. This allows you to fetch multiple pages concurrently without blocking the main thread, significantly speeding up your scraping process for large datasets. A single Kotlin coroutine can handle what might require several threads in a traditional multithreaded Java application.
  • Interoperability with Java: If you have existing Java projects or codebases, Kotlin integrates seamlessly. You can call Java code from Kotlin and vice-versa, making it easy to migrate parts of a project or use Java libraries directly. This flexibility is a huge plus for teams already invested in the JVM ecosystem.
  • Growing Community and Modern Features: Kotlin’s popularity has soared, leading to a vibrant and growing community. This means more resources, tutorials, and support are available. The language itself is actively developed, incorporating modern programming paradigms and features that enhance developer productivity.

In essence, Kotlin provides a modern, safe, and efficient environment for developing web scrapers, allowing you to focus more on data extraction logic and less on boilerplate and runtime errors.

Its excellent integration with established JVM libraries further cements its position as a top contender for web scraping tasks. Web scraping with rust

Setting Up Your Kotlin Web Scraping Environment

Before you can start pulling data from the web, you need to set up a robust development environment.

This typically involves choosing a build tool, selecting the right libraries for HTTP requests and HTML parsing, and understanding the core structure of a Kotlin project.

A well-configured environment ensures smooth development and fewer headaches down the line.

Choosing Your Build Tool: Gradle vs. Maven

For Kotlin JVM projects, Gradle is the de facto standard and highly recommended, although Maven is also a viable option.

  • Gradle Recommended: Gradle offers a more flexible and powerful build system. Its build scripts are written in Kotlin DSL Domain Specific Language or Groovy, providing programmatic control over the build process. This flexibility is particularly useful for complex projects, dependency management, and custom tasks.
    • Advantages: Highly customizable, faster incremental builds, excellent dependency management, strong support for Kotlin DSL which is type-safe and provides IDE auto-completion.
    • Setup: When creating a new Kotlin project in IntelliJ IDEA, Gradle is usually the default. Your project structure will include build.gradle.kts for Kotlin DSL or build.gradle for Groovy DSL.
  • Maven: Maven is an older, more established build tool that uses XML for its pom.xml configuration files. It’s convention-over-configuration, which can simplify setup for standard projects but offers less flexibility than Gradle for custom build logic.
    • Advantages: Widely adopted, large community, extensive plugin ecosystem.
    • Setup: If you prefer Maven, you’ll manage dependencies in pom.xml.

For this guide, we’ll primarily use Gradle with Kotlin DSL, as it aligns perfectly with a modern Kotlin development workflow. What is data parsing

Essential Libraries for Web Scraping

The core of any web scraping project lies in its libraries.

You need tools to fetch web content and tools to parse it effectively.

1. HTTP Client Libraries for fetching web content

These libraries handle making HTTP requests GET, POST, etc. to fetch the raw HTML from a URL.

  • Ktor Client: Ktor is a modern, asynchronous framework for building connected applications, including HTTP clients. It’s developed by JetBrains, which means excellent integration with Kotlin and coroutines. It’s an excellent choice for concurrent and non-blocking I/O.
    • Dependency in build.gradle.kts:
      
      
      implementation"io.ktor:ktor-client-core:2.3.9" // Core client library
      
      
      implementation"io.ktor:ktor-client-cio:2.3.9"   // CIO engine for coroutine-based I/O
      
      
      // You might choose another engine like OkHttp or Apache
      
      
      // implementation"io.ktor:ktor-client-okhttp:2.3.9"
      
      
      // implementation"io.ktor:ktor-client-apache:2.3.9"
      
    • Key Features: Asynchronous coroutine-based, extensible, support for various engines, built-in features like retries and content negotiation.
  • OkHttp: A highly efficient and popular HTTP client developed by Square. It’s a synchronous client by default but can be used asynchronously with callbacks or integrated with coroutines. It’s known for its reliability and performance.
    • Dependency:

      Implementation”com.squareup.okhttp3:okhttp:4.12.0″ Python proxy server

    • Key Features: Connection pooling, GZIP compression, response caching, transparent connection failures.

  • Jsoup Built-in Fetcher: While Jsoup is primarily an HTML parser, it also has a basic built-in HTTP fetcher Jsoup.connecturl.get. For simple, single-page scrapes, this might suffice. However, for more complex scenarios, concurrent requests, or custom headers, a dedicated HTTP client like Ktor or OkHttp is superior.
    implementation”org.jsoup:jsoup:1.17.2″
    • Key Features: Simple to use for quick fetches, but limited control compared to dedicated clients.

Recommendation: For modern Kotlin development, especially if you plan to leverage coroutines for concurrent scraping, Ktor Client is an excellent choice. If you prefer a more traditional, robust client, OkHttp is a solid alternative.

2. HTML Parsing Libraries for processing web content

Once you have the raw HTML, you need a library to parse it into a traversable structure like a DOM and extract data.

  • Jsoup The Undisputed King: Jsoup is a Java library specifically designed for working with real-world HTML. It provides a very convenient API for parsing, manipulating, and extracting data using familiar DOM, CSS, and jQuery-like selectors. It’s robust enough to handle malformed HTML, which is very common on the web. Residential vs isp proxies

    implementation"org.jsoup:jsoup:1.17.2" // Always check for the latest stable version
    
    • Key Features: CSS selector support, DOM traversal, HTML cleaning, powerful element selection, handles various HTML encodings. This will be your primary tool for navigating the parsed webpage.

3. JSON Parsing Libraries if scraping APIs or structured data

Sometimes, websites load data via JavaScript from internal APIs that return JSON.

If you’re targeting such endpoints, you’ll need a JSON parser.

  • Kotlinx.serialization: JetBrains’ own serialization framework. It’s type-safe and works seamlessly with Kotlin’s data classes, making it the most “Kotlin-idiomatic” choice for JSON.

    • Dependencies:
      // For JSON serialization/deserialization

      Implementation”org.jetbrains.kotlinx:kotlinx-serialization-json:1.6.3″ Browser automation explained

      // If using with Ktor Client for content negotiation

      Implementation”io.ktor:ktor-client-content-negotiation:2.3.9″

      Implementation”io.ktor:ktor-serialization-kotlinx-json:2.3.9″

  • Gson: Google’s JSON library. It’s very popular and widely used in the Java/Android ecosystem. It uses reflection, which can be slower than kotlinx.serialization but requires less setup.

    implementation"com.google.code.gson:gson:2.10.1"
    
  • Jackson: A powerful, high-performance JSON processor. It’s often chosen for enterprise-level applications due to its flexibility and performance. Http cookies

    implementation"com.fasterxml.jackson.core:jackson-databind:2.16.1"
    

Recommendation: For pure Kotlin projects, kotlinx.serialization is the preferred choice due to its type safety and native Kotlin integration.

Setting Up Your Project with Gradle & IntelliJ IDEA

  1. Create a New Project:
    • Open IntelliJ IDEA.
    • Select File > New > Project...
    • Choose New Project.
    • In the wizard:
      • Name: KotlinWebScraper or your preferred name
      • Location: Choose a directory.
      • Language: Kotlin
      • Build System: Gradle
      • JDK: Select a recent JDK e.g., OpenJDK 17 or higher.
      • Kotlin DSL: Ensure this is checked for build.gradle.kts.
  2. Add Dependencies:
    • Once the project is created, open the build.gradle.kts file located in your project root.

    • Add the chosen dependencies within the dependencies { ... } block. A typical setup for basic HTML scraping would look like this:
      plugins {

      kotlin"jvm" version "1.9.22" // Use your current Kotlin version
       application
       // If using kotlinx.serialization
      
      
      id"org.jetbrains.kotlin.plugin.serialization" version "1.9.22"
      

      }

      group = “com.yourcompany”
      version = “1.0-SNAPSHOT” How to scrape airbnb guide

      repositories {
      mavenCentral
      dependencies {
      // Ktor Client for HTTP requests

      implementation”io.ktor:ktor-client-core:2.3.9″

      implementation”io.ktor:ktor-client-cio:2.3.9″ // CIO engine

      // Jsoup for HTML parsing

      implementation”org.jsoup:jsoup:1.17.2″ Set up proxy in windows 11

      // Kotlinx.serialization optional, if you’re dealing with JSON

      implementation”org.jetbrains.kotlinx:kotlinx-serialization-json:1.6.3″

      implementation”io.ktor:ktor-client-content-negotiation:2.3.9″

      implementation”io.ktor:ktor-serialization-kotlinx-json:2.3.9″

      // Kotlin Coroutines for async operations typically pulled in by Ktor Web scraping with c sharp

      // If you need it explicitly for other async tasks:

      // implementation”org.jetbrains.kotlinx:kotlinx-coroutines-core:1.8.0″

      testImplementationkotlin”test”
      kotlin {

      jvmToolchain17 // Or your chosen JDK version
      

      application {

      mainClass.set"MainKt" // Replace with your main class if different
      
  3. Sync Gradle: After adding dependencies, IntelliJ IDEA will usually prompt you to “Load Gradle Changes” or “Sync Now”. Click this to download the libraries.
  4. Create Your Main File: In the src/main/kotlin directory, you’ll find a Main.kt file or similar. This is where your application’s entry point fun main will reside.

With this setup, your Kotlin environment is ready to start fetching and parsing web data. Fetch api in javascript

Fetching Web Content with Ktor Client

Fetching the raw HTML content from a web page is the first critical step in any web scraping operation.

A robust HTTP client is essential for this, capable of handling various network conditions, redirects, and custom headers.

Ktor Client, with its native Kotlin support and coroutine integration, is an excellent choice for modern web scraping in Kotlin.

Understanding HTTP Requests in Web Scraping

When you type a URL into your browser, it sends an HTTP GET request to the web server, which then responds with the webpage’s content. A web scraper mimics this behavior.

However, unlike a browser, a scraper needs to handle specific aspects programmatically: How to scrape glassdoor

  • URLs: Ensuring the URL is correctly formed and accessible.
  • Response Codes: Checking for successful responses e.g., 200 OK and handling errors e.g., 404 Not Found, 500 Internal Server Error.
  • Headers: Setting appropriate headers, such as User-Agent, Referer, or Accept-Language, which can influence how a server responds. Some websites block requests that don’t look like they’re coming from a real browser.
  • Timeouts: Preventing the scraper from hanging indefinitely if a server is slow or unresponsive.
  • Retries: Implementing logic to reattempt requests that temporarily fail.
  • Proxies: In advanced scenarios, using proxies to mask your IP address or route requests through different locations.

Basic GET Request with Ktor Client

Let’s walk through the fundamental steps to fetch HTML using Ktor Client.

1. Initialize HttpClient

You need an instance of HttpClient. This client should ideally be reused across multiple requests to benefit from connection pooling and other optimizations.

import io.ktor.client.*
import io.ktor.client.engine.cio.* // Or OkHttp, Apache, etc.
import io.ktor.client.request.*
import io.ktor.client.statement.*
import io.ktor.http.* // For HTTP status codes
import kotlinx.coroutines.*



suspend fun fetchHtmlContenturl: String: String? {
    // Create a client instance. For simple cases, you can create it per function,


   // but for larger applications, ideally, HttpClient should be a singleton or managed.
    val client = HttpClientCIO {
        // Optional: Configure the client
        engine {


           // Configure specific engine properties, e.g., for CIO


           requestTimeout = 10_000 // 10 seconds timeout for the entire request
        // User-Agent: Crucial for many websites. Mimic a common browser.
        defaultRequest {


           headerHttpHeaders.UserAgent, "Mozilla/5.0 Windows NT 10.0. Win64. x64 AppleWebKit/537.36 KHTML, like Gecko Chrome/120.0.0.0 Safari/537.36"
           headerHttpHeaders.Accept, "text/html,application/xhtml+xml,application/xml.q=0.9,image/webp,*/*.q=0.8"


           headerHttpHeaders.AcceptLanguage, "en-US,en.q=0.5"


           // Other headers like Referer if needed
    }

    try {


       val response: HttpResponse = client.geturl



       if response.status.isSuccess { // Check for 2xx status codes
            println"Successfully fetched $url. Status: ${response.status}"
            return response.bodyAsText
        } else {
            println"Failed to fetch $url.

Status: ${response.status.value} - ${response.status.description}"
            return null
    } catch e: Exception {


       println"An error occurred while fetching $url: ${e.message}"
        // Log the full exception for debugging
        e.printStackTrace
        return null
    } finally {


       // It's important to close the client to release resources, especially if created per request.


       // If HttpClient is a singleton, close it when your application shuts down.
        client.close
}

fun main = runBlocking {


   val targetUrl = "https://example.com" // Replace with your target URL
    val html = fetchHtmlContenttargetUrl
    if html != null {


       println"\n--- Fetched HTML first 500 chars ---"
        printlnhtml.take500 + "..."


       // Now you can pass 'html' to Jsoup for parsing
    } else {


       println"Could not fetch HTML from $targetUrl"

Code Breakdown:

  • HttpClientCIO: Creates an HTTP client using the CIO Coroutine-based I/O engine. Ktor supports other engines like OkHttp or Apache, which you can swap in based on your preference or existing dependencies.
  • engine { requestTimeout = 10_000 }: Sets a timeout for the entire request. This prevents your scraper from freezing if a server is unresponsive. A 10-second timeout 10,000 milliseconds is often a good starting point.
  • defaultRequest { ... }: This block allows you to set default headers that will be applied to all requests made by this HttpClient instance.
    • HttpHeaders.UserAgent: This header is critically important. Many websites inspect the User-Agent to determine if a request comes from a legitimate browser or an automated bot. A generic User-Agent or absence of one is a red flag. Using a string like "Mozilla/5.0 Windows NT 10.0. Win64. x64 AppleWebKit/537.36 KHTML, like Gecko Chrome/120.0.0.0 Safari/537.36" which mimics a recent Chrome browser makes your requests appear more legitimate.
    • HttpHeaders.Accept and HttpHeaders.AcceptLanguage: These also mimic browser behavior, indicating what content types and languages your client prefers.
  • client.geturl: Performs an HTTP GET request to the specified url. This is a suspend function, meaning it needs to be called within a coroutine scope e.g., runBlocking or another suspend function.
  • response.status.isSuccess: Checks if the HTTP status code indicates success e.g., 200 OK, 201 Created. Ktor provides convenient HttpStatusCode enums.
  • response.bodyAsText: Extracts the response body the HTML content in this case as a String.
  • try-catch-finally: Essential for robust error handling.
    • try: Contains the main logic that might throw exceptions.
    • catch e: Exception: Catches any Exception that occurs during the network request e.g., IOException for network issues, SocketTimeoutException for timeouts, etc.. It’s good practice to log the error.
    • finally: Ensures client.close is called to release network resources, regardless of whether an error occurred. For a client created per function, this is crucial. For a singleton client, you’d close it on application shutdown.
  • runBlocking: A coroutine builder that blocks the current thread until all coroutines inside it complete. Used here for simplicity in main to call the suspend function fetchHtmlContent. For production applications, you’d integrate it into a non-blocking architecture.

Handling Dynamic Content JavaScript

It’s important to understand a limitation of Ktor Client and Jsoup: they only fetch the initial HTML returned by the server. If a website heavily relies on JavaScript to load content after the initial page load e.g., dynamically rendered content, AJAX calls, single-page applications, Ktor Client alone will not see that content.

  • Scenario: Many modern websites use JavaScript to fetch data from APIs and render it on the page. If the data you need appears only after JavaScript execution, a simple HTTP client won’t suffice.
  • Solution: For such cases, you need a headless browser e.g., Selenium, Playwright. These tools launch a real web browser like Chrome or Firefox in the background, execute JavaScript, and then allow you to interact with the fully rendered DOM. While powerful, headless browsers are significantly more resource-intensive and slower than direct HTTP client calls. We will explore this in a later section.

For now, focus on mastering direct HTML fetching, as a significant amount of web data is still accessible this way.

Parsing HTML with Jsoup: Your Data Extraction Workhorse

Once you’ve successfully fetched the HTML content of a webpage, the next crucial step is to parse it and extract the specific data you need. For Kotlin web scraping, Jsoup is the go-to library for this task. It provides an intuitive and powerful API for parsing HTML, navigating the DOM, and selecting elements using CSS selectors, much like you would with jQuery or in a browser’s developer console. Dataset vs database

The Power of Jsoup

Jsoup is designed to work with real-world HTML, including malformed HTML.

It parses HTML into a Document object, which represents the page’s DOM tree.

From this Document, you can then use various methods to traverse the tree and select elements.

Core Concepts of Jsoup:

  • Document: Represents the entire HTML page, the root of the DOM tree.
  • Element: Represents a single HTML tag e.g., <div>, <a>, <p>.
  • Elements: A collection of Element objects, typically returned when multiple elements match a selector.
  • CSS Selectors: The most powerful way to find elements. Jsoup supports a rich set of CSS selectors, making it easy to target specific parts of the page.

Basic HTML Parsing and Element Selection

Let’s illustrate how to use Jsoup with a simple HTML string.

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import org.jsoup.select.Elements Requests vs httpx vs aiohttp

fun parseAndExtractDatahtmlContent: String {

// 1. Parse the HTML string into a Document object


val document: Document = Jsoup.parsehtmlContent
 println"--- Document Title ---"


printlndocument.title // Get the title of the page

 // 2. Select elements using CSS selectors

 // Example 1: Select all <h2> tags
 println"\n--- All H2 Titles ---"


val h2Elements: Elements = document.select"h2"
 if h2Elements.isNotEmpty {
     for h2 in h2Elements {


        println"H2 Text: ${h2.text}" // .text gets the visible text
     println"No h2 tags found."

 // Example 2: Select an element by its ID
println"\n--- Element by ID '#main-content' ---"
val mainContentDiv: Element? = document.selectFirst"#main-content"
 mainContentDiv?.let {


    println"Main Content HTML:\n${it.html.take100}..." // .html gets the inner HTML


    println"Main Content Text: ${it.text.take100}..." // .text gets the visible text


} ?: println"Element with ID 'main-content' not found."

 // Example 3: Select elements by class name


println"\n--- Elements by Class Name '.product-item' ---"


val productItems: Elements = document.select".product-item"
 if productItems.isNotEmpty {
     for item in productItems {


        val productName = item.selectFirst".product-name"?.text ?: "N/A"


        val productPrice = item.selectFirst".product-price"?.text ?: "N/A"


        println"Product: $productName, Price: $productPrice"


    println"No elements with class 'product-item' found."



// Example 4: Select elements with specific attributes


println"\n--- Links with 'data-category' attribute ---"


val categoryLinks: Elements = document.select"a"
 if categoryLinks.isNotEmpty {
     for link in categoryLinks {


        val href = link.attr"href" // Get attribute value


        val category = link.attr"data-category"


        println"Link Href: $href, Category: $category"


    println"No links with 'data-category' attribute found."



// Example 5: Chained selection finding an element within another


println"\n--- Paragraph inside a div with class 'description' ---"


val descriptionPara: Element? = document.selectFirst"div.description p"
 descriptionPara?.let {
     println"Description: ${it.text}"


} ?: println"No paragraph found inside 'div.description'."

fun main {
val sampleHtml = “””

Sample Product Page

Welcome to Our Store

Leave a Reply

Your email address will not be published. Required fields are marked *