SCALING AI TO LONG WEB PAGES: SMARTER WEBSITE CLASSIFICATION WITH LIGHTWEIGHT STRATEGIES

August 18, 2025

Zohir Koufi

Website classification is the silent engine behind many digital services — from filtering harmful content to powering search relevance and compliance enforcement. But as web content grows longer and more complex, even state-of-the-art AI models struggle to keep up.

At the heart of this challenge is a simple fact: transformer-based models like BERT and RoBERTa can’t handle long web pages without cutting corners. But what if there were a smarter way to help them do just that — without adding computational overhead or reinventing the architecture.

That’s exactly what we set out to solve.

The Problem: AI Models Aren’t Built for Long Sequences

Models like BERT and RoBERTa have revolutionized NLP, but they have a hard cap: they can only process inputs up to 512 tokens. For web pages — which often contain thousands — this means truncating content and losing context. Some recent models (like Longformer or BigBird) were designed to handle longer texts, but they come with high computational cost and longer training times.

For companies that rely on real-time classification or need to scale across thousands of domains, these solutions aren’t always practical.

Our Solution: Weighted Stratified Split Approach (WSSA)

Rather than modifying the model architecture, we took a data-centric approach. We created a lightweight preprocessing technique that works with existing transformer models and boosts their performance on long web pages.

Here’s how it works:

Chunking: Each long web page is split into multiple chunks of ~500 tokens — small enough to fit into standard transformers.
Smart Sampling: We apply a weighted stratified split that balances the training data by page length and category distribution. This avoids biasing the model toward short or long pages.
Chunk Voting: At inference, each chunk contributes a category prediction (‘votes’), and the final label is based on majority voting — giving the full page a voice without overwhelming the model.

This method allows BERT and RoBERTa to handle longer content more effectively, while keeping training and inference times reasonable.

Why It Works — And What We Found

We tested this approach on a real-world dataset of over 3,000 websites across 10 categories. Compared to traditional data splitting methods, WSSA led to:

Up to 4% higher accuracy for standard BERT and RoBERTa models.
Faster fine-tuning compared to Longformer and BigBird.
Lower inference latency, enabling near real-time classification.

Even more interesting: our lightweight setup outperformed specialized long-document transformers like Longformer and BigBird in many cases — especially when evaluating both the index page and surrounding web pages.

Business Benefits: Efficiency Without Compromise

For enterprise teams working on:

Web security and filtering
Content compliance and categorization
Information retrieval and user personalization

…this approach offers a high-accuracy, resource-efficient solution that scales.

By leveraging the WSSA strategy, businesses can continue using reliable transformer models (like BERT) without needing heavy infrastructure upgrades or long retraining cycles. It’s a plug-and-play boost for classification performance, ideal for teams looking to do more with less.

What’s Next

We’re actively exploring how this approach extends beyond websites to domains like document classification, legal text analysis, and even long-form chat transcripts.

As models get smarter and data gets longer, it’s clear that smart preprocessing can be just as powerful as smart architectures. And with WSSA, we’re bringing that power to the forefront — helping businesses unlock the full potential of AI, chunk by chunk.

About the author

Zohir is an AI researcher and Project Manager at SogetiLabs, specializing in synthetic data generation and bridging theoretical research with industrial applications. With a background in Computer Vision, Data Science, and an Industrial PhD in AI, his work has led to innovative solutions and international publications.

Generative AI

Cloud

Testing

Artificial intelligence

Security

SCALING AI TO LONG WEB PAGES: SMARTER WEBSITE CLASSIFICATION WITH LIGHTWEIGHT STRATEGIES

August 18, 2025

The Problem: AI Models Aren’t Built for Long Sequences

Our Solution: Weighted Stratified Split Approach (WSSA)

Why It Works — And What We Found

Business Benefits: Efficiency Without Compromise

What’s Next

About the author

Related posts

Beyond the prompt

The Frontier of AI Red Teaming: Key Challenges and Limitations

AI: Not a Job-Eater, But a Liberator

When AI Echoes Our Biases: LLMs and Societal Stereotypes

The CortexIA project: Bridging Thought and Technology

Is AI now smart enough to incorporate “Churchillian Logic” and help us break the cycle of repeating poor decisions or outcomes?

Automating with Intelligence – Automation Engineer 2.0

Exploring the Power of AI Voice

Why Software Testers Won't Be Replaced by AI and ML Anytime Soon

Smarter Website Classification: How AI Helps Businesses Navigate the Web

Leave a Reply Cancel reply

Generative AI

Cloud

Testing

Artificial intelligence

Security

The Problem: AI Models Aren’t Built for Long Sequences

Our Solution: Weighted Stratified Split Approach (WSSA)

Why It Works — And What We Found

Business Benefits: Efficiency Without Compromise

What’s Next

About the author

Zohir Koufi

R&D Project Manager | France

Related posts

Beyond the prompt

The Frontier of AI Red Teaming: Key Challenges and Limitations

AI: Not a Job-Eater, But a Liberator

When AI Echoes Our Biases: LLMs and Societal Stereotypes

The CortexIA project: Bridging Thought and Technology

Is AI now smart enough to incorporate “Churchillian Logic” and help us break the cycle of repeating poor decisions or outcomes?

Automating with Intelligence – Automation Engineer 2.0

Exploring the Power of AI Voice

Why Software Testers Won't Be Replaced by AI and ML Anytime Soon

Smarter Website Classification: How AI Helps Businesses Navigate the Web

Leave a Reply Cancel reply