Clean, Complete Syntactic and Structured data = Accurate Chatbot Responses 🤖 = Improved CX 😊

Happy Customer

Introduction

When developing a RAG (Retrieval Augmented Generation) based Chatbot system for an industry vertical or enterprise, the knowledgebase or content is critical to ensure superior quality responses. Depending on the use case, a major part of the content is obtained primarily from the company website which lists the complete product range along with its specifications and features.

Customized Strategy

Automatic Web Scraping is pivotal to keep data consistent and ensure that the process is efficient. But a one size fits all solution rarely works. Websites and Portals are developed using different CMS (Content Management Systems) like WordPress / Drupal / Joomla / Magento, plugins, elements, tools and by different people with a diverse skillset. Valuable information and its context are missed by rudimentary chunking of data, leading to inaccurate responses which in turn confuses the customer.

Learnings

My team recently built a customized web scraper which delivered excellent results and hence I am sharing our experiences here:

  1. Include all contextually relevant headers. They appear in Header h, b and strong html tags. They may also appear within paragraph p tag, with an empty line above it and below.
  2. Sequence in which the data appears is important for capturing meaning and association.
  3. Table data should be extracted along with headers to convey relevance to LLM. Consider various kinds of tables, with multiple row spans and column spans.
  4. Syntactical segregation of bulleted or numbered list items.
  5. Handling data from multiple tables and associating them to get best outcomes.
  6. Testing lexical understanding of industry specific terms
  7. Don’t miss out on Superscript data. It could have valuable information like disclaimers or sources which could be missed if not extracted. These are valuable information from compliance and legal perspective.

By providing the Chatbot with comprehensive content with the right context and meaning will equip it to address user query appropriately thereby improving CX.

The author of this article is Elizabeth Lewis, CEO and Founder of Beaconcross Technologies Private Limited. You can connect with her on LinkedIn – https://www.linkedin.com/in/lewis-elizabeth/