The key to implementing generative AI/RAG systems in business - Unpacking the grand prize-winning work from the AI Hackathon “2nd AI Challenge DAY” (co-hosted by Microsoft Japan and Kadokawa ASCII Research Laboratories) | Insights

Since OpenAI announced the release of the large language models (LLMs) GPT-3.5 and ChatGPT in 2022, the way companies use AI has changed dramatically. Such LLMs acquire general knowledge through pre-training, but have not learned company-specific information or business terminology, making them substantially unsuitable for business use. One mechanism for providing company-specific business terminology and information to LLMs is retrieval-augmented generation (RAG), and it is important to tune, implement, and utilize this RAG system to suit business needs.

ABeam Consulting participated in the AI Hackathon, “Second AI Challenge DAY,” co-hosted by Microsoft Japan and Kadokawa ASCII Research Laboratories on June 11-12, 2024, with the aim of improving practical skills in RAG system implementation. Our team was awarded the grand prize, placing first among 10 participating companies. It seems our app, which was created by making full use of our extensive knowledge of practical use support of the LLM/RAG system, received high praise.

In this Insight, we will explain the key points for implementing a generative AI/RAG system in business, which we were able to embody through participation in this event.

About the Author

Event overview & RAG system description

First, we will provide an outline of this event and describe the RAG system.

Event overview

This event was a hackathon^*1 co-hosted by Microsoft Japan and Kadokawa ASCII Research Laboratories, focusing on the implementation of RAG systems. Over the course of two days, June 11-12, 10 participating teams built RAG systems based on themes and data provided by the organizers, and these systems were then evaluated by a panel of judges. The theme for this event was “World Heritage Travel Assistant,” with the task of building a travel assistant that introduces World Heritage sites in Japan. The system was required to answer user questions about World Heritage sites not only in sentence format but also based on attached photo data, such as “Where is the temple or shrine located in this picture?”
The training data and evaluation data for measuring the accuracy of the RAG system were provided by the organizers. The evaluation criteria included not only the accuracy of the RAG system but also customer stories (what kind of experience the app provides to which users) and implementation considerations at an enterprise-grade level.

^*1 A hackathon is generally a competition in which programmers and creators get together and try to develop software and services while exchanging ideas within a set theme and time limit.

Overview of generic RAG system

A RAG system is a method that complements response generation by combining external knowledge, such as business terminology and documents not learned by the LLM, with a search database called a vector database.
Compared to fine-tuning, which involves providing external knowledge to the LLM itself for additional training, RAG systems can achieve the response generation accuracy desired by users while keeping construction costs low, explicitly show what sources were referenced to generate the response, and suppress the phenomenon of generating information not based on facts (hallucination), and thus have become the primary choice for LLM business applications.

Figure 2 shows the question-answer processing flow using a typical RAG system. Here, we assume that all external knowledge is in the format of text data, which is easy for LLM to handle.
The user's question (in Figure 2, “What was our business profit in the fiscal year ending March 2023?”) is clearly a question related to a company that LLM has not learned. To answer this question, it is necessary to convert the individual company information, which is external data, into sentence vectors^*2 in advance and store the data in a vector database. With the sentence vectors linked to individual company information stored in the vector database, the vast amount of individual company information can be searched to find information having a high degree of similarity to the user's question and generate a response.

The process of converting external knowledge, including company information, into text format, which can be easily handled by LLM, and registering this information in a vector database before actual use of the RAG system is referred to as RAG system “data preprocessing” in this Insight.

The explanation above assumes that all external knowledge is in the form of text data, which is easy for LLM to handle. If image data is included in the search target, as in the theme of this event, building additional preprocessing and architecture is required. The RAG architecture constructed at this event is explained under Key Point 3 below.

In the following, we will explain the key points of RAG system construction, focusing on the details of our team's efforts and deliverables at this event.

^*2 Sentence vectorization (embedding) is a technique of arranging sentences with similar meanings close to each other by converting the contents of sentences and documents into numerical vectors.

Key Point 1: RAG system construction and verification process

The RAG system implementation process has three stages: “1. Discussions about potential usages and use case studies,” “2. Technology verification through PoC and agile improvements,” and “3. Full deployment,” and the key to implementation is the second stage (Figure 3).

The “1. Discussions about potential usages and use case studies” stage involves identifying target users, issues to be solved, objectives for introduction, and use cases through discussions, and examining the functions and UI overview of the RAG system based on prior case examples, etc. At this event, this was implemented in the form of a “customer story study.” The details will be explained under Key Point 2.

After examining the functions and UI overview of the app, “2. Technology verification through PoC and agile improvements” is carried out. The RAG system uses different preprocessing methods and searching methods depending on the use case and the data being handled. Establishing a uniform construction method is difficult. An effective approach is to improve functionality by quickly implementing initial construction and then repeating accuracy evaluation and tuning.

When initially constructing the RAG system at this event, we utilized our construction know-how accumulated through various previous projects to implement the core technology and MVP app^*3 of the RAG system in a short period of time (Figure 4). This allowed us to start the agile accuracy improvement process of the RAG system early, so we could spend more time improving the app.
Even for construction of a generic RAG system, it is crucial to implement the MVP of the RAG system early based on such an agile approach, and optimize the app from the perspectives of both improving accuracy and solving business challenges.

When we evaluated the response accuracy of the MVP app that we built at the beginning of this event, three challenges were identified that required improvement, and we worked to address them.

Challenge 1: Responses to questions presumably involving an image attached by the user are inadequate
Challenge 2: Preprocessing of PDF, Word, PowerPoint, and CSV data is not as expected
Challenge 3: Accuracy evaluation takes time, resulting in a slow accuracy improvement cycle

The impact of Challenges 1 and 2 was particularly significant as these are factors that reduce response accuracy, and a lot of time was required for tuning. Handling questions with attached images and tuning the preprocessing of data formats such as PDF, Word, PowerPoint, and CSV data are often time-consuming tasks, even in generic RAG system implementations. The details and handling methods will be described under Key Point 3 below.

The inefficiency of the accuracy improvement cycle indicated by Challenge 3 can also be a bottleneck in generic RAG system implementations. This is because the LLM (RAG) system outputs results in the form of natural sentences, making it necessary to consider the intent of the question and the content of the response, and assessment of the correctness takes time. The details and handling methods will be described under Key Point 4 below.

^*3 MVP is an abbreviation for minimum viable product, and in this Insight refers to an app with the minimum functionality required to meet requirements.

Key Point 2: Considering practical use cases for RAG system implementation in business operations

To establish the RAG system in a business, it is important to consider the target users, the challenges to be overcome, the objectives of system introduction, and use cases, and to formulate a plan tailored to the business. This is because the functions of the app and the user interface (UI) to be provided differ depending on the aspects of the business for which the user needs the RAG system.
Normally, we would clarify this through interviews and workshops with the people in charge of the business department who know the business well, but for this event, we decided to plan and build two apps based on the assumptions we made from evaluation data.

The evaluation data for the RAG system distributed by the organizers at this event consisted of questions that users were expected to ask and the corresponding correct response data.
Many of the questions were related to the theme of the event, and covered the history and overviews of the World Heritage sites. We considered the target audience for the questions and responses related to such learning, and decided that the target users would be “junior high and high school students on a school excursion.”

The next thing to consider was the challenges faced by the target users, the objectives of app use, and the UI. School excursions are generally an opportunity for students to plan and study tour routes in advance and deepen their knowledge of World Heritage sites and other places in the field. On the other hand, we concluded that there are certain challenges in providing high-quality learning opportunities due to hurdles such as information gathering during pre-planning, preparation for students to gain deep learning on their own, and a lack of onsite guides (Figure 5).

To overcome these challenges, we needed to find ways to contribute to “planning and information gathering support,” “improving the quality of learning,” and “promoting the understanding of the heritage sites during the visit.” Our team therefore planned two apps: (1) a pre-visit route planning app and (2) a question-and-answer app for onsite use.

(1) The pre-visit route planning app (Figure 6) used the RAG system to plan learning points and tour routes for visiting the World Heritage sites based on the topic set by a student. The app has a function that creates a tour route when a student enters a topic that he or she wants to study and outputs the route in list format, and a function that displays the tour route on a map and outputs bookmarks. In this app, we adopted a UI that presents information visually in table format, etc., rather than a chat UI, in consideration of the objective of use, which is for users to efficiently plan a route.

(2) The onsite question-and-answer app (Figure 7) focused, through the RAG system, on the function of resolving questions that users have about the World Heritage sites they are visiting. The users, junior high or high school students, ask questions, and quick and real-time responses are required to enhance the onsite learning effect. Chat-style UI is suitable for interactive communication with users and is effective in local usage scenarios.

As you can see, when developing an app, it is essential to design the optimal functions and UI according to the target users, the challenges to be overcome, the introduction objectives, and the use cases of the RAG system.

Key Point 3: Handling diverse input data

The data handled in the course of operations is not limited to text data, but also includes a wide variety of source files, such as images and PDFs, which can be searched using the RAG system.
At this event as well, with practical use in mind, diverse file formats such as images, PDF, Word, PowerPoint, and CSV, were considered as search targets. Because LLMs can only read text data, we needed to preprocess these files and extract the text.

At this event, our team initially tried a method of extracting text stored in files such as PDFs using OSS libraries. This technique is effective when the text data is stored in a file such as PDF. However, the data at this event, such as PDF data, contained embedded images on web pages and did not store the text information itself, making it difficult to extract text using this method. To solve this problem, we decided to convert each file into an image and extract the text using an AI-OCR API.^*4

CSV files often store data such as numbers in comma-separated format. As they are, the files tend to lack keywords and are unlikely to appear in search results. Thus, in the preprocessing stage, we used ChatGPT to create explanatory text for the tables, which was then used as the search target. This approach can have the effect of making the table data more likely to appear high in search results when needed.

As mentioned in the “event overview” section above, the user questions assumed at this event include questions with attached photo data (for example, “Where is the temple or shrine in this photo?”). With such question, the subject of the photo needs to be identified before answering the question. We therefore adopted a technique of vectorizing the image input by the user and comparing the image with vectors of images registered in advance. This made it possible to identify the subject of the photo attached by the user and answer the question.

Thanks to this tuning, it was now possible to search a total of over 600 files, including text, images, PDFs, Word, PowerPoint and CSV files, having a total size of over 200 MB, in an integrated manner on the RAG system. Figure 8 shows the preprocessing flow for RAG architecture and Figure 9 shows the question-response flow, after tuning.
In implementing the RAG system in this way, it is necessary to determine the format and type of data to be searched, and facilitate a search for each using the appropriate method.

^*4 API is an abbreviation for application programming interface, and refers to a mechanism that allows web applications to communicate with other services and applications and utilize their functions.

Key Point 4: Automating the accuracy verification process

In order to apply the LLM-based RAG system to operations, a process of verifying its accuracy is essential. On the other hand, unlike conventional AI systems, LLM/RAG systems that output results in natural language require reading the context of the question intent and response content to judge correctness, making mechanical evaluation difficult. As a result, evaluation often involves human intervention. However, verifying the correctness of a large volume of evaluation texts one by one is time-consuming, and challenges remain, such as variations in assessment depending on the evaluator.

One way to help overcome these challenges is a technique of “LLM output evaluation by LLM,” which has been under development in recent years. Specifically, a quantitative evaluation result is mechanically obtained by output of an evaluation score based on input of the ideal correct data, the output from the RAG system, and the prompts for evaluation into the evaluation LLM.
For example, Figure 10 shows an example of an LLM-based evaluation method used in the “prompt flow” OSS library developed by Microsoft. This library establishes a method for evaluating the overall response quality, response coherence, and whether the responses are based on evidence.

At this event, the event organizers distributed evaluation prompts so that participating teams could conduct evaluations uniformly. However, it should be noted that, since the evaluation results are also LLM outputs, the evaluations have variability and are not absolute (in fact, during this event, there were cases in which the same input resulted in different scores across attempts). In actual operation, combining these evaluations with visual evaluations by staff can conceivably lead to efficient accuracy verification.

Additionally, by automating LLM output processing and combining this processing with the automatic LLM evaluation described above, it is possible to efficiently repeat an RAG system accuracy improvement cycle. When LLM models are updated, data is added, or search processes are changed, executing a complete cycle from LLM/RAG output to evaluation and improvement - a process known as "LLMOps" - allows for continuous system evaluation and improvement.

At this event, we automated and accelerated LLM output processing, and were thus able to conduct eight trial-and-improvement cycles even within a short construction period of 1.5 days. The efficiency of this series of processes aimed at improving accuracy is what paved the way to our app’s high accuracy at this event (second highest response accuracy out of the 10 participating companies). In practice, a technique called human-in-the-loop (HITL) is often employed, which utilizes user feedback with respect to responses to improve the accuracy of the RAG system.

In this way, in the RAG system, which is difficult to evaluate quantitatively, introducing a semi-automated, high-speed evaluation method and utilizing user evaluations is vital for efficient accuracy improvement and accuracy monitoring.

Key Point 5: Building a secure cloud infrastructure

There is no doubt that AI services like ChatGPT are useful for creating business value and improving productivity in companies. However, companies are not providing the necessary AI services to employees, resulting in the gradual spread of a trend called "Bring Your Own AI" (BYOAI or private AI usage) in which employees themselves bring in and use AI services. In cases where companies do not manage the rules or usage status of unauthorized BYOAI, there are risks of leakage of sensitive business information and personal data, which also poses problems for corporate governance. According to the “2024 Work Trend Index” survey released on June 6, 2024, by Microsoft and LinkedIn, approximately 80% of domestic AI users are said to be bringing personal AI tools to work. The development of an AI environment that can securely handle corporate data is an urgent task, and is also important from the perspective of defensive DX.

Against this background, there is a need to create a secure AI environment, including for the RAG system. Implementing an RAG system often involves the use of cloud infrastructure. The main reason is that LLM needs a huge amount of computing resources, making it unsuitable for on-premises environment construction. When using cloud infrastructure, appropriate use of robust security measures in the cloud environment is important.

For example, in a vector database that stores search target data, authentication management is required so that only a limited number of web apps and development members have access in order to prevent external unauthorized access and data leakage. It is also recommended that a closed network be constructed that blocks connections from the Internet, and that the vector database be accessed from a web app within the closed network. Note that applying data encryption to all content stored in the database is necessary.

Next, similar measures have to be taken for web apps. To prevent data leakage caused by external API access, settings are configured to allow access by only a limited number of development members and groups, and default encryption keys are used for related storage.

The same access control and other measures need to be applied to API services such as LLMs that are accessed via web apps. In addition, if transmitted prompts and output data are stored for a certain period of time for auditing or other purposes, care must be taken to ensure that sensitive information is not left in the usage logs. Typically, API access authentication keys are managed by retrieving them from a secure storage location, rather than embedding them directly in the source code.

Furthermore, as a security measure, a dedicated security tool is required to detect unauthorized access and check the resource configuration status according to security standards. It is also important to build a system that provides automatic notifications when an incident occurs and to have a structure in place to enable a quick response.

Implementing these security measures can ensure enterprise-grade security in RAG system operation on a cloud infrastructure and handle corporate data safely.
At this event, we compiled and demonstrated such security measures and received favorable praise from the panel of judges.

Securely and quickly building a highly effective generative AI/RAG system

At this event, we were able to quickly and accurately build travel assistant apps for World Heritage sites based on ABeam Consulting's extensive experience and knowledge of RAG system construction.
We hope that the insights we have provided here will help you build your own generative AI/RAG system.

ABeam Consulting offers a service for building an RAG system that utilizes internal data (ABeam LLM Partner) , which was embodied in this event. Be sure to check out the case study of RAG system construction with Macnica, Inc. as well.

In addition, as RAG system construction related services, we provide the LLM system “Failure Studies Consultant ,” which features input data from failure studies and was developed in collaboration with Professor Yotaro Hatamura of the University of Tokyo, support for creating business value and customer experience through workshops , and support services for the advancement of knowledge management using generative AI .
We hope you find these useful.

Finally, we would like to express our gratitude to Microsoft Japan and Kadokawa ASCII Research Laboratories who planned and hosted this event and kindly supported our participation.

Insights TOP