In this post, you will take a closer look at embedding documents to be used for a semantic search. By means of examples, you will learn how embedding influences the search result and how you can improve the results. Enjoy!

1. Introduction

In a previous post, chat with documents using LangChain4j and LocalAI was discussed. The document format has a large influence on the results was one of the conclusions. In this post, you will take a closer look at the influence of source data and the way it is embedded in order to get a better search result.

The source documents are two wikipedia documents. You will use the discography and list of songs recorded by Bruce Springsteen. The interesting part of these documents is that they contain facts and mainly in a table format. The same documents were used in the previous post, so it will be interesting to see how the findings from that post compare to the approach used in this post.

This blog can be read without reading the previous blogs if you are familiar with the concepts used. If not, it is recommended to read the previous blogs as mentioned in the prerequisites paragraph.

The sources used in this blog can be found at GitHub.

2. Prerequisites

The prerequisites for this blog are:

3. Embed Whole Document

The easiest way to embed a document is to read the document, split it in chunks and to embed the chunks. Embedding means transforming the text into vectors (numbers). The question you will ask also needs to be embedded.

The vectors are stored in a vector store which is able to find the results which are the closest to your question and will respond with these results. The source code consists out of the following parts:

  • The text needs to be embedded. An embedding model is needed for that, for simplicity you use the AllMiniLmL6V2EmbeddingModel. This model uses the BERT model which is a popular embedding model.
  • The embeddings need to be stored in an embedding store. Often a vector database is used for this purpose, but in this case you can use an in memory embedding store.
  • Read the two documents and add them to a DocumentSplitter. Here you will define to split the documents into chunks of 500 characters with no overlap.
  • By means of the DocumentSplitter, the documents are split into TextSegments.
  • The embedding model is used to embed the TextSegments. The TextSegments and their embedded counterpart are stored in the embedding store.
  • The question is also embedded with the same model.
  • Ask the embedding store to find relevant embedded segments to the embedded question. You can define how many results the store should retrieve. In this case, only one result is asked for.
  • If a match is found, the following information is printed to the console:
    • The score: a number indicating how well the result corresponds to the question;
    • The original text: the text of the segment;
    • The meta data: will show you the document the segment comes from.
private static void askQuestion(String question) {
    EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();

    EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();

    // Read and split the documents in segments of 500 chunks
    Document springsteenDiscography = loadDocument(toPath("example-files/Bruce_Springsteen_discography.pdf"));
    Document springsteenSongList = loadDocument(toPath("example-files/List_of_songs_recorded_by_Bruce_Springsteen.pdf"));
    ArrayList<Document> documents = new ArrayList<>();
    documents.add(springsteenDiscography);
    documents.add(springsteenSongList);

    DocumentSplitter documentSplitter = DocumentSplitters.recursive(500, 0);
    List<TextSegment> documentSegments = documentSplitter.splitAll(documents);

    // Embed the segments
    Response<List<Embedding>> embeddings = embeddingModel.embedAll(documentSegments);
    embeddingStore.addAll(embeddings.content(), documentSegments);

    // Embed the question and find relevant segments
    Embedding queryEmbedding = embeddingModel.embed(question).content();
    List<EmbeddingMatch<TextSegment>> embeddingMatch = embeddingStore.findRelevant(queryEmbedding,1);
    System.out.println(embeddingMatch.get(0).score());
    System.out.println(embeddingMatch.get(0).embedded().text());
    System.out.println(embeddingMatch.get(0).embedded().metadata());
}

The questions are the following and are some facts which can be found in the documents:

public static void main(String[] args) {
    askQuestion("on which album was \"adam raised a cain\" originally released?");
    askQuestion("what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?");
    askQuestion("what is the highest chart position of the album \"tracks\" in canada?");
    askQuestion("in which year was \"Highway Patrolman\" released?");
    askQuestion("who produced \"all or nothin' at all?\"");
}

3.1 Question 1

The result for question 1 on which album was “adam raised a cain” originally released? is the following:

0.6794537224516205
Jim Cretecos 1973 [14]
"57 Channels (And Nothin'
On)" Bruce Springsteen Human Touch
Jon Landau
Chuck Plotkin
Bruce
Springsteen
Roy Bittan
1992 [15]
"7 Rooms of Gloom"
(Four Tops cover)
Holland–Dozier–
Holland †
Only the Strong
Survive
Ron Aniello
Bruce
Springsteen
2022 [16]
"Across the Border" Bruce Springsteen The Ghost of Tom
Joad
Chuck Plotkin
Bruce
Springsteen
1995 [17]
"Adam Raised a Cain" Bruce Springsteen Darkness on the Edge
of Town
Jon Landau
Bruce
Springsteen
Steven Van
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=4, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} }

What do you see here?

  • The score is 0.679… This means that the segment matches for 67.9% to the question.
  • The segment itself contains the specified information at Line 27. The correct segment is chosen, this is great.
  • The metadata shows the document where the segment comes from.

You also see how the table is transformed to a text segment: it isn’t a table anymore. In the source document, the information is formatted as follows:

Another thing to notice is where the text segment is split. So, if you would have asked who produced this song, it would be an incomplete answer because this row is split in column 4.

3.2 Question 2

The result for question 2 what is the highest chart position of “Greetings from Asbury Park, N.J.” in the US? is the following:

0.6892728817378977
29. Greetings from Asbury Park, N.J. (LP liner notes). Bruce Springsteen. US: Columbia
Records. 1973. KC 31903.
30. Nebraska (LP liner notes). Bruce Springsteen. US: Columbia Records. 1982. TC 38358.
31. Chapter and Verse (CD booklet). Bruce Springsteen. US: Columbia Records. 2016. 88985
35820 2.
32. Born to Run (LP liner notes). Bruce Springsteen. US: Columbia Records. 1975. PC 33795.
33. Tracks (CD box set liner notes). Bruce Springsteen. Europe: Columbia Records. 1998. COL
492605 2 2.
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=100, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} }

The information is found in the correct document, but the wrong text segment is found. This segment comes from the References section and you needed the information from the Songs table, just like for question 1.

3.3 Question 3

The result for question 3 what is the highest chart position of the album “tracks” in canada? is the following:

0.807258199400863
56. @billboardcharts (November 29, 2021). "Debuts on this week's #Billboard200 (1/2)..." (https://twitter.com/bil
lboardcharts/status/1465346016702566400) (Tweet). Retrieved November 30, 2021 – via Twitter.
57. "ARIA Top 50 Albums Chart" (https://www.aria.com.au/charts/albums-chart/2021-11-29). Australian
Recording Industry Association. November 29, 2021. Retrieved November 26, 2021.
58. "Billboard Canadian Albums" (https://www.fyimusicnews.ca/fyi-charts/billboard-canadian-albums).
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=142, file_name=Bruce_Springsteen_discography.pdf, document_type=PDF} }

The information is found in the correct document, but also here, the segment comes from the References section, while the answer to the question can be found in the Compilation albums table. This can explain some of the wrong answers that were given in the previous post.

3.4 Question 4

The result for question 4 in which year was “Highway Patrolman” released? is the following:

0.6867325432140559
"Highway 29" Bruce Springsteen The Ghost of Tom
Joad
Chuck Plotkin
Bruce
Springsteen
1995 [17]
"Highway Patrolman" Bruce Springsteen Nebraska Bruce
Springsteen 1982 [30]
"Hitch Hikin' " Bruce Springsteen Western Stars
Ron Aniello
Bruce
Springsteen
2019 [53]
"The Hitter" Bruce Springsteen Devils & Dust
Brendan O'Brien
Chuck Plotkin
Bruce
Springsteen
2005 [24]
"The Honeymooners" Bruce Springsteen Tracks
Jon Landau
Chuck Plotkin
Bruce
Springsteen
Steven Van
Zandt
1998
[33]
[76]
"House of a Thousand
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=31, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} }

The information is found in the correct document and the correct segment is found. But it is difficult to retrieve the correct answer because of the formatting of the text segment and you do not have any context about what the information represents. The column headers are gone, so how should you know that 1982 is the answer to the question?

3.5 Question 5

The result for question 5 who produced “all or nothin’ at all”? is the following:

0.7036564758755796
Zandt (assistant)
1978 [18]
"Addicted to Romance" Bruce Springsteen She Came to Me
(soundtrack)
Bryce Dessner 2023
[19]
[20]
"Ain't Good Enough for
You" Bruce Springsteen The Promise
Jon Landau
Bruce
Springsteen
2010
[21]
[22]
"Ain't Got You" Bruce Springsteen Tunnel of Love
Jon Landau
Chuck Plotkin
Bruce
Springsteen
1987 [23]
"All I'm Thinkin' About" Bruce Springsteen Devils & Dust
Brendan O'Brien
Chuck Plotkin
Bruce
Springsteen
2005 [24]
"All or Nothin' at All" Bruce Springsteen Human Touch
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=5, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} }

The information is found in the correct document, but again, the segment is splitted in the row where the answer can be found. This can explain the incomplete answers that were given in the previous post.

3.6 Conclusion

Two answers are correct, one is partially correct and two are wrong.

4. Embed Markdown Document

What would change when you convert the PDF documents into markdown files? Tables are probably better to recognize in markdown files than in PDF documents and it allows you to segment the document at the row level instead of some arbitrary chunk size. Only the parts of the documents which contain the answers to the questions are converted. This means the Studio albums and Compilation albums from the discography and the List of songs recorded.

The segmenting is done as follows:

  • Split the document line per line;
  • Retrieve the data of the table in variable dataOnly;
  • Save the header of the table in variable header;
  • Create a TextSegment for every row in dataOnly and add the header to the segment.

The source code is as follows:

List<Document> documents = loadDocuments(toPath("markdown-files"));

List<TextSegment> segments = new ArrayList<>();
for (Document document : documents) {
    String[] splittedDocument = document.text().split("\n");
    String[] dataOnly = Arrays.copyOfRange(splittedDocument, 2, splittedDocument.length);
    String header = splittedDocument[0] + "\n" + splittedDocument[1] + "\n";

    for (String splittedLine : dataOnly) {
        segments.add(TextSegment.from(header + splittedLine, document.metadata()));
    }
}

4.1 Question 1

The result for question 1 on which album was “adam raised a cain” originally released? is the following:

0.6196628642947255
| Title                                         |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK
|-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---|
|The Essential Bruce Springsteen|14|41|—|—|5|22|—|4|2|15|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} }

The answer is incorrect.

4.2 Question 2

The result for question 2 what is the highest chart position of “Greetings from Asbury Park, N.J.” in the US? is the following:

0.8229951885990189
| Title                                         |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK
|-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---|
| Greetings from Asbury Park,N.J.               |60|71|—|—|—|—|—|—|35|41|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_studio_albums.md, document_type=UNKNOWN} }

The answer is correct and the answer can easily be retrieved as you have the header information for every column.

4.3 Question 3

The result for question 3 what is the highest chart position of the album “tracks” in canada? is the following:

0.7646818618182345
| Title                                         |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK
|-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---|
|Tracks|27|97|—|63|—|36|—|4|11|50|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} }

The answer is correct.

4.4 Question 4

The result for question 4 in which year was “Highway Patrolman” released? is the following:

0.6108392657222184
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Working on the Highway"	|Bruce Springsteen|	Born in the U.S.A.	| Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt	          |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }

The answer is incorrect. The correct document is found, but the wrong segment is chosen.

4.5 Question 5

The result for question 5 who produced “all or nothin’ at all”? is the following:

0.6724577751120745
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
| "All or Nothin' at All"                                                     | 	Bruce Springsteen	                                                                   | Human Touch                                                     | 	Jon Landau Chuck Plotkin Bruce Springsteen Roy Bittan	               |1992	|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }

The answer is correct and complete this time.

4.6 Conclusion

Three answers are correct and complete. Two answers are incorrect. Note that the incorrect answers are for different questions as before. However, the result is slightly better than with the PDF files.

5. Alternative Questions

Let’s build upon this a bit further. You are not using a Large Language Model (LLM) here which will help you with textual differences between the question you ask and interpretation of results. Maybe it helps when you change the question in order to use terminology which is closer to the data in the documents. The source code can be found here.

5.1 Question 1

Let’s change question 1 on which album was “adam raised a cain” originally released? into what is the original release of “adam raised a cain”?. The column in the table is named original release, so that might make a difference.

The result is the following:

0.6370094541277747
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
| "Adam Raised a Cain"                                                        | 	Bruce Springsteen	                                                                   | Darkness on the Edge of Town	                                   | Jon Landau Bruce Springsteen Steven Van Zandt (assistant)             |	1978|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }

The answer is correct this time and the score is slightly higher.

5.2 Question 4 – Attempt #1

Question 4 is in which year was “Highway Patrolman” released?. Remember that you only asked for the first relevant result. But, more relevant results can be displayed. Set the maximum number of results to 5.

List<EmbeddingMatch<TextSegment>> relevantMatches = embeddingStore.findRelevant(queryEmbedding,5);

The result is:

0.6108392657222184
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Working on the Highway"	|Bruce Springsteen|	Born in the U.S.A.	| Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt	          |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6076896858171996
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Turn! Turn! Turn!" (with Roger McGuinn)	| Pete Seeger †                                                                         | 	Magic Tour Highlights (EP)                                     | 	John Cooper                                                          |	2008|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6029946650419344
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Darlington County"	                                                        | Bruce Springsteen	                                                                    | Born in the U.S.A.	                                             | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt           |	1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6001672430441461
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Downbound Train"                                                           | 	Bruce Springsteen                                                                    | 	Born in the U.S.A.	                                            | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt	          |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.5982557901838741
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Highway Patrolman"	                                                        | Bruce Springsteen	                                                                    | Nebraska	                                                       | Bruce Springsteen                                                     |	1982|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }

As you can see, Highway Patrolman is a result, but only the fifth result. That is a bit strange, though.

5.3 Question 4 – Attempt #2

Let’s change question 4 in which year was “Highway Patrolman” released? into in which year was the song “Highway Patrolman” released?. So, you add the song to the question.

The result is:

0.6506125707025556
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Working on the Highway"	|Bruce Springsteen|	Born in the U.S.A.	| Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt	          |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.641000538311824
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Raise Your Hand" (live) (Eddie Floyd cover)                                | 	Steve Cropper Eddie Floyd Alvertis Isbell †                                          | 	Live 1975–85	                                                  | Jon Landau Chuck Plotkin Bruce Springsteen	                           |1986	|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6402738046796352
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Darlington County"	                                                        | Bruce Springsteen	                                                                    | Born in the U.S.A.	                                             | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt           |	1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6362427185719677
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Highway Patrolman"	                                                        | Bruce Springsteen	                                                                    | Nebraska	                                                       | Bruce Springsteen                                                     |	1982|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.635837703599965
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Wreck on the Highway"|	Bruce Springsteen	|The River	| Jon Landau Bruce Springsteen Steven Van Zandt	                        |1980	|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }

Now Highway Patrolman is the fourth result. It is getting better.

5.4 Question 4 – Attempt #3

Let’s add of the album Nebraska to question 4. The question becomes in which year was the song “Highway Patrolman” of the album “Nebraska” released?.

The result is:

0.6468954949440158
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Working on the Highway"	|Bruce Springsteen|	Born in the U.S.A.	| Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt	          |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6444919056791143
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Darlington County"	                                                        | Bruce Springsteen	                                                                    | Born in the U.S.A.	                                             | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt           |	1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6376680100362238
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Highway Patrolman"	                                                        | Bruce Springsteen	                                                                    | Nebraska	                                                       | Bruce Springsteen                                                     |	1982|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6367565537138745
| Title                                         |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK
|-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---|
|The Essential Bruce Springsteen|14|41|—|—|5|22|—|4|2|15|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} }
0.6364950606665447
| song                                                                        | writer(s)                                                                             | original release                                                | Producer(s)                                                           |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Raise Your Hand" (live) (Eddie Floyd cover)                                | 	Steve Cropper Eddie Floyd Alvertis Isbell †                                          | 	Live 1975–85	                                                  | Jon Landau Chuck Plotkin Bruce Springsteen	                           |1986	|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }

Again an improvement, Highway Patrolman is now listed as the third result. Still it is strange why it is not listed as first result. However, by adding more information, it ranks higher in the result list. This is as expected.

5.5 Conclusion

Changing the question with terminology which is closer to the source data helps in order to get a better result. Adding more context to the question also helps. Displaying more results gives you more insight and let’s you determine the correct answer from the result list.

6. Markdown Embedding Combined With LLM

The conclusions up till now are:

  • The format of the documents and the way the documents are segmented and embedded has a significant influence on the results;
  • If the question uses terminology which is close to the data in the documents, better results are achieved.

In the previous blog, the PDF files were used as source data. This is now solved by using Markdown and better segmenting and embedding. The second bullet above can probably be solved by using an LLM.

Let’s see what happens when you combine the embedding in this post with an LLM from the previous post. The full source code can be found at GitHub.

The questions and results are:

  1. on which album was “adam raised a cain” originally released?
    The album “Adam Raised a Cain” was originally released on The Essential Bruce Springsteen album in 1995.
    This answer is wrong but corresponds to the first embedded result earlier.
  2. what is the highest chart position of “Greetings from Asbury Park, N.J.” in the US?
    The highest chart position of “Greetings from Asbury Park, N.J.” in the US is 60.
    This answer is correct.
  3. what is the highest chart position of the album “tracks” in canada?
    The album “Tracks” by Metallica has not reached the highest chart position in Canada. According to the provided information, the highest chart position it reached in Canada was #27.
    This answer is wrong. The first embedded result earlier was correct, so it is strange that the wrong answer is given here.
  4. in which year was “Highway Patrolman” released?
    The song “Highway Patrolman” was released in 1984.
    The answer is wrong, but it can be explained. The first embedded result earlier showed the result for Working on the Highway and this song is released in 1984.
  5. in which year was “Higway Patrolman” released?
    The song “Highway Patrolman” was released in 1984.
    The question here included a typo. The result is similar as the previous one, so this is consistent.
  6. who produced “all or nothin’ at all”?
    The song “All or Nothin’ at All” was produced by Bruce Springsteen, Jon Landau, Chuck Plotkin, and Roy Bittan. It was originally released on Springsteen’s 1992 album “Human Touch.”
    This answer is correct and complete. With the PDF as source data, the answer was always incomplete.

The results are worse than using the PDF files as source documents. However, when a correct answer is given, it is complete. The answers are also consistent, a minor typo does not change the answer like it did before.

7. Markdown Embedding Combined With System Messages

In the previous blog, the final solution also included adding System Messages to the chat memory. Let’s see what happens when you combine this with Markdown source documents. The full source code can be found at GitHub. Below a small extract of the System Messages.

ArrayList<ChatMessage> chatMessages = new ArrayList<>();
chatMessages.add(new SystemMessage("Use the provided documents to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));

The questions and results are:

  1. on which album was “adam raised a cain” originally released?
    The album “Adam Raised a Cain” was originally released on The Essential Bruce Springsteen in 1995.
    This answer is wrong and similar as in the previous paragraph.
  2. what is the highest chart position of “Greetings from Asbury Park, N.J.” in the US?
    Based on the provided information, the highest chart position of “Greetings from Asbury Park, N.J.” in the US is 60.
    This answer is correct and similar as in the previous paragraph.
  3. what is the highest chart position of the album “tracks” in canada?
    Based on the information provided, the highest chart position of the album “Tracks” in Canada is 64.
    This answer is still wrong. However, Metallica is not mentioned anymore, so it is a bit less wrong than in the previous paragraph.
  4. in which year was “Highway Patrolman” released?
    The song “Highway Patrolman” was released in 1984.
    The answer is wrong and identical to the previous paragraph.
  5. in which year was “Higway Patrolman” released?
    The song “Highway Patrolman” was released in 1984.
    The answer is wrong and identical to the previous paragraph.
  6. who produced “all or nothin’ at all”?
    The song “All or Nothin’ at All” was produced by Bruce Springsteen, Jon Landau, Chuck Plotkin, and Roy Bittan. It was originally released on the album “Human Touch” in 1992.
    This answer is correct, complete and similar to the previous paragraph.

Adding System Messages did not have any influence to the results.

8. Overall Conclusion

What did you learn by all of this?

  1. The way documents are read and embedded seems to have the largest influence to the result.
  2. An advantage of this approach, is that you are able to display a number of results. This allows you to determine which result is the correct one.
  3. Changing your question in order that it uses the terminology used in the text segments helps to get a better result.
  4. Querying a vector store is very fast. Embedding costs some time, but you only need to do this once. Using an LLM takes a lot more time to retrieve a result when you do not use a GPU.

An interesting resource to read is Deconstructing RAG, a blog from LangChain. When improvements are made in this area, better results will be the consequence.

Besides that, metadata filtering can also be helpful. This way, you are able to filter segments based on metadata keywords before querying. At the time of writing, this is not yet possible with LangChain4j, but it is a high priority as can be seen in this issue.