Kong AI Semantic Cache plugin makes wonders! ... if you configure the vector database correctly

JESPROTECH

Kong's AI Semantic Cache plugin allows us to perform differently formulated questions, with different semantics, but, nonetheless, the same meaning, and make use of a cache system, to keep the answers to those questions persisted. It allows to use a vector database, like for example the Redis database. In this video we are having a quick look at how to configure the AI Semantic Cache plugin and we will see how it works. We will also focus on how it can go wrong and what can we do to improve it. The point of this plugin is to avoid making "unnecessary" requests to our LLM model. The way this works is that the LLM model returns data in the form of embeddings that get vectorized in the Redis database. Using vectors, the Redis database can then leverage the plugin to be able to find similarities and determine if a certain question has already been made or not, based on how similar it is to other questions with the same response. The theory behind this is quite fascinating, but we are only looking at this plugin from a very technical perspective in this video. We will use as an examplel, questions about the color of items, like for the example, the color of the Sun. I hope you enjoy the video and have fun with it and until the next one, be sure to stay tech, keep programming, be kind and have a good one everyone! Cheers!   ---   Chapters:   00:00:00 Start  00:00:42 Intro  00:03:42 Explaining Kong AI Semantic Cache Plugin  00:05:52 Referencing "What is a vector database ?" video  - https://www.youtube.com/watch?v=Yhv19le0sBw  00:07:51 Referencing the "Kong KONNECT, the first steps" video  - https://youtu.be/z3Y4NQgjGLE  00:08:12 Configuring Kong Konnect  00:09:48 Configuring the AI Semantic Cache Plugin in the example  00:13:12 Referencing the "How to and why configure the Kong AI-Proxy plugin in 10 minutes" video  - https://youtu.be/6Z8wWX-liBs  00:13:33 How to download the REDIS container  00:14:20 Explaining how to use environment variables with the prepared commands  00:18:52 Performing tests against the AI Semantic Cache plugin  00:24:56 End Notes  00:25:46 See you in the next video!  00:27:22 End credits  00:28:01 Disclaimer   ---   Related videos:   - https://youtu.be/Kw5GZnMnVhw  - https://youtu.be/rJKbAzjb5lQ  - https://youtu.be/z3Y4NQgjGLE  - https://youtu.be/KE3VTYtLvnI  - https://youtu.be/6Z8wWX-liBs  - https://youtu.be/vRH4qLZ7tz8  - https://youtu.be/Yhv19le0sBw    --- Source code   - https://github.com/jesperancinha/kong-test-drives  - https://github.com/jesperancinha/jeorg-cloud-test-drives   ---   Soundtrack:   - https://soundcloud.com/joaoesperancinha/slow-guitar-6-jesprotech   ---   Sources:   - https://docs.konghq.com/hub/kong-inc/ai-semantic-cache/   ---   As a short disclaimer, I'd like to mention that I'm not associated or affiliated with any of the brands eventually shown, displayed, or mentioned in this video.   ---

Transcript

00:00Music

00:30This is my configuration for Kong so that I can get to use the AI Semantic Cache plugin.

00:52How did I achieve this and what does this do?

00:55This is what we are going to see in this video.

00:59So if I go here to the project I developed and I've used in the other videos for the

01:04other different plugins that I've shown before that we can use with the Kong gateway.

01:11We can see here the README AI Semantic Cache.md file where I have collected all the different

01:19examples that I want to use for this video.

01:23If we go here further down this file we will see that now that I have everything configured

01:28we can try these different queries to our servers.

01:35And the first question is a very simple question and that is we are going to ask via the Kong

01:40gateway the color of the sun.

01:43The exact question is tell me the color of the sun.

01:48So let's see what happens when we now run this through the API gateway and get the response

01:53back.

01:57When we run this we can see that it takes a while to get the response back but then we

02:04get this response.

02:06And what our response to our request says is that the color of the sun is white.

02:13However, it often appears yellow, orange or red due to Earth's atmosphere scattering the

02:21light.

02:22Now it is important that we memorize this answer, at least this part of the answer.

02:26Because now I will ask the same question in a different way in this other request.

02:32What color best describes the color of the sun?

02:34It is a question for the same information.

02:38We are requesting the same information but under a different text, under a different semantics.

02:46So if we run this now, we will see the same response.

02:51We will see the color of the sun is white, however it often appears yellow, orange or red

02:56due to Earth's atmosphere scattering the light.

02:59We get exactly the same response.

03:01If you are used to running questions against LLM models, like for example with ChatGPT or

03:08any other kind of AI, you know that we usually get not the same response back but we get different

03:14responses back in a different text.

03:17Maybe the context is a bit different, maybe it will give us a more in-depth response, maybe

03:23a simpler response, maybe with different words, with a different style.

03:27But we never really get the same response back.

03:29But here I am getting the same response back.

03:32This is because the AI prompt cache plugin from Kong allows us to store responses in a

03:41cache system.

03:42Now the way this works is, and the benefit of it, is that when we are making different requests

03:50to our AI, we may want to make questions that are about the same topic.

03:58And further, we need the same content.

04:01And maybe it's not that handy every time to get the request to go to our LLM model to ask

04:08the same question again over and over and over.

04:12Perhaps it's better that if we can detect in the way we ask the question that the question

04:17has been asked before, that we can simply just grab our response from the service.

04:23And this is what we have here.

04:25And so the way this works is, when we ask a question to our LLM model, normally in a simple

04:32situation, we ask it, we get the response from our LLM model, and then we use that response

04:41and we show that response in our prompt.

04:43But with the AI semantic cache plugin, we can then make the request in a different way.

04:50And so the way it goes is, we ask a first question to our LLM model via the API gateway.

04:58That request first goes to our vector database.

05:03The data there is stored in vectors.

05:06Now vectors is a way of storing data.

05:10We can say a lot about it, but what the most important thing is, is that we understand

05:15that we store there different entities, and those entities follow different patterns.

05:22And one thing is that we measure distances between the different texts and questions that we put

05:28there.

05:29There's a whole theory behind it.

05:30But the idea is that we understand that these elements of data are stored in a different way.

05:39And the idea is that we search for similarities in them.

05:42There is a very good video about this from Redis.

05:44I will put that in the description if you want to understand more about vector databases.

05:49But the important for us here in this video is that we understand that the responses to

05:54our questions and the questions themselves are stored in that database.

05:59If we don't find a response to our question, then what the AI gateway is going to do, it's going

06:06to ask a request directly to the LLM.

06:10In this case, for this video, we are going to use mistrial, and we are going to ask mistrial

06:15questions.

06:16Then mistrial is going to give a response.

06:19That response comes in the form of embeddings, and then those embeddings will be vectorized

06:24to the Redis database.

06:28After doing that, we get the response back.

06:30That is our first response from mistrial.

06:33If we ask the same question with different texts or in a different way, the AI semantics

06:42cache plugin will be able to detect that it is the same question.

06:47And if it is the same question asked in a different way, the AI semantic cache plugin will then realize

06:56that, hey, this is the same question.

06:58So I already have the response in cache.

07:00So I will not ask the question to our LLM model.

07:04I am going to give the answer back from what I already have stored in my Redis vectorized

07:11database.

07:13And this is how simple it gets.

07:15But how we configure this plugin is what this video is about, and we are going to see that

07:21now.

07:22But I'm going to start from the point where the API gateway is already configured.

07:27We are going to use Kong Connect in this case.

07:30We would be able to use also the Enterprise Edition of the Kong gateway for local installations.

07:36But in this case, it's easier if we go directly through the Kong Connect website.

07:41To do that, we need to create data node planes, and we need to configure our gateway.

07:47But to do that, I already had a video before in this channel.

07:50You can look at it right over here.

07:52There I explain how to first create our account in Kong Connect, how to create the gateway manager,

07:58how to create services and routes.

08:00We already have that now.

08:02So we are going just to focus our attention into how the plugin is configured and what sort

08:06of parameters it requires.

08:09So let's get into it right now over here.

08:12Via my Kong Connect account, I can see that I already have these two plugins configured.

08:17I am going to delete them now, because we want to know how to create them and we want to see

08:23them being made.

08:26So we can delete them better this way.

08:30So delete and then let's delete this one as well.

08:38So now we don't have any plugins configured in our gateway.

08:42Let's have a look first at how I have this configured.

08:45So in overview, we will find that I've got a serverless default gateway and a hybrid gateway.

08:56This hybrid gateway I created before and here I already have a container running.

09:02And that container is an API gateway from Kong.

09:06We can go here to docker ps and we can see here that I've got a redis stack server running

09:15and I've got a Kong gateway version 3.9 running locally.

09:19These are two containers that I had to start so that we can get this plugin configured.

09:23Now the second important step is to get the service and this is the service that I have,

09:32my service and then here I also have my route configured.

09:36We are going to configure this plugin per route, not per service, per route simply.

09:44And if you go here to, and if you go now to the project and see the readme file and go

09:51all the way up, we will find that the first step is to configure the AI proxy plugin.

09:57The AI proxy plugin is important for us to be able to establish routes to Mistral.

10:03It doesn't matter much what we have here configured.

10:06It is important that we are able to have an AI proxy that recognizes that we are making requests

10:11to an LLM model.

10:13In this case, our large language model is Mistral and that's why we have this here configured.

10:18Mistral medium, Mistral and which is the format of OpenAPI and then which is the upstream URL,

10:24htpsapi.mistrial.ai.v1.chat.com.

10:31But this is just for the AI proxy.

10:33What's important is that we have the semantics plugin to be able to communicate with the large

10:38language model for Mistral with embeddings because we want the response in the form of

10:43embeddings because those are the ones we are going to use to vectorize into the database

10:48in Redis.

10:55And that configuration is located over here and we can see that we have a payload here

11:02for the AI semantic cache.

11:05For that, we need to hand in the Mistral API key via the header name and the header value.

11:13Then we need to say that in our model, the provider is Mistral, that the name of the model that

11:21we are using is Mistral embed and that we provide an upstream URL of the embeddings endpoint.

11:29So instead of completions, we've got api.mistral.ai.v1.embeddings.

11:34Another important aspect is, of course, the vector database, because we are taking information

11:43from our large language model and vectorizing it and putting that into our database.

11:48And it is important for us to define what kind of vectors we want to use.

11:53Now, this part is where we need someone who is expert in developing vector databases and

12:00we need to be able to choose what kind of strategy do we want to use, which in this case is Redis,

12:05because we are using a Redis database.

12:07We can define the distance metric.

12:10We can find the dimensions and the threshold.

12:14And then we need to define where we want to save the information that has been vectorized.

12:20In this case, the Euclidean metric is different than the cosine metric.

12:28And these difference have more to do with the way we measure distances between different entities,

12:34which follow different parameters.

12:37The theory behind it can revolve around physical distances and can also revolve around angles.

12:44But in this case, it's not important that we know that for the configuration of the plugin,

12:49it's just important that we get an idea, just a small flavor of how this works,

12:54so that we understand what we are configuring in this vector DB node of the JSON payload

13:01that we are sending to our AI cementing cache plugin.

13:06Now the AI proxy has very little more to add, I've spoke about this AI proxy plugin in this

13:14video over here.

13:15Again, both videos, one that talks about how to get ConConnect up and running, and the other

13:21one that talks about the AI proxy, I will put the links to them in the description of this

13:26video.

13:27So make sure to read that.

13:29And we can go further.

13:32And here at this point, we've got the command line that will run the Redis service.

13:39We need this, we need a Redis service container locally, or anywhere we want, because we need

13:45to reach a Redis server, and it needs to be an open Redis service, where we can simply

13:51run commands against it without authentication.

13:54So this container should be well protected, if we are going to use a Redis stack server.

14:01This is one example of a Redis service that we can use, and here we get our Redis container

14:06running.

14:07As I've shown you before, we've got our container running, our Redis container, and our Kong API

14:12gateway also running.

14:14To run these commands, it is important that we get our variables correct.

14:20Our variables can be created like this, we can export them for our command line purpose,

14:28and we can then get the control plane ID.

14:31Let me show you how you do that.

14:33For the control plane ID, you simply just go here to the Kong Connect website, we go to our

14:37gateway, and then here we will get our gateway ID from this point, so then we can do export

14:46the gateway ID, the control plane ID, and then we put there our gateway ID.

14:54Here for the service ID, the same, we can just go here to the service, and then go here to

15:00gateway services, and then here, if we go to service, we can find here the ID.

15:05And the same thing goes for the route, which we can go here to the route, and then here

15:11we find my route, and here we can see the ID of the route, where we can then just simply

15:15do export route ID to that number.

15:19Now for the Kong API key, it's also very easy, you can just go here to the my hybrid gateway,

15:27so whatever name you've given to your hybrid gateway, and then here we just need to press

15:31here to connect, and then admin API, and then manage access token, there will be an option

15:39here to simply generate a new token, and then you use the token that you generated here in

15:45this export.

15:49For the Mistral API key, we can just go to Mistral website, I've shown you before, you can just

15:54open here the website, and go here try the API, and we can go here to API keys, and then

16:00here, we just create a new key, and use that key for our tests.

16:05There are different options for Mistral AI, and we can use the API key over here.

16:12Finally the Redis host, the way to get to the Redis host is simply by putting the IP, or if

16:19you have a domain name, if you're using a domain name, the domain name of the Redis host.

16:23In this case, for my local experiment, what I did is docker inspect, and then put the three

16:30first digits of the container of Redis, so for example 30b, and I've got here the IP address

16:40of my Redis container, 172.17.0.3, and then I use that in the Redis host environment variable.

16:51So we've got all of these, one, two, three, six environment variables, the control plane

16:57ID, the service ID, the route ID, the Kong API key, the Mistral API key, and the Redis host.

17:05With these variables set into place in our command line, we can then run the commands

17:10that will create the plugins, or configure our plugins in our API gateway via Kong Connect,

17:19simply by copying these scripts.

17:22So if I go over here, and use these scripts, by the way, these scripts will connect directly

17:27to the European version of Kong Connect.

17:31Make sure to correct this for your region.

17:34So for example, if you are in the US, use US.

17:37If you are in the other regions, use the appropriate region for that.

17:42So here, if I just run curl in all of this, what I will now achieve is the creation of the

17:52AI Semantic Cache plugin in our gateway, in specifically our route.

17:58If I copy this, and go here to the command line, and pass this here.

18:04It has now created the plugin, but the plugin on its own will not work.

18:08We need the AI proxy.

18:10And for that, you can just go further up the readme file, and then here, we can just copy

18:17this, and then this will configure our AI proxy plugin.

18:24If we do that, then we've got our proxy plugin configured, and if we now go to Kong Connect,

18:31we should be able to see the plugin in the listing for our API gateway.

18:39So if we go to the gateway manager, my hybrid gateway, and if I go here to plugins, then

18:46I will see the AI Semantic Cache and the AI proxy back in the way we configured it before.

18:53So now, we can now make some tests, and the first test I'm going to make is, tell me the

18:59color of the sun, just the way we started.

19:01Except that now I'm making this question for the first time via the plugins, only that I

19:06I didn't change the vector database.

19:08So this means that we are probably still going to get the same response from the Redis service.

19:14So let's see what happens if I run this now.

19:18It takes a while, and now it says the color of the sun is white, however the sun may appear

19:24yellow, orange or red from Earth due to Earth's atmosphere scattering light.

19:29So we've got exactly the same response simply because I didn't remove any data from my Redis

19:34database running locally.

19:38But it did took a while to get there, and this probably has to do with initialization

19:42of these plugins.

19:44If I make now the question, what color best describes the color of the sun, we should now

19:48get a quicker response and exactly the same response.

19:55Further down, I have another question, is the sun white yellow?

20:01So it seems to be a different question, but essentially it's just a question that is trying

20:05to confirm what color of the sun is.

20:07It looks slightly different, but semantically the response should be the same.

20:12If we run this, we get the color of the sun is white, however the sun appears yellow.

20:19It's exactly the same response with also that comment about atmosphere scattering light.

20:27Now we come to another question, what is the color of Mars?

20:33This is a different question, but perhaps because the plugin isn't correctly configured in my

20:40case, or maybe we need to fine tune the way it recognizes semantics.

20:48I come to an interesting response.

20:53It still says, the color of the sun is white, however the sun appear yellow, orange or red

20:58from Earth due to Earth's atmosphere scattering light.

21:00It's the same response, but why are we getting the same response?

21:03Well, perhaps the way we configure the plugin doesn't allow the AI semantic plugin to detect

21:15that this question is not about the sun, that this question is about Mars.

21:22So this is where we start getting funny responses from the LLM model.

21:26If I go here further down, what is the color of a car?

21:29Now Mars and sun have in common that they are space bodies, they are different objects that

21:37are orbiting in a way.

21:38They have their own orbit, but they are part of the space realm.

21:44A car is different.

21:46Let's see if the large language model detects that the semantics here refer to a different

21:51kind of question.

21:53And if we do this, what is the color of a car?

21:57Then our LLM model will say, the color of a car can vary greatly as there are many different

22:02colors available for cars.

22:04Car colors include white, black, silver, and so on.

22:07And so we now get a different response that is accurate for the question that we made.

22:13And if you go and ask a question, is red rose colored red, we are basically asking a quite

22:21funny question, is a red rose colored red?

22:25And we should get a response from our LLM model that says, yes, a red rose is indeed colored

22:33red.

22:34So red rose comes from pigments called anthocyanins and it goes on to explain what that is.

22:41But now I want to make another question and see if this large language model combination

22:46with the AI semantic cache plug-in can answer something different.

22:52Let's say, what is the color of earth?

23:07Let's see what it responds.

23:22Now it does, let me try again.

23:28Now it responds something different.

23:29The color of a car can vary greatly as there are many different colors available for cars.

23:35Our question was clearly a question about asking what is the color of earth?

23:42What this means is that perhaps our plugin, our AI semantic cache plugin, it's just not fine

23:52tuned to the kind of questions we want to make to our large language model.

23:57So if we go to the configuration of our plugin, we will find this stuff again, VectorDB with

24:04the strategy, the distance metric, the dimensions, the threshold, and also the choice of what

24:11kind of database do we want to use.

24:14Important here is that we know that the responses and the way we save those responses in the relation

24:20with those responses to the questions that we make to our large language model will influence

24:25the way the AI works for us in this case.

24:29So we want to make sure that we get this as fine tuned as possible so that when we ask questions

24:35that require the same kind of response that it knows exactly which response to get and

24:40not something like this where we ask what is the color of the earth and then it starts

24:44talking about cars.

24:46Just because we fed the car color question before in the model and now the model thinks

24:51that the question has the same semantic as the question about the color of the earth.

24:57Alright so this is the way we configure the AI semantic cache plugin for Kong and this is

25:05how we can make sure to use it with Kong Connect.

25:10This plugin is very useful as you have seen before if it is fine tuned, if it is correctly

25:14configured and if we mostly if we are interested in making several different queries to our

25:21large language model and if we are sure that those questions can be cached and if we are

25:27interested in caching those questions.

25:31It's only a question of looking at our current case and make sure that we make the right decision

25:37in using this plugin because it doesn't always has to be a guarantee that it is a good plugin

25:42to use.

25:43So this is my presentation about the AI semantic cache plugin for the Kong API Gateway.

25:50We saw how to use it with Kong Connect via the website of Kong Connect via our account.

25:57We were able to configure an API Gateway and in it we have configured the services, routes,

26:02and then in the end we installed the AI proxy to work together with the AI semantic cache

26:07proxy allowing us to save responses and to make some kind of connection between the responses

26:13and the questions although in our case some questions didn't really match the responses

26:17that we were getting.

26:18But we now know that that has something to do with the way we configure our vector database

26:23which is the most important bit of the AI semantic cache plugin.

26:29And I want to emphasize this.

26:31That is the most important bit of this configuration.

26:33If we are configuring a vector database we need to make sure that a vector database is able

26:39to store results in such a way that we minimize the errors of our analysis of data.

26:46If you like this video make sure to leave a like to it.

26:50Make sure to subscribe to the channel to not miss out on coming videos.

26:54Make sure to pop in a comment about your video, about your experience, and if you have any questions

26:58there I'm happy to help.

27:00And make sure to read the description for more information about not only the AI semantic

27:06cache plugin but also other plugins that the API gateway has to offer or API gateways in general.

27:12And until the next video be sure to stay tech, keep programming, be kind, and have a good one.

27:21Bye!

27:51Bye!

27:58Bye!

28:03I'm not associated or affiliated with any of the brands, eventually shown, displayed,

28:07or mentioned in this video.

28:13You

Category

Transcript

Recommended