Author name: Munshi Alam

AI/ML

Large Language Model 101: SLM vs LLM – Understanding the Key Differences in AI Language Models

“The whole is greater than the sum of its parts” – One giant generalist LLM (Large Language Model) vs swarms of SLM (Small Language Model) Because of their smaller size and lower pre-training and fine-tuning costs, SLMs are naturally more flexible than LLMs in agentic systems. This makes it far more affordable and practical to train, customize, and deploy multiple specialized models tailored to different agentic tasks. This approach also enables democratization of AI development, allowing broader participation and faster innovation from niche players in Agentic AI space. Thing to watch for – will the efficiency and flexibility from a collection of SLMs outweigh the overhead of managing them ? I’m genuinely excited about the potential of SLMs in agentic AI. Unlike large models, which are often slow and resource-heavy to adapt, SLMs allow us to distill LLM capabilities into smaller, domain-specific versions quickly and efficiently. This means organizations can build specialized agents faster, fine-tune them at lower cost, and deploy them in real-world workflows without the heavy infrastructure demands of LLMs. The result is a more agile and scalable approach to agentic AI, where specialized SLMs can collaborate like expert teams, each optimized for a precise role.

Data Science

AI Models – Go Local or Go Cloud – Pros and Cons

Businesses today are constantly facing new and bigger challenges and are being asked to do more for less. Technology is diversifying and evolving faster than ever to address these issues, giving industries across the board a growing number of solutions to sift through.     Local AI and cloud-based AI each have their own advantages and disadvantages, and the choice between them depends on specific use cases and requirements. Here are some reasons why someone in the field of software development, like you, might prefer local AI over cloud-based AI in certain situations:     1. Data Privacy and Security: If you are working with sensitive data, keeping AI algorithms and models on a local server or device can provide better control over data privacy and security. This is especially important in industries like healthcare, finance, and defense.   2. Low Latency: Local AI can provide lower latency since data processing and inference happen on the local hardware. This is crucial in applications where real-time responses are required, such as autonomous vehicles or robotics.   3. Offline Capabilities: In situations where an internet connection is unreliable or not available, local AI can still function, while cloud-based AI may depend on a continuous connection to the cloud.   4. Cost Management: Depending on usage, cloud-based AI services can become costly over time, especially if you have large volumes of data to process. Local AI can provide cost savings in the long run as you don’t pay for cloud resources.   5. Customization and Control: With local AI, you have more control over the algorithms and models you use, allowing for customization to meet specific needs. In contrast, cloud-based AI services may offer limited customization options.   6. Compliance: Certain industries and organizations have strict regulatory compliance requirements. Keeping AI local can simplify compliance with data handling regulations.   However, it’s important to note that local AI also has limitations. It may require substantial hardware resources, regular updates, and maintenance. Additionally, cloud-based AI has its own advantages, such as scalability, ease of deployment, and access to a wide range of pre-trained models and services.   In some cases, a hybrid approach combining both local and cloud-based AI can be the most effective solution, allowing you to leverage the benefits of both paradigms. Ultimately, the choice between local and cloud AI depends on your specific project goals and constraints.

Data Science

Converse with Data using NLP / LLM – The Holy Grail !

 Industry analysts have been predicting that “The future of BI is conversational.” for quite some time now. Giant leaps made recently by the various LLM models reinvigorated that quest. Yet production-grade conversational BI solutions are very hard to engineer. Business users are still looking for insights in BI dashboards and data analysts are still hand-writing SQL queries against their databases to answer ad-hoc business questions. Why is conversational BI still not here? The vast majority of enterprise data is still in structured data stores and accessible mainly through SQL queries. For any conversational BI solution there has to be an engine that will translate user’s natural language question into a valid SQL or a Panda’s dataframe formula. Engineers have tried to build “Natural Language to SQL” (NL2SQL) engines since the 70s (using rules-based techniques) which would very quickly get too complex to be useful. But with the advancement of transformers which have enabled tools like GitHub CoPilot, OpenAI Code Interpreter or Langchain’s  pandas_dataframe_agent or Langchain’s ‘SQL Agent’ it would seem this should be a trivial problem to solve. It is not. There are (at least) two ways a company can build an LLM-based NL2SQL engine to enable conversational BI 1. Fine-tuning your own LLM — This approach would require taking an existing LLM and then training it further using NL<>SQL pairs relating to the company’s structured data. A couple of challenges with this approach are that a) coming up with the training dataset is hard and expensive and b) the most powerful LLM model around (GPT-4) cannot be fine-tuned (as of this writing). 2.Leveraging In-context learning — The latest LLM models (like GPT-4–32K) can write SQL quite well out of the box and have enough context window for quite a bit of few shot training and for an agent to try to recover from errors by performing follow-ups using chain-of-thought techniques. The idea here is to build an LLM agent on top of GPT-4 that can implement NL2SQL with few shot learning. So what are the challenges of deploying solution #2? Here are six we have encountered: 1.Table and Column descriptions— Even the best data teams often do not have clear documentation about tables, columns and metadata. With the rise of ELT where data is simply dumped in the warehouse from various sources and transformed on query the situation becomes even worse. Therefore the table and column names might be the only info available to the engine at configuration time. 2.Missing Context and Metadata–- There are often business definitions which live in data analyst’s heads and are not in the underlying data. We encountered a real-world home rental marketplace, for which what constitutes an “active listing” is a combination of WHERE clauses which are different based on the value of another column which specifies the building type. In rare cases these are stored as Views on the table, but more often that not they are just stored in a query in the BI tool/dashboard. 3.Incomplete info in question, lack of “common sense” — “what was the average rent in Los Angeles in May 2023?” A reasonable human receiving this question would simply assume the question is about Los Angeles, CA or would confirm with the asker in a follow up. However an LLM usually translates this to select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”which pulls up data for Los Angeles, CA and Los Angeles, TX without even getting columns to differentiate between the two 4.Speed — In order for the engine to be “conversational,” response times must be fast (sub 30s). This is often very hard to achieve, especially if the agent tried to recover from errors or evaluate generated responses with subsequent LLM calls. 5.Complex Queries – While GPT-4 writes simple SQL queries very well, it can often stumble on complex queries that require aggregations and joins. This is exacerbated in cases where the column name contains an action that can be done in SQL (for example Average or SUM)and in join operations on data warehouses where FOREIGN KEYS are not clearly enforced like they are in production DBs. 6.Privacy and Data Leaking – Many organizations do not want their database data or schema being sent to companies like OpenAI since it can leak into their training corpus. 7.Validation – There is no known way to identify cases where the system returns a syntactically valid but incorrect SQL. For example if the user asks for and ‘average’ value, and the system runs an AVG instead of picking a column called ‘average_price’ So is enterprise conversational BI impossible in 2023? Will there be a few more years of academic papers and company AI hackathon projects before a solution can be deployed in production? We don’t think so. While the challenges are definitely real, we believe with the right tool an enterprise data team can deploy solutions to enable business users to self-serve ad-hoc data questions from the company data warehouse. In the coming weeks we will be releasing a number of open source and hosted tools to address this.   

Data Science

Chat with Documents with Improved Response Accuracy

  Chatting with Documents using Natural Language using one of the most sought-after use cases of LLM by enterprises. A cute short-name for this problem is called RAG – Retrieval Augmented Generation. However, RAG lacks transparency in revealing what it retrieves, making it uncertain which questions the system will encounter. This results in valuable information getting lost amid a mass of irrelevant text, which is not ideal for a production-grade application.  Techniques for improving RAG performance:  After almost a year of building with LLM, I have learned many techniques to improve RAG performance and summarized some of my lessons in using RAG. In this section, I will go over few tested techniques to improve RAG performance. Adding additional info in the header or footer of the chunk Adding metadata in each chunk Adding summarized info in each chunk Use Langchain’s “Parent Document Retrieval” by using two sets of chunk sizes  Pre-retrieval Step:  Despite recent tremendous interest in utilizing NLP for wider range of real world applications, most NLP papers, tasks and pipelines assume raw, clean texts. However, many texts we encounter in the wild are not so clean, with many of them being visually structured documents (VSDs) such as PDFs. Conventional preprocessing tools for VSDs mainly focused on word segmentation and coarse layout analysis. PDFs are versatile, preserving the visual integrity of documents, but they often pose a significant challenge when it comes to extracting and manipulating their contents.  We all have heard of “garbage in, garbage out”. I think it also applies to RAG, but many people just ignore this step and focus on optimizing steps after this very crucial initial step. You cannot simply extract text from your documents and throw them into a vector database and assume to get realiable, accurate answers. Extraction of the texts and tables from the documents have to be semantically accurate and coherent.  Here is an example from my own experience. I had 10 resumes of different candidates. At the begining of the resume I got the name of the candidate. The rest of the pages (assume each resume is 2-page long) have no mention of the name.  In this case, chunks may lose information when split up each resume using some chunk size. One easy way to solve it is to add additional info (e.g. name of the candidate) in each chunk as header or as footer.  The second technique is chunk optimization. Based on what your downstream task is, you need to determine what the optimal length of the chunk is and how much overlap you want to have for each chunk. If your chunk is too small, it may not include all the information the LLM needs to answer the user’s query; if the chunk is too big, it may contain too much irrelevant information that reduces that vector search accuracy, and also confuses the LLM, and may be, sometimes, too big to fit into the context size.  From my own experience, you don’t have to stick to one chunk optimization method for all the steps in your pipeline. For example, if your pipeline involves both high-level tasks like summarization and low-level tasks like coding based on a function definition, you could try to use a bigger chunk size for summarization and then smaller chunks for coding reference. When your query is such that LLM needs to search lots of documents and then send a list of documents as answers, then it is better to useimilarity_search_wtih_score  search type. If your query requires LLM to perform a multi-step search to come an answer, you can use the prompt “Think step by step” to the LLM. This helps the engine to break down the query into multiple sub-queries After you retrieve the relevant chunks from your database, there are still some more techniques to improve the generation quality. You can use one or multiple of the following techniques based on the nature of your task and the format of your text chunks. If your task is more relevant to one specific chunk, one commonly used technique is reranking or scoring. As I’ve mentioned earlier, a high score in vector similarity search does not mean that it will always have the highest relevance. You should do a second round of reranking or scoring to pick out the text chunks that are actually useful for generating the answer. For reranking or scoring, you can ask the LLM to rank the relevance of the documents or you can use some other methods like keyword frequency or metadata matching to refine the selection before passing those documents to the LLM to generate a final answer.  Balancing quality and latencyThere are also some other tips that I found to be useful in improving and balancing generation quality and latency. In actual productions, your users may not have time to wait for the multi-step RAG process to finish, especially when there is a chain of LLM calls. The following choices may be helpful if you want to improve the latency of your RAG pipeline. The first is to use a smaller, faster model for some steps. You don’t necessarily need to use the most powerful model (which is often the slowest) for all the steps in the RAG process. For example, for some easy query rewriting, generation of hypothetical documents, or summarizing a text chunk, you can probably use a faster model (like a 7B or 13B local model). Some of these models may even be capable of generating a high-quality final output for the user.

Activate Your Free 15-Day Infominer Trial

X
Scroll to Top