Fine-Tuning LLMs Using PEFT: Lessons from Fine-Tuning for Text-to-SQL Conversion
Parameter-Efficient Fine-Tuning

Fine-Tuning LLMs Using PEFT: Lessons from Fine-Tuning for Text-to-SQL Conversion

With Generative AI rapidly evolving, fine-tuning large language models (LLMs) is becoming a crucial tool for building domain-specific AI solutions. While general-purpose models like GPT-4 and Gemini perform remarkably well across a broad spectrum of tasks, they often struggle with specialized use cases that require deep contextual understanding—such as complex SQL generation from natural language queries.

At our company, we explored fine-tuning an open-source LLM using PEFT (Parameter-Efficient Fine-Tuning) for NL-to-SQL conversion. The goal? Build a model that not only understands SQL syntax but also handles complex queries, multiple joins, nested subqueries, and custom macros—tasks where even the best general-purpose models often fall short.

In this article, I’ll walk through:

When to fine-tune instead of relying on pre-trained LLMs

How PEFT (Parameter-Efficient Fine-Tuning) improves efficiency

Key hyperparameters that impact fine-tuning performance

Our real-world fine-tuning results and what worked best

👉 Note: This article has been reviewed and approved for publication by my company, ensuring it aligns with our best practices and insights gained from real-world projects.


Why Fine-Tune Instead of Using GPT-4 or Gemini?

Before diving into the fine-tuning details, let’s answer a fundamental question: why fine-tune at all?

Cost-Efficiency: Running a 7B parameter LLM in-house is significantly cheaper than making repeated API calls to proprietary models. Fine-tuning allows better control over inference costs.

Faster Inference: A fine-tuned model delivers responses 1.5x to 2x faster than GPT-4 for complex SQL generation tasks.

Better Performance for Domain-Specific Tasks: Large models generalize well, but when the task requires deep domain-specific expertise (e.g., enterprise SQL patterns, custom macros, proprietary functions), fine-tuning yields far superior results.

When Fine-Tuning is NOT Ideal: If the model has no prior exposure to a task (e.g., generating code in an entirely new programming language), fine-tuning might lead to hallucinations and conflicts with pre-trained knowledge.

In our case, since LLMs already understand SQL syntax, fine-tuning was the perfect approach to refine and enhance the model’s ability to generate complex queries.


PEFT Fine-Tuning Explained Simply (LoRA Approach)

Fine-tuning an entire LLM is computationally expensive. Instead, PEFT (Parameter-Efficient Fine-Tuning) allows us to update only a fraction of the model’s weights, making fine-tuning efficient.

🔹 Instead of updating the entire 1M+ parameter matrix (W) of the model, PEFT introduces two small matrices (A and B) with significantly fewer parameters.

🔹 During training, only A and B are updated, while the original model weights remain frozen.

🔹 After fine-tuning, we merge the learned matrices back into the original model.

💡 Result? Faster, more memory-efficient training while retaining the power of the base model. This method is particularly effective for LLM adaptation in enterprise settings.


Fine-Tuning Our LLM for Text-to-SQL

1️⃣ Loading the Model & Tokenizer

We used BitsAndBytesConfig for 4-bit quantization, reducing memory usage while keeping performance optimal.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             quantization_config=bnb_config, 
                                             torch_dtype=torch.bfloat16,
                                             load_in_8bit=True,
                                             use_flash_attention_2=True,
                                             device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'right'
        

📌 Key takeaways:

  • device_map="auto" ensures the model distributes across available GPUs.
  • Flash Attention speeds up training.
  • 4-bit quantization allows efficient fine-tuning on consumer-grade GPUs.


2️⃣ Preprocessing & Preparing Training Data

We structured the dataset into three columns:

  • Question (Natural Language Query)
  • Query (Ground Truth SQL)
  • Schema (Database Metadata)

Each training example was formatted into a structured prompt, such as:

### Task
Generate a SQL query to answer the following question:
`How many services are available in each city?`

### Database Schema
CREATE TABLE Hotels(
  HotelID INT PRIMARY KEY,
  HotelName VARCHAR(30),
  City VARCHAR(30)
);

### Answer
```sql
SELECT Hotels.City, COUNT(*) 
FROM Hotels 
GROUP BY Hotels.City;        
### **3️⃣ Defining LoRA Adapters & Hyperparameters**
Choosing the **right hyperparameters** was crucial.

```python
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    target_modules=["q_proj","v_proj","k_proj","o_proj","lm_head"],
    lora_alpha=128,
    lora_dropout=0.05,
    r=128,
    bias="none",
    task_type="CAUSAL_LM"
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
        

📌 Key Learnings:

  • r=128 was optimal for learning complex SQL structures (nested subqueries, multi-joins).
  • lora_alpha=128 maintained balance between fine-tuning and pre-trained knowledge.
  • Fine-tuning only the attention layers and lm_head yielded the best results.


4️⃣ Training the Model

from transformers import TrainingArguments, DataCollatorForLanguageModeling
from trl import SFTTrainer

args = TrainingArguments(
    output_dir="./output_model_dir",
    num_train_epochs=4,
    per_device_train_batch_size=2,
    learning_rate=2e-5,
    logging_steps=500,
    save_steps=500,
    evaluation_strategy="steps"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=args
)

trainer.train()        

🛠 Lessons from training:

  • 4 epochs provided the best balance—avoiding underfitting while preventing overfitting.
  • Batch size of 2 was necessary due to limited GPU memory.
  • Gradient accumulation helped handle large sequences efficiently.


🚀 Results: How Our Fine-Tuned Model Performed

More Accurate SQL Generation: The fine-tuned model outperformed GPT-4 on complex queries, including:

  • Nested subqueries
  • Multiple joins (5+ tables)
  • Date-time based aggregations

2x Faster Inference: Queries returned in 1 to 1.5 seconds, much faster than GPT-4.

Better Handling of Enterprise-Specific Macros: Custom SQL functions were understood and applied correctly.

Input (Natural Language Query):

"Get the top 5 product categories that generated the highest revenue in the last quarter, including total sales and their percentage contribution to overall revenue."

Output (SQL Query):

WITH category_sales AS (
    SELECT 
        c.category_name, 
        SUM(s.sales_amount) AS total_sales
    FROM sales s
    JOIN products p ON s.product_id = p.product_id
    JOIN categories c ON p.category_id = c.category_id
    WHERE s.sale_date BETWEEN '2024-10-01' AND '2024-12-31'
    GROUP BY c.category_name
),
total_revenue AS (
    SELECT SUM(total_sales) AS overall_sales FROM category_sales
)
SELECT 
    cs.category_name, 
    cs.total_sales, 
    ROUND((cs.total_sales / tr.overall_sales) * 100, 2) AS percentage_contribution
FROM category_sales cs
JOIN total_revenue tr
ORDER BY cs.total_sales DESC
LIMIT 5;        

⚠️ Limitations & Next Steps

  • Occasionally hallucinates column names (needs schema validation).
  • Struggles with large schemas (context length limitations).
  • Future improvements: Adding a post-processing layer to validate SQL.


Final Thoughts

💡 Fine-tuning LLMs is a game-changer for domain-specific applications. If your use case involves structured queries, complex reasoning, or domain expertise, fine-tuning a PEFT-based model can outperform generic LLMs like GPT-4 and Gemini while being cheaper and faster.

If you’re exploring fine-tuning, start small, iterate fast, and tune your hyperparameters carefully! 🚀

Would love to hear from others working on fine-tuning LLMs—what has your experience been like? 👇✨

#LLM #AI #FineTuning #PEFT #SQL #MachineLearning #SQL #NLP




To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics