Author: Vitoria Lima
Goal: Play with text classification with LLMs and do a small write up about its pro and cons.
Datasets:
Dataset | Source | Categories | GitHub Code |
---|---|---|---|
Email Spam Dataset | Kaggle | ·Safe Email·Phishing Email | Jupyter Notebook |
Fake News Classification Dataset | Kaggle | ·1: Truthful News·0: Fake News | Jupyter Notebook |
Language Classification Dataset | Kaggle | English, Malayalam, Hindi, Tamil, Kannada, French, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, and Greek | Jupyter Notebook |
Youtube Spam Dataset | Kaggle | ·Spam Comment: 1·Not Spam Comment: 0 | Jupyter Notebook |
Table of Contents:
Disclaimer:
This blogpost has information up to date to the date it has been drafted and written, i.e. November 12th 2024.
This is an ongoing research topic, and this blogpost is an incomplete, but hopefully useful, current snapshot of its ongoing research questions.
This blogpost focuses on utilizing a mix of Zero-shot / Few-shot / Prompting, essentially fancy words to say to use and call an API and add some sprinkle sentences around it, like these:
- https://hussainpoonawala.medium.com/text-classification-with-large-language-models-llms-a23c731a687e
- https://arxiv.org/html/2405.10523v1
- https://arxiv.org/html/2409.01466v1
- https://arxiv.org/pdf/2305.08377
In the future, I will focus in another blogpost on how to leverage fine-tuning / embeddings / llm ensambling to enhance classification with LLMs, like actual interesting work leveraging model weights and embeddings stuff.
Don’t forget to spend time reading what the community thinks as well, sometimes it’s quite hilarious as here ( https://www.reddit.com/r/MachineLearning/comments/1c4a7sa/d_are_traditional_nlp_tasks_such_as_text/): “Can you kill a mosquito with a bazooka? Yes. Is it the most efficient tool to do so? No.”
Enough fluff, enjoy this little blogpost now!
It’s literally as easy as setting up this code structure:
import instructor
from pydantic import BaseModel, Field
from openai import OpenAI
# Initialize the OpenAI client
client = instructor.patch(OpenAI())
# Define the data model for classification
class YTCategory(str, Enum):
SPAM = "Spam Comment"
NOT_SPAM = "Not Spam Comment"
class YTClassification(BaseModel):
category: YTCategory
confidence: float = Field(ge=0, le=1, description="Confidence score for the classification")
To understand what I did above more in depth, look at this blogpost explaining in detail how to leverage LLMs potential with the Instructor library and the Pydantic library.
Blogpost Suggestion: Bridging Language Model with Python using Instructor, Pydantic, and OpenAI's Function Calling.