Author: Vitoria Lima

Goal: Play with text classification with LLMs and do a small write up about its pro and cons.

Datasets:

Dataset Source Categories GitHub Code
Email Spam Dataset Kaggle ·Safe Email·Phishing Email Jupyter Notebook
Fake News Classification Dataset Kaggle ·1: Truthful News·0: Fake News Jupyter Notebook
Language Classification Dataset Kaggle English, Malayalam, Hindi, Tamil, Kannada, French, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, and Greek Jupyter Notebook
Youtube Spam Dataset Kaggle ·Spam Comment: 1·Not Spam Comment: 0 Jupyter Notebook

Table of Contents:

Disclaimer:

This blogpost has information up to date to the date it has been drafted and written, i.e. November 12th 2024.

This is an ongoing research topic, and this blogpost is an incomplete, but hopefully useful, current snapshot of its ongoing research questions.

This blogpost focuses on utilizing a mix of Zero-shot / Few-shot / Prompting, essentially fancy words to say to use and call an API and add some sprinkle sentences around it, like these:

In the future, I will focus in another blogpost on how to leverage fine-tuning / embeddings / llm ensambling to enhance classification with LLMs, like actual interesting work leveraging model weights and embeddings stuff.

Don’t forget to spend time reading what the community thinks as well, sometimes it’s quite hilarious as here ( https://www.reddit.com/r/MachineLearning/comments/1c4a7sa/d_are_traditional_nlp_tasks_such_as_text/): “Can you kill a mosquito with a bazooka? Yes. Is it the most efficient tool to do so? No.”

Enough fluff, enjoy this little blogpost now!

🫖 The tea about LLMs classification:

Pro number #1:

It’s literally as easy as setting up this code structure:

import instructor
from pydantic import BaseModel, Field
from openai import OpenAI

# Initialize the OpenAI client
client = instructor.patch(OpenAI())

# Define the data model for classification
class YTCategory(str, Enum):
    SPAM = "Spam Comment" 
    NOT_SPAM = "Not Spam Comment" 

class YTClassification(BaseModel):
    category: YTCategory
    confidence: float = Field(ge=0, le=1, description="Confidence score for the classification")

To understand what I did above more in depth, look at this blogpost explaining in detail how to leverage LLMs potential with the Instructor library and the Pydantic library.

Blogpost Suggestion: Bridging Language Model with Python using Instructor, Pydantic, and OpenAI's Function Calling.