pipelines.classifier

Wrappers for different approaches to text classification, including scikit-learn text classification, text classification with Hugging Face Transformers, and few-shot classification (via SetFit).

source

ClassifierBase

 ClassifierBase ()

Helper class that provides a standard way to create an ABC using inheritance.

source

ClassifierBase.arrays2dataset

 ClassifierBase.arrays2dataset (X:List[str], y:Union[List[int],List[str]],
                                text_key:str='text',
                                label_key:str='label')

Convert train or test examples to HF dataset

source

ClassifierBase.dataset2arrays

 ClassifierBase.dataset2arrays (dataset, text_key:str='text',
                                label_key:str='label')

Convert a Hugging Face dataset to X, y arrays

source

ClassifierBase.evaluate

 ClassifierBase.evaluate (X_eval:list, y_eval:list,
                          print_report:bool=True, labels:list=[],
                          **kwargs)

Evaluates labeled data using the trained model. If print_report is True, prints classification report and returns nothing. Otherwise, returns and prints a dictionary of the results. Extra kwargs fed to self.predict.

source

ClassifierBase.explain

 ClassifierBase.explain (X:list, labels:list=[])

Explain the predictions on given examples in X. (Requires shap and matplotlib to be installed.)

source

ClassifierBase.sample_examples

 ClassifierBase.sample_examples (X:list, y:list, num_samples:int=8,
                                 text_key:str='text',
                                 label_key:str='label')

Sample a dataset with num_samples per class

source

SKClassifier

 SKClassifier (model_path=None, labels=[], **kwargs)

Helper class that provides a standard way to create an ABC using inheritance.

source

SKClassifier.train

 SKClassifier.train (X:List[str], y:Union[List[int],List[str]], **kwargs)

*Trains the classifier on a list of texts (X) and a list of labels (y). Additional keyword arguments are passed directly to self.model.fit.

Args:

X: List of texts
y: List representing labels

Returns:

None*

source

SKClassifier.predict

 SKClassifier.predict (X, **kwargs)

predict labels

source

SKClassifier.predict_proba

 SKClassifier.predict_proba (X, **kwargs)

predict label probabilities

source

SKClassifier.save

 SKClassifier.save (filename:str)

Save model to specified filename (e.g., /tmp/mymodel.gz). Model saved as pickle file. To reload the model, supply model_path when instantiatingSKClassifier.

Example: Training Sckit-Learn Text Classification Models

from sklearn.datasets import fetch_20newsgroups
from onprem.pipelines.classifier import SKClassifier

categories = [
             "alt.atheism",
             "soc.religion.christian",
             "comp.graphics",
             "sci.med" ]

train_b = fetch_20newsgroups(
            subset="train", categories=categories, shuffle=True, random_state=42
)
test_b = fetch_20newsgroups(
subset="test", categories=categories, shuffle=True, random_state=42
)
x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target
classes = train_b.target_names

# y_test = [classes[y] for y in y_test]
# y_train = [classes[y] for y in y_train]

clf = SKClassifier(labels=classes)
clf.train(x_train, y_train)
test_doc1 = "Jesus Christ was a first century Jewish teacher and religious leader."
test_doc2 = "The graphics on my monitor are terrible."
print(clf.predict(test_doc1))
print(clf.predict([test_doc2]))
print(clf.predict([test_doc1, test_doc2]))
clf.evaluate(x_test, y_test)

soc.religion.christian
comp.graphics
['soc.religion.christian', 'comp.graphics']
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.87      0.90       319
         comp.graphics       0.88      0.96      0.92       389
               sci.med       0.94      0.84      0.89       396
soc.religion.christian       0.91      0.96      0.94       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502

clf.evaluate(x_test, y_test, print_report=False)['accuracy']

0.9114513981358189

clf.save('/tmp/mymodel.gz') # save

clf = SKClassifier(model_path='/tmp/mymodel.gz', labels=classes) # reload
clf.evaluate(x_test, y_test)

                        precision    recall  f1-score   support

           alt.atheism       0.93      0.87      0.90       319
         comp.graphics       0.88      0.96      0.92       389
               sci.med       0.94      0.84      0.89       396
soc.religion.christian       0.91      0.96      0.94       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502

source

HFClassifier

 HFClassifier (model_id_or_path:str='google/bert_uncased_L-2_H-128_A-2',
               device=None, labels=[], **kwargs)

Helper class that provides a standard way to create an ABC using inheritance.

source

HFClassifier.train

 HFClassifier.train (X:List[str], y:Union[List[int],List[str]], **kwargs)

*Trains the classifier on a list of texts (X) and a list of labels (y). Extra kwargs are treated as arguments to transformers.TrainingArguments.

Args:

X: List of texts
y: List of integers representing labels
num_epochs: Number of epochs to train
batch_size: Batch size
metric: metric to use
callbacks: A list of callbacks to customize the training loop.

Returns:

None*

source

HFClassifier.predict

 HFClassifier.predict (X:list, max_length=512, **kwargs)

Predict labels. Extra kwargs fed to Hugging Face transformers text-classification pipeline.

source

HFClassifier.predict_proba

 HFClassifier.predict_proba (X:list, max_length=512, **wargs)

Predict labels. Extra kwargs fed to Hugging Face transformers text-classification pipeline.

The default model is a tiny BERT model (i.e., `google/bert_uncased_L-2_H-128_A-2), but we will use a larger model here to improve accuracy (e.g., distilbert).

Example: Training Hugging Face Transformer Models

categories = [
             "alt.atheism",
             "soc.religion.christian",
             "comp.graphics",
             "sci.med" ]

train_b = fetch_20newsgroups(
            subset="train", categories=categories, shuffle=True, random_state=42
)
test_b = fetch_20newsgroups(
subset="test", categories=categories, shuffle=True, random_state=42
)
x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target
classes = train_b.target_names

clf = HFClassifier(model_id_or_path='distilbert/distilbert-base-uncased', 
                   device='cuda', labels=classes)
clf.train(x_train, y_train, num_train_epochs=1, per_device_train_batch_size=8)
test_doc1 = "Jesus Christ was a first century Jewish teacher and religious leader."
test_doc2 = "The graphics on my monitor are terrible."
print(clf.predict(test_doc1))
print(clf.predict([test_doc2]))
print(clf.predict([test_doc1, test_doc2]))
clf.evaluate(x_test, y_test)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[283/283 02:25, Epoch 1/1]

Step	Training Loss

soc.religion.christian
comp.graphics
['soc.religion.christian', 'comp.graphics']
                        precision    recall  f1-score   support

           alt.atheism       0.89      0.88      0.89       319
         comp.graphics       0.97      0.98      0.97       389
               sci.med       0.97      0.95      0.96       396
soc.religion.christian       0.94      0.96      0.95       398

              accuracy                           0.95      1502
             macro avg       0.94      0.94      0.94      1502
          weighted avg       0.95      0.95      0.95      1502

clf.explain(test_doc1)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

[0]

outputs

alt.atheism

comp.graphics

sci.med

soc.religion.christian

inputs

Jesus

Christ

was

first

century

Jewish

teacher

and

religious

leader

clf.save('/tmp/my_hf_model')

clf = HFClassifier('/tmp/my_hf_model', device='cuda', labels=classes)
clf.evaluate(x_test,  y_test, print_report=False)['accuracy']

0.9454061251664447

source

FewShotClassifier

 FewShotClassifier (model_id_or_path:str='sentence-
                    transformers/paraphrase-mpnet-base-v2',
                    use_smaller:bool=False, **kwargs)

Helper class that provides a standard way to create an ABC using inheritance.

source

FewShotClassifier.train

 FewShotClassifier.train (X:List[str], y:Union[List[int],List[str]],
                          num_epochs:int=10, batch_size:int=32,
                          metric='accuracy', callbacks=None, **kwargs)

*Trains the classifier on a list of texts (X) and a list of labels (y). Additional keyword arguments are passed directly to SetFit.TrainingArguments

Args:

X: List of texts
y: List of integers representing labels
num_epochs: Number of epochs to train
batch_size: Batch size
metric: metric to use
callbacks: A list of callbacks to customize the training loop.

Returns:

None*

source

FewShotClassifier.predict

 FewShotClassifier.predict (X, **kwargs)

predict labels

source

FewShotClassifier.predict_proba

 FewShotClassifier.predict_proba (X, **kwargs)

predict label probabilities

source

FewShotClassifier.save

 FewShotClassifier.save (save_path:str)

Save model to specified folder path, save_path. To reload the model, supply path in model_id_or_path argument when instantiatingFewShotClassifier.

Example: Training Few-Shot Text Classifiers

clf = FewShotClassifier(labels=['negative', 'positive'])

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.

from datasets import load_dataset

Sample a tiny dataset with only 8 examples per class (or 16 total examples):

dataset = load_dataset("SetFit/sst2")
X_train, y_train = clf.dataset2arrays(dataset["train"], text_key="text", label_key="label")
X_test, y_test = clf.dataset2arrays(dataset["test"], text_key="text", label_key="label")
X_sample, y_sample = clf.sample_examples(X_train,  y_train, label_key="label", num_samples=8)

Repo card metadata block was not found. Setting CardData to empty.

clf.train(X_sample,  y_sample, max_steps=50)

Applying column mapping to the training dataset

***** Running training *****
  Num unique pairs = 144
  Batch size = 32
  Num epochs = 10

[50/50 00:28, Epoch 10/10]

Step	Training Loss
1	0.242700
50	0.047300

clf.evaluate(X_test, y_test)

              precision    recall  f1-score   support

    negative       0.88      0.94      0.91       912
    positive       0.93      0.87      0.90       909

    accuracy                           0.91      1821
   macro avg       0.91      0.91      0.91      1821
weighted avg       0.91      0.91      0.91      1821

new_data = ["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"]

preds = clf.predict(new_data)
preds

['positive', 'negative']

preds = clf.predict_proba(new_data)
preds

tensor([[0.1657, 0.8343],
        [0.8551, 0.1449]], dtype=torch.float64)

clf.save('/tmp/my_fewshot_model')

clf = FewShotClassifier('/tmp/my_fewshot_model', labels=['negative', 'positive'])
preds = clf.predict(new_data)
preds

['positive', 'negative']

clf.explain(new_data)

[0]

outputs

negative

positive

inputs

loved

the

spider

man

movie

[1]

outputs

negative

positive

inputs

pine

apple

pizza

the

worst

🤮