Hugging Face NLP 实战课

Hugging Face NLP Course

Hugging Face · Hugging Face 团队 · 约 10 小时

🎬 课程视频

Hugging Face 官方出品，教你用 Transformers 库做 NLP 任务，包括文本分类、问答、翻译、微调模型等。开发者必学。

📑 目录

Transformers 库入门：Pipeline 一行搞定 NLP
Tokenizer 和模型：理解底层原理
微调模型：让预训练模型适应你的任务
上传模型到 Hugging Face Hub

📖 你将学到

Transformers 库核心用法
文本分类、命名实体识别、问答
微调预训练模型
上传模型到 Hugging Face Hub

01 Transformers 库入门：Pipeline 一行搞定 NLP

Hugging Face Transformers 是目前最流行的 NLP 库，几行代码就能用上最先进的模型。

安装
pip install transformers datasets torch

Pipeline：最简单的使用方式
Pipeline 封装了模型加载、预处理、推理的全流程，一行代码就能用。

from transformers import pipeline

# 情感分析
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.99}]

# 文本生成
generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time", max_length=50)
print(result[0]['generated_text'])

# 问答
qa = pipeline("question-answering")
result = qa(
    question="What is the capital of France?",
    context="Paris is the capital of France and a major European city."
)
print(result['answer'])  # Paris

# 翻译（中译英）
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-zh-en")
result = translator("人工智能正在改变世界")
print(result[0]['translation_text'])

常用 Pipeline 任务
• sentiment-analysis：情感分析
• text-classification：文本分类
• ner：命名实体识别
• question-answering：问答
• summarization：文本摘要
• translation：翻译
• text-generation：文本生成

02 Tokenizer 和模型：理解底层原理

要做更复杂的任务，需要理解 Tokenizer 和模型的工作方式。

Tokenizer：把文本变成数字
模型不能直接处理文字，需要先把文字转成 Token（数字序列）。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")

# 编码
text = "人工智能很有趣"
tokens = tokenizer(text, return_tensors="pt")
print(tokens)
# {'input_ids': tensor([[101, 782, 2339, ...]]), 'attention_mask': ...}

# 解码（数字转回文字）
decoded = tokenizer.decode(tokens['input_ids'][0])
print(decoded)  # [CLS] 人工智能很有趣 [SEP]

加载模型做推理

from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    "uer/roberta-base-finetuned-jd-binary-chinese"
)

inputs = tokenizer("这个产品质量很好", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# 获取预测结果
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)  # 正面/负面概率

Auto 类的好处
AutoTokenizer、AutoModel 会自动根据模型名称加载对应的类，不需要手动指定模型架构。

03 微调模型：让预训练模型适应你的任务

微调（Fine-tuning）是把一个通用预训练模型，在你的特定数据上继续训练，让它更适合你的任务。

什么时候需要微调
• 通用模型在你的领域效果不好（比如医疗、法律专业术语）
• 你有标注好的私有数据
• 需要模型输出特定格式

准备数据集

from datasets import Dataset

# 准备训练数据
data = {
    "text": ["这个很好", "质量太差了", "还不错", "完全不值这个价"],
    "label": [1, 0, 1, 0]  # 1=正面, 0=负面
}
dataset = Dataset.from_dict(data)

# 划分训练集和测试集
dataset = dataset.train_test_split(test_size=0.2)

微调流程

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    evaluation_strategy="epoch",
    save_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)

trainer.train()

免费 GPU 资源
微调需要 GPU，免费选项：
• Google Colab（免费 T4 GPU，每天有限额）
• Kaggle Notebooks（每周 30 小时 GPU）
• Hugging Face Spaces（部分免费 GPU）

04 上传模型到 Hugging Face Hub

训练好的模型可以上传到 Hugging Face Hub，分享给全世界，也方便自己随时调用。

注册和登录

pip install huggingface_hub
huggingface-cli login  # 输入你的 HF Token

Token 在 huggingface.co → Settings → Access Tokens 里创建。

上传模型
trainer.push_to_hub("你的用户名/模型名称")

或者手动上传：

model.push_to_hub("你的用户名/模型名称")
tokenizer.push_to_hub("你的用户名/模型名称")

写好 Model Card
Model Card 是模型的说明文档，好的 Model Card 能让更多人使用你的模型。

在仓库里创建 README.md，包含：
• 模型用途
• 训练数据说明
• 使用示例代码
• 性能指标
• 局限性说明

调用你上传的模型

from transformers import pipeline

# 任何人都可以这样调用你的模型
classifier = pipeline("text-classification", model="你的用户名/模型名称")
result = classifier("测试文本")
print(result)

实际价值
在 HF Hub 上有高下载量的模型，能建立你在 NLP 领域的技术声誉，对求职和接单都有帮助。

去 Hugging Face 学习 →

💡 想要更系统的 AI 学习路线？

去 ganhuo.ai 看完整路线图 →

📚 相关课程

💬 信息有误？帮我们改进

📱 Telegram 反馈 ✉️ 邮件反馈