Data Preprocessing Pipeline for Brazilian Portuguese NLP Model Training
A robust pipeline for collecting, cleaning, and preparing Brazilian Portuguese text data for NLP model training.
Build a reproducible data pipeline that transforms raw data into high-quality datasets ready for Brazilian Portuguese NLP model training.
At a glance
Access
Free prompt
Open to copy without upgrading.
Prompt objective
Build a reproducible data pipeline that transforms raw data into high-quality datasets ready for Brazilian Portuguese NLP model training.
Real use case
A Brazilian social media analytics company wants to train a sentiment analysis model for social media comments. They have 500,000 Instagram and Twitter comments, but 40% are spam, 15% contain emojis that confuse the tokenizer, and 10% mix Portuguese with English and regional slang.
Customize these fields first
Replace the placeholders with your own context before you run the prompt. That usually improves the first output more than adding more instructions later.
Prompt
Create a data preprocessing pipeline for [PROJECT NAME], preparing [TEXT DATA TYPE] in Brazilian Portuguese for training a [TASK: sentiment classification/NER/summarization/QA] model.\\\\\\\\n\\\\\\\\n**Context:**\\\\\\\\n- Raw data volume: [NUMBER] records\\\\\\\\n- Sources: [LIST: social media, reviews, documents, chat logs]\\\\\\\\n- Known issues: [LIST: spam, emojis, slang, typos, etc.]\\\\\\\\n- Language: Brazilian Portuguese (with possible excerpts in [OTHER LANGUAGES])\\\\\\\\n- ML Framework: [HUGGING FACE/SPACY/PYTORCH]\\\\\\\\n\\\\\\\\n**1) Collection and Ingestion:**\\\\\\\\n- Collection scripts per source (API, web scraping, database export)\\\\\\\\n- Rate limiting and retry logic for APIs\\\\\\\\n- Deduplication: hash-based (exact) + MinHash LSH (near-duplicate)\\\\\\\\n- Storage format: Parquet (efficient) or JSONL\\\\\\\\n- Metadata per record: source, timestamp, author_id, language\\\\\\\\n\\\\\\\\n**2) Text Cleaning:**\\\\\\\\n- Encoding normalization (UTF-8)\\\\\\\\n- HTML/markdown tag removal\\\\\\\\n- URL handling: remove, replace with [URL], or keep domain\\\\\\\\n- Mention (@user) and hashtag handling\\\\\\\\n- Emojis: [REMOVE/CONVERT TO TEXT/KEEP] — justify for the task\\\\\\\\n- Whitespace and punctuation normalization\\\\\\\\n- Corrupted encoding fixes (mojibake)\\\\\\\\n- PII removal: emails, phone numbers, national IDs (regex + NER)\\\\\\\\n- Complete reusable Python script with functions\\\\\\\\n\\\\\\\\n**3) Quality Filtering:**\\\\\\\\n- Language detection (langdetect/fasttext-lid): filter non-Portuguese content\\\\\\\\n- Length filter: minimum [X] tokens, maximum [Y] tokens\\\\\\\\n- Spam/bot filter: repetitive patterns, excessive links\\\\\\\\n- Toxic content filter (if relevant)\\\\\\\\n- Perplexity-based filtering (remove incoherent text)\\\\\\\\n- Before/after statistics for each filter\\\\\\\\n\\\\\\\\n**4) Linguistic Processing (Brazilian Portuguese):**\\\\\\\\n- Tokenization with target model tokenizer\\\\\\\\n- Brazilian regional slang normalization (mapping)\\\\\\\\n- Common abbreviation handling (vc, tb, pq, q, cmg, blz → voce, tambem, porque, que, comigo, bele)\\\\\\\\n- Stemming vs. Lemmatization: when to use each\\\\\\\\n- Named Entity Recognition for PII masking\\\\\\\\n- Light spell checking (for obvious errors, not slang)\\\\\\\\n\\\\\\\\n**5) Annotation:**\\\\\\\\n- Annotation strategy: [MANUAL/SEMI-AUTOMATIC/WEAK SUPERVISION]\\\\\\\\n- Annotation guidelines (document for annotators)\\\\\\\\n- Inter-annotator agreement (Cohen's Kappa target: > 0.8)\\\\\\\\n- Tools: [LABEL STUDIO/PRODIGY/CUSTOM]\\\\\\\\n- Active learning: prioritize examples where the model has most uncertainty\\\\\\\\n- Augmentation: back-translation, paraphrasing, synonym replacement\\\\\\\\n\\\\\\\\n**6) Dataset Split and Versioning:**\\\\\\\\n- Stratified split: train/val/test maintaining class distribution\\\\\\\\n- Time-based split (for temporal data): avoid data leakage\\\\\\\\n- DVC (Data Version Control) for dataset versioning\\\\\\\\n- Dataset card: statistics, class distribution, limitations, biases\\\\\\\\n- Export in Hugging Face Datasets format\\\\\\\\n\\\\\\\\n**7) Reproducible Pipeline:**\\\\\\\\n- Orchestration: [MAKE/PREFECT/AIRFLOW/DVC PIPELINE]\\\\\\\\n- YAML config for parameterization\\\\\\\\n- Logging per step with statistics\\\\\\\\n- Reproducible: fixed seed, pinned dependencies\\\\\\\\n- Processing time and cost estimates\\\\\\\\n\\\\\\\\nProvide complete Python code with input/output examples for each step.
Open directly in an AI — the text is pre-filled:
How to use this prompt
- 1Replace the key placeholders first: PROJECT NAME, TEXT DATA TYPE, TASK: sentiment classification/NER/summarization/QA, NUMBER.
- 2Replace any bracketed placeholders like [this] with your own context.
- 3Add extra background information when you want more tailored results.
- 4Combine multiple prompts in one conversation when you need a richer output.
- 5Save your best-performing prompts so they are easy to reuse later.
Next best step
Open the guide first, then branch only if you still need more.
A guide for technical builders choosing between prompts, coding workflows, and agent-based implementation.
If this prompt is close but not quite right, generate variants next. If the job is recurring, move into the course library after the guide.
Related prompts
View allFine-tuning LLMs with Custom Data Using LoRA and QLoRA
Complete guide to fine-tuning language models with efficient parameter adaptation techniques.
Best for
Fine-tune a large language model for a specific domain, minimizing computational costs with PEFT (Parameter-Efficient Fine-Tuning) techniques.
RAG Pipeline (Retrieval-Augmented Generation) with Embeddings and Vector Database
Complete RAG implementation covering chunking, embeddings, semantic search, and augmented generation.
Best for
Build a document-based Q&A system that reduces hallucinations and delivers answers grounded in your actual data.
Advanced Prompt Engineering with Chain-of-Thought and Function Calling
Advanced prompt engineering techniques to maximize LLM performance in production environments.
Best for
Master prompt engineering techniques that improve the quality, consistency, and reliability of LLM responses in production applications.
Complete MLOps Pipeline with Model Training, Versioning, and Deployment
MLOps infrastructure to manage the full lifecycle of ML models in production.
Best for
Implement an MLOps pipeline that automates model training, evaluation, versioning, and deployment with continuous monitoring.
Explore other prompt categories
Move sideways into adjacent libraries when the current category is not the full answer.
Free browsing stays open. Premium prompts unlock the reusable workflow layer.
Use the guides and role paths to validate the job first. Upgrade when you want the full prompt text, editable premium prompts, and the surrounding course paths in one place.
Free access
- Browse guides, role paths, and category pages.
- Preview prompts before you decide to upgrade.
- Find the right starting point without friction.
Membership access
- Unlock premium prompts and the full copy text.
- See more workflow paths and course connections.
- Keep the reusable templates in one place.