๐ NLTK Tokenization
Tokenization is the first step in NLP. It breaks text down into words or sentences.
Mastering this concept will significantly boost your Python data science skills!
๐ป Code Example:
import nltk # Download required corpora (first-time only) nltk.download("punkt", quiet=True) nltk.download("averaged_perceptron_tagger", quiet=True) nltk.download("stopwords", quiet=True) nltk.download("wordnet", quiet=True) from nltk.tokenize import word_tokenize, sent_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer, PorterStemmer from nltk import pos_tag, FreqDist text = """ Pynfinity is a powerful Python learning platform built by santoshtvk. It offers interactive courses, coding tools, and bite-sized pebbles. Students use Pynfinity to master Python programming efficiently. """ # 1. Sentence & word tokenization sentences = sent_tokenize(text) words = word_tokenize(text) print(f"Sentences: {len(sentences)} | Words: {len(words)}") # 2. Remove stopwords stop_words = set(stopwords.words("english")) clean_words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words] print("\nMeaningful words:", clean_words) # 3. POS tagging tagged = pos_tag(clean_words) print("\nPOS Tags:") for word, tag in tagged[:8]: print(f" {word:<15} โ {tag}") # 4. Lemmatization (prefers linguistic correctness) lemmatizer = WordNetLemmatizer() lemmatized = [lemmatizer.lemmatize(w) for w in clean_words] print("\nLemmatized:", lemmatized[:10]) # 5. Stemming (faster but rougher) stemmer = PorterStemmer() stemmed = [stemmer.stem(w) for w in clean_words] print("Stemmed :", stemmed[:10]) # 6. Frequency distribution fdist = FreqDist(clean_words) print("\nTop 5 words:", fdist.most_common(5))
Keep exploring and happy coding! ๐ป