Jak usunąć słowa stop za pomocą nltk lub python

110

Mam więc zbiór danych, z którego chciałbym usunąć słowa pomijane

stopwords.words('english')

Walczę, jak użyć tego w moim kodzie, aby po prostu usunąć te słowa. Mam już listę słów z tego zbioru danych, część, z którą się zmagam, polega na porównaniu z tą listą i usuwaniu słów pomijanych. Każda pomoc jest mile widziana.

python nltk stop-words

— Alex
źródło

4

Skąd masz pomijane słowa? Czy to od NLTK?

— tumultous_rooster

37

@ MattO'Brien from nltk.corpus import stopwordsdla przyszłych pracowników googlerskich

— danodonovan

13

Konieczne jest również uruchomienie nltk.download("stopwords"), aby udostępnić słownik pomijanych słów.

— sffc

Zobacz także stackoverflow.com/questions/19130512/stopword-removal-with-nltk

— alvas

1

Zwróć uwagę, że słowo takie jak „nie” jest również traktowane jako słowo pomijane w nltk. Jeśli zrobisz coś takiego jak analiza sentymentu, filtrowanie spamu, negacja może zmienić całe znaczenie zdania, a jeśli usuniesz je z fazy przetwarzania, możesz nie uzyskać dokładnych wyników.

— Darkov

206

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

— Daren Thomas
źródło

Dzięki obu odpowiedziom oba działają, chociaż wydaje się, że mam usterkę w kodzie, która uniemożliwia prawidłowe działanie listy zatrzymań. Czy powinien to być nowy post z pytaniem? nie wiem jeszcze, jak tu wszystko działa!

— Alex

51

Aby poprawić wydajność, rozważ stops = set(stopwords.words("english"))zamiast tego.

— isakkarlsson,

1

>>> import nltk >>> nltk.download () Źródło

2

stopwords.words('english')są małe. Dlatego upewnij się, że na liście używasz tylko małych liter, np.[w.lower() for w in word_list]

— AlexG

19

Możesz też zrobić set diff, na przykład:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

— David Lemphers
źródło

16

Uwaga: To przekształca zdanie do zestawu, który usuwa wszystkie zduplikowane słowa i dlatego nie będzie w stanie wykorzystać liczenia częstotliwości na wynik

— David Dehghan

1

konwersja do zestawu może usunąć sensowne informacje z zdania poprzez skrobanie wielu wystąpień ważnego słowa.

— Ujjwal

14

Przypuszczam, że masz listę słów (lista_wrazów), z których chcesz usunąć odrzucane słowa. Możesz zrobić coś takiego:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

— das_weezul
źródło

5

będzie to dużo wolniejsze niż rozumienie listy

— Darena

12

Aby wykluczyć wszystkie typy stop-words, w tym stop-words nltk, możesz zrobić coś takiego:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

— sumitjainjr
źródło

Dostaję len(get_stop_words('en')) == 174vslen(stopwords.words('english')) == 179

— rubencart

6

Z tego powodu istnieje bardzo prosty, lekki pakiet Pythona stop-words.

Najpierw zainstaluj pakiet za pomocą: pip install stop-words

Następnie możesz usunąć swoje słowa w jednej linii, używając funkcji rozumienia z listy:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

Ten pakiet jest bardzo lekki do pobrania (w przeciwieństwie do nltk), działa dla obu Python 2i Python 3i zawiera słowa stopu dla wielu innych języków, takich jak:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

— user_3pij
źródło

3

Użyj biblioteki Textcleaner, aby usunąć pomijane słowa ze swoich danych.

Podążaj za tym linkiem: https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

Wykonaj poniższe kroki, aby to zrobić z tą biblioteką.

pip install textcleaner

Po zainstalowaniu:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

Użyj powyższego kodu, aby usunąć słowa stopu.

— Yugant Hadiyal
źródło

1

Możesz użyć tej funkcji, powinieneś zauważyć, że musisz obniżyć wszystkie słowa

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

— Mohammed_Ashour
źródło

1

używając filtra :

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

— Saeid BK
źródło

3

jeśli word_listjest duży, ten kod jest bardzo wolny. Lepiej jest przekonwertować listę odrzucanych słów do zestawu przed użyciem go: .. in set(stopwords.words('english')).

— Robert

1

Oto moje podejście do tego, na wypadek, gdybyś chciał natychmiast uzyskać odpowiedź w ciągu (zamiast listy filtrowanych słów):

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

— justadev
źródło

Nie używaj tego podejścia w języku francuskim, bo inaczej nie zostaniesz złapany.

— David Beauchemin

0

W przypadku, gdy dane są przechowywane w postaci Pandas DataFrame, można skorzystać remove_stopwordsz textero że Użyj listy NLTK stopwords przez domyślnie .

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

— Jonathan Besomi
źródło

0

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence)

— HM
źródło

-3

   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this

— Muhammad Yusuf
źródło

najlepiej jest dodać stopwords.words („angielski”), niż określić wszystkie słowa, które chcesz usunąć.

— Led