Tokenise
import pandas as pd
import reSms_content=['hi,how are you','Iam fine','what is it?']
df=pd.DataFrame(Sms_content,columns={'sms'})
df| sms | |
|---|---|
| 0 | hi,how are you |
| 1 | Iam fine |
| 2 | what is it? |
def tokenize(text):
tokens=re.split('\W+',text)
return tokensdf['tokenized_text']=df['sms'].apply(lambda row : tokenize(row.lower()))
df.head()| sms | tokenized_text | |
|---|---|---|
| 0 | hi,how are you | [hi, how, are, you] |
| 1 | Iam fine | [iam, fine] |
| 2 | what is it? | [what, is, it, ] |