Tokenise
import pandas as pd
import re
Sms_content=['hi,how are you','Iam fine','what is it?']
df=pd.DataFrame(Sms_content,columns={'sms'})
df
sms | |
---|---|
0 | hi,how are you |
1 | Iam fine |
2 | what is it? |
def tokenize(text):
tokens=re.split('\W+',text)
return tokens
df['tokenized_text']=df['sms'].apply(lambda row : tokenize(row.lower()))
df.head()
sms | tokenized_text | |
---|---|---|
0 | hi,how are you | [hi, how, are, you] |
1 | Iam fine | [iam, fine] |
2 | what is it? | [what, is, it, ] |