How to split a string into a list?

python list split text-segmentation

How do I split a sentence and store each word in a list?

"these are words"  ⟶  ["these", "are", "words"]

As it is, you will be printing the full list of words for each word in the list. I think you meant to use print(word) as your last line.

Mateen Ulhaq

Given a string sentence, this stores each word in a list called words:

words = sentence.split()

Mateen Ulhaq

To split the string text on any consecutive runs of whitespace:

words = text.split()

To split the string text on a custom delimiter such as ",":

words = text.split(",")

The words variable will be a list and contain the words from text split on the delimiter.

Mateen Ulhaq

Use str.split():

Return a list of the words in the string, using sep as the delimiter ... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']

@warvariuc - should have linked to docs.python.org/2/library/stdtypes.html#str.split

how about to split the word "sentence" into "s" "e" "n" "t".... ?

Community

Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:

import nltk
words = nltk.word_tokenize(raw_sentence)

This has the added benefit of splitting out punctuation.

Example:

>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',', 
'waking', 'it', '.']

This allows you to filter out any punctuation you don't want and use only words.

Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.

[Edited]

split() relies on white-space as the separator, so it will fail to separate hyphenated words--and long-dash separated phrases will fail to split too. And if the sentence contains any punctuation without spaces, those will fail to stick. For any real-world text parsing (like for this comment), your nltk suggestion is much better than split()`.

Potentially useful, although I wouldn't characterise this as splitting into "words". By any plain English definition, ',' and "'s" are not words. Normally, if you wanted to split the sentence above into "words" in a punctuation-aware way, you'd want to strip out the comma and get "fox's" as a single word.

Python 2.7+ as of April 2016.

Colonel Panic

How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

Nice, but some English words truly contain trailing punctuation. For example, the trailing dots in e.g. and Mrs., and the trailing apostrophe in the possessive frogs' (as in frogs' legs) are part of the word, but will be stripped by this algorithm. Handling abbreviations correctly can be roughly achieved by detecting dot-separated initialisms plus using a dictionary of special cases (like Mr., Mrs.). Distinguishing possessive apostrophes from single quotes is dramatically harder, since it requires parsing the grammar of the sentence in which the word is contained.

@MarkAmery You're right. It's also since occurred to me that some punctuation marks—such as the em dash—can separate words without spaces.

dbr

I want my python function to split a sentence (input) and store each word in a list

The str().split() method does this, it takes a string, splits it into a list:

>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0

The problem you're having is because of a typo, you wrote print(words) instead of print(word):

Renaming the word variable to current_word, this is what you had:

def split_line(text):
    words = text.split()
    for current_word in words:
        print(words)

..when you should have done:

def split_line(text):
    words = text.split()
    for current_word in words:
        print(current_word)

If for some reason you want to manually construct a list in the for loop, you would use the list append() method, perhaps because you want to lower-case all words (for example):

my_list = [] # make empty list
for current_word in words:
    my_list.append(current_word.lower())

Or more a bit neater, using a list-comprehension:

my_list = [current_word.lower() for current_word in words]

BlackBeard

If you want all the chars of a word/sentence in a list, do this:

print(list("word"))
#  ['w', 'o', 'r', 'd']


print(list("some sentence"))
#  ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']

Vladimir Obrizan

shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:

>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']

NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.

Use with caution, especially for NLP. It will crash on single quote strings like "It's good." with ValueError: No closing quotation

TankorSmash

I think you are confused because of a typo.

Replace print(words) with print(word) inside your loop to have every word printed on a different line

thrinadhn

Split the words without without harming apostrophes inside words Please find the input_1 and input_2 Moore's law

def split_into_words(line):
    import re
    word_regex_improved = r"(\w[\w']*\w|\w)"
    word_matcher = re.compile(word_regex_improved)
    return word_matcher.findall(line)

#Example 1

input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)

# output 
['computational', 'power', 'see', "Moore's", 'law', 'and']

#Example 2

input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""

split_into_words(input_2)
#output
['Oh',
 'you',
 "can't",
 'help',
 'that',
 'said',
 'the',
 'Cat',
 "we're",
 'all',
 'mad',
 'here',
 "I'm",
 'mad',
 "You're",
 'mad']

How to split a string into a list?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US