Miklas Njor

Mapping key/value in semi-structured data in Python

Skærmbillede 2016 02 23 kl. 13.58.16

 Wordcloud Diskussion
Diskussion

I had to merge several wordclouds into a larger wordcloud based off all keywords and counts, and I only had access to the textfiles containing the keyword and the keyword count, not a list with all the keywords.

In order to be able to do a re-count and to build the new list, I needed to create a list containing the exact count of each keyword.

Luckily the structure was the same for all pairs, and since it was only one token (keyword) in each pair it could be done in one loop.

An alternative way to solve this problem, could be to build a “dict” of tokens and add upp the count each new time a matching token was found.

You can see the code below.

__author__ = 'Miklas Njor - iAmGoldenboy - https://miklasnjor.com'

""" Converting a word-cloud laid out as an unstructured  list  of keywords with associated counts,
into a list  including keywords x counts, also taking into account  duplicate  entries """

from collections import Counter 

# a string in the form of "keyword  count keyword  count..."
# note  the duplicate  entries.
wordcloud_keywords_n_count = "apples 12 oranges 10 bananas 10 bananas 8 bananas 50 chairs 4 boats 3 orange 1 apple 1 banana 1"

def wordcloudStringToList(wordcloudString):
    """ Converting a word-cloud laid out as an unstructured  list  of keywords with associated counts,
    into a list  including keywords x counts, also taking into account duplicate  entries
    :param wordcloudString: A string in the form of "keyword count  keyword count"...
    :return:
    """

    full_list = [] # a list to collect all the keywords.

    split_wordcloud = wordcloudString.split(" ")        # splitting string.

    for items in range(0, len(split_wordcloud)):

        try:

            keyword  = split_wordcloud[items]              # get the keyword .
            keyword_count = split_wordcloud[items+1]      # get the keyword 's count.

            if keyword_count.isnumeric():               # when we hit  a number...

                for times in range(int(keyword_count)): # ... add the  keyword  times  keyword_count to the list .
                    full_list.append(  keyword )

        except IndexError:
            pass

    return full_list 


print (wordcloudStringToList(wordcloud_keywords_n_count))

# sanity check
print(Counter(wordcloudStringToList(wordcloud_keywords_n_count)))

Exit mobile version