Some friends and colleagues of mine formed a Big Data Bootcamp to share knowledge and learn together about some of the current and emerging technologies in big data and analytics. They asked me to present 2 sessions on analytics for text and unstructured data. This episode off the podcast is based on the first of those Big Data Bootcamp sessions, in which I talked about a mental model for text analytics and unstructured data, and then present some examples to illustrate the concepts.
If you haven’t already done so, be sure to subscribe for email updates or to either the RSS or iTunes podcast feeds so you don’t miss the next installment when I get into a hands-on example of how to do analytics for text and unstructured data.
This podcast is sponsored by Northwood Advisors, experts in improving performance with data-driven decision-making.
These show notes provide an outline to follow the audio content, as well as some links and enhanced content.
A Mental Model for Text Analytics and Unstructured Data
Factors to Consider for Unstructured Data
1. Degree of Inherent Structure
A. Indeterminate Structure
Indeterminate structure is the least structured of all – think of the SETI (Search for Extraterrestrial Intelligence) project searching for any non-terrestrial, non-random radio waves. To do analysis of indeterminate structure, you have to look for patterns and then identity and/or count them.
Example: Trending topics on Twitter.
B. Unstructured Content with Metadata
Much of what we handle as unstructured data really has a metadata envelope around indeterminate structure: an email, a Tweet, etc.
Example: A Twitter post has:
- Indeterminate content: the text of the tweet
- Metadata: User account, Date/Time, Client app/device, IP Address, Location (maybe), etc.
C. Targeted Content withing Indeterminate Structure
Typically when we deal with indeterminate structure, we are looking for targeted content. In these cases, the known structure is defined outside the content.
Example: The Acme Widget Corp is monitoring Twitter looking for tweets that contain:
- Any reference to Acme Widget Corp (in various forms)
- Tags (@ or #) that are relevant
- Positive or negative sentiment
- Actionable feedback
D. Semi-structured Content
- Structure may vary
- Mix of structured and unstructured content
- Survey data
- XML files or other markup languages that provide identifiable structure
2. Processing Considerations
- Data volume
- Data sources – number, type, variety, consistency, etc.
- Static vs. Dynamic
- Latency requirements
3. Signal vs. Noise
We have to think about signal vs noise at both ends of the pipe:
How we handle signal/noise is tied to the degree of inherent structure and the processing method.
What questions are we trying to answer? What’s the minimum amount of ink/pixels that will answer those questions?
The Visual Display of Quantitative Information by Edward R. Tufte
The methods of visualizing unstructured data fall into a few basic categories:
- Word clouds
- Points on a map
- Graphing based on dimensionality
- Sentiment analysis
- Dynamic visualizations