20: Unstructured Data & Text Analytics

Some friends and colleagues of mine formed a Big Data Bootcamp to share knowledge and learn together about some of the current and emerging technologies in big data and analytics. They asked me to present 2 sessions on analytics for text and unstructured data. This episode off the podcast is based on the first of those Big Data Bootcamp sessions, in which I talked about a mental model for text analytics and unstructured data, and then present some examples to illustrate the concepts.

If you haven’t already done so, be sure to subscribe for email updates or to either the RSS or iTunes podcast feeds so you don’t miss the next installment when I get into a hands-on example of how to do analytics for text and unstructured data.
Northwood Advisors
This podcast is sponsored by Northwood Advisors, experts in improving performance with data-driven decision-making.

These show notes provide an outline to follow the audio content, as well as some links and enhanced content.

A Mental Model for Text Analytics and Unstructured Data

Factors to Consider for Unstructured Data

1. Degree of Inherent Structure

  A. Indeterminate Structure

Indeterminate structure is the least structured of all – think of the SETI (Search for Extraterrestrial Intelligence) project searching for any non-terrestrial, non-random radio waves. To do analysis of indeterminate structure, you have to look for patterns and then identity and/or count them.

Example: Trending topics on Twitter.

B. Unstructured Content with Metadata

Much of what we handle as unstructured data really has a metadata envelope around indeterminate structure: an email, a Tweet, etc.

Example: A Twitter post has:

  • Indeterminate content: the text of the tweet
  • Metadata: User account, Date/Time, Client app/device, IP Address, Location (maybe), etc.

C. Targeted Content withing Indeterminate Structure

Typically when we deal with indeterminate structure, we are looking for targeted content. In these cases, the known structure is defined outside the content.

Example: The Acme Widget Corp is monitoring Twitter looking for tweets that contain:

  • Any reference to Acme Widget Corp (in various forms)
  • Tags (@ or #) that are relevant
  • Positive or negative sentiment
  • Actionable feedback

D. Semi-structured Content

  •  Structure may vary
  • Mix of structured and unstructured content


  • Survey data
  • XML files or other markup languages that provide identifiable structure

2. Processing Considerations

  • Data volume
  • Data sources – number, type, variety, consistency, etc.
  • Static vs. Dynamic
  • Latency requirements

3. Signal vs. Noise

We have to think about signal vs noise at both ends of the pipe:


How we handle signal/noise is tied to the degree of inherent structure and the processing method.


What questions are we trying to answer? What’s the minimum amount of ink/pixels that will answer those questions?

Recommended Reading

The Visual Display of Quantitative Information
by Edward R. Tufte

Basic Approaches

The methods of visualizing unstructured data fall into a few basic categories:

1. Density

  • Word clouds
  • Points on a map

2. Classification

  • Graphing based on dimensionality
  • Sentiment analysis

3. Association

  • Networks/Connections

4. Change/flow

  • Charts
  • Dynamic visualizations


Google Fusion Tables

UW Interactive Data Lab

Stanford imMens

Stanford Dissertation Browser

Other Resources

Site and Podcast Reboot

Alrighty then. Our previous web hosting provider of the Real Time Decisions Webcast discontinued the platform we were running on. They had a tool for migrating to WordPress. “Cool!” I thought. I wanted to be on WordPress and honestly had regretted my choice of their proprietary platform.

Except the migration didn’t work. The site got totally messed up. I still had all the content, but I had to rebuild the site from scratch.

So I’ve taken a long hiatus, but now I’m back. At the time of this writing, I’m basically rebuilding the site while it is live. Some of the links might not work, and I haven’t yet listed the Podcast feed on iTunes and Feedburner. But the site is live. At least the spammers have found it – that didn’t take long. But I’ve got Akismet spam filtering going, and looking forward to getting things back on track.

If you are a former follower who has rediscovered me here, thanks and welcome back. I hope the content continues to meet and exceed your expectations going forward.

And if you are a new follower, welcome!


Analytics for Marketing

I had the opportunity to present on Marketing Analytics at the Alteryx SoCal User Group. Download my presentation slides here and see a summary of some of the content below.

I use the metaphor of a table to help us communicate the four essential elements of analytics and data-driven business outcomes – the four legs of the table. The ideas are borrowed from the Northwood Advisors site.

1. Aligning analytics with strategy

4 legs of analytics

You know what you want from your business, and you need analytics to help you drive results from your strategy. If your decision making processes are not aligned with strategy, then your team could be pushing really hard in the wrong direction. Focus on aligning decision making processes with your strategy to assure that everyone is pushing the direction you want them to.

2. Data

The data to support marketing analytics should embrace several categories of data sources:

  • Internal performance data about products/services, employees/teams, customers, etc.
  • Digital media metrics, including social media, web, etc.
  • External market and consumer data

One key task of analytics is to blend these disparate data sources and types of data in order to provide a coherent picture.

3. Expertise

There is a myth held by some executives that analytics just involves pumping a bunch of data through the really smart software and getting brilliant answers. They couldn’t be more wrong. Analytics requires expertise across some key disciplines, including but not limited to these:

  • Domain expertise about the operations of your business,
  • Knowledge of the schema and definitions of the data,
  • Statistical modeling expertise to know which software and techniques to employ, and
  • Organizational insight to identify the right people and situations to drive value from analytics, as opposed to “shelfware” analytics.

4. Tools

In Episode 18: BI Trends for 2013 and Beyond I describe the landscape of broad BI platforms and niche tools of analytics. Those observations still hold true, and I recommend that podcast. Here are some of the categories to consider.

  • Advanced statistical analysis and predictive modeling for targeted solutions
  • Geospatial and market analytics
  • Data-driven process optimization for operations and support functions
  • Self-service reporting and analysis to empower decision makers
  • Dashboards and scorecards for actionable, metrics-driven management
  • KPIs (Key Performance Indicators) to focus resources on strategic objectives
  • Driver-based modeling for forecasting and planning
  • Enterprise business intelligence platforms

19: Big Data beyond the hype

The Big Data hype cycle is in full swing. But what is Big Data? How do you know if your data is BIG?

Big Data is not a concretely definable category. You can’t always say exactly what it is, but you know it when you see it. In this episode I define the key characteristics of Big Data that enable us to make more intelligent assessments and decisions regarding Big Data solutions.

Key characteristics of Big Data.

  • Physical Attributes

    • Bigness: physical size of data sets

    • Multi-source: data from multiple sources, especially both internal and external to the organization

    • Multi-structure: tabular data, markup data, audio and video data, geospatial, activity, transactions, snapshots, statuses

    • Fast arriving: streaming, frequently updated, time volatile

  • How we process it

    • Real time analysis

    • Real time outputs

      • Delivery to decision makers in real time

      • Delivery to external users (consumers, social/mobile users)

      • Interaction with software APIs

    • Aggregate and details

  • What we do with it

    • Predictive value

    • Pattern recognition, especially unlikely relationships

      • fuzzy matching

      • flexible matching

  • Challenges

    • Storage

    • Processing

    • Integration

    • Analysis

Thanks for listening to the Real Time Decisions Webcast – the leading ongoing Business Intelligence podcast focused on practical solutions.

Check out our sponsor, Northwood Advisors: http://www.northwoodadvisors.com

18: BI Trends

In this episode I talk about “BI Trends for 2013 and beyond.”
Check out the video on the Northwood Advisors web site.
Here is the link to my article: Rising to the Challenges in BI Healthcare
The 4 trends – listen to the audio to get all the details:
1. The maturing of broad BI platforms
For most companies, they can meet most of the needs of most of the people most of the time.
2. Niche tools fill the gaps
Analysts benefit from powerful tools from best-of-breed niche tools across a variety of functions.
3. Mobile BI
Putting usable content in the hands of users in the places it’s most needed is a reality that is here today and won’t be going away.
4. Big data pushing the envelope
Despite the confusing hype, the need is real and there is a growing array of tools to solve the problem.
Thanks for listening to the BI Podcast that focuses on creating a better world through better decisions.

17: Why BI Matters

Previous episodes of the Real Time Decisions Webcast have covered a lot of material about What is BI and How to do BI, but haven’t dug too deeply into Why BI Matters. In this episode, Myron Weber explores this important topic, providing information and inspiration for your company’s BI efforts.
Check out my company web site at www.NorthwoodAdvisors.com
Also, as recommended in the podcast, check out my business coach, Dave Luke at www.DaveLukeAdvisory.com
Thanks for listening to the BI Podcast that strives to change the world with better decisions.

16: Real World BI Requirements

In this episode of the webcast, I discuss Real World BI Requirements.
BI Requirements in the real world must avoid two common unconstructive extremes and provide a constructive alternative.
  • The first extreme is a data-driven approach that fails to account for the objectives and outputs required.
  • The other extreme is a blank-page approach.
The audio podcast explores these concepts and provides a constructive alternative – check it out and share your thoughts in the comments.
Don’t forget to check out my company at www.northwoodadvisors.com to learn about our advisory services for BI Strategy, Roadmap, Governance, and Best Practices.

Analysts: Stop Being Evil!

Ever notice how some business functions get more respect inside their own companies than others? It’s a natural part of business life: all roles are not perceived equally.

Sales is always a good candidate for front-of-the-line, top-shelf, Grade-A respect – partly because sales is truly an important function, and also in many cases because the sales team is so prone to stepping up and claiming that spot. Other perennial front runners in the respect category, depending on the industry, can include manufacturing (because, after all, they make the stuff), operations (they keep things running), and professional services (where would the organization be without the fees of its doctors, lawyers, engineers, or consultants).

Other functions’ respect can be more hit-and-miss. In some companies, marketing rules the roost – owning the brand and strategic direction – while in others, the marketing folks are relegated to maintaining the official PowerPoint template. The customer service call center can empowered to be great (Zappos anyone?) or… well, don’t get me started on Bank of America. HR can perceived as the key to attracting and retaining the talent that gives you an edge, or they can be viewed as the annoying pencil-pushers that you have to keep around. And there are companies that are built on the back of their technology capabilities and give the nerds a great deal of respect, while others would fire the whole IT team and go back to pencils and postage stamps if they could.

Some business functions and roles are viewed as contributing to success.

And some are considered a necessary evil.

It is in this context that I urge Analysts: Stop Being Evil! Okay, what I really mean is, stop allowing your role to be perceived as a necessary evil.

Over the past several years the role of the Analyst in business has risen in prominence, with an increasing flow of books, articles, software, conferences, and buzzwords directed at the field of Analytics. This is a positive trend in my view, and a substantive one that reflects the convergence of new competitive strategies based on management science with enterprise software that has the capability to make Analytics mainstream. But let’s not kid ourselves into thinking this is entirely new – there are industries (especially in the finance & insurance sector) that have been built on analytics for decades. In those companies, analysts get respect. And the shining stars of the case studies and books (Harrah’s, the Boston Red Sox, etc.) achieved success by elevating analytics to be a competitive differentiator.

But then there is the rest of the world. As I advise companies on strategy and execution for BI & Decision Systems, I see the sad, mainstream reality of Analysts who are treated as a necessary evil. This is reflected in the chronic late nights and lost weekends that happen not because the Analyst is on the cusp of a breakthrough insight, but because of the grueling spreadsheet march it takes to produce marginally useful “required metrics” without adequate systems and training. The Analyst is necessary because the business has grown to believe that it needs those spreadsheets to monitor its performance. But the Analyst is a necessary evil in the sense that the role is seen not as strategic, not as a competitive advantage, nor as a driver of business change, but rather as a cost of doing business – one they would eliminate if they could.

So what to do about this situation?

First, I think many Analysts need to aspire to greater things. Something beyond data crunching. Beyond delivering metrics and KPIs. Even beyond stats and operations research. A worthy goal for the Analytics function, I believe, is to own and drive better decision-making processes across the organization.

Second, I think some Analysts themselves need to consider their part in this situation. If you fancy yourself an Analyst, and yet the only analysis software you know how to use is Microsoft Excel, you are part of the problem. Train up or change your title.

Third, I encourage Analysts to be selective – when you take a job (or keep the one you’ve got), if it’s not clear that Analysts are viewed as providing true value, be wary. Get specific commitments about the Analyst role, about the investment to support Analytics, and the importance of the role within the organization.

The Analyst role can be a force for good, for transformation, and for the betterment of mankind – the motto of this webcast is “Better Decisions Lead to a Better Life.” Don’t let Analytics be seen as a necessary evil. Please.

Myron Weber is Managing Partner at Northwood Advisors.

This article was inspired when some great folks at the Smart Data Collective invited me to participate in an Analytics Blogarama on the topic of “The Emerging Role of the Analyst.” Check out the rest of the entries at http://smartdatacollective.com/40832/analytics-blogarama-october-6-2011.