Re: Gathering Intelligence from Twitter data - recap?

From: Yuriy Michael Goldman
Sent on: Thursday, January 21, 2010 11:21 PM
Correction LDAP should say OLAP :)

On Thu, Jan 21, 2010 at 11:16 PM, Yuriy Michael Goldman <[address removed]> wrote:
 Folks -

I realize that tonight's session may have focused a bit more on Semantic Web aspects of working with twitter data rather than the BI aspects.  Morton did a fantastic job presenting. Still, perhaps some of you may be wishing for some elaboration or clarification.

Feel free to mail the group or me personally with questions, comments.

My digestion of the presentation in BI terms (or as best as I can articulate them) is as follows:
  • Morton has an ETL process which takes semi-organized data and gives it structure
  • The structure could have been star schema or other similar relational schemas that are common to the BI practice.  Instead he chose to structure the data as RDF - this is semantic markup which is relational but in a slightly different way and not in a traditional rdbms way.  Some benefits to Morton's design choices are realized today and some will be realized in the near future:
    • RDF is a standard way of representing data and its meta data.  If other data sources are able to publish data as RDF, Morton can forgo the laborious data transformation and cleanup that most BI practitioners undergo today when working with disparate data sources, ie integrating in MySpace and Facebook data.  Remember, gathering intelligence, either through querying or mining requires some determinant structure - relational or not.
    • Once in RDF, if Morton can select a well-fitting vocabulary by which to organize his data this would enable him to better associate it with other, similarly organized data.  This is another way of acquiring meta data.  He can then infer new information from information that he has collected and organized.  in the future, for example, Morton may be able to infer a risk profile of a user publishing twitter messages of questionable content.
    • What else? 
  • However, rather than focusing on inference, Morton's current task focused on a rather common classification chellenge.  That is, given a large # of tweats with urls, can he 1) identify the urls, 2) classify these urls into malicious and non-malicious urls and 3) can this process be performant enough to be used in "real-time" fashion.  Real-time BI in this case would be processing a live feed of incoming twitter messages that pose a threat to consumers of this information.  Actionable BI in this case would be a filter/interceptor of sorts that sits between the producer of twitter messages and Twitter servers which can scan an incoming tweat, classify its risk, and neutralize the message or the user account if risk is high.  This is analogous to how credit card companies are able to detect and prevent credit card fraud.  Although the details of implementation are slightly different, the high-level methodology is not.
  • Morton's output (or intelligence that he delivers) is classification or a risk ranking report of urls that he detects in twitter data.  This information is used to keep a database of known malicious entities current.  This information may be use by other entities towards threat prevention.  There is also ample opportunity to turn this type of an implementation into a content screening/sanitization interceptor of sorts.
  • A note on data quality, data masking and sanitization.  These are all hot BI topics.  "Bad" and "Malicious" data that needs to be cleaned up, masked or excluded during the etl phase is not much differently implemented than Morton's extraction of url content and then classification attempts against some other knowledge bank.  In practice, Morton can use his approach to mask twitter data.  For example, if all credit card numbers posted through Tweats needed to be masked before you could pull that data from their api, Morton's methodology could again act as an outgoing interceptor of sort, masking data on the fly.
  • With respect to the urls themselves - one of Morton's challenges is understanding how something like http://loot.com and http://l001.com are different. What conservative techniques can he apply to variations of loot.com to be able to detect social engineering spoofs such as malicious urls that "look" like legitimate urls. Right now, Morton does some sort of a resolution against a data store of previously discovered malicious urls.  There is also a rudimentary "learning" process by which he grows this store.  I believe this is done through manual verification of the suspect urls but perhaps there is an opportunity to automate this further.  Perhaps one idea is to look at which sites link to the url in question.  If any of those sites are known malicious sites this would have a bearing on the risk rating of the url in question.  This is all classification.
  • With respect to Morton's choice of a data store - this is also a very common challenge for BI practitioners.  What is the best way for me to store my data? Is a relational db enough? Do I need a data mart? Do I need a warehouse? Will LDAP solve my problems etc.  In Morton's case, and at his early stage of development, a non-traditional database served his purpose well.  He used CouchDB as more or less a powerful hashmap.  As his requirements and experience in this problem space mature, he may find that CouchDB will no longer suffice.
This is not a complete list - I'm sure I missed some points and someone else may re-articulate some of the finer points better.  I certainly encourage you to participate and to of course forgive me for any inaccuracies.

Hope this helps,
Yuriy

Offer a perk for our members and get exposure.

Offer a perk →
Other nearby
Meetups
Why these groups?
x

The Meetup Groups shown here are topically similar to New York Business Intelligence.

Groups are more likely to be displayed here if they:

  • have a Meetup scheduled
  • have a high rating
  • have a group photo
  • are "public" and not "private"
  • have shown they are likely to stick around (older than 30 days)
Find more Meetup Groups
near New York

Log in

  • Not registered with us yet?
or

Log in to Meetup with your Facebook account.

Sign up

or

Join this Meetup Group even quicker with your Facebook account.

By clicking the "Sign up using Facebook" or "Sign up" buttons above, you agree to Meetup's Terms of Service