Friday, 11 May 2012

Big Data - Data Concerns..


Hope the last post had provided you with an understanding of what is big data and what it can do. let us see say what are the data problems we may face to move ahead in this new technology.

The major problem in Big data Implementation are

  1. Processing the un-structured and semi-structured data
  2. Deciphering the information form unstructured or semi structured data

what is unstructured  or semi structure data?

In simple terms any data elements that can be stored in rows and columns in a database are called structured data. If it can't be stored in the rows/columns and to be stored as BLOB's (Binary Large Objects) they are called as unstructured or semi structured data.
(Note: Yet the science could not clearly define the unstructured or semi structured data. But this is the base line which the science group is working upon).

From our personal banking credit card division example

Structured data                : credit card details like card type, interest rate, benefits , maximum limit etc...
Un / Semi structured data :  search parameters on bank website, email to bank representative, blogs on other websites etc...

If we like to have answer to the following
  1. what are the credit card types having interest rates of 19% pa?
  2. what are the credit card types having minimum card limit of 5000 $? etc...

These questions can be answered by querying the structured data with specific inputs .From technical stand point we were able to retrieve the information directly by writing simple queries.

Let us consider a scenario in unstructured data that we want to analyze "how many people searched / looked for credit cards with maximum limit of 5000 $ ?" and let us consider these are the search parameters that has been done at our website

  1. card limit 5k
  2. card limit 5000
  3. card limit 5000 cad
  4. card limit five thousand
  5. card limit five thousand canadian dollars
  6. limit five thousand dollars
  7. credit cards + 5000
  8. 5000

First problem is understanding the unstructured data.
How can we conclude that the searches are made for credit cards of limit 5000 dollars ?
       Example :
  •  Search parameter no 6 ("limit five thousand dollars") , the user may be searching for the saving account where minimum balance should be 5000 limit or the user may be looking for investments with limit of 5000.
  •  Search parameter no 8 ("5000") this parameter is too vague to co-relate it to the credit card

If we ignore this data understanding  problem and consider all the searches were looking for credit cards having limit of 5000 dollars.
what will be my search parameters ? How the typical query has to be structured ? etc...
currently we have to make lot of assumptions to derive an information from un / semi structured data.

One approach that I can think of to tackle this problem is, to capture the metadata of the search. By co-relating the search parameters with the metadata of the search we can come to certain conclusion.
     Ex:  which page the search was made ?
            If the search parameter were made on credit card page then we can come to conclusion the user is looking for credit cards.
On this approach also, How to correlate the metadata with the actual data element captured from the user is another problem ?

Big data is like an gold mine. we will have to process  huge set of data to get useful information for the business to help them for decision making.Yet this technology is in its infant stage.By the growth of could computing, parallel processing technologies BIG data will be a reality in near future. The useful ness of BIG data is highly seen in the field of personal business intelligence, Health care industry, marketing industry.
"To get one ounce gold we have to process 33 tons of rock, same goes to big data"