Data challenges in the Digital Age

There are some "sexy" words around the globe around Artificial Intelligence. That scenario results in a fusion of traditional data projects with new advantages.

Nowadays there is an unproductive time spent searching information in the organization (avg of 3.6 hours). Most of these search or data-related inquiries cross-functional collaboration, adding the potential risk of miscommunication, misalignment, and any other human creativity to delay answers... (I´ve suffered all of them). Most of the time the answer used to come from somebody with enough expertise in the domain area to guide the request, publish the answer, or help to polish the question.

  • AI proves to be great for analyzing static information.
  • Any company has a lot of documents spread on different repositories. Legal, Sales, Marketing, Training...
  • In the domain area, there are "pre-trained" models already available in the market.

Process at a Glance:

1) Document Processing: In this phase, documents must be gathered, embedded, and stored in the vector database. This process happens upfront before any client tries to search and will also consistently run in the background on document updates, deletions, and insertions. It's possible that this process will be done in batches from a data warehouse, with multiple iterations required to complete it. Also, it’s common to leverage streaming data structures to orchestrate the pipeline in real-time.

2) Serving: After a client enters a query along with some optional filters (e.g. year, category), the query text is converted into an embedding projected into the same vector space as the pre-processed documents. his enables the identification of the most pertinent documents from the whole collection. With the right vector database solution, these searches could be performed over hundreds of millions of documents in milliseconds.

AI search is based on vectors and algorithms to perform the search. Embeddings are designed to be information-dense representations of the dataset being studied. The most common format is a vector of floating. Here is an article that explains What are embeddings in Machine Learning

Last but not least data analytics is not just for somebody with technical skills, it requires also "domain experts".

>