- Published on
Creating A Tech Product with Data Science
- Authors
- Name
- @pragmaticcto
- @pragmaticcto
Data Collection
Existing companies can look at the data collected through years of operation, to start using it for AI/ML projects. But for a startup, starting with no customers and no data, the first challenge is how to get the data in the first place. I have seen attempts to build businesses on the data-value promise, when there is no data to harness at all.
The key approach is to get back to basics first and build products and processes that will allow us to collect the data. This usually starts from customer-facing products, be it an app or a website, with a well-engineered solution to collect all data that may be useful in the future. A lot of the data will be collected as part of the core business process - sales, introductions, engagements
Data accessibility
Data models and algorithms are the secret sauce that create insights and decisions from the data feed.
Tech companies are used to getting engineering teams the independence and best tools available to enable them to build the best, fastest, most scalable engineering solutions for their product. The same approach should be applied to data science. If a company wants to have the best insights, the most accurate predictions and make valuable decisions in an automated fashion, we need to enable data scientists, the people who do all the research and develop AL/ML models and algorithms, to innovate in a most efficient manner.
In the data science world, in addition to tools this means easy and timely access to data. If a data scientist has to fill in a form, or ask and wait for the data to be delivered to them by the development or operations teams, a lot of valuable time has been lost already. Building the self-serving data capabilities into your tech platform should not be an afterthought - think of it as an engineering problem
Automating the data pipeline
Another aspect of software engineering we treat as common sense - automating the development and release process. From central code repositories, via automated tests and continuous integration tools, to continuous deployment across multiple environments - we put a lot of effort into getting to the stage we’re in today. This was a big investment, and we had to convince our investors and stakeholders that although not functional, the investment in automation will bear fruit in the long run - the technology community managed to win the argument and automation is a de facto standard for any serious software development effort.
We can’t afford to treat data pipelines differently - automation must be at the forefront of any AI/ML project started. Treat data as you would code, with durable storage, easy access and shareability, continuous testing of validation of models and algorithms, all the way to automated deployment to production. With data, we can go even further - think about feedback loop, the data analysis in production, feeding back to model development and validation, creating an automated, self-evolving models that can be managed with minimal human intervention
Data Privacy and security
With all the excitement about the value that data can bring to a business, and the outcomes we can deliver using AI/ML, we should never forget the duty of care we have towards keeping the data of our customers and users safe and secure. Data privacy has rightly become the focus in modern society, and any breaches of trust in regards to data ownership can be detrimental to the credibility of any business and have a lasting impact into it’s reputation.
With this in mind, any data-driven business needs to put data security and privacy at the heart of their process. From the tech product inception, security and privacy should be recognized as key parts of the design, architecture and the development process. And then again, the same process should apply to the data lifecycle - encryption, anonymisation access control, at all stages of the data pipeline, from collection, cleaning, transformation and model development and validation should be part of the process.
Key Takeaways Agile Data-Oriented development
Make sure your product, tools and processes are design to collect quality data efficiently, with automated cleanup and transformation capabilities
Easy and secure access to the data by data scientists is a key for developing ML and AI models by data scientists
Once models are developed, they need to be validated and deployed to production - like software release process, automated data pipeline validation and deployment is key for this. Automation will assist in long term maintainability and extensibility of the data processing capability, in the same way the automation helps with frequent and defect-free software release processes.
Dealing with data, especially customer and personal data, gives you responsibility to treat it with duty of care - ensuring privacy and security best pranctices, with anonymisation and strict access control, to protected from non-authorized access and data leaks must be at the forefront of consideration for any data startup