I was asked is the following question: if you have multiple data sets and some background on the data how do you go about working towards building a model. More specifically, what are the kinds of things I look out for? I started there and wound up just doing a brain dump on some thoughts that came into my head around the idea while trying to come up with a reasonable response. This has become the response.
To sum this up as succinctly as I can before indulging myself with the fuller explanation below. I think developing use-cases for data science can be driven largely by three things: 1. Identify the objective 2. Know the data 3. Find the value If you don't have all three things, I hate to tell you but the project is meaningless. I can build a prediction model for the MNIST data set. The objective is there, and the data is there too, but that on it's own has no real value ( unless you've never done it then you have learning value, but that doesn't pay the bills very well). I'm looking at your question in two different ways: 1., with a known problem and available data sets, determining how they fit together or can work together to provide insight to the question, and 2., given several data sets and the story around them, come up with "something meaningful" from them. The first case should make sense pretty quickly. Lets say you have 3 sets of information for a company, customer information, sales information, and product information. Immediately we see that there are probably two relationships we can work with, sales information by customer and sales information by product, but we can also look at more complicated relationships that naturally come from that which we will come back to in a bit. Now depending on what our original question is we may be in good territory. If our original goal is forecasting the aggregated sales at different locations for the store, we probably have what we need already. We proably have enough information as well if we wanted to measure the impact the approximate distance of certain products from each other has on their sales amounts, e.g., does putting a display of peanuts or pretzels in the beer aisle drive increased sales. Now this has an awful lot of assumptions in it like the method of determining the distance between products is straightforward (like an aisle number). In these scenarios our objective is to determine if we have enough meaningful data to achieve an objective. Part of answering this question happens in the course of translating the business question/objective into model related terms. Are we looking to determine a dollar amount objective? Then we'll want to consider regression models since it's continuous, unless of course we bucket store revenue into categories of course. So assuming we have enough of the right data that seems to fit the question we want to answer, we'll probably want to engineer features to add value for our inputs. While the date is great, if we're forecasting the revenue for seasonal products we might want to extract the month of the year and use that instead. Other things may vary more by day of the week. If I had to put a guess at it initially, I would estimate that our peak increase in peanut consumption from being placed in the beer aisle would happen on Friday and Saturday afternoons, so now that becomes a meaningful variable. Much of this comes down to data space knowledge. The better understanding you have about the data you're using the easier it will be to figure out many different components, like how the information in different tables might be related to each of the others. It will also help you to identify where you might be missing things easier, or the impact of filling missing values with nulls or mean values. This is where you shouldn't discount what the business people in the group know. Maybe there's a behavioral psychologist in the group that has some supremely relevant information around the buying behavior you're working on. If you're in a pinch to find this kind of information. Try asking the DBA that maintains their legacy data or built their production system. In particular you're looking for the linking variables between data sets. These can be customer id's, dates, any number of things. A lot of the time, hopefully anyway, these will actually be [part of] the tables' primary keys. It happens pretty regularly and is pretty nice when true, otherwise you have work to do. A lot of cases will have you developing out processes which someone already does manually, you just want to automate or augment it with machine learning. This is close to being a necessity because the general rule of thumb for feasibility says that given time a human should be able to come to the same conclusions as your model if it's going to be reasonable and feasible, not always true, but definitely in a lot of cases. The more similarly it mimics a human thought process the easier it is to sell the solution as well, so there's that too. People still are often distrustful of machine learning, so being able to show that has value (insert vague explainable AI hype here). Rather than just aggregating sales by customer and product maybe we can instead use the sales information as a link to create an altogether new data set. If we instead built a table of the products customers purchase over time we can now open up new avenues to explore. Forecasting whether they will buy new products based on their past history might be one example. Another could be building a recommendation system based on buyer or product similarity, or ad information being tied to the products and their related impact on sales revenue. These second degree relationships often hold a great amount of value, but this is starting to get into the second question. If you are lucky, there's someone who has already mapped out these relationships for the most part. This is the world of database designers, data architects, and in some cases metadata architects. They are great people to brainstorm with because they've already had to consider many of the relationships among data sets and night think of new things discussing with you wouldn't have otherwise on your own or they might even be the inspiration you needed. If you don't have these resources available to you, think about how you might apply tried and true methods. If you have text documents look at use-cases using bag-of-words or word2vec models. Were they trying to accomplish something similar? The other approach is the enter the area of data mining. Being able to identify relationships that aren't apparent can be quite useful. On the whole there are probably much better sources than me to talk about data mining, but you can think of things like unsupervised clustering models and analysis algorithms like K-Nearest Neighbors, Latent Dirichlet Analysis (LDA), and Principle Component Analysis (PCA). Constructing cases from this point usually means using them to augment your data set to transform it into a more traditional supervised learning problem. Ultimately, each of these things ties in directly to the 3 elements we earlier identified.
0 Comments
|
Archives |