{"id":67279,"date":"2022-10-04T18:34:48","date_gmt":"2022-10-04T18:34:48","guid":{"rendered":"https:\/\/www.techrepublic.com\/?p=3999649"},"modified":"2022-10-04T18:34:48","modified_gmt":"2022-10-04T18:34:48","slug":"this-linkedin-tool-for-building-machine-learning-systems-is-now-part-of-the-lf-ai-data-foundation","status":"publish","type":"post","link":"https:\/\/cloudnewshub.com\/?p=67279","title":{"rendered":"This LinkedIn tool for building machine learning systems is now part of the LF AI &amp; Data Foundation"},"content":{"rendered":"<figure id=\"attachment_3999655\" aria-describedby=\"caption-attachment-3999655\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-article wp-image-3999655\" src=\"http:\/\/cloudnewshub.com\/wp-content\/uploads\/2022\/10\/this-linkedin-tool-for-building-machine-learning-systems-is-now-part-of-the-lf-ai-data-foundation.jpg\" alt=\"June 30, 2019 San Francisco \/ CA \/ USA - Large LinkedIn offices in downtown San Francisco; LinkedIn is an American business and employment-oriented service and it owned by Microsoft\" width=\"770\" height=\"513\"><figcaption id=\"caption-attachment-3999655\" class=\"wp-caption-text\">Image: Sundry Photography\/Adobe Stock<\/figcaption><\/figure>\n<p>As organizations start to make more extensive use of machine learning, they need to manage not just their data and the machine learning models that use it but the features that organize the raw data into concepts the models can work with.<\/p>\n<p><strong>SEE: <a href=\"https:\/\/www.techrepublic.com\/resource-library\/downloads\/artificial-intelligence-ethics-policy\/?r=768791548\" target=\"_blank\" rel=\"nofollow noopener sponsored noreferrer\">Artificial Intelligence Ethics Policy<\/a> (TechRepublic Premium)<\/strong><\/p>\n<p>Earlier this year, LinkedIn open sourced <a href=\"https:\/\/github.com\/feathr-ai\/feathr\" target=\"_blank\" rel=\"nofollow noopener sponsored noreferrer\">Feathr<\/a>, the feature store it uses internally for hundreds of different machine learning-powered services using petabytes of data, like showing interesting jobs or blog posts that you might want to read. It\u2019s the technology behind the Azure Feature Store service, and it\u2019s now become part of the <a href=\"https:\/\/lfaidata.foundation\/blog\/2022\/09\/12\/feathr-joins-lf-ai-data-as-new-sandbox-project\/\" target=\"_blank\" rel=\"nofollow noopener sponsored noreferrer\">Linux Foundation AI &amp; Data Foundation<\/a> to make it more useful to a wider range of development teams.<\/p>\n<p>\u201cFeature stores and Feathr are an important piece of how to do MLOps and how to do to deploy machine learning models efficiently, effectively and compliantly by covering all of the things that the enterprise needs to think about,\u201d David Stein, a senior staff engineer working on Feathr at LinkedIn told TechRepublic.<\/p>\n<h2>How the machine learning finds features<\/h2>\n<p>In machine learning terms, a feature is a specific data input to a machine learning model \u2014 think of it like a column in a database or a variable in code.<\/p>\n<p>\u201cIf you\u2019re trying to predict whether a person is going to buy a car, and you have a person and a car as the input to the model, and the prediction is a likelihood of buying or not wanting to buy, features that the model might be designed to use may include things like the person\u2019s income level or their favorite color: Things you know about them and the things about the car,\u201d Stein said. \u201cIf you have a giant data set with a billion rows, you would want to pick a set of columns as the starting point and you\u2019re then going to design your model around how to use those features in order to make the prediction.\u201d<\/p>\n<p>Some of the features are right there in the data, like product IDs and dates, but others need to be processed, so it\u2019s more complicated than just pointing at the columns you want in a database.<\/p>\n<aside class=\"pinbox right\">\n<h3 class=\"heading\">More must-read AI coverage<\/h3>\n<\/aside>\n<p>\u201cAll the other useful features that you\u2019re going to need may need to be computed and joined and aggregated from various other data assets,\u201d Stein explained.<\/p>\n<p>If your machine learning model works with transactions, the average value of transactions in restaurants over the last three months would be that kind of feature. If you\u2019re building a recommendation system, the data is tables of users, items and purchases, and the feature would be something you can use to make recommendations, like what products have been bought over the last week or month, whether someone bought the product on a weekday or weekend, and what the weather was like when they bought it.<\/p>\n<p>Complex machine learning systems have hundreds or thousands of features, and building the pipeline that turns the data into those features is a lot of work. They have to connect to multiple data sources, combine the features with the labeled data while preserving things like <a href=\"https:\/\/github.com\/feathr-ai\/feathr\/blob\/main\/docs\/concepts\/point-in-time-join.md\" target=\"_blank\" rel=\"nofollow noopener sponsored noreferrer\">\u201cpoint in time\u201d correctness<\/a>, save those features into low latency storage and make sure the features are handled the same way when you\u2019re using those models to make predictions.<\/p>\n<p>\u201cAt LinkedIn, there are many, many data assets like databases and ETL data stores and different kinds of information about job postings, advertising, feed items, LinkedIn users, companies, skills and jobs, and all these things as well as the LinkedIn economic graph,\u201d Stein said. \u201cThere\u2019s a huge number of different entities that may be related to a particular prediction problem.\u201d<\/p>\n<p>Just finding and connecting to all those datasets is a lot of work, before you start choosing and calculating the various features they contain.<\/p>\n<p>\u201cEngineers that would build the machine learning models would have to go to great lengths to find the details of the various data assets that those many signals might need to come from,\u201d Stein said. They also have to spend time normalizing how to access the data: Different data sources may label the same information as user ID, profile ID or UID.<\/p>\n<p>Two people using the same data to train different models can end up creating the same feature for their different projects. That\u2019s wasted effort, and if the feature definitions are slightly different, they might give confusingly different answers. Plus, each team has to build a complex feature engineering pipeline for every project.<\/p>\n<h2>Feathr: A platform for features<\/h2>\n<p>A feature store is a registry for features that lets you do all that work once. Every project can use the same pipeline, and if you need a feature that another developer has already created, you can just reuse it. This is the function of Feathr (<strong>Figure A<\/strong>).<\/p>\n<p><strong>Figure A<\/strong><\/p>\n<figure id=\"attachment_3999650\" aria-describedby=\"caption-attachment-3999650\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-article wp-image-3999650\" src=\"http:\/\/cloudnewshub.com\/wp-content\/uploads\/2022\/10\/this-linkedin-tool-for-building-machine-learning-systems-is-now-part-of-the-lf-ai-data-foundation.png\" alt width=\"770\" height=\"432\"><figcaption id=\"caption-attachment-3999650\" class=\"wp-caption-text\">Image: LinkedIn. All the complex work a feature pipeline needs to do is better built once.<\/figcaption><\/figure>\n<p>Stein suggests thinking about them rather like package managers.<\/p>\n<p>\u201cFeature stores are about making it simpler and easier to be able to import the data that you need into your machine learning application and machine learning model,\u201d he said. \u201cThat can often be a very complex setup, especially for large projects that are run over a range of time, and especially in companies where there are many projects using similar datasets. We want to make it easy for the machine learning engineer to just import their features as their inputs and then write their model code.\u201d<\/p>\n<p>Instead of finding the right dataset and writing the code to aggregate data into features, Stein further explained that \u201cthe machine learning engineer would like to be able to say \u2018okay, I want the user\u2019s number of years of experience, I want something about their company\u2019 and just have it appear as columns in an input table.\u201d That way, they can spend their time working on the model rather than feature infrastructure.<\/p>\n<p>This means a lot less work for developers on every machine learning project; in one case, thousands of lines of code turned into just ten lines because of Feathr. In another, what would have been weeks of work was finished in a couple of hours because the feature store has built-in operators.<\/p>\n<p>The fewer manual processes there are in any development pipeline, the less fragile it will be, because you\u2019re not asking somebody to do a complicated thing by hand perfectly every time. Having those inbuilt features means more people can use these sophisticated techniques.<\/p>\n<p>\u201cFeathr provides the ability to define sliding window activity signals on raw event data,\u201d Stein said. \u201cThat used to be hard to do at all without a platform that knows how to do that properly. Getting it right using more basic tooling was hard enough that many teams wouldn\u2019t even experiment with using signals like that.\u201d<\/p>\n<p>Feathr also does the work of storing features in a low latency cache so they\u2019re ready to use in production.<\/p>\n<p>\u201cWhen the application is trying to do an inference, it asks for the values of some features so that it can run that through its model to make some prediction,\u201d Stein added. \u201cYou want the feature store machinery to answer quickly so that that query can be answered very quickly.\u201d<\/p>\n<p>You don\u2019t need that low latency when you\u2019re training your machine learning model, so that can pull data from other locations like Spark, but with Feathr you don\u2019t have to write different code to do that.<\/p>\n<p>\u201cFrom the point of view of the machine learning engineer writing the model code, we want those things to look the same,\u201d Stein said.<\/p>\n<p>Accuracy and repeatability matter for machine learning, as so does knowing how models produce their results and what data they\u2019re using. A feature store makes it easier to audit that (the Azure Feature Store has a friendly user interface that shows where data comes from and where it\u2019s used), and can make it easier to understand as well because you see simplified naming rather than all the different data identifiers (<strong>Figure B<\/strong>).<\/p>\n<p><strong>Figure B<\/strong><\/p>\n<figure id=\"attachment_3999651\" aria-describedby=\"caption-attachment-3999651\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-article wp-image-3999651\" src=\"http:\/\/cloudnewshub.com\/wp-content\/uploads\/2022\/10\/this-linkedin-tool-for-building-machine-learning-systems-is-now-part-of-the-lf-ai-data-foundation-1.png\" alt width=\"770\" height=\"459\"><figcaption id=\"caption-attachment-3999651\" class=\"wp-caption-text\">Feathr makes it clearer what data is in a feature and where it came from.<\/figcaption><\/figure>\n<p>Even though data access is centralized through a feature store, Feathr uses Role Base Access Control to make sure only people who should have access to a dataset can use it for their model. The open source Feathr uses Azure Purview, which means you can set access controls once and have them applied consistently and securely everywhere.<\/p>\n<h2>Effective enterprise machine learning<\/h2>\n<p>Because it was built for the technology and configurations that LinkedIn uses internally, open sourcing Feathr also meant making it more generalized, so it would be useful for businesses who use different technologies from those at LinkedIn.<\/p>\n<p>\u201cThere\u2019s a growing number of people in the industry that have this kind of problem,\u201d Stein noted. \u201cEach individual organization building feature pipelines needs to figure out how to solve those engineering challenges, how to make sure things are being used in the right way \u2014 these are things that you can build once and build well in a platform solution.\u201d<\/p>\n<p>The first step was working with Microsoft to make Feathr work well on Azure. That includes support for more data sources more common in the industry at large than at LinkedIn (<strong>Figure C<\/strong>).<\/p>\n<p><strong>Figure C<\/strong><\/p>\n<figure id=\"attachment_3999653\" aria-describedby=\"caption-attachment-3999653\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-article wp-image-3999653\" src=\"http:\/\/cloudnewshub.com\/wp-content\/uploads\/2022\/10\/this-linkedin-tool-for-building-machine-learning-systems-is-now-part-of-the-lf-ai-data-foundation-2.png\" alt width=\"770\" height=\"407\"><figcaption id=\"caption-attachment-3999653\" class=\"wp-caption-text\">Image. Microsoft. Using Feathr with familiar Azure services.<\/figcaption><\/figure>\n<p>If you\u2019re using Feathr on Azure, you can pull data from Azure Blob Storage, Azure Data Lake Storage, Azure SQL databases and data warehouses. Once the features have been defined, they can be generated using Spark running in Azure Databricks or Azure Synapse Analytics.<\/p>\n<p>Features are stored in an Azure Redis cache for low-latency serving and registered in Azure Purview for sharing between teams. When you want to use features in a machine learning model, they can be called from inside Azure Machine Learning: Deploy the model to an Azure Kubernetes Service cluster and it can retrieve features from the Redis cache.<\/p>\n<p>Bringing the project to the LF AI &amp; Data Foundation is the next step and will take Feathr beyond the Azure ecosystem.<\/p>\n<p>\u201cThe collaboration and affiliation improves the network of people working on Feathr,\u201d Stein said. \u201cWe have access to resources and opportunities for collaboration with related projects.\u201d<\/p>\n<p>Collaboration and contribution is important because feature stores are a fairly new idea.<\/p>\n<p>\u201cThe industry is growing towards a more solid understanding of the details of what these tools need to be and what they need to do, and we\u2019re trying to contribute to that based on what we\u2019ve learned,\u201d he added.<\/p>\n<p>As often happens when open sourcing a project, that work also made Feathr better for LinkedIn itself.<\/p>\n<p>\u201cLinkedIn engineering has a culture of open sourcing our things that we believe are generally useful and would be of interest to the industry,\u201d Stein said.<\/p>\n<p>New users are an opportunity for the people who built the tool to learn more about what makes it useful by seeing how it can be used to solve increasingly diverse problems. It\u2019s also a forcing function for making documentation good enough that a new user can pick up the project and understand how to use it to solve a problem and how it compares to alternatives, he pointed out.<\/p>\n<p>\u201cThere are many things that belong in a well-rounded product,\u201d Stein said. \u201cOpen sourcing and putting a solution out into the public view is a great opportunity to do those things to make the product great. Bringing Feathr to the open source community and now to the Linux Foundation is part of the process of continuing to evolve this into a better tool that works for a broader variety of machine learning and use cases. It is the path to make it better: Selfishly, for LinkedIn, but also for the community.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Image: Sundry Photography\/Adobe Stock As organizations start to make more extensive use of machine learning, they need to manage not just their data and the machine learning models that use it but the features that organize the raw data into concepts the models can work with. SEE: Artificial Intelligence Ethics Policy (TechRepublic Premium) Earlier this [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":67280,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[77,40,783,84,152,27],"tags":[],"class_list":["post-67279","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","category-cloud","category-cloudsync","category-machine-learning","category-microsoft","category-software"],"_links":{"self":[{"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=\/wp\/v2\/posts\/67279","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=67279"}],"version-history":[{"count":0,"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=\/wp\/v2\/posts\/67279\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=\/wp\/v2\/media\/67280"}],"wp:attachment":[{"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=67279"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=67279"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cloudnewshub.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=67279"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}