Machine Feed: We give AI some headlines and see what it does

In the second part of our series, we try to learn machine methods.

The moment you try to enter the new world of technology you realize you may have started a hard project. As you stare at the many options available to accept the project, you research your options, read the docs, and get started - only to find that simply defining the problem may actually be more than finding a way, the solution is real.

Reader, this is where I discovered this 2-week machine learning adventure. You've familiarized yourself with data, tools, and methods known to problems with this type of data, and tried several ways to solve what appears to be a simple machine learning problem: Based on past performance, can we predict whether an Aras address will win an A/B test?

The situation is not particularly good. In fact, when I finished this piece, my last attempt showed that the algorithm was just as accurate as flipping a coin.

But at least that was a start. And in the process of getting there, I learned a lot about the data purification and preprocessing that is used for each of the machine learning projects.

Battlefield Setting

Our data source records the results of 5,500 A/B address tests in the past five years - that's about the length of Ars per story submitted. from shooting. Since we have scores for all of this data (meaning we know it won or lost an A/B test), this appears to be a supervised learning issue. All I needed to do to get the data was make sure it wasn't formatted correctly for the model I used to build my algorithm.

I'm not a data scientist, so I won't build my own model every time this decade. Fortunately, AWS has built a number of ready-made templates suitable for word processing, especially to work within the Amazon cloud. There are also third-party templates like Hugging Face that can be used in the SageMaker universe. It seems that each model requires the data given to it in a certain way.

The choice of model in this case is mainly due to the way we are going to solve this problem. First, I looked at two possible ways to teach an algorithm to achieve the probability of success of any given title: Binary Ranking: We simply determine whether the probability of the title falling into the 'Win' or 'Losing' column is the basis for previous winners and losers, we can compare the probability of having two titles and choose The strongest candidate. Multiply Categories: We try to categorize titles into several categories based on the number of clicks they get - for example, we rate them from 1 to 5 stars. Then we can compare the results of the main candidates. Advertising

The second method is more difficult, and there is a fundamental concern about which of these two methods makes the second method final: 5,500 tests with 11,000 addresses, and there isn't a lot of data to work on in AI. / ML application for everything.

So I chose binary categorization on my first attempt because it seemed more likely to work. It also means that the only data point I need for each address (besides the address itself) is whether it wins or loses the A/B test. I took my source data and converted it back into a 2-column comma-separated values ​​file: the addresses are in one and yes ' or 'no' in the other. I also used a script to remove all HTML markup from the titles (mostly some tags and some tags). With almost all the data cut off according to the requirements, I uploaded it to SageMaker Studio so I could use the Python tools for the rest of the steps.

Next, I need to select the form type and data setting. Again, a lot of data preparation depends on the type of data model. Different types of NLP models (and problems) require different levels of data preparation.

Tokening is then generated. “Data processing must first replace words with symbols, and separate symbols,” explains Julian Simon, an AWS technology evangelist. A token is a machine-readable number that stands for a string of characters. He said, “So 'ransomware' will be the first word, 'crooks' will be the second, and 'release' will be the third the sentence becomes a series of signs and you can enter it into the depth of learning, model it And let her know which ones are good and which ones are bad.

Depending on the specific problem, you may want to collect some data. For example, if we want to do something like analyze a sentiment (i.e. determine whether the server address provided is positive or negative) or collect addresses based on On its subject, I'd probably want to move the data to the more important content by omitting the "paused words" - common words that are important to grammatical structure but don't tell you the text in the "what they really are" says (like most articles).

Words left off by Python Natural Language Toolbox (<code>nltk</code>). Note that punctuation marks are sometimes filled with words as a tag; Needs cleaning for some uses. Advertising Data prepared for the BlazingText form, with titles enforced in lowercase. Zoom in / data prepared for a BlazingText model, forcing addresses to be typed in small numbers.

Another preprocessing section of the training data under the supervision of ML divides the data into two groups: one for training the algorithm, and the other for validating the results. The training data set is usually a larger set. Verification data is generally about 10 to 20 percent of all data generated.

Lots of research has been done on the right amount of validation data - some of this research indicates that the ideal point is greater in the number of model parameters. The algorithm used is related to the total volume of data, in this case, since there is relatively little data processed by the model, I think my credit data was 10%.

In some cases, you may want to keep another small piece of data for testing the algorithm after validating it. But our plan here is to use direct Ars addresses for testing, so I skipped that step.

To do the latest data setup, I used Jupyter Notebook - an interactive web interface for a Python instance - to convert the two-column CSV file into a data structure and manipulate it. Python has some useful data processing tools and specific knowledge tools that make these tasks very simple, and I used two in particular: sklearn (or scikit-learn) data templates, a scientific module that makes it easy to remove a lot of machine learning data preprocessing. nltk, a natural language toolkit - and in particular, the maker of Punkt tokens for processing the text of our addresses. CSV module for reading and writing CSV files.

This is a piece of code in Notepad that I used to generate CSV data for the training and validation sets: Feeder Machine : We give the AI ​​some headlines and see what it does Zoom

I used pandas to import the data structure from a csv file created from clean, formatted metadata, and the resulting object I call the 'dataset'. Using the Header() dataset command look at the addresses of each column fetched from the CSV file, along with a look at some of the data.

The pandas module allowed me to massively add a string "__label__" to the desired value of the tag column, as BlazingText requested, and used a lambda function to process the labels that made the words lowercase. Finally, I used the sklearn module to split the data into two files that I fed into BlazingText.

Machine Feed: We give AI some headlines and see what it does
machine-feed-we-give-ai-some-headlines-and-see-what-it.html It warns that Starlink and similar networks can block each other's signals

It warns that Starlink and similar networks can block each other's signals

Ofcom says the complexity of giant satellite networks raises concerns about interference.

A British government agency is concerned that Space... Let's talk about machine learning experiments that went right and wrong

Let's talk about machine learning experiments that went right and wrong

Join the original audition on Wednesday, July 28 at 1:00 PM ET!

We've spent the past few weeks burning large amounts of AWS computing time tr... Explosive iOS spy report shows Android security limitations

Explosive iOS spy report shows Android security limitations

Amnesty International finds the incompatibility tool used by the NSO Group worrisome.

The shadowy world of private spyware has long sounded t...


... Our AI title test continues: Did we break the device?

Our AI title test continues: Did we break the device?

In Part Three of Four, we look less at what went right and what went wrong.

We are now in the third phase of machine learning projects - that...