Академический Документы
Профессиональный Документы
Культура Документы
{{ad.text}}
Most of the time, researchers can use the data given to them directly,
but there are many instances where the data is not clean. That means
that it cannot be used directly to train the neural network because it
contains data that is not representative of what the algorithm wants
to classify. Perhaps, it contains bad data, like when you want to
create a neural network to gure out cats among colored images, and
the dataset contains black and white images. Another problem is
when the data is not appropriate. For example, when you want to
classify images of people as male or female. There might be pictures
without the tag or pictures that have the information corrupted with
misspelled words like ‘ale’ instead of ‘male.’ Even though these might
seem like crazy scenarios, they happen all the time. Handling these
problems and cleaning up the data is known as data wrangling.
https://hackernoon.com/data-is-the-new-oil-1227197762b2 1/10
2/13/2020 Data is the New Oil - By Giuliano Giacaglia
https://hackernoon.com/data-is-the-new-oil-1227197762b2 2/10
2/13/2020 Data is the New Oil - By Giuliano Giacaglia
Fei-Fei Li, who was the director of the Stanford Arti cial Intelligence
Laboratory and also the Chief Scientist of AI/ML at Google Cloud,
realized that data was such an important piece of the development of
Machine Learning algorithms early on, whereas most of her
colleagues did not believe the same.
Professor Fei-Fei Li
https://hackernoon.com/data-is-the-new-oil-1227197762b2 3/10
2/13/2020 Data is the New Oil - By Giuliano Giacaglia
https://hackernoon.com/data-is-the-new-oil-1227197762b2 4/10
2/13/2020 Data is the New Oil - By Giuliano Giacaglia
Amazon Mechanical Turk was the solution, but a problem still existed.
Not all the workers spoke English as their rst language, so there
were issues with speci c images and the words associated with
them. Some words were harder for these remote workers to identify.
Not only that, but there were words like ‘babuin’ that workers did not
exactly know which images represented the type of image. So, her
team created a simple algorithm to gure out how many people had
to look at each image for a given word. Words that were more
complex like ‘babuin’ required more people to check, and simpler
words like ‘cat’ needed only a few people to check these images.
With Mechanical Turk, creating ImageNet took less than three years,
much less than the initial estimate with only undergraduates. The
resulting dataset had around 3 million images separated into about
5,000 “words.” People were not impressed with her paper or dataset,
however, because they did not believe that more and more re ned
data led to better algorithms. But most of these researchers’ opinions
were about to change.
Li had to show that her dataset led to better algorithms to prove her
point. To achieve that, she had the idea of creating a challenge based
on the dataset to show that the algorithms using it would perform
better overall. That is, she had to make others train their algorithms
with her dataset to show that they could indeed perform better than
models that did not use her dataset.
The same year she published the paper in CVPR, she reached out to a
researcher named Alex Berg. She suggested that they work together
to publish papers to show that algorithms using the dataset could
gure out if images contained particular objects or animals and where
https://hackernoon.com/data-is-the-new-oil-1227197762b2 5/10
2/13/2020 Data is the New Oil - By Giuliano Giacaglia
they were located. In 2010 and 2011, they worked together and
published ve papers using ImageNet. The rst paper became the
benchmark of how algorithms would perform on these images. To
make it the benchmark for other algorithms, Li reached out to one of
the most well-known image recognition dataset and benchmark
standards, PASCAL VOC. They agreed to work together and added
ImageNet as a benchmark for their competition. The competition
used a dataset called PASCAL that only had 20 classes of images.
ImageNet in comparison had around 5,000 classes.
https://hackernoon.com/data-is-the-new-oil-1227197762b2 6/10
2/13/2020 Data is the New Oil - By Giuliano Giacaglia
https://hackernoon.com/data-is-the-new-oil-1227197762b2 7/10
2/13/2020 Data is the New Oil - By Giuliano Giacaglia
ImageNet is considered solved, reaching the error rate lower than the
average human, achieving superhuman performance for guring out if
an image contains an object and what kind of object it is. After nearly
a decade, the competition with ImageNet ended with models being
trained and tested on it. Li tried to remove the dataset from the
internet, but big companies like Facebook pushed back since they
used it as their benchmark.
Giuliano Giacaglia
Feb 01
#Machine Learning