#100daysofpractice dataset: shortcodes arrived!

Hello! How have you been?

Here at Bits4Waves we have been working on a new dataset! This is very exciting, as this is the very first dataset we’ll produce for the community! It will be all about Music, especially Resonance… But that will have to wait a bit more before it’s revealed to the world…

We’ve been busy collecting some links to videos of… practice! Specifically, links to Instagram posts with the hashtag #100daysofpractice! More specifically, what is called a shortcode: small strings that uniquely identify a post.

For instance the very first post with the hashtag #100daysofpractice has the shortcode BTrwiUuh8vV. This means that you can access it with the link https://www.instagram.com/p/BTrwiUuh8vV. You can see that it was posted by the creator of the hashtag, @violincase, an account that belongs to the violin virtuosa Hilary Hahn.

So, after collecting lots of shortcodes, today’s task is to grab them all and uniquify them. Several duplicates are expected to have gotten into the pool! This happened because the process used to obtain it was interrupted by the server several times. At each new try, it had to restart from the beginning, but the previous results were kept in the file because the results could change at each new attempt.

The idea is to use the command uniq:

uniq --version
uniq (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Richard M. Stallman and David MacKenzie.

First, let’s create a new branch uniq for this.

alias git="git -C ~/sci/100daysofpractice-dataset/"
git checkout -b uniq master
git branch
  master
* uniq

Now let’s use wc

wc --version
wc (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Paul Rubin and David MacKenzie.

to count the lines in the file:

wc -l ~/sci/100daysofpractice-dataset/shortcodes.txt
103198 /home/rafa/sci/100daysofpractice-dataset/shortcodes.txt

OK, so we got in total 103,198 shorcodes! That’s a lot, but I wonder how many of these are duplicates…

We’ll use the command uniq to deal with duplicate lines, but according to its its manual page, it:

Filter[s]  adjacent  matching lines from INPUT (or standard
input), writing to OUTPUT (or standard output).

“adjacent” being the important detail here: we cannot guarantee that the duplicates will be adjacent to one another! We can’t just got using uniq directly like that!

But this is simple to solve, we just have to sort it first. For this we can use the command sort. According to its manual, we have:

‘sort’ sorts, merges, or compares all the lines from the given files, or
standard input if none are given or for a FILE of ‘-’.  By default,
‘sort’ writes the results to standard output.  Synopsis:

Let’s give a pick into the first 10 lines of the file, using the command head

head --version
head (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by David MacKenzie and Jim Meyering.

for that:

head -n 10 ~/sci/100daysofpractice-dataset/shortcodes.txt

CNOKrJ0Amoq
CNOKj65nDUa
CNOKgRBhhZa
CNOKA_0pws8
CNOJtmAjBnc
CNOIdxfAvLd
CNOIsphA9-P
CNOJIqPA13s
CNOIr6sAs96
CNOIUoyHGgC

Okidoki… Now let’s sort the file, and put the results into a separate file:

FOLDER=~/sci/100daysofpractice-dataset
sort $FOLDER/shortcodes.txt > $FOLDER/shortcodes-sort.txt

Just for sanity check, let’s see how many lines does each file have:

FOLDER=~/sci/100daysofpractice-dataset
wc -l $FOLDER/shortcodes.txt
wc -l $FOLDER/shortcodes-sort.txt
103198 /home/rafa/sci/100daysofpractice-dataset/shortcodes.txt
103198 /home/rafa/sci/100daysofpractice-dataset/shortcodes-sort.txt

OK… Now let’s pass thes sorted file through uniq:

FOLDER=~/sci/100daysofpractice-dataset
uniq $FOLDER/shortcodes-sort.txt > $FOLDER/shortcodes-uniq.txt

Now let’s use count the lines:

FOLDER=~/sci/100daysofpractice-dataset
wc -l $FOLDER/shortcodes-uniq.txt
57287 /home/rafa/sci/100daysofpractice-dataset/shortcodes-uniq.txt

Wow… That’s a lot less lines! The original file has 103,198 lines. If we subtract from this number the resulting 57,287 unique lines, we see that the number of repeated lines in the original file is 45,911. It was indeed a lot of attempts to get there though, thus the repetition!

Now back to the file… Let’s try some random entries to see if they work as they should… We will use the command shuf

shuf --version
shuf (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Paul Eggert.

to get some random lines from the file:

FOLDER=~/sci/100daysofpractice-dataset
shuf -n 10 $FOLDER/shortcodes-uniq.txt

CK1-7QnHMqe
CNHzyblAHtf
CKon9NCgXTH
CLIpKthg31e
CLD2nLhnOw9
CL2diTkg_oZ
CLw-jZKqglz
CNFxk5pg_Uq
CLu66DWLB5q
CLwjsFApTn5

Well, that’s it for today! See you soon!

Published by bits4waves

Software in harmony with your melody

One thought on “#100daysofpractice dataset: shortcodes arrived!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: