はじめに

自然言語処理100本ノック2020のchapter2をやってみた記事です。

詳細は以下のchap1の記事を参照してください。

hirune-is-supremacy.hatenablog.com

第2章: UNIXコマンド

コード

使用データ

popular-names.txtは，アメリカで生まれた赤ちゃんの「名前」「性別」「人数」「年」をタブ区切り形式で格納したファイルである．

以下の処理を行うプログラムを作成し,popular-names.txtを入力ファイルとして実行せよ．

さらに，同様の処理をUNIXコマンドでも実行し，プログラムの実行結果を確認せよ．

!wget https://nlp100.github.io/data/popular-names.txt

# Python用
txt_file = "popular-names.txt"

10. 行数のカウント

行数をカウントせよ．確認にはwcコマンドを用いよ．

python

with open(txt_file) as f:
    txt = f.readlines()
    print(len(txt))

unix

!wc popular-names.txt

 2780 11120 55026 popular-names.txt

11. タブをスペースに置換

タブ1文字につきスペース1文字に置換せよ．確認にはsedコマンド，trコマンド，もしくはexpandコマンドを用いよ．

python

with open(txt_file) as f:
    txt = f.readlines()
    new_txt = []
    for i in txt:
        new_txt.append(i.replace("\t", " ").replace("\n", ""))

    print("\n".join(new_txt))

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
...(以下略

unix

!sed -i "s/\t/ /g" popular-names.txt

12. 1列目をcol1.txtに，2列目をcol2.txtに保存

各行の1列目だけを抜き出したものをcol1.txtに，2列目だけを抜き出したものをcol2.txtとしてファイルに保存せよ．確認にはcutコマンドを用いよ．

python

files = ["col1.txt", "col2.txt"]

for num, file in enumerate(files):
    with open(txt_file) as f:
        txt = f.readlines()
        with open(file, "w") as w_f:
            for i in txt:
                w_f.write(i[num] + "\n")

!cut -c 1 popular-names.txt > col1.txt

!cut -c 2 popular-names.txt > col2.txt

13. col1.txtとcol2.txtをマージ

12で作ったcol1.txtとcol2.txtを結合し，元のファイルの1列目と2列目をタブ区切りで並べたテキストファイルを作成せよ．確認にはpasteコマンドを用いよ．

python

with open("col1-col2.txt", "w") as w_f, open("col1.txt") as col1_f, open("col2.txt") as col2_f:
    col1 = col1_f.readlines()
    col2 = col2_f.readlines()
    for i, j in zip(col1, col2):
        w_f.write(i.replace("\n", "") + j)

unix

!paste -d "\t" col1.txt col2.txt > col1-col2.txt

14. 先頭からN行を出力

自然数Nをコマンドライン引数などの手段で受け取り，入力のうち先頭のN行だけを表示せよ．確認にはheadコマンドを用いよ．

python

N = input()

with open(txt_file) as f:
    txt = f.readlines()
    for i in range(int(N)):
        print(txt[i].replace("\n", ""))

 5


Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880

unix

!head -n 5 popular-names.txt

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880

15. 末尾のN行を出力

自然数Nをコマンドライン引数などの手段で受け取り，入力のうち末尾のN行だけを表示せよ．確認にはtailコマンドを用いよ．

python

N = input()

with open(txt_file) as f:
    txt = f.readlines()
    for i in range(int(N)):
        print(txt[len(txt) - int(N) + i].replace("\n", ""))

 5


Benjamin M 13381 2018
Elijah M 12886 2018
Lucas M 12585 2018
Mason M 12435 2018
Logan M 12352 2018

unix

!tail -n 5 popular-names.txt

Benjamin M 13381 2018
Elijah M 12886 2018
Lucas M 12585 2018
Mason M 12435 2018
Logan M 12352 2018

16. ファイルをN分割する【Skip】

自然数Nをコマンドライン引数などの手段で受け取り，入力のファイルを行単位でN分割せよ．同様の処理をsplitコマンドで実現せよ．

python

#N = input()
N = 1000

with open(txt_file) as f:
    txt = f.readlines()
    file_num = int(len(txt) / N) + 1
    #for i in range(file_num):
        # ??????????

unix

!split -l 1000 popular-names.txt split_file

17. １列目の文字列の異なり

1列目の文字列の種類（異なる文字列の集合）を求めよ．確認にはcut, sort, uniqコマンドを用いよ．

python

with open(txt_file) as f:
    txt = f.readlines()
    letter_list = []
    for i in txt:
        letter_list.append(i[0])
    print(sorted(list(set(letter_list))))

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'V', 'W']

unix

※うまく行かなかったのでココを参照。「連続していない離れた重複行も削除したければ、sortコマンドコマンドで予めソートする必要がある。その代わり順番は保存されない。」らしい。

!cut -c 1 popular-names.txt | sort | uniq

A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
R
S
T
V
W

18. 各行を3コラム目の数値の降順にソート

各行を3コラム目の数値の逆順で整列せよ（注意: 各行の内容は変更せずに並び替えよ）．確認にはsortコマンドを用いよ（この問題はコマンドで実行した時の結果と合わなくてもよい）．

python

# pandas使って良いんじゃんと気づいた
import pandas as pd

txt = pd.read_table("popular-names.txt", header=None, names=["name", "sex", "num", "era"])

txt.sort_values("num", ascending=False).head()

	name	sex	num	era
1340	Linda	F	99689	1947
1360	Linda	F	96211	1948
1350	James	M	94757	1947
1550	Michael	M	92704	1957
1351	Robert	M	91640	1947

unix

!sort -k 3nr,3 popular-names.txt | head -n 5

Linda F 99689 1947
Linda F 96211 1948
James M 94757 1947
Michael M 92704 1957
Robert M 91640 1947
sort: write failed: 'standard output': Broken pipe
sort: write error

19. 各行の1コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

各行の1列目の文字列の出現頻度を求め，その高い順に並べて表示せよ．確認にはcut, uniq, sortコマンドを用いよ．

python

import pandas as pd

df = pd.read_table("popular-names.txt", header=None, names=["name", "sex", "num", "era"])

df["name"].str[:1].value_counts(ascending=False).head()

J    448
M    407
R    211
E    211
A    211
Name: name, dtype: int64

unix

!cut -c 1 popular-names.txt | sort | uniq -c | sort -rn | head -n 5

1日22時間寝たい

技術頑張ってる最中です

自然言語処理100本ノック2020やってみる【chap2】

はじめに

第2章: UNIXコマンド

コード

使用データ

10. 行数のカウント

python

unix

11. タブをスペースに置換

python

unix

12. 1列目をcol1.txtに，2列目をcol2.txtに保存

python

13. col1.txtとcol2.txtをマージ

python

unix

14. 先頭からN行を出力

python

unix

15. 末尾のN行を出力

python

unix

16. ファイルをN分割する【Skip】

python

unix

17. １列目の文字列の異なり

python

unix

18. 各行を3コラム目の数値の降順にソート

python

unix

19. 各行の1コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

python

unix