[PATCH] gnu: Add mecab.

  • Done
  • quality assurance status badge
Details
3 participants
  • Julien Lepiller
  • Ludovic Courtès
  • Bruno Victal
Owner
unassigned
Submitted by
Julien Lepiller
Severity
normal
J
J
Julien Lepiller wrote on 4 Jul 2022 21:09
(address . guix-patches@gnu.org)
20220704210911.699b4697@sybil.lepiller.eu
Hi Guix!

This small series adds mecab and two dictionaries. MeCab is a
morphological analysis engine. I'm not sure what that previous sentence
means (:p) but I use it as a segmenter for Japanese in one of my
projects. In fact, the two patches that follow add two dictionary
sources. You need one of them in the same profile as mecab for it to be
useful (with no dictionaries, it segfaults).
J
J
Julien Lepiller wrote on 4 Jul 2022 21:42
[PATCH 2/3] gnu: Add mecab-ipadic.
(address . 56386@debbugs.gnu.org)
20220704194202.30958-2-julien@lepiller.eu
* gnu/packages/language.scm (mecab-ipadic): New variable.
---
gnu/packages/language.scm | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

Toggle diff (37 lines)
diff --git a/gnu/packages/language.scm b/gnu/packages/language.scm
index 3ffe115b51..63654c544b 100644
--- a/gnu/packages/language.scm
+++ b/gnu/packages/language.scm
@@ -970,3 +970,30 @@ (define-public mecab
collaboration between the Kyoto university and Nippon Telegraph and Telephone
Corporation. The engine is independent of any language, dictionary or corpus.")
(license (list license:gpl2+ license:lgpl2.1+ license:bsd-3))))
+
+(define-public mecab-ipadic
+ (package
+ (name "mecab-ipadic")
+ (version "2.7.0")
+ (source (package-source mecab))
+ (build-system gnu-build-system)
+ (arguments
+ `(#:configure-flags
+ (list (string-append "--with-dicdir=" (assoc-ref %outputs "out")
+ "/lib/mecab/dic")
+ "--with-charset=utf8")
+ #:phases
+ (modify-phases %standard-phases
+ (add-after 'unpack 'chdir
+ (lambda _
+ (chdir "mecab-ipadic")))
+ (add-before 'configure 'set-mecab-dir
+ (lambda* (#:key outputs #:allow-other-keys)
+ (setenv "MECAB_DICDIR" (string-append (assoc-ref outputs "out")
+ "/lib/mecab/dic")))))))
+ (native-inputs (list mecab)); for mecab-config
+ (home-page "https://taku910.github.io/mecab")
+ (synopsis "Dictionary data for MeCab")
+ (description "This package contains dictionnary data derived from
+ipadic for use with MeCab.")
+ (license (license:non-copyleft "mecab-ipadic/COPYING"))))
--
2.36.1
J
J
Julien Lepiller wrote on 4 Jul 2022 21:42
[PATCH 1/3] gnu: Add mecab.
(address . 56386@debbugs.gnu.org)
20220704194202.30958-1-julien@lepiller.eu
* gnu/packages/language.scm (mecab): New variable.
* gnu/packages/patches/mecab-variable-param.patch: New file.
* gnu/local.mk (dist_patch_DATA): Add it.
---
gnu/local.mk | 1 +
gnu/packages/language.scm | 51 ++++++++++++++++++-
.../patches/mecab-variable-param.patch | 30 +++++++++++
3 files changed, 81 insertions(+), 1 deletion(-)
create mode 100644 gnu/packages/patches/mecab-variable-param.patch

Toggle diff (116 lines)
diff --git a/gnu/local.mk b/gnu/local.mk
index faad6cc6b2..87fe75082c 100644
--- a/gnu/local.mk
+++ b/gnu/local.mk
@@ -1490,6 +1490,7 @@ dist_patch_DATA = \
%D%/packages/patches/libmemcached-build-with-gcc7.patch \
%D%/packages/patches/libmhash-hmac-fix-uaf.patch \
%D%/packages/patches/libsigrokdecode-python3.9-fix.patch \
+ %D%/packages/patches/mecab-variable-param.patch \
%D%/packages/patches/mercurial-hg-extension-path.patch \
%D%/packages/patches/mesa-opencl-all-targets.patch \
%D%/packages/patches/mesa-skip-tests.patch \
diff --git a/gnu/packages/language.scm b/gnu/packages/language.scm
index 61c9e682ed..3ffe115b51 100644
--- a/gnu/packages/language.scm
+++ b/gnu/packages/language.scm
@@ -4,7 +4,7 @@
;;; Copyright © 2018 Nikita <nikita@n0.is>
;;; Copyright © 2019 Alex Vong <alexvong1995@gmail.com>
;;; Copyright © 2020 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu>
+;;; Copyright © 2020, 2022 Julien Lepiller <julien@lepiller.eu>
;;;
;;; This file is part of GNU Guix.
;;;
@@ -921,3 +921,52 @@ (define-public praat
analysis (pitch, formant, intensity, ...), speech synthesis, labelling, segmenting
and manipulation.")
(license license:gpl2+)))
+
+(define-public mecab
+ (package
+ (name "mecab")
+ (version "0.996")
+ (source (origin
+ (method git-fetch)
+ (uri (git-reference
+ (url "https://github.com/taku910/mecab")
+ ;; latest commit
+ (commit "046fa78b2ed56fbd4fac312040f6d62fc1bc31e3")))
+ (file-name (git-file-name name version))
+ (sha256
+ (base32
+ "1hdv7rgn8j0ym9gsbigydwrbxa8cx2fb0qngg1ya15vvbw0lk4aa"))
+ (patches
+ (search-patches
+ "mecab-variable-param.patch"))))
+ (build-system gnu-build-system)
+ (native-search-paths
+ (list (search-path-specification
+ (variable "MECAB_DICDIR")
+ (separator #f)
+ (files '("lib/mecab/dic")))))
+ (arguments
+ `(#:phases
+ (modify-phases %standard-phases
+ (add-after 'unpack 'chdir
+ (lambda _
+ (chdir "mecab")))
+ (add-before 'build 'add-mecab-dicdir-variable
+ (lambda _
+ (substitute* "mecabrc.in"
+ (("dicdir = .*")
+ "dicdir = $MECAB_DICDIR"))
+ (substitute* "mecab-config.in"
+ (("echo @libdir@/mecab/dic")
+ "if [ -z \"$MECAB_DICDIR\" ]; then
+ echo @libdir@/mecab/dic
+else
+ echo \"$MECAB_DICDIR\"
+fi")))))))
+ (inputs (list libiconv))
+ (home-page "https://taku910.github.io/mecab")
+ (synopsis "Morphological analysis engine for texts")
+ (description "Mecab is a morphological analysis engine developped as a
+collaboration between the Kyoto university and Nippon Telegraph and Telephone
+Corporation. The engine is independent of any language, dictionary or corpus.")
+ (license (list license:gpl2+ license:lgpl2.1+ license:bsd-3))))
diff --git a/gnu/packages/patches/mecab-variable-param.patch b/gnu/packages/patches/mecab-variable-param.patch
new file mode 100644
index 0000000000..4457cf3f44
--- /dev/null
+++ b/gnu/packages/patches/mecab-variable-param.patch
@@ -0,0 +1,30 @@
+From 2396e90056706ef897acab3aaa081289c7336483 Mon Sep 17 00:00:00 2001
+From: LEPILLER Julien <julien.lepiller@irisa.fr>
+Date: Fri, 19 Apr 2019 11:48:39 +0200
+Subject: [PATCH] Allow variable parameters
+
+---
+ mecab/src/param.cpp | 6 +++++-
+ 1 file changed, 5 insertions(+), 1 deletion(-)
+
+diff --git a/mecab/src/param.cpp b/mecab/src/param.cpp
+index 65328a2..006b1b5 100644
+--- a/mecab/src/param.cpp
++++ b/mecab/src/param.cpp
+@@ -79,8 +79,12 @@ bool Param::load(const char *filename) {
+ size_t s1, s2;
+ for (s1 = pos+1; s1 < line.size() && isspace(line[s1]); s1++);
+ for (s2 = pos-1; static_cast<long>(s2) >= 0 && isspace(line[s2]); s2--);
+- const std::string value = line.substr(s1, line.size() - s1);
++ std::string value = line.substr(s1, line.size() - s1);
+ const std::string key = line.substr(0, s2 + 1);
++
++ if(value.find('$') == 0) {
++ value = std::getenv(value.substr(1).c_str());
++ }
+ set<std::string>(key.c_str(), value, false);
+ }
+
+--
+2.20.1
+
--
2.36.1
J
J
Julien Lepiller wrote on 4 Jul 2022 21:42
[PATCH 3/3] gnu: Add mecab-unidic.
(address . 56386@debbugs.gnu.org)
20220704194202.30958-3-julien@lepiller.eu
* gnu/packages/language.scm (mecab-unidic): New variable.
---
gnu/packages/language.scm | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

Toggle diff (50 lines)
diff --git a/gnu/packages/language.scm b/gnu/packages/language.scm
index 63654c544b..f97b982cb9 100644
--- a/gnu/packages/language.scm
+++ b/gnu/packages/language.scm
@@ -27,6 +27,7 @@ (define-module (gnu packages language)
#:use-module (gnu packages autotools)
#:use-module (gnu packages audio)
#:use-module (gnu packages base)
+ #:use-module (gnu packages compression)
#:use-module (gnu packages docbook)
#:use-module (gnu packages emacs)
#:use-module (gnu packages freedesktop)
@@ -57,6 +58,7 @@ (define-module (gnu packages language)
#:use-module (gnu packages xorg)
#:use-module (guix packages)
#:use-module (guix build-system cmake)
+ #:use-module (guix build-system copy)
#:use-module (guix build-system glib-or-gtk)
#:use-module (guix build-system gnu)
#:use-module (guix build-system perl)
@@ -997,3 +999,27 @@ (define-public mecab-ipadic
(description "This package contains dictionnary data derived from
ipadic for use with MeCab.")
(license (license:non-copyleft "mecab-ipadic/COPYING"))))
+
+(define-public mecab-unidic
+ (package
+ (name "mecab-unidic")
+ (version "3.1.0")
+ (source (origin
+ (method url-fetch)
+ (uri (string-append "https://clrd.ninjal.ac.jp/unidic_archive/cwj/"
+ version "/unidic-cwj-" version ".zip"))
+ (sha256
+ (base32
+ "1z132p2q3bgchiw529j2d7dari21kn0fhkgrj3vcl0ncg2m521il"))))
+ (build-system copy-build-system)
+ (arguments
+ `(#:install-plan
+ '(("." "lib/mecab/dic"
+ #:include-regexp ("\\.bin$" "\\.def$" "\\.dic$" "dicrc")))))
+ (native-inputs (list unzip))
+ (home-page "https://clrd.ninjal.ac.jp/unidic/en/")
+ (synopsis "Dictionary data for MeCab")
+ (description "UniDic for morphological analysis is a dictionary for
+analysis with the morphological analyser MeCab, where the short units exported
+from the database are used as entries (heading terms).")
+ (license (list license:gpl2+ license:lgpl2.1 license:bsd-3))))
--
2.36.1
L
L
Ludovic Courtès wrote on 17 Jul 2022 21:33
Re: bug#56386: [PATCH] gnu: Add mecab.
(name . Julien Lepiller)(address . julien@lepiller.eu)(address . 56386@debbugs.gnu.org)
87a6974jr2.fsf_-_@gnu.org
Hi,

Julien Lepiller <julien@lepiller.eu> skribis:

Toggle quote (6 lines)
> + (synopsis "Dictionary data for MeCab")
> + (description "UniDic for morphological analysis is a dictionary for
> +analysis with the morphological analyser MeCab, where the short units exported
> +from the database are used as entries (heading terms).")
> + (license (list license:gpl2+ license:lgpl2.1 license:bsd-3))))

Maybe add a comment stating whether this is triple-licensed (at the
user’s choice) or if that means that there are files under each of
these.

Otherwise the whole series LGTM!

Ludo’.
B
B
Bruno Victal wrote on 31 Mar 2023 00:43
Re: [bug#56386] [PATCH] gnu: Add mecab.
(name . Julien Lepiller)(address . julien@lepiller.eu)(address . 56386@debbugs.gnu.org)
69c9ca84-f59c-72ad-4dc5-3af11678c5ec@makinata.eu
On 2022-07-04 20:09, Julien Lepiller wrote:
Toggle quote (12 lines)
> Hi Guix!
>
> This small series adds mecab and two dictionaries. MeCab is a
> morphological analysis engine. I'm not sure what that previous sentence
> means (:p) but I use it as a segmenter for Japanese in one of my
> projects. In fact, the two patches that follow add two dictionary
> sources. You need one of them in the same profile as mecab for it to be
> useful (with no dictionaries, it segfaults).
>
>
>

Any updates regarding this?


Cheers,
Bruno
J
J
Julien Lepiller wrote on 1 Apr 2023 16:43
(name . Bruno Victal)(address . mirai@makinata.eu)(address . 56386-done@debbugs.gnu.org)
20230401164320.119a738e@sybil.lepiller.eu
Le Thu, 30 Mar 2023 23:43:22 +0100,
Bruno Victal <mirai@makinata.eu> a écrit :

Toggle quote (19 lines)
> On 2022-07-04 20:09, Julien Lepiller wrote:
> > Hi Guix!
> >
> > This small series adds mecab and two dictionaries. MeCab is a
> > morphological analysis engine. I'm not sure what that previous
> > sentence means (:p) but I use it as a segmenter for Japanese in one
> > of my projects. In fact, the two patches that follow add two
> > dictionary sources. You need one of them in the same profile as
> > mecab for it to be useful (with no dictionaries, it segfaults).
> >
> >
> >
>
> Any updates regarding this?
>
>
> Cheers,
> Bruno

I had forgotten about this. It's a triple license (at the user's
choice), so I added a comment. Pushed to master as
3ab24ba216ce91210b93ec61554b3343fbc3aaab to
4483296da3e2e1424d12d92d0f56fb428765ca43.
Closed
?