importer Bioconductor: no tarball, only Git

  • Open
  • quality assurance status badge
Details
3 participants
  • Maxime Devos
  • Ricardo Wurmus
  • zimoun
Owner
unassigned
Submitted by
zimoun
Severity
normal
Z
Z
zimoun wrote on 8 Apr 2022 13:48
(name . Bug Guix)(address . bug-guix@gnu.org)
868rsf23th.fsf@gmail.com
Hi,

Consider the package CHETAH, included in Bioconductor release 3.14;


but then,

Toggle snippet (5 lines)
$ guix import cran -a bioconductor CHETAH
guix import: warning: failed to retrieve package information from https://cran.r-project.org/web/packages/CHETAH/DESCRIPTION: 404 (Not Found)
guix import: error: failed to download description for package 'CHETAH'

The reason is because there is no source package. Only the Git source
repo.


Cheers,
simon
R
R
Ricardo Wurmus wrote on 11 Apr 2022 18:15
(name . zimoun)(address . zimon.toutoune@gmail.com)
874k2zzj9p.fsf@elephly.net
zimoun <zimon.toutoune@gmail.com> writes:

Toggle quote (7 lines)
> $ guix import cran -a bioconductor CHETAH
> guix import: warning: failed to retrieve package information from https://cran.r-project.org/web/packages/CHETAH/DESCRIPTION: 404 (Not Found)
> guix import: error: failed to download description for package 'CHETAH'
>
> The reason is because there is no source package. Only the Git source
> repo.

We should finally switch to fetching the sources from Git. I wonder why
we haven’t done this earlier.

I guess we should do this gradually to avoid mass updates, so perhaps we
should introduce bioconductor-git-reference and switch over packages one
by one.

What do you think?

--
Ricardo
Z
Z
zimoun wrote on 12 Apr 2022 18:25
(name . Ricardo Wurmus)(address . rekado@elephly.net)(address . 54787@debbugs.gnu.org)
87ilre5kvk.fsf@gmail.com
Hi Ricardo,

On lun., 11 avril 2022 at 18:15, Ricardo Wurmus <rekado@elephly.net> wrote:
Toggle quote (13 lines)
> zimoun <zimon.toutoune@gmail.com> writes:
>
>> $ guix import cran -a bioconductor CHETAH
>> guix import: warning: failed to retrieve package information from
>> https://cran.r-project.org/web/packages/CHETAH/DESCRIPTION: 404 (Not Found)
>> guix import: error: failed to download description for package 'CHETAH'
>>
>> The reason is because there is no source package. Only the Git source
>> repo.
>
> We should finally switch to fetching the sources from Git. I wonder why
> we haven’t done this earlier.

Because, maybe, we have just finished the janitor work cleaning the
files cran.scm, bioconductor.scm and bioinformatics.scm. :-)

Toggle quote (4 lines)
> I guess we should do this gradually to avoid mass updates, so perhaps we
> should introduce bioconductor-git-reference and switch over packages one
> by one.

First, note that annotations do not have Git repo; at least not always,
e.g.,


Second, if we go for something like:

Toggle snippet (10 lines)
(define* (bioconductor-git-reference name #:optional
(release %bioconductor-version))
"Return a <git-reference> for the R package archive on Bioconductor for the
RELEASE corresponding to NAME."
(git-reference
(url (string-append %bioconductor-git-url name))
(commit (string-append "RELEASE_" (string-replace-substring
%bioconductor-version "." "_")))))

then, it raises the question: import/cran.scm or build-system/r.scm ?
i.e., do we put a module dependency against (guix git-download) for the
r-build-system or not?

TeXLive already has a dependency to svn-download, so why not.

Well, I am also in favor to break the API and move %bioconductor-version
and %bioconductor-url to (guix build-system r). WDYT? It would
simplify some things (#36805 and #39885), I guess.


Third, the adjustments of the importer require a large cup of coffee.


Back to CHETAH, note that

guix import cran -a git htpps://git.bioconductor.org/CHETAH

works but it points to master instead of RELEASE_3_14. Well, I am not
very familiar with the Bioconductor workflow for their release.


Last, using this in gnu/packages/bioconductor.scm,

Toggle snippet (42 lines)
(define-public r-chetah
(package
(name "r-chetah")
(version "1.11.2")
(source
(origin
(method git-fetch)
(uri (bioconductor-git-reference "CHETAH"))
(file-name (git-file-name name version))
(sha256
(base32 "021v5831zqdy4pirfsb35kbnz8kmz4lxqc4cwi55qgd6r081xlgh"))))
(properties `((upstream-name . "CHETAH")))
(build-system r-build-system)
(propagated-inputs
(list r-biodist
r-corrplot
r-cowplot
r-dendextend
r-ggplot2
r-gplots
r-pheatmap
r-plotly
r-reshape2
r-s4vectors
r-shiny
r-singlecellexperiment
r-summarizedexperiment))
(native-inputs (list r-knitr))
(home-page "https://git.bioconductor.org/packages/CHETAH")
(synopsis "Fast and accurate scRNA-seq cell type identification")
(description
"CHETAH (CHaracterization of cEll Types Aided by Hierarchical classification) is
an accurate, selective and fast scRNA-seq classifier. Classification is guided
by a reference dataset, preferentially also a scRNA-seq dataset. By
hierarchical clustering of the reference data, CHETAH creates a classification
tree that enables a step-wise, top-to-bottom classification. Using a novel
stopping rule, CHETAH classifies the input cells to the cell types of the
references and to \"intermediate types\": more general classifications that ended
in an intermediate node of the tree.")
(license #f)))

it just builds with,

./pre-inst-env guix build r-chetah



WDYT?


Cheers,
simon
R
R
Ricardo Wurmus wrote on 14 Apr 2022 13:43
(name . zimoun)(address . zimon.toutoune@gmail.com)(address . 54787@debbugs.gnu.org)
87wnfrvqcw.fsf@elephly.net
zimoun <zimon.toutoune@gmail.com> writes:

Toggle quote (5 lines)
> First, note that annotations do not have Git repo; at least not always,
> e.g.,
>
> <https://bioconductor.org/packages/release/data/annotation/html/GenomeInfoDbData.html>

That’s fine. We just ignore annotation and experiment packages, and use
git only for regular packages.

Toggle quote (18 lines)
> Second, if we go for something like:
>
> (define* (bioconductor-git-reference name #:optional
> (release %bioconductor-version))
> "Return a <git-reference> for the R package archive on Bioconductor for the
> RELEASE corresponding to NAME."
> (git-reference
> (url (string-append %bioconductor-git-url name))
> (commit (string-append "RELEASE_" (string-replace-substring
> %bioconductor-version "." "_")))))
>
>
> then, it raises the question: import/cran.scm or build-system/r.scm ?
> i.e., do we put a module dependency against (guix git-download) for the
> r-build-system or not?
>
> TeXLive already has a dependency to svn-download, so why not.

Yes, I don’t think that’s a problem.

We probably should *not* use RELEASE_3_14 (or whatever) as the commit,
though, because that is a moving target. We need to resolve to the
actual commit and use its hash.

I wonder how the updater would need to be changed. It would need to
know about the release branch and look for new commits in that branch
only.

Toggle quote (4 lines)
> Well, I am also in favor to break the API and move %bioconductor-version
> and %bioconductor-url to (guix build-system r). WDYT? It would
> simplify some things (#36805 and #39885), I guess.

We tried this before and we couldn’t do this because of a circular
reference.

Toggle quote (7 lines)
> Back to CHETAH, note that
>
> guix import cran -a git htpps://git.bioconductor.org/CHETAH
>
> works but it points to master instead of RELEASE_3_14. Well, I am not
> very familiar with the Bioconductor workflow for their release.

That’s because the importer doesn’t let us specify a different branch.
We should add that, but it’s strictly separate from the migration we’re
about to embark on.

Toggle quote (51 lines)
> Last, using this in gnu/packages/bioconductor.scm,
>
> (define-public r-chetah
> (package
> (name "r-chetah")
> (version "1.11.2")
> (source
> (origin
> (method git-fetch)
> (uri (bioconductor-git-reference "CHETAH"))
> (file-name (git-file-name name version))
> (sha256
> (base32 "021v5831zqdy4pirfsb35kbnz8kmz4lxqc4cwi55qgd6r081xlgh"))))
> (properties `((upstream-name . "CHETAH")))
> (build-system r-build-system)
> (propagated-inputs
> (list r-biodist
> r-corrplot
> r-cowplot
> r-dendextend
> r-ggplot2
> r-gplots
> r-pheatmap
> r-plotly
> r-reshape2
> r-s4vectors
> r-shiny
> r-singlecellexperiment
> r-summarizedexperiment))
> (native-inputs (list r-knitr))
> (home-page "https://git.bioconductor.org/packages/CHETAH")
> (synopsis "Fast and accurate scRNA-seq cell type identification")
> (description
> "CHETAH (CHaracterization of cEll Types Aided by Hierarchical classification) is
> an accurate, selective and fast scRNA-seq classifier. Classification is guided
> by a reference dataset, preferentially also a scRNA-seq dataset. By
> hierarchical clustering of the reference data, CHETAH creates a classification
> tree that enables a step-wise, top-to-bottom classification. Using a novel
> stopping rule, CHETAH classifies the input cells to the cell types of the
> references and to \"intermediate types\": more general classifications that ended
> in an intermediate node of the tree.")
> (license #f)))
>
> it just builds with,
>
> ./pre-inst-env guix build r-chetah
>
>
>
> WDYT?

Neat :)

--
Ricardo
Z
Z
zimoun wrote on 14 Apr 2022 14:59
(name . Ricardo Wurmus)(address . rekado@elephly.net)(address . 54787@debbugs.gnu.org)
86k0brzuph.fsf@gmail.com
Hi Ricardo,

On Thu, 14 Apr 2022 at 13:43, Ricardo Wurmus <rekado@elephly.net> wrote:

Toggle quote (8 lines)
> We probably should *not* use RELEASE_3_14 (or whatever) as the commit,
> though, because that is a moving target. We need to resolve to the
> actual commit and use its hash.
>
> I wonder how the updater would need to be changed. It would need to
> know about the release branch and look for new commits in that branch
> only.

To be honest, I have not checked the Bioconductor documentation about
their Git repo structure. What I see is:

Toggle snippet (15 lines)
$ git clone https://git.bioconductor.org/packages/CHETAH
$ cd CHETAH
$ git branch -av
* master 5d5f5df [origin/master] Pass serialized S4 instances thru updateObject()
remotes/origin/HEAD -> origin/master
remotes/origin/RELEASE_3_10 063de2d bump x.y.z version to even y prior to creation of RELEASE_3_10 branch
remotes/origin/RELEASE_3_11 701ca7f bump x.y.z version to even y prior to creation of RELEASE_3_11 branch
remotes/origin/RELEASE_3_12 cd3dd78 bump x.y.z version to even y prior to creation of RELEASE_3_12 branch
remotes/origin/RELEASE_3_13 1eacdb8 bump x.y.z version to even y prior to creation of RELEASE_3_13 branch
remotes/origin/RELEASE_3_14 03295c9 bump x.y.z version to even y prior to creation of RELEASE_3_14 branch
remotes/origin/RELEASE_3_9 22b53f2 version bump
remotes/origin/master 5d5f5df Pass serialized S4 instances thru updateObject()


Do we follow ’master’? Is it a mirror of what Bioconductor names their
3.14 release?

My guess was that RELEASE_3_14 mirrors their 3.14 release.


Toggle quote (7 lines)
>> Well, I am also in favor to break the API and move %bioconductor-version
>> and %bioconductor-url to (guix build-system r). WDYT? It would
>> simplify some things (#36805 and #39885), I guess.
>
> We tried this before and we couldn’t do this because of a circular
> reference.

Well, I have something that works. So I do not know if this circular
reference is still there.



Toggle quote (4 lines)
> That’s because the importer doesn’t let us specify a different branch.
> We should add that, but it’s strictly separate from the migration we’re
> about to embark on.

I am not familiar with the updater (guix refresh -u). My plan is:

1. Add bioconductor-git-reference
2. Adapt the bioconductor importer.
3. Updater?

The question is: do we have to include the migration in the updater? Or
do we do the migration by custom scripts?


Note that, because we do not support shallow clones, the complete
sources will be a bit bigger; since they contain all the Bioconductor
history of all the packages.


Cheers,
simon
R
R
Ricardo Wurmus wrote on 14 Apr 2022 15:57
(name . zimoun)(address . zimon.toutoune@gmail.com)(address . 54787@debbugs.gnu.org)
87k0brvk73.fsf@elephly.net
zimoun <zimon.toutoune@gmail.com> writes:

Toggle quote (30 lines)
> On Thu, 14 Apr 2022 at 13:43, Ricardo Wurmus <rekado@elephly.net> wrote:
>
>> We probably should *not* use RELEASE_3_14 (or whatever) as the commit,
>> though, because that is a moving target. We need to resolve to the
>> actual commit and use its hash.
>>
>> I wonder how the updater would need to be changed. It would need to
>> know about the release branch and look for new commits in that branch
>> only.
>
> To be honest, I have not checked the Bioconductor documentation about
> their Git repo structure. What I see is:
>
> $ git clone https://git.bioconductor.org/packages/CHETAH
> $ cd CHETAH
> $ git branch -av
> * master 5d5f5df [origin/master] Pass serialized S4 instances thru updateObject()
> remotes/origin/HEAD -> origin/master
> remotes/origin/RELEASE_3_10 063de2d bump x.y.z version to even y prior to creation of RELEASE_3_10 branch
> remotes/origin/RELEASE_3_11 701ca7f bump x.y.z version to even y prior to creation of RELEASE_3_11 branch
> remotes/origin/RELEASE_3_12 cd3dd78 bump x.y.z version to even y prior to creation of RELEASE_3_12 branch
> remotes/origin/RELEASE_3_13 1eacdb8 bump x.y.z version to even y prior to creation of RELEASE_3_13 branch
> remotes/origin/RELEASE_3_14 03295c9 bump x.y.z version to even y prior to creation of RELEASE_3_14 branch
> remotes/origin/RELEASE_3_9 22b53f2 version bump
> remotes/origin/master 5d5f5df Pass serialized S4 instances thru updateObject()
>
>
> Do we follow ’master’? Is it a mirror of what Bioconductor names their
> 3.14 release?

We should not follow “master”. That’s the development branch. We
should follow the current release branch.

Toggle quote (2 lines)
> My guess was that RELEASE_3_14 mirrors their 3.14 release.

Correct.

Toggle quote (10 lines)
>>> Well, I am also in favor to break the API and move %bioconductor-version
>>> and %bioconductor-url to (guix build-system r). WDYT? It would
>>> simplify some things (#36805 and #39885), I guess.
>>
>> We tried this before and we couldn’t do this because of a circular
>> reference.
>
> Well, I have something that works. So I do not know if this circular
> reference is still there.

If “make as-derivation” does not fail it is probably okay.

Toggle quote (10 lines)
>> That’s because the importer doesn’t let us specify a different branch.
>> We should add that, but it’s strictly separate from the migration we’re
>> about to embark on.
>
> I am not familiar with the updater (guix refresh -u). My plan is:
>
> 1. Add bioconductor-git-reference
> 2. Adapt the bioconductor importer.
> 3. Updater?

The updater is closely connected to the importer. It just needs to be
told how it can find new releases.

Toggle quote (3 lines)
> The question is: do we have to include the migration in the updater? Or
> do we do the migration by custom scripts?

We can do the migration manually. But if we end up with a broken
updater I won’t be able to update Bioconductor packages in bulk; that
would be a serious problem for future maintenance.

Toggle quote (4 lines)
> Note that, because we do not support shallow clones, the complete
> sources will be a bit bigger; since they contain all the Bioconductor
> history of all the packages.

Doesn’t Guile-Git support shallow clones? In any case, this should not
be an obstacle for us. Ensuring long-term reproducibility is more
important than space savings.

--
Ricardo
M
M
Maxime Devos wrote on 14 Apr 2022 16:04
(address . 54787@debbugs.gnu.org)
2944d00df0662c6e041e9bf075616dd8e5e37169.camel@telenet.be
Ricardo Wurmus schreef op do 14-04-2022 om 13:43 [+0200]:
Toggle quote (4 lines)
> I wonder how the updater would need to be changed.  It would need to
> know about the release branch and look for new commits in that branch
> only.

Perhaps https://issues.guix.gnu.org/53144 would be useful? It adds a
'latest-git-updater' refresher that looks in a branch (or more
generally, any reference, so in principle a tag that is repeatedly
replaced would work as well) for the latest commit. There are some
unaddressed comments though ...

Greetings,
Maxime.
-----BEGIN PGP SIGNATURE-----

iI0EABYKADUWIQTB8z7iDFKP233XAR9J4+4iGRcl7gUCYlgqAxccbWF4aW1lZGV2
b3NAdGVsZW5ldC5iZQAKCRBJ4+4iGRcl7r53AP9DlogcwqzNVNiIbvZVv8rP8A6e
0/wj7D0y26tFjrQ7dAEAm1ci9dT/ccFlZksZTOMxdHQecUgPo/Tw2SUoPnXXswo=
=zoB3
-----END PGP SIGNATURE-----


Z
Z
zimoun wrote on 14 Apr 2022 17:03
(name . Ricardo Wurmus)(address . rekado@elephly.net)(address . 54787@debbugs.gnu.org)
868rs7zoza.fsf@gmail.com
On Thu, 14 Apr 2022 at 15:57, Ricardo Wurmus <rekado@elephly.net> wrote:
Toggle quote (8 lines)
> zimoun <zimon.toutoune@gmail.com> writes:
>
>> On Thu, 14 Apr 2022 at 13:43, Ricardo Wurmus <rekado@elephly.net> wrote:
>>
>>> We probably should *not* use RELEASE_3_14 (or whatever) as the commit,
>>> though, because that is a moving target. We need to resolve to the
>>> actual commit and use its hash.

[...]

Toggle quote (6 lines)
>> Do we follow ’master’? Is it a mirror of what Bioconductor names their
>> 3.14 release?
>
> We should not follow “master”. That’s the development branch. We
> should follow the current release branch.

To be sure to well understand you, you point is to have something like:

Toggle snippet (6 lines)
(define* (bioconductor-git-reference name #:key commit)
(git-reference
(url (string-append %bioconductor-git-url name))
(commit commit))))

with an explicit commit for each package definition, right?


Toggle quote (4 lines)
> Doesn’t Guile-Git support shallow clones? In any case, this should not
> be an obstacle for us. Ensuring long-term reproducibility is more
> important than space savings.

No, since libgit2 does not support it, IIUC.



Cheers,
simon
?